Covariate Adjustment in Experiments with Matched Pairs^†^†thanks: Yichong Zhang acknowledges the financial support from the NSFC under the grant No. 72133002 and a Lee Kong Chian fellowship. Any and all errors are our own.

Yuehao Bai
Department of Economics
University of Southern California
[email protected] Liang Jiang
Fanhai International School of Finance
Fudan University
[email protected] Joseph P. Romano
Departments of Economics & Statistics
Stanford University
[email protected] Azeem M. Shaikh
Department of Economics
University of Chicago
[email protected] Yichong Zhang
School of Economics
Singapore Management University
[email protected]

Abstract

This paper studies inference on the average treatment effect (ATE) in experiments in which treatment status is determined according to “matched pairs” and it is additionally desired to adjust for observed, baseline covariates to gain further precision. By a “matched pairs” design, we mean that units are sampled i.i.d. from the population of interest, paired according to observed, baseline covariates and finally, within each pair, one unit is selected at random for treatment. Importantly, we presume that not all observed, baseline covariates are used in determining treatment assignment. We study a broad class of estimators based on a “doubly robust” moment condition that permits us to study estimators with both finite-dimensional and high-dimensional forms of covariate adjustment. We find that estimators with finite-dimensional, linear adjustments need not lead to improvements in precision relative to the unadjusted difference-in-means estimator. This phenomenon persists even if the adjustments are interacted with treatment; in fact, doing so leads to no changes in precision. However, gains in precision can be ensured by including fixed effects for each of the pairs. Indeed, we show that this adjustment leads to the minimum asymptotic variance of the corresponding ATE estimator among all finite-dimensional, linear adjustments. We additionally study an estimator with a regularized adjustment, which can accommodate high-dimensional covariates. We show that this estimator leads to improvements in precision relative to the unadjusted difference-in-means estimator and also provide conditions under which it leads to the “optimal” nonparametric, covariate adjustment. A simulation study confirms the practical relevance of our theoretical analysis, and the methods are employed to reanalyze data from an experiment using a “matched pairs” design to study the effect of macroinsurance on microenterprise.

KEYWORDS: Experiment, matched pairs, covariate adjustment, randomized controlled trial, treatment assignment, LASSO

JEL classification codes: C12, C14

1 Introduction

This paper studies inference on the average treatment effect in experiments in which treatment status is determined according to “matched pairs.” By a “matched pairs” design, we mean that units are sampled i.i.d. from the population of interest, paired according to observed, baseline covariates and finally, within each pair, one unit is selected at random for treatment. This method is used routinely in all parts of the sciences. Indeed, commands to facilitate its implementation are included in popular software packages, such as sampsi in Stata. References to a variety of specific examples can be found, for instance, in the following surveys of various field experiments: Donner and Klar (2000), Glennerster and Takavarasha (2013), and Rosenberger and Lachin (2015). See also Bruhn and McKenzie (2009), who, based on a survey of selected development economists, report that 56% of researchers have used such a design at some point. Bai et al. (2022) develop methods for inference on the average treatment effect in such experiments based on the difference-in-means estimator. In this paper, we pursue the goal of improving upon the precision of this estimator by exploiting observed, baseline covariates that are not used in determining treatment status.

To this end, we study a broad class of estimators for the average treatment effect based on a “doubly robust” moment condition. The estimators in this framework are distinguished via different “working models” for the conditional expectations of potential outcomes under treatment and control given the observed, baseline covariates. Importantly, because of the double-robustness, these “working models” need not be correctly specified in order for the resulting estimator to be consistent. In this way, the framework permits us to study both finite-dimensional and high-dimensional forms of covariate adjustment without imposing unreasonable restrictions on the conditional expectations themselves. Under high-level conditions on the “working models” and their corresponding estimators and a requirement that pairs are formed so that units within pairs are suitably “close” in terms of the baseline covariates, we derive the limiting distribution of the covariate-adjusted estimator of the average treatment effect. We further construct an estimator for the variance of the limiting distribution and provide conditions under which it is consistent for this quantity.

Using our general framework, we first consider finite-dimensional, linear adjustments. For this class of estimators, our main findings are summarized as follows. First, we find that estimators with such adjustments are not guaranteed to be weakly more efficient than the unadjusted difference-in-means estimator. This finding echoes similar findings by Yang and Tsiatis (2001) and Tsiatis et al. (2008) in settings in which treatment is determined by i.i.d. coin flips, and Freedman (2008) in a finite population setting in which treatment is determined according to complete randomization. See Negi and Wooldridge (2021) for a succinct treatment of that literature. Moreover, we find that this phenomenon persists even if the adjustments are interacted with treatment. In fact, doing so leads to no changes in precision. In this sense, our results diverge from those in settings with complete randomization and treated fraction one half, where adjustments based on the uninteracted and interacted linear adjustments both guarantee gains in precision. Last, we show that estimators with both uninteracted and interacted linear adjustments with pair fixed effects are guaranteed to be weakly more efficient than the unadjusted difference-in-means estimator.

We then use our framework to consider high-dimensional adjustments based on $\ell_{1}$ penalization. Specifically, we first obtain an intermediate estimator by using the LASSO to estimate the “working model” for the relevant conditional expectations. When the treatment is determined according to “matched pairs,” however, this estimator need not be more precise than the unadjusted difference-in-means estimator. Therefore, following Cohen and Fogarty (2023), we consider, in an additional step, an estimator based on the finite-dimensional, linear adjustment described above that uses the predicted values for the “working model” as the covariates and includes fixed effects for each of the pairs. We show that the resulting estimator improves upon both the intermediate estimator and the unadjusted difference-in-means estimator in terms of precision. Moreover, we provide conditions under which the refitted adjustments attain the relevant efficiency bound derived by Armstrong (2022).

Concurrent with our paper, Cytrynbaum (2023) considers covariate adjustment in experiments in which units are grouped into tuples with possibly more than two units, rather than pairs. Both our paper and Cytrynbaum (2023) find that finite-dimensional, linear regression adjustments with pair fixed effects are guaranteed to improve precision relative to the unadjusted difference-in-means estimator, and show that such adjustments are indeed optimal among all linear adjustments. However, Cytrynbaum (2023) does not pursue more general forms of covariate adjustments, including the regularized adjustments described above. Such results permit us to study nonparametric adjustments as well as high-dimensional adjustments using covariates whose dimension diverges rapidly with the sample size.

The remainder of our paper is organized as follows. In Section 2, we describe our setup and notation. In particular, there we describe the precise sense in which we require that units in each pair are “close” in terms of their baseline covariates. In Section 3, we introduce our general class of estimators based on a “doubly robust” moment condition. Under certain high-level conditions on the “working models” and their corresponding estimators, we derive the limiting behavior of the covariate-adjusted estimator. In Section 4, we use our general framework to study a variety of estimators with finite-dimensional, linear covariate adjustment. In Section 5, we use our general framework to study covariate adjustment based on the regularized regression. In Section 6, we examine the finite-sample behavior of tests based on these different estimators via a small simulation study. We find that covariate adjustment can lead to considerable gains in precision. Finally, in Section 7, we apply our methods to reanalyze data from an experiment using a “matched pairs” design to study the effect of macroinsurance on microenterprise. Proofs of all results and some details for simulations are given in the Online Supplement.

2 Setup and Notation

Let $Y_{i}\in\mathbf{R}$ denote the (observed) outcome of interest for the $i$ th unit, $D_{i}\in\{0,1\}$ be an indicator for whether the $i$ th unit is treated, and $X_{i}\in\mathbf{R}^{k_{x}}$ and $W_{i}\in\mathbf{R}^{k_{w}}$ denote observed, baseline covariates for the $i$ th unit; $X_{i}$ and $W_{i}$ will be distinguished below through the feature that only the former will be used in determining treatment assignment. Further denote by $Y_{i}(1)$ the potential outcome of the $i$ th unit if treated and by $Y_{i}(0)$ the potential outcome of the $i$ th unit if not treated. The (observed) outcome and potential outcomes are related to treatment status by the relationship

Y_{i}=Y_{i}(1)D_{i}+Y_{i}(0)(1-D_{i})~{}.

(1)

For a random variable indexed by $i$ , $A_{i}$ , it will be useful to denote by $A^{(n)}$ the random vector $(A_{1},\ldots,A_{2n})$ . Denote by $P_{n}$ the distribution of the observed data $Z^{(n)}$ , where $Z_{i}=(Y_{i},D_{i},X_{i},W_{i})$ , and by $Q_{n}$ the distribution of $U^{(n)}$ , where $U_{i}=(Y_{i}(1),Y_{i}(0),X_{i},W_{i})$ . Note that $P_{n}$ is determined by (1), $Q_{n}$ , and the mechanism for determining treatment assignment. We assume throughout that $U^{(n)}$ consists of $2n$ i.i.d. observations, i.e., $Q_{n}=Q^{2n}$ , where $Q$ is the marginal distribution of $U_{i}$ . We therefore state our assumptions below in terms of assumptions on $Q$ and the mechanism for determining treatment assignment. Indeed, we will not make reference to $P_{n}$ in the sequel, and all operations are understood to be under $Q$ and the mechanism for determining the treatment assignment. Our object of interest is the average effect of the treatment on the outcome of interest, which may be expressed in terms of this notation as

\Delta(Q)=E[Y_{i}(1)-Y_{i}(0)]~{}.

(2)

We now describe our assumptions on $Q$ . We restrict $Q$ to satisfy the following mild requirement:

Assumption 2.1.

The distribution $Q$ is such that

(a)

$0<E[\mathrm{Var}[Y_{i}(d)|X_{i}]]$ for $d\in\{0,1\}$ .
(b)

$E[Y_{i}^{2}(d)]<\infty$ for $d\in\{0,1\}$ .
(c)

$E[Y_{i}(d)|X_{i}=x]$ and $E[Y_{i}^{2}(d)|X_{i}=x]$ are Lipschitz for $d\in\{0,1\}$ .

Next, we describe our assumptions on the mechanism determining treatment assignment. In order to describe these assumptions more formally, we require some further notation to define the relevant pairs of units. The $n$ pairs may be represented by the sets

\{\pi(2j-1),\pi(2j)\}\text{ for }j=1,\ldots,n~{},

where $\pi=\pi_{n}(X^{(n)})$ is a permutation of $2n$ elements. Because of its possible dependence on $X^{(n)}$ , $\pi$ encompasses a broad variety of different ways of pairing the $2n$ units according to the observed, baseline covariates $X^{(n)}$ . Given such a $\pi$ , we assume that treatment status is assigned as described in the following assumption:

Assumption 2.2.

Treatment status is assigned so that $(Y^{(n)}(1),Y^{(n)}(0),W^{(n)})\perp\!\!\!\perp D^{(n)}|X^{(n)}$ and, conditional on $X^{(n)}$ , $(D_{\pi(2j-1)},D_{\pi(2j)}),j=1,\ldots,n$ are i.i.d. and each uniformly distributed over the values in $\{(0,1),(1,0)\}$ .

Following Bai et al. (2022), our analysis will additionally require some discipline on the way in which pairs are formed. Let $\|\cdot\|_{2}$ denote the Euclidean norm. We will require that units in each pair are “close” in the sense described by the following assumption:

Assumption 2.3.

The pairs used in determining treatment status satisfy

\frac{1}{n}\sum_{1\leq j\leq n}\|X_{\pi(2j)}-X_{\pi(2j-1)}\|_{2}^{r}\stackrel{{\scriptstyle P}}{{\rightarrow}}0

for $r\in\{1,2\}$ .

It will at times be convenient to require further that units in consecutive pairs are also “close” in terms of their baseline covariates. One may view this requirement, which is formalized in the following assumption, as “pairing the pairs“ so that they are “close” in terms of their baseline covariates.

Assumption 2.4.

The pairs used in determining treatment status satisfy

\frac{1}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}\|X_{\pi(4j-k)}-X_{\pi(4j-\ell)}\|_{2}^{2}\stackrel{{\scriptstyle P}}{{\rightarrow}}0

for any $k\in\{2,3\}$ and $\ell\in\{0,1\}$ .

Bai et al. (2022) provide results to facilitate constructing pairs satisfying Assumptions 2.3–2.4 under weak assumptions on $Q$ . In particular, given pairs satisfying Assumption 2.3, it is frequently possible to “re-order” them so that Assumption 2.4 is satisfied. See Theorem 4.3 in Bai et al. (2022) for further details. As in Bai et al. (2022), we highlight the fact that Assumption 2.4 will only be used to enable consistent estimation of relevant variances.

Remark 2.1.

Under this setup, Bai et al. (2022) consider the unadjusted difference-in-means estimator

\displaystyle\hat{\Delta}_{n}^{\rm unadj}=\frac{1}{n}\sum_{1\leq i\leq 2n}D_{i}Y_{i}-\frac{1}{n}\sum_{1\leq i\leq 2n}(1-D_{i})Y_{i}

(3)

and show that it is consistent and asymptotically normal with limiting variance

\displaystyle\sigma_{\mathrm{unadj}}^{2}(Q)=\frac{1}{2}\operatorname*{Var}[E[Y_{i}(1)-Y_{i}(0)|X_{i}]]+E[\operatorname*{Var}[Y_{i}(1)|X_{i}]]+E[\operatorname*{Var}[Y_{i}(0)|X_{i}]]~{}.

We note that $\hat{\Delta}_{n}^{\rm unadj}$ is the unadjusted estimator because it does not use information in $W_{i}$ in either the design or analysis stage. If both $X_{i}$ and $W_{i}$ are used to form pairs in the “matched pairs” design, then the difference-in-means estimator, which we refer to as $\hat{\Delta}_{n}^{\rm ideal}$ , has limiting variance

\displaystyle\sigma^{2}_{\mathrm{ideal}}(Q)=\frac{1}{2}\operatorname*{Var}[E[Y_{i}(1)-Y_{i}(0)|X_{i},W_{i}]]+E[\operatorname*{Var}[Y_{i}(1)|X_{i},W_{i}]]+E[\operatorname*{Var}[Y_{i}(0)|X_{i},W_{i}]]~{}.

In this case, $\hat{\Delta}_{n}^{\rm ideal}$ achieves the efficiency bound derived by Armstrong (2022), and we can see that

\displaystyle\sigma_{\mathrm{unadj}}^{2}(Q)-\sigma^{2}_{\mathrm{ideal}}(Q)

\displaystyle=\frac{1}{2}E[\operatorname*{Var}[E[Y_{i}(1)+Y_{i}(0)|X_{i},W_{i}]|X_{i}]]\geq 0~{}.

For related results for parameters other than the average treatment effect, see Bai et al. (2023a). We note, however, that it is not always practical to form pairs using both $X_{i}$ and $W_{i}$ for two reasons. First, the covariate $W_{i}$ may only be collected along with the outcome variable and therefore may not be available at the design stage. Second, the quality of pairing decreases with the dimension of matching variables. Indeed, it is common in practice to match on some but not all baseline covariates. Such considerations motivate our analysis below.

3 Main Results

To accommodate various forms of covariate-adjusted estimators of $\Delta(Q)$ in a single framework, it is useful to note that it follows from Assumption 2.2 that for any $d\in\{0,1\}$ and any function $m_{d,n}:\mathbf{R}^{k_{x}}\times\mathbf{R}^{k_{w}}\rightarrow\mathbf{R}$ such that $E[|m_{d,n}(X_{i},W_{i})|]<\infty$ ,

E\left[2I\{D_{i}=d\}(Y_{i}-m_{d,n}(X_{i},W_{i}))+m_{d,n}(X_{i},W_{i})\right]=E[Y_{i}(d)]~{}.

(4)

We note that (4) is just the augmented inverse propensity score weighted moment for $E[Y_{i}(d)]$ in which the propensity score is $1/2$ and the conditional mean model is $m_{d,n}(X_{i},W_{i})$ . Such a moment is also “doubly robust.” As the propensity score for the “matched pairs” design is exactly one half, we do not require the conditional mean model to be correctly specified, i.e., $m_{d,n}(X_{i},W_{i})=E[Y_{i}(d)|X_{i},W_{i}]$ . See, for instance, Robins et al. (1995). Intuitively, $m_{d,n}$ is the “working model” which researchers use to estimate $E[Y_{i}(d)|X_{i},W_{i}]$ , and can be arbitrarily misspecified because of (4). Although $m_{d,n}$ will be identical across $n\geq 1$ for the examples in Section 4, the notation permits $m_{d,n}$ to depend on the sample size $n$ in anticipation of the high-dimensional results in Section 5. Based on the moment condition in (4), our proposed estimator of $\Delta(Q)$ is given by

\hat{\Delta}_{n}=\hat{\mu}_{n}(1)-\hat{\mu}_{n}(0)~{},

(5)

where, for $d\in\{0,1\}$ ,

\hat{\mu}_{n}(d)=\frac{1}{2n}\sum_{1\leq i\leq 2n}(2I\{D_{i}=d\}(Y_{i}-\hat{m}_{d,n}(X_{i},W_{i}))+\hat{m}_{d,n}(X_{i},W_{i}))

(6)

and $\hat{m}_{d,n}$ is a suitable estimator of the “working model” $m_{d,n}$ in (4).

By some simple algebra, we have¹¹1We thank the referee for this excellent point.

\displaystyle\hat{\Delta}_{n}=\frac{1}{n}\sum_{1\leq i\leq 2n}D_{i}\tilde{Y}_{i}-\frac{1}{n}\sum_{1\leq i\leq 2n}(1-D_{i})\tilde{Y}_{i}~{},

(7)

where

\displaystyle\tilde{Y}_{i}

\displaystyle=Y_{i}-\frac{1}{2}(\hat{m}_{1,n}(X_{i},W_{i})+\hat{m}_{0,n}(X_{i},W_{i}))~{}.

(8)

It means our regression adjusted estimator can be viewed as a difference-in-means estimator, but with the “adjusted” outcome $\tilde{Y}_{i}$ .

We require some new discipline on the behavior of $m_{d,n}$ for $d\in\{0,1\}$ and $n\geq 1$ :

Assumption 3.1.

The functions $m_{d,n}$ for $d\in\{0,1\}$ and $n\geq 1$ satisfy

(a)

For $d\in\{0,1\}$ ,

\liminf_{n\to\infty}E\left[\operatorname*{Var}\left[Y_{i}(d)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\Bigg{|}X_{i}\right]\right]>0~{}.

(b)

For $d\in\{0,1\}$ ,

$\lim_{\lambda\to\infty}\limsup_{n\to\infty}E[m_{d,n}^{2}(X_{i},W_{i})I\{|m_{d,n}(X_{i},W_{i})|>\lambda\}]=0~{}.$
(c)

$E[m_{d,n}(X_{i},W_{i})|X_{i}=x]$ , $E[m_{d,n}^{2}(X_{i},W_{i})|X_{i}=x]$ , $E[m_{d,n}(X_{i},W_{i})Y_{i}(d)|X_{i}=x]$ for $d\in\{0,1\}$ , and $E[m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})|X_{i}=x]$ are Lipschitz uniformly over $n\geq 1$ .

Assumption 3.1(a) is an assumption to rule out degenerate situations. Assumption 3.1(b) is a mild uniform integrability assumption on the “working models.” If $m_{d,n}(\cdot)\equiv m_{d}(\cdot)$ for $d\in\{0,1\}$ , then it is satisfied as long as $E[m_{d}^{2}(X_{i},W_{i})]<\infty$ . Assumption 3.1(c) ensures that units that are “close” in terms of the observed covariates are also “close” in terms of potential outcomes, uniformly across $n\geq 1$ .

Theorem 3.1 below establishes the limit in distribution of $\hat{\Delta}_{n}$ . We note that the theorem depends on high-level conditions on $m_{d,n}(\cdot)$ and $\hat{m}_{d,n}(\cdot)$ . In the sequel, these conditions will be verified in several examples.

Theorem 3.1.

Suppose $Q$ satisfies Assumption 2.1, the treatment assignment mechanism satisfies Assumptions 2.2–2.3, and $m_{d,n}(\cdot)$ for $d\in\{0,1\}$ and $n\geq 1$ satisfy Assumption 3.1. Further suppose $\hat{m}_{d,n}(\cdot)$ satisfies

\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)(\hat{m}_{d,n}(X_{i},W_{i})-m_{d,n}(X_{i},W_{i}))\stackrel{{\scriptstyle P}}{{\rightarrow}}0~{}.

(9)

Then, $\hat{\Delta}_{n}$ defined in (5) satisfies

\frac{\sqrt{n}(\hat{\Delta}_{n}-\Delta(Q))}{\sigma_{n}(Q)}\stackrel{{\scriptstyle d}}{{\rightarrow}}N(0,1)~{},

(10)

where $\sigma_{n}^{2}(Q)=\sigma_{1,n}^{2}(Q)+\sigma_{2,n}^{2}(Q)+\sigma_{3,n}^{2}(Q)$ with

	$\displaystyle\sigma_{1,n}^{2}(Q)$	$\displaystyle=\frac{1}{2}E[\operatorname*{Var}[E[Y_{i}(1)+Y_{i}(0)\|X_{i},W_{i}]-(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\|X_{i}]]$
	$\displaystyle\sigma_{2,n}^{2}(Q)$	$\displaystyle=\frac{1}{2}\operatorname*{Var}[E[Y_{i}(1)-Y_{i}(0)\|X_{i},W_{i}]]$
	$\displaystyle\sigma_{3,n}^{2}(Q)$	$\displaystyle=E[\operatorname{Var}[Y_{i}(1)\|X_{i},W_{i}]]+E[\operatorname{Var}[Y_{i}(0)\|X_{i},W_{i}]]~{}.$

In order to facilitate the use of Theorem 3.1 for inference about $\Delta(Q)$ , we next provide a consistent estimator of $\sigma_{n}(Q)$ . Define

	$\displaystyle\hat{\tau}_{n}^{2}$	$\displaystyle=\frac{1}{n}\sum_{1\leq j\leq n}(\tilde{Y}_{\pi(2j-1)}-\tilde{Y}_{\pi(2j)})^{2}$
	$\displaystyle\hat{\lambda}_{n}$	$\displaystyle=\frac{2}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}(\tilde{Y}_{\pi(4j-3)}-\tilde{Y}_{\pi(4j-2)})(\tilde{Y}_{\pi(4j-1)}-\tilde{Y}_{\pi(4j)})(D_{\pi(4j-3)}-D_{\pi(4j-2)})(D_{\pi(4j-1)}-D_{\pi(4j)})~{},$

where $\tilde{Y}_{i}$ is defined in (8). The variance estimator is given by

\hat{\sigma}_{n}^{2}=\hat{\tau}_{n}^{2}-\frac{1}{2}(\hat{\lambda}_{n}+\hat{\Delta}_{n}^{2})~{}.

(11)

The variance estimator in (11), in particular its component $\hat{\lambda}_{n}$ , is analogous to the “pairs of pairs” variance estimator in Bai et al. (2022). Such a variance estimator has also been used in Abadie and Imbens (2008) in a related setting. Note that it can be shown similarly as in Remark 3.9 of Bai et al. (2022) that $\hat{\sigma}_{n}^{2}$ in (11) is nonnegative.

Theorem 3.2 below establishes the consistency of this estimator and its implications for inference about $\Delta(Q)$ . In the statement of the theorem, we make use of the following notation: for any scalars $a$ and $b$ , $[a\pm b]$ is understood to be $[a-b,a+b]$ .

Theorem 3.2.

Suppose $Q$ satisfies Assumption 2.1, the treatment assignment mechanism satisfies Assumptions 2.2–2.4, and $m_{d,n}(\cdot)$ for $d\in\{0,1\}$ and $n\geq 1$ satisfy Assumption 3.1. Further suppose $\hat{m}_{d,n}(\cdot)$ satisfies (9) and

\frac{1}{2n}\sum_{1\leq i\leq 2n}(\hat{m}_{d,n}(X_{i},W_{i})-m_{d,n}(X_{i},W_{i}))^{2}\stackrel{{\scriptstyle P}}{{\rightarrow}}0~{}.

(12)

Then,

\frac{\hat{\sigma}_{n}}{\sigma_{n}(Q)}\stackrel{{\scriptstyle P}}{{\rightarrow}}1~{}.

Hence, (10) holds with $\hat{\sigma}_{n}$ in place of $\sigma_{n}(Q)$ . In particular, for any $\alpha\in(0,1)$ ,

P\left\{\Delta(Q)\in\left[\hat{\Delta}_{n}\pm\hat{\sigma}_{n}\Phi^{-1}\left(1-\frac{\alpha}{2}\right)\right]\right\}\rightarrow 1-\alpha~{},

where $\Phi$ is the standard normal c.d.f.

Remark 3.1.

Based on (7), it is natural to estimate $\sigma_{n}^{2}(Q)$ using the usual estimator of the limiting variance of the difference-in-means estimator, i.e.,

\displaystyle\hat{\sigma}_{\mathrm{diff},n}^{2}

\displaystyle=\frac{1}{n}\sum_{1\leq i\leq 2n}D_{i}\left(\tilde{Y}_{i}-\left(\frac{1}{n}\sum_{1\leq i\leq 2n}D_{i}\tilde{Y}_{i}\right)\right)^{2}+\frac{1}{n}\sum_{1\leq i\leq 2n}(1-D_{i})\left(\tilde{Y}_{i}-\left(\frac{1}{n}\sum_{1\leq i\leq 2n}(1-D_{i})\tilde{Y}_{i}\right)\right)^{2}.

However, it can be shown that $\hat{\sigma}_{\mathrm{diff},n}^{2}=\sigma_{\mathrm{diff},n}^{2}(Q)+o_{P}(1)$ , where

\sigma_{\mathrm{diff},n}^{2}(Q)=\operatorname*{Var}\left[Y_{i}(1)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\right]+\operatorname*{Var}\left[Y_{i}(0)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\right]~{}.

Furthermore,

\displaystyle\sigma_{\mathrm{diff},n}^{2}(Q)-\sigma_{n}^{2}(Q)

\displaystyle=\frac{1}{2}\operatorname*{Var}\left[E[Y_{i}(1)+Y_{i}(0)-(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))|X_{i}]\right]\geq 0~{},

where the inequality is strict unless

\displaystyle E[Y_{i}(1)+Y_{i}(0)-(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))|X_{i}]=E[Y_{i}(1)+Y_{i}(0)-(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))]

with probability one. In this sense, the usual estimator of the limiting variance of the difference-in-means estimator is conservative.

Remark 3.2.

An important and immediate implication of Theorem 3.1 is that $\sigma_{n}^{2}(Q)$ is minimized when

	$\displaystyle E[Y_{i}(0)+Y_{i}(1)\|X_{i},W_{i}]-E[Y_{i}(0)+Y_{i}(1)\|X_{i}]=$
	$\displaystyle\hskip 56.9055ptm_{0,n}(X_{i},W_{i})+m_{1,n}(X_{i},W_{i})-E[m_{0,n}(X_{i},W_{i})+m_{1,n}(X_{i},W_{i})\|X_{i}]$

with probability one. In other words, the “working model” for $E[Y_{i}(0)+Y_{i}(1)|X_{i},W_{i}]$ given by $m_{0,n}(X_{i},W_{i})+m_{1,n}(X_{i},W_{i})$ , need only be correct “on average” over the variables that are not used in determining the pairs. For such a choice of $m_{0,n}(X_{i},W_{i})$ and $m_{1,n}(X_{i},W_{i})$ , $\sigma_{n}^{2}(Q)$ in Theorem 3.1 becomes simply

\frac{1}{2}\operatorname*{Var}[E[Y_{i}(1)-Y_{i}(0)\Big{|}X_{i},W_{i}]]+E[\operatorname*{Var}[Y_{i}(1)|X_{i},W_{i}]]+E[\operatorname*{Var}[Y_{i}(0)|X_{i},W_{i}]]~{},

which agrees with the variance obtained in Bai et al. (2022) when both $X_{i}$ and $W_{i}$ are used in determining the pairs. Such a variance also achieves the efficiency bound derived by Armstrong (2022).

Remark 3.3.

Following Bai et al. (2023b), it is straightforward to extend the analysis in this paper to the case with multiple treatment arms and where treatment status is determined using a “matched tuples” design, but we do not pursue this further in this paper.

Remark 3.4.

Following Bai et al. (2022), we conjecture it it possible to establish the validity of a randomization test based on the test statistic studentized by a randomized version of (11). We emphasize that the validity of the randomization test depends crucially on the choice of studentization in the test statistic. See, for instance, Remark 3.16 in Bai et al. (2022). Such tests have been studied in finite-population settings with covariate adjustments by Zhao and Ding (2021). We leave a detailed analysis of randomization tests for future work.

4 Linear Adjustments

In this section, we consider linearly covariate-adjusted estimators of $\Delta(Q)$ based on a set of regressors generated by $X_{i}\in\mathbf{R}^{k_{x}}$ and $W_{i}\in\mathbf{R}^{k_{w}}$ . To this end, define $\psi_{i}=\psi(X_{i},W_{i})$ , where $\psi:\mathbf{R}^{k_{x}}\times\mathbf{R}^{k_{w}}\to\mathbf{R}^{p}$ . We impose the following assumptions on the function $\psi$ :

Assumption 4.1.

The function $\psi$ is such that

(a)

no component of $\psi$ is constant and $E[\operatorname*{Var}[\psi_{i}|X_{i}]]$ is non-singular.
(b)

$\operatorname*{Var}[\psi_{i}]<\infty$ .
(c)

$E[\psi_{i}|X_{i}=x]$ , $E[\psi_{i}\psi_{i}^{\prime}|X_{i}=x]$ , and $E[\psi_{i}Y_{i}(d)|X_{i}=x]$ for $d\in\{0,1\}$ are Lipschitz.

Assumption 4.1 is analogous to Assumption 2.1. Note, in particular, that Assumption 4.1(a) rules out situations where $\psi_{i}$ is a function of $X_{i}$ only. See Remark 4.3 for a discussion of the behavior of the covariate-adjusted estimators in such situations.

4.1 Linear Adjustments without Pair Fixed Effects

Consider the following linear regression model:

Y_{i}=\alpha+\Delta D_{i}+\psi_{i}^{\prime}\beta+\epsilon_{i}~{}.

(13)

Let $\hat{\alpha}_{n}^{\rm naive}$ , $\hat{\Delta}_{n}^{\rm naive}$ , and $\hat{\beta}_{n}^{\rm naive}$ denote the OLS estimators of $\alpha$ , $\Delta$ , and $\beta$ in (13). We call these estimators naïve because the corresponding regression adjustment is subject to Freedman’s critique and can lead to an adjusted estimator that is less efficient than the simple difference-in-means estimator $\hat{\Delta}_{n}^{\rm unadj}$ .

It follows from direct calculation that

\hat{\Delta}_{n}^{\rm naive}=\frac{1}{n}\sum_{1\leq i\leq 2n}(Y_{i}-\psi_{i}^{\prime}\hat{\beta}_{n}^{\rm naive})(2D_{i}-1)~{}.

Therefore, $\hat{\Delta}_{n}^{\rm naive}$ satisfies (5)–(6) with

\hat{m}_{d,n}(X_{i},W_{i})=\psi_{i}^{\prime}\hat{\beta}_{n}^{\rm naive}~{}.

Theorem 4.1 establishes (9) and (12) for a suitable choice of $m_{d,n}(X_{i},W_{i})$ for $d\in\{0,1\}$ and, as a result, the limiting distribution of $\hat{\Delta}_{n}^{\rm naive}$ and the validity of the variance estimator.

Theorem 4.1.

Suppose $Q$ satisfies Assumption 2.1 and the treatment assignment mechanism satisfies Assumptions 2.2–2.3. Further suppose $\psi$ satisfies Assumption 4.1. Then, as $n\to\infty$ ,

\hat{\beta}_{n}^{\rm naive}\stackrel{{\scriptstyle P}}{{\to}}\beta^{\rm naive}=\operatorname*{Var}[\psi_{i}]^{-1}\operatorname*{Cov}[\psi_{i},Y_{i}(1)+Y_{i}(0)]~{}.

Moreover, (9), (12), and Assumption 3.1 are satisfied with

m_{d,n}(X_{i},W_{i})=\psi_{i}^{\prime}\beta^{\rm naive}

for $d\in\{0,1\}$ and $n\geq 1$ .

Remark 4.1.

Freedman (2008) studies regression adjustment based on (13) when treatment is assigned by complete randomization instead of a “matched pairs” design. In such settings, Lin (2013) proposes adjustment based on the following linear regression model:

Y_{i}=\alpha+\Delta D_{i}+(\psi_{i}-\bar{\psi}_{n})^{\prime}\gamma+D_{i}(\psi_{i}-\bar{\psi}_{n})^{\prime}\eta+\epsilon_{i}~{},

(14)

where

\bar{\psi}_{n}=\frac{1}{2n}\sum_{1\leq i\leq 2n}\psi_{i}~{}.

Let $\hat{\alpha}_{n}^{\rm int},\hat{\Delta}_{n}^{\rm int},\hat{\gamma}_{n}^{\rm int},\hat{\eta}_{n}^{\rm int}$ denote the OLS estimators for $\alpha,\Delta,\gamma,\eta$ in (14). It is straightforward to show $\hat{\Delta}_{n}^{\rm int}$ satisfies (5)–(6) with

	$\displaystyle\hat{m}_{1,n}(X_{i},W_{i})$	$\displaystyle=(\psi_{i}-\hat{\mu}_{\psi,n}(1))^{\prime}(\hat{\gamma}_{n}^{\rm int}+\hat{\eta}_{n}^{\rm int})$
	$\displaystyle\hat{m}_{0,n}(X_{i},W_{i})$	$\displaystyle=(\psi_{i}-\hat{\mu}_{\psi,n}(0))^{\prime}\hat{\gamma}_{n}^{\rm int}~{},$

where

\hat{\mu}_{\psi,n}(d)=\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\psi_{i}~{}.

It can be shown using similar arguments to those used to establish Theorem 4.1 that (9) and Assumption 3.1 are satisfied with

m_{d,n}(X_{i},W_{i})=(\psi_{i}-E[\psi_{i}])^{\prime}\operatorname*{Var}[\psi_{i}]^{-1}\operatorname*{Cov}[\psi_{i},Y_{i}(d)]

for $d\in\{0,1\}$ and $n\geq 1$ . It thus follows by inspecting the expression for $\sigma^{2}_{n}(Q)$ in Theorem 3.1 that the limiting variance of $\hat{\Delta}_{n}^{\rm int}$ is the same as that of $\hat{\Delta}_{n}^{\rm naive}$ based on (13).

Remark 4.2.

Note that $\hat{\Delta}_{n}^{\rm naive}$ is the ordinary least squares estimator for $\Delta$ in the linear regression

\displaystyle Y_{i}-\psi_{i}^{\prime}\hat{\beta}_{n}^{\rm naive}=\alpha+\Delta D_{i}+\epsilon_{i}~{}.

Furthermore, Theorem 4.1 implies that its limiting variance is $\sigma_{\mathrm{naive}}^{2}(Q)$ , given by $\sigma_{n}^{2}(Q)$ in Theorem 3.1 with $m_{d}(X_{i},W_{i})=\psi_{i}^{\prime}\beta^{\rm naive}$ . The usual heteroskedasticity-robust estimator of the limiting variance of $\hat{\Delta}_{n}^{\rm naive}$ is, however, simply $\hat{\sigma}_{\mathrm{diff},n}^{2}$ defined in Remark 3.1 with $\hat{m}_{d,n}(X_{i},W_{i})=\psi_{i}^{\prime}\hat{\beta}_{n}^{\rm naive}$ . It thus follows that $\hat{\sigma}_{\mathrm{diff},n}^{2}$ is conservative for $\sigma_{\mathrm{naive}}^{2}(Q)$ in the sense described therein. It is, of course, possible to estimate $\sigma_{\mathrm{naive}}^{2}(Q)$ consistently using $\hat{\sigma}_{n}^{2}$ proposed in Theorem 3.2 with $\hat{m}_{d,n}(W_{i},X_{i})=\psi_{i}^{\prime}\hat{\beta}_{n}^{\rm naive}$ , but $\sigma_{\mathrm{naive}}^{2}(Q)$ is not guaranteed to be smaller than the limiting variance of the unadjusted estimator, i.e., $\sigma_{\mathrm{unadj}}^{2}(Q)$ , so the linear adjustment without pair fixed effects can harm the precision of the estimator. Evidence of this phenomenon is provided in our simulations in Section 6.

4.2 Linear Adjustments with Pair Fixed Effects

Remark 4.1 implies that in “matched pairs” designs, including interaction terms in the linear regression does not lead to an estimator with lower limiting variance than the one based on the linear regression without interaction terms. It is therefore interesting to study whether there exists a linearly covariate-adjusted estimator with lower limiting variance than the ones based on (13) and (14) as well as the difference-in-means estimator. To that end, consider instead the following linear regression model:

Y_{i}=\Delta D_{i}+\psi_{i}^{\prime}\beta+\sum_{1\leq j\leq n}\theta_{j}I\{i\in\{\pi(2j-1),\pi(2j)\}\}+\epsilon_{i}~{}.

(15)

Let $\hat{\Delta}_{n}^{\rm pfe}$ , $\hat{\beta}_{n}^{\rm pfe}$ , and $\hat{\gamma}_{j,n}$ , $1\leq j\leq n$ denote the OLS estimators of $\Delta$ , $\beta$ , $\theta_{j}$ , $1\leq j\leq n$ in (15), where “pfe” stands for pair fixed effect. It follows from the Frisch-Waugh-Lovell theorem that

\hat{\Delta}_{n}^{\rm pfe}=\frac{1}{n}\sum_{1\leq i\leq 2n}(Y_{i}-\psi_{i}^{\prime}\hat{\beta}_{n}^{\rm pfe})(2D_{i}-1)~{}.

Therefore, $\hat{\Delta}_{n}^{\rm pfe}$ satisfies (5)–(6) with

\hat{m}_{d,n}(X_{i},W_{i})=\psi_{i}^{\prime}\hat{\beta}_{n}^{\rm pfe}~{}.

Theorem 4.2 establishes (9) and (12) for a suitable choice of $m_{d,n}(X_{i},W_{i}),d\in\{0,1\}$ and, as a result, the limiting distribution of $\hat{\Delta}_{n}^{\rm pfe}$ and the validity of the variance estimator.

Theorem 4.2.

Suppose $Q$ satisfies Assumption 2.1 and the treatment assignment mechanism satisfies Assumptions 2.2–2.3. Then, as $n\to\infty$ ,

\hat{\beta}_{n}^{\rm pfe}\stackrel{{\scriptstyle P}}{{\to}}\beta^{\rm pfe}=(2E[\operatorname*{Var}[\psi_{i}|X_{i}]])^{-1}E[\operatorname*{Cov}[\psi_{i},Y_{i}(1)+Y_{i}(0)|X_{i}]]~{}.

Moreover, (9), (12), and Assumption 3.1 are satisfied with

m_{d,n}(X_{i},W_{i})=\psi_{i}^{\prime}\beta^{\rm pfe}

for $d\in\{0,1\}$ and $n\geq 1$ .

Remark 4.3.

When $\psi$ is restricted to be a function of $X_{i}$ only, $\hat{\Delta}_{n}^{\rm pfe}$ coincides to first order with the unadjusted difference-in-means estimator $\hat{\Delta}_{n}^{\rm unadj}$ defined in (3). To see this, suppose further that $\psi$ is Lipschitz and that $\text{Var}[Y_{i}(d)|X_{i}=x],d\in\{0,1\}$ are bounded. The proof of Theorem 4.2 reveals that $\hat{\Delta}_{n}^{\rm pfe}$ and $\hat{\beta}_{n}^{\rm pfe}$ coincide with the OLS estimators of the intercept and slope parameters in a linear regression of $(Y_{\pi(2j)}-Y_{\pi(2j-1)})(D_{\pi(2j)}-D_{\pi(2j-1)})$ on a constant and $(\psi_{\pi(2j)}-\psi_{\pi(2j-1)})(D_{\pi(2j)}-D_{\pi(2j-1)})$ . Using this observation, it follows by arguing as in Section S.1.1 of Bai et al. (2022) that

\sqrt{n}(\hat{\Delta}_{n}^{\rm pfe}-\Delta(Q))=\sqrt{n}(\hat{\Delta}_{n}^{\rm unadj}-\Delta(Q))+o_{P}(1)~{}.

See also Remark 3.8 of Bai et al. (2022).

Remark 4.4.

Note in the expression of $\sigma_{n}^{2}(Q)$ in Theorem 3.1 only depends on $m_{d,n}(X_{i},W_{i}),d\in\{0,1\}$ through $\sigma_{1,n}^{2}(Q)$ . With this in mind, consider the class of all linearly covariate-adjusted estimators based on $\psi_{i}$ , i.e., $m_{d,n}(X_{i},W_{i})=\psi_{i}^{\prime}\beta(d)$ . For this specification of $m_{d,n}(X_{i},W_{i}),d\in\{0,1\}$ ,

\sigma_{1,n}^{2}(Q)=E[(E[Y_{i}(1)+Y_{i}(0)|X_{i},W_{i}]-E[Y_{i}(1)+Y_{i}(0)|X_{i}]-(\psi_{i}-E[\psi_{i}|X_{i}])^{\prime}(\beta(1)+\beta(0)))^{2}]~{}.

It follows that among all such linear adjustments, $\sigma_{n}^{2}(Q)$ in (10) is minimized when

\beta(1)+\beta(0)=2\beta^{\rm pfe}~{}.

This observation implies that the linear adjustment with pair fixed effects, i.e., $\hat{\Delta}_{n}^{\rm pfe}$ , yields the optimal linear adjustment in the sense of minimizing $\sigma_{n}^{2}(Q)$ . Its limiting variance is, in particular, weakly smaller than the limiting variance of the unadjusted difference-in-means estimator defined in (3). On the other hand, the covariate-adjusted estimators based on (13) or (14), i.e., $\hat{\Delta}_{n}^{\rm naive}$ and $\hat{\Delta}_{n}^{\rm int}$ , are in general not optimal among all linearly covariate-adjusted estimators based on $\psi_{i}$ . In fact, the limiting variances of these two estimators may even be larger than that of the unadjusted difference-in-means estimator.

Remark 4.5.

“Matched pairs” design is essentially a non-parametric way to adjust for $X_{i}$ . Projecting $\psi_{i}$ on the pair dummies in (15) is equivalent to pair-wise demeaning, which effectively removes $E(\psi_{i}|X_{i})$ from $\psi_{i}$ . This is key to the optimality of $\hat{\Delta}_{n}^{\rm pfe}$ over all linearly adjusted estimators. Following the same logic, we expect that by replacing the pair dummies with sieve bases of $X_{i}$ in (15), the linear regression can still effectively remove $E(\psi_{i}|X_{i})$ from $\psi_{i}$ so that the new adjusted estimator is asymptotically equivalent to $\hat{\Delta}_{n}^{\rm pfe}$ , and thus, linearly optimal.

Remark 4.6.

Remark 4.2 also applies here with $\beta^{\rm naive}$ replaced by $\beta^{\rm pfe}$ . Even though $\hat{\Delta}_{n}^{\rm pfe}$ can be computed via OLS estimation of (15), we emphasize that the usual heteroskedascity-robust standard errors that naïvely treats the data (including treatment status) as if it were i.i.d. need not be consistent for the limiting variance derived in our analysis.

Remark 4.7.

One can also consider the estimator based on the following linear regression model:

Y_{i}=\Delta D_{i}+(\psi_{i}-\bar{\psi}_{n})^{\prime}\gamma+D_{i}(\psi_{i}-\hat{\mu}_{\psi,n}(1))^{\prime}\eta+\sum_{1\leq j\leq n}\theta_{j}I\{i\in\{\pi(2j-1),\pi(2j)\}\}+\epsilon_{i}~{}.

(16)

Let $\hat{\Delta}_{n}^{\rm int-pfe},\hat{\gamma}_{n}^{\rm int-pfe},\hat{\eta}_{n}^{\rm int-pfe}$ denote the OLS estimators for $\Delta,\gamma,\eta$ in (16). It is straightforward to show $\hat{\Delta}_{n}^{\rm int-pfe}$ satisfies (5)–(6) with

	$\displaystyle\hat{m}_{1,n}(X_{i},W_{i})$	$\displaystyle=(\psi_{i}-\hat{\mu}_{\psi,n}(1))^{\prime}\hat{\eta}_{n}^{\rm int-pfe}$
	$\displaystyle\hat{m}_{0,n}(X_{i},W_{i})$	$\displaystyle=(\psi_{i}-\hat{\mu}_{\psi,n}(0))^{\prime}(\hat{\eta}_{n}^{\rm int-pfe}-\hat{\gamma}_{n}^{\rm int-pfe})~{}.$

Following similar arguments to those used in the proof of Theorem 4.1, we can establish that (9) and Assumption 3.1 are satisfied with

	$\displaystyle m_{1,n}(X_{i},W_{i})$	$\displaystyle=(\psi_{i}-E[\psi_{i}])^{\prime}\eta^{\rm int-pfe}$
	$\displaystyle m_{0,n}(X_{i},W_{i})$	$\displaystyle=(\psi_{i}-E[\psi_{i}])^{\prime}(\eta^{\rm int-pfe}-\gamma^{\rm int-pfe})~{},$

where

	$\displaystyle\gamma^{\rm int-pfe}$	$\displaystyle=(E[\operatorname*{Var}[\psi_{i}\|X_{i}]])^{-1}E[\mathrm{Cov}[\psi_{i},Y_{i}(1)-Y_{i}(0)\|X_{i}]]~{},$
	$\displaystyle\eta^{\rm int-pfe}$	$\displaystyle=(E[\operatorname*{Var}[\psi_{i}\|X_{i}]])^{-1}E[\mathrm{Cov}[\psi_{i},Y_{i}(1)\|X_{i}]]~{}.$

Because $2\eta^{\rm int-pfe}-\gamma^{\rm int-pfe}=2\beta^{\rm pfe}$ , it follows from Remark 4.4 that the limiting variance of $\hat{\Delta}_{n}^{\rm int-pfe}$ is identical to the limiting variance of $\hat{\Delta}_{n}^{\rm pfe}$ .

Remark 4.8.

Wu and Gagnon-Bartsch (2021) consider the covariate adjustment for paired experiments under the design-based framework, where the covariates are treated as deterministic, and thus, the cross-sectional dependence between units in the same pair due to the closeness of their covariates is not counted in their analysis. We differ from them by considering the sampling-based framework in which the covariates are treated as random and the pairs are formed by matching, and thus, have an impact on statistical inference. Under their framework, Wu and Gagnon-Bartsch (2021) point out that covariate adjustments may have a positive or negative effect on the estimation accuracy depending on how they are estimated. This is consistent with our findings in this section. Specifically, we show that when the regression adjustments are estimated by a linear regression with pair fixed effects, the resulting ATE estimator is guaranteed to weakly improve upon the difference-in-means estimator in terms of efficiency. However, this improvement is not guaranteed if the adjustments are estimated without pair fixed effects.

Remark 4.9.

If we choose $\psi_{i}$ as a set of sieve basis functions with increasing dimension, then under suitable regularity conditions, the linear adjustments both with and without pair fixed effects achieve the same limiting variance as $\hat{\Delta}^{\rm ideal}_{n}$ , and thus, the efficiency bound. In fact, if $\psi_{i}$ contains sieve bases, then the linear adjustment without pair fixed effects can approximate the true specification $E[Y_{i}(1)+Y_{i}(0)|X_{i},W_{i}]$ in the sense that $E[Y_{i}(1)+Y_{i}(0)|X_{i},W_{i}]=\psi_{i}^{\prime}\beta^{\rm naive}+R_{i}$ and $E[R_{i}^{2}]=o(1)$ . This property implies $\sigma_{1,n}^{2}(Q)$ in Theorem 3.1 equals zero. Similarly, the linear adjustment with pair fixed effects can approximate the true specification $E[Y_{i}(1)+Y_{i}(0)|X_{i},W_{i}]-E[Y_{i}(1)+Y_{i}(0)|X_{i}]$ in the sense that $E[Y_{i}(1)+Y_{i}(0)|X_{i},W_{i}]-E[Y_{i}(1)+Y_{i}(0)|X_{i}]=\tilde{\psi}_{i}^{\prime}\beta^{\rm naive}+\tilde{R}_{i}$ and $E[\tilde{R}_{i}^{2}]=o(1)$ . This property again implies $\sigma_{1,n}^{2}(Q)$ in Theorem 3.1 equals zero. Therefore, in both cases, the adjusted estimator achieves the minimum variance. In the next section, we consider $\ell_{1}$ -regularized adjustments which may be viewed as providing a way to choose the relevant sieve bases in a data-driven manner.

5 Regularized Adjustments

In this section, we study covariate adjustments based on the $\ell_{1}$ -regularized linear regression. Such settings can arise if the covariates $W_{i}$ are high-dimensional or if the dimension of $W_{i}$ is fixed but the regressors include many sieve basis functions of $X_{i}$ and $W_{i}$ . To accommodate situations where the dimension of $W_{i}$ increases with $n$ , we add a subscript and denote it by $W_{n,i}$ instead. Let $k_{w,n}$ denote the dimension of $W_{n,i}$ . For $n\geq 1$ , let $\psi_{n,i}=\psi_{n}(X_{i},W_{n,i})$ , where $\psi_{n}:\mathbf{R}^{k_{x}}\times\mathbf{R}^{k_{w,n}}\to\mathbf{R}^{p_{n}}$ and $p_{n}$ will be permitted below to be possibly much larger than $n$ .

In what follows, we consider a two-step method in the spirit of Cohen and Fogarty (2023). In the first step, an intermediate estimator, $\hat{\Delta}_{n}^{\rm r}$ , is obtained using (5) with a “working model” obtained through a $\ell_{1}$ -regularized linear regression adjustments $m_{d,n}(X_{i},W_{i})$ for $d\in\{0,1\}$ . As explained further below in Theorem 5.1, when $m_{d,n}(X_{i},W_{i})$ is approximately correctly specified, such an estimator is optimal in the sense that it minimizes the limiting variance in Theorem 3.1. When this is not the case, however, for reasons analogous to those put forward in Remark 4.2, it needs not to have a limiting variance weakly smaller than the unadjusted difference-in-means estimator. In a second step, we therefore consider an estimator by refitting a version of (15) in which the covariates $\psi_{i}$ are replaced by the regularized estimates of $m_{d,n}(X_{i},W_{i})$ for $d\in\{0,1\}$ . The resulting estimator, $\hat{\Delta}_{n}^{\rm refit}$ , has the limiting variance weakly smaller than that of the intermediate estimator and thus remains optimal under approximately correct specification in the same sense. Moreover, it has limiting variance weakly smaller than the unadjusted difference-in-means estimator. Wager et al. (2016) also consider high-dimensional regression adjustments in randomized experiments using LASSO. We differ from their work by considering the “matched pairs” design, and more importantly, discussing when and how regularized adjustments can improve estimation efficiency upon the difference-in-means estimator.

Before proceeding, we introduce some additional notation that will be required in our formal description of the methods. We denote by $\psi_{n,i,l}$ the $l$ th components of $\psi_{n,i}$ . For a vector $a\in\mathbf{R}^{k}$ and $0\leq p\leq\infty$ , recall that

\|a\|_{p}=\Big{(}\sum_{1\leq l\leq k}|a_{l}|^{p}\Big{)}^{1/p}~{},

where it is understood that $\|a\|_{0}=\sum_{1\leq l\leq k}I\{a_{k}\neq 0\}$ and $\|a\|_{\infty}=\sup_{1\leq l\leq k}|a_{l}|$ . Using this notation, we further define

\Xi_{n}=\sup_{(x,w)\times\mathrm{supp}(X_{i})\times\mathrm{supp}(W_{i})}\|\psi_{n,i}(x,w)\|_{\infty}~{}.

For $d\in\{0,1\}$ , define

(\hat{\alpha}_{d,n}^{\rm r},\hat{\beta}_{d,n}^{\rm r})\in\operatorname*{argmin}_{a\in\mathbf{R},b\in\mathbf{R}^{p_{n}}}\frac{1}{n}\sum_{1\leq i\leq 2n:D_{i}=d}(Y_{i}-a-\psi_{n,i}^{\prime}b)^{2}+\lambda_{d,n}^{\rm r}\|\hat{\Omega}_{n}(d)b\|_{1}~{},

(17)

where $\lambda_{d,n}^{\rm r}$ is a penalty parameter that will be disciplined by the assumptions below, $\hat{\Omega}_{n}(d)=\operatorname*{diag}(\hat{\omega}_{1}(d),\cdots,\break\hat{\omega}_{p_{n}}(d))$ is a diagonal matrix, and $\hat{\omega}_{n,l}(d)$ is the penalty loading for the $l$ th regressor. Let $\hat{\Delta}_{n}^{\rm r}$ denote the estimator in (5) with $\hat{m}_{d,n}(X_{i},W_{n,i})=\psi_{n,i}^{\prime}\hat{\beta}_{d,n}^{\rm r}$ for $d\in\{0,1\}$ .

We now proceed with the statement of our assumptions. The first assumption collects a variety of moment conditions that will be used in our formal analysis:

Assumption 5.1.

(a)

There exist nonrandom quantities $(\alpha_{d,n}^{\rm r},\beta_{d,n}^{\rm r})$ such that with $\epsilon_{n,i}^{\rm r}(d)$ defined as

\displaystyle\epsilon_{n,i}^{\rm r}(d)=Y_{i}(d)-\alpha_{d,n}^{\rm r}-\psi_{n,i}^{\prime}\beta_{d,n}^{\rm r}~{},

we have

\|\Omega_{n}(d)^{-1}E[\psi_{n,i}\epsilon_{n,i}^{\rm r}(d)]\|_{\infty}+|E[\epsilon_{n,i}^{\rm r}(d)]|=o\left(\lambda_{d,n}^{\rm r}\right)~{},

(18)

where $\Omega_{n}(d)=\operatorname*{diag}(\omega_{n,1}(d),\cdots,\omega_{n,p_{n}}(d))$ and $\omega_{n,l}^{2}(d)=\operatorname*{Var}[\psi_{n,i,l}\epsilon_{n,i}^{\rm r}(d)]$ .

(b)

For some $q>2$ and constant $C_{1}$ ,

	$\displaystyle\sup_{n\geq 1}\max_{1\leq l\leq p_{n}}E[\|\psi_{n,i,l}^{q}\|\|X_{i}]$	$\displaystyle\leq C_{1}~{},$
	$\displaystyle\sup_{n\geq 1}\|\psi_{n,i}^{\prime}\beta_{d,n}^{\rm r-pd}\|$	$\displaystyle\leq C_{1}~{},$
	$\displaystyle\sup_{n\geq 1}\|E[Y_{i}(a)\|X_{i},W_{n,i}]\|$	$\displaystyle\leq C_{1}~{},$

with probability one.

(c)

For some $\underaccent{\bar}{c}$ and $\bar{c}$ , we require that

0<\underline{c}\leq\liminf_{n\rightarrow\infty}\min_{1\leq l\leq p_{n}}\hat{\omega}_{n,l}(d)/\omega_{n,l}(d)\leq\limsup_{n\rightarrow\infty}\max_{1\leq l\leq p_{n}}\hat{\omega}_{n,l}(d)/\omega_{n,l}(d)\leq\bar{c}<\infty.

(19)

(d)

For some $c_{0}$ , $\underaccent{\bar}{\sigma}$ , $\bar{\sigma}$ , the following statements hold with probability one:

0<\underaccent{\bar}{\sigma}^{2}\leq\liminf_{n\rightarrow\infty}~{}\min_{d\in\{0,1\},1\leq l\leq p_{n}}\omega_{n,l}^{2}(d)\leq\limsup_{n\rightarrow\infty}~{}\max_{d\in\{0,1\},1\leq l\leq p_{n}}\omega_{n,l}^{2}(d)\leq\bar{\sigma}^{2}<\infty~{},

	$\displaystyle\sup_{n\geq 1}\max_{d\in\{0,1\}}E[(\psi_{n,i}^{\prime}\beta_{d,n}^{\rm r})^{2}]\leq c_{0}$	$\displaystyle<\infty~{},$
	$\displaystyle\max_{d\in\{0,1\},1\leq l\leq p_{n}}\frac{1}{2n}\sum_{1\leq i\leq 2n}E[\epsilon_{n,i}^{4}(d)\|X_{i}]\leq c_{0}$	$\displaystyle<\infty~{},$
	$\displaystyle\sup_{n\geq 1}~{}\max_{d\in\{0,1\}}E[\epsilon_{n,i}^{4}(d)]\leq c_{0}$	$\displaystyle<\infty~{},$
	$\displaystyle\min_{d\in\{0,1\}}\operatorname*{Var}[Y_{i}(d)-\psi_{n,i}^{\prime}(\beta_{1,n}^{\rm r}+\beta_{0,n}^{\rm r})/2]\geq\underaccent{\bar}{\sigma}^{2}$	$\displaystyle>0~{},$
	$\displaystyle\min_{1\leq l\leq p_{n}}\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\operatorname*{Var}[\psi_{n,i,l}\epsilon_{n,i}(d)\|X_{i}]\geq\underaccent{\bar}{\sigma}^{2}$	$\displaystyle>0~{},$
	$\displaystyle\min_{1\leq l\leq p_{n}}\operatorname*{Var}[E[\psi_{n,i,l}\epsilon_{n,i}(d)\|X_{i}]]\geq\underaccent{\bar}{\sigma}^{2}$	$\displaystyle>0~{}.$

Remark 5.1.

It is instructive to note that (18) in Assumption 5.1(a) is the subgradient condition for a $\ell_{1}$ -penalized regression of the outcome $Y_{i}(d)$ on $\psi_{n,i}$ when the penalty is of order $o(\lambda_{n}^{\rm r})$ . Specifically, if $p_{n}\ll n$ , then this condition holds automatically for the $\beta_{d,n}^{\rm r}$ equal to the coefficients of a linear projection of $Y_{i}(d)$ onto $(1,\psi_{n,i}^{\prime})$ . When $p_{n}\gg n$ , but $E[Y_{i}(d)|X_{i},W_{i}]$ is approximately correctly specified in the sense that the approximation error $R_{n,i}(d)=E[Y_{i}(d)|X_{i},W_{i}]-\alpha_{d,n}^{\rm r}-\psi_{n,i}^{\prime}\beta_{d,n}^{\rm r}$ is sufficiently small, then (18) also holds. However, the approximately correct specification is not necessary for (18). For example, suppose $W_{n,i}=(W_{n,i,1},\cdots,W_{n,i,p_{n}})$ is a $p_{n}$ vector of independent standard normal random variables, $W_{n,i}$ is independent of $X_{i}$ , $\psi_{n,i}=(X_{i}^{\prime},W_{n,i}^{\prime})^{\prime}$ , and

\displaystyle Y_{i}(d)=\alpha_{d,n}^{\rm r}+\psi_{n,i}^{\prime}\beta_{d,n}^{\rm r}+\sum_{l=1}^{p_{n}}\frac{W_{n,i,l}^{2}-1}{\sqrt{p_{n}}}+u_{n,i}(d)~{},

where $E(u_{n,i}(d)|X_{i},W_{n,i})=0$ . Then, Assumption 5.1(a) holds with $\epsilon_{n,i}^{\rm r}(d)=\sum_{l=1}^{p_{n}}\frac{W_{n,i,l}^{2}-1}{\sqrt{p_{n}}}+u_{n,i}(d)$ . We can impose a sparse restriction on $\beta_{d,n}^{\rm r}$ so that it further satisfies Assumption 5.3(b) below. On the other hand, the linear regression adjustment is not approximately correctly specified because $R_{n,i}(d)=E(Y_{i}(d)|X_{i},W_{n,i})-(\alpha_{d,n}^{\rm r}+\psi_{n,i}^{\prime}\beta_{d,n}^{\rm r})=\sum_{l=1}^{p_{n}}\frac{W_{n,i,l}^{2}-1}{\sqrt{p_{n}}}$ , and we have $ER_{n,i}^{2}(d)=2\nrightarrow 0$ .

Remark 5.2.

Assumption 5.1(b) and 5.1(d) are standard in the high-dimensional estimation literature; see, for instance, Belloni et al. (2017). The last four inequalities in Assumption 5.1(d), in particular, permit us to apply the high-dimensional central limit theorem in Chernozhukov et al. (2017, Theorem 2.1).

Remark 5.3.

The penalty loadings in Assumption 5.1(c) can be computed by an iterative procedure proposed by Belloni et al. (2017). We provide more detail in Algorithm 5.1 below. We can then verify (19) under “matched pairs” designs following arguments similar to those in Belloni et al. (2017).

Our analysis will, as before, also require some discipline on the way in which pairs are formed. For this purpose, Assumption 2.3 will suffice, but we will need an additional Lipshitz-like condition:

Assumption 5.2.

For some $L>0$ and any $x_{1}$ and $x_{2}$ in the support of $X_{i}$ , we have

|(\Psi(x_{1})-\Psi(x_{2}))^{\prime}\beta_{d,n}^{\rm r}|\leq L||x_{1}-x_{2}||_{2}~{}.

We next specify our restrictions on the penalty parameter $\lambda_{d,n}^{\rm r}$ .

Assumption 5.3.

(a)

For some $\ell\ell_{n}\rightarrow\infty$ ,

$\lambda_{d,n}^{\rm r}=\frac{\ell\ell_{n}}{\sqrt{n}}\Phi^{-1}\left(1-\frac{0.1}{2\log(n)p_{n}}\right)~{}.$
(b)

$\Xi_{n}^{2}(\log p_{n})^{7}/n\to 0$ and $(\ell\ell_{n}s_{n}\log p_{n})/\sqrt{n}\to 0$ , where

$s_{n}=\max_{d\in\{0,1\}}\|\beta_{d,n}^{\rm r}\|_{0}.$ (20)

We note that Assumption 5.3(b) permits $p_{n}$ to be much greater than $n$ . It also requires sparsity in the sense that $s_{n}=o(\sqrt{n})$ .

Finally, as is common in the analysis of $\ell_{1}$ -penalized regression, we require a “restricted eigenvalue” condition. This assumption permits us to apply Bickel et al. (2009, Lemma 4.1) and establish the error bounds for $|\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r}|+||\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r}||_{1}$ and $\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\left(\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r}+\psi_{n,i}^{\prime}(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r})\right)^{2}$ .

Assumption 5.4.

For some $\kappa_{1}>0,\kappa_{2}$ and $\ell_{n}\to\infty$ , the following statements hold with probability approaching one:

	$\displaystyle\inf_{d\in\{0,1\},v\in\mathbf{R}^{p_{n}+1}:\\|v\\|_{0}\leq(s_{n}+1)\ell_{n}}(\\|v\\|_{2}^{2})^{-1}v^{\prime}\Bigg{(}\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\breve{\psi}_{n,i}\breve{\psi}_{n,i}^{\prime}\Bigg{)}v$	$\displaystyle\geq\kappa_{1}$
	$\displaystyle\sup_{d\in\{0,1\},v\in\mathbf{R}^{p_{n}+1}:\\|v\\|_{0}\leq(s_{n}+1)\ell_{n}}(\\|v\\|_{2}^{2})^{-1}v^{\prime}\Bigg{(}\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\breve{\psi}_{n,i}\breve{\psi}_{n,i}^{\prime}\Bigg{)}v$	$\displaystyle\leq\kappa_{2}$
	$\displaystyle\inf_{d\in\{0,1\},v\in\mathbf{R}^{p_{n}+1}:\\|v\\|_{0}\leq(s_{n}+1)\ell_{n}}(\\|v\\|_{2}^{2})^{-1}v^{\prime}\Bigg{(}\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}E[\breve{\psi}_{n,i}\breve{\psi}_{n,i}^{\prime}\|X_{i}]\Bigg{)}v$	$\displaystyle\geq\kappa_{1}$
	$\displaystyle\sup_{d\in\{0,1\},v\in\mathbf{R}^{p_{n}+1}:\\|v\\|_{0}\leq(s_{n}+1)\ell_{n}}(\\|v\\|_{2}^{2})^{-1}v^{\prime}\Bigg{(}\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}E[\breve{\psi}_{n,i}\breve{\psi}_{n,i}^{\prime}\|X_{i}]\Bigg{)}v$	$\displaystyle\leq\kappa_{2}~{},$

where $\breve{\psi}_{n,i}=(1,\psi_{n,i}^{\prime})^{\prime}$ .

Using these assumptions, the following theorem characterizes the behavior of $\hat{\Delta}_{n}^{\rm r}$ :

Theorem 5.1.

Suppose $Q$ satisfies Assumption 2.1 and the treatment assignment mechanism satisfies Assumptions 2.2–2.3. Further suppose Assumptions 5.1–5.4 hold. Then, (9), (12), and Assumption 3.1 are satisfied with $\hat{m}_{d,n}(X_{i},W_{n,i})=\hat{\alpha}_{d,n}^{\rm r}+\psi_{n,i}^{\prime}\hat{\beta}_{d,n}^{\rm r}$ and

m_{d,n}(X_{i},W_{n,i})=\alpha_{d,n}^{\rm r}+\psi_{n,i}^{\prime}\beta_{d,n}^{\rm r}

for $d\in\{0,1\}$ and $n\geq 1$ . Denote the variance of $\hat{\Delta}_{n}^{\rm r}$ by $\sigma_{n}^{\rm r,2}$ . If the regularized adjustment is approximately correctly specified, i.e., $E[Y_{i}(d)|X_{i},W_{n,i}]=\alpha_{d,n}^{\rm r}+\psi_{n,i}^{\prime}\beta^{\rm r}_{d,n}+R_{n,i}(d)$ and $\max_{d\in\{0,1\}}E[R_{n,i}^{2}(d)]=o(1)$ , then $\sigma_{n}^{\rm r,2}$ achieves the minimum variance, i.e.,

\displaystyle\lim_{n\to\infty}\sigma_{n}^{\rm r,2}=\sigma_{2}^{2}(Q)+\sigma_{3}^{2}(Q)~{}.

Remark 5.4.

We recommend employing an iterative estimation procedure outlined by Belloni et al. (2017) to estimate $\hat{\beta}_{d,n}^{\rm r}$ , in which the $m$ -th step’s penalty loadings are estimated based on the $(m-1)$ th step’s LASSO estimates. Formally, this iterative procedure is described by the following algorithm:

Algorithm 5.1.

Step 0: Set $\hat{\epsilon}_{n,i}^{{\rm r},(0)}(d)=Y_{i}$ if $D_{i}=d$ .
$\vdots$
Step $m$ : Compute $\hat{\omega}_{n,l}^{(m)}(d)=\sqrt{\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\psi_{n,i,l}^{2}(\hat{\epsilon}_{n,i}^{{\rm r},(m-1)}(d))^{2}}$ and compute $(\hat{\alpha}_{d,n}^{{\rm r},(m)},\hat{\beta}_{d,n}^{{\rm r},(m)})$ following (17) with $\hat{\omega}_{n,l}^{(m)}$ as the penalty loadings, and $\hat{\epsilon}_{n,i}^{{\rm r},(m)}(d)=Y_{i}-\hat{\alpha}_{d,n}^{{\rm r},(m)}-\psi_{i}^{\prime}\hat{\beta}_{d,n}^{{\rm r},(m)}$ if $D_{i}=d$ .
$\vdots$
Step $M$ : $\ldots$
Step $M+1$ : Set $\hat{\beta}_{d,n}^{\rm r}=\hat{\beta}_{d,n}^{{\rm r},(M)}$ .

As suggested by Belloni et al. (2017), we set $M$ to be 15. We note that R package hdm has a built-in option for this iterative procedure. For this choice of penalty loadings, arguments similar to those in Belloni et al. (2017) can be used to verify (19) under “matched pairs” designs.

Remark 5.5.

When the $\ell_{1}$ -regularized adjustment is approximately correctly specified, Theorem 5.1 shows $\hat{\Delta}_{n}^{\rm r}$ achieves the minimum variance derived in Remark 3.2, and thus, is guaranteed to be weakly more efficient than the difference-in-means estimator ( $\hat{\Delta}_{n}^{\rm unadj}$ ). When $W_{n,i}$ is fixed dimensional and $\psi_{n,i}$ consists of sieve basis functions of $(X_{i},W_{n,i})$ , the approximately correct specification usually holds. Specifically, under regularity conditions such as the smoothness of $E(Y_{i}(d)|X_{i},W_{n,i})$ , we can approximate $E(Y_{i}(d)|X_{i},W_{n,i})$ by $\alpha_{d,n}^{\rm r}+\psi_{n,i}^{\prime}\beta^{\rm r}_{d,n}$ and $\beta^{\rm r}_{d,n}$ is automatically sparse in the sense that $||\beta^{\rm r}_{d,n}||_{0}\ll n$ . This means our regularized regression adjustment can select relevant sieve bases in nonparametric regression adjustments in a data-driven manner and automatically minimize the limiting variance of the corresponding ATE estimator.

Remark 5.6.

When the dimension of $\psi_{n,i}$ is ultra-high (i.e., $p_{n}\gg n$ ) and the regularized adjustment is not approximately correctly specified, $\hat{\Delta}_{n}^{\rm r}$ suffers from Freedman (2008)’s critique that, theoretically, it is possible to be less efficient than $\hat{\Delta}_{n}^{\rm unadj}$ . To overcome this problem, we consider an additional step in which we treat the regularized adjustments $(\psi_{n,i}^{\prime}\hat{\beta}_{1,n}^{\rm r},\psi_{n,i}^{\prime}\hat{\beta}_{0,n}^{\rm r})$ as a two-dimensional covariate and refit a linear regression with pair fixed effects. Such a procedure has also been studied by Cohen and Fogarty (2023) in the setting with low-dimensional covariates and complete randomization. In fact, this strategy can improve upon general initial regression adjustments as long as (9), (12), and Assumption 3.1 are satisfied.

Theorem 5.2 below shows the “refit” estimator for the ATE is weakly more efficient than both $\hat{\Delta}_{n}^{\rm unadj}$ and $\hat{\Delta}_{n}^{\rm r}$ . To state the results, define $\Gamma_{n,i}=(\psi_{n,i}^{\prime}\beta_{1,n}^{\rm r},\psi_{n,i}^{\prime}\beta_{0,n}^{\rm r})^{\prime}$ , $\hat{\Gamma}_{n,i}=(\psi_{n,i}^{\prime}\hat{\beta}_{1,n}^{\rm r},\psi_{n,i}^{\prime}\hat{\beta}_{0,n}^{\rm r})$ , and $\hat{\Delta}_{n}^{\rm refit}$ as the estimator in (15) with $\psi_{i}$ replaced by $\hat{\Gamma}_{n,i}$ . Note that $\hat{\Delta}_{n}^{\rm refit}$ remains numerically the same if we include the intercept $\hat{\alpha}_{d,n}^{\rm r}$ in the definition of $\hat{\Gamma}_{n,i}$ . Following Remark 4.3, $\hat{\Delta}_{n}^{\rm refit}$ is the intercept in the linear regression of $(D_{\pi(2j-1)}-D_{\pi(2j-1)})(Y_{\pi(2j-1)}-Y_{\pi(2j)})$ on constant and $(D_{\pi(2j-1)}-D_{\pi(2j-1)})(\hat{\Gamma}_{n,\pi(2j-1)}-\hat{\Gamma}_{n,\pi(2j)})$ . Replacing $\hat{\Gamma}_{n,i}$ by $\hat{\Gamma}_{n,i}+(\hat{\alpha}_{1,n}^{\rm r},\hat{\alpha}_{0,n}^{\rm r})^{\prime}$ will not change the regression estimators.

The following assumption will be employed to control $\Gamma_{n,i}$ in our subsequent analysis:

Assumption 5.5.

For some $\kappa_{1}>0$ and $\kappa_{2}$ ,

	$\displaystyle\inf_{n\geq 1}\inf_{v\in\mathbf{R}^{2}}\|\|v\|\|_{2}^{-2}v^{\prime}E[\operatorname*{Var}[\Gamma_{n,i}\|X_{i}]]v\geq\kappa_{1}$
	$\displaystyle\sup_{n\geq 1}\sup_{v\in\mathbf{R}^{2}}\|\|v\|\|_{2}^{-2}v^{\prime}E[\operatorname*{Var}[\Gamma_{n,i}\|X_{i}]]v\leq\kappa_{2}~{}.$

The following theorem characterizes the behavior of $\hat{\Delta}_{n}^{\rm refit}$ :

Theorem 5.2.

Suppose $Q$ satisfies Assumption 2.1 and the treatment assignment mechanism satisfies Assumptions 2.2–2.3. Further suppose Assumptions 5.1–5.5 hold. Then, (9), (12), and Assumption 3.1 are satisfied with $\hat{m}_{d,n}(X_{i},W_{n,i})=\hat{\Gamma}_{n,i}^{\prime}\hat{\beta}_{n}^{\rm refit}$ and

m_{d,n}(X_{i},W_{n,i})=\Gamma_{n,i}^{\prime}\beta_{n}^{\rm refit}

for $d\in\{0,1\}$ and $n\geq 1$ , where $\beta_{n}^{\rm refit}=(2E[\operatorname*{Var}[\Gamma_{n,i}|X_{i}]])^{-1}E[\operatorname*{Cov}[\Gamma_{n,i},Y_{i}(1)+Y_{i}(0)|X_{i}]]$ . In addition, denote the asymptotic variance of $\hat{\Delta}_{n}^{\rm refit}$ as $\sigma_{n}^{\rm refit,2}$ . Then, $\sigma_{n}^{\rm unadj,2}\geq\sigma_{n}^{\rm refit,2}$ and $\sigma_{n}^{\rm r,2}\geq\sigma_{n}^{\rm refit,2}$ .

Remark 5.7.

It is possible to further relax the full rank condition in Assumption 5.5 by running a ridge regression or truncating the minimum eigenvalue of the gram matrix in the refitting step.

6 Simulations

In this section, we conduct Monte Carlo experiments to assess the finite-sample performance of the inference methods proposed in the paper. In all cases, we follow Bai et al. (2022) to consider tests of the hypothesis that

\displaystyle H_{0}:\Delta(Q)=\Delta_{0}\text{ versus }H_{1}:\Delta(Q)\neq\Delta_{0}~{}.

with $\Delta_{0}=0$ at nominal level $\alpha=0.05$ .

6.1 Data Generating Processes

We generate potential outcomes for $d\in\{0,1\}$ and $1\leq i\leq 2n$ by the equation

Y_{i}(d)=\mu_{d}+m_{d}(X_{i},W_{i})+\sigma_{d}(X_{i},W_{i})\epsilon_{d,i},~{}d=0,1,

(21)

where $\mu_{d},m_{d}\left(X_{i},W_{i}\right),\sigma_{d}\left(X_{i},W_{i}\right)$ , and $\epsilon_{d,i}$ are specified in each model as follows. In each of the specifications, ( $X_{i},W_{i},\epsilon_{0,i},\epsilon_{1,i}$ ) are i.i.d. across $i$ . The number of pairs $n$ is equal to 100 and 200. The number of replications is 10,000.

Model 1

$\left(X_{i},W_{i}\right)^{\top}=\left(\Phi\left(V_{i1}\right),\Phi\left(V_{i2}\right)\right)^{\top}$ , where $\Phi(\cdot)$ is the standard normal distribution function and

V_{i}\sim N\left(\left(\begin{array}[]{l}0\\ 0\end{array}\right),\left(\begin{array}[]{ll}1&\rho\\ \rho&1\end{array}\right)\right),

$m_{0}\left(X_{i},W_{i}\right)=\gamma\left(W_{i}-\frac{1}{2}\right)$ ; $m_{1}\left(X_{i},W_{i}\right)=m_{0}\left(X_{i},W_{i}\right)$ ; $\epsilon_{d,i}\sim N(0,1)$ for $d=0,1$ ; $\sigma_{0}\left(X_{i},W_{i}\right)=\sigma_{1}\left(X_{i},W_{i}\right)=1$ . We set $\gamma=4$ and $\rho=0.2$ .

Model 2

$\left(X_{i},W_{i}\right)^{\top}=\left(\Phi\left(V_{i1}\right),V_{1i}V_{i2}\right)^{\top}$ , where $V_{i}$ is the same as in Model 1. $m_{0}\left(X_{i},W_{i}\right)=m_{1}\left(X_{i},W_{i}\right)=\gamma_{1}\left(W_{i}-\rho\right)+\gamma_{2}\left(\Phi^{-1}\left(X_{i}\right)^{2}-1\right)$ . $\epsilon_{d,i}\sim N(0,1)$ for $d=0,1$ ; $\sigma_{0}\left(X_{i},W_{i}\right)=\sigma_{1}\left(X_{i},W_{i}\right)=1$ . $\left(\gamma_{1},\gamma_{2}\right)^{\top}=\left(1,2\right)^{\top}$ and $\rho=0.2$ .

Model 3

The same as in Model 2, except that $m_{0}\left(X_{i},W_{i}\right)=m_{1}\left(X_{i},W_{i}\right)=\gamma_{1}\left(W_{i}-\rho\right)+\gamma_{2}\left(\Phi\left(W_{i}\right)-\frac{1}{2}\right)+\gamma_{3}\left(\Phi^{-1}\left(X_{i}\right)^{2}-1\right)$ with $\left(\gamma_{1},\gamma_{2},\gamma_{3}\right)^{\top}=\left(\frac{1}{4},1,2\right)^{\top}$ .

Model 4

$\left(X_{i},W_{i}\right)^{\top}=\left(V_{i1},V_{1i}V_{i2}\right)^{\top}$ , where $V_{i}$ is the same as in Model 1. $m_{0}\left(X_{i},W_{i}\right)=m_{1}\left(X_{i},W_{i}\right)=\gamma_{1}\left(W_{i}-\rho\right)+\gamma_{2}\left(\Phi\left(W_{i}\right)-\frac{1}{2}\right)+\gamma_{3}\left(X_{i}^{2}-1\right)$ . $\epsilon_{d,i}\sim N(0,1)$ for $d=0,1$ ; $\sigma_{0}\left(X_{i},W_{i}\right)=\sigma_{1}\left(X_{i},W_{i}\right)=1$ . $\left(\gamma_{1},\gamma_{2},\gamma_{3}\right)^{\top}=\left(2,1,2\right)^{\top}$ and $\rho=0.2$ .

Model 5

The same as in Model 4, except that $m_{1}\left(X_{i},W_{i}\right)=m_{0}\left(X_{i},W_{i}\right)+\left(\Phi\left(X_{i}\right)-\frac{1}{2}\right)$ .

Model 6

The same as in Model 5, except that $\sigma_{0}\left(X_{i},W_{i}\right)=\sigma_{1}\left(X_{i},W_{i}\right)=\left(\Phi\left(X_{i}\right)+0.5\right)$ .

Model 7

$X_{i}=\left(V_{i1},V_{i2}\right)^{\top}$ and $W_{i}=\left(V_{i1}V_{i3},V_{i2}V_{i4}\right)^{\top}$ , where $V_{i}\sim N(0,\Sigma)$ with $\text{dim}(V_{i})=4$ and $\Sigma$ consisting of 1 on the diagonal and $\rho$ on all other elements. $m_{0}\left(X_{i},W_{i}\right)=m_{1}\left(X_{i},W_{i}\right)=\gamma_{1}^{\prime}\left(W_{i}-\rho\right)+\gamma_{2}^{\prime}\left(\Phi\left(W_{i}\right)-\frac{1}{2}\right)+\gamma_{3}\left(X_{i1}^{2}-1\right)$ with $\gamma_{1}=\left(2,2\right)^{\top},\gamma_{2}=\left(1,1\right)^{\top},\gamma_{3}=1$ . $\epsilon_{d,i}\sim N(0,1)$ for $d=0,1$ ; $\sigma_{0}\left(X_{i},W_{i}\right)=\sigma_{1}\left(X_{i},W_{i}\right)=1$ . $\rho=0.2$ .

Model 8

The same as in Model 7, except that $m_{1}\left(X_{i},W_{i}\right)=m_{0}\left(X_{i},W_{i}\right)+\left(\Phi\left(X_{i1}\right)-\frac{1}{2}\right)$ .

Model 9

The same as in Model 8, except that $\sigma_{0}\left(X_{i},W_{i}\right)=\sigma_{1}\left(X_{i},W_{i}\right)=\left(\Phi\left(X_{i1}\right)+0.5\right)$

Model 10

$X_{i}=\left(\Phi\left(V_{i1}\right),\cdots,\Phi\left(V_{i4}\right)\right)^{\top}$ and $W_{i}=\left(V_{i1}V_{i5},V_{i2}V_{i6}\right)^{\top}$ , where $V_{i}\sim N(0,\Sigma)$ with $\text{dim}(V_{i})=6$ and $\Sigma$ consisting of 1 on the diagonal and $\rho$ on all other elements. $m_{0}\left(X_{i},W_{i}\right)=m_{1}\left(X_{i},W_{i}\right)=\gamma_{1}^{\prime}\left(W_{i}-\rho\right)+\gamma_{2}^{\prime}\left(\Phi\left(W_{i}\right)-\frac{1}{2}\right)+\gamma_{3}^{\prime}\left(\left(\Phi^{-1}\left(X_{i1}\right)^{2},\Phi^{-1}\left(X_{i2}\right)^{2}\right)^{\top}-1\right)$ with $\gamma_{1}=\left(1,1\right)^{\top},\gamma_{2}=\left(\frac{1}{2},\frac{1}{2}\right)^{\top},\gamma_{3}=\left(\frac{1}{2},\frac{1}{2}\right)^{\top}$ . $\epsilon_{d,i}\sim N(0,1)$ for $d=0,1$ ; $\sigma_{0}\left(X_{i},W_{i}\right)=\sigma_{1}\left(X_{i},W_{i}\right)=1$ .

Model 11

The same as in Model 10, except that $m_{1}\left(X_{i},W_{i}\right)=m_{0}\left(X_{i},W_{i}\right)+\frac{1}{4}\sum_{j=1}^{4}\left(X_{ij}-\frac{1}{2}\right)$ .

Model 12

$X_{i}=\left(\Phi\left(V_{i1}\right),\cdots,\Phi\left(V_{i4}\right)\right)^{\top}$ and $W_{i}=\left(V_{i1}V_{i41},\cdots,V_{i40}V_{i80}\right)^{\top}$ , where $V_{i}\sim N(0,\Sigma)$ with $\text{dim}(V_{i})=80$ . $\Sigma$ is the Toeplitz matrix

\displaystyle\Sigma=\begin{pmatrix}1&0.5&0.5^{2}&\cdots&0.5^{79}\\ 0.5&1&0.5&\cdots&0.5^{78}\\ 0.5^{2}&0.5&1&\cdots&0.5^{77}\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0.5^{79}&0.5^{78}&0.5^{77}&\cdots&1\end{pmatrix}.

$m_{0}\left(X_{i},W_{i}\right)=m_{1}\left(X_{i},W_{i}\right)=\gamma_{1}^{\prime}W_{i}+\gamma_{2}^{\prime}\left(\Phi^{-1}\left(X_{i}\right)^{2}-1\right)$ , $\gamma_{1}=\left(\frac{1}{1^{2}},\frac{1}{2^{2}},\cdots,\frac{1}{40^{2}}\right)^{\top}$ with $\text{dim}(\gamma_{1})=40$ , and $\gamma_{2}=\left(\frac{1}{8},\frac{1}{8},\frac{1}{8},\frac{1}{8}\right)^{\top}$ with $\text{dim}(\gamma_{2})=4$ . $\epsilon_{d,i}\sim N(0,1)$ for $d=0,1$ ; $\sigma_{0}\left(X_{i},W_{i}\right)=\sigma_{1}\left(X_{i},W_{i}\right)=1$ .

Model 13

The same as in Model 12, except that $m_{0}\left(X_{i},W_{i}\right)=m_{1}\left(X_{i},W_{i}\right)=\gamma_{1}^{\prime}W_{i}+\gamma_{2}^{\prime}\left(\Phi\left(W_{i}\right)-\frac{1}{2}\right)+\gamma_{3}^{\prime}\left(\Phi^{-1}\left(X_{i}\right)^{2}-1\right)$ , $\gamma_{1}=\left(\frac{1}{1^{2}},\cdots,\frac{1}{40^{2}}\right)^{\top}$ , $\gamma_{2}=\frac{1}{8}\left(\frac{1}{1^{2}},\cdots,\frac{1}{40^{2}}\right)^{\top}$ , and $\gamma_{3}=\left(\frac{1}{8},\frac{1}{8},\frac{1}{8},\frac{1}{8}\right)^{\top}$ with $\text{dim}(\gamma_{1})=\text{dim}(\gamma_{2})=40$ and $\text{dim}(\gamma_{3})=4$ .

Model 14

The same as in Model 13, except that $m_{1}\left(X_{i},W_{i}\right)=m_{0}\left(X_{i},W_{i}\right)+\sum_{j=1}^{4}\frac{1}{j^{2}}\left(X_{ij}-\frac{1}{2}\right)$ .

Model 15

The same as in Model 14, except that $\sigma_{0}\left(X_{i},W_{i}\right)=\sigma_{1}\left(X_{i},W_{i}\right)=\left(X_{i1}+0.5\right)$ .

It is worth noting that Models 1, 2, 3, 4, 7, 10, 12, and 13 imply homogeneous treatment effects because $m_{1}\left(X_{i},W_{i}\right)=m_{0}\left(X_{i},W_{i}\right)$ . Among them, $E[Y_{i}(a)|X_{i},W_{i}]-E[Y_{i}(a)|X_{i}]$ is linear in $W_{i}$ in Models 1, 2, and 12. Models 5, 8, 11, and 14 have heterogeneous but homoscedastic treatment effects. In Models 6, 9, and 15, however, the implied treatment effects are both heterogeneous and heteroscedastic. Models 12-15 contain high-dimensional covariates.

We follow Bai et al. (2022) to match pairs. Specifically, if $\text{dim}\left(X_{i}\right)=1$ , we match pairs by sorting $X_{i},i=1,\ldots,2n$ . If $\text{dim}\left(X_{i}\right)>1$ , we match pairs by the permutation $\pi$ calculated using the R package nbpMatching. For more details, see Bai et al. (2022, Section 4). After matching the pairs, we flip coins to randomly select one unit within each pair for treatment and another for control.

6.2 Estimation and Inference

We set $\mu_{0}=0$ and $\mu_{1}=\Delta$ , where $\Delta=0$ and $1/4$ are used to illustrate the size and power, respectively. Rejection probabilities in percentage points are presented. To further illustrate the efficiency gains obtained by regression adjustments, in Figure 1, we plot the average standard error reduction in percentage relative to the standard error of the estimator without adjustments for various estimation methods.

Specifically, we consider the following adjusted estimators.

(i)

unadj: the estimator with no adjustments. In this case, our standard error is identical to the adjusted standard error proposed by Bai et al. (2022).
(ii)

naïve: the linear adjustments with regressors $W_{i}$ but without pair dummies.
(iii)

naïve2: the linear adjustments with $X_{i}$ and $W_{i}$ regressors but without pair dummies.
(iv)

pfe: the linear adjustments with regressors $W_{i}$ and pair dummies.
(v)

refit: refit the $\ell_{1}$ -regularized adjustments by linear regression with pair dummies.

See Section C in the Online Supplement for the regressors used in the regularized adjustments.

For Models 1-11, we examine the performance of estimators (i)-(v). For Models 12-15, we assess the performance among estimators (i) and (v) in high-dimensional settings. Note that the adjustments are misspecified for almost all the models. The only exception is Model 1, for which the linear adjustment in $W_{i}$ is correctly specified because $m_{d}(X_{i},W_{i})$ is just a linear function of $W_{i}$ .

6.3 Simulation Results

Tables 1 and 3 report rejection probabilities at the 0.05 level and power of the different methods for Models 1–11 when $n$ is 100 and 200, respectively. Several patterns emerge. First, for all the estimators, the rejection rates under $H_{0}$ are close to the nominal level even when $n=100$ and with misspecified adjustments. This result is expected because all the estimators take into account the dependence structure arising in the “matched pairs” design, consistent with the findings in Bai et al. (2022).

Second, in terms of power, “pfe” is higher than “unadj”, “naïve”, and “naïve2” for all eleven models, as predicted by our theory. This finding confirms that “pfe” is the optimal linear adjustment and will not degrade the precision of the ATE estimator. In contrast, we observe that “naïve” and “naïve2” in Model 3 are even less powerful than the unadjusted estimator “unadj”. Figure 1 further confirms that these two methods inflate the estimation standard error. This result echoes Freedman’s critique (Freedman, 2008) that careless regression adjustments may degrade the estimation precision. Our “pfe” addresses this issue because it has been proven to be weakly more efficient than the unadjusted estimator.

Third, the improvement of power for “pfe” is mainly due to the reduction of estimation standard errors, which can be more than 50% as shown in Figure 1 for Models 4–9. This means that the length of the confidence interval of the “pfe” estimator is just half of that for the “unadj” estimator. Note the standard error of the “unadj” estimator is the one proposed by Bai et al. (2022), which has already been adjusted to account for the cross-sectional dependence created in pair matching. The extra 50% reduction is therefore produced purely by the regression adjustment. For Models 10-11, the reduction of standard errors achieved by “pfe” is more than 40% as well. For Model 1, the linear regression is correctly specified so that all three methods achieve the global minimum asymptotic variance and maximum power. For Model 2, $m_{d}(X_{i},W_{i})-E[m_{d}(X_{i},W_{i})|X_{i}]=\gamma(W_{i}-E[W_{i}|X_{i}])$ so that the linear adjustment $\gamma W_{i}$ satisfies the conditions in Theorem 3.1. Therefore, “pfe”, as the best linear adjustment, is also the best adjustment globally, achieving the global minimum asymptotic variance and maximum power. In contrast, “naïve” and “naïve2” are not the best linear adjustment and therefore less powerful than “pfe” because of the omitted pair dummies.

Finally, the “refit” method has the best power for most models as they automatically achieve the global minimum asymptotic variance when the dimension of $W_{i}$ is fixed.

Tables 2 and 4 report the size and power for the “refit” adjustments when both $W_{i}$ and $X_{i}$ are high-dimensional. We see that the size under the null is close to the nominal 5% while the power for the adjusted estimator is higher than the unadjusted one. Figure 1 further illustrates the reduction of the standard error is more than 30% for all high-dimensional models.

Table 1: Rejection probabilities for Models 1-11 when

n=100

	$H_{0}$ : $\Delta=0$					$H_{1}$ : $\Delta=1/4$
Model	unadj	naïve	naïve2	pfe	refit	unadj	naïve	naïve2	pfe	refit
1	5.47	5.57	5.63	5.76	5.84	22.48	43.89	43.95	43.91	43.92
2	4.96	5.26	5.30	5.47	5.32	23.32	28.02	27.96	37.21	33.12
3	4.99	5.28	5.24	5.48	5.27	32.19	27.88	27.96	37.34	36.29
4	5.31	5.28	5.28	5.48	5.79	11.78	27.88	28.03	37.34	43.28
5	5.43	5.09	5.08	5.49	5.78	11.87	27.72	27.88	36.69	43.08
6	5.28	5.43	5.41	5.58	5.79	11.78	26.67	26.72	34.71	40.29
7	5.64	5.63	5.62	5.98	6.04	9.24	34.55	34.65	37.96	42.08
8	5.63	5.54	5.51	6.03	6.17	9.28	34.11	34.42	37.22	41.29
9	5.74	5.69	5.76	6.19	5.89	8.99	32.39	32.30	35.42	38.75
10	5.24	5.78	5.73	6.05	6.04	14.27	30.80	30.75	32.02	32.51
11	5.19	5.78	5.72	6.07	5.95	14.36	30.60	30.49	32.21	32.81

Table 2: Rejection probabilities for Models 12-15 when

n=100

	$H_{0}$ : $\Delta=0$		$H_{1}$ : $\Delta=1/4$
	unadj	refit	unadj	refit
12	5.35	6.12	22.01	42.56
13	5.31	6.11	21.47	42.47
14	5.24	6.07	21.39	41.14
15	5.31	6.23	20.73	38.67

Table 3: Rejection probabilities for Models 1-11 when

n=200

	$H_{0}$ : $\Delta=0$					$H_{1}$ : $\Delta=1/4$
Model	unadj	naïve	naïve2	pfe	refit	unadj	naïve	naïve2	pfe	refit
1	5.08	5.04	5.10	5.21	5.31	38.94	70.35	70.36	70.32	70.30
2	5.69	5.28	5.28	5.24	5.40	40.31	49.25	49.32	65.36	57.87
3	5.44	5.29	5.30	5.35	5.41	56.89	49.43	49.51	64.96	62.42
4	5.45	5.29	5.29	5.35	5.20	18.55	49.43	49.67	64.96	69.96
5	5.45	5.24	5.18	5.19	5.29	18.41	48.65	48.80	64.11	69.09
6	5.62	5.32	5.31	5.35	5.43	18.19	46.71	46.67	61.09	65.98
7	5.24	5.51	5.46	5.34	5.49	11.86	60.73	60.63	65.14	69.24
8	5.23	5.49	5.47	5.35	5.65	11.84	60.00	60.10	64.93	68.02
9	5.30	5.58	5.57	5.66	5.81	11.90	57.25	57.28	61.61	64.88
10	5.34	5.19	5.15	5.25	5.31	23.95	55.49	55.44	56.64	56.43
11	5.41	5.36	5.32	5.34	5.41	23.88	55.01	55.05	56.31	56.18

Table 4: Rejection probabilities for Models 12-15 when

n=200

	$H_{0}$ : $\Delta=0$		$H_{1}$ : $\Delta=1/4$
	unadj	refit	unadj	refit
12	4.97	5.22	38.91	68.10
13	4.95	5.19	38.04	68.06
14	5.01	5.24	37.65	66.69
15	5.15	5.40	36.61	63.79

Refer to caption — Figure 1: Average Standard Error Reduction in Percentage under $H_{1}$ when $n=200$

7 Empirical Illustration

In this section, we revisit the randomized experiment with a matched pairs design conducted in Groh and McKenzie (2016). In the paper, they examined the impact of macroinsurance on microenterprises. Here, we apply the covariate adjustment methods developed in this paper to their data and reinvestigate the average effect of macroinsurance on three outcome variables: the microenterprises’ monthly profits, revenues, and investment.

The subjects in the experiment are microenterprise owners, who were the clients of the largest microfinance institution in Egypt. In the randomization, after an exact match of gender and the institution’s branch code, those clients were grouped into pairs by applying an optimal greedy algorithm to additional 13 matching variables. Within each pair, a macroinsurance product was then offered to one randomly assigned client, and the other acted as a control. Based on the pair identities and all the matching variables, we re-order the pairs in our sample according to the procedure described in Section 5.1 of Jiang et al. (2022). The resulting sample contains 2824 microenterprise owners, that is, 1412 pairs of them.²²2See Groh and McKenzie (2016) and Jiang et al. (2022) for more details.

Table 5 reports the ATEs with the standard errors (in parentheses) estimated by different methods. Among them, “GM” corresponds to the method used in Groh and McKenzie (2016).³³3Groh and McKenzie (2016) estimated the effect by regression with regressors including some baseline variables, a dummy for missing observations, and dummies for the pairs. Specifically, for profits and revenues, the regressors are the baseline value for the outcome of interest, a dummy for missing observations, and pair dummies; for investment, the regressors only include pair dummies. The standard errors for the “GM” ATE estimate are calculated by the usual heteroskedastity-consistent estimator. The “GM” results in Table 5 were obtained by applying the Stata code provided by Groh and McKenzie (2016). The description of other methods is similar to that in Section 6.2.⁴⁴4Specifically: (i) $X_{i}$ includes gender and 13 additional matching variables for all adjustments. Three of the matching variables are continuous, and the others are dummies. (ii) To maintain comparability, we keep $X_{i}$ and $W_{i}$ consistent across all adjustments except for “refit” for each outcome variable. For profits and revenue, $W_{i}$ includes the baseline value for the outcome of interest, a dummy for whether the firm is above the 95th percentile of the control firms’ distributions of the outcome variable, and a dummy for missing observations. For investment, $W_{i}$ includes all the covariates used for the first two outcome variables. (iii) For “refit”, we intentionally expand the dimensions of $W_{i}$ . In addition to the baseline values used in the other adjustments and the dummy variables for missing observations, the $W_{i}$ used in “refit” also includes the interaction of the continuous original $W_{i}$ variables with three continuous variables and the first three discrete variables in $X_{i}$ . (iv) All the continuous variables in $X_{i}$ and $W_{i}$ are standardized initially when the regression-adjusted estimators are employed. The results in this table prompt the following observations.

First, aligning with our theoretical and simulation findings, we observe that the standard errors associated with the covariate-adjusted ATEs, particularly those for the “naïve2” and “pfe” estimates, are generally lower compared to the ATE estimate without any adjustment. This pattern is consistent across nearly all the outcome variables. To illustrate, when examining the revenue outcome, the standard errors for the “pfe” estimates are 10.2% smaller than those for the unadjusted ATE estimate.

Second, the standard errors of the “refit” estimates are consistently smaller than those of the unadjusted ATE estimate across all the outcome variables. For example, when profits are the outcome variable, the “refit” estimates exhibit standard errors 7.5% smaller than those of the unadjusted ATE estimate. Moreover, compared with those of the “pfe” estimates, the standard errors of “refit” are slightly smaller.

Table 5: Impacts of Macronsurance for Microenterprises

Y	n	unadj	GM	naïve	naïve2	pfe	refit
Profits	1322	-85.65	-50.88	-41.69	-50.97	-51.60	-55.13
		(49.43)	(46.46)	(47.22)	(45.49)	(46.94)	(45.71)
Revenue	1318	-838.60	-660.16	-611.75	-610.80	-635.80	-600.97
		(319.02)	(284.02)	(286.93)	(282.93)	(286.50)	(284.60)
Investment	1410	-66.60	-66.60	-49.37	-50.72	-67.31	-58.77
		(118.93)	(118.66)	(119.23)	(118.97)	(118.88)	(118.84)

\justify

Notes: The table reports the ATE estimates of the effect of macroinsurance for microenterprises. Standard errors are in parentheses.

8 Conclusion

This paper considers covariate adjustment for the estimation of average treatment effect in “matched pairs” designs when covariates other than the matching variables are available. When the dimension of these covariates is low, we suggest estimating the average treatment effect by a linear regression of the outcome on treatment status and covariates, controlling for pair fixed effects. We show that this estimator is no worse than the simple difference-in-means estimator in terms of efficiency. When the dimension of these covariates is high, we suggest a two-step estimation procedure: in the first step, we run $\ell_{1}$ -regularized regressions of outcome on covariates for the treated and control groups separately and obtain the fitted values for both potential outcomes, and in the second step, we estimate the average treatment effect by refitting a linear regression of outcome on treatment status and regularized adjustments from the first step, controlling for the pair fixed effects. We show that the final estimator is no worse than the simple difference-in-means estimator in terms of efficiency. When the conditional mean models are approximately correctly specified, this estimator further achieves the minimum variance as if all relevant covariates are used to form pairs in the experiment design stage. We take the choice of variables to use in forming pairs as given and focus on how to obtain more efficient estimators of the average treatment effect in the analysis stage. Our paper is therefore silent on the important question of how to choose the relevant matching variables in the design stage. This topic is left for future research.

Appendix A Proofs of Main Results

In the appendix, we use $a_{n}\lesssim b_{n}$ to denote there exists $c>0$ such that $a_{n}\leq cb_{n}$ .

A.1 Proof of Theorem 3.1

Step 1: Decomposition by recursive conditioning

To begin, note

$\displaystyle\hat{\mu}_{n}(1)$	$\displaystyle=\frac{1}{2n}\sum_{1\leq i\leq 2n}(2D_{i}(Y_{i}(1)-\hat{m}_{1,n}(X_{i},W_{i}))+\hat{m}_{1,n}(X_{i},W_{i}))$
	$\displaystyle=\frac{1}{2n}\sum_{1\leq i\leq 2n}(2D_{i}Y_{i}(1)-(2D_{i}-1)\hat{m}_{1,n}(X_{i},W_{i}))$
	$\displaystyle=\frac{1}{2n}\sum_{1\leq i\leq 2n}(2D_{i}Y_{i}(1)-(2D_{i}-1)m_{1,n}(X_{i},W_{i}))+o_{P}(n^{-1/2})$
	$\displaystyle=\frac{1}{2n}\sum_{1\leq i\leq 2n}(2D_{i}Y_{i}(1)-D_{i}m_{1,n}(X_{i},W_{i})-(1-D_{i})m_{1,n}(X_{i},W_{i}))+o_{P}(n^{-1/2})~{},$	(22)

where the third equality follows from (9). Similarly,

\hat{\mu}_{n}(0)=\frac{1}{2n}\sum_{1\leq i\leq 2n}(2(1-D_{i})Y_{i}(0)-D_{i}m_{0,n}(X_{i},W_{i})-(1-D_{i})m_{0,n}(X_{i},W_{i}))+o_{P}(n^{-1/2})~{}.

(23)

It follows from (22)–(23) that

\hat{\Delta}_{n}=\frac{1}{n}\sum_{1\leq i\leq 2n}D_{i}\phi_{1,n,i}-\frac{1}{n}\sum_{1\leq i\leq 2n}(1-D_{i})\phi_{0,n,i}+o_{P}(n^{-1/2})~{},

(24)

where

	$\displaystyle\phi_{1,n,i}$	$\displaystyle=Y_{i}(1)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))$
	$\displaystyle\phi_{0,n,i}$	$\displaystyle=Y_{i}(0)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))~{}.$

Next, consider

\displaystyle\mathbb{L}_{n}=\frac{1}{2\sqrt{n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)E[m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i})|X_{i}]~{}.

For simplicity, define $M_{d,n}(X_{i})=E[m_{d,n}(X_{i},W_{i})|X_{i}]$ for $d\in\{0,1\}$ . It follows from Assumption 2.2 that $E[\mathbb{L}_{n}|X^{(n)}]=0$ . On the other hand,

	$\displaystyle\operatorname*{Var}[\mathbb{L}_{n}\|X^{(n)}]$	$\displaystyle=\frac{1}{4n}\sum_{1\leq j\leq n}\left(M_{1,n}(X_{\pi(2j-1)})+M_{0,n}(X_{\pi(2j-1)})-(M_{1,n}(X_{\pi(2j)})+M_{0,n}(X_{\pi(2j)}))\right)^{2}$
		$\displaystyle\lesssim\frac{1}{n}\sum_{1\leq j\leq n}\|M_{1,n}(X_{\pi(2j-1)})-M_{1,n}(X_{\pi(2j)})\|^{2}+\frac{1}{n}\sum_{1\leq j\leq n}\|M_{0,n}(X_{\pi(2j-1)})-M_{0,n}(X_{\pi(2j)})\|^{2}$
		$\displaystyle\stackrel{{\scriptstyle P}}{{\to}}0~{},$

where the inequality follows from $(a+b)^{2}\leq 2(a^{2}+b^{2})$ and the convergence follows from Assumptions 2.3 and 3.1(c). By Markov’s inequality and the fact that $E[\mathbb{L}_{n}|X^{(n)}]=0$ , for any $\epsilon>0$ ,

P\{|\mathbb{L}_{n}|>\epsilon|X^{(n)}\}\leq\frac{\operatorname*{Var}[\mathbb{L}_{n}|X^{(n)}]}{\epsilon^{2}}\stackrel{{\scriptstyle P}}{{\to}}0~{}.

Since probabilities are bounded, we have $\mathbb{L}_{n}=o_{P}(1)$ . This fact, together with (24), imply

\sqrt{n}(\hat{\Delta}_{n}-\Delta(Q))=A_{n}-B_{n}+C_{n}-D_{n}~{},

where

	$\displaystyle A_{n}$	$\displaystyle=\frac{1}{\sqrt{n}}\sum_{1\leq i\leq 2n}\left(D_{i}\phi_{1,n,i}-E[D_{i}\phi_{1,n,i}\|X^{(n)},D^{(n)}]\right)$
	$\displaystyle B_{n}$	$\displaystyle=\frac{1}{\sqrt{n}}\sum_{1\leq i\leq 2n}\left((1-D_{i})\phi_{0,n,i}-E[(1-D_{i})\phi_{0,n,i}\|X^{(n)},D^{(n)}]\right)$
	$\displaystyle C_{n}$	$\displaystyle=\frac{1}{\sqrt{n}}\sum_{1\leq i\leq 2n}D_{i}(E[Y_{i}(1)\|X_{i}]-E[Y_{i}(1)])$
	$\displaystyle D_{n}$	$\displaystyle=\frac{1}{\sqrt{n}}\sum_{1\leq i\leq 2n}(1-D_{i})(E[Y_{i}(0)\|X_{i}]-E[Y_{i}(0)])~{}.$

Note that conditional on $X^{(n)}$ and $D^{(n)}$ , $A_{n}$ and $B_{n}$ are independent while $C_{n}$ and $D_{n}$ are constants.

Step 2: Conditional central limit theorems

We first analyze the limiting behavior of $A_{n}$ . Define

s_{n}^{2}=\sum_{1\leq i\leq 2n}D_{i}\operatorname*{Var}[\phi_{1,n,i}|X_{i}]~{}.

Note by Assumption 2.2 that $s_{n}^{2}=n\operatorname*{Var}[A_{n}|X^{(n)},D^{(n)}]$ . We proceed verify the Lindeberg condition for $A_{n}$ conditional on $X^{(n)}$ and $D^{(n)}$ , i.e., we show that for every $\epsilon>0$ ,

\frac{1}{s_{n}^{2}}\sum_{1\leq i\leq 2n}E[|D_{i}(\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}])|^{2}I\{|D_{i}(\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}])|>\epsilon s_{n}\}|X^{(n)},D^{(n)}]\stackrel{{\scriptstyle P}}{{\to}}0~{}.

(25)

To that end, first note Lemma B.2 implies

\frac{s_{n}^{2}}{nE[\operatorname*{Var}[\phi_{1,n,i}|X_{i}]]}\stackrel{{\scriptstyle P}}{{\to}}1~{}.

(26)

(26) and Assumption 3.1(a) imply that for all $\lambda>0$ ,

P\{\epsilon s_{n}>\lambda\}\stackrel{{\scriptstyle P}}{{\to}}1~{}.

(27)

Furthermore, for some $c>0$ ,

P\left\{\frac{s_{n}^{2}}{n}>c\right\}\to 1~{}.

(28)

Next, note for any $\lambda>0$ and $\delta_{1}>0$ , the left-hand side of (25) can be written as

		$\displaystyle\frac{1}{s_{n}^{2}/n}\frac{1}{n}\sum_{1\leq i\leq 2n:D_{i}=1}E[\|\phi_{1,n,i}-E[\phi_{1,n,i}\|X_{i}]\|^{2}I\{\|\phi_{1,n,i}-E[\phi_{1,n,i}\|X_{i}]\|>\epsilon s_{n}\}\|X^{(n)},D^{(n)}]$
		$\displaystyle\leq\frac{1}{s_{n}^{2}/n}\frac{1}{n}\sum_{1\leq i\leq 2n}E[\|\phi_{1,n,i}-E[\phi_{1,n,i}\|X_{i}]\|^{2}I\{\|\phi_{1,n,i}-E[\phi_{1,n,i}\|X_{i}]\|>\epsilon s_{n}\}\|X^{(n)},D^{(n)}]$
		$\displaystyle\leq\frac{1}{c}\frac{1}{n}\sum_{1\leq i\leq 2n}E[\|\phi_{1,n,i}-E[\phi_{1,n,i}\|X_{i}]\|^{2}I\{\|\phi_{1,n,i}-E[\phi_{1,n,i}\|X_{i}]\|>\lambda\}\|X^{(n)},D^{(n)}]+o_{P}(1)$
		$\displaystyle\leq\frac{2}{c}\frac{1}{2n}\sum_{1\leq i\leq 2n}E[\|\phi_{1,n,i}-E[\phi_{1,n,i}\|X_{i}]\|^{2}I\{\|\phi_{1,n,i}-E[\phi_{1,n,i}\|X_{i}]\|>\lambda\}\|X_{i}]+o_{P}(1)~{},$		(29)

where the first inequality follows by inspection, the second follows from (27)–(28), and the last follows from Assumption 2.2. We then argue

\frac{1}{2n}\sum_{1\leq i\leq 2n}E[|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}I\{|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|>\lambda\}|X_{i}]\\ =E[|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}I\{|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|>\lambda\}]+o_{P}(1)~{}.

(30)

To this end, we once again verify the Lindeberg condition in Lemma 11.4.2 of Lehmann and Romano (2005). Note

|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}I\{|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|>\lambda\}\leq|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}~{}.

Therefore, in light of Lemma B.1, we only need to verify

\lim_{\gamma\to\infty}\limsup_{n\to\infty}E[|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}I\{|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}>\gamma\}]=0~{},

(31)

which follows immediately from Lemma B.3.

Another application of (31) implies (25). Lindeberg’s central limit theorem and (26) then imply that

\sup_{t\in\mathbf{R}}|P\{A_{n}/\sqrt{E[\operatorname*{Var}[\phi_{1,n,i}|X_{i}]]}\leq t|X^{(n)},D^{(n)}\}-\Phi(t)|\stackrel{{\scriptstyle P}}{{\to}}0~{}.

Similar arguments lead to

\sup_{t\in\mathbf{R}}|P\{B_{n}/\sqrt{E[\operatorname*{Var}[\phi_{0,n,i}|X_{i}]]}\leq t|X^{(n)},D^{(n)}\}-\Phi(t)|\stackrel{{\scriptstyle P}}{{\to}}0~{}.

Step 3: Combining conditional and unconditional components

Meanwhile, it follows from the same arguments as those in (S.22)–(S.25) of Bai et al. (2022) that

C_{n}-D_{n}\stackrel{{\scriptstyle d}}{{\to}}N\left(0,\frac{1}{2}E\left[(E[Y_{i}(1)|X_{i}]-E[Y_{i}(1)]-(E[Y_{i}(0)|X_{i}]-E[Y_{i}(0)]))^{2}\right]\right)~{}.

To establish (10), define $\nu_{n}^{2}=\nu_{1,n}^{2}+\nu_{0,n}^{2}+\nu_{2}^{2}$ , where

	$\displaystyle\nu_{1,n}^{2}$	$\displaystyle=E[\operatorname*{Var}[\phi_{1,n,i}\|X_{i}]]$
	$\displaystyle\nu_{0,n}^{2}$	$\displaystyle=E[\operatorname*{Var}[\phi_{0,n,i}\|X_{i}]]$
	$\displaystyle\nu^{2}$	$\displaystyle=\frac{1}{2}E\left[(E[Y_{i}(1)\|X_{i}]-E[Y_{i}(1)]-(E[Y_{i}(0)\|X_{i}]-E[Y_{i}(0)]))^{2}\right]$

Note

\frac{\sqrt{n}(\hat{\Delta}_{n}-\Delta(Q))}{\nu_{n}}=\frac{A_{n}}{\nu_{1,n}}\frac{\nu_{1,n}}{\nu_{n}}-\frac{B_{n}}{\nu_{0,n}}\frac{\nu_{0,n}}{\nu_{n}}+\frac{C_{n}-D_{n}}{\nu_{2}}\frac{\nu_{2}}{\nu_{n}}~{}.

Further note $\nu_{n},\nu_{1,n},\nu_{0,n},\nu_{2}$ are all constants conditional on $X^{(n)}$ and $D^{(n)}$ . Suppose by contradiction that $\frac{\sqrt{n}(\hat{\Delta}_{n}-\Delta(Q))}{\nu_{n}}$ does not converge in distribution to $N(0,1)$ . Then, there exists $\epsilon>0$ and a subsequence $\{n_{k}\}$ such that

\sup_{t\in\mathbf{R}}|P\{\sqrt{n}_{k}(\hat{\Delta}_{n_{k}}-\Delta(Q))/\nu_{n_{k}}\leq t\}-\Phi(t)|\to\epsilon~{}.

(32)

Because the sequence $\nu_{1,n_{k}}$ and $\nu_{0,n_{k}}$ are bounded by Assumptions 3.1(b), there is a further subsequence, which with some abuse of notation we still denote by $\{n_{k}\}$ , along which $\nu_{1,n_{k}}\to\nu_{1}^{\ast}$ and $\nu_{0,n_{k}}\to\nu_{0}^{\ast}$ for some $\nu_{1}^{\ast},\nu_{0}^{\ast}\geq 0$ . Then, $\nu_{1,n_{k}}/\nu_{n_{k}},\nu_{0,n_{k}}/\nu_{n_{k}},\nu_{2}/\nu_{n_{k}}$ all converge to constants. Therefore, it follows from Lemma S.1.2 of Bai et al. (2022) that

\sqrt{n}_{k}(\hat{\Delta}_{n_{k}}-\Delta(Q))/\nu_{n_{k}}\stackrel{{\scriptstyle d}}{{\to}}N(0,1)~{},

a contradiction to (32). Therefore, the desired convergence in Theorem 3.1 follows.

Step 4: Rearranging the variance formula

To conclude the proof with the the variance formula as stated in the theorem, note

		$\displaystyle\operatorname*{Var}\Big{[}Y_{i}(0)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\Big{\|}X_{i}\Big{]}$
		$\displaystyle=\operatorname*{Var}\Big{[}E\Big{[}Y_{i}(0)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\Big{\|}X_{i},W_{i}\Big{]}\Big{\|}X_{i}\Big{]}$
		$\displaystyle\hskip 30.00005pt+E\Big{[}\operatorname*{Var}\Big{[}Y_{i}(0)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\Big{\|}X_{i},W_{i}\Big{]}\Big{\|}X_{i}\Big{]}$
		$\displaystyle=\operatorname*{Var}\Big{[}E\Big{[}\frac{Y_{i}(1)+Y_{i}(0)}{2}\Big{\|}X_{i},W_{i}\Big{]}-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))-E\Big{[}\frac{Y_{i}(1)-Y_{i}(0)}{2}\Big{\|}X_{i},W_{i}\Big{]}\Big{\|}X_{i}\Big{]}$
		$\displaystyle\hskip 30.00005pt+E[\operatorname*{Var}[Y_{i}(0)\|X_{i},W_{i}]\|X_{i}]$
		$\displaystyle=\operatorname*{Var}\Big{[}E\Big{[}\frac{Y_{i}(1)+Y_{i}(0)}{2}\Big{\|}X_{i},W_{i}\Big{]}-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\Big{\|}X_{i}\Big{]}$
		$\displaystyle\hskip 30.00005pt+\operatorname*{Var}\Big{[}E\Big{[}\frac{Y_{i}(1)-Y_{i}(0)}{2}\Big{\|}X_{i},W_{i}\Big{]}\Big{\|}X_{i}\Big{]}$
		$\displaystyle\hskip 30.00005pt-2\mathrm{Cov}\Big{[}E\Big{[}\frac{Y_{i}(1)+Y_{i}(0)}{2}\Big{\|}X_{i},W_{i}\Big{]}-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i})),E\Big{[}\frac{Y_{i}(1)-Y_{i}(0)}{2}\Big{\|}X_{i},W_{i}\Big{]}\Big{\|}X_{i}\Big{]}$
		$\displaystyle\hskip 30.00005pt+E[\operatorname*{Var}[Y_{i}(0)\|X_{i},W_{i}]\|X_{i}]~{},$		(33)

where the first equality follows from the law of total variance, the second one follows by direct calculation, and the last one follows by expanding the variance of the sum. Similarly,

		$\displaystyle\operatorname*{Var}\Big{[}Y_{i}(1)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\Big{\|}X_{i}\Big{]}$
		$\displaystyle=\operatorname*{Var}\Big{[}E\Big{[}\frac{Y_{i}(1)+Y_{i}(0)}{2}\Big{\|}X_{i},W_{i}\Big{]}-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\Big{\|}X_{i}\Big{]}$
		$\displaystyle\hskip 30.00005pt+\operatorname*{Var}\Big{[}E\Big{[}\frac{Y_{i}(1)-Y_{i}(0)}{2}\Big{\|}X_{i},W_{i}\Big{]}\Big{\|}X_{i}\Big{]}$
		$\displaystyle\hskip 30.00005pt+2\mathrm{Cov}\Big{[}E\Big{[}\frac{Y_{i}(1)+Y_{i}(0)}{2}\Big{\|}X_{i},W_{i}\Big{]}-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i})),E\Big{[}\frac{Y_{i}(1)-Y_{i}(0)}{2}\Big{\|}X_{i},W_{i}\Big{]}\Big{\|}X_{i}\Big{]}$
		$\displaystyle\hskip 30.00005pt+E[\operatorname*{Var}[Y_{i}(1)\|X_{i},W_{i}]\|X_{i}]~{}.$		(34)

It follows that

	$\displaystyle\sigma_{n}^{2}(Q)$	$\displaystyle=\frac{1}{2}E[\operatorname*{Var}[E[Y_{i}(1)+Y_{i}(0)\|X_{i},W_{i}]-(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\|X_{i}]]$
		$\displaystyle\hskip 30.00005pt+\frac{1}{2}E[\operatorname{Var}[E[Y_{i}(1)-Y_{i}(0)\|X_{i},W_{i}]\|X_{i}]]+\frac{1}{2}\operatorname{Var}[E[Y_{i}(1)-Y_{i}(0)\|X_{i}]]$
		$\displaystyle\hskip 30.00005pt+E[\operatorname{Var}[Y_{i}(0)\|X_{i},W_{i}]\|X_{i}]+E[\operatorname{Var}[Y_{i}(1)\|X_{i},W_{i}]\|X_{i}]$
		$\displaystyle=\frac{1}{2}E[\operatorname*{Var}[E[Y_{i}(1)+Y_{i}(0)\|X_{i},W_{i}]-(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\|X_{i}]]$
		$\displaystyle\hskip 30.00005pt+\frac{1}{2}E[(E[Y_{i}(1)-Y_{i}(0)\|X_{i},W_{i}]-E[Y_{i}(1)-Y_{i}(0)\|X_{i}])^{2}]$
		$\displaystyle\hskip 30.00005pt+\frac{1}{2}E[(E[Y_{i}(1)-Y_{i}(0)\|X_{i}]-E[Y_{i}(1)-Y_{i}(0)])^{2}]$
		$\displaystyle\hskip 30.00005pt+E[(Y_{i}(0)-E[Y_{i}(0)\|X_{i},W_{i}])^{2}]+E[(Y_{i}(1)-E[Y_{i}(1)\|X_{i},W_{i}])^{2}]$
		$\displaystyle=\frac{1}{2}E[\operatorname*{Var}[E[Y_{i}(1)+Y_{i}(0)\|X_{i},W_{i}]-(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\|X_{i}]]$
		$\displaystyle\hskip 30.00005pt+\frac{1}{2}\operatorname{Var}[E[Y_{i}(1)-Y_{i}(0)\|X_{i},W_{i}]]+E[\operatorname{Var}[Y_{i}(0)\|X_{i},W_{i}]]+E[\operatorname*{Var}[Y_{i}(1)\|X_{i},W_{i}]]~{},$

where the first equality follows by definition, the second one follows from (33)–(34), the third one again follows by definition, and the last one follows because by the law of iterated expectations,

E[(E[Y_{i}(1)-Y_{i}(0)|X_{i},W_{i}]-E[Y_{i}(1)-Y_{i}(0)|X_{i}])(E[Y_{i}(1)-Y_{i}(0)|X_{i}]-E[Y_{i}(1)-Y_{i}(0)])]=0~{}.

The conclusion then follows.

A.2 Proof of Theorem 3.2

Theorem 3.1 implies $\hat{\Delta}_{n}\stackrel{{\scriptstyle P}}{{\to}}\Delta(Q)$ . Next, we show

\hat{\tau}_{n}^{2}-E[\operatorname*{Var}[\phi_{1,n,i}|X_{i}]]+E[\operatorname*{Var}[\phi_{0,n,i}|X_{i}]]+E[(E[Y_{i}(1)|X_{i}]-E[Y_{i}(0)|X_{i}])^{2}]\stackrel{{\scriptstyle P}}{{\to}}0~{}.

(35)

To that end, define

\mathring{Y}_{i}=Y_{i}-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))~{}.

Note

	$\displaystyle\hat{\tau}_{n}^{2}$	$\displaystyle=\frac{1}{n}\sum_{1\leq j\leq n}\left(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)}+(\tilde{Y}_{\pi(2j-1)}-\tilde{Y}_{\pi(2j)}-(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)}))\right)^{2}$
		$\displaystyle=\frac{1}{n}\sum_{1\leq j\leq n}(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)})^{2}+\frac{1}{n}\sum_{1\leq j\leq n}(\tilde{Y}_{\pi(2j-1)}-\tilde{Y}_{\pi(2j)}-(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)}))^{2}$
		$\displaystyle\hskip 30.00005pt+\frac{2}{n}\sum_{1\leq j\leq n}(\tilde{Y}_{\pi(2j-1)}-\tilde{Y}_{\pi(2j)}-(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)}))(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)})~{}.$

Therefore, to establish (35), we first show

\frac{1}{n}\sum_{1\leq j\leq n}(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)})^{2}-E[\operatorname*{Var}[\phi_{1,n,i}|X_{i}]]+E[\operatorname*{Var}[\phi_{0,n,i}|X_{i}]]+E[(E[Y_{i}(1)|X_{i}]-E[Y_{i}(0)|X_{i}])^{2}]\stackrel{{\scriptstyle P}}{{\to}}0

(36)

and

\frac{1}{n}\sum_{1\leq j\leq n}(\tilde{Y}_{\pi(2j-1)}-\tilde{Y}_{\pi(2j)}-(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)}))^{2}\stackrel{{\scriptstyle P}}{{\to}}0~{}.

(37)

(37) immediately follows from repeated applications of the inequality $(a-b)^{2}\leq 2(a^{2}+b^{2})$ and (12). To verify (36), note

\frac{1}{n}\sum_{1\leq j\leq n}(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)})^{2}=\frac{1}{n}\sum_{1\leq i\leq 2n}\mathring{Y}_{i}^{2}-\frac{2}{n}\sum_{1\leq j\leq n}\mathring{Y}_{\pi(2j-1)}\mathring{Y}_{\pi(2j)}~{}.

It follows from similar arguments to those in the proof of Lemma B.2 below that

\frac{1}{n}\sum_{1\leq i\leq 2n}\mathring{Y}_{i}^{2}-E[\phi_{1,n,i}^{2}]+E[\phi_{0,n,i}^{2}]\stackrel{{\scriptstyle P}}{{\to}}0~{}.

Similarly, it follows from the proof of the same lemma that

\frac{2}{n}\sum_{1\leq j\leq n}\mathring{Y}_{\pi(2j-1)}\mathring{Y}_{\pi(2j)}-2E[E[\phi_{1,n,i}|X_{i}]E[\phi_{0,n,i}|X_{i}]]\stackrel{{\scriptstyle P}}{{\to}}0~{}.

To establish (36), note

	$\displaystyle E[\phi_{1,n,i}^{2}]+E[\phi_{0,n,i}^{2}]-2E[E[\phi_{1,n,i}\|X_{i}]E[\phi_{0,n,i}\|X_{i}]]$
	$\displaystyle=E[\operatorname{Var}[\phi_{1,n,i}\|X_{i}]]+E[\operatorname{Var}[\phi_{0,n,i}\|X_{i}]]+E[E[\phi_{1,n,i}\|X_{i}]^{2}]+E[E[\phi_{0,n,i}\|X_{i}]^{2}]-2E[E[\phi_{1,n,i}\|X_{i}]E[\phi_{0,n,i}\|X_{i}]]$
	$\displaystyle=E[\operatorname{Var}[\phi_{1,n,i}\|X_{i}]]+E[\operatorname{Var}[\phi_{0,n,i}\|X_{i}]]+E[(E[\phi_{1,n,i}\|X_{i}]-E[\phi_{0,n,i}\|X_{i}])^{2}]$
	$\displaystyle=E[\operatorname{Var}[\phi_{1,n,i}\|X_{i}]]+E[\operatorname{Var}[\phi_{0,n,i}\|X_{i}]]+E[(E[Y_{i}(1)\|X_{i}]-E[Y_{i}(0)\|X_{i}])^{2}]~{},$

where the last equality follows from the definition of $\phi_{1,n,i}$ and $\phi_{0,n,i}$ . It then follows from the Cauchy-Schwarz inequality that

\left|\frac{1}{n}\sum_{1\leq j\leq n}(\tilde{Y}_{\pi(2j-1)}-\tilde{Y}_{\pi(2j)}-(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)}))(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)})\right|\\ \leq\left(\frac{1}{n}\sum_{1\leq j\leq n}(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)})^{2}\right)\left(\frac{1}{n}\sum_{1\leq j\leq n}(\tilde{Y}_{\pi(2j-1)}-\tilde{Y}_{\pi(2j)}-(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)}))^{2}\right)\stackrel{{\scriptstyle P}}{{\to}}0~{},

which, together with (36)–(37) as well as Assumptions 2.1(b) and 3.1(b), imply (35).

Next, we show

\hat{\lambda}_{n}\stackrel{{\scriptstyle P}}{{\to}}E[(E[Y_{i}(1)|X_{i}]-E[Y_{i}(0)|X_{i}])^{2}]~{}.

(38)

Note

		$\displaystyle\hat{\lambda}_{n}-\frac{2}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}(\mathring{Y}_{\pi(4j-3)}-\mathring{Y}_{\pi(4j-2)})(\mathring{Y}_{\pi(4j-1)}-\mathring{Y}_{\pi(4j)})(D_{\pi(4j-3)}-D_{\pi(4j-2)})(D_{\pi(4j-1)}-D_{\pi(4j)})$		(39)
		$\displaystyle=\frac{2}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}(\tilde{Y}_{\pi(4j-3)}-\mathring{Y}_{\pi(4j-3)}-(\tilde{Y}_{\pi(4j-2)}-\mathring{Y}_{\pi(4j-2)}))(\mathring{Y}_{\pi(4j-1)}-\mathring{Y}_{\pi(4j)})$
		$\displaystyle\hskip 50.00008pt\times(D_{\pi(4j-3)}-D_{\pi(4j-2)})(D_{\pi(4j-1)}-D_{\pi(4j)})$
		$\displaystyle\hskip 30.00005pt+\frac{2}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}(\mathring{Y}_{\pi(4j-3)}-\mathring{Y}_{\pi(4j-2)})(\tilde{Y}_{\pi(4j-1)}-\mathring{Y}_{\pi(4j-1)})-(\tilde{Y}_{\pi(4j)}-\mathring{Y}_{\pi(4j)}))$
		$\displaystyle\hskip 50.00008pt\times(D_{\pi(4j-3)}-D_{\pi(4j-2)})(D_{\pi(4j-1)}-D_{\pi(4j)})$
		$\displaystyle\hskip 30.00005pt+\frac{2}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}(\tilde{Y}_{\pi(4j-3)}-\mathring{Y}_{\pi(4j-3)}-(\tilde{Y}_{\pi(4j-2)}-\mathring{Y}_{\pi(4j-2)}))$
		$\displaystyle\hskip 50.00008pt\times(\tilde{Y}_{\pi(4j-1)}-\mathring{Y}_{\pi(4j-1)})-(\tilde{Y}_{\pi(4j)}-\mathring{Y}_{\pi(4j)}))(D_{\pi(4j-3)}-D_{\pi(4j-2)})(D_{\pi(4j-1)}-D_{\pi(4j)})~{}.$

In what follows, we show

		$\displaystyle\frac{2}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}(\mathring{Y}_{\pi(4j-3)}-\mathring{Y}_{\pi(4j-2)})^{2}=O_{P}(1)$		(40)
		$\displaystyle\frac{2}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}(\mathring{Y}_{\pi(4j-1)}-\mathring{Y}_{\pi(4j)})^{2}=O_{P}(1)$		(41)
		$\displaystyle\frac{2}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}(\tilde{Y}_{\pi(4j-3)}-\mathring{Y}_{\pi(4j-3)}-(\tilde{Y}_{\pi(4j-2)}-\mathring{Y}_{\pi(4j-2)}))^{2}=o_{P}(1)$		(42)
		$\displaystyle\frac{2}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}(\tilde{Y}_{\pi(4j-1)}-\mathring{Y}_{\pi(4j-1)})-(\tilde{Y}_{\pi(4j)}-\mathring{Y}_{\pi(4j)}))^{2}=o_{P}(1)$		(43)
		$\displaystyle\frac{2}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}(\mathring{Y}_{\pi(4j-3)}-\mathring{Y}_{\pi(4j-2)})(\mathring{Y}_{\pi(4j-1)}-\mathring{Y}_{\pi(4j)})(D_{\pi(4j-3)}-D_{\pi(4j-2)})(D_{\pi(4j-1)}-D_{\pi(4j)})$
		$\displaystyle\hskip 50.00008pt\stackrel{{\scriptstyle P}}{{\to}}E[(E[Y_{i}(1)\|X_{i}]-E[Y_{i}(0)\|X_{i}])^{2}]~{}.$		(44)

To establish (40)–(41), note they follow directly from (36) and Assumptions 2.1(b) and 3.1(b). Next, note (42) follows from repeated applications of the inequality $(a+b)^{2}\leq 2(a^{2}+b^{2})$ and (12). (43) can be established by similar arguments. (44) follows from similar arguments to those in the proof of Lemma S.1.7 of Bai et al. (2022), with the uniform integrability arguments replaced by arguments similar to those in the proof of Lemma B.2, together with Assumptions 2.1–2.4 and 3.1. (39)–(44) imply (38) immediately.

Finally, note we have shown

\hat{\sigma}_{n}^{2}-\sigma_{n}^{2}\stackrel{{\scriptstyle P}}{{\to}}0~{}.

Assumption 3.1(a) implies $\sigma_{n}^{2}$ is bounded away from zero, so

\frac{\hat{\sigma}_{n}}{\sigma_{n}}\stackrel{{\scriptstyle P}}{{\to}}1~{}.

The conclusion of the theorem then follows.

A.3 Proof of Theorem 4.1

We will apply the Frisch-Waugh-Lovell theorem to obtain an expression for $\hat{\beta}_{n}^{\rm naive}$ . Consider the linear regression of $\psi_{i}$ on $1$ and $D_{i}$ . Define

\hat{\mu}_{\psi,n}(d)=\frac{1}{n}\sum_{1\leq i\leq 2n}\psi_{i}I\{D_{i}=d\}

for $d\in\{0,1\}$ and

\hat{\Delta}_{\psi,n}=\hat{\mu}_{\psi,n}(1)-\hat{\mu}_{\psi,n}(0)~{}.

The $i$ th residual based on the OLS estimation of this linear regression model is given by

\tilde{\psi}_{i}=\psi_{i}-\hat{\mu}_{\psi,n}(0)-\hat{\Delta}_{\psi,n}D_{i}~{}.

$\hat{\beta}_{n}^{\rm naive}$ is then given by the OLS estimator of the coefficient in the linear regression of $Y_{i}$ on $\tilde{\psi}_{i}$ . Note

	$\displaystyle\frac{1}{2n}\sum_{1\leq i\leq 2n}\tilde{\psi}_{i}\tilde{\psi}_{i}^{\prime}$	$\displaystyle=\frac{1}{2n}\sum_{1\leq i\leq 2n}(\psi_{i}-\hat{\mu}_{\psi,n}(1))(\psi_{i}-\hat{\mu}_{\psi,n}(1))^{\prime}D_{i}+\frac{1}{2n}\sum_{1\leq i\leq 2n}(\psi_{i}-\hat{\mu}_{\psi,n}(0))(\psi_{i}-\hat{\mu}_{\psi,n}(0))^{\prime}(1-D_{i})$
		$\displaystyle=\frac{1}{2n}\sum_{1\leq i\leq 2n}\psi_{i}\psi_{i}^{\prime}-\frac{1}{2}\hat{\mu}_{\psi,n}(1)\hat{\mu}_{\psi,n}(1)^{\prime}-\frac{1}{2}\hat{\mu}_{\psi,n}(0)\hat{\mu}_{\psi,n}(0)^{\prime}~{}.$

It follows from Assumption 4.1(b) and the weak law of large number that

\frac{1}{2n}\sum_{1\leq i\leq 2n}\psi_{i}\psi_{i}^{\prime}\stackrel{{\scriptstyle P}}{{\to}}E[\psi_{i}\psi_{i}^{\prime}]~{}.

On the other hand, it follows from Assumptions 2.2–2.3 and 4.1(b)–(c) as well as similar arguments to those in the proof of Lemma S.1.5 of Bai et al. (2022) that

\hat{\mu}_{\psi,n}(d)\stackrel{{\scriptstyle P}}{{\to}}E[\psi_{i}]

for $d\in\{0,1\}$ . Therefore,

\frac{1}{2n}\sum_{1\leq i\leq 2n}\tilde{\psi}_{i}\tilde{\psi}_{i}^{\prime}\stackrel{{\scriptstyle P}}{{\to}}\operatorname*{Var}[\psi_{i}]~{}.

Next,

\displaystyle\frac{1}{2n}\sum_{1\leq i\leq 2n}\tilde{\psi}_{i}Y_{i}

\displaystyle=\frac{1}{2n}\sum_{1\leq i\leq 2n}(\psi_{i}-\hat{\mu}_{\psi,n}(1))Y_{i}(1)D_{i}+\frac{1}{2n}\sum_{1\leq i\leq 2n}(\psi_{i}-\hat{\mu}_{\psi,n}(0))Y_{i}(0)(1-D_{i})

It follows from similar arguments as above as well as Assumptions 2.1(b), 2.2–2.3, and 4.1(b)–(c) that

\frac{1}{2n}\sum_{1\leq i\leq 2n}\tilde{\psi}_{i}Y_{i}\stackrel{{\scriptstyle P}}{{\to}}\operatorname*{Cov}[\psi_{i},Y_{i}(1)+Y_{i}(0)]~{}.

The convergence of $\hat{\beta}_{n}^{\rm naive}$ therefore follows from the continuous mapping theorem and Assumption 4.1(a).

To see (12) is satisfied, note

\frac{1}{2n}\sum_{1\leq i\leq 2n}(\hat{m}_{d,n}(X_{i},W_{i})-m_{d,n}(X_{i},W_{i}))^{2}=(\hat{\beta}_{n}^{\rm naive}-\beta^{\rm naive})^{\prime}\left(\frac{1}{2n}\sum_{1\leq i\leq 2n}\psi_{i}\psi_{i}^{\prime}\right)(\hat{\beta}_{n}^{\rm naive}-\beta^{\rm naive})~{}.

(12) then follows from the fact that $\hat{\beta}_{n}^{\rm naive}\stackrel{{\scriptstyle P}}{{\to}}\beta$ , Assumption 4.1(b), and the weak law of large numbers. To establish (9), first note

\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)(\hat{m}_{d,n}(X_{i},W_{i})-m_{d,n}(X_{i},W_{i}))=\frac{1}{\sqrt{2}}\sqrt{n}\hat{\Delta}_{\psi,n}^{\prime}(\hat{\beta}_{n}^{\rm naive}-\beta^{\rm naive})~{}.

In what follows, we establish

\sqrt{n}\hat{\Delta}_{\psi,n}=O_{P}(1)~{},

(45)

from which (9) follows immediately because $\hat{\beta}_{n}^{\rm naive}-\beta^{\rm naive}=o_{P}(1)$ . Note by Assumption 2.2 that $E[\sqrt{n}\hat{\Delta}_{\psi,n}|X^{(n)}]=0$ . Also note

\sqrt{n}\hat{\Delta}_{\psi,n}=F_{n}-G_{n}+H_{n}~{},

where

	$\displaystyle F_{n}$	$\displaystyle=\frac{1}{\sqrt{n}}\sum_{1\leq i\leq 2n}(\psi_{i}-E[\psi_{i}\|X_{i}])D_{i}~{},$
	$\displaystyle G_{n}$	$\displaystyle=\frac{1}{\sqrt{n}}\sum_{1\leq i\leq 2n}(\psi_{i}-E[\psi_{i}\|X_{i}])(1-D_{i})~{},\quad\text{and}$
	$\displaystyle H_{n}$	$\displaystyle=\frac{1}{\sqrt{n}}\sum_{1\leq j\leq n}(E[\psi_{\pi(2j-1)}\|X_{\pi(2j-1)}]-E[\psi_{\pi(2j)}\|X_{\pi(2j)}])(D_{\pi(2j-1)}-D_{\pi(2j)})~{}.$

We will argue $F_{n},G_{n},H_{n}$ are all $O_{P}(1)$ . Since this could be carried out separately for each entry of $F_{n}$ and $G_{n}$ , we assume without loss of generality that $k_{\psi}=1$ . First, it follows from Assumptions 2.2–2.3 and 4.1(c) as well as similar arguments to those in the proof of Lemma S.1.4 of Bai et al. (2022) that

\mathrm{Var}[F_{n}|X^{(n)},D^{(n)}]=\frac{1}{n}\sum_{1\leq i\leq 2n}\mathrm{Var}[\psi_{i}|X_{i}]D_{i}\stackrel{{\scriptstyle P}}{{\to}}E[\mathrm{Var}[\psi_{i}|X_{i}]]>0~{}.

It then follows from similar arguments using the Lindeberg central limit theorem as in the proof of Lemma S.1.4 of Bai et al. (2022) that $F_{n}=O_{P}(1)$ . Similar arguments establish $G_{n}=O_{P}(1)$ . Finally, we show $H_{n}=O_{P}(1)$ . Note that $E[H_{n}|X^{(n)}]=0$ and by Assumptions 2.2–2.3 and 4.1(c),

\mathrm{Var}[H_{n}|X^{(n)}]=\frac{1}{n}\sum_{1\leq j\leq n}(E[\psi_{\pi(2j-1)}|X_{\pi(2j-1)}]-E[\psi_{\pi(2j)}|X_{\pi(2j)}])^{2}\stackrel{{\scriptstyle P}}{{\to}}0~{}.

Therefore, for any fixed $\epsilon>0$ , Markov’s inequality implies

P\{|H_{n}-E[H_{n}|X^{(n)}]|>\epsilon|X^{(n)}\}\leq\frac{\mathrm{Var}[H_{n}|X^{(n)}]}{\epsilon^{2}}\stackrel{{\scriptstyle P}}{{\to}}0~{}.

Since probabilities are bounded and therefore uniformly integrable, we have that

P\{|H_{n}-E[H_{n}|X^{(n)}]|>\epsilon\}\to 0~{}.

Therefore, (45) follows. Finally, it is straightforward to see Assumption 3.1 is implied by Assumption 4.1.

A.4 Proof of Theorem 4.2

By the Frisch-Waugh-Lovell theorem, $\hat{\beta}_{n}^{\rm pfe}$ is equal to the OLS estimator in the linear regression of $\{(Y_{\pi(2j-1)}-Y_{\pi(2j)},Y_{\pi(2j)}-Y_{\pi(2j-1)}):1\leq j\leq n\}$ on $\{(2D_{\pi(2j-1)}-1,2D_{\pi(2j)}-1):1\leq j\leq n\}$ and $\{(\psi_{\pi(2j-1)}-\psi_{\pi(2j)},\psi_{\pi(2j)}-\psi_{\pi(2j-1)}):1\leq j\leq n\}$ . To apply the Frisch-Waugh-Lovell theorem again, we study the linear regression of $\{(\psi_{\pi(2j-1)}-\psi_{\pi(2j)},\psi_{\pi(2j)}-\psi_{\pi(2j-1)}):1\leq j\leq n\}$ on $\{(2D_{\pi(2j-1)}-1,2D_{\pi(2j)}-1):1\leq j\leq n\}$ . The OLS estimator of the regression coefficient in such a regression equals

\displaystyle\hat{\Delta}_{\psi,n}=\frac{1}{n}\sum_{1\leq j\leq n}(D_{\pi(2j-1)}-D_{\pi(2j)})(\psi_{\pi(2j-1)}-\psi_{\pi(2j)})~{}.

The residual is therefore $\{(\psi_{\pi(2j-1)}-\psi_{\pi(2j)}-(2D_{\pi(2j-1)}-1)\hat{\Delta}_{\psi,n},\psi_{\pi(2j)}-\psi_{\pi(2j-1)}-(2D_{\pi(2j)}-1)\hat{\Delta}_{\psi,n}):1\leq j\leq n\}$ . $\hat{\beta}_{n}^{\rm pfe}$ equals the OLS estimator of the coefficient in the linear regression of $\{(Y_{\pi(2j-1)}-Y_{\pi(2j)},Y_{\pi(2j)}-Y_{\pi(2j-1)}):1\leq j\leq n\}$ on those residuals. Define

	$\displaystyle\delta_{Y,j}$	$\displaystyle=(D_{\pi(2j-1)}-D_{\pi(2j)})(Y_{\pi(2j-1)}-Y_{\pi(2j)})\quad\text{and}$
	$\displaystyle\delta_{\psi,j}$	$\displaystyle=(D_{\pi(2j-1)}-D_{\pi(2j)})(\psi_{\pi(2j-1)}-\psi_{\pi(2j)})$

Apparently $\hat{\Delta}_{\psi,n}=\frac{1}{n}\sum_{1\leq j\leq n}\delta_{\psi,j}$ . A moment’s thought reveals that $\hat{\beta}_{n}^{\rm pfe}$ further equals the coefficient estimate using least squares in the linear regression of $\delta_{Y,j}$ on $\delta_{\psi,j}-\hat{\Delta}_{\psi,n}$ for $1\leq j\leq n$ . It follows from Assumptions 2.1(b)–(c), 2.2–2.3, and 4.1(b)–(c) as well as similar arguments to those in the proof of Lemma S.1.5 of Bai et al. (2022) that

	$\displaystyle\hat{\Delta}_{\psi,n}$	$\displaystyle\stackrel{{\scriptstyle P}}{{\to}}0\quad\text{and}$		(46)
	$\displaystyle\frac{1}{n}\sum_{1\leq j\leq n}\delta_{Y,j}$	$\displaystyle\stackrel{{\scriptstyle P}}{{\to}}\Delta(Q)~{}.$

Next, note that

		$\displaystyle\frac{1}{n}\sum_{1\leq j\leq n}\delta_{\psi,j}\delta_{\psi,j}^{\prime}$
		$\displaystyle=\frac{1}{n}\sum_{1\leq j\leq n}(\psi_{\pi(2j-1)}-\psi_{\pi(2j)})(\psi_{\pi(2j-1)}-\psi_{\pi(2j)})^{\prime}$
		$\displaystyle=\frac{1}{n}\sum_{1\leq i\leq 2n}\psi_{i}\psi_{i}^{\prime}-\frac{1}{n}\sum_{1\leq j\leq n}(\psi_{\pi(2j-1)}\psi_{\pi(2j)}^{\prime}+\psi_{\pi(2j)}\psi_{\pi(2j-1)}^{\prime})~{}.$		(47)

For convenience, we introduce the following notation:

	$\displaystyle\mu_{d}(X_{i})$	$\displaystyle=E[Y_{i}(d)\|X_{i}]$
	$\displaystyle\Psi(X_{i})$	$\displaystyle=E[\psi_{i}\|X_{i}]$
	$\displaystyle\xi_{d}(X_{i})$	$\displaystyle=E[\psi_{i}Y_{i}(d)\|X_{i}]~{}.$

The first term in (47) converges in probability to $2E[\psi_{i}\psi_{i}^{\prime}]$ by the weak law of large numbers. For the second term, we have that

	$\displaystyle E\Big{[}\frac{1}{n}\sum_{1\leq j\leq n}(\psi_{\pi(2j-1)}\psi_{\pi(2j)}^{\prime}+\psi_{\pi(2j)}\psi_{\pi(2j-1)}^{\prime})\Big{\|}X^{(n)}\Big{]}$
	$\displaystyle=\frac{1}{n}\sum_{1\leq i\leq 2n}\Psi(X_{i})\Psi(X_{i})^{\prime}-\frac{1}{n}\sum_{1\leq j\leq n}(\Psi(X_{\pi(2j-1)})-\Psi(X_{\pi(2j)}))(\Psi(X_{\pi(2j-1)})-\Psi(X_{\pi(2j)}))^{\prime}$
	$\displaystyle\stackrel{{\scriptstyle P}}{{\to}}2E[\Psi(X_{i})\Psi(X_{i})^{\prime}]~{},$

where the convergence in probability holds because of Assumptions 2.2–2.3 and 4.1(c). It follows from Assumptions 2.2–2.3 and 4.1(b)–(c) as well as similar arguments to those in the proof of Lemma S.1.6 of Bai et al. (2022) that

\Big{|}\frac{1}{n}\sum_{1\leq j\leq n}(\psi_{\pi(2j-1)}\psi_{\pi(2j)}^{\prime}+\psi_{\pi(2j)}\psi_{\pi(2j-1)}^{\prime})-E\Big{[}\frac{1}{n}\sum_{1\leq j\leq n}(\psi_{\pi(2j-1)}\psi_{\pi(2j)}^{\prime}+\psi_{\pi(2j)}\psi_{\pi(2j-1)}^{\prime})\Big{|}X^{(n)}\Big{]}\Big{|}\stackrel{{\scriptstyle P}}{{\to}}0~{}.

Therefore,

\frac{1}{n}\sum_{1\leq j\leq n}\delta_{\psi,j}\delta_{\psi,j}^{\prime}\stackrel{{\scriptstyle P}}{{\to}}2E[\operatorname*{Var}[\psi_{i}|X_{i}]]~{}.

We now turn to

\frac{1}{n}\sum_{1\leq j\leq n}\delta_{\psi,j}\delta_{Y,j}=\frac{1}{n}\sum_{1\leq j\leq n}(\psi_{\pi(2j-1)}-\psi_{\pi(2j)})(Y_{\pi(2j-1)}-Y_{\pi(2j)})~{}.

Note that

	$\displaystyle E[\psi_{\pi(2j-1)}Y_{\pi(2j-1)}\|X^{(n)}]$	$\displaystyle=\frac{1}{2}\xi_{1}(X_{\pi(2j-1)})+\frac{1}{2}\xi_{0}(X_{\pi(2j-1)})$
	$\displaystyle E[\psi_{\pi(2j-1)}Y_{\pi(2j)}\|X^{(n)}]$	$\displaystyle=\frac{1}{2}\Psi(X_{\pi(2j-1)})(\mu_{1}(X_{\pi(2j)})+\mu_{0}(X_{\pi(2j)}))~{}.$

It follows from Assumptions 2.1(b)–(c), 2.2–2.3, 4.1(b)–(c) as well as similar arguments to those in the proof of Lemma S.1.6 of Bai et al. (2022) that

\frac{1}{n}\sum_{1\leq j\leq n}\delta_{\psi,j}\delta_{Y,j}\stackrel{{\scriptstyle P}}{{\to}}E[\psi_{i}(Y_{i}(1)+Y_{i}(0))]-E[\Psi(X_{i})(\mu_{1}(X_{i})+\mu_{0}(X_{i}))]~{}.

The convergence in probability of $\hat{\beta}_{n}^{\rm pfe}$ now follows from Assumption 4.1(a) and the continuous mapping theorem. (9)–(12) can be established using similar arguments to those in the proof of Theorem 4.1. Finally, it is straightforward to see Assumption 3.1 is implied by Assumption 4.1.

A.5 Proof of Theorem 5.1

We divide the proof into three steps. In the first step, we show

|\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r}|+\|\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r}\|_{1}=O_{P}\left(s_{n}\lambda_{n}^{\rm r}\right)~{}.

(48)

In the second step, we show (9), (12), and Assumption 3.1 hold. In the third step, we show the asymptotic variance achieves the minimum under the approximately correct specification condition in Theorem 5.1.

Step 1: Proof of (48)

Note that

	$\displaystyle\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(Y_{i}(d)-\hat{\alpha}_{d,n}^{\rm r}-\psi_{n,i}^{\prime}\hat{\beta}_{d,n}^{\rm r})^{2}+\lambda_{d,n}^{\rm r}\\|\hat{\Omega}_{n}(d)\hat{\beta}_{d,n}^{\rm r}\\|_{1}$
	$\displaystyle\leq\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(Y_{i}(d)-\alpha_{d,n}^{\rm r}-\psi_{n,i}^{\prime}\beta_{d,n}^{\rm r})^{2}+\lambda_{d,n}^{\rm r}\\|\hat{\Omega}_{n}(d)\beta_{d,n}^{\rm r}\\|_{1}~{}.$

Rearranging the terms, we then have

		$\displaystyle\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r}+\psi_{n,i}^{\prime}(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r}))^{2}+\lambda_{d,n}^{\rm r}\\|\hat{\Omega}_{n}(d)\hat{\beta}_{d,n}^{\rm r}\\|_{1}$
		$\displaystyle\leq\left(\frac{2}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\epsilon_{n,i}(d)\psi_{n,i}^{\prime}\right)(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r})+\left(\frac{2}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\epsilon_{n,i}(d)\right)(\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r})$
		$\displaystyle+\lambda_{d,n}^{\rm r}\\|\hat{\Omega}_{n}(d)\beta_{d,n}^{\rm r}\\|_{1}$		(49)

Next, define

\mathbb{U}_{n}(d)=\Omega_{n}^{-1}(d)\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\psi_{n,i}\epsilon_{n,i}(d)-E[\psi_{n,i}\epsilon_{n,i}(d)])

and

\mathcal{E}_{n}(d)=\left\{\|\mathbb{U}_{n}(d)\|_{\infty}\leq\frac{6\bar{\sigma}}{\underaccent{\bar}{\sigma}}\sqrt{\frac{\log(2np_{n})}{n}},~{}\left|\frac{1}{n}\sum_{i\in[2n]}I\{D_{i}=d\}\epsilon_{n,i}(d)-E[\epsilon_{n,i}(d)]\right|\leq\sqrt{\frac{\log(2np_{n})}{n}}\right\}~{}.

Lemma B.4 implies $P\{\mathcal{E}_{n}(d)\}\to 1$ for $d\in\{0,1\}$ .

On the event $\mathcal{E}_{n}(d)$ , we have

	$\displaystyle\left\|\left(\frac{2}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\epsilon_{n,i}(d)\psi_{n,i}^{\prime}\right)(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r})\right\|$
	$\displaystyle\leq\left\\|\Omega_{n}^{-1}(d)\frac{2}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\epsilon_{n,i}(d)\psi_{n,i}\right\\|_{\infty}\\|\Omega_{n}(d)(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r})\\|_{1}$
	$\displaystyle\leq 2\\|\mathbb{U}_{n}(d)\\|_{\infty}\\|\Omega_{n}(d)(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r})\\|_{1}+\left\\|\Omega_{n}^{-1}(d)\frac{2}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}E[\epsilon_{n,i}(d)\psi_{n,i}]\right\\|_{\infty}\\|\Omega_{n}(d)(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r})\\|_{1}$
	$\displaystyle\leq 2\\|\mathbb{U}_{n}(d)\\|_{\infty}\\|\Omega_{n}(d)(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r})\\|_{1}+\left\\|\Omega_{n}^{-1}(d)2E[\epsilon_{n,i}(d)\psi_{n,i}]\right\\|_{\infty}\\|\Omega_{n}(d)(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r})\\|_{1}$
	$\displaystyle\leq\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}\right)\lambda_{d,n}^{\rm r}\\|\Omega_{n}(d)(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r})\\|_{1}~{},$

where $d_{n}=o(1)$ and the last inequality follows from (18) and the fact that

\lambda_{d,n}^{\rm r}\geq\ell\ell_{n}\sqrt{\frac{\log(2np_{n})}{n}}~{}.

Next, define

\hat{\delta}_{d,n}=\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r}

and let $S_{d,n}$ be the support of $\beta_{d,n}^{\rm r}$ . Then, we have

	$\displaystyle\\|\hat{\Omega}_{n}(d)\hat{\beta}_{d,n}^{\rm r}\\|_{1}$	$\displaystyle=\\|(\hat{\Omega}_{n}(d)\hat{\beta}_{d,n}^{\rm r})_{S_{d,n}}\\|_{1}+\\|(\hat{\Omega}_{n}(d)\hat{\beta}_{d,n}^{\rm r})_{S_{d,n}^{c}}\\|_{1}=\\|(\hat{\Omega}_{n}(d)\hat{\beta}_{d,n}^{\rm r})_{S_{d,n}}\\|_{1}+\\|(\hat{\Omega}_{n}(d)\hat{\delta}_{d,n})_{S_{d,n}^{c}}\\|_{1}~{},$
	$\displaystyle\\|\hat{\Omega}_{n}(d)\beta_{d,n}^{\rm r}\\|_{1}$	$\displaystyle=\\|(\hat{\Omega}_{n}(d)\beta_{d,n}^{\rm r})_{S_{d,n}}\\|_{1}\leq\\|(\hat{\Omega}_{n}(d)\hat{\beta}_{d,n}^{\rm r})_{S_{d,n}}\\|_{1}+\\|(\hat{\Omega}_{n}(d)\hat{\delta}_{d,n}^{\rm r})_{S_{d,n}}\\|_{1}~{},$

and thus,

	$\displaystyle\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}\right)\\|\Omega_{n}(d)\hat{\delta}_{d,n}\\|_{1}+\\|\hat{\Omega}_{n}(d)\beta_{d,n}^{\rm r}\\|_{1}-\\|\hat{\Omega}_{n}(d)\hat{\beta}_{d,n}^{\rm r}\\|_{1}$
	$\displaystyle=\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}\right)\\|(\Omega_{n}(d)\hat{\delta}_{d,n})_{S_{d,n}}\\|_{1}+\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}\right)\\|(\Omega_{n}(d)\hat{\delta}_{d,n})_{S_{d,n}^{c}}\\|_{1}+\\|\hat{\Omega}_{n}(d)\beta_{d,n}^{\rm r}\\|_{1}-\\|\hat{\Omega}_{n}(d)\hat{\beta}_{d,n}^{\rm r}\\|_{1}$
	$\displaystyle\leq\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}\right)\\|(\Omega_{n}(d)\hat{\delta}_{d,n})_{S_{d,n}}\\|_{1}+\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}\right)\\|(\Omega_{n}(d)\hat{\delta}_{d,n})_{S_{d,n}^{c}}\\|_{1}+\\|(\hat{\Omega}_{n}(d)\hat{\delta}_{d,n}^{\rm r})_{S_{d,n}}\\|_{1}-\\|(\hat{\Omega}_{n}(d)\hat{\delta}_{d,n}^{\rm r})_{S_{d,n}^{c}}\\|_{1}~{}.$

Further define $\breve{\delta}_{d,n}=(\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r},\hat{\delta}_{d,n}^{\prime})^{\prime}$ and $\breve{S}_{d,n}=\{1,S_{d,n}+1\}$ ⁵⁵5For example, if $S_{d,n}=\{2,4,9\}$ , we have $\breve{S}_{d,n}=\{1,3,5,10\}$ . and recall $\breve{\psi}_{n,i}=(1,\psi_{n,i}^{\prime})^{\prime}$ . Then, together with (A.5), we have

$\displaystyle 0$	$\displaystyle\leq\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\breve{\psi}_{n,i}^{\prime}\breve{\delta}_{d,n})^{2}$
	$\displaystyle\leq\lambda_{d,n}^{\rm r}\left[\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}+\bar{c}\right)\\|(\Omega_{n}(d)\hat{\delta}_{d,n})_{S_{d,n}}\\|_{1}-\left(\underline{c}-\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}-d_{n}\right)\\|(\Omega_{n}(d)\hat{\delta}_{d,n})_{S_{d,n}^{c}}\\|_{1}\right]$
	$\displaystyle+\lambda_{d,n}^{\rm r}(1/\ell_{n}+d_{n})\|\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r}\|$
	$\displaystyle\leq\lambda_{d,n}^{\rm r}\left[\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}+\bar{c}\right)\bar{\sigma}\\|(\hat{\delta}_{d,n})_{S_{d,n}}\\|_{1}-\left(\underline{c}-\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}-d_{n}\right)\underaccent{\bar}{\sigma}\\|(\hat{\delta}_{d,n})_{S_{d,n}^{c}}\\|_{1}\right]$
	$\displaystyle+\lambda_{d,n}^{\rm r}(1/\ell_{n}+d_{n})\|\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r}\|$
	$\displaystyle\leq\lambda_{d,n}^{\rm r}\left[\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}+\bar{c}\right)\bar{\sigma}\\|(\breve{\delta}_{d,n})_{\breve{S}_{d,n}}\\|_{1}-\left(\underline{c}-\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}-d_{n}\right)\underaccent{\bar}{\sigma}\\|(\breve{\delta}_{d,n})_{\breve{S}_{d,n}^{c}}\\|_{1}\right]~{}.$	(50)

Define

\mathcal{C}_{n}=\left\{u\in\mathbf{R}^{p_{n}+1}:\|u_{\breve{S}_{d,n}^{c}}\|_{1}\leq\frac{2\bar{\sigma}\bar{c}}{\underaccent{\bar}{\sigma}\underline{c}}\|u_{\breve{S}_{d,n}}\|_{1}\right\}~{}.

For sufficiently large $n$ , we have $\breve{\delta}_{d,n}\in\mathcal{C}_{n}$ . It follows from Bickel et al. (2009) and Assumption 5.4 that

\inf_{u\in\mathcal{C}_{n}}(\|u_{\breve{S}_{d,n}}\|_{1})^{-2}(s_{n}+1)u^{\prime}\left(\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\breve{\psi}_{n,i}\breve{\psi}_{n,i}^{\prime}\right)u\geq 0.25\kappa_{1}^{2}~{}.

Therefore, we have

	$\displaystyle 0.25\kappa_{1}^{2}\\|(\breve{\delta}_{d,n})_{\breve{S}_{d,n}}\\|_{1}^{2}$	$\displaystyle\leq\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\breve{\psi}_{n,i}^{\prime}\breve{\delta}_{d,n})^{2}$
		$\displaystyle\leq\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}+\bar{c}\right)\lambda_{d,n}^{\rm r}(s_{n}+1)\\|(\breve{\delta}_{d,n})_{S_{d,n}}\\|_{1}~{},$

which implies

\|(\breve{\delta}_{d,n})_{S_{d,n}}\|_{1}\leq 4\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}+\bar{c}\right)(s_{n}+1)\lambda_{d,n}^{\rm r}/\kappa_{1}^{2}~{}.

We then have

	$\displaystyle\|\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r}\|+\\|\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r}\\|_{1}$	$\displaystyle\leq\\|(\breve{\delta}_{d,n})_{\breve{S}_{d,n}}\\|_{1}+\\|(\breve{\delta}_{d,n})_{\breve{S}_{d,n}^{c}}\\|_{1}$
		$\displaystyle\leq\left(1+\frac{2\bar{\sigma}\bar{c}}{\underaccent{\bar}{\sigma}\underline{c}}\right)\\|(\breve{\delta}_{d,n})_{\breve{S}_{d,n}}\\|_{1}$
		$\displaystyle\leq 4\left(1+\frac{2\bar{\sigma}\bar{c}}{\underaccent{\bar}{\sigma}\underline{c}}\right)\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}+\bar{c}\right)(s_{n}+1)\lambda_{d,n}^{\rm r}/\kappa_{1}^{2}~{}.$

Then, we have (48) holds because $P\{\mathcal{E}_{n}(d)\}\to 1$ .

Step 2: Verifying (9), (12), and Assumption 3.1

By (A.5), on $\mathcal{E}_{n}(d)$ we have

	$\displaystyle\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r}+\psi_{n,i}^{\prime}\hat{\delta}_{d,n})^{2}$	$\displaystyle\leq\lambda_{d,n}^{\rm r}\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}+\bar{c}\right)\bar{\sigma}\\|(\breve{\delta}_{d,n})_{\breve{S}_{d,n}}\\|_{1}$
		$\displaystyle\leq 4\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}+\bar{c}\right)^{2}\bar{\sigma}(s_{n}+1)\lambda_{d,n}^{\rm r,2}/\kappa_{1}^{2}~{}.$

Because $P\{\mathcal{E}_{n}(d)\}\to 1$ , we have

\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r}+\psi_{n,i}^{\prime}(\hat{\beta}_{d,n}-\beta_{d,n}))^{2}=O_{P}\left(s_{n}(\lambda_{n}^{\rm r})^{2}\right)=o_{P}(1)~{},

which implies (12) holds.

Next, we show (9) for $\hat{\beta}_{d,n}^{\rm r}$ . First note

	$\displaystyle\left\|\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)(\hat{m}_{d,n}(X_{i},W_{n,i})-m_{d,n}(X_{i},W_{n,i}))\right\|$
	$\displaystyle\left\|\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}^{\prime}(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r})\right\|$
	$\displaystyle\leq\left\\|\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq n}(2D_{i}-1)\psi_{n,i}\right\\|_{\infty}\\|\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r}\\|_{1}~{}.$

Next, note that it follows from Assumption 2.2 that conditional on $X^{(n)}$ and $W_{n}^{(n)}$ ,

\{D_{\pi(2j-1)}-D_{\pi(2j)}:1\leq j\leq n\}

is a sequence of independent Rademacher random variables. Therefore, Hoeffding’s inequality implies

	$\displaystyle P\left\{\left\\|\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}\right\\|_{\infty}>t\Bigg{\|}X^{(n)},W_{n}^{(n)}\right\}$
	$\displaystyle\leq\sum_{1\leq l\leq p_{n}}P\left\{\left\|\frac{1}{\sqrt{2n}}\sum_{1\leq j\leq n}(\psi_{n,\pi(2j-1)}-\psi_{n,\pi(2j)})(D_{\pi(2j-1)}-D_{\pi(2j)})\right\|>t\Bigg{\|}X^{(n)},W_{n}^{(n)}\right\}$
	$\displaystyle\leq\sum_{1\leq l\leq p_{n}}2\exp\left(-\frac{t^{2}}{\frac{1}{n}\sum_{1\leq j\leq n}(\psi_{n,\pi(2j-1)}-\psi_{n,\pi(2j)})^{2}}\right)~{}.$

Define

\nu_{n}^{2}=\max_{1\leq l\leq p_{n}}\frac{1}{n}\sum_{1\leq i\leq 2n}\psi_{n,i,l}^{2}~{}.

We then have

P\left\{\left\|\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}\right\|_{\infty}>\nu_{n}\sqrt{2\log(p_{n}\vee n)}\Bigg{|}X^{(n)},W_{n}^{(n)}\right\}\leq(p_{n}\vee n)^{-1}~{}.

(51)

Next, we determine the order of $\nu_{n}^{2}$ . Note

	$\displaystyle E[\nu_{n}^{2}]$	$\displaystyle\leq\max_{1\leq l\leq p_{n}}2E[\psi_{n,i,l}^{2}]+2E\left[\frac{1}{2n}\sum_{1\leq i\leq 2n}(\psi_{n,i,l}^{2}-E[\psi_{n,i,l}^{2}])\right]$
		$\displaystyle\lesssim 1+E\left[\max_{1\leq l\leq p_{n}}\left\|\frac{1}{2n}\sum_{1\leq i\leq 2n}e_{i}\psi_{n,i,l}^{2}\right\|\right]$
		$\displaystyle\lesssim 1+\Xi_{n}E\left[\max_{1\leq l\leq p_{n}}\left\|\frac{1}{2n}\sum_{1\leq i\leq 2n}e_{i}\psi_{n,i,l}\right\|\right]$
		$\displaystyle\lesssim 1+\Xi_{n}E\left[\sup_{f\in\mathcal{F}_{n}}\left\|\frac{1}{2n}\sum_{1\leq i\leq 2n}f(e_{i},\psi_{n,i,l})\right\|\right]$

where $\{e_{i}:1\leq i\leq n\}$ is an i.i.d. sequence of Rademacher random variables,

\mathcal{F}_{n}=\{f:\mathbf{R}\times\mathbf{R}^{p_{n}}\mapsto\mathbf{R},f(e,\psi)=e\psi_{l},1\leq l\leq p_{n}\}~{},

and $\psi_{l}$ is the $l$ th element of $\psi$ . Note the second inequality follows from Lemma 2.3.1 of van der Vaart and Wellner (1996), the third inequality follows from Theorem 4.12 of Ledoux and Talagrand (1991) and the definition of $\Xi_{n}$ , and the last follows from Assumption 5.1. Note also $\mathcal{F}_{n}$ has an envelope $F=\Xi_{n}$ and

\sup_{n\geq 1}\sup_{f\in\mathcal{F}_{n}}E[f^{2}]<\infty

because of Assumption 5.1. Because the cardinality of $\mathcal{F}_{n}$ is $p_{n}$ , for any $\epsilon<1$ we have that

\sup_{Q:Q\text{ is a discrete distribution with finite support}}\mathcal{N}(\epsilon\|F\|_{Q,2},\mathcal{F},L_{2}(Q))\leq\frac{p_{n}}{\epsilon}~{},

where $\mathcal{N}(\epsilon,\mathcal{F},L_{2}(Q))$ is the covering number for class $\mathcal{F}$ under the metric $L_{2}(Q)$ using balls of radius $\epsilon$ . Therefore, Corollary 5.1 of Chernozhukov et al. (2014) implies

E\left[\sup_{f\in\mathcal{F}_{n}}\left|\frac{1}{2n}\sum_{1\leq i\leq 2n}e_{i}\psi_{n,i,l}\right|\right]\lesssim\sqrt{\frac{\log p_{n}}{n}}+\frac{\Xi_{n}\log p_{n}}{n}=o(\Xi_{n}^{-1})~{}.

Therefore, $\nu_{n}=O_{p}(1)$ . Together with (51), they imply

\left\|\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}\right\|_{\infty}=O_{P}\left(\sqrt{\log(p_{n}\vee n)}\right)~{}.

In light of (48) and Assumption 5.3, we have

\left\|\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}\right\|_{\infty}\|\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r}\|_{1}=O_{P}\left(\frac{s_{n}\ell\ell_{n}\log(p_{n}\vee n)}{\sqrt{n}}\right)=o_{P}(1)~{}.

Next, note that Assumption 3.1(a) and 3.1(b) follow Assumption 5.1, and Assumption 3.1(c) follows Assumptions 5.1 and 5.2.

Step 3: Asymptotic variance

Suppose the true specification is approximately sparse as specified in Theorem 5.1. Let $\tilde{Y}_{i}(d)=Y_{i}(d)-\mu_{d}(X_{i})$ , $\tilde{\psi}_{n,i}=\psi_{n,i}-E[\psi_{n,i}|X_{i}]$ , and $\tilde{R}_{n,i}(d)=R_{n,i}(d)-E[R_{n,i}(d)|X_{i}]$ . Then, we have

\displaystyle E\left[(E[\tilde{Y}_{i}(1)+\tilde{Y}_{i}(0)|W_{n,i},X_{i}]-\tilde{\psi}_{n,i}^{\prime}(\beta_{1,n}^{\rm r}+\beta_{0,n}^{\rm r}))^{2}\right]=E[(\tilde{R}_{n,i}(1)+\tilde{R}_{n,i}(0))^{2}]=o(1)~{}.

This concludes the proof.

A.6 Proof of Theorem 5.2

We divide the proof into three steps. In the first step, we show

\displaystyle\hat{\beta}_{n}^{\rm refit}-\beta_{n}^{\rm refit}=o_{P}(1)~{}.

(52)

In the second step, we show (9), (12), and Assumption 3.1 hold. In the third step, we show that $\sigma_{n}^{\rm na,2}\geq\sigma_{n}^{\rm refit,2}$ and $\sigma_{n}^{\rm r,2}\geq\sigma_{n}^{\rm refit,2}$ .

Step 1: Proof of (52)

Let

	$\displaystyle\hat{\Delta}_{\Gamma,n}$	$\displaystyle=\frac{1}{n}\sum_{1\leq j\leq n}(D_{\pi(2j-1)}-D_{\pi(2j)})(\Gamma_{n,\pi(2j-1)}-\Gamma_{n,\pi(2j)})~{},$
	$\displaystyle\hat{\Delta}_{\hat{\Gamma},n}$	$\displaystyle=\frac{1}{n}\sum_{1\leq j\leq n}(D_{\pi(2j-1)}-D_{\pi(2j)})(\hat{\Gamma}_{n,\pi(2j-1)}-\hat{\Gamma}_{n,\pi(2j)})~{},$
	$\displaystyle\delta_{\Gamma,j}$	$\displaystyle=(D_{\pi(2j-1)}-D_{\pi(2j)})(\Gamma_{n,\pi(2j-1)}-\Gamma_{n,\pi(2j)})~{},$
	$\displaystyle\delta_{\hat{\Gamma},j}$	$\displaystyle=(D_{\pi(2j-1)}-D_{\pi(2j)})(\hat{\Gamma}_{n,\pi(2j-1)}-\hat{\Gamma}_{n,\pi(2j)})~{}.$

Then, by the proof of Theorem 4.2, we have $\hat{\beta}_{n}^{\rm refit}$ equals the coefficient estimate using least squares in the linear regression of $\delta_{Y,j}$ on $\delta_{\hat{\Gamma},j}-\hat{\Delta}_{\hat{\Gamma},n}$ . Then, for any $u\in\mathbf{R}^{2}$ such that $||u||_{2}=1$ , we have

	$\displaystyle\left\|\left(\frac{1}{n}\sum_{1\leq j\leq n}((\delta_{\hat{\Gamma},j}-\hat{\Delta}_{\hat{\Gamma},n})^{\prime}u)^{2}\right)^{1/2}-\left(\frac{1}{n}\sum_{1\leq j\leq n}((\delta_{\Gamma,j}-\hat{\Delta}_{\Gamma,n})^{\prime}u)^{2}\right)^{1/2}\right\|$
	$\displaystyle\leq\left(\frac{1}{n}\sum_{1\leq j\leq n}((\delta_{\hat{\Gamma},j}-\delta_{\Gamma,j})^{\prime}u)^{2}-((\hat{\Delta}_{\hat{\Gamma},n}-\hat{\Delta}_{\Gamma,n})^{\prime}u)^{2}\right)^{1/2}$
	$\displaystyle\leq\left(\frac{2}{n}\sum_{1\leq i\leq 2n}\left\\|\hat{\Gamma}_{n,i}+(\hat{\alpha}_{1,n}^{\rm r},\hat{\alpha}_{0,n}^{\rm r})^{\prime}-\Gamma_{n,i}-(\alpha_{1,n}^{\rm r},\alpha_{0,n}^{\rm r})^{\prime}\right\\|_{2}^{2}\right)^{1/2}$
	$\displaystyle\lesssim\sum_{d\in\{0,1\}}\frac{1}{2n}\sum_{1\leq i\leq 2n}(\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r}+\psi_{n,i}^{\prime}(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r}))^{2}=o_{P}(1)~{},$

where the second inequality is by the fact that

	$\displaystyle\delta_{\Gamma,j}=(D_{\pi(2j-1)}-D_{\pi(2j)})(\Gamma_{n,\pi(2j-1)}+(\alpha_{1,n}^{\rm r},\alpha_{0,n}^{\rm r})^{\prime}-\Gamma_{n,\pi(2j)}-(\alpha_{1,n}^{\rm r},\alpha_{0,n}^{\rm r})^{\prime})~{},$
	$\displaystyle\delta_{\hat{\Gamma},j}=(D_{\pi(2j-1)}-D_{\pi(2j)})(\hat{\Gamma}_{n,\pi(2j-1)}+(\hat{\alpha}_{1,n}^{\rm r},\hat{\alpha}_{0,n}^{\rm r})^{\prime}-\hat{\Gamma}_{n,\pi(2j)}-(\hat{\alpha}_{1,n}^{\rm r},\hat{\alpha}_{0,n}^{\rm r})^{\prime})~{},$

and the last equality is by the proof of Theorem 5.1. This implies

	$\displaystyle\frac{1}{n}\sum_{1\leq j\leq n}(\delta_{\hat{\Gamma},j}-\hat{\Delta}_{\hat{\Gamma},n})(\delta_{\hat{\Gamma},j}-\hat{\Delta}_{\hat{\Gamma},n})^{\prime}-2E[\operatorname*{Var}[\Gamma_{n,i}\|X_{i}]]$
	$\displaystyle=\frac{1}{n}\sum_{1\leq j\leq n}(\delta_{\hat{\Gamma},j}-\hat{\Delta}_{\hat{\Gamma},n})(\delta_{\hat{\Gamma},j}-\hat{\Delta}_{\hat{\Gamma},n})^{\prime}-\frac{1}{n}\sum_{1\leq j\leq n}(\delta_{\Gamma,j}-\hat{\Delta}_{\Gamma,n})(\delta_{\Gamma,j}-\hat{\Delta}_{\Gamma,n})^{\prime}$
	$\displaystyle+\frac{1}{n}\sum_{1\leq j\leq n}(\delta_{\Gamma,j}-\hat{\Delta}_{\Gamma,n})(\delta_{\Gamma,j}-\hat{\Delta}_{\Gamma,n})^{\prime}-2E[\operatorname*{Var}[\Gamma_{n,i}\|X_{i}]]=o_{P}(1)~{},$

where the last equality holds due to the same argument as used in the proof of Theorem 4.2. Similarly, we can show that

\displaystyle\frac{1}{n}\sum_{1\leq j\leq n}\delta_{Y,j}(\delta_{\hat{\Gamma},j}-\hat{\Delta}_{\hat{\Gamma},n})-E[\operatorname*{Cov}[\Gamma_{n,i},Y_{i}(1)+Y_{i}(0)|X_{i}]]=o_{P}(1)~{},

which leads to (52).

Step 2: Verifying (9), (12), and Assumption 3.1

We first show (9). We have

	$\displaystyle\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)(\hat{m}_{d,n}(X_{i},W_{n,i})-m_{d,n}(X_{i},W_{n,i}))$
	$\displaystyle=\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)(\hat{\Gamma}_{n,i}-\Gamma_{n,i})^{\prime}\hat{\beta}_{n}^{\rm refit}+\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\Gamma_{n,i}^{\prime}(\hat{\beta}_{n}^{\rm refit}-\beta_{n}^{\rm refit})$
	$\displaystyle=\begin{pmatrix}\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}^{\prime}(\hat{\beta}_{1,n}^{\rm r}-\beta_{1,n}^{\rm r}),\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}^{\prime}(\hat{\beta}_{0,n}^{\rm r}-\beta_{0,n}^{\rm r})\end{pmatrix}\hat{\beta}_{n}^{\rm refit}$
	$\displaystyle+\begin{pmatrix}\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}^{\prime}\beta_{1,n}^{\rm r},\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}^{\prime}\beta_{0,n}^{\rm r}\end{pmatrix}(\hat{\beta}_{1,n}^{\rm refit}-\beta_{1,n}^{\rm refit})$
	$\displaystyle=o_{P}(1)~{},$

where the last equality holds by (52) and the facts that

\displaystyle\begin{pmatrix}\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}^{\prime}(\hat{\beta}_{1,n}^{\rm r}-\beta_{1,n}^{\rm r}),\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}^{\prime}(\hat{\beta}_{0,n}^{\rm r}-\beta_{0,n}^{\rm r})\end{pmatrix}=o_{P}(1)

as shown in Theorem 5.1 and

\displaystyle\begin{pmatrix}\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}^{\prime}\beta_{1,n}^{\rm r},\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}^{\prime}\beta_{0,n}^{\rm r}\end{pmatrix}=O_{P}(1)~{}.

Next, we show (12). We note that

	$\displaystyle\frac{1}{2n}\sum_{1\leq i\leq 2n}(\hat{m}_{d,n}(X_{i},W_{n,i})-m_{d,n}(X_{i},W_{n,i}))^{2}$
	$\displaystyle\lesssim\frac{1}{2n}\sum_{1\leq i\leq 2n}((\hat{\Gamma}_{n,i}-\Gamma_{n,i})^{\prime}\hat{\beta}_{n}^{\rm refit})^{2}+\frac{1}{2n}\sum_{1\leq i\leq 2n}(\Gamma_{n,i}^{\prime}(\hat{\beta}_{n}^{\rm refit}-\beta_{n}^{\rm refit}))^{2}$
	$\displaystyle\lesssim\frac{1}{2n}\sum_{1\leq i\leq 2n}((\psi_{n,i}^{\prime}(\beta_{1,n}^{\rm r}-\hat{\beta}_{1,n}^{\rm r}))^{2}+(\psi_{n,i}^{\prime}(\beta_{0,n}^{\rm r}-\hat{\beta}_{0,n}^{\rm r}))^{2})\|\|\hat{\beta}_{n}^{\rm refit}\|\|^{2}_{2}$
	$\displaystyle+\frac{1}{2n}\sum_{1\leq i\leq 2n}[(\psi_{n,i}^{\prime}\beta_{1,n}^{\rm r})^{2}+(\psi_{n,i}^{\prime}\beta_{0,n}^{\rm r})^{2}]\|\|\hat{\beta}_{n}^{\rm refit}-\beta_{n}^{\rm refit}\|\|_{2}^{2}$
	$\displaystyle\lesssim\sum_{d=0,1}\frac{1}{2n}\sum_{1\leq i\leq 2n}\left[(\alpha_{d,n}^{\rm r}-\hat{\alpha}_{d,n}^{\rm r}+\psi_{n,i}^{\prime}(\beta_{d,n}^{\rm r}-\hat{\beta}_{d,n}^{\rm r}))^{2}+(\alpha_{d,n}^{\rm r}-\hat{\alpha}_{d,n}^{\rm r})^{2}\right]\|\|\hat{\beta}_{n}^{\rm refit}\|\|^{2}_{2}+o_{P}(1)$
	$\displaystyle=o_{P}(1)~{}.$

Last, Assumption 3.1(1) can be verified in the same manner as we did in the proof of Theorem 5.1.

Step 3: Asymptotic variance

Recall $\sigma_{2}^{2}(Q)$ and $\sigma_{3}^{2}(Q)$ defined in Theorem 3.1. As we have already verified (9) for $\hat{m}_{d,n}(X_{i},W_{n,i})=\hat{\Gamma}_{n,i}\hat{\beta}_{n}^{\rm refit}$ and $m_{d,n}(X_{i},W_{n,i})=\Gamma_{n,i}\beta_{n}^{\rm refit}$ , we have, for $b\in\{\rm{unadj,r,refit}\}$ , that

\displaystyle\sigma_{n}^{\rm b,2}-\sigma_{2}^{2}(Q)-\sigma_{3}^{2}(Q)=\frac{1}{2}E\left[\operatorname*{Var}[E[Y_{i}(1)+Y_{i}(0)|X_{i},W_{n,i}]-\Gamma_{n,i}^{\prime}\gamma^{b}|X_{i}]\right]

with

\displaystyle\gamma^{\rm unadj}=(0,0)^{\prime}~{},\quad\gamma^{\rm r}=(1,1)^{\prime}~{},\quad\text{and}\quad\gamma^{\rm refit}=\beta_{n}^{\rm refit}~{}.

In addition, we note that

\displaystyle\frac{1}{2}E\left[\operatorname*{Var}[E[Y_{i}(1)+Y_{i}(0)|X_{i},W_{n,i}]-\Gamma_{n,i}^{\prime}\gamma|X_{i}]\right]

is minimized at $\gamma=\beta_{n}^{\rm refit}$ , which leads to the desired result.

Appendix B Auxiliary Lemmas

Lemma B.1.

Suppose $\phi_{n},n\geq 1$ is a sequence of random variables satisfying

\lim_{\lambda\to\infty}\limsup_{n\to\infty}E[|\phi_{n}|I\{|\phi_{n}|>\lambda\}]=0~{}.

(53)

Suppose $X$ is another random variable defined on the same probability space with $\phi_{n},n\geq 1$ . Then,

\lim_{\gamma\to\infty}\limsup_{n\to\infty}E[E[|\phi_{n}||X]I\{E[|\phi_{n}||X]>\gamma\}]=0~{}.

(54)

Proof.

Fix $\epsilon>0$ . We will show there exists $\gamma>0$ so that

\limsup_{n\to\infty}E[E[|\phi_{n}||X]I\{E[|\phi_{n}||X]>\gamma\}]<\epsilon~{}.

(55)

First note the event $\{E[|\phi_{n}||X]>\gamma\}$ is measurable with respect to the $\sigma$ -algebra generated by $X$ , and therefore

E[E[|\phi_{n}||X]I\{E[|\phi_{n}||X]>\gamma\}]=E[|\phi_{n}|I\{E[|\phi_{n}||X]>\gamma\}]~{}.

(56)

Next, by Theorem 10.3.5 of Dudley (1989), (53) implies that there exists a $\delta>0$ such that for any sequence of events $A_{n}$ such that $\limsup_{n\to\infty}P\{A_{n}\}<\delta$ , we have

\limsup_{n\to\infty}E[|\phi_{n}|I\{A_{n}\}]<\epsilon~{}.

(57)

In light of the previous result, note

\displaystyle P\{E[|\phi_{n}||X]>\gamma\}

\displaystyle\leq\frac{E[E[|\phi_{n}||X]]}{\gamma}=\frac{E[|\phi_{n}|]}{\gamma}

By Theorem 10.3.5 of Dudley (1989) again, (53) implies $\limsup_{n\to\infty}E[|\phi_{n}|]<\infty$ , so by choosing $\gamma$ large enough, we can make sure

\limsup_{n\to\infty}P\{E[|\phi_{n}||X]>\gamma\}<\delta\text{ for all }n~{}.

(55) then follows from (56)–(57).

Lemma B.2.

Suppose Assumptions 2.1–2.3 and 3.1 hold. Then,

\frac{s_{n}^{2}}{nE[\operatorname*{Var}[\phi_{1,n,i}|X_{i}]]}\stackrel{{\scriptstyle P}}{{\to}}1~{}.

Proof.

To begin, note it follows from Assumption 2.2 and $Q_{n}=Q^{2n}$ that

\frac{1}{n}\sum_{1\leq i\leq 2n}D_{i}\operatorname*{Var}[\phi_{1,n,i}|X_{i}]=\frac{1}{2n}\sum_{1\leq i\leq 2n}\operatorname*{Var}[\phi_{1,n,i}|X_{i}]\\ +\frac{1}{2n}\sum_{1\leq i\leq 2n:D_{i}=1}\operatorname*{Var}[\phi_{1,n,i}|X_{i}]-\frac{1}{2n}\sum_{1\leq i\leq 2n:D_{i}=0}\operatorname*{Var}[\phi_{1,n,i}|X_{i}]~{}.

(58)

Next,

\left|\frac{1}{2n}\sum_{1\leq i\leq 2n:D_{i}=1}\operatorname*{Var}[\phi_{1,n,i}|X_{i}]-\frac{1}{2n}\sum_{1\leq i\leq 2n:D_{i}=0}\operatorname*{Var}[\phi_{1,n,i}|X_{i}]\right|\\ \leq\frac{1}{2n}\sum_{1\leq j\leq n}|\operatorname*{Var}[\phi_{1,n,\pi(2j-1)}|X_{\pi(2j-1)}]-\operatorname*{Var}[\phi_{1,n,\pi(2j)}|X_{\pi(2j)}]|~{}.

(59)

In what follows, we will show

\frac{1}{n}\sum_{1\leq j\leq n}|\operatorname*{Cov}[Y_{\pi(2j-1)}(1),m_{1,n}(X_{\pi(2j-1)},W_{\pi(2j-1)})|X_{\pi(2j-1)}]]\\ -\operatorname*{Cov}[Y_{\pi(2j)}(1),m_{1,n}(X_{\pi(2j)},W_{\pi(2j)})|X_{\pi(2j)}]]|\stackrel{{\scriptstyle P}}{{\to}}0~{}.

To that end, first note from Assumptions 2.3 and 3.1(c) that

	$\displaystyle\frac{1}{n}\sum_{1\leq j\leq n}\|E[Y_{\pi(2j-1)}(1)m_{1,n}(X_{\pi(2j-1)},W_{\pi(2j-1)})\|X_{\pi(2j-1)}]-E[Y_{\pi(2j)}(1)m_{1,n}(X_{\pi(2j)},W_{\pi(2j)})\|X_{\pi(2j)}]\|$
	$\displaystyle\lesssim\frac{1}{n}\sum_{1\leq j\leq n}\|X_{\pi(2j-1)}-X_{\pi(2j)}\|\stackrel{{\scriptstyle P}}{{\to}}0~{}.$

Next, note

	$\displaystyle\frac{1}{n}\sum_{1\leq j\leq n}\|E[Y_{\pi(2j-1)}(1)\|X_{\pi(2j-1)}]E[m_{1,n}(X_{\pi(2j-1)},W_{\pi(2j-1)})\|X_{\pi(2j-1)}]$
	$\displaystyle\hskip 30.00005pt-E[Y_{\pi(2j)}(1)\|X_{\pi(2j)}]E[m_{1,n}(X_{\pi(2j)},W_{\pi(2j)})\|X_{\pi(2j)}]\|$
	$\displaystyle\leq\frac{1}{n}\sum_{1\leq j\leq n}\|E[Y_{\pi(2j-1)}(1)\|X_{\pi(2j-1)}]\|\|E[m_{1,n}(X_{\pi(2j-1)},W_{\pi(2j-1)})\|X_{\pi(2j-1)}]-E[m_{1,n}(X_{\pi(2j)},W_{\pi(2j)})\|X_{\pi(2j)}]\|$
	$\displaystyle\hskip 30.00005pt+\frac{1}{n}\sum_{1\leq j\leq n}\|E[Y_{\pi(2j-1)}(1)\|X_{\pi(2j-1)}]-E[Y_{\pi(2j)}(1)\|X_{\pi(2j)}]\|\|E[m_{1,n}(X_{\pi(2j)},W_{\pi(2j)})\|X_{\pi(2j)}]\|$
	$\displaystyle\leq\left(\frac{1}{n}\sum_{1\leq j\leq n}\|E[Y_{\pi(2j-1)}(1)\|X_{\pi(2j-1)}]\|^{2}\right)^{1/2}$
	$\displaystyle\hskip 50.00008pt\times\left(\frac{1}{n}\sum_{1\leq j\leq n}\|E[m_{1,n}(X_{\pi(2j-1)},W_{\pi(2j-1)})\|X_{\pi(2j-1)}]-E[m_{1,n}(X_{\pi(2j)},W_{\pi(2j)})\|X_{\pi(2j)}]\|^{2}\right)^{1/2}$
	$\displaystyle\hskip 30.00005pt+\left(\frac{1}{n}\sum_{1\leq j\leq n}\|E[m_{1,n}(X_{\pi(2j)},W_{\pi(2j)})\|X_{\pi(2j)}]\|^{2}\right)^{1/2}$
	$\displaystyle\hskip 50.00008pt\times\left(\frac{1}{n}\sum_{1\leq j\leq n}\|E[Y_{\pi(2j-1)}(1)\|X_{\pi(2j-1)}]-E[Y_{\pi(2j)}(1)\|X_{\pi(2j)}]\|^{2}\right)^{1/2}$
	$\displaystyle\lesssim\left(\frac{1}{n}\sum_{1\leq i\leq 2n}\|E[Y_{i}(1)\|X_{i}]\|^{2}\right)^{1/2}\left(\frac{1}{n}\sum_{1\leq j\leq n}\|X_{\pi(2j-1)}-X_{\pi(2j)}\|^{2}\right)^{1/2}$
	$\displaystyle\hskip 30.00005pt+\left(\frac{1}{n}\sum_{1\leq i\leq 2n}\|E[m_{1,n}(X_{i},W_{i})\|X_{i}]\|^{2}\right)^{1/2}\left(\frac{1}{n}\sum_{1\leq j\leq n}\|X_{\pi(2j-1)}-X_{\pi(2j)}\|^{2}\right)^{1/2}\stackrel{{\scriptstyle P}}{{\to}}0~{},$

where the first inequality follows from the triangle inequality, the second follows from the Cauchy-Schwarz inequality, the last follows from Assumptions 2.1(c) and 3.1(c). To see the convergence holds, first note because

E[|E[Y_{i}(1)|X_{i}]|^{2}]\leq E[E[Y_{i}^{2}(1)|X_{i}]]=E[Y_{i}^{2}(1)]<\infty~{},

the weak law of large numbers implies

\frac{1}{n}\sum_{1\leq i\leq 2n}|E[Y_{i}(1)|X_{i}]|^{2}\stackrel{{\scriptstyle P}}{{\to}}2E[|E[Y_{i}(1)|X_{i}]|^{2}]<\infty~{}.

On the other hand,

\frac{1}{2n}\sum_{1\leq i\leq 2n}|E[m_{1,n}(X_{i},W_{i})|X_{i}]|^{2}\leq\frac{1}{2n}\sum_{1\leq i\leq 2n}E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]~{}.

Assumption 3.1(b) and Lemma B.1 imply

\lim_{\lambda\to\infty}\limsup_{n\to\infty}E[E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]I\{E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]>\lambda\}]=0~{}.

Therefore, Lemma 11.4.2 of Lehmann and Romano (2005) implies

\frac{1}{2n}\sum_{1\leq i\leq 2n}E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]-E[E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]]\stackrel{{\scriptstyle P}}{{\to}}0~{}.

Finally, note $E[E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]]=E[m_{1,n}^{2}(X_{i},W_{i})]$ is bounded for $n\geq 1$ by Assumption 3.1(b), so

\frac{1}{n}\sum_{1\leq i\leq 2n}|E[m_{1,n}(X_{i},W_{i})|X_{i}]|^{2}=O_{P}(1)~{}.

The desired convergence therefore follows.

Similar arguments applied termwise imply the right-hand side of (59) is $o_{P}(1)$ . (58)–(59) then imply

\frac{s_{n}^{2}}{n}-\frac{1}{2n}\sum_{1\leq i\leq 2n}\operatorname*{Var}[\phi_{1,n,i}|X_{i}]\to 0~{}.

(60)

Next, we argue

\frac{1}{2n}\sum_{1\leq i\leq 2n}\operatorname*{Var}[\phi_{1,n,i}|X_{i}]-E[\operatorname*{Var}[\phi_{1,n,i}|X_{i}]]\to 0~{}.

(61)

To establish (61), we verify the uniform integrability condition in Lemma 11.4.2 of Lehmann and Romano (2005). To that end, we will repeatedly use the inequality

	$\displaystyle\left\|\sum_{1\leq j\leq k}a_{j}\right\|I\left\{\left\|\sum_{1\leq j\leq k}a_{j}\right\|>\lambda\right\}$	$\displaystyle\leq\sum_{1\leq j\leq k}k\|a_{j}\|I\left\{\|a_{j}\|>\frac{\lambda}{k}\right\}$		(62)
	$\displaystyle\|ab\|I\{\|ab\|>\lambda\}$	$\displaystyle\leq\|a\|^{2}I\{\|a\|>\sqrt{\lambda}\}+\|b\|^{2}I\{\|b\|>\sqrt{\lambda}\}~{}.$		(63)

Note

	$\displaystyle E[\|\operatorname{Var}[\phi_{1,n,i}\|X_{i}]-E[\operatorname{Var}[\phi_{1,n,i}\|X_{i}]]\|I\{\|\operatorname{Var}[\phi_{1,n,i}\|X_{i}]-E[\operatorname{Var}[\phi_{1,n,i}\|X_{i}]]\|>\lambda\}]$
	$\displaystyle\lesssim E\left[\|\operatorname{Var}[\phi_{1,n,i}\|X_{i}]\|I\left\{\|\operatorname{Var}[\phi_{1,n,i}\|X_{i}]\|>\frac{\lambda}{2}\right\}\right]+E[\operatorname{Var}[\phi_{1,n,i}\|X_{i}]]I\left\{E[\operatorname{Var}[\phi_{1,n,i}\|X_{i}]]>\frac{\lambda}{2}\right\}$
	$\displaystyle\leq E\left[E[\phi_{1,n,i}^{2}\|X_{i}]I\left\{E[\phi_{1,n,i}^{2}\|X_{i}]>\frac{\lambda}{2}\right\}\right]+E[\phi_{1,n,i}^{2}]I\left\{E[\phi_{1,n,i}^{2}]>\frac{\lambda}{2}\right\}~{},$

where in the second inequality we use the fact that the variance of a random variable is bounded by its second moment. Note Assumption 3.1 implies $E[\phi_{1,n,i}^{2}]$ is bounded for $n\geq 1$ , and therefore

\lim_{\lambda\to\infty}\limsup_{n\to\infty}E[\phi_{1,n,i}^{2}]I\left\{E[\phi_{1,n,i}^{2}]>\frac{\lambda}{2}\right\}=0~{}.

On the other hand

		$\displaystyle E\left[E[\phi_{1,n,i}^{2}\|X_{i}]I\left\{E[\phi_{1,n,i}^{2}\|X_{i}]>\frac{\lambda}{2}\right\}\right]$		(64)
		$\displaystyle\lesssim E\left[E[Y_{i}^{2}(1)\|X_{i}]I\left\{E[Y_{i}^{2}(1)\|X_{i}]>\frac{\lambda}{12}\right\}\right]+E\left[E[m_{1,n}^{2}(X_{i},W_{i})\|X_{i}]I\left\{E[m_{1,n}^{2}(X_{i},W_{i})\|X_{i}]>\frac{\lambda}{3}\right\}\right]$
		$\displaystyle\hskip 30.00005pt+E\left[E[m_{0,n}^{2}(X_{i},W_{i})\|X_{i}]I\left\{E[m_{0,n}^{2}(X_{i},W_{i})\|X_{i}]>\frac{\lambda}{3}\right\}\right]$
		$\displaystyle\hskip 30.00005pt+E\left[\|E[Y_{i}(1)m_{1,n}(X_{i},W_{i})\|X_{i}]\|I\left\{\|E[Y_{i}(1)m_{1,n}(X_{i},W_{i})\|X_{i}]\|>\frac{\lambda}{12}\right\}\right]$
		$\displaystyle\hskip 30.00005pt+E\left[\|E[Y_{i}(1)m_{0,n}(X_{i},W_{i})\|X_{i}]\|I\left\{\|E[Y_{i}(1)m_{0,n}(X_{i},W_{i})\|X_{i}]\|>\frac{\lambda}{12}\right\}\right]$
		$\displaystyle\hskip 30.00005pt+E\left[\|E[m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})\|X_{i}]\|I\left\{\|E[m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})\|X_{i}]\|>\frac{\lambda}{6}\right\}\right]~{}.$

It follows from Assumptions 2.1(b) and 3.1(b) together with Lemma B.1 that

	$\displaystyle\lim_{\lambda\to\infty}\limsup_{n\to\infty}E\left[E[Y_{i}^{2}(1)\|X_{i}]I\left\{E[Y_{i}^{2}(1)\|X_{i}]>\frac{\lambda}{12}\right\}\right]$	$\displaystyle=0$
	$\displaystyle\lim_{\lambda\to\infty}\limsup_{n\to\infty}E\left[E[m_{1,n}^{2}(X_{i},W_{i})\|X_{i}]I\left\{E[m_{1,n}^{2}(X_{i},W_{i})\|X_{i}]>\frac{\lambda}{3}\right\}\right]$	$\displaystyle=0$
	$\displaystyle\lim_{\lambda\to\infty}\limsup_{n\to\infty}E\left[E[m_{0,n}^{2}(X_{i},W_{i})\|X_{i}]I\left\{E[m_{0,n}^{2}(X_{i},W_{i})\|X_{i}]>\frac{\lambda}{3}\right\}\right]$	$\displaystyle=0~{}.$

For the last term in (64), note

	$\displaystyle E\left[\|E[m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})\|X_{i}]\|I\left\{\|E[m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})\|X_{i}]\|>\frac{\lambda}{6}\right\}\right]$
	$\displaystyle\leq E\left[E[\|m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})\|\|X_{i}]I\left\{E[\|m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})\|\|X_{i}]>\frac{\lambda}{6}\right\}\right]~{}.$

Meanwhile,

	$\displaystyle E\left[E[\|m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})\|I\left\{\|m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})\|>\lambda\right\}\right]$
	$\displaystyle\leq E[m_{1,n}^{2}(X_{i},W_{i})I\{\|m_{1,n}(X_{i},W_{i})\|>\sqrt{\lambda}\}]+E[m_{0,n}^{2}(X_{i},W_{i})I\{\|m_{0,n}(X_{i},W_{i})\|>\sqrt{\lambda}\}]~{}.$

It then follows from the previous two inqualities, Assumption 3.1(b), and Lemma B.1 that

\lim_{\lambda\to\infty}\limsup_{n\to\infty}E\left[|E[m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})|X_{i}]|I\left\{|E[m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})|X_{i}]|>\frac{\lambda}{6}\right\}\right]=0~{}.

Similar arguments establish

	$\displaystyle\lim_{\lambda\to\infty}\limsup_{n\to\infty}E\left[\|E[Y_{i}(1)m_{1,n}(X_{i},W_{i})\|X_{i}]\|I\left\{\|E[Y_{i}(1)m_{1,n}(X_{i},W_{i})\|X_{i}]\|>\frac{\lambda}{12}\right\}\right]$	$\displaystyle=0$
	$\displaystyle\lim_{\lambda\to\infty}\limsup_{n\to\infty}E\left[\|E[Y_{i}(1)m_{0,n}(X_{i},W_{i})\|X_{i}]\|I\left\{\|E[Y_{i}(1)m_{0,n}(X_{i},W_{i})\|X_{i}]\|>\frac{\lambda}{12}\right\}\right]$	$\displaystyle=0~{}.$

Therefore, (61) follows. The conclusion then follows from (60)–(61) and Assumption 3.1(a).

Lemma B.3.

Suppose Assumptions 2.1–2.3 and 3.1 hold. Then,

\lim_{\gamma\to\infty}\limsup_{n\to\infty}E[|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}I\{|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}>\gamma\}]=0~{}.

Proof.

Note

	$\displaystyle E[\|\phi_{1,n,i}-E[\phi_{1,n,i}\|X_{i}]\|^{2}I\{\|\phi_{1,n,i}-E[\phi_{1,n,i}\|X_{i}]\|^{2}>\gamma\}]$
	$\displaystyle\lesssim E\left[(\phi_{1,n,i}^{2}+E[\phi_{1,n,i}\|X_{i}]^{2})I\left\{\phi_{1,n,i}^{2}+E[\phi_{1,n,i}\|X_{i}]^{2}>\frac{\gamma}{2}\right\}\right]$
	$\displaystyle\lesssim E\left[\phi_{1,n,i}^{2}I\left\{\phi_{1,n,i}^{2}>\frac{\gamma}{4}\right\}\right]+E\left[E[\phi_{1,n,i}\|X_{i}]^{2}I\left\{E[\phi_{1,n,i}\|X_{i}]^{2}>\frac{\gamma}{4}\right\}\right]~{}.$

where the first inequality follows from $(a+b)^{2}\leq 2(a^{2}+b^{2})$ and the second inequality follows from (62). Next, note

	$\displaystyle E\left[E[\phi_{1,n,i}\|X_{i}]^{2}I\left\{E[\phi_{1,n,i}\|X_{i}]^{2}>\frac{\gamma}{4}\right\}\right]$
	$\displaystyle\lesssim E\left[E[Y_{i}(1)\|X_{i}]^{2}I\left\{E[Y_{i}(1)\|X_{i}]^{2}>\frac{\gamma}{24}\right\}\right]+E\left[E[m_{1,n}(X_{i},W_{i})\|X_{i}]^{2}I\left\{E[m_{1,n}(X_{i},W_{i})\|X_{i}]^{2}>\frac{\gamma}{6}\right\}\right]$
	$\displaystyle\hskip 30.00005pt+E\left[E[m_{0,n}(X_{i},W_{i})\|X_{i}]^{2}I\left\{E[m_{0,n}(X_{i},W_{i})\|X_{i}]^{2}>\frac{\gamma}{6}\right\}\right]$
	$\displaystyle\hskip 30.00005pt+E\left[\|E[Y_{i}(1)\|X_{i}]E[m_{1,n}(X_{i},W_{i})\|X_{i}]\|I\left\{\|E[Y_{i}(1)\|X_{i}]E[m_{1,n}(X_{i},W_{i})\|X_{i}]\|>\frac{\gamma}{24}\right\}\right]$
	$\displaystyle\hskip 30.00005pt+E\left[\|E[Y_{i}(1)\|X_{i}]E[m_{0,n}(X_{i},W_{i})\|X_{i}]\|I\left\{\|E[Y_{i}(1)\|X_{i}]E[m_{0,n}(X_{i},W_{i})\|X_{i}]\|>\frac{\gamma}{24}\right\}\right]$
	$\displaystyle\hskip 30.00005pt+E\left[\|E[m_{1,n}(X_{i},W_{i})\|X_{i}]E[m_{0,n}(X_{i},W_{i})\|X_{i}]\|I\left\{\|E[m_{1,n}(X_{i},W_{i})\|X_{i}]E[m_{0,n}(X_{i},W_{i})\|X_{i}]\|>\frac{\gamma}{12}\right\}\right]$
	$\displaystyle\lesssim E\left[E[Y_{i}^{2}(1)\|X_{i}]I\left\{E[Y_{i}^{2}(1)\|X_{i}]>\frac{\gamma}{24}\right\}\right]+E\left[E[m_{1,n}^{2}(X_{i},W_{i})\|X_{i}]I\left\{E[m_{1,n}^{2}(X_{i},W_{i})\|X_{i}]>\frac{\gamma}{6}\right\}\right]$
	$\displaystyle\hskip 30.00005pt+E\left[E[m_{0,n}^{2}(X_{i},W_{i})\|X_{i}]I\left\{E[m_{0,n}^{2}(X_{i},W_{i})\|X_{i}]>\frac{\gamma}{6}\right\}\right]$
	$\displaystyle\hskip 30.00005pt+E\left[\|E[Y_{i}(1)\|X_{i}]\|I\left\{\|E[Y_{i}(1)\|X_{i}]\|>\sqrt{\frac{\gamma}{24}}\right\}\right]$
	$\displaystyle\hskip 30.00005pt+E\left[\|E[m_{1,n}(X_{i},W_{i})\|X_{i}]\|I\left\{\|E[m_{1,n}(X_{i},W_{i})\|X_{i}]\|>\sqrt{\frac{\gamma}{24}}\right\}\right]$
	$\displaystyle\hskip 30.00005pt+E\left[\|E[m_{0,n}(X_{i},W_{i})\|X_{i}]\|I\left\{\|E[m_{0,n}(X_{i},W_{i})\|X_{i}]\|>\sqrt{\frac{\gamma}{24}}\right\}\right]$
	$\displaystyle\hskip 30.00005pt+E\left[\|E[m_{1,n}(X_{i},W_{i})\|X_{i}]\|I\left\{\|E[m_{1,n}(X_{i},W_{i})\|X_{i}]\|>\sqrt{\frac{\gamma}{12}}\right\}\right]$
	$\displaystyle\hskip 30.00005pt+E\left[\|E[m_{0,n}(X_{i},W_{i})\|X_{i}]\|I\left\{\|E[m_{0,n}(X_{i},W_{i})\|X_{i}]\|>\sqrt{\frac{\gamma}{12}}\right\}\right]$
	$\displaystyle\leq E\left[E[Y_{i}^{2}(1)\|X_{i}]I\left\{E[Y_{i}^{2}(1)\|X_{i}]>\frac{\gamma}{24}\right\}\right]+E\left[E[m_{1,n}^{2}(X_{i},W_{i})\|X_{i}]I\left\{E[m_{1,n}^{2}(X_{i},W_{i})\|X_{i}]>\frac{\gamma}{6}\right\}\right]$
	$\displaystyle\hskip 30.00005pt+E\left[E[m_{0,n}^{2}(X_{i},W_{i})\|X_{i}]I\left\{E[m_{0,n}^{2}(X_{i},W_{i})\|X_{i}]>\frac{\gamma}{6}\right\}\right]$
	$\displaystyle\hskip 30.00005pt+E\left[E[Y_{i}^{2}(1)\|X_{i}]I\left\{E[Y_{i}^{2}(1)\|X_{i}]>\frac{\gamma}{24}\right\}\right]$
	$\displaystyle\hskip 30.00005pt+E\left[E[m_{1,n}^{2}(X_{i},W_{i})\|X_{i}]I\left\{E[m_{1,n}^{2}(X_{i},W_{i})\|X_{i}]>\frac{\gamma}{24}\right\}\right]$
	$\displaystyle\hskip 30.00005pt+E\left[E[m_{0,n}^{2}(X_{i},W_{i})\|X_{i}]I\left\{E[m_{0,n}^{2}(X_{i},W_{i})\|X_{i}]>\sqrt{\frac{\gamma}{24}}\right\}\right]$
	$\displaystyle\hskip 30.00005pt+E\left[E[m_{1,n}^{2}(X_{i},W_{i})\|X_{i}]I\left\{E[m_{1,n}^{2}(X_{i},W_{i})\|X_{i}]>\sqrt{\frac{\gamma}{12}}\right\}\right]$
	$\displaystyle\hskip 30.00005pt+E\left[E[m_{0,n}^{2}(X_{i},W_{i})\|X_{i}]I\left\{E[m_{0,n}^{2}(X_{i},W_{i})\|X_{i}]>\sqrt{\frac{\gamma}{12}}\right\}\right]~{},$

where the first inequality follows from (62), the second one follows from the conditional Jensen’s inequality and (63), and the third one follows again from the conditional Jensen’s inequality. It then follows from Lemma B.1 together with Assumptions 2.1(b) and 3.1(b) that

\lim_{\gamma\to\infty}\limsup_{n\to\infty}E\left[E[\phi_{1,n,i}|X_{i}]^{2}I\left\{E[\phi_{1,n,i}|X_{i}]^{2}>\frac{\gamma}{4}\right\}\right]=0~{}.

Similar arguments lead to

\lim_{\gamma\to\infty}\limsup_{n\to\infty}E\left[\phi_{1,n,i}^{2}I\left\{\phi_{1,n,i}^{2}>\frac{\gamma}{4}\right\}\right]=0~{}.

The conclusion then follows.

Lemma B.4.

Suppose Assumptions in Theorem 5.1 hold. Then,

P\left\{\left|\frac{1}{n}\sum_{i\in[2n]}I\{D_{i}=d\}\epsilon_{n,i}(d)-E[\epsilon_{n,i}(d)]\right|\leq\sqrt{\frac{\log(2np_{n})}{n}}\right\}\rightarrow 1

and

P\left\{\left\|\Omega_{n}^{-1}(d)\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\psi_{n,i}\epsilon_{n,i}(d)-E[\psi_{n,i}\epsilon_{n,i}(d)])\right\|_{\infty}\leq\frac{6\bar{\sigma}}{\underaccent{\bar}{\sigma}}\sqrt{\frac{\log(2np_{n})}{n}}\right\}\to 1~{}.

Proof.

For the first result, we note that

	$\displaystyle\left\|\frac{1}{n}\sum_{i\in[2n]}I\{D_{i}=d\}\epsilon_{n,i}(d)-E[\epsilon_{n,i}(d)]\right\|$	$\displaystyle\leq\left\|\frac{1}{n}\sum_{i\in[2n]}I\{D_{i}=d\}(\epsilon_{n,i}(d)-E[\epsilon_{n,i}(d)\|X_{i}])\right\|$
		$\displaystyle+\left\|\frac{1}{n}\sum_{i\in[2n]}(I\{D_{i}=d\}-1/2)(E[\epsilon_{n,i}(d)\|X_{i}]-E[\epsilon_{n,i}(d)])\right\|$
		$\displaystyle+\left\|\frac{1}{2n}\sum_{i\in[2n]}(E[\epsilon_{n,i}(d)\|X_{i}]-E[\epsilon_{n,i}(d)])\right\|.$

The first two terms on the RHS of the above display are $O_{P}(1/\sqrt{n})$ . The last term on the RHS is also $O_{P}(1/\sqrt{n})$ by Chebyshev’s inequality. This implies the desired result.

For the second result, define

\displaystyle\mathcal{E}_{n,0}(d)=\begin{pmatrix}&\max_{d\in\{0,1\}}\frac{1}{2n}\sum_{1\leq i\leq 2n}E[\epsilon_{n,i}^{4}(d)|X_{i}]\leq c_{0}<\infty~{},\\ &\min_{1\leq l\leq p_{n}}\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\operatorname*{Var}[\psi_{n,i,l}\epsilon_{n,i}(d)|X_{i}]\geq\underaccent{\bar}{\sigma}^{2}>0~{},\end{pmatrix}

\displaystyle\mathcal{E}_{n,1}(d)=\left\{\left\|\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\psi_{n,i}\epsilon_{n,i}(d)-E[\psi_{n,i}\epsilon_{n,i}(d)|X_{i}])\right\|_{\infty}\leq 2.04\overline{\sigma}\sqrt{\log(2np_{n})/n}\right\}~{},

\displaystyle\mathcal{E}_{n,2}(d)=\left\{\left\|\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(E[\psi_{n,i}\epsilon_{n,i}(d)|X_{i}]-E[\psi_{n,i}\epsilon_{n,i}(d)])\right\|_{\infty}\leq 3.96\overline{\sigma}\sqrt{\log(2np_{n})/n}\right\}~{},

\displaystyle\mathcal{E}_{n,3}(d)=\left\{\max_{1\leq l\leq p_{n}}\left|\frac{1}{2n}\sum_{1\leq i\leq 2n}(E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)|X_{i}]-E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)])\right|^{1/2}\leq 0.01\overline{\sigma}\right\}~{},

and

\displaystyle\mathcal{E}_{n,4}(d)=\left\{\max_{1\leq l\leq p_{n}}\left|\frac{1}{2n}\sum_{1\leq i\leq 2n}(2I\{D_{i}=d\}-1)(E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}|X_{i}]-E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}])\right|^{1/2}\leq 0.01\overline{\sigma}\right\}~{}.

We aim to show that $P\{\mathcal{E}_{n,1}(d)\}\to 1$ and $P\{\mathcal{E}_{n,2}(d)\}\to 1$ . Then, by letting $C=6\bar{\sigma}/\underaccent{\bar}{\sigma}$ which implies

	$\displaystyle P\{\mathcal{E}_{n}(d)\}$	$\displaystyle=1-P\{\mathcal{E}_{n}^{c}(d)\}$
		$\displaystyle\geq 1-P\left\{\left\\|\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\psi_{n,i}\epsilon_{n,i}(d)-E[\psi_{n,i}\epsilon_{n,i}(d)])\right\\|_{\infty}\geq C\underaccent{\bar}{\sigma}\sqrt{\frac{\log(2np_{n})}{n}}\right\}$
		$\displaystyle=1-P\left\{\left\\|\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\psi_{n,i}\epsilon_{n,i}(d)-E[\psi_{n,i}\epsilon_{n,i}(d)])\right\\|_{\infty}\geq 6\bar{\sigma}\sqrt{\frac{\log(2np_{n})}{n}}\right\}$
		$\displaystyle\geq 1-P\{\mathcal{E}_{n,1}^{c}(d)\}-P\{\mathcal{E}_{n,2}^{c}(d)\}\to 1~{}.$

First, we show $P\{\mathcal{E}_{n,3}(d)\}\to 1$ . Let

t_{n}=C\sqrt{\frac{\log(np_{n})\Xi_{n}^{2}}{n}}\to 0

for some sufficiently large constant $C>0$ and $\{e_{i}\}_{1\leq i\leq 2n}$ be a sequence of i.i.d. Rademacher random variables independent of everything else. Then, for any fixed $t>0$ , we have

	$\displaystyle\left(1-\frac{4\max_{1\leq l\leq p_{n}}\operatorname*{Var}[E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)\|X_{i}]]}{2nt^{2}}\right)P\left\{\max_{1\leq l\leq p_{n}}\left\|\frac{1}{2n}\sum_{1\leq i\leq 2n}\left[E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)\|X_{i}]-E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)]\right]\right\|\geq t\right\}$
	$\displaystyle\leq 2P\left\{\max_{1\leq l\leq p_{n}}\left\|\frac{1}{2n}\sum_{1\leq i\leq 2n}4e_{i}E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)\|X_{i}]\right\|\geq t\right\}$
	$\displaystyle=o(1)+2E\left[P\left\{\max_{1\leq l\leq p_{n}}\left\|\frac{1}{2n}\sum_{1\leq i\leq 2n}4e_{i}E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)\|X_{i}]\right\|\geq t\bigg{\|}X^{(n)}\right\}I\{\mathcal{E}_{n,0}(d)\}\right]$
	$\displaystyle\lesssim o(1)+p_{n}\exp\left(-\frac{nt^{2}}{\Xi_{n}^{2}c}\right)=o(1),$

where the first inequality is by van der Vaart and Wellner (1996, Lemma 2.3.7), the second inequality is by the Hoeffding’s inequality conditional on $X^{(n)}$ and the fact that, on $\mathcal{E}_{n,0}(d)$ ,

	$\displaystyle\frac{1}{2n}\sum_{1\leq i\leq 2n}(E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)\|X_{i}])^{2}$	$\displaystyle\leq\frac{1}{2n}\sum_{1\leq i\leq 2n}E[\psi_{n,i,l}^{4}\|X_{i}]E[\epsilon_{n,i}^{4}(d)\|X_{i}]$
		$\displaystyle\leq\frac{\Xi_{n}^{2}}{2n}\sum_{1\leq i\leq 2n}E[\psi_{n,i,l}^{2}\|X_{i}]E[\epsilon_{n,i}^{4}(d)\|X_{i}]\leq\Xi_{n}^{2}Cc_{0},$

where $C$ is a fixed constant, and the last equality is by the fact that $\log(p_{n})\Xi_{n}^{2}=o(n)$ . Furthermore, we note that

	$\displaystyle\frac{4\max_{1\leq l\leq p_{n}}\operatorname*{Var}[E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)\|X_{i}]]}{2n}$
	$\displaystyle\lesssim\frac{\max_{1\leq l\leq p_{n}}E\left[E[\psi_{n,i,l}^{4}\|X_{i}]E[\epsilon_{n,i}^{4}(d)\|X_{i}]\right]}{n}$
	$\displaystyle\lesssim\frac{\Xi_{n}^{2}\max_{1\leq l\leq p_{n}}E\left[E[\psi_{n,i,l}^{2}\|X_{i}]E[\epsilon_{n,i}^{4}(d)\|X_{i}]\right]}{n}$
	$\displaystyle\lesssim\frac{\Xi_{n}^{2}E[\epsilon_{n,i}^{4}(d)]}{n}$
	$\displaystyle=o(1).$

Therefore, we have

\displaystyle P\left\{\max_{1\leq l\leq p_{n}}\left|\frac{1}{2n}\sum_{1\leq i\leq 2n}\left[E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)|X_{i}]-E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)]\right]\right|\geq t\right\}=o(1)

for any fixed $t>0$ , which is the desired result.

Next, we show $P\{\mathcal{E}_{n,4}(d)\}\to 1$ . Define $a_{n,i,l}=E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}|X_{i}]-E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}]$ . Then, we have

	$\displaystyle P\left\{\max_{1\leq l\leq p_{n}}\left\|\frac{1}{2n}\sum_{1\leq i\leq 2n}(2I\{D_{i}=d\}-1)(E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}\|X_{i}]-E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}])\right\|>t\bigg{\|}X^{(n)}\right\}I\{\mathcal{E}_{n,0}(d)\}$
	$\displaystyle\leq\sum_{1\leq l\leq p_{n}}P\left\{\left\|\frac{1}{2n}\sum_{1\leq j\leq n}(I\{D_{\pi(2j-1)}=d\}-I\{D_{\pi(2j)}=d\})(a_{n,\pi(2j-1),l}-a_{n,\pi(2j),l})\right\|>t\bigg{\|}X^{(n)}\right\}I\{\mathcal{E}_{n,0}(d)\}$
	$\displaystyle\leq\sum_{1\leq l\leq p_{n}}\exp\left(-\frac{2nt^{2}}{\frac{1}{n}\sum_{1\leq j\leq n}(a_{n,\pi(2j-1),l}-a_{n,\pi(2j),l})^{2}}\right)I\{\mathcal{E}_{n,0}(d)\}$
	$\displaystyle\leq\exp\left(\log(p_{n})-\frac{2nt^{2}}{\Xi_{n}^{2}c^{2}}\right)~{},$

where, conditional on $X^{(n)}$ , $\{I\{D_{\pi(2j-1)}=d\}-I\{D_{\pi(2j)}=d\}\}_{1\leq j\leq n}$ is a sequence of i.i.d. Rademacher random variables, the second last inequality is by Hoeffding’s inequality, and the last inequality is by that, on $\mathcal{E}_{n,0}(d)$ ,

	$\displaystyle\left(\frac{1}{n}\sum_{1\leq j\leq n}(a_{n,\pi(2j-1),l}-a_{n,\pi(2j),l})^{2}\right)^{1/2}$
	$\displaystyle\leq\left(\frac{1}{n}\sum_{1\leq j\leq n}(E[\psi_{n,\pi(2j-1),l}^{2}\epsilon_{n,i}^{2}(d)\|X_{\pi(2j-1)}])^{2}\right)^{1/2}+\left(\frac{1}{n}\sum_{1\leq j\leq n}(E[\psi_{n,\pi(2j),l}^{2}\epsilon_{n,i}^{2}(d)\|X_{\pi(2j)}])^{2}\right)^{1/2}$
	$\displaystyle\leq\left(\frac{2}{n}\sum_{1\leq i\leq 2n}(E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)\|X_{i}])^{2}\right)^{1/2}$
	$\displaystyle\leq\Xi_{n}c~{}.$

By letting $t=C\sqrt{\frac{\log(p_{n})\Xi_{n}^{2}}{n}}$ for some sufficiently large $C$ and noting that $P\{\mathcal{E}_{n,0}(d)\}\rightarrow 1$ , we have

\displaystyle\max_{1\leq l\leq p_{n}}\left|\sum_{1\leq i\leq 2n}\frac{(2I\{D_{i}=d\}-1)(E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}|X_{i}]-E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}])}{2n}\right|=O_{p}\left(\sqrt{\frac{\log(p_{n})\Xi_{n}^{2}}{n}}\right)~{},

and thus, $P\{\mathcal{E}_{n,4}(d)\}\to 1$ .

Next, we show $P\{\mathcal{E}_{n,1}(d)\}\to 1$ . We note that, for $d\in\{0,1\}$ , conditional on $(D^{(n)},X^{(n)})$ , $\{\psi_{n,i}\epsilon_{n,i}(d)\}_{1\leq i\leq 2n}$ are independent. In what follows, we couple

\mathbb{U}_{n}=\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\psi_{n,i}\epsilon_{n,i}(d)-E[\psi_{n,i}\epsilon_{n,i}(d)|X_{i}])

with a centered Gaussian random vector as in Theorem 2.1 in Chernozhukov et al. (2017). Let $Z=(Z_{1},\ldots,Z_{p_{n}})$ be a Gaussian random vector with $E[Z_{l}]=0$ for $1\leq l\leq p_{n}$ and $\operatorname*{Var}[Z]=\operatorname*{Var}[\mathbb{U}_{n}|X^{(n)},D^{(n)}]$ that additionally satisfies the conditions of that theorem. Specifically, $Z=(Z_{1},\cdots,Z_{p_{n}})$ is a centered Gaussian random vector in $R^{p_{n}}$ such that on $\mathcal{E}_{n,0}(d)\cap\mathcal{E}_{n,3}(d)\cap\mathcal{E}_{n,4}(d)$ ,

	$\displaystyle E[ZZ^{\prime}]$	$\displaystyle=\frac{1}{n^{2}}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}E[\epsilon_{n,i}^{2}(d)\psi_{n,i}\psi_{n,i}^{\prime}\|X_{i}]$
		$\displaystyle-\frac{1}{n}\left(\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}E[\epsilon_{n,i}(d)\psi_{n,i}\|X_{i}]\right)\left(\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}E[\epsilon_{n,i}(d)\psi_{n,i}\|X_{i}]\right)^{\prime}$

and

	$\displaystyle\max_{1\leq l\leq p_{n}}E[Z_{l}^{2}]$	$\displaystyle\leq\frac{\max_{1\leq l\leq p_{n}}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}\|X_{i}]}{n^{2}}$
		$\displaystyle\leq\frac{\overline{\sigma}^{2}}{n}+\frac{\max_{1\leq l\leq p_{n}}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}\|X_{i}]-E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}])}{n^{2}}$
		$\displaystyle\leq\frac{\overline{\sigma}^{2}}{n}+\frac{\max_{1\leq l\leq p_{n}}\sum_{1\leq i\leq 2n}(2I\{D_{i}=d\}-1)(E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}\|X_{i}]-E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}])}{2n^{2}}$
		$\displaystyle+\frac{\max_{1\leq l\leq p_{n}}\sum_{1\leq i\leq 2n}(E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}\|X_{i}]-E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}])}{2n^{2}}$
		$\displaystyle\leq\frac{1.02\overline{\sigma}^{2}}{n}~{}.$

Further define $q(1-\alpha)$ as the $(1-\alpha)$ quantile of $||Z||_{\infty}$ . Then, we have

\displaystyle q(1-1/n)\leq\frac{1.02\overline{\sigma}(\sqrt{2\log(2p_{n})}+\sqrt{2\log(n)})}{\sqrt{n}}\leq 2.04\overline{\sigma}\sqrt{\log(2np_{n})/n},

where the first inequality is by the last display in the proof of Lemma E.2 in Chetverikov and Sørensen (2022) and the second inequality is by the fact that $\sqrt{a}+\sqrt{b}\leq\sqrt{2(a+b)}$ for $a,b>0$ . Therefore, we have

	$\displaystyle P\{\mathcal{E}_{n,1}^{c}(d))$	$\displaystyle\leq P\{\mathcal{E}_{n,1}^{c}(d),\mathcal{E}_{n,0}(d),\mathcal{E}_{n,3}(d),\mathcal{E}_{n,4}(d)\}+o(1)$
		$\displaystyle=EP\{\mathcal{E}_{n,1}^{c}(d)\|D^{(n)},X^{(n)}\}I\{\mathcal{E}_{n,0}(d),\mathcal{E}_{n,3}(d),\mathcal{E}_{n,4}(d)\}+o(1)$
		$\displaystyle\leq E[P\{\|\|Z\|\|_{\infty}\geq 2.04\overline{\sigma}\sqrt{\log(2np_{n})/n}\|D^{(n)},X^{(n)})\}I\{\mathcal{E}_{n,0}(d),\mathcal{E}_{n,3}(d),\mathcal{E}_{n,4}(d)\}]+o(1)$
		$\displaystyle\leq E[P\{\|\|Z\|\|_{\infty}\geq q(1-1/n)\|D^{(n)},X^{(n)}\}]=o(1),$

where the second inequality is by Theorem 2.1 in Chernozhukov et al. (2017).

Finally, we turn to $\mathcal{E}_{n,2}(d)$ with $d=1$ . We have

	$\displaystyle\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=1\}(E[\psi_{n,i}\epsilon_{n,i}(1)\|X_{i}]-E[\psi_{n,i}\epsilon_{n,i}(1)])$
	$\displaystyle=\frac{1}{2n}\sum_{1\leq i\leq 2n}(E[\psi_{n,i}\epsilon_{n,i}(1)\|X_{i}]-E[\psi_{n,i}\epsilon_{n,i}(1)])+\frac{1}{2n}\sum_{1\leq i\leq 2n}(2D_{i}-1)(E[\psi_{n,i}\epsilon_{n,i}(1)\|X_{i}]-E[\psi_{n,i}\epsilon_{n,i}(1)]).$		(65)

Note $\{E[\psi_{n,i}\epsilon_{n,i}(1)|X_{i}]-E[\psi_{n,i}\epsilon_{n,i}(1)]\}_{1\leq i\leq 2n}$ is a sequence of independent centered random variables and

\displaystyle\max_{1\leq l\leq p_{n}}E[(E[\psi_{n,i,l}\epsilon_{n,i}(1)|X_{i}]-E[\psi_{n,i,l}\epsilon_{n,i}(1)])^{2}]\leq\overline{\sigma}^{2}.

Following Theorem 2.1 in Chernozhukov et al. (2017), Lemma E.2 in Chetverikov and Sørensen (2022), and similar arguments to the ones above, we have

\displaystyle P\left\{\left\|\frac{1}{2n}\sum_{1\leq i\leq 2n}(E[\psi_{n,i}\epsilon_{n,i}(1)|X_{i}]-E[\psi_{n,i}\epsilon_{n,i}(1)])\right\|_{\infty}\leq\overline{\sigma}\sqrt{2\log(2np_{n})/n}\right)\to 1.

(66)

For the second term on the RHS of (65), we define $g_{n,i,l}=E[\psi_{n,i,l}\epsilon_{n,i}(1)|X_{i}]-E[\psi_{n,i,l}\epsilon_{n,i}(1)]$ . We have

	$\displaystyle P\left\{\left\\|\frac{1}{2n}\sum_{1\leq i\leq 2n}(2D_{i}-1)E[\psi_{n,i}\epsilon_{n,i}(1)\|X_{i}]-E[\psi_{n,i}\epsilon_{n,i}(1)]\right\\|_{\infty}>t\bigg{\|}X^{(n)}\right\}$
	$\displaystyle\leq\sum_{1\leq l\leq p_{n}}P\left\{\left\|\frac{1}{2n}\sum_{1\leq j\leq n}(D_{\pi(2j-1)}-D_{\pi(2j)})(g_{n,\pi(2j-1),l}-g_{n,\pi(2j),l})\right\|>t\bigg{\|}X^{(n)}\right\}$
	$\displaystyle\leq\sum_{1\leq l\leq p_{n}}\exp\left(-\frac{2nt^{2}}{\frac{1}{n}\sum_{1\leq j\leq n}(g_{n,\pi(2j-1),l}-g_{n,\pi(2j),l})^{2}}\right)~{},$

where, conditional on $X^{(n)}$ , $\{(D_{\pi(2j-1)}-D_{\pi(2j)})\}_{1\leq j\leq n}$ is a sequence of i.i.d. Rademacher random variables and the last inequality is by Hoeffding’s inequality. In addition, on $\mathcal{E}_{n,3}(1)$ , we have

	$\displaystyle\left(\frac{1}{n}\sum_{1\leq j\leq n}(g_{n,\pi(2j-1),l}-g_{n,\pi(2j),l})^{2}\right)^{1/2}$
	$\displaystyle\leq\left(\frac{1}{n}\sum_{1\leq j\leq n}(E[\psi_{n,\pi(2j-1),l}\epsilon_{n,i}(1)\|X_{\pi(2j-1)}])^{2}\right)^{1/2}+\left(\frac{1}{n}\sum_{1\leq j\leq n}(E[\psi_{n,\pi(2j),l}\epsilon_{n,i}(1)\|X_{\pi(2j)}])^{2}\right)^{1/2}$
	$\displaystyle\leq\left(\frac{2}{n}\sum_{1\leq i\leq 2n}(E[\psi_{n,i,l}\epsilon_{n,i}(1)\|X_{i}])^{2}\right)^{1/2}$
	$\displaystyle\leq\left(\frac{2}{n}\sum_{1\leq i\leq 2n}E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(1)\|X_{i}]\right)^{1/2}$
	$\displaystyle\leq\left(\frac{2}{n}\sum_{1\leq i\leq 2n}\left[E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(1)\|X_{i}]-E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(1)]\right]\right)^{1/2}+2\overline{\sigma}$
	$\displaystyle\leq 2.02\overline{\sigma}.$

Therefore, we have

	$\displaystyle P\left\{\left\\|\frac{1}{2n}\sum_{1\leq i\leq 2n}(2D_{i}-1)E[\psi_{n,i}\epsilon_{n,i}(1)\|X_{i}]-E[\psi_{n,i}\epsilon_{n,i}(1)]\right\\|_{\infty}>2.02\sqrt{\frac{\log(np_{n})\overline{\sigma}^{2}}{n}}\right\}$
	$\displaystyle\leq P\left\{\left\\|\frac{1}{2n}\sum_{1\leq i\leq 2n}(2D_{i}-1)E[\psi_{n,i}\epsilon_{n,i}(1)\|X_{i}]-E[\psi_{n,i}\epsilon_{n,i}(1)]\right\\|_{\infty}>2.02\sqrt{\frac{\log(np_{n})\overline{\sigma}^{2}}{n}},\mathcal{E}_{n,3}(1)\right\}+o(1)$
	$\displaystyle\leq E\left[P\left\{\left\\|\frac{1}{2n}\sum_{1\leq i\leq 2n}(2D_{i}-1)E[\psi_{n,i}\epsilon_{n,i}(1)\|X_{i}]-E[\psi_{n,i}\epsilon_{n,i}(1)]\right\\|_{\infty}>2.02\sqrt{\frac{\log(np_{n})\overline{\sigma}^{2}}{n}}\bigg{\|}X^{(n)}\right\}I\{\mathcal{E}_{n,3}(1)\}\right]+o(1)$
	$\displaystyle=o(1)~{}.$		(67)

Combining (65), (66), (67), and the fact that $\sqrt{2}+2.02\leq 3.98$ , we have $P\{\mathcal{E}_{n,2}(1)\}\to 1$ . The same result holds for $\mathcal{E}_{n,2}(0)$ .

Appendix C Details for Simulations

The regressors in the LASSO-based adjustment are as follows.

(i)

For Models 1-6, we use $\{1,X_{i},W_{i},X_{i}^{2},W_{i}^{2},X_{i}W_{i},(X_{i}-\tilde{X})I\{X_{i}>\tilde{X}\},(W_{i}-\tilde{W})I\{W_{i}>\tilde{W}\},(X_{i}-\tilde{X})^{2}I\{X_{i}>\tilde{X}\},(W_{i}-\tilde{W})^{2}I\{W_{i}>\tilde{W}\}\}$ where $\tilde{X}$ and $\tilde{W}$ are the sample medians of $\{X_{i}\}_{i\in[2n]}$ and $\{W_{i}\}_{i\in[2n]}$ , respectively.
(ii)

For Models 7-9, we use $\{1,X_{i},W_{i},X_{i}^{2},W_{i}^{2},X_{i1}W_{i1},X_{i2}W_{i1},X_{i1}W_{i2},X_{i2}W_{i2},(X_{ij}-\tilde{X_{j}})I\{X_{ij}>\tilde{X_{j}}\},(X_{ij}-\tilde{X_{j}})^{2}I\{X_{ij}>\tilde{X_{j}}\},(W_{ij}-\tilde{W_{1}})I\{W_{ij}>\tilde{W_{j}}\},(W_{ij}-\tilde{W_{j}})^{2}I\{W_{ij}>\tilde{W_{j}}\}\}$ where $\tilde{X_{j}}$ and $\tilde{W_{j}}$ , for $j=1,2$ , are the sample medians of $\{X_{ij}\}_{i\in[2n]}$ and $\{W_{ij}\}_{i\in[2n]}$ , respectively.
(iii)

For Models 10-11, we use $\{1,X_{i},W_{i},X_{i}^{2},W_{i}^{2},X_{i1}W_{i1},X_{i2}W_{i2},X_{i3}W_{i1},X_{i4}W_{i2},(X_{ij}-\tilde{X_{j}})I\{X_{ij}>\tilde{X_{j}}\},(X_{ij}-\tilde{X_{j}})^{2}I\{X_{ij}>\tilde{X_{j}}\},(W_{ij}-\tilde{W_{j}})I\{W_{ij}>\tilde{W_{j}}\},(W_{ij}-\tilde{W_{j}})^{2}I\{W_{ij}>\tilde{W_{j}}\}\}$ where $\tilde{X_{j}}$ ,for $j=1,2,3,4$ , and $\tilde{W_{j}}$ , for $j=1,2$ , are the sample medians of $\{X_{ij}\}_{i\in[2n]}$ and $\{W_{ij}\}_{i\in[2n]}$ , respectively.
(iv)

Models 12-15 already contain high-dimensional covariates. We just use $X_{i}$ and $W_{i}$ as the LASSO regressors.

References

Abadie and Imbens (2008) Abadie, A. and Imbens, G. W. (2008). Estimation of the Conditional Variance in Paired Experiments. Annales d’É conomie et de Statistique 175–187.
Armstrong (2022) Armstrong, T. B. (2022). Asymptotic Efficiency Bounds for a Class of Experimental Designs. ArXiv:2205.02726 [stat], URL http://arxiv.org/abs/2205.02726.
Bai et al. (2023a) Bai, Y., Liu, J., Shaikh, A. M. and Tabord-Meehan, M. (2023a). On the Efficiency of Finely Stratified Experiments. ArXiv:2307.15181 [econ, math, stat], URL http://arxiv.org/abs/2307.15181.
Bai et al. (2023b) Bai, Y., Liu, J. and Tabord-Meehan, M. (2023b). Inference for Matched Tuples and Fully Blocked Factorial Designs. ArXiv:2206.04157 [econ, math, stat], URL http://arxiv.org/abs/2206.04157.
Bai et al. (2022) Bai, Y., Romano, J. P. and Shaikh, A. M. (2022). Inference in Experiments With Matched Pairs. Journal of the American Statistical Association, 117 1726–1737. Publisher: Taylor & Francis _eprint: https://doi.org/10.1080/01621459.2021.1883437, URL https://doi.org/10.1080/01621459.2021.1883437.
Belloni et al. (2017) Belloni, A., Chernozhukov, V., Fernández-Val, I. and Hansen, C. (2017). Program evaluation and causal inference with high-dimensional data. Econometrica, 85 233–298.
Bickel et al. (2009) Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37 1705–1732. Publisher: Institute of Mathematical Statistics, URL https://projecteuclid.org/journals/annals-of-statistics/volume-37/issue-4/Simultaneous-analysis-of-Lasso-and-Dantzig-selector/10.1214/08-AOS620.full.
Bruhn and McKenzie (2009) Bruhn, M. and McKenzie, D. (2009). In pursuit of balance: Randomization in practice in development field experiments. American Economic Journal: Applied Economics, 1 200–232.
Chernozhukov et al. (2014) Chernozhukov, V., Chetverikov, D. and Kato, K. (2014). Gaussian approximation of suprema of empirical processes. The Annals of Statistics, 42 1564–1597.
Chernozhukov et al. (2017) Chernozhukov, V., Chetverikov, D. and Kato, K. (2017). Central limit theorems and bootstrap in high dimensions. The Annals of Probability, 45 2309–2352. Publisher: Institute of Mathematical Statistics, URL https://projecteuclid.org/journals/annals-of-probability/volume-45/issue-4/Central-limit-theorems-and-bootstrap-in-high-dimensions/10.1214/16-AOP1113.full.
Chetverikov and Sørensen (2022) Chetverikov, D. and Sørensen, J. R.-V. (2022). Analytic and Bootstrap-after-Cross-Validation Methods for Selecting Penalty Parameters of High-Dimensional M-Estimators. Tech. Rep. arXiv:2104.04716, arXiv. ArXiv:2104.04716 [econ, math, stat] type: article, URL http://arxiv.org/abs/2104.04716.
Cohen and Fogarty (2023) Cohen, P. L. and Fogarty, C. B. (2023). No-harm calibration for generalized oaxaca-blinder estimators. Biometrika. Forthcoming.
Cytrynbaum (2023) Cytrynbaum, M. (2023). Covariate adjustment in stratified experiments.
Donner and Klar (2000) Donner, A. and Klar, N. (2000). Design and analysis of cluster randomization trials in health research, vol. 27. Arnold London.
Dudley (1989) Dudley, R. M. (1989). Real Analysis and Probability. Wadsworth and Brook/Cole.
Freedman (2008) Freedman, D. A. (2008). On regression adjustments to experimental data. Advances in Applied Mathematics, 40 180–193. URL http://www.sciencedirect.com/science/article/pii/S019688580700005X.
Glennerster and Takavarasha (2013) Glennerster, R. and Takavarasha, K. (2013). Running randomized evaluations: A practical guide. Princeton University Press.
Groh and McKenzie (2016) Groh, M. and McKenzie, D. (2016). Macroinsurance for microenterprises: A randomized experiment in post-revolution egypt. Journal of Development Economics, 118 13–25.
Jiang et al. (2022) Jiang, L., Liu, X., Phillips, P. C. and Zhang, Y. (2022). Bootstrap inference for quantile treatment effects in randomized experiments with matched pairs. Review of Economics and Statistics. Forthcoming.
Ledoux and Talagrand (1991) Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Classics in Mathematics, Springer-Verlag, Berlin Heidelberg. URL https://www.springer.com/gp/book/9783642202117.
Lehmann and Romano (2005) Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses. 3rd ed. Springer, New York.
Lin (2013) Lin, W. (2013). Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique. Annals of Applied Statistics, 7 295–318. Publisher: Institute of Mathematical Statistics, URL https://projecteuclid.org/euclid.aoas/1365527200.
Negi and Wooldridge (2021) Negi, A. and Wooldridge, J. M. (2021). Revisiting regression adjustment in experiments with heterogeneous treatment effects. Econometric Reviews, 40 504–534. Publisher: Taylor & Francis _eprint: https://doi.org/10.1080/07474938.2020.1824732, URL https://doi.org/10.1080/07474938.2020.1824732.
Robins et al. (1995) Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1995). Analysis of Semiparametric Regression Models for Repeated Outcomes in the Presence of Missing Data. Journal of the American Statistical Association, 90 106–121. Publisher: [American Statistical Association, Taylor & Francis, Ltd.], URL https://www.jstor.org/stable/2291134.
Rosenberger and Lachin (2015) Rosenberger, W. F. and Lachin, J. M. (2015). Randomization in clinical trials: Theory and Practice. John Wiley & Sons.
Tsiatis et al. (2008) Tsiatis, A. A., Davidian, M., Zhang, M. and Lu, X. (2008). Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach. Statistics in Medicine, 27 4658–4677.
van der Vaart and Wellner (1996) van der Vaart, A. and Wellner, J. (1996). Weak Convergence and Empirical Processes with Applications to Statistics. Springer-Verlag, New York.
Wager et al. (2016) Wager, S., Du, W., Taylor, J. and Tibshirani, R. J. (2016). High-dimensional regression adjustments in randomized experiments. Proceedings of the National Academy of Sciences, 113 12673–12678.
Wu and Gagnon-Bartsch (2021) Wu, E. and Gagnon-Bartsch, J. A. (2021). Design-based covariate adjustments in paired experiments. Journal of Educational and Behavioral Statistics, 46 109–132.
Yang and Tsiatis (2001) Yang, L. and Tsiatis, A. A. (2001). Efficiency Study of Estimators for a Treatment Effect in a Pretest–Posttest Trial. The American Statistician, 55 314–321. Publisher: Taylor & Francis _eprint: https://doi.org/10.1198/000313001753272466, URL https://doi.org/10.1198/000313001753272466.
Zhao and Ding (2021) Zhao, A. and Ding, P. (2021). Covariate-adjusted Fisher randomization tests for the average treatment effect. Journal of Econometrics, 225 278–294. URL https://www.sciencedirect.com/science/article/pii/S0304407621001457.

		$\displaystyle\frac{1}{s_{n}^{2}/n}\frac{1}{n}\sum_{1\leq i\leq 2n:D_{i}=1}E[\|\phi_{1,n,i}-E[\phi_{1,n,i}\|X_{i}]\|^{2}I\{\|\phi_{1,n,i}-E[\phi_{1,n,i}\|X_{i}]\|>\epsilon s_{n}\}\|X^{(n)},D^{(n)}]$
		$\displaystyle\leq\frac{1}{s_{n}^{2}/n}\frac{1}{n}\sum_{1\leq i\leq 2n}E[\|\phi_{1,n,i}-E[\phi_{1,n,i}\|X_{i}]\|^{2}I\{\|\phi_{1,n,i}-E[\phi_{1,n,i}\|X_{i}]\|>\epsilon s_{n}\}\|X^{(n)},D^{(n)}]$
		$\displaystyle\leq\frac{1}{c}\frac{1}{n}\sum_{1\leq i\leq 2n}E[\|\phi_{1,n,i}-E[\phi_{1,n,i}\|X_{i}]\|^{2}I\{\|\phi_{1,n,i}-E[\phi_{1,n,i}\|X_{i}]\|>\lambda\}\|X^{(n)},D^{(n)}]+o_{P}(1)$
		$\displaystyle\leq\frac{2}{c}\frac{1}{2n}\sum_{1\leq i\leq 2n}E[\|\phi_{1,n,i}-E[\phi_{1,n,i}\|X_{i}]\|^{2}I\{\|\phi_{1,n,i}-E[\phi_{1,n,i}\|X_{i}]\|>\lambda\}\|X_{i}]+o_{P}(1)~{},$		(29)

		$\displaystyle\operatorname*{Var}\Big{[}Y_{i}(0)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\Big{\|}X_{i}\Big{]}$
		$\displaystyle=\operatorname*{Var}\Big{[}E\Big{[}Y_{i}(0)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\Big{\|}X_{i},W_{i}\Big{]}\Big{\|}X_{i}\Big{]}$
		$\displaystyle\hskip 30.00005pt+E\Big{[}\operatorname*{Var}\Big{[}Y_{i}(0)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\Big{\|}X_{i},W_{i}\Big{]}\Big{\|}X_{i}\Big{]}$
		$\displaystyle=\operatorname*{Var}\Big{[}E\Big{[}\frac{Y_{i}(1)+Y_{i}(0)}{2}\Big{\|}X_{i},W_{i}\Big{]}-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))-E\Big{[}\frac{Y_{i}(1)-Y_{i}(0)}{2}\Big{\|}X_{i},W_{i}\Big{]}\Big{\|}X_{i}\Big{]}$
		$\displaystyle\hskip 30.00005pt+E[\operatorname*{Var}[Y_{i}(0)\|X_{i},W_{i}]\|X_{i}]$
		$\displaystyle=\operatorname*{Var}\Big{[}E\Big{[}\frac{Y_{i}(1)+Y_{i}(0)}{2}\Big{\|}X_{i},W_{i}\Big{]}-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\Big{\|}X_{i}\Big{]}$
		$\displaystyle\hskip 30.00005pt+\operatorname*{Var}\Big{[}E\Big{[}\frac{Y_{i}(1)-Y_{i}(0)}{2}\Big{\|}X_{i},W_{i}\Big{]}\Big{\|}X_{i}\Big{]}$
		$\displaystyle\hskip 30.00005pt-2\mathrm{Cov}\Big{[}E\Big{[}\frac{Y_{i}(1)+Y_{i}(0)}{2}\Big{\|}X_{i},W_{i}\Big{]}-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i})),E\Big{[}\frac{Y_{i}(1)-Y_{i}(0)}{2}\Big{\|}X_{i},W_{i}\Big{]}\Big{\|}X_{i}\Big{]}$
		$\displaystyle\hskip 30.00005pt+E[\operatorname*{Var}[Y_{i}(0)\|X_{i},W_{i}]\|X_{i}]~{},$		(33)

		$\displaystyle\operatorname*{Var}\Big{[}Y_{i}(1)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\Big{\|}X_{i}\Big{]}$
		$\displaystyle=\operatorname*{Var}\Big{[}E\Big{[}\frac{Y_{i}(1)+Y_{i}(0)}{2}\Big{\|}X_{i},W_{i}\Big{]}-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\Big{\|}X_{i}\Big{]}$
		$\displaystyle\hskip 30.00005pt+\operatorname*{Var}\Big{[}E\Big{[}\frac{Y_{i}(1)-Y_{i}(0)}{2}\Big{\|}X_{i},W_{i}\Big{]}\Big{\|}X_{i}\Big{]}$
		$\displaystyle\hskip 30.00005pt+2\mathrm{Cov}\Big{[}E\Big{[}\frac{Y_{i}(1)+Y_{i}(0)}{2}\Big{\|}X_{i},W_{i}\Big{]}-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i})),E\Big{[}\frac{Y_{i}(1)-Y_{i}(0)}{2}\Big{\|}X_{i},W_{i}\Big{]}\Big{\|}X_{i}\Big{]}$
		$\displaystyle\hskip 30.00005pt+E[\operatorname*{Var}[Y_{i}(1)\|X_{i},W_{i}]\|X_{i}]~{}.$		(34)

	$\displaystyle\sigma_{n}^{2}(Q)$	$\displaystyle=\frac{1}{2}E[\operatorname*{Var}[E[Y_{i}(1)+Y_{i}(0)\|X_{i},W_{i}]-(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\|X_{i}]]$
		$\displaystyle\hskip 30.00005pt+\frac{1}{2}E[\operatorname{Var}[E[Y_{i}(1)-Y_{i}(0)\|X_{i},W_{i}]\|X_{i}]]+\frac{1}{2}\operatorname{Var}[E[Y_{i}(1)-Y_{i}(0)\|X_{i}]]$
		$\displaystyle\hskip 30.00005pt+E[\operatorname{Var}[Y_{i}(0)\|X_{i},W_{i}]\|X_{i}]+E[\operatorname{Var}[Y_{i}(1)\|X_{i},W_{i}]\|X_{i}]$
		$\displaystyle=\frac{1}{2}E[\operatorname*{Var}[E[Y_{i}(1)+Y_{i}(0)\|X_{i},W_{i}]-(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\|X_{i}]]$
		$\displaystyle\hskip 30.00005pt+\frac{1}{2}E[(E[Y_{i}(1)-Y_{i}(0)\|X_{i},W_{i}]-E[Y_{i}(1)-Y_{i}(0)\|X_{i}])^{2}]$
		$\displaystyle\hskip 30.00005pt+\frac{1}{2}E[(E[Y_{i}(1)-Y_{i}(0)\|X_{i}]-E[Y_{i}(1)-Y_{i}(0)])^{2}]$
		$\displaystyle\hskip 30.00005pt+E[(Y_{i}(0)-E[Y_{i}(0)\|X_{i},W_{i}])^{2}]+E[(Y_{i}(1)-E[Y_{i}(1)\|X_{i},W_{i}])^{2}]$
		$\displaystyle=\frac{1}{2}E[\operatorname*{Var}[E[Y_{i}(1)+Y_{i}(0)\|X_{i},W_{i}]-(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\|X_{i}]]$
		$\displaystyle\hskip 30.00005pt+\frac{1}{2}\operatorname{Var}[E[Y_{i}(1)-Y_{i}(0)\|X_{i},W_{i}]]+E[\operatorname{Var}[Y_{i}(0)\|X_{i},W_{i}]]+E[\operatorname*{Var}[Y_{i}(1)\|X_{i},W_{i}]]~{},$

	$\displaystyle E[\phi_{1,n,i}^{2}]+E[\phi_{0,n,i}^{2}]-2E[E[\phi_{1,n,i}\|X_{i}]E[\phi_{0,n,i}\|X_{i}]]$
	$\displaystyle=E[\operatorname{Var}[\phi_{1,n,i}\|X_{i}]]+E[\operatorname{Var}[\phi_{0,n,i}\|X_{i}]]+E[E[\phi_{1,n,i}\|X_{i}]^{2}]+E[E[\phi_{0,n,i}\|X_{i}]^{2}]-2E[E[\phi_{1,n,i}\|X_{i}]E[\phi_{0,n,i}\|X_{i}]]$
	$\displaystyle=E[\operatorname{Var}[\phi_{1,n,i}\|X_{i}]]+E[\operatorname{Var}[\phi_{0,n,i}\|X_{i}]]+E[(E[\phi_{1,n,i}\|X_{i}]-E[\phi_{0,n,i}\|X_{i}])^{2}]$
	$\displaystyle=E[\operatorname{Var}[\phi_{1,n,i}\|X_{i}]]+E[\operatorname{Var}[\phi_{0,n,i}\|X_{i}]]+E[(E[Y_{i}(1)\|X_{i}]-E[Y_{i}(0)\|X_{i}])^{2}]~{},$

Covariate Adjustment in Experiments with Matched Pairs††thanks: Yichong Zhang acknowledges the financial support from the NSFC under the grant No. 72133002 and a Lee Kong Chian fellowship. Any and all errors are our own.

Abstract

1 Introduction

2 Setup and Notation

Assumption 2.1.

Assumption 2.2.

Assumption 2.3.

Assumption 2.4.

Remark 2.1.

3 Main Results

Assumption 3.1.

Theorem 3.1.

Theorem 3.2.

Remark 3.1.

Remark 3.2.

Remark 3.3.

Remark 3.4.

4 Linear Adjustments

Assumption 4.1.

4.1 Linear Adjustments without Pair Fixed Effects

Theorem 4.1.

Remark 4.1.

Remark 4.2.

4.2 Linear Adjustments with Pair Fixed Effects

Theorem 4.2.

Remark 4.3.

Remark 4.4.

Remark 4.5.

Remark 4.6.

Remark 4.7.

Remark 4.8.

Remark 4.9.

5 Regularized Adjustments

Assumption 5.1.

Remark 5.1.

Remark 5.2.

Remark 5.3.

Assumption 5.2.

Assumption 5.3.

Assumption 5.4.

Theorem 5.1.

Remark 5.4.

Algorithm 5.1.

Remark 5.5.

Remark 5.6.

Assumption 5.5.

Theorem 5.2.

Remark 5.7.

6 Simulations

6.1 Data Generating Processes

6.2 Estimation and Inference

6.3 Simulation Results

7 Empirical Illustration

8 Conclusion

Appendix A Proofs of Main Results

A.1 Proof of Theorem 3.1

A.2 Proof of Theorem 3.2

A.3 Proof of Theorem 4.1

A.4 Proof of Theorem 4.2

A.5 Proof of Theorem 5.1

A.6 Proof of Theorem 5.2

Appendix B Auxiliary Lemmas

Lemma B.1.

Proof.

Lemma B.2.

Proof.

Lemma B.3.

Proof.

Lemma B.4.

Proof.

Appendix C Details for Simulations

References

Covariate Adjustment in Experiments with Matched Pairs^†^†thanks: Yichong Zhang acknowledges the financial support from the NSFC under the grant No. 72133002 and a Lee Kong Chian fellowship. Any and all errors are our own.