A unified framework for covariate adjustment under stratified randomization

Fuyi Tu^1,2, Wei Ma², Hanzhong Liu³

¹ School of Science, Chongqing University of Posts and Telecommunications, Chongqing, China
² Institute of Statistics and Big Data, Renmin University of China, Beijing, China
³ Center for Statistical Science, Department of Industrial Engineering, Tsinghua University, Beijing, China Correspondence: [email protected]

Abstract

Randomization, as a key technique in clinical trials, can eliminate sources of bias and produce comparable treatment groups. In randomized experiments, the treatment effect is a parameter of general interest. Researchers have explored the validity of using linear models to estimate the treatment effect and perform covariate adjustment and thus improve the estimation efficiency. However, the relationship between covariates and outcomes is not necessarily linear, and is often intricate. Advances in statistical theory and related computer technology allow us to use nonparametric and machine learning methods to better estimate the relationship between covariates and outcomes and thus obtain further efficiency gains. However, theoretical studies on how to draw valid inferences when using nonparametric and machine learning methods under stratified randomization are yet to be conducted. In this paper, we discuss a unified framework for covariate adjustment and corresponding statistical inference under stratified randomization and present a detailed proof of the validity of using local linear kernel-weighted least squares regression for covariate adjustment in treatment effect estimators as a special case. In the case of high-dimensional data, we additionally propose an algorithm for statistical inference using machine learning methods under stratified randomization, which makes use of sample splitting to alleviate the requirements on the asymptotic properties of machine learning methods. Finally, we compare the performances of treatment effect estimators using different machine learning methods by considering various data generation scenarios, to guide practical research.

Key words: Machine learning; Nonparametric regression; Stratified randomization; Sample splitting

1 Introduction

As a method of preventing selection bias and producing valid statistical inference, randomization has been extensively used in clinical trials. Although simple randomization tends to balance both known and unknown covariates on average, as stated in Cornfield et al., (1959), severe imbalances of important covariates among treatment groups may still occur in practice. Researchers seek methods that better balance a few pre-specified important covariates. Various stratified randomization methods, such as stratified block randomization (Zelen,, 1974), stratified biased coin randomization (Efron,, 1971), and minimization (Taves,, 1974; Pocock and Simon,, 1975), have been proposed. In stratified randomization, a set of covariates is used to form strata, and subjects are then assigned to treatment groups with a probability related to their strata. Stratified randomization further improves the imbalance of important covariates among groups and yields more acceptable and efficient estimators, especially under small sample sizes.

It has been widely suggested in the statistical literature that covariates used in the randomization stage also be included in the analysis stage, as otherwise tests may be invalid (Kahan and Morris,, 2012). Stratified analysis provides an aggregate test over strata by pooling the analysis results of all the strata. It takes the interaction between outcomes and strata induced by stratified randomization into account, and leads to valid and efficient inference. As in most randomized controlled trials, we focus on the inference of the average treatment effect, which characterizes the effect of a typical treatment on subjects. Bugni et al., (2018, 2019) studied the asymptotic property of the naive difference-in-means estimator and the ordinary-least-squares (OLS) estimator obtained from regressing outcomes on indicators of strata under stratified randomization. However, as proposed by Ma et al., (2022); Liu et al., (2023) and Ye et al., 2022b , adjusting for additional covariates through linear regressions, such as OLS or lasso (Tibshirani,, 1996), further improves the efficiency of the treatment effect estimator. The variance is reduced by projecting outcomes onto the space spanned by linear functions of covariates and further analyzing the residuals. When there is no strong evidence of a linear relationship between covariates and outcomes, the aforementioned projection may be limited in improving the efficiency of the treatment effect estimator. Wang et al., (2023) adapted the adjusting methods to parametric regressions, treated the estimator of the parameter of interest as an M-estimator and studied its properties.

As pointed out by Tsiatis et al., (2008), we can consider using nonparametric and machine learning methods to further improve the efficiency of treatment effect estimators, especially in the presence of high-dimensional covariates. This direction of research has been explored in recent years. For example, Williams et al., (2022) evaluated the performance of four machine learning methods for ordinal and time-to-event outcomes and demonstrated that the use of machine learning methods generally improves the estimation efficiency. In particular, the estimation efficiency can be greatly increased by having a sufficient number of samples. Additionally, some evidence shown in Liu et al., (2023) showed that consistent and efficient estimates can be obtained when reasonably using covariate information in high-dimensional cases.

However, drawing valid inferences when using nonparametric methods and machine learning methods under stratified randomization is challenging, especially in the case of high-dimensional data. In this paper, we first derive the asymptotic property of an “oracle” estimator, where the true projection function of the outcome on the space spanned by arbitrary functions of covariates is plugged in and the estimator is thus expected to perform the best. We then outline the restriction on the convergence rate of the estimation of the true projection function, under which we can obtain an valid empirical treatment effect estimator. Additionally, we discuss the optimal choice of the projection function in the sense of treatment effect estimator’s efficiency, and provide insights into the use of different covariate adjustment tools, based on a certain degree of theoretical justification. Next, we suggest the use of statistical methods to obtain a consistent estimation of the optimal projection function. When the covariates have low dimensionality, we can use general nonparametric methods such as the local linear kernel-weighted least squares regression (abbreviated as local linear kernel in the following text); when the covariates have relatively high dimensionality, we can use the lasso or machine learning methods, such as random forest (Breiman,, 2001), neural networks, and debiased/double machine learning methods (Chernozhukov et al.,, 2018). Finally, we report on the evaluation of the empirical performances of the proposed estimators using different estimating methods for the projection function are evaluated in a simulation study.

2 Framework and notation

We consider an experiment with two treatments under stratified randomization and follow the framework and notation of Ma et al., (2022) and Liu et al., (2023). Let $A_{i},i=1,\dots,n$ , denote the indicator for treatment assignment, i.e., $A_{i}=1$ if the $i$ th unit is assigned to the treatment group and $A_{i}=0$ otherwise. The target proportion of treated units is $\pi=P(A_{i}=1)\in(0,1)$ . We assume that the observed outcome $Y_{i}$ is a function of treatment assignment $A_{i}$ and potential outcomes under treatment ( $Y_{i}(1)$ ) and control ( $Y_{i}(0)$ ): $Y_{i}=A_{i}Y_{i}(1)+(1-A_{i})Y_{i}(0)$ . During the experiment, units are stratified into $K$ strata, with $B_{i}\in\{1,\dots,K\}$ denoting the specific stratum that unit $i$ falls into. To rule out the empty stratum, we assume that the probability of units being assigned to each stratum is positive, i.e., $p_{[k]}=P(B_{i}=k)>0$ , for $k\in\{1,\dots,K\}$ and $i\in\{1,\dots,n\}$ . Additionally, we collect a $p$ -dimensional vector of baseline covariates for each unit, denoted as $X_{i}=(X_{i1},\dots,X_{ip})^{\textnormal{T}}$ . The covariates can be either low-dimensional or high-dimensional, and the estimation method is changed accordingly. We use the subscript $1$ or $0$ to indicate the assigned group being the treatment group or control group. In the randomized experiment, $n_{1}=\sum_{i=1}^{n}A_{i}$ units are assigned to the treatment group and $n_{0}=\sum_{i=1}^{n}(1-A_{i})$ units are assigned to the control group. Moreover, we use the subscript $[k]$ to index the statistics in stratum $k$ , e.g., the number of units, treated units, and control units in stratum $k$ are $n_{[k]}=\sum_{i\in[k]}1,n_{[k]1}=\sum_{i\in[k]}A_{i}$ , and $n_{[k]0}=\sum_{i\in[k]}(1-A_{i})$ , respectively, where $i\in[k]$ indexes units in stratum $k$ . The proportion of stratum sizes and the treated units in stratum $k$ are denoted as $p_{n[k]}=n_{[k]}/n$ and $\pi_{n[k]}=n_{[k]1}/n_{[k]}$ , respectively. Our parameter of interest is the average treatment effect: $\tau=E\{Y_{i}(1)-Y_{i}(0)\}$ .

Let $\mathcal{R}_{2}=\{V:\max_{k=1,\dots,K}\textnormal{Var}\{V|B_{i}=k\}>0\}$ be the set of random variables with at least one positive stratum-specific variance. Throughout this paper, the following requirements are imposed on the data-generating process and randomization mechanism.

Assumption 1.

$\{Y_{i}(1),Y_{i}(0),X_{i}\}_{i=1}^{n}$ are independent and identically distributed (i.i.d.) samples from the population distribution of $\{Y(1),Y(0),X\}$ . Moreover, $E\{Y_{i}^{2}(a)\}<\infty,Y_{i}(a)\in\mathcal{R}_{2},a=0,1$ .

Assumption 2.

Conditional on $B^{(n)}=\{B_{1},\dots,B_{n}\}$ , treatment assignments $A^{(n)}=\{A_{1},\dots,A_{n}\}$ and $\{Y_{i}(1),Y_{i}(0),X_{i}\}_{i=1}^{n}$ are independent.

Assumption 3.

In each stratum, the proportion of treated units $\pi_{n[k]}$ converges in probability to $\pi$ .

The above assumptions were also made by Bugni et al., (2019), Ma et al., (2022) and Liu et al., (2023). Here, we state no restrictions on the relationship between $A_{i}$ , indicating that they can be related to each other, which is a common case under stratified randomization. When applying different estimation methods to baseline covariates, additional assumptions are made for $X_{i}$ . Assumption 2 holds under simple and restricted randomization (Rosenberger and Lachin,, 2015), and is widely satisfied by existing stratified randomization methods and covariate-adaptive randomization methods, including the stratified biased-coin design (Efron,, 1971), stratified block design (Zelen,, 1974), stratified adaptive biased-coin design (Wei,, 1978), Pocock and Simon’s minimization (Pocock and Simon,, 1975), and the methods proposed by Hu and Hu, (2012).

Notation. The mean and variance of a random variable $V$ are denoted as $\mu_{V}=E(V)$ and $\sigma_{V}^{2}=\textnormal{Var}(V)$ , respectively. Let $\tilde{V}=V-E(V|B)$ be the variable $V$ centered at its stratum-specific mean. For random variables $r_{i}(a),i=1,\dots,n,a\in\{0,1\}$ , the sample means under treatment and control are denoted as $\bar{r}_{1}=(1/n_{1})\sum_{i=1}^{n}A_{i}r_{i}(1)$ and $\bar{r}_{0}=(1/n_{0})\sum_{i=1}^{n}(1-A_{i})r_{i}(0)$ , respectively, and the stratum-specific sample means are denoted as $\bar{r}_{[k]1}=(1/n_{[k]1})\sum_{i\in[k]}A_{i}r_{i}(1)$ and $\bar{r}_{[k]0}=(1/n_{[k]0})\sum_{i\in[k]}(1-A_{i})r_{i}(0)$ , respectively. The variance of the treatment effect estimator consists of the variation of (transformed) potential outcomes $r_{i}(1)$ and $r_{i}(0)$ and the sum of variations from treatment effect heterogeneity within each stratum:

\varsigma_{r}^{2}(\pi)=\frac{1}{\pi}\sigma^{2}_{r_{i}(1)-E\{r_{i}(1)|B_{i}\}}+\frac{1}{1-\pi}\sigma^{2}_{r_{i}(0)-E\{r_{i}(0)|B_{i}\}},

\varsigma_{Hr}^{2}=\sum_{k=1}^{K}p_{[k]}\Big{(}\big{[}E\{r_{i}(1)|B_{i}=k\}-E\{r_{i}(1)\}\big{]}-\big{[}E\{r_{i}(0)|B_{i}=k\}-E\{r_{i}(0)\}\big{]}\Big{)}^{2}.

We express the sample analog as

	$\displaystyle\hat{\varsigma}^{2}_{r}(\pi)$	$\displaystyle=$	$\displaystyle\frac{1}{\pi}\sum_{k=1}^{K}p_{n[k]}\bigg{[}\frac{1}{n_{[k]1}}\sum_{i\in[k]}A_{i}\Big{\{}\hat{r}_{i}(1)-\frac{1}{n_{[k]1}}\sum_{j\in[k]}A_{j}\hat{r}_{j}(1)\Big{\}}^{2}\bigg{]}$
			$\displaystyle+\frac{1}{1-\pi}\sum_{k=1}^{K}p_{n[k]}\bigg{[}\frac{1}{n_{[k]0}}\sum_{i\in[k]}(1-A_{i})\Big{\{}\hat{r}_{i}(0)-\frac{1}{n_{[k]0}}\sum_{j\in[k]}(1-A_{j})\hat{r}_{j}(0)\Big{\}}^{2}\bigg{]},$

	$\displaystyle\hat{\varsigma}^{2}_{Hr}$	$\displaystyle=$	$\displaystyle\sum_{k=1}^{K}p_{n[k]}\bigg{[}\Big{\{}\frac{1}{n_{[k]1}}\sum_{j\in[k]}A_{j}\hat{r}_{j}(1)-\frac{1}{n_{1}}\sum_{i=1}^{n}A_{i}\hat{r}_{i}(1)\Big{\}}$
			$\displaystyle-\Big{\{}\frac{1}{n_{[k]0}}\sum_{j\in[k]}(1-A_{j})\hat{r}_{j}(0)-\frac{1}{n_{0}}\sum_{i=1}^{n}(1-A_{i})\hat{r}_{i}(0)\Big{\}}\bigg{]}^{2},$

where $\hat{r}_{i}(a)$ is the observed or estimated value of $r_{i}(a)$ . We define the $L_{2}$ norm of a function $f$ as $||f||_{L_{2}}=\{E(f^{2})\}^{1/2}$ .

3 Linear adjustment

Under stratified randomization, the average treatment effect can be consistently estimated by aggregating the treatment effect estimates in each stratum, and a naive estimator is

\hat{\tau}=\sum_{k=1}^{K}p_{n[k]}(\bar{Y}_{[k]1}-\bar{Y}_{[k]0}).

To further improve estimation efficiency, we consider different methods for covariate adjustment. Linear regression, as a concise and highly interpretable model, is widely accepted in practice and its theoretical properties have been intensively studied. Therefore, a number of recent works on statistical inference under randomization have considered linear regressions for covariate adjustment (e.g., Lin,, 2013; Bugni et al.,, 2018, 2019; Liu and Yang,, 2020; Li and Ding,, 2020; Ma et al.,, 2022; Ye et al., 2022a, ).

Let $\bar{X}_{[k]}=n_{[k]}^{-1}\sum_{i\in[k]}X_{i}$ , Liu et al., (2023) proposed a general regression-adjusted treatment effect estimator:

\hat{\tau}_{\mathrm{gen}}=\sum_{k=1}^{K}p_{n[k]}\left[\left\{\bar{Y}_{[k]1}-\left(\bar{X}_{[k]1}-\bar{X}_{[k]}\right)^{\mathrm{T}}\hat{\beta}_{[k]}(1)\right\}-\left\{\bar{Y}_{[k]0}-\left(\bar{X}_{[k]0}-\bar{X}_{[k]}\right)^{\mathrm{T}}\hat{\beta}_{[k]}(0)\right\}\right],

where $\hat{\beta}_{[k]}(1)$ and $\hat{\beta}_{[k]}(0)$ are the estimated regression-adjusted vectors for the treatment and control groups, respectively. This general estimator is applicable to both low-dimensional cases, using OLS, and high-dimensional cases, using lasso, and the asymptotic properties of the corresponding estimators have been derived. In high-dimensional cases, the general estimator can also be extended to the use of Ridge regression (Hoerl and Kennard,, 1970) or elastic net regression(Zou and Hastie,, 2005).

4 Nonlinear adjustment

The idea behind the regression-adjusted estimator is to project outcomes onto the space spanned by linear functions of the covariates and then further analyze the residuals to reduce the variance. Treatment effect estimators that are more efficient can be obtained by appropriately estimating the projection of the outcomes on the (potentially nonlinear) space spanned by the covariates. We can therefore use other methods, such as those based on partial linear models, generalized linear models, nonparametric models and machine-learning methods, for covariate adjustment.

4.1 Oracle estimator

We start by considering the oracle case where the space $\Omega$ spanned by functions of $X$ with finite variance is given. Conditional on $B_{i}=k$ , let $h_{[k]}(X,1)$ and $h_{[k]}(X,0)$ denote the projections of $Y(1)$ and $Y(0)$ onto $\Omega$ , respectively. Following the idea of the regression-adjusted estimator, our proposed oracle estimator is:

	$\displaystyle\hat{\tau}_{\text{oracle}}$	$\displaystyle=$	$\displaystyle\sum_{k=1}^{K}p_{n[k]}\Big{[}\Big{\{}\bar{Y}_{[k]1}-\frac{1}{n_{[k]1}}\sum_{i\in[k]}(A_{i}-\pi_{n[k]})h_{[k]}(X_{i},1)\Big{\}}$
			$\displaystyle-\Big{\{}\bar{Y}_{[k]0}+\frac{1}{n_{[k]0}}\sum_{i\in[k]}(A_{i}-\pi_{n[k]})h_{[k]}(X_{i},0)\Big{\}}\Big{]}.$

Remark 1.

If we consider $\Omega$ as a space spanned by linear functions of $X$ , then $h_{[k]}(X,a)=X^{\text{T}}\beta_{[k]}(a)$ , and we have

$\displaystyle\frac{1}{n_{[k]1}}\sum_{i\in[k]}(A_{i}-\pi_{n[k]})h_{[k]}(X_{i},1)$	$\displaystyle=$	$\displaystyle\frac{1}{n_{[k]1}}\sum_{i\in[k]}(A_{i}-\pi_{n[k]})X_{i}^{\text{T}}\beta_{[k]}(1)$
	$\displaystyle=$	$\displaystyle\bigg{(}\frac{1}{n_{[k]1}}\sum_{i\in[k]}A_{i}X_{i}^{\text{T}}\ -\frac{\pi_{n[k]}}{n_{[k]1}}\sum_{i\in[k]}X_{i}^{\text{T}}\bigg{)}\beta_{[k]}(1)$
	$\displaystyle=$	$\displaystyle\bar{{X}}_{[k]1}^{\text{T}}\beta_{[k]}(1)-\frac{1}{n_{[k]}}\sum_{i\in[k]}X_{i}^{\text{T}}\beta_{[k]}(1)$
	$\displaystyle=$	$\displaystyle(\bar{{X}}_{[k]1}-\bar{X}_{[k]})^{\text{T}}\beta_{[k]}(1).$

Similarly, we have $n_{[k]0}^{-1}\sum_{i\in[k]}(A_{i}-\pi_{n[k]})h_{[k]}(X_{i},0)=(\bar{{X}}_{[k]0}-\bar{{X}}_{[k]})^{\text{T}}\beta_{[k]}(0)$ . Thus, our proposed estimator is identical to the oracle regression-adjusted stratum-specific estimator proposed by Liu et al., (2023).

Let $r_{i}(a)=Y_{i}(a)-[(1-\pi)h_{[k]}(X_{i},1)+\pi h_{[k]}(X_{i},0)],i\in[k]$ , we establish the following proposition for the oracle estimator. The detailed proof is given in Section 2 of the Supplementary Materials.

Proposition 1.

Suppose that $r_{i}(a)\in\mathcal{R}_{2},\ E\{h^{2}_{[k]}(X_{i},a)\}<\infty,\ a=0,1,\ k=1,\dots,K$ . Under Assumptions 1–3,

\sqrt{n}(\hat{\tau}_{\text{oracle}}-\tau)\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathcal{N}(0,\varsigma^{2}_{r}(\pi)+\varsigma^{2}_{Hr}).

Remark 2.

If we assume that $h_{[k]}(X_{i},1)$ and $h_{[k]}(X_{i},0)$ are the same across strata, then we can replace $h_{[k]}(X_{i},1)$ and $h_{[k]}(X_{i},0)$ by $h(X_{i},1)$ and $h(X_{i},0)$ in the estimator and transformed outcomes, respectively, and Proposition 1 still holds.

4.2 Empirical estimator

Plugging in the estimates of $h_{[k]}(X,a)$ , our empirical treatment effect estimator adjusting for baseline covariates is defined as (Liu et al.,, 2023):

	$\displaystyle\hat{\tau}_{\text{emp}}$	$\displaystyle=$	$\displaystyle\sum_{k=1}^{K}p_{n[k]}\Big{[}\Big{\{}\bar{Y}_{[k]1}-\frac{1}{n_{[k]1}}\sum_{i\in[k]}(A_{i}-\pi_{n[k]})\hat{h}_{[k]}(X_{i},1)\Big{\}}$		(1)
			$\displaystyle-\Big{\{}\bar{Y}_{[k]0}+\frac{1}{n_{[k]0}}\sum_{i\in[k]}(A_{i}-\pi_{n[k]})\hat{h}_{[k]}(X_{i},0)\Big{\}}\Big{]},$		(1)

where $\hat{h}_{[k]}(X_{i},a),\ a=0,1$ , are projection functions estimated by different methods.

The discussion presented by Tsiatis et al., (2008) implied that estimates of $h_{[k]}(X_{i},a),\ a=0,1$ are desirable if they can estimate $h_{[k]}(X_{i},a)$ well, but no formal conditions were given on the estimation error. To study the asymptotic properties of $\hat{\tau}_{\text{emp}}$ under stratified randomization, the following conditions were outlined by Liu et al., (2023).

Assumption 4.

For $k=1,\dots,K$ and $a=0,1$ ,

\sqrt{n}\Big{[}\big{\{}\bar{\hat{h}}_{[k]1}(\cdot,a)-\bar{h}_{[k]1}(\cdot,a)\big{\}}-\big{\{}\bar{\hat{h}}_{[k]0}(\cdot,a)-\bar{h}_{[k]0}(\cdot,a)\big{\}}\Big{]}=o_{P}(1),

(2)

\frac{1}{n_{[k]}}\sum_{i\in[k]}\Big{\{}\hat{h}_{[k]}(X_{i},a)-h_{[k]}(X_{i},a)\Big{\}}^{2}=o_{P}(1),

(3)

where $\bar{\hat{h}}_{[k]1}(\cdot,a)$ and $\bar{\hat{h}}_{[k]0}(\cdot,a)$ respectively denote the sample means of $\hat{h}_{[k]}(X_{i},a)$ in the treatment group and control group within stratum $k$ .

The first equation in Assumption 4, i.e. Equation (2), implies that the difference between the oracle estimator and the empirical estimator is negligible at the rate of $o_{P}(n^{-1/2})$ , the assumption of the similar form has also been used for the regression-adjusted estimation of quantile treatment effects (Assumption (i) in Jiang et al., (2023)). Equation (3) allows control of the estimation error of the projection function and serves as a guarantee of consistent variance estimation. Based on this assumption, we can draw valid inference for the average treatment effect. According to the low or high dimensionality of the covariates, a wide range of methods can be applied to estimate $h_{[k]}(X_{i},a),\ a=0,1$ . If we consider the space spanned by arbitrary measurable functions of $X$ with finite variance, then nonparametric methods can be applied. Additionally, if we have prior information on the format of $h_{[k]}(X_{i},a)$ , we can adopt parametric regressions. For linear adjustments, these conditions have been verified for lasso and linear regressions. For nonlinear adjustments, Wang et al., (2023) considered the M-estimation of the parameter of interest and proved relevant asymptotic properties. However, theoretical properties of covariate adjusted estimators using more general nonparametric or machine learning methods are not clear and yet to be explored, see discussion in Section 8. Meanwhile, Assumption 4 is a high-level assumption that, regardless of whether the setting is low-dimensional or not, as long as the adjusting method satisfies this assumption, we can obtain a consistent treatment effect estimator and make valid inference. One of the main contributions of this paper is the verification that this assumption is satisfied for local linear kernel in low-dimensional cases. In high-dimensional cases, this assumption may be hard to meet, and we thus propose a sample splitting algorithm for machine learning methods, which is another contribution of this paper. In brief, this algorithm separates the samples used for the estimation of $\hat{h}_{[k]}(\cdot,a)$ and the evaluation sample $X_{i}$ in $\hat{h}_{[k]}(X_{i},a)$ , hence simplifies the required assumption. Detailed information about the sample splitting process, simplified assumption can be found in Section 6.3.

Denote $\hat{r}_{i}(a)=Y_{i}(a)-[(1-\pi)\hat{h}_{[k]}(X_{i},1)+\pi\hat{h}_{[k]}(X_{i},0)]$ as the estimated value of $r_{i}(a)$ . The asymptotic normality and a consistent asymptotic variance estimator of $\hat{\tau}_{\text{emp}}$ is provided as in the following theorem by Liu et al., (2023), thus justifying the Wald-type inference of the average treatment effect.

Theorem 1.

Suppose that $r_{i}(a)\in\mathcal{R}_{2},\ E\{h^{2}_{[k]}(X_{i},a)\}<\infty,\ a=0,1,\ k=1,\dots,K$ . Under Assumptions 1–4,

\sqrt{n}(\hat{\tau}_{\text{emp}}-\tau)\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathcal{N}(0,\varsigma^{2}_{r}(\pi)+\varsigma^{2}_{Hr}),\qquad\hat{\varsigma}^{2}_{r}(\pi)+\hat{\varsigma}^{2}_{Hr}\stackrel{{\scriptstyle P}}{{\rightarrow}}\varsigma^{2}_{r}(\pi)+\varsigma^{2}_{Hr}.

4.3 Optimal choice of transformed outcomes

There are different choices of the subtractor in transformed outcomes, but not all of them improve the efficiency of the treatment effect estimator. How to achieve the optimal efficiency of the treatment effect estimator has been established as in the following theorem by Liu et al., (2023), and we give a more detailed explanation here.

Theorem 2.

Conditional on $B=k$ , $r(a)=Y(a)-\left[(1-\pi)E\{Y(1)|X,B=k\}+\pi\right.$
$\left.E\{Y(0)|X,B=k\}\right]$ has the minimum variance among the sets of all transformed outcomes of the form $Y(a)-[(1-\pi)h_{[k]}(X,1)+\pi h_{[k]}(X,0)]$ . In other words, the minimum variance of the treatment effect estimator is obtained when $h_{[k]}(X,a)=E\{Y(a)|X,B=k\}$ .

Remark 3.

The distance between the estimated function and the true function is often called the generalization error. As widely acknowledged, the generalization error of an estimated function can be decomposed into estimation error and approximation error (e.g., Barron,, 1994; Niyogi and Girosi,, 1996; Pinkus,, 2012). The estimation error is the distance between the estimated function and the optimal function that we can achieve in a restricted function space, and is determined by estimating methods. The optimal function is the projection of the outcomes on the restricted function space. The estimation error results from the fact that we are estimating functions on finite samples. A larger sample size results in a smaller the estimation error, i.e., it is more likely that the estimated function will approach the optimal function. However, the requirements on the sample sizes to achieve similar estimation errors vary from method to method, and the convergence rates of nonparametric regressions were demonstrated by Stone, (1980, 1982). The approximation error is the distance between the true function and the optimal function. It decreases as the range of the function space increases. Theoretically, the approximation error can be eliminated if the function space is extended to close to the full space.

To conduct valid and efficient inference, we need to minimize the distance between the estimation function and true function, i.e., control both the estimation error and approximation error. Assumption 4 imposes detailed requirements on the convergence rate of the estimating methods, which controls the estimation error to obtain valid estimates. To optimize efficiency, we need to further eliminate the approximation error, which is determined by the function space spanned by estimating methods. If the function space contains the true function (i.e., $E\{Y(a)|X,B=k\},\ a=0,1$ in Theorem 2), then the corresponding treatment effect estimator has optimal efficiency. However, it is often the case that we cannot know exactly which function space contains the true function, especially in high-dimensional cases. Therefore, we can only consider the rationality of function spaces for different estimating methods, such as the linear function space for linear regressions, the sieve space for artificial neural networks (ANNs), and the spline space for smoothing methods, in the context of current data, and compare their practical performances. In the following sections, we introduce commonly used estimating methods and demonstrate their empirical efficiencies through numerical simulations.

5 Nonparametric methods for low-dimensional cases

If a few covariates are known to be strongly correlated with the outcomes, according to domain knowledge or external data, then we can use these key covariates to make efficient statistical inferences without a heavy computational burden. Below are several models commonly used in low-dimensional settings. Note that other machine learning methods, such as neural networks and random forest, can also be applied as elaborated in Section 6. We conduct the simulation study for machine learning methods in both low-dimensional and high-dimensional settings.

5.1 Local linear kernel

Although Tsiatis et al., (2008) suggested a general strategy for estimating the projection function with the flexible use of modeling methods for covariate-adjusted estimators under simple randomization, they did not give theoretical details. Several nonparametric statistical tools including kernel, spline, and orthogonal series have been proposed and widely used for estimating $h_{[k]}(X_{i},a)$ . Among these tools, local linear kernel has better asymptotic behavior (Fan,, 1992). In this paper, we give a formal justification for the use of the local linear kernel smoother of $h_{[k]}(X_{i},a)$ under stratified randomization. Consider the problem that for $i\in[k]$ ,

\operatorname{Minimize}\sum_{j\in[k]}\left\{Y_{j}(a)-\alpha-\beta^{T}\left(X_{j}-X_{i}\right)\right\}^{2}K_{H}\left(X_{j}-X_{i}\right)\cdot\mathds{1}_{A_{j}=a},

(4)

where $H$ is a $d\times d$ symmetric positive definite matrix depending on $n$ , $K_{H}(u)=|H|^{-1/2}K(H^{-1/2}u)$ with $K$ being a $d$ -dimensional kernel such that $\int K(u)d_{u}=1$ , and $|\cdot|$ denotes the determinant of a matrix. $\mathds{1}_{A_{j}=a}$ is an indicator function that equals $1$ if $A_{j}=a$ and $0$ otherwise. $H^{1/2}$ is called the bandwidth matrix. Then, $\hat{h}_{[k]}(X_{i},a)=\hat{\alpha}$ is the local linear kernel smoother of $h_{[k]}(X_{i},a)$ . We make the following general assumptions about local linear kernel.

Assumption 5.

(i) Conditional on $B_{i}=k$ , $Y_{i}(a)=h_{[k]}(X_{i},a)+\nu^{1/2}(X_{i})\varepsilon_{i}$ , where $\nu(x)=\text{Var}(Y|X=x)>0$ is continuous, $E\{h^{2}_{[k]}(X_{i},a)\}<\infty,\ a=0,1,\ k=1,\dots,K$ , and the probability density function of $X_{i}$ ’s has a compact support set on $R^{d}$ . $\varepsilon_{i}$ ’s are mutually i.i.d. random variables with zero mean and unit variance and are independent of $X_{i}$ . All second-order derivatives of $h_{[k]}(\cdot,a)$ are continuous.

(ii) The kernel $K$ is a compactly supported, bounded kernel such that $\int uu^{\text{T}}K(u)du=\mu_{2}(K)I$ , where $\mu_{2}(K)\neq 0$ is a scalar and $I$ is the $d\times d$ identity matrix. Additionlly, all odd-order moments of $K$ vanish, i.e., $\int u_{1}^{l1}\cdots u_{d}^{l_{d}}K(u)du=0$ for all nonnegative integers $l_{1},\dots,\l_{d}$ such that their sum is odd.

(iii) $n^{-1}|H|^{-1}$ and each entry of $H$ tend to zero as $n\to\infty$ , with $H$ remaining symmetric and positive definite. Moreover, there is a fixed constant $L$ such that the condition number of $H$ is at most $L$ for all $n$ .

Then based on Theorem 1, we establish the following theorem.

Theorem 3.

Under Assumptions 1–3 and 5, for $k\in 1,\dots,K$ and $a=0,1$ , $\hat{h}_{[k]}(X_{i},a)$ satisfy Assumption 4. Therefore, the adjusted treatment effect estimators obtained using local linear kernel allow for valid statistical inference.

Remark 4.

The detailed proof for local linear kernel can be found in Section 3 of the Supplementary Materials. Gelman and Imbens, (2019) stated that in regression discontinuity analysis, higher-order polynomials are no better than local linear or quadratic polynomials mainly for three reasons, two of which also apply to our situations. First, if we rewrite the polynomial regression estimator as the weighted average of the outcomes, we will find that the higher-order terms actually make no contribution to our estimated function of interest, which is the first component of the estimator. Second, the use of higher-order polynomial regression requires extra knowledge of the existence of higher-order derivatives of the function to be estimated, and there is no universal method with which to make a good determination yet. Additionally, the superiority of local linear kernel is also demonstrated through a simulation study in Section 7.

5.2 Spline smoothing

As a nonparametric method competing with kernel methods, spline smoothing has commonly been investigated in the literature. Unlike the local linear weighted regression idea of the kernel, spline smoothing considers adding penalties when minimizing the sum of squares, to avoid over-fitting. One of the most frequently adopted penalties is the integral over the second-order derivative of the estimated function, which corresponds to the cubic spline. Additionally, despite the different forms and concepts of kernel and spline, Silverman, (1984) showed the asymptotic equivalence of spline smoothing and a kernel method with a bandwidth depending on the local density of design points. Our simulation results suggest the similar performance of spline smoothing and kernel methods under certain scenarios and the superiority of local linear kernel in certain situations. Thus, we do not explore the theoretical justification of spline smoothing-adjusted estimators in this paper.

6 Machine-learning methods for high-dimensional cases

In the era of big data, a massive amount of information can be collected, and it is difficult for us to artificially determine the most relevant covariates. Additionally, the interactions between covariates are difficult to approach by simple forms. Therefore, traditional statistical methods may not be suitable in this situation, and high-dimensional methods such as the lasso and other machine learning methods need to intervene.

In general, high-dimensional methods commonly used nowadays perform well in model prediction, which is what they were mainly designed for, but these methods are lacking in estimation efficiency and valid statistical inference. The theoretical properties of estimators adjusted using high-dimensional methods such as tree-based methods and neural networks are difficult to verify and remain to be uncovered, especially under stratified randomization. However, the ability of the high-dimensional methods to capture nonlinear features and interactions is evident. In this paper, we mainly report a comparison of the estimators adjusted using different high-dimensional techniques in the numerical study and present the pros and cons of the estimators.

6.1 Penalized regression

In the presence of high-dimensional covariates, linear regression is prone to problems of multi-collinearity or overfitting. In practice, penalized regressions are usually adopted to solve these problems. When the exact or weak sparsity assumption of the population projection coefficients is reasonable, lasso can be adopted, and the satisfiability of the lasso-adjusted estimators for Assumption 4 was proved by Liu et al., (2023). Fundamental work on the concentration inequality and restricted eigenvalue under stratified randomization in the work of Liu et al., (2023) can be used for other penalized regressions, such as adaptive-lasso (Zou,, 2006; Huang et al.,, 2008), Ridge regression (Hoerl and Kennard,, 1970), and elastic net regression (Zou and Hastie,, 2005). Moreover, although the debiased-lasso (Zhang and Zhang,, 2014) usually has a performance similar to that of lasso, it has a higher computational cost, and it is thus not recommended here. When there are many weak predictors, Ridge regression can be chosen.

6.2 General machine learning methods

With the abundance of available data and the increase of model complexity, we need the help of computers to perform data-intensive and complex operations in addition to model training and estimation, i.e., machine learning. Nowadays, there are many machine learning methods available to researchers, among which the two types of method commonly used in medical research are tree-based methods and neural networks (Garg and Mago,, 2021). Both types of method can handle data with high dimensionality and complex interactions through stepwise deconstruction. However, as has been pointed out, no single method can solve problems in a one-size-fits-all manner (e.g., Peel,, 2010; Pedersen et al.,, 2020; De Cristofaro,, 2021). In practice, we should consider the type of problem to be solved, the structure of covariates, and the parameters of interest in selecting the most appropriate method. The good approximability of machine learning methods effectively reduces the approximation error.

Tree-based methods hierarchically partition the covariate space, recursively dividing the entire space into small regions. They can handle categorical and ordinal covariates in a simple way and automatically select covariates in steps and reduce model complexity. However, the single-tree structure tends to have insufficient prediction accuracy. The prediction performance has thus been improved using ensemble trees, commonly through bagging and boosting. Random forest, proposed by Breiman, (2001), is a substantial modification of bagging (Breiman,, 1996) in ensemble learning that constructs de-correlated trees on different bootstrap samples of the data and averages the results. Random forest is widely used in randomized experiments. For example, Wu and Gagnon-Bartsch, (2018) proposed a “leave-one-out potential outcomes” estimator by imputing potential outcomes using random forest, and Wager and Athey, (2018) estimated and inferred heterogeneous treatment effects using random forest. Boosting is a sequential process that continuously trains weak classifiers or regressors and adjusts the weights of samples and classifiers or regressors in each iteration, to reduce the prediction error. The tree-based methods that are widely used in practice in combination with boosting are the gradient boosting decision tree (Friedman,, 2002), Adaboost (Freund and Schapire,, 1997) and XGboost (Chen and Guestrin,, 2016).

Benefitting from the rapid development of computer technology, ANNs are appealing machine learning methods that can approach a wide variety of (nonlinear) functions, especially in the case of high-dimensional data (e.g., White,, 1992; Yarotsky,, 2018; Schmidt-Hieber,, 2020). The willingness to use ANNs has been demonstrated in different areas of research, including pattern recognition, decision-making, and pharmaceutical research. In recent years, ANNs have also been introduced to improve the efficiency of estimating treatment effects. Farrell et al., (2021) used ANNs to estimate nuisance functions in the estimation procedure of treatment effects, based on efficient influence functions. Chen et al., (2024) considered a more general framework of treatment effects with propensity score functions estimated using ANNs.

6.3 Sample splitting

Machine learning methods can well estimate projection functions and make better predictions than traditional methods, but the fitted functions may have plenty of parameters and be complex in form. Additionally, substantive biases induced by machine learning methods are inevitable, and the restrictions on the estimation error in Assumption 4 are thus difficult to verify and may not be satisfied for various machine learning methods. In this case, we can use sample splitting for the separate estimation of the projection function and treatment effect, which relaxes the consistency conditions and thus makes most machine learning methods feasible (Chernozhukov et al.,, 2018).

Sample splitting is a common technique in statistics. This technique usually involves dividing data into two parts, one part for the inference of parameters or functions of interest and the other part for validation or estimating nuisance parameters (Picard and Berk,, 1990). Through sample splitting, we gain accuracy and robustness of the inference at the cost of prediction efficiency owing to the reduction of the sample size (Rinaldo et al.,, 2019). To regain full efficiency, Chernozhukov et al., (2018) proposed a “cross-fitting” process, which includes dividing data evenly into $M$ parts, estimating the parameter of interest and its variance on each inference-estimation data pair, and averaging them to obtain the final estimator. As the sample size increases, the estimators obtained from inference-estimation data pairs become asymptotically independent, and their variances can thus be aggregated for inference. The adoption of additional orthogonalization to reduce the regularization bias is known as the double/debiased machine learning method, which was proposed by Chernozhukov et al., (2018). This method has gained prevalence in research on high-dimensional statistical inference (e.g., Kallus et al.,, 2019; Bodory et al.,, 2022).

Moreover, the asymptotic properties of the methods described in Section 6.2 and the double/debiased machine learning method are deduced under the assumption that the treatment assignments and thus the outcomes are independent. However, under stratified randomization, there may be correlations between outcomes and between treatment assignment indicators within each stratum because of the treatment assignment procedure, which may lead to the invalidity of the aforementioned statistical inference. This motivates us to combine the form of the variance estimator in Theorem 1 with the sample splitting technique, where we expect to obtain a valid statistical inference using machine learning methods under stratified randomization. The proposed algorithm is described in Algorithm 1.

Algorithm 1 Sample splitting algorithm for estimating the average treatment effect under stratified randomization.

1:Take an (almost) evenly

M

-fold random partition

(I_{m})_{m=1}^{M}

of the observed indices

\{1,\dots,n\}

, for

m=1,\dots,M-1

, the size of fold

m

\lfloor n/M\rfloor

, and the size of the last fold is

n-(M-1)\lfloor n/M\rfloor

. Define the complementary set of

I_{m}

I_{m}^{c}=\{i:i\in\{1,\dots,n\},\ i\notin I_{m}\}

2:For each fold

m\in\{1,\dots,M\}

, construct an estimator

\hat{h}_{[k]m}(\cdot,a)

h_{[k]}(\cdot,a)

based on data indexed by

I_{m}^{c}

3:Plug in

\hat{h}_{[k]m}(\cdot,a)

to the oracle estimator and use data indexed by

I_{m}

for the inference of the average treatment effect. That is, for each

m\in\{1,\dots,M\}

, obtain a treatment effect estimator

\hat{\tau}_{m}

as in Equation (1) and a variance estimator

\hat{\sigma}^{2}_{m}

by Theorem 1

4:Aggregate the treatment effect estimators:

\hat{\tau}_{\text{ss}}=\frac{1}{M}\sum_{m=1}^{M}\hat{\tau}_{m}.

The variance estimator of

\sqrt{n}\hat{\tau}_{\text{ss}}

\hat{\sigma}^{2}_{ss}=\frac{1}{M}\sum_{m=1}^{M}\hat{\sigma}^{2}_{m}.

Indeed, this algorithm incorporates the sample splitting technique into the general framework of inference under stratified randomization. Using this algorithm, Assumption 4 on the projection function can reduce to the following second moment convergence assumption.

Assumption 6.

Denote the $M$ -fold random partition of the observed indices $\{1,\dots,n\}$ as $(I_{m})_{m=1}^{M}$ , for fold $m=1,\dots,M,$

E\Big{[}\big{\{}\hat{h}_{[k]m}(\tilde{X}_{i},a)-h_{[k]}(\tilde{X}_{i},a)\big{\}}^{2}\mid B_{i}=k,\{Y_{j},X_{j},A_{j},B_{j}\}_{j\in I_{m}^{c}}\Big{]}=o_{P}(1),

(5)

where $\tilde{X}_{i}$ has the same distribution as $\{X_{i}\mid B_{i}=k\}$ and is independent of $\{Y_{j},X_{j},A_{j},B_{j}\}_{j\in I_{m}^{c}}$ , and $\hat{h}_{[k]m}(\cdot,a)$ is the projection function estimated by samples not in fold $m$ .

The above assumption can be satisfied by many machine learning methods under certain assumptions (Chernozhukov et al.,, 2018), and we establish the following theorem.

Theorem 4.

Suppose that $r_{i}(a)\in\mathcal{R}_{2},\ E\{h^{2}_{[k]}(X_{i},a)\}<\infty,\ a=0,1,i\in[k]\cap I_{m},\ k=1,\dots,K,\ m=1,\dots,M$ . Then under Assumptions 1–3 and Assumption 6,

\sqrt{n}(\hat{\tau}_{ss}-\tau)\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathcal{N}(0,\varsigma^{2}_{r}(\pi)+\varsigma^{2}_{Hr}),\qquad\hat{\sigma}^{2}_{ss}\stackrel{{\scriptstyle P}}{{\rightarrow}}\varsigma^{2}_{r}(\pi)+\varsigma^{2}_{Hr},

where $\hat{\tau}_{ss}$ and $\hat{\sigma}^{2}_{ss}$ are defined in Algorithm 1.

From the subsequent simulation results (Table 4), we can see that all the treatment effect estimators obtained using Algorithm 1 enjoy unbiasedness and validity.

7 Simulation study

In this section, we examine the empirical performance of estimators with different estimating methods for $h_{[k]}(\cdot,a),\ a=0,1$ . We consider four low-dimensional data-generating models and four corresponding high-dimensional data-generating models with continuous outcomes. For $a\in\{0,1\}$ and $1\leq i\leq n$ , the potential outcomes are generated according to

Y_{i}(a)=g_{a}(X_{i})+\sigma_{a}\varepsilon_{a,i},

where $X_{i},\ g_{a}(X_{i}),\ i=1,\dots,n$ , are specified below. In each model, $(X_{i},\varepsilon_{0,i},\varepsilon_{1,i}),\ 1\leq i\leq n$ are i.i.d., and we set $\sigma_{0}=1,\ \sigma_{1}=3$ . Both $\varepsilon_{0,i}$ and $\varepsilon_{1,i}$ follow the standard normal distribution. For high-dimensional data-generating models with $p$ covariates, there are few covariates that truly correlate with outcomes, and we generate additional independent covariates to approach reality.

Here, we present the simulation results of the estimators under simple randomization, stratified block randomization, and minimization. The sample size $n$ is $1000$ . The block size used in stratified block randomization is $6$ . A biased-coin probability of $0.75$ and equal weights are used in minimization. The bias, standard deviation (SD) of the treatment effect estimators, standard error (SE) estimators, and empirical coverage probability (CP) of the $95\%$ confidence interval are computed using $2000$ replications.

7.1 Low-dimensional data-generating models

Model 1:

	$\displaystyle g_{0}(X_{i})$	$\displaystyle=\mu_{0}+\sum\limits_{j=1}^{4}\beta_{0j}X_{ij},$
	$\displaystyle g_{1}(X_{i})$	$\displaystyle=\mu_{1}+\sum\limits_{j=1}^{4}\beta_{1j}X_{ij},$

with $\mu_{0}=1,\ \mu_{1}=4,\ \beta_{0}^{\textnormal{T}}=(75,35,125,80)$ , and $\beta_{1}^{\textnormal{T}}=(100,80,60,40)$ . $X_{i}$ is a four-dimensional vector, $X_{i1}\sim\textup{Beta}(3,4)$ , $X_{i2}\sim\textup{Unif}[-2,2]$ , $X_{i3}$ takes values in $\{-1,1\}$ with equal probability, $X_{i4}$ takes values in $\{3,5\}$ with probability $0.6,0.4$ , and they are independent of each other. The variable used for randomization is an additional variable that takes a value in $\{1,2,3,4\}$ with probability $0.2,0.3,0.3,0.2$ and is independent of $X_{ij}$ . Model 1 considers the regular linear model as a baseline for other nonlinear models.

Model 2:

	$\displaystyle g_{0}(X_{i})$	$\displaystyle=\mu_{0}+\beta_{01}\log(X_{i1}+1)+\beta_{02}X_{i1}^{2}+\beta_{03}\exp(X_{i2})+\beta_{04}/(X_{i2}+3),$
	$\displaystyle g_{1}(X_{i})$	$\displaystyle=\mu_{1}+\beta_{11}\exp(X_{i1}+2)+\beta_{12}/(X_{i1}+1)+\beta_{13}X_{i2}^{2},$

with $\mu_{0}=-3,\ \mu_{1}=0,\ \beta_{0}^{\textnormal{T}}=(10,24,15,20)$ , and $\beta_{1}^{\textnormal{T}}=(20,17,10)$ . $X_{i}$ is a two-dimensional vector, $X_{i1}\sim\textup{Beta}(3,4)$ , $X_{i2}\sim\textup{Unif}[-2,2]$ , and they are independent of each other. The variable used for randomization is an additional variable that takes a value in $\{1,2,3,4\}$ with probability $0.2,0.3,0.3,0.2$ and is independent of $X_{ij}$ . Model 2 is an additive but nonlinear model of covariates.

Model 3:

	$\displaystyle g_{0}(X_{i})$	$\displaystyle=\mu_{0}+\beta_{01}X_{i1}X_{i2}/(X_{i1}+X_{i2}+2)+\beta_{02}X_{i1}^{2}(X_{i2}+X_{i3}),$
	$\displaystyle g_{1}(X_{i})$	$\displaystyle=\mu_{1}+\beta_{11}(X_{i2}+X_{i4})+\beta_{12}X_{i2}^{2}/\exp(X_{i1}+2),$

with $\mu_{0}=5,\ \mu_{1}=2,\ \beta_{0}^{\textnormal{T}}=(42,83)$ , and $\beta_{1}^{\textnormal{T}}=(30,75)$ . $X_{i}$ is a four-dimensional vector, $X_{i1}\sim\textup{Beta}(3,4)$ , $X_{i2}\sim\textup{Unif}[-2,2]$ , $X_{i3}\sim\mathcal{N}(0,1)$ , $X_{i4}\sim\textup{Unif}[0,2]$ , and they are independent of each other. The variable used for randomization is an additional variable that takes a value in $\{1,2\}$ with probability $0.4,0.6$ and is independent of $X_{ij}$ . Model 3 is a nonlinear model including interaction terms of covariates, hence is more complex.

Model 4:

	$\displaystyle g_{0}(X_{i})$	$\displaystyle=\mu_{0}+(\beta_{01}X_{i1}+\beta_{02}X_{i2})S+\beta_{03}\log(X_{i1}+1)\mathds{1}_{S_{i}=1},$
	$\displaystyle g_{1}(X_{i})$	$\displaystyle=\mu_{1}+(\beta_{11}X_{i1}+\beta_{12}X_{i2})S+\beta_{13}\exp(X_{i2})\mathds{1}_{S_{i}=-1},$

with $\mu_{0}=5,\ \mu_{1}=5,\ \beta_{0}^{\textnormal{T}}=(20,30,50)$ , and $\beta_{1}^{\textnormal{T}}=(20,30,65)$ . $X_{i}$ is a two-dimensional vector, $X_{i1}\sim\textup{Beta}(3,4)$ , $X_{i2}\sim\textup{Unif}[-2,2]$ , and they are independent of each other. $\mathds{1}_{S_{i}=1}$ is an indicator function that equals $1$ if $S_{i}=1$ and $0$ otherwise. $\mathds{1}_{S_{i}=-1}$ is defined likewise. $S_{i}$ is the variable used for randomization, it is an additional variable that takes a value in $\{1,-1\}$ with equal probability and is independent of $X_{ij}$ . Model 4 further takes the interaction between covariates and stratum into consideration.

7.2 Low-dimensional simulation results

For low-dimensional data-generating models, we apply linear regression, local linear kernel (kernel), natural spline (nspline), neural networks (nnet), and random forest (rf) to approach $\hat{h}_{[k]}(X_{i},a)$ . In the result tables, $\hat{\tau}$ denotes the stratum-common estimators ( $\hat{h}_{[k]}(X_{i},a)$ are the same for all strata $k,\ k=1,\dots,K$ ) and $\tilde{\tau}$ denotes the stratum-specific estimators ( $\hat{h}_{[k]}(X_{i},a)$ can be different in each stratum).

From Table 1, we see that there is little difference among the treatment effect estimators when different randomization methods are used. Under all considered scenarios, the treatment effect estimators obtained using different covariate-adjustment methods have small biases. When the data-generating model is linear (Model 1), the linear regression-adjusted estimators have the optimal efficiency, whereas $\hat{\tau}_{\text{kernel}}$ and $\hat{\tau}_{\text{nspline}}$ have the same efficiency and are followed by $\hat{\tau}_{\text{nnet}}$ and then $\hat{\tau}_{\text{rf}}$ . When the data-generating model is nonlinear (Models 2, 3 and 4), all $\hat{\tau}_{\text{linear}}$ s always have relatively large standard error. Therefore, there can be a loss of efficiency when stubbornly using linear models for covariate adjustment, especially when there is strong evidence of a nonlinear relationship between covariates and outcomes. In this case, we should consider using other methods for covariate adjustment.

For each data-generating model, the fitting performance differs among the adjusting methods. For example, in Model 2, $\hat{\tau}_{\text{kernel}}$ and $\hat{\tau}_{\text{nspline}}$ have the minimum standard deviations, followed by $\hat{\tau}_{\text{nnet}}$ and then $\hat{\tau}_{\text{rf}}$ . However, in the cases of Models 3 and 4, $\hat{\tau}_{\text{rf}}$ has the largest standard deviations among the treatment effect estimators adjusted using nonlinear methods. In Model 3, the standard deviation of $\hat{\tau}_{\text{rf}}$ is even greater than that of $\hat{\tau}_{\text{linear}}$ . A comparison between stratum-common and stratum-specific estimators reveals that the stratum-specific estimators have smaller standard deviations than the corresponding stratum-common estimators only if the data-generating model is stratum-specific (Model 4).

From the perspective of statistical inference, all treatment effect estimators have the desired coverage probability, except the estimator adjusted using random forest. Particularly, in Model 4, the coverage probabilities of $\hat{\tau}_{\text{rf}}$ are below 0.9, showing the unsatisfiability of the proposed assumptions. This inspires us to further use other statistical techniques to reach valid statistical inferences. Similar conclusions under unequal allocation ( $\pi=2/3$ ) are shown in Table 2.

Table 1: Simulated biases, standard deviations, standard errors, and coverage probabilities for different estimators and randomization methods under equal allocation (

\pi=1/2

) and low-dim-
ensional data-generating models.

Model	Estimator	Bias	SD	SE	CP	Bias	SD	SE	CP	Bias	SD	SE	CP
		Complete Rand.				Stratified Block Rand.				Minimization
1	$\hat{\tau}_{\text{linear}}$	0.02	3.00	2.91	0.95	-0.08	2.89	2.91	0.95	-0.03	2.98	2.91	0.95
	$\tilde{\tau}_{\text{linear}}$	0.02	3.00	2.91	0.95	-0.08	2.89	2.91	0.95	-0.03	2.98	2.91	0.95
	$\hat{\tau}_{\text{kernel}}$	0.02	3.00	2.91	0.95	-0.08	2.89	2.91	0.95	-0.03	2.98	2.91	0.95
	$\tilde{\tau}_{\text{kernel}}$	0.02	3.00	2.91	0.95	-0.08	2.89	2.91	0.95	-0.03	2.98	2.91	0.95
	$\hat{\tau}_{\text{nspline}}$	0.02	3.00	2.91	0.95	-0.07	2.89	2.91	0.95	-0.03	2.98	2.91	0.95
	$\tilde{\tau}_{\text{nspline}}$	0.02	3.00	2.91	0.95	-0.07	2.89	2.91	0.95	-0.03	2.98	2.91	0.95
	$\hat{\tau}_{\text{nnet}}$	0.02	3.01	2.91	0.95	-0.07	2.89	2.91	0.95	-0.03	2.99	2.91	0.95
	$\tilde{\tau}_{\text{nnet}}$	0.00	3.01	2.93	0.95	-0.08	2.90	2.92	0.95	-0.04	3.00	2.92	0.95
	$\hat{\tau}_{\text{rf}}$	0.00	4.14	3.97	0.94	-0.12	4.14	3.97	0.94	-0.07	4.20	3.97	0.94
	$\tilde{\tau}_{\text{rf}}$	0.01	4.37	4.05	0.92	-0.09	4.40	4.04	0.93	-0.06	4.46	4.04	0.92
2	$\hat{\tau}_{\text{linear}}$	-0.02	1.49	1.52	0.95	-0.03	1.51	1.52	0.95	0.04	1.52	1.52	0.95
	$\tilde{\tau}_{\text{linear}}$	-0.03	1.49	1.52	0.95	-0.04	1.51	1.52	0.95	0.03	1.52	1.52	0.96
	$\hat{\tau}_{\text{kernel}}$	0.02	1.27	1.28	0.95	0.00	1.27	1.28	0.95	0.06	1.29	1.28	0.95
	$\tilde{\tau}_{\text{kernel}}$	0.07	1.27	1.28	0.95	0.05	1.27	1.28	0.95	0.11	1.30	1.28	0.95
	$\hat{\tau}_{\text{nspline}}$	0.00	1.27	1.28	0.95	-0.02	1.26	1.28	0.95	0.04	1.29	1.28	0.95
	$\tilde{\tau}_{\text{nspline}}$	0.00	1.27	1.28	0.95	-0.02	1.26	1.28	0.95	0.03	1.29	1.28	0.95
	$\hat{\tau}_{\text{nnet}}$	-0.02	1.29	1.30	0.95	-0.03	1.29	1.30	0.95	0.02	1.31	1.30	0.95
	$\tilde{\tau}_{\text{nnet}}$	-0.06	1.35	1.36	0.95	-0.08	1.36	1.36	0.95	-0.01	1.37	1.36	0.95
	$\hat{\tau}_{\text{rf}}$	0.00	1.28	1.27	0.95	-0.02	1.29	1.27	0.94	0.04	1.31	1.27	0.95
	$\tilde{\tau}_{\text{rf}}$	0.04	1.32	1.27	0.94	0.01	1.32	1.27	0.94	0.08	1.35	1.27	0.94
3	$\hat{\tau}_{\text{linear}}$	0.00	1.39	1.41	0.96	-0.06	1.39	1.41	0.96	-0.03	1.37	1.41	0.95
	$\tilde{\tau}_{\text{linear}}$	-0.02	1.39	1.40	0.96	-0.08	1.40	1.40	0.95	-0.04	1.37	1.40	0.95
	$\hat{\tau}_{\text{kernel}}$	0.08	1.21	1.19	0.94	0.04	1.18	1.19	0.95	0.05	1.19	1.19	0.95
	$\tilde{\tau}_{\text{kernel}}$	0.13	1.21	1.19	0.94	0.10	1.19	1.19	0.95	0.10	1.20	1.19	0.94
	$\hat{\tau}_{\text{nspline}}$	0.01	1.39	1.38	0.95	-0.05	1.41	1.38	0.95	-0.02	1.38	1.38	0.95
	$\tilde{\tau}_{\text{nspline}}$	0.00	1.40	1.38	0.95	-0.05	1.40	1.38	0.94	-0.03	1.40	1.38	0.94
	$\hat{\tau}_{\text{nnet}}$	0.02	1.29	1.29	0.95	-0.05	1.29	1.29	0.95	-0.03	1.28	1.29	0.95
	$\tilde{\tau}_{\text{nnet}}$	-0.03	1.31	1.31	0.95	-0.07	1.31	1.31	0.95	-0.05	1.31	1.31	0.95
	$\hat{\tau}_{\text{rf}}$	0.00	1.40	1.22	0.92	-0.03	1.39	1.22	0.91	-0.02	1.40	1.22	0.91
	$\tilde{\tau}_{\text{rf}}$	-0.01	1.48	1.26	0.91	-0.04	1.48	1.26	0.90	-0.03	1.49	1.26	0.90
4	$\hat{\tau}_{\text{linear}}$	-0.11	3.67	3.71	0.95	0.03	3.82	3.71	0.95	-0.06	3.76	3.71	0.95
	$\tilde{\tau}_{\text{linear}}$	-0.12	3.61	3.67	0.96	0.01	3.8	3.67	0.94	-0.10	3.72	3.67	0.94
	$\hat{\tau}_{\text{kernel}}$	-0.05	3.63	3.57	0.94	0.11	3.73	3.57	0.94	0.00	3.68	3.56	0.94
	$\tilde{\tau}_{\text{kernel}}$	-0.05	3.45	3.47	0.95	0.11	3.56	3.47	0.95	-0.02	3.53	3.47	0.94
	$\hat{\tau}_{\text{nspline}}$	-0.11	3.64	3.59	0.95	0.10	3.72	3.59	0.94	-0.03	3.66	3.59	0.94
	$\tilde{\tau}_{\text{nspline}}$	-0.07	3.45	3.47	0.95	0.09	3.56	3.47	0.95	-0.03	3.53	3.47	0.95
	$\hat{\tau}_{\text{nnet}}$	-0.11	3.59	3.61	0.95	0.08	3.70	3.61	0.95	-0.05	3.65	3.61	0.94
	$\tilde{\tau}_{\text{nnet}}$	-0.08	3.45	3.47	0.95	0.08	3.56	3.48	0.95	-0.04	3.53	3.47	0.94
	$\hat{\tau}_{\text{rf}}$	-0.14	3.69	3.05	0.89	0.05	3.79	3.06	0.89	-0.05	3.75	3.05	0.88
	$\tilde{\tau}_{\text{rf}}$	-0.09	3.47	3.44	0.95	0.08	3.58	3.44	0.94	-0.05	3.54	3.44	0.94

•

Abbreviations: SD, standard deviation; SE, standard error; CP, coverage probability; Rand.: ran domization; nspline: natural spline; nnet: neural network; rf: random forest.

Table 2: Simulated biases, standard deviations, standard errors, and coverage probabilities for different estimators and randomization methods under unequal allocation (

\pi=2/3

) and low-dim-
ensional data-generating models.

Model	Estimator	Bias	SD	SE	CP	Bias	SD	SE	CP	Bias	SD	SE	CP
		Complete Rand.				Stratified Block Rand.				Minimization
1	$\hat{\tau}_{\text{linear}}$	0.01	3.00	2.91	0.94	0.03	2.93	2.91	0.95	0.08	2.86	2.91	0.96
	$\tilde{\tau}_{\text{linear}}$	0.00	2.99	2.91	0.94	0.03	2.93	2.91	0.95	0.08	2.86	2.91	0.96
	$\hat{\tau}_{\text{kernel}}$	0.01	3.00	2.91	0.94	0.03	2.93	2.91	0.95	0.08	2.86	2.91	0.96
	$\tilde{\tau}_{\text{kernel}}$	0.00	2.99	2.91	0.94	0.03	2.93	2.91	0.95	0.08	2.86	2.91	0.96
	$\hat{\tau}_{\text{nspline}}$	0.01	3.00	2.91	0.94	0.03	2.93	2.91	0.95	0.08	2.86	2.91	0.96
	$\tilde{\tau}_{\text{nspline}}$	0.00	2.99	2.91	0.94	0.03	2.93	2.91	0.95	0.08	2.86	2.91	0.96
	$\hat{\tau}_{\text{nnet}}$	0.00	3.00	2.91	0.94	0.03	2.94	2.91	0.95	0.08	2.86	2.91	0.96
	$\tilde{\tau}_{\text{nnet}}$	0.00	3.02	2.93	0.94	0.04	2.96	2.93	0.95	0.09	2.89	2.93	0.95
	$\hat{\tau}_{\text{rf}}$	-0.05	4.34	4.03	0.94	0.06	4.28	4.02	0.93	0.13	4.24	4.02	0.94
	$\tilde{\tau}_{\text{rf}}$	0.00	4.73	4.12	0.91	0.10	4.69	4.11	0.91	0.17	4.64	4.10	0.92
2	$\hat{\tau}_{\text{linear}}$	0.04	1.61	1.55	0.94	0.01	1.56	1.55	0.96	0.02	1.51	1.55	0.96
	$\tilde{\tau}_{\text{linear}}$	0.09	1.62	1.55	0.94	0.06	1.56	1.55	0.96	0.07	1.51	1.54	0.96
	$\hat{\tau}_{\text{kernel}}$	0.02	1.31	1.28	0.94	0.00	1.30	1.28	0.95	0.01	1.27	1.28	0.95
	$\tilde{\tau}_{\text{kernel}}$	0.07	1.31	1.28	0.94	0.05	1.30	1.28	0.95	0.06	1.27	1.28	0.95
	$\hat{\tau}_{\text{nspline}}$	0.02	1.31	1.28	0.94	0.00	1.30	1.28	0.95	0.01	1.27	1.28	0.95
	$\tilde{\tau}_{\text{nspline}}$	0.02	1.31	1.28	0.94	0.00	1.30	1.28	0.95	0.01	1.27	1.28	0.95
	$\hat{\tau}_{\text{nnet}}$	0.02	1.33	1.29	0.94	0.00	1.31	1.29	0.95	0.01	1.29	1.29	0.95
	$\tilde{\tau}_{\text{nnet}}$	0.07	1.43	1.37	0.94	0.04	1.39	1.36	0.95	0.04	1.35	1.36	0.95
	$\hat{\tau}_{\text{rf}}$	0.01	1.33	1.27	0.94	0.00	1.31	1.27	0.94	0.00	1.29	1.27	0.95
	$\tilde{\tau}_{\text{rf}}$	0.06	1.39	1.28	0.93	0.04	1.36	1.27	0.93	0.04	1.33	1.28	0.95
3	$\hat{\tau}_{\text{linear}}$	0.01	1.61	1.59	0.95	-0.04	1.6	1.59	0.95	-0.07	1.64	1.59	0.94
	$\tilde{\tau}_{\text{linear}}$	-0.01	1.62	1.58	0.94	-0.05	1.61	1.58	0.94	-0.10	1.65	1.58	0.93
	$\hat{\tau}_{\text{kernel}}$	0.11	1.20	1.19	0.95	0.12	1.24	1.19	0.94	0.07	1.25	1.19	0.94
	$\tilde{\tau}_{\text{kernel}}$	0.18	1.23	1.18	0.94	0.19	1.27	1.18	0.93	0.14	1.27	1.18	0.93
	$\hat{\tau}_{\text{nspline}}$	0.04	1.62	1.52	0.93	-0.02	1.62	1.52	0.93	-0.06	1.66	1.52	0.92
	$\tilde{\tau}_{\text{nspline}}$	0.02	1.64	1.51	0.93	-0.02	1.62	1.51	0.93	-0.06	1.68	1.51	0.92
	$\hat{\tau}_{\text{nnet}}$	0.01	1.39	1.38	0.95	-0.01	1.43	1.38	0.94	-0.06	1.43	1.38	0.94
	$\tilde{\tau}_{\text{nnet}}$	-0.05	1.46	1.43	0.94	-0.08	1.49	1.43	0.94	-0.14	1.51	1.43	0.93
	$\hat{\tau}_{\text{rf}}$	0.01	1.55	1.25	0.89	-0.03	1.55	1.25	0.89	-0.06	1.58	1.25	0.87
	$\tilde{\tau}_{\text{rf}}$	0.01	1.68	1.31	0.88	-0.05	1.67	1.31	0.88	-0.07	1.71	1.31	0.86
4	$\hat{\tau}_{\text{linear}}$	-0.04	3.78	3.75	0.95	-0.11	3.69	3.75	0.95	0.13	3.90	3.75	0.94
	$\tilde{\tau}_{\text{linear}}$	-0.06	3.59	3.57	0.95	-0.11	3.54	3.57	0.95	0.12	3.71	3.58	0.94
	$\hat{\tau}_{\text{kernel}}$	-0.01	3.75	3.65	0.94	-0.10	3.67	3.65	0.95	0.17	3.86	3.66	0.94
	$\tilde{\tau}_{\text{kernel}}$	-0.06	3.49	3.47	0.95	-0.08	3.42	3.47	0.95	0.16	3.61	3.47	0.94
	$\hat{\tau}_{\text{nspline}}$	-0.01	3.76	3.67	0.94	-0.13	3.65	3.67	0.95	0.14	3.87	3.67	0.94
	$\tilde{\tau}_{\text{nspline}}$	-0.07	3.49	3.47	0.95	-0.09	3.43	3.47	0.95	0.15	3.61	3.47	0.94
	$\hat{\tau}_{\text{nnet}}$	-0.03	3.72	3.70	0.95	-0.11	3.63	3.70	0.96	0.15	3.85	3.70	0.94
	$\tilde{\tau}_{\text{nnet}}$	-0.07	3.49	3.47	0.95	-0.10	3.43	3.47	0.94	0.15	3.61	3.48	0.94
	$\hat{\tau}_{\text{rf}}$	-0.04	3.79	3.11	0.89	-0.11	3.70	3.11	0.90	0.13	3.90	3.12	0.88
	$\tilde{\tau}_{\text{rf}}$	-0.08	3.51	3.44	0.94	-0.10	3.43	3.44	0.94	0.15	3.63	3.45	0.94

•

Abbreviations: SD, standard deviation; SE, standard error; CP, coverage probability; Rand.: ran domization; nspline: natural spline; nnet: neural network; rf: random forest.

7.3 High-dimensional data-generating models

We consider high-dimensional data-generating models to determine whether the treatment effect estimators and variance estimators generated using different methods such as random forest and neural network are consistent in the high-dimensional case and to compare the efficiencies of the estimators. We generate additional covariates for Models 1–4. In total, we generate $p$ covariates.

Model 5: Model 5 is based on Model 1, they have the same underlying model. The additional covariates are independent of $X_{ij}$ , and follow a multivariate normal distribution with zero mean and a covariance matrix whose elements are all $0.2$ except for the diagonal elements, which have values of $1$ .

Model 6: Model 6 is based on Model 2, they have the same underlying model. The additional covariates are first generated as in Model 5, and we then randomly choose $\lfloor p/3\rfloor$ covariates among the additional covariates and multiply them by $X_{i1}$ or $X_{i2}$ with equal probability to obtain the final high-dimensional covariates.

Model 7: Model 7 is based on Model 3, they have the same underlying model. The additional covariates are independent of $X_{ij}$ , and they follow a multivariate normal distribution with zero mean and the covariance matrix is a symmetric Toeplitz matrix whose first row is a geometric sequence with initial value $1$ and common ratio $0.5$ .

Model 8: Model 8 is based on Model 4, they have the same underlying model. The additional covariates are generated as in Model 6.

7.4 High-dimensional simulation results

Here, we present the simulation results of the high-dimensional methods with $p=200$ . The randomization settings are the same as those in the low-dimensional simulations. We apply the lasso, random forest, gradient boosting regression tree (gbrt), recursive partitioning and regression tree (rpart), and a neural network (nnet) with one hidden layer to all covariates to estimate $\hat{h}_{[k]}(X_{i},a)$ .

Table 3: Simulated biases, standard deviations, standard errors, and coverage probabilities for different estimators and randomization methods under equal allocation (

\pi=1/2

) and high-dim-
ensional data-generating models.

Model	Estimator	Bias	SD	SE	CP	Bias	SD	SE	CP	Bias	SD	SE	CP
		Complete Rand.				Stratified Block Rand.				Minimization
5	$\hat{\tau}_{\text{lasso}}$	0.07	2.95	2.92	0.95	0.03	2.96	2.91	0.95	-0.01	2.98	2.91	0.94
	$\tilde{\tau}_{\text{lasso}}$	0.09	3.01	2.98	0.95	0.02	3.02	2.97	0.94	-0.01	3.03	2.97	0.94
	$\hat{\tau}_{\text{rf}}$	-0.13	3.53	3.22	0.92	-0.21	3.50	3.21	0.93	-0.21	3.49	3.21	0.93
	$\tilde{\tau}_{\text{rf}}$	-0.09	4.95	3.99	0.89	-0.16	4.85	3.98	0.90	-0.15	4.82	3.97	0.90
	$\hat{\tau}_{\text{gbrt}}$	0.08	3.06	3.01	0.95	0.04	3.06	3.01	0.94	0.01	3.04	3.01	0.95
	$\tilde{\tau}_{\text{gbrt}}$	0.08	3.15	3.06	0.94	0.03	3.13	3.06	0.94	0.02	3.11	3.05	0.94
	$\hat{\tau}_{\text{rpart}}$	-0.21	3.31	3.29	0.95	-0.34	3.31	3.29	0.94	-0.35	3.37	3.29	0.94
	$\tilde{\tau}_{\text{rpart}}$	0.88	4.30	4.21	0.94	0.69	4.27	4.17	0.94	0.78	4.29	4.17	0.94
	$\hat{\tau}_{\text{nnet}}$	-1.33	6.97	5.48	0.87	-1.54	7.06	5.48	0.86	-1.47	7.08	5.50	0.87
	$\tilde{\tau}_{\text{nnet}}$	-0.40	8.70	6.72	0.86	-0.42	8.68	6.71	0.87	-0.60	8.58	6.70	0.87
6	$\hat{\tau}_{\text{lasso}}$	-0.09	1.59	1.55	0.94	-0.09	1.56	1.55	0.95	-0.06	1.54	1.55	0.95
	$\tilde{\tau}_{\text{lasso}}$	-0.42	1.68	1.56	0.92	-0.41	1.64	1.55	0.93	-0.39	1.63	1.56	0.93
	$\hat{\tau}_{\text{rf}}$	-0.22	1.41	1.28	0.92	-0.22	1.35	1.28	0.93	-0.20	1.37	1.28	0.93
	$\tilde{\tau}_{\text{rf}}$	-0.48	1.55	1.29	0.88	-0.48	1.47	1.29	0.90	-0.44	1.49	1.29	0.89
	$\hat{\tau}_{\text{gbrt}}$	-0.04	1.32	1.29	0.94	-0.03	1.27	1.29	0.96	-0.02	1.30	1.29	0.95
	$\tilde{\tau}_{\text{gbrt}}$	-0.18	1.36	1.29	0.93	-0.19	1.31	1.29	0.95	-0.17	1.32	1.29	0.94
	$\hat{\tau}_{\text{rpart}}$	-0.11	1.38	1.36	0.95	-0.11	1.34	1.35	0.96	-0.09	1.35	1.35	0.95
	$\tilde{\tau}_{\text{rpart}}$	-0.01	1.50	1.46	0.94	-0.02	1.42	1.45	0.96	0.01	1.46	1.46	0.95
	$\hat{\tau}_{\text{nnet}}$	0.08	2.49	1.92	0.87	0.09	2.46	1.92	0.87	0.24	2.46	1.92	0.87
	$\tilde{\tau}_{\text{nnet}}$	-0.73	2.53	1.97	0.86	-0.73	2.55	1.96	0.85	-0.70	2.62	1.96	0.85
7	$\hat{\tau}_{\text{lasso}}$	0.00	1.67	1.68	0.95	0.00	1.69	1.68	0.95	-0.09	1.62	1.68	0.96
	$\tilde{\tau}_{\text{lasso}}$	-0.02	1.70	1.71	0.95	-0.01	1.73	1.70	0.95	-0.10	1.64	1.71	0.96
	$\hat{\tau}_{\text{rf}}$	0.26	1.67	1.22	0.84	0.28	1.70	1.22	0.84	0.20	1.61	1.22	0.86
	$\tilde{\tau}_{\text{rf}}$	0.31	1.77	1.27	0.83	0.32	1.80	1.27	0.82	0.24	1.70	1.27	0.85
	$\hat{\tau}_{\text{gbrt}}$	-0.01	1.64	1.50	0.92	0.01	1.68	1.49	0.92	-0.08	1.59	1.50	0.93
	$\tilde{\tau}_{\text{gbrt}}$	-0.02	1.66	1.42	0.91	-0.01	1.70	1.41	0.90	-0.09	1.63	1.42	0.91
	$\hat{\tau}_{\text{rpart}}$	-0.05	1.59	1.50	0.93	-0.01	1.61	1.50	0.94	-0.07	1.58	1.50	0.94
	$\tilde{\tau}_{\text{rpart}}$	-0.02	1.64	1.52	0.93	-0.04	1.67	1.52	0.93	-0.10	1.65	1.52	0.93
	$\hat{\tau}_{\text{nnet}}$	0.47	2.62	1.88	0.84	0.44	2.68	1.88	0.84	0.31	2.67	1.88	0.82
	$\tilde{\tau}_{\text{nnet}}$	0.88	2.77	2.00	0.81	0.94	2.79	2.00	0.82	0.90	2.67	2.00	0.83
8	$\hat{\tau}_{\text{lasso}}$	-0.07	3.76	3.77	0.94	-0.03	3.74	3.77	0.95	0.04	3.85	3.77	0.94
	$\tilde{\tau}_{\text{lasso}}$	-0.13	3.65	3.69	0.94	-0.05	3.67	3.70	0.95	0.04	3.79	3.70	0.94
	$\hat{\tau}_{\text{rf}}$	0.49	3.62	2.83	0.87	0.50	3.55	2.83	0.88	0.55	3.68	2.84	0.86
	$\tilde{\tau}_{\text{rf}}$	0.59	3.54	3.36	0.93	0.65	3.50	3.37	0.94	0.71	3.63	3.37	0.92
	$\hat{\tau}_{\text{gbrt}}$	0.04	3.67	3.36	0.92	0.05	3.58	3.36	0.93	0.07	3.71	3.36	0.92
	$\tilde{\tau}_{\text{gbrt}}$	-0.09	3.45	3.47	0.95	0.01	3.39	3.47	0.95	0.09	3.55	3.47	0.94
	$\hat{\tau}_{\text{rpart}}$	0.03	3.81	3.44	0.92	0.02	3.75	3.45	0.92	0.07	3.91	3.45	0.91
	$\tilde{\tau}_{\text{rpart}}$	-0.08	3.49	3.51	0.94	0.01	3.43	3.51	0.95	0.09	3.58	3.51	0.94
	$\hat{\tau}_{\text{nnet}}$	0.09	4.84	3.79	0.87	-0.05	4.76	3.81	0.89	0.19	4.84	3.80	0.87
	$\tilde{\tau}_{\text{nnet}}$	1.25	4.96	3.79	0.86	1.25	4.80	3.79	0.87	1.21	4.98	3.79	0.86

•

Abbreviations: SD, standard deviation; SE, standard error; CP, coverage probability; Rand.: randomization; rf: random forest; gbrt: gradient boosting regression tree; rpart: recursive partitioning and regression tree; nnet: neural network.

From Table 3, we see that in the high-dimensional cases, the treatment effect estimators still behave similarly under different randomization methods. Under the considered scenarios, they all have small biases, except for $\hat{\tau}_{\text{nnet}}$ . When the data-generating model is a high-dimensional linear model (Model 5), $\hat{\tau}_{\text{lasso}}$ has standard deviations similar to those of $\hat{\tau}_{\text{linear}}$ under the corresponding low-dimensional model (Model 1), indicating that lasso makes good prediction and $\hat{\tau}_{\text{lasso}}$ achieves the optimal efficiency. All other treatment effect estimators have larger standard deviations than $\hat{\tau}_{\text{lasso}}$ . When the data-generating model is nonlinear (Models 6, 7, and 8), the tree-based methods have good fitting results. The performances of the treatment effect estimators vary from model to model. For example, among stratum-common estimators, $\hat{\tau}_{\text{gbrt}}$ has the smallest standard deviations in Model 6, $\hat{\tau}_{\text{rpart}}$ has the smallest standard deviations in Model 7, and $\hat{\tau}_{\text{rf}}$ has the smallest standard deviations in Model 8. In contrast, $\hat{\tau}_{\text{nnet}}$ always has relatively large standard deviations, suggesting that, unlike the low-dimensional cases, the neural network does not estimate the true model as well as the tree-based methods. Additionally, similar to the low-dimensional cases, the stratum-specific estimators have smaller standard deviations than the stratum-common estimators only when the data-generating model is stratum-specific (Model 8).

From the perspective of inference, no machine learning-adjusted estimator always has a valid 95% coverage probability. However, for nonlinear data-generating models (Models 6, 7, and 8), the machine learning-adjusted estimators have smaller standard deviations than the lasso-adjusted estimators. This implies that the machine learning methods converge to certain functions, but because we reuse samples in the estimation process, complex dependencies are introduced. Thus, the convergence properties required by Assumption 4 may not be satisfied. This motivates us to use the sample splitting technique to alleviate the requirements, as discussed in Section 6.3.

Table 4: Simulated biases, standard deviations, standard errors, and coverage probabilities for different sample splitting estimators and randomization methods under equal allocation (

\pi=1/2

) and high-dimensional data-generating models.

Model	Estimator	Bias	SD	SE	CP	Bias	SD	SE	CP	Bias	SD	SE	CP
		Complete Rand.				Stratified Block Rand.				Minimization
5	$\hat{\tau}_{\text{lasso}}^{\text{ss}}$	0.04	2.97	2.89	0.95	0.04	2.98	2.89	0.94	0.00	3.01	2.89	0.94
	$\tilde{\tau}_{\text{lasso}}^{\text{ss}}$	0.07	3.08	3.00	0.95	0.04	3.09	3.00	0.94	0.01	3.10	3.00	0.94
	$\hat{\tau}_{\text{rf}}^{\text{ss}}$	0.10	3.81	3.67	0.94	0.03	3.77	3.66	0.94	0.02	3.74	3.66	0.94
	$\tilde{\tau}_{\text{rf}}^{\text{ss}}$	0.10	5.45	5.27	0.94	0.06	5.39	5.25	0.94	0.04	5.33	5.25	0.95
	$\hat{\tau}_{\text{gbrt}}^{\text{ss}}$	0.08	3.02	2.94	0.94	0.03	3.06	2.94	0.94	0.01	3.03	2.94	0.94
	$\tilde{\tau}_{\text{gbrt}}^{\text{ss}}$	0.12	3.67	3.54	0.94	0.02	3.64	3.52	0.94	0.07	3.59	3.52	0.94
	$\hat{\tau}_{\text{rpart}}^{\text{ss}}$	0.12	3.40	3.33	0.94	0.07	3.41	3.32	0.94	0.00	3.42	3.32	0.94
	$\tilde{\tau}_{\text{rpart}}^{\text{ss}}$	0.17	4.88	4.75	0.94	0.01	4.88	4.73	0.94	0.10	4.80	4.72	0.94
	$\hat{\tau}_{\text{nnet}}^{\text{ss}}$	0.14	6.97	6.72	0.94	0.34	6.81	6.71	0.95	-0.06	6.86	6.72	0.94
	$\tilde{\tau}_{\text{nnet}}^{\text{ss}}$	0.14	8.62	8.37	0.95	0.24	8.67	8.35	0.94	0.12	8.65	8.36	0.94
6	$\hat{\tau}_{\text{lasso}}^{\text{ss}}$	-0.03	1.61	1.55	0.94	-0.02	1.57	1.55	0.95	0.01	1.57	1.55	0.95
	$\tilde{\tau}_{\text{lasso}}^{\text{ss}}$	-0.03	1.80	1.74	0.94	-0.01	1.72	1.73	0.95	0.02	1.73	1.73	0.95
	$\hat{\tau}_{\text{rf}}^{\text{ss}}$	-0.02	1.44	1.38	0.94	-0.02	1.37	1.38	0.95	0.01	1.39	1.38	0.95
	$\tilde{\tau}_{\text{rf}}^{\text{ss}}$	-0.01	1.58	1.52	0.94	-0.01	1.51	1.52	0.95	0.01	1.53	1.52	0.95
	$\hat{\tau}_{\text{gbrt}}^{\text{ss}}$	-0.02	1.34	1.29	0.94	-0.02	1.28	1.29	0.95	0.00	1.31	1.29	0.95
	$\tilde{\tau}_{\text{gbrt}}^{\text{ss}}$	-0.02	1.47	1.42	0.94	-0.02	1.40	1.41	0.95	0.01	1.42	1.41	0.95
	$\hat{\tau}_{\text{rpart}}^{\text{ss}}$	-0.02	1.40	1.35	0.94	-0.01	1.35	1.35	0.95	0.00	1.38	1.35	0.94
	$\tilde{\tau}_{\text{rpart}}^{\text{ss}}$	-0.01	1.55	1.50	0.94	-0.02	1.48	1.50	0.96	0.03	1.53	1.50	0.94
	$\hat{\tau}_{\text{nnet}}^{\text{ss}}$	-0.03	2.32	2.28	0.95	-0.05	2.36	2.28	0.94	0.04	2.35	2.28	0.95
	$\tilde{\tau}_{\text{nnet}}^{\text{ss}}$	-0.06	2.40	2.37	0.95	-0.04	2.41	2.36	0.95	0.01	2.38	2.36	0.94
7	$\hat{\tau}_{\text{lasso}}^{\text{ss}}$	0.01	1.68	1.68	0.95	0.02	1.71	1.68	0.95	-0.08	1.63	1.68	0.95
	$\tilde{\tau}_{\text{lasso}}^{\text{ss}}$	0.01	1.72	1.72	0.95	0.02	1.75	1.71	0.95	-0.08	1.66	1.72	0.95
	$\hat{\tau}_{\text{rf}}^{\text{ss}}$	0.01	1.68	1.67	0.95	0.01	1.70	1.66	0.95	-0.06	1.63	1.67	0.95
	$\tilde{\tau}_{\text{rf}}^{\text{ss}}$	0.02	1.78	1.77	0.95	0.02	1.81	1.77	0.95	-0.07	1.73	1.77	0.95
	$\hat{\tau}_{\text{gbrt}}^{\text{ss}}$	-0.01	1.57	1.57	0.95	0.01	1.61	1.57	0.95	-0.07	1.55	1.57	0.95
	$\tilde{\tau}_{\text{gbrt}}^{\text{ss}}$	0.00	1.63	1.64	0.95	0.01	1.67	1.64	0.95	-0.07	1.60	1.64	0.96
	$\hat{\tau}_{\text{rpart}}^{\text{ss}}$	-0.02	1.64	1.62	0.95	0.03	1.64	1.62	0.95	-0.06	1.61	1.62	0.95
	$\tilde{\tau}_{\text{rpart}}^{\text{ss}}$	0.01	1.68	1.67	0.95	-0.02	1.73	1.67	0.94	-0.07	1.65	1.68	0.95
	$\hat{\tau}_{\text{nnet}}^{\text{ss}}$	-0.04	2.33	2.34	0.95	-0.01	2.40	2.34	0.95	-0.09	2.26	2.34	0.95
	$\tilde{\tau}_{\text{nnet}}^{\text{ss}}$	0.02	2.52	2.47	0.95	0.05	2.57	2.47	0.93	-0.14	2.44	2.47	0.95
8	$\hat{\tau}_{\text{lasso}}^{\text{ss}}$	-0.03	3.79	3.77	0.94	-0.01	3.76	3.77	0.95	0.06	3.88	3.77	0.94
	$\tilde{\tau}_{\text{lasso}}^{\text{ss}}$	-0.07	3.67	3.69	0.94	0.01	3.68	3.70	0.95	0.09	3.81	3.70	0.94
	$\hat{\tau}_{\text{rf}}^{\text{ss}}$	-0.02	3.67	3.67	0.95	-0.02	3.61	3.67	0.95	0.05	3.75	3.67	0.95
	$\tilde{\tau}_{\text{rf}}^{\text{ss}}$	-0.06	3.54	3.54	0.94	0.01	3.49	3.54	0.95	0.06	3.63	3.55	0.94
	$\hat{\tau}_{\text{gbrt}}^{\text{ss}}$	-0.01	3.72	3.71	0.94	0.02	3.65	3.71	0.95	0.02	3.80	3.71	0.94
	$\tilde{\tau}_{\text{gbrt}}^{\text{ss}}$	-0.08	3.47	3.47	0.94	0.01	3.42	3.47	0.95	0.09	3.56	3.48	0.94
	$\hat{\tau}_{\text{rpart}}^{\text{ss}}$	0.02	3.82	3.83	0.95	0.03	3.77	3.83	0.95	0.02	3.95	3.83	0.94
	$\tilde{\tau}_{\text{rpart}}^{\text{ss}}$	-0.06	3.52	3.53	0.94	0.02	3.48	3.53	0.95	0.09	3.64	3.53	0.94
	$\hat{\tau}_{\text{nnet}}^{\text{ss}}$	-0.03	4.50	4.49	0.94	-0.02	4.44	4.49	0.95	0.03	4.70	4.50	0.93
	$\tilde{\tau}_{\text{nnet}}^{\text{ss}}$	-0.03	4.26	4.21	0.94	-0.04	4.18	4.21	0.95	0.12	4.39	4.22	0.94

•

Abbreviations: SD, standard deviation; SE, standard error; CP, coverage probability; Rand.: randomization; rf: random forest; gbrt: gradient boosting regression tree; rpart: recursive partitioning and regression tree; nnet: neural network; ss: sample splitting.

Table 4 presents the simulation results obtained using the sample splitting and cross-fitting techniques (see Algorithm 1). First, all treatment effect estimators tend to have smaller biases under considered scenarios with the used techniques. Particularly, in Model 8, under simple randomization, the absolute value of the bias of $\tilde{\tau}_{\text{nnet}}$ is 0.03, which is $2.4\%$ of the bias of the corresponding $\tilde{\tau}_{\text{nnet}}$ under the low-dimensional settings. In addition, the treatment estimators have standard deviations similar to those obtained without sample splitting, indicating the regain of efficiency with cross-fitting. There is still no one covariate adjustment method that always yields the best treatment effect estimator. However, $\hat{\tau}_{\text{nnet}}^{\text{ss}}$ and $\tilde{\tau}_{\text{nnet}}^{\text{ss}}$ continues to have the largest standard deviations.

In terms of statistical inference, we see that all treatment effect estimators have coverage probabilities of approximately $95\%$ . In other words, sample splitting can effectively alleviate the requirement on covariate adjustment methods (Assumption 4) and thus allow valid inference.

8 Discussion and Practical Recommendations

In this paper, we inherited the framework of Liu et al., (2023) and proved the asymptotic property of the oracle treatment effect estimator when the forms of the covariate adjustment functions are known. We reviewed the assumptions presented in Liu et al., (2023) that adjusting methods need to satisfy to realize valid inference once the covariate adjustment functions are plugged in. We presented a detailed verification of the assumptions’ satisfiability for local linear kernel. Following the asymptotic distribution, valid confidence intervals and tests could be constructed for the average treatment effect. In the case of high-dimensional covariates, we proposed estimators using machine learning methods and the sample splitting technique, which had efficient and robust practical performances.

According to the theoretical verifications and simulation results and considering simplicity, robustness, and efficiency, our recommendations to practitioners are as follows. If a strong linear relationship between covariates and outcomes is observed, we recommend linear regression or lasso for covariate adjustment, because these methods can readily and quickly lead to a valid and efficient treatment effect estimator. We consider using other methods when the relationship between covariates and outcomes is complex. In low-dimensional cases, we can use the local linear kernel or smoothing spline to adjust covariates for valid inference. In high-dimensional cases, we can choose appropriate machine learning methods by considering the background information and distributions of covariates and outcomes. Moreover, sample splitting and cross-fitting techniques should be used to obtain valid inferences and regain efficiency.

Independently of our work but at the same time, Rafi, (2023) and Bannick et al., (2023) also studied how to improve the efficiency of the treatment effect estimator under stratified randomization and used sample splitting and cross-fitting to alleviate the requirements on the estimation methods. While we shared similar ideas, our considered estimators, adjusting methods and contributions are different. Rafi, (2023) explored the semiparametric efficiency bound and proposed the second moment convergence assumption on the estimation function under sample splitting for the consistency and efficiency of the treatment effect estimator. Inspired by his proofs, we provided the theoretical justification of our proposed sample splitting estimator based on the Assumption 6. It should be noted that our sample splitting algorithm is slightly different from that used by Rafi, (2023). Our algorithm is carried out on the whole sample, while Rafi, (2023) performed the sample splitting process within each stratum. Importantly, although our theoretical results are inspired by Rafi, (2023), we have uniquely introduced the sample splitting algorithm and estimator in our research. Moreover, Rafi, (2023) proved that when using sample splitting and cross-fitting, the estimator adjusted by Nadaraya-Watson kernel regression can achieve the efficiency bound. In this paper, we proved that the treatment effect estimator adjusted by local linear kernel regression can also attain the corresponding asymptotic property without sample splitting. By comparing the asymptotic variances, we can conclude that our proposed estimator adjusted by local linear kernel can also achieve the efficiency bound while retaining high computational efficiency. In addition, we provide consistent variance estimator for the average treatment effect, hence solving the problem of constructing valid inference procedures which is left in Rafi, (2023). Bannick et al., (2023) considered the Donsker condition for parametric and nonparametric methods in low-dimensional cases, which is an extension of Guo and Basse, (2023) and Zhang and Zheng, (2020), while we directly verified the asymptotic properties for kernel regression. In high-dimensional cases, Bannick et al., (2023) also imposed an $L_{2}$ condition and established the theoretical properties of the cross-fitted estimator, although the exact form of the condition and the estimator are slightly different than ours. In numerical experiments, Bannick et al., (2023) applied their results to the generalized linear model and random forest. Meanwhile, we considered a broad class of adjusting methods and they all obtain good empirical performances.

References

Bannick et al., (2023) Bannick, M. S., Shao, J., Liu, J., Du, Y., Yi, Y., and Ye, T. (2023). A general form of covariate adjustment in randomized clinical trials. arXiv preprint arXiv:2306.10213.
Barron, (1994) Barron, A. R. (1994). Approximation and estimation bounds for artificial neural networks. Mach. Learn., 14(1):115–133.
Bodory et al., (2022) Bodory, H., Huber, M., and Lafférs, L. (2022). Evaluating (weighted) dynamic treatment effects by double machine learning. Econom. J., 25(3):628–648.
Breiman, (1996) Breiman, L. (1996). Bagging predictors. Mach. Learn., 24(2):123–140.
Breiman, (2001) Breiman, L. (2001). Random forests. Mach. Learn., 45(1):5–32.
Bugni et al., (2018) Bugni, F. A., Canay, I. A., and Shaikh, A. M. (2018). Inference under covariate-adaptive randomization. J. Am. Stat. Assoc., 113(524):1784–1796.
Bugni et al., (2019) Bugni, F. A., Canay, I. A., and Shaikh, A. M. (2019). Inference under covariate-adaptive randomization with multiple treatments. Quant. Econ., 10:1747–1785.
Chen and Guestrin, (2016) Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794.
Chen et al., (2024) Chen, X., Liu, Y., Ma, S., and Zhang, Z. (2024). Causal inference of general treatment effects using neural networks with a diverging number of confounders. Journal of Econometrics, 238(1):105555.
Chernozhukov et al., (2018) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. Econom. J., 21:C1–C68.
Cornfield et al., (1959) Cornfield, J., Haenszel, W., Hammond, E. C., Lilienfeld, A. M., Shimkin, M. B., and Wynder, E. L. (1959). Smoking and lung cancer: recent evidence and a discussion of some questions. J. Natl. Cancer Inst., 22:173–203.
De Cristofaro, (2021) De Cristofaro, E. (2021). A critical overview of privacy in machine learning. IEEE Secur. Priv., 19(4):19–27.
Efron, (1971) Efron, B. (1971). Forcing a sequential experiment to be balanced. Biometrika, 58(3):403–417.
Fan, (1992) Fan, J. (1992). Design-adaptive nonparametric regression. J. Am. Stat. Assoc., 87(420):998–1004.
Farrell et al., (2021) Farrell, M. H., Liang, T., and Misra, S. (2021). Deep neural networks for estimation and inference. Econometrica, 89(1):181–213.
Freund and Schapire, (1997) Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci., 55(1):119–139.
Friedman, (2002) Friedman, J. H. (2002). Stochastic gradient boosting. Comput. Stat. Data Anal., 38(4):367–378.
Garg and Mago, (2021) Garg, A. and Mago, V. (2021). Role of machine learning in medical research: A survey. Comput. Sci. Rev., 40:100370.
Gelman and Imbens, (2019) Gelman, A. and Imbens, G. (2019). Why high-order polynomials should not be used in regression discontinuity designs. J. Bus. Econ. Stat., 37(3):447–456.
Guo and Basse, (2023) Guo, K. and Basse, G. (2023). The generalized oaxaca-blinder estimator. Journal of the American Statistical Association, 118(541):524–536.
Hoerl and Kennard, (1970) Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12:55–67.
Hu and Hu, (2012) Hu, Y. and Hu, F. (2012). Asymptotic properties of covariate-adaptive randomization. Ann. Stat., 40(3):1794–1815.
Huang et al., (2008) Huang, J., Ma, S., and Zhang, C.-H. (2008). Adaptive lasso for sparse high-dimensional regression models. Stat. Sin., 18:1603–1618.
Jiang et al., (2023) Jiang, L., Phillips, P. C., Tao, Y., and Zhang, Y. (2023). Regression-adjusted estimation of quantile treatment effects under covariate-adaptive randomizations. Journal of Econometrics, 234(2):758–776.
Kahan and Morris, (2012) Kahan, B. C. and Morris, T. P. (2012). Improper analysis of trials randomised using stratified blocks or minimisation. Stat. Med., 31(4):328–340.
Kallus et al., (2019) Kallus, N., Mao, X., and Uehara, M. (2019). Localized debiased machine learning: Efficient inference on quantile treatment effects and beyond. arXiv preprint arXiv:1912.12945.
Li and Ding, (2020) Li, X. and Ding, P. (2020). Rerandomization and regression adjustment. J. R. Stat. Soc.: Ser. B (Stat. Methodol.), 82:241–268.
Lin, (2013) Lin, W. (2013). Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique. Ann. Appl. Stat., 7:295–318.
Liu et al., (2023) Liu, H., Tu, F., and Ma, W. (2023). Lasso-adjusted treatment effect estimation under covariate-adaptive randomization. Biometrika, 110(2):431–447.
Liu and Yang, (2020) Liu, H. and Yang, Y. (2020). Regression-adjusted average treatment effect estimates in stratified randomized experiments. Biometrika, 107(4):935–948.
Ma et al., (2022) Ma, W., Tu, F., and Liu, H. (2022). Regression analysis for covariate-adaptive randomization: A robust and efficient inference perspective. Stat. Med., 41(29):5645–5661.
Niyogi and Girosi, (1996) Niyogi, P. and Girosi, F. (1996). On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. Neural Comput., 8(4):819–842.
Pedersen et al., (2020) Pedersen, M., Verspoor, K., Jenkinson, M., Law, M., Abbott, D. F., and Jackson, G. D. (2020). Artificial intelligence for clinical decision support in neurology. Brain Commun., 2(2).
Peel, (2010) Peel, L. (2010). Estimating network parameters for selecting community detection algorithms. In 13th International Conference on Information Fusion, pages 1–8. IEEE.
Picard and Berk, (1990) Picard, R. R. and Berk, K. N. (1990). Data splitting. Am. Stat., 44(2):140–147.
Pinkus, (2012) Pinkus, A. (2012). N-widths in Approximation Theory, volume 7. Springer Science & Business Media, New York.
Pocock and Simon, (1975) Pocock, S. J. and Simon, R. (1975). Sequential treatment assignment with balancing for prognostic factors in the controlled clinical trial. Biometrics, 31(1):103–115.
Rafi, (2023) Rafi, A. (2023). Efficient semiparametric estimation of average treatment effects under covariate adaptive randomization. arXiv preprint arXiv:2305.08340.
Rinaldo et al., (2019) Rinaldo, A., Wasserman, L., and G’Sell, M. (2019). Bootstrapping and sample splitting for high-dimensional, assumption-lean inference. Ann. Stat., 47(6):3438–3469.
Rosenberger and Lachin, (2015) Rosenberger, W. F. and Lachin, J. M. (2015). Randomization in Clinical Trials: Theory and Practice. John Wiley & Sons, New Jersey, 2nd edition.
Ruppert and Wand, (1994) Ruppert, D. and Wand, M. P. (1994). Multivariate locally weighted least squares regression. Ann. Stat., 22(3):1346–1370.
Schmidt-Hieber, (2020) Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with relu activation function. Ann. Stat., 48(4):1875–1897.
Silverman, (1984) Silverman, B. W. (1984). Spline smoothing: The equivalent variable kernel method. Ann. Stat., 12(3):898–916.
Stone, (1980) Stone, C. J. (1980). Optimal rates of convergence for nonparametric estimators. Ann. Stat., 8(6):1348–1360.
Stone, (1982) Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. Ann. Stat., 10(4):1040–1053.
Taves, (1974) Taves, D. R. (1974). Minimization: A new method of assigning patients to treatment and control groups. Clin. Pharmacol. Ther., 15(5):443–453.
Tibshirani, (1996) Tibshirani, R. J. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc.: Ser. B (Stat. Methodol.), 58:267–288.
Tsiatis et al., (2008) Tsiatis, A. A., Davidian, M., Zhang, M., and Lu, X. (2008). Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach. Stat. Med., 27(23):4658–4677.
Wager and Athey, (2018) Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc., 113(523):1228–1242.
Wang et al., (2023) Wang, B., Susukida, R., Mojtabai, R., Amin-Esmaeili, M., and Rosenblum, M. (2023). Model-robust inference for clinical trials that improve precision by stratified randomization and covariate adjustment. J. Am. Stat. Assoc., 118(542):1152–1163.
Wei, (1978) Wei, L. J. (1978). An application of an urn model to the design of sequential controlled clinical trials. J. Am. Stat. Assoc., 73:559–563.
White, (1992) White, H. (1992). Artificial Neural Networks: approximation and learning theory. Blackwell, Oxford.
Williams et al., (2022) Williams, N., Rosenblum, M., and Díaz, I. (2022). Optimising precision and power by machine learning in randomised trials with ordinal and time-to-event outcomes with an application to covid-19. J. R. Stat. Soc.: Ser. A (Stat. Soc.), 185(4):2156–2178.
Wu and Gagnon-Bartsch, (2018) Wu, E. and Gagnon-Bartsch, J. A. (2018). The loop estimator: Adjusting for covariates in randomized experiments. Eval. Rev., 42:458–488.
Yarotsky, (2018) Yarotsky, D. (2018). Optimal approximation of continuous functions by very deep relu networks. In Conference on Learning Theory, pages 639–649. PMLR.
(56) Ye, T., Shao, J., Yi, Y., and Zhao, Q. (2022a). Toward better practice of covariate adjustment in analyzing randomized clinical trials. J. Am. Stat. Assoc., pages 1–13.
(57) Ye, T., Yi, Y., and Shao, J. (2022b). Inference on the average treatment effect under minimization and other covariate-adaptive randomization methods. Biometrika, 109(1):33–47.
Zelen, (1974) Zelen, M. (1974). The randomization and stratification of patients to clinical trials. Journal of Chronic Diseases, 27(7):365–375.
Zhang and Zhang, (2014) Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217–242.
Zhang and Zheng, (2020) Zhang, Y. and Zheng, X. (2020). Quantile treatment effects and bootstrap inference under covariate-adaptive randomization. Quantitative Economics, 11(3):957–982.
Zou, (2006) Zou, H. (2006). The adaptive lasso and its oracle properties. J. Am. Stat. Assoc., 101(476):1418–1429.
Zou and Hastie, (2005) Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc.: Ser. B (Stat. Methodol.), 67:301–320.

Appendix A Useful Lemmas

We first introduce the following lemmas that are useful for our proofs.

Lemma 1.

Let $V_{i}=f(Y_{i}(1),Y_{i}(0),B_{i},X_{i})$ for some measurable function $f(\cdot)$ such that $E(|V_{i}|)<\infty$ . Under Assumptions 1–3,

\frac{1}{n}\sum_{i=1}^{n}A_{i}V_{i}\xrightarrow{P}\pi E(V_{1}).

Lemma 2.

Under Assumptions 1–3, we have

\frac{n_{1}}{n}\xrightarrow{P}\pi,\quad\pi_{n[k]}=\frac{n_{[k]1}}{n_{[k]}}\xrightarrow{P}\pi,\quad\frac{n_{[k]1}}{n}\xrightarrow{P}\pi p_{[k]},\quad p_{n[k]}=\frac{n_{[k]}}{n}\xrightarrow{P}p_{[k]},

\frac{n_{0}}{n}\xrightarrow{P}1-\pi,\quad\frac{n_{[k]0}}{n_{[k]}}\xrightarrow{P}1-\pi,\quad\frac{n_{[k]0}}{n}\xrightarrow{P}(1-\pi)p_{[k]}.

Lemma 3.

Let $V_{i}=f(Y_{i}(1),Y_{i}(0),B_{i},X_{i})$ for some measurable function $f(\cdot)$ such that $E(V_{i}^{2})<\infty$ . Under Assumptions 1–3,

\sum_{k=1}^{K}p_{n[k]}\cdot\frac{1}{n_{[k]1}}\sum_{i\in[k]}A_{i}(V_{i}-\bar{V}_{[k]1})^{2}\xrightarrow{P}\sigma^{2}_{V_{i}-E(V_{i}\mid B_{i})},

\sum_{k=1}^{K}p_{n[k]}\cdot\frac{1}{n_{[k]0}}\sum_{i\in[k]}(1-A_{i})(V_{i}-\bar{V}_{[k]0})^{2}\xrightarrow{P}\sigma^{2}_{V_{i}-E(V_{i}\mid B_{i})}.

The above lemmas are the same as those in Liu et al., (2023), and Lemma 1 is a generalized version of what has been proved in Bugni et al., (2019) for $V_{i}=f(Y_{i}(1),Y_{i}(0),B_{i})$ (see Lemma C.4). Lemma 2 can be obtained directly from the weak law of large numbers and the above Lemma 1. Lemma 3 can be obtained from the proof of Lemma 7 in Ma et al., (2022). We omit the proofs of these lemmas.

Appendix B Proof of Proposition 1

Proof.

Let $h_{[k]}(X_{i})=(1-\pi)h_{[k]}(X_{i},1)+\pi h_{[k]}(X_{i},0),\ \bar{h}_{[k]1}=n_{[k]1}^{-1}\sum_{i\in[k]}A_{i}h_{[k]}(X_{i})$ and $\bar{h}_{[k]0}=n_{[k]0}^{-1}\sum_{i\in[k]}(1-A_{i})h_{[k]}(X_{i})$ . Then

$\displaystyle\hat{\tau}_{\text{oracle}}$	$\displaystyle=$	$\displaystyle\sum_{k=1}^{K}p_{n[k]}\bigg{[}\Big{\{}\bar{Y}_{[k]1}-\sum_{i\in[k]}\frac{A_{i}-\pi_{n[k]}}{n_{[k]}\pi_{n[k]}}h_{[k]}(X_{i},1)\Big{\}}$	(6)
		$\displaystyle-\Big{\{}\bar{Y}_{[k]0}+\sum_{i\in[k]}\frac{A_{i}-\pi_{n[k]}}{n_{[k]}(1-\pi_{n[k]})}h_{[k]}(X_{i},0)\Big{\}}\bigg{]}$
	$\displaystyle=$	$\displaystyle\sum_{k=1}^{K}\frac{p_{n[k]}}{n_{[k]}}\sum_{i\in[k]}\Big{\{}\frac{A_{i}Y_{i}}{\pi_{n[k]}}-\frac{A_{i}-\pi_{n[k]}}{\pi_{n[k]}}h_{[k]}(X_{i},1)$
		$\displaystyle-\frac{(1-A_{i})Y_{i}}{1-\pi_{n[k]}}-\frac{A_{i}-\pi_{n[k]}}{1-\pi_{n[k]}}h_{[k]}(X_{i},0)\Big{\}}$
	$\displaystyle=$	$\displaystyle\sum_{k=1}^{K}\frac{p_{n[k]}}{n_{[k]}}\sum_{i\in[k]}\Big{[}\frac{A_{i}}{\pi_{n[k]}}\big{\{}Y_{i}-h_{[k]}(X_{i})\big{\}}-\frac{1-A_{i}}{1-\pi_{n[k]}}\big{\{}Y_{i}-h_{[k]}(X_{i})\big{\}}\Big{]}$
		$\displaystyle+\sum_{k=1}^{K}\frac{p_{n[k]}}{n_{[k]}}\sum_{i\in[k]}\Big{[}\frac{A_{i}}{\pi_{n[k]}}(\pi_{n[k]}-\pi)\big{\{}h_{[k]}(X_{i},1)-h_{[k]}(X_{i},0)\big{\}}$
		$\displaystyle-\frac{1-A_{i}}{1-\pi_{n[k]}}(\pi_{n[k]}-\pi)\big{\{}h_{[k]}(X_{i},1)-h_{[k]}(X_{i},0)\big{\}}\Big{]}$
	$\displaystyle=$	$\displaystyle\sum_{k=1}^{K}p_{n[k]}\Big{[}\bar{Y}_{[k]1}-\bar{h}_{[k]1}-\big{\{}\bar{Y}_{[k]0}-\bar{h}_{[k]0}\big{\}}\Big{]}$
		$\displaystyle+\sum_{k=1}^{K}p_{n[k]}(\pi_{n[k]}-\pi)\frac{1}{n_{[k]1}}\sum_{i\in[k]}A_{i}\big{\{}h_{[k]}(X_{i},1)-h_{[k]}(X_{i},0)\big{\}}$
		$\displaystyle-\sum_{k=1}^{K}p_{n[k]}(\pi_{n[k]}-\pi)\frac{1}{n_{[k]0}}\sum_{i\in[k]}(1-A_{i})\big{\{}h_{[k]}(X_{i},1)-h_{[k]}(X_{i},0)\big{\}}.$

The equality in equation (6) is because

$\begin{aligned} \bar{Y}_{[k]1}&=\frac{1}{n_{[k]1}}\sum_{i\in[k]}A_{i}Y_{i}=\frac{1}{n_{[k]}\pi_{n[k]}}\sum_{i\in[k]}A_{i}Y_{i},\\ \bar{Y}_{[k]0}&=\frac{1}{n_{[k]0}}\sum_{i\in[k]}(1-A_{i})Y_{i}=\frac{1}{n_{[k]}(1-\pi_{n[k]})}\sum_{i\in[k]}(1-A_{i})Y_{i}.\end{aligned}$

For $i\in[k]$ , denote the transformed outcome as

r_{i}(a)=Y_{i}(a)-[(1-\pi)h_{[k]}(X_{i},1)+\pi h_{[k]}(X_{i},0)],

then $\hat{\tau}_{\text{oracle}}$ is the stratified difference-in-means estimator applied to the transformed outcomes $r_{i}(a),\ a=0,1$ , which satisfy

E\{r_{i}(1)-r_{i}(0)\}=\sum_{k=1}^{K}p_{[k]}E\{Y_{i}(1)-Y_{i}(0)|B_{i}=k\}=E\{Y_{i}(1)-Y_{i}(0)\}=\tau.

Since $E\{Y^{2}_{i}(a)\}<\infty$ and $E\{h^{2}_{[k]}(X_{i},a)\}<\infty$ , then $E\{r^{2}_{i}(a)\}<\infty$ . As a result, according to Proposition 1 in Liu et al., (2023), $\sum_{k=1}^{K}p_{n[k]}\big{[}\bar{Y}_{[k]1}-\bar{h}_{[k]1}-\{\bar{Y}_{[k]0}-\bar{h}_{[k]0}\}\big{]}$ is asymptotically normal with mean $\tau$ and variance $\zeta^{2}_{r}(\pi)+\zeta^{2}_{Hr}$ . Then, it suffices to show that the last two terms in equation (6) are negligible.

Under the second moment conditions on $h_{[k]}(X_{i},a)$ , applying Proposition 1 in Liu et al., (2023) to each stratum with outcomes $h_{[k]}(X_{i},1)-h_{[k]}(X_{i},0)$ , we have $(1/n_{[k]1})\sum_{i\in[k]}A_{i}\{h_{[k]}(X_{i},1)-h_{[k]}(X_{i},0)\}-(1/n_{[k]0})\sum_{i\in[k]}(1-A_{i})\{h_{[k]}(X_{i},1)-h_{[k]}(X_{i},0)\}=O_{P}(n^{-1/2})$ . Together with $\pi_{n[k]}-\pi=o_{P}(1)$ , we have the desired term is $o_{P}(n^{-1/2})$ . ∎

Appendix C Proof of Local Linear Kernel

We denote $K_{H}(u)=|H|^{-1/2}K(H^{-1/2}u)$ , where $K(\cdot)$ is the symmetric density kernel function used in local linear kernel, $H$ is a $d\times d$ symmetric positive definite matrix depending on $n$ . $H^{1/2}$ is called the bandwidth matrix. Let $D_{g}(x)$ denote the $d\times 1$ vector of first-order partial derivatives and $\mathcal{H}_{g}(x)$ denote the $d\times d$ Hessian matrix of a sufficiently smooth $d$ -variate function $g$ at $x$ . Let 1 denote a generic matrix having each entry equal to $1$ . If $U_{n}$ is a random matrix, then $O_{P}(U_{n})$ and $o_{P}(U_{n})$ are to be taken componentwise.

Lemma 4.

Suppose that

(i) $X_{1},\dots,X_{n}$ are i.i.d. (d-)dimensional variables with continuous probability density function $f(\cdot)$ , and $f(\cdot)$ has a compact support set on $R^{d}$ ;

(ii) $Y_{i}=m(X_{i})+\nu^{1/2}(X_{i})\varepsilon_{i},\ i=1,\dots,n$ , where $\nu(x)=\text{Var}(Y|X=x)>0$ is continuous, $\varepsilon_{i}$ ’s are mutually independent random variables with $E(\varepsilon_{i})=0,\ \textnormal{Var}(\varepsilon_{i})=1$ , and $\varepsilon_{i}$ ’s are independent of $X_{i}$ ’s. All second-order derivatives of $m(\cdot)$ are continuous;

(iii) The kernel K is a compactly supported, bounded kernel such that $\int uu^{\text{T}}K(u)du=\mu_{2}(K)I$ , where $\mu_{2}(K)\neq 0$ is a scalar and $I$ is the $d\times d$ identity matrix. In addition, all odd-order moments of $K$ vanish, that is, $\int u_{1}^{l1}\cdots u_{d}^{l_{d}}K(u)du=0$ for all nonnegative integers $l_{1},\dots,\l_{d}$ such that their sum is odd;

(iv) $n^{-1}|H|^{-1}$ and each entry of $H$ tend to zero as $n\to\infty$ , with $H$ remaining symmetric and positive definite. Moreover, there is a fixed constant $L$ such that the condition number of $H$ is at most $L$ for all $n$ .

Denote $\hat{m}_{H}(x)$ as the local linear smoother of $m(\cdot)$ on point $x$ based on $X_{1},\dots,X_{n}$ , using symmetric density kernel and bandwidth matrix $H^{1/2}$ . Then

E\left[\left\{\hat{m}_{H}(x)-m(x)\right\}^{2}|X_{1},\dots,X_{n}\right]\xrightarrow{P}0.

Proof.

Denote

M_{0}=(m(X_{1}),\dots,m(X_{n}))^{\text{T}},\quad Y=(Y_{1},\dots,Y_{n})^{\text{T}},

W_{x}=\textnormal{diag}(K_{H}(x-X_{1}),\dots,K_{H}(x-X_{n})),

N_{x}=\begin{pmatrix}1&(X_{1}-x)^{\text{T}}\\ \vdots&\vdots\\ 1&(X_{n}-x)^{\text{T}}\end{pmatrix},\quad V=\begin{pmatrix}\nu(X_{1})&&\\ &\ddots&\\ &&\nu(X_{n})\end{pmatrix}.

By the definition of the local linear kernel smoother, for given $x$ , we have

\hat{m}_{H}(x)=e_{1}^{T}(N_{x}^{\text{T}}W_{x}N_{x})^{-1}N_{x}^{T}W_{x}Y,

where $e_{1}$ is a $d$ -dimensional vector with first element being $1$ and the remaining elements being $0$ . Taking expectation with respect to $Y$ , we have

E\left\{\hat{m}_{H}(x)\mid X_{1},\ldots,X_{n}\right\}=e_{1}^{\text{T}}\left(N_{x}^{\text{T}}W_{x}N_{x}\right)^{-1}N_{x}^{\text{T}}W_{x}M_{0}.

(7)

Let $Q_{m}(x)$ be the $n\times 1$ vector given by

Q_{m}(x)=\left[\left(X_{1}-x\right)^{\text{T}}\mathcal{H}_{m}(x)\left(X_{1}-x\right),\ldots,\left(X_{n}-x\right)^{\text{T}}\mathcal{H}_{m}(x)\left(X_{n}-x\right)\right]^{\text{T}}.

(8)

Then Taylor’s expansion implies that

M_{0}=N_{x}\left(\begin{array}[]{c}m(x)\\ D_{m}(x)\end{array}\right)+\frac{1}{2}Q_{m}(x)+R_{m}(x),

(9)

where $R_{m}(x)$ is a vector of Taylor series remainder terms. As Ruppert and Wand, (1994) stated, when $R_{m}(x)$ is pre-multiplied by $e_{1}^{\text{T}}(N_{x}^{\text{T}}W_{x}N_{x})^{-1}N_{x}^{\text{T}}W_{x}$ , the resulting scalar is $o_{P}\{\textnormal{tr}(H)\}$ . Then by equations (7)–(9),

			$\displaystyle E\left\{\hat{m}_{H}(x)-m(x)\mid X_{1},\ldots,X_{n}\right\}$
		$\displaystyle=$	$\displaystyle e_{1}^{\text{T}}\left(N_{x}^{\text{T}}W_{x}N_{x}\right)^{-1}N_{x}^{\text{T}}W_{x}\left\{N_{x}\left(\begin{array}[]{c}m(x)\\ D_{m}(x)\end{array}\right)+\frac{1}{2}Q_{m}(x)+R_{m}(x)\right\}-m(x)$
		$\displaystyle=$	$\displaystyle e_{1}^{\text{T}}\left(N_{x}^{\text{T}}W_{x}N_{x}\right)^{-1}N_{x}^{\text{T}}W_{x}N_{x}\left(\begin{array}[]{c}m(x)\\ D_{m}(x)\end{array}\right)$
			$\displaystyle+e_{1}^{\text{T}}\left(N_{x}^{\text{T}}W_{x}N_{x}\right)^{-1}N_{x}^{\text{T}}W_{x}\left\{\frac{1}{2}Q_{m}(x)+R_{m}(x)\right\}-m(x)$
		$\displaystyle=$	$\displaystyle e_{1}^{\text{T}}\left(\begin{array}[]{c}m(x)\\ D_{m}(x)\end{array}\right)-m(x)+e_{1}^{\text{T}}\left(N_{x}^{\text{T}}W_{x}N_{x}\right)^{-1}N_{x}^{\text{T}}W_{x}\left\{\frac{1}{2}Q_{m}(x)+R_{m}(x)\right\}$
		$\displaystyle=$	$\displaystyle e_{1}^{\text{T}}\left(N_{x}^{\text{T}}W_{x}N_{x}\right)^{-1}N_{x}^{\text{T}}W_{x}\left\{\frac{1}{2}Q_{m}(x)+R_{m}(x)\right\}.$

Using standard results from density estimation (e.g., Ruppert and Wand,, 1994), if $x\notin\{X_{1},\dots,X_{n}\}$ ,

n^{-1}\sum_{i=1}^{n}K_{H}\left(X_{i}-x\right)=f(x)+o_{P}(1),

(13)

n^{-1}\sum_{i=1}^{n}K_{H}\left(X_{i}-x\right)\left(X_{i}-x\right)=\mu_{2}(K)HD_{f}(x)+o_{P}(H\bm{1}),

(14)

n^{-1}\sum_{i=1}^{n}K_{H}\left(X_{i}-x\right)\left(X_{i}-x\right)\left(X_{i}-x\right)^{\text{T}}=\mu_{2}(K)f(x)H+o_{P}(H).

(15)

If $x=X_{q}\in\{X_{1},\dots,X_{n}\}$ for some $q=1,\ldots,n$ ,

	$\displaystyle n^{-1}\sum_{i=1}^{n}K_{H}\left(X_{i}-x\right)$	$\displaystyle=$	$\displaystyle\frac{n-1}{n}\frac{1}{n-1}\sum_{i\neq q}K_{H}\left(X_{i}-x\right)+\frac{1}{n}K_{H}(0)$		(16)
			$\displaystyle=f(x)+o_{P}(1),$		(16)

	$\displaystyle n^{-1}\sum_{i=1}^{n}K_{H}\left(X_{i}-x\right)\left(X_{i}-x\right)$	$\displaystyle=$	$\displaystyle\frac{n-1}{n}\frac{1}{n-1}\sum_{i\neq q}K_{H}\left(X_{i}-x\right)$		(17)
		$\displaystyle=$	$\displaystyle\mu_{2}(K)HD_{f}(x)+o_{P}(H\bm{1}),$		(17)

	$\displaystyle n^{-1}\sum_{i=1}^{n}K_{H}\left(X_{i}-x\right)\left(X_{i}-x\right)\left(X_{i}-x\right)^{\text{T}}$	(18)
$\displaystyle=$	$\displaystyle\frac{n-1}{n}\frac{1}{n-1}\sum_{i\neq q}K_{H}\left(X_{i}-x\right)\left(X_{i}-x\right)\left(X_{i}-x\right)^{\text{T}}$
$\displaystyle=$	$\displaystyle\mu_{2}(K)f(x)H+o_{P}(H).$

It follows from above that

	$\displaystyle\big{(}n^{-1}N_{x}^{\text{T}}W_{x}N_{x}\big{)}^{-1}$	(21)
$\displaystyle=$	$\displaystyle\left[\begin{array}[]{cc}n^{-1}\sum_{i=1}^{n}K_{H}\left(X_{i}-x\right)&n^{-1}\sum_{i=1}^{n}K_{H}\left(X_{i}-x\right)\left(X_{i}-x\right)^{\text{T}}\\ n^{-1}\sum_{i=1}^{n}K_{H}\left(X_{i}-x\right)\left(X_{i}-x\right)&n^{-1}\sum_{i=1}^{n}K_{H}\left(X_{i}-x\right)\left(X_{i}-x\right)\left(X_{i}-x\right)^{\text{T}}\end{array}\right]^{-1}$	(21)
$\displaystyle=$	$\displaystyle\left(\begin{array}[]{cc}A_{11}&A_{12}\\ A_{21}&A_{22}\end{array}\right)^{-1}=\left(\begin{array}[]{cc}A^{11}&A^{12}\\ A^{21}&A^{22}\end{array}\right),$	(26)

where we use the following definitions:

A_{11}=n^{-1}\sum_{i=1}^{n}K_{H}\left(X_{i}-x\right),\quad A_{12}=n^{-1}\sum_{i=1}^{n}K_{H}\left(X_{i}-x\right)\left(X_{i}-x\right)^{\text{T}},

A_{21}=n^{-1}\sum_{i=1}^{n}K_{H}\left(X_{i}-x\right)\left(X_{i}-x\right),\quad A_{22}=n^{-1}\sum_{i=1}^{n}K_{H}\left(X_{i}-x\right)\left(X_{i}-x\right)\left(X_{i}-x\right)^{\text{T}},

and $A^{11},A^{12},A^{21},A^{22}$ are the corresponding block matrices (vectors) of the inverse matrix $\left(\begin{array}[]{cc}A_{11}&A_{12}\\ A_{21}&A_{22}\end{array}\right)^{-1}$ . And we have

A^{11}=\left(A_{11}-A_{12}A_{22}^{-1}A_{21}\right)^{-1},\quad A^{22}=\left(A_{22}-A_{21}A_{11}^{-1}A_{12}\right)^{-1},

A^{12}=-A_{11}^{-1}A_{12}A^{22},\quad A^{21}=-A_{22}^{-1}A_{21}A^{11}.

Simple algebra gives

$\displaystyle A_{12}A_{22}^{-1}A_{21}$	$\displaystyle=$	$\displaystyle\left\{\mu_{2}(k)D_{f}(x)^{\text{T}}H+o_{P}\left(\bm{1}^{\text{T}}H\right)\right\}\cdot\left\{\mu_{2}(k)f(x)H+o_{p}(H)\right\}^{-1}\cdot\left\{\mu_{2}(k)HD_{f}(x)+o_{P}(H\bm{1})\right\}$
	$\displaystyle=$	$\displaystyle\left\{\mu_{2}(k)D_{f}(x)^{\text{T}}H+o_{P}(\bm{1}^{\text{T}}H)\right\}\cdot\left\{\mu_{2}(k)^{-1}f(x)^{-1}H^{-1}+o_{P}\left(H^{-1}\right)\right\}$
		$\displaystyle\cdot\left\{\mu_{2}(k)HD_{f}(x)+o_{P}(H\bm{1})\right\}$
	$\displaystyle=$	$\displaystyle\left\{f(x)^{-1}D_{f}(x)^{\text{T}}+o_{P}\left(D_{f}(x)^{\text{T}}+f(x)^{-1}\bm{1}^{\text{T}}+\bm{1}^{\text{T}}\right)\right\}\cdot\left\{\mu_{2}(k)HD_{f}(x)+o_{P}(H\bm{1})\right\}$
	$\displaystyle=$	$\displaystyle\mu_{2}(k)D_{f}(x)^{\text{T}}HD_{f}(x)+o_{P}\{\textnormal{tr}(H)\},$
$\displaystyle A^{11}$	$\displaystyle=$	$\displaystyle\left[f(x)+o_{P}(1)-\mu_{2}(k)D_{f}(x)^{\text{T}}HD_{f}(x)+o_{P}\{tr(H)\}\right]^{-1}$
	$\displaystyle=$	$\displaystyle\{f(x)+o_{P}(1)\}^{-1},\quad\text{as each entry of }H\to 0$
	$\displaystyle=$	$\displaystyle f(x)^{-1}+o_{P}(1),$
$\displaystyle A_{21}A_{11}^{-1}A_{12}$	$\displaystyle=$	$\displaystyle\left\{\mu_{2}(k)HD_{f}(x)+o_{P}(H\bm{1})\right\}\cdot\{f(x)+o_{P}(1)\}^{-1}\cdot\left\{\mu_{2}(k)D_{f}(x)^{\text{T}}H+o_{P}\left(\bm{1}^{\text{T}}H\right)\right\}$
	$\displaystyle=$	$\displaystyle\left\{f(x)\mu_{2}(k)HD_{f}(x)+o_{P}\left(\mu_{2}(k)HD_{f}(x)+f(x)H\bm{1}+H\bm{1}\right)\right\}$
		$\displaystyle\cdot\left\{\mu_{2}(k)D_{f}(x)^{\text{T}}H+o_{P}\left(\bm{1}^{\text{T}}H\right)\right\}$
	$\displaystyle=$	$\displaystyle f(x)\mu_{2}(k)^{2}HD_{f}(x)D_{f}(x)^{\text{T}}H+o_{P}\left(H^{2}\right),$
$\displaystyle A^{22}$	$\displaystyle=$	$\displaystyle\left\{\mu_{2}(k)f(x)H+o_{P}(H)-f(x)\mu_{2}(k)^{2}Hf(x)D_{f}(x)^{T}H+o_{P}(H^{2})\right\}^{-1}$
	$\displaystyle=$	$\displaystyle\left\{\mu_{2}(k)f(x)H+o_{P}(H)\right\}^{-1}$
	$\displaystyle=$	$\displaystyle\left\{\mu_{2}(k)f(x)H\right\}^{-1}+o_{P}\left(H^{-1}\right),$
$\displaystyle A^{12}$	$\displaystyle=$	$\displaystyle-\left\{f(x)^{-1}+o_{P}(1)\right\}\left\{\mu_{2}(k)D_{f}(x)^{\text{T}}H+o_{P}(\bm{1}^{\text{T}}H)\right\}\cdot\big{[}\left\{\mu_{2}(k)f(x)H\right\}^{-1}+o_{P}\left(H^{-1}\right)\big{]}$
	$\displaystyle=$	$\displaystyle-\left\{\mu_{2}(k)f(x)^{-1}D_{f}(x)^{\text{T}}H+o_{P}\left(D_{f}(x)^{\text{T}}H\right)\right\}\cdot\left\{\mu_{2}(k)^{-1}f(x)^{-1}H^{-1}+o_{P}\left(H^{-1}\right)\right\}$
	$\displaystyle=$	$\displaystyle-D_{f}(x)^{\text{T}}f(x)^{-2}+o_{P}(\bm{1}^{\text{T}}),$

and by similar deduction, we have $A^{21}=D_{f}(x)f(x)^{-2}+o_{P}(\bm{1})$ . Therefore,

\big{(}n^{-1}N_{x}^{\text{T}}W_{x}N_{x}\big{)}^{-1}=\left[\begin{array}[]{cc}f(x)^{-1}+o_{P}(1)&-D_{f}(x)^{\text{T}}f(x)^{-2}+o_{P}(\bm{1^{\text{T}}})\\ -D_{f}(x)f(x)^{-2}+o_{P}(\bm{1})&\{\mu_{2}(K)f(x)H\}^{-1}+o_{P}(H^{-1}),\end{array}\right],

n^{-1}N_{x}^{\text{T}}W_{x}Q_{m}(x)=\left[\begin{array}[]{c}n^{-1}\sum_{i=1}^{n}K_{H}\left(X_{i}-x\right)\left(X_{i}-x\right)^{\text{T}}\mathcal{H}_{m}(x)\left(X_{i}-x\right)\\ n^{-1}\sum_{i=1}^{n}\left\{K_{H}\left(X_{i}-x\right)\left(X_{i}-x\right)^{\text{T}}\mathcal{H}_{m}(x)\left(X_{i}-x\right)\right\}\left(X_{i}-x\right)\end{array}\right],

			$\displaystyle n^{-1}\sum_{i=1}^{n}K_{H}\left(X_{i}-x\right)\left\{\left(X_{i}-x\right)^{\text{T}}\mathcal{H}_{m}(x)\left(X_{i}-x\right)\right\}\left(X_{i}-x\right)$
		$\displaystyle=$	$\displaystyle\int K(u)\left\{\left(H^{1/2}u\right)^{T}\mathcal{H}_{m}(x)\left(H^{1/2}u\right)\right\}\left(H^{1/2}u\right)f\left(x+H^{1/2}u\right)du+o_{P}\left(H^{3/2}\bm{1}\right)$
		$\displaystyle=$	$\displaystyle O_{P}\left(H^{3/2}\bm{1}\right).$

Thus,

		$\displaystyle E\left\{\hat{m}_{H}(x)\mid X_{1},\ldots,X_{n}\right\}-m(x)$
	$\displaystyle=$	$\displaystyle\frac{1}{2}e_{1}^{\text{T}}\left(N_{x}^{\text{T}}W_{x}N_{x}\right)^{-1}N_{x}^{\text{T}}W_{x}Q_{m}(x)+o_{P}\{\textnormal{tr}(H)\}$
	$\displaystyle=$	$\displaystyle\frac{1}{2}f(x)^{-1}E\Big{\{}n^{-1}\sum_{i=1}^{n}K_{H}\left(X_{i}-x\right)\left(X_{i}-x\right)^{\text{T}}\mathcal{H}_{m}(x)\left(X_{i}-x\right)\Big{\}}+o_{P}\{\textnormal{tr}(H)\}$
	$\displaystyle=$	$\displaystyle\frac{1}{2}f(x)^{-1}\left\{\int K(u)\left(H^{1/2}u\right)^{\text{T}}\mathcal{H}_{m}(x)\left(H^{1/2}u\right)f\left(x+H^{1/2}u\right)du\right\}+o_{P}\{\textnormal{tr}(H)\}$
	$\displaystyle=$	$\displaystyle\frac{1}{2}\textnormal{tr}\left\{H^{1/2}\mathcal{H}_{m}(x)H^{1/2}\int K(u)uu^{T}du\right\}+o_{P}\{\textnormal{tr}(H)\}$
	$\displaystyle=$	$\displaystyle\frac{1}{2}\mu_{2}(K)\textnormal{tr}\left\{H\mathcal{H}_{m}(x)\right\}+o_{P}\{\textnormal{tr}(H)\}.$

For the variance, we have

\operatorname{Var}\left\{\hat{m}_{H}(x)\mid X_{1},\ldots,X_{n}\right\}=e_{1}^{\text{T}}\left(N_{x}^{\text{T}}W_{x}N_{x}\right)^{-1}N_{x}^{\text{T}}W_{x}VW_{x}N_{x}\left(N_{x}^{\text{T}}W_{x}N_{x}\right)^{-1}e_{1}.

The upper-left entry of $n^{-1}N_{x}^{\text{T}}W_{x}VW_{x}N_{x}$ is

			$\displaystyle n^{-1}\sum_{i=1}^{n}K_{H}\left(X_{i}-x\right)^{2}\nu\left(X_{i}\right)$
		$\displaystyle=$	$\displaystyle\|H\|^{-1/2}\int K^{2}(u)\nu\left(x+H^{1/2}u\right)f\left(x+H^{1/2}u\right)du\left\{1+o_{P}(1)\right\}$
		$\displaystyle=$	$\displaystyle\|H\|^{-1/2}R(K)\nu(x)f(x)\left\{1+o_{P}(1)\right\},$

where $R(K)=\int K(u)^{2}du$ . The upper-right block of $n^{-1}N_{x}^{\text{T}}W_{x}VW_{x}N_{x}$ is

			$\displaystyle n^{-1}\sum_{i=1}^{n}K_{H}\left(X_{i}-x\right)^{2}\left(X_{i}-x\right)^{\text{T}}\nu\left(X_{i}\right)$
		$\displaystyle=$	$\displaystyle\|H\|^{-1/2}\int K^{2}(u)u^{\text{T}}H^{1/2}\nu\left(x+H^{1/2}u\right)f\left(x+H^{1/2}u\right)du\left\{1+o_{P}(1)\right\}$
		$\displaystyle=$	$\displaystyle O_{P}\left(\|H\|^{1/2}\right),$

and the lower-right block is

			$\displaystyle n^{-1}\sum_{i=1}^{n}K_{H}\left(X_{i}-x\right)^{2}\left(X_{i}-x\right)\left(X_{i}-x\right)^{T}\nu\left(X_{i}\right)$
		$\displaystyle=$	$\displaystyle\|H\|^{-1/2}H^{1/2}\left\{\int K^{2}(u)uu^{T}du\right\}H^{1/2}\nu(x)f(x)+o_{P}\left(\|H\|^{-1/2}H\right).$

Using equation (21) again and the above equations, we have

\operatorname{Var}\left\{\hat{m}_{H}(x)\mid X_{1},\ldots,X_{n}\right\}=n^{-1}|H|^{-1/2}\{R(K)\nu(x)/f(x)\}\left\{1+o_{P}(1)\right\}.

(27)

Then

		$\displaystyle E\left[\left\{\hat{m}(x)-m(x)\right\}^{2}\|X_{1},\dots,X_{n}\right]$
	$\displaystyle=$	$\displaystyle n^{-1}\|H\|^{-1/2}\{R(K)\nu(x)/f(x)\}+\frac{1}{4}\mu_{2}(K)^{2}\textnormal{tr}^{2}\left\{H\mathcal{H}_{m}(x)\right\}+o_{P}\{n^{-1}\|H\|^{-1/2}+\textnormal{tr}^{2}(H)\}$
	$\displaystyle\stackrel{{\scriptstyle P}}{{\rightarrow}}$	$\displaystyle 0,$

under Assumption (iii). ∎

C.1 Proof of Theorem 3

Proof.

	$\displaystyle\quad\sqrt{n}\Big{[}\big{\{}\bar{\hat{h}}_{[k]1}(\cdot,a)-\bar{h}_{[k]1}(\cdot,a)\big{\}}-\big{\{}\bar{\hat{h}}_{[k]0}(\cdot,a)-\bar{h}_{[k]0}(\cdot,a)\big{\}}\Big{]}$	(28)
$\displaystyle=$	$\displaystyle\sqrt{n}\left[\frac{1}{n_{[k]1}}\sum_{i\in[k]}A_{i}\{\hat{h}_{[k]}(X_{i},a)-h_{[k]}(X_{i},a)\}-\frac{1}{n_{[k]0}}\sum_{i\in[k]}(1-A_{i})\{\hat{h}_{[k]}(X_{i},a)-h_{[k]}(X_{i},a)\}\right]$
$\displaystyle=$	$\displaystyle\frac{1}{\sqrt{n}p_{n[k]}\pi_{n[k]}(1-\pi_{n[k]})}\sum_{i\in[k]}(A_{i}-\pi_{n[k]})\{\hat{h}_{[k]}(X_{i},a)-h_{[k]}(X_{i},a)\}.$

Decomposing $\hat{h}_{[k]}(X_{i},a)-h_{[k]}(X_{i},a)$ into the variance term and the bias term, conditional on $X_{1},\dots,X_{n}$ , and by Taylor’s expansion, we have

	$\displaystyle\hat{h}_{[k]}(X_{i},a)-h_{[k]}(X_{i},a)$	(31)
$\displaystyle=$	$\displaystyle e_{1}^{\text{T}}(N_{i}^{\text{T}}W_{i}N_{i})^{-1}N_{i}^{\text{T}}W_{i}(Y(a)-h_{[k]}(a))+e_{1}^{\text{T}}(N_{i}^{\text{T}}W_{i}N_{i})^{-1}N_{i}^{\text{T}}W_{i}h_{[k]}(a)-h_{[k]}(X_{i},a)$
$\displaystyle=$	$\displaystyle e_{1}^{\text{T}}(N_{i}^{\text{T}}W_{i}N_{i})^{-1}N_{i}^{\text{T}}W_{i}\varepsilon(\nu)+e_{1}^{\text{T}}(N_{i}^{\text{T}}W_{i}N_{i})^{-1}N_{i}^{\text{T}}W_{i}$
	$\displaystyle\cdot\left\{N_{i}\left[\begin{array}[]{c}h_{[k]}(X_{i},a)\\ D_{h_{[k]}}(X_{i},a)\end{array}\right]+\frac{1}{2}Q_{h_{[k]}}(X_{i},a)+R_{h_{[k]}}(X_{i},a)\right\}-h_{[k]}(X_{i},a)$
$\displaystyle=$	$\displaystyle e_{1}^{\text{T}}(N_{i}^{\text{T}}W_{i}N_{i})^{-1}N_{i}^{\text{T}}W_{i}\varepsilon(\nu)+\frac{1}{2}e_{1}^{\text{T}}(N_{i}^{\text{T}}W_{i}N_{i})^{-1}N_{i}^{\text{T}}W_{i}Q_{h_{[k]}}(X_{i},a)+o_{P}(\textnormal{tr}(H)),$	(32)

where

W_{i}=diag(K_{H}(X_{i}-X_{1}),\dots,K_{H}(X_{i}-X_{n})),

N_{i}=\begin{pmatrix}1&(X_{1}-X_{i})^{\text{T}}\\ \vdots&\vdots\\ 1&(X_{n}-X_{i})^{\text{T}}\end{pmatrix},

Y(a)=(Y_{1}(a),\dots,Y_{n}(a))^{\text{T}},\quad h_{[k]}(a)=(h_{[k]}(X_{1},a),\dots,h_{[k]}(X_{n},a))^{\text{T}},

\varepsilon(\nu)=(\nu(X_{1})^{1/2}\varepsilon_{1},\dots,\nu(X_{n})^{1/2}\varepsilon_{n})^{\text{T}},

Q_{h_{[k]}}(X_{i},a)=\Big{(}(X_{1}-X_{i})^{\text{T}}\mathcal{H}_{h_{[k]}}(X_{i},a)(X_{1}-X_{i}),\dots,(X_{n}-X_{i})^{\text{T}}\mathcal{H}_{h_{[k]}}(X_{i},a)(X_{n}-X_{i})\Big{)}^{T},

and $R_{h_{[k]}}(X_{i},a)$ is the vector of Taylor series remainder terms. Here $\mathcal{H}_{h_{[k]}}(X_{i},a)$ denotes the $d\times d$ Hessian matrix of $h_{[k]}(\cdot,a)$ evaluated at $X_{i}$ .

Following the proof of the ordinary least squares-adjusted estimators, we split the variance term and the bias term into a product of an $O_{P}(1)$ term and an $o_{P}(1)$ term, respectively. For the variance term, by the derivation of Lemma 4 and equations (13)–(18), we have

			$\displaystyle e_{1}^{\text{T}}(N_{i}^{\text{T}}W_{i}N_{i})^{-1}N_{i}^{\text{T}}W_{i}\varepsilon(\nu)$
		$\displaystyle=$	$\displaystyle e_{1}^{\text{T}}(n^{-1}N_{i}^{\text{T}}W_{i}N_{i})^{-1}\{n^{-1}N_{i}^{\text{T}}W_{i}\varepsilon(\nu)\}$
		$\displaystyle=$	$\displaystyle e_{1}^{\text{T}}\left[\begin{array}[]{cc}f(X_{i})^{-1}+o_{P}(1)&-D_{f}(X_{i})^{\text{T}}f(X_{i})^{-2}+o_{P}(\bm{1^{\text{T}}})\\ -D_{f}(X_{i})f(X_{i})^{-2}+o_{P}(\bm{1})&\{\mu_{2}(K)f(X_{i})H\}^{-1}+o_{P}(H^{-1})\end{array}\right]$
			$\displaystyle\cdot\left[\begin{array}[]{c}n^{-1}\sum_{j=1}^{n}K_{H}(X_{j}-X_{i})\nu^{1/2}(X_{j})\varepsilon_{j}\\ n^{-1}\sum_{j=1}^{n}K_{H}(X_{j}-X_{i})(X_{j}-X_{i})\nu^{1/2}(X_{j})\varepsilon_{j}\end{array}\right]$
		$\displaystyle=$	$\displaystyle f(X_{i})^{-1}n^{-1}\sum_{j=1}^{n}K_{H}(X_{j}-X_{i})\nu^{1/2}(X_{j})\varepsilon_{j}$
			$\displaystyle-f(X_{i})^{-2}D_{f}(X_{i})^{\text{T}}n^{-1}\sum_{j=1}^{n}K_{H}(X_{j}-X_{i})(X_{j}-X_{i})\nu^{1/2}(X_{j})\varepsilon_{j}+o_{P}(1).$

Denote

	$\displaystyle\xi_{1}$	$\displaystyle=$	$\displaystyle n^{-1/2}\sum_{i\in[k]}(A_{i}-\pi_{n[k]})f(X_{i})^{-1}n^{-1}\sum_{j=1}^{n}K_{H}(X_{j}-X_{i})\nu^{1/2}(X_{j})\varepsilon_{j}$
		$\displaystyle=$	$\displaystyle n^{-1}\sum_{j=1}^{n}\big{\{}n^{-1/2}\sum_{i\in[k]}(A_{i}-\pi_{n[k]})f(X_{i})^{-1}K_{H}(X_{j}-X_{i})\big{\}}\nu^{1/2}(X_{j})\varepsilon_{j},$

where $f(X_{i})>0,\ K_{H}(X_{j}-X_{i})>0,\ \nu(X_{j})>0$ , $A_{i}-\pi_{n[k]}$ and $f(X_{i})$ are bounded. Then

E(\xi_{1})=E\{E(\xi_{1}|X_{1},\dots,X_{n})\}=0,

$\displaystyle\text{Var}(\xi_{1})$	$\displaystyle=$	$\displaystyle E\{\text{Var}(\xi_{1}\|X_{1},\dots,X_{n},A^{(n)})\}+\text{Var}\{E(\xi_{1}\|X_{1},\dots,X_{n},A^{(n)})\}$
	$\displaystyle=$	$\displaystyle E\{\text{Var}(\xi_{1}\|X_{1},\dots,X_{n},A^{(n)})\}$
	$\displaystyle=$	$\displaystyle E\Big{(}\text{Var}\big{[}n^{-1}\sum_{j=1}^{n}\big{\{}n^{-1/2}\sum_{i\in[k]}(A_{i}-\pi_{n[k]})f(X_{i})^{-1}K_{H}(X_{j}-X_{i})\Big{\}}$
		$\displaystyle\cdot\nu^{1/2}(X_{j})\varepsilon_{j}\|X_{1},\dots,X_{n},A^{(n)}\big{]}\Big{)}$
	$\displaystyle=$	$\displaystyle E\Big{[}n^{-2}\sum_{j=1}^{n}n^{-1}\Big{\{}\sum_{i\in[k]}(A_{i}-\pi_{n[k]})f(X_{i})^{-1}K_{H}(X_{j}-X_{i})\Big{\}}^{2}\nu(X_{j})\Big{]}$
	$\displaystyle=$	$\displaystyle E\Big{(}n^{-1}\sum_{j=1}^{n}n^{-1}\|H\|^{-1}\cdot n^{-1}\Big{[}\sum_{i\in[k]}(A_{i}-\pi_{n[k]})f(X_{i})^{-1}K\{H^{-1/2}(X_{j}-X_{i})\}\Big{]}^{2}\nu(X_{j})\Big{)}.$

There exist a constant $C_{\xi}$ such that $0\leq n^{-1}[\sum_{i\in[k]}(A_{i}-\pi_{n[k]})f(X_{i})^{-1}K\{H^{-1/2}(X_{j}-X_{i})\}]^{2}\leq C_{\xi}$ , and $n^{-1}|H|^{-1}\to 0$ , so $\text{Var}(\xi_{1})\to 0$ as $n\to\infty$ . From above, we have

\frac{1}{\sqrt{n}p_{n[k]}\pi_{n[k]}(1-\pi_{n[k]})}\sum_{i\in[k]}(A_{i}-\pi_{n[k]})f(X_{i})^{-1}n^{-1}\sum_{j=1}^{n}K_{H}(X_{j}-X_{i})\nu^{1/2}(X_{j})\varepsilon_{j}\stackrel{{\scriptstyle P}}{{\rightarrow}}0.

By similar deduction, we have

C_{n}^{-1}\sum_{i\in[k]}(A_{i}-\pi_{n[k]})f(X_{i})^{-2}D_{f}(X_{i})^{\text{T}}n^{-1}\sum_{j=1}^{n}K_{H}\left(X_{j}-X_{i}\right)(X_{j}-X_{i})\nu^{1/2}(X_{j})\varepsilon_{j}\stackrel{{\scriptstyle P}}{{\rightarrow}}0,

where $C_{n}=\sqrt{n}p_{n[k]}\pi_{n[k]}(1-\pi_{n[k]}).$ Moreover, $n^{-1/2}\sum_{i\in[k]}(A_{i}-\pi_{n[k]})=O_{P}(1)$ , so $n^{-1/2}\sum_{i\in[k]}(A_{i}-\pi_{n[k]})\cdot o_{P}(1)=o_{P}(1)$ . Then

\frac{1}{\sqrt{n}p_{n[k]}\pi_{n[k]}(1-\pi_{n[k]})}\sum_{i\in[k]}(A_{i}-\pi_{n[k]})e_{1}^{\text{T}}(N_{i}^{\text{T}}W_{i}N_{i})^{-1}N_{i}^{\text{T}}W_{i}\varepsilon(\nu)\stackrel{{\scriptstyle P}}{{\rightarrow}}0.

For the second term of equation (31), i.e., the bias term, we have

		$\displaystyle e_{1}^{\text{T}}(N_{i}^{\text{T}}W_{i}N_{i})^{-1}N_{i}^{\text{T}}W_{i}Q_{h_{[k]}}(X_{i},a)$
	$\displaystyle=$	$\displaystyle e_{1}^{\text{T}}(n^{-1}N_{i}^{\text{T}}W_{i}N_{i})^{-1}\cdot(n^{-1}N_{i}^{\text{T}}W_{i}Q_{h_{[k]}}(X_{i},a))$
	$\displaystyle=$	$\displaystyle e_{1}^{\text{T}}\left[\begin{array}[]{cc}f(X_{i})^{-1}+o_{P}(1)&-D_{f}(X_{i})^{\text{T}}f(X_{i})^{-2}+o_{P}(\bm{1^{\text{T}}})\\ -D_{f}(X_{i})f(X_{i})^{-2}+o_{P}(\bm{1})&\{\mu_{2}(K)f(X_{i})H\}^{-1}+o_{P}(H^{-1})\end{array}\right]$
		$\displaystyle\cdot{\left[\begin{array}[]{c}n^{-1}\sum_{j=1}^{n}K_{H}\left(X_{j}-X_{i}\right)\left(X_{j}-X_{i}\right)^{\text{T}}\mathcal{H}_{h_{[k]}}(X_{i},a)\left(X_{j}-X_{i}\right)\\ n^{-1}\sum_{j=1}^{n}\left\{K_{H}\left(X_{j}-X_{i}\right)\left(X_{j}-X_{i}\right)^{\text{T}}\mathcal{H}_{h_{[k]}}(X_{i},a)\left(X_{j}-X_{i}\right)\right\}\left(X_{j}-X_{i}\right)\end{array}\right]}$
	$\displaystyle=$	$\displaystyle n^{-1}f(X_{i})^{-1}\sum_{j=1}^{n}K_{H}\left(X_{j}-X_{i}\right)\left(X_{j}-X_{i}\right)^{\text{T}}\mathcal{H}_{h_{[k]}}(X_{i},a)\left(X_{j}-X_{i}\right)+O_{P}(D_{f}(X_{i})H^{3/2}\bm{1})$
	$\displaystyle=$	$\displaystyle f(X_{i})^{-1}\left[\left\{\int K(u)\left(H^{1/2}u\right)^{T}\mathcal{H}_{h_{[k]}}(X_{i},a)\left(H^{1/2}u\right)f\left(X_{i}+H^{1/2}u\right)du\right\}+o_{P}\{\textnormal{tr}(H)\}\right]$
	$\displaystyle=$	$\displaystyle f(X_{i})^{-1}\left[\textnormal{tr}\left\{H^{1/2}\mathcal{H}_{h_{[k]}}(X_{i},a)H^{1/2}\int K(u)uu^{T}du\right\}+o_{P}\{\textnormal{tr}(H)\}\right]$
	$\displaystyle=$	$\displaystyle f(X_{i})^{-1}\mu_{2}(K)\textnormal{tr}\left\{H\mathcal{H}_{h_{[k]}}(X_{i},a)\right\}+o_{P}\{\textnormal{tr}(H)\}.$

Let $C_{H}=\underset{i,j\in\{1,\dots,d\}}{\max}H_{ij}$ . Because each entry of $H$ tends to $0$ , $C_{H}$ tends to $0$ as well. Recall that all second-order derivatives of $h_{[k]}(\cdot,a)$ are continuous, and the density function of $X_{i}$ ’s has a compact support set, so the elements of $\mathcal{H}_{h_{[k]}}(X_{i},a)$ are also bounded. Then there exist a constant $C_{\mathcal{H}}$ such that

|\textnormal{tr}\left\{H\mathcal{H}_{h_{[k]}}(X_{i},a)\right\}|\leq d^{2}C_{H}C_{\mathcal{H}}.

Therefore,

			$\displaystyle n^{-1/2}\sum_{i\in[k]}A_{i}f(X_{i})^{-1}e_{1}^{\text{T}}(N_{i}^{\text{T}}W_{i}N_{i})^{-1}N_{i}^{\text{T}}W_{i}Q_{h_{[k]}}(X_{i},a)$
		$\displaystyle=$	$\displaystyle\mu_{2}(K)n^{-1/2}\sum_{i\in[k]}A_{i}f(X_{i})^{-1}\left[\textnormal{tr}\left\{H\mathcal{H}_{h_{[k]}}(X_{i},a)\right\}+o_{P}\{\textnormal{tr}(H)\}\right],$

where $n^{-1/2}\sum_{i\in[k]}A_{i}f(X_{i})^{-1}\textnormal{tr}\left\{H\mathcal{H}_{h_{[k]}}(X_{i},a)\right\}\leq n^{-1/2}\sum_{i\in[k]}A_{i}f(X_{i})^{-1}d^{2}C_{H}C_{\mathcal{H}}$ . By the central limit theorem in Bugni et al., (2018), we have $n^{-1/2}\sum_{i\in[k]}A_{i}f(X_{i})^{-1}=O_{P}(1)$ . Thus,

n^{-1/2}\sum_{i\in[k]}A_{i}f(X_{i})^{-1}e_{1}^{\text{T}}(N_{i}^{\text{T}}W_{i}N_{i})^{-1}N_{i}^{\text{T}}W_{i}Q_{h_{[k]}}(X_{i},a)=o_{P}(1).

By similar deduction, we have

n^{-1/2}\sum_{i\in[k]}(1-A_{i})f(X_{i})^{-1}e_{1}^{\text{T}}(N_{i}^{\text{T}}W_{i}N_{i})^{-1}N_{i}^{\text{T}}W_{i}Q_{h_{[k]}}(X_{i},a)=o_{P}(1).

As a consequence,

\frac{1}{\sqrt{n}p_{n[k]}\pi_{n[k]}(1-\pi_{n[k]})}\sum_{i\in[k]}(A_{i}-\pi_{n[k]})\{\hat{h}_{[k]}(X_{i},a)-h_{[k]}(X_{i},a)\}=o_{P}(1),

that is, equation (2) holds for local linear kernel, under Assumption 5.

To prove equation (3), recall that $\{X_{i}\}_{i=1}^{n}$ are i.i.d., and conditional on $\{B_{1},\dots,B_{n}\}$ , $\{A_{1},\dots,A_{n}\}$ are independent of $\{X_{i}\}_{i=1}^{n}$ . Similar to the proof of Lemma B.2 in Bugni et al., (2018), by arranging the order of units with respect to the treatment assignment, we can construct quantities $\tilde{\hat{h}}_{[k]}(X_{i},a)$ ’s such that $\tilde{\hat{h}}_{[k]}(X_{i},a)\stackrel{{\scriptstyle d}}{{=}}\hat{h}_{[k]}(X_{i},a)|B_{i}=k,a=0,1$ , and $\tilde{\hat{h}}_{[k]}(X_{i},a)$ ’s don’t depend on $B^{(n)}$ and $A^{(n)}$ . Then, for $\forall\varepsilon>0$ , by Markov inequality, we have

			$\displaystyle\mathbb{P}\Big{(}\frac{1}{n_{[k]}}\sum_{i\in[k]}\left\{\tilde{\hat{h}}_{[k]}(X_{i},a)-h_{[k]}(X_{i},a)\right\}^{2}>\varepsilon\mid X_{1},\dots,X_{n}\Big{)}$
		$\displaystyle\leq$	$\displaystyle E\left[\frac{1}{n_{[k]}}\sum_{i\in[k]}\left\{\tilde{\hat{h}}_{[k]}(X_{i},a)-h_{[k]}(X_{i},a)\right\}^{2}\mid X_{1},\dots,X_{n}\right]/\varepsilon$
		$\displaystyle\xrightarrow{P}$	$\displaystyle 0,$

by lemma 4. Thus,

\mathbb{P}\Big{(}\frac{1}{n_{[k]}}\sum_{i\in[k]}\left\{\tilde{\hat{h}}_{[k]}(X_{i},a)-h_{[k]}(X_{i},a)\right\}^{2}>\varepsilon\Big{)}\xrightarrow{P}0.

Hence, $n_{[k]}^{-1}\sum_{i\in[k]}\big{\{}\hat{h}_{[k]}(X_{i},a)-h_{[k]}(X_{i},a)\big{\}}^{2}=o_{P}(1)$ , i.e., equation (3) holds. ∎

Appendix D Proof of Theorem 4

Proof.

Let $n_{[k]m*}=\sum_{i\in[k]\cap I_{m}}1$ denote the number of units in stratum $k$ and fold $m$ , $n_{[k]ma}=\sum_{i\in[k]\cap I_{m}}\mathds{1}_{A_{i}=a},a=0,1,$ denote the number of units in stratum $k$ and fold $m$ with treatment $a$ . For simplicity, suppose each fold has the same number of units $n_{m}$ , then $n=Mn_{m}$ . Denote $p_{n[k]m}=n_{[k]m*}/n_{m}=(Mn_{[k]m*})/n,\pi_{n[k]m}=n_{[k]m1}/n_{[k]m*}$ , then ${n_{[k]m*}}/{n_{[k]}}\stackrel{{\scriptstyle P}}{{\to}}1/M,$

	$\displaystyle\sum_{k=1}^{K}p_{n[k]}\bar{Y}_{[k]1}-\frac{1}{M}\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]m}\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}Y_{i}$	(39)
$\displaystyle=$	$\displaystyle\sum_{k=1}^{K}p_{n[k]}\sum_{m=1}^{M}\frac{n_{[k]m1}}{n_{[k]1}}\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}Y_{i}-\frac{1}{M}\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]m}\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}Y_{i}$
$\displaystyle=$	$\displaystyle\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m1}}{n_{[k]1}}\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}Y_{i}-\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}}{n_{[k]}}\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}Y_{i}$
$\displaystyle=$	$\displaystyle\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}(\frac{n_{[k]m1}}{n_{[k]1}}-\frac{n_{[k]m*}}{n_{[k]}})\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}Y_{i}.$

By similar deduction, we have

	$\displaystyle\sum_{k=1}^{K}p_{n[k]}\frac{1}{n_{[k]1}}A_{i}h_{[k]}(X_{i},1)-\frac{1}{M}\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]m}\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}\hat{h}_{[k]m}(X_{i},1)$	(40)
$\displaystyle=$	$\displaystyle\sum_{k=1}^{K}p_{n[k]}\sum_{m=1}^{M}\frac{n_{[k]m1}}{n_{[k]1}}\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}h_{[k]}(X_{i},1)$
	$\displaystyle-\frac{1}{M}\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]m}\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}\hat{h}_{[k]m}(X_{i},1)$
$\displaystyle=$	$\displaystyle\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m1}}{n_{[k]1}}\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}h_{[k]}(X_{i},1)$
	$\displaystyle-\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}}{n_{[k]}}\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}\hat{h}_{[k]m}(X_{i},1)$
$\displaystyle=$	$\displaystyle\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}(\frac{n_{[k]m1}}{n_{[k]1}}-\frac{n_{[k]m*}}{n_{[k]}})\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}h_{[k]}(X_{i},1)$
	$\displaystyle+\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}}{n_{[k]}}\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}\{h_{[k]}(X_{i},1)-\hat{h}_{[k]m}(X_{i},1)\},$

and

	$\displaystyle\frac{1}{M}\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]m}\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}\pi_{n[k]m}\hat{h}_{[k]m}(X_{i},1)$	(41)
	$\displaystyle-\sum_{k=1}^{K}p_{n[k]}\frac{1}{n_{[k]1}}\sum_{i\in[k]}\pi_{n[k]}h_{[k]}(X_{i},1)$
$\displaystyle=$	$\displaystyle\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}}{n_{[k]}}\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}\pi_{n[k]m}\hat{h}_{[k]m}(X_{i},1)$
	$\displaystyle-\sum_{k=1}^{K}p_{n[k]}\sum_{m=1}^{M}\frac{n_{[k]m1}}{n_{[k]1}}\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}\frac{\pi_{n[k]}}{\pi_{n[k]m}}\pi_{n[k]m}h_{[k]}(X_{i},1)$
$\displaystyle=$	$\displaystyle\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}}{n_{[k]}}\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}\pi_{n[k]m}\hat{h}_{[k]m}(X_{i},1)$
	$\displaystyle-\sum_{m=1}^{M}\sum_{k=1}^{K}\pi_{n[k]}\frac{n_{[k]m*}}{n_{[k]}}\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}\pi_{n[k]m}h_{[k]}(X_{i},1)$
$\displaystyle=$	$\displaystyle\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}\pi_{n[k]m}}{n_{[k]}}\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}\{\hat{h}_{[k]m}(X_{i},1)-h_{[k]}(X_{i},1)\}$

Because the folds are mutually exclusive and the fold partition process is independent of covariates, stratum, treatments, and outcomes, we have $\pi_{n[k]m}\stackrel{{\scriptstyle P}}{{\to}}\pi$ as $n\to\infty$ . Moreover, $h_{[k]}(X_{i},1)=E(Y_{i}|B_{i}=k,A_{i}=1)$ implies $({1}/{n_{[k]m1}})\sum_{i\in[k]\cap I_{m}}A_{i}\{Y_{i}-h_{[k]}(X_{i},1)\}=O_{p}(1/\sqrt{n})$ . Together with $\pi_{n[k]}\stackrel{{\scriptstyle P}}{{\to}}\pi$ and Equations (39), (40), (41), we have

			$\displaystyle\sum_{k=1}^{K}p_{n[k]}\bar{Y}_{[k]1}-\sum_{k=1}^{K}p_{n[k]}\frac{1}{n_{[k]1}}(A_{i}-\pi_{n[k]})h_{[k]}(X_{i},1)$
			$\displaystyle-\Big{\{}\frac{1}{M}\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]m}\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}Y_{i}$
			$\displaystyle-\frac{1}{M}\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]m}\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}(A_{i}-\pi_{n[k]m})\hat{h}_{[k]m}(X_{i},1)\Big{\}}$
		$\displaystyle=$	$\displaystyle\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}(\frac{n_{[k]m1}}{n_{[k]1}}-\frac{n_{[k]m*}}{n_{[k]}})\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}Y_{i}$
			$\displaystyle-\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}(\frac{n_{[k]m1}}{n_{[k]1}}-\frac{n_{[k]m*}}{n_{[k]}})\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}h_{[k]}(X_{i},1)$
			$\displaystyle+\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}}{n_{[k]}}\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}\{h_{[k]}(X_{i},1)-\hat{h}_{[k]m}(X_{i},1)\}$
			$\displaystyle+\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}\pi_{n[k]m}}{n_{[k]}}\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}\{\hat{h}_{[k]m}(X_{i},1)-h_{[k]}(X_{i},1)\}$
		$\displaystyle=$	$\displaystyle\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}(\frac{n_{[k]m1}}{n_{[k]1}}-\frac{n_{[k]m*}}{n_{[k]}})\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}\{Y_{i}-h_{[k]}(X_{i},1)\}$
			$\displaystyle+\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}}{n_{[k]}}\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}(A_{i}-\pi_{n[k]m})\{h_{[k]}(X_{i},1)-\hat{h}_{[k]m}(X_{i},1)\}$
		$\displaystyle=$	$\displaystyle\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}(\pi_{n[k]m}-\pi_{n[k]})}{n_{[k]1}}\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}\{Y_{i}-h_{[k]}(X_{i},1)\}$
			$\displaystyle+\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}}{n_{[k]}}\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}(A_{i}-\pi_{n[k]m})\{h_{[k]}(X_{i},1)-\hat{h}_{[k]m}(X_{i},1)\}$
		$\displaystyle=$	$\displaystyle\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}(\pi_{n[k]m}-\pi)}{n_{[k]1}}\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}\{Y_{i}-h_{[k]}(X_{i},1)\}$
			$\displaystyle+\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}(\pi-\pi_{n[k]})}{n_{[k]1}}\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}\{Y_{i}-h_{[k]}(X_{i},1)\}$
			$\displaystyle+\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}}{n_{[k]}}\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}(A_{i}-\pi_{n[k]m})\{h_{[k]}(X_{i},1)-\hat{h}_{[k]m}(X_{i},1)\}$
		$\displaystyle=$	$\displaystyle\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}}{n_{[k]}}\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}(A_{i}-\pi_{n[k]m})\{h_{[k]}(X_{i},1)-\hat{h}_{[k]m}(X_{i},1)\}+o_{P}({1}/{\sqrt{n}}).$

By similar deduction, we can establish the corresponding results for the control group:

			$\displaystyle\sum_{k=1}^{K}p_{n[k]}\bar{Y}_{[k]0}+\sum_{k=1}^{K}p_{n[k]}\frac{1}{n_{[k]0}}(A_{i}-\pi_{n[k]})h_{[k]}(X_{i},0)$
			$\displaystyle-\Big{\{}\frac{1}{M}\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]m}\frac{1}{n_{[k]m0}}\sum_{i\in[k]\cap I_{m}}(1-A_{i})Y_{i}$
			$\displaystyle+\frac{1}{M}\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]m}\frac{1}{n_{[k]m0}}\sum_{i\in[k]\cap I_{m}}(A_{i}-\pi_{n[k]m})\hat{h}_{[k]m}(X_{i},0)\Big{\}}$
		$\displaystyle=$	$\displaystyle\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}}{n_{[k]}}\cdot\frac{1}{n_{[k]m0}}\sum_{i\in[k]\cap I_{m}}\{(1-A_{i})-(1-\pi_{n[k]m})\}$
			$\displaystyle\cdot\{h_{[k]}(X_{i},0)-\hat{h}_{[k]m}(X_{i},0)\}+o_{P}({1}/{\sqrt{n}}).$

Within stratum $k$ and fold $m$ , let $\bar{\hat{h}}_{[k]m1}(\cdot,a)$ and $\bar{\hat{h}}_{[k]m0}(\cdot,a)$ respectively denote the sample means of $\hat{h}_{[k]m}(X_{i},a)$ in the treatment group and control group, and let $\bar{h}_{[k]m1}(\cdot,a)$ and $\bar{h}_{[k]m0}(\cdot,a)$ respectively denote the sample means of $h_{[k]}(X_{i},a)$ in the treatment group and control group. Then,

			$\displaystyle\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}}{n_{[k]}}\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}\big{(}A_{i}-\pi_{n[k]m}\big{)}\big{\{}h_{[k]}(X_{i},1)-\hat{h}_{[k]m}(X_{i},1)\big{\}}$
			$\displaystyle-\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}}{n_{[k]}}\cdot\frac{1}{n_{[k]m0}}\sum_{i\in[k]\cap I_{m}}\big{\{}(1-A_{i})-(1-\pi_{n[k]m})\big{\}}\big{\{}h_{[k]}(X_{i},0)-\hat{h}_{[k]m}(X_{i},0)\big{\}}$
		$\displaystyle=$	$\displaystyle-\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}}{n_{[k]}}(1-\pi_{n[k]m})\Big{[}\big{\{}\bar{\hat{h}}_{[k]m1}(\cdot,1)-\bar{h}_{[k]m1}(\cdot,1)\big{\}}-\big{\{}\bar{\hat{h}}_{[k]m0}(\cdot,1)-\bar{h}_{[k]m0}(\cdot,1)\big{\}}\Big{]}$
			$\displaystyle-\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}}{n_{[k]}}\pi_{n[k]m}\Big{[}\big{\{}\bar{\hat{h}}_{[k]m1}(\cdot,0)-\bar{h}_{[k]m1}(\cdot,0)\big{\}}-\big{\{}\bar{\hat{h}}_{[k]m0}(\cdot,0)-\bar{h}_{[k]m0}(\cdot,0)\big{\}}\Big{]},$

As a consequence, to prove that $\sqrt{n}\hat{\tau}_{\text{ss}}$ and $\sqrt{n}\hat{\tau}_{\text{oracle}}$ have the same asymptotic distribution, we only need to prove the sample splitting version of Assumption 4, i.e., Assumption 7 below holds.

Assumption 7.

For $k=1,\dots,K,m=1,\dots,M$ and $a=0,1$ ,

\sqrt{n}\Big{[}\big{\{}\bar{\hat{h}}_{[k]m1}(\cdot,a)-\bar{h}_{[k]m1}(\cdot,a)\big{\}}-\big{\{}\bar{\hat{h}}_{[k]m0}(\cdot,a)-\bar{h}_{[k]m0}(\cdot,a)\big{\}}\Big{]}=o_{P}(1),

(42)

\frac{1}{n_{[k]m*}}\sum_{i\in[k]\cap I_{m}}\Big{\{}\hat{h}_{[k]m}(X_{i},a)-h_{[k]}(X_{i},a)\Big{\}}^{2}=o_{P}(1).

(43)

For Equation (42), we have

	$\displaystyle\sqrt{n}\Big{[}\big{\{}\bar{\hat{h}}_{[k]m1}(\cdot,a)-\bar{h}_{[k]1}(\cdot,a)\big{\}}-\big{\{}\bar{\hat{h}}_{[k]m0}(\cdot,a)-\bar{h}_{[k]0}(\cdot,a)\big{\}}\Big{]}$	(44)
$\displaystyle=$	$\displaystyle\sqrt{n}\Big{(}\big{[}\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}\{\hat{h}_{[k]m}(X_{i},a)-h_{[k]}(X_{i},a)\}\big{]}$
	$\displaystyle-\big{[}\frac{1}{n_{[k]m0}}\sum_{i\in[k]\cap I_{m}}(1-A_{i})\{\hat{h}_{[k]m}(X_{i},a)-h_{[k]}(X_{i},a)\}\big{]}\Big{)}$
$\displaystyle=$	$\displaystyle\frac{\sqrt{n}}{n_{[k]m*}}\sum_{i\in[k]\cap I_{m}}\big{(}\frac{A_{i}}{\pi_{n[k]m}}-\frac{1-A_{i}}{1-\pi_{n[k]m}}\big{)}\{\hat{h}_{[k]m}(X_{i},a)-h_{[k]}(X_{i},a)\}$
$\displaystyle=$	$\displaystyle\frac{M}{p_{n[k]m}\sqrt{n}}\sum_{i\in[k]\cap I_{m}}\frac{A_{i}-\pi_{n[k]m}}{\pi_{n[k]m}(1-\pi_{n[k]m})}\{\hat{h}_{[k]m}(X_{i},a)-h_{[k]}(X_{i},a)\}.$

For $a=1$ ,

\eqref{eqn::goal1expand}=\frac{M}{p_{n[k]m}\sqrt{n}}\sum_{i\in[k]\cap I_{m}}\frac{\mathds{1}_{A_{i}=1}-\pi_{n[k]m}}{\pi_{n[k]m}(1-\pi_{n[k]m})}\{\hat{h}_{[k]m}(X_{i},a)-h_{[k]}(X_{i},a)\}.

For $a=0$ ,

	$\displaystyle\eqref{eqn::goal1expand}$	$\displaystyle=$	$\displaystyle-\frac{M}{p_{n[k]m}\sqrt{n}}\sum_{i\in[k]}\frac{(1-A_{i})-(1-\pi_{n[k]m})}{\pi_{n[k]m}(1-\pi_{n[k]m})}\{\hat{h}_{[k]m}(X_{i},a)-h_{[k]}(X_{i},a)\}$
		$\displaystyle=$	$\displaystyle-\frac{M}{p_{n[k]m}\sqrt{n}}\sum_{i\in[k]\cap I_{m}}\frac{\mathds{1}_{A_{i}=0}-(1-\pi_{n[k]m})}{\pi_{n[k]m}(1-\pi_{n[k]m})}\{\hat{h}_{[k]m}(X_{i},a)-h_{[k]}(X_{i},a)\}.$

Let

\pi_{n[k]ma}=\begin{cases}\pi_{n[k]m},&\mbox{if a = 1}\\ 1-\pi_{n[k]m},&\mbox{if a = 0}.\end{cases},\quad\Delta_{i[k]ma}=\hat{h}_{[k]m}(X_{i},a)-h_{[k]}(X_{i},a).

To prove Equation (42), it suffices to show

\frac{1}{\sqrt{n}}\sum_{i\in[k]\cap I_{m}}\frac{\mathds{1}_{A_{i}=a}-\pi_{n[k]ma}}{\pi_{n[k]ma}}\Delta_{i[k]ma}=o_{P}(1).

Let

T_{n[k]m}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\frac{\mathds{1}_{A_{i}=a}-\pi_{n[k]ma}}{\pi_{n[k]ma}}\{\hat{h}_{[k]m}(X_{i},a)-h_{[k]}(X_{i},a)\}\cdot\mathds{1}_{i\in[k]\cap I_{m}}.

Denote $A^{(n)}=(A_{1},A_{2},\dots,A_{n})$ and $B^{(n)}=(B_{1},B_{2},\dots,B_{n})$ . Suppose that $W_{[k]}^{\prime}$ and $W_{[k]}^{\prime\prime}$ are independent copies of $W_{[k]}=\{Y_{i}(1),Y_{i}(0),X_{i}\}_{i\in[k]},\ k=1,\dots,K$ , such that $(W_{[k]}^{\prime},W_{[k]}^{\prime\prime})\perp\!\!\!\perp(W_{[k]},A^{(n)})$ and $W_{[k]}^{\prime}\perp\!\!\!\perp W_{[k]}^{\prime\prime}$ . Let $\mathbb{N}_{[k]ma}=(n_{[k]},n_{[k]m*},n_{[k]ma})$ . By Lemma 1 and the law of large numbers, we have ${n_{[k]ma}}/{n_{[k]m*}}\stackrel{{\scriptstyle P}}{{\to}}\pi(a)$ , where $\pi(1)=\pi$ and $\pi(0)=1-\pi$ .

Denote $\breve{\Delta}_{i[k]ma}$ as the statistic obtained by replacing the units from $I_{m}^{c}$ (the units for estimating $\hat{h}_{[k]m}(\cdot,a)$ ) by units in $W_{[k]}^{\prime}$ and replacing the units from $I_{m}$ by units in $W_{[k]}^{\prime\prime}$ in $\Delta_{i[k]ma}$ , and

\breve{T}_{n[k]m}=\frac{1}{\sqrt{n}}\Big{\{}\sum_{i=1}^{n_{[k]ma}}\frac{1-\pi_{n[k]ma}}{\pi_{n[k]ma}}\breve{\Delta}_{i[k]ma}-\sum_{i=n_{[k]ma}+1}^{n_{[k]m*}}\breve{\Delta}_{i[k]ma}\Big{\}},

then $\breve{\Delta}_{i[k]ma}\stackrel{{\scriptstyle d}}{{=}}\Delta_{i[k]ma}$ and $\breve{T}_{n[k]m}\stackrel{{\scriptstyle d}}{{=}}T_{n[k]m}$ , conditional on $\{W_{[k]}^{\prime},A^{(n)}\}$ , $\breve{\Delta}_{i[k]ma}$ ’s are independent across $i\in[k]$ by the independent assumptions, and it remains to show that $\breve{T}_{n[k]m}=o_{P}(1)$ . Simple calculation gives

			$\displaystyle E\{\breve{T}_{n[k]m}\mid\mathbb{N}_{[k]ma},W_{[k]}^{\prime}\}$
		$\displaystyle=$	$\displaystyle\frac{1}{\sqrt{n}}\Big{[}\sum_{i=1}^{n_{[k]ma}}\frac{1-\pi_{n[k]ma}}{\pi_{n[k]ma}}E\{\breve{\Delta}_{i[k]ma}\|\mathbb{N}_{[k]ma},W_{[k]}^{\prime}\}-\sum_{i=n_{[k]ma}+1}^{n_{[k]m*}}E\{\breve{\Delta}_{i[k]ma}\|\mathbb{N}_{[k]ma},W_{[k]}^{\prime}\}\Big{]}$
		$\displaystyle=$	$\displaystyle\frac{1}{\sqrt{n}}\Big{(}\frac{n_{[k]ma}}{\pi_{n[k]ma}}-n_{[k]m*}\Big{)}E\{\breve{\Delta}_{i[k]ma}\|\mathbb{N}_{[k]ma},W_{[k]}^{\prime}\}$
		$\displaystyle=$	$\displaystyle 0.$

In addition, we have

	$\displaystyle\text{Var}\{\breve{T}_{n[k]m}\mid\mathbb{N}_{[k]ma},W_{[k]}^{\prime}\}$	(45)
$\displaystyle=$	$\displaystyle\frac{1}{n}\bigg{[}\Big{\{}\frac{1-\pi_{n[k]ma}}{\pi_{n[k]ma}}\}^{2}n_{[k]ma}Var\{\breve{\Delta}_{i[k]ma}\|\mathbb{N}_{[k]ma},W_{[k]}^{\prime}\Big{\}}$
	$\displaystyle+(n_{[k]m*}-n_{[k]ma})Var\{\breve{\Delta}_{i[k]ma}\|\mathbb{N}_{[k]ma},W_{[k]}^{\prime}\}\bigg{]}$
$\displaystyle=$	$\displaystyle\frac{1}{n}\Big{(}\frac{1-2\pi_{n[k]ma}}{\pi_{n[k]ma}^{2}}n_{[k]ma}+n_{[k]m*}\Big{)}Var\{\breve{\Delta}_{i[k]ma}\|\mathbb{N}_{[k]ma},W_{[k]}^{\prime}\}$
$\displaystyle\leq$	$\displaystyle\frac{1}{n}\Big{(}\frac{1-2\pi_{n[k]ma}}{\pi_{n[k]ma}^{2}}n_{[k]ma}+n_{[k]m*}\Big{)}E\{\breve{\Delta}_{i[k]ma}^{2}\|\mathbb{N}_{[k]ma},W_{[k]}^{\prime}\}$
$\displaystyle=$	$\displaystyle\Big{(}\frac{1-2\pi_{n[k]ma}}{\pi_{n[k]ma}^{2}}\cdot\frac{n_{[k]ma}}{n}+\frac{n_{[k]m*}}{n}\Big{)}E\{\breve{\Delta}_{i[k]ma}^{2}\|\mathbb{N}_{[k]ma},W_{[k]}^{\prime}\}$
$\displaystyle=$	$\displaystyle\Big{(}\frac{1-\pi_{n[k]ma}}{\pi_{n[k]ma}}\cdot\frac{n_{[k]m*}}{n}\Big{)}E\{\breve{\Delta}_{i[k]ma}^{2}\|\mathbb{N}_{[k]ma},W_{[k]}^{\prime}\}.$

In (45), ${(1-\pi_{n[k]ma})}/{\pi_{n[k]ma}}=O_{P}(1),\ 0<{n_{[k]m*}}/{n}<1$ . Together with Assumption 6, we have

\text{Var}\{\breve{T}_{n[k]m}\mid\mathbb{N}_{[k]ma},W_{[k]}^{\prime}\}\leq O_{P}(1)\cdot E\{\breve{\Delta}_{i[k]ma}^{2}\mid\mathbb{N}_{[k]ma},W_{[k]}^{\prime}\}=o_{p}(1).

Then for $\forall\varepsilon>0$ , by Markov inequality, we have

P(|\breve{T}_{n[k]m}|>\varepsilon\mid\mathbb{N}_{[k]ma},W_{[k]}^{\prime})\leq\varepsilon^{-2}E(\breve{T}_{n[k]m}^{2}\mid\mathbb{N}_{[k]ma},W_{[k]}^{\prime})=o_{P}(1).

Therefore, by the extension of the dominated convergence theorem to convergence in probability, we have

$\displaystyle\lim_{n\to\infty}P(\|\breve{T}_{n[k]m}\|>\varepsilon)$	$\displaystyle=$	$\displaystyle\lim_{n\to\infty}E(\mathds{1}_{\|\breve{T}_{n[k]m}\|>\varepsilon})$
	$\displaystyle=$	$\displaystyle\lim_{n\to\infty}E\{E(\mathds{1}_{\|\breve{T}_{n[k]m}\|>\varepsilon}\mid\mathbb{N}_{[k]ma},W_{[k]}^{\prime})\}$
	$\displaystyle=$	$\displaystyle\lim_{n\to\infty}E(P(\|\breve{T}_{n[k]m}\|>\varepsilon\mid\mathbb{N}_{[k]ma},W_{[k]}^{\prime}))$
	$\displaystyle=$	$\displaystyle 0.$

Moreover, if Assumption 6 holds, because $X_{i}$ ’s are independent and identically distributed, then equation (43) holds by the law of large numbers. In conclusion, if Assumption 6 holds, then the two equations in Assumption 7 also hold.

Next, we prove the consistency of the variance estimator. Note that

			$\displaystyle\sum_{k=1}^{K}p_{n[k]}\bigg{[}\frac{1}{n_{[k]1}}\sum_{i\in[k]}A_{i}\Big{\{}\hat{r}_{i}(1)-\frac{1}{n_{[k]1}}\sum_{j\in[k]}A_{j}\hat{r}_{j}(1)\Big{\}}^{2}\bigg{]}$
			$\displaystyle-\frac{1}{M}\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]m}\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}\Big{\{}\hat{r}_{i}(1)-\frac{1}{n_{[k]m1}}\sum_{j\in[k]\cap I_{m}}A_{j}\hat{r}_{j}(1)\Big{\}}^{2}$
		$\displaystyle=$	$\displaystyle\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}(\frac{n_{[k]m1}}{n_{[k]1}}-\frac{n_{[k]m*}}{n_{[k]}})\frac{1}{n_{[k]m1}}$
			$\displaystyle\cdot\sum_{i\in[k]\cap I_{m}}A_{i}\bigg{[}\Big{\{}\hat{r}_{i}(1)-\frac{1}{n_{[k]1}}\sum_{j\in[k]}A_{j}\hat{r}_{j}(1)\Big{\}}^{2}-\Big{\{}\hat{r}_{i}(1)-\frac{1}{n_{[k]m1}}\sum_{j\in[k]\cap I_{m}}A_{j}\hat{r}_{j}(1)\Big{\}}^{2}\bigg{]}$
		$\displaystyle=$	$\displaystyle\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}(\frac{n_{[k]m1}}{n_{[k]1}}-\frac{n_{[k]m*}}{n_{[k]}})\Big{\{}\frac{1}{n_{[k]m1}}\sum_{j\in[k]\cap I_{m}}A_{j}\hat{r}_{j}(1)-\frac{1}{n_{[k]1}}\sum_{j\in[k]}A_{j}\hat{r}_{j}(1)\Big{\}}$
			$\displaystyle\cdot\frac{1}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}\Big{\{}2\hat{r}_{i}(1)-\frac{1}{n_{[k]1}}\sum_{j\in[k]}A_{j}\hat{r}_{j}(1)-\frac{1}{n_{[k]m1}}\sum_{j\in[k]\cap I_{m}}A_{j}\hat{r}_{j}(1)\Big{\}}$
		$\displaystyle=$	$\displaystyle\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]}\frac{n_{[k]m*}(\pi_{n[k]m}-\pi_{n[k]})}{n_{[k]1}}\cdot\bigg{[}\Big{\{}\frac{2}{n_{[k]m1}}\sum_{i\in[k]\cap I_{m}}A_{i}\hat{r}_{i}(1)\Big{\}}\cdot\Big{\{}\frac{1}{n_{[k]m1}}\sum_{j\in[k]\cap I_{m}}A_{j}\hat{r}_{j}(1)$
			$\displaystyle-\frac{1}{n_{[k]1}}\sum_{j\in[k]}A_{j}\hat{r}_{j}(1)\Big{\}}-\Big{\{}\frac{1}{n_{[k]m1}}\sum_{j\in[k]\cap I_{m}}A_{j}\hat{r}_{j}(1)\Big{\}}^{2}+\Big{\{}\frac{1}{n_{[k]1}}\sum_{j\in[k]}A_{j}\hat{r}_{j}(1)\Big{\}}^{2}\bigg{]}$
		$\displaystyle=$	$\displaystyle o_{P}(1).$

By similar deduction, we have

	$\displaystyle\sum_{k=1}^{K}p_{n[k]}\bigg{[}\frac{1}{n_{[k]0}}\sum_{i\in[k]}(1-A_{i})\Big{\{}\hat{r}_{i}(0)-\frac{1}{n_{[k]0}}\sum_{j\in[k]}(1-A_{j})\hat{r}_{j}(0)\Big{\}}^{2}\bigg{]}-\frac{1}{M}\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]m}$
	$\displaystyle\quad\cdot\frac{1}{n_{[k]m0}}\sum_{i\in[k]\cap I_{m}}(1-A_{i})\Big{\{}\hat{r}_{i}(0)-\frac{1}{n_{[k]m0}}\sum_{j\in[k]\cap I_{m}}(1-A_{j})\hat{r}_{j}(0)\Big{\}}^{2}=o_{P}(1).$

Using the techniques developed by Liu et al., (2023), Assumption 7 implies that

			$\displaystyle\sum_{k=1}^{K}p_{n[k]}\bigg{[}\frac{1}{n_{[k]1}}\sum_{i\in[k]}A_{i}\Big{\{}\hat{r}_{i}(1)-\frac{1}{n_{[k]1}}\sum_{j\in[k]}A_{j}\hat{r}_{j}(1)\Big{\}}^{2}\bigg{]}$
			$\displaystyle+\sum_{k=1}^{K}p_{n[k]}\bigg{[}\frac{1}{n_{[k]0}}\sum_{i\in[k]}(1-A_{i})\Big{\{}\hat{r}_{i}(0)-\frac{1}{n_{[k]0}}\sum_{j\in[k]}(1-A_{j})\hat{r}_{j}(0)\Big{\}}^{2}\bigg{]}$
		$\displaystyle=$	$\displaystyle\varsigma_{r}^{2}(\pi)+o_{p}(1).$

Let

n_{m1}=\sum_{i\in I_{m}}A_{i},\ n_{m0}=\sum_{i\in I_{m}}(1-A_{i}),\ \pi_{n}=\frac{n_{1}}{n},\ \pi_{nm}=\frac{n_{m1}}{n_{m}},

\bar{\hat{r}}_{m1}=\frac{1}{n_{m1}}\sum_{i\in I_{m}}A_{i}\hat{r}_{i}(1),\ \bar{\hat{r}}_{m0}=\frac{1}{n_{m0}}\sum_{i\in I_{m}}(1-A_{i})\hat{r}_{i}(0),

\bar{\hat{r}}_{[k]m1}=\frac{1}{n_{[k]m1}}\sum_{j\in[k]\cap I_{m}}A_{j}\hat{r}_{j}(1),\ \bar{\hat{r}}_{[k]m0}=\frac{1}{n_{[k]m0}}\sum_{j\in[k]\cap I_{m}}(1-A_{j})\hat{r}_{j}(0).

Since the folds are mutually exclusive and the fold partition process is independent of covariates, stratum, treatments, and outcomes, we can show that $\pi_{n}\stackrel{{\scriptstyle P}}{{\to}}\pi,$ $\pi_{nm}\stackrel{{\scriptstyle P}}{{\to}}\pi$ , $\bar{\hat{r}}_{[k]m1}\stackrel{{\scriptstyle P}}{{\to}}E\{r_{i}(1)\mid B_{i}=k\},$ $\bar{\hat{r}}_{[k]m0}\stackrel{{\scriptstyle P}}{{\to}}E\{r_{i}(0)\mid B_{i}=k\},$ $\bar{\hat{r}}_{m1}\stackrel{{\scriptstyle P}}{{\to}}E\{r_{i}(1)\}$ , and $\bar{\hat{r}}_{m0}\stackrel{{\scriptstyle P}}{{\to}}E\{r_{i}(0)\}$ . Thus,

\displaystyle\frac{1}{M}\sum_{m=1}^{M}\sum_{k=1}^{K}p_{n[k]m}\big{\{}(\bar{\hat{r}}_{[k]m1}-\bar{\hat{r}}_{m1})-(\bar{\hat{r}}_{[k]m0}-\bar{\hat{r}}_{m0})\big{\}}^{2}=\varsigma^{2}_{Hr}+o_{p}(1).

From the above results, we conclude that $\hat{\sigma}_{ss}^{2}\stackrel{{\scriptstyle P}}{{\to}}\varsigma^{2}_{r}(\pi)+\varsigma^{2}_{Hr}$ , hence complete the proof. ∎