This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Asymptotic inference with flexible covariate adjustment under rerandomization and stratified rerandomization

Bingkai Wang1 and Fan Li2,3 1Department of Biostatistics School of Public Health University of Michigan Ann Arbor MI USA 2Department of Biostatistics Yale School of Public Health New Haven CT USA 3Center for Methods in Implementation and Prevention Science Yale School of Public Health New Haven CT USA
Abstract

Rerandomization is an effective treatment allocation procedure to control for baseline covariate imbalance. For estimating the average treatment effect, rerandomization has been previously shown to improve the precision of the unadjusted and the linearly-adjusted estimators over simple randomization without compromising consistency. However, it remains unclear whether such results apply more generally to the class of M-estimators, including the g-computation formula with generalized linear regression and doubly-robust methods, and more broadly, to efficient estimators with data-adaptive machine learners. In this paper, using a super-population framework, we develop the asymptotic theory for a more general class of covariate-adjusted estimators under rerandomization and its stratified extension. We prove that the asymptotic linearity and the influence function remain identical for any M-estimator under simple randomization and rerandomization, but rerandomization may lead to a non-Gaussian asymptotic distribution. We further explain, drawing examples from several common M-estimators, that asymptotic normality can be achieved if rerandomization variables are appropriately adjusted for in the final estimator. These results are extended to stratified rerandomization. Finally, we study the asymptotic theory for efficient estimators based on data-adaptive machine learners, and prove their efficiency optimality under rerandomization and stratified rerandomization. Our results are demonstrated via simulations and re-analyses of a cluster-randomized experiment that used stratified rerandomization.

Keywords: covariate adjustment, doubly-robust estimator, M-estimation, machine learning, influence function, randomized trials.

1 Introduction

In randomized experiments, rerandomization (Morgan and Rubin,, 2012) refers to a restricted randomization procedure that discards treatment allocation schemes corresponding to large baseline imbalance. That is, treatment assignments will be randomly re-generated until the balance statistics fall within a pre-specified threshold. In biomedical research, such a randomization procedure is also referred to as covariate-constrained randomization (Raab and Butcher,, 2001; Moulton,, 2004) and often applied to cluster randomization. Compared with simple randomization that assigns treatment by independent coin flips, rerandomization can provide effective design-based control of imbalance associated with a number of covariates, possibly of different types, and improve the study power. Compared to covariate-adaptive randomization, such as stratified randomization (Zelen,, 1974) for balancing discrete baseline variables, rerandomization can easily handle continuous baseline variables and features continuously-valued design parameters to control for the randomization space, thereby offering additional flexibility. Given these potential advantages, rerandomization is increasingly popular in economics (Bruhn and McKenzie,, 2009) and biomedical research (Ivers et al.,, 2012; Turner et al., 2017a, ).

Under rerandomization, the statistical properties of the unadjusted and linearly-adjusted estimators for the average treatment effect have been extensively studied under a finite-population design-based framework. Li et al., (2018) first established the design-based asymptotic theory for the unadjusted estimator (i.e., the two-sample difference-in-means estimator) under rerandomization. Li and Ding, (2020) generalized their results to covariate-adjusted estimators based on linear regression, and recommended the combination of rerandomization and regression adjustment to optimize statistical precision. Rerandomization has also been extended to accommodate tiers of covariates with varying importance (Morgan and Rubin,, 2015), sequential rerandomization in batches (Zhou et al.,, 2018), high-dimensional settings with a diverging number of covariates (Wang and Li,, 2022), split-plot designs (Shi et al.,, 2022), cluster rerandomization (Lu et al.,, 2023), stratified rerandomization (Wang et al., 2023c, ), as well as rerandomization based on p-values from regression-based balance assessment (Zhao and Ding,, 2024). A notable feature of all the previous works is the focus on the unadjusted and linearly-adjusted estimators—a common choice when adopting the finite-population design-based framework for statistical inference.

However, analysis of randomized experiments in practice can involve more general covariate adjustment strategies than linear regression (Benkeser et al.,, 2021), and the validity and efficiency of such general estimators remain unexplored in the existing literature on rerandomization. For example, with binary outcomes, the g-computation estimator based on logistic regression can target the causal risk ratio estimand and effectively leverage baseline covariates for precision gain (Colantuoni and Rosenblum,, 2015). When the outcomes are missing at random, doubly-robust estimators (Robins et al.,, 2007) offer two opportunities for consistent estimation of the average treatment effect; i.e., consistency holds when either the missingness propensity score model or the outcome model is correctly specified but not necessarily both. In cluster-randomized experiments, mixed-effects models are routinely used to simultaneously adjust for covariates and account for the intracluster correlation (Turner et al., 2017b, ; Wang et al.,, 2021). Finally, data-adaptive machine learners have shown promise to maximally leverage baseline information to achieve full efficiency in randomized experiments (Chernozhukov et al.,, 2018). Only for selected scenarios under cluster randomization, previous simulations have empirically demonstrated that rerandomization can improve precision for mixed-effects models and generalized estimating equations estimators (Li et al.,, 2016, 2017). However, to the best of our knowledge, there has been no formal development of the asymptotic theory for this more general class of covariate-adjusted estimators under rerandomization; thus few recommendations are currently available for rerandomized experiments when inference involves flexible covariate adjustment.

In this paper, we introduce formal asymptotic results for the class of M-estimators as well as efficient estimators (nuisance models estimated via data-adaptive machine learners) under rerandomization and stratified rerandomization. First, we prove that, under standard regularity conditions for M-estimators, the asymptotic linearity and the influence function remain identical under simple randomization and rerandomization, but rerandomization may lead to a non-Gaussian asymptotic distribution. This result justifies the consistency of M-estimators under rerandomization, and obviates the need to re-derive the influence function under rerandomization. Next, we survey several common M-estimators in randomized experiments, and show that asymptotic normality can be achieved if rerandomization variables are appropriately adjusted for in the final estimator. This result clarifies when rerandomization is ignorable during the analysis stage, in which case one can conveniently draw inference under rerandomization as if simple randomization were carried out. Then, we develop parallel results to stratified rerandomization—a restricted randomization procedure combining stratification and rerandomization to achieve stronger control of chance imbalance. Finally, we turn to efficient estimators that are motivated by the efficient influence functions coupled with data-adaptive machine learners. We prove that, as long as the rerandomization variables are adjusted for in the machine learners, the efficient estimators continue to be optimally efficient without the need to invoke any additional regularity assumptions. For deriving all of the above asymptotic results, the key challenge is to address correlation among individual observations introduced by (stratified) rerandomization, for which we developed general technical arguments to accommodate flexible covariate adjustment.

Throughout, we adopt a super-population sampling-based framework, which is an effective framework to study more general covariate-adjusted estimators in randomized experiments (Wang et al., 2023b, ). We refer to Robins, (2002) and Ding et al., (2017) for a comparison between the super-population sampling-based and finite-population design-based framework. When the unadjusted and linear regression estimators are considered under rerandomization and stratified randomization, our results can be viewed as the sampling-based counterparts of earlier design-based results (Li et al.,, 2018; Li and Ding,, 2020; Lu et al.,, 2023; Wang et al., 2023c, ). Importantly, our results go beyond linear regression and address a wider class of covariate-adjusted estimators under rerandomization and stratified rerandomization.

The remainder of the paper is organized as follows. In the next section, we introduce the super-population framework, randomization procedures, and M-estimation. Section 3 characterizes the asymptotic distributions for M-estimators under rerandomization, and Section 4 surveys familiar M-estimators and provides sufficient conditions to achieve asymptotic normality under rerandomization. Section 5 presents parallel asymptotic results for stratified rerandomization. Section 6 introduces the asymptotic theory for estimators based on the efficient influence function under rerandomization and stratified rerandomization. We report simulations in Section 7 and re-analyses of a completed cluster-randomized experiment in Section 8, for which the R code is available at https://github.com/BingkaiWang/Rerandomization. Section 9 concludes.

2 Notation, rerandomization, and M-estimation

2.1 Assumptions and estimands

We consider a randomized experiment with nn individuals. Each individual i(i=1,,n)i\ (i=1,\dots,n) has an outcome YiY_{i}, a non-missing indicator of outcome RiR_{i} (Ri=1R_{i}=1 if YiY_{i} observed and 0 if missing), a treatment assignment indicator AiA_{i} (Ai=1A_{i}=1 if assigned treatment and 0 otherwise), and a vector of baseline covariates 𝑿i\boldsymbol{X}_{i}. The observed data are (𝑶1,,𝑶n)(\boldsymbol{O}_{1},\dots,\boldsymbol{O}_{n}) with 𝑶i=(RiYi,Ri,Ai,𝑿i)\boldsymbol{O}_{i}=(R_{i}Y_{i},R_{i},A_{i},\boldsymbol{X}_{i}). We adopt the potential outcomes framework and define Yi(a)Y_{i}(a) as the potential outcome if individual ii were assigned to treatment group a,a{0,1}a,a\in\{0,1\}. Similarly, we denote Ri(a)R_{i}(a) as the potential non-missing indicator given treatment aa. We assume causal consistency such that Yi=AiYi(1)+(1Ai)Yi(0)Y_{i}=A_{i}Y_{i}(1)+(1-A_{i})Y_{i}(0) and Ri=AiRi(1)+(1Ai)Ri(0)R_{i}=A_{i}R_{i}(1)+(1-A_{i})R_{i}(0). Denoting the complete data vector for each individual as 𝑾i=(Yi(1),Yi(0),Ri(1),Ri(0),𝑿i)\boldsymbol{W}_{i}=(Y_{i}(1),Y_{i}(0),R_{i}(1),R_{i}(0),\boldsymbol{X}_{i}), we make the following assumptions on (𝑾1,,𝑾n)(\boldsymbol{W}_{1},\dots,\boldsymbol{W}_{n}).

Assumption 1 (Super-population).

Each complete data vector 𝐖i,i=1,,n\boldsymbol{W}_{i},i=1,\dots,n is an independent and identically distributed draw from an unknown distribution 𝒫𝐖\mathcal{P}^{\boldsymbol{W}} on 𝐖=(Y(1),Y(0),R(1),R(0),𝐗)\boldsymbol{W}=(Y(1),Y(0),R(1),R(0),\boldsymbol{X}) with finite second moments.

Assumption 2 (Missing at random).

For each a=0,1a=0,1, R(a)R(a) is independent of Y(a)Y(a) given 𝐗\boldsymbol{X} and P(R(a)=1|𝐗)P(R(a)=1|\boldsymbol{X}) is uniformly bounded away from 0.

Assumption 1 is a standard condition for making inference under a sampling-based framework. It postulates a notional super-population of units, from which the observed sample is randomly drawn and for which the target estimands may be defined. We invoke Assumption 2 as a standard condition to accommodate missing outcomes in randomized experiments. This assumption is required for the doubly-robust estimators in Theorems 2, 4, 5, and 6, but is not required for other results without missing data.

Our goal is to estimate the average treatment effect, defined in a general form as

Δ=f(E[Y(1)],E[Y(0)])\Delta^{*}=f(E[Y(1)],E[Y(0)])

where f(x,y)f(x,y) is a pre-specified function defining the scale of effect measure. For example, f(x,y)=xyf(x,y)=x-y corresponds to the treatment effect on the difference scale, whereas f(x,y)=x/yf(x,y)=x/y corresponds to the risk ratio scale, which is a standard choice when the outcome is binary. Finally, by Assumption 1, the expectation operator in the estimand definition (and throughout the paper) is defined with respect to the super-population distribution 𝒫W\mathcal{P}^{W}.

2.2 Simple randomization and rerandomization

Simple randomization assigns treatment via independent coin flips, that is, AiA_{i} is independently determined by a Bernoulli distribution 𝒫A\mathcal{P}^{A} with P(Ai=1)=π(0,1)P(A_{i}=1)=\pi\in(0,1). By design, each AiA_{i} is independent of (𝑾1,,𝑾n)(\boldsymbol{W}_{1},\dots,\boldsymbol{W}_{n}), and the observed data (𝑶1,,𝑶n)(\boldsymbol{O}_{1},\dots,\boldsymbol{O}_{n}) are independent and identically distributed given Assumption 1.

Rerandomization improves simple randomization by controlling imbalance on a subset of measured baseline covariates, which we denote as 𝑿r\boldsymbol{X}^{r} with 𝑿r𝑿\boldsymbol{X}^{r}\subset\boldsymbol{X} (Morgan and Rubin,, 2012); we refer to 𝑿r\boldsymbol{X}^{r} as rerandomization variables. Rerandomization involves the following three steps. First, we independently generate A1,,AnA_{1}^{*},\dots,A_{n}^{*} from a Bernoulli distribution with P(Ai=1)=π(0,1)P(A_{i}^{*}=1)=\pi\in(0,1) as in simple randomization. In the second step, we compute the imbalance statistic on 𝑿ir\boldsymbol{X}^{r}_{i} and its variance estimator as

𝑰\displaystyle\boldsymbol{I} =1N1i=1nAi𝑿ir1N0i=1n(1Ai)𝑿ir,\displaystyle=\frac{1}{N_{1}}\sum_{i=1}^{n}A^{*}_{i}\boldsymbol{X}^{r}_{i}-\frac{1}{N_{0}}\sum_{i=1}^{n}(1-A_{i}^{*})\boldsymbol{X}^{r}_{i},
Var^(𝑰)\displaystyle\widehat{Var}(\boldsymbol{I}) =1N1N0i=1n(𝑿ir𝑿r¯)(𝑿ir𝑿r¯)\displaystyle=\frac{1}{N_{1}N_{0}}\sum_{i=1}^{n}(\boldsymbol{X}^{r}_{i}-\overline{\boldsymbol{X}^{r}})(\boldsymbol{X}^{r}_{i}-\overline{\boldsymbol{X}^{r}})^{\top}

where N1=i=1nAiN_{1}=\sum_{i=1}^{n}A_{i}^{*}, N0=nN1N_{0}=n-N_{1}^{*} and 𝑿r¯=n1i=1n𝑿ir\overline{\boldsymbol{X}^{r}}=n^{-1}\sum_{i=1}^{n}\boldsymbol{X}^{r}_{i}. Lastly, given a pre-specified balance threshold t>0t>0, we check whether 𝑰{Var^(𝑰)}1𝑰<t\boldsymbol{I}^{\top}\{\widehat{Var}(\boldsymbol{I})\}^{-1}\boldsymbol{I}<t. If true, the final treatment assignment (A1,,An)(A_{1},\dots,A_{n}) is set to be (A1,,An)(A_{1}^{*},\dots,A_{n}^{*}); otherwise, we repeat the steps until we obtain the first randomization scheme that satisfies the balance criterion.

The above rerandomization procedure considers the Mahalanobis distance to define the balance criterion, and 𝑰{Var^(𝑰)}1𝑰\boldsymbol{I}^{\top}\{\widehat{Var}(\boldsymbol{I})\}^{-1}\boldsymbol{I} follows a chi-squared distribution asymptotically. The balance threshold tt can then be selected to adjust the rejection rate. For example, letting qq denote the dimension of 𝑿r\boldsymbol{X}^{r}, then tt being the α\alpha-quantile of 𝒳q2\mathcal{X}^{2}_{q} leads to an approximate rejection rate 1α1-\alpha. In general, smaller tt corresponds to stronger control over imbalance; if t+t\rightarrow+\infty, then rerandomization reduces to simple randomization. From a statistical perspective, rerandomization introduces correlation among (A1,,An)(A_{1},\dots,A_{n}) and hence correlation among observed data (𝑶1,,𝑶n)(\boldsymbol{O}_{1},\dots,\boldsymbol{O}_{n}), leading to a key complication in studying large-sample results of treatment effect estimators.

Beyond the Mahalanobis distance, rerandomization can also be more generally based on 𝑰𝐇^1𝑰<t\boldsymbol{I}^{\top}\widehat{\mathbf{H}}^{-1}\boldsymbol{I}<t, where 𝐇^\widehat{\mathbf{H}} is a positive definite matrix (e.g., a diagonal matrix collecting the sample variance for each component of 𝑰\boldsymbol{I}). For our main theorems, we focus on rerandomization using the Mahalanobis distance, which leads to the most interpretable results. Nevertheless, we show how our results can be extended to accommodate such a more general distance function in Remarks 1 and 4.

2.3 M-estimation

M-estimator (van der Vaart,, 1998, Section 5) refers to a wide class of estimators that are solutions to estimating equations. Specifically, let 𝜽=(Δ,𝜷)l+1\boldsymbol{\theta}=(\Delta,\boldsymbol{\beta})\in\mathbb{R}^{l+1} be an (l+1)(l+1)-dimensional vector of parameters, where Δ\Delta\in\mathbb{R} is the parameter of interest and 𝜷l\boldsymbol{\beta}\in\mathbb{R}^{l} is an ll-dimensional vector of nuisance parameters. An M-estimator 𝜽^=(Δ^,𝜷^)\widehat{\boldsymbol{\theta}}=(\widehat{\Delta},\widehat{\boldsymbol{\beta}}) for 𝜽\boldsymbol{\theta} is the solution to

i=1n𝝍(𝑶i;𝜽)=𝟎,\sum_{i=1}^{n}\boldsymbol{\psi}(\boldsymbol{O}_{i};\boldsymbol{\theta})=\boldsymbol{0},

where 𝝍\boldsymbol{\psi} is a pre-specified (l+1)(l+1)-dimensional estimating function. For example, 𝝍\boldsymbol{\psi} is the score function if 𝜽^\widehat{\boldsymbol{\theta}} is obtained by maximum likelihood estimation. The M-estimator for Δ\Delta^{*} is Δ^\widehat{\Delta}. For a=0,1a=0,1, we denote 𝑶(a)=(R(a)Y(a),R(a),a,𝑿)\boldsymbol{O}(a)=(R(a)Y(a),R(a),a,\boldsymbol{X}) and ||||2||\cdot||_{2} as the L2L_{2} matrix norm. Throughout, we assume the following regularity conditions for M-estimation.

Assumption 3 (Regularity conditions).

(1) 𝛉𝚯\boldsymbol{\theta}\in\boldsymbol{\Theta}, a compact subset of l+1\mathbb{R}^{l+1}.
(2) E[𝛙(𝐎(a);𝛉)22]<E[||\boldsymbol{\psi}(\boldsymbol{O}(a);\boldsymbol{\theta})||_{2}^{2}]<\infty for all 𝛉𝚯\boldsymbol{\theta}\in\boldsymbol{\Theta} and a{0,1}a\in\{0,1\}. (3) There exists a unique solution 𝛉¯=(Δ¯,𝛃¯)\underline{\boldsymbol{\theta}}=(\underline{\Delta},\underline{\boldsymbol{\beta}}), an inner point of 𝚯\boldsymbol{\Theta}, to the equations πE[𝛙(𝐎(1);𝛉)]+(1π)E[𝛙(𝐎(0);𝛉)]=𝟎\pi E[\boldsymbol{\psi}(\boldsymbol{O}(1);\boldsymbol{\theta})]+(1-\pi)E[\boldsymbol{\psi}(\boldsymbol{O}(0);\boldsymbol{\theta})]=\boldsymbol{0}. (4) The function 𝛉𝛙(𝐨;𝛉)\boldsymbol{\theta}\mapsto\boldsymbol{\psi}(\boldsymbol{o};\boldsymbol{\theta}) is twice continuously differentiable for every 𝐨\boldsymbol{o} in the support of 𝐎\boldsymbol{O} and is dominated by an integrable function. (5) There exist an integrable function v(𝐨)v(\boldsymbol{o}) that dominates the first and second derivatives of 𝛙(𝐨;𝛉)\boldsymbol{\psi}(\boldsymbol{o};\boldsymbol{\theta}) in 𝛉𝛉¯2<C||\boldsymbol{\theta}-\underline{\boldsymbol{\theta}}||_{2}<C for some C>0C>0. (6) E[||𝛉𝛙(𝐎(a);𝛉)|𝛉=𝛉¯||22]<E[||\frac{\partial}{\partial\boldsymbol{\theta}}\boldsymbol{\psi}(\boldsymbol{O}(a);\boldsymbol{\theta})\big{|}_{\boldsymbol{\theta}=\underline{\boldsymbol{\theta}}}||_{2}^{2}]<\infty for a{0,1}a\in\{0,1\}, and πE[𝛉𝛙(𝐎(1);𝛉)|𝛉=𝛉¯]+(1π)E[𝛉𝛙(𝐎(0);𝛉)|𝛉=𝛉¯]\pi E[\frac{\partial}{\partial\boldsymbol{\theta}}\boldsymbol{\psi}(\boldsymbol{O}(1);\boldsymbol{\theta})\big{|}_{\boldsymbol{\theta}=\underline{\boldsymbol{\theta}}}]+(1-\pi)E[\frac{\partial}{\partial\boldsymbol{\theta}}\boldsymbol{\psi}(\boldsymbol{O}(0);\boldsymbol{\theta})\big{|}_{\boldsymbol{\theta}=\underline{\boldsymbol{\theta}}}] is invertible.

Assumption 3 are largely moment and continuity assumptions to rule out irregular or degenerate M-estimators, which are similar to those invoked in Section 5.3 of van der Vaart, (1998). It is important to note that when the M-estimator is defined based on a working regression model, Assumption 3 does not necessarily imply the working model is correctly specified. For example, if the analysis of covariance (ANCOVA, Example 1 in Section 4) is used, Assumption 3 (2)-(6) simply reduce to assuming 𝑾\boldsymbol{W} has finite fourth moments and invertible covariance matrices, rather than assuming the ANCOVA model is the true data generating process.

Under simple randomization, it has been established that Assumptions 1 and 3 yield Δ^𝑝Δ¯\widehat{\Delta}\xrightarrow{p}\underline{\Delta}, i.e., Δ^\widehat{\Delta} converge in probability to a limit quantity Δ¯\underline{\Delta}, and n(Δ^Δ¯)𝑑N(0,V)\sqrt{n}(\widehat{\Delta}-\underline{\Delta})\xrightarrow{d}N(0,V), i.e., n\sqrt{n}-rate asymptotic normality of Δ^\widehat{\Delta} (van der Vaart,, 1998). To achieve consistent estimation, i.e., Δ^𝑝Δ\widehat{\Delta}\xrightarrow{p}\Delta^{*}, one needs to choose appropriate working models (e.g., ANCOVA in randomized experiments) or make additional assumptions (e.g., Assumption 2 in the presence of missing outcomes) such that Δ=Δ¯\Delta^{*}=\underline{\Delta}; we refer to Section 4 for detailed examples. In the next section, we formally describe the asymptotic behavior of Δ^\widehat{\Delta} under rerandomization.

3 Asymptotics for M-estimators under rerandomization

We state our first asymptotic result below.

Theorem 1.

Given Assumptions 1 and 3, and under either simple randomization or rerandomization, we have Δ^𝑝Δ¯\widehat{\Delta}\xrightarrow{p}\underline{\Delta} and asymptotic linearity, i.e.,

n(Δ^Δ¯)=1ni=1nIF(𝑶i)+op(1),\sqrt{n}(\widehat{\Delta}-\underline{\Delta})=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}IF(\boldsymbol{O}_{i})+o_{p}(1), (1)

where the influence function IF(𝐎i)IF(\boldsymbol{O}_{i}) is the first entry of 𝐁1𝛙(𝐎i;𝛉¯)-\mathbf{B}^{-1}\boldsymbol{\psi}(\boldsymbol{O}_{i};\underline{\boldsymbol{\theta}}) with
𝐁=πE[𝛉𝛙(𝐎(1);𝛉)|𝛉=𝛉¯]+(1π)E[𝛉𝛙(𝐎(0);𝛉)|𝛉=𝛉¯]\mathbf{B}=\pi E\left[\frac{\partial}{\partial\boldsymbol{\theta}}\boldsymbol{\psi}(\boldsymbol{O}(1);\boldsymbol{\theta})\big{|}_{\boldsymbol{\theta}=\underline{\boldsymbol{\theta}}}\right]+(1-\pi)E\left[\frac{\partial}{\partial\boldsymbol{\theta}}\boldsymbol{\psi}(\boldsymbol{O}(0);\boldsymbol{\theta})\big{|}_{\boldsymbol{\theta}=\underline{\boldsymbol{\theta}}}\right].

Furthermore, under rerandomization with Mahalanobis distance as the balance criterion,

n(Δ^Δ¯)𝑑V(1R2z+R2rq,t),\sqrt{n}(\widehat{\Delta}-\underline{\Delta})\xrightarrow{d}\sqrt{V}\left(\sqrt{1-R^{2}}\ z+\sqrt{R^{2}}\ r_{q,t}\right), (2)

where zN(0,1)z\sim N(0,1), rq,tD1|(𝐃𝐃<t)r_{q,t}\sim D_{1}|(\boldsymbol{D}^{\top}\boldsymbol{D}<t) for 𝐃N(𝟎,𝐈q)\boldsymbol{D}\sim N(\boldsymbol{0},\mathbf{I}_{q}) with D1D_{1} being the first element of 𝐃\boldsymbol{D} and rq,tr_{q,t} being independent of zz, VV is the asymptotic variance of Δ^\widehat{\Delta} under simple randomization, and

R2=π(1π)VCov[IF{𝑶(1)}IF{𝑶(0)},𝑿r]Var(𝑿r)1Cov[𝑿r,IF{𝑶(1)}IF{𝑶(0)}].R^{2}=\frac{\pi(1-\pi)}{V}Cov[IF\{\boldsymbol{O}(1)\}-IF\{\boldsymbol{O}(0)\},\boldsymbol{X}^{r}]Var(\boldsymbol{X}^{r})^{-1}Cov[\boldsymbol{X}^{r},IF\{\boldsymbol{O}(1)\}-IF\{\boldsymbol{O}(0)\}].

Consistent estimators for VV and R2R^{2} are provided in the Supplementary Material.

A central finding in Theorem 1 is that the convergence in probability and asymptotic linearity properties remain identical under simple randomization and rerandomization. As a result, if an M-estimator is consistent to the average treatment effect (Δ¯=Δ\underline{\Delta}=\Delta^{*}) under simple randomization, then it remains consistent under rerandomization. Furthermore, the influence function of an M-estimator remains unchanged under rerandomization.

In addition, Equation (2) provides the asymptotic distribution of an M-estimator under rerandomization. This result is similar to Theorem 1 of Li et al., (2018) and Theorem 1 of Li and Ding, (2020) (derived under a finite-population design-based framework) if Δ=E[Y(1)Y(0)]\Delta^{*}=E[Y(1)-Y(0)] and Δ^\widehat{\Delta} is the unadjusted difference-in-means estimator or the linearly-adjusted estimator. Under a super-population sampling-based framework, our new contribution in Theorem 1 is to formally provide the asymptotic theory for a wider class of estimators as well as a variety of estimands under rerandomization.

Different from simple randomization that leads to asymptotic normality of an M-estimator, rerandomization may correspond to a non-normal asymptotic distribution. Specifically, the M-estimator converges weakly to a distribution given by a summation of two independent components—a normal variate V1R2z\sqrt{V}\sqrt{1-R^{2}}\ z and an independent truncated normal variate VR2rq,t\sqrt{V}\sqrt{R^{2}}\ r_{q,t} by 𝑫𝑫<t\boldsymbol{D}^{\top}\boldsymbol{D}<t. Here, R2[0,1]R^{2}\in[0,1] quantifies the variance of linear projection of Δ^\widehat{\Delta} on II scaled by VV, representing the fraction of variance in the M-estimator that can be explained by 𝑿r\boldsymbol{X}^{r}. Our R2R^{2} shares the same interpretation as it is defined in Li et al., (2018), except that we use the influence function to handle M-estimators. According to Morgan and Rubin, (2012); Li et al., (2018), the density of V(1R2z+R2rq,t)\sqrt{V}\left(\sqrt{1-R^{2}}\ z+\sqrt{R^{2}}\ r_{q,t}\right) is symmetric about zero and bell-shaped, resembling a normal distribution. However, its variance is V{1(1vq,t)R2}V\{1-(1-v_{q,t})R^{2}\}, where vq,t=P(𝒳q+22<t)/P(𝒳q2<t)<1v_{q,t}=P(\mathcal{X}^{2}_{q+2}<t)/P(\mathcal{X}^{2}_{q}<t)<1, which implies that V{1(1vq,t)R2}VV\{1-(1-v_{q,t})R^{2}\}\leq V. As a result, rerandomization does not increase the asymptotic variance of an M-estimator, compared to simple randomization. In fact, under a design-based framework, Theorem 2 of Li et al., (2018) provided insights into this variance reduction (as well as the corresponding change in the confidence interval lengths): the difference in asymptotic variance is non-decreasing in R2R^{2} and non-increasing in both tt and qq. As a result of Theorem 1, this observation continues to hold for M-estimators.

For statistical inference, we propose consistent estimators V^\widehat{V} and R^2\widehat{R}^{2} in Section A of the Supplementary Material. The 95% confidence interval for Δ^\widehat{\Delta} can then be approximated by the following Monte Carlo approach. We first generate a large number of draws (m=10,000m=10,000) of zz and rq,tr_{q,t}, e.g., via rejection sampling. We then compute interval Δ^+V^/n(1R^2z+R^2rq,t)\widehat{\Delta}+\sqrt{\widehat{V}/n}\left(\sqrt{1-\widehat{R}^{2}}\ z+\sqrt{\widehat{R}^{2}}\ r_{q,t}\right) for each draw, and take the 2.5%2.5\% and 97.5%97.5\% quantiles of its empirical distribution to obtain the final confidence interval estimator CI^\widehat{CI}. As m,nm,n\rightarrow\infty, we have asymptotic nominal coverage such that P(ΔCI^)0.95P(\Delta^{*}\in\widehat{CI})\rightarrow 0.95.

Remark 1.

When rerandomization is based on a more general balance criterion 𝐈𝐇^1𝐈<t\boldsymbol{I}^{\top}\widehat{\mathbf{H}}^{-1}\boldsymbol{I}<t, Theorem 1 remains the same, except that the asymptotic distribution in (2) is now V1R2z+Cov[IF{𝐎(1)}IF{𝐎(0)},𝐗r]𝐕I1/2𝐑𝐇,t\sqrt{V}\sqrt{1-R^{2}}\ z+Cov[IF\{\boldsymbol{O}(1)\}-IF\{\boldsymbol{O}(0)\},\boldsymbol{X}^{r}]\mathbf{V}_{I}^{-1/2}\boldsymbol{R}_{\mathbf{H},t}, where 𝐑𝐇,t𝐃|(𝐃𝐕I1/2𝐇¯1𝐕I1/2𝐃<t)\boldsymbol{R}_{\mathbf{H},t}\sim\boldsymbol{D}|(\boldsymbol{D}^{\top}\mathbf{V}_{I}^{1/2}\underline{\mathbf{H}}^{-1}\mathbf{V}_{I}^{1/2}\boldsymbol{D}<t) and 𝐕I={π(1π)}1Var(𝐗r)\mathbf{V}_{I}=\{\pi(1-\pi)\}^{-1}Var(\boldsymbol{X}^{r}) with 𝐇¯\underline{\mathbf{H}} being the probability limit of n𝐇^n\widehat{\mathbf{H}}. Compared to Equation (2), the first component of the asymptotic distribution remains the same, but the second component becomes more complex and depends on the choice of 𝐇^\widehat{\mathbf{H}}. Nevertheless, as long as 𝐇¯\underline{\mathbf{H}} is positive definite, rerandomization still leads to no increase in the asymptotic variance. In addition, the same sampling-based confidence interval can still be constructed with a Monte Carlo approach via rejection sampling of 𝐑𝐇,t\boldsymbol{R}_{\mathbf{H},t}. Similar results were firstly pointed out by Lu et al., (2023) under a design-based framework in the context of cluster rerandomization (for unadjusted and linearly-adjusted estimators); we extend their results to a wider class of M-estimators (proof in the Supplementary Material).

Remark 2.

Theorem 1 can be further extended to handle rerandomization given tiers of covariates (Morgan and Rubin,, 2015). To elaborate, for b=1,,Bb=1,\dots,B, let 𝐈b=𝐆b𝐈\boldsymbol{I}_{b}=\mathbf{G}_{b}\boldsymbol{I} be any subvector of 𝐈\boldsymbol{I} with 𝐆b\mathbf{G}_{b} being the transformation matrix. In addition, 𝐇^b\widehat{\mathbf{H}}_{b} and tb>0t_{b}>0 are the weighting matrix and threshold defining the balance criterion for the bb-th tier of covariates. The second step of rerandomization is changed to accepting randomization if 𝐈b𝐇^b1𝐈b<tb\boldsymbol{I}_{b}^{\top}\widehat{\mathbf{H}}_{b}^{-1}\boldsymbol{I}_{b}<t_{b} for all b=1,,Bb=1,\dots,B. Under this design, Theorem 1 still holds with the asymptotic distribution described in Remark 1, except that here 𝐑𝐇,t𝐃|(𝐃𝐕I1/2𝐆b𝐇¯b1𝐆b𝐕I1/2𝐃<tb:b=1,,B)\boldsymbol{R}_{\mathbf{H},t}\sim\boldsymbol{D}|(\boldsymbol{D}^{\top}\mathbf{V}_{I}^{1/2}\mathbf{G}_{b}\underline{\mathbf{H}}_{b}^{-1}\mathbf{G}_{b}^{\top}\mathbf{V}_{I}^{1/2}\boldsymbol{D}<t_{b}:b=1,\dots,B), with 𝐇¯b\underline{\mathbf{H}}_{b} defined as the probability limit of n𝐇^bn\widehat{\mathbf{H}}_{b}.

4 Examples

In this section, we survey common M-estimators used for analyzing randomized experiments and provide important special cases where R2=0R^{2}=0, meaning that the M-estimator has fully accounted for rerandomization (in other words, its variance, VV, cannot be further explained by 𝑿r\boldsymbol{X}^{r}). Therefore, in these cases, each M-estimator remains asymptotically normal under rerandomization. These examples serve as important clarifications on when standard inference may be carried out ignoring rerandomization. Among the four examples, estimators in Examples 1, 2, and 4 are model-robust and hence consistent under arbitrary working model misspecification under simple randomization; see discussions in Tsiatis et al., (2008); Colantuoni and Rosenblum, (2015); Wang et al., (2021); this model-robustness property is carried under rerandomization as implied by Theorem 1. Example 3 focuses on experiments with missing outcomes, and we additionally require Assumption 2 to establish the double robustness property under simple randomization (Robins et al.,, 2007); this property remains to hold under rerandomization as a result of Theorem 1.

Example 1 (ANCOVA estimator).

In randomized experiments, the ANCOVA estimator considers the working model E[Y|A,𝐗]=β0+βAA+𝛃𝐗𝐗E[Y|A,\boldsymbol{X}]=\beta_{0}+\beta_{A}A+\boldsymbol{\beta}_{\boldsymbol{X}}^{\top}\boldsymbol{X} and uses ordinary least squares to obtain estimators (β^0,β^A,𝛃^X)(\widehat{\beta}_{0},\widehat{\beta}_{A},\widehat{\boldsymbol{\beta}}_{X}) for parameters (β0,βA,𝛃X)(\beta_{0},\beta_{A},\boldsymbol{\beta}_{X}). In this context, Δ^=β^A\widehat{\Delta}=\widehat{\beta}_{A} is the ANCOVA estimator for the average treatment effect estimand, Δ=E[Y(1)Y(0)]\Delta^{*}=E[Y(1)-Y(0)].

Example 2 (G-computation estimator with working logistic regression).

When the outcome is binary, the logistic regression model, logit{P(Y=1|A,𝐗)}=β0+βAA+𝛃𝐗𝐗\text{logit}\{P(Y=1|A,\boldsymbol{X})\}=\beta_{0}+\beta_{A}A+\boldsymbol{\beta}_{\boldsymbol{X}}^{\top}\boldsymbol{X}, is commonly used to analyze randomized experiments. With maximum likelihood estimators (β^0,β^A,𝛃^X)(\widehat{\beta}_{0},\widehat{\beta}_{A},\widehat{\boldsymbol{\beta}}_{X}), we construct p^i(a)=logit1(β^0+β^Aa+𝛃^𝐗𝐗i)\widehat{p}_{i}(a)=\text{logit}^{-1}(\widehat{\beta}_{0}+\widehat{\beta}_{A}a+\widehat{\boldsymbol{\beta}}_{\boldsymbol{X}}^{\top}\boldsymbol{X}_{i}) for a=0,1a=0,1. The g-computation estimator for the average treatment effect on any scale Δ=f(E[Y(1)],E[Y(0)])\Delta^{*}=f(E[Y(1)],E[Y(0)]) is Δ^=f(n1i=1np^i(1),n1i=1np^i(0))\widehat{\Delta}=f(n^{-1}\sum_{i=1}^{n}\widehat{p}_{i}(1),n^{-1}\sum_{i=1}^{n}\widehat{p}_{i}(0)).

Example 3 (DR-WLS estimator for handling missing outcomes).

In the presence of missing outcomes, the doubly-robust weighted least square (DR-WLS) estimator involves the following steps. First, we fit a logistic regression working model to predict missingness, i.e., logit{P(R=1|𝐗)}=α0+αAA+𝛂X𝐗\text{logit}\{P(R=1|\boldsymbol{X})\}=\alpha_{0}+\alpha_{A}A+\boldsymbol{\alpha}_{X}\boldsymbol{X}, and obtain the missingness propensity score as P^(R=1|A,𝐗)=logit1(α^0+α^AA+𝛂^X𝐗)\widehat{P}(R=1|A,\boldsymbol{X})=\text{logit}^{-1}(\widehat{\alpha}_{0}+\widehat{\alpha}_{A}A+\widehat{\boldsymbol{\alpha}}_{X}\boldsymbol{X}). Second, we fit an outcome regression working model, E[Y|A,𝐗]=g1(β0+βAA+𝛃X𝐗)E[Y|A,\boldsymbol{X}]=g^{-1}(\beta_{0}+\beta_{A}A+\boldsymbol{\beta}_{X}\boldsymbol{X}), using data with Ri=1R_{i}=1 and canonical link function gg, weighted by the inverse propensity score {P^(R=1|Ai,𝐗i)}1\{\widehat{P}(R=1|A_{i},\boldsymbol{X}_{i})\}^{-1}. We denote the model fit as E^[Y|A,𝐗]=g1(β^0+β^AA+𝛃^X𝐗)\widehat{E}[Y|A,\boldsymbol{X}]=g^{-1}(\widehat{\beta}_{0}+\widehat{\beta}_{A}A+\widehat{\boldsymbol{\beta}}_{X}\boldsymbol{X}). Finally, we apply a g-computation estimator Δ^=f(n1i=1nE^[Y|1,𝐗i],n1i=1nE^[Y|0,𝐗i])\widehat{\Delta}=f\left(n^{-1}\sum_{i=1}^{n}\widehat{E}[Y|1,\boldsymbol{X}_{i}],n^{-1}\sum_{i=1}^{n}\widehat{E}[Y|0,\boldsymbol{X}_{i}]\right) to estimate the average treatment effect estimand, Δ=f(E[Y(1)],E[Y(0)])\Delta^{*}=f(E[Y(1)],E[Y(0)]). We assume missing at random (Assumption 2) and at least one of the two working models is correctly specified.

Example 4 (Mixed-ANCOVA estimator under cluster randomization).

When rerandomization is at the cluster level, the outcomes for each cluster become an NiN_{i}-dimensional vector 𝐘i=(Yi1,,YiNi)\boldsymbol{Y}_{i}=(Y_{i1},\dots,Y_{iN_{i}}), where NiN_{i} is the size of cluster ii. In addition, the covariates 𝐗i\boldsymbol{X}_{i} is a collection of individual covariates {𝐗i1,,𝐗iNi}\{\boldsymbol{X}_{i1},\dots,\boldsymbol{X}_{iN_{i}}\}. For simplicity, we assume non-informative cluster sizes, i.e., NiN_{i} is independent of all (Yij,Ai,𝐗ij)(Y_{ij},A_{i},\boldsymbol{X}_{ij}), and all (Yij,Ai,𝐗ij)(Y_{ij},A_{i},\boldsymbol{X}_{ij}) are identically distributed, but they can have arbitrary with-cluster correlation. Then, rerandomization is based on 𝐗ir\boldsymbol{X}_{i}^{r} being a summary function of (𝐗i1r,,𝐗iNir)(\boldsymbol{X}_{i1}^{r},\dots,\boldsymbol{X}_{iN_{i}}^{r}), e.g., Ni1j=1Ni𝐗ijr{N_{i}}^{-1}\sum_{j=1}^{N_{i}}\boldsymbol{X}_{ij}^{r}. The mixed-ANCOVA estimator involves fitting the linear mixed model Yij=β0+βAAi+𝛃X𝐗ij+δi+εijY_{ij}={\beta}_{0}+{\beta}_{A}A_{i}+{\boldsymbol{\beta}}_{X}\boldsymbol{X}_{ij}+\delta_{i}+\varepsilon_{ij}, where δiN(0,τ2)\delta_{i}\sim N(0,\tau^{2}) is the random intercept and εijN(0,σ2)\varepsilon_{ij}\sim N(0,\sigma^{2}) is the independent error. Under maximum likelihood estimation, we can construct a model-robust estimator Δ^=β^A\widehat{\Delta}=\widehat{\beta}_{A} for the average treatment effect estimand Δ=E[Yij(1)Yij(0)]\Delta^{*}=E[Y_{ij}(1)-Y_{ij}(0)].

Each estimator described in Examples 1-4 can be cast as an M-estimator (with the corresponding estimating function provided in Supplementary Material) and its consistency and asymptotic linearity properties directly carry from simple randomization to rerandomization. Furthermore, Theorem 2 clarifies, for each estimator, when asymptotic normality holds under rerandomization and thus inference can proceed by ignoring rerandomization.

Theorem 2.

Assume Assumptions 1-3 and that 𝐗\boldsymbol{X} includes 𝐗r\boldsymbol{X}^{r} as a subset for covariate adjustment. Then, for the ANCOVA, logistic regression, DR-WLS, and mixed-ANCOVA estimators described in Examples 1-4, we have n(Δ^Δ)𝑑N(0,V)\sqrt{n}(\widehat{\Delta}-\Delta^{*})\xrightarrow{d}N(0,V) under rerandomization if at least one of the following conditions holds: (1) π=0.5\pi=0.5 and Δ=E[Y(1)Y(0)]\Delta^{*}=E[Y(1)-Y(0)] is the estimand on the difference scale, or (2) the outcome regression model further includes a treatment-by-covariate interaction term 𝛃AXrA𝐗r\boldsymbol{\beta}_{AX^{r}}A\boldsymbol{X}^{r} for 𝐗r\boldsymbol{X}^{r}, or (3) the outcome regression model is correctly specified. In other words, R2=0R^{2}=0 under any of conditions (1), (2), or (3).

Remark 3.

While the ANCOVA and mixed-ANCOVA estimators target Δ=E[Y(1)Y(0)]\Delta^{*}=E[Y(1)-Y(0)] through the treatment coefficient, the working models can be adapted to estimate treatment effect estimands on other scales through a g-computation procedure (as described in Examples 2 and 3). When treatment-by-covariate interaction terms are included, they become the ANCOVA2 (Tsiatis et al.,, 2008) and mixed-ANCOVA2 estimators (Wang et al.,, 2021), for which g-computation is needed for consistent estimation under simple randomization and rerandomization. For the DR-WLS estimator, Theorem 2 does not require adjusting for 𝐗r\boldsymbol{X}^{r} in the missingness model to obtain asymptotic normality, although this is still recommended as including 𝐗r\boldsymbol{X}^{r} may further improve efficiency.

Theorem 2 highlights the benefit of adjusting for baseline variables that are used for rerandomization. In this case, inference with many estimators under rerandomization can be performed as if simple randomization were used in the design stage; thus rerandomization becomes ignorable in the analysis stage. This result has two implications. First, the choice of threshold tt or the weighting matrix 𝐇^\widehat{\mathbf{H}} has no impact asymptotically for the considered estimators in Examples 1-4. Therefore, the question of how one should select these design parameters for optimizing the asymptotic precision is immaterial in large samples as the design parameters do not impact the asymptotic distribution. Second, many existing results about variance under simple randomization naturally extend to rerandomization. For example, for estimating the average treatment effect on the difference scale, the ANCOVA estimator under condition (1) or (2) in Theorem 2 does not reduce the asymptotic precision by adjusting for baseline covariates under simple randomization (Tsiatis et al.,, 2008); the same result holds under rerandomization as long as the adjustment set included 𝑿r\boldsymbol{X}^{r}. As another example, when condition (1) holds, not only ANCOVA and mixed-ANCOVA provide consistent point estimates, their model-based variance estimators are also consistent (Wang et al.,, 2019, 2021); such properties hold under rerandomization by Theorem 2.

On the contrary, if 𝑿r\boldsymbol{X}^{r} is not included in the adjustment set for the above example estimators, Theorem 2 may not hold, and the asymptotic distribution of the estimator is non-normal, according to Theorem 1. Intuitively, this is the scenario where 𝑿r\boldsymbol{X}^{r} can further explain the variance of Δ^\widehat{\Delta} so that R2>0R^{2}>0. If inference is performed based on asymptotic results derived under simple randomization (e.g., using normal approximation), then the resulting confidence interval may suffer from power loss, although the inference remains valid. Given the above discussion, we recommend adjusting for rerandomization variables when using M-estimators described in Examples 1-4, such that conventional inferential procedure based on asymptotic normality can be justified by Theorem 2.

5 Extensions to stratified rerandomization

5.1 Defining stratified rerandomization

In practice, we may want to control the treatment imbalance on 𝑿r\boldsymbol{X}^{r} at level tt and, meanwhile, eliminate the treatment imbalance on certain categorical variables. We refer to this design as stratified rerandomization (Wang et al., 2023c, ). Compared to rerandomization defined in Section 2.2, this design further ensures a perfectly balanced assignment within each stratum. To formally define this scheme, we first introduce stratified randomization without rerandomization. Stratified (block) randomization refers to a randomization scheme that achieves exact balance within each pre-specified stratum. Let S𝑿S\in\boldsymbol{X} be a categorical baseline variable encoding the randomization strata, e.g., SS taking values in 𝒮={male smoker,female smoker,male non-smoker,female non-smoker}\mathcal{S}=\{\textup{male smoker},\ \textup{female smoker},\ \textup{male non-smoker},\ \textup{female non-smoker}\} if the randomization strata are defined by sex and smoking status. Within each randomization stratum, a size-kk permutation block with πk\pi k ones and (1π)k(1-\pi)k zeros in random order is independently sampled to assign the treatment for the first kk participants (1 for treatment and 0 for control). When a permutation block is exhausted, a new one is sampled to assign the treatment for the next kk participants. The block size kk is chosen to make πk\pi k an integer. Unlike simple randomization where treatment assignment is independent, stratified randomization directly introduces correlation among treatment assignment and observed data (𝑶1,,𝑶n)(\boldsymbol{O}_{1},\dots,\boldsymbol{O}_{n}).

Stratified rerandomization combines stratified randomization and rerandomization through the following steps. First, we generate A~1,,A~n\widetilde{A}_{1},\dots,\widetilde{A}_{n} by stratified randomization based on randomization strata {S1,,Sn}\{S_{1},\dots,S_{n}\} and parameter π(0,1)\pi\in(0,1). Second, we compute

𝑰~\displaystyle\widetilde{\boldsymbol{I}} =1N~1i=1nA~i𝑿ir1N~0i=1n(1A~i)𝑿ir,\displaystyle=\frac{1}{\widetilde{N}_{1}}\sum_{i=1}^{n}\widetilde{A}_{i}\boldsymbol{X}_{i}^{r}-\frac{1}{\widetilde{N}_{0}}\sum_{i=1}^{n}(1-\widetilde{A}_{i})\boldsymbol{X}_{i}^{r},
Var^(𝑰~)\displaystyle\widehat{Var}(\widetilde{\boldsymbol{I}}) =nN~1N~0[1ni=1n𝑿ir𝑿irs𝒮1ni=1nI{Si=s}𝑿sr¯𝑿sr¯],\displaystyle=\frac{n}{\widetilde{N}_{1}\widetilde{N}_{0}}\left[\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{X}^{r}_{i}\boldsymbol{X}^{r}_{i}{}^{\top}-\sum_{s\in\mathcal{S}}\frac{1}{n}\sum_{i=1}^{n}I\{S_{i}=s\}\overline{\boldsymbol{X}_{s}^{r}}\overline{\boldsymbol{X}_{s}^{r}}^{\top}\right],

where N~1=i=1nA~i\widetilde{N}_{1}=\sum_{i=1}^{n}\widetilde{A}_{i}, N~0=nN~1\widetilde{N}_{0}=n-\widetilde{N}_{1}, and 𝑿sr¯=(i=1nI{Si=s})1i=1n𝑿irI{Si=s}\overline{\boldsymbol{X}_{s}^{r}}=(\sum_{i=1}^{n}I\{S_{i}=s\})^{-1}\sum_{i=1}^{n}\boldsymbol{X}_{i}^{r}I\{S_{i}=s\}. In the last step, if 𝑰~{Var^(𝑰~)}1𝑰~<t\widetilde{\boldsymbol{I}}^{\top}\{\widehat{Var}(\widetilde{\boldsymbol{I}})\}^{-1}\widetilde{\boldsymbol{I}}<t for a pre-specified balance threshold tt, we set the final treatment assignment (A1,,An)(A_{1},\dots,A_{n}) to be (A~1,,A~n)(\widetilde{A}_{1},\dots,\widetilde{A}_{n}); otherwise, we return to the first step to obtain the first randomization scheme that satisfies the balance criterion.

Compared to rerandomization defined in Section 2.2, stratified rerandomization involves two changes: the first step replaces simple randomization by stratified randomization, and the variance estimator Var^(𝑰~)\widehat{Var}(\widetilde{\boldsymbol{I}}) is updated to reflect the true variance of the imbalance statistic under stratification, which is smaller than under simple randomization. This design is asymptotically equivalent to the first stratified rerandomization criterion (overall Mahalanobis distance) of Wang et al., 2023c since we assume π\pi is constant across randomization strata—the most common setting in randomized experiments. Without loss of generality, we assume S𝑿rS\notin\boldsymbol{X}^{r}; otherwise, rerandomization is effectively controlling for imbalance on 𝑿rS\boldsymbol{X}^{r}\setminus S. Under stratified rerandomization, we next develop the large-sample properties for M-estimators regarding the general estimand Δ\Delta^{*}.

5.2 Asymptotics for M-estimators under stratified rerandomization

Theorem 3.

Given Assumptions 1 and 3, and under stratified rerandomization, we have Δ^𝑝Δ¯\widehat{\Delta}\xrightarrow{p}\underline{\Delta} and asymptotic linearity as in Equation (1), and

n(Δ^Δ¯)𝑑V~{1R~2z+R~2rq,t},\displaystyle\sqrt{n}(\widehat{\Delta}-\underline{\Delta})\xrightarrow{d}\sqrt{\widetilde{V}}\left\{\sqrt{1-\widetilde{R}^{2}}\ z+\sqrt{\widetilde{R}^{2}}\ r_{q,t}\right\}, (3)

where zN(0,1)z\sim N(0,1), rq,tD1|(𝐃𝐃<t)r_{q,t}\sim D_{1}|(\boldsymbol{D}^{\top}\boldsymbol{D}<t) for 𝐃N(𝟎,𝐈q)\boldsymbol{D}\sim N(\boldsymbol{0},\mathbf{I}_{q}) with D1D_{1} being the first element of 𝐃\boldsymbol{D} and rq,tr_{q,t} being independent of zz, and

V~\displaystyle\widetilde{V} =Vπ(1π)E[E[IF{𝑶(1)}IF{𝑶(0)}|S]2],\displaystyle=V-\pi(1-\pi)E\left[E[IF\{\boldsymbol{O}(1)\}-IF\{\boldsymbol{O}(0)\}|S]^{2}\right],
R~2\displaystyle\widetilde{R}^{2} =π(1π)V~𝑪E[Var(𝑿r|S)]1𝑪.\displaystyle=\frac{\pi(1-\pi)}{\widetilde{V}}\boldsymbol{C}^{\top}E[Var(\boldsymbol{X}^{r}|S)]^{-1}\boldsymbol{C}.

with 𝐂=E[Cov[𝐗r,IF{𝐎(1)}IF{𝐎(0)}|S]]\boldsymbol{C}=E[Cov[\boldsymbol{X}^{r},IF\{\boldsymbol{O}(1)\}-IF\{\boldsymbol{O}(0)\}|S]] and VV being the asymptotic variance under simple randomization. Consistent estimators for V~\widetilde{V} and R~2\widetilde{R}^{2} are provided in the Supplementary Material.

Theorem 3 parallels Theorem 1 in terms of convergence in probability, asymptotic linearity, and weak convergence results. The major difference lies in the constant factors of the asymptotic distribution. In particular, V~\widetilde{V} substitutes VV in Equation (3) to account for the variance reduction from stratification (Wang et al., 2023b, ). Likewise, compared to Theorem 1, R~2\widetilde{R}^{2} is also updated by replacing marginal covariance with the expectation of conditional covariance. Despite these changes, the properties of the asymptotic distribution remain similar to those under rerandomization, and statistical inference considerations under rerandomization discussed in Section 3 still apply to stratified rerandomization. Finally, if we consider the unadjusted estimator, the difference estimand Δ=E[Y(1)Y(0)]\Delta^{*}=E[Y(1)-Y(0)], and fixed strata categories, Theorem 3 becomes the counterpart of Theorem 3 of Wang et al., 2023c under a sampling-based framework. Theorem 3, however, includes a wider class of M-estimators and addresses a more general class of estimands Δ\Delta^{*} (e.g. to allow for ratio estimands).

Remark 4.

If rerandomization is based on 𝐈𝐇^1𝐈<t\boldsymbol{I}^{\top}\widehat{\mathbf{H}}^{-1}\boldsymbol{I}<t, Theorem 3 remains the same with the asymptotic distribution in Equation (3) updated to V~1R~2z+𝐂𝐕~I1/2𝐑~𝐇,t\sqrt{\widetilde{V}}\sqrt{1-\widetilde{R}^{2}}\ z+\boldsymbol{C}^{\top}\widetilde{\mathbf{V}}_{I}^{-1/2}\widetilde{\boldsymbol{R}}_{\mathbf{H},t}, where 𝐑~𝐇,t𝐃|(𝐃𝐕~I1/2𝐇¯1𝐕~I1/2𝐃<t)\widetilde{\boldsymbol{R}}_{\mathbf{H},t}\sim\boldsymbol{D}|(\boldsymbol{D}^{\top}\widetilde{\mathbf{V}}_{I}^{1/2}\underline{\mathbf{H}}^{-1}\widetilde{\mathbf{V}}_{I}^{1/2}\boldsymbol{D}<t) and 𝐕~I={π(1π)}1E[Var(𝐗r|S)]\widetilde{\mathbf{V}}_{I}=\{\pi(1-\pi)\}^{-1}E[Var(\boldsymbol{X}^{r}|S)], and 𝐇¯\underline{\mathbf{H}} is the probability limit of n𝐇^n\widehat{\mathbf{H}}. As long as 𝐇¯\underline{\mathbf{H}} is positive definite, stratified rerandomization does not lead to asymptotic precision reduction over stratified or simple randomization.

Remark 5.

Like rerandomization, stratified rerandomization can also accommodate tiers of covariates in a similar way as in Remark 2. However, under our framework, Theorem 3 is not implied by Thoerem 1 with tiers of covariates (by specifying 𝐈b\boldsymbol{I}_{b} to I{Si=s}I\{S_{i}=s\} for each s𝒮s\in\mathcal{S} and letting tb0t_{b}\rightarrow 0). This is because s𝒮I{Si=s}=1\sum_{s\in\mathcal{S}}I\{S_{i}=s\}=1 and so the variance of {I{Si=s}:s𝒮}\{I\{S_{i}=s\}:s\in\mathcal{S}\} is rank deficient. Nevertheless, if we enforce i=1nAi=πn\sum_{i=1}^{n}A_{i}=\pi n in rerandomization as done in Li et al., (2018), then rerandomization with tiers of covariates accommodates stratified randomization as a special case (Wang et al., 2023c, ); these schemes all fit into stratified rerandomization with tiers of covariates under our framework.

5.3 Continued examples

For Examples 1-4 in Section 4, their model-robustness or double-robustness property remains the same under stratified rerandomization. We thus extend Theorem 3 for stratified rerandomization given below.

Theorem 4.

Assume Assumptions 1-3 and that 𝐗\boldsymbol{X} includes 𝐗r\boldsymbol{X}^{r} as a subset and SS as dummy variables for covariate adjustment. Then, for the ANCOVA, logistic regression, DR-WLS, or mixed-ANCOVA estimator in Examples 1-4, we have n(Δ^Δ)𝑑N(0,V)\sqrt{n}(\widehat{\Delta}-\Delta^{*})\xrightarrow{d}N(0,V) under stratified rerandomization if at least one of the following conditions holds: (1) π=0.5\pi=0.5 and Δ=E[Y(1)Y(0)]\Delta^{*}=E[Y(1)-Y(0)], or (2) the outcome regression model further includes treatment-by-covariate interaction terms for both 𝐗r\boldsymbol{X}^{r} and SS, or (3) the outcome regression model is correctly specified. In other words, R~2=0\widetilde{R}^{2}=0 and V~=V\widetilde{V}=V under condition (1), (2), or (3).

Theorem 4 clarifies that adjusting for both 𝑿r\boldsymbol{X}^{r} and SS in the working models offers asymptotic normality of the considered M-estimators under stratified rerandomization. In this case, the asymptotic distribution for each estimator is no different from that under simple randomization, and asymptotic results developed for those estimators under simple randomization can directly apply even though the actual assignment comes from a stratified rerandomization procedure. Among the three conditions, conditions (1) and (3) remain the same as Theorem 3, but condition (2) is modified to further include additional treatment-by-stratum interaction terms (as intuitively, stratum variable is part of the randomization procedure). Similar to Theorem 3, the threshold tt and weighting matrix 𝐇^\widehat{\mathbf{H}} do not contribute to the asymptotic distribution, suggesting that the choice of these design parameters in general does not impact the asymptotic inference of the considered treatment effect estimators. Due to the appeal of applying normal approximation for inference, we maintain our recommendation that all rerandomization variables (including rerandomization variables and dummy stratum variables) be adjusted for in the analysis.

6 Efficient inference with machine learning

Beyond traditional M-estimators based on parametric working models, data-adaptive machine learning models have also been studied in randomized experiments under simple randomization, typically through the vehicle of efficient influence function (van der Laan et al.,, 2011; Chernozhukov et al.,, 2018). While these estimators provide flexible covariate adjustment and achieve the asymptotic efficiency lower bound under simple randomization, their properties under rerandomization or stratified rerandomization remain unknown. In this section, we turn our focus to a class of efficient estimators via the efficient influence function and obtain their asymptotic distribution under rerandomization and stratified rerandomization. We focus on the setting where outcomes are missing at random (Assumption 2), but our theory can be straightforwardly extended to accommodate other settings (e.g., cluster-randomized experiments), where efficient estimation has been investigated under simple randomization (Wang et al., 2023a, ).

6.1 Efficient inference under rerandomization

Under rerandomization, we study the efficient estimator that uses double machine learning (DML) for estimating nuisance functions with cross-fitting (Chernozhukov et al.,, 2018). Let η^a(𝑿;𝒟),κ^a(𝑿;𝒟)\widehat{\eta}_{a}(\boldsymbol{X};\mathcal{D}),\widehat{\kappa}_{a}(\boldsymbol{X};\mathcal{D}) denote the pre-specified estimators for E[Y(a)|𝑿],E[R(a)|𝑿]E[Y(a)|\boldsymbol{X}],E[R(a)|\boldsymbol{X}] trained on data 𝒟\mathcal{D} and evaluated at 𝑿\boldsymbol{X}. For cross-fitting, we randomly partition the index set {1,,n}\{1,\dots,n\} into KK folds with approximately equal sizes. Specifically, let k\mathcal{I}_{k} denote the index set for the kk-th fold, we have {1,,n}=k=1Kk\{1,\dots,n\}=\cup_{k=1}^{K}\mathcal{I}_{k} and |k|=K1n|\mathcal{I}_{k}|=K^{-1}n. If KK does not divide nn, we require ||k|K1n|1||\mathcal{I}_{k}|-K^{-1}n|\leq 1 instead. Defining 𝒪a,k={(RiYi,Ri,𝑿i):i=1,,n;ik;Ai=a}\mathcal{O}_{a,-k}=\{(R_{i}Y_{i},R_{i},\boldsymbol{X}_{i}):i=1,\dots,n;i\notin\mathcal{I}_{k};A_{i}=a\} as the training data for fold kk within treatment group aa, we specify the nuisance function estimators as η^a(𝑿i)=k=1KI{ik}η^a(𝑿i;𝒪a,k)\widehat{\eta}_{a}(\boldsymbol{X}_{i})=\sum_{k=1}^{K}I\{i\in\mathcal{I}_{k}\}\widehat{\eta}_{a}(\boldsymbol{X}_{i};\mathcal{O}_{a,-k}) and κ^a(𝑿i)=k=1KI{ik}κ^a(𝑿i;𝒪a,k)\widehat{\kappa}_{a}(\boldsymbol{X}_{i})=\sum_{k=1}^{K}I\{i\in\mathcal{I}_{k}\}\widehat{\kappa}_{a}(\boldsymbol{X}_{i};\mathcal{O}_{a,-k}). Next, the efficient estimator for E[Y(a)]E[Y(a)] is constructed based on the efficient influence function as

μ^adml=1ni=1n[I{Ai=a}πa(1π)1aRiκ^a(𝑿i){Yiη^a(𝑿i)}+η^a(𝑿i)],\widehat{\mu}_{a}^{\textup{dml}}=\frac{1}{n}\sum_{i=1}^{n}\left[\frac{I\{A_{i}=a\}}{\pi^{a}(1-\pi)^{1-a}}\frac{R_{i}}{\widehat{\kappa}_{a}(\boldsymbol{X}_{i})}\{Y_{i}-\widehat{\eta}_{a}(\boldsymbol{X}_{i})\}+\widehat{\eta}_{a}(\boldsymbol{X}_{i})\right], (4)

and output Δ^dml=f(μ^1dml,μ^0dml)\widehat{\Delta}^{\textup{dml}}=f(\widehat{\mu}_{1}^{\textup{dml}},\widehat{\mu}_{0}^{\textup{dml}}). For this estimator, we assume the following conditions.

Assumption 4 (Conditions for nuisance function estimators).

For a=0,1a=0,1, we assume (1) E[{η^a(𝐗;𝒟n)E[Y(a)|𝐗]}2]1/2=o(n1/4)E[\left\{\widehat{\eta}_{a}(\boldsymbol{X};\mathcal{D}_{n})-E[Y(a)|\boldsymbol{X}]\right\}^{2}]^{1/2}=o(n^{-1/4}) and E[{κ^a(𝐗;𝒟n)E[R(a)|𝐗]}2]1/2=o(n1/4)E[\left\{\widehat{\kappa}_{a}(\boldsymbol{X};\mathcal{D}_{n})-E[R(a)|\boldsymbol{X}]\right\}^{2}]^{1/2}=o(n^{-1/4}) for 𝒟n={Ri(a)Yi(a),Ri(a),𝐗i}i=1n\mathcal{D}_{n}=\{R_{i}(a)Y_{i}(a),R_{i}(a),\boldsymbol{X}_{i}\}_{i=1}^{n}; (2) {κ^a(𝐗;𝒟n)}1\{\widehat{\kappa}_{a}(\boldsymbol{X};\mathcal{D}_{n})\}^{-1} and E[Y2(a)|𝐗]E[Y^{2}(a)|\boldsymbol{X}] are uniformly bounded.

Assumption 4(1) requires that the nuisance function estimators converge to their target in L2L_{2}-norm with a rate faster than n1/4n^{1/4}, if the training data consist of independent and identically distributed data. This rate can be achieved by existing machine learning methods, including random forests (Wager and Athey,, 2018), deep neural network (Farrell et al.,, 2021), and highly-adpative lasso (Benkeser and Van Der Laan,, 2016), under mild regularity conditions. Of note, for the outcome model, the training data 𝒟n\mathcal{D}_{n} are used to learn E[R(a)Y(a)|R(a),𝑿]E[R(a)Y(a)|R(a),\boldsymbol{X}], which yields E[Y(a)|R(a)=1,𝑿]E[Y(a)|R(a)=1,\boldsymbol{X}] and thus E[Y(a)|𝑿]E[Y(a)|\boldsymbol{X}] under Assumption 2. In Assumption 4(2), we assume uniform boundedness on several components of 𝒫W\mathcal{P}^{W} and the estimators so that the remainder terms can be appropriately controlled. Overall, Assumption 4 resembles the standard assumptions for efficient inference made in Chernozhukov et al., (2018), and importantly, no extra condition is specifically assumed for addressing rerandomization. Theorem 5 below provides the asymptotic results for Δ^dml\widehat{\Delta}^{\textup{dml}}.

Theorem 5.

Given Assumptions 1, 2, and 4, and under rerandomization, we have consistency, i.e., Δ^dml𝑝Δ\widehat{\Delta}^{\textup{dml}}\xrightarrow{p}\Delta^{*}, and asymptotic linearity, i.e., n(Δ^dmlΔ)=1ni=1nEIF(𝐎i)+op(1)\sqrt{n}(\widehat{\Delta}^{\textup{dml}}-\Delta^{*})=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}EIF(\boldsymbol{O}_{i})+o_{p}(1) with

EIF(𝑶i)=a=01fa[I{Ai=a}πa(1π)1aRiE[Ri(a)|𝑿i]{YiE[Yi(a)|𝑿i]}+E[Yi(a)|𝑿i]E[Yi(a)]],EIF(\boldsymbol{O}_{i})=\sum_{a=0}^{1}f_{a}^{\prime}\left[\frac{I\{A_{i}=a\}}{\pi^{a}(1-\pi)^{1-a}}\frac{R_{i}}{E[R_{i}(a)|\boldsymbol{X}_{i}]}\{Y_{i}-E[Y_{i}(a)|\boldsymbol{X}_{i}]\}+E[Y_{i}(a)|\boldsymbol{X}_{i}]-E[Y_{i}(a)]\right], (5)

where faf_{a}^{\prime} is the partial derivative of ff with respect to E[Y(a)]E[Y(a)].

Furthermore, n(Δ^dmlΔ)𝑑V(1R2z+R2rq,t)\sqrt{n}(\widehat{\Delta}^{\textup{dml}}-\Delta^{*})\xrightarrow{d}\sqrt{V}\left(\sqrt{1-R^{2}}\ z+\sqrt{R^{2}}\ r_{q,t}\right) as described in Equation (2) with VV being the asymptotic variance under simple randomization and R2R^{2} defined in Theorem 1 by setting IF(𝐎i)=EIF(𝐎i)IF(\boldsymbol{O}_{i})=EIF(\boldsymbol{O}_{i}). Additionally, R2=0R^{2}=0 if the adjustment set 𝐗\boldsymbol{X} includes the rerandomization variables 𝐗r\boldsymbol{X}^{r}. Consistent estimators for VV and R2R^{2} are provided in the Supplementary Material.

Theorem 5 justifies the validity of efficient estimation of the average treatment effect estimand under rerandomization. Additionally, the overall structure of its asymptotic expansion resembles that for M-estimators in Theorem 1, with the only difference being that the influence function is given in Equation (5), which is the efficient influence function under simple randomization. If covariate adjustment includes the rerandomization variables, then Δ^dml\widehat{\Delta}^{\textup{dml}} is asymptotically normal with asymptotic variance VV—the usual efficiency lower bound given by the variance of EIF(𝑶i)EIF(\boldsymbol{O}_{i}) under simple randomization. On the contrary, if 𝑿\boldsymbol{X} does not include all rerandomization variables, then VV does not fully account for the precision gain from rerandomization, which leads to a positive R2R^{2} and non-normal asymptotic distribution.

Comparing efficient estimators with parametric M-estimators, efficient estimators have the advantage of optimal efficiency, while machine learners often require a large sample size to avoid over-fitting. We recommend using efficient estimators if the sample size per arm is no less than 200, and caution against data-adaptive fitting with small sample sizes, e.g., n50n\leq 50.

Remark 6.

With no missing outcomes, we have Ri1R_{i}\equiv 1, and the efficient estimator is simplified by setting κ^a(𝐗;𝒟n)1\widehat{\kappa}_{a}(\boldsymbol{X};\mathcal{D}_{n})\equiv 1. In this case, Assumption 4(1) is relaxed to only requiring a consistency assumption E[{η^a(𝐗;𝒟n)E[Y(a)|𝐗]}2]=o(1)E[\left\{\widehat{\eta}_{a}(\boldsymbol{X};\mathcal{D}_{n})-E[Y(a)|\boldsymbol{X}]\right\}^{2}]=o(1) without the need for regulating rate of convergence. Under this modified Assumption 4, Theorem 5 still holds and extends a special case of Chernozhukov et al., (2018) from simple randomization to rerandomization.

6.2 Results under stratified rerandomization

When using stratified rerandomization, the efficient estimator in Section 6.1 requires modification to achieve our target asymptotic results. This is because stratification introduces additional correlation among observed data, which cannot be handled by the standard cross-fitting procedure. To overcome this challenge, we perform cross-fitting within each stratum s𝒮s\in\mathcal{S} and treatment group a{0,1}a\in\{0,1\}, an approach proposed by Rafi, (2023) to address stratification. Specifically, for each s𝒮s\in\mathcal{S} and a{0,1}a\in\{0,1\}, we randomly partition {i=1,,n:Si=s,Ai=a}\{i=1,\dots,n:S_{i}=s,A_{i}=a\} into kk folds and denote ask\mathcal{I}_{ask} as the index set of the kk-th fold. Similarly, we require roughly equal fold sizes across kk given aa and ss. We next define the kk-th fold in stratum ss as sk=1sk0sk\mathcal{I}_{sk}=\mathcal{I}_{1sk}\cup\mathcal{I}_{0sk} and define the training data for this fold as 𝒪as,k={(RiYi,Yi,𝑿i):i=1,,n;isk;Ai=a;Si=s}\mathcal{O}_{as,-k}=\{(R_{i}Y_{i},Y_{i},\boldsymbol{X}_{i}):i=1,\dots,n;i\notin\mathcal{I}_{sk};A_{i}=a;S_{i}=s\}. The nuisance function estimators are updated to η^a(𝑿i)=k=1Ks𝒮I{isk,Si=s}η^a(𝑿i;𝒪as,k)\widehat{\eta}_{a}(\boldsymbol{X}_{i})=\sum_{k=1}^{K}\sum_{s\in\mathcal{S}}I\{i\in\mathcal{I}_{sk},S_{i}=s\}\widehat{\eta}_{a}(\boldsymbol{X}_{i};\mathcal{O}_{as,-k}) and κ^a(𝑿i)=k=1Ks𝒮I{isk,Si=s}κ^a(𝑿i;𝒪as,k)\widehat{\kappa}_{a}(\boldsymbol{X}_{i})=\sum_{k=1}^{K}\sum_{s\in\mathcal{S}}I\{i\in\mathcal{I}_{sk},S_{i}=s\}\widehat{\kappa}_{a}(\boldsymbol{X}_{i};\mathcal{O}_{as,-k}). Then μ^adml\widehat{\mu}_{a}^{\textup{dml}} is defined as in Equation (4) and Δ^dml=f(μ^1dml,μ^0dml)\widehat{\Delta}^{\textup{dml}}=f(\widehat{\mu}_{1}^{\textup{dml}},\widehat{\mu}_{0}^{\textup{dml}}). Based on this modified cross-fitting scheme, Theorem 6 gives the counterpart of Theorem 5 under stratified rerandomization.

Theorem 6.

Given Assumptions 1,2, and 4, and under stratified rerandomization, we have the same consistency and asymptotic linearity as in Theorem 5. Furthermore, n(Δ^dmlΔ)𝑑V~(1R~2z+R~2rq,t)\sqrt{n}(\widehat{\Delta}^{\textup{dml}}-\Delta^{*})\xrightarrow{d}\sqrt{\widetilde{V}}\left(\sqrt{1-\widetilde{R}^{2}}\ z+\sqrt{\widetilde{R}^{2}}\ r_{q,t}\right) as described in Equation (3) with V~\widetilde{V} and R~2\widetilde{R}^{2} now defined by setting IF(𝐎i)=EIF(𝐎i)IF(\boldsymbol{O}_{i})=EIF(\boldsymbol{O}_{i}). Additionally, R~2=0\widetilde{R}^{2}=0 if the adjustment set 𝐗\boldsymbol{X} includes the rerandomization variables 𝐗r\boldsymbol{X}^{r} and the dummy stratum variables SS. Consistent estimators for V~\widetilde{V} and R~2\widetilde{R}^{2} are provided in the Supplementary Material.

7 Simulations

We conduct two simulation studies to demonstrate our asymptotic results. The first simulation focuses on continuous outcomes with a difference estimand Δ=E[Y(1)Y(0)]\Delta^{*}=E[Y(1)-Y(0)], and the second simulation focuses on binary outcomes with a ratio estimand Δ=E[Y(1)]/E[Y(0)]\Delta^{*}=E[Y(1)]/E[Y(0)]. Both simulations consider complete outcomes and outcomes under missing at random (Assumption 2), under three randomization procedures: simple randomization, rerandomization, and stratified rerandomization.

7.1 Simulation design

For the first simulation with continuous outcomes, we set n=400n=400 and independently generate three baseline covariates Xi1𝒩(1,1)X_{i1}\sim\mathcal{N}(1,1), Si|Xi1(0,0.4+0.2I{Xi1<1})S_{i}|X_{i1}\sim\mathcal{B}(0,0.4+0.2I\{X_{i1}<1\}), and Xi2|(Xi1,Si)𝒩(0,1)X_{i2}|(X_{i1},S_{i})\sim\mathcal{N}(0,1) for i=1,,ni=1,\dots,n, where 𝒩,\mathcal{N},\mathcal{B} represent normal and Bernoulli distributions, respectively. Next, we independently generate

Yi(a)\displaystyle Y_{i}(a) 𝒩(4aSiXi22+2eXi1+|Xi2|,1),\displaystyle\sim\mathcal{N}\left(4aS_{i}X_{i2}^{2}+2e^{X_{i1}}+|X_{i2}|,1\right),
Ri(a)\displaystyle R_{i}(a) {expit(0.6+0.6a+Xi2+Si)},\displaystyle\sim\mathcal{B}\left\{\textup{expit}(0.6+0.6a+X_{i2}+S_{i})\right\},

for a=0,1a=0,1 and i=1,,ni=1,\dots,n, where expit(x)=1/(1+ex)\textup{expit}(x)=1/(1+e^{-x}). For treatment allocation, we set the randomization variables as 𝑿ir=(Xi1,Xi2)\boldsymbol{X}^{r}_{i}=(X_{i1},X_{i2}) and stratification variable as SiS_{i}. Then (A1,,An)(A_{1},\dots,A_{n}) are generated under simple randomization, rerandomization, and stratified rerandomization as described in Sections 2.2 and 5.1 with Mahalanobis distance and threshold t=1t=1 (corresponding to an acceptance rate of approximately 40%40\%). Denoting Yi=AiYi(1)+(1Ai)Yi(0)Y_{i}=A_{i}Y_{i}(1)+(1-A_{i})Y_{i}(0) and Ri=AiRi(1)+(1Ai)Ri(0)R_{i}=A_{i}R_{i}(1)+(1-A_{i})R_{i}(0), the observed data are {Yi,Ai,Xi1,Xi2,Si:i=1,,n}\{Y_{i},A_{i},X_{i1},X_{i2},S_{i}:i=1,\dots,n\} under the no missing outcome scenario, and {RiYi,Ri,Ai,Xi1,Xi2,Si:i=1,,n}\{R_{i}Y_{i},R_{i},A_{i},X_{i1},X_{i2},S_{i}:i=1,\dots,n\} under the missing outcome scenario. We repeat the above procedure to generate 10001000 simulated data sets.

For each simulated data set, we implement the following three estimators under the no missing outcome scenario. The unadjusted estimator is given by the difference in mean outcomes between treatment groups and ignores covariates. The ANCOVA estimator (Example 1) adjusts for all covariates 𝑿i=(Xi1,Xi2,Si)\boldsymbol{X}_{i}=(X_{i1},X_{i2},S_{i}). The ML estimator refers to the efficient estimator described in Section 6 but setting Ri=κ^a(𝑿i)=1R_{i}=\widehat{\kappa}_{a}(\boldsymbol{X}_{i})=1 as discussed in Remark 6. For the outcome regression model η^a(𝑿i)\widehat{\eta}_{a}(\boldsymbol{X}_{i}), we adopt an ensemble learner of generalized linear models, regression trees, and neural networks via SuperLearner (van der Laan et al.,, 2007). Under the outcome missing at random scenario, we instead compute the DR-WLS estimator (described in Example 3) and the efficient DML estimator (as described in Section 6); we use the same ensemble learners for estimating the two nuisance functions for the latter. Of note, since the outcome missingness is generated by a generalized linear model, the missingness propensity score model is correctly specified in both the DR-WLS and DML estimators.

For each estimator, we report the following performance metrics: bias, empirical standard error (ESE), average of standard error estimators assuming simple randomization (ASE, i.e., no adjustment is performed for rerandomization or stratified rerandomization), coverage probability based on normal approximations and standard error estimators under simple randomization (CP-Normal), and the coverage probability based on the derived asymptotic distribution reflecting the actual randomization procedure (CP-True).

In the second simulation study with binary outcomes, the data generating process is the same as the first simulation except that Yi(a){expit(4+4aSiXi22+eXi1|Xi2|)}Y_{i}(a)\sim\mathcal{B}\left\{\textup{expit}(-4+4aS_{i}X_{i2}^{2}+e^{X_{i1}}-|X_{i2}|)\right\}. Here, the unadjusted estimator is the ratio of mean outcomes between treatment groups. The outcome model used in the ANCOVA estimator and DR-WLS estimator is substituted by GLM2, which denotes logistic regression in Example 2 with all treatment-by-covariate interaction terms. The ML and DML estimators are also modified to target the ratio estimand, but the nuisance function estimators remain the same as in the first simulation study.

7.2 Simulation results

Table 1 summarizes the results of the first simulation study. Across different settings, all estimators have negligible bias and achieve nominal coverage probability based on our derived asymptotic distributions in Theorems 1-6, thereby empirically supporting our theoretical results. For the unadjusted estimator, we observe that rerandomization and stratified rerandomization can improve precision, reflected by ESE being 20%\sim 20\% smaller than ASE and CP-Normal being close to 1. Under rerandomization and stratified rerandomization, the ANCOVA, ML, DR-WLS, and DML estimators all have similar performance compared to simple randomization, and the corresponding coverage probabilities are all close to 0.95; this is expected because asymptotic normality holds for these covariate-adjusted estimators under rerandomization. These results further support Theorems 2, 4, 5, and 6, by which the ANCOVA, ML, DR-WLS, and DML estimators have been shown to fully account for the variance reduction brought by the adaptive randomization procedure (and hence R2=0R^{2}=0). Of note, we observe that the empirical standard error tends to be slightly larger than the average standard error estimators for adjusted estimators (especially when there are missing outcomes); this is driven by outlier estimates from a few simulated data sets, which have negligible impact on the coverage probability. We have repeated the simulation with n=1000n=1000, and this variance underestimation issue disappears (results omitted for brevity). Finally, in terms of finite-sample efficiency, we observe that parametric working models can improve precision over the unadjusted estimator, and machine learning estimators often lead to further variance reduction.

Table 1: Simulation results for continuous outcomes. ESE: empirical standard error. ASE: average of standard error estimators assuming simple randomization. CP-Normal: coverage probability based on normal approximations and standard error estimators assuming simple randomization. CP-True: coverage probability based on the derived asymptotic distribution (which accounts for the actual randomization procedure).
Missing
Outcomes?
Randomization Estimator Bias ESE ASE CP-Normal CP-True
No Simple randomization Unadjusted 0.00 1.23 1.19 0.95 0.95
ANCOVA 0.01 0.82 0.79 0.95 0.96
ML -0.00 0.43 0.40 0.94 0.95
Rerandomization Unadjusted -0.01 0.95 1.19 0.99 0.94
ANCOVA -0.00 0.83 0.79 0.95 0.95
ML 0.02 0.45 0.41 0.94 0.94
Stratified rerandomization Unadjusted 0.02 0.92 1.19 0.99 0.94
ANCOVA 0.02 0.82 0.79 0.94 0.95
ML 0.01 0.51 0.47 0.95 0.94
Yes Simple randomization DR-WLS -0.00 1.00 0.94 0.95 0.94
DML 0.05 0.69 0.61 0.96 0.95
Rerandomization DR-WLS 0.03 0.97 0.94 0.95 0.95
DML 0.07 0.87 0.63 0.95 0.95
Stratified rerandomization DR-WLS 0.06 0.99 0.94 0.96 0.95
DML 0.10 0.82 0.69 0.95 0.94

Table 2 presents the results for the second simulation, with similar overall findings to those from the first simulation. Here, since the ANCOVA estimator is replaced by logistic regression with treatment-by-covariate interactions, the GLM2 estimator is asymptotically normal as stated in Theorem 2(2) and Theorem 4(2).

Table 2: Simulation results for binary outcomes. ESE: empirical standard error. ASE: average of standard error estimators assuming simple randomization. CP-Normal: coverage probability based on normal approximations and standard error estimators assuming simple randomization. CP-True: coverage probability based on the derived asymptotic distribution (which accounts for the actual randomization procedure).
Missing
Outcomes?
Randomization Estimator Bias ESE ASE CP-Normal CP-True
No Simple randomization Unadjusted 0.02 0.25 0.24 0.94 0.93
GLM2 0.01 0.19 0.18 0.94 0.94
ML 0.01 0.19 0.18 0.94 0.94
Rerandomization Unadjusted 0.02 0.21 0.24 0.97 0.94
GLM2 0.01 0.19 0.18 0.94 0.94
ML 0.02 0.20 0.19 0.94 0.94
Stratified rerandomization Unadjusted 0.01 0.21 0.24 0.97 0.93
GLM2 0.00 0.19 0.18 0.94 0.94
ML 0.01 0.19 0.18 0.95 0.94
Yes Simple randomization DR-WLS 0.02 0.21 0.21 0.95 0.95
DML 0.03 0.26 0.22 0.96 0.96
Rerandomization DR-WLS 0.02 0.22 0.21 0.94 0.94
DML 0.03 0.23 0.21 0.94 0.94
Stratified rerandomization DR-WLS 0.02 0.22 0.21 0.94 0.94
DML 0.03 0.28 0.23 0.95 0.95

8 Data application

The Effectiveness of Group Focused Psychosocial Support for Adults Affected by Humanitarian Crises (GroupPMPlus) study is a cluster-randomized experiment in Nepal designed to improve the mental health of people affected by humanitarian emergencies such as pandemics, war, and environmental disasters (Jordans et al.,, 2021). The cluster-level intervention was Group Problem Management Plus, a psychological treatment of 5 weekly sessions (versus standard care). In this study, 72 wards (the smallest administrative units in Nepal, representing clusters) were enrolled, consisting of 609 individuals in total. Stratified rerandomization was used for equal treatment allocation on clusters: stratification was based on gender (all individuals in the same ward have the same gender), and rerandomization was based on three binary cluster-level covariates: high or low access to mental health service, high or low disaster risk, and rural/urban status. However, we are unable to obtain detailed rerandomization parameters from the published report, including the weighting matrix or balance threshold. Therefore, we carry out our analysis assuming simple randomization: this choice will lead to conservative confidence intervals for an unadjusted analysis but has no impact on the covariate-adjusted estimators, as supported by our theory and simulations. The primary outcome was a continuous measure of psychological distress at the 3-month follow-up evaluated by the General Health Questionnaire (GHQ-12). The baseline variables we adjust for include the baseline GHQ-12 score and all variables used in stratified rerandomization. For this study, we implement the mixed-ANCOVA estimator (Example 4) with individual-level data (Wang et al.,, 2021), and the unadjusted, ANCOVA, and ML estimators with cluster-level means (implemented as in the first simulation) to estimate the average treatment effect. For all estimators, we compute their point estimates, standard errors assuming simple randomization, and confidence intervals under normal approximation.

The results are summarized in Table 3. While all estimators have similar point estimates, their standard error estimates differ. For the unadjusted estimator, since we are unaware of the detailed rerandomization parameters, 0.78 should be a conservative estimate, leading to failure to reject the null at the 5% level. In contrast, the standard error estimates for all three covariate-adjusted estimators should have fully accounted for the precision gain from stratified rerandomization as clarified by our theoretical results. For these estimators, it is important to highlight that validity of the standard error estimator is achieved without knowing the exact rerandomization parameters, which further endorses the recommendation of adjusting for stratification and rerandomization variables. Among the three covariate-adjusted estimators, the ANCOVA estimator appears to have the highest precision, while the machine learning estimator leads to the least variance reduction. This may be because either the sample size is relatively limited for machine learning methods to demonstrate asymptotic efficiency gain or the true data-generating distribution is nearly linear in covariates.

Table 3: Data analysis results for the GroupPMPlus study.
Estimator Estimate Standard error 95% confidence interval
Unadjusted -1.25 0.78 (-2.77,  0.28)
ANCOVA -1.45 0.56 (-2.55, -0.35)
Mixed-ANCOVA -1.35 0.62 (-2.57, -0.13)
ML -1.44 0.64 (-2.69, -0.18)

9 Discussion

Covariate adjustment in randomized experiments can be achieved at the design stage, e.g., via rerandomization, and at the analysis stage, e.g., by outcome modeling. In this paper, our primary contribution is to clarify the impact of (stratified) rerandomization for a wide class of estimators, and to further demonstrate when rerandomization can be ignorable (and hence conventional asymptotic normality results can apply) when the subsequent analysis adjusts for the rerandomization variables. These results provide important clarifications to earlier simulation findings, for example, in cluster-randomized experiments (Li et al.,, 2016, 2017) where mixed-model ANCOVA was evaluated under rerandomization. We have also extended the theoretical development to efficient machine learning estimators, which maximally leverage baseline covariates to optimize asymptotic efficiency gain in randomized experiments. Importantly, through the lens of the super-population framework, our results expanded the existing rerandomization theory (Li et al.,, 2018; Li and Ding,, 2020; Lu et al.,, 2023; Wang et al., 2023c, ; Zhao and Ding,, 2024) from unadjusted and linear-adjusted estimators to more general covariate-adjusted estimators. Therefore, our results have wide implications for randomized experiments, especially given the diversity of covariate-adjustment methods used in current practice (Pirondini et al.,, 2022).

A practical question for conducting rerandomization is how to select design parameters, e.g., rerandomization variables, weighting matrix, and balance threshold. Our general recommendation is to balance only prognostic covariates via the Mahalanobis distance and choosing tt such that the rejection rate is no larger than 95%. We retain our recommendation to adjust for rerandomization variables during analysis for maximum efficiency gain and convenience in statistical inference. Finally, with our example estimators, although the choices of weighting matrix and tt have no impact asymptotically, they could have finite-sample implications. Alternatively, tt may be specified in a data-driven fashion, i.e., tn=fn(𝑿1r,,𝑿nr)t_{n}=f_{n}(\boldsymbol{X}_{1}^{r},\dots,\boldsymbol{X}_{n}^{r}). Asymptotic analysis with an arbitrary tnt_{n} is a challenging problem that requires special convergence results for conditional quantiles; this extension will be left for future research.

References

  • Benkeser et al., (2021) Benkeser, D., Díaz, I., Luedtke, A., Segal, J., Scharfstein, D., and Rosenblum, M. (2021). Improving precision and power in randomized trials for covid-19 treatments using covariate adjustment, for binary, ordinal, and time-to-event outcomes. Biometrics, 77(4):1467–1481.
  • Benkeser and Van Der Laan, (2016) Benkeser, D. and Van Der Laan, M. (2016). The highly adaptive lasso estimator. In 2016 IEEE international conference on data science and advanced analytics (DSAA), pages 689–696. IEEE.
  • Bruhn and McKenzie, (2009) Bruhn, M. and McKenzie, D. (2009). In pursuit of balance: Randomization in practice in development field experiments. American economic journal: applied economics, 1(4):200–232.
  • Chernozhukov et al., (2018) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68.
  • Colantuoni and Rosenblum, (2015) Colantuoni, E. and Rosenblum, M. (2015). Leveraging prognostic baseline variables to gain precision in randomized trials. Statistics in medicine, 34(18):2602–2617.
  • Ding et al., (2017) Ding, P., Li, X., and Miratrix, L. W. (2017). Bridging finite and super population causal inference. Journal of Causal Inference, 5(2):20160027.
  • Farrell et al., (2021) Farrell, M. H., Liang, T., and Misra, S. (2021). Deep neural networks for estimation and inference. Econometrica, 89(1):181–213.
  • Ivers et al., (2012) Ivers, N. M., Halperin, I. J., Barnsley, J., Grimshaw, J. M., Shah, B. R., Tu, K., Upshur, R., and Zwarenstein, M. (2012). Allocation techniques for balance at baseline in cluster randomized trials: a methodological review. Trials, 13:1–9.
  • Jordans et al., (2021) Jordans, M. J., Kohrt, B. A., Sangraula, M., Turner, E. L., Wang, X., Shrestha, P., Ghimire, R., van’t Hof, E., Bryant, R. A., Dawson, K. S., et al. (2021). Effectiveness of group problem management plus, a brief psychological intervention for adults affected by humanitarian disasters in nepal: A cluster randomized controlled trial. PLoS Medicine, 18(6):e1003621.
  • Li et al., (2016) Li, F., Lokhnygina, Y., Murray, D. M., Heagerty, P. J., and DeLong, E. R. (2016). An evaluation of constrained randomization for the design and analysis of group-randomized trials. Statistics in medicine, 35(10):1565–1579.
  • Li et al., (2017) Li, F., Turner, E. L., Heagerty, P. J., Murray, D. M., Vollmer, W. M., and DeLong, E. R. (2017). An evaluation of constrained randomization for the design and analysis of group-randomized trials with binary outcomes. Statistics in medicine, 36(24):3791–3806.
  • Li and Ding, (2020) Li, X. and Ding, P. (2020). Rerandomization and regression adjustment. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(1):241–268.
  • Li et al., (2018) Li, X., Ding, P., and Rubin, D. B. (2018). Asymptotic theory of rerandomization in treatment–control experiments. Proceedings of the National Academy of Sciences, 115(37):9157–9162.
  • Lu et al., (2023) Lu, X., Liu, T., Liu, H., and Ding, P. (2023). Design-based theory for cluster rerandomization. Biometrika, 110(2):467–483.
  • Morgan and Rubin, (2012) Morgan, K. L. and Rubin, D. B. (2012). Rerandomization to improve covariate balance in experiments. Annals of Statistics, 40(2):1263–1282.
  • Morgan and Rubin, (2015) Morgan, K. L. and Rubin, D. B. (2015). Rerandomization to balance tiers of covariates. Journal of the American Statistical Association, 110(512):1412–1421.
  • Moulton, (2004) Moulton, L. H. (2004). Covariate-based constrained randomization of group-randomized trials. Clinical trials, 1(3):297–305.
  • Pirondini et al., (2022) Pirondini, L., Gregson, J., Owen, R., Collier, T., and Pocock, S. (2022). Covariate adjustment in cardiovascular randomized controlled trials: its value, current practice, and need for improvement. Heart Failure, 10(5):297–305.
  • Raab and Butcher, (2001) Raab, G. M. and Butcher, I. (2001). Balance in cluster randomized trials. Statistics in medicine, 20(3):351–365.
  • Rafi, (2023) Rafi, A. (2023). Efficient semiparametric estimation of average treatment effects under covariate adaptive randomization. arXiv preprint arXiv:2305.08340.
  • Robins et al., (2007) Robins, J., Sued, M., Lei-Gomez, Q., and Rotnitzky, A. (2007). Comment: Performance of double-robust estimators when “inverse probability” weights are highly variable. Statist. Sci., 22(4):544–559.
  • Robins, (2002) Robins, J. M. (2002). Covariance adjustment in randomized experiments and observational studies: Comment. Statistical Science, 17(3):309–321.
  • Shi et al., (2022) Shi, W., Zhao, A., and Liu, H. (2022). Rerandomization and covariate adjustment in split-plot designs. arXiv preprint arXiv:2209.12385.
  • Tsiatis et al., (2008) Tsiatis, A., Davidian, M., Zhang, M., and Lu, X. (2008). Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: A principled yet flexible approach. Stat Med, 27(23):4658–4677.
  • (25) Turner, E. L., Li, F., Gallis, J. A., Prague, M., and Murray, D. M. (2017a). Review of recent methodological developments in group-randomized trials: part 1—design. American journal of public health, 107(6):907–915.
  • (26) Turner, E. L., Prague, M., Gallis, J. A., Li, F., and Murray, D. M. (2017b). Review of recent methodological developments in group-randomized trials: part 2—analysis. American Journal of Public Health, 107(7):1078–1086.
  • van der Laan et al., (2007) van der Laan, M. J., Polley, E. C., and Hubbard, A. E. (2007). Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1).
  • van der Laan et al., (2011) van der Laan, M. J., Rose, S., et al. (2011). Targeted Learning: Causal Inference for Observational and Experimental Data, volume 10. Springer.
  • van der Vaart, (1998) van der Vaart, A. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
  • Wager and Athey, (2018) Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242.
  • Wang et al., (2021) Wang, B., Harhay, M. O., Small, D. S., Morris, T. P., and Li, F. (2021). On the mixed-model analysis of covariance in cluster-randomized trials. arXiv preprint arXiv:2112.00832.
  • Wang et al., (2019) Wang, B., Ogburn, E. L., and Rosenblum, M. (2019). Analysis of covariance in randomized trials: More precision and valid confidence intervals, without model assumptions. Biometrics, 75(4):1391–1400.
  • (33) Wang, B., Park, C., Small, D. S., and Li, F. (2023a). Model-robust and efficient covariate adjustment for cluster-randomized experiments. Journal of the American Statistical Association.
  • (34) Wang, B., Susukida, R., Mojtabai, R., Amin-Esmaeili, M., and Rosenblum, M. (2023b). Model-robust inference for clinical trials that improve precision by stratified randomization and covariate adjustment. Journal of the American Statistical Association, 118(542):1152–1163.
  • (35) Wang, X., Wang, T., and Liu, H. (2023c). Rerandomization in stratified randomized experiments. Journal of the American Statistical Association, 118(542):1295–1304.
  • Wang and Li, (2022) Wang, Y. and Li, X. (2022). Rerandomization with diminishing covariate imbalance and diverging number of covariates. The Annals of Statistics, 50(6):3439–3465.
  • Zelen, (1974) Zelen, M. (1974). The randomization and stratification of patients to clinical trials. Journal of Chronic Diseases, 27(7):365 – 375.
  • Zhao and Ding, (2024) Zhao, A. and Ding, P. (2024). No star is good news: A unified look at rerandomization based on p-values from covariate balance tests. Journal of Econometrics, 241(1):105724.
  • Zhou et al., (2018) Zhou, Q., Ernst, P. A., Morgan, K. L., Rubin, D. B., and Zhang, A. (2018). Sequential rerandomization. Biometrika, 105(3):745–752.