Asymptotic inference with flexible covariate adjustment under rerandomization and stratified rerandomization

Bingkai Wang¹ and Fan Li^2,3 ¹Department of Biostatistics School of Public Health University of Michigan Ann Arbor MI USA ²Department of Biostatistics Yale School of Public Health New Haven CT USA ³Center for Methods in Implementation and Prevention Science Yale School of Public Health New Haven CT USA

Abstract

Rerandomization is an effective treatment allocation procedure to control for baseline covariate imbalance. For estimating the average treatment effect, rerandomization has been previously shown to improve the precision of the unadjusted and the linearly-adjusted estimators over simple randomization without compromising consistency. However, it remains unclear whether such results apply more generally to the class of M-estimators, including the g-computation formula with generalized linear regression and doubly-robust methods, and more broadly, to efficient estimators with data-adaptive machine learners. In this paper, using a super-population framework, we develop the asymptotic theory for a more general class of covariate-adjusted estimators under rerandomization and its stratified extension. We prove that the asymptotic linearity and the influence function remain identical for any M-estimator under simple randomization and rerandomization, but rerandomization may lead to a non-Gaussian asymptotic distribution. We further explain, drawing examples from several common M-estimators, that asymptotic normality can be achieved if rerandomization variables are appropriately adjusted for in the final estimator. These results are extended to stratified rerandomization. Finally, we study the asymptotic theory for efficient estimators based on data-adaptive machine learners, and prove their efficiency optimality under rerandomization and stratified rerandomization. Our results are demonstrated via simulations and re-analyses of a cluster-randomized experiment that used stratified rerandomization.

Keywords: covariate adjustment, doubly-robust estimator, M-estimation, machine learning, influence function, randomized trials.

1 Introduction

In randomized experiments, rerandomization (Morgan and Rubin,, 2012) refers to a restricted randomization procedure that discards treatment allocation schemes corresponding to large baseline imbalance. That is, treatment assignments will be randomly re-generated until the balance statistics fall within a pre-specified threshold. In biomedical research, such a randomization procedure is also referred to as covariate-constrained randomization (Raab and Butcher,, 2001; Moulton,, 2004) and often applied to cluster randomization. Compared with simple randomization that assigns treatment by independent coin flips, rerandomization can provide effective design-based control of imbalance associated with a number of covariates, possibly of different types, and improve the study power. Compared to covariate-adaptive randomization, such as stratified randomization (Zelen,, 1974) for balancing discrete baseline variables, rerandomization can easily handle continuous baseline variables and features continuously-valued design parameters to control for the randomization space, thereby offering additional flexibility. Given these potential advantages, rerandomization is increasingly popular in economics (Bruhn and McKenzie,, 2009) and biomedical research (Ivers et al.,, 2012; Turner et al., 2017a, ).

Under rerandomization, the statistical properties of the unadjusted and linearly-adjusted estimators for the average treatment effect have been extensively studied under a finite-population design-based framework. Li et al., (2018) first established the design-based asymptotic theory for the unadjusted estimator (i.e., the two-sample difference-in-means estimator) under rerandomization. Li and Ding, (2020) generalized their results to covariate-adjusted estimators based on linear regression, and recommended the combination of rerandomization and regression adjustment to optimize statistical precision. Rerandomization has also been extended to accommodate tiers of covariates with varying importance (Morgan and Rubin,, 2015), sequential rerandomization in batches (Zhou et al.,, 2018), high-dimensional settings with a diverging number of covariates (Wang and Li,, 2022), split-plot designs (Shi et al.,, 2022), cluster rerandomization (Lu et al.,, 2023), stratified rerandomization (Wang et al., 2023c, ), as well as rerandomization based on p-values from regression-based balance assessment (Zhao and Ding,, 2024). A notable feature of all the previous works is the focus on the unadjusted and linearly-adjusted estimators—a common choice when adopting the finite-population design-based framework for statistical inference.

However, analysis of randomized experiments in practice can involve more general covariate adjustment strategies than linear regression (Benkeser et al.,, 2021), and the validity and efficiency of such general estimators remain unexplored in the existing literature on rerandomization. For example, with binary outcomes, the g-computation estimator based on logistic regression can target the causal risk ratio estimand and effectively leverage baseline covariates for precision gain (Colantuoni and Rosenblum,, 2015). When the outcomes are missing at random, doubly-robust estimators (Robins et al.,, 2007) offer two opportunities for consistent estimation of the average treatment effect; i.e., consistency holds when either the missingness propensity score model or the outcome model is correctly specified but not necessarily both. In cluster-randomized experiments, mixed-effects models are routinely used to simultaneously adjust for covariates and account for the intracluster correlation (Turner et al., 2017b, ; Wang et al.,, 2021). Finally, data-adaptive machine learners have shown promise to maximally leverage baseline information to achieve full efficiency in randomized experiments (Chernozhukov et al.,, 2018). Only for selected scenarios under cluster randomization, previous simulations have empirically demonstrated that rerandomization can improve precision for mixed-effects models and generalized estimating equations estimators (Li et al.,, 2016, 2017). However, to the best of our knowledge, there has been no formal development of the asymptotic theory for this more general class of covariate-adjusted estimators under rerandomization; thus few recommendations are currently available for rerandomized experiments when inference involves flexible covariate adjustment.

In this paper, we introduce formal asymptotic results for the class of M-estimators as well as efficient estimators (nuisance models estimated via data-adaptive machine learners) under rerandomization and stratified rerandomization. First, we prove that, under standard regularity conditions for M-estimators, the asymptotic linearity and the influence function remain identical under simple randomization and rerandomization, but rerandomization may lead to a non-Gaussian asymptotic distribution. This result justifies the consistency of M-estimators under rerandomization, and obviates the need to re-derive the influence function under rerandomization. Next, we survey several common M-estimators in randomized experiments, and show that asymptotic normality can be achieved if rerandomization variables are appropriately adjusted for in the final estimator. This result clarifies when rerandomization is ignorable during the analysis stage, in which case one can conveniently draw inference under rerandomization as if simple randomization were carried out. Then, we develop parallel results to stratified rerandomization—a restricted randomization procedure combining stratification and rerandomization to achieve stronger control of chance imbalance. Finally, we turn to efficient estimators that are motivated by the efficient influence functions coupled with data-adaptive machine learners. We prove that, as long as the rerandomization variables are adjusted for in the machine learners, the efficient estimators continue to be optimally efficient without the need to invoke any additional regularity assumptions. For deriving all of the above asymptotic results, the key challenge is to address correlation among individual observations introduced by (stratified) rerandomization, for which we developed general technical arguments to accommodate flexible covariate adjustment.

Throughout, we adopt a super-population sampling-based framework, which is an effective framework to study more general covariate-adjusted estimators in randomized experiments (Wang et al., 2023b, ). We refer to Robins, (2002) and Ding et al., (2017) for a comparison between the super-population sampling-based and finite-population design-based framework. When the unadjusted and linear regression estimators are considered under rerandomization and stratified randomization, our results can be viewed as the sampling-based counterparts of earlier design-based results (Li et al.,, 2018; Li and Ding,, 2020; Lu et al.,, 2023; Wang et al., 2023c, ). Importantly, our results go beyond linear regression and address a wider class of covariate-adjusted estimators under rerandomization and stratified rerandomization.

The remainder of the paper is organized as follows. In the next section, we introduce the super-population framework, randomization procedures, and M-estimation. Section 3 characterizes the asymptotic distributions for M-estimators under rerandomization, and Section 4 surveys familiar M-estimators and provides sufficient conditions to achieve asymptotic normality under rerandomization. Section 5 presents parallel asymptotic results for stratified rerandomization. Section 6 introduces the asymptotic theory for estimators based on the efficient influence function under rerandomization and stratified rerandomization. We report simulations in Section 7 and re-analyses of a completed cluster-randomized experiment in Section 8, for which the R code is available at https://github.com/BingkaiWang/Rerandomization. Section 9 concludes.

2 Notation, rerandomization, and M-estimation

2.1 Assumptions and estimands

We consider a randomized experiment with $n$ individuals. Each individual $i\ (i=1,\dots,n)$ has an outcome $Y_{i}$ , a non-missing indicator of outcome $R_{i}$ ( $R_{i}=1$ if $Y_{i}$ observed and $0$ if missing), a treatment assignment indicator $A_{i}$ ( $A_{i}=1$ if assigned treatment and $0$ otherwise), and a vector of baseline covariates $\boldsymbol{X}_{i}$ . The observed data are $(\boldsymbol{O}_{1},\dots,\boldsymbol{O}_{n})$ with $\boldsymbol{O}_{i}=(R_{i}Y_{i},R_{i},A_{i},\boldsymbol{X}_{i})$ . We adopt the potential outcomes framework and define $Y_{i}(a)$ as the potential outcome if individual $i$ were assigned to treatment group $a,a\in\{0,1\}$ . Similarly, we denote $R_{i}(a)$ as the potential non-missing indicator given treatment $a$ . We assume causal consistency such that $Y_{i}=A_{i}Y_{i}(1)+(1-A_{i})Y_{i}(0)$ and $R_{i}=A_{i}R_{i}(1)+(1-A_{i})R_{i}(0)$ . Denoting the complete data vector for each individual as $\boldsymbol{W}_{i}=(Y_{i}(1),Y_{i}(0),R_{i}(1),R_{i}(0),\boldsymbol{X}_{i})$ , we make the following assumptions on $(\boldsymbol{W}_{1},\dots,\boldsymbol{W}_{n})$ .

Assumption 1 (Super-population).

Each complete data vector $\boldsymbol{W}_{i},i=1,\dots,n$ is an independent and identically distributed draw from an unknown distribution $\mathcal{P}^{\boldsymbol{W}}$ on $\boldsymbol{W}=(Y(1),Y(0),R(1),R(0),\boldsymbol{X})$ with finite second moments.

Assumption 2 (Missing at random).

For each $a=0,1$ , $R(a)$ is independent of $Y(a)$ given $\boldsymbol{X}$ and $P(R(a)=1|\boldsymbol{X})$ is uniformly bounded away from 0.

Assumption 1 is a standard condition for making inference under a sampling-based framework. It postulates a notional super-population of units, from which the observed sample is randomly drawn and for which the target estimands may be defined. We invoke Assumption 2 as a standard condition to accommodate missing outcomes in randomized experiments. This assumption is required for the doubly-robust estimators in Theorems 2, 4, 5, and 6, but is not required for other results without missing data.

Our goal is to estimate the average treatment effect, defined in a general form as

\Delta^{*}=f(E[Y(1)],E[Y(0)])

where $f(x,y)$ is a pre-specified function defining the scale of effect measure. For example, $f(x,y)=x-y$ corresponds to the treatment effect on the difference scale, whereas $f(x,y)=x/y$ corresponds to the risk ratio scale, which is a standard choice when the outcome is binary. Finally, by Assumption 1, the expectation operator in the estimand definition (and throughout the paper) is defined with respect to the super-population distribution $\mathcal{P}^{W}$ .

2.2 Simple randomization and rerandomization

Simple randomization assigns treatment via independent coin flips, that is, $A_{i}$ is independently determined by a Bernoulli distribution $\mathcal{P}^{A}$ with $P(A_{i}=1)=\pi\in(0,1)$ . By design, each $A_{i}$ is independent of $(\boldsymbol{W}_{1},\dots,\boldsymbol{W}_{n})$ , and the observed data $(\boldsymbol{O}_{1},\dots,\boldsymbol{O}_{n})$ are independent and identically distributed given Assumption 1.

Rerandomization improves simple randomization by controlling imbalance on a subset of measured baseline covariates, which we denote as $\boldsymbol{X}^{r}$ with $\boldsymbol{X}^{r}\subset\boldsymbol{X}$ (Morgan and Rubin,, 2012); we refer to $\boldsymbol{X}^{r}$ as rerandomization variables. Rerandomization involves the following three steps. First, we independently generate $A_{1}^{*},\dots,A_{n}^{*}$ from a Bernoulli distribution with $P(A_{i}^{*}=1)=\pi\in(0,1)$ as in simple randomization. In the second step, we compute the imbalance statistic on $\boldsymbol{X}^{r}_{i}$ and its variance estimator as

	$\displaystyle\boldsymbol{I}$	$\displaystyle=\frac{1}{N_{1}}\sum_{i=1}^{n}A^{}_{i}\boldsymbol{X}^{r}_{i}-\frac{1}{N_{0}}\sum_{i=1}^{n}(1-A_{i}^{})\boldsymbol{X}^{r}_{i},$
	$\displaystyle\widehat{Var}(\boldsymbol{I})$	$\displaystyle=\frac{1}{N_{1}N_{0}}\sum_{i=1}^{n}(\boldsymbol{X}^{r}_{i}-\overline{\boldsymbol{X}^{r}})(\boldsymbol{X}^{r}_{i}-\overline{\boldsymbol{X}^{r}})^{\top}$

where $N_{1}=\sum_{i=1}^{n}A_{i}^{*}$ , $N_{0}=n-N_{1}^{*}$ and $\overline{\boldsymbol{X}^{r}}=n^{-1}\sum_{i=1}^{n}\boldsymbol{X}^{r}_{i}$ . Lastly, given a pre-specified balance threshold $t>0$ , we check whether $\boldsymbol{I}^{\top}\{\widehat{Var}(\boldsymbol{I})\}^{-1}\boldsymbol{I}<t$ . If true, the final treatment assignment $(A_{1},\dots,A_{n})$ is set to be $(A_{1}^{*},\dots,A_{n}^{*})$ ; otherwise, we repeat the steps until we obtain the first randomization scheme that satisfies the balance criterion.

The above rerandomization procedure considers the Mahalanobis distance to define the balance criterion, and $\boldsymbol{I}^{\top}\{\widehat{Var}(\boldsymbol{I})\}^{-1}\boldsymbol{I}$ follows a chi-squared distribution asymptotically. The balance threshold $t$ can then be selected to adjust the rejection rate. For example, letting $q$ denote the dimension of $\boldsymbol{X}^{r}$ , then $t$ being the $\alpha$ -quantile of $\mathcal{X}^{2}_{q}$ leads to an approximate rejection rate $1-\alpha$ . In general, smaller $t$ corresponds to stronger control over imbalance; if $t\rightarrow+\infty$ , then rerandomization reduces to simple randomization. From a statistical perspective, rerandomization introduces correlation among $(A_{1},\dots,A_{n})$ and hence correlation among observed data $(\boldsymbol{O}_{1},\dots,\boldsymbol{O}_{n})$ , leading to a key complication in studying large-sample results of treatment effect estimators.

Beyond the Mahalanobis distance, rerandomization can also be more generally based on $\boldsymbol{I}^{\top}\widehat{\mathbf{H}}^{-1}\boldsymbol{I}<t$ , where $\widehat{\mathbf{H}}$ is a positive definite matrix (e.g., a diagonal matrix collecting the sample variance for each component of $\boldsymbol{I}$ ). For our main theorems, we focus on rerandomization using the Mahalanobis distance, which leads to the most interpretable results. Nevertheless, we show how our results can be extended to accommodate such a more general distance function in Remarks 1 and 4.

2.3 M-estimation

M-estimator (van der Vaart,, 1998, Section 5) refers to a wide class of estimators that are solutions to estimating equations. Specifically, let $\boldsymbol{\theta}=(\Delta,\boldsymbol{\beta})\in\mathbb{R}^{l+1}$ be an $(l+1)$ -dimensional vector of parameters, where $\Delta\in\mathbb{R}$ is the parameter of interest and $\boldsymbol{\beta}\in\mathbb{R}^{l}$ is an $l$ -dimensional vector of nuisance parameters. An M-estimator $\widehat{\boldsymbol{\theta}}=(\widehat{\Delta},\widehat{\boldsymbol{\beta}})$ for $\boldsymbol{\theta}$ is the solution to

\sum_{i=1}^{n}\boldsymbol{\psi}(\boldsymbol{O}_{i};\boldsymbol{\theta})=\boldsymbol{0},

where $\boldsymbol{\psi}$ is a pre-specified $(l+1)$ -dimensional estimating function. For example, $\boldsymbol{\psi}$ is the score function if $\widehat{\boldsymbol{\theta}}$ is obtained by maximum likelihood estimation. The M-estimator for $\Delta^{*}$ is $\widehat{\Delta}$ . For $a=0,1$ , we denote $\boldsymbol{O}(a)=(R(a)Y(a),R(a),a,\boldsymbol{X})$ and $||\cdot||_{2}$ as the $L_{2}$ matrix norm. Throughout, we assume the following regularity conditions for M-estimation.

Assumption 3 (Regularity conditions).

(1) $\boldsymbol{\theta}\in\boldsymbol{\Theta}$ , a compact subset of $\mathbb{R}^{l+1}$ .
(2) $E[||\boldsymbol{\psi}(\boldsymbol{O}(a);\boldsymbol{\theta})||_{2}^{2}]<\infty$ for all $\boldsymbol{\theta}\in\boldsymbol{\Theta}$ and $a\in\{0,1\}$ . (3) There exists a unique solution $\underline{\boldsymbol{\theta}}=(\underline{\Delta},\underline{\boldsymbol{\beta}})$ , an inner point of $\boldsymbol{\Theta}$ , to the equations $\pi E[\boldsymbol{\psi}(\boldsymbol{O}(1);\boldsymbol{\theta})]+(1-\pi)E[\boldsymbol{\psi}(\boldsymbol{O}(0);\boldsymbol{\theta})]=\boldsymbol{0}$ . (4) The function $\boldsymbol{\theta}\mapsto\boldsymbol{\psi}(\boldsymbol{o};\boldsymbol{\theta})$ is twice continuously differentiable for every $\boldsymbol{o}$ in the support of $\boldsymbol{O}$ and is dominated by an integrable function. (5) There exist an integrable function $v(\boldsymbol{o})$ that dominates the first and second derivatives of $\boldsymbol{\psi}(\boldsymbol{o};\boldsymbol{\theta})$ in $||\boldsymbol{\theta}-\underline{\boldsymbol{\theta}}||_{2}<C$ for some $C>0$ . (6) $E[||\frac{\partial}{\partial\boldsymbol{\theta}}\boldsymbol{\psi}(\boldsymbol{O}(a);\boldsymbol{\theta})\big{|}_{\boldsymbol{\theta}=\underline{\boldsymbol{\theta}}}||_{2}^{2}]<\infty$ for $a\in\{0,1\}$ , and $\pi E[\frac{\partial}{\partial\boldsymbol{\theta}}\boldsymbol{\psi}(\boldsymbol{O}(1);\boldsymbol{\theta})\big{|}_{\boldsymbol{\theta}=\underline{\boldsymbol{\theta}}}]+(1-\pi)E[\frac{\partial}{\partial\boldsymbol{\theta}}\boldsymbol{\psi}(\boldsymbol{O}(0);\boldsymbol{\theta})\big{|}_{\boldsymbol{\theta}=\underline{\boldsymbol{\theta}}}]$ is invertible.

Assumption 3 are largely moment and continuity assumptions to rule out irregular or degenerate M-estimators, which are similar to those invoked in Section 5.3 of van der Vaart, (1998). It is important to note that when the M-estimator is defined based on a working regression model, Assumption 3 does not necessarily imply the working model is correctly specified. For example, if the analysis of covariance (ANCOVA, Example 1 in Section 4) is used, Assumption 3 (2)-(6) simply reduce to assuming $\boldsymbol{W}$ has finite fourth moments and invertible covariance matrices, rather than assuming the ANCOVA model is the true data generating process.

Under simple randomization, it has been established that Assumptions 1 and 3 yield $\widehat{\Delta}\xrightarrow{p}\underline{\Delta}$ , i.e., $\widehat{\Delta}$ converge in probability to a limit quantity $\underline{\Delta}$ , and $\sqrt{n}(\widehat{\Delta}-\underline{\Delta})\xrightarrow{d}N(0,V)$ , i.e., $\sqrt{n}$ -rate asymptotic normality of $\widehat{\Delta}$ (van der Vaart,, 1998). To achieve consistent estimation, i.e., $\widehat{\Delta}\xrightarrow{p}\Delta^{*}$ , one needs to choose appropriate working models (e.g., ANCOVA in randomized experiments) or make additional assumptions (e.g., Assumption 2 in the presence of missing outcomes) such that $\Delta^{*}=\underline{\Delta}$ ; we refer to Section 4 for detailed examples. In the next section, we formally describe the asymptotic behavior of $\widehat{\Delta}$ under rerandomization.

3 Asymptotics for M-estimators under rerandomization

We state our first asymptotic result below.

Theorem 1.

Given Assumptions 1 and 3, and under either simple randomization or rerandomization, we have $\widehat{\Delta}\xrightarrow{p}\underline{\Delta}$ and asymptotic linearity, i.e.,

\sqrt{n}(\widehat{\Delta}-\underline{\Delta})=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}IF(\boldsymbol{O}_{i})+o_{p}(1),

(1)

where the influence function $IF(\boldsymbol{O}_{i})$ is the first entry of $-\mathbf{B}^{-1}\boldsymbol{\psi}(\boldsymbol{O}_{i};\underline{\boldsymbol{\theta}})$ with
$\mathbf{B}=\pi E\left[\frac{\partial}{\partial\boldsymbol{\theta}}\boldsymbol{\psi}(\boldsymbol{O}(1);\boldsymbol{\theta})\big{|}_{\boldsymbol{\theta}=\underline{\boldsymbol{\theta}}}\right]+(1-\pi)E\left[\frac{\partial}{\partial\boldsymbol{\theta}}\boldsymbol{\psi}(\boldsymbol{O}(0);\boldsymbol{\theta})\big{|}_{\boldsymbol{\theta}=\underline{\boldsymbol{\theta}}}\right]$ .

Furthermore, under rerandomization with Mahalanobis distance as the balance criterion,

\sqrt{n}(\widehat{\Delta}-\underline{\Delta})\xrightarrow{d}\sqrt{V}\left(\sqrt{1-R^{2}}\ z+\sqrt{R^{2}}\ r_{q,t}\right),

(2)

where $z\sim N(0,1)$ , $r_{q,t}\sim D_{1}|(\boldsymbol{D}^{\top}\boldsymbol{D}<t)$ for $\boldsymbol{D}\sim N(\boldsymbol{0},\mathbf{I}_{q})$ with $D_{1}$ being the first element of $\boldsymbol{D}$ and $r_{q,t}$ being independent of $z$ , $V$ is the asymptotic variance of $\widehat{\Delta}$ under simple randomization, and

R^{2}=\frac{\pi(1-\pi)}{V}Cov[IF\{\boldsymbol{O}(1)\}-IF\{\boldsymbol{O}(0)\},\boldsymbol{X}^{r}]Var(\boldsymbol{X}^{r})^{-1}Cov[\boldsymbol{X}^{r},IF\{\boldsymbol{O}(1)\}-IF\{\boldsymbol{O}(0)\}].

Consistent estimators for $V$ and $R^{2}$ are provided in the Supplementary Material.

A central finding in Theorem 1 is that the convergence in probability and asymptotic linearity properties remain identical under simple randomization and rerandomization. As a result, if an M-estimator is consistent to the average treatment effect ( $\underline{\Delta}=\Delta^{*}$ ) under simple randomization, then it remains consistent under rerandomization. Furthermore, the influence function of an M-estimator remains unchanged under rerandomization.

In addition, Equation (2) provides the asymptotic distribution of an M-estimator under rerandomization. This result is similar to Theorem 1 of Li et al., (2018) and Theorem 1 of Li and Ding, (2020) (derived under a finite-population design-based framework) if $\Delta^{*}=E[Y(1)-Y(0)]$ and $\widehat{\Delta}$ is the unadjusted difference-in-means estimator or the linearly-adjusted estimator. Under a super-population sampling-based framework, our new contribution in Theorem 1 is to formally provide the asymptotic theory for a wider class of estimators as well as a variety of estimands under rerandomization.

Different from simple randomization that leads to asymptotic normality of an M-estimator, rerandomization may correspond to a non-normal asymptotic distribution. Specifically, the M-estimator converges weakly to a distribution given by a summation of two independent components—a normal variate $\sqrt{V}\sqrt{1-R^{2}}\ z$ and an independent truncated normal variate $\sqrt{V}\sqrt{R^{2}}\ r_{q,t}$ by $\boldsymbol{D}^{\top}\boldsymbol{D}<t$ . Here, $R^{2}\in[0,1]$ quantifies the variance of linear projection of $\widehat{\Delta}$ on $I$ scaled by $V$ , representing the fraction of variance in the M-estimator that can be explained by $\boldsymbol{X}^{r}$ . Our $R^{2}$ shares the same interpretation as it is defined in Li et al., (2018), except that we use the influence function to handle M-estimators. According to Morgan and Rubin, (2012); Li et al., (2018), the density of $\sqrt{V}\left(\sqrt{1-R^{2}}\ z+\sqrt{R^{2}}\ r_{q,t}\right)$ is symmetric about zero and bell-shaped, resembling a normal distribution. However, its variance is $V\{1-(1-v_{q,t})R^{2}\}$ , where $v_{q,t}=P(\mathcal{X}^{2}_{q+2}<t)/P(\mathcal{X}^{2}_{q}<t)<1$ , which implies that $V\{1-(1-v_{q,t})R^{2}\}\leq V$ . As a result, rerandomization does not increase the asymptotic variance of an M-estimator, compared to simple randomization. In fact, under a design-based framework, Theorem 2 of Li et al., (2018) provided insights into this variance reduction (as well as the corresponding change in the confidence interval lengths): the difference in asymptotic variance is non-decreasing in $R^{2}$ and non-increasing in both $t$ and $q$ . As a result of Theorem 1, this observation continues to hold for M-estimators.

For statistical inference, we propose consistent estimators $\widehat{V}$ and $\widehat{R}^{2}$ in Section A of the Supplementary Material. The 95% confidence interval for $\widehat{\Delta}$ can then be approximated by the following Monte Carlo approach. We first generate a large number of draws ( $m=10,000$ ) of $z$ and $r_{q,t}$ , e.g., via rejection sampling. We then compute interval $\widehat{\Delta}+\sqrt{\widehat{V}/n}\left(\sqrt{1-\widehat{R}^{2}}\ z+\sqrt{\widehat{R}^{2}}\ r_{q,t}\right)$ for each draw, and take the $2.5\%$ and $97.5\%$ quantiles of its empirical distribution to obtain the final confidence interval estimator $\widehat{CI}$ . As $m,n\rightarrow\infty$ , we have asymptotic nominal coverage such that $P(\Delta^{*}\in\widehat{CI})\rightarrow 0.95$ .

Remark 1.

When rerandomization is based on a more general balance criterion $\boldsymbol{I}^{\top}\widehat{\mathbf{H}}^{-1}\boldsymbol{I}<t$ , Theorem 1 remains the same, except that the asymptotic distribution in (2) is now $\sqrt{V}\sqrt{1-R^{2}}\ z+Cov[IF\{\boldsymbol{O}(1)\}-IF\{\boldsymbol{O}(0)\},\boldsymbol{X}^{r}]\mathbf{V}_{I}^{-1/2}\boldsymbol{R}_{\mathbf{H},t}$ , where $\boldsymbol{R}_{\mathbf{H},t}\sim\boldsymbol{D}|(\boldsymbol{D}^{\top}\mathbf{V}_{I}^{1/2}\underline{\mathbf{H}}^{-1}\mathbf{V}_{I}^{1/2}\boldsymbol{D}<t)$ and $\mathbf{V}_{I}=\{\pi(1-\pi)\}^{-1}Var(\boldsymbol{X}^{r})$ with $\underline{\mathbf{H}}$ being the probability limit of $n\widehat{\mathbf{H}}$ . Compared to Equation (2), the first component of the asymptotic distribution remains the same, but the second component becomes more complex and depends on the choice of $\widehat{\mathbf{H}}$ . Nevertheless, as long as $\underline{\mathbf{H}}$ is positive definite, rerandomization still leads to no increase in the asymptotic variance. In addition, the same sampling-based confidence interval can still be constructed with a Monte Carlo approach via rejection sampling of $\boldsymbol{R}_{\mathbf{H},t}$ . Similar results were firstly pointed out by Lu et al., (2023) under a design-based framework in the context of cluster rerandomization (for unadjusted and linearly-adjusted estimators); we extend their results to a wider class of M-estimators (proof in the Supplementary Material).

Remark 2.

Theorem 1 can be further extended to handle rerandomization given tiers of covariates (Morgan and Rubin,, 2015). To elaborate, for $b=1,\dots,B$ , let $\boldsymbol{I}_{b}=\mathbf{G}_{b}\boldsymbol{I}$ be any subvector of $\boldsymbol{I}$ with $\mathbf{G}_{b}$ being the transformation matrix. In addition, $\widehat{\mathbf{H}}_{b}$ and $t_{b}>0$ are the weighting matrix and threshold defining the balance criterion for the $b$ -th tier of covariates. The second step of rerandomization is changed to accepting randomization if $\boldsymbol{I}_{b}^{\top}\widehat{\mathbf{H}}_{b}^{-1}\boldsymbol{I}_{b}<t_{b}$ for all $b=1,\dots,B$ . Under this design, Theorem 1 still holds with the asymptotic distribution described in Remark 1, except that here $\boldsymbol{R}_{\mathbf{H},t}\sim\boldsymbol{D}|(\boldsymbol{D}^{\top}\mathbf{V}_{I}^{1/2}\mathbf{G}_{b}\underline{\mathbf{H}}_{b}^{-1}\mathbf{G}_{b}^{\top}\mathbf{V}_{I}^{1/2}\boldsymbol{D}<t_{b}:b=1,\dots,B)$ , with $\underline{\mathbf{H}}_{b}$ defined as the probability limit of $n\widehat{\mathbf{H}}_{b}$ .

4 Examples

In this section, we survey common M-estimators used for analyzing randomized experiments and provide important special cases where $R^{2}=0$ , meaning that the M-estimator has fully accounted for rerandomization (in other words, its variance, $V$ , cannot be further explained by $\boldsymbol{X}^{r}$ ). Therefore, in these cases, each M-estimator remains asymptotically normal under rerandomization. These examples serve as important clarifications on when standard inference may be carried out ignoring rerandomization. Among the four examples, estimators in Examples 1, 2, and 4 are model-robust and hence consistent under arbitrary working model misspecification under simple randomization; see discussions in Tsiatis et al., (2008); Colantuoni and Rosenblum, (2015); Wang et al., (2021); this model-robustness property is carried under rerandomization as implied by Theorem 1. Example 3 focuses on experiments with missing outcomes, and we additionally require Assumption 2 to establish the double robustness property under simple randomization (Robins et al.,, 2007); this property remains to hold under rerandomization as a result of Theorem 1.

Example 1 (ANCOVA estimator).

In randomized experiments, the ANCOVA estimator considers the working model $E[Y|A,\boldsymbol{X}]=\beta_{0}+\beta_{A}A+\boldsymbol{\beta}_{\boldsymbol{X}}^{\top}\boldsymbol{X}$ and uses ordinary least squares to obtain estimators $(\widehat{\beta}_{0},\widehat{\beta}_{A},\widehat{\boldsymbol{\beta}}_{X})$ for parameters $(\beta_{0},\beta_{A},\boldsymbol{\beta}_{X})$ . In this context, $\widehat{\Delta}=\widehat{\beta}_{A}$ is the ANCOVA estimator for the average treatment effect estimand, $\Delta^{*}=E[Y(1)-Y(0)]$ .

Example 2 (G-computation estimator with working logistic regression).

When the outcome is binary, the logistic regression model, $\text{logit}\{P(Y=1|A,\boldsymbol{X})\}=\beta_{0}+\beta_{A}A+\boldsymbol{\beta}_{\boldsymbol{X}}^{\top}\boldsymbol{X}$ , is commonly used to analyze randomized experiments. With maximum likelihood estimators $(\widehat{\beta}_{0},\widehat{\beta}_{A},\widehat{\boldsymbol{\beta}}_{X})$ , we construct $\widehat{p}_{i}(a)=\text{logit}^{-1}(\widehat{\beta}_{0}+\widehat{\beta}_{A}a+\widehat{\boldsymbol{\beta}}_{\boldsymbol{X}}^{\top}\boldsymbol{X}_{i})$ for $a=0,1$ . The g-computation estimator for the average treatment effect on any scale $\Delta^{*}=f(E[Y(1)],E[Y(0)])$ is $\widehat{\Delta}=f(n^{-1}\sum_{i=1}^{n}\widehat{p}_{i}(1),n^{-1}\sum_{i=1}^{n}\widehat{p}_{i}(0))$ .

Example 3 (DR-WLS estimator for handling missing outcomes).

In the presence of missing outcomes, the doubly-robust weighted least square (DR-WLS) estimator involves the following steps. First, we fit a logistic regression working model to predict missingness, i.e., $\text{logit}\{P(R=1|\boldsymbol{X})\}=\alpha_{0}+\alpha_{A}A+\boldsymbol{\alpha}_{X}\boldsymbol{X}$ , and obtain the missingness propensity score as $\widehat{P}(R=1|A,\boldsymbol{X})=\text{logit}^{-1}(\widehat{\alpha}_{0}+\widehat{\alpha}_{A}A+\widehat{\boldsymbol{\alpha}}_{X}\boldsymbol{X})$ . Second, we fit an outcome regression working model, $E[Y|A,\boldsymbol{X}]=g^{-1}(\beta_{0}+\beta_{A}A+\boldsymbol{\beta}_{X}\boldsymbol{X})$ , using data with $R_{i}=1$ and canonical link function $g$ , weighted by the inverse propensity score $\{\widehat{P}(R=1|A_{i},\boldsymbol{X}_{i})\}^{-1}$ . We denote the model fit as $\widehat{E}[Y|A,\boldsymbol{X}]=g^{-1}(\widehat{\beta}_{0}+\widehat{\beta}_{A}A+\widehat{\boldsymbol{\beta}}_{X}\boldsymbol{X})$ . Finally, we apply a g-computation estimator $\widehat{\Delta}=f\left(n^{-1}\sum_{i=1}^{n}\widehat{E}[Y|1,\boldsymbol{X}_{i}],n^{-1}\sum_{i=1}^{n}\widehat{E}[Y|0,\boldsymbol{X}_{i}]\right)$ to estimate the average treatment effect estimand, $\Delta^{*}=f(E[Y(1)],E[Y(0)])$ . We assume missing at random (Assumption 2) and at least one of the two working models is correctly specified.

Example 4 (Mixed-ANCOVA estimator under cluster randomization).

When rerandomization is at the cluster level, the outcomes for each cluster become an $N_{i}$ -dimensional vector $\boldsymbol{Y}_{i}=(Y_{i1},\dots,Y_{iN_{i}})$ , where $N_{i}$ is the size of cluster $i$ . In addition, the covariates $\boldsymbol{X}_{i}$ is a collection of individual covariates $\{\boldsymbol{X}_{i1},\dots,\boldsymbol{X}_{iN_{i}}\}$ . For simplicity, we assume non-informative cluster sizes, i.e., $N_{i}$ is independent of all $(Y_{ij},A_{i},\boldsymbol{X}_{ij})$ , and all $(Y_{ij},A_{i},\boldsymbol{X}_{ij})$ are identically distributed, but they can have arbitrary with-cluster correlation. Then, rerandomization is based on $\boldsymbol{X}_{i}^{r}$ being a summary function of $(\boldsymbol{X}_{i1}^{r},\dots,\boldsymbol{X}_{iN_{i}}^{r})$ , e.g., ${N_{i}}^{-1}\sum_{j=1}^{N_{i}}\boldsymbol{X}_{ij}^{r}$ . The mixed-ANCOVA estimator involves fitting the linear mixed model $Y_{ij}={\beta}_{0}+{\beta}_{A}A_{i}+{\boldsymbol{\beta}}_{X}\boldsymbol{X}_{ij}+\delta_{i}+\varepsilon_{ij}$ , where $\delta_{i}\sim N(0,\tau^{2})$ is the random intercept and $\varepsilon_{ij}\sim N(0,\sigma^{2})$ is the independent error. Under maximum likelihood estimation, we can construct a model-robust estimator $\widehat{\Delta}=\widehat{\beta}_{A}$ for the average treatment effect estimand $\Delta^{*}=E[Y_{ij}(1)-Y_{ij}(0)]$ .

Each estimator described in Examples 1-4 can be cast as an M-estimator (with the corresponding estimating function provided in Supplementary Material) and its consistency and asymptotic linearity properties directly carry from simple randomization to rerandomization. Furthermore, Theorem 2 clarifies, for each estimator, when asymptotic normality holds under rerandomization and thus inference can proceed by ignoring rerandomization.

Theorem 2.

Assume Assumptions 1-3 and that $\boldsymbol{X}$ includes $\boldsymbol{X}^{r}$ as a subset for covariate adjustment. Then, for the ANCOVA, logistic regression, DR-WLS, and mixed-ANCOVA estimators described in Examples 1-4, we have $\sqrt{n}(\widehat{\Delta}-\Delta^{*})\xrightarrow{d}N(0,V)$ under rerandomization if at least one of the following conditions holds: (1) $\pi=0.5$ and $\Delta^{*}=E[Y(1)-Y(0)]$ is the estimand on the difference scale, or (2) the outcome regression model further includes a treatment-by-covariate interaction term $\boldsymbol{\beta}_{AX^{r}}A\boldsymbol{X}^{r}$ for $\boldsymbol{X}^{r}$ , or (3) the outcome regression model is correctly specified. In other words, $R^{2}=0$ under any of conditions (1), (2), or (3).

Remark 3.

While the ANCOVA and mixed-ANCOVA estimators target $\Delta^{*}=E[Y(1)-Y(0)]$ through the treatment coefficient, the working models can be adapted to estimate treatment effect estimands on other scales through a g-computation procedure (as described in Examples 2 and 3). When treatment-by-covariate interaction terms are included, they become the ANCOVA2 (Tsiatis et al.,, 2008) and mixed-ANCOVA2 estimators (Wang et al.,, 2021), for which g-computation is needed for consistent estimation under simple randomization and rerandomization. For the DR-WLS estimator, Theorem 2 does not require adjusting for $\boldsymbol{X}^{r}$ in the missingness model to obtain asymptotic normality, although this is still recommended as including $\boldsymbol{X}^{r}$ may further improve efficiency.

Theorem 2 highlights the benefit of adjusting for baseline variables that are used for rerandomization. In this case, inference with many estimators under rerandomization can be performed as if simple randomization were used in the design stage; thus rerandomization becomes ignorable in the analysis stage. This result has two implications. First, the choice of threshold $t$ or the weighting matrix $\widehat{\mathbf{H}}$ has no impact asymptotically for the considered estimators in Examples 1-4. Therefore, the question of how one should select these design parameters for optimizing the asymptotic precision is immaterial in large samples as the design parameters do not impact the asymptotic distribution. Second, many existing results about variance under simple randomization naturally extend to rerandomization. For example, for estimating the average treatment effect on the difference scale, the ANCOVA estimator under condition (1) or (2) in Theorem 2 does not reduce the asymptotic precision by adjusting for baseline covariates under simple randomization (Tsiatis et al.,, 2008); the same result holds under rerandomization as long as the adjustment set included $\boldsymbol{X}^{r}$ . As another example, when condition (1) holds, not only ANCOVA and mixed-ANCOVA provide consistent point estimates, their model-based variance estimators are also consistent (Wang et al.,, 2019, 2021); such properties hold under rerandomization by Theorem 2.

On the contrary, if $\boldsymbol{X}^{r}$ is not included in the adjustment set for the above example estimators, Theorem 2 may not hold, and the asymptotic distribution of the estimator is non-normal, according to Theorem 1. Intuitively, this is the scenario where $\boldsymbol{X}^{r}$ can further explain the variance of $\widehat{\Delta}$ so that $R^{2}>0$ . If inference is performed based on asymptotic results derived under simple randomization (e.g., using normal approximation), then the resulting confidence interval may suffer from power loss, although the inference remains valid. Given the above discussion, we recommend adjusting for rerandomization variables when using M-estimators described in Examples 1-4, such that conventional inferential procedure based on asymptotic normality can be justified by Theorem 2.

5 Extensions to stratified rerandomization

5.1 Defining stratified rerandomization

In practice, we may want to control the treatment imbalance on $\boldsymbol{X}^{r}$ at level $t$ and, meanwhile, eliminate the treatment imbalance on certain categorical variables. We refer to this design as stratified rerandomization (Wang et al., 2023c, ). Compared to rerandomization defined in Section 2.2, this design further ensures a perfectly balanced assignment within each stratum. To formally define this scheme, we first introduce stratified randomization without rerandomization. Stratified (block) randomization refers to a randomization scheme that achieves exact balance within each pre-specified stratum. Let $S\in\boldsymbol{X}$ be a categorical baseline variable encoding the randomization strata, e.g., $S$ taking values in $\mathcal{S}=\{\textup{male smoker},\ \textup{female smoker},\ \textup{male non-smoker},\ \textup{female non-smoker}\}$ if the randomization strata are defined by sex and smoking status. Within each randomization stratum, a size- $k$ permutation block with $\pi k$ ones and $(1-\pi)k$ zeros in random order is independently sampled to assign the treatment for the first $k$ participants (1 for treatment and 0 for control). When a permutation block is exhausted, a new one is sampled to assign the treatment for the next $k$ participants. The block size $k$ is chosen to make $\pi k$ an integer. Unlike simple randomization where treatment assignment is independent, stratified randomization directly introduces correlation among treatment assignment and observed data $(\boldsymbol{O}_{1},\dots,\boldsymbol{O}_{n})$ .

Stratified rerandomization combines stratified randomization and rerandomization through the following steps. First, we generate $\widetilde{A}_{1},\dots,\widetilde{A}_{n}$ by stratified randomization based on randomization strata $\{S_{1},\dots,S_{n}\}$ and parameter $\pi\in(0,1)$ . Second, we compute

	$\displaystyle\widetilde{\boldsymbol{I}}$	$\displaystyle=\frac{1}{\widetilde{N}_{1}}\sum_{i=1}^{n}\widetilde{A}_{i}\boldsymbol{X}_{i}^{r}-\frac{1}{\widetilde{N}_{0}}\sum_{i=1}^{n}(1-\widetilde{A}_{i})\boldsymbol{X}_{i}^{r},$
	$\displaystyle\widehat{Var}(\widetilde{\boldsymbol{I}})$	$\displaystyle=\frac{n}{\widetilde{N}_{1}\widetilde{N}_{0}}\left[\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{X}^{r}_{i}\boldsymbol{X}^{r}_{i}{}^{\top}-\sum_{s\in\mathcal{S}}\frac{1}{n}\sum_{i=1}^{n}I\{S_{i}=s\}\overline{\boldsymbol{X}_{s}^{r}}\overline{\boldsymbol{X}_{s}^{r}}^{\top}\right],$

where $\widetilde{N}_{1}=\sum_{i=1}^{n}\widetilde{A}_{i}$ , $\widetilde{N}_{0}=n-\widetilde{N}_{1}$ , and $\overline{\boldsymbol{X}_{s}^{r}}=(\sum_{i=1}^{n}I\{S_{i}=s\})^{-1}\sum_{i=1}^{n}\boldsymbol{X}_{i}^{r}I\{S_{i}=s\}$ . In the last step, if $\widetilde{\boldsymbol{I}}^{\top}\{\widehat{Var}(\widetilde{\boldsymbol{I}})\}^{-1}\widetilde{\boldsymbol{I}}<t$ for a pre-specified balance threshold $t$ , we set the final treatment assignment $(A_{1},\dots,A_{n})$ to be $(\widetilde{A}_{1},\dots,\widetilde{A}_{n})$ ; otherwise, we return to the first step to obtain the first randomization scheme that satisfies the balance criterion.

Compared to rerandomization defined in Section 2.2, stratified rerandomization involves two changes: the first step replaces simple randomization by stratified randomization, and the variance estimator $\widehat{Var}(\widetilde{\boldsymbol{I}})$ is updated to reflect the true variance of the imbalance statistic under stratification, which is smaller than under simple randomization. This design is asymptotically equivalent to the first stratified rerandomization criterion (overall Mahalanobis distance) of Wang et al., 2023c since we assume $\pi$ is constant across randomization strata—the most common setting in randomized experiments. Without loss of generality, we assume $S\notin\boldsymbol{X}^{r}$ ; otherwise, rerandomization is effectively controlling for imbalance on $\boldsymbol{X}^{r}\setminus S$ . Under stratified rerandomization, we next develop the large-sample properties for M-estimators regarding the general estimand $\Delta^{*}$ .

5.2 Asymptotics for M-estimators under stratified rerandomization

Theorem 3.

Given Assumptions 1 and 3, and under stratified rerandomization, we have $\widehat{\Delta}\xrightarrow{p}\underline{\Delta}$ and asymptotic linearity as in Equation (1), and

\displaystyle\sqrt{n}(\widehat{\Delta}-\underline{\Delta})\xrightarrow{d}\sqrt{\widetilde{V}}\left\{\sqrt{1-\widetilde{R}^{2}}\ z+\sqrt{\widetilde{R}^{2}}\ r_{q,t}\right\},

(3)

	$\displaystyle\widetilde{V}$	$\displaystyle=V-\pi(1-\pi)E\left[E[IF\{\boldsymbol{O}(1)\}-IF\{\boldsymbol{O}(0)\}\|S]^{2}\right],$
	$\displaystyle\widetilde{R}^{2}$	$\displaystyle=\frac{\pi(1-\pi)}{\widetilde{V}}\boldsymbol{C}^{\top}E[Var(\boldsymbol{X}^{r}\|S)]^{-1}\boldsymbol{C}.$

with $\boldsymbol{C}=E[Cov[\boldsymbol{X}^{r},IF\{\boldsymbol{O}(1)\}-IF\{\boldsymbol{O}(0)\}|S]]$ and $V$ being the asymptotic variance under simple randomization. Consistent estimators for $\widetilde{V}$ and $\widetilde{R}^{2}$ are provided in the Supplementary Material.

Theorem 3 parallels Theorem 1 in terms of convergence in probability, asymptotic linearity, and weak convergence results. The major difference lies in the constant factors of the asymptotic distribution. In particular, $\widetilde{V}$ substitutes $V$ in Equation (3) to account for the variance reduction from stratification (Wang et al., 2023b, ). Likewise, compared to Theorem 1, $\widetilde{R}^{2}$ is also updated by replacing marginal covariance with the expectation of conditional covariance. Despite these changes, the properties of the asymptotic distribution remain similar to those under rerandomization, and statistical inference considerations under rerandomization discussed in Section 3 still apply to stratified rerandomization. Finally, if we consider the unadjusted estimator, the difference estimand $\Delta^{*}=E[Y(1)-Y(0)]$ , and fixed strata categories, Theorem 3 becomes the counterpart of Theorem 3 of Wang et al., 2023c under a sampling-based framework. Theorem 3, however, includes a wider class of M-estimators and addresses a more general class of estimands $\Delta^{*}$ (e.g. to allow for ratio estimands).

Remark 4.

If rerandomization is based on $\boldsymbol{I}^{\top}\widehat{\mathbf{H}}^{-1}\boldsymbol{I}<t$ , Theorem 3 remains the same with the asymptotic distribution in Equation (3) updated to $\sqrt{\widetilde{V}}\sqrt{1-\widetilde{R}^{2}}\ z+\boldsymbol{C}^{\top}\widetilde{\mathbf{V}}_{I}^{-1/2}\widetilde{\boldsymbol{R}}_{\mathbf{H},t}$ , where $\widetilde{\boldsymbol{R}}_{\mathbf{H},t}\sim\boldsymbol{D}|(\boldsymbol{D}^{\top}\widetilde{\mathbf{V}}_{I}^{1/2}\underline{\mathbf{H}}^{-1}\widetilde{\mathbf{V}}_{I}^{1/2}\boldsymbol{D}<t)$ and $\widetilde{\mathbf{V}}_{I}=\{\pi(1-\pi)\}^{-1}E[Var(\boldsymbol{X}^{r}|S)]$ , and $\underline{\mathbf{H}}$ is the probability limit of $n\widehat{\mathbf{H}}$ . As long as $\underline{\mathbf{H}}$ is positive definite, stratified rerandomization does not lead to asymptotic precision reduction over stratified or simple randomization.

Remark 5.

Like rerandomization, stratified rerandomization can also accommodate tiers of covariates in a similar way as in Remark 2. However, under our framework, Theorem 3 is not implied by Thoerem 1 with tiers of covariates (by specifying $\boldsymbol{I}_{b}$ to $I\{S_{i}=s\}$ for each $s\in\mathcal{S}$ and letting $t_{b}\rightarrow 0$ ). This is because $\sum_{s\in\mathcal{S}}I\{S_{i}=s\}=1$ and so the variance of $\{I\{S_{i}=s\}:s\in\mathcal{S}\}$ is rank deficient. Nevertheless, if we enforce $\sum_{i=1}^{n}A_{i}=\pi n$ in rerandomization as done in Li et al., (2018), then rerandomization with tiers of covariates accommodates stratified randomization as a special case (Wang et al., 2023c, ); these schemes all fit into stratified rerandomization with tiers of covariates under our framework.

5.3 Continued examples

For Examples 1-4 in Section 4, their model-robustness or double-robustness property remains the same under stratified rerandomization. We thus extend Theorem 3 for stratified rerandomization given below.

Theorem 4.

Assume Assumptions 1-3 and that $\boldsymbol{X}$ includes $\boldsymbol{X}^{r}$ as a subset and $S$ as dummy variables for covariate adjustment. Then, for the ANCOVA, logistic regression, DR-WLS, or mixed-ANCOVA estimator in Examples 1-4, we have $\sqrt{n}(\widehat{\Delta}-\Delta^{*})\xrightarrow{d}N(0,V)$ under stratified rerandomization if at least one of the following conditions holds: (1) $\pi=0.5$ and $\Delta^{*}=E[Y(1)-Y(0)]$ , or (2) the outcome regression model further includes treatment-by-covariate interaction terms for both $\boldsymbol{X}^{r}$ and $S$ , or (3) the outcome regression model is correctly specified. In other words, $\widetilde{R}^{2}=0$ and $\widetilde{V}=V$ under condition (1), (2), or (3).

Theorem 4 clarifies that adjusting for both $\boldsymbol{X}^{r}$ and $S$ in the working models offers asymptotic normality of the considered M-estimators under stratified rerandomization. In this case, the asymptotic distribution for each estimator is no different from that under simple randomization, and asymptotic results developed for those estimators under simple randomization can directly apply even though the actual assignment comes from a stratified rerandomization procedure. Among the three conditions, conditions (1) and (3) remain the same as Theorem 3, but condition (2) is modified to further include additional treatment-by-stratum interaction terms (as intuitively, stratum variable is part of the randomization procedure). Similar to Theorem 3, the threshold $t$ and weighting matrix $\widehat{\mathbf{H}}$ do not contribute to the asymptotic distribution, suggesting that the choice of these design parameters in general does not impact the asymptotic inference of the considered treatment effect estimators. Due to the appeal of applying normal approximation for inference, we maintain our recommendation that all rerandomization variables (including rerandomization variables and dummy stratum variables) be adjusted for in the analysis.

6 Efficient inference with machine learning

Beyond traditional M-estimators based on parametric working models, data-adaptive machine learning models have also been studied in randomized experiments under simple randomization, typically through the vehicle of efficient influence function (van der Laan et al.,, 2011; Chernozhukov et al.,, 2018). While these estimators provide flexible covariate adjustment and achieve the asymptotic efficiency lower bound under simple randomization, their properties under rerandomization or stratified rerandomization remain unknown. In this section, we turn our focus to a class of efficient estimators via the efficient influence function and obtain their asymptotic distribution under rerandomization and stratified rerandomization. We focus on the setting where outcomes are missing at random (Assumption 2), but our theory can be straightforwardly extended to accommodate other settings (e.g., cluster-randomized experiments), where efficient estimation has been investigated under simple randomization (Wang et al., 2023a, ).

6.1 Efficient inference under rerandomization

Under rerandomization, we study the efficient estimator that uses double machine learning (DML) for estimating nuisance functions with cross-fitting (Chernozhukov et al.,, 2018). Let $\widehat{\eta}_{a}(\boldsymbol{X};\mathcal{D}),\widehat{\kappa}_{a}(\boldsymbol{X};\mathcal{D})$ denote the pre-specified estimators for $E[Y(a)|\boldsymbol{X}],E[R(a)|\boldsymbol{X}]$ trained on data $\mathcal{D}$ and evaluated at $\boldsymbol{X}$ . For cross-fitting, we randomly partition the index set $\{1,\dots,n\}$ into $K$ folds with approximately equal sizes. Specifically, let $\mathcal{I}_{k}$ denote the index set for the $k$ -th fold, we have $\{1,\dots,n\}=\cup_{k=1}^{K}\mathcal{I}_{k}$ and $|\mathcal{I}_{k}|=K^{-1}n$ . If $K$ does not divide $n$ , we require $||\mathcal{I}_{k}|-K^{-1}n|\leq 1$ instead. Defining $\mathcal{O}_{a,-k}=\{(R_{i}Y_{i},R_{i},\boldsymbol{X}_{i}):i=1,\dots,n;i\notin\mathcal{I}_{k};A_{i}=a\}$ as the training data for fold $k$ within treatment group $a$ , we specify the nuisance function estimators as $\widehat{\eta}_{a}(\boldsymbol{X}_{i})=\sum_{k=1}^{K}I\{i\in\mathcal{I}_{k}\}\widehat{\eta}_{a}(\boldsymbol{X}_{i};\mathcal{O}_{a,-k})$ and $\widehat{\kappa}_{a}(\boldsymbol{X}_{i})=\sum_{k=1}^{K}I\{i\in\mathcal{I}_{k}\}\widehat{\kappa}_{a}(\boldsymbol{X}_{i};\mathcal{O}_{a,-k})$ . Next, the efficient estimator for $E[Y(a)]$ is constructed based on the efficient influence function as

\widehat{\mu}_{a}^{\textup{dml}}=\frac{1}{n}\sum_{i=1}^{n}\left[\frac{I\{A_{i}=a\}}{\pi^{a}(1-\pi)^{1-a}}\frac{R_{i}}{\widehat{\kappa}_{a}(\boldsymbol{X}_{i})}\{Y_{i}-\widehat{\eta}_{a}(\boldsymbol{X}_{i})\}+\widehat{\eta}_{a}(\boldsymbol{X}_{i})\right],

(4)

and output $\widehat{\Delta}^{\textup{dml}}=f(\widehat{\mu}_{1}^{\textup{dml}},\widehat{\mu}_{0}^{\textup{dml}})$ . For this estimator, we assume the following conditions.

Assumption 4 (Conditions for nuisance function estimators).

For $a=0,1$ , we assume (1) $E[\left\{\widehat{\eta}_{a}(\boldsymbol{X};\mathcal{D}_{n})-E[Y(a)|\boldsymbol{X}]\right\}^{2}]^{1/2}=o(n^{-1/4})$ and $E[\left\{\widehat{\kappa}_{a}(\boldsymbol{X};\mathcal{D}_{n})-E[R(a)|\boldsymbol{X}]\right\}^{2}]^{1/2}=o(n^{-1/4})$ for $\mathcal{D}_{n}=\{R_{i}(a)Y_{i}(a),R_{i}(a),\boldsymbol{X}_{i}\}_{i=1}^{n}$ ; (2) $\{\widehat{\kappa}_{a}(\boldsymbol{X};\mathcal{D}_{n})\}^{-1}$ and $E[Y^{2}(a)|\boldsymbol{X}]$ are uniformly bounded.

Assumption 4(1) requires that the nuisance function estimators converge to their target in $L_{2}$ -norm with a rate faster than $n^{1/4}$ , if the training data consist of independent and identically distributed data. This rate can be achieved by existing machine learning methods, including random forests (Wager and Athey,, 2018), deep neural network (Farrell et al.,, 2021), and highly-adpative lasso (Benkeser and Van Der Laan,, 2016), under mild regularity conditions. Of note, for the outcome model, the training data $\mathcal{D}_{n}$ are used to learn $E[R(a)Y(a)|R(a),\boldsymbol{X}]$ , which yields $E[Y(a)|R(a)=1,\boldsymbol{X}]$ and thus $E[Y(a)|\boldsymbol{X}]$ under Assumption 2. In Assumption 4(2), we assume uniform boundedness on several components of $\mathcal{P}^{W}$ and the estimators so that the remainder terms can be appropriately controlled. Overall, Assumption 4 resembles the standard assumptions for efficient inference made in Chernozhukov et al., (2018), and importantly, no extra condition is specifically assumed for addressing rerandomization. Theorem 5 below provides the asymptotic results for $\widehat{\Delta}^{\textup{dml}}$ .

Theorem 5.

Given Assumptions 1, 2, and 4, and under rerandomization, we have consistency, i.e., $\widehat{\Delta}^{\textup{dml}}\xrightarrow{p}\Delta^{*}$ , and asymptotic linearity, i.e., $\sqrt{n}(\widehat{\Delta}^{\textup{dml}}-\Delta^{*})=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}EIF(\boldsymbol{O}_{i})+o_{p}(1)$ with

EIF(\boldsymbol{O}_{i})=\sum_{a=0}^{1}f_{a}^{\prime}\left[\frac{I\{A_{i}=a\}}{\pi^{a}(1-\pi)^{1-a}}\frac{R_{i}}{E[R_{i}(a)|\boldsymbol{X}_{i}]}\{Y_{i}-E[Y_{i}(a)|\boldsymbol{X}_{i}]\}+E[Y_{i}(a)|\boldsymbol{X}_{i}]-E[Y_{i}(a)]\right],

(5)

where $f_{a}^{\prime}$ is the partial derivative of $f$ with respect to $E[Y(a)]$ .

Furthermore, $\sqrt{n}(\widehat{\Delta}^{\textup{dml}}-\Delta^{*})\xrightarrow{d}\sqrt{V}\left(\sqrt{1-R^{2}}\ z+\sqrt{R^{2}}\ r_{q,t}\right)$ as described in Equation (2) with $V$ being the asymptotic variance under simple randomization and $R^{2}$ defined in Theorem 1 by setting $IF(\boldsymbol{O}_{i})=EIF(\boldsymbol{O}_{i})$ . Additionally, $R^{2}=0$ if the adjustment set $\boldsymbol{X}$ includes the rerandomization variables $\boldsymbol{X}^{r}$ . Consistent estimators for $V$ and $R^{2}$ are provided in the Supplementary Material.

Theorem 5 justifies the validity of efficient estimation of the average treatment effect estimand under rerandomization. Additionally, the overall structure of its asymptotic expansion resembles that for M-estimators in Theorem 1, with the only difference being that the influence function is given in Equation (5), which is the efficient influence function under simple randomization. If covariate adjustment includes the rerandomization variables, then $\widehat{\Delta}^{\textup{dml}}$ is asymptotically normal with asymptotic variance $V$ —the usual efficiency lower bound given by the variance of $EIF(\boldsymbol{O}_{i})$ under simple randomization. On the contrary, if $\boldsymbol{X}$ does not include all rerandomization variables, then $V$ does not fully account for the precision gain from rerandomization, which leads to a positive $R^{2}$ and non-normal asymptotic distribution.

Comparing efficient estimators with parametric M-estimators, efficient estimators have the advantage of optimal efficiency, while machine learners often require a large sample size to avoid over-fitting. We recommend using efficient estimators if the sample size per arm is no less than 200, and caution against data-adaptive fitting with small sample sizes, e.g., $n\leq 50$ .

Remark 6.

With no missing outcomes, we have $R_{i}\equiv 1$ , and the efficient estimator is simplified by setting $\widehat{\kappa}_{a}(\boldsymbol{X};\mathcal{D}_{n})\equiv 1$ . In this case, Assumption 4(1) is relaxed to only requiring a consistency assumption $E[\left\{\widehat{\eta}_{a}(\boldsymbol{X};\mathcal{D}_{n})-E[Y(a)|\boldsymbol{X}]\right\}^{2}]=o(1)$ without the need for regulating rate of convergence. Under this modified Assumption 4, Theorem 5 still holds and extends a special case of Chernozhukov et al., (2018) from simple randomization to rerandomization.

6.2 Results under stratified rerandomization

When using stratified rerandomization, the efficient estimator in Section 6.1 requires modification to achieve our target asymptotic results. This is because stratification introduces additional correlation among observed data, which cannot be handled by the standard cross-fitting procedure. To overcome this challenge, we perform cross-fitting within each stratum $s\in\mathcal{S}$ and treatment group $a\in\{0,1\}$ , an approach proposed by Rafi, (2023) to address stratification. Specifically, for each $s\in\mathcal{S}$ and $a\in\{0,1\}$ , we randomly partition $\{i=1,\dots,n:S_{i}=s,A_{i}=a\}$ into $k$ folds and denote $\mathcal{I}_{ask}$ as the index set of the $k$ -th fold. Similarly, we require roughly equal fold sizes across $k$ given $a$ and $s$ . We next define the $k$ -th fold in stratum $s$ as $\mathcal{I}_{sk}=\mathcal{I}_{1sk}\cup\mathcal{I}_{0sk}$ and define the training data for this fold as $\mathcal{O}_{as,-k}=\{(R_{i}Y_{i},Y_{i},\boldsymbol{X}_{i}):i=1,\dots,n;i\notin\mathcal{I}_{sk};A_{i}=a;S_{i}=s\}$ . The nuisance function estimators are updated to $\widehat{\eta}_{a}(\boldsymbol{X}_{i})=\sum_{k=1}^{K}\sum_{s\in\mathcal{S}}I\{i\in\mathcal{I}_{sk},S_{i}=s\}\widehat{\eta}_{a}(\boldsymbol{X}_{i};\mathcal{O}_{as,-k})$ and $\widehat{\kappa}_{a}(\boldsymbol{X}_{i})=\sum_{k=1}^{K}\sum_{s\in\mathcal{S}}I\{i\in\mathcal{I}_{sk},S_{i}=s\}\widehat{\kappa}_{a}(\boldsymbol{X}_{i};\mathcal{O}_{as,-k})$ . Then $\widehat{\mu}_{a}^{\textup{dml}}$ is defined as in Equation (4) and $\widehat{\Delta}^{\textup{dml}}=f(\widehat{\mu}_{1}^{\textup{dml}},\widehat{\mu}_{0}^{\textup{dml}})$ . Based on this modified cross-fitting scheme, Theorem 6 gives the counterpart of Theorem 5 under stratified rerandomization.

Theorem 6.

Given Assumptions 1,2, and 4, and under stratified rerandomization, we have the same consistency and asymptotic linearity as in Theorem 5. Furthermore, $\sqrt{n}(\widehat{\Delta}^{\textup{dml}}-\Delta^{*})\xrightarrow{d}\sqrt{\widetilde{V}}\left(\sqrt{1-\widetilde{R}^{2}}\ z+\sqrt{\widetilde{R}^{2}}\ r_{q,t}\right)$ as described in Equation (3) with $\widetilde{V}$ and $\widetilde{R}^{2}$ now defined by setting $IF(\boldsymbol{O}_{i})=EIF(\boldsymbol{O}_{i})$ . Additionally, $\widetilde{R}^{2}=0$ if the adjustment set $\boldsymbol{X}$ includes the rerandomization variables $\boldsymbol{X}^{r}$ and the dummy stratum variables $S$ . Consistent estimators for $\widetilde{V}$ and $\widetilde{R}^{2}$ are provided in the Supplementary Material.

7 Simulations

We conduct two simulation studies to demonstrate our asymptotic results. The first simulation focuses on continuous outcomes with a difference estimand $\Delta^{*}=E[Y(1)-Y(0)]$ , and the second simulation focuses on binary outcomes with a ratio estimand $\Delta^{*}=E[Y(1)]/E[Y(0)]$ . Both simulations consider complete outcomes and outcomes under missing at random (Assumption 2), under three randomization procedures: simple randomization, rerandomization, and stratified rerandomization.

7.1 Simulation design

For the first simulation with continuous outcomes, we set $n=400$ and independently generate three baseline covariates $X_{i1}\sim\mathcal{N}(1,1)$ , $S_{i}|X_{i1}\sim\mathcal{B}(0,0.4+0.2I\{X_{i1}<1\})$ , and $X_{i2}|(X_{i1},S_{i})\sim\mathcal{N}(0,1)$ for $i=1,\dots,n$ , where $\mathcal{N},\mathcal{B}$ represent normal and Bernoulli distributions, respectively. Next, we independently generate

	$\displaystyle Y_{i}(a)$	$\displaystyle\sim\mathcal{N}\left(4aS_{i}X_{i2}^{2}+2e^{X_{i1}}+\|X_{i2}\|,1\right),$
	$\displaystyle R_{i}(a)$	$\displaystyle\sim\mathcal{B}\left\{\textup{expit}(0.6+0.6a+X_{i2}+S_{i})\right\},$

for $a=0,1$ and $i=1,\dots,n$ , where $\textup{expit}(x)=1/(1+e^{-x})$ . For treatment allocation, we set the randomization variables as $\boldsymbol{X}^{r}_{i}=(X_{i1},X_{i2})$ and stratification variable as $S_{i}$ . Then $(A_{1},\dots,A_{n})$ are generated under simple randomization, rerandomization, and stratified rerandomization as described in Sections 2.2 and 5.1 with Mahalanobis distance and threshold $t=1$ (corresponding to an acceptance rate of approximately $40\%$ ). Denoting $Y_{i}=A_{i}Y_{i}(1)+(1-A_{i})Y_{i}(0)$ and $R_{i}=A_{i}R_{i}(1)+(1-A_{i})R_{i}(0)$ , the observed data are $\{Y_{i},A_{i},X_{i1},X_{i2},S_{i}:i=1,\dots,n\}$ under the no missing outcome scenario, and $\{R_{i}Y_{i},R_{i},A_{i},X_{i1},X_{i2},S_{i}:i=1,\dots,n\}$ under the missing outcome scenario. We repeat the above procedure to generate $1000$ simulated data sets.

For each simulated data set, we implement the following three estimators under the no missing outcome scenario. The unadjusted estimator is given by the difference in mean outcomes between treatment groups and ignores covariates. The ANCOVA estimator (Example 1) adjusts for all covariates $\boldsymbol{X}_{i}=(X_{i1},X_{i2},S_{i})$ . The ML estimator refers to the efficient estimator described in Section 6 but setting $R_{i}=\widehat{\kappa}_{a}(\boldsymbol{X}_{i})=1$ as discussed in Remark 6. For the outcome regression model $\widehat{\eta}_{a}(\boldsymbol{X}_{i})$ , we adopt an ensemble learner of generalized linear models, regression trees, and neural networks via SuperLearner (van der Laan et al.,, 2007). Under the outcome missing at random scenario, we instead compute the DR-WLS estimator (described in Example 3) and the efficient DML estimator (as described in Section 6); we use the same ensemble learners for estimating the two nuisance functions for the latter. Of note, since the outcome missingness is generated by a generalized linear model, the missingness propensity score model is correctly specified in both the DR-WLS and DML estimators.

For each estimator, we report the following performance metrics: bias, empirical standard error (ESE), average of standard error estimators assuming simple randomization (ASE^∗, i.e., no adjustment is performed for rerandomization or stratified rerandomization), coverage probability based on normal approximations and standard error estimators under simple randomization (CP-Normal), and the coverage probability based on the derived asymptotic distribution reflecting the actual randomization procedure (CP-True).

In the second simulation study with binary outcomes, the data generating process is the same as the first simulation except that $Y_{i}(a)\sim\mathcal{B}\left\{\textup{expit}(-4+4aS_{i}X_{i2}^{2}+e^{X_{i1}}-|X_{i2}|)\right\}$ . Here, the unadjusted estimator is the ratio of mean outcomes between treatment groups. The outcome model used in the ANCOVA estimator and DR-WLS estimator is substituted by GLM2, which denotes logistic regression in Example 2 with all treatment-by-covariate interaction terms. The ML and DML estimators are also modified to target the ratio estimand, but the nuisance function estimators remain the same as in the first simulation study.

7.2 Simulation results

Table 1 summarizes the results of the first simulation study. Across different settings, all estimators have negligible bias and achieve nominal coverage probability based on our derived asymptotic distributions in Theorems 1-6, thereby empirically supporting our theoretical results. For the unadjusted estimator, we observe that rerandomization and stratified rerandomization can improve precision, reflected by ESE being $\sim 20\%$ smaller than ASE^∗ and CP-Normal being close to 1. Under rerandomization and stratified rerandomization, the ANCOVA, ML, DR-WLS, and DML estimators all have similar performance compared to simple randomization, and the corresponding coverage probabilities are all close to 0.95; this is expected because asymptotic normality holds for these covariate-adjusted estimators under rerandomization. These results further support Theorems 2, 4, 5, and 6, by which the ANCOVA, ML, DR-WLS, and DML estimators have been shown to fully account for the variance reduction brought by the adaptive randomization procedure (and hence $R^{2}=0$ ). Of note, we observe that the empirical standard error tends to be slightly larger than the average standard error estimators for adjusted estimators (especially when there are missing outcomes); this is driven by outlier estimates from a few simulated data sets, which have negligible impact on the coverage probability. We have repeated the simulation with $n=1000$ , and this variance underestimation issue disappears (results omitted for brevity). Finally, in terms of finite-sample efficiency, we observe that parametric working models can improve precision over the unadjusted estimator, and machine learning estimators often lead to further variance reduction.

Table 1: Simulation results for continuous outcomes. ESE: empirical standard error. ASE^∗: average of standard error estimators assuming simple randomization. CP-Normal: coverage probability based on normal approximations and standard error estimators assuming simple randomization. CP-True: coverage probability based on the derived asymptotic distribution (which accounts for the actual randomization procedure).

Missing

Outcomes?

Randomization

Estimator

Bias

ESE

ASE^∗

CP-Normal

CP-True

Simple randomization

Unadjusted

0.00

1.23

1.19

0.95

ANCOVA

0.01

0.82

0.79

0.95

0.96

-0.00

0.43

0.40

0.94

0.95

Rerandomization

Unadjusted

-0.01

0.95

1.19

0.99

0.94

ANCOVA

-0.00

0.83

0.79

0.95

0.02

0.45

0.41

0.94

Stratified rerandomization

Unadjusted

0.02

0.92

1.19

0.99

0.94

ANCOVA

0.02

0.82

0.79

0.94

0.95

0.01

0.51

0.47

0.95

0.94

Yes

Simple randomization

DR-WLS

-0.00

1.00

0.94

0.95

0.94

DML

0.05

0.69

0.61

0.96

0.95

Rerandomization

DR-WLS

0.03

0.97

0.94

0.95

DML

0.07

0.87

0.63

0.95

Stratified rerandomization

DR-WLS

0.06

0.99

0.94

0.96

0.95

DML

0.10

0.82

0.69

0.95

0.94

Table 2 presents the results for the second simulation, with similar overall findings to those from the first simulation. Here, since the ANCOVA estimator is replaced by logistic regression with treatment-by-covariate interactions, the GLM2 estimator is asymptotically normal as stated in Theorem 2(2) and Theorem 4(2).

Table 2: Simulation results for binary outcomes. ESE: empirical standard error. ASE^∗: average of standard error estimators assuming simple randomization. CP-Normal: coverage probability based on normal approximations and standard error estimators assuming simple randomization. CP-True: coverage probability based on the derived asymptotic distribution (which accounts for the actual randomization procedure).

Missing

Outcomes?

Randomization

Estimator

Bias

ESE

ASE^∗

CP-Normal

CP-True

Simple randomization

Unadjusted

0.02

0.25

0.24

0.94

0.93

GLM2

0.01

0.19

0.18

0.94

0.01

0.19

0.18

0.94

Rerandomization

Unadjusted

0.02

0.21

0.24

0.97

0.94

GLM2

0.01

0.19

0.18

0.94

0.02

0.20

0.19

0.94

Stratified rerandomization

Unadjusted

0.01

0.21

0.24

0.97

0.93

GLM2

0.00

0.19

0.18

0.94

0.01

0.19

0.18

0.95

0.94

Yes

Simple randomization

DR-WLS

0.02

0.21

0.95

DML

0.03

0.26

0.22

0.96

Rerandomization

DR-WLS

0.02

0.22

0.21

0.94

DML

0.03

0.23

0.21

0.94

Stratified rerandomization

DR-WLS

0.02

0.22

0.21

0.94

DML

0.03

0.28

0.23

0.95

8 Data application

The Effectiveness of Group Focused Psychosocial Support for Adults Affected by Humanitarian Crises (GroupPMPlus) study is a cluster-randomized experiment in Nepal designed to improve the mental health of people affected by humanitarian emergencies such as pandemics, war, and environmental disasters (Jordans et al.,, 2021). The cluster-level intervention was Group Problem Management Plus, a psychological treatment of 5 weekly sessions (versus standard care). In this study, 72 wards (the smallest administrative units in Nepal, representing clusters) were enrolled, consisting of 609 individuals in total. Stratified rerandomization was used for equal treatment allocation on clusters: stratification was based on gender (all individuals in the same ward have the same gender), and rerandomization was based on three binary cluster-level covariates: high or low access to mental health service, high or low disaster risk, and rural/urban status. However, we are unable to obtain detailed rerandomization parameters from the published report, including the weighting matrix or balance threshold. Therefore, we carry out our analysis assuming simple randomization: this choice will lead to conservative confidence intervals for an unadjusted analysis but has no impact on the covariate-adjusted estimators, as supported by our theory and simulations. The primary outcome was a continuous measure of psychological distress at the 3-month follow-up evaluated by the General Health Questionnaire (GHQ-12). The baseline variables we adjust for include the baseline GHQ-12 score and all variables used in stratified rerandomization. For this study, we implement the mixed-ANCOVA estimator (Example 4) with individual-level data (Wang et al.,, 2021), and the unadjusted, ANCOVA, and ML estimators with cluster-level means (implemented as in the first simulation) to estimate the average treatment effect. For all estimators, we compute their point estimates, standard errors assuming simple randomization, and confidence intervals under normal approximation.

The results are summarized in Table 3. While all estimators have similar point estimates, their standard error estimates differ. For the unadjusted estimator, since we are unaware of the detailed rerandomization parameters, 0.78 should be a conservative estimate, leading to failure to reject the null at the 5% level. In contrast, the standard error estimates for all three covariate-adjusted estimators should have fully accounted for the precision gain from stratified rerandomization as clarified by our theoretical results. For these estimators, it is important to highlight that validity of the standard error estimator is achieved without knowing the exact rerandomization parameters, which further endorses the recommendation of adjusting for stratification and rerandomization variables. Among the three covariate-adjusted estimators, the ANCOVA estimator appears to have the highest precision, while the machine learning estimator leads to the least variance reduction. This may be because either the sample size is relatively limited for machine learning methods to demonstrate asymptotic efficiency gain or the true data-generating distribution is nearly linear in covariates.

Table 3: Data analysis results for the GroupPMPlus study.

Estimator	Estimate	Standard error	95% confidence interval
Unadjusted	-1.25	0.78	(-2.77, 0.28)
ANCOVA	-1.45	0.56	(-2.55, -0.35)
Mixed-ANCOVA	-1.35	0.62	(-2.57, -0.13)
ML	-1.44	0.64	(-2.69, -0.18)

9 Discussion

Covariate adjustment in randomized experiments can be achieved at the design stage, e.g., via rerandomization, and at the analysis stage, e.g., by outcome modeling. In this paper, our primary contribution is to clarify the impact of (stratified) rerandomization for a wide class of estimators, and to further demonstrate when rerandomization can be ignorable (and hence conventional asymptotic normality results can apply) when the subsequent analysis adjusts for the rerandomization variables. These results provide important clarifications to earlier simulation findings, for example, in cluster-randomized experiments (Li et al.,, 2016, 2017) where mixed-model ANCOVA was evaluated under rerandomization. We have also extended the theoretical development to efficient machine learning estimators, which maximally leverage baseline covariates to optimize asymptotic efficiency gain in randomized experiments. Importantly, through the lens of the super-population framework, our results expanded the existing rerandomization theory (Li et al.,, 2018; Li and Ding,, 2020; Lu et al.,, 2023; Wang et al., 2023c, ; Zhao and Ding,, 2024) from unadjusted and linear-adjusted estimators to more general covariate-adjusted estimators. Therefore, our results have wide implications for randomized experiments, especially given the diversity of covariate-adjustment methods used in current practice (Pirondini et al.,, 2022).

A practical question for conducting rerandomization is how to select design parameters, e.g., rerandomization variables, weighting matrix, and balance threshold. Our general recommendation is to balance only prognostic covariates via the Mahalanobis distance and choosing $t$ such that the rejection rate is no larger than 95%. We retain our recommendation to adjust for rerandomization variables during analysis for maximum efficiency gain and convenience in statistical inference. Finally, with our example estimators, although the choices of weighting matrix and $t$ have no impact asymptotically, they could have finite-sample implications. Alternatively, $t$ may be specified in a data-driven fashion, i.e., $t_{n}=f_{n}(\boldsymbol{X}_{1}^{r},\dots,\boldsymbol{X}_{n}^{r})$ . Asymptotic analysis with an arbitrary $t_{n}$ is a challenging problem that requires special convergence results for conditional quantiles; this extension will be left for future research.

References

Benkeser et al., (2021) Benkeser, D., Díaz, I., Luedtke, A., Segal, J., Scharfstein, D., and Rosenblum, M. (2021). Improving precision and power in randomized trials for covid-19 treatments using covariate adjustment, for binary, ordinal, and time-to-event outcomes. Biometrics, 77(4):1467–1481.
Benkeser and Van Der Laan, (2016) Benkeser, D. and Van Der Laan, M. (2016). The highly adaptive lasso estimator. In 2016 IEEE international conference on data science and advanced analytics (DSAA), pages 689–696. IEEE.
Bruhn and McKenzie, (2009) Bruhn, M. and McKenzie, D. (2009). In pursuit of balance: Randomization in practice in development field experiments. American economic journal: applied economics, 1(4):200–232.
Chernozhukov et al., (2018) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68.
Colantuoni and Rosenblum, (2015) Colantuoni, E. and Rosenblum, M. (2015). Leveraging prognostic baseline variables to gain precision in randomized trials. Statistics in medicine, 34(18):2602–2617.
Ding et al., (2017) Ding, P., Li, X., and Miratrix, L. W. (2017). Bridging finite and super population causal inference. Journal of Causal Inference, 5(2):20160027.
Farrell et al., (2021) Farrell, M. H., Liang, T., and Misra, S. (2021). Deep neural networks for estimation and inference. Econometrica, 89(1):181–213.
Ivers et al., (2012) Ivers, N. M., Halperin, I. J., Barnsley, J., Grimshaw, J. M., Shah, B. R., Tu, K., Upshur, R., and Zwarenstein, M. (2012). Allocation techniques for balance at baseline in cluster randomized trials: a methodological review. Trials, 13:1–9.
Jordans et al., (2021) Jordans, M. J., Kohrt, B. A., Sangraula, M., Turner, E. L., Wang, X., Shrestha, P., Ghimire, R., van’t Hof, E., Bryant, R. A., Dawson, K. S., et al. (2021). Effectiveness of group problem management plus, a brief psychological intervention for adults affected by humanitarian disasters in nepal: A cluster randomized controlled trial. PLoS Medicine, 18(6):e1003621.
Li et al., (2016) Li, F., Lokhnygina, Y., Murray, D. M., Heagerty, P. J., and DeLong, E. R. (2016). An evaluation of constrained randomization for the design and analysis of group-randomized trials. Statistics in medicine, 35(10):1565–1579.
Li et al., (2017) Li, F., Turner, E. L., Heagerty, P. J., Murray, D. M., Vollmer, W. M., and DeLong, E. R. (2017). An evaluation of constrained randomization for the design and analysis of group-randomized trials with binary outcomes. Statistics in medicine, 36(24):3791–3806.
Li and Ding, (2020) Li, X. and Ding, P. (2020). Rerandomization and regression adjustment. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(1):241–268.
Li et al., (2018) Li, X., Ding, P., and Rubin, D. B. (2018). Asymptotic theory of rerandomization in treatment–control experiments. Proceedings of the National Academy of Sciences, 115(37):9157–9162.
Lu et al., (2023) Lu, X., Liu, T., Liu, H., and Ding, P. (2023). Design-based theory for cluster rerandomization. Biometrika, 110(2):467–483.
Morgan and Rubin, (2012) Morgan, K. L. and Rubin, D. B. (2012). Rerandomization to improve covariate balance in experiments. Annals of Statistics, 40(2):1263–1282.
Morgan and Rubin, (2015) Morgan, K. L. and Rubin, D. B. (2015). Rerandomization to balance tiers of covariates. Journal of the American Statistical Association, 110(512):1412–1421.
Moulton, (2004) Moulton, L. H. (2004). Covariate-based constrained randomization of group-randomized trials. Clinical trials, 1(3):297–305.
Pirondini et al., (2022) Pirondini, L., Gregson, J., Owen, R., Collier, T., and Pocock, S. (2022). Covariate adjustment in cardiovascular randomized controlled trials: its value, current practice, and need for improvement. Heart Failure, 10(5):297–305.
Raab and Butcher, (2001) Raab, G. M. and Butcher, I. (2001). Balance in cluster randomized trials. Statistics in medicine, 20(3):351–365.
Rafi, (2023) Rafi, A. (2023). Efficient semiparametric estimation of average treatment effects under covariate adaptive randomization. arXiv preprint arXiv:2305.08340.
Robins et al., (2007) Robins, J., Sued, M., Lei-Gomez, Q., and Rotnitzky, A. (2007). Comment: Performance of double-robust estimators when “inverse probability” weights are highly variable. Statist. Sci., 22(4):544–559.
Robins, (2002) Robins, J. M. (2002). Covariance adjustment in randomized experiments and observational studies: Comment. Statistical Science, 17(3):309–321.
Shi et al., (2022) Shi, W., Zhao, A., and Liu, H. (2022). Rerandomization and covariate adjustment in split-plot designs. arXiv preprint arXiv:2209.12385.
Tsiatis et al., (2008) Tsiatis, A., Davidian, M., Zhang, M., and Lu, X. (2008). Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: A principled yet flexible approach. Stat Med, 27(23):4658–4677.
(25) Turner, E. L., Li, F., Gallis, J. A., Prague, M., and Murray, D. M. (2017a). Review of recent methodological developments in group-randomized trials: part 1—design. American journal of public health, 107(6):907–915.
(26) Turner, E. L., Prague, M., Gallis, J. A., Li, F., and Murray, D. M. (2017b). Review of recent methodological developments in group-randomized trials: part 2—analysis. American Journal of Public Health, 107(7):1078–1086.
van der Laan et al., (2007) van der Laan, M. J., Polley, E. C., and Hubbard, A. E. (2007). Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1).
van der Laan et al., (2011) van der Laan, M. J., Rose, S., et al. (2011). Targeted Learning: Causal Inference for Observational and Experimental Data, volume 10. Springer.
van der Vaart, (1998) van der Vaart, A. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
Wager and Athey, (2018) Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242.
Wang et al., (2021) Wang, B., Harhay, M. O., Small, D. S., Morris, T. P., and Li, F. (2021). On the mixed-model analysis of covariance in cluster-randomized trials. arXiv preprint arXiv:2112.00832.
Wang et al., (2019) Wang, B., Ogburn, E. L., and Rosenblum, M. (2019). Analysis of covariance in randomized trials: More precision and valid confidence intervals, without model assumptions. Biometrics, 75(4):1391–1400.
(33) Wang, B., Park, C., Small, D. S., and Li, F. (2023a). Model-robust and efficient covariate adjustment for cluster-randomized experiments. Journal of the American Statistical Association.
(34) Wang, B., Susukida, R., Mojtabai, R., Amin-Esmaeili, M., and Rosenblum, M. (2023b). Model-robust inference for clinical trials that improve precision by stratified randomization and covariate adjustment. Journal of the American Statistical Association, 118(542):1152–1163.
(35) Wang, X., Wang, T., and Liu, H. (2023c). Rerandomization in stratified randomized experiments. Journal of the American Statistical Association, 118(542):1295–1304.
Wang and Li, (2022) Wang, Y. and Li, X. (2022). Rerandomization with diminishing covariate imbalance and diverging number of covariates. The Annals of Statistics, 50(6):3439–3465.
Zelen, (1974) Zelen, M. (1974). The randomization and stratification of patients to clinical trials. Journal of Chronic Diseases, 27(7):365 – 375.
Zhao and Ding, (2024) Zhao, A. and Ding, P. (2024). No star is good news: A unified look at rerandomization based on p-values from covariate balance tests. Journal of Econometrics, 241(1):105724.
Zhou et al., (2018) Zhou, Q., Ernst, P. A., Morgan, K. L., Rubin, D. B., and Zhang, A. (2018). Sequential rerandomization. Biometrika, 105(3):745–752.