This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Multivariate Tie-breaker Designs

Tim P. Morrison
Stanford University
   Art B. Owen
Stanford University
(October 2024)
Abstract

In a tie-breaker design (TBD), subjects with high values of a running variable are given some (usually desirable) treatment, subjects with low values are not, and subjects in the middle are randomized. TBDs are intermediate between regression discontinuity designs (RDDs) and randomized controlled trials (RCTs). TBDs allow a tradeoff between the resource allocation efficiency of an RDD and the statistical efficiency of an RCT. We study a model where the expected response is one multivariate regression for treated subjects and another for control subjects. We propose a prospective D-optimality, analogous to Bayesian optimal design, to understand design tradeoffs without reference to a specific data set. For given covariates, we show how to use convex optimization to choose treatment probabilities that optimize this criterion. We can incorporate a variety of constraints motivated by economic and ethical considerations. In our model, D-optimality for the treatment effect coincides with D-optimality for the whole regression, and, without constraints, an RCT is globally optimal. We show that a monotonicity constraint favoring more deserving subjects induces sparsity in the number of distinct treatment probabilities. We apply the convex optimization solution to a semi-synthetic example involving triage data from the MIMIC-IV-ED database.

1 Introduction

There are many important settings where a costly or scarce intervention can be made for some but not all subjects. Examples include giving a scholarship to some students (Angrist et al., 2014), intervening to prevent child abuse in some families (Krantz, 2022), programs to counter juvenile delinquency (Lipsey et al., 1981) and sending some university students to remedial English classes (Aiken et al., 1998). There are also lower-stakes settings where a company might be able to offer a perk such as a free service upgrade, to some but not all of its customers. In settings such as these, there is usually a priority ordering of subjects, perhaps based on how deserving they are or on how much they or the investigator might gain from the intervention. We can represent that priority order in terms of a real-valued running variable xix_{i} for subject ii.

To maximize the immediate short-term value from the limited intervention, one can assign it only to subjects with xitx_{i}\geqslant t for some threshold tt. The difficulty with this “greedy” solution is that such an allocation makes it difficult to estimate the causal effect of the treatment. It is possible to use a regression discontinuity design (RDD) in this case, comparing subjects with xix_{i} somewhat larger than tt to those with xix_{i} somewhat smaller than tt. For details on the RDD, see the comprehensive recent survey by Cattaneo and Titiunik (2022). The RDD can give a consistent nonparametric estimate of the treatment effect at x=tx=t (Hahn et al., 2001), but not at other values of xx. The treatment effect can be studied at other values of xx under an assumed regression model, but then the variance of the regression coefficients can be large due to the strong dependence between the running variable and the treatment (Gelman and Imbens, 2019).

The RDD is commonly used to analyze observational data for which the investigator has no control over the treatment cutoff. In the settings we consider, the investigators assign the treatment and can therefore employ some randomization. The motivating context makes it costly or even ethically difficult to use a randomized controlled trial (RCT) where the treatment is assigned completely at random without regard to xix_{i}. The tie-breaker design (TBD) is a compromise between the RCT and RDD. It is a triage where subjects with large values of xix_{i} get the treatment, those with small values get the control level and those in between are randomized to either treatment or control.

While the tie-breaker design has been known since Campbell (1969) it has not been subjected to much analysis. Most of the theory for the TBD has been for the setting with a scalar xx and models that are linear in xx and the treatment (Owen and Varian, 2020; Li and Owen, 2022). In this paper we study the TBD for a vector-valued predictor. Our first motivation is that many use cases for TBDs will include multiple covariates. Second, although multivariate nonparametric regression models are out of the scope of this paper, we believe that TBD regressions are a useful first step in that direction. Third, Gelman et al. (2020) counsel against RDDs that do not adjust for pre-treatment variables. The multivariate regression setting supports such adjustments.

In a tie-breaker design we have a vector of covariates for subject ii and must choose a two-level treatment variable. We then face an atypical experimental design problem where some of the predictors are fixed and not settable by us while one of them is subject to randomization. This is known as the ‘marginally restricted design’ problem after Cook and Thibodeau (1980) who studied DD-optimality in such a setting. We consider two such settings. In one setting, the investigator already has the covariate values. The second setting is one step removed, where we consider random covariates that an investigator might later have.

The paper is organized as follows. Section 2 outlines our notation and the regression model. Section 3 introduces our notions of efficiency and D-optimality in the multivariate regression model. Theorem 1 shows that D-optimality for the treatment effect parameters is equivalent to D-optimality for the full regression. To study efficiency for future subjects, we use a prospective D-optimality criterion, adapted from Bayesian optimal design, that maximizes expected future information. Theorem 2 then shows that the RCT is prospectively D-optimal. We also discuss an example involving Gaussian covariates and a symmetric design, which provides relevant intuition. Section 4 finds an expression for the expected short-term gain, with particular focus on this Gaussian case. When the running variable is linear in the covariates, the best linear combination for statistical efficiency is the “last” eigenvector of the covariance matrix of covariates while the best linear combination for short-term gain is the true treatment effect. Section 5 presents a design strategy based on convex optimization to choose treatment probabilities for given covariates and compares the effects of various practical constraints. In particular, we show that a monotonicity constraint in the treatment probabilities yields solutions with a few distinct treatment probability levels. This is consistent with some past results in constrained experimental design (Cook and Fedorov, 1995), but both our proof and the theorem particulars are distinct. Section 6 illustrates this procedure on a hospital data set from MIMIC-IV-ED about which emergency room patients should receive intensive care. Section 7 has a brief discussion of some additional context for our results.

1.1 Tie-breaker designs

The earliest tie-breaker design of which we are aware (Campbell, 1969) had a discrete running variable, and literally broke a tie by randomizing the treatment assignments for those subjects with x=tx=t. For an imperfectly measured running variable, we might consider |xt|<Δ|x-t|<\Delta to not be meaningfully large for some Δ>0\Delta>0, so randomizing in that window is like breaking ties. This is the approach from Boruch (1975).

When we randomize for xx with |xt|<Δ|x-t|<\Delta, then setting Δ=0\Delta=0 provides an RDD while Δ\Delta\to\infty provides an RCT. The TBD transitions smoothly between these extremes. Many of the early comparisons among these methods use the two-line regression model

Yi=β0+β1Xi+γ0Zi+γ1ZiXi+εi.\displaystyle Y_{i}=\beta_{0}+\beta_{1}X_{i}+\gamma_{0}Z_{i}+\gamma_{1}Z_{i}X_{i}+\varepsilon_{i}. (1)

Here, YiY_{i} is the response for subject ii, Zi{1,1}Z_{i}\in\{-1,1\} is a binary treatment variable, and εi\varepsilon_{i} is an IID error term. Goldberger (1972) finds for Xi𝒩(0,1)X_{i}\sim\mathcal{N}(0,1) that the RDD has about 2.75 times the variance of an RCT. Jacob et al. (2012) consider polynomials in XX up to third degree, with or without interactions with ZZ for Gaussian and for uniform random variables. Neither paper considers tie-breakers. Cappelleri et al. (1994) study the two-line model above (absent the XX-ZZ interaction) and compare the RDD, RCT and three intermediate TBD designs. They study the sample size required to reject γ=0\gamma=0 with sufficient power for those designs and three different sizes of γ\gamma. Larger Δ\Delta brings greater power.

Owen and Varian (2020) find an expression for the variance of the parameters in the two-line model for xx with a symmetric uniform distribution about a threshold t=0t=0, and also for a Gaussian case. The randomization window is |x|Δ|x|\leqslant\Delta, with a larger Δ\Delta yielding smaller variance but a lesser short-term gain. Their designs have only three levels (0, 50 and 100 percent) for the treatment probability given xx. They show that there is no advantage to using other levels or a sliding scale when xx has a symmetric distribution and 5050% of the subjects will get the treatment.

Li and Owen (2023) study the two-line regression for a general distribution of xx and an arbitrary proportion of treated subjects. They constrain the global treatment probability and a measure of short-term gain. They find that there exists a D-optimal design under these constraints with a treatment probability that is constant within at most four intervals of xx values. Moreover, with the addition of a monotonicity constraint, there exists an optimal solution with just two levels corresponding to x<tx<t^{\prime} and x>tx>t^{\prime} for some tt^{\prime}.

Kluger and Owen (2023) consider tie-breaker design for nonparametric regression for real-valued treatments xx. A tie-breaker design can support causal inference about the treatment effect at points within the randomization window, not just at the threshold x=tx=t. At the threshold, the widely use kernel regression methods for RDD become much more efficient if one has sampled from a tie-breaker, because they can use interpolation to x=tx=t instead of the extrapolation from the left and right sides that an RDD requires.

1.2 Marginally constrained experimental design

The tie-breaker setup is a special case of marginally constrained experimentation of Cook and Thibodeau (1980). Some more general constraints are considered in Lopez-Fidalgo and Garcet-Rodriguez (2004). Some of our findings are special cases of more general results in Nachtsheim (1989). Heavlin and Finnegan (1998) consider a sequential design setting in which the columns of a design matrix correspond to steps in semi-conductor fabrication, with each step constrained by the prior ones. The marginally constrained literature generally considers choosing nin_{i} values of the settable variables for level ii of the fixed variables. In our setting, we cannot assume that ni>1n_{i}>1.

The TBD setting often has a monotonicity constraint on treatment probabilities that we have not seen in the marginally constrained design literature. Under that constraint a more “deserving” subject should never have a lower treatment probability than a less deserving subject has. This and other constraints in the TBD will often lead to optimal designs with a small number of different treatment levels.

1.3 Sequential experimentation

We anticipate that the tie-breaker design could be used in sequential experimentation. In our motivating applications, the response is measured long enough after the treatment (e.g., six years in some educational settings) that bandit methods (Slivkins, 2019) are not appropriate. There is related work on adaptive experimental design. See for instance Metelkina and Pronzato (2017). The problem we focus on is designing the first experiment that one might use, and a full sequential analysis is not in the scope of this paper.

2 Setup

Given nn subjects with dd covariates each, we let Xn×dX\in\mathbb{R}^{n\times d} be the design matrix and XidX_{i}\in\mathbb{R}^{d} be the variables for subject ii. We write X~=[1X]n×(d+1)\tilde{X}=\begin{bmatrix}1&X\end{bmatrix}\in\mathbb{R}^{n\times(d+1)} to denote the matrix with an intercept out front, and X~=[1X]d+1\tilde{X}=[1\quad X^{\top}]^{\top}\in\mathbb{R}^{d+1} to denote its ii’th row. For ease of notation, we zero-index X~\tilde{X} so that X~i0=1\tilde{X}_{i0}=1 and X~ij=Xij\tilde{X}_{ij}=X_{ij}, for j=1,,dj=1,\dots,d.

We are interested in the effect of some treatment Zi{1,1}Z_{i}\in\{-1,1\} on a future response YiY_{i}\in\mathbb{R} for subject ii, where Zi=1Z_{i}=1 corresponds to treatment and Zi=1Z_{i}=-1 to control. The design problem is to choose probabilities pi[0,1]p_{i}\in[0,1] and then take (Zi=1)=pi\mathbb{P}(Z_{i}=1)=p_{i}. This differs from the common experimental design framework in which the covariates XiX_{i} can also be chosen. In Section 5 we will show how to get optimal pip_{i} by convex optimization.

To get more general insights into the design problem, in Section 3 we also consider a random data framework. The predictors are to be sampled with XiiidPXX_{i}\stackrel{{\scriptstyle\mathrm{iid}}}{{\sim}}P_{X}. This allows us to relate design findings to the properties of PXP_{X} rather than to a specific matrix XX. After XiX_{i} are observed, ZiZ_{i} will be set randomly and then YiY_{i} observed. The analyst is unable to alter PXP_{X} at all but can choose any function p(X)=(Z=1X)p(X)=\mathbb{P}(Z=1\!\mid\!X) that satisfies the imposed constraints.

We work with the following linear model:

Yi=X~iβ~+ZiX~iγ~+εi\displaystyle Y_{i}=\tilde{X}_{i}^{\top}\tilde{\beta}+Z_{i}\tilde{X}_{i}^{\top}\tilde{\gamma}+\varepsilon_{i} (2)

for β~,γ~d+1\tilde{\beta},\tilde{\gamma}\in\mathbb{R}^{d+1}, where εi\varepsilon_{i} are IID noise terms with mean zero and variance σ2>0\sigma^{2}>0. We use the same notational convention of writing β~=[β0β]\tilde{\beta}=\begin{bmatrix}\beta_{0}&\beta^{\top}\end{bmatrix}^{\top} and γ~=[γ0γ]\tilde{\gamma}=\begin{bmatrix}\gamma_{0}\quad\gamma^{\top}\end{bmatrix}^{\top} for β,γd\beta,\gamma\in\mathbb{R}^{d} to separate out the intercept term. We consider γ~\tilde{\gamma} to be the parameter of greatest interest because it captures the treatment effect of ZZ.

Equation (2) generalizes the two-line model (1) studied by Li and Owen (2023) and Owen and Varian (2020). The latter authors describe some computational methods for the model (2) but most of their theory is for model (1).

Though a strong parametric assumption, this multi-line regression is a helpful and practical model to inform treatment assignment at the design stage. Because our scenario of interest could include regions of the covariate space with p(x)=0p(x)=0, we do not have overlap and cannot rely on classical semiparametric methods for causal inference. Nonparametric generalizations of (2) such as spline models may prove more flexible in highly nonlinear settings, but we do not explore them here.

In Section 3, we also consider multivariate tie-breaker designs (TBDs), which we define as follows: the treatment ZiZ_{i} is assigned via

(Zi=1Xi)={1,Xiηup,Xiη(,u)0,Xiη\displaystyle\mathbb{P}(Z_{i}=1\!\mid\!X_{i})=\begin{cases}1,&X_{i}^{\top}\eta\geqslant u\\ p,&X_{i}^{\top}\eta\in(\ell,u)\\ 0,&X_{i}^{\top}\eta\leqslant\ell\end{cases} (3)

for parameters u<u<\ell, p(0,1)p\in(0,1) and ηd\eta\in\mathbb{R}^{d}. That is, we assign treatment to subject ii whenever XiηX_{i}^{\top}\eta is at or above some upper cutoff uu, do not assign treatment whenever it is at or below some lower cutoff \ell, and randomize at some fixed probability pp in the middle. For the case u=u=\ell, we take (Zi=1Xi)=𝟙{Xiηu}\mathbb{P}(Z_{i}=1\!\mid\!X_{i})=\mathds{1}\{X_{i}^{\top}\eta\geqslant u\} which has a mild asymmetry in offering the treatment to those subjects, if any, that have Xiη==uX_{i}^{\top}\eta=\ell=u.

In equation (3) we ignore the intercept and just consider XiX_{i} instead of X~i\tilde{X}_{i} since the intercept would merely shift everything by a constant. Here, we use ηd\eta\in\mathbb{R}^{d} instead of γ\gamma from (2) to reflect that the vector we treat on need not be the same as the true γ\gamma, which is unknown.

In practice, η\eta will encode whatever constraints on randomization a practitioner may encounter. While linear constraints do not encompass all possibilities, they are simple and flexible enough to account for a great many that one could feasibly seek to impose; for example, constraints on a heart rate below some cutoff or a total SAT score that is sufficiently high can both be covered by (3). One could also set \ell and uu to be known or estimated quantiles of a covariate or a linear combination of them if, for example, one wants deterministic treatment assignment for the top 5%5\% of candidates. More generally, we show in Section 5 how to accommodate any convex constraints on treatment assignment.

The assignment (3) greatly generalizes the one in Owen and Varian (2020) which had d=1d=1, =u\ell=-u, p=1/2p=1/2, and PXP_{X} either 𝒰(1,1)\mathcal{U}(-1,1) or 𝒩(0,1)\mathcal{N}(0,1). In analogy to the one-dimensional case, we refer to the case with u=u=\ell as an RDD. We refer to any choice of (,u,η)(\ell,u,\eta) for which (Xη(,u))=1\mathbb{P}(X^{\top}\eta\in(\ell,u))=1 as an RCT.

3 Efficiency and D-optimality

We assume that the covariates XidX_{i}\in\mathbb{R}^{d} have yet to be observed and are drawn from some distribution PXP_{X} with a finite and invertible covariance matrix Σ\Sigma. The aim is to devise a treatment assignment scheme p(X)p(X) with ZiZ_{i} assigned independently via (Zi=1 | Xi)=p(Xi)\mathbb{P}(Z_{i}=1\text{ }|\text{ }X_{i})=p(X_{i}) that is optimal in some sense. Random allocation of ZZ has advantages of fairness and supports some randomization-based inference.

Section 3.1 briefly outlines efficiency and D-optimality in the fixed ZZ setting to introduce relevant formulas. In Section 3.2, we then outline a notion of prospective DD-optimality used when XX is random and we must choose a distribution of ZZ given XX. We model our approach on the Bayesian optimal design literature (Chaloner and Verdinelli, 1995), in which the model parameters are treated as random, which we describe in more detail in that section.

The treatment of XX as random permits a theoretical discussion about what is expected to happen depending on the distributional properties of the unseen XX data. However, the case in which the covariates are actually fixed is also covered by the procedure in Section 3.2, since it corresponds to a point mass distribution on XX. We return to this case in Section 5, in which we show how to use convex optimization to obtain prospectively D-optimal p(Xi)p(X_{i}).

3.1 D-optimality for nonrandom XX and ZZ

In this brief section, we outline our notions of efficiency and D-optimality in the setting in which XX is fixed and ZZ are assigned non-randomly. While not the focus of this paper, it will allow us to motivate various definitions and derive a property of D-optimality in the multivariate model that will be useful going forward.

We begin by conditioning on XX and ZZ. Let Dn×nD\in\mathbb{R}^{n\times n} be the diagonal matrix whose diagonal entries are Dii=ZiD_{ii}=Z_{i}. We can write the linear model (2) in matrix form as Y=Uδ+εY=U\delta+\varepsilon, where U=[X~DX~]U=\begin{bmatrix}\tilde{X}&D\tilde{X}\end{bmatrix} and δ=[β~γ~]\delta=\begin{bmatrix}\tilde{\beta}^{\top}&\tilde{\gamma}^{\top}\end{bmatrix}^{\top}.

In the general model (2), conditionally on XX and ZZ we have

Var(δ^X,Z)\displaystyle\mathrm{Var}(\hat{\delta}\!\mid\!X,Z) =[Var(β^X,Z)Cov(β^,γ^X,Z)Cov(γ^,β^X,Z)Var(γ^X,Z)]\displaystyle=\begin{bmatrix}\mathrm{Var}(\hat{\beta}\!\mid\!X,Z)&\mathrm{Cov}(\hat{\beta},\hat{\gamma}\!\mid\!X,Z)\\ \mathrm{Cov}(\hat{\gamma},\hat{\beta}\!\mid\!X,Z)&\mathrm{Var}(\hat{\gamma}\!\mid\!X,Z)\end{bmatrix}
σ2[((UU)1)11((UU)1)12((UU)1)21((UU)1)22,].\displaystyle\equiv\sigma^{2}\begin{bmatrix}((U^{\top}U)^{-1})_{11}&((U^{\top}U)^{-1})_{12}\\ ((U^{\top}U)^{-1})_{21}&((U^{\top}U)^{-1})_{22},\end{bmatrix}.

where we assume here that UU is full rank. Because σ2\sigma^{2} is merely a multiplicative factor independent of all relevant parameters, it is no loss of generality to take σ2=1\sigma^{2}=1 going forward for simplicity. The treatment effect vector γ~\tilde{\gamma} is our parameter of primary interest, so we want to minimize a measure of the magnitude of ((UU)1)22((U^{\top}U)^{-1})_{22}. When XX is fixed, a standard choice would be to assign {Z1,,Zn}\{Z_{1},\ldots,Z_{n}\} to minimize the DD-optimality criterion of

det(((UU)1)22)=j=1d+1λj(Var(γ^Z)),\det\bigl{(}((U^{\top}U)^{-1})_{22}\bigr{)}=\prod_{j=1}^{d+1}\lambda_{j}(\mathrm{Var}(\hat{\gamma}\!\mid\!Z)),

where λj()\lambda_{j}(\cdot) is the jjth eigenvalue of its argument. D-optimality is the most studied design choice among many alternatives (Atkinson et al., 2007, Chapter 10). It has the convenient property of being invariant under reparametrizations. This criterion is actually a particular case of DS-optimality, in which only the parameters corresponding to a subset SS of the indices are of interest. See Section 10.3 of Atkinson et al. (2007). Much of our theory and discussion will generalize to a broader class of criteria that we discuss in Section 3.2.

Before proceeding to the setting in which XX and ZZ are random, we note a helpful result. Under the model (2), there is a convenient property of D-optimality in this setting, which we state as the following simple theorem.

Theorem 1.

For data following model (2), assume that UU is full rank. Then the D-optimality criteria for γ~\tilde{\gamma}, β~\tilde{\beta} and δ\delta are equivalent.

Proof.

D-optimality for γ~\tilde{\gamma} comes from minimizing det((UU)1)22)\det((U^{\top}U)^{-1})_{22}) while D-optimality for δ\delta comes from maximizing det(UU)\det(U^{\top}U). To show that those are equivalent, we write

UU=[X~X~X~DX~X~DX~X~D2X~]=[X~X~X~DX~X~DX~X~X~][ABBA]\displaystyle U^{\top}U=\begin{bmatrix}\tilde{X}^{\top}\tilde{X}&\tilde{X}^{\top}D\tilde{X}\\ \tilde{X}^{\top}D\tilde{X}&\tilde{X}^{\top}D^{2}\tilde{X}\end{bmatrix}=\begin{bmatrix}\tilde{X}^{\top}\tilde{X}&\tilde{X}^{\top}D\tilde{X}\\ \tilde{X}^{\top}D\tilde{X}&\tilde{X}^{\top}\tilde{X}\end{bmatrix}\equiv\begin{bmatrix}A&B\\ B&A\end{bmatrix} (4)

using D2=ID^{2}=I in the second equality. In the decomposition above, AA and BB are symmetric with AA invertible and so, from properties of block matrices

det(UU)\displaystyle\det(U^{\top}U) =det(A)det(ABA1B)and\displaystyle=\det(A)\det(A-BA^{-1}B)\quad\text{and}
((UU)1)22\displaystyle((U^{\top}U)^{-1})_{22} =(ABA1B)1,\displaystyle=(A-BA^{-1}B)^{-1},

from which

det(UU)=det(X~X~)det(((UU)1)22).\det\left(U^{\top}U\right)=\frac{\det(\tilde{X}^{\top}\tilde{X})}{\det(((U^{\top}U)^{-1})_{22})}. (5)

Because X~X~\tilde{X}^{\top}\tilde{X} does not depend on Z, our D-optimality criterion is equivalent to maximizing det(UU)\det(U^{\top}U). Finally, D-optimality for β~\tilde{\beta} and γ~\tilde{\gamma} are equivalent by symmetry. ∎

The equivalence (5) is well-known in the DS-optimality literature (see, e.g., Equation 10.7 of Atkinson et al. (2007)). In our setting, only the right half of the columns of UU can be changed. Then the numerator in Equation (5) is conveniently fixed and so optimizing the denominator alone optimizes the ratio.

The simple structure of the model (2) has made DD-optimal estimation of γ~\tilde{\gamma} equivalent to D-optimality for δ\delta and for β~\tilde{\beta}. By the same token, a design that is A-optimal for γ~\tilde{\gamma}, minimizing tr(Var(γ~^))\mathrm{tr}(\mathrm{Var}(\hat{\tilde{\gamma}})) is also A-optimal for δ\delta and for β~\tilde{\beta}. Lemma 1 of Nachtsheim (1989) includes our setting and shows that DD-optimality for δ\delta is equivalent to DD-optimality for γ~\tilde{\gamma}. It does not apply to β~\tilde{\beta} for our problem nor does that result consider other criteria such as A-optimality.

The theory of marginally restricted D-optimality in Cook and Thibodeau (1980) describes some settings where the D-optimal design for all variables is a tensor product of the given empirical design for the fixed variables and a randomized design for the settable variables. Such a design is simply an RCT on the settable variables. By their Lemma 1, this holds when the regression model is a Kronecker product of functions of fixed variables times functions of settable variables. In their Lemma 3, this holds when the regression model has an intercept plus a sum of functions of settable variables and a sum of functions of fixed variables. Neither of those apply to model (2) but Lemma 2 of Nachtsheim (1989) does. The TBD designs we consider usually have constraints on the short-term gain or monotonicity constraints, and those generally make RCTs non-optimal.

3.2 Prospective D-optimality

In this section, we modify the approach in Section 3.1 to account for the randomness in XX and ZZ. To do so, we adopt a prospective D-optimality criterion to apply to the setting where both XiX_{i} and ZiZ_{i} have yet to be observed. We derive our approach from ideas in Bayesian optimal design, which we briefly summarize below based on Chapter 18.2 of Atkinson et al. (2007), omitting details that will not play a role in our setting.

Bayesian optimal design often arises when the variance of the parameter estimate depends on the unknown true value of the parameter θ\theta, as is the usual case for models where the expected response is a nonlinear function of θ\theta. In this case, the information matrix M=M(θ)M=M(\theta) is usually constructed by linearizing the model form and then taking the expected outer product of the gradient with itself under a design measure on the predictors. This M(θ)M(\theta) is random because θ\theta has a prior distribution. In our case, M=UUM=U^{\top}U since this is the information matrix for the multivariate regression.

The favored approach in Bayesian optimal design is to choose the design in order to minimize 𝔼(logdet(M1))\mathbb{E}(\log\det(M^{-1})) where the expectation is over the prior distribution on the parameters (Chaloner and Verdinelli, 1995; Dette, 1996). That can be quite expensive to do. Many of the examples in the literature optimize the design over a grid which is reasonable when the dimensions of XX and θ\theta are both small, but those methods do not scale well to larger problems.

Table 18.1 of Atkinson et al. (2007) lists four additional criteria along with the above one. They are log𝔼(det(M1))\log\mathbb{E}(\det(M^{-1})), logdet(𝔼(M1))\log\det(\mathbb{E}(M^{-1})), log(𝔼(det(M))1)\log(\mathbb{E}(\det(M))^{-1}), and log(det𝔼(M))1\log(\det\mathbb{E}(M))^{-1}. The objective becomes more tractable each time a nonlinear operation is taken out of the expectation. When the logarithm is the final step, the criterion is equivalent to not taking that logarithm.

We choose the last of those four quantities (choice V in their Table 18.1) for our definition of prospective DD-optimality.

Definition 1.

[Prospective D-optimality] For random predictors XX, a design function (Z=1x)=p(x)\mathbb{P}(Z=1\!\mid\!x)=p(x) is prospectively D-optimal if it maximizes

det(𝔼(UU))=det(𝔼[Var(δ^X,Z)1]),\det(\mathbb{E}(U^{\top}U))=\det(\mathbb{E}[\mathrm{Var}(\hat{\delta}\!\mid\!X,Z)^{-1}]),

where the expectation is with respect to XX and ZZ.

We could analogously define prospective D-optimality for β~\tilde{\beta} or γ~\tilde{\gamma} as minimizing det((𝔼[UU]1)11)\det((\mathbb{E}[U^{\top}U]^{-1})_{11}) or det((𝔼[UU]1)22)\det((\mathbb{E}[U^{\top}U]^{-1})_{22}), respectively. By Theorem 1, prospective D-optimality in the sense of Definition 1 is equivalent to these conditions in our model, so the three notions all align.

Our choice is known as EW D-optimality in Bayesian design for generalized linear models (GLMs) (Yang et al., 2016; Bu et al., 2020). The E is for expectation and the WW refers to a weight matrix arising in GLMs. It is valued for its significantly reduced computational cost. While a computationally efficient design is not necessarily the best one, Yang et al. (2016) and Yang et al. (2017) find in a range of simulation studies that EW D-optimal designs tend to have strong performance under the more standard Bayesian D-optimality criterion as well.

Since our prospective D-optimality criterion uses the expected value of UUU^{\top}U, it depends only on the pip_{i} terms and not on the joint distribution of the ZiZ_{i}. It is therefore important that the ZiZ_{i} be independent in order to distinguish, say, an RCT with (Zi=1)=1/2\mathbb{P}(Z_{i}=1)=1/2 from an allocation that takes all Zi=Z1Z_{i}=Z_{1} where Z1=1Z_{1}=1 with probability 1/21/2. In addition, Morrison et al. (2024) show in a similar problem that, when the ZiZ_{i} are sampled independently, the design that minimizes det(𝔼[M])\det(\mathbb{E}[M]) also minimizes 𝔼[det(M)]\mathbb{E}[\det(M)] asymptotically as nn\to\infty. Kluger and Owen (2023) consider some stratified sampling methods that incorporate negative correlations among the ZiZ_{i} which makes UUU^{\top}U come even closer to its expectation than it does under independent sampling.

Under sampling with XiPXX_{i}\sim P_{X}, define

Σ~\displaystyle\tilde{\Sigma} =𝔼(1nX~X~)\displaystyle=\mathbb{E}\Bigl{(}\frac{1}{n}\tilde{X}^{\top}\tilde{X}\Bigr{)}
=[1𝔼[X]𝔼[X]𝔼[XX]],\displaystyle=\begin{bmatrix}1&\mathbb{E}[X_{\bullet}]^{\top}\\ \mathbb{E}[X_{\bullet}]&\mathbb{E}[X_{\bullet}X_{\bullet}^{\top}]\end{bmatrix},

where the bullet subscript denotes an arbitrary subject with XPXX_{\bullet}\sim P_{X} and (Z=1 | X)=p(X)\mathbb{P}(Z_{\bullet}=1\text{ }|\text{ }X_{\bullet})=p(X_{\bullet}). In addition,

𝔼(1n(X~DX~)jk)\displaystyle\mathbb{E}\Bigl{(}\frac{1}{n}(\tilde{X}^{\top}D\tilde{X})_{jk}\Bigr{)} =𝔼(1ni=1nZiX~ijX~ik)\displaystyle=\mathbb{E}\Bigl{(}\frac{1}{n}\sum_{i=1}^{n}Z_{i}\tilde{X}_{ij}\tilde{X}_{ik}\Bigr{)}
=𝔼[X~jX~k(2p(X)1)]\displaystyle=\mathbb{E}[\tilde{X}_{\bullet j}\tilde{X}_{\bullet k}(2p(X_{\bullet})-1)]

Now let NN be the matrix with

Njk\displaystyle N_{jk} =𝔼[X~jX~k(2p(X)1)].\displaystyle=\mathbb{E}[\tilde{X}_{\bullet j}\tilde{X}_{\bullet k}(2p(X_{\bullet})-1)]. (6)

Under our sampling assumptions

𝔼(1nUU)=[Σ~NNΣ~].\displaystyle\mathbb{E}\Bigl{(}\frac{1}{n}U^{\top}U\Bigr{)}=\begin{bmatrix}\tilde{\Sigma}&N\\ N&\tilde{\Sigma}\end{bmatrix}. (7)

The right-hand side of (7) represents the expected information per observation in our tie-breaker design. Using these formulas, we obtain the following desirable result for prospective D-optimality.

Theorem 2.

Under the model (2) with XiPXX_{i}\sim P_{X}, the RCT p(X)=1/2p(X_{\bullet})=1/2 is prospectively D-optimal among all designs p(X)p(X_{\bullet}). Moreover, this is the unique D-optimal design of the form  (3).

The proofs of Theorem 2 and of all subsequent theorems are presented in Appendix A. Theorem 2 does not require XiX_{i} to be independent though that would be the usual model.

The theorem establishes that the RCT is prospectively D-optimal among any randomization scheme (Z=1X)=p(X)[0,1]\mathbb{P}(Z=1\!\mid\!X_{\bullet})=p(X_{\bullet})\in[0,1]. It is not necessarily the unique optimum in this larger class. For instance, if

𝔼[XjXk(2p(X)1)]=0\mathbb{E}[X_{\bullet j}X_{\bullet k}(2p(X_{\bullet})-1)]=0

for all jj and kk, then the function p()p(\cdot) would provide the same efficiency as an RCT since it would make the matrix NN in the above proof vanish.

Though we frame Theorem 2 via prospective D-optimality, the result also holds for a broader class of criteria that we call prospectively monotone. The basic idea of such criteria is that they only depend on the bottom-right submatrix of 𝔼[(UU)1]\mathbb{E}[(U^{\top}U)^{-1}] and that they encourage this submatrix to be small in the standard ordering on positive semi-definite matrices (i.e., Σ1Σ2\Sigma_{1}\preceq\Sigma_{2} if and only if Σ2Σ1\Sigma_{2}-\Sigma_{1} is PSD). The precise definition is as follows.

Definition 2.

[Prospective monotonicity] For random predictors, we say that a criterion is prospectively monotone for δ^\hat{\delta} if it depends only on the matrix 𝔼[Var(γ~^X,Z)1]\mathbb{E}[\text{Var}(\hat{\tilde{\gamma}}\!\mid\!X,Z)^{-1}] and if the criterion increases whenever this matrix increases in the standard ordering on positive semi-definite matrices.

Because these are the only properties of prospective D-optimality we use in Theorem 2, the same result holds immediately for any prospectively monotone criterion. Examples include prospective A-optimality, which minimizes the quantity Tr(𝔼[Var(γ~^Z)1])\text{Tr}(\mathbb{E}[\text{Var}(\hat{\tilde{\gamma}}\!\mid\!Z)^{-1}]), or prospective C-optimality, which minimizes c𝔼[Var(γ~^Z)1]cc^{\top}\mathbb{E}[\text{Var}(\hat{\tilde{\gamma}}\!\mid\!Z)^{-1}]c for some preset vector cc (the latter of use if a particular linear combination of elements of γ^\hat{\gamma} is of primary interest).

3.3 Symmetric Distributions with Symmetric Designs

We can gain particular insights, both theoretical and practical, by considering a special case satisfying two conditions. First, PXP_{X} has a symmetric density, i.e., fX(x)=fX(x)f_{X}(\vec{x})=f_{X}(-\vec{x}) for xd\vec{x}\in\mathbb{R}^{d}. This includes the special case of Gaussian covariates, which we consider in more detail as well. Secondly, we will further assume that p=1/2p=1/2 and the randomization window is symmetric about zero with width Δ0\Delta\geqslant 0, which we call a symmetric design. That is, we restrict (3) to simply

p(Zi=1 | Xi)\displaystyle p(Z_{i}=1\text{ }|\text{ }X_{i}) ={1,XiηΔ1/2,|Xiη|(Δ,Δ)0,XiηΔ.\displaystyle=\begin{cases}1,&X_{i}^{\top}\eta\geqslant\Delta\\ 1/2,&|X_{i}^{\top}\eta|\in(-\Delta,\Delta)\\ 0,&X_{i}^{\top}\eta\leqslant-\Delta.\end{cases} (8)

For j=k=0j=k=0 (i.e., both terms are intercepts), equation (6) reduces to

N00=𝔼[(𝟙{XηΔ}𝟙{XηΔ})]=0N_{00}=\mathbb{E}[(\mathds{1}\{X_{\bullet}^{\top}\eta\geqslant\Delta\}-\mathds{1}\{X_{\bullet}^{\top}\eta\leqslant-\Delta\})]=0

since we are integrating an odd function with respect to a symmetric density. Likewise, when both j,k1j,k\geqslant 1 we have Njk=0N_{jk}=0. The only cases that remain are the first row and first column of NN, besides the top-left entry. Thus, we can write

N=[0αα0d×d]\displaystyle N=\begin{bmatrix}0&\alpha^{\top}\\ \alpha&\textbf{0}_{d\times d}\end{bmatrix} (9)

where αd\alpha\in\mathbb{R}^{d} with

αj\displaystyle\alpha_{j} =𝔼[Xj(𝟙{XηΔ}𝟙{XηΔ})]\displaystyle=\mathbb{E}\bigl{[}X_{\bullet j}\bigl{(}\mathds{1}\{X_{\bullet}^{\top}\eta\geqslant\Delta\}-\mathds{1}\{X_{\bullet}^{\top}\eta\leqslant-\Delta\}\bigr{)}\bigr{]}
=2𝔼[Xj𝟙{XηΔ}].\displaystyle=2\mathbb{E}[X_{\bullet j}\mathds{1}\{X_{\bullet}^{\top}\eta\geqslant\Delta\}]. (10)

We note that α=α(Δ,η)\alpha=\alpha(\Delta,\eta) depends on the width Δ\Delta and the treatment assignment vector η\eta, but we suppress that dependence for notational ease. From (3.3), we can compute explicitly that

NΣ~1N=[αΣ1α00αα],N\tilde{\Sigma}^{-1}N=\begin{bmatrix}\alpha^{\top}\Sigma^{-1}\alpha&0\\ 0&\alpha\alpha^{\top}\end{bmatrix},

so our criterion becomes

det(Σ~)det(Σ~NΣ~1N)\displaystyle\det(\tilde{\Sigma})\det(\tilde{\Sigma}-N\tilde{\Sigma}^{-1}N) =det(Σ)(1αΣ1α)det(Σαα)\displaystyle=\det(\Sigma)(1-\alpha^{\top}\Sigma^{-1}\alpha)\det(\Sigma-\alpha\alpha^{\top})
=(1αΣ1α)2det(Σ)2.\displaystyle=(1-\alpha^{\top}\Sigma^{-1}\alpha)^{2}\det(\Sigma)^{2}.

In the last line we use the formula det(A+cd)=det(A)(1+dA1c)\det(A+cd^{\top})=\det(A)(1+d^{\top}A^{-1}c) for the determinant of a rank-one update of an invertible matrix and we also note that det(Σ~)=det(Σ)\det(\tilde{\Sigma})=\det(\Sigma). Let W=Σ1/2W=\Sigma^{1/2} so that Var(W1x)=I\mathrm{Var}(W^{-1}x)=I. The efficiency therefore only depends on α\alpha through αΣ1α=W1α2\alpha^{\top}\Sigma^{-1}\alpha=\|W^{-1}\alpha\|^{2}.

We could also ask whether we can do better by changing our randomization scheme to allow

(Zi=1Xi)={1,XiηΔp,|Xiη|<Δ0,XiηΔ\displaystyle\mathbb{P}(Z_{i}=1\!\mid\!X_{i})=\begin{cases}1,&X_{i}^{\top}\eta\geqslant\Delta\\ p,&|X_{i}^{\top}\eta|<\Delta\\ 0,&X_{i}^{\top}\eta\leqslant-\Delta\end{cases} (11)

for some other p1/2p\neq 1/2. While this may be a reasonable choice in practice when treatment cannot be assigned equally, it cannot provide any efficiency benefit in the symmetric case, as shown in Theorem 3 below. Just as an RCT is most efficient globally, if one is using the three level rule (11) then the best choice for the middle level is 1/21/2 and that choice is unique under a reasonable assumption.

Theorem 3.

If PXP_{X} is symmetric and the randomization window is symmetric around zero with width Δ\Delta, then a prospectively D-optimal design of the form (11) has p=1/2p=1/2. Moreover, this design is unique provided that (|Xη|Δ)>0\mathbb{P}(|X_{\bullet}^{\top}\eta|\leqslant\Delta)>0.

An informative example is the case in which PX=𝒩(0,Σ)P_{X}=\mathcal{N}(0,\Sigma) for some covariance matrix Σ\Sigma. In this case, we can compute the efficiency explicitly as a function of Δ\Delta.

Theorem 4.

Let PXP_{X} be the 𝒩(0,Σ)\mathcal{N}(0,\Sigma) distribution for a positive definite matrix Σ\Sigma. For XiiidPXX_{i}\stackrel{{\scriptstyle\mathrm{iid}}}{{\sim}}P_{X} and ZiZ_{i} sampled independently from (3) for a nonzero vector ηd\eta\in\mathbb{R}^{d} and a threshold Δ0\Delta\geqslant 0

det(𝔼(Var(δ^Z))1)=(12πeΔ2ηΣη)2det(Σ)2.\det\bigl{(}\mathbb{E}\bigl{(}\mathrm{Var}(\hat{\delta}\!\mid\!Z)\bigr{)}^{-1}\bigr{)}=\Bigl{(}1-\frac{2}{\pi}e^{\frac{-\Delta^{2}}{\eta^{\top}\Sigma\eta}}\Bigr{)}^{2}\det(\Sigma)^{2}.

From Theorem 4 we find that the efficiency ratio between Δ=\Delta=\infty (the RCT) and Δ=0\Delta=0 (the RDD) is (12/π)27.57(1-{2}/{\pi})^{-2}\approx 7.57. The result in Goldberger (1972) gives a ratio of (12/π)1(1-2/\pi)^{-1} for the variance of the slope in the case d=1d=1. Our result is the same, though we pick up an extra factor because our determinant criterion incorporates both the intercept and the slope. Their result was for d=1d=1; here we get the same efficiency ratio for all d1d\geqslant 1.

In this multivariate setting we see that for any fixed Δ>0\Delta>0, the most efficient design minimizes ηΣη\eta^{\top}\Sigma\eta, so it is the eigenvector corresponding to the smallest eigenvalue of Σ\Sigma. This represents the least “distribution-aware” choice, i.e., the last principal component vector, which aligns with our intuition that we gain more information by randomizing as much as possible.

4 Short-term Gain

We turn now to the other arm of the tradeoff, the short-term gain. In our motivating problems, the response is defined so that larger values of YiY_{i} define better outcomes. Now 𝔼[Yi]=𝔼[X~iβ~]+T~i\mathbb{E}[Y_{i}]=\mathbb{E}[\tilde{X}_{i}^{\top}\tilde{\beta}]+\tilde{T}_{i} where T~i=𝔼[ZiX~iγ~]\tilde{T}_{i}=\mathbb{E}[Z_{i}\tilde{X}_{i}^{\top}\tilde{\gamma}]. The first term in 𝔼[Yi]\mathbb{E}[Y_{i}] is not affected by the treatment allocation and so any consideration of short-term gain can be expressed in terms of T~i\tilde{T}_{i}. Further, 𝔼(T~i)=𝔼[Ziγ0]+Ti\mathbb{E}(\tilde{T}_{i})=\mathbb{E}[Z_{i}\gamma_{0}]+T_{i} for Ti=𝔼[ZiXiγ]T_{i}=\mathbb{E}[Z_{i}X_{i}^{\top}\gamma]. Now 𝔼[Ziγ0]\mathbb{E}[Z_{i}\gamma_{0}] only depends on our design via the expected proportion of treated subjects. This will often be fixed by a budget constraint and even when it is not fixed, it does not depend on where specifically we assign the treatment. Therefore, for design purposes we may focus on TiT_{i}.

Under the model (2) with treatment assignment from (11),

Ti=𝔼[Xiγ(𝟙{Xiηu}𝟙{Xiη}+(2p1)𝟙{Xiη(,u)})].\displaystyle T_{i}=\mathbb{E}\bigl{[}X_{i}^{\top}\gamma\bigl{(}\mathds{1}\{X_{i}^{\top}\eta\geqslant u\}-\mathds{1}\{X_{i}^{\top}\eta\leqslant\ell\}+(2p-1)\mathds{1}\{X_{i}^{\top}\eta\in(\ell,u)\}\bigr{)}\bigr{]}. (12)

If η=γ\eta=\gamma, so that we assign treatment using the true treatment effect vector, then equation (12) shows that the best expected gain comes from taking u==0u=\ell=0, which is an RDD centered at the mean. Moreover, the expected gain decreases monotonically as we increase uu beyond 0 or decrease \ell below 0. This matches our intuition that we must sacrifice some short-term gain to improve on statistical efficiency. Ordinarily ηγ\eta\neq\gamma, and a poor choice of η\eta could break this monotonicity.

In the Gaussian case considered in Section 3.3, we can likewise derive an explicit formula for the expected gain as a function of uu, η\eta, and γ\gamma. Letting T=ZXγT_{\bullet}=Z_{\bullet}X_{\bullet}^{\top}\gamma, we have

𝔼[T]\displaystyle\mathbb{E}[T_{\bullet}] =𝔼[Xγ(𝟙{Xηu}𝟙{Xηu})]\displaystyle=\mathbb{E}[X_{\bullet}^{\top}\gamma(\mathds{1}\{X_{\bullet}^{\top}\eta\geqslant u\}-\mathds{1}\{X_{\bullet}^{\top}\eta\leqslant-u\})]
=γ𝔼[X(𝟙{Xηu}𝟙{Xηu})]\displaystyle=\gamma^{\top}\mathbb{E}[X_{\bullet}(\mathds{1}\{X^{\top}\eta\geqslant u\}-\mathds{1}\{X_{\bullet}^{\top}\eta\leqslant-u\})]
=γα.\displaystyle=\gamma^{\top}\alpha.

Using the formula (23) for α\alpha in the Gaussian case, this is simply

𝔼[T]=2πγΣηηΣη eu22ηΣη.\mathbb{E}[T_{\bullet}]=\sqrt{\frac{2}{\pi}}\frac{\gamma^{\top}\Sigma\eta}{\sqrt{\eta^{\top}\Sigma\eta}}\text{ }e^{\frac{-u^{2}}{2\eta^{\top}\Sigma\eta}}.

We expect intuitively that η=γ\eta=\gamma will maximize 𝔼[T]\mathbb{E}[T_{\bullet}]. To verify this we start by choosing u=u(η)u=u(\eta) in a way that keeps the proportion of data in the three treatment regions constant. We do so by taking u=u(η)=u0ηΣηu=u(\eta)=u_{0}\sqrt{\eta^{\top}\Sigma\eta} for some u00u_{0}\geqslant 0, and then

𝔼[T]=2πγΣηηΣη eu02/2.\mathbb{E}[T_{\bullet}]=\sqrt{\frac{2}{\pi}}\frac{\gamma^{\top}\Sigma\eta}{\sqrt{\eta^{\top}\Sigma\eta}}\text{ }e^{{-u_{0}^{2}}/{2}}.

Let γw=Σ1/2γ\gamma_{w}=\Sigma^{1/2}\gamma and ηw=Σ1/2η\eta_{w}=\Sigma^{1/2}\eta, using the same matrix square root in both cases. Then

γΣηηΣη=γwηwηw\frac{\gamma^{\top}\Sigma\eta}{\sqrt{\eta^{\top}\Sigma\eta}}=\frac{\gamma_{w}^{\top}\eta_{w}}{\|\eta_{w}\|}

is maximized by taking ηw=γw\eta_{w}=\gamma_{w} or equivalently η=γ\eta=\gamma. Any scaling of η=γ\eta=\gamma leaves this criterion invariant.

Working under the normalization ηΣη=1\eta^{\top}\Sigma\eta=1, we can summarize our results in the Gaussian case as

det(1nUU)\displaystyle\det\Bigl{(}\frac{1}{n}U^{\top}U\Bigr{)} =(12πeu02)2det(Σ)2,and\displaystyle=\Bigl{(}1-\frac{2}{\pi}e^{-u_{0}^{2}}\Bigr{)}^{2}\det(\Sigma)^{2},\quad\text{and} (13)
𝔼[T]\displaystyle\mathbb{E}[T_{\bullet}] =2πγΣη eu02/2.\displaystyle=\sqrt{\frac{2}{\pi}}\gamma^{\top}\Sigma\eta\text{ }e^{{-u_{0}^{2}}/{2}}. (14)

With our normalization, u2=u02ηΣη=u02u^{2}=u_{0}^{2}\eta^{\top}\Sigma\eta=u_{0}^{2}. Equations (13) and (14) quantify the tradeoff between efficiency and short-term gain, that come from choosing u0u_{0}. Greater randomization through larger u0u_{0} increases efficiency, and, assuming that the sign of η\eta is properly chosen, decreases the short-term gain.

5 Convex Formulation

In this section, we return to the setting where x1,,xndx_{1},\ldots,x_{n}\in\mathbb{R}^{d} are fixed values but ZiZ_{i} are not yet assigned. We use lowercase xix_{i} to emphasize that they are non-random. We assume that any subset of the xix_{i} of size dd has full rank, as would be the case almost surely when they are drawn IID from a distribution whose covariance matrix is of full rank. The design problem is to choose pi=(Zi=1)p_{i}=\mathbb{P}(Z_{i}=1).

For given xix_{i}, the design matrix in (2) is

U=[u1(Z1)u2(Z2)un(Zn)],forui(1)=ui+[x~ix~i]andui(1)=ui[x~ix~i].U=\begin{bmatrix}u_{1}(Z_{1})^{\top}\\ u_{2}(Z_{2})^{\top}\\ \vdots\\ u_{n}(Z_{n})^{\top}\end{bmatrix},\quad\text{for}\quad u_{i}(1)=u_{i+}\equiv\begin{bmatrix}\tilde{x}_{i}\\ \tilde{x}_{i}\end{bmatrix}\quad\text{and}\quad u_{i}(-1)=u_{i-}\equiv\begin{bmatrix}\phantom{-}\tilde{x}_{i}\\ -\tilde{x}_{i}\end{bmatrix}.

Introducing pi+=pip_{i+}=p_{i} and pi=1pip_{i-}=1-p_{i} we get

𝔼(UU)\displaystyle\mathbb{E}(U^{\top}U) =i=1n𝔼(ui(Zi)ui(Zi))=i=1n(pi+ui+ui++piuiui).\displaystyle=\sum_{i=1}^{n}\mathbb{E}(u_{i}(Z_{i})u_{i}(Z_{i})^{\top})=\sum_{i=1}^{n}(p_{i+}u_{i+}u_{i+}^{\top}+p_{i-}u_{i-}u_{i-}^{\top}). (15)

Our design criterion is to choose pi±p_{i\pm} to minimize

logdet(𝔼(UU))=logdet(i=1ns{+,}pisuisuis).-\log\det(\mathbb{E}(U^{\top}U))=-\log\det\Biggl{(}\sum_{i=1}^{n}\sum_{s\in\{+,-\}}p_{is}u_{is}u_{is}^{\top}\Biggr{)}.

This problem is only well-defined for ndn\geqslant d, since otherwise the matrix does not have full rank for any choice of pnp\in\mathbb{R}^{n} and the determinant is always zero. This criterion is convex in {pis1in,s{+,}}\{p_{is}\!\mid\!1\leqslant i\leqslant n,s\in\{+,-\}\} by a direct match with Chapter 7.5.2 of Boyd and Vandenberghe (2004) over the convex domain with 0pi±10\leqslant p_{i\pm}\leqslant 1 and pi++pi=1p_{i+}+p_{i-}=1 for all ii.

It will be simpler for us to optimize over q2p1[1,1]nq\equiv 2p-1\in[-1,1]^{n}, in which case

logdet(𝔼(UU))=logdet([i=1nx~ix~ii=1nqix~ix~ii=1nqix~ix~ii=1nx~ix~i]).\displaystyle-\log\det(\mathbb{E}(U^{\top}U))=-\log\det\left(\begin{bmatrix}\sum_{i=1}^{n}\tilde{x}_{i}\tilde{x}_{i}^{\top}&\sum_{i=1}^{n}q_{i}\tilde{x}_{i}\tilde{x}_{i}^{\top}\\ \sum_{i=1}^{n}q_{i}\tilde{x}_{i}\tilde{x}_{i}^{\top}&\sum_{i=1}^{n}\tilde{x}_{i}\tilde{x}_{i}^{\top}\end{bmatrix}\right). (16)

Absent any other constraints, we have seen that the RCT (qi=0q_{i}=0, pi=1/2p_{i}={1}/{2} for all ini\leqslant n) always minimizes (16). The constrained optimization of (16) can be cast as both a semi-definite program (Boyd and Vandenberghe, 2004) and a mixed integer second-order cone program (Sagnol and Harman, 2015).

This setting is close to the usual design measure relaxation. Instead of choosing ni=1n_{i}=1 point (x~i,Zi)(\tilde{x}_{i},Z_{i}) for observation ii we make a random choice between (x~i,1)(\tilde{x}_{i},1) and (x~i,1)(\tilde{x}_{i},-1) for that point. The difference here is that we have the union of nn such tiny design problems.

5.1 Some useful convex constraints

We outline a few reasonable (and convex) constraints that one could impose on this problem:

Budget constraint: In practice we have a fixed budget for the treatments. For instance the number of scholarships or customer perks to give out may be fixed for economic reasons. We can impose this constraint in expectation by setting (1/n)i=1npi=μ(1/n)\sum_{i=1}^{n}p_{i}=\mu, where μ\mu is some fixed average treatment rate.

Monotonicity constraint: It may be reasonable to require that pip_{i} is nondecreasing in some running variable ri=f(x~i)r_{i}=f(\tilde{x}_{i}). For example, a university may require that an applicant’s probability of receiving a scholarship can only stay constant or increase with their score, which is some combination of applicant variables. We can encode this as a convex constraint by first permuting the data matrix so that r(1)r(2)r(n)r_{(1)}\leqslant r_{(2)}\leqslant\cdots\leqslant r_{(n)} and then forcing p(1)p(2)p(n)p_{(1)}\leqslant p_{(2)}\leqslant\cdots\leqslant p_{(n)}. Note that the formulation (3) satisfies this monotonicity constraint, in which case ri=x~iηr_{i}=\tilde{x}_{i}^{\top}\eta.

Gain constraint: One may also want to impose that the expected gain is at least some fraction of its highest possible value, i.e.

i=1npix~iηρi=1n|x~iη|.\displaystyle\sum_{i=1}^{n}p_{i}\tilde{x}_{i}^{\top}\eta\geqslant\rho\sum_{i=1}^{n}|\tilde{x}_{i}^{\top}\eta|. (17)

The left-hand side of (17) is the expected gain for this choice of pip_{i}, whereas the right-hand side is the highest possible gain, which corresponds to the RDD (Zi=1)=𝟙{x~iη0}\mathbb{P}(Z_{i}=1)=\mathds{1}\{\tilde{x}_{i}^{\top}\eta\geqslant 0\}. Because γ\gamma is typically not known exactly, (17) computes the anticipated gain under the sampling direction η\eta we use. If an analyst has access to a better estimate of gain, such as a set of estimated treatment effects {τ^(x~1),,τ^(x~n)}\{\hat{\tau}(\tilde{x}_{1}),\ldots,\hat{\tau}(\tilde{x}_{n})\} fit from prior data, they could replace (17) by i=1npiτ^(x~i)ρi=1n|τ^(x~i)|\sum_{i=1}^{n}p_{i}\hat{\tau}(\tilde{x}_{i})\geqslant\rho\sum_{i=1}^{n}|\hat{\tau}(\tilde{x}_{i})|, which is again a linear constraint on qq.

Covariate balance constraint: The expected total value of the jjth variable for treated subjects is ipix~ij\sum_{i}p_{i}\tilde{x}_{ij}. The constraint

ipi(x~ijΔj)0\sum_{i}p_{i}(\tilde{x}_{ij}-\Delta_{j})\leqslant 0

keeps the expected sample average of the jjth variable among treated subjects to at most Δj\Delta_{j} regardless of the actual number of treated subjects. We note that sufficiently stringent covariate balance constraints may not be compatible with gain or monotonicity constraints.

5.2 Piecewise constant optimal designs

Though we delay discussion of a particular example to Section 6, we prove here an empirically-observed phenomenon regarding the monotonicity constraint. In particular, optimal solutions when a monotonicity constraint is imposed display only a few distinct levels in the assigned treatment probability. This is consistent with the one-dimensional theory of Li and Owen (2023), though interestingly that paper observed the same behavior even with only a budget constraint. It also used an approach that does not obviously extend to d>1d>1.

Theorem 5.

Consider the optimization problem

min𝑞logdet(i=1n[x~ix~iqix~ix~iqix~ix~ix~ix~i])s.t.q[1,1]n,qπ(1)qπ(2)qπ(n),q¯=μ.\displaystyle\begin{split}\underset{q}{\min}&-\log\det\left(\sum_{i=1}^{n}\begin{bmatrix}\tilde{x}_{i}\tilde{x}_{i}^{\top}&q_{i}\tilde{x}_{i}\tilde{x}_{i}^{\top}\\ q_{i}\tilde{x}_{i}\tilde{x}_{i}^{\top}&\tilde{x}_{i}\tilde{x}_{i}^{\top}\end{bmatrix}\right)\\ \mathrm{s.t.}\ \ &q\in[-1,1]^{n},\\ &q_{\pi(1)}\leqslant q_{\pi(2)}\leqslant\cdots\leqslant q_{\pi(n)},\\ &\overline{q}=\mu.\end{split} (18)

Suppose that the non-intercept columns x1,,xnx_{1},\ldots,x_{n} are drawn IID from a density on d\mathbb{R}^{d}. Then with probability one, the solution to (18) has at most (d+22)+2\binom{d+2}{2}+2 distinct levels.

These upper bounds closely resemble those in Cook and Fedorov (1995), who study constraints on a design of x~1,,x~n\tilde{x}_{1},\ldots,\tilde{x}_{n} themselves rather than on a treatment variable. However, theirs is an existence result for some solution with this level of sparsity, whereas our result holds for any solution with probability one. Moreover, their result derives from Caratheodory’s theorem, whereas ours comes from a close analysis of the KKT conditions.

The upper bounds in Theorem 5 are not tight. In our experience, there have typically been no more than five or six distinct levels even for dd as high as ten.

6 MIMIC-IV-ED Example

In this section we detail a simulation based on a real data set of emergency department (ED) patients. The MIMIC-IV-ED database (Johnson et al., 2021) provided via PhysioNet (Goldberger et al., 2000) includes data on ED admissions at the Beth Israel Deaconess Medical Center between 2011 and 2019.

Emergency departments face heavy resource constraints, particularly in the limited human attention and beds available. It is thus important to ensure patients are triaged appropriately so that the patients in most urgent need of care are assigned to intensive care units (ICUs). In practice, this is often done via a scoring method such as the Emergency Severity Index (ESI), in which patients receive a score in {1,2,3,4,5}\{1,2,3,4,5\}, with 11 indicating the highest severity and 55 indicating the lowest severity. MIMIC-IV-ED contains these values as acuity scores, along with a vector of vital signs and other relevant information about each patient.

Such a setting provides a very natural potential use case for tie-breaker designs. Patients arrive with an assortment of covariates, and hospitals acting under resource constraints must decide whether to put them in an ICU. A hospital or researcher may be interested in the treatment effect of an ICU bed; for example, a practical implication of such a question is whether to expand the ICU or allocate resources elsewhere (Phua et al., 2020). It is also of interest (Chang et al., 2017) to understand which types of patient benefit more or less from ICU beds to improve this costly resource allocation. Obviously, it is both unethical and counterproductive to assign some ICU beds by an RCT. Patients with high acuity scores must be sent to the ICU, and those with low acuity scores may be exposed unnecessarily to bad outcomes such as increased risk of acquiring a hospital-based infection (Kumar et al., 2018; Vranas et al., 2018). However, it may be possible to randomize “in the middle”, e.g., by randomizing for patients with an intermediate acuity scores such as 33, or with scores in a range where there is uncertainty as to whether the ICU would be beneficial for that patient. Because such patients are believed to have similar severities, this would limit ethical concerns and allow for greater information gain.

Refer to caption
Figure 1: Efficiency and gain of the MIMIC-IV-Ed simulation as a function of the width of the randomization window Δ\Delta.

The triage data set contains several vital signs for patients. Of these, we use all quantitative ones, which are: temperature, heart rate (HR), respiration rate (RR), oxygen saturation (O2O_{2} Sat.), and systolic and diastolic blood pressure (SBP and DBP). There is also an acuity score for each patient, as described above. The data set contains 448,972 entries, but to arrive at a more realistic sample for a prospective analysis, we randomly select 200200 subjects among those with no missing or blatantly inaccurate entries. Our optimization was done using the CVXR package (Fu et al., 2020) and the MOSEK solver (ApS, 2022).

To carry out a full analysis of the sort described in this paper, we need a vector η\eta, as in (3). In practice, one could assume a model of the form (2) and take η=γ^\eta=\hat{\gamma} for some estimate of γ\gamma formed via prior data. Since we do not have any YY values (which in practice could be some measure of survival, length of stay, or subsequent readmission), we will construct η\eta via the acuity scores, using the reasonable assumption that treatment benefit increases with more severe acuity scores.

We collapse acuity scores of {1,2}\{1,2\} into a group (Y=1Y=1) and acuity scores of {3,4,5}\{3,4,5\} into another (Y=0Y=0) and perform a logistic regression using these binary groups. The covariates used are the vital signs and their squares, the latter to allow for non-monotonic effects, e.g., the acuity score might be lower for both abnormally low and abnormally high heart rates. All covariates were scaled to mean zero and variance one. For pure quadratic terms the squares of the scaled covariates were themselves scaled to have mean zero and variance one. We also considered an ordered categorical regression model but preferred the logistic regression for ease of interpretability. Our estimated ηj^\hat{\eta_{j}} are in Table 1.

Int. Temp. Temp2 HR HR2 RR RR2
0.74-0.74 0.32-0.32 0.220.22 0.03-0.03 0.670.67 0.03-0.03 0.54\phantom{0}0.54
O2O_{2} Sat. O2O_{2} Sat.2 SBP SBP2 DBP DBP2
0.03\phantom{0}0.03 0.360.36 0.01\phantom{0}0.01 0.170.17 0.11-0.11 0.13-0.13
Table 1: These are the coefficients η^j\hat{\eta}_{j} that define a quadratic running variable for the MIMIC data. The intercept is followed by a sum of pure quadratics in temperature, heart rate, respiration rate, O2O_{2} saturation, systolic blood pressure and diastolic blood pressure.

Figure 1 presents the efficiency/gain tradeoff as we vary the size of the randomization window Δ\Delta in (8). For ease of visualization, we plot the logs of both quantities. As expected, we get a clear monotone increase in efficiency and decrease in gain as we increase Δ\Delta, moving from an RDD to an RCT. It should be noted that our efficiency criterion, because it only uses information in the XX values, is robust to a poor choice of η\eta, whereas our gain definition is constrained by the assumption that η\eta is a reasonably accurate stand-in for the true treatment effect γ\gamma.

In practice, it is hard to interpret what a “good” value of efficiency is because of our D-optimality criterion. Hence, as in Owen and Varian (2020), a pragmatic approach is to first stipulate that the gain is at least some fraction of its highest possible value, and then pick the largest Δ\Delta for this choice to maximize efficiency. A more qualitative choice based on results like Figure 1, such as picking the right endpoint of a sharp efficiency jump or the left endpoint of a sharp gain decline, would also be sensible.

Refer to caption
Figure 2: Optimal solutions for MIMIC-IV-ED treatment probabilities under various constraints. The treatment constraint imposed p¯=0.2\overline{p}=0.2 for the average treatment rate, and the gain constraint imposed ρ=0.7\rho=0.7, i.e., at least 70%70\% of the maximum possible gain.

As we see in Figure 2, the treatment constraint causes most pip_{i} to be at or near zero or one. Adding the gain constraint pushes most of the treatment probabilities to zero for low values of the running variable and one for high values. This scenario most closely resembles the RDD, with some deviations to boost efficiency. Indeed, the optimal solution would necessarily tend towards the RDD solution as the gain constraint increased. Finally, the monotonicity constraint further pushes the higher values of pp to the positive values of the running variable and vice-versa, since we lose the opportunity to counterbalance some high and low probabilities at the extreme with their opposites. The right two panels display a few discrete levels in the treatment probability, consistent with Theorem 5.

7 Discussion

In this paper, we add to a growing body of work demonstrating the benefits of tie-breaker designs. Though RCTs are often infeasible, opportunities for small doses of randomization may present themselves in a wide variety of real-world settings, in which case treatment effects can be learned more efficiently. This phenomenon is analogous to similar causal inference findings about merging observational and experimental data (Rosenman et al., 2021, 2023; Colnet et al., 2024).

The convex optimization framework in Section 5 is more general and conveniently only relies on knowing sample data rather than population parameters. It is also simple to implement and allows one to incorporate natural economic and ethical constraints with ease.

Multivariate tie-breaker designs are a natural option in situations in which there is no clear univariate running variable. For example, subjects may possess a vector of covariates, many of which could be associated with heterogeneous treatment effects in some unknown way of interest. Of course, two-line models and their multivariate analogs are not nearly as complicated as many of the models found in practice. Our view is to use them as a working model by which to decide on treatment allocations, in which case a more flexible model could be used upon full data acquisition as appropriate.

8 Acknowledgements

We thank John Cherian, Anav Sood, Harrison Li and Dan Kluger for helpful discussions. We also thank Balasubramanian Narasimhan for helpful input on the convex optimization problem, Michael Baiocchi and Minh Nguyen of the Stanford School of Medicine for discussions about triage to hospital intensive care units, and an anonymous reviewer for helpful comments. This work was supported by the NSF under grants IIS-1837931 and DMS-2152780. T. M. is supported by a B. C. and E. J. Eaves Stanford Graduate Fellowship.

References

  • Aiken et al. (1998) L. Aiken, S. West, D. Schwalm, J. Carroll, and C. Hsiung. Comparison of a randomized and two quasi-experimental designs in a single outcome evaluation: Efficacy of a university-level remedial writing program. Evaluation Review, 22(2):207–244, 1998.
  • Angrist et al. (2014) J. Angrist, S. Hudson, and A. Pallais. Leveling up: Early results from a randomized evaluation of post-secondary aid. Technical report, National Bureau of Economic Research, 2014. URL http://www.nber.org/papers/w20800.pdf.
  • ApS (2022) MOSEK ApS. MOSEK Fusion API for Python 9.3.18., 2022. URL https://docs.mosek.com/latest/rmosek/index.html.
  • Atkinson et al. (2007) A. Atkinson, A. Donev, and R. Tobias. Optimum experimental designs, with SAS, volume 34 of Oxford Statistical Science Series. Oxford University Press, Oxford, 2007.
  • Boruch (1975) R. F. Boruch. Coupling randomized experiments and approximations to experiments in social program evaluation. Sociological Methods & Research, 4(1):31–53, 1975.
  • Boyd and Vandenberghe (2004) S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, 2004.
  • Bu et al. (2020) X. Bu, D. Majumdar, and J. Yang. D-optimal designs for multinomial logistic models. The Annals of Statistics, 48(2):pp. 983–1000, 2020.
  • Campbell (1969) D. T. Campbell. Reforms as experiments. American psychologist, 24(4):409, 1969.
  • Cappelleri et al. (1994) J. C. Cappelleri, R. B. Darlington, and W. M. K. Trochim. Power analysis of cutoff-based randomized clinical trials. Evaluation Review, 18(2):141–152, 1994.
  • Cattaneo and Titiunik (2022) M. D. Cattaneo and R. Titiunik. Regression discontinuity designs. Annual Review of Economics, 14:821–851, 2022.
  • Chaloner and Verdinelli (1995) K. Chaloner and I. Verdinelli. Bayesian experimental design: A review. Statistical Science, pages 273–304, 1995.
  • Chang et al. (2017) D. Chang, D. Dacosta, and M. Shapiro. Priority levels in medical intensive care at an academic public hospital. JAMA Intern Med., 177(2):280–281, 2017.
  • Colnet et al. (2024) B. Colnet, I. Mayer, G. Chen, A. Dieng, R. Li, G. Varoquaux, J.-P. Vert, J. Josse, and S. Yang. Causal inference methods for combining randomized trials and observational studies: a review. Statistical science, 39(1):165–191, 2024.
  • Cook and Fedorov (1995) D. Cook and V. Fedorov. Constrained optimization of experimental design. Statistics, 26(2):129–148, 1995.
  • Cook and Thibodeau (1980) R. D. Cook and L. Thibodeau. Marginally restricted d-optimal designs. Journal of the American Statistical Association, pages 366–371, 1980.
  • Dette (1996) H. Dette. A note on bayesian c- and d-optimal designs in nonlinear regression models. The Annals of Statistics, 24(3):1225–1234, 1996.
  • Fu et al. (2020) A. Fu, N. Balasubramanian, and S. Boyd. CVXR: An R package for disciplined convex optimization. Journal of Statistical Software, 94(14):1–34, 2020. doi: 10.18637/jss.v094.i14.
  • Gelman and Imbens (2019) A. Gelman and G. Imbens. Why high-order polynomials should not be used in regression discontinuity designs. Journal of Business & Economic Statistics, 37(3):447–456, 2019.
  • Gelman et al. (2020) A. Gelman, J. Hill, and A. Vehtari. Regression and other stories. Cambridge University Press, 2020.
  • Goldberger (1972) A. Goldberger. Selection bias in evaluating treatment effects: Some formal illustrations. Technical report discussion paper, Institute for Research on Poverty, University of Wisconsin–Madison, 1972.
  • Goldberger et al. (2000) A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. Ch. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. Physiobank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation, 101(23):e215–e220, 2000.
  • Hahn et al. (2001) J. Hahn, P. Todd, and W. Van der Klaauw. Identification and estimation of treatment effects with a regression-discontinuity design. Econometrica, 69(1):201–209, 2001.
  • Heavlin and Finnegan (1998) W. D. Heavlin and G. P. Finnegan. Columnwise construction of response surface designs. Statistica Sinica, pages 185–206, 1998.
  • Jacob et al. (2012) R. Jacob, P. Zhu, M.-A. Somers, and H. Bloom. A practical guide to regression discontinuity. Working paper, MDRC, 2012.
  • Johnson et al. (2021) A. Johnson, L. Bulgarelli, T. Pollard, L. A. Celi, R. Mark, and S. Horng. MIMIC-IV-ED. https://doi.org/10.13026/77z6-9w59, 2021.
  • Kluger and Owen (2023) D. M. Kluger and A. B. Owen. Kernel regression analysis of tie-breaker designs. Electronic Journal of Statistics, 17:243–290, 2023.
  • Krantz (2022) C. Krantz. Modeling the outcomes of a longitudinal tie-breaker regression discontinuity design to assess an in-home training program for families at risk of child abuse and neglect. PhD thesis, University of Oklahoma, 2022.
  • Kumar et al. (2018) S. Kumar, B. Shankar, S. Arya, M. Deb, and H. Chellani. Healthcare associated infections in neonatal intensive care unit and its correlation with environmental surveillance. Journal of Infection and Public Health, 11(2):275–279, 2018.
  • Li and Owen (2022) H. Li and A. B. Owen. A general characterization of optimality in tie-breaker designs. Technical Report arXiv2202.12511, Stanford University, 2022.
  • Li and Owen (2023) H. Li and A. B. Owen. A general characterization of optimality in tie-breaker designs. The Annals of Statistics, 51(3):1030–1057, 2023.
  • Lipsey et al. (1981) M. W. Lipsey, D. S. Cordray, and D. E. Berger. Evaluation of a juvenile diversion program: Using multiple lines of evidence. Evaluation Review, 5(3):283–306, 1981.
  • Lopez-Fidalgo and Garcet-Rodriguez (2004) J. Lopez-Fidalgo and S. A. Garcet-Rodriguez. Optimal experimental designs when some independent variables are not subject to control. Journal of the American Statistical Association, 99(468):1190–1199, 2004.
  • Metelkina and Pronzato (2017) A. Metelkina and L. Pronzato. Information-regret compromise in covariate-adaptive treatment allocation. The Annals of Statistics, 45 (5):2046–2073, 2017.
  • Morrison et al. (2024) Tim Morrison, Minh Nguyen, Michael Baiocchi, and Art B Owen. Constrained design of a binary instrument in a partially linear model. Technical report, arXiv:2406.05592, 2024.
  • Nachtsheim (1989) C. J. Nachtsheim. On the design of experiments in the presence of fixed covariates. Journal of Statistical Planning and Inference, 22(2):203–212, 1989.
  • Owen and Varian (2020) A. B. Owen and H. Varian. Optimizing the tie-breaker regression discontinuity design. Electronic Journal of Statistics, 14(2):4004–4027, 2020. doi: 10.1214/20-EJS1765. URL https://doi.org/10.1214/20-EJS1765.
  • Phua et al. (2020) J. Phua, M. Hashmi, and R. Haniffa. ICU beds: less is more? Not sure. Intensive Care Med., 46:1600–1602, 2020.
  • Rosenman et al. (2021) E. T. R. Rosenman, M. Baiocchi, H. Banack, and A. B. Owen. Propensity score methods for merging observational and experimental datasets. Statistics in Medicine, 2021.
  • Rosenman et al. (2023) E. T. R. Rosenman, G. Basse, A. B. Owen, and M. Baiocchi. Combining observational and experimental datasets using shrinkage estimators. Biometrics, 79(4):2961–2973, 2023.
  • Sagnol and Harman (2015) G. Sagnol and R. Harman. Computing exact DD-optimal designs by mixed integer second-order cone programming. The Annals of Statistics, 43(5):2198–2224, 2015.
  • Slivkins (2019) A. Slivkins. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(1–2):1–286, 2019.
  • Vranas et al. (2018) K. Vranas, J. Jopling, J. Scott, O. Badawi, M. Harhay, C. Slatore, M. Ramsey, M. Breslow, A. Milstein, and M. Kerlin. The association of ICU acuity with outcomes of patients at low risk of dying. Crit Care Med., 46(3):347–353, 2018.
  • Yang et al. (2016) J. Yang, A. Mandal, and D. Majumdar. Optimal designs for 2k2^{k} factorial experiments with binary response. Statistica Sinica, 26(1):385–411, 2016.
  • Yang et al. (2017) J. Yang, L. Tong, and A. Mandal. D-optimal designs with ordered categorical data. Statistica Sinica, 27(4):1879–1902, 2017.

Appendix A Proofs

Proof of Theorem 2.

Using 𝔼(Var(δ^X,Z)1)\mathbb{E}(\mathrm{Var}(\hat{\delta}\!\mid\!X,Z)^{-1}) from (7), we have

det[Σ~NNΣ~]\displaystyle\det\begin{bmatrix}\tilde{\Sigma}&N\\ N&\tilde{\Sigma}\end{bmatrix} =det(Σ~)det(Σ~NΣ~1N)\displaystyle=\det(\tilde{\Sigma})\det(\tilde{\Sigma}-N\tilde{\Sigma}^{-1}N)
=det(Σ~)2det(IΣ~1/2NΣ~1NΣ~1/2)\displaystyle=\det(\tilde{\Sigma})^{2}\det(I-\tilde{\Sigma}^{-1/2}N\tilde{\Sigma}^{-1}N\tilde{\Sigma}^{-1/2})
=det(Σ~2)det(IA)\displaystyle=\det(\tilde{\Sigma}^{2})\det(I-A)

where A=Σ~1/2NΣ~1NΣ~1/2A=\tilde{\Sigma}^{-1/2}N\tilde{\Sigma}^{-1}N\tilde{\Sigma}^{-1/2} is symmetric and positive semi-definite. Now det(IA)1\det(I-A)\leqslant 1 with equality if and only if A=0A=0, which occurs if and only if N=0N=0. Therefore, any design giving N=0N=0 is prospectively D-optimal. From the formula (6) for NjkN_{jk}, we see right away that the RCT p(X)=1/2p(X_{\bullet})=1/2 satisfies this, which proves the first half of the theorem.

If we restrict to designs of the form (3), then for general (,u,p,η)(\ell,u,p,\eta) we have

Njk=𝔼[X~jX~k(𝟙{Xηu}𝟙{Xη}+q𝟙{Xη(,u)})]N_{jk}=\mathbb{E}[\tilde{X}_{\bullet j}\tilde{X}_{\bullet k}(\mathds{1}\{X_{\bullet}^{\top}\eta\geqslant u\}-\mathds{1}\{X_{\bullet}^{\top}\eta\leqslant\ell\}+q\mathds{1}\{X^{\top}\eta\in(\ell,u)\})]

where q2p1q\equiv 2p-1. Consider a tuple (,u,p,η)(\ell,u,p,\eta) for which N=0N=0. Then considering in particular the entry of NN where j=k=1j=k=1 (so that X~j=X~k=1\tilde{X}_{\bullet j}=\tilde{X}_{\bullet k}=1), we must have

pup+qpm=0\displaystyle p_{u}-p_{\ell}+qp_{m}=0 (19)

where pu=(Xηu)p_{u}=\mathbb{P}(X^{\top}\eta\geqslant u), p=(Xη)p_{\ell}=\mathbb{P}(X^{\top}\eta\leqslant\ell), and pm=(Xη(,m))p_{m}=\mathbb{P}(X^{\top}\eta\in(\ell,m)). If we freeze k=0k=0 and vary jj from 11 to dd, we obtain

𝔼[Xj(𝟙{Xηu}𝟙{Xη}+q𝟙{Xη(,u)})]=0\mathbb{E}[X_{\bullet j}(\mathds{1}\{X_{\bullet}^{\top}\eta\geqslant u\}-\mathds{1}\{X_{\bullet}^{\top}\eta\leqslant\ell\}+q\mathds{1}\{X^{\top}\eta\in(\ell,u)\})]=0

for all jj. Taking a suitable linear combination gives

𝔼[Xη(𝟙{Xηu}𝟙{Xη}+q𝟙{Xη(,u)})]=0\mathbb{E}[X_{\bullet}^{\top}\eta(\mathds{1}\{X_{\bullet}^{\top}\eta\geqslant u\}-\mathds{1}\{X_{\bullet}^{\top}\eta\leqslant\ell\}+q\mathds{1}\{X^{\top}\eta\in(\ell,u)\})]=0

Let f(u)=𝔼[XηXηu]f(u)=\mathbb{E}[X^{\top}\eta\!\mid\!X^{\top}\eta\geqslant u], and analogously for f()f(\ell) and f(m)f(m). Then

f(u)puf()p+f(m)qpm=0.\displaystyle f(u)p_{u}-f(\ell)p_{\ell}+f(m)qp_{m}=0. (20)

Next, equations (19) and (20) give respectively that

f()p\displaystyle f(\ell)p_{\ell} =f()pu+f()qpm\displaystyle=f(\ell)p_{u}+f(\ell)qp_{m}
=f(u)pu+f(m)qpm\displaystyle=f(u)p_{u}+f(m)qp_{m}

i.e., that

(f(u)f())pu=(f()f(m))qpm.(f(u)-f(\ell))p_{u}=(f(\ell)-f(m))qp_{m}.

By definition, f(u)>f(m)>f()f(u)>f(m)>f(\ell), so the left-hand side is non-negative and (f()f(m))(f(\ell)-f(m)) is negative. This implies that qpm0qp_{m}\leqslant 0. Similarly, we have

f(u)pu=f(u)pf(u)qpm=f()pf(m)qpmf(u)p_{u}=f(u)p_{\ell}-f(u)qp_{m}=f(\ell)p_{\ell}-f(m)qp_{m}

and so

(f(u)f())p=(f(u)f(m))qpm.(f(u)-f(\ell))p_{\ell}=(f(u)-f(m))qp_{m}.

Therefore qpm0qp_{m}\geqslant 0, and so qpm=0qp_{m}=0. This leaves three possibilities: q=0q=0, pm=0p_{m}=0, or both. If pm=0p_{m}=0, then (19) implies that pu=p=1/2p_{u}=p_{\ell}=1/2, which is impossible given (20). So then it must be that q=0q=0, whereupon (19) again implies that pu=pp_{u}=p_{\ell} and (20) forces pu=p=0p_{u}=p_{\ell}=0. In summary, a prospectively D-optimal design must satisfy

(Xη<)=(Xη>u)=0,and\displaystyle\mathbb{P}(X^{\top}\eta<\ell)=\mathbb{P}(X^{\top}\eta>u)=0,\quad\text{and}
(Z=1Xη(,u))=12.\displaystyle\mathbb{P}(Z=1\!\mid\!X^{\top}\eta\in(\ell,u))=\frac{1}{2}.\qed
Proof of Theorem 3.

Let q=2p1q=2p-1. The off-diagonal block matrix N=N(q)N=N(q) in (7) can now be written as

Njk=𝔼[X~jX~k(𝟙{Xηu}𝟙{Xηu}+q𝟙{|Xη|<u})].N_{jk}=\mathbb{E}[\tilde{X}_{\bullet j}\tilde{X}_{\bullet k}(\mathds{1}\{X_{\bullet}^{\top}\eta\geqslant u\}-\mathds{1}\{X_{\bullet}^{\top}\eta\leqslant u\}+q\mathds{1}\{|X_{\bullet}^{\top}\eta|<u\})].

That is, we can write N=N0+qN1N=N_{0}+qN_{1}, where N0N_{0} is as in (9) and N1N_{1} has (j,k)(j,k) entry equal to 𝔼[X~jX~k𝟙{|Xη|<u}]\mathbb{E}[\tilde{X}_{\bullet j}\tilde{X}_{\bullet k}\mathds{1}\{|X_{\bullet}^{\top}\eta|<u\}]. Note that N1N_{1} is block diagonal, the exact opposite of N0N_{0}. Let

f(q)=logdet(Σ~(N0+qN1)Σ~1(N0+qN1)).\displaystyle f(q)=\log\det\bigl{(}\tilde{\Sigma}-(N_{0}+qN_{1})\tilde{\Sigma}^{-1}(N_{0}+qN_{1})\bigr{)}. (21)

To prove the theorem, we will simply show that f(0)=0f^{\prime}(0)=0 and f′′(q)0f^{\prime\prime}(q)\leqslant 0 for q[1,1]q\in[-1,1], implying that q=0q=0 (i.e., p=1/2p=1/2) is the global maximizer of ff on this interval. Let

A=N1Σ~1N1,B=(N1Σ~1N0+N0Σ~1N1),andC=Σ~1N0Σ~1N0\begin{split}A&=-N_{1}\tilde{\Sigma}^{-1}N_{1},\\ B&=-(N_{1}\tilde{\Sigma}^{-1}N_{0}+N_{0}\tilde{\Sigma}^{-1}N_{1}),\quad\text{and}\\ C&=\tilde{\Sigma}^{-1}-N_{0}\tilde{\Sigma}^{-1}N_{0}\end{split} (22)

so that f(q)=logdet(q2A+qB+C)f(q)=\log\det(q^{2}A+qB+C). Call a (d+1)×(d+1)(d+1)\times(d+1) block matrix “block off-diagonal” if it is zero in the top-left entry and zero in the bottom-right d×dd\times d block, as in the case of N0N_{0}. The product of two block off-diagonal matrices is block-diagonal (with blocks of size 1×11\times 1 and d×dd\times d), and the product of a block off-diagonal matrix and such a block diagonal matrix is block off-diagonal. Thus, AA and CC are both block diagonal, whereas BB is block off-diagonal. Differentiating ff, we obtain

f(q)=tr((q2A+qB+C)1(2qA+B))f^{\prime}(q)=\mathrm{tr}((q^{2}A+qB+C)^{-1}(2qA+B))

so that f(0)=tr(C1B)f^{\prime}(0)=\text{tr}(C^{-1}B). As noted, CC is block diagonal and BB is block off-diagonal, so the product C1BC^{-1}B is block off-diagonal and thus f(0)=0f^{\prime}(0)=0. It simplifies some expressions to let M1=2qA+BM_{1}=2qA+B and M2=(q2A+qB+C)1M_{2}=(q^{2}A+qB+C)^{-1}. Then f(q)=tr(M2M1)f^{\prime}(q)=\mathrm{tr}(M_{2}M_{1}) and

f′′(q)=tr(M1M2M1M2+2M2A).f^{\prime\prime}(q)=\mathrm{tr}(-M_{1}M_{2}M_{1}M_{2}+2M_{2}A).

For q[1,1]q\in[-1,1], M2M_{2} is the upper-left block of the inverse of the covariance matrix in (7), so it is positive semi-definite. Then M21/2M1M2M1M21/2M_{2}^{1/2}M_{1}M_{2}M_{1}M_{2}^{1/2} is positive semi-definite as well and thus

tr(M1M2M1M2)=tr(M21/2M1M2M1M21/2)0.\mathrm{tr}(-M_{1}M_{2}M_{1}M_{2})=-\mathrm{tr}(M_{2}^{1/2}M_{1}M_{2}M_{1}M_{2}^{1/2})\leqslant 0.

In addition, AA is negative semi-definite, so

tr(2M2A)=2tr(M21/2AM21/2)0.\mathrm{tr}(2M_{2}A)=2\mathrm{tr}(M_{2}^{1/2}AM_{2}^{1/2})\leqslant 0.

Therefore, f′′(q)0f^{\prime\prime}(q)\leqslant 0 everywhere, so q=0q=0 is in fact a global optimum.

If (|Xη|u)>0\mathbb{P}(|X_{\bullet}^{\top}\eta|\leqslant u)>0, then AA and BB are both nonzero, so these two trace inequalities are strict. Then f′′(q)<0f^{\prime\prime}(q)<0 for all q[1,1]q\in[-1,1], so ff cannot be constant anywhere. Since ff is also concave on [1,1][-1,1], the local optimum at q=0q=0 must be a global optimum on this interval. ∎

Proof of Theorem 4.

To prove Theorem 4, we begin with two lemmas. We write φ\varphi for the 𝒩(0,1)\mathcal{N}(0,1) probability density function. We start our study of efficiency by finding an expression for αj\alpha_{j}.

Lemma 1.

Let PXP_{X} be the 𝒩(0,Id)\mathcal{N}(0,I_{d}) distribution, let αj\alpha_{j} be given by (3.3) and let ZiZ_{i} be sampled according to (3) for a nonzero vector ηd\eta\in\mathbb{R}^{d} and u0u\geqslant 0. Then

αj=2ηjηφ(uη)\alpha_{j}=2\frac{\eta_{j}}{\|\eta\|}\varphi\Bigl{(}\frac{u}{\|\eta\|}\Bigr{)}

for j=1,,dj=1,\dots,d.

Proof.

The result is easy if ηj=0\eta_{j}=0. Without loss of generality, assume that ηj>0\eta_{j}>0. Let xjx_{-j} and ηj\eta_{-j} be the vectors in d1\mathbb{R}^{d-1} formed by removing the jjth component from the vectors xx and η\eta, respectively. Using φ(t)=tφ(t)\varphi^{\prime}(t)=-t\varphi(t),

𝔼[xj(𝟙{xηu})]\displaystyle\mathbb{E}[x_{j}(\mathds{1}\{x^{\top}\eta\geqslant u\})] =𝔼[xj𝟙{xj(uxjηj)/ηj}]\displaystyle=\mathbb{E}\left[x_{j}\mathds{1}\{x_{j}\geqslant(u-x_{-j}^{\top}\eta_{-j})/\eta_{j}\}\right]
=𝔼[φ((uxjηj)/ηj)]\displaystyle=\mathbb{E}\left[\varphi((u-x_{-j}^{\top}\eta_{-j})/\eta_{j})\right]

and applying it a second time along with symmetry of φ\varphi, we get

αj\displaystyle\alpha_{j} =𝔼[xj(𝟙{xηu}𝟙{xηu})]\displaystyle=\mathbb{E}[x_{j}(\mathds{1}\{x^{\top}\eta\geqslant u\}-\mathds{1}\{x^{\top}\eta\leqslant-u\})]
=𝔼[φ((uxjηj)/ηj)+φ((u+xjηj)/ηj)].\displaystyle=\mathbb{E}\left[\varphi((u-x_{-j}^{\top}\eta_{-j})/\eta_{j})+\varphi((u+x_{-j}^{\top}\eta_{-j})/\eta_{j})\right].

Now let u~j=u/ηj\tilde{u}_{j}=u/\eta_{j} and z~j=xjηj/ηj𝒩(0,τ2)\tilde{z}_{j}=x_{-j}^{\top}\eta_{-j}/\eta_{j}\sim\mathcal{N}(0,\tau^{2}) with τ2=ηj2/ηj2\tau^{2}={\|\eta_{-j}\|^{2}}/{\eta_{j}^{2}}. Then we get

αj\displaystyle\alpha_{j} =12π12πτ2(e12(u~jz~j)2+e12(u~j+z~j)2)ez~j2/2τ2dz~j\displaystyle=\frac{1}{\sqrt{2\pi}}\frac{1}{\sqrt{2\pi\tau^{2}}}\int_{-\infty}^{\infty}\left(e^{-\frac{1}{2}(\tilde{u}_{j}-\tilde{z}_{j})^{2}}+e^{-\frac{1}{2}(\tilde{u}_{j}+\tilde{z}_{j})^{2}}\right)e^{-\tilde{z}_{j}^{2}/2\tau^{2}}\,\mathrm{d}\tilde{z}_{j}
=12πτ2(2πτ2τ2+1eu~22(τ2+1)+2πτ2τ2+1eu~22(τ2+1))\displaystyle=\frac{1}{2\pi\sqrt{\tau^{2}}}\left(\frac{\sqrt{2\pi\tau^{2}}}{\sqrt{\tau^{2}+1}}e^{\frac{-\tilde{u}^{2}}{2(\tau^{2}+1)}}+\frac{\sqrt{2\pi\tau^{2}}}{\sqrt{\tau^{2}+1}}e^{\frac{-\tilde{u}^{2}}{2(\tau^{2}+1)}}\right)
=2πηjη2eu22η22.\displaystyle=\sqrt{\frac{2}{\pi}}\frac{\eta_{j}}{\|\eta\|_{2}}e^{\frac{-u^{2}}{2\|\eta\|_{2}^{2}}}.\qed
Lemma 2.

Let PXP_{X} be the 𝒩(0,Σ)\mathcal{N}(0,\Sigma) distribution for a positive definite matrix Σ\Sigma, let αj\alpha_{j} be given by (3.3) and let ZiZ_{i} be sampled according to (3) for a nonzero vector ηd\eta\in\mathbb{R}^{d} and u0u\geqslant 0. Then

α=2πΣηηΣηeu22ηΣη=2ΣηηΣηφ(uηΣη).\displaystyle\alpha=\sqrt{\frac{2}{\pi}}\frac{\Sigma\eta}{\sqrt{\eta^{\top}\Sigma\eta}}e^{\frac{-u^{2}}{2\eta^{\top}\Sigma\eta}}=2\frac{\Sigma\eta}{\sqrt{\eta^{\top}\Sigma\eta}}\varphi\biggl{(}\frac{u}{\sqrt{\eta^{\top}\Sigma\eta}}\biggr{)}. (23)
Proof.

For the general case of x𝒩(0,Σ)x\sim\mathcal{N}(0,\Sigma), with Σ\Sigma any positive-definite matrix, we define W=Σ1/2W=\Sigma^{1/2} and write x=Wzx=Wz. Then z𝒩(0,Id)z\sim\mathcal{N}(0,I_{d}), and so

αj\displaystyle\alpha_{j} =𝔼[xj(𝟙{xηu}𝟙{xηu})]\displaystyle=\mathbb{E}[x_{j}(\mathds{1}\{x^{\top}\eta\geqslant u\}-\mathds{1}\{x^{\top}\eta\leqslant-u\})]
=𝔼[Wj(z𝟙{zWηu}𝟙{zWηu})]\displaystyle=\mathbb{E}[W_{j}^{\top}(z\mathds{1}\{z^{\top}W\eta\geqslant u\}-\mathds{1}\{z^{\top}W\eta\leqslant-u\})]
=Wj𝔼[z(𝟙{zWηu}𝟙{zWηu})].\displaystyle=W_{j}^{\top}\mathbb{E}[z(\mathds{1}\{z^{\top}W\eta\geqslant u\}-\mathds{1}\{z^{\top}W\eta\leqslant-u\})].

This reduces the problem to the case Σ=Id\Sigma=I_{d} with η\eta replaced by WηW\eta, so we obtain

α=W(2πWηWη2eu22Wη22).\alpha=W\biggl{(}\sqrt{\frac{2}{\pi}}\frac{W\eta}{\|W\eta\|_{2}}e^{\frac{-u^{2}}{2\|W\eta\|_{2}^{2}}}\biggr{)}.\qed

We can now conclude the proof of Theorem 4. By Lemma 2, we have

α=2πΣηηΣηeu22ηΣη=2ΣηηΣηφ(uηΣη).\displaystyle\alpha=\sqrt{\frac{2}{\pi}}\frac{\Sigma\eta}{\sqrt{\eta^{\top}\Sigma\eta}}e^{\frac{-u^{2}}{2\eta^{\top}\Sigma\eta}}=2\frac{\Sigma\eta}{\sqrt{\eta^{\top}\Sigma\eta}}\varphi\biggl{(}\frac{u}{\sqrt{\eta^{\top}\Sigma\eta}}\biggr{)}. (24)

Therefore,

det(𝔼(1nUU))\displaystyle\det\left(\mathbb{E}\left(\frac{1}{n}U^{\top}U\right)\right) =(1αΣ1α)2det(Σ)2=(12πeu2ηΣη)2det(Σ)2.\displaystyle=(1-\alpha^{\top}\Sigma^{-1}\alpha)^{2}\det(\Sigma)^{2}=\left(1-\frac{2}{\pi}e^{\frac{-u^{2}}{\eta^{\top}\Sigma\eta}}\right)^{2}\det(\Sigma)^{2}.\qed
Proof of Theorem 5.

In the optimization problem (18), π\pi is a fixed permutation in SnS_{n} corresponding to a monotonicity constraint, and so it is no loss of generality to take π\pi to be the identity and omit it. If μ=1\mu=-1 or 11, then clearly the only solution is the constant vector q=μq=\mu (all treated or all control), so we ignore that case henceforth.

Because we aim to show that optimal solutions have constant levels in qq, it will be easier to reparametrize in terms of the increments ci=qiqi1c_{i}=q_{i}-q_{i-1} and show sparsity, where we take c1=q1c_{1}=q_{1}. Letting

G0=[X~X~00X~X~],Gk=i=kn[0x~ix~ix~ix~i0],G_{0}=\begin{bmatrix}\tilde{X}^{\top}\tilde{X}&0\\ 0&\tilde{X}^{\top}\tilde{X}\end{bmatrix},\quad G_{k}=\sum_{i=k}^{n}\begin{bmatrix}0&\tilde{x}_{i}\tilde{x}_{i}^{\top}\\ \tilde{x}_{i}\tilde{x}_{i}^{\top}&0\end{bmatrix},

and G(c)=G0+c1G1++cnGnG(c)=G_{0}+c_{1}G_{1}+\cdots+c_{n}G_{n}, we can rewrite this problem as

min𝑐logdet(G(c))such that c11,ck0 for all 2kn,c11,c1+n1nc2+n2nc3++1ncn=μ.\displaystyle\begin{split}\underset{c}{\min}&-\log\det(G(c))\\ \text{such that }&c_{1}\geqslant-1,\\ &c_{k}\geqslant 0\text{ for all }2\leqslant k\leqslant n,\\ &c^{\top}1\leqslant 1,\\ &c_{1}+\frac{n-1}{n}c_{2}+\frac{n-2}{n}c_{3}+\cdots+\frac{1}{n}c_{n}=\mu.\end{split} (25)

The first three constraints correspond to q[1,1]nq\in[-1,1]^{n} and the monotonicity condition, while the last constraint corresponds to q¯=μ\overline{q}=\mu. Let vv be the vector with vk=(nk+1)/nv_{k}=(n-k+1)/n, so that the last constraint in (25) is cv=μc^{\top}v=\mu. Moreover, let

A=[100010001111](n+1)×nandd=[1001]n+1A=\begin{bmatrix}-1&0&\ldots&0\\ 0&-1&\ldots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\ldots&-1\\ 1&1&\ldots&1\end{bmatrix}\in\mathbb{R}^{(n+1)\times n}\quad\text{and}\quad d=\begin{bmatrix}1\\ 0\\ \vdots\\ 0\\ 1\end{bmatrix}\in\mathbb{R}^{n+1}

so that the inequality constraints can be written simply as Acd0Ac-d\leqslant 0. Slater’s condition holds by considering the interior point

c=(μ(n1)ϵ,nn1ϵ,nn2ϵ,,n2ϵ,nϵ)c=\Bigl{(}\mu-(n-1)\epsilon,\frac{n}{n-1}\epsilon,\frac{n}{n-2}\epsilon,\dots,\frac{n}{2}\epsilon,n\epsilon\Bigr{)}

for any ϵ>0\epsilon>0 sufficiently small, so the KKT conditions are both necessary and sufficient. The Lagrangian is

L(c;λ,ν)=logdet(G(c))+i=1n+1λi(aicdi)+ν(cvμ),L(c;\lambda,\nu)=-\log\det(G(c))+\sum_{i=1}^{n+1}\lambda_{i}(a_{i}^{\top}c-d_{i})+\nu(c^{\top}v-\mu),

where aia_{i} is a column vector formed from the ii’th row of AA. The differential is then

Lck\displaystyle\frac{\partial L}{\partial c_{k}} =tr(G(c)1Gk)+i=1n+1λiaik+νvk\displaystyle=-\mathrm{tr}(G(c)^{-1}G_{k})+\sum_{i=1}^{n+1}\lambda_{i}a_{ik}+\nu v_{k}
=2i=knx~iG(c)121x~iλk+λn+1+nk+1nν.\displaystyle=-2\sum_{i=k}^{n}\tilde{x}_{i}^{\top}G(c)^{-1}_{12}\tilde{x}_{i}-\lambda_{k}+\lambda_{n+1}+\frac{n-k+1}{n}\nu.

Here, G(c)121(d+1)×(d+1)G(c^{*})^{-1}_{12}\in\mathbb{R}^{(d+1)\times(d+1)} is the off-diagonal block of G(c)1G(c^{*})^{-1}, which is symmetric since the off-diagonal block of G(c)G(c^{*}) itself is symmetric.

The stationarity condition is that the partial derivative above is zero for all kk. Suppose first that, for the optimal cc^{*}, {c1,c2,,cmc_{1}^{*},c_{2}^{*},\ldots,c_{m}^{*}} are all nonzero, where mm is thus far unspecified. Later, we relax the condition that it is the first mm components of cc specifically that are nonzero.

The complementary slackness condition implies that ckλk=0c_{k}^{*}\lambda_{k}=0 for all 1kn1\leqslant k\leqslant n, so in order for this to hold we must have λ1=λ2==λm=0\lambda_{1}=\lambda_{2}=\cdots=\lambda_{m}=0. The stationarity condition for these terms is then

i=knx~iG(c)121x~i=λn+1+nk+1nν, for all 1km.\sum_{i=k}^{n}\tilde{x}_{i}^{\top}G(c^{*})^{-1}_{12}\tilde{x}_{i}=\lambda_{n+1}^{*}+\frac{n-k+1}{n}\nu^{*},\quad\text{ for all }1\leqslant k\leqslant m. (26)

The right-hand side of (26) is linear in kk, with slope and intercept that depend on λn+1\lambda_{n+1}^{*} and ν\nu^{*}. These constitute two free parameters on the right-hand side. For a fixed matrix G(c)G(c), this is therefore a probability zero event whenever m3m\geqslant 3, since x1,,xnx_{1},\ldots,x_{n} are sampled IID from a density. However, the issue here is that cc^{*} is not fixed but rather an implicit function of the data matrix XX. Thus, we instead show that, for mm sufficiently large and with probability one, there is no symmetric matrix M(d+1)×(d+1)M\in\mathbb{R}^{(d+1)\times(d+1)} and pair (λn+1,ν)(\lambda_{n+1}^{*},\nu^{*}) with

i=knx~iMx~i=λn+1+nk+1nν, for all 1km.\sum_{i=k}^{n}\tilde{x}_{i}^{\top}M\tilde{x}_{i}=\lambda_{n+1}^{*}+\frac{n-k+1}{n}\nu^{*},\quad\text{ for all }1\leqslant k\leqslant m. (27)

Here, the logic is analogous: there are (d+22)\binom{d+2}{2} free parameters for an arbitrary symmetric matrix M(d+1)×(d+1)M\in\mathbb{R}^{(d+1)\times(d+1)} , as well as two additional parameters from λn+1\lambda_{n+1}^{*} and ν\nu^{*}. Thus, if m(d+22)+3m\geqslant\binom{d+2}{2}+3, this cannot happen with probability one. The logic is the same for any size mm subset of {x~1,,x~n}\{\tilde{x}_{1},\ldots,\tilde{x}_{n}\}, not just the first mm, so the result follows by a union bound.