This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

To ignore dependencies is perhaps not a sin

Douglas P. Wiens Mathematical & Statistical Sciences, University of Alberta, Edmonton, Canada, T6G 2G1

January 30, 2025
Abstract

We present a result according to which certain functions of covariance matrices are maximized at scalar multiples of the identity matrix. In a statistical context in which such functions measure loss, this says that the least favourable form of dependence is in fact independence, so that a procedure optimal for i.i.d. data can be minimax. In particular, the ordinary least squares (ols) estimate of a correctly specified regression response is minimax among generalized least squares (gls) estimates, when the maximum is taken over certain classes of error covariance structures and the loss function possesses a natural monotonicity property. An implication is that it can be not only safe, but optimal to ignore such departures from the usual assumption of i.i.d. errors. We then consider regression models in which the response function is possibly misspecified, and show that ols is minimax if the design is uniform on its support, but that this often fails otherwise. We go on to investigate the interplay between minimax gls procedures and minimax designs, leading us to extend, to robustness against dependencies, an existing observation – that robustness against model misspecifications is increased by splitting replicates into clusters of observations at nearby locations.

keywords:
design , induced matrix norm , Loewner ordering , particle swarm optimization , robustness.
MSC:
[2010] Primary 62G35 , Secondary 62K05
journal: U. Alberta preprint series

1 Introduction and summary

When carrying out a study, whether observational or designed, calling for a regression analysis the investigator may be faced with questions regarding possible correlations or heteroscedasticity within the data. If there are such departures from the assumptions underlying the use of the ordinary least squares (ols) estimates of the regression parameters, then the use of generalized least squares (gls) might be called for. In its pure form, as envisioned by Aitken (1935), this calls for the use of the inverse of the covariance matrix CC, i.e. the precision matrix, of the random errors. This is inconvenient, since CC is rarely known and, even if there is some prior knowledge of its structure, before the study is carried out there are no data from which accurate estimates of its elements might be made. If a consistent estimate C^1\hat{C}^{-1} of the precision matrix does exist, then one can employ ‘feasible generalized least squares’ estimation - see e.g. Fomby et al. (1984). An example is the Cochrane-Orcutt procedure (Cochrane and Orcutt (1949)), which can be applied iteratively in AR(1) models. Otherwise a positive definite ‘pseudo precision’ matrix PP might be employed. With data yy and design matrix XX this leads to the estimate

θ^gls=argminθP1/2(yXθ)2=(XPX)1XPy.\hat{\theta}_{\text{{gls}}}=\arg\min_{\theta}\left\|P^{1/2}\left(y-X\theta\right)\right\|^{2}=\left(X^{\prime}PX\right)^{-1}X^{\prime}Py. (1)

In Wiens (2024b) a similar problem was addressed, pertaining to designed experiments whose data are to be analyzed by ols. A lemma, restated below as Lemma 1, was used to show that certain commonly employed loss functions, taking covariance matrices as their arguments and increasing with respect to the Loewner ordering by positive semidefiniteness, are maximized at scalar multiples of the identity matrix. This has the somewhat surprising statistical interpretation that the least favourable form of dependence is in fact independence. The lemma was used to show that the assumption of independent and homoscedastic errors at the design stage of an experiment is in fact a minimax strategy, within broad classes of alternate covariance structures.

In this article we study the implications of the lemma in the problem of choosing between ols and gls. We first show that, when the form of the regression response is accurately modelled, then it can be safe, and indeed optimal – in a minimax sense – to ignore possible departures from independence and homoscedasticity, varying over certain large classes of such departures. This is because the common functions measuring the loss incurred by gls, when the covariance matrix of the errors is CC, are maximized when CC is a multiple of the identity matrix. But in that case the best gls estimate is ols, i.e. ols is a minimax procedure.

We then consider the case of misspecified regression models, in which bias becomes a component of the integrated mean squared prediction error (imspe). The imspe is maximized over CC and over the departures from the fitted linear model. We show that, if a gls with (pseudo) precision matrix PP is employed, then the variance component of this maximum continues to be minimized by P=IP=I, i.e. by ols, but the bias generally does not and, depending upon the design, ols can fail to be a minimax procedure. We show however that if the design is uniform on its support then ols is minimax. Otherwise, ols can fail to be minimax when the design emphasizes bias reduction over variance reduction to a sufficiently large extent.

We also construct minimax designs – minimizing the maximum imspe over the design – and combine them with minimax choices of PP. These designs are often uniform on their supports, and so ols is a minimax procedure in this context. The design uniformity is attained by replacing the replicates that are a feature of ‘classically optimal’ designs minimizing variance alone by clusters of observations at nearby design points.

A summary of our findings is that, if a sensible design is chosen, then ols is at least ‘almost’ a minimax gls procedure, often exactly so. We conclude that, for Loewner-increasing loss functions, and for covariance matrices CC varying over the classes covered by Lemma 1, the simplicity of ols makes it a robust and attractive alternative to gls.

The computations for this article were carried out in matlab; the code is available on the author’s personal website.

2 A useful lemma

Suppose that M\left\|\cdot\right\|_{M} is a matrix norm, induced by the vector norm V\left\|\cdot\right\|_{V}, i.e.

CM=supxV =1CxV.\left\|C\right\|_{M}=\sup_{\left\|x\right\|_{V}\text{ }=1}\left\|Cx\right\|_{V}.

We use the subscript ‘MM’ when referring to an arbitrary matrix norm, but adopt special notation in the following cases:

(i) For the Euclidean norm xV=(xx)1/2\left\|x\right\|_{V}=(x^{\prime}x)^{1/2}, the matrix norm is denoted CE\left\|C\right\|_{E} and is the spectral radius, i.e. the root of the maximum eigenvalue of CCC^{\prime}C. This is the maximum eigenvalue of CC if CC is a covariance matrix, i.e. is symmetric and positive semidefinite.

(ii) For the sup norm xV=maxi|xi|\left\|x\right\|_{V}=\max_{i}\left|x_{i}\right|, the matrix norm C\left\|C\right\|_{\infty} is maxij|cij|\max_{i}\sum_{j}\left|c_{ij}\right|, the maximum absolute row sum.

(iii) For the 1-norm xV=i|xi|\left\|x\right\|_{V}=\sum_{i}\left|x_{i}\right|, the matrix norm C1\left\|C\right\|_{1} is maxji|cij|\max_{j}\sum_{i}\left|c_{ij}\right|, the maximum absolute column sum. For symmetric matrices, C1=C\left\|C\right\|_{1}=\left\|C\right\|_{\infty}.

Now suppose that the loss function in a statistical problem is (C)\mathcal{L}\left(C\right), where CC is an n×nn\times n covariance matrix and ()\mathcal{L}\left(\mathbf{\cdot}\right) is non-decreasing in the Loewner ordering:

AB(A)(B).A\preceq B\Rightarrow\mathcal{L}\left(A\right)\leq\mathcal{L}\left(B\right).

Here ABA\preceq B means that BA0B-A\succeq 0, i.e. is positive semidefinite (p.s.d.).

The following lemma is established in Wiens (2024b).

Lemma 1.

For η2>0\eta^{2}>0, covariance matrix CC and induced norm CM\left\|C\right\|_{M}, define

𝒞M={C|C0 and CMη2}.\mathcal{C}_{M}=\left\{C\left|{}\right.C\succeq 0\text{ and }\left\|C\right\|_{M}\leq\eta^{2}\right\}. (2)

For the norm E\left\|\mathbf{\cdot}\right\|_{E} an equivalent definition is

𝒞E={C|0Cη2In}.\mathcal{C}_{E}=\left\{C\left|{}\right.0\preceq C\preceq\eta^{2}I_{n}\right\}.

Then:

(i) In any such class 𝒞M\mathcal{C}_{M}, max𝒞M(C)=(η2In)\max_{\mathcal{C}_{M}}\mathcal{L}\left(C\right)=\mathcal{L}\left(\eta^{2}I_{n}\right).

(ii) If 𝒞\mathcal{C}^{\prime} 𝒞M\mathcal{\subseteq C}_{M} and η2In𝒞\eta^{2}I_{n}\in\mathcal{C}^{\prime}, then max𝒞(C)=(η2In)\max_{\mathcal{C}^{\prime}}\mathcal{L}\left(C\right)=\mathcal{L}\left(\eta^{2}I_{n}\right).

A consequence of (i) of this lemma is that if one is carrying out a statistical procedure with loss function (C)\mathcal{L}\left(C\right), then a version of the procedure which minimizes (η2In)\mathcal{L}\left(\eta^{2}I_{n}\right) is minimax as CC varies over 𝒞M\mathcal{C}_{M}.

An interpretation of the lemma is that, in attempting to maximize loss by altering the correlations or increasing the variances of CC, one should always choose the latter. But the procedures discussed in this article do not depend on the particular value of η2\eta^{2} – its only role is to ensure that 𝒞M\mathcal{C}_{M} is large enough to contain the departures of interest.

3 Generalized least squares regression estimates when the response is correctly specified

Consider the linear model

y=Xθ+εy=X\theta+\varepsilon (3)

for Xn×pX_{n\times p} of rank pp. Suppose that the random errors ε\varepsilon have covariance matrix C𝒞MC\in\mathcal{C}_{M}. If CC is known then the ‘best linear unbiased estimate’ is θ^blue=\hat{\theta}_{\text{{blue}}}= (XC1X)1XC1y\left(X^{\prime}C^{-1}X\right)^{-1}X^{\prime}C^{-1}y. In the more common case that the covariances are at best only vaguely known, an attractive possibility is to use the generalized least squares estimate (1) for a given positive definite (pseudo) precision matrix PP. If P=C1P=C^{-1} then the blue is returned. A diagonal PP gives ‘weighted least squares’ (wls). Here we propose choosing PP according to the minimax principle, i.e. to minimize the maximum value of an appropriate function (C)\mathcal{L}\left(C\right) of the covariance matrix of the estimate, as CC varies over 𝒞M\mathcal{C}_{M}.

For brevity we drop the ‘pseudo’ and call PP a precision matrix. Since θ^gls\hat{\theta}_{\text{{gls}}} is invariant under multiplication of PP by a scalar, we assume throughout that

tr(P)=n.tr\left(P\right)=n. (4)

The covariance matrix of θ^gls\hat{\theta}_{\text{{gls}}} is

cov(θ^gls|C,P)=(XPX)1XPCPX(XPX)1.\text{{cov}}\left(\hat{\theta}_{\text{{gls}}}\left|{}\right.C,P\right)=\left(X^{\prime}PX\right)^{-1}X^{\prime}PCPX\left(X^{\prime}PX\right)^{-1}.

Viewed as a function of CC this is non-decreasing in the Loewner ordering, so that if a function Φ\Phi is non-decreasing in this ordering, then

(C|P)=Φ{cov(θ^glsC,P)}\mathcal{L}\left(C\left|{}\right.P\right)=\Phi\{\text{{cov}}(\hat{\theta}_{\text{{gls}}}\mid C,P)\}

is also non-decreasing and the conclusions of the lemma hold:

max𝒞M(C|P)=(η2In|P)=Φ{η2(XPX)1XP2X(XPX)1}.\max_{\mathcal{C}_{M}}\mathcal{L}\left(C\left|{}\right.P\right)=\mathcal{L}\left(\eta^{2}I_{n}\left|{}\right.P\right)=\Phi\left\{\eta^{2}\left(X^{\prime}PX\right)^{-1}XP^{2}X\left(X^{\prime}PX\right)^{-1}\right\}.

But this last expression is minimized by P=InP=I_{n}, i.e. by the ols estimate θ^ols=(XX)1Xy\hat{\theta}_{\text{{ols}}}=\left(X^{\prime}X\right)^{-1}X^{\prime}y, with minimum value

max𝒞M(C|In)=Φ{η2(XX)1}.\max_{\mathcal{C}_{M}}\mathcal{L}\left(C\left|{}\right.I_{n}\right)=\Phi\left\{\eta^{2}\left(X^{\prime}X\right)^{-1}\right\}.

This follows from the monotonicity of Φ\Phi and the inequality

η2(XPX)1XP2X(XPX)1η2(XX)1\displaystyle\eta^{2}\left(X^{\prime}PX\right)^{-1}X^{\prime}P^{2}X\left(X^{\prime}PX\right)^{-1}-\eta^{2}\left(X^{\prime}X\right)^{-1}
=\displaystyle= η2(XPX)1XP{InX(XX)1X}PX(XPX)10,\displaystyle\eta^{2}\left(X^{\prime}PX\right)^{-1}X^{\prime}P\left\{I_{n}-X\left(X^{\prime}X\right)^{-1}X^{\prime}\right\}PX\left(X^{\prime}PX\right)^{-1}\succeq 0,

which uses the fact that InX(XX)1XI_{n}-X\left(X^{\prime}X\right)^{-1}X^{\prime} is idempotent, hence positive semidefinite.

It is well-known that if 0Σ1Σ20\preceq\Sigma_{1}\preceq\Sigma_{2} then the iith largest eigenvalue λi\lambda_{i} of Σ2\Sigma_{2} dominates that of Σ1\Sigma_{1}, for all ii. It follows that Φ\Phi is non-decreasing in the Loewner ordering in the cases:

(i) Φ(Σ)=tr(Σ)=iλi(Σ)\Phi\left(\Sigma\right)=tr\left(\Sigma\right)=\sum_{i}\lambda_{i}\left(\Sigma\right);

(ii) Φ(Σ)=det(Σ)=iλi(Σ)\Phi\left(\Sigma\right)=det\left(\Sigma\right)=\prod_{i}\lambda_{i}\left(\Sigma\right);

(iii) Φ(Σ)=maxiλi(Σ)\Phi\left(\Sigma\right)=\max_{i}\lambda_{i}\left(\Sigma\right);

(iv) Φ(Σ)=tr(KΣ)\Phi\left(\Sigma\right)=tr\left(K\Sigma\right) for K0K\succeq 0.

Thus if loss is measured in any of these ways and C𝒞MC\in\mathcal{C}_{M} then θ^ols\hat{\theta}_{\text{{ols}}} is minimax for 𝒞M\mathcal{C}_{M} in the class of gls estimates.

Minimax procedures are sometimes criticized for dealing optimally with an overly pessimistic least favourable case – see Huber (1972) for a discussion; such criticism does certainly not apply here.

In each of the following examples, we posit a particular covariance structure for CC, a norm CM\left\|C\right\|_{M}, a bound η2\eta^{2} and a class 𝒞\mathcal{C}^{\prime} for which C𝒞𝒞MC\in\mathcal{C}^{\prime}\subseteq\mathcal{C}_{M}. In each case η2In𝒞\eta^{2}I_{n}\in\mathcal{C}^{\prime}, so that statement (ii) of the lemma applies and θ^ols\hat{\theta}_{\text{{ols}}} is minimax for 𝒞\mathcal{C}^{\prime} (and for all of 𝒞M\mathcal{C}_{M} as well) and with respect to any of the criteria (i) – (iv).

Example 1: Independent, heteroscedastic errors. Suppose that C=diag(σ12,,σn2)C=diag(\sigma_{1}^{2},...,\sigma_{n}^{2}). Then the discussion above applies if 𝒞\mathcal{C}^{\prime} is the subclass of diagonal members of 𝒞E\mathcal{C}_{E} for η2=maxiσi2\eta^{2}=\max_{i}\sigma_{i}^{2}.

Example 2: Equicorrelated errors. Suppose that the researcher fears that the observations are possibly weakly correlated, and so considers C=σ2((1ρ)In+ρ1n1n)C=\sigma^{2}\left(\left(1-\rho\right)I_{n}+\rho 1_{n}1_{n}^{\prime}\right), with |ρ|ρmax\left|\rho\right|\leq\rho_{\max}. If ρ\rho 0\geq 0 then C1=C=CE=σ2{1+(n1)|ρ|}\left\|C\right\|_{1}=\left\|C\right\|_{\infty}=\left\|C\right\|_{E}=\leavevmode\nobreak\ \sigma^{2}\left\{1+\left(n-1\right)\left|\rho\right|\right\}, and we take η2=σ2{1+(n1)ρmax}\eta^{2}=\sigma^{2}\left\{1+\left(n-1\right)\rho_{\max}\right\}. If 𝒞\mathcal{C}^{\prime} is the subclass of 𝒞1\mathcal{C}_{1} or 𝒞\mathcal{C}_{\infty} or 𝒞E\mathcal{C}_{E} defined by the equicorrelation structure, then minimaxity of θ^ols\hat{\theta}_{\text{{ols}}} for any of these classes follows. If ρ<0\rho<0 then this continues to hold for 𝒞1=𝒞\mathcal{C}_{1}=\mathcal{C}_{\infty}, and for 𝒞E\mathcal{C}_{E} if η2=σ2(1+ρmax)\eta^{2}=\sigma^{2}\left(1+\rho_{\max}\right).

Example 3: ma(1) errors. Assume first that the random errors are homoscedastic but are possibly serially correlated, following an ma(1) model with corr(εi,εj)=ρI(|ij|=1)(\varepsilon_{i},\varepsilon_{j})=\rho I\left(\left|i-j\right|=1\right) and with |ρ|ρmax\left|\rho\right|\leq\rho_{\max}. Then C1=Cσ2(1+2ρmax)=η2\left\|C\right\|_{1}=\left\|C\right\|_{\infty}\leq\sigma^{2}\left(1+2\rho_{\max}\right)=\eta^{2}, and in the discussion above we may take 𝒞\mathcal{C}^{\prime} to be the subclass – containing η2In\eta^{2}I_{n} – defined by cij=0c_{ij}=0 if |ij|>1\left|i-j\right|>1. If the errors are instead heteroscedastic, then σ2\sigma^{2} is replaced by maxiσi2\max_{i}\sigma_{i}^{2}.

Example 4: ar(1) errors. It is known – see for instance Trench (1999), p. 182 – that the eigenvalues of an ar(1) autocorrelation matrix with autocorrelation parameter ρ\rho are bounded, and that the maximum eigenvalue λ(ρ)\lambda\left(\rho\right) has λ=\lambda^{\ast}= maxρλ(ρ)>λ(0)=1\max_{\rho}\lambda\left(\rho\right)>\lambda\left(0\right)=1. Then, again under homoscedasticity, the covariance matrix CC has CEσ2λ=η2\left\|C\right\|_{E}\leq\sigma^{2}\lambda^{\ast}=\eta^{2}, and the discussion above applies when 𝒞\mathcal{C}^{\prime} is the subclass defined by the autocorrelation structure.

Example 5: All of the above. If 𝒞\mathcal{C} is the union of the classes of covariance structures employed in Examples 1-4 , then the maximum loss over 𝒞\mathcal{C} is attained at η02In\eta_{0}^{2}I_{n}, where η02\eta_{0}^{2} is the maximum of those in these four examples. Then θ^ols\hat{\theta}_{\text{{ols}}} is minimax robust against the union of these classes, since η02In\eta_{0}^{2}I_{n} is in each of them.

3.1 Inference from GLS estimates when C=C= σ2In\sigma^{2}I_{n}

In the next section we consider biased regression models, and investigate the performance of the gls estimate (1) with PInP\neq I_{n} even though CC is a multiple σ2\sigma^{2} of the identity matrix. A caveat to the use of this estimate in correctly specified models (3) is given by the following result. It was established in Wiens (2000) for wls estimates, but holds for gls estimates as well.

Theorem 1.

Suppose that data yn×1y_{n\times 1} obey the linear model (3) with C=C= σ2In\sigma^{2}I_{n} and that a gls estimate (1), with PInP\neq I_{n}, is employed. Let H:n×nH:n\times n be the projector onto the column space of (XPX):n×2p(X\vdots PX):n\times 2p. Then an unbiased estimate of σ2\sigma^{2} is

S2=(InH)y2/(nrk(H)).S^{2}=\left\|\left(I_{n}-H\right)y\right\|^{2}/\left(n-rk\left(H\right)\right).

The vector (InH)y\left(I_{n}-H\right)y is uncorrelated with θ^gls\hat{\theta}_{\text{{gls}}}. If the errors are normally distributed, then S2σ2χnrk(H)2S^{2}\sim\sigma^{2}\chi_{n-rk\left(H\right)}^{2}, independently of θ^gls\hat{\theta}_{\text{{gls}}}.

The projector HH will typically have rank 2p2p when PInP\neq I_{n}, and so pp degrees of freedom are lost in the estimation of σ2\sigma^{2} and subsequent normal-theory inferences.

4 Minimax precision matrices in misspecified response models

Working in finite design spaces χ={xi}i=1Nd\chi=\left\{x_{i}\right\}_{i=1}^{N}\subset\mathbb{R}^{d}, and with pp-dimensional regressors f(x)f\left(x\right), Wiens (2018) studied design problems for possibly misspecified regression models

Y(x)=f(x)θ+ψ(x)+ε,Y\left(x\right)=f^{\prime}\left(x\right)\theta+\psi\left(x\right)+\varepsilon, (5)

with the unknown contaminant ψ\psi ranging over a class Ψ\Psi and satisfying, for identifiability of θ\theta, the orthogonality condition

xχf(x)ψ(x)=0p×1,\sum_{x\in\chi}f\left(x\right)\psi\left(x\right)=0_{p\times 1}, (6)

as well as a bound

xχψ2(x)τ2.\sum_{x\in\chi}\psi^{2}\left(x\right)\leq\tau^{2}. (7)

For designs ξ\xi placing mass ξi\xi_{i} on xiχx_{i}\in\chi, he took θ^=θ^ols\hat{\theta}=\hat{\theta}_{\text{{ols}}}, loss function imspe:

(ψ,ξ)=xχE[f(x)θ^E{Y(x)}]2,\mathcal{I}\left(\psi,\xi\right)=\sum_{x\in\chi}E[f^{\prime}\left(x\right)\hat{\theta}-E\{Y\left(x\right)\}]^{2},

and found designs minimizing the maximum, over ψ\psi, of (ψ,ξ)\mathcal{I}\left(\psi,\xi\right).

In Wiens (2018) the random errors εi\varepsilon_{i} were assumed to be i.i.d.; now suppose that they instead have covariance matrix C𝒞MC\in\mathcal{C}_{M} and take θ^=θ^gls\hat{\theta}=\hat{\theta}_{\text{{gls}}} with precision matrix PP. Using (6), and emphasizing the dependence on CC and PP, (ψ,ξ)\mathcal{I}\left(\psi,\xi\right) decomposes as

(ψ,ξ|C,P)=xχf(x)cov(θ^|C,P))f(x)+xχf(x)bψ,Pbψ,Pf(x)+xχψ2(x).\mathcal{I}\left(\psi,\xi\left|{}\right.C,P\right)=\sum_{x\in\chi}f^{\prime}\left(x\right)\text{{cov}}\left(\hat{\theta}\left|{}\right.C,P\right))f\left(x\right)+\sum_{x\in\chi}f^{\prime}\left(x\right)b_{\psi,P}b_{\psi,P}^{\prime}f\left(x\right)+\sum_{x\in\chi}\psi^{2}\left(x\right). (8)

Here bψ,P=E(θ^)θb_{\psi,P}=E(\hat{\theta})-\theta is the bias. Denote by ψX\psi_{X} the n×1n\times 1 vector consisting of the values of ψ\psi corresponding to the rows of XX, so that bψ,P=(XPX)1XPψXb_{\psi,P}=\left(X^{\prime}PX\right)^{-1}X^{\prime}P\psi_{X}.

To express these quantities in terms of the design, define a set of n×Nn\times N indicator matrices

𝒥={J{0,1}n×N|JJ is diagonal with trace n}.\mathcal{J}=\left\{J\in\left\{0,1\right\}^{n\times N}\left|{}\right.J^{\prime}J\text{ is diagonal with trace }n\right\}.

There is a one-one correspondence between 𝒥\mathcal{J} and the set of nn-point designs on χ\chi. Given JJ, with

JJD=diag(n1,,nN),J^{\prime}J\equiv D=diag\left(n_{1},...,n_{N}\right),

the jjth column of JJ contains njn_{j} ones, specifying the number of occurrences of xjx_{j} in a design, which thus has design vector ξ=n1J1n=(n1/n,,nN/n)\xi=n^{-1}J^{\prime}1_{n}=\left(n_{1}/n,...,n_{N}/n\right)^{\prime}. Conversely, a design determines JJ by Jij=I(the ith row of X is f(xj))J_{ij}=I\left(\text{the }i\text{th row of }X\text{ is }f^{\prime}\left(x_{j}\right)\right). The rank qq of DD is the number of support points of the design, assumed p\geq p.

Define FN×pF_{N\times p} to be the matrix with rows {f(xi)}i=1N\left\{f^{\prime}\left(x_{i}\right)\right\}_{i=1}^{N}. Then X=JFX=JF and, correspondingly, ψX=Jψ¯\psi_{X}=J\bar{\psi} for ψ¯=(ψ(x1),,ψ(xN))\bar{\psi}=\left(\psi\left(x_{1}\right),...,\psi\left(x_{N}\right)\right)^{\prime}.

The proofs for this section are in the appendix.

Theorem 2.

For η\eta as at (2) and τ\tau as at (7), define ν=τ2/(τ2+η2)\nu=\tau^{2}/\left(\tau^{2}+\eta^{2}\right); this is the relative importance to the investigator of errors due to bias rather than to variation. Then for a design ξ\xi and precision matrix PP, the maximum of (ψ,ξ|C,P)\mathcal{I}\left(\psi,\xi\left|{}\right.C,P\right) as CC varies over 𝒞M\mathcal{C}_{M} and ψ\psi varies over Ψ\Psi is (τ2+η2)×\left(\tau^{2}+\eta^{2}\right)\times

ν(ξ,P)=(1ν)0(ξ,P)+ν1(ξ,P),\mathcal{I}_{\nu}\left(\xi,P\right)=\left(1-\nu\right)\mathcal{I}_{0}\left(\xi,P\right)+\nu\mathcal{I}_{1}\left(\xi,P\right), (9)

where

0(ξ,P)\displaystyle\mathcal{I}_{0}\left(\xi,P\right) =\displaystyle= tr{(QUQ)1(QVQ)(QUQ)1},\displaystyle tr\left\{\left(Q^{\prime}UQ\right)^{-1}\left(Q^{\prime}VQ\right)\left(Q^{\prime}UQ\right)^{-1}\right\}, (10a)
1(ξ,P)\displaystyle\mathcal{I}_{1}\left(\xi,P\right) =\displaystyle= chmax{(QUQ)1QU2Q(QUQ)1},\displaystyle ch_{\max}\left\{\left(Q^{\prime}UQ\right)^{-1}Q^{\prime}U^{2}Q\left(Q^{\prime}UQ\right)^{-1}\right\}, (10b)
UN×N\displaystyle U_{N\times N} =\displaystyle= JPJ and VN×N=JP2J.\displaystyle J^{\prime}PJ\text{\ and }V_{N\times N}=J^{\prime}P^{2}J.

Here the columns of QN×pQ_{N\times p} form an orthogonal basis for the column space col(F){col}\left(F\right), JJ is the indicator matrix of the design ξ\xi, and chmaxch_{\max} denotes the maximum eigenvalue of a matrix.

Remark 1.

We assume throughout that the design is such that XPX0X^{\prime}PX\succ 0, implying that QUQ0Q^{\prime}UQ\succ 0.

Remark 2.

An investigator might decide beforehand to use ols, and then design to minimize ν(ξ,P)=ν(ξ,In)\mathcal{I}_{\nu}\left(\xi,P\right)=\mathcal{I}_{\nu}\left(\xi,I_{n}\right). This is a well-studied problem, solved for numerous response models under the assumption of i.i.d. errors – see Wiens (2015) for a review. By virtue of Theorem 2 these designs enjoy the additional property of being minimax against departures C𝒞MC\in\mathcal{C}_{M}.

4.1 Simulations

Table 1.  Minimax precision matrices; multinomial designs:
means of performance measures ±\pm 11 standard error.
Response NN ν\nu %In\%I_{n} ν(ξ,In)\mathcal{I}_{\nu}\left(\xi,I_{n}\right) ν(ξ,Pν)\mathcal{I}_{\nu}\left(\xi,P^{\nu}\right) T1(%)T_{1}\left(\%\right) T2(%)T_{2}\left(\%\right) T3(%)T_{3}\left(\%\right)
1111 .5.5 11 3.34±.113.34\pm.11 3.19±.113.19\pm.11 4.16±.154.16\pm.15 3.23±.123.23\pm.12 12.29±.4612.29\pm.46
linear 1111 11 11 3.72±.163.72\pm.16 3.27±.143.27\pm.14 11.81±.3511.81\pm.35 9.18±.319.18\pm.31 14.40±.5214.40\pm.52
n=10n=10 5151 .5.5 2727 11.19±.2111.19\pm.21 11.05±.2111.05\pm.21 1.24±.071.24\pm.07 .85±.05.85\pm.05 4.10±.224.10\pm.22
5151 11 2727 9.80±.239.80\pm.23 9.35±.229.35\pm.22 4.34±.224.34\pm.22 2.96±.162.96\pm.16 4.82±.264.82\pm.26
1111 .5.5 0 5.99±1.255.99\pm 1.25 5.57±1.065.57\pm 1.06 4.62±.154.62\pm.15 3.15±.113.15\pm.11 14.16±.5014.16\pm.50
quadratic 1111 11 0 7.30±1.827.30\pm 1.82 6.07±1.276.07\pm 1.27 13.04±.3813.04\pm.38 8.57±.388.57\pm.38 16.22±.5716.22\pm.57
n=15n=15 5151 .5.5 44 12.61±.4612.61\pm.46 12.40±.4512.40\pm.45 1.58±.071.58\pm.07 .99±.05.99\pm.05 5.75±.275.75\pm.27
5151 11 44 10.69±.5410.69\pm.54 10.03±.5110.03\pm.51 5.95±.245.95\pm.24 3.54±.173.54\pm.17 6.72±.316.72\pm.31
1111 .5.5 0 9.71±1.639.71\pm 1.63 9.15±1.529.15\pm 1.52 4.87±.154.87\pm.15 3.25±.113.25\pm.11 14.90±.5114.90\pm.51
cubic 1111 11 0 12.98±2.5312.98\pm 2.53 11.41±2.2211.41\pm 2.22 13.54±.3813.54\pm.38 8.86±.298.86\pm.29 16.92±.5816.92\pm.58
n=20n=20 5151 .5.5 0 21.67±2.4321.67\pm 2.43 21.21±2.3821.21\pm 2.38 1.87±.081.87\pm.08 1.14±.051.14\pm.05 7.24±.307.24\pm.30
5151 11 0 20.76±2.920.76\pm 2.9 19.29±2.8119.29\pm 2.81 7.39±.267.39\pm.26 3.75±.163.75\pm.16 8.45±.358.45\pm.35
Table 2.  Minimax precision matrices; symmetrized designs:
means of performance measures ±\pm 11 standard error.
Response NN ν\nu %In\%I_{n} ν(ξ,In)\mathcal{I}_{\nu}\left(\xi,I_{n}\right) ν(ξ,Pν)\mathcal{I}_{\nu}\left(\xi,P^{\nu}\right) T1(%)T_{1}\left(\%\right) T2(%)T_{2}\left(\%\right) T3(%)T_{3}\left(\%\right)
1111 .5.5 2020 2.10±.032.10\pm.03 2.05±.032.05\pm.03 2.45±.112.45\pm.11 1.64±.071.64\pm.07 8.46±.418.46\pm.41
linear 1111 11 2020 1.83±.041.83\pm.04 1.66±.041.66\pm.04 8.51±.318.51\pm.31 4.99±.214.99\pm.21 10.06±.4610.06\pm.46
n=10n=10 5151 .5.5 8080 10.01±.2910.01\pm.29 9.97±.299.97\pm.29 .40±.05.40\pm.05 .23±.03.23\pm.03 1.45±.181.45\pm.18
5151 11 8080 7.77±.297.77\pm.29 7.67±.297.67\pm.29 1.48±.161.48\pm.16 .67±.07.67\pm.07 1.66±.201.66\pm.20
1111 .5.5 0 2.35±.082.35\pm.08 2.26±.072.26\pm.07 3.49±.073.49\pm.07 2.17±.072.17\pm.07 14.10±.5014.10\pm.50
quadratic 1111 11 0 2.01±.092.01\pm.09 1.74±.081.74\pm.08 14.40±.3714.40\pm.37 8.57±.228.57\pm.22 18.05±.5818.05\pm.58
n=15n=15 5151 .5.5 8585 10.58±.7010.58\pm.70 10.53±.6810.53\pm.68 .25±.04.25\pm.04 .15±.02.15\pm.02 1.02±.161.02\pm.16
5151 11 8585 7.54±.807.54\pm.80 7.39±.747.39\pm.74 1.03±.151.03\pm.15 .49±.07.49\pm.07 1.19±.191.19\pm.19
1111 .5.5 33 2.64±.192.64\pm.19 2.55±.192.55\pm.19 3.56±.123.56\pm.12 1.99±.071.99\pm.07 15.18±.5615.18\pm.56
cubic 1111 11 33 2.39±.272.39\pm.27 2.11±.262.11\pm.26 15.19±.4415.19\pm.44 9.09±.289.09\pm.28 19.69±.7019.69\pm.70
n=20n=20 5151 .5.5 5858 11.97±1.9111.97\pm 1.91 11.80±1.8011.80\pm 1.80 .45±.04.45\pm.04 .24±.02.24\pm.02 2.11±.192.11\pm.19
5151 11 5858 8.44±2.148.44\pm 2.14 7.94±1.807.94\pm 1.80 2.23±.182.23\pm.18 .86±.07.86\pm.07 2.47±.212.47\pm.21

In (9), var == 0\mathcal{I}_{0} and bias == 1\mathcal{I}_{1} are the components of the imspe due to variation and to bias, respectively. That 0(ξ,P)\mathcal{I}_{0}\left(\xi,P\right) is minimized by ols for any design was established in §3. If ols is to be a minimax procedure for a particular design and some ν(0,1]\nu\in(0,1], any increase in bias must be outweighed by a proportional decrease in var. We shall present theoretical and numerical evidence that for many designs this is not the case, and for these pairs (ξ,ν)\left(\xi,\nu\right) ols is not minimax.

We first present the results of a simulation study, in which the designs exhibit no particular structure. To find PP minimizing (9) we use the fact that, by virtue of the Choleski decomposition, any positive definite matrix can be represented as P=LLP=LL^{\prime}, for a lower triangular LL. We thus express ν(ξ,LL)\mathcal{I}_{\nu}\left(\xi,LL^{\prime}\right) as a function of the vector ln(n+1)/2×1l_{n\left(n+1\right)/2\times 1} consisting of the elements in the lower triangle of LL, and minimize over ll using a nonlinear constrained minimizer. The constraint – recall (4) – is that ll=nl^{\prime}l=n.

Of course we cannot guarantee that this yields an absolute minimum, but the numerical evidence is compelling. In any event, the numerical results give a negative answer to the question of whether or not ols is necessarily minimax – the minimizing PP is often, but not always, the identity matrix.

In our simulation study we set the design space to be χ={1=x1<<xN=1}\chi=\left\{-1=x_{1}<\cdot\cdot\cdot<x_{N}=1\right\}, with the xix_{i} equally spaced. We chose regressors f(x)=(1,x)f\left(x\right)=\left(1,x\right)^{\prime}, (1,x,x2)\left(1,x,x^{2}\right)^{\prime} or (1,x,x2,x3)\left(1,x,x^{2},x^{3}\right)^{\prime}, corresponding to linear, quadratic, or cubic regression. For various values of nn and NN we first randomly generated probability distributions (p1,p2,..,pN)\left(p_{1},p_{2},..,p_{N}\right) and then generated a multinomial(n;p1,p2,..,pN)\left(n;p_{1},p_{2},..,p_{N}\right) vector; this is nξn\xi. For each such design we computed the minimizing PP, and both components of the minimized value of ν(ξ,P)\mathcal{I}_{\nu}\left(\xi,P\right). This was done for ν=.5,1\nu=.5,1. We took nn equal to five times the number of regression parameters. Denote by PνP^{\nu} the minimizing PP. Of course P0=InP^{0}=I_{n}. In each case we compared three quantities:

T1\displaystyle T_{1} =\displaystyle= 100(ν(ξ,P0)ν(ξ,Pν))ν(ξ,P0), the percent reduction in ν achieved by Pν;\displaystyle 100\frac{\left(\mathcal{I}_{\nu}\left(\xi,P^{0}\right)-\mathcal{I}_{\nu}\left(\xi,P^{\nu}\right)\right)}{\mathcal{I}_{\nu}\left(\xi,P^{0}\right)}\text{, the percent reduction in }\mathcal{I}_{\nu}\text{ achieved by }P^{\nu};
T2\displaystyle T_{2} =\displaystyle= 100(0(ξ,Pν)0(ξ,P0))0(ξ,P0), the percent increase, relative to ols, in var;\displaystyle 100\frac{\left(\mathcal{I}_{0}\left(\xi,P^{\nu}\right)-\mathcal{I}_{0}\left(\xi,P^{0}\right)\right)}{\mathcal{I}_{0}\left(\xi,P^{0}\right)}\text{, the percent increase, relative to {ols}, in {var};}
T3\displaystyle T_{3} =\displaystyle= 100(1(ξ,P0)1(ξ,Pν))1(ξ,P0), the percent decrease, relative to ols, in bias.\displaystyle 100\frac{\left(\mathcal{I}_{1}\left(\xi,P^{0}\right)-\mathcal{I}_{1}\left(\xi,P^{\nu}\right)\right)}{\mathcal{I}_{1}\left(\xi,P^{0}\right)}\text{, the percent decrease, relative to {ols}, in {bias}.}

The means, and standard errors based on 500 runs, of the performance measures using these ‘multinomial’ designs are given in Table 1. The percentages of times that Pν=InP^{\nu}=I_{n} was minimax are also given. When ν=1\nu=1 the percent reduction in the bias (T3T_{3}) can be significant, but is accompanied by an often sizeable increase in the variance (T2T_{2}). When ν=.5\nu=.5 the reduction T1T_{1} is typically quite modest.

These multinomial designs, mimicking those which might arise in observational studies, are not required to be symmetric. We re-ran the simulations after symmetrizing the designs by averaging them with their reflections across x=0x=0 and then applying a rounding mechanism which preserved symmetry. The resulting designs gave substantially reduced losses both for P=InP=I_{n} (ols) and P=PνP=P^{\nu} (gls), and were much more likely to be optimized by Pν=P^{\nu}= InI_{n}. The differences between the means of ν(ξ,In)\mathcal{I}_{\nu}\left(\xi,I_{n}\right) and ν(ξ,Pν)\mathcal{I}_{\nu}\left(\xi,P^{\nu}\right) were generally statistically insignificant, and the values of T1T_{1}, T2T_{2} and T3T_{3} showed only very modest benefits to gls. See Table 2.

A practitioner might understandably conclude that, even though PνP^{\nu} is minimax, its benefits are outweighed by the computational complexity of its implementation. This is bolstered by Theorem 1, which continues to hold with the modification that S2S^{2} now follows a scaled non-central χ2\chi^{2} distribution, with a non-centrality parameter depending on ψX(InH)ψX\psi_{X}^{\prime}\left(I_{n}-H\right)\psi_{X}.

4.2 Theoretical complements

In Theorem 3 below, we show that the experimenter can often design in such a way that P=InP=I_{n} is a minimax precision matrix, so that ols is a minimax procedure. In particular, this holds if the design is uniform on its support, i.e. places an equal number of observations at each of several points of the design space.

Suppose that a design ξ\xi places ni0n_{i}\geq 0 observations at xiχx_{i}\in\chi. Let J+:n×qJ_{+}:n\times q be the result of retaining only the non-zero columns of JJ, so that JJ=J+J+JJ^{\prime}=J_{+}J_{+}^{\prime}, and D+J+J+D_{+}\equiv J_{+}^{\prime}J_{+}\ is the diagonal matrix containing the positive nin_{i}. If the columns removed have labels j1,,jNqj_{1},...,j_{N-q} then let Q+:q×pQ_{+}:q\times p be the result of removing these rows from QQ, so that JQ=J+Q+JQ=J_{+}Q_{+} and QDQ=Q+D+Q+Q^{\prime}DQ=Q_{+}^{\prime}D_{+}Q_{+}. Now define α=n/tr(D+1)\alpha=n/tr\left(D_{+}^{-1}\right) and

P0=αJ+D+2J+,P_{0}=\alpha J_{+}D_{+}^{-2}J_{+}^{\prime}, (11)

with trP0=ntrP_{0}=n. Note that

rk(P0)=rk(J+D+1)=rk(D+1J+J+D+1)=rk(D+1)=q,rk\left(P_{0}\right)=rk\left(J_{+}D_{+}^{-1}\right)=rk\left(D_{+}^{-1}J_{+}^{\prime}J_{+}D_{+}^{-1}\right)=rk\left(D_{+}^{-1}\right)=q,

so that P0P_{0} is positive definite iff q=nq=n. This is relevant in part (ii) of Theorem 3, where we deal with the possible rank deficiency of P0P_{0} by introducing

Pε(P0+εIn)/(1+ε);P_{\varepsilon}\equiv\left(P_{0}+\varepsilon I_{n}\right)/\left(1+\varepsilon\right); (12)

for ε>0\varepsilon>0, PεP_{\varepsilon} is positive definite with tr(Pε)=tr\left(P_{\varepsilon}\right)= nn.

Theorem 3.

(i) Suppose that qNq\leq N and the design is uniform on qq points of χ\chi, with k1k\geq 1 observations at each xix_{i}. Then n=kqn=kq, D+=kIqD_{+}=kI_{q}, and P=InP=I_{n} is a minimax precision matrix:

ν(ξ,In)=minP0ν(ξ,P);\mathcal{I}_{\nu}\left(\xi,I_{n}\right)=\min_{P\succ 0}\mathcal{I}_{\nu}\left(\xi,P\right); (13)

thus ols is minimax within the class of gls methods. In particular this holds if P0=P_{0}= InI_{n}, where P0P_{0} is defined at (11).
(ii)  Suppose that a design ξ\xi places mass on qNq\leq N points of χ\chi, that P0P_{0}\neq InI_{n}, and that neither of the following holds:

(Q+Q+)1(Q+D+1Q+)(Q+Q+)1\displaystyle\left(Q_{+}^{\prime}Q_{+}\right)^{-1}\left(Q_{+}^{\prime}D_{+}^{-1}Q_{+}\right)\left(Q_{+}^{\prime}Q_{+}\right)^{-1} =\displaystyle= (Q+D+Q+)1,\displaystyle\left(Q_{+}^{\prime}D_{+}Q_{+}\right)^{-1}, (14a)
chmax{(Q+D+Q+)1(Q+D+2Q+)(Q+D+Q+)1}\displaystyle ch_{\max}\left\{\left(Q_{+}^{\prime}D_{+}Q_{+}\right)^{-1}\left(Q_{+}^{\prime}D_{+}^{2}Q_{+}\right)\left(Q_{+}^{\prime}D_{+}Q_{+}\right)^{-1}\right\} =\displaystyle= chmax{(Q+Q+)1}.\displaystyle ch_{\max}\left\{\left(Q_{+}^{\prime}Q_{+}\right)^{-1}\right\}. (14b)

Then in particular D+D_{+} is not a multiple of IqI_{q} and so the design is non-uniform. With PεP_{\varepsilon} as defined at (12), there is ν0(0,1)\nu_{0}\in\left(0,1\right) for which, for each ν(ν0,1]\nu\in(\nu_{0},1], ν(ξ,Pε)<ν(ξ,In)\mathcal{I}_{\nu}\left(\xi,P_{\varepsilon}\right)<\mathcal{I}_{\nu}\left(\xi,I_{n}\right). Thus ols is not minimax for such (ξ,ν)\left(\xi,\nu\right).

Remark 3.

The requirement of Theorem 3(ii) that (14a) and (14b) fail excludes more designs than those which are uniform on their supports, and is a condition on QQ as well as on the design. For instance if Q+(Q+Q+)1/2Aq×pQ_{+}\left(Q_{+}^{\prime}Q_{+}\right)^{-1/2}\equiv A_{q\times p} is block-diagonal: A=i=1mAiA=\oplus_{i=1}^{m}A_{i}, where Ai:qi×piA_{i}:q_{i}\times p_{i} (qi=q,pi=p\sum q_{i}=q,\sum p_{i}=p) satisfies AiAi=IpiA_{i}^{\prime}A_{i}=I_{p_{i}}, and if D+=D_{+}= i=1mkiIqi\oplus_{i=1}^{m}k_{i}I_{q_{i}}, then

AD+1A\displaystyle A^{\prime}D_{+}^{-1}A =\displaystyle= (AD+A)1,\displaystyle\left(A^{\prime}D_{+}A\right)^{-1}, (15)
(AD+A)1AD+2A(AD+A)1\displaystyle\left(A^{\prime}D_{+}A\right)^{-1}A^{\prime}D_{+}^{2}A\left(A^{\prime}D_{+}A\right)^{-1} =\displaystyle= Ip.\displaystyle I_{p}. (16)

Equation (15) gives (14a), and (16) asserts the equality of the two matrices in (14b), hence of their maximum eigenvalues. These equations are satisfied even though the design is non-uniform if the kik_{i} are not all equal.

In Tables 1 and 2, uniform designs account for 100%100\% and 95%95\%, respectively, of the cases in which Pν=InP^{\nu}=I_{n} is optimal. Common exceptions in Table 2 are designs which are uniform apart from having points added or removed at x=0x=0 to maintain symmetry. Those designs for which InI_{n} is not optimal all meet the conditions of Theorem 3(ii). This was checked numerically: since (14a) implies that 0(ξ,P0)0(ξ,In)=0\mathcal{I}_{0}\left(\xi,P_{0}\right)-\mathcal{I}_{0}\left(\xi,I_{n}\right)=0, and (14b)  implies that 1(ξ,In)1(ξ,P0)=0\mathcal{I}_{1}\left(\xi,I_{n}\right)-\mathcal{I}_{1}\left(\xi,P_{0}\right)=0, their failure is verified by checking that each of these differences is positive.

5 Minimax precision matrices and minimax designs

Table 3.  Minimax designs and precision matrices:
performance measures (T1=T2=T3=0T_{1}=T_{2}=T_{3}=0 if Pν=InP^{\nu}=I_{n}).
Response NN ν\nu ν(ξ,In)\mathcal{I}_{\nu}\left(\xi,I_{n}\right) ν(ξ,Pν)\mathcal{I}_{\nu}\left(\xi,P^{\nu}\right) T1(%)T_{1}(\%) T2(%)T_{2}(\%) T3(%)T_{3}(\%)
1111 .5.5 1.601.60 1.601.60 0 0 0
linear 1111 11 1.101.10 1.101.10 0 0 0
n=10n=10 5151 .5.5 6.146.14 6.146.14 0 0 0
5151 11 5.105.10 5.105.10 0 0 0
1111 .5.5 1.611.61 1.531.53 4.674.67 2.812.81 16.0616.06
quadratic 1111 11 1.121.12 1.001.00 10.8410.84 10.7810.78 10.8410.84
n=15n=15 5151 .5.5 5.805.80 5.805.80 0 0 0
5151 11 3.403.40 3.403.40 0 0 0
1111 .5.5 1.551.55 1.531.53 1.461.46 1.131.13 6.046.04
cubic 1111 11 1.121.12 1.001.00 11.0311.03 4.844.84 11.0311.03
n=20n=20 5151 .5.5 5.565.56 5.565.56 0 0 0
5151 11 2.552.55 2.552.55 0 0 0
Refer to caption
Figure 1: Minimax design frequencies for a cubic model; n=20n=20.

We investigated the interplay between minimax precision matrices and minimax designs. To this end (9) was minimized over both ξ\xi and PP. To minimize over ξ\xi we employed particle swarm optimization (Kennedy and Eberhart (1995)). The algorithm searches over continuous designs ξ\xi, and so each such design to be evaluated was first rounded so that nξn\xi had integer values. Then JJ, and the corresponding minimax precision matrix Pν=Pν(J)P^{\nu}=P^{\nu}\left(J\right) were computed and the loss returned. The final output is an optimal pair {Jν,Pν}\left\{J^{\nu},P^{\nu}\right\}. Using a genetic algorithm yielded the same results but was many times slower.

The results, using the same parameters as in Tables 1 and 2, are shown in Table 3. We note that in all cases the use of the minimax design gives significantly smaller losses, both using ols and gls. In eight of the twelve cases studied it turns out that the minimax design is uniform on its support and so the choice Pν=InP^{\nu}=I_{n} is minimax. In the remaining cases – all in line with (iii) of Theorem 3 – minimax precision results in only a marginal improvement. Of the two factors – ξ\xi and PP – explaining the decrease in ν\mathcal{I}_{\nu}, the design is by far the greater contributor.

See Figure 1 for some representative plots of the minimax designs for a cubic response. These reflect several common features of robust designs. One is that the designs using ν=1\nu=1, i.e. aimed at minimization of the bias alone, tend to be more uniform than those using ν=.5\nu=.5. This reflects the fact – following from (6) and exploited in (i) of Theorem 3 – that when a uniform design on all of χ\chi is implemented, then the bias using ols vanishes. As well, when the design space is sufficiently rich as to allow for clusters of nearby design points to replace replicates, then this invariably occurs. See Wiens (2024a), Fang and Wiens (2000) and Heo et al. (2001) for examples and discussions. Such clusters form near the support points of the classically I-optimal designs, minimizing variance alone. See for instance Studden (1977) who showed that the I-optimal design for cubic regression places masses .1545.1545, .3455.3455 at each of ±1\pm 1, ±.477\pm.477 - a situation well approximated by the design in (c) of Figure 1, whose clusters around these points account for masses of .15.15 and .35.35 each. As ν\nu increases the clusters become more diffuse, as in d) of Figure 1. A result of our findings in this article is that an additional benefit to such clustering is that it allows ols to be a minimax gls procedure.

6 Continuous design spaces

If the design space is continuous – an interval, or hypercube, for instance – then the problem of finding a minimax design is somewhat more complicated. We continue to work with the alternate models (5), but the constraints (6) and (7) now have their finite sums replaced by Lebesgue integrals over χ\chi. As shown in Wiens (1992), in order that a design ξ\xi have finite maximum loss it is necessary that it be absolutely continuous, i.e. possess a density. Otherwise it places positive mass on some set of Lebesgue measure zero, on which any ψ(x)\psi\left(x\right) can be modified without affecting these integrals. Such modifications can be done in such a way as to make the integrated squared bias unbounded. Thus static, discrete designs are not admissible.

A remedy, detailed by Waite and Woods (2022), is to choose design points randomly, from an appropriate density. In the parlance of game theory, this precludes Nature, assumed malevolent, from knowing the support points of ξ\xi and replying with a ψ()\psi\left(\cdot\right) modified as above. They also recommend designs concentrated near the I-optimal design points, leading to (c) and (d) of Figure 1, but with randomly chosen clusters. The resulting designs are always uniform on their supports, and so ols is minimax in all cases, in contrast to the situation illustrated in (a) and (b) of Figure 1. See Wiens (2024a) for guidelines and further examples, and Waite (2024) for extensions allowing for randomized replication.

Appendix: Derivations

Proof of Theorem 2.

In the notation of the theorem, (8) becomes

(ψ,ξ|C,P)\displaystyle\mathcal{I}\left(\psi,\xi\left|{}\right.C,P\right) =\displaystyle= tr{Fcov(θ^|C,P)F}\displaystyle tr\left\{F\text{{cov}}\left(\hat{\theta}\left|{}\right.C,P\right)F^{\prime}\right\} (A.1)
+\displaystyle+ ψ¯JPJF(FJPJF)1FF(FJPJF)1FJPJψ¯+ψ¯ψ¯.\displaystyle\bar{\psi}^{\prime}J^{\prime}PJF\left(F^{\prime}J^{\prime}PJF\right)^{-1}F^{\prime}F\left(F^{\prime}J^{\prime}PJF\right)^{-1}F^{\prime}J^{\prime}PJ\bar{\psi}+\bar{\psi}^{\prime}\bar{\psi}.

As in §3, and taking K=FFK=F^{\prime}F in (iv) of that section, for C𝒞MC\in\mathcal{C}_{M} the trace in (A.1) is maximized by C=η2InC=\eta^{2}I_{n}, with

trFcov(θ^|η2In,P)F=η2tr{F(FJPJF)1(FJP2JF)(FJPJF)1F}.trF\text{{cov}}\left(\hat{\theta}\left|{}\right.\eta^{2}I_{n},P\right)F^{\prime}=\eta^{2}tr\left\{F\left(F^{\prime}J^{\prime}PJF\right)^{-1}\left(F^{\prime}J^{\prime}P^{2}JF\right)\left(F^{\prime}J^{\prime}PJF\right)^{-1}F^{\prime}\right\}. (A.2)

Extend the orthogonal basis for col(F){col}\left(F\right) – formed by the columns of QQ  – by appending to QQ the matrix Q:N×(Np)Q_{\ast}:N\times\left(N-p\right), whose columns form an orthogonal basis for the orthogonal complement col(F){col}\left(F\right)^{\perp}. Then (QQ):N×N(Q\vdots Q_{\ast}):N\times N is an orthogonal matrix and we have thatF=\ F= QRQR for a non-singular RR. If the construction is carried out by the Gram-Schmidt method, then RR is upper triangular.
Constraint (6) dictates that ψ\psi lie in col(Q){col}\left(Q_{\ast}\right). A maximizing ψ\psi will satisfy (7) with equality, hence ψ¯=τQβ\bar{\psi}=\tau Q_{\ast}\beta for some β(Np)×1\beta_{\left(N-p\right)\times 1} with unit norm. Combining these observations along with (A.1) and (A.2) yields that maxψ,C(ψ,ξ|C,P)\max_{\psi,C}\mathcal{I}\left(\psi,\xi\left|{}\right.C,P\right) is given by

η2tr{Q(QUQ)1(QVQ)(QUQ)1Q}\displaystyle\eta^{2}tr\left\{Q\left(Q^{\prime}UQ\right)^{-1}\left(Q^{\prime}VQ\right)\left(Q^{\prime}UQ\right)^{-1}Q^{\prime}\right\}
+τ2maxβ=1{βQUQ(QUQ)1QQ(QUQ)1QUQβ+1}.\displaystyle+\tau^{2}\max_{\left\|\beta\right\|=1}\left\{\beta^{\prime}Q_{\ast}^{\prime}UQ\left(Q^{\prime}UQ\right)^{-1}Q^{\prime}Q\left(Q^{\prime}UQ\right)^{-1}Q^{\prime}UQ_{\ast}\beta+1\right\}. (A.3)

Here and elsewhere we use that trAB=trBAtrAB=trBA, and that such products have the same non-zero eigenvalues. Then (A.3) becomes (τ2+η2)\left(\tau^{2}+\eta^{2}\right) times ν(ξ,P)\mathcal{I}_{\nu}\left(\xi,P\right), given by

ν(ξ,P)=(1ν)tr{(QUQ)1(QVQ)(QUQ)1}\displaystyle\mathcal{I}_{\nu}\left(\xi,P\right)=\left(1-\nu\right)tr\left\{\left(Q^{\prime}UQ\right)^{-1}\left(Q^{\prime}VQ\right)\left(Q^{\prime}UQ\right)^{-1}\right\}
+ν{chmaxQUQ(QUQ)1(QUQ)1QUQ+1}.\displaystyle+\nu\left\{ch_{\max}Q_{\ast}^{\prime}UQ\left(Q^{\prime}UQ\right)^{-1}\cdot\left(Q^{\prime}UQ\right)^{-1}Q^{\prime}UQ_{\ast}+1\right\}. (A.4)

The maximum eigenvalue is also that of

(QUQ)1QUQQUQ(QUQ)1\displaystyle\left(Q^{\prime}UQ\right)^{-1}Q^{\prime}UQ_{\ast}\cdot Q_{\ast}^{\prime}UQ\left(Q^{\prime}UQ\right)^{-1} =\displaystyle= (QUQ)1QU(INQQ)UQ(QUQ)1\displaystyle\left(Q^{\prime}UQ\right)^{-1}Q^{\prime}U\left(I_{N}-QQ^{\prime}\right)UQ\left(Q^{\prime}UQ\right)^{-1}
=\displaystyle= (QUQ)1QU2Q(QUQ)1Ip;\displaystyle\left(Q^{\prime}UQ\right)^{-1}Q^{\prime}U^{2}Q\left(Q^{\prime}UQ\right)^{-1}-I_{p};

this in (A.4) gives (9). ∎

The proof of Theorem 3 requires a preliminary result.

Lemma 2.

(i) For a fixed design ξ\xi and any P0P\succ 0, 0(ξ,P)0(ξ,In)\mathcal{I}_{0}\left(\xi,P\right)\geq\mathcal{I}_{0}\left(\xi,I_{n}\right) and 1(ξ,P)chmax{(Q+Q+)1}\mathcal{I}_{1}\left(\xi,P\right)\geq ch_{\max}\left\{\left(Q_{+}^{\prime}Q_{+}\right)^{-1}\right\}.
If neither of the equations (14a), (14b) holds, then:
(ii) 0(ξ,P0)>0(ξ,In)\mathcal{I}_{0}\left(\xi,P_{0}\right)>\mathcal{I}_{0}\left(\xi,I_{n}\right)  and 1(ξ,In)>1(ξ,P0)\mathcal{I}_{1}\left(\xi,I_{n}\right)>\mathcal{I}_{1}\left(\xi,P_{0}\right);
(iii) With PεP_{\varepsilon} as defined in Theorem 3(ii), and for sufficiently small ε>\varepsilon> 0, Δ0(ε)=0(ξ,Pε)0(ξ,In)>0\Delta_{0}\left(\varepsilon\right)=\mathcal{I}_{0}\left(\xi,P_{\varepsilon}\right)-\mathcal{I}_{0}\left(\xi,I_{n}\right)>0 and Δ1(ε)=1(ξ,In)1(ξ,Pε)>\Delta_{1}\left(\varepsilon\right)=\mathcal{I}_{1}\left(\xi,I_{n}\right)-\mathcal{I}_{1}\left(\xi,P_{\varepsilon}\right)> 0.

Proof of Lemma 2.

(i) From (10a), 0(ξ,P)0(ξ,In)\mathcal{I}_{0}\left(\xi,P\right)-\mathcal{I}_{0}\left(\xi,I_{n}\right) is the trace of

(QJPJQ)1(QJP2JQ)(QJPJQ)1(QJJQ)1\displaystyle\left(Q^{\prime}J^{\prime}PJQ\right)^{-1}\left(Q^{\prime}J^{\prime}P^{2}JQ\right)\left(Q^{\prime}J^{\prime}PJQ\right)^{-1}-\left(Q^{\prime}J^{\prime}JQ\right)^{-1}
=\displaystyle= (QJPJQ)1QJP{InJQ(QJJQ)1QJ}PJQ(QJPJQ)1,\displaystyle\left(Q^{\prime}J^{\prime}PJQ\right)^{-1}Q^{\prime}J^{\prime}P\left\{I_{n}-JQ\left(Q^{\prime}J^{\prime}JQ\right)^{-1}Q^{\prime}J^{\prime}\right\}PJQ\left(Q^{\prime}J^{\prime}PJQ\right)^{-1},

which is 0\succeq 0 (since the matrix in braces is idempotent, hence p.s.d.) with non-negative trace. For the second inequality first note that

(QUQ)1QU2Q(QUQ)1(Q+Q+)1\displaystyle\left(Q^{\prime}UQ\right)^{-1}Q^{\prime}U^{2}Q\left(Q^{\prime}UQ\right)^{-1}-\left(Q_{+}^{\prime}Q_{+}\right)^{-1}
=\displaystyle= (Q+J+PJ+Q+)1Q+J+PJ+J+PJ+Q+(Q+J+PJ+Q+)1(Q+Q+)1\displaystyle\left(Q_{+}^{\prime}J_{+}^{\prime}PJ_{+}Q_{+}\right)^{-1}Q_{+}^{\prime}J_{+}^{\prime}PJ_{+}J_{+}^{\prime}PJ_{+}Q_{+}\left(Q_{+}^{\prime}J_{+}^{\prime}PJ_{+}Q_{+}\right)^{-1}-\left(Q_{+}^{\prime}Q_{+}\right)^{-1}
=\displaystyle= (Q+J+PJ+Q+)1Q+J+PJ+{InQ+(Q+Q+)1Q+}J+PJ+Q+(Q+J+PJ+Q+)1\displaystyle\left(Q_{+}^{\prime}J_{+}^{\prime}PJ_{+}Q_{+}\right)^{-1}Q_{+}^{\prime}J_{+}^{\prime}PJ_{+}\left\{I_{n}-Q_{+}\left(Q_{+}^{\prime}Q_{+}\right)^{-1}Q_{+}^{\prime}\right\}J_{+}^{\prime}PJ_{+}Q_{+}\left(Q_{+}^{\prime}J_{+}^{\prime}PJ_{+}Q_{+}\right)^{-1}

is p.s.d., so that by Weyl’s Monotonicity Theorem (Bhatia (1997)),

1(ξ,P)=chmax{(QUQ)1QU2Q(QUQ)1}chmax{(Q+Q+)1}.\mathcal{I}_{1}\left(\xi,P\right)=ch_{\max}\left\{\left(Q^{\prime}UQ\right)^{-1}Q^{\prime}U^{2}Q\left(Q^{\prime}UQ\right)^{-1}\right\}\geq ch_{\max}\left\{\left(Q_{+}^{\prime}Q_{+}\right)^{-1}\right\}.\newline

(ii) We use the following identities, which follow from (10a) and (10b), expressed in the notation preceding the statement of Theorem 3:

0(ξ,In)\displaystyle\mathcal{I}_{0}\left(\xi,I_{n}\right) =\displaystyle= tr{(Q+D+Q+)1}\displaystyle tr\left\{\left(Q_{+}^{\prime}D_{+}Q_{+}\right)^{-1}\right\} (A.5a)
1(ξ,In)\displaystyle\mathcal{I}_{1}\left(\xi,I_{n}\right) =\displaystyle= chmax{(Q+D+Q+)1(Q+D+2Q+)(Q+D+Q+)1}\displaystyle ch_{\max}\left\{\left(Q_{+}^{\prime}D_{+}Q_{+}\right)^{-1}\left(Q_{+}^{\prime}D_{+}^{2}Q_{+}\right)\left(Q_{+}^{\prime}D_{+}Q_{+}\right)^{-1}\right\} (A.5b)
0(ξ,P0)\displaystyle\mathcal{I}_{0}\left(\xi,P_{0}\right) =\displaystyle= tr{(Q+Q+)1(Q+D+1Q+)(Q+Q+)1},\displaystyle tr\left\{\left(Q_{+}^{\prime}Q_{+}\right)^{-1}\left(Q_{+}^{\prime}D_{+}^{-1}Q_{+}\right)\left(Q_{+}^{\prime}Q_{+}\right)^{-1}\right\}, (A.5c)
1(ξ,P0)\displaystyle\mathcal{I}_{1}\left(\xi,P_{0}\right) =\displaystyle= chmax{(Q+Q+)1}.\displaystyle ch_{\max}\left\{\left(Q_{+}^{\prime}Q_{+}\right)^{-1}\right\}.\newline (A.5d)

To prove (ii) we show that if either inequality fails then one of (14a), (14b) holds – a contradiction. First note that

0(ξ,P0)0(ξ,In)\displaystyle\mathcal{I}_{0}\left(\xi,P_{0}\right)-\mathcal{I}_{0}\left(\xi,I_{n}\right)
=\displaystyle= tr{(Q+Q+)1(Q+D+1Q+)(Q+Q+)1(Q+D+Q+)1}\displaystyle tr\left\{\left(Q_{+}^{\prime}Q_{+}\right)^{-1}\left(Q_{+}^{\prime}D_{+}^{-1}Q_{+}\right)\left(Q_{+}^{\prime}Q_{+}\right)^{-1}-\left(Q_{+}^{\prime}D_{+}Q_{+}\right)^{-1}\right\}
=\displaystyle= tr{(Q+Q+)1Q+D+1/2[IqD+1/2Q+(Q+D+Q+)1Q+D+1/2]D+1/2Q+(Q+Q+)1},\displaystyle tr\left\{\left(Q_{+}^{\prime}Q_{+}\right)^{-1}Q_{+}^{\prime}D_{+}^{-1/2}\left[I_{q}-D_{+}^{1/2}Q_{+}\left(Q_{+}^{\prime}D_{+}Q_{+}\right)^{-1}Q_{+}^{\prime}D_{+}^{1/2}\right]D_{+}^{-1/2}Q_{+}\left(Q_{+}^{\prime}Q_{+}\right)^{-1}\right\},

which is non-negative. If the first inequality fails, so that 0(ξ,P0)=0(ξ,In)\mathcal{I}_{0}\left(\xi,P_{0}\right)=\mathcal{I}_{0}\left(\xi,I_{n}\right), then the trace of the p.s.d. matrix at (Proof of Lemma 2.) is zero, hence all eigenvalues are zero and the matrix is the zero matrix. This is (14a).
That 1(ξ,In)1(ξ,P0)0\mathcal{I}_{1}\left(\xi,I_{n}\right)-\mathcal{I}_{1}\left(\xi,P_{0}\right)\geq 0 is the first inequality in (i). If the second inequality of (ii) fails, then 1(ξ,In)=1(ξ,P0)\mathcal{I}_{1}\left(\xi,I_{n}\right)=\mathcal{I}_{1}\left(\xi,P_{0}\right) and their evaluations at (A.5b) and (A.5d) give (14b).
For (iii), that Δ0(ε)>0\Delta_{0}\left(\varepsilon\right)>0 and Δ1(ε)>0\Delta_{1}\left(\varepsilon\right)>0 for sufficiently small ε\varepsilon follow from the continuity of 0(ξ,Pε)\mathcal{I}_{0}\left(\xi,P_{\varepsilon}\right) and 1(ξ,Pε)\mathcal{I}_{1}\left(\xi,P_{\varepsilon}\right) as functions of ε\varepsilon: Δ0(ε)0(ξ,P0)0(ξ,In)>0\Delta_{0}\left(\varepsilon\right)\rightarrow\mathcal{I}_{0}\left(\xi,P_{0}\right)-\mathcal{I}_{0}\left(\xi,I_{n}\right)>0 and Δ1(ε)=1(ξ,In)1(ξ,P0)>0\Delta_{1}\left(\varepsilon\right)=\mathcal{I}_{1}\left(\xi,I_{n}\right)-\mathcal{I}_{1}\left(\xi,P_{0}\right)>0 as ε0\varepsilon\rightarrow 0. ∎

Proof of Theorem 3.

(i) From the first inequality in Lemma 2 (i),

0(ξ,In)=minP00(ξ,P).\mathcal{I}_{0}\left(\xi,I_{n}\right)=\min_{P\succ 0}\mathcal{I}_{0}\left(\xi,P\right).

If P=InP=I_{n} then U=JPJ=JJ=D+=kIqU=J^{\prime}PJ=J^{\prime}J=D_{+}=kI_{q}, so that from (A.5b), and the second inequality in Lemma 2(i),

1(ξ,In)=chmax{(Q+Q+)1}=minP01(ξ,P).\mathcal{I}_{1}\left(\xi,I_{n}\right)=ch_{\max}\left\{\left(Q_{+}^{\prime}Q_{+}\right)^{-1}\right\}=\min_{P\succ 0}\mathcal{I}_{1}\left(\xi,P\right).

Now (13) is immediate. If P0=InP_{0}=I_{n} then q=rk(P0)=nq=rk\left(P_{0}\right)=n, so that all nn observations are made at distinct points, hence D+=InD_{+}=I_{n} and the design is uniform on its support. (ii)  By Lemma 2(iii) there is ε0>0\varepsilon_{0}>0 for which Δ0(ε)>0\Delta_{0}\left(\varepsilon\right)>0 and Δ1(ε)>\Delta_{1}\left(\varepsilon\right)> 0 when 0<εε00<\varepsilon\leq\varepsilon_{0}. For ε\varepsilon in this range,

ν(ξ,In)ν(ξ,Pε)=ν(Δ0(ε)+Δ1(ε))Δ0(ε)>0,\mathcal{I}_{\nu}\left(\xi,I_{n}\right)-\mathcal{I}_{\nu}\left(\xi,P_{\varepsilon}\right)=\nu\left(\Delta_{0}\left(\varepsilon\right)+\Delta_{1}\left(\varepsilon\right)\right)-\Delta_{0}\left(\varepsilon\right)>0,

for ν(ν0,1]\nu\in(\nu_{0},1] and ν0Δ0(ε)/(Δ0(ε)+Δ1(ε))\nu_{0}\equiv\Delta_{0}\left(\varepsilon\right)/\left(\Delta_{0}\left(\varepsilon\right)+\Delta_{1}\left(\varepsilon\right)\right). ∎

Acknowledgements

This work was carried out with the support of the Natural Sciences and Engineering Research Council of Canada.

References

  • Aitken (1935) Aitken, A. C. (1935), “On Least Squares and Linear Combinations of Observations,” Proceedings of the Royal Society of Edinburgh, 55, 42–48.
  • Bhatia (1997) Bhatia, R. (1997), Matrix Analysis, Berlin: Springer.
  • Cochrane and Orcutt (1949) Cochrane, D. and Orcutt, G. H. (1949), “Application of Least Squares Regression to Relationships Containing Autocorrelated Error Terms,” Journal of the American Statistical Association, 44, 32–61.
  • Fang and Wiens (2000) Fang, Z. and Wiens, D. P. (2000), “Integer-Valued, Minimax Robust Designs for Estimation and Extrapolation in Heteroscedastic, Approximately Linear Models,” Journal of the American Statistical Association, 95, 807–818.
  • Fomby et al. (1984) Fomby, T. B., Johnson, S. R., and Hill, R. C. (1984), “Feasible Generalized Least Squares Estimation,” in Advanced Econometric Methods, New York, NY: Springer.
  • Heo et al. (2001) Heo, G., Schmuland, B., and Wiens, D. P. (2001), “Restricted Minimax Robust Designs for Misspecified Regression Models,” The Canadian Journal of Statistics, 29, 117–128.
  • Huber (1972) Huber, P. J. (1972), “The 1972 Wald Lecture Robust Statistics: A Review,” The Annals of Mathematical Statistics, 43, 1041–1067.
  • Kennedy and Eberhart (1995) Kennedy, J. and Eberhart, R. (1995), “Particle Swarm Optimization,” in Proceedings of ICCN’95 - International Conference on Neural Networks, WA, Australia, vol. 4: Perth, pp. 1942–1948.
  • Studden (1977) Studden, W. J. (1977), “Optimal Designs for Integrated Variance in Polynomial Regression,” in Statistical Decision Theory and Related Topics II, eds. Gupta, S. S. and Moore, D., York: Academic Press, pp. 411–420.
  • Trench (1999) Trench, W. F. (1999), “Asymptotic Distribution of the Spectra of a Class of Generalized Kac-Murdock-Szegö Matrices,” Linear Algebra and Its Applications, 294, 181–192.
  • Waite (2024) Waite, T. W. (2024), “Replication in Random Translation Designs,” Statistics and Probability Letters, in press, https://doi.org/10.1016/j.spl.2024.110229.
  • Waite and Woods (2022) Waite, T. W. and Woods, D. C. (2022), “Minimax Efficient Random Experimental Design Strategies With Application to Model-Robust Design for Prediction,” Journal of the American Statistical Association, 117, 1452–1465.
  • Wiens (1992) Wiens, D. P. (1992), “Minimax Designs for Approximately Linear Regression,” Journal of Statistical Planning and Inference, 31, 353–371.
  • Wiens (2000) — (2000), “Robust Weights and Designs for Biased Regression Models: Least Squares and Generalized M-Estimation,” Journal of Statistical Planning and Inference, 83, 395–412.
  • Wiens (2015) — (2015), “Robustness of Design,” Handbook of Design and Analysis of Experiments.
  • Wiens (2018) — (2018), “I-Robust and D-Robust Designs on a Finite Design Space,” Statistics and Computing, 28, 241–258.
  • Wiens (2024a) — (2024a), “Jittering and Clustering: Strategies for the Construction of Robust Designs,” Statistics and Computing, in press, https://doi.org/10.1007/s11222-024-10436-2.
  • Wiens (2024b) — (2024b), “A Note on Minimax Robustness of Designs Against Correlated or Heteroscedastic Responses,” Biometrika, in press, https://doi.org/10.1093/biomet/asae001.