This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sequential online subsampling for thinning experimental designs

Luc Pronzato111CNRS, Université Côte d’Azur, I3S, France, [email protected] (corresponding author)  and HaiYing Wang222Department of Statistics, University of Connecticut, USA, [email protected]
Abstract

We consider a design problem where experimental conditions (design points XiX_{i}) are presented in the form of a sequence of i.i.d. random variables, generated with an unknown probability measure μ\mu, and only a given proportion α(0,1)\alpha\in(0,1) can be selected. The objective is to select good candidates XiX_{i} on the fly and maximize a concave function Φ\Phi of the corresponding information matrix. The optimal solution corresponds to the construction of an optimal bounded design measure ξαμ/α\xi_{\alpha}^{*}\leq\mu/\alpha, with the difficulty that μ\mu is unknown and ξα\xi_{\alpha}^{*} must be constructed online. The construction proposed relies on the definition of a threshold τ\tau on the directional derivative of Φ\Phi at the current information matrix, the value of τ\tau being fixed by a certain quantile of the distribution of this directional derivative. Combination with recursive quantile estimation yields a nonlinear two-time-scale stochastic approximation method. It can be applied to very long design sequences since only the current information matrix and estimated quantile need to be stored. Convergence to an optimum design is proved. Various illustrative examples are presented.

Keywords: Active learning, data thinning, design of experiments, sequential design, subsampling

AMS subject classifications: 62K05, 62L05, 62L20, 68Q32

1 Introduction

Consider a rather general parameter estimation problem in a model with independent observations Yi=Yi(xi)Y_{i}=Y_{i}(x_{i}) conditionally on the experimental variables xix_{i}, with xix_{i} in some set 𝒳{\mathscr{X}}. Suppose that for any x𝒳x\in{\mathscr{X}} there exists a measurable set 𝒴x{\mathscr{Y}}_{x}\in\mathds{R} and a σ\sigma-finite measure μx\mu_{x} on 𝒴x{\mathscr{Y}}_{x} such that Y(x)Y(x) has the density φx,𝜽¯\varphi_{x,\overline{\boldsymbol{\theta}}} with respect to μx\mu_{x}, with 𝜽¯\overline{\boldsymbol{\theta}} the true value of the model parameters 𝜽\boldsymbol{\theta} to be estimated, 𝜽p\boldsymbol{\theta}\in\mathds{R}^{p}. In particular, this covers the case of regression models, with μx\mu_{x} the Lebesgue measure on 𝒴x={\mathscr{Y}}_{x}=\mathds{R} and Y(x)=η(𝜽¯,x)+ε(x)Y(x)=\eta(\overline{\boldsymbol{\theta}},x)+\varepsilon(x), where the ε(xi)\varepsilon(x_{i}) are independently distributed with zero mean and known variance σi2\sigma_{i}^{2} (or unknown but constant variance σ2\sigma^{2}), and the case of generalized linear models, with φx,𝜽¯\varphi_{x,\overline{\boldsymbol{\theta}}} in the exponential family and logistic regression as a special case. Denoting by 𝜽^n\hat{\boldsymbol{\theta}}^{n} the estimated value of 𝜽\boldsymbol{\theta} from data (Xi,Yi)(X_{i},Y_{i}), i=1,,ni=1,\ldots,n, under rather weak conditions on the xix_{i} and φx,𝜽¯\varphi_{x,\overline{\boldsymbol{\theta}}}, see below, we have

n(𝜽^n𝜽¯)d𝒩(𝟎,𝐌1(ξ,𝜽¯)) as n,\displaystyle\sqrt{n}(\hat{\boldsymbol{\theta}}^{n}-\overline{\boldsymbol{\theta}})\stackrel{{\scriptstyle\rm d}}{{\rightarrow}}{\mathscr{N}}(\boldsymbol{0},\mathbf{M}^{-1}(\xi,\overline{\boldsymbol{\theta}}))\ \mbox{ as }n\rightarrow\infty\,, (1.1)

where 𝐌(ξ,𝜽)\mathbf{M}(\xi,\boldsymbol{\theta}) denotes the (normalized) Fisher information matrix for parameters 𝜽\boldsymbol{\theta} and (asymptotic) design ξ\xi (that is, a probability measure on 𝒳{\mathscr{X}}),

𝐌(ξ,𝜽)\displaystyle\mathbf{M}(\xi,\boldsymbol{\theta}) =\displaystyle= limn1n𝖤x1,,xn,𝜽{i=1nlogφx,𝜽(Yi)𝜽j=1nlogφx,𝜽(Yj)𝜽}\displaystyle\lim_{n\rightarrow\infty}\frac{1}{n}\,\mathsf{E}_{x_{1},\ldots,x_{n},\boldsymbol{\theta}}\left\{\sum_{i=1}^{n}\frac{\partial\log\varphi_{x,\boldsymbol{\theta}}(Y_{i})}{\partial\boldsymbol{\theta}}\sum_{j=1}^{n}\frac{\partial\log\varphi_{x,\boldsymbol{\theta}}(Y_{j})}{\partial\boldsymbol{\theta}^{\top}}\right\}
=\displaystyle= 𝒳[𝒴xlogφx,𝜽(y)𝜽logφx,𝜽(y)𝜽φx,𝜽(y)μx(dy)]ξ(dx).\displaystyle\int_{\mathscr{X}}\left[\int_{{\mathscr{Y}}_{x}}\frac{\partial\log\varphi_{x,\boldsymbol{\theta}}(y)}{\partial\boldsymbol{\theta}}\frac{\partial\log\varphi_{x,\boldsymbol{\theta}}(y)}{\partial\boldsymbol{\theta}^{\top}}\,\varphi_{x,\boldsymbol{\theta}}(y)\,\mu_{x}(\mbox{\rm d}y)\right]\xi(\mbox{\rm d}x)\,.

This is true in particular for randomized designs such that the xix_{i} are independently sampled from ξ\xi, and for asymptotically discrete designs, such that ξ\xi is a discrete measure on 𝒳{\mathscr{X}} and the empirical design measure ξn=i=1nδxi\xi_{n}=\sum_{i=1}^{n}\delta_{x_{i}} converges strongly to ξ\xi; see Pronzato and Pázman, (2013). The former case corresponds to the situation considered here. The choice of μx\mu_{x} is somewhat arbitrary, provided that 𝒴xφx,𝜽¯(y)μx(dy)=1\int_{{\mathscr{Y}}_{x}}\varphi_{x,\overline{\boldsymbol{\theta}}}(y)\,\mu_{x}(\mbox{\rm d}y)=1 for all xx, and we shall assume that μx(dy)1\mu_{x}(\mbox{\rm d}y)\equiv 1. We can then write

𝐌(ξ,𝜽)=𝒳(x,𝜽)ξ(dx), where (x,𝜽)=𝒴xlogφx,𝜽¯(y)𝜽logφx,𝜽¯(y)𝜽φx,𝜽¯(y)dy\displaystyle\mathbf{M}(\xi,\boldsymbol{\theta})=\int_{\mathscr{X}}{\mathscr{M}}(x,\boldsymbol{\theta})\,\xi(\mbox{\rm d}x)\,,\mbox{ where }{\mathscr{M}}(x,\boldsymbol{\theta})=\int_{{\mathscr{Y}}_{x}}\frac{\partial\log\varphi_{x,\overline{\boldsymbol{\theta}}}(y)}{\partial\boldsymbol{\theta}}\frac{\partial\log\varphi_{x,\overline{\boldsymbol{\theta}}}(y)}{\partial\boldsymbol{\theta}^{\top}}\,\varphi_{x,\overline{\boldsymbol{\theta}}}(y)\,\mbox{\rm d}y

denotes the elementary information matrix at xx.

Taking motivation from (1.1), optimal experimental design (approximate theory) aims at choosing a measure ξ\xi\* that minimizes a scalar function of the asymptotic covariance matrix 𝐌1(ξ,𝜽¯)\mathbf{M}^{-1}(\xi,\overline{\boldsymbol{\theta}}) of 𝜽^n\hat{\boldsymbol{\theta}}^{n}, or equivalently, that maximizes a function Φ\Phi of 𝐌(ξ,𝜽¯)\mathbf{M}(\xi,\overline{\boldsymbol{\theta}}). For a nonlinear model (x,𝜽){\mathscr{M}}(x,\boldsymbol{\theta}) and 𝐌(ξ,𝜽)\mathbf{M}(\xi,\boldsymbol{\theta}) depend on the model parameters 𝜽\boldsymbol{\theta}. Since 𝜽¯\overline{\boldsymbol{\theta}} is unknown, the standard approach is local, and consists in constructing an optimal design for a nominal value 𝜽0\boldsymbol{\theta}_{0} of 𝜽\boldsymbol{\theta}. This is the point of view we shall adopt here — although sequential estimation of 𝜽\boldsymbol{\theta} is possible, see Section 6. When 𝜽\boldsymbol{\theta} is fixed at some 𝜽0\boldsymbol{\theta}_{0}, there is fundamentally no difference with experimental design in a linear model for which (x,𝜽){\mathscr{M}}(x,\boldsymbol{\theta}) and 𝐌(ξ,𝜽)\mathbf{M}(\xi,\boldsymbol{\theta}) do not depend on 𝜽\boldsymbol{\theta}. For example, in the linear regression model

Y(Xi)=𝐟(Xi)𝜽¯+εi,\displaystyle Y(X_{i})=\mathbf{f}^{\top}(X_{i})\overline{\boldsymbol{\theta}}+\varepsilon_{i}\,,

where the errors εi\varepsilon_{i} are independent and identically distributed (i.i.d.), with a density φε\varphi_{\varepsilon} with respect to the Lebesgue measure having finite Fisher information for location Iε={[φε(t)]2/φε(t)}dt<I_{\varepsilon}=\int_{\mathds{R}}\left\{[\varphi^{\prime}_{\varepsilon}(t)]^{2}/\varphi_{\varepsilon}(t)\right\}\mbox{\rm d}t<\infty (Iε=1/σ2I_{\varepsilon}=1/\sigma^{2} for normal errors 𝒩(0,σ2){\mathscr{N}}(0,\sigma^{2})), then (x)=Iε𝐟(x)𝐟(x){\mathscr{M}}(x)=I_{\varepsilon}\,\mathbf{f}(x)\mathbf{f}^{\top}(x), 𝐌(ξ)=Iε𝒳𝐟(x)𝐟(x)ξ(dx)\mathbf{M}(\xi)=I_{\varepsilon}\,\int_{\mathscr{X}}\mathbf{f}(x)\mathbf{f}^{\top}(x)\,\xi(\mbox{\rm d}x). Polynomial regression provides typical examples of such a situation and will be used for illustration in Section 4. The construction of an optimal design measure ξ\xi^{*} maximizing Φ[𝐌(ξ,θ0)]\Phi[\mathbf{M}(\xi,\theta_{0})] usually relies on the application of a specialized algorithm to a discretization of the design space 𝒳{\mathscr{X}}; see, e.g., Pronzato and Pázman, (2013, Chap. 9).

With the rapid development of connected sensors and the pervasive usage of computers, there exist more and more situations where extraordinary amounts of massive data (Xi,Yi)(X_{i},Y_{i}), i=1,,Ni=1,\ldots,N, are available to construct models. When NN is very large, using all the data to construct 𝜽^N\hat{\boldsymbol{\theta}}^{N} is then unfeasible, and selecting the most informative subset through the construction of an nn-point optimal design, nNn\ll N, over the discrete set 𝒳N={Xi,i=1,,N}{\mathscr{X}}_{N}=\{X_{i},i=1,\ldots,N\} is also not feasible. The objective of this paper is to present a method to explore 𝒳N{\mathscr{X}}_{N} sequentially and select a proportion n=αNn=\lfloor\alpha N\rfloor of the NN data points to be used to estimate 𝜽\boldsymbol{\theta}. Each candidate XiX_{i} is considered only once, which allows very large datasets to be processed: when the XiX_{i} are i.i.d. and are received sequentially, they can be selected on the fly which makes the method applicable to data streaming; when NN data points are available simultaneously, a random permutation allows 𝒳N{\mathscr{X}}_{N} to be processed as an i.i.d. sequence. When NN is too large for the storage capacity and the i.i.d. assumption is not tenable, interleaving or scrambling techniques can be used. Since de-scrambling is not necessary here (the objective is only to randomize the sequence), a simple random selection in a fixed size buffer may be sufficient; an example is presented in Section 4.3.

The method is based on the construction of an optimal bounded design measure and draws on the paper (Pronzato,, 2006). In that paper, the sequential selection of the XiX_{i} relies on a threshold set on the directional derivative of the design criterion, given by the (1α)(1-\alpha)-quantile of the distribution of this derivative. At stage kk, all previous XiX_{i}, i=1,,ki=1,\ldots,k, are used for the estimation of the quantile CkC_{k} that defines the threshold for the possible selection of the candidate Xk+1X_{k+1}. In the present paper, we combine this approach with the recursive estimation of CkC_{k}, following (Tierney,, 1983): as a result, the construction is fully sequential and only requires to record the current value of the information matrix 𝐌k\mathbf{M}_{k} and of the estimated quantile C^k\widehat{C}_{k} of the distribution of the directional derivative. It relies on a reinterpretation of the approach in (Pronzato,, 2006) as a stochastic approximation method for the solution of the necessary and sufficient optimality conditions for a bounded design measure, which we combine with another stochastic approximation method for quantile estimation to obtain a two-time-scale stochastic approximation scheme.

The paper is organized as follows. Section 2 introduces the notation and assumptions and recalls main results on optimal bounded design measures. Section 3 presents our subsampling algorithm based on a two-time-scale stochastic approximation procedure and contains the main result of the paper. Several illustrative examples are presented in Section 4. We are not aware of any other method for thinning experimental designs that is applicable to data streaming; nevertheless, in Section 5 we compare our algorithm with an exchange method and with the IBOSS algorithm of Wang et al., (2019) in the case where the NN design points are available and can be processed simultaneously. Section 6 concludes and suggests a few directions for further developments. A series of technical results are provided in the Appendix.

2 Optimal bounded design measures

2.1 Notation and assumptions

Suppose that XX is distributed with the probability measure μ\mu on 𝒳d{\mathscr{X}}\subseteq\mathds{R}^{d}, a subset of d\mathds{R}^{d} with nonempty interior, with d1d\geq 1. For any ξ𝒫+(𝒳)\xi\in{\mathscr{P}}^{+}({\mathscr{X}}), the set of positive measure ξ\xi on 𝒳{\mathscr{X}} (not necessarily of mass one), we denote 𝐌(ξ)=𝒳(x)ξ(dx)\mathbf{M}(\xi)=\int_{\mathscr{X}}{\mathscr{M}}(x)\,\xi(\mbox{\rm d}x) where, for all xx in 𝒳{\mathscr{X}}, (x)𝕄{\mathscr{M}}(x)\in\mathbb{M}^{\geq}, the set (cone) of symmetric non-negative definite p×pp\times p matrices. We assume that p>1p>1 in the rest of the paper (the optimal selection of information in the case p=1p=1 forms a variant of the secretary problem for which an asymptotically optimal solution can be derived, see Albright and Derman, (1972); Pronzato, (2001)).

We denote by Φ:𝕄{}\Phi:\mathbb{M}^{\geq}\rightarrow\mathds{R}\cup\{-\infty\} the design criterion we wish to maximize, and by λmin(𝐌)\lambda_{\min}(\mathbf{M}) and λmax(𝐌)\lambda_{\max}(\mathbf{M}) the minimum and maximum eigenvalues of 𝐌\mathbf{M}, respectively; we shall use the 2\ell_{2} norm for vectors and Frobenius norm for matrices, 𝐌=trace1/2[𝐌𝐌]\|\mathbf{M}\|=\operatorname{trace}^{1/2}[\mathbf{M}\mathbf{M}^{\top}]; all vectors are column vectors. For any tt\in\mathds{R}, we denote [t]+=max{t,0}[t]^{+}=\max\{t,0\} and, for any t+t\in\mathds{R}^{+}, t\lfloor t\rfloor denotes the largest integer smaller than tt. For 0L0\leq\ell\leq L we denote by 𝕄,L\mathbb{M}^{\geq}_{\ell,L} the (convex) set defined by

𝕄,L={𝐌𝕄:<λmin(𝐌) and λmax(𝐌)<L},\displaystyle\mathbb{M}^{\geq}_{\ell,L}=\{\mathbf{M}\in\mathbb{M}^{\geq}:\ell<\lambda_{\min}(\mathbf{M})\mbox{ and }\lambda_{\max}(\mathbf{M})<L\}\,,

and by 𝕄>\mathbb{M}^{>} the open cone of symmetric positive definite p×pp\times p matrices. We make the following assumptions on Φ\Phi.

HΦ

Φ\Phi is strictly concave on 𝕄>\mathbb{M}^{>}, linearly differentiable and increasing for Loewner ordering; its gradient Φ(𝐌)\nabla_{\Phi}(\mathbf{M}) is well defined in 𝕄\mathbb{M}^{\geq} for any 𝐌𝕄>\mathbf{M}\in\mathbb{M}^{>} and satisfies Φ(𝐌)<A()\|\nabla_{\Phi}(\mathbf{M})\|<A(\ell) and λmin[Φ(𝐌)]>a(L)\lambda_{\min}[\nabla_{\Phi}(\mathbf{M})]>a(L) for any 𝐌𝕄,L\mathbf{M}\in\mathbb{M}^{\geq}_{\ell,L}, for some a(L)>0a(L)>0 and A()<A(\ell)<\infty; moreover, Φ\nabla_{\Phi} satisfies the following Lipschitz condition: for all 𝐌1\mathbf{M}_{1} and 𝐌2\mathbf{M}_{2} in 𝕄\mathbb{M}^{\geq} such that λmin(𝐌i)>>0\lambda_{\min}(\mathbf{M}_{i})>\ell>0, i=1,2i=1,2, there exists K<K_{\ell}<\infty such that Φ(𝐌2)Φ(𝐌1)<K𝐌2𝐌1\|\nabla_{\Phi}(\mathbf{M}_{2})-\nabla_{\Phi}(\mathbf{M}_{1})\|<K_{\ell}\,\|\mathbf{M}_{2}-\mathbf{M}_{1}\|.

The criterion Φ0(𝐌)=logdet(𝐌)\Phi_{0}(\mathbf{M})=\log\det(\mathbf{M}) and criteria Φq(𝐌)=trace(𝐌q)\Phi_{q}(\mathbf{M})=-\operatorname{trace}(\mathbf{M}^{-q}), q(1,)q\in(-1,\infty), q0q\neq 0, with Φq(𝐌)=\Phi_{q}(\mathbf{M})=-\infty if 𝐌\mathbf{M} is singular, which are often used in optimal design (in particular with qq a positive integer) satisfy HΦ; see, e.g., Pukelsheim, (1993, Chap. 6). Their gradients are Φ0(𝐌)=𝐌1\nabla_{\Phi_{0}}(\mathbf{M})=\mathbf{M}^{-1} and Φq(𝐌)=q𝐌(q+1)\nabla_{\Phi_{q}}(\mathbf{M})=q\,\mathbf{M}^{-(q+1)}, q0q\neq 0; the constants a(L)a(L) and A()A(\ell) are respectively given by a(L)=1/La(L)=1/L, A()=p/A(\ell)=\sqrt{p}/\ell for Φ0\Phi_{0} and a(L)=q/Lq+1a(L)=q/L^{q+1}, A()=qp/q+1A(\ell)=q\sqrt{p}/\ell^{q+1} for Φq\Phi_{q}. The Lispchitz condition follows from the fact that the criteria are twice differentiable on 𝕄>\mathbb{M}^{>}. The positively homogeneous versions Φ0+(𝐌)=det1/p(𝐌)\Phi_{0}^{+}(\mathbf{M})=\det^{1/p}(\mathbf{M}) and Φq+(𝐌)=[(1/p)trace(𝐌q)]1/q\Phi_{q}^{+}(\mathbf{M})=[(1/p)\operatorname{trace}(\mathbf{M}^{-q})]^{-1/q}, which satisfy Φ+(a𝐌)=aΦ+(𝐌)\Phi^{+}(a\mathbf{M})=a\,\Phi^{+}(\mathbf{M}) for any a>0a>0 and any 𝐌𝕄\mathbf{M}\in\mathbb{M}^{\geq}, and Φ+(𝐈p)=1\Phi^{+}(\mathbf{I}_{p})=1, with 𝐈p\mathbf{I}_{p} the p×pp\times p identity matrix, could be considered too; see Pukelsheim, (1993, Chaps. 5, 6). The strict concavity of Φ\Phi implies that, for any convex subset 𝕄^\widehat{\mathbb{M}} of 𝕄>\mathbb{M}^{>}, there exists a unique matrix 𝐌\mathbf{M}^{*} maximizing Φ(𝐌)\Phi(\mathbf{M}) with respect to 𝐌𝕄^\mathbf{M}\in\widehat{\mathbb{M}}.

We denote by FΦ(𝐌,𝐌)F_{\Phi}(\mathbf{M},\mathbf{M}^{\prime}) the directional derivative of Φ\Phi at 𝐌\mathbf{M} in the direction 𝐌\mathbf{M}^{\prime},

FΦ(𝐌,𝐌)=limγ0+Φ[(1γ)𝐌+γ𝐌]Φ(𝐌)γ=trace[Φ(𝐌)(𝐌𝐌)],\displaystyle F_{\Phi}(\mathbf{M},\mathbf{M}^{\prime})=\lim_{\gamma\rightarrow 0^{+}}\frac{\Phi[(1-\gamma)\mathbf{M}+\gamma\mathbf{M}^{\prime}]-\Phi(\mathbf{M})}{\gamma}=\operatorname{trace}[\nabla_{\Phi}(\mathbf{M})(\mathbf{M}^{\prime}-\mathbf{M})]\,,

and we make the following assumptions on μ\mu and {\mathscr{M}}.

Hμ

μ\mu has a bounded positive density φ\varphi with respect to the Lebesgue measure on every open subset of 𝒳{\mathscr{X}}.

HM

(i) {\mathscr{M}} is continuous on 𝒳{\mathscr{X}} and satisfies 𝒳(x)2μ(dx)<B<\int_{\mathscr{X}}\|{\mathscr{M}}(x)\|^{2}\,\mu(\mbox{\rm d}x)<B<\infty;

(ii) for any 𝒳ϵ𝒳{\mathscr{X}}_{\epsilon}\subset{\mathscr{X}} of measure μ(𝒳ϵ)=ϵ>0\mu({\mathscr{X}}_{\epsilon})=\epsilon>0, λmin{𝒳ϵ(x)μ(dx)}>ϵ\lambda_{\min}\left\{\int_{{\mathscr{X}}_{\epsilon}}{\mathscr{M}}(x)\,\mu(\mbox{\rm d}x)\right\}>\ell_{\epsilon} for some ϵ>0\ell_{\epsilon}>0.

Since all the designs considered will be formed by points sampled from μ\mu, we shall confound 𝒳{\mathscr{X}} with the support of μ\mu: 𝒳={xd:μ(d(x,ϵ))>0ϵ>0}{\mathscr{X}}=\{x\in\mathds{R}^{d}:\mu({\mathscr{B}}_{d}(x,\epsilon))>0\quad\forall\epsilon>0\}, with d(x,ϵ){\mathscr{B}}_{d}(x,\epsilon) the open ball with center xx and radius ϵ\epsilon. Notice that HM-(i) implies that λmax[𝐌(μ)]<B\lambda_{\max}[\mathbf{M}(\mu)]<\sqrt{B} and 𝐌(μ)<pB\|\mathbf{M}(\mu)\|<\sqrt{p\,B}.

Our sequential selection procedure will rely on the estimation of the (1α)(1-\alpha)-quantile C1α(𝐌)C_{1-\alpha}(\mathbf{M}) of the distribution F𝐌(z)F_{\mathbf{M}}(z) of the directional derivative Z𝐌(X)=FΦ[𝐌,(X)]Z_{\mathbf{M}}(X)=F_{\Phi}[\mathbf{M},{\mathscr{M}}(X)] when XμX\sim\mu, and we shall assume that Hμ,M below is satisfied. It implies in particular that C1α(𝐌)C_{1-\alpha}(\mathbf{M}) is uniquely defined by F𝐌(C𝐌,1α)=1αF_{\mathbf{M}}(C_{\mathbf{M},1-\alpha})=1-\alpha.

Hμ,M

For all 𝐌𝕄,L\mathbf{M}\in\mathbb{M}^{\geq}_{\ell,L}, F𝐌F_{\mathbf{M}} has a uniformly bounded density φ𝐌\varphi_{\mathbf{M}}; moreover, for any α(0,1)\alpha\in(0,1), there exists ϵ,L>0\epsilon_{\ell,L}>0 such that φ𝐌[C1α(𝐌)]>ϵ,L\varphi_{\mathbf{M}}[C_{1-\alpha}(\mathbf{M})]>\epsilon_{\ell,L} and φ𝐌\varphi_{\mathbf{M}} is continuous at C1α(𝐌)C_{1-\alpha}(\mathbf{M}).

Hμ,M is overrestricting (we only need the existence and boundedness of φ𝐌\varphi_{\mathbf{M}}, and its positiveness and continuity at C1α(𝐌)C_{1-\alpha}(\mathbf{M})), but is satisfied is many common situations; see Section 4 for examples. Let us emphasize that Hμ and HM are not enough to guarantee the existence of a density φ𝐌\varphi_{\mathbf{M}}, since trace[Φ(𝐌)(x)]\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}){\mathscr{M}}(x)] may remain constant over subsets of 𝒳{\mathscr{X}} having positive measure. Assuming the existence of φ𝐌\varphi_{\mathbf{M}} and the continuity of φ\varphi on 𝒳{\mathscr{X}} is also insufficient, since φ𝐌\varphi_{\mathbf{M}} is generally not continuous when Z𝐌(x)Z_{\mathbf{M}}(x) is not differentiable in xx, and φ𝐌\varphi_{\mathbf{M}} is not necessarily bounded.

2.2 Optimal design

As mentioned in introduction, when the cardinality of 𝒳N{\mathscr{X}}_{N} is very large, one may wish to select only nn candidates XiX_{i} among the NN available, a fraction n=αNn=\lfloor\alpha N\rfloor say, with α(0,1)\alpha\in(0,1). For any nNn\leq N, we denote by 𝐌n,N\mathbf{M}_{n,N}^{*} a design matrix (non necessarily unique) obtained by selecting nn points optimally within 𝒳N{\mathscr{X}}_{N}; that is, 𝐌n,N\mathbf{M}_{n,N}^{*} gives the maximum of Φ(𝐌n)\Phi(\mathbf{M}_{n}) with respect to 𝐌n=(1/n)j=1n(Xij)\mathbf{M}_{n}=(1/n)\,\sum_{j=1}^{n}{\mathscr{M}}(X_{i_{j}}), where the XijX_{i_{j}} are nn distinct points in 𝒳N{\mathscr{X}}_{N}. Note that this forms a difficult combinatorial problem, unfeasible for large nn and NN. If one assumes that the XiX_{i} are i.i.d., with μ\mu their probability measure on 𝒳{\mathscr{X}}, for large NN the optimal selection of n=αNn=\lfloor\alpha N\rfloor points amounts at constructing an optimal bounded design measure ξα\xi_{\alpha}^{*}, such that Φ[𝐌(ξα)]\Phi[\mathbf{M}(\xi_{\alpha}^{*})] is maximum and ξαμ/α\xi_{\alpha}\leq\mu/\alpha (in the sense ξα(𝒜)μ(𝒜)/α\xi_{\alpha}(\mathcal{A})\leq\mu(\mathcal{A})/\alpha for any μ\mu-measurable set 𝒜\mathcal{A}, which makes ξα\xi_{\alpha} absolutely continuous with respect to μ\mu). Indeed, Lemma A.1 in Appendix A indicates that lim supNΦ(𝐌αN,N)=Φ[𝐌(ξα)]\limsup_{N\rightarrow\infty}\Phi(\mathbf{M}_{\lfloor\alpha N\rfloor,N}^{*})=\Phi[\mathbf{M}(\xi_{\alpha}^{*})]. Also, under HΦ, 𝖤{Φ(𝐌n,N)}Φ[𝐌(ξn/N)]\mathsf{E}\{\Phi(\mathbf{M}_{n,N}^{*})\}\leq\Phi[\mathbf{M}(\xi_{n/N}^{*})] for all Nn>0N\geq n>0; see Pronzato, (2006, Lemma 3).

A key result is that, when all subsets of 𝒳{\mathscr{X}} with constant Z𝐌(x)Z_{\mathbf{M}}(x) have zero measure, Z𝐌(ξα)(x)=FΦ[𝐌(ξα),(x)]Z_{\mathbf{M}(\xi_{\alpha}^{*})}(x)=F_{\Phi}[\mathbf{M}(\xi_{\alpha}^{*}),{\mathscr{M}}(x)] separates two sets 𝒳α{\mathscr{X}}_{\alpha}^{*} and 𝒳𝒳α{\mathscr{X}}\setminus{\mathscr{X}}_{\alpha}^{*}, with FΦ[𝐌(ξα),(x)]C1αF_{\Phi}[\mathbf{M}(\xi_{\alpha}^{*}),{\mathscr{M}}(x)]\geq C_{1-\alpha}^{*} and ξα=μ/α\xi_{\alpha}^{*}=\mu/\alpha on 𝒳α{\mathscr{X}}_{\alpha}^{*}, and FΦ[𝐌(ξα),(x)]C1αF_{\Phi}[\mathbf{M}(\xi_{\alpha}^{*}),{\mathscr{M}}(x)]\leq C_{1-\alpha}^{*} and ξα=0\xi_{\alpha}^{*}=0 on 𝒳𝒳α{\mathscr{X}}\setminus{\mathscr{X}}_{\alpha}^{*}, for some constant C1αC_{1-\alpha}^{*}; moreover, 𝒳FΦ[𝐌(ξα),(x)]ξα(dx)=𝒳αFΦ[𝐌(ξα),(x)]μ(dx)=0\int_{\mathscr{X}}F_{\Phi}[\mathbf{M}(\xi_{\alpha}^{*}),{\mathscr{M}}(x)]\,\xi_{\alpha}^{*}(\mbox{\rm d}x)=\int_{{\mathscr{X}}_{\alpha}^{*}}F_{\Phi}[\mathbf{M}(\xi_{\alpha}^{*}),{\mathscr{M}}(x)]\,\mu(\mbox{\rm d}x)=0; see Wynn, (1982); Fedorov, (1989) and Fedorov and Hackl, (1997, Chap. 4). (The condition mentioned in those references is that μ\mu has no atoms, but the example in Section 4.4.2 will show that this is not sufficient; extension to arbitrary measures is considered in (Sahm and Schwabe,, 2001).)

For α(0,1)\alpha\in(0,1), denote

𝕄(α)={𝐌(ξα)=𝒳(x)ξα(dx):ξα𝒫+(𝒳),ξαμα,𝒳ξα(dx)=1}.\displaystyle\mathbb{M}(\alpha)=\left\{\mathbf{M}(\xi_{\alpha})=\int_{\mathscr{X}}{\mathscr{M}}(x)\,\xi_{\alpha}(\mbox{\rm d}x):\,\xi_{\alpha}\in{\mathscr{P}}^{+}({\mathscr{X}}),\,\xi_{\alpha}\leq\frac{\mu}{\alpha},\,\int_{\mathscr{X}}\xi_{\alpha}(\mbox{\rm d}x)=1\right\}\,.

In (Pronzato,, 2006), it is shown that, for any 𝐌𝕄>\mathbf{M}\in\mathbb{M}^{>},

𝐌+(𝐌,α)=argmax𝐌𝕄(α)FΦ(𝐌,𝐌)=1α𝒳𝕀{FΦ[𝐌,(x)]C1α}(x)μ(dx),\displaystyle\mathbf{M}^{+}(\mathbf{M},\alpha)=\arg\max_{\mathbf{M}^{\prime}\in\mathbb{M}(\alpha)}F_{\Phi}(\mathbf{M},\mathbf{M}^{\prime})=\frac{1}{\alpha}\,\int_{\mathscr{X}}\mathbb{I}_{\{F_{\Phi}[\mathbf{M},{\mathscr{M}}(x)]\geq C_{1-\alpha}\}}\,{\mathscr{M}}(x)\,\mu(\mbox{\rm d}x)\,, (2.1)

where, for any proposition 𝒜\mathcal{A}, 𝕀{𝒜}=1\mathbb{I}_{\{\mathcal{A}\}}=1 if 𝒜\mathcal{A} is true and is zero otherwise, and C1α=C1α(𝐌)C_{1-\alpha}=C_{1-\alpha}(\mathbf{M}) is an (1α)(1-\alpha)-quantile of FΦ[𝐌,(X)]F_{\Phi}[\mathbf{M},{\mathscr{M}}(X)] when XμX\sim\mu and satisfies

𝒳𝕀{FΦ[𝐌,(x)]C1α(𝐌)}μ(dx)=α.\displaystyle\int_{\mathscr{X}}\mathbb{I}_{\{F_{\Phi}[\mathbf{M},{\mathscr{M}}(x)]\geq C_{1-\alpha}(\mathbf{M})\}}\,\mu(\mbox{\rm d}x)=\alpha\,. (2.2)

Therefore, 𝐌α=𝐌(ξα)\mathbf{M}_{\alpha}^{*}=\mathbf{M}(\xi_{\alpha}^{*}) is the optimum information matrix in 𝕄(α)\mathbb{M}(\alpha) (unique since Φ\Phi is strictly concave) if and only if it satisfies max𝐌𝕄(α)FΦ(𝐌α,𝐌)=0\max_{\mathbf{M}^{\prime}\in\mathbb{M}(\alpha)}F_{\Phi}(\mathbf{M}_{\alpha}^{*},\mathbf{M}^{\prime})=0, or equivalently 𝐌α=𝐌+(𝐌α,α)\mathbf{M}_{\alpha}^{*}=\mathbf{M}^{+}(\mathbf{M}_{\alpha}^{*},\alpha), and the constant C1αC_{1-\alpha}^{*} equals C1α(𝐌α)C_{1-\alpha}(\mathbf{M}_{\alpha}^{*}); see (Pronzato,, 2006, Th. 5); see also Pronzato, (2004). Note that C1α0C_{1-\alpha}^{*}\leq 0 since 𝒳FΦ[𝐌(ξα),(x)]ξα(dx)=0\int_{\mathscr{X}}F_{\Phi}[\mathbf{M}(\xi_{\alpha}^{*}),{\mathscr{M}}(x)]\,\xi_{\alpha}^{*}(\mbox{\rm d}x)=0 and FΦ[𝐌(ξα),(x)]C1αF_{\Phi}[\mathbf{M}(\xi_{\alpha}^{*}),{\mathscr{M}}(x)]\geq C_{1-\alpha}^{*} on the support of ξα\xi_{\alpha}^{*}.

3 Sequential construction of an optimal bounded design measure

3.1 A stochastic approximation problem

Suppose that the XiX_{i} are i.i.d. with μ\mu. The solution of 𝐌=𝐌+(𝐌,α)\mathbf{M}=\mathbf{M}^{+}(\mathbf{M},\alpha), α(0,1)\alpha\in(0,1), with respect to 𝐌\mathbf{M} by stochastic approximation yields the iterations

nk+1=nk+𝕀{FΦ[𝐌nk,(Xk+1)]C1α(𝐌nk)},𝐌nk+1=𝐌nk+1nk+1𝕀{FΦ[𝐌nk,(Xk+1)]C1α(𝐌nk)}[(Xk+1)𝐌nk].\displaystyle\begin{array}[]{rcl}n_{k+1}&=&n_{k}+\mathbb{I}_{\{F_{\Phi}[\mathbf{M}_{n_{k}},{\mathscr{M}}(X_{k+1})]\geq C_{1-\alpha}(\mathbf{M}_{n_{k}})\}}\,,\\ \\ \mathbf{M}_{n_{k+1}}&=&\mathbf{M}_{n_{k}}+\frac{1}{n_{k}+1}\,\mathbb{I}_{\{F_{\Phi}[\mathbf{M}_{n_{k}},{\mathscr{M}}(X_{k+1})]\geq C_{1-\alpha}(\mathbf{M}_{n_{k}})\}}\,\left[{\mathscr{M}}(X_{k+1})-\mathbf{M}_{n_{k}}\right]\,.\end{array} (3.4)

Note that 𝖤{𝕀{FΦ[𝐌,(X)]C1α(𝐌)}[(X)𝐌]}=α[𝐌+(𝐌,α)𝐌]\mathsf{E}\left\{\mathbb{I}_{\{F_{\Phi}[\mathbf{M},{\mathscr{M}}(X)]\geq C_{1-\alpha}(\mathbf{M})\}}\,\left[{\mathscr{M}}(X)-\mathbf{M}\right]\right\}=\alpha\,[\mathbf{M}^{+}(\mathbf{M},\alpha)-\mathbf{M}]. The almost sure (a.s.) convergence of 𝐌nk\mathbf{M}_{n_{k}} in (3.4) to 𝐌(ξα)\mathbf{M}(\xi_{\alpha}^{*}) that maximizes Φ(𝐌)\Phi(\mathbf{M}) with respect 𝐌𝕄(α)\mathbf{M}\in\mathbb{M}(\alpha) is proved in (Pronzato,, 2006) under rather weak assumptions on Φ\Phi, {\mathscr{M}} and μ\mu.

The construction (3.4) requires the calculation of the (1α)(1-\alpha)-quantile C1α(𝐌nk)C_{1-\alpha}(\mathbf{M}_{n_{k}}) for all nkn_{k}, see (2.2), which is not feasible when μ\mu is unknown and has a prohibitive computational cost when we know μ\mu. For that reason, it is proposed in (Pronzato,, 2006) to replace C1α(𝐌nk)C_{1-\alpha}(\mathbf{M}_{n_{k}}) by the empirical quantile C~α,k(𝐌nk)\widetilde{C}_{\alpha,k}(\mathbf{M}_{n_{k}}) that uses the empirical measure μk=(1/k)i=1kδXi\mu_{k}=(1/k)\,\sum_{i=1}^{k}\delta_{X_{i}} of the XiX_{i} that have been observed up to stage kk. This construction preserves the a.s. convergence of 𝐌nk\mathbf{M}_{n_{k}} to 𝐌(ξα)\mathbf{M}(\xi_{\alpha}^{*}) in (3.4), but its computational cost and storage requirement increase with kk, which makes it unadapted to situations with very large NN. The next section considers the recursive estimation of C1α(𝐌nk)C_{1-\alpha}(\mathbf{M}_{n_{k}}) and contains the main result of the paper.

3.2 Recursive quantile estimation

The idea is to plug a recursive estimator of the (1α)(1-\alpha)-quantile C1α(𝐌nk)C_{1-\alpha}(\mathbf{M}_{n_{k}}) in (3.4). Under mild assumptions, for random variables ZiZ_{i} that are i.i.d. with distribution function FF such that the solution of the equation F(z)=1αF(z)=1-\alpha is unique, the recursion

C^k+1=C^k+βk+1(𝕀{Zk+1C^k}α)\displaystyle\widehat{C}_{k+1}=\widehat{C}_{k}+\frac{\beta}{k+1}\,\left(\mathbb{I}_{\{Z_{k+1}\geq\widehat{C}_{k}\}}-\alpha\right) (3.5)

with β>0\beta>0 converges a.s. to the quantile C1αC_{1-\alpha} such that F(C1α)=1αF(C_{1-\alpha})=1-\alpha. Here, we shall use a construction based on (Tierney,, 1983). In that paper, a clever dynamical choice of β=βk\beta=\beta_{k} is shown to provide the optimal asymptotic rate of convergence of C^k\widehat{C}_{k} towards C1αC_{1-\alpha}, with k(C^kC1α)d𝒩(0,α(1α)/f2(C1α))\sqrt{k}(\widehat{C}_{k}-C_{1-\alpha})\stackrel{{\scriptstyle\rm d}}{{\rightarrow}}{\mathscr{N}}(0,\alpha(1-\alpha)/f^{2}(C_{1-\alpha})) as kk\rightarrow\infty, where f(z)=dF(z)/dzf(z)=\mbox{\rm d}F(z)/\mbox{\rm d}z is the p.d.f. of the ZiZ_{i} — note that it coincides with the asymptotic behavior of the sample (empirical) quantile. The only conditions on FF are that f(z)f(z) exists for all zz and is uniformly bounded, and that ff is continuous and positive at the unique root C1αC_{1-\alpha} of F(z)=1αF(z)=1-\alpha.

There is a noticeable difference, however, with the estimation of C1α(𝐌nk)C_{1-\alpha}(\mathbf{M}_{n_{k}}): in our case we need to estimate a quantile of Zk(X)=FΦ[𝐌nk,(X)]Z_{k}(X)=F_{\Phi}[\mathbf{M}_{n_{k}},{\mathscr{M}}(X)] for XμX\sim\mu, with the distribution of Zk(X)Z_{k}(X) evolving with kk. For that reason, we shall impose a faster dynamic to the evolution of C^k\widehat{C}_{k}, and replace (3.5) by

C^k+1=C^k+βk(k+1)q(𝕀{Zk(Xk+1)C^k}α)\displaystyle\widehat{C}_{k+1}=\widehat{C}_{k}+\frac{\beta_{k}}{(k+1)^{q}}\,\left(\mathbb{I}_{\{Z_{k}(X_{k+1})\geq\widehat{C}_{k}\}}-\alpha\right) (3.6)

for some q(0,1)q\in(0,1). The combination of (3.6) with (3.4) yields a particular nonlinear two-time-scale stochastic approximation scheme. There exist advanced results on the convergence of linear two-time-scale stochastic approximation, see Konda and Tsitsiklis, (2004); Dalal et al., (2018). To the best of our knowledge, however, there are few results on convergence for nonlinear schemes. Convergence is shown in (Borkar,, 1997) under the assumption of boundedness of the iterates using the ODE method of Ljung, (1977); sufficient conditions for stability are provided in (Lakshminarayanan and Bhatnagar,, 2017), also using the ODE approach. In the proof of Theorem 3.1 we provide justifications for our construction, based on the analyses and results in the references mentioned above.

The construction is summarized in Algorithm 1 below. The presence of the small number ϵ1\epsilon_{1} is only due to technical reasons: setting zk+1=+z_{k+1}=+\infty when nk/k<ϵ1n_{k}/k<\epsilon_{1} in (3.9) has the effect of always selecting Xk+1X_{k+1} when less than ϵ1k\epsilon_{1}\,k points have been selected previously; it ensures that nk+1/k>ϵ1n_{k+1}/k>\epsilon_{1} for all kk and thus that 𝐌nk\mathbf{M}_{n_{k}} always belongs to 𝕄,L\mathbb{M}^{\geq}_{\ell,L} for some >0\ell>0 and L<L<\infty; see Lemma B.1 in Appendix.

  • Algorithm 1: sequential selection (α\alpha given).

  • 0)

    Choose k0pk_{0}\geq p, q(1/2,1)q\in(1/2,1), γ(0,q1/2)\gamma\in(0,q-1/2), and 0<ϵ1α0<\epsilon_{1}\ll\alpha.

  • 1)

    Initialization: select X1,,Xk0X_{1},\ldots,X_{k_{0}}, compute 𝐌nk0=(1/k0)i=1k0(Xi)\mathbf{M}_{n_{k_{0}}}=(1/k_{0})\,\sum_{i=1}^{k_{0}}{\mathscr{M}}(X_{i}). If 𝐌nk0\mathbf{M}_{n_{k_{0}}} is singular, increase k0k_{0} and select the next points until 𝐌nk0\mathbf{M}_{n_{k_{0}}} has full rank. Set k=nk=k0k=n_{k}=k_{0}, the number of points selected.

    Compute ζi=Zk0(Xi)\zeta_{i}=Z_{k_{0}}(X_{i}), for i=1,,k0i=1,\ldots,k_{0} and order the ζi\zeta_{i} as ζ1:k0ζ2:k0ζk0:k0\zeta_{1:k_{0}}\leq\zeta_{2:k_{0}}\leq\cdots\leq\zeta_{k_{0}:k_{0}}; denote k0+=(1α/2)k0k_{0}^{+}=\lceil(1-\alpha/2)\,k_{0}\rceil and k0=max{(13α/2)k0,1}k_{0}^{-}=\max\{\lfloor(1-3\,\alpha/2)\,k_{0}\rfloor,1\}.

    Initialize C^k0\widehat{C}_{k_{0}} at ζ(1α)k0:k0\zeta_{\lceil(1-\alpha)\,k_{0}\rceil:k_{0}}; set β0=k0/(k0+k0)\beta_{0}=k_{0}/(k_{0}^{+}-k_{0}^{-}), h=(ζk0+:k0ζk0:k0)h=(\zeta_{k_{0}^{+}:k_{0}}-\zeta_{k_{0}^{-}:k_{0}}), hk0=h/k0γh_{k_{0}}=h/k_{0}^{\gamma} and f^k0=[i=1k0𝕀{|ζiC^k0|hk0}]/(2k0hk0)\widehat{f}_{k_{0}}=\left[\sum_{i=1}^{k_{0}}\mathbb{I}_{\{|\zeta_{i}-\widehat{C}_{k_{0}}|\leq h_{k_{0}}\}}\right]/(2\,k_{0}\,h_{k_{0}}).

  • 2)

    Iteration k+1k+1: collect Xk+1X_{k+1} and compute Zk(Xk+1)=FΦ[𝐌nk,(Xk+1)]Z_{k}(X_{k+1})=F_{\Phi}[\mathbf{M}_{n_{k}},{\mathscr{M}}(X_{k+1})].

    If nk/k<ϵ1 set zk+1=+;otherwise  set zk+1=Zk(Xk+1).\displaystyle\begin{array}[]{ll}\mbox{If }n_{k}/k<\epsilon_{1}&\mbox{ set }z_{k+1}=+\infty\,;\\ \mbox{otherwise }&\mbox{ set }z_{k+1}=Z_{k}(X_{k+1})\,.\end{array} (3.9)

    If zk+1C^kz_{k+1}\geq\widehat{C}_{k}, update nkn_{k} into nk+1=nk+1n_{k+1}=n_{k}+1 and 𝐌nk\mathbf{M}_{n_{k}} into

    𝐌nk+1=𝐌nk+1nk+1[(Xk+1)𝐌nk];\displaystyle\mathbf{M}_{n_{k+1}}=\mathbf{M}_{n_{k}}+\frac{1}{n_{k}+1}\,\left[{\mathscr{M}}(X_{k+1})-\mathbf{M}_{n_{k}}\right]\,; (3.10)

    otherwise, set nk+1=nkn_{k+1}=n_{k}.

  • 3)

    Compute βk=min{1/f^k,β0kγ}\beta_{k}=\min\{1/\widehat{f}_{k},\beta_{0}\,k^{\gamma}\}; update C^k\widehat{C}_{k} using (3.6).

    Set hk+1=h/(k+1)γh_{k+1}=h/(k+1)^{\gamma} and update f^k\widehat{f}_{k} into

    f^k+1=f^k+1(k+1)q[12hk+1𝕀{|Zk(Xk+1)C^k|hk+1}f^k].\displaystyle\widehat{f}_{k+1}=\widehat{f}_{k}+\frac{1}{(k+1)^{q}}\,\left[\frac{1}{2\,h_{k+1}}\,\mathbb{I}_{\{|Z_{k}(X_{k+1})-\widehat{C}_{k}|\leq h_{k+1}\}}-\widehat{f}_{k}\right]\,.
  • 4)

    kk+1k\leftarrow k+1, return to Step 2.

Note that C^k\widehat{C}_{k} is updated whatever the value of Zk(Xk+1)Z_{k}(X_{k+1}). Recursive quantile estimation by (3.6) follows (Tierney,, 1983). To ensure a faster dynamic for the evolution of C^k\widehat{C}_{k} than for 𝐌nk\mathbf{M}_{n_{k}}, we take q<1q<1 instead of q=1q=1 in (Tierney,, 1983), and the construction of f^k\widehat{f}_{k} and the choices of βk\beta_{k} and hkh_{k} are modified accordingly. Following the same arguments as in the proof of Proposition 1 of (Tierney,, 1983), the a.s. convergence of C^k\widehat{C}_{k} to C1αC_{1-\alpha} in the modified version of (3.5) is proved in Theorem C.1 (Appendix C).

The next theorem establishes the convergence of the combined stochastic approximation schemes with two time-scales.

Theorem 3.1.

Under HΦ, Hμ, HM and Hμ,M, the normalized information matrix 𝐌nk\mathbf{M}_{n_{k}} corresponding to the nkn_{k} candidates selected after kk iterations of Algorithm 1 converges a.s. to the optimal matrix 𝐌α\mathbf{M}_{\alpha}^{*} in 𝕄(α)\mathbb{M}(\alpha) as kk\rightarrow\infty.

Proof. Our analysis is based on (Borkar,, 1997). We denote by n{\mathscr{F}}_{n} the increasing sequence of σ\sigma-fields generated by the XiX_{i}. According to (3.6), we can write C^k+1=C^k+[βk/(k+1)q]Vk+1\widehat{C}_{k+1}=\widehat{C}_{k}+[\beta_{k}/(k+1)^{q}]\,V_{k+1} with Vk+1=𝕀{Zk(Xk+1)C^k}αV_{k+1}=\mathbb{I}_{\{Z_{k}(X_{k+1})\geq\widehat{C}_{k}\}}-\alpha. Therefore, 𝖤{Vk+1|k}=𝒳[𝕀{Zk(x)C^k}α]μ(dx)\mathsf{E}\{V_{k+1}|{\mathscr{F}}_{k}\}=\int_{\mathscr{X}}[\mathbb{I}_{\{Z_{k}(x)\geq\widehat{C}_{k}\}}-\alpha]\,\mu(\mbox{\rm d}x) and 𝗏𝖺𝗋{Vk+1|k}=Fk(C^k)[1Fk(C^k)]\mathsf{var}\{V_{k+1}|{\mathscr{F}}_{k}\}=F_{k}(\widehat{C}_{k})[1-F_{k}(\widehat{C}_{k})], with FkF_{k} the distribution function of Zk(X)Z_{k}(X). From Lemma B.1 (Appendix B) and Hμ,M, FkF_{k} has a well defined density fkf_{k} for all kk, with fk(t)>0f_{k}(t)>0 for all tt and fkf_{k} bounded. The first part of the proof of Theorem C.1 applies (see Appendix C): f^k\widehat{f}_{k} is a.s. bounded and βk\beta_{k} is bounded away from zero a.s. Therefore, kβk/(k+1)q\sum_{k}\beta_{k}/(k+1)^{q}\rightarrow\infty a.s. and (k+1)q/[βk(k+1)]0(k+1)^{q}/[\beta_{k}\,(k+1)]\rightarrow 0 a.s.; also, k[βk/(k+1)q]2<\sum_{k}[\beta_{k}/(k+1)^{q}]^{2}<\infty since qγ>1/2q-\gamma>1/2.

The o.d.e. associated with (3.6), for a fixed matrix 𝐌\mathbf{M} and thus a fixed Z()Z(\cdot), such that Z(X)=FΦ[𝐌,(X)]Z(X)=F_{\Phi}[\mathbf{M},{\mathscr{M}}(X)] has the distribution function FF and density ff, is

dC(t)dt=1F[C(t)]α=F(C1α)F[C(t)],\displaystyle\frac{\mbox{\rm d}C(t)}{\mbox{\rm d}t}=1-F[C(t)]-\alpha=F(C_{1-\alpha})-F[C(t)]\,,

where C1α=C1α(𝐌)C_{1-\alpha}=C_{1-\alpha}(\mathbf{M}) satisfies F(C1α)=1αF(C_{1-\alpha})=1-\alpha. Consider the Lyapunov function L(C)=[F(C)F(C1α)]2L(C)=[F(C)-F(C_{1-\alpha})]^{2}. It satisfies dL[C(t)]/dt=2f[C(t)]L[C(t)]0\mbox{\rm d}L[C(t)]/\mbox{\rm d}t=-2\,f[C(t)]\,L[C(t)]\leq 0, with dL[C(t)]/dt=0\mbox{\rm d}L[C(t)]/\mbox{\rm d}t=0 if and only if C=C1αC=C_{1-\alpha}. Moreover, C1αC_{1-\alpha} is Lipschitz continuous in 𝐌\mathbf{M}; see Lemma D.1 in Appendix D. The conditions for Theorem 1.1 in (Borkar,, 1997) are thus satisfied concerning the iterations for C^k\widehat{C}_{k}.

Denote 𝐌^k=𝐌nk\widehat{\mathbf{M}}_{k}=\mathbf{M}_{n_{k}} and ρk=nk/k\rho_{k}=n_{k}/k, so that (3.9) implies kρkϵ1(k1)k\,\rho_{k}\geq\epsilon_{1}\,(k-1) for all kk; see Lemma B.1 in Appendix. They satisfy

ρk+1=ρk+Rk+1k+1 and 𝐌^k+1=𝐌^k+𝛀k+1k+1,\displaystyle\rho_{k+1}=\rho_{k}+\frac{R_{k+1}}{k+1}\mbox{ and }\widehat{\mathbf{M}}_{k+1}=\widehat{\mathbf{M}}_{k}+\frac{\boldsymbol{\Omega}_{k+1}}{k+1}\,, (3.11)

where Rk+1=𝕀{Zk(Xk+1)C^k}ρkR_{k+1}=\mathbb{I}_{\{Z_{k}(X_{k+1})\geq\widehat{C}_{k}\}}-\rho_{k}, and 𝛀k+1=(1/ρk+1)𝕀{Zk(Xk+1)C^k}[(Xk+1)𝐌^k]\boldsymbol{\Omega}_{k+1}=(1/\rho_{k+1})\,\mathbb{I}_{\{Z_{k}(X_{k+1})\geq\widehat{C}_{k}\}}\,\left[{\mathscr{M}}(X_{k+1})-\widehat{\mathbf{M}}_{k}\right]. We have 𝖤{Rk+1|k}=𝒳𝕀{Zk(x)C^k}μ(dx)ρk\mathsf{E}\{R_{k+1}|{\mathscr{F}}_{k}\}=\int_{\mathscr{X}}\mathbb{I}_{\{Z_{k}(x)\geq\widehat{C}_{k}\}}\,\mu(\mbox{\rm d}x)-\rho_{k} and 𝗏𝖺𝗋{Rk+1|k}=Fk(C^k)[1Fk(C^k)]\mathsf{var}\{R_{k+1}|{\mathscr{F}}_{k}\}=F_{k}(\widehat{C}_{k})[1-F_{k}(\widehat{C}_{k})], with FkF_{k} the distribution function of Zk(X)Z_{k}(X), which, from Hμ,M, has a well defined density fkf_{k} for all kk. Also,

𝖤{𝛀k+1|k}\displaystyle\mathsf{E}\{\boldsymbol{\Omega}_{k+1}|{\mathscr{F}}_{k}\} =\displaystyle= kρk+1ρkk+1=1+1/kρk+1/kk,\displaystyle\frac{{\mathcal{I}}_{k}}{\rho_{k}+\frac{1-\rho_{k}}{k+1}}=\frac{1+1/k}{\rho_{k}+1/k}\,{\mathcal{I}}_{k}\,,

where k=𝒳𝕀{Zk(x)C^k}[(x)𝐌^k]μ(dx){\mathcal{I}}_{k}=\int_{\mathscr{X}}\mathbb{I}_{\{Z_{k}(x)\geq\widehat{C}_{k}\}}\,\left[{\mathscr{M}}(x)-\widehat{\mathbf{M}}_{k}\right]\,\mu(\mbox{\rm d}x). Denote Δk+1=𝛀k+1k/ρk\Delta_{k+1}=\boldsymbol{\Omega}_{k+1}-{\mathcal{I}}_{k}/\rho_{k}, so that

𝐌^k+1=𝐌^k+1k+1kρk+Δk+1k+1.\displaystyle\widehat{\mathbf{M}}_{k+1}=\widehat{\mathbf{M}}_{k}+\frac{1}{k+1}\,\frac{{\mathcal{I}}_{k}}{\rho_{k}}+\frac{\Delta_{k+1}}{k+1}\,. (3.12)

We get 𝖤{Δk+1|k}=(ρk1)/[ρk(kρk+1)]k\mathsf{E}\{\Delta_{k+1}|{\mathscr{F}}_{k}\}=(\rho_{k}-1)/[\rho_{k}\,(k\,\rho_{k}+1)]\,{\mathcal{I}}_{k} and

𝗏𝖺𝗋{{Δk+1}i,j|k}\displaystyle\mathsf{var}\{\{\Delta_{k+1}\}_{i,j}|{\mathscr{F}}_{k}\} =\displaystyle= 𝗏𝖺𝗋{{𝛀k+1}i,j|k}\displaystyle\mathsf{var}\{\{\boldsymbol{\Omega}_{k+1}\}_{i,j}|{\mathscr{F}}_{k}\}
=\displaystyle= (k+1)2(kρk+1)2[𝒳𝕀{Zk(x)C^k}{(x)𝐌^k}i,j2μ(dx){k}i,j2],\displaystyle\frac{(k+1)^{2}}{(k\,\rho_{k}+1)^{2}}\,\left[\int_{\mathscr{X}}\mathbb{I}_{\{Z_{k}(x)\geq\widehat{C}_{k}\}}\,\{{\mathscr{M}}(x)-\widehat{\mathbf{M}}_{k}\}_{i,j}^{2}\,\mu(\mbox{\rm d}x)-\{{\mathcal{I}}_{k}\}_{i,j}^{2}\right]\,,

where (3.9) implies that ρk>ϵ1/2\rho_{k}>\epsilon_{1}/2, and therefore (k+1)(1ρk)/[ρk(kρk+1)]<4/ϵ12(k+1)\,(1-\rho_{k})/[\rho_{k}\,(k\,\rho_{k}+1)]<4/\epsilon_{1}^{2}, and 𝗏𝖺𝗋{{Δk+1}i,j|k}\mathsf{var}\{\{\Delta_{k+1}\}_{i,j}|{\mathscr{F}}_{k}\} is a.s. bounded from HM-(i). This implies that kΔk+1/(k+1)<\sum_{k}\Delta_{k+1}/(k+1)<\infty a.s. The limiting o.d.e. associated with (3.11) and (3.12) are

dρ(t)dt\displaystyle\frac{\mbox{\rm d}\rho(t)}{\mbox{\rm d}t} =\displaystyle= 𝒳𝕀{FΦ[𝐌^(t),(x)]C1α[𝐌^(t)]}μ(dx)ρ(t)=αρ(t),\displaystyle\int_{\mathscr{X}}\mathbb{I}_{\{F_{\Phi}[\widehat{\mathbf{M}}(t),{\mathscr{M}}(x)]\geq C_{1-\alpha}[\widehat{\mathbf{M}}(t)]\}}\,\mu(\mbox{\rm d}x)-\rho(t)=\alpha-\rho(t)\,,
d𝐌^(t)dt\displaystyle\frac{\mbox{\rm d}\widehat{\mathbf{M}}(t)}{\mbox{\rm d}t} =\displaystyle= 1ρ(t)𝒳𝕀{FΦ[𝐌^(t),(x)]C1α[𝐌^(t)]}[(x)𝐌^(t)]μ(dx)\displaystyle\frac{1}{\rho(t)}\,\int_{\mathscr{X}}\mathbb{I}_{\{F_{\Phi}[\widehat{\mathbf{M}}(t),{\mathscr{M}}(x)]\geq C_{1-\alpha}[\widehat{\mathbf{M}}(t)]\}}\,\left[{\mathscr{M}}(x)-\widehat{\mathbf{M}}(t)\right]\,\mu(\mbox{\rm d}x)
=\displaystyle= αρ(t){𝐌+[𝐌^(t),α]𝐌^(t)},\displaystyle\frac{\alpha}{\rho(t)}\,\left\{\mathbf{M}^{+}[\widehat{\mathbf{M}}(t),\alpha]-\widehat{\mathbf{M}}(t)\right\}\,,

where 𝐌+[𝐌^(t),α]\mathbf{M}^{+}[\widehat{\mathbf{M}}(t),\alpha] is defined by (2.1). The first equation implies that ρ(t)\rho(t) converges exponentially fast to α\alpha, with ρ(t)=α+[ρ(0)α]exp(t)\rho(t)=\alpha+[\rho(0)-\alpha]\,\exp(-t); the second equation gives

dΦ[𝐌^(t)]dt=trace[Φ[𝐌^(t)]d𝐌^(t)dt]=αρ(t)max𝐌𝕄(α)FΦ[𝐌^(t),𝐌]0,\displaystyle\frac{\mbox{\rm d}\Phi[\widehat{\mathbf{M}}(t)]}{\mbox{\rm d}t}=\operatorname{trace}\left[\nabla_{\Phi}[\widehat{\mathbf{M}}(t)]\,\frac{\mbox{\rm d}\widehat{\mathbf{M}}(t)}{\mbox{\rm d}t}\right]=\frac{\alpha}{\rho(t)}\,\max_{\mathbf{M}^{\prime}\in\mathbb{M}(\alpha)}F_{\Phi}[\widehat{\mathbf{M}}(t),\mathbf{M}^{\prime}]\geq 0\,,

with a strict inequality if 𝐌^(t)𝐌α\widehat{\mathbf{M}}(t)\neq\mathbf{M}_{\alpha}^{*}, the optimal matrix in 𝕄(α)\mathbb{M}(\alpha). The conditions of Theorem 1.1 in (Borkar,, 1997) are thus satisfied, and 𝐌^k\widehat{\mathbf{M}}_{k} converges to 𝐌α\mathbf{M}_{\alpha}^{*} a.s.    

Remark 3.1.

(i)

Algorithm 1 does not require the knowledge of μ\mu and has minimum storage requirements: apart for the current matrix 𝐌nk\mathbf{M}_{n_{k}}, we only need to update the scalar variables C^k\widehat{C}_{k} and fkf_{k}. Its complexity is 𝒪(d3N){\mathcal{O}}(d^{3}\,N) in general, considering that the complexity of the calculation of FΦ[𝐌,(X)]F_{\Phi}[\mathbf{M},{\mathscr{M}}(X)] is 𝒪(d3){\mathcal{O}}(d^{3}). It can be reduced to 𝒪(d2N){\mathcal{O}}(d^{2}\,N) when (X){\mathscr{M}}(X) has rank one and 𝐌nk1\mathbf{M}_{n_{k}}^{-1} is updated instead of 𝐌nk\mathbf{M}_{n_{k}} (see remark (iii) below), for D-optimality and Φq\Phi_{q}-optimality with qq integer; see Section 2.1. Very long sequences (Xi)(X_{i}) can thus be processed.

(ii)

Numerical simulations indicate that we do not need to take q<1q<1 in Algorithm 1: (3.6) with q=1q=1 yields satisfactory performance, provided the step-size obeys Kersten’s rule and does not decrease at each iteration.

(iii)

The substitution of trace[Φ(𝐌)(X)]\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}){\mathscr{M}}(X)] for FΦ[𝐌,(X)]=trace{Φ(𝐌)[(X)𝐌]}F_{\Phi}[\mathbf{M},{\mathscr{M}}(X)]=\operatorname{trace}\{\nabla_{\Phi}(\mathbf{M})[{\mathscr{M}}(X)-\mathbf{M}]\} everywhere does not change the behavior of the algorithm. When Φ(𝐌)\nabla_{\Phi}(\mathbf{M}) only depends on 𝐌1\mathbf{M}^{-1} (which is often the case for classical design criteria, see the discussion following the presentation of HΦ), and if (X){\mathscr{M}}(X) is a low rank matrix, it may be preferable to update 𝐌nk1\mathbf{M}_{n_{k}}^{-1} instead of 𝐌nk\mathbf{M}_{n_{k}}, thereby avoiding matrix inversions. For example, if (Xk+1)=Iε𝐟(Xk+1)𝐟(Xk+1){\mathscr{M}}(X_{k+1})=I_{\varepsilon}\,\mathbf{f}(X_{k+1})\mathbf{f}^{\top}(X_{k+1}), then, instead of updating (3.10), it is preferable to update the following

𝐌nk+11=(1+1nk)[𝐌nk1Iε𝐌nk1𝐟(Xk+1)𝐟(Xk+1)𝐌nk1nk+Iε𝐟(Xk+1)𝐌nk1𝐟(Xk+1)].\displaystyle\mathbf{M}_{n_{k+1}}^{-1}=\left(1+\frac{1}{n_{k}}\right)\left[\mathbf{M}_{n_{k}}^{-1}-\frac{I_{\varepsilon}\,\mathbf{M}_{n_{k}}^{-1}\mathbf{f}(X_{k+1})\mathbf{f}^{\top}(X_{k+1})\mathbf{M}_{n_{k}}^{-1}}{n_{k}+I_{\varepsilon}\,\mathbf{f}^{\top}(X_{k+1})\mathbf{M}_{n_{k}}^{-1}\mathbf{f}(X_{k+1})}\right]\,.

Low-rank updates of the Cholesky decomposition of the matrix can be considered too.

(iv)

Algorithm 1 can be adapted to the case where the number of iterations is fixed (equal to the size NN of the candidate set 𝒳N{\mathscr{X}}_{N}) and the number of candidates nn to be selected is imposed. A straightforward modification is to introduce truncation and forced selection: we run the algorithm with α=n/N\alpha=n/N and, at Step 2, we set zk+1=z_{k+1}=-\infty (reject Xk+1X_{k+1}) if nknn_{k}\geq n and set zk+1=+z_{k+1}=+\infty (select Xk+1X_{k+1}) if nnkNkn-n_{k}\geq N-k. However, this may induce the selection of points XiX_{i} carrying little information when kk approaches NN in case nkn_{k} is excessively small. For that reason, adaptation of α\alpha to nkn_{k}, obtained by substituting αk=(nnk)/(Nk)\alpha_{k}=(n-n_{k})/(N-k) for the constant α\alpha everywhere, seems preferable. This is illustrated by an example in Section 4.2.

(v)

The case when μ\mu has discrete components (atoms), or more precisely when there exist subsets of 𝒳{\mathscr{X}} of positive measure where Z𝐌(x)Z_{\mathbf{M}}(x) is constant (see Section 4.4.2), requires additional technical developments which we do not detail here.

A first difficulty is that HM-(ii) may not be satisfied when the matrices (x){\mathscr{M}}(x) do not have full rank, unless we only consider large enough ϵ\epsilon. Unless ϵ1\epsilon_{1} in (3.9) is large enough, Lemma B.1 is not valid, and other arguments are required in the proof of Theorem 3.1. Possible remedies may consist (a) in adding a regularization matrix γ𝐈p\gamma\mathbf{I}_{p} with a small γ\gamma to all matrices (x){\mathscr{M}}(x) (which amounts at considering optimal design for Bayesian estimation with a vague prior; see, e.g., Pilz, (1983)), or (b) in replacing the condition in (3.9) by [If λmin(𝐌nk)<ϵ1\lambda_{\min}(\mathbf{M}_{n_{k}})<\epsilon_{1}, set zk+1=+z_{k+1}=+\infty].

A second difficulty is that C1α(𝐌α)C_{1-\alpha}(\mathbf{M}_{\alpha}^{*}) may correspond to a point of discontinuity of the distribution function of FΦ[𝐌α,(X)]F_{\Phi}[\mathbf{M}_{\alpha}^{*},{\mathscr{M}}(X)]. The estimated value fkf_{k} of the density of FΦ[𝐌nk,(X)]F_{\Phi}[\mathbf{M}_{n_{k}},{\mathscr{M}}(X)] at C^k\widehat{C}_{k} (Step 3 of Algorithm 1) may then increase to infinity and βk\beta_{k} tend to zero in (3.6). This can be avoided by setting βk=max{ϵ2,min(1/f^k,β0kγ)}\beta_{k}=\max\{\epsilon_{2},\min(1/\widehat{f}_{k},\beta_{0}\,k^{\gamma})\} for some ϵ2>0\epsilon_{2}>0.

In (Pronzato,, 2006), where empirical quantiles are used, measures needed to be taken to avoid the acceptance of too many points, for instance based on the adaptation of α\alpha through αk=(nnk)/(Nk)\alpha_{k}=(n-n_{k})/(N-k), see remark (iv) above, or via the addition of the extra condition [if nk/k>αn_{k}/k>\alpha, set zk+1=z_{k+1}=-\infty] to (3.9) in case nn is not specified. Such measures do not appear to be necessary when quantiles are estimated by (3.6); see the examples in Section 4.4.   \triangleleft

4 Examples

We always take k0=5pk_{0}=5\,p, q=5/8q=5/8, γ=1/10\gamma=1/10 in Algorithm 1 (our simulations indicate that these choices are not critical); we also set ϵ1=0\epsilon_{1}=0.

4.1 Example 1: quadratic regression with normal independent variables

Take (x)=𝐟(x)𝐟(x){\mathscr{M}}(x)=\mathbf{f}(x)\mathbf{f}^{\top}(x), with 𝐟(x)=(1,x,x2)\mathbf{f}(x)=(1,\ x,\ x^{2})^{\top} and Φ(𝐌)=logdet(𝐌)\Phi(\mathbf{M})=\log\det(\mathbf{M}), and let the XiX_{i} be i.i.d. standard normal variables 𝒩(0,1){\mathscr{N}}(0,1). The D-optimal design for xx in an interval [t,t][t,t^{\prime}] corresponds to ξ=(1/3)(δt+δ(t+t)/2+δt)\xi^{*}=(1/3)\,\left(\delta_{t}+\delta_{(t+t^{\prime})/2}+\delta_{t^{\prime}}\right). In the data thinning problem, the optimal solution corresponds to the selection of XiX_{i} in the union of three intervals; that is, with the notation of Section 2, 𝒳α=(,a][b,b][a,){\mathscr{X}}_{\alpha}^{*}=(-\infty,-a]\cup[-b,b]\cup[a,\infty). The values of aa and bb are obtained by solving the pair of equations 0bφ(x)dx+aφ(x)dx=α/2\int_{0}^{b}\varphi(x)\,\mbox{\rm d}x+\int_{a}^{\infty}\varphi(x)\,\mbox{\rm d}x=\alpha/2 and trace[𝐌1(ξ)(a)]=trace[𝐌1(ξ)(b)]\operatorname{trace}[\mathbf{M}^{-1}(\xi){\mathscr{M}}(a)]=\operatorname{trace}[\mathbf{M}^{-1}(\xi){\mathscr{M}}(b)], with φ\varphi the standard normal density and 𝐌(ξ)=[a(x)φ(x)dx+bb(x)φ(x)dx+a(x)φ(x)dx]/α\mathbf{M}(\xi)=[\int_{-\infty}^{-a}{\mathscr{M}}(x)\,\varphi(x)\,\mbox{\rm d}x+\int_{-b}^{b}{\mathscr{M}}(x)\,\varphi(x)\,\mbox{\rm d}x+\int_{a}^{\infty}{\mathscr{M}}(x)\,\varphi(x)\,\mbox{\rm d}x]/\alpha.

We set the horizon NN at 100 000100\,000 and consider the two cases α=1/2\alpha=1/2 and α=1/10\alpha=1/10. In each case we keep α\alpha constant but apply the rule of Remark 3.1-(iv) (truncation/forced selection) to select exactly n=50 000n=50\,000 and n=10 000n=10\,000 design points, respectively. For α=1/2\alpha=1/2, we have a1.0280a\simeq 1.0280, b0.2482b\simeq 0.2482, and Φα=Φ(𝐌α)=max𝐌𝕄(α)Φ(𝐌)1.6354\Phi_{\alpha}^{*}=\Phi(\mathbf{M}_{\alpha}^{*})=\max_{\mathbf{M}\in\mathbb{M}(\alpha)}\Phi(\mathbf{M})\simeq 1.6354, C1α(𝐌α)1.2470C_{1-\alpha}(\mathbf{M}_{\alpha}^{*})\simeq-1.2470; when α=1/10\alpha=1/10, we have a1.8842a\simeq 1.8842, b0.0507b\simeq 0.0507, and Φα3.2963\Phi_{\alpha}^{*}\simeq 3.2963, C1α(𝐌α)0.8513C_{1-\alpha}(\mathbf{M}_{\alpha}^{*})\simeq-0.8513. The figures below present results obtained for one simulation (i.e., one random set 𝒳N{\mathscr{X}}_{N}), but they are rather typical in the sense that different 𝒳N{\mathscr{X}}_{N} yield similar behaviors.

Figure 1 shows a smoothed histogram (Epanechnikov kernel, bandwidth equal to 1/10001/1000 of the range of the XiX_{i} in 𝒳N{\mathscr{X}}_{N}) of the design points selected by Algorithm 1, for α=1/2\alpha=1/2 (left) and α=1/10\alpha=1/10 (right). There is good adequation with the theoretical optimal density, which corresponds to a truncation of the normal density at values indicated by the vertical dotted lines.

Refer to caption
Refer to caption
Figure 1: Smoothed histogram of the XiX_{i} selected by Algorithm 1; the vertical dotted lines indicate the positions of a,b,b,a-a,-b,b,a that define the set 𝒳α=(,a][b,b][a,){\mathscr{X}}_{\alpha}^{*}=(-\infty,-a]\cup[-b,b]\cup[a,\infty) where ξα=μ/α\xi_{\alpha}^{*}=\mu/\alpha; N=100 000N=100\,000; Left: α=1/2\alpha=1/2 (n=50 000n=50\,000); Right: α=1/10\alpha=1/10 (n=10 000n=10\,000).

Figure 2 presents the evolution of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) as a function of kk, together with the optimal value Φα\Phi_{\alpha}^{*} (horizontal line), for the two choices of α\alpha considered (the figures show some similarity on the two panels since the same set 𝒳N{\mathscr{X}}_{N} is used for both). Convergence of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) to Φα\Phi_{\alpha}^{*} is fast in both cases; the presence of steps on the evolution of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}), more visible on the right panel, is due to long subsequences of samples consecutively rejected.

Refer to caption
Refer to caption
Figure 2: Evolution of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) obtained with Algorithm 1 as a function of kk (log scale); the horizontal line indicates the optimal value Φα\Phi_{\alpha}^{*}; N=100 000N=100\,000; Left: α=1/2\alpha=1/2 (n=50 000n=50\,000); Right: α=1/10\alpha=1/10 (n=10 000n=10\,000).

Figure 3 shows the behavior of the final directional derivative FΦ[𝐌nN,(x)]F_{\Phi}[\mathbf{M}_{n_{N}},{\mathscr{M}}(x)], after observation of all XiX_{i} in 𝒳N{\mathscr{X}}_{N}, together with the value of its estimated quantile C^N\widehat{C}_{N} (horizontal solid line). The theoretical values C1α(𝐌α)C_{1-\alpha}(\mathbf{M}_{\alpha}^{*}) (horizontal dashed line) and the values a,b,b,a-a,-b,b,a where FΦ[𝐌α,(x)]=C1α(𝐌α)F_{\Phi}[\mathbf{M}_{\alpha}^{*},{\mathscr{M}}(x)]=C_{1-\alpha}(\mathbf{M}_{\alpha}^{*}) (vertical dashed lines) are also shown (C^N\widehat{C}_{N} and C1α(𝐌α)C_{1-\alpha}(\mathbf{M}_{\alpha}^{*}) are indistinguishable on the right panel). Although the figure indicates that FΦ[𝐌nN,(x)]F_{\Phi}[\mathbf{M}_{n_{N}},{\mathscr{M}}(x)] differs significantly from FΦ[𝐌α,(x)]F_{\Phi}[\mathbf{M}_{\alpha}^{*},{\mathscr{M}}(x)], they are close enough to allow selection of the most informative XiX_{i}, as illustrated by Figures 1 and 2.

Refer to caption
Refer to caption
Figure 3: FΦ[𝐌nN,(x)]=trace[𝐌nN1(x)]3F_{\Phi}[\mathbf{M}_{n_{N}},{\mathscr{M}}(x)]=\operatorname{trace}[\mathbf{M}_{n_{N}}^{-1}\,{\mathscr{M}}(x)]-3 as a function of xx (solid line); the horizontal solid (respectively, dashed) line indicates the value of C^N\widehat{C}_{N} (respectively, C1α(𝐌α)C_{1-\alpha}(\mathbf{M}_{\alpha}^{*})), the vertical lines indicate the positions of a,b,b,a-a,-b,b,a where FΦ[𝐌α,(x)]=C1α(𝐌α)F_{\Phi}[\mathbf{M}_{\alpha}^{*},{\mathscr{M}}(x)]=C_{1-\alpha}(\mathbf{M}_{\alpha}^{*}); N=100 000N=100\,000; Left: α=1/2\alpha=1/2 (n=50 000n=50\,000); Right: α=1/10\alpha=1/10 (n=10 000n=10\,000).

Figure 4 shows 𝐌nk𝐌α\|\mathbf{M}_{n_{k}}-\mathbf{M}_{\alpha}^{*}\| (Frobenius norm) as a function of kk (log scale), averaged over 1 0001\,000 independent repetitions with random samples 𝒳N{\mathscr{X}}_{N} of size N=10 000N=10\,000, for α=1/2\alpha=1/2. It suggests that 𝐌nk𝐌α=𝒪(1/k)\|\mathbf{M}_{n_{k}}-\mathbf{M}_{\alpha}^{*}\|={\mathcal{O}}(1/\sqrt{k}) for large kk, although the conditions in (Konda and Tsitsiklis,, 2004) are not satisfied since the scheme we consider is nonlinear. This convergence rate is significantly faster than what is suggested by Dalal et al., (2018). These investigations require further developments and will be pursued elsewhere.

Refer to caption
Figure 4: Evolution of log10𝐌nk𝐌α\log_{10}\,\|\mathbf{M}_{n_{k}}-\mathbf{M}_{\alpha}^{*}\| as a function of log10k\log_{10}\,k (values averaged over 1 000 random samples 𝒳N{\mathscr{X}}_{N}); the dashed line has slope 1/2-1/2 (α=1/2\alpha=1/2: n=5 000n=5\,000, N=10 000N=10\,000).

4.2 Example 2: multilinear regression with normal independent variables

Take (X)=XX{\mathscr{M}}(X)=XX^{\top}, with X=(x1,x2,,xd)X=(x_{1},\ x_{2},\ldots,x_{d})^{\top}, d>1d>1, and Φ(𝐌)=logdet(𝐌)\Phi(\mathbf{M})=\log\det(\mathbf{M}), the vectors XiX_{i} being i.i.d. 𝒩(𝟎,𝐈d){\mathscr{N}}(\boldsymbol{0},\mathbf{I}_{d}) (so that p=dp=d). Denote by φ(𝐱)=(2π)d/2exp(𝐱2/2)\varphi(\mathbf{x})=(2\,\pi)^{-d/2}\,\exp(-\|\mathbf{x}\|^{2}/2) the probability density of XX. For symmetry reasons, for any α(0,1)\alpha\in(0,1) the optimal (normalized) information matrix is 𝐌α=ρα𝐈d\mathbf{M}_{\alpha}^{*}=\rho_{\alpha}\,\mathbf{I}_{d}, with Φ(𝐌α)=dlogρα\Phi(\mathbf{M}_{\alpha}^{*})=d\,\log\rho_{\alpha}, where

ρα=1α𝐱Rαx12φ(𝐱)d𝐱\displaystyle\rho_{\alpha}=\frac{1}{\alpha}\,\int_{\|\mathbf{x}\|\geq R_{\alpha}}x_{1}^{2}\,\varphi(\mathbf{x})\,\mbox{\rm d}\mathbf{x} =\displaystyle= 1dα𝐱Rα𝐱2φ(𝐱)d𝐱\displaystyle\frac{1}{d\,\alpha}\,\int_{\|\mathbf{x}\|\geq R_{\alpha}}\|\mathbf{x}\|^{2}\,\varphi(\mathbf{x})\,\mbox{\rm d}\mathbf{x}
=\displaystyle= 1dαrRαr21(2π)d/2exp(r2/2)Sd(1)rd1dr,\displaystyle\frac{1}{d\,\alpha}\,\int_{r\geq R_{\alpha}}r^{2}\,\frac{1}{(2\,\pi)^{d/2}}\,\exp(-r^{2}/2)\,S_{d}(1)\,r^{d-1}\,\mbox{\rm d}r\,,

with Sd(1)=2πd/2/Γ(d/2)S_{d}(1)=2\,\pi^{d/2}/\Gamma(d/2), the surface area of the dd-dimensional unit sphere, and RαR_{\alpha} the solution of

α=𝐱Rαφ(𝐱)d𝐱=rRα1(2π)d/2exp(r2/2)Sd(1)rd1dr.\displaystyle\alpha=\int_{\|\mathbf{x}\|\geq R_{\alpha}}\varphi(\mathbf{x})\,\mbox{\rm d}\mathbf{x}=\int_{r\geq R_{\alpha}}\frac{1}{(2\,\pi)^{d/2}}\,\exp(-r^{2}/2)\,S_{d}(1)\,r^{d-1}\,\mbox{\rm d}r\,.

Since FΦ[𝐌,(X)]=trace[𝐌1(X)]dF_{\Phi}[\mathbf{M},{\mathscr{M}}(X)]=\operatorname{trace}[\mathbf{M}^{-1}{\mathscr{M}}(X)]-d, we get FΦ[𝐌α,(X)]=𝐱2/ραdF_{\Phi}[\mathbf{M}_{\alpha}^{*},{\mathscr{M}}(X)]=\|\mathbf{x}\|^{2}/\rho_{\alpha}-d, C1α(𝐌α)=Rα2/ραd0C_{1-\alpha}(\mathbf{M}_{\alpha}^{*})=R_{\alpha}^{2}/\rho_{\alpha}-d\leq 0, and Φα=Φ(𝐌α)\Phi_{\alpha}^{*}=\Phi(\mathbf{M}_{\alpha}^{*}) is differentiable with respect to α\alpha, with dΦα/dα=C1α(𝐌α)/α\mbox{\rm d}\Phi_{\alpha}^{*}/\mbox{\rm d}\alpha=C_{1-\alpha}(\mathbf{M}_{\alpha}^{*})/\alpha; see Pronzato, (2004, Th. 4). Closed-form expressions are available for d=2d=2, with Rα=2logαR_{\alpha}=\sqrt{-2\,\log\alpha} and ρα=1logα\rho_{\alpha}=1-\log\alpha; RαR_{\alpha} and ρα\rho_{\alpha} can easily be computed numerically for any d>2d>2 and α(0,1)\alpha\in(0,1). One may notice that, from a result by Harman, (2004), the design matrix 𝐌α\mathbf{M}_{\alpha}^{*} is optimal for any other orthogonally invariant criterion Φ\Phi.

For the linear model with intercept, such that (X)=𝐟(X)𝐟(X){\mathscr{M}}^{\prime}(X)=\mathbf{f}(X)\mathbf{f}^{\top}(X) with 𝐟(X)=[1,X]\mathbf{f}(X)=[1,\ X^{\top}]^{\top}, the optimal matrix is

𝐌α=(1𝟎𝟎𝐌α)\displaystyle{\mathbf{M}^{\prime}}_{\alpha}^{*}=\left(\begin{array}[]{cc}1&\boldsymbol{0}^{\top}\\ \boldsymbol{0}&\mathbf{M}_{\alpha}^{*}\\ \end{array}\right)

with 𝐌α=ρα𝐈d\mathbf{M}_{\alpha}^{*}=\rho_{\alpha}\,\mathbf{I}_{d} the optimal matrix for the model without intercept. The same design is thus optimal for both models. Also, when the XiX_{i} are i.i.d. 𝒩(0,𝚺){\mathscr{N}}(0,\boldsymbol{\Sigma}), the optimal matrix 𝐌Σ,α\mathbf{M}_{\Sigma,\alpha}^{*} for Φ()=logdet()\Phi(\cdot)=\log\det(\cdot) simply equals 𝚺1/2𝐌α𝚺1/2\boldsymbol{\Sigma}^{1/2}\,\mathbf{M}_{\alpha}^{*}\,\boldsymbol{\Sigma}^{1/2}.

Again, we present results obtained for one random set 𝒳N{\mathscr{X}}_{N}. Figure 5 shows the evolution of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) as a function of kk for d=3d=3 with α=1/1 000\alpha=1/1\,000 and N=100 000N=100\,000 when we want we select exactly 100100 points: the blue dashed-line is when we combine truncation and forced selection; the red solid line is when we adapt α\alpha according to αk=(nnk)/(Nk)\alpha_{k}=(n-n_{k})/(N-k); see Remark 3.1-(iv) — the final values, for k=Nk=N, are indicated by a triangle and a star, respectively; we only show the evolution of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) for kk between 10 000 and 100 000 since the curves are confounded for smaller kk (they are based on the same 𝒳N{\mathscr{X}}_{N}). In the first case, the late forced selection of unimportant XiX_{i} yields a significant decrease of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}), whereas adaptation of α\alpha anticipates the need of being less selective to reach the target number nn of selected points.

Refer to caption
Figure 5: Evolution of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) obtained with Algorithm 1 as a function of kk (log scale): d=3d=3, N=100 000N=100\,000, exactly n=100n=100 points are collected using truncation/forced selection (blue dashed line and \triangledown) or adaptation of α\alpha (red solid line and \bigstar); see Remark 3.1-(iv).

Figure 2 has illustrated the convergence of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) to Φα\Phi_{\alpha}^{*} for a fixed α\alpha as kk\rightarrow\infty, but in fact what really matters is that nkn_{k} tends to infinity: indeed, Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) does not converge to Φα\Phi_{\alpha}^{*} if we fix nk=nn_{k}=n and let kk tend to infinity, so that α=n/k\alpha=n/k tends to zero (see also Section 5.1). This is illustrated on the left panel of Figure 6, where d=25d=25 and, from left to right, α\alpha equals 0.5 (magenta dotted line), 0.1, 0.05 and 0.01 (red solid line). Since the optimal value Φα\Phi_{\alpha}^{*} depends on α\alpha, here we present the evolution with kk of the D-efficiency [det(𝐌nk)/det(𝐌α)]1/p=exp[(Φ(𝐌nk)Φα)/d][\det(\mathbf{M}_{n_{k}})/\det(\mathbf{M}_{\alpha}^{*})]^{1/p}=\exp[(\Phi(\mathbf{M}_{n_{k}})-\Phi_{\alpha}^{*})/d]. The right panel is for fixed α=0.1\alpha=0.1 and varying dd, with, from left to right, d=5d=5 (red solid line), 10, 20, 30 and 50 (cyan solid line). As one may expect, performance (slightly) deteriorates as dd increases due to the increasing variability of Z𝐌(X)Z_{\mathbf{M}}(X), with 𝗏𝖺𝗋[Z𝐌(X)]=𝗏𝖺𝗋[X𝐌X]=2trace(𝐌2)\mathsf{var}[Z_{\mathbf{M}}(X)]=\mathsf{var}[X^{\top}\mathbf{M}X]=2\,\operatorname{trace}(\mathbf{M}^{2}).

Refer to caption
Refer to caption
Figure 6: Evolution of D-efficiency of 𝐌nk\mathbf{M}_{n_{k}} obtained with Algorithm 1 as a function of kk (log scale); the horizontal line indicates the optimal value 1. Left: d=25d=25 and α=0.5\alpha=0.5 (magenta dotted line), 0.1 (black), 0.05 (blue) and 0.01 (red solid line). Right: α=0.1\alpha=0.1 and d=5d=5 (red solid line), 10 (blue), 20 (black), 30 (magenta) and 50 (cyan solid line).

4.3 Example 3: processing a non i.i.d. sequence

When the design points XiX_{i} are not from an i.i.d. sequence, Algorithm 1 cannot be used directly and some preprocessing is required. When storage of the whole sequence 𝒳N{\mathscr{X}}_{N} is possible, a random permutation can be applied to 𝒳N{\mathscr{X}}_{N} before using Algorithm 1. When NN is too large for that, for instance in the context of data streaming, and the sequence possesses a structured time-dependence, one may try to identify the dependence model through time series analysis and use forecasting to decide which design points should be selected. The data thinning mechanism is then totally dependent on the model of the sequence, and the investigation of the techniques to be used is beyond the scope of this paper. Examples of the application of a simple scrambling method to the sequence prior to selection by Algorithm 1 are presented below. The method corresponds to Algorithm 2 below; its output sequence X~k\widetilde{X}_{k} is used as input for Algorithm 1. We do not study the properties of the method in conjunction with the convergence properties of Algorithm 1, which would require further developments.

  • Algorithm 2: random scrambling in a buffer.

  • 1)

    Initialization: choose the buffer size BB, set k=1k=1 and 𝒳(1)={X1,,XB}{\mathscr{X}}^{(1)}=\{X_{1},\ldots,X_{B}\}.

  • 2)

    Draw X~k\widetilde{X}_{k} by uniform sampling within 𝒳(k){\mathscr{X}}^{(k)}.

  • 3)

    Set 𝒳(k+1)=𝒳(k){X~k}{XB+k}{\mathscr{X}}^{(k+1)}={\mathscr{X}}^{(k)}\setminus\{\widetilde{X}_{k}\}\cup\{X_{B+k}\}, kk+1k\leftarrow k+1, return to Step 2.

Direct calculation shows that the probability that X~k\widetilde{X}_{k} equals XiX_{i} is

𝖯𝗋𝗈𝖻{X~k=Xi}={1B(11B)k1 for 1iB1B(11B)k1+Bi for B+1iB+k10 for B+k1<i\displaystyle\mathsf{Prob}\{\widetilde{X}_{k}=X_{i}\}=\left\{\begin{array}[]{ll}\frac{1}{B}\,\left(1-\frac{1}{B}\right)^{k-1}&\mbox{ for }1\leq i\leq B\\ \frac{1}{B}\,\left(1-\frac{1}{B}\right)^{k-1+B-i}&\mbox{ for }B+1\leq i\leq B+k-1\\ 0&\mbox{ for }B+k-1<i\end{array}\right.

showing the limits of randomization via Algorithm 2 (in particular, the first points of the sequence XiX_{i} will tend to appear first among the X~k\widetilde{X}_{k}). However, the method will give satisfactory results if the size BB of the buffer is large enough, as its performance improves as BB increases.

As an illustration, we consider the same quadratic regression model as in Example 1, with Φ(𝐌)=logdet(𝐌)\Phi(\mathbf{M})=\log\det(\mathbf{M}), in the extreme case where Xi=x(ti)X_{i}=x(t_{i}) with x(t)x(t) a simple function of tt.

First, we consider the extremely unfavorable situation where x(t)=tx(t)=t. When tt is uniformly distributed on 𝒯=[0,T]{\mathscr{T}}=[0,T], the optimal design ξα\xi_{\alpha}^{*} selects all points associated with tt in [0,ta][tb,Ttb][Tta,T][0,t_{a}]\cup[t_{b},T-t_{b}]\cup[T-t_{a},T], for some ta<tbt_{a}<t_{b} in 𝒯{\mathscr{T}} satisfying 2ta+T2tb=αT2\,t_{a}+T-2\,t_{b}=\alpha\,T. For α=1/10\alpha=1/10, we get ta/T0.03227t_{a}/T\simeq 0.03227 and tb/T0.48227t_{b}/T\simeq 0.48227. The horizontal black line in Figure 7 indicates the optimal value Φα\Phi_{\alpha}^{*} when T=1T=1. The blue dotted line shows Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) when Algorithm 1 is directly applied to the points Xi=i/NX_{i}=i/N, i=1,,N=100 000i=1,\ldots,N=100\,000. The red line is when randomization via Algorithm 2, with buffer size B=αN=10 000B=\alpha N=10\,000, is applied first; the dotted curve in magenta is for B=3αNB=3\,\alpha N. The positive effects of randomization through Algorithm 2 and the influence of the buffer size are visible on the figure. Here, the monotonicity of the XiX_{i}, which inhibits early exploration of their range of variation, prevents convergence to the optimum.

Refer to caption
Figure 7: Evolution of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) obtained with Algorithm 1 as a function of kk (log scale), the horizontal line indicates the optimal value Φα\Phi_{\alpha}^{*}; N=100 000N=100\,000, α=1/10\alpha=1/10; direct processing of the points Xi=i/NX_{i}=i/N (dotted blue line), preprocessing with Algorithm 2 with B=αNB=\alpha N (red solid line) and B=3αNB=3\,\alpha N (magenta dotted line).

We now consider the more favorable case where Xi=sin(2πνi/N)X_{i}=\sin(2\pi\nu i/N), i=1,,N=100 000i=1,\ldots,N=100\,000, with ν=5\nu=5. The left panel of Figure 8 shows the same information as Figure 7, when Algorithm 1 is applied directly to the XiX_{i} (blue dotted line) and after preprocessing with Algorithm 2 with B=αNB=\alpha N (red line) and B=αN/10B=\alpha N/10 (magenta dotted line). The early exploration of the range of variability of the XiX_{i}, possible here thanks to the periodicity of x(t)x(t), makes the randomization through Algorithm 2 efficient enough to allow Algorithm 1 to behave correctly when B=αNB=\alpha N (red line). The situation improves when BB is increased, but naturally deteriorates if BB is too small (magenta dotted line). The right panel shows the points X~k\widetilde{X}_{k} produced by Algorithm 2 (with B=αN=10 000B=\alpha N=10\,000) which are selected by Algorithm 1. The effect of randomization is visible. For k<5 000k<5\,000 all points in the buffer are in the interval [sin(2πν(k+B)/N),1][1,1][\sin(2\pi\nu(k+B)/N),1]\subset[-1,1] and the points selected by Algorithm 1 are near the end points or the center of this moving interval. For larger kk, randomization is strong enough to maintain the presence of suitable candidates in the buffer for selection by Algorithm 1.

Refer to caption
Refer to caption
Figure 8: Left: Evolution of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) obtained with Algorithm 1 as a function of kk (log scale), the horizontal line indicates the optimal value Φα\Phi_{\alpha}^{*}; N=100 000N=100\,000, α=1/10\alpha=1/10; direct processing of the points Xi=sin(2πνi/N)X_{i}=\sin(2\pi\nu i/N) (dotted blue line), preprocessing with Algorithm 2 with B=αNB=\alpha N (red solid line) and B=αN/10B=\alpha N/10 (magenta dotted line). Right: XiX_{i}, i=1,,N=100 000i=1,\ldots,N=100\,000 (black) and points X~k\widetilde{X}_{k} produced by Algorithm 2 with B=αNB=\alpha N which are selected by Algorithm 1 (red).

4.4 Examples with Z𝐌(x)Z_{\mathbf{M}}(x) constant on subsets of positive measure

Here we consider situations where Hμ,M is violated due to the existence of subsets of 𝒳{\mathscr{X}} of positive measure on which Z𝐌(x)Z_{\mathbf{M}}(x) is constant. The model is the same as in Section 4.2, with X=(x1,x2,,xd)X=(x_{1},\ x_{2},\ldots,x_{d})^{\top}, (X)=XX{\mathscr{M}}(X)=XX^{\top} and Φ(𝐌)=logdet(𝐌)\Phi(\mathbf{M})=\log\det(\mathbf{M}).

4.4.1 Example 4: μ\mu has discrete components

This is Example 11 in (Pronzato,, 2006), where d=2d=2, μ=(1/2)μ𝒩+(1/2)μd\mu=(1/2)\,\mu_{\mathscr{N}}+(1/2)\,\mu_{d}, with μ𝒩\mu_{\mathscr{N}} corresponding to the normal distribution 𝒩(0,1){\mathscr{N}}(0,1) and μd\mu_{d} the discrete measure that puts weight 1/41/4 at each one of the points (±1,±1)(\pm 1,\pm 1). Denote by (r){\mathscr{B}}(r) the closed ball centered at the origin 𝟎\boldsymbol{0} with radius rr, by μ[r]\mu[r] the measure equal to μ\mu on its complement (r)¯\overline{{\mathscr{B}}(r)}, and let 𝖾=exp(1)\mathsf{e}=\exp(1). The optimal matrix is 𝐌α=𝐌(ξα)\mathbf{M}_{\alpha}^{*}=\mathbf{M}(\xi_{\alpha}^{*}), with ξα\xi_{\alpha}^{*} the probability measure defined by:

ξα={1αμ[2log(2α)] if 0<α12𝖾,1αμ[2]+1α[α1/(2𝖾)]μd if 12𝖾<α12𝖾+12,1αμ[2log(2α1)] if 12𝖾+12<α<1,\displaystyle\xi_{\alpha}^{*}=\left\{\begin{array}[]{ll}\frac{1}{\alpha}\,\mu[\sqrt{-2\,\log(2\,\alpha)}]&\mbox{ if }0<\alpha\leq\frac{1}{2\mathsf{e}}\,,\\ \frac{1}{\alpha}\,\mu[\sqrt{2}]+\frac{1}{\alpha}\,[\alpha-1/(2\,\mathsf{e})]\mu_{d}&\mbox{ if }\frac{1}{2\mathsf{e}}<\alpha\leq\frac{1}{2\mathsf{e}}+\frac{1}{2}\,,\\ \frac{1}{\alpha}\,\mu[\sqrt{-2\,\log(2\,\alpha-1)}]&\mbox{ if }\frac{1}{2\mathsf{e}}+\frac{1}{2}<\alpha<1\,,\end{array}\right.

with associated Φ\Phi-values

Φ(𝐌α)={2log[1log(2α)] if 0<α12𝖾,2log(1+12𝖾α) if 12𝖾<α12𝖾+12,2log(1(2α1)log(2α1)2α) if 12𝖾+12<α1.\displaystyle\Phi(\mathbf{M}_{\alpha}^{*})=\left\{\begin{array}[]{ll}2\,\log[1-\log(2\,\alpha)]&\mbox{ if }0<\alpha\leq\frac{1}{2\mathsf{e}}\,,\\ 2\,\log\left(1+\frac{1}{2\mathsf{e}\,\alpha}\right)&\mbox{ if }\frac{1}{2\mathsf{e}}<\alpha\leq\frac{1}{2\mathsf{e}}+\frac{1}{2}\,,\\ 2\,\log\left(1-\frac{(2\,\alpha-1)\,\log(2\,\alpha-1)}{2\,\alpha}\right)&\mbox{ if }\frac{1}{2\mathsf{e}}+\frac{1}{2}<\alpha\leq 1\,.\end{array}\right.

Figure 9 shows the evolution of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) as a function of kk for α=0.5\alpha=0.5 (left) and α=0.02\alpha=0.02 (right). Note that α<1/(2𝖾)\alpha<1/(2\mathsf{e}) in the second case, but 1/(2𝖾)<α1/(2𝖾)+1/21/(2\mathsf{e})<\alpha\leq 1/(2\mathsf{e})+1/2 in the first one, so that ξα\xi_{\alpha}^{*} is neither zero nor μ/α\mu/\alpha on the four points (±1,±1)(\pm 1,\pm 1). Figure 9 shows that Algorithm 1 nevertheless behaves satisfactorily in both cases.

Refer to caption
Refer to caption
Figure 9: Evolution of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) obtained with Algorithm 1 as a function of kk (log scale) when d=2d=2 and μ=(1/2)μ𝒩+(1/2)μd\mu=(1/2)\,\mu_{\mathscr{N}}+(1/2)\,\mu_{d} with α=0.5\alpha=0.5 (left) and α=0.02\alpha=0.02 (right); the horizontal line indicates the optimal value Φα\Phi_{\alpha}^{*}.

4.4.2 Example 5: the distribution of Z𝐌α(X)Z_{\mathbf{M}_{\alpha}^{*}}(X) has discrete components

Let U[𝒮d(𝟎,r)]U[{\mathscr{S}}_{d}(\boldsymbol{0},r)] denote the uniform probability measure on the dd-dimensional sphere 𝒮d(𝟎,r){\mathscr{S}}_{d}(\boldsymbol{0},r) with center 𝟎\boldsymbol{0} and radius rr. The probability measure of the XiX_{i} is μ=(1/3)i=13U[𝒮d(𝟎,ri)]\mu=(1/3)\sum_{i=1}^{3}U[{\mathscr{S}}_{d}(\boldsymbol{0},r_{i})], the mixture of distributions on three nested spheres with radii r1>r2>r3>0r_{1}>r_{2}>r_{3}>0. The optimal bounded measure is

ξα={U[𝒮d(𝟎,r1)]if 0<α13,13αU[𝒮d(𝟎,r1)]+α1/3αU[𝒮d(𝟎,r2)]if 13<α23,13α{U[𝒮d(𝟎,r1)]+U[𝒮d(𝟎,r2)]}+α2/3αU[𝒮d(𝟎,r3)]if 23<α<1,\displaystyle\xi_{\alpha}^{*}=\left\{\begin{array}[]{ll}U[{\mathscr{S}}_{d}(\boldsymbol{0},r_{1})]&\mbox{if }0<\alpha\leq\frac{1}{3}\,,\\ \frac{1}{3\alpha}\,U[{\mathscr{S}}_{d}(\boldsymbol{0},r_{1})]+\frac{\alpha-1/3}{\alpha}U[{\mathscr{S}}_{d}(\boldsymbol{0},r_{2})]&\mbox{if }\frac{1}{3}<\alpha\leq\frac{2}{3}\,,\\ \frac{1}{3\alpha}\,\left\{U[{\mathscr{S}}_{d}(\boldsymbol{0},r_{1})]+U[{\mathscr{S}}_{d}(\boldsymbol{0},r_{2})]\right\}+\frac{\alpha-2/3}{\alpha}U[{\mathscr{S}}_{d}(\boldsymbol{0},r_{3})]&\mbox{if }\frac{2}{3}<\alpha<1\,,\end{array}\right.

with associated Φ\Phi-values

Φ(𝐌α)={dlog(r12/d)if 0<α13,dlog(r12/3+(α1/3)r22αd)if 13<α23,dlog((r12+r22)/3+(α2/3)r32αd)if 23<α<1.\displaystyle\Phi(\mathbf{M}_{\alpha}^{*})=\left\{\begin{array}[]{ll}d\,\log(r_{1}^{2}/d)&\mbox{if }0<\alpha\leq\frac{1}{3}\,,\\ d\,\log\left(\frac{r_{1}^{2}/3+(\alpha-1/3)r_{2}^{2}}{\alpha d}\right)&\mbox{if }\frac{1}{3}<\alpha\leq\frac{2}{3}\,,\\ d\,\log\left(\frac{(r_{1}^{2}+r_{2}^{2})/3+(\alpha-2/3)r_{3}^{2}}{\alpha d}\right)&\mbox{if }\frac{2}{3}<\alpha<1\,.\end{array}\right.

Notice that for α(0,1/3)\alpha\in(0,1/3) (respectively, α(1/3,2/3)\alpha\in(1/3,2/3)) ξα0\xi_{\alpha}^{*}\neq 0 and ξαμ/α\xi_{\alpha}^{*}\neq\mu/\alpha on 𝒮d(𝟎,r1){\mathscr{S}}_{d}(\boldsymbol{0},r_{1}) (respectively, on 𝒮d(𝟎,r2){\mathscr{S}}_{d}(\boldsymbol{0},r_{2})) although μ\mu is atomless.

The left panel of Figure 10 gives the evolution with kk of the D-efficiency [det(𝐌nk)/det(𝐌α)]1/p=exp[(Φ(𝐌nk)Φα)/d][\det(\mathbf{M}_{n_{k}})/\det(\mathbf{M}_{\alpha}^{*})]^{1/p}=\exp[(\Phi(\mathbf{M}_{n_{k}})-\Phi_{\alpha}^{*})/d], for α=0.5\alpha=0.5 (red solid line) and α=0.2\alpha=0.2 (blue dashed line) when d=5d=5. The right panel shows the evolution of the ratio nk/kn_{k}/k for those two situations, with the limiting value α\alpha indicated by a horizontal line. Although assumption Hμ,M is violated, Algorithm 1 continues to perform satisfactorily.

Refer to caption
Refer to caption
Figure 10: Left: Evolution of D-efficiency of 𝐌nk\mathbf{M}_{n_{k}} obtained with Algorithm 1 as a function of kk (log scale) for α=0.5\alpha=0.5 (red solid line) and α=0.2\alpha=0.2 (blue dashed line); the horizontal line indicates the optimal value 1; d=5d=5. Right: evolution of the ratio nk/kn_{k}/k in the same simulations.

5 Comparison with other methods

5.1 Case nn fixed with large NN: comparison with an exchange method

The convergence of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) to Φα\Phi_{\alpha}^{*} in Algorithm 1 relies on the fact that nkn_{k} grows like 𝒪(αk){\mathcal{O}}(\alpha\,k) for some α>0\alpha>0; see Theorem 3.1. If the number nn of points to be selected is fixed, Algorithm 1 does not provide any performance guarantee when applied to a sequence of length NN\rightarrow\infty (the situation is different when p=1p=1 where an asymptotically optimal construction is available; see Pronzato, (2001)). In that case, a method of the exchange type may look more promising, although large values of NN entail serious difficulties. Typically, the algorithm is initialized by a nn point design chosen within 𝒳N{\mathscr{X}}_{N}, and at each iteration a temporarily selected XiX_{i} is replaced by a better point in 𝒳N{\mathscr{X}}_{N}. Fedorov’s (1972) algorithm considers all n×(Nn)n\times(N-n) possible replacements at each iteration ((Nn)(N-n) instead of NN since we do not allow repetitions in the present context); its computational cost is prohibitive for large NN. The variants suggested by Cook and Nachtsheim, (1980), or the DETMAX algorithm of Mitchell, (1974), still require the maximization of a function g(Xj)g(X_{j}) with respect to Xj𝒳NX_{j}\in{\mathscr{X}}_{N} at each iteration, which remains unfeasible for very large NN. Below, we consider a simplified version where all NN points are examined successively, and replacement is accepted when it improves the current criterion value.

  • Algorithm 3: sequential exchange (nn fixed).

  • 1)

    Initialization: select X1,,XnX_{1},\ldots,X_{n}, set k=nk=n and 𝒳k={X1,,Xk}{\mathscr{X}}_{k}^{*}=\{X_{1},\ldots,X_{k}\}, compute 𝐌n,k=(1/k)i=1k(Xi)\mathbf{M}_{n,k}=(1/k)\,\sum_{i=1}^{k}{\mathscr{M}}(X_{i}) and Φ(𝐌n,k)\Phi(\mathbf{M}_{n,k}).

  • 2)

    Iteration k+1k+1: collect Xk+1X_{k+1}. If Xk+1𝒳kX_{k+1}\in{\mathscr{X}}_{k}^{*}, set Δ(k)(Xi,Xk+1)=0\Delta^{(k)}(X_{i^{*}},X_{k+1})=0; otherwise compute

    Δ(k)(Xi,Xk+1)=maxXi𝒳k[Φ{𝐌n,k+(1/n)[(Xk+1)(Xi)]}Φ(𝐌n,k)].\displaystyle\Delta^{(k)}(X_{i^{*}},X_{k+1})=\max_{X_{i}\in{\mathscr{X}}_{k}^{*}}\left[\Phi\{\mathbf{M}_{n,k}+(1/n)[{\mathscr{M}}(X_{k+1})-{\mathscr{M}}(X_{i})]\}-\Phi(\mathbf{M}_{n,k})\right]\,.

    If Δ(k)(Xi,Xk+1)>0\Delta^{(k)}(X_{i^{*}},X_{k+1})>0, set 𝒳k+1=𝒳k{Xi}Xk+1{\mathscr{X}}_{k+1}^{*}={\mathscr{X}}_{k}^{*}\setminus\{X_{i^{*}}\}\cup X_{k+1}, update 𝐌n,k\mathbf{M}_{n,k} into 𝐌n,k+1=𝐌n,k+(1/n)[(Xk+1)(Xi)]\mathbf{M}_{n,k+1}=\mathbf{M}_{n,k}+(1/n)[{\mathscr{M}}(X_{k+1})-{\mathscr{M}}(X_{i^{*}})], compute Φ(𝐌n,k+1)\Phi(\mathbf{M}_{n,k+1});

    otherwise, set 𝒳k+1=𝒳k{\mathscr{X}}_{k+1}^{*}={\mathscr{X}}_{k}^{*}, 𝐌n,k+1=𝐌n,k\mathbf{M}_{n,k+1}=\mathbf{M}_{n,k}.

  • 3)

    If k+1=Nk+1=N stop; otherwise, kk+1k\leftarrow k+1, return to Step 2.

Remark 5.1.

When (x){\mathscr{M}}(x) has rank one, with (x)=𝐟(x)𝐟(x){\mathscr{M}}(x)=\mathbf{f}(x)\mathbf{f}^{\top}(x) and Φ(𝐌)=logdet(𝐌)\Phi(\mathbf{M})=\log\det(\mathbf{M}) or Φ(𝐌)=det1/p(𝐌)\Phi(\mathbf{M})=\det^{1/p}(\mathbf{M}) (DD-optimal design), Δ(k)(Xi,Xk+1)>0\Delta^{(k)}(X_{i^{*}},X_{k+1})>0 is equivalent to

𝐟(Xk+1)𝐌n,k1𝐟(Xk+1)𝐟(Xi)𝐌n,k1𝐟(Xi)+δ(k)(Xi,Xk+1)>0,\displaystyle\mathbf{f}^{\top}(X_{k+1})\mathbf{M}_{n,k}^{-1}\mathbf{f}^{\top}(X_{k+1})-\mathbf{f}^{\top}(X_{i^{*}})\mathbf{M}_{n,k}^{-1}\mathbf{f}^{\top}(X_{i^{*}})+\delta^{(k)}(X_{i^{*}},X_{k+1})>0\,, (5.1)

where

δ(k)(X,Xk+1)=[𝐟(Xk+1)𝐌n,k1𝐟(X)]2[𝐟(Xk+1)𝐌n,k1𝐟(Xk+1)][𝐟(X)𝐌n,k1𝐟(X)]n,\displaystyle\delta^{(k)}(X,X_{k+1})=\frac{[\mathbf{f}^{\top}(X_{k+1})\mathbf{M}_{n,k}^{-1}\mathbf{f}^{\top}(X)]^{2}-[\mathbf{f}^{\top}(X_{k+1})\mathbf{M}_{n,k}^{-1}\mathbf{f}^{\top}(X_{k+1})][\mathbf{f}^{\top}(X)\mathbf{M}_{n,k}^{-1}\mathbf{f}^{\top}(X)]}{n}\,, (5.2)

see Fedorov, (1972, p. 164). As for Algorithm 1 (see Remark 3.1-(iii)), we may update 𝐌n,k1\mathbf{M}_{n,k}^{-1} instead of 𝐌n,k\mathbf{M}_{n,k} to avoid matrix inversions. For large enough nn, the term (5.2) is negligible and the condition is almost 𝐟(Xk+1)𝐌n,k1𝐟(Xk+1)>𝐟(Xi)𝐌n,k1𝐟(Xi)\mathbf{f}^{\top}(X_{k+1})\mathbf{M}_{n,k}^{-1}\mathbf{f}^{\top}(X_{k+1})>\mathbf{f}^{\top}(X_{i^{*}})\mathbf{M}_{n,k}^{-1}\mathbf{f}^{\top}(X_{i^{*}}); that is, FΦ[𝐌n,k,(Xk+1)]>FΦ[𝐌n,k,(Xi)]F_{\Phi}[\mathbf{M}_{n,k},{\mathscr{M}}(X_{k+1})]>F_{\Phi}[\mathbf{M}_{n,k},{\mathscr{M}}(X_{i^{*}})]. This is the condition we use in the example below. It does not guarantee in general that Φ(𝐌n,k+1)>Φ(𝐌n,k)\Phi(\mathbf{M}_{n,k+1})>\Phi(\mathbf{M}_{n,k}) (since δ(k)(Xi,Xk+1)0\delta^{(k)}(X_{i},X_{k+1})\leq 0 from Cauchy-Schwartz inequality), but no significant difference was observed compared with the use of the exact condition (5.1). Algorithm 3 has complexity 𝒪(nd3N){\mathcal{O}}(nd^{3}\,N) in general (the additional factor nn compared with Algorithm 1 is due to the calculation of the maximum over all XiX_{i} in 𝒳k{\mathscr{X}}_{k}^{*} at Step 2).   \triangleleft

Neither Algorithm 1 with α=n/N\alpha=n/N and 𝐌n,k=𝐌nk\mathbf{M}_{n,k}=\mathbf{M}_{n_{k}} nor Algorithm 3 ensures that Φ(𝐌n,N)\Phi(\mathbf{M}_{n,N}) tends to Φα=Φ(𝐌α)\Phi_{\alpha}^{*}=\Phi(\mathbf{M}_{\alpha}^{*}) as NN\rightarrow\infty. Also, we can expect to have Φ(𝐌n,k)Φn/k\Phi(\mathbf{M}_{n,k})\lesssim\Phi_{n/k}^{*} for all kk with Algorithm 3, since under HΦ the matrix 𝐌n,k\mathbf{M}_{n,k}^{*} corresponding to the optimal selection of nn distinct points among 𝒳k{\mathscr{X}}_{k} satisfies 𝖤{Φ(𝐌n,k)}Φn/k\mathsf{E}\{\Phi(\mathbf{M}_{n,k}^{*})\}\leq\Phi_{n/k}^{*} for all kn>0k\geq n>0; see Pronzato, (2006, Lemma 3).

Example 6: nn fixed and NN large

We consider the same situation as in Example 2 (Section 4.2), with X=(x1,x2,,xd)X=(x_{1},\ x_{2},\ldots,x_{d})^{\top}, (X)=XX{\mathscr{M}}(X)=XX^{\top}, Φ(𝐌)=logdet(𝐌)\Phi(\mathbf{M})=\log\det(\mathbf{M}); the XiX_{i} are i.i.d. 𝒩(𝟎,𝐈d){\mathscr{N}}(\boldsymbol{0},\mathbf{I}_{d}), with p=d=3p=d=3. We still take k0=5pk_{0}=5\,p, q=5/8q=5/8, γ=1/10\gamma=1/10 in Algorithm 1. We have 𝖤{𝐌nk0}=𝐌(μ)\mathsf{E}\{\mathbf{M}_{n_{k_{0}}}\}=\mathbf{M}(\mu) in Algorithm 1, and, when nn is large enough, 𝐌n,n𝐌(μ)\mathbf{M}_{n,n}\simeq\mathbf{M}(\mu) at Step 1 of Algorithm 3, with 𝐌(μ)=𝐈d\mathbf{M}(\mu)=\mathbf{I}_{d} and therefore Φ(𝐌n,n)0\Phi(\mathbf{M}_{n,n})\simeq 0.

We consider two values of nn, n=100n=100 and n=1 000n=1\,000, with α=103\alpha=10^{-3} (that is, with N=100 000N=100\,000 and N=1 000 000N=1\,000\,000, respectively). Figure 11 shows the evolutions of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) (kk0k\geq k_{0}, Algorithm 1, red solid line) and Φ(𝐌n,k)\Phi(\mathbf{M}_{n,k}) (knk\geq n, Algorithm 3, blue dashed line) as functions of kk in those two cases (n=100n=100, left; n=1 000n=1\,000, right). In order to select nn points exactly, adaptation of α\alpha is used in Algorithm 1, see Remark 3.1-(iv). The value of nn is too small for Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) to approach Φα\Phi_{\alpha}^{*} (indicated by the horizontal black line) in the first case, whereas n=1 000n=1\,000 is large enough on the right panel; Algorithm 3 performs similarly in both cases and is superior to Algorithm 1 for n=100n=100; the magenta curve with triangles shows Φn/k\Phi_{n/k}^{*}, knk\geq n, with Φn/kΦ(𝐌n,k)\Phi_{n/k}^{*}\gtrsim\Phi(\mathbf{M}_{n,k}) for all kk, as expected.   \triangleleft

In case it is possible to store the NN points XiX_{i}, we can replay both algorithms on the same data set in order to increase the final value of Φ\Phi for the sample selected. For Algorithm 3, we can simply run the algorithm again on a set 𝒳N(2){\mathscr{X}}_{N}^{(2)} — starting with k=1k=1 at Step 1 since nn points have already been selected — with 𝒳N(2)=𝒳N{\mathscr{X}}_{N}^{(2)}={\mathscr{X}}_{N} or corresponding to a random permutation of it. Series of runs on sets 𝒳N(2),𝒳N(3),{\mathscr{X}}_{N}^{(2)},{\mathscr{X}}_{N}^{(3)},\ldots can be concatenated: the fact that Φ\Phi can only increase implies convergence for an infinite sequence of runs, but generally to a local maximum only; see the discussion in (Cook and Nachtsheim,, 1980, Sect. 2.4). When applied to Example 6, this method was not able to improve the design obtained in the first run of Algorithm 3, with a similar behavior with or without permutations in the construction of the 𝒳N(i){\mathscr{X}}_{N}^{(i)}.

Algorithm 1 requires a more subtle modification since points are selected without replacement. First, we run Algorithm 1 with α\alpha fixed at n/Nn/N on a set 𝒳mN=𝒳N𝒳N(2)𝒳N(m){\mathscr{X}}_{mN}={\mathscr{X}}_{N}\cup{\mathscr{X}}_{N}^{(2)}\cup\cdots\cup{\mathscr{X}}_{N}^{(m)}, where the replications 𝒳N(i){\mathscr{X}}_{N}^{(i)} are all identical to 𝒳N{\mathscr{X}}_{N} or correspond to random permutations of it. The values of 𝐌nmN\mathbf{M}_{n_{mN}} and C^mN\widehat{C}_{mN} are then used in a second stage, where the NN points X1,,XNX_{1},\ldots,X_{N} in 𝒳N{\mathscr{X}}_{N} are inspected sequentially: starting at k=0k=0 and nk=0n_{k}=0, a new point Xk+1X_{k+1} is selected if nk<nn_{k}<n and FΦ[𝐌nmN,(Xk+1)]>C^mNF_{\Phi}[\mathbf{M}_{n_{mN}},{\mathscr{M}}(X_{k+1})]>\widehat{C}_{mN} (or if nnkNk+1n-n_{k}\geq N-k+1, see Remark 3.1-(iv)). The set 𝒳N{\mathscr{X}}_{N} is thus used m+1m+1 times in total. The idea is that for mm large enough, we can expect 𝐌nmN\mathbf{M}_{n_{mN}} to be close to 𝐌α\mathbf{M}_{\alpha}^{*} and C^mN\widehat{C}_{mN} to be close to the true quantile C1α(𝐌α)C_{1-\alpha}(\mathbf{M}_{\alpha}^{*}), whereas the optimal rule for selection is FΦ[𝐌α,(Xk+1)]>C1α(𝐌α)F_{\Phi}[\mathbf{M}_{\alpha}^{*},{\mathscr{M}}(X_{k+1})]>C_{1-\alpha}(\mathbf{M}_{\alpha}^{*}). Note that the quantile of the directional derivative is not estimated in this second phase, and updating of 𝐌nk\mathbf{M}_{n_{k}} is only used to follow the evolution of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) on plots.

Example 6 (continued)

The black-dotted line in Figure 11 shows the evolution of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) as a function of kk in the second phase (for kk large enough to have Φ(𝐌nk)>0\Phi(\mathbf{M}_{n_{k}})>0): we have taken m=9m=9 for n=100n=100 (left), so that (m+1)N=1 000 000(m+1)N=1\,000\,000 points are used in total (but 10 times the same), and m=1m=1 for n=1 000n=1\,000 (right), with 2 000 0002\,000\,000 points used (twice the same). Figure 12 shows the evolution of C^k\widehat{C}_{k} for k=1,,mNk=1,\ldots,mN, for n=100n=100, N=100 000N=100\,000, m=9m=9 (left), and n=1 000n=1\,000, N=1 000 000N=1\,000\,000, m=1m=1 (right); the horizontal black line indicates the value of C1α(𝐌α)C_{1-\alpha}(\mathbf{M}_{\alpha}^{*}). The left panel indicates that n=100n=100 is too small to estimate C1α(𝐌α)C_{1-\alpha}(\mathbf{M}_{\alpha}^{*}) correctly with Algorithm 1 (note that m=4m=4 would have been enough), which is consistent with the behavior of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) observed in Figure 11-left (red solid line). The right panel of Figure 12 shows that C^k\widehat{C}_{k} has converged before inspection of the 1 000 0001\,000\,000 points in 𝒳N{\mathscr{X}}_{N}, which explains the satisfactory behavior of Algorithm 1 in Figure 11-right. Notice the similarity between the left and right panels of Figure 12 due to the fact that the same value α=103\alpha=10^{-3} is used in both. Here the 𝒳N(i){\mathscr{X}}_{N}^{(i)} are constructed by random permutations of the points in 𝒳N{\mathscr{X}}_{N}, but the behavior is similar without.

Refer to caption
Refer to caption
Figure 11: Evolutions of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) (kk0k\geq k_{0}, Algorithm 1, red solid line) Φ(𝐌n,k)\Phi(\mathbf{M}_{n,k}) (knk\geq n, Algorithm 3, blue dashed line), and Φn/k\Phi_{n/k}^{*} (magenta curve with triangles) as functions of kk; the horizontal black line corresponds to Φα\Phi_{\alpha}^{*}; the black dotted curve shows the evolution of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}}) as a function of kk when the selection is based on FΦ[𝐌nmN,(Xk+1)]>C^mNF_{\Phi}[\mathbf{M}_{n_{mN}},{\mathscr{M}}(X_{k+1})]>\widehat{C}_{mN}, with 𝐌nmN\mathbf{M}_{n_{mN}} and C^mN\widehat{C}_{mN} obtained with Algorithm 1 applied to 𝒳mN=𝒳N𝒳N(2)𝒳N(m){\mathscr{X}}_{mN}={\mathscr{X}}_{N}\cup{\mathscr{X}}_{N}^{(2)}\cup\cdots\cup{\mathscr{X}}_{N}^{(m)}. Left: n=100n=100, N=100 000N=100\,000, m=9m=9; Right: n=1 000n=1\,000, N=1 000 000N=1\,000\,000, m=1m=1.
Refer to caption
Refer to caption
Figure 12: Evolution of C^k\widehat{C}_{k} in Algorithm 1 when applied to 𝒳mN=𝒳N𝒳N(2)𝒳N(m){\mathscr{X}}_{mN}={\mathscr{X}}_{N}\cup{\mathscr{X}}_{N}^{(2)}\cup\cdots\cup{\mathscr{X}}_{N}^{(m)}, the horizontal black line corresponds to C1α(𝐌α)C_{1-\alpha}(\mathbf{M}_{\alpha}^{*}). Left: n=100n=100, N=100 000N=100\,000, m=9m=9; Right: n=1 000n=1\,000, N=1 000 000N=1\,000\,000, m=1m=1.

5.2 Comparison with IBOSS

IBOSS (Information-Based Optimal Subdata Selection, Wang et al., (2019)) is a selection procedure motivated by D-optimality developed in the context of multilinear regression with intercept, where (X)=𝐟(X)𝐟(X){\mathscr{M}}(X)=\mathbf{f}(X)\mathbf{f}^{\top}(X) with 𝐟(X)=[1,X]\mathbf{f}(X)=[1,\ X^{\top}]^{\top}. All points XiX_{i} in 𝒳N{\mathscr{X}}_{N} are processed simultaneously: the dd coordinates of the XiX_{i} are examined successively; for each k=1,,dk=1,\ldots,d, the rr points with largest kk-th coordinate and the rr points having smallest kk-th coordinate are selected (and removed from 𝒳N{\mathscr{X}}_{N}), where r=n/(2d)r=n/(2d), possibly with suitable rounding, when exactly nn points have to be selected. The design selected is sensitive to the order in which coordinates are inspected. The necessity to find the largest or smallest coordinate values yields a complexity of 𝒪(dN){\mathcal{O}}(d\,N); parallelization with simultaneous sorting of each coordinate is possible. Like for any design selection algorithm, the matrix 𝐌n,N\mathbf{M}_{n,N} obtained with IBOSS satisfies 𝖤{Φ(𝐌n,N)}Φn/N\mathsf{E}\{\Phi(\mathbf{M}_{n,N})\}\leq\Phi_{n/N}^{*} for all Nn>0N\geq n>0 (Pronzato,, 2006, Lemma 3). The asymptotic performance of IBOSS (the behavior of 𝐌n,N\mathbf{M}_{n,N} and 𝐌n,N1\mathbf{M}_{n,N}^{-1}) for nn fixed and NN tending to infinity is investigated in (Wang et al.,, 2019) for XX following a multivariate normal or lognormal distribution. Next property concerns the situation where nn is a fraction of NN, with NN\rightarrow\infty and the components of XX are independent.

Theorem 5.1.

Suppose that the XiX_{i} are i.i.d. with μ\mu satisfying Hμ and, moreover, that their components {Xi}k\{X_{i}\}_{k} are independent, with φk\varphi_{k} the p.d.f. of {X1}k\{X_{1}\}_{k} for k=1,,dk=1,\ldots,d. Suppose, without any loss of generality, that coordinates are inspected in the order 1,,d1,\ldots,d. Then, for any α(0,1]\alpha\in(0,1], the matrix 𝐕n,N=(1/n)j=1nXijXij\mathbf{V}_{n,N}=(1/n)\sum_{j=1}^{n}X_{i_{j}}X_{i_{j}}^{\top} corresponding to the nn points XijX_{i_{j}} selected by IBOSS satisfies 𝐕n,N𝐕α𝖨𝖡𝖮𝖲𝖲\mathbf{V}_{n,N}\rightarrow\mathbf{V}_{\alpha}^{\mathsf{IBOSS}} a.s. when n=αNn=\lfloor\alpha N\rfloor and NN\rightarrow\infty, with

{𝐕α𝖨𝖡𝖮𝖲𝖲}k,k\displaystyle\{\mathbf{V}_{\alpha}^{\mathsf{IBOSS}}\}_{k,k} =\displaystyle= 1α[𝖤[{X}k2]πksk(α)],k=1,d,\displaystyle\frac{1}{\alpha}\,\left[\mathsf{E}[\{X\}_{k}^{2}]-\pi_{k}\,s_{k}(\alpha)\right]\,,\ k=1\ldots,d\,, (5.3)
{𝐕α𝖨𝖡𝖮𝖲𝖲}k,k\displaystyle\{\mathbf{V}_{\alpha}^{\mathsf{IBOSS}}\}_{k,k^{\prime}} =\displaystyle= 1α[𝖤[{X}k]𝖤[{X}k]πkπk1αmk(α)mk(α)],kk,\displaystyle\frac{1}{\alpha}\,\left[\mathsf{E}[\{X\}_{k}]\,\mathsf{E}[\{X\}_{k}^{\prime}]-\frac{\pi_{k}\,\pi_{k^{\prime}}}{1-\alpha}\,m_{k}(\alpha)\,m_{k^{\prime}}(\alpha)\right]\,,\ k\neq k^{\prime}\,, (5.4)

where 𝖤[{X}k]=xφk(x)dx\mathsf{E}[\{X\}_{k}]=\int_{-\infty}^{\infty}x\,\varphi_{k}(x)\,\mbox{\rm d}x, 𝖤[{X}k2]=x2φk(x)dx\mathsf{E}[\{X\}_{k}^{2}]=\int_{-\infty}^{\infty}x^{2}\,\varphi_{k}(x)\,\mbox{\rm d}x, πk=(1α)[d(k1)α]/(dkα)\pi_{k}=(1-\alpha)[d-(k-1)\alpha]/(d-k\alpha),

sk(α)=qk(α2[d(k1)α])qk(1α2[d(k1)α])x2φk(x)dx and mk(α)=qk(α2[d(k1)α])qk(1α2[d(k1)α])xφk(x)dx,\displaystyle s_{k}(\alpha)=\int_{q_{k}\left(\frac{\alpha}{2[d-(k-1)\alpha]}\right)}^{q_{k}\left(1-\frac{\alpha}{2[d-(k-1)\alpha]}\right)}x^{2}\,\varphi_{k}(x)\,\mbox{\rm d}x\ \mbox{ and }\ m_{k}(\alpha)=\int_{q_{k}\left(\frac{\alpha}{2[d-(k-1)\alpha]}\right)}^{q_{k}\left(1-\frac{\alpha}{2[d-(k-1)\alpha]}\right)}x\,\varphi_{k}(x)\,\mbox{\rm d}x\,,

with qk()q_{k}(\cdot) the quantile function for φk\varphi_{k}, satisfying qk(t)φk(u)du=t\int_{-\infty}^{q_{k}(t)}\varphi_{k}(u)\,\mbox{\rm d}u=t for any t(0,1]t\in(0,1].

Proof. By construction, IBOSS asymptotically first selects all points such that {X}1\{X\}_{1} does not belong to 1=(q1[α/(2d)],q1[1α/(2d)]){\mathcal{I}}_{1}=(q_{1}[\alpha/(2d)],q_{1}[1-\alpha/(2d)]), then, among remaining points, all those such that {X}22=(q2[α/(2d(1α/d))],q1[1α/(2d(1α/d))])\{X\}_{2}\not\in{\mathcal{I}}_{2}=(q_{2}[\alpha/(2d(1-\alpha/d))],q_{1}[1-\alpha/(2d(1-\alpha/d))]). By induction, all points such that {X}kk=(qk[α/(2[d(k1)α])],qk[1α/(2[d(k1)α])])\{X\}_{k}\not\in{\mathcal{I}}_{k}=(q_{k}[\alpha/(2[d-(k-1)\alpha])],q_{k}[1-\alpha/(2[d-(k-1)\alpha])]) are selected at stage k{3,,d}k\in\{3,\ldots,d\}. Denote 𝐱=(x1,,xd)\mathbf{x}=(x_{1},\ldots,x_{d})^{\top}. We have

{𝐕α𝖨𝖡𝖮𝖲𝖲}k,k\displaystyle\{\mathbf{V}_{\alpha}^{\mathsf{IBOSS}}\}_{k,k} =\displaystyle= 1α𝒳=1dxk2φ(𝐱)d𝐱=1α[𝒳xk2φ(𝐱)d𝐱=1dxk2φ(𝐱)d𝐱]\displaystyle\frac{1}{\alpha}\int_{{\mathscr{X}}\setminus\prod_{\ell=1}^{d}{\mathcal{I}}_{\ell}}x_{k}^{2}\,\varphi(\mathbf{x})\,\mbox{\rm d}\mathbf{x}=\frac{1}{\alpha}\left[\int_{{\mathscr{X}}}x_{k}^{2}\,\varphi(\mathbf{x})\,\mbox{\rm d}\mathbf{x}-\int_{\prod_{\ell=1}^{d}{\mathcal{I}}_{\ell}}x_{k}^{2}\,\varphi(\mathbf{x})\,\mbox{\rm d}\mathbf{x}\right]
=\displaystyle= 1α[𝖤[{X}k2](kPr{{X}})kx2φk(x)dx].\displaystyle\frac{1}{\alpha}\left[\mathsf{E}[\{X\}_{k}^{2}]-\left(\prod_{\ell\neq k}\Pr\{\{X\}_{\ell}\in{\mathcal{I}}_{\ell}\}\right)\,\int_{{\mathcal{I}}_{k}}x^{2}\,\varphi_{k}(x)\,\mbox{\rm d}x\right]\,.

Direct calculation gives Pr{Xk=1dk}=1α\Pr\{X\in\prod_{k=1}^{d}{\mathcal{I}}_{k}\}=1-\alpha and

kPr{{X}}=1αPr{{X}kk}=1α1αd(k1)α=πk,\displaystyle\prod_{\ell\neq k}\Pr\{\{X\}_{\ell}\in{\mathcal{I}}_{\ell}\}=\frac{1-\alpha}{\Pr\{\{X\}_{k}\in{\mathcal{I}}_{k}\}}=\frac{1-\alpha}{1-\frac{\alpha}{d-(k-1)\alpha}}=\pi_{k}\,,

which proves (5.3). Similarly,

{𝐕α𝖨𝖡𝖮𝖲𝖲}k,k\displaystyle\{\mathbf{V}_{\alpha}^{\mathsf{IBOSS}}\}_{k,k^{\prime}} =\displaystyle= 1α𝒳=1dxkxkφ(𝐱)d𝐱=1α[𝒳xkxkφ(𝐱)d𝐱=1dxkxkφ(𝐱)d𝐱]\displaystyle\frac{1}{\alpha}\int_{{\mathscr{X}}\setminus\prod_{\ell=1}^{d}{\mathcal{I}}_{\ell}}x_{k}\,x_{k^{\prime}}\,\varphi(\mathbf{x})\,\mbox{\rm d}\mathbf{x}=\frac{1}{\alpha}\left[\int_{{\mathscr{X}}}x_{k}\,x_{k^{\prime}}\,\varphi(\mathbf{x})\,\mbox{\rm d}\mathbf{x}-\int_{\prod_{\ell=1}^{d}{\mathcal{I}}_{\ell}}x_{k}\,x_{k^{\prime}}\,\varphi(\mathbf{x})\,\mbox{\rm d}\mathbf{x}\right]
=\displaystyle= 1α[𝖤[{X}k]𝖤[{X}k](k,kPr{{X}})mk(α)mk(α)],\displaystyle\frac{1}{\alpha}\left[\mathsf{E}[\{X\}_{k}]\,\mathsf{E}[\{X\}_{k}^{\prime}]-\left(\prod_{\ell\neq k,\,\ell\neq k^{\prime}}\Pr\{\{X\}_{\ell}\in{\mathcal{I}}_{\ell}\}\right)\,m_{k}(\alpha)\,m_{k^{\prime}}(\alpha)\right]\,,

with

k,kPr{{X}}=1αPr{{X}kk}Pr{{X}kk}=πkπk1α,\displaystyle\prod_{\ell\neq k,\,\ell\neq k^{\prime}}\Pr\{\{X\}_{\ell}\in{\mathcal{I}}_{\ell}\}=\frac{1-\alpha}{\Pr\{\{X\}_{k}\in{\mathcal{I}}_{k}\}\,\Pr\{\{X\}_{k^{\prime}}\in{\mathcal{I}}_{k^{\prime}}\}}=\frac{\pi_{k}\,\pi_{k^{\prime}}}{1-\alpha}\,,

which proves (5.4) and concludes the proof.    

A key difference between IBOSS and Algorithm 1 is that IBOSS is nonsequential and therefore cannot be used in the streaming setting. Also, IBOSS is motivated by D-optimal design and may not perform well for other criteria, whereas Algorithm 1 converges to the optimal solution when n=αNn=\lfloor\alpha N\rfloor and NN\rightarrow\infty for any criterion satisfying HΦ. Moreover, IBOSS strongly relies on the assumption that (X)=𝐟(X)𝐟(X){\mathscr{M}}(X)=\mathbf{f}(X)\mathbf{f}^{\top}(X) with 𝐟(X)=[1,X]\mathbf{f}(X)=[1,\ X^{\top}]^{\top} and, as the next example illustrates, it can perform poorly in other situations, in particular when the XiX_{i} are functionally dependent.

Example 7: quadratic regression on [0,1][0,1]

Take 𝐟(X)=[X,X2]\mathbf{f}(X)=[X,\ X^{2}]^{\top}, with XX uniformly distributed in [0,1][0,1] and Φ(𝐌)=logdet(𝐌)\Phi(\mathbf{M})=\log\det(\mathbf{M}). For αα0.754160\alpha\leq\alpha_{*}\simeq 0.754160, the optimal measure ξα\xi_{\alpha}^{*} equals μ/α\mu/\alpha on [1/2a,1/2+b][1(αab),1][1/2-a,1/2+b]\cup[1-(\alpha-a-b),1] for some a>ba>b (which are determined by the two equations FΦ[𝐌α,(1/2a)]=FΦ[𝐌α,(1/2+b)]=FΦ[𝐌α,(1(αab))]F_{\Phi}[\mathbf{M}_{\alpha}^{*},{\mathscr{M}}(1/2-a)]=F_{\Phi}[\mathbf{M}_{\alpha}^{*},{\mathscr{M}}(1/2+b)]=F_{\Phi}[\mathbf{M}_{\alpha}^{*},{\mathscr{M}}(1-(\alpha-a-b))]). For αα\alpha\geq\alpha_{*}, ξα=μ/α\xi_{\alpha}^{*}=\mu/\alpha on [1α,1][1-\alpha,1]. When n=αNn=\lfloor\alpha N\rfloor when NN\rightarrow\infty, the matrix 𝐌n,N𝖨𝖡𝖮𝖲𝖲\mathbf{M}_{n,N}^{\mathsf{IBOSS}} obtained with IBOSS applied to the points 𝐟(Xi)\mathbf{f}(X_{i}) converges to 𝐌α𝖨𝖡𝖮𝖲𝖲=𝐌(ξα𝖨𝖡𝖮𝖲𝖲)\mathbf{M}_{\alpha}^{\mathsf{IBOSS}}=\mathbf{M}(\xi_{\alpha}^{\mathsf{IBOSS}}), with ξα𝖨𝖡𝖮𝖲𝖲=μ/α\xi_{\alpha}^{\mathsf{IBOSS}}=\mu/\alpha on [0,α/2][1α/2,1][0,\alpha/2]\cup[1-\alpha/2,1]. The left panel of Figure 13 shows det(𝐌α)\det(\mathbf{M}_{\alpha}^{*}) (red solid line) and det(𝐌α𝖨𝖡𝖮𝖲𝖲)\det(\mathbf{M}_{\alpha}^{\mathsf{IBOSS}}) (blue dotted line) as functions of α[0,1]\alpha\in[0,1]. We have det(𝐌α𝖨𝖡𝖮𝖲𝖲)=(1/960)α2(α4+2540α+26α28α3)\det(\mathbf{M}_{\alpha}^{\mathsf{IBOSS}})=(1/960)\alpha^{2}(\alpha^{4}+25-40\alpha+26\alpha^{2}-8\alpha^{3}), which tends to 0 as α0\alpha\rightarrow 0.   \triangleleft

Next examples show that IBOSS performs more comparably to Algorithm 1 for multilinear regression with intercept, where (X)=𝐟(X)𝐟(X){\mathscr{M}}(X)=\mathbf{f}(X)\mathbf{f}^{\top}(X) with 𝐟(X)=[1,X]\mathbf{f}(X)=[1,\ X^{\top}]^{\top}. Its performance may nevertheless be significantly poorer than that of Algorithm 1.

Example 8: multilinear regression with intercept, Φ(𝐌)=logdet(𝐌)\Phi(\mathbf{M})=\log\det(\mathbf{M})
XX is uniformly distributed in [1,1]2[-1,1]^{2}.

Direct calculation shows that, for any α[0,1]\alpha\in[0,1], the optimal measure ξα\xi_{\alpha}^{*} equals μ/α\mu/\alpha on [1,1]22(𝟎,Rα)[-1,1]^{2}\setminus{\mathscr{B}}_{2}(\boldsymbol{0},R_{\alpha}), with 2(𝟎,r){\mathscr{B}}_{2}(\boldsymbol{0},r) the open ball centered at the origin with radius rr. Here, Rα=2(1α)/πR_{\alpha}=2\sqrt{(1-\alpha)/\pi} when α1π/4\alpha\geq 1-\pi/4, and Rα>1R_{\alpha}>1 is solution of 1+πR2/4R21R2arcsin(1/R)=α1+\pi R^{2}/4-\sqrt{R^{2}-1}-R^{2}\,\arcsin(1/R)=\alpha when α(1π/4,1]\alpha\in(1-\pi/4,1]. The associated optimal matrix is diagonal, 𝐌α=diag{1,ρα,ρα}\mathbf{M}_{\alpha}^{*}=\operatorname{diag}\{1,\rho_{\alpha},\rho_{\alpha}\}, with

ρα={12α[2/32(1α)2/π]if 0α1π/4,12α[2/3+πRα4/8(Rα4/2)arcsin(1/Rα)Rα21(Rα2+2)/6]if 1π/4<α1.\displaystyle\rho_{\alpha}=\left\{\begin{array}[]{ll}\frac{1}{2\alpha}\,[2/3-2(1-\alpha)^{2}/\pi]&\!\mbox{if }0\leq\alpha\leq 1-\pi/4\,,\\ \frac{1}{2\alpha}\,\left[2/3+\pi R_{\alpha}^{4}/8-(R_{\alpha}^{4}/2)\,\arcsin(1/R_{\alpha})-\sqrt{R_{\alpha}^{2}-1}\,(R_{\alpha}^{2}+2)/6\right]&\!\mbox{if }1-\pi/4<\alpha\leq 1\,.\end{array}\right.

Extension to d>2d>2 is possible but involves complicated calculations.

When n=αNn=\lfloor\alpha N\rfloor and NN\rightarrow\infty, the matrix 𝐌n,N𝖨𝖡𝖮𝖲𝖲\mathbf{M}_{n,N}^{\mathsf{IBOSS}} obtained with IBOSS converges to 𝐌α𝖨𝖡𝖮𝖲𝖲=𝐌(ξα𝖨𝖡𝖮𝖲𝖲)\mathbf{M}_{\alpha}^{\mathsf{IBOSS}}=\mathbf{M}(\xi_{\alpha}^{\mathsf{IBOSS}}) when n=αNn=\lfloor\alpha N\rfloor and NN\rightarrow\infty, with ξα𝖨𝖡𝖮𝖲𝖲=μ/α\xi_{\alpha}^{\mathsf{IBOSS}}=\mu/\alpha on [1,1]2([1+a,1a]×[1+b,1b])[-1,1]^{2}\setminus([-1+a,1-a]\times[-1+b,1-b]), with a=α/2a=\alpha/2 and b=α/(1α)ab=\alpha/(1-\alpha)\geq a. The matrix 𝐌α𝖨𝖡𝖮𝖲𝖲\mathbf{M}_{\alpha}^{\mathsf{IBOSS}} is diagonal, 𝐌α𝖨𝖡𝖮𝖲𝖲=diag{1,Dα,1,Dα,2}\mathbf{M}_{\alpha}^{\mathsf{IBOSS}}=\operatorname{diag}\{1,D_{\alpha,1},D_{\alpha,2}\}, where 𝐕α𝖨𝖡𝖮𝖲𝖲=diag{Dα,1,Dα,2}\mathbf{V}_{\alpha}^{\mathsf{IBOSS}}=\operatorname{diag}\{D_{\alpha,1},D_{\alpha,2}\} is the matrix in Theorem 5.1 with Dα,1=(85α+α2)/12D_{\alpha,1}=(8-5\alpha+\alpha^{2})/12 and Dα,2=(811α+4α2)/[3(2α)2]D_{\alpha,2}=(8-11\alpha+4\alpha^{2})/[3(2-\alpha)^{2}]. The right panel of Figure 13 shows det(𝐌α)\det(\mathbf{M}_{\alpha}^{*}) (red solid line) and det(𝐌α𝖨𝖡𝖮𝖲𝖲)\det(\mathbf{M}_{\alpha}^{\mathsf{IBOSS}}) (blue dashed line) as functions of α[0,1]\alpha\in[0,1]. Note that det(𝐌α)1\det(\mathbf{M}_{\alpha}^{*})\rightarrow 1 whereas det(𝐌α𝖨𝖡𝖮𝖲𝖲)4/9\det(\mathbf{M}_{\alpha}^{\mathsf{IBOSS}})\rightarrow 4/9 when α0\alpha\rightarrow 0. The problem is due to selection by IBOSS of points having one coordinate in the central part of the interval.   \triangleleft

Refer to caption
Refer to caption
Figure 13: det(𝐌α)\det(\mathbf{M}_{\alpha}^{*}) (red solid line) and det(𝐌α𝖨𝖡𝖮𝖲𝖲)\det(\mathbf{M}_{\alpha}^{\mathsf{IBOSS}}) (blue dotted line) as functions of α[0,1]\alpha\in[0,1]. Left: quadratic regression on [0,1][0,1]; Right: multilinear regression with intercept on [1,1]2[-1,1]^{2}.
XX is normally distributed 𝒩(𝟎,𝐈d){\mathscr{N}}(\boldsymbol{0},\mathbf{I}_{d}).

The expression of the optimal matrix 𝐌α\mathbf{M}_{\alpha}^{*} has been derived in Section 4.2; the asymptotic value for NN\rightarrow\infty of the matrix 𝐌αN,N\mathbf{M}_{\lfloor\alpha N\rfloor,N} is

𝐌α𝖨𝖡𝖮𝖲𝖲=(1𝟎𝟎𝐕α𝖨𝖡𝖮𝖲𝖲),\displaystyle\mathbf{M}_{\alpha}^{\mathsf{IBOSS}}=\left(\begin{array}[]{cc}1&\boldsymbol{0}^{\top}\\ \boldsymbol{0}&\mathbf{V}_{\alpha}^{\mathsf{IBOSS}}\\ \end{array}\right)\,,

where the expression of 𝐕α𝖨𝖡𝖮𝖲𝖲\mathbf{V}_{\alpha}^{\mathsf{IBOSS}} (here a diagonal matrix) is given in Theorem 5.1. Figure 14 shows the D-efficiency det1/(d+1)(𝐌α𝖨𝖡𝖮𝖲𝖲)/det1/(d+1)(𝐌α)\det^{1/(d+1)}(\mathbf{M}_{\alpha}^{\mathsf{IBOSS}})/\det^{1/(d+1)}(\mathbf{M}_{\alpha}^{*}) as a function of α(0,1]\alpha\in(0,1] for d=3d=3 (left) and d=25d=25 (right), showing that the performance of IBOSS deteriorates as dd increases. We also performed series of simulations for d=25d=25, with 100 independent repetitions of selections of n=αNn=\lfloor\alpha N\rfloor points within 𝒳N{\mathscr{X}}_{N} (N=10 000N=10\,000) based on IBOSS and Algorithm 1. Due to the small value of NN, we apply Algorithm 1 to replications 𝒳mN=𝒳N𝒳N(2)𝒳N(m){\mathscr{X}}_{mN}={\mathscr{X}}_{N}\cup{\mathscr{X}}_{N}^{(2)}\cup\cdots\cup{\mathscr{X}}_{N}^{(m)} of 𝒳N{\mathscr{X}}_{N}, see Section 5.1, with m=99m=99 for α<0.1\alpha<0.1, m=9m=9 for 0.1α<0.50.1\leq\alpha<0.5 and m=4m=4 for α0.5\alpha\geq 0.5. The colored areas on Figure 14 show the variability range for efficiency, corresponding to the empirical mean ±\pm 2 standard deviations obtained for the 100 repetitions, for IBOSS (green, bottom) and Algorithm 1 (magenta, top); note that variability decreases as n=αNn=\lfloor\alpha N\rfloor increases. The approximation of 𝐌n,N\mathbf{M}_{n,N} obtained with IBOSS by the asymptotic matrix 𝐌α𝖨𝖡𝖮𝖲𝖲\mathbf{M}_{\alpha}^{\mathsf{IBOSS}} is quite accurate although NN is rather small; Algorithm 1 (incorporating mm repetitions of 𝒳N{\mathscr{X}}_{N}) performs significantly better than IBOSS although the setting is particularly favorable to IBOSS — it is significantly slower than IBOSS, however, when mm is large.

Refer to caption
Refer to caption
Figure 14: D-efficiency of IBOSS (blue solid line) as a function of α(0,1]\alpha\in(0,1] for d=3d=3 (left) and d=25d=25 (right). The enveloppes on the right panel show the empirical mean efficiency ±\pm 2 standard deviations obtained for 100 independent repetitions with n=αNn=\lfloor\alpha N\rfloor and N=10 000N=10\,000 for IBOSS (green, bottom) and Algorithm 1 (magenta, top).

6 Conclusions and further developments

We have proposed a sequential subsampling method for experimental design (Algorithm 1) that converges to the optimal solution when the length of the sequence tends to infinity and a fixed proportion of design points is selected. Since the method only needs to keep the memory of the current information matrix associated with the design problem (or its inverse), and to update a pair of scalar variables (an estimated quantile, and an estimate of the p.d.f. value at the quantile), it can be applied to sequences of arbitrary length and is suitable for data streaming.

We have not tried to optimize the choice of initialization and tuning parameters in Algorithm 1. Although it does not seem critical (the same tuning has been used in all the examples presented), there is certainly an opportunity to improve, in particular concerning β0\beta_{0} and C^k0\widehat{C}_{k_{0}} (for instance, using the information that C1α0C_{1-\alpha}^{*}\leq 0 whereas C^k0>0\widehat{C}_{k_{0}}>0 for small α\alpha with the initialization we use).

We have only considered the case of linear models, where the information matrix does not depend on unknown parameters (equivalent to local optimum design in case of a nonlinear model), but extension to online parameter estimation in a nonlinear model with (x)=(x,𝜽){\mathscr{M}}(x)={\mathscr{M}}(x,\boldsymbol{\theta}) would not require important modifications. Denote by 𝜽^n\hat{\boldsymbol{\theta}}^{n} the estimated value of the parameters after observation at the nn design points selected, Xi1,,XinX_{i_{1}},\ldots,X_{i_{n}}, say. Then, we can use 𝐌nk0=(1/k0)i=1k0(Xi,𝜽^k0)\mathbf{M}_{n_{k_{0}}}=(1/k_{0})\,\sum_{i=1}^{k_{0}}{\mathscr{M}}(X_{i},\hat{\boldsymbol{\theta}}^{k_{0}}) at Step 1 of Algorithm 1, and 𝐌nk+1\mathbf{M}_{n_{k}+1} given by (3.10) can be replaced by 𝐌nk+1=[1/(nk+1)][j=1nk(Xij,𝜽^nk)+(Xk+1,𝜽^nk)]\mathbf{M}_{n_{k}+1}=[1/(n_{k}+1)]\,[\sum_{j=1}^{n_{k}}{\mathscr{M}}(X_{i_{j}},\hat{\boldsymbol{\theta}}^{n_{k}})+{\mathscr{M}}(X_{k+1},\hat{\boldsymbol{\theta}}^{n_{k}})] at Step 2. Recursive estimation can be used for k>k0k>k_{0} to reduce computational cost. For instance for maximum likelihood estimation, with the notation of Section 1, we can update 𝜽^nk\hat{\boldsymbol{\theta}}^{n_{k}} as

𝜽^nk+1=𝜽^nk+1nk+1𝐌nk+11logφXk+1,θ(Yk+1)θ|θ=𝜽^nk\displaystyle\hat{\boldsymbol{\theta}}^{n_{k}+1}=\hat{\boldsymbol{\theta}}^{n_{k}}+\frac{1}{n_{k}+1}\,\mathbf{M}_{n_{k}+1}^{-1}\frac{\partial\log\varphi_{X_{k+1},\theta}(Y_{k+1})}{\partial\theta}\bigg{|}_{\theta=\hat{\boldsymbol{\theta}}^{n_{k}}}

when Xk+1X_{k+1} is selected; see Ljung and Söderström, (1983); Tsypkin, (1983). A further simplification would be to update 𝐌nk\mathbf{M}_{n_{k}} as 𝐌nk+1=𝐌nk+[1/(nk+1)][(Xk+1,𝜽^nk)𝐌nk]\mathbf{M}_{n_{k+1}}=\mathbf{M}_{n_{k}}+[1/(n_{k}+1)]\,[{\mathscr{M}}(X_{k+1},\hat{\boldsymbol{\theta}}^{n_{k}})-\mathbf{M}_{n_{k}}]. When the XiX_{i} are i.i.d. with μ\mu satisfying Hμ, the strong consistency of 𝜽^nk\hat{\boldsymbol{\theta}}^{n_{k}} holds with such recursive schemes under rather general conditions when all XiX_{i} are selected. Showing that this remains true when only a proportion α\alpha is selected by Algorithm 1 requires technical developments outside the scope of this paper, but we anticipate that 𝐌nk𝐌α,𝜽¯\mathbf{M}_{n_{k}}\rightarrow\mathbf{M}_{\alpha,\overline{\boldsymbol{\theta}}}^{*} a.s., with 𝐌α,𝜽¯\mathbf{M}_{\alpha,\overline{\boldsymbol{\theta}}}^{*} the optimal matrix for the true value 𝜽¯\overline{\boldsymbol{\theta}} of the model parameters.

Algorithm 1 can be viewed as an adaptive version of the treatment allocation method presented in (Metelkina and Pronzato,, 2017): consider the selection or rejection of XiX_{i} as the allocation of individual ii to treatment 1 (selection) or 2 (rejection), with respective contributions 1(Xi)=(Xi){\mathscr{M}}_{1}(X_{i})={\mathscr{M}}(X_{i}) or 2(Xi)=0{\mathscr{M}}_{2}(X_{i})=0 to the collection of information; count a cost of one for allocation to treatment 1 and zero for rejection. Then, the doubly-adaptive sequential allocation (4.6) of Metelkina and Pronzato, (2017) that optimizes a compromise between information and cost exactly coincides with Algorithm 1 where C^k\widehat{C}_{k} is frozen to a fixed CC, i.e., without Step 3. In that sense, the two-time-scale stochastic approximation procedure of Algorithm 1 opens the way to the development of adaptive treatment allocation procedures where the proportion of individuals allocated to the poorest treatment could be adjusted online to a given target.

Finally, the designs obtained with the proposed thinning procedure are model-based: when the model is wrong, ξα\xi_{\alpha}^{*} is no longer optimal for the true model. Model-robustness issues are not considered in the paper and would require specific developments, following for instance the approach in (Wiens,, 2005; Nie et al.,, 2018).

Appendix A Maximum of Φ(𝐌nk)\Phi(\mathbf{M}_{n_{k}})

The property below is stated without proof in (Pronzato,, 2006). We provide here a formal proof based on results on conditional value-at-risk by Rockafellar and Uryasev, (2000) and Pflug, (2000).

Lemma A.1.

Suppose that nk/kαn_{k}/k\rightarrow\alpha as kk\rightarrow\infty. Then, under HΦ and HM, for any choice of nkn_{k} points XiX_{i} among kk points i.i.d. with μ\mu, we have lim supkΦ(𝐌nk,k)Φ(𝐌α)\limsup_{k\rightarrow\infty}\Phi(\mathbf{M}_{n_{k},k})\leq\Phi(\mathbf{M}_{\alpha}^{*}) a.s., where 𝐌α\mathbf{M}_{\alpha}^{*} maximizes Φ(𝐌)\Phi(\mathbf{M}) with respect to 𝐌𝕄(α)\mathbf{M}\in\mathbb{M}(\alpha).

Proof. Denote by 𝐌nk,k\mathbf{M}_{n_{k},k}^{*} the matrix that corresponds to choosing nkn_{k} distinct candidates that maximize Φ(𝐌nk,k)\Phi(\mathbf{M}_{n_{k},k}). The concavity of Φ\Phi implies

Φ(𝐌nk,k)Φ(𝐌α)+trace[Φ(𝐌α)(𝐌nk,k𝐌α)].\displaystyle\Phi(\mathbf{M}_{n_{k},k}^{*})\leq\Phi(\mathbf{M}_{\alpha}^{*})+\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}_{\alpha}^{*})(\mathbf{M}_{n_{k},k}^{*}-\mathbf{M}_{\alpha}^{*})]\,. (A.1)

The rest of the proof consists in deriving an upper bound on the second term on the right-hand side of (A.1).

Denote zi=trace[Φ(𝐌α)(Xi)]z_{i}=\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}_{\alpha}^{*}){\mathscr{M}}(X_{i})] for all i=1,,ki=1,\ldots,k and let the zi:kz_{i:k} denote the version sorted by decreasing values. Since Φ\Phi is increasing for Loewner ordering, Φ(𝐌)Φ(𝐌+𝐳𝐳)\Phi(\mathbf{M})\leq\Phi(\mathbf{M}+\mathbf{z}\mathbf{z}^{\top}) for any 𝐌𝕄\mathbf{M}\in\mathbb{M}^{\geq} and any 𝐳p\mathbf{z}\in\mathds{R}^{p}, and concavity implies Φ(𝐌+𝐳𝐳)Φ(𝐌)+𝐳Φ(𝐌)𝐳\Phi(\mathbf{M}+\mathbf{z}\mathbf{z}^{\top})\leq\Phi(\mathbf{M})+\mathbf{z}^{\top}\nabla_{\Phi}(\mathbf{M})\mathbf{z}, showing that Φ(𝐌)𝕄\nabla_{\Phi}(\mathbf{M})\in\mathbb{M}^{\geq}. Therefore, zi:k0z_{i:k}\geq 0 for all ii.

First, we may notice that trace[Φ(𝐌α)𝐌nk,k](1/nk)i=1nkzi:k\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}_{\alpha}^{*})\mathbf{M}_{n_{k},k}^{*}]\leq(1/n_{k})\,\sum_{i=1}^{n_{k}}z_{i:k} and that

trace[Φ(𝐌α)𝐌α]=1α𝒳𝕀{trace[Φ(𝐌α)(x)]c1α}trace[Φ(𝐌α)(x)]μ(dx)\displaystyle\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}_{\alpha}^{*})\mathbf{M}_{\alpha}^{*}]=\frac{1}{\alpha}\,\int_{\mathscr{X}}\mathbb{I}_{\{\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}_{\alpha}^{*}){\mathscr{M}}(x)]\geq c_{1-\alpha}\}}\,\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}_{\alpha}^{*}){\mathscr{M}}(x)]\,\mu(\mbox{\rm d}x)

with c1α0c_{1-\alpha}\geq 0 and such that 𝒳𝕀{trace[Φ(𝐌α)(x)]c1α}μ(dx)=α\int_{\mathscr{X}}\mathbb{I}_{\{\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}_{\alpha}^{*}){\mathscr{M}}(x)]\geq c_{1-\alpha}\}}\,\mu(\mbox{\rm d}x)=\alpha; see (2.2).

Following Rockafellar and Uryasev, (2000); Pflug, (2000), we then define the functions g(x;β,a)=a+(1/β)[trace[Φ(𝐌β)(x)a]+g(x;\beta,a)=a+(1/\beta)\,[\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}_{\beta}^{*}){\mathscr{M}}(x)-a]^{+}, x𝒳x\in{\mathscr{X}}, β(0,1)\beta\in(0,1), aa\in\mathds{R}. We can then write, for any β(0,1)\beta\in(0,1),

trace[Φ(𝐌β)𝐌β]=𝖤{g(X;β,c1β)}=infa𝖤{g(X;β,a)}c1β,\displaystyle\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}_{\beta}^{*})\mathbf{M}_{\beta}^{*}]=\mathsf{E}\{g(X;\beta,c_{1-\beta})\}=\inf_{a}\mathsf{E}\{g(X;\beta,a)\}\geq c_{1-\beta}\,, (A.2)

and

1nki=1nkzi:k=𝖤μk{g(X;αk,znk:k)}=infa𝖤μk{g(X;αk,a)},\displaystyle\frac{1}{n_{k}}\,\sum_{i=1}^{n_{k}}z_{i:k}=\mathsf{E}_{\mu_{k}}\{g(X;\alpha_{k},z_{n_{k}:k})\}=\inf_{a}\mathsf{E}_{\mu_{k}}\{g(X;\alpha_{k},a)\}\,,

where αk=nk/k(α/2,1]\alpha_{k}=n_{k}/k\in(\alpha/2,1] for all kk larger than some k1k_{1} and where 𝖤μk{}\mathsf{E}_{\mu_{k}}\{\cdot\} denotes expectation for the empirical measure μk=(1/k)i=1kδXi\mu_{k}=(1/k)\,\sum_{i=1}^{k}\delta_{X_{i}}.

Next, we construct an upper bound on znk:kz_{n_{k}:k}. For k>k1k>k_{1}, the matrix 𝐌k=(1/k)i=1k(Xi)\mathbf{M}_{k}=(1/k)\,\sum_{i=1}^{k}{\mathscr{M}}(X_{i}) satisfies

trace[Φ(𝐌α)𝐌k]=(1/k)i=1kzi:k(nk/k)znk:k>(α/2)znk:k.\displaystyle\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}_{\alpha}^{*})\mathbf{M}_{k}]=(1/k)\,\sum_{i=1}^{k}z_{i:k}\geq(n_{k}/k)\,z_{n_{k}:k}>(\alpha/2)\,z_{n_{k}:k}\,. (A.3)

Now, 𝐌α=𝐌(ξα)\mathbf{M}_{\alpha}^{*}=\mathbf{M}(\xi_{\alpha}^{*}) with ξα=μ/α\xi_{\alpha}^{*}=\mu/\alpha on a set 𝒳α𝒳{\mathscr{X}}_{\alpha}^{*}\subset{\mathscr{X}} and ξα=0\xi_{\alpha}^{*}=0 elsewhere, and μ(𝒳α)=αξα(𝒳α)=αξα(𝒳)=α\mu({\mathscr{X}}_{\alpha}^{*})=\alpha\,\xi_{\alpha}^{*}({\mathscr{X}}_{\alpha}^{*})=\alpha\,\xi_{\alpha}^{*}({\mathscr{X}})=\alpha. HM-(ii) then implies that λmin(𝐌α)=(1/α)λmin[𝒳α(x)μ(dx)]>α/α\lambda_{\min}(\mathbf{M}_{\alpha}^{*})=(1/\alpha)\,\lambda_{\min}[\int_{{\mathscr{X}}_{\alpha}^{*}}{\mathscr{M}}(x)\,\mu(\mbox{\rm d}x)]>\ell_{\alpha}/\alpha, and HΦ implies that Φ(𝐌α)<A(α/α)<\|\nabla_{\Phi}(\mathbf{M}_{\alpha}^{*})\|<A(\ell_{\alpha}/\alpha)<\infty. Therefore trace[Φ(𝐌α)𝐌(μ)]<Aα=A(α/α)pB\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}_{\alpha}^{*})\mathbf{M}(\mu)]<A_{\alpha}=A(\ell_{\alpha}/\alpha)\sqrt{pB} from HM-(i). Since trace[Φ(𝐌α)𝐌k]\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}_{\alpha}^{*})\mathbf{M}_{k}] tends to trace[Φ(𝐌α)𝐌(μ)]\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}_{\alpha}^{*})\mathbf{M}(\mu)] a.s. as kk\rightarrow\infty, (A.3) implies that there exists a.s. k2k_{2} such that, for all k>k2k>k_{2}, znk:k<Aα/(4α)z_{n_{k}:k}<A_{\alpha}/(4\alpha).

To summarize, (A.1) implies

Φ(𝐌nk,k)\displaystyle\Phi(\mathbf{M}_{n_{k},k}^{*}) \displaystyle\leq Φ(𝐌α)+𝖤μk{g(X;αk,znk:k)}𝖤{g(X;α,c1α)}\displaystyle\Phi(\mathbf{M}_{\alpha}^{*})+\mathsf{E}_{\mu_{k}}\{g(X;\alpha_{k},z_{n_{k}:k})\}-\mathsf{E}\{g(X;\alpha,c_{1-\alpha})\}
\displaystyle\leq Φ(𝐌α)+|𝖤μk{g(X;αk,znk:k)}𝖤{g(X;αk,c1αk)}|\displaystyle\Phi(\mathbf{M}_{\alpha}^{*})+\left|\mathsf{E}_{\mu_{k}}\{g(X;\alpha_{k},z_{n_{k}:k})\}-\mathsf{E}\{g(X;\alpha_{k},c_{1-\alpha_{k}})\}\right|
+|trace[Φ(𝐌αk)𝐌αk]trace[Φ(𝐌α)𝐌α]|.\displaystyle+\left|\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}_{\alpha_{k}}^{*})\mathbf{M}_{\alpha_{k}}^{*}]-\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}_{\alpha}^{*})\mathbf{M}_{\alpha}^{*}]\right|\,.

The last term tends to zero as kk tends to infinity, due to (A.2) and the continuity of conditional value-at-risk; see (Rockafellar and Uryasev,, 2002, Prop. 13). Since c1αktrace[Φ(𝐌αk)𝐌αk]c_{1-\alpha_{k}}\leq\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}_{\alpha_{k}}^{*})\mathbf{M}_{\alpha_{k}}^{*}], see (A.2), and αkα\alpha_{k}\rightarrow\alpha, for all kk large enough we have c1αk2trace[Φ(𝐌α)𝐌α]c_{1-\alpha_{k}}\leq 2\,\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}_{\alpha}^{*})\mathbf{M}_{\alpha}^{*}]. Denote a¯=max{Aα/(4α),2trace[Φ(𝐌α)𝐌α]}\bar{a}=\max\{A_{\alpha}/(4\alpha),2\,\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}_{\alpha}^{*})\mathbf{M}_{\alpha}^{*}]\}. The second term can then be rewritten as

|𝖤μk{g(X;αk,znk:k)}𝖤{g(X;αk,c1αk)}|\displaystyle\left|\mathsf{E}_{\mu_{k}}\{g(X;\alpha_{k},z_{n_{k}:k})\}-\mathsf{E}\{g(X;\alpha_{k},c_{1-\alpha_{k}})\}\right| =\displaystyle= |infa[0,a¯]𝖤μk{g(X;αk,a)}infa[0,a¯]𝖤{g(X;αk,a)}|\displaystyle\left|\inf_{a\in[0,\bar{a}]}\mathsf{E}_{\mu_{k}}\{g(X;\alpha_{k},a)\}-\inf_{a\in[0,\bar{a}]}\mathsf{E}\{g(X;\alpha_{k},a)\}\right|
maxa[0,a¯]|𝖤μk{g(X;αk,a)}𝖤{g(X;αk,a)}|.\displaystyle\leq\max_{a\in[0,\bar{a}]}\left|\mathsf{E}_{\mu_{k}}\{g(X;\alpha_{k},a)\}-\mathsf{E}\{g(X;\alpha_{k},a)\}\right|\,.

The functions g(;t,a)g(\cdot;t,a) with t(α/2,1]t\in(\alpha/2,1], a[0,a¯]a\in[0,\bar{a}], form a Glivenko-Cantelli class; see (van der Vaart,, 1998, p. 271). It implies that maxa[0,a¯]|𝖤μk{g(X;αk,a)}𝖤{g(X;αk,a)}|0\max_{a\in[0,\bar{a}]}\left|\mathsf{E}_{\mu_{k}}\{g(X;\alpha_{k},a)\}-\mathsf{E}\{g(X;\alpha_{k},a)\}\right|\rightarrow 0 a.s., which concludes the proof.    

The class of functions g(;t,a)g(\cdot;t,a) is in fact Donsker (van der Vaart,, 1998, p. 271). The strict concavity of Φ()\Phi(\cdot) implies that optimal matrices are unique, and in complement of Lemma A.1 we get 𝐌αk,k𝐌α=𝒪p(1/k)\|\mathbf{M}_{\lfloor\alpha k\rfloor,k}^{*}-\mathbf{M}_{\alpha}^{*}\|={\mathcal{O}}_{p}(1/\sqrt{k}). Note that when an optimal bounded design measure ξα\xi_{\alpha}^{*} is known, a selection procedure such that nk/kαn_{k}/k\rightarrow\alpha and Φ(𝐌nk,k)Φ(𝐌α)\Phi(\mathbf{M}_{n_{k},k})\rightarrow\Phi(\mathbf{M}_{\alpha}^{*}) a.s. is straightforwardly available: simply select the points that belong to the set 𝒳α{\mathscr{X}}_{\alpha}^{*} on which ξα=μ/α\xi_{\alpha}^{*}=\mu/\alpha.

Appendix B Non degeneracy of 𝐌nk\mathbf{M}_{n_{k}}

To invoke Hμ,M in order to ensure the existence of a density φ𝐌nk\varphi_{\mathbf{M}_{n_{k}}} having the required properties for all kk (which is essential for the convergence of Algorithm 1, see Theorem 3.1), we need to guarantee that 𝐌nk𝕄,L\mathbf{M}_{n_{k}}\in\mathbb{M}^{\geq}_{\ell,L} for all kk, for some \ell and LL. This is the object of the following lemma.

Lemma B.1.

Under HM, when ϵ1>0\epsilon_{1}>0 in Algorithm 1, nk+1/k>ϵ1n_{k+1}/k>\epsilon_{1} for all kk and there exists a.s. >0\ell>0 and L<L<\infty such that 𝐌nk𝕄,L\mathbf{M}_{n_{k}}\in\mathbb{M}^{\geq}_{\ell,L} for all k>k0k>k_{0}.

Proof. Since the first k0k_{0} points are selected, we have nk/k=1>ϵ1n_{k}/k=1>\epsilon_{1} for kk0k\leq k_{0}. Let kk_{*} be the first kk for which nk/k<ϵ1n_{k}/k<\epsilon_{1}. It implies that nk=nk1>(k1)ϵ1n_{k_{*}}=n_{k_{*}-1}>(k_{*}-1)\,\epsilon_{1}, and (3.9) implies that nk+1=nk+1n_{k_{*}+1}=n_{k_{*}}+1. Therefore, nk+1/k>ϵ1+(1ϵ1)/k>ϵ1n_{k_{*}+1}/k_{*}>\epsilon_{1}+(1-\epsilon_{1})/k_{*}>\epsilon_{1}, and nk/(k1)>ϵ1n_{k}/(k-1)>\epsilon_{1} for all k>1k>1.

If the nkn_{k} points were chosen randomly, nk>(k1)ϵ1n_{k}>(k-1)\,\epsilon_{1} would be enough to obtain that, from HM, λmin(𝐌nk)>ϵ1/2\lambda_{\min}(\mathbf{M}_{n_{k}})>\ell_{\epsilon_{1}}/2 and λmax(𝐌nk)<B/2\lambda_{\max}(\mathbf{M}_{n_{k}})<\sqrt{B}/2 for all kk larger than some k1k_{1}. However, here the situation is more complicated since points are accepted or rejected according to a sequential decision rule, and a more sophisticated argumentation is required. An expedite solution is to consider the worst possible choices of nkn_{k} points, that yield the smallest value of λmin(𝐌nk)\lambda_{\min}(\mathbf{M}_{n_{k}}) and largest value of λmax(𝐌nk)\lambda_{\max}(\mathbf{M}_{n_{k}}). This approach is used in Lemma B.2 presented below, which permits to conclude the proof.    

Lemma B.2.

Under HM, any matrix 𝐌nk\mathbf{M}_{n_{k}} obtained by choosing nkn_{k} points out of kk independently distributed with μ\mu and such that nk/k>ϵ>0n_{k}/k>\epsilon>0 satisfies liminfkλmin(𝐌nk)>\lim\inf_{k\rightarrow\infty}\lambda_{\min}(\mathbf{M}_{n_{k}})>\ell and limsupkλmax(𝐌nk)<L\lim\sup_{k\rightarrow\infty}\lambda_{\max}(\mathbf{M}_{n_{k}})<L a.s. for some >0\ell>0 and L<L<\infty.

Proof. We first construct a lower bound on liminfkλmin(𝐌nk)\lim\inf_{k\rightarrow\infty}\lambda_{\min}(\mathbf{M}_{n_{k}}). Consider the criterion Φ+(𝐌)=λmin(𝐌)\Phi_{\infty}^{+}(\mathbf{M})=\lambda_{\min}(\mathbf{M}), and denote by 𝐌nk,k\mathbf{M}_{n_{k},k}^{*} the nkn_{k}-point design matrix that minimizes Φ+\Phi_{\infty}^{+} over the design space formed by kk points XiX_{i} i.i.d. with μ\mu. We can write 𝐌nk,k=(1/nk)i=1nk(Xki)\mathbf{M}_{n_{k},k}^{*}=(1/n_{k})\,\sum_{i=1}^{n_{k}}{\mathscr{M}}(X_{k_{i}}), where the kik_{i} correspond to the indices of positive uiu_{i} in the minimization of f(𝐮)=Φ+[i=1kui(Xi)]f(\mathbf{u})=\Phi_{\infty}^{+}[\sum_{i=1}^{k}u_{i}\,{\mathscr{M}}(X_{i})] with respect to 𝐮=(u1,,uk)\mathbf{u}=(u_{1},\ldots,u_{k}) under the constraints ui{0,1}u_{i}\in\{0,1\} for all ii and iui=nk\sum_{i}u_{i}=n_{k}. Obviously, any matrix 𝐌nk\mathbf{M}_{n_{k}} obtained by choosing nkn_{k} distinct points XiX_{i} among X1,,XkX_{1},\ldots,X_{k} satisfies λmin(𝐌nk)λmin(𝐌nk,k)\lambda_{\min}(\mathbf{M}_{n_{k}})\geq\lambda_{\min}(\mathbf{M}_{n_{k},k}^{*}).

For any 𝐌𝕄\mathbf{M}\in\mathbb{M}^{\geq}, denote 𝒰(𝐌)={𝐮p:𝐮=1,𝐌𝐮=λmin(𝐌)𝐮}{\mathscr{U}}(\mathbf{M})=\{\mathbf{u}\in\mathds{R}^{p}:\,\|\mathbf{u}\|=1\,,\ \mathbf{M}\mathbf{u}=\lambda_{\min}(\mathbf{M})\mathbf{u}\}. Then, for any 𝐮𝒰(𝐌nk,k)\mathbf{u}\in{\mathscr{U}}(\mathbf{M}_{n_{k},k}^{*}), 𝐮𝐌nk,k𝐮=λmin(𝐌nk,k)=min𝐯p:𝐯=1𝐯𝐌nk,k𝐯=(1/nk)i=1nkzi:k(𝐮)\mathbf{u}^{\top}\mathbf{M}_{n_{k},k}^{*}\mathbf{u}=\lambda_{\min}(\mathbf{M}_{n_{k},k}^{*})=\min_{\mathbf{v}\in\mathds{R}^{p}:\,\|\mathbf{v}\|=1}\mathbf{v}^{\top}\mathbf{M}_{n_{k},k}^{*}\mathbf{v}=(1/n_{k})\,\sum_{i=1}^{n_{k}}z_{i:k}(\mathbf{u}), where the zi:k(𝐮)z_{i:k}(\mathbf{u}) correspond to the values of 𝐮(Xi)𝐮\mathbf{u}^{\top}{\mathscr{M}}(X_{i})\mathbf{u} sorted by increasing order for i=1,,ki=1,\ldots,k. For any m{1,,nk1}m\in\{1,\ldots,n_{k}-1\}, we thus have

λmin(𝐌nk,k)1mi=1mzi:k(𝐮)λmin(𝐌m,k),\displaystyle\lambda_{\min}(\mathbf{M}_{n_{k},k}^{*})\geq\frac{1}{m}\,\sum_{i=1}^{m}z_{i:k}(\mathbf{u})\geq\lambda_{\min}(\mathbf{M}_{m,k}^{*})\,,

showing that the worst situation corresponds to the smallest admissible nkn_{k}; that is, we only have to consider the case when nk/kϵn_{k}/k\rightarrow\epsilon as kk\rightarrow\infty.

Since Φ+\Phi_{\infty}^{+} is concave, for any 𝐌𝕄\mathbf{M}^{\prime}\in\mathbb{M}^{\geq} we have

λmin(𝐌)λmin(𝐌nk,k)+FΦ+(𝐌nk,k,𝐌),\displaystyle\lambda_{\min}(\mathbf{M}^{\prime})\leq\lambda_{\min}(\mathbf{M}_{n_{k},k}^{*})+F_{\Phi_{\infty}^{+}}(\mathbf{M}_{n_{k},k}^{*},\mathbf{M}^{\prime})\,, (B.1)

where FΦ+(𝐌,𝐌)=min𝐮𝒰(𝐌)𝐮(𝐌𝐌)𝐮F_{\Phi_{\infty}^{+}}(\mathbf{M},\mathbf{M}^{\prime})=\min_{\mathbf{u}\in{\mathscr{U}}(\mathbf{M})}\mathbf{u}^{\top}(\mathbf{M}^{\prime}-\mathbf{M})\mathbf{u} is the directional derivative of Φ+\Phi_{\infty}^{+} at 𝐌\mathbf{M} in the direction 𝐌\mathbf{M}^{\prime}.

For any α(0,1)\alpha\in(0,1) and any ξαμ/α\xi_{\alpha}\leq\mu/\alpha, there exists a set 𝒳α𝒳{\mathscr{X}}_{\alpha}\subset{\mathscr{X}} such that ξα(1α)μ\xi_{\alpha}\geq(1-\alpha)\mu on 𝒳α{\mathscr{X}}_{\alpha} and μ(𝒳α)α2\mu({\mathscr{X}}_{\alpha})\geq\alpha^{2}. Indeed, any set 𝒵{\mathscr{Z}} on which ξα<(1α)μ\xi_{\alpha}<(1-\alpha)\mu is such that ξα(𝒵)<(1α)μ(𝒵)(1α)\xi_{\alpha}({\mathscr{Z}})<(1-\alpha)\,\mu({\mathscr{Z}})\leq(1-\alpha); therefore, taking 𝒳α=𝒳𝒵{\mathscr{X}}_{\alpha}={\mathscr{X}}\setminus{\mathscr{Z}}, we get μ(𝒳α)αξα(𝒳α)α2\mu({\mathscr{X}}_{\alpha})\geq\alpha\,\xi_{\alpha}({\mathscr{X}}_{\alpha})\geq\alpha^{2}. Denote αk=nk/k\alpha_{k}=n_{k}/k, with αk>ϵ\alpha_{k}>\epsilon and αkϵ\alpha_{k}\rightarrow\epsilon as kk\rightarrow\infty, and take any 𝐌=𝐌(ξαk)𝕄(αk)\mathbf{M}^{\prime}=\mathbf{M}(\xi_{\alpha_{k}})\in\mathbb{M}(\alpha_{k}). Applying HM-(ii) to the set 𝒳αk{\mathscr{X}}_{\alpha_{k}} defined above, we get

λmin(𝐌)=λmin(𝒳(x)ξαk(dx))\displaystyle\lambda_{\min}(\mathbf{M}^{\prime})=\lambda_{\min}\left(\int_{{\mathscr{X}}}{\mathscr{M}}(x)\,\xi_{\alpha_{k}}(\mbox{\rm d}x)\right) \displaystyle\geq λmin(𝒳αk(x)ξαk(dx))\displaystyle\lambda_{\min}\left(\int_{{\mathscr{X}}_{\alpha_{k}}}{\mathscr{M}}(x)\,\xi_{\alpha_{k}}(\mbox{\rm d}x)\right)
\displaystyle\geq (1αk)λmin(𝒳αk(x)μ(dx))>(1αk)αk2.\displaystyle(1-\alpha_{k})\,\lambda_{\min}\left(\int_{{\mathscr{X}}_{\alpha_{k}}}{\mathscr{M}}(x)\,\mu(\mbox{\rm d}x)\right)>(1-\alpha_{k})\,\ell_{\alpha_{k}^{2}}\,.

For kk larger than some k1k_{1} we have αk(ϵ,2ϵ)\alpha_{k}\in(\epsilon,2\epsilon), and therefore λmin(𝐌)>cϵ=(12ϵ)ϵ2>0\lambda_{\min}(\mathbf{M}^{\prime})>c_{\epsilon}=(1-2\epsilon)\,\ell_{\epsilon^{2}}>0. The inequality (B.1) thus gives, for k>k1k>k_{1},

cϵ<λmin(𝐌nk,k)+min𝐮𝒰(𝐌nk,k)min𝐌𝕄(αk)𝐮(𝐌𝐌nk,k)𝐮.\displaystyle c_{\epsilon}<\lambda_{\min}(\mathbf{M}_{n_{k},k}^{*})+\min_{\mathbf{u}\in{\mathscr{U}}(\mathbf{M}_{n_{k},k}^{*})}\min_{\mathbf{M}^{\prime}\in\mathbb{M}(\alpha_{k})}\mathbf{u}^{\top}(\mathbf{M}^{\prime}-\mathbf{M}_{n_{k},k}^{*})\mathbf{u}\,. (B.2)

The rest of the proof follows from results on conditional value-at-risk by Rockafellar and Uryasev, (2000) and Pflug, (2000). For a fixed 𝐮p\mathbf{u}\in\mathds{R}^{p}, 𝐮𝟎\mathbf{u}\neq\boldsymbol{0}, and α(0,1)\alpha\in(0,1), we have

min𝐌𝕄(α)𝐮𝐌𝐮=1α𝒳𝕀{𝐮(x)𝐮aα(𝐮)}[𝐮(x)𝐮]μ(dx),\displaystyle\min_{\mathbf{M}^{\prime}\in\mathbb{M}(\alpha)}\mathbf{u}^{\top}\mathbf{M}^{\prime}\mathbf{u}=\frac{1}{\alpha}\,\int_{\mathscr{X}}\mathbb{I}_{\{\mathbf{u}^{\top}{\mathscr{M}}(x)\mathbf{u}\leq a_{\alpha}(\mathbf{u})\}}\,[\mathbf{u}^{\top}{\mathscr{M}}(x)\mathbf{u}]\,\mu(\mbox{\rm d}x)\,,

where the α\alpha-quantile aα(𝐮)a_{\alpha}(\mathbf{u}) satisfies 𝒳𝕀{𝐮(x)𝐮aα(𝐮)}μ(dx)=α\int_{\mathscr{X}}\mathbb{I}_{\{\mathbf{u}^{\top}{\mathscr{M}}(x)\mathbf{u}\leq a_{\alpha}(\mathbf{u})\}}\,\mu(\mbox{\rm d}x)=\alpha. For any aa\in\mathds{R} and 𝐮p\mathbf{u}\in\mathds{R}^{p}, denote

h(x;α,a,𝐮)=a1α[a𝐮(x)𝐮]+,x𝒳.\displaystyle h(x;\alpha,a,\mathbf{u})=a-\frac{1}{\alpha}\,[a-\mathbf{u}^{\top}{\mathscr{M}}(x)\mathbf{u}]^{+}\,,\ x\in{\mathscr{X}}\,.

We can write min𝐌𝕄(α)𝐮𝐌𝐮=𝖤{h(X;α,aα(𝐮),𝐮)}=supa𝖤{h(X;α,a,𝐮)}\min_{\mathbf{M}^{\prime}\in\mathbb{M}(\alpha)}\mathbf{u}^{\top}\mathbf{M}^{\prime}\mathbf{u}=\mathsf{E}\{h(X;\alpha,a_{\alpha}(\mathbf{u}),\mathbf{u})\}=\sup_{a\in\mathds{R}}\mathsf{E}\{h(X;\alpha,a,\mathbf{u})\}, where the expectation is with respect to XX distributed with μ\mu (Rockafellar and Uryasev,, 2000). Also, from Pflug, (2000), for any 𝐮𝒰(𝐌nk,k)\mathbf{u}\in{\mathscr{U}}(\mathbf{M}_{n_{k},k}^{*}) we can write 𝐮𝐌nk,k𝐮=𝖤μk{h(X;αk,znk:k(𝐮),𝐮)}=supa𝖤μk{h(X;αk,a,𝐮)}\mathbf{u}^{\top}\mathbf{M}_{n_{k},k}^{*}\mathbf{u}=\mathsf{E}_{\mu_{k}}\{h(X;\alpha_{k},z_{n_{k}:k}(\mathbf{u}),\mathbf{u})\}=\sup_{a\in\mathds{R}}\mathsf{E}_{\mu_{k}}\{h(X;\alpha_{k},a,\mathbf{u})\}, where 𝖤μk{}\mathsf{E}_{\mu_{k}}\{\cdot\} denotes expectation for the empirical measure μk=(1/k)i=1kδXi\mu_{k}=(1/k)\,\sum_{i=1}^{k}\delta_{X_{i}}.

Now, from HM-(i), for any 𝐮p\mathbf{u}\in\mathds{R}^{p} with 𝐮=1\|\mathbf{u}\|=1,

(1α)aα(𝐮)𝒳𝕀{𝐮(x)𝐮>aα(𝐮)}[𝐮(x)𝐮]μ(dx)<B.\displaystyle(1-\alpha)\,a_{\alpha}(\mathbf{u})\leq\int_{\mathscr{X}}\mathbb{I}_{\{\mathbf{u}^{\top}{\mathscr{M}}(x)\mathbf{u}>a_{\alpha}(\mathbf{u})\}}\,[\mathbf{u}^{\top}{\mathscr{M}}(x)\mathbf{u}]\,\mu(\mbox{\rm d}x)<\sqrt{B}\,. (B.3)

We also have (knk)znk:k(𝐮)i=nk+1kzi:k(𝐮)i=1kzi:k(𝐮)=k(𝐮𝐌k𝐮)kλmax(𝐌k)(k-n_{k})\,z_{n_{k}:k}(\mathbf{u})\leq\sum_{i=n_{k}+1}^{k}z_{i:k}(\mathbf{u})\leq\sum_{i=1}^{k}z_{i:k}(\mathbf{u})=k\,(\mathbf{u}^{\top}\mathbf{M}_{k}\mathbf{u})\leq k\,\lambda_{\max}(\mathbf{M}_{k}), with 𝐌k𝐌(μ)\mathbf{M}_{k}\rightarrow\mathbf{M}(\mu) a.s. as kk\rightarrow\infty. Denote zϵ¯=2B/(12ϵ)\overline{z_{\epsilon}}=2\,\sqrt{B}/(1-2\epsilon); since αkϵ\alpha_{k}\rightarrow\epsilon, from HM2-(i) there exists a.s. k2k_{2} such that, for all k>k2k>k_{2}, znk:k(𝐮)<zϵ¯z_{n_{k}:k}(\mathbf{u})<\overline{z_{\epsilon}} and, from (B.3), aαk(𝐮)<zϵ¯a_{\alpha_{k}}(\mathbf{u})<\overline{z_{\epsilon}}.

Therefore, for large enough kk, for any 𝐮𝒰(𝐌nk,k)\mathbf{u}\in{\mathscr{U}}(\mathbf{M}_{n_{k},k}^{*}),

min𝐌𝕄(αk)𝐮(𝐌𝐌nk,k)𝐮\displaystyle\min_{\mathbf{M}^{\prime}\in\mathbb{M}(\alpha_{k})}\mathbf{u}^{\top}(\mathbf{M}^{\prime}-\mathbf{M}_{n_{k},k}^{*})\mathbf{u} =\displaystyle= 𝖤{h(X;αk,aαk(𝐮),𝐮)}𝖤μk{h(X;αk,znk:k(𝐮),𝐮)}\displaystyle\mathsf{E}\{h(X;\alpha_{k},a_{\alpha_{k}}(\mathbf{u}),\mathbf{u})\}-\mathsf{E}_{\mu_{k}}\{h(X;\alpha_{k},z_{n_{k}:k}(\mathbf{u}),\mathbf{u})\}
\displaystyle\leq supa[0,zϵ¯]|𝖤{h(X;αk,a,𝐮)}𝖤μk{h(X;αk,a,𝐮)}|.\displaystyle\sup_{a\in[0,\overline{z_{\epsilon}}]}\left|\mathsf{E}\{h(X;\alpha_{k},a,\mathbf{u})\}-\mathsf{E}_{\mu_{k}}\{h(X;\alpha_{k},a,\mathbf{u})\}\right|\,.

The functions h(;α,a,𝐮)h(\cdot;\alpha,a,\mathbf{u}) with α(ϵ,2ϵ)\alpha\in(\epsilon,2\epsilon), a[0,zϵ¯]a\in[0,\overline{z_{\epsilon}}] and 𝐮p\mathbf{u}\in\mathds{R}^{p}, 𝐮=1\|\mathbf{u}\|=1, form a Glivenko-Cantelli class; see (van der Vaart,, 1998, p. 271). This implies that there exists a.s. k3k_{3} such that

max𝐮p:𝐮=1supa[0,zϵ¯]|𝖤{h(X;αk,a,𝐮)}𝖤μk{h(X;αk,a,𝐮)}|<cϵ/2,k>k3.\displaystyle\max_{\mathbf{u}\in\mathds{R}^{p}:\|\mathbf{u}\|=1}\sup_{a\in[0,\overline{z_{\epsilon}}]}\left|\mathsf{E}\{h(X;\alpha_{k},a,\mathbf{u})\}-\mathsf{E}_{\mu_{k}}\{h(X;\alpha_{k},a,\mathbf{u})\}\right|<c_{\epsilon}/2\,,\quad\forall k>k_{3}\,.

Therefore, from (B.2), λmin(𝐌nk,k)>cϵ/2\lambda_{\min}(\mathbf{M}_{n_{k},k}^{*})>c_{\epsilon}/2 for all k>k3k>k_{3}, which concludes the first part of the proof.

We construct now an upper bound on limsupkλmax(𝐌nk)\lim\sup_{k\rightarrow\infty}\lambda_{\max}(\mathbf{M}_{n_{k}}) following steps similar to the above developments but exploiting now the convexity of the criterion 𝐌1/Φ+(𝐌)=λmax(𝐌)\mathbf{M}\rightarrow 1/\Phi_{-\infty}^{+}(\mathbf{M})=\lambda_{\max}(\mathbf{M}). Its directional derivative is F1/Φ+(𝐌,𝐌)=max𝐮𝒰(𝐌)𝐮(𝐌𝐌)𝐮F_{1/\Phi_{-\infty}^{+}}(\mathbf{M},\mathbf{M}^{\prime})=\max_{\mathbf{u}\in{\mathscr{U}}(\mathbf{M})}\mathbf{u}^{\top}(\mathbf{M}^{\prime}-\mathbf{M})\mathbf{u}, with 𝒰(𝐌)={𝐮p:𝐮=1,𝐌𝐮=λmax(𝐌)𝐮}{\mathscr{U}}(\mathbf{M})=\{\mathbf{u}\in\mathds{R}^{p}:\,\|\mathbf{u}\|=1\,,\ \mathbf{M}\mathbf{u}=\lambda_{\max}(\mathbf{M})\mathbf{u}\}. Denote by 𝐌nk,k\mathbf{M}_{n_{k},k}^{*} the nkn_{k}-point design matrix that maximizes 1/Φ+1/\Phi_{-\infty}^{+} over the design space formed by kk points XiX_{i} i.i.d. with μ\mu. We can write 𝐌nk,k=(1/nk)i=1nk(Xki)\mathbf{M}_{n_{k},k}^{*}=(1/n_{k})\,\sum_{i=1}^{n_{k}}{\mathscr{M}}(X_{k_{i}}), where the kik_{i} correspond to the indices of positive uiu_{i} in the maximization of f(𝐮)=λmax[i=1kui(Xi)]f(\mathbf{u})=\lambda_{\max}[\sum_{i=1}^{k}u_{i}\,{\mathscr{M}}(X_{i})] with respect to 𝐮=(u1,,uk)\mathbf{u}=(u_{1},\ldots,u_{k}) under the constraints ui{0,1}u_{i}\in\{0,1\} for all ii and iui=nk\sum_{i}u_{i}=n_{k}. Any matrix 𝐌nk\mathbf{M}_{n_{k}} obtained by selecting nkn_{k} distinct points XiX_{i} among X1,,XkX_{1},\ldots,X_{k} satisfies λmax(𝐌nk)λmax(𝐌nk,k)\lambda_{\max}(\mathbf{M}_{n_{k}})\leq\lambda_{\max}(\mathbf{M}_{n_{k},k}^{*}).

For any 𝐮𝒰(𝐌nk,k)\mathbf{u}\in{\mathscr{U}}(\mathbf{M}_{n_{k},k}^{*}) we can write 𝐮𝐌nk,k𝐮=λmax(𝐌nk,k)=max𝐯p:𝐯=1𝐯𝐌nk,k𝐯=(1/nk)i=1nkzi:k(𝐮)\mathbf{u}^{\top}\mathbf{M}_{n_{k},k}^{*}\mathbf{u}=\lambda_{\max}(\mathbf{M}_{n_{k},k}^{*})=\max_{\mathbf{v}\in\mathds{R}^{p}:\,\|\mathbf{v}\|=1}\mathbf{v}^{\top}\mathbf{M}_{n_{k},k}^{*}\mathbf{v}=(1/n_{k})\,\sum_{i=1}^{n_{k}}z_{i:k}(\mathbf{u}), where the zi:k(𝐮)z_{i:k}(\mathbf{u}) correspond to the values of 𝐮(Xi)𝐮\mathbf{u}^{\top}{\mathscr{M}}(X_{i})\mathbf{u} sorted by decreasing order for i=1,,ki=1,\ldots,k. For any m{1,,nk1}m\in\{1,\ldots,n_{k}-1\}, we thus have

λmax(𝐌nk,k)1mi=1mzi:k(𝐮)λmax(𝐌m,k),\displaystyle\lambda_{\max}(\mathbf{M}_{n_{k},k}^{*})\leq\frac{1}{m}\,\sum_{i=1}^{m}z_{i:k}(\mathbf{u})\leq\lambda_{\max}(\mathbf{M}_{m,k}^{*})\,,

showing that the worst case corresponds to the smallest admissible nkn_{k}, and we can restrict our attention to the case when αk=nk/kϵ\alpha_{k}=n_{k}/k\rightarrow\epsilon as kk\rightarrow\infty.

The convexity of 1/Φ+1/\Phi_{-\infty}^{+} implies that, for any 𝐌𝕄\mathbf{M}^{\prime}\in\mathbb{M}^{\geq},

λmax(𝐌)λmax(𝐌nk,k)+F1/Φ+(𝐌nk,k,𝐌).\displaystyle\lambda_{\max}(\mathbf{M}^{\prime})\geq\lambda_{\max}(\mathbf{M}_{n_{k},k}^{*})+F_{1/\Phi_{-\infty}^{+}}(\mathbf{M}_{n_{k},k}^{*},\mathbf{M}^{\prime})\,. (B.4)

Take 𝐌𝕄(αk)\mathbf{M}^{\prime}\in\mathbb{M}(\alpha_{k}), corresponding to some ξαk\xi_{\alpha_{k}}. From HM-(i),

λmax(𝐌)=λmax[𝒳(x)ξαk(dx)]1αkλmax[𝐌(μ)]<Bαk.\displaystyle\lambda_{\max}(\mathbf{M}^{\prime})=\lambda_{\max}\left[\int_{{\mathscr{X}}}{\mathscr{M}}(x)\,\xi_{\alpha_{k}}(\mbox{\rm d}x)\right]\leq\frac{1}{\alpha_{k}}\,\lambda_{\max}[\mathbf{M}(\mu)]<\frac{\sqrt{B}}{\alpha_{k}}\,.

Therefore, there exists some k1k_{1} such that, for all k>k1k>k_{1}, λmax(𝐌)<2B/ϵ\lambda_{\max}(\mathbf{M}^{\prime})<2\sqrt{B}/\epsilon, and (B.4) gives

2Bϵλmax(𝐌nk,k)+max𝐮𝒰(𝐌nk,k)max𝐌𝕄(αk)𝐮(𝐌𝐌nk,k)𝐮.\displaystyle\frac{2\,\sqrt{B}}{\epsilon}\geq\lambda_{\max}(\mathbf{M}_{n_{k},k}^{*})+\max_{\mathbf{u}\in{\mathscr{U}}(\mathbf{M}_{n_{k},k}^{*})}\max_{\mathbf{M}^{\prime}\in\mathbb{M}(\alpha_{k})}\mathbf{u}^{\top}(\mathbf{M}^{\prime}-\mathbf{M}_{n_{k},k}^{*})\mathbf{u}\,.

For aa\in\mathds{R}, α(0,1)\alpha\in(0,1) and 𝐮p\mathbf{u}\in\mathds{R}^{p}, denote h(x;α,a,𝐮)=a+(1/α)[𝐮(x)𝐮a]+h(x;\alpha,a,\mathbf{u})=a+(1/\alpha)[\mathbf{u}^{\top}{\mathscr{M}}(x)\mathbf{u}-a]^{+}, x𝒳x\in{\mathscr{X}}. We have λmax(𝐌nk,k)=(1/nk)i=1nkzi:k(𝐮)=𝖤μk{h(X;αk,znk:k(𝐮),𝐮)}=infa𝖤μk{h(X;αk,a,𝐮)}\lambda_{\max}(\mathbf{M}_{n_{k},k}^{*})=(1/n_{k})\sum_{i=1}^{n_{k}}z_{i:k}(\mathbf{u})=\mathsf{E}_{\mu_{k}}\{h(X;\alpha_{k},z_{n_{k}:k}(\mathbf{u}),\mathbf{u})\}=\inf_{a}\mathsf{E}_{\mu_{k}}\{h(X;\alpha_{k},a,\mathbf{u})\}, 𝐮𝒰(𝐌nk,k)\mathbf{u}\in{\mathscr{U}}(\mathbf{M}_{n_{k},k}^{*}), with znk:k(𝐮)z_{n_{k}:k}(\mathbf{u}) satisfying 0nkznk:k(𝐮)i=1nkzi:k(𝐮)<i=1kzi:k(𝐮)=kλmax(𝐌k)0\leq n_{k}\,z_{n_{k}:k}(\mathbf{u})\leq\sum_{i=1}^{n_{k}}z_{i:k}(\mathbf{u})<\sum_{i=1}^{k}z_{i:k}(\mathbf{u})=k\,\lambda_{max}(\mathbf{M}_{k}). Also, for any α(0,1)\alpha\in(0,1) and 𝐮p\mathbf{u}\in\mathds{R}^{p}, 𝐮𝟎\mathbf{u}\neq\boldsymbol{0}, max𝐌𝕄(α)𝐮𝐌𝐮=𝖤{h(X;α,aα(𝐮),𝐮)}=infa𝖤{h(X;α,a,𝐮)}\max_{\mathbf{M}^{\prime}\in\mathbb{M}(\alpha)}\mathbf{u}^{\top}\mathbf{M}^{\prime}\mathbf{u}=\mathsf{E}\{h(X;\alpha,a_{\alpha}(\mathbf{u}),\mathbf{u})\}=\inf_{a}\mathsf{E}\{h(X;\alpha,a,\mathbf{u})\}, where aα(𝐮)a_{\alpha}(\mathbf{u}) satisfies 𝒳𝕀{𝐮(x)𝐮aα(𝐮)}μ(dx)=α\int_{\mathscr{X}}\mathbb{I}_{\{\mathbf{u}^{\top}{\mathscr{M}}(x)\mathbf{u}\geq a_{\alpha}(\mathbf{u})\}}\,\mu(\mbox{\rm d}x)=\alpha, and HM-(i) implies that 0aα(𝐮)(1/α)𝒳𝕀{𝐮(x)𝐮aα(𝐮)}𝐮(x)𝐮μ(dx)<B/α0\leq a_{\alpha}(\mathbf{u})\leq(1/\alpha)\int_{\mathscr{X}}\mathbb{I}_{\{\mathbf{u}^{\top}{\mathscr{M}}(x)\mathbf{u}\geq a_{\alpha}(\mathbf{u})\}}\,\mathbf{u}^{\top}{\mathscr{M}}(x)\mathbf{u}\,\mu(\mbox{\rm d}x)<\sqrt{B}/\alpha. Since αk=nk/kϵ\alpha_{k}=n_{k}/k\rightarrow\epsilon and 𝐌k𝐌(μ)\mathbf{M}_{k}\rightarrow\mathbf{M}(\mu) a.s., there exists a.s. k2k_{2} such that, for all k>k2k>k_{2}, 0aαk(𝐮)<2B/ϵ0\leq a_{\alpha_{k}}(\mathbf{u})<2\sqrt{B}/\epsilon and 0znk:k(𝐮)<2B/ϵ0\leq z_{n_{k}:k}(\mathbf{u})<2\sqrt{B}/\epsilon. This implies that, for 𝐮𝒰(𝐌nk,k)\mathbf{u}\in{\mathscr{U}}(\mathbf{M}_{n_{k},k}^{*}) and k>k2k>k_{2},

max𝐌𝕄(αk)𝐮(𝐌𝐌nk,k)𝐮\displaystyle\max_{\mathbf{M}^{\prime}\in\mathbb{M}(\alpha_{k})}\mathbf{u}^{\top}(\mathbf{M}^{\prime}-\mathbf{M}_{n_{k},k}^{*})\mathbf{u} =\displaystyle= 𝖤{h(X;αk,aαk(𝐮),𝐮)}𝖤μk{h(X;αk,znk:k(𝐮),𝐮)}\displaystyle\mathsf{E}\{h(X;\alpha_{k},a_{\alpha_{k}}(\mathbf{u}),\mathbf{u})\}-\mathsf{E}_{\mu_{k}}\{h(X;\alpha_{k},z_{n_{k}:k}(\mathbf{u}),\mathbf{u})\}
\displaystyle\leq supa[0,2B/ϵ]|𝖤{h(X;αk,a,𝐮)}𝖤μk{h(X;αk,a,𝐮)}|.\displaystyle\sup_{a\in[0,2\sqrt{B}/\epsilon]}\left|\mathsf{E}\{h(X;\alpha_{k},a,\mathbf{u})\}-\mathsf{E}_{\mu_{k}}\{h(X;\alpha_{k},a,\mathbf{u})\}\right|\,.

The rest of the proof is similar to the case above for λmin\lambda_{\min}, using the fact that the functions h(;α,a,𝐮)h(\cdot;\alpha,a,\mathbf{u}) with α(ϵ,2ϵ)\alpha\in(\epsilon,2\epsilon), a[0,2B/ϵ]a\in[0,2\sqrt{B}/\epsilon] and 𝐮p\mathbf{u}\in\mathds{R}^{p}, 𝐮=1\|\mathbf{u}\|=1, form a Glivenko-Cantelli class.    

Appendix C Convergence of C^k\widehat{C}_{k}

We consider the convergence properties of (3.6) when the matrix 𝐌k\mathbf{M}_{k} is fixed, that is,

C^k+1=C^k+βk(k+1)q(𝕀{Zk+1C^k}α),\displaystyle\widehat{C}_{k+1}=\widehat{C}_{k}+\frac{\beta_{k}}{(k+1)^{q}}\,\left(\mathbb{I}_{\{Z_{k+1}\geq\widehat{C}_{k}\}}-\alpha\right)\,, (C.1)

where the ZkZ_{k} have a fixed distribution with uniformly bounded density ff such that f(C1α)>0f(C_{1-\alpha})>0. We follow the arguments of Tierney, (1983). The construction of βk\beta_{k} is like in Algorithm 1, with βk=max{min(1/f^k,β0kγ)}\beta_{k}=\max\{\min(1/\widehat{f}_{k},\beta_{0}\,k^{\gamma})\} and f^k\widehat{f}_{k} following the recursion

f^k+1=f^k+1(k+1)q[12hk+1𝕀{|Zk+1C^k|hk+1}f^k]\displaystyle\widehat{f}_{k+1}=\widehat{f}_{k}+\frac{1}{(k+1)^{q}}\,\left[\frac{1}{2\,h_{k+1}}\,\mathbb{I}_{\{|Z_{k+1}-\widehat{C}_{k}|\leq h_{k+1}\}}-\widehat{f}_{k}\right] (C.2)

with hk=h/kγh_{k}=h/k^{\gamma}.

Theorem C.1.

Let α(0,1)\alpha\in(0,1), β0>0\beta_{0}>0, h>0h>0, 1/2<q11/2<q\leq 1, 0<γ<q1/20<\gamma<q-1/2. Let FF be a distribution function such that f(t)=dF(t)/dtf(t)=\mbox{\rm d}F(t)/\mbox{\rm d}t exists for all tt, is uniformly bounded, and is strictly positive in a neighborhood of C1αC_{1-\alpha}, the unique value of CC such that F(C)=1αF(C)=1-\alpha. Let (Xi)(X_{i}) be an i.i.d. sequence distributed with FF and define C^k\widehat{C}_{k} and f^k\widehat{f}_{k} by (C.1) and (C.2) respectively, with βk=min{1/f^k,β0kγ}\beta_{k}=\min\{1/\widehat{f}_{k},\beta_{0}\,k^{\gamma}\} and hk=h/kγh_{k}=h/k^{\gamma}. Then, C^kC1α\widehat{C}_{k}\rightarrow C_{1-\alpha} a.s. when kk\rightarrow\infty.

Proof. We first show that f^k\widehat{f}_{k} is a.s. bounded. From the mean-value theorem, there exists a tkt_{k} in [C^khk+1,C^k+hk+1][\widehat{C}_{k}-h_{k+1},\widehat{C}_{k}+h_{k+1}] such that Pr{|Zk+1C^k|hk+1}=F(C^k+hk+1)F(C^khk+1)=2hk+1f(tk)\Pr\{|Z_{k+1}-\widehat{C}_{k}|\leq h_{k+1}\}=F(\widehat{C}_{k}+h_{k+1})-F(\widehat{C}_{k}-h_{k+1})=2\,h_{k+1}\,f(t_{k}). Denote ωk+1=𝕀{|Zk+1C^k|hk+1}2hk+1f(tk)\omega_{k+1}=\mathbb{I}_{\{|Z_{k+1}-\widehat{C}_{k}|\leq h_{k+1}\}}-2\,h_{k+1}\,f(t_{k}). We can write

f^k+1=(1Bk)f^k+Ak+Ak\displaystyle\widehat{f}_{k+1}=(1-B_{k})\,\widehat{f}_{k}+A_{k}+A^{\prime}_{k}

where Bk=1/[(k+1)q]B_{k}=1/[(k+1)^{q}], Ak=ωk+1/[2hk+1(k+1)q]A_{k}=\omega_{k+1}/[2\,h_{k+1}\,(k+1)^{q}] and Ak=Bkf(tk)A^{\prime}_{k}=B_{k}\,f(t_{k}). Therefore,

f^k+1=f^1i=1k(1Bi)+j=1k(Aj+Aj)i=j+1k(1Bi).\displaystyle\widehat{f}_{k+1}=\widehat{f}_{1}\,\prod_{i=1}^{k}(1-B_{i})+\sum_{j=1}^{k}(A_{j}+A^{\prime}_{j})\,\prod_{i=j+1}^{k}(1-B_{i})\,.

We have i=1k(1Bi)<exp(i=1kBi)0\prod_{i=1}^{k}(1-B_{i})<\exp(-\sum_{i=1}^{k}B_{i})\rightarrow 0 as kk\rightarrow\infty since q1q\leq 1. Next, for hk=h/kγh_{k}=h/k^{\gamma} and 0<γ<q1/20<\gamma<q-1/2, k1/[hkkq]2<\sum_{k}1/[h_{k}\,k^{q}]^{2}<\infty, j=1kAj\sum_{j=1}^{k}A_{j} forms an 2{\mathscr{L}}^{2}-bounded martingale and therefore converges a.s. to some limit. Lemma 2 of Albert and Gardner, (1967, p. 190) then implies that j=1kAji=j+1k(1Bi)0\sum_{j=1}^{k}A_{j}\,\prod_{i=j+1}^{k}(1-B_{i})\rightarrow 0 a.s. as kk\rightarrow\infty. Consider now the term Tk=j=1kAji=j+1k(1Bi)T_{k}=\sum_{j=1}^{k}A^{\prime}_{j}\,\prod_{i=j+1}^{k}(1-B_{i}). Since ff is bounded, Aj<f¯BjA^{\prime}_{j}<\bar{f}\,B_{j} for some f¯<\bar{f}<\infty and

Tk<f¯j=1kBji=j+1k(1Bi)=f¯[1i=1k(1Bi)]<f¯,\displaystyle T_{k}<\bar{f}\,\sum_{j=1}^{k}B_{j}\,\prod_{i=j+1}^{k}(1-B_{i})=\bar{f}\,\left[1-\prod_{i=1}^{k}(1-B_{i})\right]<\bar{f}\,,

where the equality follows from Albert and Gardner, (1967, Lemma 1, p. 189). This shows that f^k\widehat{f}_{k} is a.s. bounded. Therefore, βk=min{1/f^k,β0kγ}\beta_{k}=\min\{1/\widehat{f}_{k},\beta_{0}\,k^{\gamma}\} is a.s. bounded away from zero.

We consider now the convergence of (C.1). Following Tierney, (1983), define

Dk=βk(k+1)q{𝕀{Zk+1C^k}[1F(C^k)]} and Ek=βk(k+1)qF(C^k)(1α)C^kC1α.\displaystyle D_{k}=\frac{\beta_{k}}{(k+1)^{q}}\,\left\{\mathbb{I}_{\{Z_{k+1}\geq\widehat{C}_{k}\}}-[1-F(\widehat{C}_{k})]\right\}\mbox{ and }E_{k}=\frac{\beta_{k}}{(k+1)^{q}}\,\frac{F(\widehat{C}_{k})-(1-\alpha)}{\widehat{C}_{k}-C_{1-\alpha}}\,.

Denote by k{\mathscr{F}}_{k} the increasing sequence of σ\sigma-fields generated by the XiX_{i}; we have 𝖤{Dk|k}=0\mathsf{E}\{D_{k}|{\mathscr{F}}_{k}\}=0 and 𝖤{Dk2|k}=βk2F(C^k)[1F(C^k)]/(k+1)2q\mathsf{E}\{D_{k}^{2}|{\mathscr{F}}_{k}\}=\beta_{k}^{2}\,F(\widehat{C}_{k})\,[1-F(\widehat{C}_{k})]/(k+1)^{2q}. We can rewrite (C.1) as C^k+1C1α=(C^kC1α)(1Ek)+Dk\widehat{C}_{k+1}-C_{1-\alpha}=(\widehat{C}_{k}-C_{1-\alpha})\,(1-E_{k})+D_{k}, which gives

𝖤{(C^k+1C1α)2|k}=(C^kC1α)2(1Ek)2+βk2(k+1)2qF(C^k)[1F(C^k)].\displaystyle\mathsf{E}\{(\widehat{C}_{k+1}-C_{1-\alpha})^{2}|{\mathscr{F}}_{k}\}=(\widehat{C}_{k}-C_{1-\alpha})^{2}\,(1-E_{k})^{2}+\frac{\beta_{k}^{2}}{(k+1)^{2q}}\,F(\widehat{C}_{k})\,[1-F(\widehat{C}_{k})]\,.

Ek0E_{k}\geq 0 for all kk, [F(C^k)(1α)]/(C^kC1α)[F(\widehat{C}_{k})-(1-\alpha)]/(\widehat{C}_{k}-C_{1-\alpha}) is bounded since ff is bounded, and therefore Ek0E_{k}\rightarrow 0. Since βkβ0kγ\beta_{k}\leq\beta_{0}\,k^{\gamma} and 0<γ<q1/20<\gamma<q-1/2, kβk2/(k+1)2q<\sum_{k}\beta_{k}^{2}/(k+1)^{2q}<\infty. Robbins-Siegmund Theorem (1971) then implies that C^k\widehat{C}_{k} converges a.s. to some limit and that k(C^kC1α)2[1(1Ek)2]<\sum_{k}(\widehat{C}_{k}-C_{1-\alpha})^{2}\,[1-(1-E_{k})^{2}]<\infty a.s.; since Ek0E_{k}\rightarrow 0, we obtain k(C^kC1α)2Ek<\sum_{k}(\widehat{C}_{k}-C_{1-\alpha})^{2}\,E_{k}<\infty a.s. Since q1q\leq 1, βk\beta_{k} is a.s. bounded away from zero, and ff is strictly positive in a neighborhood of C1αC_{1-\alpha}, we obtain that kEk=\sum_{k}E_{k}=\infty, implying that C^kC1α\widehat{C}_{k}\rightarrow C_{1-\alpha} a.s., which concludes the proof.    

Tierney, (1983) uses q=1q=1; the continuity of ff at C1αC_{1-\alpha} then implies that fkf(C1α)f_{k}\rightarrow f(C_{1-\alpha}) a.s., and his construction also achieves the optimal rate of convergence of C^k\widehat{C}_{k} to C1αC_{1-\alpha}, with k(C^kC1α)d𝒩(0,α(1α)/f2(C1α)\sqrt{k}(\widehat{C}_{k}-C_{1-\alpha})\stackrel{{\scriptstyle\rm d}}{{\rightarrow}}{\mathscr{N}}(0,\alpha(1-\alpha)/f^{2}(C_{1-\alpha}) as kk\rightarrow\infty.

Appendix D Lipschitz continuity of C1α(𝐌)C_{1-\alpha}(\mathbf{M})

Lemma D.1.

Under HΦ and Hμ,M, the (1α)(1-\alpha)-quantile C1α(𝐌)C_{1-\alpha}(\mathbf{M}) of the distribution F𝐌F_{\mathbf{M}} of Z𝐌(X)=FΦ[𝐌,(X)]Z_{\mathbf{M}}(X)=F_{\Phi}[\mathbf{M},{\mathscr{M}}(X)] is a Lipschitz continuous function of 𝐌𝕄,L\mathbf{M}\in\mathbb{M}_{\ell,L}^{\geq}.

Proof. For any 𝐀𝕄>\mathbf{A}\in\mathbb{M}^{>}, define the random variable T𝐀(X)=trace[𝐀(X)]T_{\mathbf{A}}(X)=\operatorname{trace}[\mathbf{A}{\mathscr{M}}(X)] and denote G𝐀G_{\mathbf{A}} its distribution function and Q1α(𝐀)Q_{1-\alpha}(\mathbf{A}) the associated (1α)(1-\alpha)-quantile. We have Z𝐌(X)=TΦ(𝐌)(X)trace[Φ(𝐌)𝐌]Z_{\mathbf{M}}(X)=T_{\nabla_{\Phi}(\mathbf{M})}(X)-\operatorname{trace}[\nabla_{\Phi}(\mathbf{M})\mathbf{M}], and therefore

C1α(𝐌)=Q1α[Φ(𝐌)]trace[Φ(𝐌)𝐌].\displaystyle C_{1-\alpha}(\mathbf{M})=Q_{1-\alpha}[\nabla_{\Phi}(\mathbf{M})]-\operatorname{trace}[\nabla_{\Phi}(\mathbf{M})\mathbf{M}]\,. (D.1)

We fist show that trace[Φ(𝐌)𝐌]\operatorname{trace}[\nabla_{\Phi}(\mathbf{M})\mathbf{M}] is Lipschitz continuous in 𝐌\mathbf{M}. Indeed, for any 𝐌\mathbf{M}, 𝐌\mathbf{M}^{\prime} in 𝕄,L\mathbb{M}_{\ell,L}^{\geq}, we have

|trace[Φ(𝐌)𝐌]trace[Φ(𝐌)𝐌]|\displaystyle\left|\operatorname{trace}[\nabla_{\Phi}(\mathbf{M}^{\prime})\mathbf{M}^{\prime}]-\operatorname{trace}[\nabla_{\Phi}(\mathbf{M})\mathbf{M}]\right| \displaystyle\leq 𝐌Φ(𝐌)Φ(𝐌)+Φ(𝐌)𝐌𝐌\displaystyle\|\mathbf{M}^{\prime}\|\,\|\nabla_{\Phi}(\mathbf{M}^{\prime})-\nabla_{\Phi}(\mathbf{M})\|+\|\nabla_{\Phi}(\mathbf{M})\|\,\|\mathbf{M}^{\prime}-\mathbf{M}\| (D.2)
<\displaystyle< [LpK+A()]𝐌𝐌,\displaystyle[L\sqrt{p}\,K_{\ell}+A(\ell)]\,\|\mathbf{M}^{\prime}-\mathbf{M}\|\,,

where we used HΦ and the fact that 𝐌,𝐌𝕄,L\mathbf{M},\mathbf{M}^{\prime}\in\mathbb{M}_{\ell,L}^{\geq}.

Consider now G𝐀G_{\mathbf{A}} and G𝐀G_{\mathbf{A}^{\prime}} for two matrices 𝐀\mathbf{A} and 𝐀\mathbf{A}^{\prime} in 𝕄>\mathbb{M}^{>}. We have

G𝐀(t)G𝐀(t)=𝒳(𝕀{trace[𝐀(x)]t}𝕀{trace[𝐀(x)]t})μ(dx),\displaystyle G_{\mathbf{A}^{\prime}}(t)-G_{\mathbf{A}}(t)=\int_{\mathscr{X}}\left(\mathbb{I}_{\{\operatorname{trace}[\mathbf{A}^{\prime}{\mathscr{M}}(x)]\leq t\}}-\mathbb{I}_{\{\operatorname{trace}[\mathbf{A}{\mathscr{M}}(x)]\leq t\}}\right)\,\mu(\mbox{\rm d}x)\,,

and therefore

|G𝐀(t)G𝐀(t)|\displaystyle\left|G_{\mathbf{A}^{\prime}}(t)-G_{\mathbf{A}}(t)\right| \displaystyle\leq 𝖯𝗋𝗈𝖻{min{trace[𝐀(X)],trace[𝐀(X)]}t\displaystyle\mathsf{Prob}\left\{\min\{\operatorname{trace}[\mathbf{A}^{\prime}{\mathscr{M}}(X)],\operatorname{trace}[\mathbf{A}{\mathscr{M}}(X)]\}\leq t\right.
max{trace[𝐀(X)],trace[𝐀(X)]}}\displaystyle\hskip 56.9055pt\left.\leq\max\{\operatorname{trace}[\mathbf{A}^{\prime}{\mathscr{M}}(X)],\operatorname{trace}[\mathbf{A}{\mathscr{M}}(X)]\}\right\}
𝖯𝗋𝗈𝖻{trace[(𝐀𝐀𝐀𝐈p)(X)]ttrace[(𝐀+𝐀𝐀𝐈p)(X)]},\displaystyle\leq\mathsf{Prob}\left\{\operatorname{trace}[(\mathbf{A}-\|\mathbf{A}-\mathbf{A}^{\prime}\|\mathbf{I}_{p}){\mathscr{M}}(X)]\leq t\leq\operatorname{trace}[(\mathbf{A}+\|\mathbf{A}-\mathbf{A}^{\prime}\|\mathbf{I}_{p}){\mathscr{M}}(X)]\right\}\,,

with 𝐈p\mathbf{I}_{p} the p×pp\times p identity matrix. Since 𝐀λmin(𝐀)𝐈b𝕄\mathbf{A}-\lambda_{\min}(\mathbf{A})\,\mathbf{I}_{b}\in\mathbb{M}^{\geq}, denoting b1=1𝐀𝐀/λmin(𝐀)b_{1}=1-\|\mathbf{A}-\mathbf{A}^{\prime}\|/\lambda_{\min}(\mathbf{A}) and b2=1+𝐀𝐀/λmin(𝐀)b_{2}=1+\|\mathbf{A}-\mathbf{A}^{\prime}\|/\lambda_{\min}(\mathbf{A}), we obtain

|G𝐀(t)G𝐀(t)|\displaystyle\left|G_{\mathbf{A}^{\prime}}(t)-G_{\mathbf{A}}(t)\right| \displaystyle\leq 𝖯𝗋𝗈𝖻{b1trace[𝐀(X)]tb2trace[𝐀(X)]}\displaystyle\mathsf{Prob}\left\{b_{1}\,\operatorname{trace}[\mathbf{A}{\mathscr{M}}(X)]\leq t\leq b_{2}\,\operatorname{trace}[\mathbf{A}{\mathscr{M}}(X)]\right\} (D.3)
=𝖯𝗋𝗈𝖻{trace[𝐀(X)]tb1trace[𝐀(X)]tb2}\displaystyle=\mathsf{Prob}\left\{\operatorname{trace}[\mathbf{A}{\mathscr{M}}(X)]\leq\frac{t}{b_{1}}\bigwedge\operatorname{trace}[\mathbf{A}{\mathscr{M}}(X)]\geq\frac{t}{b_{2}}\right\}
=G𝐀(t/b1)G𝐀(t/b2).\displaystyle=G_{\mathbf{A}}(t/b_{1})-G_{\mathbf{A}}(t/b_{2})\,.

In the rest of the proof we show that Q1α[Φ(𝐌)]Q_{1-\alpha}[\nabla_{\Phi}(\mathbf{M})] is Lipschitz continuous in 𝐌\mathbf{M}. Take two matrices 𝐌,𝐌𝕄,L\mathbf{M},\mathbf{M}^{\prime}\in\mathbb{M}_{\ell,L}^{\geq}, and consider the associated (1α)(1-\alpha)-quantiles Q1α[Φ(𝐌)]Q_{1-\alpha}[\nabla_{\Phi}(\mathbf{M})] and Q1α[Φ(𝐌)]Q_{1-\alpha}[\nabla_{\Phi}(\mathbf{M}^{\prime})], which we shall respectively denote Q1αQ_{1-\alpha} and Q1αQ^{\prime}_{1-\alpha} to simplify notation. From Hμ,M, the p.d.f. ψ𝐌\psi_{\mathbf{M}} associated with GΦ(𝐌)G_{\nabla_{\Phi}(\mathbf{M})} is continuous at Q1αQ_{1-\alpha} and satisfies ψ𝐌(Q1α)>ϵ,L\psi_{\mathbf{M}}(Q_{1-\alpha})>\epsilon_{\ell,L}. From the identities

Q1αψ𝐌(z)dz=Q1αψ𝐌(z)dz=1α,\displaystyle\int_{-\infty}^{Q_{1-\alpha}}\psi_{\mathbf{M}}(z)\,\mbox{\rm d}z=\int_{-\infty}^{Q^{\prime}_{1-\alpha}}\psi_{\mathbf{M}^{\prime}}(z)\,\mbox{\rm d}z=1-\alpha\,,

we deduce

|Q1αQ1αψ𝐌(z)dz|=|Q1α[ψ𝐌(z)ψ𝐌(z)]dz|=|GΦ(𝐌)(Q1α)GΦ(𝐌)(Q1α)|.\displaystyle\left|\int_{Q_{1-\alpha}}^{Q^{\prime}_{1-\alpha}}\psi_{\mathbf{M}}(z)\,\mbox{\rm d}z\right|=\left|\int_{-\infty}^{Q^{\prime}_{1-\alpha}}[\psi_{\mathbf{M}^{\prime}}(z)-\psi_{\mathbf{M}}(z)]\,\mbox{\rm d}z\right|=\left|G_{\nabla_{\Phi}(\mathbf{M}^{\prime})}(Q^{\prime}_{1-\alpha})-G_{\nabla_{\Phi}(\mathbf{M})}(Q^{\prime}_{1-\alpha})\right|\,. (D.4)

From HΦ, when substituting Φ(𝐌){\nabla_{\Phi}(\mathbf{M})} for 𝐀\mathbf{A} and Φ(𝐌){\nabla_{\Phi}(\mathbf{M}^{\prime})} for 𝐀\mathbf{A}^{\prime} in b1b_{1} and b2b_{2}, we get b1>B1=1K𝐌𝐌/a(L)b_{1}>B_{1}=1-K_{\ell}\|\mathbf{M}^{\prime}-\mathbf{M}\|/a(L) and b2<B2=1+K𝐌𝐌/a(L)b_{2}<B_{2}=1+K_{\ell}\|\mathbf{M}^{\prime}-\mathbf{M}\|/a(L), showing that Q1αQ1αQ^{\prime}_{1-\alpha}\rightarrow Q_{1-\alpha} as 𝐌𝐌0\|\mathbf{M}^{\prime}-\mathbf{M}\|\rightarrow 0. Therefore, there exists some β1\beta_{1} such that, for 𝐌𝐌<β1\|\mathbf{M}^{\prime}-\mathbf{M}\|<\beta_{1} we have

|Q1αQ1αψ𝐌(z)dz|>12|Q1αQ1α|ϵ,L.\displaystyle\left|\int_{Q_{1-\alpha}}^{Q^{\prime}_{1-\alpha}}\psi_{\mathbf{M}}(z)\,\mbox{\rm d}z\right|>\frac{1}{2}\,\left|Q^{\prime}_{1-\alpha}-Q_{1-\alpha}\right|\,\epsilon_{\ell,L}\,. (D.5)

Using (D.3), we also obtain for 𝐌𝐌\|\mathbf{M}^{\prime}-\mathbf{M}\| smaller than some β2\beta_{2}

|GΦ(𝐌)(Q1α)GΦ(𝐌)(Q1α)|\displaystyle\left|G_{\nabla_{\Phi}(\mathbf{M}^{\prime})}(Q^{\prime}_{1-\alpha})-G_{\nabla_{\Phi}(\mathbf{M})}(Q^{\prime}_{1-\alpha})\right| \displaystyle\leq GΦ(𝐌)(Q1α/B1)GΦ(𝐌)(Q1α/B2)\displaystyle G_{\nabla_{\Phi}(\mathbf{M})}(Q^{\prime}_{1-\alpha}/B_{1})-G_{\nabla_{\Phi}(\mathbf{M})}(Q^{\prime}_{1-\alpha}/B_{2})
<\displaystyle< 2ψ𝐌(Q1α)(Q1αB1Q1αB2)\displaystyle 2\psi_{\mathbf{M}}(Q^{\prime}_{1-\alpha})\,\left(\frac{Q^{\prime}_{1-\alpha}}{B_{1}}-\frac{Q^{\prime}_{1-\alpha}}{B_{2}}\right)
< 4𝐌𝐌ψ𝐌(Q1α)Q1αa(L)K(a2(L)/K2𝐌𝐌2).\displaystyle<\ 4\|\mathbf{M}^{\prime}-\mathbf{M}\|\,\psi_{\mathbf{M}}(Q^{\prime}_{1-\alpha})\,Q^{\prime}_{1-\alpha}\,\frac{a(L)}{K_{\ell}\left(a^{2}(L)/K_{\ell}^{2}-\|\mathbf{M}^{\prime}-\mathbf{M}\|^{2}\right)}\,.

Therefore, when 𝐌𝐌<a(L)/(K2)\|\mathbf{M}^{\prime}-\mathbf{M}\|<a(L)/(K_{\ell}\sqrt{2}),

|GΦ(𝐌)(Q1α)GΦ(𝐌)(Q1α)|<κ𝐌𝐌\displaystyle\left|G_{\nabla_{\Phi}(\mathbf{M}^{\prime})}(Q^{\prime}_{1-\alpha})-G_{\nabla_{\Phi}(\mathbf{M})}(Q^{\prime}_{1-\alpha})\right|<\kappa\,\|\mathbf{M}-\mathbf{M}^{\prime}\|

with κ=8φ¯𝐌Q1αK/a(L)\kappa=8\bar{\varphi}_{\mathbf{M}}\,Q^{\prime}_{1-\alpha}\,K_{\ell}/a(L), where φ¯𝐌\bar{\varphi}_{\mathbf{M}} is the upper bound on φ𝐌\varphi_{\mathbf{M}} in Hμ,M. Using (D.4) and (D.5) we thus obtain, for 𝐌𝐌\|\mathbf{M}^{\prime}-\mathbf{M}\| small enough,

|Q1α[Φ(𝐌)]Q1α[Φ(𝐌)]|<2κ/ϵ,L𝐌𝐌,\displaystyle\left|Q_{1-\alpha}[\nabla_{\Phi}(\mathbf{M}^{\prime})]-Q_{1-\alpha}[\nabla_{\Phi}(\mathbf{M})]\right|<2\kappa/\epsilon_{\ell,L}\,\|\mathbf{M}-\mathbf{M}^{\prime}\|\,,

which, combined with (D.2) and (D.1), completes the proof.    

Acknowledgements

The work of the first author was partly supported by project INDEX (INcremental Design of EXperiments) ANR-18-CE91-0007 of the French National Research Agency (ANR).

References

  • Albert and Gardner, (1967) Albert, A. and Gardner, L. (1967). Stochastic Approximation and Nonlinear Regression. MIT Press, Cambridge, MA.
  • Albright and Derman, (1972) Albright, S. and Derman, C. (1972). Asymptotically optimal policies for the stochastic sequential assignment problem. Management Science, 19(1):46–51.
  • Borkar, (1997) Borkar, V. (1997). Stochastic approximation with two time scales. Systems & Control Letters, 29:291–294.
  • Cook and Nachtsheim, (1980) Cook, R. and Nachtsheim, C. (1980). A comparison of algorithms for constructing exact dd-optimal designs. Technometrics, 22(3):315–324.
  • Dalal et al., (2018) Dalal, G., Szörényi, B., Thoppe, G., and Mannor, S. (2018). Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning. Proc. of Machine Learning Research, 75:1–35.
  • Fedorov, (1972) Fedorov, V. (1972). Theory of Optimal Experiments. Academic Press, New York.
  • Fedorov, (1989) Fedorov, V. (1989). Optimal design with bounded density: optimization algorithms of the exchange type. Journal Statist. Planning and Inference, 22:1–13.
  • Fedorov and Hackl, (1997) Fedorov, V. and Hackl, P. (1997). Model-Oriented Design of Experiments. Springer, Berlin.
  • Harman, (2004) Harman, R. (2004). Lower bounds on efficiency ratios based on ϕp\phi_{p}-optimal designs. In Di Bucchianico, A., Läuter, H., and Wynn, H., editors, mODa’7 – Advances in Model–Oriented Design and Analysis, Proceedings of the 7th Int. Workshop, Heeze (Netherlands), pages 89–96, Heidelberg. Physica Verlag.
  • Kesten, (1958) Kesten, H. (1958). Accelerated stochastic approximation. The Annals of Mathematical Statistics, 29(1):41–59.
  • Konda and Tsitsiklis, (2004) Konda, V. and Tsitsiklis, J. (2004). Convergence rate of linear two-time-scale stochastic approximation. The Annals of Applied Probability, 14(2):796–819.
  • Lakshminarayanan and Bhatnagar, (2017) Lakshminarayanan, C. and Bhatnagar, S. (2017). A stability criterion for two timescale stochatic approximation. Automatica, 79:108–114.
  • Ljung, (1977) Ljung, L. (1977). Analysis of recursive stochastic algorithms. IEEE Transactions on Automatic Control, 22(4):551–575.
  • Ljung and Söderström, (1983) Ljung, L. and Söderström, T. (1983). Theory and Practice of Recursive Identification. MIT Press, Cambridge, MA.
  • Metelkina and Pronzato, (2017) Metelkina, A. and Pronzato, L. (2017). Information-regret compromise in covariate-adaptive treatment allocation. The Annals of Statistics, 45(5):2046–2073.
  • Mitchell, (1974) Mitchell, T. (1974). An algorithm for the construction of “DD-optimal” experimental designs. Technometrics, 16:203–210.
  • Nie et al., (2018) Nie, R., Wiens, D., and Zhai, Z. (2018). Minimax robust active learning for approximately specified regression models. Canadian Journal of Statistics, 46(1):104–122.
  • Pflug, (2000) Pflug, G. (2000). Some remarks on the value-at-risk and the conditional value-at-risk. In Uryasev, S., editor, Probabilistic Constrained Optimization, pages 272–281. Springer.
  • Pilz, (1983) Pilz, J. (1983). Bayesian Estimation and Experimental Design in Linear Regression Models, volume 55. Teubner-Texte zur Mathematik, Leipzig. (also Wiley, New York, 1991).
  • Pronzato, (2001) Pronzato, L. (2001). Optimal and asymptotically optimal decision rules for sequential screening and resource allocation. IEEE Transactions on Automatic Control, 46(5):687–697.
  • Pronzato, (2004) Pronzato, L. (2004). A minimax equivalence theorem for optimum bounded design measures. Statistics & Probability Letters, 68:325–331.
  • Pronzato, (2006) Pronzato, L. (2006). On the sequential construction of optimum bounded designs. Journal of Statistical Planning and Inference, 136:2783–2804.
  • Pronzato and Pázman, (2013) Pronzato, L. and Pázman, A. (2013). Design of Experiments in Nonlinear Models. Asymptotic Normality, Optimality Criteria and Small-Sample Properties. Springer, LNS 212, New York.
  • Pukelsheim, (1993) Pukelsheim, F. (1993). Optimal Experimental Design. Wiley, New York.
  • Robbins and Siegmund, (1971) Robbins, H. and Siegmund, D. (1971). A convergence theorem for non negative almost supermartingales and some applications. In Rustagi, J., editor, Optimization Methods in Statistics, pages 233–257. Academic Press, New York.
  • Rockafellar and Uryasev, (2000) Rockafellar, R. and Uryasev, S. (2000). Optimization of conditional value-at-risk. Journal of Risk, 2:21–42.
  • Rockafellar and Uryasev, (2002) Rockafellar, R. and Uryasev, S. (2002). Conditional value-at-risk for general loss distributions. Journal of Banking & Finance, 26:1443–1471.
  • Sahm and Schwabe, (2001) Sahm, M. and Schwabe, R. (2001). A note on optimal bounded designs. In Atkinson, A., Bogacka, B., and Zhigljavsky, A., editors, Optimum Design 2000, chapter 13, pages 131–140. Kluwer, Dordrecht.
  • Tierney, (1983) Tierney, L. (1983). A space-efficient recursive procedure for estimating a quantile of an unknown distribution. SIAM Journal on Scientific and Statistical Computing, 4(4):706–711.
  • Tsypkin, (1983) Tsypkin, Y. (1983). Optimality in identification of linear plants. International Journal of Systems Science, 14(1):59–74.
  • van der Vaart, (1998) van der Vaart, A. (1998). Asymptotic Statistics. Cambridge University Press, Cambridge.
  • Wang et al., (2019) Wang, H., Yang, M., and Stufken, J. (2019). Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association, 114(525):393–405.
  • Wiens, (2005) Wiens, D. (2005). Robustness in spatial studies II: minimax design. Environmetrics, 16(2):205–217.
  • Wynn, (1982) Wynn, H. (1982). Optimum submeasures with applications to finite population sampling. In Gupta, S. and Berger, J., editors, Statistical Decision Theory and Related Topics III. Proc. 3rd Purdue Symp., vol. 2, pages 485–495. Academic Press, New York.