Sequential online subsampling for thinning experimental designs
Abstract
We consider a design problem where experimental conditions (design points ) are presented in the form of a sequence of i.i.d. random variables, generated with an unknown probability measure , and only a given proportion can be selected. The objective is to select good candidates on the fly and maximize a concave function of the corresponding information matrix. The optimal solution corresponds to the construction of an optimal bounded design measure , with the difficulty that is unknown and must be constructed online. The construction proposed relies on the definition of a threshold on the directional derivative of at the current information matrix, the value of being fixed by a certain quantile of the distribution of this directional derivative. Combination with recursive quantile estimation yields a nonlinear two-time-scale stochastic approximation method. It can be applied to very long design sequences since only the current information matrix and estimated quantile need to be stored. Convergence to an optimum design is proved. Various illustrative examples are presented.
Keywords: Active learning, data thinning, design of experiments, sequential design, subsampling
AMS subject classifications: 62K05, 62L05, 62L20, 68Q32
1 Introduction
Consider a rather general parameter estimation problem in a model with independent observations conditionally on the experimental variables , with in some set . Suppose that for any there exists a measurable set and a -finite measure on such that has the density with respect to , with the true value of the model parameters to be estimated, . In particular, this covers the case of regression models, with the Lebesgue measure on and , where the are independently distributed with zero mean and known variance (or unknown but constant variance ), and the case of generalized linear models, with in the exponential family and logistic regression as a special case. Denoting by the estimated value of from data , , under rather weak conditions on the and , see below, we have
(1.1) |
where denotes the (normalized) Fisher information matrix for parameters and (asymptotic) design (that is, a probability measure on ),
This is true in particular for randomized designs such that the are independently sampled from , and for asymptotically discrete designs, such that is a discrete measure on and the empirical design measure converges strongly to ; see Pronzato and Pázman, (2013). The former case corresponds to the situation considered here. The choice of is somewhat arbitrary, provided that for all , and we shall assume that . We can then write
denotes the elementary information matrix at .
Taking motivation from (1.1), optimal experimental design (approximate theory) aims at choosing a measure that minimizes a scalar function of the asymptotic covariance matrix of , or equivalently, that maximizes a function of . For a nonlinear model and depend on the model parameters . Since is unknown, the standard approach is local, and consists in constructing an optimal design for a nominal value of . This is the point of view we shall adopt here — although sequential estimation of is possible, see Section 6. When is fixed at some , there is fundamentally no difference with experimental design in a linear model for which and do not depend on . For example, in the linear regression model
where the errors are independent and identically distributed (i.i.d.), with a density with respect to the Lebesgue measure having finite Fisher information for location ( for normal errors ), then , . Polynomial regression provides typical examples of such a situation and will be used for illustration in Section 4. The construction of an optimal design measure maximizing usually relies on the application of a specialized algorithm to a discretization of the design space ; see, e.g., Pronzato and Pázman, (2013, Chap. 9).
With the rapid development of connected sensors and the pervasive usage of computers, there exist more and more situations where extraordinary amounts of massive data , , are available to construct models. When is very large, using all the data to construct is then unfeasible, and selecting the most informative subset through the construction of an -point optimal design, , over the discrete set is also not feasible. The objective of this paper is to present a method to explore sequentially and select a proportion of the data points to be used to estimate . Each candidate is considered only once, which allows very large datasets to be processed: when the are i.i.d. and are received sequentially, they can be selected on the fly which makes the method applicable to data streaming; when data points are available simultaneously, a random permutation allows to be processed as an i.i.d. sequence. When is too large for the storage capacity and the i.i.d. assumption is not tenable, interleaving or scrambling techniques can be used. Since de-scrambling is not necessary here (the objective is only to randomize the sequence), a simple random selection in a fixed size buffer may be sufficient; an example is presented in Section 4.3.
The method is based on the construction of an optimal bounded design measure and draws on the paper (Pronzato,, 2006). In that paper, the sequential selection of the relies on a threshold set on the directional derivative of the design criterion, given by the -quantile of the distribution of this derivative. At stage , all previous , , are used for the estimation of the quantile that defines the threshold for the possible selection of the candidate . In the present paper, we combine this approach with the recursive estimation of , following (Tierney,, 1983): as a result, the construction is fully sequential and only requires to record the current value of the information matrix and of the estimated quantile of the distribution of the directional derivative. It relies on a reinterpretation of the approach in (Pronzato,, 2006) as a stochastic approximation method for the solution of the necessary and sufficient optimality conditions for a bounded design measure, which we combine with another stochastic approximation method for quantile estimation to obtain a two-time-scale stochastic approximation scheme.
The paper is organized as follows. Section 2 introduces the notation and assumptions and recalls main results on optimal bounded design measures. Section 3 presents our subsampling algorithm based on a two-time-scale stochastic approximation procedure and contains the main result of the paper. Several illustrative examples are presented in Section 4. We are not aware of any other method for thinning experimental designs that is applicable to data streaming; nevertheless, in Section 5 we compare our algorithm with an exchange method and with the IBOSS algorithm of Wang et al., (2019) in the case where the design points are available and can be processed simultaneously. Section 6 concludes and suggests a few directions for further developments. A series of technical results are provided in the Appendix.
2 Optimal bounded design measures
2.1 Notation and assumptions
Suppose that is distributed with the probability measure on , a subset of with nonempty interior, with . For any , the set of positive measure on (not necessarily of mass one), we denote where, for all in , , the set (cone) of symmetric non-negative definite matrices. We assume that in the rest of the paper (the optimal selection of information in the case forms a variant of the secretary problem for which an asymptotically optimal solution can be derived, see Albright and Derman, (1972); Pronzato, (2001)).
We denote by the design criterion we wish to maximize, and by and the minimum and maximum eigenvalues of , respectively; we shall use the norm for vectors and Frobenius norm for matrices, ; all vectors are column vectors. For any , we denote and, for any , denotes the largest integer smaller than . For we denote by the (convex) set defined by
and by the open cone of symmetric positive definite matrices. We make the following assumptions on .
- HΦ
-
is strictly concave on , linearly differentiable and increasing for Loewner ordering; its gradient is well defined in for any and satisfies and for any , for some and ; moreover, satisfies the following Lipschitz condition: for all and in such that , , there exists such that .
The criterion and criteria , , , with if is singular, which are often used in optimal design (in particular with a positive integer) satisfy HΦ; see, e.g., Pukelsheim, (1993, Chap. 6). Their gradients are and , ; the constants and are respectively given by , for and , for . The Lispchitz condition follows from the fact that the criteria are twice differentiable on . The positively homogeneous versions and , which satisfy for any and any , and , with the identity matrix, could be considered too; see Pukelsheim, (1993, Chaps. 5, 6). The strict concavity of implies that, for any convex subset of , there exists a unique matrix maximizing with respect to .
We denote by the directional derivative of at in the direction ,
and we make the following assumptions on and .
- Hμ
-
has a bounded positive density with respect to the Lebesgue measure on every open subset of .
- HM
-
(i) is continuous on and satisfies ;
(ii) for any of measure , for some .
Since all the designs considered will be formed by points sampled from , we shall confound with the support of : , with the open ball with center and radius . Notice that HM-(i) implies that and .
Our sequential selection procedure will rely on the estimation of the -quantile of the distribution of the directional derivative when , and we shall assume that Hμ,M below is satisfied. It implies in particular that is uniquely defined by .
- Hμ,M
-
For all , has a uniformly bounded density ; moreover, for any , there exists such that and is continuous at .
Hμ,M is overrestricting (we only need the existence and boundedness of , and its positiveness and continuity at ), but is satisfied is many common situations; see Section 4 for examples. Let us emphasize that Hμ and HM are not enough to guarantee the existence of a density , since may remain constant over subsets of having positive measure. Assuming the existence of and the continuity of on is also insufficient, since is generally not continuous when is not differentiable in , and is not necessarily bounded.
2.2 Optimal design
As mentioned in introduction, when the cardinality of is very large, one may wish to select only candidates among the available, a fraction say, with . For any , we denote by a design matrix (non necessarily unique) obtained by selecting points optimally within ; that is, gives the maximum of with respect to , where the are distinct points in . Note that this forms a difficult combinatorial problem, unfeasible for large and . If one assumes that the are i.i.d., with their probability measure on , for large the optimal selection of points amounts at constructing an optimal bounded design measure , such that is maximum and (in the sense for any -measurable set , which makes absolutely continuous with respect to ). Indeed, Lemma A.1 in Appendix A indicates that . Also, under HΦ, for all ; see Pronzato, (2006, Lemma 3).
A key result is that, when all subsets of with constant have zero measure, separates two sets and , with and on , and and on , for some constant ; moreover, ; see Wynn, (1982); Fedorov, (1989) and Fedorov and Hackl, (1997, Chap. 4). (The condition mentioned in those references is that has no atoms, but the example in Section 4.4.2 will show that this is not sufficient; extension to arbitrary measures is considered in (Sahm and Schwabe,, 2001).)
For , denote
In (Pronzato,, 2006), it is shown that, for any ,
(2.1) |
where, for any proposition , if is true and is zero otherwise, and is an -quantile of when and satisfies
(2.2) |
Therefore, is the optimum information matrix in (unique since is strictly concave) if and only if it satisfies , or equivalently , and the constant equals ; see (Pronzato,, 2006, Th. 5); see also Pronzato, (2004). Note that since and on the support of .
3 Sequential construction of an optimal bounded design measure
3.1 A stochastic approximation problem
Suppose that the are i.i.d. with . The solution of , , with respect to by stochastic approximation yields the iterations
(3.4) |
Note that . The almost sure (a.s.) convergence of in (3.4) to that maximizes with respect is proved in (Pronzato,, 2006) under rather weak assumptions on , and .
The construction (3.4) requires the calculation of the -quantile for all , see (2.2), which is not feasible when is unknown and has a prohibitive computational cost when we know . For that reason, it is proposed in (Pronzato,, 2006) to replace by the empirical quantile that uses the empirical measure of the that have been observed up to stage . This construction preserves the a.s. convergence of to in (3.4), but its computational cost and storage requirement increase with , which makes it unadapted to situations with very large . The next section considers the recursive estimation of and contains the main result of the paper.
3.2 Recursive quantile estimation
The idea is to plug a recursive estimator of the -quantile in (3.4). Under mild assumptions, for random variables that are i.i.d. with distribution function such that the solution of the equation is unique, the recursion
(3.5) |
with converges a.s. to the quantile such that . Here, we shall use a construction based on (Tierney,, 1983). In that paper, a clever dynamical choice of is shown to provide the optimal asymptotic rate of convergence of towards , with as , where is the p.d.f. of the — note that it coincides with the asymptotic behavior of the sample (empirical) quantile. The only conditions on are that exists for all and is uniformly bounded, and that is continuous and positive at the unique root of .
There is a noticeable difference, however, with the estimation of : in our case we need to estimate a quantile of for , with the distribution of evolving with . For that reason, we shall impose a faster dynamic to the evolution of , and replace (3.5) by
(3.6) |
for some . The combination of (3.6) with (3.4) yields a particular nonlinear two-time-scale stochastic approximation scheme. There exist advanced results on the convergence of linear two-time-scale stochastic approximation, see Konda and Tsitsiklis, (2004); Dalal et al., (2018). To the best of our knowledge, however, there are few results on convergence for nonlinear schemes. Convergence is shown in (Borkar,, 1997) under the assumption of boundedness of the iterates using the ODE method of Ljung, (1977); sufficient conditions for stability are provided in (Lakshminarayanan and Bhatnagar,, 2017), also using the ODE approach. In the proof of Theorem 3.1 we provide justifications for our construction, based on the analyses and results in the references mentioned above.
The construction is summarized in Algorithm 1 below. The presence of the small number is only due to technical reasons: setting when in (3.9) has the effect of always selecting when less than points have been selected previously; it ensures that for all and thus that always belongs to for some and ; see Lemma B.1 in Appendix.
-
Algorithm 1: sequential selection ( given).
-
0)
Choose , , , and .
-
1)
Initialization: select , compute . If is singular, increase and select the next points until has full rank. Set , the number of points selected.
Compute , for and order the as ; denote and .
Initialize at ; set , , and .
-
2)
Iteration : collect and compute .
(3.9) If , update into and into
(3.10) otherwise, set .
-
3)
Compute ; update using (3.6).
Set and update into
-
4)
, return to Step 2.
Note that is updated whatever the value of . Recursive quantile estimation by (3.6) follows (Tierney,, 1983). To ensure a faster dynamic for the evolution of than for , we take instead of in (Tierney,, 1983), and the construction of and the choices of and are modified accordingly. Following the same arguments as in the proof of Proposition 1 of (Tierney,, 1983), the a.s. convergence of to in the modified version of (3.5) is proved in Theorem C.1 (Appendix C).
The next theorem establishes the convergence of the combined stochastic approximation schemes with two time-scales.
Theorem 3.1.
Under HΦ, Hμ, HM and Hμ,M, the normalized information matrix corresponding to the candidates selected after iterations of Algorithm 1 converges a.s. to the optimal matrix in as .
Proof. Our analysis is based on (Borkar,, 1997). We denote by the increasing sequence of -fields generated by the . According to (3.6), we can write with . Therefore, and , with the distribution function of . From Lemma B.1 (Appendix B) and Hμ,M, has a well defined density for all , with for all and bounded. The first part of the proof of Theorem C.1 applies (see Appendix C): is a.s. bounded and is bounded away from zero a.s. Therefore, a.s. and a.s.; also, since .
The o.d.e. associated with (3.6), for a fixed matrix and thus a fixed , such that has the distribution function and density , is
where satisfies . Consider the Lyapunov function . It satisfies , with if and only if . Moreover, is Lipschitz continuous in ; see Lemma D.1 in Appendix D. The conditions for Theorem 1.1 in (Borkar,, 1997) are thus satisfied concerning the iterations for .
Denote and , so that (3.9) implies for all ; see Lemma B.1 in Appendix. They satisfy
(3.11) |
where , and . We have and , with the distribution function of , which, from Hμ,M, has a well defined density for all . Also,
where . Denote , so that
(3.12) |
We get and
where (3.9) implies that , and therefore , and is a.s. bounded from HM-(i). This implies that a.s. The limiting o.d.e. associated with (3.11) and (3.12) are
where is defined by (2.1). The first equation implies that converges exponentially fast to , with ; the second equation gives
with a strict inequality if , the optimal matrix in . The conditions of Theorem 1.1 in (Borkar,, 1997) are thus satisfied, and converges to a.s.
Remark 3.1.
- (i)
-
Algorithm 1 does not require the knowledge of and has minimum storage requirements: apart for the current matrix , we only need to update the scalar variables and . Its complexity is in general, considering that the complexity of the calculation of is . It can be reduced to when has rank one and is updated instead of (see remark (iii) below), for D-optimality and -optimality with integer; see Section 2.1. Very long sequences can thus be processed.
- (ii)
-
Numerical simulations indicate that we do not need to take in Algorithm 1: (3.6) with yields satisfactory performance, provided the step-size obeys Kersten’s rule and does not decrease at each iteration.
- (iii)
-
The substitution of for everywhere does not change the behavior of the algorithm. When only depends on (which is often the case for classical design criteria, see the discussion following the presentation of HΦ), and if is a low rank matrix, it may be preferable to update instead of , thereby avoiding matrix inversions. For example, if , then, instead of updating (3.10), it is preferable to update the following
Low-rank updates of the Cholesky decomposition of the matrix can be considered too.
- (iv)
-
Algorithm 1 can be adapted to the case where the number of iterations is fixed (equal to the size of the candidate set ) and the number of candidates to be selected is imposed. A straightforward modification is to introduce truncation and forced selection: we run the algorithm with and, at Step 2, we set (reject ) if and set (select ) if . However, this may induce the selection of points carrying little information when approaches in case is excessively small. For that reason, adaptation of to , obtained by substituting for the constant everywhere, seems preferable. This is illustrated by an example in Section 4.2.
- (v)
-
The case when has discrete components (atoms), or more precisely when there exist subsets of of positive measure where is constant (see Section 4.4.2), requires additional technical developments which we do not detail here.
A first difficulty is that HM-(ii) may not be satisfied when the matrices do not have full rank, unless we only consider large enough . Unless in (3.9) is large enough, Lemma B.1 is not valid, and other arguments are required in the proof of Theorem 3.1. Possible remedies may consist (a) in adding a regularization matrix with a small to all matrices (which amounts at considering optimal design for Bayesian estimation with a vague prior; see, e.g., Pilz, (1983)), or (b) in replacing the condition in (3.9) by [If , set ].
A second difficulty is that may correspond to a point of discontinuity of the distribution function of . The estimated value of the density of at (Step 3 of Algorithm 1) may then increase to infinity and tend to zero in (3.6). This can be avoided by setting for some .
In (Pronzato,, 2006), where empirical quantiles are used, measures needed to be taken to avoid the acceptance of too many points, for instance based on the adaptation of through , see remark (iv) above, or via the addition of the extra condition [if , set ] to (3.9) in case is not specified. Such measures do not appear to be necessary when quantiles are estimated by (3.6); see the examples in Section 4.4.
4 Examples
We always take , , in Algorithm 1 (our simulations indicate that these choices are not critical); we also set .
4.1 Example 1: quadratic regression with normal independent variables
Take , with and , and let the be i.i.d. standard normal variables . The D-optimal design for in an interval corresponds to . In the data thinning problem, the optimal solution corresponds to the selection of in the union of three intervals; that is, with the notation of Section 2, . The values of and are obtained by solving the pair of equations and , with the standard normal density and .
We set the horizon at and consider the two cases and . In each case we keep constant but apply the rule of Remark 3.1-(iv) (truncation/forced selection) to select exactly and design points, respectively. For , we have , , and , ; when , we have , , and , . The figures below present results obtained for one simulation (i.e., one random set ), but they are rather typical in the sense that different yield similar behaviors.
Figure 1 shows a smoothed histogram (Epanechnikov kernel, bandwidth equal to of the range of the in ) of the design points selected by Algorithm 1, for (left) and (right). There is good adequation with the theoretical optimal density, which corresponds to a truncation of the normal density at values indicated by the vertical dotted lines.


Figure 2 presents the evolution of as a function of , together with the optimal value (horizontal line), for the two choices of considered (the figures show some similarity on the two panels since the same set is used for both). Convergence of to is fast in both cases; the presence of steps on the evolution of , more visible on the right panel, is due to long subsequences of samples consecutively rejected.


Figure 3 shows the behavior of the final directional derivative , after observation of all in , together with the value of its estimated quantile (horizontal solid line). The theoretical values (horizontal dashed line) and the values where (vertical dashed lines) are also shown ( and are indistinguishable on the right panel). Although the figure indicates that differs significantly from , they are close enough to allow selection of the most informative , as illustrated by Figures 1 and 2.


Figure 4 shows (Frobenius norm) as a function of (log scale), averaged over independent repetitions with random samples of size , for . It suggests that for large , although the conditions in (Konda and Tsitsiklis,, 2004) are not satisfied since the scheme we consider is nonlinear. This convergence rate is significantly faster than what is suggested by Dalal et al., (2018). These investigations require further developments and will be pursued elsewhere.

4.2 Example 2: multilinear regression with normal independent variables
Take , with , , and , the vectors being i.i.d. (so that ). Denote by the probability density of . For symmetry reasons, for any the optimal (normalized) information matrix is , with , where
with , the surface area of the -dimensional unit sphere, and the solution of
Since , we get , , and is differentiable with respect to , with ; see Pronzato, (2004, Th. 4). Closed-form expressions are available for , with and ; and can easily be computed numerically for any and . One may notice that, from a result by Harman, (2004), the design matrix is optimal for any other orthogonally invariant criterion .
For the linear model with intercept, such that with , the optimal matrix is
with the optimal matrix for the model without intercept. The same design is thus optimal for both models. Also, when the are i.i.d. , the optimal matrix for simply equals .
Again, we present results obtained for one random set . Figure 5 shows the evolution of as a function of for with and when we want we select exactly points: the blue dashed-line is when we combine truncation and forced selection; the red solid line is when we adapt according to ; see Remark 3.1-(iv) — the final values, for , are indicated by a triangle and a star, respectively; we only show the evolution of for between 10 000 and 100 000 since the curves are confounded for smaller (they are based on the same ). In the first case, the late forced selection of unimportant yields a significant decrease of , whereas adaptation of anticipates the need of being less selective to reach the target number of selected points.

Figure 2 has illustrated the convergence of to for a fixed as , but in fact what really matters is that tends to infinity: indeed, does not converge to if we fix and let tend to infinity, so that tends to zero (see also Section 5.1). This is illustrated on the left panel of Figure 6, where and, from left to right, equals 0.5 (magenta dotted line), 0.1, 0.05 and 0.01 (red solid line). Since the optimal value depends on , here we present the evolution with of the D-efficiency . The right panel is for fixed and varying , with, from left to right, (red solid line), 10, 20, 30 and 50 (cyan solid line). As one may expect, performance (slightly) deteriorates as increases due to the increasing variability of , with .


4.3 Example 3: processing a non i.i.d. sequence
When the design points are not from an i.i.d. sequence, Algorithm 1 cannot be used directly and some preprocessing is required. When storage of the whole sequence is possible, a random permutation can be applied to before using Algorithm 1. When is too large for that, for instance in the context of data streaming, and the sequence possesses a structured time-dependence, one may try to identify the dependence model through time series analysis and use forecasting to decide which design points should be selected. The data thinning mechanism is then totally dependent on the model of the sequence, and the investigation of the techniques to be used is beyond the scope of this paper. Examples of the application of a simple scrambling method to the sequence prior to selection by Algorithm 1 are presented below. The method corresponds to Algorithm 2 below; its output sequence is used as input for Algorithm 1. We do not study the properties of the method in conjunction with the convergence properties of Algorithm 1, which would require further developments.
-
Algorithm 2: random scrambling in a buffer.
-
1)
Initialization: choose the buffer size , set and .
-
2)
Draw by uniform sampling within .
-
3)
Set , , return to Step 2.
Direct calculation shows that the probability that equals is
showing the limits of randomization via Algorithm 2 (in particular, the first points of the sequence will tend to appear first among the ). However, the method will give satisfactory results if the size of the buffer is large enough, as its performance improves as increases.
As an illustration, we consider the same quadratic regression model as in Example 1, with , in the extreme case where with a simple function of .
First, we consider the extremely unfavorable situation where . When is uniformly distributed on , the optimal design selects all points associated with in , for some in satisfying . For , we get and . The horizontal black line in Figure 7 indicates the optimal value when . The blue dotted line shows when Algorithm 1 is directly applied to the points , . The red line is when randomization via Algorithm 2, with buffer size , is applied first; the dotted curve in magenta is for . The positive effects of randomization through Algorithm 2 and the influence of the buffer size are visible on the figure. Here, the monotonicity of the , which inhibits early exploration of their range of variation, prevents convergence to the optimum.

We now consider the more favorable case where , , with . The left panel of Figure 8 shows the same information as Figure 7, when Algorithm 1 is applied directly to the (blue dotted line) and after preprocessing with Algorithm 2 with (red line) and (magenta dotted line). The early exploration of the range of variability of the , possible here thanks to the periodicity of , makes the randomization through Algorithm 2 efficient enough to allow Algorithm 1 to behave correctly when (red line). The situation improves when is increased, but naturally deteriorates if is too small (magenta dotted line). The right panel shows the points produced by Algorithm 2 (with ) which are selected by Algorithm 1. The effect of randomization is visible. For all points in the buffer are in the interval and the points selected by Algorithm 1 are near the end points or the center of this moving interval. For larger , randomization is strong enough to maintain the presence of suitable candidates in the buffer for selection by Algorithm 1.


4.4 Examples with constant on subsets of positive measure
Here we consider situations where Hμ,M is violated due to the existence of subsets of of positive measure on which is constant. The model is the same as in Section 4.2, with , and .
4.4.1 Example 4: has discrete components
This is Example 11 in (Pronzato,, 2006), where , , with corresponding to the normal distribution and the discrete measure that puts weight at each one of the points . Denote by the closed ball centered at the origin with radius , by the measure equal to on its complement , and let . The optimal matrix is , with the probability measure defined by:
with associated -values
Figure 9 shows the evolution of as a function of for (left) and (right). Note that in the second case, but in the first one, so that is neither zero nor on the four points . Figure 9 shows that Algorithm 1 nevertheless behaves satisfactorily in both cases.


4.4.2 Example 5: the distribution of has discrete components
Let denote the uniform probability measure on the -dimensional sphere with center and radius . The probability measure of the is , the mixture of distributions on three nested spheres with radii . The optimal bounded measure is
with associated -values
Notice that for (respectively, ) and on (respectively, on ) although is atomless.
The left panel of Figure 10 gives the evolution with of the D-efficiency , for (red solid line) and (blue dashed line) when . The right panel shows the evolution of the ratio for those two situations, with the limiting value indicated by a horizontal line. Although assumption Hμ,M is violated, Algorithm 1 continues to perform satisfactorily.


5 Comparison with other methods
5.1 Case fixed with large : comparison with an exchange method
The convergence of to in Algorithm 1 relies on the fact that grows like for some ; see Theorem 3.1. If the number of points to be selected is fixed, Algorithm 1 does not provide any performance guarantee when applied to a sequence of length (the situation is different when where an asymptotically optimal construction is available; see Pronzato, (2001)). In that case, a method of the exchange type may look more promising, although large values of entail serious difficulties. Typically, the algorithm is initialized by a point design chosen within , and at each iteration a temporarily selected is replaced by a better point in . Fedorov’s (1972) algorithm considers all possible replacements at each iteration ( instead of since we do not allow repetitions in the present context); its computational cost is prohibitive for large . The variants suggested by Cook and Nachtsheim, (1980), or the DETMAX algorithm of Mitchell, (1974), still require the maximization of a function with respect to at each iteration, which remains unfeasible for very large . Below, we consider a simplified version where all points are examined successively, and replacement is accepted when it improves the current criterion value.
-
Algorithm 3: sequential exchange ( fixed).
-
1)
Initialization: select , set and , compute and .
-
2)
Iteration : collect . If , set ; otherwise compute
If , set , update into , compute ;
otherwise, set , .
-
3)
If stop; otherwise, , return to Step 2.
Remark 5.1.
When has rank one, with and or (-optimal design), is equivalent to
(5.1) |
where
(5.2) |
see Fedorov, (1972, p. 164). As for Algorithm 1 (see Remark 3.1-(iii)), we may update instead of to avoid matrix inversions. For large enough , the term (5.2) is negligible and the condition is almost ; that is, . This is the condition we use in the example below. It does not guarantee in general that (since from Cauchy-Schwartz inequality), but no significant difference was observed compared with the use of the exact condition (5.1). Algorithm 3 has complexity in general (the additional factor compared with Algorithm 1 is due to the calculation of the maximum over all in at Step 2).
Neither Algorithm 1 with and nor Algorithm 3 ensures that tends to as . Also, we can expect to have for all with Algorithm 3, since under HΦ the matrix corresponding to the optimal selection of distinct points among satisfies for all ; see Pronzato, (2006, Lemma 3).
Example 6: fixed and large
We consider the same situation as in Example 2 (Section 4.2), with , , ; the are i.i.d. , with . We still take , , in Algorithm 1. We have in Algorithm 1, and, when is large enough, at Step 1 of Algorithm 3, with and therefore .
We consider two values of , and , with (that is, with and , respectively). Figure 11 shows the evolutions of (, Algorithm 1, red solid line) and (, Algorithm 3, blue dashed line) as functions of in those two cases (, left; , right). In order to select points exactly, adaptation of is used in Algorithm 1, see Remark 3.1-(iv). The value of is too small for to approach (indicated by the horizontal black line) in the first case, whereas is large enough on the right panel; Algorithm 3 performs similarly in both cases and is superior to Algorithm 1 for ; the magenta curve with triangles shows , , with for all , as expected.
In case it is possible to store the points , we can replay both algorithms on the same data set in order to increase the final value of for the sample selected. For Algorithm 3, we can simply run the algorithm again on a set — starting with at Step 1 since points have already been selected — with or corresponding to a random permutation of it. Series of runs on sets can be concatenated: the fact that can only increase implies convergence for an infinite sequence of runs, but generally to a local maximum only; see the discussion in (Cook and Nachtsheim,, 1980, Sect. 2.4). When applied to Example 6, this method was not able to improve the design obtained in the first run of Algorithm 3, with a similar behavior with or without permutations in the construction of the .
Algorithm 1 requires a more subtle modification since points are selected without replacement. First, we run Algorithm 1 with fixed at on a set , where the replications are all identical to or correspond to random permutations of it. The values of and are then used in a second stage, where the points in are inspected sequentially: starting at and , a new point is selected if and (or if , see Remark 3.1-(iv)). The set is thus used times in total. The idea is that for large enough, we can expect to be close to and to be close to the true quantile , whereas the optimal rule for selection is . Note that the quantile of the directional derivative is not estimated in this second phase, and updating of is only used to follow the evolution of on plots.
Example 6 (continued)
The black-dotted line in Figure 11 shows the evolution of as a function of in the second phase (for large enough to have ): we have taken for (left), so that points are used in total (but 10 times the same), and for (right), with points used (twice the same). Figure 12 shows the evolution of for , for , , (left), and , , (right); the horizontal black line indicates the value of . The left panel indicates that is too small to estimate correctly with Algorithm 1 (note that would have been enough), which is consistent with the behavior of observed in Figure 11-left (red solid line). The right panel of Figure 12 shows that has converged before inspection of the points in , which explains the satisfactory behavior of Algorithm 1 in Figure 11-right. Notice the similarity between the left and right panels of Figure 12 due to the fact that the same value is used in both. Here the are constructed by random permutations of the points in , but the behavior is similar without.




5.2 Comparison with IBOSS
IBOSS (Information-Based Optimal Subdata Selection, Wang et al., (2019)) is a selection procedure motivated by D-optimality developed in the context of multilinear regression with intercept, where with . All points in are processed simultaneously: the coordinates of the are examined successively; for each , the points with largest -th coordinate and the points having smallest -th coordinate are selected (and removed from ), where , possibly with suitable rounding, when exactly points have to be selected. The design selected is sensitive to the order in which coordinates are inspected. The necessity to find the largest or smallest coordinate values yields a complexity of ; parallelization with simultaneous sorting of each coordinate is possible. Like for any design selection algorithm, the matrix obtained with IBOSS satisfies for all (Pronzato,, 2006, Lemma 3). The asymptotic performance of IBOSS (the behavior of and ) for fixed and tending to infinity is investigated in (Wang et al.,, 2019) for following a multivariate normal or lognormal distribution. Next property concerns the situation where is a fraction of , with and the components of are independent.
Theorem 5.1.
Suppose that the are i.i.d. with satisfying Hμ and, moreover, that their components are independent, with the p.d.f. of for . Suppose, without any loss of generality, that coordinates are inspected in the order . Then, for any , the matrix corresponding to the points selected by IBOSS satisfies a.s. when and , with
(5.3) | |||||
(5.4) |
where , , ,
with the quantile function for , satisfying for any .
Proof. By construction, IBOSS asymptotically first selects all points such that does not belong to , then, among remaining points, all those such that . By induction, all points such that are selected at stage . Denote . We have
Direct calculation gives and
which proves (5.3). Similarly,
with
which proves (5.4) and concludes the proof.
A key difference between IBOSS and Algorithm 1 is that IBOSS is nonsequential and therefore cannot be used in the streaming setting. Also, IBOSS is motivated by D-optimal design and may not perform well for other criteria, whereas Algorithm 1 converges to the optimal solution when and for any criterion satisfying HΦ. Moreover, IBOSS strongly relies on the assumption that with and, as the next example illustrates, it can perform poorly in other situations, in particular when the are functionally dependent.
Example 7: quadratic regression on
Take , with uniformly distributed in and . For , the optimal measure equals on for some (which are determined by the two equations ). For , on . When when , the matrix obtained with IBOSS applied to the points converges to , with on . The left panel of Figure 13 shows (red solid line) and (blue dotted line) as functions of . We have , which tends to 0 as .
Next examples show that IBOSS performs more comparably to Algorithm 1 for multilinear regression with intercept, where with . Its performance may nevertheless be significantly poorer than that of Algorithm 1.
Example 8: multilinear regression with intercept,
is uniformly distributed in .
Direct calculation shows that, for any , the optimal measure equals on , with the open ball centered at the origin with radius . Here, when , and is solution of when . The associated optimal matrix is diagonal, , with
Extension to is possible but involves complicated calculations.
When and , the matrix obtained with IBOSS converges to when and , with on , with and . The matrix is diagonal, , where is the matrix in Theorem 5.1 with and . The right panel of Figure 13 shows (red solid line) and (blue dashed line) as functions of . Note that whereas when . The problem is due to selection by IBOSS of points having one coordinate in the central part of the interval.


is normally distributed .
The expression of the optimal matrix has been derived in Section 4.2; the asymptotic value for of the matrix is
where the expression of (here a diagonal matrix) is given in Theorem 5.1. Figure 14 shows the D-efficiency as a function of for (left) and (right), showing that the performance of IBOSS deteriorates as increases. We also performed series of simulations for , with 100 independent repetitions of selections of points within () based on IBOSS and Algorithm 1. Due to the small value of , we apply Algorithm 1 to replications of , see Section 5.1, with for , for and for . The colored areas on Figure 14 show the variability range for efficiency, corresponding to the empirical mean 2 standard deviations obtained for the 100 repetitions, for IBOSS (green, bottom) and Algorithm 1 (magenta, top); note that variability decreases as increases. The approximation of obtained with IBOSS by the asymptotic matrix is quite accurate although is rather small; Algorithm 1 (incorporating repetitions of ) performs significantly better than IBOSS although the setting is particularly favorable to IBOSS — it is significantly slower than IBOSS, however, when is large.


6 Conclusions and further developments
We have proposed a sequential subsampling method for experimental design (Algorithm 1) that converges to the optimal solution when the length of the sequence tends to infinity and a fixed proportion of design points is selected. Since the method only needs to keep the memory of the current information matrix associated with the design problem (or its inverse), and to update a pair of scalar variables (an estimated quantile, and an estimate of the p.d.f. value at the quantile), it can be applied to sequences of arbitrary length and is suitable for data streaming.
We have not tried to optimize the choice of initialization and tuning parameters in Algorithm 1. Although it does not seem critical (the same tuning has been used in all the examples presented), there is certainly an opportunity to improve, in particular concerning and (for instance, using the information that whereas for small with the initialization we use).
We have only considered the case of linear models, where the information matrix does not depend on unknown parameters (equivalent to local optimum design in case of a nonlinear model), but extension to online parameter estimation in a nonlinear model with would not require important modifications. Denote by the estimated value of the parameters after observation at the design points selected, , say. Then, we can use at Step 1 of Algorithm 1, and given by (3.10) can be replaced by at Step 2. Recursive estimation can be used for to reduce computational cost. For instance for maximum likelihood estimation, with the notation of Section 1, we can update as
when is selected; see Ljung and Söderström, (1983); Tsypkin, (1983). A further simplification would be to update as . When the are i.i.d. with satisfying Hμ, the strong consistency of holds with such recursive schemes under rather general conditions when all are selected. Showing that this remains true when only a proportion is selected by Algorithm 1 requires technical developments outside the scope of this paper, but we anticipate that a.s., with the optimal matrix for the true value of the model parameters.
Algorithm 1 can be viewed as an adaptive version of the treatment allocation method presented in (Metelkina and Pronzato,, 2017): consider the selection or rejection of as the allocation of individual to treatment 1 (selection) or 2 (rejection), with respective contributions or to the collection of information; count a cost of one for allocation to treatment 1 and zero for rejection. Then, the doubly-adaptive sequential allocation (4.6) of Metelkina and Pronzato, (2017) that optimizes a compromise between information and cost exactly coincides with Algorithm 1 where is frozen to a fixed , i.e., without Step 3. In that sense, the two-time-scale stochastic approximation procedure of Algorithm 1 opens the way to the development of adaptive treatment allocation procedures where the proportion of individuals allocated to the poorest treatment could be adjusted online to a given target.
Finally, the designs obtained with the proposed thinning procedure are model-based: when the model is wrong, is no longer optimal for the true model. Model-robustness issues are not considered in the paper and would require specific developments, following for instance the approach in (Wiens,, 2005; Nie et al.,, 2018).
Appendix A Maximum of
The property below is stated without proof in (Pronzato,, 2006). We provide here a formal proof based on results on conditional value-at-risk by Rockafellar and Uryasev, (2000) and Pflug, (2000).
Lemma A.1.
Suppose that as . Then, under HΦ and HM, for any choice of points among points i.i.d. with , we have a.s., where maximizes with respect to .
Proof. Denote by the matrix that corresponds to choosing distinct candidates that maximize . The concavity of implies
(A.1) |
The rest of the proof consists in deriving an upper bound on the second term on the right-hand side of (A.1).
Denote for all and let the denote the version sorted by decreasing values. Since is increasing for Loewner ordering, for any and any , and concavity implies , showing that . Therefore, for all .
Following Rockafellar and Uryasev, (2000); Pflug, (2000), we then define the functions , , , . We can then write, for any ,
(A.2) |
and
where for all larger than some and where denotes expectation for the empirical measure .
Next, we construct an upper bound on . For , the matrix satisfies
(A.3) |
Now, with on a set and elsewhere, and . HM-(ii) then implies that , and HΦ implies that . Therefore from HM-(i). Since tends to a.s. as , (A.3) implies that there exists a.s. such that, for all , .
To summarize, (A.1) implies
The last term tends to zero as tends to infinity, due to (A.2) and the continuity of conditional value-at-risk; see (Rockafellar and Uryasev,, 2002, Prop. 13). Since , see (A.2), and , for all large enough we have . Denote . The second term can then be rewritten as
The functions with , , form a Glivenko-Cantelli class; see (van der Vaart,, 1998, p. 271). It implies that a.s., which concludes the proof.
The class of functions is in fact Donsker (van der Vaart,, 1998, p. 271). The strict concavity of implies that optimal matrices are unique, and in complement of Lemma A.1 we get . Note that when an optimal bounded design measure is known, a selection procedure such that and a.s. is straightforwardly available: simply select the points that belong to the set on which .
Appendix B Non degeneracy of
To invoke Hμ,M in order to ensure the existence of a density having the required properties for all (which is essential for the convergence of Algorithm 1, see Theorem 3.1), we need to guarantee that for all , for some and . This is the object of the following lemma.
Lemma B.1.
Under HM, when in Algorithm 1, for all and there exists a.s. and such that for all .
Proof. Since the first points are selected, we have for . Let be the first for which . It implies that , and (3.9) implies that . Therefore, , and for all .
If the points were chosen randomly, would be enough to obtain that, from HM, and for all larger than some . However, here the situation is more complicated since points are accepted or rejected according to a sequential decision rule, and a more sophisticated argumentation is required. An expedite solution is to consider the worst possible choices of points, that yield the smallest value of and largest value of . This approach is used in Lemma B.2 presented below, which permits to conclude the proof.
Lemma B.2.
Under HM, any matrix obtained by choosing points out of independently distributed with and such that satisfies and a.s. for some and .
Proof. We first construct a lower bound on . Consider the criterion , and denote by the -point design matrix that minimizes over the design space formed by points i.i.d. with . We can write , where the correspond to the indices of positive in the minimization of with respect to under the constraints for all and . Obviously, any matrix obtained by choosing distinct points among satisfies .
For any , denote . Then, for any , , where the correspond to the values of sorted by increasing order for . For any , we thus have
showing that the worst situation corresponds to the smallest admissible ; that is, we only have to consider the case when as .
Since is concave, for any we have
(B.1) |
where is the directional derivative of at in the direction .
For any and any , there exists a set such that on and . Indeed, any set on which is such that ; therefore, taking , we get . Denote , with and as , and take any . Applying HM-(ii) to the set defined above, we get
For larger than some we have , and therefore . The inequality (B.1) thus gives, for ,
(B.2) |
The rest of the proof follows from results on conditional value-at-risk by Rockafellar and Uryasev, (2000) and Pflug, (2000). For a fixed , , and , we have
where the -quantile satisfies . For any and , denote
We can write , where the expectation is with respect to distributed with (Rockafellar and Uryasev,, 2000). Also, from Pflug, (2000), for any we can write , where denotes expectation for the empirical measure .
Now, from HM-(i), for any with ,
(B.3) |
We also have , with a.s. as . Denote ; since , from HM2-(i) there exists a.s. such that, for all , and, from (B.3), .
Therefore, for large enough , for any ,
The functions with , and , , form a Glivenko-Cantelli class; see (van der Vaart,, 1998, p. 271). This implies that there exists a.s. such that
Therefore, from (B.2), for all , which concludes the first part of the proof.
We construct now an upper bound on following steps similar to the above developments but exploiting now the convexity of the criterion . Its directional derivative is , with . Denote by the -point design matrix that maximizes over the design space formed by points i.i.d. with . We can write , where the correspond to the indices of positive in the maximization of with respect to under the constraints for all and . Any matrix obtained by selecting distinct points among satisfies .
For any we can write , where the correspond to the values of sorted by decreasing order for . For any , we thus have
showing that the worst case corresponds to the smallest admissible , and we can restrict our attention to the case when as .
The convexity of implies that, for any ,
(B.4) |
Take , corresponding to some . From HM-(i),
Therefore, there exists some such that, for all , , and (B.4) gives
For , and , denote , . We have , , with satisfying . Also, for any and , , , where satisfies , and HM-(i) implies that . Since and a.s., there exists a.s. such that, for all , and . This implies that, for and ,
The rest of the proof is similar to the case above for , using the fact that the functions with , and , , form a Glivenko-Cantelli class.
Appendix C Convergence of
We consider the convergence properties of (3.6) when the matrix is fixed, that is,
(C.1) |
where the have a fixed distribution with uniformly bounded density such that . We follow the arguments of Tierney, (1983). The construction of is like in Algorithm 1, with and following the recursion
(C.2) |
with .
Theorem C.1.
Proof. We first show that is a.s. bounded. From the mean-value theorem, there exists a in such that . Denote . We can write
where , and . Therefore,
We have as since . Next, for and , , forms an -bounded martingale and therefore converges a.s. to some limit. Lemma 2 of Albert and Gardner, (1967, p. 190) then implies that a.s. as . Consider now the term . Since is bounded, for some and
where the equality follows from Albert and Gardner, (1967, Lemma 1, p. 189). This shows that is a.s. bounded. Therefore, is a.s. bounded away from zero.
We consider now the convergence of (C.1). Following Tierney, (1983), define
Denote by the increasing sequence of -fields generated by the ; we have and . We can rewrite (C.1) as , which gives
for all , is bounded since is bounded, and therefore . Since and , . Robbins-Siegmund Theorem (1971) then implies that converges a.s. to some limit and that a.s.; since , we obtain a.s. Since , is a.s. bounded away from zero, and is strictly positive in a neighborhood of , we obtain that , implying that a.s., which concludes the proof.
Tierney, (1983) uses ; the continuity of at then implies that a.s., and his construction also achieves the optimal rate of convergence of to , with as .
Appendix D Lipschitz continuity of
Lemma D.1.
Under HΦ and Hμ,M, the -quantile of the distribution of is a Lipschitz continuous function of .
Proof. For any , define the random variable and denote its distribution function and the associated -quantile. We have , and therefore
(D.1) |
We fist show that is Lipschitz continuous in . Indeed, for any , in , we have
(D.2) | |||||
where we used HΦ and the fact that .
Consider now and for two matrices and in . We have
and therefore
with the identity matrix. Since , denoting and , we obtain
(D.3) | |||||
In the rest of the proof we show that is Lipschitz continuous in . Take two matrices , and consider the associated -quantiles and , which we shall respectively denote and to simplify notation. From Hμ,M, the p.d.f. associated with is continuous at and satisfies . From the identities
we deduce
(D.4) |
From HΦ, when substituting for and for in and , we get and , showing that as . Therefore, there exists some such that, for we have
(D.5) |
Using (D.3), we also obtain for smaller than some
Therefore, when ,
with , where is the upper bound on in Hμ,M. Using (D.4) and (D.5) we thus obtain, for small enough,
Acknowledgements
The work of the first author was partly supported by project INDEX (INcremental Design of EXperiments) ANR-18-CE91-0007 of the French National Research Agency (ANR).
References
- Albert and Gardner, (1967) Albert, A. and Gardner, L. (1967). Stochastic Approximation and Nonlinear Regression. MIT Press, Cambridge, MA.
- Albright and Derman, (1972) Albright, S. and Derman, C. (1972). Asymptotically optimal policies for the stochastic sequential assignment problem. Management Science, 19(1):46–51.
- Borkar, (1997) Borkar, V. (1997). Stochastic approximation with two time scales. Systems & Control Letters, 29:291–294.
- Cook and Nachtsheim, (1980) Cook, R. and Nachtsheim, C. (1980). A comparison of algorithms for constructing exact -optimal designs. Technometrics, 22(3):315–324.
- Dalal et al., (2018) Dalal, G., Szörényi, B., Thoppe, G., and Mannor, S. (2018). Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning. Proc. of Machine Learning Research, 75:1–35.
- Fedorov, (1972) Fedorov, V. (1972). Theory of Optimal Experiments. Academic Press, New York.
- Fedorov, (1989) Fedorov, V. (1989). Optimal design with bounded density: optimization algorithms of the exchange type. Journal Statist. Planning and Inference, 22:1–13.
- Fedorov and Hackl, (1997) Fedorov, V. and Hackl, P. (1997). Model-Oriented Design of Experiments. Springer, Berlin.
- Harman, (2004) Harman, R. (2004). Lower bounds on efficiency ratios based on -optimal designs. In Di Bucchianico, A., Läuter, H., and Wynn, H., editors, mODa’7 – Advances in Model–Oriented Design and Analysis, Proceedings of the 7th Int. Workshop, Heeze (Netherlands), pages 89–96, Heidelberg. Physica Verlag.
- Kesten, (1958) Kesten, H. (1958). Accelerated stochastic approximation. The Annals of Mathematical Statistics, 29(1):41–59.
- Konda and Tsitsiklis, (2004) Konda, V. and Tsitsiklis, J. (2004). Convergence rate of linear two-time-scale stochastic approximation. The Annals of Applied Probability, 14(2):796–819.
- Lakshminarayanan and Bhatnagar, (2017) Lakshminarayanan, C. and Bhatnagar, S. (2017). A stability criterion for two timescale stochatic approximation. Automatica, 79:108–114.
- Ljung, (1977) Ljung, L. (1977). Analysis of recursive stochastic algorithms. IEEE Transactions on Automatic Control, 22(4):551–575.
- Ljung and Söderström, (1983) Ljung, L. and Söderström, T. (1983). Theory and Practice of Recursive Identification. MIT Press, Cambridge, MA.
- Metelkina and Pronzato, (2017) Metelkina, A. and Pronzato, L. (2017). Information-regret compromise in covariate-adaptive treatment allocation. The Annals of Statistics, 45(5):2046–2073.
- Mitchell, (1974) Mitchell, T. (1974). An algorithm for the construction of “-optimal” experimental designs. Technometrics, 16:203–210.
- Nie et al., (2018) Nie, R., Wiens, D., and Zhai, Z. (2018). Minimax robust active learning for approximately specified regression models. Canadian Journal of Statistics, 46(1):104–122.
- Pflug, (2000) Pflug, G. (2000). Some remarks on the value-at-risk and the conditional value-at-risk. In Uryasev, S., editor, Probabilistic Constrained Optimization, pages 272–281. Springer.
- Pilz, (1983) Pilz, J. (1983). Bayesian Estimation and Experimental Design in Linear Regression Models, volume 55. Teubner-Texte zur Mathematik, Leipzig. (also Wiley, New York, 1991).
- Pronzato, (2001) Pronzato, L. (2001). Optimal and asymptotically optimal decision rules for sequential screening and resource allocation. IEEE Transactions on Automatic Control, 46(5):687–697.
- Pronzato, (2004) Pronzato, L. (2004). A minimax equivalence theorem for optimum bounded design measures. Statistics & Probability Letters, 68:325–331.
- Pronzato, (2006) Pronzato, L. (2006). On the sequential construction of optimum bounded designs. Journal of Statistical Planning and Inference, 136:2783–2804.
- Pronzato and Pázman, (2013) Pronzato, L. and Pázman, A. (2013). Design of Experiments in Nonlinear Models. Asymptotic Normality, Optimality Criteria and Small-Sample Properties. Springer, LNS 212, New York.
- Pukelsheim, (1993) Pukelsheim, F. (1993). Optimal Experimental Design. Wiley, New York.
- Robbins and Siegmund, (1971) Robbins, H. and Siegmund, D. (1971). A convergence theorem for non negative almost supermartingales and some applications. In Rustagi, J., editor, Optimization Methods in Statistics, pages 233–257. Academic Press, New York.
- Rockafellar and Uryasev, (2000) Rockafellar, R. and Uryasev, S. (2000). Optimization of conditional value-at-risk. Journal of Risk, 2:21–42.
- Rockafellar and Uryasev, (2002) Rockafellar, R. and Uryasev, S. (2002). Conditional value-at-risk for general loss distributions. Journal of Banking & Finance, 26:1443–1471.
- Sahm and Schwabe, (2001) Sahm, M. and Schwabe, R. (2001). A note on optimal bounded designs. In Atkinson, A., Bogacka, B., and Zhigljavsky, A., editors, Optimum Design 2000, chapter 13, pages 131–140. Kluwer, Dordrecht.
- Tierney, (1983) Tierney, L. (1983). A space-efficient recursive procedure for estimating a quantile of an unknown distribution. SIAM Journal on Scientific and Statistical Computing, 4(4):706–711.
- Tsypkin, (1983) Tsypkin, Y. (1983). Optimality in identification of linear plants. International Journal of Systems Science, 14(1):59–74.
- van der Vaart, (1998) van der Vaart, A. (1998). Asymptotic Statistics. Cambridge University Press, Cambridge.
- Wang et al., (2019) Wang, H., Yang, M., and Stufken, J. (2019). Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association, 114(525):393–405.
- Wiens, (2005) Wiens, D. (2005). Robustness in spatial studies II: minimax design. Environmetrics, 16(2):205–217.
- Wynn, (1982) Wynn, H. (1982). Optimum submeasures with applications to finite population sampling. In Gupta, S. and Berger, J., editors, Statistical Decision Theory and Related Topics III. Proc. 3rd Purdue Symp., vol. 2, pages 485–495. Academic Press, New York.