Optimal Recovery from Inaccurate Data in Hilbert Spaces:
Regularize, but what of the Parameter?
Abstract
In Optimal Recovery, the task of learning a function from observational data is tackled deterministically by adopting a worst-case perspective tied to an explicit model assumption made on the functions to be learned. Working in the framework of Hilbert spaces, this article considers a model assumption based on approximability. It also incorporates observational inaccuracies modeled via additive errors bounded in . Earlier works have demonstrated that regularization provide algorithms that are optimal in this situation, but did not fully identify the desired hyperparameter. This article fills the gap in both a local scenario and a global scenario. In the local scenario, which amounts to the determination of Chebyshev centers, the semidefinite recipe of Beck and Eldar (legitimately valid in the complex setting only) is complemented by a more direct approach, with the proviso that the observational functionals have orthonormal representers. In the said approach, the desired parameter is the solution to an equation that can be resolved via standard methods. In the global scenario, where linear algorithms rule, the parameter elusive in the works of Micchelli et al. is found as the byproduct of a semidefinite program. Additionally and quite surprisingly, in case of observational functionals with orthonormal representers, it is established that any regularization parameter is optimal.
Key words and phrases: Regularization, Chebyshev center, semidefinite programming, S-procedure, hyperparameter selection.
AMS classification: 41A65, 46N40, 90C22, 90C47.
1 Introduction
1.1 Background on Optimal Recovery
This article is concerned with a central problem in Data Science, namely: a function is acquired through point evaluations
(1) |
and these data should be used to learn —or to recover it, with the terminology preferred in this article. Importantly, the evaluation points are considered fixed entities in our scenario: they cannot be chosen in a favorable way, as in Information-Based Complexity [Novak and Woźniakowski, 2008], nor do they occur as independent realizations of a random variable, as in Statistical Learning Theory [Hastie, Tibshirani, and Friedman, 2009]. In particular, without an underlying probability distribution, the performance of the recovery process cannot be assessed via generalization error. Instead, it is assessed via a notion of worst-case error, central to the theory of Optimal Recovery [Micchelli and Rivlin, 1977].
To outline this theory, we make the framework slightly more abstract. Precisely, given a normed space , the unknown function is replaced by an element . This element is accessible only through a priori information expressing an educated belief about and a posteriori information akin to (1). In other words, our partial knowledge about is summed up via
-
•
the fact that for a subset of called a model set;
-
•
the observational data , , for some linear functionals making up the observation map .
We wish to approximate by some produced using this partial knowledge of . Since the error involves the unknown , which is only accessible via and , we take a worst-case perspective leading to the local worst-case error
(2) |
Our objective consists in finding an element that minimizes . Such an can be described, almost tautologically, as a center of a smallest ball containing . It is called a Chebyshev center of this set of model- and data-consistent elements. This remark, however, does not come with any practical construction of a Chebyshev center.
The term local was used above to make a distinction with the global worst-case error of a recovery map , defined as
(3) |
The minimal value of is called the intrinsic error (of the observation map over the model set ) and the maps that achieve this minimal value are called globally optimal recovery maps. Our objective consists in constructing such maps—of course, the map that assigns to a Chebyshev center of is one of them, but it may be impractical. By contrast, for model sets that are convex and symmetric, the existence of linear maps among the set of globally optimal recovery maps is guaranteed by fundamental results from Optimal Recovery in at least two settings: when is a Hilbert space and when is an arbitrary normed space but the full recovery of gives way to the recovery of a quantity of interest , being a linear functional. We refer the readers to [Foucart, To appear, Chapter 9] for details.
1.2 The specific problem
The problem solved in this article is a quintessential Optimal Recovery problem—its specificity lies in the particular model set and in the incorporation of errors in the observation process. The underlying normed space is a Hilbert space and is therefore denoted by from now on. Reproducing kernel Hilbert spaces, whose usage is widespread in Data Science [Schölkopf and Smola, 2002], are of particular interest as point evaluations of type (1) make perfect sense there.
Concerning the model set, we concentrate on an approximation-based choice that is increasingly scrutinized, see e.g. [Maday, Patera, Penn, and Yano, 2015], [DeVore, Petrova, and Wojtaszczyk, 2017] and [Cohen, Dahmen, Mula, and Nichols, 2020]. Depending on a linear subspace of and on a parameter , it takes the form
Binev, Cohen, Dahmen, DeVore, Petrova, and Wojtaszczyk [2017] completely solved the Optimal Recovery problem with exact data in this situation (locally and globally). Precisely, they showed that the solution to
(4) |
which clearly belongs to the model- and data-consistent set , turns out to be its Chebyshev center. Moreover, with and denoting the orthogonal projectors onto and onto the orthogonal complement of , the fact that makes the optimization program (4) tractable. It can actually be seen that is a linear map. This is a significant advantage because can then be precomputed in an offline stage knowing only and and the program (4) need not be solved afresh for each new data arriving in an online stage.
Concerning the observation process, instead of exact data , it is now assumed that
for some unknown error vector . This error vector is not modeled as random noise but through the deterministic -bound . Although other -norms can the considered for the optimal recovery of when is a linear functional on an arbitrary normed space (see [Ettehad and Foucart, 2021]), here the arguments rely critically on being endowed with the -norm. It will be written simply as below, hoping that it does not create confusion with the Hilbert norm on .
For our specific problem, the worst-case recovery errors (2) and (3) need to be adjusted. The local worst-case recovery error at for becomes
As for the global worst-case error of , it reads
Note that both worst-case errors are infinite if one can find a nonzero in . Indeed, the element , , obeys and , so for instance . Thus, we always make the assumption that
(5) |
We keep in mind that the latter forces , as can be seen by dimension arguments. With denoting the Hermitian adjoint of , another assumption that we sometimes make reads
(6) |
This is not extremely stringent: assuming the surjectivity of is quite natural, otherwise certain observations need not be collected; then the map can be preprocessed into another map satisfying by setting . Incidentally, if represent the Riesz representers of the observation functionals , characterized by for all , then the assumption (6) is equivalent to the orthonormality of the system . In a reproducing kernel Hilbert space with kernel , if the ’s are point evaluations at some ’s, so that , then (6) is equivalent to for all . This occurs e.g. for the Paley–Wiener space of functions with Fourier transform supported on when the evaluations points come from an integer grid, since the kernel is given by , .
1.3 Main results
There are previous works on Optimal Recovery in Hilbert spaces in the presence of observation error bounded in . Notably, [Beck and Eldar, 2007] dealt with the local setting, while [Melkman and Micchelli, 1979] and [Micchelli, 1993] dealt with the global setting. These works underline the importance of regularization, which is prominent in many other settings [Chen and Haykin, 2002]. They establish that the optimal recovery maps are obtained by solving the unconstrained program
(7) |
for some . It is the precise choice of this regularization parameter which is the purpose of this article. Assuming from now on that is finite dimensional222It is likely that the results are still valid in the infinite-dimensional case. But then it is unclear how to solve (8) and (9) numerically, so the infinite-dimensional case is not given proper scrutiny in the rest of the article., we provide a complete (almost) picture of the local and global Optimal Recovery solutions, as summarized in the four points below, three of them being new:
-
L1.
With restricted here to be a complex Hilbert space, the Chebyshev center of the set is the minimizer of (7) for the choice , where are solutions to the semidefinite program333In the statement of this semidefinite program and elsewhere, the notation means that an operator is positive semidefinite on , i.e., that for all .
s.to and - L2.
-
G1.
A globally optimal recovery map is provided by the linear map sending to the minimizer of (7) with parameter , where are solutions to the semidefinite program
(9) - G2.
Before entering the technicalities, a few comments are in order to put these results in context. Item L1 is the result of [Beck and Eldar, 2007] (see Corollary 3.2 there) adapted to our situation. It relies on an extension of the S-lemma involving two quadratic constraints. This extension is valid in the complex finite-dimensional setting, but not necessarily in the real setting, hence the restriction on (this does not preclude the validity of the result in the real setting, though). It is worth pointing out the nonlinearity of the map that sends to the above Chebyshev center. Incidentally, we can safely talk about the Chebyshev center, because it is known [Garkavi, 1962] that a bounded set in a uniformly convex Banach space has exactly one Chebyshev center. A sketch of the argument adapted to our situation is presented in the appendix.
For item L2, working with an observation map satisfying allows us to construct the Chebyshev center even in the setting of a real Hilbert space. This is possible because our argument does not rely on the extension of the S-lemma—it just uses the obvious implication. As for equation (8), it is easily solved using the bisection method or the Newton/secant method. Moreover, it gives some insight on the value of the optimal parameter . For instance, the proof reveals that is always between and . When , say, the optimal parameter should then satisfy , which is somewhat intuitive: means that there is more model mismatch than data mismatch, so the regularization should penalize model fidelity less than data fidelity by taking , i.e., . As an aside, we point out that, here too, the map that sends to the Chebyshev center is not a linear map—if it was, then the optimal parameter should be independent of .
In contrast, the globally optimal recovery map of item G1 is linear. It is one of several globally optimal recovery maps, since the locally optimal one (which is nonlinear) is also globally optimal. However, as revealed in the reproducible444matlab and Python files illustrating the findings of this article are located at https://github.com/foucart/COR. accompanying this article, it is in general the only regularization map that turns out to be globally optimal. The fact that regularization produces globally optimal recovery maps was recognized by Micchelli, who wrote in the abstract of [Micchelli, 1993] that “the regularization parameter must be chosen with care”. However, a recipe for selecting the parameter was not given there, except on a specific example. The closest to a nonexhaustive search is found in [Plaskota, 1996, Lemma 2.6.2] for the case , but even this result does not translate into a numerically tractable recipe. The selection stemming from (9) does, at least when is finite-dimensional, which is assumed here. Semidefinite programs can indeed be solved in matlab using CVX [Grant and Boyd, 2014] and in Python using CVXPY [Diamond and Boyd, 2016].
Finally, a surprise arises in item G2. Working with an observation map satisfying , the latter indeed reveals that the regularization parameter does not need to be chosen with care after all, since regularization maps are globally optimal no matter how the parameter is chosen. The precise interpretation of the choices and will be elucidated later.
The rest of this article is organized as follows. Section 2 gathers some auxiliary results that are used in the proofs of the main results. Section 3 elucidates item L1 and establishes item L2—in other words, it is concerned with local optimality. Section 4, which is concerned with global optimality, is the place where items G1 and G2 are proved. Lastly, a short appendix containing some side information is included after the bibliography.
2 Technical Preparation
This section establishes (or recalls) a few results that we isolate here in order not to disrupt the flow of subsequent arguments.
2.1 S-lemma and S-procedure
Loosely speaking, the S-procedure is a relaxation technique expressing the fact that a quadratic inequality is a consequence of some quadratic constraints. In case of a single quadratic constraint, the relaxation turns out to be exact. This result, known as the S-lemma, can be stated as follows: given quadratic functions and defined on , with or ,
provided for some . With more than one quadratic constraint, , say, whenever is still a consequence of for some , but the reverse implication does not hold anymore. There is a subtlety when , as the reverse implication holds for but not for , see [Pólik and Terlaky, 2007, Section 3]. However, if the quadratic constraints do not feature linear terms, then the reverse implication holds for also when . Since this result of [Polyak, 1998, Theorem 4.1] is to be invoked later, we state it formally below.
Theorem 1.
Suppose that and that quadratic functions on take the form for symmetric matrices and scalars . Then
provided and for some and for some .
2.2 Regularization
In this subsection, we take a closer look at the regularization program (7). The result below shows that its solution depends linearly on . In fact, the result covers a slightly more general program and the linearity claim follows by taking , , , and .
Proposition 2.
Let be linear maps from into other Hilbert spaces containing points , respectively. For , the optimization program
(10) |
has solutions characterized by
(11) |
Moreover, if , then is uniquely given by
(12) |
Proof.
The program (10) can be interpreted as a standard least squares problem, namely as
According to the normal equations, its solutions are characterized by
which is a rewriting of (11). Next, if , then
with equality only possible when , i.e., . This shows that is positive definite, and hence invertible, which allows us to write (12) as a consequence of (11). ∎
The expression (12) is not always the most convenient one. Under extra conditions on and , we shall see that , , can in fact be expressed as the convex combination . The elements and should be interpreted555Intuitively, the solution to the program (10) written as the minimization of becomes, as , the mininizer of subject to . This explains the interpretation of . A similar argument explains the interpretation of . as
The requirements that and need to be imposed for and to even exist and the condition easily guarantees that and are unique. They obey
(13) |
For instance, the identity reflects the constraint in the optimization program defining , while is obtained by expanding around for any . At this point, we are ready to establish our claim under extra conditions on and , namely that they are orthogonal projections. These conditions will be in place when the observation map satisfies . Indeed, in view of for any , the regularization program (7) also reads
where both and are orthogonal projections. The result below will then be applied with , , , and .
Proposition 3.
Let be two orthogonal projectors on such that and let , . For , the solution to the optimization program
(14) |
satisfies
(15) |
Moreover, one has
(16) |
Proof.
Taking the extra conditions on and into account, the identities (13) read
In this form, they imply that
(17) |
In a similar fashion, by exchanging the roles of and , and consequently also of and , we have , i.e., . Subtracting (17) from the latter yields , in other words . Then, the element belongs to , so that . In summary, we have established that
(18) |
From here, we can deduce the two parts of the proposition. For the first part, we notice that
which shows that satisfies the relation (11) characterizing the minimizer of (14), so that , as announced in (15). For the second part, we notice that
so the first equality of (16) follows by taking the norm. The second equality of (16) is derived in a similar fashion. ∎
We complement Proposition 3 with a few additional pieces of information.
Remark.
Under the assumptions of Proposition 3, the solution to (14) is also solution to
Indeed, at , the squared objective function equals , while at an arbitrary , it satisfies
In the case , , , and , the choice is quite relevant, since the above optimization program becomes equivalent to
Its solution is clearly in the model- and data-consistent set . In fact, this could have been a natural guess for its Chebyshev center, but item L2 reveals the invalidity of such a guess. Nonetheless, the special parameter will make a reappearance in the argument leading to item L2.
Remark.
Remark.
Considering again the case , , , and , Proposition 3 implies that for any , given that the latter holds for and for . For , this is because the constraint of the optimization program defining imposes . For , this is a result established e.g. in [Foucart, Liao, Shahrampour, and Wang, 2020, Theorem 2]. The said result also provides an efficient way to compute the solution of (7) even when is infinite dimensional, as stated in the appendix.
3 Local Optimality
Our goal in this section is to determine locally optimal recovery maps. In other words, the section is concerned with Chebyshev centers. We start by considering the situation of an arbitrary observation map , but with a restriction on the space . Next, lifting this restriction on , we refine the result in the particular case of an observation map satisfying .
3.1 Arbitrary observations
In this subsection, we reproduce a result from [Beck and Eldar, 2007], albeit with different notation, and explain how it implies the statement of item L1. The result in question, namely Corollary 3.2, relies on the S-procedure with two constraints, and as such cannot be claimed in the real setting.
Theorem 4.
Let be a complex Hilbert space. Let be two linear maps from into other Hilbert spaces containing points , respectively. Suppose the existence of such that and and the existence of such that is positive definite. Then the Chebyshev center of equals , where are solutions to
s.to | ||||
and |
The statement made in item L1 is of course derived by taking , , , and . Theorem 4 is indeed applicable, as satisfies the strict feasibility conditions, while the positive definiteness condition is not only fulfilled for some , but for all , since , with equality only possible if , i.e., if thanks to the assumption (5). We also note that, by virtue of (12), the element defined above is nothing else than the regularized solution with parameter .
3.2 Orthonormal observations
In this subsection, we place ourselves in the situation of an observation map satisfying and we provide a proof of the statements made item L2. In fact, we prove some slightly more general results and L2 follows by taking , , , and . Note that we must separate the cases where (corresponding to ) and where is a proper orthogonal projection (corresponding to ). We emphasize that, in each of these two cases, the optimal parameter is not independent of . Therefore, in view of (15) and of the linear dependence of and on , the regularized solution does not depend linearly on . In other words, the locally optimal recovery map is not a linear map. The following two simple lemmas will be used to deal with both cases.
Lemma 5.
Let be two linear maps from into other Hilbert spaces containing points , respectively. Given , let
If the orthogonality conditions
(20) |
are fulfilled, then is the Chebyshev center of the set i.e., for any ,
(21) |
Proof.
First, writing , we easily see that the right-hand side of (21) reduces to . Second, let us remark that the orthogonality conditions guarantee that both satisfy and . For instance, we have
(22) |
where the latter inequality reflects the feasibility of . Therefore, the left-hand side of (21) is bounded below by
(23) |
i.e., by the right-hand side of (21). ∎
The next lemma somehow relates to the S-procedure. However, it does not involve the coveted (and usually invalid) equivalence, but only the straightforward implication.
Lemma 6.
Let be two linear maps from into other Hilbert spaces containing points , respectively. Given and , suppose that
(24) |
and that there exist such that
(25) |
as well as
(26) |
Then, one has
(27) |
Proof.
By writing the variable in the optimization program (27) as , the constraints on transform into constraints on . Thanks to (24), the latter constraints read
Combining these constraints—specifically, multiplying the first by , the second by , and summing—implies that
where (25) and (26) were exploited in the last step. In other words, one has , i.e., , under the constraints on , proving that is indeed a maximizer in (27). ∎
3.2.1 The case
As mentioned earlier, the case corresponds to the choice , i.e., to a model set being an origin-centered ball in , and to regularizations being classical Tikhonov regularizations. The arguments are slightly less involved here than for the case . Here is the main result.
Theorem 7.
Let be an orthogonal projector on with and let , . The solution to the regularization program (14) with parameter
is the Chebyshev center of the set .
Proof.
Before separating two cases, we remark that is implicitly assumed for the above set to be nonempty. Now, we first consider the case . Defining with , our objective is to find and for which conditions (24), (25), and (26) of Lemma 6 are fulfilled, so that is a maximizer appearing in Lemma 5, and then to verify that the orthogonality conditions (20) hold, so that is indeed the required Chebyshev center. We take any , with a normalization will be decided later, and , . In this way, since , condition (25) is automatic, and condition (26) follows from the characterization (11) written here as . This characterization also allows us to deduce (20) only from , which holds because the spaces and are orthogonal. The remaining condition (24) now reads and . Recalling from Proposition 3 that , while taking into account that here and that thanks to (18), we have and . Thus, condition (24) reads
The latter is justified by our choice of , while the former can simply be achieved by normalizing , so long as , i.e., , which is our implicit assumption for nonemptiness of the set under consideration.
Next, we consider the case . We note that this implies that belongs to the set —we are going to show that is actually the Chebyshev center of this set. In other words, since , this means that with is the Chebyshev center. To this end, we shall establish that, for any ,
On the one hand, the right-hand side is obviously bounded above by . On the other hand, selecting with , we define to obtain and . Thus, the left-hand side is bounded below by
This proves that the left-hand side is larger than or equal to the right-hand side, as required. ∎
3.2.2 The case
We now assume that is a proper orthogonal projection, i.e., that , which corresponds to the case . The main result is stated below. To apply it in practice, the optimal parameter needs to be computed by solving an equation involving the smallest eigenvalue of self-adjoint operator depending on . This can be done using an all purpose routine. We could also devise our own bisection method, Newton method (since the derivative is accessible, see the appendix), or secant method.
Theorem 8.
Let be two orthogonal projectors on such that and let , . Consider to be a (often unique) between and such that
(28) |
where is precomputed as . Then the solution of the regularization program (14) with parameter is the Chebyshev center of the set .
Remark.
The proof of Theorem 8 requires an additional result that gives information about the norms of the projections and when is an eigenvector of the positive semidefinite operator . This result will be applied for the eigenvector associated with the smallest eigenvalue.
Lemma 9.
Let be two orthogonal projectors on . For , let be an eigenvector of corresponding to an eigenvalue . Then
(29) |
Proof.
Remark.
Because , , and are all nonnegative, Lemma 9 implicitly guarantees that and have the same sign as . These quantities are nonnegative when , , and is the smallest eigenvalue—the case of application of the lemma. Indeed, taking with (which is possible because ), one has
i.e., . The inequality , i.e., , is obtained in a similar fashion. These inequalities sum up to give . The latter is in fact (strictly) positive when , since either or is smaller than , so that .
With the above result at hand, we are ready to fully justify the main result of this subsection.
Proof of Theorem 8.
Let us temporarily take for granted the existence of a solution to (28). Defining , our objective is again to find and for which conditions (24), (25), and (26) of Lemma 6 are fulfilled, so that is a maximizer appearing in Lemma 5, and then to verify that the orthogonality conditions (20) hold, so that is indeed the required Chebyshev center. Writing , we choose to be a (so far unnormalized) eigenvector of corresponding to the eigenvalue . Setting and , conditions (25) is swiftly verified, since , , and
Then, the characterization of the regularization solution , see (11), allows us to validate condition (26) via
The orthogonality conditions (20) are also swiftly verified: the second one follows from the first one using (11); the first one holds because, while is an eigenvector of corresponding to its smallest eigenvalue, is an eigenvector corresponding to the largest eigenvalue (i.e., to one), since it is invariant when applying both and . Thus, it remains to verify that the two conditions of (24) are fulfilled. In view of the orthogonality conditions (20), they read
(32) |
Now, invoking Proposition 3, as well as Lemma 9, the two conditions of (24) become
(33) | ||||
(34) |
After some simplification work, starting by forming the combinations (33)(34) and (33)(34), these two conditions are seen to be equivalent to
(35) | ||||
(36) |
These two conditions can be fulfilled: the latter is the condition that defined , i.e., (28), while the former is simply guaranteed by properly normalizing the eigenvector .
Before establishing the existence , we point out that its uniqueness holds when , i.e., when there is no such that and —such an would solve the regularization program for any . Indeed, if were two solutions to (28), then the previous argument would imply that and are both Chebyshev centers, which could only happen if they were equal, i.e., if by (15). Now, for the existence of , it will be justified by the fact that the function
is continuous between and and takes values of different signs there. To see the difference in sign, notice that by the remark after Lemma 9—this is where the assumption is critical—so that
To see the continuity, we need the continuity of the smallest eigenvalue as a function of and the nonvanishing of the denominator between and . The former is a consequence of Weyl’s inequality, yielding
The latter is less immediate. We start by using (16) and recalling the very definition of to write
Therefore, if the denominator vanished for some , we would have
This would force and to have the same sign, contrary to the assumption that runs between and . Thus, the nonvanishing of the denominator is explained, concluding the proof. ∎
Remark.
The above arguments contain the value of the minimal local worst-case error, i.e., of the Chebyshev radius of the set . Indeed, we recall from the proof of Lemma 5 that this radius equals , whose value was derived in (35). This expression can be simplified with the help of (36) by noticing that
As a consequence, we deduce that the Chebyshev radius satisfies
4 Global Optimality
Our goal in this section is to uncover some favorable globally optimal recovery maps—favorable in the sense that they are linear maps. We start by considering the situation of an arbitrary observation map before moving to the particular case where it satisfies .
4.1 Arbitrary observations
In this subsection, we first recall a standard lower bound for the global worst-case error. This lower bound, already exploited e.g. in [Micchelli, 1993], shall be expressed as the minimal value of a certain semidefinite program. This expression will allow us to demonstrate that the lower bound is achieved by the regularization map
for some parameter to be explicitly determined. Here is a precise formulation of the result.
Theorem 10.
Given the approximability set and the uncertainty set , define where are solutions to
Then the regularization map is a globally optimal recovery map over and , i.e.,
(37) |
The proof relies on three lemmas given below, the first of which introducing the said lower bound.
Lemma 11.
For any recovery map , one has , where
Proof.
As a reminder, the global worst-case error of is defined by
For any such that and , since satisfies and satisfies , we have
Taking the supremum over leads to the required inequality . ∎
The second lemma expresses the square of the lower bound as the minimal value of a semidefinite program. In passing, the square of the global worst-case error of a linear recovery map is also related to the minimal value of a semidefinite program.
Lemma 12.
One has
(38) |
Moreover, if a recovery map is linear, one also has
(39) |
Proof.
The first semidefinite characterization is based on the version of the S-procedure stated in Theorem 1. Precisely, we write the square of the lower bound as
The validity of Theorem 1 in ensured by the facts that and for and that . Note that the resulting constraint decouples as for all , i.e., , and . Taking the minimal value of under the latter constraint, namely , leads to the expression of given in (38).
As for (39), we start by remarking that the linearity of the recovery map allows us to write
The latter constraint can be expressed in terms of the combined variable as
(40) |
Although the proviso of Theorem 1 is not fulfiled here, the constraint (40) is still a consequence of (but is not equivalent to) the existence of such that
The latter can also be written as the existence of such that, for all ,
Therefore, we obtain the inequality (instead of the equality)
s.to | |||||
and |
The variable can be eliminated from this optimization program by assigning it the value , thus arriving at the semidefinite program announced in (39). ∎
The third and final lemma relates the constraints of (38) and (39): while the constraint of (39) with any regularization map implies the constraint of (38), see the appendix, we need the partial converse that the constraint of (38) implies the constraint of (39) for a specific regularization map .
Lemma 13.
If , then setting yields
Proof.
We recall from Proposition 2 adapted to the current situation that, for any ,
We now notice that the hypothesis is equivalent to . With our particular choice of , this reads . It follows that
The inverse appearing above can be written as
and since and always have the same nonzero eigenvalues, we derive that
Writing the latter as
and multiplying on both sides by yields
Taking the expressions of and into account, we conclude that
as announced. ∎
With the above three lemmas at hand, the main result of this subsection follows easily.
Proof of Theorem 10.
Remark.
When , so that , we obtain and , resulting in a minimal global worst-case error equal to and achieved for the regularization map . This result can be seen directly from for any , while .
4.2 Orthonormal observations
In this subsection, we demonstrate that the use of orthonormal observations guarantees, rather unexpectedly, that regularization provides optimal recovery maps even without a careful parameter selection. The main result reads as follows.
Theorem 14.
Given the approximability set and the uncertainty set , if , then all the regularization maps are optimal recovery maps, i.e., for all ,
(41) |
The proof strategy consists in establishing that the constraints in (38) and in (39) with are in fact equivalent for any . This yields the inequality , which proves the required result, given that was introduced as a lower bound on for every . While the constraint in (39) implies the constraint in (38) for any observation map (see the appendix), the reverse implication relies on the fact that , e.g. via the identity derived in Proposition 3. The following realization is also a crucial point of our argument.
Lemma 15.
Assume that . For , let be an eigenvector of associated with an eigenvalue . For any , one has
-
•
if , then
-
•
if , then
Proof.
Multiplying the eigenequation defining on the left by , we obtain
(42) |
According to (19), we have , , , and . Thus, the relation (42) specified to and to yields
(43) | ||||
(44) |
Subtracting (44) from (43) yields . Therefore, we derive that provided . In this case, the equations (43)-(44) reduce to . In view of , we arrive at for any . The relation follows from the eigenequation rewritten as .
It remains to deal with the case . Notice that this case is not vacuous, as it is equivalent to , which is nontrivial by a dimension argument involving assumption (5). To see this equivalence, notice that clearly implies , while the latter eigenequation forces , hence and , i.e., and . We now consider such an eigenvector associated with the eigenvalue : in view of , we remark that and that . We deduce that and in turn that . ∎
We are now ready to establish the main result of this subsection.
Proof of Theorem 14.
Let be fixed throughout. As announced earlier, our objective is to establish that, thanks to , the condition implies the condition
or equivalently the condition
The equivalence of these conditions is seen as follows: the former implies the latter by multiplying on the left by and on the right by , while the latter implies the former under the assumption by multiplying on the left by and on the right by . As a matter of fact, according to a classical result about Schur complements, see e.g. [Boyd and Vandenberghe, 2004, Section A.5.5], the latter is further equivalent to
Thus, considering , our objective is to prove the nonnegativity of the inner product
Let us decompose , , and as , , and , where , , and belong to the space spanned by eigenvectors of corresponding to eigenvalues and where , , and belong to the eigenspace of corresponding to the eigenvalue , i.e., . We take notice of the fact that the spaces and are orthogonal. With this decomposition, the above inner product becomes
where we have set
We first remark that the terms in are all zero: first, it is clear that ; then, one has and is obtained similarly; next, Lemma 15 ensures that and is obtained similarly; last, writing where the are orthogonal eigenvectors of corresponding to eigenvalues , we derive from Lemma 15 that
and is obtained similarly. As a result, we have .
We now turn to the quantity . Exploiting Lemma 15 again, we write
At this point, we can bound from below as
This shows that since the condition ensures that for every .
Finally, Lemma 15 also helps us to bound the quantity from below according to
This allows us to obtain since the condition ensures that and . Altogether, we have shown that , which concludes the proof. ∎
Remark.
The value of the minimal global worst-case error can, in general, be computed by solving the semidefinite program (38) characterizing the lower bound . In the case where , it can also be computed without resorting to semidefinite programming. Precisely, if denotes the (unique) between and such that
(45) |
and if denotes , then we claim that, for any ,
Indeed, since we now know that the global worst-case error equals its lower bound independently of and since and are feasible for the semidefinite program (38) characterizing , we obtain
(46) |
Moreover, going back to the proof of Theorem 8, we recognize that the choice of here corresponds to the instance there. This instance comes with being equal to zero and with being equal to a properly normalized eigenvector of corresponding to the eigenvalue . The identities (32) now read and , i.e., . Setting and , which satisfy and , the very definition of the global worst-case error yields
(47) | ||||
Together, the inequalities (46) and (47) justify our claim about the value of the global worst-case error. In passing, it is worth noticing that the above argument reveals that and are extremal in the defining expression for the global worst-case error of the regularization map independently of the parameter .
References
- Beck and Eldar [2007] A. Beck and Y. C. Eldar. Regularization in regression with bounded noise: a Chebyshev center approach. SIAM Journal on Matrix Analysis and Applications, 29(2):606–625, 2007.
- Binev et al. [2017] P. Binev, A. Cohen, W. Dahmen, R. DeVore, G. Petrova, and P. Wojtaszczyk. Data assimilation in reduced modeling. SIAM/ASA Journal on Uncertainty Quantification, 5(1):1–29, 2017.
- Boyd and Vandenberghe [2004] S. P. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
- Chen and Haykin [2002] Z. Chen and S. Haykin. On different facets of regularization theory. Neural Computation, 14(12):2791–2846, 2002.
- Cohen et al. [2020] A. Cohen, W. Dahmen, O. Mula, and J. Nichols. Nonlinear reduced models for state and parameter estimation. arXiv preprint arXiv:2009.02687, 2020.
- DeVore et al. [2017] R. DeVore, G. Petrova, and P. Wojtaszczyk. Data assimilation and sampling in Banach spaces. Calcolo, 54(3):963–1007, 2017.
- Diamond and Boyd [2016] S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
- Ettehad and Foucart [2021] M. Ettehad and S. Foucart. Instances of computational optimal recovery: dealing with observation errors. SIAM/ASA Journal on Uncertainty Quantification, 9(4):1438–1456, 2021.
- Foucart [To appear] S. Foucart. Mathematical Pictures at a Data Science Exhibition. Cambridge University Press, To appear.
- Foucart et al. [2020] S. Foucart, C. Liao, S. Shahrampour, and Y. Wang. Learning from non-random data in Hilbert spaces: an optimal recovery perspective. arXiv preprint arXiv:2006.03706, 2020.
- Garkavi [1962] A. L. Garkavi. On the optimal net and best cross-section of a set in a normed space. Izvestiya Rossiiskoi Akademii Nauk. Seriya Matematicheskaya, 26(1):87–106, 1962.
- Grant and Boyd [2014] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx, March 2014.
- Hastie et al. [2009] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, second edition, 2009.
- Maday et al. [2015] Y. Maday, A. T. Patera, J. D. Penn, and M. Yano. A parameterized-background data-weak approach to variational data assimilation: formulation, analysis, and application to acoustics. International Journal for Numerical Methods in Engineering, 102(5):933–965, 2015.
- Melkman and Micchelli [1979] A. A. Melkman and C. A. Micchelli. Optimal estimation of linear operators in Hilbert spaces from inaccurate data. SIAM Journal on Numerical Analysis, 16(1):87–105, 1979.
- Micchelli [1993] C. A. Micchelli. Optimal estimation of linear operators from inaccurate data: a second look. Numerical Algorithms, 5(8):375–390, 1993.
- Micchelli and Rivlin [1977] C. A. Micchelli and T. J. Rivlin. A survey of optimal recovery. In Optimal Estimation in Approximation Theory, pages 1–54. Springer, 1977.
- Novak and Woźniakowski [2008] E. Novak and H. Woźniakowski. Tractability of Multivariate Problems: Linear Information. European Mathematical Society, 2008.
- Plaskota [1996] L. Plaskota. Noisy Information and Computational Complexity. Cambridge University Press, 1996.
- Pólik and Terlaky [2007] I. Pólik and T. Terlaky. A survey of the S-lemma. SIAM Review, 49(3):371–418, 2007.
- Polyak [1998] B. T. Polyak. Convexity of quadratic transformations and its use in control and optimization. Journal of Optimization Theory and Applications, 99(3):553–583, 1998.
- Schölkopf and Smola [2002] B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002.
Appendix
This additional section collects justifications for a few facts that were mentioned but not explained in the main text. These facts are: the uniqueness of a Chebyshev center for the model- and data-consistent set (see page 1.3), the efficient computation of the solution to (7) when (see page Remark), the form of Newton method when solving equation (28) (see page 8), and the reason why the constraint of (39) always implies the constraint of (38) (see pages 12 and 14).
Uniqueness of the Chebyshev center.
Let be two Chebyshev centers, i.e., minimizers of and let be the value of the minimum. Consider such that . Then
Thus, equality must hold all the way through. This implies that , i.e., that , as expected.
Computation of the regularized solution.
Let be a basis for and let denote the Riesz representers of the observation functionals , which form an orthonormal basis for under the assumption that . With representing the cross-gramian with entries , the solution to the regularization program (7) is given, even when is infinite dimensional, by
where the coefficient vectors and are computed according to
This is fairly easy to see for and it has been established in [Foucart, Liao, Shahrampour, and Wang, 2020, Theorem 2] for , so the general result follows from Proposition 3. Alternatively, it can be obtained by replicating the steps from the proof of the case with minor changes.
Newton method.
Equation (28) takes the form , where
Newton method produces a sequence converging to a solution using the recursion
(48) |
In order to apply this method, we need the ability to compute the derivative of with respect to . Setting , this essentially reduces to the computation of , which is performed via the argument below. Note that the argument is not rigorous, as we take for granted the differentiability of the eigenvalue and of a normalized eigenvector associated with it. However, nothing prevents us from applying the scheme (48) using the expression for given in (49) below and agree that a solution has been found if the output satisfies for some prescribed tolerance . Now, the argument starts form the identities
which we differentiate to obtain
By taking the inner product with in the first identity and using the second identity, we derive
According to Lemma 9, this expression can be transformed, after some work, into
(49) |