Statistical Limit Theorems in Distributionally Robust Optimization
Abstract
The goal of this paper is to develop methodology for the systematic analysis of asymptotic statistical properties of data driven DRO formulations based on their corresponding non-DRO counterparts. We illustrate our approach in various settings, including both phi-divergence and Wasserstein uncertainty sets. Different types of asymptotic behaviors are obtained depending on the rate at which the uncertainty radius decreases to zero as a function of the sample size and the geometry of the uncertainty sets.
|
|
1 Introduction
The statistical analysis of Empirical Risk Minimization (ERM) estimators is a well investigated topic both in statistics (e.g., [18]) and stochastic optimization (e.g., [17]). In recent years, there has been significant interest in the investigation of distributionally robust optimization (DRO) estimators (e.g., [13]). The goal of this paper is to develop methodology for the study of asymptotic statistical properties of data driven DRO formulations based on their corresponding non-DRO counterparts.
Our objective is to illustrate the main conceptual strategies for the statistical development, emphasizing qualitative features, for instance, the different types of behavior arising from the interaction between the distributional uncertainty size and the sample size, while keeping the discussion easily accessible. Consequently, in order to keep the discussion easily accessible, we do not necessarily focus on the most general assumptions to apply our results.
To set the stage, let us introduce some notation. We use to denote the set of Borel probability measures supported on a closed (nonempty) set . Let be a sequence of independent identically distributed (i.i.d.) random vectors viewed as realizations (or i.i.d. copies) of random vector having distribution . Consider the corresponding empirical measure , where denotes the Dirac measure of mass one at the point . The sample mean of a function is . By the Strong Law of Large Numbers we have that converges with probability one (w.p.1) to , provided the expectation is well defined111Throughout our discussion, every function whose expectation is considered will be assumed to be Borel measurable, so we will not be concerned with making this assumption repeatedly..
By the Central Limit Theorem,
where “” denotes the weak convergence (converges in distribution) and represents the normal distribution with mean zero and variance , provided this variance is finite.
We consider a loss function of the form , with being the parameter space. Unless stated otherwise, we assume that the set is compact and is continuous on . We define
(1.1) |
So, the standard ERM formulation takes the form
(1.2) |
and viewed as an empirical counterpart of the “true” (or limiting) form
(1.3) |
The statistical properties such as consistency and asymptotic normality of the ERM estimates have been widely studied in significant generality as the sample size . These types of results hold under structural properties of the function and natural stability assumptions (to be reviewed) which guarantee a functional Central Limit Theorems for . Our goal is to present a development that is largely parallel for the associated distributionally robust counterpart to (1.2).
More precisely, (1.2) can be endowed with distributional robustness by defining a set of probability measures, referred to as the ambiguity set, , which are seen as “reasonable” (according to some criterion) perturbations of the empirical measure. The parameter is the uncertainty size and the family of sets is typically nondecreasing in (in the inclusion partial order sense). The ambiguity set can be defined around any reference probability measure, but unless stated otherwise, we will center the ambiguity set around . In this paper we deal with ambiguity sets of the form
(1.4) |
where is a divergence between . Specifically, we consider the phi-divergence and Wasserstein distance cases.
In order to state the DRO version of (1.2) we define
(1.5) |
where is a monotonically decreasing sequence tending to zero as . The DRO version of (1.2) takes the form
(1.6) |
The aim of this paper is to investigate asymptotic statistical properties of the optimal value and optimal solutions of the DRO problem (1.6). In particular, under natural assumptions (to be discussed), we will show that both in phi-divergence and Wasserstein DRO formulations, there are typically (but not always) three types of cases involving the limiting asymptotic statistics depending on the rate of convergence of to zero. These can be seen both in terms of the value function error
and the optimal solution error (assuming it is unique for the limiting version of the problem and sufficient regularity conditions are in place).
Intuitively, if is smaller than a certain (to be characterized) critical rate relative to the canonical parametric statistical error rate , then the DRO effect is negligible compared to the statistical error implicit in a sample of size . If decreases to zero right at the critical rate, the DRO effect is comparable with this statistical error and can be quantified in the form of an asymptotic bias. If is bigger than the critical rate, the DRO effect overwhelms the statistical noise. These critical rates depend on the sensitivity of the optimal value function with respect to a small change in the size of uncertainty.
Our objective is to provide accessible principles that can be used to obtain explicit limiting distributions for the errors, both for value functions and optimizers, when in these three cases; see Theorems 3.1 and 3.2 for general principles and Theorems 4.1 and 4.2 for the application to these principles to value functions of phi-divergence and Wasserstein DRO, respectively; and Theorems 5.1 and 5.2 for the corresponding application to phi-divergence and Wasserstein DRO optimal solutions, respectively.
It is important to note that it is common in the data-driven DRO literature to suggest choosing in order to enforce that is inside with high probability. Such selection typically will fall in the third case, that is, this choice will induce estimates that are substantially larger than standard statistical noise. Therefore, prescriptions corresponding to the third case should be assigned only if the optimizer perceives that the out-of-sample environment is substantially different from the observed (empirical) environment due to errors or fluctuations that fall outside of standard statistical noise.
The rest of the paper is organized as follows. In Section 2 we will quickly review the elements of statistical analysis of Empirical Risk Minimization (ERM) – also known as Empirical Optimization or Sample Average Approximation – which corresponds to the case . Then, in Section 3, we will follow a parallel discussion to that of Section 2 and discuss assumptions for the data-driven DRO version of the problem. The objective is to use these assumptions so that we can obtain a flexible and disciplined approach that can be systematically applied to various DRO formulations. Then, in Section 4 we will discuss the application of this approach to the explicit development of asymptotics for the optimal value in phi-divergence and Wasserstein DRO and, finally, in Section 5, we also develop these explicit results for associated optimal solutions.
We use the following notation throughout the paper. For a sequence of random variables, by writing we mean that tends in probability to zero as . In particular means that tends in probability to zero. The notation means that is absolutely continuous with respect to . Unless stated otherwise probabilistic statements like “almost every” (a.e.), are made with respect to the probability measure . By saying that a function is integrable we mean that . It is said that a mapping is directionally differentiable at a point if the directional derivative
(1.7) |
exists for every . We will use the term , , to denote a random field such that
(1.8) |
2 Statistics of ERM: Review
In addition to the population objective function , introduced in (1.1), we also let
(2.1) |
be the optimal value and the set of optimal solutions of the population version of the optimization problem, respectively.
As defined in (1.1), is the objective function of the ERM version of the problem and
(2.2) |
are the respective optimal value and an optimal solutions of the ERM problem. We will now quickly review the development of the asymptotic statistics of the optimal value in ERM and then we will discuss the corresponding results for optimal solutions.
2.1 Asymptotics of the Optimal Value
In order to analyze the statistical error in the difference between the optimal values , we start from enforcing a functional Central Limit Theorem (CLT) for . In particular, one imposes assumptions which guarantee an expansion of the form222Recall that denotes a random field satisfying condition (1.8).
(2.3) |
where we have functional weak convergence
(2.4) |
in the uniform topology on compact sets, with being a mean zero Gaussian random field with covariance function
(2.5) |
There are several ways to enforce (2.3); a simple set of sufficient conditions satisfying this is given next (cf., [18, example 19.7]).
Assumption 2.1
(i) For some the expectation is finite. (ii) There is a measurable function such that is finite and
(2.6) |
for all and a.e. .
In particular under this assumption, it follows that the expectation function and variance
(2.7) |
are finite valued and continuous on . Furthermore, since the set is compact, it follows that the optimal value , of the ERM problem, converges to in probability (in fact almost surely). Moreover, it is not difficult to show from (2.3) that the distance from to converges in probability to zero (actually, the convergence occurs almost surely) as . Finally, since the functional , mapping continuous functions to the real line, is directionally differentiable, the following classical result is a direct consequence of the (functional) Delta Theorem (cf., [14]).
Proposition 2.1
Under Assumption 2.1,
(2.8) |
as . In particular, if is the singleton, i.e. is the unique optimal solution of the true problem, then converges in distribution to normal .
2.2 Asymptotics of Optimal Solutions
We assume now that is the singleton, i.e., is the unique optimal solution of the true (population) problem (1.3). We also assume that for a.e. , the function is continuously differentiable333Unless stated otherwise all first and second order derivatives will be taken with respect to vector .. As it was argued in the previous section, asymptotics of the optimal value is governed by the asymptotics of the objective function. On the other hand, asymptotics of the optimal solutions can be derived from the asymptotics of the gradients of the objective function.
Let us consider the following parametrisation of problem (1.3):
(2.9) |
with parameter vector . Denote by an optimal solution of the above problem (2.9) viewed as a function of vector . Of course, we have that .
Assumption 2.2 (uniform second order growth)
There is a neighborhood of and a positive constant such that for every in a neighborhood of , problem (2.9) has an optimal solution and
(2.10) |
for all .
The following assumption can be viewed as a counterpart of Assumption 2.1 applied to the gradients of the objective function.
Assumption 2.3
(i) For some the expectation is finite. (ii) There is a measurable function such that is finite and
(2.11) |
for all and a.e. .
By the functional CLT it follows that
(2.12) |
where we have a functional weak convergence in the uniform topology on a closed neighborhood of , with being a continuous mean zero Gaussian random field with covariance function
It follows from (2.12) that
(2.13) |
Also since tends in probability to zero, we have
(2.14) |
Thus we have the following result from [15, Theorem 2.1], where the respective regularity conditions are ensured by the above property (2.14).
The above result reduces the analysis of asymptotic properties of the optimal solutions to investigation of asymptotic behavior of the optimal solutions of the finite dimensional problem (2.9). By the (finite dimensional) Central Limit Theorem, converges in distribution to normal with covariance matrix . Moreover, if the mapping is directionally differentiable at (in the Hadamard sense), then by the finite dimensional Delta Theorem it follows from (2.15) that
(2.16) |
where . In particular, if is linear (i.e., is differentiable at with Jacobian matrix ), then converges in distribution to normal with null mean vector and covariance matrix .
Directional differentiability of optimal solutions of parameterized problems is well investigated. For example, if is an interior point of , is twice continuously differentiable at and the Hessian matrix is nonsingular, then the uniform second order growth (Assumption 2.2) holds, and is differentiable at with . When is on the boundary of the set , the sensitivity analysis of the parameterized problem (2.9) is more delicate and involves a certain measure of the curvature of the set at the point . This is discussed extensively in [6]. We also refer to [17, sections 5.1.3 and 7.1.5] for a basic summary of such results.
It is worthwhile to note at this point that the regularity conditions of assumptions 2.2 and 2.3 address different properties of the considered setting. Assumption 2.2 deals with the limiting optimization problem and is of deterministic nature. The uniform second order growth condition was introduced in [15], and in a more general form was discussed in [6, section 5.1.3]. On the other hand Assumption 2.3 is related to the stochastic behavior of the ERM problem (1.2).
3 Statistics of DRO: General Principles
We now provide sufficient conditions for the development of DRO statistical principles based on assumptions which are parallel to those imposed in the ERM section. Define
(3.1) |
the optimal value and an optimal solution of the DRO problem (1.6) (recall the definition of in (1.5)).
3.1 DRO Asymptotics of the Optimal Value
Similar to the ERM case, in the DRO setting, we will typically have an expansion of the form
(3.2) |
for some , where converges in probability in the uniform topology over to a continuous deterministic process ,
(3.3) |
Since , it follows then that . We will characterize and explicitly in the next sections in the context of phi-divergence and Wasserstein DRO formulations under suitable conditions on the distributional uncertainty set – in addition to Assumption 2.1 (which is clearly independent of the distributional uncertainty). The following result, which is immediate from the application of the functional Delta Theorem summarizes the type of behavior that we expect in DRO formulations depending on the geometry of the distributional uncertainty set and the rate of decay to zero of the uncertainty size . Recall that denotes a mean zero Gaussian random field with covariance function (2.5).
Theorem 3.1
Suppose that Assumption 2.1 and
conditions (3.2) - (3.3) hold. Then there are three types of asymptotic
behavior of the DRO optimal value:
(a) If , then
(3.4) |
and hence
(3.5) |
which coincides with (2.8) and thus the DRO formulation has no
asymptotic impact.
(b) If ,
then
(3.6) |
so the DRO formulation introduces an explicit and quantifiable asymptotic bias
which can be interpreted as a regularization term.
(c) If
, then444The right hand
side of (3.7) is a deterministic number. Therefore convergence in
distribution ‘’ there is the same as convergence in probability.
(3.7) |
so the bias term induced by the DRO formulation is larger than the statistical error.
Proof. Part (a). By (3.2) and (3.3) we have that in the considered case
where is the generic term satisfying (1.8). Thus (3.4) follows.
Part (b). By (3.2) and (3.3) in the considered case we can write
Under Assumption 2.1, by the functional CLT we have that converges in distribution to , and hence (3.6) follows by the Delta Theorem.
Part (c) may appear somewhat different because the right hand side is deterministic but, under case (c) note that we can simply write
so case (c) also follows from the standard analysis since converges uniformly to in probability (thus it converges weakly in the uniform topology).
3.2 DRO Asymptotics of the Optimal Solutions
As in the ERM development, in addition to Assumptions 2.2, it is convenient to guarantee that for all large enough, is differentiable in a neighborhood of and
(3.8) |
for some , where converges in probability to uniformly around a closed neighborhood of . In consequence, we obtain the following analog of Theorem 3.1, which follows from the finite dimensional Delta Theorem. Recall that is an optimal solution of problem (2.9) and is its directional derivative at .
Theorem 3.2
Suppose that: Assumptions 2.2
and 2.3 hold, conditions (3.2) - (3.3) are satisfied,
identity (3.8) holds with
converging in probability to uniformly around a closed neighborhood
of , and that is
directionally differentiable at (in the Hadamard sense). Let with covariance matrix . Then the DRO optimal solutions can have three types of asymptotic
behavior:
(A) If , then
(3.9) |
thus
(3.10) |
(B) If , then
(3.11) |
(C) If , then
(3.12) |
4 General Principle in Action: Optimal Values
In this section, we apply the general principle to the asymptotics of the value function in two of the main types of DRO formulations, namely, phi-divergence and Wasserstein DRO.
4.1 The Phi-Divergence Case
We recall the definition of the distributional uncertainty set for the phi-divergence case. Consider a convex lower semi-continuous function such that and for . For probability measures such that is absolutely continuous with respect to with the corresponding density , the -divergence is defined as (cf., [7],[12])
(4.1) |
In particular, for , , this becomes the Kullback–Leibler (KL) divergence of from . The ambiguity set associated with is defined as
(4.2) |
By duality arguments the corresponding distributionally robust functional can be written in the form (cf., [2], [3], [16])
(4.3) |
where is the convex conjugate of . Using this representation we can obtain an asymptotic expansion for (4.3) as a function of . This expansion can be helpful to suggest the form of the expansion in (3.2) and (3.8). For this, we need to assume certain regularity properties of around .
Assumption 4.1
Assume that is two times continuously differentiable in a neighborhood of with .
Under this condition we have the following expansion, which is obtained, in order to simplify our exposition, under the assumption that the probability measure has compact support. See also the results in [11], which provide additional correction terms under a fixed . The uniform feature of the statement below is helpful in the statistical analysis. Our development here will also be used in the expansion of the optimal solutions.
Proposition 4.1
Suppose that Assumption 4.1 holds, that for some . Then, for any ,
(4.4) |
uniformly over Borel probability measures supported on such that . Moreover, there is such that for all
is unique.
Proof. Note that we can write
where the sup is taken over the set of positive random variables satisfying the specified moment constraints. We may assume that for simplicity since we can always center the objective function around . In turn, by letting , the previous optimization problem is equivalent to
(4.5) |
Since and , then is feasible for any provided that and
In turn, since is two times continuously differentiable at , we have that
as uniformly over compact sets. Therefore, we conclude that there exists such that for any
Since can be chosen to be arbitrarily small, we conclude an asymptotic lower bound which retrieves (4.4). For the upper bound, we apply the duality result (4.3) in the form corresponding to (4.5), we obtain
(4.6) |
We will plug in
into (4.6) to obtain our upper bound. Using that and that is convex with , we have that the family of (continuous) functions
converges uniformly on compact sets to
Therefore we obtain that
These estimates, which are uniform given that , yield the estimate in the proposition. The uniqueness is standard, it
follows from the local strong convexity of at the
origin.
Recall that , and that is a mean zero Gaussian random field. Expansion (4.4) immediately yields, at least when is -bounded, that
(4.7) |
Consequently, we obtain the following result.
Theorem 4.1
Suppose that is -essentially bounded, that
Assumption 2.1 and Assumption 4.1 hold,
and that for all . Then, we have
the following types of asymptotic behavior of the DRO optimal values.
(a-phi) If , then
(4.8) |
(b-phi) If , then
(4.9) |
(c-phi) If , then
(4.10) |
so the bias term induced by the DRO formulation dominates the statistical error.
Proof. Proof of this theorem is quite standard (cf., [17, proof of Theorem 5.7]). For the sake of completeness we briefly outline proof of case (b-phi). Note that our assumptions imply Assumption 2.1, and hence is a continuous function of . Therefore there is a compact neighborhood of such that for all . We can restrict the minimization to for which the expansion (4.7) holds.
Consider the space of continuous functions equipped with the sup-norm, and functional , mapping into the real
line. This functional is directionally differentiable in the Hadamard sense
with the directional derivative at a point given by
, where
. We have that and , where . By
the functional CLT and (4.7) it follows that converges in distribution (weakly) to . We can apply now the
functional Delta Theorem to conclude (4.9).
Given that is only assumed to satisfy Assumption 4.1, without imposing any growth condition, situations such as the (c-phi) case require imposing stronger moment conditions than just assuming . This can be seen in the KL-divergence case in which . For fixed , the population solution requires that has a finite moment generating function in a neighborhood of the origin. Therefore, if converges to zero sufficiently slowly and has infinite moments of order , an expansion such as (4.7) may not hold. However, if , it follows that expansion (4.7) holds exactly with .
On the other hand, the result in [8, Theorem 2] provides more for the case (b-phi) since it does not require compact support (although it requires to be three times continuously differentiable). The following example shows that the smoothness of is important in deriving the asymptotics in the previous result with .
Example 4.1
Consider , . In that case (e.g., [16, Example 3.12]), for and essentially bounded ,
(4.11) |
where
(4.12) |
Note that and as tends to one,
(4.13) |
provided is essentially bounded.
Suppose that is bounded on , and hence
(4.14) |
Consider with . Then the first term in (4.14) is of order , and by (4.13) the second term is . Consequently in that case and hence this corresponds to case (a) in Theorem 3.1. This shows that the assumption of smoothness (differentiability) of is essential for derivation of the asymptotics of . Here some additional terms in the asymptotics of appear when is of order , rather than .
4.2 The Wasserstein Distance Case
We use to denote the set of Borel probability measures on the product space . Let be a lower semi-continuous function such that if and only if . This function measures the marginal cost of a transporting a unit of mass from a source location to a target location, respectively. The optimal transport cost between is given by
(4.15) |
where is the expectation under a joint distribution and and denote the marginal distributions of and , respectively. It turns out that the optimizer is always achieved, thus we write ‘’ instead of ’. Let be a norm on the space . An important special case corresponds to the choice for some , in which case is the so-called -Wasserstein distance. The reader is referred to the text of Villani [19] for more background on optimal transport.
For any given and we have the following dual result (cf., [9], [4], [10]) assuming that is upper semi-continuous and is -integrable,
(4.16) |
where
(4.17) |
Throughout the rest of our discussion, we will choose for and therefore write for this choice of cost function. Further, we use to denote the dual norm, namely,
As in the phi-divergence case, assuming that is fixed and has compact support, for example, we can obtain an asymptotic expansion for (4.16) as a function of . By writing we mean .
Proposition 4.2
Suppose that is continuously differentiable and the mapping
(4.18) |
is bounded on compact sets. Then, for any ,
uniformly over such that .
Proof. The proof of this result is similar to the one given in the phi-divergence case. We start by observing that
where the optimization in the right hand side is taken over random variables . We let and note that
Next, we can obtain a lower bound by considering a specific form of suggested by the formal asymptotic limit as . Note that
and the equality is achieved if we choose any which is a constant multiple of
(The function can be selected in a measurable way using the uniformization technique of Jankov-von Neumann.) Next, if , then
and
Letting we have that and therefore
with
The denominator is well defined since and the random variable is essentially bounded uniformly over the family and . Since the gradient is continuous, then it is uniformly continuous over compact sets and, consequently, uniformly over in compact sets,
as . This yields that
uniformly over and . For the upper bound, we can apply the duality representation, just as we did in the phi-divergence case. Using duality, we have that
Once again, we select a specific choice given by
The fact that follows because . We then obtain
Next, we argue that the family of functions
converges uniformly on compact sets to the function . Let us consider the sup over and note because (4.18) is bounded on compact sets, there exists a constant independent of such that
By selecting small enough (depending only on and ) we see that the right hand side can be made arbitrarily negative uniformly over as . So, it suffices to consider only , in this case, since is continuous, then it is uniformly continuous on compacts. So, we can write, in terms of the (uniform) modulus of the continuity function
where . In conclusion, we have that
Further, the range
in the upper and lower envelopes above can be further constrained to be
compact (independent of and , but depending on
). From the above expressions, we deduce the required
uniform convergence of on compacts. The asymptotic upper bound then
follows from these estimates.
Similar results have appeared in the literature (cf., [1]). An important difference which is useful in our analysis is that the above result is uniform over a class such that .
In order to write the expansion of we clarify that here we use to denote the gradient with respect to . Under suitable boundedness and smoothness assumptions, the previous result yields
(4.19) |
We collect the precise statement of our result next. The proof is similar to that of Theorem 4.1 and thus omitted.
Theorem 4.2
Suppose is continuously differentiable, that
(4.20) |
is locally bounded, that has compact support, is Lipschitz continuous and
(4.21) |
Then, we have the following types of behavior of optimal values .
(a-W) If , then
(4.22) |
(b-W) If , then
(c-W) If , then
A completely analogous situation to Example 4.1 can also be constructed in Wasserstein DRO to show that both differentiability of and are important in deriving the asymptotics in the critical case. The case in which was covered in [5] under suitable quadratic growth conditions and the existence of second moments.
5 General Principle in Action: Optimal Solutions
We complete our discussion in this section, considering optimal solutions for phi-divergence and DRO problems. A key observation is that in both the phi-divergence case and the Wasserstein DRO case the uncertainty set is compact in the weak topology and therefore, if Assumption 2.3 holds, the function is differentiable and its gradient has expansion (3.8). In fact, the derivative can be shown to exist if we are able to argue that, for sufficiently small, the worst case measure is unique. This is precisely the strategy that we will pursue in this section. Throughout the section we impose the condition that . Recall that .
5.1 The Phi-Divergence Case
Theorem 5.1
Proof. Applying the centering and scaling used to obtain (4.5) we obtain
where
(5.1) |
and
It suffices to show that
uniformly over some region for some . Note that the optimization region in (5.1) is compact in the weak topology and therefore, by Danskin’s Theorem (see [17, sections 5.1.3 and 7.1.5], Section 7), we have that is directionally differentiable and by the uniqueness of the optimal for sufficiently small we have that
We can precisely characterize from Proposition 4.1 over a region for which we can guarantee . Note that such can be found assuming that (for some random but finite almost surely because of the Strong Law of Large Numbers and continuity since . We have, uniformly over , for ,
On the other hand, defining
we have that
where
We obtain
It follows that
uniformly over , and
uniformly in probability (in fact almost surely) as . Uniform convergence in probability over follows from these observations.
5.2 The Wasserstein Distance Case
As in Proposition 4.2, in order to simplify the exposition, we assume that has a compact support. We also let be the norm for . This choice, in particular, satisfies that for any such that , the set
(5.2) |
This will help us argue, in the presence of Lipschitz gradients, that the worst case adversarial distribution is unique when the distributional uncertainty, , is sufficiently small and this, in turn, will help guarantee differentiability. In this section, we use and to denote the derivatives with respect to and , respectively. The derivative with respect to all of the arguments is simply denoted via .
Theorem 5.2
Suppose that Assumptions 2.2 and 2.3 hold.
Further, assume that
conditions (4.20) - (4.21) hold and that is
directionally differentiable at (in the Hadamard sense). Let , where is the covariance matrix of . Then, we have the following.
(A-W) If
, then
(5.3) |
(B-W) If , then
(5.4) |
(C-W) If , then
(5.5) |
Proof. Most of the work has already been done in the proof of Proposition 4.2. We have that
where
(5.6) |
It suffices to show uniform convergence of to in some neighborhood for some .
From the proof of Proposition 4.2 we can collect several facts, note that we are assuming that is -Lipschitz, which guarantees (4.20).
I) First,
in probability.
II) Moreover, we also saw that there exists a random (finite with probability one) such that
with for all uniformly over for small enough so that . Note that such exists by continuity since we assume that .
III) Finally, also on the set from II), since is Lipschitz for all sufficiently small, we have that the maximizer inside the expectation is unique because of (5.2) and it converges uniformly on compacts in the variable and over to
(5.7) |
where
Next, by Danskin’s Theorem, see [17, sections 5.1.3 and 7.1.5], Section 7, because the uncertainty set is compact in the weak topology, we have that is differentiable by uniqueness of . The most convenient representation to see this is (5.6). It is also direct that is differentiable everywhere. Moreover, since
Danskin’s Theorem also applies and we have that
where is given in (5.7). So, we have (using ) instead of ,
Since converges (in probability)
uniformly over compact sets in and and it is bounded almost surely, we obtain the
required uniform convergence occurs in probability from the fact that is Lipschitz continuous.
A similar result is obtained in [5] quadratic growth conditions and
the existence of second moments (thus relaxing compactness assumptions).
However, [5] primarily focuses on the case in which the optimal
solution lies in the interior of the feasible region. Our discussion here can
be used in combination with the analysis in [5] to deal with boundary cases.
Acknowledgement
J. Blanchet’s research was partially
supported by the Air Force Office of Scientific Research (AFOSR) under award
number FA9550-20-1-0397, with additional support from NSF 1915967 and 2118199.
The research of A. Shapiro was partially supported by (AFOSR) Grant FA9550-22-1-0244.
References
- [1] D. Bartl, S. Drapeau, J. Obłój, and J. Wiesel. Sensitivity analysis of wasserstein distributionally robust optimization problems. Proc. of the Royal Society A., 447:2256, 2021.
- [2] G. Bayraksan and D. K. Love. Data-driven stochastic programming using phi-divergences. Tutorials in Operations Research, INFORMS, pages 1563–1581, 2015.
- [3] A. Ben-Tal and M. Teboulle. Penalty functions and duality in stochastic programming via phi-divergence functionals. Mathematics of Operations Research, 12:224–240, 1987.
- [4] J. Blanchet, Y. Kang, and K. Murthy. Robust Wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56:830–857, 2019.
- [5] J. Blanchet, K. Murthy, and N. Si. Confidence regions in wasserstein distributionally robust estimation. Biometrika, 109:295–315, 2022.
- [6] J. Frédéric Bonnans and Alexander Shapiro. Perturbation Analysis of Optimization Problems. Springer Series in Operations Research. Springer, 2000.
- [7] I. Csiszár. Eine informationstheoretische ungleichung und ihre anwendung auf den beweis der ergodizitat von markoffschen ketten. Magyar. Tud. Akad. Mat. Kutato Int. Kozls, 8:85–108, 1063.
- [8] John C. Duchi, Peter W. Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 46(3):946–969, 2021.
- [9] P. M. Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171:115–166, 2018.
- [10] R. Gao and A. Kleywegt. Distributionally robust stochastic optimization with wasserstein distance. arXiv preprint arXiv:1604.02199, 2016.
- [11] H. Lam. Robust sensitivity analysis for stochastic systems. Math. of Oper. Research, 41:1248–1275, 2016.
- [12] T. Morimoto. Markov processes and the h-theorem. J. Phys. Soc. Jap., 18(3):328–333, 1963.
- [13] H. Rahimian and S. Mehrotra. Distributionally robust optimization: A review. Arxiv, https://arxiv.org/abs/1908.05659.
- [14] A. Shapiro. Asymptotic analysis of stochastic programs. Annals of Operations Research, 30:169–186, 1991.
- [15] A. Shapiro. Asymptotic behavior of optimal solutions in stochastic programming. Mathematics of Operations Research, 18:829 – 845, 1993.
- [16] A. Shapiro. Distributionally robust stochastic programming. SIAM J. Optimization, 27:2258–2275, 2017.
- [17] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on Stochastic Programming: Modeling and Theory. SIAM, Philadelphia, 2009.
- [18] A.W. van der Vaart. Asymptotic Statistics. Cambridge University Press, Cambridge, 1998.
- [19] C. Villani. Topics in Optimal Transportation. American Mathematical Society, Graduate Studies in Mathematics, Vol. 58, 2003.