Sensitivity analysis of Wasserstein distributionally robust optimization problems
Abstract.
We consider sensitivity of a generic stochastic optimization problem to model uncertainty. We take a non-parametric approach and capture model uncertainty using Wasserstein balls around the postulated model. We provide explicit formulae for the first order correction to both the value function and the optimizer and further extend our results to optimization under linear constraints. We present applications to statistics, machine learning, mathematical finance and uncertainty quantification. In particular, we provide explicit first-order approximation for square-root LASSO regression coefficients and deduce coefficient shrinkage compared to the ordinary least squares regression. We consider robustness of call option pricing and deduce a new Black-Scholes sensitivity, a non-parametric version of the so-called Vega. We also compute sensitivities of optimized certainty equivalents in finance and propose measures to quantify robustness of neural networks to adversarial examples.
Key words and phrases:
Robust stochastic optimization, Sensitivity analysis, Uncertainty quantification, Non-parametric uncertainty, Wasserstein metric1. Introduction
We consider a generic stochastic optimization problem
(1) |
where is the set of actions or choices, is the loss function and is a probability measure over the space of states . Such problems are found across the whole of applied mathematics. The measure is the crucial input and it could represent, e.g., a dynamic model of the system, as is often in mathematical finance or mathematical biology, or the empirical measure of observed data points, or the training set, as is the case in statistics and machine learning applications. In virtually all the cases, there is a certain degree of uncertainty around the choice of coming from modelling choices and simplifications, incomplete information, data errors, finite sample error, etc. It is thus very important to understand the influence of changes in on (1), both on its value and on its optimizer. Often, the choice of is done in two stages: first a parametric family of models is adopted and then the values of the parameters are fixed. Sensitivity analysis of (1) with changing parameters is a classical topic explored in parametric programming and statistical inference, e.g., Armacost and Fiacco, (1974); Vogel, (2007); Bonnans and Shapiro, (2013). It also underscores a lot of progress in the field of uncertainty quantification, see Ghanem et al., (2017). Considering as an abstract parameter, the mathematical programming literature looked into qualitative and quantitative stability of (1). We refer to Dupacova, (1990); Romisch, (2003) and the references therein. When represents data samples, there has been a considerable interest in the optimization community in designing algorithms which are robust and, in particular, do not require excessive hypertuning, see Asi and Duchi, (2019) and the references therein.
A more systematic approach to model uncertainty in (1) is offered by the distributionally robust optimization problem
(2) |
where is a ball of radius around in the space of probability measures, as specified below. Such problems greatly generalize more classical robust optimization and have been studied extensively in operations research and machine learning in particular, we refer the reader to Rahimian and Mehrotra, (2019) and the references therein. Our goal in this paper is to understand the behaviour of these problems for small . Our main results compute first-order behaviour of and its optimizer for small . This offers a measure of sensitivity to errors in model choice and/or specification as well as points in the abstract direction, in the space of models, in which the change is most pronounced. We use examples to show that our results can be applied across a wide spectrum of science.
This paper is organised as follows. We first present the main results and then, in section 3, explore their applications. Further discussion of our results and the related literature is found in section 4, which is then followed by the proofs. Online appendix Bartl et al., 2021a contains many supplementary results and remarks, as well as some more technical arguments from the proofs.
2. Main results
Take , endow with the Euclidean norm and write for the interior of a set . Assume that is a closed convex subset of . Let denote the set of all (Borel) probability measures on . Further fix a seminorm on and denote by its (extended) dual norm, i.e., . In particular, for we also have . For , we define the -Wasserstein distance as
where is the set of all probability measures on with first marginal and second marginal . Denote the Wasserstein ball
of size around . Note that, taking a suitable probability space and a random variable , we have the following probabilistic representation of :
where the supremum is taken over all satisfying almost surely and .
Wasserstein distances and optimal transport techniques have proved to be powerful and versatile tools in a multitude of applications, from economics Chiappori et al., (2010); Carlier and Ekeland, (2010) to image recognition Peyré and Cuturi, (2019). The idea to use Wasserstein balls to represent model uncertainty was pioneered in
Pflug and Wozabal, (2007) in the context of investment problems.
When sampling from a measure with a finite moment, the measures converge to the true distribution and Wasserstein balls around the empirical measures have the interpretation of confidence sets, see Fournier and Guillin, (2014). In this setup, the radius can then be chosen as a function of a given confidence level and the sample size . This yields finite samples guarantees and asymptotic consistency, see Mohajerin Esfahani and Kuhn, (2018); Obłój and Wiesel, (2021), and justifies the use of the Wasserstein metric to capture model uncertainty. The value in (2) has a dual representation, see Gao and Kleywegt, (2016); Blanchet and Murthy, (2019), which has led to significant new developments in distributionally robust optimization, e.g., Mohajerin Esfahani and Kuhn, (2018); Blanchet et al., 2019a ; Kuhn et al., (2019); Shafieezadeh-Abadeh et al., (2019).
Naturally, other choices for the distance on the space of measures are also possible: such as the Kullblack-Leibler divergence, see Lam, (2016) for general sensitivity results and Calafiore, (2007) for applications in portfolio optimization, or the Hellinger distance, see Lindsay, (1994) for a statistical robustness analysis. We refer to section 4
for a more detailed analysis of the state of the art in these fields. Both of these approaches have good analytic properties and often lead to theoretically appealing closed-form solutions. However, they are also very restrictive since any measure in the neighborhood of has to be absolutely continuous with respect to . In particular, if is the empirical measure of observations then measures in its neighborhood have to be supported on those fixed points. To obtain meaningful results it is thus necessary to impose additional structural assumptions, which are often hard to justify solely on the basis of the data at hand and, equally importantly, create another layer of model uncertainty themselves. We refer to (Gao and Kleywegt,, 2016, sec. 1.1) for further discussion of potential issues with -divergences. The Wasserstein distance, while harder to handle analytically, is more versatile and does not require any such additional assumptions.
Throughout the paper we take the convention that continuity and closure are understood w.r.t. . We assume that is convex and closed and that the seminorm is strictly convex in the sense that for two elements with and , we have (note that this is satisfied for every -norm for ). We fix , let so that , and fix such that the boundary of has –zero measure and . Denote by the set of optimizers for in (2).
Assumption 1.
The loss function satisfies
-
•
is differentiable on for every . Moreover is continuous and for every there is such that for all and with .
-
•
For all sufficiently small we have and for every sequence such that and such that for all there is a subsequence which converges to some .
The above assumption is not restrictive: the first part merely ensures existence of , while the second part is satisfied as soon as either is compact or is coercive, which is the case in most examples of interest, see (Bartl et al., 2021a, , Lemma 23) for further comments.
Theorem 2.
If Assumption 1 holds then is given by
Remark 3.
Inspecting the proof, defining
we obtain . This means that for small there is no first-order gain from optimizing over all in the definition of when compared with restricting simply to , as in .
The above result naturally extends to computing sensitivities of robust problems, i.e., , see (Bartl et al., 2021a, , Corollary 13), as well as to the case of stochastic optimization under linear constraints, see (Bartl et al., 2021a, , Theorem 15). We recall that .
Assumption 4.
Suppose the is twice continuously differentiable, and
-
•
for some , , all and all close to .
-
•
The function is twice continuously differentiable in the neighbourhood of and the matrix is invertible.
Theorem 5.
Suppose and such that as and Assumptions 1 and 4 are satisfied. Then, if or -a.e.,
where is the unique function satisfying , see (Bartl et al., 2021a, , Lemma 7). In particular, if .
Above and throughout the convention is that , , and . The assumed existence and convergence of optimizers holds, e.g., with suitable convexity of in , see (Bartl et al., 2021a, , Lemma 22) for a worked out setting. In line with the financial economics practice, we gave our sensitivities letter symbols, and , loosely motivated by , the Greek for Model, and , the Hebrew for control.
3. Applications
We now illustrate the universality of Theorems 2 and 5 by considering their applications in a number of different fields. Unless otherwise stated, , and means .
3.1. Financial Economics
We start with the simple example of risk-neutral pricing of a call option written on an underlying asset . Here, are the maturity and the strike respectively, and is the distribution of . We set interest rates and dividends to zero for simplicity. In the Black and Scholes, (1973) model, is a log-normal distribution, i.e., is Gaussian with mean and variance . In this case, is given by the celebrated Black-Scholes formula. Note that this example is particularly simple since is independent of . However, to ensure risk-neutral pricing, we have to impose a linear constraint on the measures in , giving
(3) |
To compute its sensitivity we encode the constraint using a Lagrangian and apply Theorem 2, see (Bartl et al., 2021a, , Rem. 11, Thm. 15). For , letting and , the resulting formula, see (Bartl et al., 2021a, , Example 18), is given by
Let us specialise to the log-normal distribution of the Black-Scholes model above and denote the quantity in (3) as . It may be computed exactly using methods from Bartl et al., (2020) and Figure 1 compares the exact value and the first order approximation.

We have , where and is the cdf of distribution. It is also insightful to compare with a parametric sensitivity. If instead of Wasserstein balls, we consider the resulting sensitivity is known as the Black-Scholes Vega and given by . We plot the two sensitivities in Figure 2. It is remarkable how, for the range of strikes of interest, the non-parametric model sensitivity traces out the usual shape of but shifted upwards to account for the idiosyncratic risk of departure from the log-normal family. More generally, given a book of options with payoff at time , with , we could say that the book is -neutral if the sensitivity was the same for and for . In analogy to Delta-Vega hedging standard, one could develop a non-parametric model-agnostic Delta-Upsilon hedging. We believe these ideas offer potential for exciting industrial applications and we leave them to further research.

We turn now to the classical notion of optimized Certainty Equivalent (OCE) of Ben Tal and Teboulle, (1986). It is a decision theoretic criterion designed to split a liability between today and tomorrow’s payments. It is also a convex risk measure in the sense of Artzner et al., (1999) and covers many of the popular risk measures such as expected shortfall or entropic risk, see Ben Tal and Teboulle, (2007). We fix a convex monotone function which is bounded from below and . Here represent the payoff of a financial position and is the negative of a utility function, or a loss function. We take and refer to (Bartl et al., 2021a, , Lemma 22) for generic sufficient conditions for Assumptions 1 and 4 to hold in this setup. The OCE corresponds to in (1) for and , . Theorems 2 and 5 yield the sensitivities
where, for simplicity, we took for the latter.
A related problem considers hedging strategies which minimise the expected loss of the hedged position, i.e., , where and represent today and tomorrow’s traded prices. We compute as
Further, we can combine loss minimization with OCE and consider , . This gives as the infimum over of
The above formulae capture non-parametric sensitivity to model uncertainty for examples of key risk measurements in financial economics. To the best of our knowledge this has not been achieved before.
Finally, we consider briefly the classical mean-variance optimization of Markowitz, (1952). Here represents the loss distribution across the assets and , are the relative investment weights. The original problem is to minimise the sum of the expectation and standard deviations of returns , with . Using the ideas in (Pflug et al.,, 2012, Example 2) and considering measures on , we can recast the problem as (1). Whilst Pflug et al., (2012) focused on the asymptotic regime , their non-asymptotic statements are related to our Theorem 2 and either result could be used here to obtain that .
3.2. Neural Networks
We specialise now to quantifying robustness of neural networks (NN) to adversarial examples. This has been an important topic in machine learning since Szegedy et al., (2013) observed that NN consistently misclassify inputs formed by applying small worst-case perturbations to a data set. This produced a number of works offering either explanations for these effects or algorithms to create such adversarial examples, e.g. Goodfellow et al., (2014); Li et al., (2019); Carlini and Wagner, (2017); Wong and Kolter, (2017); Weng et al., (2018); Araujo et al., (2019); Mangal et al., (2019) to name just a few. The main focus of research works in this area, see Bastani et al., (2016), has been on faster algorithms for finding adversarial examples, typically leading to an overfit to these examples without any significant generalisation properties. The viewpoint has been mainly pointwise, e.g., Szegedy et al., (2013), with some generalisations to probabilistic robustness, e.g., Mangal et al., (2019).
In contrast, we propose a simple metric for measuring robustness of neural networks which is independent of the architecture employed and the algorithms for identifying adversarial examples. In fact, Theorem 2 offers a simple and intuitive way to formalise robustness of neural networks: for simplicity consider a -layer neural network trained on a given distribution of pairs , i.e. solve
where the is taken over , for a given activation function , where the composition above is understood componentwise. Set . Data perturbations are captured by and (2) offers a robust training procedure. The first order quantification of the NN sensitivity to adversarial data is then given by
A similar viewpoint, capturing robustness to adversarial examples through optimal transport lens, has been recently adopted by other authors. The dual formulation of (2) was used by Shafieezadeh-Abadeh et al., (2019) to reduce the training of neural networks to tractable linear programs. Sinha et al., (2020) modified (2) to consider a penalised problem to propose new stochastic gradient descent algorithms with inbuilt robustness to adversarial data.
3.3. Uncertainty Quantification
In the context of UQ the measure represents input parameters of a (possibly complicated) operation in a physical, engineering or economic system. We consider the so-called reliability or certification problem: for a given set of undesirable outcomes, one wants to control , for a set of probability measures . The distributionally robust adversarial classification problem considered recently by Ho-Nguyen and Wright, (2020) is also of this form, with Wasserstein balls around an empirical measure of samples. Using the dual formulation of Blanchet and Murthy, (2019), they linked the problem to minimization of the conditional value-at-risk and proposed a reformulation, and numerical methods, in the case of linear classification. We propose instead a regularised version of the problem and look for
for a given safety level . We thus consider the average distance to the undesirable set, , and not just its probability. The quantity could then be used to quantify the implicit uncertainty of the certification problem, where higher corresponds to less uncertainty. Taking statistical confidence bounds of the empirical measure in Wasserstein distance into account, see Fournier and Guillin, (2014), would then determine the minimum number of samples needed to estimate the empirical measure.
Assume that is convex. Then differentiable everywhere except at the boundary of with for and for all . Further, assume is absolutely continuous w.r.t. Lebesgue measure on . Theorem 2, using (Bartl et al., 2021a, , Remark 11), gives a first-order expansion for the above problem
In the special case this simplifies to
and the minimal measure pushes every point not contained in in the direction of the orthogonal projection. This recovers the intuition of (Chen et al.,, 2018, Theorem 1), which in turn rely on (Gao and Kleywegt,, 2016, Corollary 2, Example 7). Note however that our result holds for general measures . We also note that such an approximation could provide an ansatz for dimension reduction, by identifying the dimensions for which the partial derivatives are negligible and then projecting on to the corresponding lower-dimensional subspace (thus providing a simpler surrogate for ). This would be an alternative to a basis expansion (e.g., orthogonal polynomials) used in UQ and would exploit the interplay of properties of and simultaneously.
3.4. Statistics
We discuss two applications of our results in the realm of statistics. We start by highlighting the link between our results and the so-called influence curves (IC) in robust statistics. For a functional its IC is defined as
Computing the , if it exists, is in general hard and closed form solutions may be unachievable. However, for the so-called M-estimators, defined as optimizers for ,
for some (e.g., for the median), we have
under suitable assumptions on , see (Huber and Ronchetti,, 1981, section 3.2.1). In comparison, writing for the optimizer for , Theorem 5 yields
(4) |
under Assumption 4 and normalisation . To investigate the connection let us Taylor-expand around to obtain
Choosing and integrating both sides over and dividing by , we obtain the asymptotic equality
for by (4). We conclude that considering the average directional derivative of IC in the direction of gives our first-order sensitivity. For an interesting conjecture regarding the comparison of influence functions and sensitivities in KL-divergence we refer to (Lam,, 2018, Section 7.3) and (Lam,, 2016, Section 3.4.2).
Our second application in statistics exploits the representation of the LASSO/Ridge regressions as robust versions the standard linear regression. We consider and . If instead of the Euclidean metric we take , for some and , in the definition of the Wasserstein distance, then Blanchet et al., 2019a showed that
(5) |
holds, where . The case is the ordinary least squares regression. For , the RHS for is directly related to the Ridge regression, while the limiting case is called the square-root LASSO regression, a regularised variant of linear regression well known for its good empirical performance. Closed-form solutions to (5) do not exist in general and it is a common practice to use numerical routines to solve it approximately. Theorem 5 offers instead an explicit first-order approximation of for small . We denote by the ordinary least squares estimator and by the identity matrix. Note that the first order condition on implies that for all . In particular, and , where we assume the system is overdetermined so that is invertible. Letting a direct computation, see Bartl et al., 2021a , yields
(6) |
For , and for , and hence111The case , inspecting the proof, we see that Theorem 5 still holds since does not have zero components -a.s., which are the only points of discontinuity of . is approximately
(7) |
respectively. This corresponds to parameter shrinkage: proportional for square-root Ridge and a shift towards zero for square-root LASSO. To the best of our knowledge these are first such results and we stress that our formulae are valid in a general context and, in particular, parameter shrinkage depends on the direction through the factor. Figure 3 compares the first order approximation with the actual results and shows a remarkable fit.

Furthermore, our results agree with what is known in the canonical test case for the (standard) Ridge and LASSO, see Tibshirani, (1996), when is the empirical measure of i.i.d. observations, the data is centred and the covariates are orthogonal, i.e., . In that case, (7) simplifies to
where is the usual coefficient of determination.
The case of is naturally of particular importance in statistics and data science and we continue to consider it in the next subsection. In particular, we characterise the asymptotic distribution of , where and is the optimizer of the non-robust problem for the data-generating measure. This recovers the central limit theorem of Blanchet et al., 2019b , a link we explain further in section 44.2.
3.5. Out-of-sample error
A benchmark of paramount importance in optimization is the so-called out-of-sample error, also known as the prediction error in statistical learning. Consider the setup above when is the empirical measure of i.i.d. observations sampled from the “true” distribution and take, for simplicity, , with . Our aim is to compute the optimal which solves the original problem (1). However, we only have access to the training set, encoded via . Suppose we solve the distributionally robust optimization problem (2) for and denote the robust optimizer . Then the out-of-sample error
quantifies the error from using as opposed to the true optimizer .
While this expression seems to be hard to compute explicitly for finite samples, Theorem 5 offers a way to find the asymptotic distribution of a (suitably rescaled version of) the out-of-sample error. We suppose the assumptions in Theorem 5 are satisfied and note that the first order condition for gives . Then, a second-order Taylor expansion gives
(8) |
for some , (coordinate-wise) between and . Now we write
where we define as the optimizer of the non-robust problem (1) with replaced by . In particular the -method for M-estimators implies that
(9) |
where and denotes the convergence in distribution. On the other hand, for a fixed , Theorem 5 applied to yields
(10) | ||||
(11) |
where
Almost surely (w.r.t. sampling of ), we know that in as , see Fournier and Guillin, (2014), and under the regularity and growth assumptions on in (Bartl et al., 2021a, , eq. (27)) we check that a.s., see (Bartl et al., 2021a, , Example 28) for details. In particular, taking and combining the above with (9) we obtain
(12) |
This recovers the central limit theorem of Blanchet et al., 2019b , as discussed in more detail in section 44.2 below. Together, (8) and (11) give us the a.s. asymptotic behaviour of the out-of-sample error
(13) |
These results also extend and complement (Anderson and Philpott,, 2019, Prop. 17). Anderson and Philpott, (2019) investigate when the distributionally robust optimizers yield, on average, better performance than the simple in-sample optimizer . To this end, they consider the expectation, over the realisations of the empirical measure of
This is closely related to the out-of-sample error and our derivations above can be easily modified. The first order term in the Taylor expansion no longer vanishes and, instead of (8), we now have
which holds, e.g., if for any , there exists such that for all , . Combined with (10), this gives asymptotics in small for a fixed . For quadratic and taking , we recover the result in (Anderson and Philpott,, 2019, Prop. 17), see (Bartl et al., 2021a, , Example 28) for details.
4. Further discussion and literature review
We start with an overview of related literature and then focus specifically on a comparison of our results with the CLT of Blanchet et al., 2019b mentioned above.
4.1. Discussion of related literature
Let us first remark, that while Theorem 2 offers some superficial similarities to a classical maximum theorem, which is usually concerned with continuity properties of , in this work we are instead interested in the exact first derivative of the function . Indeed the convergence follows for all satisfying directly from the definition of convergence in Wasserstein metric (see e.g. (Villani,, 2008, Def. 6.8)). In conclusion the main issue is to quantify the rate of this convergence by calculating the first derivative .
Our work investigates model uncertainty broadly conceived: it includes errors related to the choice of models from a particular (parametric or not) class of models as well as the mis-specification of such class altogether (or indeed, its absence). In decision theoretic literature, these aspects are sometimes referred to as model ambiguity and model mis-specification respectively, see Hansen and Marinacci, (2016). However, seeing our main problem (2) in decision theoretic terms is not necessarily helpful as we think of as given and not coming from some latent expected utility type of problem. In particular, our actions are just constants.
In our work we decided to capture the uncertainty in the specification of using neighborhoods in the Wassertein distance. As already mentioned, other choices are possible and have been used in past. Possibly, the most often used alternative is the relative entropy, or the Kullblack-Leibler divergence. In particular, it has been used in this context in economics, see Hansen and Sargent, (2008). To the best of our knowledge, the only comparable study of sensitivities with respect to relative entropy balls is Lam, (2016), see also Lam, (2018) allowing for additional marginal constraints. However, this only considered the specific case where the reward function is independent of the action. Its main result is
where is a ball of radius centred around in KL-divergence, and denote the variance and kurtosis of under the measure respectively. In particular, the first order sensitivity involves the function itself. In contrast, our Theorem 2 states and involves the first derivative . In the trivial case of a point mass we recover the intuitive sensitivity , while the results of Lam, (2016) do not apply for this case. We also note that Lam, (2016) requires exponential moments of the function under the baseline measure , while we only require polynomial moments. In particular in applications in econometrics (or any field in which typically has fat tails), the scope of application of the corresponding results might then be decisively different. We remark however, that this requirement can be substantially weakened (to the existence of polynomial moments) when replacing KL-divergences by -divergences, see e.g. Atar et al., (2015); Glasserman and Xu, (2014). We expect a sensitivity analysis similar to Lam, (2016) to hold in this setting. However, to the best of our knowledge no explicit results seem to available in the literature.
To understand the relative technical difficulties and merits it is insightful to go into the details of the statements. In fact, in the case of relative entropy and the one-period setup we are considering, the exact form of the optimizing density can be determined exactly (see (Lam,, 2016, Proposition 3.1)) up to a one-dimensional Langrange parameter. This is well known and is the reason behind the usual elegant formulae obtained in this context. But this then reduces the problem in Lam, (2016) to a one-dimensional problem, which can be well-approximated via a Taylor approximation. In contrast, when we consider balls in the Wasserstein distance, the form of the optimizing measure is not known (apart from some degenerate cases). In fact a key insight of our results is that the optimizing measure can be approximated by a deterministic shift in the direction (this is, in general, not exact but only true as a first order approximation). The reason for these contrastive starting points of the analyses is the fact that Wasserstein balls contain a more heterogeneous set of measures, while in the case of relative entropy, exponentiating will always do the trick. We remark however that this is not true for the finite-horizon problems considered in (Lam,, 2016, Section 3.2) any more, where the worst-case measure is found using an elaborate fixed-point equation.
A point which further emphasizes the fact that the topology introduced by the Wasserstein metric is less tractable is the fact that
where is the relative entropy and
for some normalising constant , see, e.g., Carlier et al., (2017). This is known as the entropic optimal transport formulation and has received considerable interest in the ML community in the past years (see e.g. Peyré and Cuturi, (2019)). In particular, the Wasserstein distance can be approximated by relative entropy, but only with respect to reference measures on the product space. As we consider optimization over above it amounts to changing the reference measure. In consequence the topological structure imposed by Wasserstein distances is more intricate compared to relative entropy, but also more flexible.
The other well studied distance is the Hellinger distance. Lindsay, (1994) calculates influence curves for the minimum Hellinger distance estimator on a countable sample space. Their main result is that for the choice (where is a collection of parametric densities)
the product of the inverse Fisher information matrix and the score function, which is the same as for the classical maximum likelihood estimator. Denote by the empirical measure of data samples and by the corresponding minimum Hellinger distance estimator for . In particular this result implies the same CLT as for M-estimators given by
where As we discuss in the next section, our Theorem 5 yields a similar CLT, namely
Thus the Wasserstein worst-case approach leads to a shift of the mean of the normal distribution in the direction
compared to the non-robust case. In the simple case with standard deviation we obtain the MLE . We can directly compute (for ) that
Thus the robust approach accounts for a shift of (of order 1 if mulitplied with inverse Fisher information) to account for a possibly higher variance in the underlying data. In particular, in our approach, the so-called neutral spaces considered, e.g., in (Komorowski et al.,, 2011, eq. (21)) as
should also take this shift into account, i.e., their definition should be adjusted to
Lastly, let us mention another situation when our approach provides directly interpretable insights in the context of a parametric family of models. Namely, if one considers a family of models such that the worst-case model in the Wasserstein ball remains in , i.e., , then considering (the first order approximation to) model uncertainty over Wasserstein balls actually reduces to considerations within the parametric family. While uncommon, such a situation would arise, e.g., for a scale-location family , with and a linear/quadratic .
4.2. Link to the CLT of Blanchet et al., 2019b
As observed in section 33.5above, Theorem 5 allows to recover the main results in Blanchet et al., 2019b . We explain this now in detail. Set , , . Let denote the empirical measure of i.i.d. samples from . We impose the assumptions on and from Blanchet et al., 2019b , including Lipschitz continuity of gradients of and strict convexity. These, in particular, imply that the optimizers and , as defined in section 33.5 are well defined and unique, and further as . (Blanchet et al., 2019b, , Thm. 1) implies that, as ,
(14) |
where We note that for we have
Thus
and (14) agrees with (12) which is justified by the Lipschitz growth assumptions on , and from Blanchet et al., 2019b , see (Bartl et al., 2021a, , eq. (27)). In particular Theorem 5 implies (14) as a special case. While this connection is insightful to establish222We thank Jose Blanchet for pointing out the possible link and encouraging us to explore it. it is also worth stressing that the proofs in Blanchet et al., 2019b pass through the dual formulation and are thus substantially different to ours. Furthermore, while Theorem 5 holds under milder assumptions on than those in Blanchet et al., 2019b , the last argument in our reasoning above requires the stronger assumptions on . It is thus not clear if our results could help to significantly weaken the assumptions in the central limit theorems of Blanchet et al., 2019b .
5. Proofs
We consider the case and here. For the general case and additional details we refer to Bartl et al., 2021a . When clear from the context, we do not indicate the space over which we integrate.
Proof of Theorem 2.
For every let denote those which satisfy
As the infimum in the definition of is attained (see (Villani,, 2008, Theorem 4.1, p.43)) one has .
We start by showing the “” inequality in the statement. For any one has with equality for . Therefore, differentiating and using both Fubini’s theorem and Hölder’s inequality, we obtain that
Any choice converges in -Wasserstein distance on ) to the pushforward measure of under the mapping , which we denote . This can be seen by, e.g., considering the coupling between and . Now note that and the growth assumption on implies
(15) |
for some and all , . In particular for all and small , for another constant . As further is continuous for every , the -Wasserstein convergence of to implies that
for every for , see (Bartl et al., 2021a, , Lemma 21). Dominated convergence (in ) then yields “” in the statement of the theorem.
We turn now to the opposite “” inequality. As for every there is no loss in generality in assuming that the right hand side is not equal to zero. Now take any, for notational simplicity not relabelled, subsequence of which attains the liminf in and pick . By assumption, for a (again not relabelled) subsequence, one has . Further note that which implies
Now define , where
for with the convention . Note that the integral is well defined since, as before in (15), one has for some and the latter is integrable under . Using that it further follows that
In particular and we can use it to estimate from below the supremum over giving
For any , with , the inner integral converges to
The last equality follows from the definition of and a simple calculation. To justify the convergence, first note that for all by continuity of and since . Moreover, as before in (15), one has for some , hence for some and all . The latter is integrable under , hence convergence of the integrals follows from the dominated convergence theorem. This concludes the proof. ∎
Proof of Theorem 5.
We first show that
(16) | ||||
for all . We start with the “-inequality. For any we have
Let and recall that converge to . Let denote the set of which attain the value: . By (Bartl et al., 2021a, , Lemma 29) the function is (one-sided) directionally differentiable at for all small and thus for all
Then, using Lagrange multipliers to encode the optimality of in , we obtain
where we used a minimax argument as well as Fubini’s theorem. We note that the functions above satisfy the assumptions of Theorem 2 for a fixed . In particular using exactly the same arguments as in the proof of Theorem 2 (i.e., Hölder’s inequality and a specific transport attaining the supremum) we obtain by exchanging the order of and that
(17) | |||
For the infimum can be computed explicitly and equals
For the general case we refer to (Bartl et al., 2021a, , Lemma 30), noting that by assumption , we see that the RHS above is equal to the RHS in (16).
The proof of the “-inequality in (16) follows by the very same arguments. Indeed, (Bartl et al., 2021a, , Lemma 29) implies that
for all and we can write
From here on, we argue as in the “-inequality and conclude that indeed (16) holds.
By assumption the matrix is invertible. Therefore, in a small neighborhood of , the mapping is invertible. In particular and by the first order condition . Applying the chain rule and using (16) gives
This completes the proof. ∎
References
- Anderson and Philpott, (2019) Anderson, E. J. and Philpott, A. B. (2019). Improving sample average approximation using distributional robustness. Optimization Online.
- Araujo et al., (2019) Araujo, A., Pinot, R., Negrevergne, B., Meunier, L., Chevaleyre, Y., Yger, F., and Atif, J. (2019). Robust neural networks using randomized adversarial training. arXiv:1903.10219.
- Armacost and Fiacco, (1974) Armacost, R. L. and Fiacco, A. V. (1974). Computational experience in sensitivity analysis for nonlinear programming. Math. Program., 6(1):301–326.
- Artzner et al., (1999) Artzner, P., Delbaen, F., Eber, J., and Heath, D. (1999). Coherent measures of risk. Math. Finance, 9(3):203–228.
- Asi and Duchi, (2019) Asi, H. and Duchi, J. C. (2019). The importance of better models in stochastic optimization. Proc. Natl. Acad. Sci. USA, 116(46):22924–22930.
- Atar et al., (2015) Atar, R., Chowdhary, K., and Dupuis, P. (2015). Robust bounds on risk-sensitive functionals via rényi divergence. SIAM/ASA J. Uncertain. Quantif., 3(1):18–33.
- (7) Bartl, D., Drapeau, S., Obłój, J., and Wiesel, J. (2021a). Appendix to sensitivity analysis of Wasserstein distributionally robust optimization problems.
- (8) Bartl, D., Drapeau, S., Obłój, J., and Wiesel, J. (2021b). Sensitivity analysis of Wasserstein distributionally robust optimization problems. TBC.
- Bartl et al., (2020) Bartl, D., Drapeau, S., and Tangpi, L. (2020). Computational aspects of robust optimized certainty equivalents and option pricing. Math. Finance, 30(1):287–309.
- Bastani et al., (2016) Bastani, O., Ioannou, Y., Lampropoulos, L., Vytiniotis, D., Nori, A., and Criminisi, A. (2016). Measuring neural net robustness with constraints. In Advances in neural information processing systems, pages 2613–2621.
- Ben Tal and Teboulle, (1986) Ben Tal, A. and Teboulle, M. (1986). Expected utility, penalty functions, and duality in stochastic nonlinear programming. Manag. Sci., 32(11):1445–1466.
- Ben Tal and Teboulle, (2007) Ben Tal, A. and Teboulle, M. (2007). An old-new concept of convex risk measures: the optimized certainty equivalent. Math. Finance, 17(3):449–476.
- Black and Scholes, (1973) Black, F. and Scholes, M. (1973). The pricing of options and corporate liabilities. J. Political Econ, 81(3):637–654.
- (14) Blanchet, J., Kang, Y., and Murthy, K. (2019a). Robust Wasserstein profile inference and applications to machine learning. J. Appl. Probab., 56(3):830–857.
- Blanchet and Murthy, (2019) Blanchet, J. and Murthy, K. (2019). Quantifying distributional model risk via optimal transport. Math. Oper. Res., 44(2):565–600.
- (16) Blanchet, J., Murthy, K., and Si, N. (2019b). Confidence regions in Wasserstein distributionally robust estimation. arXiv:1906.01614.
- Bonnans and Shapiro, (2013) Bonnans, J. F. and Shapiro, A. (2013). Perturbation Analysis of Optimization Problems. Springer Science & Business Media.
- Brezis, (2010) Brezis, H. (2010). Functional analysis, Sobolev spaces and partial differential equations. Springer Science & Business Media.
- Calafiore, (2007) Calafiore, G. C. (2007). Ambiguous risk measures and optimal robust portfolios. SIAM J. Optim., 18(3):853–877.
- Carlier et al., (2017) Carlier, G., Duval, V., Peyré, G., and Schmitzer, B. (2017). Convergence of entropic schemes for optimal transport and gradient flows. SIAM J. Math. Anal., 49(2):1385–1418.
- Carlier and Ekeland, (2010) Carlier, G. and Ekeland, I. (2010). Matching for teams. Econom. Theory, 42:397–418.
- Carlini and Wagner, (2017) Carlini, N. and Wagner, D. (2017). Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE.
- Chen et al., (2018) Chen, Z., Kuhn, D., and Wiesemann, W. (2018). Data-driven chance constrained programs over Wasserstein balls. arXiv:1809.00210.
- Chiappori et al., (2010) Chiappori, P.-A., McCann, R. J., and Nesheim, L. (2010). Hedonic price equilibria, stable matching, and optimal transport: Equivalence, topology, and uniqueness. Econom. Theory, 42:317–354.
- Dupacova, (1990) Dupacova, J. (1990). Stability and sensitivity analysis for stochastic programming. Ann. Oper. Res., 27(1-4):115–142.
- Fournier and Guillin, (2014) Fournier, N. and Guillin, A. (2014). On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Related Fields, 162(3-4):707–738.
- Gao and Kleywegt, (2016) Gao, R. and Kleywegt, A. J. (2016). Distributionally robust stochastic optimization with Wasserstein distance. arXiv:1604.02199.
- Ghanem et al., (2017) Ghanem, R., Higdon, D., and Owhadi, H., editors (2017). Handbook of Uncertainty Quantification. Springer International Publishing.
- Glasserman and Xu, (2014) Glasserman, P. and Xu, X. (2014). Robust risk measurement and model risk. Quant. Finance, 14(1):29–58.
- Goodfellow et al., (2014) Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014). Explaining and harnessing adversarial examples. arXiv:1412.6572.
- Hansen and Marinacci, (2016) Hansen, L. P. and Marinacci, M. (2016). Ambiguity aversion and model misspecification: An economic perspective. Stat. Sci., 31(4):511–515.
- Hansen and Sargent, (2008) Hansen, L. P. and Sargent, T. (2008). Robustness. Princeton university press.
- Ho-Nguyen and Wright, (2020) Ho-Nguyen, N. and Wright, S. J. (2020). Adversarial classification via distributional robustness with wasserstein ambiguity. arXiv preprint arXiv:2005.13815.
- Huber and Ronchetti, (1981) Huber, P. and Ronchetti, E. (1981). Robust statistics. Wiley Series in Probability and Mathematical Statistics. New York, NY, USA, Wiley-IEEE, 52:54.
- Komorowski et al., (2011) Komorowski, M., Costa, M. J., Rand, D. A., and Stumpf, M. P. (2011). Sensitivity, robustness, and identifiability in stochastic chemical kinetics models. Proc. Natl. Acad. Sci. USA, 108(21):8645–8650.
- Kuhn et al., (2019) Kuhn, D., Esfahani, P. M., Nguyen, V. A., and Shafieezadeh-Abadeh, S. (2019). Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations Research & Management Science in the Age of Analytics, pages 130–166. INFORMS.
- Lam, (2016) Lam, H. (2016). Robust sensitivity analysis for stochastic systems. Math. Oper. Res., 41(4):1248–1275.
- Lam, (2018) Lam, H. (2018). Sensitivity to serial dependency of input processes: A robust approach. Management Science, 64(3):1311–1327.
- Li et al., (2019) Li, L., Zhong, Z., Li, B., and Xie, T. (2019). Robustra: training provable robust neural networks over reference adversarial space. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 4711–4717. AAAI Press.
- Lindsay, (1994) Lindsay, B. G. (1994). Efficiency versus robustness: the case for minimum hellinger distance and related methods. Ann. Stat., 22(2):1081–1114.
- Mangal et al., (2019) Mangal, R., Nori, A. V., and Orso, A. (2019). Robustness of neural networks: a probabilistic and practical approach. In Proceedings of the 41st International Conference on Software Engineering: New Ideas and Emerging Results, pages 93–96. IEEE Press.
- Markowitz, (1952) Markowitz, H. (1952). Portfolio selection. J. Finance, 7(1):77–91.
- Mohajerin Esfahani and Kuhn, (2018) Mohajerin Esfahani, P. and Kuhn, D. (2018). Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Math. Program., 171(1-2, Ser. A):115–166.
- Obłój and Wiesel, (2021) Obłój, J. and Wiesel, J. (2021). Robust estimation of superhedging prices. Ann. Stat., 49(1):508–530.
- Peyré and Cuturi, (2019) Peyré, G. and Cuturi, M. (2019). Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355–607.
- Peyré and Cuturi, (2019) Peyré, G. and Cuturi, M. (2019). Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607.
- Pflug and Wozabal, (2007) Pflug, G. and Wozabal, D. (2007). Ambiguity in portfolio selection. Quant. Finance, 7(4):435–442.
- Pflug et al., (2012) Pflug, G. C., Pichler, A., and Wozabal, D. (2012). The 1/n investment strategy is optimal under high model ambiguity. J. Banking Finance, 36(2):410–417.
- Rahimian and Mehrotra, (2019) Rahimian, H. and Mehrotra, S. (2019). Distributionally robust optimization: A review. arXiv.org:1908.05659.
- Romisch, (2003) Romisch, W. (2003). Stability of stochastic programming problems. In Stochastic programming, pages 483–554. Elsevier Sci. B. V., Amsterdam.
- Shafieezadeh-Abadeh et al., (2019) Shafieezadeh-Abadeh, S., Kuhn, D., and Esfahani, P. M. (2019). Regularization via mass transportation. J. Mach. Learn. Res., 20(103):1–68.
- Sinha et al., (2020) Sinha, A., Namkoong, H., Volpi, R., and Duchi, J. (2020). Certifying some distributional robustness with principled adversarial training. arXiv:1710.10571v5.
- Szegedy et al., (2013) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. (2013). Intriguing properties of neural networks. arXiv:1312.6199.
- Terkelsen, (1973) Terkelsen, F. (1973). Some minimax theorems. Math. Scand., 31(2):405–413.
- Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol., 58(1):267–288.
- Villani, (2008) Villani, C. (2008). Optimal transport: old and new, volume 338. Springer Science & Business Media.
- Vogel, (2007) Vogel, S. (2007). Stability results for stochastic programming problems. Optimization, 19(2):269–288.
- Weng et al., (2018) Weng, T.-W., Zhang, H., Chen, P.-Y., Yi, J., Su, D., Gao, Y., Hsieh, C.-J., and Daniel, L. (2018). Evaluating the robustness of neural networks: An extreme value theory approach. arXiv:1801.10578.
- Wong and Kolter, (2017) Wong, E. and Kolter, J. Z. (2017). Provable defenses against adversarial examples via the convex outer adversarial polytope. arXiv:1711.00851.
Appendix A Preliminaries
We recall and further explain the setting from the main body of the paper Bartl et al., 2021b . Take , endow with the Euclidean norm . Throughout the paper we take the convention that topological properties, such as continuity or closure, are understood w.r.t. . We let denote respectively the interior, the closure, the boundary and the complement of a set . We denote the set of all probability measures on by . For a variable , we will denote the optimizer by and the set of optimizers by .
Fix a seminorm on and denote by its (extended) dual norm, i.e. . Let us define the equivalence relation if and only if . Furthermore let us set and write . With this notation, the quotient space is a normed space for . Furthermore, by the triangle inequality for and equivalence of norms on , there exists such that and for all . As is Hausdorff, this immediately implies that is Hausdorff as well. Furthermore we conclude, that is continuous and is lower semicontinuous w.r.t. (as the supremum over continuous functions ). Lastly we make the convention that denotes the ball of radius around in . As our setup is slightly non-standard, we state the following lemmas for completeness:
Lemma 6.
For every we have that .
Proof.
As is convex and closed, this follows directly from the bipolar theorem. ∎
Lemma 7.
Assume that is strictly convex. Then the following hold:
-
(i)
For all there exists such that and . If , then is unique.
-
(ii)
The map is continuous.
Proof.
Fix . The existence of in (i) follows from Lemma 6. Assume towards a contradiction that there exists another with , and . Defining we have . On the other hand, by the Hausdorff property of , we have and thus, by strict convexity of , . Using again Lemma 6, we conclude , a contradiction.
For (ii) we assume towards a contradiction that for some sequence in we have , but . As remarked above, we have , in particular after taking a subsequence. Recalling that and is lower semicontinuous, we conclude that and in particular by Lemma 6 and (i). Finally
which leads to a contradiction. ∎
Lemma 8.
If is strictly convex, then is strictly convex as well.
Proof.
Fix . We first note that
is uniquely defined. Indeed, this follows from applying the exact same arguments as in the proof of Lemma 7, adjusting for . Take now such that and . Set and note that . Then . This shows the claim. ∎
Let denote the state space which is a closed convex subset of . Fix and take so that . For probability measures and on , we define their -Wasserstein distance as
where is the set of all probability measures with first marginal and second marginal . In the proofs we sometimes also use the -Wasserstein distance with respect to the Euclidean norm given by
Recall that for some constant , which in turn implies that . A Wasserstein ball of size around is denoted
From now on, we fix such that and . Let denote the action (decision) space which is a convex and closed subset of . We consider robust stochastic optimization problem [2]:
In accordance with our conventions, we write for an optimizer: and for the set of such optimizers. We also let denote the set of measures such that and sometimes write for if is fixed.
Appendix B Discussion, extensions and proofs related to Theorem 2
We complement now the discussion of Theorem 2. We start with some remarks, extensions and further examples before proceeding with the proofs, including a complete proof of Theorem 2 for general seminorms .
B.1. Discussion and extensions of Theorem 2
Remark 9.
Remark 10.
Let . In addition to Assumption 1, suppose that is twice continuously differentiable and that for ever there is such that for all and all with . Then, the same arguments as in the proof of Theorem 2 but with a second order Taylor expansion yield
for small , where denotes the largest eigenvalue of the Hessian taken w.r.t. the norm and is such that .
In particular, this means that if the term in front of is the same order of magnitude as the term in front of , then the first order approximation is quite accurate for small . Note that larger implies smaller and therefore a smaller term in front of the term.
Remark 11.
We believe that Assumption 1 lists natural sufficient conditions for differentiability of in zero. In particular all these conditions are used in the proof of Theorem 2. Relaxing Assumption 1 seems to require a careful analysis of the interplay between (the space explored by balls around) and the functions . We state here a straightforward extension to the case where is only weakly differentiable and leave more fundamental extensions (e.g., to manifolds) for future research.
Specifically, in case that the baseline distribution is absolutely continuous w.r.t. the Lebesgue measure and , Theorem 2 remains true if we merely assume that has a weak derivative (in the Sobolev sense) on for all and replace by the weak derivative of in Assumption 1. More concretely the first point of Assumption 1 should read:
-
•
The weak derivative of is continuous at every point , where is a Lebesgue-null set, and for every there is such that for all and .
Proof of Remark 11.
For notational simplicity we only consider the case . Note that by, e.g., Brezis, (2010)[Theorem 8.2] we can assume that is continuous and satisfies
for all and all . Furthermore
(18) |
where means that is absolutely continuous w.r.t. the Lebesgue measure. Indeed, let us take and set , where denotes the multivariate normal distribution with covariance , and denotes the convolution operator. For every , by convexity of and the triangle inequality for , we have
By assumption and one can check that as . Hence, for every there exists small such that . As further , this shows (18). The proof of the remark now follows by the exact same arguments as in the proof of Theorem 2. ∎
A natural example, which highlights the importance of Remark 11 is the following:
Example 12.
We let be a model for a vector of returns and assume that is absolutely continuous with respect to Lebesgue measure. Let further and let denote a portfolio. We then consider the average value at risk at level of the portfolio wealth , which can be written as
where is the value at risk at level defined as
We note that the average value at risk is an example for an optimized certainty equivalent (OCE), when choosing in (Bartl et al., 2021b, , p. 3). We can thus rewrite the optimization problem
as
Set and assume that there exists a unique minimiser of . Then is given by . The robust version of reads
Note that the function is weakly differentiable with weak derivative . In conclusion has weak derivative
which is continuous at except on the lower-dimensional set , which is in particular a Lebesgue null set. Remark 11 thus yields
and thus
Comparing with (Bartl et al.,, 2020, Table 1), we see that this approximation is actually exact for .
We now mention two extensions of Theorem 2. The first one concerns the derivative of for .
Corollary 13.
Fix and in addition to the assumptions of Theorem 2 assume that
-
•
for small enough and for every sequence such that and such that there is a subsequence which converges to some .
-
•
there exists such that for all and every with one has for all and some constant .
Then
where we recall that is the set of all for which .
Remark 14.
The second extension of Theorem 2 offers a more specific sensitivity result by including additional constraints on the ball of measures considered. Let and let be a family of functions and assume that is calibrated to in the sense that . Consider the set
and the corresponding optimization problem
We have the following result.
Theorem 15 (Sensitivity of under linear constraints).
In addition to the assumptions of Theorem 2, assume that there is some small such that for every one has for all and some constant . Further assume that , , are continuously differentiable with , and that the non-degeneracy condition
(19) |
holds. Then
Remark 16.
Note that if is a norm and has full support, the above non-degeneracy condition (19) can be made without loss of generality. Indeed, as the unit circle is compact and the function is continuous, the infimum in (19) is attained. In particular, if
then -a.s. for some in the unit circle. As has full support this implies that on . Thus are linearly dependent functions on . Deleting all linearly dependent coordinates and calling the resulting vector , we have for every . Moreover, the non-degeneracy condition (19) holds for .
Remark 17.
We can relax the conditions of Theorem 15 in the spirit of Remark 11: more specifically, assume that the baseline distribution is absolutely continuous w.r.t. the Lebesgue measure and . Then Theorem 15 remains true if we merely assume that and have a weak derivative (in the Sobolev sense) on for all and replace and by the weak derivative of and of respectively. More concretely the assumption should read:
-
•
The weak derivatives of and of are continuous at every point , where is a Lebesgue-null set, and for every there is such that and for all , and .
Example 18 (Martingale constraints).
Let , , , , and let and , i.e., corresponds to the measures satisfying the martingale (barycentre preservation) constraint . Clearly the assumptions on of Theorem 15 are satisfied. It remains to solve the optimization problem over and plug in the optimizer. We then obtain
i.e., is the standard deviation of under . In line with the previous remark, this results extend to the case of the call option pricing discussed in the main body of the paper.
Example 19 (Covariance constraints).
Let , , , . Further let for some and , i.e., we want to optimize over measures satisfying the covariance constraint . Assume that there exists no such that -a.s. . Clearly the assumptions on of Theorem 15 are satisfied. Note that
so in particular the optimal in the definition of is given by
Plugging this in gives
It follows that
Example 20 (Calibration).
Consider the function , the discrete measure formalises grid points for which option data is available, is the set of maturities and strikes of interest and , for a given compact set , is a class of parametric models (e.g., Heston). A Wasserstein ball around can then be seen as a plausible formalisation of market data uncertainty. Derivatives in and correspond to classical pricing sensitivities, which are readily available for most common parametric models. These have to be only evaluated for one model Changing the class of parametric models and computing the sensitivity in Theorem 2 could then yield insights into when a calibration procedure can be considered reasonably robust.
B.2. Proofs and auxiliary results related to Theorem 2
Proof of Theorem 2.
We present now a complete proof of Theorem 2 for general state space and semi-norm . All the essential ideas have already been outlined in Bartl et al., 2021b but for the convenience of the reader we repeat all of the steps as opposed to only detailing where the general case differs from the one treated in Bartl et al., 2021b .
Step 1: Let us first assume that . For every let denote those which satisfy
Note that the dual norm is lower semicontinuous, which implies that the infimum in the definition of is attained (see (Villani,, 2008, Theorem 4.1, p.43)) one has .
We start by showing the “” inequality in the statement. For any one has with equality for . Therefore, differentiating and using Fubini’s theorem, we obtain that
Now recall that for every , whence for any and , we have that
where we used Hölder’s inequality to obtain the last inequality. By definition of the last integral is smaller than and we end up with
It remains to show that the last term converges to the integral under . To that end, note that any choice converges in on ) to the pushforward measure of under the mapping , which we denote . This can be seen by, e.g., considering the coupling between and . Now note that, together with growth restriction on of Assumption 1, implies
(20) |
for some and all , . Recall that there furthermore exists such that , in particular for all and small , for another constant . As Assumption 1 further yields continuity of for every , the -Wasserstein convergence of to implies that
for every , see Lemma 21. Dominated convergence (in ) then yields “” in the statement of the theorem.
We turn now to the opposite “” inequality. As for every there is no loss in generality in assuming that the right hand side is not equal to zero. Now take any, for notational simplicity not relabelled, subsequence of which attains the liminf in and pick . By the second part of Assumption 1, for a (again not relabelled) subsequence, one has . Further note that which implies
By Lemma 7 there exists a function such that for every . Now define
for with the convention . Note that the integral is well defined since, as before in (20), one has for some and the latter is integrable under . Using that it further follows that
In particular and we can use it to estimate from below the supremum over giving
For any , with , the inner integral converges to
The last equality follows from the definition of and a simple calculation. To justify the convergence, first note that
for all by continuity of and since . Moreover, as before in (20), one has
for some and all .
The latter is integrable under , hence convergence of the integrals follows from the dominated convergence theorem.
Step 2: We now extend the proof to the case, where is closed convex and its boundary has zero measure under .
Note that the proof of the “”-inequality remains unchanged. We modify the proof of the “”-inequality as follows: let us first define
for all , so that in particular . We now redefine
Then and in particular as in Step 1. Noting that
the remaining steps of the proof follow as in Step 1. This concludes the proof. ∎
Lemma 21.
Let , let and assume that is continuous and, for some constant , satisfies for all and all in a neighborhood of . Let be a sequence of probability measures which converges to some w.r.t. and be a sequence which converges to . Then as .
Proof.
Let be a small neighborhood of such that for all and . The measures converge in to the measure . As and similarly for , the claim follows from (Villani,, 2008, Lemma 4.3, p.43). ∎
The following lemma relates to the financial economics applications described in Bartl et al., 2021b . We focus on a sufficient condition for the second part of Assumption 1. For this, we assume that does not contain any redundant assets, i.e. for every . If satisfies this condition, we call it non-degenerate. Note that this condition is slightly stronger than no-arbitrage. However, if satisfies no arbitrage, then one can always delete the redundant dimensions in similarly to the remark after Theorem 15, so that the modified measure satisfies for every .
Lemma 22.
Assume that is convex, increasing, bounded from below and satisfies the first part of Assumption 1. Furthermore assume that is non-degenerate in the above sense. Then for every there exists an optimizer for , i.e.,
Furthermore, if is strictly convex, the optimizer of is unique and as . In particular, Assumption 1 is satisfied.
Proof.
The first statement is trivially true if is constant, so assume otherwise in the following. Moreover, note by the first part of Assumption 1 we have for all . Now fix , and let be a minimizing sequence, i.e.
If is bounded, then after passing to a subsequence there is a limit, and Fatou’s lemma shows that this limit is a minimizer. It remains to argue why is bounded. Heading for a contradiction, assume that as . After passing to a (not relabeled) subsequence, there is with such that as . By our assumption we have . As is bounded below this shows that
as , a contradiction.
To prove the second claim note that strict convexity of readily implies that admits a unique minimizer . Now, heading for a contraction, assume that there exists a subsequence converging to zero, such that does not converge to . The exact same reasoning as above shows that is bounded, hence (possibly after passing to a not relabeled subsequence) there is a limit . Using Fatou’s lemma once more implies
On the other hand, plugging into implies
which follows from as and that any converges in to by definition. This gives the desired contraction. ∎
In analogy to the above result, the following summarizes simple sufficient conditions for the second part of Assumption 1.
Lemma 23.
Assume that either is compact or that is coercive, in the sense that if . Moreover, assume that is continuous, such that for some . Then the second part of Assumption 1 is satisfied.
Proof.
Let us first note that for fixed the function is lower semiconinuous as a supremum of continuous functions for . Next we note that . Indeed, if is compact, this directly follows from lower semicontinuity of . Otherwise, the fact that for all and coercivity imply that any minimising sequence is bounded. Lastly, we show that any accumulation point of such a sequence is an element of . By the above we can assume (by taking a subsequence without relabelling if necessary) that . If , then
for any . This contradicts for all and concludes the proof. ∎
Proof of Corollary 13.
We start with the “”-inequality. First, note that for any , , and , we have
This implies that
(21) |
Note that the assumption implies ( for some new constant ). To simplify notation let us thus define and recall that is compact w.r.t. by Lemma 24, hence there is such that (after passing to a subsequence) w.r.t. as . The same arguments as in the proof of Theorem 2 show that (21) (divided by ) converges to when . So, to conclude the “”-part, all that is left to do is show that , which follows as
We now turn to the proof of the “”-inequality. To that end, let be a sequence of optimizers, i.e. for all . Then by assumption there exists such that (after passing to a subsequence) . Let be arbitrary. As (by the triangle inequality) we have
As further (trivially) we conclude
as , where the the last equality follows from the exact same arguments as presented int he proof of Theorem 2. As was arbitrary, the claim follows. ∎
Proof of Theorem 15.
We start by showing the easier estimate
(22) |
To that end, let and by arbitrary. Then . Moreover, as , it further follows that . Therefore (22) is a consequence of Theorem 2 (applied to the function ).
To show the other direction, i.e. that
(23) |
pick a (not relabeled) subsequence of which converges to the liminf. For , there is another (again not relabeled) subsequence which converges to some . From now on stick to this subsequence. In a first step, notice that
(24) |
Indeed, this follows from a minimax theorem (see (Terkelsen,, 1973, Cor. 2, p. 411)) and appropriate compactness of as stated in Lemma 24. For notational simplicity let be an optimizer for (24). Then
(25) |
where we used that . Now, in case that is uniformly bounded for all small , after passing to a subsequence, it converges to some . Then it follows from the exact same arguments as used in the proof of Theorem 2 that
which shows (23). It remains to argue why is bounded for small . By (25) and the estimate “” we have
The second term converges to (see the proof of Theorem 2), in particular it is bounded for all small. On the other hand by (19) and continuity as well as growth of , the first term is larger than for some . By (22) this implies that must be bounded for small . ∎
We have used the following lemma:
Lemma 24.
Let such that and let be a probability measure on . Then -Wasserstein ball is compact w.r.t. .
Proof.
We recall that is lower semicontinuous and there exists such that for all . As by assumption, an application of Prokhorov’s theorem shows that is weakly precompact (recall the convention that we continuity is defined for ). Hence, for every sequence of measures in there exists a subsequence, which we also call and a measure such that converges weakly to . As is weakly lower semicontinuous (see (Villani,, 2008, Lemma 4.3, p.43)), this implies . Applying the same argument to the tight sequence defined via
we conclude that there exists another subsequence of which also converges in . This concludes the proof. ∎
Appendix C Discussion, proofs and auxiliary results related to Theorem 5
C.1. Further discussion of Theorem 5
We note that a natural way to compute the sensitivity of would be by combining Theorem 2 with chain rule and differentiation of the function . This cannot however be rigorously justified as the following remark demonstrates.
Remark 25.
Let us point out that it is not true that is differentiable for under the sole assumption that is sufficiently smooth and .
To give an example, let , , and take and . A quick computation shows (independently of ). In particular and for all and is clearly not differentiable in .
Instead, we use a more involved argument, combining differentiability of with a Lagrangian approach. This however requires slightly stricter growth assumptions than the ones imposed in Assumption 1, which are specified in Assumption 4.
Example 26.
We provide detailed computations behind the square-root LASSO/Ridge regression example discussed in Bartl et al., 2021b . We consider , . We fix norms , , for some , and . We recall than then (5) holds and we can apply our methodology for . In general we have
and
Recalling the convention that is given by
we conclude
where is the identity matrix. Recall furthermore that for all and in particular . Set now
Then and for . As does not depend on the last coordinate, we also write simply for . As we have in particular
In conclusion
Let us now specialise to the typical statistical context and let equal to the empirical measure of data samples, i.e., for some points and . Let us write and . Then in particular
and we recover the notation common in statistics. In particular, . If we now assume that (and hence ), then we can easily compute
Note that, under the assumption that , is defined as
Thus in the case we have
Furthermore, in the case we have
(26) |
We also have
so (26) simplifies to
Remark 27.
While is not strictly convex, the above example can still be adapted to cover this case under the additional assumption, that has no entries which are equal to zero. Indeed, we note that is continuous (even constant) at every point except if a component of is equal to zero. Thus the proof of Lemma 30 still applies if we assume that has -a.s. no components which are equal to zero instead of merely assuming that -a.s..
Example 28.
We provide further details and discussion to complement the out-of-sample error example in Bartl et al., 2021b . First, we recall the remainder term obtained therein:
Recall that in holds a.s. We suppose that Assumptions 1 and 4 hold, and that for any , there exists such that the following hold uniformly for all :
(27) |
Recall from (9) that we already know that a.s. Under the above integrability assumption, Lemma 21 gives
with analogous convergence for the other two terms in . We conclude that a.s. and that (12) and (13) hold.
We now show how the arguments above can be adapted to extend and complement (Anderson and Philpott,, 2019, Prop. 17). Therein, the authors study VRS which is the expectation over realisations of of
If VRS then, on average, the robust problem offers an improved performance, i.e., finds a better approximation to the true optimizer than the classical non-robust problem. If we work with the difference above, then we look at first order Taylor expansion and obtain
which holds under the first condition in (27). This can be compared with (Anderson and Philpott,, 2019, Lemma 1) which was derived under a Lipschitz continuity assumption on . For the quadratic case of (Anderson and Philpott,, 2019, Prop. 17) we have , where we took for notational simplicity. We then have and . Specialising (10) to this setting, with , gives
While our results work for , see Remark 9, we can formally let . The last term then converges to which recovers (Anderson and Philpott,, 2019, Prop. 17), taking into account that
C.2. Proofs and auxiliary results related to Theorem 5
Lemma 29.
Let be differentiable such that is continuous, fix , and assume that for some we have that and for some , all and all close to . Further fix and recall that is the set of maximizing measures given the strategy . Then the (one-sided) directional derivative of at in the in direction is given by
Proof.
Fix . We start by showing that
(28) |
To that end, let and be arbitrary. By definition of one has . Moreover implies that . Note that the assumption implies
Therefore, by dominated convergence, one has
and as was arbitrary, this shows (28).
We proceed to show that
(29) |
For every sufficiently small let such that . The existence of such is guaranteed by Lemma 24, which also guarantees that (possibly after passing to a subsequence) there is such that in . We claim that . By Lemma 21 one has
On the other hand, for any choice one has
This implies and in particular . At this point expand
so that
where we used for the last inequality. Recall that converges to in and by assumption for all close to . In particular
for sufficiently small. As furthermore is continuous, we conclude by Lemma 21 that
Lastly, by by Fubini’s theorem and dominated convergence (in )
as , which ultimately shows (29). ∎
Lemma 30.
Let and let be measurable such that and such that -a.s.. Then we have that
(30) |
where was defined in Lemma 7.
Proof.
First recall that is continuous and satisfies for every . Now define
Similarly, define by replacing in the definition of by . As in the proof of Theorem 2 we compute
This remains true when and are replaced by and , respectively. Moreover, Hölder’s inequality implies that
The first of these two inequalities immediately implies that the left hand side in (30) is larger than the right hand side.
To show the other inequality, note that is continuous and satisfies for , hence for all such that . Consequently one quickly computes for all such that . By dominated convergence we conclude that
and the claim follows. ∎
Let us lastly give the proof of Theorem 5 for general seminorms.
Proof of Theorem 5.
Recall the convention that and , as well as . Further recall that and converge to as . In order to show
we first show that for every
(31) | ||||
where we recall that is the -th coordinate of the vector . We start with the “”-inequality in (31). For any , the fundamental theorem of calculus implies that
Moreover, by Lemma 29 the function is (one-sided) directionally differentiable at for all small and thus for all
(32) |
where we recall is the set of all for which . We now encode the optimality of in via a Lagrange multiplier to obtain
(33) |
In a similar manner, we trivially have
(34) |
for any , as . Applying (32) and then (33), (34) we thus conclude for
(35) |
As in the proof of Lemma 29 we note that is compact in and both terms inside the grow at most as by Assumption 4. Thus using (Terkelsen,, 1973, Cor. 2, p. 411) we can interchange the infimum and supremum in the last line above. Recall that
whence (35) is equal to
For every fixed we can follow the arguments in the proof of Theorem 2 to see that, when divided by , the term inside the infimum converges to
(36) |
as . Note that following these arguments requires the following properties, which are a direct consequence of Assumptions 1 and Assumption 4:
-
•
is differentiable on ,
-
•
is differentiable on for every ,
-
•
is continuous,
-
•
is continuous,
-
•
for every there is such that for all and with .
-
•
for every there is such that for all and with .
-
•
For all sufficiently small we have and for every sequence such that and such that for all there is a subsequence which converges to some .
Suppose first that -a.s.. Then the right hand side of (31) is equal to zero. Moreover, taking in (36), we also have that , which proves that indeed the left hand side in (31) is smaller than the right hand side.
Now suppose that -a.s.. Then, using the inequality “” and Lemma 30 to compute the last term (noting that by assumption), we conclude that indeed the
To obtain the reverse “”-inequality in (31) follows by the very same arguments. Indeed, Lemma 29 implies that
for all and we can write
From here on we argue as in the “”-inequality to conclude that (31) holds.
By assumption the matrix is invertible. Therefore, in a small neighborhood of , the mapping is invertible. In particular
where the second equality holds by the first order condition for optimality of . Applying the chain rule and using (31) gives
This completes the proof. ∎