Sensitivity analysis of Wasserstein distributionally robust optimization problems

Daniel Bartl¹ ¹Department of Mathematics, University of Vienna , Samuel Drapeau² ²School of Mathematical Sciences and Shanghai Advanced Institute of Finance (SAIF/CAFR), Shanghai Jiao Tong University , Jan Obłój³ and Johannes Wiesel³ ³Department of Statistics, Columbia University

Abstract.

We consider sensitivity of a generic stochastic optimization problem to model uncertainty. We take a non-parametric approach and capture model uncertainty using Wasserstein balls around the postulated model. We provide explicit formulae for the first order correction to both the value function and the optimizer and further extend our results to optimization under linear constraints. We present applications to statistics, machine learning, mathematical finance and uncertainty quantification. In particular, we provide explicit first-order approximation for square-root LASSO regression coefficients and deduce coefficient shrinkage compared to the ordinary least squares regression. We consider robustness of call option pricing and deduce a new Black-Scholes sensitivity, a non-parametric version of the so-called Vega. We also compute sensitivities of optimized certainty equivalents in finance and propose measures to quantify robustness of neural networks to adversarial examples.

Key words and phrases:

Robust stochastic optimization, Sensitivity analysis, Uncertainty quantification, Non-parametric uncertainty, Wasserstein metric

Support from the European Research Council under the EU’s

7^{\text{th}}

FP / ERC grant agreement no. 335421, the Vienna Science and Technology Fund (WWTF) project MA16-021 and the Austrian Science Fund (FWF) project P28661 as well as the National Science Foundation of China, Grant Numbers 11971310 and 11671257, Shanghai Jiao Tong University, Grant “Assessment of Risk and Uncertainty in Finance” number AF0710020 are gratefully acknowledged. We thank Jose Blanchet, Peyman Mohajerin Esfahani, Daniel Kuhn and Mike Giles for helpful comments on an earlier version of the paper.

1. Introduction

We consider a generic stochastic optimization problem

(1)

\inf_{a\in\mathcal{A}}\int_{\mathcal{S}}f\left(x,a\right)\,\mu(dx),

where $\mathcal{A}$ is the set of actions or choices, $f$ is the loss function and $\mu$ is a probability measure over the space of states $\mathcal{S}$ . Such problems are found across the whole of applied mathematics. The measure $\mu$ is the crucial input and it could represent, e.g., a dynamic model of the system, as is often in mathematical finance or mathematical biology, or the empirical measure of observed data points, or the training set, as is the case in statistics and machine learning applications. In virtually all the cases, there is a certain degree of uncertainty around the choice of $\mu$ coming from modelling choices and simplifications, incomplete information, data errors, finite sample error, etc. It is thus very important to understand the influence of changes in $\mu$ on (1), both on its value and on its optimizer. Often, the choice of $\mu$ is done in two stages: first a parametric family of models is adopted and then the values of the parameters are fixed. Sensitivity analysis of (1) with changing parameters is a classical topic explored in parametric programming and statistical inference, e.g., Armacost and Fiacco, (1974); Vogel, (2007); Bonnans and Shapiro, (2013). It also underscores a lot of progress in the field of uncertainty quantification, see Ghanem et al., (2017). Considering $\mu$ as an abstract parameter, the mathematical programming literature looked into qualitative and quantitative stability of (1). We refer to Dupacova, (1990); Romisch, (2003) and the references therein. When $\mu$ represents data samples, there has been a considerable interest in the optimization community in designing algorithms which are robust and, in particular, do not require excessive hypertuning, see Asi and Duchi, (2019) and the references therein.

A more systematic approach to model uncertainty in (1) is offered by the distributionally robust optimization problem

(2)

\displaystyle V(\delta)\mathrel{\mathop{:}}=\inf_{a\in\mathcal{A}}V(\delta,a)\mathrel{\mathop{:}}=\inf_{a\in\mathcal{A}}\sup_{\nu\in B_{\delta}\left(\mu\right)}\int_{\mathcal{S}}f\left(x,a\right)\,\nu(dx),

where $B_{\delta}\left(\mu\right)$ is a ball of radius $\delta$ around $\mu$ in the space of probability measures, as specified below. Such problems greatly generalize more classical robust optimization and have been studied extensively in operations research and machine learning in particular, we refer the reader to Rahimian and Mehrotra, (2019) and the references therein. Our goal in this paper is to understand the behaviour of these problems for small $\delta$ . Our main results compute first-order behaviour of $V(\delta)$ and its optimizer for small $\delta$ . This offers a measure of sensitivity to errors in model choice and/or specification as well as points in the abstract direction, in the space of models, in which the change is most pronounced. We use examples to show that our results can be applied across a wide spectrum of science.

This paper is organised as follows. We first present the main results and then, in section 3, explore their applications. Further discussion of our results and the related literature is found in section 4, which is then followed by the proofs. Online appendix Bartl et al., 2021a contains many supplementary results and remarks, as well as some more technical arguments from the proofs.

2. Main results

Take $d,k\in\mathbb{N}$ , endow $\mathbb{R}^{d}$ with the Euclidean norm $|\cdot|$ and write ${\Gamma}^{o}$ for the interior of a set $\Gamma$ . Assume that $\mathcal{S}$ is a closed convex subset of $\mathbb{R}^{d}$ . Let $\mathcal{P}(\mathcal{S})$ denote the set of all (Borel) probability measures on $\mathcal{S}$ . Further fix a seminorm $\|\cdot\|$ on $\mathbb{R}^{d}$ and denote by $\|\cdot\|_{\ast}$ its (extended) dual norm, i.e., $\|y\|_{\ast}:=\sup\{\langle x,y\rangle:\|x\|\leq 1\}$ . In particular, for $\|\cdot\|=|\cdot|$ we also have $\|\cdot\|_{\ast}=|\cdot|$ . For $\mu,\nu\in\mathcal{P}(\mathcal{S})$ , we define the $p$ -Wasserstein distance as

W_{p}(\mu,\nu)=\inf\left\{\int_{\mathcal{S}\times\mathcal{S}}\|x-y\|_{\ast}^{p}\,\pi(dx,dy)\colon\pi\in\mathrm{Cpl}(\mu,\nu)\right\}^{1/p},

where $\mathrm{Cpl}(\mu,\nu)$ is the set of all probability measures $\pi$ on $\mathcal{S}\times\mathcal{S}$ with first marginal $\pi_{1}:=\pi(\cdot\times\mathcal{S})=\mu$ and second marginal $\pi_{2}:=\pi(\mathcal{S}\times\cdot)=\nu$ . Denote the Wasserstein ball

B_{\delta}(\mu)=\left\{\nu\in\mathcal{P}(\mathcal{S}):W_{p}(\mu,\nu)\leq\delta\right\}

of size $\delta\geq 0$ around $\mu$ . Note that, taking a suitable probability space $(\Omega,\mathbb{F},\mathbb{P})$ and a random variable $X\sim\mu$ , we have the following probabilistic representation of $V(\delta,a)$ :

\sup_{\nu\in B_{\delta}\left(\mu\right)}\int_{\mathcal{S}}f\left(x,a\right)\,\nu(dx)=\sup_{Z}\mathbb{E}_{\mathbb{P}}[f(X+Z,a)]

where the supremum is taken over all $Z$ satisfying $X+Z\in\mathcal{S}$ almost surely and $\mathbb{E}_{\mathbb{P}}[\|Z\|_{\ast}^{p}]\leq\delta^{p}$ . Wasserstein distances and optimal transport techniques have proved to be powerful and versatile tools in a multitude of applications, from economics Chiappori et al., (2010); Carlier and Ekeland, (2010) to image recognition Peyré and Cuturi, (2019). The idea to use Wasserstein balls to represent model uncertainty was pioneered in Pflug and Wozabal, (2007) in the context of investment problems. When sampling from a measure with a finite $p^{\textrm{th}}$ moment, the measures converge to the true distribution and Wasserstein balls around the empirical measures have the interpretation of confidence sets, see Fournier and Guillin, (2014). In this setup, the radius $\delta$ can then be chosen as a function of a given confidence level $\alpha$ and the sample size $N$ . This yields finite samples guarantees and asymptotic consistency, see Mohajerin Esfahani and Kuhn, (2018); Obłój and Wiesel, (2021), and justifies the use of the Wasserstein metric to capture model uncertainty. The value $V(\delta,a)$ in (2) has a dual representation, see Gao and Kleywegt, (2016); Blanchet and Murthy, (2019), which has led to significant new developments in distributionally robust optimization, e.g., Mohajerin Esfahani and Kuhn, (2018); Blanchet et al., 2019a ; Kuhn et al., (2019); Shafieezadeh-Abadeh et al., (2019).
Naturally, other choices for the distance on the space of measures are also possible: such as the Kullblack-Leibler divergence, see Lam, (2016) for general sensitivity results and Calafiore, (2007) for applications in portfolio optimization, or the Hellinger distance, see Lindsay, (1994) for a statistical robustness analysis. We refer to section 4 for a more detailed analysis of the state of the art in these fields. Both of these approaches have good analytic properties and often lead to theoretically appealing closed-form solutions. However, they are also very restrictive since any measure in the neighborhood of $\mu$ has to be absolutely continuous with respect to $\mu$ . In particular, if $\mu$ is the empirical measure of $N$ observations then measures in its neighborhood have to be supported on those fixed $N$ points. To obtain meaningful results it is thus necessary to impose additional structural assumptions, which are often hard to justify solely on the basis of the data at hand and, equally importantly, create another layer of model uncertainty themselves. We refer to (Gao and Kleywegt,, 2016, sec. 1.1) for further discussion of potential issues with $\phi$ -divergences. The Wasserstein distance, while harder to handle analytically, is more versatile and does not require any such additional assumptions.

Throughout the paper we take the convention that continuity and closure are understood w.r.t. $|\cdot|$ . We assume that $\mathcal{A}\subset\mathbb{R}^{k}$ is convex and closed and that the seminorm $\|\cdot\|$ is strictly convex in the sense that for two elements $x,y\in\mathbb{R}^{d}$ with $||x\|=\|y\|=1$ and $\|x-y\|\neq 0$ , we have $\|\frac{1}{2}x+\frac{1}{2}y\|<1$ (note that this is satisfied for every $l^{s}$ -norm $|x|_{s}:=(\sum_{i=1}^{d}|x_{i}|^{s})^{1/s}$ for $s>1$ ). We fix $p\in(1,\infty)$ , let $q:=p/(p-1)$ so that $1/p+1/q=1$ , and fix $\mu\in\mathcal{P}(\mathcal{S})$ such that the boundary of $\mathcal{S}\subset\mathbb{R}^{d}$ has $\mu$ –zero measure and $\int_{\mathcal{S}}|x|^{p}\,\mu(dx)<\infty$ . Denote by $\mathcal{A}^{\star}_{\delta}$ the set of optimizers for $V(\delta)$ in (2).

Assumption 1.

The loss function $f\colon\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ satisfies

•

$x\mapsto f(x,a)$ is differentiable on ${\mathcal{S}}^{o}$ for every $a\in\mathcal{A}$ . Moreover $(x,a)\mapsto\nabla_{x}f(x,a)$ is continuous and for every $r>0$ there is $c>0$ such that $|\nabla_{x}f(x,a)|\leq c(1+|x|^{p-1})$ for all $x\in\mathcal{S}$ and $a\in\mathcal{A}$ with $|a|\leq r$ .
•

For all $\delta\geq 0$ sufficiently small we have $\mathcal{A}^{\star}_{\delta}\neq\emptyset$ and for every sequence $(\delta_{n})_{n\in\mathbb{N}}$ such that $\lim_{n\to\infty}\delta_{n}=0$ and $(a^{\star}_{n})_{n\in\mathbb{N}}$ such that $a^{\star}_{n}\in\mathcal{A}^{\star}_{\delta_{n}}$ for all $n\in\mathbb{N}$ there is a subsequence which converges to some $a^{\star}\in\mathcal{A}^{\star}_{0}$ .

The above assumption is not restrictive: the first part merely ensures existence of $\|\nabla_{x}f(\cdot,a^{\star})\|_{L^{q}(\mu)}$ , while the second part is satisfied as soon as either $\mathcal{A}$ is compact or $V(0,\cdot)$ is coercive, which is the case in most examples of interest, see (Bartl et al., 2021a, , Lemma 23) for further comments.

Theorem 2.

If Assumption 1 holds then $V^{\prime}(0)$ is given by

\Upsilon:=\lim_{\delta\to 0}\frac{V(\delta)-V(0)}{\delta}=\inf_{a^{\star}\in\mathcal{A}^{\star}_{0}}\Big{(}\int_{\mathcal{S}}\|\nabla_{x}f(x,a^{\star})\|^{q}\,\mu(dx)\Big{)}^{1/q}.

Remark 3.

Inspecting the proof, defining

\tilde{V}(\delta)=\inf_{a^{\star}\in\mathcal{A}^{\star}_{0}}\sup_{\nu\in B_{\delta}\left(\mu\right)}\int_{\mathcal{S}}f\left(x,a^{\star}\right)\,\nu(dx)

we obtain $\tilde{V}^{\prime}(0)=V^{\prime}(0)$ . This means that for small $\delta>0$ there is no first-order gain from optimizing over all $a\in\mathcal{A}$ in the definition of $V(\delta)$ when compared with restricting simply to $a^{\star}\in\mathcal{A}^{\star}_{0}$ , as in $\tilde{V}(\delta)$ .

The above result naturally extends to computing sensitivities of robust problems, i.e., $V^{\prime}(r)$ , see (Bartl et al., 2021a, , Corollary 13), as well as to the case of stochastic optimization under linear constraints, see (Bartl et al., 2021a, , Theorem 15). We recall that $V(0,a)=\int_{\mathcal{S}}f(x,a)\,\mu(dx)$ .

Assumption 4.

Suppose the $f$ is twice continuously differentiable, $a^{\star}\in\mathcal{A}^{\star}_{0}\cap{\mathcal{A}}^{o}$ and

•

$\sum_{i=1}^{k}|\nabla_{a_{i}}\nabla_{x}f(x,a)|\leq c(1+|x|^{p-1-\varepsilon})$ for some $\varepsilon>0$ , $c>0$ , all $x\in\mathbb{R}^{d}$ and all $a$ close to $a^{\star}$ .
•

The function $a\mapsto V(0,a)$ is twice continuously differentiable in the neighbourhood of $a^{\star}$ and the matrix $\nabla^{2}_{a}V(0,a^{\star})$ is invertible.

Theorem 5.

Suppose $a^{\star}\in\mathcal{A}^{\star}_{0}$ and $a^{\star}_{\delta}\in\mathcal{A}^{\star}_{\delta}$ such that $a^{\star}_{\delta}\to a^{\star}$ as $\delta\to 0$ and Assumptions 1 and 4 are satisfied. Then, if $\nabla_{x}f(x,a^{\star})\neq 0$ or $\nabla_{x}\nabla_{a}f(x,a^{\star})=0$ $\mu$ -a.e.,

	$\displaystyle\beth$	$\displaystyle:=\lim_{\delta\to 0}\frac{a^{\star}_{\delta}-a^{\star}}{\delta}=-\Big{(}\int_{\mathcal{S}}\\|\nabla_{x}f(x,a^{\star})\\|^{q}\,\mu(dx)\Big{)}^{\frac{1}{q}-1}$
		$\displaystyle\quad\ \cdot(\nabla^{2}_{a}V(0,a^{\star}))^{-1}\int_{\mathcal{S}}\frac{\nabla_{x}\nabla_{a}f(x,a^{\star})\,h(\nabla_{x}f(x,a^{\star}))}{\\|\nabla_{x}f(x,a^{\star})\\|^{1-q}}\,\mu(dx),$

where $h\colon\mathbb{R}^{d}\setminus\{0\}\to\{x\in\mathbb{R}^{d}\ :\ \|x\|_{\ast}=1\}$ is the unique function satisfying $\langle\cdot,h(\cdot)\rangle=\|\cdot\|$ , see (Bartl et al., 2021a, , Lemma 7). In particular, $h(\cdot)=\cdot/|\cdot|$ if $\|\cdot\|=|\cdot|$ .

Above and throughout the convention is that $\nabla_{x}f(x,a)\in\mathbb{R}^{d\times 1},$ $\nabla_{a_{i}}\nabla_{x}f(x,a)\in\mathbb{R}^{d\times 1}$ , $\nabla_{a}f(x,a)\in\mathbb{R}^{k\times 1}$ , $\nabla_{x}\nabla_{a}f(x,a)\in\mathbb{R}^{k\times d}$ and $0/0=0$ . The assumed existence and convergence of optimizers holds, e.g., with suitable convexity of $f$ in $a$ , see (Bartl et al., 2021a, , Lemma 22) for a worked out setting. In line with the financial economics practice, we gave our sensitivities letter symbols, $\Upsilon$ and $\beth$ , loosely motivated by $\Upsilon\pi\acute{o}\delta\varepsilon\iota\gamma\mu\alpha$ , the Greek for Model, and , the Hebrew for control.

3. Applications

We now illustrate the universality of Theorems 2 and 5 by considering their applications in a number of different fields. Unless otherwise stated, $\mathcal{S}=\mathbb{R}^{d}$ , $\mathcal{A}=\mathbb{R}^{k}$ and $\int$ means $\int_{\mathcal{S}}$ .

3.1. Financial Economics

We start with the simple example of risk-neutral pricing of a call option written on an underlying asset $(S_{t})_{t\leq T}$ . Here, $T,K>0$ are the maturity and the strike respectively, $f(x,a)=(S_{0}x-K)^{+}$ and $\mu$ is the distribution of $S_{T}/S_{0}$ . We set interest rates and dividends to zero for simplicity. In the Black and Scholes, (1973) model, $\mu$ is a log-normal distribution, i.e., $\log(S_{T}/S_{0})\sim\mathcal{N}(-\sigma^{2}T/2,\sigma^{2}T)$ is Gaussian with mean $-\sigma^{2}T/2$ and variance $\sigma^{2}T$ . In this case, $V(0)$ is given by the celebrated Black-Scholes formula. Note that this example is particularly simple since $f$ is independent of $a$ . However, to ensure risk-neutral pricing, we have to impose a linear constraint on the measures in $B_{\delta}(\mu)$ , giving

(3)

\sup_{\nu\in B_{\delta}(\mu)\colon\int x\nu(dx)=1}\int(S_{0}x-K)^{+}\nu(dx).

To compute its sensitivity we encode the constraint using a Lagrangian and apply Theorem 2, see (Bartl et al., 2021a, , Rem. 11, Thm. 15). For $p=2$ , letting $k=K/S_{0}$ and $\mu_{k}=\mu([k,\infty))$ , the resulting formula, see (Bartl et al., 2021a, , Example 18), is given by

\displaystyle\Upsilon

\displaystyle=S_{0}\sqrt{\int\Big{(}\mathbf{1}_{x\geq k}-\mu_{k}\Big{)}^{2}\mu(dx)}=S_{0}\sqrt{\mu_{k}(1-\mu_{k})}.

Let us specialise to the log-normal distribution of the Black-Scholes model above and denote the quantity in (3) as $\mathcal{R}BS(\delta)$ . It may be computed exactly using methods from Bartl et al., (2020) and Figure 1 compares the exact value and the first order approximation.

Refer to caption — Figure 1. DRO value $\mathcal{R}$ BS $(\delta)$ vs the first order (FO) approximation $\mathcal{R}$ BS $(0)+\Upsilon\delta$ , $S_{0}=T=1$ , $K=1.2$ , $\sigma=0.2$ .

We have $\Upsilon=S_{0}\sqrt{\Phi(d_{-})(1-\Phi(d_{-}))}$ , where $d_{-}=\frac{\log(S_{0}/K)-\sigma^{2}T/2}{\sigma\sqrt{T}}$ and $\Phi$ is the cdf of $\mathcal{N}(0,1)$ distribution. It is also insightful to compare $\Upsilon$ with a parametric sensitivity. If instead of Wasserstein balls, we consider $\{\mathcal{N}(-\tilde{\sigma}^{2}T/2,\tilde{\sigma}^{2}T):|\sigma-\tilde{\sigma}|\leq\delta\}$ the resulting sensitivity is known as the Black-Scholes Vega and given by $\mathcal{V}=S_{0}\Phi^{\prime}(d_{-}+\sigma\sqrt{T})$ . We plot the two sensitivities in Figure 2. It is remarkable how, for the range of strikes of interest, the non-parametric model sensitivity $\Upsilon$ traces out the usual shape of $\mathcal{V}$ but shifted upwards to account for the idiosyncratic risk of departure from the log-normal family. More generally, given a book of options with payoff $f=f^{+}-f^{-}$ at time $T$ , with $f^{+},f^{-}\geq 0$ , we could say that the book is $\Upsilon$ -neutral if the sensitivity $\Upsilon$ was the same for $f^{+}$ and for $f^{-}$ . In analogy to Delta-Vega hedging standard, one could develop a non-parametric model-agnostic Delta-Upsilon hedging. We believe these ideas offer potential for exciting industrial applications and we leave them to further research.

We turn now to the classical notion of optimized Certainty Equivalent (OCE) of Ben Tal and Teboulle, (1986). It is a decision theoretic criterion designed to split a liability between today and tomorrow’s payments. It is also a convex risk measure in the sense of Artzner et al., (1999) and covers many of the popular risk measures such as expected shortfall or entropic risk, see Ben Tal and Teboulle, (2007). We fix a convex monotone function $l\colon\mathbb{R}\to\mathbb{R}$ which is bounded from below and $g\colon\mathbb{R}^{d}\to\mathbb{R}$ . Here $g$ represent the payoff of a financial position and $l$ is the negative of a utility function, or a loss function. We take $\|\cdot\|=|\cdot|$ and refer to (Bartl et al., 2021a, , Lemma 22) for generic sufficient conditions for Assumptions 1 and 4 to hold in this setup. The OCE corresponds to $V$ in (1) for $f(x,a)=l(g(x)-a)+a$ and $\mathcal{A}=\mathbb{R}$ , $\mathcal{S}=\mathbb{R}^{d}$ . Theorems 2 and 5 yield the sensitivities

	$\displaystyle\Upsilon=\inf_{a^{\star}\in\mathcal{A}^{\star}_{0}}\left(\int\big{\|}l^{\prime}\big{(}g(x)-a^{\star}\big{)}\nabla g(x)\big{\|}^{q}\,\mu(dx)\right)^{1/q},$
	$\displaystyle\beth=\Big{(}\int\|l^{\prime}(g(x)-a^{\star})\,\nabla g(x)\|^{2}\,\mu(dz)\Big{)}^{-1/2}$
	$\displaystyle\qquad\qquad\cdot\frac{\int l^{\prime\prime}(g(x)-a^{\star})\,l^{\prime}(g(x)-a^{\star})\,(\nabla g(x))^{2}\,\mu(dx)}{\int l^{\prime\prime}(g(x)-a^{\star})\,\mu(dx)},$

where, for simplicity, we took $p=q=2$ for the latter.
A related problem considers hedging strategies which minimise the expected loss of the hedged position, i.e., $f(x,a)=l\left(g(x)+\langle a,x-x_{0}\rangle\right)$ , where $\mathcal{A}=\mathbb{R}^{k}$ and $(x_{0},x)$ represent today and tomorrow’s traded prices. We compute $\Upsilon$ as

\displaystyle\inf_{a^{\star}\in\mathcal{A}^{\star}_{0}}\left(\int\big{|}l^{\prime}\big{(}g(x)+\langle a^{\star},x-x_{0}\rangle\big{)}(\nabla g(x)+a^{\star})\big{|}^{q}\,\mu(dx)\right)^{1/q}.

Further, we can combine loss minimization with OCE and consider $a=(H,m)\in\mathbb{R}^{k}\times\mathbb{R}$ , $f(x,(h,m))=l(g(x)+\langle H,x-x_{0}\rangle+m)-m$ . This gives $V^{\prime}(0)$ as the infimum over $(H^{\star},m^{\star})\in\mathcal{A}^{\star}_{0}$ of

\left(\int\big{|}l^{\prime}\big{(}g(x)+\langle H^{\star},x-x_{0}\rangle+m^{\star}\big{)}\left(\nabla g(x)+H^{\star}\right)\big{|}^{q}\,\mu(dx)\right)^{1/q}.

The above formulae capture non-parametric sensitivity to model uncertainty for examples of key risk measurements in financial economics. To the best of our knowledge this has not been achieved before.

Finally, we consider briefly the classical mean-variance optimization of Markowitz, (1952). Here $\mu$ represents the loss distribution across the assets and $a\in\mathbb{R}^{d}$ , $\sum_{i=1}^{d}a_{i}=1$ are the relative investment weights. The original problem is to minimise the sum of the expectation and $\gamma$ standard deviations of returns $\langle a,X\rangle$ , with $X\sim\mu$ . Using the ideas in (Pflug et al.,, 2012, Example 2) and considering measures on $\mathbb{R}^{d}\times\mathbb{R}^{d}$ , we can recast the problem as (1). Whilst Pflug et al., (2012) focused on the asymptotic regime $\delta\to\infty$ , their non-asymptotic statements are related to our Theorem 2 and either result could be used here to obtain that $V(\delta)\approx V(0)+\sqrt{1-\gamma^{2}}\delta$ .

3.2. Neural Networks

We specialise now to quantifying robustness of neural networks (NN) to adversarial examples. This has been an important topic in machine learning since Szegedy et al., (2013) observed that NN consistently misclassify inputs formed by applying small worst-case perturbations to a data set. This produced a number of works offering either explanations for these effects or algorithms to create such adversarial examples, e.g. Goodfellow et al., (2014); Li et al., (2019); Carlini and Wagner, (2017); Wong and Kolter, (2017); Weng et al., (2018); Araujo et al., (2019); Mangal et al., (2019) to name just a few. The main focus of research works in this area, see Bastani et al., (2016), has been on faster algorithms for finding adversarial examples, typically leading to an overfit to these examples without any significant generalisation properties. The viewpoint has been mainly pointwise, e.g., Szegedy et al., (2013), with some generalisations to probabilistic robustness, e.g., Mangal et al., (2019).

In contrast, we propose a simple metric for measuring robustness of neural networks which is independent of the architecture employed and the algorithms for identifying adversarial examples. In fact, Theorem 2 offers a simple and intuitive way to formalise robustness of neural networks: for simplicity consider a $1$ -layer neural network trained on a given distribution $\mu$ of pairs $(x,y)$ , i.e. $(A^{\star}_{1},A_{2}^{\star},b_{1}^{\star},b_{2}^{\star})$ solve

\displaystyle\inf\int

\displaystyle|y-\left((A_{2}(\cdot)+b_{2})\circ\sigma\circ(A_{1}(\cdot)+b_{1})\right)(x)|^{p}\,\mu(dx,dy),

where the $\inf$ is taken over $a=(A_{1},A_{2},b_{1},b_{2})\in\mathcal{A}=\mathbb{R}^{k\times d}\times\mathbb{R}^{d\times k}\times\mathbb{R}^{k}\times\mathbb{R}^{d}$ , for a given activation function $\sigma:\mathbb{R}\to\mathbb{R}$ , where the composition above is understood componentwise. Set $f(x,y;A,b):=|y-(A_{2}(\cdot)+b_{2})\circ\sigma\circ(A_{1}(\cdot)+b_{1})(x)|^{p}$ . Data perturbations are captured by $\nu\in B_{\delta}^{p}(\mu)$ and (2) offers a robust training procedure. The first order quantification of the NN sensitivity to adversarial data is then given by

\displaystyle\left(\int|\nabla f(x,y;A^{\star},b^{\star})|^{q}\,\mu(dx,dy)\right)^{1/q}.

A similar viewpoint, capturing robustness to adversarial examples through optimal transport lens, has been recently adopted by other authors. The dual formulation of (2) was used by Shafieezadeh-Abadeh et al., (2019) to reduce the training of neural networks to tractable linear programs. Sinha et al., (2020) modified (2) to consider a penalised problem $\inf_{a\in\mathcal{A}}\sup_{\nu\in\mathcal{P}(\mathcal{S})}\int_{\mathcal{S}}f\left(x,a\right)\,\nu(dx)-\gamma W_{p}(\mu,\nu)$ to propose new stochastic gradient descent algorithms with inbuilt robustness to adversarial data.

3.3. Uncertainty Quantification

In the context of UQ the measure $\mu$ represents input parameters of a (possibly complicated) operation $G$ in a physical, engineering or economic system. We consider the so-called reliability or certification problem: for a given set $E$ of undesirable outcomes, one wants to control $\sup_{\nu\in\mathcal{P}}\nu(G(x)\in E)$ , for a set of probability measures $\mathcal{P}$ . The distributionally robust adversarial classification problem considered recently by Ho-Nguyen and Wright, (2020) is also of this form, with Wasserstein balls $\mathcal{P}$ around an empirical measure of $N$ samples. Using the dual formulation of Blanchet and Murthy, (2019), they linked the problem to minimization of the conditional value-at-risk and proposed a reformulation, and numerical methods, in the case of linear classification. We propose instead a regularised version of the problem and look for

\displaystyle\delta(\alpha):=\sup\left\{\delta\geq 0:\ \inf_{\nu\in B_{\delta}(\mu)}\int d(G(x),E)\,\nu(dx)\geq\alpha\right\}

for a given safety level $\alpha$ . We thus consider the average distance to the undesirable set, $d(G(x),E):=\inf_{e\in E}|G(x)-e|$ , and not just its probability. The quantity $\delta(\alpha)$ could then be used to quantify the implicit uncertainty of the certification problem, where higher $\delta$ corresponds to less uncertainty. Taking statistical confidence bounds of the empirical measure in Wasserstein distance into account, see Fournier and Guillin, (2014), $\delta$ would then determine the minimum number of samples needed to estimate the empirical measure.

Assume that $E$ is convex. Then $x\mapsto d(x,E)$ differentiable everywhere except at the boundary of $E$ with $\nabla_{x}d(x,E)=0$ for $x\in{E}^{o}$ and $|\nabla_{x}d(x,E)|=1$ for all $x\in\bar{E}^{c}$ . Further, assume $\mu$ is absolutely continuous w.r.t. Lebesgue measure on $\mathcal{S}$ . Theorem 2, using (Bartl et al., 2021a, , Remark 11), gives a first-order expansion for the above problem

	$\displaystyle\inf_{\nu\in B_{\delta}(\mu)}\int d(G(x),E)\,\nu(dx)=\int d(G(x),E)\,\mu(dx)$
	$\displaystyle\qquad-\left(\int\|\nabla_{x}d(G(x),E)\nabla_{x}G(x)\|^{q}\,\mu(dx)\right)^{1/q}\delta+o(\delta).$

In the special case $\nabla_{x}G(x)=cI$ this simplifies to

\displaystyle\int d(G(x),E)\,\mu(dx)-c\left(\mu(G(x)\notin E)\right)^{1/q}\delta+o(\delta)

and the minimal measure $\nu$ pushes every point $G(x)$ not contained in $E$ in the direction of the orthogonal projection. This recovers the intuition of (Chen et al.,, 2018, Theorem 1), which in turn rely on (Gao and Kleywegt,, 2016, Corollary 2, Example 7). Note however that our result holds for general measures $\mu$ . We also note that such an approximation could provide an ansatz for dimension reduction, by identifying the dimensions for which the partial derivatives are negligible and then projecting $G$ on to the corresponding lower-dimensional subspace (thus providing a simpler surrogate for $G$ ). This would be an alternative to a basis expansion (e.g., orthogonal polynomials) used in UQ and would exploit the interplay of properties of $G$ and $\mu$ simultaneously.

3.4. Statistics

We discuss two applications of our results in the realm of statistics. We start by highlighting the link between our results and the so-called influence curves (IC) in robust statistics. For a functional $\mu\mapsto T(\mu)$ its IC is defined as

\displaystyle\mathrm{IC}(y)=\lim_{t\to 0}\frac{T(t\delta_{y}+(1-t)\mu)-T(\mu)}{t}.

Computing the $\mathrm{IC}$ , if it exists, is in general hard and closed form solutions may be unachievable. However, for the so-called M-estimators, defined as optimizers for $V(0)$ ,

\displaystyle T(\mu):=\mathrm{argmin}_{a}\int f(x,a)\mu(dx),

for some $f$ (e.g., $f(x,a)=|x-a|$ for the median), we have

\displaystyle\mathrm{IC}(y)=\frac{\nabla_{a}f(y,T(\mu))}{-\int\nabla^{2}_{a}f(s,T(\mu))\,\mu(ds)},

under suitable assumptions on $f$ , see (Huber and Ronchetti,, 1981, section 3.2.1). In comparison, writing $T^{\delta}$ for the optimizer for $V(\delta)$ , Theorem 5 yields

(4)

\displaystyle\lim_{\delta\to 0}\frac{T^{\delta}-T(\mu)}{\delta}=\frac{\int\nabla_{x}\nabla_{a}f(x,T(\mu))\nabla_{x}f(x,T(\mu))\,\mu(dx)}{-\int\nabla^{2}_{a}f(s,T(\mu))\,\mu(ds)},

under Assumption 4 and normalisation $\|\nabla_{x}f(x,T(\mu))\|_{L^{p}(\mu)}=1$ . To investigate the connection let us Taylor-expand $\mathrm{IC}(y)$ around $x$ to obtain

\displaystyle\mathrm{IC}(y)-\mathrm{IC}(x)=

\displaystyle\frac{\nabla_{a}\nabla_{x}f(x,T(\mu))}{-\int\nabla^{2}_{a}f(s,T(\mu))\,\mu(ds)}(y-x).

Choosing $y=x+\delta\nabla f_{x}(x,T(\mu))$ and integrating both sides over $\mu$ and dividing by $\delta$ , we obtain the asymptotic equality

\displaystyle\int\frac{\mathrm{IC}(x+\delta\nabla_{x}f(x,T(\mu)))-\mathrm{IC}(x)}{\delta}\,\mu(dx)\approx\frac{T^{\delta}-T(\mu)}{\delta}

for $\delta\to 0$ by (4). We conclude that considering the average directional derivative of IC in the direction of $\nabla f_{x}(x,T(\mu))$ gives our first-order sensitivity. For an interesting conjecture regarding the comparison of influence functions and sensitivities in KL-divergence we refer to (Lam,, 2018, Section 7.3) and (Lam,, 2016, Section 3.4.2).

Our second application in statistics exploits the representation of the LASSO/Ridge regressions as robust versions the standard linear regression. We consider $\mathcal{A}=\mathbb{R}^{k}$ and $\mathcal{S}=\mathbb{R}^{k+1}$ . If instead of the Euclidean metric we take $\|(x,y)\|_{\ast}=|x|_{r}\mathbf{1}_{\{y=0\}}+\infty\mathbf{1}_{\{y\neq 0\}}$ , for some $r>1$ and $(x,y)\in\mathbb{R}^{k}\times\mathbb{R}$ , in the definition of the Wasserstein distance, then Blanchet et al., 2019a showed that

(5)

\begin{split}&\inf_{a\in\mathbb{R}^{k}}\sup_{\nu\in B_{\delta}\left(\mu\right)}\int(y-\langle x,a\rangle)^{2}\,\nu(dx,dy)\\ &\qquad=\inf_{a\in\mathbb{R}^{k}}\left(\sqrt{\int(y-\langle a,x\rangle)^{2}\,\mu(dx,dy)}+\delta|a|_{s}\right)^{2}\end{split}

holds, where $1/r+1/s=1$ . The $\delta=0$ case is the ordinary least squares regression. For $\delta>0$ , the RHS for $s=2$ is directly related to the Ridge regression, while the limiting case $s=1$ is called the square-root LASSO regression, a regularised variant of linear regression well known for its good empirical performance. Closed-form solutions to (5) do not exist in general and it is a common practice to use numerical routines to solve it approximately. Theorem 5 offers instead an explicit first-order approximation of $a^{\star}_{\delta}$ for small $\delta$ . We denote by $a^{\star}$ the ordinary least squares estimator and by $I$ the $k\times k$ identity matrix. Note that the first order condition on $a^{\star}$ implies that $\int(y-\langle a^{\star},x\rangle)x_{i}\mu(dx,dy)=0$ for all $1\leq i\leq k$ . In particular, $V(0)=\int(y^{2}-\langle a^{\star},x\rangle y)\mu(dx,dy)$ and $a^{\star}=D^{-1}\int yx\mu(dx,dy)$ , where we assume the system is overdetermined so that $D=\int xx^{T}\,\mu(dx,dy)$ is invertible. Letting $J=a^{\star}x^{T}+(Ia^{\star})(Ix)$ a direct computation, see Bartl et al., 2021a , yields

(6)

\displaystyle a^{\star}_{\delta}\approx\ a^{\star}-\sqrt{V(0)}D^{-1}\,h(a^{\star})\delta.

For $s=2$ , $h(a^{\star})=a^{\star}/|a^{\star}|_{2}$ and for $s=1$ , $h(a^{\star})=\text{sign}(a^{\star})$ and hence¹¹1The case $s=1$ , inspecting the proof, we see that Theorem 5 still holds since $a^{\star}$ does not have zero components $\mu$ -a.s., which are the only points of discontinuity of $h$ . $a^{\star}_{\delta}$ is approximately

(7)

\left(1-\frac{\sqrt{V(0)}}{|a^{\star}|_{2}}D^{-1}\delta\right)a^{\star}\text{ and }a^{\star}-\sqrt{V(0)}D^{-1}\text{sign}(a^{\star})\delta

respectively. This corresponds to parameter shrinkage: proportional for square-root Ridge and a shift towards zero for square-root LASSO. To the best of our knowledge these are first such results and we stress that our formulae are valid in a general context and, in particular, parameter shrinkage depends on the direction through the $D^{-1}$ factor. Figure 3 compares the first order approximation with the actual results and shows a remarkable fit.

Furthermore, our results agree with what is known in the canonical test case for the (standard) Ridge and LASSO, see Tibshirani, (1996), when $\mu=\mu_{N}$ is the empirical measure of $N$ i.i.d. observations, the data is centred and the covariates are orthogonal, i.e., $D=\frac{1}{N}I$ . In that case, (7) simplifies to

\left(1-\delta\sqrt{N\left(\frac{1}{R^{2}}-1\right)}\right)a^{\star}\text{ and }a^{\star}-\sqrt{N}\,|y|\,\sqrt{1-R^{2}}\,\text{sign}(a^{\star})\delta,

where $R^{2}$ is the usual coefficient of determination.

The case of $\mu_{N}$ is naturally of particular importance in statistics and data science and we continue to consider it in the next subsection. In particular, we characterise the asymptotic distribution of $\sqrt{N}(a^{\star}_{1/\sqrt{N}}-a^{\star})$ , where $a^{\star}_{\delta}\in\mathcal{A}^{\star}_{\delta}(\mu_{N})$ and $a^{\star}\in\mathcal{A}^{\star}_{0}(\mu_{\infty})$ is the optimizer of the non-robust problem for the data-generating measure. This recovers the central limit theorem of Blanchet et al., 2019b , a link we explain further in section 4 4.2.

3.5. Out-of-sample error

A benchmark of paramount importance in optimization is the so-called out-of-sample error, also known as the prediction error in statistical learning. Consider the setup above when $\mu_{N}$ is the empirical measure of $N$ i.i.d. observations sampled from the “true” distribution $\mu=\mu_{\infty}$ and take, for simplicity, $\|\cdot\|=|\cdot|_{s}$ , with $s>1$ . Our aim is to compute the optimal $a^{\star}$ which solves the original problem (1). However, we only have access to the training set, encoded via $\mu_{N}$ . Suppose we solve the distributionally robust optimization problem (2) for $\mu_{N}$ and denote the robust optimizer $a^{{\star},N}_{\delta}$ . Then the out-of-sample error

\displaystyle V(0,a^{{\star},N}_{\delta})-V(0,a^{\star})=\int f(x,a^{{\star},N}_{\delta})\,\mu(dx)-\int f(x,a^{{\star}})\,\mu(dx)

quantifies the error from using $a^{{\star},N}_{\delta}$ as opposed to the true optimizer $a^{\star}$ .

While this expression seems to be hard to compute explicitly for finite samples, Theorem 5 offers a way to find the asymptotic distribution of a (suitably rescaled version of) the out-of-sample error. We suppose the assumptions in Theorem 5 are satisfied and note that the first order condition for $a^{\star}$ gives $\nabla_{a}V(0,a^{\star})=0$ . Then, a second-order Taylor expansion gives

(8)

\displaystyle V(0,a^{{\star},N}_{\delta})-V(0,a^{\star})=\frac{1}{2}(a^{{\star},N}_{\delta}-a^{{\star}})^{T}\nabla_{a}^{2}V(0,\tilde{a})(a^{{\star},N}_{\delta}-a^{{\star}}),

for some $\tilde{a}$ , (coordinate-wise) between $a^{\star}$ and $a^{{\star},N}_{\delta}$ . Now we write

\displaystyle a^{{\star},N}_{\delta}-a^{{\star}}

\displaystyle=a^{{\star},N}_{\delta}-a^{{\star},N}+a^{{\star},N}-a^{{\star}},

where we define $a^{{\star},N}$ as the optimizer of the non-robust problem (1) with $\mu$ replaced by $\mu_{N}$ . In particular the $\delta$ -method for M-estimators implies that

(9)

\displaystyle\sqrt{N}(a^{{\star},N}-a^{{\star}})\Rightarrow(\nabla_{a}^{2}V(0,a^{{\star}}))^{-1}H,

where $H\sim\mathcal{N}\left(0,\int(\nabla_{a}f(x,a^{{\star}}))^{T}\nabla_{a}f(x,a^{{\star}})\,\mu(dx)\right)$ and $\Rightarrow$ denotes the convergence in distribution. On the other hand, for a fixed $N\in\mathbb{N}$ , Theorem 5 applied to $\mu_{N}$ yields

(10)	$\displaystyle a^{{\star},N}_{\delta}-a^{{\star},N}=$	$\displaystyle-\Big{(}\int\|\nabla_{x}f(x,a^{{\star},N})\|_{s}^{q}\,\mu_{N}(dx)\Big{)}^{\frac{1}{q}-1}\cdot\left(\int\nabla_{a}^{2}f(x,a^{{\star},N})\,\mu_{N}(dx)\right)^{-1}$
	$\displaystyle\quad\ \cdot\int\frac{\nabla_{x}\nabla_{a}f(x,a^{{\star},N})\,h(\nabla_{x}f(x,a^{{\star},N}))}{\|\nabla_{x}f(x,a^{{\star},N})\|_{s}^{1-q}}\,\mu_{N}(dx)\cdot\delta+o(\delta)$
(11)	$\displaystyle=$	$\displaystyle-\left((\nabla_{a}^{2}V(0,a^{{\star}}))^{-1}\Theta+\Delta_{N}\right)\cdot\delta+o(\delta),$

where

	$\displaystyle\Theta$	$\displaystyle:=\Big{(}\int\|\nabla_{x}f(x,a^{\star})\|^{q}_{s}\,\mu(dx)\Big{)}^{\frac{1}{q}-1}\cdot\int\frac{\nabla_{x}\nabla_{a}f(x,a^{\star})\,h(\nabla_{x}f(x,a^{\star}))}{\|\nabla_{x}f(x,a^{\star})\|_{s}^{1-q}}\,\mu(dx),$
	$\displaystyle\Delta_{N}$	$\displaystyle:=\Big{(}\int\|\nabla_{x}f(x,a^{{\star},N})\|_{s}^{q}\,\mu_{N}(dx)\Big{)}^{\frac{1}{q}-1}\cdot\left(\int\nabla_{a}^{2}f(x,a^{{\star},N})\,\mu_{N}(dx)\right)^{-1}$
		$\displaystyle\quad\cdot\int\frac{\nabla_{x}\nabla_{a}f(x,a^{{\star},N})\,h(\nabla_{x}f(x,a^{{\star},N}))}{\|\nabla_{x}f(x,a^{{\star},N})\|_{s}^{1-q}}\,\mu_{N}(dx)-(\nabla_{a}^{2}V(0,a^{{\star}}))^{-1}\Theta.$

Almost surely (w.r.t. sampling of $\mu_{N}$ ), we know that $\mu_{N}\to\mu$ in $W_{p}$ as $N\to\infty$ , see Fournier and Guillin, (2014), and under the regularity and growth assumptions on $f$ in (Bartl et al., 2021a, , eq. (27)) we check that $\Delta_{N}\to 0$ a.s., see (Bartl et al., 2021a, , Example 28) for details. In particular, taking $\delta=1/\sqrt{N}$ and combining the above with (9) we obtain

(12)

\sqrt{N}\left(a^{{\star},N}_{1/\sqrt{N}}-a^{\star}\right)\Rightarrow(\nabla_{a}^{2}V(0,a^{\star}))^{-1}(H-\Theta).

This recovers the central limit theorem of Blanchet et al., 2019b , as discussed in more detail in section 4 4.2 below. Together, (8) and (11) give us the a.s. asymptotic behaviour of the out-of-sample error

(13)

\displaystyle V(0,a^{{\star},N}_{\delta})-V(0,a^{\star})=\frac{1}{2N}(H-\Theta)^{T}(\nabla_{a}^{2}V(0,a^{{\star}}))^{-1}(H-\Theta)+o\left(\frac{1}{N}\right).

These results also extend and complement (Anderson and Philpott,, 2019, Prop. 17). Anderson and Philpott, (2019) investigate when the distributionally robust optimizers $a^{{\star},N}_{\delta}$ yield, on average, better performance than the simple in-sample optimizer $a^{{\star},N}$ . To this end, they consider the expectation, over the realisations of the empirical measure $\mu_{N}$ of

\displaystyle V(0,a^{{\star},N}_{\delta})-V(0,a^{{\star},N})=\int f(x,a^{{\star},N}_{\delta})\,\mu(dx)-\int f(x,a^{{\star},N})\,\mu(dx).

This is closely related to the out-of-sample error and our derivations above can be easily modified. The first order term in the Taylor expansion no longer vanishes and, instead of (8), we now have

\displaystyle V(0,a^{{\star},N}_{\delta})-V(0,a^{{\star},N})=\nabla_{a}V(0,a^{{\star},N})(a^{{\star},N}_{\delta}-a^{{\star},N})+o(|a^{{\star},N}_{\delta}-a^{{\star},N}|),

which holds, e.g., if for any $r>0$ , there exists $c>0$ such that $\sum_{i=1}^{k}\Big{|}\nabla_{a}\nabla_{a_{i}}f(x,a)\Big{|}\leq c(1+|x|^{p})$ for all $x\in\mathcal{S}$ , $|a|\leq r$ . Combined with (10), this gives asymptotics in small $\delta$ for a fixed $N$ . For quadratic $f$ and taking $q\uparrow\infty$ , we recover the result in (Anderson and Philpott,, 2019, Prop. 17), see (Bartl et al., 2021a, , Example 28) for details.

4. Further discussion and literature review

We start with an overview of related literature and then focus specifically on a comparison of our results with the CLT of Blanchet et al., 2019b mentioned above.

4.1. Discussion of related literature

Let us first remark, that while Theorem 2 offers some superficial similarities to a classical maximum theorem, which is usually concerned with continuity properties of $\delta\mapsto V(\delta)$ , in this work we are instead interested in the exact first derivative of the function $\delta\mapsto V(\delta)$ . Indeed the convergence $\lim_{\delta\to 0}\sup_{\nu\in B_{\delta}(\mu)}\int f(x)\,\nu(dx)=\int f(x)\,\mu(dx)$ follows for all $f$ satisfying $f(x)\leq c(1+|x|^{p})$ directly from the definition of convergence in Wasserstein metric (see e.g. (Villani,, 2008, Def. 6.8)). In conclusion the main issue is to quantify the rate of this convergence by calculating the first derivative $V^{\prime}(\delta)$ .

Our work investigates model uncertainty broadly conceived: it includes errors related to the choice of models from a particular (parametric or not) class of models as well as the mis-specification of such class altogether (or indeed, its absence). In decision theoretic literature, these aspects are sometimes referred to as model ambiguity and model mis-specification respectively, see Hansen and Marinacci, (2016). However, seeing our main problem (2) in decision theoretic terms is not necessarily helpful as we think of $f$ as given and not coming from some latent expected utility type of problem. In particular, our actions $a\in\mathcal{A}$ are just constants.

In our work we decided to capture the uncertainty in the specification of $\mu$ using neighborhoods in the Wassertein distance. As already mentioned, other choices are possible and have been used in past. Possibly, the most often used alternative is the relative entropy, or the Kullblack-Leibler divergence. In particular, it has been used in this context in economics, see Hansen and Sargent, (2008). To the best of our knowledge, the only comparable study of sensitivities with respect to relative entropy balls is Lam, (2016), see also Lam, (2018) allowing for additional marginal constraints. However, this only considered the specific case $f(x,a)=f(x)$ where the reward function is independent of the action. Its main result is

\displaystyle\sup_{\nu\in B^{KL}_{\delta}(\mu)}\int f(x)\,\nu(dx)=\int f(x)\,\mu(dx)+\sqrt{2\operatorname{Var}_{\mu}(f(X))}\delta+\frac{1}{3}\frac{\kappa_{3}(f(X))}{\operatorname{Var}_{\mu}(f(X))}\delta^{2}+O\left(\delta^{3}\right),

where $B^{KL}_{\delta}(\mu)$ is a ball of radius $\delta^{2}$ centred around $\mu$ in KL-divergence, $\operatorname{Var}_{\mu}(f(X))$ and $\kappa_{3}(f(X))$ denote the variance and kurtosis of $f$ under the measure $\mu$ respectively. In particular, the first order sensitivity involves the function $f$ itself. In contrast, our Theorem 2 states $V^{\prime}(\delta)=(\int(f^{\prime}(x))^{2}\,\mu(dx))^{1/2}$ and involves the first derivative $f^{\prime}$ . In the trivial case of a point mass $\mu=\delta_{x}$ we recover the intuitive sensitivity $V^{\prime}(\delta)=|f^{\prime}(x)|$ , while the results of Lam, (2016) do not apply for this case. We also note that Lam, (2016) requires exponential moments of the function $f$ under the baseline measure $\mu$ , while we only require polynomial moments. In particular in applications in econometrics (or any field in which $\mu$ typically has fat tails), the scope of application of the corresponding results might then be decisively different. We remark however, that this requirement can be substantially weakened (to the existence of polynomial moments) when replacing KL-divergences by $\alpha$ -divergences, see e.g. Atar et al., (2015); Glasserman and Xu, (2014). We expect a sensitivity analysis similar to Lam, (2016) to hold in this setting. However, to the best of our knowledge no explicit results seem to available in the literature.
To understand the relative technical difficulties and merits it is insightful to go into the details of the statements. In fact, in the case of relative entropy and the one-period setup we are considering, the exact form of the optimizing density can be determined exactly (see (Lam,, 2016, Proposition 3.1)) up to a one-dimensional Langrange parameter. This is well known and is the reason behind the usual elegant formulae obtained in this context. But this then reduces the problem in Lam, (2016) to a one-dimensional problem, which can be well-approximated via a Taylor approximation. In contrast, when we consider balls in the Wasserstein distance, the form of the optimizing measure is not known (apart from some degenerate cases). In fact a key insight of our results is that the optimizing measure can be approximated by a deterministic shift in the direction $(x+f^{\prime}(x)\delta)_{*}\mu$ (this is, in general, not exact but only true as a first order approximation). The reason for these contrastive starting points of the analyses is the fact that Wasserstein balls contain a more heterogeneous set of measures, while in the case of relative entropy, exponentiating $f$ will always do the trick. We remark however that this is not true for the finite-horizon problems considered in (Lam,, 2016, Section 3.2) any more, where the worst-case measure is found using an elaborate fixed-point equation.
A point which further emphasizes the fact that the topology introduced by the Wasserstein metric is less tractable is the fact that

\displaystyle W^{p}_{p}(\mu,\nu)=\lim_{\varepsilon\to 0}\inf_{\pi\in\Pi(\mu,\nu)}\int|x-y|^{p}\,\pi(dx,dy)+\varepsilon H(\pi\mid\mu\otimes\nu)=\lim_{\varepsilon\to 0}\varepsilon\inf_{\pi\in\Pi(\mu,\nu)}H(\pi\mid R^{\varepsilon}),

where $H(\pi\mid R^{\varepsilon})=\int\log\left(\frac{d\pi}{dR^{\varepsilon}}\right)\,d\pi$ is the relative entropy and

dR^{\varepsilon}=c_{0}\exp(-|x-y|^{p}/\varepsilon)d(\mu\otimes\nu)

for some normalising constant $c_{0}>0$ , see, e.g., Carlier et al., (2017). This is known as the entropic optimal transport formulation and has received considerable interest in the ML community in the past years (see e.g. Peyré and Cuturi, (2019)). In particular, the Wasserstein distance can be approximated by relative entropy, but only with respect to reference measures on the product space. As we consider optimization over $\nu$ above it amounts to changing the reference measure. In consequence the topological structure imposed by Wasserstein distances is more intricate compared to relative entropy, but also more flexible.

The other well studied distance is the Hellinger distance. Lindsay, (1994) calculates influence curves for the minimum Hellinger distance estimator $a^{\mathrm{Hell},{\star}}$ on a countable sample space. Their main result is that for the choice $f(x,a)=\log(\ell(x,a))$ (where $(\ell(x,a))_{a\in\mathcal{A}}$ is a collection of parametric densities)

\displaystyle IC(x)=-\left(\nabla_{a}^{2}V(0,a^{\mathrm{Hell},{\star}})\right)^{-1}\nabla_{a}\log(\ell(x,a^{\mathrm{Hell},{\star}})),

the product of the inverse Fisher information matrix and the score function, which is the same as for the classical maximum likelihood estimator. Denote by $\mu_{N}$ the empirical measure of $N$ data samples and by $a^{\mathrm{Hell},{\star}}(N)$ the corresponding minimum Hellinger distance estimator for $\mu_{N}$ . In particular this result implies the same CLT as for M-estimators given by

\displaystyle N^{1/2}(a^{\mathrm{Hell},{\star}}(N)-a^{\mathrm{Hell},{\star}})\Rightarrow(\nabla_{a}^{2}V(0,a^{\star}))^{-1}H

where $H\sim\mathcal{N}\left(0,\int\nabla_{a}f(x,a^{\mathrm{Hell},{\star}})^{T}\nabla_{a}f(x,a^{\mathrm{Hell},{\star}})\,\mu(dx)\right).$ As we discuss in the next section, our Theorem 5 yields a similar CLT, namely

\displaystyle N^{1/2}(a^{{\star},N}_{1/\sqrt{N}}-a^{\star})\Rightarrow(\nabla_{a}^{2}V(0,a^{\star}))^{-1}\cdot\left(H-\nabla_{a}\sqrt{\int|\nabla_{x}f(x,a^{\star})|_{s}^{2}\,\mu(dx)}\,\right).

Thus the Wasserstein worst-case approach leads to a shift of the mean of the normal distribution in the direction

-\nabla_{a}\sqrt{\int|\nabla_{x}f(x,a^{\star})|_{s}^{2}\,\mu(dx)}

compared to the non-robust case. In the simple case $\mu=\mathcal{N}(0,\sigma^{2})$ with standard deviation $\sigma>0$ we obtain the MLE $\sigma^{{\star},N}=\frac{1}{N}\sum_{k=1}^{N}X_{i}^{2}$ . We can directly compute (for $a=\sigma$ ) that

	$\displaystyle\nabla_{\sigma}\sqrt{\int\left\|\nabla_{x}\left(\mathrm{const.}+\log\left(\exp\left(-\frac{x^{2}}{2(\sigma^{\star})^{2}}\right)\right)\right)\right\|_{s}^{2}\,\mu(dx)}$	$\displaystyle=\nabla_{\sigma}\sqrt{\int\frac{x^{2}}{(\sigma^{\star})^{4}}\,\mu(dx)}$
		$\displaystyle=\nabla_{\sigma}\frac{\sigma^{\star}}{(\sigma^{\star})^{2}}=\nabla_{\sigma}\frac{1}{\sigma^{\star}}=-\frac{1}{(\sigma^{\star})^{2}}.$

Thus the robust approach accounts for a shift of $1/(\sigma^{\star})^{2}$ (of order 1 if mulitplied with inverse Fisher information) to account for a possibly higher variance in the underlying data. In particular, in our approach, the so-called neutral spaces considered, e.g., in (Komorowski et al.,, 2011, eq. (21)) as

\displaystyle\left\{a:\ -(a-a^{\star})^{T}\nabla_{a}^{2}V(0,a^{\star})(a-a^{\star})\leq\delta\right\}

should also take this shift into account, i.e., their definition should be adjusted to

	$\displaystyle\Bigg{\{}a:\ -\left(a-a^{\star}+\nabla_{a}\sqrt{\int\|\nabla_{x}f(x,a^{\star})\|_{s}^{2}\,\mu(dx)}\right)^{T}$	$\displaystyle\nabla_{a}^{2}V(0,a^{\star})$
		$\displaystyle\cdot\left(a-a^{\star}+\nabla_{a}\sqrt{\int\|\nabla_{x}f(x,a^{\star})\|_{s}^{2}\,\mu(dx)}\right)\leq\delta\Bigg{\}}.$

Lastly, let us mention another situation when our approach provides directly interpretable insights in the context of a parametric family of models. Namely, if one considers a family of models $\mathcal{P}$ such that the worst-case model in the Wasserstein ball remains in $\mathcal{P}$ , i.e., $(x+f^{\prime}(x)\delta)_{*}\mu\in\mathcal{P}$ , then considering (the first order approximation to) model uncertainty over Wasserstein balls actually reduces to considerations within the parametric family. While uncommon, such a situation would arise, e.g., for a scale-location family $\mathcal{P}$ , with $\mu\in\mathcal{P}$ and a linear/quadratic $f$ .

4.2. Link to the CLT of Blanchet et al., 2019b

As observed in section 3 3.5above, Theorem 5 allows to recover the main results in Blanchet et al., 2019b . We explain this now in detail. Set $\|\cdot\|=|\cdot|_{s}$ , $p=q=2$ , $\mathcal{S}=\mathbb{R}^{d}$ . Let $\mu_{N}$ denote the empirical measure of $N$ i.i.d. samples from $\mu$ . We impose the assumptions on $\mu$ and $f$ from Blanchet et al., 2019b , including Lipschitz continuity of gradients of $f$ and strict convexity. These, in particular, imply that the optimizers $a^{{\star},N}_{\delta},a^{{\star},N}$ and $a^{{\star}}$ , as defined in section 3 3.5 are well defined and unique, and further $a^{{\star},N}_{1/\sqrt{N}}\to a^{\star}$ as $N\to\infty$ . (Blanchet et al., 2019b, , Thm. 1) implies that, as $N\to\infty$ ,

(14)

\displaystyle\sqrt{N}(a^{{\star},N}_{1/\sqrt{N}}-a^{\star})\Rightarrow(\nabla_{a}^{2}V(0,a^{\star}))^{-1}\cdot\left(H-\nabla_{a}\sqrt{\int|\nabla_{x}f(x,a^{\star})|_{s}^{2}\,\mu(dx)}\,\right),

where $H\sim\mathcal{N}\left(0,\int\nabla_{a}f(x,a^{\star})^{T}\nabla_{a}f(x,a^{\star})\,\mu(dx)\right).$ We note that for $\|\cdot\|=|\cdot|_{s}$ we have

\displaystyle h(x)=(\text{sign}(x_{1})\,|x_{1}|^{s-1},\dots,\text{sign}(x_{k})\,|x_{k}|^{s-1})\cdot|x|_{s}^{1-s}=\nabla_{x}|x|_{s}.

Thus

\displaystyle\nabla_{a}\sqrt{\int|\nabla_{x}f(x,a^{\star})|_{s}^{2}\,\mu(dx)}

\displaystyle=\frac{\int|\nabla_{x}f(x,a^{\star})|_{s}h(\nabla_{x}f(x,a^{\star}))\nabla_{x}\nabla_{a}f(x,a^{\star})\,\mu(dx)}{\sqrt{\int|\nabla_{x}f(x,a^{\star})|_{s}^{2}\,\mu(dx)}}\cdot

and (14) agrees with (12) which is justified by the Lipschitz growth assumptions on $f$ , $\nabla_{x}f(x,a)$ and $\nabla_{a}\nabla_{x}f(x,a)$ from Blanchet et al., 2019b , see (Bartl et al., 2021a, , eq. (27)). In particular Theorem 5 implies (14) as a special case. While this connection is insightful to establish²²2We thank Jose Blanchet for pointing out the possible link and encouraging us to explore it. it is also worth stressing that the proofs in Blanchet et al., 2019b pass through the dual formulation and are thus substantially different to ours. Furthermore, while Theorem 5 holds under milder assumptions on $f$ than those in Blanchet et al., 2019b , the last argument in our reasoning above requires the stronger assumptions on $f$ . It is thus not clear if our results could help to significantly weaken the assumptions in the central limit theorems of Blanchet et al., 2019b .

5. Proofs

We consider the case $\mathcal{S}=\mathbb{R}^{d}$ and $\|\cdot\|=|\cdot|$ here. For the general case and additional details we refer to Bartl et al., 2021a . When clear from the context, we do not indicate the space over which we integrate.

Proof of Theorem 2.

For every $\delta\geq 0$ let $C_{\delta}(\mu)$ denote those $\pi\in\mathcal{P}(\mathbb{R}^{d}\times\mathbb{R}^{d})$ which satisfy

\pi_{1}=\mu\text{ and }\left(\int|x-y|^{p}\,\pi(dx,dy)\right)^{1/p}\leq\delta.

As the infimum in the definition of $W_{p}(\mu,\nu)$ is attained (see (Villani,, 2008, Theorem 4.1, p.43)) one has $B_{\delta}(\mu)=\{\pi_{2}:\pi\in C_{\delta}(\mu)\}$ .

We start by showing the “ $\leq$ ” inequality in the statement. For any $a^{\star}\in\mathcal{A}^{\star}_{0}$ one has $V(\delta)\leq\sup_{\nu\in B_{\delta}(\mu)}\int f(y,a^{\star})\,\nu(dy)$ with equality for $\delta=0$ . Therefore, differentiating $f(\cdot,a^{\star})$ and using both Fubini’s theorem and Hölder’s inequality, we obtain that

	$\displaystyle V(\delta)-V(0)\leq\sup_{\pi\in C_{\delta}(\mu)}\int f(y,a^{\star})-f(x,a^{\star})\,\pi(dx,dy)$
	$\displaystyle=\sup_{\pi\in C_{\delta}(\mu)}\int_{0}^{1}\int\langle\nabla_{x}f(x+t(y-x),a^{\star}),(y-x)\rangle\,\pi(dx,dy)dt$
	$\displaystyle\leq\delta\sup_{\pi\in C_{\delta}(\mu)}\int_{0}^{1}\Big{(}\int\|\nabla_{x}f(x+t(y-x),a^{\star})\|^{q}\pi(dx,dy)\Big{)}^{1/q}dt.$

Any choice $\pi^{\delta}\in C_{\delta}(\mu)$ converges in $p$ -Wasserstein distance on $\mathcal{P}(\mathbb{R}^{d}\times\mathbb{R}^{d}$ ) to the pushforward measure of $\mu$ under the mapping $x\mapsto(x,x)$ , which we denote $[x\mapsto(x,x)]_{\ast}\mu$ . This can be seen by, e.g., considering the coupling $[(x,y)\mapsto(x,y,x,x)]_{\ast}\pi^{\delta}$ between $\pi^{\delta}$ and $[x\mapsto(x,x)]_{\ast}\mu$ . Now note that $q=p/(p-1)$ and the growth assumption on $\nabla_{x}f(\cdot,a^{\star})$ implies

(15)

\displaystyle|\nabla_{x}f(x+t(y-x),a^{\star})|^{q}\leq c(1+|x|^{p}+|y|^{p})

for some $c>0$ and all $x,y\in\mathbb{R}^{d}$ , $t\in[0,1]$ . In particular $\int|\nabla_{x}f(x+t(y-x),a^{\star})|^{q}\,\pi^{\delta}(dx,dy)\leq C$ for all $t\in[0,1]$ and small $\delta>0$ , for another constant $C>0$ . As further $(x,y)\mapsto|\nabla_{x}f(x+t(y-x),a^{\star})|^{q}$ is continuous for every $t$ , the $p$ -Wasserstein convergence of $\pi^{\delta}$ to $[x\mapsto(x,x)]_{\ast}\mu$ implies that

\int|\nabla_{x}f(x+t(y-x),a^{\star})|^{q}\,\pi^{\delta}(dx,dy)\to\int|\nabla_{x}f(x,a^{\star})|^{q}\,\mu(dx)

for every $t\in[0,1]$ for $\delta\to 0$ , see (Bartl et al., 2021a, , Lemma 21). Dominated convergence (in $t$ ) then yields “ $\leq$ ” in the statement of the theorem.

We turn now to the opposite “ $\geq$ ” inequality. As $V(\delta)\geq V(0)$ for every $\delta>0$ there is no loss in generality in assuming that the right hand side is not equal to zero. Now take any, for notational simplicity not relabelled, subsequence of $(\delta)_{\delta>0}$ which attains the liminf in $(V(\delta)-V(0))/\delta$ and pick $a^{\star}_{\delta}\in\mathcal{A}^{\star}_{\delta}$ . By assumption, for a (again not relabelled) subsequence, one has $a^{\star}_{\delta}\to a^{\star}\in\mathcal{A}^{\star}_{0}$ . Further note that $V(0)\leq\int f(x,a^{\star}_{\delta})\,\mu(dx)$ which implies

\displaystyle V(\delta)-V(0)

\displaystyle\geq\sup_{\pi\in C_{\delta}(\mu)}\int f(y,a^{\star}_{\delta})-f(x,a^{\star}_{\delta})\,\pi(dx,dy).

Now define $\pi^{\delta}:=[x\mapsto(x,x+\delta T(x))]_{\ast}\mu$ , where

T(x):=\frac{\nabla_{x}f(x,a^{\star})}{|\nabla_{x}f(x,a^{\star})|^{2-q}}\Big{(}\int|\nabla_{x}f(z,a^{\star})|^{q}\,\mu(dz)\Big{)}^{1/q-1}

for $x\in\mathbb{R}^{d}$ with the convention $0/0=0$ . Note that the integral is well defined since, as before in (15), one has $|\nabla_{x}f(x,a^{\star})|^{q}\leq C(1+|x|^{p})$ for some $C>0$ and the latter is integrable under $\mu$ . Using that $pq-p=q$ it further follows that

	$\displaystyle\int\|x-y\|^{p}\,\pi^{\delta}(dx,dy)=\delta^{p}\int\|T(x)\|^{p}\,\mu(dx)$
	$\displaystyle=\delta^{p}\frac{\int\|\nabla_{x}f(x,a^{\star})\|^{p+pq-2p}\,\mu(dx)}{\big{(}\int\|\nabla_{x}f(z,a^{\star})\|^{q}\,\mu(dz)\big{)}^{p(1-1/q)}}=\delta^{p}.$

In particular $\pi^{\delta}\in C_{\delta}(\mu)$ and we can use it to estimate from below the supremum over $C_{\delta}(\mu)$ giving

	$\displaystyle\frac{V(\delta)-V(0)}{\delta}$	$\displaystyle\geq\frac{1}{\delta}\int f(x+\delta T(x),a^{\star}_{\delta})-f(x,a^{\star}_{\delta})\,\mu(dx)$
		$\displaystyle=\int_{0}^{1}\int\langle\nabla_{x}f(x+t\delta T(x),a^{\star}_{\delta}),T(x)\rangle\,\mu(dx)\,dt.$

For any $t\in[0,1]$ , with $\delta\to 0$ , the inner integral converges to

\displaystyle\int\langle\nabla_{x}f(x,a^{\star}),T(x)\rangle\,\mu(dx)=\Big{(}\int|\nabla_{x}f(x,a^{\star})|^{q}\,\mu(dx)\Big{)}^{1/q}.

The last equality follows from the definition of $T$ and a simple calculation. To justify the convergence, first note that $\langle\nabla_{x}f(x+t\delta T(x),a^{\star}_{\delta}),T(x)\rangle\to\langle\nabla_{x}f(x,a^{\star}),T(x)\rangle$ for all $x\in\mathbb{R}^{d}$ by continuity of $\nabla_{x}f$ and since $a^{\star}_{\delta}\to a^{\star}$ . Moreover, as before in (15), one has $|T(x)|\leq c(1+|x|)$ for some $c>0$ , hence $|\langle\nabla_{x}f(x+t\delta T(x),a^{\star}),T(x)\rangle|\leq C(1+|x|^{p})$ for some $C>0$ and all $t\in[0,1]$ . The latter is integrable under $\mu$ , hence convergence of the integrals follows from the dominated convergence theorem. This concludes the proof. ∎

Proof of Theorem 5.

We first show that

(16)		$\displaystyle\lim_{\delta\to 0}\frac{-\nabla_{a_{i}}V(0,a^{\star}_{\delta})}{\delta}$	$\displaystyle=\int\nabla_{x}\nabla_{a_{i}}f(x,a^{\star})\frac{\nabla_{x}f(x,a^{\star})}{\|\nabla_{x}f(x),a^{\star})\|^{2-q}}\,\mu(dx)$
		$\displaystyle\qquad\cdot\Big{(}\int\|\nabla_{x}f(x,a^{\star})\|^{q}\,\mu(dx)\Big{)}^{1/q-1}$

for all $i\in\{1,\dots,k\}$ . We start with the “ $\leq"$ -inequality. For any $a\in{\mathcal{A}}^{o}$ we have

\displaystyle\nabla_{a}f(y,a)-\nabla_{a}f(x,a)

\displaystyle=\int_{0}^{1}\nabla_{x}\nabla_{a}f(x+t(y-x),a)(y-x)\rangle\,dt.

Let $\delta>0$ and recall that $a^{\star}_{\delta}\in\mathcal{A}^{\star}_{\delta}$ converge to $a^{\star}\in\mathcal{A}^{\star}_{0}$ . Let $B^{\star}_{\delta}(\mu,a^{\star}_{\delta})$ denote the set of $\nu\in B_{\delta}(\mu)$ which attain the value: $\int f(x,a^{\star}_{\delta})\,\nu(dx)=V(\delta)$ . By (Bartl et al., 2021a, , Lemma 29) the function $a\mapsto V(\delta,a)$ is (one-sided) directionally differentiable at $a^{\star}_{\delta}$ for all $\delta>0$ small and thus for all $i\in\{1,\dots,k\}$

\displaystyle\sup_{\nu\in B^{\star}_{\delta}(\mu,a^{\star}_{\delta})}\int\nabla_{a_{i}}f(x,a^{\star}_{\delta})\,\nu(dx)\geq 0.

Then, using Lagrange multipliers to encode the optimality of $B^{\star}_{\delta}(\mu,a^{\star}_{\delta})$ in $B_{\delta}(\mu)$ , we obtain

	$\displaystyle-\nabla_{a_{i}}V(0,a^{\star}_{\delta})\leq\sup_{\nu\in B^{\star}_{\delta}(\mu,a^{\star}_{\delta})}\int\nabla_{a_{i}}f(y,a^{\star}_{\delta})\nu(dy)-\nabla_{a_{i}}V(0,a^{\star}_{\delta})$
	$\displaystyle=\sup_{\nu\in B_{\delta}(\mu)}\inf_{\lambda\in\mathbb{R}}\bigg{(}\int\big{[}\nabla_{a_{i}}f(y,a^{\star}_{\delta})+\lambda(f(y,a^{\star}_{\delta})-V(\delta))\big{]}\nu(dy)$
	$\displaystyle\quad-\int\big{[}\nabla_{a_{i}}f(x,a^{\star}_{\delta})+\lambda(f(x,a^{\star}_{\delta})-V(0,a^{\star}_{\delta}))\big{]}\mu(dx)\bigg{)}$
	$\displaystyle=\inf_{\lambda\in\mathbb{R}}\bigg{(}\sup_{\pi\in C_{\delta}(\mu)}\int_{0}^{1}\int\Big{\langle}\nabla_{x}\nabla_{a_{i}}f(x+t(y-x),a^{\star}_{\delta})$
	$\displaystyle\quad+\lambda\nabla_{x}f(x+t(y-x),a^{\star}_{\delta}),y-x\Big{\rangle}\,\pi(dx,dy)\,dt$
	$\displaystyle\quad-\lambda\sup_{\pi\in C_{\delta}(\mu)}\int_{0}^{1}\int\langle\nabla_{x}f(x+t(y-x),a^{\star}_{\delta},y-x\rangle\,\pi(dx,dy)\,dt\bigg{)}$

where we used a minimax argument as well as Fubini’s theorem. We note that the functions above satisfy the assumptions of Theorem 2 for a fixed $\lambda$ . In particular using exactly the same arguments as in the proof of Theorem 2 (i.e., Hölder’s inequality and a specific transport attaining the supremum) we obtain by exchanging the order of $\limsup$ and $\inf$ that

(17)		$\displaystyle\limsup_{\delta\to 0}\frac{-\nabla_{a_{i}}V(0,a^{\star}_{\delta})}{\delta}\$
	$\displaystyle\leq\inf_{\lambda\in\mathbb{R}}\Bigg{(}\left(\int\left\|\nabla_{x}\nabla_{a_{i}}f(x,a^{\star})+\lambda\nabla_{x}f(x,a^{\star})\right\|^{q}\,\mu(dx)\right)^{1/q}$
	$\displaystyle\qquad-\lambda\left(\int\left\|\nabla_{x}f(x,a^{\star})\right\|^{q}\,\mu(dx)\right)^{1/q}\Bigg{)}.$

For $q=2$ the infimum can be computed explicitly and equals

\displaystyle\frac{\int\langle\nabla_{x}\nabla_{a_{i}}f(x,a^{\star}),\nabla_{x}f(x,a^{\star})\rangle\,\mu(dx)}{\sqrt{\int|\nabla_{x}f(x,a^{\star})|^{2}\,\mu(dx)}}

For the general case we refer to (Bartl et al., 2021a, , Lemma 30), noting that by assumption $\nabla_{x}f(x,a^{\star})\neq 0$ , we see that the RHS above is equal to the RHS in (16).

The proof of the “ $\geq"$ -inequality in (16) follows by the very same arguments. Indeed, (Bartl et al., 2021a, , Lemma 29) implies that

\inf_{\nu\in B^{\star}_{\delta}(\mu,a^{\star}_{\delta})}\int\nabla_{a_{i}}f(x,a^{\star}_{\delta})\,\nu(dx)\leq 0

for all $i\in\{1,\dots,k\}$ and we can write

	$\displaystyle-\nabla_{a_{i}}V(0,a^{\star}_{\delta})\geq\inf_{\nu\in B^{\star}_{\delta}(\mu,a^{\star}_{\delta})}\int\nabla_{a_{i}}f(y,a^{\star}_{\delta})\,\nu(dy)-\nabla_{a_{i}}V(0,a^{\star}_{\delta})$
	$\displaystyle=\inf_{\nu\in B_{\delta}(\mu)}\sup_{\lambda\in\mathbb{R}}\bigg{(}\int\big{[}\nabla_{a_{i}}f(y,a^{\star}_{\delta})+\lambda(f(y,a^{\star}_{\delta})-V(\delta))\big{]}\nu(dy)$
	$\displaystyle\quad-\int\big{[}\nabla_{a_{i}}f(x,a^{\star}_{\delta})+\lambda(f(x,a^{\star}_{\delta})-V(0,a^{\star}_{\delta}))\big{]}\mu(dx)\bigg{)}.$

From here on, we argue as in the “ $\leq"$ -inequality and conclude that indeed (16) holds.

By assumption the matrix $\nabla_{a}^{2}V(0,a^{\star})$ is invertible. Therefore, in a small neighborhood of $a^{\star}$ , the mapping $\nabla_{a}V(0,\cdot)$ is invertible. In particular $a^{\star}_{\delta}=(\nabla_{a}V(0,\cdot))^{-1}\left(\nabla_{a}V(0,a^{\star}_{\delta})\right)$ and by the first order condition $a^{\star}=(\nabla_{a}V(0,\cdot))^{-1}\left(0\right)$ . Applying the chain rule and using (16) gives

	$\displaystyle\lim_{\delta\to 0}\frac{a^{\star}_{\delta}-a^{\star}}{\delta}$	$\displaystyle=(\nabla_{a}^{2}V(0,a^{\star}))^{-1}\cdot\ \lim_{\delta\to 0}\frac{\nabla_{a}V(0,a^{\star}_{\delta})}{\delta}$
		$\displaystyle=-(\nabla^{2}_{a}V(0,a^{\star}))^{-1}\Big{(}\int\|\nabla_{x}f(z,a^{\star})\|^{q}\,\mu(dz)\Big{)}^{1/q-1}$
		$\displaystyle\cdot\int\frac{\nabla_{x}\nabla_{a}f(x,a^{\star})\nabla_{x}f(x,a^{\star})}{\|\nabla_{x}f(x,a^{\star})\|^{2-q}}\,\mu(dx).$

This completes the proof. ∎

References

Anderson and Philpott, (2019) Anderson, E. J. and Philpott, A. B. (2019). Improving sample average approximation using distributional robustness. Optimization Online.
Araujo et al., (2019) Araujo, A., Pinot, R., Negrevergne, B., Meunier, L., Chevaleyre, Y., Yger, F., and Atif, J. (2019). Robust neural networks using randomized adversarial training. arXiv:1903.10219.
Armacost and Fiacco, (1974) Armacost, R. L. and Fiacco, A. V. (1974). Computational experience in sensitivity analysis for nonlinear programming. Math. Program., 6(1):301–326.
Artzner et al., (1999) Artzner, P., Delbaen, F., Eber, J., and Heath, D. (1999). Coherent measures of risk. Math. Finance, 9(3):203–228.
Asi and Duchi, (2019) Asi, H. and Duchi, J. C. (2019). The importance of better models in stochastic optimization. Proc. Natl. Acad. Sci. USA, 116(46):22924–22930.
Atar et al., (2015) Atar, R., Chowdhary, K., and Dupuis, P. (2015). Robust bounds on risk-sensitive functionals via rényi divergence. SIAM/ASA J. Uncertain. Quantif., 3(1):18–33.
(7) Bartl, D., Drapeau, S., Obłój, J., and Wiesel, J. (2021a). Appendix to sensitivity analysis of Wasserstein distributionally robust optimization problems.
(8) Bartl, D., Drapeau, S., Obłój, J., and Wiesel, J. (2021b). Sensitivity analysis of Wasserstein distributionally robust optimization problems. TBC.
Bartl et al., (2020) Bartl, D., Drapeau, S., and Tangpi, L. (2020). Computational aspects of robust optimized certainty equivalents and option pricing. Math. Finance, 30(1):287–309.
Bastani et al., (2016) Bastani, O., Ioannou, Y., Lampropoulos, L., Vytiniotis, D., Nori, A., and Criminisi, A. (2016). Measuring neural net robustness with constraints. In Advances in neural information processing systems, pages 2613–2621.
Ben Tal and Teboulle, (1986) Ben Tal, A. and Teboulle, M. (1986). Expected utility, penalty functions, and duality in stochastic nonlinear programming. Manag. Sci., 32(11):1445–1466.
Ben Tal and Teboulle, (2007) Ben Tal, A. and Teboulle, M. (2007). An old-new concept of convex risk measures: the optimized certainty equivalent. Math. Finance, 17(3):449–476.
Black and Scholes, (1973) Black, F. and Scholes, M. (1973). The pricing of options and corporate liabilities. J. Political Econ, 81(3):637–654.
(14) Blanchet, J., Kang, Y., and Murthy, K. (2019a). Robust Wasserstein profile inference and applications to machine learning. J. Appl. Probab., 56(3):830–857.
Blanchet and Murthy, (2019) Blanchet, J. and Murthy, K. (2019). Quantifying distributional model risk via optimal transport. Math. Oper. Res., 44(2):565–600.
(16) Blanchet, J., Murthy, K., and Si, N. (2019b). Confidence regions in Wasserstein distributionally robust estimation. arXiv:1906.01614.
Bonnans and Shapiro, (2013) Bonnans, J. F. and Shapiro, A. (2013). Perturbation Analysis of Optimization Problems. Springer Science & Business Media.
Brezis, (2010) Brezis, H. (2010). Functional analysis, Sobolev spaces and partial differential equations. Springer Science & Business Media.
Calafiore, (2007) Calafiore, G. C. (2007). Ambiguous risk measures and optimal robust portfolios. SIAM J. Optim., 18(3):853–877.
Carlier et al., (2017) Carlier, G., Duval, V., Peyré, G., and Schmitzer, B. (2017). Convergence of entropic schemes for optimal transport and gradient flows. SIAM J. Math. Anal., 49(2):1385–1418.
Carlier and Ekeland, (2010) Carlier, G. and Ekeland, I. (2010). Matching for teams. Econom. Theory, 42:397–418.
Carlini and Wagner, (2017) Carlini, N. and Wagner, D. (2017). Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE.
Chen et al., (2018) Chen, Z., Kuhn, D., and Wiesemann, W. (2018). Data-driven chance constrained programs over Wasserstein balls. arXiv:1809.00210.
Chiappori et al., (2010) Chiappori, P.-A., McCann, R. J., and Nesheim, L. (2010). Hedonic price equilibria, stable matching, and optimal transport: Equivalence, topology, and uniqueness. Econom. Theory, 42:317–354.
Dupacova, (1990) Dupacova, J. (1990). Stability and sensitivity analysis for stochastic programming. Ann. Oper. Res., 27(1-4):115–142.
Fournier and Guillin, (2014) Fournier, N. and Guillin, A. (2014). On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Related Fields, 162(3-4):707–738.
Gao and Kleywegt, (2016) Gao, R. and Kleywegt, A. J. (2016). Distributionally robust stochastic optimization with Wasserstein distance. arXiv:1604.02199.
Ghanem et al., (2017) Ghanem, R., Higdon, D., and Owhadi, H., editors (2017). Handbook of Uncertainty Quantification. Springer International Publishing.
Glasserman and Xu, (2014) Glasserman, P. and Xu, X. (2014). Robust risk measurement and model risk. Quant. Finance, 14(1):29–58.
Goodfellow et al., (2014) Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014). Explaining and harnessing adversarial examples. arXiv:1412.6572.
Hansen and Marinacci, (2016) Hansen, L. P. and Marinacci, M. (2016). Ambiguity aversion and model misspecification: An economic perspective. Stat. Sci., 31(4):511–515.
Hansen and Sargent, (2008) Hansen, L. P. and Sargent, T. (2008). Robustness. Princeton university press.
Ho-Nguyen and Wright, (2020) Ho-Nguyen, N. and Wright, S. J. (2020). Adversarial classification via distributional robustness with wasserstein ambiguity. arXiv preprint arXiv:2005.13815.
Huber and Ronchetti, (1981) Huber, P. and Ronchetti, E. (1981). Robust statistics. Wiley Series in Probability and Mathematical Statistics. New York, NY, USA, Wiley-IEEE, 52:54.
Komorowski et al., (2011) Komorowski, M., Costa, M. J., Rand, D. A., and Stumpf, M. P. (2011). Sensitivity, robustness, and identifiability in stochastic chemical kinetics models. Proc. Natl. Acad. Sci. USA, 108(21):8645–8650.
Kuhn et al., (2019) Kuhn, D., Esfahani, P. M., Nguyen, V. A., and Shafieezadeh-Abadeh, S. (2019). Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations Research & Management Science in the Age of Analytics, pages 130–166. INFORMS.
Lam, (2016) Lam, H. (2016). Robust sensitivity analysis for stochastic systems. Math. Oper. Res., 41(4):1248–1275.
Lam, (2018) Lam, H. (2018). Sensitivity to serial dependency of input processes: A robust approach. Management Science, 64(3):1311–1327.
Li et al., (2019) Li, L., Zhong, Z., Li, B., and Xie, T. (2019). Robustra: training provable robust neural networks over reference adversarial space. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 4711–4717. AAAI Press.
Lindsay, (1994) Lindsay, B. G. (1994). Efficiency versus robustness: the case for minimum hellinger distance and related methods. Ann. Stat., 22(2):1081–1114.
Mangal et al., (2019) Mangal, R., Nori, A. V., and Orso, A. (2019). Robustness of neural networks: a probabilistic and practical approach. In Proceedings of the 41st International Conference on Software Engineering: New Ideas and Emerging Results, pages 93–96. IEEE Press.
Markowitz, (1952) Markowitz, H. (1952). Portfolio selection. J. Finance, 7(1):77–91.
Mohajerin Esfahani and Kuhn, (2018) Mohajerin Esfahani, P. and Kuhn, D. (2018). Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Math. Program., 171(1-2, Ser. A):115–166.
Obłój and Wiesel, (2021) Obłój, J. and Wiesel, J. (2021). Robust estimation of superhedging prices. Ann. Stat., 49(1):508–530.
Peyré and Cuturi, (2019) Peyré, G. and Cuturi, M. (2019). Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355–607.
Peyré and Cuturi, (2019) Peyré, G. and Cuturi, M. (2019). Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607.
Pflug and Wozabal, (2007) Pflug, G. and Wozabal, D. (2007). Ambiguity in portfolio selection. Quant. Finance, 7(4):435–442.
Pflug et al., (2012) Pflug, G. C., Pichler, A., and Wozabal, D. (2012). The 1/n investment strategy is optimal under high model ambiguity. J. Banking Finance, 36(2):410–417.
Rahimian and Mehrotra, (2019) Rahimian, H. and Mehrotra, S. (2019). Distributionally robust optimization: A review. arXiv.org:1908.05659.
Romisch, (2003) Romisch, W. (2003). Stability of stochastic programming problems. In Stochastic programming, pages 483–554. Elsevier Sci. B. V., Amsterdam.
Shafieezadeh-Abadeh et al., (2019) Shafieezadeh-Abadeh, S., Kuhn, D., and Esfahani, P. M. (2019). Regularization via mass transportation. J. Mach. Learn. Res., 20(103):1–68.
Sinha et al., (2020) Sinha, A., Namkoong, H., Volpi, R., and Duchi, J. (2020). Certifying some distributional robustness with principled adversarial training. arXiv:1710.10571v5.
Szegedy et al., (2013) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. (2013). Intriguing properties of neural networks. arXiv:1312.6199.
Terkelsen, (1973) Terkelsen, F. (1973). Some minimax theorems. Math. Scand., 31(2):405–413.
Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol., 58(1):267–288.
Villani, (2008) Villani, C. (2008). Optimal transport: old and new, volume 338. Springer Science & Business Media.
Vogel, (2007) Vogel, S. (2007). Stability results for stochastic programming problems. Optimization, 19(2):269–288.
Weng et al., (2018) Weng, T.-W., Zhang, H., Chen, P.-Y., Yi, J., Su, D., Gao, Y., Hsieh, C.-J., and Daniel, L. (2018). Evaluating the robustness of neural networks: An extreme value theory approach. arXiv:1801.10578.
Wong and Kolter, (2017) Wong, E. and Kolter, J. Z. (2017). Provable defenses against adversarial examples via the convex outer adversarial polytope. arXiv:1711.00851.

Appendix A Preliminaries

We recall and further explain the setting from the main body of the paper Bartl et al., 2021b . Take $d,k\in\mathbb{N}$ , endow $\mathbb{R}^{d}$ with the Euclidean norm $|\cdot|$ . Throughout the paper we take the convention that topological properties, such as continuity or closure, are understood w.r.t. $|\cdot|$ . We let ${\Gamma}^{o},\bar{\Gamma},\partial\Gamma,\Gamma^{c}$ denote respectively the interior, the closure, the boundary and the complement of a set $\Gamma\subset\mathbb{R}^{d}$ . We denote the set of all probability measures on $\Gamma$ by $\mathcal{P}(\Gamma)$ . For a variable $\gamma\in\Gamma$ , we will denote the optimizer by $\gamma^{\star}$ and the set of optimizers by $\Gamma^{\star}$ .

Fix a seminorm $\|\cdot\|$ on $\mathbb{R}^{d}$ and denote by $\|\cdot\|_{\ast}$ its (extended) dual norm, i.e. $\|y\|_{\ast}:=\sup\{\langle x,y\rangle:\|x\|\leq 1\}$ . Let us define the equivalence relation $x\sim y$ if and only if $\|x-y\|=0$ . Furthermore let us set $U:=\{x\in\mathbb{R}^{d}\ :\ \|x\|=0\}$ and write $[x]=x+U$ . With this notation, the quotient space $\mathbb{R}^{d}/U=\{[x]\ :\ x\in\mathbb{R}^{d}\}$ is a normed space for $\|\cdot\|$ . Furthermore, by the triangle inequality for $\|\cdot\|$ and equivalence of norms on $\mathbb{R}^{d}$ , there exists $c>0$ such that $\|x\|\leq c|x|$ and $|x|\leq c\|x\|_{\ast}$ for all $x\in\mathbb{R}^{d}$ . As $|\cdot|$ is Hausdorff, this immediately implies that $\|\cdot\|_{\ast}$ is Hausdorff as well. Furthermore we conclude, that $\|\cdot\|$ is continuous and $\|\cdot\|_{\ast}$ is lower semicontinuous w.r.t. $|\cdot|$ (as the supremum over continuous functions $\langle x,\cdot\rangle$ ). Lastly we make the convention that $B_{\delta}(x)$ denotes the ball of radius $\delta$ around $x$ in $|\cdot|$ . As our setup is slightly non-standard, we state the following lemmas for completeness:

Lemma 6.

For every $x\in\mathbb{R}^{d}$ we have that $\|x\|=\sup\{\langle x,y\rangle\ :\ \|y\|_{\ast}\leq 1\}$ .

Proof.

As $\{x\in\mathcal{S}\ :\ \|x\|\leq 1\}$ is convex and closed, this follows directly from the bipolar theorem. ∎

Lemma 7.

Assume that $\|\cdot\|_{\ast}$ is strictly convex. Then the following hold:

(i)

For all $x\in\mathbb{R}^{d}$ there exists $h(x)\in\mathbb{R}^{d}$ such that $\|h(x)\|_{\ast}=1$ and $\|x\|=\langle x,h(x)\rangle$ . If $x\neq 0$ , then $h(x)$ is unique.
(ii)

The map $h:\mathbb{R}^{d}\setminus\{0\}\to\mathbb{R}^{d}$ is continuous.

Proof.

Fix $x\in\mathbb{R}^{d}\setminus\{0\}$ . The existence of $h(x)\in\mathbb{R}^{d}$ in (i) follows from Lemma 6. Assume towards a contradiction that there exists another $\tilde{h}(x)\in\mathbb{R}^{d}$ with $\|\tilde{h}(x)\|_{\ast}=1$ , $\langle x,\tilde{h}(x)\rangle=\|x\|$ and $\tilde{h}(x)\neq h(x)$ . Defining $\bar{h}(x)=(h(x)+\tilde{h}(x))/2$ we have $\langle x,\bar{h}(x)\rangle=(\langle x,h(x)\rangle+\langle x,\tilde{h}(x)\rangle)/2=\|x\|$ . On the other hand, by the Hausdorff property of $\|\cdot\|_{\ast}$ , we have $\|h(x)-\tilde{h}(x)\|_{\ast}\neq 0$ and thus, by strict convexity of $\|\cdot\|_{\ast}$ , $\|\bar{h}(x)\|_{\ast}<1$ . Using again Lemma 6, we conclude $\|x\|\geq\langle x,\bar{h}(x)/\|\bar{h}(x)\|_{\ast}\rangle>\|x\|$ , a contradiction.
For (ii) we assume towards a contradiction that for some sequence $(x_{n})_{n\in\mathbb{N}}$ in $\mathbb{R}^{d}$ we have $\lim_{n\to\infty}x_{n}=x\in\mathbb{R}^{d}\setminus\{0\}$ , but $\lim_{n\to\infty}h(x_{n})\neq h(x)$ . As remarked above, we have $\{\|\cdot\|_{\ast}\leq 1\}\subseteq B_{c}(0)$ , in particular $\lim_{n\to\infty}h(x_{n})=y\in\mathbb{R}^{d}$ after taking a subsequence. Recalling that $h(x)\neq y$ and $\|\cdot\|_{\ast}$ is lower semicontinuous, we conclude that $\|y\|_{\ast}\leq 1$ and in particular $\|x\|>\langle x,y\rangle$ by Lemma 6 and (i). Finally

\|x\|=\lim_{n\to\infty}\|x_{n}\|=\lim_{n\to\infty}\langle x_{n},h(x_{n})\rangle=\langle x,y\rangle,

which leads to a contradiction. ∎

Lemma 8.

If $\|\cdot\|$ is strictly convex, then $\|\cdot\|_{\ast}$ is strictly convex as well.

Proof.

Fix $y\in\mathbb{R}^{d}\setminus\{0\}$ . We first note that

\displaystyle k(y):=\{x\in\mathbb{R}^{d}\ :\ \|x\|=1,\,\|y\|_{\ast}=\langle x,y\rangle\}/U

is uniquely defined. Indeed, this follows from applying the exact same arguments as in the proof of Lemma 7, adjusting for $U$ . Take now $y,y^{\prime}\in\mathbb{R}^{d}$ such that $\|y\|_{\ast}=\|y^{\prime}\|_{\ast}=1$ and $\|y-y^{\prime}\|_{\ast}\neq 0$ . Set $\bar{y}=(y+y^{\prime})/2$ and note that $\|\bar{y}-y\|_{\ast},\|\bar{y}-y^{\prime}\|_{\ast}\neq 0$ . Then $\|\bar{y}\|_{\ast}=(\langle[k(\bar{y})],y\rangle+\langle[k(\bar{y})],y^{\prime}\rangle)/2<1$ . This shows the claim. ∎

Let $\mathcal{S}$ denote the state space which is a closed convex subset of $\mathbb{R}^{d}$ . Fix $p>1$ and take $q=p/(p-1)$ so that $1/p+1/q=1$ . For probability measures $\mu$ and $\nu$ on $\mathcal{S}$ , we define their $p$ -Wasserstein distance as

W_{p}(\mu,\nu):=\inf\left\{\int_{\mathcal{S}\times\mathcal{S}}\|x-y\|_{\ast}^{p}\,\pi(dx,dy)\colon\pi\in\mathrm{Cpl}(\mu,\nu)\right\}^{1/p},

where $\mathrm{Cpl}(\mu,\nu)$ is the set of all probability measures $\pi\in\mathcal{P}(\mathcal{S}\times\mathcal{S})$ with first marginal $\pi_{1}:=\pi(\cdot\times\mathcal{S})=\mu$ and second marginal $\pi_{2}:=\pi(\mathcal{S}\times\cdot)=\nu$ . In the proofs we sometimes also use the $p$ -Wasserstein distance with respect to the Euclidean norm $|\cdot|$ given by

W_{p}^{|\cdot|}(\mu,\nu)=\inf\left\{\int_{\mathcal{S}\times\mathcal{S}}|x-y|^{p}\,\pi(dx,dy)\colon\pi\in\mathrm{Cpl}(\mu,\nu)\right\}^{1/p}.

Recall that $|\cdot|\leq c\|\cdot\|_{\ast}$ for some constant $c>0$ , which in turn implies that $W_{p}^{|\cdot|}(\cdot,\cdot)\leq cW_{p}(\cdot,\cdot)$ . A Wasserstein ball of size $\delta\geq 0$ around $\mu$ is denoted

B_{\delta}(\mu):=\left\{\nu\in\mathcal{P}(\mathbb{R}^{d}):W_{p}(\mu,\nu)\leq\delta\right\}.

From now on, we fix $\mu\in\mathcal{P}(\mathcal{S})$ such that $\mu(\partial\mathcal{S})=0$ and $\int_{\mathcal{S}}|x|^{p}\,\mu(dx)<\infty$ . Let $\mathcal{A}$ denote the action (decision) space which is a convex and closed subset of $\mathbb{R}^{k}$ . We consider robust stochastic optimization problem [2]:

\displaystyle V(\delta)\mathrel{\mathop{:}}=\inf_{a\in\mathcal{A}}V(\delta,a)\mathrel{\mathop{:}}=\inf_{a\in\mathcal{A}}\sup_{\nu\in B_{\delta}\left(\mu\right)}\int_{\mathcal{S}}f\left(x,a\right)\,\nu(dx).

In accordance with our conventions, we write $a^{\star}$ for an optimizer: $V(\delta)=V(\delta,a^{\star})$ and $\mathcal{A}^{\star}_{\subset}\mathcal{A}$ for the set of such optimizers. We also let $B^{\star}_{\delta}(\mu,a)$ denote the set of measures $\nu^{\star}$ such that $V(\delta,a)=\int_{\mathcal{S}}f(x,a)\,\nu^{\star}(dx)$ and sometimes write $B^{\star}_{\delta}(\mu)$ for $B^{\star}_{\delta}(\mu,a^{\star})$ if $a^{\star}\in\mathcal{A}^{\star}_{\delta}$ is fixed.

Appendix B Discussion, extensions and proofs related to Theorem 2

We complement now the discussion of Theorem 2. We start with some remarks, extensions and further examples before proceeding with the proofs, including a complete proof of Theorem 2 for general seminorms $\|\cdot\|$ .

B.1. Discussion and extensions of Theorem 2

Remark 9.

Theorem 2 may fail for $p=1$ . Indeed take $d=1$ , $\|\cdot\|=|\cdot|$ and $f(x)=x^{2}$ , $\mathcal{S}=[-1,1]$ , $\mu$ the point mass in zero, $\mu=\delta_{0}$ . Then $\nabla_{x}f(x)=2x$ and the $\mu(dx)$ –essential supremum of $|\nabla_{x}f(x)|$ is equal to $0$ . However $\nu_{\lambda}:=\lambda\delta_{1}+(1-\lambda)\delta_{0}\in B_{\lambda}(\mu)$ for all $\lambda\in[0,1]$ and it is easy to see $V(\delta)=\delta$ and thus $V^{\prime}(0)=1$ . The point where the proof of Theorem 2 breaks down is that the map $\delta$ to the $\nu_{\delta}$ –essential supremum of $|\nabla_{x}f(x)|$ is not continuous at $\delta=0$ .

Remark 10.

Let $p>2$ . In addition to Assumption 1, suppose that $f$ is twice continuously differentiable and that for ever $r\geq 0$ there is $c\geq 0$ such that $|\nabla_{x}^{2}f(x,a)|\leq c(1+|x|^{p-2})$ for all $x\in\mathcal{S}$ and all $a\in\mathcal{A}$ with $|a|\leq r$ . Then, the same arguments as in the proof of Theorem 2 but with a second order Taylor expansion yield

	$\displaystyle V(\delta)\leq V(0)$	$\displaystyle+\delta\left(\int_{\mathcal{S}}\\|\nabla_{x}f(a^{\star},x)\\|^{q}\,\mu(dx)\right)^{1/q}$
		$\displaystyle+\delta^{2}\left(\int_{\mathcal{S}}\lambda_{\max}\Big{(}\frac{1}{2}\nabla_{x}^{2}f(a^{\star},x)\Big{)}^{r}\,\mu(dx)\right)^{1/r}+o(\delta^{2}),$

for small $\delta\geq 0$ , where $\lambda_{\max}$ denotes the largest eigenvalue of the Hessian taken w.r.t. the norm $\|\cdot\|_{\ast}$ and $r=p/(p-2)$ is such that $2/p+1/r=1$ .

In particular, this means that if the term in front of $\delta^{2}$ is the same order of magnitude as the term in front of $\delta$ , then the first order approximation is quite accurate for small $\delta$ . Note that larger $p$ implies smaller $r$ and therefore a smaller term in front of the $\delta^{2}$ term.

Remark 11.

We believe that Assumption 1 lists natural sufficient conditions for differentiability of $V(\delta)$ in zero. In particular all these conditions are used in the proof of Theorem 2. Relaxing Assumption 1 seems to require a careful analysis of the interplay between (the space explored by balls around) $\mu$ and the functions $f,\nabla_{x}f$ . We state here a straightforward extension to the case where $f$ is only weakly differentiable and leave more fundamental extensions (e.g., to manifolds) for future research.
Specifically, in case that the baseline distribution $\mu$ is absolutely continuous w.r.t. the Lebesgue measure and $\|\cdot\|=|\cdot|$ , Theorem 2 remains true if we merely assume that $f(\cdot,a)$ has a weak derivative (in the Sobolev sense) on ${\mathcal{S}}^{o}$ for all $a\in\mathcal{A}$ and replace $\nabla_{x}f(\cdot,a)$ by the weak derivative of $f(\cdot,a)$ in Assumption 1. More concretely the first point of Assumption 1 should read:

•

The weak derivative $(x,a)\mapsto g(x,a)$ of $f(\cdot,a)$ is continuous at every point $(x,a)\in N\times\mathcal{A}^{\star}(0)$ , where $N$ is a Lebesgue-null set, and for every $r>0$ there is $c>0$ such that $|g(x,a)|\leq c(1+|x|^{p-1})$ for all $x\in\mathcal{S}$ and $|a|\leq r$ .

Proof of Remark 11.

For notational simplicity we only consider the case $\mathcal{S}=\mathbb{R}^{d}$ . Note that by, e.g., Brezis, (2010)[Theorem 8.2] we can assume that $f(\cdot,a)$ is continuous and satisfies

\displaystyle f(y,a)-f(x,a)=\int_{0}^{1}\langle g(x+t(y-x),a),y-x\rangle\,dt

for all $x,y\in\mathbb{R}^{d}$ and all $a\in\mathcal{A}$ . Furthermore

(18)

\displaystyle\sup_{\nu\in B_{\delta}(\mu)}\int_{\mathcal{S}}f(x,a)\,\nu(dx)=\sup_{\nu\in B_{\delta}(\mu),\ \nu\ll\mathrm{Leb}}\int_{\mathcal{S}}f(x,a)\,\nu(dx),

where $\nu\ll\mathrm{Leb}$ means that $\nu$ is absolutely continuous w.r.t. the Lebesgue measure. Indeed, let us take $\nu\in B_{\delta}(\mu)$ and set $\tilde{\nu}=\tilde{\nu}(t,\varepsilon)=(1-t)(\nu\ast N(0,\varepsilon))+t\mu$ , where $N(0,\varepsilon)$ denotes the multivariate normal distribution with covariance $\varepsilon\mathbf{I}$ , $\varepsilon>0$ and $\ast$ denotes the convolution operator. For every $0<t<1$ , by convexity of $W_{p}^{p}(\cdot,\cdot)$ and the triangle inequality for $W_{p}$ , we have

	$\displaystyle W_{p}^{p}(\mu,\tilde{\nu})$	$\displaystyle\leq(1-t)W_{p}^{p}(\nu\ast N(0,\varepsilon),\mu)+tW_{p}^{p}(\mu,\mu)$
		$\displaystyle=(1-t)W_{p}^{p}(\nu\ast N(0,\varepsilon),\mu)$
		$\displaystyle\leq(1-t)\left(W_{p}(\nu\ast N(0,\varepsilon),\nu)+W_{p}(\nu,\mu)\right)^{p}.$

By assumption $W_{p}(\nu,\mu)\leq\delta$ and one can check that $W_{p}(\nu\ast N(0,\varepsilon),\nu)\to 0$ as $\varepsilon\to 0$ . Hence, for every $t<1$ there exists small $\varepsilon=\varepsilon(t)>0$ such that $W_{p}(\mu,\tilde{\nu})\leq\delta$ . As further $\lim_{t\to 1}\int_{\mathcal{S}}f(x,a)\,\tilde{\nu}(dx)=\int_{\mathcal{S}}f(x,a)\,\nu(dx)$ , this shows (18). The proof of the remark now follows by the exact same arguments as in the proof of Theorem 2. ∎

A natural example, which highlights the importance of Remark 11 is the following:

Example 12.

We let $\mu$ be a model for a vector of returns $X\in\mathcal{S}=\mathbb{R}^{d}$ and assume that $\mu$ is absolutely continuous with respect to Lebesgue measure. Let further $\|\cdot\|=|\cdot|$ and let $z\in B\subset\mathbb{R}^{d}$ denote a portfolio. We then consider the average value at risk at level $\alpha\in(0,1)$ of the portfolio wealth $\langle z,X\rangle$ , which can be written as

\displaystyle\text{AV@R}_{\alpha}\left(\langle z,X\rangle\right)=\frac{1}{\alpha}\int_{1-\alpha}^{1}\text{V@R}_{u}(\langle z,X\rangle)du,

where $\text{V@R}_{u}(\langle z,X\rangle)$ is the value at risk at level $u\in(0,1)$ defined as

\displaystyle\text{V@R}_{u}(\langle z,X\rangle)=\inf\{x\in\mathbb{R}^{d}\ :\mu(\langle z,x\rangle)\geq u\}.

We note that the average value at risk is an example for an optimized certainty equivalent (OCE), when choosing $l(x,a)=a+\frac{1}{\alpha}(x-a)^{+}$ in (Bartl et al., 2021b, , p. 3). We can thus rewrite the optimization problem

\displaystyle V(0)

\displaystyle=\inf_{z\in B}\text{AV@R}_{\alpha}\left(\langle z,X\rangle\right)

\displaystyle V(0)=\inf_{z\in B,m\in\mathbb{R}}\left(m+\frac{1}{\alpha}\int_{\mathcal{S}}\left(\langle z,x\rangle-m\right)^{+}\mu(dx)\right).

Set $\mathcal{A}=B\times\mathbb{R}$ and assume that there exists a unique minimiser $(z^{\star},m^{\star})\in\mathcal{A}^{\star}_{0}$ of $V(0)$ . Then $m^{\star}$ is given by $\mathrm{V@R}(\langle z^{\star},X\rangle)$ . The robust version of $V(0)$ reads

V(\delta)=\inf_{(z,m)\in\mathcal{A}}\sup_{\nu\in B_{\delta}(\mu)}\left(m+\frac{1}{\alpha}\int_{\mathcal{S}}\left(\langle z,x\rangle-m\right)^{+}\nu(dx)\right).

Note that the function $x\mapsto x^{+}$ is weakly differentiable with weak derivative $\mathbf{1}_{\{x\geq 0\}}$ . In conclusion $f(x,(z,m))=m+\frac{1}{\alpha}\left(\langle z,x\rangle-m\right)^{+}$ has weak derivative

g(x,(z,m))=\frac{1}{\alpha}\mathbf{1}_{\{\langle z,x\rangle-m\geq 0\}},

which is continuous at $(x,(h^{\star},m^{\star}))$ except on the lower-dimensional set $\{x\in\mathcal{S}\ :\ \langle z^{\star},x\rangle-m^{\star}=0\}$ , which is in particular a Lebesgue null set. Remark 11 thus yields

\displaystyle V^{\prime}(0)=|z^{\star}|\left(\frac{1}{\alpha^{q}}\int_{\mathcal{S}}\mathbf{1}_{\left\{\langle z^{\star}x\rangle\geq\mathrm{V@R}_{\alpha}\left(\langle z^{\star},X\rangle\right)\right\}}\mu(dx)\right)^{\frac{1}{q}}=\frac{|z^{\star}|}{\alpha^{1/p}}

and thus

\displaystyle V(\delta)

\displaystyle=\text{AV@R}_{\alpha}\left(\langle z^{\star},X\rangle\right)+\frac{|z^{\star}|}{\alpha^{1/p}}\delta+o(\delta).

Comparing with (Bartl et al.,, 2020, Table 1), we see that this approximation is actually exact for $p=1,2$ .

We now mention two extensions of Theorem 2. The first one concerns the derivative of $V(\delta)$ for $\delta>0$ .

Corollary 13.

Fix $r>0$ and in addition to the assumptions of Theorem 2 assume that

•

$\mathcal{A}^{\star}_{r+\delta}\neq\emptyset$ for $\delta\geq 0$ small enough and for every sequence $(\delta_{n})_{n\in\mathbb{N}}$ such that $\lim_{n\to\infty}\delta_{n}=0$ and $(a^{\star}_{n})_{n\in\mathbb{N}}$ such that $a^{\star}_{n}\in\mathcal{A}^{\star}_{r+\delta_{n}}$ there is a subsequence which converges to some $a^{\star}\in\mathcal{A}^{\star}_{r}$ .
•

there exists $\varepsilon>0$ such that for all $\gamma>0$ and every $a\in\mathcal{A}$ with $|a|\leq\gamma$ one has $|\nabla_{x}f(x,a)|\leq c(1+|x|^{p-1-\varepsilon})$ for all $x\in\mathcal{S}$ and some constant $c>0$ .

Then

V^{\prime}(r+)=\lim_{\delta\to 0}\frac{V(r+\delta)-V(r)}{\delta}=\inf_{a^{\star}\in\mathcal{A}^{\star}_{r}}\sup_{\nu\in B^{\star}_{r}(\mu,a^{\star})}\left(\int_{\mathcal{S}}\|\nabla_{x}f(x,a^{\star})\|^{q}\,\nu(dx)\right)^{1/q},

where we recall that $B^{\star}_{r}(\mu,a^{\star})$ is the set of all $\nu\in B_{r}(\mu)$ for which $\int_{\mathcal{S}}f(x,a^{\star})\,\nu(dx)=V(r)$ .

Remark 14.

Recall the notation $V(\delta,a)$ in (2). Inspecting the proof of the above Corollary, it is clear the main difficulty is in showing that

\lim_{\delta\to 0}\frac{V(r+\delta,a)-V(r,a)}{\delta}=\sup_{\nu\in B^{\star}_{r}(\mu,a)}\left(\int_{\mathcal{S}}\|\nabla_{x}f(x,a)\|^{q}\,\nu(dx)\right)^{1/q}.

In this way, the final statement of Corollary 13, or indeed of Theorem 2, can be interpreted as an instance of the envelope theorem.

The second extension of Theorem 2 offers a more specific sensitivity result by including additional constraints on the ball $B_{\delta}(\mu)$ of measures considered. Let $m\in\mathbb{N}$ and let $\Phi=(\Phi_{1},\dots,\Phi_{m})\colon\mathcal{S}\to\mathbb{R}^{m}$ be a family of $m$ functions and assume that $\mu$ is calibrated to $\Phi$ in the sense that $\int_{\mathcal{S}}\Phi(x)\,\mu(dx)=0$ . Consider the set

B^{\Phi}_{\delta}(\mu):=\left\{\nu\in B_{\delta}(\mu):\int_{\mathcal{S}}\Phi(x)\,\nu(dx)=0\right\}

and the corresponding optimization problem

\displaystyle V^{\Phi}(\delta)\mathrel{\mathop{:}}=\inf_{a\in\mathcal{A}}\sup_{\nu\in B^{\Phi}_{\delta}\left(\mu\right)}\int_{\mathcal{S}}f\left(x,a\right)\,\nu(dx).

We have the following result.

Theorem 15 (Sensitivity of $V(\delta)$ under linear constraints).

In addition to the assumptions of Theorem 2, assume that there is some small $\varepsilon>0$ such that for every $a\in\mathcal{A}$ one has $|f(x,a)|\leq c(1+|x|^{p-\varepsilon})$ for all $x\in\mathbb{R}^{d}$ and some constant $c>0$ . Further assume that $\Phi_{i}$ , $i\leq m$ , are continuously differentiable with $|\Phi_{i}(x)|\leq c(1+|x|^{p-\varepsilon})$ , $|\nabla_{x}\Phi_{i}(x)|\leq c(1+|x|^{p-1})$ and that the non-degeneracy condition

(19)

\displaystyle\inf\left\{\int_{\mathcal{S}}\bigg{\|}\sum_{i=1}^{m}\lambda_{i}\nabla_{x}\Phi_{i}(x)\bigg{\|}^{q}\,\mu(dx):\lambda\in\mathbb{R}^{d},\ |\lambda|=1\right\}>0

holds. Then

\displaystyle(V^{\Phi})^{\prime}(0)=\inf_{a^{\star}\in\mathcal{A}^{\star}_{0}}\inf_{\lambda\in\mathbb{R}^{m}}\left(\int_{\mathcal{S}}\Big{\|}\nabla_{x}f(x,a^{\star})+\sum_{i=1}^{m}\lambda_{i}\nabla_{x}\Phi_{i}(x)\Big{\|}^{q}\,\mu(dx)\right)^{1/q}.

Remark 16.

Note that if $\|\cdot\|$ is a norm and $\mu$ has full support, the above non-degeneracy condition (19) can be made without loss of generality. Indeed, as the unit circle is compact and the function $\lambda\mapsto\int_{\mathcal{S}}\left\|\sum_{i=1}^{m}\lambda_{i}\nabla_{x}\Phi_{i}(x)\right\|^{q}\,\mu(dx)$ is continuous, the infimum in (19) is attained. In particular, if

\displaystyle\inf\left\{\int_{\mathcal{S}}\left\|\sum_{i=1}^{m}\lambda_{i}\nabla_{x}\Phi_{i}(x)\right\|^{q}\,\mu(dx):|\lambda|=1\right\}=0,

then $\sum_{i=1}^{m}\lambda_{i}\nabla_{x}\Phi_{i}=0$ $\mu$ -a.s. for some $\lambda$ in the unit circle. As $\mu$ has full support this implies that $\sum_{i=1}^{m}\lambda_{i}\nabla_{x}\Phi_{i}=0$ on $\mathcal{S}$ . Thus $\nabla_{x}\Phi_{1},\dots,\nabla_{x}\Phi_{m}$ are linearly dependent functions on $\mathcal{S}$ . Deleting all linearly dependent coordinates and calling the resulting vector $\tilde{\Phi}$ , we have $V^{\Phi}(\delta)=V^{\tilde{\Phi}}(\delta)$ for every $\delta\geq 0$ . Moreover, the non-degeneracy condition (19) holds for $\tilde{\Phi}$ .

Remark 17.

We can relax the conditions of Theorem 15 in the spirit of Remark 11: more specifically, assume that the baseline distribution $\mu$ is absolutely continuous w.r.t. the Lebesgue measure and $\|\cdot\|=|\cdot|$ . Then Theorem 15 remains true if we merely assume that $f(\cdot,a)$ and $\Phi_{i}$ have a weak derivative (in the Sobolev sense) on ${\mathcal{S}}^{o}$ for all $a\in\mathcal{A}$ and replace $\nabla_{x}f(\cdot,a)$ and $\nabla\Phi_{i}$ by the weak derivative of $f(\cdot,a)$ and of $\Phi_{i}$ respectively. More concretely the assumption should read:

•

The weak derivatives $(x,a)\mapsto g(x,a)$ of $f(\cdot,a)$ and $x\mapsto g_{i}(x)$ of $\Phi_{i}$ are continuous at every point $(x,a)\in N\times\mathcal{A}^{{\star}}(0)$ , where $N$ is a Lebesgue-null set, and for every $r>0$ there is $c>0$ such that $|g_{i}(x,a)|\leq c(1+|x|^{p-1})$ and $|g_{i}(x)|\leq c(1+|x|^{p-1})$ for all $x\in\mathcal{S}$ , $i=1,\dots,m$ and $|a|\leq r$ .

Example 18 (Martingale constraints).

Let $d=1$ , $\mathcal{S}=\mathbb{R}$ , $\|\cdot\|=|\cdot|$ , $p=2$ , and let $\Phi_{1}(x):=x-x_{0}$ and $\Phi:=\{\Phi_{1}\}$ , i.e., $B^{\Phi}_{\delta}(\mu)$ corresponds to the measures $\nu\in B_{\delta}(\mu)$ satisfying the martingale (barycentre preservation) constraint $\int_{\mathbb{R}}x\,\nu(dx)=x_{0}$ . Clearly the assumptions on $\Phi$ of Theorem 15 are satisfied. It remains to solve the optimization problem over $\lambda\in\mathbb{R}$ and plug in the optimizer. We then obtain

(V^{\Phi})^{\prime}(0)=\inf_{a^{\star}\in\mathcal{A}^{\star}_{0}}\left(\int_{\mathbb{R}}\left(\nabla_{x}f(x,a^{\star})-\int_{\mathbb{R}}\nabla_{x}f(y,a^{\star})\,\mu(dy)\right)^{2}\,\mu(dx)\right)^{1/2},

i.e., $(V^{\Phi})^{\prime}(0)$ is the standard deviation of $\nabla_{x}f(\cdot,a^{\star})$ under $\mu$ . In line with the previous remark, this results extend to the case of the call option pricing discussed in the main body of the paper.

Example 19 (Covariance constraints).

Let $d=2$ , $\mathcal{S}=\mathbb{R}^{2}$ , $\|\cdot\|=|\cdot|$ , $p=2$ . Further let $\Phi_{1}(x_{1},x_{2}):=x_{1}x_{2}-b$ for some $b\in\mathbb{R}$ and $\Phi:=\{\Phi_{1}\}$ , i.e., we want to optimize over measures $\nu\in B_{\delta}(\mu)$ satisfying the covariance constraint $\int_{\mathbb{R}^{2}}x_{1}x_{2}\,\nu(dx)=b$ . Assume that there exists no $\lambda\in\mathbb{R}\setminus\{0\}$ such that $\mu$ -a.s. $x_{1}=\lambda x_{2}$ . Clearly the assumptions on $\Phi$ of Theorem 15 are satisfied. Note that

	$\displaystyle\int_{\mathbb{R}^{2}}\|\nabla_{x}f(x,a)+\lambda_{1}\nabla_{x}\Phi_{1}(x)\|^{2}\,\mu(dx)$
	$\displaystyle=\int_{\mathbb{R}^{2}}(\nabla_{x_{1}}f(x,a)+\lambda_{1}x_{2})^{2}+(\nabla_{x_{2}}f(x,a)+\lambda_{1}x_{1})^{2}\,\mu(dx),$

so in particular the optimal $\lambda$ in the definition of $(V^{\Phi})^{\prime}(0)$ is given by

\lambda_{1}=\frac{-\int_{\mathbb{R}^{2}}\nabla_{x_{1}}f(x,a)x_{2}+\nabla_{x_{2}}f(x,a)x_{1}\,\mu(dx)}{\int_{\mathbb{R}^{2}}x_{1}^{2}+x_{2}^{2}\,\mu(dx)}.

Plugging this in gives

	$\displaystyle\int_{\mathbb{R}^{2}}\|\nabla_{x}f(x,a)+\lambda_{1}\nabla_{x}\Phi_{1}(x)\|^{2}\,\mu(dx)=\int_{\mathbb{R}^{2}}(\nabla_{x_{1}}f(x,a))^{2}+(\nabla_{x_{2}}f(x,a))^{2}\,\mu(dx)$
	$\displaystyle\qquad\qquad+2\lambda_{1}\int_{\mathbb{R}^{2}}(\nabla_{x_{1}}f(x,a)x_{2}+\nabla f_{x_{2}}f(x,a)x_{1})\mu(dx)+\lambda_{1}^{2}\int_{\mathbb{R}^{2}}x_{2}^{2}+x_{1}^{2}\,\mu(dx)$
	$\displaystyle=\int_{\mathbb{R}^{2}}(\nabla_{x_{1}}f(x,a))^{2}+(\nabla_{x_{2}}f(x,a))^{2}\,\mu(dx)-\frac{\left(\int_{\mathbb{R}^{2}}(\nabla_{x_{1}}f(x,a)x_{2}+\nabla_{x_{2}}f(x,a)x_{1})\,\mu(dx)\right)^{2}}{\int_{\mathbb{R}^{2}}(x_{1}^{2}+x_{2}^{2})\,\mu(dx)}.$

It follows that

	$\displaystyle(V^{\Phi})^{\prime}(0)=\inf_{a^{\star}\in\mathcal{A}^{\star}_{0}}\Bigg{(}$	$\displaystyle\int_{\mathbb{R}^{2}}\|\nabla_{x}f(x,a^{\star})\|^{2}\,\mu(dx)$
		$\displaystyle-\frac{\big{(}\int_{\mathbb{R}^{2}}\nabla_{x_{1}}f(x,a^{\star})x_{2}+\nabla_{x_{2}}f(x,a^{\star})x_{1}\,\mu(dx)\big{)}^{2}}{\int_{\mathbb{R}^{2}}\|x\|^{2}\,\mu(dx)}\Bigg{)}^{1/2}.$

Example 20 (Calibration).

Consider the function $f((T,K),a)=(E_{\mathbb{P}_{a}}[(S_{T}-K)^{+}]-C((T,K))^{2}$ , the discrete measure $\mu$ formalises grid points for which option data $C(T,K)$ is available, $\mathcal{S}\subset\mathbb{R}_{+}\times\mathbb{R}_{+}$ is the set of maturities and strikes of interest and $\{\mathbb{P}_{a},a\in\mathcal{A}\}$ , for a given compact set $\mathcal{A}$ , is a class of parametric models (e.g., Heston). A Wasserstein ball around $\mu$ can then be seen as a plausible formalisation of market data uncertainty. Derivatives in $T$ and $K$ correspond to classical pricing sensitivities, which are readily available for most common parametric models. These have to be only evaluated for one model $\mathbb{P}_{a^{\star}}.$ Changing the class of parametric models $\{\mathbb{P}_{a},a\in A\}$ and computing the sensitivity in Theorem 2 could then yield insights into when a calibration procedure can be considered reasonably robust.

B.2. Proofs and auxiliary results related to Theorem 2

Proof of Theorem 2.

We present now a complete proof of Theorem 2 for general state space $\mathcal{S}$ and semi-norm $\|\cdot\|$ . All the essential ideas have already been outlined in Bartl et al., 2021b but for the convenience of the reader we repeat all of the steps as opposed to only detailing where the general case differs from the one treated in Bartl et al., 2021b .

Step 1: Let us first assume that $\mathcal{S}=\mathbb{R}^{d}$ . For every $\delta\geq 0$ let $C_{\delta}(\mu)$ denote those $\pi\in\mathcal{P}(\mathcal{S}\times\mathcal{S})$ which satisfy

\pi_{1}=\mu\text{ and }\left(\int_{\mathcal{S}\times\mathcal{S}}\|x-y\|^{p}_{\ast}\,\pi(dx,dy)\right)^{1/p}\leq\delta.

Note that the dual norm $\|\cdot\|_{\ast}$ is lower semicontinuous, which implies that the infimum in the definition of $W_{p}(\mu,\nu)$ is attained (see (Villani,, 2008, Theorem 4.1, p.43)) one has $B_{\delta}(\mu)=\{\pi_{2}:\pi\in C_{\delta}(\mu)\}$ .

We start by showing the “ $\leq$ ” inequality in the statement. For any $a^{\star}\in\mathcal{A}^{\star}_{0}$ one has $V(\delta)\leq\sup_{\nu\in B_{\delta}(\mu)}\int_{\mathcal{S}}f(y,a^{\star})\,\nu(dy)$ with equality for $\delta=0$ . Therefore, differentiating $f(\cdot,a^{\star})$ and using Fubini’s theorem, we obtain that

	$\displaystyle V(\delta)-V(0)$	$\displaystyle\leq\sup_{\pi\in C_{\delta}(\mu)}\int_{\mathcal{S}\times\mathcal{S}}f(y,a^{\star})-f(x,a^{\star})\,\pi(dx,dy)$
		$\displaystyle=\sup_{\pi\in C_{\delta}(\mu)}\int_{0}^{1}\int_{\mathcal{S}}\langle\nabla_{x}f(x+t(y-x),a^{\star}),(y-x)\rangle\,\pi(dx,dy)dt.$

Now recall that $\langle x,y\rangle\leq\|x\|\|y\|_{\ast}$ for every $x,y\in\mathbb{R}^{d}$ , whence for any $\pi\in C_{\delta}(\mu)$ and $t\in[0,1]$ , we have that

	$\displaystyle\int_{\mathcal{S}}\langle\nabla_{x}f(x+t(y-x),a^{\star}),(y-x)\rangle\,\pi(dx,dy)$
	$\displaystyle\leq\int_{\mathcal{S}}\\|\nabla_{x}f(x+t(y-x),a^{\star})\\|\\|y-x\\|_{\ast}\,\pi(dx,dy)$
	$\displaystyle\leq\Big{(}\int_{\mathcal{S}}\\|\nabla_{x}f(x+t(y-x),a^{\star})\\|^{q}\,\pi(dx,dy)\Big{)}^{1/q}\Big{(}\int_{\mathcal{S}}\\|y-x\\|^{p}\,\pi(dx,dy)\Big{)}^{1/p},$

where we used Hölder’s inequality to obtain the last inequality. By definition of $C_{\delta}(\mu)$ the last integral is smaller than $\delta$ and we end up with

V(\delta)-V(0)\leq\delta\sup_{\pi\in C_{\delta}(\mu)}\int_{0}^{1}\Big{(}\int_{\mathcal{S}}\|\nabla_{x}f(x+t(y-x),a^{\star})\|^{q}\pi(dx,dy)\Big{)}^{1/q}dt.

It remains to show that the last term converges to the integral under $\mu$ . To that end, note that any choice $\pi^{\delta}\in C_{\delta}(\mu)$ converges in $W_{p}^{|\cdot|}$ on $\mathcal{P}(\mathcal{S}\times\mathcal{S}$ ) to the pushforward measure of $\mu$ under the mapping $x\to(x,x)$ , which we denote $[x\mapsto(x,x)]_{\ast}\mu$ . This can be seen by, e.g., considering the coupling $[(x,y)\mapsto(x,y,x,x)]_{\ast}\pi^{\delta}$ between $\pi^{\delta}$ and $[x\mapsto(x,x)]_{\ast}\mu$ . Now note that, together with growth restriction on $\nabla_{x}f$ of Assumption 1, $q=p/(p-1)$ implies

(20)

\displaystyle\|\nabla_{x}f(x+t(y-x),a^{\star})\|^{q}\leq c(1+|x|^{p}+|y|^{p})

for some $c>0$ and all $x,y\in\mathbb{R}^{d}$ , $t\in[0,1]$ . Recall that there furthermore exists $\tilde{c}>0$ such that $\|x\|\leq\tilde{c}|x|$ , in particular $\int_{\mathcal{S}}\|\nabla_{x}f(x+t(y-x),a^{\star})\|^{q}\,\pi^{\delta}(dx,dy)\leq C$ for all $t\in[0,1]$ and small $\delta>0$ , for another constant $C>0$ . As Assumption 1 further yields continuity of $(x,y)\mapsto\|\nabla_{x}f(x+t(y-x),a^{\star})\|^{q}$ for every $t$ , the $p$ -Wasserstein convergence of $\pi^{\delta}$ to $[x\mapsto(x,x)]_{\ast}\mu$ implies that

\int_{\mathcal{S}}\|\nabla_{x}f(x+t(y-x),a^{\star})\|^{q}\,\pi(dx,dy)\to\int_{\mathcal{S}}\|\nabla_{x}f(x,a^{\star})\|^{q}\,\mu(dx)

for every $t\in[0,1]$ , see Lemma 21. Dominated convergence (in $t$ ) then yields “ $\leq$ ” in the statement of the theorem.

We turn now to the opposite “ $\geq$ ” inequality. As $V(\delta)\geq V(0)$ for every $\delta>0$ there is no loss in generality in assuming that the right hand side is not equal to zero. Now take any, for notational simplicity not relabelled, subsequence of $(\delta)_{\delta>0}$ which attains the liminf in $(V(\delta)-V(0))/\delta$ and pick $a^{\star}_{\delta}\in\mathcal{A}^{\star}_{\delta}$ . By the second part of Assumption 1, for a (again not relabelled) subsequence, one has $a^{\star}_{\delta}\to a^{\star}\in\mathcal{A}^{\star}_{0}$ . Further note that $V(0)\leq\int_{\mathcal{S}}f(x,a^{\star}_{\delta})\,\mu(dx)$ which implies

\displaystyle V(\delta)-V(0)

\displaystyle\geq\sup_{\pi\in C_{\delta}(\mu)}\int_{\mathcal{S}\times\mathcal{S}}f(y,a^{\star}_{\delta})-f(x,a^{\star}_{\delta})\,\pi(dx,dy).

By Lemma 7 there exists a function $h\colon\mathbb{R}^{d}\mapsto\{x\in\mathbb{R}^{d}:\|x\|_{\ast}=1\}$ such that $\|x\|=\langle x,h(x)\rangle$ for every $x\in\mathbb{R}^{d}$ . Now define

	$\displaystyle\pi^{\delta}$	$\displaystyle:=[x\mapsto(x,x+\delta T(x))]_{\ast}\mu,\quad\text{where}$
	$\displaystyle T(x)$	$\displaystyle:=\frac{h(\nabla_{x}f(x,a^{\star}))}{\\|\nabla_{x}f(x,a^{\star})\\|^{1-q}}\Big{(}\int_{\mathcal{S}}\\|\nabla_{x}f(z,a^{\star})\\|^{q}\,\mu(dz)\Big{)}^{1/q-1}$

for $x\in\mathbb{R}^{d}$ with the convention $h(\cdot)/0=0$ . Note that the integral is well defined since, as before in (20), one has $\|\nabla_{x}f(x,a^{\star})\|^{q}\leq C(1+|x|^{p})$ for some $C>0$ and the latter is integrable under $\mu$ . Using that $pq-p=q$ it further follows that

	$\displaystyle\int_{\mathcal{S}\times\mathcal{S}}\\|x-y\\|_{\ast}^{p}\,\pi^{\delta}(dx,dy)=\delta^{p}\int_{\mathcal{S}}\\|T(x)\\|_{\ast}^{p}\,\mu(dx)$
	$\displaystyle=\delta^{p}\frac{\int_{\mathcal{S}}\\|\nabla_{x}f(x,a^{\star})\\|^{pq-p}\,\mu(dx)}{\big{(}\int_{\mathcal{S}}\\|\nabla_{x}f(z,a^{\star})\\|^{q}\,\mu(dz)\big{)}^{p(1-1/q)}}=\delta^{p}.$

In particular $\pi^{\delta}\in C_{\delta}(\mu)$ and we can use it to estimate from below the supremum over $C_{\delta}(\mu)$ giving

	$\displaystyle\frac{V(\delta)-V(0)}{\delta}$	$\displaystyle\geq\frac{1}{\delta}\int_{\mathcal{S}}f(x+\delta T(x),a^{\star}_{\delta})-f(x,a^{\star}_{\delta})\,\mu(dx)$
		$\displaystyle=\int_{0}^{1}\int_{\mathcal{S}}\langle\nabla_{x}f(x+t\delta T(x),a^{\star}_{\delta}),T(x)\rangle\,\mu(dx)\,dt.$

For any $t\in[0,1]$ , with $\delta\to 0$ , the inner integral converges to

\displaystyle\int_{\mathcal{S}}\langle\nabla_{x}f(x,a^{\star}),T(x)\rangle\,\mu(dx)=\Big{(}\int_{\mathcal{S}}\|\nabla_{x}f(x,a^{\star})\|^{q}\,\mu(dx)\Big{)}^{1/q}.

The last equality follows from the definition of $T$ and a simple calculation. To justify the convergence, first note that

\langle\nabla_{x}f(x+t\delta T(x),a^{\star}_{\delta}),T(x)\rangle\to\langle\nabla_{x}f(x,a^{\star}),T(x)\rangle

for all $x\in\mathbb{R}^{d}$ by continuity of $(a,x)\mapsto\nabla_{x}f(x,a)$ and since $a^{\star}_{\delta}\to a^{\star}$ . Moreover, as before in (20), one has

\|\langle\nabla_{x}f(x+t\delta T(x),a^{\star}),T(x)\rangle\|\leq C(1+|x|^{p})

for some $C>0$ and all $t\in[0,1]$ . The latter is integrable under $\mu$ , hence convergence of the integrals follows from the dominated convergence theorem.

Step 2: We now extend the proof to the case, where $\mathcal{S}\subset\mathbb{R}^{d}$ is closed convex and its boundary has zero measure under $\mu$ .
Note that the proof of the “ $\leq$ ”-inequality remains unchanged. We modify the proof of the “ $\geq$ ”-inequality as follows: let us first define

\mathcal{S}^{\varepsilon}:=\{x\in\mathcal{S}\ :\ |x-z|\geq\varepsilon\text{ for all }z\in\mathcal{S}^{c}\}

for all $\varepsilon>0$ , so that in particular $\bigcup_{\varepsilon>0}\mathcal{S}^{\varepsilon}={\mathcal{S}}^{o}$ . We now redefine

\pi^{\delta}:=\left[x\mapsto\left(x,x+\delta T(x)\mathbf{1}_{\{x\in\mathcal{S}^{\sqrt{\delta}}\}}\mathbf{1}_{\{|T(x)|\leq 1/\sqrt{\delta}\}}\right)\right]_{\ast}\mu.

Then $\pi^{\delta}\in\mathcal{P}(\mathcal{S}\times\mathcal{S})$ and in particular $\pi^{\delta}\in C_{\delta}(\mu)$ as in Step 1. Noting that

\displaystyle\lim_{\delta\to 0}T(x)\mathbf{1}_{\{x\in\mathcal{S}^{\sqrt{\delta}}\}}\mathbf{1}_{\{|T(x)|\leq 1/\sqrt{\delta}\}}=T(x)\mathbf{1}_{\{x\in{\mathcal{S}}^{o}\}},

the remaining steps of the proof follow as in Step 1. This concludes the proof. ∎

Lemma 21.

Let $p\in[1,\infty)$ , let $a_{0}\in\mathcal{A}$ and assume that $f$ is continuous and, for some constant $c>0$ , satisfies $|f(x,a)|\leq c(1+|x|^{p})$ for all $x\in\mathcal{S}$ and all $a$ in a neighborhood of $a_{0}$ . Let $(\mu_{n})_{n\in\mathbb{N}}$ be a sequence of probability measures which converges to some $\mu$ w.r.t. $W_{p}^{|\cdot|}$ and $(a_{n})_{n\in\mathbb{N}}$ be a sequence which converges to $a_{0}$ . Then $\int_{\mathcal{S}}f(x,a_{n})\,\mu_{n}(dx)\to\int_{\mathcal{S}}f(x,a_{0})\,\mu(dx)$ as $n\to\infty$ .

Proof.

Let $K$ be a small neighborhood of $a_{0}$ such that $|f(x,a)|\leq c(1+|x|^{p})$ for all $x\in\mathcal{S}$ and $a\in K$ . The measures $\mu_{n}\otimes\delta_{a_{n}}$ converge in $W_{p}^{|\cdot|}$ to the measure $\mu\otimes\delta_{a_{0}}$ . As $\int_{\mathcal{S}}f(x,a_{n})\,\mu_{n}(dx)=\int_{\mathcal{S}\times K}f(x,a)\,(\mu_{n}\otimes\delta_{a_{n}})(d(x,a))$ and similarly for $\mu\otimes\delta_{a_{0}}$ , the claim follows from (Villani,, 2008, Lemma 4.3, p.43). ∎

The following lemma relates to the financial economics applications described in Bartl et al., 2021b . We focus on a sufficient condition for the second part of Assumption 1. For this, we assume that $\mu$ does not contain any redundant assets, i.e. $\mu(\{x\in\mathbb{R}^{d}:\langle a,x-x_{0}\rangle>0\})>0$ for every $a\neq 0$ . If $\mu$ satisfies this condition, we call it non-degenerate. Note that this condition is slightly stronger than no-arbitrage. However, if $\mu$ satisfies no arbitrage, then one can always delete the redundant dimensions in $\mu$ similarly to the remark after Theorem 15, so that the modified measure satisfies $\mu(\{x\in\mathbb{R}^{d}:\langle a,x-x_{0}\rangle>0\})>0$ for every $a\neq 0$ .

Lemma 22.

Assume that $l\colon\mathbb{R}\to\mathbb{R}$ is convex, increasing, bounded from below and $f(x,a):=l(g(x)+\langle a,x\rangle)$ satisfies the first part of Assumption 1. Furthermore assume that $\mu$ is non-degenerate in the above sense. Then for every $\delta\geq 0$ there exists an optimizer $a^{\star}_{\delta}\in\mathbb{R}^{d}$ for $V(\delta)$ , i.e.,

V(\delta)=\sup_{\nu\in B_{\delta}(\mu)}\int_{\mathbb{R}^{d}}l(g(x)+\langle a^{\star}_{\delta},x-x_{0}\rangle)\,\nu(dx)<\infty.

Furthermore, if $l$ is strictly convex, the optimizer $a^{\star}$ of $V(0)$ is unique and $a^{\star}_{\delta}\to a^{\star}$ as $\delta\to 0$ . In particular, Assumption 1 is satisfied.

Proof.

The first statement is trivially true if $l$ is constant, so assume otherwise in the following. Moreover, note by the first part of Assumption 1 we have $V(\delta)<\infty$ for all $\delta\geq 0$ . Now fix $\delta\geq 0$ , and let $(a_{n})_{n\in\mathbb{N}}$ be a minimizing sequence, i.e.

V(\delta)=\lim_{n\to\infty}\sup_{\nu\in B_{\delta}(\mu)}\int_{\mathbb{R}^{d}}l(g(x)+\langle a_{n},x-x_{0}\rangle)\,\nu(dx).

If $(a_{n})_{n\in\mathbb{N}}$ is bounded, then after passing to a subsequence there is a limit, and Fatou’s lemma shows that this limit is a minimizer. It remains to argue why $(a_{n})_{n\in\mathbb{N}}$ is bounded. Heading for a contradiction, assume that $|a_{n}|\to\infty$ as $n\to\infty$ . After passing to a (not relabeled) subsequence, there is $\tilde{a}\in\mathbb{R}^{d}$ with $|\tilde{a}|=1$ such that $a_{n}/|a_{n}|\to\tilde{a}$ as $n\to\infty$ . By our assumption we have $\mu(\{x\in\mathbb{R}^{d}:\langle\tilde{a},x-x_{0}\rangle>0\})>0$ . As $l$ is bounded below this shows that

\sup_{\nu\in B_{\delta}(\mu)}\int_{\mathbb{R}^{d}}l(g(x)+\langle a_{n},x-x_{0}\rangle)\,\nu(dx)\geq\int_{\mathbb{R}^{d}}l(g(x)+\langle a_{n},x-x_{0}\rangle)\,\mu(dx)\to\infty,

as $n\to\infty$ , a contradiction.

To prove the second claim note that strict convexity of $l$ readily implies that $V(0)$ admits a unique minimizer $a^{\star}$ . Now, heading for a contraction, assume that there exists a subsequence $(\delta_{n})_{n\in\mathbb{N}}$ converging to zero, such that $a^{\star}_{\delta_{n}}$ does not converge to $a^{\star}$ . The exact same reasoning as above shows that $(a^{\star}_{\delta_{n}})_{n\in\mathbb{N}}$ is bounded, hence (possibly after passing to a not relabeled subsequence) there is a limit $\tilde{a}\neq a^{\star}$ . Using Fatou’s lemma once more implies

	$\displaystyle V(0)$	$\displaystyle<\int_{\mathbb{R}^{d}}l(g(x)+\langle\tilde{a},x-x_{0}\rangle)\,\mu(dx)$
		$\displaystyle\leq\liminf_{n\to\infty}\int_{\mathbb{R}^{d}}l(g(x)+\langle a^{\star}_{\delta_{n}},x-x_{0}\rangle)\,\mu(dx)\leq\liminf_{n\to\infty}V(\delta_{n}).$

On the other hand, plugging $a^{\star}$ into $V(\delta)$ implies

\limsup_{n\to\infty}V(\delta_{n})\leq\limsup_{n\to\infty}\sup_{\nu\in B_{\delta_{n}}(\mu)}\int_{\mathbb{R}^{d}}l(g(x)+\langle a^{\star},x-x_{0}\rangle)\,\nu(dx)=V(0),

which follows from as $l(g(x)+\langle a^{\star},x-x_{0}\rangle)\leq c(1+|x|^{p})$ and that any $\nu_{n}\in B_{\delta_{n}}(\mu)$ converges in $W^{|\cdot|}_{p}$ to $\mu$ by definition. This gives the desired contraction. ∎

In analogy to the above result, the following summarizes simple sufficient conditions for the second part of Assumption 1.

Lemma 23.

Assume that either $\mathcal{A}$ is compact or that $a\mapsto V(0,a)$ is coercive, in the sense that $V(0,a_{n})\to\infty$ if $|a_{n}|\to\infty$ . Moreover, assume that $f$ is continuous, such that $f(x,a)\leq c(1+|x|^{p})$ for some $c\geq 0$ . Then the second part of Assumption 1 is satisfied.

Proof.

Let us first note that for fixed $\delta\geq 0$ the function $a\mapsto V(\delta,a)$ is lower semiconinuous as a supremum of continuous functions $a\mapsto\int f(x,a)\,\nu(dx)$ for $\nu\in B_{\delta}(\mu)$ . Next we note that $\mathcal{A}^{\star}(\delta)\neq\emptyset$ . Indeed, if $\mathcal{A}$ is compact, this directly follows from lower semicontinuity of $a\mapsto V(\delta,a)$ . Otherwise, the fact that $V(\delta,a)\geq V(0,a)$ for all $a\in\mathcal{A}$ and coercivity imply that any minimising sequence $(a_{n})_{n\in\mathbb{N}}$ is bounded. Lastly, we show that any accumulation point of such a sequence is an element of $\mathcal{A}^{\star}(0)$ . By the above we can assume (by taking a subsequence without relabelling if necessary) that $\lim_{n\to\infty}a_{n}=a\in\mathcal{A}$ . If $a\notin\mathcal{A}^{\star}$ , then

\displaystyle\liminf_{n\to\infty}V(\delta_{n},a_{n})\geq\lim_{n\to\infty}V(0,a_{n})=V(0,a)>V(0,a^{\star})=\lim_{n\to\infty}V(\delta_{n},a^{\star})

for any $a^{\star}\in\mathcal{A}^{\star}(0)$ . This contradicts $a_{n}\in\mathcal{A}^{\star}(\delta_{n})$ for all $n\in\mathbb{N}$ and concludes the proof. ∎

Proof of Corollary 13.

We start with the “ $\leq$ ”-inequality. First, note that for any $\delta>0$ , $a^{r}\in\mathcal{A}^{\star}_{r}$ , and $\nu^{r+\delta}\in B_{r+\delta}^{\star}(\mu,a^{r})$ , we have

	$\displaystyle V(r+\delta)$	$\displaystyle\leq V(r+\delta,a^{r})=\int_{\mathcal{S}}f(x,a^{r})\,\nu^{r+\delta}(dx),$
	$\displaystyle V(r)$	$\displaystyle\geq\sup_{\nu\in B_{r}(\mu)\cap B_{\delta}(\nu^{r+\delta})}\int_{\mathcal{S}}f(x,a^{r})\,\nu(dx).$

This implies that

	$\displaystyle V(r+\delta)-V(r)$	$\displaystyle\leq\sup_{\pi\in C_{\delta}(\nu^{r+\delta})}\int_{\mathcal{S}\times\mathcal{S}}f(x,a^{r})-f(y,a^{r})\,\pi(dx,dy)$
		$\displaystyle=\sup_{\pi\in C_{\delta}(\nu^{r+\delta})}\int_{0}^{1}\int_{\mathcal{S}\times\mathcal{S}}\langle\nabla_{x}f(y+t(x-y),a^{r}),(x-y)\rangle\,\pi(dx,dy)\,dt$
(21)			$\displaystyle\leq\delta\sup_{\pi\in C_{\delta}(\nu^{r+\delta})}\int_{0}^{1}\left(\int_{\mathcal{S}\times\mathcal{S}}\\|\nabla_{x}f(y+t(x-y),a^{r})\\|^{q}\,\pi(dx,dy)\right)^{1/q}\,dt.$

Note that the assumption $|\nabla_{x}f(x,a)|\leq c(1+|x|^{p-1-\varepsilon})$ implies $|\nabla_{x}f(x,a)|^{q}\leq c(1+|x|^{\frac{p(p-1-\varepsilon)}{p-1}})$ ( for some new constant $c$ ). To simplify notation let us thus define $\tilde{\varepsilon}=(p-1-\varepsilon)/(p-1)<1$ and recall that $B_{r+1}(\mu)$ is compact w.r.t. $W^{|\cdot|}_{p\tilde{\varepsilon}}$ by Lemma 24, hence there is $\tilde{\nu}^{r}\in B_{r}(\mu)$ such that (after passing to a subsequence) $\nu^{r+\delta}\to\tilde{\nu}^{r}$ w.r.t. $W^{|\cdot|}_{p\tilde{\varepsilon}}$ as $\delta\to 0$ . The same arguments as in the proof of Theorem 2 show that (21) (divided by $\delta$ ) converges to $\left(\int_{\mathcal{S}}\left\|\nabla_{x}f(x,a^{r})\right\|^{q}\tilde{\nu}^{r}(dx)\right)^{1/q}$ when $\delta\to 0$ . So, to conclude the “ $\leq$ ”-part, all that is left to do is show that $\tilde{\nu}^{r}\in B_{r}^{\star}(\mu,a^{r})$ , which follows as

V(r)\leq\lim_{\delta\to 0}V(r+\delta)\leq\lim_{\delta\to 0}\int_{\mathcal{S}}f(x,a^{r})\nu^{r+\delta}(dx)=\int_{\mathcal{S}}f(x,a^{r})\,\tilde{\nu}^{r}(dx)\leq V(r).

We now turn to the proof of the “ $\geq$ ”-inequality. To that end, let $(a^{r+\delta})_{\delta>0}$ be a sequence of optimizers, i.e. $a^{r+\delta}\in\mathcal{A}^{\star}_{r+\delta}$ for all $\delta>0$ . Then by assumption there exists $a^{r}\in\mathcal{A}^{\star}_{r}$ such that (after passing to a subsequence) $\lim_{\delta\to 0}a^{r+\delta}=a^{r}$ . Let $\nu^{r}\in B_{r}^{\star}(\mu,a^{r})$ be arbitrary. As $B_{\delta}(\nu^{r})\subset B_{r+\delta}(\mu)$ (by the triangle inequality) we have

\displaystyle V(r+\delta)

\displaystyle\geq\sup_{\nu\in B_{\delta}(\nu^{r})}\int_{\mathcal{S}}f(x,a^{r+\delta})\,\nu(dx).

As further (trivially) $V(r)\leq\int_{\mathcal{S}}f(x,a^{r+\delta})\,\nu^{r}(dx)$ we conclude

	$\displaystyle\frac{V(\delta+r)-V(r)}{\delta}$	$\displaystyle\geq\sup_{\nu\in B_{\delta}(\nu^{r})}\frac{1}{\delta}\int_{\mathcal{S}}f(x,a^{r+\delta})\nu(dx)-\int_{\mathcal{S}}f(x,a^{r+\delta})\nu^{r}(dx)$
		$\displaystyle\to\left(\int_{\mathcal{S}}\left\\|\nabla_{x}f(x,a^{r})\right\\|^{q}\nu^{r}(dx)\right)^{1/q},$

as $\delta\to 0$ , where the the last equality follows from the exact same arguments as presented int he proof of Theorem 2. As $\nu^{r}\in B_{r}^{\star}(\mu,a^{r})$ was arbitrary, the claim follows. ∎

Proof of Theorem 15.

We start by showing the easier estimate

(22)

\displaystyle\begin{split}&\limsup_{\delta\to 0}\frac{V^{\Phi}(\delta)-V^{\Phi}(0)}{\delta}\\ &\leq\inf_{a^{\star}\in\mathcal{A}^{\star}_{0}}\inf_{\lambda\in\mathbb{R}^{m}}\left(\int_{\mathcal{S}}\Big{\|}\nabla_{x}f(x,a^{\star})+\sum_{i=1}^{m}\lambda_{i}\nabla_{x}\Phi_{i}(x)\Big{\|}^{q}\,\mu(dx)\right)^{1/q}.\end{split}

To that end, let $a^{\star}\in\mathcal{A}^{\star}_{0}$ and $\lambda\in\mathbb{R}^{m}$ by arbitrary. Then $V^{\Phi}(0)=\int_{\mathcal{S}}f(x,a^{\star})+\sum_{i=1}^{m}\lambda_{i}\Phi_{i}(x)\,\mu(dx)$ . Moreover, as $B_{\delta}^{\Phi}(\mu)\subset B_{\delta}(\mu)$ , it further follows that $V^{\Phi}(\delta)\leq\sup_{\nu\in B_{\delta}(\mu)}\int_{\mathcal{S}}f(y,a^{\star})+\sum_{i=1}^{m}\lambda_{i}\Phi_{i}(y)\,\nu(dy)$ . Therefore (22) is a consequence of Theorem 2 (applied to the function $\tilde{f}(x,a):=f(x,a^{\star})+\sum_{i=1}^{m}\lambda_{i}\Phi_{i}(x)$ ).

To show the other direction, i.e. that

(23)

\displaystyle\begin{split}&\liminf_{\delta\to 0}\frac{V^{\Phi}(\delta)-V^{\Phi}(0)}{\delta}\\ &\geq\inf_{a^{\star}\in\mathcal{A}^{\star}_{0}}\inf_{\lambda\in\mathbb{R}^{m}}\left(\int_{\mathcal{S}\times\mathcal{S}}\Big{\|}\nabla_{x}f(x,a^{\star})+\sum_{i=1}^{m}\lambda_{i}\nabla_{x}\Phi_{i}(x)\Big{\|}^{q}\,\mu(dx)\right)^{1/q}.\end{split}

pick a (not relabeled) subsequence of $(\delta)_{\delta>0}$ which converges to the liminf. For $a^{\star}_{\delta}\in\mathcal{A}^{\star}_{\delta}$ , there is another (again not relabeled) subsequence which converges to some $a^{\star}\in\mathcal{A}^{\star}_{0}$ . From now on stick to this subsequence. In a first step, notice that

	$\displaystyle V^{\Phi}(\delta)$	$\displaystyle=\sup_{\nu\in B_{\delta}(\mu)}\inf_{\lambda\in\mathbb{R}^{m}}\int_{\mathcal{S}}f(y,a^{\star}_{\delta})+\sum_{i=1}^{m}\lambda_{i}\Phi_{i}(y)\,\nu(dy)$
(24)			$\displaystyle=\inf_{\lambda\in\mathbb{R}^{m}}\sup_{\nu\in B_{\delta}(\mu)}\int_{\mathcal{S}}f(y,a^{\star}_{\delta})+\sum_{i=1}^{m}\lambda_{i}\Phi_{i}(y)\,\nu(dy).$

Indeed, this follows from a minimax theorem (see (Terkelsen,, 1973, Cor. 2, p. 411)) and appropriate compactness of $B_{\delta}(\mu)$ as stated in Lemma 24. For notational simplicity let $\lambda^{\star}_{\delta}$ be an optimizer for (24). Then

(25)

\displaystyle\begin{split}&\frac{V^{\Phi}(\delta)-V^{\Phi}(0)}{\delta}\\ &\geq\frac{1}{\delta}\sup_{\pi\in C_{\delta}(\mu)}\int_{\mathcal{S}\times\mathcal{S}}f(y,a^{\star}_{\delta})-f(x,a^{\star}_{\delta})+\sum_{i=1}^{m}\lambda^{\star}_{\delta,i}(\Phi_{i}(y)-\Phi_{i}(x))\,\pi(dx,dy),\end{split}

where we used that $V^{\Phi}(0)\leq\int_{\mathcal{S}}f(x,a^{\star}_{\delta})+\sum_{i=1}^{m}\lambda^{\star}_{\delta,i}\Phi_{i}(x)\,\mu(dx)$ . Now, in case that $\lambda^{\star}_{\delta}$ is uniformly bounded for all small $\delta>0$ , after passing to a subsequence, it converges to some $\lambda^{\star}$ . Then it follows from the exact same arguments as used in the proof of Theorem 2 that

	$\displaystyle\liminf_{\delta\to 0}\frac{1}{\delta}\sup_{\pi\in C_{\delta}(\mu)}\int_{\mathcal{S}\times\mathcal{S}}f(y,a^{\star}_{\delta})-f(x,a^{\star}_{\delta})+\sum_{i=1}^{m}\lambda^{\star}_{\delta,i}(\Phi_{i}(y)-\Phi_{i}(x))\,\pi(dx,dy)$
	$\displaystyle\geq\Big{(}\int_{\mathcal{S}}\Big{\\|}\nabla_{x}f(x,a^{\star})+\sum_{i=1}^{m}\lambda^{\star}_{i}\nabla_{x}\Phi_{i}(x)\Big{\\|}^{q}\,\mu(dx)\Big{)}^{1/q}$

which shows (23). It remains to argue why $\lambda^{\star}_{\delta}$ is bounded for small $\delta>0$ . By (25) and the estimate “ $\sup(A+B)\geq\sup A+\inf B$ ” we have

	$\displaystyle\frac{V^{\Phi}(\delta)-V^{\Phi}(0)}{\delta}\$	$\displaystyle\geq\frac{1}{\delta}\sup_{\pi\in C_{\delta}(\mu)}\int_{\mathcal{S}\times\mathcal{S}}\sum_{i=1}^{m}\lambda^{\star}_{\delta,i}(\Phi_{i}(y)-\Phi_{i}(x))\,\pi(dx,dy)$
		$\displaystyle\quad+\frac{1}{\delta}\inf_{\pi\in C_{\delta}(\mu)}\int_{\mathcal{S}\times\mathcal{S}}f(y,a^{\star}_{\delta})-f(x,a^{\star}_{\delta})\,\pi(dx,dy).$

The second term converges to $-(\int_{\mathcal{S}}\|\nabla_{x}f(x,a^{\star})\|^{q}\,\mu(dx))^{1/q}$ (see the proof of Theorem 2), in particular it is bounded for all $\delta>0$ small. On the other hand by (19) and continuity as well as growth of $x\mapsto\nabla_{x}\Phi_{i}(x)$ , the first term is larger than $c|\lambda^{\star}_{\delta}|$ for some $c>0$ . By (22) this implies that $(\lambda^{\star}_{\delta})_{\delta>0}$ must be bounded for small $\delta>0$ . ∎

We have used the following lemma:

Lemma 24.

Let $p,q\in[1,\infty)$ such that $q<p$ and let $\mu$ be a probability measure on $\mathcal{S}$ . Then $p$ -Wasserstein ball $B_{\delta}(\mu)$ is compact w.r.t. $W_{q}^{|\cdot|}$ .

Proof.

We recall that $\|\cdot\|_{\ast}$ is lower semicontinuous and there exists $c>0$ such that $|x|\leq c\|x\|_{\ast}$ for all $x\in\mathbb{R}^{d}$ . As $\int_{\mathcal{S}}|x|^{p}\,\mu(dx)<\infty$ by assumption, an application of Prokhorov’s theorem shows that $B_{\delta}(\mu)$ is weakly precompact (recall the convention that we continuity is defined for $(\mathbb{R}^{d},|\cdot|)$ ). Hence, for every sequence of measures $(\nu_{n})_{n\in\mathbb{N}}$ in $B_{\delta}(\mu)$ there exists a subsequence, which we also call $(\nu_{n})_{n\in\mathbb{N}}$ and a measure $\nu$ such that $\nu_{n}$ converges weakly to $\nu$ . As $W_{p}$ is weakly lower semicontinuous (see (Villani,, 2008, Lemma 4.3, p.43)), this implies $\nu\in B_{\delta}(\mu)$ . Applying the same argument to the tight sequence $(\tilde{\nu}_{n})_{n\in\mathbb{N}}$ defined via

\tilde{\nu}_{n}(dx):=\frac{|x|^{q}}{\int_{\mathcal{S}}|y|^{q}\,\nu_{n}(dy)}\nu_{n}(dx)

we conclude that there exists another subsequence of $(\nu_{n})_{n\in\mathbb{N}}$ which also converges in $W_{q}^{|\cdot|}$ . This concludes the proof. ∎

Appendix C Discussion, proofs and auxiliary results related to Theorem 5

C.1. Further discussion of Theorem 5

We note that a natural way to compute the sensitivity of $a^{\star}_{\delta}$ would be by combining Theorem 2 with chain rule and differentiation of the function $V(a,\delta)$ . This cannot however be rigorously justified as the following remark demonstrates.

Remark 25.

Let us point out that it is not true that $a\mapsto V(a,\delta)$ is differentiable for $\delta>0$ under the sole assumption that $(x,a)\mapsto f(x,a)$ is sufficiently smooth and $\nabla_{a}^{2}f\neq 0$ .

To give an example, let $\mathcal{S}=\mathbb{R}$ , $\|\cdot\|=|\cdot|$ , $\mathcal{A}=\mathbb{R}$ and take $f(x,a):=ax+a^{2}$ and $\mu=\delta_{0}$ . A quick computation shows $V(\delta,a)=\delta|a|+a^{2}$ (independently of $p$ ). In particular $V(\delta)=0$ and $a^{\star}_{\delta}=a^{\star}=0$ for all $\delta>0$ and $a\mapsto V(\delta,a)$ is clearly not differentiable in $a=0$ .

Instead, we use a more involved argument, combining differentiability of $a\mapsto V(0,a)$ with a Lagrangian approach. This however requires slightly stricter growth assumptions than the ones imposed in Assumption 1, which are specified in Assumption 4.

Example 26.

We provide detailed computations behind the square-root LASSO/Ridge regression example discussed in Bartl et al., 2021b . We consider $\mathcal{A}=\mathbb{R}^{k}$ , $\mathcal{S}=\mathbb{R}^{k+1}$ . We fix norms $\|(x,y)\|=|x|_{s}$ , $\|(x,y)\|_{\ast}=|x|_{r}\mathbf{1}_{\{y=0\}}+\infty\mathbf{1}_{\{y\neq 0\}}$ , for some $s>1$ , $1/s+1/r=1$ and $(x,y)\in\mathbb{R}^{k}\times\mathbb{R}$ . We recall than then (5) holds and we can apply our methodology for $f((x,y),a):=(y-\langle x,a\rangle)^{2}$ . In general we have

\nabla_{(x,y)}f((x,y),a^{\star})=(-2(y-\langle x,a^{\star}\rangle)a^{\star},2(y-\langle x,a^{\star}\rangle))

$\nabla_{a}^{2}V(0,a^{\star})=2D$ and

	$\displaystyle\left(\int_{\mathbb{R}^{k+1}}\\|\nabla_{(x,y)}f((x,y),a^{\star})\\|^{2}\,\mu(dx,dy)\right)^{1/2}$	$\displaystyle=2\|a^{\star}\|_{s}\left(\int_{\mathbb{R}^{k+1}}(y-\langle x,a^{\star}\rangle)^{2}\,\mu(dx,dy)\right)^{1/2}$
		$\displaystyle=2\|a^{\star}\|_{s}\sqrt{V(0)}.$

Recalling the convention that $\nabla_{(x,y)}\nabla_{a}f\in\mathbb{R}^{k\times(d+1)}$ is given by

\displaystyle\begin{bmatrix}\nabla_{x_{1}}\nabla_{a_{1}}f&\dots&\nabla_{x_{d}}\nabla_{a_{1}}f&\nabla_{y}\nabla_{a_{1}}f\\ \nabla_{x_{1}}\nabla_{a_{2}}f&\dots&\nabla_{x_{d}}\nabla_{a_{2}}f&\nabla_{y}\nabla_{a_{2}}f\\ \vdots&\vdots&\vdots&\vdots\\ \nabla_{x_{1}}\nabla_{a_{k}}f&\dots&\nabla_{x_{d}}\nabla_{a_{k}}f&\nabla_{y}\nabla_{a_{k}}f\end{bmatrix}

we conclude

\displaystyle\nabla_{(x,y)}\nabla_{a}f((x,y),a^{\star})=2\left(-y\mathbf{I}+x(a^{\star})^{T}+(\mathbf{I}a^{\star})(\mathbf{I}x),-x\right),

where $\mathbf{I}$ is the $k\times k$ identity matrix. Recall furthermore that $\int_{\mathbb{R}^{k+1}}(y-\langle a^{\star},x\rangle)x_{i}\mu(dx,dy)=0$ for all $1\leq i\leq k$ and in particular $V(0)=\int_{\mathbb{R}^{k+1}}(y^{2}-\langle a^{\star},x\rangle y)\mu(dx,dy)$ . Set now

h((x,y)):=(\text{sign}(x_{1})\,|x_{1}|^{s-1},\dots,\text{sign}(x_{k})\,|x_{k}|^{s-1},0)\cdot|x|_{s}^{1-s}.

Then $\langle(x,y),h((x,y))\rangle=|x|_{s}$ and $|h(x,y)|_{r}=1$ for $(x,y)\in\mathcal{S}\setminus U$ . As $h$ does not depend on the last coordinate, we also write simply $h(x)$ for $h((x,y))$ . As $q=2$ we have in particular

	$\displaystyle\int_{\mathbb{R}^{k+1}}\nabla_{(x,y)}\nabla_{a}f((x,y),a^{\star})\frac{h(\nabla_{(x,y)}f((x,y),a^{\star}))}{\\|\nabla_{(x,y)}f((x,y),a^{\star})\\|^{-1}}\,\mu(dx,dy)$
	$\displaystyle=4\int_{\mathbb{R}^{k+1}}\big{[}-y\mathbf{I}+x(a^{\star})^{T}+(\mathbf{I}a^{\star})(\mathbf{I}x)\big{]}\,h(-(y-\langle x,a^{\star}\rangle)a^{\star})\,\|a^{\star}\|_{s}\|y-\langle x,a^{\star}\rangle\|\,\mu(dx,dy)$
	$\displaystyle=-4\|a^{\star}\|_{s}\int_{\mathbb{R}^{k+1}}\big{[}-y\mathbf{I}+x(a^{\star})^{T}+(\mathbf{I}a^{\star})(\mathbf{I}x)\big{]}\,(y-\langle x,a^{\star}\rangle)h(a^{\star})\,\mu(dx,dy)\,$
	$\displaystyle=4\|a^{\star}\|_{s}V(0)\,h(a^{\star}).$

In conclusion

	$\displaystyle a^{\star}_{\delta}\approx$	$\displaystyle\ a^{\star}-\Big{(}\int_{\mathbb{R}^{k+1}}\\|\nabla_{(x,y)}f((x,y),a^{\star})\\|^{2}\,\mu(dx,dy)\Big{)}^{-1/2}(\nabla^{2}_{a}V(0,a^{\star}))^{-1}$
		$\displaystyle\qquad\qquad\qquad\cdot\int_{\mathbb{R}^{k+1}}\frac{\nabla_{(x,y)}\nabla_{a}f((x,y),a^{\star})\,h(\nabla_{(x,y)}f((x,y),a^{\star}))}{\\|\nabla_{(x,y)}f((x,y),a^{\star})\\|^{-1}}\,\mu(dx,dy)\cdot\delta$
		$\displaystyle=a^{\star}-\frac{1}{4\|a^{\star}\|_{s}\sqrt{V(0)}}\,D^{-1}\,4\|a^{\star}\|_{s}V(0)\,h(a^{\star})\cdot\delta$
		$\displaystyle=a^{\star}-\sqrt{V(0)}D^{-1}\,h(a^{\star})\cdot\delta.$

Let us now specialise to the typical statistical context and let $\mu=\mu_{N}$ equal to the empirical measure of $N$ data samples, i.e., $\mu_{N}=\frac{1}{N}\sum_{i=1}^{N}\delta_{(x_{i},y_{1})}$ for some points $x_{1},\dots,x_{N}\in\mathbb{R}^{d}$ and $y_{1},\ldots,y_{N}\in\mathbb{R}$ . Let us write $x_{i}=(x_{i,1},\dots,x_{i,d})$ and $X=(x_{i,j})_{i=1,\dots,N}^{j=1,\dots,d}$ . Then in particular

\displaystyle D=\int_{\mathbb{R}^{k+1}}xx^{T}\,\mu_{N}(dx,dy)=\frac{1}{N}X^{T}X

and we recover the notation common in statistics. In particular, $a^{\star}=(X^{T}X)^{-1}X^{T}y$ . If we now assume that $X^{T}X=\mathbf{I}$ (and hence $D^{-1}=N\mathbf{I}$ ), then we can easily compute

	$\displaystyle V(0)$	$\displaystyle=\frac{1}{N}(y-Xa^{\star})^{T}(y-Xa^{\star})=\frac{1}{N}(y-XX^{T}y)^{T}(y-XX^{T}y)$
		$\displaystyle=\frac{1}{N}y^{T}(\mathbf{I}-XX^{T})^{T}(\mathbf{I}-XX^{T})y=\frac{1}{N}y^{T}(\mathbf{I}-XX^{T}-XX^{T}+XX^{T}XX^{T})y$
		$\displaystyle=\frac{1}{N}y^{T}(\mathbf{I}-XX^{T})y$

Note that, under the assumption that $\sum_{i=1}^{N}y_{i}=0$ , $R^{2}$ is defined as

\displaystyle R^{2}

\displaystyle=1-\frac{y^{T}(\mathbf{I}-XX^{T})y}{y^{T}y}=\frac{y^{T}y-y^{T}(\mathbf{I}-XX^{T})y}{y^{T}y}=\frac{y^{T}XX^{T}y}{y^{T}y}.

Thus in the case $s=1$ we have

	$\displaystyle a^{\star}_{\delta}$	$\displaystyle\approx a^{\star}-\sqrt{V(0)}D^{-1}\,\text{sign}(a^{\star})\cdot\delta=a^{\star}-\sqrt{N}\,\sqrt{y^{T}y-y^{T}XX^{T}y}\,\text{sign}(a^{\star})\cdot\delta$
		$\displaystyle=a^{\star}-\sqrt{N}\,\sqrt{y^{T}y}\,\sqrt{1-\frac{y^{T}XX^{T}y}{y^{T}y}}\,\text{sign}(a^{\star})\cdot\delta$
		$\displaystyle=a^{\star}-\sqrt{N}\,\|y\|\,\sqrt{1-R^{2}}\,\text{sign}(a^{\star})\cdot\delta.$

Furthermore, in the case $s=2$ we have

	$\displaystyle a^{\star}_{\delta}$	$\displaystyle\approx a^{\star}-D^{-1}\frac{\sqrt{V(0)}}{\|a^{\star}\|_{2}}a^{\star}\delta=a^{\star}\left(1-N\frac{\sqrt{y^{T}(\mathbf{I}-XX^{T})y}}{\sqrt{N}\|a^{\star}\|_{2}}\,\delta\right)$
(26)			$\displaystyle=a^{\star}\left(1-\frac{\sqrt{N\,y^{T}(\mathbf{I}-XX^{T})y}}{\|a^{\star}\|_{2}}\,\delta\right).$

We also have

\displaystyle|a|=\sqrt{\langle a^{\star},a^{\star}\rangle}=\sqrt{y^{T}XX^{T}y},

so (26) simplifies to

	$\displaystyle a^{\star}_{\delta}$	$\displaystyle\approx a^{\star}\left(1-\frac{\sqrt{N\,y^{T}(\mathbf{I}-XX^{T})y}}{\sqrt{y^{T}XX^{T}y}}\,\delta\right)=a^{\star}\left(1-\delta\sqrt{N\left(\frac{y^{T}y}{y^{T}XX^{T}y}-1\right)}\right)$
		$\displaystyle=a^{\star}\left(1-\delta\sqrt{N\left(\frac{1}{R^{2}}-1\right)}\right).$

Remark 27.

While $|\cdot|_{1}$ is not strictly convex, the above example can still be adapted to cover this case under the additional assumption, that $a^{\star}$ has no entries which are equal to zero. Indeed, we note that $x\mapsto h(x,y)$ is continuous (even constant) at every point $x$ except if a component of $x$ is equal to zero. Thus the proof of Lemma 30 still applies if we assume that $g$ has $\mu$ -a.s. no components which are equal to zero instead of merely assuming that $g\neq 0$ $\mu$ -a.s..

Example 28.

We provide further details and discussion to complement the out-of-sample error example in Bartl et al., 2021b . First, we recall the remainder term obtained therein:

	$\displaystyle\Delta_{N}$	$\displaystyle:=\Big{(}\int\|\nabla_{x}f(x,a^{{\star},N})\|_{s}^{q}\,\mu_{N}(dx)\Big{)}^{\frac{1}{q}-1}\cdot\left(\int\nabla_{a}^{2}f(x,a^{{\star},N})\,\mu_{N}(dx)\right)^{-1}$
		$\displaystyle\quad\cdot\int\frac{\nabla_{x}\nabla_{a}f(x,a^{{\star},N})\,h(\nabla_{x}f(x,a^{{\star},N}))}{\|\nabla_{x}f(x,a^{{\star},N})\|_{s}^{1-q}}\,\mu_{N}(dx)-(\nabla_{a}^{2}V(0,a^{{\star}}))^{-1}\Theta,\quad\textrm{where}$
	$\displaystyle\Theta$	$\displaystyle:=\Big{(}\int\|\nabla_{x}f(x,a^{\star})\|_{s}^{q}\,\mu(dx)\Big{)}^{\frac{1}{q}-1}\cdot\int\frac{\nabla_{x}\nabla_{a}f(x,a^{\star})\,h(\nabla_{x}f(x,a^{\star}))}{\|\nabla_{x}f(x,a^{\star})\|_{s}^{1-q}}\,\mu(dx).$

Recall that $\mu_{N}\to\mu$ in $W_{p}$ holds a.s. We suppose that Assumptions 1 and 4 hold, and that for any $r>0$ , there exists $c>0$ such that the following hold uniformly for all $|a|\leq r$ :

(27)

\displaystyle\begin{split}\sum_{i=1}^{k}\Big{|}\nabla_{a}\nabla_{a_{i}}f(x,a)\Big{|}&\leq c(1+|x|^{p}),\\ \Bigg{|}\frac{\nabla_{x}\nabla_{a}f(x,a^{\star})\,h(\nabla_{x}f(x,a^{\star}))}{|\nabla_{x}f(x,a^{\star})|^{1-q}}\Bigg{|}&\leq c(1+|x|^{p}).\end{split}

Recall from (9) that we already know that $a^{{\star},N}\to a^{{\star}}$ a.s. Under the above integrability assumption, Lemma 21 gives

\displaystyle\left|\int\nabla_{a_{i}}\nabla_{a_{j}}f(x,a^{{\star},N})\,\mu_{N}(dx)-\int\nabla_{a_{i}}\nabla_{a_{j}}f(x,a^{{\star}})\,\mu(dx)\right|\to 0,

with analogous convergence for the other two terms in $\Delta_{N}$ . We conclude that $\Delta_{N}\to 0$ a.s. and that (12) and (13) hold.

We now show how the arguments above can be adapted to extend and complement (Anderson and Philpott,, 2019, Prop. 17). Therein, the authors study VRS $(\delta)$ which is the expectation over realisations of $\mu_{N}$ of

\displaystyle\int f(x,a^{{\star},N})\,\mu(dx)-\int f(x,a^{{\star},N}_{\delta})\,\mu(dx).

If VRS $(\delta)>0$ then, on average, the robust problem offers an improved performance, i.e., finds a better approximation to the true optimizer $a^{\star}$ than the classical non-robust problem. If we work with the difference above, then we look at first order Taylor expansion and obtain

\displaystyle V(0,a^{{\star},N}_{\delta})-V(0,a^{{\star},N})=\nabla_{a}V(0,a^{{\star},N})(a^{{\star},N}_{\delta}-a^{{\star},N})+o(|a^{{\star},N}_{\delta}-a^{{\star},N}|),

which holds under the first condition in (27). This can be compared with (Anderson and Philpott,, 2019, Lemma 1) which was derived under a Lipschitz continuity assumption on $a\mapsto f(x,a)$ . For the quadratic case of (Anderson and Philpott,, 2019, Prop. 17) we have $f(x,a)=1/2a^{2}-g(x)a$ , where we took $d=1$ for notational simplicity. We then have $\nabla_{x}f(xa)=-g^{\prime}(x)a,$ $\nabla_{a}^{2}f(x,a)=1$ and $\nabla_{x}\nabla_{a}f(x,a)\nabla_{x}f(x,a)=(g^{\prime}(x))^{2}a$ . Specialising (10) to this setting, with $s=2$ , gives

	$\displaystyle a^{{\star},N}_{\delta}-a^{{\star},N}$	$\displaystyle\approx-\left(\nabla_{a}^{2}V\left(0,a^{{\star}}\right)\right)^{-1}\left(\int\left\|\nabla_{x}f\left(x,a^{{\star}}\right)\right\|^{q}\mu_{N}(dx)\right)^{1/q-1}\cdot\int\frac{\nabla_{x}\nabla_{a}f\left(x,a^{{\star}}\right)\nabla_{x}f\left(x,a^{{\star}}\right)}{\|\nabla_{x}f\left(x,a^{{\star}}\right)\|^{2-q}}\mu_{N}(dx)$
		$\displaystyle=-\|a^{{\star}}\|^{1-q}\left(\int\|g^{\prime}(x)\|^{q}\,\mu_{N}(dx)\right)^{1/q-1}\int\frac{(g^{\prime}(x))^{2}a^{{\star}}}{\|g^{\prime}(x)a^{{\star}}\|^{2-q}}\,\mu_{N}(dx)$
		$\displaystyle=-\text{sign}(a^{{\star}})\left(\int\|g^{\prime}(x)\|^{q}\,\mu_{N}(dx)\right)^{1/q}.$

While our results work for $p>1$ , see Remark 9, we can formally let $q\uparrow\infty$ . The last term then converges to $-\text{sign}(a^{{\star}})\|g^{\prime}\|_{L^{\infty}(\mu)}$ which recovers (Anderson and Philpott,, 2019, Prop. 17), taking into account that $\text{sign}(a^{{\star}})=\text{sign}\left(\int g(x)\,\mu(dx)\right).$

C.2. Proofs and auxiliary results related to Theorem 5

Lemma 29.

Let $f\colon\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ be differentiable such that $(x,a)\mapsto\nabla_{a}f(x,a)$ is continuous, fix $a\in{\mathcal{A}}^{o}$ , and assume that for some $\varepsilon>0$ we have that $|\nabla_{x}f(x,\tilde{a})|\leq c(1+|x|^{p-1-\varepsilon})$ and $|\nabla_{a}f(x,\tilde{a})|\leq c(1+|x|^{p-\varepsilon})$ for some $c>0$ , all $x\in\mathcal{S}$ and all $\tilde{a}\in\mathcal{A}$ close to $a$ . Further fix $\delta\geq 0$ and recall that $B^{\star}_{\delta}(\mu,a)$ is the set of maximizing measures given the strategy $a$ . Then the (one-sided) directional derivative of $V(\delta,\cdot)$ at $a$ in the in direction $b\in\mathbb{R}^{k}$ is given by

\lim_{h\to 0}\frac{V(\delta,a+hb)-V(\delta,a)}{h}=\sup_{\nu\in B^{\star}_{\delta}(\mu,a)}\int_{\mathcal{S}}\langle\nabla_{a}f(x,a),b\rangle\,\nu(dx).

Proof.

Fix $b\in\mathbb{R}^{k}$ . We start by showing that

(28)

\displaystyle\liminf_{h\to 0}\frac{V(\delta,a+hb)-V(\delta,a)}{h}

\displaystyle\geq\sup_{\nu\in B^{\star}_{\delta}(\mu,a)}\int_{\mathcal{S}}\langle\nabla_{a}f(x,a),b\rangle\,\nu(dx).

To that end, let $\nu\in B^{\star}_{\delta}(\mu,a)$ and $h>0$ be arbitrary. By definition of $B^{\star}_{\delta}(\mu,a)$ one has $V(\delta,a)=\int_{\mathcal{S}}f(x,a)\,\nu(dx)$ . Moreover $B^{\star}_{\delta}(\mu,a)\subseteq B_{\delta}(\mu)$ implies that $V(\delta,a+hb)\geq\int_{\mathcal{S}}f(x,a+hb)\,\nu(dx)$ . Note that the assumption $|\nabla_{x}f(x,\tilde{a})|\leq c(1+|x|^{p-1-\varepsilon})$ implies

	$\displaystyle\|f(x,\tilde{a})-f(0,\tilde{a})\|$	$\displaystyle=\left\|\int_{0}^{1}\langle\nabla_{x}f(tx,\tilde{a}),x\rangle dt\right\|$
		$\displaystyle\leq\int_{0}^{1}c(1+\|tx\|^{p-1-\varepsilon})\|x\|dt\leq c(1+\|x\|^{p-\varepsilon}\vee\|x\|).$

Therefore, by dominated convergence, one has

	$\displaystyle\liminf_{h\to 0}\frac{V(\delta,a+hb)-V(\delta,a)}{h}$	$\displaystyle\geq\liminf_{h\to 0}\int_{\mathcal{S}}\frac{f(x,a+hb)-f(x,a)}{h}\,\nu(dx)$
		$\displaystyle=\int_{\mathcal{S}}\lim_{h\to 0}\frac{f(x,a+hb)-f(x,a)}{h}\,\nu(dx)$
		$\displaystyle=\int_{\mathcal{S}}\langle\nabla_{a}f(x,a),b\rangle\,\nu(dx)$

and as $\nu\in B^{\star}_{\delta}(\mu,a)$ was arbitrary, this shows (28).

We proceed to show that

(29)

\displaystyle\limsup_{h\to 0}\frac{V(\delta,a+hb)-V(\delta,a)}{h}

\displaystyle\leq\sup_{\nu\in B^{\star}_{\delta}(\mu,a)}\int_{\mathcal{S}}\langle\nabla_{a}f(x,a),b\rangle\,\nu(dx).

For every sufficiently small $h>0$ let $\nu^{h}\in B_{\delta}^{\star}(\mu,a+hb)$ such that $V(\delta,a+hb)=\int_{\mathcal{S}}f(x,a+hb)\,\nu^{h}(dx)$ . The existence of such $\nu^{h}$ is guaranteed by Lemma 24, which also guarantees that (possibly after passing to a subsequence) there is $\tilde{\nu}\in B_{\delta}(\mu)$ such that $\nu^{h}\to\tilde{\nu}$ in $W^{|\cdot|}_{p-\varepsilon}$ . We claim that $\tilde{\nu}\in B^{\star}_{\delta}(\mu,a)$ . By Lemma 21 one has

\lim_{h\to 0}V(\delta,a+hb)=\int_{\mathcal{S}}f(x,a)\,\tilde{\nu}(dx)\leq V(\delta,a).

On the other hand, for any choice $\tilde{\nu}\in B^{\star}_{\delta}(\mu,a)$ one has

\lim_{h\to 0}V(\delta,a+hb)\geq\lim_{h\to 0}\int_{\mathcal{S}}f(x,a+hb)\,\tilde{\nu}(dx)=\int_{\mathcal{S}}f(x,a)\,\tilde{\nu}(dx)=V(\delta,a).

This implies $V(\delta,a)=\int_{\mathcal{S}}f(x,a)\,\tilde{\nu}(dx)$ and in particular $\tilde{\nu}\in B^{\star}_{\delta}(\mu,a)$ . At this point expand

f(x,a+hb)=f(x,a)+\int_{0}^{1}\langle\nabla_{a}f(x,a+thb),hb\rangle\,dt

so that

	$\displaystyle V(\delta,a+hb)-V(\delta,a)$
	$\displaystyle=\int_{\mathcal{S}}\Big{(}f(x,a)+\int_{0}^{1}\langle\nabla_{a}f(x,a+thb),hb\rangle\,dt\Big{)}\,\nu^{h}(dx)-\int_{\mathcal{S}}f(x,a)\,\tilde{\nu}(dx)$
	$\displaystyle\leq\int_{\mathcal{S}}\int_{0}^{1}\langle\nabla_{a}f(x,a+thb),hb\rangle\,dt\,\nu^{h}(dx)$

where we used $\tilde{\nu}\in B^{\star}_{\delta}(\mu,a)$ for the last inequality. Recall that $\nu^{h}$ converges to $\tilde{\nu}$ in $W^{|\cdot|}_{p-\varepsilon}$ and by assumption $|\nabla_{a}f(x,\tilde{a})|\leq c(1+|x|^{p-\varepsilon})$ for all $\tilde{a}\in\mathcal{A}$ close to $a$ . In particular

\frac{1}{h}\langle\nabla_{a}f(x,a+thb),hb\rangle\leq|\nabla_{a}f(x,a+thb)||b|\leq c(1+|x|^{p-\varepsilon})

for $h$ sufficiently small. As furthermore $(x,a)\mapsto\nabla_{a}f(x,a)$ is continuous, we conclude by Lemma 21 that

\displaystyle\lim_{h\to 0}\frac{1}{h}\int_{\mathcal{S}}\langle\nabla_{a}f(x,a+thb),hb\rangle\,dt\,\nu^{h}(dx)=\int_{\mathcal{S}}\langle\nabla_{a}f(x,a),b\rangle\,\tilde{\nu}(dx).

Lastly, by by Fubini’s theorem and dominated convergence (in $t$ )

\frac{1}{h}\int_{\mathcal{S}}\int_{0}^{1}\langle\nabla_{a}f(x,a+thb),hb\rangle\,dt\,\nu^{h}(dx)\to\int_{\mathcal{S}}\langle\nabla_{a}f(x,a),b\rangle\,\tilde{\nu}(dx)

as $h\to 0$ , which ultimately shows (29). ∎

Lemma 30.

Let $q\in(1,\infty)$ and let $f,g\colon\mathcal{S}\to\mathbb{R}^{d}$ be measurable such that $\int_{\mathcal{S}}\|f(x)\|^{q}+\|g\|^{q}\,\mu(dx)<\infty$ and such that $g\neq 0$ $\mu$ -a.s.. Then we have that

(30)

\displaystyle\begin{split}&\inf_{\lambda\in\mathbb{R}}\left(\left(\int_{\mathcal{S}}\|f(x)+\lambda g(x)\|^{q}\,\mu(dx)\right)^{1/q}-\lambda\left(\int_{\mathcal{S}}\|g(x)\|^{q}\,\mu(dx)\right)^{1/q}\right)\\ &=\int_{\mathcal{S}}\frac{\langle f(x),h(g(x))\rangle}{\|g(x)\|^{1-q}}\,\mu(dx)\cdot\Big{(}\int_{\mathcal{S}}\|g(x)\|^{q}\,\mu(dx)\Big{)}^{1/q-1},\end{split}

where $h\colon\mathbb{R}^{d}\setminus\{0\}\to\mathbb{R}^{d}$ was defined in Lemma 7.

Proof.

First recall that $h$ is continuous and satisfies $\|x\|=\langle x,h(x)\rangle$ for every $x\neq 0$ . Now define

G(x):=\frac{h(g(x))}{\|g(x)\|^{1-q}}\Big{(}\int_{\mathcal{S}}\|g(z)\|^{q}\,\mu(dz)\Big{)}^{1/q-1}\quad\text{for }x\in\mathcal{S}.

Similarly, define $G^{\lambda}$ by replacing $g$ in the definition of $G$ by $g^{\lambda}:=f+\lambda g$ . As in the proof of Theorem 2 we compute

\int_{\mathcal{S}}\|G(x)\|_{\ast}^{p}\,\mu(dx)=1\quad\text{and}\quad\left(\int_{\mathcal{S}}\|g(x)\|^{q}\,\mu(dx)\right)^{1/q}=\int_{\mathcal{S}}\langle g(x),G(x)\rangle\,\mu(dx).

This remains true when $g$ and $G$ are replaced by $g^{\lambda}$ and $G^{\lambda}$ , respectively. Moreover, Hölder’s inequality implies that

	$\displaystyle\left(\int_{\mathcal{S}}\\|g^{\lambda}(x)\\|^{q}\,\mu(dx)\right)^{1/q}$	$\displaystyle\geq\int_{\mathcal{S}}\langle g^{\lambda}(x),G(x)\rangle\,\mu(dx),$
	$\displaystyle\left(\int_{\mathcal{S}}\\|g(x)\\|^{q}\,\mu(dx)\right)^{1/q}$	$\displaystyle\geq\int_{\mathcal{S}}\langle g(x),G^{\lambda}(x)\rangle\,\mu(dx).$

The first of these two inequalities immediately implies that the left hand side in (30) is larger than the right hand side.

To show the other inequality, note that $h$ is continuous and satisfies $h(\lambda x)=h(x)$ for $\lambda>0$ , hence $h(g(x))=\lim_{\lambda\to\infty}h(g^{\lambda}(x))$ for all $x\in\mathcal{S}$ such that $g(x)\neq 0$ . Consequently one quickly computes $G(x)=\lim_{\lambda\to\infty}G^{\lambda}(x)$ for all $x\in\mathcal{S}$ such that $g(x)\neq 0$ . By dominated convergence we conclude that

	$\displaystyle\inf_{\lambda\in\mathbb{R}}\left(\left(\int_{\mathcal{S}}\\|f(x)+\lambda g(x)\\|^{q}\,\mu(dx)\right)^{1/q}-\lambda\left(\int_{\mathcal{S}}\\|g(x)\\|^{q}\,\mu(dx)\right)^{1/q}\right)$
	$\displaystyle\leq\lim_{\lambda\to\infty}\left(\int_{\mathcal{S}}\langle f(x)+\lambda g(x),G^{\lambda}(x)\rangle\,\mu(dx)-\lambda\int_{\mathcal{S}}\langle g(x),G^{\lambda}(x)\rangle\,\mu(dx)\right)$
	$\displaystyle=\int_{\mathcal{S}}\langle f(x),G(x)\rangle\,\mu(dx)$

and the claim follows. ∎

Let us lastly give the proof of Theorem 5 for general seminorms.

Proof of Theorem 5.

Recall the convention that $\nabla_{x}\nabla_{a}f(x,a)\in\mathbb{R}^{k\times d}$ and $\nabla_{x}f(x,a)\in\mathbb{R}^{d\times 1}$ , $\nabla_{a}f(x,a)\in\mathbb{R}^{k\times 1}$ as well as $h(\cdot)/0=0$ . Further recall that $a^{\star}\in\mathcal{A}^{\star}(0)$ and $a^{\star}_{\delta}\in\mathcal{A}^{\star}(\delta)$ converge to $a^{\star}$ as $\delta\to 0$ . In order to show

	$\displaystyle\lim_{\delta\to 0}\frac{a^{\star}_{\delta}-a^{\star}}{\delta}$	$\displaystyle=-\Big{(}\int_{\mathcal{S}}\\|\nabla_{x}f(z,a^{\star})\\|^{q}\,\mu(dz)\Big{)}^{\frac{1}{q}-1}(\nabla^{2}_{a}V(0,a^{\star}))^{-1}$
		$\displaystyle\qquad\cdot\int_{\mathcal{S}}\frac{\nabla_{x}\nabla_{a}f(x,a^{\star})\,h(\nabla_{x}f(x,a^{\star}))}{\\|\nabla_{x}f(x,a^{\star})\\|^{1-q}}\,\mu(dx),$

we first show that for every $i\in\{1,\dots,k\}$

(31)		$\displaystyle\lim_{\delta\to 0}\frac{-\nabla_{a_{i}}V(0,a^{\star}_{\delta})}{\delta}$	$\displaystyle=\int_{\mathcal{S}}\nabla_{x}\nabla_{a_{i}}f(x,a^{\star})\frac{h(\nabla_{x}(f(x,a^{\star}))}{\\|\nabla_{x}f(x),a^{\star})\\|^{1-q}}\,\mu(dx)$
		$\displaystyle\qquad\cdot\Big{(}\int_{\mathcal{S}}\\|\nabla_{x}f(x,a^{\star})\\|^{q}\,\mu(dx)\Big{)}^{1/q-1},$

where we recall that $\nabla_{a_{i}}V(0,a^{\star}_{\delta})$ is the $i$ -th coordinate of the vector $\nabla_{a}V(0,a^{\star}_{\delta})$ . We start with the “ $\leq$ ”-inequality in (31). For any $a\in{\mathcal{A}}^{o}$ , the fundamental theorem of calculus implies that

\displaystyle\nabla_{a}f(y,a)-\nabla_{a}f(x,a)

\displaystyle=\int_{0}^{1}\nabla_{x}\nabla_{a}f(x+t(y-x),a)(y-x)\,dt.

Moreover, by Lemma 29 the function $a\mapsto V(\delta,a)$ is (one-sided) directionally differentiable at $a^{\star}_{\delta}$ for all $\delta>0$ small and thus for all $i\in\{1,\dots,k\}$

(32)

\displaystyle\sup_{\nu\in B_{\delta}^{\star}(\mu,a^{\star}_{\delta})}\int_{\mathcal{S}}\nabla_{a_{i}}f(x,a^{\star}_{\delta})\,\nu(dx)\geq 0,

where we recall $B_{\delta}^{\star}(\mu,a^{\star}_{\delta})$ is the set of all $\nu\in B_{\delta}(\mu)$ for which $\int_{\mathcal{S}}f(x,a^{\star}_{\delta})\,\nu(dx)=V(\delta,a^{\star}_{\delta})=V(\delta)$ . We now encode the optimality of $\nu$ in $B_{\delta}^{\star}(\mu,a^{\star}_{\delta})$ via a Lagrange multiplier to obtain

(33)

\displaystyle\begin{split}&\sup_{\nu\in B_{\delta}^{\star}(\mu,a^{\star}_{\delta})}\int_{\mathcal{S}}\nabla_{a_{i}}f(x,a^{\star}_{\delta})\,\nu(dx)\\ &=\sup_{\nu\in B_{\delta}(\mu)}\inf_{\lambda\in\mathbb{R}}\int_{\mathcal{S}}\big{[}\nabla_{a_{i}}f(y,a^{\star}_{\delta})+\lambda(f(y,a^{\star}_{\delta})-V(\delta))\big{]}\nu(dy).\end{split}

In a similar manner, we trivially have

(34)

\displaystyle\int_{\mathcal{S}}\nabla_{a_{i}}f(x,a^{\star}_{\delta})\,\mu(dx)=\int_{\mathcal{S}}\big{[}\nabla_{a_{i}}f(x,a^{\star}_{\delta})+\lambda(f(x,a^{\star}_{\delta})-V(0,a^{\star}_{\delta}))\big{]}\mu(dx)

for any $\lambda\in\mathbb{R}$ , as $\int_{\mathcal{S}}f(x,a^{\star}_{\delta})\,\mu(dx)=V(0,a^{\star}_{\delta})$ . Applying (32) and then (33), (34) we thus conclude for $i\in\{1,\dots,k\}$

		$\displaystyle-\nabla_{a_{i}}V(0,a^{\star}_{\delta})\leq\sup_{\nu\in B_{\delta}^{\star}(\mu)}\int_{\mathcal{S}}\nabla_{a_{i}}f(y,a^{\star}_{\delta})\,\nu(dy)-\nabla_{a_{i}}V(0,a^{\star}_{\delta})$
		$\displaystyle=\sup_{\nu\in B_{\delta}(\mu)}\inf_{\lambda\in\mathbb{R}}\bigg{(}\int_{\mathcal{S}}\big{[}\nabla_{a_{i}}f(y,a^{\star}_{\delta})+\lambda(f(y,a^{\star}_{\delta})-V(\delta))\big{]}\,\nu(dy)$
		$\displaystyle\quad-\int_{\mathcal{S}}\big{[}\nabla_{a_{i}}f(x,a^{\star}_{\delta})+\lambda(f(x,a^{\star}_{\delta})-V(0,a^{\star}_{\delta}))\big{]}\mu(dx)\bigg{)}$
(35)		$\displaystyle\begin{split}&=\sup_{\nu\in B_{\delta}(\mu)}\inf_{\lambda\in\mathbb{R}}\bigg{(}\int_{\mathcal{S}}\Big{[}\nabla_{a_{i}}f(y,a^{\star}_{\delta})+\lambda f(y,a^{\star}_{\delta})\Big{]}\,\nu(dy)\\ &\quad-\int_{\mathcal{S}}\Big{[}\nabla_{a_{i}}f(x,a^{\star}_{\delta})+\lambda f(x,a^{\star}_{\delta})\Big{]}\,\mu(dx)-\lambda(V(\delta)-V(0,a^{\star}_{\delta}))\bigg{)}.\end{split}$

As in the proof of Lemma 29 we note that $B_{\delta}(\mu)$ is compact in $W_{p-\varepsilon}^{|\cdot|}$ and both terms inside the $\nu(dy)$ grow at most as $c(1+|y|^{p-\varepsilon})$ by Assumption 4. Thus using (Terkelsen,, 1973, Cor. 2, p. 411) we can interchange the infimum and supremum in the last line above. Recall that

V(\delta)=\sup_{\nu\in B_{\delta}(\mu)}\int_{\mathcal{S}}f(y,a^{\star}_{\delta})\,\nu(dy),

whence (35) is equal to

	$\displaystyle\inf_{\lambda\in\mathbb{R}}\bigg{(}\sup_{\pi\in C_{\delta}(\mu)}\int_{\mathcal{S}\times\mathcal{S}}\Big{[}\nabla_{a_{i}}f(y,a^{\star}_{\delta})-\nabla_{a_{i}}f(x,a^{\star}_{\delta})+\lambda(f(y,a^{\star}_{\delta})-f(x,a^{\star}_{\delta}))\Big{]}\,\pi(dx,dy)$
	$\displaystyle-\lambda\sup_{\pi\in C_{\delta}(\mu)}\int_{\mathcal{S}\times\mathcal{S}}f(y,a^{\star}_{\delta})-f(x,a^{\star}_{\delta})\,\pi(dx,dy)\bigg{)}.$

For every fixed $\lambda\in\mathbb{R}$ we can follow the arguments in the proof of Theorem 2 to see that, when divided by $\delta$ , the term inside the infimum converges to

(36)

\displaystyle\left(\int_{\mathcal{S}}\left\|\nabla_{x}\nabla_{a_{i}}f(x,a^{\star})+\lambda\nabla_{x}f(x,a^{\star})\right\|^{q}\,\mu(dx)\right)^{1/q}-\lambda\left(\int_{\mathcal{S}}\left\|\nabla_{x}f(x,a^{\star})\right\|^{q}\,\mu(dx)\right)^{1/q}

as $\delta\to 0$ . Note that following these arguments requires the following properties, which are a direct consequence of Assumptions 1 and Assumption 4:

•

$(x,a)\mapsto f(x,a)$ is differentiable on ${\mathcal{S}}^{o}\times{\mathcal{A}}^{o}$ ,
•

$x\mapsto\nabla_{a_{i}}f(x,a)$ is differentiable on ${\mathcal{S}}^{o}$ for every $a\in\mathcal{A}$ ,
•

$(x,a)\mapsto\nabla_{x}f(x,a)$ is continuous,
•

$(x,a)\mapsto\nabla_{x}\nabla_{a_{i}}f(x,a)$ is continuous,
•

for every $r>0$ there is $c>0$ such that $|\lambda\nabla_{x}f(x,a)|\leq c(1+|x|^{p-1})$ for all $x\in\mathcal{S}$ and $a\in\mathcal{A}$ with $|a|\leq r$ .
•

for every $r>0$ there is $c>0$ such that $|\nabla_{x}\nabla_{a_{i}}f(x,a)|\leq c(1+|x|^{p-1})$ for all $x\in\mathcal{S}$ and $a\in\mathcal{A}$ with $|a|\leq r$ .
•

For all $\delta\geq 0$ sufficiently small we have $\mathcal{A}^{\star}_{\delta}\neq\emptyset$ and for every sequence $(\delta_{n})_{n\in\mathbb{N}}$ such that $\lim_{n\to\infty}\delta_{n}=0$ and $(a^{\star}_{n})_{n\in\mathbb{N}}$ such that $a^{\star}_{n}\in\mathcal{A}^{\star}_{\delta_{n}}$ for all $n\in\mathbb{N}$ there is a subsequence which converges to some $a^{\star}\in\mathcal{A}^{\star}_{0}$ .

Suppose first that $\nabla_{x}\nabla_{a_{i}}f(x,a^{\star})=0$ $\mu$ -a.s.. Then the right hand side of (31) is equal to zero. Moreover, taking $\lambda=0$ in (36), we also have that $\nabla_{a_{i}}V(0,a^{\star}_{\delta})\leq 0$ , which proves that indeed the left hand side in (31) is smaller than the right hand side.

Now suppose that $\nabla_{x}f(x,a^{\star})\neq 0$ $\mu$ -a.s.. Then, using the inequality “ $\limsup_{\delta}\inf_{\lambda}\leq\inf_{\lambda}\limsup_{\delta}$ ” and Lemma 30 to compute the last term (noting that $\nabla_{x}f(x,a^{\star})\neq 0$ by assumption), we conclude that indeed the

	$\displaystyle\limsup_{\delta\to 0}\frac{-\nabla_{a_{i}}V(0,a^{\star}_{\delta})}{\delta}$	$\displaystyle\leq\int_{\mathcal{S}}\nabla_{x}\nabla_{a_{i}}f(x,a^{\star})\frac{h(\nabla_{x}(f(x,a^{\star}))}{\\|\nabla_{x}f(x),a^{\star})\\|^{1-q}}\,\mu(dx)$
		$\displaystyle\qquad\cdot\Big{(}\int_{\mathcal{S}}\\|\nabla_{x}f(x,a^{\star})\\|^{q}\,\mu(dx)\Big{)}^{1/q-1}.$

To obtain the reverse “ $\geq$ ”-inequality in (31) follows by the very same arguments. Indeed, Lemma 29 implies that

\displaystyle\inf_{\nu\in B_{\delta}^{\star}(\mu,a^{\star}_{\delta})}\int_{\mathcal{S}}\nabla_{a_{i}}f(x,a^{\star}_{\delta})\,\nu(dx)\leq 0

for all $i\in\{1,\dots,k\}$ and we can write

	$\displaystyle-\nabla_{a_{i}}V(0,a^{\star}_{\delta})\geq\inf_{\nu\in B_{\delta}^{\star}(\mu)}\int_{\mathcal{S}}\nabla_{a_{i}}f(y,a^{\star}_{\delta})\,\nu(dy)-\int_{\mathcal{S}}\nabla_{a_{i}}f(y,a^{\star}_{\delta})\,\mu(dx)$
	$\displaystyle=\inf_{\nu\in B_{\delta}(\mu)}\sup_{\lambda\in\mathbb{R}}\int_{\mathcal{S}}\big{[}\nabla_{a_{i}}f(y,a^{\star}_{\delta})+\lambda(f(y,a^{\star}_{\delta})-V(\delta))\big{]}\,\nu(dy)-\int_{\mathcal{S}}\nabla_{a_{i}}f(x,a^{\star}_{\delta})\,\mu(dx)$

From here on we argue as in the “ $\leq$ ”-inequality to conclude that (31) holds.

By assumption the matrix $\nabla_{a}^{2}V(0,a^{\star})$ is invertible. Therefore, in a small neighborhood of $a^{\star}$ , the mapping $\nabla_{a}V(0,\cdot)$ is invertible. In particular

a^{\star}_{\delta}=(\nabla_{a}V(0,\cdot))^{-1}\left(\nabla_{a}V(0,a^{\star}_{\delta})\right)\quad\text{and}\quad a^{\star}=(\nabla_{a}V(0,\cdot))^{-1}\left(0\right),

where the second equality holds by the first order condition for optimality of $a^{\star}$ . Applying the chain rule and using (31) gives

	$\displaystyle\lim_{\delta\to 0}\frac{a^{\star}_{\delta}-a^{\star}}{\delta}$	$\displaystyle=(\nabla_{a}^{2}V(0,a^{\star}))^{-1}\cdot\ \lim_{\delta\to 0}\frac{\nabla_{a}V(0,a^{\star}_{\delta})}{\delta}$
		$\displaystyle=-(\nabla^{2}_{a}V(0,a^{\star}))^{-1}\Big{(}\int_{\mathcal{S}}\\|\nabla_{x}f(z,a^{\star})\\|^{q}\,\mu(dz)\Big{)}^{1/q-1}$
		$\displaystyle\quad\cdot\int_{\mathcal{S}}\frac{\nabla_{x}\nabla_{a}f(x,a^{\star})h(\nabla_{x}f(x,a^{\star}))}{\\|\nabla_{x}f(x,a^{\star})\\|^{1-q}}\,\mu(dx).$

This completes the proof. ∎

	$\displaystyle\Theta$	$\displaystyle:=\Big{(}\int\|\nabla_{x}f(x,a^{\star})\|^{q}_{s}\,\mu(dx)\Big{)}^{\frac{1}{q}-1}\cdot\int\frac{\nabla_{x}\nabla_{a}f(x,a^{\star})\,h(\nabla_{x}f(x,a^{\star}))}{\|\nabla_{x}f(x,a^{\star})\|_{s}^{1-q}}\,\mu(dx),$
	$\displaystyle\Delta_{N}$	$\displaystyle:=\Big{(}\int\|\nabla_{x}f(x,a^{{\star},N})\|_{s}^{q}\,\mu_{N}(dx)\Big{)}^{\frac{1}{q}-1}\cdot\left(\int\nabla_{a}^{2}f(x,a^{{\star},N})\,\mu_{N}(dx)\right)^{-1}$
		$\displaystyle\quad\cdot\int\frac{\nabla_{x}\nabla_{a}f(x,a^{{\star},N})\,h(\nabla_{x}f(x,a^{{\star},N}))}{\|\nabla_{x}f(x,a^{{\star},N})\|_{s}^{1-q}}\,\mu_{N}(dx)-(\nabla_{a}^{2}V(0,a^{{\star}}))^{-1}\Theta.$

	$\displaystyle\int_{\mathbb{R}^{k+1}}\nabla_{(x,y)}\nabla_{a}f((x,y),a^{\star})\frac{h(\nabla_{(x,y)}f((x,y),a^{\star}))}{\\|\nabla_{(x,y)}f((x,y),a^{\star})\\|^{-1}}\,\mu(dx,dy)$
	$\displaystyle=4\int_{\mathbb{R}^{k+1}}\big{[}-y\mathbf{I}+x(a^{\star})^{T}+(\mathbf{I}a^{\star})(\mathbf{I}x)\big{]}\,h(-(y-\langle x,a^{\star}\rangle)a^{\star})\,\|a^{\star}\|_{s}\|y-\langle x,a^{\star}\rangle\|\,\mu(dx,dy)$
	$\displaystyle=-4\|a^{\star}\|_{s}\int_{\mathbb{R}^{k+1}}\big{[}-y\mathbf{I}+x(a^{\star})^{T}+(\mathbf{I}a^{\star})(\mathbf{I}x)\big{]}\,(y-\langle x,a^{\star}\rangle)h(a^{\star})\,\mu(dx,dy)\,$
	$\displaystyle=4\|a^{\star}\|_{s}V(0)\,h(a^{\star}).$

	$\displaystyle a^{\star}_{\delta}\approx$	$\displaystyle\ a^{\star}-\Big{(}\int_{\mathbb{R}^{k+1}}\\|\nabla_{(x,y)}f((x,y),a^{\star})\\|^{2}\,\mu(dx,dy)\Big{)}^{-1/2}(\nabla^{2}_{a}V(0,a^{\star}))^{-1}$
		$\displaystyle\qquad\qquad\qquad\cdot\int_{\mathbb{R}^{k+1}}\frac{\nabla_{(x,y)}\nabla_{a}f((x,y),a^{\star})\,h(\nabla_{(x,y)}f((x,y),a^{\star}))}{\\|\nabla_{(x,y)}f((x,y),a^{\star})\\|^{-1}}\,\mu(dx,dy)\cdot\delta$
		$\displaystyle=a^{\star}-\frac{1}{4\|a^{\star}\|_{s}\sqrt{V(0)}}\,D^{-1}\,4\|a^{\star}\|_{s}V(0)\,h(a^{\star})\cdot\delta$
		$\displaystyle=a^{\star}-\sqrt{V(0)}D^{-1}\,h(a^{\star})\cdot\delta.$

	$\displaystyle\Delta_{N}$	$\displaystyle:=\Big{(}\int\|\nabla_{x}f(x,a^{{\star},N})\|_{s}^{q}\,\mu_{N}(dx)\Big{)}^{\frac{1}{q}-1}\cdot\left(\int\nabla_{a}^{2}f(x,a^{{\star},N})\,\mu_{N}(dx)\right)^{-1}$
		$\displaystyle\quad\cdot\int\frac{\nabla_{x}\nabla_{a}f(x,a^{{\star},N})\,h(\nabla_{x}f(x,a^{{\star},N}))}{\|\nabla_{x}f(x,a^{{\star},N})\|_{s}^{1-q}}\,\mu_{N}(dx)-(\nabla_{a}^{2}V(0,a^{{\star}}))^{-1}\Theta,\quad\textrm{where}$
	$\displaystyle\Theta$	$\displaystyle:=\Big{(}\int\|\nabla_{x}f(x,a^{\star})\|_{s}^{q}\,\mu(dx)\Big{)}^{\frac{1}{q}-1}\cdot\int\frac{\nabla_{x}\nabla_{a}f(x,a^{\star})\,h(\nabla_{x}f(x,a^{\star}))}{\|\nabla_{x}f(x,a^{\star})\|_{s}^{1-q}}\,\mu(dx).$

	$\displaystyle a^{{\star},N}_{\delta}-a^{{\star},N}$	$\displaystyle\approx-\left(\nabla_{a}^{2}V\left(0,a^{{\star}}\right)\right)^{-1}\left(\int\left\|\nabla_{x}f\left(x,a^{{\star}}\right)\right\|^{q}\mu_{N}(dx)\right)^{1/q-1}\cdot\int\frac{\nabla_{x}\nabla_{a}f\left(x,a^{{\star}}\right)\nabla_{x}f\left(x,a^{{\star}}\right)}{\|\nabla_{x}f\left(x,a^{{\star}}\right)\|^{2-q}}\mu_{N}(dx)$
		$\displaystyle=-\|a^{{\star}}\|^{1-q}\left(\int\|g^{\prime}(x)\|^{q}\,\mu_{N}(dx)\right)^{1/q-1}\int\frac{(g^{\prime}(x))^{2}a^{{\star}}}{\|g^{\prime}(x)a^{{\star}}\|^{2-q}}\,\mu_{N}(dx)$
		$\displaystyle=-\text{sign}(a^{{\star}})\left(\int\|g^{\prime}(x)\|^{q}\,\mu_{N}(dx)\right)^{1/q}.$

Sensitivity analysis of Wasserstein distributionally robust optimization problems

Abstract.

Key words and phrases:

1. Introduction

2. Main results

Assumption 1.

Theorem 2.

Remark 3.

Assumption 4.

Theorem 5.

3. Applications

3.1. Financial Economics

3.2. Neural Networks

3.3. Uncertainty Quantification

3.4. Statistics

3.5. Out-of-sample error

4. Further discussion and literature review

4.1. Discussion of related literature

4.2. Link to the CLT of Blanchet et al., 2019b

5. Proofs

Proof of Theorem 2.

Proof of Theorem 5.

References

Appendix A Preliminaries

Lemma 6.

Proof.

Lemma 7.

Proof.

Lemma 8.

Proof.

Appendix B Discussion, extensions and proofs related to Theorem 2

B.1. Discussion and extensions of Theorem 2

Remark 9.

Remark 10.

Remark 11.

Proof of Remark 11.

Example 12.

Corollary 13.

Remark 14.

Theorem 15 (Sensitivity of V​(δ)V(\delta) under linear constraints).

Remark 16.

Remark 17.

Example 18 (Martingale constraints).

Example 19 (Covariance constraints).

Example 20 (Calibration).

B.2. Proofs and auxiliary results related to Theorem 2

Proof of Theorem 2.

Lemma 21.

Proof.

Lemma 22.

Proof.

Lemma 23.

Proof.

Proof of Corollary 13.

Proof of Theorem 15.

Lemma 24.

Proof.

Appendix C Discussion, proofs and auxiliary results related to Theorem 5

C.1. Further discussion of Theorem 5

Remark 25.

Example 26.

Remark 27.

Example 28.

C.2. Proofs and auxiliary results related to Theorem 5

Lemma 29.

Proof.

Lemma 30.

Proof.

Proof of Theorem 5.

Theorem 15 (Sensitivity of $V(\delta)$ under linear constraints).