This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sensitivity analysis of Wasserstein distributionally robust optimization problems

Daniel Bartl1 1Department of Mathematics, University of Vienna Samuel Drapeau2 2School of Mathematical Sciences and Shanghai Advanced Institute of Finance (SAIF/CAFR), Shanghai Jiao Tong University Jan Obłój3  and  Johannes Wiesel3 3Department of Statistics, Columbia University
Abstract.

We consider sensitivity of a generic stochastic optimization problem to model uncertainty. We take a non-parametric approach and capture model uncertainty using Wasserstein balls around the postulated model. We provide explicit formulae for the first order correction to both the value function and the optimizer and further extend our results to optimization under linear constraints. We present applications to statistics, machine learning, mathematical finance and uncertainty quantification. In particular, we provide explicit first-order approximation for square-root LASSO regression coefficients and deduce coefficient shrinkage compared to the ordinary least squares regression. We consider robustness of call option pricing and deduce a new Black-Scholes sensitivity, a non-parametric version of the so-called Vega. We also compute sensitivities of optimized certainty equivalents in finance and propose measures to quantify robustness of neural networks to adversarial examples.

Key words and phrases:
Robust stochastic optimization, Sensitivity analysis, Uncertainty quantification, Non-parametric uncertainty, Wasserstein metric
Support from the European Research Council under the EU’s 7th7^{\text{th}} FP / ERC grant agreement no. 335421, the Vienna Science and Technology Fund (WWTF) project MA16-021 and the Austrian Science Fund (FWF) project P28661 as well as the National Science Foundation of China, Grant Numbers 11971310 and 11671257, Shanghai Jiao Tong University, Grant “Assessment of Risk and Uncertainty in Finance” number AF0710020 are gratefully acknowledged. We thank Jose Blanchet, Peyman Mohajerin Esfahani, Daniel Kuhn and Mike Giles for helpful comments on an earlier version of the paper.

1. Introduction

We consider a generic stochastic optimization problem

(1) infa𝒜𝒮f(x,a)μ(dx),\inf_{a\in\mathcal{A}}\int_{\mathcal{S}}f\left(x,a\right)\,\mu(dx),

where 𝒜\mathcal{A} is the set of actions or choices, ff is the loss function and μ\mu is a probability measure over the space of states 𝒮\mathcal{S}. Such problems are found across the whole of applied mathematics. The measure μ\mu is the crucial input and it could represent, e.g., a dynamic model of the system, as is often in mathematical finance or mathematical biology, or the empirical measure of observed data points, or the training set, as is the case in statistics and machine learning applications. In virtually all the cases, there is a certain degree of uncertainty around the choice of μ\mu coming from modelling choices and simplifications, incomplete information, data errors, finite sample error, etc. It is thus very important to understand the influence of changes in μ\mu on (1), both on its value and on its optimizer. Often, the choice of μ\mu is done in two stages: first a parametric family of models is adopted and then the values of the parameters are fixed. Sensitivity analysis of (1) with changing parameters is a classical topic explored in parametric programming and statistical inference, e.g., Armacost and Fiacco, (1974); Vogel, (2007); Bonnans and Shapiro, (2013). It also underscores a lot of progress in the field of uncertainty quantification, see Ghanem et al., (2017). Considering μ\mu as an abstract parameter, the mathematical programming literature looked into qualitative and quantitative stability of (1). We refer to Dupacova, (1990); Romisch, (2003) and the references therein. When μ\mu represents data samples, there has been a considerable interest in the optimization community in designing algorithms which are robust and, in particular, do not require excessive hypertuning, see Asi and Duchi, (2019) and the references therein.

A more systematic approach to model uncertainty in (1) is offered by the distributionally robust optimization problem

(2) V(δ):=infa𝒜V(δ,a):=infa𝒜supνBδ(μ)𝒮f(x,a)ν(dx),\displaystyle V(\delta)\mathrel{\mathop{:}}=\inf_{a\in\mathcal{A}}V(\delta,a)\mathrel{\mathop{:}}=\inf_{a\in\mathcal{A}}\sup_{\nu\in B_{\delta}\left(\mu\right)}\int_{\mathcal{S}}f\left(x,a\right)\,\nu(dx),

where Bδ(μ)B_{\delta}\left(\mu\right) is a ball of radius δ\delta around μ\mu in the space of probability measures, as specified below. Such problems greatly generalize more classical robust optimization and have been studied extensively in operations research and machine learning in particular, we refer the reader to Rahimian and Mehrotra, (2019) and the references therein. Our goal in this paper is to understand the behaviour of these problems for small δ\delta. Our main results compute first-order behaviour of V(δ)V(\delta) and its optimizer for small δ\delta. This offers a measure of sensitivity to errors in model choice and/or specification as well as points in the abstract direction, in the space of models, in which the change is most pronounced. We use examples to show that our results can be applied across a wide spectrum of science.

This paper is organised as follows. We first present the main results and then, in section 3, explore their applications. Further discussion of our results and the related literature is found in section 4, which is then followed by the proofs. Online appendix Bartl et al., 2021a contains many supplementary results and remarks, as well as some more technical arguments from the proofs.

2. Main results

Take d,kd,k\in\mathbb{N}, endow d\mathbb{R}^{d} with the Euclidean norm |||\cdot| and write Γo{\Gamma}^{o} for the interior of a set Γ\Gamma. Assume that 𝒮\mathcal{S} is a closed convex subset of d\mathbb{R}^{d}. Let 𝒫(𝒮)\mathcal{P}(\mathcal{S}) denote the set of all (Borel) probability measures on 𝒮\mathcal{S}. Further fix a seminorm \|\cdot\| on d\mathbb{R}^{d} and denote by \|\cdot\|_{\ast} its (extended) dual norm, i.e., y:=sup{x,y:x1}\|y\|_{\ast}:=\sup\{\langle x,y\rangle:\|x\|\leq 1\}. In particular, for =||\|\cdot\|=|\cdot| we also have =||\|\cdot\|_{\ast}=|\cdot|. For μ,ν𝒫(𝒮)\mu,\nu\in\mathcal{P}(\mathcal{S}), we define the pp-Wasserstein distance as

Wp(μ,ν)=inf{𝒮×𝒮xypπ(dx,dy):πCpl(μ,ν)}1/p,W_{p}(\mu,\nu)=\inf\left\{\int_{\mathcal{S}\times\mathcal{S}}\|x-y\|_{\ast}^{p}\,\pi(dx,dy)\colon\pi\in\mathrm{Cpl}(\mu,\nu)\right\}^{1/p},

where Cpl(μ,ν)\mathrm{Cpl}(\mu,\nu) is the set of all probability measures π\pi on 𝒮×𝒮\mathcal{S}\times\mathcal{S} with first marginal π1:=π(×𝒮)=μ\pi_{1}:=\pi(\cdot\times\mathcal{S})=\mu and second marginal π2:=π(𝒮×)=ν\pi_{2}:=\pi(\mathcal{S}\times\cdot)=\nu. Denote the Wasserstein ball

Bδ(μ)={ν𝒫(𝒮):Wp(μ,ν)δ}B_{\delta}(\mu)=\left\{\nu\in\mathcal{P}(\mathcal{S}):W_{p}(\mu,\nu)\leq\delta\right\}

of size δ0\delta\geq 0 around μ\mu. Note that, taking a suitable probability space (Ω,𝔽,)(\Omega,\mathbb{F},\mathbb{P}) and a random variable XμX\sim\mu, we have the following probabilistic representation of V(δ,a)V(\delta,a):

supνBδ(μ)𝒮f(x,a)ν(dx)=supZ𝔼[f(X+Z,a)]\sup_{\nu\in B_{\delta}\left(\mu\right)}\int_{\mathcal{S}}f\left(x,a\right)\,\nu(dx)=\sup_{Z}\mathbb{E}_{\mathbb{P}}[f(X+Z,a)]

where the supremum is taken over all ZZ satisfying X+Z𝒮X+Z\in\mathcal{S} almost surely and 𝔼[Zp]δp\mathbb{E}_{\mathbb{P}}[\|Z\|_{\ast}^{p}]\leq\delta^{p}. Wasserstein distances and optimal transport techniques have proved to be powerful and versatile tools in a multitude of applications, from economics Chiappori et al., (2010); Carlier and Ekeland, (2010) to image recognition Peyré and Cuturi, (2019). The idea to use Wasserstein balls to represent model uncertainty was pioneered in Pflug and Wozabal, (2007) in the context of investment problems. When sampling from a measure with a finite pthp^{\textrm{th}} moment, the measures converge to the true distribution and Wasserstein balls around the empirical measures have the interpretation of confidence sets, see Fournier and Guillin, (2014). In this setup, the radius δ\delta can then be chosen as a function of a given confidence level α\alpha and the sample size NN. This yields finite samples guarantees and asymptotic consistency, see Mohajerin Esfahani and Kuhn, (2018); Obłój and Wiesel, (2021), and justifies the use of the Wasserstein metric to capture model uncertainty. The value V(δ,a)V(\delta,a) in (2) has a dual representation, see Gao and Kleywegt, (2016); Blanchet and Murthy, (2019), which has led to significant new developments in distributionally robust optimization, e.g., Mohajerin Esfahani and Kuhn, (2018); Blanchet et al., 2019a ; Kuhn et al., (2019); Shafieezadeh-Abadeh et al., (2019).
Naturally, other choices for the distance on the space of measures are also possible: such as the Kullblack-Leibler divergence, see Lam, (2016) for general sensitivity results and Calafiore, (2007) for applications in portfolio optimization, or the Hellinger distance, see Lindsay, (1994) for a statistical robustness analysis. We refer to section 4 for a more detailed analysis of the state of the art in these fields. Both of these approaches have good analytic properties and often lead to theoretically appealing closed-form solutions. However, they are also very restrictive since any measure in the neighborhood of μ\mu has to be absolutely continuous with respect to μ\mu. In particular, if μ\mu is the empirical measure of NN observations then measures in its neighborhood have to be supported on those fixed NN points. To obtain meaningful results it is thus necessary to impose additional structural assumptions, which are often hard to justify solely on the basis of the data at hand and, equally importantly, create another layer of model uncertainty themselves. We refer to (Gao and Kleywegt,, 2016, sec. 1.1) for further discussion of potential issues with ϕ\phi-divergences. The Wasserstein distance, while harder to handle analytically, is more versatile and does not require any such additional assumptions.

Throughout the paper we take the convention that continuity and closure are understood w.r.t. |||\cdot|. We assume that 𝒜k\mathcal{A}\subset\mathbb{R}^{k} is convex and closed and that the seminorm \|\cdot\| is strictly convex in the sense that for two elements x,ydx,y\in\mathbb{R}^{d} with ||x=y=1||x\|=\|y\|=1 and xy0\|x-y\|\neq 0, we have 12x+12y<1\|\frac{1}{2}x+\frac{1}{2}y\|<1 (note that this is satisfied for every lsl^{s}-norm |x|s:=(i=1d|xi|s)1/s|x|_{s}:=(\sum_{i=1}^{d}|x_{i}|^{s})^{1/s} for s>1s>1). We fix p(1,)p\in(1,\infty), let q:=p/(p1)q:=p/(p-1) so that 1/p+1/q=11/p+1/q=1, and fix μ𝒫(𝒮)\mu\in\mathcal{P}(\mathcal{S}) such that the boundary of 𝒮d\mathcal{S}\subset\mathbb{R}^{d} has μ\mu–zero measure and 𝒮|x|pμ(dx)<\int_{\mathcal{S}}|x|^{p}\,\mu(dx)<\infty. Denote by 𝒜δ\mathcal{A}^{\star}_{\delta} the set of optimizers for V(δ)V(\delta) in (2).

Assumption 1.

The loss function f:𝒮×𝒜f\colon\mathcal{S}\times\mathcal{A}\to\mathbb{R} satisfies

  • xf(x,a)x\mapsto f(x,a) is differentiable on 𝒮o{\mathcal{S}}^{o} for every a𝒜a\in\mathcal{A}. Moreover (x,a)xf(x,a)(x,a)\mapsto\nabla_{x}f(x,a) is continuous and for every r>0r>0 there is c>0c>0 such that |xf(x,a)|c(1+|x|p1)|\nabla_{x}f(x,a)|\leq c(1+|x|^{p-1}) for all x𝒮x\in\mathcal{S} and a𝒜a\in\mathcal{A} with |a|r|a|\leq r.

  • For all δ0\delta\geq 0 sufficiently small we have 𝒜δ\mathcal{A}^{\star}_{\delta}\neq\emptyset and for every sequence (δn)n(\delta_{n})_{n\in\mathbb{N}} such that limnδn=0\lim_{n\to\infty}\delta_{n}=0 and (an)n(a^{\star}_{n})_{n\in\mathbb{N}} such that an𝒜δna^{\star}_{n}\in\mathcal{A}^{\star}_{\delta_{n}} for all nn\in\mathbb{N} there is a subsequence which converges to some a𝒜0a^{\star}\in\mathcal{A}^{\star}_{0}.

The above assumption is not restrictive: the first part merely ensures existence of xf(,a)Lq(μ)\|\nabla_{x}f(\cdot,a^{\star})\|_{L^{q}(\mu)}, while the second part is satisfied as soon as either 𝒜\mathcal{A} is compact or V(0,)V(0,\cdot) is coercive, which is the case in most examples of interest, see (Bartl et al., 2021a, , Lemma  23) for further comments.

Theorem 2.

If Assumption 1 holds then V(0)V^{\prime}(0) is given by

Υ:=limδ0V(δ)V(0)δ=infa𝒜0(𝒮xf(x,a)qμ(dx))1/q.\Upsilon:=\lim_{\delta\to 0}\frac{V(\delta)-V(0)}{\delta}=\inf_{a^{\star}\in\mathcal{A}^{\star}_{0}}\Big{(}\int_{\mathcal{S}}\|\nabla_{x}f(x,a^{\star})\|^{q}\,\mu(dx)\Big{)}^{1/q}.
Remark 3.

Inspecting the proof, defining

V~(δ)=infa𝒜0supνBδ(μ)𝒮f(x,a)ν(dx)\tilde{V}(\delta)=\inf_{a^{\star}\in\mathcal{A}^{\star}_{0}}\sup_{\nu\in B_{\delta}\left(\mu\right)}\int_{\mathcal{S}}f\left(x,a^{\star}\right)\,\nu(dx)

we obtain V~(0)=V(0)\tilde{V}^{\prime}(0)=V^{\prime}(0). This means that for small δ>0\delta>0 there is no first-order gain from optimizing over all a𝒜a\in\mathcal{A} in the definition of V(δ)V(\delta) when compared with restricting simply to a𝒜0a^{\star}\in\mathcal{A}^{\star}_{0}, as in V~(δ)\tilde{V}(\delta).

The above result naturally extends to computing sensitivities of robust problems, i.e., V(r)V^{\prime}(r), see (Bartl et al., 2021a, , Corollary 13), as well as to the case of stochastic optimization under linear constraints, see (Bartl et al., 2021a, , Theorem 15). We recall that V(0,a)=𝒮f(x,a)μ(dx)V(0,a)=\int_{\mathcal{S}}f(x,a)\,\mu(dx).

Assumption 4.

Suppose the ff is twice continuously differentiable, a𝒜0𝒜oa^{\star}\in\mathcal{A}^{\star}_{0}\cap{\mathcal{A}}^{o} and

  • i=1k|aixf(x,a)|c(1+|x|p1ε)\sum_{i=1}^{k}|\nabla_{a_{i}}\nabla_{x}f(x,a)|\leq c(1+|x|^{p-1-\varepsilon}) for some ε>0\varepsilon>0, c>0c>0, all xdx\in\mathbb{R}^{d} and all aa close to aa^{\star}.

  • The function aV(0,a)a\mapsto V(0,a) is twice continuously differentiable in the neighbourhood of aa^{\star} and the matrix a2V(0,a)\nabla^{2}_{a}V(0,a^{\star}) is invertible.

Theorem 5.

Suppose a𝒜0a^{\star}\in\mathcal{A}^{\star}_{0} and aδ𝒜δa^{\star}_{\delta}\in\mathcal{A}^{\star}_{\delta} such that aδaa^{\star}_{\delta}\to a^{\star} as δ0\delta\to 0 and Assumptions 1 and 4 are satisfied. Then, if xf(x,a)0\nabla_{x}f(x,a^{\star})\neq 0 or xaf(x,a)=0\nabla_{x}\nabla_{a}f(x,a^{\star})=0 μ\mu-a.e.,

\displaystyle\beth :=limδ0aδaδ=(𝒮xf(x,a)qμ(dx))1q1\displaystyle:=\lim_{\delta\to 0}\frac{a^{\star}_{\delta}-a^{\star}}{\delta}=-\Big{(}\int_{\mathcal{S}}\|\nabla_{x}f(x,a^{\star})\|^{q}\,\mu(dx)\Big{)}^{\frac{1}{q}-1}
(a2V(0,a))1𝒮xaf(x,a)h(xf(x,a))xf(x,a)1qμ(dx),\displaystyle\quad\ \cdot(\nabla^{2}_{a}V(0,a^{\star}))^{-1}\int_{\mathcal{S}}\frac{\nabla_{x}\nabla_{a}f(x,a^{\star})\,h(\nabla_{x}f(x,a^{\star}))}{\|\nabla_{x}f(x,a^{\star})\|^{1-q}}\,\mu(dx),

where h:d{0}{xd:x=1}h\colon\mathbb{R}^{d}\setminus\{0\}\to\{x\in\mathbb{R}^{d}\ :\ \|x\|_{\ast}=1\} is the unique function satisfying ,h()=\langle\cdot,h(\cdot)\rangle=\|\cdot\|, see (Bartl et al., 2021a, , Lemma 7). In particular, h()=/||h(\cdot)=\cdot/|\cdot| if =||\|\cdot\|=|\cdot|.

Above and throughout the convention is that xf(x,a)d×1,\nabla_{x}f(x,a)\in\mathbb{R}^{d\times 1}, aixf(x,a)d×1\nabla_{a_{i}}\nabla_{x}f(x,a)\in\mathbb{R}^{d\times 1}, af(x,a)k×1\nabla_{a}f(x,a)\in\mathbb{R}^{k\times 1}, xaf(x,a)k×d\nabla_{x}\nabla_{a}f(x,a)\in\mathbb{R}^{k\times d} and 0/0=00/0=0. The assumed existence and convergence of optimizers holds, e.g., with suitable convexity of ff in aa, see (Bartl et al., 2021a, , Lemma 22) for a worked out setting. In line with the financial economics practice, we gave our sensitivities letter symbols, Υ\Upsilon and \beth, loosely motivated by Υπo´δειγμα\Upsilon\pi\acute{o}\delta\varepsilon\iota\gamma\mu\alpha, the Greek for Model, and , the Hebrew for control.

3. Applications

We now illustrate the universality of Theorems 2 and 5 by considering their applications in a number of different fields. Unless otherwise stated, 𝒮=d\mathcal{S}=\mathbb{R}^{d}, 𝒜=k\mathcal{A}=\mathbb{R}^{k} and \int means 𝒮\int_{\mathcal{S}}.

3.1. Financial Economics

We start with the simple example of risk-neutral pricing of a call option written on an underlying asset (St)tT(S_{t})_{t\leq T}. Here, T,K>0T,K>0 are the maturity and the strike respectively, f(x,a)=(S0xK)+f(x,a)=(S_{0}x-K)^{+} and μ\mu is the distribution of ST/S0S_{T}/S_{0}. We set interest rates and dividends to zero for simplicity. In the Black and Scholes, (1973) model, μ\mu is a log-normal distribution, i.e., log(ST/S0)𝒩(σ2T/2,σ2T)\log(S_{T}/S_{0})\sim\mathcal{N}(-\sigma^{2}T/2,\sigma^{2}T) is Gaussian with mean σ2T/2-\sigma^{2}T/2 and variance σ2T\sigma^{2}T. In this case, V(0)V(0) is given by the celebrated Black-Scholes formula. Note that this example is particularly simple since ff is independent of aa. However, to ensure risk-neutral pricing, we have to impose a linear constraint on the measures in Bδ(μ)B_{\delta}(\mu), giving

(3) supνBδ(μ):xν(dx)=1(S0xK)+ν(dx).\sup_{\nu\in B_{\delta}(\mu)\colon\int x\nu(dx)=1}\int(S_{0}x-K)^{+}\nu(dx).

To compute its sensitivity we encode the constraint using a Lagrangian and apply Theorem 2, see (Bartl et al., 2021a, , Rem. 11, Thm. 15). For p=2p=2, letting k=K/S0k=K/S_{0} and μk=μ([k,))\mu_{k}=\mu([k,\infty)), the resulting formula, see (Bartl et al., 2021a, , Example 18), is given by

Υ\displaystyle\Upsilon =S0(𝟏xkμk)2μ(dx)=S0μk(1μk).\displaystyle=S_{0}\sqrt{\int\Big{(}\mathbf{1}_{x\geq k}-\mu_{k}\Big{)}^{2}\mu(dx)}=S_{0}\sqrt{\mu_{k}(1-\mu_{k})}.

Let us specialise to the log-normal distribution of the Black-Scholes model above and denote the quantity in (3) as BS(δ)\mathcal{R}BS(\delta). It may be computed exactly using methods from Bartl et al., (2020) and Figure 1 compares the exact value and the first order approximation.

Refer to caption
Figure 1. DRO value \mathcal{R}BS(δ)(\delta) vs the first order (FO) approximation \mathcal{R}BS(0)+Υδ(0)+\Upsilon\delta, S0=T=1S_{0}=T=1, K=1.2K=1.2, σ=0.2\sigma=0.2.

We have Υ=S0Φ(d)(1Φ(d))\Upsilon=S_{0}\sqrt{\Phi(d_{-})(1-\Phi(d_{-}))}, where d=log(S0/K)σ2T/2σTd_{-}=\frac{\log(S_{0}/K)-\sigma^{2}T/2}{\sigma\sqrt{T}} and Φ\Phi is the cdf of 𝒩(0,1)\mathcal{N}(0,1) distribution. It is also insightful to compare Υ\Upsilon with a parametric sensitivity. If instead of Wasserstein balls, we consider {𝒩(σ~2T/2,σ~2T):|σσ~|δ}\{\mathcal{N}(-\tilde{\sigma}^{2}T/2,\tilde{\sigma}^{2}T):|\sigma-\tilde{\sigma}|\leq\delta\} the resulting sensitivity is known as the Black-Scholes Vega and given by 𝒱=S0Φ(d+σT)\mathcal{V}=S_{0}\Phi^{\prime}(d_{-}+\sigma\sqrt{T}). We plot the two sensitivities in Figure 2. It is remarkable how, for the range of strikes of interest, the non-parametric model sensitivity Υ\Upsilon traces out the usual shape of 𝒱\mathcal{V} but shifted upwards to account for the idiosyncratic risk of departure from the log-normal family. More generally, given a book of options with payoff f=f+ff=f^{+}-f^{-} at time TT, with f+,f0f^{+},f^{-}\geq 0, we could say that the book is Υ\Upsilon-neutral if the sensitivity Υ\Upsilon was the same for f+f^{+} and for ff^{-}. In analogy to Delta-Vega hedging standard, one could develop a non-parametric model-agnostic Delta-Upsilon hedging. We believe these ideas offer potential for exciting industrial applications and we leave them to further research.

Refer to caption
Figure 2. Black-Scholes model: Υ\Upsilon vs 𝒱\mathcal{V}, S0=T=1S_{0}=T=1, σ=0.2\sigma=0.2.

We turn now to the classical notion of optimized Certainty Equivalent (OCE) of Ben Tal and Teboulle, (1986). It is a decision theoretic criterion designed to split a liability between today and tomorrow’s payments. It is also a convex risk measure in the sense of Artzner et al., (1999) and covers many of the popular risk measures such as expected shortfall or entropic risk, see Ben Tal and Teboulle, (2007). We fix a convex monotone function l:l\colon\mathbb{R}\to\mathbb{R} which is bounded from below and g:dg\colon\mathbb{R}^{d}\to\mathbb{R}. Here gg represent the payoff of a financial position and ll is the negative of a utility function, or a loss function. We take =||\|\cdot\|=|\cdot| and refer to (Bartl et al., 2021a, , Lemma 22) for generic sufficient conditions for Assumptions 1 and 4 to hold in this setup. The OCE corresponds to VV in (1) for f(x,a)=l(g(x)a)+af(x,a)=l(g(x)-a)+a and 𝒜=\mathcal{A}=\mathbb{R}, 𝒮=d\mathcal{S}=\mathbb{R}^{d}. Theorems 2 and 5 yield the sensitivities

Υ=infa𝒜0(|l(g(x)a)g(x)|qμ(dx))1/q,\displaystyle\Upsilon=\inf_{a^{\star}\in\mathcal{A}^{\star}_{0}}\left(\int\big{|}l^{\prime}\big{(}g(x)-a^{\star}\big{)}\nabla g(x)\big{|}^{q}\,\mu(dx)\right)^{1/q},
=(|l(g(x)a)g(x)|2μ(dz))1/2\displaystyle\beth=\Big{(}\int|l^{\prime}(g(x)-a^{\star})\,\nabla g(x)|^{2}\,\mu(dz)\Big{)}^{-1/2}
l′′(g(x)a)l(g(x)a)(g(x))2μ(dx)l′′(g(x)a)μ(dx),\displaystyle\qquad\qquad\cdot\frac{\int l^{\prime\prime}(g(x)-a^{\star})\,l^{\prime}(g(x)-a^{\star})\,(\nabla g(x))^{2}\,\mu(dx)}{\int l^{\prime\prime}(g(x)-a^{\star})\,\mu(dx)},

where, for simplicity, we took p=q=2p=q=2 for the latter.
A related problem considers hedging strategies which minimise the expected loss of the hedged position, i.e., f(x,a)=l(g(x)+a,xx0)f(x,a)=l\left(g(x)+\langle a,x-x_{0}\rangle\right), where 𝒜=k\mathcal{A}=\mathbb{R}^{k} and (x0,x)(x_{0},x) represent today and tomorrow’s traded prices. We compute Υ\Upsilon as

infa𝒜0(|l(g(x)+a,xx0)(g(x)+a)|qμ(dx))1/q.\displaystyle\inf_{a^{\star}\in\mathcal{A}^{\star}_{0}}\left(\int\big{|}l^{\prime}\big{(}g(x)+\langle a^{\star},x-x_{0}\rangle\big{)}(\nabla g(x)+a^{\star})\big{|}^{q}\,\mu(dx)\right)^{1/q}.

Further, we can combine loss minimization with OCE and consider a=(H,m)k×a=(H,m)\in\mathbb{R}^{k}\times\mathbb{R}, f(x,(h,m))=l(g(x)+H,xx0+m)mf(x,(h,m))=l(g(x)+\langle H,x-x_{0}\rangle+m)-m. This gives V(0)V^{\prime}(0) as the infimum over (H,m)𝒜0(H^{\star},m^{\star})\in\mathcal{A}^{\star}_{0} of

(|l(g(x)+H,xx0+m)(g(x)+H)|qμ(dx))1/q.\left(\int\big{|}l^{\prime}\big{(}g(x)+\langle H^{\star},x-x_{0}\rangle+m^{\star}\big{)}\left(\nabla g(x)+H^{\star}\right)\big{|}^{q}\,\mu(dx)\right)^{1/q}.

The above formulae capture non-parametric sensitivity to model uncertainty for examples of key risk measurements in financial economics. To the best of our knowledge this has not been achieved before.

Finally, we consider briefly the classical mean-variance optimization of Markowitz, (1952). Here μ\mu represents the loss distribution across the assets and ada\in\mathbb{R}^{d}, i=1dai=1\sum_{i=1}^{d}a_{i}=1 are the relative investment weights. The original problem is to minimise the sum of the expectation and γ\gamma standard deviations of returns a,X\langle a,X\rangle, with XμX\sim\mu. Using the ideas in (Pflug et al.,, 2012, Example 2) and considering measures on d×d\mathbb{R}^{d}\times\mathbb{R}^{d}, we can recast the problem as (1). Whilst Pflug et al., (2012) focused on the asymptotic regime δ\delta\to\infty, their non-asymptotic statements are related to our Theorem 2 and either result could be used here to obtain that V(δ)V(0)+1γ2δV(\delta)\approx V(0)+\sqrt{1-\gamma^{2}}\delta.

3.2. Neural Networks

We specialise now to quantifying robustness of neural networks (NN) to adversarial examples. This has been an important topic in machine learning since Szegedy et al., (2013) observed that NN consistently misclassify inputs formed by applying small worst-case perturbations to a data set. This produced a number of works offering either explanations for these effects or algorithms to create such adversarial examples, e.g. Goodfellow et al., (2014); Li et al., (2019); Carlini and Wagner, (2017); Wong and Kolter, (2017); Weng et al., (2018); Araujo et al., (2019); Mangal et al., (2019) to name just a few. The main focus of research works in this area, see Bastani et al., (2016), has been on faster algorithms for finding adversarial examples, typically leading to an overfit to these examples without any significant generalisation properties. The viewpoint has been mainly pointwise, e.g., Szegedy et al., (2013), with some generalisations to probabilistic robustness, e.g., Mangal et al., (2019).

In contrast, we propose a simple metric for measuring robustness of neural networks which is independent of the architecture employed and the algorithms for identifying adversarial examples. In fact, Theorem 2 offers a simple and intuitive way to formalise robustness of neural networks: for simplicity consider a 11-layer neural network trained on a given distribution μ\mu of pairs (x,y)(x,y), i.e. (A1,A2,b1,b2)(A^{\star}_{1},A_{2}^{\star},b_{1}^{\star},b_{2}^{\star}) solve

inf\displaystyle\inf\int |y((A2()+b2)σ(A1()+b1))(x)|pμ(dx,dy),\displaystyle|y-\left((A_{2}(\cdot)+b_{2})\circ\sigma\circ(A_{1}(\cdot)+b_{1})\right)(x)|^{p}\,\mu(dx,dy),

where the inf\inf is taken over a=(A1,A2,b1,b2)𝒜=k×d×d×k×k×da=(A_{1},A_{2},b_{1},b_{2})\in\mathcal{A}=\mathbb{R}^{k\times d}\times\mathbb{R}^{d\times k}\times\mathbb{R}^{k}\times\mathbb{R}^{d}, for a given activation function σ:\sigma:\mathbb{R}\to\mathbb{R}, where the composition above is understood componentwise. Set f(x,y;A,b):=|y(A2()+b2)σ(A1()+b1)(x)|pf(x,y;A,b):=|y-(A_{2}(\cdot)+b_{2})\circ\sigma\circ(A_{1}(\cdot)+b_{1})(x)|^{p}. Data perturbations are captured by νBδp(μ)\nu\in B_{\delta}^{p}(\mu) and (2) offers a robust training procedure. The first order quantification of the NN sensitivity to adversarial data is then given by

(|f(x,y;A,b)|qμ(dx,dy))1/q.\displaystyle\left(\int|\nabla f(x,y;A^{\star},b^{\star})|^{q}\,\mu(dx,dy)\right)^{1/q}.

A similar viewpoint, capturing robustness to adversarial examples through optimal transport lens, has been recently adopted by other authors. The dual formulation of (2) was used by Shafieezadeh-Abadeh et al., (2019) to reduce the training of neural networks to tractable linear programs. Sinha et al., (2020) modified (2) to consider a penalised problem infa𝒜supν𝒫(𝒮)𝒮f(x,a)ν(dx)γWp(μ,ν)\inf_{a\in\mathcal{A}}\sup_{\nu\in\mathcal{P}(\mathcal{S})}\int_{\mathcal{S}}f\left(x,a\right)\,\nu(dx)-\gamma W_{p}(\mu,\nu) to propose new stochastic gradient descent algorithms with inbuilt robustness to adversarial data.

3.3. Uncertainty Quantification

In the context of UQ the measure μ\mu represents input parameters of a (possibly complicated) operation GG in a physical, engineering or economic system. We consider the so-called reliability or certification problem: for a given set EE of undesirable outcomes, one wants to control supν𝒫ν(G(x)E)\sup_{\nu\in\mathcal{P}}\nu(G(x)\in E), for a set of probability measures 𝒫\mathcal{P}. The distributionally robust adversarial classification problem considered recently by Ho-Nguyen and Wright, (2020) is also of this form, with Wasserstein balls 𝒫\mathcal{P} around an empirical measure of NN samples. Using the dual formulation of Blanchet and Murthy, (2019), they linked the problem to minimization of the conditional value-at-risk and proposed a reformulation, and numerical methods, in the case of linear classification. We propose instead a regularised version of the problem and look for

δ(α):=sup{δ0:infνBδ(μ)d(G(x),E)ν(dx)α}\displaystyle\delta(\alpha):=\sup\left\{\delta\geq 0:\ \inf_{\nu\in B_{\delta}(\mu)}\int d(G(x),E)\,\nu(dx)\geq\alpha\right\}

for a given safety level α\alpha. We thus consider the average distance to the undesirable set, d(G(x),E):=infeE|G(x)e|d(G(x),E):=\inf_{e\in E}|G(x)-e|, and not just its probability. The quantity δ(α)\delta(\alpha) could then be used to quantify the implicit uncertainty of the certification problem, where higher δ\delta corresponds to less uncertainty. Taking statistical confidence bounds of the empirical measure in Wasserstein distance into account, see Fournier and Guillin, (2014), δ\delta would then determine the minimum number of samples needed to estimate the empirical measure.

Assume that EE is convex. Then xd(x,E)x\mapsto d(x,E) differentiable everywhere except at the boundary of EE with xd(x,E)=0\nabla_{x}d(x,E)=0 for xEox\in{E}^{o} and |xd(x,E)|=1|\nabla_{x}d(x,E)|=1 for all xE¯cx\in\bar{E}^{c}. Further, assume μ\mu is absolutely continuous w.r.t. Lebesgue measure on 𝒮\mathcal{S}. Theorem 2, using (Bartl et al., 2021a, , Remark 11), gives a first-order expansion for the above problem

infνBδ(μ)d(G(x),E)ν(dx)=d(G(x),E)μ(dx)\displaystyle\inf_{\nu\in B_{\delta}(\mu)}\int d(G(x),E)\,\nu(dx)=\int d(G(x),E)\,\mu(dx)
(|xd(G(x),E)xG(x)|qμ(dx))1/qδ+o(δ).\displaystyle\qquad-\left(\int|\nabla_{x}d(G(x),E)\nabla_{x}G(x)|^{q}\,\mu(dx)\right)^{1/q}\delta+o(\delta).

In the special case xG(x)=cI\nabla_{x}G(x)=cI this simplifies to

d(G(x),E)μ(dx)c(μ(G(x)E))1/qδ+o(δ)\displaystyle\int d(G(x),E)\,\mu(dx)-c\left(\mu(G(x)\notin E)\right)^{1/q}\delta+o(\delta)

and the minimal measure ν\nu pushes every point G(x)G(x) not contained in EE in the direction of the orthogonal projection. This recovers the intuition of (Chen et al.,, 2018, Theorem 1), which in turn rely on (Gao and Kleywegt,, 2016, Corollary 2, Example 7). Note however that our result holds for general measures μ\mu. We also note that such an approximation could provide an ansatz for dimension reduction, by identifying the dimensions for which the partial derivatives are negligible and then projecting GG on to the corresponding lower-dimensional subspace (thus providing a simpler surrogate for GG). This would be an alternative to a basis expansion (e.g., orthogonal polynomials) used in UQ and would exploit the interplay of properties of GG and μ\mu simultaneously.

3.4. Statistics

We discuss two applications of our results in the realm of statistics. We start by highlighting the link between our results and the so-called influence curves (IC) in robust statistics. For a functional μT(μ)\mu\mapsto T(\mu) its IC is defined as

IC(y)=limt0T(tδy+(1t)μ)T(μ)t.\displaystyle\mathrm{IC}(y)=\lim_{t\to 0}\frac{T(t\delta_{y}+(1-t)\mu)-T(\mu)}{t}.

Computing the IC\mathrm{IC}, if it exists, is in general hard and closed form solutions may be unachievable. However, for the so-called M-estimators, defined as optimizers for V(0)V(0),

T(μ):=argminaf(x,a)μ(dx),\displaystyle T(\mu):=\mathrm{argmin}_{a}\int f(x,a)\mu(dx),

for some ff (e.g., f(x,a)=|xa|f(x,a)=|x-a| for the median), we have

IC(y)=af(y,T(μ))a2f(s,T(μ))μ(ds),\displaystyle\mathrm{IC}(y)=\frac{\nabla_{a}f(y,T(\mu))}{-\int\nabla^{2}_{a}f(s,T(\mu))\,\mu(ds)},

under suitable assumptions on ff, see (Huber and Ronchetti,, 1981, section 3.2.1). In comparison, writing TδT^{\delta} for the optimizer for V(δ)V(\delta), Theorem 5 yields

(4) limδ0TδT(μ)δ=xaf(x,T(μ))xf(x,T(μ))μ(dx)a2f(s,T(μ))μ(ds),\displaystyle\lim_{\delta\to 0}\frac{T^{\delta}-T(\mu)}{\delta}=\frac{\int\nabla_{x}\nabla_{a}f(x,T(\mu))\nabla_{x}f(x,T(\mu))\,\mu(dx)}{-\int\nabla^{2}_{a}f(s,T(\mu))\,\mu(ds)},

under Assumption 4 and normalisation xf(x,T(μ))Lp(μ)=1\|\nabla_{x}f(x,T(\mu))\|_{L^{p}(\mu)}=1. To investigate the connection let us Taylor-expand IC(y)\mathrm{IC}(y) around xx to obtain

IC(y)IC(x)=\displaystyle\mathrm{IC}(y)-\mathrm{IC}(x)= axf(x,T(μ))a2f(s,T(μ))μ(ds)(yx).\displaystyle\frac{\nabla_{a}\nabla_{x}f(x,T(\mu))}{-\int\nabla^{2}_{a}f(s,T(\mu))\,\mu(ds)}(y-x).

Choosing y=x+δfx(x,T(μ))y=x+\delta\nabla f_{x}(x,T(\mu)) and integrating both sides over μ\mu and dividing by δ\delta, we obtain the asymptotic equality

IC(x+δxf(x,T(μ)))IC(x)δμ(dx)TδT(μ)δ\displaystyle\int\frac{\mathrm{IC}(x+\delta\nabla_{x}f(x,T(\mu)))-\mathrm{IC}(x)}{\delta}\,\mu(dx)\approx\frac{T^{\delta}-T(\mu)}{\delta}

for δ0\delta\to 0 by (4). We conclude that considering the average directional derivative of IC in the direction of fx(x,T(μ))\nabla f_{x}(x,T(\mu)) gives our first-order sensitivity. For an interesting conjecture regarding the comparison of influence functions and sensitivities in KL-divergence we refer to (Lam,, 2018, Section 7.3) and (Lam,, 2016, Section 3.4.2).

Our second application in statistics exploits the representation of the LASSO/Ridge regressions as robust versions the standard linear regression. We consider 𝒜=k\mathcal{A}=\mathbb{R}^{k} and 𝒮=k+1\mathcal{S}=\mathbb{R}^{k+1}. If instead of the Euclidean metric we take (x,y)=|x|r𝟏{y=0}+𝟏{y0}\|(x,y)\|_{\ast}=|x|_{r}\mathbf{1}_{\{y=0\}}+\infty\mathbf{1}_{\{y\neq 0\}}, for some r>1r>1 and (x,y)k×(x,y)\in\mathbb{R}^{k}\times\mathbb{R}, in the definition of the Wasserstein distance, then Blanchet et al., 2019a showed that

(5) infaksupνBδ(μ)(yx,a)2ν(dx,dy)=infak((ya,x)2μ(dx,dy)+δ|a|s)2\begin{split}&\inf_{a\in\mathbb{R}^{k}}\sup_{\nu\in B_{\delta}\left(\mu\right)}\int(y-\langle x,a\rangle)^{2}\,\nu(dx,dy)\\ &\qquad=\inf_{a\in\mathbb{R}^{k}}\left(\sqrt{\int(y-\langle a,x\rangle)^{2}\,\mu(dx,dy)}+\delta|a|_{s}\right)^{2}\end{split}

holds, where 1/r+1/s=11/r+1/s=1. The δ=0\delta=0 case is the ordinary least squares regression. For δ>0\delta>0, the RHS for s=2s=2 is directly related to the Ridge regression, while the limiting case s=1s=1 is called the square-root LASSO regression, a regularised variant of linear regression well known for its good empirical performance. Closed-form solutions to (5) do not exist in general and it is a common practice to use numerical routines to solve it approximately. Theorem 5 offers instead an explicit first-order approximation of aδa^{\star}_{\delta} for small δ\delta. We denote by aa^{\star} the ordinary least squares estimator and by II the k×kk\times k identity matrix. Note that the first order condition on aa^{\star} implies that (ya,x)xiμ(dx,dy)=0\int(y-\langle a^{\star},x\rangle)x_{i}\mu(dx,dy)=0 for all 1ik1\leq i\leq k. In particular, V(0)=(y2a,xy)μ(dx,dy)V(0)=\int(y^{2}-\langle a^{\star},x\rangle y)\mu(dx,dy) and a=D1yxμ(dx,dy)a^{\star}=D^{-1}\int yx\mu(dx,dy), where we assume the system is overdetermined so that D=xxTμ(dx,dy)D=\int xx^{T}\,\mu(dx,dy) is invertible. Letting J=axT+(Ia)(Ix)J=a^{\star}x^{T}+(Ia^{\star})(Ix) a direct computation, see Bartl et al., 2021a , yields

(6) aδaV(0)D1h(a)δ.\displaystyle a^{\star}_{\delta}\approx\ a^{\star}-\sqrt{V(0)}D^{-1}\,h(a^{\star})\delta.

For s=2s=2, h(a)=a/|a|2h(a^{\star})=a^{\star}/|a^{\star}|_{2} and for s=1s=1, h(a)=sign(a)h(a^{\star})=\text{sign}(a^{\star}) and hence111The case s=1s=1, inspecting the proof, we see that Theorem 5 still holds since aa^{\star} does not have zero components μ\mu-a.s., which are the only points of discontinuity of hh. aδa^{\star}_{\delta} is approximately

(7) (1V(0)|a|2D1δ)a and aV(0)D1sign(a)δ\left(1-\frac{\sqrt{V(0)}}{|a^{\star}|_{2}}D^{-1}\delta\right)a^{\star}\text{ and }a^{\star}-\sqrt{V(0)}D^{-1}\text{sign}(a^{\star})\delta

respectively. This corresponds to parameter shrinkage: proportional for square-root Ridge and a shift towards zero for square-root LASSO. To the best of our knowledge these are first such results and we stress that our formulae are valid in a general context and, in particular, parameter shrinkage depends on the direction through the D1D^{-1} factor. Figure 3 compares the first order approximation with the actual results and shows a remarkable fit.

Refer to caption
Figure 3. Square-root LASSO parameter shrinkage aδa0a^{\star}_{\delta}-a^{\star}_{0}: exact (o) and the first-order approximation (x) in (7). 20002000 observations generated according to Y=1.5X13X22X3+0.3X40.5X50.7X6+0.2X7+0.5X8+1.2X9+0.8X10+εY=1.5X_{1}-3X_{2}-2X_{3}+0.3X_{4}-0.5X_{5}-0.7X_{6}+0.2X_{7}+0.5X_{8}+1.2X_{9}+0.8X_{10}+\varepsilon with all Xi,εX_{i},\varepsilon i.i.d. 𝒩(0,1)\mathcal{N}(0,1).

Furthermore, our results agree with what is known in the canonical test case for the (standard) Ridge and LASSO, see Tibshirani, (1996), when μ=μN\mu=\mu_{N} is the empirical measure of NN i.i.d. observations, the data is centred and the covariates are orthogonal, i.e., D=1NID=\frac{1}{N}I. In that case, (7) simplifies to

(1δN(1R21))a and aN|y|1R2sign(a)δ,\left(1-\delta\sqrt{N\left(\frac{1}{R^{2}}-1\right)}\right)a^{\star}\text{ and }a^{\star}-\sqrt{N}\,|y|\,\sqrt{1-R^{2}}\,\text{sign}(a^{\star})\delta,

where R2R^{2} is the usual coefficient of determination.

The case of μN\mu_{N} is naturally of particular importance in statistics and data science and we continue to consider it in the next subsection. In particular, we characterise the asymptotic distribution of N(a1/Na)\sqrt{N}(a^{\star}_{1/\sqrt{N}}-a^{\star}), where aδ𝒜δ(μN)a^{\star}_{\delta}\in\mathcal{A}^{\star}_{\delta}(\mu_{N}) and a𝒜0(μ)a^{\star}\in\mathcal{A}^{\star}_{0}(\mu_{\infty}) is the optimizer of the non-robust problem for the data-generating measure. This recovers the central limit theorem of Blanchet et al., 2019b , a link we explain further in section 44.2.

3.5. Out-of-sample error

A benchmark of paramount importance in optimization is the so-called out-of-sample error, also known as the prediction error in statistical learning. Consider the setup above when μN\mu_{N} is the empirical measure of NN i.i.d. observations sampled from the “true” distribution μ=μ\mu=\mu_{\infty} and take, for simplicity, =||s\|\cdot\|=|\cdot|_{s}, with s>1s>1. Our aim is to compute the optimal aa^{\star} which solves the original problem (1). However, we only have access to the training set, encoded via μN\mu_{N}. Suppose we solve the distributionally robust optimization problem (2) for μN\mu_{N} and denote the robust optimizer aδ,Na^{{\star},N}_{\delta}. Then the out-of-sample error

V(0,aδ,N)V(0,a)=f(x,aδ,N)μ(dx)f(x,a)μ(dx)\displaystyle V(0,a^{{\star},N}_{\delta})-V(0,a^{\star})=\int f(x,a^{{\star},N}_{\delta})\,\mu(dx)-\int f(x,a^{{\star}})\,\mu(dx)

quantifies the error from using aδ,Na^{{\star},N}_{\delta} as opposed to the true optimizer aa^{\star}.

While this expression seems to be hard to compute explicitly for finite samples, Theorem 5 offers a way to find the asymptotic distribution of a (suitably rescaled version of) the out-of-sample error. We suppose the assumptions in Theorem 5 are satisfied and note that the first order condition for aa^{\star} gives aV(0,a)=0\nabla_{a}V(0,a^{\star})=0. Then, a second-order Taylor expansion gives

(8) V(0,aδ,N)V(0,a)=12(aδ,Na)Ta2V(0,a~)(aδ,Na),\displaystyle V(0,a^{{\star},N}_{\delta})-V(0,a^{\star})=\frac{1}{2}(a^{{\star},N}_{\delta}-a^{{\star}})^{T}\nabla_{a}^{2}V(0,\tilde{a})(a^{{\star},N}_{\delta}-a^{{\star}}),

for some a~\tilde{a}, (coordinate-wise) between aa^{\star} and aδ,Na^{{\star},N}_{\delta}. Now we write

aδ,Na\displaystyle a^{{\star},N}_{\delta}-a^{{\star}} =aδ,Na,N+a,Na,\displaystyle=a^{{\star},N}_{\delta}-a^{{\star},N}+a^{{\star},N}-a^{{\star}},

where we define a,Na^{{\star},N} as the optimizer of the non-robust problem (1) with μ\mu replaced by μN\mu_{N}. In particular the δ\delta-method for M-estimators implies that

(9) N(a,Na)(a2V(0,a))1H,\displaystyle\sqrt{N}(a^{{\star},N}-a^{{\star}})\Rightarrow(\nabla_{a}^{2}V(0,a^{{\star}}))^{-1}H,

where H𝒩(0,(af(x,a))Taf(x,a)μ(dx))H\sim\mathcal{N}\left(0,\int(\nabla_{a}f(x,a^{{\star}}))^{T}\nabla_{a}f(x,a^{{\star}})\,\mu(dx)\right) and \Rightarrow denotes the convergence in distribution. On the other hand, for a fixed NN\in\mathbb{N}, Theorem 5 applied to μN\mu_{N} yields

(10) aδ,Na,N=\displaystyle a^{{\star},N}_{\delta}-a^{{\star},N}= (|xf(x,a,N)|sqμN(dx))1q1(a2f(x,a,N)μN(dx))1\displaystyle-\Big{(}\int|\nabla_{x}f(x,a^{{\star},N})|_{s}^{q}\,\mu_{N}(dx)\Big{)}^{\frac{1}{q}-1}\cdot\left(\int\nabla_{a}^{2}f(x,a^{{\star},N})\,\mu_{N}(dx)\right)^{-1}
xaf(x,a,N)h(xf(x,a,N))|xf(x,a,N)|s1qμN(dx)δ+o(δ)\displaystyle\quad\ \cdot\int\frac{\nabla_{x}\nabla_{a}f(x,a^{{\star},N})\,h(\nabla_{x}f(x,a^{{\star},N}))}{|\nabla_{x}f(x,a^{{\star},N})|_{s}^{1-q}}\,\mu_{N}(dx)\cdot\delta+o(\delta)
(11) =\displaystyle= ((a2V(0,a))1Θ+ΔN)δ+o(δ),\displaystyle-\left((\nabla_{a}^{2}V(0,a^{{\star}}))^{-1}\Theta+\Delta_{N}\right)\cdot\delta+o(\delta),

where

Θ\displaystyle\Theta :=(|xf(x,a)|sqμ(dx))1q1xaf(x,a)h(xf(x,a))|xf(x,a)|s1qμ(dx),\displaystyle:=\Big{(}\int|\nabla_{x}f(x,a^{\star})|^{q}_{s}\,\mu(dx)\Big{)}^{\frac{1}{q}-1}\cdot\int\frac{\nabla_{x}\nabla_{a}f(x,a^{\star})\,h(\nabla_{x}f(x,a^{\star}))}{|\nabla_{x}f(x,a^{\star})|_{s}^{1-q}}\,\mu(dx),
ΔN\displaystyle\Delta_{N} :=(|xf(x,a,N)|sqμN(dx))1q1(a2f(x,a,N)μN(dx))1\displaystyle:=\Big{(}\int|\nabla_{x}f(x,a^{{\star},N})|_{s}^{q}\,\mu_{N}(dx)\Big{)}^{\frac{1}{q}-1}\cdot\left(\int\nabla_{a}^{2}f(x,a^{{\star},N})\,\mu_{N}(dx)\right)^{-1}
xaf(x,a,N)h(xf(x,a,N))|xf(x,a,N)|s1qμN(dx)(a2V(0,a))1Θ.\displaystyle\quad\cdot\int\frac{\nabla_{x}\nabla_{a}f(x,a^{{\star},N})\,h(\nabla_{x}f(x,a^{{\star},N}))}{|\nabla_{x}f(x,a^{{\star},N})|_{s}^{1-q}}\,\mu_{N}(dx)-(\nabla_{a}^{2}V(0,a^{{\star}}))^{-1}\Theta.

Almost surely (w.r.t. sampling of μN\mu_{N}), we know that μNμ\mu_{N}\to\mu in WpW_{p} as NN\to\infty, see Fournier and Guillin, (2014), and under the regularity and growth assumptions on ff in (Bartl et al., 2021a, , eq. (27)) we check that ΔN0\Delta_{N}\to 0 a.s., see (Bartl et al., 2021a, , Example 28) for details. In particular, taking δ=1/N\delta=1/\sqrt{N} and combining the above with (9) we obtain

(12) N(a1/N,Na)(a2V(0,a))1(HΘ).\sqrt{N}\left(a^{{\star},N}_{1/\sqrt{N}}-a^{\star}\right)\Rightarrow(\nabla_{a}^{2}V(0,a^{\star}))^{-1}(H-\Theta).

This recovers the central limit theorem of Blanchet et al., 2019b , as discussed in more detail in section 44.2 below. Together, (8) and (11) give us the a.s. asymptotic behaviour of the out-of-sample error

(13) V(0,aδ,N)V(0,a)=12N(HΘ)T(a2V(0,a))1(HΘ)+o(1N).\displaystyle V(0,a^{{\star},N}_{\delta})-V(0,a^{\star})=\frac{1}{2N}(H-\Theta)^{T}(\nabla_{a}^{2}V(0,a^{{\star}}))^{-1}(H-\Theta)+o\left(\frac{1}{N}\right).

These results also extend and complement (Anderson and Philpott,, 2019, Prop. 17). Anderson and Philpott, (2019) investigate when the distributionally robust optimizers aδ,Na^{{\star},N}_{\delta} yield, on average, better performance than the simple in-sample optimizer a,Na^{{\star},N}. To this end, they consider the expectation, over the realisations of the empirical measure μN\mu_{N} of

V(0,aδ,N)V(0,a,N)=f(x,aδ,N)μ(dx)f(x,a,N)μ(dx).\displaystyle V(0,a^{{\star},N}_{\delta})-V(0,a^{{\star},N})=\int f(x,a^{{\star},N}_{\delta})\,\mu(dx)-\int f(x,a^{{\star},N})\,\mu(dx).

This is closely related to the out-of-sample error and our derivations above can be easily modified. The first order term in the Taylor expansion no longer vanishes and, instead of (8), we now have

V(0,aδ,N)V(0,a,N)=aV(0,a,N)(aδ,Na,N)+o(|aδ,Na,N|),\displaystyle V(0,a^{{\star},N}_{\delta})-V(0,a^{{\star},N})=\nabla_{a}V(0,a^{{\star},N})(a^{{\star},N}_{\delta}-a^{{\star},N})+o(|a^{{\star},N}_{\delta}-a^{{\star},N}|),

which holds, e.g., if for any r>0r>0, there exists c>0c>0 such that i=1k|aaif(x,a)|c(1+|x|p)\sum_{i=1}^{k}\Big{|}\nabla_{a}\nabla_{a_{i}}f(x,a)\Big{|}\leq c(1+|x|^{p}) for all x𝒮x\in\mathcal{S}, |a|r|a|\leq r. Combined with (10), this gives asymptotics in small δ\delta for a fixed NN. For quadratic ff and taking qq\uparrow\infty, we recover the result in (Anderson and Philpott,, 2019, Prop. 17), see (Bartl et al., 2021a, , Example 28) for details.

4. Further discussion and literature review

We start with an overview of related literature and then focus specifically on a comparison of our results with the CLT of Blanchet et al., 2019b mentioned above.

4.1. Discussion of related literature

Let us first remark, that while Theorem 2 offers some superficial similarities to a classical maximum theorem, which is usually concerned with continuity properties of δV(δ)\delta\mapsto V(\delta), in this work we are instead interested in the exact first derivative of the function δV(δ)\delta\mapsto V(\delta). Indeed the convergence limδ0supνBδ(μ)f(x)ν(dx)=f(x)μ(dx)\lim_{\delta\to 0}\sup_{\nu\in B_{\delta}(\mu)}\int f(x)\,\nu(dx)=\int f(x)\,\mu(dx) follows for all ff satisfying f(x)c(1+|x|p)f(x)\leq c(1+|x|^{p}) directly from the definition of convergence in Wasserstein metric (see e.g. (Villani,, 2008, Def. 6.8)). In conclusion the main issue is to quantify the rate of this convergence by calculating the first derivative V(δ)V^{\prime}(\delta).

Our work investigates model uncertainty broadly conceived: it includes errors related to the choice of models from a particular (parametric or not) class of models as well as the mis-specification of such class altogether (or indeed, its absence). In decision theoretic literature, these aspects are sometimes referred to as model ambiguity and model mis-specification respectively, see Hansen and Marinacci, (2016). However, seeing our main problem (2) in decision theoretic terms is not necessarily helpful as we think of ff as given and not coming from some latent expected utility type of problem. In particular, our actions a𝒜a\in\mathcal{A} are just constants.

In our work we decided to capture the uncertainty in the specification of μ\mu using neighborhoods in the Wassertein distance. As already mentioned, other choices are possible and have been used in past. Possibly, the most often used alternative is the relative entropy, or the Kullblack-Leibler divergence. In particular, it has been used in this context in economics, see Hansen and Sargent, (2008). To the best of our knowledge, the only comparable study of sensitivities with respect to relative entropy balls is Lam, (2016), see also Lam, (2018) allowing for additional marginal constraints. However, this only considered the specific case f(x,a)=f(x)f(x,a)=f(x) where the reward function is independent of the action. Its main result is

supνBδKL(μ)f(x)ν(dx)=f(x)μ(dx)+2Varμ(f(X))δ+13κ3(f(X))Varμ(f(X))δ2+O(δ3),\displaystyle\sup_{\nu\in B^{KL}_{\delta}(\mu)}\int f(x)\,\nu(dx)=\int f(x)\,\mu(dx)+\sqrt{2\operatorname{Var}_{\mu}(f(X))}\delta+\frac{1}{3}\frac{\kappa_{3}(f(X))}{\operatorname{Var}_{\mu}(f(X))}\delta^{2}+O\left(\delta^{3}\right),

where BδKL(μ)B^{KL}_{\delta}(\mu) is a ball of radius δ2\delta^{2} centred around μ\mu in KL-divergence, Varμ(f(X))\operatorname{Var}_{\mu}(f(X)) and κ3(f(X))\kappa_{3}(f(X)) denote the variance and kurtosis of ff under the measure μ\mu respectively. In particular, the first order sensitivity involves the function ff itself. In contrast, our Theorem 2 states V(δ)=((f(x))2μ(dx))1/2V^{\prime}(\delta)=(\int(f^{\prime}(x))^{2}\,\mu(dx))^{1/2} and involves the first derivative ff^{\prime}. In the trivial case of a point mass μ=δx\mu=\delta_{x} we recover the intuitive sensitivity V(δ)=|f(x)|V^{\prime}(\delta)=|f^{\prime}(x)|, while the results of Lam, (2016) do not apply for this case. We also note that Lam, (2016) requires exponential moments of the function ff under the baseline measure μ\mu, while we only require polynomial moments. In particular in applications in econometrics (or any field in which μ\mu typically has fat tails), the scope of application of the corresponding results might then be decisively different. We remark however, that this requirement can be substantially weakened (to the existence of polynomial moments) when replacing KL-divergences by α\alpha-divergences, see e.g. Atar et al., (2015); Glasserman and Xu, (2014). We expect a sensitivity analysis similar to Lam, (2016) to hold in this setting. However, to the best of our knowledge no explicit results seem to available in the literature.
To understand the relative technical difficulties and merits it is insightful to go into the details of the statements. In fact, in the case of relative entropy and the one-period setup we are considering, the exact form of the optimizing density can be determined exactly (see (Lam,, 2016, Proposition 3.1)) up to a one-dimensional Langrange parameter. This is well known and is the reason behind the usual elegant formulae obtained in this context. But this then reduces the problem in Lam, (2016) to a one-dimensional problem, which can be well-approximated via a Taylor approximation. In contrast, when we consider balls in the Wasserstein distance, the form of the optimizing measure is not known (apart from some degenerate cases). In fact a key insight of our results is that the optimizing measure can be approximated by a deterministic shift in the direction (x+f(x)δ)μ(x+f^{\prime}(x)\delta)_{*}\mu (this is, in general, not exact but only true as a first order approximation). The reason for these contrastive starting points of the analyses is the fact that Wasserstein balls contain a more heterogeneous set of measures, while in the case of relative entropy, exponentiating ff will always do the trick. We remark however that this is not true for the finite-horizon problems considered in (Lam,, 2016, Section 3.2) any more, where the worst-case measure is found using an elaborate fixed-point equation.
A point which further emphasizes the fact that the topology introduced by the Wasserstein metric is less tractable is the fact that

Wpp(μ,ν)=limε0infπΠ(μ,ν)|xy|pπ(dx,dy)+εH(πμν)=limε0εinfπΠ(μ,ν)H(πRε),\displaystyle W^{p}_{p}(\mu,\nu)=\lim_{\varepsilon\to 0}\inf_{\pi\in\Pi(\mu,\nu)}\int|x-y|^{p}\,\pi(dx,dy)+\varepsilon H(\pi\mid\mu\otimes\nu)=\lim_{\varepsilon\to 0}\varepsilon\inf_{\pi\in\Pi(\mu,\nu)}H(\pi\mid R^{\varepsilon}),

where H(πRε)=log(dπdRε)𝑑πH(\pi\mid R^{\varepsilon})=\int\log\left(\frac{d\pi}{dR^{\varepsilon}}\right)\,d\pi is the relative entropy and

dRε=c0exp(|xy|p/ε)d(μν)dR^{\varepsilon}=c_{0}\exp(-|x-y|^{p}/\varepsilon)d(\mu\otimes\nu)

for some normalising constant c0>0c_{0}>0, see, e.g., Carlier et al., (2017). This is known as the entropic optimal transport formulation and has received considerable interest in the ML community in the past years (see e.g. Peyré and Cuturi, (2019)). In particular, the Wasserstein distance can be approximated by relative entropy, but only with respect to reference measures on the product space. As we consider optimization over ν\nu above it amounts to changing the reference measure. In consequence the topological structure imposed by Wasserstein distances is more intricate compared to relative entropy, but also more flexible.

The other well studied distance is the Hellinger distance. Lindsay, (1994) calculates influence curves for the minimum Hellinger distance estimator aHell,a^{\mathrm{Hell},{\star}} on a countable sample space. Their main result is that for the choice f(x,a)=log((x,a))f(x,a)=\log(\ell(x,a)) (where ((x,a))a𝒜(\ell(x,a))_{a\in\mathcal{A}} is a collection of parametric densities)

IC(x)=(a2V(0,aHell,))1alog((x,aHell,)),\displaystyle IC(x)=-\left(\nabla_{a}^{2}V(0,a^{\mathrm{Hell},{\star}})\right)^{-1}\nabla_{a}\log(\ell(x,a^{\mathrm{Hell},{\star}})),

the product of the inverse Fisher information matrix and the score function, which is the same as for the classical maximum likelihood estimator. Denote by μN\mu_{N} the empirical measure of NN data samples and by aHell,(N)a^{\mathrm{Hell},{\star}}(N) the corresponding minimum Hellinger distance estimator for μN\mu_{N}. In particular this result implies the same CLT as for M-estimators given by

N1/2(aHell,(N)aHell,)(a2V(0,a))1H\displaystyle N^{1/2}(a^{\mathrm{Hell},{\star}}(N)-a^{\mathrm{Hell},{\star}})\Rightarrow(\nabla_{a}^{2}V(0,a^{\star}))^{-1}H

where H𝒩(0,af(x,aHell,)Taf(x,aHell,)μ(dx)).H\sim\mathcal{N}\left(0,\int\nabla_{a}f(x,a^{\mathrm{Hell},{\star}})^{T}\nabla_{a}f(x,a^{\mathrm{Hell},{\star}})\,\mu(dx)\right). As we discuss in the next section, our Theorem 5 yields a similar CLT, namely

N1/2(a1/N,Na)(a2V(0,a))1(Ha|xf(x,a)|s2μ(dx)).\displaystyle N^{1/2}(a^{{\star},N}_{1/\sqrt{N}}-a^{\star})\Rightarrow(\nabla_{a}^{2}V(0,a^{\star}))^{-1}\cdot\left(H-\nabla_{a}\sqrt{\int|\nabla_{x}f(x,a^{\star})|_{s}^{2}\,\mu(dx)}\,\right).

Thus the Wasserstein worst-case approach leads to a shift of the mean of the normal distribution in the direction

a|xf(x,a)|s2μ(dx)-\nabla_{a}\sqrt{\int|\nabla_{x}f(x,a^{\star})|_{s}^{2}\,\mu(dx)}

compared to the non-robust case. In the simple case μ=𝒩(0,σ2)\mu=\mathcal{N}(0,\sigma^{2}) with standard deviation σ>0\sigma>0 we obtain the MLE σ,N=1Nk=1NXi2\sigma^{{\star},N}=\frac{1}{N}\sum_{k=1}^{N}X_{i}^{2}. We can directly compute (for a=σa=\sigma) that

σ|x(const.+log(exp(x22(σ)2)))|s2μ(dx)\displaystyle\nabla_{\sigma}\sqrt{\int\left|\nabla_{x}\left(\mathrm{const.}+\log\left(\exp\left(-\frac{x^{2}}{2(\sigma^{\star})^{2}}\right)\right)\right)\right|_{s}^{2}\,\mu(dx)} =σx2(σ)4μ(dx)\displaystyle=\nabla_{\sigma}\sqrt{\int\frac{x^{2}}{(\sigma^{\star})^{4}}\,\mu(dx)}
=σσ(σ)2=σ1σ=1(σ)2.\displaystyle=\nabla_{\sigma}\frac{\sigma^{\star}}{(\sigma^{\star})^{2}}=\nabla_{\sigma}\frac{1}{\sigma^{\star}}=-\frac{1}{(\sigma^{\star})^{2}}.

Thus the robust approach accounts for a shift of 1/(σ)21/(\sigma^{\star})^{2} (of order 1 if mulitplied with inverse Fisher information) to account for a possibly higher variance in the underlying data. In particular, in our approach, the so-called neutral spaces considered, e.g., in (Komorowski et al.,, 2011, eq. (21)) as

{a:(aa)Ta2V(0,a)(aa)δ}\displaystyle\left\{a:\ -(a-a^{\star})^{T}\nabla_{a}^{2}V(0,a^{\star})(a-a^{\star})\leq\delta\right\}

should also take this shift into account, i.e., their definition should be adjusted to

{a:(aa+a|xf(x,a)|s2μ(dx))T\displaystyle\Bigg{\{}a:\ -\left(a-a^{\star}+\nabla_{a}\sqrt{\int|\nabla_{x}f(x,a^{\star})|_{s}^{2}\,\mu(dx)}\right)^{T} a2V(0,a)\displaystyle\nabla_{a}^{2}V(0,a^{\star})
(aa+a|xf(x,a)|s2μ(dx))δ}.\displaystyle\cdot\left(a-a^{\star}+\nabla_{a}\sqrt{\int|\nabla_{x}f(x,a^{\star})|_{s}^{2}\,\mu(dx)}\right)\leq\delta\Bigg{\}}.

Lastly, let us mention another situation when our approach provides directly interpretable insights in the context of a parametric family of models. Namely, if one considers a family of models 𝒫\mathcal{P} such that the worst-case model in the Wasserstein ball remains in 𝒫\mathcal{P}, i.e., (x+f(x)δ)μ𝒫(x+f^{\prime}(x)\delta)_{*}\mu\in\mathcal{P}, then considering (the first order approximation to) model uncertainty over Wasserstein balls actually reduces to considerations within the parametric family. While uncommon, such a situation would arise, e.g., for a scale-location family 𝒫\mathcal{P}, with μ𝒫\mu\in\mathcal{P} and a linear/quadratic ff.

4.2. Link to the CLT of Blanchet et al., 2019b

As observed in section 33.5above, Theorem 5 allows to recover the main results in Blanchet et al., 2019b . We explain this now in detail. Set =||s\|\cdot\|=|\cdot|_{s}, p=q=2p=q=2, 𝒮=d\mathcal{S}=\mathbb{R}^{d}. Let μN\mu_{N} denote the empirical measure of NN i.i.d. samples from μ\mu. We impose the assumptions on μ\mu and ff from Blanchet et al., 2019b , including Lipschitz continuity of gradients of ff and strict convexity. These, in particular, imply that the optimizers aδ,N,a,Na^{{\star},N}_{\delta},a^{{\star},N} and aa^{{\star}}, as defined in section 33.5 are well defined and unique, and further a1/N,Naa^{{\star},N}_{1/\sqrt{N}}\to a^{\star} as NN\to\infty. (Blanchet et al., 2019b, , Thm. 1) implies that, as NN\to\infty,

(14) N(a1/N,Na)(a2V(0,a))1(Ha|xf(x,a)|s2μ(dx)),\displaystyle\sqrt{N}(a^{{\star},N}_{1/\sqrt{N}}-a^{\star})\Rightarrow(\nabla_{a}^{2}V(0,a^{\star}))^{-1}\cdot\left(H-\nabla_{a}\sqrt{\int|\nabla_{x}f(x,a^{\star})|_{s}^{2}\,\mu(dx)}\,\right),

where H𝒩(0,af(x,a)Taf(x,a)μ(dx)).H\sim\mathcal{N}\left(0,\int\nabla_{a}f(x,a^{\star})^{T}\nabla_{a}f(x,a^{\star})\,\mu(dx)\right). We note that for =||s\|\cdot\|=|\cdot|_{s} we have

h(x)=(sign(x1)|x1|s1,,sign(xk)|xk|s1)|x|s1s=x|x|s.\displaystyle h(x)=(\text{sign}(x_{1})\,|x_{1}|^{s-1},\dots,\text{sign}(x_{k})\,|x_{k}|^{s-1})\cdot|x|_{s}^{1-s}=\nabla_{x}|x|_{s}.

Thus

a|xf(x,a)|s2μ(dx)\displaystyle\nabla_{a}\sqrt{\int|\nabla_{x}f(x,a^{\star})|_{s}^{2}\,\mu(dx)} =|xf(x,a)|sh(xf(x,a))xaf(x,a)μ(dx)|xf(x,a)|s2μ(dx)\displaystyle=\frac{\int|\nabla_{x}f(x,a^{\star})|_{s}h(\nabla_{x}f(x,a^{\star}))\nabla_{x}\nabla_{a}f(x,a^{\star})\,\mu(dx)}{\sqrt{\int|\nabla_{x}f(x,a^{\star})|_{s}^{2}\,\mu(dx)}}\cdot

and (14) agrees with (12) which is justified by the Lipschitz growth assumptions on ff, xf(x,a)\nabla_{x}f(x,a) and axf(x,a)\nabla_{a}\nabla_{x}f(x,a) from Blanchet et al., 2019b , see (Bartl et al., 2021a, , eq. (27)). In particular Theorem 5 implies (14) as a special case. While this connection is insightful to establish222We thank Jose Blanchet for pointing out the possible link and encouraging us to explore it. it is also worth stressing that the proofs in Blanchet et al., 2019b pass through the dual formulation and are thus substantially different to ours. Furthermore, while Theorem 5 holds under milder assumptions on ff than those in Blanchet et al., 2019b , the last argument in our reasoning above requires the stronger assumptions on ff. It is thus not clear if our results could help to significantly weaken the assumptions in the central limit theorems of Blanchet et al., 2019b .

5. Proofs

We consider the case 𝒮=d\mathcal{S}=\mathbb{R}^{d} and =||\|\cdot\|=|\cdot| here. For the general case and additional details we refer to Bartl et al., 2021a . When clear from the context, we do not indicate the space over which we integrate.

Proof of Theorem 2.

For every δ0\delta\geq 0 let Cδ(μ)C_{\delta}(\mu) denote those π𝒫(d×d)\pi\in\mathcal{P}(\mathbb{R}^{d}\times\mathbb{R}^{d}) which satisfy

π1=μ and (|xy|pπ(dx,dy))1/pδ.\pi_{1}=\mu\text{ and }\left(\int|x-y|^{p}\,\pi(dx,dy)\right)^{1/p}\leq\delta.

As the infimum in the definition of Wp(μ,ν)W_{p}(\mu,\nu) is attained (see (Villani,, 2008, Theorem 4.1, p.43)) one has Bδ(μ)={π2:πCδ(μ)}B_{\delta}(\mu)=\{\pi_{2}:\pi\in C_{\delta}(\mu)\}.

We start by showing the “\leq” inequality in the statement. For any a𝒜0a^{\star}\in\mathcal{A}^{\star}_{0} one has V(δ)supνBδ(μ)f(y,a)ν(dy)V(\delta)\leq\sup_{\nu\in B_{\delta}(\mu)}\int f(y,a^{\star})\,\nu(dy) with equality for δ=0\delta=0. Therefore, differentiating f(,a)f(\cdot,a^{\star}) and using both Fubini’s theorem and Hölder’s inequality, we obtain that

V(δ)V(0)supπCδ(μ)f(y,a)f(x,a)π(dx,dy)\displaystyle V(\delta)-V(0)\leq\sup_{\pi\in C_{\delta}(\mu)}\int f(y,a^{\star})-f(x,a^{\star})\,\pi(dx,dy)
=supπCδ(μ)01xf(x+t(yx),a),(yx)π(dx,dy)𝑑t\displaystyle=\sup_{\pi\in C_{\delta}(\mu)}\int_{0}^{1}\int\langle\nabla_{x}f(x+t(y-x),a^{\star}),(y-x)\rangle\,\pi(dx,dy)dt
δsupπCδ(μ)01(|xf(x+t(yx),a)|qπ(dx,dy))1/q𝑑t.\displaystyle\leq\delta\sup_{\pi\in C_{\delta}(\mu)}\int_{0}^{1}\Big{(}\int|\nabla_{x}f(x+t(y-x),a^{\star})|^{q}\pi(dx,dy)\Big{)}^{1/q}dt.

Any choice πδCδ(μ)\pi^{\delta}\in C_{\delta}(\mu) converges in pp-Wasserstein distance on 𝒫(d×d\mathcal{P}(\mathbb{R}^{d}\times\mathbb{R}^{d}) to the pushforward measure of μ\mu under the mapping x(x,x)x\mapsto(x,x), which we denote [x(x,x)]μ[x\mapsto(x,x)]_{\ast}\mu. This can be seen by, e.g., considering the coupling [(x,y)(x,y,x,x)]πδ[(x,y)\mapsto(x,y,x,x)]_{\ast}\pi^{\delta} between πδ\pi^{\delta} and [x(x,x)]μ[x\mapsto(x,x)]_{\ast}\mu. Now note that q=p/(p1)q=p/(p-1) and the growth assumption on xf(,a)\nabla_{x}f(\cdot,a^{\star}) implies

(15) |xf(x+t(yx),a)|qc(1+|x|p+|y|p)\displaystyle|\nabla_{x}f(x+t(y-x),a^{\star})|^{q}\leq c(1+|x|^{p}+|y|^{p})

for some c>0c>0 and all x,ydx,y\in\mathbb{R}^{d}, t[0,1]t\in[0,1]. In particular |xf(x+t(yx),a)|qπδ(dx,dy)C\int|\nabla_{x}f(x+t(y-x),a^{\star})|^{q}\,\pi^{\delta}(dx,dy)\leq C for all t[0,1]t\in[0,1] and small δ>0\delta>0, for another constant C>0C>0. As further (x,y)|xf(x+t(yx),a)|q(x,y)\mapsto|\nabla_{x}f(x+t(y-x),a^{\star})|^{q} is continuous for every tt, the pp-Wasserstein convergence of πδ\pi^{\delta} to [x(x,x)]μ[x\mapsto(x,x)]_{\ast}\mu implies that

|xf(x+t(yx),a)|qπδ(dx,dy)|xf(x,a)|qμ(dx)\int|\nabla_{x}f(x+t(y-x),a^{\star})|^{q}\,\pi^{\delta}(dx,dy)\to\int|\nabla_{x}f(x,a^{\star})|^{q}\,\mu(dx)

for every t[0,1]t\in[0,1] for δ0\delta\to 0, see (Bartl et al., 2021a, , Lemma 21). Dominated convergence (in tt) then yields “\leq” in the statement of the theorem.

We turn now to the opposite “\geq” inequality. As V(δ)V(0)V(\delta)\geq V(0) for every δ>0\delta>0 there is no loss in generality in assuming that the right hand side is not equal to zero. Now take any, for notational simplicity not relabelled, subsequence of (δ)δ>0(\delta)_{\delta>0} which attains the liminf in (V(δ)V(0))/δ(V(\delta)-V(0))/\delta and pick aδ𝒜δa^{\star}_{\delta}\in\mathcal{A}^{\star}_{\delta}. By assumption, for a (again not relabelled) subsequence, one has aδa𝒜0a^{\star}_{\delta}\to a^{\star}\in\mathcal{A}^{\star}_{0}. Further note that V(0)f(x,aδ)μ(dx)V(0)\leq\int f(x,a^{\star}_{\delta})\,\mu(dx) which implies

V(δ)V(0)\displaystyle V(\delta)-V(0) supπCδ(μ)f(y,aδ)f(x,aδ)π(dx,dy).\displaystyle\geq\sup_{\pi\in C_{\delta}(\mu)}\int f(y,a^{\star}_{\delta})-f(x,a^{\star}_{\delta})\,\pi(dx,dy).

Now define πδ:=[x(x,x+δT(x))]μ\pi^{\delta}:=[x\mapsto(x,x+\delta T(x))]_{\ast}\mu, where

T(x):=xf(x,a)|xf(x,a)|2q(|xf(z,a)|qμ(dz))1/q1T(x):=\frac{\nabla_{x}f(x,a^{\star})}{|\nabla_{x}f(x,a^{\star})|^{2-q}}\Big{(}\int|\nabla_{x}f(z,a^{\star})|^{q}\,\mu(dz)\Big{)}^{1/q-1}

for xdx\in\mathbb{R}^{d} with the convention 0/0=00/0=0. Note that the integral is well defined since, as before in (15), one has |xf(x,a)|qC(1+|x|p)|\nabla_{x}f(x,a^{\star})|^{q}\leq C(1+|x|^{p}) for some C>0C>0 and the latter is integrable under μ\mu. Using that pqp=qpq-p=q it further follows that

|xy|pπδ(dx,dy)=δp|T(x)|pμ(dx)\displaystyle\int|x-y|^{p}\,\pi^{\delta}(dx,dy)=\delta^{p}\int|T(x)|^{p}\,\mu(dx)
=δp|xf(x,a)|p+pq2pμ(dx)(|xf(z,a)|qμ(dz))p(11/q)=δp.\displaystyle=\delta^{p}\frac{\int|\nabla_{x}f(x,a^{\star})|^{p+pq-2p}\,\mu(dx)}{\big{(}\int|\nabla_{x}f(z,a^{\star})|^{q}\,\mu(dz)\big{)}^{p(1-1/q)}}=\delta^{p}.

In particular πδCδ(μ)\pi^{\delta}\in C_{\delta}(\mu) and we can use it to estimate from below the supremum over Cδ(μ)C_{\delta}(\mu) giving

V(δ)V(0)δ\displaystyle\frac{V(\delta)-V(0)}{\delta} 1δf(x+δT(x),aδ)f(x,aδ)μ(dx)\displaystyle\geq\frac{1}{\delta}\int f(x+\delta T(x),a^{\star}_{\delta})-f(x,a^{\star}_{\delta})\,\mu(dx)
=01xf(x+tδT(x),aδ),T(x)μ(dx)𝑑t.\displaystyle=\int_{0}^{1}\int\langle\nabla_{x}f(x+t\delta T(x),a^{\star}_{\delta}),T(x)\rangle\,\mu(dx)\,dt.

For any t[0,1]t\in[0,1], with δ0\delta\to 0, the inner integral converges to

xf(x,a),T(x)μ(dx)=(|xf(x,a)|qμ(dx))1/q.\displaystyle\int\langle\nabla_{x}f(x,a^{\star}),T(x)\rangle\,\mu(dx)=\Big{(}\int|\nabla_{x}f(x,a^{\star})|^{q}\,\mu(dx)\Big{)}^{1/q}.

The last equality follows from the definition of TT and a simple calculation. To justify the convergence, first note that xf(x+tδT(x),aδ),T(x)xf(x,a),T(x)\langle\nabla_{x}f(x+t\delta T(x),a^{\star}_{\delta}),T(x)\rangle\to\langle\nabla_{x}f(x,a^{\star}),T(x)\rangle for all xdx\in\mathbb{R}^{d} by continuity of xf\nabla_{x}f and since aδaa^{\star}_{\delta}\to a^{\star}. Moreover, as before in (15), one has |T(x)|c(1+|x|)|T(x)|\leq c(1+|x|) for some c>0c>0, hence |xf(x+tδT(x),a),T(x)|C(1+|x|p)|\langle\nabla_{x}f(x+t\delta T(x),a^{\star}),T(x)\rangle|\leq C(1+|x|^{p}) for some C>0C>0 and all t[0,1]t\in[0,1]. The latter is integrable under μ\mu, hence convergence of the integrals follows from the dominated convergence theorem. This concludes the proof. ∎

Proof of Theorem 5.

We first show that

(16) limδ0aiV(0,aδ)δ\displaystyle\lim_{\delta\to 0}\frac{-\nabla_{a_{i}}V(0,a^{\star}_{\delta})}{\delta} =xaif(x,a)xf(x,a)|xf(x),a)|2qμ(dx)\displaystyle=\int\nabla_{x}\nabla_{a_{i}}f(x,a^{\star})\frac{\nabla_{x}f(x,a^{\star})}{|\nabla_{x}f(x),a^{\star})|^{2-q}}\,\mu(dx)
(|xf(x,a)|qμ(dx))1/q1\displaystyle\qquad\cdot\Big{(}\int|\nabla_{x}f(x,a^{\star})|^{q}\,\mu(dx)\Big{)}^{1/q-1}

for all i{1,,k}i\in\{1,\dots,k\}. We start with the “"\leq"-inequality. For any a𝒜oa\in{\mathcal{A}}^{o} we have

af(y,a)af(x,a)\displaystyle\nabla_{a}f(y,a)-\nabla_{a}f(x,a) =01xaf(x+t(yx),a)(yx)dt.\displaystyle=\int_{0}^{1}\nabla_{x}\nabla_{a}f(x+t(y-x),a)(y-x)\rangle\,dt.

Let δ>0\delta>0 and recall that aδ𝒜δa^{\star}_{\delta}\in\mathcal{A}^{\star}_{\delta} converge to a𝒜0a^{\star}\in\mathcal{A}^{\star}_{0}. Let Bδ(μ,aδ)B^{\star}_{\delta}(\mu,a^{\star}_{\delta}) denote the set of νBδ(μ)\nu\in B_{\delta}(\mu) which attain the value: f(x,aδ)ν(dx)=V(δ)\int f(x,a^{\star}_{\delta})\,\nu(dx)=V(\delta). By (Bartl et al., 2021a, , Lemma 29) the function aV(δ,a)a\mapsto V(\delta,a) is (one-sided) directionally differentiable at aδa^{\star}_{\delta} for all δ>0\delta>0 small and thus for all i{1,,k}i\in\{1,\dots,k\}

supνBδ(μ,aδ)aif(x,aδ)ν(dx)0.\displaystyle\sup_{\nu\in B^{\star}_{\delta}(\mu,a^{\star}_{\delta})}\int\nabla_{a_{i}}f(x,a^{\star}_{\delta})\,\nu(dx)\geq 0.

Then, using Lagrange multipliers to encode the optimality of Bδ(μ,aδ)B^{\star}_{\delta}(\mu,a^{\star}_{\delta}) in Bδ(μ)B_{\delta}(\mu), we obtain

aiV(0,aδ)supνBδ(μ,aδ)aif(y,aδ)ν(dy)aiV(0,aδ)\displaystyle-\nabla_{a_{i}}V(0,a^{\star}_{\delta})\leq\sup_{\nu\in B^{\star}_{\delta}(\mu,a^{\star}_{\delta})}\int\nabla_{a_{i}}f(y,a^{\star}_{\delta})\nu(dy)-\nabla_{a_{i}}V(0,a^{\star}_{\delta})
=supνBδ(μ)infλ([aif(y,aδ)+λ(f(y,aδ)V(δ))]ν(dy)\displaystyle=\sup_{\nu\in B_{\delta}(\mu)}\inf_{\lambda\in\mathbb{R}}\bigg{(}\int\big{[}\nabla_{a_{i}}f(y,a^{\star}_{\delta})+\lambda(f(y,a^{\star}_{\delta})-V(\delta))\big{]}\nu(dy)
[aif(x,aδ)+λ(f(x,aδ)V(0,aδ))]μ(dx))\displaystyle\quad-\int\big{[}\nabla_{a_{i}}f(x,a^{\star}_{\delta})+\lambda(f(x,a^{\star}_{\delta})-V(0,a^{\star}_{\delta}))\big{]}\mu(dx)\bigg{)}
=infλ(supπCδ(μ)01xaif(x+t(yx),aδ)\displaystyle=\inf_{\lambda\in\mathbb{R}}\bigg{(}\sup_{\pi\in C_{\delta}(\mu)}\int_{0}^{1}\int\Big{\langle}\nabla_{x}\nabla_{a_{i}}f(x+t(y-x),a^{\star}_{\delta})
+λxf(x+t(yx),aδ),yxπ(dx,dy)dt\displaystyle\quad+\lambda\nabla_{x}f(x+t(y-x),a^{\star}_{\delta}),y-x\Big{\rangle}\,\pi(dx,dy)\,dt
λsupπCδ(μ)01xf(x+t(yx),aδ,yxπ(dx,dy)dt)\displaystyle\quad-\lambda\sup_{\pi\in C_{\delta}(\mu)}\int_{0}^{1}\int\langle\nabla_{x}f(x+t(y-x),a^{\star}_{\delta},y-x\rangle\,\pi(dx,dy)\,dt\bigg{)}

where we used a minimax argument as well as Fubini’s theorem. We note that the functions above satisfy the assumptions of Theorem 2 for a fixed λ\lambda. In particular using exactly the same arguments as in the proof of Theorem 2 (i.e., Hölder’s inequality and a specific transport attaining the supremum) we obtain by exchanging the order of lim sup\limsup and inf\inf that

(17) lim supδ0aiV(0,aδ)δ\displaystyle\limsup_{\delta\to 0}\frac{-\nabla_{a_{i}}V(0,a^{\star}_{\delta})}{\delta}\
infλ((|xaif(x,a)+λxf(x,a)|qμ(dx))1/q\displaystyle\leq\inf_{\lambda\in\mathbb{R}}\Bigg{(}\left(\int\left|\nabla_{x}\nabla_{a_{i}}f(x,a^{\star})+\lambda\nabla_{x}f(x,a^{\star})\right|^{q}\,\mu(dx)\right)^{1/q}
λ(|xf(x,a)|qμ(dx))1/q).\displaystyle\qquad-\lambda\left(\int\left|\nabla_{x}f(x,a^{\star})\right|^{q}\,\mu(dx)\right)^{1/q}\Bigg{)}.

For q=2q=2 the infimum can be computed explicitly and equals

xaif(x,a),xf(x,a)μ(dx)|xf(x,a)|2μ(dx)\displaystyle\frac{\int\langle\nabla_{x}\nabla_{a_{i}}f(x,a^{\star}),\nabla_{x}f(x,a^{\star})\rangle\,\mu(dx)}{\sqrt{\int|\nabla_{x}f(x,a^{\star})|^{2}\,\mu(dx)}}

For the general case we refer to (Bartl et al., 2021a, , Lemma 30), noting that by assumption xf(x,a)0\nabla_{x}f(x,a^{\star})\neq 0, we see that the RHS above is equal to the RHS in (16).

The proof of the “"\geq"-inequality in (16) follows by the very same arguments. Indeed, (Bartl et al., 2021a, , Lemma 29) implies that

infνBδ(μ,aδ)aif(x,aδ)ν(dx)0\inf_{\nu\in B^{\star}_{\delta}(\mu,a^{\star}_{\delta})}\int\nabla_{a_{i}}f(x,a^{\star}_{\delta})\,\nu(dx)\leq 0

for all i{1,,k}i\in\{1,\dots,k\} and we can write

aiV(0,aδ)infνBδ(μ,aδ)aif(y,aδ)ν(dy)aiV(0,aδ)\displaystyle-\nabla_{a_{i}}V(0,a^{\star}_{\delta})\geq\inf_{\nu\in B^{\star}_{\delta}(\mu,a^{\star}_{\delta})}\int\nabla_{a_{i}}f(y,a^{\star}_{\delta})\,\nu(dy)-\nabla_{a_{i}}V(0,a^{\star}_{\delta})
=infνBδ(μ)supλ([aif(y,aδ)+λ(f(y,aδ)V(δ))]ν(dy)\displaystyle=\inf_{\nu\in B_{\delta}(\mu)}\sup_{\lambda\in\mathbb{R}}\bigg{(}\int\big{[}\nabla_{a_{i}}f(y,a^{\star}_{\delta})+\lambda(f(y,a^{\star}_{\delta})-V(\delta))\big{]}\nu(dy)
[aif(x,aδ)+λ(f(x,aδ)V(0,aδ))]μ(dx)).\displaystyle\quad-\int\big{[}\nabla_{a_{i}}f(x,a^{\star}_{\delta})+\lambda(f(x,a^{\star}_{\delta})-V(0,a^{\star}_{\delta}))\big{]}\mu(dx)\bigg{)}.

From here on, we argue as in the “"\leq"-inequality and conclude that indeed (16) holds.

By assumption the matrix a2V(0,a)\nabla_{a}^{2}V(0,a^{\star}) is invertible. Therefore, in a small neighborhood of aa^{\star}, the mapping aV(0,)\nabla_{a}V(0,\cdot) is invertible. In particular aδ=(aV(0,))1(aV(0,aδ))a^{\star}_{\delta}=(\nabla_{a}V(0,\cdot))^{-1}\left(\nabla_{a}V(0,a^{\star}_{\delta})\right) and by the first order condition a=(aV(0,))1(0)a^{\star}=(\nabla_{a}V(0,\cdot))^{-1}\left(0\right). Applying the chain rule and using (16) gives

limδ0aδaδ\displaystyle\lim_{\delta\to 0}\frac{a^{\star}_{\delta}-a^{\star}}{\delta} =(a2V(0,a))1limδ0aV(0,aδ)δ\displaystyle=(\nabla_{a}^{2}V(0,a^{\star}))^{-1}\cdot\ \lim_{\delta\to 0}\frac{\nabla_{a}V(0,a^{\star}_{\delta})}{\delta}
=(a2V(0,a))1(|xf(z,a)|qμ(dz))1/q1\displaystyle=-(\nabla^{2}_{a}V(0,a^{\star}))^{-1}\Big{(}\int|\nabla_{x}f(z,a^{\star})|^{q}\,\mu(dz)\Big{)}^{1/q-1}
xaf(x,a)xf(x,a)|xf(x,a)|2qμ(dx).\displaystyle\cdot\int\frac{\nabla_{x}\nabla_{a}f(x,a^{\star})\nabla_{x}f(x,a^{\star})}{|\nabla_{x}f(x,a^{\star})|^{2-q}}\,\mu(dx).

This completes the proof. ∎

References

  • Anderson and Philpott, (2019) Anderson, E. J. and Philpott, A. B. (2019). Improving sample average approximation using distributional robustness. Optimization Online.
  • Araujo et al., (2019) Araujo, A., Pinot, R., Negrevergne, B., Meunier, L., Chevaleyre, Y., Yger, F., and Atif, J. (2019). Robust neural networks using randomized adversarial training. arXiv:1903.10219.
  • Armacost and Fiacco, (1974) Armacost, R. L. and Fiacco, A. V. (1974). Computational experience in sensitivity analysis for nonlinear programming. Math. Program., 6(1):301–326.
  • Artzner et al., (1999) Artzner, P., Delbaen, F., Eber, J., and Heath, D. (1999). Coherent measures of risk. Math. Finance, 9(3):203–228.
  • Asi and Duchi, (2019) Asi, H. and Duchi, J. C. (2019). The importance of better models in stochastic optimization. Proc. Natl. Acad. Sci. USA, 116(46):22924–22930.
  • Atar et al., (2015) Atar, R., Chowdhary, K., and Dupuis, P. (2015). Robust bounds on risk-sensitive functionals via rényi divergence. SIAM/ASA J. Uncertain. Quantif., 3(1):18–33.
  • (7) Bartl, D., Drapeau, S., Obłój, J., and Wiesel, J. (2021a). Appendix to sensitivity analysis of Wasserstein distributionally robust optimization problems.
  • (8) Bartl, D., Drapeau, S., Obłój, J., and Wiesel, J. (2021b). Sensitivity analysis of Wasserstein distributionally robust optimization problems. TBC.
  • Bartl et al., (2020) Bartl, D., Drapeau, S., and Tangpi, L. (2020). Computational aspects of robust optimized certainty equivalents and option pricing. Math. Finance, 30(1):287–309.
  • Bastani et al., (2016) Bastani, O., Ioannou, Y., Lampropoulos, L., Vytiniotis, D., Nori, A., and Criminisi, A. (2016). Measuring neural net robustness with constraints. In Advances in neural information processing systems, pages 2613–2621.
  • Ben Tal and Teboulle, (1986) Ben Tal, A. and Teboulle, M. (1986). Expected utility, penalty functions, and duality in stochastic nonlinear programming. Manag. Sci., 32(11):1445–1466.
  • Ben Tal and Teboulle, (2007) Ben Tal, A. and Teboulle, M. (2007). An old-new concept of convex risk measures: the optimized certainty equivalent. Math. Finance, 17(3):449–476.
  • Black and Scholes, (1973) Black, F. and Scholes, M. (1973). The pricing of options and corporate liabilities. J. Political Econ, 81(3):637–654.
  • (14) Blanchet, J., Kang, Y., and Murthy, K. (2019a). Robust Wasserstein profile inference and applications to machine learning. J. Appl. Probab., 56(3):830–857.
  • Blanchet and Murthy, (2019) Blanchet, J. and Murthy, K. (2019). Quantifying distributional model risk via optimal transport. Math. Oper. Res., 44(2):565–600.
  • (16) Blanchet, J., Murthy, K., and Si, N. (2019b). Confidence regions in Wasserstein distributionally robust estimation. arXiv:1906.01614.
  • Bonnans and Shapiro, (2013) Bonnans, J. F. and Shapiro, A. (2013). Perturbation Analysis of Optimization Problems. Springer Science & Business Media.
  • Brezis, (2010) Brezis, H. (2010). Functional analysis, Sobolev spaces and partial differential equations. Springer Science & Business Media.
  • Calafiore, (2007) Calafiore, G. C. (2007). Ambiguous risk measures and optimal robust portfolios. SIAM J. Optim., 18(3):853–877.
  • Carlier et al., (2017) Carlier, G., Duval, V., Peyré, G., and Schmitzer, B. (2017). Convergence of entropic schemes for optimal transport and gradient flows. SIAM J. Math. Anal., 49(2):1385–1418.
  • Carlier and Ekeland, (2010) Carlier, G. and Ekeland, I. (2010). Matching for teams. Econom. Theory, 42:397–418.
  • Carlini and Wagner, (2017) Carlini, N. and Wagner, D. (2017). Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE.
  • Chen et al., (2018) Chen, Z., Kuhn, D., and Wiesemann, W. (2018). Data-driven chance constrained programs over Wasserstein balls. arXiv:1809.00210.
  • Chiappori et al., (2010) Chiappori, P.-A., McCann, R. J., and Nesheim, L. (2010). Hedonic price equilibria, stable matching, and optimal transport: Equivalence, topology, and uniqueness. Econom. Theory, 42:317–354.
  • Dupacova, (1990) Dupacova, J. (1990). Stability and sensitivity analysis for stochastic programming. Ann. Oper. Res., 27(1-4):115–142.
  • Fournier and Guillin, (2014) Fournier, N. and Guillin, A. (2014). On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Related Fields, 162(3-4):707–738.
  • Gao and Kleywegt, (2016) Gao, R. and Kleywegt, A. J. (2016). Distributionally robust stochastic optimization with Wasserstein distance. arXiv:1604.02199.
  • Ghanem et al., (2017) Ghanem, R., Higdon, D., and Owhadi, H., editors (2017). Handbook of Uncertainty Quantification. Springer International Publishing.
  • Glasserman and Xu, (2014) Glasserman, P. and Xu, X. (2014). Robust risk measurement and model risk. Quant. Finance, 14(1):29–58.
  • Goodfellow et al., (2014) Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014). Explaining and harnessing adversarial examples. arXiv:1412.6572.
  • Hansen and Marinacci, (2016) Hansen, L. P. and Marinacci, M. (2016). Ambiguity aversion and model misspecification: An economic perspective. Stat. Sci., 31(4):511–515.
  • Hansen and Sargent, (2008) Hansen, L. P. and Sargent, T. (2008). Robustness. Princeton university press.
  • Ho-Nguyen and Wright, (2020) Ho-Nguyen, N. and Wright, S. J. (2020). Adversarial classification via distributional robustness with wasserstein ambiguity. arXiv preprint arXiv:2005.13815.
  • Huber and Ronchetti, (1981) Huber, P. and Ronchetti, E. (1981). Robust statistics. Wiley Series in Probability and Mathematical Statistics. New York, NY, USA, Wiley-IEEE, 52:54.
  • Komorowski et al., (2011) Komorowski, M., Costa, M. J., Rand, D. A., and Stumpf, M. P. (2011). Sensitivity, robustness, and identifiability in stochastic chemical kinetics models. Proc. Natl. Acad. Sci. USA, 108(21):8645–8650.
  • Kuhn et al., (2019) Kuhn, D., Esfahani, P. M., Nguyen, V. A., and Shafieezadeh-Abadeh, S. (2019). Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations Research & Management Science in the Age of Analytics, pages 130–166. INFORMS.
  • Lam, (2016) Lam, H. (2016). Robust sensitivity analysis for stochastic systems. Math. Oper. Res., 41(4):1248–1275.
  • Lam, (2018) Lam, H. (2018). Sensitivity to serial dependency of input processes: A robust approach. Management Science, 64(3):1311–1327.
  • Li et al., (2019) Li, L., Zhong, Z., Li, B., and Xie, T. (2019). Robustra: training provable robust neural networks over reference adversarial space. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 4711–4717. AAAI Press.
  • Lindsay, (1994) Lindsay, B. G. (1994). Efficiency versus robustness: the case for minimum hellinger distance and related methods. Ann. Stat., 22(2):1081–1114.
  • Mangal et al., (2019) Mangal, R., Nori, A. V., and Orso, A. (2019). Robustness of neural networks: a probabilistic and practical approach. In Proceedings of the 41st International Conference on Software Engineering: New Ideas and Emerging Results, pages 93–96. IEEE Press.
  • Markowitz, (1952) Markowitz, H. (1952). Portfolio selection. J. Finance, 7(1):77–91.
  • Mohajerin Esfahani and Kuhn, (2018) Mohajerin Esfahani, P. and Kuhn, D. (2018). Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Math. Program., 171(1-2, Ser. A):115–166.
  • Obłój and Wiesel, (2021) Obłój, J. and Wiesel, J. (2021). Robust estimation of superhedging prices. Ann. Stat., 49(1):508–530.
  • Peyré and Cuturi, (2019) Peyré, G. and Cuturi, M. (2019). Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355–607.
  • Peyré and Cuturi, (2019) Peyré, G. and Cuturi, M. (2019). Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607.
  • Pflug and Wozabal, (2007) Pflug, G. and Wozabal, D. (2007). Ambiguity in portfolio selection. Quant. Finance, 7(4):435–442.
  • Pflug et al., (2012) Pflug, G. C., Pichler, A., and Wozabal, D. (2012). The 1/n investment strategy is optimal under high model ambiguity. J. Banking Finance, 36(2):410–417.
  • Rahimian and Mehrotra, (2019) Rahimian, H. and Mehrotra, S. (2019). Distributionally robust optimization: A review. arXiv.org:1908.05659.
  • Romisch, (2003) Romisch, W. (2003). Stability of stochastic programming problems. In Stochastic programming, pages 483–554. Elsevier Sci. B. V., Amsterdam.
  • Shafieezadeh-Abadeh et al., (2019) Shafieezadeh-Abadeh, S., Kuhn, D., and Esfahani, P. M. (2019). Regularization via mass transportation. J. Mach. Learn. Res., 20(103):1–68.
  • Sinha et al., (2020) Sinha, A., Namkoong, H., Volpi, R., and Duchi, J. (2020). Certifying some distributional robustness with principled adversarial training. arXiv:1710.10571v5.
  • Szegedy et al., (2013) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. (2013). Intriguing properties of neural networks. arXiv:1312.6199.
  • Terkelsen, (1973) Terkelsen, F. (1973). Some minimax theorems. Math. Scand., 31(2):405–413.
  • Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol., 58(1):267–288.
  • Villani, (2008) Villani, C. (2008). Optimal transport: old and new, volume 338. Springer Science & Business Media.
  • Vogel, (2007) Vogel, S. (2007). Stability results for stochastic programming problems. Optimization, 19(2):269–288.
  • Weng et al., (2018) Weng, T.-W., Zhang, H., Chen, P.-Y., Yi, J., Su, D., Gao, Y., Hsieh, C.-J., and Daniel, L. (2018). Evaluating the robustness of neural networks: An extreme value theory approach. arXiv:1801.10578.
  • Wong and Kolter, (2017) Wong, E. and Kolter, J. Z. (2017). Provable defenses against adversarial examples via the convex outer adversarial polytope. arXiv:1711.00851.

Appendix A Preliminaries

We recall and further explain the setting from the main body of the paper Bartl et al., 2021b . Take d,kd,k\in\mathbb{N}, endow d\mathbb{R}^{d} with the Euclidean norm |||\cdot|. Throughout the paper we take the convention that topological properties, such as continuity or closure, are understood w.r.t. |||\cdot|. We let Γo,Γ¯,Γ,Γc{\Gamma}^{o},\bar{\Gamma},\partial\Gamma,\Gamma^{c} denote respectively the interior, the closure, the boundary and the complement of a set Γd\Gamma\subset\mathbb{R}^{d}. We denote the set of all probability measures on Γ\Gamma by 𝒫(Γ)\mathcal{P}(\Gamma). For a variable γΓ\gamma\in\Gamma, we will denote the optimizer by γ\gamma^{\star} and the set of optimizers by Γ\Gamma^{\star}.

Fix a seminorm \|\cdot\| on d\mathbb{R}^{d} and denote by \|\cdot\|_{\ast} its (extended) dual norm, i.e. y:=sup{x,y:x1}\|y\|_{\ast}:=\sup\{\langle x,y\rangle:\|x\|\leq 1\}. Let us define the equivalence relation xyx\sim y if and only if xy=0\|x-y\|=0. Furthermore let us set U:={xd:x=0}U:=\{x\in\mathbb{R}^{d}\ :\ \|x\|=0\} and write [x]=x+U[x]=x+U. With this notation, the quotient space d/U={[x]:xd}\mathbb{R}^{d}/U=\{[x]\ :\ x\in\mathbb{R}^{d}\} is a normed space for \|\cdot\|. Furthermore, by the triangle inequality for \|\cdot\| and equivalence of norms on d\mathbb{R}^{d}, there exists c>0c>0 such that xc|x|\|x\|\leq c|x| and |x|cx|x|\leq c\|x\|_{\ast} for all xdx\in\mathbb{R}^{d}. As |||\cdot| is Hausdorff, this immediately implies that \|\cdot\|_{\ast} is Hausdorff as well. Furthermore we conclude, that \|\cdot\| is continuous and \|\cdot\|_{\ast} is lower semicontinuous w.r.t. |||\cdot| (as the supremum over continuous functions x,\langle x,\cdot\rangle). Lastly we make the convention that Bδ(x)B_{\delta}(x) denotes the ball of radius δ\delta around xx in |||\cdot|. As our setup is slightly non-standard, we state the following lemmas for completeness:

Lemma 6.

For every xdx\in\mathbb{R}^{d} we have that x=sup{x,y:y1}\|x\|=\sup\{\langle x,y\rangle\ :\ \|y\|_{\ast}\leq 1\}.

Proof.

As {x𝒮:x1}\{x\in\mathcal{S}\ :\ \|x\|\leq 1\} is convex and closed, this follows directly from the bipolar theorem. ∎

Lemma 7.

Assume that \|\cdot\|_{\ast} is strictly convex. Then the following hold:

  1. (i)

    For all xdx\in\mathbb{R}^{d} there exists h(x)dh(x)\in\mathbb{R}^{d} such that h(x)=1\|h(x)\|_{\ast}=1 and x=x,h(x)\|x\|=\langle x,h(x)\rangle. If x0x\neq 0, then h(x)h(x) is unique.

  2. (ii)

    The map h:d{0}dh:\mathbb{R}^{d}\setminus\{0\}\to\mathbb{R}^{d} is continuous.

Proof.

Fix xd{0}x\in\mathbb{R}^{d}\setminus\{0\}. The existence of h(x)dh(x)\in\mathbb{R}^{d} in (i) follows from Lemma 6. Assume towards a contradiction that there exists another h~(x)d\tilde{h}(x)\in\mathbb{R}^{d} with h~(x)=1\|\tilde{h}(x)\|_{\ast}=1, x,h~(x)=x\langle x,\tilde{h}(x)\rangle=\|x\| and h~(x)h(x)\tilde{h}(x)\neq h(x). Defining h¯(x)=(h(x)+h~(x))/2\bar{h}(x)=(h(x)+\tilde{h}(x))/2 we have x,h¯(x)=(x,h(x)+x,h~(x))/2=x\langle x,\bar{h}(x)\rangle=(\langle x,h(x)\rangle+\langle x,\tilde{h}(x)\rangle)/2=\|x\|. On the other hand, by the Hausdorff property of \|\cdot\|_{\ast}, we have h(x)h~(x)0\|h(x)-\tilde{h}(x)\|_{\ast}\neq 0 and thus, by strict convexity of \|\cdot\|_{\ast}, h¯(x)<1\|\bar{h}(x)\|_{\ast}<1. Using again Lemma 6, we conclude xx,h¯(x)/h¯(x)>x\|x\|\geq\langle x,\bar{h}(x)/\|\bar{h}(x)\|_{\ast}\rangle>\|x\|, a contradiction.
For (ii) we assume towards a contradiction that for some sequence (xn)n(x_{n})_{n\in\mathbb{N}} in d\mathbb{R}^{d} we have limnxn=xd{0}\lim_{n\to\infty}x_{n}=x\in\mathbb{R}^{d}\setminus\{0\}, but limnh(xn)h(x)\lim_{n\to\infty}h(x_{n})\neq h(x). As remarked above, we have {1}Bc(0)\{\|\cdot\|_{\ast}\leq 1\}\subseteq B_{c}(0), in particular limnh(xn)=yd\lim_{n\to\infty}h(x_{n})=y\in\mathbb{R}^{d} after taking a subsequence. Recalling that h(x)yh(x)\neq y and \|\cdot\|_{\ast} is lower semicontinuous, we conclude that y1\|y\|_{\ast}\leq 1 and in particular x>x,y\|x\|>\langle x,y\rangle by Lemma 6 and (i). Finally

x=limnxn=limnxn,h(xn)=x,y,\|x\|=\lim_{n\to\infty}\|x_{n}\|=\lim_{n\to\infty}\langle x_{n},h(x_{n})\rangle=\langle x,y\rangle,

which leads to a contradiction. ∎

Lemma 8.

If \|\cdot\| is strictly convex, then \|\cdot\|_{\ast} is strictly convex as well.

Proof.

Fix yd{0}y\in\mathbb{R}^{d}\setminus\{0\}. We first note that

k(y):={xd:x=1,y=x,y}/U\displaystyle k(y):=\{x\in\mathbb{R}^{d}\ :\ \|x\|=1,\,\|y\|_{\ast}=\langle x,y\rangle\}/U

is uniquely defined. Indeed, this follows from applying the exact same arguments as in the proof of Lemma 7, adjusting for UU. Take now y,ydy,y^{\prime}\in\mathbb{R}^{d} such that y=y=1\|y\|_{\ast}=\|y^{\prime}\|_{\ast}=1 and yy0\|y-y^{\prime}\|_{\ast}\neq 0. Set y¯=(y+y)/2\bar{y}=(y+y^{\prime})/2 and note that y¯y,y¯y0\|\bar{y}-y\|_{\ast},\|\bar{y}-y^{\prime}\|_{\ast}\neq 0. Then y¯=([k(y¯)],y+[k(y¯)],y)/2<1\|\bar{y}\|_{\ast}=(\langle[k(\bar{y})],y\rangle+\langle[k(\bar{y})],y^{\prime}\rangle)/2<1. This shows the claim. ∎

Let 𝒮\mathcal{S} denote the state space which is a closed convex subset of d\mathbb{R}^{d}. Fix p>1p>1 and take q=p/(p1)q=p/(p-1) so that 1/p+1/q=11/p+1/q=1. For probability measures μ\mu and ν\nu on 𝒮\mathcal{S}, we define their pp-Wasserstein distance as

Wp(μ,ν):=inf{𝒮×𝒮xypπ(dx,dy):πCpl(μ,ν)}1/p,W_{p}(\mu,\nu):=\inf\left\{\int_{\mathcal{S}\times\mathcal{S}}\|x-y\|_{\ast}^{p}\,\pi(dx,dy)\colon\pi\in\mathrm{Cpl}(\mu,\nu)\right\}^{1/p},

where Cpl(μ,ν)\mathrm{Cpl}(\mu,\nu) is the set of all probability measures π𝒫(𝒮×𝒮)\pi\in\mathcal{P}(\mathcal{S}\times\mathcal{S}) with first marginal π1:=π(×𝒮)=μ\pi_{1}:=\pi(\cdot\times\mathcal{S})=\mu and second marginal π2:=π(𝒮×)=ν\pi_{2}:=\pi(\mathcal{S}\times\cdot)=\nu. In the proofs we sometimes also use the pp-Wasserstein distance with respect to the Euclidean norm |||\cdot| given by

Wp||(μ,ν)=inf{𝒮×𝒮|xy|pπ(dx,dy):πCpl(μ,ν)}1/p.W_{p}^{|\cdot|}(\mu,\nu)=\inf\left\{\int_{\mathcal{S}\times\mathcal{S}}|x-y|^{p}\,\pi(dx,dy)\colon\pi\in\mathrm{Cpl}(\mu,\nu)\right\}^{1/p}.

Recall that ||c|\cdot|\leq c\|\cdot\|_{\ast} for some constant c>0c>0, which in turn implies that Wp||(,)cWp(,)W_{p}^{|\cdot|}(\cdot,\cdot)\leq cW_{p}(\cdot,\cdot). A Wasserstein ball of size δ0\delta\geq 0 around μ\mu is denoted

Bδ(μ):={ν𝒫(d):Wp(μ,ν)δ}.B_{\delta}(\mu):=\left\{\nu\in\mathcal{P}(\mathbb{R}^{d}):W_{p}(\mu,\nu)\leq\delta\right\}.

From now on, we fix μ𝒫(𝒮)\mu\in\mathcal{P}(\mathcal{S}) such that μ(𝒮)=0\mu(\partial\mathcal{S})=0 and 𝒮|x|pμ(dx)<\int_{\mathcal{S}}|x|^{p}\,\mu(dx)<\infty. Let 𝒜\mathcal{A} denote the action (decision) space which is a convex and closed subset of k\mathbb{R}^{k}. We consider robust stochastic optimization problem [2]:

V(δ):=infa𝒜V(δ,a):=infa𝒜supνBδ(μ)𝒮f(x,a)ν(dx).\displaystyle V(\delta)\mathrel{\mathop{:}}=\inf_{a\in\mathcal{A}}V(\delta,a)\mathrel{\mathop{:}}=\inf_{a\in\mathcal{A}}\sup_{\nu\in B_{\delta}\left(\mu\right)}\int_{\mathcal{S}}f\left(x,a\right)\,\nu(dx).

In accordance with our conventions, we write aa^{\star} for an optimizer: V(δ)=V(δ,a)V(\delta)=V(\delta,a^{\star}) and 𝒜𝒜\mathcal{A}^{\star}_{\subset}\mathcal{A} for the set of such optimizers. We also let Bδ(μ,a)B^{\star}_{\delta}(\mu,a) denote the set of measures ν\nu^{\star} such that V(δ,a)=𝒮f(x,a)ν(dx)V(\delta,a)=\int_{\mathcal{S}}f(x,a)\,\nu^{\star}(dx) and sometimes write Bδ(μ)B^{\star}_{\delta}(\mu) for Bδ(μ,a)B^{\star}_{\delta}(\mu,a^{\star}) if a𝒜δa^{\star}\in\mathcal{A}^{\star}_{\delta} is fixed.

Appendix B Discussion, extensions and proofs related to Theorem 2

We complement now the discussion of Theorem 2. We start with some remarks, extensions and further examples before proceeding with the proofs, including a complete proof of Theorem 2 for general seminorms \|\cdot\|.

B.1. Discussion and extensions of Theorem 2

Remark 9.

Theorem 2 may fail for p=1p=1. Indeed take d=1d=1, =||\|\cdot\|=|\cdot| and f(x)=x2f(x)=x^{2}, 𝒮=[1,1]\mathcal{S}=[-1,1], μ\mu the point mass in zero, μ=δ0\mu=\delta_{0}. Then xf(x)=2x\nabla_{x}f(x)=2x and the μ(dx)\mu(dx)–essential supremum of |xf(x)||\nabla_{x}f(x)| is equal to 0. However νλ:=λδ1+(1λ)δ0Bλ(μ)\nu_{\lambda}:=\lambda\delta_{1}+(1-\lambda)\delta_{0}\in B_{\lambda}(\mu) for all λ[0,1]\lambda\in[0,1] and it is easy to see V(δ)=δV(\delta)=\delta and thus V(0)=1V^{\prime}(0)=1. The point where the proof of Theorem 2 breaks down is that the map δ\delta to the νδ\nu_{\delta}–essential supremum of |xf(x)||\nabla_{x}f(x)| is not continuous at δ=0\delta=0.

Remark 10.

Let p>2p>2. In addition to Assumption 1, suppose that ff is twice continuously differentiable and that for ever r0r\geq 0 there is c0c\geq 0 such that |x2f(x,a)|c(1+|x|p2)|\nabla_{x}^{2}f(x,a)|\leq c(1+|x|^{p-2}) for all x𝒮x\in\mathcal{S} and all a𝒜a\in\mathcal{A} with |a|r|a|\leq r. Then, the same arguments as in the proof of Theorem 2 but with a second order Taylor expansion yield

V(δ)V(0)\displaystyle V(\delta)\leq V(0) +δ(𝒮xf(a,x)qμ(dx))1/q\displaystyle+\delta\left(\int_{\mathcal{S}}\|\nabla_{x}f(a^{\star},x)\|^{q}\,\mu(dx)\right)^{1/q}
+δ2(𝒮λmax(12x2f(a,x))rμ(dx))1/r+o(δ2),\displaystyle+\delta^{2}\left(\int_{\mathcal{S}}\lambda_{\max}\Big{(}\frac{1}{2}\nabla_{x}^{2}f(a^{\star},x)\Big{)}^{r}\,\mu(dx)\right)^{1/r}+o(\delta^{2}),

for small δ0\delta\geq 0, where λmax\lambda_{\max} denotes the largest eigenvalue of the Hessian taken w.r.t. the norm \|\cdot\|_{\ast} and r=p/(p2)r=p/(p-2) is such that 2/p+1/r=12/p+1/r=1.

In particular, this means that if the term in front of δ2\delta^{2} is the same order of magnitude as the term in front of δ\delta, then the first order approximation is quite accurate for small δ\delta. Note that larger pp implies smaller rr and therefore a smaller term in front of the δ2\delta^{2} term.

Remark 11.

We believe that Assumption 1 lists natural sufficient conditions for differentiability of V(δ)V(\delta) in zero. In particular all these conditions are used in the proof of Theorem 2. Relaxing Assumption 1 seems to require a careful analysis of the interplay between (the space explored by balls around) μ\mu and the functions f,xff,\nabla_{x}f. We state here a straightforward extension to the case where ff is only weakly differentiable and leave more fundamental extensions (e.g., to manifolds) for future research.
Specifically, in case that the baseline distribution μ\mu is absolutely continuous w.r.t. the Lebesgue measure and =||\|\cdot\|=|\cdot|, Theorem 2 remains true if we merely assume that f(,a)f(\cdot,a) has a weak derivative (in the Sobolev sense) on 𝒮o{\mathcal{S}}^{o} for all a𝒜a\in\mathcal{A} and replace xf(,a)\nabla_{x}f(\cdot,a) by the weak derivative of f(,a)f(\cdot,a) in Assumption 1. More concretely the first point of Assumption 1 should read:

  • The weak derivative (x,a)g(x,a)(x,a)\mapsto g(x,a) of f(,a)f(\cdot,a) is continuous at every point (x,a)N×𝒜(0)(x,a)\in N\times\mathcal{A}^{\star}(0), where NN is a Lebesgue-null set, and for every r>0r>0 there is c>0c>0 such that |g(x,a)|c(1+|x|p1)|g(x,a)|\leq c(1+|x|^{p-1}) for all x𝒮x\in\mathcal{S} and |a|r|a|\leq r.

Proof of Remark 11.

For notational simplicity we only consider the case 𝒮=d\mathcal{S}=\mathbb{R}^{d}. Note that by, e.g., Brezis, (2010)[Theorem 8.2] we can assume that f(,a)f(\cdot,a) is continuous and satisfies

f(y,a)f(x,a)=01g(x+t(yx),a),yx𝑑t\displaystyle f(y,a)-f(x,a)=\int_{0}^{1}\langle g(x+t(y-x),a),y-x\rangle\,dt

for all x,ydx,y\in\mathbb{R}^{d} and all a𝒜a\in\mathcal{A}. Furthermore

(18) supνBδ(μ)𝒮f(x,a)ν(dx)=supνBδ(μ),νLeb𝒮f(x,a)ν(dx),\displaystyle\sup_{\nu\in B_{\delta}(\mu)}\int_{\mathcal{S}}f(x,a)\,\nu(dx)=\sup_{\nu\in B_{\delta}(\mu),\ \nu\ll\mathrm{Leb}}\int_{\mathcal{S}}f(x,a)\,\nu(dx),

where νLeb\nu\ll\mathrm{Leb} means that ν\nu is absolutely continuous w.r.t. the Lebesgue measure. Indeed, let us take νBδ(μ)\nu\in B_{\delta}(\mu) and set ν~=ν~(t,ε)=(1t)(νN(0,ε))+tμ\tilde{\nu}=\tilde{\nu}(t,\varepsilon)=(1-t)(\nu\ast N(0,\varepsilon))+t\mu, where N(0,ε)N(0,\varepsilon) denotes the multivariate normal distribution with covariance ε𝐈\varepsilon\mathbf{I}, ε>0\varepsilon>0 and \ast denotes the convolution operator. For every 0<t<10<t<1, by convexity of Wpp(,)W_{p}^{p}(\cdot,\cdot) and the triangle inequality for WpW_{p}, we have

Wpp(μ,ν~)\displaystyle W_{p}^{p}(\mu,\tilde{\nu}) (1t)Wpp(νN(0,ε),μ)+tWpp(μ,μ)\displaystyle\leq(1-t)W_{p}^{p}(\nu\ast N(0,\varepsilon),\mu)+tW_{p}^{p}(\mu,\mu)
=(1t)Wpp(νN(0,ε),μ)\displaystyle=(1-t)W_{p}^{p}(\nu\ast N(0,\varepsilon),\mu)
(1t)(Wp(νN(0,ε),ν)+Wp(ν,μ))p.\displaystyle\leq(1-t)\left(W_{p}(\nu\ast N(0,\varepsilon),\nu)+W_{p}(\nu,\mu)\right)^{p}.

By assumption Wp(ν,μ)δW_{p}(\nu,\mu)\leq\delta and one can check that Wp(νN(0,ε),ν)0W_{p}(\nu\ast N(0,\varepsilon),\nu)\to 0 as ε0\varepsilon\to 0. Hence, for every t<1t<1 there exists small ε=ε(t)>0\varepsilon=\varepsilon(t)>0 such that Wp(μ,ν~)δW_{p}(\mu,\tilde{\nu})\leq\delta. As further limt1𝒮f(x,a)ν~(dx)=𝒮f(x,a)ν(dx)\lim_{t\to 1}\int_{\mathcal{S}}f(x,a)\,\tilde{\nu}(dx)=\int_{\mathcal{S}}f(x,a)\,\nu(dx), this shows (18). The proof of the remark now follows by the exact same arguments as in the proof of Theorem 2. ∎

A natural example, which highlights the importance of Remark 11 is the following:

Example 12.

We let μ\mu be a model for a vector of returns X𝒮=dX\in\mathcal{S}=\mathbb{R}^{d} and assume that μ\mu is absolutely continuous with respect to Lebesgue measure. Let further =||\|\cdot\|=|\cdot| and let zBdz\in B\subset\mathbb{R}^{d} denote a portfolio. We then consider the average value at risk at level α(0,1)\alpha\in(0,1) of the portfolio wealth z,X\langle z,X\rangle, which can be written as

AV@Rα(z,X)=1α1α1V@Ru(z,X)𝑑u,\displaystyle\text{AV@R}_{\alpha}\left(\langle z,X\rangle\right)=\frac{1}{\alpha}\int_{1-\alpha}^{1}\text{V@R}_{u}(\langle z,X\rangle)du,

where V@Ru(z,X)\text{V@R}_{u}(\langle z,X\rangle) is the value at risk at level u(0,1)u\in(0,1) defined as

V@Ru(z,X)=inf{xd:μ(z,x)u}.\displaystyle\text{V@R}_{u}(\langle z,X\rangle)=\inf\{x\in\mathbb{R}^{d}\ :\mu(\langle z,x\rangle)\geq u\}.

We note that the average value at risk is an example for an optimized certainty equivalent (OCE), when choosing l(x,a)=a+1α(xa)+l(x,a)=a+\frac{1}{\alpha}(x-a)^{+} in (Bartl et al., 2021b, , p. 3). We can thus rewrite the optimization problem

V(0)\displaystyle V(0) =infzBAV@Rα(z,X)\displaystyle=\inf_{z\in B}\text{AV@R}_{\alpha}\left(\langle z,X\rangle\right)

as

V(0)=infzB,m(m+1α𝒮(z,xm)+μ(dx)).\displaystyle V(0)=\inf_{z\in B,m\in\mathbb{R}}\left(m+\frac{1}{\alpha}\int_{\mathcal{S}}\left(\langle z,x\rangle-m\right)^{+}\mu(dx)\right).

Set 𝒜=B×\mathcal{A}=B\times\mathbb{R} and assume that there exists a unique minimiser (z,m)𝒜0(z^{\star},m^{\star})\in\mathcal{A}^{\star}_{0} of V(0)V(0). Then mm^{\star} is given by V@R(z,X)\mathrm{V@R}(\langle z^{\star},X\rangle). The robust version of V(0)V(0) reads

V(δ)=inf(z,m)𝒜supνBδ(μ)(m+1α𝒮(z,xm)+ν(dx)).V(\delta)=\inf_{(z,m)\in\mathcal{A}}\sup_{\nu\in B_{\delta}(\mu)}\left(m+\frac{1}{\alpha}\int_{\mathcal{S}}\left(\langle z,x\rangle-m\right)^{+}\nu(dx)\right).

Note that the function xx+x\mapsto x^{+} is weakly differentiable with weak derivative 𝟏{x0}\mathbf{1}_{\{x\geq 0\}}. In conclusion f(x,(z,m))=m+1α(z,xm)+f(x,(z,m))=m+\frac{1}{\alpha}\left(\langle z,x\rangle-m\right)^{+} has weak derivative

g(x,(z,m))=1α𝟏{z,xm0},g(x,(z,m))=\frac{1}{\alpha}\mathbf{1}_{\{\langle z,x\rangle-m\geq 0\}},

which is continuous at (x,(h,m))(x,(h^{\star},m^{\star})) except on the lower-dimensional set {x𝒮:z,xm=0}\{x\in\mathcal{S}\ :\ \langle z^{\star},x\rangle-m^{\star}=0\}, which is in particular a Lebesgue null set. Remark 11 thus yields

V(0)=|z|(1αq𝒮𝟏{zxV@Rα(z,X)}μ(dx))1q=|z|α1/p\displaystyle V^{\prime}(0)=|z^{\star}|\left(\frac{1}{\alpha^{q}}\int_{\mathcal{S}}\mathbf{1}_{\left\{\langle z^{\star}x\rangle\geq\mathrm{V@R}_{\alpha}\left(\langle z^{\star},X\rangle\right)\right\}}\mu(dx)\right)^{\frac{1}{q}}=\frac{|z^{\star}|}{\alpha^{1/p}}

and thus

V(δ)\displaystyle V(\delta) =AV@Rα(z,X)+|z|α1/pδ+o(δ).\displaystyle=\text{AV@R}_{\alpha}\left(\langle z^{\star},X\rangle\right)+\frac{|z^{\star}|}{\alpha^{1/p}}\delta+o(\delta).

Comparing with (Bartl et al.,, 2020, Table 1), we see that this approximation is actually exact for p=1,2p=1,2.

We now mention two extensions of Theorem 2. The first one concerns the derivative of V(δ)V(\delta) for δ>0\delta>0.

Corollary 13.

Fix r>0r>0 and in addition to the assumptions of Theorem 2 assume that

  • 𝒜r+δ\mathcal{A}^{\star}_{r+\delta}\neq\emptyset for δ0\delta\geq 0 small enough and for every sequence (δn)n(\delta_{n})_{n\in\mathbb{N}} such that limnδn=0\lim_{n\to\infty}\delta_{n}=0 and (an)n(a^{\star}_{n})_{n\in\mathbb{N}} such that an𝒜r+δna^{\star}_{n}\in\mathcal{A}^{\star}_{r+\delta_{n}} there is a subsequence which converges to some a𝒜ra^{\star}\in\mathcal{A}^{\star}_{r}.

  • there exists ε>0\varepsilon>0 such that for all γ>0\gamma>0 and every a𝒜a\in\mathcal{A} with |a|γ|a|\leq\gamma one has |xf(x,a)|c(1+|x|p1ε)|\nabla_{x}f(x,a)|\leq c(1+|x|^{p-1-\varepsilon}) for all x𝒮x\in\mathcal{S} and some constant c>0c>0.

Then

V(r+)=limδ0V(r+δ)V(r)δ=infa𝒜rsupνBr(μ,a)(𝒮xf(x,a)qν(dx))1/q,V^{\prime}(r+)=\lim_{\delta\to 0}\frac{V(r+\delta)-V(r)}{\delta}=\inf_{a^{\star}\in\mathcal{A}^{\star}_{r}}\sup_{\nu\in B^{\star}_{r}(\mu,a^{\star})}\left(\int_{\mathcal{S}}\|\nabla_{x}f(x,a^{\star})\|^{q}\,\nu(dx)\right)^{1/q},

where we recall that Br(μ,a)B^{\star}_{r}(\mu,a^{\star}) is the set of all νBr(μ)\nu\in B_{r}(\mu) for which 𝒮f(x,a)ν(dx)=V(r)\int_{\mathcal{S}}f(x,a^{\star})\,\nu(dx)=V(r).

Remark 14.

Recall the notation V(δ,a)V(\delta,a) in (2). Inspecting the proof of the above Corollary, it is clear the main difficulty is in showing that

limδ0V(r+δ,a)V(r,a)δ=supνBr(μ,a)(𝒮xf(x,a)qν(dx))1/q.\lim_{\delta\to 0}\frac{V(r+\delta,a)-V(r,a)}{\delta}=\sup_{\nu\in B^{\star}_{r}(\mu,a)}\left(\int_{\mathcal{S}}\|\nabla_{x}f(x,a)\|^{q}\,\nu(dx)\right)^{1/q}.

In this way, the final statement of Corollary 13, or indeed of Theorem 2, can be interpreted as an instance of the envelope theorem.

The second extension of Theorem 2 offers a more specific sensitivity result by including additional constraints on the ball Bδ(μ)B_{\delta}(\mu) of measures considered. Let mm\in\mathbb{N} and let Φ=(Φ1,,Φm):𝒮m\Phi=(\Phi_{1},\dots,\Phi_{m})\colon\mathcal{S}\to\mathbb{R}^{m} be a family of mm functions and assume that μ\mu is calibrated to Φ\Phi in the sense that 𝒮Φ(x)μ(dx)=0\int_{\mathcal{S}}\Phi(x)\,\mu(dx)=0. Consider the set

BδΦ(μ):={νBδ(μ):𝒮Φ(x)ν(dx)=0}B^{\Phi}_{\delta}(\mu):=\left\{\nu\in B_{\delta}(\mu):\int_{\mathcal{S}}\Phi(x)\,\nu(dx)=0\right\}

and the corresponding optimization problem

VΦ(δ):=infa𝒜supνBδΦ(μ)𝒮f(x,a)ν(dx).\displaystyle V^{\Phi}(\delta)\mathrel{\mathop{:}}=\inf_{a\in\mathcal{A}}\sup_{\nu\in B^{\Phi}_{\delta}\left(\mu\right)}\int_{\mathcal{S}}f\left(x,a\right)\,\nu(dx).

We have the following result.

Theorem 15 (Sensitivity of V(δ)V(\delta) under linear constraints).

In addition to the assumptions of Theorem 2, assume that there is some small ε>0\varepsilon>0 such that for every a𝒜a\in\mathcal{A} one has |f(x,a)|c(1+|x|pε)|f(x,a)|\leq c(1+|x|^{p-\varepsilon}) for all xdx\in\mathbb{R}^{d} and some constant c>0c>0. Further assume that Φi\Phi_{i}, imi\leq m, are continuously differentiable with |Φi(x)|c(1+|x|pε)|\Phi_{i}(x)|\leq c(1+|x|^{p-\varepsilon}), |xΦi(x)|c(1+|x|p1)|\nabla_{x}\Phi_{i}(x)|\leq c(1+|x|^{p-1}) and that the non-degeneracy condition

(19) inf{𝒮i=1mλixΦi(x)qμ(dx):λd,|λ|=1}>0\displaystyle\inf\left\{\int_{\mathcal{S}}\bigg{\|}\sum_{i=1}^{m}\lambda_{i}\nabla_{x}\Phi_{i}(x)\bigg{\|}^{q}\,\mu(dx):\lambda\in\mathbb{R}^{d},\ |\lambda|=1\right\}>0

holds. Then

(VΦ)(0)=infa𝒜0infλm(𝒮xf(x,a)+i=1mλixΦi(x)qμ(dx))1/q.\displaystyle(V^{\Phi})^{\prime}(0)=\inf_{a^{\star}\in\mathcal{A}^{\star}_{0}}\inf_{\lambda\in\mathbb{R}^{m}}\left(\int_{\mathcal{S}}\Big{\|}\nabla_{x}f(x,a^{\star})+\sum_{i=1}^{m}\lambda_{i}\nabla_{x}\Phi_{i}(x)\Big{\|}^{q}\,\mu(dx)\right)^{1/q}.
Remark 16.

Note that if \|\cdot\| is a norm and μ\mu has full support, the above non-degeneracy condition (19) can be made without loss of generality. Indeed, as the unit circle is compact and the function λ𝒮i=1mλixΦi(x)qμ(dx)\lambda\mapsto\int_{\mathcal{S}}\left\|\sum_{i=1}^{m}\lambda_{i}\nabla_{x}\Phi_{i}(x)\right\|^{q}\,\mu(dx) is continuous, the infimum in (19) is attained. In particular, if

inf{𝒮i=1mλixΦi(x)qμ(dx):|λ|=1}=0,\displaystyle\inf\left\{\int_{\mathcal{S}}\left\|\sum_{i=1}^{m}\lambda_{i}\nabla_{x}\Phi_{i}(x)\right\|^{q}\,\mu(dx):|\lambda|=1\right\}=0,

then i=1mλixΦi=0\sum_{i=1}^{m}\lambda_{i}\nabla_{x}\Phi_{i}=0 μ\mu-a.s. for some λ\lambda in the unit circle. As μ\mu has full support this implies that i=1mλixΦi=0\sum_{i=1}^{m}\lambda_{i}\nabla_{x}\Phi_{i}=0 on 𝒮\mathcal{S}. Thus xΦ1,,xΦm\nabla_{x}\Phi_{1},\dots,\nabla_{x}\Phi_{m} are linearly dependent functions on 𝒮\mathcal{S}. Deleting all linearly dependent coordinates and calling the resulting vector Φ~\tilde{\Phi}, we have VΦ(δ)=VΦ~(δ)V^{\Phi}(\delta)=V^{\tilde{\Phi}}(\delta) for every δ0\delta\geq 0. Moreover, the non-degeneracy condition (19) holds for Φ~\tilde{\Phi}.

Remark 17.

We can relax the conditions of Theorem 15 in the spirit of Remark 11: more specifically, assume that the baseline distribution μ\mu is absolutely continuous w.r.t. the Lebesgue measure and =||\|\cdot\|=|\cdot|. Then Theorem 15 remains true if we merely assume that f(,a)f(\cdot,a) and Φi\Phi_{i} have a weak derivative (in the Sobolev sense) on 𝒮o{\mathcal{S}}^{o} for all a𝒜a\in\mathcal{A} and replace xf(,a)\nabla_{x}f(\cdot,a) and Φi\nabla\Phi_{i} by the weak derivative of f(,a)f(\cdot,a) and of Φi\Phi_{i} respectively. More concretely the assumption should read:

  • The weak derivatives (x,a)g(x,a)(x,a)\mapsto g(x,a) of f(,a)f(\cdot,a) and xgi(x)x\mapsto g_{i}(x) of Φi\Phi_{i} are continuous at every point (x,a)N×𝒜(0)(x,a)\in N\times\mathcal{A}^{{\star}}(0), where NN is a Lebesgue-null set, and for every r>0r>0 there is c>0c>0 such that |gi(x,a)|c(1+|x|p1)|g_{i}(x,a)|\leq c(1+|x|^{p-1}) and |gi(x)|c(1+|x|p1)|g_{i}(x)|\leq c(1+|x|^{p-1}) for all x𝒮x\in\mathcal{S}, i=1,,mi=1,\dots,m and |a|r|a|\leq r.

Example 18 (Martingale constraints).

Let d=1d=1, 𝒮=\mathcal{S}=\mathbb{R}, =||\|\cdot\|=|\cdot|, p=2p=2, and let Φ1(x):=xx0\Phi_{1}(x):=x-x_{0} and Φ:={Φ1}\Phi:=\{\Phi_{1}\}, i.e., BδΦ(μ)B^{\Phi}_{\delta}(\mu) corresponds to the measures νBδ(μ)\nu\in B_{\delta}(\mu) satisfying the martingale (barycentre preservation) constraint xν(dx)=x0\int_{\mathbb{R}}x\,\nu(dx)=x_{0}. Clearly the assumptions on Φ\Phi of Theorem 15 are satisfied. It remains to solve the optimization problem over λ\lambda\in\mathbb{R} and plug in the optimizer. We then obtain

(VΦ)(0)=infa𝒜0((xf(x,a)xf(y,a)μ(dy))2μ(dx))1/2,(V^{\Phi})^{\prime}(0)=\inf_{a^{\star}\in\mathcal{A}^{\star}_{0}}\left(\int_{\mathbb{R}}\left(\nabla_{x}f(x,a^{\star})-\int_{\mathbb{R}}\nabla_{x}f(y,a^{\star})\,\mu(dy)\right)^{2}\,\mu(dx)\right)^{1/2},

i.e., (VΦ)(0)(V^{\Phi})^{\prime}(0) is the standard deviation of xf(,a)\nabla_{x}f(\cdot,a^{\star}) under μ\mu. In line with the previous remark, this results extend to the case of the call option pricing discussed in the main body of the paper.

Example 19 (Covariance constraints).

Let d=2d=2, 𝒮=2\mathcal{S}=\mathbb{R}^{2}, =||\|\cdot\|=|\cdot|, p=2p=2. Further let Φ1(x1,x2):=x1x2b\Phi_{1}(x_{1},x_{2}):=x_{1}x_{2}-b for some bb\in\mathbb{R} and Φ:={Φ1}\Phi:=\{\Phi_{1}\}, i.e., we want to optimize over measures νBδ(μ)\nu\in B_{\delta}(\mu) satisfying the covariance constraint 2x1x2ν(dx)=b\int_{\mathbb{R}^{2}}x_{1}x_{2}\,\nu(dx)=b. Assume that there exists no λ{0}\lambda\in\mathbb{R}\setminus\{0\} such that μ\mu-a.s. x1=λx2x_{1}=\lambda x_{2}. Clearly the assumptions on Φ\Phi of Theorem 15 are satisfied. Note that

2|xf(x,a)+λ1xΦ1(x)|2μ(dx)\displaystyle\int_{\mathbb{R}^{2}}|\nabla_{x}f(x,a)+\lambda_{1}\nabla_{x}\Phi_{1}(x)|^{2}\,\mu(dx)
=2(x1f(x,a)+λ1x2)2+(x2f(x,a)+λ1x1)2μ(dx),\displaystyle=\int_{\mathbb{R}^{2}}(\nabla_{x_{1}}f(x,a)+\lambda_{1}x_{2})^{2}+(\nabla_{x_{2}}f(x,a)+\lambda_{1}x_{1})^{2}\,\mu(dx),

so in particular the optimal λ\lambda in the definition of (VΦ)(0)(V^{\Phi})^{\prime}(0) is given by

λ1=2x1f(x,a)x2+x2f(x,a)x1μ(dx)2x12+x22μ(dx).\lambda_{1}=\frac{-\int_{\mathbb{R}^{2}}\nabla_{x_{1}}f(x,a)x_{2}+\nabla_{x_{2}}f(x,a)x_{1}\,\mu(dx)}{\int_{\mathbb{R}^{2}}x_{1}^{2}+x_{2}^{2}\,\mu(dx)}.

Plugging this in gives

2|xf(x,a)+λ1xΦ1(x)|2μ(dx)=2(x1f(x,a))2+(x2f(x,a))2μ(dx)\displaystyle\int_{\mathbb{R}^{2}}|\nabla_{x}f(x,a)+\lambda_{1}\nabla_{x}\Phi_{1}(x)|^{2}\,\mu(dx)=\int_{\mathbb{R}^{2}}(\nabla_{x_{1}}f(x,a))^{2}+(\nabla_{x_{2}}f(x,a))^{2}\,\mu(dx)
+2λ12(x1f(x,a)x2+fx2f(x,a)x1)μ(dx)+λ122x22+x12μ(dx)\displaystyle\qquad\qquad+2\lambda_{1}\int_{\mathbb{R}^{2}}(\nabla_{x_{1}}f(x,a)x_{2}+\nabla f_{x_{2}}f(x,a)x_{1})\mu(dx)+\lambda_{1}^{2}\int_{\mathbb{R}^{2}}x_{2}^{2}+x_{1}^{2}\,\mu(dx)
=2(x1f(x,a))2+(x2f(x,a))2μ(dx)(2(x1f(x,a)x2+x2f(x,a)x1)μ(dx))22(x12+x22)μ(dx).\displaystyle=\int_{\mathbb{R}^{2}}(\nabla_{x_{1}}f(x,a))^{2}+(\nabla_{x_{2}}f(x,a))^{2}\,\mu(dx)-\frac{\left(\int_{\mathbb{R}^{2}}(\nabla_{x_{1}}f(x,a)x_{2}+\nabla_{x_{2}}f(x,a)x_{1})\,\mu(dx)\right)^{2}}{\int_{\mathbb{R}^{2}}(x_{1}^{2}+x_{2}^{2})\,\mu(dx)}.

It follows that

(VΦ)(0)=infa𝒜0(\displaystyle(V^{\Phi})^{\prime}(0)=\inf_{a^{\star}\in\mathcal{A}^{\star}_{0}}\Bigg{(} 2|xf(x,a)|2μ(dx)\displaystyle\int_{\mathbb{R}^{2}}|\nabla_{x}f(x,a^{\star})|^{2}\,\mu(dx)
(2x1f(x,a)x2+x2f(x,a)x1μ(dx))22|x|2μ(dx))1/2.\displaystyle-\frac{\big{(}\int_{\mathbb{R}^{2}}\nabla_{x_{1}}f(x,a^{\star})x_{2}+\nabla_{x_{2}}f(x,a^{\star})x_{1}\,\mu(dx)\big{)}^{2}}{\int_{\mathbb{R}^{2}}|x|^{2}\,\mu(dx)}\Bigg{)}^{1/2}.
Example 20 (Calibration).

Consider the function f((T,K),a)=(Ea[(STK)+]C((T,K))2f((T,K),a)=(E_{\mathbb{P}_{a}}[(S_{T}-K)^{+}]-C((T,K))^{2}, the discrete measure μ\mu formalises grid points for which option data C(T,K)C(T,K) is available, 𝒮+×+\mathcal{S}\subset\mathbb{R}_{+}\times\mathbb{R}_{+} is the set of maturities and strikes of interest and {a,a𝒜}\{\mathbb{P}_{a},a\in\mathcal{A}\}, for a given compact set 𝒜\mathcal{A}, is a class of parametric models (e.g., Heston). A Wasserstein ball around μ\mu can then be seen as a plausible formalisation of market data uncertainty. Derivatives in TT and KK correspond to classical pricing sensitivities, which are readily available for most common parametric models. These have to be only evaluated for one model a.\mathbb{P}_{a^{\star}}. Changing the class of parametric models {a,aA}\{\mathbb{P}_{a},a\in A\} and computing the sensitivity in Theorem 2 could then yield insights into when a calibration procedure can be considered reasonably robust.

B.2. Proofs and auxiliary results related to Theorem 2

Proof of Theorem 2.

We present now a complete proof of Theorem 2 for general state space 𝒮\mathcal{S} and semi-norm \|\cdot\|. All the essential ideas have already been outlined in Bartl et al., 2021b but for the convenience of the reader we repeat all of the steps as opposed to only detailing where the general case differs from the one treated in Bartl et al., 2021b .

Step 1: Let us first assume that 𝒮=d\mathcal{S}=\mathbb{R}^{d}. For every δ0\delta\geq 0 let Cδ(μ)C_{\delta}(\mu) denote those π𝒫(𝒮×𝒮)\pi\in\mathcal{P}(\mathcal{S}\times\mathcal{S}) which satisfy

π1=μ and (𝒮×𝒮xypπ(dx,dy))1/pδ.\pi_{1}=\mu\text{ and }\left(\int_{\mathcal{S}\times\mathcal{S}}\|x-y\|^{p}_{\ast}\,\pi(dx,dy)\right)^{1/p}\leq\delta.

Note that the dual norm \|\cdot\|_{\ast} is lower semicontinuous, which implies that the infimum in the definition of Wp(μ,ν)W_{p}(\mu,\nu) is attained (see (Villani,, 2008, Theorem 4.1, p.43)) one has Bδ(μ)={π2:πCδ(μ)}B_{\delta}(\mu)=\{\pi_{2}:\pi\in C_{\delta}(\mu)\}.

We start by showing the “\leq” inequality in the statement. For any a𝒜0a^{\star}\in\mathcal{A}^{\star}_{0} one has V(δ)supνBδ(μ)𝒮f(y,a)ν(dy)V(\delta)\leq\sup_{\nu\in B_{\delta}(\mu)}\int_{\mathcal{S}}f(y,a^{\star})\,\nu(dy) with equality for δ=0\delta=0. Therefore, differentiating f(,a)f(\cdot,a^{\star}) and using Fubini’s theorem, we obtain that

V(δ)V(0)\displaystyle V(\delta)-V(0) supπCδ(μ)𝒮×𝒮f(y,a)f(x,a)π(dx,dy)\displaystyle\leq\sup_{\pi\in C_{\delta}(\mu)}\int_{\mathcal{S}\times\mathcal{S}}f(y,a^{\star})-f(x,a^{\star})\,\pi(dx,dy)
=supπCδ(μ)01𝒮xf(x+t(yx),a),(yx)π(dx,dy)𝑑t.\displaystyle=\sup_{\pi\in C_{\delta}(\mu)}\int_{0}^{1}\int_{\mathcal{S}}\langle\nabla_{x}f(x+t(y-x),a^{\star}),(y-x)\rangle\,\pi(dx,dy)dt.

Now recall that x,yxy\langle x,y\rangle\leq\|x\|\|y\|_{\ast} for every x,ydx,y\in\mathbb{R}^{d}, whence for any πCδ(μ)\pi\in C_{\delta}(\mu) and t[0,1]t\in[0,1], we have that

𝒮xf(x+t(yx),a),(yx)π(dx,dy)\displaystyle\int_{\mathcal{S}}\langle\nabla_{x}f(x+t(y-x),a^{\star}),(y-x)\rangle\,\pi(dx,dy)
𝒮xf(x+t(yx),a)yxπ(dx,dy)\displaystyle\leq\int_{\mathcal{S}}\|\nabla_{x}f(x+t(y-x),a^{\star})\|\|y-x\|_{\ast}\,\pi(dx,dy)
(𝒮xf(x+t(yx),a)qπ(dx,dy))1/q(𝒮yxpπ(dx,dy))1/p,\displaystyle\leq\Big{(}\int_{\mathcal{S}}\|\nabla_{x}f(x+t(y-x),a^{\star})\|^{q}\,\pi(dx,dy)\Big{)}^{1/q}\Big{(}\int_{\mathcal{S}}\|y-x\|^{p}\,\pi(dx,dy)\Big{)}^{1/p},

where we used Hölder’s inequality to obtain the last inequality. By definition of Cδ(μ)C_{\delta}(\mu) the last integral is smaller than δ\delta and we end up with

V(δ)V(0)δsupπCδ(μ)01(𝒮xf(x+t(yx),a)qπ(dx,dy))1/q𝑑t.V(\delta)-V(0)\leq\delta\sup_{\pi\in C_{\delta}(\mu)}\int_{0}^{1}\Big{(}\int_{\mathcal{S}}\|\nabla_{x}f(x+t(y-x),a^{\star})\|^{q}\pi(dx,dy)\Big{)}^{1/q}dt.

It remains to show that the last term converges to the integral under μ\mu. To that end, note that any choice πδCδ(μ)\pi^{\delta}\in C_{\delta}(\mu) converges in Wp||W_{p}^{|\cdot|} on 𝒫(𝒮×𝒮\mathcal{P}(\mathcal{S}\times\mathcal{S}) to the pushforward measure of μ\mu under the mapping x(x,x)x\to(x,x), which we denote [x(x,x)]μ[x\mapsto(x,x)]_{\ast}\mu. This can be seen by, e.g., considering the coupling [(x,y)(x,y,x,x)]πδ[(x,y)\mapsto(x,y,x,x)]_{\ast}\pi^{\delta} between πδ\pi^{\delta} and [x(x,x)]μ[x\mapsto(x,x)]_{\ast}\mu. Now note that, together with growth restriction on xf\nabla_{x}f of Assumption 1, q=p/(p1)q=p/(p-1) implies

(20) xf(x+t(yx),a)qc(1+|x|p+|y|p)\displaystyle\|\nabla_{x}f(x+t(y-x),a^{\star})\|^{q}\leq c(1+|x|^{p}+|y|^{p})

for some c>0c>0 and all x,ydx,y\in\mathbb{R}^{d}, t[0,1]t\in[0,1]. Recall that there furthermore exists c~>0\tilde{c}>0 such that xc~|x|\|x\|\leq\tilde{c}|x|, in particular 𝒮xf(x+t(yx),a)qπδ(dx,dy)C\int_{\mathcal{S}}\|\nabla_{x}f(x+t(y-x),a^{\star})\|^{q}\,\pi^{\delta}(dx,dy)\leq C for all t[0,1]t\in[0,1] and small δ>0\delta>0, for another constant C>0C>0. As Assumption 1 further yields continuity of (x,y)xf(x+t(yx),a)q(x,y)\mapsto\|\nabla_{x}f(x+t(y-x),a^{\star})\|^{q} for every tt, the pp-Wasserstein convergence of πδ\pi^{\delta} to [x(x,x)]μ[x\mapsto(x,x)]_{\ast}\mu implies that

𝒮xf(x+t(yx),a)qπ(dx,dy)𝒮xf(x,a)qμ(dx)\int_{\mathcal{S}}\|\nabla_{x}f(x+t(y-x),a^{\star})\|^{q}\,\pi(dx,dy)\to\int_{\mathcal{S}}\|\nabla_{x}f(x,a^{\star})\|^{q}\,\mu(dx)

for every t[0,1]t\in[0,1], see Lemma 21. Dominated convergence (in tt) then yields “\leq” in the statement of the theorem.

We turn now to the opposite “\geq” inequality. As V(δ)V(0)V(\delta)\geq V(0) for every δ>0\delta>0 there is no loss in generality in assuming that the right hand side is not equal to zero. Now take any, for notational simplicity not relabelled, subsequence of (δ)δ>0(\delta)_{\delta>0} which attains the liminf in (V(δ)V(0))/δ(V(\delta)-V(0))/\delta and pick aδ𝒜δa^{\star}_{\delta}\in\mathcal{A}^{\star}_{\delta}. By the second part of Assumption 1, for a (again not relabelled) subsequence, one has aδa𝒜0a^{\star}_{\delta}\to a^{\star}\in\mathcal{A}^{\star}_{0}. Further note that V(0)𝒮f(x,aδ)μ(dx)V(0)\leq\int_{\mathcal{S}}f(x,a^{\star}_{\delta})\,\mu(dx) which implies

V(δ)V(0)\displaystyle V(\delta)-V(0) supπCδ(μ)𝒮×𝒮f(y,aδ)f(x,aδ)π(dx,dy).\displaystyle\geq\sup_{\pi\in C_{\delta}(\mu)}\int_{\mathcal{S}\times\mathcal{S}}f(y,a^{\star}_{\delta})-f(x,a^{\star}_{\delta})\,\pi(dx,dy).

By Lemma 7 there exists a function h:d{xd:x=1}h\colon\mathbb{R}^{d}\mapsto\{x\in\mathbb{R}^{d}:\|x\|_{\ast}=1\} such that x=x,h(x)\|x\|=\langle x,h(x)\rangle for every xdx\in\mathbb{R}^{d}. Now define

πδ\displaystyle\pi^{\delta} :=[x(x,x+δT(x))]μ,where\displaystyle:=[x\mapsto(x,x+\delta T(x))]_{\ast}\mu,\quad\text{where}
T(x)\displaystyle T(x) :=h(xf(x,a))xf(x,a)1q(𝒮xf(z,a)qμ(dz))1/q1\displaystyle:=\frac{h(\nabla_{x}f(x,a^{\star}))}{\|\nabla_{x}f(x,a^{\star})\|^{1-q}}\Big{(}\int_{\mathcal{S}}\|\nabla_{x}f(z,a^{\star})\|^{q}\,\mu(dz)\Big{)}^{1/q-1}

for xdx\in\mathbb{R}^{d} with the convention h()/0=0h(\cdot)/0=0. Note that the integral is well defined since, as before in (20), one has xf(x,a)qC(1+|x|p)\|\nabla_{x}f(x,a^{\star})\|^{q}\leq C(1+|x|^{p}) for some C>0C>0 and the latter is integrable under μ\mu. Using that pqp=qpq-p=q it further follows that

𝒮×𝒮xypπδ(dx,dy)=δp𝒮T(x)pμ(dx)\displaystyle\int_{\mathcal{S}\times\mathcal{S}}\|x-y\|_{\ast}^{p}\,\pi^{\delta}(dx,dy)=\delta^{p}\int_{\mathcal{S}}\|T(x)\|_{\ast}^{p}\,\mu(dx)
=δp𝒮xf(x,a)pqpμ(dx)(𝒮xf(z,a)qμ(dz))p(11/q)=δp.\displaystyle=\delta^{p}\frac{\int_{\mathcal{S}}\|\nabla_{x}f(x,a^{\star})\|^{pq-p}\,\mu(dx)}{\big{(}\int_{\mathcal{S}}\|\nabla_{x}f(z,a^{\star})\|^{q}\,\mu(dz)\big{)}^{p(1-1/q)}}=\delta^{p}.

In particular πδCδ(μ)\pi^{\delta}\in C_{\delta}(\mu) and we can use it to estimate from below the supremum over Cδ(μ)C_{\delta}(\mu) giving

V(δ)V(0)δ\displaystyle\frac{V(\delta)-V(0)}{\delta} 1δ𝒮f(x+δT(x),aδ)f(x,aδ)μ(dx)\displaystyle\geq\frac{1}{\delta}\int_{\mathcal{S}}f(x+\delta T(x),a^{\star}_{\delta})-f(x,a^{\star}_{\delta})\,\mu(dx)
=01𝒮xf(x+tδT(x),aδ),T(x)μ(dx)𝑑t.\displaystyle=\int_{0}^{1}\int_{\mathcal{S}}\langle\nabla_{x}f(x+t\delta T(x),a^{\star}_{\delta}),T(x)\rangle\,\mu(dx)\,dt.

For any t[0,1]t\in[0,1], with δ0\delta\to 0, the inner integral converges to

𝒮xf(x,a),T(x)μ(dx)=(𝒮xf(x,a)qμ(dx))1/q.\displaystyle\int_{\mathcal{S}}\langle\nabla_{x}f(x,a^{\star}),T(x)\rangle\,\mu(dx)=\Big{(}\int_{\mathcal{S}}\|\nabla_{x}f(x,a^{\star})\|^{q}\,\mu(dx)\Big{)}^{1/q}.

The last equality follows from the definition of TT and a simple calculation. To justify the convergence, first note that

xf(x+tδT(x),aδ),T(x)xf(x,a),T(x)\langle\nabla_{x}f(x+t\delta T(x),a^{\star}_{\delta}),T(x)\rangle\to\langle\nabla_{x}f(x,a^{\star}),T(x)\rangle

for all xdx\in\mathbb{R}^{d} by continuity of (a,x)xf(x,a)(a,x)\mapsto\nabla_{x}f(x,a) and since aδaa^{\star}_{\delta}\to a^{\star}. Moreover, as before in (20), one has

xf(x+tδT(x),a),T(x)C(1+|x|p)\|\langle\nabla_{x}f(x+t\delta T(x),a^{\star}),T(x)\rangle\|\leq C(1+|x|^{p})

for some C>0C>0 and all t[0,1]t\in[0,1]. The latter is integrable under μ\mu, hence convergence of the integrals follows from the dominated convergence theorem.

Step 2: We now extend the proof to the case, where 𝒮d\mathcal{S}\subset\mathbb{R}^{d} is closed convex and its boundary has zero measure under μ\mu.
Note that the proof of the “\leq”-inequality remains unchanged. We modify the proof of the “\geq”-inequality as follows: let us first define

𝒮ε:={x𝒮:|xz|ε for all z𝒮c}\mathcal{S}^{\varepsilon}:=\{x\in\mathcal{S}\ :\ |x-z|\geq\varepsilon\text{ for all }z\in\mathcal{S}^{c}\}

for all ε>0\varepsilon>0, so that in particular ε>0𝒮ε=𝒮o\bigcup_{\varepsilon>0}\mathcal{S}^{\varepsilon}={\mathcal{S}}^{o}. We now redefine

πδ:=[x(x,x+δT(x)𝟏{x𝒮δ}𝟏{|T(x)|1/δ})]μ.\pi^{\delta}:=\left[x\mapsto\left(x,x+\delta T(x)\mathbf{1}_{\{x\in\mathcal{S}^{\sqrt{\delta}}\}}\mathbf{1}_{\{|T(x)|\leq 1/\sqrt{\delta}\}}\right)\right]_{\ast}\mu.

Then πδ𝒫(𝒮×𝒮)\pi^{\delta}\in\mathcal{P}(\mathcal{S}\times\mathcal{S}) and in particular πδCδ(μ)\pi^{\delta}\in C_{\delta}(\mu) as in Step 1. Noting that

limδ0T(x)𝟏{x𝒮δ}𝟏{|T(x)|1/δ}=T(x)𝟏{x𝒮o},\displaystyle\lim_{\delta\to 0}T(x)\mathbf{1}_{\{x\in\mathcal{S}^{\sqrt{\delta}}\}}\mathbf{1}_{\{|T(x)|\leq 1/\sqrt{\delta}\}}=T(x)\mathbf{1}_{\{x\in{\mathcal{S}}^{o}\}},

the remaining steps of the proof follow as in Step 1. This concludes the proof. ∎

Lemma 21.

Let p[1,)p\in[1,\infty), let a0𝒜a_{0}\in\mathcal{A} and assume that ff is continuous and, for some constant c>0c>0, satisfies |f(x,a)|c(1+|x|p)|f(x,a)|\leq c(1+|x|^{p}) for all x𝒮x\in\mathcal{S} and all aa in a neighborhood of a0a_{0}. Let (μn)n(\mu_{n})_{n\in\mathbb{N}} be a sequence of probability measures which converges to some μ\mu w.r.t. Wp||W_{p}^{|\cdot|} and (an)n(a_{n})_{n\in\mathbb{N}} be a sequence which converges to a0a_{0}. Then 𝒮f(x,an)μn(dx)𝒮f(x,a0)μ(dx)\int_{\mathcal{S}}f(x,a_{n})\,\mu_{n}(dx)\to\int_{\mathcal{S}}f(x,a_{0})\,\mu(dx) as nn\to\infty.

Proof.

Let KK be a small neighborhood of a0a_{0} such that |f(x,a)|c(1+|x|p)|f(x,a)|\leq c(1+|x|^{p}) for all x𝒮x\in\mathcal{S} and aKa\in K. The measures μnδan\mu_{n}\otimes\delta_{a_{n}} converge in Wp||W_{p}^{|\cdot|} to the measure μδa0\mu\otimes\delta_{a_{0}}. As 𝒮f(x,an)μn(dx)=𝒮×Kf(x,a)(μnδan)(d(x,a))\int_{\mathcal{S}}f(x,a_{n})\,\mu_{n}(dx)=\int_{\mathcal{S}\times K}f(x,a)\,(\mu_{n}\otimes\delta_{a_{n}})(d(x,a)) and similarly for μδa0\mu\otimes\delta_{a_{0}}, the claim follows from (Villani,, 2008, Lemma 4.3, p.43). ∎

The following lemma relates to the financial economics applications described in Bartl et al., 2021b . We focus on a sufficient condition for the second part of Assumption 1. For this, we assume that μ\mu does not contain any redundant assets, i.e. μ({xd:a,xx0>0})>0\mu(\{x\in\mathbb{R}^{d}:\langle a,x-x_{0}\rangle>0\})>0 for every a0a\neq 0. If μ\mu satisfies this condition, we call it non-degenerate. Note that this condition is slightly stronger than no-arbitrage. However, if μ\mu satisfies no arbitrage, then one can always delete the redundant dimensions in μ\mu similarly to the remark after Theorem 15, so that the modified measure satisfies μ({xd:a,xx0>0})>0\mu(\{x\in\mathbb{R}^{d}:\langle a,x-x_{0}\rangle>0\})>0 for every a0a\neq 0.

Lemma 22.

Assume that l:l\colon\mathbb{R}\to\mathbb{R} is convex, increasing, bounded from below and f(x,a):=l(g(x)+a,x)f(x,a):=l(g(x)+\langle a,x\rangle) satisfies the first part of Assumption 1. Furthermore assume that μ\mu is non-degenerate in the above sense. Then for every δ0\delta\geq 0 there exists an optimizer aδda^{\star}_{\delta}\in\mathbb{R}^{d} for V(δ)V(\delta), i.e.,

V(δ)=supνBδ(μ)dl(g(x)+aδ,xx0)ν(dx)<.V(\delta)=\sup_{\nu\in B_{\delta}(\mu)}\int_{\mathbb{R}^{d}}l(g(x)+\langle a^{\star}_{\delta},x-x_{0}\rangle)\,\nu(dx)<\infty.

Furthermore, if ll is strictly convex, the optimizer aa^{\star} of V(0)V(0) is unique and aδaa^{\star}_{\delta}\to a^{\star} as δ0\delta\to 0. In particular, Assumption 1 is satisfied.

Proof.

The first statement is trivially true if ll is constant, so assume otherwise in the following. Moreover, note by the first part of Assumption 1 we have V(δ)<V(\delta)<\infty for all δ0\delta\geq 0. Now fix δ0\delta\geq 0, and let (an)n(a_{n})_{n\in\mathbb{N}} be a minimizing sequence, i.e.

V(δ)=limnsupνBδ(μ)dl(g(x)+an,xx0)ν(dx).V(\delta)=\lim_{n\to\infty}\sup_{\nu\in B_{\delta}(\mu)}\int_{\mathbb{R}^{d}}l(g(x)+\langle a_{n},x-x_{0}\rangle)\,\nu(dx).

If (an)n(a_{n})_{n\in\mathbb{N}} is bounded, then after passing to a subsequence there is a limit, and Fatou’s lemma shows that this limit is a minimizer. It remains to argue why (an)n(a_{n})_{n\in\mathbb{N}} is bounded. Heading for a contradiction, assume that |an||a_{n}|\to\infty as nn\to\infty. After passing to a (not relabeled) subsequence, there is a~d\tilde{a}\in\mathbb{R}^{d} with |a~|=1|\tilde{a}|=1 such that an/|an|a~a_{n}/|a_{n}|\to\tilde{a} as nn\to\infty. By our assumption we have μ({xd:a~,xx0>0})>0\mu(\{x\in\mathbb{R}^{d}:\langle\tilde{a},x-x_{0}\rangle>0\})>0. As ll is bounded below this shows that

supνBδ(μ)dl(g(x)+an,xx0)ν(dx)dl(g(x)+an,xx0)μ(dx),\sup_{\nu\in B_{\delta}(\mu)}\int_{\mathbb{R}^{d}}l(g(x)+\langle a_{n},x-x_{0}\rangle)\,\nu(dx)\geq\int_{\mathbb{R}^{d}}l(g(x)+\langle a_{n},x-x_{0}\rangle)\,\mu(dx)\to\infty,

as nn\to\infty, a contradiction.

To prove the second claim note that strict convexity of ll readily implies that V(0)V(0) admits a unique minimizer aa^{\star}. Now, heading for a contraction, assume that there exists a subsequence (δn)n(\delta_{n})_{n\in\mathbb{N}} converging to zero, such that aδna^{\star}_{\delta_{n}} does not converge to aa^{\star}. The exact same reasoning as above shows that (aδn)n(a^{\star}_{\delta_{n}})_{n\in\mathbb{N}} is bounded, hence (possibly after passing to a not relabeled subsequence) there is a limit a~a\tilde{a}\neq a^{\star}. Using Fatou’s lemma once more implies

V(0)\displaystyle V(0) <dl(g(x)+a~,xx0)μ(dx)\displaystyle<\int_{\mathbb{R}^{d}}l(g(x)+\langle\tilde{a},x-x_{0}\rangle)\,\mu(dx)
lim infndl(g(x)+aδn,xx0)μ(dx)lim infnV(δn).\displaystyle\leq\liminf_{n\to\infty}\int_{\mathbb{R}^{d}}l(g(x)+\langle a^{\star}_{\delta_{n}},x-x_{0}\rangle)\,\mu(dx)\leq\liminf_{n\to\infty}V(\delta_{n}).

On the other hand, plugging aa^{\star} into V(δ)V(\delta) implies

lim supnV(δn)lim supnsupνBδn(μ)dl(g(x)+a,xx0)ν(dx)=V(0),\limsup_{n\to\infty}V(\delta_{n})\leq\limsup_{n\to\infty}\sup_{\nu\in B_{\delta_{n}}(\mu)}\int_{\mathbb{R}^{d}}l(g(x)+\langle a^{\star},x-x_{0}\rangle)\,\nu(dx)=V(0),

which follows from as l(g(x)+a,xx0)c(1+|x|p)l(g(x)+\langle a^{\star},x-x_{0}\rangle)\leq c(1+|x|^{p}) and that any νnBδn(μ)\nu_{n}\in B_{\delta_{n}}(\mu) converges in Wp||W^{|\cdot|}_{p} to μ\mu by definition. This gives the desired contraction. ∎

In analogy to the above result, the following summarizes simple sufficient conditions for the second part of Assumption 1.

Lemma 23.

Assume that either 𝒜\mathcal{A} is compact or that aV(0,a)a\mapsto V(0,a) is coercive, in the sense that V(0,an)V(0,a_{n})\to\infty if |an||a_{n}|\to\infty. Moreover, assume that ff is continuous, such that f(x,a)c(1+|x|p)f(x,a)\leq c(1+|x|^{p}) for some c0c\geq 0. Then the second part of Assumption 1 is satisfied.

Proof.

Let us first note that for fixed δ0\delta\geq 0 the function aV(δ,a)a\mapsto V(\delta,a) is lower semiconinuous as a supremum of continuous functions af(x,a)ν(dx)a\mapsto\int f(x,a)\,\nu(dx) for νBδ(μ)\nu\in B_{\delta}(\mu). Next we note that 𝒜(δ)\mathcal{A}^{\star}(\delta)\neq\emptyset. Indeed, if 𝒜\mathcal{A} is compact, this directly follows from lower semicontinuity of aV(δ,a)a\mapsto V(\delta,a). Otherwise, the fact that V(δ,a)V(0,a)V(\delta,a)\geq V(0,a) for all a𝒜a\in\mathcal{A} and coercivity imply that any minimising sequence (an)n(a_{n})_{n\in\mathbb{N}} is bounded. Lastly, we show that any accumulation point of such a sequence is an element of 𝒜(0)\mathcal{A}^{\star}(0). By the above we can assume (by taking a subsequence without relabelling if necessary) that limnan=a𝒜\lim_{n\to\infty}a_{n}=a\in\mathcal{A}. If a𝒜a\notin\mathcal{A}^{\star}, then

lim infnV(δn,an)limnV(0,an)=V(0,a)>V(0,a)=limnV(δn,a)\displaystyle\liminf_{n\to\infty}V(\delta_{n},a_{n})\geq\lim_{n\to\infty}V(0,a_{n})=V(0,a)>V(0,a^{\star})=\lim_{n\to\infty}V(\delta_{n},a^{\star})

for any a𝒜(0)a^{\star}\in\mathcal{A}^{\star}(0). This contradicts an𝒜(δn)a_{n}\in\mathcal{A}^{\star}(\delta_{n}) for all nn\in\mathbb{N} and concludes the proof. ∎

Proof of Corollary 13.

We start with the “\leq”-inequality. First, note that for any δ>0\delta>0, ar𝒜ra^{r}\in\mathcal{A}^{\star}_{r}, and νr+δBr+δ(μ,ar)\nu^{r+\delta}\in B_{r+\delta}^{\star}(\mu,a^{r}), we have

V(r+δ)\displaystyle V(r+\delta) V(r+δ,ar)=𝒮f(x,ar)νr+δ(dx),\displaystyle\leq V(r+\delta,a^{r})=\int_{\mathcal{S}}f(x,a^{r})\,\nu^{r+\delta}(dx),
V(r)\displaystyle V(r) supνBr(μ)Bδ(νr+δ)𝒮f(x,ar)ν(dx).\displaystyle\geq\sup_{\nu\in B_{r}(\mu)\cap B_{\delta}(\nu^{r+\delta})}\int_{\mathcal{S}}f(x,a^{r})\,\nu(dx).

This implies that

V(r+δ)V(r)\displaystyle V(r+\delta)-V(r) supπCδ(νr+δ)𝒮×𝒮f(x,ar)f(y,ar)π(dx,dy)\displaystyle\leq\sup_{\pi\in C_{\delta}(\nu^{r+\delta})}\int_{\mathcal{S}\times\mathcal{S}}f(x,a^{r})-f(y,a^{r})\,\pi(dx,dy)
=supπCδ(νr+δ)01𝒮×𝒮xf(y+t(xy),ar),(xy)π(dx,dy)𝑑t\displaystyle=\sup_{\pi\in C_{\delta}(\nu^{r+\delta})}\int_{0}^{1}\int_{\mathcal{S}\times\mathcal{S}}\langle\nabla_{x}f(y+t(x-y),a^{r}),(x-y)\rangle\,\pi(dx,dy)\,dt
(21) δsupπCδ(νr+δ)01(𝒮×𝒮xf(y+t(xy),ar)qπ(dx,dy))1/q𝑑t.\displaystyle\leq\delta\sup_{\pi\in C_{\delta}(\nu^{r+\delta})}\int_{0}^{1}\left(\int_{\mathcal{S}\times\mathcal{S}}\|\nabla_{x}f(y+t(x-y),a^{r})\|^{q}\,\pi(dx,dy)\right)^{1/q}\,dt.

Note that the assumption |xf(x,a)|c(1+|x|p1ε)|\nabla_{x}f(x,a)|\leq c(1+|x|^{p-1-\varepsilon}) implies |xf(x,a)|qc(1+|x|p(p1ε)p1)|\nabla_{x}f(x,a)|^{q}\leq c(1+|x|^{\frac{p(p-1-\varepsilon)}{p-1}}) ( for some new constant cc). To simplify notation let us thus define ε~=(p1ε)/(p1)<1\tilde{\varepsilon}=(p-1-\varepsilon)/(p-1)<1 and recall that Br+1(μ)B_{r+1}(\mu) is compact w.r.t. Wpε~||W^{|\cdot|}_{p\tilde{\varepsilon}} by Lemma 24, hence there is ν~rBr(μ)\tilde{\nu}^{r}\in B_{r}(\mu) such that (after passing to a subsequence) νr+δν~r\nu^{r+\delta}\to\tilde{\nu}^{r} w.r.t. Wpε~||W^{|\cdot|}_{p\tilde{\varepsilon}} as δ0\delta\to 0. The same arguments as in the proof of Theorem 2 show that (21) (divided by δ\delta) converges to (𝒮xf(x,ar)qν~r(dx))1/q\left(\int_{\mathcal{S}}\left\|\nabla_{x}f(x,a^{r})\right\|^{q}\tilde{\nu}^{r}(dx)\right)^{1/q} when δ0\delta\to 0. So, to conclude the “\leq”-part, all that is left to do is show that ν~rBr(μ,ar)\tilde{\nu}^{r}\in B_{r}^{\star}(\mu,a^{r}), which follows as

V(r)limδ0V(r+δ)limδ0𝒮f(x,ar)νr+δ(dx)=𝒮f(x,ar)ν~r(dx)V(r).V(r)\leq\lim_{\delta\to 0}V(r+\delta)\leq\lim_{\delta\to 0}\int_{\mathcal{S}}f(x,a^{r})\nu^{r+\delta}(dx)=\int_{\mathcal{S}}f(x,a^{r})\,\tilde{\nu}^{r}(dx)\leq V(r).

We now turn to the proof of the “\geq”-inequality. To that end, let (ar+δ)δ>0(a^{r+\delta})_{\delta>0} be a sequence of optimizers, i.e. ar+δ𝒜r+δa^{r+\delta}\in\mathcal{A}^{\star}_{r+\delta} for all δ>0\delta>0. Then by assumption there exists ar𝒜ra^{r}\in\mathcal{A}^{\star}_{r} such that (after passing to a subsequence) limδ0ar+δ=ar\lim_{\delta\to 0}a^{r+\delta}=a^{r}. Let νrBr(μ,ar)\nu^{r}\in B_{r}^{\star}(\mu,a^{r}) be arbitrary. As Bδ(νr)Br+δ(μ)B_{\delta}(\nu^{r})\subset B_{r+\delta}(\mu) (by the triangle inequality) we have

V(r+δ)\displaystyle V(r+\delta) supνBδ(νr)𝒮f(x,ar+δ)ν(dx).\displaystyle\geq\sup_{\nu\in B_{\delta}(\nu^{r})}\int_{\mathcal{S}}f(x,a^{r+\delta})\,\nu(dx).

As further (trivially) V(r)𝒮f(x,ar+δ)νr(dx)V(r)\leq\int_{\mathcal{S}}f(x,a^{r+\delta})\,\nu^{r}(dx) we conclude

V(δ+r)V(r)δ\displaystyle\frac{V(\delta+r)-V(r)}{\delta} supνBδ(νr)1δ𝒮f(x,ar+δ)ν(dx)𝒮f(x,ar+δ)νr(dx)\displaystyle\geq\sup_{\nu\in B_{\delta}(\nu^{r})}\frac{1}{\delta}\int_{\mathcal{S}}f(x,a^{r+\delta})\nu(dx)-\int_{\mathcal{S}}f(x,a^{r+\delta})\nu^{r}(dx)
(𝒮xf(x,ar)qνr(dx))1/q,\displaystyle\to\left(\int_{\mathcal{S}}\left\|\nabla_{x}f(x,a^{r})\right\|^{q}\nu^{r}(dx)\right)^{1/q},

as δ0\delta\to 0, where the the last equality follows from the exact same arguments as presented int he proof of Theorem 2. As νrBr(μ,ar)\nu^{r}\in B_{r}^{\star}(\mu,a^{r}) was arbitrary, the claim follows. ∎

Proof of Theorem 15.

We start by showing the easier estimate

(22) lim supδ0VΦ(δ)VΦ(0)δinfa𝒜0infλm(𝒮xf(x,a)+i=1mλixΦi(x)qμ(dx))1/q.\displaystyle\begin{split}&\limsup_{\delta\to 0}\frac{V^{\Phi}(\delta)-V^{\Phi}(0)}{\delta}\\ &\leq\inf_{a^{\star}\in\mathcal{A}^{\star}_{0}}\inf_{\lambda\in\mathbb{R}^{m}}\left(\int_{\mathcal{S}}\Big{\|}\nabla_{x}f(x,a^{\star})+\sum_{i=1}^{m}\lambda_{i}\nabla_{x}\Phi_{i}(x)\Big{\|}^{q}\,\mu(dx)\right)^{1/q}.\end{split}

To that end, let a𝒜0a^{\star}\in\mathcal{A}^{\star}_{0} and λm\lambda\in\mathbb{R}^{m} by arbitrary. Then VΦ(0)=𝒮f(x,a)+i=1mλiΦi(x)μ(dx)V^{\Phi}(0)=\int_{\mathcal{S}}f(x,a^{\star})+\sum_{i=1}^{m}\lambda_{i}\Phi_{i}(x)\,\mu(dx). Moreover, as BδΦ(μ)Bδ(μ)B_{\delta}^{\Phi}(\mu)\subset B_{\delta}(\mu), it further follows that VΦ(δ)supνBδ(μ)𝒮f(y,a)+i=1mλiΦi(y)ν(dy)V^{\Phi}(\delta)\leq\sup_{\nu\in B_{\delta}(\mu)}\int_{\mathcal{S}}f(y,a^{\star})+\sum_{i=1}^{m}\lambda_{i}\Phi_{i}(y)\,\nu(dy). Therefore (22) is a consequence of Theorem 2 (applied to the function f~(x,a):=f(x,a)+i=1mλiΦi(x)\tilde{f}(x,a):=f(x,a^{\star})+\sum_{i=1}^{m}\lambda_{i}\Phi_{i}(x)).

To show the other direction, i.e. that

(23) lim infδ0VΦ(δ)VΦ(0)δinfa𝒜0infλm(𝒮×𝒮xf(x,a)+i=1mλixΦi(x)qμ(dx))1/q.\displaystyle\begin{split}&\liminf_{\delta\to 0}\frac{V^{\Phi}(\delta)-V^{\Phi}(0)}{\delta}\\ &\geq\inf_{a^{\star}\in\mathcal{A}^{\star}_{0}}\inf_{\lambda\in\mathbb{R}^{m}}\left(\int_{\mathcal{S}\times\mathcal{S}}\Big{\|}\nabla_{x}f(x,a^{\star})+\sum_{i=1}^{m}\lambda_{i}\nabla_{x}\Phi_{i}(x)\Big{\|}^{q}\,\mu(dx)\right)^{1/q}.\end{split}

pick a (not relabeled) subsequence of (δ)δ>0(\delta)_{\delta>0} which converges to the liminf. For aδ𝒜δa^{\star}_{\delta}\in\mathcal{A}^{\star}_{\delta}, there is another (again not relabeled) subsequence which converges to some a𝒜0a^{\star}\in\mathcal{A}^{\star}_{0}. From now on stick to this subsequence. In a first step, notice that

VΦ(δ)\displaystyle V^{\Phi}(\delta) =supνBδ(μ)infλm𝒮f(y,aδ)+i=1mλiΦi(y)ν(dy)\displaystyle=\sup_{\nu\in B_{\delta}(\mu)}\inf_{\lambda\in\mathbb{R}^{m}}\int_{\mathcal{S}}f(y,a^{\star}_{\delta})+\sum_{i=1}^{m}\lambda_{i}\Phi_{i}(y)\,\nu(dy)
(24) =infλmsupνBδ(μ)𝒮f(y,aδ)+i=1mλiΦi(y)ν(dy).\displaystyle=\inf_{\lambda\in\mathbb{R}^{m}}\sup_{\nu\in B_{\delta}(\mu)}\int_{\mathcal{S}}f(y,a^{\star}_{\delta})+\sum_{i=1}^{m}\lambda_{i}\Phi_{i}(y)\,\nu(dy).

Indeed, this follows from a minimax theorem (see (Terkelsen,, 1973, Cor. 2, p. 411)) and appropriate compactness of Bδ(μ)B_{\delta}(\mu) as stated in Lemma 24. For notational simplicity let λδ\lambda^{\star}_{\delta} be an optimizer for (24). Then

(25) VΦ(δ)VΦ(0)δ1δsupπCδ(μ)𝒮×𝒮f(y,aδ)f(x,aδ)+i=1mλδ,i(Φi(y)Φi(x))π(dx,dy),\displaystyle\begin{split}&\frac{V^{\Phi}(\delta)-V^{\Phi}(0)}{\delta}\\ &\geq\frac{1}{\delta}\sup_{\pi\in C_{\delta}(\mu)}\int_{\mathcal{S}\times\mathcal{S}}f(y,a^{\star}_{\delta})-f(x,a^{\star}_{\delta})+\sum_{i=1}^{m}\lambda^{\star}_{\delta,i}(\Phi_{i}(y)-\Phi_{i}(x))\,\pi(dx,dy),\end{split}

where we used that VΦ(0)𝒮f(x,aδ)+i=1mλδ,iΦi(x)μ(dx)V^{\Phi}(0)\leq\int_{\mathcal{S}}f(x,a^{\star}_{\delta})+\sum_{i=1}^{m}\lambda^{\star}_{\delta,i}\Phi_{i}(x)\,\mu(dx). Now, in case that λδ\lambda^{\star}_{\delta} is uniformly bounded for all small δ>0\delta>0, after passing to a subsequence, it converges to some λ\lambda^{\star}. Then it follows from the exact same arguments as used in the proof of Theorem 2 that

lim infδ01δsupπCδ(μ)𝒮×𝒮f(y,aδ)f(x,aδ)+i=1mλδ,i(Φi(y)Φi(x))π(dx,dy)\displaystyle\liminf_{\delta\to 0}\frac{1}{\delta}\sup_{\pi\in C_{\delta}(\mu)}\int_{\mathcal{S}\times\mathcal{S}}f(y,a^{\star}_{\delta})-f(x,a^{\star}_{\delta})+\sum_{i=1}^{m}\lambda^{\star}_{\delta,i}(\Phi_{i}(y)-\Phi_{i}(x))\,\pi(dx,dy)
(𝒮xf(x,a)+i=1mλixΦi(x)qμ(dx))1/q\displaystyle\geq\Big{(}\int_{\mathcal{S}}\Big{\|}\nabla_{x}f(x,a^{\star})+\sum_{i=1}^{m}\lambda^{\star}_{i}\nabla_{x}\Phi_{i}(x)\Big{\|}^{q}\,\mu(dx)\Big{)}^{1/q}

which shows (23). It remains to argue why λδ\lambda^{\star}_{\delta} is bounded for small δ>0\delta>0. By (25) and the estimate “sup(A+B)supA+infB\sup(A+B)\geq\sup A+\inf B” we have

VΦ(δ)VΦ(0)δ\displaystyle\frac{V^{\Phi}(\delta)-V^{\Phi}(0)}{\delta}\ 1δsupπCδ(μ)𝒮×𝒮i=1mλδ,i(Φi(y)Φi(x))π(dx,dy)\displaystyle\geq\frac{1}{\delta}\sup_{\pi\in C_{\delta}(\mu)}\int_{\mathcal{S}\times\mathcal{S}}\sum_{i=1}^{m}\lambda^{\star}_{\delta,i}(\Phi_{i}(y)-\Phi_{i}(x))\,\pi(dx,dy)
+1δinfπCδ(μ)𝒮×𝒮f(y,aδ)f(x,aδ)π(dx,dy).\displaystyle\quad+\frac{1}{\delta}\inf_{\pi\in C_{\delta}(\mu)}\int_{\mathcal{S}\times\mathcal{S}}f(y,a^{\star}_{\delta})-f(x,a^{\star}_{\delta})\,\pi(dx,dy).

The second term converges to (𝒮xf(x,a)qμ(dx))1/q-(\int_{\mathcal{S}}\|\nabla_{x}f(x,a^{\star})\|^{q}\,\mu(dx))^{1/q} (see the proof of Theorem 2), in particular it is bounded for all δ>0\delta>0 small. On the other hand by (19) and continuity as well as growth of xxΦi(x)x\mapsto\nabla_{x}\Phi_{i}(x), the first term is larger than c|λδ|c|\lambda^{\star}_{\delta}| for some c>0c>0. By (22) this implies that (λδ)δ>0(\lambda^{\star}_{\delta})_{\delta>0} must be bounded for small δ>0\delta>0. ∎

We have used the following lemma:

Lemma 24.

Let p,q[1,)p,q\in[1,\infty) such that q<pq<p and let μ\mu be a probability measure on 𝒮\mathcal{S}. Then pp-Wasserstein ball Bδ(μ)B_{\delta}(\mu) is compact w.r.t. Wq||W_{q}^{|\cdot|}.

Proof.

We recall that \|\cdot\|_{\ast} is lower semicontinuous and there exists c>0c>0 such that |x|cx|x|\leq c\|x\|_{\ast} for all xdx\in\mathbb{R}^{d}. As 𝒮|x|pμ(dx)<\int_{\mathcal{S}}|x|^{p}\,\mu(dx)<\infty by assumption, an application of Prokhorov’s theorem shows that Bδ(μ)B_{\delta}(\mu) is weakly precompact (recall the convention that we continuity is defined for (d,||)(\mathbb{R}^{d},|\cdot|)). Hence, for every sequence of measures (νn)n(\nu_{n})_{n\in\mathbb{N}} in Bδ(μ)B_{\delta}(\mu) there exists a subsequence, which we also call (νn)n(\nu_{n})_{n\in\mathbb{N}} and a measure ν\nu such that νn\nu_{n} converges weakly to ν\nu. As WpW_{p} is weakly lower semicontinuous (see (Villani,, 2008, Lemma 4.3, p.43)), this implies νBδ(μ)\nu\in B_{\delta}(\mu). Applying the same argument to the tight sequence (ν~n)n(\tilde{\nu}_{n})_{n\in\mathbb{N}} defined via

ν~n(dx):=|x|q𝒮|y|qνn(dy)νn(dx)\tilde{\nu}_{n}(dx):=\frac{|x|^{q}}{\int_{\mathcal{S}}|y|^{q}\,\nu_{n}(dy)}\nu_{n}(dx)

we conclude that there exists another subsequence of (νn)n(\nu_{n})_{n\in\mathbb{N}} which also converges in Wq||W_{q}^{|\cdot|}. This concludes the proof. ∎

Appendix C Discussion, proofs and auxiliary results related to Theorem 5

C.1. Further discussion of Theorem 5

We note that a natural way to compute the sensitivity of aδa^{\star}_{\delta} would be by combining Theorem 2 with chain rule and differentiation of the function V(a,δ)V(a,\delta). This cannot however be rigorously justified as the following remark demonstrates.

Remark 25.

Let us point out that it is not true that aV(a,δ)a\mapsto V(a,\delta) is differentiable for δ>0\delta>0 under the sole assumption that (x,a)f(x,a)(x,a)\mapsto f(x,a) is sufficiently smooth and a2f0\nabla_{a}^{2}f\neq 0.

To give an example, let 𝒮=\mathcal{S}=\mathbb{R}, =||\|\cdot\|=|\cdot|, 𝒜=\mathcal{A}=\mathbb{R} and take f(x,a):=ax+a2f(x,a):=ax+a^{2} and μ=δ0\mu=\delta_{0}. A quick computation shows V(δ,a)=δ|a|+a2V(\delta,a)=\delta|a|+a^{2} (independently of pp). In particular V(δ)=0V(\delta)=0 and aδ=a=0a^{\star}_{\delta}=a^{\star}=0 for all δ>0\delta>0 and aV(δ,a)a\mapsto V(\delta,a) is clearly not differentiable in a=0a=0.

Instead, we use a more involved argument, combining differentiability of aV(0,a)a\mapsto V(0,a) with a Lagrangian approach. This however requires slightly stricter growth assumptions than the ones imposed in Assumption 1, which are specified in Assumption 4.

Example 26.

We provide detailed computations behind the square-root LASSO/Ridge regression example discussed in Bartl et al., 2021b . We consider 𝒜=k\mathcal{A}=\mathbb{R}^{k}, 𝒮=k+1\mathcal{S}=\mathbb{R}^{k+1}. We fix norms (x,y)=|x|s\|(x,y)\|=|x|_{s}, (x,y)=|x|r𝟏{y=0}+𝟏{y0}\|(x,y)\|_{\ast}=|x|_{r}\mathbf{1}_{\{y=0\}}+\infty\mathbf{1}_{\{y\neq 0\}}, for some s>1s>1, 1/s+1/r=11/s+1/r=1 and (x,y)k×(x,y)\in\mathbb{R}^{k}\times\mathbb{R}. We recall than then (5) holds and we can apply our methodology for f((x,y),a):=(yx,a)2f((x,y),a):=(y-\langle x,a\rangle)^{2}. In general we have

(x,y)f((x,y),a)=(2(yx,a)a,2(yx,a))\nabla_{(x,y)}f((x,y),a^{\star})=(-2(y-\langle x,a^{\star}\rangle)a^{\star},2(y-\langle x,a^{\star}\rangle))

a2V(0,a)=2D\nabla_{a}^{2}V(0,a^{\star})=2D and

(k+1(x,y)f((x,y),a)2μ(dx,dy))1/2\displaystyle\left(\int_{\mathbb{R}^{k+1}}\|\nabla_{(x,y)}f((x,y),a^{\star})\|^{2}\,\mu(dx,dy)\right)^{1/2} =2|a|s(k+1(yx,a)2μ(dx,dy))1/2\displaystyle=2|a^{\star}|_{s}\left(\int_{\mathbb{R}^{k+1}}(y-\langle x,a^{\star}\rangle)^{2}\,\mu(dx,dy)\right)^{1/2}
=2|a|sV(0).\displaystyle=2|a^{\star}|_{s}\sqrt{V(0)}.

Recalling the convention that (x,y)afk×(d+1)\nabla_{(x,y)}\nabla_{a}f\in\mathbb{R}^{k\times(d+1)} is given by

[x1a1fxda1fya1fx1a2fxda2fya2fx1akfxdakfyakf]\displaystyle\begin{bmatrix}\nabla_{x_{1}}\nabla_{a_{1}}f&\dots&\nabla_{x_{d}}\nabla_{a_{1}}f&\nabla_{y}\nabla_{a_{1}}f\\ \nabla_{x_{1}}\nabla_{a_{2}}f&\dots&\nabla_{x_{d}}\nabla_{a_{2}}f&\nabla_{y}\nabla_{a_{2}}f\\ \vdots&\vdots&\vdots&\vdots\\ \nabla_{x_{1}}\nabla_{a_{k}}f&\dots&\nabla_{x_{d}}\nabla_{a_{k}}f&\nabla_{y}\nabla_{a_{k}}f\end{bmatrix}

we conclude

(x,y)af((x,y),a)=2(y𝐈+x(a)T+(𝐈a)(𝐈x),x),\displaystyle\nabla_{(x,y)}\nabla_{a}f((x,y),a^{\star})=2\left(-y\mathbf{I}+x(a^{\star})^{T}+(\mathbf{I}a^{\star})(\mathbf{I}x),-x\right),

where 𝐈\mathbf{I} is the k×kk\times k identity matrix. Recall furthermore that k+1(ya,x)xiμ(dx,dy)=0\int_{\mathbb{R}^{k+1}}(y-\langle a^{\star},x\rangle)x_{i}\mu(dx,dy)=0 for all 1ik1\leq i\leq k and in particular V(0)=k+1(y2a,xy)μ(dx,dy)V(0)=\int_{\mathbb{R}^{k+1}}(y^{2}-\langle a^{\star},x\rangle y)\mu(dx,dy). Set now

h((x,y)):=(sign(x1)|x1|s1,,sign(xk)|xk|s1,0)|x|s1s.h((x,y)):=(\text{sign}(x_{1})\,|x_{1}|^{s-1},\dots,\text{sign}(x_{k})\,|x_{k}|^{s-1},0)\cdot|x|_{s}^{1-s}.

Then (x,y),h((x,y))=|x|s\langle(x,y),h((x,y))\rangle=|x|_{s} and |h(x,y)|r=1|h(x,y)|_{r}=1 for (x,y)𝒮U(x,y)\in\mathcal{S}\setminus U. As hh does not depend on the last coordinate, we also write simply h(x)h(x) for h((x,y))h((x,y)). As q=2q=2 we have in particular

k+1(x,y)af((x,y),a)h((x,y)f((x,y),a))(x,y)f((x,y),a)1μ(dx,dy)\displaystyle\int_{\mathbb{R}^{k+1}}\nabla_{(x,y)}\nabla_{a}f((x,y),a^{\star})\frac{h(\nabla_{(x,y)}f((x,y),a^{\star}))}{\|\nabla_{(x,y)}f((x,y),a^{\star})\|^{-1}}\,\mu(dx,dy)
=4k+1[y𝐈+x(a)T+(𝐈a)(𝐈x)]h((yx,a)a)|a|s|yx,a|μ(dx,dy)\displaystyle=4\int_{\mathbb{R}^{k+1}}\big{[}-y\mathbf{I}+x(a^{\star})^{T}+(\mathbf{I}a^{\star})(\mathbf{I}x)\big{]}\,h(-(y-\langle x,a^{\star}\rangle)a^{\star})\,|a^{\star}|_{s}|y-\langle x,a^{\star}\rangle|\,\mu(dx,dy)
=4|a|sk+1[y𝐈+x(a)T+(𝐈a)(𝐈x)](yx,a)h(a)μ(dx,dy)\displaystyle=-4|a^{\star}|_{s}\int_{\mathbb{R}^{k+1}}\big{[}-y\mathbf{I}+x(a^{\star})^{T}+(\mathbf{I}a^{\star})(\mathbf{I}x)\big{]}\,(y-\langle x,a^{\star}\rangle)h(a^{\star})\,\mu(dx,dy)\,
=4|a|sV(0)h(a).\displaystyle=4|a^{\star}|_{s}V(0)\,h(a^{\star}).

In conclusion

aδ\displaystyle a^{\star}_{\delta}\approx a(k+1(x,y)f((x,y),a)2μ(dx,dy))1/2(a2V(0,a))1\displaystyle\ a^{\star}-\Big{(}\int_{\mathbb{R}^{k+1}}\|\nabla_{(x,y)}f((x,y),a^{\star})\|^{2}\,\mu(dx,dy)\Big{)}^{-1/2}(\nabla^{2}_{a}V(0,a^{\star}))^{-1}
k+1(x,y)af((x,y),a)h((x,y)f((x,y),a))(x,y)f((x,y),a)1μ(dx,dy)δ\displaystyle\qquad\qquad\qquad\cdot\int_{\mathbb{R}^{k+1}}\frac{\nabla_{(x,y)}\nabla_{a}f((x,y),a^{\star})\,h(\nabla_{(x,y)}f((x,y),a^{\star}))}{\|\nabla_{(x,y)}f((x,y),a^{\star})\|^{-1}}\,\mu(dx,dy)\cdot\delta
=a14|a|sV(0)D1 4|a|sV(0)h(a)δ\displaystyle=a^{\star}-\frac{1}{4|a^{\star}|_{s}\sqrt{V(0)}}\,D^{-1}\,4|a^{\star}|_{s}V(0)\,h(a^{\star})\cdot\delta
=aV(0)D1h(a)δ.\displaystyle=a^{\star}-\sqrt{V(0)}D^{-1}\,h(a^{\star})\cdot\delta.

Let us now specialise to the typical statistical context and let μ=μN\mu=\mu_{N} equal to the empirical measure of NN data samples, i.e., μN=1Ni=1Nδ(xi,y1)\mu_{N}=\frac{1}{N}\sum_{i=1}^{N}\delta_{(x_{i},y_{1})} for some points x1,,xNdx_{1},\dots,x_{N}\in\mathbb{R}^{d} and y1,,yNy_{1},\ldots,y_{N}\in\mathbb{R}. Let us write xi=(xi,1,,xi,d)x_{i}=(x_{i,1},\dots,x_{i,d}) and X=(xi,j)i=1,,Nj=1,,dX=(x_{i,j})_{i=1,\dots,N}^{j=1,\dots,d}. Then in particular

D=k+1xxTμN(dx,dy)=1NXTX\displaystyle D=\int_{\mathbb{R}^{k+1}}xx^{T}\,\mu_{N}(dx,dy)=\frac{1}{N}X^{T}X

and we recover the notation common in statistics. In particular, a=(XTX)1XTya^{\star}=(X^{T}X)^{-1}X^{T}y. If we now assume that XTX=𝐈X^{T}X=\mathbf{I} (and hence D1=N𝐈D^{-1}=N\mathbf{I}), then we can easily compute

V(0)\displaystyle V(0) =1N(yXa)T(yXa)=1N(yXXTy)T(yXXTy)\displaystyle=\frac{1}{N}(y-Xa^{\star})^{T}(y-Xa^{\star})=\frac{1}{N}(y-XX^{T}y)^{T}(y-XX^{T}y)
=1NyT(𝐈XXT)T(𝐈XXT)y=1NyT(𝐈XXTXXT+XXTXXT)y\displaystyle=\frac{1}{N}y^{T}(\mathbf{I}-XX^{T})^{T}(\mathbf{I}-XX^{T})y=\frac{1}{N}y^{T}(\mathbf{I}-XX^{T}-XX^{T}+XX^{T}XX^{T})y
=1NyT(𝐈XXT)y\displaystyle=\frac{1}{N}y^{T}(\mathbf{I}-XX^{T})y

Note that, under the assumption that i=1Nyi=0\sum_{i=1}^{N}y_{i}=0, R2R^{2} is defined as

R2\displaystyle R^{2} =1yT(𝐈XXT)yyTy=yTyyT(𝐈XXT)yyTy=yTXXTyyTy.\displaystyle=1-\frac{y^{T}(\mathbf{I}-XX^{T})y}{y^{T}y}=\frac{y^{T}y-y^{T}(\mathbf{I}-XX^{T})y}{y^{T}y}=\frac{y^{T}XX^{T}y}{y^{T}y}.

Thus in the case s=1s=1 we have

aδ\displaystyle a^{\star}_{\delta} aV(0)D1sign(a)δ=aNyTyyTXXTysign(a)δ\displaystyle\approx a^{\star}-\sqrt{V(0)}D^{-1}\,\text{sign}(a^{\star})\cdot\delta=a^{\star}-\sqrt{N}\,\sqrt{y^{T}y-y^{T}XX^{T}y}\,\text{sign}(a^{\star})\cdot\delta
=aNyTy1yTXXTyyTysign(a)δ\displaystyle=a^{\star}-\sqrt{N}\,\sqrt{y^{T}y}\,\sqrt{1-\frac{y^{T}XX^{T}y}{y^{T}y}}\,\text{sign}(a^{\star})\cdot\delta
=aN|y|1R2sign(a)δ.\displaystyle=a^{\star}-\sqrt{N}\,|y|\,\sqrt{1-R^{2}}\,\text{sign}(a^{\star})\cdot\delta.

Furthermore, in the case s=2s=2 we have

aδ\displaystyle a^{\star}_{\delta} aD1V(0)|a|2aδ=a(1NyT(𝐈XXT)yN|a|2δ)\displaystyle\approx a^{\star}-D^{-1}\frac{\sqrt{V(0)}}{|a^{\star}|_{2}}a^{\star}\delta=a^{\star}\left(1-N\frac{\sqrt{y^{T}(\mathbf{I}-XX^{T})y}}{\sqrt{N}|a^{\star}|_{2}}\,\delta\right)
(26) =a(1NyT(𝐈XXT)y|a|2δ).\displaystyle=a^{\star}\left(1-\frac{\sqrt{N\,y^{T}(\mathbf{I}-XX^{T})y}}{|a^{\star}|_{2}}\,\delta\right).

We also have

|a|=a,a=yTXXTy,\displaystyle|a|=\sqrt{\langle a^{\star},a^{\star}\rangle}=\sqrt{y^{T}XX^{T}y},

so (26) simplifies to

aδ\displaystyle a^{\star}_{\delta} a(1NyT(𝐈XXT)yyTXXTyδ)=a(1δN(yTyyTXXTy1))\displaystyle\approx a^{\star}\left(1-\frac{\sqrt{N\,y^{T}(\mathbf{I}-XX^{T})y}}{\sqrt{y^{T}XX^{T}y}}\,\delta\right)=a^{\star}\left(1-\delta\sqrt{N\left(\frac{y^{T}y}{y^{T}XX^{T}y}-1\right)}\right)
=a(1δN(1R21)).\displaystyle=a^{\star}\left(1-\delta\sqrt{N\left(\frac{1}{R^{2}}-1\right)}\right).
Remark 27.

While ||1|\cdot|_{1} is not strictly convex, the above example can still be adapted to cover this case under the additional assumption, that aa^{\star} has no entries which are equal to zero. Indeed, we note that xh(x,y)x\mapsto h(x,y) is continuous (even constant) at every point xx except if a component of xx is equal to zero. Thus the proof of Lemma 30 still applies if we assume that gg has μ\mu-a.s. no components which are equal to zero instead of merely assuming that g0g\neq 0 μ\mu-a.s..

Example 28.

We provide further details and discussion to complement the out-of-sample error example in Bartl et al., 2021b . First, we recall the remainder term obtained therein:

ΔN\displaystyle\Delta_{N} :=(|xf(x,a,N)|sqμN(dx))1q1(a2f(x,a,N)μN(dx))1\displaystyle:=\Big{(}\int|\nabla_{x}f(x,a^{{\star},N})|_{s}^{q}\,\mu_{N}(dx)\Big{)}^{\frac{1}{q}-1}\cdot\left(\int\nabla_{a}^{2}f(x,a^{{\star},N})\,\mu_{N}(dx)\right)^{-1}
xaf(x,a,N)h(xf(x,a,N))|xf(x,a,N)|s1qμN(dx)(a2V(0,a))1Θ,where\displaystyle\quad\cdot\int\frac{\nabla_{x}\nabla_{a}f(x,a^{{\star},N})\,h(\nabla_{x}f(x,a^{{\star},N}))}{|\nabla_{x}f(x,a^{{\star},N})|_{s}^{1-q}}\,\mu_{N}(dx)-(\nabla_{a}^{2}V(0,a^{{\star}}))^{-1}\Theta,\quad\textrm{where}
Θ\displaystyle\Theta :=(|xf(x,a)|sqμ(dx))1q1xaf(x,a)h(xf(x,a))|xf(x,a)|s1qμ(dx).\displaystyle:=\Big{(}\int|\nabla_{x}f(x,a^{\star})|_{s}^{q}\,\mu(dx)\Big{)}^{\frac{1}{q}-1}\cdot\int\frac{\nabla_{x}\nabla_{a}f(x,a^{\star})\,h(\nabla_{x}f(x,a^{\star}))}{|\nabla_{x}f(x,a^{\star})|_{s}^{1-q}}\,\mu(dx).

Recall that μNμ\mu_{N}\to\mu in WpW_{p} holds a.s. We suppose that Assumptions 1 and 4 hold, and that for any r>0r>0, there exists c>0c>0 such that the following hold uniformly for all |a|r|a|\leq r:

(27) i=1k|aaif(x,a)|c(1+|x|p),|xaf(x,a)h(xf(x,a))|xf(x,a)|1q|c(1+|x|p).\displaystyle\begin{split}\sum_{i=1}^{k}\Big{|}\nabla_{a}\nabla_{a_{i}}f(x,a)\Big{|}&\leq c(1+|x|^{p}),\\ \Bigg{|}\frac{\nabla_{x}\nabla_{a}f(x,a^{\star})\,h(\nabla_{x}f(x,a^{\star}))}{|\nabla_{x}f(x,a^{\star})|^{1-q}}\Bigg{|}&\leq c(1+|x|^{p}).\end{split}

Recall from (9) that we already know that a,Naa^{{\star},N}\to a^{{\star}} a.s. Under the above integrability assumption, Lemma 21 gives

|aiajf(x,a,N)μN(dx)aiajf(x,a)μ(dx)|0,\displaystyle\left|\int\nabla_{a_{i}}\nabla_{a_{j}}f(x,a^{{\star},N})\,\mu_{N}(dx)-\int\nabla_{a_{i}}\nabla_{a_{j}}f(x,a^{{\star}})\,\mu(dx)\right|\to 0,

with analogous convergence for the other two terms in ΔN\Delta_{N}. We conclude that ΔN0\Delta_{N}\to 0 a.s. and that (12) and (13) hold.

We now show how the arguments above can be adapted to extend and complement (Anderson and Philpott,, 2019, Prop. 17). Therein, the authors study VRS(δ)(\delta) which is the expectation over realisations of μN\mu_{N} of

f(x,a,N)μ(dx)f(x,aδ,N)μ(dx).\displaystyle\int f(x,a^{{\star},N})\,\mu(dx)-\int f(x,a^{{\star},N}_{\delta})\,\mu(dx).

If VRS(δ)>0(\delta)>0 then, on average, the robust problem offers an improved performance, i.e., finds a better approximation to the true optimizer aa^{\star} than the classical non-robust problem. If we work with the difference above, then we look at first order Taylor expansion and obtain

V(0,aδ,N)V(0,a,N)=aV(0,a,N)(aδ,Na,N)+o(|aδ,Na,N|),\displaystyle V(0,a^{{\star},N}_{\delta})-V(0,a^{{\star},N})=\nabla_{a}V(0,a^{{\star},N})(a^{{\star},N}_{\delta}-a^{{\star},N})+o(|a^{{\star},N}_{\delta}-a^{{\star},N}|),

which holds under the first condition in (27). This can be compared with (Anderson and Philpott,, 2019, Lemma 1) which was derived under a Lipschitz continuity assumption on af(x,a)a\mapsto f(x,a). For the quadratic case of (Anderson and Philpott,, 2019, Prop. 17) we have f(x,a)=1/2a2g(x)af(x,a)=1/2a^{2}-g(x)a, where we took d=1d=1 for notational simplicity. We then have xf(xa)=g(x)a,\nabla_{x}f(xa)=-g^{\prime}(x)a, a2f(x,a)=1\nabla_{a}^{2}f(x,a)=1 and xaf(x,a)xf(x,a)=(g(x))2a\nabla_{x}\nabla_{a}f(x,a)\nabla_{x}f(x,a)=(g^{\prime}(x))^{2}a. Specialising (10) to this setting, with s=2s=2, gives

aδ,Na,N\displaystyle a^{{\star},N}_{\delta}-a^{{\star},N} (a2V(0,a))1(|xf(x,a)|qμN(dx))1/q1xaf(x,a)xf(x,a)|xf(x,a)|2qμN(dx)\displaystyle\approx-\left(\nabla_{a}^{2}V\left(0,a^{{\star}}\right)\right)^{-1}\left(\int\left|\nabla_{x}f\left(x,a^{{\star}}\right)\right|^{q}\mu_{N}(dx)\right)^{1/q-1}\cdot\int\frac{\nabla_{x}\nabla_{a}f\left(x,a^{{\star}}\right)\nabla_{x}f\left(x,a^{{\star}}\right)}{|\nabla_{x}f\left(x,a^{{\star}}\right)|^{2-q}}\mu_{N}(dx)
=|a|1q(|g(x)|qμN(dx))1/q1(g(x))2a|g(x)a|2qμN(dx)\displaystyle=-|a^{{\star}}|^{1-q}\left(\int|g^{\prime}(x)|^{q}\,\mu_{N}(dx)\right)^{1/q-1}\int\frac{(g^{\prime}(x))^{2}a^{{\star}}}{|g^{\prime}(x)a^{{\star}}|^{2-q}}\,\mu_{N}(dx)
=sign(a)(|g(x)|qμN(dx))1/q.\displaystyle=-\text{sign}(a^{{\star}})\left(\int|g^{\prime}(x)|^{q}\,\mu_{N}(dx)\right)^{1/q}.

While our results work for p>1p>1, see Remark 9, we can formally let qq\uparrow\infty. The last term then converges to sign(a)gL(μ)-\text{sign}(a^{{\star}})\|g^{\prime}\|_{L^{\infty}(\mu)} which recovers (Anderson and Philpott,, 2019, Prop. 17), taking into account that sign(a)=sign(g(x)μ(dx)).\text{sign}(a^{{\star}})=\text{sign}\left(\int g(x)\,\mu(dx)\right).

C.2. Proofs and auxiliary results related to Theorem 5

Lemma 29.

Let f:𝒮×𝒜f\colon\mathcal{S}\times\mathcal{A}\to\mathbb{R} be differentiable such that (x,a)af(x,a)(x,a)\mapsto\nabla_{a}f(x,a) is continuous, fix a𝒜oa\in{\mathcal{A}}^{o}, and assume that for some ε>0\varepsilon>0 we have that |xf(x,a~)|c(1+|x|p1ε)|\nabla_{x}f(x,\tilde{a})|\leq c(1+|x|^{p-1-\varepsilon}) and |af(x,a~)|c(1+|x|pε)|\nabla_{a}f(x,\tilde{a})|\leq c(1+|x|^{p-\varepsilon}) for some c>0c>0, all x𝒮x\in\mathcal{S} and all a~𝒜\tilde{a}\in\mathcal{A} close to aa. Further fix δ0\delta\geq 0 and recall that Bδ(μ,a)B^{\star}_{\delta}(\mu,a) is the set of maximizing measures given the strategy aa. Then the (one-sided) directional derivative of V(δ,)V(\delta,\cdot) at aa in the in direction bkb\in\mathbb{R}^{k} is given by

limh0V(δ,a+hb)V(δ,a)h=supνBδ(μ,a)𝒮af(x,a),bν(dx).\lim_{h\to 0}\frac{V(\delta,a+hb)-V(\delta,a)}{h}=\sup_{\nu\in B^{\star}_{\delta}(\mu,a)}\int_{\mathcal{S}}\langle\nabla_{a}f(x,a),b\rangle\,\nu(dx).
Proof.

Fix bkb\in\mathbb{R}^{k}. We start by showing that

(28) lim infh0V(δ,a+hb)V(δ,a)h\displaystyle\liminf_{h\to 0}\frac{V(\delta,a+hb)-V(\delta,a)}{h} supνBδ(μ,a)𝒮af(x,a),bν(dx).\displaystyle\geq\sup_{\nu\in B^{\star}_{\delta}(\mu,a)}\int_{\mathcal{S}}\langle\nabla_{a}f(x,a),b\rangle\,\nu(dx).

To that end, let νBδ(μ,a)\nu\in B^{\star}_{\delta}(\mu,a) and h>0h>0 be arbitrary. By definition of Bδ(μ,a)B^{\star}_{\delta}(\mu,a) one has V(δ,a)=𝒮f(x,a)ν(dx)V(\delta,a)=\int_{\mathcal{S}}f(x,a)\,\nu(dx). Moreover Bδ(μ,a)Bδ(μ)B^{\star}_{\delta}(\mu,a)\subseteq B_{\delta}(\mu) implies that V(δ,a+hb)𝒮f(x,a+hb)ν(dx)V(\delta,a+hb)\geq\int_{\mathcal{S}}f(x,a+hb)\,\nu(dx). Note that the assumption |xf(x,a~)|c(1+|x|p1ε)|\nabla_{x}f(x,\tilde{a})|\leq c(1+|x|^{p-1-\varepsilon}) implies

|f(x,a~)f(0,a~)|\displaystyle|f(x,\tilde{a})-f(0,\tilde{a})| =|01xf(tx,a~),x𝑑t|\displaystyle=\left|\int_{0}^{1}\langle\nabla_{x}f(tx,\tilde{a}),x\rangle dt\right|
01c(1+|tx|p1ε)|x|𝑑tc(1+|x|pε|x|).\displaystyle\leq\int_{0}^{1}c(1+|tx|^{p-1-\varepsilon})|x|dt\leq c(1+|x|^{p-\varepsilon}\vee|x|).

Therefore, by dominated convergence, one has

lim infh0V(δ,a+hb)V(δ,a)h\displaystyle\liminf_{h\to 0}\frac{V(\delta,a+hb)-V(\delta,a)}{h} lim infh0𝒮f(x,a+hb)f(x,a)hν(dx)\displaystyle\geq\liminf_{h\to 0}\int_{\mathcal{S}}\frac{f(x,a+hb)-f(x,a)}{h}\,\nu(dx)
=𝒮limh0f(x,a+hb)f(x,a)hν(dx)\displaystyle=\int_{\mathcal{S}}\lim_{h\to 0}\frac{f(x,a+hb)-f(x,a)}{h}\,\nu(dx)
=𝒮af(x,a),bν(dx)\displaystyle=\int_{\mathcal{S}}\langle\nabla_{a}f(x,a),b\rangle\,\nu(dx)

and as νBδ(μ,a)\nu\in B^{\star}_{\delta}(\mu,a) was arbitrary, this shows (28).

We proceed to show that

(29) lim suph0V(δ,a+hb)V(δ,a)h\displaystyle\limsup_{h\to 0}\frac{V(\delta,a+hb)-V(\delta,a)}{h} supνBδ(μ,a)𝒮af(x,a),bν(dx).\displaystyle\leq\sup_{\nu\in B^{\star}_{\delta}(\mu,a)}\int_{\mathcal{S}}\langle\nabla_{a}f(x,a),b\rangle\,\nu(dx).

For every sufficiently small h>0h>0 let νhBδ(μ,a+hb)\nu^{h}\in B_{\delta}^{\star}(\mu,a+hb) such that V(δ,a+hb)=𝒮f(x,a+hb)νh(dx)V(\delta,a+hb)=\int_{\mathcal{S}}f(x,a+hb)\,\nu^{h}(dx). The existence of such νh\nu^{h} is guaranteed by Lemma 24, which also guarantees that (possibly after passing to a subsequence) there is ν~Bδ(μ)\tilde{\nu}\in B_{\delta}(\mu) such that νhν~\nu^{h}\to\tilde{\nu} in Wpε||W^{|\cdot|}_{p-\varepsilon}. We claim that ν~Bδ(μ,a)\tilde{\nu}\in B^{\star}_{\delta}(\mu,a). By Lemma 21 one has

limh0V(δ,a+hb)=𝒮f(x,a)ν~(dx)V(δ,a).\lim_{h\to 0}V(\delta,a+hb)=\int_{\mathcal{S}}f(x,a)\,\tilde{\nu}(dx)\leq V(\delta,a).

On the other hand, for any choice ν~Bδ(μ,a)\tilde{\nu}\in B^{\star}_{\delta}(\mu,a) one has

limh0V(δ,a+hb)limh0𝒮f(x,a+hb)ν~(dx)=𝒮f(x,a)ν~(dx)=V(δ,a).\lim_{h\to 0}V(\delta,a+hb)\geq\lim_{h\to 0}\int_{\mathcal{S}}f(x,a+hb)\,\tilde{\nu}(dx)=\int_{\mathcal{S}}f(x,a)\,\tilde{\nu}(dx)=V(\delta,a).

This implies V(δ,a)=𝒮f(x,a)ν~(dx)V(\delta,a)=\int_{\mathcal{S}}f(x,a)\,\tilde{\nu}(dx) and in particular ν~Bδ(μ,a)\tilde{\nu}\in B^{\star}_{\delta}(\mu,a). At this point expand

f(x,a+hb)=f(x,a)+01af(x,a+thb),hb𝑑tf(x,a+hb)=f(x,a)+\int_{0}^{1}\langle\nabla_{a}f(x,a+thb),hb\rangle\,dt

so that

V(δ,a+hb)V(δ,a)\displaystyle V(\delta,a+hb)-V(\delta,a)
=𝒮(f(x,a)+01af(x,a+thb),hb𝑑t)νh(dx)𝒮f(x,a)ν~(dx)\displaystyle=\int_{\mathcal{S}}\Big{(}f(x,a)+\int_{0}^{1}\langle\nabla_{a}f(x,a+thb),hb\rangle\,dt\Big{)}\,\nu^{h}(dx)-\int_{\mathcal{S}}f(x,a)\,\tilde{\nu}(dx)
𝒮01af(x,a+thb),hb𝑑tνh(dx)\displaystyle\leq\int_{\mathcal{S}}\int_{0}^{1}\langle\nabla_{a}f(x,a+thb),hb\rangle\,dt\,\nu^{h}(dx)

where we used ν~Bδ(μ,a)\tilde{\nu}\in B^{\star}_{\delta}(\mu,a) for the last inequality. Recall that νh\nu^{h} converges to ν~\tilde{\nu} in Wpε||W^{|\cdot|}_{p-\varepsilon} and by assumption |af(x,a~)|c(1+|x|pε)|\nabla_{a}f(x,\tilde{a})|\leq c(1+|x|^{p-\varepsilon}) for all a~𝒜\tilde{a}\in\mathcal{A} close to aa. In particular

1haf(x,a+thb),hb|af(x,a+thb)||b|c(1+|x|pε)\frac{1}{h}\langle\nabla_{a}f(x,a+thb),hb\rangle\leq|\nabla_{a}f(x,a+thb)||b|\leq c(1+|x|^{p-\varepsilon})

for hh sufficiently small. As furthermore (x,a)af(x,a)(x,a)\mapsto\nabla_{a}f(x,a) is continuous, we conclude by Lemma 21 that

limh01h𝒮af(x,a+thb),hb𝑑tνh(dx)=𝒮af(x,a),bν~(dx).\displaystyle\lim_{h\to 0}\frac{1}{h}\int_{\mathcal{S}}\langle\nabla_{a}f(x,a+thb),hb\rangle\,dt\,\nu^{h}(dx)=\int_{\mathcal{S}}\langle\nabla_{a}f(x,a),b\rangle\,\tilde{\nu}(dx).

Lastly, by by Fubini’s theorem and dominated convergence (in tt)

1h𝒮01af(x,a+thb),hb𝑑tνh(dx)𝒮af(x,a),bν~(dx)\frac{1}{h}\int_{\mathcal{S}}\int_{0}^{1}\langle\nabla_{a}f(x,a+thb),hb\rangle\,dt\,\nu^{h}(dx)\to\int_{\mathcal{S}}\langle\nabla_{a}f(x,a),b\rangle\,\tilde{\nu}(dx)

as h0h\to 0, which ultimately shows (29). ∎

Lemma 30.

Let q(1,)q\in(1,\infty) and let f,g:𝒮df,g\colon\mathcal{S}\to\mathbb{R}^{d} be measurable such that 𝒮f(x)q+gqμ(dx)<\int_{\mathcal{S}}\|f(x)\|^{q}+\|g\|^{q}\,\mu(dx)<\infty and such that g0g\neq 0 μ\mu-a.s.. Then we have that

(30) infλ((𝒮f(x)+λg(x)qμ(dx))1/qλ(𝒮g(x)qμ(dx))1/q)=𝒮f(x),h(g(x))g(x)1qμ(dx)(𝒮g(x)qμ(dx))1/q1,\displaystyle\begin{split}&\inf_{\lambda\in\mathbb{R}}\left(\left(\int_{\mathcal{S}}\|f(x)+\lambda g(x)\|^{q}\,\mu(dx)\right)^{1/q}-\lambda\left(\int_{\mathcal{S}}\|g(x)\|^{q}\,\mu(dx)\right)^{1/q}\right)\\ &=\int_{\mathcal{S}}\frac{\langle f(x),h(g(x))\rangle}{\|g(x)\|^{1-q}}\,\mu(dx)\cdot\Big{(}\int_{\mathcal{S}}\|g(x)\|^{q}\,\mu(dx)\Big{)}^{1/q-1},\end{split}

where h:d{0}dh\colon\mathbb{R}^{d}\setminus\{0\}\to\mathbb{R}^{d} was defined in Lemma 7.

Proof.

First recall that hh is continuous and satisfies x=x,h(x)\|x\|=\langle x,h(x)\rangle for every x0x\neq 0. Now define

G(x):=h(g(x))g(x)1q(𝒮g(z)qμ(dz))1/q1for x𝒮.G(x):=\frac{h(g(x))}{\|g(x)\|^{1-q}}\Big{(}\int_{\mathcal{S}}\|g(z)\|^{q}\,\mu(dz)\Big{)}^{1/q-1}\quad\text{for }x\in\mathcal{S}.

Similarly, define GλG^{\lambda} by replacing gg in the definition of GG by gλ:=f+λgg^{\lambda}:=f+\lambda g. As in the proof of Theorem 2 we compute

𝒮G(x)pμ(dx)=1and(𝒮g(x)qμ(dx))1/q=𝒮g(x),G(x)μ(dx).\int_{\mathcal{S}}\|G(x)\|_{\ast}^{p}\,\mu(dx)=1\quad\text{and}\quad\left(\int_{\mathcal{S}}\|g(x)\|^{q}\,\mu(dx)\right)^{1/q}=\int_{\mathcal{S}}\langle g(x),G(x)\rangle\,\mu(dx).

This remains true when gg and GG are replaced by gλg^{\lambda} and GλG^{\lambda}, respectively. Moreover, Hölder’s inequality implies that

(𝒮gλ(x)qμ(dx))1/q\displaystyle\left(\int_{\mathcal{S}}\|g^{\lambda}(x)\|^{q}\,\mu(dx)\right)^{1/q} 𝒮gλ(x),G(x)μ(dx),\displaystyle\geq\int_{\mathcal{S}}\langle g^{\lambda}(x),G(x)\rangle\,\mu(dx),
(𝒮g(x)qμ(dx))1/q\displaystyle\left(\int_{\mathcal{S}}\|g(x)\|^{q}\,\mu(dx)\right)^{1/q} 𝒮g(x),Gλ(x)μ(dx).\displaystyle\geq\int_{\mathcal{S}}\langle g(x),G^{\lambda}(x)\rangle\,\mu(dx).

The first of these two inequalities immediately implies that the left hand side in (30) is larger than the right hand side.

To show the other inequality, note that hh is continuous and satisfies h(λx)=h(x)h(\lambda x)=h(x) for λ>0\lambda>0, hence h(g(x))=limλh(gλ(x))h(g(x))=\lim_{\lambda\to\infty}h(g^{\lambda}(x)) for all x𝒮x\in\mathcal{S} such that g(x)0g(x)\neq 0. Consequently one quickly computes G(x)=limλGλ(x)G(x)=\lim_{\lambda\to\infty}G^{\lambda}(x) for all x𝒮x\in\mathcal{S} such that g(x)0g(x)\neq 0. By dominated convergence we conclude that

infλ((𝒮f(x)+λg(x)qμ(dx))1/qλ(𝒮g(x)qμ(dx))1/q)\displaystyle\inf_{\lambda\in\mathbb{R}}\left(\left(\int_{\mathcal{S}}\|f(x)+\lambda g(x)\|^{q}\,\mu(dx)\right)^{1/q}-\lambda\left(\int_{\mathcal{S}}\|g(x)\|^{q}\,\mu(dx)\right)^{1/q}\right)
limλ(𝒮f(x)+λg(x),Gλ(x)μ(dx)λ𝒮g(x),Gλ(x)μ(dx))\displaystyle\leq\lim_{\lambda\to\infty}\left(\int_{\mathcal{S}}\langle f(x)+\lambda g(x),G^{\lambda}(x)\rangle\,\mu(dx)-\lambda\int_{\mathcal{S}}\langle g(x),G^{\lambda}(x)\rangle\,\mu(dx)\right)
=𝒮f(x),G(x)μ(dx)\displaystyle=\int_{\mathcal{S}}\langle f(x),G(x)\rangle\,\mu(dx)

and the claim follows. ∎

Let us lastly give the proof of Theorem 5 for general seminorms.

Proof of Theorem 5.

Recall the convention that xaf(x,a)k×d\nabla_{x}\nabla_{a}f(x,a)\in\mathbb{R}^{k\times d} and xf(x,a)d×1\nabla_{x}f(x,a)\in\mathbb{R}^{d\times 1}, af(x,a)k×1\nabla_{a}f(x,a)\in\mathbb{R}^{k\times 1} as well as h()/0=0h(\cdot)/0=0. Further recall that a𝒜(0)a^{\star}\in\mathcal{A}^{\star}(0) and aδ𝒜(δ)a^{\star}_{\delta}\in\mathcal{A}^{\star}(\delta) converge to aa^{\star} as δ0\delta\to 0. In order to show

limδ0aδaδ\displaystyle\lim_{\delta\to 0}\frac{a^{\star}_{\delta}-a^{\star}}{\delta} =(𝒮xf(z,a)qμ(dz))1q1(a2V(0,a))1\displaystyle=-\Big{(}\int_{\mathcal{S}}\|\nabla_{x}f(z,a^{\star})\|^{q}\,\mu(dz)\Big{)}^{\frac{1}{q}-1}(\nabla^{2}_{a}V(0,a^{\star}))^{-1}
𝒮xaf(x,a)h(xf(x,a))xf(x,a)1qμ(dx),\displaystyle\qquad\cdot\int_{\mathcal{S}}\frac{\nabla_{x}\nabla_{a}f(x,a^{\star})\,h(\nabla_{x}f(x,a^{\star}))}{\|\nabla_{x}f(x,a^{\star})\|^{1-q}}\,\mu(dx),

we first show that for every i{1,,k}i\in\{1,\dots,k\}

(31) limδ0aiV(0,aδ)δ\displaystyle\lim_{\delta\to 0}\frac{-\nabla_{a_{i}}V(0,a^{\star}_{\delta})}{\delta} =𝒮xaif(x,a)h(x(f(x,a))xf(x),a)1qμ(dx)\displaystyle=\int_{\mathcal{S}}\nabla_{x}\nabla_{a_{i}}f(x,a^{\star})\frac{h(\nabla_{x}(f(x,a^{\star}))}{\|\nabla_{x}f(x),a^{\star})\|^{1-q}}\,\mu(dx)
(𝒮xf(x,a)qμ(dx))1/q1,\displaystyle\qquad\cdot\Big{(}\int_{\mathcal{S}}\|\nabla_{x}f(x,a^{\star})\|^{q}\,\mu(dx)\Big{)}^{1/q-1},

where we recall that aiV(0,aδ)\nabla_{a_{i}}V(0,a^{\star}_{\delta}) is the ii-th coordinate of the vector aV(0,aδ)\nabla_{a}V(0,a^{\star}_{\delta}). We start with the “\leq”-inequality in (31). For any a𝒜oa\in{\mathcal{A}}^{o}, the fundamental theorem of calculus implies that

af(y,a)af(x,a)\displaystyle\nabla_{a}f(y,a)-\nabla_{a}f(x,a) =01xaf(x+t(yx),a)(yx)𝑑t.\displaystyle=\int_{0}^{1}\nabla_{x}\nabla_{a}f(x+t(y-x),a)(y-x)\,dt.

Moreover, by Lemma 29 the function aV(δ,a)a\mapsto V(\delta,a) is (one-sided) directionally differentiable at aδa^{\star}_{\delta} for all δ>0\delta>0 small and thus for all i{1,,k}i\in\{1,\dots,k\}

(32) supνBδ(μ,aδ)𝒮aif(x,aδ)ν(dx)0,\displaystyle\sup_{\nu\in B_{\delta}^{\star}(\mu,a^{\star}_{\delta})}\int_{\mathcal{S}}\nabla_{a_{i}}f(x,a^{\star}_{\delta})\,\nu(dx)\geq 0,

where we recall Bδ(μ,aδ)B_{\delta}^{\star}(\mu,a^{\star}_{\delta}) is the set of all νBδ(μ)\nu\in B_{\delta}(\mu) for which 𝒮f(x,aδ)ν(dx)=V(δ,aδ)=V(δ)\int_{\mathcal{S}}f(x,a^{\star}_{\delta})\,\nu(dx)=V(\delta,a^{\star}_{\delta})=V(\delta). We now encode the optimality of ν\nu in Bδ(μ,aδ)B_{\delta}^{\star}(\mu,a^{\star}_{\delta}) via a Lagrange multiplier to obtain

(33) supνBδ(μ,aδ)𝒮aif(x,aδ)ν(dx)=supνBδ(μ)infλ𝒮[aif(y,aδ)+λ(f(y,aδ)V(δ))]ν(dy).\displaystyle\begin{split}&\sup_{\nu\in B_{\delta}^{\star}(\mu,a^{\star}_{\delta})}\int_{\mathcal{S}}\nabla_{a_{i}}f(x,a^{\star}_{\delta})\,\nu(dx)\\ &=\sup_{\nu\in B_{\delta}(\mu)}\inf_{\lambda\in\mathbb{R}}\int_{\mathcal{S}}\big{[}\nabla_{a_{i}}f(y,a^{\star}_{\delta})+\lambda(f(y,a^{\star}_{\delta})-V(\delta))\big{]}\nu(dy).\end{split}

In a similar manner, we trivially have

(34) 𝒮aif(x,aδ)μ(dx)=𝒮[aif(x,aδ)+λ(f(x,aδ)V(0,aδ))]μ(dx)\displaystyle\int_{\mathcal{S}}\nabla_{a_{i}}f(x,a^{\star}_{\delta})\,\mu(dx)=\int_{\mathcal{S}}\big{[}\nabla_{a_{i}}f(x,a^{\star}_{\delta})+\lambda(f(x,a^{\star}_{\delta})-V(0,a^{\star}_{\delta}))\big{]}\mu(dx)

for any λ\lambda\in\mathbb{R}, as 𝒮f(x,aδ)μ(dx)=V(0,aδ)\int_{\mathcal{S}}f(x,a^{\star}_{\delta})\,\mu(dx)=V(0,a^{\star}_{\delta}). Applying (32) and then (33), (34) we thus conclude for i{1,,k}i\in\{1,\dots,k\}

aiV(0,aδ)supνBδ(μ)𝒮aif(y,aδ)ν(dy)aiV(0,aδ)\displaystyle-\nabla_{a_{i}}V(0,a^{\star}_{\delta})\leq\sup_{\nu\in B_{\delta}^{\star}(\mu)}\int_{\mathcal{S}}\nabla_{a_{i}}f(y,a^{\star}_{\delta})\,\nu(dy)-\nabla_{a_{i}}V(0,a^{\star}_{\delta})
=supνBδ(μ)infλ(𝒮[aif(y,aδ)+λ(f(y,aδ)V(δ))]ν(dy)\displaystyle=\sup_{\nu\in B_{\delta}(\mu)}\inf_{\lambda\in\mathbb{R}}\bigg{(}\int_{\mathcal{S}}\big{[}\nabla_{a_{i}}f(y,a^{\star}_{\delta})+\lambda(f(y,a^{\star}_{\delta})-V(\delta))\big{]}\,\nu(dy)
𝒮[aif(x,aδ)+λ(f(x,aδ)V(0,aδ))]μ(dx))\displaystyle\quad-\int_{\mathcal{S}}\big{[}\nabla_{a_{i}}f(x,a^{\star}_{\delta})+\lambda(f(x,a^{\star}_{\delta})-V(0,a^{\star}_{\delta}))\big{]}\mu(dx)\bigg{)}
(35) =supνBδ(μ)infλ(𝒮[aif(y,aδ)+λf(y,aδ)]ν(dy)𝒮[aif(x,aδ)+λf(x,aδ)]μ(dx)λ(V(δ)V(0,aδ))).\displaystyle\begin{split}&=\sup_{\nu\in B_{\delta}(\mu)}\inf_{\lambda\in\mathbb{R}}\bigg{(}\int_{\mathcal{S}}\Big{[}\nabla_{a_{i}}f(y,a^{\star}_{\delta})+\lambda f(y,a^{\star}_{\delta})\Big{]}\,\nu(dy)\\ &\quad-\int_{\mathcal{S}}\Big{[}\nabla_{a_{i}}f(x,a^{\star}_{\delta})+\lambda f(x,a^{\star}_{\delta})\Big{]}\,\mu(dx)-\lambda(V(\delta)-V(0,a^{\star}_{\delta}))\bigg{)}.\end{split}

As in the proof of Lemma 29 we note that Bδ(μ)B_{\delta}(\mu) is compact in Wpε||W_{p-\varepsilon}^{|\cdot|} and both terms inside the ν(dy)\nu(dy) grow at most as c(1+|y|pε)c(1+|y|^{p-\varepsilon}) by Assumption 4. Thus using (Terkelsen,, 1973, Cor. 2, p. 411) we can interchange the infimum and supremum in the last line above. Recall that

V(δ)=supνBδ(μ)𝒮f(y,aδ)ν(dy),V(\delta)=\sup_{\nu\in B_{\delta}(\mu)}\int_{\mathcal{S}}f(y,a^{\star}_{\delta})\,\nu(dy),

whence (35) is equal to

infλ(supπCδ(μ)𝒮×𝒮[aif(y,aδ)aif(x,aδ)+λ(f(y,aδ)f(x,aδ))]π(dx,dy)\displaystyle\inf_{\lambda\in\mathbb{R}}\bigg{(}\sup_{\pi\in C_{\delta}(\mu)}\int_{\mathcal{S}\times\mathcal{S}}\Big{[}\nabla_{a_{i}}f(y,a^{\star}_{\delta})-\nabla_{a_{i}}f(x,a^{\star}_{\delta})+\lambda(f(y,a^{\star}_{\delta})-f(x,a^{\star}_{\delta}))\Big{]}\,\pi(dx,dy)
λsupπCδ(μ)𝒮×𝒮f(y,aδ)f(x,aδ)π(dx,dy)).\displaystyle-\lambda\sup_{\pi\in C_{\delta}(\mu)}\int_{\mathcal{S}\times\mathcal{S}}f(y,a^{\star}_{\delta})-f(x,a^{\star}_{\delta})\,\pi(dx,dy)\bigg{)}.

For every fixed λ\lambda\in\mathbb{R} we can follow the arguments in the proof of Theorem 2 to see that, when divided by δ\delta, the term inside the infimum converges to

(36) (𝒮xaif(x,a)+λxf(x,a)qμ(dx))1/qλ(𝒮xf(x,a)qμ(dx))1/q\displaystyle\left(\int_{\mathcal{S}}\left\|\nabla_{x}\nabla_{a_{i}}f(x,a^{\star})+\lambda\nabla_{x}f(x,a^{\star})\right\|^{q}\,\mu(dx)\right)^{1/q}-\lambda\left(\int_{\mathcal{S}}\left\|\nabla_{x}f(x,a^{\star})\right\|^{q}\,\mu(dx)\right)^{1/q}

as δ0\delta\to 0. Note that following these arguments requires the following properties, which are a direct consequence of Assumptions 1 and Assumption 4:

  • (x,a)f(x,a)(x,a)\mapsto f(x,a) is differentiable on 𝒮o×𝒜o{\mathcal{S}}^{o}\times{\mathcal{A}}^{o},

  • xaif(x,a)x\mapsto\nabla_{a_{i}}f(x,a) is differentiable on 𝒮o{\mathcal{S}}^{o} for every a𝒜a\in\mathcal{A},

  • (x,a)xf(x,a)(x,a)\mapsto\nabla_{x}f(x,a) is continuous,

  • (x,a)xaif(x,a)(x,a)\mapsto\nabla_{x}\nabla_{a_{i}}f(x,a) is continuous,

  • for every r>0r>0 there is c>0c>0 such that |λxf(x,a)|c(1+|x|p1)|\lambda\nabla_{x}f(x,a)|\leq c(1+|x|^{p-1}) for all x𝒮x\in\mathcal{S} and a𝒜a\in\mathcal{A} with |a|r|a|\leq r.

  • for every r>0r>0 there is c>0c>0 such that |xaif(x,a)|c(1+|x|p1)|\nabla_{x}\nabla_{a_{i}}f(x,a)|\leq c(1+|x|^{p-1}) for all x𝒮x\in\mathcal{S} and a𝒜a\in\mathcal{A} with |a|r|a|\leq r.

  • For all δ0\delta\geq 0 sufficiently small we have 𝒜δ\mathcal{A}^{\star}_{\delta}\neq\emptyset and for every sequence (δn)n(\delta_{n})_{n\in\mathbb{N}} such that limnδn=0\lim_{n\to\infty}\delta_{n}=0 and (an)n(a^{\star}_{n})_{n\in\mathbb{N}} such that an𝒜δna^{\star}_{n}\in\mathcal{A}^{\star}_{\delta_{n}} for all nn\in\mathbb{N} there is a subsequence which converges to some a𝒜0a^{\star}\in\mathcal{A}^{\star}_{0}.

Suppose first that xaif(x,a)=0\nabla_{x}\nabla_{a_{i}}f(x,a^{\star})=0 μ\mu-a.s.. Then the right hand side of (31) is equal to zero. Moreover, taking λ=0\lambda=0 in (36), we also have that aiV(0,aδ)0\nabla_{a_{i}}V(0,a^{\star}_{\delta})\leq 0, which proves that indeed the left hand side in (31) is smaller than the right hand side.

Now suppose that xf(x,a)0\nabla_{x}f(x,a^{\star})\neq 0 μ\mu-a.s.. Then, using the inequality “lim supδinfλinfλlim supδ\limsup_{\delta}\inf_{\lambda}\leq\inf_{\lambda}\limsup_{\delta}” and Lemma 30 to compute the last term (noting that xf(x,a)0\nabla_{x}f(x,a^{\star})\neq 0 by assumption), we conclude that indeed the

lim supδ0aiV(0,aδ)δ\displaystyle\limsup_{\delta\to 0}\frac{-\nabla_{a_{i}}V(0,a^{\star}_{\delta})}{\delta} 𝒮xaif(x,a)h(x(f(x,a))xf(x),a)1qμ(dx)\displaystyle\leq\int_{\mathcal{S}}\nabla_{x}\nabla_{a_{i}}f(x,a^{\star})\frac{h(\nabla_{x}(f(x,a^{\star}))}{\|\nabla_{x}f(x),a^{\star})\|^{1-q}}\,\mu(dx)
(𝒮xf(x,a)qμ(dx))1/q1.\displaystyle\qquad\cdot\Big{(}\int_{\mathcal{S}}\|\nabla_{x}f(x,a^{\star})\|^{q}\,\mu(dx)\Big{)}^{1/q-1}.

To obtain the reverse “\geq”-inequality in (31) follows by the very same arguments. Indeed, Lemma 29 implies that

infνBδ(μ,aδ)𝒮aif(x,aδ)ν(dx)0\displaystyle\inf_{\nu\in B_{\delta}^{\star}(\mu,a^{\star}_{\delta})}\int_{\mathcal{S}}\nabla_{a_{i}}f(x,a^{\star}_{\delta})\,\nu(dx)\leq 0

for all i{1,,k}i\in\{1,\dots,k\} and we can write

aiV(0,aδ)infνBδ(μ)𝒮aif(y,aδ)ν(dy)𝒮aif(y,aδ)μ(dx)\displaystyle-\nabla_{a_{i}}V(0,a^{\star}_{\delta})\geq\inf_{\nu\in B_{\delta}^{\star}(\mu)}\int_{\mathcal{S}}\nabla_{a_{i}}f(y,a^{\star}_{\delta})\,\nu(dy)-\int_{\mathcal{S}}\nabla_{a_{i}}f(y,a^{\star}_{\delta})\,\mu(dx)
=infνBδ(μ)supλ𝒮[aif(y,aδ)+λ(f(y,aδ)V(δ))]ν(dy)𝒮aif(x,aδ)μ(dx)\displaystyle=\inf_{\nu\in B_{\delta}(\mu)}\sup_{\lambda\in\mathbb{R}}\int_{\mathcal{S}}\big{[}\nabla_{a_{i}}f(y,a^{\star}_{\delta})+\lambda(f(y,a^{\star}_{\delta})-V(\delta))\big{]}\,\nu(dy)-\int_{\mathcal{S}}\nabla_{a_{i}}f(x,a^{\star}_{\delta})\,\mu(dx)

From here on we argue as in the “\leq”-inequality to conclude that (31) holds.

By assumption the matrix a2V(0,a)\nabla_{a}^{2}V(0,a^{\star}) is invertible. Therefore, in a small neighborhood of aa^{\star}, the mapping aV(0,)\nabla_{a}V(0,\cdot) is invertible. In particular

aδ=(aV(0,))1(aV(0,aδ))anda=(aV(0,))1(0),a^{\star}_{\delta}=(\nabla_{a}V(0,\cdot))^{-1}\left(\nabla_{a}V(0,a^{\star}_{\delta})\right)\quad\text{and}\quad a^{\star}=(\nabla_{a}V(0,\cdot))^{-1}\left(0\right),

where the second equality holds by the first order condition for optimality of aa^{\star}. Applying the chain rule and using (31) gives

limδ0aδaδ\displaystyle\lim_{\delta\to 0}\frac{a^{\star}_{\delta}-a^{\star}}{\delta} =(a2V(0,a))1limδ0aV(0,aδ)δ\displaystyle=(\nabla_{a}^{2}V(0,a^{\star}))^{-1}\cdot\ \lim_{\delta\to 0}\frac{\nabla_{a}V(0,a^{\star}_{\delta})}{\delta}
=(a2V(0,a))1(𝒮xf(z,a)qμ(dz))1/q1\displaystyle=-(\nabla^{2}_{a}V(0,a^{\star}))^{-1}\Big{(}\int_{\mathcal{S}}\|\nabla_{x}f(z,a^{\star})\|^{q}\,\mu(dz)\Big{)}^{1/q-1}
𝒮xaf(x,a)h(xf(x,a))xf(x,a)1qμ(dx).\displaystyle\quad\cdot\int_{\mathcal{S}}\frac{\nabla_{x}\nabla_{a}f(x,a^{\star})h(\nabla_{x}f(x,a^{\star}))}{\|\nabla_{x}f(x,a^{\star})\|^{1-q}}\,\mu(dx).

This completes the proof. ∎