Sensitivity of multiperiod optimization problems with respect to the adapted Wasserstein distance

Daniel Bartl Daniel Bartl
Faculty of Mathematics
University of Vienna, Austria
[email protected] and Johannes Wiesel Johannes Wiesel
Department of Statistics
Columbia University
[email protected]

Abstract.

We analyze the effect of small changes in the underlying probabilistic model on the value of multiperiod stochastic optimization problems and optimal stopping problems. We work in finite discrete time and measure these changes with the adapted Wasserstein distance. We prove explicit first-order approximations for both problems. Expected utility maximization is discussed as a special case.

Keywords: robust multiperiod stochastic optimization, sensitivity analysis, (adapted) Wasserstein distance, optimal stopping
Both authors thank Mathias Beiglböck, Yifan Jiang, and Jan Obłój for helpful discussions and two referees for a careful reading. DB acknowledges support from Austrian Science Fund (FWF) through projects ESP-31N and P34743. JW acknowledges support from NSF grant DMS-2205534.

1. Introduction

Consider a (real-valued) discrete-time stochastic process $X=(X_{t})_{t=1}^{T}$ whose probabilistic behavior is governed by a reference model $P$ . Typically, such a model could describe the evolution of the stochastic process in an idealized probabilistic setting, as is customary in mathematical finance, or it could be derived from historical observations, as is a common assumption in statistics and machine learning. In both cases, one expects $P$ to merely approximate the true but unknown model. Consequently, an important question pertains to the effect that (small) misspecifications of $P$ have on quantities of interest in these areas. In this note, we analyze this question in two fundamental instances: optimal stopping problems and convex multiperiod stochastic optimization problems. For simplicity we focus on the latter in this introduction: consider

v(P):=\inf_{a\text{ admissable control}}E_{P}[f(X,a(X))],

where $f\colon\mathbb{R}^{T}\times\mathbb{R}^{T}\to\mathbb{R}$ is convex in the control variable (i.e., its second argument). The admissible controls are the (uniformly bounded) predictable functions $a=(a_{t})_{t=1}^{T}$ , i.e., $a_{t}(X)$ only depends on $X_{1},\dots,X_{t-1}$ . For concreteness, let us mention that utility maximization—an essential problem in mathematical finance—falls into this framework, by setting

f(x,a):=-U\Big{(}g(x)-\sum_{t=1}^{T}a_{t}(x_{t}-x_{t-1})\Big{)},

where $U\colon\mathbb{R}\to\mathbb{R}$ is a concave utility function, $g\colon\mathbb{R}^{T}\to\mathbb{R}$ is a payoff function and $X_{0}\in\mathbb{R}$ —see Example 2.6 for more details.

The question how changes of the model $P$ influence the value of $v(P)$ clearly depends on the chosen distance between models. In order to answer it in a generic way (i.e., without restricting to parametric models), one first needs to choose a suitable metric on the laws of stochastic processes $\mathcal{P}(\mathbb{R}^{T})$ . A crucial observation that has appeared in different contexts and goes back at least to Aldous (1981); Hellwig (1996); Pflug (2010); Pflug and Pichler (2012), is that any metric compatible with weak convergence (and also variants thereof that account for integrability, such as the Wasserstein distance) is too coarse to guarantee continuity of the map $P\mapsto v(P)$ in general. Roughly put, the reason is that two processes can have very similar laws but completely different filtrations; hence completely different sets of admissible controls. This fact has been rediscovered several times during the last decades, and researchers from different fields have introduced modifications of the weak topology that guarantee such continuity properties of $P\mapsto v(P)$ ; we refer to Backhoff-Veraguas et al. (2020b) for detailed historical references. Strikingly, all the different modifications of the weak topology turn out to be equivalent to the so-called weak adapted topology: this is the coarsest topology that makes multiperiod optimization problems continuous, see Backhoff-Veraguas et al. (2020b); Bartl et al. (2021a).

With the choice of topology settled, the next question pertains to the choice of a suitable distance. This is already relevant in a one-period framework, where the weak and weak adapted topologies coincide. Recent research shows, that the Wasserstein distance, which metrizes the weak topology, is (perhaps surprisingly) powerful and versatile. Analogously, the adapted Wasserstein distance $\mathcal{AW}_{p}$ (see Section 2 for the definition) metrizes the weak adapted topology and Pflug (2010); Backhoff-Veraguas et al. (2020a) show, that the multiperiod optimization problems considered in this note are Lipschitz-continuous w.r.t. $\mathcal{AW}_{p}$ . However, the Lipschitz-constants depend on global continuity parameters of $f$ and are thus far from being sharp in general. Moreover, the exact computation of the (worst case) value of $v(Q)$ , where $Q$ is in a neighbourhood of $P$ , requires solving an infinite-dimensional optimization problem, which does not admit explicit solutions in general. Both of these issues already occur in a one-period setting—despite the results of, e.g., Blanchet and Murthy (2019); Bartl et al. (2019), which relate this infinite-dimensional optimization problem to a simpler dual problem. In conclusion, computing the error

\mathcal{E}(r):=\sup_{\mathcal{AW}_{p}(P,Q)\leq r}v(Q)-v(P)

exactly is only possible in a few (arguably degenerate) cases.

In this note we address both issues by extending ideas of Bartl et al. (2021b); Oblój and Wiesel (2021) from a one-period setting to a multiperiod setting. The key insight of these works is that in a one-period setting, computing first-order approximations for $\mathcal{E}(r)$ is virtually always possible, while obtaining exact expressions might be infeasible in many cases. Our results go hand in hand with those of Bartl et al. (2021b), and we obtain explicit closed-form solutions for $\mathcal{E}^{\prime}(0)$ , which have intuitive interpretations. For instance, we show in Theorem 2.4, that under mild integrability and differentiability assumptions

\displaystyle\sup_{Q\,:\,\mathcal{AW}_{p}(P,Q)\leq r}v(Q)

\displaystyle=v(P)+r\cdot\Big{(}\sum_{t=1}^{T}E_{P}\big{[}\big{|}E_{P}\big{[}\partial_{x_{t}}f(X,a^{\ast}(X))\big{|}\mathcal{F}_{t}^{X}\big{]}\big{|}^{\frac{p}{p-1}}\big{]}\Big{)}^{\frac{p-1}{p}}+o(r)

holds as $r\downarrow 0$ , where $\partial_{x_{t}}f(x,a)$ is the partial derivative with respect to the $t$ -th coordinate of $x$ , $\mathcal{F}^{X}_{t}=\sigma(X_{1},\dots,X_{t})$ , $a^{\ast}$ is the unique optimizer for $v(P)$ and $o$ denotes the Landau symbol. In the case of utility maximization with $p=q=2$ (and $g\equiv 0$ for simplicity), the first-order correction term is essentially the expected quadratic variation of $a^{\ast}$ , but not under $P$ , but distorted by the $l^{2}$ -distance of the conditioned Radon-Nykodym density of an equivalent martingale measure w.r.t. $P$ —see Example 2.6 for details.

Investigating robustness of optimization problems in varying formulations is a recurring theme in the optimization literature; we refer to Rahimian and Mehrotra (2019) and the references therein for an overview. In the context of mathematical finance, representing distributional uncertainty through Wasserstein neighbourhoods goes back (at least) to Pflug and Wozabal (2007) and has seen a spike in recent research activity, leading to many impressive developments, see, e.g., duality results in Gao and Kleywegt (2016); Blanchet and Murthy (2019); Kuhn et al. (2019); Bartl et al. (2019) and applications in mathematical finance Blanchet et al. (2021), machine learning and statistics Shafieezadeh-Abadeh et al. (2019); Blanchet et al. (2020). Our theoretical results are directly linked to Acciaio and Hou (2022); Backhoff et al. (2022) characterizing the speed of convergence between the true and the (modified) empirical measure in the adapted Wasserstein distance and to new developments for computationally efficient relaxations of optimal transport problems, see Eckstein and Pammer (2022). For completeness, we mention that other notions of distance have been used to model distributional uncertainty, see e.g., Lam (2016, 2018); Calafiore (2007) in the context of operations research, Huber (2011); Lindsay (1994) in the context of statistics, and Herrmann and Muhle-Karbe (2017); Hobson (1998); Karoui et al. (1998) in the context of mathematical finance.

2. Main results

2.1. Preliminaries

We start by setting up notation. Let $T\in\mathbb{N}$ , let $\mathbb{R}^{T}$ be the path space of a stochastic process in finite discrete time, and let $\mathcal{P}_{p}({\mathbb{R}}^{T})$ denote the set of all Borel-probability measures on ${\mathbb{R}}^{T}$ with finite $p$ -th moment. Throughout this article, $X\colon{\mathbb{R}}^{T}\to{\mathbb{R}}^{T}$ is the identity (i.e., the canonical process) and $X,Y\colon{\mathbb{R}}^{T}\times{\mathbb{R}}^{T}\to{\mathbb{R}}^{T}$ denote the projections to the first and second coordinate, respectively. The filtration generated by $X$ is denoted by $(\mathcal{F}^{X}_{t})_{t=0}^{T}$ , i.e., $\mathcal{F}_{t}^{X}:=\sigma(X_{s}:s\leq t)$ and ${\mathcal{F}}_{0}^{X}:=\{\emptyset,{\mathbb{R}}^{T}\}$ . Sometimes we write $(\mathcal{F}^{X,Y}_{t})_{t=1}^{T}$ for the filtration generated by the processes $(X_{t},Y_{t})_{t=1}^{T}$ .

For a function $f\colon{\mathbb{R}}^{T}\times{\mathbb{R}}^{T}\to{\mathbb{R}}$ we write $\partial_{x_{t}}f$ for the partial derivative of $f$ in $t$ -th coordinate of $x$ , that is,

\partial_{x_{t}}f(x,a)=\lim_{\varepsilon\downarrow 0}\frac{1}{\varepsilon}(f(x+\varepsilon e_{t},a)-f(x,a))

where $e_{t}$ is the $t$ -th unit vector; $\nabla_{a}f$ for the gradient in $a$ , and $\nabla_{a}^{2}f$ for the Hessian in $a$ . We adopt the same notation for functions $f\colon{\mathbb{R}}^{T}\to{\mathbb{R}}$ or $f\colon{\mathbb{R}}^{T}\times\{1,\dots,T\}\to{\mathbb{R}}$ and write $\partial_{x_{t}}f$ for the partial derivative of $f$ in $t$ -th coordinate of $x\in{\mathbb{R}}^{T}$ . For univariate functions $\ell:{\mathbb{R}}\to{\mathbb{R}}$ we simply write $\ell^{\prime},\ell^{\prime\prime}$ for the first and second derivatives.

Definition 2.1.

Let $P,Q\in\mathcal{P}_{p}({\mathbb{R}}^{T})$ . A Borel probability measure $\pi$ on ${\mathbb{R}}^{T}\times{\mathbb{R}}^{T}$ is called coupling (between $P$ and $Q$ ), if its first marginal distribution is $P$ and its second one is $Q$ . A coupling $\pi$ is called causal if

(2.1)

\displaystyle\pi((Y_{1},\dots,Y_{t})\in A\,|\,X_{1},\dots,X_{T})=\pi((Y_{1},\dots,Y_{t})\in A\,|\,X_{1},\dots,X_{t})

$\pi$ -almost surely for all Borel sets $A\subseteq{\mathbb{R}}^{t}$ and all $1\leq t\leq T$ ; a casual coupling is called bicausal if (2.1) holds also with the roles of $X$ and $Y$ reversed.

Phrased differently, (2.1) means that under $\pi$ , conditionally on the ‘past’ $X_{1},\dots,X_{t}$ , the ‘future’ $X_{t+1},\dots,X_{T}$ is independent of $Y_{1},\dots,Y_{t}$ ; see e.g. (Bartl et al., 2021a, Lemma 2.2) for this and further equivalent characterizations of (bi-)causality. It is also instructive to analyze condition (2.1) in the case of a Monge-coupling, i.e., when there is a transport map $\psi\colon{\mathbb{R}}^{T}\to{\mathbb{R}}^{T}$ such that $Y=\psi(X)=(\psi_{t}(X))_{t=1}^{T}$ $\pi$ -almost surely. Indeed, then (2.1) simply means that $\psi_{t}$ needs to be $\mathcal{F}^{X}_{t}$ -measurable.

Fix $p\in(1,\infty)$ and define the adapted Wasserstein distance on $\mathcal{P}_{p}({\mathbb{R}}^{T})$ by

(2.2)

\displaystyle\mathcal{AW}_{p}(P,Q):=\inf_{\pi}\Big{(}\sum_{t=1}^{T}E_{\pi}[|X_{t}-Y_{t}|^{p}]\Big{)}^{1/p},

where the infimum is taken over all bicausal couplings $\pi$ between $P$ and $Q$ . Set

B_{r}(P):=\{Q\in\mathcal{P}_{p}({\mathbb{R}}^{T}):\mathcal{AW}_{p}(P,Q)\leq r\}

for $r\geq 0$ , and denote by $q:=p/(p-1)$ the conjugate Hölder exponent of $p$ .

2.2. The uncontrolled case

We are now in a position to state the main results of the paper. We start with a simplified case, where $f$ depends on $X$ only and there are no controls. The sensitivities of the stochastic optimization and optimal stopping problems in Section 2.3 and 2.4 respectively can be seen as natural extensions of this result; indeed the sensitivity computed in Theorem 2.2 already exhibits the structure, which is common to all our results presented here.

Theorem 2.2.

Let $f\colon{\mathbb{R}}^{T}\to{\mathbb{R}}$ be continuously differentiable and assume that there exists $c>0$ such that

\sum_{s=1}^{T}|\partial_{x_{s}}f(x)|\leq c\Big{(}1+\sum_{s=1}^{T}|x_{s}|^{p-1}\Big{)}

for every $x\in{\mathbb{R}}^{T}$ . Then, as $r\downarrow 0$ ,

\displaystyle\sup_{Q\in B_{r}(P)}E_{Q}[f(X)]=E_{P}[f(X)]+r\cdot\Big{(}\sum_{t=1}^{T}E_{P}\Big{[}\big{|}E_{P}[\partial_{x_{t}}f(X)|\mathcal{F}^{X}_{t}]\big{|}^{q}\Big{]}\Big{)}^{1/q}+o(r).

2.3. Multiperiod stochastic optimization problems

Fix a constant $L$ throughout this section, and denote by $\mathcal{A}$ the set of all predictable controls bounded by $L$ , i.e., every $a=(a_{t})_{t=1}^{T}\in\mathcal{A}$ is such that $a_{t}\colon\mathbb{R}^{T}\to\mathbb{R}$ only depends on $x_{1},\dots,x_{t-1}$ (with the convention that $a_{1}$ is deterministic) and that $|a_{t}|\leq L$ for a fixed constant $L$ . Recall that

v(Q)=\inf_{a\in\mathcal{A}}E_{Q}[f(X,a(X))],

where $f\colon\mathbb{R}^{T}\times\mathbb{R}^{T}\to\mathbb{R}$ is assumed to be convex in the control variable (i.e., its second argument).

Assumption 2.3.

For every $x\in\mathbb{R}^{T}$ , $f(x,\cdot)$ is twice continuously differentiable and strongly convex in the sense that $\nabla^{2}_{a}f(X,\cdot)\succ\varepsilon(X)I$ on $[-L,L]^{T}$ where $I$ is the identity matrix¹¹1For two $T\times T$ -matrices $A$ and $B$ , we write $A\succ B$ if $A-B$ is positive semidefinite, that is, $\langle Az,z\rangle\geq\langle Bz,z\rangle$ for all $z\in{\mathbb{R}}^{T}$ . and $P(\varepsilon(X)>0)=1$ .

Moreover, $f(\cdot,a)$ is differentiable for every $a\in\mathbb{R}^{T}$ , its partial derivatives $\partial_{x_{t}}f$ are jointly continuous, and there is a constant $c>0$ such that

\displaystyle\sum_{s=1}^{T}|\partial_{x_{s}}f(x,a)|\leq c\cdot\Big{(}1+\sum_{s=1}^{T}|x_{s}|^{p-1}\Big{)}

for every $x\in\mathbb{R}^{T}$ and $a\in[-L,L]^{T}$ .

Theorem 2.4.

If Assumption 2.3 holds true, then there exists exactly one $a^{\ast}\in\mathcal{A}$ such that $v(P)=E_{P}[f(X,a^{\ast}(X))]$ . Furthermore, as $r\downarrow 0$ ,

\sup_{Q\in B_{r}(P)}v(Q)=v(P)+r\cdot\Big{(}\sum_{t=1}^{T}E_{P}\big{[}|E_{P}[\partial_{x_{t}}f(X,a^{\ast}(X))|\mathcal{F}_{t}^{X}]|^{q}\big{]}\Big{)}^{1/q}+o(r).

Remark 2.5.

The restriction to controls that are uniformly bounded (i.e., satisfy $|a_{t}(x)|\leq L$ ) is necessary to guarantee continuity of $Q\mapsto v(Q)$ in general. This can be seen easily in the utility maximization example below—even when restricting to models that satisfy a no-arbitrage condition, see, e.g., (Backhoff-Veraguas et al., 2020a, Remark 5.3).

Example 2.6.

Let $\ell\colon\mathbb{R}\to\mathbb{R}$ be a convex loss function, i.e., $\ell$ is bounded from below and convex. Moreover let $g\colon\mathbb{R}^{T}\to\mathbb{R}$ be (the negative of) a payoff function and consider the problem

u(P):=\inf_{a\in\mathcal{A}}E_{P}\Big{[}\ell\Big{(}g(X)+\sum_{t=1}^{T}a_{t}(X)(X_{t}-X_{t-1})\Big{)}\Big{]},

where $X_{0}\in\mathbb{R}$ is a fixed value. As discussed in the introduction, $u(P)$ corresponds to the utility maximisation problem with payoff $g$ .

Suppose that $\ell$ is twice continuously differentiable with $|\ell^{\prime}(u)|\leq c(1+|u|^{p-1})$ and $\ell^{\prime\prime}>0$ , that $g$ is continuously differentiable with bounded derivative, and that $P(X_{t-1}=X_{t})=0$ for all $t$ . Then Assumption 2.3 is satisfied.

The assumption that $P(X_{t+1}=X_{t})=0$ is used to prove strong convexity in the sense of Assumption 2.3. In the present one-dimensional setting, it simply means that the stock price does not stay constant from time $t$ to $t+1$ with positive probability. It is satisfied, for instance, if $X$ is a Binomial tree under $P$ , or if $P$ has a density with respect to the Lebesgue measure—in particular, if $X$ is a discretized SDE with non-zero volatility. Moreover the assumption that the derivative of $g$ is bounded can be relaxed at the price of restricting to $\ell$ with a slower growth.

Corollary 2.7.

In the setting of Example 2.6: Let $a^{\ast}$ be the unique optimizer for $u(P)$ , set $a^{\ast}_{T+1}:=0,$

Z:=g(X)+\sum_{t=1}^{T}a^{\ast}_{t}(X_{t}-X_{t-1}),

and

\displaystyle V:=\Big{(}\sum_{t=1}^{T}E_{P}\Big{[}\Big{|}(a^{\ast}_{t+1}-a^{\ast}_{t})\cdot E_{P}\big{[}\ell^{\prime}(Z)|\mathcal{F}_{t}^{X}\big{]}-E_{P}\big{[}\ell^{\prime}(Z)\partial_{x_{t}}g(X)|\mathcal{F}_{t}^{X}\big{]}\Big{|}^{q}\Big{]}\Big{)}^{1/q}.

Then

\displaystyle\sup_{Q\in B_{r}(P)}u(Q)=u(P)+r\cdot V+o(r)\qquad\text{as }r\downarrow 0.

Note that for $p=q=2$ and $g=0$ , $F$ is essentially the expected quadratic variation of $a^{\ast}$ , but not under $P$ , but distorted by the $l^{2}$ -distance of the conditioned Radon-Nykodym density of an equivalent martingale measure w.r.t. $P$ .

2.4. Optimal stopping problems

Let $f\colon{\mathbb{R}}^{T}\times\{1,\dots,T\}\to\mathbb{R}$ be such that $f(X,t)$ is ${\mathcal{F}}_{t}^{X}$ -measurable for $t=1,\dots,T$ and consider

\displaystyle s(Q):=\inf_{\tau\in\mathrm{ST}}\mathbb{E}_{Q}[f(X,\tau)],

where $\mathrm{ST}$ refers to the set of all bounded stopping times with respect to the canonical filtration, i.e., $\tau\in\mathrm{ST}$ if $\tau\colon{\mathbb{R}}^{T}\to\{1,\dots,T\}$ is such that $\{\tau\leq t\}\in\mathcal{F}_{t}^{X}$ for every $t$ .

Theorem 2.8.

Assume that $f(\cdot,t)$ is continuously differentiable for every $t=1,\dots,T$ and that there is a constant $c>0$ such that

\sum_{s=1}^{T}|\partial_{x_{s}}f(x,t)|\leq c\Big{(}1+\sum_{s=1}^{T}|x_{s}|^{p-1}\Big{)}

for every $x\in{\mathbb{R}}^{T}$ and $t=1,\dots,T$ . Furthermore assume that there exists exactly one optimal stopping time $\tau^{\ast}$ for $s(P)$ . Then, as $r\downarrow 0$ ,

\sup_{Q\in B_{r}(P)}s(Q)=s(P)+r\cdot\left(\sum_{t=1}^{T}E_{P}\left[\left|E_{P}\left[\partial_{x_{t}}f(X,\tau^{\ast})|{\mathcal{F}}_{t}^{X}\right]\right|^{q}\right]\right)^{1/q}+o(r).

Example 2.9.

It is instructive to consider Theorem 2.8 in the special case where $f$ is Markovian, i.e., there is function $g\colon{\mathbb{R}}\to{\mathbb{R}}$ such that $f(x,t)=g(x_{t})$ for all $t$ and $x$ . Indeed, in this case, the first-order correction term simplifies to $E_{P}[|g^{\prime}(X_{\tau^{\ast}})|^{q}]^{1/q}$ .

2.5. Extensions and open questions

To the best of our knowledge, this is the first work addressing the nonparametric sensitivity of multiperiod optimization problems (w.r.t. the adapted Wasserstein distance). Below we identify possible extensions, which are outside of the current scope of the current article. We plan to address these in future work.

(1)

Our results extend to ${\mathbb{R}}^{d}$ -valued stochastic processes and $\mathcal{AW}_{p}$ defined with an arbitrary norm on ${\mathbb{R}}^{d}$ . Similarly, the set of predictable controls in Theorem 2.4 can be chosen to be any compact convex subset of ${\mathbb{R}}^{d}.$ The modifications needed are in line with Bartl et al. (2021b).
(2)

A natural extension of our results from a financial perspective would be the analysis of sensitivities for robust option pricing: let $P$ be a martingale law (i.e. $X$ is a $(\mathcal{F}_{t}^{X})_{t=1}^{T}$ -martingale under $P$ ) and consider

$\sup_{Q\in B_{r}(P):\,Q\text{ is a martingale law}}E[f(X)].$

In one period models ( $T=1$ ) this was carried out in Bartl et al. (2021b); Nendel and Sgarabottolo (2022). In a similar manner, it is natural to analyze sensitivity of robust American option pricing by considering only martingales in Theorem 2.8.
(3)

There are certain natural examples for $f$ that do not satisfy our regularity assumptions, e.g., in mathematical finance. In a one-period framework, regularity of $f$ can be relaxed systematically, see Bartl et al. (2021b); Nendel and Sgarabottolo (2022), and it is interesting to investigate if this is the case here as well.
(4)

In some examples, the restriction to bounded controls is automatic, see, e.g., Rásonyi and Stettner (2005). For instance, in the setting of Example 2.6 with $g=0$ , we suspect that similar arguments as used in Rásonyi and Stettner (2005) might show that a “conditional full support condition” of $P$ is sufficient to obtain first-order approximation with unbounded strategies.
(5)

We suspect that the assumption on the uniqueness of the optimizer in Theorems 2.4 and 2.8 can be relaxed. Indeed, at least in a two-period setting, modifications of the arguments presented here can cover the general case in Theorem 2.8.
(6)

Motivated from the literature on distributionally robust optimization problems cited in the introduction, one could also consider min-max problems of the form

$\inf_{a\in\mathcal{A}}\sup_{Q\in B_{r}(P)}E_{P}[f(X,a(X))].$

An important observation is that most arguments in the analysis of such problems (in the one-period setting) heavily rely on (convexity and) compactness of $B_{r}(P)$ ; both properties fail to hold true in multiple periods. It was recently shown in Bartl et al. (2021a) that these can be recovered by passing to an appropriate factor space of processes together with general filtrations.
(7)

The present methods can be extended to cover functionals that depend not only on $P$ but also on its disintegrations—as is common in weak optimal transport (see, e.g., Gozlan et al. (2017)). As an example, consider $T=2$ and $J(Q):=E_{Q}[f(X_{1})+g(Q_{X_{1}})]$ , where the functions $f$ and $g$ are suitably (Fréchet) differentiable. Using the same arguments as in the proof of Theorem 2.2, one can show that the first-order correction term equals $(E_{P}[|f^{\prime}(X_{1})|^{q}+|E_{P}[g^{\prime}(P_{X_{1}})]|^{q}])^{1/q}$ .²²2When completing a first draft of this paper, we learned that similar results have been established by Jiang in independent research.

3. Proofs

3.1. Proof of Theorem 2.2

We need the following technical lemma, which essentially states that causal couplings can be approximated by bicausal ones with similar marginals. For a Borel probability measure $\pi$ on ${\mathbb{R}}^{T}\times{\mathbb{R}}^{T}$ and a Borel mapping $\phi$ from ${\mathbb{R}}^{T}\times{\mathbb{R}}^{T}$ to another Polish space, $\phi_{\ast}\pi$ denotes the push-forward of measure $\pi$ under $\phi$ .

Lemma 3.1.

Let $P,Q\in\mathcal{P}_{p}({\mathbb{R}}^{T})$ and let $\pi$ be a causal coupling between $P$ and $Q$ . For each $\delta>0$ there exists $Y^{\delta}\colon{\mathbb{R}}^{T}\times{\mathbb{R}}^{T}\to{\mathbb{R}}^{T}$ such that $Y^{\delta}_{t}$ is ${\mathcal{F}}^{X,Y}_{t}$ -measurable, $X_{t}$ is $\sigma(Y^{\delta}_{t})$ -measurable, and $|Y_{t}^{\delta}-Y_{t}|\leq\delta$ for every $1\leq t\leq T$ .

In particular, $\pi^{\delta}:=(X,Y^{\delta})_{\ast}\pi$ is a bicausal coupling between $P$ and $Q^{\delta}:=Y^{\delta}_{\ast}\pi$ .

Proof.

For $\delta>0$ we consider the Borel mappings

	$\displaystyle\psi_{\delta}$	$\displaystyle\colon\mathbb{R}\to(0,\delta)\text{ and}$
	$\displaystyle\phi_{\delta}$	$\displaystyle\colon\mathbb{R}\to\delta\mathbb{Z}:=\{\delta k:k\in\mathbb{Z}\}$

where $\psi_{\delta}$ is a (Borel-)isomorphism and $\phi_{\delta}(x):=\max\{\delta k:\delta k\leq x\}$ . For $t=1,\dots,T$ , set

Y_{t}^{\delta}:=\phi_{\delta}(Y_{t})+\psi_{\delta}(X_{t}).

By definition $X_{t}$ is $\sigma(Y_{t}^{\delta})$ -measurable, $Y^{\delta}_{t}$ is $\mathcal{F}^{X,Y}_{t}$ -measurable, and $|Y_{t}^{\delta}-Y_{t}|\leq\delta$ . It remains to note that the bicausality constraints (2.1) are clearly satisfied. ∎

The following proof serves as a blue print for the proofs of Theorems 2.4 and 2.8.

Proof of Theorem 2.2.

To simplify notation, set

F_{t}:=E_{P}[\partial_{x_{t}}f(X)|\mathcal{F}^{X}_{t}]\quad\text{for }t=1,\dots,T.

We first prove the upper bound, that is

(3.1)

\displaystyle\limsup_{r\to 0}\frac{1}{r}\Big{(}\sup_{Q\in B_{r}(P)}E_{Q}[f(X)]-E_{P}[f(X)]\Big{)}\leq\Big{(}\sum_{t=1}^{T}E_{P}[|F_{t}|^{q}]\Big{)}^{1/q}.

To that end, for any $r>0$ , let $Q=Q^{r}\in B_{r}(P)$ be such that

E_{Q}[f(X)]\geq\sup_{R\in B_{r}(P)}E_{R}[f(X)]-o(r),

and let $\pi=\pi^{r}$ be an (almost) optimal bicausal coupling between $P$ and $Q$ , i.e.,

\Big{(}\sum_{t=1}^{T}E_{\pi}[|X_{t}-Y_{t}|^{p}]\Big{)}^{1/p}\leq\mathcal{AW}_{p}(P,Q)+o(r)\leq r+o(r).

The fundamental theorem of calculus and Fubini’s theorem imply

	$\displaystyle E_{Q}[f(X)]-E_{P}[f(X)]$	$\displaystyle=E_{\pi}[f(Y)-f(X)]$
		$\displaystyle=\sum_{t=1}^{T}\int_{0}^{1}E_{\pi}[\partial_{x_{t}}f(X+\lambda(Y-X))\cdot(Y_{t}-X_{t})]\,d\lambda.$

Moreover, by the tower property and Hölder’s inequality,

	$\displaystyle E_{\pi}[\partial_{x_{t}}f(X+\lambda(Y-X))\cdot(Y_{t}-X_{t})]$
	$\displaystyle=E_{\pi}\Big{[}E_{\pi}[\partial_{x_{t}}f(X+\lambda(Y-X))\|\mathcal{F}_{t}^{X,Y}]\cdot(Y_{t}-X_{t})\Big{]}$
	$\displaystyle\leq E_{\pi}\Big{[}\|E_{\pi}[\partial_{x_{t}}f(X+\lambda(Y-X))\|\mathcal{F}_{t}^{X,Y}]\|^{q}\Big{]}^{1/q}\cdot E_{\pi}[\|Y_{t}-X_{t}\|^{p}]^{1/p}.$

We next claim that, for every $\lambda\in[0,1]$ ,

(3.2)

\displaystyle E_{\pi}\Big{[}\Big{|}E_{\pi}[\partial_{x_{t}}f(X+\lambda(Y-X))|\mathcal{F}_{t}^{X,Y}]\Big{|}^{q}\Big{]}^{1/q}\to E_{P}[|F_{t}|^{q}]^{1/q}

as $r\to 0$ . Indeed, since $\pi$ is bicausal, we have that

E_{\pi}[\partial_{x_{t}}f(X)|\mathcal{F}_{t}^{X,Y}]=E_{P}[\partial_{x_{t}}f(X)|\mathcal{F}_{t}^{X}]=F_{t}

$\pi$ -almost surely, see, e.g., (Bartl et al., 2021a, Lemma 2.2). Therefore, Jensen’s inequality shows that

	$\displaystyle E_{\pi}\Big{[}\Big{\|}E_{\pi}[\partial_{x_{t}}f(X+\lambda(Y-X))\|\mathcal{F}_{t}^{X,Y}]-F_{t}\Big{\|}^{q}\Big{]}$
	$\displaystyle\leq E_{\pi}\Big{[}\|\partial_{x_{t}}f(X+\lambda(Y-X))-\partial_{x_{t}}f(X)\|^{q}\Big{]}$

which converges zero; this follows from the continuity of $\partial_{x_{t}}f$ , since $\sum_{t=1}^{T}E_{\pi}[|X_{t}-Y_{t}|^{p}]\to 0$ , and since $|\partial_{x_{t}}f(x)|^{q}\leq\tilde{c}(1+\sum_{s=1}^{T}|x_{s}|^{p})$ by the growth assumption and since $q(p-1)=p$ . Then (3.2) follows from the triangle inequality.

We conclude that

	$\displaystyle E_{Q}[f(X)]-E_{P}[f(X)]$	$\displaystyle\leq\sum_{t=1}^{T}\Big{(}E_{P}[\|F_{t}\|^{q}]^{1/q}+o(1)\Big{)}E_{\pi}[\|Y_{t}-X_{t}\|^{p}]^{1/p}$
		$\displaystyle\leq\Big{(}\sum_{t=1}^{T}E_{P}[\|F_{t}\|^{q}]+o(1)\Big{)}^{1/q}\Big{(}\sum_{t=1}^{T}E_{\pi}[\|Y_{t}-X_{t}\|^{p}]\Big{)}^{1/p},$

where the second inequality follows from Hölder’s inequality between $\ell^{q}(\mathbb{R}^{T})$ and $\ell^{p}(\mathbb{R}^{T})$ . Recalling that $\pi$ is an almost optimal bicausal coupling between $P$ and $Q$ and that $\mathcal{AW}_{p}(P,Q)\leq r$ , this shows (3.1).

It remains to prove the lower bound, that is,

(3.3)

\displaystyle\liminf_{r\to 0}\frac{1}{r}\Big{(}\sup_{Q\in B_{r}(P)}E_{Q}[f(X)]-E_{P}[f(X)]\Big{)}\geq\Big{(}\sum_{t=1}^{T}E_{P}[|F_{t}|^{q}]\Big{)}^{1/q}.

To that end, we first use the duality between $\|\cdot\|_{\ell^{q}(\mathbb{R}^{T})}$ and $\|\cdot\|_{\ell^{p}(\mathbb{R}^{T})}$ , which yields the existence of $a\in[0,\infty)^{T}$ satisfying

\Big{(}\sum_{t=1}^{T}E_{P}[|F_{t}|^{q}]\Big{)}^{1/q}=\sum_{t=1}^{T}E_{P}[|F_{t}|^{q}]^{1/q}a_{t}\quad\text{ and }\sum_{t=1}^{T}a_{t}^{p}=1.

Next we use duality between $\|\cdot\|_{L^{q}(P)}$ and $\|\cdot\|_{L^{p}(P)}$ which yields the existence of random variables $(Z_{t})_{t=1}^{T}$ satisfying

E_{P}[|F_{t}|^{q}]^{1/q}a_{t}=E_{P}[F_{t}Z_{t}]\quad\text{and }E_{P}[|Z_{t}|^{p}]^{1/p}=a_{t}

for $t=1,\dots,T$ . Combining both results,

(3.4)

\displaystyle\sum_{t=1}^{T}E_{P}[F_{t}Z_{t}]=\Big{(}\sum_{t=1}^{T}E_{P}[|F_{t}|^{q}]\Big{)}^{1/q}\quad\text{and}\quad\sum_{t=1}^{T}E_{P}[|Z_{t}|^{p}]=1.

Note that, since $F_{t}$ is $\mathcal{F}^{X}_{t}$ -measurable, $Z_{t}$ can be chosen $\mathcal{F}^{X}_{t}$ -measurable as well.

At this point, for fixed $r>0$ , we would like to define $Q^{r}$ as the law of $X+rZ$ and $\pi=\pi^{r}$ as the law of $(X,X+rZ)$ . Since $Z_{t}$ is $\mathcal{F}^{X}_{t}$ -measurable, $\pi$ is clearly causal. Unfortunately however, it does not need to be bicausal in general. We thus first apply Lemma 3.1 to $P$ and $Q=Q^{r}$ with $\delta=1/n$ , which yields measures $Q^{n}=Q^{r,n}$ and processes $Y^{n}=Y^{r,n}$ which satisfy the assertion of Lemma 3.1.

Now fix $r>0$ . Since

	$\displaystyle\mathcal{AW}_{p}(P,Q^{n})$	$\displaystyle\leq\Big{(}\sum_{t=1}^{T}E_{\pi}[\|X_{t}-Y_{t}^{n}\|^{p}]\Big{)}^{1/p}$
		$\displaystyle\leq\Big{(}\sum_{t=1}^{T}(ra_{t})^{p}\Big{)}^{1/p}+\frac{T}{n}=r+\frac{T}{n},$

we have that, for every $\varepsilon>0$ ,

\sup_{R\in B_{r+\varepsilon}(P)}E_{R}[f(X)]-E_{P}[f(X)]\geq\lim_{n\to\infty}E_{\pi}[f(Y^{n})-f(X)].

Using the fundamental theorem of calculus and Fubini’s theorem as before, the fact that $Y^{n}_{t}$ is $\mathcal{F}^{X}_{t}$ -measurable shows that

	$\displaystyle E_{P}[f(Y^{n})-f(X)]$	$\displaystyle=\sum_{t=1}^{T}\int_{0}^{1}E_{P}\Big{[}\partial_{x_{t}}f(X+\lambda(Y^{n}_{t}-X_{t}))\cdot(Y^{n}_{t}-X_{t})\Big{]}\,d\lambda$
		$\displaystyle=\sum_{t=1}^{T}\int_{0}^{1}E_{P}\Big{[}E_{P}[\partial_{x_{t}}f(X+\lambda(Y^{n}_{t}-X_{t}))\|{\mathcal{F}}_{t}^{X}]\cdot(Y^{n}_{t}-X_{t})\Big{]}\,d\lambda$
		$\displaystyle\to\sum_{t=1}^{T}\int_{0}^{1}E_{P}\Big{[}E_{P}[\partial_{x_{t}}f(X+\lambda rZ)\|{\mathcal{F}}_{t}^{X}]\cdot rZ_{t}\Big{]}\,d\lambda,$

as $n\to\infty$ , by the growth assumption since $Y^{n}_{t}-X_{t}\to Z_{t}$ in $L^{p}(P)$ .

In a final step we let $r\to 0$ . Applying the previous step to $\varepsilon=o(r)$ shows that

	$\displaystyle\liminf_{r\to 0}\frac{1}{r}\Big{(}\sup_{R\in B_{r}(P)}E_{R}[f(X)]-E_{P}[f(X)]\Big{)}$
	$\displaystyle\geq\liminf_{r\to 0}\sum_{t=1}^{T}\int_{0}^{1}E_{P}\big{[}E_{P}[\partial_{x_{t}}f(X+\lambda rZ)\|{\mathcal{F}}_{t}^{X}]\cdot Z_{t}\big{]}\,d\lambda$
	$\displaystyle=\sum_{t=1}^{T}E_{P}\big{[}E_{P}[\partial_{x_{t}}f(X)\|{\mathcal{F}}_{t}^{X}]\cdot Z_{t}\big{]}\,d\lambda,$

where the equality holds by using the growth assumption on $|\partial_{x_{t}}f|$ . Recalling the choice of $(Z_{t})_{t=1}^{T}$ (see (3.4)) completes the proof. ∎

3.2. Proof of Theorem 2.4

The proof of Theorem 2.4 has a similar structure as the proof of Theorem 2.2, but some additional arguments have to be made in order to take care of the optimization in $a\in\mathcal{A}$ . Throughout, we work under Assumption 2.3. We start with two auxiliary results.

Lemma 3.2.

Let $(Q_{n})_{n\in\mathbb{N}}$ be such that $\mathcal{AW}_{p}(P,Q_{n})\to 0$ for $n\to\infty$ . Then $v(Q^{n})\to v(P)$ .

Proof.

Let $Q\in\mathcal{P}_{p}({\mathbb{R}}^{T})$ and let $\pi$ be a bicausal coupling between $P$ and $Q$ . Let $\varepsilon>0$ be arbitrary and fix $a\in\mathcal{A}$ that satisfies $E_{P}[f(X,a(X))]\leq v(P)+\varepsilon$ . Next define $b$ by

b_{t}:=E_{\pi}[a_{t}(X)|{\mathcal{F}}_{T}^{Y}]\qquad\text{for }t=1,\dots,T.

By bicausality, $b_{t}$ is actually measurable with respect to ${\mathcal{F}}_{t-1}^{Y}$ (see, e.g., (Bartl et al., 2021a, Lemma 2.2)) and clearly $|b_{t}|\leq L$ ; thus $b\in\mathcal{A}$ . Moreover, convexity of $f(x,\cdot)$ implies that

\displaystyle v(Q)

\displaystyle\leq E_{Q}[f(Y,b(Y))]=E_{\pi}[f(Y,E_{\pi}[a(X)|{\mathcal{F}}_{T}^{Y}])]\leq E_{\pi}[f(Y,a(X))].

The fundamental theorem of calculus and Hölder’s inequality yield

	$\displaystyle E_{\pi}[f(Y,a(X))]-E_{P}[f(X,a(X))]$
	$\displaystyle=\int_{0}^{1}\sum_{t=1}^{T}E_{\pi}\Big{[}\partial_{x_{t}}f(X+\lambda(Y-X),a(X))(Y_{t}-X_{t})\Big{]}\,d\lambda$
	$\displaystyle\leq\int_{0}^{1}\sum_{t=1}^{T}E_{\pi}[\|\partial_{x_{t}}f(X+\lambda(Y-X),a(X))\|^{q}]^{1/q}E_{\pi}[\|Y_{t}-X_{t}\|^{p}]^{1/p}\,d\lambda.$

Using the growth assumption and arguing as in the proof of Theorem 2.2, the last term is at most of order $\mathcal{AW}_{p}(P,Q)$ . As $\varepsilon$ was arbitrary, this shows $v(Q)-v(P)\leq O(\mathcal{AW}_{p}(P,Q))$ (where again $O$ denotes the Landau symbol) and reversing the roles of $P$ and $Q$ completes the proof. ∎

Lemma 3.3.

There exists exactly one $a^{\ast}\in\mathcal{A}$ such that $v(P)=E_{P}[f(X,a^{\ast}(X))]$ .

Proof.

This is a standard result. The existence follows from Komlos’ lemma Komlós (1967) and uniqueness from strict convexity. ∎

Proof of Theorem 2.4.

Let $a^{\ast}\in\mathcal{A}$ be the unique optimizer of $v(P)$ (see Lemma 3.3) and, for shorthand notation, set $F_{t}:=E_{P}[\partial_{x_{t}}f(X,a^{\ast}(X))|\mathcal{F}_{t}^{X}]$ for $t=1,\dots,T$ .

We first prove the upper bound. We claim that it follows from combining the reasoning in the proof of Theorem 2.2 and Lemma 3.2. Indeed, let $Q^{r}\in B_{r}(P)$ be such that

v(Q^{r})\geq\sup_{Q\in B_{r}(P)}v(Q)-o(r)

and let $\pi=\pi^{r}$ be a bicausal coupling between $P$ and $Q^{r}$ that is (almost) optimal for $\mathcal{AW}_{p}(P,Q^{r})$ . Define $b^{r}\in\mathcal{A}$ by $b^{r}:=E_{\pi}[a^{\ast}(X)|{\mathcal{F}}_{T}^{Y}]$ and use convexity of $f(x,\cdot)$ to conclude that

v(Q^{r})\leq E_{\pi}[f(Y,b^{r}(Y))]\leq E_{\pi}[f(Y,a^{\ast}(X))].

From here on, it follows from the fundamental theorem of calculus and Hölder’s inequality just as in the proof of Theorem 2.2 that

	$\displaystyle v(Q^{r})-v(P)$	$\displaystyle\leq\sum_{t=1}^{T}E_{\pi}[\partial_{x_{t}}f(X,a^{\ast}(X))\cdot(Y_{t}-X_{t})]+o(r)$
		$\displaystyle\leq r\Big{(}\sum_{t=1}^{T}E_{\pi}[\|F_{t}\|^{q}]\Big{)}^{1/q}+o(r).$

This completes the proof of the upper bound.

We proceed with the lower bound. To that end, we start with the same construction as in the proof of Theorem 2.2: let $\pi=\pi^{r}$ be the law of $(X,X+rZ)$ where $Z$ satisfies (3.4), that is, $Z_{t}$ is $\mathcal{F}_{t}^{X}$ -measurable for every $t$ such that

\sum_{t=1}^{T}E_{P}[|Z_{t}|^{p}]\leq 1\quad\text{and}\quad\Big{(}\sum_{t=1}^{T}E_{P}[|F_{t}|^{q}]\Big{)}^{1/q}=\sum_{t=1}^{T}E_{P}[F_{t}Z_{t}].

Again, $\pi$ might be only causal and not bicausal, and we need to rely on Lemma 3.1. For the sake of a clearer presentation, we ignore this step this time.

For each $r$ , let $a^{r}\in\mathcal{A}$ be almost optimal for $v(Q^{r})$ , that is

E_{Q^{r}}[f(Y,a^{r}(Y))]\leq v(Q^{r})+o(r).

Observe that, by construction of $Q^{r}$ (i.e., since $Z_{t}$ is $\mathcal{F}_{t}^{X}$ -measurable for each $t$ ), there is $b^{r}\in\mathcal{A}$ such that $\pi(a^{r}(Y)=b^{r}(X))=1$ .

Now let $(r_{n})_{n}$ be an arbitrary sequence that converges to zero. By Lemma 3.4 below, after passing to subsequence $(r_{n_{k}})_{k}$ , $b^{r_{n_{k}}}(X)$ converges to $a^{\ast}(X)$ $P$ -almost surely. Since

v(P)\leq E_{P}[f(X,b^{r_{n_{k}}}(X))]

for all $k$ , the fundamental theorem of calculus and the growth assumption imply

	$\displaystyle v(Q^{r_{n_{k}}})-v(P)$	$\displaystyle\geq E_{P}[f(X+r_{n_{k}}Z,b^{r_{n_{k}}}(X))-f(X,b^{r_{n_{k}}}(X))]-o(r_{n_{k}})$
		$\displaystyle\geq\sum_{t=1}^{T}E_{P}[E_{P}[\partial_{x_{t}}f(X,b^{r_{n_{k}}}(X))\|\mathcal{F}_{t}^{X}]\cdot r_{n_{k}}Z_{t}]-o(r_{n_{k}}).$

Since $b^{r_{n_{k}}}(X)\to a^{\ast}(X)$ $P$ -almost surely, the continuity of $\partial_{x_{t}}f$ and the growth assumption imply that

\displaystyle\liminf_{k\to\infty}\frac{v(Q^{r_{n_{k}}})-v(P)}{r_{n_{k}}}

\displaystyle\geq\sum_{t=1}^{T}E_{P}[E_{P}[\partial_{x_{t}}f(X,a^{\ast}(X))|\mathcal{F}_{t}^{X}]\cdot Z_{t}].

To complete the proof, it remains to recall the choice of $Z$ and that $(r_{n})_{n}$ was an arbitrary sequence. ∎

Lemma 3.4.

In the setting of the proof of Theorem 2.4: there exists a subsequence $(r_{n_{k}})_{k}$ such that $b^{r_{n_{k}}}(X)\to a^{\ast}(X)$ $P$ -almost surely.

Proof.

Recall that $b^{r_{n}}$ was chosen almost optimally for $v(Q^{r_{n}})$ , hence

\displaystyle v(Q^{r_{n}})

\displaystyle\geq E_{P}[f(X+r_{n}Z,b^{r_{n}}(X))]-o(r_{n})\geq E_{P}[f(X,b^{r_{n}}(X))]-O(r_{n}),

where the last inequality holds by continuity and the growth assumptions on $f$ , see the proof of Theorem 2.4. Next recall that $\nabla_{a}^{2}f(X,a)\succ\varepsilon(X)I$ for $a\in[-L,L]^{T}$ with $P(\varepsilon(X)>0)=1$ . In particular, a second order Taylor expansion shows that

	$\displaystyle E_{P}[f(X,b^{r_{n}}(X))]-E_{P}[f(X,a^{\ast}(X))]$
	$\displaystyle\geq E_{P}\Big{[}\langle\nabla_{a}f(X,a^{\ast}(X)),b^{r_{n}}(X)-a^{\ast}(X)\rangle\Big{]}+E_{P}\Big{[}\frac{\varepsilon(X)}{2}\\|b^{r_{n}}(X)-a^{\ast}(X)\\|_{\ell^{2}({\mathbb{R}}^{T})}^{2}\Big{]}.$

The first term is non-negative by optimality of $a^{\ast}$ . Thus, since $v(Q^{r_{n}})\to v(P)$ by Lemma 3.2, this implies that the second term must converge to zero. As $\varepsilon$ is strictly positive, this can only happen if $b^{r_{n}}(X)\to a^{\ast}(X)$ in $P$ -probability. Hence, after passing to a subsequence, $b^{r_{n}}(X)\to a^{\ast}(X)$ $P$ -almost surely. ∎

3.3. Proof of Corollary 2.7

For shorthand notation, set

(a\cdot x)_{T}:=\sum_{t=1}^{T}a_{t}(x_{t}-x_{t-1}).

The goal is to apply Theorem 2.4 to the function

f(x,a):=\ell(g(x)+(a\cdot x)_{T})

for $(x,a)\in{\mathbb{R}}^{T}\times\mathbb{R}^{T}$ . To that end, we start by checking Assumption 2.3. Since $g$ continuously differentiable and $\ell$ is twice continuously differentiable, the parts of Assumption 2.3 pertaining to the differentiability of $f$ hold true. Moreover,

\langle\nabla^{2}_{a}f(x,a)u,u\rangle=\ell^{\prime\prime}(g(x)+(a\cdot x)_{T})\sum_{t=1}^{T}u_{t}^{2}(x_{t}-x_{t-1})^{2}

for any $u\in\mathbb{R}^{T}$ . Since $\ell^{\prime\prime}>0$ and $P(X_{t}=X_{t-1})=0$ for every $t$ by assumption, one can readily verify that there is $\varepsilon(X)$ with $P(\varepsilon(X)>0)=1$ such that

\nabla^{2}_{a}f(X,\cdot)\succ\varepsilon(X)I\qquad\text{on }[-L,L]^{T}.

Next observe that

\partial_{x_{t}}f(x,a)=\ell^{\prime}(g(x)+(a\cdot x)_{T})(\partial_{x_{t}}g(x)+(a_{t}-a_{t+1})).

A quick computation involving the growth assumption on $g^{\prime}$ and $\ell^{\prime}$ shows that

|\partial_{x_{t}}f(x,a)|\leq c\Big{(}1+\sum_{s=1}^{T}|x_{s}|^{p-1}\Big{)}\qquad\text{ for all }x\in\mathbb{R}^{T}\text{ and }a\in[-L,L]^{T}.

In particular, Assumption 2.3 is satisfied, and the proof follows by applying Theorem 2.4. ∎

3.4. Proof of Theorem 2.8

We start with the upper bound. To that end, let $\tau^{\ast}$ be the optimal stopping time for $s(P)$ , let $Q\in B_{r}(P)$ be such that $s(Q)\geq\sup_{R\in B_{r}(P)}s(R)-o(r)$ , and let $\pi$ be a (almost) optimal bicausal coupling for $\mathcal{AW}_{p}(P,Q)$ . Using a similar argument as in Lemma 3.2, we can use the coupling $\pi$ to build a stopping time $\tau$ such that

E_{Q}[f(X,\tau(X))]\leq E_{\pi}[f(Y,\tau^{\ast}(X))]

—see (Backhoff-Veraguas et al., 2020b, Lemma 7.1) or (Bartl et al., 2021a, Proposition 5.8) for detailed proofs. Under the growth assumption on $f$ , the fundamental theorem of calculus and Fubini’s theorem yield

	$\displaystyle s(Q)-s(P)$	$\displaystyle\leq E_{\pi}[f(Y,\tau^{\ast}(X))-f(X,\tau^{\ast}(X))]+o(r)$
		$\displaystyle=\int_{0}^{1}\sum_{t=1}^{T}E_{\pi}\left[\partial_{x_{t}}f(X+\lambda(Y-X),\tau^{\ast}(X))\cdot(Y_{t}-X_{t})\right]\,d\lambda+o(r)$
		$\displaystyle\leq r\int_{0}^{1}\Big{(}\sum_{t=1}^{T}E_{\pi}\left[\|E_{\pi}[\partial_{x_{t}}f(X+\lambda(Y-X),\tau^{\ast}(X))\|{\mathcal{F}}_{t}^{X,Y}]\|^{q}\right]\Big{)}^{1/q}\,d\lambda$
		$\displaystyle\quad+o(r),$

where the last inequality follows from Hölder’s inequality and since

\sum_{t=1}^{T}E_{\pi}[|X_{t}-Y_{t}|^{p}]\leq r^{p}

in the same way as in the proof of Theorem 2.2. We also conclude using similar arguments that

\displaystyle\lim_{r\to 0}E_{\pi}\left[|E_{\pi}[\partial_{x_{t}}f(X+\lambda(Y-X),\tau^{\ast}(X))|{\mathcal{F}}_{t}^{X,Y}]|^{q}\right]=E_{P}\left[|E_{P}[\partial_{x_{t}}f(X,\tau^{\ast})|{\mathcal{F}}_{t}^{X}]|^{q}\right]

for every $\lambda\in[0,1]$ and every $t=1,\dots,T$ .

We proceed with the lower bound. To make the presentation concise, we assume here that $T=2$ —the general case follows from a (somewhat tedious) adaptation of the arguments presented here. The assumption that the optimal stopping time $\tau^{\ast}$ is unique implies, by the Snell envelope theorem, that

P(f(X,1)\neq E_{P}[f(X,2)|{\mathcal{F}}_{1}^{X}])=1;

in particular

(3.5)

\displaystyle\begin{split}\{\tau^{\ast}=1\}&=\{f(X,1)<E_{P}[f(X,2)|{\mathcal{F}}_{1}^{X}]\},\\ \{\tau^{\ast}=2\}&=\{f(X,1)>E_{P}[f(X,2)|{\mathcal{F}}_{1}^{X}]\}.\end{split}

As before, set $F_{t}:=E_{P}[\partial_{x_{t}}f(X,\tau^{\ast})|{\mathcal{F}}_{t}^{X}]$ and take $Z$ that satisfies (3.4), i.e., $Z_{t}$ is $\mathcal{F}_{t}^{X}$ -measurable for every $t$ , and

(3.6)

\displaystyle\begin{split}E_{P}[|Z_{1}|^{p}]+E_{P}[|Z_{2}|^{p}]&\leq 1\quad\text{and}\\ E_{P}[F_{1}Z_{1}]+E_{P}[F_{2}Z_{2}]&=(E_{P}[|F_{1}|^{q}]+E_{P}[|F_{2}|^{q}])^{1/q}.\end{split}

Next, for every $r>0$ , set

	$\displaystyle A^{r}$	$\displaystyle:=\{f(X+rZ,1)<E_{P}[f(X+rZ,2)\|{\mathcal{F}}_{1}^{X}]\}\cap\{\tau^{\ast}=1\},$
	$\displaystyle B^{r}$	$\displaystyle:=\{f(X+rZ,1)>E_{P}[f(X+rZ,2)\|{\mathcal{F}}_{1}^{X}]\}\cap\{\tau^{\ast}=2\}.$

Define the process

X^{r}:=X+rZ\mathbf{1}_{A^{r}\cup B^{r}}.

Since $A^{r},B^{r}$ and $Z_{1}$ are ${\mathcal{F}}_{1}^{X}$ -measurable, the coupling $\pi^{r}:=(X,X^{r})_{\ast}P$ is causal between $P$ and $P^{r}:=(X^{r})_{\ast}P$ . Using Lemma 3.1 (just as in the proof of Theorem 2.2), we can actually assume without loss of generality that $\pi^{r}$ is in fact bicausal and that $\mathcal{F}^{X^{r}}_{t}=\mathcal{F}^{X}_{t}$ for each $t$ —we will leave this detail to the reader and proceed.

In particular, since

|X_{t}-X_{t}^{r}|\leq r|Z_{t}|,

it follows from (3.6) that $\mathcal{AW}_{p}(P,P^{r})\leq r$ ; thus

(3.7)		$\displaystyle\sup_{Q\in B_{r}(P)}s(Q)$	$\displaystyle\geq\inf_{\tau\in\mathrm{ST}}E_{P}[f(X^{r},\tau(X^{r}))]$
(3.8)			$\displaystyle=E_{P}[f(X^{r},1)\wedge E_{P}[f(X^{r},2)\|\mathcal{F}^{X}_{1}]],$

where the equality holds by the Snell envelope theorem and since $\mathcal{F}^{X}_{1}=\mathcal{F}^{X^{r}}_{1}$ .

Next note that

	$\displaystyle f(X^{r},1)<E[f(X^{r},2)\|\mathcal{F}^{X}_{1}]$	$\displaystyle\text{ and }f(X,1)<E[f(X,2)\|\mathcal{F}^{X}_{1}]\qquad\text{on }A^{r},$
	$\displaystyle f(X^{r},1)>E[f(X^{r},2)\|\mathcal{F}^{X}_{1}]$	$\displaystyle\text{ and }f(X,1)>E[f(X,2)\|\mathcal{F}^{X}_{1}]\qquad\text{on }B^{r},$
	$\displaystyle f(X^{r},1)\wedge E_{P}[f(X^{r},2)\|\mathcal{F}^{X}_{1}]$	$\displaystyle=f(X,1)\wedge E_{P}[f(X,2)\|\mathcal{F}^{X}_{1}]\qquad\,\,\,\text{on }(A^{r}\cup B^{r})^{c}\in\mathcal{F}_{1}^{X}.$

Combined with (3.7) and since

s(P)=E_{P}[f(X,1)\wedge E[f(X,2)|{\mathcal{F}}_{1}^{X}]],

we get

	$\displaystyle\sup_{Q\in B_{r}(P)}s(Q)-s(P)$	$\displaystyle\geq E_{P}\big{[}(f(X^{r},1)-f(X,1))\mathbf{1}_{A^{r}}$
		$\displaystyle\qquad+(E_{P}[f(X^{r},2)-f(X,2)\|{\mathcal{F}}_{1}^{X}])\mathbf{1}_{B^{r}}]$
		$\displaystyle=E_{P}\big{[}(f(X^{r},1)-f(X,1))\mathbf{1}_{A^{r}}+(f(X^{r},2)-f(X,2))\mathbf{1}_{B^{r}}],$

where the inequality holds by the tower property. Using the fundamental theorem of calculus just as in the proof of Theorem 2.2 shows that

	$\displaystyle\sup_{Q\in B_{r}(P)}\frac{1}{r}\big{(}s(Q)-s(P)\big{)}$	$\displaystyle\geq\sum_{t=1}^{2}E_{P}\big{[}\partial_{x_{t}}f(X,1)Z_{t}\mathbf{1}_{A^{r}}+\partial_{x_{t}}f(X,2)Z_{t}\mathbf{1}_{B^{r}}\big{]}-o(1)$
		$\displaystyle\to\sum_{t=1}^{2}E_{P}\big{[}\partial_{x_{t}}f(X,\tau^{\ast})Z_{t}\big{]}$

as $r\downarrow 0$ , where the convergence holds because, by (3.5), $\mathbf{1}_{A_{1}^{r}}\to\mathbf{1}_{\{\tau^{\ast}=1\}}$ and $\mathbf{1}_{A_{2}^{r}}\to\mathbf{1}_{\{\tau^{\ast}=2\}}$ . To complete the proof, it remains to recall the definition of $Z$ , see (3.6). ∎

References

Acciaio and Hou [2022] Beatrice Acciaio and Songyan Hou. Convergence of adapted empirical measures on $\mathbb{R}^{d}$ . arXiv preprint arXiv:2211.10162, 2022.
Aldous [1981] D. Aldous. Weak convergence and general theory of processes. Department of Statistics, University of California, Berkeley, CA 94720, 1981.
Backhoff et al. [2022] Julio Backhoff, Daniel Bartl, Mathias Beiglböck, and Johannes Wiesel. Estimating processes in adapted Wasserstein distance. The Annals of Applied Probability, 32(1):529–550, 2022.
Backhoff-Veraguas et al. [2020a] Julio Backhoff-Veraguas, Daniel Bartl, Mathias Beiglböck, and Manu Eder. Adapted Wasserstein distances and stability in mathematical finance. Finance and Stochastics, 24(3):601–632, 2020a.
Backhoff-Veraguas et al. [2020b] Julio Backhoff-Veraguas, Daniel Bartl, Mathias Beiglböck, and Manu Eder. All adapted topologies are equal. Probability Theory and Related Fields, 178(3):1125–1172, 2020b.
Bartl et al. [2019] Daniel Bartl, Samuel Drapeau, and Ludovic Tangpi. Computational aspects of robust optimized certainty equivalents and option pricing. Mathematical Finance, 9(1):203, March 2019.
Bartl et al. [2021a] Daniel Bartl, Mathias Beiglböck, and Gudmund Pammer. The Wasserstein space of stochastic processes. arXiv preprint arXiv:2104.14245, 2021a.
Bartl et al. [2021b] Daniel Bartl, Samuel Drapeau, Jan Oblój, and Johannes Wiesel. Sensitivity analysis of Wasserstein distributionally robust optimization problems. Proceedings of the Royal Society A, 477(2256):20210176, 2021b.
Blanchet and Murthy [2019] Jose Blanchet and Karthyek Murthy. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565–600, 2019.
Blanchet et al. [2020] José Blanchet, Yang Kang, José Luis Montiel Olea, Viet Anh Nguyen, and Xuhui Zhang. Machine learning’s dropout training is distributionally robust optimal. arXiv preprint arXiv:2009.06111, 2020.
Blanchet et al. [2021] Jose Blanchet, Lin Chen, and Xun Yu Zhou. Distributionally robust mean-variance portfolio selection with Wasserstein distances. Management Science, 2021.
Calafiore [2007] Giuseppe C Calafiore. Ambiguous risk measures and optimal robust portfolios. SIAM Journal on Optimization, 18(3):853–877, 2007.
Eckstein and Pammer [2022] Stephan Eckstein and Gudmund Pammer. Computational methods for adapted optimal transport. arXiv preprint arXiv:2203.05005, 2022.
Gao and Kleywegt [2016] Rui Gao and Anton J Kleywegt. Distributionally robust stochastic optimization with wasserstein distance. arXiv preprint arXiv:1604.02199, 2016.
Gozlan et al. [2017] Nathael Gozlan, Cyril Roberto, Paul-Marie Samson, and Prasad Tetali. Kantorovich duality for general transport costs and applications. Journal of Functional Analysis, 273(11):3327–3405, 2017.
Hellwig [1996] M. Hellwig. Sequential decisions under uncertainty and the maximum theorem. J. Math. Econom., 25(4):443–464, 1996.
Herrmann and Muhle-Karbe [2017] Sebastian Herrmann and Johannes Muhle-Karbe. Model uncertainty, recalibration, and the emergence of delta–vega hedging. Finance and Stochastics, 21(4):873–930, 2017.
Hobson [1998] David G Hobson. Volatility misspecification, option pricing and superreplication via coupling. Annals of Applied Probability, pages 193–205, 1998.
Huber [2011] Peter J Huber. Robust statistics. In International encyclopedia of statistical science, pages 1248–1251. Springer, 2011.
[20] Yifan Jiang. Wasserstein distributional sensitivity to model uncertainty in a dynamic context. DPhil Transfer of Status Thesis. University of Oxford, January 2023. Private communication.
Karoui et al. [1998] Nicole El Karoui, Monique Jeanblanc-Picquè, and Steven E Shreve. Robustness of the black and scholes formula. Mathematical finance, 8(2):93–126, 1998.
Komlós [1967] Janos Komlós. A generalization of a problem of Steinhaus. Acta Mathematica Academiae Scientiarum Hungaricae, 18(1-2):217–229, 1967.
Kuhn et al. [2019] Daniel Kuhn, Peyman Mohajerin Esfahani, Viet Anh Nguyen, and Soroosh Shafieezadeh-Abadeh. Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations research & management science in the age of analytics, pages 130–166. Informs, 2019.
Lam [2016] Henry Lam. Robust sensitivity analysis for stochastic systems. Mathematics of Operations Research, 41(4):1248–1275, 2016.
Lam [2018] Henry Lam. Sensitivity to serial dependency of input processes: A robust approach. Management Science, 64(3):1311–1327, 2018.
Lindsay [1994] Bruce G Lindsay. Efficiency versus robustness: the case for minimum hellinger distance and related methods. The annals of statistics, 22(2):1081–1114, 1994.
Nendel and Sgarabottolo [2022] Max Nendel and Alessandro Sgarabottolo. A parametric approach to the estimation of convex risk functionals based on Wasserstein distances. arXiv preprint arXiv:2210.14340, 2022.
Oblój and Wiesel [2021] Jan Oblój and Johannes Wiesel. Distributionally robust portfolio maximization and marginal utility pricing in one period financial markets. Mathematical Finance, 31(4):1454–1493, 2021.
Pflug [2010] G Ch Pflug. Version-Independence and Nested Distributions in Multistage Stochastic Optimization. SIAM Journal on Optimization, 20(3):1406–1420, January 2010.
Pflug and Wozabal [2007] Georg Pflug and David Wozabal. Ambiguity in portfolio selection. Quantitative Finance, 7(4):435–442, 2007.
Pflug and Pichler [2012] Georg Ch Pflug and Alois Pichler. A Distance For Multistage Stochastic Optimization Models. SIAM Journal on Optimization, 22(1):1–23, January 2012.
Rahimian and Mehrotra [2019] Hamed Rahimian and Sanjay Mehrotra. Distributionally robust optimization: A review. arXiv preprint arXiv:1908.05659, 2019.
Rásonyi and Stettner [2005] Miklós Rásonyi and Lukasz Stettner. On utility maximization in discrete-time financial market models. The Annals of Applied Probability, 15(2):1367–1395, 2005.
Shafieezadeh-Abadeh et al. [2019] Soroosh Shafieezadeh-Abadeh, Daniel Kuhn, and Peyman Mohajerin Esfahani. Regularization via mass transportation. Journal of Machine Learning Research, 20(103):1–68, 2019.

	$\displaystyle f(X^{r},1)<E[f(X^{r},2)\|\mathcal{F}^{X}_{1}]$	$\displaystyle\text{ and }f(X,1)<E[f(X,2)\|\mathcal{F}^{X}_{1}]\qquad\text{on }A^{r},$
	$\displaystyle f(X^{r},1)>E[f(X^{r},2)\|\mathcal{F}^{X}_{1}]$	$\displaystyle\text{ and }f(X,1)>E[f(X,2)\|\mathcal{F}^{X}_{1}]\qquad\text{on }B^{r},$
	$\displaystyle f(X^{r},1)\wedge E_{P}[f(X^{r},2)\|\mathcal{F}^{X}_{1}]$	$\displaystyle=f(X,1)\wedge E_{P}[f(X,2)\|\mathcal{F}^{X}_{1}]\qquad\,\,\,\text{on }(A^{r}\cup B^{r})^{c}\in\mathcal{F}_{1}^{X}.$