This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distributionally Robust Markov Decision Processes and their Connection to Risk Measures

Nicole Bäuerle Department of Mathematics, Karlsruhe Institute of Technology (KIT), D-76128 Karlsruhe, Germany [email protected]  and  Alexander Glauner Department of Mathematics, Karlsruhe Institute of Technology (KIT), D-76128 Karlsruhe, Germany [email protected]
Abstract.

We consider robust Markov Decision Processes with Borel state and action spaces, unbounded cost and finite time horizon. Our formulation leads to a Stackelberg game against nature. Under integrability, continuity and compactness assumptions we derive a robust cost iteration for a fixed policy of the decision maker and a value iteration for the robust optimization problem. Moreover, we show the existence of deterministic optimal policies for both players. This is in contrast to classical zero-sum games. In case the state space is the real line we show under some convexity assumptions that the interchange of supremum and infimum is possible with the help of Sion’s minimax Theorem. Further, we consider the problem with special ambiguity sets. In particular we are able to derive some cases where the robust optimization problem coincides with the minimization of a coherent risk measure. In the final section we discuss two applications: A robust LQ problem and a robust problem for managing regenerative energy.

Key words:

Robust Markov Decision Process; Dynamic Games; Minimax Theorem; Risk Measures

AMS subject classifications:

90C40, 90C17, 91G70

1. Introduction

Markov Decision Processes (MDPs) are a well-established tool to model and solve sequential decision making under stochastic perturbations. In the standard theory it is assumed that all parameters and distributions are known or can be estimated with a certain precision. However, using the so-derived ’optimal’ policies in a system where the true parameters or distributions deviate, may lead to a significant degeneration of the performance. In order to cope with this problem there are different approaches in the literature.

The first approach which is typically used when parameters are unknown is the so-called Bayesian approach. In this setting a prior distribution for the model parameters is assumed and additional information which is revealed while the process evolves, is used to update the beliefs. Hence this approach allows that parameters can be learned. It is very popular in engineering applications. Introductions can e.g. be found in [hernandez2012adaptive] and [BaeuerleRieder2011]. In this paper we are not going to pursue this stream of literature.

A second approach is the so-called robust approach. Here it is assumed that instead of having one particular transition law, we are faced with a whole family of laws which are possible for the transition. In the literature this is referred to as model ambiguity. One way of dealing with this ambiguity is the worst case approach, where the controller selects a policy which is optimal with respect to the most adverse transition law in each scenario. This setting can also be interpreted as a dynamic game with nature as the controller’s opponent. The worst case approach is empirically justified by the so-called Ellsberg Paradox. The experiment suggested by [Ellsberg1961] has shown that agents tend to be ambiguity averse. In the sequel, axiomatic approaches to model risk and ambiguity attitude have appeared, see e.g. [gilboa1989maxmin, maccheroni2006ambiguity]. [EpsteinSchneider2003] investigated the question whether ambiguity aversion can be incorporated in an axiomatic model of intertemporal utility. The representation of the preferences turned out to be some worst case expected utility, i.e. the minimal expected utility over an appropriate set of probability measures. This set of probability measures needs to satisfy some rectangularity condition for the utility to have a recursive structure and therefore being time consistent. The rectangularity property has been taken up by [Iyengar2005] as a key assumption for being able to derive a Bellman equation for a robust MDP with countable state and action spaces. Contemporaneously, [NilimGhaoui2005] reached similar findings, however limited to finite state and action spaces.

[wiesemann2013robust] have considered robust MDP beyond the rectangularity condition. Based on observed histories, they derive a confidence region that contains the unknown parameters with a prespecified probability and determine a policy that attains the highest worst-case performance over this confidence region. A similar approach has been taken in [xu2010distributionally] where nested uncertainty sets for transition laws are given which correspond to confidence sets. The optimization is then based on the expected performance of policies under the (respective) most adversarial distribution. This approach lies between the Bayesian and the robust approach since the decision maker uses prior information without a Bayesian interpretation. All these analyses are restricted to finite state and action spaces. A similar but different approach has been considered in [bauerle2019markov] where parameter ambiguity is combined with Bayesian learning methods. Here the authors deal with arbitrary Borel state and action spaces.

In our paper we will generalize the results of [Iyengar2005] to a model with Borel spaces and unbounded cost function. We consider finite horizon expected cost problems with a given transition law under ambiguity concerning the distribution of the disturbance variables. The problem is to minimize the worst expected cost. This leads to a Stackelberg game against nature. In order to deal with the arising measurability issues which impose a major technical difficulty compared to the countable space situation, we borrow from the dynamic game setup in [Gonzales2002] and [JaskiewiczNowak2011]. The major difference of our contribution compared to these two works is the design of the distributional ambiguity. We replace the topology of weak convergence on the ambiguity set by the weak* topology σ(Lq,Lp)\sigma(L^{q},L^{p}) in order to obtain connections to recursive risk measures in Section 6. Moreover, we rigorously derive a Bellman equation. Note that [jaskiewicz2014robust] treats another robust MDP setup with Borel state and action spaces, however deals with the average control problem. Further, our model allows for rather general ambiguity sets for the disturbance distribution. Under additional technical assumptions our model also comprises the setting of a decreasing confidence regions for distributions. Moreover, we discuss in detail sufficient conditions for the interchange of supremum and infimum. We provide a counterexample which shows that this interchange is not always possible, in contrast to [NilimGhaoui2005], [wiesemann2013robust]. This counterexample also highlights the difference to classical two-person zero-sum games. Further, we are able to derive a connection to the optimization of risk measures in MDP frameworks. In case the ambiguity sets are chosen in a specific way the robust optimization problem coincides with the optimization of a risk measure applied to the total cost.

The outline of the paper is as follows: In the next section we introduce our basic model which is a game against nature concerning expected cost with a finite time horizon. Section 3 contains the first main results. Under integrability, continuity and compactness assumptions we derive a robust cost iteration for a fixed policy of the decision maker and a value iteration for the robust optimization problem. Moreover, we show the existence of optimal deterministic policies for both players. This is in contrast to classical zero-sum games where we usually obtain optimal randomized policies. In Section 4 we consider the real line as state space which allows for slightly different model assumptions. Then in Section 5 we discuss (with the real line as state space) when supremum and infimum can be interchanged in the solution of the problem. Under some convexity assumptions this can be achieved with the help of Sion’s minimax Theorem [Sion1958]. Being able to interchange supremum and infimum sometimes simplifies the solution of the problem. In Section 6 we consider the problem with special ambiguity sets. In particular we are able to derive some cases which can be solved straightforward and situations where the robust optimization problem coincides with the minimization of a coherent risk measure. In the final section we discuss two applications: A robust LQ problem and a robust problem for managing regenerative energy.

2. The Markov Decision Model

We consider the following standard Markov Decision Process with general Borel state and action spaces and restrict ourselves to a model with finite planning horizon NN\in\mathbb{N}. Results for an infinite planning horizon can be found in [glauner20]. The state space EE is a Borel space with Borel σ\sigma-algebra E)\={(}E) and the action space AA is a Borel space with Borel σ\sigma-Algebra A)\={(}A). The possible state-action combinations at time nn form a measurable subset DnD_{n} of E×AE\times A such that DnD_{n} contains the graph of a measurable mapping EAE\to A. The xx-section of DnD_{n},

Dn(x)={aA:(x,a)Dn},D_{n}(x)=\{a\in A:(x,a)\in D_{n}\},

is the set of admissible actions in state xEx\in E at time nn. The sets Dn(x)D_{n}(x) are non-empty by assumption. We assume that the dynamics of the MDP are given by measurable transition functions Tn:Dn×𝒵ET_{n}:D_{n}\times\mathcal{Z}\to E and depend on disturbances Z1,,ZNZ_{1},\dots,Z_{N} which are independent random elements on a common probability space (Ω,𝒜,)(\Omega,\mathcal{A},\mathbb{P}) with values in a measurable space (𝒵,)(\mathcal{Z},\mathfrak{Z}). When the current state is xnx_{n} the controller chooses action anDn(xn)a_{n}\in D_{n}(x_{n}) and zn+1z_{n+1} is the realization of Zn+1Z_{n+1}, then the next state is given by

xn+1=Tn(xn,an,zn+1).x_{n+1}=T_{n}(x_{n},a_{n},z_{n+1}).

The one-stage cost function cn:Dn×Ec_{n}:D_{n}\times E\to\mathbb{R} gives the cost cn(x,a,x)c_{n}(x,a,x^{\prime}) for choosing action aa if the system is in state xx at time nn and the next state is xx^{\prime}. The terminal cost function cN:Ec_{N}:E\to\mathbb{R} gives the cost cN(x)c_{N}(x) if the system terminates in state xx.

The model data is supposed to have the following continuity and compactness properties.

Assumption 2.1.
  1. (i)

    Dn(x)D_{n}(x) are compact and ExDn(x)E\ni x\mapsto D_{n}(x) is upper semicontinuous for n=0,,N1n=0,\dots,N-1, i.e. if xkxx_{k}\to x and akDn(xk)a_{k}\in D_{n}(x_{k}), kk\in\mathbb{N}, then (ak)(a_{k}) has an accumulation point in Dn(x)D_{n}(x).

  2. (ii)

    Tn(x,a,z)T_{n}(x,a,z) is continuous in (x,a)(x,a) for z𝒵z\in\mathcal{Z} and n=0,,N1n=0,\dots,N-1.

  3. (iii)

    cn,n=0,,N1c_{n},n=0,\dots,N-1, as well as the terminal cost function cNc_{N} are lower semicontinuous.

For n0n\in\mathbb{N}_{0} we denote by n\mathcal{H}_{n} the set of feasible histories of the decision process up to time nn

hn={x0,if n=0,(x0,a0,x1,,xn),if n1,\displaystyle h_{n}=\begin{cases}x_{0},&\text{if }n=0,\\ (x_{0},a_{0},x_{1},\dots,x_{n}),&\text{if }n\geq 1,\end{cases}

where akDk(xk)a_{k}\in D_{k}(x_{k}) for k0k\in\mathbb{N}_{0}. In order for the controller’s decisions to be implementable, they must be based on the information available at the time of decision making, i.e. be functions of the history of the surplus process.

Definition 2.2.
  • (i)

    A randomized policy is a sequence π=(π0,π1,,πN1)\pi=(\pi_{0},\pi_{1},\dots,\pi_{N-1}) of stochastic kernels πn\pi_{n} from n\mathcal{H}_{n} to the action space AA satisfying the constraint

    πn(Dn(xn)|hn)=1,hnn.\pi_{n}(D_{n}(x_{n})|h_{n})=1,\qquad h_{n}\in\mathcal{H}_{n}.
  • (ii)

    A measurable mapping dn:nAd_{n}:\mathcal{H}_{n}\to A with dn(hn)Dn(xn)d_{n}(h_{n})\in D_{n}(x_{n}) for every hnnh_{n}\in\mathcal{H}_{n} is called (deterministic) decision rule at time nn. π=(d0,d1,,dN1)\pi=(d_{0},d_{1},\dots,d_{N-1}) is called (deterministic) policy.

  • (iii)

    A decision rule at time nn is called Markov if it depends on the current state only, i.e. dn(hn)=dn(xn)d_{n}(h_{n})=d_{n}(x_{n}) for all hnnh_{n}\in\mathcal{H}_{n}. If all decision rules are Markov, the deterministic (NN-stage) policy is called Markov.

For convenience, deterministic policies may simply be referred to as policy. With ΠRΠΠM\Pi^{R}\supseteq\Pi\supseteq\Pi^{M} we denote the sets of all randomized policies, deterministic policies and Markov policies. The first inclusion is by identifying deterministic decision rules dnd_{n} with the corresponding Dirac kernels

πn(|hn):=δdn(hn)(),hnn.\pi_{n}(\cdot|h_{n}):=\delta_{d_{n}(h_{n})}(\cdot),\qquad h_{n}\in\mathcal{H}_{n}.

A feasible policy always exists since DnD_{n} contains the graph of a measurable mapping.

Due to the independence of the disturbances we may without loss of generality assume that the probability space has a product structure

(Ω,𝒜,)=n=1(Ωn,𝒜n,n).(\Omega,\mathcal{A},\mathbb{P})=\bigotimes_{n=1}^{\mathbb{N}}(\Omega_{n},\mathcal{A}_{n},\mathbb{P}_{n}).

We take (Ω,𝒜,)(\Omega,\mathcal{A},\mathbb{P}) as the canonical construction, i.e.

(Ωn,𝒜n,n)=(𝒵,,Zn)andZn(ω¯)=ωn,ω¯=(ω1,,ωN)Ω(\Omega_{n},\mathcal{A}_{n},\mathbb{P}_{n})=\left(\mathcal{Z},\mathfrak{Z},\mathbb{P}^{Z_{n}}\right)\qquad\text{and}\qquad Z_{n}(\bar{\omega})=\omega_{n},\quad\bar{\omega}=(\omega_{1},\dots,\omega_{N})\in\Omega

for all n=1,,Nn=1,\dots,N. We denote by (Xn),(An)(X_{n}),(A_{n}) the random state and action processes and define Hn:=(X0,A0,,Xn)H_{n}:=(X_{0},A_{0},\ldots,X_{n}). In the sequel, we will require n\mathbb{P}_{n} to be separable. Additionally, we will assume for some results that (Ωn,𝒜n,n)\left(\Omega_{n},\mathcal{A}_{n},\mathbb{P}_{n}\right) is atomless in order to support a generalized distributional transform.

Let n{0,,N1}n\in\{0,\dots,N-1\} be a stage of the decision process. Due to the product structure of (Ω,𝒜,)(\Omega,\mathcal{A},\mathbb{P}) the transition kernel is given by

Qn(B|x,a)=1B(Tn(x,a,zn+1))n+1(dzn+1),BE),(x,a)Dn.\displaystyle Q_{n}(B|x,a)=\int 1_{B}\big{(}T_{n}(x,a,z_{n+1})\big{)}\mathbb{P}_{n+1}(\operatorname{d}z_{n+1}),\quad B\in\={(}E),\ (x,a)\in D_{n}. (2.1)

We assume now that there is some uncertainty about n\mathbb{P}_{n}, e.g. because it cannot be estimated properly. Moreover, the decision maker is very risk averse and tries to minimize the expected cost on a worst case basis. Thus, we denote by (Ωn,𝒜n,n)\mathcal{M}(\Omega_{n},\mathcal{A}_{n},\mathbb{P}_{n}) the set of probability measures on (Ωn,𝒜n)(\Omega_{n},\mathcal{A}_{n}) which are absolutely continuous with respect to n\mathbb{P}_{n} and define for q(1,]q\in(1,\infty]

q(Ωn,𝒜n,n)={(Ωn,𝒜n,n):ddnLq(Ωn,𝒜n,n)}.\mathcal{M}^{q}(\Omega_{n},\mathcal{A}_{n},\mathbb{P}_{n})=\left\{\mathbb{Q}\in\mathcal{M}(\Omega_{n},\mathcal{A}_{n},\mathbb{P}_{n}):\frac{\operatorname{d}\mathbb{Q}}{\operatorname{d}\mathbb{P}_{n}}\in L^{q}(\Omega_{n},\mathcal{A}_{n},\mathbb{P}_{n})\right\}.

Henceforth, we fix a non-empty subset 𝒬nq(Ωn,𝒜n,n)\mathcal{Q}_{n}\subseteq\mathcal{M}^{q}(\Omega_{n},\mathcal{A}_{n},\mathbb{P}_{n}) which is referred to as ambiguity set at stage nn. This can be seen as the set of probability measures which may reflect the law of motion. Due to absolute continuity, we can identify 𝒬n\mathcal{Q}_{n} with the set of corresponding densities w.r.t. n\mathbb{P}_{n}

𝒬nd={ddnLq(Ωn,𝒜n,n):𝒬n}.\mathcal{Q}_{n}^{d}=\left\{\frac{\operatorname{d}\mathbb{Q}}{\operatorname{d}\mathbb{P}_{n}}\in L^{q}(\Omega_{n},\mathcal{A}_{n},\mathbb{P}_{n}):\mathbb{Q}\in\mathcal{Q}_{n}\right\}. (2.2)

Accordingly, we view 𝒬n\mathcal{Q}_{n} as a subset of Lq(Ωn,𝒜n,n)L^{q}(\Omega_{n},\mathcal{A}_{n},\mathbb{P}_{n}) and endow it with the trace topolgy of the weak* topolgy σ(Lq,Lp)\sigma(L^{q},L^{p}) on Lq(Ωn,𝒜n,n)L^{q}(\Omega_{n},\mathcal{A}_{n},\mathbb{P}_{n}) where 1p+1q=1\frac{1}{p}+\frac{1}{q}=1. The weak* topology in turn induces a Borel σ\sigma-algebra on 𝒬n\mathcal{Q}_{n} making it a measurable space. We obtain the following result (for a proof see the Appendix).

Lemma 2.3.

Let the ambiguity set be norm-bounded and the probability measure n\mathbb{P}_{n} on (Ωn,𝒜n)(\Omega_{n},\mathcal{A}_{n}) be separable. Then 𝒬n\mathcal{Q}_{n} endowed with the weak* topology σ(Lq,Lp)\sigma(L^{q},L^{p}) is a separable metrizable space. If 𝒬n\mathcal{Q}_{n} is additionally weak* closed, it is even a compact Borel space.

In our cost model we allow for any norm-bounded ambiguity set 𝒬nq(Ωn,𝒜n,n)\mathcal{Q}_{n}\subseteq\mathcal{M}^{q}(\Omega_{n},\mathcal{A}_{n},\mathbb{P}_{n}). For applications, a meaningful way of choosing 𝒬n\mathcal{Q}_{n} (within a norm bound) is to take all probability measures in q(Ωn,𝒜n,n)\mathcal{M}^{q}(\Omega_{n},\mathcal{A}_{n},\mathbb{P}_{n}) which are close to n\mathbb{P}_{n} in a certain metric like e.g. the Wasserstein metric (see e.g. [yang2017convex]). In our setting, that requires absolute continuity, the Kullback–-Leibler divergence could be a natural choice.

The controller only knows that the transition kernel (2.1) at each stage is defined by some 𝒬n+1\mathbb{Q}\in\mathcal{Q}_{n+1} instead of n+1\mathbb{P}_{n+1} but not which one exactly. We assume here that the controller faces a dynamic game against nature. This means that nature reacts to the controller’s action ana_{n} in scenario hnnh_{n}\in\mathcal{H}_{n} with a decision rule γn:n×A𝒬n+1\gamma_{n}:\mathcal{H}_{n}\times A\to\mathcal{Q}_{n+1} where anDn(xn)a_{n}\in D_{n}(x_{n}). A policy of nature is a sequence of such decision rules γ=(γ0,,γN1)\gamma=(\gamma_{0},\dots,\gamma_{N-1}). Let Γ\Gamma be the set of all policies of nature. Since nature is an unobserved theoretical opponent of the controller, her actions are not considered to be part of the history of the Markov Decision Process. A Markov decision rule of nature at time nn is a measurable mapping γn:Dn𝒬n+1\gamma_{n}:D_{n}\to\mathcal{Q}_{n+1} and a Markov policy of nature is a sequence γ=(γ0,,γN1)\gamma=(\gamma_{0},\dots,\gamma_{N-1}) of such decision rules. The set of Markov policies of nature is denoted by ΓM\Gamma^{M}. Thus we are faced with a Stackelberg game where the controller is the mover and nature is the follower.

Lemma 2.4.

For n=0,,N1n=0,\dots,N-1 a decision rule γn:n×A𝒬n+1\gamma_{n}:\mathcal{H}_{n}\times A\to\mathcal{Q}_{n+1} induces a stochastic kernel from n×A\mathcal{H}_{n}\times A to Ωn+1\Omega_{n+1} by

γn(B|hn,an):=γn(hn,an)(B),B𝒜n+1,(hn,an)n×A.\gamma_{n}(B|h_{n},a_{n}):=\gamma_{n}(h_{n},a_{n})(B),\qquad B\in\mathcal{A}_{n+1},\ (h_{n},a_{n})\in\mathcal{H}_{n}\times A.

For a proof of the lemma see the Appendix. In the sequel, it will be clear from the context where we refer to γn\gamma_{n} as a decision rule or as a stochastic kernel.

The probability measure γn(|hn,an)\gamma_{n}(\cdot|h_{n},a_{n}), which is unknown for the controller, now takes the role of n+1\mathbb{P}_{n+1} in defining the transition kernel of the Markov Decision Process in (2.1). Let

Qnγ(B|hn,an)=1B(T(xn,an,zn+1))γn(dzn+1|hn,an),BE),hnn,anDn(xn).\displaystyle Q_{n}^{\gamma}(B|h_{n},a_{n})=\int 1_{B}\big{(}T(x_{n},a_{n},z_{n+1})\big{)}\gamma_{n}(\operatorname{d}z_{n+1}|h_{n},a_{n}),\;B\in\={(}E),\ h_{n}\in\mathcal{H}_{n},a_{n}\in D_{n}(x_{n}). (2.3)

As in the case without ambiguity, the Theorem of Ionescu-Tulcea yields that each initial state xEx\in E and pair of policies of the controller and nature (π,γ)ΠR×Γ(\pi,\gamma)\in\Pi^{R}\times\Gamma induce a unique law of motion

xπγ:=δxπ0Q0γπ1Q1γπN1QN1γ\displaystyle\mathbb{Q}^{\pi\gamma}_{x}:=\delta_{x}\otimes\pi_{0}\otimes Q_{0}^{\gamma}\otimes\pi_{1}\otimes Q_{1}^{\gamma}\otimes\dots\otimes\pi_{N-1}\otimes Q_{N-1}^{\gamma} (2.4)

on N{\mathcal{H}}_{N} satisfying

xπγ(X0B)\displaystyle\mathbb{Q}^{\pi\gamma}_{x}(X_{0}\in B) =δx(B),\displaystyle=\delta_{x}(B),
xπγ(AnC|Hn=hn)\displaystyle\mathbb{Q}^{\pi\gamma}_{x}(A_{n}\in C|H_{n}=h_{n}) =πn(C|hn),\displaystyle=\pi_{n}(C|h_{n}),
xπγ(Xn+1B|(Hn,An)=(hn,an))\displaystyle\mathbb{Q}^{\pi\gamma}_{x}(X_{n+1}\in B|(H_{n},A_{n})=(h_{n},a_{n})) =Qnγ(B|hn,an)\displaystyle=Q_{n}^{\gamma}(B|h_{n},a_{n})

for all BE)B\in\={(}E) and CA)C\in\={(}A). In the usual way, we denote with 𝔼xπγ\mathbb{E}_{x}^{\pi\gamma} the expectation operator with respect to xπγ\mathbb{Q}^{\pi\gamma}_{x} and with 𝔼nhnπγ\mathbb{E}_{nh_{n}}^{\pi\gamma} or 𝔼nxπγ\mathbb{E}_{nx}^{\pi\gamma} the respective conditional expectation given Hn=hnH_{n}=h_{n} or Xn=xX_{n}=x.

The value of a policy pair (π,γ)ΠR×Γ(\pi,\gamma)\in\Pi^{R}\times\Gamma at time n=0,,N1n=0,\dots,N-1 is defined as

VNπγ(hN)=cN(xN),hNN,Vnπγ(hn)=𝔼nhnπγ[k=nN1ck(Xk,Ak,Xk+1)+cN(XN)],hnn.\displaystyle\begin{aligned} V_{N\pi\gamma}(h_{N})&=c_{N}(x_{N}),&h_{N}\in\mathcal{H}_{N},\\ V_{n\pi\gamma}(h_{n})&=\mathbb{E}_{nh_{n}}^{\pi\gamma}\left[\sum_{k=n}^{N-1}c_{k}(X_{k},A_{k},X_{k+1})+c_{N}(X_{N})\right],&h_{n}\in\mathcal{H}_{n}.\end{aligned} (2.5)

Since the controller is unaware which probability measure in the ambiguity set determines the transition law in each scenario, he tries to minimize the expected cost under the assumption to be confronted with the most adverse probability measure. The value functions are thus given by

Vn(hn)=infπΠRsupγΓVnπγ(hn),hnn,\displaystyle V_{n}(h_{n})=\inf_{\pi\in\Pi^{R}}\sup_{\gamma\in\Gamma}\ V_{n\pi\gamma}(h_{n}),\qquad h_{n}\in\mathcal{H}_{n},

and the optimization objective is

V0(x)=infπΠRsupγΓV0πγ(x),xE.\displaystyle V_{0}(x)=\inf_{\pi\in\Pi^{R}}\sup_{\gamma\in\Gamma}\ V_{0\pi\gamma}(x),\qquad x\in E. (2.6)

In game-theoretic terminology this is the upper value of a dynamic zero-sum game. If nature were to act first, i.e. if infimum and supremum were interchanged, one would obtain the game’s lower value. If the two values agree and the infima/suprema are attained, the game has a Nash equilibrium, see also Section 5. But note here that players are asymmetric in information in our setting.

Remark 2.5.

[Iyengar2005] does not model nature to make active decisions, but instead defines the set of all possible laws of motion. When each law of motion is of the form (2.4), he calls the set rectangular. Our approach with active decisions of nature based on [Gonzales2002] and [JaskiewiczNowak2011] is needed to construct Markov kernels as in Lemma 2.4 with probability measures from a given ambiguity set. When state and action spaces are countable as in [Iyengar2005] the technical problem of measurability does not arise and one can directly construct an ambiguous law of motion by simply multiplying (conditional) probabilities. The rectangularity property is satisfied in our setting.

Our model feature that there is no ambiguity in the transition functions is justified in many applications. Typically, transition functions describe a technical process or economic calculation which is known ex-ante and does not have to be estimated. The same applies to the cost function.

3. Value Iteration and Optimal Policies

In order to have well-defined value functions we need some integrability conditions.

Assumption 3.1.
  • (i)

    There exist α,ϵ¯,ϵ¯0\alpha,\underaccent{\bar}{\epsilon},\bar{\epsilon}\geq 0 with ϵ¯+ϵ¯=1,α1\underaccent{\bar}{\epsilon}+\bar{\epsilon}=1,\alpha\neq 1 and measurable functions b¯:E(,ϵ¯]\underaccent{\bar}{b}:E\to(-\infty,-\underaccent{\bar}{\epsilon}] and b¯:E[ϵ¯,)\bar{b}:E\to[\bar{\epsilon},\infty) such that it holds for all n=0,,N1n=0,\dots,N-1, 𝒬n+1\mathbb{Q}\in\mathcal{Q}_{n+1} and (x,a)Dn(x,a)\in D_{n}

    𝔼[cn(x,a,Tn(x,a,Zn+1))]\displaystyle\mathbb{E}^{\mathbb{Q}}\left[-c_{n}^{-}(x,a,T_{n}(x,a,Z_{n+1}))\right] b¯(x),\displaystyle\geq\underaccent{\bar}{b}(x), 𝔼[b¯(Tn(x,a,Zn+1))]\displaystyle\mathbb{E}^{\mathbb{Q}}\left[\underaccent{\bar}{b}(T_{n}(x,a,Z_{n+1}))\right] αb¯(x),\displaystyle\geq\alpha\underaccent{\bar}{b}(x),
    𝔼[cn+(x,a,Tn(x,a,Zn+1))]\displaystyle\mathbb{E}^{\mathbb{Q}}\left[c_{n}^{+}(x,a,T_{n}(x,a,Z_{n+1}))\right] b¯(x),\displaystyle\leq\bar{b}(x), 𝔼[b¯(Tn(x,a,Zn+1))]\displaystyle\mathbb{E}^{\mathbb{Q}}\left[\bar{b}(T_{n}(x,a,Z_{n+1}))\right] αb¯(x).\displaystyle\leq\alpha\bar{b}(x).

    Furthermore, it holds b¯(x)cN(x)b¯(x)\underaccent{\bar}{b}(x)\leq c_{N}(x)\leq\bar{b}(x) for all xEx\in E.

  • (ii)

    We define b:E[1,),b(x):=b¯(x)b¯(x)b:E\to[1,\infty),\ b(x):=\bar{b}(x)-\underaccent{\bar}{b}(x). For all n=0,,N1n=0,\dots,N-1 and (x¯,a¯)Dn(\bar{x},\bar{a})\in D_{n} there exists an ϵ>0\epsilon>0 and measurable functions Θn,1x¯,a¯,Θn,2x¯,a¯:𝒵+\Theta_{n,1}^{\bar{x},\bar{a}},\Theta_{n,2}^{\bar{x},\bar{a}}:\mathcal{Z}\to\mathbb{R}_{+} such that Θn,1x¯,a¯(Zn+1),Θn,2x¯,a¯(Zn+1)Lp(Ωn+1,𝒜n+1,n+1)\Theta_{n,1}^{\bar{x},\bar{a}}(Z_{n+1}),\Theta_{n,2}^{\bar{x},\bar{a}}(Z_{n+1})\in L^{p}(\Omega_{n+1},\mathcal{A}_{n+1},\mathbb{P}_{n+1}) and

    |cn(x,a,Tn(x,a,z))|Θn,1x¯,a¯(z),b(Tn(x,a,z))Θn,2x¯,a¯(z)\displaystyle|c_{n}(x,a,T_{n}(x,a,z))|\leq\Theta_{n,1}^{\bar{x},\bar{a}}(z),\qquad\qquad b(T_{n}(x,a,z))\leq\Theta_{n,2}^{\bar{x},\bar{a}}(z)

    for all z𝒵z\in\mathcal{Z} and (x,a)Bϵ(x¯,a¯)Dn(x,a)\in B_{\epsilon}(\bar{x},\bar{a})\cap D_{n}. Here, Bϵ(x¯,a¯)B_{\epsilon}(\bar{x},\bar{a}) is the closed ball around (x¯,a¯)(\bar{x},\bar{a}) w.r.t. an arbitrary product metric on E×AE\times A.

  • (iii)

    The ambiguity sets 𝒬n\mathcal{Q}_{n} are norm bounded, i.e. there exists K[1,)K\in[1,\infty) such that

    𝔼|ddn|qK\mathbb{E}\left|\frac{\operatorname{d}\mathbb{Q}}{\operatorname{d}\mathbb{P}_{n}}\right|^{q}\leq K

    for all n=1,,Nn=1,\dots,N and 𝒬n\mathbb{Q}\in\mathcal{Q}_{n}.

Remark 3.2.
  1. (a)

    b¯,b¯\underaccent{\bar}{b},\bar{b} are called lower and upper bounding function, respectively, while bb is referred to as bounding function. As the absolute value is the sum of positive and negative part, bb satisfies for all n=0,,N1n=0,\dots,N-1, 𝒬n\mathbb{Q}\in\mathcal{Q}_{n} and (x,a)Dn(x,a)\in D_{n}:

    𝔼[|cn(x,a,Tn(x,a,Zn+1))|]b(x)and𝔼[|b(Tn(x,a,Zn+1))|]αb(x)\mathbb{E}^{\mathbb{Q}}\left[|c_{n}(x,a,T_{n}(x,a,Z_{n+1}))|\right]\leq b(x)\qquad\text{and}\qquad\mathbb{E}^{\mathbb{Q}}\left[|b(T_{n}(x,a,Z_{n+1}))|\right]\leq\alpha b(x)
  2. (b)

    Assumptions 3.1 (i) and (ii) are satisfied with b¯=b¯=constant>0-\underaccent{\bar}{b}=\bar{b}=constant>0 if the cost functions are bounded.

  3. (c)

    If p=1p=1 and hence q=q=\infty, it is technically sufficient if part (ii) of Assumption 3.1 holds under the reference probability measure n\mathbb{P}_{n}. Using Hölder’s inequality and part (iii) we get for every 𝒬n+1\mathbb{Q}\in\mathcal{Q}_{n+1}

    𝔼[cn(x,a,Tn(x,a,Zn+1))]\displaystyle\mathbb{E}^{\mathbb{Q}}\left[-c_{n}^{-}(x,a,T_{n}(x,a,Z_{n+1}))\right] 𝔼[cn(x,a,Tn(x,a,Zn+1))]esssupddn+1Kb¯(x),\displaystyle\geq\mathbb{E}\left[-c_{n}^{-}(x,a,T_{n}(x,a,Z_{n+1}))\right]\operatorname{ess\,sup}\frac{\operatorname{d}\mathbb{Q}}{\operatorname{d}\mathbb{P}_{n+1}}\geq K\underaccent{\bar}{b}(x),
    𝔼[b¯(Tn(x,a,Zn+1))]\displaystyle\mathbb{E}^{\mathbb{Q}}\left[\underaccent{\bar}{b}(T_{n}(x,a,Z_{n+1}))\right] 𝔼[b¯(Tn(x,a,Zn+1))]esssupddn+1αKb¯(x)\displaystyle\geq\mathbb{E}\left[\underaccent{\bar}{b}(T_{n}(x,a,Z_{n+1}))\right]\operatorname{ess\,sup}\frac{\operatorname{d}\mathbb{Q}}{\operatorname{d}\mathbb{P}_{n+1}}\geq\alpha K\underaccent{\bar}{b}(x)

    and analogous results for the upper bounding function. I.e. one simply has to replace α\alpha by KαK\alpha. However, the factor KαK\alpha may be unnecessarily crude.

The next lemma shows that due to Assumption 3.1 (i) the value (2.5) of a policy pair (π,γ)ΠR×Γ(\pi,\gamma)\in\Pi^{R}\times\Gamma is well-defined at all stages n=0,,Nn=0,\dots,N. One can see that the existence of either a lower or an upper bounding function is sufficient for the policy value to be well-defined. However, for the existence of an optimal policy pair we will need the integral to exist with finite value and therefore require both a lower and an upper bounding function.

Lemma 3.3.

Under Assumption 3.1 it holds for all policy pairs (π,γ)ΠR×Γ(\pi,\gamma)\in\Pi^{R}\times\Gamma, time points n=0,,Nn=0,\dots,N and histories hnnh_{n}\in\mathcal{H}_{n}

1αN+1n1αb¯(xn)Vnπγ(hn)1αN+1n1αb¯(xn).\frac{1-\alpha^{N+1-n}}{1-\alpha}\underaccent{\bar}{b}(x_{n})\leq V_{n\pi\gamma}(h_{n})\leq\frac{1-\alpha^{N+1-n}}{1-\alpha}\bar{b}(x_{n}).
Proof.

We only prove the first inequality. The second inequality is analogous. We use that

Vnπγ(hn)\displaystyle V_{n\pi\gamma}(h_{n})\geq k=nN1𝔼nhnπγ[ck(Xk,Ak,Xk+1)]+𝔼NhNπγ[cN(XN)]\displaystyle\sum_{k=n}^{N-1}\mathbb{E}_{nh_{n}}^{\pi\gamma}\left[-c_{k}^{-}(X_{k},A_{k},X_{k+1})\right]+\mathbb{E}_{Nh_{N}}^{\pi\gamma}\left[-c_{N}^{-}(X_{N})\right]

and consider the summands individually. We have 𝔼NhNπγ[cN(XN)]𝔼NhNπγ[b¯(XN)]\mathbb{E}_{Nh_{N}}^{\pi\gamma}\left[-c_{N}^{-}(X_{N})\right]\geq\mathbb{E}_{Nh_{N}}^{\pi\gamma}\left[\underaccent{\bar}{b}(X_{N})\right] by Assumption 3.1 (i). Since γk\gamma_{k} is a mapping to 𝒬k+1\mathcal{Q}_{k+1} it follows from the first inequality of Assumption 3.1 (i) that

𝔼nhnπγ[ck(Xk,Ak,Xk+1)]=𝔼khkπγ[ck(Xk,Ak,Xk+1)]xπγ(dhk|Hn=hn)\displaystyle\mathbb{E}_{nh_{n}}^{\pi\gamma}\left[-c_{k}^{-}(X_{k},A_{k},X_{k+1})\right]=\int\mathbb{E}_{kh_{k}}^{\pi\gamma}\left[-c_{k}^{-}(X_{k},A_{k},X_{k+1})\right]\mathbb{Q}_{x}^{\pi\gamma}(\operatorname{d}h_{k}|H_{n}=h_{n})
=ck(xk,ak,Tk(xk,ak,zk+1))γk(dzk+1|hk,ak)πk(dak|hk)xπγ(dhk|Hn=hn)\displaystyle=\iiint-c_{k}^{-}\big{(}x_{k},a_{k},T_{k}(x_{k},a_{k},z_{k+1})\big{)}\,\gamma_{k}(\operatorname{d}z_{k+1}|h_{k},a_{k})\,\pi_{k}(\operatorname{d}a_{k}|h_{k})\ \mathbb{Q}_{x}^{\pi\gamma}(\operatorname{d}h_{k}|H_{n}=h_{n})
b¯(xk)xπγ(dhk|Hn=hn)=𝔼nhnπγ[b¯(Xk)]\displaystyle\geq\int\underaccent{\bar}{b}(x_{k})\ \mathbb{Q}_{x}^{\pi\gamma}(\operatorname{d}h_{k}|H_{n}=h_{n})=\mathbb{E}_{nh_{n}}^{\pi\gamma}\left[\underaccent{\bar}{b}(X_{k})\right]

for k=n,,N1k=n,\dots,N-1. Now, the second inequality of Assumption 3.1 (i) yields for kn+1k\geq n+1

𝔼nhnπγ[b¯(Xk)]=𝔼k1hk1[b¯(Xk)]xπγ(dhk1|Hn=hn)\displaystyle\mathbb{E}_{nh_{n}}^{\pi\gamma}\left[\underaccent{\bar}{b}(X_{k})\right]=\int\mathbb{E}_{k-1h_{k-1}}\left[\underaccent{\bar}{b}(X_{k})\right]\mathbb{Q}_{x}^{\pi\gamma}(\operatorname{d}h_{k-1}|H_{n}=h_{n})
=b¯(Tk1(xk1,ak1,zk))γk1(dzk|hk1,ak1)πk1(dak1|hk1)xπγ(dhk1|Hn=hn)\displaystyle=\iiint\!\underaccent{\bar}{b}\big{(}T_{k-1}(x_{k-1},a_{k-1},z_{k})\big{)}\,\gamma_{k-1}(\operatorname{d}\!z_{k}|h_{k-1},a_{k-1})\,\pi_{k-1}(\operatorname{d}\!a_{k-1}|h_{k-1})\,\mathbb{Q}_{x}^{\pi\gamma}(\operatorname{d}\!h_{k-1}|H_{n}\!=\!h_{n})
αb¯(xk1)xπγ(dhk1|Hn=hn)=α𝔼nhnπγ[b¯(Xk1)].\displaystyle\geq\alpha\int\underaccent{\bar}{b}(x_{k-1})\mathbb{Q}_{x}^{\pi\gamma}(\operatorname{d}h_{k-1}|H_{n}=h_{n})=\alpha\mathbb{E}_{nh_{n}}^{\pi\gamma}\left[\underaccent{\bar}{b}(X_{k-1})\right].

Iterating this argument we obtain

𝔼nhnπγ[cN(XN)]αNnb¯(xn)and𝔼nhnπγ[ck(Xk,Ak,Xk+1)]αknb¯(xn).\mathbb{E}_{nh_{n}}^{\pi\gamma}\left[-c_{N}^{-}(X_{N})\right]\geq\alpha^{N-n}\underaccent{\bar}{b}(x_{n})\qquad\text{and}\qquad\mathbb{E}_{nh_{n}}^{\pi\gamma}\left[-c_{k}^{-}(X_{k},A_{k},X_{k+1})\right]\geq\alpha^{k-n}\underaccent{\bar}{b}(x_{n}).

Finally, summation over kk yields the assertion. ∎

With the bounding function bb we define the function space

𝔹b:={v:E|v measurable with λ+ s.t. |v(x)|λb(x) for all xE}.\mathbb{B}_{b}:=\left\{v:E\to\mathbb{R}\ |\ v\text{ measurable with }\lambda\in\mathbb{R}_{+}\text{ s.t. }|v(x)|\leq\lambda\,b(x)\text{ for all }x\in E\right\}.

Endowing 𝔹b\mathbb{B}_{b} with the weighted supremum norm

vb:=supxE|v(x)|b(x).\|v\|_{b}:=\sup_{x\in E}\frac{|v(x)|}{b(x)}.

makes (𝔹b,b)(\mathbb{B}_{b},\|\cdot\|_{b}) a Banach space, cf. Proposition 7.2.1 in [HernandezLasserre1999].

Having ensured that the value functions are well-defined, we can now proceed to derive the cost iteration. To ease notation we introduce the following operators.

Definition 3.4.

For functions v:n+1v:\mathcal{H}_{n+1}\to\mathbb{R} s.t. v(hn,an,)𝔹bv(h_{n},a_{n},\cdot)\in\mathbb{B}_{b} for all hnn,anDn(xn)h_{n}\in\mathcal{H}_{n},a_{n}\in D_{n}(x_{n}) and n=0,,N1n=0,\ldots,N-1 let

𝒯nπnγnv(hn):=\displaystyle\mathcal{T}_{n\pi_{n}\gamma_{n}}v(h_{n}):=
cn(xn,an,Tn(xn,an,zn+1))+v(hn,an,Tn(xn,an,zn+1))γn(dzn+1|hn,an)πn(dan|hn)\displaystyle\iint c_{n}\big{(}x_{n},a_{n},T_{n}(x_{n},a_{n},z_{n+1})\big{)}+v\big{(}h_{n},a_{n},T_{n}(x_{n},a_{n},z_{n+1})\big{)}\,\gamma_{n}(\operatorname{d}z_{n+1}|h_{n},a_{n})\,\pi_{n}(\operatorname{d}a_{n}|h_{n})
𝒯nπnv(hn):=\displaystyle\mathcal{T}_{n\pi_{n}}v(h_{n}):=
sup𝒬n+1cn(xn,an,Tn(xn,an,zn+1))+v(hn,an,Tn(xn,an,zn+1))(dzn+1)πn(dan|hn)\displaystyle\int\sup_{\mathbb{Q}\in\mathcal{Q}_{n+1}}\int c_{n}\big{(}x_{n},a_{n},T_{n}(x_{n},a_{n},z_{n+1})\big{)}+v\big{(}h_{n},a_{n},T_{n}(x_{n},a_{n},z_{n+1})\big{)}\,\mathbb{Q}(\operatorname{d}z_{n+1})\,\pi_{n}(\operatorname{d}a_{n}|h_{n})

Note that the operators are monotone in vv.

Proposition 3.5.

Under Assumption 3.1 the value of a policy pair (π,γ)ΠR×Γ(\pi,\gamma)\in\Pi^{R}\times\Gamma can be calculated recursively for n=0,,Nn=0,\dots,N and hnnh_{n}\in\mathcal{H}_{n} as

VNπγ(hN)\displaystyle V_{N\pi\gamma}(h_{N}) =cN(xN),\displaystyle=c_{N}(x_{N}),
Vnπγ(hn)\displaystyle V_{n\pi\gamma}(h_{n}) =𝒯nπnγnVn+1πγ(hn).\displaystyle=\mathcal{T}_{n\pi_{n}\gamma_{n}}V_{n+1\pi\gamma}(h_{n}).
Proof.

The proof is by backward induction. At time NN there is nothing to show. Now assume the assertion holds for n+1n+1, then the tower property of conditional expectation yields for nn

Vnπγ(hn)=𝔼nhnπγ[cn(Xn,An,Xn+1)+𝔼n+1hnAnXn+1πγ[k=n+1N1ck(Xk,Ak,Xk+1)+cN(XN)]]\displaystyle V_{n\pi\gamma}(h_{n})=\mathbb{E}_{nh_{n}}^{\pi\gamma}\left[c_{n}(X_{n},A_{n},X_{n+1})+\mathbb{E}_{n+1h_{n}A_{n}X_{n+1}}^{\pi\gamma}\left[\sum_{k=n+1}^{N-1}c_{k}(X_{k},A_{k},X_{k+1})+c_{N}(X_{N})\right]\right]
=cn(xn,an,x)+𝔼n+1hnanxπγ[k=n+1N1ck(Xk,Ak,Xk+1)+cN(XN)]Qnγ(dx|hn,an)πn(dan|hn)\displaystyle=\iint c_{n}(x_{n},a_{n},x^{\prime})+\mathbb{E}_{n+1h_{n}a_{n}x^{\prime}}^{\pi\gamma}\left[\sum_{k=n+1}^{N-1}c_{k}(X_{k},A_{k},X_{k+1})+c_{N}(X_{N})\right]Q_{n}^{\gamma}(\operatorname{d}x^{\prime}|h_{n},a_{n})\,\pi_{n}(\operatorname{d}a_{n}|h_{n})
=𝒯nπnγnVn+1πγ(hn)\displaystyle=\mathcal{T}_{n\pi_{n}\gamma_{n}}V_{n+1\pi\gamma}(h_{n})

for all hnnh_{n}\in\mathcal{H}_{n}. ∎

Now, we evaluate a policy of the controller under the worst-case scenario regarding nature’s reaction. We define the robust value of a policy πΠR\pi\in\Pi^{R} at time n=0,,N1n=0,\dots,N-1 as

Vnπ(hn)=supγΓVnπγ(hn),hnn.V_{n\pi}(h_{n})=\sup_{\gamma\in\Gamma}V_{n\pi\gamma}(h_{n}),\quad h_{n}\in\mathcal{H}_{n}.

To minimize this quantity is the controller’s optimization objective. For the robust policy value, a cost iteration holds, too. With regard to a policy of nature this is a Bellman equation given a fixed policy of the controller.

Theorem 3.6.

Let Assumptions 2.1, 3.1 be satisfied.

  • a)

    The robust value of a policy πΠR\pi\in\Pi^{R} is a measurable function of hnnh_{n}\in\mathcal{H}_{n} for n=0,,N1n=0,\dots,N-1. It can be calculated recursively as

    VNπ(hN)\displaystyle V_{N\pi}(h_{N}) =cN(xN),\displaystyle=c_{N}(x_{N}),
    Vnπ(hn)\displaystyle V_{n\pi}(h_{n}) =𝒯nπnVn+1π(hn)\displaystyle=\mathcal{T}_{n\pi_{n}}V_{n+1\pi}(h_{n})
  • b)

    If the ambiguity sets 𝒬n+1\mathcal{Q}_{n+1} are weak* closed, there exist maximizing decision rules γn\gamma_{n}^{*} of nature for all nn. Each sequence of such decision rules γ=(γ1,,γN1)Γ\gamma^{*}=(\gamma_{1}^{*},\dots,\gamma_{N-1}^{*})\in\Gamma is an optimal response of nature to the controller’s policy, i.e. Vnπ=VnπγV_{n\pi}=V_{n\pi\gamma^{*}}, n=0,,N1.n=0,\dots,N-1.

Proof.

The proof is by backward induction. At time NN there is nothing to show. Now assume the assertion holds at time n+1n+1, i.e. that Vn+1πV_{n+1\pi} is measurable and that for every ϵ>0\epsilon>0 there exists an ϵ2\frac{\epsilon}{2}-optimal strategy γ^=(γ^n+1,,γ^N1)\hat{\gamma}=(\hat{\gamma}_{n+1},\dots,\hat{\gamma}_{N-1}) of nature. By Proposition 3.5 we have at nn

Vnπ(hn)\displaystyle V_{n\pi}(h_{n}) =\displaystyle= supγΓVnπγ(hn)=supγΓ𝒯nπnγnVn+1πγ(hn)supγΓ𝒯nπnγnVn+1π(hn)𝒯nπnVn+1π(hn)\displaystyle\sup_{\gamma\in\Gamma}V_{n\pi\gamma}(h_{n})=\sup_{\gamma\in\Gamma}\mathcal{T}_{n\pi_{n}\gamma_{n}}V_{n+1\pi\gamma}(h_{n})\leq\sup_{\gamma\in\Gamma}\mathcal{T}_{n\pi_{n}\gamma_{n}}V_{n+1\pi}(h_{n})\leq\mathcal{T}_{n\pi_{n}}V_{n+1\pi}(h_{n}) (3.1)
\displaystyle\leq 𝒯nπnγ^nVn+1π(hn)+ε2𝒯nπnγ^nVn+1πγ^(hn)+ε=Vnπγ^(hn)+ϵVnπ(hn)+ϵ\displaystyle\!\!\!\mathcal{T}_{n\pi_{n}\hat{\gamma}_{n}}V_{n+1\pi}(h_{n})+\frac{\varepsilon}{2}\leq\mathcal{T}_{n\pi_{n}\hat{\gamma}_{n}}V_{n+1\pi\hat{\gamma}}(h_{n})\!+\!\varepsilon=V_{n\pi\hat{\gamma}}(h_{n})\!+\!\epsilon\leq V_{n\pi}(h_{n})+\epsilon

where γ^n:n×A𝒬n+1\hat{\gamma}_{n}:\mathcal{H}_{n}\times A\to\mathcal{Q}_{n+1} is a measurable ϵ2\frac{\epsilon}{2}-maximizer.

Since ϵ>0\epsilon>0 is arbitrary, equality holds. It remains to show the measurability of the outer integrand after the second inequality and the existence of an ϵ2\frac{\epsilon}{2}-maximizer. This follows from the optimal measurable selection theorem in [Rieder1978]: To see this, first note that the function

f(hn,an,)=cn(xn,an,Tn(xn,an,Zn+1))+Vn+1π(hn,an,Tn(xn,an,Zn+1))d,f(h_{n},a_{n},\mathbb{Q})=\int c_{n}\big{(}x_{n},a_{n},T_{n}(x_{n},a_{n},Z_{n+1})\big{)}+V_{n+1\pi}\big{(}h_{n},a_{n},T_{n}(x_{n},a_{n},Z_{n+1})\big{)}\operatorname{d}\mathbb{Q},

is jointly measurable due to Lemma 4.51 in [AliprantisBorder2006]. Consequently,

{(hn,an,)n×A×𝒬n+1:f(hn,an,)η}{S×Q:Sn×A),Q𝒬n+1}.\left\{(h_{n},a_{n},\mathbb{Q})\in\mathcal{H}_{n}\times A\times\mathcal{Q}_{n+1}:f(h_{n},a_{n},\mathbb{Q})\geq\eta\right\}\in\left\{S\times Q:S\in\={(}\mathcal{H}_{n}\times A),\ Q\subseteq\mathcal{Q}_{n+1}\right\}.

for every η\eta\in\mathbb{R}. The right hand side is a selection class. Obviously, it holds

n×A×𝒬n+1{S×Q:Sn×A),Q𝒬n+1}.\mathcal{H}_{n}\times A\times\mathcal{Q}_{n+1}\in\left\{S\times Q:S\in\={(}\mathcal{H}_{n}\times A),\ Q\subseteq\mathcal{Q}_{n+1}\right\}.

Now, Theorem 3.2 in [Rieder1978] yields that

n×A(hn,an)sup𝒬n+1f(hn,an,)\mathcal{H}_{n}\times A\ni(h_{n},a_{n})\mapsto\sup_{\mathbb{Q}\in\mathcal{Q}_{n+1}}f(h_{n},a_{n},\mathbb{Q})

is measurable and for every ϵ>0\epsilon>0 there exists an ϵ\epsilon-maximizer γn:n×A𝒬n+1\gamma_{n}:\mathcal{H}_{n}\times A\to\mathcal{Q}_{n+1}.

For part b) we have to show that there exists not only a ϵ\epsilon-maximizer γ^n\hat{\gamma}_{n} at (3.1) but a maximizer. This follows from Theorem 3.7 in [Rieder1978]. The additional requirements are that 𝒬n+1\mathcal{Q}_{n+1} is a separable metrizable space, which holds by Lemma 2.3, and that the set {𝒬n+1:f(hn,an,)η}\{\mathbb{Q}\in\mathcal{Q}_{n+1}:f(h_{n},a_{n},\mathbb{Q})\geq\eta\} is compact for every η\eta\in\mathbb{R} and (hn,an)n×A(h_{n},a_{n})\in\mathcal{H}_{n}\times A. By assumption, 𝒬n+1\mathcal{Q}_{n+1} is weakly closed and therefore compact by Lemma 2.3. Since due to Assumption 3.1 the integrand of ff is in LpL^{p}, the mapping f(hn,an,)\mathbb{Q}\mapsto f(h_{n},a_{n},\mathbb{Q}) is continuous for every (hn,an)n×A(h_{n},a_{n})\in\mathcal{H}_{n}\times A. Hence, {𝒬n+1:f(hn,an,)η}\{\mathbb{Q}\in\mathcal{Q}_{n+1}:f(h_{n},a_{n},\mathbb{Q})\geq\eta\} is closed as the preimage of a closed set. Since closed subsets of compact sets are compact, the proof is complete. ∎

So far we have only considered the case that the ambiguity set may depend on the time index but not on the state of the decision process. This covers many applications, e.g. the connection to risk measures (see Section 6). Moreover, we can allow any norm bounded ambiguity sets as long as it is independent of the state using the optimal selection theorem by [Rieder1978] in Theorem 3.6. If the ambiguity set is weak* closed, the following generalization is possible.

Corollary 3.7.

For n=0,,N1n=0,\dots,N-1 let 𝒬n\mathcal{Q}_{n} be weak* closed and

Dn(x,a)𝒬n+1(x,a)𝒬n+1D_{n}\ni(x,a)\mapsto\mathcal{Q}_{n+1}(x,a)\subseteq\mathcal{Q}_{n+1}

be a non-empty and closed-valued mapping giving the possible probability measures at time nn in state xEx\in E if the controller chooses aDn(x)a\in D_{n}(x). Then the assertion of Theorem 3.6 b) still holds.

Proof.

We have to show the existence of a measurable maximizer at (3.1). The rest of the proof is not affected. Since 𝒬n+1\mathcal{Q}_{n+1} is weak* closed, it is a compact Borel space by Lemma 2.3. Consequently, the set-valued mapping 𝒬n+1()\mathcal{Q}_{n+1}(\cdot) is compact-valued, as closed subsets of compact sets are compact. In the proof of the theorem it has been shown that the function f(hn,an,)f(h_{n},a_{n},\mathbb{Q}) is jointly measurable and continuous in \mathbb{Q}. Hence, Proposition D.5 in [HernandezLasserre1996] yields the existence of a measurable maximizer. ∎

State dependent ambiguity sets are a possibility to make the distributionally robust optimality criterion less conservative. E.g. they allow to incorporate learning about the unknown disturbance distribution. We refer the reader to [Bielecki2019] for an interesting example where the ambiguity sets are recursive confidence regions for an unknown parameter of the disturbance distribution.

Let us now consider specifically deterministic Markov policies πΠM\pi\in\Pi^{M} of the controller. The subspace

𝔹={v𝔹b:v lower semicontinuous}.\mathbb{B}=\{v\in\mathbb{B}_{b}:\ v\text{ lower semicontinuous}\}.

of (𝔹b,b)(\mathbb{B}_{b},\|\cdot\|_{b}) turns out to be the set of possible value functions under such policies. (𝔹,b)(\mathbb{B},\|\cdot\|_{b}) is a complete metric space since the subset of lower semicontinuous functions is closed in (𝔹b,b)(\mathbb{B}_{b},\|\cdot\|_{b}). We define the following operators on 𝔹b\mathbb{B}_{b} and especially on 𝔹\mathbb{B}.

Definition 3.8.

For v𝔹bv\in\mathbb{B}_{b} and Markov decision rules d:EAd:E\to A, γ:Dn𝒬n+1\gamma:D_{n}\to\mathcal{Q}_{n+1} we define

Lnv(x,a,)\displaystyle L_{n}v(x,a,\mathbb{Q}) =cn(x,a,Tn(x,a,Zn+1))+v(Tn(x,a,Zn+1))d,\displaystyle=\int c_{n}(x,a,T_{n}(x,a,Z_{n+1}))+v(T_{n}(x,a,Z_{n+1}))\operatorname{d}\mathbb{Q},\quad (x,a,)Dn×𝒬n+1,\displaystyle(x,a,\mathbb{Q})\in D_{n}\times\mathcal{Q}_{n+1},
𝒯ndγv(x)\displaystyle\mathcal{T}_{nd\gamma}v(x) =Lnv(x,d(x),γ(x,d(x))),\displaystyle=L_{n}v\big{(}x,d(x),\gamma(x,d(x))\big{)}, xE,\displaystyle x\in E,
𝒯ndv(x)\displaystyle\mathcal{T}_{nd}v(x) =sup𝒬n+1Lnv(x,d(x),),\displaystyle=\sup_{\mathbb{Q}\in\mathcal{Q}_{n+1}}L_{n}v(x,d(x),\mathbb{Q}), xE,\displaystyle x\in E,
𝒯nv(x)\displaystyle\mathcal{T}_{n}v(x) =infaDn(x)sup𝒬n+1Lnv(x,a,),\displaystyle=\inf_{a\in D_{n}(x)}\sup_{\mathbb{Q}\in\mathcal{Q}_{n+1}}L_{n}v(x,a,\mathbb{Q}), xE.\displaystyle x\in E.

Note that the operators are monotone in vv. Under Markov policies π=(d0,,dN1)ΠM\pi=(d_{0},\dots,d_{N-1})\in\Pi^{M} of the controller and γ=(γ0,,γN1)ΓM\gamma=(\gamma_{0},\dots,\gamma_{N-1})\in\Gamma^{M} of nature, the value iteration can be expressed with the help of these operators. In order to distinguish from the history-dependent case, we denote the value function here with JJ. Setting JNπγ(x)=cN(x),xE,J_{N\pi\gamma}(x)=c_{N}(x),\ x\in E, we obtain for n=0,,N1n=0,\dots,N-1 and xEx\in E with Proposition 3.5

Jnπγ(x)\displaystyle J_{n\pi\gamma}(x) =cn(x,dn(x),Tn(x,dn(x),zn+1))\displaystyle=\int c_{n}\big{(}x,d_{n}(x),T_{n}(x,d_{n}(x),z_{n+1})\big{)}
+Jn+1πγ(Tn(x,dn(x),zn+1))γn(dzn+1|x,dn(x))=𝒯ndnγnJn+1πγ(x).\displaystyle\phantom{=\int}\ +J_{n+1\pi\gamma}\big{(}T_{n}(x,d_{n}(x),z_{n+1})\big{)}\,\gamma_{n}(\operatorname{d}z_{n+1}|x,d_{n}(x))=\mathcal{T}_{nd_{n}\gamma_{n}}J_{n+1\pi\gamma}(x).

We define the robust value of Markov policy πΠM\pi\in\Pi^{M} of the controller as

Jnπ(x)=supγΓMJnπγ(x),xE.\displaystyle J_{n\pi}(x)=\sup_{\gamma\in\Gamma^{M}}J_{n\pi\gamma}(x),\qquad x\in E.

When calculating the robust value of a Markov policy of the controller it is no restriction to take the supremum only over Markov policies of nature.

Corollary 3.9.

Let πΠM\pi\in\Pi^{M}. It holds for n=0,,Nn=0,\dots,N that Jnπ(xn)=Vnπ(hn),hnn.J_{n\pi}(x_{n})=V_{n\pi}(h_{n}),\ h_{n}\in\mathcal{H}_{n}. I.e., we have the robust value iteration

Jnπ(x)\displaystyle J_{n\pi}(x) =sup𝒬n+1cn(x,dn(x),Tn(x,dn(x),Zn+1))+Jn+1π(Tn(x,dn(x),Zn+1))d\displaystyle=\sup_{\mathbb{Q}\in\mathcal{Q}_{n+1}}\int c_{n}\big{(}x,d_{n}(x),T_{n}(x,d_{n}(x),Z_{n+1})\big{)}+J_{n+1\pi}\big{(}T_{n}(x,d_{n}(x),Z_{n+1})\big{)}\operatorname{d}\mathbb{Q}
=𝒯ndnJn+1π(x).\displaystyle=\mathcal{T}_{nd_{n}}J_{n+1\pi}(x).

Moreover, there exists a Markovian ϵ\epsilon-optimal policy of nature and if the ambiguity sets 𝒬n+1\mathcal{Q}_{n+1} are all weak* closed even a Markovian optimal policy.

Proof.

For n=Nn=N the assertion is trivial. Assuming it holds at time n+1n+1, it follows at time nn from Theorem 3.6 that

Vnπ(hn)\displaystyle V_{n\pi}(h_{n}) =sup𝒬n+1cn(x,dn(x),Tn(x,dn(x),Zn+1))+Jn+1π(Tn(x,dn(x),Zn+1))d\displaystyle=\sup_{\mathbb{Q}\in\mathcal{Q}_{n+1}}\int c_{n}\big{(}x,d_{n}(x),T_{n}(x,d_{n}(x),Z_{n+1})\big{)}+J_{n+1\pi}\big{(}T_{n}(x,d_{n}(x),Z_{n+1})\big{)}\operatorname{d}\mathbb{Q}
=Jnπ(xn).\displaystyle=J_{n\pi}(x_{n}).

Replacing n×A\mathcal{H}_{n}\times A by DnD_{n}, the existence of (ϵ\epsilon-) optimal policies is guaranteed by the same arguments as in the proof of Theorem 3.6. ∎

Let us further define for n=0,,Nn=0,\dots,N the Markovian value function

Jn(x)=infπΠMsupγΓMJnπγ(x),xE.\displaystyle J_{n}(x)=\inf_{\pi\in\Pi^{M}}\sup_{\gamma\in\Gamma^{M}}J_{n\pi\gamma}(x),\qquad x\in E.

The next result shows that JnJ_{n} satisfies a Bellman equation and proves that an optimal policy of the controller exists and is Markov.

Theorem 3.10.

Let Assumptions 2.1, 3.1 be satisfied.

  • a)

    For n=0,,N1n=0,\dots,N-1, it suffices to consider deterministic Markov policies both for the controller and nature, i.e. Vn(hn)=Jn(xn)V_{n}(h_{n})=J_{n}(x_{n}) for all hnnh_{n}\in\mathcal{H}_{n}. Moreover, Jn𝔹J_{n}\in\mathbb{B} and satisfies the Bellman equation

    JN(x)\displaystyle J_{N}(x) =cN(x),\displaystyle=c_{N}(x),
    Jn(x)\displaystyle J_{n}(x) =𝒯nJn+1(x),xE.\displaystyle=\mathcal{T}_{n}J_{n+1}(x),\qquad x\in E.

    For n=0,,N1n=0,\dots,N-1 there exist Markov minimizers dnd_{n}^{*} of Jn+1J_{n+1} and every sequence of such minimizers constitutes an optimal policy π=(d0,,dN1)\pi^{*}=(d_{0}^{*},\dots,d_{N-1}^{*}) of the controller.

  • b)

    If the ambiguity sets 𝒬n\mathcal{Q}_{n} are weak* closed, there exist maximizing decision rules γn\gamma_{n}^{*} of nature, i.e. Jn=𝒯dnγnJn+1J_{n}=\mathcal{T}_{d_{n}^{*}\gamma_{n}^{*}}J_{n+1} and every sequence of maximizers induces an optimal policy of nature γ=(γ0,,γN1)ΓM\gamma^{*}=(\gamma_{0}^{*},\dots,\gamma_{N-1}^{*})\in\Gamma^{M} satisfying Jn=JnπγJ_{n}=J_{n\pi^{*}\gamma^{*}}.

Proof.

We proceed by backward induction. At time NN we have VN=JN=cN𝔹V_{N}=J_{N}=c_{N}\in\mathbb{B} due to semicontinuity and Assumption 3.1 (i). Now assume the assertion holds at time n+1n+1. Using the robust value iteration (Corollary 3.9) we obtain at time nn:

Vn(hn)\displaystyle V_{n}(h_{n}) =\displaystyle= infπΠRsupγΓVnπγ(hn)=infπΠRVnπ(hn)=infπΠR𝒯nπnVn+1π(hn)\displaystyle\inf_{\pi\in\Pi^{R}}\sup_{\gamma\in\Gamma}V_{n\pi\gamma}(h_{n})=\inf_{\pi\in\Pi^{R}}V_{n\pi}(h_{n})=\inf_{\pi\in\Pi^{R}}\mathcal{T}_{n\pi_{n}}V_{n+1\pi}(h_{n})
\displaystyle\geq infπΠR𝒯nπnVn+1(hn)=infπΠRsup𝒬n+1LnJn+1(xn,an,)πn(dan|hn)\displaystyle\inf_{\pi\in\Pi^{R}}\mathcal{T}_{n\pi_{n}}V_{n+1}(h_{n})=\inf_{\pi\in\Pi^{R}}\int\sup_{\mathbb{Q}\in\mathcal{Q}_{n+1}}L_{n}J_{n+1}(x_{n},a_{n},\mathbb{Q})\pi_{n}(\operatorname{d}a_{n}|h_{n})
\displaystyle\geq infanDn(xn)sup𝒬n+1LnJn+1(xn,an,)=𝒯nJn+1(xn).\displaystyle\inf_{a_{n}\in D_{n}(x_{n})}\sup_{\mathbb{Q}\in\mathcal{Q}_{n+1}}L_{n}J_{n+1}(x_{n},a_{n},\mathbb{Q})=\mathcal{T}_{n}J_{n+1}(x_{n}).

Here, objective and constraint depend on the history of the process only through xnx_{n}. Thus, given the existence of a minimizing Markov decision rule dnd_{n}^{*} the last expression is equal to 𝒯ndnJn+1(xn)\mathcal{T}_{nd_{n}^{*}}J_{n+1}(x_{n}). Again by the induction hypothesis, there exists an optimal Markov policy π=(dn+1,,dN1)ΠM\pi=(d_{n+1}^{*},\dots,d_{N-1}^{*})\in\Pi^{M} such that

Vn(hn)𝒯nJn+1(xn)=𝒯ndnJn+1π(xn)=Jnπ(xn)Jn(xn)Vn(hn).\displaystyle V_{n}(h_{n})\geq\mathcal{T}_{n}J_{n+1}(x_{n})=\mathcal{T}_{nd_{n}^{*}}J_{n+1\pi^{*}}(x_{n})=J_{n\pi^{*}}(x_{n})\geq J_{n}(x_{n})\geq V_{n}(h_{n}). (3.2)

It remains to show the existence of a minimizing Markov decision rule dnd_{n}^{*} and that Jn𝔹J_{n}\in\mathbb{B}. We want to apply Proposition 2.4.3 in [BaeuerleRieder2011]. The set-valued mapping ExDn(x)E\ni x\mapsto D_{n}(x) is compact-valued and upper semicontinuous. Next, we show that Dn(x,a)sup𝒬n+1Lnv(x,a,)D_{n}\ni(x,a)\mapsto\sup_{\mathbb{Q}\in\mathcal{Q}_{n+1}}L_{n}v(x,a,\mathbb{Q}) is lower semicontinuous for every v𝔹v\in\mathbb{B}. Let {(xk,ak)}k\{(x_{k},a_{k})\}_{k\in\mathbb{N}} be a convergent sequence in DnD_{n} with limit (x,a)Dn(x^{*},a^{*})\in D_{n}. The function

(x,a)cn(x,a,Tn(x,a,z))+v(Tn(x,a,z))\displaystyle(x,a)\mapsto c_{n}\big{(}x,a,T_{n}(x,a,z)\big{)}+v\big{(}T_{n}(x,a,z)\big{)} (3.3)

is lower semicontinuous for every z𝒵z\in\mathcal{Z}. The sequence of random variables {Ck}n\{C_{k}\}_{n\in\mathbb{N}} given by

Ck:=cn(xk,ak,Tn(xk,ak,Zn+1))+v(Tn(xk,ak,Zn+1))C_{k}:=c_{n}\big{(}x_{k},a_{k},T_{n}(x_{k},a_{k},Z_{n+1})\big{)}+v\big{(}T_{n}(x_{k},a_{k},Z_{n+1})\big{)}

is bounded by some C¯Lp(Ωn+1,𝒜n+1,n+1)\bar{C}\in L^{p}(\Omega_{n+1},\mathcal{A}_{n+1},\mathbb{P}_{n+1}) by Lemma LABEL:thm:abstract_robust_Lpbound. Now, Fatou’s Lemma yields for every 𝒬n+1\mathbb{Q}\in\mathcal{Q}_{n+1}

lim infkLnv(xk,ak,)\displaystyle\liminf_{k\to\infty}L_{n}v(x_{k},a_{k},\mathbb{Q}) =lim infk𝔼[cn(xk,ak,Tn(xk,ak,Zn+1))+v(Tn(xk,ak,Zn+1))]\displaystyle=\liminf_{k\to\infty}\mathbb{E}^{\mathbb{Q}}\Big{[}c_{n}\big{(}x_{k},a_{k},T_{n}(x_{k},a_{k},Z_{n+1})\big{)}+v\big{(}T_{n}(x_{k},a_{k},Z_{n+1})\big{)}\Big{]}
𝔼[lim infkcn(xk,ak,Tn(xk,ak,Zn+1))+v(Tn(xk,ak,Zn+1))]\displaystyle\geq\mathbb{E}^{\mathbb{Q}}\Big{[}\liminf_{k\to\infty}c_{n}\big{(}x_{k},a_{k},T_{n}(x_{k},a_{k},Z_{n+1})\big{)}+v\big{(}T_{n}(x_{k},a_{k},Z_{n+1})\big{)}\Big{]}
𝔼[cn(x,a,Tn(x,a,Zn+1))+v(Tn(x,a,Zn+1))]=Lnv(x,a,),\displaystyle\geq\mathbb{E}^{\mathbb{Q}}\left[c_{n}\big{(}x^{*},a^{*},T_{n}(x^{*},a^{*},Z_{n+1})\big{)}+v\big{(}T_{n}(x^{*},a^{*},Z_{n+1})\big{)}\right]=L_{n}v(x^{*},a^{*},\mathbb{Q}),

where the last inequality is by the lower semicontinuity of (3.3). Thus, the function Dn(x,a)Lnv(x,a,)D_{n}\ni(x,a)\mapsto L_{n}v(x,a,\mathbb{Q}) is lower semicontinuous for every 𝒬n+1\mathbb{Q}\in\mathcal{Q}_{n+1} and consequently Dn(x,a)sup𝒬n+1Lnv(x,a,)D_{n}\ni(x,a)\mapsto\sup_{\mathbb{Q}\in\mathcal{Q}_{n+1}}L_{n}v(x,a,\mathbb{Q}) is lower semicontinuous as a supremum of lower semicontinuous functions. Now, Proposition 2.4.3 in [BaeuerleRieder2011] yields the existence of a minimizing Markov decision rule dnd_{n}^{*} at (3.2) and that Jn=𝒯nJn+1J_{n}=\mathcal{T}_{n}J_{n+1} is lower semicontinuous. Furthermore, JnJ_{n} is bounded by λb\lambda b for some λ+\lambda\in\mathbb{R}_{+} due to Lemma 3.3. Thus Jn𝔹J_{n}\in\mathbb{B}. Part b) follows from Theorem 3.6 b). ∎

4. Real Line as State Space

The model has been introduced in Section 2 with a general Borel space as state space. In order to solve the distributionally robust cost minimization problem in Section 3 we needed a continuous transition function despite having a semicontinuous model, cf. the proof of Theorem 3.10. This assumption on the transition function can be relaxed to semicontinuity if the state space is the real line and the transition and one-stage cost function have some form of monotonicity. In some applications this relaxation of the continuity assumption is relevant. Furthermore, a real state space can be exploited to address the distributionally robust cost minimization problem with more specific techniques. In addition to Assumption 3.1, we make the following assumptions in this section.

Assumption 4.1.
  • (i)

    The state space is the real line E=E=\mathbb{R}.

  • (ii)

    The model data satisfies the Continuity and Compactness Assumptions 2.1 with the transition function TnT_{n} being lower semicontinuous instead of continuous.

  • (iii)

    The model data has the following monotonicity properties:

    • (iii a)

      The set-valued mapping xDn(x)\mathbb{R}\ni x\mapsto D_{n}(x) is decreasing.

    • (iii b)

      The function xTn(x,a,z)\mathbb{R}\ni x\mapsto T_{n}(x,a,z) is increasing for all aDn(x),z𝒵a\in D_{n}(x),z\in\mathcal{Z}.

    • (iii c)

      The function 2(x,x)cn(x,a,x)\mathbb{R}^{2}\ni(x,x^{\prime})\mapsto c_{n}(x,a,x^{\prime}) is increasing for all aDn(x)a\in D_{n}(x).

    • (iii d)

      The function xcN(x)\mathbb{R}\ni x\mapsto c_{N}(x) is increasing.

With the real line as state space a simple separation condition is sufficient for Assumption 3.1 (ii).

Corollary 4.2.

Let there be upper semicontinuous functions ϑn,1,ϑn,2:D+\vartheta_{n,1},\vartheta_{n,2}:D\to\mathbb{R}_{+} and measurable functions Θn,1,Θn,2:𝒵+\Theta_{n,1},\Theta_{n,2}:\mathcal{Z}\to\mathbb{R}_{+} which fulfil Θn,1(Zn+1),Θn,2(Zn+1)Lp(Ωn,𝒜n,n)\Theta_{n,1}(Z_{n+1}),\Theta_{n,2}(Z_{n+1})\in L^{p}(\Omega_{n},\mathcal{A}_{n},\mathbb{P}_{n}) and

|cn(x,a,Tn(x,a,z))|ϑn,1(x,a)+Θn,1(z),b(Tn(x,a,z))ϑn,2(x,a)+Θn,2(z)\displaystyle|c_{n}(x,a,T_{n}(x,a,z))|\leq\vartheta_{n,1}(x,a)+\Theta_{n,1}(z),\qquad b(T_{n}(x,a,z))\leq\vartheta_{n,2}(x,a)+\Theta_{n,2}(z)

for every (x,a,z)D×𝒵(x,a,z)\in D\times\mathcal{Z}. Then Assumption 3.1 (ii) is satisfied.

Proof.

Let (x¯,a¯)Dn(\bar{x},\bar{a})\in D_{n}. We can choose ϵ>0\epsilon>0 arbitrarily. The set S=[x¯ϵ,x¯+ϵ]×Dn(x¯ϵ)S=[\bar{x}-\epsilon,\bar{x}+\epsilon]\times D_{n}(\bar{x}-\epsilon) is compact w.r.t. the product topology. Moreover, Bϵ(x¯,a¯)DnSB_{\epsilon}(\bar{x},\bar{a})\cap D_{n}\subseteq S since the set-valued mapping Dn()D_{n}(\cdot) is decreasing. Due to upper semicontinuity there exist (xi,ai)S(x_{i},a_{i})\in S such that ϑn,i(xi,ai)=sup(x,a)Sϑn,i(x,a)\vartheta_{n,i}(x_{i},a_{i})=\sup_{(x,a)\in S}\vartheta_{n,i}(x,a), i=1,2i=1,2. Hence, one can define

Θn,ix¯,a¯():=ϑn,i(xi,ai)+Θn,i(),i=1,2\Theta_{n,i}^{\bar{x},\bar{a}}(\cdot):=\vartheta_{n,i}(x_{i},a_{i})+\Theta_{n,i}(\cdot),\quad i=1,2

and Assumption 3.1 (ii) is satisfied. ∎

The question is, how replacing Assumption 2.1 (ii) by Assumption 4.1 affects the validity of all previous results. The only result that was proven using the continuity of the transition function TnT_{n} in (x,a)(x,a) and not only its measurability is Theorem 3.10. All other statements are unaffected.

Proposition 4.3.

The assertions of Theorem 3.10 still hold when we replace Assumption 2.1 by Assumption 4.1. Moreover, JnJ_{n} and JJ are increasing. The set of potential value functions can therefore be replaced by

𝔹={v𝔹b:v lower semicontinuous and increasing}.\mathbb{B}=\{v\in\mathbb{B}_{b}:\ v\text{ lower semicontinuous and increasing}\}.
Proof.

In the proof of Theorem 3.10 the continuity of TnT_{n} is used to show that Dn(x,a)Lnv(x,a)D_{n}\ni(x,a)\mapsto L_{n}v(x,a) is lower semicontinuous for every v𝔹v\in\mathbb{B}. Due to the monotonicity assumptions

Dn(x,a)cn(x,a,Tn(x,a,zn+1))+v(Tn(x,a,zn+1))D_{n}\ni(x,a)\mapsto c_{n}\big{(}x,a,T_{n}(x,a,z_{n+1})\big{)}+v\big{(}T_{n}(x,a,z_{n+1})\big{)}

is lower semicontinuous for every zn+1𝒵z_{n+1}\in\mathcal{Z}. Now the lower semicontinuity of Dn(x,a)Lnv(x,a)D_{n}\ni(x,a)\mapsto L_{n}v(x,a) and the existence of a minimizing decision rule follows as in the proof of Theorem 3.10. The fact that 𝒯nv\mathcal{T}_{n}v is increasing for every v𝔹v\in\mathbb{B} follows as in Theorem 2.4.14 in [BaeuerleRieder2011]. ∎

In the following Section 5 we use a minimax approach as an alternative way to solve the Bellman equation of the distributionally robust cost minimization problem and to study its game theoretical properties. Subsequently in Section 6, we also consider special choices of the ambiguity set which are advantageous for solving the optimization problem.

5. Minimax Approach and Game Theory

We assume E=E=\mathbb{R} as in the last section. Compared to a risk-neutral Markov Decision Model, the Bellman equation of the robust model (see Theorem 3.10) has the additional complication that a supremum over possibly uncountably many expectations needs to be calculated. This can be a quite challenging task. Therefore, it may be advantageous to interchange the infimum and supremum. For instance, in applications it may be possible to infer structural properties of the optimal actions independently from the probability measure \mathbb{Q} after the interchange. Based on the minimax theorem by [Sion1958], cf. Appendix Theorem LABEL:thm:A:minimax, this section presents a criterion under which the interchange of infimum and supremum is possible.

Lemma 5.1.

Let Assumptions 3.1, 4.1 be satisfied. Let AA be a subset of a vector space, DnD_{n} be a convex set, xcN(x)x\mapsto c_{N}(x), (x,a)Tn(x,a,z)(x,a)\mapsto T_{n}(x,a,z) be convex as well as (x,a,x)cn(x,a,x)(x,a,x^{\prime})\mapsto c_{n}(x,a,x^{\prime}). Then the value functions JnJ_{n} and the limit value function JJ are convex.

Proof.

The proof is by backward induction. JN=cNJ_{N}=c_{N} is convex by assumption. Now assume that Jn+1J_{n+1} is convex. Recall that Jn+1J_{n+1} is increasing (Proposition 4.3). Hence, for every z𝒵z\in\mathcal{Z} the function

Dn(x,a)cn(x,a,Tn(x,a,z))+Jn+1(Tn(x,a,z))\displaystyle D_{n}\ni(x,a)\mapsto c_{n}\big{(}x,a,T_{n}(x,a,z)\big{)}+J_{n+1}\big{(}T_{n}(x,a,z)\big{)}

is convex as a composition of an increasing convex with a convex function. By the linearity of expectation,

Dn(x,a)𝔼[cn(x,a,Tn(x,a,Zn+1))+Jn+1(Tn(x,a,Zn+1))]\displaystyle D_{n}\ni(x,a)\mapsto\mathbb{E}^{\mathbb{Q}}\Big{[}c_{n}\big{(}x,a,T_{n}(x,a,Z_{n+1})\big{)}+J_{n+1}\big{(}T_{n}(x,a,Z_{n+1})\big{)}\Big{]} (5.1)

is convex for every 𝒬n+1\mathbb{Q}\in\mathcal{Q}_{n+1}. As the pointwise supremum of a collection of convex functions is convex, we obtain convexity of Dn(x,a)sup𝒬n+1LJn+1(x,a,)D_{n}\ni(x,a)\mapsto\sup_{\mathbb{Q}\in\mathcal{Q}_{n+1}}LJ_{n+1}(x,a,\mathbb{Q}). Now, Proposition 2.4.18 in [BaeuerleRieder2011] yields the assertion. ∎

The assumptions of Lemma 5.1 are subsequently referred to as convex model.

Theorem 5.2.

Let Assumptions 3.1, 4.1 be satisfied. In a convex model we have for all n=0,,N1n=0,\dots,N-1

Jn(x)\displaystyle J_{n}(x) =infaDn(x)sup𝒬n+1LnJn+1(x,a,)=sup𝒬n+1infaDn(x)LnJn+1(x,a,),x.\displaystyle=\inf_{a\in D_{n}(x)}\sup_{\mathbb{Q}\in\mathcal{Q}_{n+1}}L_{n}J_{n+1}(x,a,\mathbb{Q})=\sup_{\mathbb{Q}\in\mathcal{Q}_{n+1}}\inf_{a\in D_{n}(x)}L_{n}J_{n+1}(x,a,\mathbb{Q}),\qquad x\in\mathbb{R}.
Proof.

Let xx\in\mathbb{R} be fixed and define f:Dn(x)×𝒬n+1f:D_{n}(x)\times\mathcal{Q}_{n+1}\to\mathbb{R},

f(a,)=Lnv(x,a,)=𝔼[cn(x,a,Tn(x,a,Zn+1))+Jn+1(Tn(x,a,Zn+1))].f(a,\mathbb{Q})=L_{n}v(x,a,\mathbb{Q})=\mathbb{E}^{\mathbb{Q}}\Big{[}c_{n}\big{(}x,a,T_{n}(x,a,Z_{n+1})\big{)}+J_{n+1}\big{(}T_{n}(x,a,Z_{n+1})\big{)}\Big{]}.

The function ff is convex in aa by (5.1) and linear in \mathbb{Q}, i.e. especially concave. Furthermore, the set Dn(x)D_{n}(x) is compact and it has been shown in the proof of Theorem 3.10 that ff is lower semicontinuous in aa. Hence, the assertion follows from Theorem LABEL:thm:A:minimax a). ∎

Remark 5.3.

The interchange of infimum and supremum in Theorem 5.2 is based on Sion’s Minimax Theorem LABEL:thm:A:minimax, which requires convexity of the function

acn(x,a,Tn(x,a,z))+Jn+1(T(x,a,z))(dz)\displaystyle a\mapsto\int c_{n}\big{(}x,a,T_{n}(x,a,z)\big{)}+J_{n+1}\big{(}T(x,a,z)\big{)}\,\mathbb{Q}(\operatorname{d}z) (5.2)

for every (x,)×𝒬(x,\mathbb{Q})\in\mathbb{R}\times\mathcal{Q}. This can be guaranteed by a convex model (cf. Lemma 5.1) which means that several components of the decision model need to have some convexity property. However, these assumptions are quite restrictive. Resorting to randomized actions is a standard approach to convexify (or more precisely linearize) the function (5.2) without assumptions on the model components. Let 𝒫(Dn(x))\mathcal{P}(D_{n}(x)) be the set of all probability measures on Dn(x)D_{n}(x). Then it follows from Sion’s Theorem LABEL:thm:A:minimax that

infμ𝒫(Dn(x))sup𝒬n+1LnJn+1(x,a,)μ(da)=sup𝒬n+1infμ𝒫(Dn(x))LnJn+1(x,a,)μ(da)\displaystyle\inf_{\mu\in\mathcal{P}(D_{n}(x))}\sup_{\mathbb{Q}\in\mathcal{Q}_{n+1}}\int L_{n}J_{n+1}(x,a,\mathbb{Q})\mu(da)=\sup_{\mathbb{Q}\in\mathcal{Q}_{n+1}}\inf_{\mu\in\mathcal{P}(D_{n}(x))}\int L_{n}J_{n+1}(x,a,\mathbb{Q})\mu(da) (5.3)
=sup𝒬n+1infaDn(x)LnJn+1(x,a,).\displaystyle=\sup_{\mathbb{Q}\in\mathcal{Q}_{n+1}}\inf_{a\in D_{n}(x)}L_{n}J_{n+1}(x,a,\mathbb{Q}).

The last equality holds since acn(x,a,Tn(x,a,z))+Jn+1(Tn(x,a,z))a\mapsto c_{n}\big{(}x,a,T_{n}(x,a,z)\big{)}+J_{n+1}\big{(}T_{n}(x,a,z)\big{)} is lower semicontinuous (cf. the proof of Theorem 3.10) and Dn(x)D_{n}(x) is compact. This appears to be a very elegant solution for the interchange problem, but unfortunately, the Bellman equation of the distributionally robust cost minimization problem (2.6) under a randomized action of the controller is given by

Jn(x)\displaystyle J_{n}(x) =infμ𝒫(Dn(x))sup𝒬n+1cn(x,a,Tn(x,a,z))+Jn+1(Tn(x,a,z))(dz)μ(da)\displaystyle=\inf_{\mu\in\mathcal{P}(D_{n}(x))}\int\sup_{\mathbb{Q}\in\mathcal{Q}_{n+1}}\int c_{n}\big{(}x,a,T_{n}(x,a,z)\big{)}+J_{n+1}\big{(}T_{n}(x,a,z)\big{)}\,\mathbb{Q}(\operatorname{d}z)\,\mu(\operatorname{d}a) (5.4)
=infaDn(x)sup𝒬n+1LnJn+1(x,a,),\displaystyle=\inf_{a\in D_{n}(x)}\sup_{\mathbb{Q}\in\mathcal{Q}_{n+1}}L_{n}J_{n+1}(x,a,\mathbb{Q}),

cf. Theorems 3.6 and 3.10, and (5.3) does in general not equal (5.4). Recall that in our model nature is allowed to react to any realization of the controller’s action. This was crucial to obtain a robust value iteration in Theorem 3.6. In contrast to that, (5.3) means that nature maximizes only knowing the distribution of the controller’s action. However, in general (5.3) \neq (5.4) as will be shown in the next example.

Example 5.4.

In order to see that (5.3) \neq (5.4) and that infimum and supremum cannot be interchanged in general, consider the simple static counter example N=1N=1, E=E=\mathbb{R}, A=[0,1]A=[0,1], D=×AD=\mathbb{R}\times A, ZBin(1,p)Z\sim\text{Bin}(1,p), p[0,1]=𝒬p\in[0,1]=\mathcal{Q}, T(x,a,z)=(az)2T(x,a,z)=-(a-z)^{2} and c(x,a,x)=xc(x,a,x^{\prime})=x^{\prime}. It is readily checked that Assumption 4.1 is satisfied. Especially, one has constant bounding functions. In this example (5.4) equals

infa[0,1]supp[0,1]𝔼p[c(x,a,T(x,a,Z))]=infa[0,1]supp[0,1](1p)a2p(a1)2\displaystyle\inf_{a\in[0,1]}\sup_{p\in[0,1]}\mathbb{E}^{p}\big{[}c(x,a,T(x,a,Z))\big{]}=\inf_{a\in[0,1]}\sup_{p\in[0,1]}-(1-p)a^{2}-p(a-1)^{2}
=supa[0,1]infp[0,1](1p)a2+p(a1)2=supa[0,1]min{a2,(1a)2}=14.\displaystyle=-\sup_{a\in[0,1]}\inf_{p\in[0,1]}(1-p)a^{2}+p(a-1)^{2}=-\sup_{a\in[0,1]}\min\{a^{2},(1-a)^{2}\}=-\frac{1}{4}.

If the controller chooses μ𝒰(0,1)\mu\sim\mathcal{U}(0,1), then (5.3) must be less or equal than

supp[0,1]01(1p)a2p(a1)2da\displaystyle\sup_{p\in[0,1]}\int_{0}^{1}-(1-p)a^{2}-p(a-1)^{2}\,\operatorname{d}a =supp[0,1]13(1p)13p=13.\displaystyle=\sup_{p\in[0,1]}-\frac{1}{3}(1-p)-\frac{1}{3}p=-\frac{1}{3}.

Indeed we obtain here supp[0,1]infa[0,1]𝔼p[c(x,a,T(x,a,Z))]=12\sup_{p\in[0,1]}\inf_{a\in[0,1]}\mathbb{E}^{p}\big{[}c(x,a,T(x,a,Z))\big{]}=-\frac{1}{2}. The approach to interchange infimum and supremum through a linearization with randomized actions works when nature only observes the distribution and not the realization of the controller’s action. Also note that the situation here is quite different to classical two-person zero-sum games. The fact that infimum and supremum cannot be interchanged is a consequence of the asymmetric mover/follower situation. The example above with Pxap=(1p)a2+p(a1)2P_{xa}^{p}=(1-p)a^{2}+p(a-1)^{2} and [0,1][0,1] discretized can also be used as a counter-example within the setting of [NilimGhaoui2005] and [wiesemann2013robust].

As mentioned before, the distributionally robust cost minimization model can be interpreted as a dynamic game with nature as the controller’s opponent. Since nature chooses her action after the controller, observing his action but not being restricted by it, there is a (weak) second-mover advantage by construction of the game. The fact that infimum and supremum in the Bellman equation can be interchanged means that the second-mover advantage vanishes in the special case of a convex model.

Let additionally the one-stage ambiguity sets 𝒬n\mathcal{Q}_{n} be weak* closed. Now, the conditions of Theorem LABEL:thm:A:minimax b) are fulfilled, too. Then, the ambiguity set is weak* compact by Lemma 2.3 and by Lemma LABEL:thm:abstract_robust_Lpbound we have that cn(x,a,Tn(x,a,Zn+1))+Jn+1(Tn(x,a,Zn+1))c_{n}\big{(}x,a,T_{n}(x,a,Z_{n+1})\big{)}+J_{n+1}\big{(}T_{n}(x,a,Z_{n+1})\big{)} is in LpL^{p}. Thus, LnJn+1(x,a,)\mathbb{Q}\mapsto L_{n}J_{n+1}(x,a,\mathbb{Q}) is weak* continuous for every (x,a)D(x,a)\in D. This yields that (a,)LnJn+1(x,a,)(a,\mathbb{Q})\mapsto L_{n}J_{n+1}(x,a,\mathbb{Q}) satisfies the minimax-equality

minaDn(x)max𝒬n+1LnJn+1(x,a,)=max𝒬n+1minaDn(x)LnJn+1(x,a,)\min_{a\in D_{n}(x)}\max_{\mathbb{Q}\in\mathcal{Q}_{n+1}}L_{n}J_{n+1}(x,a,\mathbb{Q})=\max_{\mathbb{Q}\in\mathcal{Q}_{n+1}}\min_{a\in D_{n}(x)}L_{n}J_{n+1}(x,a,\mathbb{Q})

and Lemma 2.105 in [BarbuPrecupanu2012] implies that for every xx\in\mathbb{R} the function has a saddle point (a,)(a^{*},\mathbb{Q}^{*}), i.e.

LnJn+1(x,a,)LnJn+1(x,a,)LnJn+1(x,a,)L_{n}J_{n+1}(x,a^{*},\mathbb{Q})\leq L_{n}J_{n+1}(x,a^{*},\mathbb{Q}^{*})\leq L_{n}J_{n+1}(x,a,\mathbb{Q}^{*})

for all aD(x)a\in D(x) and 𝒬n+1\mathbb{Q}\in\mathcal{Q}_{n+1}. Such a saddle point constitutes a Nash equilibrium in the subgame scenario Xn=xX_{n}=x. We will show that Nash equilibria exist not only in one-stage subgames but also globally.

Before, let us introduce a modification of the game against nature where nature instead of the controller moves first. Given a policy of nature, the controller faces an arbitrary but fixed probability measure in each scenario Xn=xX_{n}=x. Thus, the inner optimization problem is a risk-neutral MDP and it follows from standard theory that it suffices for the controller to consider deterministic Markov policies. Clearly, the controller’s optimal policy will depend on the policy that nature has chosen before. It will turn out to be a pointwise dependence on the actions of nature. To clarify this and for comparability with the original game (2.6), where the controller moves first, we distinguish the following types of Markov strategies of the controller

Π()=\displaystyle\Pi(\mathbb{R})= ΠM={π=(d0,,dN1)|dn:A measurable,dn(x)Dn(x),x}\displaystyle\Pi^{M}=\ \left\{\pi=(d_{0},\dots,d_{N-1})|\,d_{n}:\mathbb{R}\to A\text{ measurable},\,d_{n}(x)\in D_{n}(x),\,x\in\mathbb{R}\right\}
Π(,𝒬)=\displaystyle\Pi(\mathbb{R},\mathcal{Q})= {π=(d0,,dN1)|dn:×𝒬n+1A measurable,dn(x,)Dn(x),x}\displaystyle\ \left\{\pi=(d_{0},\dots,d_{N-1})|\,d_{n}:\mathbb{R}\times\mathcal{Q}_{n+1}\to A\text{ measurable},\,d_{n}(x,\mathbb{Q})\in D_{n}(x),\,x\in\mathbb{R}\right\}

and of nature

Γ()=\displaystyle\Gamma(\mathbb{R})= {γ=(γ0,,γN1)|γn:𝒬n+1 measurable}\displaystyle\ \left\{\gamma=(\gamma_{0},\dots,\gamma_{N-1})|\ \gamma_{n}:\mathbb{R}\to\mathcal{Q}_{n+1}\text{ measurable}\right\}
Γ(,A)=ΓM=\displaystyle\Gamma(\mathbb{R},A)=\Gamma^{M}= {γ=(γ0,,γN1)|γn:×A𝒬n+1 measurable}.\displaystyle\ \left\{\gamma=(\gamma_{0},\dots,\gamma_{N-1})|\ \gamma_{n}:\mathbb{R}\times A\to\mathcal{Q}_{n+1}\text{ measurable}\right\}.

The value JnπγJ_{n\pi\gamma} of a pair of Markov policies (γ,π)Γ()×Π(,𝒬)(\gamma,\pi)\in\Gamma(\mathbb{R})\times\Pi(\mathbb{R},\mathcal{Q}) is defined as in (2.5). The bounds in Lemma 3.3 apply since the proofs do not use properties of the policies. The game under consideration is

J~n(x)=supγΓ()infπΠ(,𝒬)Jnπγ(x),x,n=0,,N.\displaystyle\widetilde{J}_{n}(x)=\sup_{\gamma\in\Gamma(\mathbb{R})}\inf_{\pi\in\Pi(\mathbb{R},\mathcal{Q})}J_{n\pi\gamma}(x),\qquad x\in\mathbb{R},\ n=0,\ldots,N. (5.5)

For clarity, we mark all quantities of the game where nature moves first which differ from the respective quantity of the original game with a tilde. The value of a policy of nature γΓ()\gamma\in\Gamma(\mathbb{R}) is defined as

J~nγ(x)=infπΠ(,𝒬)Jnπγ(x),x.\displaystyle\widetilde{J}_{n\gamma}(x)=\inf_{\pi\in\Pi(\mathbb{R},\mathcal{Q})}J_{n\pi\gamma}(x),\qquad x\in\mathbb{R}.

The Bellman operator on 𝔹\mathbb{B} can be introduced in the usual way:

𝒯~nv(x)\displaystyle\widetilde{\mathcal{T}}_{n}v(x) =sup𝒬n+1infaDn(x)Lnv(x,a,),x.\displaystyle=\sup_{\mathbb{Q}\in\mathcal{Q}_{n+1}}\inf_{a\in D_{n}(x)}L_{n}v(x,a,\mathbb{Q}),\quad x\in\mathbb{R}.
Theorem 5.5.

Let Assumptions 3.1, 4.1 be satisfied, the ambiguity sets 𝒬n+1\mathcal{Q}_{n+1} be weak* closed and the model be convex.

  • a)

    J~n𝔹\widetilde{J}_{n}\in\mathbb{B} for n=0,,Nn=0,\ldots,N and they satisfy the Bellman equation

    J~N(x)\displaystyle\widetilde{J}_{N}(x) =cN(x),\displaystyle=c_{N}(x),
    J~n(x)\displaystyle\widetilde{J}_{n}(x) =𝒯~nJ~n+1(x),x.\displaystyle=\widetilde{\mathcal{T}}_{n}\widetilde{J}_{n+1}(x),\qquad x\in\mathbb{R}.

    There exist decision rules γ~n:𝒬n+1\tilde{\gamma}_{n}:\mathbb{R}\to\mathcal{Q}_{n+1} of nature and d~n:×𝒬n+1A\tilde{d}_{n}:\mathbb{R}\times\mathcal{Q}_{n+1}\to A of the controller such that J~n(x)=𝒯d~nγ~nJ~n+1(x)\widetilde{J}_{n}(x)=\mathcal{T}_{\tilde{d}_{n}\tilde{\gamma}_{n}}\widetilde{J}_{n+1}(x) and all sequences of such decision rules induce an optimal policy pair γ~=(γ~0,,γ~n1)Γ()\tilde{\gamma}=(\tilde{\gamma}_{0},\dots,\tilde{\gamma}_{n-1})\in\Gamma(\mathbb{R}) and π~=(d~0,,d~n1)Π(,𝒬)\tilde{\pi}=(\tilde{d}_{0},\dots,\tilde{d}_{n-1})\in\Pi(\mathbb{R},\mathcal{Q}) i.e. J~n=Jnπ~γ~\widetilde{J}_{n}=J_{n\tilde{\pi}\tilde{\gamma}}.

  • b)

    We have that Jn=J~n=J~nγ~J_{n}=\widetilde{J}_{n}=\widetilde{J}_{n\tilde{\gamma}}.

Proof.

We have for nn and xx\in\mathbb{R}:

Jn(x)\displaystyle J_{n}(x) =infπΠ()supγΓ(,A)Jnπγ(x)infπΠ()supγΓ()Jnπγ(x)supγΓ()infπΠ()Jnπγ(x)\displaystyle=\inf_{\pi\in\Pi(\mathbb{R})}\sup_{\gamma\in\Gamma(\mathbb{R},A)}J_{n\pi\gamma}(x)\geq\inf_{\pi\in\Pi(\mathbb{R})}\sup_{\gamma\in\Gamma(\mathbb{R})}J_{n\pi\gamma}(x)\geq\sup_{\gamma\in\Gamma(\mathbb{R})}\inf_{\pi\in\Pi(\mathbb{R})}J_{n\pi\gamma}(x)
supγΓ()infπΠ(,)Jnπγ(x)=J~n(x).\displaystyle\geq\sup_{\gamma\in\Gamma(\mathbb{R})}\inf_{\pi\in\Pi(\mathbb{R},\mathbb{Q})}J_{n\pi\gamma}(x)=\widetilde{J}_{n}(x). (5.6)

Let π=(d0,,dN1)Π()\pi^{*}=(d_{0}^{*},\dots,d_{N-1}^{*})\in\Pi(\mathbb{R}) and γ=(γ0,,γN1)Γ(,A)\gamma^{*}=(\gamma_{0}^{*},\dots,\gamma_{N-1}^{*})\in\Gamma(\mathbb{R},A) be optimal strategies for the original game (2.6). The existence is guaranteed by Theorem 3.10. Then γ~=(γ~0,,γ~N1)\tilde{\gamma}=(\tilde{\gamma}_{0},\dots,\tilde{\gamma}_{N-1}) defined by γ~n:=γn(,dn())\tilde{\gamma}_{n}:=\gamma_{n}^{*}(\cdot,d_{n}^{*}(\cdot)) lies in Γ()\Gamma(\mathbb{R}) since the decision rules are well defined as compositions of measurable maps.
We prove all statements simultaneously by induction. In particular we show that there exists a policy π~Π(,𝒬)\tilde{\pi}\in\Pi(\mathbb{R},\mathcal{Q}) such that

Jn=J~nγ~=J~nπ~γ~.J_{n}=\widetilde{J}_{n\tilde{\gamma}}=\widetilde{J}_{n\tilde{\pi}\tilde{\gamma}}.

We show next that JnJ~nJ_{n}\leq\widetilde{J}_{n}. From Theorem 5.2 and the induction hypothesis we obtain

Jn(x)\displaystyle J_{n}(x) =\displaystyle= LnJn+1(x,dn(x),γ~n(x))=infaDn(x)LnJn+1(x,a,γ~n(x))=infaDn(x)LnJ~n+1(x,a,γ~n(x)).\displaystyle L_{n}J_{n+1}(x,d_{n}^{*}(x),\tilde{\gamma}_{n}(x))=\inf_{a\in D_{n}(x)}L_{n}J_{n+1}(x,a,\tilde{\gamma}_{n}(x))=\inf_{a\in D_{n}(x)}L_{n}\widetilde{J}_{n+1}(x,a,\tilde{\gamma}_{n}(x)).

Observe that again by the induction hypothesis

J~nγ~(x)\displaystyle\widetilde{J}_{n\tilde{\gamma}}(x) =\displaystyle= infπΠ(,𝒬)Jnπγ~(x)=infπΠ(,𝒬)LnJn+1πγ~(x,dn(x,γ~n(x)),γ~n(x))\displaystyle\inf_{\pi\in\Pi(\mathbb{R},\mathcal{Q})}J_{n\pi\tilde{\gamma}}(x)=\inf_{\pi\in\Pi(\mathbb{R},\mathcal{Q})}L_{n}J_{n+1\pi\tilde{\gamma}}(x,d_{n}(x,\tilde{\gamma}_{n}(x)),\tilde{\gamma}_{n}(x)) (5.7)
\displaystyle\geq infaDn(x)LnJ~n+1γ~(x,a,γ~n(x))=LnJ~n+1γ~(x,d~n(x,γ~n(x)),γ~n(x))\displaystyle\inf_{a\in D_{n}(x)}L_{n}\widetilde{J}_{n+1\tilde{\gamma}}(x,a,\tilde{\gamma}_{n}(x))=L_{n}\widetilde{J}_{n+1\tilde{\gamma}}(x,\tilde{d}_{n}(x,\tilde{\gamma}_{n}(x)),\tilde{\gamma}_{n}(x))
=\displaystyle= LnJ~n+1π~γ~(x,d~n(x,γ~n(x)),γ~n(x))=J~nπ~γ~J~nγ~.\displaystyle L_{n}\widetilde{J}_{n+1\tilde{\pi}\tilde{\gamma}}(x,\tilde{d}_{n}(x,\tilde{\gamma}_{n}(x)),\tilde{\gamma}_{n}(x))=\widetilde{J}_{n\tilde{\pi}\tilde{\gamma}}\geq\widetilde{J}_{n\tilde{\gamma}}.

We will show below the existence of a minimizer d~n\tilde{d}_{n} in the second line is justified. Thus we get equality in the expression above. Combining the previous two equations above we finally get

Jn(x)\displaystyle J_{n}(x) =\displaystyle= infaDn(x)LnJ~n+1(x,a,γ~n(x))=J~nγ~J~n(x).\displaystyle\inf_{a\in D_{n}(x)}L_{n}\widetilde{J}_{n+1}(x,a,\tilde{\gamma}_{n}(x))=\widetilde{J}_{n\tilde{\gamma}}\leq\widetilde{J}_{n}(x).

In total we have shown that Jn=J~n=J~nγ~=J~nπ~γ~J_{n}=\widetilde{J}_{n}=\tilde{J}_{n\tilde{\gamma}}=\tilde{J}_{n\tilde{\pi}\tilde{\gamma}}. The joint Bellman equation for the controller and nature J~n=𝒯~J~n+1\widetilde{J}_{n}=\widetilde{\mathcal{T}}\widetilde{J}_{n+1} follows from Theorem 5.2.

Finally, we verify the existence of a minimizer d~n\tilde{d}_{n}. Let {(xn,an,n)}n\{(x_{n},a_{n},\mathbb{Q}_{n})\}_{n\in\mathbb{N}} be a convergent sequence in ×A×𝒬\mathbb{R}\times A\times\mathcal{Q} with limit (x,a,)(x^{*},a^{*},\mathbb{Q}^{*}). By dominated convergence (Lemma LABEL:thm:abstract_robust_Lpbound) and the lower semicontinuity of Dn(x,a)cn(x,a,Tn(x,a,zn+1))+v(Tn(x,a,zn+1))D_{n}\ni(x,a)\mapsto c_{n}\big{(}x,a,T_{n}(x,a,z_{n+1})\big{)}+v\big{(}T_{n}(x,a,z_{n+1})\big{)} for any v𝔹v\in\mathbb{B} we obtain that the increasing sequence of random variables {Cm}m\{C_{m}\}_{m} given by

Cm:=infkmcn(xk,ak,Tn(xk,ak,Zn+1))+v(Tn(xk,ak,Zn+1))C_{m}:=\inf_{k\geq m}c_{n}\big{(}x_{k},a_{k},T_{n}(x_{k},a_{k},Z_{n+1})\big{)}+v\big{(}T_{n}(x_{k},a_{k},Z_{n+1})\big{)}

satisfies

Cm\displaystyle C_{m}at(5.7)andthat×𝒬n+1(x,)infaD(x)LnJ~n+1γ~(x,a,)=LnJ~n+1γ~(x,d~n(x,),)islowersemicontinuous.Thiscompletestheproof.Remark 5.65.65.6Remark 5.6Remark 5.6.Notethatincontrasttoclassicalzerosumgamesweobtaintheexistenceofdeterministicpolicies.AsadirectconsequencewegettheexistenceofNashequilibriaonpolicylevel.Corollary 5.75.75.7Corollary 5.7Corollary 5.7.𝐶𝑜𝑛𝑠𝑖𝑑𝑒𝑟𝑎𝑐𝑜𝑛𝑣𝑒𝑥𝑚𝑜𝑑𝑒𝑙𝑤𝑖𝑡ℎ𝑤𝑒𝑎𝑘𝑐𝑙𝑜𝑠𝑒𝑑𝑎𝑚𝑏𝑖𝑔𝑢𝑖𝑡𝑦𝑠𝑒𝑡𝑠𝒬n+1𝑎𝑛𝑑𝐴𝑠𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛𝑠3.1,4.1𝑓𝑢𝑙𝑓𝑖𝑙𝑙𝑒𝑑.𝐹𝑜𝑟𝑥𝑤𝑒𝑔𝑒𝑡Jn(x)=minπΠ()maxγΓ(,A)Jnπγ(x)=maxγΓ()minπΠ(,𝒬)Jnπγ(x)=J~n(x).𝐶𝑜𝑛𝑠𝑒𝑞𝑢𝑒𝑛𝑡𝑙𝑦,𝑤𝑒𝑒𝑣𝑒𝑛𝑜𝑏𝑡𝑎𝑖𝑛⁢Jn(x)=⁢min∈π⁢Π(R)max∈γ⁢Γ(R)J⁢nπγ(x)=⁢max∈γ⁢Γ(R)min∈π⁢Π(R)J⁢nπγ(x).Proof.Theorem5.5impliesequalityin(5),i.e.Jn(x)=minπΠ()maxγΓ(,A)Jnπγ(x)=infπΠ()supγΓ()Jnπγ(x)=supγΓ()infπΠ()Jnπγ(x)=maxγΓ()minπΠ(,)Jnπγ(x)=J~n(x).Itremainstofindoptimalpoliciesforthesecondandfourthequationof(5).Letπ=(d0,,dN1)Π(),γ=(γ0,,γN1)Γ(,A)andγ~=(γ~0,,γ~N1)Γ()π~=(d~0,,d~N1)Π(,𝒬)beoptimalstrategiesforthefirstandfifthequationof(5),respectively,whichexistbyProposition4.3andTheorem5.5.Then=inf∈π⁢Π(R)sup∈γ⁢Γ(R)J⁢nπγsup∈γ⁢Γ(R)inf∈π⁢Π(R)J⁢nπγisattainedbytheadmissiblestrategypair(π^,γ^)Π()×Γ()whichcanbedefinedbyd^n:=dnandγ^n:=γn(,dn())oralternativelybyd^n:=d~n(,γ~n())andγ^n:=γ~nforn=0,,N1.at\eqref{eq:thm5.4}andthat\begin{aligned} \mathbb{R}\times\mathcal{Q}_{n+1}\ni(x,\mathbb{Q})\mapsto\inf_{a\in D(x)}L_{n}\widetilde{J}_{n+1{\tilde{\gamma}}}(x,a,\mathbb{Q})=L_{n}\widetilde{J}_{n+1{\tilde{\gamma}}}(x,\tilde{d}_{n}(x,\mathbb{Q}),\mathbb{Q})\end{aligned}islowersemicontinuous.Thiscompletestheproof.\qed\par\end@proof\par\begin{remark}Notethatincontrasttoclassicalzero-sumgamesweobtaintheexistenceofdeterministicpolicies.\end{remark}\par AsadirectconsequencewegettheexistenceofNashequilibriaonpolicylevel.\par\begin{corollary}Consideraconvexmodelwithweak*closedambiguitysets$\mathcal{Q}_{n+1}$andAssumptions\ref{ass:abstract_robust_finite},\ref{ass:abstract_robust_real}fulfilled.For$x\in\mathbb{R}$weget\begin{aligned} J_{n}(x)=\min_{\pi\in\Pi(\mathbb{R})}\max_{\gamma\in\Gamma(\mathbb{R},A)}J_{n\pi\gamma}(x)=\max_{\gamma\in\Gamma(\mathbb{R})}\min_{\pi\in\Pi(\mathbb{R},\mathcal{Q})}J_{n\pi\gamma}(x)=\tilde{J}_{n}(x).\end{aligned}Consequently,weevenobtain$$J_{n}(x)=\min_{\pi\in\Pi(\mathbb{R})}\max_{\gamma\in\Gamma(\mathbb{R})}J_{n\pi\gamma}(x)=\max_{\gamma\in\Gamma(\mathbb{R})}\min_{\pi\in\Pi(\mathbb{R})}J_{n\pi\gamma}(x).$$\end{corollary}\par\@proof Theorem\ref{thm:abstract_robust_nature_first}impliesequalityin\eqref{eq:abstract_robust_nature_first_proof1},i.e.\begin{aligned} J_{n}(x)&=\min_{\pi\in\Pi(\mathbb{R})}\max_{\gamma\in\Gamma(\mathbb{R},A)}J_{n\pi\gamma}(x)=\inf_{\pi\in\Pi(\mathbb{R})}\sup_{\gamma\in\Gamma(\mathbb{R})}J_{n\pi\gamma}(x)\\ &=\sup_{\gamma\in\Gamma(\mathbb{R})}\inf_{\pi\in\Pi(\mathbb{R})}J_{n\pi\gamma}(x)=\max_{\gamma\in\Gamma(\mathbb{R})}\min_{\pi\in\Pi(\mathbb{R},\mathbb{Q})}J_{n\pi\gamma}(x)=\widetilde{J}_{n}(x).\end{aligned}Itremainstofindoptimalpoliciesforthesecondandfourthequationof\eqref{eq:abstract_robust_Nash_proof}.Let\begin{aligned} \pi^{*}&=(d_{0}^{*},\dots,d_{N-1}^{*})\in\Pi(\mathbb{R}),&\gamma^{*}&=(\gamma_{0}^{*},\dots,\gamma_{N-1}^{*})\in\Gamma(\mathbb{R},A)\\ \text{and}\qquad\qquad\tilde{\gamma}&=(\tilde{\gamma}_{0},\dots,\tilde{\gamma}_{N-1})\in\Gamma(\mathbb{R})&\tilde{\pi}&=(\tilde{d}_{0},\dots,\tilde{d}_{N-1})\in\Pi(\mathbb{R},\mathcal{Q})\end{aligned}beoptimalstrategiesforthefirstandfifthequationof\eqref{eq:abstract_robust_Nash_proof},respectively,whichexistbyProposition\ref{thm:abstract_robust_real}andTheorem\ref{thm:abstract_robust_nature_first}.Then$$\inf_{\pi\in\Pi(\mathbb{R})}\sup_{\gamma\in\Gamma(\mathbb{R})}J_{n\pi\gamma}=\sup_{\gamma\in\Gamma(\mathbb{R})}\inf_{\pi\in\Pi(\mathbb{R})}J_{n\pi\gamma}$$isattainedbytheadmissiblestrategypair$(\hat{\pi},\hat{\gamma})\in\Pi(\mathbb{R})\times\Gamma(\mathbb{R})$whichcanbedefinedby$\hat{d}_{n}:=d_{n}^{*}$and$\hat{\gamma}_{n}:=\gamma_{n}^{*}(\cdot,d_{n}^{*}(\cdot))$oralternativelyby$\hat{d}_{n}:=\tilde{d}_{n}(\cdot,\tilde{\gamma}_{n}(\cdot))$and$\hat{\gamma}_{n}:=\tilde{\gamma}_{n}$for$n=0,\dots,N-1$.\qed\end@proof\par

6. Special Ambiguity Sets

In this section, we consider some special choices for the ambiguity set 𝒬n+1\mathcal{Q}_{n+1} which simplify solving the Markov Decision Problem (2.6) or allow for structural statements about the solution. We assume a real-valued state space here.

Convex hull. It does not change the optimal value of the optimization problems if a given ambiguity set 𝒬n+1\mathcal{Q}_{n+1} is replaced by its convex hull conv(𝒬n+1)\operatorname{conv}(\mathcal{Q}_{n+1}) or its closed convex hull conv¯(𝒬n+1)\operatorname{\overline{conv}}(\mathcal{Q}_{n+1}), where the closure is with respect to the weak* topology. Clearly, to demonstrate this, it suffices to compare the corresponding Bellman equations.

Lemma 6.1.

Let 𝒬n+1\mathcal{Q}_{n+1} be any norm bounded ambiguity set. Then it holds for all v𝔹v\in\mathbb{B} and xx\in\mathbb{R}

infaDn(x)sup𝒬n+1Lnv(x,a,)=infaD(x)supconv(𝒬n+1)Lnv(x,a,)=infaD(x)supconv¯(𝒬n+1)Lnv(x,a,).\displaystyle\inf_{a\in D_{n}(x)}\sup_{\mathbb{Q}\in\mathcal{Q}_{n+1}}L_{n}v(x,a,\mathbb{Q})=\inf_{a\in D(x)}\sup_{\mathbb{Q}\in\operatorname{conv}(\mathcal{Q}_{n+1})}L_{n}v(x,a,\mathbb{Q})=\inf_{a\in D(x)}\sup_{\mathbb{Q}\in\operatorname{\overline{conv}}(\mathcal{Q}_{n+1})}L_{n}v(x,a,\mathbb{Q}).
Proof.

Fix (x,a)Dn(x,a)\in D_{n}. The function Lnv(x,a,)\mathbb{Q}\mapsto L_{n}v(x,a,\mathbb{Q}) is linear. Thus, for a generic element =i=1nλiiconv(𝒬n+1)\mathbb{Q}=\sum_{i=1}^{n}\lambda_{i}\mathbb{Q}_{i}\in\operatorname{conv}(\mathcal{Q}_{n+1}) we have

Lnv(x,a,i=1mλii)=i=1mλiLnv(x,a,i)maxi=1,,mLnv(x,a,i),L_{n}v\left(x,a,\sum_{i=1}^{m}\lambda_{i}\mathbb{Q}_{i}\right)=\sum_{i=1}^{m}\lambda_{i}L_{n}v(x,a,\mathbb{Q}_{i})\leq\max_{i=1,\dots,m}L_{n}v(x,a,\mathbb{Q}_{i}),

i.e. there can be no improvement of the supremum on the convex hull. We also have that conv¯(𝒬n+1)\operatorname{\overline{conv}}(\mathcal{Q}_{n+1}) is metrizable and therefore coincides with the limit points of sequences in conv(𝒬n+1)\operatorname{conv}(\mathcal{Q}_{n+1}). Since Lnv(x,a,)\mathbb{Q}\mapsto L_{n}v(x,a,\mathbb{Q}) is weak* continuous (cf. proof of Theorem 3.6), the supremum cannot be improved on the closure either. ∎

Integral stochastic orders on 𝒬n+1\mathcal{Q}_{n+1}. Following an idea of [Mueller1997], one can define integral stochastic orders on the ambiguity 𝒬n+1\mathcal{Q}_{n+1} set by

1𝔹,x,a2:\displaystyle\mathbb{Q}_{1}\leq_{\mathbb{B},x,a}\mathbb{Q}_{2}\ :\Longleftrightarrow\ Lnv(x,a,1)Lnv(x,a,2)for all v𝔹\displaystyle L_{n}v(x,a,\mathbb{Q}_{1})\leq L_{n}v(x,a,\mathbb{Q}_{2})\quad\text{for all }v\in\mathbb{B}
where (x,a)Dn(x,a)\in D_{n} is fixed and
1𝔹2:\displaystyle\mathbb{Q}_{1}\leq_{\mathbb{B}}\mathbb{Q}_{2}\ :\Longleftrightarrow\ 1𝔹,x,a2for all (x,a)Dn.\displaystyle\mathbb{Q}_{1}\leq_{\mathbb{B},x,a}\mathbb{Q}_{2}\quad\text{for all }(x,a)\in D_{n}.

If there exists a maximal element with respect to one of these stochastic orders, this probability measure is an optimal action for nature (in the respective scenario). The proof of the next lemma follows directly from the Bellman equation and the definition of the orderings.

Lemma 6.2.
  1. (a)

    If there exists a maximal element n,x,a𝒬n+1\mathbb{Q}_{n,x,a}\in\mathcal{Q}_{n+1} w.r.t. 𝔹,x,a\leq_{\mathbb{B},x,a} for every (x,a)Dn(x,a)\in D_{n}, then γ=(γ0,,γN1)\gamma=(\gamma_{0},\dots,\gamma_{N-1}) with γn(x,a):=n,x,a\gamma_{n}(x,a):=\mathbb{Q}_{n,x,a} defines an optimal policy.

  2. (b)

    If there exists a maximal element n𝒬n\mathbb{Q}^{*}_{n}\in\mathcal{Q}_{n} w.r.t. 𝔹\leq_{\mathbb{B}}, then γ=(γ0,,γN1)\gamma=(\gamma_{0},\dots,\gamma_{N-1}) with γnn\gamma_{n}\equiv\mathbb{Q}^{*}_{n} defines a constant optimal action of nature. That is, (2.6) can be reformulated to risk-neutral MDPs under the probability measure n\mathbb{Q}^{*}_{n}.

In fact, Lemma 6.2 holds for any state space. But it is only a reformulation of what is an optimal action for nature. However, under Assumption 4.1 it has practical relevance when a simpler sufficient condition for the integral stochastic order 𝔹\leq_{\mathbb{B}} is fulfilled. We give three exemplary criteria:

  1. 1.

    Let 𝒵\mathcal{Z} be a partially ordered space, e.g. 𝒵=m\mathcal{Z}=\mathbb{R}^{m}, and assume that the transition functions are increasing in zz. Then the functions

    𝒵zcn(x,a,Tn(x,a,z))+v(Tn(x,a,z)),v𝔹,(x,a)Dn\displaystyle\mathcal{Z}\ni z\mapsto c_{n}(x,a,T_{n}(x,a,z))+v(T_{n}(x,a,z)),\qquad v\in\mathbb{B},\ (x,a)\in D_{n} (6.1)

    are increasing. Thus, 1𝔹2\mathbb{Q}_{1}\leq_{\mathbb{B}}\mathbb{Q}_{2} is implied by the usual stochastic order of the disturbance distributions 1Zn+1st2Zn+1\mathbb{Q}_{1}^{Z_{n+1}}\leq_{st}\mathbb{Q}_{2}^{Z_{n+1}} and a maximal element of 𝒬n+1\mathcal{Q}_{n+1} w.r.t. st\leq_{st} allows the same conclusion as in Lemma 6.2 b).

  2. 2.

    Let 𝒵\mathcal{Z} be a real vector space, assume a convex model (cf. Lemma 5.1) and let the transition functions TnT_{n} additionally be convex in zz. Now, Lemma 5.1 yields that the functions (6.1) are convex as compositions of increasing convex and convex mappings. Consequently, 𝔹\leq_{\mathbb{B}} is implied by the convex order cx\leq_{cx} of the disturbance distributions Zn+1\mathbb{Q}^{Z_{n+1}}.

Convex order on the set of densities. Since the probability measures in 𝒬n+1\mathcal{Q}_{n+1} are absolutely continuous with respect to the reference probability measure n+1\mathbb{P}_{n+1}, we can alternatively consider the set of densities 𝒬n+1d\mathcal{Q}_{n+1}^{d} (see (2.2)). In general, one has to take care both of the marginal distribution of the density and the dependence structure with the random cost when searching for a maximizing density of the Bellman equation

infaDn(x)supY𝒬n+1d𝔼[(cn(x,a,Tn(x,a,Z))+Jn+1(Tn(x,a,Z)))Y].\inf_{a\in D_{n}(x)}\sup_{Y\in\mathcal{Q}_{n+1}^{d}}\mathbb{E}\Big{[}\Big{(}c_{n}\big{(}x,a,T_{n}(x,a,Z)\big{)}+J_{n+1}\big{(}T_{n}(x,a,Z)\big{)}\Big{)}Y\Big{]}.

However, if 𝒬n+1d\mathcal{Q}_{n+1}^{d} is sufficiently rich, the maximization reduces to comparing marginal distributions.

Definition 6.3.

The set of densities 𝒬nd\mathcal{Q}_{n}^{d} is called law invariant, if for Y1𝒬ndY_{1}\in\mathcal{Q}_{n}^{d} every Y2Lq(Ωn,𝒜n,n)Y_{2}\in L^{q}(\Omega_{n},\mathcal{A}_{n},\mathbb{P}_{n}) with Y2Y1Y_{2}\sim Y_{1} is in 𝒬nd\mathcal{Q}_{n}^{d}, too.

Lemma 6.4.

Let Assumptions 3.1, 4.1 be satisfied. If 𝒬n+1d\mathcal{Q}_{n+1}^{d} is law invariant,

supY𝒬n+1d𝔼[(cn(x,a,Tn(x,a,Zn+1))+v(Tn(x,a,Zn+1)))Y],(x,a)Dn,v𝔹,\sup_{Y\in\mathcal{Q}_{n+1}^{d}}\mathbb{E}\Big{[}\Big{(}c_{n}\big{(}x,a,T_{n}(x,a,Z_{n+1})\big{)}+v\big{(}T_{n}(x,a,Z_{n+1})\big{)}\Big{)}Y\Big{]},\qquad(x,a)\in D_{n},\ v\in\mathbb{B}, (6.2)

is not changed by restricting the maximization to densities which are comonotonic to the random variable Tn(x,a,Zn+1)T_{n}(x,a,Z_{n+1}).

Proof.

By Lemma LABEL:thm:abstract_robust_Lpbound cn(x,a,Tn(x,a,Zn+1))+v(Tn(x,a,Zn+1))c_{n}\big{(}x,a,T_{n}(x,a,Z_{n+1})\big{)}+v\big{(}T_{n}(x,a,Z_{n+1})\big{)} are in Lp(Ωn+1,𝒜n+1,n+1)L^{p}(\Omega_{n+1},\mathcal{A}_{n+1},\mathbb{P}_{n+1}) for all (x,a)Dn(x,a)\in D_{n}. Thus the expectation (6.2) exists for all YLq(𝒵,,Zn+1)Y\in L^{q}(\mathcal{Z},\mathfrak{Z},\mathbb{P}^{Z_{n+1}}). Note that a product of r.v. with fixed margins is maximized in expectation when the r.v. are chosen comonotonic. Due to the law invariance of 𝒬n+1d\mathcal{Q}_{n+1}^{d} we can find for every Y𝒬n+1dY\in\mathcal{Q}_{n+1}^{d} some Y𝒬n+1dY^{\prime}\in\mathcal{Q}_{n+1}^{d} comonotonic to cn(x,a,Tn(x,a,Zn+1))+v(Tn(x,a,Zn+1))c_{n}\big{(}x,a,T_{n}(x,a,Z_{n+1})\big{)}+v\big{(}T_{n}(x,a,Z_{n+1})\big{)} such that YYY^{\prime}\sim Y and (6.2) is maximal with YY^{\prime}. Since, the function xcn(x,a,x)+v(x)\mathbb{R}\ni x^{\prime}\mapsto c_{n}\big{(}x,a,x^{\prime}\big{)}+v\big{(}x^{\prime}\big{)} is increasing, this is the same as requiring comonotonicity to Tn(x,a,Zn+1)T_{n}(x,a,Z_{n+1}). ∎

For the comparison of marginal distributions one would naturally think of stochastic orders. Here, the convex order yields a sufficient criterion for optimality. In order to obtain a connection to risk measures, let us introduce the following notations:

Definition 6.5.

Let FXF_{X} be the distribution function of a real-valued random variable XX.

  • a)

    The (lower) quantile function of XX is the left-continuous generalized inverse of FXF_{X}

    FX1(α):=qX(α):=inf{x:FX(x)α},α(0,1).F^{-1}_{X}(\alpha):=q^{-}_{X}(\alpha):=\inf\{x\in\mathbb{R}:\ F_{X}(x)\geq\alpha\},\qquad\alpha\in(0,1).
  • b)

    The upper quantile function of XX is the right-continuous generalized inverse of FXF_{X}

    qX+(α):=inf{x:FX(x)>α},α(0,1).q^{+}_{X}(\alpha):=\inf\{x\in\mathbb{R}:\ F_{X}(x)>\alpha\},\qquad\alpha\in(0,1).
Lemma 6.6 ([Rueschendorf2009, 2.1]).

For any random variable XX on an atomless probability space there exists a random variable UX𝒰(0,1)U_{X}\sim\mathcal{U}(0,1) such that

qX(UX)=qX+(UX)=X-a.s.\displaystyle q_{X}^{-}(U_{X})=q_{X}^{+}(U_{X})=X\quad\mathbb{P}\text{-a.s.}

The random variable UXU_{X} is referred to as (generalized) distributional transform of XX. Note that if h:h:\mathbb{R}\to\mathbb{R} is increasing and left-continuous, then XX and h(X)h(X) have the same distributional transform.

Definition 6.7.

Let ϕ:[0,1]+\phi:[0,1]\to\mathbb{R}_{+} be increasing and right-continuous with 01ϕ(u)du=1\int_{0}^{1}\phi(u)\operatorname{d}u=1. Functionals ρϕ:Lp¯\rho_{\phi}:L^{p}\to\bar{\mathbb{R}}

ρϕ(X):=01qX(u)ϕ(u)du.\displaystyle\rho_{\phi}(X):=\int_{0}^{1}q_{X}(u)\phi(u)\operatorname{d}u.

are called spectral risk measures and ϕ\phi is called spectrum.

When we choose the spectrum ϕ(u)=11α1[α,1](u)\phi(u)=\frac{1}{1-\alpha}1_{[\alpha,1]}(u) then we obtain the Expected Shortfall. Note that every spectral risk measure is also coherent, i.e. monotone, translation-invariant, positive homogeneous and subadditive.

In what follows we assume a non-atomic probability space.

Lemma 6.8.

Let Assumptions 3.1, 4.1 be satisfied, 𝒬n+1d\mathcal{Q}_{n+1}^{d} be law invariant and suppose there exists a maximal element Yn+1Y^{*}_{n+1} of 𝒬n+1d\mathcal{Q}_{n+1}^{d} w.r.t. the convex order cx\leq_{cx}.

  • a)

    Then

    ρϕ(X)=supY𝒬n+1d𝔼[XY],XLp(Ωn+1,𝒜n+1,n+1),\displaystyle\rho_{\phi}(X)=\sup_{Y\in\mathcal{Q}_{n+1}^{d}}\mathbb{E}[XY],\qquad X\in L^{p}(\Omega_{n+1},\mathcal{A}_{n+1},\mathbb{P}_{n+1}),

    defines a spectral risk measure with spectrum ϕn+1(u):=qYn+1+(u),u[0,1]\phi_{n+1}(u):=q^{+}_{Y^{*}_{n+1}}(u),\ u\in[0,1]. In this case, γ=(γ0,,γN1)\gamma=(\gamma_{0},\dots,\gamma_{N-1}) with γn(x,a)=ϕn+1(UTn(x,a,Zn+1))\gamma_{n}(x,a)=\phi_{n+1}(U_{T_{n}(x,a,Z_{n+1})}) is an optimal strategy of nature in (2.6). Here UTn(x,a,Zn+1)U_{T_{n}(x,a,Z_{n+1})} denotes the distributional transform of Tn(x,a,Zn+1)T_{n}(x,a,Z_{n+1}).

  • b)

    If additionally, the disturbance space is the real line 𝒵=\mathcal{Z}=\mathbb{R} and the transition function TnT_{n} is increasing and lower semicontinuous in zz, γ=(γ0,,γN1)\gamma=(\gamma_{0},\dots,\gamma_{N-1}) with γnϕn+1(UZn+1)\gamma_{n}\equiv\phi_{n+1}(U_{Z_{n+1}}) defines a constant optimal action of nature. That is, (2.6) can be reformulated to a risk-neutral MDP with probability measures dn+1=ϕn+1(UZn+1)dn+1\operatorname{d}\mathbb{Q}_{n+1}=\phi_{n+1}(U_{Z_{n+1}})\operatorname{d}\mathbb{P}_{n+1}.

Proof.
  1. (a)

    It holds Yn+1=qYn+1+(UYn+1)Y^{*}_{n+1}=q^{+}_{Y^{*}_{n+1}}(U_{Y^{*}_{n+1}})\ \mathbb{P}-a.s. by Lemma 6.6. Therefore,

    𝒬n+1d{YLq(Ωn+1,𝒜n+1,n+1):Ycxϕn+1(U),U𝒰(0,1)},\mathcal{Q}_{n+1}^{d}\subseteq\left\{Y\in L^{q}(\Omega_{n+1},\mathcal{A}_{n+1},\mathbb{P}_{n+1}):\ Y\leq_{cx}\phi_{n+1}(U),\ U\sim\mathcal{U}(0,1)\right\},

    and the random variables ϕn+1(U),U𝒰(0,1)\phi_{n+1}(U),\ U\sim\mathcal{U}(0,1) are contained in both sets due to law invariance. ρϕ\rho_{\phi} indeed defines a spectral risk measure (see Proposition LABEL:thm:spectral_rm-dual) and ϕn+1(U~)\phi_{n+1}(\tilde{U}) is an optimal action of nature at time nn given (x,a)(x,a), where U~\tilde{U} is the distributional transform of the random variable

    cn(x,a,Tn(x,a,Zn+1))+Jn+1(Tn(x,a,Zn+1)).c_{n}\big{(}x,a,T_{n}(x,a,Z_{n+1})\big{)}+J_{n+1}\big{(}T_{n}(x,a,Z_{n+1})\big{)}.

    Since the function xcn(x,a,x)+v(x)\mathbb{R}\ni x^{\prime}\mapsto c_{n}\big{(}x,a,x^{\prime}\big{)}+v\big{(}x^{\prime}\big{)} is increasing and lower semicontinuous, i.e. left continuous, it follows with Lemma 6.6 that U~=UTn(x,a,Zn+1)\tilde{U}=U_{T_{n}(x,a,Z_{n+1})}.

  2. (b)

    Under the additional assumptions we have that UTn(x,a,Zn+1)=UZn+1U_{T_{n}(x,a,Z_{n+1})}=U_{Z_{n+1}}. ∎

Recall that the probability space under consideration is the product space. Under the assumptions of Lemma 6.8 b) we can replace the probability measure \mathbb{P} by

^:=n=1N1n,dn=ϕn(UZn)dn\widehat{\mathbb{Q}}:=\bigotimes_{n=1}^{N-1}\mathbb{Q}^{*}_{n},\qquad\operatorname{d}\mathbb{Q}^{*}_{n}=\phi_{n}(U_{Z_{n}})\operatorname{d}\mathbb{P}_{n}

and the optimization problem (2.6) can equivalently be written as

infπΠM𝔼x^[n=0N1cn(Xn,dn(Xn),Xn+1)+cN(XN)].\displaystyle\inf_{\pi\in\Pi^{M}}\mathbb{E}_{x}^{\widehat{\mathbb{Q}}}\left[\sum_{n=0}^{N-1}c_{n}(X_{n},d_{n}(X_{n}),X_{n+1})+c_{N}(X_{N})\right]. (6.3)

With the reversed argumentation of Lemma 6.8 a robust formulation of (6.3) is given by

infπΠMsup𝔔𝔼x[n=0N1cn(Xn,dn(Xn),Xn+1)+cN(XN)]\displaystyle\inf_{\pi\in\Pi^{M}}\sup_{\mathbb{Q}\in\mathfrak{Q}}\mathbb{E}_{x}^{\mathbb{Q}}\left[\sum_{n=0}^{N-1}c_{n}(X_{n},d_{n}(X_{n}),X_{n+1})+c_{N}(X_{N})\right] (6.4)

with

𝔔={n=1N1n:dn=Yndn,YnLq(Ωn,𝒜n,n),Yncxϕn(Un),Un𝒰(0,1)}.\mathfrak{Q}=\left\{\bigotimes_{n=1}^{N-1}\mathbb{Q}_{n}:\ \operatorname{d}\mathbb{Q}_{n}=Y_{n}\operatorname{d}\mathbb{P}_{n},\ Y_{n}\in L^{q}(\Omega_{n},\mathcal{A}_{n},\mathbb{P}_{n}),\ Y_{n}\leq_{cx}\phi_{n}(U_{n}),\ U_{n}\sim\mathcal{U}(0,1)\right\}.

The Yn,Y_{n}, are indeed densities. Now, (6.4) can be interpreted as the minimization of a coherent risk measure ρ\rho

infπΠMρ(n=0N1cn(Xn,dn(Xn),Xn+1)+cN(XN))\displaystyle\inf_{\pi\in\Pi^{M}}\rho\Big{(}\sum_{n=0}^{N-1}c_{n}(X_{n},d_{n}(X_{n}),X_{n+1})+c_{N}(X_{N})\Big{)} (6.5)

and the problem can be solved with the value iteration

Jn(x)=infaDn(x)ρn(cn(x,a,Tn(x,a,Zn+1))+Jn+1(Tn(x,a,Zn+1)))J_{n}(x)=\inf_{a\in D_{n}(x)}\rho_{n}\Big{(}c_{n}(x,a,T_{n}(x,a,Z_{n+1}))+J_{n+1}(T_{n}(x,a,Z_{n+1}))\Big{)}

where ρn(X)=supY𝒬n+1d𝔼[XY]\rho_{n}(X)=\sup_{Y\in\mathcal{Q}_{n+1}^{d}}\mathbb{E}[XY] and J0(x)J_{0}(x) gives the minimal value (6.4).

Remark 6.9.

Some authors already considered the problem of optimizing risk measures of dynamic decision processes. For example [ruszczynski2010risk] considered Markov risk measures for finite and discounted infinite horizon models. In the final section he briefly discusses the relation to min-max problems where player 2 chooses the measure. [shen2013risk] treat similar problems (also with average cost) with different properties of the risk maps. More specific applications (dividend problem and economic growth) with the recursive entropic risk measures are treated in [BaeuerleJaskiewicz2017, BaeuerleJaskiewicz2018]. In [chow2015risk] the authors treat the dynamic average-value at risk as a min-max problem. For this risk measure there are also alternative ways for the solution, see e.g. [BaeuerleOtt2011],

7. Application

7.1. LQ Problem

We consider here so-called linear-quadratic (LQ) problems. The state and action space are E=E=\mathbb{R} and A=dA=\mathbb{R}^{d}. Let U1,,UnU_{1},\dots,U_{n}, V1,,VNV_{1},\dots,V_{N} be \mathbb{R}- and d\mathbb{R}^{d}-valued random vectors, respectively, and W1,,WNW_{1},\dots,W_{N} be random variables with values in \mathbb{R}. The random elements {Zn=(Un,Vn,Wn)}1nN\{Z_{n}=(U_{n},V_{n},W_{n})\}_{1\leq n\leq N} are independent and the nn-th element is defined on (Ωn,𝒜n,n)(\Omega_{n},\mathcal{A}_{n},\mathbb{P}_{n}). It is supposed that the disturbances {Zn}1nN\{Z_{n}\}_{1\leq n\leq N} have finite 2p2p-th moments, p1p\geq 1. The transition function is given by

Tn(x,a,Zn+1)=Un+1x+Vn+1a+Wn+1T_{n}(x,a,Z_{n+1})=U_{n+1}x+V_{n+1}^{\top}a+W_{n+1}

for n=0,,N1n=0,\dots,N-1. Furthermore, let there be deterministic positive constants Q0,,QN+Q_{0},\dots,Q_{N}\in\mathbb{R}_{+} and deterministic positive definite symmetric matrices R0,,RN1d×dR_{0},\dots,R_{N-1}\in\mathbb{R}^{d\times d}. The one-stage cost functions are

cn(x,a,x)=cn(x,a)=x2Qn+aRnac_{n}(x,a,x^{\prime})=c_{n}(x,a)=x^{2}Q_{n}+a^{\top}R_{n}a

and the terminal cost function is cN(x)=x2QNc_{N}(x)=x^{2}Q_{N}. Hence, the optimization problem under consideration is

infπΠRsupγΓ𝔼0xπγ[k=0N1Xk2Qk+AkRkAk+XN2QN].\displaystyle\inf_{\pi\in\Pi^{R}}\sup_{\gamma\in\Gamma}\ \mathbb{E}_{0x}^{\pi\gamma}\left[\sum_{k=0}^{N-1}X_{k}^{2}Q_{k}+A_{k}^{\top}R_{k}A_{k}+X_{N}^{2}Q_{N}\right]. (7.1)

Policy values and value functions are defined in the usual way. For different formulations of robust LQ problems, see e.g. [hansen1999five].

Since QnQ_{n} and the matrices RnR_{n} are positive semidefinite, b¯0\underaccent{\bar}{b}\equiv 0 is a lower bounding function and the one-stage costs are at least quasi-integrable. In the sequel, we will determine the value functions and optimal policy by elementary calculations and will show that the value functions are convex and therefore continuous. Hence, we can dispense with an upper bounding function and compactness of the action space.

Since the Borel σ\sigma-algebra of a finite dimensional Euclidean space is countably generated, it is no restriction to assume that the probability measures n+1\mathbb{P}_{n+1} are separable. Further, we assume that for n=0,,N1n=0,\dots,N-1

  • (i)

    the ambiguity sets 𝒬n+1q(Ωn+1,𝒜n+1,n+1)\mathcal{Q}_{n+1}\subseteq\mathcal{M}^{q}(\Omega_{n+1},\mathcal{A}_{n+1},\mathbb{P}_{n+1}) are norm bounded and weak* closed.

  • (ii)

    it holds 𝔼[Wn+1]=0\mathbb{E}^{\mathbb{Q}}[W_{n+1}]=0 for all 𝒬n+1\mathbb{Q}\in\mathcal{Q}_{n+1}.

  • (iii)

    Wn+1W_{n+1} and (Un+1,Vn+1)(U_{n+1},V_{n+1}) are independent for all 𝒬n+1\mathbb{Q}\in\mathcal{Q}_{n+1}, i.e. =Wn+1Un+1,Vn+1\mathbb{Q}=\mathbb{Q}_{W_{n+1}}\otimes\mathbb{Q}_{U_{n+1},V_{n+1}}.

I.e. Assumption 3.1 is satisfied apart from upper bounding. Theorems 3.6 and 3.10 use the bounding, continuity and compactness assumptions only to prove the existence of optimal decision rules. Thus, we can employ the Bellman equation and restrict the consideration to Markov policies as long as we are able to prove the existence of optimal decision rules on each stage. We proceed backwards. At stage NN, no action has to be chosen and the value function is JN(x)=x2QNJ_{N}(x)=x^{2}Q_{N}. At stage N1N-1, we have to solve the Bellman equation

JN1(x)\displaystyle J_{N-1}(x) =infaAsup𝒬NcN1(x,a)+𝔼[JN(T(x,a,Zn+1))]\displaystyle=\inf_{a\in A}\sup_{\mathbb{Q}\in\mathcal{Q}_{N}}c_{N-1}(x,a)+\mathbb{E}^{\mathbb{Q}}\left[J_{N}(T(x,a,Z_{n+1}))\right] (7.2)
=infaAsup𝒬Nx2QN1+aRN1a+𝔼[(UNx+VNa+WN)2QN]\displaystyle=\inf_{a\in A}\sup_{\mathbb{Q}\in\mathcal{Q}_{N}}x^{2}Q_{N-1}+a^{\top}R_{N-1}a+\mathbb{E}^{\mathbb{Q}}\left[(U_{N}x+V_{N}^{\top}a+W_{N})^{2}Q_{N}\right]
=infaAsup𝒬Nx2QN1+aRN1a+𝔼[x2UN2+(VNa)2+2xUNVNa+WN2]QN\displaystyle=\inf_{a\in A}\sup_{\mathbb{Q}\in\mathcal{Q}_{N}}x^{2}Q_{N-1}+a^{\top}R_{N-1}a+\mathbb{E}^{\mathbb{Q}}\left[x^{2}U_{N}^{2}+(V_{N}^{\top}a)^{2}\right.\left.+2xU_{N}V_{N}^{\top}a+W_{N}^{2}\right]Q_{N}

For the last equality we used that 𝔼[2WN(UNx+VNa)]=0\mathbb{E}^{\mathbb{Q}}\left[2W_{N}(U_{N}x+V_{N}a)\right]=0 by assumption. Since RN1R_{N-1} and QNQ_{N} are positive (semi-) definite, the objective function (7.2) is strictly convex in aa. Moreover, it is linear and especially concave in \mathbb{Q}. Finally, 𝒬N\mathcal{Q}_{N} is weak* compact by the Theorem of Banach-Alaoglu [AliprantisBorder2006, 6.21] the objective function (7.2) is continuous in \mathbb{Q} by definition of the weak* topology since the integrand is in Lp(Ωn+1,𝒜n+1,n+1)L^{p}(\Omega_{n+1},\mathcal{A}_{n+1},\mathbb{P}_{n+1}). Thus, the requirements of Sion’s Minimax Theorem LABEL:thm:A:minimax b) are satisfied and we can interchange infimum and supremum in (7.2), i.e.

JN1(x)\displaystyle J_{N-1}(x) =sup𝒬NinfaAx2QN1+aRN1a+x2𝔼[UN2]QN+a𝔼[VNVN]aQN\displaystyle=\sup_{\mathbb{Q}\in\mathcal{Q}_{N}}\inf_{a\in A}x^{2}Q_{N-1}+a^{\top}R_{N-1}a+x^{2}\mathbb{E}^{\mathbb{Q}}[U_{N}^{2}]Q_{N}+a^{\top}\mathbb{E}^{\mathbb{Q}}[V_{N}^{\top}V_{N}]aQ_{N}
+2xQN𝔼[UNVN]a+𝔼[WN2]QN\displaystyle\phantom{=\inf_{a\in A}\sup_{\mathbb{Q}\in\mathcal{Q}_{N}}}\ +2xQ_{N}\mathbb{E}^{\mathbb{Q}}[U_{N}V_{N}^{\top}]a+\mathbb{E}^{\mathbb{Q}}[W_{N}^{2}]Q_{N} (7.3)

In order to solve the inner minimization problem it suffices due to strict convexity and smoothness to determine the unique zero of the gradient of the objective function.

0=2RN1a+2𝔼[VNVN]aQN+2x𝔼[VNUN]QN\displaystyle 0=2R_{N-1}a+2\mathbb{E}^{\mathbb{Q}}[V_{N}^{\top}V_{N}]aQ_{N}+2x\mathbb{E}^{\mathbb{Q}}[V_{N}^{\top}U_{N}]Q_{N}
\displaystyle\Longleftrightarrow\quad a=(RN1+𝔼[VNVN]QN)1𝔼[VNUN]QNx.\displaystyle a=-\big{(}R_{N-1}+\mathbb{E}^{\mathbb{Q}}[V_{N}^{\top}V_{N}]Q_{N}\big{)}^{-1}\mathbb{E}^{\mathbb{Q}}[V_{N}^{\top}U_{N}]Q_{N}x.

Note that the matrix (RN1+𝔼[VNVN])QN\big{(}R_{N-1}+\mathbb{E}^{\mathbb{Q}}[V_{N}^{\top}V_{N}]\big{)}Q_{N} is positive definite and therefore invertible due to the positive (semi-) definiteness of RN1R_{N-1} and QNQ_{N}. Inserting in (7.1) gives

JN1(x)\displaystyle J_{N-1}(x) =sup𝒬N𝔼[WN2]QN+x2(QN1+𝔼[UN2]QN\displaystyle=\sup_{\mathbb{Q}\in\mathcal{Q}_{N}}\mathbb{E}^{\mathbb{Q}}[W_{N}^{2}]Q_{N}+x^{2}\Big{(}Q_{N-1}+\mathbb{E}^{\mathbb{Q}}[U_{N}^{2}]Q_{N}
𝔼[UNVN](RN1+𝔼[VNVN]QN)1𝔼[VNUN]QN2)\displaystyle\phantom{=\sup_{\mathbb{Q}\in\mathcal{Q}_{N}}}\ -\mathbb{E}^{\mathbb{Q}}[U_{N}V_{N}]\Big{(}R_{N-1}+\mathbb{E}^{\mathbb{Q}}[V_{N}^{\top}V_{N}]Q_{N}\Big{)}^{-1}\mathbb{E}^{\mathbb{Q}}[V_{N}^{\top}U_{N}]Q_{N}^{2}\Big{)}
=sup𝒬N𝔼[WN2]QN+x2KN1=supWN𝔼[WN2]QN+x2supUN,VNKN1\displaystyle=\sup_{\mathbb{Q}\in\mathcal{Q}_{N}}\mathbb{E}^{\mathbb{Q}}[W_{N}^{2}]Q_{N}+x^{2}K_{N-1}^{\mathbb{Q}}=\sup_{\mathbb{Q}_{W_{N}}}\mathbb{E}^{\mathbb{Q}}[W_{N}^{2}]Q_{N}+x^{2}\sup_{\mathbb{Q}_{U_{N},V_{N}}}K_{N-1}^{\mathbb{Q}} (7.4)

where

KN1\displaystyle K_{N-1}^{\mathbb{Q}} =QN1+𝔼[UN2]QN𝔼[UNVN](RN1+𝔼[VNVN]QN)1𝔼[VNUN]QN2.\displaystyle=Q_{N-1}+\mathbb{E}^{\mathbb{Q}}[U_{N}^{2}]Q_{N}-\mathbb{E}^{\mathbb{Q}}[U_{N}^{\top}V_{N}]\Big{(}R_{N-1}+\mathbb{E}^{\mathbb{Q}}[V_{N}^{\top}V_{N}]Q_{N}\Big{)}^{-1}\mathbb{E}^{\mathbb{Q}}[V_{N}^{\top}U_{N}]Q_{N}^{2}.

Since Jn(x)0J_{n}(x)\geq 0 for all xx\in\mathbb{R} we must have KN1>0K_{N-1}^{\mathbb{Q}}>0 and since 𝒬N\mathcal{Q}_{N} is weak* compact and 𝔼[WN2]QN+x2KN1\mathbb{Q}\mapsto\mathbb{E}^{\mathbb{Q}}[W_{N}^{2}]Q_{N}+x^{2}K_{N-1}^{\mathbb{Q}} weak* continuous, there exists an optimal measure γN1(x)=N=WNUN,VN\gamma_{N-1}(x)=\mathbb{Q}^{*}_{N}=\mathbb{Q}^{*}_{W_{N}}\otimes\mathbb{Q}^{*}_{U_{N},V_{N}} which is independent of xx due to our independence assumption (iii). Since JN1J_{N-1} is again quadratic we obtain when we define

Kn1\displaystyle K_{n-1}^{\mathbb{Q}} =Qn1+Kn(𝔼[Un2]𝔼[UnVn](Rn1/Kn+𝔼[VnVn])1𝔼[VnUn])\displaystyle=Q_{n-1}+K_{n}^{\mathbb{Q}}\Big{(}\mathbb{E}^{\mathbb{Q}}[U_{n}^{2}]-\mathbb{E}^{\mathbb{Q}}[U_{n}^{\top}V_{n}]\Big{(}R_{n-1}/K_{n}^{\mathbb{Q}}+\mathbb{E}^{\mathbb{Q}}[V_{n}^{\top}V_{n}]\Big{)}^{-1}\mathbb{E}^{\mathbb{Q}}[V_{n}^{\top}U_{n}]\Big{)}
Ln1\displaystyle L_{n-1}^{\mathbb{Q}} =(Rn1/Kn+𝔼[VnVn])1𝔼[VnUn]\displaystyle=-\Big{(}R_{n-1}/K_{n}^{\mathbb{Q}}+\mathbb{E}^{\mathbb{Q}}[V_{n}^{\top}V_{n}]\Big{)}^{-1}\mathbb{E}^{\mathbb{Q}}[V_{n}^{\top}U_{n}]

that γn(x)γn=argmaxWn𝔼[Wn2]argmaxUn,VnKn\gamma^{*}_{n}(x)\equiv\gamma^{*}_{n}=\operatorname{argmax}_{\mathbb{Q}_{W_{n}}}\mathbb{E}^{\mathbb{Q}}[W_{n}^{2}]\otimes\operatorname{argmax}_{\mathbb{Q}_{U_{n},V_{n}}}K_{n}^{\mathbb{Q}} and dn(x)=Lnγnxd_{n}^{*}(x)=L_{n}^{\gamma^{*}_{n}}x. Since the third term of Kn1K_{n-1}^{\mathbb{Q}} is a quadratic form nature should choose \mathbb{Q} such that 𝔼[VnUn]=0\mathbb{E}^{\mathbb{Q}}[V_{n}^{\top}U_{n}]=0 if possible and maximize the second moment of UnU_{n}. If this is possible, we obtain dn(x)=0d_{n}^{*}(x)=0, i.e. the controller will not control the system. In any case we see here the optimal choice of nature γn(x)\gamma^{*}_{n}(x) does not depend on xx.

When we specialize the situation to d=1d=1 we obtain

Kn1\displaystyle K_{n-1}^{\mathbb{Q}} =Qn1+(𝔼[Un2](𝔼[UnVn]2Rn1Kn+𝔼[Vn2])Kn.\displaystyle=Q_{n-1}+\Big{(}\mathbb{E}^{\mathbb{Q}}[U_{n}^{2}]-\frac{(\mathbb{E}^{\mathbb{Q}}[U_{n}V_{n}]^{2}}{\frac{R_{n-1}}{K_{n}^{\mathbb{Q}}}+\mathbb{E}^{\mathbb{Q}}[V_{n}^{2}]}\Big{)}K_{n}^{\mathbb{Q}}.

Thus, nature has to maximize the expression in brackets. For a moment we skip the index nn. Let us further assume that (U,V)(U,V) is jointly normally distributed with expectations μU\mu_{U} and μV\mu_{V} and variances σU2\sigma^{2}_{U} and σV2\sigma^{2}_{V} and covariance σU,V\sigma_{U,V} and that all these parameters may be elements of compact intervals. Then the expression in brackets reduces to

σU2+μU2(σUV+μUμV)2RK+σV2+μV2.\sigma_{U}^{2}+\mu_{U}^{2}-\frac{(\sigma_{UV}+\mu_{U}\mu_{V})^{2}}{\frac{R}{K^{\mathbb{Q}}}+\sigma_{V}^{2}+\mu_{V}^{2}}.

We see immediately that both σV2\sigma_{V}^{2} and σU2\sigma_{U}^{2} have to be chosen as maximal possible value. The remaining three parameters μU,μV\mu_{U},\mu_{V} and σUV\sigma_{UV} have to minimize the fraction. If it is possible to choose them such that μU\mu_{U} is maximal and σUV+μUμV=0\sigma_{UV}+\mu_{U}\mu_{V}=0 this would be optimal.

7.2. Managing regenerative energy

The second example is the management of a joint wind and storage facility. Before each period the owner of the wind turbine has to announce the amount aa of energy she wants to produce. If there is enough wind in the next period she receives the reward PaPa where we assume that P>0P>0 is the fixed price for energy. Additional energy which may have been produced will be stored in a battery with capacity K>0K>0. If there is not enough wind in the next period, the storage device will be used to cover the shortage. In case this is still not enough, the remaining shortage will be penalized by a proportional cost rate c>0c>0 (see [anderson2015optimal] for further background). We consider a robust version here, i.e. the distribution \mathbb{Q} of the produced wind energy varies in a set 𝒬\mathcal{Q} with bounded support [0,B][0,B]. The state is the amount of energy in the battery, hence E=[0,K]E=[0,K]. Further A=[0,B]=D(x)A=[0,B]=D(x) and the action is the amount of energy which is bid. We obtain the following Bellman equation:

Jn(x)\displaystyle J_{n}(x) =\displaystyle= infaD(x)sup𝒬{aBaP+Jn+1((x+za)K)(dz)\displaystyle\inf_{a\in D(x)}\sup_{\mathbb{Q}\in\mathcal{Q}}\Big{\{}\int_{a}^{B}-aP+J_{n+1}((x+z-a)\wedge K)\mathbb{Q}(dz)
+0a(a(z+x))P+(azx)+c+Jn+1((x+za)+)(dz)}\displaystyle+\int_{0}^{a}-(a\wedge(z+x))P+(a-z-x)^{+}c+J_{n+1}((x+z-a)^{+})\mathbb{Q}(dz)\Big{\}}
=\displaystyle= infaD(x)sup𝒬{aP+aBJn+1((x+za)K)(dz)\displaystyle\inf_{a\in D(x)}\sup_{\mathbb{Q}\in\mathcal{Q}}\Big{\{}-aP+\int_{a}^{B}J_{n+1}((x+z-a)\wedge K)\mathbb{Q}(dz)
+0a(P+c)(x+za)+Jn+1((x+za)+)(dz)}\displaystyle+\int_{0}^{a}(P+c)(x+z-a)^{-}+J_{n+1}((x+z-a)^{+})\mathbb{Q}(dz)\Big{\}}

From the last representation we see that

T(x,a,z)={(x+za)K),za(x+za)+,0zaT(x,a,z)=\left\{\begin{array}[]{ll}(x+z-a)\wedge K),&z\geq a\\ (x+z-a)^{+},&0\leq z\leq a\end{array}\right.

and

c(x,a,T(x,a,z))=aP+{0,za(P+c)(x+za),0zac(x,a,T(x,a,z))=-aP+\left\{\begin{array}[]{ll}0,&z\geq a\\ (P+c)(x+z-a)^{-},&0\leq z\leq a\end{array}\right.

We see that D(x)D(x) is compact and xD(x)x\mapsto D(x) is lower semicontinuous. TT is continuous and cc is continuous and bounded. It is easy to see that Jn(x)J_{n}(x) is decreasing in xx. Suppose \mathbb{Q} has a minimal element \mathbb{Q}^{*} w.r.t. st\leq_{st}. Then we can omit the sup𝒬\sup_{\mathbb{Q}\in\mathcal{Q}} and replace \mathbb{Q} by \mathbb{Q}^{*}. The remaining Bellman equation is then a standard MDP.

8. Appendix

8.1. Additional Proofs

Proof of Lemma 2.3:

Recall that we identify 𝒬n\mathcal{Q}_{n} with the set of the corresponding densities 𝒬nd\mathcal{Q}_{n}^{d}. The closure 𝒬¯nd\overline{\mathcal{Q}}_{n}^{d} of 𝒬nd\mathcal{Q}_{n}^{d} remains norm bounded. This can be seen as follows: Let X𝒬¯ndX\in\overline{\mathcal{Q}}_{n}^{d}. Then there exists a net {Xα}αI𝒬nd\{X_{\alpha}\}_{\alpha\in I}\subseteq\mathcal{Q}_{n}^{d} such that XαX_{\alpha}[XY]   for all Y ∈L^p( Ω_n, A_n, P_n ) with ∥Y∥_L^p=1. By Hölder’s inequality we have for all αI\alpha\in I

|𝔼[XαY]|𝔼|XαY|XαLqYLp=XαLqK.\left|\mathbb{E}[X_{\alpha}Y]\right|\leq\mathbb{E}|X_{\alpha}Y|\leq\|X_{\alpha}\|_{L^{q}}\|Y\|_{L^{p}}=\|X_{\alpha}\|_{L^{q}}\leq K.

Thus, |𝔼[XY]|K|\mathbb{E}[XY]|\leq K. Finally, due to duality it follows

XLq=supYLp=1|𝔼[XY]|K.\|X\|_{L^{q}}=\sup_{\|Y\|_{L^{p}}=1}|\mathbb{E}[XY]|\leq K.

The separability of the probability measure n\mathbb{P}_{n} makes Lp(Ωn,𝒜n,n)L^{p}(\Omega_{n},\mathcal{A}_{n},\mathbb{P}_{n}) a separable Banach space. Consequently, the weak* topology is metrizable on the norm bounded set 𝒬¯nd\overline{\mathcal{Q}}_{n}^{d} [Morrison2001, p. 157]. The trace topology on the subset 𝒬nd𝒬¯nd\mathcal{Q}_{n}^{d}\subseteq\overline{\mathcal{Q}}_{n}^{d} coincides with the topology induced by the restriction of the metric [OSearcoid2007, 4.4.1], i.e. 𝒬nd\mathcal{Q}_{n}^{d} is metrizable, too.

Since 𝒬¯nd\overline{\mathcal{Q}}_{n}^{d} is norm bounded and weak* closed, the Theorem of Banach-Alaoglu [AliprantisBorder2006, 6.21] yields that it is weak* compact. As a compact metrizable space 𝒬¯nd\overline{\mathcal{Q}}_{n}^{d} is complete [AliprantisBorder2006, 3.28] and also separable [AliprantisBorder2006, 3.26, 3.28]. Hence, 𝒬¯nd\overline{\mathcal{Q}}_{n}^{d} is a Borel Space. The set of densities 𝒬nd\mathcal{Q}_{n}^{d} is also separable as a subspace of a separable metrizable space [AliprantisBorder2006, 3.5]. \Box