This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distributionally robust stochastic optimal control

Alexander Shapiro Georgia Institute of Technology, Atlanta, Georgia 30332, USA, [email protected]
Research of this author was partially supported by Air Force Office of Scientific Research (AFOSR) under Grant FA9550-22-1-0244.
   Yan Li Texas A&M University, College Station, Texas 77843, USA, [email protected]

The main goal of this paper is to discuss construction of distributionally robust counterparts of stochastic optimal control problems. Randomized and non-randomized policies are considered. In particular necessary and sufficient conditions for existence of non-randomized policies are given.

Keywords: stochastic optimal control, dynamic equations distributional robustness, game formulation, risk measures

1 Introduction

Consider the Stochastic Optimal Control (SOC) (discrete time, finite horizon) model (e.g., [1]):

minπΠ𝔼π[t=1Tft(xt,ut,ξt)+fT+1(xT+1)],\min\limits_{\pi\in\Pi}{\mathbb{E}}^{\pi}\left[\sum_{t=1}^{T}f_{t}(x_{t},u_{t},\xi_{t})+f_{T+1}(x_{T+1})\right], (1.1)

where Π\Pi is the set of polices satisfying the constraints

Π={π=(π1,,πT):ut=πt(xt,ξ[t1]),ut𝒰t,xt+1=Ft(xt,ut,ξt),t=1,,T}.\Pi=\Big{\{}\pi=(\pi_{1},\ldots,\pi_{T}):u_{t}=\pi_{t}(x_{t},\xi_{[t-1]}),u_{t}\in{\cal U}_{t},x_{t+1}=F_{t}(x_{t},u_{t},\xi_{t}),\;\;t=1,...,T\Big{\}}. (1.2)

Here variables xtntx_{t}\in{\mathbb{R}}^{n_{t}}, t=1,,T+1t=1,...,T+1, represent the state of the system, utmtu_{t}\in{\mathbb{R}}^{m_{t}}, t=1,,Tt=1,...,T, are controls, ξtΞt\xi_{t}\in\Xi_{t}, t=1,,Tt=1,...,T, are random vectors, Ξt\Xi_{t} is a closed subset of dt{\mathbb{R}}^{d_{t}}, ft:nt×mt×dtf_{t}:{\mathbb{R}}^{n_{t}}\times{\mathbb{R}}^{m_{t}}\times{\mathbb{R}}^{d_{t}}\to{\mathbb{R}}, t=1,,Tt=1,...,T, are cost functions, cT+1(xT+1)c_{T+1}(x_{T+1}) is a final cost function, Ft:nt×mt×dtnt+1F_{t}:{\mathbb{R}}^{n_{t}}\times{\mathbb{R}}^{m_{t}}\times{\mathbb{R}}^{d_{t}}\to{\mathbb{R}}^{n_{t+1}} are (measurable) mappings and 𝒰t{\cal U}_{t} is a (nonempty) subset of mt{\mathbb{R}}^{m_{t}}. Values x1x_{1} and ξ0\xi_{0} are deterministic (initial conditions); it is also possible to view x1x_{1} as random with a given distribution, this is not essential for the following discussion.

  • Unless stated otherwise we assume that the probability law of the random process ξ1,,ξT\xi_{1},...,\xi_{T} does not depend on our decisions.

Remark 1.1.

The above assumption is basic for our analysis. It views ξ1,,ξT\xi_{1},...,\xi_{T} as a random data process and assumes that its probability law does not depend on the respective states and actions. This assumption is reasonable in many applications, for instance in the example of Inventory Model, discussed in section 4, this means that the probability distribution of the demand process does not depend on the inventory level and order quantity. This assumption allows to separate the transition probabilities, defined by the functional relations xt+1=Ft(xt,ut,ξt)x_{t+1}=F_{t}(x_{t},u_{t},\xi_{t}), and the probability distribution of the data process (see Remarks 1.2 and 2.1 below). \hfill\square

The optimization in (1.1) is performed over policies πΠ\pi\in\Pi determined by decisions utu_{t} and state variables xtx_{t} considered as functions of ξ[t1]=(ξ1,,ξt1)\xi_{[t-1]}=(\xi_{1},...,\xi_{t-1}), t=1,,Tt=1,...,T, and satisfying the feasibility constraints (1.2). We also denote Ξ[t]:=Ξ1××Ξt\Xi_{[t]}:=\Xi_{1}\times\cdots\times\Xi_{t}. For the sake of simplicity, in order not to distract from the main message of the paper, we assume that the control sets 𝒰t{\cal U}_{t} do not depend on xtx_{t}. It is possible to extend the analysis to the general case, where the control sets are functions of the state variables.

Remark 1.2.

Note that because of the basic assumption that the probability distribution of ξ1,,ξT\xi_{1},...,\xi_{T} does not depend on our decisions (does not depend on states and actions), it suffices to consider policies {πt(ξ[t1])}\{\pi_{t}(\xi_{[t-1]})\} as functions of the process ξt\xi_{t} alone. It also could be noted that in case the random process ξt\xi_{t} is stagewise independent, i.e., random vector ξt+1\xi_{t+1} is independent of ξ[t]\xi_{[t]}, t=1,,T1t=1,...,T-1, it suffices to consider policies of the form {ut=πt(xt)}\{u_{t}=\pi_{t}(x_{t})\}. \hfill\square

Consider Banach space C(Ξt)C(\Xi_{t}) of continuous functions ϕ:Ξt\phi:\Xi_{t}\to{\mathbb{R}} equipped with the sup-norm ϕ=supξΞt|ϕ(ξ)|\|\phi\|_{\infty}=\sup_{\xi\in\Xi_{t}}|\phi(\xi)|. The dual space of C(Ξt)C(\Xi_{t}) is the space of finite signed measures on Ξt\Xi_{t} with respect to the bilinear form ϕ,γ=ϕ𝑑γ\langle\phi,\gamma\rangle=\int\phi d\gamma (Riesz representation).

Remark 1.3.

Denote by 𝔐t{\mathfrak{M}}_{t} the set of probability measures on the set Ξt\Xi_{t} equipped with its Borel sigma algebra. The set 𝔐t{\mathfrak{M}}_{t} is a weakly closed subset of the unit ball of the dual space of C(Ξt)C(\Xi_{t}) and hence is weakly compact by the Banach - Alaoglu theorem. The weak topology111In probability theory this weak topology is often referred to as the weak topology, e.g., [2]. of 𝔐t{\mathfrak{M}}_{t} is metrizable (e.g., [6, Theorem 6.30]). It can be noted that if ϕnC(Ξt)\phi_{n}\in C(\Xi_{t}) converges (in the norm topology) to ϕC(Ξt)\phi\in C(\Xi_{t}) and γnωγ\gamma_{n}\stackrel{{\scriptstyle\omega^{*}}}{{\to}}\gamma, then ϕn𝑑γn=ϕn,γn\int\phi_{n}d\gamma_{n}=\langle\phi_{n},\gamma_{n}\rangle converges to ϕ𝑑γ\int\phi d\gamma. \hfill\square

Remark 1.4.

Let us recall the following properties of the min-max problem:

minx𝒳supy𝒴f(x,y),\min_{x\in{\cal X}}\sup_{y\in{\cal Y}}f(x,y), (1.3)

where 𝒳{\cal X} and 𝒴{\cal Y} are nonempty sets and f:𝒳×𝒴f:{\cal X}\times{\cal Y}\to{\mathbb{R}} is a real valued function. A point (x,y)𝒳×𝒴(x^{*},y^{*})\in{\cal X}\times{\cal Y} is a saddle point of problem (1.3) if xargminx𝒳f(x,y)x^{*}\in\mathop{\rm arg\,min}_{x\in{\cal X}}f(x,y^{*}) and yargmaxf(x,y).y^{*}\in\mathop{\rm arg\,max}f(x^{*},y). If the saddle point exists, then problem (1.3) has the same optimal value as its dual

maxy𝒴infx𝒳f(x,y),\max_{y\in{\cal Y}}\inf_{x\in{\cal X}}f(x,y), (1.4)

and xx^{*} is an optimal solution of problem (1.3) and yy^{*} is an optimal solution of problem (1.4). Conversely, if the optimal values of problems (1.3) and (1.4) are the same, and xx^{*} is an optimal solution of problem (1.3) and yy^{*} is an optimal solution of problem (1.4), then (x,y)(x^{*},y^{*}) is a saddle point. By Sion’s minimax theorem [16], we have that if 𝒳{\cal X} and 𝒴{\cal Y} are convex subsets of linear topological spaces, f(x,y)f(x,y) is continuous, convex in xx and concave in yy, and at least one of the sets 𝒳{\cal X} or 𝒴{\cal Y} is compact, then the optimal values of problems (1.3) and (1.4) are equal to each other. \hfill\square

The risk neutral SOC model (1.1) was thoroughly investigated. On the other hand, the analysis of its distributionally robust counterpart is more involved. The distributionally robust approach to Markov Decision Processes (MDPs) can be traced back to [7] and [10]. The origins of distributional robustness can be also related to the dynamic game theory (e.g., [8] and references there in). By duality arguments the distributionally robust formulations are closely related to the risk averse settings. The aim of this paper is to describe and formulate certain related questions in the framework of the SOC model. In particular we consider randomized policies and give necessary and suffcient conditions for existence of non-randomized optimal policies. We also discuss a relation between the nested risk averse and game formulations of distributionally robust problems.

2 Distributionally Robust Stochastic Optimal Control

In the distributionally robust counterpart of the risk neutral SOC problem (1.1) it is assumed that the probability law of the data process ξt\xi_{t} is not specified exactly, but rather is a member of a specified so-called ambiguity set of probability distributions. This modeling concept could be applied in various contexts, such as the Inventory Model discussed in Section 4, where the order quantity must be determined in the presence of uncertain demand distributions. Specifically, consider the setting where there is ambiguity set 𝒫tξ[t1]{\cal P}_{t}^{\xi_{[t-1]}} of probability measures on Ξt\Xi_{t}, depending on the history ξ[t1]=(ξ1,,ξt1)\xi_{[t-1]}=(\xi_{1},...,\xi_{t-1}) (cf., [11]). That is, 𝒫tξ[t1]=𝕄t(ξ[t1]){\cal P}_{t}^{\xi_{[t-1]}}={\mathbb{M}}_{t}(\xi_{[t-1]}), where 𝕄t:Ξ[t1]𝔐t{\mathbb{M}}_{t}:\Xi_{[t-1]}\rightrightarrows{\mathfrak{M}}_{t} is the respective multifunction. Here 𝕄1()𝒫1{\mathbb{M}}_{1}(\cdot)\equiv{\cal P}_{1} for some pre-determined 𝒫1𝔐1{\cal P}_{1}\subset{\mathfrak{M}}_{1}.

We view the distributionally robust counterpart of problem (1.1) as a dynamic game between the decision maker (the controller) and the adversary (the nature). Define history of the decision process as

𝔥t=(x1,u1,P1,ξ1,x2,,ut1,Pt1,ξt1,xt),t=1,,T.\mathfrak{h}_{t}=(x_{1},u_{1},P_{1},\xi_{1},x_{2},...,u_{t-1},P_{t-1},\xi_{t-1},x_{t}),\;t=1,...,T. (2.5)

At stage t=1,,Tt=1,...,T, based on 𝔥t\mathfrak{h}_{t} the nature chooses a probability measure Pt𝒫tξ[t1]P_{t}\in{\cal P}_{t}^{\xi_{[t-1]}}, at the same time the controller chooses its action ut𝒰tu_{t}\in{\cal U}_{t}. For a realization of ξtPt\xi_{t}\sim P_{t}, the next state xt+1=Ft(xt,ut,ξt)x_{t+1}=F_{t}(x_{t},u_{t},\xi_{t}) and so on the process is continued. (It is also possible to let the ambiguity set 𝒫tξ[t1]{\cal P}_{t}^{\xi_{[t-1]}} depend on the past actions u[t1]=(u1,,ut1)u_{[t-1]}=(u_{1},\ldots,u_{t-1}); this does not change the essence of our discussion for the dynamic game.) We write the corresponding distributionally robust problem as

minπΠsupγΓ𝔼π,γ[t=1Tft(xt,ut,ξt)+fT+1(xT+1)],\min\limits_{\pi\in\Pi}\sup_{\gamma\in\Gamma}{\mathbb{E}}^{\pi,\gamma}\left[\sum_{t=1}^{T}f_{t}(x_{t},u_{t},\xi_{t})+f_{T+1}(x_{T+1})\right], (2.6)

where Π\Pi is the set of polices of the controller and Γ\Gamma is the set of policies of the nature (cf., [13, p. 812]).

Unless stated otherwise we make the following assumptions about 𝒰t,Ξt,ft,Ft{\cal U}_{t},\Xi_{t},f_{t},F_{t}.

Assumption 2.1.

The sets 𝒰t{\cal U}_{t} and Ξt\Xi_{t}, t=1,,Tt=1,...,T, are compact, the functions Ft(xt,ut,ξt)F_{t}(x_{t},u_{t},\xi_{t}), ft(xt,ut,ξt)f_{t}(x_{t},u_{t},\xi_{t}), t=1,,Tt=1,...,T, and fT+1(xT+1)f_{T+1}(x_{T+1}) are continuous.

2.1 Dynamic Equations

The dynamic programming equations for the distributionally robust counterpart (2.6) of problem (1.1) are 𝒱T+1(xT+1,ξ[T])=fT+1(xT+1){\cal V}_{T+1}(x_{T+1},\xi_{[T]})=f_{T+1}(x_{T+1}) for all ξ[T]Ξ1××ΞT\xi_{[T]}\in\Xi_{1}\times\cdots\times\Xi_{T}, and for t=T,,1t=T,...,1,

𝒱t(xt,ξ[t1])\displaystyle{\cal V}_{t}(x_{t},\xi_{[t-1]}) =infut𝒰tsupPt𝒫tξ[t1]𝔼ξtPt[ft(xt,ut,ξt)+𝒱t+1(Ft(xt,ut,ξt),ξ[t])].\displaystyle=\inf\limits_{u_{t}\in{\cal U}_{t}}\sup_{P_{t}\in{\cal P}^{\xi_{[t-1]}}_{t}}{\mathbb{E}}_{\xi_{t}\sim P_{t}}\left[f_{t}(x_{t},u_{t},\xi_{t})+{\cal V}_{t+1}\big{(}F_{t}(x_{t},u_{t},\xi_{t}),\xi_{[t]}\big{)}\right]. (2.7)

To ensure that (2.7) is well-defined, we need value functions 𝒱t{\cal V}_{t} to be measurable. We establish below the continuity of the value functions and that the optimal action in (2.7) can be attained. We refer the readers to the Appendix for the definition of continuous multifunctions.

Proposition 2.1.

Suppose that Assumption 2.1 holds and the multifunctions 𝕄t{\mathbb{M}}_{t} are continuous. Then the value functions 𝒱t(xt,ξ[t1]){\cal V}_{t}(x_{t},\xi_{[t-1]}) in (2.7) are continuous in (xt,ξ[t1])nt×Ξ[t1](x_{t},\xi_{[t-1]})\in{\mathbb{R}}^{n_{t}}\times\Xi_{[t-1]} and the infimum in (2.7) is attained. The optimal policy of the controller is given by π¯t(xt,ξ[t1])=u¯t(xt,ξ[t1])\bar{\pi}_{t}(x_{t},\xi_{[t-1]})=\bar{u}_{t}(x_{t},\xi_{[t-1]}), t=1,,T,t=1,...,T, with

u¯t(xt,ξ[t1])argminut𝒰tsupPt𝒫tξ[t1]𝔼Pt[ft(xt,ut,ξt)+𝒱t+1(Ft(xt,ut,ξt),ξ[t])].\bar{u}_{t}(x_{t},\xi_{[t-1]})\in\mathop{\rm arg\,min}\limits_{u_{t}\in{\cal U}_{t}}\sup_{P_{t}\in{\cal P}_{t}^{\xi_{[t-1]}}}{\mathbb{E}}_{P_{t}}\left[f_{t}(x_{t},u_{t},\xi_{t})+{\cal V}_{t+1}\big{(}F_{t}(x_{t},u_{t},\xi_{t}),\xi_{[t]}\big{)}\right]. (2.8)
Proof.

We argue by induction in t=T,,1t=T,...,1. We have that 𝒱T+1(xT+1,ξ[T])=fT+1(xT+1){\cal V}_{T+1}(x_{T+1},\xi_{[T]})=f_{T+1}(x_{T+1}) and hence 𝒱T+1(xT+1,ξ[T]){\cal V}_{T+1}(x_{T+1},\xi_{[T]}) is continuous by the assumption. Now suppose that 𝒱t+1(,){\cal V}_{t+1}(\cdot,\cdot) is continuous. Consider

ψ(Pt,xt,ut,ξ[t1])\displaystyle\psi(P_{t},x_{t},u_{t},\xi_{[t-1]}) :=\displaystyle:= 𝔼ξtPt[ft(xt,ut,ξt)+𝒱t+1(Ft(xt,ut,ξt),ξ[t])]\displaystyle{\mathbb{E}}_{\xi_{t}\sim P_{t}}\left[f_{t}(x_{t},u_{t},\xi_{t})+{\cal V}_{t+1}(F_{t}(x_{t},u_{t},\xi_{t}),\xi_{[t]})\right]
=\displaystyle= Ξtft(xt,ut,ξt)+𝒱t+1(Ft(xt,ut,ξt),ξ[t1],ξt)Pt(dξt).\displaystyle\int_{\Xi_{t}}f_{t}(x_{t},u_{t},\xi_{t})+{\cal V}_{t+1}(F_{t}(x_{t},u_{t},\xi_{t}),\xi_{[t-1]},\xi_{t})P_{t}(d\xi_{t}).

From Assumption 2.1 the set Ξt\Xi_{t} is compact, and subsequently for any sequence (xtn,utn,ξ[t1]n)(x_{t}^{n},u_{t}^{n},\xi_{[t-1]}^{n}) converging to (xt,ut,ξ[t1])(x_{t},u_{t},\xi_{[t-1]}), we have that ϕtn(ξt):=ft(xtn,utn,ξt)+𝒱t+1(Ft(xtn,utn,ξt),(ξ[t1]n,ξt))\phi_{t}^{n}(\xi_{t}):=f_{t}(x^{n}_{t},u_{t}^{n},\xi_{t})+{\cal V}_{t+1}(F_{t}(x_{t}^{n},u_{t}^{n},\xi_{t}),(\xi_{[t-1]}^{n},\xi_{t})) converges to ϕt(ξt):=ft(xt,ut,ξt)+𝒱t+1(Ft(xt,ut,ξt),ξ[t])\phi_{t}(\xi_{t}):=f_{t}(x_{t},u_{t},\xi_{t})+{\cal V}_{t+1}(F_{t}(x_{t},u_{t},\xi_{t}),\xi_{[t]}) uniformly in ξtΞt\xi_{t}\in\Xi_{t}. For any PtnwPtP_{t}^{n}\overset{w^{*}}{\to}P_{t}, it follows by Remark 1.3 that ψt(Ptn,xtn,utn,ξ[t1]n)ψt(Pt,xt,ut,ξ[t1])\psi_{t}(P_{t}^{n},x_{t}^{n},u_{t}^{n},\xi_{[t-1]}^{n})\to\psi_{t}(P_{t},x_{t},u_{t},\xi_{[t-1]}). In addition, since Ξt\Xi_{t} is compact, C(Ξt)C(\Xi_{t}) is separable, and 𝔐t{\mathfrak{M}}_{t} with weak topology is metrizable [6, Theorem 6.30]. It follows that ψt\psi_{t} is jointly continuous in PtP_{t} (with respect to the weak topology of 𝔐t{\mathfrak{M}}_{t}) and (xt,ut,ξ[t1])nt×mt×Ξ[t1](x_{t},u_{t},\xi_{[t-1]})\in{\mathbb{R}}^{n_{t}}\times{\mathbb{R}}^{m_{t}}\times\Xi_{[t-1]}. Consequently we obtain by Theorem 7.1 that the function

ν(xt,ut,ξ[t1]):=supPt𝒫tξ[t1]ψ(Pt,xt,ut,ξ[t1])\nu(x_{t},u_{t},\xi_{[t-1]}):=\sup_{P_{t}\in{\cal P}_{t}^{\xi_{[t-1]}}}\psi(P_{t},x_{t},u_{t},\xi_{[t-1]})

is continuous in (xt,ut,ξ[t1])(x_{t},u_{t},\xi_{[t-1]}). Finally again by Theorem 7.1 we have that

𝒱t(xt,ξ[t1])=infut𝒰tν(xt,ut,ξ[t1]){\cal V}_{t}(x_{t},\xi_{[t-1]})=\inf_{u_{t}\in{\cal U}_{t}}\nu(x_{t},u_{t},\xi_{[t-1]})

is continuous. The infimum is attained as 𝒰t{\cal U}_{t} is compact. ∎

Remark 2.1.

It can be noted that because of the basic assumption that the probability laws of the process ξt\xi_{t} do not depend on the states and actions, it follows that the state xtx_{t} can be considered as a function of ξ[t1]\xi_{[t-1]}, and hence the value functions 𝒱t(ξ[t1]){\cal V}_{t}(\xi_{[t-1]}) and optimal policies π¯t(ξ[t1])\bar{\pi}_{t}(\xi_{[t-1]}) can be viewed as functions of the process ξ1,,ξT\xi_{1},...,\xi_{T} alone (compare with Remark 1.2). \hfill\square

Remark 2.2.

Condition (2.8) is sufficient for the corresponding policy to be optimal. It could be interesting to note that for certain classes of ambiguity sets such condition is not necessary for the policy to be optimal for problem (2.6) even if ambiguity sets 𝒫t{\cal P}_{t} do not depend on ξ[t1]\xi_{[t-1]} (cf., [12]). \hfill\square

2.2 Randomized Policies

We also consider randomized policies for the controller. That is, at stage t=1,,Tt=1,...,T, the nature chooses a probability measure Pt𝒫tξ[t1]P_{t}\in{\cal P}_{t}^{\xi_{[t-1]}}, at the same time the controller chooses its action utu_{t} at random according to a probability distribution μtt\mu_{t}\in{\mathfrak{C}}_{t}, where t{\mathfrak{C}}_{t} denotes the set of (Borel) probability measures on 𝒰t{\cal U}_{t}. Consequently xt+1=Ft(xt,ut,ξt)x_{t+1}=F_{t}(x_{t},u_{t},\xi_{t}) and so on. Here the decision process is defined by histories 𝔥t{\mathfrak{h}}_{t}, as defined in (2.5), with actions utu_{t} chosen at random.

This corresponds to the min-max formulation

minμΠ^supγΓ𝔼μ,γ[t=1Tft(xt,ut,ξt)+fT+1(xT+1)],\min\limits_{\mu\in\widehat{\Pi}}\sup_{\gamma\in\Gamma}{\mathbb{E}}^{\mu,\gamma}\left[\sum_{t=1}^{T}f_{t}(x_{t},u_{t},\xi_{t})+f_{T+1}(x_{T+1})\right], (2.9)

where Π^\widehat{\Pi} is the set of randomized policies of the controller that maps 𝔥t\mathfrak{h}_{t} into t{\mathfrak{C}}_{t} at each stage. For the randomized policies the counterpart of dynamic equation (2.7) becomes VT+1(xT+1,ξ[T])=fT+1(xT+1)V_{T+1}(x_{T+1},\xi_{[T]})=f_{T+1}(x_{T+1}) for all ξ[T]Ξ[T]\xi_{[T]}\in\Xi_{[T]}, and for t=T,,1t=T,...,1,

Vt(xt,ξ[t1])=infμttsupPt𝒫tξ[t1]𝔼(ut,ξt)μt×Pt[ft(xt,ut,ξt)+Vt+1(Ft(xt,ut,ξt),ξ[t])]V_{t}(x_{t},\xi_{[t-1]})=\inf\limits_{\mu_{t}\in{\mathfrak{C}}_{t}}\sup_{P_{t}\in{\cal P}^{\xi_{[t-1]}}_{t}}{\mathbb{E}}_{(u_{t},\xi_{t})\sim\mu_{t}\times P_{t}}\left[f_{t}(x_{t},u_{t},\xi_{t})+V_{t+1}\big{(}F_{t}(x_{t},u_{t},\xi_{t}),\xi_{[t]}\big{)}\right] (2.10)

(cf., [9],[17]). The controller has a non-randomized optimal policy if for every tt and xtx_{t} problem (2.11) below has optimal solution supported on a single point of 𝒰t{\cal U}_{t}.

Note that by the Banach - Alaoglu theorem the set t{\mathfrak{C}}_{t} is compact in the weak topology of the dual of the space C(𝒰t)C({\cal U}_{t}). With similar arguments as in Proposition 2.1, we have the following.

Proposition 2.2.

Suppose that Assumption 2.1 holds and the multifunctions 𝕄t{\mathbb{M}}_{t} are continuous. Then the value functions Vt(xt,ξ[t1])V_{t}(x_{t},\xi_{[t-1]}) in (2.10) are continuous in (xt,ξ[t1])nt×Ξ[t1](x_{t},\xi_{[t-1]})\in{\mathbb{R}}^{n_{t}}\times\Xi_{[t-1]} and the infimum in (2.10) is attained. The optimal policy of the controller is given by π¯t(xt,ξ[t1])=μ¯t(xt,ξ[t1])\bar{\pi}_{t}(x_{t},\xi_{[t-1]})=\bar{\mu}_{t}(x_{t},\xi_{[t-1]}) with

μ¯t(xt,ξ[t1])argminμttsupPt𝒫tξ[t1]𝔼(ut,ξt)μt×Pt[ft(xt,ut,ξt)+Vt+1(Ft(xt,ut,ξt),ξ[t])].\bar{\mu}_{t}(x_{t},\xi_{[t-1]})\in\mathop{\rm arg\,min}\limits_{\mu_{t}\in{\mathfrak{C}}_{t}}\sup_{P_{t}\in{\cal P}^{\xi_{[t-1]}}_{t}}{\mathbb{E}}_{(u_{t},\xi_{t})\sim\mu_{t}\times P_{t}}\left[f_{t}(x_{t},u_{t},\xi_{t})+V_{t+1}\big{(}F_{t}(x_{t},u_{t},\xi_{t}),\xi_{[t]}\big{)}\right]. (2.11)
Proof.

We have that VT+1(xT+1,ξ[T])=fT+1(xT+1)V_{T+1}(x_{T+1},\xi_{[T]})=f_{T+1}(x_{T+1}) and hence VT+1(xT+1,ξ[T])V_{T+1}(x_{T+1},\xi_{[T]}) is continuous by the assumption. Now suppose that Vt+1(,)V_{t+1}(\cdot,\cdot) is continuous. Consider

ψ(Pt,xt,μt,ξ[t1])\displaystyle\psi(P_{t},x_{t},\mu_{t},\xi_{[t-1]}) :=\displaystyle:= 𝔼(ut,ξt)μt×Pt[ft(xt,ut,ξt)+Vt+1(Ft(xt,ut,ξt),ξ[t])]\displaystyle{\mathbb{E}}_{(u_{t},\xi_{t})\sim\mu_{t}\times P_{t}}\left[f_{t}(x_{t},u_{t},\xi_{t})+V_{t+1}(F_{t}(x_{t},u_{t},\xi_{t}),\xi_{[t]})\right]
=\displaystyle= 𝒰tΞtft(xt,ut,ξt)+Vt+1(Ft(xt,ut,ξt),ξ[t])μt(dut)Pt(dξt).\displaystyle\int_{{\cal U}_{t}}\int_{\Xi_{t}}f_{t}(x_{t},u_{t},\xi_{t})+V_{t+1}(F_{t}(x_{t},u_{t},\xi_{t}),\xi_{[t]})\mu_{t}(du_{t})P_{t}(d\xi_{t}).

From Assumption 2.1, both Ξt\Xi_{t} and 𝒰t{\cal U}_{t} are compact, and subsequently for any (xtn,ξ[t1]n)(x_{t}^{n},\xi_{[t-1]}^{n}) converging to (xt,ξ[t1])(x_{t},\xi_{[t-1]}), we have that ϕtn(ut,ξt):=ft(xtn,ut,ξt)+Vt+1(Ft(xtn,ut,ξt),(ξ[t1]n,ξt))\phi_{t}^{n}(u_{t},\xi_{t}):=f_{t}(x^{n}_{t},u_{t},\xi_{t})+V_{t+1}(F_{t}(x_{t}^{n},u_{t},\xi_{t}),(\xi_{[t-1]}^{n},\xi_{t})) converges to ϕt(ut,ξt):=ft(xt,ut,ξt)+Vt+1(Ft(xt,ut,ξt),ξ[t])\phi_{t}(u_{t},\xi_{t}):=f_{t}(x_{t},u_{t},\xi_{t})+V_{t+1}(F_{t}(x_{t},u_{t},\xi_{t}),\xi_{[t]}) uniformly in (ut,ξt)𝒰t×Ξt(u_{t},\xi_{t})\in{\cal U}_{t}\times\Xi_{t}. For PtnwPtP_{t}^{n}\overset{w^{*}}{\to}P_{t} and μtnwμt\mu_{t}^{n}\overset{w^{*}}{\to}\mu_{t}, from compactness of Ξt\Xi_{t} and 𝒰t{\cal U}_{t}, it can be readily shown that Ptn×μtnP_{t}^{n}\times\mu_{t}^{n} converges to Pt×μtP_{t}\times\mu_{t} in the weak topology of the dual of C(Ξt×𝒰t)C(\Xi_{t}\times{\cal U}_{t}), and hence ψt(Ptn,xtn,μtn,ξ[t1]n)ψt(Pt,xt,μt,ξ[t1])\psi_{t}(P_{t}^{n},x_{t}^{n},\mu_{t}^{n},\xi_{[t-1]}^{n})\to\psi_{t}(P_{t},x_{t},\mu_{t},\xi_{[t-1]}) (Remark 1.3). In addition, both 𝔐t{\mathfrak{M}}_{t} and t{\mathfrak{C}}_{t} with weak topology are metrizable [6, Theorem 6.30]. It follows that ψt\psi_{t} is continuous jointly in (Pt,μt)(P_{t},\mu_{t}) (with respect to the product weak topology of 𝔐t{\mathfrak{M}}_{t} and t{\mathfrak{C}}_{t}) and (xt,ξ[t1])nt×Ξ[t1](x_{t},\xi_{[t-1]})\in{\mathbb{R}}^{n_{t}}\times\Xi_{[t-1]}. The rest of the proof then follows from the same lines as in Proposition 2.1. ∎

Remark 2.3.

Note that here the state xtx_{t} also depends on the history of chosen controls uτμτu_{\tau}\sim\mu_{\tau}, τ=1,,t1\tau=1,...,t-1. Therefore the optimal policy π¯t(xt,ξ[t1])\bar{\pi}_{t}(x_{t},\xi_{[t-1]}) cannot be written as functions of ξ[t1]\xi_{[t-1]} alone (compare with Remark 2.1). \hfill\square

In Section 4 we will discuss necessary and sufficient conditions for the existence of non-randomized optimal policies for the controller.

2.3 Nested Formulation

For non-randomized policies of the controller, dynamic equations (2.7) correspond to the following nested formulation of the respective distributionally robust SOC. For a non-randomized policy {πt=πt(xt,ξ[t1])}\{\pi_{t}=\pi_{t}(x_{t},\xi_{[t-1]})\}, consider the total cost

Zπ:=t=1Tft(xt,πt,ξt)+fT+1(xT+1).Z^{\pi}:=\sum_{t=1}^{T}f_{t}(x_{t},\pi_{t},\xi_{t})+f_{T+1}(x_{T+1}).

Recall that we can consider here policies πt(ξ[t1])\pi_{t}(\xi_{[t-1]}) as functions of ξ[t1]\xi_{[t-1]} alone, and that the state xtx_{t} is also a function of ξ[t1]\xi_{[t-1]} (see Remark 2.1). Therefore we can view Zπ=Zπ(ξ[T])Z^{\pi}=Z^{\pi}(\xi_{[T]}) as a function of the process ξ1,,ξT\xi_{1},...,\xi_{T}.

Consider linear spaces 𝒵t{\cal Z}_{t} of bounded random variables Zt:Ξ[t]Z_{t}:\Xi_{[t]}\to{\mathbb{R}}. For t=T,,1t=T,...,1, define recursively mappings t:𝒵t𝒵t1{\cal R}_{t}:{\cal Z}_{t}\to{\cal Z}_{t-1},

t(Zt):=supPt𝒫tξ[t1]{𝔼ξtPt[Zt(ξ1,,ξt1,ξt)]=ΞtZt(ξ1,,ξt1,ξt)𝑑Pt(ξt)}.{\cal R}_{t}(Z_{t}):=\sup_{P_{t}\in{\cal P}_{t}^{\xi_{[t-1]}}}\left\{{\mathbb{E}}_{\xi_{t}\sim P_{t}}[Z_{t}(\xi_{1},...,\xi_{t-1},\xi_{t})]=\int_{\Xi_{t}}Z_{t}(\xi_{1},...,\xi_{t-1},\xi_{t})dP_{t}(\xi_{t})\right\}. (2.12)

Note that 𝒵0={\cal Z}_{0}={\mathbb{R}} and 1(Z1)=supP1𝒫1𝔼P1[Z1]{\cal R}_{1}(Z_{1})=\sup_{P_{1}\in{\cal P}_{1}}{\mathbb{E}}_{P_{1}}[Z_{1}] is a real number. Let

T(ZT):=1(2(T(ZT))){\mathfrak{R}}_{T}(Z_{T}):={\cal R}_{1}({\cal R}_{2}(\cdots{\cal R}_{T}(Z_{T}))) (2.13)

be the corresponding composite functional. Then with ZT=ZπZ_{T}=Z^{\pi} the nested distributionally robust counterpart of problem (1.1) is

minπΠT[t=1Tft(xt,πt,ξt)+fT+1(xT+1)],\begin{array}[]{ll}\min\limits_{\pi\in\Pi}{\mathfrak{R}}_{T}\left[\sum_{t=1}^{T}f_{t}(x_{t},\pi_{t},\xi_{t})+f_{T+1}(x_{T+1})\right],\end{array} (2.14)

where (as before) Π\Pi is the set of non-randomized policies of the controller.

2.4 Stagewise Independence Setting

An important particular case of the above framework is when the ambiguity sets 𝒫t{\cal P}_{t} do not depend on ξ[t1]\xi_{[t-1]}, i.e., 𝒫tξ[t1]=𝒫t{\cal P}_{t}^{\xi_{[t-1]}}={\cal P}_{t} for all ξ[t1]Ξ[t1]\xi_{[t-1]}\in\Xi_{[t-1]}. In that case the value functions Vt(xt)V_{t}(x_{t}) and optimal policies π¯(xt)\bar{\pi}(x_{t}) do not depend on ξ[t1]\xi_{[t-1]} and the respective dynamic equations can be written as

Vt(xt)=infμttsupPt𝒫t𝔼(ut,ξt)μt×Pt[ft(xt,ut,ξt)+Vt+1(Ft(xt,ut,ξt))].V_{t}(x_{t})=\inf\limits_{\mu_{t}\in{\mathfrak{C}}_{t}}\sup_{P_{t}\in{\cal P}_{t}}{\mathbb{E}}_{(u_{t},\xi_{t})\sim\mu_{t}\times P_{t}}\left[f_{t}(x_{t},u_{t},\xi_{t})+V_{t+1}\big{(}F_{t}(x_{t},u_{t},\xi_{t})\big{)}\right]. (2.15)

Also in that case the assumption of continuity of the multifunctions 𝕄t()𝒫t{\mathbb{M}}_{t}(\cdot)\equiv{\cal P}_{t} holds automatically.

Here we can consider the static counterpart of problem (2.6) defined as

minπΠsupP𝔓𝔼ξ[T]Pπ[t=1Tft(xt,ut,ξt)+fT+1(xT+1)],\min\limits_{\pi\in\Pi}\sup_{P\in{\mathfrak{P}}}{\mathbb{E}}^{\pi}_{\xi_{[T]}\sim P}\left[\sum_{t=1}^{T}f_{t}(x_{t},u_{t},\xi_{t})+f_{T+1}(x_{T+1})\right], (2.16)

where

𝔓:={P=P1××PT:Pt𝒫t,t=1,,T}.{\mathfrak{P}}:=\{P=P_{1}\times\cdots\times P_{T}:P_{t}\in{\cal P}_{t},\;t=1,...,T\}. (2.17)

In the distributionally robust setting the ambiguity set of probability distributions of ξ[T]\xi_{[T]} of the form (2.17) can be viewed as the counterpart of the stagewise-independence condition.

3 Duality of Game Formulation

The dynamic equations of the dual of problem (2.9) are: QT+1(xT+1,ξ[T])=fT+1(xT+1)Q_{T+1}(x_{T+1},\xi_{[T]})=f_{T+1}(x_{T+1}) for all ξ[T]Ξ[T]\xi_{[T]}\in\Xi_{[T]}, and for t=T,,1t=T,...,1,

Qt(xt,ξ[t1])=supPt𝒫tξ[t1]infμtt𝔼μt×Pt[ft(xt,ut,ξt)+Qt+1(Ft(xt,ut,ξt),ξ[t])].Q_{t}(x_{t},\xi_{[t-1]})=\sup_{P_{t}\in{\cal P}_{t}^{\xi_{[t-1]}}}\inf\limits_{\mu_{t}\in{\mathfrak{C}}_{t}}{\mathbb{E}}_{\mu_{t}\times P_{t}}\left[f_{t}(x_{t},u_{t},\xi_{t})+Q_{t+1}\big{(}F_{t}(x_{t},u_{t},\xi_{t}),\xi_{[t]}\big{)}\right]. (3.1)

For a given PtP_{t} (and xtx_{t}) the maximization in the right hand side of (3.1) is over all probability measures μt\mu_{t} supported on the set Ξt\Xi_{t}. It is straightforward to see that the maximum is attained at Dirac measure222We denote by δa\delta_{a} the Dirac measure of mass one at point aa. which depends on utu_{t}. Therefore it suffices to perform the minimization in (3.1) over Dirac measures μt=δut\mu_{t}=\delta_{u_{t}}, and hence we can write equation (3.1) in the following equivalent way

Qt(xt,ξ[t1])=supPt𝒫tξ[t1]infut𝒰t𝔼Pt[ft(xt,ut,ξt)+Qt+1(Ft(xt,ut,ξt),ξ[t])].Q_{t}(x_{t},\xi_{[t-1]})=\sup_{P_{t}\in{\cal P}_{t}^{\xi_{[t-1]}}}\inf\limits_{u_{t}\in{\cal U}_{t}}{\mathbb{E}}_{P_{t}}\left[f_{t}(x_{t},u_{t},\xi_{t})+Q_{t+1}\big{(}F_{t}(x_{t},u_{t},\xi_{t}),\xi_{[t]}\big{)}\right]. (3.2)

Similar to Proposition 2.2, it can be shown that the value functions Qt()Q_{t}(\cdot) are well-defined and continuous. By the standard theory of min-max, we have that Qt()Vt()Q_{t}(\cdot)\leq V_{t}(\cdot). We next establish that equality indeed holds when the multifunctions 𝕄t{\mathbb{M}}_{t}, t=1,,Tt=1,...,T, are convex-valued, i.e., the sets 𝒫tξ[t1]=𝕄t(ξ[t1]){\cal P}_{t}^{\xi_{[t-1]}}={\mathbb{M}}_{t}(\xi_{[t-1]}) are convex for all ξ[t1]\xi_{[t-1]}.

Proposition 3.1.

Suppose that Assumption 2.1 holds, and the multifunctions 𝕄t{\mathbb{M}}_{t} are continuous and convex-valued, then Qt()=Vt()Q_{t}(\cdot)=V_{t}(\cdot) and the min-max problem in (3.1) possesses saddle point (Pt,μt)(P_{t}^{*},\mu_{t}^{*}), for all t=1,,Tt=1,\ldots,T.

Proof.

Under the specified assumptions, the sets 𝒫tξ[t1]{\cal P}_{t}^{\xi_{[t-1]}} and t{\mathfrak{C}}_{t} are weakly compact and convex, and of course the expectation 𝔼μt×Pt{\mathbb{E}}_{\mu_{t}\times P_{t}} is linear with respect to μt\mu_{t} and PtP_{t} and continuous in the respective weak topologies. Then by using Sion’s duality theorem (see Remark 1.4) and applying induction backward in time, we can conclude that Qt()=Vt()Q_{t}(\cdot)=V_{t}(\cdot), t=1,,Tt=1,...,T. Further by compactness arguments the respective min-max and max-min problems have optimal solutions, which implies existence of the saddle point. ∎

It could be worth mentioning here that the duality of the game formulation (when considering randomized policies Π^\widehat{\Pi} of the controller) does not require the convexity of 𝒰t{\cal U}_{t} or any assumptions on the transition functions FtF_{t}.

4 Existence of Non-randomized Optimal Policies

We have the following necessary and sufficient condition for the existence of a non-randomized optimal policy of the controller (cf., [9, Theorem 2.2]).

Theorem 4.1.

Suppose that Assumption 2.1 holds, the multifunctions 𝕄t{\mathbb{M}}_{t} are continuous and convex-valued. Then the controller has a non-randomized optimal policy iff the min-max problem in the right hand side of (2.7) has a saddle point for all t=1,,Tt=1,...,T and (xt,ξ[t1])(x_{t},\xi_{[t-1]}).

Proof.

Note that the min-max problem (2.7) (resp. the max-min problem (3.2)) has a saddle point iff Qt()=𝒱t()Q_{t}(\cdot)={\cal V}_{t}(\cdot), t=1,,Tt=1,...,T.

It is clear that Vt()𝒱t()V_{t}(\cdot)\leq{\cal V}_{t}(\cdot), and Qt()Vt()Q_{t}(\cdot)\leq V_{t}(\cdot). Suppose that for t=1,,Tt=1,...,T and all (xt,ξ[t1])(x_{t},\xi_{[t-1]}), the min-max problem (2.7) possesses a saddle point (u¯t,Pt)(\bar{u}_{t},P_{t}) possibly depending on (xt,ξ[t1])(x_{t},\xi_{[t-1]}). Then we have 𝒱t()=Qt(){\cal V}_{t}(\cdot)=Q_{t}(\cdot) and subsequently Vt()=𝒱t()V_{t}(\cdot)={\cal V}_{t}(\cdot). It follows that μ¯t=δu¯t\bar{\mu}_{t}=\delta_{\bar{u}_{t}} is an optimal solution of the min-max problem (2.10) and hence a non-randomized optimal policy of the controller.

Conversely, suppose that the controller has a non-randomized optimal policy. Then Vt()=𝒱t()V_{t}(\cdot)={\cal V}_{t}(\cdot), t=1,,Tt=1,...,T. From Proposition 3.1, we also have that Vt()=Qt()V_{t}(\cdot)=Q_{t}(\cdot), and hence Qt()=𝒱t()Q_{t}(\cdot)={\cal V}_{t}(\cdot), t=1,,Tt=1,...,T. Since problem (2.7) and its dual have optimal solutions, this implies the existence of a saddle point for (2.7) (see Remark 1.4). ∎

In particular, we have the following result (cf., [5]).

Corollary 4.1.

Suppose that the sets 𝒰t{\cal U}_{t} are convex, for every ξtΞt\xi_{t}\in\Xi_{t} the functions ft(xt,ut,ξt)f_{t}(x_{t},u_{t},\xi_{t}) are convex in (xt,ut)(x_{t},u_{t}), the mappings Ft(xt,ut,ξt)=At(ξt)xt+B(ξt)ut+bt(ξt)F_{t}(x_{t},u_{t},\xi_{t})=A_{t}(\xi_{t})x_{t}+B(\xi_{t})u_{t}+b_{t}(\xi_{t}) are affine, the multifunctions 𝕄t{\mathbb{M}}_{t} are continuous and convex-valued, t=1,,Tt=1,...,T, and Assumption 2.1 holds. Then the value functions Vt(xt,ξ[t1])=𝒱t(xt,ξ[t1])V_{t}(x_{t},\xi_{[t-1]})={\cal V}_{t}(x_{t},\xi_{[t-1]}) are convex in xtx_{t} for every ξ[t1]\xi_{[t-1]}, and the controller has a non-randomized optimal policy.

Proof.

It can be directly verified that under the specified assumptions, the value functions 𝒱t(,ξ[t1]){\cal V}_{t}(\cdot,\xi_{[t-1]}) are convex for all t=1,Tt=1,\ldots T and ξ[t1]Ξ[t1]\xi_{[t-1]}\in\Xi_{[t-1]}. Consequently the min-max problem (2.7) has a saddle point (see Remark 1.4). ∎

In many interesting applications the considered problem is convex in the sense of Corollary 4.1. In such cases there is no point of considering randomized policies. As an example consider the classical Inventory Model (cf., [18]).

Inventory Model.

Consider the following inventory model (cf., [18])

min𝔼[t=1Tctut+ψt(xt,ut,Dt)]s.t.ut𝒰t,xt+1=xt+utDt,t=1,,T,\begin{array}[]{cll}\min&{\mathbb{E}}\left[\sum_{t=1}^{T}c_{t}u_{t}+\psi_{t}(x_{t},u_{t},D_{t})\right]\\ {\rm s.t.}&u_{t}\in{\cal U}_{t},\;\;x_{t+1}=x_{t}+u_{t}-D_{t},\;t=1,...,T,\end{array} (4.1)

where ψt(xt,ut,dt):=bt[dt(xt+ut)]++ht[xt+utdt]+,\psi_{t}(x_{t},u_{t},d_{t}):=b_{t}[d_{t}-(x_{t}+u_{t})]_{+}+h_{t}[x_{t}+u_{t}-d_{t}]_{+}, D1,,DTD_{1},...,D_{T} is a (random) demand process, ct,bt,htc_{t},b_{t},h_{t} are the ordering, backorder penalty and holding costs per unit, respectively, xtx_{t} is the inventory level, utu_{t} is the order quantity at time tt, and 𝒰t=[0,at]{\cal U}_{t}=[0,a_{t}]. Suppose that the distribution of (D1,,DT)(D_{1},...,D_{T}) is supported on Ξ1××ΞT\Xi_{1}\times\cdots\times\Xi_{T} with Ξt\Xi_{t}, t=1,,Tt=1,...,T, being a finite subinterval of the nonnegative real line.

The distributionally robust counterpart of (4.1) is a convex problem provided the respective multifunctions 𝕄t{\mathbb{M}}_{t} are continuous and convex valued. It follows by Corollary 4.1 that there is a non-randomized optimal policy of the controller.

5 Construction of Ambiguity Sets

There are several ways how the ambiguity sets can be constructed. We discuss now an approach which can be considered as an extension of the risk averse modeling of multistage stochastic programs.

Let {\mathbb{P}} be a probability measure on a (closed) set Ξd\Xi\subset{\mathbb{R}}^{d}, and 𝒫{\cal P} be a set of probability measures on Ξ\Xi absolutely continuous with respect to {\mathbb{P}}. Consider the corresponding distributionally robust functional

(Z):=supP𝒫𝔼P[Z]{\cal R}(Z):=\sup_{P\in{\cal P}}{\mathbb{E}}_{P}[Z] (5.1)

defined on an appropriate space 𝒵{\cal Z} of random variables Z:ΞZ:\Xi\to{\mathbb{R}}. Such functional can be considered as a coherent risk measure (e.g., [14, Chapter 6]). The functional {\cal R} is law invariant if (Z)=(Z){\cal R}(Z)={\cal R}(Z^{\prime}) when ZZ and ZZ^{\prime} are distributionally equivalent, i.e., (Zz)=(Zz){\mathbb{P}}(Z\leq z)={\mathbb{P}}(Z^{\prime}\leq z) for any zz\in{\mathbb{R}}. The law invariant coherent risk measure can be considered as a function of the cumulative distribution function (cdf) HZ(z):=(Zz)H_{Z}(z):={\mathbb{P}}(Z\leq z).

An important example of law invariant coherent risk measure is the Average Value-at-Risk

𝖠𝖵@𝖱α(Z):=(1α)1α1𝖵@𝖱τ(Z)𝑑τ,α[0,1),{\sf AV@R}_{\alpha}(Z):=(1-\alpha)^{-1}\int_{\alpha}^{1}{\sf V@R}_{\tau}(Z)d\tau,\;\alpha\in[0,1),

where 𝖵@𝖱τ(Z)=inf{z:t(Zz)τ}{\sf V@R}_{\tau}(Z)=\inf\{z:{\mathbb{P}}_{t}(Z\leq z)\geq\tau\}. It has dual representation of the form (5.1) with333The notation PQP\ll Q means that measure PP is absolutely continuous with respect to measure QQ.

𝒫={P:dP/d(1α)1}.{\cal P}=\left\{P\ll{\mathbb{P}}:dP/d{\mathbb{P}}\leq(1-\alpha)^{-1}\right\}.

Now let {\mathbb{P}} be a reference probability measure on Ξ[T]=Ξ1××ΞT\Xi_{[T]}=\Xi_{1}\times\cdots\times\Xi_{T}. Denote by t{\mathbb{P}}_{t} and [t]{\mathbb{P}}_{[t]} the corresponding marginal probability measures on Ξt\Xi_{t} and Ξ[t]\Xi_{[t]}. Let us define tξ[t1]{\mathbb{P}}_{t}^{\xi_{[t-1]}}, with respect to Ξ[t]=Ξ[t1]×Ξt\Xi_{[t]}=\Xi_{[t-1]}\times\Xi_{t} and [t1]{\mathbb{P}}_{[t-1]}, recursively going backward in time. That is, using the Regular Probability Kernel (RPK) (e.g., [4, III- 70]), define tξ[t1]{\mathbb{P}}_{t}^{\xi_{[t-1]}} as a probability measure on Ξt\Xi_{t} for almost every (a.e.) ξ[t1]Ξ[t1]\xi_{[t-1]}\in\Xi_{[t-1]} (a.e. with respect to [t1]{\mathbb{P}}_{[t-1]}), and for any measurable sets AΞ[t1]A\subset\Xi_{[t-1]} and BΞtB\subset\Xi_{t},

(A×B)=Atξ[t1](B)𝑑[t1](ξ[t1]).{\mathbb{P}}(A\times B)=\int_{A}{\mathbb{P}}_{t}^{\xi_{[t-1]}}(B)d{\mathbb{P}}_{[t-1]}(\xi_{[t-1]}). (5.2)

The conditional counterpart tξ[t1]{\cal R}_{t}^{\xi_{[t-1]}} of the law invariant coherent risk measure can be defined as the respective function of the conditional cdf of ZZ (cf., [15]). For example, the conditional counterpart of the Average Value-at-Risk can be defined in that way and is given by

𝖠𝖵@𝖱αξ[t1](Z)=(1α)1α1𝖵@𝖱τ|ξ[t1](Z)𝑑τ,α[0,1),{\sf AV@R}^{\xi_{[t-1]}}_{\alpha}(Z)=(1-\alpha)^{-1}\int_{\alpha}^{1}{\sf V@R}_{\tau|\xi_{[t-1]}}(Z)d\tau,\;\alpha\in[0,1),

where 𝖵@𝖱τ|ξ[t1]{\sf V@R}_{\tau|\xi_{[t-1]}} is the conditional counterpart of 𝖵@𝖱τ{\sf V@R}_{\tau}.

In turn this defines the set 𝒫tξ[t1]{\cal P}_{t}^{\xi_{[t-1]}} of conditional probability measures. The respective one-step mappings t:𝒵t𝒵t1{\cal R}_{t}:{\cal Z}_{t}\to{\cal Z}_{t-1}, defined in (2.12), are given by

[t(Zt)](ξ[t1])=tξ[t1](Zt).[{\cal R}_{t}(Z_{t})](\xi_{[t-1]})={\cal R}_{t}^{\xi_{[t-1]}}(Z_{t}). (5.3)

This leads to the respective nested functional T{\mathfrak{R}}_{T}, defined in (2.13), and the nested formulation (2.14) of the problem with respect to non-randomized policies of the controller.

In this construction of nested counterparts of law invariant risk measures it is essential that the ambiguity sets consist of probability measures absolutely continuous with respect to the reference measure. However, the above approach can be extended beyond that setting. That is, the conditional set 𝒫tξ[t1]{\cal P}_{t}^{\xi_{[t-1]}} of probability measures on Ξt\Xi_{t} can be defined in some away with respect to the conditional distribution tξ[t1]{\mathbb{P}}_{t}^{\xi_{[t-1]}}. For example, 𝒫tξ[t1]{\cal P}_{t}^{\xi_{[t-1]}} can consist of probability measures Pt𝔐tP_{t}\in{\mathfrak{M}}_{t} with a Wasserstein distance from tξ[t1]{\mathbb{P}}_{t}^{\xi_{[t-1]}} less than or equal to a constant rt>0r_{t}>0.

Of course, in such construction it should be verified that the dynamic programming equations (2.7) (in the case of non-randomized policies) and (2.10) (in the case of randomized policies) are well defined. There are two important cases where the corresponding multifunctions 𝕄t{\mathbb{M}}_{t} are continuous and hence the value functions are continuous (Proposition 2.2). One such case is when the ambiguity sets 𝒫t{\cal P}_{t} do not depend on ξ[t1]\xi_{[t-1]} (see Section 2.4). This corresponds to the case where {\mathbb{P}} is given by the direct product 1××t{\mathbb{P}}_{1}\times\cdots\times{\mathbb{P}}_{t} of the marginal measures. In that case, the ambiguity sets 𝒫t𝔐t{\cal P}_{t}\subset{\mathfrak{M}}_{t} can be arbitrary. For example, 𝒫t{\cal P}_{t} can be the set of probability measures with Wasserstein distance from the reference measure t{\mathbb{P}}_{t}, less than or equal to a positive constant rtr_{t}. The dynamic programming equations take the form (2.15), and by Proposition 2.2 the value functions Vt(xt)V_{t}(x_{t}) are continuous.

Another important case is when the sets Ξt\Xi_{t}, t=1,,Tt=1,...,T, are finite, i.e., (ξ1,,ξT)(\xi_{1},...,\xi_{T}) has discrete distribution with a finite support. Then continuity of the multifunctions 𝕄t{\mathbb{M}}_{t} trivially holds, and hence the value functions Vt(xt,ξ[t1])V_{t}(x_{t},\xi_{[t-1]}) are continuous.

6 Conclusions

The main technical assumption used in the considered construction is continuity of the multifunctions 𝕄t{\mathbb{M}}_{t}. This assumption ensures continuity of the value functions and hence measurability of the objective functions in the dynamic programming equations (2.7) and (2.10). It holds automatically in the two important cases, namely when the ambiguity sets do not depend on the history of the process ξt\xi_{t} or when the sets Ξt\Xi_{t} are finite. It could be noted that just upper semicontinuity of 𝕄t{\mathbb{M}}_{t} is not sufficient for ensuring continuity (and even semicontinuity) of the value functions. In general it could be difficult to verify the assumption of continuity of 𝕄t{\mathbb{M}}_{t}, in which case the measurability question remains open.

By using the game formulation it is possible to consider the randomized policies of the controller. In Theorem 4.1 we give necessary and sufficient conditions for existence of non-randomized optimal policies. The assumption that the multifunctions 𝕄t{\mathbb{M}}_{t} are convex-valued, i.e., that the respective ambiguity sets are convex, is rather mild. The assumption of continuity of 𝕄t{\mathbb{M}}_{t} is needed in order to ensure that the respective dynamic equations are well defined.

There is a delicate point which we would like to mention. In the game formulation, after the nature chooses the probability measure Pt𝒫tξ[t1]P_{t}\in{\cal P}_{t}^{\xi_{[t-1]}}, the system moves to the next state xt+1=Ft(xt,ut,ξt)x_{t+1}=F_{t}(x_{t},u_{t},\xi_{t}) according to probability distribution ξtPt\xi_{t}\sim P_{t}. On the other hand in the risk averse approach is assumed existence of reference measures t{\mathbb{P}}_{t} and the probability law of ξt\xi_{t} is defined by ξtt\xi_{t}\sim{\mathbb{P}}_{t}. In both cases, the game and the nested risk averse (for non-randomized policies) approaches lead to the same dynamic equations and the same optimal policies of the controller and the same optimal values. However, the interpretation in terms of realizations of the random process could be different.

References

  • [1] D.P. Bertsekas and S.E. Shreve. Stochastic Optimal Control, The Discrete Time Case. Academic Press, New York, 1978.
  • [2] P. Billingsley. Probability and Measure, Third Edition. Wiley, 1995.
  • [3] J. Frédéric Bonnans and Alexander Shapiro. Perturbation Analysis of Optimization Problems. Springer Series in Operations Research. Springer, 2000.
  • [4] Dellacherie C. and Meyer P.A. Probabilities and Potential. Holland Publishing, Amsterdam, Netherlands, 1988.
  • [5] E. Delage, D. Kuhn, and W. Wiesemann. ”Dice”-sion - making under uncertainty: When can a random decision reduce risk? Management Science, 65(7):3282 – 3301, 2019.
  • [6] A Hitchhiker’s Guide. Infinite dimensional analysis. Springer, 2006.
  • [7] G.N. Iyengar. Robust Dynamic Programming. Mathematics of Operations Research, 30:257–280, 2005.
  • [8] A. Jaśkiewicz and A. S. Nowak. Zero-sum stochastic games. In T. Basar and G. Zaccour, editors, Handbook of dynamic game theory. Springer, 2016.
  • [9] Yan Li and A. Shapiro. Rectangularity and duality of distributionally robust markov decision processes. https://arxiv.org/abs/2308.11139, 2023.
  • [10] A. Nilim and L. El Ghaoui. Robust control of Markov decision processes with uncertain transition probabilities. Operations Research, 53:780–798, 2005.
  • [11] A. Shapiro. Rectangular sets of probability measures. Operations Research, 64(2):528–541, 2016.
  • [12] A. Shapiro. Interchangeability principle and dynamic equations in risk averse stochastic programming. Operations Research Letters, 45:377–381, 2017.
  • [13] A. Shapiro. Distributionally robust optimal control and MDP modeling. Operations Research Letters, 49:809–814, 2021.
  • [14] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on Stochastic Programming: Modeling and Theory. SIAM, Philadelphia, third edition, 2021.
  • [15] A. Shapiro and A. Pichler. Conditional distributionally robust functionals. Operations Research, online:1–13, 2023.
  • [16] M. Sion. On general minimax theorems. Pacific Journal of Mathematics, 8:171–176, 1958.
  • [17] S. Wang, N. Si, J. Blanchet, and Z. Zhou. On the foundation of distributionally robust reinforcement learning. https://arxiv.org/abs/2311.09018, 2023.
  • [18] P.H. Zipkin. Foundation of inventory management. McGraw-Hill, Boston, MA, 2000.

7 Appendix

Let 𝒳{\cal X} and Γ\Gamma be nonempty compact metric spaces, f:𝒳×Γf:{\cal X}\times\Gamma\to{\mathbb{R}} be continuous real valued unction, and Φ:Γ𝒳\Phi:\Gamma\rightrightarrows{\cal X} be a multifunction (point-to-set mapping). It is said that the multifunction Φ\Phi is closed if γnγ\gamma_{n}\to\gamma, xnΦ(γn)x_{n}\in\Phi(\gamma_{n}) and xnxx_{n}\to x implies that xΦ(γ)x\in\Phi(\gamma). If Φ\Phi is closed, then Φ\Phi is closed valued, i.e., the set Φ(γ)\Phi(\gamma) is closed for every γΓ\gamma\in\Gamma. It is said that the multifunction Φ\Phi is upper semicontinuous if for any γ¯Γ\bar{\gamma}\in\Gamma and any neighborhood 𝒱{\cal V} of Φ(γ¯)\Phi(\bar{\gamma}) there is a neighborhood 𝒰{\cal U} of γ\gamma such that Φ(γ)𝒱\Phi(\gamma)\subset{\cal V} for any γ𝒰\gamma\in{\cal U}. In the considered framework of compact sets, the multifunction Φ\Phi is closed iff it is closed valued and upper semicontinuous (e.g., [3, Lemma 4.3]).

It is said that the multifunction Φ\Phi is lower semicontinuous if for any γ¯Γ\bar{\gamma}\in\Gamma and any neighborhood 𝒱{\cal V} of Φ(γ¯)\Phi(\bar{\gamma}) there is a neighborhood 𝒰{\cal U} of γ\gamma such that 𝒱Φ(γ){\cal V}\cap\Phi(\gamma)\neq\emptyset for any γ𝒰\gamma\in{\cal U}. It is said that Φ\Phi is continuous if it is closed and lower semicontinuous.

Consider value function

v(γ):=supxΦ(γ)f(x,γ).v(\gamma):=\sup_{x\in\Phi(\gamma)}f(x,\gamma). (7.1)

We have the following result (e.g., [3, Proposition 4.4]).

Theorem 7.1.

Suppose that the sets 𝒳{\cal X} and Γ\Gamma are compact, the function f:𝒳×Γf:{\cal X}\times\Gamma\to{\mathbb{R}} is continuous, and the multifunction Φ:Γ𝒳\Phi:\Gamma\rightrightarrows{\cal X} is continuous. Then the value function v(γ)v(\gamma) is real valued and continuous.