Distributionally Robust Markov Decision Processes and their Connection to Risk Measures
Abstract.
We consider robust Markov Decision Processes with Borel state and action spaces, unbounded cost and finite time horizon. Our formulation leads to a Stackelberg game against nature. Under integrability, continuity and compactness assumptions we derive a robust cost iteration for a fixed policy of the decision maker and a value iteration for the robust optimization problem. Moreover, we show the existence of deterministic optimal policies for both players. This is in contrast to classical zero-sum games. In case the state space is the real line we show under some convexity assumptions that the interchange of supremum and infimum is possible with the help of Sion’s minimax Theorem. Further, we consider the problem with special ambiguity sets. In particular we are able to derive some cases where the robust optimization problem coincides with the minimization of a coherent risk measure. In the final section we discuss two applications: A robust LQ problem and a robust problem for managing regenerative energy.
- Key words:
-
Robust Markov Decision Process; Dynamic Games; Minimax Theorem; Risk Measures
- AMS subject classifications:
-
90C40, 90C17, 91G70
1. Introduction
Markov Decision Processes (MDPs) are a well-established tool to model and solve sequential decision making under stochastic perturbations. In the standard theory it is assumed that all parameters and distributions are known or can be estimated with a certain precision. However, using the so-derived ’optimal’ policies in a system where the true parameters or distributions deviate, may lead to a significant degeneration of the performance. In order to cope with this problem there are different approaches in the literature.
The first approach which is typically used when parameters are unknown is the so-called Bayesian approach. In this setting a prior distribution for the model parameters is assumed and additional information which is revealed while the process evolves, is used to update the beliefs. Hence this approach allows that parameters can be learned. It is very popular in engineering applications. Introductions can e.g. be found in [hernandez2012adaptive] and [BaeuerleRieder2011]. In this paper we are not going to pursue this stream of literature.
A second approach is the so-called robust approach. Here it is assumed that instead of having one particular transition law, we are faced with a whole family of laws which are possible for the transition. In the literature this is referred to as model ambiguity. One way of dealing with this ambiguity is the worst case approach, where the controller selects a policy which is optimal with respect to the most adverse transition law in each scenario. This setting can also be interpreted as a dynamic game with nature as the controller’s opponent. The worst case approach is empirically justified by the so-called Ellsberg Paradox. The experiment suggested by [Ellsberg1961] has shown that agents tend to be ambiguity averse. In the sequel, axiomatic approaches to model risk and ambiguity attitude have appeared, see e.g. [gilboa1989maxmin, maccheroni2006ambiguity]. [EpsteinSchneider2003] investigated the question whether ambiguity aversion can be incorporated in an axiomatic model of intertemporal utility. The representation of the preferences turned out to be some worst case expected utility, i.e. the minimal expected utility over an appropriate set of probability measures. This set of probability measures needs to satisfy some rectangularity condition for the utility to have a recursive structure and therefore being time consistent. The rectangularity property has been taken up by [Iyengar2005] as a key assumption for being able to derive a Bellman equation for a robust MDP with countable state and action spaces. Contemporaneously, [NilimGhaoui2005] reached similar findings, however limited to finite state and action spaces.
[wiesemann2013robust] have considered robust MDP beyond the rectangularity condition. Based on observed histories, they derive a confidence region that contains the unknown parameters with a prespecified probability and determine a policy that attains the highest worst-case performance over this confidence region. A similar approach has been taken in [xu2010distributionally] where nested uncertainty sets for transition laws are given which correspond to confidence sets. The optimization is then based on the expected performance of policies under the (respective) most adversarial distribution. This approach lies between the Bayesian and the robust approach since the decision maker uses prior information without a Bayesian interpretation. All these analyses are restricted to finite state and action spaces. A similar but different approach has been considered in [bauerle2019markov] where parameter ambiguity is combined with Bayesian learning methods. Here the authors deal with arbitrary Borel state and action spaces.
In our paper we will generalize the results of [Iyengar2005] to a model with Borel spaces and unbounded cost function. We consider finite horizon expected cost problems with a given transition law under ambiguity concerning the distribution of the disturbance variables. The problem is to minimize the worst expected cost. This leads to a Stackelberg game against nature. In order to deal with the arising measurability issues which impose a major technical difficulty compared to the countable space situation, we borrow from the dynamic game setup in [Gonzales2002] and [JaskiewiczNowak2011]. The major difference of our contribution compared to these two works is the design of the distributional ambiguity. We replace the topology of weak convergence on the ambiguity set by the weak* topology in order to obtain connections to recursive risk measures in Section 6. Moreover, we rigorously derive a Bellman equation. Note that [jaskiewicz2014robust] treats another robust MDP setup with Borel state and action spaces, however deals with the average control problem. Further, our model allows for rather general ambiguity sets for the disturbance distribution. Under additional technical assumptions our model also comprises the setting of a decreasing confidence regions for distributions. Moreover, we discuss in detail sufficient conditions for the interchange of supremum and infimum. We provide a counterexample which shows that this interchange is not always possible, in contrast to [NilimGhaoui2005], [wiesemann2013robust]. This counterexample also highlights the difference to classical two-person zero-sum games. Further, we are able to derive a connection to the optimization of risk measures in MDP frameworks. In case the ambiguity sets are chosen in a specific way the robust optimization problem coincides with the optimization of a risk measure applied to the total cost.
The outline of the paper is as follows: In the next section we introduce our basic model which is a game against nature concerning expected cost with a finite time horizon. Section 3 contains the first main results. Under integrability, continuity and compactness assumptions we derive a robust cost iteration for a fixed policy of the decision maker and a value iteration for the robust optimization problem. Moreover, we show the existence of optimal deterministic policies for both players. This is in contrast to classical zero-sum games where we usually obtain optimal randomized policies. In Section 4 we consider the real line as state space which allows for slightly different model assumptions. Then in Section 5 we discuss (with the real line as state space) when supremum and infimum can be interchanged in the solution of the problem. Under some convexity assumptions this can be achieved with the help of Sion’s minimax Theorem [Sion1958]. Being able to interchange supremum and infimum sometimes simplifies the solution of the problem. In Section 6 we consider the problem with special ambiguity sets. In particular we are able to derive some cases which can be solved straightforward and situations where the robust optimization problem coincides with the minimization of a coherent risk measure. In the final section we discuss two applications: A robust LQ problem and a robust problem for managing regenerative energy.
2. The Markov Decision Model
We consider the following standard Markov Decision Process with general Borel state and action spaces and restrict ourselves to a model with finite planning horizon . Results for an infinite planning horizon can be found in [glauner20]. The state space is a Borel space with Borel -algebra and the action space is a Borel space with Borel -Algebra . The possible state-action combinations at time form a measurable subset of such that contains the graph of a measurable mapping . The -section of ,
is the set of admissible actions in state at time . The sets are non-empty by assumption. We assume that the dynamics of the MDP are given by measurable transition functions and depend on disturbances which are independent random elements on a common probability space with values in a measurable space . When the current state is the controller chooses action and is the realization of , then the next state is given by
The one-stage cost function gives the cost for choosing action if the system is in state at time and the next state is . The terminal cost function gives the cost if the system terminates in state .
The model data is supposed to have the following continuity and compactness properties.
Assumption 2.1.
-
(i)
are compact and is upper semicontinuous for , i.e. if and , , then has an accumulation point in .
-
(ii)
is continuous in for and .
-
(iii)
, as well as the terminal cost function are lower semicontinuous.
For we denote by the set of feasible histories of the decision process up to time
where for . In order for the controller’s decisions to be implementable, they must be based on the information available at the time of decision making, i.e. be functions of the history of the surplus process.
Definition 2.2.
-
(i)
A randomized policy is a sequence of stochastic kernels from to the action space satisfying the constraint
-
(ii)
A measurable mapping with for every is called (deterministic) decision rule at time . is called (deterministic) policy.
-
(iii)
A decision rule at time is called Markov if it depends on the current state only, i.e. for all . If all decision rules are Markov, the deterministic (-stage) policy is called Markov.
For convenience, deterministic policies may simply be referred to as policy. With we denote the sets of all randomized policies, deterministic policies and Markov policies. The first inclusion is by identifying deterministic decision rules with the corresponding Dirac kernels
A feasible policy always exists since contains the graph of a measurable mapping.
Due to the independence of the disturbances we may without loss of generality assume that the probability space has a product structure
We take as the canonical construction, i.e.
for all . We denote by the random state and action processes and define . In the sequel, we will require to be separable. Additionally, we will assume for some results that is atomless in order to support a generalized distributional transform.
Let be a stage of the decision process. Due to the product structure of the transition kernel is given by
(2.1) |
We assume now that there is some uncertainty about , e.g. because it cannot be estimated properly. Moreover, the decision maker is very risk averse and tries to minimize the expected cost on a worst case basis. Thus, we denote by the set of probability measures on which are absolutely continuous with respect to and define for
Henceforth, we fix a non-empty subset which is referred to as ambiguity set at stage . This can be seen as the set of probability measures which may reflect the law of motion. Due to absolute continuity, we can identify with the set of corresponding densities w.r.t.
(2.2) |
Accordingly, we view as a subset of and endow it with the trace topolgy of the weak* topolgy on where . The weak* topology in turn induces a Borel -algebra on making it a measurable space. We obtain the following result (for a proof see the Appendix).
Lemma 2.3.
Let the ambiguity set be norm-bounded and the probability measure on be separable. Then endowed with the weak* topology is a separable metrizable space. If is additionally weak* closed, it is even a compact Borel space.
In our cost model we allow for any norm-bounded ambiguity set . For applications, a meaningful way of choosing (within a norm bound) is to take all probability measures in which are close to in a certain metric like e.g. the Wasserstein metric (see e.g. [yang2017convex]). In our setting, that requires absolute continuity, the Kullback–-Leibler divergence could be a natural choice.
The controller only knows that the transition kernel (2.1) at each stage is defined by some instead of but not which one exactly. We assume here that the controller faces a dynamic game against nature. This means that nature reacts to the controller’s action in scenario with a decision rule where . A policy of nature is a sequence of such decision rules . Let be the set of all policies of nature. Since nature is an unobserved theoretical opponent of the controller, her actions are not considered to be part of the history of the Markov Decision Process. A Markov decision rule of nature at time is a measurable mapping and a Markov policy of nature is a sequence of such decision rules. The set of Markov policies of nature is denoted by . Thus we are faced with a Stackelberg game where the controller is the mover and nature is the follower.
Lemma 2.4.
For a decision rule induces a stochastic kernel from to by
For a proof of the lemma see the Appendix. In the sequel, it will be clear from the context where we refer to as a decision rule or as a stochastic kernel.
The probability measure , which is unknown for the controller, now takes the role of in defining the transition kernel of the Markov Decision Process in (2.1). Let
(2.3) |
As in the case without ambiguity, the Theorem of Ionescu-Tulcea yields that each initial state and pair of policies of the controller and nature induce a unique law of motion
(2.4) |
on satisfying
for all and . In the usual way, we denote with the expectation operator with respect to and with or the respective conditional expectation given or .
The value of a policy pair at time is defined as
(2.5) |
Since the controller is unaware which probability measure in the ambiguity set determines the transition law in each scenario, he tries to minimize the expected cost under the assumption to be confronted with the most adverse probability measure. The value functions are thus given by
and the optimization objective is
(2.6) |
In game-theoretic terminology this is the upper value of a dynamic zero-sum game. If nature were to act first, i.e. if infimum and supremum were interchanged, one would obtain the game’s lower value. If the two values agree and the infima/suprema are attained, the game has a Nash equilibrium, see also Section 5. But note here that players are asymmetric in information in our setting.
Remark 2.5.
[Iyengar2005] does not model nature to make active decisions, but instead defines the set of all possible laws of motion. When each law of motion is of the form (2.4), he calls the set rectangular. Our approach with active decisions of nature based on [Gonzales2002] and [JaskiewiczNowak2011] is needed to construct Markov kernels as in Lemma 2.4 with probability measures from a given ambiguity set. When state and action spaces are countable as in [Iyengar2005] the technical problem of measurability does not arise and one can directly construct an ambiguous law of motion by simply multiplying (conditional) probabilities. The rectangularity property is satisfied in our setting.
Our model feature that there is no ambiguity in the transition functions is justified in many applications. Typically, transition functions describe a technical process or economic calculation which is known ex-ante and does not have to be estimated. The same applies to the cost function.
3. Value Iteration and Optimal Policies
In order to have well-defined value functions we need some integrability conditions.
Assumption 3.1.
-
(i)
There exist with and measurable functions and such that it holds for all , and
Furthermore, it holds for all .
-
(ii)
We define . For all and there exists an and measurable functions such that and
for all and . Here, is the closed ball around w.r.t. an arbitrary product metric on .
-
(iii)
The ambiguity sets are norm bounded, i.e. there exists such that
for all and .
Remark 3.2.
-
(a)
are called lower and upper bounding function, respectively, while is referred to as bounding function. As the absolute value is the sum of positive and negative part, satisfies for all , and :
-
(b)
Assumptions 3.1 (i) and (ii) are satisfied with if the cost functions are bounded.
-
(c)
If and hence , it is technically sufficient if part (ii) of Assumption 3.1 holds under the reference probability measure . Using Hölder’s inequality and part (iii) we get for every
and analogous results for the upper bounding function. I.e. one simply has to replace by . However, the factor may be unnecessarily crude.
The next lemma shows that due to Assumption 3.1 (i) the value (2.5) of a policy pair is well-defined at all stages . One can see that the existence of either a lower or an upper bounding function is sufficient for the policy value to be well-defined. However, for the existence of an optimal policy pair we will need the integral to exist with finite value and therefore require both a lower and an upper bounding function.
Lemma 3.3.
Under Assumption 3.1 it holds for all policy pairs , time points and histories
Proof.
We only prove the first inequality. The second inequality is analogous. We use that
and consider the summands individually. We have by Assumption 3.1 (i). Since is a mapping to it follows from the first inequality of Assumption 3.1 (i) that
for . Now, the second inequality of Assumption 3.1 (i) yields for
Iterating this argument we obtain
Finally, summation over yields the assertion. ∎
With the bounding function we define the function space
Endowing with the weighted supremum norm
makes a Banach space, cf. Proposition 7.2.1 in [HernandezLasserre1999].
Having ensured that the value functions are well-defined, we can now proceed to derive the cost iteration. To ease notation we introduce the following operators.
Definition 3.4.
For functions s.t. for all and let
Note that the operators are monotone in .
Proposition 3.5.
Under Assumption 3.1 the value of a policy pair can be calculated recursively for and as
Proof.
The proof is by backward induction. At time there is nothing to show. Now assume the assertion holds for , then the tower property of conditional expectation yields for
for all . ∎
Now, we evaluate a policy of the controller under the worst-case scenario regarding nature’s reaction. We define the robust value of a policy at time as
To minimize this quantity is the controller’s optimization objective. For the robust policy value, a cost iteration holds, too. With regard to a policy of nature this is a Bellman equation given a fixed policy of the controller.
Theorem 3.6.
Let Assumptions 2.1, 3.1 be satisfied.
-
a)
The robust value of a policy is a measurable function of for . It can be calculated recursively as
-
b)
If the ambiguity sets are weak* closed, there exist maximizing decision rules of nature for all . Each sequence of such decision rules is an optimal response of nature to the controller’s policy, i.e. ,
Proof.
The proof is by backward induction. At time there is nothing to show. Now assume the assertion holds at time , i.e. that is measurable and that for every there exists an -optimal strategy of nature. By Proposition 3.5 we have at
(3.1) | |||||
where is a measurable -maximizer.
Since is arbitrary, equality holds. It remains to show the measurability of the outer integrand after the second inequality and the existence of an -maximizer. This follows from the optimal measurable selection theorem in [Rieder1978]: To see this, first note that the function
is jointly measurable due to Lemma 4.51 in [AliprantisBorder2006]. Consequently,
for every . The right hand side is a selection class. Obviously, it holds
Now, Theorem 3.2 in [Rieder1978] yields that
is measurable and for every there exists an -maximizer .
For part b) we have to show that there exists not only a -maximizer at (3.1) but a maximizer. This follows from Theorem 3.7 in [Rieder1978]. The additional requirements are that is a separable metrizable space, which holds by Lemma 2.3, and that the set is compact for every and . By assumption, is weakly closed and therefore compact by Lemma 2.3. Since due to Assumption 3.1 the integrand of is in , the mapping is continuous for every . Hence, is closed as the preimage of a closed set. Since closed subsets of compact sets are compact, the proof is complete. ∎
So far we have only considered the case that the ambiguity set may depend on the time index but not on the state of the decision process. This covers many applications, e.g. the connection to risk measures (see Section 6). Moreover, we can allow any norm bounded ambiguity sets as long as it is independent of the state using the optimal selection theorem by [Rieder1978] in Theorem 3.6. If the ambiguity set is weak* closed, the following generalization is possible.
Corollary 3.7.
For let be weak* closed and
be a non-empty and closed-valued mapping giving the possible probability measures at time in state if the controller chooses . Then the assertion of Theorem 3.6 b) still holds.
Proof.
We have to show the existence of a measurable maximizer at (3.1). The rest of the proof is not affected. Since is weak* closed, it is a compact Borel space by Lemma 2.3. Consequently, the set-valued mapping is compact-valued, as closed subsets of compact sets are compact. In the proof of the theorem it has been shown that the function is jointly measurable and continuous in . Hence, Proposition D.5 in [HernandezLasserre1996] yields the existence of a measurable maximizer. ∎
State dependent ambiguity sets are a possibility to make the distributionally robust optimality criterion less conservative. E.g. they allow to incorporate learning about the unknown disturbance distribution. We refer the reader to [Bielecki2019] for an interesting example where the ambiguity sets are recursive confidence regions for an unknown parameter of the disturbance distribution.
Let us now consider specifically deterministic Markov policies of the controller. The subspace
of turns out to be the set of possible value functions under such policies. is a complete metric space since the subset of lower semicontinuous functions is closed in . We define the following operators on and especially on .
Definition 3.8.
For and Markov decision rules , we define
Note that the operators are monotone in . Under Markov policies of the controller and of nature, the value iteration can be expressed with the help of these operators. In order to distinguish from the history-dependent case, we denote the value function here with . Setting we obtain for and with Proposition 3.5
We define the robust value of Markov policy of the controller as
When calculating the robust value of a Markov policy of the controller it is no restriction to take the supremum only over Markov policies of nature.
Corollary 3.9.
Let . It holds for that I.e., we have the robust value iteration
Moreover, there exists a Markovian -optimal policy of nature and if the ambiguity sets are all weak* closed even a Markovian optimal policy.
Proof.
Let us further define for the Markovian value function
The next result shows that satisfies a Bellman equation and proves that an optimal policy of the controller exists and is Markov.
Theorem 3.10.
Let Assumptions 2.1, 3.1 be satisfied.
-
a)
For , it suffices to consider deterministic Markov policies both for the controller and nature, i.e. for all . Moreover, and satisfies the Bellman equation
For there exist Markov minimizers of and every sequence of such minimizers constitutes an optimal policy of the controller.
-
b)
If the ambiguity sets are weak* closed, there exist maximizing decision rules of nature, i.e. and every sequence of maximizers induces an optimal policy of nature satisfying .
Proof.
We proceed by backward induction. At time we have due to semicontinuity and Assumption 3.1 (i). Now assume the assertion holds at time . Using the robust value iteration (Corollary 3.9) we obtain at time :
Here, objective and constraint depend on the history of the process only through . Thus, given the existence of a minimizing Markov decision rule the last expression is equal to . Again by the induction hypothesis, there exists an optimal Markov policy such that
(3.2) |
It remains to show the existence of a minimizing Markov decision rule and that . We want to apply Proposition 2.4.3 in [BaeuerleRieder2011]. The set-valued mapping is compact-valued and upper semicontinuous. Next, we show that is lower semicontinuous for every . Let be a convergent sequence in with limit . The function
(3.3) |
is lower semicontinuous for every . The sequence of random variables given by
is bounded by some by Lemma LABEL:thm:abstract_robust_Lpbound. Now, Fatou’s Lemma yields for every
where the last inequality is by the lower semicontinuity of (3.3). Thus, the function is lower semicontinuous for every and consequently is lower semicontinuous as a supremum of lower semicontinuous functions. Now, Proposition 2.4.3 in [BaeuerleRieder2011] yields the existence of a minimizing Markov decision rule at (3.2) and that is lower semicontinuous. Furthermore, is bounded by for some due to Lemma 3.3. Thus . Part b) follows from Theorem 3.6 b). ∎
4. Real Line as State Space
The model has been introduced in Section 2 with a general Borel space as state space. In order to solve the distributionally robust cost minimization problem in Section 3 we needed a continuous transition function despite having a semicontinuous model, cf. the proof of Theorem 3.10. This assumption on the transition function can be relaxed to semicontinuity if the state space is the real line and the transition and one-stage cost function have some form of monotonicity. In some applications this relaxation of the continuity assumption is relevant. Furthermore, a real state space can be exploited to address the distributionally robust cost minimization problem with more specific techniques. In addition to Assumption 3.1, we make the following assumptions in this section.
Assumption 4.1.
-
(i)
The state space is the real line .
-
(ii)
The model data satisfies the Continuity and Compactness Assumptions 2.1 with the transition function being lower semicontinuous instead of continuous.
-
(iii)
The model data has the following monotonicity properties:
-
(iii a)
The set-valued mapping is decreasing.
-
(iii b)
The function is increasing for all .
-
(iii c)
The function is increasing for all .
-
(iii d)
The function is increasing.
-
(iii a)
With the real line as state space a simple separation condition is sufficient for Assumption 3.1 (ii).
Corollary 4.2.
Let there be upper semicontinuous functions and measurable functions which fulfil and
for every . Then Assumption 3.1 (ii) is satisfied.
Proof.
Let . We can choose arbitrarily. The set is compact w.r.t. the product topology. Moreover, since the set-valued mapping is decreasing. Due to upper semicontinuity there exist such that , . Hence, one can define
and Assumption 3.1 (ii) is satisfied. ∎
The question is, how replacing Assumption 2.1 (ii) by Assumption 4.1 affects the validity of all previous results. The only result that was proven using the continuity of the transition function in and not only its measurability is Theorem 3.10. All other statements are unaffected.
Proposition 4.3.
Proof.
In the proof of Theorem 3.10 the continuity of is used to show that is lower semicontinuous for every . Due to the monotonicity assumptions
is lower semicontinuous for every . Now the lower semicontinuity of and the existence of a minimizing decision rule follows as in the proof of Theorem 3.10. The fact that is increasing for every follows as in Theorem 2.4.14 in [BaeuerleRieder2011]. ∎
In the following Section 5 we use a minimax approach as an alternative way to solve the Bellman equation of the distributionally robust cost minimization problem and to study its game theoretical properties. Subsequently in Section 6, we also consider special choices of the ambiguity set which are advantageous for solving the optimization problem.
5. Minimax Approach and Game Theory
We assume as in the last section. Compared to a risk-neutral Markov Decision Model, the Bellman equation of the robust model (see Theorem 3.10) has the additional complication that a supremum over possibly uncountably many expectations needs to be calculated. This can be a quite challenging task. Therefore, it may be advantageous to interchange the infimum and supremum. For instance, in applications it may be possible to infer structural properties of the optimal actions independently from the probability measure after the interchange. Based on the minimax theorem by [Sion1958], cf. Appendix Theorem LABEL:thm:A:minimax, this section presents a criterion under which the interchange of infimum and supremum is possible.
Lemma 5.1.
Proof.
The proof is by backward induction. is convex by assumption. Now assume that is convex. Recall that is increasing (Proposition 4.3). Hence, for every the function
is convex as a composition of an increasing convex with a convex function. By the linearity of expectation,
(5.1) |
is convex for every . As the pointwise supremum of a collection of convex functions is convex, we obtain convexity of . Now, Proposition 2.4.18 in [BaeuerleRieder2011] yields the assertion. ∎
The assumptions of Lemma 5.1 are subsequently referred to as convex model.
Proof.
Remark 5.3.
The interchange of infimum and supremum in Theorem 5.2 is based on Sion’s Minimax Theorem LABEL:thm:A:minimax, which requires convexity of the function
(5.2) |
for every . This can be guaranteed by a convex model (cf. Lemma 5.1) which means that several components of the decision model need to have some convexity property. However, these assumptions are quite restrictive. Resorting to randomized actions is a standard approach to convexify (or more precisely linearize) the function (5.2) without assumptions on the model components. Let be the set of all probability measures on . Then it follows from Sion’s Theorem LABEL:thm:A:minimax that
(5.3) | ||||
The last equality holds since is lower semicontinuous (cf. the proof of Theorem 3.10) and is compact. This appears to be a very elegant solution for the interchange problem, but unfortunately, the Bellman equation of the distributionally robust cost minimization problem (2.6) under a randomized action of the controller is given by
(5.4) | ||||
cf. Theorems 3.6 and 3.10, and (5.3) does in general not equal (5.4). Recall that in our model nature is allowed to react to any realization of the controller’s action. This was crucial to obtain a robust value iteration in Theorem 3.6. In contrast to that, (5.3) means that nature maximizes only knowing the distribution of the controller’s action. However, in general (5.3) (5.4) as will be shown in the next example.
Example 5.4.
In order to see that (5.3) (5.4) and that infimum and supremum cannot be interchanged in general, consider the simple static counter example , , , , , , and . It is readily checked that Assumption 4.1 is satisfied. Especially, one has constant bounding functions. In this example (5.4) equals
If the controller chooses , then (5.3) must be less or equal than
Indeed we obtain here . The approach to interchange infimum and supremum through a linearization with randomized actions works when nature only observes the distribution and not the realization of the controller’s action. Also note that the situation here is quite different to classical two-person zero-sum games. The fact that infimum and supremum cannot be interchanged is a consequence of the asymmetric mover/follower situation. The example above with and discretized can also be used as a counter-example within the setting of [NilimGhaoui2005] and [wiesemann2013robust].
As mentioned before, the distributionally robust cost minimization model can be interpreted as a dynamic game with nature as the controller’s opponent. Since nature chooses her action after the controller, observing his action but not being restricted by it, there is a (weak) second-mover advantage by construction of the game. The fact that infimum and supremum in the Bellman equation can be interchanged means that the second-mover advantage vanishes in the special case of a convex model.
Let additionally the one-stage ambiguity sets be weak* closed. Now, the conditions of Theorem LABEL:thm:A:minimax b) are fulfilled, too. Then, the ambiguity set is weak* compact by Lemma 2.3 and by Lemma LABEL:thm:abstract_robust_Lpbound we have that is in . Thus, is weak* continuous for every . This yields that satisfies the minimax-equality
and Lemma 2.105 in [BarbuPrecupanu2012] implies that for every the function has a saddle point , i.e.
for all and . Such a saddle point constitutes a Nash equilibrium in the subgame scenario . We will show that Nash equilibria exist not only in one-stage subgames but also globally.
Before, let us introduce a modification of the game against nature where nature instead of the controller moves first. Given a policy of nature, the controller faces an arbitrary but fixed probability measure in each scenario . Thus, the inner optimization problem is a risk-neutral MDP and it follows from standard theory that it suffices for the controller to consider deterministic Markov policies. Clearly, the controller’s optimal policy will depend on the policy that nature has chosen before. It will turn out to be a pointwise dependence on the actions of nature. To clarify this and for comparability with the original game (2.6), where the controller moves first, we distinguish the following types of Markov strategies of the controller
and of nature
The value of a pair of Markov policies is defined as in (2.5). The bounds in Lemma 3.3 apply since the proofs do not use properties of the policies. The game under consideration is
(5.5) |
For clarity, we mark all quantities of the game where nature moves first which differ from the respective quantity of the original game with a tilde. The value of a policy of nature is defined as
The Bellman operator on can be introduced in the usual way:
Theorem 5.5.
Let Assumptions 3.1, 4.1 be satisfied, the ambiguity sets be weak* closed and the model be convex.
-
a)
for and they satisfy the Bellman equation
There exist decision rules of nature and of the controller such that and all sequences of such decision rules induce an optimal policy pair and i.e. .
-
b)
We have that .
Proof.
We have for and :
(5.6) |
Let and be optimal strategies for the original game (2.6). The existence is guaranteed by Theorem 3.10.
Then defined by
lies in since the decision rules are well defined as compositions of measurable maps.
We prove all statements simultaneously by induction. In particular we show that there exists a policy such that
We show next that . From Theorem 5.2 and the induction hypothesis we obtain
Observe that again by the induction hypothesis
(5.7) | |||||
We will show below the existence of a minimizer in the second line is justified. Thus we get equality in the expression above. Combining the previous two equations above we finally get
In total we have shown that . The joint Bellman equation for the controller and nature follows from Theorem 5.2.
Finally, we verify the existence of a minimizer . Let be a convergent sequence in with limit . By dominated convergence (Lemma LABEL:thm:abstract_robust_Lpbound) and the lower semicontinuity of for any we obtain that the increasing sequence of random variables given by
satisfies
6. Special Ambiguity SetsIn this section, we consider some special choices for the ambiguity set which simplify solving the Markov Decision Problem (2.6) or allow for structural statements about the solution. We assume a real-valued state space here. Convex hull. It does not change the optimal value of the optimization problems if a given ambiguity set is replaced by its convex hull or its closed convex hull , where the closure is with respect to the weak* topology. Clearly, to demonstrate this, it suffices to compare the corresponding Bellman equations. Lemma 6.1.Let be any norm bounded ambiguity set. Then it holds for all and Proof.Fix . The function is linear. Thus, for a generic element we have i.e. there can be no improvement of the supremum on the convex hull. We also have that is metrizable and therefore coincides with the limit points of sequences in . Since is weak* continuous (cf. proof of Theorem 3.6), the supremum cannot be improved on the closure either. ∎ Integral stochastic orders on . Following an idea of [Mueller1997], one can define integral stochastic orders on the ambiguity set by
If there exists a maximal element with respect to one of these stochastic orders, this probability measure is an optimal action for nature (in the respective scenario). The proof of the next lemma follows directly from the Bellman equation and the definition of the orderings. Lemma 6.2.
In fact, Lemma 6.2 holds for any state space. But it is only a reformulation of what is an optimal action for nature. However, under Assumption 4.1 it has practical relevance when a simpler sufficient condition for the integral stochastic order is fulfilled. We give three exemplary criteria:
Convex order on the set of densities. Since the probability measures in are absolutely continuous with respect to the reference probability measure , we can alternatively consider the set of densities (see (2.2)). In general, one has to take care both of the marginal distribution of the density and the dependence structure with the random cost when searching for a maximizing density of the Bellman equation However, if is sufficiently rich, the maximization reduces to comparing marginal distributions. Definition 6.3.The set of densities is called law invariant, if for every with is in , too. Lemma 6.4.Proof.By Lemma LABEL:thm:abstract_robust_Lpbound are in for all . Thus the expectation (6.2) exists for all . Note that a product of r.v. with fixed margins is maximized in expectation when the r.v. are chosen comonotonic. Due to the law invariance of we can find for every some comonotonic to such that and (6.2) is maximal with . Since, the function is increasing, this is the same as requiring comonotonicity to . ∎ For the comparison of marginal distributions one would naturally think of stochastic orders. Here, the convex order yields a sufficient criterion for optimality. In order to obtain a connection to risk measures, let us introduce the following notations: Definition 6.5.Let be the distribution function of a real-valued random variable .
Lemma 6.6 ([Rueschendorf2009, 2.1]).For any random variable on an atomless probability space there exists a random variable such that The random variable is referred to as (generalized) distributional transform of . Note that if is increasing and left-continuous, then and have the same distributional transform. Definition 6.7.Let be increasing and right-continuous with . Functionals are called spectral risk measures and is called spectrum. When we choose the spectrum then we obtain the Expected Shortfall. Note that every spectral risk measure is also coherent, i.e. monotone, translation-invariant, positive homogeneous and subadditive. In what follows we assume a non-atomic probability space. Lemma 6.8.Let Assumptions 3.1, 4.1 be satisfied, be law invariant and suppose there exists a maximal element of w.r.t. the convex order .
Proof.
Recall that the probability space under consideration is the product space. Under the assumptions of Lemma 6.8 b) we can replace the probability measure by and the optimization problem (2.6) can equivalently be written as
With the reversed argumentation of Lemma 6.8 a robust formulation of (6.3) is given by
with The are indeed densities. Now, (6.4) can be interpreted as the minimization of a coherent risk measure
and the problem can be solved with the value iteration where and gives the minimal value (6.4). Remark 6.9.Some authors already considered the problem of optimizing risk measures of dynamic decision processes. For example [ruszczynski2010risk] considered Markov risk measures for finite and discounted infinite horizon models. In the final section he briefly discusses the relation to min-max problems where player 2 chooses the measure. [shen2013risk] treat similar problems (also with average cost) with different properties of the risk maps. More specific applications (dividend problem and economic growth) with the recursive entropic risk measures are treated in [BaeuerleJaskiewicz2017, BaeuerleJaskiewicz2018]. In [chow2015risk] the authors treat the dynamic average-value at risk as a min-max problem. For this risk measure there are also alternative ways for the solution, see e.g. [BaeuerleOtt2011], 7. Application7.1. LQ ProblemWe consider here so-called linear-quadratic (LQ) problems. The state and action space are and . Let , be - and -valued random vectors, respectively, and be random variables with values in . The random elements are independent and the -th element is defined on . It is supposed that the disturbances have finite -th moments, . The transition function is given by for . Furthermore, let there be deterministic positive constants and deterministic positive definite symmetric matrices . The one-stage cost functions are and the terminal cost function is . Hence, the optimization problem under consideration is
Policy values and value functions are defined in the usual way. For different formulations of robust LQ problems, see e.g. [hansen1999five]. Since and the matrices are positive semidefinite, is a lower bounding function and the one-stage costs are at least quasi-integrable. In the sequel, we will determine the value functions and optimal policy by elementary calculations and will show that the value functions are convex and therefore continuous. Hence, we can dispense with an upper bounding function and compactness of the action space. Since the Borel -algebra of a finite dimensional Euclidean space is countably generated, it is no restriction to assume that the probability measures are separable. Further, we assume that for
I.e. Assumption 3.1 is satisfied apart from upper bounding. Theorems 3.6 and 3.10 use the bounding, continuity and compactness assumptions only to prove the existence of optimal decision rules. Thus, we can employ the Bellman equation and restrict the consideration to Markov policies as long as we are able to prove the existence of optimal decision rules on each stage. We proceed backwards. At stage , no action has to be chosen and the value function is . At stage , we have to solve the Bellman equation
For the last equality we used that by assumption. Since and are positive (semi-) definite, the objective function (7.2) is strictly convex in . Moreover, it is linear and especially concave in . Finally, is weak* compact by the Theorem of Banach-Alaoglu [AliprantisBorder2006, 6.21] the objective function (7.2) is continuous in by definition of the weak* topology since the integrand is in . Thus, the requirements of Sion’s Minimax Theorem LABEL:thm:A:minimax b) are satisfied and we can interchange infimum and supremum in (7.2), i.e.
In order to solve the inner minimization problem it suffices due to strict convexity and smoothness to determine the unique zero of the gradient of the objective function. Note that the matrix is positive definite and therefore invertible due to the positive (semi-) definiteness of and . Inserting in (7.1) gives
where Since for all we must have and since is weak* compact and weak* continuous, there exists an optimal measure which is independent of due to our independence assumption (iii). Since is again quadratic we obtain when we define that and . Since the third term of is a quadratic form nature should choose such that if possible and maximize the second moment of . If this is possible, we obtain , i.e. the controller will not control the system. In any case we see here the optimal choice of nature does not depend on . When we specialize the situation to we obtain Thus, nature has to maximize the expression in brackets. For a moment we skip the index . Let us further assume that is jointly normally distributed with expectations and and variances and and covariance and that all these parameters may be elements of compact intervals. Then the expression in brackets reduces to We see immediately that both and have to be chosen as maximal possible value. The remaining three parameters and have to minimize the fraction. If it is possible to choose them such that is maximal and this would be optimal. 7.2. Managing regenerative energyThe second example is the management of a joint wind and storage facility. Before each period the owner of the wind turbine has to announce the amount of energy she wants to produce. If there is enough wind in the next period she receives the reward where we assume that is the fixed price for energy. Additional energy which may have been produced will be stored in a battery with capacity . If there is not enough wind in the next period, the storage device will be used to cover the shortage. In case this is still not enough, the remaining shortage will be penalized by a proportional cost rate (see [anderson2015optimal] for further background). We consider a robust version here, i.e. the distribution of the produced wind energy varies in a set with bounded support . The state is the amount of energy in the battery, hence . Further and the action is the amount of energy which is bid. We obtain the following Bellman equation: From the last representation we see that and We see that is compact and is lower semicontinuous. is continuous and is continuous and bounded. It is easy to see that is decreasing in . Suppose has a minimal element w.r.t. . Then we can omit the and replace by . The remaining Bellman equation is then a standard MDP. 8. Appendix8.1. Additional ProofsProof of Lemma 2.3: Recall that we identify with the set of the corresponding densities . The closure of remains norm bounded. This can be seen as follows: Let . Then there exists a net such that [XY] for all Y ∈L^p( Ω_n, A_n, P_n ) with ∥Y∥_L^p=1. By Hölder’s inequality we have for all Thus, . Finally, due to duality it follows The separability of the probability measure makes a separable Banach space. Consequently, the weak* topology is metrizable on the norm bounded set [Morrison2001, p. 157]. The trace topology on the subset coincides with the topology induced by the restriction of the metric [OSearcoid2007, 4.4.1], i.e. is metrizable, too. Since is norm bounded and weak* closed, the Theorem of Banach-Alaoglu [AliprantisBorder2006, 6.21] yields that it is weak* compact. As a compact metrizable space is complete [AliprantisBorder2006, 3.28] and also separable [AliprantisBorder2006, 3.26, 3.28]. Hence, is a Borel Space. The set of densities is also separable as a subspace of a separable metrizable space [AliprantisBorder2006, 3.5]. |