Distributionally robust stochastic optimal control
The main goal of this paper is to discuss construction of distributionally robust counterparts of stochastic optimal control problems. Randomized and non-randomized policies are considered. In particular necessary and sufficient conditions for existence of non-randomized policies are given.
Keywords: stochastic optimal control, dynamic equations distributional robustness, game formulation, risk measures
1 Introduction
Consider the Stochastic Optimal Control (SOC) (discrete time, finite horizon) model (e.g., [1]):
(1.1) |
where is the set of polices satisfying the constraints
(1.2) |
Here variables , , represent the state of the system, , , are controls, , , are random vectors, is a closed subset of , , , are cost functions, is a final cost function, are (measurable) mappings and is a (nonempty) subset of . Values and are deterministic (initial conditions); it is also possible to view as random with a given distribution, this is not essential for the following discussion.
-
•
Unless stated otherwise we assume that the probability law of the random process does not depend on our decisions.
Remark 1.1.
The above assumption is basic for our analysis. It views as a random data process and assumes that its probability law does not depend on the respective states and actions. This assumption is reasonable in many applications, for instance in the example of Inventory Model, discussed in section 4, this means that the probability distribution of the demand process does not depend on the inventory level and order quantity. This assumption allows to separate the transition probabilities, defined by the functional relations , and the probability distribution of the data process (see Remarks 1.2 and 2.1 below).
The optimization in (1.1) is performed over policies determined by decisions and state variables considered as functions of , , and satisfying the feasibility constraints (1.2). We also denote . For the sake of simplicity, in order not to distract from the main message of the paper, we assume that the control sets do not depend on . It is possible to extend the analysis to the general case, where the control sets are functions of the state variables.
Remark 1.2.
Note that because of the basic assumption that the probability distribution of does not depend on our decisions (does not depend on states and actions), it suffices to consider policies as functions of the process alone. It also could be noted that in case the random process is stagewise independent, i.e., random vector is independent of , , it suffices to consider policies of the form .
Consider Banach space of continuous functions equipped with the sup-norm . The dual space of is the space of finite signed measures on with respect to the bilinear form (Riesz representation).
Remark 1.3.
Denote by the set of probability measures on the set equipped with its Borel sigma algebra. The set is a weakly∗ closed subset of the unit ball of the dual space of and hence is weakly∗ compact by the Banach - Alaoglu theorem. The weak∗ topology111In probability theory this weak∗ topology is often referred to as the weak topology, e.g., [2]. of is metrizable (e.g., [6, Theorem 6.30]). It can be noted that if converges (in the norm topology) to and , then converges to .
Remark 1.4.
Let us recall the following properties of the min-max problem:
(1.3) |
where and are nonempty sets and is a real valued function. A point is a saddle point of problem (1.3) if and If the saddle point exists, then problem (1.3) has the same optimal value as its dual
(1.4) |
and is an optimal solution of problem (1.3) and is an optimal solution of problem (1.4). Conversely, if the optimal values of problems (1.3) and (1.4) are the same, and is an optimal solution of problem (1.3) and is an optimal solution of problem (1.4), then is a saddle point. By Sion’s minimax theorem [16], we have that if and are convex subsets of linear topological spaces, is continuous, convex in and concave in , and at least one of the sets or is compact, then the optimal values of problems (1.3) and (1.4) are equal to each other.
The risk neutral SOC model (1.1) was thoroughly investigated. On the other hand, the analysis of its distributionally robust counterpart is more involved. The distributionally robust approach to Markov Decision Processes (MDPs) can be traced back to [7] and [10]. The origins of distributional robustness can be also related to the dynamic game theory (e.g., [8] and references there in). By duality arguments the distributionally robust formulations are closely related to the risk averse settings. The aim of this paper is to describe and formulate certain related questions in the framework of the SOC model. In particular we consider randomized policies and give necessary and suffcient conditions for existence of non-randomized optimal policies. We also discuss a relation between the nested risk averse and game formulations of distributionally robust problems.
2 Distributionally Robust Stochastic Optimal Control
In the distributionally robust counterpart of the risk neutral SOC problem (1.1) it is assumed that the probability law of the data process is not specified exactly, but rather is a member of a specified so-called ambiguity set of probability distributions. This modeling concept could be applied in various contexts, such as the Inventory Model discussed in Section 4, where the order quantity must be determined in the presence of uncertain demand distributions. Specifically, consider the setting where there is ambiguity set of probability measures on , depending on the history (cf., [11]). That is, , where is the respective multifunction. Here for some pre-determined .
We view the distributionally robust counterpart of problem (1.1) as a dynamic game between the decision maker (the controller) and the adversary (the nature). Define history of the decision process as
(2.5) |
At stage , based on the nature chooses a probability measure , at the same time the controller chooses its action . For a realization of , the next state and so on the process is continued. (It is also possible to let the ambiguity set depend on the past actions ; this does not change the essence of our discussion for the dynamic game.) We write the corresponding distributionally robust problem as
(2.6) |
where is the set of polices of the controller and is the set of policies of the nature (cf., [13, p. 812]).
Unless stated otherwise we make the following assumptions about .
Assumption 2.1.
The sets and , , are compact, the functions , , , and are continuous.
2.1 Dynamic Equations
The dynamic programming equations for the distributionally robust counterpart (2.6) of problem (1.1) are for all , and for ,
(2.7) |
To ensure that (2.7) is well-defined, we need value functions to be measurable. We establish below the continuity of the value functions and that the optimal action in (2.7) can be attained. We refer the readers to the Appendix for the definition of continuous multifunctions.
Proposition 2.1.
Proof.
We argue by induction in . We have that and hence is continuous by the assumption. Now suppose that is continuous. Consider
From Assumption 2.1 the set is compact, and subsequently for any sequence converging to , we have that converges to uniformly in . For any , it follows by Remark 1.3 that . In addition, since is compact, is separable, and with weak∗ topology is metrizable [6, Theorem 6.30]. It follows that is jointly continuous in (with respect to the weak∗ topology of ) and . Consequently we obtain by Theorem 7.1 that the function
is continuous in . Finally again by Theorem 7.1 we have that
is continuous. The infimum is attained as is compact. ∎
Remark 2.1.
It can be noted that because of the basic assumption that the probability laws of the process do not depend on the states and actions, it follows that the state can be considered as a function of , and hence the value functions and optimal policies can be viewed as functions of the process alone (compare with Remark 1.2).
2.2 Randomized Policies
We also consider randomized policies for the controller. That is, at stage , the nature chooses a probability measure , at the same time the controller chooses its action at random according to a probability distribution , where denotes the set of (Borel) probability measures on . Consequently and so on. Here the decision process is defined by histories , as defined in (2.5), with actions chosen at random.
This corresponds to the min-max formulation
(2.9) |
where is the set of randomized policies of the controller that maps into at each stage. For the randomized policies the counterpart of dynamic equation (2.7) becomes for all , and for ,
(2.10) |
(cf., [9],[17]). The controller has a non-randomized optimal policy if for every and problem (2.11) below has optimal solution supported on a single point of .
Note that by the Banach - Alaoglu theorem the set is compact in the weak∗ topology of the dual of the space . With similar arguments as in Proposition 2.1, we have the following.
Proposition 2.2.
Proof.
We have that and hence is continuous by the assumption. Now suppose that is continuous. Consider
From Assumption 2.1, both and are compact, and subsequently for any converging to , we have that converges to uniformly in . For and , from compactness of and , it can be readily shown that converges to in the weak∗ topology of the dual of , and hence (Remark 1.3). In addition, both and with weak∗ topology are metrizable [6, Theorem 6.30]. It follows that is continuous jointly in (with respect to the product weak∗ topology of and ) and . The rest of the proof then follows from the same lines as in Proposition 2.1. ∎
Remark 2.3.
Note that here the state also depends on the history of chosen controls , . Therefore the optimal policy cannot be written as functions of alone (compare with Remark 2.1).
In Section 4 we will discuss necessary and sufficient conditions for the existence of non-randomized optimal policies for the controller.
2.3 Nested Formulation
For non-randomized policies of the controller, dynamic equations (2.7) correspond to the following nested formulation of the respective distributionally robust SOC. For a non-randomized policy , consider the total cost
Recall that we can consider here policies as functions of alone, and that the state is also a function of (see Remark 2.1). Therefore we can view as a function of the process .
Consider linear spaces of bounded random variables . For , define recursively mappings ,
(2.12) |
Note that and is a real number. Let
(2.13) |
be the corresponding composite functional. Then with the nested distributionally robust counterpart of problem (1.1) is
(2.14) |
where (as before) is the set of non-randomized policies of the controller.
2.4 Stagewise Independence Setting
An important particular case of the above framework is when the ambiguity sets do not depend on , i.e., for all . In that case the value functions and optimal policies do not depend on and the respective dynamic equations can be written as
(2.15) |
Also in that case the assumption of continuity of the multifunctions holds automatically.
3 Duality of Game Formulation
The dynamic equations of the dual of problem (2.9) are: for all , and for ,
(3.1) |
For a given (and ) the maximization in the right hand side of (3.1) is over all probability measures supported on the set . It is straightforward to see that the maximum is attained at Dirac measure222We denote by the Dirac measure of mass one at point . which depends on . Therefore it suffices to perform the minimization in (3.1) over Dirac measures , and hence we can write equation (3.1) in the following equivalent way
(3.2) |
Similar to Proposition 2.2, it can be shown that the value functions are well-defined and continuous. By the standard theory of min-max, we have that . We next establish that equality indeed holds when the multifunctions , , are convex-valued, i.e., the sets are convex for all .
Proposition 3.1.
Proof.
Under the specified assumptions, the sets and are weakly∗ compact and convex, and of course the expectation is linear with respect to and and continuous in the respective weak∗ topologies. Then by using Sion’s duality theorem (see Remark 1.4) and applying induction backward in time, we can conclude that , . Further by compactness arguments the respective min-max and max-min problems have optimal solutions, which implies existence of the saddle point. ∎
It could be worth mentioning here that the duality of the game formulation (when considering randomized policies of the controller) does not require the convexity of or any assumptions on the transition functions .
4 Existence of Non-randomized Optimal Policies
We have the following necessary and sufficient condition for the existence of a non-randomized optimal policy of the controller (cf., [9, Theorem 2.2]).
Theorem 4.1.
Proof.
In particular, we have the following result (cf., [5]).
Corollary 4.1.
Suppose that the sets are convex, for every the functions are convex in , the mappings are affine, the multifunctions are continuous and convex-valued, , and Assumption 2.1 holds. Then the value functions are convex in for every , and the controller has a non-randomized optimal policy.
Proof.
In many interesting applications the considered problem is convex in the sense of Corollary 4.1. In such cases there is no point of considering randomized policies. As an example consider the classical Inventory Model (cf., [18]).
Inventory Model.
Consider the following inventory model (cf., [18])
(4.1) |
where is a (random) demand process, are the ordering, backorder penalty and holding costs per unit, respectively, is the inventory level, is the order quantity at time , and . Suppose that the distribution of is supported on with , , being a finite subinterval of the nonnegative real line.
5 Construction of Ambiguity Sets
There are several ways how the ambiguity sets can be constructed. We discuss now an approach which can be considered as an extension of the risk averse modeling of multistage stochastic programs.
Let be a probability measure on a (closed) set , and be a set of probability measures on absolutely continuous with respect to . Consider the corresponding distributionally robust functional
(5.1) |
defined on an appropriate space of random variables . Such functional can be considered as a coherent risk measure (e.g., [14, Chapter 6]). The functional is law invariant if when and are distributionally equivalent, i.e., for any . The law invariant coherent risk measure can be considered as a function of the cumulative distribution function (cdf) .
An important example of law invariant coherent risk measure is the Average Value-at-Risk
where . It has dual representation of the form (5.1) with333The notation means that measure is absolutely continuous with respect to measure .
Now let be a reference probability measure on . Denote by and the corresponding marginal probability measures on and . Let us define , with respect to and , recursively going backward in time. That is, using the Regular Probability Kernel (RPK) (e.g., [4, III- 70]), define as a probability measure on for almost every (a.e.) (a.e. with respect to ), and for any measurable sets and ,
(5.2) |
The conditional counterpart of the law invariant coherent risk measure can be defined as the respective function of the conditional cdf of (cf., [15]). For example, the conditional counterpart of the Average Value-at-Risk can be defined in that way and is given by
where is the conditional counterpart of .
In turn this defines the set of conditional probability measures. The respective one-step mappings , defined in (2.12), are given by
(5.3) |
This leads to the respective nested functional , defined in (2.13), and the nested formulation (2.14) of the problem with respect to non-randomized policies of the controller.
In this construction of nested counterparts of law invariant risk measures it is essential that the ambiguity sets consist of probability measures absolutely continuous with respect to the reference measure. However, the above approach can be extended beyond that setting. That is, the conditional set of probability measures on can be defined in some away with respect to the conditional distribution . For example, can consist of probability measures with a Wasserstein distance from less than or equal to a constant .
Of course, in such construction it should be verified that the dynamic programming equations (2.7) (in the case of non-randomized policies) and (2.10) (in the case of randomized policies) are well defined. There are two important cases where the corresponding multifunctions are continuous and hence the value functions are continuous (Proposition 2.2). One such case is when the ambiguity sets do not depend on (see Section 2.4). This corresponds to the case where is given by the direct product of the marginal measures. In that case, the ambiguity sets can be arbitrary. For example, can be the set of probability measures with Wasserstein distance from the reference measure , less than or equal to a positive constant . The dynamic programming equations take the form (2.15), and by Proposition 2.2 the value functions are continuous.
Another important case is when the sets , , are finite, i.e., has discrete distribution with a finite support. Then continuity of the multifunctions trivially holds, and hence the value functions are continuous.
6 Conclusions
The main technical assumption used in the considered construction is continuity of the multifunctions . This assumption ensures continuity of the value functions and hence measurability of the objective functions in the dynamic programming equations (2.7) and (2.10). It holds automatically in the two important cases, namely when the ambiguity sets do not depend on the history of the process or when the sets are finite. It could be noted that just upper semicontinuity of is not sufficient for ensuring continuity (and even semicontinuity) of the value functions. In general it could be difficult to verify the assumption of continuity of , in which case the measurability question remains open.
By using the game formulation it is possible to consider the randomized policies of the controller. In Theorem 4.1 we give necessary and sufficient conditions for existence of non-randomized optimal policies. The assumption that the multifunctions are convex-valued, i.e., that the respective ambiguity sets are convex, is rather mild. The assumption of continuity of is needed in order to ensure that the respective dynamic equations are well defined.
There is a delicate point which we would like to mention. In the game formulation, after the nature chooses the probability measure , the system moves to the next state according to probability distribution . On the other hand in the risk averse approach is assumed existence of reference measures and the probability law of is defined by . In both cases, the game and the nested risk averse (for non-randomized policies) approaches lead to the same dynamic equations and the same optimal policies of the controller and the same optimal values. However, the interpretation in terms of realizations of the random process could be different.
References
- [1] D.P. Bertsekas and S.E. Shreve. Stochastic Optimal Control, The Discrete Time Case. Academic Press, New York, 1978.
- [2] P. Billingsley. Probability and Measure, Third Edition. Wiley, 1995.
- [3] J. Frédéric Bonnans and Alexander Shapiro. Perturbation Analysis of Optimization Problems. Springer Series in Operations Research. Springer, 2000.
- [4] Dellacherie C. and Meyer P.A. Probabilities and Potential. Holland Publishing, Amsterdam, Netherlands, 1988.
- [5] E. Delage, D. Kuhn, and W. Wiesemann. ”Dice”-sion - making under uncertainty: When can a random decision reduce risk? Management Science, 65(7):3282 – 3301, 2019.
- [6] A Hitchhiker’s Guide. Infinite dimensional analysis. Springer, 2006.
- [7] G.N. Iyengar. Robust Dynamic Programming. Mathematics of Operations Research, 30:257–280, 2005.
- [8] A. Jaśkiewicz and A. S. Nowak. Zero-sum stochastic games. In T. Basar and G. Zaccour, editors, Handbook of dynamic game theory. Springer, 2016.
- [9] Yan Li and A. Shapiro. Rectangularity and duality of distributionally robust markov decision processes. https://arxiv.org/abs/2308.11139, 2023.
- [10] A. Nilim and L. El Ghaoui. Robust control of Markov decision processes with uncertain transition probabilities. Operations Research, 53:780–798, 2005.
- [11] A. Shapiro. Rectangular sets of probability measures. Operations Research, 64(2):528–541, 2016.
- [12] A. Shapiro. Interchangeability principle and dynamic equations in risk averse stochastic programming. Operations Research Letters, 45:377–381, 2017.
- [13] A. Shapiro. Distributionally robust optimal control and MDP modeling. Operations Research Letters, 49:809–814, 2021.
- [14] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on Stochastic Programming: Modeling and Theory. SIAM, Philadelphia, third edition, 2021.
- [15] A. Shapiro and A. Pichler. Conditional distributionally robust functionals. Operations Research, online:1–13, 2023.
- [16] M. Sion. On general minimax theorems. Pacific Journal of Mathematics, 8:171–176, 1958.
- [17] S. Wang, N. Si, J. Blanchet, and Z. Zhou. On the foundation of distributionally robust reinforcement learning. https://arxiv.org/abs/2311.09018, 2023.
- [18] P.H. Zipkin. Foundation of inventory management. McGraw-Hill, Boston, MA, 2000.
7 Appendix
Let and be nonempty compact metric spaces, be continuous real valued unction, and be a multifunction (point-to-set mapping). It is said that the multifunction is closed if , and implies that . If is closed, then is closed valued, i.e., the set is closed for every . It is said that the multifunction is upper semicontinuous if for any and any neighborhood of there is a neighborhood of such that for any . In the considered framework of compact sets, the multifunction is closed iff it is closed valued and upper semicontinuous (e.g., [3, Lemma 4.3]).
It is said that the multifunction is lower semicontinuous if for any and any neighborhood of there is a neighborhood of such that for any . It is said that is continuous if it is closed and lower semicontinuous.
Theorem 7.1.
Suppose that the sets and are compact, the function is continuous, and the multifunction is continuous. Then the value function is real valued and continuous.