Partially Observable Discrete-time Discounted Markov Games with General Utility
Abstract.
In this paper, we investigate a partially observable zero sum games where the state process is a discrete time Markov chain. We consider a general utility function in the optimization criterion. We show the existence of value for both finite and infinite horizon games and also establish the existence of optimal polices. The main step involves converting the partially observable game into a completely observable game which also keeps track of the total discounted accumulated reward/cost.
Keywords: partially observable; zero-sum games; utility function; value of the game; optimal policies.
1. Introduction
In this paper we consider a discrete time zero-sum game where the state process evolves like a controlled Markov chain. We also assume that the state has two components one of which is observable while the other is not observable. In the optimization criterion we consider general utility function and discounted reward/cost. Player 1 is assumed to be the maximizer and player 2 is the minimizer. Both finite and infinite horizon problems are investigated. In both cases we show that the game has value and also establish the existence of optimal policies for both players.
Risk-sensitive optimization problems wherein one tries to optimize the expectation of the exponential of the accumulated cost/reward, are widely studied in stochastic optimal control literature. One of the primary reasons for the popularity of this criterion is that unlike risk-neutral optimization criterion, here the risk preference of the controllers are taken into account. Risk-sensitive control problems have been widely studied in literature for both Markov as well as semi-Markov processes. Risk-sensitive control problems for discrete-time Markov chains has been studied in [2, 7, 11, 13, 22, 25], for continuous-time Markov chains in [19, 20, 21, 27, 29] and for semi-Markov processes in [6, 10, 23]. There is also a good amout of literature on risk-sensitive games, both zero-sum and non-zero sum, see [1, 4, 5, 9, 18, 28]. But, in all the above cited literature it is assumed that the state process is completely observable. However, in practice it may be the case that complete information about the state is unavailable to the controller for taking decisions, which makes the study of optimization problems with partial observation quite natural. Risk-sensitive control problems for discrete-time Markov chains with partial observation has been studied in [3, 12, 24]. Although, there has been work on partially observable risk-neutral games [15, 16, 17, 26], but to the best of our knowledge there has been no work on risk-sensitive games with partial observation. Thus, this work seems to be the first work on partially observable risk-sensitive games. In this paper we consider risk-sensitive zero-sum games with partial observation and general utility function, and thus the exponential utility is a special case of our model. Like in the risk-neutral case here also we convert the partially observable game into an equivalent completely observable game. But the difference with the risk-neutral case is that here we need to keep track of the accumulated cost as well. Since in the considered model the reward/cost function is assumed to be dependant on the unobservable component, so we need to consider the joint conditional distribution of the unobservable state component and the accumulated reward/cost.
The rest of the paper is organized a follows. In Section 2, we describe our game model. Section 3 deals with finite horizon problem. Finally in Section 4 we investigate the infinite horizon problem as a limit of finite horizon problem.
2. Zero-Sum Game Model
The partially observable risk-sensitive zero-sum Markov game model that we are interested in can be represented by the tuple
(1) |
where each individual component has the following interpretation.
-
•
are the Borel spaces, X represents the observable state space and Y is the non-observable space.
-
•
The Borel spaces and are the action sets for player 1 and 2 respectively. And for each , , are Borel subsets denoting the set of all admissible actions in state for player 1 and 2 respectively.
-
•
Define to be the set of admissible state-action pairs, which is assumed to be a measurable subset of . Then is the immediate reward function for player 1 and immediate cost function for player 2 respectively.
-
•
For each , the mapping is the transition probability that the next pair is in , where is the collection of all Borel subsets of , given the current state is and actions are chosen by the players.
Also we have the discount factor . In what follows, we assume that the transition kernel has a measurable density with respect to some -finite measures and , i.e.,
(2) |
We also introduce the marginal transition kernel density by
We assume that the distribution of , the initial (unobservable) state is known to the players. The risk-sensitive partially observed zero sum game evolves in the following way:
-
•
At the th decision epoch, based on the initial observation , the players choose actions and simultaneously and independent of each other.
-
•
As a consequence of these chosen actions player 1 gets a reward and player 2 incurs a cost , where is the initial unobservable state. Note that the reward/cost depends on the unobservable component as well and therefore is itself unobservable.
-
•
At the next time epoch the system moves to the next state according to the transition law . If the observable state at time 1 is , then based on this observation, the previous observation and the previous pair of actions , players 1 and 2 choose actions and respectively. This yields a reward for player 1 and a cost for player 2. Now the sequence of events as described above repeats itself.
Let be the admissible observable history available upto time n, i.e., and for , . Thus a typical element of is given by , where for each , represents the observable state at the decision epoch, is the action chosen by player 1 at the decision epoch and is the action chosen by player 2 at the decision epoch. We endow these spaces with the Borel sigma-algebra. Next we introduce the concept of decision rules and policies.
Definition 1.
Let and denote the set of all probability measures on and respectively.
A measurable mapping with the property for is called a decision rule at stage n for player 1. Similarly, a measurable mapping with the property for is called a decision rule at stage n for player 2.
A sequence , and , where ’s are decision rules at stage for all , is called a pair of policies for player 1 and 2 respectively.
3. Finite Horizon Problem
For a fixed policy and , fixed (observable) initial state , the initial distribution together with the transition kernel , we obtain by a theorem of Ionescu Tulcea a probability measure on endowed with the product -algebra. More precisely is the probability measure under policy given and . On this probability space, for we define the random variables and via the canonical projections:
The action sequence for player 1 is given by and that of player 2 given by . Under the policies, and , the distribution of is given by and the distribution of is given by .
In this section we look into the stage optimization problem. For defining the optimality criterion, consider a utility function which is assumed to be continuous and strictly increasing. The discounted cost/reward generated over N stages is given by:
(3) |
For a fixed initial observable state and given policies and , the optimization criterion that we are interested in is given by:
(4) |
Here player 1 tries to maximize over all policies , for each . Analogously, player 2 tries to minimize over all policies , for each . This leads to the following definitions of optimal policies and value of the game.
Definition 2.
A strategy is said to be optimal for player 1 in the partially observed model if
The quantity is referred to as the lower value of the partially observed game.
Similarly, a strategy is said to be optimal for player 2 in the partially observed model if
The quantity is referred as the upper value of the partially observed game.
Hence, a pair optimal strategies satisfies
for any . Thus, constitutes a saddle point equilibrium. The partially observed game is said to have a value if
Note that if both the players have optimal strategies then the partially observed game has a value. Our aim is to show that the game model under consideration has a value and also there exists a saddle point equilibrium. Towards that end, we assume that the following assumptions are in force throughout the paper.
Assumption 1.
-
(i)
For each , the sets and are compact subsets of and . Also the mappings and are continuous.
-
(ii)
is continuous.
-
(iii)
is bounded and continuous.
-
(iv)
C is also bounded, i.e., there exist constants with .
In literature partially observable risk-neutral games are treated by first converting it to an equivalent completely observable game. Following that approach, in this risk-sensitive approach also we first convert our partially observed game problem into a complete observed game problem. We show the existence of the value function and optimal strategies in case of the equivalent completely observable model and then revert back to our partially observed model.
Now in the unobserved model, the state and the cost accumulated so far cannot be observed because it depends on . Thus we need to estimate them. For that we consider the following set of probability measures on :
:={ is a probability measure on the Borel - algebra such that there exists a constant with }.
The elements of the above set will essay the role of the conditional distribution of the hidden state component and accumulated cost. The precise interpretation will be seen in Theorem 1. In order to estimate the unobserved state component and accumulated cost we define the following updating operator given by
(5) |
where and is the Y-marginal distribution of . Going forward we will also use the notation , which will denote the S-marginal. We define the updating operator only when the denominator is positive. For and we now define the sequence of probability measures
(6) |
The next theorem provides the interpretation of the above defined sequence of probability measures as the sequence of conditional distributions. For that purpose we first define the sequence of random variables:
We then have the following result which generalizes Theorem 1 of [3] to the game setting.
Theorem 1.
Suppose that is given by the recursion (3). For , for a given initial observable state and given policies and of the respective players we have
where .
Proof.
We first show that
(7) |
for all bounded and measurable and
We use induction on . The basis step is true, as for , both sides reduce to . Now suppose that the statement is true for . we simply write in place of . For a given observable history , the left hand side of (7) becomes:
While the right hand side becomes:
where we use the recursion for in the third equation and use Fubini’s theorem, to cancel out the normalizing constant of . Hence we are done by induction.
Now, in particular, if we take with and a measurable set of histories until time then we get from (7),
This establishes the fact that is a conditional -distribution of given the history . ∎
Now we again look at the optimization problem (4). Motivated by the previous result we define for , , and :
(8) |
We solve our optimization problem by using a state augmentation technique. For that purpose we define, for a probability measure :
We now consider the completely observable model with new state space . The action spaces for player 1 and player 2 are same as the partially observable model. One stage cost/reward is and the terminal cost/reward function is . Since for all the support of in the s-component is bounded, the expectation is well defined. The transition law for the new model is given by , which for , and a Borel measurable subset is defined by
The decision rules for player 1 in the newly defined model are given by measurable mappings such that . Similarly, the decision rules for player 2 are given by measurable mappings such that . We denote by the set of all decision rules for player 1 and denotes the same for player 2. For player 1, we denote by the set of all Markov policies with for all . represents the same for player 2.
Let . Note that we consider the topology of weak convergence on . For , and , we consider the following operators:
(9) |
Next we have the following theorem:
Theorem 2.
-
(a)
Let and be two policies for player 1 and 2 respectively. Then it holds that for all ,
-
(b)
For all let . Then and
-
(c)
For there exists measurable functions such that
for all and . Then is the value of the stage partially observable stochastic game and with and are optimal policies for player 1 and 2 respectively.
Proof.
(a) We establish the above iteration by induction on . For we have,
Now suppose that the statement is true for . Let and denote the 1-shifted policies. Then we get,
Hence we have the desired conclusion by induction.
(b and c) We show by induction on that
(i)
(ii) , for any measurable .
(iii) , for any measurable .
(iv) .
Under our assumptions . For , it follows from definition that
Under our assumptions the existence of follows from a classical measurable selection theorem [8] and Fan’s minimax theorem [14]. Now follows from the definition of . Now suppose the statement is true for . Since , we again have the existence of . Using the induction hypothesis and monotonicity of the we obtain
for any measurable . On the other hand
for any measurable . Thus combining with part (a), we have shown that is a saddle point equilibrium for the stage completely observable game problem with optimization criterion given by (8). The rest of the conclusions now follow from the relation between the partially observable game and completely observable game. ∎
4. Infinite Horizon Problem
In case of finite horizon problem we have the existence of the value of the game and also we have shown the existence of the optimal strategies. Now we consider the case of infinite horizon, i.e for a given pair of policies and we are interested in the optimization problem:
The upper value, lower value, optimal strategies and value of the game have the same definitions as in Definition 2 with replaced by .
In the infinite horizon case we need to deal with concave and convex utility functions separately.
Concave utility function: We first investigate the case of concave utility function. Just as in the finite horizon case we will consider the equivalent completely observable game. To that end we have the following notations.
(10) |
We denote
Then we have the main theorem of this section:
Theorem 3.
-
(a)
is the unique solution of in with where is as defined in (9). Moreover, , and as .
-
(b)
There exist measurable functions such that
(11) for all and . Then is the value of the partially observable infinite horizon stochastic game. Moreover, with and
are optimal policies for player 1 and 2 respectively.
Proof.
(a) Here we first show that . Since is increasing and concave, it satisfies the inequality
where is the left hand side derivative of that exists since is concave. Further more, and is non increasing. For it holds, . Using this, we get
(12) |
where is defined by the last equation. Clearly, . Now from (4) we get . Taking limit we get . Now, . Taking limit we get . Again, . Now applying operator on both sides we have . Thus again letting we obtain . Hence combining two inequalities we have .
Next, we obtain
By the similar reasoning it can be shown that . Thus we have and and the limits exist. Moreover, by iteration we have,
Using the fact , we obtain
Since , we obtain and as . Also since is both increasing as well as decreasing limit of continuous functions, it is also continuous.
Now for the uniqueness purpose if possible let there be another solution of with . This then implies that for all . Taking the limit we have the uniqueness of the solution.
(b) The existence of follows again from the measurable selection theorem and the minimax theorem. By monotonicity and the
fact that we obtain that , where and . By the definition of we obtain
for any ,
The property of also implies that . Hence we can also write for any ,
By iterating this inequality -times we get
for arbitrary and . Letting we get,
for all policies and . The rest of the conclusions are now straight forward.
∎
Convex Utility function: Now we consider the case of convex utility function .
Theorem 4.
Theorem 3 also holds for convex .
Proof.
For the convex case the proof is along the same lines as in Theorem 3. The only difference is that here we will use another inequality. Note that for increasing and convex we have the inequality
where is the right-hand side derivative of that exists since is convex. Moreover, and is increasing. Thus, we obtain for ,
Now let’s denote . Then we have . So we end up with
Letting yields .
We use the same inequality to get,
and the right hand side converges to 0 as . ∎
Few remarks are in order.
Remark 1.
(1) An important sub case of the model that we have considered here is when the reward/cost does not depend on the unobservable component, i.e., for some function . In this case the accumulated reward/cost is no more unobservable and thereby we need not estimate it. Thus in that case it can be shown along similar lines as in Proposition 1 of [3], that we can take the state space of the completely observable model as and the updating operator as
where is a Borel subset of and .
(2) An important utility function is given by the function , where is a fixed parameter. In this case again it is not necessary to keep track of the accumulated cost. It is enough to consider as the state space of the completely observable model. Again arguing similar to Theorem 3 in [3], it can be shown that the value of the stage game problem is given by where and for ,
where and for , Borel subset of and , Borel subset of ,
References
- [1] Arnab Basu and Mrinal Kanti Ghosh. Zero-sum risk-sensitive stochastic games on a countable state space. Stochastic Process. Appl., 124(1):961–983, 2014.
- [2] Nicole Bäuerle and Ulrich Rieder. More risk-sensitive Markov decision processes. Math. Oper. Res., 39(1):105–120, 2014.
- [3] Nicole Bäuerle and Ulrich Rieder. Partially observable risk-sensitive Markov decision processes. Math. Oper. Res., 42(4):1180–1196, 2017.
- [4] Nicole Bäuerle and Ulrich Rieder. Zero-sum risk-sensitive stochastic games. Stochastic Process. Appl., 127(2):622–642, 2017.
- [5] Arnab Bhabak and Subhamay Saha. Zero and non-zero sum risk-sensitive semi-markov games. Stochastic Analysis and Applications, pages 1–18, 2021.
- [6] Arnab Bhabak and Subhamay Saha. Risk-sensitive semi-Markov decision problems with discounted cost and general utilities. Statist. Probab. Lett., 184:Paper No. 109408, 9, 2022.
- [7] V. S. Borkar and S. P. Meyn. Risk-sensitive optimal control for Markov decision processes with monotone cost. Math. Oper. Res., 27(1):192–209, 2002.
- [8] L. D. Brown and R. Purves. Measurable selections of extrema. Ann. Statist., 1:902–912, 1973.
- [9] Rolando Cavazos-Cadena and Daniel Hernández-Hernández. The vanishing discount approach in a class of zero-sum finite games with risk-sensitive average criterion. SIAM J. Control Optim., 57(1):219–240, 2019.
- [10] Selene Chávez-Rodríguez, Rolando Cavazos-Cadena, and Hugo Cruz-Suárez. Controlled semi-Markov chains with risk-sensitive average cost criterion. J. Optim. Theory Appl., 170(2):670–686, 2016.
- [11] Kun Jen Chung and Matthew J. Sobel. Discounted MDPs: distribution functions and exponential utility maximization. SIAM J. Control Optim., 25(1):49–62, 1987.
- [12] G. B. Di Masi and L. Stettner. Risk sensitive control of discrete time partially observed Markov processes with infinite horizon. Stochastics Stochastics Rep., 67(3-4):309–322, 1999.
- [13] Giovanni B. Di Masi and Ł ukasz Stettner. Infinite horizon risk sensitive control of discrete time Markov processes under minorization property. SIAM J. Control Optim., 46(1):231–252, 2007.
- [14] Ky Fan. Minimax theorems. Proc. Nat. Acad. Sci. U.S.A., 39:42–47, 1953.
- [15] M. K. Ghosh, D. McDonald, and S. Sinha. Zero-sum stochastic games with partial information. J. Optim. Theory Appl., 121(1):99–118, 2004.
- [16] Mrinal K. Ghosh and Anindya Goswami. Partially observable semi-Markov games with discounted payoff. Stoch. Anal. Appl., 24(5):1035–1059, 2006.
- [17] Mrinal K. Ghosh and Anindya Goswami. Partially observed semi-Markov zero-sum games with average payoff. J. Math. Anal. Appl., 345(1):26–39, 2008.
- [18] Mrinal K. Ghosh, K. Suresh Kumar, and Chandan Pal. Zero-sum risk-sensitive stochastic games for continuous time Markov chains. Stoch. Anal. Appl., 34(5):835–851, 2016.
- [19] Mrinal K. Ghosh and Subhamay Saha. Risk-sensitive control of continuous time Markov chains. Stochastics, 86(4):655–675, 2014.
- [20] Xianping Guo and Zhong-Wei Liao. Risk-sensitive discounted continuous-time Markov decision processes with unbounded rates. SIAM J. Control Optim., 57(6):3857–3883, 2019.
- [21] Xianping Guo and Junyu Zhang. Risk-sensitive continuous-time Markov decision processes with unbounded rates and Borel spaces. Discrete Event Dyn. Syst., 29(4):445–471, 2019.
- [22] Daniel Hernández-Hernández and Steven I. Marcus. Risk sensitive control of Markov processes in countable state space. Systems Control Lett., 29(3):147–155, 1996.
- [23] Yonghui Huang, Zhaotong Lian, and Xianping Guo. Risk-sensitive semi-Markov decision processes with general utilities and multiple criteria. Adv. in Appl. Probab., 50(3):783–804, 2018.
- [24] Matthew R. James, John S. Baras, and Robert J. Elliott. Risk-sensitive control and dynamic games for partially observed discrete-time nonlinear systems. IEEE Trans. Automat. Control, 39(4):780–792, 1994.
- [25] Anna Jaśkiewicz. Average optimality for risk-sensitive control with general state space. Ann. Appl. Probab., 17(2):654–675, 2007.
- [26] Subhamay Saha. Zero-sum stochastic games with partial information and average payoff. J. Optim. Theory Appl., 160(1):344–354, 2014.
- [27] K. Suresh Kumar and Chandan Pal. Risk-sensitive control of pure jump process on countable space with near monotone cost. Appl. Math. Optim., 68(3):311–331, 2013.
- [28] Qingda Wei. Zero-sum games for continuous-time Markov jump processes with risk-sensitive finite-horizon cost criterion. Oper. Res. Lett., 46(1):69–75, 2018.
- [29] Qingda Wei and Xian Chen. Risk-sensitive average continuous-time Markov decision processes with unbounded rates. Optimization, 68(4):773–800, 2019.