This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Partially Observable Discrete-time Discounted Markov Games with General Utility

Arnab Bhabak Department of Mathematics
Indian Institute of Technology Guwahati
Guwahati, Assam, India
[email protected]
 and  Subhamay Saha Department of Mathematics
Indian Institute of Technology Guwahati
Guwahati, Assam, India
[email protected]
Abstract.

In this paper, we investigate a partially observable zero sum games where the state process is a discrete time Markov chain. We consider a general utility function in the optimization criterion. We show the existence of value for both finite and infinite horizon games and also establish the existence of optimal polices. The main step involves converting the partially observable game into a completely observable game which also keeps track of the total discounted accumulated reward/cost.

Keywords: partially observable; zero-sum games; utility function; value of the game; optimal policies.

1. Introduction

In this paper we consider a discrete time zero-sum game where the state process evolves like a controlled Markov chain. We also assume that the state has two components one of which is observable while the other is not observable. In the optimization criterion we consider general utility function and discounted reward/cost. Player 1 is assumed to be the maximizer and player 2 is the minimizer. Both finite and infinite horizon problems are investigated. In both cases we show that the game has value and also establish the existence of optimal policies for both players.

Risk-sensitive optimization problems wherein one tries to optimize the expectation of the exponential of the accumulated cost/reward, are widely studied in stochastic optimal control literature. One of the primary reasons for the popularity of this criterion is that unlike risk-neutral optimization criterion, here the risk preference of the controllers are taken into account. Risk-sensitive control problems have been widely studied in literature for both Markov as well as semi-Markov processes. Risk-sensitive control problems for discrete-time Markov chains has been studied in [2, 7, 11, 13, 22, 25], for continuous-time Markov chains in [19, 20, 21, 27, 29] and for semi-Markov processes in [6, 10, 23]. There is also a good amout of literature on risk-sensitive games, both zero-sum and non-zero sum, see [1, 4, 5, 9, 18, 28]. But, in all the above cited literature it is assumed that the state process is completely observable. However, in practice it may be the case that complete information about the state is unavailable to the controller for taking decisions, which makes the study of optimization problems with partial observation quite natural. Risk-sensitive control problems for discrete-time Markov chains with partial observation has been studied in [3, 12, 24]. Although, there has been work on partially observable risk-neutral games [15, 16, 17, 26], but to the best of our knowledge there has been no work on risk-sensitive games with partial observation. Thus, this work seems to be the first work on partially observable risk-sensitive games. In this paper we consider risk-sensitive zero-sum games with partial observation and general utility function, and thus the exponential utility is a special case of our model. Like in the risk-neutral case here also we convert the partially observable game into an equivalent completely observable game. But the difference with the risk-neutral case is that here we need to keep track of the accumulated cost as well. Since in the considered model the reward/cost function is assumed to be dependant on the unobservable component, so we need to consider the joint conditional distribution of the unobservable state component and the accumulated reward/cost.

The rest of the paper is organized a follows. In Section 2, we describe our game model. Section 3 deals with finite horizon problem. Finally in Section 4 we investigate the infinite horizon problem as a limit of finite horizon problem.

2. Zero-Sum Game Model

The partially observable risk-sensitive zero-sum Markov game model that we are interested in can be represented by the tuple

(X,Y,A,B,{A(x)A,B(x)B,xX},C(x,y,a,b),Q),\displaystyle(X,Y,A,B,\{A(x)\subset A,B(x)\subset B,x\in X\},C(x,y,a,b),Q), (1)

where each individual component has the following interpretation.

  • X,YX,Y are the Borel spaces, X represents the observable state space and Y is the non-observable space.

  • The Borel spaces AA and BB are the action sets for player 1 and 2 respectively. And for each xXx\in X, A(x)AA(x)\subset A, B(x)BB(x)\subset B are Borel subsets denoting the set of all admissible actions in state xXx\in X for player 1 and 2 respectively.

  • Define 𝕂={(x,a,b):xX,aA(x),bB(x)}\mathbb{K}=\{(x,a,b):x\in X,a\in A(x),b\in B(x)\} to be the set of admissible state-action pairs, which is assumed to be a measurable subset of X×A×BX\times A\times B. Then C:𝕂×YC:\mathbb{K}\times Y\to\mathbb{R} is the immediate reward function for player 1 and immediate cost function for player 2 respectively.

  • For each (x,y,a,b)𝕂×Y(x,y,a,b)\in\mathbb{K}\times Y, the mapping Q(B|x,y,a,b)Q(B|x,y,a,b) is the transition probability that the next pair is in B(X×Y)B\in\mathcal{B}(X\times Y), where (X×Y)\mathcal{B}(X\times Y) is the collection of all Borel subsets of X×YX\times Y, given the current state is (x,y)(x,y) and actions (a,b)A(x)×B(x)(a,b)\in A(x)\times B(x) are chosen by the players.

Also we have the discount factor β(0,1)\beta\in(0,1). In what follows, we assume that the transition kernel QQ has a measurable density qq with respect to some σ\sigma-finite measures λ\lambda and ν\nu, i.e.,

Q(B|x,y,a,b)=Bq(x,y|x,y,a,b)λ(dx)ν(dy),B(X×Y).\displaystyle Q(B|x,y,a,b)=\int_{B}q(x^{{}^{\prime}},y^{{}^{\prime}}|x,y,a,b)\lambda(dx^{{}^{\prime}})\nu(dy^{{}^{\prime}}),\hskip 14.22636ptB\in\mathcal{B}(X\times Y). (2)

We also introduce the marginal transition kernel density by

qX(x|x,y,a,b):=Yq(x,y|x,y,a,b)ν(dy).\displaystyle q^{X}(x^{{}^{\prime}}|x,y,a,b):=\int_{Y}q(x^{{}^{\prime}},y^{{}^{\prime}}|x,y,a,b)\nu(dy^{{}^{\prime}}).

We assume that the distribution Q0Q_{0} of Y0Y_{0}, the initial (unobservable) state is known to the players. The risk-sensitive partially observed zero sum game evolves in the following way:

  • At the 0th decision epoch, based on the initial observation x0x_{0}, the players choose actions a0A(x0)a_{0}\in A(x_{0}) and b0B(x0)b_{0}\in B(x_{0}) simultaneously and independent of each other.

  • As a consequence of these chosen actions player 1 gets a reward C(x0,y0,a0,b0)C(x_{0},y_{0},a_{0},b_{0}) and player 2 incurs a cost C(x0,y0,a0,b0)C(x_{0},y_{0},a_{0},b_{0}), where y0y_{0} is the initial unobservable state. Note that the reward/cost depends on the unobservable component as well and therefore is itself unobservable.

  • At the next time epoch the system moves to the next state (x1,y1)(x_{1},y_{1}) according to the transition law Q(.|x0,y0,a0,b0)Q(.|x_{0},y_{0},a_{0},b_{0}). If the observable state at time 1 is x1x_{1}, then based on this observation, the previous observation x0x_{0} and the previous pair of actions (a0,b0)(a_{0},b_{0}), players 1 and 2 choose actions a1A(x1)a_{1}\in A(x_{1}) and b1B(x1)b_{1}\in B(x_{1}) respectively. This yields a reward C(x1,y1,a1,b1)C(x_{1},y_{1},a_{1},b_{1}) for player 1 and a cost C(x1,y1,a1,b1)C(x_{1},y_{1},a_{1},b_{1}) for player 2. Now the sequence of events as described above repeats itself.

Let n\mathcal{H}_{n} be the admissible observable history available upto time n, i.e., 0=X\mathcal{H}_{0}=X and for n1n\geq 1, n=n1×A×B×X\mathcal{H}_{n}=\mathcal{H}_{n-1}\times A\times B\times X. Thus a typical element of n\mathcal{H}_{n} is given by hn=(x0,a0,b0,x1,a1,b1,,xn1,an1,bn1,xn)h_{n}=(x_{0},a_{0},b_{0},x_{1},a_{1},b_{1},\ldots,x_{n-1},a_{n-1},b_{n-1},x_{n}), where for each nn, xnx_{n} represents the observable state at the nthnth decision epoch, ana_{n} is the action chosen by player 1 at the nthnth decision epoch and bnb_{n} is the action chosen by player 2 at the nthnth decision epoch. We endow these spaces with the Borel sigma-algebra. Next we introduce the concept of decision rules and policies.

Definition 1.

Let P(A)P(A) and P(B)P(B) denote the set of all probability measures on AA and BB respectively.

A measurable mapping fn:nP(A)f_{n}:\mathcal{H}_{n}\rightarrow P(A) with the property fn(hn)(A(xn))=1f_{n}(h_{n})(A(x_{n}))=1 for hnnh_{n}\in\mathcal{H}_{n} is called a decision rule at stage n for player 1. Similarly, a measurable mapping gn:nP(B)g_{n}:\mathcal{H}_{n}\rightarrow P(B) with the property gn(hn)(B(xn))=1g_{n}(h_{n})(B(x_{n}))=1 for hnnh_{n}\in\mathcal{H}_{n} is called a decision rule at stage n for player 2.

A sequence π=(f0,f1,)\pi=(f_{0},f_{1},...), and σ=(g0,g1,)\sigma=(g_{0},g_{1},...), where fn,gnf_{n},g_{n}’s are decision rules at stage nn for all nn, is called a pair of policies for player 1 and 2 respectively.

3. Finite Horizon Problem

For a fixed policy π=(f0,f1,)\pi=(f_{0},f_{1},...) and σ=(g0,g1,)\sigma=(g_{0},g_{1},...), fixed (observable) initial state xXx\in X, the initial distribution Q0Q_{0} together with the transition kernel QQ, we obtain by a theorem of Ionescu Tulcea a probability measure xyπσ\mathbb{P}^{\pi\sigma}_{xy} on (X×Y)(X\times Y)^{\infty} endowed with the product σ\sigma-algebra. More precisely xyπσ\mathbb{P}^{\pi\sigma}_{xy} is the probability measure under policy (π,σ)(\pi,\sigma) given X0=xX_{0}=x and Y0=yY_{0}=y. On this probability space, for ω=(x0,y0,x1,y1,,xn,yn,)\omega=(x_{0},y_{0},x_{1},y_{1},\ldots,x_{n},y_{n},\ldots) we define the random variables XnX_{n} and YnY_{n} via the canonical projections:

Xn(ω)=xn,Yn(ω)=yn.X_{n}(\omega)=x_{n},\quad Y_{n}(\omega)=y_{n}.

The action sequence for player 1 is given by {An}\{A_{n}\} and that of player 2 given by {Bn}\{B_{n}\}. Under the policies, π=(f0,f1,)\pi=(f_{0},f_{1},...) and σ=(g0,g1,)\sigma=(g_{0},g_{1},...), the distribution of AnA_{n} is given by fn(X0,A0,B0,,Xn)f_{n}(X_{0},A_{0},B_{0},\ldots,X_{n}) and the distribution of BnB_{n} is given by gn(X0,A0,B0,,Xn)g_{n}(X_{0},A_{0},B_{0},\ldots,X_{n}).

In this section we look into the NN stage optimization problem. For defining the optimality criterion, consider a utility function U:+U:\mathbb{R}_{+}\rightarrow\mathbb{R} which is assumed to be continuous and strictly increasing. The discounted cost/reward generated over N stages is given by:

CN=k=0N1βkC(Xk,Yk,Ak,Bk).\displaystyle C_{N}=\sum_{k=0}^{N-1}\beta^{k}C(X_{k},Y_{k},A_{k},B_{k}). (3)

For a fixed initial observable state xx and given policies π\pi and σ\sigma, the optimization criterion that we are interested in is given by:

JNπσ(x):=Y𝔼xyπσ[U(CN)]Q0(dy).\displaystyle J_{N\pi\sigma}(x):=\int_{Y}\mathbb{E}^{\pi\sigma}_{xy}[U(C_{N})]Q_{0}(dy). (4)

Here player 1 tries to maximize JNπσ(x)J_{N\pi\sigma}(x) over all policies π\pi, for each σ\sigma. Analogously, player 2 tries to minimize JNπσ(x)J_{N\pi\sigma}(x) over all policies σ\sigma, for each π\pi. This leads to the following definitions of optimal policies and value of the game.

Definition 2.

A strategy π\pi^{*} is said to be optimal for player 1 in the partially observed model if

JNπσ(x)supπinfσJNπσ(x),foranyσ.\displaystyle J_{N\pi^{*}\sigma}(x)\geq\sup_{\pi}\inf_{\sigma^{{}^{\prime}}}J_{N\pi\sigma^{{}^{\prime}}}(x),\hskip 8.5359ptfor\hskip 2.84544ptany\hskip 5.69046pt\sigma.

The quantity supπinfσJNπσ(x,y,z)\sup_{\pi}\inf_{\sigma^{{}^{\prime}}}J_{N\pi\sigma^{{}^{\prime}}}(x,y,z) is referred to as the lower value of the partially observed game.
Similarly, a strategy σ\sigma^{*} is said to be optimal for player 2 in the partially observed model if

JNπσ(x)infσsupπJNπσ(x),foranyπ.\displaystyle J_{N\pi\sigma^{*}}(x)\leq\inf_{\sigma}\sup_{\pi^{{}^{\prime}}}J_{N\pi^{{}^{\prime}}\sigma}(x),\hskip 8.5359ptfor\hskip 2.84544ptany\hskip 5.69046pt\pi.

The quantity infσsupπJNπσ(x,y,z)\inf_{\sigma}\sup_{\pi^{{}^{\prime}}}J_{N\pi^{{}^{\prime}}\sigma}(x,y,z) is referred as the upper value of the partially observed game.
Hence, a pair optimal strategies (π,σ)(\pi^{*},\sigma^{*}) satisfies

JNπσJNπσJNπσ\displaystyle J_{N\pi\sigma^{*}}\leq J_{N\pi^{*}\sigma^{*}}\leq J_{N\pi^{*}\sigma}

for any π,σ\pi,\sigma. Thus, (π,σ)(\pi^{*},\sigma^{*}) constitutes a saddle point equilibrium. The partially observed game is said to have a value if

JN(x)=supπinfσJNπσ(x)=infσsupπJNπσ(x).\displaystyle J_{N}(x)=\sup_{\pi}\inf_{\sigma}J_{N\pi\sigma}(x)=\inf_{\sigma}\sup_{\pi}J_{N\pi\sigma}(x).

Note that if both the players have optimal strategies then the partially observed game has a value. Our aim is to show that the game model under consideration has a value and also there exists a saddle point equilibrium. Towards that end, we assume that the following assumptions are in force throughout the paper.

Assumption 1.


  • (i)

    For each xXx\in X, the sets A(x)A(x) and B(x)B(x) are compact subsets of AA and BB. Also the mappings xA(x)x\rightarrow A(x) and xB(x)x\rightarrow B(x) are continuous.

  • (ii)

    (x,y,a,b)C(x,y,a,b)(x,y,a,b)\rightarrow C(x,y,a,b) is continuous.

  • (iii)

    (x,y,x,y,a,b)q(x,y|x,y,a,b)(x,y,x^{{}^{\prime}},y^{{}^{\prime}},a,b)\rightarrow q(x^{{}^{\prime}},y^{{}^{\prime}}|x,y,a,b) is bounded and continuous.

  • (iv)

    C is also bounded, i.e., there exist constants a¯,a¯\underline{a},\bar{a} with 0<a¯C(x,y,a,b)a¯0<\underline{a}\leq C(x,y,a,b)\leq\bar{a}.

In literature partially observable risk-neutral games are treated by first converting it to an equivalent completely observable game. Following that approach, in this risk-sensitive approach also we first convert our partially observed game problem into a complete observed game problem. We show the existence of the value function and optimal strategies in case of the equivalent completely observable model and then revert back to our partially observed model.

Now in the unobserved model, the state yy and the cost accumulated so far cannot be observed because it depends on yy. Thus we need to estimate them. For that we consider the following set of probability measures on Y×+Y\times\mathbb{R}_{+}:

Pb(Y×+)P_{b}(Y\times\mathbb{R}_{+}):={μ\mu is a probability measure on the Borel σ\sigma- algebra (Y×+)\mathcal{B}(Y\times\mathbb{R}_{+}) such that there exists a constant K=K(μ)>0K=K(\mu)>0 with μ(Y×[0,K])=1\mu(Y\times[0,K])=1}.

The elements of the above set will essay the role of the conditional distribution of the hidden state component and accumulated cost. The precise interpretation will be seen in Theorem 1. In order to estimate the unobserved state component and accumulated cost we define the following updating operator Φ:X×A×B×X×Pb(Y×+)×+Pb(Y×+)\Phi:X\times A\times B\times X\times P_{b}(Y\times\mathbb{R}_{+})\times\mathbb{R}_{+}\rightarrow P_{b}(Y\times\mathbb{R}_{+}) given by

Φ(x,a,b,x,μ,z)(B):=Y+(Bq(x,y|x,y,a,b)ν(dy)δs+zC(x,y,a,b)(ds))μ(dy,ds)YqX(x|x,y,a,b)μY(dy),\displaystyle\Phi(x,a,b,x^{{}^{\prime}},\mu,z)(B):=\frac{\int_{Y}\int_{\mathbb{R}_{+}}\big{(}\int_{B}q(x^{{}^{\prime}},y^{{}^{\prime}}|x,y,a,b)\nu(dy^{{}^{\prime}})\delta_{s+zC(x,y,a,b)}(ds^{{}^{\prime}})\big{)}\mu(dy,ds)}{\int_{Y}q^{X}(x^{{}^{\prime}}|x,y,a,b)\mu^{Y}(dy)}, (5)

where B(Y×+)B\in\mathcal{B}(Y\times\mathbb{R}_{+}) and μY(dy)=μ(dy,+)\mu^{Y}(dy)=\mu(dy,\mathbb{R}_{+}) is the Y-marginal distribution of μ\mu. Going forward we will also use the notation μs(ds):=μ(Y,ds)\mu^{s}(ds):=\mu(Y,ds), which will denote the S-marginal. We define the updating operator only when the denominator is positive. For hn=(x0,a0,b0,..,xn)h_{n}=(x_{0},a_{0},b_{0},.....,x_{n}) and B(Y×+)B\in\mathcal{B}(Y\times\mathbb{R}_{+}) we now define the sequence of probability measures

μ0(B|h0):=(Q0×δ0)(B),\displaystyle\mu_{0}(B|h_{0}):=(Q_{0}\times\delta_{0})(B),
μn+1(B|hn,a,b,x)=Φ(xn,a,b,x,μn(.|hn),βn)(B).\displaystyle\mu_{n+1}(B|h_{n},a,b,x^{{}^{\prime}})=\Phi(x_{n},a,b,x^{{}^{\prime}},\mu_{n}(.|h_{n}),\beta^{n})(B). (6)

The next theorem provides the interpretation of the above defined sequence of probability measures (μn)(\mu_{n}) as the sequence of conditional distributions. For that purpose we first define the sequence of random variables:

S0:=0,Sn:=k=0n1βkC(Xk,Yk,Ak,Bk),n\displaystyle S_{0}:=0,\hskip 5.69046ptS_{n}:=\sum_{k=0}^{n-1}\beta^{k}C(X_{k},Y_{k},A_{k},B_{k}),\hskip 5.69046ptn\in\mathbb{N}

We then have the following result which generalizes Theorem 1 of [3] to the game setting.

Theorem 1.

Suppose that (μn)(\mu_{n}) is given by the recursion (3). For n0n\geq 0, for a given initial observable state xXx\in X and given policies π=(f0,f1,)\pi=(f_{0},f_{1},\ldots) and σ=(g0,g1,)\sigma=(g_{0},g_{1},\ldots) of the respective players we have

xπσ((Yn,Sn)B|X0,A0,B0,.,Xn)=μn(B|X0,A0,B0,.,Xn),xπσa.s.,forB(Y×+),\displaystyle\mathbb{P}_{x}^{\pi\sigma}((Y_{n},S_{n})\in B|X_{0},A_{0},B_{0},....,X_{n})=\mu_{n}(B|X_{0},A_{0},B_{0},....,X_{n}),\hskip 5.69046pt\mathbb{P}^{\pi\sigma}_{x}-a.s.,\hskip 5.69046ptfor\hskip 5.69046ptB\in\mathcal{B}(Y\times\mathbb{R}_{+}),

where xπσ(.):=xyπσ(.)Q0(dy)\mathbb{P}_{x}^{\pi\sigma}(.):=\int\mathbb{P}^{\pi\sigma}_{xy}(.)Q_{0}(dy).

Proof.

We first show that

𝔼xπσ[v(X0,A0,B0,X1,.,Xn,Yn,Sn)]=𝔼xπσ[v(X0,A0,B0,X1,.,Xn)]\displaystyle\mathbb{E}^{\pi\sigma}_{x}[v(X_{0},A_{0},B_{0},X_{1},....,X_{n},Y_{n},S_{n})]=\mathbb{E}^{\pi\sigma}_{x}[v^{{}^{\prime}}(X_{0},A_{0},B_{0},X_{1},....,X_{n})] (7)

for all bounded and measurable v:n×Y×+v:\mathcal{H}_{n}\times Y\times\mathbb{R}_{+}\rightarrow\mathbb{R} and

v(hn):=Y+v(hn,yn,sn)μn(dyn,dsn|hn).\displaystyle v^{{}^{\prime}}(h_{n}):=\int_{Y}\int_{\mathbb{R}_{+}}v(h_{n},y_{n},s_{n})\mu_{n}(dy_{n},ds_{n}|h_{n}).

We use induction on nn. The basis step is true, as for n=0n=0, both sides reduce to v(x,y,0)Q0(dy)\int v(x,y,0)Q_{0}(dy). Now suppose that the statement is true for n1n-1. we simply write fn,gnf_{n},g_{n} in place of fn(hn),gn(hn)f_{n}(h_{n}),g_{n}(h_{n}). For a given observable history hn1h_{n-1}, the left hand side of (7) becomes:

𝔼xπσ[v(hn1,An1,Bn1,Xn,Yn,Sn)]=BAY+μn1(dyn1,dsn1|hn1)×\displaystyle\mathbb{E}^{\pi\sigma}_{x}[v(h_{n-1},A_{n-1},B_{n-1},X_{n},Y_{n},S_{n})]=\int_{B}\int_{A}\int_{Y}\int_{\mathbb{R}_{+}}\mu_{n-1}(dy_{n-1},ds_{n-1}|h_{n-1})\times
YXν(dyn)λ(dxn)q(xn,yn|xn1,yn1,an1,bn1)×\displaystyle\int_{Y}\int_{X}\nu(dy_{n})\lambda(dx_{n})q(x_{n},y_{n}|x_{n-1},y_{n-1},a_{n-1},b_{n-1})\times
+δsn1+βn1C(xn1,yn1,an1,bn1)(dsn)v(hn1,an1,bn1,xn,yn,sn)fn1(dan1)gn1(dbn1)\displaystyle\int_{\mathbb{R}_{+}}\delta_{s_{n-1}+\beta^{n-1}C(x_{n-1},y_{n-1},a_{n-1},b_{n-1})}(ds_{n})v(h_{n-1},a_{n-1},b_{n-1},x_{n},y_{n},s_{n})f_{n-1}(da_{n-1})g_{n-1}(db_{n-1})
=BAY+μn1(dyn1,dsn1|hn1)YXν(dyn)λ(dxn)q(xn,yn|xn1,yn1,an1,bn1)×\displaystyle=\int_{B}\int_{A}\int_{Y}\int_{\mathbb{R}_{+}}\mu_{n-1}(dy_{n-1},ds_{n-1}|h_{n-1})\int_{Y}\int_{X}\nu(dy_{n})\lambda(dx_{n})q(x_{n},y_{n}|x_{n-1},y_{n-1},a_{n-1},b_{n-1})\times
v(hn1,an1,bn1,xn,yn,sn1+βn1C(xn1,yn1,an1,bn1))fn1(dan1)gn1(dbn1).\displaystyle v(h_{n-1},a_{n-1},b_{n-1},x_{n},y_{n},s_{n-1}+\beta^{n-1}C(x_{n-1},y_{n-1},a_{n-1},b_{n-1}))f_{n-1}(da_{n-1})g_{n-1}(db_{n-1}).

While the right hand side becomes:

𝔼xπσ[v(X0,A0,B0,X1,.,Xn)]\displaystyle\mathbb{E}^{\pi\sigma}_{x}[v^{{}^{\prime}}(X_{0},A_{0},B_{0},X_{1},....,X_{n})]
=BAY+μn1(dyn1,dsn1|hn1)×\displaystyle=\int_{B}\int_{A}\int_{Y}\int_{\mathbb{R}_{+}}\mu_{n-1}(dy_{n-1},ds_{n-1}|h_{n-1})\times
Xλ(dxn)qX(xn|xn1,yn1,an1,bn1)v(hn1,an1,bn1,xn)fn1(dan1)gn1(dbn1)\displaystyle\int_{X}\lambda(dx_{n})q^{X}(x_{n}|x_{n-1},y_{n-1},a_{n-1},b_{n-1})v^{{}^{\prime}}(h_{n-1},a_{n-1},b_{n-1},x_{n})f_{n-1}(da_{n-1})g_{n-1}(db_{n-1})
=BAYμn1Y(dyn1|hn1)×Xλ(dxn)qX(xn|xn1,yn1,an1,bn1)×\displaystyle=\int_{B}\int_{A}\int_{Y}\mu_{n-1}^{Y}(dy_{n-1}|h_{n-1})\times\int_{X}\lambda(dx_{n})q^{X}(x_{n}|x_{n-1},y_{n-1},a_{n-1},b_{n-1})\times
Y+μn(dyn,dsn|hn)v(hn1,an1,bn1,xn,yn,sn)fn1(dan1)gn1(dbn1)\displaystyle\int_{Y}\int_{\mathbb{R}_{+}}\mu_{n}(dy_{n},ds_{n}|h_{n})v(h_{n-1},a_{n-1},b_{n-1},x_{n},y_{n},s_{n})f_{n-1}(da_{n-1})g_{n-1}(db_{n-1})
=BAYXν(dyn)λ(dxn)Y+μn1(dyn1,dsn1|hn1)q(xn,yn|xn1,yn1,an1,gn1)×\displaystyle=\int_{B}\int_{A}\int_{Y}\int_{X}\nu(dy_{n})\lambda(dx_{n})\int_{Y}\int_{\mathbb{R}_{+}}\mu_{n-1}(dy_{n-1},ds_{n-1}|h_{n-1})q(x_{n},y_{n}|x_{n-1},y_{n-1},a_{n-1},g_{n-1})\times
+δsn1+βn1C(xn1,yn1,fn1,gn1)(dsn)v(hn1,fn1,gn1,xn,yn,sn)fn1(dan1)gn1(dbn1)\displaystyle\int_{\mathbb{R}_{+}}\delta_{s_{n-1}+\beta^{n-1}C(x_{n-1},y_{n-1},f_{n-1},g_{n-1})}(ds_{n})v(h_{n-1},f_{n-1},g_{n-1},x_{n},y_{n},s_{n})f_{n-1}(da_{n-1})g_{n-1}(db_{n-1})
=BAY+μn1(dyn1,dsn1|hn1)YXν(dyn)λ(dxn)q(xn,yn|xn1,yn1,an1,bn1)×\displaystyle=\int_{B}\int_{A}\int_{Y}\int_{\mathbb{R}_{+}}\mu_{n-1}(dy_{n-1},ds_{n-1}|h_{n-1})\int_{Y}\int_{X}\nu(dy_{n})\lambda(dx_{n})q(x_{n},y_{n}|x_{n-1},y_{n-1},a_{n-1},b_{n-1})\times
v(hn1,an1,bn1,xn,yn,sn1+βn1C(xn1,yn1,an1,bn1))fn1(dan1)gn1(dbn1),\displaystyle v(h_{n-1},a_{n-1},b_{n-1},x_{n},y_{n},s_{n-1}+\beta^{n-1}C(x_{n-1},y_{n-1},a_{n-1},b_{n-1}))f_{n-1}(da_{n-1})g_{n-1}(db_{n-1}),

where we use the recursion for μn\mu_{n} in the third equation and use Fubini’s theorem, to cancel out the normalizing constant of μn\mu_{n}. Hence we are done by induction.
Now, in particular, if we take v=1B×Cv=1_{B\times C} with B(Y×+)B\in\mathcal{B}(Y\times\mathbb{R}_{+}) and CX×A×B×.×XC\subset X\times A\times B\times....\times X a measurable set of histories until time nn then we get from (7),

xπσ((Yn,Sn)B,(X0,A0,B0,.,Xn)C)=𝔼xπσ[μn(B|X0,A0,B0,.,Xn)1C(X0,A0,B0,.,Xn)]\displaystyle\mathbb{P}_{x}^{\pi\sigma}((Y_{n},S_{n})\in B,(X_{0},A_{0},B_{0},....,X_{n})\in C)=\mathbb{E}^{\pi\sigma}_{x}[\mu_{n}(B|X_{0},A_{0},B_{0},....,X_{n})1_{C}(X_{0},A_{0},B_{0},....,X_{n})]

This establishes the fact that μn(B|X0,A0,B0,.,Xn)\mu_{n}(B|X_{0},A_{0},B_{0},....,X_{n}) is a conditional xπσ\mathbb{P}^{\pi\sigma}_{x}-distribution of (Yn,Sn)(Y_{n},S_{n}) given the history (X0,A0,B0,.,Xn)(X_{0},A_{0},B_{0},....,X_{n}). ∎

Now we again look at the optimization problem (4). Motivated by the previous result we define for xXx\in X, μPb(Y×+)\mu\in P_{b}(Y\times\mathbb{R}_{+}), z(0,1]z\in(0,1] and n=1,2,.,Nn=1,2,....,N:

Vnπσ(x,μ,z):=Y+𝔼xyπσ[U(s+zk=0n1βkC(Xk,Yk,Ak,Bk))]μ(dy,ds).\displaystyle V_{n\pi\sigma}(x,\mu,z):=\int_{Y}\int_{\mathbb{R}_{+}}\mathbb{E}^{\pi\sigma}_{xy}\bigg{[}U\big{(}s+z\sum_{k=0}^{n-1}\beta^{k}C(X_{k},Y_{k},A_{k},B_{k})\big{)}\bigg{]}\mu(dy,ds). (8)

We solve our optimization problem by using a state augmentation technique. For that purpose we define, for a probability measure μP(Y)\mu\in P(Y):

QX(B|x,μ,a,b):=BYqX(x|x,y,a,b)μ(dy)λ(dx),B(X).\displaystyle Q^{X}(B|x,\mu,a,b):=\int_{B}\int_{Y}q^{X}(x^{{}^{\prime}}|x,y,a,b)\mu(dy)\lambda(dx^{{}^{\prime}}),\hskip 14.22636ptB\in\mathcal{B}(X).

We now consider the completely observable model with new state space E=X×Pb(Y×+)×(0,1]E=X\times P_{b}(Y\times\mathbb{R}_{+})\times(0,1]. The action spaces for player 1 and player 2 are same as the partially observable model. One stage cost/reward is 0 and the terminal cost/reward function is V0(x,μ,z):=Y+U(s)μ(dy,ds)V_{0}(x,\mu,z):=\int_{Y}\int_{\mathbb{R}_{+}}U(s)\mu(dy,ds). Since for all μPb(Y×+)\mu\in P_{b}(Y\times\mathbb{R}_{+}) the support of μ\mu in the s-component is bounded, the expectation is well defined. The transition law for the new model is given by Q~(.|x,μ,z,a,b)\tilde{Q}(.|x,\mu,z,a,b), which for (x,μ,z,a,b)E×A×B(x,\mu,z,a,b)\in E\times A\times B, and a Borel measurable subset BEB\in E is defined by

Q~(B|x,μ,z,a,b):=X1B((x,Φ(x,a,b,x,μ,z),βz))QX(dx|x,μY,a,b).\displaystyle\tilde{Q}(B|x,\mu,z,a,b):=\int_{X}1_{B}\big{(}(x^{{}^{\prime}},\Phi(x,a,b,x^{{}^{\prime}},\mu,z),\beta z)\big{)}Q^{X}(dx^{{}^{\prime}}|x,\mu^{Y},a,b).

The decision rules for player 1 in the newly defined model are given by measurable mappings f:EP(A)f:E\rightarrow P(A) such that f(x,μ,z)(A(x))=1f(x,\mu,z)(A(x))=1. Similarly, the decision rules for player 2 are given by measurable mappings g:EP(B)g:E\rightarrow P(B) such that g(x,μ,z)(B(x))=1g(x,\mu,z)(B(x))=1. We denote by F1F_{1} the set of all decision rules for player 1 and F2F_{2} denotes the same for player 2. For player 1, we denote by Π1M\Pi_{1}^{M} the set of all Markov policies π=(f0,f1,)\pi=(f_{0},f_{1},...) with fnF1f_{n}\in F_{1} for all n0n\geq 0. Π2M\Pi_{2}^{M} represents the same for player 2.

Let 𝒞(E)={v:E,vis continuous and vV0}\mathcal{C}(E)=\{v:E\to\mathbb{R},\,v\,\,\textup{is continuous and }\,\,v\geq V_{0}\}. Note that we consider the topology of weak convergence on Pb(Y×+)P_{b}(Y\times\mathbb{R}_{+}). For v𝒞(E)v\in\mathcal{C}(E), (ζ,η)P(A(x))×P(B(x))(\zeta,\eta)\in P(A(x))\times P(B(x)) and (f,g)F1×F2(f,g)\in F_{1}\times F_{2}, we consider the following operators:

(Tfgv)(x,μ,z):=BAXv(x,Φ(x,a,b,x,μ,z),βz)QX(dx|x,μY,a,b)f(x,μ,z)(da)g(x,μ,z)(db)\displaystyle(T_{fg}v)(x,\mu,z):=\int_{B}\int_{A}\int_{X}v\big{(}x^{{}^{\prime}},\Phi(x,a,b,x^{{}^{\prime}},\mu,z),\beta z\big{)}Q^{X}(dx^{{}^{\prime}}|x,\mu^{Y},a,b)f(x,\mu,z)(da)g(x,\mu,z)(db)
(Lv)(x,μ,z,ζ,η):=BAXv(x,Φ(x,a,b,x,μ,z),βz)QX(dx|x,μY,a,b)ζ(da)η(db).\displaystyle(Lv)(x,\mu,z,\zeta,\eta):=\int_{B}\int_{A}\int_{X}v\big{(}x^{{}^{\prime}},\Phi(x,a,b,x^{{}^{\prime}},\mu,z),\beta z\big{)}Q^{X}(dx^{{}^{\prime}}|x,\mu^{Y},a,b)\zeta(da)\eta(db).
Tv(x,μ,z):=infgsupfTfgv=infζsupη(Lv)(x,μ,z,ζ,η),(x,μ,z)E.\displaystyle Tv(x,\mu,z):=\inf_{g}\sup_{f}T_{fg}v=\inf_{\zeta}\sup_{\eta}(Lv)(x,\mu,z,\zeta,\eta),\hskip 8.5359pt(x,\mu,z)\in E. (9)

Next we have the following theorem:

Theorem 2.
  • (a)

    Let π=(f0,f1,,fN1)\pi=(f_{0},f_{1},...,f_{N-1}) and σ=(g0,g1,,gN1)\sigma=(g_{0},g_{1},...,g_{N-1}) be two policies for player 1 and 2 respectively. Then it holds that for all n=1,2,,Nn=1,2,...,N,

    Vnπσ(x,μ,z)=Tf0g0Tf1g1Tfn1gn1V0.V_{n\pi\sigma}(x,\mu,z)=T_{f_{0}g_{0}}T_{f_{1}g_{1}}...T_{f_{n-1}g_{n-1}}V_{0}.
  • (b)

    For all n=1,2,,Nn=1,2,...,N let Vn=infπsupσVnπσV_{n}=\inf_{\pi}\sup_{\sigma}V_{n\pi\sigma}. Then Vn𝒞(E)V_{n}\in\mathcal{C}(E) and

    Vn(x,μ,z)=TVn1(x,μ,z).\displaystyle V_{n}(x,\mu,z)=TV_{n-1}(x,\mu,z).
  • (c)

    For n=1,2,,Nn=1,2,...,N there exists measurable functions (γn,δn)F1×F2(\gamma_{n}^{*},\delta_{n}^{*})\in F_{1}\times F_{2} such that

    LVn1(x,μ,z,ζ,δn(x,μ,z))LVn1(x,μ,z,γn(x,μ,z),δn(x,μ,z))LVn1(x,μ,z,γn(x,μ,z),η),\displaystyle LV_{n-1}(x,\mu,z,\zeta,\delta_{n}^{*}(x,\mu,z))\leq LV_{n-1}(x,\mu,z,\gamma_{n}^{*}(x,\mu,z),\delta_{n}^{*}(x,\mu,z))\leq LV_{n-1}(x,\mu,z,\gamma^{*}_{n}(x,\mu,z),\eta),

    for all (ζ,η)P(A(x))×P(B(x))(\zeta,\eta)\in P(A(x))\times P(B(x)) and (x,μ,z)E(x,\mu,z)\in E. Then VN(,Q0×δ0,1)V_{N}(\cdot,Q_{0}\times\delta_{0},1) is the value of the NN stage partially observable stochastic game and (π,σ)=(fn,gn)n=0,1,,N1(\pi^{*},\sigma^{*})=(f_{n}^{*},g^{*}_{n})_{n=0,1,...,N-1} with fn(hn)=γNn(xn,μ(|hn),βn)f^{*}_{n}(h_{n})=\gamma_{N-n}^{*}(x_{n},\mu(\cdot|h_{n}),\beta^{n}) and gn=δNn(xn,μ(|hn),βn)g^{*}_{n}=\delta_{N-n}^{*}(x_{n},\mu(\cdot|h_{n}),\beta^{n}) are optimal policies for player 1 and 2 respectively.

Proof.

(a) We establish the above iteration by induction on nn. For n=1n=1 we have,

Tf0,g0V0(x,μ,z)=BAXV0(x,Φ(x,a,b,x,μ,z),βz)QX(dx|x,μY,a,b)f0(x,μ,z)(da)g0(x,μ,z)(db)\displaystyle T_{f_{0},g_{0}}V_{0}(x,\mu,z)=\int_{B}\int_{A}\int_{X}V_{0}\big{(}x^{{}^{\prime}},\Phi(x,a,b,x^{{}^{\prime}},\mu,z),\beta z\big{)}Q^{X}(dx^{{}^{\prime}}|x,\mu^{Y},a,b)f_{0}(x,\mu,z)(da)g_{0}(x,\mu,z)(db)
=BAY+X+U(s)δs+zC(x,y,a,b)(ds)qX(x|x,y,a,b)λ(dx)μ(dy,ds)f0(x,μ,z)(da)g0(x,μ,z)(db)\displaystyle=\int_{B}\int_{A}\int_{Y}\int_{\mathbb{R}_{+}}\int_{X}\int_{\mathbb{R}_{+}}U(s^{{}^{\prime}})\delta_{s+zC(x,y,a,b)}(ds^{{}^{\prime}})q^{X}(x^{{}^{\prime}}|x,y,a,b)\lambda(dx^{{}^{\prime}})\mu(dy,ds)f_{0}(x,\mu,z)(da)g_{0}(x,\mu,z)(db)
=Y+𝔼xyπσ[U(s+zC(x,y,A0,B0))]μ(dy,ds)\displaystyle=\int_{Y}\int_{\mathbb{R}_{+}}\mathbb{E}^{\pi\sigma}_{xy}\big{[}U\big{(}s+zC(x,y,A_{0},B_{0})\big{)}\big{]}\mu(dy,ds)
=V1πσ(x,μ,z).\displaystyle=V_{1\pi\sigma}(x,\mu,z).

Now suppose that the statement is true for VnV_{n}. Let π¯=(f1,f2,)\bar{\pi}=(f_{1},f_{2},...) and σ¯=(g1,g2,)\bar{\sigma}=(g_{1},g_{2},...) denote the 1-shifted policies. Then we get,

(Tf0g0Tf1g1TfngnV0)(x,μ,z)\displaystyle(T_{f_{0}g_{0}}T_{f_{1}g_{1}}...T_{f_{n}g_{n}}V_{0})(x,\mu,z)
=BAXVnπ¯σ¯(x,Φ(x,a,b,x,μ,z),βz)QX(dx|x,μY,a,b)f0(x,μ,z)(da)g0(x,μ,z)(db)\displaystyle=\int_{B}\int_{A}\int_{X}V_{n\bar{\pi}\bar{\sigma}}\big{(}x^{{}^{\prime}},\Phi(x,a,b,x^{{}^{\prime}},\mu,z),\beta z\big{)}Q^{X}(dx^{{}^{\prime}}|x,\mu^{Y},a,b)f_{0}(x,\mu,z)(da)g_{0}(x,\mu,z)(db)
=BAXY+𝔼x,yπ¯σ¯[U(s+zk=0n1βk+1C(Xk,Yk,Ak,Bk))]\displaystyle=\int_{B}\int_{A}\int_{X}\int_{Y}\int_{\mathbb{R}_{+}}\mathbb{E}^{\bar{\pi}\bar{\sigma}}_{x^{{}^{\prime}},y^{{}^{\prime}}}\big{[}U(s^{{}^{\prime}}+z\sum_{k=0}^{n-1}\beta^{k+1}C(X_{k},Y_{k},A_{k},B_{k}))\big{]}
Φ(x,a,b,x,μ,z)(dy,ds)QX(dx|x,μY,a,b)f0(x,μ,z)(da)g0(x,μ,z)(db)\displaystyle\Phi(x,a,b,x^{{}^{\prime}},\mu,z)(dy^{{}^{\prime}},ds^{{}^{\prime}})Q^{X}(dx^{{}^{\prime}}|x,\mu^{Y},a,b)f_{0}(x,\mu,z)(da)g_{0}(x,\mu,z)(db)
=BAXY+𝔼π,σ[U(s+zk=1nβkC(Xk,Yk,Ak,Bk))|X1=x,Y1=y]\displaystyle=\int_{B}\int_{A}\int_{X}\int_{Y}\int_{\mathbb{R}_{+}}\mathbb{E}^{\pi,\sigma}\big{[}U(s^{{}^{\prime}}+z\sum_{k=1}^{n}\beta^{k}C(X_{k},Y_{k},A_{k},B_{k}))|X_{1}=x^{{}^{\prime}},Y_{1}=y^{{}^{\prime}}\big{]}
Y+q(x,y|x,y,a,b)δs+zC(x,y,a,b)(ds)μ(dy,ds)ν(dy)λ(dx)f0(x,μ,z)(da)g0(x,μ,z)(db)\displaystyle\cdot\int_{Y}\int_{\mathbb{R}_{+}}q(x^{{}^{\prime}},y^{{}^{\prime}}|x,y,a,b)\delta_{s+zC(x,y,a,b)}(ds^{{}^{\prime}})\mu(dy,ds)\nu(dy^{{}^{\prime}})\lambda(dx^{{}^{\prime}})f_{0}(x,\mu,z)(da)g_{0}(x,\mu,z)(db)
=BAYYX+𝔼π,σ[U(s+zC(x,y,a,b)+zk=1nβkC(Xk,Yk,Ak,Bk))|X1=x,Y1=y]\displaystyle=\int_{B}\int_{A}\int_{Y}\int_{Y}\int_{X}\int_{\mathbb{R}_{+}}\mathbb{E}^{\pi,\sigma}\big{[}U(s+zC(x,y,a,b)+z\sum_{k=1}^{n}\beta^{k}C(X_{k},Y_{k},A_{k},B_{k}))|X_{1}=x^{{}^{\prime}},Y_{1}=y^{{}^{\prime}}\big{]}
q(x,y|x,y,a,b)μ(dy,ds)ν(dy)λ(dx)f0(x,μ,z)(da)g0(x,μ,z)(db)\displaystyle q(x^{{}^{\prime}},y^{{}^{\prime}}|x,y,a,b)\mu(dy,ds)\nu(dy^{{}^{\prime}})\lambda(dx^{{}^{\prime}})f_{0}(x,\mu,z)(da)g_{0}(x,\mu,z)(db)
=Y+𝔼π,σ[U(s+zk=0nβkC(Xk,Yk,Ak,Bk))]μ(dy,ds)\displaystyle=\int_{Y}\int_{\mathbb{R}_{+}}\mathbb{E}^{\pi,\sigma}\big{[}U(s+z\sum_{k=0}^{n}\beta^{k}C(X_{k},Y_{k},A_{k},B_{k}))\big{]}\mu(dy,ds)
=Vn+1πσ(x,μ,z);\displaystyle=V_{n+1\pi\sigma}(x,\mu,z);

Hence we have the desired conclusion by induction.

(b and c) We show by induction on nn that
(i) Vn=TVn1𝒞(E)V_{n}=TV_{n-1}\in\mathcal{C}(E)
(ii) TγnδnTγn1δn1Tγ1δ1V0VnT_{\gamma_{n}\delta^{*}_{n}}T_{\gamma_{n-1}\delta^{*}_{n-1}}...T_{\gamma_{1}\delta^{*}_{1}}V_{0}\leq V_{n}, for any measurable γ1,γ2,,γn:EP(A(x))\gamma_{1},\gamma_{2},...,\gamma_{n}:E\rightarrow P(A(x)).
(iii) TγnδnTγn1δn1Tγ1δ1V0VnT_{\gamma^{*}_{n}\delta_{n}}T_{\gamma^{*}_{n-1}\delta_{n-1}}...T_{\gamma_{1}^{*}\delta_{1}}V_{0}\geq V_{n}, for any measurable δ1,δ2,,δn:EP(B(x))\delta_{1},\delta_{2},...,\delta_{n}:E\rightarrow P(B(x)).
(iv) TγnδnTγn1δn1Tγ1δ1V0=VnT_{\gamma^{*}_{n}\delta^{*}_{n}}T_{\gamma^{*}_{n-1}\delta^{*}_{n-1}}...T_{\gamma^{*}_{1}\delta^{*}_{1}}V_{0}=V_{n}.

Under our assumptions V0𝒞(E)V_{0}\in\mathcal{C}(E). For n=1n=1, it follows from definition that

V1=infgsupfTfgV0=TV0.V_{1}=\inf_{g}\sup_{f}T_{fg}V_{0}=TV_{0}.

Under our assumptions the existence of (γ1,δ1)(\gamma_{1}^{*},\delta_{1}^{*}) follows from a classical measurable selection theorem [8] and Fan’s minimax theorem [14]. Now (ii)(iv)(ii)-(iv) follows from the definition of (γ1,δ1)(\gamma_{1}^{*},\delta_{1}^{*}). Now suppose the statement is true for n1n-1. Since Vn1𝒞(E)V_{n-1}\in\mathcal{C}(E), we again have the existence of (γn,δn)(\gamma^{*}_{n},\delta^{*}_{n}). Using the induction hypothesis and monotonicity of the TT we obtain

TγnδnTγn1δn1Tγ1δ1V0TγnδnVn1TγnδnVn1=TγnδnTγn1δn1Tγ1δ1V0T_{\gamma_{n}\delta^{*}_{n}}T_{\gamma_{n-1}\delta^{*}_{n-1}}...T_{\gamma_{1}\delta^{*}_{1}}V_{0}\leq T_{\gamma_{n}\delta_{n}^{*}}V_{n-1}\leq T_{\gamma^{*}_{n}\delta_{n}^{*}}V_{n-1}=T_{\gamma^{*}_{n}\delta^{*}_{n}}T_{\gamma^{*}_{n-1}\delta^{*}_{n-1}}...T_{\gamma^{*}_{1}\delta^{*}_{1}}V_{0}

for any measurable γ1,γ2,,γn:EP(A(x))\gamma_{1},\gamma_{2},...,\gamma_{n}:E\rightarrow P(A(x)). On the other hand

TγnδnVn1TγnδnVn1TγnδnTγn1δn1Tγ1δ1V0T_{\gamma^{*}_{n}\delta_{n}^{*}}V_{n-1}\leq T_{\gamma^{*}_{n}\delta_{n}}V_{n-1}\leq T_{\gamma^{*}_{n}\delta_{n}}T_{\gamma^{*}_{n-1}\delta_{n-1}}...T_{\gamma_{1}^{*}\delta_{1}}V_{0}

for any measurable δ1,δ2,,δn:EP(B(x))\delta_{1},\delta_{2},...,\delta_{n}:E\rightarrow P(B(x)). Thus combining with part (a), we have shown that ((γN,γN1,,γ1),(δN,δN1,,δ1))((\gamma_{N}^{*},\gamma_{N-1}^{*},\ldots,\gamma^{*}_{1}),(\delta_{N}^{*},\delta_{N-1}^{*},\ldots,\delta^{*}_{1})) is a saddle point equilibrium for the NN stage completely observable game problem with optimization criterion given by (8). The rest of the conclusions now follow from the relation between the partially observable game and completely observable game. ∎

4. Infinite Horizon Problem

In case of finite horizon problem we have the existence of the value of the game and also we have shown the existence of the optimal strategies. Now we consider the case of infinite horizon, i.e for a given pair of policies (π,σ)(\pi,\sigma) and xXx\in X we are interested in the optimization problem:

Jπσ(x):=Y𝔼xyπσ[U(k=0βkC(Xk,Yk,Ak,Bk))]Q0(dy),xE.J_{\infty\pi\sigma}(x):=\int_{Y}\mathbb{E}^{\pi\sigma}_{xy}\big{[}U\big{(}\sum_{k=0}^{\infty}\beta^{k}C(X_{k},Y_{k},A_{k},B_{k})\big{)}\big{]}Q_{0}(dy),\hskip 8.5359ptx\in E.

The upper value, lower value, optimal strategies and value of the game have the same definitions as in Definition 2 with NN replaced by \infty. In the infinite horizon case we need to deal with concave and convex utility functions separately.
Concave utility function: We first investigate the case of concave utility function. Just as in the finite horizon case we will consider the equivalent completely observable game. To that end we have the following notations.

Vπσ(x,μ,z):=Y+𝔼xyπσ[U(s+zk=0βkC(Xk,Yk,Ak,Bk))]μ(dy,ds),\displaystyle V_{\infty\pi\sigma}(x,\mu,z):=\int_{Y}\int_{\mathbb{R}_{+}}\mathbb{E}^{\pi\sigma}_{xy}\big{[}U\big{(}s+z\sum_{k=0}^{\infty}\beta^{k}C(X_{k},Y_{k},A_{k},B_{k})\big{)}\big{]}\mu(dy,ds),
V(x,μ,z):=infσsupπVπσ(x,μ,z).\displaystyle V_{\infty}(x,\mu,z):=\inf_{\sigma}\sup_{\pi}V_{\infty\pi\sigma}(x,\mu,z). (10)

We denote

a¯(μ,z):=+U(s+zc¯1β)μs(ds),\displaystyle\bar{a}(\mu,z):=\int_{\mathbb{R}_{+}}U(s+\frac{z\bar{c}}{1-\beta})\mu^{s}(ds),
a¯(μ,z):=+U(s+zc¯1β)μs(ds),(μ,z)Pb(Y×+)×(0,1].\displaystyle\underline{a}(\mu,z):=\int_{\mathbb{R}_{+}}U(s+\frac{z\underline{c}}{1-\beta})\mu^{s}(ds),\hskip 8.5359pt(\mu,z)\in P_{b}(Y\times\mathbb{R}_{+})\times(0,1].

Then we have the main theorem of this section:

Theorem 3.
  • (a)

    VV_{\infty} is the unique solution of v=Tvv=Tv in 𝒞(E)\mathcal{C}(E) with a¯(μ,z)v(x,μ,z)a¯(μ,z)\underline{a}(\mu,z)\leq v(x,\mu,z)\leq\bar{a}(\mu,z) where TT is as defined in (9). Moreover, TnV0VT^{n}V_{0}\uparrow V_{\infty}, Tna¯VT^{n}\underline{a}\uparrow V_{\infty} and Tna¯VT^{n}\bar{a}\downarrow V_{\infty} as nn\rightarrow\infty.

  • (b)

    There exist measurable functions (γ,δ)F1×F2(\gamma^{*},\delta^{*})\in F_{1}\times F_{2} such that

    Tγδ(x,μ,z)Tγδ(x,μ,z)Tγδ(x,μ,z),\displaystyle T_{\gamma\delta^{*}}(x,\mu,z)\leq T_{\gamma^{*}\delta^{*}}(x,\mu,z)\leq T_{\gamma^{*}\delta}(x,\mu,z), (11)

    for all (γ,δ)F1×F2(\gamma,\delta)\in F_{1}\times F_{2} and (x,μ,z)E(x,\mu,z)\in E. Then V(x,Q0×δ0,1)V_{\infty}(x,Q_{0}\times\delta_{0},1) is the value of the partially observable infinite horizon stochastic game. Moreover, (π,σ)=((f0,f1,),(g0,g1,))(\pi^{*},\sigma^{*})=((f_{0}^{*},f_{1}^{*},\ldots),(g_{0}^{*},g_{1}^{*},\ldots)) with fn(hn):=γ(xn,μn(|hn),βn)f^{*}_{n}(h_{n}):=\gamma^{*}(x_{n},\mu_{n}(\cdot|h_{n}),\beta^{n}) and
    gn(hn):=δ(xn,μn(|hn),βn)g^{*}_{n}(h_{n}):=\delta^{*}(x_{n},\mu_{n}(\cdot|h_{n}),\beta^{n}) are optimal policies for player 1 and 2 respectively.

Proof.

(a) Here we first show that Vn=TnV0VV_{n}=T^{n}V_{0}\uparrow V_{\infty}. Since U:+U:\mathbb{R}_{+}\rightarrow\mathbb{R} is increasing and concave, it satisfies the inequality

U(s1+s2)U(s1)+U(s1)s2,s1,s20,U(s_{1}+s_{2})\leq U(s_{1})+U^{{}^{\prime}}_{-}(s_{1})s_{2},\hskip 8.5359pts_{1},s_{2}\geq 0,

where UU^{{}^{\prime}}_{-} is the left hand side derivative of UU that exists since UU is concave. Further more, U(s)0U^{{}^{\prime}}_{-}(s)\geq 0 and UU^{{}^{\prime}}_{-} is non increasing. For (x,μ,z)E(x,\mu,z)\in E it holds, Vn(x,μ,z)V(x,μ,z)V_{n}(x,\mu,z)\leq V_{\infty}(x,\mu,z). Using this, we get

Vnπσ(x,μ,z)Vπσ(x,μ,z)\displaystyle V_{n\pi\sigma}(x,\mu,z)\leq V_{\infty\pi\sigma}(x,\mu,z)
=Y+𝔼xyπσ[U(s+zk=0βkC(Xk,Yk,Ak,Bk))]μ(dy,ds),\displaystyle=\int_{Y}\int_{\mathbb{R}_{+}}\mathbb{E}^{\pi\sigma}_{xy}\big{[}U\big{(}s+z\sum_{k=0}^{\infty}\beta^{k}C(X_{k},Y_{k},A_{k},B_{k})\big{)}\big{]}\mu(dy,ds),
Y+𝔼xyπσ[U(s+zk=0n1βkC(Xk,Yk,Ak,Bk))]μ(dy,ds),\displaystyle\leq\int_{Y}\int_{\mathbb{R}_{+}}\mathbb{E}^{\pi\sigma}_{xy}\big{[}U\big{(}s+z\sum_{k=0}^{n-1}\beta^{k}C(X_{k},Y_{k},A_{k},B_{k})\big{)}\big{]}\mu(dy,ds),
+Y+𝔼xyπσ[U(s+zk=0n1βkC(Xk,Yk,Ak,Bk))zm=nβkC(Xm,Ym,Am,Bm)]μ(dy,ds),\displaystyle+\int_{Y}\int_{\mathbb{R}_{+}}\mathbb{E}^{\pi\sigma}_{xy}\big{[}U^{{}^{\prime}}_{-}\big{(}s+z\sum_{k=0}^{n-1}\beta^{k}C(X_{k},Y_{k},A_{k},B_{k})\big{)}z\sum_{m=n}^{\infty}\beta^{k}C(X_{m},Y_{m},A_{m},B_{m})\big{]}\mu(dy,ds),
Vnπσ(x,μ,z)+βnzc¯1β+U(s+zc¯)μs(ds)\displaystyle\leq V_{n\pi\sigma}(x,\mu,z)+\beta^{n}\frac{z\bar{c}}{1-\beta}\int_{\mathbb{R}_{+}}U^{{}^{\prime}}_{-}(s+z\underline{c})\mu^{s}(ds)
Vnπσ(x,μ,z)+βnzc¯1βU(zc¯)=:Vnπσ(x,μ,z)+ϵn(z),\displaystyle\leq V_{n\pi\sigma}(x,\mu,z)+\beta^{n}\frac{z\bar{c}}{1-\beta}U^{{}^{\prime}}_{-}(z\underline{c})=:V_{n\pi\sigma}(x,\mu,z)+\epsilon_{n}(z), (12)

where ϵn(z)\epsilon_{n}(z) is defined by the last equation. Clearly, limnϵn(z)=0\lim_{n\rightarrow\infty}\epsilon_{n}(z)=0. Now from (4) we get Vn(x,μ,z)V(x,μ,z)Vn(x,μ,z)+ϵn(z)V_{n}(x,\mu,z)\leq V_{\infty}(x,\mu,z)\leq V_{n}(x,\mu,z)+\epsilon_{n}(z). Taking limit we get Vn=TnV0VV_{n}=T^{n}V_{0}\uparrow V_{\infty}. Now, Vn+1=TVnTVV_{n+1}=TV_{n}\leq TV_{\infty}. Taking limit nn\to\infty we get VTVV_{\infty}\leq TV_{\infty}. Again, VVn+ϵnV_{\infty}\leq V_{n}+\epsilon_{n}. Now applying operator TT on both sides we have TVT(Vn+ϵn)=Vn+1+ϵn+1TV_{\infty}\leq T(V_{n}+\epsilon_{n})=V_{n+1}+\epsilon_{n+1}. Thus again letting nn\rightarrow\infty we obtain TVVTV_{\infty}\leq V_{\infty}. Hence combining two inequalities we have V=TVV_{\infty}=TV_{\infty}.
Next, we obtain

(Ta¯)(μ,z)\displaystyle(T\bar{a})(\mu,z) =infηsupζBA+U(s+zβc¯1β)Φ(x,a,b,x,μ,z)(ds)ζ(da)η(db)\displaystyle=\inf_{\eta}\sup_{\zeta}\int_{B}\int_{A}\int_{\mathbb{R}_{+}}U(s^{{}^{\prime}}+\frac{z\beta\bar{c}}{1-\beta})\Phi(x,a,b,x^{{}^{\prime}},\mu,z)(ds^{{}^{\prime}})\zeta(da)\eta(db)
+U(s+zc¯+zβc¯1β)μs(ds)\displaystyle\leq\int_{\mathbb{R}_{+}}U(s+z\bar{c}+\frac{z\beta\bar{c}}{1-\beta})\mu^{s}(ds)
=+U(s+zc¯1β)μs(ds)=a¯(μ,z).\displaystyle=\int_{\mathbb{R}_{+}}U(s+\frac{z\bar{c}}{1-\beta})\mu^{s}(ds)=\bar{a}(\mu,z).

By the similar reasoning it can be shown that Ta¯a¯T\underline{a}\geq\underline{a}. Thus we have Tna¯T^{n}\bar{a}\downarrow and Tna¯T^{n}\underline{a}\uparrow and the limits exist. Moreover, by iteration we have,

(Tna¯)(x,μ,z)\displaystyle(T^{n}\underline{a})(x,\mu,z) =infσsupπY+𝔼xyπσ[U(s+zβnc¯1β+zk=0n1βkC(Xk,Yk,Ak,Bk))]μ(dy,ds)\displaystyle=\inf_{\sigma}\sup_{\pi}\int_{Y}\int_{\mathbb{R}_{+}}\mathbb{E}^{\pi\sigma}_{xy}\big{[}U\big{(}s+\frac{z\beta^{n}\underline{c}}{1-\beta}+z\sum_{k=0}^{n-1}\beta^{k}C(X_{k},Y_{k},A_{k},B_{k})\big{)}\big{]}\mu(dy,ds)
(TnV0)(x,μ,z).\displaystyle\geq(T^{n}V_{0})(x,\mu,z).
(Tna¯)(x,μ,z)\displaystyle(T^{n}\bar{a})(x,\mu,z) =infσsupπY+𝔼xyπσ[U(s+zβnc¯1β+zk=0n1βkC(Xk,Yk,Ak,Bk))]μ(dy,ds)\displaystyle=\inf_{\sigma}\sup_{\pi}\int_{Y}\int_{\mathbb{R}_{+}}\mathbb{E}^{\pi\sigma}_{xy}\big{[}U\big{(}s+\frac{z\beta^{n}\bar{c}}{1-\beta}+z\sum_{k=0}^{n-1}\beta^{k}C(X_{k},Y_{k},A_{k},B_{k})\big{)}\big{]}\mu(dy,ds)

Using the fact U(s1+s2)U(s1)+U(s1)s2U(s_{1}+s_{2})\leq U(s_{1})+U^{{}^{\prime}}_{-}(s_{1})s_{2}, we obtain

0\displaystyle 0 (Tna¯)(x,μ,z)(Tna¯)(x,μ,z)(Tna¯)(x,μ,z)(TnV0)(x,μ,z)\displaystyle\leq(T^{n}\bar{a})(x,\mu,z)-(T^{n}\underline{a})(x,\mu,z)\leq(T^{n}\bar{a})(x,\mu,z)-(T^{n}V_{0})(x,\mu,z)
supσsupπY+𝔼xyπσ[U(s+zβnc¯1β+zk=0n1βkC(Xk,Yk,Ak,Bk))\displaystyle\leq\sup_{\sigma}\sup_{\pi}\int_{Y}\int_{\mathbb{R}_{+}}\mathbb{E}^{\pi\sigma}_{xy}\big{[}U\big{(}s+\frac{z\beta^{n}\bar{c}}{1-\beta}+z\sum_{k=0}^{n-1}\beta^{k}C(X_{k},Y_{k},A_{k},B_{k})\big{)}
U(s+zk=0n1βkC(Xk,Yk,Ak,Bk))]μ(dy,ds)\displaystyle-U\big{(}s+z\sum_{k=0}^{n-1}\beta^{k}C(X_{k},Y_{k},A_{k},B_{k})\big{)}\big{]}\mu(dy,ds)
ϵn(z).\displaystyle\leq\epsilon_{n}(z).

Since limnϵn=0\lim_{n\rightarrow\infty}\epsilon_{n}=0, we obtain Tna¯VT^{n}\underline{a}\uparrow V_{\infty} and Tna¯VT^{n}\bar{a}\downarrow V_{\infty} as nn\rightarrow\infty. Also since VV_{\infty} is both increasing as well as decreasing limit of continuous functions, it is also continuous.
Now for the uniqueness purpose if possible let there be another solution v𝒞(E)v\in\mathcal{C}(E) of v=Tvv=Tv with a¯va¯\underline{a}\leq v\leq\bar{a}. This then implies that Tna¯vTna¯T^{n}\underline{a}\leq v\leq T^{n}\bar{a} for all nn. Taking the limit nn\rightarrow\infty we have the uniqueness of the solution.
(b) The existence of (γ,δ)(\gamma^{*},\delta^{*}) follows again from the measurable selection theorem and the minimax theorem. By monotonicity and the fact that V0Va¯V_{0}\leq V_{\infty}\leq\bar{a} we obtain that limnTγ,δnV0=limnTγ,δnV=Vπγσδ\lim_{n\rightarrow\infty}T^{n}_{\gamma^{*},\delta^{*}}V_{0}=\lim_{n\rightarrow\infty}T^{n}_{\gamma^{*},\delta^{*}}V_{\infty}=V_{\infty\pi_{\gamma^{*}}\sigma_{\delta^{*}}}, where πγ=(γ,γ,)\pi_{\gamma^{*}}=(\gamma^{*},\gamma^{*},\ldots) and σδ=(δ,δ,)\sigma_{\delta^{*}}=(\delta^{*},\delta^{*},\ldots). By the definition of (γ,δ)(\gamma^{*},\delta^{*}) we obtain for any (γ,δ)F1×F2(\gamma,\delta)\in F_{1}\times F_{2},

TγδVTγδVTγδV.T_{\gamma\delta^{*}}V_{\infty}\leq T_{\gamma^{*}\delta^{*}}V_{\infty}\leq T_{\gamma^{*}\delta}V_{\infty}.

The property of (γ,δ)(\gamma^{*},\delta^{*}) also implies that V=TV=TγδVV_{\infty}=TV_{\infty}=T_{\gamma^{*}\delta^{*}}V_{\infty}. Hence we can also write for any (γ,δ)F1×F2(\gamma,\delta)\in F_{1}\times F_{2},

TγδVVTγδV.T_{\gamma\delta^{*}}V_{\infty}\leq V_{\infty}\leq T_{\gamma^{*}\delta}V_{\infty}.

By iterating this inequality nn-times we get

Tγ1δTγ2δTγnδVVTγδ1Tγδ2TγδnT_{\gamma_{1}\delta^{*}}T_{\gamma_{2}\delta^{*}}...T_{\gamma_{n}\delta^{*}}V_{\infty}\leq V_{\infty}\leq T_{\gamma^{*}\delta_{1}}T_{\gamma^{*}\delta_{2}}...T_{\gamma^{*}\delta_{n}}

for arbitrary γ1,γ2,,γn\gamma_{1},\gamma_{2},...,\gamma_{n} and δ1,δ2,,δn\delta_{1},\delta_{2},...,\delta_{n}. Letting nn\rightarrow\infty we get,

VπσδVπγσδ=VVπγσV_{\infty\pi\sigma_{\delta^{*}}}\leq V_{\infty\pi_{\gamma^{*}}\sigma_{\delta^{*}}}=V_{\infty}\leq V_{\infty\pi_{\gamma^{*}}\sigma}

for all policies π\pi and σ\sigma. The rest of the conclusions are now straight forward.

Convex Utility function: Now we consider the case of convex utility function UU.

Theorem 4.

Theorem 3 also holds for convex UU.

Proof.

For the convex case the proof is along the same lines as in Theorem 3. The only difference is that here we will use another inequality. Note that for U:+U:\mathbb{R}_{+}\rightarrow\mathbb{R} increasing and convex we have the inequality

U(s1+s2)U(s1)+U+(s1+s2)s2,s1,s20,U(s_{1}+s_{2})\leq U(s_{1})+U^{{}^{\prime}}_{+}(s_{1}+s_{2})s_{2},\hskip 8.5359pts_{1},s_{2}\geq 0,

where U+U^{{}^{\prime}}_{+} is the right-hand side derivative of UU that exists since UU is convex. Moreover, U+(s)0U^{{}^{\prime}}_{+}(s)\geq 0 and U+U^{{}^{\prime}}_{+} is increasing. Thus, we obtain for (x,μ,z)E(x,\mu,z)\in E,

Vnπσ(x,μ,z)Vπσ(x,μ,z)\displaystyle V_{n\pi\sigma}(x,\mu,z)\leq V_{\infty\pi\sigma}(x,\mu,z)
=Y+𝔼xyπσ[U(s+zk=0βkC(Xk,Yk,Ak,Bk))]μ(dy,ds),\displaystyle=\int_{Y}\int_{\mathbb{R}_{+}}\mathbb{E}^{\pi\sigma}_{xy}\big{[}U\big{(}s+z\sum_{k=0}^{\infty}\beta^{k}C(X_{k},Y_{k},A_{k},B_{k})\big{)}\big{]}\mu(dy,ds),
Y+𝔼xyπσ[U(s+zk=0n1βkC(Xk,Yk,Ak,Bk))]μ(dy,ds),\displaystyle\leq\int_{Y}\int_{\mathbb{R}_{+}}\mathbb{E}^{\pi\sigma}_{xy}\big{[}U\big{(}s+z\sum_{k=0}^{n-1}\beta^{k}C(X_{k},Y_{k},A_{k},B_{k})\big{)}\big{]}\mu(dy,ds),
+Y+𝔼xyπσ[U+(s+zk=0βkC(Xk,Yk,Ak,Bk))zm=nβkC(Xm,Ym,Am,Bm)]μ(dy,ds),\displaystyle+\int_{Y}\int_{\mathbb{R}_{+}}\mathbb{E}^{\pi\sigma}_{xy}\big{[}U^{{}^{\prime}}_{+}\big{(}s+z\sum_{k=0}^{\infty}\beta^{k}C(X_{k},Y_{k},A_{k},B_{k})\big{)}z\sum_{m=n}^{\infty}\beta^{k}C(X_{m},Y_{m},A_{m},B_{m})\big{]}\mu(dy,ds),
Vnπσ(x,μ,z)+βnzc¯1β+U+(s+zc¯1β)μs(ds)\displaystyle\leq V_{n\pi\sigma}(x,\mu,z)+\beta^{n}\frac{z\bar{c}}{1-\beta}\int_{\mathbb{R}_{+}}U^{{}^{\prime}}_{+}(s+\frac{z\bar{c}}{1-\beta})\mu^{s}(ds)

Now let’s denote δn(μ,z):=βnzc¯1β+U+(s+zc¯1β)μs(ds)\delta_{n}(\mu,z):=\beta^{n}\frac{z\bar{c}}{1-\beta}\int_{\mathbb{R}_{+}}U^{{}^{\prime}}_{+}(s+\frac{z\bar{c}}{1-\beta})\mu^{s}(ds). Then we have limnδn(μ,z)=0\lim_{n\rightarrow\infty}\delta_{n}(\mu,z)=0. So we end up with

Vn(x,μ,z)V(x,μ,z)Vn(x,μ,z)+δn(μ,z).V_{n}(x,\mu,z)\leq V_{\infty}(x,\mu,z)\leq V_{n}(x,\mu,z)+\delta_{n}(\mu,z).

Letting nn\rightarrow\infty yields TnV0VT^{n}V_{0}\rightarrow V_{\infty}.
We use the same inequality to get,

0\displaystyle 0 (Tna¯)(x,μ,z)(Tna¯)(x,μ,z)(Tna¯)(x,μ,z)(TnV0)(x,μ,z)\displaystyle\leq(T^{n}\bar{a})(x,\mu,z)-(T^{n}\underline{a})(x,\mu,z)\leq(T^{n}\bar{a})(x,\mu,z)-(T^{n}V_{0})(x,\mu,z)
supσsupπY+𝔼xyπσ[U(s+zβnc¯1β+zk=0n1βkC(Xk,Yk,Ak,Bk))\displaystyle\leq\sup_{\sigma}\sup_{\pi}\int_{Y}\int_{\mathbb{R}_{+}}\mathbb{E}^{\pi\sigma}_{xy}\big{[}U\big{(}s+\frac{z\beta^{n}\bar{c}}{1-\beta}+z\sum_{k=0}^{n-1}\beta^{k}C(X_{k},Y_{k},A_{k},B_{k})\big{)}
U(s+zk=0n1βkC(Xk,Yk,Ak,Bk))]μ(dy,ds)\displaystyle-U\big{(}s+z\sum_{k=0}^{n-1}\beta^{k}C(X_{k},Y_{k},A_{k},B_{k})\big{)}\big{]}\mu(dy,ds)
βnzc¯1β+U+(s+zc¯1β)μs(ds)=δn(μ,z)\displaystyle\leq\beta^{n}\frac{z\bar{c}}{1-\beta}\int_{\mathbb{R}_{+}}U^{{}^{\prime}}_{+}(s+\frac{z\bar{c}}{1-\beta})\mu^{s}(ds)=\delta_{n}(\mu,z)

and the right hand side converges to 0 as nn\rightarrow\infty. ∎

Few remarks are in order.

Remark 1.

(1) An important sub case of the model that we have considered here is when the reward/cost does not depend on the unobservable component, i.e., C(x,y,a,b)=C(x,a,b)C(x,y,a,b)=C^{\prime}(x,a,b) for some function C()C^{\prime}(\cdot). In this case the accumulated reward/cost is no more unobservable and thereby we need not estimate it. Thus in that case it can be shown along similar lines as in Proposition 1 of [3], that we can take the state space of the completely observable model as X×P(Y)×+×(0,1]X\times P(Y)\times\mathbb{R}_{+}\times(0,1] and the updating operator as

Φ(x,a,b,x,μ,z)(B):=Y(Bq(x,y|x,y,a,b)ν(dy))μ(dy)YqX(x|x,y,a,b)μ(dy),\Phi(x,a,b,x^{{}^{\prime}},\mu,z)(B):=\frac{\int_{Y}\big{(}\int_{B}q(x^{{}^{\prime}},y^{{}^{\prime}}|x,y,a,b)\nu(dy^{{}^{\prime}})\big{)}\mu(dy)}{\int_{Y}q^{X}(x^{{}^{\prime}}|x,y,a,b)\mu(dy)},

where BB is a Borel subset of YY and μP(Y)\mu\in P(Y).

(2) An important utility function is given by the function U(x)=1θeθxU(x)=\frac{1}{\theta}e^{\theta x}, where θ>0\theta>0 is a fixed parameter. In this case again it is not necessary to keep track of the accumulated cost. It is enough to consider X×P(Y)×(0,1]X\times P(Y)\times(0,1] as the state space of the completely observable model. Again arguing similar to Theorem 3 in [3], it can be shown that the value of the NN stage game problem is given by J(x)=αN(x,Q0,θ)J(x)=\alpha_{N}(x,Q_{0},\theta) where α0(x,μ,θz)=1θ\alpha_{0}(x,\mu,\theta z)=\frac{1}{\theta} and for n=1,2,,Nn=1,2,\ldots,N,

αn+1(x,μ,θz)=infηP(B(x))supζP(A(x))BA[αn(x,Φe(x,a,b,x,μ,z),βθz)Q^X(dx|x,μ,a,b,θz)]ζ(da)η(db),\displaystyle\alpha_{n+1}(x,\mu,\theta z)=\inf_{\eta\in P(B(x))}\sup_{\zeta\in P(A(x))}\int_{B}\int_{A}\left[\alpha_{n}(x^{\prime},\Phi_{e}(x,a,b,x^{\prime},\mu,z),\beta\theta z)\hat{Q}^{X}(dx^{\prime}|x,\mu,a,b,\theta z)\right]\zeta(da)\eta(db),

where (x,μ,z)X×P(Y)×(0,1](x,\mu,z)\in X\times P(Y)\times(0,1] and for B1B_{1}, Borel subset of XX and B2B_{2}, Borel subset of YY,

Q^X(B1|x,μ,a,b,z)=B1YezC(x,y,a,b)qX(x|x,y,a,b)μ(dy)λ(dx),\hat{Q}^{X}(B_{1}|x,\mu,a,b,z)=\int_{B_{1}}\int_{Y}e^{zC(x,y,a,b)}q^{X}(x^{\prime}|x,y,a,b)\mu(dy)\lambda(dx^{\prime}),
Φe(x,a,b,x,μ,z)(B2)=B2YezC(x,y,a,b)q(x,y|x,y,a,b)μ(dy)ν(dy)YYezC(x,y,a,b)q(x,y|x,y,a,b)μ(dy)ν(dy).\Phi_{e}(x,a,b,x^{\prime},\mu,z)(B_{2})=\frac{\int_{B_{2}}\int_{Y}e^{zC(x,y,a,b)}q(x^{\prime},y^{\prime}|x,y,a,b)\mu(dy)\nu(dy^{\prime})}{\int_{Y}\int_{Y}e^{zC(x,y,a,b)}q(x^{\prime},y^{\prime}|x,y,a,b)\mu(dy)\nu(dy^{\prime})}.

References

  • [1] Arnab Basu and Mrinal Kanti Ghosh. Zero-sum risk-sensitive stochastic games on a countable state space. Stochastic Process. Appl., 124(1):961–983, 2014.
  • [2] Nicole Bäuerle and Ulrich Rieder. More risk-sensitive Markov decision processes. Math. Oper. Res., 39(1):105–120, 2014.
  • [3] Nicole Bäuerle and Ulrich Rieder. Partially observable risk-sensitive Markov decision processes. Math. Oper. Res., 42(4):1180–1196, 2017.
  • [4] Nicole Bäuerle and Ulrich Rieder. Zero-sum risk-sensitive stochastic games. Stochastic Process. Appl., 127(2):622–642, 2017.
  • [5] Arnab Bhabak and Subhamay Saha. Zero and non-zero sum risk-sensitive semi-markov games. Stochastic Analysis and Applications, pages 1–18, 2021.
  • [6] Arnab Bhabak and Subhamay Saha. Risk-sensitive semi-Markov decision problems with discounted cost and general utilities. Statist. Probab. Lett., 184:Paper No. 109408, 9, 2022.
  • [7] V. S. Borkar and S. P. Meyn. Risk-sensitive optimal control for Markov decision processes with monotone cost. Math. Oper. Res., 27(1):192–209, 2002.
  • [8] L. D. Brown and R. Purves. Measurable selections of extrema. Ann. Statist., 1:902–912, 1973.
  • [9] Rolando Cavazos-Cadena and Daniel Hernández-Hernández. The vanishing discount approach in a class of zero-sum finite games with risk-sensitive average criterion. SIAM J. Control Optim., 57(1):219–240, 2019.
  • [10] Selene Chávez-Rodríguez, Rolando Cavazos-Cadena, and Hugo Cruz-Suárez. Controlled semi-Markov chains with risk-sensitive average cost criterion. J. Optim. Theory Appl., 170(2):670–686, 2016.
  • [11] Kun Jen Chung and Matthew J. Sobel. Discounted MDPs: distribution functions and exponential utility maximization. SIAM J. Control Optim., 25(1):49–62, 1987.
  • [12] G. B. Di Masi and L. Stettner. Risk sensitive control of discrete time partially observed Markov processes with infinite horizon. Stochastics Stochastics Rep., 67(3-4):309–322, 1999.
  • [13] Giovanni B. Di Masi and Ł ukasz Stettner. Infinite horizon risk sensitive control of discrete time Markov processes under minorization property. SIAM J. Control Optim., 46(1):231–252, 2007.
  • [14] Ky Fan. Minimax theorems. Proc. Nat. Acad. Sci. U.S.A., 39:42–47, 1953.
  • [15] M. K. Ghosh, D. McDonald, and S. Sinha. Zero-sum stochastic games with partial information. J. Optim. Theory Appl., 121(1):99–118, 2004.
  • [16] Mrinal K. Ghosh and Anindya Goswami. Partially observable semi-Markov games with discounted payoff. Stoch. Anal. Appl., 24(5):1035–1059, 2006.
  • [17] Mrinal K. Ghosh and Anindya Goswami. Partially observed semi-Markov zero-sum games with average payoff. J. Math. Anal. Appl., 345(1):26–39, 2008.
  • [18] Mrinal K. Ghosh, K. Suresh Kumar, and Chandan Pal. Zero-sum risk-sensitive stochastic games for continuous time Markov chains. Stoch. Anal. Appl., 34(5):835–851, 2016.
  • [19] Mrinal K. Ghosh and Subhamay Saha. Risk-sensitive control of continuous time Markov chains. Stochastics, 86(4):655–675, 2014.
  • [20] Xianping Guo and Zhong-Wei Liao. Risk-sensitive discounted continuous-time Markov decision processes with unbounded rates. SIAM J. Control Optim., 57(6):3857–3883, 2019.
  • [21] Xianping Guo and Junyu Zhang. Risk-sensitive continuous-time Markov decision processes with unbounded rates and Borel spaces. Discrete Event Dyn. Syst., 29(4):445–471, 2019.
  • [22] Daniel Hernández-Hernández and Steven I. Marcus. Risk sensitive control of Markov processes in countable state space. Systems Control Lett., 29(3):147–155, 1996.
  • [23] Yonghui Huang, Zhaotong Lian, and Xianping Guo. Risk-sensitive semi-Markov decision processes with general utilities and multiple criteria. Adv. in Appl. Probab., 50(3):783–804, 2018.
  • [24] Matthew R. James, John S. Baras, and Robert J. Elliott. Risk-sensitive control and dynamic games for partially observed discrete-time nonlinear systems. IEEE Trans. Automat. Control, 39(4):780–792, 1994.
  • [25] Anna Jaśkiewicz. Average optimality for risk-sensitive control with general state space. Ann. Appl. Probab., 17(2):654–675, 2007.
  • [26] Subhamay Saha. Zero-sum stochastic games with partial information and average payoff. J. Optim. Theory Appl., 160(1):344–354, 2014.
  • [27] K. Suresh Kumar and Chandan Pal. Risk-sensitive control of pure jump process on countable space with near monotone cost. Appl. Math. Optim., 68(3):311–331, 2013.
  • [28] Qingda Wei. Zero-sum games for continuous-time Markov jump processes with risk-sensitive finite-horizon cost criterion. Oper. Res. Lett., 46(1):69–75, 2018.
  • [29] Qingda Wei and Xian Chen. Risk-sensitive average continuous-time Markov decision processes with unbounded rates. Optimization, 68(4):773–800, 2019.