This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Common Information Belief based Dynamic Programs for Stochastic Zero-sum Games with Competing Teams

Dhruva Kartik, Ashutosh Nayyar and Urbashi Mitra Dhruva Kartik, Ashutosh Nayyar and Urbashi Mitra are with the Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA, USA. Email: {mokhasun,ashutosn,ubli}@usc.edu. This research was supported by Grant ONR N00014-15-1-2550, NSF CCF-1817200, NSF ECCS 1750041, NSF CCF-2008927, ARO W911NF1910269, Cisco Foundation 1980393, ONR 503400-78050, DOE DE-SC0021417, Swedish Research Council 2018-04359 and Okawa Foundation.
Abstract

Decentralized team problems where players have asymmetric information about the state of the underlying stochastic system have been actively studied, but games between such teams are less understood. We consider a general model of zero-sum stochastic games between two competing teams. This model subsumes many previously considered team and zero-sum game models. For this general model, we provide bounds on the upper (min-max) and lower (max-min) values of the game. Furthermore, if the upper and lower values of the game are identical (i.e., if the game has a value), our bounds coincide with the value of the game. Our bounds are obtained using two dynamic programs based on a sufficient statistic known as the common information belief (CIB). We also identify certain information structures in which only the minimizing team controls the evolution of the CIB. In these cases, we show that one of our CIB based dynamic programs can be used to find the min-max strategy (in addition to the min-max value). We propose an approximate dynamic programming approach for computing the values (and the strategy when applicable) and illustrate our results with the help of an example.

I Introduction

In decentralized team problems, players collaboratively control a stochastic system to minimize a common cost. The information used by these players to select their control actions may be different. For instance, some of the players may have more information about the system state than others [1]; or each player may have some private observations that are shared with other players with some delay [2]. Such multi-agent team problems with an information asymmetry arise in a multitude of domains like autonomous vehicles and drones, power grids, transportation networks, military and rescue operations, wildlife conservation [3] etc. Over the past few years, several methods have been developed to address decentralized team problems [4, 5, 6, 7, 1]. However, games between such teams are less understood. Many of the aforementioned systems are susceptible to adversarial attacks. Therefore, the strategies used by the team of players for controlling these systems must be designed in such a way that the damage inflicted by the adversary is minimized. Such adversarial interactions can be modeled as zero-sum games between competing teams, and our main goal in this paper is develop a framework that can be used to analyze and solve them.

The aforementioned works [4, 5, 6, 7, 1] on cooperative team problems solve them by first constructing an auxiliary single-agent Markov Decision Process (MDP). The auxiliary state (state of the auxiliary MDP) is the common information belief (CIB). CIB is the belief on the system state and all the players’ private information conditioned on the common (or public) information. Auxiliary actions (actions in the auxiliary MDP) are mappings from agents’ private information to their actions [4]. The optimal values of the team problem and this auxiliary MDP are identical. Further, an optimal strategy for the team problem can be obtained using any optimal solution of the auxiliary MDP with a simple transformation. The optimal value and strategies of this auxiliary MDP (and thus the team problem) can be characterized by dynamic programs (a.k.a. Bellman equations or recursive formulas). A key consequence of this characterization is that the CIB is a sufficient statistic for optimal control in team problems. We investigate whether a similar approach can be used to characterize values and strategies in zero-sum games between teams. This extension is not straightforward. In general games (i.e., not necessarily zero-sum), it may not be possible to obtain such dynamic programs (DPs) and/or sufficient statistics [8, 9]. However, we show that for zero-sum games between teams, the values can be characterized by CIB based DPs. Further, we show that for some specialized models, the CIB based DPs can be used to characterize a min-max strategy as well. A key implication of our result is that this CIB based approach can be used to solve several team problems considered before [4, 1, 7] even in the presence of certain types of adversaries.

A phenomenon of particular interest and importance in team problems is signaling. Players in a team can agree upon their control strategies ex ante. Based on these agreed upon strategies, a player can often make inferences about the system state or the other players’ private information (which are otherwise inaccessible to the player). This implicit form of communication between players is referred to as signaling and can be vital for effective coordination. While signaling is beneficial in cooperative teams, it can be detrimental in the presence of an adversary. This is because the adversary can exploit it to infer sensitive private information and inflict severe damage upon the system. A concrete example that illustrates this trade-off between signaling and secrecy is discussed is Section V. Our framework can be used optimize this trade-off in several stochastic games between teams.

Related Work on Games

Zero-sum games between two individual players with asymmetric information have been extensively studied. In [10, 11, 12, 13, 14, 15], stochastic zero-sum games with varying degrees of generality were considered and dynamic programming characterizations of the value of the game were provided. Various properties of the value functions (such as continuity) were also established and for some specialized information structures, these works also characterize a min-max strategy. Linear programs for computing the values and strategies in certain games were proposed in [16, 17]; and methods based on heuristic search value iteration (HSVI) [18] to compute the value of some games were proposed in [19, 20]. Zero-sum extensive form games in which a team of players competes against an adversary have been studied in [21, 22, 23]. Structured Nash equilibria in general games (i.e. not necessarily zero-sum) were studied in [24, 25, 26] under some assumptions on the system dynamics and players’ information structure. A combination of reinforcement learning and search was used in [27] to solve two-player zero-sum games. While this approach has very strong empirical performance, a better analytical understanding of it is needed. Our work is closely related to [28, 15] and builds on their results. Our novel contributions in this paper over past works are summarized below.

Contributions

(i) In this paper, we study a general class of stochastic zero-sum games between two competing teams of players. Our general model captures a variety of information structures including many previously considered stochastic team [5, 7, 1] and zero-sum game models [15, 19, 20] as well as game models that have not been studied before. Our results provide a unified framework for analyzing a wide class of game models that satisfy some minimal assumptions. (ii) For our general model, we adapt the techniques in [15] to provide bounds on the upper (min-max) and lower (max-min) values of the game. These bounds provide us with fundamental limits on the performance achievable by either team. Furthermore, if the upper and lower values of the game are identical (i.e., if the game has a value), our bounds coincide with the value of the game. Our bounds are obtained using two dynamic programs (DPs) based on a sufficient statistic known as the common information belief (CIB). (iii) We also identify a subclass of game models in which only one of the teams (say the minimizing team) controls the evolution of the CIB. In these cases, we show that one of our CIB based dynamic programs can be used to find the min-max value as well as a min-max strategy111Note that this characterization of a min-max strategy is not present in [15]. A similar result for a very specific model with limited applicability exists in [28]. Our result is substantially more general than that in [28].. (iv) Our result reveals that the structure of the CIB based min-max strategy is similar to the structure of team optimal strategies. Such structural results have been successfully used in prior works [7, 27] to design efficient strategies for significantly challenging team problems. (v) Lastly, we discuss an approximate dynamic programming approach along with key structural properties for computing the values (and the strategy when applicable) and illustrate our results with the help of an example.

Notation

Random variables are denoted by upper case letters, their realizations by the corresponding lower case letters. In general, subscripts are used as time index while superscripts are used to index decision-making agents. For time indices t1t2t_{1}\leq t_{2}, Xt1:t2{{X}}_{t_{1}:t_{2}} is the short hand notation for the variables (Xt1,Xt1+1,,Xt2)({{X}}_{t_{1}},{{X}}_{t_{1}+1},...,{{X}}_{t_{2}}). Similarly, X1:2{{X}}^{1:2} is the short hand notation for the collection of variables (X1,X2)({{X}}^{1},{{X}}^{2}). Operators (){\mathbb{P}}(\cdot) and 𝔼[]{\mathbb{E}}[\cdot] denote the probability of an event, and the expectation of a random variable respectively. For random variables/vectors XX and YY, (|Y=y){\mathbb{P}}(\cdot|{{Y}}=y), 𝔼[X|Y=y]{\mathbb{E}}[{{X}}|{{Y}}=y] and (X=xY=y){\mathbb{P}}({{X}}=x\mid{{Y}}=y) are denoted by (|y){\mathbb{P}}(\cdot|y), 𝔼[X|y]{\mathbb{E}}[{{X}}|y] and (xy){\mathbb{P}}(x\mid y), respectively. For a strategy gg, we use g(){\mathbb{P}}^{g}(\cdot) (resp. 𝔼g[]{\mathbb{E}}^{g}[\cdot]) to indicate that the probability (resp. expectation) depends on the choice of gg. For any finite set 𝒜\mathcal{A}, Δ𝒜\Delta\mathcal{A} denotes the probability simplex over the set 𝒜\mathcal{A}. For any two sets 𝒜\mathcal{A} and \mathcal{B}, (𝒜,)\mathcal{F}(\mathcal{A},\mathcal{B}) denotes the set of all functions from 𝒜\mathcal{A} to \mathcal{B}. We define rand to be mechanism that given (i) a finite set 𝒜\mathcal{A}, (ii) a distribution dd over 𝒜\mathcal{A} and a random variable KK uniformly distributed over the interval (0,1](0,1], produces a random variable X𝒜X\in\mathcal{A} with distribution dd, i.e.,

X=rand(𝒜,d,K)d.\displaystyle X={\textsc{rand}}(\mathcal{A},d,K)\sim d. (1)

II Problem Formulation

Consider a dynamic system with two teams. Team 1 has N1N_{1} players and Team 2 has N2N_{2} players. The system operates in discrete time over a horizon222With a sufficiently large planning horizon, infinite horizon problems with discounted cost can be solved approximately as finite-horizon problems. TT. Let Xt𝒳t{{X}}_{t}\in{\mathcal{X}}_{t} be the state of the system at time tt, and let Uti,j𝒰ti,j{{U}}_{t}^{i,j}\in{\mathcal{U}}_{t}^{i,j} be the action of Player jj, j{1,,Ni}j\in\{1,\dots,N_{i}\}, in Team ii, i{1,2}i\in\{1,2\}, at time tt. Let

Ut1(Ut1,1,,Ut1,N1);\displaystyle U_{t}^{1}\doteq\left(U_{t}^{1,1},\dots,U_{t}^{1,N_{1}}\right); Ut2(Ut2,1,,Ut2,N2),\displaystyle U_{t}^{2}\doteq\left(U_{t}^{2,1},\dots,U_{t}^{2,N_{2}}\right),

and 𝒰ti\mathcal{U}_{t}^{i} be the set of all possible realizations of UtiU_{t}^{i}. We will refer to UtiU_{t}^{i} as Team ii’s action at time tt. The state of the system evolves in a controlled Markovian manner as

Xt+1=ft(Xt,Ut1,Ut2,Wts),\displaystyle{{X}}_{t+1}=f_{t}({{X}}_{t},{{U}}_{t}^{1},{{U}}_{t}^{2},{{W}}_{t}^{s}), (2)

where Wts{{W}}_{t}^{s} is the system noise. There is an observation process Yti,j𝒴ti,j{{Y}}_{t}^{i,j}\in{\mathcal{Y}}_{t}^{i,j} associated with each Player jj in Team ii and is given as

Yti,j=hti,j(Xt,Ut11,Ut12,Wti,j),\displaystyle{{Y}}_{t}^{i,j}=h_{t}^{i,j}({{X}}_{t},{{U}}_{t-1}^{1},{{U}}_{t-1}^{2},{{W}}_{t}^{i,j}), (3)

where Wti,j{{W}}_{t}^{i,j} is the observation noise. Let us define

Yt1(Yt1,1,,Yt1,N1);\displaystyle Y_{t}^{1}\doteq\left(Y_{t}^{1,1},\dots,Y_{t}^{1,N_{1}}\right); Yt2(Yt2,1,,Yt2,N2).\displaystyle Y_{t}^{2}\doteq\left(Y_{t}^{2,1},\dots,Y_{t}^{2,N_{2}}\right).

We assume that the sets 𝒳t{\mathcal{X}}_{t}, 𝒰ti,j{\mathcal{U}}_{t}^{i,j} and 𝒴ti,j{\mathcal{Y}}_{t}^{i,j} are finite for all i,ji,j and tt. Further, the random variables X1,Wts,Wti,j{{X}}_{1},{{W}}_{t}^{s},{{W}}_{t}^{i,j} (referred to as the primitive random variables) can take finitely many values and are mutually independent.

Remark 1.

An alternative approach commonly used for characterizing system dynamics and observation models is to specify the transition and observation probabilities. We emphasize that this alternative characterization is equivalent to ours in equations (2) and (3) [29].

Information Structure

At time tt, Player jj in Team ii has access to a subset of all observations and actions generated so far. Let Iti,jI^{i,j}_{t} denote the collection of variables (i.e. observations and actions) available to Player jj in team ii at time tt. Then Iti,ji,j{Y1:ti,j,U1:t1i,j}{{I}}_{t}^{i,j}\subseteq\cup_{i,j}\{{{Y}}^{i,j}_{1:t},{{U}}^{i,j}_{1:t-1}\}. The set of all possible realizations of Iti,j{{I}}_{t}^{i,j} is denoted by ti,j\mathcal{I}^{i,j}_{t}. Examples of such information structures include Iti,j={Y1:ti,j,U1:t1i,j}I_{t}^{i,j}=\{Y_{1:t}^{i,j},U_{1:t-1}^{i,j}\} which corresponds to the information structure in Dec-POMDPs [5] and Iti,j={Y1:ti,j,Y1:td1:2,U1:t11:2}I_{t}^{i,j}=\{Y_{1:t}^{i,j},Y^{1:2}_{1:t-d},U_{1:t-1}^{1:2}\} wherein each player’s actions are seen by all the players and their observations become public after a delay of dd time steps.

Information Iti,j{{I}}_{t}^{i,j} can be decomposed into common and private information, i.e. Iti,j=CtPti,j{{I}}_{t}^{i,j}={{C}}_{t}\cup{{P}}_{t}^{i,j}; common information Ct{{C}}_{t} is the set of variables known to all players at time tt. The private information Pti,jP_{t}^{i,j} for Player jj in Team ii is defined as Iti,jCt{{I}}_{t}^{i,j}\setminus C_{t}. Let

Pt1(Pt1,1,,Pt1,N1);\displaystyle P_{t}^{1}\doteq\left(P_{t}^{1,1},\dots,P_{t}^{1,N_{1}}\right); Pt2(Pt2,1,,Pt2,N2).\displaystyle P_{t}^{2}\doteq\left(P_{t}^{2,1},\dots,P_{t}^{2,N_{2}}\right).

We will refer to PtiP_{t}^{i} as Team ii’s private information. Let 𝒞t\mathcal{C}_{t} be the set of all possible realizations of common information at time tt, 𝒫ti,j\mathcal{P}_{t}^{i,j} be the set of all possible realizations of private information for Player jj in Team ii at time tt and 𝒫ti\mathcal{P}_{t}^{i} be the set of all possible realizations of PtiP^{i}_{t}. We make the following assumption on the evolution of common and private information. This is similar to Assumption 1 of [24, 15].

Assumption 1.

The evolution of common and private information available to the players is as follows: (i) The common information Ct{{C}}_{t} is non-decreasing with time, i.e. CtCt+1{{C}}_{t}\subseteq{{C}}_{t+1}. Let Zt+1Ct+1Ct{{Z}}_{t+1}\doteq{{C}}_{t+1}\setminus{{C}}_{t} be the increment in common information. Thus, Ct+1={Ct,Zt+1}{{C}}_{t+1}=\{{{C}}_{t},{{Z}}_{t+1}\}. Furthermore,

Zt+1=ζt+1(Pt1:2,Ut1:2,Yt+11:2),\displaystyle{{Z}}_{t+1}=\zeta_{t+1}({{P}}_{t}^{1:2},{{U}}_{t}^{1:2},{{Y}}_{t+1}^{1:2}), (4)

where ζt+1\zeta_{t+1} is a fixed transformation. (ii) The private information evolves as

Pt+1i=ξt+1i(Pt1:2,Ut1:2,Yt+11:2),\displaystyle{{P}}^{i}_{t+1}=\xi_{t+1}^{i}({{P}}_{t}^{1:2},{{U}}_{t}^{1:2},{{Y}}_{t+1}^{1:2}), (5)

where ξt+1i\xi_{t+1}^{i} is a fixed transformation and i=1,2i=1,2.

As noted in [4, 15], a number of information structures satisfy the above assumption. Our analysis applies to any information structure that satisfies Assumption 1 including, among others, Dec-POMDPs and the delayed sharing information structure discussed above.

Strategies and Values

Players can use any information available to them to select their actions and we allow behavioral strategies for all players. Thus, at time tt, Player jj in Team ii chooses a distribution δUti,j\delta{{U}}_{t}^{i,j} over its action space using a control law gti,j:ti,jΔ𝒰ti,jg_{t}^{i,j}:\mathcal{I}_{t}^{i,j}\rightarrow\Delta\mathcal{U}_{t}^{i,j}, i.e., δUti,j=gti,j(Iti,j)=gti,j(Ct,Pti,j).\delta{{U}}_{t}^{i,j}=g_{t}^{i,j}({{I}}_{t}^{i,j})=g_{t}^{i,j}({{C}}_{t},{{P}}_{t}^{i,j}). The distrubtion δUti,j\delta U^{i,j}_{t} is then used to randomly generate the control action Uti,jU^{i,j}_{t} as follows. We assume that player jj of Team ii has access to i.i.d. random variables K1:Ti,j{{K}}^{i,j}_{1:T} that are uniformly distributed over the interval (0,1](0,1]. These uniformly distributed variables are independent of each other and of the primitive random variables. The action Uti,jU^{i,j}_{t} is generated using Kti,jK^{i,j}_{t} and the randomization mechanism described in (1), i.e.,

Uti,j=rand(𝒰ti,j,δUti,j,Kti,j).\displaystyle{{U}}_{t}^{i,j}={\textsc{rand}}(\mathcal{U}_{t}^{i,j},\delta U_{t}^{i,j},{{K}}_{t}^{i,j}). (6)

The collection of control laws used by the players in Team ii at time tt is denoted by gti(gti,1,,gti,Ni)g_{t}^{i}\doteq(g_{t}^{i,1},\dots,g_{t}^{i,N_{i}}) and is referred to as the control law of Team ii at time tt. Let the set of all possible control laws for Team ii at time tt be denoted by 𝒢ti\mathcal{G}_{t}^{i}. The collection of control laws gi(g1i,,gTi){g}^{i}\doteq(g_{1}^{i},\dots,g_{T}^{i}) is referred to as the control strategy of Team ii, and the pair of control strategies (g1,g2)({g}^{1},{g}^{2}) is referred to as a strategy profile. Let the set of all possible control strategies for Team ii be 𝒢i\mathcal{G}^{i}.

The total expected cost associated with a strategy profile (g1,g2)({g}^{1},{g}^{2}) is

J(g1,g2)𝔼(g1,g2)[t=1Tct(Xt,Ut1,Ut2)],\displaystyle J({g}^{1},{g}^{2})\doteq{\mathbb{E}}^{\left({g}^{1},{g}^{2}\right)}\left[\sum_{t=1}^{T}c_{t}({{X}}_{t},{{U}}_{t}^{1},{{U}}_{t}^{2})\right], (7)

where ct:𝒳t×𝒰t1×𝒰t2c_{t}:{\mathcal{X}}_{t}\times{\mathcal{U}}_{t}^{1}\times{\mathcal{U}}_{t}^{2}\rightarrow{\mathbb{R}} is the cost function at time tt. Team 1 wants to minimize the total expected cost, while Team 2 wants to maximize it. We refer to this zero-sum game between Team 1 and Team 2 as Game 𝒢\mathscr{G}.

Definition 1.

The upper and lower values of the game 𝒢\mathscr{G} are respectively defined as

Su(𝒢)\displaystyle S^{u}(\mathscr{G}) ming1𝒢1maxg2𝒢2J(g1,g2),\displaystyle\doteq\min_{g^{1}\in\mathcal{G}^{1}}\max_{g^{2}\in\mathcal{G}^{2}}J(g^{1},g^{2}), (8)
Sl(𝒢)\displaystyle S^{l}(\mathscr{G}) maxg2𝒢2ming1𝒢1J(g1,g2).\displaystyle\doteq\max_{g^{2}\in\mathcal{G}^{2}}\min_{g^{1}\in\mathcal{G}^{1}}J(g^{1},g^{2}). (9)

If the upper and lower values are the same, they are referred to as the value of the game and denoted by S(𝒢)S(\mathscr{G}). The minimizing strategy in (8) is referred to as Team 1’s optimal strategy and the maximizing strategy in (9) is referred to as Team 2’s optimal strategy333The strategy spaces 𝒢1\mathcal{G}^{1} and 𝒢2\mathcal{G}^{2} are compact and the cost J()J(\cdot) is continuous in g1,g2g^{1},g^{2}. Hence, the existence of optimal strategies can be established using Berge’s maximum theorem [30]..

A key objective of this work is to characterize the upper and lower values Su(𝒢)S^{u}(\mathscr{G}) and Sl(𝒢)S^{l}(\mathscr{G}) of Game 𝒢\mathscr{G}. To this end, we will define an expanded virtual game 𝒢e\mathscr{G}_{e}. This virtual game will be used to obtain bounds on the upper and lower values of the original game 𝒢\mathscr{G}. These bounds happen to be tight when the upper and lower values of game 𝒢\mathscr{G} are equal. For a sub-class of information structures, we will show that the expanded virtual game 𝒢e\mathscr{G}_{e} can be used to obtain optimal strategies for one of the teams.

Remark 2.

An alternative way of randomization is to use mixed strategies wherein a player randomly chooses a deterministic strategy at the beginning of the game and uses it for selecting its actions. According to Kuhn’s theorem, mixed and behavioral strategies are equivalent when players have perfect recall [31].

Remark 3 (Independent and Shared Randomness).

In most situations, the source of randomization is either privately known to the player (as in (6)) or publicly known to all the players in both teams. In this paper, we focus on independent randomization as in (6). In some situations, a shared source of randomness may be available to all players in Team ii but not to any any of the players in the opposing team. Such shared randomness can help players in a team coordinate better. We believe that our approach can be extended to this case as well with some modifications.

We note that if the upper and lower values of game 𝒢\mathscr{G} are the same, then any pair of optimal strategies (g1,g2)(g^{1*},g^{2*}) forms a Team Nash Equilibrium444When players in a team randomize independently, Team Nash equilirbia may not exist in general [32]., i.e., for every g1𝒢1{g}^{1}\in\mathcal{G}^{1} and g2𝒢2{g}^{2}\in\mathcal{G}^{2},

J(g1,g2)J(g1,g2)J(g1,g2).\displaystyle J({g}^{1*},{g}^{2})\leq J({g}^{1*},{g}^{2*})\leq J({g}^{1},{g}^{2*}).

In this case, J(g1,g2)J(g^{1*},g^{2*}) is the value of the game, i.e. J(g1,g2)=Sl(𝒢)=Su(𝒢)=S(𝒢).J({g}^{1*},{g}^{2*})=S^{l}(\mathscr{G})=S^{u}(\mathscr{G})=S(\mathscr{G}). Conversely, if a Team Nash Equilibrium exists, then the upper and lower values are the same [33].

III Expanded Virtual Game 𝒢e\mathscr{G}_{e}

The expanded virtual game 𝒢e\mathscr{G}_{e} is constructed using the methodology in [15]. This game involves the same underlying system model as in game 𝒢\mathscr{G}. The key distinction between games 𝒢\mathscr{G} and 𝒢e\mathscr{G}_{e} lies in the manner in which the actions used to control the system are chosen. In game 𝒢e\mathscr{G}_{e}, all the players in each team of game 𝒢\mathscr{G} are replaced by a virtual player. Thus, game 𝒢e\mathscr{G}_{e} has two virtual players, one for each team, and they operate as follows.

Prescriptions

Consider virtual player ii associated with Team ii, i=1,2i=1,2. At each time tt and for each j=1,,Nij=1,\dots,N_{i}, virtual player ii selects a function Γti,j\Gamma^{i,j}_{t} that maps private information Pti,jP^{i,j}_{t} to a distribution δUti,j\delta{{U}}_{t}^{i,j} over the space 𝒰ti,j\mathcal{U}_{t}^{i,j}. Thus, δUti,j=Γti,j(Pti,j)\delta U_{t}^{i,j}=\Gamma_{t}^{i,j}(P_{t}^{i,j}). The set of all such mappings is denoted by ti,j(𝒫ti,j,Δ𝒰ti,j)\mathcal{B}_{t}^{i,j}\doteq\mathcal{F}({\mathcal{P}}_{t}^{i,j},\Delta{\mathcal{U}}_{t}^{i,j}). We refer to the tuple Γti(Γti,1,,Γti,Ni)\Gamma_{t}^{i}\doteq(\Gamma_{t}^{i,1},\dots,\Gamma_{t}^{i,N_{i}}) of such mappings as virtual player ii’s prescription at time tt. The set of all possible prescriptions for virtual player ii at time tt is denoted by titi,1××ti,Ni\mathcal{B}_{t}^{i}\doteq\mathcal{B}_{t}^{i,1}\times\dots\times\mathcal{B}_{t}^{i,N_{i}}. Once virtual player ii selects its prescription, the action Uti,jU^{i,j}_{t} is randomly generated according to the distribution Γti,j(Pti,j)\Gamma_{t}^{i,j}({{P}}_{t}^{i,j}). More precisely,

Uti,j\displaystyle{{U}}_{t}^{i,j} =rand(𝒰ti,j,Γti,j(Pti,j),Kti,j),\displaystyle={\textsc{rand}}({\mathcal{U}}_{t}^{i,j},\Gamma_{t}^{i,j}({{P}}_{t}^{i,j}),{{K}}_{t}^{i,j}), (10)

where the random variable Kti,jK^{i,j}_{t} and the mechanism rand are the same as in equation (6).

Strategies

The virtual players in game 𝒢e\mathscr{G}_{e} have access to the common information Ct{{C}}_{t} and all the past prescriptions of both players, i.e., Γ1:t11:2\Gamma_{1:t-1}^{1:2}. Virtual player ii selects its prescription at time tt using a control law χ~ti\tilde{\chi}_{t}^{i}, i.e., Γti=χ~ti(Ct,Γ1:t11:2).\Gamma_{t}^{i}=\tilde{\chi}_{t}^{i}({{C}}_{t},\Gamma_{1:t-1}^{1:2}). Let ~ti\tilde{\mathcal{H}}_{t}^{i} be the set of all such control laws at time tt and ~i~1i××~Ti\tilde{\mathcal{H}}^{i}\doteq\tilde{\mathcal{H}}_{1}^{i}\times\dots\times\tilde{\mathcal{H}}_{T}^{i} be the set of all control strategies for virtual player ii. The total cost for a strategy profile (χ~1,χ~2)(\tilde{{\chi}}^{1},\tilde{{\chi}}^{2}) is

𝒥(χ~1,χ~2)=𝔼(χ~1,χ~2)[t=1Tct(Xt,Ut1,Ut2)].\displaystyle{\mathcal{J}}(\tilde{{\chi}}^{1},\tilde{{\chi}}^{2})={\mathbb{E}}^{(\tilde{{\chi}}^{1},\tilde{{\chi}}^{2})}\left[\sum_{t=1}^{T}c_{t}({{X}}_{t},{{U}}_{t}^{1},{{U}}_{t}^{2})\right]. (11)

The upper and lower values in 𝒢e\mathscr{G}_{e} are defined as

Su(𝒢e)\displaystyle S^{u}(\mathscr{G}_{e}) minχ~1~1maxχ~2~2𝒥(χ~1,χ~2)\displaystyle\doteq\min_{\tilde{\chi}^{1}\in\tilde{\mathcal{H}}^{1}}\max_{\tilde{\chi}^{2}\in\tilde{\mathcal{H}}^{2}}{\mathcal{J}}(\tilde{\chi}^{1},\tilde{\chi}^{2})
Sl(𝒢e)\displaystyle S^{l}(\mathscr{G}_{e}) maxχ~2~2minχ~1~1𝒥(χ~1,χ~2).\displaystyle\doteq\max_{\tilde{\chi}^{2}\in\tilde{\mathcal{H}}^{2}}\min_{\tilde{\chi}^{1}\in\tilde{\mathcal{H}}^{1}}{\mathcal{J}}(\tilde{\chi}^{1},\tilde{\chi}^{2}).

The following theorem establishes the relationship between the upper and lower values of the expanded game 𝒢e\mathscr{G}_{e} and the original game 𝒢\mathscr{G}. This result is analogous to Theorem 1 from [15].

Theorem 1 (Proof in App. A).

The lower and upper values of the two games described above satisfy the following: Sl(𝒢)Sl(𝒢e)Su(𝒢e)Su(𝒢).S^{l}(\mathscr{G})\leq S^{l}(\mathscr{G}_{e})\leq S^{u}(\mathscr{G}_{e})\leq S^{u}(\mathscr{G}). Further, all these inequalities become equalities when a Team Nash equilibrium exists in Game 𝒢\mathscr{G}.

III-A The Dynamic Programming Characterization

We describe a methodology for finding the upper and lower values of the expanded game 𝒢e\mathscr{G}_{e} in this subsection. The results (and their proofs) in this subsection are similar to those in Section 4.2 of [15]. However, the prescription spaces ti\mathcal{B}_{t}^{i} in this paper are different (and more general) from those in [15], and thus our results in this paper are more general. Our dynamic program is based on a sufficient statistic for virtual players in game 𝒢e\mathscr{G}_{e} called the common information belief (CIB).

Definition 2.

At time tt, the common information belief (CIB), denoted by Πt\Pi_{t}, is defined as the virtual players’ belief on the state and private information based on their information in game 𝒢e\mathscr{G}_{e}. Thus, for each xt𝒳t,pt1𝒫t1x_{t}\in{\mathcal{X}}_{t},p_{t}^{1}\in{\mathcal{P}}_{t}^{1} and pt2𝒫t2,p_{t}^{2}\in{\mathcal{P}}_{t}^{2}, we have

Πt(xt,pt1:2)[Xt=xt,Pt1:2=pt1:2Ct,Γ1:t11:2].\displaystyle\Pi_{t}({x}_{t},{p}_{t}^{1:2})\doteq{\mathbb{P}}\left[{{X}}_{t}={x}_{t},{{P}}_{t}^{1:2}={p}_{t}^{1:2}\mid{{C}}_{t},\Gamma_{1:t-1}^{1:2}\right].

The belief Πt\Pi_{t} takes values in the set 𝒮tΔ(𝒳t×𝒫t1×𝒫t2)\mathcal{S}_{t}\doteq\Delta(\mathcal{X}_{t}\times\mathcal{P}^{1}_{t}\times\mathcal{P}^{2}_{t}).

The following lemma describes an update rule that can be used to compute the CIB.

Lemma 1 (Proof in App. B).

For any strategy profile (χ~1,χ~2)(\tilde{{\chi}}^{1},\tilde{{\chi}}^{2}) in Game 𝒢e\mathscr{G}_{e}, the common information based belief Πt\Pi_{t} evolves almost surely as

Πt+1=Ft(Πt,Γt1:2,Zt+1),\displaystyle\Pi_{t+1}=F_{t}(\Pi_{t},\Gamma_{t}^{1:2},{Z}_{t+1}), (12)

where FtF_{t} is a fixed transformation that does not depend on the virtual players’ strategies. Further, the total expected cost can be expressed as

𝒥(χ~1,χ~2)=𝔼(χ~1,χ~2)[t=1Tc~t(Πt,Γt1,Γt2)],\displaystyle{\mathcal{J}}(\tilde{{\chi}}^{1},\tilde{{\chi}}^{2})={\mathbb{E}}^{(\tilde{{\chi}}^{1},\tilde{{\chi}}^{2})}\left[\sum_{t=1}^{T}\tilde{c}_{t}(\Pi_{t},\Gamma_{t}^{1},\Gamma_{t}^{2})\right], (13)

where the function c~t\tilde{c}_{t} is as defined in equation (58) in Appendix B.

Values in Game 𝒢e\mathscr{G}_{e}

We now describe two dynamic programs, one for each virtual player in 𝒢e\mathscr{G}_{e}. The minimizing virtual player (virtual player 1) in game 𝒢e\mathscr{G}_{e} solves the following dynamic program. Define VT+1u(πT+1)=0V^{u}_{T+1}(\pi_{T+1})=0 for every πT+1\pi_{T+1}. In a backward inductive manner, at each time tTt\leq T and for each possible common information belief πt\pi_{t} and prescriptions γt1,γt2\gamma^{1}_{t},\gamma^{2}_{t}, define the upper cost-to-go function wtuw_{t}^{u} and the upper value function VtuV_{t}^{u} as

wtu(πt,γt1,γt2)\displaystyle w^{u}_{t}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2}) (14)
c~t(πt,γt1,γt2)+𝔼[Vt+1u(Ft(πt,γt1:2,Zt+1))πt,γt1:2],\displaystyle\doteq\tilde{c}_{t}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2})+{\mathbb{E}}[V^{u}_{t+1}(F_{t}(\pi_{t},\gamma_{t}^{1:2},{{Z}}_{t+1}))\mid\pi_{t},\gamma_{t}^{1:2}],
Vtu(πt)minγt1maxγt2wtu(πt,γt1,γt2).\displaystyle V_{t}^{u}(\pi_{t})\doteq\min_{{\gamma}_{t}^{1}}\max_{\gamma_{t}^{2}}w_{t}^{u}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2}). (15)

The maximizing virtual player (virtual player 2) solves an analogous max-min dynamic program with a lower cost-to-go function wtlw_{t}^{l} and lower value function VtlV_{t}^{l} (See App. C for details).

Lemma 2 (Proof in App. C).

For each tt, there exists a measurable mapping Ξt1:𝒮tt1\Xi^{1}_{t}:\mathcal{S}_{t}\rightarrow\mathcal{B}_{t}^{1} such that Vtu(πt)=maxγt2wtu(πt,Ξt1(πt),γt2)V_{t}^{u}(\pi_{t})=\max_{\gamma_{t}^{2}}w_{t}^{u}(\pi_{t},\Xi_{t}^{1}(\pi_{t}),\gamma_{t}^{2}). Similarly, there exists a measurable mapping Ξt2:𝒮tt2\Xi^{2}_{t}:\mathcal{S}_{t}\rightarrow\mathcal{B}_{t}^{2} such that Vtl(πt)=minγt1wtl(πt,γt1,Ξt2(πt))V_{t}^{l}(\pi_{t})=\min_{\gamma_{t}^{1}}w_{t}^{l}(\pi_{t},\gamma_{t}^{1},\Xi_{t}^{2}(\pi_{t})).

Theorem 2 (Proof in App. D).

The upper and lower values of the expanded virtual game 𝒢e\mathscr{G}_{e} are given by Su(𝒢e)=𝔼[V1u(Π1)]S^{u}(\mathscr{G}_{e})={\mathbb{E}}[V_{1}^{u}(\Pi_{1})] and Sl(𝒢e)=𝔼[V1l(Π1)]S^{l}(\mathscr{G}_{e})={\mathbb{E}}[V_{1}^{l}(\Pi_{1})].

Theorem 2 gives us a dynamic programming characterization of the upper and lower values of the expanded game. As mentioned in Theorem 1, the upper and lower values of the expanded game provide bounds on the corresponding values of the original game. If the original game has a Team Nash equilibrium, then the dynamic programs described above characterize the value of the game.

Optimal Strategies in Game 𝒢e\mathscr{G}_{e}

The mappings Ξ1\Xi^{1} and Ξ2\Xi^{2} obtained from the dynamic programs described above (see Lemma 2) can be used to construct optimal strategies for both virtual players in game 𝒢e\mathscr{G}_{e} in the following manner.

Definition 3.

Define strategies χ~1\tilde{\chi}^{1*} and χ~2\tilde{\chi}^{2*} for virtual players 1 and 2 respectively as follows: for each instance of common information ctc_{t} and prescription history γ1:t11:2\gamma_{1:t-1}^{1:2}, let

χ~t1(ct,γ1:t11:2)Ξt1(πt);\displaystyle\tilde{\chi}^{1*}_{t}(c_{t},\gamma_{1:t-1}^{1:2})\doteq\Xi_{t}^{1}(\pi_{t}); χ~t2(ct,γ1:t11:2)Ξ21(πt),\displaystyle\tilde{\chi}^{2*}_{t}(c_{t},\gamma_{1:t-1}^{1:2})\doteq\Xi_{2}^{1}(\pi_{t}),

where Ξt1\Xi_{t}^{1} and Ξt2\Xi_{t}^{2} are the mappings defined in Lemma 2 and πt\pi_{t} (which is a function of ct,γ1:t11:2c_{t},\gamma_{1:t-1}^{1:2}) is obtained in a forward inductive manner using the update rule FtF_{t} defined in Lemma 1.

Theorem 3 (Proof in App. D).

The strategies χ~1\tilde{\chi}^{1*} and χ~2\tilde{\chi}^{2*} as defined in Definition 3 are, respectively, min-max and max-min strategies in the expanded virtual game 𝒢e\mathscr{G}_{e}.

IV Only Virtual Player 1 controls the CIB

In this section, we consider a special class of instances of Game 𝒢\mathscr{G} and show that the dynamic program in (15) can be used to obtain a min-max strategy for Team 1, the minimizing team in game 𝒢\mathscr{G}. The key property of the information structures considered in this section is that the common information belief Πt\Pi_{t} is controlled555Note that the players in Team 2 might still be able to control the state dynamics through their actions. only by virtual player 1 in the corresponding expanded game 𝒢e\mathscr{G}_{e}. This is formally stated in the following assumption.

Assumption 2.

For any strategy profile (χ~1,χ~2)(\tilde{{\chi}}^{1},\tilde{{\chi}}^{2}) in Game 𝒢e\mathscr{G}_{e}, the CIB Πt\Pi_{t} evolves almost surely as

Πt+1=Ft(Πt,Γt1,Zt+1),\displaystyle\Pi_{t+1}=F_{t}(\Pi_{t},\Gamma_{t}^{1},{Z}_{t+1}), (16)

where FtF_{t} is a fixed transformation that does not depend on the virtual players’ strategies.

We will now describe some instances of Game 𝒢\mathscr{G} that satisfy Assumption 2. We note that two-player zero-sum games that satisfy a property similar to Assumption 2 were studied in [13].

IV-A Game Models Satisfying Assumption 2

All players in Team 2 have the same information

Consider an instance of game 𝒢\mathscr{G} in which every player jj in Team 2 has the following information structure It2,j={Y1:t2,U1:t12}I_{t}^{2,j}=\left\{Y_{1:t}^{2},U_{1:t-1}^{2}\right\}. Further, Team 2’s information is known to every player in Team 1. Thus, the common information Ct=It2,jC_{t}=I_{t}^{2,j}. Under this condition, players in Team 2 do not have any private information. Thus, their private information Pt2=P_{t}^{2}=\varnothing. Any information structure satisfying the above conditions satisfies Assumption 2, see Appendix F-A for a proof. Since Team 1’s information structure is relatively unrestricted, the above model subsumes many previously considered team and game models. Notable examples of such models include: (i) all purely cooperative team problems in [4, 34, 1, 7], and (ii) two-player zero-sum game models where one agent is more informed than the other [16, 13, 19].

Team 2’s observations become common information with one-step delay

Consider an an instance of game 𝒢\mathscr{G} where the current private information of Team 2 becomes common information in the very next time-step. More specifically, we have Ct+1{Y1:t2,U1:t2}C_{t+1}\supseteq\{Y^{2}_{1:t},U^{2}_{1:t}\} and for each Player jj in Team 2, Pt2,j=Yt2,jP^{2,j}_{t}=Y^{2,j}_{t}. Note that unlike in [16, 19], players in Team 2 have some private information in this model. Any information structure that satisfies the above conditions satisfies Assumption 2, see Appendix F-B for a proof.

Games with symmetric information

Consider the information structure where Iti,j=i,j{Y1:ti,j,U1:t1i,j}{{I}}_{t}^{i,j}=\cup_{i,j}\{{{Y}}^{i,j}_{1:t},{{U}}^{i,j}_{1:t-1}\} for every i,ji,j. All the players in this game have the same information and thus, players do not have any private information. Note that this model subsumes perfect information games. It can be shown that this model satisfies Assumption 2 using the same arguments in Appendix F-A. In this case, the CIB is not controlled by both virtual players and thus, we can use the dynamic program to obtain both min-max and max-min strategies.

In addition to the models discussed above, there are other instances of 𝒢\mathscr{G} that satisfy Assumption 2. These are included in Appendix E.

IV-B Min-max Value and Strategy in Game 𝒢\mathscr{G}

Dynamic Program

Since we are considering special cases of Game 𝒢\mathscr{G}, we can use the analysis in Section III to write the min-max dynamic program for virtual player 1. Because of Assumption 2, the belief update Ft(πt,γt1:2,zt+1)F_{t}(\pi_{t},\gamma_{t}^{1:2},z_{t+1}) in (14) is replaced by Ft(πt,γt1,zt+1)F_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1}). Using Theorems 2 and 3, we can can conclude that the upper value of the expanded game Su(𝒢e)=𝔼[V1u(Π1)]S^{u}(\mathscr{G}_{e})={\mathbb{E}}[V_{1}^{u}(\Pi_{1})] and that the strategy χ~1\tilde{\chi}^{1*} obtained from the DP is a min-max strategy for virtual player 1 in Game 𝒢e\mathscr{G}_{e}. An approximate dynamic programming based approach for solving the dynamic programs is discussed in Appendix H. This discussion includes certain structural properties of the value functions that make their computation significantly more tractable.

Min-max Value and Strategy

The following results provide a characterization of the min-max value Su(𝒢)S^{u}(\mathscr{G}) and a min-max strategy g1g^{1*} in game 𝒢\mathscr{G} under Assumption 2. Note that unlike the inequality in Theorem 1, the upper values of games 𝒢\mathscr{G} and 𝒢e\mathscr{G}_{e} are always equal in this case.

Theorem 4 (Proof in App. G).

Under Assumption 2, we have Su(𝒢)=Su(𝒢e)=𝔼[V1u(Π1)].S^{u}(\mathscr{G})=S^{u}(\mathscr{G}_{e})={\mathbb{E}}[V_{1}^{u}(\Pi_{1})].

Algorithm 1 Strategy g1,jg^{1,j*} for Player jj in Team 1
  Input: Ξt1(π)\Xi^{1}_{t}(\pi) obtained from DP for all tt and all π\pi
  for t=1t=1 to TT do
     Current information: Ct,Pt1,jC_{t},P_{t}^{1,j} {where Ct={Ct1,Zt}C_{t}=\{C_{t-1},Z_{t}\}}
     Update CIB Πt=Ft1(Πt1,Ξt11(Πt1),Zt)\Pi_{t}=F_{t-1}(\Pi_{t-1},\Xi_{t-1}^{1}(\Pi_{t-1}),Z_{t}) {If t=1t=1, Initialize CIB Πt\Pi_{t} using CtC_{t}}
     Get prescription Γt1=(Γt1,1,,Γt1,N1)=Ξt1(Πt)\Gamma_{t}^{1}=(\Gamma_{t}^{1,1},\dots,\Gamma_{t}^{1,N_{1}})=\Xi_{t}^{1}(\Pi_{t})
     Get distribution δUt1,j=Γt1,j(Pt1,j)\delta U_{t}^{1,j}=\Gamma_{t}^{1,j}(P_{t}^{1,j}) and select action Ut1,j=rand(𝒰t1,j,δUt1,j,Kt1,j)U_{t}^{1,j}={\textsc{rand}}(\mathcal{U}_{t}^{1,j},\delta U_{t}^{1,j},{{K}}_{t}^{1,j})
  end for

Theorem 5 (Proof in App. G).

Under Assumption 2, the strategy g1{g}^{1*} defined in Algorithm 1 is a min-max strategy for Team 1 in the original game 𝒢\mathscr{G}.

V A Special Case and Numerical Experiments

Consider an instance of Game 𝒢\mathscr{G} in which Team 1 has two players and Team 2 has only one player. At each time tt, Player 1 in Team 1 observes the state perfectly, i.e. Yt1,1=XtY_{t}^{1,1}=X_{t}, but the player in Team 2 gets an imperfect observation Yt2Y^{2}_{t} defined as in (3). Player 1 has complete information: at each time tt, it knows the entire state, observation and action histories of all the players. The player in Team 2 has partial information: at each time tt, it knows its observation history Y1:t2Y^{2}_{1:t} and action histories of all the players. Player 2 in Team 1 has the same information as that of the player in Team 2. Thus, the total information available to each player at tt is as follows:

It1,1={X1:t,Y1:t2,U1:t11:2};\displaystyle{{I}}_{t}^{1,1}=\{{{X}}_{1:t},{{Y}}_{1:t}^{2},{{U}}_{1:t-1}^{1:2}\}; It1,2=It2={Y1:t2,U1:t11:2}.\displaystyle I_{t}^{1,2}={{I}}_{t}^{2}=\{{{Y}}_{1:t}^{2},{{U}}_{1:t-1}^{1:2}\}.

Clearly, It2It1,1{{I}}_{t}^{2}\subseteq{{I}}_{t}^{1,1}. The common and private information for this game can be written as follows: Ct=It2{{C}}_{t}={{I}}_{t}^{2}, Pt1,1={X1:t}{{P}}_{t}^{1,1}=\{{{X}}_{1:t}\} and Pt1,2=Pt2=P_{t}^{1,2}={{P}}_{t}^{2}=\varnothing. The increment in common information at time tt is Zt={Yt2,Ut11:2}{{Z}}_{t}=\{{{Y}}_{t}^{2},{{U}}_{t-1}^{1:2}\}. In the game described above, the private information in Pt1,1P_{t}^{1,1} includes the entire state history. However, Player 1 in Team 1 can ignore the past states X1:t1X_{1:t-1} without loss of optimality.

Lemma 3 (Proof in App. I).

There exists a min-max strategy g1g^{1*} such that the control law gt1,1g^{1,1*}_{t} at time tt uses only Xt{{X}}_{t} and It2{{I}}_{t}^{2} to select δUt1,1\delta U_{t}^{1,1}, i.e., δUt1,1=gt1,1(Xt,It2).\delta U^{1,1}_{t}=g^{1,1*}_{t}(X_{t},I^{2}_{t}).

The lemma above implies that, for the purpose of characterizing the value of the game and a min-max strategy for Team 1, we can restrict player 1’s information structure to be It1,1={Xt,It2}I^{1,1}_{t}=\{X_{t},I^{2}_{t}\}. Thus, the common and private information become: Ct=It2{{C}}_{t}={{I}}_{t}^{2}, Pt1,1={Xt}{{P}}_{t}^{1,1}=\{{{X}}_{t}\} and Pt2=Pt1.1={{P}}_{t}^{2}=P_{t}^{1.1}=\varnothing. We refer to this game with reduced private information as Game \mathscr{H}. The corresponding expanded virtual game is denoted by e\mathscr{H}_{e}. A general methodology for reducing private information in decentralized team and game problems can be found in [35]. The information structure in \mathscr{H} is a special case of the first information structure in Section IV-A, and thus satisfies Assumption 2. Therefore, using the dynamic program in Section IV-B, we can obtain the value function V1uV_{1}^{u} and the min-max strategy g1g^{1*}.

Numerical Experiments

We consider a particular example of game \mathscr{H} described above. In this example, there are two entities (ll and rr) that can potentially be attacked and at any given time, exactly one of the entities is vulnerable. Player 1 of Team 1 knows which of the two entities is vulnerable whereas all the other players do not have this information. Player 2 of Team 1 can choose to defend one of the entities. The attacker in Team 2 can either launch a blanket attack on both entities or launch a targeted attack on one of the entities. When the attacker launches a blanket attack, the damage incurred by the system is minimal if Player 2 in Team 1 happens to be defending the vulnerable entity and the damage is substantial otherwise. When the attacker launches a targeted attack on the vulnerable entity, the damage is substantial irrespective of the defender’s position. But if the attacker targets the invulnerable entity, the attacker becomes passive and cannot attack for some time. Thus, launching a targeted attack involves high risk for the attacker. The state of the attacker (active or passive) and all the players’ actions are public information. The system state XtX_{t} thus has two components, the hidden state (ll or rr) and the state of the attacker (aa or pp). For convenience, we will denote the states (l,a)(l,a) and (r,a)(r,a) with 0 and 1 respectively.

The only role of Player 1 in Team 1 in this game is to signal the hidden state using two available actions α\alpha and β\beta. The main challenge is that both the defender and the attacker can see Player 1’s actions. Player 1 needs to signal the hidden state to some extent so that its teammate’s defense is effective under blanket attacks. However, if too much information is revealed, the attacker can exploit it to launch a targeted attack and cause significant damage. In this example, the key is to design a strategy that can balance between these two contrasting goals of signaling and secrecy. A precise description of this model is provided in Appendix J.

Refer to caption
(a) An estimate of the value function V1u()V_{1}^{u}(\cdot).
Refer to caption
(b) Prescriptions at t=1t=1 for Player 1 in Team 1.
Figure 1: In these plots, the xx-axis represents π1(0)\pi_{1}(0) and we restrict our attention to those beliefs where π1(0)+π1(1)=1\pi_{1}(0)+\pi_{1}(1)=1, i.e. when the attacker is active. In Figure 1(b), the blue and red curves respectively depict the Bernoulli probabilities associated with the distributions γ11,1(0)\gamma_{1}^{1,1}(0) and γ11,1(1)\gamma_{1}^{1,1}(1), where γ11,1\gamma_{1}^{1,1} is Player 1’s prescription in Team 1.

In order to solve this problem, we used the approximate DP approach discussed in Appendix H. The value function V1u()V_{1}^{u}(\cdot) thus obtained is shown in Figure 1(a). The tension between signaling and secrecy can be seen in the shape of the value function in Figure 1(a). When the CIB π1(0)=0.5\pi_{1}(0)=0.5, the value function is concave in its neighborhood and decreases as we move away from 0.5. This indicates that in these belief states, revealing the hidden state to some extent is preferable. However, as the belief goes further away from 0.5, the value function starts increasing at some point. This indicates that the adversary has too much information and is using it to inflict damage upon the system. Figure 1(b) depicts Player 1’s prescriptions leading to non-trivial signaling patterns at various belief states. Notice that the distributions γ11,1(0)\gamma_{1}^{1,1}(0) and γ11,1(1)\gamma_{1}^{1,1}(1) for hidden states 0 and 1 are quite distinct when π1(0)=0.5\pi_{1}(0)=0.5 (indicating significant signaling) and are nearly identical when π1(0)=0.72\pi_{1}(0)=0.72 (indicating negligible signaling). A more detailed discussion on our experimental results can be found in Appendix J.

VI Conclusions

We considered a general model of stochastic zero-sum games between two competing decentralized teams and provided bounds on their upper and lower values in the form of CIB based dynamic programs. When game has a value, our bounds coincide with the value. We identified several instances of this game model (including previously considered models) in which the CIB is controlled only by one of the teams (say the minimizing team). For such games, we also provide a characterization of the min-max strategy. Under this strategy, each player only uses the current CIB and its private information to select its actions. The sufficiency of the CIB and private information for optimality can potentially be exploited to design efficient strategies in various problems. Finally, we proposed a computational approach for approximately solving the CIB based DPs. There is significant scope for improvement in our computational approach. Tailored forward exploration heuristics for sampling the belief space and adding a policy network can improve the accuracy and tractability of our approach.

References

  • [1] Yuxuan Xie, Jilles Dibangoye, and Olivier Buffet, “Optimally solving two-agent decentralized pomdps under one-sided information sharing,” in International Conference on Machine Learning. PMLR, 2020, pp. 10473–10482.
  • [2] Ashutosh Nayyar, Aditya Mahajan, and Demosthenis Teneketzis, “Optimal control strategies in delayed sharing information structures,” IEEE Transactions on Automatic Control, vol. 56, no. 7, pp. 1606–1620, 2010.
  • [3] Fei Fang, Milind Tambe, Bistra Dilkina, and Andrew J Plumptre, Artificial intelligence and conservation, Cambridge University Press, 2019.
  • [4] Ashutosh Nayyar, Aditya Mahajan, and Demosthenis Teneketzis, “Decentralized stochastic control with partial history sharing: A common information approach,” IEEE Transactions on Automatic Control, vol. 58, no. 7, pp. 1644–1658, 2013.
  • [5] Frans A Oliehoek and Christopher Amato, A concise introduction to decentralized POMDPs, Springer, 2016.
  • [6] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson, “Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning,” in International Conference on Machine Learning. PMLR, 2018, pp. 4295–4304.
  • [7] Jakob Foerster, Francis Song, Edward Hughes, Neil Burch, Iain Dunning, Shimon Whiteson, Matthew Botvinick, and Michael Bowling, “Bayesian action decoder for deep multi-agent reinforcement learning,” in International Conference on Machine Learning. PMLR, 2019, pp. 1942–1951.
  • [8] Ashutosh Nayyar and Abhishek Gupta, “Information structures and values in zero-sum stochastic games,” in American Control Conference (ACC), 2017. IEEE, 2017, pp. 3658–3663.
  • [9] Dengwang Tang, Hamidreza Tavafoghi, Vijay Subramanian, Ashutosh Nayyar, and Demosthenis Teneketzis, “Dynamic games among teams with delayed intra-team information sharing,” arXiv preprint arXiv:2102.11920, 2021.
  • [10] Jean-François Mertens, Sylvain Sorin, and Shmuel Zamir, Repeated games, vol. 55, Cambridge University Press, 2015.
  • [11] Dinah Rosenberg, Eilon Solan, and Nicolas Vieille, “Stochastic games with a single controller and incomplete information,” SIAM journal on control and optimization, vol. 43, no. 1, pp. 86–110, 2004.
  • [12] Jérôme Renault, “The value of repeated games with an informed controller,” Mathematics of operations Research, vol. 37, no. 1, pp. 154–179, 2012.
  • [13] Fabien Gensbittel, Miquel Oliu-Barton, and Xavier Venel, “Existence of the uniform value in zero-sum repeated games with a more informed controller,” Journal of Dynamics and Games, vol. 1, no. 3, pp. 411–445, 2014.
  • [14] Xiaoxi Li and Xavier Venel, “Recursive games: uniform value, tauberian theorem and the mertens conjecture,” International Journal of Game Theory, vol. 45, no. 1-2, pp. 155–189, 2016.
  • [15] Dhruva Kartik and Ashutosh Nayyar, “Upper and lower values in zero-sum stochastic games with asymmetric information,” Dynamic Games and Applications, pp. 1–26, 2020.
  • [16] Jiefu Zheng and David A Castañón, “Decomposition techniques for Markov zero-sum games with nested information,” in 52nd IEEE Conference on Decision and Control. IEEE, 2013, pp. 574–581.
  • [17] Lichun Li and Jeff Shamma, “LP formulation of asymmetric zero-sum stochastic games,” in 53rd IEEE Conference on Decision and Control. IEEE, 2014, pp. 1930–1935.
  • [18] Trey Smith and Reid Simmons, “Heuristic search value iteration for pomdps,” in Proceedings of the 20th conference on Uncertainty in artificial intelligence, 2004, pp. 520–527.
  • [19] Karel Horák, Branislav Bošanskỳ, and Michal Pěchouček, “Heuristic search value iteration for one-sided partially observable stochastic games,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2017, vol. 31.
  • [20] Karel Horák and Branislav Bošanskỳ, “Solving partially observable stochastic games with public observations,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, vol. 33, pp. 2029–2036.
  • [21] Bernhard von Stengel and Daphne Koller, “Team-maxmin equilibria,” Games and Economic Behavior, vol. 21, no. 1-2, pp. 309–321, 1997.
  • [22] Gabriele Farina, Andrea Celli, Nicola Gatti, and Tuomas Sandholm, “Ex ante coordination and collusion in zero-sum multi-player extensive-form games,” in Conference on Neural Information Processing Systems (NIPS), 2018.
  • [23] Youzhi Zhang and Bo An, “Computing team-maxmin equilibria in zero-sum multiplayer extensive-form games,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, vol. 34, pp. 2318–2325.
  • [24] Ashutosh Nayyar, Abhishek Gupta, Cedric Langbort, and Tamer Başar, “Common information based Markov perfect equilibria for stochastic games with asymmetric information: Finite games,” IEEE Transactions on Automatic Control, vol. 59, no. 3, pp. 555–570, 2014.
  • [25] Yi Ouyang, Hamidreza Tavafoghi, and Demosthenis Teneketzis, “Dynamic games with asymmetric information: Common information based perfect bayesian equilibria and sequential decomposition,” IEEE Transactions on Automatic Control, vol. 62, no. 1, pp. 222–237, 2017.
  • [26] Deepanshu Vasal, Abhinav Sinha, and Achilleas Anastasopoulos, “A systematic process for evaluating structured perfect bayesian equilibria in dynamic games with asymmetric information,” IEEE Transactions on Automatic Control, vol. 64, no. 1, pp. 78–93, 2019.
  • [27] Noam Brown, Anton Bakhtin, Adam Lerer, and Qucheng Gong, “Combining deep reinforcement learning and search for imperfect-information games,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds. 2020, vol. 33, pp. 17057–17069, Curran Associates, Inc.
  • [28] Dhruva Kartik and Ashutosh Nayyar, “Stochastic zero-sum games with asymmetric information,” in 58th IEEE Conference on Decision and Control. IEEE, 2019.
  • [29] Panganamala Ramana Kumar and Pravin Varaiya, Stochastic systems: Estimation, identification, and adaptive control, vol. 75, SIAM, 2015.
  • [30] A Hitchhiker’s Guide, Infinite dimensional analysis, Springer, 2006.
  • [31] Michael Maschler, Eilon Solan, and Shmuel Zamir, Game Theory, Cambridge University Press, 2013.
  • [32] Venkat Anantharam and Vivek Borkar, “Common randomness and distributed control: A counterexample,” Systems & control letters, vol. 56, no. 7-8, pp. 568–572, 2007.
  • [33] Martin J Osborne and Ariel Rubinstein, A course in game theory, MIT press, 1994.
  • [34] Jilles Steeve Dibangoye, Christopher Amato, Olivier Buffet, and François Charpillet, “Optimally solving dec-pomdps as continuous-state mdps,” Journal of Artificial Intelligence Research, vol. 55, pp. 443–497, 2016.
  • [35] Hamidreza Tavafoghi, Yi Ouyang, and Demosthenis Teneketzis, “A sufficient information approach to decentralized decision making,” in 2018 IEEE Conference on Decision and Control (CDC). IEEE, 2018, pp. 5069–5076.
  • [36] Onésimo Hernández-Lerma and Jean B Lasserre, Discrete-time Markov control processes: basic optimality criteria, vol. 30, Springer Science & Business Media, 2012.
  • [37] Dimitri P Bertsekas and John N Tsitsiklis, Neuro-dynamic programming, vol. 5, Athena Scientific Belmont, MA, 1996.
  • [38] Tianyi Lin, Chi Jin, and Michael Jordan, “On gradient descent ascent for nonconvex-concave minimax problems,” in International Conference on Machine Learning. PMLR, 2020, pp. 6083–6093.
  • [39] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” arXiv preprint arXiv:1706.08500, 2017.

Appendix A Virtual Game 𝒢v\mathscr{G}_{v} and Proof of Theorem 1

A-A Virtual Game 𝒢v\mathscr{G}_{v}

In order to establish the connection between the original game 𝒢\mathscr{G} and the expanded game 𝒢e\mathscr{G}_{e}, we construct an intermediate virtual game 𝒢v\mathscr{G}_{v}. Virtual game 𝒢v\mathscr{G}_{v} is very similar to the expanded game 𝒢e\mathscr{G}_{e}. The only difference between games 𝒢v\mathscr{G}_{v} and 𝒢e\mathscr{G}_{e} is that the virtual players in Game 𝒢v\mathscr{G}_{v} have access only to the common information CtC_{t} at time tt. On the other hand, the virtual players in Game 𝒢e\mathscr{G}_{e} have access to the common information CtC_{t} as well as the prescription history Γ1:t11:2\Gamma_{1:t-1}^{1:2} at time tt. We formally define virtual game 𝒢v\mathscr{G}_{v} below.

Model and Prescriptions

Consider virtual player ii associated with Team ii, i=1,2i=1,2. At each time tt, virtual player ii selects a prescription Γti=(Γti,1,,Γti,Ni)ti\Gamma^{i}_{t}=(\Gamma_{t}^{i,1},\dots,\Gamma_{t}^{i,N_{i}})\in\mathcal{B}_{t}^{i}, where ti\mathcal{B}_{t}^{i} is the set of prescriptions defined in Section III. Once virtual player ii selects its prescription, the action Uti,jU^{i,j}_{t} is randomly generated according to the distribution Γti,j(Pti,j)\Gamma_{t}^{i,j}({{P}}_{t}^{i,j}). More precisely, the system dynamics for this game are given by:

Xt+1\displaystyle{{X}}_{t+1} =ft(Xt,Ut1:2,Wts)\displaystyle=f_{t}({{X}}_{t},{{U}}_{t}^{1:2},{{W}}_{t}^{s}) (17)
Pt+1i\displaystyle{{P}}^{i}_{t+1} =ξt+1i(Pt1:2,Ut1:2,Yt+11:2)\displaystyle=\xi_{t+1}^{i}({{P}}_{t}^{1:2},{{U}}_{t}^{1:2},{{Y}}_{t+1}^{1:2}) i=1,2,\displaystyle i=1,2, (18)
Yt+1i,j\displaystyle{{Y}}_{t+1}^{i,j} =ht+1i,j(Xt+1,Ut1:2,Wt+1i,j)\displaystyle=h_{t+1}^{i,j}({{X}}_{t+1},{{U}}_{t}^{1:2},{{W}}_{t+1}^{i,j}) i=1,2;j=1,,Ni,\displaystyle i=1,2;j=1,\dots,N_{i}, (19)
Uti,j\displaystyle{{U}}_{t}^{i,j} =rand(𝒰ti,j,Γti,j(Pti,j),Kti,j)\displaystyle={\textsc{rand}}({\mathcal{U}}_{t}^{i,j},\Gamma_{t}^{i,j}({{P}}_{t}^{i,j}),{{K}}_{t}^{i,j}) i=1,2;j=1,,Ni\displaystyle i=1,2;j=1,\dots,N_{i} (20)
Zt+1\displaystyle{{Z}}_{t+1} =ζt+1(Pt1:2,Ut1:2,Yt+11:2),\displaystyle=\zeta_{t+1}({{P}}_{t}^{1:2},{{U}}_{t}^{1:2},{{Y}}_{t+1}^{1:2}), (21)

where the functions ft,hti,jf_{t},h_{t}^{i,j}, ξti,rand\xi_{t}^{i},{\textsc{rand}} and ζt\zeta_{t} are the same as in 𝒢\mathscr{G}.

Strategies

In virtual game 𝒢v\mathscr{G}_{v}, virtual players use the common information Ct{{C}}_{t} to select their prescriptions at time tt. The iith virtual player selects its prescription according to a control law χti\chi_{t}^{i}, i.e. Γti=χti(Ct)\Gamma_{t}^{i}=\chi_{t}^{i}({{C}}_{t}). For virtual player ii, the collection of control laws over the entire time horizon χi=(χ1i,,χTi){\chi}^{i}=(\chi_{1}^{i},\dots,\chi_{T}^{i}) is referred to as its control strategy. Let ti\mathcal{H}_{t}^{i} be the set of all possible control laws for virtual player ii at time tt and let i\mathcal{H}^{i} be the set of all possible control strategies for virtual player ii, i.e. i=1i××Ti\mathcal{H}^{i}=\mathcal{H}_{1}^{i}\times\dots\times\mathcal{H}_{T}^{i}. The total cost associated with the game for a strategy profile (χ1,χ2)({\chi}^{1},{\chi}^{2}) is

𝒥(χ1,χ2)=𝔼(χ1,χ2)[t=1Tct(Xt,Ut1,Ut2)],\displaystyle\mathcal{J}({\chi}^{1},{\chi}^{2})={\mathbb{E}}^{({\chi}^{1},{\chi}^{2})}\left[\sum_{t=1}^{T}c_{t}({{X}}_{t},{{U}}_{t}^{1},{{U}}_{t}^{2})\right], (22)

where the function ctc_{t} is the same as in Game 𝒢\mathscr{G}.

Note that any strategy χii\chi^{i}\in\mathcal{H}^{i} is equivalent to the strategy χ~i~i\tilde{\chi}^{i}\in\tilde{\mathcal{H}}^{i} that satisfies the following condition: for each time tt and for each realization of common information ct𝒞t{c}_{t}\in\mathcal{C}_{t},

χ~ti(ct,γ1:t11:2)=χti(ct)γ1:t11:21:t11:2.\displaystyle\tilde{\chi}_{t}^{i}({c}_{t},\gamma_{1:t-1}^{1:2})=\chi_{t}^{i}({c}_{t})\quad\forall\;\gamma_{1:t-1}^{1:2}\in\mathcal{B}_{1:t-1}^{1:2}. (23)

Hence, with slight abuse of notation, we can say that the strategy space i\mathcal{H}^{i} in the virtual game 𝒢v\mathscr{G}_{v} is a subset of the strategy space ~i\tilde{\mathcal{H}}^{i} in the expanded game 𝒢e\mathscr{G}_{e}.

The following lemma establishes a connection between the original game 𝒢\mathscr{G} and the virtual game 𝒢v\mathscr{G}_{v} constructed above.

Lemma 4.

Let Su(𝒢v)S^{u}(\mathscr{G}_{v}) and Sl(𝒢v)S^{l}(\mathscr{G}_{v}) be, respectively, the upper and lower values of the virtual game 𝒢v\mathscr{G}_{v}. Then,

Sl(𝒢)=Sl(𝒢v)andSu(𝒢)=Su(𝒢v).S^{l}(\mathscr{G})=S^{l}(\mathscr{G}_{v})~{}~{}\mbox{and}~{}~{}S^{u}(\mathscr{G})=S^{u}(\mathscr{G}_{v}).

Consequently, if a Team Nash equilibrium exists in the original game 𝒢\mathscr{G}, then S(𝒢)=Sl(𝒢v)=Su(𝒢v).S(\mathscr{G})=S^{l}(\mathscr{G}_{v})=S^{u}(\mathscr{G}_{v}).

Proof.

For a given strategy gi𝒢ig^{i}\in\mathcal{G}^{i}, let us define a strategy χii\chi^{i}\in\mathcal{H}^{i} in the following manner. For each time tt, each instance of common information ct𝒞tc_{t}\in\mathcal{C}_{t} and player j=1,,Nij=1,\dots,N_{i}, let

γti,jgti,j(ct,),\displaystyle\gamma_{t}^{i,j}\doteq g_{t}^{i,j}(c_{t},\cdot), (24)

and let γti(γti,1,,γti,Ni)\gamma_{t}^{i}\doteq(\gamma_{t}^{i,1},\dots,\gamma_{t}^{i,N_{i}}). Note that the partial function gti,j(ct,)g_{t}^{i,j}(c_{t},\cdot) is a mapping from 𝒫ti,j\mathcal{P}_{t}^{i,j} to Δ𝒰ti,j\Delta{\mathcal{U}}_{t}^{i,j}. Let

χti(ct)γti.\displaystyle\chi_{t}^{i}(c_{t})\doteq\gamma_{t}^{i}. (25)

We will denote this correspondence between strategies in Game 𝒢\mathscr{G} and 𝒢v\mathscr{G}_{v} with i:𝒢ii\mathcal{M}^{i}:\mathcal{G}^{i}\rightarrow\mathcal{H}^{i}, i=1,2i=1,2, i.e., χi=i(gi)\chi^{i}=\mathcal{M}^{i}(g^{i}). One can easily verify that the mapping i\mathcal{M}^{i} is bijective. Further, for every g1𝒢1g^{1}\in\mathcal{G}^{1} and g2𝒢2g^{2}\in\mathcal{G}^{2}, we have

J(g1,g2)=𝒥(1(g1),2(g2)).\displaystyle J(g^{1},g^{2})=\mathcal{J}(\mathcal{M}^{1}(g^{1}),\mathcal{M}^{2}(g^{2})). (26)

We refer the reader to Appendix A of [24] for a proof of the above statement.

Therefore, for any strategy g1𝒢1g^{1}\in\mathcal{G}^{1}, we have

supg2𝒢2J(g1,g2)\displaystyle\sup_{g^{2}\in\mathcal{G}^{2}}J(g^{1},g^{2}) =supg2𝒢2𝒥(1(g1),2(g2))\displaystyle=\sup_{g^{2}\in\mathcal{G}^{2}}\mathcal{J}(\mathcal{M}^{1}(g^{1}),\mathcal{M}^{2}(g^{2})) (27)
=supχ22𝒥(1(g1),χ2).\displaystyle=\sup_{\chi^{2}\in\mathcal{H}^{2}}\mathcal{J}(\mathcal{M}^{1}(g^{1}),\chi^{2}). (28)

Consequently,

infg1𝒢1supg2𝒢2J(g1,g2)\displaystyle\inf_{g^{1}\in\mathcal{G}^{1}}\sup_{g^{2}\in\mathcal{G}^{2}}J(g^{1},g^{2}) =infg1𝒢1supχ22𝒥(1(g1),χ2)\displaystyle=\inf_{g^{1}\in\mathcal{G}^{1}}\sup_{\chi^{2}\in\mathcal{H}^{2}}\mathcal{J}(\mathcal{M}^{1}(g^{1}),\chi^{2}) (29)
=infχ11supχ22𝒥(χ1,χ2).\displaystyle=\inf_{\chi^{1}\in\mathcal{H}^{1}}\sup_{\chi^{2}\in\mathcal{H}^{2}}\mathcal{J}(\chi^{1},\chi^{2}). (30)

This implies that Su(𝒢)=Su(𝒢v)S^{u}(\mathscr{G})=S^{u}(\mathscr{G}_{v}). We can similarly prove that Sl(𝒢)=Sl(𝒢v)S^{l}(\mathscr{G})=S^{l}(\mathscr{G}_{v}).

Remark 4.

We can also show that a strategy g1g^{1} is a min-max strategy in Game 𝒢\mathscr{G} if and only if 1(g1)\mathcal{M}^{1}(g^{1}) is a min-max strategy in Game 𝒢v\mathscr{G}_{v}. A similar statement holds for max-min strategies as well.

We will now establish the relationship between the upper and lower values of the virtual game 𝒢v\mathscr{G}_{v} and the expanded virtual game 𝒢e\mathscr{G}_{e}. To do so, we define the following mappings between the strategies in games 𝒢v\mathscr{G}_{v} and 𝒢e\mathscr{G}_{e}. To do so, we define the following mappings between the strategies in games 𝒢v\mathscr{G}_{v} and 𝒢e\mathscr{G}_{e}.

Definition 4.

Let ϱi:~1×~2i\varrho^{i}:\tilde{\mathcal{H}}^{1}\times\tilde{\mathcal{H}}^{2}\rightarrow\mathcal{H}^{i} be an operator that maps a strategy profile (χ~1,χ~2)({\tilde{\chi}}^{1},{\tilde{\chi}}^{2}) in virtual game 𝒢e\mathscr{G}_{e} to a strategy χi{{\chi}}^{i} for virtual player ii in game 𝒢v\mathscr{G}_{v} as follows: For t=1,2,,T,t=1,2,\ldots,T,

χti(ct)χ~ti(ct,γ~1:t11:2),\displaystyle\chi_{t}^{i}({c}_{t})\doteq\tilde{\chi}_{t}^{i}({c}_{t},\tilde{\gamma}_{1:t-1}^{1:2}), (31)

where γ~sj=χ~sj(cs,γ~1:s11:2)\tilde{\gamma}_{s}^{j}=\tilde{\chi}_{s}^{j}({c}_{s},\tilde{\gamma}_{1:s-1}^{1:2}) for every 1st11\leq s\leq t-1 and j=1,2j=1,2. We denote the ordered pair (ϱ1,ϱ2)(\varrho^{1},\varrho^{2}) by ϱ\varrho.

The mapping ϱ\varrho is defined in such a way that the strategy profile (χ~1,χ~2)({\tilde{\chi}}^{1},{\tilde{\chi}}^{2}) and the strategy profile ϱ(χ~1,χ~2)\varrho({\tilde{\chi}}^{1},{\tilde{\chi}}^{2}) induce identical dynamics in the respective games 𝒢e\mathscr{G}_{e} and 𝒢v\mathscr{G}_{v}.

Lemma 5.

Let (χ1,χ2)({{\chi}}^{1},{{\chi}}^{2}) and (χ~1,χ~2)({\tilde{\chi}}^{1},{\tilde{\chi}}^{2}) be strategy profiles for games 𝒢v\mathscr{G}_{v} and 𝒢e\mathscr{G}_{e}, such that χi=ϱi(χ~1,χ~2){\chi}^{i}=\varrho^{i}({\tilde{\chi}}^{1},{\tilde{\chi}}^{2}), i=1,2i=1,2. Then,

𝒥(χ1,χ2)=𝒥(χ~1,χ~2).\displaystyle\mathcal{J}({{\chi}}^{1},{{\chi}}^{2})={\mathcal{J}}({\tilde{\chi}}^{1},{\tilde{\chi}}^{2}). (32)
Proof.

Let us consider the evolution of the virtual game 𝒢v\mathscr{G}_{v} under the strategy profile (χ1,χ2)(\chi^{1},\chi^{2}) and the expanded virtual game 𝒢e\mathscr{G}_{e} under the strategy profile (χ~1,χ~2)(\tilde{\chi}^{1},\tilde{\chi}^{2}). Let the primitive variables and the randomization variables Kti,jK_{t}^{i,j} in both games be identical. The variables such as the state, action and information variables in the expanded game 𝒢e\mathscr{G}_{e} are distinguished from those in the virtual game 𝒢v\mathscr{G}_{v} by means of a tilde. For instance, XtX_{t} is the state in game 𝒢v\mathscr{G}_{v} and X~t\tilde{X}_{t} is the state in game 𝒢e\mathscr{G}_{e}.

We will prove by induction that the system evolution in both these games is identical over the entire horizon. This is clearly true at the end of time t=1t=1 because the state, observations and the common and private information variables are identical in both games. Moreover, since χi=ϱi(χ~1,χ~2){\chi}^{i}=\varrho^{i}({\tilde{\chi}}^{1},{\tilde{\chi}}^{2}), i=1,2i=1,2, the strategies χ1i\chi^{i}_{1} and χ~1i\tilde{\chi}^{i}_{1} are identical by definition (see Definition 4). Thus, the prescriptions and actions at t=1t=1 are also identical.

For induction, assume that the system evolution in both games is identical until the end of time tt. Then,

Xt+1=ft(Xt,Ut1:2,Wts)=ft(X~t,U~t1:2,Wts)=X~t+1.{{X}}_{t+1}=f_{t}({{X}}_{t},{{U}}_{t}^{1:2},{{W}}_{t}^{s})=f_{t}(\tilde{{{X}}}_{t},\tilde{{{U}}}_{t}^{1:2},{{W}}_{t}^{s})=\tilde{{{X}}}_{t+1}.

Using equations (3), (5) and (4), we can similarly argue that Yt+1i=Y~t+1i{{Y}}_{t+1}^{i}=\tilde{{{Y}}}_{t+1}^{i}, Pt+1i=P~t+1i{{P}}_{t+1}^{i}=\tilde{{{P}}}_{t+1}^{i} and Ct+1=C~t+1{{C}}_{t+1}=\tilde{{{C}}}_{t+1}. Since χi=ϱi(χ~1,χ~2){\chi}^{i}=\varrho^{i}({\tilde{\chi}}^{1},{\tilde{\chi}}^{2}), we also have

Γ~t+1i\displaystyle\tilde{\Gamma}_{t+1}^{i} =χ~t+1i(C~t+1,Γ~1:t1:2)=aχt+1i(C~t+1)=bΓt+1i.\displaystyle=\tilde{\chi}_{t+1}^{i}(\tilde{{{C}}}_{t+1},\tilde{\Gamma}_{1:t}^{1:2})\stackrel{{\scriptstyle a}}{{=}}\chi_{t+1}^{i}(\tilde{{{C}}}_{t+1})\stackrel{{\scriptstyle b}}{{=}}\Gamma_{t+1}^{i}. (33)

Here, equality (a)(a) follows from the construction of the mapping ϱi\varrho^{i} (see Definition 4) and equality (b)(b) follows from the fact that Ct+1=C~t+1C_{t+1}=\tilde{C}_{t+1}. Further,

Ut+1i,j=rand(𝒰ti,j,Γt+1i,j(Pt+1i,j),Kt+1i,j)\displaystyle{{U}}_{t+1}^{i,j}={\textsc{rand}}({\mathcal{U}}_{t}^{i,j},\Gamma_{t+1}^{i,j}({{P}}_{t+1}^{i,j}),{{K}}_{t+1}^{i,j}) =rand(𝒰ti,j,Γ~t+1i,j(P~t+1i,j),Kt+1i,j)\displaystyle={\textsc{rand}}({\mathcal{U}}_{t}^{i,j},\tilde{\Gamma}_{t+1}^{i,j}(\tilde{{{P}}}_{t+1}^{i,j}),{{K}}_{t+1}^{i,j}) (34)
=U~t+1i,j.\displaystyle=\tilde{{{U}}}_{t+1}^{i,j}. (35)

Thus, by induction, the hypothesis is true for every 1tT1\leq t\leq T. This proves that the virtual and expanded games have identical dynamics under strategy profiles (χ1,χ2)({\chi}^{1},\chi^{2}) and (χ~1,χ~2)(\tilde{{\chi}}^{1},\tilde{\chi}^{2}).

Since the virtual and expanded games have the same cost structure, having identical dynamics ensures that strategy profiles (χ1,χ2)({\chi}^{1},\chi^{2}) and (χ~1,χ~2)(\tilde{{\chi}}^{1},\tilde{\chi}^{2}) have the same expected cost in games 𝒢v\mathscr{G}_{v} and 𝒢e\mathscr{G}_{e}, respectively. Therefore, 𝒥(χ1,χ2)=𝒥(χ~1,χ~2)\mathcal{J}({\chi}^{1},\chi^{2})={\mathcal{J}}(\tilde{{\chi}}^{1},\tilde{\chi}^{2}). ∎

A-B Proof of Theorem 1

For any strategy χ11\chi^{1}\in\mathcal{H}^{1}, we have

supχ~2~2𝒥(χ1,χ~2)supχ22𝒥(χ1,χ2),\displaystyle\sup_{\tilde{\chi}^{2}\in\tilde{\mathcal{H}}^{2}}\mathcal{J}({\chi}^{1},\tilde{\chi}^{2})\geq\sup_{{\chi}^{2}\in{\mathcal{H}}^{2}}\mathcal{J}({\chi}^{1},{\chi}^{2}), (36)

because 2~2\mathcal{H}^{2}\subseteq\tilde{\mathcal{H}}^{2}. Further,

supχ~2~2𝒥(χ1,χ~2)\displaystyle\sup_{\tilde{\chi}^{2}\in\tilde{\mathcal{H}}^{2}}\mathcal{J}({\chi}^{1},\tilde{\chi}^{2}) =supχ~2~2𝒥(ϱ1(χ1,χ~2),ϱ2(χ1,χ~2)).\displaystyle=\sup_{\tilde{\chi}^{2}\in\tilde{\mathcal{H}}^{2}}\mathcal{J}(\varrho^{1}(\chi^{1},\tilde{\chi}^{2}),\varrho^{2}(\chi^{1},\tilde{\chi}^{2})). (37)
=supχ~2~2𝒥(χ1,ϱ2(χ1,χ~2))\displaystyle=\sup_{\tilde{\chi}^{2}\in\tilde{\mathcal{H}}^{2}}\mathcal{J}({\chi}^{1},\varrho^{2}(\chi^{1},\tilde{\chi}^{2})) (38)
supχ22𝒥(χ1,χ2),\displaystyle\leq\sup_{{\chi}^{2}\in{\mathcal{H}}^{2}}\mathcal{J}({\chi}^{1},{\chi}^{2}), (39)

where the first equality is due to Lemma 5, the second equality is because ϱ1(χ1,χ~2)=χ1\varrho^{1}(\chi^{1},\tilde{\chi}^{2})=\chi^{1} and the last inequality is due to the fact that ϱ2(χ1,χ~2)2\varrho^{2}(\chi^{1},\tilde{\chi}^{2})\in{\mathcal{H}}^{2} for any χ~2~2\tilde{\chi}^{2}\in\tilde{\mathcal{H}}^{2}.

Combining (36) and (39), we obtain that

supχ22𝒥(χ1,χ2)=supχ~2~2𝒥(χ1,χ~2).\displaystyle\sup_{{\chi}^{2}\in{\mathcal{H}}^{2}}\mathcal{J}({\chi}^{1},{\chi}^{2})=\sup_{\tilde{\chi}^{2}\in\tilde{\mathcal{H}}^{2}}\mathcal{J}({\chi}^{1},\tilde{\chi}^{2}). (40)

Now,

Su(𝒢e)\displaystyle S^{u}(\mathscr{G}_{e}) :=infχ~1~1supχ~2~2𝒥(χ~1,χ~2)\displaystyle:=\inf_{\tilde{\chi}^{1}\in\tilde{\mathcal{H}}^{1}}\sup_{\tilde{\chi}^{2}\in\tilde{\mathcal{H}}^{2}}\mathcal{J}(\tilde{\chi}^{1},\tilde{\chi}^{2}) (41)
infχ11supχ~2~2𝒥(χ1,χ~2)\displaystyle\leq\inf_{{\chi}^{1}\in{\mathcal{H}}^{1}}\sup_{\tilde{\chi}^{2}\in\tilde{\mathcal{H}}^{2}}\mathcal{J}({\chi}^{1},\tilde{\chi}^{2}) (42)
=infχ11supχ22𝒥(χ1,χ2),\displaystyle=\inf_{{\chi}^{1}\in{\mathcal{H}}^{1}}\sup_{{\chi}^{2}\in{\mathcal{H}}^{2}}\mathcal{J}({\chi}^{1},{\chi}^{2}), (43)
=:Su(𝒢v).\displaystyle=:S^{u}(\mathscr{G}_{v}). (44)

where the inequality (42) is true since 1~1\mathcal{H}^{1}\subseteq\tilde{\mathcal{H}}^{1} and the equality in (43) follows from (40). Therefore, Su(𝒢e)Su(𝒢v)S^{u}(\mathscr{G}_{e})\leq S^{u}(\mathscr{G}_{v}). We can use similar arguments to show that Sl(𝒢v)Sl(𝒢e).S^{l}(\mathscr{G}_{v})\leq S^{l}(\mathscr{G}_{e}).

Appendix B Proof of Lemma 1

Definition 5 (Notation for Prescriptions).

For virtual player ii at time tt, let γi=(γi,1,,γi,Ni)ti\gamma^{i}=(\gamma^{i,1},\dots,\gamma^{i,N_{i}})\in\mathcal{B}_{t}^{i} be a prescription. We use γi,j(pti,j;uti,j)\gamma^{i,j}(p_{t}^{i,j};u_{t}^{i,j}) to denote the probability assigned to action uti,ju_{t}^{i,j} by the distribution γi,j(pti,j)\gamma^{i,j}(p_{t}^{i,j}). We will also use the following notation:

γi(pti;uti)j=1Niγi,j(pti,j;uti,j)pti𝒫ti,uti𝒰ti.\displaystyle\gamma^{i}(p_{t}^{i};u_{t}^{i})\doteq\prod_{j=1}^{N_{i}}\gamma^{i,j}(p_{t}^{i,j};u_{t}^{i,j})~{}~{}\forall p_{t}^{i}\in{\mathcal{P}}_{t}^{i},u_{t}^{i}\in{\mathcal{U}}_{t}^{i}.

Here, γi(pti;uti)\gamma^{i}(p_{t}^{i};u_{t}^{i}) is the probability assigned to a team action utiu_{t}^{i} by the prescription γi\gamma^{i} when the team’s private information is ptip_{t}^{i}.

We begin with defining the following transformations for each time tt. Recall that 𝒮t\mathcal{S}_{t} is the set of all possible common information beliefs at time tt and ti\mathcal{B}_{t}^{i} is the prescription space for virtual player ii at time tt.

Definition 6.
  1. 1.

    Let 𝒫tj:𝒮t×t1×t2Δ(𝒵t+1×𝒳t+1×𝒫t+11×𝒫t+12)\mathscr{P}_{t}^{j}:\mathcal{S}_{t}\times\mathcal{B}_{t}^{1}\times\mathcal{B}_{t}^{2}\to\Delta(\mathcal{Z}_{t+1}\times{\mathcal{X}}_{t+1}\times{\mathcal{P}}_{t+1}^{1}\times{\mathcal{P}}_{t+1}^{2}) be defined as

    𝒫tj(πt,\displaystyle\mathscr{P}_{t}^{j}(\pi_{t}, γt1:2;zt+1,xt+1,pt+11:2)\displaystyle\gamma_{t}^{1:2};z_{t+1},x_{t+1},p_{t+1}^{1:2}) (45)
    :=xt,pt1:2,ut1:2πt(xt,pt1:2)γt1(pt1;ut1)γt2(pt2;ut2)[xt+1,pt+11:2,zt+1xt,pt1:2,ut1:2].\displaystyle:=\sum_{{x}_{t},{p}^{1:2}_{t},{u}_{t}^{1:2}}\pi_{t}({x}_{t},{p}_{t}^{1:2})\gamma_{t}^{1}({p}_{t}^{1};u_{t}^{1})\gamma_{t}^{2}({p}_{t}^{2};u_{t}^{2}){\mathbb{P}}[{x}_{t+1},{p}_{t+1}^{1:2},{z}_{t+1}\mid{x}_{t},{p}_{t}^{1:2},{u}_{t}^{1:2}]. (46)

    We will use 𝒫tj(πt,γt1:2)\mathscr{P}_{t}^{j}(\pi_{t},\gamma_{t}^{1:2}) as a shorthand for the probability distribution 𝒫tj(πt,γt1:2;,,)\mathscr{P}_{t}^{j}(\pi_{t},\gamma_{t}^{1:2};\cdot,\cdot,\cdot). The distribution 𝒫tj(πt,γt1:2)\mathscr{P}_{t}^{j}(\pi_{t},\gamma_{t}^{1:2}) can be viewed as a joint distribution over the variables Zt+1,Xt+1,Pt+11:2Z_{t+1},X_{t+1},P_{t+1}^{1:2} if the distribution on Xt,Pt1:2X_{t},P^{1:2}_{t} is πt\pi_{t} and prescriptions γt1:2\gamma^{1:2}_{t} are chosen by the virtual players at time tt.

  2. 2.

    Let 𝒫tm:𝒮t×t1×t2Δ𝒵t+1\mathscr{P}_{t}^{m}:\mathcal{S}_{t}\times\mathcal{B}_{t}^{1}\times\mathcal{B}_{t}^{2}\to\Delta\mathcal{Z}_{t+1} be defined as

    𝒫tm(πt,γt1:2;zt+1)=xt+1,pt+11:2𝒫tj(πt,γt1:2;zt+1,xt+1,pt+11:2).\displaystyle\mathscr{P}_{t}^{m}(\pi_{t},\gamma_{t}^{1:2};z_{t+1})=\sum_{x_{t+1},p_{t+1}^{1:2}}\mathscr{P}_{t}^{j}(\pi_{t},\gamma_{t}^{1:2};z_{t+1},x_{t+1},p_{t+1}^{1:2}). (47)

    The distribution 𝒫tm(πt,γt1:2)\mathscr{P}_{t}^{m}(\pi_{t},\gamma_{t}^{1:2}) is the marginal distribution of the variable Zt+1Z_{t+1} obtained from the joint distribution 𝒫tj(πt,γt1:2)\mathscr{P}_{t}^{j}(\pi_{t},\gamma_{t}^{1:2}) defined above.

  3. 3.

    Let Ft:𝒮t×t1×t2×𝒵t+1𝒮t+1F_{t}:\mathcal{S}_{t}\times\mathcal{B}_{t}^{1}\times\mathcal{B}_{t}^{2}\times\mathcal{Z}_{t+1}\to\mathcal{S}_{t+1} be defined as

    Ft(πt,γt1:2,zt+1)={𝒫tj(πt,γt1:2;zt+1,,)𝒫tm(πt,γt1:2;zt+1)if 𝒫tm(πt,γt1:2;zt+1)>0Gt(πt,γt1:2,zt+1)otherwise,\displaystyle F_{t}(\pi_{t},\gamma_{t}^{1:2},z_{t+1})=\begin{cases}\frac{\mathscr{P}_{t}^{j}(\pi_{t},\gamma_{t}^{1:2};z_{t+1},\cdot,\cdot)}{\mathscr{P}_{t}^{m}(\pi_{t},\gamma_{t}^{1:2};z_{t+1})}&\text{if }\mathscr{P}_{t}^{m}(\pi_{t},\gamma_{t}^{1:2};z_{t+1})>0\\ G_{t}(\pi_{t},\gamma_{t}^{1:2},z_{t+1})&\text{otherwise},\end{cases} (48)

    where Gt:𝒮t×t1×t2×𝒵t+1𝒮t+1G_{t}:\mathcal{S}_{t}\times\mathcal{B}_{t}^{1}\times\mathcal{B}_{t}^{2}\times\mathcal{Z}_{t+1}\to\mathcal{S}_{t+1} can be any arbitrary measurable mapping. One such mapping is the one that maps every element πt,γt1:2,zt+1\pi_{t},\gamma_{t}^{1:2},z_{t+1} to the uniform distribution over the finite space 𝒳t+1×𝒫t+11×𝒫t+12\mathcal{X}_{t+1}\times{\mathcal{P}}_{t+1}^{1}\times{\mathcal{P}}_{t+1}^{2}.

Let the collection of transformations FtF_{t} that can be constructed using the method described in Definition 6 be denoted by \mathscr{B}. Note that the transformations 𝒫tj,𝒫tm\mathscr{P}_{t}^{j},\mathscr{P}_{t}^{m} and FtF_{t} do not depend on the strategy profile (χ~1,χ~2)(\tilde{\chi}^{1},\tilde{\chi}^{2}) because the term [xt+1,pt+11:2,zt+1xt,pt1:2,ut1:2]{\mathbb{P}}[{x}_{t+1},{p}_{t+1}^{1:2},{z}_{t+1}\mid{x}_{t},{p}_{t}^{1:2},{u}_{t}^{1:2}] in (46) depends only on the system dynamics (see equations (17) – (21)) and not on the strategy profile (χ~1,χ~2)(\tilde{\chi}^{1},\tilde{\chi}^{2}).

Consider a strategy profile (χ~1,χ~2)(\tilde{\chi}^{1},\tilde{\chi}^{2}). Note that the number of possible realizations of common information and prescription history under (χ~1,χ~2)(\tilde{\chi}^{1},\tilde{\chi}^{2}) is finite. Let ct+1,γ1:t1:2c_{t+1},\gamma_{1:t}^{1:2} be a realization of the common information and prescription history at time t+1t+1 with non-zero probability of occurrence under (χ~1,χ~2)(\tilde{\chi}^{1},\tilde{\chi}^{2}). For this realization of virtual players’ information, the common information based belief on the state and private information at time t+1t+1 is given by

πt+1(xt+1,pt+11:2)\displaystyle\pi_{t+1}(x_{t+1},p_{t+1}^{1:2})
=(χ~1,χ~2)[Xt+1=xt+1,Pt+11:2=pt+11:2ct+1,γ1:t1:2]\displaystyle={\mathbb{P}}^{(\tilde{\chi}^{1},\tilde{\chi}^{2})}[X_{t+1}={x}_{t+1},P_{t+1}^{1:2}={p}_{t+1}^{1:2}\mid{c}_{t+1},\gamma_{1:t}^{1:2}]
=(χ~1,χ~2)[Xt+1=xt+1,Pt+11:2=pt+11:2ct,γ1:t11:2,zt+1,γt1:2]\displaystyle={\mathbb{P}}^{(\tilde{\chi}^{1},\tilde{\chi}^{2})}[X_{t+1}={x}_{t+1},P_{t+1}^{1:2}={p}_{t+1}^{1:2}\mid{c}_{t},\gamma_{1:t-1}^{1:2},{z}_{t+1},\gamma_{t}^{1:2}]
=(χ~1,χ~2)[Xt+1=xt+1,Pt+11:2=pt+11:2,Zt+1=zt+1ct,γ1:t1:2](χ~1,χ~2)[Zt+1=zt+1ct,γ1:t1:2].\displaystyle=\frac{{\mathbb{P}}^{(\tilde{\chi}^{1},\tilde{\chi}^{2})}[X_{t+1}={x}_{t+1},P_{t+1}^{1:2}={p}_{t+1}^{1:2},Z_{t+1}={z}_{t+1}\mid{c}_{t},\gamma_{1:t}^{1:2}]}{{\mathbb{P}}^{(\tilde{\chi}^{1},\tilde{\chi}^{2})}[Z_{t+1}={z}_{t+1}\mid{c}_{t},\gamma_{1:t}^{1:2}]}. (49)

Notice that the expression (49) is well-defined, that is, the denominator is non-zero because of our assumption that the realization ct+1,γ1:t1:2c_{t+1},\gamma_{1:t}^{1:2} has non-zero probability of occurrence. Let us consider the numerator in the expression (49). For convenience, we will denote it with (χ~1,χ~2)[xt+1,pt+11:2,zt+1ct,γ1:t1:2]{\mathbb{P}}^{(\tilde{\chi}^{1},\tilde{\chi}^{2})}[{x}_{t+1},{p}_{t+1}^{1:2},{z}_{t+1}\mid{c}_{t},\gamma_{1:t}^{1:2}]. We have

(χ~1,χ~2)[xt+1,pt+11:2,zt+1ct,γ1:t1:2]\displaystyle{\mathbb{P}}^{(\tilde{\chi}^{1},\tilde{\chi}^{2})}[{x}_{t+1},{p}_{t+1}^{1:2},{z}_{t+1}\mid{c}_{t},\gamma_{1:t}^{1:2}]
=xt,pt1:2,ut1:2πt(xt,pt1:2)γt1(pt1;ut1)γt2(pt2;ut2)(χ~1,χ~2)[xt+1,pt+11:2,zt+1ct,γ1:t1:2,xt,pt1:2,ut1:2]\displaystyle=\sum_{{x}_{t},{p}^{1:2}_{t},{u}_{t}^{1:2}}\pi_{t}({x}_{t},{p}_{t}^{1:2})\gamma_{t}^{1}({p}_{t}^{1};{u}_{t}^{1})\gamma_{t}^{2}({p}_{t}^{2};{u}_{t}^{2}){\mathbb{P}}^{(\tilde{\chi}^{1},\tilde{\chi}^{2})}[{x}_{t+1},{p}_{t+1}^{1:2},{z}_{t+1}\mid{c}_{t},\gamma_{1:t}^{1:2},{x}_{t},{p}_{t}^{1:2},{u}_{t}^{1:2}] (50)
=xt,pt1:2,ut1:2πt(xt,pt1:2)γt1(pt1;ut1)γt2(pt2;ut2)[xt+1,pt+11:2,zt+1xt,pt1:2,ut1:2]\displaystyle=\sum_{{x}_{t},{p}^{1:2}_{t},{u}_{t}^{1:2}}\pi_{t}({x}_{t},{p}_{t}^{1:2})\gamma_{t}^{1}({p}_{t}^{1};u_{t}^{1})\gamma_{t}^{2}({p}_{t}^{2};u_{t}^{2}){\mathbb{P}}[{x}_{t+1},{p}_{t+1}^{1:2},{z}_{t+1}\mid{x}_{t},{p}_{t}^{1:2},{u}_{t}^{1:2}] (51)
=𝒫tj(πt,γt1:2;zt+1,xt+1,pt+11:2),\displaystyle=\mathscr{P}_{t}^{j}(\pi_{t},\gamma_{t}^{1:2};z_{t+1},x_{t+1},p_{t+1}^{1:2}), (52)

where πt\pi_{t} is the common information belief on Xt,Pt1,Pt2X_{t},P_{t}^{1},P_{t}^{2} at time tt given the realization666Note that the belief (χ~1,χ~2)[xt,pt1:2ct,γ1:t11:2]=(χ~1,χ~2)[xt,pt1:2ct,γ1:t1:2]{\mathbb{P}}^{(\tilde{\chi}^{1},\tilde{\chi}^{2})}[x_{t},p_{t}^{1:2}\mid c_{t},\gamma_{1:t-1}^{1:2}]={\mathbb{P}}^{(\tilde{\chi}^{1},\tilde{\chi}^{2})}[x_{t},p_{t}^{1:2}\mid c_{t},\gamma_{1:t}^{1:2}] because γti=χ~ti(ct,γ1:t11:2)\gamma_{t}^{i}=\tilde{\chi}_{t}^{i}(c_{t},\gamma_{1:t-1}^{1:2}), i=1,2i=1,2. ct,γ1:t11:2c_{t},\gamma_{1:t-1}^{1:2} and 𝒫tj\mathscr{P}_{t}^{j} is as defined in Definition 6. The equality in (51) is due to the structure of the system dynamics in game 𝒢e\mathscr{G}_{e} described by equations (17) – (21). Similarly, the denominator in (49) satisfies

0<(χ~1,χ~2)[zt+1ct,γ1:t1:2]\displaystyle 0<{\mathbb{P}}^{(\tilde{\chi}^{1},\tilde{\chi}^{2})}[{z}_{t+1}\mid{c}_{t},\gamma_{1:t}^{1:2}] =xt+1,pt+11:2𝒫tj(πt,γt1:2;zt+1,xt+1,pt+11:2)\displaystyle=\sum_{x_{t+1},p_{t+1}^{1:2}}\mathscr{P}_{t}^{j}(\pi_{t},\gamma_{t}^{1:2};z_{t+1},x_{t+1},p_{t+1}^{1:2})
=𝒫tm(πt,γt1:2;zt+1),\displaystyle=\mathscr{P}_{t}^{m}(\pi_{t},\gamma_{t}^{1:2};z_{t+1}), (53)

where 𝒫tm\mathscr{P}_{t}^{m} is as defined is Definition 6. Thus, from equation (49), we have

πt+1=𝒫tj(πt,γt1:2;zt+1,,)𝒫tm(πt,γt1:2,zt+1)=Ft(πt,γt1:2;zt+1),\pi_{t+1}=\frac{\mathscr{P}_{t}^{j}(\pi_{t},\gamma_{t}^{1:2};z_{t+1},\cdot,\cdot)}{\mathscr{P}_{t}^{m}(\pi_{t},\gamma_{t}^{1:2},z_{t+1})}=F_{t}(\pi_{t},\gamma_{t}^{1:2};{z}_{t+1}), (54)

where FtF_{t} is as defined in Definition 6. Since the relation (54) holds for every realization ct+1,γ1:t1:2c_{t+1},\gamma_{1:t}^{1:2} that has non-zero probability of occurrence under (χ~1,χ~2)(\tilde{\chi}^{1},\tilde{\chi}^{2}), we can conclude that the common information belief Πt\Pi_{t} evolves almost surely as

Πt+1=Ft(Πt,Γt1:2,Zt+1),t1,\displaystyle\Pi_{t+1}=F_{t}(\Pi_{t},\Gamma_{t}^{1:2},{Z}_{t+1}),~{}~{}t\geq 1, (55)

under the strategy profile (χ~1,χ~2)(\tilde{\chi}^{1},\tilde{\chi}^{2}).

The expected cost at time tt can be expressed as follows

𝔼(χ~1,χ~2)[ct(Xt,Ut1,Ut2)]\displaystyle{\mathbb{E}}^{(\tilde{{\chi}}^{1},\tilde{{\chi}}^{2})}[c_{t}(X_{t},U_{t}^{1},U_{t}^{2})] =𝔼(χ~1,χ~2)[𝔼[ct(Xt,Ut1,Ut2)Ct,Γ1:t1:2]]\displaystyle={\mathbb{E}}^{(\tilde{{\chi}}^{1},\tilde{{\chi}}^{2})}[{\mathbb{E}}[c_{t}(X_{t},U_{t}^{1},U_{t}^{2})\mid C_{t},\Gamma_{1:t}^{1:2}]] (56)
=𝔼(χ~1,χ~2)[c~t(Πt,Γt1,Γt2)],\displaystyle={\mathbb{E}}^{(\tilde{{\chi}}^{1},\tilde{{\chi}}^{2})}[\tilde{c}_{t}(\Pi_{t},\Gamma_{t}^{1},\Gamma_{t}^{2})], (57)

where the function c~t\tilde{c}_{t} is as defined as

c~t(π,γ1,γ2):=\displaystyle\tilde{c}_{t}(\pi,\gamma^{1},\gamma^{2}):= xt,pt1:2,ut1:2ct(xt,ut1,ut2)π(xt,pt1,pt2)γ1(pt1;ut1)γ2(pt2;ut2).\displaystyle\sum_{{x}_{t},{p}_{t}^{1:2},{u}_{t}^{1:2}}c_{t}({x}_{t},{u}_{t}^{1},{u}_{t}^{2})\pi({x}_{t},{p}_{t}^{1},{p}_{t}^{2})\gamma^{1}({p}_{t}^{1};{u}_{t}^{1})\gamma^{2}({p}_{t}^{2};{u}_{t}^{2}). (58)

Therefore, the total cost can be expressed as

𝔼(χ~1,χ~2)\displaystyle{\mathbb{E}}^{(\tilde{{\chi}}^{1},\tilde{{\chi}}^{2})} [t=1Tct(Xt,Ut1,Ut2)]=𝔼(χ~1,χ~2)[t=1Tc~t(Πt,Γt1,Γt2)].\displaystyle\left[\sum_{t=1}^{T}c_{t}({{X}}_{t},{{U}}_{t}^{1},{{U}}_{t}^{2})\right]={\mathbb{E}}^{(\tilde{{\chi}}^{1},\tilde{{\chi}}^{2})}\left[\sum_{t=1}^{T}\tilde{c}_{t}(\Pi_{t},\Gamma_{t}^{1},\Gamma_{t}^{2})\right]. (59)

Appendix C Domain Extension and Proof of Lemma 2

C-A The Max-min Dynamic Program

The maximizing virtual player (virtual player 2) in game 𝒢e\mathscr{G}_{e} solves the following dynamic program. Define VT+1l(πT+1)=0V^{l}_{T+1}(\pi_{T+1})=0 for every πT+1\pi_{T+1}. In a backward inductive manner, at each time tTt\leq T and for each possible common information belief πt\pi_{t} and prescriptions γt1,γt2\gamma^{1}_{t},\gamma^{2}_{t}, define the lower cost-to-go function wtlw_{t}^{l} and the lower value function VtlV_{t}^{l} as

wtl(πt,γt1,γt2)\displaystyle w^{l}_{t}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2}) c~t(πt,γt1,γt2)+𝔼[Vt+1l(Ft(πt,γt1:2,Zt+1))πt,γt1:2],\displaystyle\doteq\tilde{c}_{t}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2})+{\mathbb{E}}[V^{l}_{t+1}(F_{t}(\pi_{t},\gamma_{t}^{1:2},{{Z}}_{t+1}))\mid\pi_{t},\gamma_{t}^{1:2}], (60)
Vtl(πt)\displaystyle V_{t}^{l}(\pi_{t}) minγt1maxγt2wtl(πt,γt1,γt2).\displaystyle\doteq\min_{{\gamma}_{t}^{1}}\max_{\gamma_{t}^{2}}w_{t}^{l}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2}). (61)

C-B Domain Extension

Recall that 𝒮t\mathcal{S}_{t} is the set of all probability distributions over the finite set 𝒳t×𝒫t1×𝒫t2\mathcal{X}_{t}\times\mathcal{P}_{t}^{1}\times\mathcal{P}_{t}^{2}. Define

𝒮¯t:={απt:0α1,πt𝒮t}.\displaystyle\bar{\mathcal{S}}_{t}:=\{\alpha\pi_{t}:0\leq\alpha\leq 1,\pi_{t}\in\mathcal{S}_{t}\}. (62)

The functions c~t\tilde{c}_{t} in (58), 𝒫tj\mathscr{P}_{t}^{j} in (45), 𝒫tm\mathscr{P}_{t}^{m} in (53) and FtF_{t} in (54) were defined for any πt𝒮t\pi_{t}\in\mathcal{S}_{t}. We will extend the domain of the argument πt\pi_{t} in these functions to 𝒮¯t\bar{\mathcal{S}}_{t} as follows. For any γtiti,i=1,2\gamma_{t}^{i}\in\mathcal{B}_{t}^{i},i=1,2, zt+1𝒵t+1z_{t+1}\in\mathcal{Z}_{t+1}, 0α10\leq\alpha\leq 1 and πt𝒮t\pi_{t}\in{\mathcal{S}}_{t}, let

  1. 1.

    c~t(απt,γt1,γt2):=αc~t(πt,γt1,γt2)\tilde{c}_{t}(\alpha\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2}):=\alpha\tilde{c}_{t}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2})

  2. 2.

    𝒫tj(απt,γt1:2):=α𝒫tj(πt,γt1:2)\mathscr{P}_{t}^{j}(\alpha\pi_{t},\gamma_{t}^{1:2}):=\alpha\mathscr{P}_{t}^{j}(\pi_{t},\gamma_{t}^{1:2})

  3. 3.

    𝒫tm(απt,γt1:2):=α𝒫tm(πt,γt1:2)\mathscr{P}_{t}^{m}(\alpha\pi_{t},\gamma_{t}^{1:2}):=\alpha\mathscr{P}_{t}^{m}(\pi_{t},\gamma_{t}^{1:2})

  4. 4.

    Ft(απt,γt1:2,zt+1):={Ft(πt,γt1:2,zt+1)if α>0𝟎if α=0,F_{t}(\alpha\pi_{t},\gamma_{t}^{1:2},z_{t+1}):=\begin{cases}F_{t}(\pi_{t},\gamma_{t}^{1:2},z_{t+1})&\text{if }\alpha>0\\ \bm{0}&\text{if }\alpha=0,\end{cases}

where 𝟎\bm{0} is a zero-vector of size |𝒳t×𝒫t1×𝒫t2||\mathcal{X}_{t}\times\mathcal{P}_{t}^{1}\times\mathcal{P}_{t}^{2}|.

Having extended the domain of the above functions, we can also extend the domain of the argument πt\pi_{t} in the functions wtu(),wtl(),Vtu(),Vtl()w_{t}^{u}(\cdot),w_{t}^{l}(\cdot),V_{t}^{u}(\cdot),V_{t}^{l}(\cdot) defined in the dynamic programs of Section III-A. First, for any 0α10\leq\alpha\leq 1 and πT+1𝒮T+1\pi_{T+1}\in{\mathcal{S}}_{T+1}, define VT+1u(απT+1):=0V^{u}_{T+1}(\alpha\pi_{T+1}):=0. We can then define the following functions for every tTt\leq T in a backward inductive manner: For any γtiti,i=1,2\gamma_{t}^{i}\in\mathcal{B}_{t}^{i},i=1,2, 0α10\leq\alpha\leq 1 and πt𝒮t\pi_{t}\in{\mathcal{S}}_{t}, let

wtu(απt,γt1,γt2)\displaystyle w^{u}_{t}(\alpha\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2}) :=c~t(απt,γt1,γt2)+zt+1[𝒫tm(απt,γt1:2;zt+1)Vt+1u(Ft(απt,γt1:2,zt+1))]\displaystyle:=\tilde{c}_{t}(\alpha\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2})+\sum_{z_{t+1}}\big{[}\mathscr{P}_{t}^{m}(\alpha\pi_{t},\gamma_{t}^{1:2};z_{t+1})V^{u}_{t+1}(F_{t}(\alpha\pi_{t},\gamma_{t}^{1:2},z_{t+1}))\big{]} (63)
Vtu(απt)\displaystyle V^{u}_{t}(\alpha\pi_{t}) :=infγt1supγt2wtu(απt,γt1,γt2).\displaystyle:=\inf_{{\gamma}_{t}^{1}}\sup_{\gamma_{t}^{2}}w_{t}^{u}(\alpha\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2}). (64)

Note that when α=1\alpha=1, the above definition of wtuw^{u}_{t} is equal to the definition of wtuw^{u}_{t} in equation (15) of the dynamic program. We can similarly extend wtlw^{l}_{t} and VtlV^{l}_{t}.

Lemma 6.

For any constant 0α10\leq\alpha\leq 1 and any πt𝒮¯t\pi_{t}\in\bar{\mathcal{S}}_{t}, we have αVtu(πt)=Vtu(απt)\alpha V_{t}^{u}(\pi_{t})=V_{t}^{u}(\alpha\pi_{t}) and αVtl(πt)=Vtl(απt)\alpha V_{t}^{l}(\pi_{t})=V_{t}^{l}(\alpha\pi_{t}).

Proof.

The proof can be easily obtained from the above definitions of the extended functions. ∎

C-C Proof of Lemma 2

We will first prove inductively that the function wtu:𝒮¯t×t1×t2w_{t}^{u}:\bar{\mathcal{S}}_{t}\times\mathcal{B}_{t}^{1}\times\mathcal{B}_{t}^{2}\to{\mathbb{R}} is continuous. Let us as assume that value function Vt+1uV_{t+1}^{u} is continuous for some tTt\leq T. Note that this assumption clearly holds at t=Tt=T. At time tt, we have

wtu(πt,γt1,γt2)\displaystyle w^{u}_{t}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2}) =c~t(πt,γt1,γt2)+𝔼[Vt+1u(Ft(πt,γt1:2,Zt+1))πt,γt1:2]\displaystyle=\tilde{c}_{t}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2})+{\mathbb{E}}[V^{u}_{t+1}(F_{t}(\pi_{t},\gamma_{t}^{1:2},{{Z}}_{t+1}))\mid\pi_{t},\gamma_{t}^{1:2}] (65)
=c~t(πt,γt1,γt2)+zt+1𝒫tm(πt,γt1:2;zt+1)Vt+1u(Ft(πt,γt1:2,zt+1))\displaystyle=\tilde{c}_{t}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2})+\sum_{z_{t+1}}\mathscr{P}_{t}^{m}(\pi_{t},\gamma_{t}^{1:2};z_{t+1})V_{t+1}^{u}(F_{t}(\pi_{t},\gamma_{t}^{1:2},{z}_{t+1})) (66)
=c~t(πt,γt1,γt2)+zt+1Vt+1u(𝒫tj(πt,γt1:2;zt+1,,)),\displaystyle=\tilde{c}_{t}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2})+\sum_{z_{t+1}}V_{t+1}^{u}(\mathscr{P}_{t}^{j}(\pi_{t},\gamma_{t}^{1:2};{z}_{t+1},\cdot,\cdot)), (67)

where the last inequality follows from the homogeneity property of the value functions in Lemma 6 and the structure of the belief update in (54). The first term in (67) is clearly continuous (see 58). Also, the transformation 𝒫tj(πt,γt1:2;zt+1,,)\mathscr{P}_{t}^{j}(\pi_{t},\gamma_{t}^{1:2};z_{t+1},\cdot,\cdot) defined in (45) is a continuous function. Therefore, by our induction hypothesis, the composition Vt+1u(𝒫tj())V_{t+1}^{u}(\mathscr{P}_{t}^{j}(\cdot)) is continuous in (πt,γt1:2)(\pi_{t},\gamma_{t}^{1:2}) for every common observation zt+1z_{t+1}. Thus, wtuw_{t}^{u} is continuous in its arguments. To complete our inductive argument, we need to show that VtuV_{t}^{u} is a continuous function and to this end, we will use the Berge’s maximum theorem (Lemma 17.31) in [30]. Since wtuw_{t}^{u} is continuous and t2\mathcal{B}_{t}^{2} is compact, we can use the maximum theorem to conclude that the following function

vtu(πt,γt1)supγt2wtu(πt,γt1,γt2),\displaystyle v_{t}^{u}(\pi_{t},\gamma_{t}^{1})\doteq\sup_{\gamma_{t}^{2}}w^{u}_{t}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2}), (68)

is continuous in πt,γt1\pi_{t},\gamma_{t}^{1}. Once again, we can use the maximum theorem to conclude that

Vtu(πt)=infγt1vtu(πt,γt1)\displaystyle V_{t}^{u}(\pi_{t})=\inf_{\gamma_{t}^{1}}v_{t}^{u}(\pi_{t},\gamma_{t}^{1}) (69)

is continuous in πt\pi_{t}. This concludes our inductive argument. Also, because of the continuity of vtuv_{t}^{u} in (68), we can use the measurable selection condition (see Condition 3.3.2 in [36]) to prove the existence of the measurable mapping Ξt1(πt)\Xi_{t}^{1}(\pi_{t}) as defined in Lemma 2. A similar argument can be made to establish the existence of a maxminimizer and a measurable mapping Ξt2(πt)\Xi_{t}^{2}(\pi_{t}) as defined in Lemma 2. This concludes the proof of Lemma 2.

Appendix D Proof of Theorem 2

Let us first define a distribution Π~t\tilde{\Pi}_{t} over the space 𝒳t×𝒫t1×𝒫t2\mathcal{X}_{t}\times{\mathcal{P}}_{t}^{1}\times{\mathcal{P}}_{t}^{2} in the following manner. The distribution Π~t\tilde{\Pi}_{t}, given Ct,Γ1:t11:2C_{t},\Gamma_{1:t-1}^{1:2}, is recursively obtained using the following relation

Π~1(x1,p11,p12)\displaystyle\tilde{\Pi}_{1}(x_{1},p_{1}^{1},p_{1}^{2}) =[X1=x1,P11=p11,P12=p12C1]x1,p11,p12,\displaystyle={\mathbb{P}}[X_{1}=x_{1},P_{1}^{1}=p_{1}^{1},P_{1}^{2}=p_{1}^{2}\mid C_{1}]~{}\forall\;x_{1},p_{1}^{1},p_{1}^{2}, (70)
Π~τ+1\displaystyle\tilde{\Pi}_{\tau+1} =Fτ(Π~τ,Γτ1,Γτ2,Zτ+1),τ1,\displaystyle=F_{\tau}(\tilde{\Pi}_{\tau},\Gamma_{\tau}^{1},\Gamma_{\tau}^{2},{Z}_{\tau+1}),~{}~{}\tau\geq 1, (71)

where FτF_{\tau} is as defined in Definition 6 in Appendix B. We refer to this distribution as the strategy-independent common information belief (SI-CIB).

Let χ~1~1\tilde{\chi}^{1}\in\tilde{\mathcal{H}}^{1} be any strategy for virtual player 1 in game 𝒢e\mathscr{G}_{e}. Consider the problem of obtaining virtual player 2’s best response to the strategy χ~1\tilde{\chi}^{1} with respect to the cost 𝒥(χ~1,χ~2)\mathcal{J}(\tilde{\chi}^{1},\tilde{\chi}^{2}) defined in (11). This problem can be formulated as a Markov decision process (MDP) with common information and prescription history Ct,Γ1:t11:2C_{t},\Gamma_{1:t-1}^{1:2} as the state. The control action at time tt in this MDP is Γt2\Gamma_{t}^{2}, which is selected based on the information Ct,Γ1:t11:2C_{t},\Gamma_{1:t-1}^{1:2} using strategy χ~22\tilde{\chi}^{2}\in\mathcal{H}^{2}. The evolution of the state Ct,Γ1:t11:2C_{t},\Gamma_{1:t-1}^{1:2} of this MDP is as follows

{Ct+1,Γ1:t1:2}={Ct,Zt+1,Γ1:t11:2,χ~t1(Ct,Γ1:t11:2),Γt2},\displaystyle\{C_{t+1},\Gamma_{1:t}^{1:2}\}=\{C_{t},Z_{t+1},\Gamma_{1:t-1}^{1:2},\tilde{\chi}^{1}_{t}(C_{t},\Gamma_{1:t-1}^{1:2}),\Gamma_{t}^{2}\}, (72)

where

(χ~1,χ~2)[Zt+1=zt+1Ct,Γ1:t11:2,Γt2]=𝒫tm[Π~t,Γt1,Γt2;zt+1],\displaystyle{\mathbb{P}}^{(\tilde{{\chi}}^{1},\tilde{{\chi}}^{2})}[Z_{t+1}=z_{t+1}\mid C_{t},\Gamma_{1:t-1}^{1:2},\Gamma_{t}^{2}]=\mathscr{P}_{t}^{m}[\tilde{\Pi}_{t},\Gamma_{t}^{1},\Gamma_{t}^{2};z_{t+1}], (73)

almost surely. Here, Γt1=χ~t1(Ct,Γ1:t11:2)\Gamma_{t}^{1}=\tilde{\chi}^{1}_{t}(C_{t},\Gamma_{1:t-1}^{1:2}) and the transformation 𝒫tm\mathscr{P}_{t}^{m} is as defined in Definition 6 in Appendix B. Notice that due to Lemma 1, the common information belief Πt\Pi_{t} associated with any strategy profile (χ~1,χ~2){(\tilde{{\chi}}^{1},\tilde{{\chi}}^{2})} is equal to Πt~\tilde{\Pi_{t}} almost surely. This results in the state evolution equation in (73). The objective of this MDP is to maximize, for a given χ~1\tilde{\chi}^{1}, the following cost

𝔼(χ~1,χ~2)[t=1Tc~t(Π~t,Γt1,Γt2)],\displaystyle{\mathbb{E}}^{(\tilde{{\chi}}^{1},\tilde{{\chi}}^{2})}\left[\sum_{t=1}^{T}\tilde{c}_{t}(\tilde{\Pi}_{t},\Gamma_{t}^{1},\Gamma_{t}^{2})\right], (74)

where c~t\tilde{c}_{t} is as defined in equation (58). Due to Lemma 1, the total expected cost defined above is equal to the cost 𝒥(χ~1,χ~2){\mathcal{J}}(\tilde{{\chi}}^{1},\tilde{{\chi}}^{2}) defined in (11).

The MDP described above can be solved using the following dynamic program. For every realization of virtual players’ information cT+1,γ1:T1:2c_{T+1},\gamma_{1:T}^{1:2}, define

VT+1χ~1(cT+1,γ1:T1:2):=0.V^{\tilde{\chi}^{1}}_{T+1}(c_{T+1},\gamma_{1:T}^{1:2}):=0.

In a backward inductive manner, for each time tTt\leq T and each realization ct,γ1:t11:2c_{t},\gamma_{1:t-1}^{1:2}, define

Vtχ~1(ct,γ1:t11:2)\displaystyle V^{\tilde{\chi}^{1}}_{t}(c_{t},\gamma_{1:t-1}^{1:2}) :=supγt2[c~t(π~t,γt1,γt2)+𝔼[Vt+1χ~1(ct,Zt+1,γ1:t1:2)ct,γ1:t1:2]],\displaystyle:=\sup_{\gamma_{t}^{2}}[\tilde{c}_{t}(\tilde{\pi}_{t},\gamma_{t}^{1},\gamma_{t}^{2})+{\mathbb{E}}[V^{\tilde{\chi}^{1}}_{t+1}(c_{t},Z_{t+1},\gamma_{1:t}^{1:2})\mid c_{t},\gamma_{1:t}^{1:2}]], (75)

where γt1=χ~t1(ct,γ1:t11:2)\gamma_{t}^{1}=\tilde{\chi}^{1}_{t}(c_{t},\gamma_{1:t-1}^{1:2}) and π~t\tilde{\pi}_{t} is the SI-CIB associated with the information ct,γ1:t11:2c_{t},\gamma_{1:t-1}^{1:2}. Note that the measurable selection condition (see condition 3.3.2 in [36]) holds for the dynamic program described above. Thus, the value functions Vtχ~1()V^{\tilde{\chi}^{1}}_{t}(\cdot) are measurable and there exists a measurable best-response strategy for player 2 which is a solution to the dynamic program described above. Therefore, we have

supχ~2𝒥(χ~1,χ~2)=𝔼V1χ~1(C1).\displaystyle\sup_{\tilde{\chi}^{2}}\mathcal{J}(\tilde{\chi}^{1},\tilde{\chi}^{2})={\mathbb{E}}V^{\tilde{\chi}^{1}}_{1}(C_{1}). (76)
Claim D.1.

For any strategy χ~1~1\tilde{\chi}^{1}\in\tilde{\mathcal{H}}^{1} and for any realization of virtual players’ information ct,γ1:t11:2c_{t},\gamma_{1:t-1}^{1:2}, we have

Vtχ~1(ct,γ1:t11:2)Vtu(π~t),\displaystyle V^{\tilde{\chi}^{1}}_{t}(c_{t},\gamma_{1:t-1}^{1:2})\geq V_{t}^{u}(\tilde{\pi}_{t}), (77)

where VtuV_{t}^{u} is as defined in (15) and π~t\tilde{\pi}_{t} is the SI-CIB belief associated with the instance ct,γ1:t11:2c_{t},\gamma_{1:t-1}^{1:2}. As a consequence, we have

supχ~2𝒥(χ~1,χ~2)𝔼V1u(Π1).\displaystyle\sup_{\tilde{\chi}^{2}}\mathcal{J}(\tilde{\chi}^{1},\tilde{\chi}^{2})\geq{\mathbb{E}}V^{u}_{1}(\Pi_{1}). (78)
Proof.

The proof is by backward induction. Clearly, the claim is true at time t=T+1t=T+1. Assume that the claim is true for all times greater than tt. Then we have

Vtχ~1(ct,γ1:t11:2)\displaystyle V^{\tilde{\chi}^{1}}_{t}(c_{t},\gamma_{1:t-1}^{1:2}) =supγt2[c~t(π~t,γt1,γt2)+𝔼[Vt+1χ~1(ct,Zt+1,γ1:t1:2)ct,γ1:t1:2]]\displaystyle=\sup_{\gamma_{t}^{2}}[\tilde{c}_{t}(\tilde{\pi}_{t},\gamma_{t}^{1},\gamma_{t}^{2})+{\mathbb{E}}[V^{\tilde{\chi}^{1}}_{t+1}(c_{t},Z_{t+1},\gamma_{1:t}^{1:2})\mid c_{t},\gamma_{1:t}^{1:2}]]
supγt2[c~t(π~t,γt1,γt2)+𝔼[Vt+1u(Ft(π~t,γt1:2,Zt+1))ct,γ1:t1:2]]\displaystyle\geq\sup_{\gamma_{t}^{2}}[\tilde{c}_{t}(\tilde{\pi}_{t},\gamma_{t}^{1},\gamma_{t}^{2})+{\mathbb{E}}[V^{u}_{t+1}(F_{t}(\tilde{\pi}_{t},\gamma_{t}^{1:2},Z_{t+1}))\mid c_{t},\gamma_{1:t}^{1:2}]]
Vtu(π~t).\displaystyle\geq V_{t}^{u}(\tilde{\pi}_{t}).

The first equality follows from the definition in (75) and the inequality after that follows from the induction hypothesis. The last inequality is a consequence of the definition of the value function VtuV_{t}^{u}. This completes the induction argument. Further, using Claim D.1 and the result in (76), we have

supχ~2𝒥(χ~1,χ~2)=𝔼V1χ~1(C1)𝔼V1u(Π~1)=𝔼V1u(Π1).\displaystyle\sup_{\tilde{\chi}^{2}}\mathcal{J}(\tilde{\chi}^{1},\tilde{\chi}^{2})={\mathbb{E}}V^{\tilde{\chi}^{1}}_{1}(C_{1})\geq{\mathbb{E}}V^{u}_{1}(\tilde{\Pi}_{1})={\mathbb{E}}V^{u}_{1}(\Pi_{1}).

We can therefore say that

Su(𝒢e)\displaystyle S^{u}(\mathscr{G}_{e}) =infχ~1supχ~2𝒥(χ~1,χ~2)infχ~1𝔼V1u(Π1)=𝔼V1u(Π1).\displaystyle=\inf_{\tilde{\chi}^{1}}\sup_{\tilde{\chi}^{2}}\mathcal{J}(\tilde{\chi}^{1},\tilde{\chi}^{2})\geq\inf_{\tilde{\chi}^{1}}{\mathbb{E}}V^{u}_{1}(\Pi_{1})={\mathbb{E}}V^{u}_{1}(\Pi_{1}). (79)

Further, for the strategy χ~1\tilde{\chi}^{1*} defined in Definition 3, the inequalities (77) and (78) hold with equality. We can prove this using an inductive argument similar to the one used to prove Claim D.1. Therefore, we have

Su(𝒢e)\displaystyle S^{u}(\mathscr{G}_{e}) =infχ~1supχ~2𝒥(χ~1,χ~2)supχ~2𝒥(χ~1,χ~2)=𝔼V1χ~1(C1)=𝔼V1u(Π1).\displaystyle=\inf_{\tilde{\chi}^{1}}\sup_{\tilde{\chi}^{2}}\mathcal{J}(\tilde{\chi}^{1},\tilde{\chi}^{2})\leq\sup_{\tilde{\chi}^{2}}\mathcal{J}(\tilde{\chi}^{1*},\tilde{\chi}^{2})={\mathbb{E}}V^{\tilde{\chi}^{1*}}_{1}(C_{1})={\mathbb{E}}V^{u}_{1}(\Pi_{1}). (80)

Combining (79) and (80), we have

Su(𝒢e)=𝔼V1u(Π1).\displaystyle S^{u}(\mathscr{G}_{e})={\mathbb{E}}V^{u}_{1}(\Pi_{1}).

Thus, the inequality in (80) holds with equality which leads us to the result that the strategy χ~1\tilde{\chi}^{1*} is a min-max strategy in game 𝒢e\mathscr{G}_{e}. A similar argument can be used to show that

Sl(𝒢e)=𝔼V1l(Π1),\displaystyle S^{l}(\mathscr{G}_{e})={\mathbb{E}}V^{l}_{1}(\Pi_{1}),

and that the strategy χ~2\tilde{\chi}^{2*} defined in Definition 3 is a max-min strategy in game 𝒢e\mathscr{G}_{e}.

Appendix E Information Structures

E-A Team 2 does not control the state

Consider an instance of Game 𝒢\mathscr{G} in which the state evolution and the players’ observations are given by

Xt+1=ft(Xt,Ut1,Wts);\displaystyle{{X}}_{t+1}=f_{t}({{X}}_{t},{{U}}_{t}^{1},{{W}}_{t}^{s}); Yti,j=hti,j(Xt,Ut11,Wti,j).\displaystyle{{Y}}_{t}^{i,j}=h_{t}^{i,j}({{X}}_{t},{{U}}_{t-1}^{1},{{W}}_{t}^{i,j}).

Further, let the information structure of the players be such that the common and private information evolve as

Zt+1\displaystyle{{Z}}_{t+1} =Ct+1Ct=ζt+1(Pt1:2,Ut1,Yt+11:2)\displaystyle=C_{t+1}\setminus C_{t}=\zeta_{t+1}({{P}}_{t}^{1:2},{{U}}_{t}^{1},{{Y}}_{t+1}^{1:2}) (81)
Pt+1i\displaystyle{{P}}^{i}_{t+1} =ξt+1i(Pt1:2,Ut1,Yt+11:2).\displaystyle=\xi_{t+1}^{i}({{P}}_{t}^{1:2},{{U}}_{t}^{1},{{Y}}_{t+1}^{1:2}). (82)

In this model, Team 2’s actions do not affect the system evolution and Team 2’s past actions are not used by any of the players to select their current actions. Any information structure that satisfies these conditions satisfies Assumption 2, see Appendix F-C for a proof.

E-B Global and local states

Consider a system in which the system state Xt=(Xt0,Xt1,Xt2)X_{t}=(X_{t}^{0},X_{t}^{1},X_{t}^{2}) comprises of a global state Xt0X_{t}^{0} and a local state Xti=(Xti,1,,Xti,Ni)X_{t}^{i}=(X_{t}^{i,1},\dots,X_{t}^{i,N_{i}}) for Team i=1,2i=1,2. The state evolution is given by

Xt+10\displaystyle{{X}}^{0}_{t+1} =ft1(Xt0,Xt1,Ut1,Ut2,Wts,1)\displaystyle=f_{t}^{1}({{X}}_{t}^{0},X_{t}^{1},{{U}}_{t}^{1},U_{t}^{2},{{W}}_{t}^{s,1}) (83)
Xt+11\displaystyle{{X}}^{1}_{t+1} =ft2(Xt0,Xt1,Ut1,Ut2,Wts,2)\displaystyle=f_{t}^{2}({{X}}_{t}^{0},X_{t}^{1},{{U}}_{t}^{1},U_{t}^{2},{{W}}_{t}^{s,2}) (84)
Xt+12\displaystyle{{X}}^{2}_{t+1} =ft3(Xt0,Ut1,Ut2,Wts,3).\displaystyle=f_{t}^{3}({{X}}_{t}^{0},{{U}}_{t}^{1},U_{t}^{2},{{W}}_{t}^{s,3}). (85)

Note that Team 2’s current local state does not affect the state evolution. Further, we have

Ct\displaystyle C_{t} ={X1:t0,U1:t11,U1:t12}\displaystyle=\{X_{1:t}^{0},U_{1:t-1}^{1},U_{1:t-1}^{2}\} (86)
Pt1,j\displaystyle P_{t}^{1,j} ={Xt1,j}j=1,,N1\displaystyle=\{X_{t}^{1,j}\}\quad\forall j=1,\dots,N_{1} (87)
Pt2,j\displaystyle P_{t}^{2,j} ={Xt2,j}j=1,,N2.\displaystyle=\{X_{t}^{2,j}\}\quad\forall j=1,\dots,N_{2}. (88)

Under this system dynamics and information structure, we can prove that Assumption 2 holds. The proof of this is provided in Appendix F-D.

Appendix F Information Structures that Satisfy Assumption 2: Proofs

For each model in IV-A, we will accordingly construct transformations Qt:𝒮t×t1×𝒵t+1|𝒳t+1×𝒫t+11:2|Q_{t}:\mathcal{S}_{t}\times\mathcal{B}_{t}^{1}\times\mathcal{Z}_{t+1}\to{\mathbb{R}}^{|\mathcal{X}_{t+1}\times\mathcal{P}_{t+1}^{1:2}|} and Rt:𝒮t×t1×𝒵t+1R_{t}:\mathcal{S}_{t}\times\mathcal{B}_{t}^{1}\times\mathcal{Z}_{t+1}\to{\mathbb{R}} at each time tt such that

𝒫tm(πt,γt1:2;zt+1)\displaystyle\mathscr{P}_{t}^{m}(\pi_{t},\gamma_{t}^{1:2};z_{t+1}) Rt(πt,γt1,zt+1)πt,γt1:2,zt+1,\displaystyle\leq R_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1})\quad\forall\pi_{t},\gamma_{t}^{1:2},z_{t+1}, (89a)
𝒫tj(πt,γt1:2,zt+1,)𝒫tm(πt,γt1:2;zt+1)\displaystyle\frac{\mathscr{P}_{t}^{j}(\pi_{t},\gamma_{t}^{1:2},z_{t+1},\cdot)}{\mathscr{P}_{t}^{m}(\pi_{t},\gamma_{t}^{1:2};z_{t+1})} =Qt(πt,γt1,zt+1)Rt(πt,γt1,zt+1)𝒫tm(πt,γt1:2;zt+1)>0.\displaystyle=\frac{Q_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1})}{R_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1})}\quad\forall\mathscr{P}_{t}^{m}(\pi_{t},\gamma_{t}^{1:2};z_{t+1})>0. (89b)

Note that the transformations QtQ_{t} and RtR_{t} do not make use of virtual player 2’s prescription γt2\gamma_{t}^{2}. Following the methodology in Definition 6, we define FtF_{t} as

Ft(πt,γt1:2,zt+1)\displaystyle F_{t}(\pi_{t},\gamma_{t}^{1:2},z_{t+1}) ={𝒫tj(πt,γt1:2,zt+1,)𝒫tm(πt,γt1:2;zt+1)if 𝒫tm(πt,γt1:2;zt+1)>0Gt(πt,γt1:2,zt+1)otherwise,\displaystyle=\begin{cases}\frac{\mathscr{P}_{t}^{j}(\pi_{t},\gamma_{t}^{1:2},z_{t+1},\cdot)}{\mathscr{P}_{t}^{m}(\pi_{t},\gamma_{t}^{1:2};z_{t+1})}&\text{if }\mathscr{P}_{t}^{m}(\pi_{t},\gamma_{t}^{1:2};z_{t+1})>0\\ G_{t}(\pi_{t},\gamma_{t}^{1:2},z_{t+1})&\text{otherwise},\end{cases} (90)

where the transformation GtG_{t} is chosen to be

Gt(πt,γt1:2,zt+1)\displaystyle G_{t}(\pi_{t},\gamma_{t}^{1:2},z_{t+1}) ={Qt(πt,γt1,zt+1)Rt(πt,γt1,zt+1)if Rt(πt,γt1,zt+1)>0𝒰(𝒳t+1×𝒫t+11:2)otherwise,\displaystyle=\begin{cases}\frac{Q_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1})}{R_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1})}&\text{if }R_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1})>0\\ \mathscr{U}({\mathcal{X}}_{t+1}\times\mathcal{P}_{t+1}^{1:2})&\text{otherwise},\end{cases} (91)

where 𝒰(𝒳t+1×𝒫t+11:2)\mathscr{U}({\mathcal{X}}_{t+1}\times\mathcal{P}_{t+1}^{1:2}) is the uniform distribution over the space 𝒳t+1×𝒫t+11:2{\mathcal{X}}_{t+1}\times\mathcal{P}_{t+1}^{1:2}. Since the transformations QtQ_{t} and RtR_{t} satisfy (89a) and (89b), we can simplify the expression for the transformation FtF_{t} in (90) to obtain the following

Ft(πt,γt1:2,zt+1)\displaystyle F_{t}(\pi_{t},\gamma_{t}^{1:2},z_{t+1}) ={Qt(πt,γt1,zt+1)Rt(πt,γt1,zt+1)if Rt(πt,γt1,zt+1)>0𝒰(𝒳t+1×𝒫t+11:2)otherwise.\displaystyle=\begin{cases}\frac{Q_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1})}{R_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1})}&\text{if }R_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1})>0\\ \mathscr{U}({\mathcal{X}}_{t+1}\times\mathcal{P}_{t+1}^{1:2})&\text{otherwise}.\end{cases} (92)

This concludes the construction of an update rule FtF_{t} in the class \mathscr{B} that does not use virtual player 2’s prescription γt2\gamma_{t}^{2}. We will now describe the the construction of the transformations QtQ_{t} and RtR_{t} for each information structure in Section IV-A.

F-A All players in Team 2 have the same information

In this case, Team 2 does not have any private information and any instance of the common observation zt+1z_{t+1} includes Team 2’s action at time tt (denote it with u^t2\hat{u}_{t}^{2}). The corresponding transformation 𝒫tj\mathscr{P}_{t}^{j} (see Definition 6) has the following form.

𝒫tj(πt,γt1:2;zt+1,xt+1,pt+11)\displaystyle\mathscr{P}_{t}^{j}(\pi_{t},\gamma_{t}^{1:2};z_{t+1},x_{t+1},p_{t+1}^{1}) (93)
=xt,pt1:2,ut1:2πt(xt,pt1:2)γt1(pt1;ut1)γt2(pt2;ut2)[xt+1,pt+11:2,zt+1xt,pt1:2,ut1:2]\displaystyle=\sum_{{x}_{t},{p}^{1:2}_{t},{u}_{t}^{1:2}}\pi_{t}({x}_{t},{p}_{t}^{1:2})\gamma_{t}^{1}({p}_{t}^{1};u_{t}^{1})\gamma_{t}^{2}({p}_{t}^{2};u_{t}^{2}){\mathbb{P}}[{x}_{t+1},{p}_{t+1}^{1:2},{z}_{t+1}\mid{x}_{t},{p}_{t}^{1:2},{u}_{t}^{1:2}] (94)
=γt2(;u^t2)xt,pt1,ut1πt(xt,pt1)γt1(pt1;ut1)[xt+1,pt+11,zt+1xt,pt1,ut1,u^t2]\displaystyle=\gamma_{t}^{2}(\varnothing;\hat{u}_{t}^{2})\sum_{{x}_{t},{p}^{1}_{t},{u}_{t}^{1}}\pi_{t}({x}_{t},{p}_{t}^{1})\gamma_{t}^{1}({p}_{t}^{1};u_{t}^{1}){\mathbb{P}}[{x}_{t+1},{p}_{t+1}^{1},{z}_{t+1}\mid{x}_{t},{p}_{t}^{1},{u}_{t}^{1},\hat{u}_{t}^{2}] (95)
=:γt2(;u^t2)Qt(πt,γt1,zt+1;xt+1,pt+11).\displaystyle=:\gamma_{t}^{2}(\varnothing;\hat{u}_{t}^{2})Q_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1};x_{t+1},p_{t+1}^{1}). (96)

Here, we use the fact that Team 2 does not have any private information and u^t2\hat{u}_{t}^{2} is a part of the common observation zt+1z_{t+1}. Similarly, the corresponding transformation 𝒫tm\mathscr{P}_{t}^{m} (see Definition 6) has the following form.

𝒫tm(πt,γt1:2;zt+1)\displaystyle\mathscr{P}_{t}^{m}(\pi_{t},\gamma_{t}^{1:2};z_{t+1}) =γt2(;u^t2)xt+1,pt+11Qt(πt,γt1,zt+1;xt+1,pt+11)\displaystyle=\gamma_{t}^{2}(\varnothing;\hat{u}_{t}^{2})\sum_{x_{t+1},p_{t+1}^{1}}Q_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1};x_{t+1},p_{t+1}^{1}) (97)
=:γt2(;u^t2)Rt(πt,γt1,zt+1).\displaystyle=:\gamma_{t}^{2}(\varnothing;\hat{u}_{t}^{2})R_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1}). (98)

Using the results (96) and (98), we can easily conclude that the transformations QtQ_{t} and RtR_{t} defined above satisfy the conditions (89a) and (89b).

F-B Team 2’s observations become common information with a delay of one-step

In this case, any instance of the common observation zt+1z_{t+1} includes Team 2’s private information at time tt (denote it with p^t2\hat{p}_{t}^{2}) and Team 2’s action at time tt (denote it with u^t2\hat{u}_{t}^{2}). The corresponding transformation 𝒫tj\mathscr{P}_{t}^{j} (see Definition 6) has the following form.

𝒫tj(πt,γt1:2;zt+1,xt+1,pt+11:2)\displaystyle\mathscr{P}_{t}^{j}(\pi_{t},\gamma_{t}^{1:2};z_{t+1},x_{t+1},p_{t+1}^{1:2}) (99)
=xt,pt1:2,ut1:2πt(xt,pt1:2)γt1(pt1;ut1)γt2(pt2;ut2)[xt+1,pt+11:2,zt+1xt,pt1:2,ut1:2]\displaystyle=\sum_{{x}_{t},{p}^{1:2}_{t},{u}_{t}^{1:2}}\pi_{t}({x}_{t},{p}_{t}^{1:2})\gamma_{t}^{1}({p}_{t}^{1};u_{t}^{1})\gamma_{t}^{2}({p}_{t}^{2};u_{t}^{2}){\mathbb{P}}[{x}_{t+1},{p}_{t+1}^{1:2},{z}_{t+1}\mid{x}_{t},{p}_{t}^{1:2},{u}_{t}^{1:2}] (100)
=γt2(p^t2;u^t2)xt,pt1,ut1πt(xt,pt1,p^t2)γt1(pt1;ut1)[xt+1,pt+11:2,zt+1xt,pt1,p^t2,ut1,u^t2]\displaystyle=\gamma_{t}^{2}(\hat{p}_{t}^{2};\hat{u}_{t}^{2})\sum_{{x}_{t},{p}^{1}_{t},{u}_{t}^{1}}\pi_{t}({x}_{t},{p}_{t}^{1},\hat{p}_{t}^{2})\gamma_{t}^{1}({p}_{t}^{1};u_{t}^{1}){\mathbb{P}}[{x}_{t+1},{p}_{t+1}^{1:2},{z}_{t+1}\mid{x}_{t},{p}_{t}^{1},\hat{p}_{t}^{2},{u}_{t}^{1},\hat{u}_{t}^{2}] (101)
=:γt2(p^t2;u^t2)Qt(πt,γt1,zt+1;xt+1,pt+11:2).\displaystyle=:\gamma_{t}^{2}(\hat{p}_{t}^{2};\hat{u}_{t}^{2})Q_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1};x_{t+1},p_{t+1}^{1:2}). (102)

Here, we use the fact that both p^t2\hat{p}_{t}^{2} and u^t2\hat{u}_{t}^{2} are part of the common observation zt+1z_{t+1}. Similarly, the corresponding transformation 𝒫tm\mathscr{P}_{t}^{m} (see Definition 6) has the following form.

𝒫tm(πt,γt1:2;zt+1)\displaystyle\mathscr{P}_{t}^{m}(\pi_{t},\gamma_{t}^{1:2};z_{t+1}) =γt2(p^t2;u^t2)xt+1,pt+11:2Qt(πt,γt1,zt+1;xt+1,pt+11:2)\displaystyle=\gamma_{t}^{2}(\hat{p}_{t}^{2};\hat{u}_{t}^{2})\sum_{x_{t+1},p_{t+1}^{1:2}}Q_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1};x_{t+1},p_{t+1}^{1:2}) (103)
=:γt2(p^t2;u^t2)Rt(πt,γt1,zt+1).\displaystyle=:\gamma_{t}^{2}(\hat{p}_{t}^{2};\hat{u}_{t}^{2})R_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1}). (104)

Using the results (102) and (104), we can conclude that the transformations QtQ_{t} and RtR_{t} defined above satisfy the conditions (89a) and (89b).

F-C Team 2 does not control the state

In this case, the corresponding transformation 𝒫tj\mathscr{P}_{t}^{j} (see Definition 6) has the following form.

𝒫tj(πt,γt1:2;zt+1,xt+1,pt+11:2)\displaystyle\mathscr{P}_{t}^{j}(\pi_{t},\gamma_{t}^{1:2};z_{t+1},x_{t+1},p_{t+1}^{1:2}) (105)
=xt,pt1:2,ut1:2πt(xt,pt1:2)γt1(pt1;ut1)γt2(pt2;ut2)[xt+1,pt+11:2,zt+1xt,pt1:2,ut1:2]\displaystyle=\sum_{{x}_{t},{p}^{1:2}_{t},{u}_{t}^{1:2}}\pi_{t}({x}_{t},{p}_{t}^{1:2})\gamma_{t}^{1}({p}_{t}^{1};u_{t}^{1})\gamma_{t}^{2}({p}_{t}^{2};u_{t}^{2}){\mathbb{P}}[{x}_{t+1},{p}_{t+1}^{1:2},{z}_{t+1}\mid{x}_{t},{p}_{t}^{1:2},{u}_{t}^{1:2}] (106)
=xt,pt1:2,ut1:2πt(xt,pt1:2)γt1(pt1;ut1)γt2(pt2;ut2)[xt+1,pt+11:2,zt+1xt,pt1:2,ut1]\displaystyle=\sum_{{x}_{t},{p}^{1:2}_{t},{u}_{t}^{1:2}}\pi_{t}({x}_{t},{p}_{t}^{1:2})\gamma_{t}^{1}({p}_{t}^{1};u_{t}^{1})\gamma_{t}^{2}({p}_{t}^{2};u_{t}^{2}){\mathbb{P}}[{x}_{t+1},{p}_{t+1}^{1:2},{z}_{t+1}\mid{x}_{t},{p}_{t}^{1:2},{u}_{t}^{1}] (107)
=xt,pt1:2,ut1πt(xt,pt1:2)γt1(pt1;ut1)[xt+1,pt+11:2,zt+1xt,pt1:2,ut1]\displaystyle=\sum_{{x}_{t},{p}^{1:2}_{t},{u}_{t}^{1}}\pi_{t}({x}_{t},{p}_{t}^{1:2})\gamma_{t}^{1}({p}_{t}^{1};u_{t}^{1}){\mathbb{P}}[{x}_{t+1},{p}_{t+1}^{1:2},{z}_{t+1}\mid{x}_{t},{p}_{t}^{1:2},{u}_{t}^{1}] (108)
=:Qt(πt,γt1,zt+1;xt+1,pt+11:2).\displaystyle=:Q_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1};x_{t+1},p_{t+1}^{1:2}). (109)

Note that (107) holds because ut2u_{t}^{2} does not influence the evolution of state, common and private information and (108) follows by summing over ut2u_{t}^{2}. Similarly, the corresponding transformation 𝒫tm\mathscr{P}_{t}^{m} (see Definition 6) has the following form.

𝒫tm(πt,γt1:2;zt+1)\displaystyle\mathscr{P}_{t}^{m}(\pi_{t},\gamma_{t}^{1:2};z_{t+1}) =xt+1,pt+11:2Qt(πt,γt1,zt+1;xt+1,pt+11:2)\displaystyle=\sum_{x_{t+1},p_{t+1}^{1:2}}Q_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1};x_{t+1},p_{t+1}^{1:2}) (110)
=:Rt(πt,γt1,zt+1).\displaystyle=:R_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1}). (111)

Using the results (109) and (111), we can conclude that the transformations QtQ_{t} and RtR_{t} satisfy the conditions (89a) and (89b).

F-D Global and local states

In this case, the private information variables of each player are part of the system state XtX_{t} because Pt1=Xt1P_{t}^{1}=X_{t}^{1} and Pt2=Xt2P_{t}^{2}=X_{t}^{2}. Therefore, the common information belief Πt\Pi_{t} is formed only on the system state. Let us first define a collection of beliefs with a particular structure

𝒮t{π𝒮t:π(x0,x1,x2)=𝟙x^(x0)π1|0(x1x0)π2(x2)x0,x1,x2},\displaystyle\mathcal{S}^{\prime}_{t}\doteq\left\{\pi\in\mathcal{S}_{t}:\pi(x_{0},x_{1},x_{2})=\mathbbm{1}_{\hat{x}}(x_{0})\pi^{1|0}(x_{1}\mid x_{0})\pi^{2}(x_{2})~{}~{}\forall x_{0},x_{1},x_{2}\right\}, (112)

where π1|0\pi^{1|0} and π2\pi^{2} denote the respective conditional and marginal distributions formed using the joint distribution π\pi, and x^\hat{x} is some realization of the global state Xt0X_{t}^{0}. Under the dynamics and information structure in this case, we can show that the belief πt\pi_{t} computed recursively using the transformation FtF_{t} as in (54) always lies in 𝒮t\mathcal{S}^{\prime}_{t} (for an appropriate initial distribution on the state X1X_{1}). Therefore, we restrict our attention to beliefs in the restricted set 𝒮t\mathcal{S}_{t}^{\prime}. Let πt𝒮t\pi_{t}\in\mathcal{S}^{\prime}_{t} such that

πt(x0,x1,x2)=𝟙x^(x0)πt1|0(x1x0)πt2(x2)x0,x1,x2.\displaystyle\pi_{t}(x_{0},x_{1},x_{2})=\mathbbm{1}_{\hat{x}}(x_{0})\pi_{t}^{1|0}(x_{1}\mid x_{0})\pi_{t}^{2}(x_{2})~{}~{}\forall x_{0},x_{1},x_{2}. (113)

In this case, any instance of the common observation zt+1z_{t+1} comprises of the global state x^t+10\hat{x}_{t+1}^{0} and players’ actions (denoted by u^t1,u^t2\hat{u}_{t}^{1},\hat{u}_{t}^{2} for Teams 1 and 2 respectively). The corresponding transformation 𝒫tj\mathscr{P}_{t}^{j} (see Definition 6) has the following form.

𝒫tj(πt,γt1:2;zt+1,xt+1)\displaystyle\mathscr{P}_{t}^{j}(\pi_{t},\gamma_{t}^{1:2};z_{t+1},x_{t+1}) (114)
=xt,ut1:2πt(xt)γt1(xt1;ut1)γt2(xt2;ut2)[xt+1,zt+1xt,ut1:2]\displaystyle=\sum_{{x}_{t},{u}_{t}^{1:2}}\pi_{t}({x}_{t})\gamma_{t}^{1}({x}_{t}^{1};u_{t}^{1})\gamma_{t}^{2}({x}_{t}^{2};u_{t}^{2}){\mathbb{P}}[{x}_{t+1},{z}_{t+1}\mid{x}_{t},{u}_{t}^{1:2}] (115)
=(xt0,xt1𝟙x^(xt0)πt1|0(xt1xt0)γt1(xt1;u^t1)[xt+1,zt+1xt0,xt1,u^t1:2])(xt2πt2(xt2)γt2(xt2;u^t2))\displaystyle=\left(\sum_{x_{t}^{0},{x}^{1}_{t}}\mathbbm{1}_{\hat{x}}(x_{t}^{0})\pi^{1|0}_{t}(x_{t}^{1}\mid x_{t}^{0})\gamma_{t}^{1}({x}_{t}^{1};\hat{u}_{t}^{1}){\mathbb{P}}[{x}_{t+1},{z}_{t+1}\mid{x}^{0}_{t},x_{t}^{1},\hat{u}_{t}^{1:2}]\right)\left(\sum_{{x}^{2}_{t}}\pi_{t}^{2}(x_{t}^{2})\gamma_{t}^{2}({x}_{t}^{2};\hat{u}_{t}^{2})\right) (116)
=(xt1πt1|0(xt1x^)γt1(xt1;u^t1)[xt+1,zt+1x^,xt1,u^t1:2])(xt2πt2(xt2)γt2(xt2;u^t2))\displaystyle=\left(\sum_{{x}^{1}_{t}}\pi^{1|0}_{t}(x_{t}^{1}\mid\hat{x})\gamma_{t}^{1}({x}_{t}^{1};\hat{u}_{t}^{1}){\mathbb{P}}[{x}_{t+1},{z}_{t+1}\mid\hat{x},x_{t}^{1},\hat{u}_{t}^{1:2}]\right)\left(\sum_{{x}^{2}_{t}}\pi_{t}^{2}(x_{t}^{2})\gamma_{t}^{2}({x}_{t}^{2};\hat{u}_{t}^{2})\right) (117)
=:Qt(πt,γt1,zt+1;xt+1)(xt2πt2(xt2)γt2(xt2;u^t2)).\displaystyle=:Q_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1};x_{t+1})\left(\sum_{{x}^{2}_{t}}\pi_{t}^{2}(x_{t}^{2})\gamma_{t}^{2}({x}_{t}^{2};\hat{u}_{t}^{2})\right). (118)

Similarly, the corresponding transformation 𝒫tm\mathscr{P}_{t}^{m} (see Definition 6) has the following form.

𝒫tm(πt,γt1:2;zt+1)\displaystyle\mathscr{P}_{t}^{m}(\pi_{t},\gamma_{t}^{1:2};z_{t+1}) =(xt2πt2(xt2)γt2(xt2;u^t2))xt+1Qt(πt,γt1,zt+1;xt+1)\displaystyle=\left(\sum_{{x}^{2}_{t}}\pi_{t}^{2}(x_{t}^{2})\gamma_{t}^{2}({x}_{t}^{2};\hat{u}_{t}^{2})\right)\sum_{x_{t+1}}Q_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1};x_{t+1}) (119)
=:(xt2πt2(xt2)γt2(xt2;u^t2))Rt(πt,γt1,zt+1).\displaystyle=:\left(\sum_{{x}^{2}_{t}}\pi_{t}^{2}(x_{t}^{2})\gamma_{t}^{2}({x}_{t}^{2};\hat{u}_{t}^{2})\right)R_{t}(\pi_{t},\gamma_{t}^{1},z_{t+1}). (120)

Using the results (118) and (120), we can conclude that the transformations QtQ_{t} and RtR_{t} satisfy the conditions (89a) and (89b).

Appendix G Proof of Theorem 5

Let χ~1~1\tilde{\chi}^{1*}\in\tilde{\mathcal{H}}^{1} be the min-max strategy for virtual player 1 in the expanded virtual game 𝒢e\mathscr{G}_{e} as described in Section IV-B. Note that the strategy χ~1\tilde{\chi}^{1*} uses only the common information ctc_{t} and virtual player 1’s past prescriptions γ1:t11\gamma_{1:t-1}^{1}. This is because the CIB update FtF_{t} in Assumption 2 does not depend on virtual player 2’s prescription γt2\gamma_{t}^{2}. Let us define a strategy χ11\chi^{1*}\in\mathcal{H}^{1} for virtual player 1 in the virtual game 𝒢v\mathscr{G}_{v} defined in Appendix A. At each time tt and for each instance ct𝒞tc_{t}\in\mathcal{C}_{t},

χt1(ct)Ξt1(πt).\displaystyle\chi_{t}^{1*}({c}_{t})\doteq\Xi_{t}^{1}(\pi_{t}). (121)

Here, Ξt1\Xi_{t}^{1} is the mapping obtained by solving the min-max dynamic program (see Lemma 2) and πt\pi_{t} is computed using the following relation

π1(x1,p11,p12)\displaystyle\pi_{1}(x_{1},p_{1}^{1},p_{1}^{2}) =[X1=x1,P11=p11,P12=p12C1=c1]x1,p11,p12\displaystyle={\mathbb{P}}[X_{1}=x_{1},P_{1}^{1}=p_{1}^{1},P_{1}^{2}=p_{1}^{2}\mid C_{1}=c_{1}]~{}\forall\;x_{1},p_{1}^{1},p_{1}^{2} (122)
πτ+1\displaystyle\pi_{\tau+1} =Fτ(πτ,Ξτ1(πτ),zτ+1),1τ<t,\displaystyle=F_{\tau}(\pi_{\tau},\Xi_{\tau}^{1}(\pi_{\tau}),z_{\tau+1}),~{}1\leq\tau<t, (123)

where FtF_{t} is the belief update transformation in Assumption 2. Note that the prescription χt1(ct)\chi^{1*}_{t}(c_{t}) is the same as the one obtained in the “Get prescription” step in Algorithm 1 for common information ctc_{t} in the tt-th iteration.

Using Definition 4, we have

χ1=ϱ1(χ~1,χ~2){\chi}^{1*}=\varrho^{1}(\tilde{{\chi}}^{1*},\tilde{{\chi}}^{2}) (124)

for any strategy χ~2H~2\tilde{{\chi}}^{2}\in\tilde{H}^{2}. Based on this observation and the fact that for a given χ22\chi^{2}\in\mathcal{H}^{2}, ϱ2(χ~1,χ2)=χ2\varrho^{2}(\tilde{\chi}^{1},\chi^{2})=\chi^{2} for every χ~1~1\tilde{\chi}^{1}\in\tilde{\mathcal{H}}^{1}, we have

(χ1,χ2)=ϱ(χ~1,χ2).({\chi}^{1*},{\chi}^{2})=\varrho(\tilde{{\chi}}^{1*},{{\chi}}^{2}). (125)

Further, due to Theorem 1, we have

Su(𝒢v)\displaystyle S^{u}(\mathscr{G}_{v}) Su(𝒢e)\displaystyle\geq S^{u}(\mathscr{G}_{e}) (126)
=asupχ~2~2𝒥(χ~1,χ~2)\displaystyle\stackrel{{\scriptstyle a}}{{=}}\sup_{\tilde{\chi}^{2}\in\tilde{\mathcal{H}}^{2}}{\mathcal{J}}(\tilde{{\chi}}^{1*},{\tilde{\chi}}^{2}) (127)
bsupχ22𝒥(χ~1,χ2)\displaystyle\stackrel{{\scriptstyle b}}{{\geq}}\sup_{{\chi}^{2}\in{\mathcal{H}}^{2}}{\mathcal{J}}(\tilde{{\chi}}^{1*},{{\chi}}^{2}) (128)
=csupχ22𝒥(χ1,χ2)\displaystyle\stackrel{{\scriptstyle c}}{{=}}\sup_{{\chi}^{2}\in{\mathcal{H}}^{2}}{\mathcal{J}}({{\chi}}^{1*},{{\chi}}^{2}) (129)
Su(𝒢v).\displaystyle\geq S^{u}(\mathscr{G}_{v}). (130)

where the equality in (a)(a) is because χ~1\tilde{{\chi}}^{1*} is a min-max strategy of 𝒢e\mathscr{G}_{e}. Inequality (b)(b) holds because 2~2\mathcal{H}^{2}\subseteq\tilde{\mathcal{H}}^{2}. Equality in (c)(c) is a consequence of the result in (125) and Lemma 5 in Appendix A. The last inequality simply follows from the definition of the upper value of the virtual game. Therefore, all the inequalities in the display above must hold with equality. Hence, χ1{{\chi}}^{1*} must be a min-max strategy of game 𝒢v\mathscr{G}_{v} and Su(𝒢v)=Su(𝒢e)S^{u}(\mathscr{G}_{v})=S^{u}(\mathscr{G}_{e}).

From Lemma 4 in Appendix A, we know that Su(𝒢v)=Su(𝒢)S^{u}(\mathscr{G}_{v})=S^{u}(\mathscr{G}) and thus, Su(𝒢)=Su(𝒢e)S^{u}(\mathscr{G})=S^{u}(\mathscr{G}_{e}). Further, we can verify from the description of the strategy g1g^{1*} in Algorithm 1 that χ1=1(g1)\chi^{1*}=\mathcal{M}^{1}(g^{1*}), where 1\mathcal{M}^{1} is the transformation defined in the proof of Lemma 4 in Appendix A. Based on the argument in Remarks 4 in Appendix A, we can conlude that the strategy g1g^{1*} is a min-max strategy in Game 𝒢\mathscr{G}.

Appendix H Solving the DP: Methods and Challenges

In the previous section, we provided a dynamic programming characterization of the upper value Su(𝒢)S^{u}(\mathscr{G}) and a min-max strategy g1g^{1*} in the original game 𝒢\mathscr{G}. We now describe an approximate dynamic programming methodology [37] that can be used to compute them. The methodology we propose offers broad guidelines for a computational approach. We do not make any claim about the degree of approximation achieved by the proposed methodology. In Section H-A, we discuss some structural properties of the cost-to-go and value functions in the dynamic program that may be useful for computational simplification. We illustrate our approach and the challenges involved with the help of an example in Section V.

At each time tt, let V^t+1(πt+1,θt+1)\hat{V}_{t+1}(\pi_{t+1},\theta_{t+1}) be a an approximate representation of the upper value function Vt+1u(πt+1)V_{t+1}^{u}(\pi_{t+1}) in a suitable parametric form, where θt+1\theta_{t+1} is a vector representing the parameters. We proceed as follows.

Sampling the belief space

At time tt, we sample a set 𝒮t\mathscr{S}_{t} of belief points from the set 𝒮t\mathcal{S}_{t} (the set of all CIBs). A simple sampling approach would be to uniformly sample the space 𝒮t\mathcal{S}_{t}. Using the arguments in Appendix 5 of [15], we can say that the value function VtuV_{t}^{u} is uniformly continuous in the CIB πt\pi_{t}. Thus, if the set 𝒮t\mathscr{S}_{t} is sufficiently large, computing the value function only at the belief points in 𝒮t\mathscr{S}_{t} may be enough for obtaining an ϵ\epsilon-approximation of the value function. It is likely that the required size of 𝒮t\mathscr{S}_{t} to ensure an ϵ\epsilon-approximation will be prohibitively large. For solving POMDPs, more intelligent strategies for sampling the belief points, known as forward exploration heuristics, exist in literature [18, 19, 1]. While these methods rely on certain structural properties of the value functions (such as convexity), it may be possible to adapt them for our dynamic program and make the CIB sampling process more efficient. A precise understanding of such exploration heuristics and the relationship between the approximation error ϵ\epsilon and the associated computational burden is needed. This is a problem for future work.

Compute value function at each belief point

Once we have a collection of points 𝒮t\mathscr{S}_{t}, we can then approximately compute the value Vtu(πt)V_{t}^{u}(\pi_{t}) for each πt𝒮t.\pi_{t}\in\mathscr{S}_{t}. For each belief vector πt𝒮t\pi_{t}\in\mathscr{S}_{t}, we will compute V¯t(πt)\bar{V}_{t}(\pi_{t}) which is given by

w^t(πt,γt1,γt2)\displaystyle\hat{w}_{t}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2}) c~t(πt,γt1,γt2)+𝔼[V^t+1(Ft(πt,γt1,Zt+1),θt+1)πt,γt1:2]\displaystyle\doteq\tilde{c}_{t}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2})+{\mathbb{E}}[\hat{V}_{t+1}(F_{t}(\pi_{t},\gamma_{t}^{1},{{Z}}_{t+1}),\theta_{t+1})\mid\pi_{t},\gamma_{t}^{1:2}] (131)
V¯t(πt)\displaystyle\bar{V}_{t}(\pi_{t}) minγt1maxγt2w^t(πt,γt1,γt2).\displaystyle\doteq\min_{{\gamma}_{t}^{1}}\max_{\gamma_{t}^{2}}\hat{w}_{t}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2}). (132)

For a given belief point πt\pi_{t}, one approach for solving the min-max problem in (132) is to use the Gradient Descent Ascent (GDA) method [38, 39]. This can however lead to local optima because in general, the cost w^t\hat{w}_{t} is neither convex nor concave in the respective prescriptions. In some cases such as when Team 2 has only one player, the inner maximizing problem in (132) can be substantially simplified and we will discuss this in Section H-A.

Interpolation

For solving a particular instance of the min-max problem in (132), we need an estimate of the value function Vt+1uV_{t+1}^{u}. Generally, knowing just the value of the function at different points may be insufficient and we may need additional information like the first derivatives/gradients especially when we are using gradient based methods like GDA. In that case, it will be helpful to choose an appropriate parametric form (such as neural networks) for V^t\hat{V}_{t} that has desirable properties like continuity or differentiability. The parameters θt\theta_{t} can be obtained by solving the following regression problem.

minθtπt𝒮t(V^t(πt,θt)V¯t(πt))2.\displaystyle\min_{\theta_{t}}\sum_{\pi_{t}\in\mathscr{S}_{t}}(\hat{V}_{t}(\pi_{t},\theta_{t})-\bar{V}_{t}(\pi_{t}))^{2}. (133)

Standard methods such as gradient descent can be used to solve this regression problem.

H-A Structural Properties of the DP

We will now prove that under Assumption 2, the cost-to-go function wtuw_{t}^{u} (see (14)) in the min-max dynamic program defined in Section IV-B is linear in virtual player 2’s prescription in its product form γt2(pt2;ut2)\gamma_{t}^{2}(p_{t}^{2};u_{t}^{2}) (see Definition 5 in Appendix B).

Lemma 7.

The cost-to-go wtu(πt,γt1,γt2)w_{t}^{u}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2}) is linear in γt2(pt2;ut2)\gamma_{t}^{2}(p_{t}^{2};u_{t}^{2}) (see Definition 5 in Appendix B), i.e., there exists a function at:𝒫t2×𝒰t2×𝒮t×t1a_{t}:{\mathcal{P}}_{t}^{2}\times{\mathcal{U}}_{t}^{2}\times\mathcal{S}_{t}\times\mathcal{B}_{t}^{1}\to{\mathbb{R}} such that

wtu(πt,γt1,γt2)=pt2,ut2πt(pt2)γt2(pt2;ut2)at(pt2,ut2,πt,γt1).\displaystyle w^{u}_{t}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2})=\sum_{p_{t}^{2},u_{t}^{2}}\pi_{t}(p_{t}^{2})\gamma_{t}^{2}(p_{t}^{2};u_{t}^{2})a_{t}(p_{t}^{2},u_{t}^{2},\pi_{t},\gamma_{t}^{1}).

Here, πt(pt2)\pi_{t}(p_{t}^{2}) denotes the marginal probability of pt2p_{t}^{2} with respect to the CIB πt\pi_{t}.

Proof.

We have

wtu(πt,γt1,γt2)\displaystyle w^{u}_{t}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2}) =c~t(πt,γt1,γt2)+𝔼[Vt+1u(Ft(πt,γt1:2,Zt+1))πt,γt1:2]\displaystyle=\tilde{c}_{t}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2})+{\mathbb{E}}[V^{u}_{t+1}(F_{t}(\pi_{t},\gamma_{t}^{1:2},{{Z}}_{t+1}))\mid\pi_{t},\gamma_{t}^{1:2}] (134)
=c~t(πt,γt1,γt2)+zt+1𝒫tm(πt,γt1:2;zt+1)Vt+1u(Ft(πt,γt1,zt+1))\displaystyle=\tilde{c}_{t}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2})+\sum_{z_{t+1}}\mathscr{P}_{t}^{m}(\pi_{t},\gamma_{t}^{1:2};z_{t+1})V_{t+1}^{u}(F_{t}(\pi_{t},\gamma_{t}^{1},{z}_{t+1})) (135)
pt2,ut2πt(pt2)γt2(pt2;ut2)at(pt2,ut2,πt,γt1),\displaystyle\doteq\sum_{p_{t}^{2},u_{t}^{2}}\pi_{t}(p_{t}^{2})\gamma_{t}^{2}(p_{t}^{2};u_{t}^{2})a_{t}(p_{t}^{2},u_{t}^{2},\pi_{t},\gamma_{t}^{1}), (136)

where the last equality uses the fact that the belief update FtF_{t} does not depend on γt2\gamma_{t}^{2} and both c~t\tilde{c}_{t} (see (58)) and 𝒫tm\mathscr{P}_{t}^{m} (see (53)) are linear in γt2(pt2;ut2)\gamma_{t}^{2}(p_{t}^{2};u_{t}^{2}). We then rearrange terms in (135) and appropriately define ata_{t} to obtain (136). Here, πt(pt2)\pi_{t}(p_{t}^{2}) represents the marginal distribution of the second player’s private information with respect to the common information based belief πt\pi_{t}. ∎

Remark 5.

When there is only one player in Team 2, then the prescription γt2\gamma_{t}^{2} coincides with its product form in Definition 5. Then the cost-to-go wtu(πt,γt1,γt2)w_{t}^{u}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2}) is linear in γt2\gamma_{t}^{2} as well.

At any given time tt, a pure prescription for virtual player 2 in Game 𝒢e\mathscr{G}_{e} is a prescription (γ2,1,,γ2,N2)t2(\gamma^{2,1},\dots,\gamma^{2,N_{2}})\in\mathcal{B}_{t}^{2} such that for every jj and every instance pt2,j𝒫t2,jp_{t}^{2,j}\in{\mathcal{P}}_{t}^{2,j}, the distribution γ2,j(pt2,j)\gamma^{2,j}(p_{t}^{2,j}) is degenerate. Let 𝒬t2\mathcal{Q}_{t}^{2} be the set of all such pure prescriptions at time tt for virtual player 2.

Lemma 8.

Under Assumption 2, for any given πt𝒮t\pi_{t}\in\mathcal{S}_{t} and γt1t1\gamma_{t}^{1}\in\mathcal{B}_{t}^{1} the cost-to-go function in (14) satisfies

maxγt2t2wtu(πt,γt1,γt2)=maxγt2𝒬t2wtu(πt,γt1,γt2).\displaystyle\max_{\gamma_{t}^{2}\in\mathcal{B}_{t}^{2}}w^{u}_{t}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2})=\max_{\gamma_{t}^{2}\in\mathcal{Q}_{t}^{2}}w^{u}_{t}(\pi_{t},\gamma_{t}^{1},\gamma_{t}^{2}).
Proof.

Using Lemma 7, for a fixed value of CIB πt\pi_{t} and virtual player 1’s prescription γt1\gamma_{t}^{1}, the inner maximization problem in (15) can be viewed as a single-stage team problem. In this team problem, the are N2N_{2} players and the state of the system is Pt2P_{t}^{2} which is distributed according to πt\pi_{t}. Player jj’s information in this team is Pt2,jP_{t}^{2,j}. Using this information, the player selects a distribution δUt2,j\delta U_{t}^{2,j} which is then used to randomly generate an action Ut2,jU_{t}^{2,j}. The reward for this team is given by ata_{t} defined in Lemma 7. It can be easily shown that for a single-stage team problem, we can restrict to deterministic strategies without loss of optimality. It is easy to see that the set of all deterministic strategies in the single-stage team problem described above corresponds to the collection of pure prescriptions 𝒬t2\mathcal{Q}_{t}^{2}. This concludes the proof of Lemma 8. ∎

Lemma 8 allows us to transform a non-concave maximization problem over the continuous space t2\mathcal{B}_{t}^{2} in (15) and (132) into a maximization problem over a finite set 𝒬t2\mathcal{Q}_{t}^{2}. This is particularly useful when players in Team 2 do not have any private information because in that case, 𝒬t2\mathcal{Q}_{t}^{2} has the same size as the action space 𝒰t2{\mathcal{U}}_{t}^{2}. Other structural properties and simplifications are discussed in Appendix 7.

Appendix I Proof of Lemma 3

For proving Lemma 3, it will be helpful to split Team 1’s strategy g1g^{1} into two parts (g1,1,g1,2)(g^{1,1},g^{1,2}) where g1,1=(g11,1,,gT1,1)g^{1,1}=(g^{1,1}_{1},\dots,g^{1,1}_{T}) and g1,2=(g11,2,,gT1,2)g^{1,2}=(g^{1,2}_{1},\dots,g^{1,2}_{T}). The set of all such strategies for Player jj in Team 1 will be denoted by 𝒢1,j\mathcal{G}^{1,j}. We will prove the lemma using the following claim.

Claim I.1.

Consider any arbitrary strategy g1=(g1,1,g1,2)g^{1}=(g^{1,1},g^{1,2}) for Team 1. Then there exists a strategy g¯1,1\bar{g}^{1,1} for Player 1 in Team 1 such that, for each tt, g¯t1,1\bar{g}^{1,1}_{t} is a function of XtX_{t} and It2I^{2}_{t} and

J((g¯1,1,g1,2),g2)=J((g1,1,g1,2),g2),g2𝒢2.J((\bar{g}^{1,1},g^{1,2}),g^{2})=J((g^{1,1},g^{1,2}),g^{2}),~{}~{}\forall g^{2}\in\mathcal{G}^{2}.

Suppose that the above claim is true. Let (h1,1,h1,2)(h^{1,1},h^{1,2}) be a min-max strategy for Team 1. Due to Claim I.1, there exists a strategy h¯1,1\bar{h}^{1,1} for Player 1 in Team 1 such that, for each tt, h¯t1,1\bar{h}^{1,1}_{t} is a function only of XtX_{t} and It2I^{2}_{t} and

J((h¯1,1,h1,2),g2)=J((h1,1,h1,2),g2),J((\bar{h}^{1,1},h^{1,2}),g^{2})=J((h^{1,1},h^{1,2}),g^{2}),

for every strategy g2𝒢2g^{2}\in\mathcal{G}^{2}. Therefore, we have that

supg2𝒢2J((h¯1,1,h1,2),g2)=supg2𝒢2J((h1,1,h1,2),g2)=infg1𝒢1supg2𝒢2J(g1,g2).\displaystyle\sup_{g^{2}\in\mathcal{G}^{2}}J((\bar{h}^{1,1},h^{1,2}),g^{2})=\sup_{g^{2}\in\mathcal{G}^{2}}J((h^{1,1},h^{1,2}),g^{2})=\inf_{g^{1}\in\mathcal{G}^{1}}\sup_{g^{2}\in\mathcal{G}^{2}}J(g^{1},g^{2}).

Thus, (h¯1,1,h1,2)(\bar{h}^{1,1},h^{1,2}) is a min-max strategy for Team 1 wherein Player 1 uses only the current state and Player 2’s information.

Proof of Claim I.1: We now proceed to prove Claim I.1. Consider any arbitrary strategy g1=(g1,1,g1,2)g^{1}=(g^{1,1},g^{1,2}) for Team 1. Let ιt2={u1:t11:2,y1:t2}\iota_{t}^{2}=\{u_{1:t-1}^{1:2},y_{1:t}^{2}\} be a realization of Team 2’s information It2I_{t}^{2} (which is the same as Player 2’s information in Team 1). Define the distribution Ψt(ιt2)\Psi_{t}(\iota_{t}^{2}) over the space (τ=1t𝒳τ)×𝒰t1,1(\prod_{\tau=1}^{t}{\mathcal{X}}_{\tau})\times{\mathcal{U}}_{t}^{1,1} as follows:

Ψt(ιt2;x1:t,ut1,1)g1,h2[X1:t,Ut1,1=(x1:t,ut1,1)It2=ιt2],\displaystyle\Psi_{t}(\iota_{t}^{2};x_{1:t},u_{t}^{1,1})\doteq{\mathbb{P}}^{g^{1},h^{2}}[X_{1:t},U^{1,1}_{t}=(x_{1:t},u_{t}^{1,1})\mid I_{t}^{2}=\iota_{t}^{2}],

if ιt2\iota_{t}^{2} is feasible, that is g1,h2[It2=ιt2]>0{\mathbb{P}}^{g^{1},h^{2}}[I_{t}^{2}=\iota_{t}^{2}]>0, under the open-loop strategy h2(u1:t12)h^{2}\doteq(u^{2}_{1:t-1}) for Team 2. Otherwise, define Ψt(ιt2;x1:t,ut1,1)\Psi_{t}(\iota_{t}^{2};x_{1:t},u_{t}^{1,1}) to be the uniform distribution over the space (τ=1t𝒳τ)×𝒰t1,1(\prod_{\tau=1}^{t}{\mathcal{X}}_{\tau})\times{\mathcal{U}}_{t}^{1,1}.

Lemma 9.

Let g1g^{1} be Team 1’s strategy and let g2g^{2} be an arbitrary strategy for Team 2. Then for any realization x1:t,ut1,1x_{1:t},u_{t}^{1,1} of the variables X1:t,Ut1,1X_{1:t},U^{1,1}_{t}, we have

g1,g2[X1:t,Ut1,1=(x1:t,ut1,1)It2]=Ψt(It2;x1:t,ut1,1),\displaystyle{\mathbb{P}}^{g^{1},g^{2}}[X_{1:t},U^{1,1}_{t}=(x_{1:t},u_{t}^{1,1})\mid I_{t}^{2}]=\Psi_{t}(I_{t}^{2};x_{1:t},u_{t}^{1,1}),

almost surely.

Proof.

From Team 2’s perspective, the system evolution can be seen in the following manner. The system state at time tt is St=(X1:t,Ut1,1,It2)S_{t}=(X_{1:t},U_{t}^{1,1},I^{2}_{t}). Team 2 obtains a partial observation Yt2Y_{t}^{2} of the state at time tt. Using information {Y1:t2,U1:t11:2}\{Y_{1:t}^{2},U_{1:t-1}^{1:2}\}, Team 2 then selects an action Ut2U_{t}^{2}. The state then evolves in a controlled Markovian manner (with dynamics that depend on g1g^{1}). Thus, from Team 2’s perspective, this system is a partially observable Markov decision process (POMDP). The claim in the Lemma then follows from the standard result in POMDPs that the belief on the state given the player’s information does not depend on the player’s strategy [29]. ∎

For any instance ιt2\iota_{t}^{2} of Team 2’s information It2I_{t}^{2}, define the distribution Φt(ιt2)\Phi_{t}(\iota_{t}^{2}) over the space 𝒳t×𝒰t1,1{\mathcal{X}}_{t}\times{\mathcal{U}}_{t}^{1,1} as follows

Φt(ιt2;xt,ut1,1)=x1:t1Ψt(ιt2;x1:t,ut1,1).\displaystyle\Phi_{t}(\iota_{t}^{2};x_{t},u_{t}^{1,1})={\sum_{x_{1:t-1}}\Psi_{t}(\iota_{t}^{2};x_{1:t},u_{t}^{1,1})}. (137)

Define strategy g¯1,1\bar{g}^{1,1} for Player 1 in Team 1 such that for any realization xt,ιt2x_{t},\iota_{t}^{2} of state XtX_{t} and Player 2’s information It2I_{t}^{2} at time tt, the probability of selecting an action ut1,1u_{t}^{1,1} at time tt is

g¯t1,1(xt,ιt2;ut1,1){Φt(ιt2;xt,ut1,1)ut1,1Φt(ιt2;xt,ut1,1)if ut1,1Φt(ιt2;xt,ut1,1)>0𝒰()otherwise,\displaystyle\bar{g}_{t}^{1,1}(x_{t},\iota_{t}^{2};u_{t}^{1,1})\doteq\begin{cases}\frac{\Phi_{t}(\iota_{t}^{2};x_{t},u_{t}^{1,1})}{\sum_{u_{t}^{1,1^{\prime}}}\Phi_{t}(\iota_{t}^{2};x_{t},u_{t}^{1,1^{\prime}})}&\text{if }\sum_{u_{t}^{1,1^{\prime}}}\Phi_{t}(\iota_{t}^{2};x_{t},u_{t}^{1,1^{\prime}})>0\\ \mathscr{U}(\cdot)&\text{otherwise},\end{cases} (138)

where 𝒰()\mathscr{U}(\cdot) denotes the uniform distribution over the action space 𝒰t1,1{\mathcal{U}}_{t}^{1,1}. Notice that the construction of the strategy g¯1,1\bar{g}^{1,1} does not involve Team 2’s strategy g2g^{2}.

Lemma 10.

For any strategy g2g^{2} for Team 2, we have

(g1,g2)[Ut1,1=ut1,1Xt,It2]=g¯t1,1(Xt,It2;ut1,1)\displaystyle{\mathbb{P}}^{(g^{1},g^{2})}[U_{t}^{1,1}=u_{t}^{1,1}\mid X_{t},I_{t}^{2}]=\bar{g}_{t}^{1,1}(X_{t},I_{t}^{2};u_{t}^{1,1})

almost surely for every ut1,1𝒰t1,1u_{t}^{1,1}\in{\mathcal{U}}_{t}^{1,1}.

Proof.

Let xt,ιt2x_{t},\iota_{t}^{2} be a realization that has a non-zero probability of occurrence under the strategy profile (g1,g2)(g^{1},g^{2}). Then using Lemma 9, we have

(g1,g2)[X1:t,Ut1,1=(x1:t,ut1,1)ιt2]=Ψt(ιt2;x1:t,ut1,1),\displaystyle{\mathbb{P}}^{(g^{1},g^{2})}[X_{1:t},U^{1,1}_{t}=(x_{1:t},u_{t}^{1,1})\mid\iota_{t}^{2}]=\Psi_{t}(\iota_{t}^{2};x_{1:t},u_{t}^{1,1}), (139)

for every realization x1:t1x_{1:t-1} of states X1:t1X_{1:t-1} and ut1,1u_{t}^{1,1} of action Ut1,1U_{t}^{1,1}. Summing over all x1:t1,ut1,1x_{1:t-1},u^{1,1}_{t} and using (137) and (139), we have

(g1,g2)[Xt=xtIt2=ιt2]=ut1,1Φt(ιt2;xt,ut1,1).\displaystyle{\mathbb{P}}^{(g^{1},g^{2})}[X_{t}=x_{t}\mid I_{t}^{2}=\iota_{t}^{2}]=\sum_{u_{t}^{1,1}}\Phi_{t}(\iota_{t}^{2};x_{t},u_{t}^{1,1}). (140)

The left hand side of the above equation is positive since xt,it2x_{t},i^{2}_{t} is a realization of positive probability under the strategy profile (g1,g2)(g^{1},g^{2}).

Using Bayes’ rule, (137), (138) and (139), we obtain

g1,g2[Ut1,1=ut1,1Xt=xt,It2=ιt2]\displaystyle{\mathbb{P}}^{g^{1},g^{2}}[U_{t}^{1,1}=u_{t}^{1,1}\mid X_{t}=x_{t},I_{t}^{2}=\iota_{t}^{2}] =Φt(ιt2;xt,ut1,1)ut1,1Φt(ιt2;xt,ut1,1)=g¯t1,1(xt,ιt2;ut1,1).\displaystyle=\frac{\Phi_{t}(\iota_{t}^{2};x_{t},u_{t}^{1,1})}{\sum_{u_{t}^{1,1^{\prime}}}\Phi_{t}(\iota_{t}^{2};x_{t},u_{t}^{1,1^{\prime}})}=\bar{g}_{t}^{1,1}(x_{t},\iota_{t}^{2};u_{t}^{1,1}).

This concludes the proof of the lemma. ∎

Let us define g¯1=(g¯1,1,g1,2)\bar{g}^{1}=(\bar{g}^{1,1},g^{1,2}), where g¯1,1\bar{g}^{1,1} is as defined in (138). We can now show that the strategy g¯1\bar{g}^{1} satisfies

J(g¯1,g2)=J(g1,g2),J(\bar{g}^{1},g^{2})=J(g^{1},g^{2}),

for every strategy g2𝒢2g^{2}\in\mathcal{G}^{2}. Because of the structure of the cost function in (7), it is sufficient to show that for each time tt, the random variables (Xt,Ut1,Ut2,It2)(X_{t},U_{t}^{1},U_{t}^{2},I^{2}_{t}) have the same joint distribution under strategy profiles (g1,g2)(g^{1},g^{2}) and (g¯1,g2)(\bar{g}^{1},g^{2}). We prove this by induction. It is easy to verify that at time t=1t=1, (X1,U11,U12,I12)(X_{1},U_{1}^{1},U_{1}^{2},I^{2}_{1}) have the same joint distribution under strategy profiles (g1,g2)(g^{1},g^{2}) and (g¯1,g2)(\bar{g}^{1},g^{2}).

Now assume that at time tt,

g1,g2[xt,ut1,ut2,ιt2]=g¯1,g2[xt,ut1,ut2,ιt2],\displaystyle{\mathbb{P}}^{g^{1},g^{2}}[x_{t},u_{t}^{1},u_{t}^{2},\iota_{t}^{2}]={\mathbb{P}}^{\bar{g}^{1},g^{2}}[x_{t},u_{t}^{1},u_{t}^{2},\iota_{t}^{2}], (141)

for any realization of state, actions and Team 2’s information xt,ut1,ut2,ιt2x_{t},u_{t}^{1},u_{t}^{2},\iota_{t}^{2}. Let ιt+12=(ιt2,ut1:2,yt+12)\iota_{t+1}^{2}=(\iota_{t}^{2},u_{t}^{1:2},y_{t+1}^{2}). Then we have

g1,g2[xt+1,ιt+12]\displaystyle{\mathbb{P}}^{g^{1},g^{2}}[x_{t+1},\iota_{t+1}^{2}] =x¯t[xt+1,yt+12x¯t,ut1:2,ιt2]g1,g2[x¯t,ut1:2,ιt2]\displaystyle=\sum_{\bar{x}_{t}}{\mathbb{P}}[x_{t+1},y_{t+1}^{2}\mid\bar{x}_{t},u_{t}^{1:2},\iota_{t}^{2}]{\mathbb{P}}^{g^{1},g^{2}}[\bar{x}_{t},u_{t}^{1:2},\iota_{t}^{2}]
=x¯t[xt+1,yt+12x¯t,ut1:2,ιt2]g¯1,g2[x¯t,ut1:2,ιt2]\displaystyle=\sum_{\bar{x}_{t}}{\mathbb{P}}[x_{t+1},y_{t+1}^{2}\mid\bar{x}_{t},u_{t}^{1:2},\iota_{t}^{2}]{\mathbb{P}}^{\bar{g}^{1},g^{2}}[\bar{x}_{t},u_{t}^{1:2},\iota_{t}^{2}] (142)
=g¯1,g2[xt+1,ιt+12].\displaystyle={\mathbb{P}}^{\bar{g}^{1},g^{2}}[x_{t+1},\iota_{t+1}^{2}]. (143)

The equality in (142) is due to the induction hypothesis. Note that the conditional distribution [xt+1,ιt+12xt,ut1,ut2,ιt2]{\mathbb{P}}[x_{t+1},\iota_{t+1}^{2}\mid{x}_{t},{u}_{t}^{1},u_{t}^{2},\iota_{t}^{2}] does not depend on players’ strategies (see equations (2) and (3)).

At t+1t+1, for any realization xt+1,ut+11,ut+12,ιt+12x_{t+1},u_{t+1}^{1},u_{t+1}^{2},\iota_{t+1}^{2} that has non-zero probability of occurrence under the strategy profile (g1,g2)(g^{1},g^{2}), we have

g1,g2[xt+1,ut+11,ut+12,ιt+12]\displaystyle{\mathbb{P}}^{g^{1},g^{2}}[x_{t+1},u_{t+1}^{1},u_{t+1}^{2},\iota_{t+1}^{2}] (144)
=g1,g2[xt+1,ιt+12]gt2(ιt+12;ut+12)gt1,2(ιt+12;ut+11,2)g1,g2[ut+11,1xt+1,ιt+12]\displaystyle={\mathbb{P}}^{g^{1},g^{2}}[x_{t+1},\iota_{t+1}^{2}]g_{t}^{2}(\iota_{t+1}^{2};u_{t+1}^{2})g_{t}^{1,2}(\iota_{t+1}^{2};u_{t+1}^{1,2}){\mathbb{P}}^{g^{1},g^{2}}[u_{t+1}^{1,1}\mid x_{t+1},\iota_{t+1}^{2}] (145)
=g1,g2[xt+1,ιt+12]gt2(ιt+12;ut+12)gt1,2(ιt+12;ut+11,2)g¯t1,1(xt+1,ιt+12;ut+11,1)\displaystyle={\mathbb{P}}^{g^{1},g^{2}}[x_{t+1},\iota_{t+1}^{2}]g_{t}^{2}(\iota_{t+1}^{2};u_{t+1}^{2})g_{t}^{1,2}(\iota_{t+1}^{2};u_{t+1}^{1,2})\bar{g}_{t}^{1,1}(x_{t+1},\iota_{t+1}^{2};u_{t+1}^{1,1}) (146)
=g¯1,g2[xt+1,ιt+12]gt2(ιt+12;ut+12)gt1,2(ιt+12;ut+11,2)g¯t1,1(xt+1,ιt+12;ut+11,1)\displaystyle={\mathbb{P}}^{\bar{g}^{1},g^{2}}[x_{t+1},\iota_{t+1}^{2}]g_{t}^{2}(\iota_{t+1}^{2};u_{t+1}^{2})g_{t}^{1,2}(\iota_{t+1}^{2};u_{t+1}^{1,2})\bar{g}_{t}^{1,1}(x_{t+1},\iota_{t+1}^{2};u_{t+1}^{1,1}) (147)
=g¯1,g2[xt+1,ιt+12]gt2(ιt+12;ut+12)gt1,2(ιt+12;ut+11,2)g¯1,g2[ut+11,1xt+1,ιt+12]\displaystyle={\mathbb{P}}^{\bar{g}^{1},g^{2}}[x_{t+1},\iota_{t+1}^{2}]g_{t}^{2}(\iota_{t+1}^{2};u_{t+1}^{2})g_{t}^{1,2}(\iota_{t+1}^{2};u_{t+1}^{1,2}){\mathbb{P}}^{\bar{g}^{1},g^{2}}[u_{t+1}^{1,1}\mid x_{t+1},\iota_{t+1}^{2}] (148)
=g¯1,g2[xt+1,ut+11,ut+12,ιt+12],\displaystyle={\mathbb{P}}^{\bar{g}^{1},g^{2}}[x_{t+1},u_{t+1}^{1},u_{t+1}^{2},\iota_{t+1}^{2}], (149)

where the equality in (144) is a consequence of the chain rule and the manner in which players randomize their actions. Equality in (146) follows from Lemma 10 and the equality in (147) follows from the result in (143). Therefore, by induction, the equality in (141) holds for all tt. This concludes the proof of Claim I.1. ∎

Appendix J Numerical Experiments: Details

J-A Game Model

States

There are two players in Team 1 and one player in Team 2. We will refer to Player 2 in Team 1 as the defender and the Player in Team 2 as the attacker. Player 1 in Team 1 will be referred to as the signaling player. The state space of the system is 𝒳t={(l,a),(r,a),(l,p),(r,p)}{\mathcal{X}}_{t}=\{(l,a),(r,a),(l,p),(r,p)\}. For convenience, we will denote these four states with {0,1,2,3}\{0,1,2,3\} in the same order. Recall that ll and rr represent the two entities and aa and pp denote the active and passive states of the attacker. Therefore, if the state Xt=(l,a)X_{t}=(l,a), this means that the vulnerable entity is ll and the attacker is active. Similar interpretation applies for other states.

Actions

Each player in Team 1 has two actions and let 𝒰t1,1=𝒰t1,2={α,β}{\mathcal{U}}_{t}^{1,1}={\mathcal{U}}_{t}^{1,2}=\{\alpha,\beta\}. Player 2’s action α\alpha corresponds to defending entity ll and β\beta corresponds to defending entity rr. Player 1’s actions are meant only for signaling. The player (attacker) in Team 2 has three actions, 𝒰t2,1={α,β,μ}{\mathcal{U}}_{t}^{2,1}=\{\alpha,\beta,\mu\}. For the attacker, α\alpha represents the targeted attack on entity ll, β\beta represents the targeted attack on entity rr and μ\mu represents the blanket attack on both entities.

Dynamics

In this particular example, the actions of the players in Team 1 do not affect the state evolution. Only the attacker’s actions affect the state evolution. We consider this type of setup for simplicity and one can let the players in Team 1 control the state evolution as well. On the other hand, note that the evolution of the CIB is controlled only by Team 1 (the signaling player) and not by Team 2 (attacker). The transition probabilities are provided in Table I. We interpret these transitions below.

Whenever the attacker is passive, i.e. Xt=2X_{t}=2 or 3, then the attacker will remain passive with probability 0.7. In this case, the system state does not change. With probability 0.3, the attacker will become active. Given that the attacker becomes active, the next state Xt+1X_{t+1} may either be 0 or 1 with probability 0.5 each. In this state, even the attacker’s actions do not affect the transitions.

When the current state Xt=0X_{t}=0, the following cases are possible:

  1. 1.

    Attacker plays α\alpha: In this case, the attacker launches a successful targeted attack. The next state Xt+1X_{t+1} is either 0 or 1 with probability 0.5.

  2. 2.

    Attacker plays β\beta: The next state Xt+1X_{t+1} is 2 with probability 1. Thus, the attacker becomes passive if it targets the invulnerable entity.

  3. 3.

    Attacker plays μ\mu: The state does not change with probability 0.70.7. With probability 0.3, Xt+1=2X_{t+1}=2.

The attacker’s actions can be interpreted similarly when the state Xt=1X_{t}=1.

TABLE I: The transition probabilities [Xt+1Xt,Ut2]{\mathbb{P}}[X_{t+1}\mid X_{t},U_{t}^{2}]. Note that the dynamics do not depend on time tt and Team 1’s actions.
Xt=0X_{t}=0 Xt=1X_{t}=1 Xt=2X_{t}=2 Xt=3X_{t}=3
α\alpha (0.5,0.5,0.0,0.0) (0.0,0.0,0.0,1.0) (0.15,0.15,0.7,0.0) (0.15,0.15,0.0,0.7)
β\beta (0.0,0.0,1.0,0.0) (0.5,0.5,0.0,0.0) (0.15,0.15,0.7,0.0) (0.15,0.15,0.0,0.7)
μ\mu (0.7,0.0,0.3,0.0) (0.0,0.7,0.0,0.3) (0.15,0.15,0.7,0.0) (0.15,0.15,0.0,0.7)

Cost

Player 1’s actions in Team 1 do not affect the cost in this example (again for simplicity). The cost as a function of the state and the other players’ actions is provided in Table II.

TABLE II: The cost c(Xt,Ut1,2,Ut2,1)c(X_{t},U_{t}^{1,2},U_{t}^{2,1}). Note that the actions of Player 1 in Team 1 do not affect the cost. Each row corresponds to a pair of actions (Ut1,2,Ut2,1)(U_{t}^{1,2},U_{t}^{2,1}) and each column corresponds to a state XtX_{t}.
Xt=0X_{t}=0 Xt=1X_{t}=1 Xt=2X_{t}=2 Xt=3X_{t}=3
(α,α)(\alpha,\alpha) 15 0 0 0
(α,β)(\alpha,\beta) 0 15 0 0
(α,μ)(\alpha,\mu) 10 20 0 0
(β,α)(\beta,\alpha) 15 0 0 0
(β,β)(\beta,\beta) 0 15 0 0
(β,μ)(\beta,\mu) 20 10 0 0

Note that when the attacker is passive, the system does not incur any cost. If the attacker launches a successful targeted attack, then the cost is 15. If the attacker launches a failed targeted attack, then the cost is 0. When the attacker launches a blanket attack, if the defender happens to be defending the vulnerable entity, the cost incurred is 10. Otherwise, the cost incurred is 20.

We consider discounted cost with a horizon of T=15T=15. Therefore, the cost function ct()=δt1c()c_{t}(\cdot)=\delta^{t-1}c(\cdot) at time tt, where the function c()c(\cdot) is defined in Table II. The discount factor δ\delta in our problem is 0.90.9.

J-B Architecture and Implementation

At any time tt, we represent the value functions VtuV_{t}^{u} using a deep fully-connected ReLU network. The network (denoted by V^t\hat{V}_{t} with parameters θt\theta_{t}) takes the CIB πt\pi_{t} as input and returns the value V^t(πt,θt)\hat{V}_{t}(\pi_{t},\theta_{t}). The state XtX_{t} in our example takes only four values and thus, we sample the CIB belief space uniformly. For each CIB point in the sampled set, we compute an estimate of the value function V¯t\bar{V}_{t} as in (132). Since Team 2 has only one player, we can use the structural result in Section H-A to simplify the inner maximization problem in (132) and solve the outer minimization problem in (132) using gradient descent. The neural network representation of the value function is helpful for this as we can compute gradients using backpropagation. Since this minimization is not convex, we may not converge to the global optimum. Therefore, we use multiple initializations and pick the best solution. Finally, the interpolation step is performed by training the neural network V^t\hat{V}_{t} with 2\ell_{2} loss.

We illustrate this iterative procedure in Section J-D. For each time tt, we compute the upper value functions V¯t\bar{V}_{t} at the sampled belief points using the estimated value function V^t+1\hat{V}_{t+1}. In the figures in Section J-D, these points are labeled “Value function” and plotted in blue. The weights of the neural network V^t\hat{V}_{t} are then adjusted by regression to obtain an approximation. This approximated value function is plotted in orange. For the particular example discussed above, we can obtain very close approximations of the value functions. We can also observe that the value function converges due to discounting as the horizon of the problem increases.

J-C The Value Function and Team 1’s Strategy

The value function computed using the methodology described above is depicted in Figure 1(a). Since the the initial belief is given by π1(0)=π1(1)=0.5\pi_{1}(0)=\pi_{1}(1)=0.5, we can conclude that the min-max value of the game is 65.2. We note that the value function in Figure 1(a) is neither convex nor concave. As discussed earlier in Section V, the trade-off between signaling and secrecy can be understood from the value function. Clearly, revealing too much (or too little) information about the hidden state is unfavorable for Team 1. The most favorable belief according to the value function seems to be π1(0)=0.28\pi_{1}(0)=0.28 (or 0.72).

We also compute a min-max strategy for Team 1 and the corresponding best-response777This is not to be confused with the max-min strategy for the attacker. for the attacker in Team 2. The min-max strategy is computed using Algorithm 1.

  1. 1.

    In the best-response, the attacker decides to launch a targeted attack (while it is active) on entity ll if 0π1(0)<0.280\leq\pi_{1}(0)<0.28 and on entity rr if 0.72<π1(0)10.72<\pi_{1}(0)\leq 1. For all other belief-states it launches a blanket attack. Clearly, the attacker launches a targeted attack only if it is confident enough.

  2. 2.

    According to our computed min-max strategy, Player 2 in Team 1 simply computes the CIB and chooses to defend the entity that is more likely to be vulnerable.

  3. 3.

    The signaling strategy for Player 1 in Team 1 is more complicated. This can be seen from the strategy depicted in Figure 1(b). The signaling strategy can be divided into three broad regions as shown in Figure 1(b).

    1. (a)

      The grey region is where the signaling can be arbitrary. This is because the attacker in these belief states is confident enough about which entity is vulnerable and will decide to launch a targeted attack immediately.

    2. (b)

      In the brown region, the signaling is negligible. This is because even slight signaling might make the attacker confident enough to launch a targeted attack.

    3. (c)

      In the yellow region, there is substantial signaling. In this region, the attacker uses the blanket attack and signaling can help Player 2 in Team 1 defend better.

We also observe an interesting pattern in the signaling strategy. Consider the CIB π1(0)=0.6\pi_{1}(0)=0.6 in Figure 1(b). Let the true state be 0. In this scenario, Player 1 in Team 1 knows that the state is 0 while Player 2 believes that state 1 is more likely. Therefore, Player 1’s signaling strategy must be such that it quickly resolves this mismatch between the state and Player 2’s belief. However, when the true state is 1, Player 2’s belief is consistent with the state. Too much signaling in this case can make the attacker more confident about the system state. We note that there is an asymmetry w.r.t. the state in the signaling requirement, i.e., signaling is favorable when the state is 0 but unfavorable when the state is 1. The asymmetric signaling strategy for states 0 and 1 depicted in Figure 1(b) achieves this goal of asymmetric signaling efficiently.

J-D Value Function Approximations

In this Section, we present successive approximations of the value functions obtained using our methodology in Section H.

Refer to caption

Figure 2: Estimated value function V^15\hat{V}_{15}

Refer to caption

Figure 3: Estimated value function V^14\hat{V}_{14}

Refer to caption

Figure 4: Estimated value function V^13\hat{V}_{13}

Refer to caption

Figure 5: Estimated value function V^12\hat{V}_{12}

Refer to caption

Figure 6: Estimated value function V^11\hat{V}_{11}

Refer to caption

Figure 7: Estimated value function V^10\hat{V}_{10}

Refer to caption

Figure 8: Estimated value function V^9\hat{V}_{9}

Refer to caption

Figure 9: Estimated value function V^8\hat{V}_{8}

Refer to caption

Figure 10: Estimated value function V^7\hat{V}_{7}

Refer to caption

Figure 11: Estimated value function V^6\hat{V}_{6}

Refer to caption

Figure 12: Estimated value function V^5\hat{V}_{5}

Refer to caption

Figure 13: Estimated value function V^4\hat{V}_{4}

Refer to caption

Figure 14: Estimated value function V^3\hat{V}_{3}

Refer to caption

Figure 15: Estimated value function V^2\hat{V}_{2}

Refer to caption

Figure 16: Estimated value function V^1\hat{V}_{1}