This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Zero-Sum Games between Large-Population Teams:
Reachability-based Analysis under Mean-Field Sharing

Yue Guan  Mohammad Afshari   Panagiotis Tsiotras Yue Guan is a PhD student with the School of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA, USA. Email: [email protected]Mohammad Afshari is a Postdoctoral Fellow with the Institute for Robotics and Intelligent Machines, Georgia Institute of Technology, Atlanta, GA, USA. Email: [email protected]Panagiotis Tsiotras is the David & Andrew Lewis Chair Professor with the School of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA, USA. Email: [email protected]
Abstract

This work studies the behaviors of two large-population teams competing in a discrete environment. The team-level interactions are modeled as a zero-sum game while the agent dynamics within each team is formulated as a collaborative mean-field team problem. Drawing inspiration from the mean-field literature, we first approximate the large-population team game with its infinite-population limit. Subsequently, we construct a fictitious centralized system and transform the infinite-population game to an equivalent zero-sum game between two coordinators. We study the optimal coordination strategies for each team via a novel reachability analysis and later translate them back to decentralized strategies that the original agents deploy. We prove that the strategies are ϵ\epsilon-optimal for the original finite-population team game, and we further show that the suboptimality diminishes when team size approaches infinity. The theoretical guarantees are verified by numerical examples.

1 Introduction

Multi-agent decision-making arises in many applications, ranging from warehouse robots [Li et al.,, 2021], surveillance missions [Tian et al.,, 2020] to organizational economics [Gibbons et al.,, 2013]. While the majority of the literature formulates the problems within either the cooperative or competitive settings, results on mixed collaborative-competitive team behaviors are relatively sparse. In this work, we consider a competitive team game, where two teams, each consisting of a large number of intelligent agents, compete at the team level, while the agents within each team collaborate. Such hierarchical interactions are of particular interest to military operations [Tyler et al.,, 2020] and other multi-agent systems that operate in adversarial environments.

There are two major challenges when trying to solve such competitive team problems:

  1. 1.

    Large-population team problems are computationally challenging since the solution complexity increases exponentially with the number of agents. It is well-known that team optimal control problems belong to the NEXP complexity class [Bernstein et al.,, 2002].

  2. 2.

    Competitive team problems are conceptually challenging due to the elusive nature of the opponent team. In particular, one may want to impose assumptions on the opponent team to obtain tractable solutions, but these assumptions may not be valid. It is thus unclear whether one can deploy techniques that are readily available in the large-population game literature.

The scalability challenge is also present in dynamic games; however, it has been successfully resolved for a class of models known as mean-field games (MFGs) [Huang et al.,, 2006; Lasry and Lions,, 2007]. The salient feature of mean-field games is that the selfish agents are weakly coupled in their dynamics and rewards, and coupling is only through the mean field (i.e., the empirical distribution of the agents). When the population is large, the game can be approximately solved by considering the infinite-population limit, at which point the interactions among agents are reduced to a game between a typical agent and the infinite population.

The mean-field approximation idea has been extended to the single-team setting [Arabneydi and Mahajan,, 2014], where a group of homogeneous agents are weakly coupled but receive the same team reward and thus collaborate. Under the assumption that all agents apply the same strategy, a dynamic programming decomposition is developed leveraging the common-information approach [Nayyar et al.,, 2013] so that all agents within the team deploy the same strategy prescribed by a fictitious coordinator. The identical strategy assumption significantly simplifies the problem, but the optimality of such approach is only guaranteed for the linear-quadratic models [Arabneydi and Mahajan,, 2015]. However, in competitive team setting, although one may restrict the strategies used by her/his team to be identical, extending the same assumption to the opponent team may lead to a substantial underestimation of the opponent’s capabilities and thus requires further justification.

1.1 Main Contributions

We address the two aforementioned challenges for the class of discrete zero-sum mean-field team games (ZS-MFTGs), which is an extension to the mean-field (single) team problems. Importantly, ZS-MFTG models competitive team behaviors and draws focus to the justifiable approximation of the opponent team strategies. We focus on discrete-time models with finite state and action spaces, different from the continuous setting in the concurrent work [Sanjari et al.,, 2023].

We develop a dynamic program that constructs ϵ\epsilon-optimal strategies to the proposed large-population competitive team problem. Notably, our approach finds an optimal solution at the infinite-population limit and considers only identical team strategies. This avoids both the so-called “curse of dimensionality” issue in multi-agent systems and the book-keeping of individual strategies. Our main results provide a sub-optimality bound on the exploitability for our proposed solution in the original finite-population game, even when the opponent team is allowed to use non-identical strategies. Specifically, we show that the sub-optimality decreases at the rate of 𝒪(N¯0.5)\mathcal{O}(\underline{N}^{-0.5}), where N¯\underline{N} is the size of the smaller team.

Our results stem from a novel reachability-based analysis of the mean-field approximation. In particular, we show that any finite-population team behavior can be effectively approximated by an infinite-population team that uses identical team strategies. This result allows us to approximate the original problem with two competing infinite-population teams and transform the resulting problem into a zero-sum game between two fictitious coordinators. Specifically, each coordinator observes the mean-fields (state distributions) of both teams and prescribes a local policy for all agents within its team (see Figure 1). Such transformation leads to a simple dynamic program based on the common-information approach [Nayyar et al.,, 2013].

1.2 Related Work

Mean-field games.

The mean-field game model was introduced in [Huang et al.,, 2006, 2007; Lasry and Lions,, 2007] to address the scalability issues in large population games. The salient feature of mean-field games is that selfish agents are weakly-coupled in their dynamics and rewards through the state distribution (mean-field) of the agents. If the population is sufficiently large, then an approximately optimal solution for these models can be obtained by solving the infinite-population limit, and the solution in this limiting case is the mean-field equilibrium (MFE). The infinite-population game is easier to solve since, when the population is asymptotically large, the action of a single agent does not affect the mean-field. In the continuous setting, the MFE is characterized by a Hamilton-Jacobi-Bellman equation (HJB) coupled with a transport equation [Lasry and Lions,, 2007]. The HJB equation describes the optimality conditions for the strategy of a typical agent, and the transport equation captures the evolution of the mean-field. The existence and uniqueness of the MFE have been established in [Huang et al.,, 2006]. Two desirable features of the MFE are: (i) the resultant strategy is decentralized and identical across all agents, i.e., an agent selects its action only based on the local information of its own state, and no information regarding the mean-field is needed; and (ii) the complexity of the MFE solution does not depend on the number of agents. The mean-field results have been extended to discrete environments in [Cui and Koeppl,, 2021; Guan et al.,, 2022], using entropy-regularization. We refer the readers to [Laurière et al.,, 2022] for a detailed overview of the main results in the mean-field game literature.

The main differences between our setup and the current mean-field game literature are the following: (i) we seek team-optimal strategies, while MFG seeks Nash equilibrium strategies. In particular, we provide performance guarantees when the whole opponent team deviates, while MFG only considers single-agent deviations. (ii) The MFE assumes that all agents apply the same strategy thus the mean-field flow can be solved offline. As a result, the MFE strategy is open-loop to the mean-field. However, under the setting in this paper, different opponent team strategies can lead to different mean-field trajectories, and thus our optimal strategy requires feedback on the mean-field to respond to different strategies deployed by the opponent team.

Mean-field teams.

The single-team problem under with mean-field-type coupling has been considered in [Arabneydi and Mahajan,, 2014], where all agents receive the same team reward and thus the problem is collaborative. It transforms the team problem into a hierarchical control structure, where a fictitious coordinator broadcasts a local policy based on the distribution of agents, and the individual agents then act according to this policy. The work [Arabneydi and Mahajan,, 2014] assumes that all agents within the team apply the same strategy and the optimality for the finite-population game is only guaranteed in the LQG setting [Mahajan and Nayyar,, 2015]. Our work extends this formulation to the two-team zero-sum setting and justifies the identical team strategy assumptions on the opponent team.

The concurrent work [Sanjari et al.,, 2023] studies a similar team against team problem but in a continuous state and action setting. Through modeling the randomized policies as Borel probability measures, the authors showed that under suitable topology induced by certain information structure, a Nash equilibrium exists for the team game, and it is exchangeable for the finite population and symmetric (identical) for the infinite population. It is also shown that common randomness is not necessary for an approximate Nash equilibrium for large population games. Our work differs in the following aspects: (i) The concurrent work [Sanjari et al.,, 2023] relies on the convexity of the team best-response correspondence and the Kakutani fixed point theorem to establish the existence of a Nash equilibrium. However, due to discrete nature of our formulation, the convexity no longer holds, and thus a Nash equilibrium may not exist as shown in Numerical Example 1. Consequently, we developed a novel reachability-based analysis to study the single-sided optimality based on the lower and upper game values; (ii) the mean-field sharing information structure allows an easy transformation of the team against team problem into a zero-sum continuous game between two coordinators, which can be solved via dynamic programming. (iii) our reachability-based approach and the additional Lipschitz assumptions allow us to provide the convergence rate of the finite-population team performance to its infinite-population limit.

Markov team games and information structure.

Different from mean-field games, team problems focus on fully cooperative scenarios, where all agents share a common reward function. The theory of teams was first investigated in [Marschak,, 1955; Radner,, 1962]. From a game-theoretic perspective, this cooperative setting can also be viewed as a special case of Markov potential games [Zazo et al.,, 2016], with the potential function being the common reward. When all agents have access to the present and past observations along with the past actions of all other agents, such a system is equivalent to a centralized one, which enables the use of single-agent algorithms, such as value iteration or Q-learning [Bertsekas and Tsitsiklis,, 1996]. However, such a centralized system suffers from scalability issues, since the joint state and action spaces grow exponentially with the number of agents. Furthermore, in many applications, decentralization is a desirable or even required trait, due to reasons such as limited communication bandwidth, privacy, etc. This work, therefore, investigates the decentralized team control problem under the information structure known as the mean-field sharing [Arabneydi and Mahajan,, 2014].

Information structure is one of the most important characteristics of a multi-agent system, since it largely determines the tractability of the corresponding decentralized decision problem [Papadimitriou and Tsitsiklis,, 1985; Ho,, 1980]. In [Witsenhausen,, 1971], information structures are classified into three classes: classical, quasi-classical, and non-classical. The characterizing feature of the classical information structure is centralization of information, i.e., all agents know the information available to all agents acting in the past. A system is called quasi-classical or partially nested if the following condition holds: If agent ii can influence agent jj, then agent jj must know the information available to agent ii. All other information structures are non-classical. In the team game literature, information structures that are commonly used include state sharing [Aicardi et al.,, 1987], belief sharing [Yüksel,, 2009], partial-history sharing [Nayyar et al.,, 2013], mean-field sharing [Arabneydi and Mahajan,, 2014], etc. This work, in particular, assumes a mean-field sharing structure, where each agent observes its local state and the mean-fields of both teams. As the agents do not receive information regarding the other agents’ individual states, the mean-field sharing is a non-classic information structure, thus posing a challenge to attain tractable solutions.

As decentralized team problems belong to the NEXP complexity class [Bernstein et al.,, 2002], no efficient algorithm is available, in general. Nonetheless, it is possible to develop a dynamic programming decomposition for specific information structures. Three such generic approaches are discussed in [Mahajan et al.,, 2012]: namely, person-by-person approach, designer’s approach, and common-information approach. While we use the common-information approach in this work, we refer the interested reader to [Mahajan et al.,, 2012] for the other two approaches.

Team problems in a mixed competitive-cooperative setting have also been considered. For example, [Lagoudakis and Parr,, 2002] proposed a centralized learning algorithm for zero-sum Markov games in which each side is composed of multiple agents collaborating against an opponent team of agents. Later, in [Hogeboom-Burr and Yüksel,, 2023], the authors studied the information structure and the existence of equilibria in games involving teams against teams.

Common-information approach.

The common information-based approach was proposed in [Nayyar et al.,, 2013] to investigate optimal strategies in decentralized stochastic control problems with partial history-sharing information structure. In this model, each agent shares part of its information with other agents at each time step. Based on the common information available to all agents, a fictitious coordinator is designed. It is shown that the optimal control problem at the coordinator level is a centralized Partially Observable Markov Decision Process (POMDP) and can be solved using dynamic programming. The coordinator solves the POMDP problem and chooses a prescription for each agent that maps that agent’s local information to its actions. The common information approach was used to solve decentralized stochastic control problems including the control sharing information structure [Mahajan,, 2011], mean-field teams [Arabneydi and Mahajan,, 2014], and LQG systems [Mahajan and Nayyar,, 2015].

Most relevant to this work, [Kartik et al.,, 2021] considered a general model of zero-sum stochastic game between two competing teams, where the information available to each agent can be decomposed into common and private information. Exploiting this specific information structure, an expanded game between two virtual players (fictitious coordinators) is constructed based on the sufficient statistics known as the common information belief. The authors showed that the expanded game can be solved via dynamic programming, and the upper and lower values of the expanded games provide bounds for the original game values. Although we utilize a similar common information approach as in [Kartik et al.,, 2021], this work focuses more on the large-population aspect of the problem, i.e., the mean-field limits of the team games along with the performance guarantees under the mean-field approximation.

In this work, we treat the mean fields of both (infinite-population) teams as common information. Instead of having a single coordinator, we introduce two fictitious coordinators, one for each team, and, consequently, formulate an equivalent zero-sum game between the two coordinators. Since we consider the equivalent coordinator game at the infinite-population limit, the two mean fields provide enough information to characterize the status of the system under the mean-field dynamics, which leads to a full-state information structure instead of a partially observable one. Although the original environment is discrete, the resultant zero-sum coordinator game has continuous state and action spaces111The joint state of the coordinator game is the mean-fields of the two teams, which are distributions over state space and are continuous objects. The actions used by the coordinators are the local policies, which are distributions over the action spaces and are also continuous.. Under certain assumptions, we show that the optimal max-min and min-max value functions are Lipschitz continuous with respect to the mean-fields. This justifies the solution of the continuous game via a discretization scheme.

1.3 Road Map

Our approach can be summarized in the road map in Figure 1.

  1. 1.

    In Section 2, we present the formulation of the zero-sum mean-field team games (ZS-MFTG) with finite number of agents.

  2. 2.

    In Section 3, we employ the mean-field approximation and considers the infinite-population limit where identical team strategies are deployed.

  3. 3.

    In Section 4, we use the common information approach [Nayyar et al.,, 2013] and reformulate the infinite-population problem as a zero-sum game between two coordinators. This allows us to efficiently find the max-min and min-max optimal strategies through dynamic programming.

  4. 4.

    In Section 5, we establish that the optimal strategies for the coordinator game (computed under the common information approach, mean-field approximation, and identical team strategies) remain ϵ\epsilon-optimal for the original finite-population game. Specifically, if one team applies the proposed strategy, the opponent team can only increase its performance by 𝒪(1/min{N1,N2})\mathcal{O}\big{(}1/\sqrt{\min\{N_{1},N_{2}\}}\big{)}, even if the opponent team uses a non-identical team strategy to exploit.

Refer to caption
Figure 1: The road map of the proposed approach.

1.4 Notations

We use the shorthand notation [n][n] to denote the set of indices {1,2,,n}\{1,2,\ldots,n\}. The indicator function is denoted as 𝟙()\mathds{1}_{\cdot}\big{(}{\cdot}\big{)}, such that 𝟙a(b)=1\mathds{1}_{a}\big{(}{b}\big{)}=1 if a=ba=b and 0 otherwise. To distinguish between random variables and their realizations, we use uppercase letters to denote random variables (e.g. XX and \mathcal{M}) and lowercase letters to denote their realizations (e.g. xx and μ\mu). We use bold letters to denote vectors, e.g. 𝐘=(Y1,,Yn)\mathbf{Y}=(Y_{1},\ldots,Y_{n}). For a finite set EE, we denote the space of all probability measures over EE as 𝒫(E)\mathcal{P}(E) and equip it with the total variation norm dTV\mathrm{d}_{\mathrm{TV}} [Chung,, 2001]. Additional important notations used in this paper are summarized in Table 1.

Table 1: Important Notation.
Notation Description Population Size
N1N_{1} / N2N_{2} / NN Number of agents in Blue / Red team / whole system Finite
ρ\rho Team size ratio N1/NN_{1}/N Finite
Xi,tN1(xi,tN1)X_{i,t}^{N_{1}}(x_{i,t}^{N_{1}}) / Yj,tN2(yj,tN2)Y_{j,t}^{N_{2}}(y_{j,t}^{N_{2}}) State of a single Blue/Red agent (realization) Finite
tN1(μtN1)\mathcal{M}_{t}^{N_{1}}(\mu_{t}^{N_{1}}) / 𝒩tN2(νtN2)\mathcal{N}_{t}^{N_{2}}(\nu_{t}^{N_{2}}) Empirical distribution of Blue/Red team (realization) Finite
μtρ\mu_{t}^{\rho} / νtρ\nu_{t}^{\rho} Mean-field of Blue/Red team Infinite
ϕt\phi_{t} / ψt\psi_{t} Identical Blue/Red team policy Finite & Infinite
ϕ\phi / ψ\psi Identical Blue/Red team strategy Finite & Infinite
ϕtN1\phi^{N_{1}}_{t} / ψtN2\psi^{N_{2}}_{t} Non-identical Blue/Red team policy Finite
ϕN1\phi^{N_{1}} / ψN2\psi^{N_{2}} Non-identical Blue/Red team strategy Finite
πt\pi_{t} / σt\sigma_{t} Identical Blue/Red team local policy Infinite
α\alpha / β\beta Blue/Red coordinator strategy Infinite
rtρr^{\rho}_{t} Game reward Finite & Infinite
ftρf^{\rho}_{t} / gtρg^{\rho}_{t} Dynamics of Blue/Red agents Finite & Infinite
JNJ^{N} Finite-population game value Finite
JcorρJ^{\rho}_{\mathrm{cor}} Coordinator game value Infinite
μ,tρ\mathcal{R}^{\rho}_{\mu,t} / ν,tρ\mathcal{R}^{\rho}_{\nu,t} Reachablity correspondence for Blue/Red mean-field Infinite

2 Problem Formulation

Consider a discrete-time system with two large teams of agents that operate over a finite time horizon TT. The Blue team consists of N1N_{1} homogeneous agents, and the Red team consists of N2N_{2} homogeneous agents. The total system size is denoted as N=N1+N2N=N_{1}+N_{2}. We use ρ=N1/N\rho=N_{1}/N to denote the ratio between the size of the Blue team and the total number of agents (Blue and Red agents combined). Let Xi,tN1𝒳X^{N_{1}}_{i,t}\in\mathcal{X} and Ui,tN1𝒰U^{N_{1}}_{i,t}\in\mathcal{U} be the random variables representing the state and action, respectively, taken by Blue agent i[N1]i\in[N_{1}] at time tt. Here, 𝒳\mathcal{X} and 𝒰\mathcal{U} represent the finite individual state and action space for each Blue agent, independent of ii and tt. Similarly, we use Yj,tN2𝒴Y^{N_{2}}_{j,t}\in\mathcal{Y} and Vj,tN2𝒱V^{N_{2}}_{j,t}\in\mathcal{V} to denote the individual state and action of Red agent j[N2]j\in[N_{2}]. The joint state and action of the Blue team are denoted as 𝐗tN1=(X1,tN1,,XN1,tN1)\mathbf{X}^{N_{1}}_{t}=\left(X^{N_{1}}_{1,t},\ldots,X^{N_{1}}_{N_{1},t}\right) and 𝐔tN1=(U1,tN1,,UN1,tN1)\mathbf{U}^{N_{1}}_{t}=\left(U^{N_{1}}_{1,t},\ldots,U^{N_{1}}_{N_{1},t}\right), respectively. Similarly, the Red team’s joint state and action are denoted as 𝐘tN2\mathbf{Y}^{N_{2}}_{t}, and 𝐕tN2\mathbf{V}^{N_{2}}_{t}. We use 𝐗i,tN1=(X1,tN1,,Xi1,tN1,Xi+1,tN1,,XN1,tN1)\mathbf{X}^{N_{1}}_{-i,t}=\left(X^{N_{1}}_{1,t},\ldots,X^{N_{1}}_{i-1,t},X^{N_{1}}_{i+1,t},\ldots,X^{N_{1}}_{N_{1},t}\right) to denote the joint state of the Blue team excluding Blue agent ii, and 𝐘j,tN2\mathbf{Y}^{N_{2}}_{-j,t} is defined similarly.

Definition 1 (Empirical Distribution).

The empirical distribution (ED) for the Blue and Red teams are defined as

tN1(x)\displaystyle\mathcal{M}^{N_{1}}_{t}(x) =1N1i=1N1𝟙x(Xi,tN1),x𝒳,\displaystyle=\frac{1}{N_{1}}\sum_{i=1}^{N_{1}}\mathds{1}_{x}(X^{N_{1}}_{i,t}),~{}~{}~{}x\in\mathcal{X}, (1a)
𝒩tN2(y)\displaystyle\mathcal{N}^{N_{2}}_{t}(y) =1N2j=1N2𝟙y(Yj,tN2),y𝒴.\displaystyle=\frac{1}{N_{2}}\sum_{j=1}^{N_{2}}\mathds{1}_{y}(Y^{N_{2}}_{j,t}),~{}~{}~{}y\in\mathcal{Y}. (1b)

Notice that tN1(x)\mathcal{M}^{N_{1}}_{t}(x) gives the fraction of Blue agents at state xx. Hence, the random vector tN1=[tN1(x)]x𝒳\mathcal{M}^{N_{1}}_{t}=[\mathcal{M}^{N_{1}}_{t}(x)]_{x\in\mathcal{X}} is a probability measure, i.e., tN1𝒫(𝒳)\mathcal{M}^{N_{1}}_{t}\in\mathcal{P}(\mathcal{X}). Similarly, we have 𝒩tN2𝒫(𝒴)\mathcal{N}^{N_{2}}_{t}\in\mathcal{P}(\mathcal{Y}). Since finite state and action spaces are considered, we have 𝒫(𝒳)=Δ|𝒳|\mathcal{P}(\mathcal{X})=\Delta_{|\mathcal{X}|} and 𝒫(𝒴)=Δ|𝒴|\mathcal{P}(\mathcal{Y})=\Delta_{|\mathcal{Y}|}, where Δk\Delta_{k} is the kk-dimensional standard simplex.

We introduce two operators Empμ:𝒳N1𝒫(𝒳)\mathrm{Emp}_{\mu}:\mathcal{X}^{N_{1}}\to\mathcal{P}(\mathcal{X}) and Empν:𝒴N2𝒫(𝒳)\mathrm{Emp}_{\nu}:\mathcal{Y}^{N_{2}}\to\mathcal{P}(\mathcal{X}) to denote operation performed in (1) that relates the joint states and their corresponding EDs:

μtN1=Empμ(𝐱tN1),νtN2=Empν(𝐲tN2).\displaystyle\mu_{t}^{N_{1}}=\mathrm{Emp}_{\mu}(\mathbf{x}^{N_{1}}_{t}),\qquad\nu_{t}^{N_{2}}=\mathrm{Emp}_{\nu}(\mathbf{y}^{N_{2}}_{t}). (2)

To measure the distance between two distributions (probability measures), we use the total variation.

Definition 2.

For a finite set EE, the total variation between two probability measures μ,μ𝒫(E)\mu,\mu^{\prime}\in\mathcal{P}(E) is given by

dTV(μ,μ)=12eE|μ(e)μ(e)|=12μμ1.\mathrm{d}_{\mathrm{TV}}\big{(}\mu,\mu^{\prime}\big{)}=\frac{1}{2}\sum_{e\in E}\left\lvert\mu(e)-\mu^{\prime}(e)\right\rvert=\frac{1}{2}\left\lVert\mu-\mu^{\prime}\right\rVert_{1}. (3)

2.1 Dynamics

We consider a weakly-coupled dynamics, where the dynamics of each individual agent is coupled with other agents through the EDs (empirical distributions). For Blue agent ii, its stochastic transition is governed by a transition kernel ftρ:𝒳×𝒳×𝒰×𝒫(𝒳)×𝒫(𝒴)[0,1]f^{\rho}_{t}:\mathcal{X}\times\mathcal{X}\times\mathcal{U}\times\mathcal{P}(\mathcal{X})\times\mathcal{P}(\mathcal{Y})\to[0,1] and satisfies

(Xi,t+1N1=xi,t+1N1|Ui,tN1=ui,tN1,𝐗tN1=𝐱tN1,𝐘tN2=𝐲tN2)=ftρ(xi,t+1N1|xi,tN1,ui,tN1,μtN1,νtN2),\mathbb{P}(X_{i,t+1}^{N_{1}}=x_{i,t+1}^{N_{1}}|U_{i,t}^{N_{1}}=u_{i,t}^{N_{1}},\mathbf{X}_{t}^{N_{1}}=\mathbf{x}_{t}^{N_{1}},\mathbf{Y}_{t}^{N_{2}}=\mathbf{y}_{t}^{N_{2}})=f^{\rho}_{t}(x_{i,t+1}^{N_{1}}|x_{i,t}^{N_{1}},u_{i,t}^{N_{1}},\mu^{N_{1}}_{t},\nu^{N_{2}}_{t}), (4)

where μtN1=Empμ(𝐱tN1)\mu^{N_{1}}_{t}=\mathrm{Emp}_{\mu}(\mathbf{x}_{t}^{N_{1}}) and νtN2=Empν(𝐲tN2)\nu^{N_{2}}_{t}=\mathrm{Emp}_{\nu}(\mathbf{y}_{t}^{N_{2}}). Note that the transition kernel has ρ=N1/N\rho=N_{1}/N depends on the ratio between the sizes of the two teams. Similarly, the transition of Red agent jj follows

(Yj,t+1N2=yj,t+1N2|Vj,tN2=vj,tN2,𝐗tN1=𝐱tN1,𝐘tN2=𝐲tN2)=gtρ(yj,t+1N2|yj,tN2,vj,tN2,μtN1,νtN2).\mathbb{P}(Y_{j,t+1}^{N_{2}}=y_{j,t+1}^{N_{2}}|V_{j,t}^{N_{2}}=v_{j,t}^{N_{2}},\mathbf{X}_{t}^{N_{1}}=\mathbf{x}_{t}^{N_{1}},\mathbf{Y}_{t}^{N_{2}}=\mathbf{y}_{t}^{N_{2}})=g^{\rho}_{t}(y_{j,t+1}^{N_{2}}|y_{j,t}^{N_{2}},v_{j,t}^{N_{2}},\mu^{N_{1}}_{t},\nu^{N_{2}}_{t}). (5)
Assumption 1.

The transition kernels ftρf^{\rho}_{t} and gtρg^{\rho}_{t} are LftL_{f_{t}} and LgtL_{g_{t}}-Lipschitz continuous, respectively. Formally, for all μ,μ𝒫(𝒳)\mu,\mu^{\prime}\in\mathcal{P}(\mathcal{X}) and ν,ν𝒫(𝒴)\nu,\nu^{\prime}\in\mathcal{P}(\mathcal{Y}) and for all t{0,,T1}t\in\{0,\ldots,T-1\}, there exist positive constants LftL_{f_{t}} and LgtL_{g_{t}} such that

x𝒳|ftρ(x|x,u,μ,ν)ftρ(x|x,u,μ,ν)|Lft(dTV(μ,μ)+dTV(ν,ν))\displaystyle\sum_{x^{\prime}\in\mathcal{X}}\left\lvert f^{\rho}_{t}(x^{\prime}|x,u,\mu,\nu)-f^{\rho}_{t}(x^{\prime}|x,u,\mu^{\prime},\nu^{\prime})\right\rvert\leq L_{f_{t}}\Big{(}\mathrm{d}_{\mathrm{TV}}\big{(}\mu,\mu^{\prime}\big{)}+\mathrm{d}_{\mathrm{TV}}\big{(}\nu,\nu^{\prime}\big{)}\Big{)}~{}~{}~{} x𝒳,u𝒰,\displaystyle\forall x\in\mathcal{X},u\in\mathcal{U},
y𝒴|gtρ(y|y,v,μ,ν)gtρ(y|y,v,μ,ν)|Lgt(dTV(μ,μ)+dTV(ν,ν))\displaystyle\sum_{y^{\prime}\in\mathcal{Y}}\left\lvert g^{\rho}_{t}(y^{\prime}|y,v,\mu,\nu)-g^{\rho}_{t}(y^{\prime}|y,v,\mu^{\prime},\nu^{\prime})\right\rvert\leq L_{g_{t}}\Big{(}\mathrm{d}_{\mathrm{TV}}\big{(}\mu,\mu^{\prime}\big{)}+\mathrm{d}_{\mathrm{TV}}\big{(}\nu,\nu^{\prime}\big{)}\Big{)}~{}~{}~{} y𝒴,v𝒱.\displaystyle\forall y\in\mathcal{Y},v\in\mathcal{V}.

2.2 Reward Structure

Under the team-game framework, agents in the same team share the same reward. Similar to the dynamics, we consider a weakly-coupled team reward

rtρ:𝒫(𝒳)×𝒫(𝒴)[Rmax,Rmax].r^{\rho}_{t}:\mathcal{P}(\mathcal{X})\times\mathcal{P}(\mathcal{Y})\to[-R_{\max},R_{\max}]. (6)
Assumption 2.

The reward function is LrL_{r}-Lipschitz continuous. Formally, for all μ,μ𝒫(𝒳)\mu,\mu^{\prime}\in\mathcal{P}(\mathcal{X}), ν,ν𝒫(𝒴)\nu,\nu^{\prime}\in\mathcal{P}(\mathcal{Y}), and for all t{0,,T}t\in\{0,\ldots,T\}, there exists a positive constant LrL_{r} such that

|rtρ(μ,ν)rtρ(μ,ν)|Lr(dTV(μ,μ)+dTV(ν,ν)).\left\lvert r^{\rho}_{t}(\mu,\nu)-r^{\rho}_{t}(\mu^{\prime},\nu^{\prime})\right\rvert\leq L_{r}\Big{(}\mathrm{d}_{\mathrm{TV}}\big{(}\mu,\mu^{\prime}\big{)}+\mathrm{d}_{\mathrm{TV}}\big{(}\nu,\nu^{\prime}\big{)}\Big{)}. (7)

Under the zero-sum reward structure, we let the Blue team maximize the reward while the Red team minimizes it.

2.3 Information Structure.

We assume a mean-field sharing information structure [Arabneydi and Mahajan,, 2015]. At each time step tt, Blue agent ii observes its own state Xi,tN1X_{i,t}^{N_{1}} and the EDs tN1\mathcal{M}_{t}^{N_{1}} and 𝒩tN2\mathcal{N}_{t}^{N_{2}}. Similarly, Red agent jj observes Yj,tN2Y_{j,t}^{N_{2}}, tN1\mathcal{M}_{t}^{N_{1}} and 𝒩tN2\mathcal{N}_{t}^{N_{2}}. Viewing the EDs as aggregated states of the teams, we consider the following mixed Markov policies:

ϕi,t\displaystyle\phi_{i,t} :𝒰×𝒳×𝒫(𝒳)×𝒫(𝒴)[0,1],\displaystyle:\mathcal{U}\times\mathcal{X}\times\mathcal{P}(\mathcal{X})\times\mathcal{P}(\mathcal{Y})\to[0,1], (8)
ψj,t\displaystyle\psi_{j,t} :𝒱×𝒴×𝒫(𝒳)×𝒫(𝒴)[0,1],\displaystyle:\mathcal{V}\times\mathcal{Y}\times\mathcal{P}(\mathcal{X})\times\mathcal{P}(\mathcal{Y})\to[0,1],

where ϕi,t(u|Xi,tN1,tN1,𝒩tN2)\phi_{i,t}(u|X^{N_{1}}_{i,t},\mathcal{M}^{N_{1}}_{t},\mathcal{N}^{N_{2}}_{t}) represents the probability that Blue agent ii selects action uu given its state Xi,tN1X_{i,t}^{N_{1}} and the team distributions tN1\mathcal{M}^{N_{1}}_{t} and 𝒩tN2\mathcal{N}^{N_{2}}_{t}. Note that agent’s individual state is a private information, while the team EDs are the common information shared among all agents.

An individual strategy is defined as a time sequence ϕi={ϕi,t}t=0T\phi_{i}=\{\phi_{i,t}\}_{t=0}^{T}. A Blue team strategy ϕN1={ϕi}i=1N1\phi^{N_{1}}=\{\phi_{i}\}_{i=1}^{N_{1}} is the collection of individual strategies used by each Blue agent. We use Φt\Phi_{t} and Φ\Phi to denote, respectively, the set of individual policies and strategies available to each Blue agent.222Since Blue agents have the same state and action spaces, they have the same policy space. The set of Blue team strategies is then defined as the Cartesian product ΦN1=×i=1N1Φ\Phi^{N_{1}}=\times_{i=1}^{N_{1}}\Phi. The notations extend naturally to the Red team.

In summary, an instance of a zero-sum mean-field team game is defined as the tuple:

ZS-MFTG=𝒳,𝒴,𝒰,𝒱,ftρ,gtρ,rtρ,N1,N2,T.\texttt{ZS-MFTG}=\langle\mathcal{X},\mathcal{Y},\mathcal{U},\mathcal{V},f^{\rho}_{t},g^{\rho}_{t},r^{\rho}_{t},N_{1},N_{2},T\rangle.

2.4 Optimization Problem.

Starting from the initial joint states 𝐱0N1\mathbf{x}_{0}^{N_{1}} and 𝐲0N2\mathbf{y}^{N_{2}}_{0}, the performance of any team strategy pair (ϕN1,ψN2)ΦN1×ΨN2(\phi^{N_{1}},\psi^{N_{2}})\in\Phi^{N_{1}}\times\Psi^{N_{2}} is quantified by the expected cumulative reward

JN,ϕN1,ψN2(𝐱0N1,𝐲0N2)=𝔼ϕN1,ψN2[t=0Trtρ(tN1,𝒩tN2)|𝐗0N1=𝐱0N1,𝐘0N2=𝐲0N2],J^{N,\phi^{N_{1}},\psi^{N_{2}}}\big{(}\mathbf{x}_{0}^{N_{1}},\mathbf{y}_{0}^{N_{2}}\big{)}=\mathbb{E}_{\phi^{N_{1}},\psi^{N_{2}}}\left[\sum_{t=0}^{T}r^{\rho}_{t}(\mathcal{M}^{N_{1}}_{t},\mathcal{N}^{N_{2}}_{t})\Big{|}\mathbf{X}^{N_{1}}_{0}=\mathbf{x}_{0}^{N_{1}},\mathbf{Y}^{N_{2}}_{0}=\mathbf{y}^{N_{2}}_{0}\right], (9)

where tN1=Empμ(𝐗tN1)\mathcal{M}_{t}^{N_{1}}=\mathrm{Emp}_{\mu}(\mathbf{X}^{N_{1}}_{t}) and 𝒩tN2=Empν(𝐘tN2)\mathcal{N}_{t}^{N_{2}}=\mathrm{Emp}_{\nu}(\mathbf{Y}^{N_{2}}_{t}), and the expectation is with respect to the distribution induced on all system variables by the strategy pair (ϕN1,ψN2)(\phi^{N_{1}},\psi^{N_{2}}).

Remark 1.

Note that the value function in (9) takes the joint states as arguments, different from the value function in [Arabneydi and Mahajan,, 2015] which takes the ED as its argument. The difference comes from the non-identical team strategies allowed in our formulation, which requires each agent’s state and index information to sample actions and predict the game’s evolution. Arabneydi and Mahajan, assume an identical team strategy, and hence the ED is an information state (sufficient statistics) that characterizes the value function of the finite-population game. A counter-example that shows the ED is not an information state under non-identical team strategy is presented in Appendix A.3.

When the maximizing Blue team considers its worst-case performance, we have the following max-min optimization:

J¯N=maxϕN1ΦN1minψN2ΨN2JN,ϕN1,ψN2,\underline{J}^{N*}=\max_{\mathbf{\phi}^{N_{1}}\in\Phi^{N_{1}}}~{}\min_{\mathbf{\psi}^{N_{2}}\in\Psi^{N_{2}}}~{}~{}J^{N,\phi^{N_{1}},\psi^{N_{2}}}, (10)

where J¯N\underline{J}^{N*} is the lower game value for the finite-population game. Note that the game value may not always exist (max-min value differs from min-max value) [Elliott and Kalton,, 1972]. In Section 4.5, we present a special case of the ZS-MFTG where the game value exists at the infinite-population limit. For the general case, we consider the following optimality condition for the Blue team strategy.

Definition 3.

A Blue team strategy ϕN1\phi^{N_{1}*} is ϵ\epsilon-optimal if

J¯NminψN2ΨN2JN,ϕN1,ψN2J¯Nϵ.\underline{J}^{N*}\geq\min_{\mathbf{\psi}^{N_{2}}\in\Psi^{N_{2}}}J^{N,\phi^{N_{1}*},\psi^{N_{2}}}\geq\underline{J}^{N*}-\epsilon.

The strategy ϕN1\phi^{N_{1}*} is optimal if ϵ=0\epsilon=0.

Similarly, the minimizing Red team considers a min-max optimization problem, which leads to the upper game value as follows

J¯N=minψN2ΨN2maxϕN1ΦN1JN,ϕN1,ψN2.\bar{J}^{N*}=\min_{\mathbf{\psi}^{N_{2}}\in\Psi^{N_{2}}}~{}\max_{\mathbf{\phi}^{N_{1}}\in\Phi^{N_{1}}}~{}J^{N,\phi^{N_{1}},\psi^{N_{2}}}. (11)

The definitions of optimality and ϵ\epsilon-optimality extend naturally to the Red team strategies.

The rest of the paper focuses on the performance analysis form the Blue team’s perspective (max-min optimization), but the techniques developed are applicable to the Red team’s side due to the symmetry of the problem formulation.

2.5 Examples

A two-node example

Consider a simple team game on a two-node graph in Figure 2. The state spaces are given by 𝒳={x1,x2}\mathcal{X}=\{x^{1},x^{2}\} and 𝒴={y1,y2}\mathcal{Y}=\{y^{1},y^{2}\}, and the action spaces are 𝒰={u1,u2}\mathcal{U}=\{u^{1},u^{2}\} and 𝒱={v1,v2}\mathcal{V}=\{v^{1},v^{2}\}. The Blue action u1u^{1} corresponds to staying on the current node and u2u^{2} represents moving to the other node. The same connotations apply to Red actions v1v^{1} and v2v^{2}.

Refer to caption
Figure 2: An example of ZS-MFTG over a two-node graph, where N1=2N_{1}=2, N2=2N_{2}=2 and ρ=0.5\rho=0.5.

The maximizing Blue team’s objective is to maximize its presence at node 2 (state x2x^{2}), and hence rt(μ,ν)=μ(x2)r_{t}(\mu,\nu)=\mu(x^{2}). The Blue transition kernel at x1x^{1} under u2u^{2} is defined as

ft(x1|x1,u2,μ,ν)=0.5(1(ρμ(x1)(1ρ)ν(y1))),\displaystyle f_{t}(x^{1}|x^{1},u^{2},\mu,\nu)=0.5\big{(}1-\big{(}\rho\mu(x^{1})-(1-\rho)\nu(y^{1})\big{)}\big{)},
ft(x2|x1,u2,μ,ν)=0.5(1+(ρμ(x1)(1ρ)ν(y1))).\displaystyle f_{t}(x^{2}|x^{1},u^{2},\mu,\nu)=0.5\big{(}1+\big{(}\rho\mu(x^{1})-(1-\rho)\nu(y^{1})\big{)}\big{)}.

Under this transition kernel, the probability of a Blue agent transitioning from node 1 to node 2 depends on the Blue team’s numerical advantage over the Red team at node 1.

The initial joint states depicted in Figure 2 are given by 𝐱02=[x1,x1]\mathbf{x}^{2}_{0}=[x^{1},x^{1}] and 𝐲02=[y1,y2]\mathbf{y}^{2}_{0}=[y^{1},y^{2}]. The corresponding EDs are μ02=[1,0]\mu_{0}^{2}=[1,0], ν02=[0.5,0.5]\nu_{0}^{2}=[0.5,0.5], and the running reward is r0=μ02(x2)=0r_{0}=\mu_{0}^{2}(x^{2})=0. Suppose the Blue team applies a team strategy such that ϕ0i(u2|x1,μ02,ν02)=1\phi^{i}_{0}(u^{2}|x^{1},\mu_{0}^{2},\nu_{0}^{2})=1 for both i[2]i\in[2] (under which both Blue agents apply u2u^{2}). The probability of an individual Blue agent transitioning to node 2 is 0.625. Thus, the next Blue ED is a random vector with three possible realizations: (i) 12=[1,0]\mathcal{M}^{2}_{1}=[1,0] with probability 0.14 (both Blue agents remain on node 1); (ii) 12=[0.5,0.5]\mathcal{M}^{2}_{1}=[0.5,0.5] with probability 0.47 (one Blue agent moves and the other remains); and (iii) 12=[0,1]\mathcal{M}^{2}_{1}=[0,1] with probability 0.39 (both Blue agents move). Suppose the game terminates at T=1T=1, then the value under the given Blue strategy ϕ2\phi^{2} is given by J4,ϕ2,ψ2(𝐱02,𝐲02)=0+(0.140+0.470.5+0.391)=0.63J^{4,\phi^{2},\psi^{2}}(\mathbf{x}^{2}_{0},\mathbf{y}^{2}_{0})=0+(0.14\cdot 0+0.47\cdot 0.5+0.39\cdot 1)=0.63.

Connections to the dynamics in [Huang et al.,, 2006]

As a special case of the previously described dynamics and reward structure, we consider the following pairwise-state-coupled dynamics and team reward derived from the single-population model in [Huang et al.,, 2006]. The transition kernel for the Blue and Red agents are defined as follows:

ft(xi,t+1N1|xi,tN1,ui,tN1,𝐱i,tN1,𝐲tN2)\displaystyle f_{t}\big{(}x^{N_{1}}_{i,t+1}\big{|}x^{N_{1}}_{i,t},u^{N_{1}}_{i,t},\mathbf{x}^{N_{1}}_{-i,t},\mathbf{y}^{N_{2}}_{t}) =k=1N1f1,t(xi,t+1N1|xi,tN1,ui,tN1,xk,tN1)N+k=1N2f2,t(xi,t+1N1|xi,tN1,ui,tN1,yk,tN2)N\displaystyle=\frac{\sum_{k=1}^{N_{1}}f_{1,t}(x^{N_{1}}_{i,t+1}\big{|}x^{N_{1}}_{i,t},u^{N_{1}}_{i,t},x^{N_{1}}_{k,t})}{N}+\frac{\sum_{k=1}^{N_{2}}f_{2,t}(x^{N_{1}}_{i,t+1}\big{|}x^{N_{1}}_{i,t},u^{N_{1}}_{i,t},y^{N_{2}}_{k,t})}{N} (12a)
gt(yj,t+1N2|yj,tN2,vj,tN2,𝐱tN1,𝐲j,tN2)\displaystyle g_{t}\big{(}y^{N_{2}}_{j,t+1}\big{|}y^{N_{2}}_{j,t},v^{N_{2}}_{j,t},\mathbf{x}^{N_{1}}_{t},\mathbf{y}^{N_{2}}_{-j,t}) =k=1N1g1,t(yj,t+1N2|yj,tN2,vj,tN2,xk,tN1)N+k=1N2g2,t(yj,t+1N2|yj,tN2,vj,tN2,yk,tN2)N.\displaystyle=\frac{\sum_{k=1}^{N_{1}}g_{1,t}(y^{N_{2}}_{j,t+1}\big{|}y^{N_{2}}_{j,t},v^{N_{2}}_{j,t},x^{N_{1}}_{k,t})}{N}+\frac{\sum_{k=1}^{N_{2}}g_{2,t}(y^{N_{2}}_{j,t+1}\big{|}y^{N_{2}}_{j,t},v^{N_{2}}_{j,t},y^{N_{2}}_{k,t})}{N}. (12b)

The terms f1,tf_{1,t}, f2,tf_{2,t}, g1,tg_{1,t} and g2,tg_{2,t} are transition kernels and model the pair-wise influence between agents. For example, f2,tf_{2,t} represents the influence that the state of a Red agent has on the transition of a Blue agent.

The weakly-coupled team rewards are the averaged state-dependent reward

rt(𝐱tN1,𝐲tN2)\displaystyle r_{t}(\mathbf{x}_{t}^{N_{1}},\mathbf{y}_{t}^{N_{2}}) =1Nk=1N1r1,t(xk,tN1)1Nk=1N2r2,t(yk,tN2).\displaystyle=\frac{1}{N}\sum_{k=1}^{N_{1}}r_{1,t}(x^{N_{1}}_{k,t})-\frac{1}{N}\sum_{k=1}^{N_{2}}r_{2,t}(y^{N_{2}}_{k,t}). (13)

Through some algebraic manipulations, the aforementioned pairwise-state-dependent transition kernel and team reward can be transformed into the weakly-coupled form represented by equations (4) and (6) as follows:

ft(xi,t+1N1|xi,tN1,ui,tN1,𝐱i,tN1,𝐲tN2)\displaystyle f_{t}\big{(}x^{N_{1}}_{i,t+1}\big{|}x^{N_{1}}_{i,t},u^{N_{1}}_{i,t},\mathbf{x}^{N_{1}}_{-i,t},\mathbf{y}^{N_{2}}_{t}\!) =ρx𝒳μtN1(x)f1,t(xi,t+1N1|xi,tN1,ui,tN1,x)+(1ρ)y𝒴νtN2(y)f2,t(xi,t+1N1|xi,tN1,ui,tN1,y)\displaystyle=\!\rho\!\sum_{x\in\mathcal{X}}\!\mu^{N_{1}}_{t}(x)f_{1,t}(x^{N_{1}}_{i,t+1}\big{|}x^{N_{1}}_{i,t},u^{N_{1}}_{i,t},x)\!+\!\!(1-\rho)\!\sum_{y\in\mathcal{Y}}\!\nu_{t}^{N_{2}}(y)f_{2,t}(x^{N_{1}}_{i,t+1}\big{|}x^{N_{1}}_{i,t},u^{N_{1}}_{i,t},y)
ftρ(xi,t+1N1|xi,tN1,ui,tN1,μtN1,νtN2),\displaystyle\triangleq f^{\rho}_{t}\big{(}x^{N_{1}}_{i,t+1}\big{|}x^{N_{1}}_{i,t},u^{N_{1}}_{i,t},\mu^{N_{1}}_{t},\nu^{N_{2}}_{t}),
gt(yj,t+1N2|yj,tN2,vj,tN2,𝐱tN1,𝐲j,tN2)\displaystyle g_{t}\big{(}y^{N_{2}}_{j,t+1}\big{|}y^{N_{2}}_{j,t},v^{N_{2}}_{j,t},\mathbf{x}^{N_{1}}_{t},\mathbf{y}^{N_{2}}_{-j,t}\!) =ρx𝒳μtN1(x)g1,t(yj,t+1N2|yj,tN2,vj,tN2,x)+(1ρ)y𝒴νtN2(y)g2,t(yj,t+1N2|yj,tN2,vj,tN2,y)\displaystyle=\!\rho\!\sum_{x\in\mathcal{X}}\!\mu^{N_{1}}_{t}(x)g_{1,t}(y^{N_{2}}_{j,t+1}\big{|}y^{N_{2}}_{j,t},v^{N_{2}}_{j,t},x)\!+\!\!(1-\rho)\!\sum_{y\in\mathcal{Y}}\!\nu_{t}^{N_{2}}(y)g_{2,t}(y^{N_{2}}_{j,t+1}\big{|}y^{N_{2}}_{j,t},v^{N_{2}}_{j,t},y)
gtρ(yj,t+1N2|yj,tN2,vj,tN2,μtN1,νtN2),\displaystyle\triangleq g^{\rho}_{t}\big{(}y^{N_{2}}_{j,t+1}\big{|}y^{N_{2}}_{j,t},v^{N_{2}}_{j,t},\mu^{N_{1}}_{t},\nu^{N_{2}}_{t}),
rt(𝐱tN1,𝐲tN2)\displaystyle r_{t}(\mathbf{x}_{t}^{N_{1}},\mathbf{y}_{t}^{N_{2}}) =ρx𝒳μtN1(x)r1,t(x)(1ρ)y𝒴νtN2(y)r2,t(y)rtρ(μtN1,νtN2),\displaystyle=\rho\sum_{x\in\mathcal{X}}\mu^{N_{1}}_{t}(x)r_{1,t}(x)-(1-\rho)\sum_{y\in\mathcal{Y}}\nu^{N_{2}}_{t}(y)r_{2,t}(y)\triangleq r^{\rho}_{t}(\mu_{t}^{N_{1}},\nu_{t}^{N_{2}}),

where ρ=N1/N\rho=N_{1}/N, μtN1=Empμ(𝐱tN1)\mu^{N_{1}}_{t}=\mathrm{Emp}_{\mu}(\mathbf{x}_{t}^{N_{1}}) and νtN2=Empν(𝐲tN2)\nu^{N_{2}}_{t}=\mathrm{Emp}_{\nu}(\mathbf{y}_{t}^{N_{2}}) (see Appendix A for the detailed derivations). In this specific example, it is clear that the team with more agents has a larger influence on the dynamics and the reward. Hence, it is necessary to include ρ=N1/N\rho=N_{1}/N as a parameter for both the dynamics and the reward function. It can be further verified that the above weakly-coupled transition kernels ftρf^{\rho}_{t} and gtρg^{\rho}_{t} are both 22-Lipschitz continuous with respect to the EDs. Additionally, the weakly-coupled reward function has a Lipschitz constant of Lr=2max{r1,max,r2,max}L_{r}=2\max\{r_{1,\max},r_{2,\max}\}, where r1,max=maxx,t|r1,t(x)|r_{1,\max}=\max_{x,t}\left\lvert r_{1,t}(x)\right\rvert and r2,max=maxy,t|r2,t(y)|r_{2,\max}=\max_{y,t}\left\lvert r_{2,t}(y)\right\rvert (see Propositions 1 and 2 in Appendix A).

3 Mean-Field Approximation

The preceding max-min and min-max optimizations are intractable for large-population systems, since the dimension of the joint policy spaces ΦN1\Phi^{N_{1}} and ΨN2\Psi^{N_{2}} grows exponentially with the number of the agents. To address this scalability challenge, we first consider the limiting case of the ZS-MFTGs, when the number of agents of both teams goes to infinity. We further assume that agents from the same infinite-population team employ the same strategy. As a result, the behavioral of the entire team can be represented by a typical agent [Huang et al.,, 2006]. As we will show later, due to the law of large numbers, the ED of an infinite-population team converges to the state distribution of its corresponding typical agent. This limiting distribution is known as the mean-field [Huang et al.,, 2006]. In the sequel, we formulate the mean-field team game at its infinite-population limit and introduce several additional concepts that are necessary for the performance analysis in the next sections.

3.1 Identical Team Strategies and Mean-Field Dynamics

At the infinite population limit, we specifically focus on the Blue and Red teams that employ identical team strategies. Consequently, we first formalize the class of identical team strategies.

Definition 4 (Identical Blue Team Strategy).

The Blue team strategy ϕN1={ϕ1,,ϕN1}\phi^{N_{1}}=\{\phi_{1},\ldots,\phi_{N_{1}}\} is an identical team strategy, if ϕi1,t=ϕi2,t\phi_{i_{1},t}=\phi_{i_{2},t} for all i1,i2[N1]i_{1},i_{2}\in[N_{1}] and all t{0,,T1}t\in\{0,\ldots,T-1\}.

We slightly abuse the notation and use ϕ\phi to denote the identical Blue team strategy, under which all Blue agents apply the same individual strategy ϕ\phi. Consequently, Φ\Phi is used to denote both the set of Blue individual strategies and the set of identical Blue team strategies.333When Φ\Phi denotes the set of identical team strategies, it is a subset of the original set of Blue team strategies ΦN1\Phi^{N_{1}}, i.e., ΦΦN1\Phi\subseteq\Phi^{N_{1}}. The definitions and notations extend to the identical Red team strategies.

With the notions of identical team strategies, we define the mean-field flow as a deterministic sequence of the state distribution of a typical agent under the given pair of strategies.

Definition 5.

Given identical team strategies ϕΦ\phi\in\Phi and ψΨ\psi\in\Psi, the MFs are defined as the sequence of vectors that adhere to the following coupled deterministic dynamics:

μt+1ρ(x)\displaystyle\mu^{\rho}_{t+1}(x^{\prime}) =x𝒳[u𝒰ftρ(x|x,u,μtρ,νtρ)ϕt(u|x,μtρ,νtρ)]μtρ(x),\displaystyle=\sum_{x\in\mathcal{X}}\left[\sum_{u\in\mathcal{U}}f^{\rho}_{t}(x^{\prime}|x,u,\mu^{\rho}_{t},\nu^{\rho}_{t})\;\phi_{t}(u|x,\mu^{\rho}_{t},\nu^{\rho}_{t})\right]\mu^{\rho}_{t}(x), (14a)
νt+1ρ(y)\displaystyle\nu^{\rho}_{t+1}(y^{\prime}) =y𝒴[v𝒱gtρ(y|y,v,μtρ,νtρ)ψt(v|y,μtρ,νtρ)]νtρ(y),\displaystyle=\sum_{y\in\mathcal{Y}}\left[\sum_{v\in\mathcal{V}}g^{\rho}_{t}(y^{\prime}|y,v,\mu^{\rho}_{t},\nu^{\rho}_{t})\;\psi_{t}(v|y,\mu^{\rho}_{t},\nu^{\rho}_{t})\right]\nu^{\rho}_{t}(y), (14b)

where the initial conditions are given by μ0ρ=μ0\mu^{\rho}_{0}=\mu_{0} and ν0ρ=ν0\nu^{\rho}_{0}=\nu_{0}.

The above deterministic mean-field dynamics can be expressed in a compact matrix form as

μt+1ρ\displaystyle\mu^{\rho}_{t+1} =μtρFtρ(μtρ,νtρ,ϕt),\displaystyle=\mu^{\rho}_{t}F_{t}^{\rho}(\mu^{\rho}_{t},\nu^{\rho}_{t},\phi_{t}), (15)
νt+1ρ\displaystyle\nu^{\rho}_{t+1} =νtρGtρ(μtρ,νtρ,ψt),\displaystyle=\nu^{\rho}_{t}G^{\rho}_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t},\psi_{t}),

where Ftρ|𝒳|×|𝒳|F_{t}^{\rho}\in\mathbb{R}^{|\mathcal{X}|\times|\mathcal{X}|} is the transition matrix for a typical Blue agent’s distribution under the policy ϕt\phi_{t} and its entries are given by [Ftρ(μtρ,νtρ,ϕt)]pq=u𝒰ftρ(q|p,u,μtρ,νtρ)ϕt(u|p,μtρ,νtρ)\left[F_{t}^{\rho}(\mu^{\rho}_{t},\nu^{\rho}_{t},\phi_{t})\right]_{pq}=\sum_{u\in\mathcal{U}}f^{\rho}_{t}(q|p,u,\mu^{\rho}_{t},\nu^{\rho}_{t})\;\phi_{t}(u|p,\mu_{t}^{\rho},\nu_{t}^{\rho}). The matrix GtρG_{t}^{\rho} is defined similarly.

The following lemma shows that the deterministic MF above is an approximation of the (stochastic) finite-population ED, and the approximation error goes to zero when N1,N2N_{1},N_{2}\to\infty. Thus, we can regard the mean-field as the empirical distribution of an infinite-population team.

Lemma 1.

Let 𝐗tN1\mathbf{X}^{N_{1}}_{t}, 𝐘tN2\mathbf{Y}^{N_{2}}_{t}, tN1\mathcal{M}^{N_{1}}_{t} and 𝒩tN2\mathcal{N}^{N_{2}}_{t} be the joint states and the corresponding EDs of a finite-population game. Denote the next EDs induced by an identical policy pair (ϕt,ψt)Φt×Ψt(\phi_{t},\psi_{t})\in\Phi_{t}\times\Psi_{t} as (t+1N1,𝒩t+1N2)(\mathcal{M}_{t+1}^{N_{1}},\mathcal{N}_{t+1}^{N_{2}}). Then, the following bounds hold:

𝔼ϕt[dTV(t+1N1,t+1)|𝐗tN1,𝐘tN2]|𝒳|21N1,\displaystyle\mathbb{E}_{\phi_{t}}\left[\mathrm{d}_{\mathrm{TV}}\big{(}\mathcal{M}_{t+1}^{N_{1}},\mathcal{M}_{t+1}\big{)}\big{|}\;\mathbf{X}^{N_{1}}_{t},\mathbf{Y}^{N_{2}}_{t}\right]\leq\frac{|\mathcal{X}|}{2}\sqrt{\frac{1}{N_{1}}},
𝔼ψt[dTV(𝒩t+1N2,𝒩t+1)|𝐗tN1,𝐘tN2]|𝒴|21N2,\displaystyle\mathbb{E}_{\psi_{t}}\left[\mathrm{d}_{\mathrm{TV}}\big{(}\mathcal{N}_{t+1}^{N_{2}},\mathcal{N}_{t+1}\big{)}\big{|}\;\mathbf{X}^{N_{1}}_{t},\mathbf{Y}^{N_{2}}_{t}\right]\leq\frac{|\mathcal{Y}|}{2}\sqrt{\frac{1}{N_{2}}},

where t+1=tN1Ftρ(tN1,𝒩tN2,ϕt)\mathcal{M}_{t+1}=\mathcal{M}^{N_{1}}_{t}F_{t}^{\rho}(\mathcal{M}_{t}^{N_{1}},\mathcal{N}_{t}^{N_{2}},\phi_{t}) and 𝒩t+1=𝒩tN2Gtρ(tN1,𝒩tN2,ψt)\mathcal{N}_{t+1}=\mathcal{N}^{N_{2}}_{t}G^{\rho}_{t}(\mathcal{M}_{t}^{N_{1}},\mathcal{N}_{t}^{N_{2}},\psi_{t}) according to (15).

Proof.

See Appendix B.2

3.2 Infinite-Population Optimization

For the infinite-population game, the cumulative reward induced by the strategy pair (ϕ,ψ)Φ×Ψ(\phi,\psi)\in\Phi\times\Psi is given by

Jρ,ϕ,ψ(μ0ρ,ν0ρ)=t=0Trtρ(μtρ,νtρ),J^{\rho,\phi,\psi}(\mu^{\rho}_{0},\nu^{\rho}_{0})=\sum_{t=0}^{T}r_{t}^{\rho}(\mu^{\rho}_{t},\nu^{\rho}_{t}), (16)

where the propagation of the mean-fields μtρ\mu^{\rho}_{t} and νtρ\nu^{\rho}_{t} are subject to the dynamics (14). The worst-case performance of the Blue team is given by the lower game value

J¯ρ(μ0ρ,ν0ρ)=maxϕΦminψΨJρ,ϕ,ψ(μ0ρ,ν0ρ).\underline{J}^{\rho*}(\mu^{\rho}_{0},\nu^{\rho}_{0})=\max_{\phi\in\Phi}\;\min_{\psi\in\Psi}~{}J^{\rho,\phi,\psi}(\mu^{\rho}_{0},\nu^{\rho}_{0}). (17)

The Red team instead uses the upper game value based on the following min-max optimization

J¯ρ(μ0ρ,ν0ρ)=minψΨmaxϕΦJρ,ϕ,ψ(μ0ρ,ν0ρ).\bar{J}^{\rho*}(\mu^{\rho}_{0},\nu^{\rho}_{0})=\min_{\psi\in\Psi}\max_{\phi\in\Phi}~{}J^{\rho,\phi,\psi}(\mu^{\rho}_{0},\nu^{\rho}_{0}). (18)
Remark 2.

Different from the finite-population value (10) which takes joint states as argument, the infinite-population game value (17) takes MFs. This is due to the assumed identical team strategy, which no longer requires agents’ index information to differentiate them and sample actions. Consequently, the MFs can serve as the information state as in [Arabneydi and Mahajan,, 2015]. The difference comes from the non-identical strategies considered in the finite-population game, which require each agent’s state and index information to sample actions and predict the game’s evolution.

4 Zero-Sum Game Between Coordinators

The mean-field sharing structure in (8) allows us to reformulate the infinite-population competitive team problem (17) as an equivalent two-player game from the perspective of two fictitious444The coordinators are fictitious since they are introduced as an auxiliary concept and are not required for the actual implementation of the obtained strategies. coordinators. The coordinators know the common information (MFs) and selects a local policy that maps each agent’s local information (individual state) to its actions. Through this common-information approach [Nayyar et al.,, 2013], we provide a dynamic program that constructs optimal strategies for all agents under the original mean-field sharing information structure.

4.1 Equivalent Centralized Problem

We use πt:𝒰×𝒳[0,1]\pi_{t}:\mathcal{U}\times\mathcal{X}\to[0,1] to denote a local Blue policy, which is open-loop with respect to the MFs. Specifically, πt(u|x)\pi_{t}(u|x) is the probability that a Blue agent selects action uu at state xx regardless of the current MFs. The set of open-loop Blue local policies is denoted as Πt\Pi_{t}. Similarly, σt:𝒱×𝒴[0,1]\sigma_{t}:\mathcal{V}\times\mathcal{Y}\to[0,1] and Σt\Sigma_{t} denote a Red local policy and its admissible set. Under the local policy πt\pi_{t}, the Blue MF propagates as

μt+1ρ(x)=x𝒳[u𝒰ftρ(x|x,u,μtρ,νtρ)πt(u|x)]μtρ(x),\mu^{\rho}_{t+1}(x^{\prime})\!=\sum_{x\in\mathcal{X}}\Big{[}\sum_{u\in\mathcal{U}}f^{\rho}_{t}(x^{\prime}|x,u,\mu^{\rho}_{t},\nu^{\rho}_{t})\pi_{t}(u|x)\Big{]}\mu^{\rho}_{t}(x), (19)

and the Red team MF dynamics under Red local policies is defined similarly as

νt+1ρ(y)=y𝒴[v𝒱gtρ(y|y,v,μtρ,νtρ)σt(v|y)]νtρ(y).\nu^{\rho}_{t+1}(y^{\prime})\!=\sum_{y\in\mathcal{Y}}\Big{[}\sum_{v\in\mathcal{V}}g^{\rho}_{t}(y^{\prime}|y,v,\mu^{\rho}_{t},\nu^{\rho}_{t})\sigma_{t}(v|y)\Big{]}\nu^{\rho}_{t}(y).

At each time tt, a Blue coordinator observes the MFs of both teams (common information) and prescribes a local policy πtΠt\pi_{t}\in\Pi_{t} to all Blue agents within its team. The local policy is selected based on:

πt=αt(μtρ,νtρ),\pi_{t}=\alpha_{t}\big{(}\mu^{\rho}_{t},\nu^{\rho}_{t}\big{)},

where αt:𝒫(𝒳)×𝒫(𝒴)Πt\alpha_{t}:\mathcal{P}(\mathcal{X})\times\mathcal{P}(\mathcal{Y})\to\Pi_{t} is a deterministic Blue coordination policy, and πt(ut|xt)αt(μtρ,νtρ)(ut|xt)\pi_{t}(u_{t}|x_{t})\triangleq\alpha_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t})(u_{t}|x_{t}) gives the probability that a Blue agent selects action utu_{t} given its current state xtx_{t}. Similarly, the Red coordinator observes the MFs and selects a local policy σtΣt\sigma_{t}\in\Sigma_{t} according to σt=βt(μtρ,νtρ)\sigma_{t}=\beta_{t}\big{(}\mu^{\rho}_{t},\nu^{\rho}_{t}\big{)}. We refer to the time sequence α=(α1,,αT1)\alpha=\big{(}\alpha_{1},\ldots,\alpha_{T-1}\big{)} as the Blue team coordination strategy and β=(β1,,βT1)\beta=\big{(}\beta_{1},\ldots,\beta_{T-1}\big{)} as the Red team coordination strategy. The sets of admissible coordination strategies are denoted as 𝒜\mathcal{A} and \mathcal{B}.

Remark 3.

There is a one-to-one correspondence between the coordination strategies and the identical team strategies. For example, given an identical Blue team strategy ϕΦ\phi\in\Phi, one can define the coordination strategy:

αt(μt,νt)=πts.t.πt(ut|xt)=ϕt(ut|xt,μt,νt)μt𝒫(𝒳),νt𝒫(𝒴),xt𝒳 and ut𝒰.\alpha_{t}(\mu_{t},\nu_{t})=\pi_{t}\quad~{}\mathrm{s.t.}~{}\pi_{t}(u_{t}|x_{t})=\phi_{t}(u_{t}|x_{t},\mu_{t},\nu_{t})\quad\forall\mu_{t}\in\mathcal{P}(\mathcal{X}),\;\nu_{t}\in\mathcal{P}(\mathcal{Y}),\;x_{t}\in\mathcal{X}\text{ and }u_{t}\in\mathcal{U}.

Similarly, a Blue coordination strategy α𝒜\alpha\in\mathcal{A} induces an identical team strategy ϕΦ\phi\in\Phi according to the rule

ϕt(ut|xt,μtρ,νtρ)=(αt(μtρ,νtρ))πt(ut|xt)μt𝒫(𝒳),νt𝒫(𝒴),xt𝒳 and ut𝒰.\phi_{t}(u_{t}|x_{t},\mu^{\rho}_{t},\nu^{\rho}_{t})=\underbrace{\Big{(}\alpha_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t})\Big{)}}_{\displaystyle{\pi_{t}}}(u_{t}|x_{t})\quad\forall\mu_{t}\in\mathcal{P}(\mathcal{X}),\;\nu_{t}\in\mathcal{P}(\mathcal{Y}),\;x_{t}\in\mathcal{X}\text{ and }u_{t}\in\mathcal{U}.

Plugging in the deterministic coordination strategies, the mean-fields dynamics in (19) becomes

μt+1ρ\displaystyle\mu^{\rho}_{t+1} =μtρFtρ(μtρ,νtρ,αt(μtρ,νtρ)πt),\displaystyle=\mu^{\rho}_{t}F_{t}^{\rho}(\mu^{\rho}_{t},\nu^{\rho}_{t},\underbrace{\alpha_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t})}_{\displaystyle{\pi_{t}}}), (20a)
νt+1ρ\displaystyle\nu^{\rho}_{t+1} =νtρGtρ(μtρ,νtρ,βt(μtρ,νtρ)).\displaystyle=\nu^{\rho}_{t}G^{\rho}_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t},\beta_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t})). (20b)

For notational simplicity, we will use the shorthand notations Ftρ(μtρ,νtρ,αt)F_{t}^{\rho}(\mu^{\rho}_{t},\nu^{\rho}_{t},\alpha_{t}) and Gtρ(μtρ,νtρ,βt)G^{\rho}_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t},\beta_{t}).

The original competitive team problem in (17) can now be viewed as an equivalent zero-sum game played between the two coordinators, where the game state is the joint mean-field (μtρ,νtρ)(\mu_{t}^{\rho},\nu_{t}^{\rho}), and the actions are the local policies πt\pi_{t} and σt\sigma_{t} selected by the coordinators. Figure 3 provides a schematic of the equivalent centralized system.

Formally, the zero-sum coordinator game can be defined via a tuple

ZS-CG=𝒫(𝒳),𝒫(Y),Πt,Σt,Ftρ,Gtρ,rtρ,ρ,T.\texttt{ZS-CG}=\langle\mathcal{P}(\mathcal{X}),\mathcal{P}(Y),\Pi_{t},\Sigma_{t},F^{\rho}_{t},G^{\rho}_{t},r^{\rho}_{t},\rho,T\rangle. (21)

In particular, the continuous game state space is 𝒫(𝒳)×𝒫(Y)\mathcal{P}(\mathcal{X})\times\mathcal{P}(Y), and the continuous action spaces are Πt\Pi_{t} and Σt\Sigma_{t} for the Blue and Red coordinators, respectively. The deterministic dynamics of the game is given in (20). Finally, the reward structure rtρr^{\rho}_{t} is given in (6), and the horizon TT is the same as in the original game.

Refer to caption
Figure 3: A schematic of the proposed common-information approach for the mean-field zero-sum team games.

Given two coordination strategies α𝒜\alpha\in\mathcal{A} and β\beta\in\mathcal{B}, the cumulative rewards of the coordinator game is defined as

Jcorρ,α,β(μ0ρ,ν0ρ)=t=0Trtρ(μtρ,νtρ),J_{\mathrm{cor}}^{\rho,\alpha,\beta}(\mu^{\rho}_{0},\nu^{\rho}_{0})=\sum_{t=0}^{T}r_{t}^{\rho}(\mu^{\rho}_{t},\nu^{\rho}_{t}), (22)

where the propagation of the mean-fields μtρ\mu^{\rho}_{t} and νtρ\nu^{\rho}_{t} are subject to the dynamics (20). The worst-case performance of the Blue team is given by the lower coordinator game value

J¯corρ(μ0ρ,ν0ρ)=maxα𝒜minβJcorρ,α,β(μ0ρ,ν0ρ),\underline{J}_{\mathrm{cor}}^{\rho*}(\mu^{\rho}_{0},\nu^{\rho}_{0})=\max_{\alpha\in\mathcal{A}}\;\min_{\beta\in\mathcal{B}}~{}J_{\mathrm{cor}}^{\rho,\alpha,\beta}(\mu^{\rho}_{0},\nu^{\rho}_{0}), (23)

and the upper value for the Red team is based on the following min-max optimization

J¯corρ(μ0ρ,ν0ρ)=minββmaxα𝒜Jcorρ,α,β(μ0ρ,ν0ρ).\bar{J}_{\mathrm{cor}}^{\rho*}(\mu^{\rho}_{0},\nu^{\rho}_{0})=\min_{\beta\in\beta}\max_{\alpha\in\mathcal{A}}~{}J_{\mathrm{cor}}^{\rho,\alpha,\beta}(\mu^{\rho}_{0},\nu^{\rho}_{0}). (24)

In the next section, we examine the properties of the max-min (lower) and min-max (upper) game values of the coordinator game.

4.2 Value Functions of the Coordinator Game

Similar to the standard two-player zero-sum games, we use a backward induction scheme to find the lower and upper values of the coordinator game. In general, the max-min value need not be equal to the min-max value [Elliott and Kalton,, 1972]. Consequently, from the Blue coordinator’s perspective, we consider the lower value (max-min), which provides the highest guaranteed performance for the maximizing Blue team in a worst-case scenario.

The terminal lower value at time TT is defined as

J¯cor,Tρ(μTρ,νTρ)=rTρ(μTρ,νTρ).\underline{J}_{\mathrm{cor},T}^{\rho*}(\mu^{\rho}_{T},\nu^{\rho}_{T})=r^{\rho}_{T}(\mu^{\rho}_{T},\nu^{\rho}_{T}). (25)

For all previous time steps t=0,,T1t=0,\ldots,T-1, the two coordinators optimize their cumulative reward function by choosing their actions (i.e., local policies) πt\pi_{t} and σt\sigma_{t}. Consequently, for all t=0,,T1,t=0,\ldots,T-1, we have

J¯cor,tρ(μtρ,νtρ)=rtρ(μtρ,νtρ)+maxπtΠtminσtΣtJ¯cor,t+1ρ(μtρFtρ(μtρ,νtρ,πt),νtρGtρ(μtρ,νtρ,σt)).\underline{J}_{\mathrm{cor},t}^{\rho*}(\mu^{\rho}_{t},\nu^{\rho}_{t})=r^{\rho}_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t})+\max_{\pi_{t}\in\Pi_{t}}~{}\min_{\sigma_{t}\in\Sigma_{t}}\underline{J}_{\mathrm{cor},t+1}^{\rho*}\big{(}\mu^{\rho}_{t}F^{\rho}_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t},\pi_{t}),\nu^{\rho}_{t}G^{\rho}_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t},\sigma_{t})\big{)}. (26)

With the optimal value function, the optimal Blue team coordination policies can then be easily constructed via

αt(μtρ,νtρ)\displaystyle\alpha^{*}_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t}) argmaxπtΠtminσtΣtJ¯cor,t+1ρ(μtρFtρ(μtρ,νtρ,πt),νtρGtρ(μtρ,νtρ,σt)).\displaystyle\in\operatorname*{argmax}_{\pi_{t}\in\Pi_{t}}~{}\min_{\sigma_{t}\in\Sigma_{t}}\underline{J}_{\mathrm{cor},t+1}^{\rho*}\big{(}\mu^{\rho}_{t}F_{t}^{\rho}(\mu^{\rho}_{t},\nu^{\rho}_{t},\pi_{t}),\nu_{t}^{\rho}G^{\rho}_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t},\sigma_{t})\big{)}. (27)

Similarly, for the Red team coordinator, the upper values are computed as

J¯cor,Tρ(μTρ,νTρ)=rTρ(μTρ,νTρ),\displaystyle\bar{J}_{\mathrm{cor},T}^{\rho*}(\mu^{\rho}_{T},\nu^{\rho}_{T})=r^{\rho}_{T}(\mu^{\rho}_{T},\nu^{\rho}_{T}), (28a)
J¯cor,tρ(μtρ,νtρ)=rt(μtρ,νtρ)+minσtΣtmaxπtΠtJ¯cor,t+1ρ(μtρFtρ(μtρ,νtρ,πt),νtρGtρ(μtρ,νtρ,σt)),t=0,,T1,\displaystyle\bar{J}_{\mathrm{cor},t}^{\rho*}(\mu^{\rho}_{t},\nu^{\rho}_{t})=r_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t})+\min_{\sigma_{t}\in\Sigma_{t}}\max_{\pi_{t}\in\Pi_{t}}\bar{J}_{\mathrm{cor},t+1}^{\rho*}\big{(}\mu^{\rho}_{t}F^{\rho}_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t},\pi_{t}),\nu^{\rho}_{t}G^{\rho}_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t},\sigma_{t})\big{)},~{}~{}t=0,\ldots,T-1, (28b)

and the optimal Red team coordination policy is given by

βt(μtρ,νtρ)\displaystyle\beta^{*}_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t}) argminσtΣtmaxπtΠtJ¯cor,t+1ρ(μtρFtρ(μtρ,νtρ,πt),Gtρ(μtρ,νtρ,σt)).\displaystyle\in\operatorname*{argmin}_{\sigma_{t}\in\Sigma_{t}}~{}\max_{\pi_{t}\in\Pi_{t}}\bar{J}_{\mathrm{cor},t+1}^{\rho*}\big{(}\mu^{\rho}_{t}F_{t}^{\rho}(\mu^{\rho}_{t},\nu^{\rho}_{t},\pi_{t}),G^{\rho}_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t},\sigma_{t})\big{)}. (29)

Note that the optimal Blue team coordination strategy induces an identical Blue team strategy that satisfies the mean-field sharing information structure and can be implemented in the finite-population game (Remark 3).

4.3 Reachable Sets

At the infinite-population limit, the MF dynamics in (15) is deterministic, and thus selecting the local policies πt\pi_{t} and σt\sigma_{t} at time tt is equivalent to selecting the desirable MFs at the next time step. Consequently, we examine the set of MFs that can be reached from the current MFs.

Definition 6.

The Blue reachable set, starting from the mean-fields μtρ\mu^{\rho}_{t} and νtρ\nu^{\rho}_{t}, is defined as the set comprising all the possible next Blue team mean-fields μt+1ρ\mu^{\rho}_{t+1} that can be achieved by employing a local policy πtΠt\pi_{t}\in\Pi_{t}. Formally,

μ,tρ(μtρ,νtρ)={μt+1ρ|πtΠts.t.μt+1ρ=μtρFtρ(μtρ,νtρ,πt)}.\mathcal{R}_{\mu,t}^{\rho}(\mu^{\rho}_{t},\nu^{\rho}_{t})=\left\{\mu^{\rho}_{t+1}\;|\;\exists\pi_{t}\in\Pi_{t}~{}\mathrm{s.t.}~{}\mu^{\rho}_{t+1}=\mu^{\rho}_{t}F^{\rho}_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t},\pi_{t})\right\}. (30)

Similarly, the Red reachable set is defined as

ν,tρ(μtρ,νtρ)={νt+1ρ|σtΣts.t.νt+1ρ=νtρGtρ(μtρ,νtρ,σt)}.\mathcal{R}_{\nu,t}^{\rho}(\mu^{\rho}_{t},\nu^{\rho}_{t})=\left\{\nu^{\rho}_{t+1}\;|\;\exists\sigma_{t}\in\Sigma_{t}~{}\mathrm{s.t.}~{}\nu^{\rho}_{t+1}=\nu^{\rho}_{t}G^{\rho}_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t},\sigma_{t})\right\}. (31)

We will regard the reachable sets as set-valued functions (correspondences) [Freeman and Kokotovic,, 2008]. In this case, we write μ,tρ:𝒫(𝒳)×𝒫(𝒴)𝒫(𝒳)\mathcal{R}^{\rho}_{\mu,t}:\mathcal{P}(\mathcal{X})\times\mathcal{P}(\mathcal{Y})\rightsquigarrow\mathcal{P}(\mathcal{X}), and similarly ν,tρ:𝒫(𝒳)×𝒫(𝒴)𝒫(𝒴)\mathcal{R}^{\rho}_{\nu,t}:\mathcal{P}(\mathcal{X})\times\mathcal{P}(\mathcal{Y})\rightsquigarrow\mathcal{P}(\mathcal{Y}).

The following lemma justifies using the reachable sets constructed based on the local policies to analyze the reachability of identical team policies.

Lemma 2.

For all μtρ𝒫(𝒳)\mu_{t}^{\rho}\in\mathcal{P}(\mathcal{X}) and νtρ𝒫(𝒴)\nu_{t}^{\rho}\in\mathcal{P}(\mathcal{Y}), we have that

{μt+1ρ|ϕtΦts.t.μt+1ρ=μtρFtρ(μtρ,νtρ,ϕt)}=μ,tρ(μtρ,νtρ).\left\{\mu^{\rho}_{t+1}\;|\;\exists\phi_{t}\in\Phi_{t}~{}\mathrm{s.t.}~{}\mu^{\rho}_{t+1}=\mu^{\rho}_{t}F^{\rho}_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t},\phi_{t})\right\}=\mathcal{R}_{\mu,t}^{\rho}(\mu^{\rho}_{t},\nu^{\rho}_{t}). (32)
Proof.

This lemma is a direct consequence of Remark 3.

Remark 4.

Note that the reachable sets are constructed based on identical team strategies, since under the coordinator game formulation, all agents in the same team follow the same local policies prescribed by their coordinator.

4.4 Equivalent Form of Value Functions

We can now change the optimization domains in (25) and (26) from the policy spaces to the corresponding reachable sets. One can easily see that the following lower value propagation scheme is equivalent to the one in (25) and (26).

J¯cor,Tρ(μTρ,νTρ)\displaystyle\underline{J}_{\mathrm{cor},T}^{\rho*}(\mu^{\rho}_{T},\nu^{\rho}_{T}) =rTρ(μTρ,νTρ)\displaystyle=r^{\rho}_{T}(\mu^{\rho}_{T},\nu^{\rho}_{T}) (33)
J¯cor,tρ(μtρ,νtρ)\displaystyle\underline{J}_{\mathrm{cor},t}^{\rho*}(\mu^{\rho}_{t},\nu^{\rho}_{t}) =rtρ(μtρ,νtρ)+maxμt+1ρμ,tρ(μtρ,νtρ)minνt+1ρν,tρ(μtρ,νtρ)J¯cor,t+1ρ(μt+1ρ,νt+1ρ),t=0,,T1.\displaystyle=r^{\rho}_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t})+\max_{\mu_{t+1}^{\rho}\in\mathcal{R}_{\mu,t}^{\rho}(\mu_{t}^{\rho},\nu_{t}^{\rho})}~{}\min_{\nu_{t+1}^{\rho}\in\mathcal{R}_{\nu,t}^{\rho}(\mu_{t}^{\rho},\nu_{t}^{\rho})}\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mu^{\rho}_{t+1},\nu^{\rho}_{t+1}),\quad t=0,\ldots,T-1.

Given the current mean-fields (μtρ,νtρ)(\mu_{t}^{\rho},\nu_{t}^{\rho}), the optimal next mean-field to achieve is then given by

μt+1ρargmaxμt+1ρμ,tρ(μtρ,νtρ)minνt+1ρν,tρ(μtρ,νtρ)J¯cor,t+1ρ(μt+1ρ,νt+1ρ).\mu_{t+1}^{\rho*}\in\operatorname*{argmax}_{\mu_{t+1}^{\rho}\in\mathcal{R}_{\mu,t}^{\rho}(\mu_{t}^{\rho},\nu_{t}^{\rho})}~{}\min_{\nu_{t+1}^{\rho}\in\mathcal{R}_{\nu,t}^{\rho}(\mu_{t}^{\rho},\nu_{t}^{\rho})}\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mu^{\rho}_{t+1},\nu^{\rho}_{t+1}).

The Blue coordination policy that leads to μt+1ρ\mu_{t+1}^{\rho*} can be constructed via a simple linear program leveraging the linearity of the mean-field dynamics with respect to the policy. See Appendix E for details.

In the sequel, we primarily work with the reachability-based optimization in (33). There are two advantages to this approach: First, the reachable sets generally have a lower dimension than the coordinator action spaces, 555The Blue reachable set is a subset of 𝒫(𝒳)\mathcal{P}(\mathcal{X}), while the Blue coordinator action space is given by Πt=(𝒫(𝒰))|𝒳|\Pi_{t}=(\mathcal{P}(\mathcal{U}))^{|\mathcal{X}|}., which is desirable for numerical algorithms; Second, the reachability-based optimization allows us to compare the “reachability” induced by non-identical and identical team strategies (Theorem 2 in Section 5) and then study the performance loss due to the identical strategy assumption.

4.5 Existence of the Coordinator Game Value

In general, the coordinator game value may not exist, i.e, the max-min and min-max values differ (see Numerical Example 1 in Section 6). However, we can show that the existence of coordinator game value for a special class of mean-field team games, where the dynamics of each individual agent is independent of other agents’ states (both its teammates and its opponents). It is left as a future research direction to obtain more general conditions that ensure the existence of the game value.

Definition 7.

We say that the weakly-coupled dynamics are independent if the following holds for all μt𝒫(𝒳)\mu_{t}\in\mathcal{P}(\mathcal{X}), νt𝒫(𝒴)\nu_{t}\in\mathcal{P}(\mathcal{Y}) and t{0,,T1}t\in\{0,\ldots,T-1\},

ftρ(xt+1|xt,ut,μt,νt)\displaystyle f^{\rho}_{t}(x_{t+1}|x_{t},u_{t},\mu_{t},\nu_{t}) =f¯t(xt+1|xt,ut),\displaystyle=\bar{f}_{t}(x_{t+1}|x_{t},u_{t}), (34)
gtρ(yt+1|yt,vt,μt,νt)\displaystyle g^{\rho}_{t}(y_{t+1}|y_{t},v_{t},\mu_{t},\nu_{t}) =g¯t(yt+1|yt,vt),\displaystyle=\bar{g}_{t}(y_{t+1}|y_{t},v_{t}),

for some transition kernels f¯t\bar{f}_{t} and g¯t\bar{g}_{t}.

In other words, the dynamics of an agent only depends on that agent’s current state and action and is fully decoupled from all other agents. However, we still allow reward coupled via the MFs. The following theorem ensures the existence of the game value under independent dynamics.

Theorem 1.

Suppose the reward function rtρr^{\rho}_{t} is concave-convex for all t{0,,T}t\in\{0,\ldots,T\} and the system dynamics is independent. Then, the game value exists.

Proof.

One can show that the optimal value under independent dynamics is concave-convex and apply the minimax theorem. The detailed proof is given in Appendix D. ∎

5 Main Results

Recall that the optimal Blue team coordination strategy α\alpha^{*} is constructed for the infinite-population game assuming that both teams employ identical team strategies. This section establishes the performance guarantees for α\alpha^{*} in the finite-population games where both teams are allowed to deploy non-identical strategies.

5.1 Approximation Error

As α\alpha^{*} is solved at the infinite-population limit, it is essential to understand how well the infinite-population game approximates the original finite-population problem. In this subsection, we show that that the reachable set constructed using identical strategies and the mean-field dynamics is rich enough to approximate any empirical distributions induced by non-identical team strategies in finite-population games. We start with constructing local policies that mimics the behaviors of non-identical team strategies.

Lemma 3.

Let 𝐗tN1\mathbf{X}^{N_{1}}_{t}, 𝐘tN2\mathbf{Y}^{N_{2}}_{t}, tN1\mathcal{M}^{N_{1}}_{t} and 𝒩tN2\mathcal{N}^{N_{2}}_{t} be the joint states and the corresponding EDs of a finite-population game. Given any Blue team policy ϕtN1ΦtN1\phi^{N_{1}}_{t}\in\Phi^{N_{1}}_{t} (potentially non-identical), define the following local policy

πapprx,t(u|x)={i=1N𝟙x(Xi,tN1)ϕi,t(u|x,tN1,𝒩tN2)N1tN1(x)if tN1(x)>0,1/|𝒰|if tN1(x)=0.\pi_{\mathrm{apprx},t}(u|x)=\left\{\begin{array}[]{ll}\frac{\sum_{i=1}^{N}\mathds{1}_{x}\big{(}{X_{i,t}^{N_{1}}}\big{)}\phi_{i,t}(u|x,\mathcal{M}^{N_{1}}_{t},\mathcal{N}^{N_{2}}_{t})}{N_{1}\mathcal{M}^{N_{1}}_{t}(x)}&\text{if }~{}~{}\mathcal{M}_{t}^{N_{1}}(x)>0,\\ {1}/{|\mathcal{U}|}&\text{if }~{}~{}\mathcal{M}_{t}^{N_{1}}(x)=0.\end{array}\right. (35)

Further, define the next mean-field induced by πapprx,t\pi_{\mathrm{apprx},t} as

apprx,t+1=tN1Ftρ(tN1,𝒩tN2,πapprx,t).\mathcal{M}_{\mathrm{apprx},t+1}=\mathcal{M}^{N_{1}}_{t}F^{\rho}_{t}(\mathcal{M}^{N_{1}}_{t},\mathcal{N}^{N_{2}}_{t},\pi_{\mathrm{apprx},t}). (36)

Then, the expected distance between the next Blue ED t+1N1\mathcal{M}^{N_{1}}_{t+1} induced by the team policy ϕtN1\phi^{N_{1}}_{t} and the mean-field apprx,t+1\mathcal{M}_{\mathrm{apprx},t+1} satisfies

𝔼[dTV(t+1N1,apprx,t+1)|𝐗tN1,𝐘tN2]|𝒳|21N1.\mathbb{E}\left[\mathrm{d}_{\mathrm{TV}}\big{(}\mathcal{M}^{N_{1}}_{t+1},\mathcal{M}_{\mathrm{apprx},t+1}\big{)}|\mathbf{X}_{t}^{N_{1}},\mathbf{Y}_{t}^{N_{2}}\right]\leq\frac{|\mathcal{X}|}{2}\sqrt{\frac{1}{N_{1}}}. (37)
Proof.

See Appendix B.3. ∎

The local policy πapprx,t\pi_{\mathrm{apprx},t} mimics the population behavior by setting its action distribution at state xx as the average of the policies used by the Blue agents at state xx. The purpose of the second case in (35) is to ensure that the constructed local policy is well-defined at states where no Blue agent is present. The mean-field induced by πapprx,t\pi_{\mathrm{apprx},t} is within the reachable set and is close to the next ED induced by the non-identical team policy ϕtN1\phi^{N_{1}}_{t} in expectation. This idea is visualized in Figure 4.

Refer to caption
Figure 4: An illustration of Lemma 3.

Lemma 3 directly leads to the following theorem regarding the richness of the reachable sets, as the mean-field induced by πapprx,t\pi_{\mathrm{apprx},t} is within the reachable set.

Theorem 2.

Let 𝐗tN1\mathbf{X}^{N_{1}}_{t}, 𝐘tN2\mathbf{Y}^{N_{2}}_{t}, tN1\mathcal{M}^{N_{1}}_{t}, and 𝒩tN2\mathcal{N}^{N_{2}}_{t} be the joint states and the corresponding EDs at time tt. Denote the next Blue ED induced by some Blue team policy ϕtN1ΦtN1\phi^{N_{1}}_{t}\in\Phi^{N_{1}}_{t} as t+1N1\mathcal{M}^{N_{1}}_{t+1}. Then, there exists a mean-field μt+1μ,tρ(tN1,𝒩tN2)\mu_{t+1}\in\mathcal{R}^{\rho}_{\mu,t}(\mathcal{M}^{N_{1}}_{t},\mathcal{N}^{N_{2}}_{t}) that satisfies

𝔼ϕtN1[dTV(t+1N1,μt+1)|𝐗tN1,𝐘tN2]|𝒳|21N1.\mathbb{E}_{\phi_{t}^{N_{1}}}\left[\mathrm{d}_{\mathrm{TV}}\big{(}\mathcal{M}^{N_{1}}_{t+1},\mu_{t+1}\big{)}\big{|}\mathbf{X}^{N_{1}}_{t},\mathbf{Y}^{N_{2}}_{t}\right]\leq\frac{|\mathcal{X}|}{2}\sqrt{\frac{1}{N_{1}}}. (38)
Remark 5.

The construction of πapprx,t\pi_{\mathrm{apprx},t} in (35) requires knowing each Blue agent’s state, which may seem to violate the mean-field sharing information structure in (8). However, πapprx,t\pi_{\mathrm{apprx},t} only serves as an auxiliary concept to prove the existence of a mean-field in the reachable set that satisfies (38). The existence result in Theorem 2 is all we need to provide performance guarantees. In fact, πapprx,t\pi_{\mathrm{apprx},t} will not be used to construct the optimal policies.

Remark 6.

All results in this subsection extend to the analysis from Red team’s side.

5.2 Lipschitz Continuity of the Value Functions

Next, we examine the continuity of the optimal value function in (33) with respect to the mean-field arguments, which is essential for the performance guarantees. Clearly, the continuity of the value function depends on the continuity of the two reachability correspondences μ,tρ\mathcal{R}^{\rho}_{\mu,t} and ν,tρ\mathcal{R}^{\rho}_{\nu,t}. To properly study the continuity of the reachability correspondences, we use the Hausdorff distance to measure the distance between two sets.

Definition 8 (Hausdorff distance).

For a normed space (𝒳,)(\mathcal{X},\left\lVert\cdot\right\rVert), the Hausdorff distance between the sets A,B𝒳A,B\subseteq\mathcal{X} is defined as

distH(A,B)=max{supaAinfbBab,supbBinfaAab}.\mathrm{dist}_{\mathrm{H}}(A,B)=\max\left\{\sup_{a\in A}\inf_{b\in B}\left\lVert a-b\right\rVert,\sup_{b\in B}\inf_{a\in A}\left\lVert a-b\right\rVert\right\}. (39)

Based on the Hausdorff distance, we have the following notion of Lipschitz continuity for correspondences [Freeman and Kokotovic,, 2008].

Definition 9.

A correspondence Γ:𝒳𝒵\Gamma:\mathcal{X}\rightsquigarrow\mathcal{Z} is LΓL_{\Gamma}-Lipschitz continuous under the Hausdorff distance if, for all x,x𝒳x,x^{\prime}\in\mathcal{X}, it satisfies that

distH(Γ(x),Γ(x))LΓxx.\mathrm{dist}_{\mathrm{H}}(\Gamma(x),\Gamma(x^{\prime}))\leq L_{\Gamma}\left\lVert x-x^{\prime}\right\rVert. (40)

The following lemma presents a Lipschitz continuity result for the reachability correspondences.

Lemma 4.

The reachability correspondences μ,t\mathcal{R}_{\mu,t} and ν,t\mathcal{R}_{\nu,t} in (30) and (31) satisfy the following inequalities for all μt,μt𝒫(𝒳)\mu_{t},\mu^{\prime}_{t}\in\mathcal{P}(\mathcal{X}) and νt,νt𝒫(𝒴)\nu_{t},\nu^{\prime}_{t}\in\mathcal{P}(\mathcal{Y}),

distH(μ,tρ(μt,νt),μ,tρ(μt,νt))LRμ,t(dTV(μt,μt)+dTV(νt,νt)),\displaystyle\mathrm{dist}_{\mathrm{H}}(\mathcal{R}^{\rho}_{\mu,t}(\mu_{t},\nu_{t}),\mathcal{R}^{\rho}_{\mu,t}(\mu^{\prime}_{t},\nu^{\prime}_{t}))\leq L_{R_{\mu,t}}\big{(}\mathrm{d}_{\mathrm{TV}}\big{(}\mu_{t},\mu^{\prime}_{t}\big{)}+\mathrm{d}_{\mathrm{TV}}\big{(}\nu_{t},\nu^{\prime}_{t}\big{)}\big{)}, (41)
distH(ν,tρ(μt,νt),ν,tρ(μt,νt))LRν,t(dTV(μt,μt)+dTV(νt,νt)),\displaystyle\mathrm{dist}_{\mathrm{H}}(\mathcal{R}^{\rho}_{\nu,t}(\mu_{t},\nu_{t}),\mathcal{R}^{\rho}_{\nu,t}(\mu^{\prime}_{t},\nu^{\prime}_{t}))\leq L_{R_{\nu,t}}\big{(}\mathrm{d}_{\mathrm{TV}}\big{(}\mu_{t},\mu^{\prime}_{t}\big{)}+\mathrm{d}_{\mathrm{TV}}\big{(}\nu_{t},\nu^{\prime}_{t}\big{)}\big{)},

where LRμ,t=1+12LftL_{R_{\mu,t}}=1+\frac{1}{2}L_{f_{t}}, LRν,t=1+12LgtL_{R_{\nu,t}}=1+\frac{1}{2}L_{g_{t}}, and LftL_{f_{t}} and LgtL_{g_{t}} are the Lipschitz constants in Assumption 1.

Proof.

The proof is postponed to Appendix C.2. ∎

Leveraging the continuity of the reachability correspondences, the following theorem establishes the Lipschitz continuity of the optimal value functions.

Theorem 3.

The optimal lower value function J¯cor,tρ\underline{J}^{\rho*}_{\mathrm{cor},t} and the optimal upper value function J¯cor,tρ\bar{J}^{\rho*}_{\mathrm{cor},t} are both Lipschitz continuous. Formally, for all μtρ,μtρ𝒫(𝒳)\mu_{t}^{\rho},\mu^{\rho\prime}_{t}\in\mathcal{P}(\mathcal{X}) and νtρ,νtρ𝒫(𝒴)\nu_{t}^{\rho},\nu^{\rho\prime}_{t}\in\mathcal{P}(\mathcal{Y}), the following inequalities hold

|J¯cor,tρ(μtρ,νtρ)J¯cor,tρ(μtρ,νtρ)|\displaystyle\left\lvert\underline{J}^{\rho*}_{\mathrm{cor},t}(\mu^{\rho}_{t},\nu^{\rho}_{t})-\underline{J}^{\rho*}_{\mathrm{cor},t}(\mu^{\rho\prime}_{t},\nu^{\rho\prime}_{t})\right\rvert LJ,t(dTV(μtρ,μtρ)+dTV(μtρ,νtρ)),\displaystyle\leq L_{J,t}\big{(}\mathrm{d}_{\mathrm{TV}}\big{(}\mu^{\rho}_{t},\mu_{t}^{\rho\prime}\big{)}+\mathrm{d}_{\mathrm{TV}}\big{(}\mu^{\rho}_{t},\nu_{t}^{\rho\prime}\big{)}\big{)}, (42)
|J¯cor,tρ(μtρ,νtρ)J¯cor,tρ(μtρ,νtρ)|\displaystyle\left\lvert\bar{J}^{\rho*}_{\mathrm{cor},t}(\mu^{\rho}_{t},\nu^{\rho}_{t})-\bar{J}^{\rho*}_{\mathrm{cor},t}(\mu^{\rho\prime}_{t},\nu^{\rho\prime}_{t})\right\rvert LJ,t(dTV(μtρ,μtρ)+dTV(μtρ,νtρ)),\displaystyle\leq L_{J,t}\big{(}\mathrm{d}_{\mathrm{TV}}\big{(}\mu^{\rho}_{t},\mu_{t}^{\rho\prime}\big{)}+\mathrm{d}_{\mathrm{TV}}\big{(}\mu^{\rho}_{t},\nu_{t}^{\rho\prime}\big{)}\big{)},

where the Lipschitz constant LJ,tL_{J,t} is given by

LJ,t=Lr(1+k=tT1τ=tk(Lμ,τρ+Lν,τρ)).L_{J,t}=L_{r}\big{(}1+\sum_{k=t}^{T-1}\prod_{\tau=t}^{k}(L_{\mathcal{R}^{\rho}_{\mu,\tau}}+L_{\mathcal{R}^{\rho}_{\nu,\tau}})\big{)}. (43)
Proof.

Observe that the lower value in (33) takes the form: f(x,y)=maxpΓ(x,y)minqΘ(x,y)g(p,q)f(x,y)=\max_{p\in\Gamma(x,y)}\min_{q\in\Theta(x,y)}g(p,q), which can be viewed as an extension of the marginal function [Freeman and Kokotovic,, 2008] to the max-min case. We present a continuity result for this type of marginal function in Lemma 11 in Appendix C.1. Based on Lemma 11, we can prove the Lipschitz continuity result through an inductive argument since the terminal rewards are assumed to be Lipschitz. A detailed proof is given in Appendix C.3. ∎

5.3 Performance Guarantees

In the previous section, we obtained an optimal Blue coordination strategy α\alpha^{*} for the coordinator game. The coordinator game is constructed under the mean-field setting, i.e., both teams have an infinite number of agents and all agents in each team apply the same strategy. We analyze the performance guarantees for α\alpha^{*} in the finite-population game and compare the worst-case performance of this coordinator-induced Blue team strategy to the original max-min optimization in (10) where agents have the flexibility to apply non-identical strategies within each team.

To this end, let the Red team apply non-identical team strategies ψN2=(ψ1N2,ψN2N2)ΨN2\psi^{N_{2}}=(\psi^{N_{2}}_{1},\ldots\psi^{N_{2}}_{N_{2}})\in\Psi^{N_{2}}. Recall that a Blue coordination strategy induces an identical Blue team strategy (see Remark 3). We denote the finite-population performance under the identical Blue team strategy induced by α\alpha^{*} and the Red team strategy ψN2\psi^{N_{2}} as

J0N,α,ψN2(𝐱0N1,𝐲0N2)=𝔼α,ψN2[t=0Trtρ(tN1,𝒩tN2)|𝐗0N1=𝐱0N1,𝐘0N2=𝐲0N2].J_{0}^{N,\alpha^{*},\psi^{N_{2}}}(\mathbf{x}^{N_{1}}_{0},\mathbf{y}^{N_{2}}_{0})=\mathbb{E}_{\alpha^{*},\psi^{N_{2}}}\Big{[}\sum_{t=0}^{T}r^{\rho}_{t}(\mathcal{M}^{N_{1}}_{t},\mathcal{N}^{N_{2}}_{t})|\mathbf{X}^{N_{1}}_{0}=\mathbf{x}^{N_{1}}_{0},\mathbf{Y}^{N_{2}}_{0}=\mathbf{y}^{N_{2}}_{0}\Big{]}.

Through dynamic programming, we can compute the above-induced value through a backward propagation scheme

JTN,α,ψN2(𝐱TN1,𝐲TN2)\displaystyle J_{T}^{N,\alpha^{*},\psi^{N_{2}}}(\mathbf{x}^{N_{1}}_{T},\mathbf{y}^{N_{2}}_{T}) =rTρ(μTN1,νTN2)\displaystyle=r^{\rho}_{T}(\mu^{N_{1}}_{T},\nu^{N_{2}}_{T}) (44a)
JtN,α,ψN2(𝐱tN1,𝐲tN2)\displaystyle J_{t}^{N,\alpha^{*},\psi^{N_{2}}}(\mathbf{x}^{N_{1}}_{t},\mathbf{y}^{N_{2}}_{t}) =rtρ(μtN1,νtN2)+𝔼α,ψN2[Jt+1N,α,ψN2(𝐗t+1N1,𝐘t+1N2)|𝐗tN1=𝐱tN1,𝐘tN2=𝐲tN2],\displaystyle=r^{\rho}_{t}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})+\mathbb{E}_{\alpha^{*},\psi^{N_{2}}}\Big{[}J_{t+1}^{N,\alpha^{*},\psi^{N_{2}}}(\mathbf{X}^{N_{1}}_{t+1},\mathbf{Y}^{N_{2}}_{t+1})\big{|}\mathbf{X}^{N_{1}}_{t}=\mathbf{x}^{N_{1}}_{t},\mathbf{Y}^{N_{2}}_{t}=\mathbf{y}^{N_{2}}_{t}\;\Big{]}, (44b)
t=0,,T1,\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\forall t=0,\ldots,T-1,

where μtN1=Empμ(𝐱tN1)\mu_{t}^{N_{1}}=\mathrm{Emp}_{\mu}(\mathbf{x}^{N_{1}}_{t}) and νtN2=Empν(𝐲tN2)\nu_{t}^{N_{2}}=\mathrm{Emp}_{\nu}(\mathbf{y}^{N_{2}}_{t}) are the EDs corresponding to the joint states 𝐱tN1\mathbf{x}^{N_{1}}_{t} and 𝐲tN2\mathbf{y}^{N_{2}}_{t}, respectively.

We have the following main result regarding the performance guarantees for the optimal Blue coordination strategy.

Theorem 4.

The optimal Blue coordination strategy α\alpha^{*} obtained from (27) induces an ϵ\epsilon-optimal Blue team strategy. Formally, for all 𝐱N1𝒳N1,𝐲N2𝒴N2\mathbf{x}^{N_{1}}\in\mathcal{X}^{N_{1}},\mathbf{y}^{N_{2}}\in\mathcal{Y}^{N_{2}},

J¯N(𝐱N1,𝐲N2)minψN2ΨN2JN,α,ψN2(𝐱N1,𝐲N2)J¯N(𝐱N1,𝐲N2)𝒪(1N¯)\underline{J}^{N*}(\mathbf{x}^{N_{1}},\mathbf{y}^{N_{2}})\geq\min_{\psi^{N_{2}}\in\Psi^{N_{2}}}J^{N,\alpha^{*},\psi^{N_{2}}}(\mathbf{x}^{N_{1}},\mathbf{y}^{N_{2}})\geq\underline{J}^{N*}(\mathbf{x}^{N_{1}},\mathbf{y}^{N_{2}})-\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)} (45)

where N¯=min{N1,N2}\underline{N}=\min\{N_{1},N_{2}\}.

Proof.

We first prove the first inequality in (45).

J¯N(𝐱N1,𝐲N2)\displaystyle\underline{J}^{N*}(\mathbf{x}^{N_{1}},\mathbf{y}^{N_{2}}) =maxϕN1ΦN1minψN2ΨN2JN,ϕN1,ψN2(𝐱N1,𝐲N2)\displaystyle=\max_{\phi^{N_{1}}\in\Phi^{N_{1}}}\min_{\psi^{N_{2}}\in\Psi^{N_{2}}}J^{N,\phi^{N_{1}},\psi^{N_{2}}}(\mathbf{x}^{N_{1}},\mathbf{y}^{N_{2}})
(i)maxϕΦminψN2ΨN2JN,ϕ,ψN2(𝐱N1,𝐲N2)=(ii)maxα𝒜minψN2ΨN2JN,α,ψN2(𝐱N1,𝐲N2)\displaystyle\stackrel{{\scriptstyle\text{(i)}}}{{\geq}}\max_{\phi\in\Phi}\min_{\psi^{N_{2}}\in\Psi^{N_{2}}}J^{N,\phi,\psi^{N_{2}}}(\mathbf{x}^{N_{1}},\mathbf{y}^{N_{2}})\stackrel{{\scriptstyle\text{(ii)}}}{{=}}\max_{\alpha\in\mathcal{A}}\min_{\psi^{N_{2}}\in\Psi^{N_{2}}}J^{N,\alpha,\psi^{N_{2}}}(\mathbf{x}^{N_{1}},\mathbf{y}^{N_{2}})
minψN2ΨN2JN,α,ψN2(𝐱N1,𝐲N2),\displaystyle\geq\min_{\psi^{N_{2}}\in\Psi^{N_{2}}}J^{N,\alpha^{*},\psi^{N_{2}}}(\mathbf{x}^{N_{1}},\mathbf{y}^{N_{2}}),

where inequality (i) is a result of ΦΦN2\Phi\subseteq\Phi^{N_{2}}, and equality (ii) is due to the one-to-one correspondence between coordination strategies and identical team strategies (see Remark 3).

For the second inequality in (45), we break it down into two lemmas: First, Lemma 5 states that minψN2ψN2JN,α,ψN2J¯corρϵ1\min_{\psi^{N_{2}}\in\psi^{N_{2}}}J^{N,\alpha^{*},\psi^{N_{2}}}\geq\underline{J}_{\mathrm{cor}}^{\rho*}-\epsilon_{1}; and Lemma 6 shows that J¯corρJ¯Nϵ2\underline{J}_{\mathrm{cor}}^{\rho*}\geq\underline{J}^{N*}-\epsilon_{2}; finally, it is shown that both error terms are of order 𝒪(1N¯)\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)}. Combining the two lemmas, we obtain the desired result. ∎

Remark 7.

Recall that α\alpha^{*} is solved at the infinite-population limit under the restriction that both teams apply identical team strategies. Theorem 4 states that the identical Blue team strategy induced by α\alpha^{*} is still ϵ\epsilon-optimal, even if (i) it is deployed in a finite-population game and (ii) the opponent team employs non-identical strategies to exploit.

Remark 8.

Continuity Assumptions 1 and 2 are necessary to translate the infinite-population performance back to the finite-population game. See Appendix A.2 for a discontinuous example where the infinite-population game value is significantly different from that of the finite-population problem.

Lemma 5.

For all joint states 𝐱N1𝒳N1\mathbf{x}^{N_{1}}\in\mathcal{X}^{N_{1}} and 𝐲N2𝒴N2\mathbf{y}^{N_{2}}\in\mathcal{Y}^{N_{2}}, the optimal Blue coordination strategy α\alpha^{*} in (27) guarantees

minψN2ψN2JN,α,ψN2(𝐱N1,𝐲N2)J¯corρ(μN1,νN2)𝒪(1N¯),\min_{\psi^{N_{2}}\in\psi^{N_{2}}}J^{N,\alpha^{*},\psi^{N_{2}}}(\mathbf{x}^{N_{1}},\mathbf{y}^{N_{2}})\geq\underline{J}_{\mathrm{cor}}^{\rho*}(\mu^{N_{1}},\nu^{N_{2}})-\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)}, (46)

where μN1=Empμ(𝐱N1)\mu^{N_{1}}=\mathrm{Emp}_{\mu}(\mathbf{x}^{N_{1}}) and νN2=Empν(𝐲N2)\nu^{N_{2}}=\mathrm{Emp}_{\nu}(\mathbf{y}^{N_{2}}) are the corresponding EDs.

Proof.

The proof is constructed based on induction. Fix an arbitrary Red team strategy ψN2ΨN2\psi^{N_{2}}\in\Psi^{N_{2}}.

Base case: At the terminal timestep TT, since there is no decision to be made, both value functions are equal to the terminal reward and are thus the same. Formally, for all 𝐱TN1𝒳N1\mathbf{x}^{N_{1}}_{T}\in\mathcal{X}^{N_{1}} and 𝐲TN2𝒴N2\mathbf{y}^{N_{2}}_{T}\in\mathcal{Y}^{N_{2}},

JTN,α,ψN2(𝐱TN1,𝐲TN2)=J¯cor,Tρ(μTN1,νTN1)=rTρ(μTN1,νTN2),J_{T}^{N,\alpha^{*},\psi^{N_{2}}}(\mathbf{x}^{N_{1}}_{T},\mathbf{y}^{N_{2}}_{T})=\underline{J}_{\mathrm{cor},T}^{\rho*}(\mu^{N_{1}}_{T},\nu^{N_{1}}_{T})=r^{\rho}_{T}(\mu^{N_{1}}_{T},\nu^{N_{2}}_{T}),

where μTN1=Empμ(𝐱TN1)\mu^{N_{1}}_{T}=\mathrm{Emp}_{\mu}(\mathbf{x}^{N_{1}}_{T}) and νTN2=Empν(𝐲TN2)\nu^{N_{2}}_{T}=\mathrm{Emp}_{\nu}(\mathbf{y}^{N_{2}}_{T}). For simplicity, we do not emphasize the correspondence between the joint states and the EDs for the rest of the proof, as it is clear from the context.

Inductive hypothesis: Assume that at t+1t+1, the following holds for all joint states 𝐱t+1N1𝒳N1\mathbf{x}^{N_{1}}_{t+1}\in\mathcal{X}^{N_{1}} and 𝐲t+1N2𝒴N2\mathbf{y}^{N_{2}}_{t+1}\in\mathcal{Y}^{N_{2}}:

Jt+1N,α,ψN2(𝐱t+1N1,𝐲t+1N2)J¯cor,t+1ρ(μt+1N1,νt+1N2)𝒪(1N¯).J_{t+1}^{N,\alpha^{*},\psi^{N_{2}}}(\mathbf{x}^{N_{1}}_{t+1},\mathbf{y}^{N_{2}}_{t+1})\geq\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mu^{N_{1}}_{t+1},\nu^{N_{2}}_{t+1})-\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)}. (47)

Induction: At timestep tt, consider an arbitrary pair of joint states 𝐱tN1𝒳N1\mathbf{x}^{N_{1}}_{t}\in\mathcal{X}^{N_{1}} and 𝐲tN2𝒴N2\mathbf{y}^{N_{2}}_{t}\in\mathcal{Y}^{N_{2}}, and their corresponding EDs μtN1\mu^{N_{1}}_{t} and νtN2\nu^{N_{2}}_{t}. Define μt+1\mu_{t+1}^{*} as

μt+1=μtN1F(μtN1,νtN2,αt).\mu_{t+1}^{*}=\mu^{N_{1}}_{t}F(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t},\alpha_{t}^{*}).

Note that, from the optimality of αt\alpha_{t}^{*}, we have

μt+1argmaxμt+1μ,tρ(μtN1,νtN2)minνt+1ν,tρ(μtN1,νtN2)J¯cor,t+1ρ(μt+1,νt+1).\mu_{t+1}^{*}\in\operatorname*{argmax}_{\mu_{t+1}\in\mathcal{R}_{\mu,t}^{\rho}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})}~{}\min_{\nu_{t+1}\in\mathcal{R}_{\nu,t}^{\rho}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})}\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mu_{t+1},\nu_{t+1}). (48)

Furthermore, from Theorem 2, there exists a νapprx,t+1ν,tρ(μtN1,νtN2)\nu_{\mathrm{apprx},t+1}\in\mathcal{R}_{\nu,t}^{\rho}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t}) for the Red team policy ψtN2\psi_{t}^{N_{2}} such that

𝔼ψtN2[dTV(𝒩t+1N2,νapprx,t+1)|𝐗tN1=𝐱tN1,𝐘tN2=𝐲tN2]|𝒴|21N2.\mathbb{E}_{\psi_{t}^{N_{2}}}\left[\mathrm{d}_{\mathrm{TV}}\big{(}\mathcal{N}^{N_{2}}_{t+1},\nu_{\mathrm{apprx},t+1}\big{)}\big{|}\mathbf{X}_{t}^{N_{1}}=\mathbf{x}^{N_{1}}_{t},\mathbf{Y}_{t}^{N_{2}}=\mathbf{y}^{N_{2}}_{t}\right]\leq\frac{|\mathcal{Y}|}{2}\sqrt{\frac{1}{N_{2}}}. (49)

For notational simplicity, we drop the conditions 𝐗tN1=𝐱tN1\mathbf{X}_{t}^{N_{1}}=\mathbf{x}^{N_{1}}_{t} and 𝐘tN2=𝐲tN2\mathbf{Y}_{t}^{N_{2}}=\mathbf{y}^{N_{2}}_{t} in the following derivations. Then, for all joint states 𝐱tN1𝒳N1\mathbf{x}^{N_{1}}_{t}\in\mathcal{X}^{N_{1}} and 𝐲tN2𝒴N2\mathbf{y}^{N_{2}}_{t}\in\mathcal{Y}^{N_{2}}, we have

JtN,α,ψN2(𝐱tN1,𝐲tN2)=rtρ(μtN1,νtN2)+𝔼α,ψN2[Jt+1N,α,ψN2(𝐗t+1N1,𝐘t+1N2)]\displaystyle J_{t}^{N,\alpha^{*},\psi^{N_{2}}}(\mathbf{x}^{N_{1}}_{t},\mathbf{y}^{N_{2}}_{t})=r^{\rho}_{t}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})+\mathbb{E}_{\alpha^{*},\psi^{N_{2}}}\Big{[}J_{t+1}^{N,\alpha^{*},\psi^{N_{2}}}(\mathbf{X}^{N_{1}}_{t+1},\mathbf{Y}^{N_{2}}_{t+1})\Big{]} (50)
(i)rtρ(μtN1,νtN2)+𝔼α,ψN2[J¯cor,t+1ρ(t+1N1,𝒩t+1N2)]𝒪(1N¯)\displaystyle\stackrel{{\scriptstyle\text{(i)}}}{{\geq}}r^{\rho}_{t}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})+\mathbb{E}_{\alpha^{*},\psi^{N_{2}}}\Big{[}\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mathcal{M}^{N_{1}}_{t+1},\mathcal{N}^{N_{2}}_{t+1})\Big{]}-\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)} (51)
=rtρ(μtN1,νtN2)𝒪(1N¯)\displaystyle=r^{\rho}_{t}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})-\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)} (52)
+𝔼α,ψN2[J¯cor,t+1ρ(t+1N1,𝒩t+1N2)J¯cor,t+1ρ(μt+1,νapprx,t+1)+J¯cor,t+1ρ(μt+1,νapprx,t+1)]\displaystyle\qquad+\mathbb{E}_{\alpha^{*},\psi^{N_{2}}}\Big{[}\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mathcal{M}^{N_{1}}_{t+1},\mathcal{N}^{N_{2}}_{t+1})-\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mu_{t+1}^{*},\nu_{\mathrm{apprx},t+1})+\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mu_{t+1}^{*},\nu_{\mathrm{apprx},t+1})\Big{]}
(ii)rtρ(μtN1,νtN2)+J¯cor,t+1ρ(μt+1,νapprx,t+1)𝒪(1N¯)\displaystyle\stackrel{{\scriptstyle\text{(ii)}}}{{\geq}}r^{\rho}_{t}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})+\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mu_{t+1}^{*},\nu_{\mathrm{apprx},t+1})-\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)} (53)
LJ,t+1𝔼α[dTV(t+1N1,μt+1)]=𝒪(1N1) due to Lemma 1LJ,t+1𝔼ψN2[dTV(𝒩t+1N2,νapprx,t+1)]=𝒪(1N2) due to (49)\displaystyle\qquad\qquad\qquad\qquad\qquad-L_{J,t+1}\underbrace{\mathbb{E}_{\alpha^{*}}\Big{[}\mathrm{d}_{\mathrm{TV}}\big{(}\mathcal{M}_{t+1}^{N_{1}},{\mu}^{*}_{t+1}\big{)}\Big{]}}_{=\mathcal{O}(\frac{1}{\sqrt{N_{1}}})\text{ due to Lemma~{}\ref{lmm:mf-aprx}}}-L_{J,t+1}\underbrace{\mathbb{E}_{\psi^{N_{2}}}\Big{[}\mathrm{d}_{\mathrm{TV}}\big{(}\mathcal{N}^{N_{2}}_{t+1},\nu_{\mathrm{apprx},t+1}\big{)}\Big{]}}_{=\mathcal{O}(\frac{1}{\sqrt{N_{2}}})\text{ due to \eqref{eqn:red-apprx-nu}}}
=(iii)rtρ(μtN1,νtN2)+J¯cor,t+1ρ(μt+1,νapprx,t+1)𝒪(1N¯)\displaystyle\stackrel{{\scriptstyle\text{(iii)}}}{{=}}r^{\rho}_{t}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})+\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mu_{t+1}^{*},\nu_{\mathrm{apprx},t+1})-\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)} (54)
(iv)rtρ(μtN1,νtN2)+minνt+1ν,tρ(μtN1,νtN2)J¯cor,t+1ρ(μt+1,νt+1)𝒪(1N¯)\displaystyle\stackrel{{\scriptstyle\text{(iv)}}}{{\geq}}r^{\rho}_{t}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})+\min_{\nu_{t+1}\in\mathcal{R}_{\nu,t}^{\rho}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})}\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mu^{*}_{t+1},\nu_{t+1})-\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)} (55)
=(v)rtρ(μtN1,νtN2)+maxμt+1μ,tρ(μtN1,νtN2)minνt+1ν,tρ(μtN1,νtN2)J¯cor,t+1ρ(μt+1,νt+1)𝒪(1N¯)\displaystyle\stackrel{{\scriptstyle\text{(v)}}}{{=}}r^{\rho}_{t}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})+\max_{\mu_{t+1}\in\mathcal{R}_{\mu,t}^{\rho}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})}~{}\min_{\nu_{t+1}\in\mathcal{R}_{\nu,t}^{\rho}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})}\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mu_{t+1},\nu_{t+1})-\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)} (56)
=J¯cor,tρ(μtN1,νtN2)𝒪(1N¯).\displaystyle=\underline{J}_{\mathrm{cor},t}^{\rho*}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})-\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)}. (57)

For inequality (i), we used the inductive hypothesis; for inequality (ii), we utilized the Lipschitz continuity of the coordinator value function in Theorem 3; equality (iii) is due to Lemma 1 and Theorem 2, and the fact that the two error terms with Lipschitz constant LJ,t+1L_{J,t+1} are bounded by 𝒪(1N¯)\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)}; inequality (iv) is due to the fact that νapprx,t+1\nu_{\mathrm{apprx},t+1} is in the reachable set; equality (v) comes from the definition of μt+1\mu^{*}_{t+1} in (48), and the final equality in (57) is simply the definition of J¯cor,tρ\underline{J}^{\rho*}_{\mathrm{cor},t}, which completes the induction.

Since the Red team strategy ψN2ΨN2\psi^{N_{2}}\in\Psi^{N_{2}} is arbitrary, we have that, for all joint states 𝐱N1𝒳N1\mathbf{x}^{N_{1}}\in\mathcal{X}^{N_{1}} and 𝐲N2𝒴N2\mathbf{y}^{N_{2}}\in\mathcal{Y}^{N_{2}},

minψN2ψN2JN,α,ψN2\displaystyle\min_{\psi^{N_{2}}\in\psi^{N_{2}}}J^{N,\alpha^{*},\psi^{N_{2}}} (𝐱N1,𝐲N2)=minψN2ψN2J0N,α,ψN2(𝐱N1,𝐲N2)\displaystyle(\mathbf{x}^{N_{1}},\mathbf{y}^{N_{2}})=\min_{\psi^{N_{2}}\in\psi^{N_{2}}}J^{N,\alpha^{*},\psi^{N_{2}}}_{0}(\mathbf{x}^{N_{1}},\mathbf{y}^{N_{2}})
J¯cor,0ρ(μ,ν)𝒪(1N¯)=J¯corρ(μN1,νN2)𝒪(1N¯).\displaystyle\geq\underline{J}_{\mathrm{cor},0}^{\rho*}(\mu,\nu)-\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)}=\underline{J}_{\mathrm{cor}}^{\rho*}(\mu^{N_{1}},\nu^{N_{2}})-\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)}.

Lemma 6.

The following inequality holds for all joint states 𝐱N1𝒳N1\mathbf{x}^{N_{1}}\in\mathcal{X}^{N_{1}} and 𝐲N2𝒴N2\mathbf{y}^{N_{2}}\in\mathcal{Y}^{N_{2}},

J¯corρ(μN1,νN1)J¯N(𝐱N1,𝐲N2)𝒪(1N¯),\underline{J}_{\mathrm{cor}}^{\rho*}(\mu^{N_{1}},\nu^{N_{1}})\geq\underline{J}^{N*}(\mathbf{x}^{N_{1}},\mathbf{y}^{N_{2}})-\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)}, (58)

where μN1=Empμ(𝐱N1)\mu^{N_{1}}=\mathrm{Emp}_{\mu}(\mathbf{x}^{N_{1}}) and νN2=Empν(𝐲N2)\nu^{N_{2}}=\mathrm{Emp}_{\nu}(\mathbf{y}^{N_{2}}).

Proof.

We prove the lemma through an inductive argument.

Base case: At the terminal timestep TT, the two value functions are the same. Thus, we have, for all joint states 𝐱TN1𝐗N1\mathbf{x}^{N_{1}}_{T}\in\mathbf{X}^{N_{1}} and 𝐲TN2𝐘N2\mathbf{y}^{N_{2}}_{T}\in\mathbf{Y}^{N_{2}}, that

J¯cor,Tρ(μTN1,νTN1)=J¯TN(𝐱TN1,𝐲TN2)=rtρ(μTN1,νTN1).\underline{J}_{\mathrm{cor},T}^{\rho*}(\mu^{N_{1}}_{T},\nu^{N_{1}}_{T})=\underline{J}_{T}^{N*}(\mathbf{x}^{N_{1}}_{T},\mathbf{y}^{N_{2}}_{T})=r^{\rho}_{t}(\mu^{N_{1}}_{T},\nu^{N_{1}}_{T}).

Inductive hypothesis: Assume that, at time step t+1t+1, the following holds for all 𝐱t+1N1𝐗N1\mathbf{x}^{N_{1}}_{t+1}\in\mathbf{X}^{N_{1}} and 𝐲t+1N2𝐘N2\mathbf{y}^{N_{2}}_{t+1}\in\mathbf{Y}^{N_{2}},

J¯cor,t+1ρ(μt+1N1,νt+1N1)J¯t+1N(𝐱t+1N1,𝐲t+1N2)𝒪(1N¯).\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mu^{N_{1}}_{t+1},\nu^{N_{1}}_{t+1})\geq\underline{J}_{t+1}^{N*}(\mathbf{x}^{N_{1}}_{t+1},\mathbf{y}^{N_{2}}_{t+1})-\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)}. (59)

Induction: Consider arbitrary 𝐱tN1𝐗N1\mathbf{x}^{N_{1}}_{t}\in\mathbf{X}^{N_{1}} and 𝐲tN2𝐘N2\mathbf{y}^{N_{2}}_{t}\in\mathbf{Y}^{N_{2}}. For each Blue team policy ϕtN1ΦtN1\phi_{t}^{N_{1}}\in\Phi_{t}^{N_{1}}, Theorem 2 allows us to define μapprx,t+1ϕtN1μ(μtN1,νtN2)\mu_{\mathrm{apprx},t+1}^{\phi^{N_{1}}_{t}}\in\mathcal{R}_{\mu}(\mu_{t}^{N_{1}},\nu_{t}^{N_{2}}) such that

𝔼ϕtN1[dTV(t+1N1,μapprx,t+1ϕtN1)|𝐗tN1=𝐱tN1,𝐘tN2=𝐲tN2]|𝒳|21N1.\mathbb{E}_{\phi_{t}^{N_{1}}}\left[\mathrm{d}_{\mathrm{TV}}\big{(}\mathcal{M}^{N_{1}}_{t+1},\mu^{\phi^{N_{1}}_{t}}_{\mathrm{apprx},t+1}\big{)}\big{|}\mathbf{X}_{t}^{N_{1}}=\mathbf{x}^{N_{1}}_{t},\mathbf{Y}_{t}^{N_{2}}=\mathbf{y}^{N_{2}}_{t}\right]\leq\frac{|\mathcal{X}|}{2}\sqrt{\frac{1}{N_{1}}}. (60)

For an identical Red team policy ψtΨt\psi_{t}\in\Psi_{t}, denote νt+1ψt=νtN2Gtρ(μtN1,νtN2,ψt)\nu^{\psi_{t}}_{t+1}=\nu^{N_{2}}_{t}G^{\rho}_{t}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t},\psi_{t}). Then, we have

J¯tN\displaystyle\underline{J}_{t}^{N*} (𝐱tN1,𝐲tN2)=maxϕtN1ΦtN1minψtN2ΨtN2rtρ(μtN1,νtN2)+𝔼ϕtN1,ψtN2[J¯t+1N(𝐗t+1N1,𝐘t+1N2)]\displaystyle(\mathbf{x}^{N_{1}}_{t},\mathbf{y}^{N_{2}}_{t})=\max_{\phi^{N_{1}}_{t}\in\Phi^{N_{1}}_{t}}\min_{\psi^{N_{2}}_{t}\in\Psi^{N_{2}}_{t}}r^{\rho}_{t}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})+\mathbb{E}_{\phi^{N_{1}}_{t},\psi_{t}^{N_{2}}}\Big{[}\underline{J}_{t+1}^{N*}(\mathbf{X}^{N_{1}}_{t+1},\mathbf{Y}^{N_{2}}_{t+1})\Big{]}
(i)rtρ(μtN1,νtN2)+maxϕtN1ΦtN1minψtN2ΨtN2𝔼ϕtN1,ψtN2[J¯cor,t+1ρ(t+1N1,𝒩t+1N2)]+𝒪(1N¯)\displaystyle\stackrel{{\scriptstyle\text{(i)}}}{{\leq}}r^{\rho}_{t}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})+\max_{\phi^{N_{1}}_{t}\in\Phi^{N_{1}}_{t}}\min_{\psi_{t}^{N_{2}}\in\Psi^{N_{2}}_{t}}\mathbb{E}_{\phi^{N_{1}}_{t},\psi_{t}^{N_{2}}}\Big{[}\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mathcal{M}^{N_{1}}_{t+1},\mathcal{N}^{N_{2}}_{t+1})\Big{]}+\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)}
(ii)rtρ(μtN1,νtN2)+maxϕtN1ΦtN1minψtΨt𝔼ϕtN1,ψt[J¯cor,t+1ρ(t+1N1,𝒩t+1N2)]+𝒪(1N¯)\displaystyle\stackrel{{\scriptstyle\text{(ii)}}}{{\leq}}r^{\rho}_{t}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})+\max_{\phi^{N_{1}}_{t}\in\Phi^{N_{1}}_{t}}\;\min_{\psi_{t}\in\Psi_{t}}\;\mathbb{E}_{\phi^{N_{1}}_{t},\psi_{t}}\Big{[}\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mathcal{M}^{N_{1}}_{t+1},\mathcal{N}^{N_{2}}_{t+1})\Big{]}+\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)}
=rtρ(μtN1,νtN2)+𝒪(1N¯)+maxϕtN1ΦtN1minψtΨt𝔼ϕtN1,ψt[J¯cor,t+1ρ(t+1N1,𝒩t+1N2)\displaystyle=r^{\rho}_{t}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})+\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)}+\max_{\phi^{N_{1}}_{t}\in\Phi^{N_{1}}_{t}}\;\min_{\psi_{t}\in\Psi_{t}}\;\mathbb{E}_{\phi^{N_{1}}_{t},\psi_{t}}\Big{[}\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mathcal{M}^{N_{1}}_{t+1},\mathcal{N}^{N_{2}}_{t+1})
J¯cor,t+1ρ(μapprx,t+1ϕtN1,νt+1ψt)+J¯cor,t+1ρ(μapprx,t+1ϕtN1,νt+1ψt)]\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad-\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mu_{\mathrm{apprx},t+1}^{\phi^{N_{1}}_{t}},\nu_{t+1}^{\psi_{t}})+\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mu_{\mathrm{apprx},t+1}^{\phi^{N_{1}}_{t}},\nu_{t+1}^{\psi_{t}})\Big{]}
(iii)rtρ(μtN1,νtN2)+maxϕtN1ΦtN1minψtΨtJ¯cor,t+1ρ(μapprx,t+1ϕtN1,νt+1ψt)+𝒪(1N¯)\displaystyle\stackrel{{\scriptstyle\text{(iii)}}}{{\leq}}r^{\rho}_{t}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})+\max_{\phi^{N_{1}}_{t}\in\Phi^{N_{1}}_{t}}\;\min_{\psi_{t}\in\Psi_{t}}\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mu_{\mathrm{apprx},t+1}^{\phi^{N_{1}}_{t}},\nu_{t+1}^{\psi_{t}})+\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)}
+LJ,t+1𝔼ϕtN1,ψt[dTV(t+1N1,μapprx,t+1ϕtN1)+dTV(𝒩t+1N2,νt+1ψt)]\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad+L_{J,t+1}\mathbb{E}_{\phi^{N_{1}}_{t},\psi_{t}}\Big{[}\mathrm{d}_{\mathrm{TV}}\big{(}\mathcal{M}_{t+1}^{N_{1}},\mu_{\mathrm{apprx},t+1}^{\phi^{N_{1}}_{t}}\big{)}+\mathrm{d}_{\mathrm{TV}}\big{(}\mathcal{N}_{t+1}^{N_{2}},\nu_{t+1}^{\psi_{t}}\big{)}\Big{]}
(iv)rtρ(μtN1,νtN2)+maxϕtN1ΦtN1minψtΨtJ¯cor,t+1ρ(μapprx,t+1ϕtN1,νt+1ψt)+𝒪(1N¯)\displaystyle\stackrel{{\scriptstyle\text{(iv)}}}{{\leq}}r^{\rho}_{t}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})+\max_{\phi^{N_{1}}_{t}\in\Phi^{N_{1}}_{t}}\;\min_{\psi_{t}\in\Psi_{t}}\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mu_{\mathrm{apprx},t+1}^{\phi^{N_{1}}_{t}},\nu_{t+1}^{\psi_{t}})+\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)}
=(v)rtρ(μtN1,νtN2)+maxϕtN1ΦtN1minνt+1ν,tρ(μtN1,νtN2)J¯cor,t+1ρ(μapprx,t+1ϕtN1,νt+1)+𝒪(1N¯)\displaystyle\stackrel{{\scriptstyle\text{(v)}}}{{=}}r^{\rho}_{t}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})+\max_{\phi^{N_{1}}_{t}\in\Phi^{N_{1}}_{t}}\min_{\nu_{t+1}\in\mathcal{R}_{\nu,t}^{\rho}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})}\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mu_{\mathrm{apprx},t+1}^{\phi^{N_{1}}_{t}},\nu_{t+1})+\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)}
(vi)rtρ(μt,νt)+maxμt+1μ,tρ(μtN1,νtN2)minνt+1ν,tρ(μtN1,νtN2)J¯cor,t+1ρ(μt+1,νt+1)+𝒪(1N¯)\displaystyle\stackrel{{\scriptstyle\text{(vi)}}}{{\leq}}r^{\rho}_{t}(\mu_{t},\nu_{t})+\max_{\mu_{t+1}\in\mathcal{R}_{\mu,t}^{\rho}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})}\min_{\nu_{t+1}\in\mathcal{R}_{\nu,t}^{\rho}(\mu^{N_{1}}_{t},\nu^{N_{2}}_{t})}\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mu_{t+1},\nu_{t+1})+\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)}
=J¯cor,tρ(μt,νt)+𝒪(1N¯).\displaystyle=\underline{J}_{\mathrm{cor},t}^{\rho*}(\mu_{t},\nu_{t})+\mathcal{O}\Big{(}\frac{1}{\sqrt{\underline{N}}}\Big{)}.

For inequality (i), we used the inductive hypothesis; for inequality (ii), we reduced the optimization domain of the Red team to the identical policy space; inequality (iii) is a result of the Lipschitz continuity; for inequality (iv), we use Lemma 1 and Theorem 2; equality (v) holds, since for all μt+1\mu_{t+1} in the reachable set, there is an identical Red team strategy that induces it, and vice versa. Consequently, switching from optimizing with the identical Red team policies to the reachable set does not change the minimization domain; inequality(vi) is due to the fact that μapprx,t+1ϕtN1\mu_{\mathrm{apprx},t+1}^{\phi^{N_{1}}_{t}} is always in the reachable set by construction. ∎

6 Numerical Example

In this section, we provide two numerical examples. The first one illustrates a scenario where the lower and upper coordinator game values are different. The second one verifies the performance guarantees in the finite-population game. For both examples, the states spaces for the Blue and Red team are 𝒳={x1,x2}\mathcal{X}=\{x^{1},x^{2}\} and 𝒴={y1,y2}\mathcal{Y}=\{y^{1},y^{2}\}, and the action spaces are 𝒰={u1,u2}\mathcal{U}=\{u^{1},u^{2}\} and 𝒱={v1,v2}\mathcal{V}=\{v^{1},v^{2}\}. The two-state state spaces allow the joint mean-fields (μt,νt)(\mu_{t},\nu_{t}) to be fully characterized solely by μt(x1)\mu_{t}(x^{1}) and νt(y1)\nu_{t}(y^{1}).

The code that implements the two examples can be found at https://github.com/scottyueguan/MFTG.

6.1 Numerical Example 1

We consider a following two-stage example with a population ratio ρ=0.6\rho=0.6. The time-invariant transition kernel for the Blue agents is defined as follows:

fρ(x1|x1,u1,μ,ν)=0.5(1+(ρμ(x1)(1ρ)ν(y1))),\displaystyle f^{\rho}(x^{1}|x^{1},u^{1},\mu,\nu)=0.5\big{(}1+\big{(}\rho\mu(x^{1})-(1-\rho)\nu(y^{1})\big{)}\big{)},~{}~{} fρ(x2|x1,u1,μ,ν)=0.5(1(ρμ(x1)(1ρ)ν(y1))),\displaystyle f^{\rho}(x^{2}|x^{1},u^{1},\mu,\nu)=0.5\big{(}1-\big{(}\rho\mu(x^{1})-(1-\rho)\nu(y^{1})\big{)}\big{)},
fρ(x1|x1,u2,μ,ν)=0.5(10.3(ρμ(x1)(1ρ)ν(y1))),\displaystyle f^{\rho}(x^{1}|x^{1},u^{2},\mu,\nu)=0.5\big{(}1-0.3\big{(}\rho\mu(x^{1})-(1-\rho)\nu(y^{1})\big{)}\big{)},~{}~{} fρ(x2|x1,u2,μ,ν)=0.5(1+0.3(ρμ(x1)(1ρ)ν(y1))),\displaystyle f^{\rho}(x^{2}|x^{1},u^{2},\mu,\nu)=0.5\big{(}1+0.3\big{(}\rho\mu(x^{1})-(1-\rho)\nu(y^{1})\big{)}\big{)},
fρ(x1|x2,u1,μ,ν)=0.5(1(ρμ(x2)(1ρ)ν(y2))),\displaystyle f^{\rho}(x^{1}|x^{2},u^{1},\mu,\nu)=0.5\big{(}1-\big{(}\rho\mu(x^{2})-(1-\rho)\nu(y^{2})\big{)}\big{)},~{}~{} fρ(x2|x2,u1,μ,ν)=0.5(1+(ρμ(x2)(1ρ)ν(y2))),\displaystyle f^{\rho}(x^{2}|x^{2},u^{1},\mu,\nu)=0.5\big{(}1+\big{(}\rho\mu(x^{2})-(1-\rho)\nu(y^{2})\big{)}\big{)},
fρ(x1|x2,u2,μ,ν)=0.5(1+0.3(ρμ(x2)(1ρ)ν(y2))),\displaystyle f^{\rho}(x^{1}|x^{2},u^{2},\mu,\nu)=0.5\big{(}1+0.3\big{(}\rho\mu(x^{2})-(1-\rho)\nu(y^{2})\big{)}\big{)},~{}~{} fρ(x2|x2,u2,μ,ν)=0.5(10.3(ρμ(x2)(1ρ)ν(y2))).\displaystyle f^{\rho}(x^{2}|x^{2},u^{2},\mu,\nu)=0.5\big{(}1-0.3\big{(}\rho\mu(x^{2})-(1-\rho)\nu(y^{2})\big{)}\big{)}.

The time-invariant Red transition kernel is similarly defined as

gρ(y1|y1,v1,μ,ν)=0.5(1+((1ρ)ν(y1)ρμ(x1))),\displaystyle g^{\rho}(y^{1}|y^{1},v^{1},\mu,\nu)=0.5\big{(}1+\big{(}(1-\rho)\nu(y^{1})-\rho\mu(x^{1})\big{)}\big{)},~{}~{} gρ(y2|y1,v1,μ,ν)=0.5(1((1ρ)ν(y1)ρμ(x1))),\displaystyle g^{\rho}(y^{2}|y^{1},v^{1},\mu,\nu)=0.5\big{(}1-\big{(}(1-\rho)\nu(y^{1})-\rho\mu(x^{1})\big{)}\big{)},
gρ(y1|y1,v2,μ,ν)=0.5(10.3((1ρ)ν(y1)ρμ(x1))),\displaystyle g^{\rho}(y^{1}|y^{1},v^{2},\mu,\nu)=0.5\big{(}1-0.3\big{(}(1-\rho)\nu(y^{1})-\rho\mu(x^{1})\big{)}\big{)},~{}~{} gρ(y2|y1,v2,μ,ν)=0.5(1+0.3((1ρ)ν(y1)ρμ(x1))),\displaystyle g^{\rho}(y^{2}|y^{1},v^{2},\mu,\nu)=0.5\big{(}1+0.3\big{(}(1-\rho)\nu(y^{1})-\rho\mu(x^{1})\big{)}\big{)},
gρ(y1|y2,v1,μ,ν)=0.5(1((1ρ)ν(y2)ρμ(x2))),\displaystyle g^{\rho}(y^{1}|y^{2},v^{1},\mu,\nu)=0.5\big{(}1-\big{(}(1-\rho)\nu(y^{2})-\rho\mu(x^{2})\big{)}\big{)},~{}~{} gρ(y2|y2,v1,μ,ν)=0.5(1+((1ρ)ν(y2)ρμ(x2))),\displaystyle g^{\rho}(y^{2}|y^{2},v^{1},\mu,\nu)=0.5\big{(}1+\big{(}(1-\rho)\nu(y^{2})-\rho\mu(x^{2})\big{)}\big{)},
gρ(y1|y2,v2,μ,ν)=0.5(1+0.3((1ρ)ν(y2)ρμ(x2))),\displaystyle g^{\rho}(y^{1}|y^{2},v^{2},\mu,\nu)=0.5\big{(}1+0.3\big{(}(1-\rho)\nu(y^{2})-\rho\mu(x^{2})\big{)}\big{)},~{}~{} gρ(y2|y2,v2,μ,ν)=0.5(10.3((1ρ)ν(y2)ρμ(x2))).\displaystyle g^{\rho}(y^{2}|y^{2},v^{2},\mu,\nu)=0.5\big{(}1-0.3\big{(}(1-\rho)\nu(y^{2})-\rho\mu(x^{2})\big{)}\big{)}.

For simplicity, we only consider non-zero reward at the terminal time step T=2T=2. Specifically,

r0ρ(μ,ν)=r1ρ(μ,ν)=0μ𝒫(𝒳),ν𝒫(𝒴),\displaystyle r^{\rho}_{0}(\mu,\nu)=r^{\rho}_{1}(\mu,\nu)=0\quad\forall\mu\in\mathcal{P}(\mathcal{X}),\nu\in\mathcal{P}(\mathcal{Y}),
r2ρ(μ,ν)=μ(x2).\displaystyle r^{\rho}_{2}(\mu,\nu)=\mu(x^{2}).

In other words, the objective for the Blue team is to maximize its population fraction on state x2x^{2} at t=2t=2. While the Red distribution does not play a role in the reward structure, the Red agents need to strategically distribute themselves to influence the Blue agent’s transitions in order to minimize the terminal reward.

Refer to caption
Figure 5: Subplots (a)-(c) present the game value computed via discretization. The x- and y-axes correspond to μtρ(x1)\mu^{\rho}_{t}(x^{1}) and νtρ(y1)\nu^{\rho}_{t}(y^{1}), respectively. Subplot (d) illustrates the reachable sets starting from μ0=[0.96,0.04]\mu_{0}=[0.96,0.04] and ν0=[0.04,0.96]\nu_{0}=[0.04,0.96].

The game values for the coordinator game are computed through discretization, where we uniformly mesh the two-dimensional simplices 𝒫(𝒳)\mathcal{P}(\mathcal{X}) and 𝒫(𝒴)\mathcal{P}(\mathcal{Y}) into 500 bins666One can easily provide an error bound on the difference between the discretized value and the true optimal value using the Lipschitz constants. See [Guan et al.,, 2023] for an example.. The numerically solved values are presented in Figure 5.

While the game value Jcor,1ρJ^{\rho*}_{\mathrm{cor},1} exists at time t=1t=1, it is not convex-concave, and hence the best-response correspondences in the coordinator game is not convex-valued. Consequently, the upper (max-min) and lower (min-max) game values at the previous step t=0t=0 differs, as observed in subplot (a). Specifically, at μ0ρ=[0.96,0.04]\mu^{\rho}_{0}=[0.96,0.04] and ν0ρ=[0.04,0.96]\nu^{\rho}_{0}=[0.04,0.96], we have the lower value J¯cor,0ρ=0.5298\underline{J}^{\rho*}_{\mathrm{cor},0}=0.5298 and the upper value J¯cor,0ρ=0.5384\bar{J}^{\rho*}_{\mathrm{cor},0}=0.5384, which are visualized as the green and yellow points. This discrepancy in the game values implies the absence of a Nash equilibrium in this coordinator game.

Based on (33), the optimization domains for computing Jcor,0ρJ^{\rho*}_{\mathrm{cor},0} are μ,0(μ0ρ,ν0ρ)\mathcal{R}_{\mu,0}(\mu_{0}^{\rho},\nu_{0}^{\rho}) for the maximization and ν,0(μ0ρ,ν0ρ)\mathcal{R}_{\nu,0}(\mu_{0}^{\rho},\nu_{0}^{\rho}) for the minimization, both of which are plotted in (d) and visualized as the box in subplots (b) and (c). Subplot (c) presents a zoom-in for the optimization maxμ,0minν,0Jcor,1ρ\max_{\mathcal{R}_{\mu,0}}\min_{\mathcal{R}_{\nu,0}}J^{\rho*}_{\mathrm{cor},1} and its min-max counterpart. The marginal functions are also plotted, from which the max-min (green point) and min-max values (yellow point) at t=0t=0 can be directly obtained.

Refer to caption
Figure 6: Max-min and min-max trajectory of the coordinator game.

Figure 6 presents the max-min and min-max trajectory of the coordinator game. Note that the max-min solution corresponds to the highest worst case performance for the Blue coordinator, where the Blue coordinator announces its first move from [0.96,0.04][0.96,0.04] to [0.4172,0.5828][0.4172,0.5828] and then the Red coordinator exploits the announced move. If instead, the Red coordinator announces its first move toward [0.3160,0.6840][0.3160,0.6840] in the max-min trajectory, the Blue coordinator can exploit that move by achieving [0.776,0.224][0.776,0.224] at t=1t=1 and ultimately receive a higher reward of 0.5442.

6.2 Numerical Example 2

It is generally challenging to verify the suboptimality bound in Theorem 4, since computing the true optimal performance of a finite-population team game is intractable. However, for the following simple example, we can construct the optimal team strategies even for large but still finite-population teams.

Consider a ZS-MFTG instance with a terminal time T=2T=2 and population ratio ρ=0.375\rho=0.375. The (minimizing) Red team’s objective is to maximize its presence at state y1y^{1} at t=2t=2, which translates to

r0ρ(μ,ν)=r1ρ(μ,ν)=0μ𝒫(𝒳),ν𝒫(𝒴),\displaystyle r^{\rho}_{0}(\mu,\nu)=r^{\rho}_{1}(\mu,\nu)=0\quad\forall\mu\in\mathcal{P}(\mathcal{X}),\nu\in\mathcal{P}(\mathcal{Y}),
r2ρ(μ,ν)=ν(y1)\displaystyle r^{\rho}_{2}(\mu,\nu)=-\nu(y^{1})

The Blue transition kernels are time-invariant, deterministic and independent of the MFs, and are defined as

ftρ(x1|x1,u1,μ,ν)=1.0,\displaystyle f_{t}^{\rho}(x^{1}|x^{1},u^{1},\mu,\nu)=1.0,\quad ftρ(x2|x1,u1,μ,ν)=0.0,\displaystyle f_{t}^{\rho}(x^{2}|x^{1},u^{1},\mu,\nu)=0.0,\qquad μ𝒫(𝒳),ν𝒫(𝒴),t{0,1},\displaystyle\forall\;\mu\in\mathcal{P}(\mathcal{X}),\nu\in\mathcal{P}(\mathcal{Y}),t\in\{0,1\}, (61)
ftρ(x1|x1,u2,μ,ν)=0.0,\displaystyle f_{t}^{\rho}(x^{1}|x^{1},u^{2},\mu,\nu)=0.0,\quad ftρ(x2|x1,u2,μ,ν)=1.0,\displaystyle f_{t}^{\rho}(x^{2}|x^{1},u^{2},\mu,\nu)=1.0,\qquad μ𝒫(𝒳),ν𝒫(𝒴),t{0,1},\displaystyle\forall\;\mu\in\mathcal{P}(\mathcal{X}),\nu\in\mathcal{P}(\mathcal{Y}),t\in\{0,1\},
ftρ(x1|x2,u1,μ,ν)=0.0,\displaystyle f_{t}^{\rho}(x^{1}|x^{2},u^{1},\mu,\nu)=0.0,\quad ftρ(x2|x2,u1,μ,ν)=1.0,\displaystyle f_{t}^{\rho}(x^{2}|x^{2},u^{1},\mu,\nu)=1.0,\qquad μ𝒫(𝒳),ν𝒫(𝒴),t{0,1},\displaystyle\forall\;\mu\in\mathcal{P}(\mathcal{X}),\nu\in\mathcal{P}(\mathcal{Y}),t\in\{0,1\},
ftρ(x1|x2,u2,μ,ν)=1.0,\displaystyle f_{t}^{\rho}(x^{1}|x^{2},u^{2},\mu,\nu)=1.0,\quad ftρ(x2|x2,u2,μ,ν)=0.0,\displaystyle f_{t}^{\rho}(x^{2}|x^{2},u^{2},\mu,\nu)=0.0,\qquad μ𝒫(𝒳),ν𝒫(𝒴),t{0,1}.\displaystyle\forall\;\mu\in\mathcal{P}(\mathcal{X}),\nu\in\mathcal{P}(\mathcal{Y}),t\in\{0,1\}.

Under the above transition kernel, a Blue agent can freely move tween the two nodes. Specifically, action u1u^{1} instructs a Blue agent to remain at its current state, while action u2u^{2} directs it to transition to the other state. One can verify that the above Blue transition kernel ensures that the Blue reachable set covers the whole simplex regardless of the initial team distributions, i.e., μ,t(μ,ν)=𝒫(𝒳)\mathcal{R}_{\mu,t}(\mu,\nu)=\mathcal{P}(\mathcal{X}) for all μ𝒫(𝒳)\mu\in\mathcal{P}(\mathcal{X}), ν𝒫(𝒴)\nu\in\mathcal{P}(\mathcal{Y}) and t{0,1}t\in\{0,1\}.

The Red transition kernels are time dependent and are defined as

g0ρ(y1|y1,v,μ,ν)=1.0,\displaystyle g_{0}^{\rho}(y^{1}|y^{1},v,\mu,\nu)=1.0,\quad g0ρ(y2|y1,v,μ,ν)=0.0,\displaystyle g_{0}^{\rho}(y^{2}|y^{1},v,\mu,\nu)=0.0,~{}~{} v𝒱,μ𝒫(𝒳),ν𝒫(𝒴),\displaystyle\forall\;v\in\mathcal{V},\mu\in\mathcal{P}(\mathcal{X}),\nu\in\mathcal{P}(\mathcal{Y}), (62)
g0ρ(y1|y2,v,μ,ν)=0.0,\displaystyle g_{0}^{\rho}(y^{1}|y^{2},v,\mu,\nu)=0.0,\quad g0ρ(y2|y2,v,μ,ν)=1.0,\displaystyle g_{0}^{\rho}(y^{2}|y^{2},v,\mu,\nu)=1.0,~{}~{} v𝒱,μ𝒫(𝒳),ν𝒫(𝒴),\displaystyle\forall\;v\in\mathcal{V},\mu\in\mathcal{P}(\mathcal{X}),\nu\in\mathcal{P}(\mathcal{Y}), (63)
g1ρ(y1|y1,v,μ,ν)=1.0,\displaystyle g_{1}^{\rho}(y^{1}|y^{1},v,\mu,\nu)=1.0,\quad g1ρ(y2|y1,v,μ,ν)=0.0,\displaystyle g_{1}^{\rho}(y^{2}|y^{1},v,\mu,\nu)=0.0,~{}~{} v𝒱,μ𝒫(𝒳),ν𝒫(𝒴),\displaystyle\forall\;v\in\mathcal{V},\mu\in\mathcal{P}(\mathcal{X}),\nu\in\mathcal{P}(\mathcal{Y}), (64)
g1ρ(y1|y2,v1,μ,ν)=0.0,\displaystyle g_{1}^{\rho}(y^{1}|y^{2},v^{1},\mu,\nu)=0.0,\quad g1ρ(y2|y2,v1,μ,ν)=1.0,\displaystyle g_{1}^{\rho}(y^{2}|y^{2},v^{1},\mu,\nu)=1.0,~{}~{} μ𝒫(𝒳),ν𝒫(𝒴),\displaystyle\qquad\quad\forall\;\mu\in\mathcal{P}(\mathcal{X}),\nu\in\mathcal{P}(\mathcal{Y}), (65)

and the transitions from state y2y^{2} using action v2v^{2} at t=1t=1 are given by

g1ρ(y2|y2,v2,μ,ν)=1min{5((μ(x1)12)2+(μ(x2)(112))2),1},\displaystyle g_{1}^{\rho}(y^{2}|y^{2},v^{2},\mu,\nu)=1-\min\Big{\{}5\Big{(}(\mu(x^{1})-\frac{1}{\sqrt{2}})^{2}+(\mu(x^{2})-(1-\frac{1}{\sqrt{2}}))^{2}\Big{)},1\Big{\}},~{}~{}~{}~{} ν𝒫(𝒴),\displaystyle\forall\nu\in\mathcal{P}(\mathcal{Y}), (66)
g1ρ(y2|y2,v2,μ,ν)=min{5((μ(x1)12)2+(μ(x2)(112))2),1},\displaystyle g_{1}^{\rho}(y^{2}|y^{2},v^{2},\mu,\nu)=\min\Big{\{}5\Big{(}(\mu(x^{1})-\frac{1}{\sqrt{2}})^{2}+(\mu(x^{2})-(1-\frac{1}{\sqrt{2}}))^{2}\Big{)},1\Big{\}},~{}~{}~{}~{} ν𝒫(𝒴).\displaystyle\forall\nu\in\mathcal{P}(\mathcal{Y}).

The Red transitions in (62) and (63) implies that all Red agents are frozen and can not change their states with any action at t=0t=0. Similarly, all Red agents on state y1y^{1} are frozen at t=1t=1 as depicted in (64). The Red agents at state y2y^{2} at t=1t=1 can choose to either deterministically stay at state y2y^{2} using action v1v^{1} as in (65) or try to move toward state y1y^{1} using action v2v^{2}. As in (66), the probability of transitioning from y2y^{2} to y1y^{1} is controlled by the Blue team’s distribution μ\mu at t=1t=1. If the Blue team can perfectly match the target distribution [1/2,11/2][1/\sqrt{2},1-1/\sqrt{2}], then no Red agent can transition from y2y^{2} to y1y^{1}. Otherwise, the more the Blue team deviates from the target distribution, the more likely a Red agent can transition from y2y^{2} to y1y^{1}.

Infinite-population case.

Since all Red agents are frozen at t=0t=0 and the Red sub-population at state y2y^{2} is again frozen at t=1t=1, only the actions of Red agents at state y1y^{1} at t=1t=1 have an impact on the game outcome. As a result, the above setup leads to a dominant optimal Red team strategy: all Red agents at y2y^{2} use action v2v^{2} at t=1t=1 and try to transition to state y1y^{1} for the sake of maximizing the team’s presence at y1y^{1}, regardless of the Blue team’s distribution. Note that this is the dominant optimal strategy against all Blue team strategies in both finite and infinite-population scenarios. Specifically,

ψj,1N2(v2|y2,μ,ν)=ψ1ρ(v2|y2,μ,ν)=1.0,μ𝒫(𝒳),ν𝒫(𝒴),j[N2].\psi^{N_{2}*}_{j,1}(v^{2}|y^{2},\mu,\nu)=\psi^{\rho*}_{1}(v^{2}|y^{2},\mu,\nu)=1.0,~{}~{}\forall\mu\in\mathcal{P}(\mathcal{X}),\nu\in\mathcal{P}(\mathcal{Y}),j\in[N_{2}]. (67)

On the other hand, the Blue team should try to match the target distribution [1/2,11/2][1/\sqrt{2},1-1/\sqrt{2}] to minimize the portion of Red agents transitioning from y2y^{2} to y1y^{1} at t=1t=1. Since the Blue reachable set at t=0t=0 covers the whole simplex, the target distribution can always be achieved at t=1t=1 starting from any initial Blue team distribution in the infinite-population case. One Blue coordination strategy that achieves the given target distribution is

α0(μ,ν)={π01,if μ0(x1)<12,π02,if μ0(x1)12,\alpha_{0}(\mu,\nu)=\left\{\begin{array}[]{cc}\pi_{0}^{1},&\text{if }\mu_{0}(x^{1})<\frac{1}{\sqrt{2}},\\ \pi_{0}^{2},&\text{if }\mu_{0}(x^{1})\geq\frac{1}{\sqrt{2}},\end{array}\right. (68)

where,

π01(u1|x1)=1,\displaystyle\pi_{0}^{1}(u^{1}|x^{1})=1,\qquad π01(u2|x1)=0,\displaystyle\pi_{0}^{1}(u^{2}|x^{1})=0,
π01(u1|x2)=11/2μ0(x2),\displaystyle\pi_{0}^{1}(u^{1}|x^{2})=\frac{1-1/\sqrt{2}}{\mu_{0}(x^{2})},\qquad π01(u2|x2)=1/2μ0(x1)μ0(x2),\displaystyle\pi_{0}^{1}(u^{2}|x^{2})=\frac{1/\sqrt{2}-\mu_{0}(x^{1})}{\mu_{0}(x^{2})},
π02(u1|x1)=1/2μ0(x1),\displaystyle\pi_{0}^{2}(u^{1}|x^{1})=\frac{1/\sqrt{2}}{\mu_{0}(x^{1})},\qquad π02(u2|x1)=11/2μ0(x2)μ0(x1),\displaystyle\pi_{0}^{2}(u^{2}|x^{1})=\frac{1-1/\sqrt{2}-\mu_{0}(x^{2})}{\mu_{0}(x^{1})},
π02(u1|x2)=1,\displaystyle\pi_{0}^{2}(u^{1}|x^{2})=1,\qquad π02(u2|x2)=0.\displaystyle\pi_{0}^{2}(u^{2}|x^{2})=0.

The following Figure Figure 7 presents the coordinator game value, which is solved via discretization. In this example, the upper and lower game coincides, and thus the game value exists. The black line in the middle subplot marks the game value with μ1ρ(x1)=1/2\mu^{\rho}_{1}(x^{1})=1/\sqrt{2}, when the Blue team perfectly achieves the target distribution. The values on the black line dominate and give the highest performance the Blue team can achieve given any ν1ρ(y1)\nu_{1}^{\rho}(y^{1}), which aligns with our analysis which states that the Blue team should match the target distribution.

Refer to caption
Figure 7: The values of the coordinator game at time t=0,1t=0,1 and 2. The blue line in the middle figure marks the value at μ1ρ(x1)=1/2\mu^{\rho}_{1}(x^{1})=1/\sqrt{2}, when the Blue team perfectly achieves the target distribution.

In the infinite-population coordinator game, the Blue team can always achieve the target distribution regardless of its initial distribution and thus completely block the migration of Red agents from y2y^{2} to y1y^{1}. Consequently, only the Red agents that are initialized directly at y1y^{1} can be counted toward the reward if the Blue team plays rationally, and thus the game value is simply given by J0ρ=ν0ρ(y1)J^{\rho*}_{0}=-\nu^{\rho}_{0}(y^{1}), which matches the game value plotted in Figure 7.

Finite-population case.

The Red team’s optimal strategy remains the same as the infinite-population case. However, the Blue team cannot achieve the irrational target distribution with finite number of agents. While the Blue team can still match the target distribution in expectation using a (stochastic) identical team strategy, the following analysis shows that a non-identical deterministic Blue team strategy achieves a better performance.

Consider a Blue team with three agents and all Blue agents are on node 1, i.e., μ03=[1,0]\mu^{3}_{0}=[1,0]. The optimal Blue coordination strategy prescribes that all Blue agents pick u1u^{1} (“stay”) with probability 1/21/\sqrt{2} and u2u^{2} (“move to x2x^{2}”) with probability (11/2)(1\!-\!1/\sqrt{2}) to reach the target distribution in expectation. Such action selection leads to the following four possible outcomes of the next Blue team ED μ13\mu^{3}_{1}: ([1,0])=0.354\mathbb{P}([1,0])=0.354, ([2/3,1/3])=0.439\mathbb{P}([2/3,1/3])=0.439, ([1/3,2/3])=0.182\mathbb{P}([1/3,2/3])=0.182, and ([0,1])=0.025\mathbb{P}([0,1])=0.025. In expectation, these empirical distributions lead to a transition probability of 0.518 for a Red team agent moving from y2y^{2} to y1y^{1}. Consequently, we have the worst-case performance of the optimal Blue coordinator strategy as minψN2J3,α,ψN2=ν0(y1)0.518ν0(y2)\min_{\psi^{N_{2}}}J^{3,\alpha^{*},\psi^{N_{2}}}=-\nu_{0}(y^{1})-0.518\nu_{0}(y^{2}).

Next, consider the non-identical deterministic Blue team strategy, under which Blue agents 1 and 2 apply action u1u^{1} and Blue agent 3 applies u2u^{2}. This Blue team strategy deterministically leads to 13=[2/3,1/3]\mathcal{M}_{1}^{3}=[2/3,1/3] at t=1t=1, and the resultant Red team transition probability from y2y^{2} to y1y^{1} is 0.016. Clearly, the non-identical Blue team strategy significantly outperforms the identical mixed team strategy in this three-agent case. Furthermore, this Blue team strategy is optimal over the entire non-identical Blue team strategy set, resulting in a finite-population optimal game value J3=ν0(y1)0.016ν0(y2)J^{3*}=-\nu_{0}(y^{1})-0.016\nu_{0}(y^{2}).

We repeat the above computation for multiple Blue team size N1N_{1} and plot the suboptimality gap as the blue line in Figure 8, which verifies the 𝒪(1/N¯)\mathcal{O}(1/\sqrt{\underline{N}}) decrease rate predicted by Theorem 4.

Specifically, the optimal value for the finite-population game is computed based on the following:

  • At t=0t=0, the Blue team uses a non-identical deterministic team strategy and deterministically achieves the optimal ED μ1N1\mu^{N_{1}}_{1}, which minimize the two-norm distance to the target distribution. Such optimal Red ED is constructed as [nN1,N1nN1][\frac{n}{N_{1}},\frac{N_{1}-n}{N1}] where n=argmin0nN1(nN112)2n=\operatorname*{argmin}_{0\leq n\leq N_{1}}(\frac{n}{N_{1}}-\frac{1}{\sqrt{2}})^{2}.

  • At t=1t=1, all Red agents apply v2v^{2}, and the distribution of the next Red EDs ν2N2\nu^{N_{2}}_{2} can be computed based on the optimal μ1N1\mu^{N_{1}}_{1} achieved at t=1t=1.

  • The expected game value is then computed based on the distribution of the Red ED ν2N2\nu_{2}^{N_{2}}.

Refer to caption
Figure 8: Difference between game values with initial distributions μ0=[1,0]\mu_{0}=[1,0] and ν0=[0.6,0.4]\nu_{0}=[0.6,0.4].

7 Conclusion

In this work, we introduced a discrete zero-sum mean-field team game to model the behaviors of competing large-population teams. We developed a dynamic programming approach that approximately solves this large-population game at its infinite-population limit where only identical team strategies are considered. Our analysis demonstrated that the identical strategies constructed are ϵ\epsilon-optimal within the general class of non-identical team strategies when deployed in the original finite-population game. The derived performance guarantees are verified through numerical examples. Future work will investigate the LQG setup of this problem and explore machine-learning techniques to address more complex zero-sum mean-field team problems. Additionally, we aim to generalize our results to the infinite-horizon discounted case and problems with heterogeneous sub-populations.

Acknowledgments:

The authors acknowledge fruitful discussions with Dr. Ali Pakniyat.

Appendix A Special Cases and Examples

A.1 Pairwise State-Coupled Mean-Field Team Games

In Section 2, we introduced the pairwise state-coupled dynamics and the averaged state-dependent reward as a special case for the proposed mean-field team game. Specifically, the pairwise state-coupled transition kernels are defined as follows

ft(xi,t+1N1|xi,tN1,ui,tN1,𝐱i,tN1,𝐲tN2)\displaystyle f_{t}\big{(}x^{N_{1}}_{i,t+1}\big{|}x^{N_{1}}_{i,t},u^{N_{1}}_{i,t},\mathbf{x}^{N_{1}}_{-i,t},\mathbf{y}^{N_{2}}_{t}) =k=1N1f1,t(xi,t+1N1|xi,tN1,ui,tN1,xk,tN1)N+k=1N2f2,t(xi,t+1N1|xi,tN1,ui,tN1,yk,tN2)N\displaystyle=\frac{\sum_{k=1}^{N_{1}}f_{1,t}(x^{N_{1}}_{i,t+1}\big{|}x^{N_{1}}_{i,t},u^{N_{1}}_{i,t},x^{N_{1}}_{k,t})}{N}+\frac{\sum_{k=1}^{N_{2}}f_{2,t}(x^{N_{1}}_{i,t+1}\big{|}x^{N_{1}}_{i,t},u^{N_{1}}_{i,t},y^{N_{2}}_{k,t})}{N} (12)
gt(yj,t+1N2|yj,tN2,vj,tN2,𝐱tN1,𝐲j,tN2)\displaystyle g_{t}\big{(}y^{N_{2}}_{j,t+1}\big{|}y^{N_{2}}_{j,t},v^{N_{2}}_{j,t},\mathbf{x}^{N_{1}}_{t},\mathbf{y}^{N_{2}}_{-j,t}) =k=1N1g1,t(yj,t+1N2|yj,tN2,vj,tN2,xk,tN1)N+k=1N2g2,t(yj,t+1N2|yj,tN2,vj,tN2,yk,tN2)N.\displaystyle=\frac{\sum_{k=1}^{N_{1}}g_{1,t}(y^{N_{2}}_{j,t+1}\big{|}y^{N_{2}}_{j,t},v^{N_{2}}_{j,t},x^{N_{1}}_{k,t})}{N}+\frac{\sum_{k=1}^{N_{2}}g_{2,t}(y^{N_{2}}_{j,t+1}\big{|}y^{N_{2}}_{j,t},v^{N_{2}}_{j,t},y^{N_{2}}_{k,t})}{N}.

Through some algebraic manipulations, the Blue transition kernel can be re-written in the weakly-coupled form in (4):

ft(xi,t+1N1|xi,tN1,ui,tN1,𝐱i,tN1,𝐲tN2)\displaystyle f_{t}\big{(}x^{N_{1}}_{i,t+1}\big{|}x^{N_{1}}_{i,t},u^{N_{1}}_{i,t},\mathbf{x}^{N_{1}}_{-i,t},\mathbf{y}^{N_{2}}_{t})
=N1Nk=1N1f1,t(xi,t+1N1|xi,tN1,ui,tN1,xk,tN1)N1+N2Nk=1N2f2,t(xi,t+1N1|xi,tN1,ui,tN1,yk,tN2)N2\displaystyle=\frac{N_{1}}{N}\frac{\sum_{k=1}^{N_{1}}f_{1,t}(x^{N_{1}}_{i,t+1}\big{|}x^{N_{1}}_{i,t},u^{N_{1}}_{i,t},x^{N_{1}}_{k,t})}{N_{1}}+\frac{N_{2}}{N}\frac{\sum_{k=1}^{N_{2}}f_{2,t}(x^{N_{1}}_{i,t+1}\big{|}x^{N_{1}}_{i,t},u^{N_{1}}_{i,t},y^{N_{2}}_{k,t})}{N_{2}}
=ρk=1N1x𝒳𝟙x(xk,tN1)f1,t(xi,t+1N1|xi,tN1,ui,tN1,xk,tN1)N1+(1ρ)k=1N2y𝒴𝟙y(yk,tN2)f2,t(xi,t+1N1|xi,tN1,ui,tN1,yk,tN2)N2\displaystyle=\rho\frac{\sum_{k=1}^{N_{1}}\sum_{x\in\mathcal{X}}\mathds{1}_{x}\big{(}{x^{N_{1}}_{k,t}}\big{)}f_{1,t}(x^{N_{1}}_{i,t+1}\big{|}x^{N_{1}}_{i,t},u^{N_{1}}_{i,t},x^{N_{1}}_{k,t})}{N_{1}}+(1-\rho)\frac{\sum_{k=1}^{N_{2}}\sum_{y\in\mathcal{Y}}\mathds{1}_{y}\big{(}{y^{N_{2}}_{k,t}}\big{)}f_{2,t}(x^{N_{1}}_{i,t+1}\big{|}x^{N_{1}}_{i,t},u^{N_{1}}_{i,t},y^{N_{2}}_{k,t})}{N_{2}}
=ρx𝒳k=1N1𝟙x(xk,tN1)f1,t(xi,t+1N1|xi,tN1,ui,tN1,x)N1+(1ρ)y𝒴k=1N2𝟙y(yk,tN2)f2,t(xi,t+1N1|xi,tN1,ui,tN1,y)N2\displaystyle=\rho\frac{\sum_{x\in\mathcal{X}}\sum_{k=1}^{N_{1}}\mathds{1}_{x}\big{(}{x^{N_{1}}_{k,t}}\big{)}f_{1,t}(x^{N_{1}}_{i,t+1}\big{|}x^{N_{1}}_{i,t},u^{N_{1}}_{i,t},x)}{N_{1}}+(1-\rho)\frac{\sum_{y\in\mathcal{Y}}\sum_{k=1}^{N_{2}}\mathds{1}_{y}\big{(}{y^{N_{2}}_{k,t}}\big{)}f_{2,t}(x^{N_{1}}_{i,t+1}\big{|}x^{N_{1}}_{i,t},u^{N_{1}}_{i,t},y)}{N_{2}}
=ρx𝒳μtN1(x)f1,t(xi,t+1N1|xi,tN1,ui,tN1,x)+(1ρ)y𝒴νtN2(y)f2,t(xi,t+1N1|xi,tN1,ui,tN1,y)\displaystyle=\rho\sum_{x\in\mathcal{X}}\mu^{N_{1}}_{t}(x)f_{1,t}(x^{N_{1}}_{i,t+1}\big{|}x^{N_{1}}_{i,t},u^{N_{1}}_{i,t},x)+(1-\rho)\sum_{y\in\mathcal{Y}}\nu_{t}^{N_{2}}(y)f_{2,t}(x^{N_{1}}_{i,t+1}\big{|}x^{N_{1}}_{i,t},u^{N_{1}}_{i,t},y)
ftρ(xi,t+1N1|xi,tN1,ui,tN1,μtN1,νtN2).\displaystyle\triangleq f^{\rho}_{t}\big{(}x^{N_{1}}_{i,t+1}\big{|}x^{N_{1}}_{i,t},u^{N_{1}}_{i,t},\mu^{N_{1}}_{t},\nu^{N_{2}}_{t}).

The Red transition kernel can be similarly transformed into a weakly-coupled form.

The averaged state-dependent reward is defined as

rt(𝐱tN1,𝐲tN2)\displaystyle r_{t}(\mathbf{x}_{t}^{N_{1}},\mathbf{y}_{t}^{N_{2}}) =1Nk=1N1r1,t(xk,tN1)1Nk=1N2r2,t(yk,tN2).\displaystyle=\frac{1}{N}\sum_{k=1}^{N_{1}}r_{1,t}(x^{N_{1}}_{k,t})-\frac{1}{N}\sum_{k=1}^{N_{2}}r_{2,t}(y^{N_{2}}_{k,t}). (13)

One can also transform the above reward structure to a weakly-coupled form as in (6)

rt(𝐱tN1,𝐲tN2)\displaystyle r_{t}(\mathbf{x}_{t}^{N_{1}},\mathbf{y}_{t}^{N_{2}}) =N1Nk=1N1r1,t(xk,tN1)N1+N2Nk=1N2r2,t(yk,tN2)N2\displaystyle=\frac{N_{1}}{N}\frac{\sum_{k=1}^{N_{1}}r_{1,t}(x^{N_{1}}_{k,t})}{N_{1}}+\frac{N_{2}}{N}\frac{\sum_{k=1}^{N_{2}}r_{2,t}(y^{N_{2}}_{k,t})}{N_{2}}
=N1Nk=1N1x𝒳𝟙x(xk,tN1)r1,t(xk,tN1)N1+N2Nk=1N2y𝒴𝟙y(yk,tN2)r2,t(yk,tN2)N2\displaystyle=\frac{N_{1}}{N}\frac{\sum_{k=1}^{N_{1}}\sum_{x\in\mathcal{X}}\mathds{1}_{x}\big{(}{x^{N_{1}}_{k,t}}\big{)}r_{1,t}(x^{N_{1}}_{k,t})}{N_{1}}+\frac{N_{2}}{N}\frac{\sum_{k=1}^{N_{2}}\sum_{y\in\mathcal{Y}}\mathds{1}_{y}\big{(}{y^{N_{2}}_{k,t}}\big{)}r_{2,t}(y^{N_{2}}_{k,t})}{N_{2}}
=ρx𝒳μtN1(x)r1,t(x)(1ρ)y𝒴νtN2(y)r2,t(y)\displaystyle=\rho\sum_{x\in\mathcal{X}}\mu^{N_{1}}_{t}(x)r_{1,t}(x)-(1-\rho)\sum_{y\in\mathcal{Y}}\nu^{N_{2}}_{t}(y)r_{2,t}(y)
rtρ(μtN1,νtN2).\displaystyle\triangleq r^{\rho}_{t}(\mu_{t}^{N_{1}},\nu_{t}^{N_{2}}).
Proposition 1.

The transition kernels in (12) satisfy

x𝒳|ftρ(x|x,u,μ,ν)ftρ(x|x,u,μ,ν)|2(dTV(μ,μ)+dTV(ν,ν))\displaystyle\sum_{x^{\prime}\in\mathcal{X}}\left\lvert f^{\rho}_{t}(x^{\prime}|x,u,\mu,\nu)-f^{\rho}_{t}(x^{\prime}|x,u,\mu^{\prime},\nu^{\prime})\right\rvert\leq 2\Big{(}\mathrm{d}_{\mathrm{TV}}\big{(}\mu,\mu^{\prime}\big{)}+\mathrm{d}_{\mathrm{TV}}\big{(}\nu,\nu^{\prime}\big{)}\Big{)}~{}~{}~{} x𝒳,u𝒰,\displaystyle\forall x\in\mathcal{X},u\in\mathcal{U},
y𝒴|gtρ(y|y,v,μ,ν)gtρ(y|y,v,μ,ν)|2(dTV(μ,μ)+dTV(ν,ν))\displaystyle\sum_{y^{\prime}\in\mathcal{Y}}\left\lvert g^{\rho}_{t}(y^{\prime}|y,v,\mu,\nu)-g^{\rho}_{t}(y^{\prime}|y,v,\mu^{\prime},\nu^{\prime})\right\rvert\leq 2\Big{(}\mathrm{d}_{\mathrm{TV}}\big{(}\mu,\mu^{\prime}\big{)}+\mathrm{d}_{\mathrm{TV}}\big{(}\nu,\nu^{\prime}\big{)}\Big{)}~{}~{}~{} y𝒴,v𝒱.\displaystyle\forall y\in\mathcal{Y},v\in\mathcal{V}.
Proof.

Due to symmetry, we only show the result for the Blue transition kernel. From the definition, we have, for all x𝒳x\in\mathcal{X} and u𝒰u\in\mathcal{U}, that

x𝒳|ftρ(\displaystyle\sum_{x^{\prime}\in\mathcal{X}}\big{|}f^{\rho}_{t}( x|x,u,μ,ν)fρt(x|x,u,μ,ν)|=x𝒳|ρz𝒳μ(z)f1,t(x|x,u,z)+(1ρ)y𝒴ν(y)f2,t(x|x,u,y)\displaystyle x^{\prime}|x,u,\mu,\nu)-f^{\rho}_{t}(x^{\prime}|x,u,\mu^{\prime},\nu^{\prime})\big{|}=\sum_{x^{\prime}\in\mathcal{X}}\Big{|}\rho\sum_{z\in\mathcal{X}}\mu(z)f_{1,t}(x^{\prime}|x,u,z)+(1-\rho)\sum_{y\in\mathcal{Y}}\nu(y)f_{2,t}(x^{\prime}|x,u,y)
ρz𝒳μ(z)f1,t(x|x,u,z)(1ρ)y𝒴ν(y)f2,t(x|x,u,y)|\displaystyle\qquad\qquad\qquad\qquad\qquad\quad\qquad\quad\qquad~{}~{}~{}-\rho\sum_{z\in\mathcal{X}}\mu(z)f_{1,t}(x^{\prime}|x,u,z)-(1-\rho)\sum_{y\in\mathcal{Y}}\nu(y)f_{2,t}(x^{\prime}|x,u,y)\Big{|}
ρx𝒳z𝒳f1,t(x|x,u,z)|μ(z)μ(z)|+(1ρ)x𝒳y𝒴f2,t(x|x,u,y)|ν(y)ν(y)|\displaystyle\leq\rho\sum_{x^{\prime}\in\mathcal{X}}\sum_{z\in\mathcal{X}}f_{1,t}(x^{\prime}|x,u,z)\left\lvert\mu(z)-\mu^{\prime}(z)\right\rvert+(1-\rho)\sum_{x^{\prime}\in\mathcal{X}}\sum_{y\in\mathcal{Y}}f_{2,t}(x^{\prime}|x,u,y)\left\lvert\nu(y)-\nu^{\prime}(y)\right\rvert
=ρz𝒳|μ(z)μ(z)|+(1ρ)y𝒴|ν(y)ν(y)|=2ρdTV(μ,μ)+2(1ρ)dTV(ν,ν)\displaystyle=\rho\sum_{z\in\mathcal{X}}\left\lvert\mu(z)-\mu^{\prime}(z)\right\rvert+(1-\rho)\sum_{y\in\mathcal{Y}}\left\lvert\nu(y)-\nu^{\prime}(y)\right\rvert=2\rho\;\mathrm{d}_{\mathrm{TV}}\big{(}\mu,\mu^{\prime}\big{)}+2(1-\rho)\;\mathrm{d}_{\mathrm{TV}}\big{(}\nu,\nu^{\prime}\big{)}
2(dTV(μ,μ)+dTV(ν,ν)).\displaystyle\leq 2\Big{(}\mathrm{d}_{\mathrm{TV}}\big{(}\mu,\mu^{\prime}\big{)}+\mathrm{d}_{\mathrm{TV}}\big{(}\nu,\nu^{\prime}\big{)}\Big{)}.

Proposition 2.

For all μ,μ𝒫(𝒳)\mu,\mu^{\prime}\in\mathcal{P}(\mathcal{X}), ν,ν𝒫(𝒴)\nu,\nu^{\prime}\in\mathcal{P}(\mathcal{Y}) and t{0,,T}t\in\{0,\ldots,T\}, the reward structure in (13) satisfies

|rtρ(μ,ν)rtρ(μ,ν)|(2max{r1,max,r2,max})(dTV(μ,μ)+dTV(ν,ν)),\left\lvert r^{\rho}_{t}(\mu,\nu)-r^{\rho}_{t}(\mu^{\prime},\nu^{\prime})\right\rvert\leq(2\max\{r_{1,\max},r_{2,\max}\})\Big{(}\mathrm{d}_{\mathrm{TV}}\big{(}\mu,\mu^{\prime}\big{)}+\mathrm{d}_{\mathrm{TV}}\big{(}\nu,\nu^{\prime}\big{)}\Big{)},

where r1,max=maxx,t|r1,t(x)|r_{1,\max}=\max_{x,t}\left\lvert r_{1,t}(x)\right\rvert and r2,max=maxy,t|r2,t(y)|r_{2,\max}=\max_{y,t}\left\lvert r_{2,t}(y)\right\rvert.

Proof.

Note that

|rtρ(μ,ν)rtρ(μ,ν)|\displaystyle\left\lvert r^{\rho}_{t}(\mu,\nu)-r^{\rho}_{t}(\mu^{\prime},\nu^{\prime})\right\rvert =|xr1,t(x)(μ(x)μ(x))yr2,t(y)(ν(y)ν(y))|\displaystyle=\left\lvert\sum_{x}r_{1,t}(x)(\mu(x)-\mu^{\prime}(x))-\sum_{y}r_{2,t}(y)(\nu(y)-\nu^{\prime}(y))\right\rvert
|xr1,t(x)(μ(x)μ(x))|+|yr2,t(y)(ν(y)ν(y))|\displaystyle\leq\left\lvert\sum_{x}r_{1,t}(x)(\mu(x)-\mu^{\prime}(x))\right\rvert+\left\lvert\sum_{y}r_{2,t}(y)(\nu(y)-\nu^{\prime}(y))\right\rvert
x|r1,t(x)||μ(x)μ(x)|+y|r2,t(y)||ν(y)ν(y)|\displaystyle\leq\sum_{x}\left\lvert r_{1,t}(x)\right\rvert\left\lvert\mu(x)-\mu^{\prime}(x)\right\rvert+\sum_{y}\left\lvert r_{2,t}(y)\right\rvert\left\lvert\nu(y)-\nu^{\prime}(y)\right\rvert
r1,maxx|μ(x)μ(x)|+r2,maxy|ν(y)ν(y)|\displaystyle\leq r_{1,\max}\sum_{x}\left\lvert\mu(x)-\mu^{\prime}(x)\right\rvert+r_{2,\max}\sum_{y}\left\lvert\nu(y)-\nu^{\prime}(y)\right\rvert
max{r1,max,r2,max}(2dTV(μ,μ)+2dTV(ν,ν)).\displaystyle\leq\max\{r_{1,\max},r_{2,\max}\}\big{(}2\mathrm{d}_{\mathrm{TV}}\big{(}\mu,\mu^{\prime}\big{)}+2\mathrm{d}_{\mathrm{TV}}\big{(}\nu,\nu^{\prime}\big{)}\big{)}.

A.2 Discontinuous Reward Example

Through this example, we demonstrate the necessity of Assumption 2, i.e., the reward function needs to be continuous with respect to the mean-fields as in (7). Consider a MFTG with terminal time T=1T=1 over the following deterministic system.

Refer to caption
Figure 9: A deterministic system with two Blue states (left) and one Red state (right).

The Blue agent’s transition kernel are given by

f0ρ(x1|x1,u1,μ,ν)=1,f0ρ(x2|x1,u1,μ,ν)=0,\displaystyle f^{\rho}_{0}(x^{1}|x^{1},u^{1},\mu,\nu)=1,\qquad f^{\rho}_{0}(x^{2}|x^{1},u^{1},\mu,\nu)=0,
f0ρ(x1|x1,u2,μ,ν)=0,f0ρ(x2|x1,u1,μ,ν)=1,\displaystyle f^{\rho}_{0}(x^{1}|x^{1},u^{2},\mu,\nu)=0,\qquad f^{\rho}_{0}(x^{2}|x^{1},u^{1},\mu,\nu)=1,
f0ρ(x1|x2,u3,μ,ν)=0,f0ρ(x2|x2,u3,μ,ν)=1,\displaystyle f^{\rho}_{0}(x^{1}|x^{2},u^{3},\mu,\nu)=0,\qquad f^{\rho}_{0}(x^{2}|x^{2},u^{3},\mu,\nu)=1,
f0ρ(x1|x2,u4,μ,ν)=1,f0ρ(x2|x2,u4,μ,ν)=0,\displaystyle f^{\rho}_{0}(x^{1}|x^{2},u^{4},\mu,\nu)=1,\qquad f^{\rho}_{0}(x^{2}|x^{2},u^{4},\mu,\nu)=0,

which holds for all distributions μ𝒫(𝒳)\mu\in\mathcal{P}(\mathcal{X}) and ν𝒫(𝒴)\nu\in\mathcal{P}(\mathcal{Y}).

The Red agent’s transition kernel is given by

g0ρ(y1|y1,v1,μ,ν)=1,μ𝒫(𝒳),ν𝒫(𝒴).\displaystyle g^{\rho}_{0}(y^{1}|y^{1},v^{1},\mu,\nu)=1,\quad\forall\mu\in\mathcal{P}(\mathcal{X}),~{}\nu\in\mathcal{P}(\mathcal{Y}).

Given μ𝒫(𝒳)\mu\in\mathcal{P}(\mathcal{X}) and ν𝒫(𝒴)\nu\in\mathcal{P}(\mathcal{Y}), the discontinuous reward is given by

r0ρ(μ,ν)\displaystyle r_{0}^{\rho}(\mu,\nu) =0,\displaystyle=0,
r1ρ(μ,ν)\displaystyle r_{1}^{\rho}(\mu,\nu) =𝟙(μ=(1/3,11/3)).\displaystyle=\mathds{1}\Big{(}\mu=(1/\sqrt{3},1-1/\sqrt{3})\Big{)}.

As the dynamics are decoupled and the reward does not depend on the Red distribution, the game is essentially a single-team problem. We choose this degenerate example for its simplicity. Since the reward at time t=0t=0 is always zero, the objective of the Blue team is to achieve the desired distribution (1/3,11/3)(1/\sqrt{3},1-1/\sqrt{3}) at time t=1t=1 to receive one unit of reward. Since t=0t=0 is the only time step that the Blue agents select action, a Blue coordination strategy only consists of a Blue coordination policy at time t=0t=0. In this case, an optimal Blue coordination strategy can be

α0(μ,ν)={π01,if μ0(x1)<13π02,if μ0(x1)13,\alpha^{*}_{0}(\mu,\nu)=\left\{\begin{array}[]{cc}\pi_{0}^{1},&\text{if }\mu_{0}(x^{1})<\frac{1}{\sqrt{3}}\\ \pi_{0}^{2},&\text{if }\mu_{0}(x^{1})\geq\frac{1}{\sqrt{3}}\end{array}\right.,

where,

π01(u1|x1)=1,\displaystyle\pi_{0}^{1}(u^{1}|x^{1})=1,\qquad π01(u2|x1)=0,\displaystyle\pi_{0}^{1}(u^{2}|x^{1})=0,
π01(u3|x2)=11/3μ0(x2),\displaystyle\pi_{0}^{1}(u^{3}|x^{2})=\frac{1-1/\sqrt{3}}{\mu_{0}(x^{2})},\qquad π01(u4|x2)=1/3μ0(x1)μ0(x2),\displaystyle\pi_{0}^{1}(u^{4}|x^{2})=\frac{1/\sqrt{3}-\mu_{0}(x^{1})}{\mu_{0}(x^{2})},
π02(u1|x1)=1/3μ0(x1),\displaystyle\pi_{0}^{2}(u^{1}|x^{1})=\frac{1/\sqrt{3}}{\mu_{0}(x^{1})},\qquad π02(u2|x1)=11/3μ0(x2)μ0(x1),\displaystyle\pi_{0}^{2}(u^{2}|x^{1})=\frac{1-1/\sqrt{3}-\mu_{0}(x^{2})}{\mu_{0}(x^{1})},
π02(u3|x2)=1,\displaystyle\pi_{0}^{2}(u^{3}|x^{2})=1,\qquad π02(u4|x2)=0.\displaystyle\pi_{0}^{2}(u^{4}|x^{2})=0.

The proposed local policy is well-defined, since for the case where μ0(x1)<1/3\mu_{0}(x^{1})<1/\sqrt{3}, it is implied that μ0(x2)>0\mu_{0}(x^{2})>0, and similarly, for the case where μ0(x1)>1/3\mu_{0}(x^{1})>1/\sqrt{3}, we automatically have μ0(x1)>0\mu_{0}(x^{1})>0.

One can verify that the above coordination strategy will achieve the mean-field μ1=(3,11/3)\mu_{1}=(\sqrt{3},1-1/\sqrt{3}) from all μ0𝒫(𝒳)\mu_{0}\in\mathcal{P}(\mathcal{X}) under the mean-field dynamics (19). Consequently, we have

Jcorρ(μ0,ν0)=1,μ0𝒫(𝒳),ν0𝒫(𝒴).J^{\rho*}_{\mathrm{cor}}(\mu_{0},\nu_{0})=1,\quad\forall\mu_{0}\in\mathcal{P}(\mathcal{X}),\nu_{0}\in\mathcal{P}(\mathcal{Y}).

However, we know that for all N1N_{1}, the ED must be rational, and thus N1(3,11/3)\mathcal{M}^{N_{1}}\neq(\sqrt{3},1-1/\sqrt{3}) for all N1N_{1} almost surely, which leads to

JN(𝐱0N1,𝐲0N2)=0,𝐱0N1𝒳N1,𝐲0N2𝒴N2,J^{N*}(\mathbf{x}_{0}^{N_{1}},\mathbf{y}^{N_{2}}_{0})=0,\quad\forall\mathbf{x}_{0}^{N_{1}}\in\mathcal{X}^{N_{1}},\mathbf{y}^{N_{2}}_{0}\in\mathcal{Y}^{N_{2}},

which implies that the performance of the coordination strategy achieved at the infinite-population limit fails to translate back to the finite-population game, and hence Theorem 4 no longer holds.

One can construct similar examples to illustrate the necessity of the transition kernels being continuous with respect to the mean-fields.

A.3 Counter-Example for Information State

For simplicity, we use a single-team example to illustrate why the ED cannot be an information state when different agents are allowed to apply different strategies, even under the weakly-coupled dynamics (4) and rewards (6). Consider the following system with state spaces 𝒳={x1,x2}\mathcal{X}=\{x^{1},x^{2}\} and 𝒴=\mathcal{Y}=\varnothing, and action spaces 𝒰={u1,u2}\mathcal{U}=\{u^{1},u^{2}\} and 𝒱=\mathcal{V}=\varnothing.

Refer to caption
Figure 10: A counter-example showing ED cannot serve as an information state.

The transition kernel is time-invariant and is given by

ft(x1|x1,u1,μ)=1.0,\displaystyle f_{t}(x^{1}|x^{1},u^{1},\mu)=1.0,\qquad ft(x2|x1,u2,μ)=1.0,μ𝒫(𝒳),t{0,1}.\displaystyle f_{t}(x^{2}|x^{1},u^{2},\mu)=1.0,\qquad\raisebox{-8.39996pt}[0.0pt][0.0pt]{$\forall\;\mu\in\mathcal{P}(\mathcal{X}),\;t\in\{0,1\}$.}
ft(x2|x2,u1,μ)=1.0,\displaystyle f_{t}(x^{2}|x^{2},u^{1},\mu)=1.0,\qquad ft(x1|x2,u2,μ)=1.0,\displaystyle f_{t}(x^{1}|x^{2},u^{2},\mu)=1.0,

In words, each agent’s transition is deterministic and decoupled from the other agents. At each state, the agent can select action u1u^{1} to stay at its current state or use action u2u^{2} to move to the other state.

We consider a one-stage scenario with the reward functions

r0(μ)\displaystyle r_{0}(\mu) =0,μ𝒫(𝒳),\displaystyle=0,\qquad\forall\mu\in\mathcal{P}(\mathcal{X}),
r1(μ)\displaystyle r_{1}(\mu) =μ(x1).\displaystyle=\mu(x^{1}).

To illustrate that the ED cannot serve as an information state, consider a two-agent team and the following non-identical team strategy:

ϕ1,02(u2|x1,μ)=1,\displaystyle\phi^{2}_{1,0}(u^{2}|x^{1},\mu)=1,\qquad ϕ1,02(u1|x2,μ)=1,μ𝒫(𝒳),\displaystyle\phi^{2}_{1,0}(u^{1}|x^{2},\mu)=1,~{}~{}\forall\mu\in\mathcal{P}(\mathcal{X}),
ϕ2,02(u1|x1,μ)=1,\displaystyle\phi^{2}_{2,0}(u^{1}|x^{1},\mu)=1,\qquad ϕ2,02(u1|x2,μ)=1,μ𝒫(𝒳).\displaystyle\phi^{2}_{2,0}(u^{1}|x^{2},\mu)=1,~{}~{}\forall\mu\in\mathcal{P}(\mathcal{X}).

In words, agent 1 selects u2u^{2} on state x1x^{1} and selects u1u^{1} on state x2x^{2}, while agent 2 selects u1u^{1} on both states x1x^{1} and x2x^{2}.

We now consider two initial configurations with the same ED.

  • 𝒳1,02=x1\mathcal{X}^{2}_{1,0}=x^{1} and 𝒳2,02=x2\mathcal{X}^{2}_{2,0}=x^{2}, i.e., agent 1 is initially at x1x^{1} and agent 2 at x2x^{2}. The ED is thus 02=[0.5,0.5]\mathcal{M}^{2}_{0}=[0.5,0.5]. If the two agents follow the above non-identical team strategy, then agent 1 uses action u2u^{2} and transits to x2x^{2}, while agent 2 chooses action u1u^{1} and stays on x2x^{2}, leading to 12=[0,1]\mathcal{M}^{2}_{1}=[0,1]. The resultant reward value is then 0.

  • 𝒳1,02=x2\mathcal{X}^{2}_{1,0}=x^{2} and 𝒳2,02=x1\mathcal{X}^{2}_{2,0}=x^{1}, i.e., agent 1 is initially at x2x^{2} and agent 2 at x1x^{1}. The ED is again 02=[0.5,0.5]\mathcal{M}^{2}_{0}=[0.5,0.5]. If the two agents follow the above non-identical team strategy, then both agents select action u1u^{1} and stay at their current states, leading to 12=[0.5,0.5]\mathcal{M}^{2}_{1}=[0.5,0.5] and a reward value of 0.5.

From this example, we have shown that the values can be different under the same team strategy given different initial conditions that correspond to the same ED since the ED does not differentiate the ordering of the agents. Clearly, the ED alone is not enough to characterize the future evolution of the game, nor the value function. Consequently, the ED is not an information state and we need the joint state information in the finite-population game to properly construct the value functions when different agents apply different strategies.

Appendix B Proof of the Mean-Field Approximation Results

B.1 Modified 2\ell^{2} Weak Law of Large Numbers

The following lemma is a modified version of the 2\ell^{2} weak law of large numbers [Chung,, 2001] adapted to accommodate the notion of EDs used in this work. This lemma will be used extensively in the later proofs.

Lemma 7.

Let X1,,XNX_{1},\ldots,X_{N} be NN independent random variables (vectors) taking values in a finite set 𝒳\mathcal{X}. Suppose XiX_{i} is distributed according to pi𝒫(𝒳)p_{i}\in\mathcal{P}(\mathcal{X}), such that (Xi=x)=pi(x)\mathbb{P}(X_{i}=x)=p_{i}(x). Define the ED as

N(x)=1Ni=1N𝟙x(Xi).\mathcal{M}^{N}(x)=\frac{1}{N}\sum_{i=1}^{N}\mathds{1}_{x}\big{(}{X_{i}}\big{)}.

and let μ(x)=1Ni=1Npi(x)\mu(x)=\frac{1}{N}\sum_{i=1}^{N}p_{i}(x). Then, the following inequality holds

𝔼[dTV(N,μ)]12|𝒳|N.\mathbb{E}\left[\mathrm{d}_{\mathrm{TV}}\big{(}\mathcal{M}^{N},\mu\big{)}\right]\leq\frac{1}{2}\sqrt{\frac{|\mathcal{X}|}{N}}. (69)
Proof.

Consider the random variables Xx,i=𝟙x(Xi)X_{x,i}=\mathds{1}_{x}\big{(}{X_{i}}\big{)}, which are independent across ii with mean pi(x)p_{i}(x). Since Xx,iX_{x,i} only takes a value of 0 and 1, we have 𝔼[Xx,i2]=𝔼[Xx,i]=pi(x)\mathbb{E}[X_{x,i}^{2}]=\mathbb{E}[X_{x,i}]=p_{i}(x), and Var(Xx,i)=pi(x)pi(x)2\mathrm{Var}(X_{x,i})=p_{i}(x)-p_{i}(x)^{2}. It follows that

𝔼[Nμ22]\displaystyle\mathbb{E}\left[\lVert\mathcal{M}^{N}-\mu\rVert_{2}^{2}\right] =𝔼[x𝒳|N(x)μ(x)|2]=𝔼[x𝒳|1Ni=1N(Xx,ipi(x))|2]\displaystyle=\mathbb{E}\left[\sum_{x\in\mathcal{X}}\left\lvert\mathcal{M}^{N}(x)-\mu(x)\right\rvert^{2}\right]=\mathbb{E}\Big{[}\sum_{x\in\mathcal{X}}\Big{|}\frac{1}{N}\sum_{i=1}^{N}(X_{x,i}-p_{i}(x))\Big{|}^{2}\Big{]}
=1N2x𝒳Var(i=1NXx,i)=(i)1N2x𝒳i=1NVar(Xx,i)\displaystyle=\frac{1}{N^{2}}\sum_{x\in\mathcal{X}}\mathrm{Var}\Big{(}\sum_{i=1}^{N}X_{x,i}\Big{)}\stackrel{{\scriptstyle(i)}}{{=}}\frac{1}{N^{2}}\sum_{x\in\mathcal{X}}\sum_{i=1}^{N}\mathrm{Var}(X_{x,i})
=1N2x𝒳i=1N(pi(x)pi(x)2)1N2i=1N(x𝒳pi(x))\displaystyle=\frac{1}{N^{2}}\sum_{x\in\mathcal{X}}\sum_{i=1}^{N}(p_{i}(x)-p_{i}(x)^{2})\leq\frac{1}{N^{2}}\sum_{i=1}^{N}\Big{(}\sum_{x\in\mathcal{X}}p_{i}(x)\Big{)}
1N,\displaystyle\leq\frac{1}{N},

where (i)(i) leverages the fact that Xx,iX_{x,i} are independent across ii. By Jensen’s inequality, we have 𝔼[Nμ2]1N\mathbb{E}\left[\lVert\mathcal{M}^{N}-\mu\rVert_{2}\right]\leq\frac{1}{\sqrt{N}}. Furthermore, due to equivalence of norms in the finite dimensional space 𝒳\mathcal{X}, we have Nμ1|𝒳|Nμ2\left\lVert\mathcal{M}^{N}-\mu\right\rVert_{1}\leq\sqrt{|\mathcal{X}|}\left\lVert{\mathcal{M}^{N}-\mu}\right\rVert_{2} almost surely, where |𝒳||\mathcal{X}| is the cardinality of the set 𝒳\mathcal{X}. It then follows that

𝔼[dTV(N,μ)]=12𝔼[Nμ1]12|𝒳|N.\displaystyle\mathbb{E}[\mathrm{d}_{\mathrm{TV}}\big{(}\mathcal{M}^{N},\mu\big{)}]=\frac{1}{2}\mathbb{E}\left[\left\lVert\mathcal{M}^{N}-\mu\right\rVert_{1}\right]\leq\frac{1}{2}\sqrt{\frac{|\mathcal{X}|}{N}}.

Corollary 1.

Let X1,,XNX_{1},\ldots,X_{N} be NN independent and identically distributed random variables (vectors) taking values in a finite set 𝒳\mathcal{X}. Suppose XiX_{i} is distributed according to p𝒫(𝒳)p\in\mathcal{P}(\mathcal{X}), such that (𝐱i=x)=p(x)\mathbb{P}(\mathbf{x}_{i}=x)=p(x). Define the ED as

N(x)=1Ni=1N𝟙x(Xi).\mathcal{M}^{N}(x)=\frac{1}{N}\sum_{i=1}^{N}\mathds{1}_{x}\big{(}{X_{i}}\big{)}.

Then,

𝔼[dTV(N,p)]=12|𝒳|N.\mathbb{E}\left[\mathrm{d}_{\mathrm{TV}}\big{(}\mathcal{M}^{N},p\big{)}\right]=\frac{1}{2}\sqrt{\frac{|\mathcal{X}|}{N}}. (70)

B.2 Proof of Lemma 1

The following lemma is used to support the proof of Lemma 1.

Lemma 8.

For all a1,,an0a_{1},\ldots,a_{n}\in\mathbb{R}_{\geq 0}, the following inequality holds:

i=1nain(i=1nai2).\sum_{i=1}^{n}a_{i}\leq\sqrt{n}\left(\sqrt{\sum_{i=1}^{n}a_{i}^{2}}\right).
Proof.

Cauchy-Schwarz inequality directly implies

i=1nai=i=1n|ai|1i=1nai2i=1n12=ni=1nai2.\sum_{i=1}^{n}a_{i}=\sum_{i=1}^{n}|a_{i}|\cdot 1\leq\sqrt{\sum_{i=1}^{n}a_{i}^{2}}\sqrt{\sum_{i=1}^{n}1^{2}}=\sqrt{n}\sqrt{\sum_{i=1}^{n}a_{i}^{2}}.

Now, we restate Lemma 1 and present its proof. See 1

Proof.

Due to symmetry, we only prove the result for the Blue team. Let N1,tx=N1tN1(x)N_{1,t}^{x}=N_{1}\mathcal{M}^{N_{1}}_{t}(x) be the number of Blue agents that are at state xx at time tt in the finite-population system. With a slight abuse of notation, we use [N1,tx][N_{1,t}^{x}] to denote the set of the Blue agents currently at state xx. We first focus on the sub-population of Blue agents that are at state xx. Since all Blue agents in [N1,tx][N_{1,t}^{x}] apply ϕt\phi_{t} and randomize independently, the next states for these N1,txN_{1,t}^{x} agents are independent and identically distributed conditioned on the joint states 𝐗tN1\mathbf{X}^{N_{1}}_{t} and 𝐘tN2\mathbf{Y}^{N_{2}}_{t}. For an agent ii in this sub-population [N1,tx][N_{1,t}^{x}] the conditional distribution of its next state Xi,t+1N1X^{N_{1}}_{i,t+1} is given by

[Xi,t+1N1=x|Xi,tN1,𝐗tN1,𝐘tN2]=uftρ(x|x,u,tN1,𝒩tN1)ϕt(u|x,tN1,𝒩tN1)t+1x(x),\mathbb{P}\left[X_{i,t+1}^{N_{1}}=x^{\prime}\big{|}X_{i,t}^{N_{1}},\mathbf{X}^{N_{1}}_{t},\mathbf{Y}^{N_{2}}_{t}\right]=\sum_{u}f_{t}^{\rho}(x^{\prime}|x,u,\mathcal{M}^{N_{1}}_{t},\mathcal{N}^{N_{1}}_{t})\phi_{t}(u|x,\mathcal{M}^{N_{1}}_{t},\mathcal{N}^{N_{1}}_{t})\triangleq\mathcal{M}_{t+1}^{x}(x^{\prime}), (71)

where tN1=Empμ(𝐗tN1)\mathcal{M}^{N_{1}}_{t}=\mathrm{Emp}_{\mu}(\mathbf{X}^{N_{1}}_{t}) and 𝒩tN2=Empν(𝐘tN2)\mathcal{N}^{N_{2}}_{t}=\mathrm{Emp}_{\nu}(\mathbf{Y}^{N_{2}}_{t}). Define the ED of these N1,txN_{1,t}^{x} Blue agents at time t+1t+1 as

t+1N1,tx(x)=1N1,txi[N1,tx]𝟙x(Xi,t+1N1).\mathcal{M}_{t+1}^{N_{1,t}^{x}}(x^{\prime})=\frac{1}{N_{1,t}^{x}}\sum_{i\in[N_{1,t}^{x}]}\mathds{1}_{x^{\prime}}\big{(}{X_{i,t+1}^{N_{1}}}\big{)}.

Then, we have the following inequality due to Corollary 1:

𝔼ϕt[dTV(t+1N1,tx,t+1x)|𝐗tN1,𝐘tN2]12|𝒳|N1,tx.\mathbb{E}_{\phi_{t}}\left[\mathrm{d}_{\mathrm{TV}}\big{(}\mathcal{M}^{N_{1,t}^{x}}_{t+1},\mathcal{M}^{x}_{t+1}\big{)}\Big{|}\mathbf{X}^{N_{1}}_{t},\mathbf{Y}^{N_{2}}_{t}\right]\leq\frac{1}{2}\sqrt{\frac{|\mathcal{X}|}{N_{1,t}^{x}}}. (72)

Note that we can compute the ED of the whole Blue team based on the ED of sub-populations of Blue agents on different states via

t+1N1(x)\displaystyle\mathcal{M}_{t+1}^{N_{1}}(x^{\prime}) =1N1i=0N1𝟙x(Xi,t+1N1)=1N1x𝒳i[N1,tx]𝟙x(Xi,t+1N1)=x𝒳N1,txN1t+1N1,tx(x).\displaystyle=\frac{1}{N_{1}}\sum_{i=0}^{N_{1}}\mathds{1}_{x^{\prime}}\big{(}{X_{i,t+1}^{N_{1}}}\big{)}=\frac{1}{N_{1}}\sum_{x\in\mathcal{X}}\sum_{i\in[N_{1,t}^{x}]}\mathds{1}_{x^{\prime}}\big{(}{X_{i,t+1}^{N_{1}}}\big{)}=\sum_{x\in\mathcal{X}}\frac{N_{1,t}^{x}}{N_{1}}\mathcal{M}_{t+1}^{N_{1,t}^{x}}(x^{\prime}). (73)

Similarly, we can relate the propagated mean-field of the whole team with the ones of each sub-population via

t+1ρ(x)\displaystyle\mathcal{M}^{\rho}_{t+1}(x^{\prime}) =x𝒳[u𝒰ftρ(x|x,u,tN1,𝒩tN2)ϕt(u|x,tN1,𝒩tN2)]tN1(x)\displaystyle=\sum_{x\in\mathcal{X}}\left[\sum_{u\in\mathcal{U}}f^{\rho}_{t}(x^{\prime}|x,u,\mathcal{M}^{N_{1}}_{t},\mathcal{N}^{N_{2}}_{t})\;\phi_{t}(u|x,\mathcal{M}^{N_{1}}_{t},\mathcal{N}^{N_{2}}_{t})\right]\mathcal{M}^{N_{1}}_{t}(x) (74)
=x𝒳t+1x(x)tN1(x)=x𝒳N1,txN1t+1x(x).\displaystyle=\sum_{x\in\mathcal{X}}\mathcal{M}^{x}_{t+1}(x^{\prime})\mathcal{M}_{t}^{N_{1}}(x)=\sum_{x\in\mathcal{X}}\frac{N_{1,t}^{x}}{N_{1}}\mathcal{M}^{x}_{t+1}(x^{\prime}).

Combining (72), (73), and (74), we have

𝔼ϕt[dTV(t+1N1,t+1ρ)|𝐗tN1,𝐘tN2]\displaystyle\mathbb{E}_{\phi_{t}}\left[\mathrm{d}_{\mathrm{TV}}\big{(}\mathcal{M}^{N_{1}}_{t+1},\mathcal{M}^{\rho}_{t+1}\big{)}\big{|}\mathbf{X}^{N_{1}}_{t},\mathbf{Y}^{N_{2}}_{t}\right] =𝔼ϕt[12x|t+1N1(x)t+1ρ(x)||𝐗tN1,𝐘tN2]\displaystyle=\mathbb{E}_{\phi_{t}}\left[\frac{1}{2}\sum_{x^{\prime}}\left\lvert\mathcal{M}_{t+1}^{N_{1}}(x^{\prime})-\mathcal{M}^{\rho}_{t+1}(x^{\prime})\right\rvert|\;\mathbf{X}^{N_{1}}_{t},\mathbf{Y}^{N_{2}}_{t}\right]
=𝔼ϕt[12x|xN1,txN1t+1N1,tx(x)xN1,txN1t+1x(x)||𝐗tN1,𝐘tN2]\displaystyle=\mathbb{E}_{\phi_{t}}\left[\frac{1}{2}\sum_{x^{\prime}}\left\lvert\sum_{x}\frac{N_{1,t}^{x}}{N_{1}}\mathcal{M}_{t+1}^{N^{x}_{1,t}}(x^{\prime})-\sum_{x}\frac{N_{1,t}^{x}}{N_{1}}\mathcal{M}^{x}_{t+1}(x^{\prime})\right\rvert\Big{|}\;\mathbf{X}^{N_{1}}_{t},\mathbf{Y}^{N_{2}}_{t}\right]
xN1,txN1𝔼ϕt[12x|t+1N1,tx(x)t+1x(x)||𝐗tN1,𝐘tN2]\displaystyle\leq\sum_{x}\frac{N_{1,t}^{x}}{N_{1}}\mathbb{E}_{\phi_{t}}\left[\frac{1}{2}\sum_{x^{\prime}}\left\lvert\mathcal{M}_{t+1}^{N^{x}_{1,t}}(x^{\prime})-\mathcal{M}^{x}_{t+1}(x^{\prime})\right\rvert\big{|}\;\mathbf{X}^{N_{1}}_{t},\mathbf{Y}^{N_{2}}_{t}\right]
=xN1,txN1𝔼ϕt[dTV(t+1N1,tx,t+1x)|𝐗tN1,𝐘tN2]\displaystyle=\sum_{x}\frac{N_{1,t}^{x}}{N_{1}}\mathbb{E}_{\phi_{t}}\left[\mathrm{d}_{\mathrm{TV}}\big{(}\mathcal{M}_{t+1}^{N^{x}_{1,t}},\mathcal{M}^{x}_{t+1}\big{)}\big{|}\mathbf{X}^{N_{1}}_{t},\mathbf{Y}^{N_{2}}_{t}\right]
xN1,tx2N1|𝒳|N1,tx|𝒳|21N1,\displaystyle\leq\sum_{x}\frac{N_{1,t}^{x}}{2N_{1}}\sqrt{\frac{|\mathcal{X}|}{N_{1,t}^{x}}}\leq\frac{|\mathcal{X}|}{2}\sqrt{\frac{1}{N_{1}}},

where the last inequality is a result of Lemma 8.777Let ax=N1,txa_{x}=\sqrt{N_{1,t}^{x}} for each x𝒳x\in\mathcal{X}, it follows that ax0a_{x}\geq 0 and x(ax)2=N1\sum_{x}(a_{x})^{2}=N_{1} almost surely, which allows the application of Lemma 8.

B.3 Proof of Lemma 3

See 3

Proof.

Note that if tN1(x)=0\mathcal{M}^{N_{1}}_{t}(x)=0, then the state xx has no contribution to the next mean-field propagated via (36). The second case in (35) ensures that the constructed policy is well-defined. Without loss of generality, we assume that tN1(x)>0\mathcal{M}^{N_{1}}_{t}(x)>0 for all x𝒳x\in\mathcal{X}.

Let N1,tx=N1tN1(x)N_{1,t}^{x}=N_{1}\mathcal{M}^{N_{1}}_{t}(x) be the number of Blue agents that are on state xx at time tt in the finite-population system. With a slight abuse of notation, we use [N1,tx][N_{1,t}^{x}] to denote the set of the Blue agents on state xx. Then, for agent i[N1,tx]i\in[N_{1,t}^{x}] applying policy ϕi,t\phi_{i,t}, its conditional distribution of its next state Xi,t+1N1X^{N_{1}}_{i,t+1} is given by

pix(x)=(Xi,t+1N1=x|𝐗tN1,𝐘tN2)=u𝒰ftρ(x|u,tN1,𝒩tN2)ϕi,t(u|x,tN1,𝒩tN2)p_{i}^{x}(x^{\prime})=\mathbb{P}(X^{N_{1}}_{i,t+1}=x^{\prime}|\mathbf{X}^{N_{1}}_{t},\mathbf{Y}^{N_{2}}_{t})=\sum_{u\in\mathcal{U}}f^{\rho}_{t}(x^{\prime}|u,\mathcal{M}^{N_{1}}_{t},\mathcal{N}^{N_{2}}_{t})\phi_{i,t}(u|x,\mathcal{M}^{N_{1}}_{t},\mathcal{N}^{N_{2}}_{t})

Define the ED of these N1,txN^{x}_{1,t} Blue agents at time t+1t+1 as

t+1N1,tx(x)=1N1,txi[N1,tx]𝟙x(Xi,t+1N1).\mathcal{M}^{N_{1,t}^{x}}_{t+1}(x^{\prime})=\frac{1}{N^{x}_{1,t}}\sum_{i\in[N^{x}_{1,t}]}\mathds{1}_{x^{\prime}}\big{(}{X_{i,t+1}^{N_{1}}}\big{)}. (75)

On the other hand, the mean-field propagated from state xx under local policy πapprx,t\pi_{\mathrm{apprx},t} can be expressed as

apprx,t+1x(x)\displaystyle\mathcal{M}^{x}_{\mathrm{apprx},t+1}(x^{\prime}) =uftρ(x|x,u,tN1,𝒩tN2)πapprx,t(u|x)\displaystyle=\sum_{u}f_{t}^{\rho}(x^{\prime}|x,u,\mathcal{M}^{N_{1}}_{t},\mathcal{N}^{N_{2}}_{t})\pi_{\mathrm{apprx},t}(u|x) (76)
=uftρ(x|x,u,tN1,𝒩tN2)i[N1,tx]ϕi,t(u|x,tN1,𝒩tN2)N1,tx\displaystyle=\sum_{u}f_{t}^{\rho}(x^{\prime}|x,u,\mathcal{M}^{N_{1}}_{t},\mathcal{N}^{N_{2}}_{t})\frac{\sum_{i\in[N^{x}_{1,t}]}\phi_{i,t}(u|x,\mathcal{M}^{N_{1}}_{t},\mathcal{N}^{N_{2}}_{t})}{N^{x}_{1,t}}
=i[N1,tx]pix(x)N1,tx.\displaystyle=\frac{\sum_{i\in[N^{x}_{1,t}]}p^{x}_{i}(x^{\prime})}{N^{x}_{1,t}}.

From Lemma 7, we have that

𝔼[dTV(t+1N1,tx,apprx,t+1x)|𝐗tN1,𝐘tN2]12|𝒳|N1,tx.\mathbb{E}\left[\mathrm{d}_{\mathrm{TV}}\big{(}\mathcal{M}^{N_{1,t}^{x}}_{t+1},\mathcal{M}_{\mathrm{apprx},t+1}^{x}\big{)}\big{|}\mathbf{X}^{N_{1}}_{t},\mathbf{Y}^{N_{2}}_{t}\right]\leq\frac{1}{2}\sqrt{\frac{|\mathcal{X}|}{N_{1,t}^{x}}}. (77)

The rest of the analysis is similar to the proof of Lemma 1. Notice that

t+1N1(x)\displaystyle\mathcal{M}_{t+1}^{N_{1}}(x^{\prime}) =x𝒳N1,txN1t+1N1,tx(x),\displaystyle=\sum_{x\in\mathcal{X}}\frac{N_{1,t}^{x}}{N_{1}}\mathcal{M}_{t+1}^{N_{1,t}^{x}}(x^{\prime}),
apprx,t+1(x)\displaystyle\mathcal{M}_{\mathrm{apprx},t+1}(x^{\prime}) =x𝒳N1,txN1apprx,t+1x(x).\displaystyle=\sum_{x\in\mathcal{X}}\frac{N_{1,t}^{x}}{N_{1}}\mathcal{M}_{\mathrm{apprx},t+1}^{x}(x^{\prime}).

It then follows

𝔼[dTV(t+1N1,apprx,t+1)|𝐗tN1,𝐘tN2]\displaystyle\mathbb{E}\left[\mathrm{d}_{\mathrm{TV}}\big{(}\mathcal{M}_{t+1}^{N_{1}},\mathcal{M}_{\mathrm{apprx},t+1}\big{)}\big{|}\mathbf{X}^{N_{1}}_{t},\mathbf{Y}^{N_{2}}_{t}\right] xN1,txN1𝔼[dTV(t+1N1,tx,apprx,t+1x)|𝐗tN1,𝐘tN2]\displaystyle\leq\sum_{x}\frac{N_{1,t}^{x}}{N_{1}}\mathbb{E}\left[\mathrm{d}_{\mathrm{TV}}\big{(}\mathcal{M}_{t+1}^{N_{1,t}^{x}},\mathcal{M}_{\mathrm{apprx},t+1}^{x}\big{)}\big{|}\mathbf{X}^{N_{1}}_{t},\mathbf{Y}^{N_{2}}_{t}\right]
|𝒳|21N1,\displaystyle\leq\frac{|\mathcal{X}|}{2}\sqrt{\frac{1}{N_{1}}},

where the last inequality results from Lemma 8.

Appendix C Proof of the Continuity Results

In this section, we present the proofs for Lemma 4 and Theorem 3. We start with several preliminary results.

C.1 Preliminary results

Lemma 9.

Let ξ:𝒜×\xi:\mathcal{A}\times\mathcal{B}\to\mathbb{R} be a Lipschitz continuous function in its second argument, such that |ξ(a,b)ξ(a,b)|Lξbb\left\lvert\xi(a,b)-\xi(a,b^{\prime})\right\rvert\leq L_{\xi}\left\lVert b-b^{\prime}\right\rVert for all a𝒜a\in\mathcal{A} and b,bb,b^{\prime}\in\mathcal{B}. Let compact sets Q,QQ,Q^{\prime}\subseteq\mathcal{B} such that distH(Q,Q)ϵ\mathrm{dist}_{\mathrm{H}}(Q,Q^{\prime})\leq\epsilon. Then, for all a𝒜a\in\mathcal{A},

|minbQξ(a,b)minbQξ(a,b)|\displaystyle\left\lvert\min_{b\in Q}\xi(a,b)-\min_{b^{\prime}\in Q^{\prime}}\xi(a,b^{\prime})\right\rvert Lξϵ.\displaystyle\leq L_{\xi}\epsilon. (78a)
|maxbQξ(a,b)maxbQξ(a,b)|\displaystyle\left\lvert\max_{b\in Q}\xi(a,b)-\max_{b^{\prime}\in Q^{\prime}}\xi(a,b^{\prime})\right\rvert Lξϵ.\displaystyle\leq L_{\xi}\epsilon. (78b)
Proof.

We start with the proof for (78a). Fix a𝒜a\in\mathcal{A} and consider the minimizer bargminbQξ(a,b)b^{*}\in\operatorname*{argmin}_{b\in Q}\xi(a,b). From the assumption on the distance between QQ and QQ^{\prime}, there exists a bQb^{*\prime}\in Q^{\prime} such that bbϵ\left\lVert b^{*\prime}-b^{*}\right\rVert\leq\epsilon. Since ξ\xi is Lipschitz with respect to bb, it follows that

|ξ(a,b)ξ(a,b)|Lξϵ.\left\lvert\xi(a,b^{*})-\xi(a,b^{*\prime})\right\rvert\leq L_{\xi}\epsilon. (79)

Then,

minbQξ(a,b)=ξ(a,b)ξ(a,b)LξϵminbQξ(a,b)Lξϵ.\displaystyle\min_{b\in Q}\xi(a,b)=\xi(a,b^{*})\geq\xi(a,b^{*\prime})-L_{\xi}\epsilon\geq\min_{b^{\prime}\in Q^{\prime}}\xi(a,b^{\prime})-L_{\xi}\epsilon.

Similarly, one can show that minbQξ(a,b)minbQξ(a,b)Lξϵ\min_{b^{\prime}\in Q^{\prime}}\xi(a,b^{\prime})\geq\min_{b\in Q}\xi(a,b)-L_{\xi}\epsilon, and thus we have (78a).

The result (78a) can be easily extended to the maximization case in (78b) by making use of the fact that maxbQξ(a,b)=minbQξ(a,b)\max_{b\in Q}\xi(a,b)=-\min_{b\in Q}-\xi(a,b). ∎

Lemma 10.

Let ζ:𝒜×\zeta:\mathcal{A}\times\mathcal{B}\to\mathbb{R} be a Lipschitz continuous function in its first argument, such that |ζ(a,b)ζ(a,b)|Lζaa\left\lvert\zeta(a,b)-\zeta(a^{\prime},b)\right\rvert\leq L_{\zeta}\left\lVert a-a^{\prime}\right\rVert for all a,a𝒜a,a^{\prime}\in\mathcal{A} and bb\in\mathcal{B}. Then, for all compact sets QQ\subseteq\mathcal{B},

|minbQζ(a,b)minbQζ(a,b)|Lζaa.\left\lvert\min_{b\in Q}\zeta(a,b)-\min_{b^{\prime}\in Q}\zeta(a^{\prime},b^{\prime})\right\rvert\leq L_{\zeta}\left\lVert a-a^{\prime}\right\rVert.
Proof.

Since QQ is compact and ζ\zeta is continuous, the two minimization problems are well-defined. Next, let bargminbQζ(a,b)b^{*}\in\operatorname*{argmin}_{b\in Q}\zeta(a,b), it follows that

minbQζ(a,b)minbQζ(a,b)=minbQζ(a,b)ζ(a,b)ζ(a,b)ζ(a,b)(i)Lζaa,\min_{b^{\prime}\in Q}\zeta(a^{\prime},b^{\prime})-\min_{b\in Q}\zeta(a,b)=\min_{b^{\prime}\in Q}\zeta(a^{\prime},b^{\prime})-\zeta(a,b^{*})\leq\zeta(a^{\prime},b^{*})-\zeta(a,b^{*})\stackrel{{\scriptstyle\mathrm{(i)}}}{{\leq}}L_{\zeta}\left\lVert a-a^{\prime}\right\rVert,

where (i) is due to the Lipschitz assumption. Similarly, one can show the other direction minbQζ(a,b)minbQζ(a,b)Lζaa\min_{b\in Q}\zeta(a,b)-\min_{b^{\prime}\in Q}\zeta(a^{\prime},b^{\prime})\leq L_{\zeta}\left\lVert a-a^{\prime}\right\rVert. ∎

Lemma 11.

Consider the compact-valued correspondences Γ:𝒳×𝒴𝒳\Gamma:\mathcal{X}\times\mathcal{Y}\rightsquigarrow\mathcal{X} and Θ:𝒳×𝒴𝒴\Theta:\mathcal{X}\times\mathcal{Y}\rightsquigarrow\mathcal{Y}, which are Lipschitz continuous with Lipschitz constants LΓL_{\Gamma} and LΘL_{\Theta}, respectively. Given a LgL_{g}-Lipschitz continuous real-valued function g:𝒳×𝒴g:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}, the max-min marginal function

f(x,y)=maxpΓ(x,y)minqΘ(x,y)g(p,q)f(x,y)=\max_{p\in\Gamma(x,y)}\min_{q\in\Theta(x,y)}g(p,q) (80)

is Lipschitz continuous with Lipschitz constant Lg(LΓ+LΘ)L_{g}(L_{\Gamma}+L_{\Theta}).

Proof.

Let h(x,y,p)=minqΘ(x,y)g(p,q)h(x,y,p)=\min_{q\in\Theta(x,y)}g(p,q). Since g(p,q)g(p,q) is continuous and Θ(x,y)\Theta(x,y) is compact, the minimization is well-defined for each pp. Fix p𝒳p\in\mathcal{X}, and consider x,x𝒳x,x^{\prime}\in\mathcal{X} and y,y𝒴y,y^{\prime}\in\mathcal{Y}. The Lipschitz continuity of Θ\Theta implies that

distH(Θ(x,y),Θ(x,y))LΘ(xx+yy).\mathrm{dist}_{\mathrm{H}}(\Theta(x,y),\Theta(x^{\prime},y^{\prime}))\leq L_{\Theta}(\left\lVert x-x^{\prime}\right\rVert+\left\lVert y-y^{\prime}\right\rVert). (81)

For simplicity, let ϵ=(xx+yy)\epsilon=(\left\lVert x-x^{\prime}\right\rVert+\left\lVert y-y^{\prime}\right\rVert). Leveraging Lemma 9, we have888 Relating to Lemma 9: Set gg as function ξ\xi; treat the argument pp as aa and the optimization variable qq as bb; regard the sets Θ(x,y)\Theta(x,y) and Θ(x,y)\Theta(x^{\prime},y^{\prime}) as QQ and QQ^{\prime} respectively.

|h(x,y,p)h(x,y,p)|LgLΘϵ.\left\lvert h(x,y,p)-h(x^{\prime},y^{\prime},p)\right\rvert\leq L_{g}L_{\Theta}\epsilon. (82)

Next, fix (x,y)𝒳×𝒴(x,y)\in\mathcal{X}\times\mathcal{Y} and consider p,p𝒴p,p^{\prime}\in\mathcal{Y}. It follows from Lemma 10 that999Relating to Lemma 10: Set gg as function ζ\zeta; treat the arguments pp and qq as aa and bb, respectively; regard the set Θ(x,y)\Theta(x,y) as QQ.

|h(x,y,p)h(x,y,p)|=|minqΘ(x,y)g(p,q)minqΘ(x,y)g(p,q)|Lgpp.\displaystyle\left\lvert h(x,y,p)-h(x,y,p^{\prime})\right\rvert=\left\lvert\min_{q\in\Theta(x,y)}g(p,q)-\min_{q^{\prime}\in\Theta(x,y)}g(p^{\prime},q^{\prime})\right\rvert\leq L_{g}\left\lVert p-p^{\prime}\right\rVert. (83)

Finally, consider x,x𝒳x,x^{\prime}\in\mathcal{X} and y,y𝒴y,y^{\prime}\in\mathcal{Y}, it follows that

|f(x,y)\displaystyle\big{|}f(x,y) f(x,y)|=|maxpΓ(x,y)h(x,y,p)maxpΓ(x,y)h(x,y,p)|\displaystyle-f(x^{\prime},y^{\prime})\big{|}=\left\lvert\max_{p\in\Gamma(x,y)}h(x,y,p)-\max_{p^{\prime}\in\Gamma(x^{\prime},y^{\prime})}h(x^{\prime},y^{\prime},p^{\prime})\right\rvert (84)
|maxpΓ(x,y)h(x,y,p)maxpΓ(x,y)h(x,y,p)|+|maxpΓ(x,y)h(x,y,p)maxpΓ(x,y)h(x,y,p)|\displaystyle\leq\left\lvert\max_{p\in\Gamma(x,y)}h(x,y,p)-\max_{p\in\Gamma(x,y)}h(x^{\prime},y^{\prime},p)\right\rvert+\left\lvert\max_{p\in\Gamma(x,y)}h(x^{\prime},y^{\prime},p)-\max_{p^{\prime}\in\Gamma(x^{\prime},y^{\prime})}h(x^{\prime},y^{\prime},p^{\prime})\right\rvert (85)
LgLΘϵ+LΓLgϵ=Lg(LΘ+LΓ)ϵ,\displaystyle\leq L_{g}L_{\Theta}\epsilon+L_{\Gamma}L_{g}\epsilon=L_{g}(L_{\Theta}+L_{\Gamma})\epsilon, (86)

where the first term in (85) is bounded using Lemma 10 and the Lipschitz constant derived in (82) 101010 Relating to Lemma 10: Set hh as function ζ\zeta; treat the arguments (x,y)(x,y) as aa and the argument pp as bb; regard the set Γ(x,y)\Gamma(x,y) as the optimization domain QQ. , and the second term in (85) is bounded using Lemma 9 and the Lipschitz constant in (83) 111111 Relating to Lemma 9: Set hh as function ξ\xi; treat the argument (x,y)(x^{\prime},y^{\prime}) as aa and the optimization variable pp as bb; regard the sets Γ(x,y)\Gamma(x,y) and Γ(x,y)\Gamma(x^{\prime},y^{\prime}) as QQ and QQ^{\prime} respectively. . ∎

Next, we present some results regarding the continuity of the reachability correspondences. We start by defining the pure local policies.

Definition 10 (Pure local policy).

A Blue local policy πtΠt\pi_{t}\in\Pi_{t} is said to be pure if πt(u|x){0,1}\pi_{t}(u|x)\in\{0,1\} for all u𝒰u\in\mathcal{U} and x𝒳x\in\mathcal{X}. We use Π^t={π^k}k=1|𝒰||𝒳|\hat{\Pi}_{t}=\{\hat{\pi}^{k}\}_{k=1}^{|\mathcal{U}|^{|\mathcal{X}|}} to denote the set of pure Blue policies, where π^k\hat{\pi}^{k} denotes the kk-th pure Blue local policy. The pure Red local policies are defined similarly.

Proposition 3.

The Blue reachable set is characterized as

μ,tρ(μt,νt)=Co({μtFtρ(μt,νt,π^k}k=1|𝒰||𝒳|),\mathcal{R}^{\rho}_{\mu,t}(\mu_{t},\nu_{t})=\mathrm{Co}(\{\mu_{t}F_{t}^{\rho}(\mu_{t},\nu_{t},\hat{\pi}^{k}\}_{k=1}^{|\mathcal{U}|^{|\mathcal{X}|}}), (87)

where π^k\hat{\pi}^{k} are pure Blue local policies, and Co\mathrm{Co} denotes the convex hull.

Proof.

We first show that the mean-field propagation rule is linear with respect to the policy. Note that the set of admissible policies Πt\Pi_{t} can be characterized as the convex hull of the pure policies, i.e., Πt=Co({π^k}k)\Pi_{t}=\mathrm{Co}(\{\hat{\pi}^{k}\}_{k}). Consider arbitrary current mean-fields μt\mu_{t} and νt\nu_{t}, and consider the policy πt=λπt1+(1λ)πt2\pi_{t}=\lambda\pi_{t}^{1}+(1-\lambda)\pi_{t}^{2} for some λ(0,1)\lambda\in(0,1) and πt1,πt2Πt\pi_{t}^{1},\pi_{t}^{2}\in\Pi_{t}. Then, the next mean-field induced by πt\pi_{t} is given by

μt+1(x)\displaystyle\mu_{t+1}(x^{\prime}) =x[uftρ(x|x,μt,νt)πt(u|x)]μt(x)\displaystyle=\sum_{x}\left[\sum_{u}f^{\rho}_{t}(x^{\prime}|x,\mu_{t},\nu_{t})\pi_{t}(u|x)\right]\mu_{t}(x)
=λx[uftρ(x|x,μt,νt)πt1(u|x)]μt(x)+(1λ)x[uftρ(x|x,μt,νt)πt2(u|x)]μt(x),\displaystyle=\lambda\sum_{x}\left[\sum_{u}f^{\rho}_{t}(x^{\prime}|x,\mu_{t},\nu_{t})\pi^{1}_{t}(u|x)\right]\mu_{t}(x)+(1-\lambda)\sum_{x}\left[\sum_{u}f^{\rho}_{t}(x^{\prime}|x,\mu_{t},\nu_{t})\pi^{2}_{t}(u|x)\right]\mu_{t}(x),

which implies that

μt+1=μtFtρ(μt,νt,πt)=λμtFtρ(μt,νt,πt1)+(1λ)μtFtρ(μt,νt,πt2).\mu_{t+1}=\mu_{t}F^{\rho}_{t}(\mu_{t},\nu_{t},\pi_{t})=\lambda\mu_{t}F^{\rho}_{t}(\mu_{t},\nu_{t},\pi^{1}_{t})+(1-\lambda)\mu_{t}F^{\rho}_{t}(\mu_{t},\nu_{t},\pi^{2}_{t}). (88)

For the \subseteq direction in (87), consider μt+1μ,tρ(μt,νt)\mu_{t+1}\in\mathcal{R}^{\rho}_{\mu,t}(\mu_{t},\nu_{t}). Then there exists a policy πt\pi_{t} such that μt+1=μtFtρ(μt,νt,πt)\mu_{t+1}=\mu_{t}F^{\rho}_{t}(\mu_{t},\nu_{t},\pi_{t}). Since the admissible policy set is the convex hull of pure policies, we have πt=kλkπ^k\pi_{t}=\sum_{k}\lambda_{k}\hat{\pi}^{k}, where λk0\lambda_{k}\geq 0 and kλk=1\sum_{k}\lambda_{k}=1. It follows directly from (88) that

μt+1=kλkμtFtρ(μt,νt,π^k)Co({μtFtρ(μt,νt,π^k}k=1|𝒰||𝒳|).\mu_{t+1}=\sum_{k}\lambda_{k}\mu_{t}F^{\rho}_{t}(\mu_{t},\nu_{t},\hat{\pi}^{k})\in\mathrm{Co}(\{\mu_{t}F_{t}^{\rho}(\mu_{t},\nu_{t},\hat{\pi}^{k}\}_{k=1}^{|\mathcal{U}|^{|\mathcal{X}|}}). (89)

For the \supseteq direction in (87), consider a point μt+1Co({μtFtρ(μt,νt,π^k)}k=1|𝒰||𝒳|)\mu_{t+1}\in\mathrm{Co}(\{\mu_{t}F_{t}^{\rho}(\mu_{t},\nu_{t},\hat{\pi}^{k})\}_{k=1}^{|\mathcal{U}|^{|\mathcal{X}|}}). By definition and the linearity in (88), we have

μt+1=kλkμtFtρ(μt,νt,π^k)=μtFtρ(μt,νt,πt),\mu_{t+1}=\sum_{k}\lambda_{k}\mu_{t}F_{t}^{\rho}(\mu_{t},\nu_{t},\hat{\pi}^{k})=\mu_{t}F_{t}^{\rho}(\mu_{t},\nu_{t},\pi_{t}),

where πt=kλkπ^kΠt\pi_{t}=\sum_{k}{\lambda_{k}\hat{\pi}^{k}}\in\Pi_{t}, which implies μt+1μ,tρ(μtρ,νtρ)\mu_{t+1}\in\mathcal{R}_{\mu,t}^{\rho}(\mu^{\rho}_{t},\nu^{\rho}_{t})

Lemma 12.

Consider a pure local policy π^tΠ^t\hat{\pi}_{t}\in\hat{\Pi}_{t} and arbitrary mean-fields μt,μt𝒫(𝒳)\mu_{t},\mu^{\prime}_{t}\in\mathcal{P}(\mathcal{X}) and νt,νt𝒫(𝒴)\nu_{t},\nu^{\prime}_{t}\in\mathcal{P}(\mathcal{Y}). The following bound holds:

dTV(μtFtρ(μt,νt,π^t),μtFtρ(μt,νt,π^t))(1+12Lft)(dTV(μt,μt)+dTV(νt,νt)).\mathrm{d}_{\mathrm{TV}}\big{(}\mu_{t}F^{\rho}_{t}(\mu_{t},\nu_{t},\hat{\pi}_{t}),\mu_{t}^{\prime}F^{\rho}_{t}(\mu^{\prime}_{t},\nu^{\prime}_{t},\hat{\pi}_{t})\big{)}\leq\big{(}1+\frac{1}{2}L_{f_{t}}\big{)}\big{(}\mathrm{d}_{\mathrm{TV}}\big{(}\mu_{t},\mu^{\prime}_{t}\big{)}+\mathrm{d}_{\mathrm{TV}}\big{(}\nu_{t},\nu^{\prime}_{t}\big{)}\big{)}.
Proof.

Using the triangle inequality, we have that

dTV(μtFtρ(μt,νt,π^t),μtFtρ(μt,νt,π^t))\displaystyle\mathrm{d}_{\mathrm{TV}}\big{(}\mu_{t}F^{\rho}_{t}(\mu_{t},\nu_{t},\hat{\pi}_{t}),\mu_{t}^{\prime}F^{\rho}_{t}(\mu^{\prime}_{t},\nu^{\prime}_{t},\hat{\pi}_{t})\big{)} dTV(μtFtρ(μt,νt,π^t),μtFtρ(μt,νt,π^t))A\displaystyle\leq\underbrace{\mathrm{d}_{\mathrm{TV}}\big{(}\mu_{t}F^{\rho}_{t}(\mu_{t},\nu_{t},\hat{\pi}_{t}),\mu_{t}^{\prime}F^{\rho}_{t}(\mu_{t},\nu_{t},\hat{\pi}_{t})\big{)}}_{A}
+dTV(μtFtρ(μt,νt,π^t),μtFtρ(μt,νt,π^t))B.\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad+\underbrace{\mathrm{d}_{\mathrm{TV}}\big{(}\mu^{\prime}_{t}F^{\rho}_{t}(\mu_{t},\nu_{t},\hat{\pi}_{t}),\mu_{t}^{\prime}F^{\rho}_{t}(\mu^{\prime}_{t},\nu^{\prime}_{t},\hat{\pi}_{t})\big{)}}_{B}.

Since FtρF^{\rho}_{t} is a stochastic matrix, it is a non-expansive operator under the 1-norm, and hence AdTV(μt,μt)A\leq\mathrm{d}_{\mathrm{TV}}\big{(}\mu_{t},\mu^{\prime}_{t}\big{)}. Next, we bound the term BB using Assumption 1.

B\displaystyle B =12x|xμt(x)(uπ^t(u|x)(ftρ(x|x,μt,νt,u)ftρ(x|x,μt,νt,u)))|\displaystyle=\frac{1}{2}\sum_{x^{\prime}}\left\lvert\sum_{x}\mu^{\prime}_{t}(x)\Big{(}\sum_{u}\hat{\pi}_{t}(u|x)\big{(}f_{t}^{\rho}(x^{\prime}|x,\mu_{t},\nu_{t},u)-f_{t}^{\rho}(x^{\prime}|x,\mu^{\prime}_{t},\nu^{\prime}_{t},u)\big{)}\Big{)}\right\rvert
12x,x,uμt(x)π^t(u|x)|ftρ(x|x,μt,νt,u)ftρ(x|x,μt,νt,u)|\displaystyle\leq\frac{1}{2}\sum_{x^{\prime},x,u}\mu_{t}^{\prime}(x)\hat{\pi}_{t}(u|x)\big{|}f_{t}^{\rho}(x^{\prime}|x,\mu_{t},\nu_{t},u)-f_{t}^{\rho}(x^{\prime}|x,\mu^{\prime}_{t},\nu^{\prime}_{t},u)\big{|}
=12x,uμt(x)π^t(u|x)(x|ftρ(x|x,μt,νt,u)ftρ(x|x,μt,νt,u)|)\displaystyle=\frac{1}{2}\sum_{x,u}\mu_{t}^{\prime}(x)\hat{\pi}_{t}(u|x)\Big{(}\sum_{x^{\prime}}\big{|}f_{t}^{\rho}(x^{\prime}|x,\mu_{t},\nu_{t},u)-f_{t}^{\rho}(x^{\prime}|x,\mu^{\prime}_{t},\nu^{\prime}_{t},u)\big{|}\Big{)}
12x,uμt(x)π^t(u|x)Lft(dTV(μt,μt)+dTV(νt,νt))=12Lft(dTV(μt,μt)+dTV(νt,νt))\displaystyle\leq\frac{1}{2}\sum_{x,u}\mu_{t}^{\prime}(x)\hat{\pi}_{t}(u|x)L_{f_{t}}\Big{(}\mathrm{d}_{\mathrm{TV}}\big{(}\mu_{t},\mu^{\prime}_{t}\big{)}+\mathrm{d}_{\mathrm{TV}}\big{(}\nu_{t},\nu_{t}^{\prime}\big{)}\Big{)}=\frac{1}{2}L_{f_{t}}\Big{(}\mathrm{d}_{\mathrm{TV}}\big{(}\mu_{t},\mu^{\prime}_{t}\big{)}+\mathrm{d}_{\mathrm{TV}}\big{(}\nu_{t},\nu_{t}^{\prime}\big{)}\Big{)}

Combining the bounds, we have

dTV(μtFtρ(μt,νt,π^t),μtFtρ(μt,νt,π^t))(1+12Lft)dTV(μt,μt)+12LftdTV(νt,νt).\mathrm{d}_{\mathrm{TV}}\big{(}\mu_{t}F^{\rho}_{t}(\mu_{t},\nu_{t},\hat{\pi}_{t}),\mu_{t}^{\prime}F^{\rho}_{t}(\mu^{\prime}_{t},\nu^{\prime}_{t},\hat{\pi}_{t})\big{)}\leq\big{(}1+\frac{1}{2}L_{f_{t}}\big{)}\mathrm{d}_{\mathrm{TV}}\big{(}\mu_{t},\mu^{\prime}_{t}\big{)}+\frac{1}{2}L_{f_{t}}\mathrm{d}_{\mathrm{TV}}\big{(}\nu_{t},\nu^{\prime}_{t}\big{)}.

Lemma 13.

Let two sets of points {xi}i=1N\{x_{i}\}_{i=1}^{N} and {yi}i=1N\{y_{i}\}_{i=1}^{N} that satisfy xiyiϵ\left\lVert x_{i}-y_{i}\right\rVert\leq\epsilon for all ii. Then,

distH(Co({xi}i=1N),Co({yi}i=1N))ϵ.\mathrm{dist}_{\mathrm{H}}(\mathrm{Co}(\{x_{i}\}_{i=1}^{N}),\mathrm{Co}(\{y_{i}\}_{i=1}^{N}))\leq\epsilon.
Proof.

Consider a point xCo({xi}i=1N)x\in\mathrm{Co}(\{x_{i}\}_{i=1}^{N}). By definition, there exists a set of non-negative coefficients {λi}i=1N\{\lambda_{i}\}_{i=1}^{N}, such that x=iλixix=\sum_{i}\lambda_{i}x_{i} and iλi=1\sum_{i}\lambda_{i}=1. Using the same set of coefficients, define y=iλiyiCo({yi}i=1N)y=\sum_{i}\lambda_{i}y_{i}\in\mathrm{Co}(\{y_{i}\}_{i=1}^{N}). Then,

xyiλixiyiϵ.\left\lVert x-y\right\rVert\leq\sum_{i}\lambda_{i}\left\lVert x_{i}-y_{i}\right\rVert\leq\epsilon.

Thus, for a fixed xCo({xi}i=1N)x\in\mathrm{Co}(\{x_{i}\}_{i=1}^{N}), we have infyCo({yi}i=1N)xyϵ.\inf_{y\in\mathrm{Co}(\{y_{i}\}_{i=1}^{N})}\left\lVert x-y\right\rVert\leq\epsilon. Since xCo({xi}i=1N)x\in\mathrm{Co}(\{x_{i}\}_{i=1}^{N}) is arbitrary, the above inequality implies

supxCo({xi}i=1N)infyCo({yi}i=1N)xyϵ.\sup_{x\in\mathrm{Co}(\{x_{i}\}_{i=1}^{N})}\inf_{y\in\mathrm{Co}(\{y_{i}\}_{i=1}^{N})}\left\lVert x-y\right\rVert\leq\epsilon.

Through a similar argument, we can conclude that supyCo({yi}i=1N)infxCo({xi}i=1N)xyϵ,\sup_{y\in\mathrm{Co}(\{y_{i}\}_{i=1}^{N})}\inf_{x\in\mathrm{Co}(\{x_{i}\}_{i=1}^{N})}\left\lVert x-y\right\rVert\leq\epsilon, which completes the proof. ∎

C.2 Proof of Lemma 4

See 4

Proof.

Due to symmetry, we only present the proof for the Blue reachability correspondence. From Lemma 12, we have that, for all pure local policies π^tkΠ^t\hat{\pi}^{k}_{t}\in\hat{\Pi}_{t},

dTV(μtFtρ(μt,νt,π^tk),μtFtρ(μt,νt,π^t))(1+12Lft)dTV(μt,μt)+12LftdTV(νt,νt).\mathrm{d}_{\mathrm{TV}}\big{(}\mu_{t}F^{\rho}_{t}(\mu_{t},\nu_{t},\hat{\pi}^{k}_{t}),\mu_{t}^{\prime}F^{\rho}_{t}(\mu^{\prime}_{t},\nu^{\prime}_{t},\hat{\pi}_{t})\big{)}\leq\big{(}1+\frac{1}{2}L_{f_{t}}\big{)}\mathrm{d}_{\mathrm{TV}}\big{(}\mu_{t},\mu^{\prime}_{t}\big{)}+\frac{1}{2}L_{f_{t}}\mathrm{d}_{\mathrm{TV}}\big{(}\nu_{t},\nu^{\prime}_{t}\big{)}.

From Proposition 3, we know that the reachable set can be characterized as the convex hull

μ,tρ(μt,νt)=Co({μtFtρ(μt,νt,π^k}k=1|𝒰||𝒳|).\mathcal{R}^{\rho}_{\mu,t}(\mu_{t},\nu_{t})=\mathrm{Co}(\{\mu_{t}F_{t}^{\rho}(\mu_{t},\nu_{t},\hat{\pi}^{k}\}_{k=1}^{|\mathcal{U}|^{|\mathcal{X}|}}).

Leveraging Lemma 13, we can conclude that

distH(μ,t(μt,νt),μ,t(μt,νt)))(1+12Lft)(dTV(μt,μt)+dTV(νt,νt)).\mathrm{dist}_{\mathrm{H}}(\mathcal{R}_{\mu,t}(\mu_{t},\nu_{t}),\mathcal{R}_{\mu,t}(\mu^{\prime}_{t},\nu^{\prime}_{t})))\leq\big{(}1+\frac{1}{2}L_{f_{t}}\big{)}\big{(}\mathrm{d}_{\mathrm{TV}}\big{(}\mu_{t},\mu^{\prime}_{t}\big{)}+\mathrm{d}_{\mathrm{TV}}\big{(}\nu_{t},\nu^{\prime}_{t}\big{)}\big{)}.

C.3 Proof of Theorem 3

See 3

Proof.

The proof is given by induction.

Base case: At the terminal time TT, the optimal coordinator value function is given by

J¯cor,Tρ(μTρ,νTρ)=rTρ(μTρ,νTρ).\underline{J}_{\mathrm{cor},T}^{\rho*}(\mu^{\rho}_{T},\nu^{\rho}_{T})=r^{\rho}_{T}(\mu^{\rho}_{T},\nu^{\rho}_{T}).

From Assumption 2, the Lipschitz constant for J¯cor,Tρ\underline{J}_{\mathrm{cor},T}^{\rho*} is LrL_{r}.

Inductive hypothesis: Suppose that the Lipschitz constant is given by (43) at time t+1t+1.

Induction: Recall the definition of the optimal coordinator lower value at time tt:

J¯cor,tρ(μtρ,νtρ)=rt(μtρ,νtρ)+maxμt+1ρμ,tρ(μtρ,νtρ)minνt+1ρν,tρ(μtρ,νtρ)J¯cor,t+1ρ(μt+1ρ,νt+1ρ),\underline{J}_{\mathrm{cor},t}^{\rho*}(\mu^{\rho}_{t},\nu^{\rho}_{t})=r_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t})+\max_{\mu_{t+1}^{\rho}\in\mathcal{R}_{\mu,t}^{\rho}(\mu_{t}^{\rho},\nu_{t}^{\rho})}~{}\min_{\nu_{t+1}^{\rho}\in\mathcal{R}_{\nu,t}^{\rho}(\mu_{t}^{\rho},\nu_{t}^{\rho})}\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mu^{\rho}_{t+1},\nu^{\rho}_{t+1}),

where the second term takes the form of a max-min marginal function defined in (80), where Γ=μ,tρ\Gamma=\mathcal{R}^{\rho}_{\mu,t}, Θ=ν,tρ\Theta=\mathcal{R}^{\rho}_{\nu,t} and g=J¯cor,t+1ρg=\underline{J}_{\mathrm{cor},t+1}^{\rho*}, while the first term is Lipschitz continuous with Lipschitz constant LrL_{r}. Applying Lemma 11, it follows that, for all μtρ,μtρ𝒫(𝒳)\mu^{\rho}_{t},\mu^{\rho\prime}_{t}\in\mathcal{P}(\mathcal{X}) and νtρ,νtρ𝒫(𝒴)\nu^{\rho}_{t},\nu^{\rho\prime}_{t}\in\mathcal{P}(\mathcal{Y}),

|J¯cor,tρ(μtρ,νtρ)J¯cor,tρ(μtρ,νtρ)||rt(μtρ,νtρ)rt(μtρ,νtρ)|+\displaystyle\Big{|}\underline{J}_{\mathrm{cor},t}^{\rho*}(\mu^{\rho}_{t},\nu^{\rho}_{t})-\underline{J}_{\mathrm{cor},t}^{\rho*}(\mu^{\rho\prime}_{t},\nu^{\rho\prime}_{t})\Big{|}\leq\left\lvert r_{t}(\mu^{\rho}_{t},\nu^{\rho}_{t})-r_{t}(\mu^{\rho\prime}_{t},\nu^{\rho\prime}_{t})\right\rvert+
|maxμt+1ρμ,tρ(μtρ,νtρ)minνt+1ρν,tρ(μtρ,νtρ)J¯cor,t+1ρ(μt+1ρ,νt+1ρ)maxμt+1ρμ,tρ(μtρ,νtρ)minνt+1ρν,tρ(μtρ,νtρ)J¯cor,t+1ρ(μt+1ρ,νt+1ρ)|\displaystyle\left\lvert\max_{\mu_{t+1}^{\rho}\in\mathcal{R}_{\mu,t}^{\rho}(\mu_{t}^{\rho},\nu_{t}^{\rho})}~{}\min_{\nu_{t+1}^{\rho}\in\mathcal{R}_{\nu,t}^{\rho}(\mu_{t}^{\rho},\nu_{t}^{\rho})}\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mu^{\rho}_{t+1},\nu^{\rho}_{t+1})-\max_{\mu_{t+1}^{\rho\prime}\in\mathcal{R}_{\mu,t}^{\rho\prime}(\mu_{t}^{\rho\prime},\nu_{t}^{\rho\prime})}~{}\min_{\nu_{t+1}^{\rho\prime}\in\mathcal{R}_{\nu,t}^{\rho\prime}(\mu_{t}^{\rho\prime},\nu_{t}^{\rho\prime})}\underline{J}_{\mathrm{cor},t+1}^{\rho*}(\mu^{\rho\prime}_{t+1},\nu^{\rho\prime}_{t+1})\right\rvert
(Lr+(Lμ,tρ+Lν,tρ)Lr(1+k=t+1T1τ=t+1k(Lμ,τρ+Lν,τρ)))(dTV(μtρ,μtρ)+dTV(νtρ,νtρ))\displaystyle\leq\bigg{(}L_{r}+(L_{\mathcal{R}^{\rho}_{\mu,t}}+L_{\mathcal{R}^{\rho}_{\nu,t}})L_{r}\Big{(}1+\sum_{k=t+1}^{T-1}\prod_{\tau=t+1}^{k}(L_{\mathcal{R}^{\rho}_{\mu,\tau}}+L_{\mathcal{R}^{\rho}_{\nu,\tau}})\Big{)}\bigg{)}\Big{(}\mathrm{d}_{\mathrm{TV}}\big{(}\mu^{\rho}_{t},\mu^{\rho\prime}_{t}\big{)}+\mathrm{d}_{\mathrm{TV}}\big{(}\nu^{\rho}_{t},\nu^{\rho\prime}_{t}\big{)}\Big{)}
=(Lr(1+(Lμ,tρ+Lν,tρ)+k=t+1T1(Lμ,tρ+Lν,tρ)τ=t+1k(Lμ,τρ+Lν,τρ)))(dTV(μtρ,μtρ)+dTV(νtρ,νtρ))\displaystyle=\bigg{(}L_{r}\Big{(}1+(L_{\mathcal{R}^{\rho}_{\mu,t}}+L_{\mathcal{R}^{\rho}_{\nu,t}})+\sum_{k=t+1}^{T-1}(L_{\mathcal{R}^{\rho}_{\mu,t}}+L_{\mathcal{R}^{\rho}_{\nu,t}})\prod_{\tau=t+1}^{k}(L_{\mathcal{R}^{\rho}_{\mu,\tau}}+L_{\mathcal{R}^{\rho}_{\nu,\tau}})\Big{)}\bigg{)}\Big{(}\mathrm{d}_{\mathrm{TV}}\big{(}\mu^{\rho}_{t},\mu^{\rho\prime}_{t}\big{)}+\mathrm{d}_{\mathrm{TV}}\big{(}\nu^{\rho}_{t},\nu^{\rho\prime}_{t}\big{)}\Big{)}
=(Lr(1+(Lρ,μt+Lν,tρ)+k=t+1T1τ=tk(Lμ,τρ+Lν,τρ)))(dTV(μtρ,μtρ)+dTV(νtρ,νtρ))\displaystyle=\bigg{(}L_{r}\Big{(}1+(L_{\mathcal{R}^{\rho}{{}_{\mu},t}}+L_{\mathcal{R}^{\rho}_{\nu,t}})+\sum_{k=t+1}^{T-1}\prod_{\tau=t}^{k}(L_{\mathcal{R}^{\rho}_{\mu,\tau}}+L_{\mathcal{R}^{\rho}_{\nu,\tau}})\Big{)}\bigg{)}\Big{(}\mathrm{d}_{\mathrm{TV}}\big{(}\mu^{\rho}_{t},\mu^{\rho\prime}_{t}\big{)}+\mathrm{d}_{\mathrm{TV}}\big{(}\nu^{\rho}_{t},\nu^{\rho\prime}_{t}\big{)}\Big{)}
=(Lr(1+k=tT1τ=tk(Lμ,τρ+Lν,τρ)))(dTV(μtρ,μtρ)+dTV(νtρ,νtρ)),\displaystyle=\bigg{(}L_{r}\Big{(}1+\sum_{k=t}^{T-1}\prod_{\tau=t}^{k}(L_{\mathcal{R}^{\rho}_{\mu,\tau}}+L_{\mathcal{R}^{\rho}_{\nu,\tau}})\Big{)}\bigg{)}\Big{(}\mathrm{d}_{\mathrm{TV}}\big{(}\mu^{\rho}_{t},\mu^{\rho\prime}_{t}\big{)}+\mathrm{d}_{\mathrm{TV}}\big{(}\nu^{\rho}_{t},\nu^{\rho\prime}_{t}\big{)}\Big{)},

which completes the induction. ∎

Appendix D Proof of Existence of Game Values

We first examine the convexity of the reachability correspondences.

Definition 11 ([Kuroiwa,, 1996]).

Let 𝒳\mathcal{X}, 𝒴\mathcal{Y} and 𝒵\mathcal{Z} be convex sets. The correspondence Γ:𝒳×𝒴𝒵\Gamma:\mathcal{X}\times\mathcal{Y}\rightsquigarrow\mathcal{Z} is convex with respect to 𝒳\mathcal{X} if, for all x1,x2𝒳x_{1},x_{2}\in\mathcal{X}, y𝒴y\in\mathcal{Y}, z1Γ(x1,y)z_{1}\in\Gamma(x_{1},y), z2Γ(x2,y)z_{2}\in\Gamma(x_{2},y), and λ(0,1)\lambda\in(0,1), there exists zΓ(λx1+(1λ)x2,y)z\in\Gamma(\lambda x_{1}+(1-\lambda)x_{2},y) such that

z=λz1+(1λ)z2.z=\lambda z_{1}+(1-\lambda)z_{2}.
Remark 9.

The above definition of convexity is equivalent to the following set inclusion

Γ(λx1+(1λ)x2,y)λΓ(x1,y)+(1λ)Γ(x2,y),\Gamma(\lambda x_{1}+(1-\lambda)x_{2},y)\supseteq\lambda\Gamma(x_{1},y)+(1-\lambda)\Gamma(x_{2},y), (90)

for all λ(0,1)\lambda\in(0,1), x1,x2𝒳x_{1},x_{2}\in\mathcal{X} and y𝒴y\in\mathcal{Y}, where the Minkowski sum is defined as

λΓ(x1,y)+(1λ)Γ(x2,y)={z|z=λz1+(1λ)z2 where z1Γ(x1,y),z2Γ(x2,y)}.\lambda\Gamma(x_{1},y)+(1-\lambda)\Gamma(x_{2},y)=\{z|z=\lambda z_{1}+(1-\lambda)z_{2}\,\text{ where }\,z_{1}\in\Gamma(x_{1},y),z_{2}\in\Gamma(x_{2},y)\}.
Definition 12.

The correspondence Γ:𝒳×𝒴𝒵\Gamma:\mathcal{X}\times\mathcal{Y}\rightsquigarrow\mathcal{Z} is constant with respect to 𝒴\mathcal{Y} if

Γ(x,y1)=Γ(x,y2)x𝒳,y1,y2𝒴.\Gamma(x,y_{1})=\Gamma(x,y_{2})\quad\forall x\in\mathcal{X},~{}\forall y_{1},y_{2}\in\mathcal{Y}. (91)

We have the following concave-convex result for a max-min marginal function defined in (80).

Lemma 14.

Consider two compact-valued correspondences Γ:𝒳×𝒴𝒳\Gamma:\mathcal{X}\times\mathcal{Y}\rightsquigarrow\mathcal{X} and Θ:𝒳×𝒴𝒴\Theta:\mathcal{X}\times\mathcal{Y}\rightsquigarrow\mathcal{Y}. Let Γ\Gamma be convex with respect to 𝒳\mathcal{X} and constant with respect to 𝒴\mathcal{Y}, and let Θ\Theta be constant with respect to 𝒳\mathcal{X} and convex with respect to 𝒴\mathcal{Y}. Let g:𝒳×𝒴g:\mathcal{X}\times\mathcal{Y}\to\mathbb{R} be concave-convex. Then, the max-min marginal function f(x,y)=maxpΓ(x,y)minqΘ(x,y)g(p,q)f(x,y)=\max_{p\in\Gamma(x,y)}\min_{q\in\Theta(x,y)}g(p,q) is also concave-convex.

Proof.

Fix y𝒴y\in\mathcal{Y}, and consider x1,x2𝒳x_{1},x_{2}\in\mathcal{X} and λ(0,1)\lambda\in(0,1). Denote x=λx1+(1λ)x2x=\lambda x_{1}+(1-\lambda)x_{2}. It follows that

f(λx1+(1λ)x2,y)\displaystyle f(\lambda x_{1}+(1-\lambda)x_{2},y) =maxpΓ(λx1+(1λ)x2,y)minqΘ(λx1+(1λ)x2,y)g(p,q)\displaystyle=\max_{p\in\Gamma(\lambda x_{1}+(1-\lambda)x_{2},y)}\;\min_{q\in\Theta(\lambda x_{1}+(1-\lambda)x_{2},y)}g(p,q)
(i)maxpλΓ(x1,y)+(1λ)Γ(x2,y)minqΘ(x,y)g(p,q)\displaystyle\stackrel{{\scriptstyle\text{(i)}}}{{\geq}}\max_{p\in\lambda\Gamma(x_{1},y)+(1-\lambda)\Gamma(x_{2},y)}~{}~{}\min_{q\in\Theta(x,y)}g(p,q)
=maxp1Γ(x1,y)p2Γ(x2,y)minqΘ(x,y)g(λp1+(1λ)p2,q)\displaystyle=\max_{\begin{subarray}{c}p_{1}\in\Gamma(x_{1},y)\\ p_{2}\in\Gamma(x_{2},y)\end{subarray}}\min_{q\in\Theta(x,y)}g(\lambda p_{1}+(1-\lambda)p_{2},q)
(ii)maxp1Γ(x1,y)p2Γ(x2,y)minqΘ(x,y)(λg(p1,q)+(1λ)g(p2,q))\displaystyle\stackrel{{\scriptstyle\text{(ii)}}}{{\geq}}\max_{\begin{subarray}{c}p_{1}\in\Gamma(x_{1},y)\\ p_{2}\in\Gamma(x_{2},y)\end{subarray}}\min_{q\in\Theta(x,y)}(\lambda g(p_{1},q)+(1-\lambda)g(p_{2},q))
(iii)maxp1Γ(x1,y)p2Γ(x2,y)(minq1Θ(x,y)λg(p1,q1)A(p1)+minq2Θ(x,y)(1λ)g(p2,q2)B(p2))\displaystyle\stackrel{{\scriptstyle\text{(iii)}}}{{\geq}}\max_{\begin{subarray}{c}p_{1}\in\Gamma(x_{1},y)\\ p_{2}\in\Gamma(x_{2},y)\end{subarray}}\Big{(}\underbrace{\min_{q_{1}\in\Theta(x,y)}\lambda g(p_{1},q_{1})}_{A(p_{1})}+\underbrace{\min_{q_{2}\in\Theta(x,y)}(1-\lambda)g(p_{2},q_{2})}_{B(p_{2})}\Big{)}
=λmaxp1Γ(x1,y)minq1Θ(x,y)g(p1,q1)+(1λ)maxp2Γ(x2,y)minq2Θ(x,y)g(p2,q2)\displaystyle=\lambda\max_{p_{1}\in\Gamma(x_{1},y)}\min_{q_{1}\in\Theta(x,y)}g(p_{1},q_{1})+(1-\lambda)\max_{p_{2}\in\Gamma(x_{2},y)}\min_{q_{2}\in\Theta(x,y)}g(p_{2},q_{2})
=(iv)λmaxp1Γ(x1,y)minq1Θ(x1,y)g(p1,q1)+(1λ)maxp2Γ(x2,y)minq2Θ(x2,y)g(p2,q2)\displaystyle\stackrel{{\scriptstyle\text{(iv)}}}{{=}}\lambda\max_{p_{1}\in\Gamma(x_{1},y)}\min_{q_{1}\in\Theta(x_{1},y)}g(p_{1},q_{1})+(1-\lambda)\max_{p_{2}\in\Gamma(x_{2},y)}\min_{q_{2}\in\Theta(x_{2},y)}g(p_{2},q_{2})
=λf(x1,y)+(1λ)f(x2,y),\displaystyle=\lambda f(x_{1},y)+(1-\lambda)f(x_{2},y),

where inequality (i) holds from restricting the maximization domain using the convexity of Γ\Gamma in (90); inequality (ii) holds from the assumption that gg is concave with respect to its pp-argument; inequality (iii) is the result of distributing the minimization; equality (iv) is due to Θ\Theta being constant with respect to 𝒳\mathcal{X}. The above result implies the concavity of ff with respect to its xx-argument.

Fix x𝒳x\in\mathcal{X}, and let y1,y2𝒴y_{1},y_{2}\in\mathcal{Y} and λ(0,1)\lambda\in(0,1). Denote y=λy1+(1λ)y2y=\lambda y_{1}+(1-\lambda)y_{2}. Then,

f(x,λy1+(1λ)y2)\displaystyle f(x,\lambda y_{1}+(1-\lambda)y_{2}) =maxpΓ(x,λy1+(1λ)y2)minqΘ(x,λy1+(1λ)y2)g(p,q)\displaystyle=\max_{p\in\Gamma(x,\lambda y_{1}+(1-\lambda)y_{2})}\;\min_{q\in\Theta(x,\lambda y_{1}+(1-\lambda)y_{2})}g(p,q)
maxpΓ(x,y)minq1Θ(x,y1)q2Θ(x,y2)g(p,λq1+(1λ)q2)\displaystyle\leq\max_{p\in\Gamma(x,y)}\min_{\begin{subarray}{c}q_{1}\in\Theta(x,y_{1})\\ q_{2}\in\Theta(x,y_{2})\end{subarray}}g(p,\lambda q_{1}+(1-\lambda)q_{2})
maxpΓ(x,y)minq1Θ(x,y1)q2Θ(x,y2)(λg(p,q1)+(1λ)g(p,q2))\displaystyle\leq\max_{p\in\Gamma(x,y)}\min_{\begin{subarray}{c}q_{1}\in\Theta(x,y_{1})\\ q_{2}\in\Theta(x,y_{2})\end{subarray}}(\lambda g(p,q_{1})+(1-\lambda)g(p,q_{2}))
λmaxp1Γ(x,y)minq1Θ(x,y1)g(p1,q1)+(1λ)maxp2Γ(x,y)minq2Θ(x,y2)g(p2,q2)\displaystyle\leq\lambda\max_{p_{1}\in\Gamma(x,y)}\min_{q_{1}\in\Theta(x,y_{1})}g(p_{1},q_{1})+(1-\lambda)\max_{p_{2}\in\Gamma(x,y)}\min_{q_{2}\in\Theta(x,y_{2})}g(p_{2},q_{2})
=λmaxp1Γ(x,y1)minq1Θ(x,y1)g(p1,q1)+(1λ)maxp2Γ(x,y2)minq2Θ(x,y2)g(p2,q2)\displaystyle=\lambda\max_{p_{1}\in\Gamma(x,y_{1})}\min_{q_{1}\in\Theta(x,y_{1})}g(p_{1},q_{1})+(1-\lambda)\max_{p_{2}\in\Gamma(x,y_{2})}\min_{q_{2}\in\Theta(x,y_{2})}g(p_{2},q_{2})
=λf(x,y1)+(1λ)f(x,y2),\displaystyle=\lambda f(x,y_{1})+(1-\lambda)f(x,y_{2}),

which implies that ff is convex with respect to its yy-argument. ∎

Recall the definition of independent dynamics: See 7

The following lemma shows the convexity of the reachability correspondences μ,tρ\mathcal{R}^{\rho}_{\mu,t} and ν,tρ\mathcal{R}^{\rho}_{\nu,t} under independent dynamics.

Lemma 15.

Under the independent dynamics in Definition 7, the reachability correspondences satisfy

μ,tρ(μt,νt)\displaystyle\mathcal{R}^{\rho}_{\mu,t}(\mu_{t},\nu_{t}) =μ,tρ(μt,νt),μt𝒫(𝒳),νt,νt𝒫(𝒴),\displaystyle=\mathcal{R}^{\rho}_{\mu,t}(\mu_{t},\nu^{\prime}_{t}),\quad\forall\mu_{t}\in\mathcal{P}(\mathcal{X}),~{}~{}\forall\nu_{t},\nu^{\prime}_{t}\in\mathcal{P}(\mathcal{Y}), (92a)
ν,tρ(μt,νt)\displaystyle\mathcal{R}^{\rho}_{\nu,t}(\mu_{t},\nu_{t}) =ν,tρ(μt,νt),μt,μt𝒫(𝒳),νt𝒫(𝒴).\displaystyle=\mathcal{R}^{\rho}_{\nu,t}(\mu^{\prime}_{t},\nu_{t}),\quad\forall\mu_{t},\mu^{\prime}_{t}\in\mathcal{P}(\mathcal{X}),~{}~{}\forall\nu_{t}\in\mathcal{P}(\mathcal{Y}). (92b)

Furthermore, for all λ[0,1]\lambda\in[0,1],

μ,tρ(λμt+(1λ)μt,νt)\displaystyle\mathcal{R}^{\rho}_{\mu,t}\big{(}\lambda\mu_{t}+(1-\lambda)\mu^{\prime}_{t},\nu_{t}\big{)} =λμ,tρ(μt,νt)+(1λ)μ,tρ(μt,νt),μt,μt𝒫(𝒳),νt𝒫(𝒴),\displaystyle=\lambda\mathcal{R}^{\rho}_{\mu,t}\big{(}\mu_{t},\nu_{t}\big{)}+(1-\lambda)\mathcal{R}^{\rho}_{\mu,t}\big{(}\mu^{\prime}_{t},\nu_{t}\big{)},\quad\forall\mu_{t},\mu^{\prime}_{t}\in\mathcal{P}(\mathcal{X}),~{}~{}\forall\nu_{t}\in\mathcal{P}(\mathcal{Y}), (93a)
ν,tρ(μt,λνt+(1λ)νt)\displaystyle\mathcal{R}^{\rho}_{\nu,t}\big{(}\mu_{t},\lambda\nu_{t}+(1-\lambda)\nu^{\prime}_{t}\big{)} =λν,tρ(μt,νt)+(1λ)ν,tρ(μt,νt),μt𝒫(𝒳),νt,νt𝒫(𝒴).\displaystyle=\lambda\mathcal{R}^{\rho}_{\nu,t}\big{(}\mu_{t},\nu_{t}\big{)}+(1-\lambda)\mathcal{R}^{\rho}_{\nu,t}\big{(}\mu_{t},\nu^{\prime}_{t}\big{)},\quad\forall\mu_{t}\in\mathcal{P}(\mathcal{X}),~{}~{}\forall\nu_{t},\nu^{\prime}_{t}\in\mathcal{P}(\mathcal{Y}). (93b)
Proof.

Due to symmetry, we only prove the results for the Blue reachability correspondence. The results for the Red team can be obtained through a similar argument.

Consider an arbitrary pair of distributions μt𝒫(𝒳)\mu_{t}\in\mathcal{P}(\mathcal{X}) and νt𝒫(𝒴)\nu_{t}\in\mathcal{P}(\mathcal{Y}). If μt+1μ,tρ(μt,νt)\mu_{t+1}\in\mathcal{R}_{\mu,t}^{\rho}(\mu_{t},\nu_{t}) under the independent dynamics in (34), there exists a local policy πtΠt\pi_{t}\in\Pi_{t} such that

μt+1(x)=x𝒳[u𝒰f¯t(x|x,u)πt(u|x)]μt(x),\mu_{t+1}(x^{\prime})=\sum_{x\in\mathcal{X}}\Big{[}\sum_{u\in\mathcal{U}}\bar{f}_{t}(x^{\prime}|x,u)\pi_{t}(u|x)\Big{]}\mu_{t}(x),

which is independent of νt\nu_{t}. Consequently,

μ,tρ(μt,νt)=μ,tρ(μt,νt),μt𝒫(𝒳) and νt,νt𝒫(𝒴).\mathcal{R}_{\mu,t}^{\rho}(\mu_{t},\nu_{t})=\mathcal{R}_{\mu,t}^{\rho}(\mu_{t},\nu^{\prime}_{t}),\quad\forall~{}\mu_{t}\in\mathcal{P}(\mathcal{X})\text{ and }\forall~{}\nu_{t},\nu^{\prime}_{t}\in\mathcal{P}(\mathcal{Y}).

Next, we prove the property in (93a). For the \subseteq direction, consider two Blue mean-fields μt,μt𝒫(𝒳)\mu_{t},\mu^{\prime}_{t}\in\mathcal{P}(\mathcal{X}) and a Red mean-field νt𝒫(𝒴)\nu_{t}\in\mathcal{P}(\mathcal{Y}). Let μ¯t+1μ,tρ(λμt+(1λ)μt,νt)\bar{\mu}_{t+1}\in\mathcal{R}^{\rho}_{\mu,t}(\lambda\mu_{t}+(1-\lambda)\mu^{\prime}_{t},\nu_{t}). From the definition of the reachable set, there exists a local policy πtΠt\pi_{t}\in\Pi_{t} such that

μ¯t+1(x)\displaystyle\bar{\mu}_{t+1}(x^{\prime}) =x𝒳[u𝒰f¯t(x|x,u)πt(u|x)](λμt(x)+(1λ)μt(x))\displaystyle=\sum_{x\in\mathcal{X}}\Big{[}\sum_{u\in\mathcal{U}}\bar{f}_{t}(x^{\prime}|x,u)\pi_{t}(u|x)\Big{]}(\lambda\mu_{t}(x)+(1-\lambda)\mu_{t}^{\prime}(x))
=λx𝒳[u𝒰f¯t(x|x,u)πt(u|x)]μt(x)μt+1(x)+(1λ)x𝒳[u𝒰f¯t(x|x,u)πt(u|x)]μt(x)μt+1(x),\displaystyle=\lambda\underbrace{\sum_{x\in\mathcal{X}}\Big{[}\sum_{u\in\mathcal{U}}\bar{f}_{t}(x^{\prime}|x,u)\pi_{t}(u|x)\Big{]}\mu_{t}(x)}_{\mu_{t+1}(x^{\prime})}+(1-\lambda)\underbrace{\sum_{x\in\mathcal{X}}\Big{[}\sum_{u\in\mathcal{U}}\bar{f}_{t}(x^{\prime}|x,u)\pi_{t}(u|x)\Big{]}\mu_{t}^{\prime}(x)}_{\mu^{\prime}_{t+1}(x^{\prime})},

where μt+1μ,tρ(μt,νt)\mu_{t+1}\in\mathcal{R}^{\rho}_{\mu,t}(\mu_{t},\nu_{t}) and μt+1μ,tρ(μt,νt)\mu^{\prime}_{t+1}\in\mathcal{R}^{\rho}_{\mu,t}(\mu^{\prime}_{t},\nu_{t}). Hence, we have

μ,tρ(λμt+(1λ)μt,νt)λμ,tρ(μt,νt)+(1λ)μ,tρ(μt,νt).\mathcal{R}^{\rho}_{\mu,t}(\lambda\mu_{t}+(1-\lambda)\mu^{\prime}_{t},\nu_{t})\subseteq\lambda\mathcal{R}^{\rho}_{\mu,t}\big{(}\mu_{t},\nu_{t}\big{)}+(1-\lambda)\mathcal{R}^{\rho}_{\mu,t}\big{(}\mu^{\prime}_{t},\nu_{t}\big{)}.

Let now μ¯t+1=λμt+1+(1λ)μt+1\bar{\mu}_{t+1}=\lambda\mu_{t+1}+(1-\lambda)\mu_{t+1}^{\prime}, where μt+1μ,tρ(μt,νt)\mu_{t+1}\in\mathcal{R}^{\rho}_{\mu,t}(\mu_{t},\nu_{t}) and μt+1μ,tρ(μt,νt)\mu_{t+1}^{\prime}\in\mathcal{R}^{\rho}_{\mu,t}(\mu_{t}^{\prime},\nu_{t}). Further, define μ¯t=λμt+(1λ)μt\bar{\mu}_{t}=\lambda\mu_{t}+(1-\lambda)\mu^{\prime}_{t}. From the definition of the reachable set, there exists local policies πt\pi_{t} and πt\pi^{\prime}_{t} such that

μ¯t+1\displaystyle\bar{\mu}_{t+1} =λx𝒳[u𝒰f¯t(x|x,u)πt(u|x)]μt(x)+(1λ)x𝒳u𝒰f¯t(x|x,u)πt(u|x)]μt(x)\displaystyle=\lambda\sum_{x\in\mathcal{X}}\Big{[}\sum_{u\in\mathcal{U}}\bar{f}_{t}(x^{\prime}|x,u)\pi_{t}(u|x)\Big{]}\mu_{t}(x)+(1-\lambda)\sum_{x\in\mathcal{X}}\sum_{u\in\mathcal{U}}\bar{f}_{t}(x^{\prime}|x,u)\pi^{\prime}_{t}(u|x)\Big{]}\mu_{t}^{\prime}(x)
=x𝒳u𝒰f¯t(x|x,u)[λπt(u|x)μt(x)+(1λ)πt(u|x)μt(x)]\displaystyle=\sum_{x\in\mathcal{X}}\sum_{u\in\mathcal{U}}\bar{f}_{t}(x^{\prime}|x,u)\Big{[}\lambda\pi_{t}(u|x)\mu_{t}(x)+(1-\lambda)\pi^{\prime}_{t}(u|x)\mu^{\prime}_{t}(x)\Big{]}
=x𝒳u𝒰𝟙μ¯t(x)>0f¯t(x|x,u)[λπt(u|x)μt(x)+(1λ)πt(u|x)μt(x)]\displaystyle=\sum_{x\in\mathcal{X}}\sum_{u\in\mathcal{U}}\mathds{1}_{\bar{\mu}_{t}(x)>0}\;\bar{f}_{t}(x^{\prime}|x,u)\Big{[}\lambda\pi_{t}(u|x)\mu_{t}(x)+(1-\lambda)\pi^{\prime}_{t}(u|x)\mu^{\prime}_{t}(x)\Big{]}
+x𝒳u𝒰𝟙μ¯t(x)=0f¯t(x|x,u)[λπt(u|x)μt(x)+(1λ)πt(u|x)μt(x)]=0\displaystyle+\sum_{x\in\mathcal{X}}\sum_{u\in\mathcal{U}}\mathds{1}_{\bar{\mu}_{t}(x)=0}\;\bar{f}_{t}(x^{\prime}|x,u)\underbrace{\Big{[}\lambda\pi_{t}(u|x)\mu_{t}(x)+(1-\lambda)\pi^{\prime}_{t}(u|x)\mu^{\prime}_{t}(x)\Big{]}}_{=0\,}
=x𝒳u𝒰𝟙μ¯t(x)>0f¯t(x|x,u)λπt(u|x)μt(x)+(1λ)πt(u|x)μt(x)μ¯t(x)μ¯t(x)\displaystyle=\sum_{x\in\mathcal{X}}\sum_{u\in\mathcal{U}}\mathds{1}_{\bar{\mu}_{t}(x)>0}\;\bar{f}_{t}(x^{\prime}|x,u)\frac{\lambda\pi_{t}(u|x)\mu_{t}(x)+(1-\lambda)\pi^{\prime}_{t}(u|x)\mu^{\prime}_{t}(x)}{\bar{\mu}_{t}(x)}\bar{\mu}_{t}(x)
+x𝒳u𝒰𝟙μ¯t(x)=0f¯t(x|x,u)[λπt(u|x)+(1λ)πt(u|x)]μ¯t(x)=0\displaystyle+\sum_{x\in\mathcal{X}}\sum_{u\in\mathcal{U}}\mathds{1}_{\bar{\mu}_{t}(x)=0}\;\bar{f}_{t}(x^{\prime}|x,u)\underbrace{\Big{[}\lambda\pi_{t}(u|x)+(1-\lambda)\pi^{\prime}_{t}(u|x)\Big{]}\bar{\mu}_{t}(x)}_{=0}
=x𝒳[u𝒰f¯t(x|x,u)π¯t(u|x)]μ¯t(x),\displaystyle=\sum_{x\in\mathcal{X}}\left[\sum_{u\in\mathcal{U}}\bar{f}_{t}(x^{\prime}|x,u)\bar{\pi}_{t}(u|x)\right]\bar{\mu}_{t}(x),

where the “averaged” local policy π¯t\bar{\pi}_{t} is given by

π¯t(u|x)=𝟙μ¯t(x)>0λπt(u|x)μt(x)+(1λ)πt(u|x)μt(x)λμt(x)+(1λ)μt(x)+𝟙μ¯t(x)=0(λπt(u|x)+(1λ)πt(u|x)).\bar{\pi}_{t}(u|x)=\mathds{1}_{\bar{\mu}_{t}(x)>0}\frac{\lambda\pi_{t}(u|x)\mu_{t}(x)+(1-\lambda)\pi^{\prime}_{t}(u|x)\mu^{\prime}_{t}(x)}{\lambda\mu_{t}(x)+(1-\lambda)\mu^{\prime}_{t}(x)}+\mathds{1}_{\bar{\mu}_{t}(x)=0}\Big{(}\lambda\pi_{t}(u|x)+(1-\lambda)\pi^{\prime}_{t}(u|x)\Big{)}.

Consequently, we have

μ¯t+1μ,tρ(λμt+(1λ)μt,νt).\bar{\mu}_{t+1}\in\mathcal{R}^{\rho}_{\mu,t}(\lambda\mu_{t}+(1-\lambda)\mu_{t}^{\prime},\nu_{t}).

Hence,

μ,tρ(λμt+(1λ)μt,νt)λμ,tρ(μt,νt)+(1λ)μ,tρ(μt,νt).\mathcal{R}^{\rho}_{\mu,t}(\lambda\mu_{t}+(1-\lambda)\mu^{\prime}_{t},\nu_{t})\supseteq\lambda\mathcal{R}^{\rho}_{\mu,t}\big{(}\mu_{t},\nu_{t}\big{)}+(1-\lambda)\mathcal{R}^{\rho}_{\mu,t}\big{(}\mu^{\prime}_{t},\nu_{t}\big{)}.

See 1

Proof.

The game value at time TT is directly the mean-field reward, which is assumed to be concave-convex. Hence, one can provide an inductive proof leveraging Lemma 14 and Lemma 15 to show that both the lower and upper game values are concave-convex at each time step. Note that the lower game value in (33) has the form of a max-min marginal function. Finally, as the reachable sets are compact and convex, one can apply the minimax theorem [Owen,, 2013], which guarantees that the max-min optimal value is the same as the min-max optimal value, and thus ensures the existence of the game value. ∎

Appendix E Policy Extraction from Reachability Set

In Section 4.4, we changed the optimization domain from the policy space to the reachability set and obtained the optimal next-to-go mean-field. We now address the natural question regarding the construction of the coordination strategy that achieve the desired next-to-go mean-field.

We consider the following general policy extraction problem.

Problem 1.

Given a desired next-to-go Blue mean-field μt+1ρμ,tρ(μtρ,νtρ)\mu^{\rho}_{t+1}\in\mathcal{R}_{\mu,t}^{\rho}(\mu^{\rho}_{t},\nu^{\rho}_{t}), find the local policy πtΠt\pi_{t}\in\Pi_{t} such that

μt+1ρ=μtρFρ(μtρ,νtρ,πt).\mu^{\rho}_{t+1}=\mu^{\rho}_{t}F^{\rho}(\mu^{\rho}_{t},\nu^{\rho}_{t},\pi_{t}).

From the convex hull characterization of the reachability set (see Proposition 3), we have that the mean-field μt+1ρμ,tρ(μtρ,νtρ)\mu^{\rho}_{t+1}\in\mathcal{R}_{\mu,t}^{\rho}(\mu^{\rho}_{t},\nu^{\rho}_{t}) can be written as

μt+1ρ=kλkμtρFtρ(μtρ,νtρ,π^tk),\mu_{t+1}^{\rho}=\sum_{k}\lambda^{k}\mu_{t}^{\rho}F^{\rho}_{t}(\mu_{t}^{\rho},\nu_{t}^{\rho},\hat{\pi}_{t}^{k}),

where λk0\lambda^{k}\geq 0 and kλk=1\sum_{k}\lambda^{k}=1, and {π^k}k\{\hat{\pi}^{k}\}_{k} are the pure local policies (see Definition 10).

Since μtρ\mu_{t}^{\rho}, νtρ\nu_{t}^{\rho} and {πtk}\{\pi^{k}_{t}\} are all known variables at time tt, the vector-form coefficients 𝝀\bm{\lambda} can be found by solving the following linear feasibility problem:

𝑴t𝝀=μt+1ρ,s.t.𝝀0andkλk=1,\bm{M}_{t}\bm{\lambda}=\mu^{\rho}_{t+1},\quad\text{s.t.}~{}\bm{\lambda}\geq 0~{}\text{and}~{}\sum_{k}\lambda^{k}=1, (94)

where the matrix 𝑴t|𝒳|×|Π^t|\bm{M}_{t}\in\mathbb{R}^{|\mathcal{X}|\times|\hat{\Pi}_{t}|} has μtρFtρ(μtρ,νtρ,π^tk)\mu_{t}^{\rho}F^{\rho}_{t}(\mu_{t}^{\rho},\nu_{t}^{\rho},\hat{\pi}_{t}^{k}) as its columns. Again, the feasibility of (94) is guaranteed due to the construction of the reachable set.

From the linearity of the mean-field dynamics in (88), we have that the local policy πt=kλkπ^tk\pi_{t}=\sum_{k}\lambda^{k}\hat{\pi}_{t}^{k} achieves the desired next-to-go mean-field. Specifically,

μt+1ρ=μtρFρ(μtρ,νtρ,πt)=kλkμtρFtρ(μtρ,νtρ,π^tk).\mu_{t+1}^{\rho}=\mu^{\rho}_{t}F^{\rho}(\mu^{\rho}_{t},\nu^{\rho}_{t},\pi_{t})=\sum_{k}\lambda^{k}\mu_{t}^{\rho}F^{\rho}_{t}(\mu_{t}^{\rho},\nu_{t}^{\rho},\hat{\pi}_{t}^{k}).

The Blue coordination strategy can then be constructed to select the local Blue policy solved from (94) to achieve the optimal next-to-go Blue mean-field.

References

  • Aicardi et al., [1987] Aicardi, M., Davoli, F., and Minciardi, R. (1987). Decentralized optimal control of Markov chains with a common past information set. IEEE Transactions on Automatic Control, 32(11):1028–1031.
  • Arabneydi and Mahajan, [2014] Arabneydi, J. and Mahajan, A. (2014). Team optimal control of coupled subsystems with mean-field sharing. In 53rd IEEE Conference on Decision and Control, pages 1669–1674.
  • Arabneydi and Mahajan, [2015] Arabneydi, J. and Mahajan, A. (2015). Team-optimal solution of finite number of mean-field coupled LQG subsystems. In 54th IEEE Conference on Decision and Control, pages 5308–5313, Osaka, Japan, Dec. 15–18, 2015.
  • Bernstein et al., [2002] Bernstein, D. S., Givan, R., Immerman, N., and Zilberstein, S. (2002). The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4):819–840.
  • Bertsekas and Tsitsiklis, [1996] Bertsekas, D. and Tsitsiklis, J. N. (1996). Neuro-dynamic Programming. Athena Scientific.
  • Chung, [2001] Chung, K. L. (2001). A Course in Probability Theory. Academic press.
  • Cui and Koeppl, [2021] Cui, K. and Koeppl, H. (2021). Approximately solving mean field games via entropy-regularized deep reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 1909–1917. PMLR.
  • Elliott and Kalton, [1972] Elliott, R. J. and Kalton, N. J. (1972). The Existence of Value in Differential Games, volume 126. American Mathematical Soc.
  • Freeman and Kokotovic, [2008] Freeman, R. and Kokotovic, P. V. (2008). Robust Nonlinear Control Design: State-space and Lyapunov Techniques. Springer Science & Business Media.
  • Gibbons et al., [2013] Gibbons, R., Roberts, J., et al. (2013). The Handbook of Organizational Economics. Princeton University Press Princeton, NJ.
  • Guan et al., [2023] Guan, Y., Pan, L., Shishika, D., and Tsiotras, P. (2023). On the adversarial convex body chasing problem. In 2023 American Control Conference, pages 435–440.
  • Guan et al., [2022] Guan, Y., Zhou, M., Pakniyat, A., and Tsiotras, P. (2022). Shaping large population agent behaviors through entropy-regularized mean-field games. In 2022 American Control Conference (ACC), pages 4429–4435. IEEE.
  • Ho, [1980] Ho, Y.-C. (1980). Team decision theory and information structures. Proceedings of the IEEE, 68(6):644–654.
  • Hogeboom-Burr and Yüksel, [2023] Hogeboom-Burr, I. and Yüksel, S. (2023). Zero-sum games involving teams against teams: existence of equilibria, and comparison and regularity in information. Systems & Control Letters, 172:105454.
  • Huang et al., [2007] Huang, M., Caines, P. E., and Malhamé, R. P. (2007). Large-population cost-coupled LQG problems with nonuniform agents: individual-mass behavior and decentralized ϵ\epsilon-Nash equilibria. IEEE Transactions on Automatic Control, 52(9):1560–1571.
  • Huang et al., [2006] Huang, M., Malhamé, R. P., and Caines, P. E. (2006). Large population stochastic dynamic games: closed-loop McKean-Vlasov systems and the nash certainty equivalence principle. Communications in Information & Systems, 6(3):221–252.
  • Kartik et al., [2021] Kartik, D., Nayyar, A., and Mitra, U. (2021). Common information belief based dynamic programs for stochastic zero-sum games with competing teams. arXiv preprint arXiv:2102.05838.
  • Kuroiwa, [1996] Kuroiwa, D. (1996). Convexity for set-valued maps. Applied Mathematics Letters, 9(2):97–101.
  • Lagoudakis and Parr, [2002] Lagoudakis, M. G. and Parr, R. (2002). Learning in zero-sum team Markov games using factored value functions. Advances in Neural Information Processing Systems, 15.
  • Lasry and Lions, [2007] Lasry, J.-M. and Lions, P.-L. (2007). Mean field games. Japanese Journal of Mathematics, 2(1):229–260.
  • Laurière et al., [2022] Laurière, M., Perrin, S., Geist, M., and Pietquin, O. (2022). Learning mean field games: A survey. arXiv preprint arXiv:2205.12944.
  • Li et al., [2021] Li, J., Tinka, A., Kiesel, S., Durham, J. W., Kumar, T. S., and Koenig, S. (2021). Lifelong multi-agent path finding in large-scale warehouses. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 11272–11281.
  • Mahajan, [2011] Mahajan, A. (2011). Optimal decentralized control of coupled subsystems with control sharing. In 50th IEEE Conference on Decision and Control and European Control Conference, pages 5726–5731, Orlando, FL, Dec. 12–15, 2011.
  • Mahajan et al., [2012] Mahajan, A., Martins, N. C., Rotkowitz, M. C., and Yüksel, S. (2012). Information structures in optimal decentralized control. In 51st IEEE Conference on Decision and Control, pages 1291–1306, Maui, HW, Dec. 10–13, 2012. IEEE.
  • Mahajan and Nayyar, [2015] Mahajan, A. and Nayyar, A. (2015). Sufficient statistics for linear control strategies in decentralized systems with partial history sharing. IEEE Transactions on Automatic Control, 60(8):2046–2056.
  • Marschak, [1955] Marschak, J. (1955). Elements for a theory of teams. Management Science, 1(2):127–137.
  • Nayyar et al., [2013] Nayyar, A., Mahajan, A., and Teneketzis, D. (2013). Decentralized stochastic control with partial history sharing: A common information approach. IEEE Transactions on Automatic Control, 58(7):1644–1658.
  • Owen, [2013] Owen, G. (2013). Game Theory. Emerald Group Publishing.
  • Papadimitriou and Tsitsiklis, [1985] Papadimitriou, C. H. and Tsitsiklis, J. N. (1985). Intractable problems in control theory. In 24th IEEE Conference on Decision and Control, pages 1099–1103, Fort Lauderdale, FL, Dec. 11–13, 1985.
  • Radner, [1962] Radner, R. (1962). Team decision problems. The Annals of Mathematical Statistics, 33(3):857–881.
  • Sanjari et al., [2023] Sanjari, S., Saldi, N., and Yüksel, S. (2023). Nash equilibria for exchangeable team against team games and their mean field limit. In 2023 American Control Conference, pages 1104–1109. IEEE.
  • Tian et al., [2020] Tian, Y., Liu, K., Ok, K., Tran, L., Allen, D., Roy, N., and How, J. P. (2020). Search and rescue under the forest canopy using multiple UAVs. The International Journal of Robotics Research, 39(10-11):1201–1221.
  • Tyler et al., [2020] Tyler, J., Arnold, R., Abruzzo, B., and Korpela, C. (2020). Autonomous teammates for squad tactics. In International Conference on Unmanned Aircraft Systems, pages 1667–1672, Athens, Greece, Sept. 1–4, 2020.
  • Witsenhausen, [1971] Witsenhausen, H. S. (1971). Separation of estimation and control for discrete-time systems. Proceedings of the IEEE, 59(11):1557–1566.
  • Yüksel, [2009] Yüksel, S. (2009). Stochastic nestedness and the belief sharing information pattern. IEEE Transactions on Automatic Control, 54(12):2773–2786.
  • Zazo et al., [2016] Zazo, S., Macua, S. V., Sánchez-Fernández, M., and Zazo, J. (2016). Dynamic potential games with constraints: Fundamentals and applications in communications. IEEE Transactions on Signal Processing, 64(14):3806–3821.