Zero-Sum Games between Large-Population Teams:
Reachability-based Analysis under Mean-Field Sharing
Abstract
This work studies the behaviors of two large-population teams competing in a discrete environment. The team-level interactions are modeled as a zero-sum game while the agent dynamics within each team is formulated as a collaborative mean-field team problem. Drawing inspiration from the mean-field literature, we first approximate the large-population team game with its infinite-population limit. Subsequently, we construct a fictitious centralized system and transform the infinite-population game to an equivalent zero-sum game between two coordinators. We study the optimal coordination strategies for each team via a novel reachability analysis and later translate them back to decentralized strategies that the original agents deploy. We prove that the strategies are -optimal for the original finite-population team game, and we further show that the suboptimality diminishes when team size approaches infinity. The theoretical guarantees are verified by numerical examples.
1 Introduction
Multi-agent decision-making arises in many applications, ranging from warehouse robots [Li et al.,, 2021], surveillance missions [Tian et al.,, 2020] to organizational economics [Gibbons et al.,, 2013]. While the majority of the literature formulates the problems within either the cooperative or competitive settings, results on mixed collaborative-competitive team behaviors are relatively sparse. In this work, we consider a competitive team game, where two teams, each consisting of a large number of intelligent agents, compete at the team level, while the agents within each team collaborate. Such hierarchical interactions are of particular interest to military operations [Tyler et al.,, 2020] and other multi-agent systems that operate in adversarial environments.
There are two major challenges when trying to solve such competitive team problems:
-
1.
Large-population team problems are computationally challenging since the solution complexity increases exponentially with the number of agents. It is well-known that team optimal control problems belong to the NEXP complexity class [Bernstein et al.,, 2002].
-
2.
Competitive team problems are conceptually challenging due to the elusive nature of the opponent team. In particular, one may want to impose assumptions on the opponent team to obtain tractable solutions, but these assumptions may not be valid. It is thus unclear whether one can deploy techniques that are readily available in the large-population game literature.
The scalability challenge is also present in dynamic games; however, it has been successfully resolved for a class of models known as mean-field games (MFGs) [Huang et al.,, 2006; Lasry and Lions,, 2007]. The salient feature of mean-field games is that the selfish agents are weakly coupled in their dynamics and rewards, and coupling is only through the mean field (i.e., the empirical distribution of the agents). When the population is large, the game can be approximately solved by considering the infinite-population limit, at which point the interactions among agents are reduced to a game between a typical agent and the infinite population.
The mean-field approximation idea has been extended to the single-team setting [Arabneydi and Mahajan,, 2014], where a group of homogeneous agents are weakly coupled but receive the same team reward and thus collaborate. Under the assumption that all agents apply the same strategy, a dynamic programming decomposition is developed leveraging the common-information approach [Nayyar et al.,, 2013] so that all agents within the team deploy the same strategy prescribed by a fictitious coordinator. The identical strategy assumption significantly simplifies the problem, but the optimality of such approach is only guaranteed for the linear-quadratic models [Arabneydi and Mahajan,, 2015]. However, in competitive team setting, although one may restrict the strategies used by her/his team to be identical, extending the same assumption to the opponent team may lead to a substantial underestimation of the opponent’s capabilities and thus requires further justification.
1.1 Main Contributions
We address the two aforementioned challenges for the class of discrete zero-sum mean-field team games (ZS-MFTGs), which is an extension to the mean-field (single) team problems. Importantly, ZS-MFTG models competitive team behaviors and draws focus to the justifiable approximation of the opponent team strategies. We focus on discrete-time models with finite state and action spaces, different from the continuous setting in the concurrent work [Sanjari et al.,, 2023].
We develop a dynamic program that constructs -optimal strategies to the proposed large-population competitive team problem. Notably, our approach finds an optimal solution at the infinite-population limit and considers only identical team strategies. This avoids both the so-called “curse of dimensionality” issue in multi-agent systems and the book-keeping of individual strategies. Our main results provide a sub-optimality bound on the exploitability for our proposed solution in the original finite-population game, even when the opponent team is allowed to use non-identical strategies. Specifically, we show that the sub-optimality decreases at the rate of , where is the size of the smaller team.
Our results stem from a novel reachability-based analysis of the mean-field approximation. In particular, we show that any finite-population team behavior can be effectively approximated by an infinite-population team that uses identical team strategies. This result allows us to approximate the original problem with two competing infinite-population teams and transform the resulting problem into a zero-sum game between two fictitious coordinators. Specifically, each coordinator observes the mean-fields (state distributions) of both teams and prescribes a local policy for all agents within its team (see Figure 1). Such transformation leads to a simple dynamic program based on the common-information approach [Nayyar et al.,, 2013].
1.2 Related Work
Mean-field games.
The mean-field game model was introduced in [Huang et al.,, 2006, 2007; Lasry and Lions,, 2007] to address the scalability issues in large population games. The salient feature of mean-field games is that selfish agents are weakly-coupled in their dynamics and rewards through the state distribution (mean-field) of the agents. If the population is sufficiently large, then an approximately optimal solution for these models can be obtained by solving the infinite-population limit, and the solution in this limiting case is the mean-field equilibrium (MFE). The infinite-population game is easier to solve since, when the population is asymptotically large, the action of a single agent does not affect the mean-field. In the continuous setting, the MFE is characterized by a Hamilton-Jacobi-Bellman equation (HJB) coupled with a transport equation [Lasry and Lions,, 2007]. The HJB equation describes the optimality conditions for the strategy of a typical agent, and the transport equation captures the evolution of the mean-field. The existence and uniqueness of the MFE have been established in [Huang et al.,, 2006]. Two desirable features of the MFE are: (i) the resultant strategy is decentralized and identical across all agents, i.e., an agent selects its action only based on the local information of its own state, and no information regarding the mean-field is needed; and (ii) the complexity of the MFE solution does not depend on the number of agents. The mean-field results have been extended to discrete environments in [Cui and Koeppl,, 2021; Guan et al.,, 2022], using entropy-regularization. We refer the readers to [Laurière et al.,, 2022] for a detailed overview of the main results in the mean-field game literature.
The main differences between our setup and the current mean-field game literature are the following: (i) we seek team-optimal strategies, while MFG seeks Nash equilibrium strategies. In particular, we provide performance guarantees when the whole opponent team deviates, while MFG only considers single-agent deviations. (ii) The MFE assumes that all agents apply the same strategy thus the mean-field flow can be solved offline. As a result, the MFE strategy is open-loop to the mean-field. However, under the setting in this paper, different opponent team strategies can lead to different mean-field trajectories, and thus our optimal strategy requires feedback on the mean-field to respond to different strategies deployed by the opponent team.
Mean-field teams.
The single-team problem under with mean-field-type coupling has been considered in [Arabneydi and Mahajan,, 2014], where all agents receive the same team reward and thus the problem is collaborative. It transforms the team problem into a hierarchical control structure, where a fictitious coordinator broadcasts a local policy based on the distribution of agents, and the individual agents then act according to this policy. The work [Arabneydi and Mahajan,, 2014] assumes that all agents within the team apply the same strategy and the optimality for the finite-population game is only guaranteed in the LQG setting [Mahajan and Nayyar,, 2015]. Our work extends this formulation to the two-team zero-sum setting and justifies the identical team strategy assumptions on the opponent team.
The concurrent work [Sanjari et al.,, 2023] studies a similar team against team problem but in a continuous state and action setting. Through modeling the randomized policies as Borel probability measures, the authors showed that under suitable topology induced by certain information structure, a Nash equilibrium exists for the team game, and it is exchangeable for the finite population and symmetric (identical) for the infinite population. It is also shown that common randomness is not necessary for an approximate Nash equilibrium for large population games. Our work differs in the following aspects: (i) The concurrent work [Sanjari et al.,, 2023] relies on the convexity of the team best-response correspondence and the Kakutani fixed point theorem to establish the existence of a Nash equilibrium. However, due to discrete nature of our formulation, the convexity no longer holds, and thus a Nash equilibrium may not exist as shown in Numerical Example 1. Consequently, we developed a novel reachability-based analysis to study the single-sided optimality based on the lower and upper game values; (ii) the mean-field sharing information structure allows an easy transformation of the team against team problem into a zero-sum continuous game between two coordinators, which can be solved via dynamic programming. (iii) our reachability-based approach and the additional Lipschitz assumptions allow us to provide the convergence rate of the finite-population team performance to its infinite-population limit.
Markov team games and information structure.
Different from mean-field games, team problems focus on fully cooperative scenarios, where all agents share a common reward function. The theory of teams was first investigated in [Marschak,, 1955; Radner,, 1962]. From a game-theoretic perspective, this cooperative setting can also be viewed as a special case of Markov potential games [Zazo et al.,, 2016], with the potential function being the common reward. When all agents have access to the present and past observations along with the past actions of all other agents, such a system is equivalent to a centralized one, which enables the use of single-agent algorithms, such as value iteration or Q-learning [Bertsekas and Tsitsiklis,, 1996]. However, such a centralized system suffers from scalability issues, since the joint state and action spaces grow exponentially with the number of agents. Furthermore, in many applications, decentralization is a desirable or even required trait, due to reasons such as limited communication bandwidth, privacy, etc. This work, therefore, investigates the decentralized team control problem under the information structure known as the mean-field sharing [Arabneydi and Mahajan,, 2014].
Information structure is one of the most important characteristics of a multi-agent system, since it largely determines the tractability of the corresponding decentralized decision problem [Papadimitriou and Tsitsiklis,, 1985; Ho,, 1980]. In [Witsenhausen,, 1971], information structures are classified into three classes: classical, quasi-classical, and non-classical. The characterizing feature of the classical information structure is centralization of information, i.e., all agents know the information available to all agents acting in the past. A system is called quasi-classical or partially nested if the following condition holds: If agent can influence agent , then agent must know the information available to agent . All other information structures are non-classical. In the team game literature, information structures that are commonly used include state sharing [Aicardi et al.,, 1987], belief sharing [Yüksel,, 2009], partial-history sharing [Nayyar et al.,, 2013], mean-field sharing [Arabneydi and Mahajan,, 2014], etc. This work, in particular, assumes a mean-field sharing structure, where each agent observes its local state and the mean-fields of both teams. As the agents do not receive information regarding the other agents’ individual states, the mean-field sharing is a non-classic information structure, thus posing a challenge to attain tractable solutions.
As decentralized team problems belong to the NEXP complexity class [Bernstein et al.,, 2002], no efficient algorithm is available, in general. Nonetheless, it is possible to develop a dynamic programming decomposition for specific information structures. Three such generic approaches are discussed in [Mahajan et al.,, 2012]: namely, person-by-person approach, designer’s approach, and common-information approach. While we use the common-information approach in this work, we refer the interested reader to [Mahajan et al.,, 2012] for the other two approaches.
Team problems in a mixed competitive-cooperative setting have also been considered. For example, [Lagoudakis and Parr,, 2002] proposed a centralized learning algorithm for zero-sum Markov games in which each side is composed of multiple agents collaborating against an opponent team of agents. Later, in [Hogeboom-Burr and Yüksel,, 2023], the authors studied the information structure and the existence of equilibria in games involving teams against teams.
Common-information approach.
The common information-based approach was proposed in [Nayyar et al.,, 2013] to investigate optimal strategies in decentralized stochastic control problems with partial history-sharing information structure. In this model, each agent shares part of its information with other agents at each time step. Based on the common information available to all agents, a fictitious coordinator is designed. It is shown that the optimal control problem at the coordinator level is a centralized Partially Observable Markov Decision Process (POMDP) and can be solved using dynamic programming. The coordinator solves the POMDP problem and chooses a prescription for each agent that maps that agent’s local information to its actions. The common information approach was used to solve decentralized stochastic control problems including the control sharing information structure [Mahajan,, 2011], mean-field teams [Arabneydi and Mahajan,, 2014], and LQG systems [Mahajan and Nayyar,, 2015].
Most relevant to this work, [Kartik et al.,, 2021] considered a general model of zero-sum stochastic game between two competing teams, where the information available to each agent can be decomposed into common and private information. Exploiting this specific information structure, an expanded game between two virtual players (fictitious coordinators) is constructed based on the sufficient statistics known as the common information belief. The authors showed that the expanded game can be solved via dynamic programming, and the upper and lower values of the expanded games provide bounds for the original game values. Although we utilize a similar common information approach as in [Kartik et al.,, 2021], this work focuses more on the large-population aspect of the problem, i.e., the mean-field limits of the team games along with the performance guarantees under the mean-field approximation.
In this work, we treat the mean fields of both (infinite-population) teams as common information. Instead of having a single coordinator, we introduce two fictitious coordinators, one for each team, and, consequently, formulate an equivalent zero-sum game between the two coordinators. Since we consider the equivalent coordinator game at the infinite-population limit, the two mean fields provide enough information to characterize the status of the system under the mean-field dynamics, which leads to a full-state information structure instead of a partially observable one. Although the original environment is discrete, the resultant zero-sum coordinator game has continuous state and action spaces111The joint state of the coordinator game is the mean-fields of the two teams, which are distributions over state space and are continuous objects. The actions used by the coordinators are the local policies, which are distributions over the action spaces and are also continuous.. Under certain assumptions, we show that the optimal max-min and min-max value functions are Lipschitz continuous with respect to the mean-fields. This justifies the solution of the continuous game via a discretization scheme.
1.3 Road Map
Our approach can be summarized in the road map in Figure 1.
-
1.
In Section 2, we present the formulation of the zero-sum mean-field team games (ZS-MFTG) with finite number of agents.
-
2.
In Section 3, we employ the mean-field approximation and considers the infinite-population limit where identical team strategies are deployed.
- 3.
-
4.
In Section 5, we establish that the optimal strategies for the coordinator game (computed under the common information approach, mean-field approximation, and identical team strategies) remain -optimal for the original finite-population game. Specifically, if one team applies the proposed strategy, the opponent team can only increase its performance by , even if the opponent team uses a non-identical team strategy to exploit.

1.4 Notations
We use the shorthand notation to denote the set of indices . The indicator function is denoted as , such that if and otherwise. To distinguish between random variables and their realizations, we use uppercase letters to denote random variables (e.g. and ) and lowercase letters to denote their realizations (e.g. and ). We use bold letters to denote vectors, e.g. . For a finite set , we denote the space of all probability measures over as and equip it with the total variation norm [Chung,, 2001]. Additional important notations used in this paper are summarized in Table 1.
Notation | Description | Population Size |
---|---|---|
/ / | Number of agents in Blue / Red team / whole system | Finite |
Team size ratio | Finite | |
/ | State of a single Blue/Red agent (realization) | Finite |
/ | Empirical distribution of Blue/Red team (realization) | Finite |
/ | Mean-field of Blue/Red team | Infinite |
/ | Identical Blue/Red team policy | Finite & Infinite |
/ | Identical Blue/Red team strategy | Finite & Infinite |
/ | Non-identical Blue/Red team policy | Finite |
/ | Non-identical Blue/Red team strategy | Finite |
/ | Identical Blue/Red team local policy | Infinite |
/ | Blue/Red coordinator strategy | Infinite |
Game reward | Finite & Infinite | |
/ | Dynamics of Blue/Red agents | Finite & Infinite |
Finite-population game value | Finite | |
Coordinator game value | Infinite | |
/ | Reachablity correspondence for Blue/Red mean-field | Infinite |
2 Problem Formulation
Consider a discrete-time system with two large teams of agents that operate over a finite time horizon . The Blue team consists of homogeneous agents, and the Red team consists of homogeneous agents. The total system size is denoted as . We use to denote the ratio between the size of the Blue team and the total number of agents (Blue and Red agents combined). Let and be the random variables representing the state and action, respectively, taken by Blue agent at time . Here, and represent the finite individual state and action space for each Blue agent, independent of and . Similarly, we use and to denote the individual state and action of Red agent . The joint state and action of the Blue team are denoted as and , respectively. Similarly, the Red team’s joint state and action are denoted as , and . We use to denote the joint state of the Blue team excluding Blue agent , and is defined similarly.
Definition 1 (Empirical Distribution).
The empirical distribution (ED) for the Blue and Red teams are defined as
(1a) | ||||
(1b) |
Notice that gives the fraction of Blue agents at state . Hence, the random vector is a probability measure, i.e., . Similarly, we have . Since finite state and action spaces are considered, we have and , where is the -dimensional standard simplex.
We introduce two operators and to denote operation performed in (1) that relates the joint states and their corresponding EDs:
(2) |
To measure the distance between two distributions (probability measures), we use the total variation.
Definition 2.
For a finite set , the total variation between two probability measures is given by
(3) |
2.1 Dynamics
We consider a weakly-coupled dynamics, where the dynamics of each individual agent is coupled with other agents through the EDs (empirical distributions). For Blue agent , its stochastic transition is governed by a transition kernel and satisfies
(4) |
where and . Note that the transition kernel has depends on the ratio between the sizes of the two teams. Similarly, the transition of Red agent follows
(5) |
Assumption 1.
The transition kernels and are and -Lipschitz continuous, respectively. Formally, for all and and for all , there exist positive constants and such that
2.2 Reward Structure
Under the team-game framework, agents in the same team share the same reward. Similar to the dynamics, we consider a weakly-coupled team reward
(6) |
Assumption 2.
The reward function is -Lipschitz continuous. Formally, for all , , and for all , there exists a positive constant such that
(7) |
Under the zero-sum reward structure, we let the Blue team maximize the reward while the Red team minimizes it.
2.3 Information Structure.
We assume a mean-field sharing information structure [Arabneydi and Mahajan,, 2015]. At each time step , Blue agent observes its own state and the EDs and . Similarly, Red agent observes , and . Viewing the EDs as aggregated states of the teams, we consider the following mixed Markov policies:
(8) | ||||
where represents the probability that Blue agent selects action given its state and the team distributions and . Note that agent’s individual state is a private information, while the team EDs are the common information shared among all agents.
An individual strategy is defined as a time sequence . A Blue team strategy is the collection of individual strategies used by each Blue agent. We use and to denote, respectively, the set of individual policies and strategies available to each Blue agent.222Since Blue agents have the same state and action spaces, they have the same policy space. The set of Blue team strategies is then defined as the Cartesian product . The notations extend naturally to the Red team.
In summary, an instance of a zero-sum mean-field team game is defined as the tuple:
2.4 Optimization Problem.
Starting from the initial joint states and , the performance of any team strategy pair is quantified by the expected cumulative reward
(9) |
where and , and the expectation is with respect to the distribution induced on all system variables by the strategy pair .
Remark 1.
Note that the value function in (9) takes the joint states as arguments, different from the value function in [Arabneydi and Mahajan,, 2015] which takes the ED as its argument. The difference comes from the non-identical team strategies allowed in our formulation, which requires each agent’s state and index information to sample actions and predict the game’s evolution. Arabneydi and Mahajan, assume an identical team strategy, and hence the ED is an information state (sufficient statistics) that characterizes the value function of the finite-population game. A counter-example that shows the ED is not an information state under non-identical team strategy is presented in Appendix A.3.
When the maximizing Blue team considers its worst-case performance, we have the following max-min optimization:
(10) |
where is the lower game value for the finite-population game. Note that the game value may not always exist (max-min value differs from min-max value) [Elliott and Kalton,, 1972]. In Section 4.5, we present a special case of the ZS-MFTG where the game value exists at the infinite-population limit. For the general case, we consider the following optimality condition for the Blue team strategy.
Definition 3.
A Blue team strategy is -optimal if
The strategy is optimal if .
Similarly, the minimizing Red team considers a min-max optimization problem, which leads to the upper game value as follows
(11) |
The definitions of optimality and -optimality extend naturally to the Red team strategies.
The rest of the paper focuses on the performance analysis form the Blue team’s perspective (max-min optimization), but the techniques developed are applicable to the Red team’s side due to the symmetry of the problem formulation.
2.5 Examples
A two-node example
Consider a simple team game on a two-node graph in Figure 2. The state spaces are given by and , and the action spaces are and . The Blue action corresponds to staying on the current node and represents moving to the other node. The same connotations apply to Red actions and .

The maximizing Blue team’s objective is to maximize its presence at node 2 (state ), and hence . The Blue transition kernel at under is defined as
Under this transition kernel, the probability of a Blue agent transitioning from node 1 to node 2 depends on the Blue team’s numerical advantage over the Red team at node 1.
The initial joint states depicted in Figure 2 are given by and . The corresponding EDs are , , and the running reward is . Suppose the Blue team applies a team strategy such that for both (under which both Blue agents apply ). The probability of an individual Blue agent transitioning to node 2 is 0.625. Thus, the next Blue ED is a random vector with three possible realizations: (i) with probability 0.14 (both Blue agents remain on node 1); (ii) with probability 0.47 (one Blue agent moves and the other remains); and (iii) with probability 0.39 (both Blue agents move). Suppose the game terminates at , then the value under the given Blue strategy is given by .
Connections to the dynamics in [Huang et al.,, 2006]
As a special case of the previously described dynamics and reward structure, we consider the following pairwise-state-coupled dynamics and team reward derived from the single-population model in [Huang et al.,, 2006]. The transition kernel for the Blue and Red agents are defined as follows:
(12a) | ||||
(12b) |
The terms , , and are transition kernels and model the pair-wise influence between agents. For example, represents the influence that the state of a Red agent has on the transition of a Blue agent.
The weakly-coupled team rewards are the averaged state-dependent reward
(13) |
Through some algebraic manipulations, the aforementioned pairwise-state-dependent transition kernel and team reward can be transformed into the weakly-coupled form represented by equations (4) and (6) as follows:
where , and (see Appendix A for the detailed derivations). In this specific example, it is clear that the team with more agents has a larger influence on the dynamics and the reward. Hence, it is necessary to include as a parameter for both the dynamics and the reward function. It can be further verified that the above weakly-coupled transition kernels and are both -Lipschitz continuous with respect to the EDs. Additionally, the weakly-coupled reward function has a Lipschitz constant of , where and (see Propositions 1 and 2 in Appendix A).
3 Mean-Field Approximation
The preceding max-min and min-max optimizations are intractable for large-population systems, since the dimension of the joint policy spaces and grows exponentially with the number of the agents. To address this scalability challenge, we first consider the limiting case of the ZS-MFTGs, when the number of agents of both teams goes to infinity. We further assume that agents from the same infinite-population team employ the same strategy. As a result, the behavioral of the entire team can be represented by a typical agent [Huang et al.,, 2006]. As we will show later, due to the law of large numbers, the ED of an infinite-population team converges to the state distribution of its corresponding typical agent. This limiting distribution is known as the mean-field [Huang et al.,, 2006]. In the sequel, we formulate the mean-field team game at its infinite-population limit and introduce several additional concepts that are necessary for the performance analysis in the next sections.
3.1 Identical Team Strategies and Mean-Field Dynamics
At the infinite population limit, we specifically focus on the Blue and Red teams that employ identical team strategies. Consequently, we first formalize the class of identical team strategies.
Definition 4 (Identical Blue Team Strategy).
The Blue team strategy is an identical team strategy, if for all and all .
We slightly abuse the notation and use to denote the identical Blue team strategy, under which all Blue agents apply the same individual strategy . Consequently, is used to denote both the set of Blue individual strategies and the set of identical Blue team strategies.333When denotes the set of identical team strategies, it is a subset of the original set of Blue team strategies , i.e., . The definitions and notations extend to the identical Red team strategies.
With the notions of identical team strategies, we define the mean-field flow as a deterministic sequence of the state distribution of a typical agent under the given pair of strategies.
Definition 5.
Given identical team strategies and , the MFs are defined as the sequence of vectors that adhere to the following coupled deterministic dynamics:
(14a) | ||||
(14b) |
where the initial conditions are given by and .
The above deterministic mean-field dynamics can be expressed in a compact matrix form as
(15) | ||||
where is the transition matrix for a typical Blue agent’s distribution under the policy and its entries are given by . The matrix is defined similarly.
The following lemma shows that the deterministic MF above is an approximation of the (stochastic) finite-population ED, and the approximation error goes to zero when . Thus, we can regard the mean-field as the empirical distribution of an infinite-population team.
Lemma 1.
Let , , and be the joint states and the corresponding EDs of a finite-population game. Denote the next EDs induced by an identical policy pair as . Then, the following bounds hold:
where and according to (15).
Proof.
See Appendix B.2 ∎
3.2 Infinite-Population Optimization
For the infinite-population game, the cumulative reward induced by the strategy pair is given by
(16) |
where the propagation of the mean-fields and are subject to the dynamics (14). The worst-case performance of the Blue team is given by the lower game value
(17) |
The Red team instead uses the upper game value based on the following min-max optimization
(18) |
Remark 2.
Different from the finite-population value (10) which takes joint states as argument, the infinite-population game value (17) takes MFs. This is due to the assumed identical team strategy, which no longer requires agents’ index information to differentiate them and sample actions. Consequently, the MFs can serve as the information state as in [Arabneydi and Mahajan,, 2015]. The difference comes from the non-identical strategies considered in the finite-population game, which require each agent’s state and index information to sample actions and predict the game’s evolution.
4 Zero-Sum Game Between Coordinators
The mean-field sharing structure in (8) allows us to reformulate the infinite-population competitive team problem (17) as an equivalent two-player game from the perspective of two fictitious444The coordinators are fictitious since they are introduced as an auxiliary concept and are not required for the actual implementation of the obtained strategies. coordinators. The coordinators know the common information (MFs) and selects a local policy that maps each agent’s local information (individual state) to its actions. Through this common-information approach [Nayyar et al.,, 2013], we provide a dynamic program that constructs optimal strategies for all agents under the original mean-field sharing information structure.
4.1 Equivalent Centralized Problem
We use to denote a local Blue policy, which is open-loop with respect to the MFs. Specifically, is the probability that a Blue agent selects action at state regardless of the current MFs. The set of open-loop Blue local policies is denoted as . Similarly, and denote a Red local policy and its admissible set. Under the local policy , the Blue MF propagates as
(19) |
and the Red team MF dynamics under Red local policies is defined similarly as
At each time , a Blue coordinator observes the MFs of both teams (common information) and prescribes a local policy to all Blue agents within its team. The local policy is selected based on:
where is a deterministic Blue coordination policy, and gives the probability that a Blue agent selects action given its current state . Similarly, the Red coordinator observes the MFs and selects a local policy according to . We refer to the time sequence as the Blue team coordination strategy and as the Red team coordination strategy. The sets of admissible coordination strategies are denoted as and .
Remark 3.
There is a one-to-one correspondence between the coordination strategies and the identical team strategies. For example, given an identical Blue team strategy , one can define the coordination strategy:
Similarly, a Blue coordination strategy induces an identical team strategy according to the rule
Plugging in the deterministic coordination strategies, the mean-fields dynamics in (19) becomes
(20a) | ||||
(20b) |
For notational simplicity, we will use the shorthand notations and .
The original competitive team problem in (17) can now be viewed as an equivalent zero-sum game played between the two coordinators, where the game state is the joint mean-field , and the actions are the local policies and selected by the coordinators. Figure 3 provides a schematic of the equivalent centralized system.
Formally, the zero-sum coordinator game can be defined via a tuple
(21) |
In particular, the continuous game state space is , and the continuous action spaces are and for the Blue and Red coordinators, respectively. The deterministic dynamics of the game is given in (20). Finally, the reward structure is given in (6), and the horizon is the same as in the original game.

Given two coordination strategies and , the cumulative rewards of the coordinator game is defined as
(22) |
where the propagation of the mean-fields and are subject to the dynamics (20). The worst-case performance of the Blue team is given by the lower coordinator game value
(23) |
and the upper value for the Red team is based on the following min-max optimization
(24) |
In the next section, we examine the properties of the max-min (lower) and min-max (upper) game values of the coordinator game.
4.2 Value Functions of the Coordinator Game
Similar to the standard two-player zero-sum games, we use a backward induction scheme to find the lower and upper values of the coordinator game. In general, the max-min value need not be equal to the min-max value [Elliott and Kalton,, 1972]. Consequently, from the Blue coordinator’s perspective, we consider the lower value (max-min), which provides the highest guaranteed performance for the maximizing Blue team in a worst-case scenario.
The terminal lower value at time is defined as
(25) |
For all previous time steps , the two coordinators optimize their cumulative reward function by choosing their actions (i.e., local policies) and . Consequently, for all we have
(26) |
With the optimal value function, the optimal Blue team coordination policies can then be easily constructed via
(27) |
Similarly, for the Red team coordinator, the upper values are computed as
(28a) | |||
(28b) |
and the optimal Red team coordination policy is given by
(29) |
Note that the optimal Blue team coordination strategy induces an identical Blue team strategy that satisfies the mean-field sharing information structure and can be implemented in the finite-population game (Remark 3).
4.3 Reachable Sets
At the infinite-population limit, the MF dynamics in (15) is deterministic, and thus selecting the local policies and at time is equivalent to selecting the desirable MFs at the next time step. Consequently, we examine the set of MFs that can be reached from the current MFs.
Definition 6.
The Blue reachable set, starting from the mean-fields and , is defined as the set comprising all the possible next Blue team mean-fields that can be achieved by employing a local policy . Formally,
(30) |
Similarly, the Red reachable set is defined as
(31) |
We will regard the reachable sets as set-valued functions (correspondences) [Freeman and Kokotovic,, 2008]. In this case, we write , and similarly .
The following lemma justifies using the reachable sets constructed based on the local policies to analyze the reachability of identical team policies.
Lemma 2.
For all and , we have that
(32) |
Remark 4.
Note that the reachable sets are constructed based on identical team strategies, since under the coordinator game formulation, all agents in the same team follow the same local policies prescribed by their coordinator.
4.4 Equivalent Form of Value Functions
We can now change the optimization domains in (25) and (26) from the policy spaces to the corresponding reachable sets. One can easily see that the following lower value propagation scheme is equivalent to the one in (25) and (26).
(33) | ||||
Given the current mean-fields , the optimal next mean-field to achieve is then given by
The Blue coordination policy that leads to can be constructed via a simple linear program leveraging the linearity of the mean-field dynamics with respect to the policy. See Appendix E for details.
In the sequel, we primarily work with the reachability-based optimization in (33). There are two advantages to this approach: First, the reachable sets generally have a lower dimension than the coordinator action spaces, 555The Blue reachable set is a subset of , while the Blue coordinator action space is given by ., which is desirable for numerical algorithms; Second, the reachability-based optimization allows us to compare the “reachability” induced by non-identical and identical team strategies (Theorem 2 in Section 5) and then study the performance loss due to the identical strategy assumption.
4.5 Existence of the Coordinator Game Value
In general, the coordinator game value may not exist, i.e, the max-min and min-max values differ (see Numerical Example 1 in Section 6). However, we can show that the existence of coordinator game value for a special class of mean-field team games, where the dynamics of each individual agent is independent of other agents’ states (both its teammates and its opponents). It is left as a future research direction to obtain more general conditions that ensure the existence of the game value.
Definition 7.
We say that the weakly-coupled dynamics are independent if the following holds for all , and ,
(34) | ||||
for some transition kernels and .
In other words, the dynamics of an agent only depends on that agent’s current state and action and is fully decoupled from all other agents. However, we still allow reward coupled via the MFs. The following theorem ensures the existence of the game value under independent dynamics.
Theorem 1.
Suppose the reward function is concave-convex for all and the system dynamics is independent. Then, the game value exists.
Proof.
One can show that the optimal value under independent dynamics is concave-convex and apply the minimax theorem. The detailed proof is given in Appendix D. ∎
5 Main Results
Recall that the optimal Blue team coordination strategy is constructed for the infinite-population game assuming that both teams employ identical team strategies. This section establishes the performance guarantees for in the finite-population games where both teams are allowed to deploy non-identical strategies.
5.1 Approximation Error
As is solved at the infinite-population limit, it is essential to understand how well the infinite-population game approximates the original finite-population problem. In this subsection, we show that that the reachable set constructed using identical strategies and the mean-field dynamics is rich enough to approximate any empirical distributions induced by non-identical team strategies in finite-population games. We start with constructing local policies that mimics the behaviors of non-identical team strategies.
Lemma 3.
Let , , and be the joint states and the corresponding EDs of a finite-population game. Given any Blue team policy (potentially non-identical), define the following local policy
(35) |
Further, define the next mean-field induced by as
(36) |
Then, the expected distance between the next Blue ED induced by the team policy and the mean-field satisfies
(37) |
Proof.
See Appendix B.3. ∎
The local policy mimics the population behavior by setting its action distribution at state as the average of the policies used by the Blue agents at state . The purpose of the second case in (35) is to ensure that the constructed local policy is well-defined at states where no Blue agent is present. The mean-field induced by is within the reachable set and is close to the next ED induced by the non-identical team policy in expectation. This idea is visualized in Figure 4.

Lemma 3 directly leads to the following theorem regarding the richness of the reachable sets, as the mean-field induced by is within the reachable set.
Theorem 2.
Let , , , and be the joint states and the corresponding EDs at time . Denote the next Blue ED induced by some Blue team policy as . Then, there exists a mean-field that satisfies
(38) |
Remark 5.
The construction of in (35) requires knowing each Blue agent’s state, which may seem to violate the mean-field sharing information structure in (8). However, only serves as an auxiliary concept to prove the existence of a mean-field in the reachable set that satisfies (38). The existence result in Theorem 2 is all we need to provide performance guarantees. In fact, will not be used to construct the optimal policies.
Remark 6.
All results in this subsection extend to the analysis from Red team’s side.
5.2 Lipschitz Continuity of the Value Functions
Next, we examine the continuity of the optimal value function in (33) with respect to the mean-field arguments, which is essential for the performance guarantees. Clearly, the continuity of the value function depends on the continuity of the two reachability correspondences and . To properly study the continuity of the reachability correspondences, we use the Hausdorff distance to measure the distance between two sets.
Definition 8 (Hausdorff distance).
For a normed space , the Hausdorff distance between the sets is defined as
(39) |
Based on the Hausdorff distance, we have the following notion of Lipschitz continuity for correspondences [Freeman and Kokotovic,, 2008].
Definition 9.
A correspondence is -Lipschitz continuous under the Hausdorff distance if, for all , it satisfies that
(40) |
The following lemma presents a Lipschitz continuity result for the reachability correspondences.
Lemma 4.
Proof.
The proof is postponed to Appendix C.2. ∎
Leveraging the continuity of the reachability correspondences, the following theorem establishes the Lipschitz continuity of the optimal value functions.
Theorem 3.
The optimal lower value function and the optimal upper value function are both Lipschitz continuous. Formally, for all and , the following inequalities hold
(42) | ||||
where the Lipschitz constant is given by
(43) |
Proof.
Observe that the lower value in (33) takes the form: , which can be viewed as an extension of the marginal function [Freeman and Kokotovic,, 2008] to the max-min case. We present a continuity result for this type of marginal function in Lemma 11 in Appendix C.1. Based on Lemma 11, we can prove the Lipschitz continuity result through an inductive argument since the terminal rewards are assumed to be Lipschitz. A detailed proof is given in Appendix C.3. ∎
5.3 Performance Guarantees
In the previous section, we obtained an optimal Blue coordination strategy for the coordinator game. The coordinator game is constructed under the mean-field setting, i.e., both teams have an infinite number of agents and all agents in each team apply the same strategy. We analyze the performance guarantees for in the finite-population game and compare the worst-case performance of this coordinator-induced Blue team strategy to the original max-min optimization in (10) where agents have the flexibility to apply non-identical strategies within each team.
To this end, let the Red team apply non-identical team strategies . Recall that a Blue coordination strategy induces an identical Blue team strategy (see Remark 3). We denote the finite-population performance under the identical Blue team strategy induced by and the Red team strategy as
Through dynamic programming, we can compute the above-induced value through a backward propagation scheme
(44a) | ||||
(44b) | ||||
where and are the EDs corresponding to the joint states and , respectively.
We have the following main result regarding the performance guarantees for the optimal Blue coordination strategy.
Theorem 4.
The optimal Blue coordination strategy obtained from (27) induces an -optimal Blue team strategy. Formally, for all ,
(45) |
where .
Proof.
Remark 7.
Recall that is solved at the infinite-population limit under the restriction that both teams apply identical team strategies. Theorem 4 states that the identical Blue team strategy induced by is still -optimal, even if (i) it is deployed in a finite-population game and (ii) the opponent team employs non-identical strategies to exploit.
Remark 8.
Lemma 5.
For all joint states and , the optimal Blue coordination strategy in (27) guarantees
(46) |
where and are the corresponding EDs.
Proof.
The proof is constructed based on induction. Fix an arbitrary Red team strategy .
Base case: At the terminal timestep , since there is no decision to be made, both value functions are equal to the terminal reward and are thus the same. Formally, for all and ,
where and . For simplicity, we do not emphasize the correspondence between the joint states and the EDs for the rest of the proof, as it is clear from the context.
Inductive hypothesis: Assume that at , the following holds for all joint states and :
(47) |
Induction: At timestep , consider an arbitrary pair of joint states and , and their corresponding EDs and . Define as
Note that, from the optimality of , we have
(48) |
Furthermore, from Theorem 2, there exists a for the Red team policy such that
(49) |
For notational simplicity, we drop the conditions and in the following derivations. Then, for all joint states and , we have
(50) | |||
(51) | |||
(52) | |||
(53) | |||
(54) | |||
(55) | |||
(56) | |||
(57) |
For inequality (i), we used the inductive hypothesis; for inequality (ii), we utilized the Lipschitz continuity of the coordinator value function in Theorem 3; equality (iii) is due to Lemma 1 and Theorem 2, and the fact that the two error terms with Lipschitz constant are bounded by ; inequality (iv) is due to the fact that is in the reachable set; equality (v) comes from the definition of in (48), and the final equality in (57) is simply the definition of , which completes the induction.
Since the Red team strategy is arbitrary, we have that, for all joint states and ,
∎
Lemma 6.
The following inequality holds for all joint states and ,
(58) |
where and .
Proof.
We prove the lemma through an inductive argument.
Base case: At the terminal timestep , the two value functions are the same. Thus, we have, for all joint states and , that
Inductive hypothesis: Assume that, at time step , the following holds for all and ,
(59) |
Induction: Consider arbitrary and . For each Blue team policy , Theorem 2 allows us to define such that
(60) |
For an identical Red team policy , denote . Then, we have
For inequality (i), we used the inductive hypothesis; for inequality (ii), we reduced the optimization domain of the Red team to the identical policy space; inequality (iii) is a result of the Lipschitz continuity; for inequality (iv), we use Lemma 1 and Theorem 2; equality (v) holds, since for all in the reachable set, there is an identical Red team strategy that induces it, and vice versa. Consequently, switching from optimizing with the identical Red team policies to the reachable set does not change the minimization domain; inequality(vi) is due to the fact that is always in the reachable set by construction. ∎
6 Numerical Example
In this section, we provide two numerical examples. The first one illustrates a scenario where the lower and upper coordinator game values are different. The second one verifies the performance guarantees in the finite-population game. For both examples, the states spaces for the Blue and Red team are and , and the action spaces are and . The two-state state spaces allow the joint mean-fields to be fully characterized solely by and .
The code that implements the two examples can be found at https://github.com/scottyueguan/MFTG.
6.1 Numerical Example 1
We consider a following two-stage example with a population ratio . The time-invariant transition kernel for the Blue agents is defined as follows:
The time-invariant Red transition kernel is similarly defined as
For simplicity, we only consider non-zero reward at the terminal time step . Specifically,
In other words, the objective for the Blue team is to maximize its population fraction on state at . While the Red distribution does not play a role in the reward structure, the Red agents need to strategically distribute themselves to influence the Blue agent’s transitions in order to minimize the terminal reward.

The game values for the coordinator game are computed through discretization, where we uniformly mesh the two-dimensional simplices and into 500 bins666One can easily provide an error bound on the difference between the discretized value and the true optimal value using the Lipschitz constants. See [Guan et al.,, 2023] for an example.. The numerically solved values are presented in Figure 5.
While the game value exists at time , it is not convex-concave, and hence the best-response correspondences in the coordinator game is not convex-valued. Consequently, the upper (max-min) and lower (min-max) game values at the previous step differs, as observed in subplot (a). Specifically, at and , we have the lower value and the upper value , which are visualized as the green and yellow points. This discrepancy in the game values implies the absence of a Nash equilibrium in this coordinator game.
Based on (33), the optimization domains for computing are for the maximization and for the minimization, both of which are plotted in (d) and visualized as the box in subplots (b) and (c). Subplot (c) presents a zoom-in for the optimization and its min-max counterpart. The marginal functions are also plotted, from which the max-min (green point) and min-max values (yellow point) at can be directly obtained.

Figure 6 presents the max-min and min-max trajectory of the coordinator game. Note that the max-min solution corresponds to the highest worst case performance for the Blue coordinator, where the Blue coordinator announces its first move from to and then the Red coordinator exploits the announced move. If instead, the Red coordinator announces its first move toward in the max-min trajectory, the Blue coordinator can exploit that move by achieving at and ultimately receive a higher reward of 0.5442.
6.2 Numerical Example 2
It is generally challenging to verify the suboptimality bound in Theorem 4, since computing the true optimal performance of a finite-population team game is intractable. However, for the following simple example, we can construct the optimal team strategies even for large but still finite-population teams.
Consider a ZS-MFTG instance with a terminal time and population ratio . The (minimizing) Red team’s objective is to maximize its presence at state at , which translates to
The Blue transition kernels are time-invariant, deterministic and independent of the MFs, and are defined as
(61) | ||||||||
Under the above transition kernel, a Blue agent can freely move tween the two nodes. Specifically, action instructs a Blue agent to remain at its current state, while action directs it to transition to the other state. One can verify that the above Blue transition kernel ensures that the Blue reachable set covers the whole simplex regardless of the initial team distributions, i.e., for all , and .
The Red transition kernels are time dependent and are defined as
(62) | |||||
(63) | |||||
(64) | |||||
(65) |
and the transitions from state using action at are given by
(66) | ||||||
The Red transitions in (62) and (63) implies that all Red agents are frozen and can not change their states with any action at . Similarly, all Red agents on state are frozen at as depicted in (64). The Red agents at state at can choose to either deterministically stay at state using action as in (65) or try to move toward state using action . As in (66), the probability of transitioning from to is controlled by the Blue team’s distribution at . If the Blue team can perfectly match the target distribution , then no Red agent can transition from to . Otherwise, the more the Blue team deviates from the target distribution, the more likely a Red agent can transition from to .
Infinite-population case.
Since all Red agents are frozen at and the Red sub-population at state is again frozen at , only the actions of Red agents at state at have an impact on the game outcome. As a result, the above setup leads to a dominant optimal Red team strategy: all Red agents at use action at and try to transition to state for the sake of maximizing the team’s presence at , regardless of the Blue team’s distribution. Note that this is the dominant optimal strategy against all Blue team strategies in both finite and infinite-population scenarios. Specifically,
(67) |
On the other hand, the Blue team should try to match the target distribution to minimize the portion of Red agents transitioning from to at . Since the Blue reachable set at covers the whole simplex, the target distribution can always be achieved at starting from any initial Blue team distribution in the infinite-population case. One Blue coordination strategy that achieves the given target distribution is
(68) |
where,
The following Figure Figure 7 presents the coordinator game value, which is solved via discretization. In this example, the upper and lower game coincides, and thus the game value exists. The black line in the middle subplot marks the game value with , when the Blue team perfectly achieves the target distribution. The values on the black line dominate and give the highest performance the Blue team can achieve given any , which aligns with our analysis which states that the Blue team should match the target distribution.

In the infinite-population coordinator game, the Blue team can always achieve the target distribution regardless of its initial distribution and thus completely block the migration of Red agents from to . Consequently, only the Red agents that are initialized directly at can be counted toward the reward if the Blue team plays rationally, and thus the game value is simply given by , which matches the game value plotted in Figure 7.
Finite-population case.
The Red team’s optimal strategy remains the same as the infinite-population case. However, the Blue team cannot achieve the irrational target distribution with finite number of agents. While the Blue team can still match the target distribution in expectation using a (stochastic) identical team strategy, the following analysis shows that a non-identical deterministic Blue team strategy achieves a better performance.
Consider a Blue team with three agents and all Blue agents are on node 1, i.e., . The optimal Blue coordination strategy prescribes that all Blue agents pick (“stay”) with probability and (“move to ”) with probability to reach the target distribution in expectation. Such action selection leads to the following four possible outcomes of the next Blue team ED : , , , and . In expectation, these empirical distributions lead to a transition probability of 0.518 for a Red team agent moving from to . Consequently, we have the worst-case performance of the optimal Blue coordinator strategy as .
Next, consider the non-identical deterministic Blue team strategy, under which Blue agents 1 and 2 apply action and Blue agent 3 applies . This Blue team strategy deterministically leads to at , and the resultant Red team transition probability from to is 0.016. Clearly, the non-identical Blue team strategy significantly outperforms the identical mixed team strategy in this three-agent case. Furthermore, this Blue team strategy is optimal over the entire non-identical Blue team strategy set, resulting in a finite-population optimal game value .
We repeat the above computation for multiple Blue team size and plot the suboptimality gap as the blue line in Figure 8, which verifies the decrease rate predicted by Theorem 4.
Specifically, the optimal value for the finite-population game is computed based on the following:
-
•
At , the Blue team uses a non-identical deterministic team strategy and deterministically achieves the optimal ED , which minimize the two-norm distance to the target distribution. Such optimal Red ED is constructed as where .
-
•
At , all Red agents apply , and the distribution of the next Red EDs can be computed based on the optimal achieved at .
-
•
The expected game value is then computed based on the distribution of the Red ED .

7 Conclusion
In this work, we introduced a discrete zero-sum mean-field team game to model the behaviors of competing large-population teams. We developed a dynamic programming approach that approximately solves this large-population game at its infinite-population limit where only identical team strategies are considered. Our analysis demonstrated that the identical strategies constructed are -optimal within the general class of non-identical team strategies when deployed in the original finite-population game. The derived performance guarantees are verified through numerical examples. Future work will investigate the LQG setup of this problem and explore machine-learning techniques to address more complex zero-sum mean-field team problems. Additionally, we aim to generalize our results to the infinite-horizon discounted case and problems with heterogeneous sub-populations.
Acknowledgments:
The authors acknowledge fruitful discussions with Dr. Ali Pakniyat.
Appendix A Special Cases and Examples
A.1 Pairwise State-Coupled Mean-Field Team Games
In Section 2, we introduced the pairwise state-coupled dynamics and the averaged state-dependent reward as a special case for the proposed mean-field team game. Specifically, the pairwise state-coupled transition kernels are defined as follows
(12) | ||||
Through some algebraic manipulations, the Blue transition kernel can be re-written in the weakly-coupled form in (4):
The Red transition kernel can be similarly transformed into a weakly-coupled form.
The averaged state-dependent reward is defined as
(13) |
One can also transform the above reward structure to a weakly-coupled form as in (6)
Proposition 1.
The transition kernels in (12) satisfy
Proof.
Due to symmetry, we only show the result for the Blue transition kernel. From the definition, we have, for all and , that
∎
Proposition 2.
Proof.
Note that
∎
A.2 Discontinuous Reward Example
Through this example, we demonstrate the necessity of Assumption 2, i.e., the reward function needs to be continuous with respect to the mean-fields as in (7). Consider a MFTG with terminal time over the following deterministic system.

The Blue agent’s transition kernel are given by
which holds for all distributions and .
The Red agent’s transition kernel is given by
Given and , the discontinuous reward is given by
As the dynamics are decoupled and the reward does not depend on the Red distribution, the game is essentially a single-team problem. We choose this degenerate example for its simplicity. Since the reward at time is always zero, the objective of the Blue team is to achieve the desired distribution at time to receive one unit of reward. Since is the only time step that the Blue agents select action, a Blue coordination strategy only consists of a Blue coordination policy at time . In this case, an optimal Blue coordination strategy can be
where,
The proposed local policy is well-defined, since for the case where , it is implied that , and similarly, for the case where , we automatically have .
One can verify that the above coordination strategy will achieve the mean-field from all under the mean-field dynamics (19). Consequently, we have
However, we know that for all , the ED must be rational, and thus for all almost surely, which leads to
which implies that the performance of the coordination strategy achieved at the infinite-population limit fails to translate back to the finite-population game, and hence Theorem 4 no longer holds.
One can construct similar examples to illustrate the necessity of the transition kernels being continuous with respect to the mean-fields.
A.3 Counter-Example for Information State
For simplicity, we use a single-team example to illustrate why the ED cannot be an information state when different agents are allowed to apply different strategies, even under the weakly-coupled dynamics (4) and rewards (6). Consider the following system with state spaces and , and action spaces and .

The transition kernel is time-invariant and is given by
In words, each agent’s transition is deterministic and decoupled from the other agents. At each state, the agent can select action to stay at its current state or use action to move to the other state.
We consider a one-stage scenario with the reward functions
To illustrate that the ED cannot serve as an information state, consider a two-agent team and the following non-identical team strategy:
In words, agent 1 selects on state and selects on state , while agent 2 selects on both states and .
We now consider two initial configurations with the same ED.
-
•
and , i.e., agent 1 is initially at and agent 2 at . The ED is thus . If the two agents follow the above non-identical team strategy, then agent 1 uses action and transits to , while agent 2 chooses action and stays on , leading to . The resultant reward value is then 0.
-
•
and , i.e., agent 1 is initially at and agent 2 at . The ED is again . If the two agents follow the above non-identical team strategy, then both agents select action and stay at their current states, leading to and a reward value of 0.5.
From this example, we have shown that the values can be different under the same team strategy given different initial conditions that correspond to the same ED since the ED does not differentiate the ordering of the agents. Clearly, the ED alone is not enough to characterize the future evolution of the game, nor the value function. Consequently, the ED is not an information state and we need the joint state information in the finite-population game to properly construct the value functions when different agents apply different strategies.
Appendix B Proof of the Mean-Field Approximation Results
B.1 Modified Weak Law of Large Numbers
The following lemma is a modified version of the weak law of large numbers [Chung,, 2001] adapted to accommodate the notion of EDs used in this work. This lemma will be used extensively in the later proofs.
Lemma 7.
Let be independent random variables (vectors) taking values in a finite set . Suppose is distributed according to , such that . Define the ED as
and let . Then, the following inequality holds
(69) |
Proof.
Consider the random variables , which are independent across with mean . Since only takes a value of 0 and 1, we have , and . It follows that
where leverages the fact that are independent across . By Jensen’s inequality, we have . Furthermore, due to equivalence of norms in the finite dimensional space , we have almost surely, where is the cardinality of the set . It then follows that
∎
Corollary 1.
Let be independent and identically distributed random variables (vectors) taking values in a finite set . Suppose is distributed according to , such that . Define the ED as
Then,
(70) |
B.2 Proof of Lemma 1
The following lemma is used to support the proof of Lemma 1.
Lemma 8.
For all , the following inequality holds:
Proof.
Cauchy-Schwarz inequality directly implies
∎
Proof.
Due to symmetry, we only prove the result for the Blue team. Let be the number of Blue agents that are at state at time in the finite-population system. With a slight abuse of notation, we use to denote the set of the Blue agents currently at state . We first focus on the sub-population of Blue agents that are at state . Since all Blue agents in apply and randomize independently, the next states for these agents are independent and identically distributed conditioned on the joint states and . For an agent in this sub-population the conditional distribution of its next state is given by
(71) |
where and . Define the ED of these Blue agents at time as
Then, we have the following inequality due to Corollary 1:
(72) |
Note that we can compute the ED of the whole Blue team based on the ED of sub-populations of Blue agents on different states via
(73) |
Similarly, we can relate the propagated mean-field of the whole team with the ones of each sub-population via
(74) | ||||
B.3 Proof of Lemma 3
See 3
Proof.
Note that if , then the state has no contribution to the next mean-field propagated via (36). The second case in (35) ensures that the constructed policy is well-defined. Without loss of generality, we assume that for all .
Let be the number of Blue agents that are on state at time in the finite-population system. With a slight abuse of notation, we use to denote the set of the Blue agents on state . Then, for agent applying policy , its conditional distribution of its next state is given by
Define the ED of these Blue agents at time as
(75) |
On the other hand, the mean-field propagated from state under local policy can be expressed as
(76) | ||||
From Lemma 7, we have that
(77) |
The rest of the analysis is similar to the proof of Lemma 1. Notice that
It then follows
where the last inequality results from Lemma 8.
∎
Appendix C Proof of the Continuity Results
In this section, we present the proofs for Lemma 4 and Theorem 3. We start with several preliminary results.
C.1 Preliminary results
Lemma 9.
Let be a Lipschitz continuous function in its second argument, such that for all and . Let compact sets such that . Then, for all ,
(78a) | ||||
(78b) |
Proof.
Lemma 10.
Let be a Lipschitz continuous function in its first argument, such that for all and . Then, for all compact sets ,
Proof.
Since is compact and is continuous, the two minimization problems are well-defined. Next, let , it follows that
where (i) is due to the Lipschitz assumption. Similarly, one can show the other direction . ∎
Lemma 11.
Consider the compact-valued correspondences and , which are Lipschitz continuous with Lipschitz constants and , respectively. Given a -Lipschitz continuous real-valued function , the max-min marginal function
(80) |
is Lipschitz continuous with Lipschitz constant .
Proof.
Let . Since is continuous and is compact, the minimization is well-defined for each . Fix , and consider and . The Lipschitz continuity of implies that
(81) |
For simplicity, let . Leveraging Lemma 9, we have888 Relating to Lemma 9: Set as function ; treat the argument as and the optimization variable as ; regard the sets and as and respectively.
(82) |
Next, fix and consider . It follows from Lemma 10 that999Relating to Lemma 10: Set as function ; treat the arguments and as and , respectively; regard the set as .
(83) |
Finally, consider and , it follows that
(84) | ||||
(85) | ||||
(86) |
where the first term in (85) is bounded using Lemma 10 and the Lipschitz constant derived in (82) 101010 Relating to Lemma 10: Set as function ; treat the arguments as and the argument as ; regard the set as the optimization domain . , and the second term in (85) is bounded using Lemma 9 and the Lipschitz constant in (83) 111111 Relating to Lemma 9: Set as function ; treat the argument as and the optimization variable as ; regard the sets and as and respectively. . ∎
Next, we present some results regarding the continuity of the reachability correspondences. We start by defining the pure local policies.
Definition 10 (Pure local policy).
A Blue local policy is said to be pure if for all and . We use to denote the set of pure Blue policies, where denotes the -th pure Blue local policy. The pure Red local policies are defined similarly.
Proposition 3.
The Blue reachable set is characterized as
(87) |
where are pure Blue local policies, and denotes the convex hull.
Proof.
We first show that the mean-field propagation rule is linear with respect to the policy. Note that the set of admissible policies can be characterized as the convex hull of the pure policies, i.e., . Consider arbitrary current mean-fields and , and consider the policy for some and . Then, the next mean-field induced by is given by
which implies that
(88) |
Lemma 12.
Consider a pure local policy and arbitrary mean-fields and . The following bound holds:
Proof.
Using the triangle inequality, we have that
Since is a stochastic matrix, it is a non-expansive operator under the 1-norm, and hence . Next, we bound the term using Assumption 1.
Combining the bounds, we have
∎
Lemma 13.
Let two sets of points and that satisfy for all . Then,
Proof.
Consider a point . By definition, there exists a set of non-negative coefficients , such that and . Using the same set of coefficients, define . Then,
Thus, for a fixed , we have Since is arbitrary, the above inequality implies
Through a similar argument, we can conclude that which completes the proof. ∎
C.2 Proof of Lemma 4
See 4
C.3 Proof of Theorem 3
See 3
Proof.
The proof is given by induction.
Base case: At the terminal time , the optimal coordinator value function is given by
From Assumption 2, the Lipschitz constant for is .
Inductive hypothesis: Suppose that the Lipschitz constant is given by (43) at time .
Induction: Recall the definition of the optimal coordinator lower value at time :
where the second term takes the form of a max-min marginal function defined in (80), where , and , while the first term is Lipschitz continuous with Lipschitz constant . Applying Lemma 11, it follows that, for all and ,
which completes the induction. ∎
Appendix D Proof of Existence of Game Values
We first examine the convexity of the reachability correspondences.
Definition 11 ([Kuroiwa,, 1996]).
Let , and be convex sets. The correspondence is convex with respect to if, for all , , , , and , there exists such that
Remark 9.
The above definition of convexity is equivalent to the following set inclusion
(90) |
for all , and , where the Minkowski sum is defined as
Definition 12.
The correspondence is constant with respect to if
(91) |
We have the following concave-convex result for a max-min marginal function defined in (80).
Lemma 14.
Consider two compact-valued correspondences and . Let be convex with respect to and constant with respect to , and let be constant with respect to and convex with respect to . Let be concave-convex. Then, the max-min marginal function is also concave-convex.
Proof.
Fix , and consider and . Denote . It follows that
where inequality (i) holds from restricting the maximization domain using the convexity of in (90); inequality (ii) holds from the assumption that is concave with respect to its -argument; inequality (iii) is the result of distributing the minimization; equality (iv) is due to being constant with respect to . The above result implies the concavity of with respect to its -argument.
Fix , and let and . Denote . Then,
which implies that is convex with respect to its -argument. ∎
Recall the definition of independent dynamics: See 7
The following lemma shows the convexity of the reachability correspondences and under independent dynamics.
Lemma 15.
Under the independent dynamics in Definition 7, the reachability correspondences satisfy
(92a) | ||||
(92b) |
Furthermore, for all ,
(93a) | ||||
(93b) |
Proof.
Due to symmetry, we only prove the results for the Blue reachability correspondence. The results for the Red team can be obtained through a similar argument.
Consider an arbitrary pair of distributions and . If under the independent dynamics in (34), there exists a local policy such that
which is independent of . Consequently,
Next, we prove the property in (93a). For the direction, consider two Blue mean-fields and a Red mean-field . Let . From the definition of the reachable set, there exists a local policy such that
where and . Hence, we have
Let now , where and . Further, define . From the definition of the reachable set, there exists local policies and such that
where the “averaged” local policy is given by
Consequently, we have
Hence,
∎
See 1
Proof.
The game value at time is directly the mean-field reward, which is assumed to be concave-convex. Hence, one can provide an inductive proof leveraging Lemma 14 and Lemma 15 to show that both the lower and upper game values are concave-convex at each time step. Note that the lower game value in (33) has the form of a max-min marginal function. Finally, as the reachable sets are compact and convex, one can apply the minimax theorem [Owen,, 2013], which guarantees that the max-min optimal value is the same as the min-max optimal value, and thus ensures the existence of the game value. ∎
Appendix E Policy Extraction from Reachability Set
In Section 4.4, we changed the optimization domain from the policy space to the reachability set and obtained the optimal next-to-go mean-field. We now address the natural question regarding the construction of the coordination strategy that achieve the desired next-to-go mean-field.
We consider the following general policy extraction problem.
Problem 1.
Given a desired next-to-go Blue mean-field , find the local policy such that
From the convex hull characterization of the reachability set (see Proposition 3), we have that the mean-field can be written as
where and , and are the pure local policies (see Definition 10).
Since , and are all known variables at time , the vector-form coefficients can be found by solving the following linear feasibility problem:
(94) |
where the matrix has as its columns. Again, the feasibility of (94) is guaranteed due to the construction of the reachable set.
From the linearity of the mean-field dynamics in (88), we have that the local policy achieves the desired next-to-go mean-field. Specifically,
The Blue coordination strategy can then be constructed to select the local Blue policy solved from (94) to achieve the optimal next-to-go Blue mean-field.
References
- Aicardi et al., [1987] Aicardi, M., Davoli, F., and Minciardi, R. (1987). Decentralized optimal control of Markov chains with a common past information set. IEEE Transactions on Automatic Control, 32(11):1028–1031.
- Arabneydi and Mahajan, [2014] Arabneydi, J. and Mahajan, A. (2014). Team optimal control of coupled subsystems with mean-field sharing. In 53rd IEEE Conference on Decision and Control, pages 1669–1674.
- Arabneydi and Mahajan, [2015] Arabneydi, J. and Mahajan, A. (2015). Team-optimal solution of finite number of mean-field coupled LQG subsystems. In 54th IEEE Conference on Decision and Control, pages 5308–5313, Osaka, Japan, Dec. 15–18, 2015.
- Bernstein et al., [2002] Bernstein, D. S., Givan, R., Immerman, N., and Zilberstein, S. (2002). The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4):819–840.
- Bertsekas and Tsitsiklis, [1996] Bertsekas, D. and Tsitsiklis, J. N. (1996). Neuro-dynamic Programming. Athena Scientific.
- Chung, [2001] Chung, K. L. (2001). A Course in Probability Theory. Academic press.
- Cui and Koeppl, [2021] Cui, K. and Koeppl, H. (2021). Approximately solving mean field games via entropy-regularized deep reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 1909–1917. PMLR.
- Elliott and Kalton, [1972] Elliott, R. J. and Kalton, N. J. (1972). The Existence of Value in Differential Games, volume 126. American Mathematical Soc.
- Freeman and Kokotovic, [2008] Freeman, R. and Kokotovic, P. V. (2008). Robust Nonlinear Control Design: State-space and Lyapunov Techniques. Springer Science & Business Media.
- Gibbons et al., [2013] Gibbons, R., Roberts, J., et al. (2013). The Handbook of Organizational Economics. Princeton University Press Princeton, NJ.
- Guan et al., [2023] Guan, Y., Pan, L., Shishika, D., and Tsiotras, P. (2023). On the adversarial convex body chasing problem. In 2023 American Control Conference, pages 435–440.
- Guan et al., [2022] Guan, Y., Zhou, M., Pakniyat, A., and Tsiotras, P. (2022). Shaping large population agent behaviors through entropy-regularized mean-field games. In 2022 American Control Conference (ACC), pages 4429–4435. IEEE.
- Ho, [1980] Ho, Y.-C. (1980). Team decision theory and information structures. Proceedings of the IEEE, 68(6):644–654.
- Hogeboom-Burr and Yüksel, [2023] Hogeboom-Burr, I. and Yüksel, S. (2023). Zero-sum games involving teams against teams: existence of equilibria, and comparison and regularity in information. Systems & Control Letters, 172:105454.
- Huang et al., [2007] Huang, M., Caines, P. E., and Malhamé, R. P. (2007). Large-population cost-coupled LQG problems with nonuniform agents: individual-mass behavior and decentralized -Nash equilibria. IEEE Transactions on Automatic Control, 52(9):1560–1571.
- Huang et al., [2006] Huang, M., Malhamé, R. P., and Caines, P. E. (2006). Large population stochastic dynamic games: closed-loop McKean-Vlasov systems and the nash certainty equivalence principle. Communications in Information & Systems, 6(3):221–252.
- Kartik et al., [2021] Kartik, D., Nayyar, A., and Mitra, U. (2021). Common information belief based dynamic programs for stochastic zero-sum games with competing teams. arXiv preprint arXiv:2102.05838.
- Kuroiwa, [1996] Kuroiwa, D. (1996). Convexity for set-valued maps. Applied Mathematics Letters, 9(2):97–101.
- Lagoudakis and Parr, [2002] Lagoudakis, M. G. and Parr, R. (2002). Learning in zero-sum team Markov games using factored value functions. Advances in Neural Information Processing Systems, 15.
- Lasry and Lions, [2007] Lasry, J.-M. and Lions, P.-L. (2007). Mean field games. Japanese Journal of Mathematics, 2(1):229–260.
- Laurière et al., [2022] Laurière, M., Perrin, S., Geist, M., and Pietquin, O. (2022). Learning mean field games: A survey. arXiv preprint arXiv:2205.12944.
- Li et al., [2021] Li, J., Tinka, A., Kiesel, S., Durham, J. W., Kumar, T. S., and Koenig, S. (2021). Lifelong multi-agent path finding in large-scale warehouses. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 11272–11281.
- Mahajan, [2011] Mahajan, A. (2011). Optimal decentralized control of coupled subsystems with control sharing. In 50th IEEE Conference on Decision and Control and European Control Conference, pages 5726–5731, Orlando, FL, Dec. 12–15, 2011.
- Mahajan et al., [2012] Mahajan, A., Martins, N. C., Rotkowitz, M. C., and Yüksel, S. (2012). Information structures in optimal decentralized control. In 51st IEEE Conference on Decision and Control, pages 1291–1306, Maui, HW, Dec. 10–13, 2012. IEEE.
- Mahajan and Nayyar, [2015] Mahajan, A. and Nayyar, A. (2015). Sufficient statistics for linear control strategies in decentralized systems with partial history sharing. IEEE Transactions on Automatic Control, 60(8):2046–2056.
- Marschak, [1955] Marschak, J. (1955). Elements for a theory of teams. Management Science, 1(2):127–137.
- Nayyar et al., [2013] Nayyar, A., Mahajan, A., and Teneketzis, D. (2013). Decentralized stochastic control with partial history sharing: A common information approach. IEEE Transactions on Automatic Control, 58(7):1644–1658.
- Owen, [2013] Owen, G. (2013). Game Theory. Emerald Group Publishing.
- Papadimitriou and Tsitsiklis, [1985] Papadimitriou, C. H. and Tsitsiklis, J. N. (1985). Intractable problems in control theory. In 24th IEEE Conference on Decision and Control, pages 1099–1103, Fort Lauderdale, FL, Dec. 11–13, 1985.
- Radner, [1962] Radner, R. (1962). Team decision problems. The Annals of Mathematical Statistics, 33(3):857–881.
- Sanjari et al., [2023] Sanjari, S., Saldi, N., and Yüksel, S. (2023). Nash equilibria for exchangeable team against team games and their mean field limit. In 2023 American Control Conference, pages 1104–1109. IEEE.
- Tian et al., [2020] Tian, Y., Liu, K., Ok, K., Tran, L., Allen, D., Roy, N., and How, J. P. (2020). Search and rescue under the forest canopy using multiple UAVs. The International Journal of Robotics Research, 39(10-11):1201–1221.
- Tyler et al., [2020] Tyler, J., Arnold, R., Abruzzo, B., and Korpela, C. (2020). Autonomous teammates for squad tactics. In International Conference on Unmanned Aircraft Systems, pages 1667–1672, Athens, Greece, Sept. 1–4, 2020.
- Witsenhausen, [1971] Witsenhausen, H. S. (1971). Separation of estimation and control for discrete-time systems. Proceedings of the IEEE, 59(11):1557–1566.
- Yüksel, [2009] Yüksel, S. (2009). Stochastic nestedness and the belief sharing information pattern. IEEE Transactions on Automatic Control, 54(12):2773–2786.
- Zazo et al., [2016] Zazo, S., Macua, S. V., Sánchez-Fernández, M., and Zazo, J. (2016). Dynamic potential games with constraints: Fundamentals and applications in communications. IEEE Transactions on Signal Processing, 64(14):3806–3821.