Graphon Mean-Field Control for Cooperative Multi-Agent Reinforcement Learning
Abstract
The marriage between mean-field theory and reinforcement learning has shown a great capacity to solve large-scale control problems with homogeneous agents. To break the homogeneity restriction of mean-field theory, a recent interest is to introduce graphon theory to the mean-field paradigm. In this paper, we propose a graphon mean-field control (GMFC) framework to approximate cooperative multi-agent reinforcement learning (MARL) with nonuniform interactions and show that the approximate order is of , with the number of agents. By discretizing the graphon index of GMFC, we further introduce a smaller class of GMFC called block GMFC, which is shown to well approximate cooperative MARL. Our empirical studies on several examples demonstrate that our GMFC approach is comparable with the state-of-art MARL algorithms while enjoying better scalability.
1 Introduction
Multi-agent reinforcement learning (MARL) has found various applications in the field of transportation and simulating [50, 1], stock price analyzing and trading [32, 31], wireless communication networks [12, 11, 13], and learning behaviors in social dilemmas [33, 28, 34]. MARL, however, becomes intractable due to the complex interactions among agents as the number of agents increases.
A recent tractable approach is a mean-field approach by considering MARL in the regime with a large number of homogeneous agents under weak interactions [20]. According to the number of agents and learning goals, there are three subtle types of mean-field theories for MARL. The first one is called mean-field MARL (MF-MARL), which refers to the empirical average of the states or actions of a finite population. For example, [52] proposes to approximate interactions within the population of agents by averaging the actions of the overall population or neighboring agents. [35] proposes a mean-field proximal policy optimization algorithm for a class of MARL with permutation invariance. The second one is called mean-field game (MFG), which describes the asymptotic limit of non-cooperative stochastic games as the number of agents goes to infinity [30, 27, 8]. Recently, a rapidly growing literature studies MFG for noncooperative MARL either in a model-based way [53, 6, 26] or by a model-free approach [25, 48, 18, 14, 44]. The third one is called mean-field control (MFC), which is closely related to MFG yet different from MFG in terms of learning goals. For cooperative MFC, the Bellman equation for the value function is defined on an enlarged space of probability measures, and MFC is always reformulated as a new Markov decision process (MDP) with continuous state-action space [43]. [9] shows the existence of optimal policies for MFC in the form of mean-field MDP and adapts classical reinforcement learning (RL) methods to the mean-field setups. [24] approximates MARL by a MFC approach, and proposes a model-free kernel-based Q-learning algorithm (MFC-K-Q) that enjoys a linear convergence rate and is independent of the number of agents. [44] presents a model-based RL algorithm M3-UCRL for MFC with a general regret bound. [2] proposes a unified two-timescale learning framework for MFG and MFC by tuning the ratio of learning rates of function and the population state distribution.
One restriction of the mean-field theory is that it eliminates the difference among agents and interactions between agents are assumed to be uniform. However, in many real world scenarios, strategic interactions between agents are not always uniform and rely on the relative positions of agents. To develop scalable learning algorithms for multi-agent systems with heterogeneous agents, one approach is to exploit the local network structure of agents [45, 38]. Another approach is to consider mean-field systems on large graphs and their asymptotic limits, which leads to graphon mean-field theory [39]. So far, most existing works on graphon mean-field theory consider either diffusion processes without learning in continuous time or non-cooperative graphon mean-field game (GMFG) in discrete time. [3] considers uncontrolled graphon mean-field systems in continuous time. [17] studies MFG on an Erdös-Rényi graph. [19] studies the convergence of weighted empirical measures described by stochastic differential equations. [4] studies propagation of chaos of weakly interacting particles on general graph sequences. [5] considers general GMFG and studies -Nash equilibria of the multi-agent system by a PDE approach in continuous time. [29] studies stochastic games on large graphs and their graphon limits. It shows that GMFG is viewed as a special case of MFG by viewing the label of agents as a component of the state process. [21, 22] study continuous-time cooperative graphon mean-field systems with linear dynamics. On the other hand, [7] studies static finite-agent network games and their associated graphon games. [49] provides a sequential decomposition algorithm to find Nash equilibria of discrete-time GMFG. [15] constructs a discrete-time learning GMFG framework to analyze approximate Nash equilibria for MARL with nonuniform interactions. However, little is focused on learning cooperative graphon mean-field systems in discrete time, except for [41, 42] on particular forms of nonuniform interactions among agents. [42] proves that when the reward is affine in the state distribution and action distribution, MARL with nonuniform interactions can still be approximated by classic MFC. [41] considers multi-class MARL, where agents belonging to the same class are homogeneous. In contrast, we consider a general discrete-time GMFC framework under which agents are allowed to interact non-uniformly on any network captured by a graphon.
Our Work
In this work, we propose a general discrete-time GMFC framework to approximate cooperative MARL on large graphs by combining classic MFC and network games. Theoretically, we first show that GMFC can be reformulated as a new MDP with deterministic dynamics and infinite-dimensional state-action space, hence the Bellman equation for Q function is established on the space of probability measure ensembles. It shows that GMFC approximates cooperative MARL well in terms of both value function and optimal policies. The approximation error is at order , where is the number of agents. Furthermore, instead of learning infinite-dimensional GMFC directly, we introduce a smaller class called block GMFC by discretizing the graphon index, which can be recast as a new MDP with deterministic dynamic and finite-dimensional continuous state-action space. We show that the optimal policy ensemble learned from block GMFC is near optimal for cooperative MARL. To deploy the policy ensemble in the finite-agent system, we directly sample from the action distribution in the blocks. Empirically, our experiments in Section 5 demonstrate that when the number of agents becomes large, the mean episode reward of MARL becomes increasingly close to that of block GMFC, which verifies our theoretical findings. Furthermore, our block GMFC approach achieves comparable performances with other popular existing MARL algorithms in the finite-agent setting.
Outline
The rest of the paper is organized as follows. Section 2 recalls basic notations of graphons and introduces the setup of cooperative MARL with nonuniform interactions and its asymptotic limit called GMFC. Section 3 connects cooperative MARL and GMFC, introduces block GMFC for efficient algorithm design, and builds its connection with cooperative MARL. The main theoretical proofs are presented in Section 4. Section 5 tests the performance of block GMFC experimentally.
2 Mean-Field MARL on Dense Graphs
2.1 Preliminary: Graphon Theory
In the following, we consider a cooperative multi-agent system and its associated mean-field limit. In this system, each agent is affected by all others, with different agents exerting different effects on her. This multi-agent system with agents can be described by a weighted graph , where the vertex set and the edge set represent agents and the interactions between agents, respectively. To study the limit of the multi-agent system as goes to infinity, we adopt the graphon theory introduced in [39] used to characterize the limit behavior of dense graph sequences. Therefore, throughout the paper, we assume the graph is dense and leave sparse graphs for future study.
In general, a graphon is represented by a bounded symmetric measurable function , with . We denote by the space of all graphons and equip the space with the cut norm
It is worth noting that each weighted graph is uniquely determined by a step-graphon
We assume that the sequence of converges to a graphon in cut norm as the number of agents goes to infinity, which is crucial for the convergence analysis of cooperative MARL in Section 3.
Assumption 2.1
The sequence converges in cut norm to some graphon such that
Some common examples of graphons include
-
1)
Erdős Rényi: , , ;
-
2)
Stochastic block model:
where represents the intra-community interaction and the inter-community interaction;
-
3)
Random geometric graphon: , where is a non-increasing function.
2.2 Cooperative MARL with Nonuniform Interactions
In this section, we facilitate the analysis of MARL by considering a particular class of MARL with nonuniform interactions, where each agent interacts with all other agents via the aggregated weighted mean-field effect of the population of all agents.
Recall that we use the weighted graph to represent the multi-agent system, in which agents are cooperative and coordinated by a central controller. They share a finite state space and take actions from a finite action space . We denote by and the space of all probability measures on and , respectively. Furthermore, denote by the space of all Borel measures on .
For each agent , the neighborhood empirical measure is given by
(2.2) |
where denotes Dirac measure at , and describing the interaction between agents and is taken as either
(C1) |
or
(C2) |
At each step if agent , at state takes an action , then she will receive a reward
(2.3) |
where , and she will change to a new state according to a transition probability such that
(2.4) |
where .
(2.3)-(2.4) indicate that the reward and the transition probability of agent at time depend on both her individual information and neighborhood empirical measure .
Furthermore, the policy is assumed to be stationary for simplicity and takes the Markovian form
(2.5) |
which maps agent ’s state to a randomized action. For each agent , the space of all policies is denoted as .
Remark 2.2
2.3 Graphon Mean-Field Control
We expect the cooperative MARL (2.2)-(2.2) to become a GMFC problem as . In GMFC, there is a continuum of agents , and each agent with the index/label follows
(2.8) |
where , denotes the probability distribution of , and is defined as the neighborhood mean-field measure of agent :
(2.9) |
with the graphon given in Assumption 2.1.
To ease the sequel analysis, define the space of state distribution ensembles and the space of policy ensembles . Then and are elements in and , respectively.
The objective of GMFC is to maximize the expected discounted accumulated reward averaged over all agents
3 Main Results
3.1 Reformulation of GMFC
In this section, we show that GMFC (2.8)-(2.3) can be reformulated as a MDP with deterministic dynamics and continuous state-action space .
Theorem 3.1
(3.4) and (3.2) indicate the evolution of the state distribution ensemble over time. That is, under the fixed policy ensemble , the state distribution of agent at time is fully determined by the policy ensemble and the state distribution ensemble at time . Note that the state distribution of each agent is fully coupled with state distributions of the population of all agents via the graphon .
With the reformulation in Theorem 3.1, the associated function starting from is defined as
(3.5) |
Hence its Bellman equation is given by
(3.6) |
Remark 3.2
3.2 Approximation
In this section, we show that GMFC (2.8)-(2.3) provides a good approximation for the cooperative multi-agent system (2.2)-(2.2) in terms of the value function and the optimal policy ensemble. To do this, the following assumptions on , , , and are needed.
Assumption 3.3 (graphon )
There exists such that for all
Assumption 3.3 is common in graphon mean-field theory [21, 15, 29]. Indeed, the Lipschitz continuity assumption on in Assumption 3.3 can be relaxed to piecewise Lipschitz continuity on .
Assumption 3.4 (transition probability )
There exists such that for all ,
where denotes norm here and throughout the paper.
Assumption 3.5 (reward )
There exist and such that for all , ,
Assumption 3.6 (policy )
There exists such that for any policy ensemble is Lipschitz continuous, i.e.
Assumptions 3.4-3.6 are standard and commonly used to bridge the multi-agent system and mean-field theory.
To show approximation properties of GMFC in the large-scale multi-agent system, we need to relate policy ensembles of GMFC to policies of the multi-agent system. On one hand, one can see that any leads to a -agent policy tuple with
(3.7) |
On the other hand, any -agent policy tuple can be seen as a step policy ensemble in :
(3.8) |
Theorem 3.7 (Approximate Pareto Property)
Assume Assumptions 2.1, 3.3, 3.4, 3.5 and 3.6. Then under either the condition (C1) or (C2), we have for any initial distribution
(3.9) |
Moreover, if the graphon convergence in Assumption 2.1 is at rate , then . As a consequence, for any , there exists an integer such that when , the optimal policy ensemble of GMFC denoted as (if it exists) provides an -Pareto optimality for the multi-agent system (2.2), with defined in (3.7).
Directly learning Q function of GMFC in (3.6) will lead to high complexity. Instead, we will introduce a smaller class of GMFC with a lower dimension in the next section, which enables a scalable algorithm.
3.3 Algorithm Design
This section will show that discretizing the graphon index of GMFC enables to approximate Q function in (3.6) by an approximated Q function in (3.10) below defined on a smaller space, which is critical for designing efficient learning algorithms.
Precisely, we choose uniform grids for simplicity, and approximate each agent by the nearest close to it. Introduce , . Meanwhile, and can be viewed as a piecewise constant state distribution ensemble in and a piecewise constant policy ensemble in , respectively. Our arguments can be easily generalized to nonuniform grids.
Consequently, instead of performing algorithms according to (3.6) with a continuum of graphon labels directly, we work with GMFC with blocks called block GMFC, in which agents in the same block are homogeneous. The Bellman equation for Q function of block GMFC is given by
(3.10) |
where the neighborhood mean-field measure, the aggregated reward and the aggregated transition dynamics are given by
(3.11) | |||||
(3.12) | |||||
(3.13) |
We see from (3.10) that block GMFC is a MDP with deterministic dynamics and continuous state-action space . The following Theorem shows that there exists an optimal policy ensemble of block GMFC in .
Theorem 3.8 (Existence of Optimal Policy Ensemble)
Furthermore, we show that with sufficiently fine partitions of the graphon index , i.e., is sufficiently large, block GMFC (3.10)-(3.13) well approximates the multi-agent system in Section 2.2.
Theorem 3.9
Theorem 3.9 shows that the optimal policy ensemble of block GMFC is near-optimal for all sufficiently large multi-agent systems, meaning that block GMFC provides a good approximation for the multi-agent system.
Recall that block GMFC can be viewed as a MDP with deterministic dynamics and continuous state-action space. To learn block GMFC, one can adopt a similar kernel-based Q learning method in [24] for MFC, a uniform discretization method or deep reinforcement algorithms like DDPG [37] for MFC in [9] with theoretical guarantees. Since block GMFC has a higher dimension than classical MFC, we choose to adapt DRL algorithm Proximal Policy Optimization (PPO) [47] to block GMFC and then apply the learned policy ensemble of block GMFC to the multi-agent system to validate our theoretical findings. We describe the deployment of block GMFC in the multi-agent system in Algorithm 1, which we call it N-agent GMFC.
4 Proofs of Main Results
4.1 Proof of Theorem 3.7
To prove Theorem 3.7, we need the following two Lemmas. We start by defining the step state distribution for notational simplicity
(4.1) |
Lemma 4.1 shows the convergence of the neighborhood empirical measure to the neighborhood mean-field measure.
Lemma 4.1
Moreover, if the graphon convergence in Assumption 2.1 is at rate , then
Proof of Lemma 4.1.
We first prove (4.2) under the condition (C1) and then show (4.2) also holds under the condition (C2).
Case 1: . Note that under the condition (C1), by the definition of in (4.1). Then
For the term , we adopt Theorem 2 in [15] and have that under the policy ensemble and -agent policy , with defined in (3.7)
Moreover, if the graphon convergence in Assumption 2.1 is at rate , then the term is also at rate .
By noting that ,
where the last inequality is from the fact in [39] that the convergence of is equivalent to the convergence of
Case 2: are random variables with Bernoulli().
Note from Case 1 that as and if the graphon convergence in Assumption 2.1 is at rate . Therefore, it is enough to estimate .
We proceed the same argument as in the proof of Theorem 6.3 in [24]. Precisely, conditioned on , is a sequence of independent mean-zero random variables bounded in due to . This implies that each is a sub-Gaussian with variance bounded by 4. As a result, conditioned on , is a mean-zero sub-Gaussian random variable with variance . By the equation (2.66) in [51], we have
Therefore, combining and in Case 2, we show that when are random variables with Bernoulli(), (4.2) holds under the condition (C2). ∎
Lemma 4.2 shows the convergence of the state distribution of -agent game to the state distribution of GMFC.
Lemma 4.2
Proof of Lemma 4.2.
First term of (4.4)
where the second inequality is from the continuity of , and the last inequality is from Lemma 4.1.
Second term of (4.4)
One can view as a function of for any fixed and . Note that , where is a constant independent of and . Since (4.3) holds at , then
Third term of (4.4)
where the second inequality is by the uniform boundedness of and the third inequality is from Assumption 3.6. ∎
Now we are ready to prove Theorem 3.7. We start by defining the aggregated reward over all possible actions under the policy
First term of (4.5)
Second term of (4.5)
From Lemma 4.2, we have .
Third term of (4.5)
Therefore, combining , and yields the desired result. ∎
4.2 Proof of Theorem 3.8
First, we see that (3.10) corresponds to the following optimal control problem
(4.7) |
We first show the verification result and then prove the continuity property of in (4.8), which thus leads to Theorem 3.8.
Lemma 4.3 (Verification)
Proof of Lemma 4.3.
First, define -bounded function space . Then we define a Bellman operator
One can show that is a contraction operator with the module-. By Banach fixed point theorem, admits a unique fixed point. As function of (4.8) satisfies , is unique solution of (3.10).
We next define under the policy ensemble with
Similarly, we can show that is a contraction map with the module- and thus admits a unique fixed point, which is denoted as . From this, we have
which implies is an optimal policy ensemble. ∎
Proof of Lemma 4.4.
We will show that as , in the sense that ,
From (4.8),
By induction, we obtain
Therefore, if , then
where is a constant depending on . ∎
Now we prove Theorem 3.8.
4.3 Proof of Theorem 3.9
We first prove the following Lemma, which shows that GMFC and block GMFC become increasingly close to each other as the number of blocks becomes larger.
Proof of Lemma 4.5.
Based on Lemma 4.5, we have the following Proposition.
Proposition 4.6
Proof of Proposition 4.6.
Proof of Theorem 3.9.
Suppose that and are optimal policies of the problems (4.7) and (2.2), respectively. From Proposition 4.6, for any , there exists sufficiently large
where by (3.8), .
From Theorem 3.7, for any , there exists such that for all
Then we have
where due to the optimality of for . This means that the optimal policy of block GMFC provides an -optimal policy for the multi-agent system with . ∎
5 Experiments
In this section, we provide an empirical verification of our theoretical results, with two examples adapted from existing works on learning MFGs [16, 10] and learning GMFGs [15].
5.1 SIS Graphon Model
We consider a SIS graphon model in [16] under a cooperative setting. In this model, each agent shares a state space and an action space , where is susceptible, is infected, represents keeping contact with others, and means keeping social distance. The transition probability of each agent is represented as follows
where is the infection rate with keeping contact with others, is the infection rate under social distance, and is the fixed recovery rate. We assume , meaning that keeping social distance can reduce the risk of being infected. The individual reward function is defined as
where represents the cost of being infected such as the cost of medical treatment, represents the cost of keeping social distance, and represents the penalty of going out if the agent is infected.
In our experiment, we set =0.8, =0, for the transition dynamics and =2, =0.3, for the reward function. The initial mean field is taken as the uniform distribution. We set the episode length to 50.
5.2 Malware Spread Graphon Model
We consider a malware spread model in [10] under a cooperative setting. In this model, let , , denote the health level of the agent, where and represents the best level and the worst level, respectively. All agents can take two actions: means doing nothing, and means repairing. The state transition is given by
where are i.i.d. random variables with a certain probability distribution. Then after taking action , agent will receive an individual reward
Here considering the heterogeneity of agents, we use to denote the importance effect of agent on agent . is the risk of being infected by other agents and is the cost of taking action .
In our experiment, we set =3, =0.3, and =0.5. In addition, to stabilize the training of the RL agent, we fix to a static value, i.e., 0.7. In this model, we set the episode length to 10.
5.3 Performance of N-agent GMFC on Multi-Agent System
For both models, we use PPO [47] to train the block GMFC agent in the infinite-agent environment and obtain the policy ensembles and further use Algorithm 1 to deploy them in the finite-agent environment. We test the performance of N-agent GMFC with 10 blocks to different numbers of agents, i.e., from 10 to 100. For each case, we run 1000 times of simulations and show the mean and standard variation (Green shadows in Figure 1 and Figure 2) of the mean episode reward. We can see that in both scenarios and for different types of graphons, the mean episode rewards of the N-agent GMFC become increasingly close to that of block GMFC as the number of agents grows. (See Figure 1 and Figure 2). This verifies our theoretical findings empirically.






5.4 Comparison with Other Algorithms
For different types of graphons, we compare our algorithm N-agent GMFC with three existing MARL algorithms, including two independent learning algorithms, i.e., independent DQN [40], independent PPO [47] and a powerful centralized-training-and-decentralized-execution(CTDE)-based algorithm QMIX [46]. We test the performance of those algorithms with different numbers of blocks, i.e., 2, 5, 10, to the multi-agent systems with 40 agents. The results are reported in Table 1 and Table 2.
In the SIS graphon model, N-agent GMFC shows dominating performance in most cases and outperforms independent algorithms by a large margin. Only QMIX can reach comparable results. And in the malware spread graphon model, N-agent GMFC outperforms other algorithms in more than half of the cases. Only independent DQN has comparable performance in this environment. And we can see that in both environments, the performance gap between N-agent GMFC and other MARL algorithms is shrinking as the number of blocks goes larger. This is mainly because the action space of block GMFC increases more quickly than MARL algorithms as the block number increases. And it is hard to train RL agents when the action space is too large.
Beyond the visible results shown in Tables 1 and 2, when the number of agents grows larger, classic MARL methods become infeasible because of the curse of dimensionality and the restriction of memory storage, while N-agent GMFC is trained only once and independent of the number of agents , hence is easier to scale up in a large-scale regime and enjoys a more stable performance. We can see that N-agent GMFC shows more stable results when N increases as shown in Figure 1 and Figure 2.
Graphon Type | M | Algorithm | |||
---|---|---|---|---|---|
N-agent GMFC | I-DQN | I-PPO | QMIX | ||
Erdős Rényi | 2 | -15.37 | -17.58 | -20.63 | -20.51 |
5 | -15.74 | -16.17 | -20.42 | 16.94 | |
10 | -15.67 | -17.55 | -21.38 | -14.45 | |
Stochastic Block | 2 | -13.58 | -16.05 | -18.38 | -17.69 |
5 | -13.67 | -15.91 | -20.13 | -13.79 | |
10 | -13.57 | -15.52 | -14.87 | -13.86 | |
Random Geometric | 2 | -12.45 | -17.93 | -14.82 | -14.52 |
5 | -9.82 | -12.81 | -12.99 | -10.84 | |
10 | -10.52 | -11.68 | -12.66 | -12.60 |
Graphon Type | M | Algorithm | |||
---|---|---|---|---|---|
N-agent GMFC | I-DQN | I-PPO | QMIX | ||
Erdős Rényi | 2 | -5.21 | -5.11 | -5.31 | -6.05 |
5 | -5.21 | -5.30 | -5.26 | -6.13 | |
10 | -5.21 | -5.14 | -5.27 | -5.21 | |
Stochastic Block | 2 | -5.16 | -5.21 | -5.37 | -5.88 |
5 | -5.10 | -5.19 | -5.31 | -5.70 | |
10 | -5.09 | -5.05 | -5.28 | -5.27 | |
Random Geometric | 2 | -5.02 | -5.21 | -5.27 | -5.35 |
5 | -4.85 | -5.03 | -5.04 | -5.05 | |
10 | -4.82 | -4.83 | -5.14 | -4.83 |
5.5 Implementation Details
We use three graphons in our experiments: (1) Erdős Rényi: ; (2) Stochastic block model: , if or , , otherwise; (3) Random geometric graphon: , where .
For the RL algorithms, we use the implementation of RLlib [36] (version 1.11.0, Apache-2.0 license). For PPO used to learn an optimal policy ensemble in block GFMC, we use a 64-dimensional linear layer to encode the observation and 2-layer MLPs with 256 hidden units per layer for both value network and actor network. For independent DQN and independent PPO, we use the default weight-sharing model with 64-dimensional embedding layers. We train the GMFC PPO agent for 1000 iterations, and other three MARL agents for 200 iterations. The specific hyper-parameters are listed in Table 3.
Algorithms | GMFC PPO | I-DQN | I-PPO | QMIX |
---|---|---|---|---|
Learning rate | 0.0005 | 0.0005 | 0.0001 | 0.00005 |
Learning rate decay | True | True | True | False |
Discount factor | 0.95 | 0.95 | 0.95 | 0.95 |
Batch size | 128 | 128 | 128 | 128 |
KL coefficient | 0.2 | - | 0.2 | - |
KL target | 0.01 | - | 0.01 | - |
Buffer size | - | 2000 | - | 2000 |
Target network update frequency | - | 2000 | - | 1000 |
6 Conclusion
In this work, we have proposed a discrete-time GMFC framework for MARL with nonuniform interactions on dense graphs. Theoretically, we have shown that under suitable assumptions, GMFC approximates MARL well with approximation error of order . To reduce the dimension of GMFC, we have introduced block GMFC by discretizing the graphon index and shown that it also approximates MARL well. Empirical studies on several examples have verified the plausibility of the GMFC framework. For future research, we are interested in establishing theoretical guarantees of the PPO-based algorithm for block GMFC, learning the graph structure of MARL and extending our framework to MARL with nonuniform interactions on sparse graphs.
References
- Adler and Blue [2002] Adler, J.L., Blue, V.J., 2002. A cooperative multi-agent transportation management and route guidance system. Transportation Research Part C: Emerging Technologies 10, 433–454.
- Angiuli et al. [2022] Angiuli, A., Fouque, J.P., Laurière, M., 2022. Unified reinforcement Q-learning for mean field game and control problems. Mathematics of Control, Signals, and Systems , 1–55.
- Bayraktar et al. [2020] Bayraktar, E., Chakraborty, S., Wu, R., 2020. Graphon mean field systems. arXiv preprint arXiv:2003.13180 .
- Bet et al. [2020] Bet, G., Coppini, F., Nardi, F.R., 2020. Weakly interacting oscillators on dense random graphs. arXiv preprint arXiv:2006.07670 .
- Caines and Huang [2019] Caines, P.E., Huang, M., 2019. Graphon mean field games and the GMFG equations: -Nash equilibria, in: 2019 IEEE 58th conference on decision and control (CDC), IEEE. pp. 286–292.
- Cardaliaguet and Lehalle [2018] Cardaliaguet, P., Lehalle, C.A., 2018. Mean field game of controls and an application to trade crowding. Mathematics and Financial Economics 12, 335–363.
- Carmona et al. [2022] Carmona, R., Cooney, D.B., Graves, C.V., Lauriere, M., 2022. Stochastic graphon games: I. the static case. Mathematics of Operations Research 47, 750–778.
- Carmona and Delarue [2013] Carmona, R., Delarue, F., 2013. Probabilistic analysis of mean-field games. SIAM Journal on Control and Optimization 51, 2705–2734.
- Carmona et al. [2019] Carmona, R., Laurière, M., Tan, Z., 2019. Model-free mean-field reinforcement learning: mean-field MDP and mean-field Q-learning. arXiv preprint arXiv:1910.12802 .
- Chen et al. [2021] Chen, Y., Liu, J., Khoussainov, B., 2021. Agent-level maximum entropy inverse reinforcement learning for mean field games. arXiv preprint arXiv:2104.14654 .
- Choi et al. [2009] Choi, J., Oh, S., Horowitz, R., 2009. Distributed learning and cooperative control for multi-agent systems. Automatica 45, 2802–2814.
- Cortes et al. [2004] Cortes, J., Martinez, S., Karatas, T., Bullo, F., 2004. Coverage control for mobile sensing networks. IEEE Transactions on Robotics and Automation 20, 243–255.
- Cui et al. [2019] Cui, J., Liu, Y., Nallanathan, A., 2019. Multi-agent reinforcement learning-based resource allocation for UAV networks. IEEE Transactions on Wireless Communications 19, 729–743.
- Cui and Koeppl [2021] Cui, K., Koeppl, H., 2021. Approximately solving mean field games via entropy-regularized deep reinforcement learning, in: International Conference on Artificial Intelligence and Statistics, PMLR. pp. 1909–1917.
- Cui and Koeppl [2022] Cui, K., Koeppl, H., 2022. Learning graphon mean field games and approximate Nash equilibria, in: International Conference on Learning Representations.
- Cui et al. [2021] Cui, K., Tahir, A., Sinzger, M., Koeppl, H., 2021. Discrete-time mean field control with environment states, in: 2021 60th IEEE Conference on Decision and Control (CDC), IEEE. pp. 5239–5246.
- Delarue [2017] Delarue, F., 2017. Mean field games: A toy model on an Erdös-Renyi graph. ESAIM: Proceedings and Surveys 60, 1–26.
- Elie et al. [2020] Elie, R., Perolat, J., Laurière, M., Geist, M., Pietquin, O., 2020. On the convergence of model free learning in mean field games, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7143–7150.
- Esunge and Wu [2014] Esunge, J.N., Wu, J., 2014. Convergence of weighted empirical measures. Stochastic Analysis and Applications 32, 802–819.
- Flyvbjerg et al. [1993] Flyvbjerg, H., Sneppen, K., Bak, P., 1993. Mean field theory for a simple model of evolution. Physical review letters 71, 4087.
- Gao and Caines [2019a] Gao, S., Caines, P.E., 2019a. Graphon control of large-scale networks of linear systems. IEEE Transactions on Automatic Control 65, 4090–4105.
- Gao and Caines [2019b] Gao, S., Caines, P.E., 2019b. Spectral representations of graphons in very large network systems control, in: 2019 IEEE 58th conference on decision and Control (CDC), IEEE. pp. 5068–5075.
- Gu et al. [2019] Gu, H., Guo, X., Wei, X., Xu, R., 2019. Dynamic programming principles for mean-field controls with learning. arXiv preprint arXiv:1911.07314 .
- Gu et al. [2021] Gu, H., Guo, X., Wei, X., Xu, R., 2021. Mean-field controls with Q-learning for cooperative MARL: convergence and complexity analysis. SIAM Journal on Mathematics of Data Science 3, 1168–1196.
- Guo et al. [2019] Guo, X., Hu, A., Xu, R., Zhang, J., 2019. Learning mean-field games. Advances in Neural Information Processing Systems 32.
- Hadikhanloo and Silva [2019] Hadikhanloo, S., Silva, F.J., 2019. Finite mean field games: fictitious play and convergence to a first order continuous mean field game. Journal de Mathématiques Pures et Appliquées 132, 369–397.
- Huang et al. [2006] Huang, M., Malhamé, R.P., Caines, P.E., 2006. Large population stochastic dynamic games: closed-loop McKean-Vlasov systems and the Nash certainty equivalence principle. Communications in Information & Systems 6, 221–252.
- Hughes et al. [2018] Hughes, E., Leibo, J.Z., Phillips, M., Tuyls, K., Dueñez-Guzman, E., García Castañeda, A., Dunning, I., Zhu, T., McKee, K., Koster, R., et al., 2018. Inequity aversion improves cooperation in intertemporal social dilemmas. Advances in neural information processing systems 31.
- Lacker and Soret [2022] Lacker, D., Soret, A., 2022. A label-state formulation of stochastic graphon games and approximate equilibria on large networks. arXiv preprint arXiv:2204.09642 .
- Lasry and Lions [2007] Lasry, J.M., Lions, P.L., 2007. Mean field games. Japanese Journal of Mathematics 2, 229–260.
- Lee et al. [2007] Lee, J.W., Park, J., Jangmin, O., Lee, J., Hong, E., 2007. A multiagent approach to -learning for daily stock trading. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 37, 864–877.
- Lee and Zhang [2002] Lee, J.W., Zhang, B.T., 2002. Stock trading system using reinforcement learning with cooperative agents, in: Proceedings of the Nineteenth International Conference on Machine Learning, pp. 451–458.
- Leibo et al. [2017] Leibo, J.Z., Zambaldi, V., Lanctot, M., Marecki, J., Graepel, T., 2017. Multi-agent reinforcement learning in sequential social dilemmas. arXiv preprint arXiv:1702.03037 .
- Lerer and Peysakhovich [2017] Lerer, A., Peysakhovich, A., 2017. Maintaining cooperation in complex social dilemmas using deep reinforcement learning. arXiv preprint arXiv:1707.01068 .
- Li et al. [2021] Li, Y., Wang, L., Yang, J., Wang, E., Wang, Z., Zhao, T., Zha, H., 2021. Permutation invariant policy optimization for mean-field multi-agent reinforcement learning: A principled approach. arXiv preprint arXiv:2105.08268 .
- Liang et al. [2018] Liang, E., Liaw, R., Nishihara, R., Moritz, P., Fox, R., Goldberg, K., Gonzalez, J.E., Jordan, M.I., Stoica, I., 2018. RLlib: Abstractions for distributed reinforcement learning, in: International Conference on Machine Learning (ICML).
- Lillicrap et al. [2016] Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D., 2016. Continuous control with deep reinforcement learning, in: International Conference on Learning Representations.
- Lin et al. [2021] Lin, Y., Qu, G., Huang, L., Wierman, A., 2021. Multi-agent reinforcement learning in stochastic networked systems. Advances in Neural Information Processing Systems 34, 7825–7837.
- Lovász [2012] Lovász, L., 2012. Large networks and graph limits. volume 60. American Mathematical Soc.
- Mnih et al. [2013] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M., 2013. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 .
- Mondal et al. [2022a] Mondal, W.U., Agarwal, M., Aggarwal, V., Ukkusuri, S.V., 2022a. On the approximation of cooperative heterogeneous multi-agent reinforcement learning (MARL) using mean field control (MFC). Journal of Machine Learning Research 23, 1–46.
- Mondal et al. [2022b] Mondal, W.U., Aggarwal, V., Ukkusuri, S.V., 2022b. Can mean field control (MFC) approximate cooperative multi agent reinforcement learning (MARL) with non-uniform interaction? arXiv preprint arXiv:2203.00035 .
- Motte and Pham [2022] Motte, M., Pham, H., 2022. Mean-field Markov decision processes with common noise and open-loop controls. The Annals of Applied Probability 32, 1421–1458.
- Pasztor et al. [2021] Pasztor, B., Bogunovic, I., Krause, A., 2021. Efficient model-based multi-agent mean-field reinforcement learning. arXiv preprint arXiv:2107.04050 .
- Qu et al. [2020] Qu, G., Wierman, A., Li, N., 2020. Scalable reinforcement learning of localized policies for multi-agent networked systems, in: Learning for Dynamics and Control, PMLR. pp. 256–266.
- Rashid et al. [2018] Rashid, T., Samvelyan, M., Schroeder, C., Farquhar, G., Foerster, J., Whiteson, S., 2018. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning, in: International Conference on Machine Learning, PMLR. pp. 4295–4304.
- Schulman et al. [2017] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 .
- Subramanian and Mahajan [2019] Subramanian, J., Mahajan, A., 2019. Reinforcement learning in stationary mean-field games, in: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 251–259.
- Vasal et al. [2021] Vasal, D., Mishra, R., Vishwanath, S., 2021. Sequential decomposition of graphon mean field games, in: 2021 American Control Conference (ACC), IEEE. pp. 730–736.
- W Axhausen et al. [2016] W Axhausen, K., Horni, A., Nagel, K., 2016. The multi-agent transport simulation MATSim. Ubiquity Press.
- Wainwright [2019] Wainwright, M.J., 2019. High-dimensional statistics: A non-asymptotic viewpoint. volume 48. Cambridge University Press.
- Yang et al. [2018] Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., Wang, J., 2018. Mean field multi-agent reinforcement learning, in: International Conference on Machine Learning, PMLR. pp. 5571–5580.
- Yin et al. [2010] Yin, H., Mehta, P.G., Meyn, S.P., Shanbhag, U.V., 2010. Learning in mean-field oscillator games, in: 49th IEEE Conference on Decision and Control (CDC), IEEE. pp. 3125–3132.