Permutation Invariant Policy Optimization for Mean-Field Multi-Agent Reinforcement Learning: A Principled Approach
Yan Li†, Lingxiao Wang⋄, Jiachen Yang†, Ethan Wang†,
Zhaoran Wang⋄, Tuo Zhao†, Hongyuan Zha∗††
Yan Li, Jiachen Yang, Ethan Wang, and Tuo Zhao are affiliated with Georgia Institute of Technology; Lingxiao Wang, Zhaoran Wang are affiliated with Northwestern University; Hongyuan Zha is affiliated with The Chinese University of Hong Kong, Shenzhen. Tuo Zhao is the corresponding author; Email: tourzhao@gatech.edu.
Abstract
Multi-agent reinforcement learning (MARL) becomes more challenging in the presence of more agents, as the capacity of the joint state and action spaces grows exponentially in the number of agents. To address such a challenge of scale, we identify a class of cooperative MARL problems with permutation invariance, and formulate it as mean-field Markov decision processes (MDP). To exploit the permutation invariance therein, we propose the mean-field proximal policy optimization (MF-PPO) algorithm, at the core of which is a permutation-invariant actor-critic neural architecture.
We prove that MF-PPO attains the globally optimal policy at a sublinear rate of convergence. Moreover, its sample complexity is independent of the number of agents. We validate the theoretical advantages of MF-PPO with numerical experiments in the multi-agent particle environment (MPE). In particular, we show that the inductive bias introduced by the permutation-invariant neural architecture enables MF-PPO to outperform existing competitors with a smaller number of model parameters, which is the key to its generalization performance.
1 Introduction
Multi-Agent Reinforcement Learning (Littman, 1994; Zhang et al., 2019) generalizes Reinforcement Learning (Sutton and Barto, 2018) to address the sequential decision-making problem of multiple agents maximizing their own long term rewards while interacting with one another in a common environment.
With breakthroughs in deep learning, MARL algorithms equipped with deep neural networks have seen significant empirical successes in various domains,
including simulated autonomous driving (Shalev-Shwartz et al., 2016), multi-agent robotic control (Matarić, 1997; Kober et al., 2013), and E-sports (Vinyals et al., 2019).
Despite tremendous successes, MARL is notoriously hard to scale to the many-agent setting, as the size of the state-action space grows exponentially with respect to the number of agents.
This phenomenon is recently described as the curse of many agents (Menda et al., 2018).
To tackle this challenge,
we focus on cooperative MARL,
where agents work together to maximize their team reward (Panait and Luke, 2005).
We identify and exploit a key property of cooperative MARL with homogeneous agents, namely the invariance with respect to the permutation of agents.
Such permutation invariance can be found in many real-world scenarios with homogeneous agents, such as distributed control of multiple autonomous vehicles and team sports (Cao et al., 2013; Kalyanakrishnan et al., 2006),
and also in the case of heterogeneous groups where invariance holds within each group (Liu et al., 2019b).
More importantly, we further find that permutation invariance has significant practical implications, as the optimal value functions remains invariant when permuting the joint state-action pairs.
Such an observation strongly advocates a permutation invariant design for learning, which helps reduce the effective search space of the policy/value functions from exponential dependence on the number of agents to polynomial dependence.
Several empirical methods have been proposed to incorporate permutation invariance into solving MARL problems.
Liu et al. (2019b) implement a permutation invariant critic based on Graph Convolutional Network (GCN) (Kipf and Welling, 2017).
Sunehag et al. (2017) propose value decomposition, which together with parameter sharing, leads to a joint critic network that is permutation invariant over agents.
While these methods are based on heuristics, we are the first to provide theoretical principles for introducing permutation invariance as an inductive bias
for learning value functions and policy in homogeneous systems.
In addition, we adopt the DeepSet (Zaheer et al., 2017) architecture, which is well suited for handling homogeneity of agents, with much simpler operations to induce permutation invariance, and being much more parameter efficient by doing so.
To scale up learning in MARL problem in the presence of a large number, even infinitely many agents,
mean-field approximation has been explored to directly model the population behavior of the agents.
Yang et al. (2017) consider a mean-field game with deterministic linear state transitions, and show it can be reformulated as a mean-field MDP, with mean-field state residing in a finite-dimensional probability simplex.
Yang et al. (2018) take the mean-field approximation over actions, such that the interaction for any given agent and the population is approximated by the interaction between the agent’s action and the averaged actions of its neighboring agents.
However, the motivation for averaging over local actions remains unclear, and it generally requires a sparse graph over agents. In practice, properly identifying such structure also demands extensive prior knowledge.
Carmona et al. (2019) motivate a mean-field MDP from the perspective of mean-field control.
The mean-field state therein lies in a probability simplex and thus continuous in nature. To enable the ensuing Q-learning algorithm, discretization of the joint state-action space is necessary.
Wang et al. (2020) motivate a mean-field MDP from permutation invariance, but assume a central controller coordinating the actions of all the agents, and hence is restricted to handling the curse of many agents from the exponential blowup of joint state space.
In contrast, our formulation of mean-field approximation allows agents to make their own local actions without resorting to a centralized controller.
We propose a mean-field Markov decision process motivated from the permutation invariance structure of cooperative MARL, which can be viewed as a natural limit of finite-agent MDP by taking the number of agents to infinity.
Such a mean-field MDP generalizes traditional MDP, with each state representing a distribution over the state space of a single agent.
The mean-field MDP provides us a tractable formulation to model MDP with many agents, even infinite number of agents.
We further propose the Mean-Field Proximal Policy Optimization (MF-PPO) algorithm, at the core of which is a pair of permutation invariant actor and critic neural networks.
These networks are implemented based on DeepSet (Zaheer et al., 2017), which uses convolutional type operations to induce permutation invariance over the set of inputs.
We show that with sufficiently many agents, MF-PPO converges to the optimal policy of the mean-field MDP with a sublinear sample complexity independent of the number of agents.
To support our theory, we conduct numerical experiments on the benchmark multi-agent particle environment (MPE) and show that our proposed method requires a smaller number of model parameters and attains a better performance compared to multiple baselines.
Notations.
For a set , we denote as the set of distribution on . denotes the Dirac measure supported at .
For , we use to denote that each is independently sampled from distribution .
For a function and a distribution , we write . We write in short for .
2 Problem Setup
We focus on studying multi-agent systems with cooperative, homogeneous agents, where the agents within the system are of similar nature and hence cannot be distinguished from each other.
Specifically, we consider a discrete time control problem with agents, formulated as a Markov decision process .
We define the joint state space to be the Cartesian product of the finite state space for each agent, and similarly define joint action space .
The homogeneous nature of the system is reflected in the transition kernel and the shared reward , which satisfies:
(1)
for all and permutation mapping .
In other words, it is the configuration, rather than the identities of the agents, that affects the team reward, and the transition to the next configuration solely depends on the current configuration. See Figure1 for a detailed illustration.
Such permutation invariance can be found in many real-world scenarios, such as distributed control of multiple autonomous vehicles, team sports and social economic systems (Zheng et al., 2020; Cao et al., 2013; Kalyanakrishnan et al., 2006).
Figure 1: Illustration of permutation invariance. Exchanging identities of the agents (first and third row) does not change the transition of the underlying system (second row).
Our goal is to find the optimal policy within a policy class which maximizes the expected discounted reward
Our first result shows that learning with permutation invariance advocates permutation invariant network design.
Proposition 2.1.
For cooperative MARL satisfying (1), there exists an optimal policy that is permutation invariant, i.e., for any permutation mapping .
In addition, for any permutation invariant policy , the value function and the state-action value function is also permutation invariant
where .
Proposition 2.1 has an important implication for architecture design, as it states that it suffices to search within the permutation invariant policy class and value function class. To the best of our knowledge, this is the first theoretical justification of permutation invariant network design for learning with homogeneous agents.
We focus on the factorized policy class with parameter sharing scheme, where each agent makes its own decision given local state.
Specifically, the joint policy can be factorized as , where denotes the shared local mapping.
Such a policy class is widely adopted in the celebrated centralized training – decentralized execution paradigm (Lowe et al., 2017), due to its light overhead in the deployment phase and favorable performances.
However, directly learning such factorized policy remains challenging, as each agent needs to estimate its state-action value function, denoted as . The search space during learning is , which scales exponentially with respect to the number of agents. The large search space poses as a significant roadblock for efficient learning, and was recently coined as the curse of many agents (Menda et al., 2018).
To address the curse of many agents, we exploit the homogeneity of the system and take mean-field approximation. We begin by taking the perspective of agent , which is arbitrarily chosen from the agents.
We denote the state of such agent by and the states of the rest of the agents by .
One can verify that when permuting the state of all the other agents, the value function remains unchanged; additionally, we can further characterize the value function as a function of the local state, and the empirical state distribution over the rest of agents.
Proposition 2.2.
For any permutation mapping ,
the value function satisfies
Additionally, there exists such that:
where is the empirical state distribution over the rest of the agents.
For a system with a large number of agents (e.g., financial markets, social networks), the empirical state distribution can be seen as the concrete realization of the underlying distribution of the population.
Motivated from this observation and Proposition 2.2, we formulate the following mean-field MDP that can be seen as the limit of finite-agent MDP in the presence of infinitely many homogeneous agents.
Definition 2.1(mean-field MDP).
The mean-field MDP consists of elements of the following: state ; action ;
reward ; transition kernel ).
The defined mean-field MDP has an intimate connection with our previously discussed finite-agent MDP: here represents the joint state viewed from the individual agent with state ;
each agent makes a local decision using the shared local mapping and receive a local reward, whose cumulative effect is captured by the team reward ; finally, the joint state transits to according to kernel .
We can clearly see that the mean-field MDP has a step-to-step correspondence with the finite-agent MDP with homogeneous agents, thus making it a natural generalization in the presence of large number of agents.
Our goal is to learn efficiently a policy whose expected discounted reward is maximized. To facilitate ensuing discussions, we define the value function as
where , and Q-function as
where .
The optimal policy is then denoted by
Despite the intuitive analogy to finite-agent MDP, solving the mean-field MDP posses some unique challenges.
In addition to the unknown transition kernel and reward, the mean-field MDP takes a distribution as its state, which we do not have complete information of during training.
In the following section, we propose our mean-field Proximal Policy Optimization (MF-PPO) algorithm that, with a careful architecture design, can solve such mean-field MDP in a model-free fashion efficiently.
3 Mean-field Proximal Policy Optimization
Our algorithm falls into the category of the actor-critic learning paradigm, consisting of alternating iterations of policy evaluation and improvement.
The unique features of MF-PPO lie in the facts:
(1) it exploits permutation invariance of the system, reducing search space of the actor/critic networks drastically and enables much more efficient learning;
(2) it can handle a varying number of agents. For simplicity of exposition, we consider a fixed number of agents here.
Throughout the rest of the section, we assume that for any joint state , the agent has access to samples from . We denote concatenation of such samples as and write .
MF-PPO maintains a pair of actor (denoted by ) and critic networks (denoted by ), and uses the actor network to induce an energy-based policy .
Specifically, given state , the actor network induces a distribution on the action set according to
where denotes the temperature parameter.
We use to denote the dependency of the policy on the energy function.
Permutation-invariant Actor and Critic.
We adopt a permutation invariant design of the actor and critic network.
Specifically, given individual state and sampled states ,
the actor (resp. critic) network (resp. ) satisfies
for any permutation mapping .
With permutation invariance, the search space of the actor/critic network polynomially depends on the number of agents .
Proposition 3.1.
The search space of a permutation invariance actor (critic) network is at the order of
Additionally, if , then the search space depends on at the order of .
Compared to architectures without permutation invariance, whose search space depends on at the order of , we can clearly see the search space of MF-PPO can be exponentially smaller.
Motivated by the characterization of the permutation invariant set function in Zaheer et al. (2017), the actor/critic network in MF-PPO takes the form of Deep Sets architecture, i.e.,
Both networks first aggregate local information by averaging over the output of a shared sub-network among agents, before feeding the aggregated information into a subsequent network.
See Figure 2 for a detailed illustration.
Effectively, by the average pooling layer and the preceding parameter sharing scheme, the network can keep its output unchanged when permuting the ordering of agents.
Compared to a Graph Convolutional Neural Network (Kipf and Welling, 2017), the averaging operations are well suited for homogeneous agents and more parameter-efficient.
Naturally, the actor network when given the joint state-action pair is given by
We assume is parameterized by a neural network with parameters , which is to be learned during training.
We let the function class of all possible actor networks be denoted by .
This same architecture design applies to the critic network, with learnable parameters denoted by and the function class denoted by .
MF-PPO then consists of successive iterations of policy evaluation and policy improvement steps described in detail below.
Figure 2: Illustration of a DeepSet parameterized critic network.
Policy Evaluation.
At the -th iteration of MF-PPO, we first update the critic network by minimizing the mean squared Bellman error while holding the actor network fixed.
We denote the policy induced by the actor network as , the stationary state distribution of policy as , and the stationary state-action distribution as . Thus, we follow the update
(2)
where denotes the Euclidean ball with radius centered at the initialized parameter , and the Bellman evaluation operator is given by
where .
We solve (2) by T-step temporal-difference (TD) update and output .
At the -th iteration of the TD update,
where we sample
We use to denote the orthogonal projection onto set , and to denote the step size.
The details are summarized in Algorithm 2 in Appendix A.
Policy Improvement. Following the policy evaluation step, MF-PPO updates its policy by updating the policy network , which is the energy function associated with the policy.
We update the policy network by
The update rule intuitively reads as increasing the probability for choosing action if it yields a higher value for critic network , which can be viewed as a softened version of policy iteration (Bertsekas, 2011).
Additionally, by controlling the proximity parameter , we can control the softness of the update, with yielding the vanilla policy iteration.
Moreover, without constraint of , such an update would have a nice closed form expression, and itself is another energy-based policy.
Proposition 3.2.
Let denote the energy-based policy, then the update
yields .
To take into account that the representable function of actor network resides in , we update the policy by projecting the energy function of back to .
Specifically, by denoting
we recover the next actor network (i.e., energy) by
performing the following regression task
(3)
where .
We approximately solve (3) via T-step stochastic gradient descent (SGD), and output
At the -th iteration of SGD,
where we sample , and , and is the step size.
The details are summarized in Algorithm 3 of Appendix A.
Finally, we present the complete MF-PPO in Algorithm 1.
Require: Mean-field MDP , discount factor ; number of outer iterations , number of inner updates ;
policy update parameter , step size , projection radius .
Initialize: , , (uniform policy).
fordo
Set temperature parameter , and proximity parameter
Solve (2) to update the critic network , using TD update (Algorithm 2)
Solve (3) to update the actor network for , using SGD update (Algorithm 3)
Update policy:
endfor
4 Global Convergence of MF-PPO
We present the global convergence of MF-PPO algorithm for two-layer permutation-invariant parameterization of actor and critic network.
We remark that our analysis can be extended to multi-layer permutation-invariant networks, and we present the two-layer case here for simplicity of exposition.
Specifically, the actor and critic networks take the form
where (resp. ) is the width of the actor (resp. critic) network, and denotes the ReLU activation.
We randomly initialize (resp. ) and first layer weights (resp. ) by
For the ease of analysis, we take and share the initialization of and (resp. and ).
Additionally, we keep ’s fixed during training,
and (resp. ) within the ball (resp. ) throughout training.
We define the following function class characterizing the class of previously defined actor/critic network for large network width.
Definition 4.1.
Given , define the function class over domain by
where are random weights sampled according to
also induces the function class over given by
It is well known that functions within approximate functions within the reproducing kernel Hilbert space associated with kernel
for large network width
(Jacot et al., 2018; Chizat and Bach, 2018; Cai et al., 2019; Arora et al., 2019), whose RKHS norm is bounded by .
For large and , represents a rich class of functions.
Additionally, functions within can be viewed as the mean-embedding of the joint state-action pair onto the RKHS space (Muandet et al., 2016; Song et al., 2009; Smola et al., 2007). Below, we make one technical assumption which states that is rich enough to represent the Q-function of all the policies within our policy class.
Assumption 1.
For any policy induced by , we have .
We define two mild conditions stating (1) boundedness of reward, and (2) regularity of the joint state and the policy.
Assumption 2.
Reward function for some .
Additionally, for any and any , there exists such that
for any and .
We remark that the condition in Assumption 2 holds whenever is bounded and has an upper bounded density.
We measure the progress of MF-PPO in Algorithm 1 using the expected total reward
(4)
where is the stationary state distribution of the optimal policy .
We also denote as the stationary state-action distribution induced by .
Note that we have:
for any policy .
Our main results are presented in the following theorem, showing that converges to at a sub-linear rate.
Theorem 4.1(Global Convergence of MF-PPO).
Under Assumptions 1 and 2,
the policies generated by Algorithm 1 satisify
where ,
and are defined in Lemma 4.3.
In particular, suppose at each iteration of MF-PPO, we observe
agents, and the actor/critic network satisfies
and
then we have
Theorem 4.1 states that, given sufficiently many agents and a large enough actor/critic network, MF-PPO attains global optimality at a sublinear rate.
Our result shows that when solving the mean-field MDP, having more agents serves as a blessing instead of a curse.
In addition, as will be demonstrated in our proof sketch, there exists an inherent phase transition depending on the number of agents, where the final optimality gap is dominated by statistical error for a small number of agents (first phase); and by optimization error for a large number of agents (second phase).
The complete proof of Theorem 4.1 takes a careful analysis on the error from policy evaluation (2) and the improvement step (3). The analysis on the outer iterations of MF-PPO can be overviewed as approximate mirror descent, which needs to take into account how the evaluation and improvement error interacts.
Intuitively, the tuple describes the the effect of policy update when using approximate policy evaluation and policy improvement, and will be further clarified in the ensuing discussion.
We present here the skeleton of our proof, and defer the technical detail to the appendix.
Proof Sketch.
We first establish the global convergence of the policy evaluation (2) and improvement step (3).
Lemma 4.1(Policy Evaluation).
Under the same assumptions in Theorem 4.1,
let denote the Q-function of policy , let
denote policy evaluation error, where is the output of Algorithm 2, we have
denote policy optimization error, where is the output of Algorithm 3, we have
Lemma 4.1 and 4.2 show that despite non-convexity, both the policy evaluation and the policy improvement steps converge approximately to the global optimal solution.
In particular, given networks with large width, for a small number of iterations , the optimization error dominates the optimality gap; for a large number of iterations , the statistical error dominates the optimality gap.
With Lemma 4.1 and 4.2, we illustrate the main argument for the proof of Theorem 4.1. Let us assume the ideal case when .
Note that for , we obtain the exact -function of policy .
For , we obtain the ideal energy-based updated policy defined in Proposition 3.2. That is,
(5)
Without function approximation, problem (5) can be solved by treating each joint state independently, hence one can apply the well known three-point lemma in mirror descent (Chen and Teboulle, 1993) and obtain that, for all :
From Lemma 6.1 in Kakade and Langford (2002), the expectation of the left hand side yields exactly .
Hence we have
Pinsker’s Inequality
combined with the observation ,
and basic inequality for gives us
(6)
By setting , and telescoping the above inequality from to , we obtain:
Note that the key element in the global convergence of MF-PPO is the recursion defined in (6), which holds whenever we have the exact -function of current policy and no function approximation is used when updating the next policy.
Now MF-PPO conducts approximate policy evaluation (), and after obtaining the approximate -function,
conducts approximate policy improvement step ().
In addition, the error of approximating the -function introduced in the evaluation step can be further compounded in the improvement step.
Nevertheless, recursion (6) still holds approximately, with additional terms representing the policy evaluation/improvement errors.
Let (evaluation error) and (improvement error) be defined as in Lemma 4.1 and Lemma 4.2, respectively. We have:
(7)
(8)
where
In addition, and are defined by:
Finally, from Lemma 4.3, by telescoping inequality (8) from to , we complete the proof of Theorem 4.1.
5 Experiments
We perform experiments on the benchmark multi-agent particle environment (MPE) used in prior work (Lowe et al., 2017; Mordatch and Abbeel, 2018; Liu et al., 2019b).
In the cooperative navigation task, agents each with position must move to cover fixed landmarks at positions .
They receive a team reward
In the cooperative push task, agents with position work together to push a ball to a fixed landmark .
They receive a team reward
Both tasks involve homogeneous agents.
We instantiate MF-PPO by building on the original PPO algorithm (Schulman et al., 2017) with a centralized permutation invariant value-function critic, represented by a DeepSet (Zaheer et al., 2017) network with two hidden layers.
We use a standard two-layer multi-layer perception (MLP) for the actor network in all algorithms.
One version, labeled as MF, has a centralized actor network that outputs the mean and diagonal covariance of a Gaussian distribution over the joint continuous action space.
A second decentralized version with parameter sharing, labeled as MF-PS, has a decentralized actor network that maps individual observations to parameters of a Gaussian distribution over the individual action space.
(a)
(b)
(c)
(d)
Figure 3: Performance versus number of environment steps in the multi-agent cooperative navigation task, over five independent runs per method.
Each point is the average reward during 1000 evaluation episodes.
MF based on DeepSet invariant critic, and built on either PPO or MADDPG, significantly outperforms other critic representations for various number of agents.
We compare with two other critic representations: one that uses MLP for the centralized critic, labeled MLP, and another that uses a graph convolutional network for the critic (Liu et al., 2019b), labeled GCN.
Note that the GCN representation is permutation invariant if one imposes a fully-connected graph for the agents in the MPE, but this invariance property does not hold for all graphs in general.
We also compare with an extension of (Yang et al., 2018) to the case of continuous action spaces, labeled MF-A, in which each independent DDPG agent has a decentralized critic that takes in the mean of all other agents’ actions .
Empirically, as we find that off-policy RL learns faster than on-policy RL in the MPE with higher agent number, regardless of the critic representation, we make all comparisons on top of MADDPG (Lowe et al., 2017) for all cases.
For a fair comparison of all critic representations, we ensure that all neural network architectures contain approximately the same number of trainable weights.
For the cooperative navigation task,
Figure3 shows that the permutation invariant critic representation based on DeepSet enables MF and MF-PS to learn faster or reach a higher performance than all other representations and methods in the MPE with 6, 15, 30 and 200 agents.
Figure4(a) shows that MF consistently and significantly outperforms GCN as the number of parameters in their critic network varies over a range, with all other settings fixed.
In particular, MF requires much fewer critic parameters to achieve higher performance than GCN.
For the cooperative push task,
Figure4(b) demonstrates similar performance improvement of MF compared to all other methods.
We provide additional experiment detail in Appendix B.
(a)Varying critic size
(b)
Figure 4:
(a) MF outperforms GCN even with a fewer number of critic network parameters ().
(b) Performance versus number of environment steps in the multi-agent cooperative push task.
MF based on DeepSet invariant critic significantly outperforms other critic representations.
Computational Improvements.
Theorem 4.1 states that to obtain a small optimality gap in MF-PPO, one needs to compute the update on a large number of agents. We remark that with the dual embedding techniques developed in Dai et al. (2017), one can avoid computation on all the agents by sampling a small number of agents to compute the update. This technique could be readily incorporated into MF-PPO to improve its computational efficiency.
6 Conclusion
We propose a principled approach to exploit agent homogeneity and permutation invariance through the mean-field approximation in MARL.
Moreover, our results are also the first to provably show the global convergence of MARL algorithms with neural networks as function approximators.
This is in sharp contrast to current practices, which are mostly heuristic methods without convergence guarantees.
References
Arora et al. (2019)Arora, S., Du, S. S., Hu, W., Li, Z. and
Wang, R. (2019).
Fine-grained analysis of optimization and generalization for
overparameterized two-layer neural networks.
arXiv preprint arXiv:1901.08584 .
Bertsekas (2011)Bertsekas, D. P. (2011).
Approximate policy iteration: A survey and some new methods.
Journal of Control Theory and Applications9
310–335.
Bloem-Reddy and Teh (2019)Bloem-Reddy, B. and Teh, Y. W. (2019).
Probabilistic symmetry and invariant neural networks.
arXiv preprint arXiv:1901.06082 .
Cai et al. (2019)Cai, Q., Yang, Z., Lee, J. D. and Wang, Z.
(2019).
Neural temporal-difference learning converges to global optima.
In Advances in Neural Information Processing Systems.
Cao et al. (2013)Cao, Y., Yu, W., Ren, W. and Chen, G.
(2013).
An overview of recent progress in the study of distributed
multi-agent coordination.
IEEE Transactions on Industrial informatics9
427–438.
Carmona et al. (2019)Carmona, R., Laurière, M. and Tan, Z. (2019).
Model-free mean-field reinforcement learning: mean-field mdp and
mean-field q-learning.
arXiv preprint arXiv:1910.12802 .
Chen and Teboulle (1993)Chen, G. and Teboulle, M. (1993).
Convergence analysis of a proximal-like minimization algorithm using
bregman functions.
SIAM Journal on Optimization3 538–543.
Chizat and Bach (2018)Chizat, L. and Bach, F. (2018).
A note on lazy training in supervised differentiable programming.
arXiv preprint arXiv:1812.079568.
Dai et al. (2017)Dai, B., Shaw, A., Li, L., Xiao, L.,
He, N., Liu, Z., Chen, J. and Song, L.
(2017).
Sbeed: Convergent reinforcement learning with nonlinear function
approximation.
arXiv preprint arXiv:1712.10285 .
Jacot et al. (2018)Jacot, A., Gabriel, F. and Hongler, C. (2018).
Neural tangent kernel: Convergence and generalization in neural
networks.
In Advances in neural information processing systems.
Kakade and Langford (2002)Kakade, S. and Langford, J. (2002).
Approximately optimal approximate reinforcement learning.
In ICML, vol. 2.
Kalyanakrishnan et al. (2006)Kalyanakrishnan, S., Liu, Y. and Stone, P. (2006).
Half field offense in robocup soccer: A multiagent reinforcement
learning case study.
In Robot Soccer World Cup. Springer.
Kipf and Welling (2017)Kipf, T. N. and Welling, M. (2017).
Semi-supervised classification with graph convolutional networks.
In International Conference on Learning Representatioons.
Kober et al. (2013)Kober, J., Bagnell, J. A. and Peters, J. (2013).
Reinforcement learning in robotics: A survey.
The International Journal of Robotics Research32
1238–1274.
Littman (1994)Littman, M. L. (1994).
Markov games as a framework for multi-agent reinforcement learning.
In Machine Learning Proceedings 1994. Elsevier, 157–163.
Liu et al. (2019a)Liu, B., Cai, Q., Yang, Z. and Wang, Z.
(2019a).
Neural proximal/trust region policy optimization attains globally
optimal policy.
arXiv preprint arXiv:1906.10306 .
Liu et al. (2019b)Liu, I.-J., Yeh, R. A. and Schwing, A. G.
(2019b).
Pic: Permutation invariant critic for multi-agent deep reinforcement
learning.
arXiv preprint arXiv:1911.00025 .
Lowe et al. (2017)Lowe, R., Wu, Y., Tamar, A., Harb, J.,
Abbeel, O. P. and Mordatch, I. (2017).
Multi-agent actor-critic for mixed cooperative-competitive
environments.
In Advances in Neural Information Processing Systems.
Matarić (1997)Matarić, M. J. (1997).
Reinforcement learning in the multi-robot domain.
In Robot colonies. Springer, 73–83.
Menda et al. (2018)Menda, K., Chen, Y.-C., Grana, J., Bono,
J. W., Tracey, B. D., Kochenderfer, M. J. and
Wolpert, D. (2018).
Deep reinforcement learning for event-driven multi-agent decision
processes.
IEEE Transactions on Intelligent Transportation Systems20 1259–1268.
Mordatch and Abbeel (2018)Mordatch, I. and Abbeel, P. (2018).
Emergence of grounded compositional language in multi-agent
populations.
In Thirty-Second AAAI Conference on Artificial Intelligence.
Muandet et al. (2016)Muandet, K., Fukumizu, K., Sriperumbudur, B. and
Schölkopf, B. (2016).
Kernel mean embedding of distributions: A review and beyond.
arXiv preprint arXiv:1605.09522 .
Panait and Luke (2005)Panait, L. and Luke, S. (2005).
Cooperative multi-agent learning: The state of the art.
Autonomous agents and multi-agent systems11
387–434.
Schulman et al. (2017)Schulman, J., Wolski, F., Dhariwal, P.,
Radford, A. and Klimov, O. (2017).
Proximal policy optimization algorithms.
arXiv preprint arXiv:1707.06347 .
Shalev-Shwartz et al. (2016)Shalev-Shwartz, S., Shammah, S. and Shashua, A.
(2016).
Safe, multi-agent, reinforcement learning for autonomous driving.
arXiv preprint arXiv:1610.03295 .
Smola et al. (2007)Smola, A., Gretton, A., Song, L. and
Schölkopf, B. (2007).
A hilbert space embedding for distributions.
In International Conference on Algorithmic Learning Theory.
Springer.
Song et al. (2009)Song, L., Huang, J., Smola, A. and Fukumizu,
K. (2009).
Hilbert space embeddings of conditional distributions with
applications to dynamical systems.
In Proceedings of the 26th Annual International Conference on
Machine Learning.
Sunehag et al. (2017)Sunehag, P., Lever, G., Gruslys, A.,
Czarnecki, W. M., Zambaldi, V., Jaderberg, M.,
Lanctot, M., Sonnerat, N., Leibo, J. Z.,
Tuyls, K.et al. (2017).
Value-decomposition networks for cooperative multi-agent learning.
arXiv preprint arXiv:1706.05296 .
Sutton and Barto (2018)Sutton, R. S. and Barto, A. G. (2018).
Reinforcement learning: An introduction.
MIT press.
Vinyals et al. (2019)Vinyals, O., Babuschkin, I., Czarnecki, W. M.,
Mathieu, M., Dudzik, A., Chung, J., Choi,
D. H., Powell, R., Ewalds, T., Georgiev, P.et al. (2019).
Grandmaster level in starcraft ii using multi-agent reinforcement
learning.
Nature575 350–354.
Wang et al. (2020)Wang, L., Yang, Z. and Wang, Z. (2020).
Breaking the curse of many agents: Provable mean embedding
q-iteration for mean-field reinforcement learning.
In International Conference on Machine Learning. PMLR.
Yang et al. (2017)Yang, J., Ye, X., Trivedi, R., Xu, H. and
Zha, H. (2017).
Learning deep mean field games for modeling large population
behavior.
arXiv preprint arXiv:1711.03156 .
Yang et al. (2018)Yang, Y., Luo, R., Li, M., Zhou, M.,
Zhang, W. and Wang, J. (2018).
Mean field multi-agent reinforcement learning.
arXiv preprint arXiv:1802.05438 .
Zaheer et al. (2017)Zaheer, M., Kottur, S., Ravanbakhsh, S.,
Poczos, B., Salakhutdinov, R. R. and Smola, A. J.
(2017).
Deep sets.
In Advances in neural information processing systems.
Zhang et al. (2019)Zhang, K., Yang, Z. and Başar, T. (2019).
Multi-agent reinforcement learning: A selective overview of theories
and algorithms.
arXiv preprint arXiv:1911.10635 .
Zheng et al. (2020)Zheng, S., Trott, A., Srinivasa, S., Naik,
N., Gruesbeck, M., Parkes, D. C. and Socher, R.
(2020).
The ai economist: Improving equality and productivity with ai-driven
tax policies.
arXiv preprint arXiv:2004.13332 .
Supplementary Material for A Permutation Invariant Approach to Deep Mean-Field Multi-Agent Reinforcement Learning
Appendix A Algorithms for Policy Evaluation/Improvement
Algorithm 2 Policy Evaluation via TD
Require: Mean-field MDP , sample , number of iterations
Initialize:
Set step size
fordo
Set
endfor
Take ergodic average:
Output:
Algorithm 3 Policy Improvement via SGD
Require: Mean-field MDP , sample , number of iterations
Initialize:
Set step size
fordo
Set
endfor
Take ergodic average:
Output:
Appendix B Experimental details
Environment.
We used the open-source code for the multi-agent particle environments provided by Liu et al. (2019b), which itself is based on the original code by Lowe et al. (2017), without any modification.
Please refer to Liu et al. (2019b, Appendix B) for complete details.
Computing infrastructure and runtime.
Experiments were run on Intel Xeon 6154 CPUs, using one core for each independent policy training session.
Average training time was approximately 4 hours for and with 1.5e6 steps, and 12 hours for with 7.25e5 steps.
B.1 Implementation
GCN111We label this architecture as “GCN” to distinguish it from alternative ways to instantiate a permutation invariant critic..
For experiments based on MADDPG, we re-ran the open-source code provided by Liu et al. (2019b) without modification.
For experiments based on PPO, we used the GCN network in Kipf and Welling (2017) as the critic in PPO.
Performance of GCN reported in this work are the results of our experimental runs.
MF.
In the PPO-based implementation, the joint policy is parameterized by a multi-layer perceptron (MLP) with two hidden layers of size 128 and ReLU activation, which takes in the concatenation of all agents’ observations and outputs the mean and diagonal covariance of a joint Gaussian policy.
In the MADDPG-based implementation, each decentralized actor (i.e., policy) network is an MLP with two hidden layers of size 187 and ReLU activation, which takes in each agent’s individual observation and outputs the agent’s real-valued action vector.
The centralized critic has the form , where and are hidden layers of size (for agents, respectively, to have approximately equal number of trainable critic parameters as the GCN critic used in Liu et al. (2019b)) with ReLU activation, and is a linear layer with output size 1.
MLP.
The centralized critic is an MLP with two hidden layers of size 138,187,75,42,7 for the case of agents, respectively, such that total number of trainable parameters is approximately equal to that in GCN (Liu et al., 2019b).
The joint policy has the same representation as the one in MF for both PPO-based and MADDPG-based implementations.
Hyperparameters.
We used the Adam optimizer for all algorithms.
For experiments based on PPO, the common hyperparameters across all algorithms are: discount factor , KL divergence target , KL divergence initial coefficient , and GAE (see Schulman et al. (2017)), policy entropy loss coefficient 0.01, max gradient norm , and actor learning rate .
Each PPO training step uses 5 epochs with minibatch size 2.
We did a sweep of critic learning rate for each critic architecture, choosing for MF, for GCN and MLP, and for MF-A.
For experiments based on MADDPG, the common hyperparameters are:
actor learning rate , critic learning rate , batch size 1024, discount factor , replay buffer size , 8 actor and critic updates per 100 environment steps, and target network update rate .
We first prove the first part of claim. We begin by showing that the optimal Q-function is permutation invariant, i.e., for all permutation mapping .
Note that is the unique solution to Bellman optimality condition over the space of -function
(9)
Or equivalently, from the permutation invariance of and defined in (1)
That is, is also a solution to the Bellman optimality condition (9). From the uniqueness of the solution, we have
. Hence the optimal policy is permutation invariant.
For second part of the proposition, we show the detailed proof for permutation invariance of value function .
For any permutation invariance policy , let , and denote the key elements of the induced Markov reward process.
One can clearly see that from permutation invariance defined in (1), we have that both and is permutation invariant.
We define the -step value function as the expected reward from time to time , then:
We can see from above recursion that is permutation invariant for all . From , we can conclude that is also permutation invariant.
For permutation invariance of Q-function, recall that is the unique solution of the following Bellman evaluation equation
(10)
From the permutation invariance of , and , we have is also a solution to (10), as
The proof for showing follows the exact same ingredients as in the proof for Proposition 2.1.
Following Theorem 11 in Bloem-Reddy and Teh (2019), there exists a function such that , where . Then Proposition 2.2 follows from letting .
For any permutation invariant actor/critic function ,
we consider the size of tabular representation for such functions and denote it as the size for the search space.
We say and are permutation equivalent if for some permutation mapping .
We can observe that (i) the permutation equivalent is a valid equivalent relation defined over the space of all possible ; (ii) and are permutation equivalent if and only if for every value in , occurs the same number of times in and .
One can verify that for containing elements, with each element taking values in , then the number of equivalence classes over the space of all possible , induced by permutation equivalent relation is . The claim of Proposition 3.1 follows immediately.
∎
We provide a unified analysis of the policy evaluation and policy improvement step.
The policy evaluation and improvement steps can be described as minimizing the following mean-squared loss:
(11)
where the operator maps function to
(12)
Formulation (11) includes policy improvement and policy evaluation as special cases:
Policy Improvement. This corresponds to ,
, ,
Policy Evaluation. This corresponds to , , , .
Note that equals to the Bellman operator: .
To solve problem (11), we consider the following generic update rule:
(13)
(14)
where .
Instead of knowing exactly, we only have access to via independent samples represented as the states of agents. Hence, we perform the following update:
(15)
(16)
where ; and .
Policy Improvement Updates. For , and , we recover the policy improvement update in Algorithm 2.
Policy Evaluation Updates. For , , we recover the policy evaluation update in Algorithm 3.
We define a few more notations before we proceed
(residual):
(stochastic semi-gradient):
(population semi-gradient):
where .
We also define the associated finite sample approximation:
where .
Note that given is not an unbiased estimator of , as .
E.1 Linearization at Initialization
For a two layer ReLU network defined as , where , we define
its linearization at the initialization
Note that , which equivalently means is the local linearization of at its initialization (note we do not train the second layer ).
We define
and similarly define , and , with everything related to replaced by .
Our main objectives are to show that the objective in (11) can be replaced by replacing everything related to with its local linearization , and the linearized stochastic semi-gradient remains close to the population semi-gradient when the number of hidden units and number of agents are large enough.
By doing so, we can conclude that learning with linearized stochastic semi-gradient is sufficient to solve (11) to high accuracy, when the number of hidden units and number of agents are large enough.
For the ease of exposition, we also make the following assumptions,
Assumption 3.
We assume for all .
Assumption 4.
The function satisfies: for each and any action , we have
.
Remark: We will verify that with , Assumption 4 holds for the policy evaluation step. With , Assumption 4 holds for the policy improvement step.
where in the last inequality we use and Assumption 4.
Finally, note that
The claim follows immediately.
∎
We then bound the variance of for fixed .
Lemma E.4.
For any , let and , and ,
we have
Proof.
We have
(20)
For the first term, note that , we have
where the last inequality we use the similar argument in Lemma E.3.
Now as we have shown in Lemma E.3, we have .
Hence the first term in (E.1) is at the order of .
To bound the second term in (E.1), from Jensen’s inequality we have
On the other hand, we have
Therefore, taking square on both sides of (E.1) and using Cauchy Schwartz inequality, we have
∎
With Lemma E.3 and Lemma E.4, we can now bound the different between and .
Now apply Lemma E.3 and Lemma E.4, the claim follow immediately.
∎
We are now ready to show the convergence of using updates (13) and (14) to solve problem (11).
We denote to be the function within the class , which satisfies the stationary condition:
(21)
where denotes the Euclidean projection onto set .
Condition (21) is equivalent to:
Note that , and .
Hence we have
(22)
With the definition of operator in (12),
we know that
Note that satisfies .
For policy evaluation, by definition (12), the operator equals to the Bellman evaluation operator: .
Hence we have .
Now by Assumption 1, we know that , together with the fact that .
By uniqueness of the projection, we have , which implies
The claim follows immediately by applying Theorem E.1.
∎
Note that here we have . For policy optimization the operator is defined by:
.
Then we have
where the second inequality comes from the definition of projection and ,
and the last inequality comes from direct application of Theorem E.1 and Lemma E.1, together with the fact that and share the same initialization.
∎