This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

TAPE: Leveraging Agent Topology for Cooperative Multi-Agent Policy Gradient

Xingzhou Lou1,2, Junge Zhang1,2, Timothy J. Norman3, Kaiqi Huang1,2, Yali Du4 Work done while visiting King’s College LondonCorrespondence
Abstract

Multi-Agent Policy Gradient (MAPG) has made significant progress in recent years. However, centralized critics in state-of-the-art MAPG methods still face the centralized-decentralized mismatch (CDM) issue, which means sub-optimal actions by some agents will affect other agent’s policy learning. While using individual critics for policy updates can avoid this issue, they severely limit cooperation among agents. To address this issue, we propose an agent topology framework, which decides whether other agents should be considered in policy gradient and achieves compromise between facilitating cooperation and alleviating the CDM issue. The agent topology allows agents to use coalition utility as learning objective instead of global utility by centralized critics or local utility by individual critics. To constitute the agent topology, various models are studied. We propose Topology-based multi-Agent Policy gradiEnt (TAPE) for both stochastic and deterministic MAPG methods. We prove the policy improvement theorem for stochastic TAPE and give a theoretical explanation for the improved cooperation among agents. Experiment results on several benchmarks show the agent topology is able to facilitate agent cooperation and alleviate CDM issue respectively to improve performance of TAPE. Finally, multiple ablation studies and a heuristic graph search algorithm are devised to show the efficacy of the agent topology.

1 Introduction

Recent years has witnessed dramatic progress of reinforcement learning (RL) and multi-agent reinforcement learning (MARL) in real life applications, such as unmanned vehicles (Liu et al. 2022), traffic signal control (Noaeen et al. 2022) and on-demand delivery (Wang et al. 2023). Taking advantage of the centralized training decentralized execution (CTDE) (Oliehoek, Spaan, and Vlassis 2008; Kraemer and Banerjee 2016) paradigm, current cooperative MARL methods (Du et al. 2023; Wang et al. 2020a, b; Peng et al. 2021; Zhang et al. 2021; Zhou, Lan, and Aggarwal 2022) adopt value function factorization or a centralized critic to provide centralized learning signals to promote cooperation and achieve implicit or explicit credit assignment. Multi-agent policy gradient (MAPG) (Lowe et al. 2017; Foerster et al. 2018; Zhou et al. 2020; Zhang et al. 2021; Zhou, Lan, and Aggarwal 2022; Du et al. 2019) applies RL policy gradient techniques (Sutton and Barto 2018; Silver et al. 2014; Lillicrap et al. 2015) to the multi-agent context. In CTDE, MAPG methods adopt centralized critics or value-mixing networks (Rashid et al. 2020b, a; Wang et al. 2020a) for individual critics so that agents can directly update their policies to maximize the global QQ value Qtot𝝅Q^{\bm{\pi}}_{tot} in their policy gradient. As a result, agents cooperate more effectively and obtain better expected team rewards.

The centralized critic approach has an inherent problem known as centralized-decentralized mismatch (CDM) (Wang et al. 2020c; Chen et al. 2022). The CDM issue refers to sub-optimal, or explorative actions of some agents negatively affecting policy learning of others, causing catastrophic miscoordination. The CDM issue arises because sub-optimal or explorative actions may lead to a small or negative centralized global QQ value Qtot𝝅Q^{\bm{\pi}}_{tot}, even if other agents take good or optimal actions. In turn, the small Qtot𝝅Q^{\bm{\pi}}_{tot} will make the other agents mistake their good actions as bad ones and interrupt their policy learning. The Decomposed Off-Policy (DOP) approach (Wang et al. 2020c) deals with sub-optimal actions of other agents by linearly decomposed individual critics, which ignore the other agents’ actions in the policy gradient. But the use of individual critics severely limits agent cooperation.

We give an example to illustrate the issue of learning with centralised critics and individual critics respectively. Consider an one-step matrix game with two agents AA, BB where each agent has two actions a0,a1a_{0},a_{1}. Reward R(a0,a0)=2,R(a0,a1)=4,R(a1,a0)=1R(a_{0},a_{0})=2,R(a_{0},a_{1})=-4,R(a_{1},a_{0})=-1 and R(a1,a1)=0R(a_{1},a_{1})=0. Assume agent AA has a near-optimal policy with probability ϵ\epsilon choosing non-optimal action a1a_{1} and is using the COMA centralized critic (Foerster et al. 2018) for policy learning. If agent AA takes optimal action a0a_{0} and BB takes the non-optimal action a1a_{1}, agent AA’s counterfactual advantage AdvA(a0,a1)=Qtot𝝅(a0,a1)[(1ϵ)Qtot𝝅(a0,a1)+ϵQtot𝝅(a1,a1)]=4ϵ<0Adv_{A}(a_{0},a_{1})=Q_{tot}^{\bm{\pi}}(a_{0},a_{1})-\left[\left(1-\epsilon\right)Q_{tot}^{\bm{\pi}}(a_{0},a_{1})+\epsilon Q_{tot}^{\bm{\pi}}(a_{1},a_{1})\right]=-4\epsilon<0, which means agent AA will mistakenly think a0a_{0} as a bad action. Consequently, the sub-optimal action of agent BB causes agent AA to decrease the probability of taking optimal action a0a_{0} and deviate from the optimal policy. Similar problems will occur with other centralized critics. If we employ individual critics, however, cooperation will be limited. Assume both agents’ policies are initialized as random policies and learning with individual critics. For agent AA, QA(a0)=𝔼aBπB[Qtot𝝅(a0,aB)]=0.5×20.5×4=1Q_{A}(a_{0})=\mathbb{E}_{a_{B}\sim\pi_{B}}[Q^{\bm{\pi}}_{tot}(a_{0},a_{B})]=0.5\times 2-0.5\times 4=-1. Similarly, we can get QA(a1)=0.5,QB(a0)=0.5,QB(a1)=2Q_{A}(a_{1})=-0.5,\ Q_{B}(a_{0})=0.5,\ Q_{B}(a_{1})=-2. The post-update joint-policy will be (a1,a0)(a_{1},a_{0}) and receive reward 1-1, which is clearly sub-optimal.

In this paper, we aims to alleviate the CDM issue without hindering agent’s cooperation capacity by proposing an agent topology framework to describe the relationships between agents’ policy updates. Under the agent topology framework, agents connected in the topology consider and maximize each other’s utilities. Thus, the shared objective makes each individual agent forms a coalition with its connected neighbors. Agents only consider the utilities of agents in the same coalition, facilitating in-coalition cooperation and avoiding influence of out-of-coalition agents. Based on the agent topology, we propose Topology-based multi-Agent Policy gradiEnt (TAPE) for both stochastic and deterministic MAPG, where the agent topology can alleviate the bad influence of other agents’ sub-optimality without hindering cooperation among agents. Theoretically, we prove the policy improvement theorem for stochastic TAPE and give a theoretical explanation for improved cooperation by exploiting agent topology from the perspective of parameter-space exploration.

Empirically, we use three prevalent random graph models (Erdős, Rényi et al. 1960; Watts and Strogatz 1998; Albert and Barabási 2002) to constitute the agent topology. Results show that the Erdős–Rényi (ER) model (Erdős, Rényi et al. 1960) is able to generate the most diverse topologies. With diverse coalitions, agents are able to explore different cooperation patterns and achieve strong cooperation performance. Evaluation results on a matrix game, Level-based foraging (Papoudakis et al. 2021) and SMAC (Samvelyan et al. 2019) show that TAPE outperforms all baselines and the agent topology is able to improve base methods’ performance by both facilitating cooperation among agents and alleviating the CDM issue. Moreover, to show the efficacy of the agent topology, we conduct multiple studies and devise a heuristic graph search algorithm.

Contributions of this paper are three-fold: Firstly, We propose an agent topology framework and Topology-based multi-Agent Policy gradiEnt (TAPE) to achieve compromise between facilitating cooperation and alleviating CDM issue; Secondly, we theoretically establish policy improvement theorem for stochastic TAPE and elaborate the cause for improved cooperation by agent topology; Finally, empirical results demonstrate that the agent topology is able to alleviate the CDM issue without hindering cooperation among agents, resulting in strong performance of TAPE.

2 Preliminaries

The cooperative multi-agent task in this paper is modelled as Decentralized Partially Observable Markov Decision Process (Dec-POMDP) (Oliehoek and Amato 2016). A Dec-POMDP is a tuple G=I,S,𝒜,P,r,𝒪,O,n,γG=\left\langle I,S,\mathcal{A},P,r,\mathcal{O},O,n,\gamma\right\rangle, where I={1,..,n}I=\{1,..,n\} is a finite set of nn agents, SS is the state space, 𝒜\mathcal{A} is the agent action space and γ\gamma is a discount factor. At each timestep, every agent iIi\in I picks an action ai𝒜a_{i}\in\mathcal{A} to form the joint-action 𝐚𝐀=𝒜n\mathbf{a}\in\mathbf{A}=\mathcal{A}^{n} to interact with the environment. Then a state transition will occur according to a state transition function P(s|s,𝐚):S×𝐀×S[0,1]P(s^{\prime}|s,\mathbf{a}):S\times\mathbf{A}\times S\rightarrow[0,1]. All agents will receive a shared reward by the reward function r(s,𝐚):S×𝐀r(s,\mathbf{a}):S\times\mathbf{A}\rightarrow\mathbb{R}. During execution, every agent draws a local observation o𝒪o\in\mathcal{O} by an observation function O(s,a):S×A𝒪O(s,a):S\times A\rightarrow\mathcal{O}. Every agent stores an observation-action history τaT=(𝒪×𝒜)\tau^{a}\in T=(\mathcal{O}\times\mathcal{A}), based on which agent ii derives a policy πi(ai|τi)\pi_{i}(a_{i}|\tau_{i}). The joint policy 𝝅={π1,..,πn}\bm{\pi}=\{\pi_{1},..,\pi_{n}\} consists of policies of all agents. The global QQ value function Qtot𝝅(s,𝐚)=𝔼𝝅[i=0γirt+i|st=s,𝐚t=𝐚]Q^{\bm{\pi}}_{tot}(s,\mathbf{a})=\mathbb{E}_{\bm{\pi}}[\sum_{i=0}\gamma^{i}r_{t+i}|s_{t}=s,\mathbf{a}_{t}=\mathbf{a}] is the expectation of discounted future reward summed over the joint-policy 𝝅\bm{\pi}.

The policy gradient in stochastic MAPG method DOP is: g=𝔼𝝅[iki(s)θilogπi(ai|τi)Qiϕi(s,ai)]g=\mathbb{E}_{\bm{\pi}}\left[\sum_{i}k_{i}(s)\nabla_{\theta_{i}}\log\pi_{i}(a_{i}|\tau_{i})Q_{i}^{\phi_{i}}(s,a_{i})\right], where ki0k_{i}\geq 0 is the positive coefficient provided by the mixing network, and the policy gradient in deterministic MAPG methods is g=𝔼𝒟[iθiπi(τi)aiQtot𝝅(s,𝒂)|ai=πi(τi)]g=\mathbb{E}_{\mathcal{D}}\left[\sum_{i}\nabla_{\theta_{i}}\pi_{i}(\tau_{i})\nabla_{a_{i}}Q^{\bm{\pi}}_{tot}(s,\bm{a})|_{a_{i}=\pi_{i}(\tau_{i})}\right], where Qtot𝝅Q^{\bm{\pi}}_{tot} is the centralized critic and πi\pi_{i} is the policy of agent ii parameterized θi\theta_{i}.

3 Related Work

Multi-Agent Policy Gradient The policy gradient in stochastic MAPG methods has the form 𝔼𝝅[iθilogπi(ai|τi)Gi]\mathbb{E}_{\bm{\pi}}\left[\sum_{i}\nabla_{\theta_{i}}\log\pi_{i}(a_{i}|\tau_{i})G_{i}\right] (Foerster et al. 2018; Wang et al. 2020c; Lou et al. 2023b; Chen et al. 2022), where objective GiG_{i} varies across different methods, such as counterfactual advantage (Foerster et al. 2018) and polarized joint-action value (Chen et al. 2022). The objective in DOP is individual aristocratic utility (Wolpert and Tumer 2001), which ignores other agents’ utilities to avoid the CDM issue, but the cooperation is also limited by this objective. It is worth noting that polarized joint-action value (Chen et al. 2022) also aims to address the CDM issue, but it only applies to stochastic MAPG methods, and the polarized global QQ value can be very unstable. Deterministic MAPG methods use gradient ascent to directly maximize the centralized global QQ value Qtot𝝅Q^{\bm{\pi}}_{tot}. Lowe et al. (Lowe et al. 2017) model the global QQ value with a centralized critic. Current deterministic MAPG methods (Zhang et al. 2021; Peng et al. 2021; Zhou, Lan, and Aggarwal 2022) adopt value factorization to mix individual QQ values to get Qtot𝝅Q^{\bm{\pi}}_{tot}. As the global QQ value is determined by the centralized critic for all agents, sub-optimal actions of one agent will easily influence all others.

Topology in Reinforcement Learning Adjodah et al. (Adjodah et al. 2019) discuss the communication topology issue in parallel-running RL algorithms such as A3C (Mnih et al. 2016). Results show that the centralized learner implicitly yields a fully-connected communication topology among parallel workers, which will harm their performance. In MARL with decentralized training, communication topology is adopted to enable inter-agent communication among networked agents (Zhang et al. 2018; Wang et al. 2019; Konan, Seraj, and Gombolay 2022; Du et al. 2021). The communication topology allows agent to share local information with each other during both training and execution and even achieve local consensus, which further leads to better cooperation performance. In MARL with centralized training, deep coordination graph (DCG) (Böhmer, Kurin, and Whiteson 2020) factorizes the joint value function according to a coordination graph to achieve a trade-off between representational capacity and generalization. Deep implicit coordination graph (Li et al. 2020) allows to infer the coordination graph dynamically by agent interactions instead of domain expertise in DCG. Ruan et al. (Ruan et al. 2022) learn an action coordination graph to represents agents’ decision dependency, which further coordinates the dependent behaviors among agents.

4 Topology-based Multi-Agent Policy Gradient

In this section, we propose Topology-based multi-Agent Policy gradiEnt (TAPE), which exploits the agent topology for both stochastic and deterministic MAPG. This use of the agent topology provides a compromise between facilitating cooperation and alleviating CDM. The primary purpose of the agent topology is to indicate relationships between agents’ policy updates, so we focus on policy gradients of TAPE here and cover the remainder in supplementary material. First, we will define the agent topology.

The agent topology describes how agents should consider others’ utility during policy updates. Each agent is a vertex v𝒱v\in\mathcal{V} and \mathcal{E} is the set of edges. For a given topology, (𝒱,)(\mathcal{V},\mathcal{E}), if eije_{ij}\in\mathcal{E}, the source agent ii should consider the utility of the destination agent jj in its policy gradient. The only constraint we place on a topology is that i,eii\forall\ i,\ e_{ii}\in\mathcal{E}, because agents should at least consider their own utility in the policy gradient. The topology captures the relationships between agents’ policy updates, not their communication network at test time (Foerster et al. 2016; Das et al. 2019; Wang et al. 2019; Ding, Huang, and Lu 2020). Connected agents consider and maximize each other’s utilities together. Thus, the shared objective makes each individual agent form a coalition with the connected neighbors. We use the adjacency matrix EE to refer the agent topology in what follows.

In our agent topology framework, DOP (Wang et al. 2020c) (policy gradient given in section 2) and other independent learning algorithms’ has an edgeless agent topology. The adjacency matrix is the identity matrix and no edge exists in the topology. With no coalition, DOP agent will only maximize its own individual utility QiQ_{i}, and hence is poor at cooperation. Although DOP adopts a mixing network for the individual utilities to enhance cooperation, an agent’s ability to cooperate is still limited, which we will empirically show in the matrix game experiments. Methods with centralized critic such as COMA (Foerster et al. 2018), FACMAC (Peng et al. 2021) and PAC (Zhou, Lan, and Aggarwal 2022) yields the fully-connected agent topology. In these methods, there is only one coalition with all of the agents in it (all edges exist in the topology), and all agents update their policies based on the centralized critic. Consequently, they suffer from the CDM issue severely, because the influence of an agent’s sub-optimal behavior will spread to the entire multi-agent system.

4.1 Stochastic TAPE

Instead of global centralized critic (Foerster et al. 2018), we use the agent topology to aggregate individual utilities and critics to facilitate cooperation among agents for stochastic MAPG (Wang et al. 2020c). To this end, a new learning objective Coalition Utility for the policy gradient is defined as below.

Definition 1 (Coalition Utility).

Coalition Utility 𝐔i\bm{U}_{i} for agent ii is the summation of individual utility UjU_{j} of connected agent jj in agent topology EE, i.e. 𝐔i=j=1nEijUj\bm{U}_{i}=\sum_{j=1}^{n}E_{ij}U_{j}, where Uj(s,aj)=Qtotϕ(s,𝐚)ajπj(aj|τj)Qtotϕ(s,(aj,𝐚j))U_{j}(s,a_{j})=Q^{\phi}_{tot}(s,\bm{a})-\sum_{a_{j}^{\prime}}\pi_{j}({a_{j}^{\prime}}|\tau_{j})Q^{\phi}_{tot}(s,({a_{j}^{\prime}},\bm{a}_{-j})).

UjU_{j} is the aristocrat utility from (Wang et al. 2020c; Wolpert and Tumer 2001). Eij=1E_{ij}=1 only if agent jj is connected to agent ii in EE and QtotϕQ^{\phi}_{tot} is the global QQ value function. Coalition utility only depends on in-coalition agents because if agent jj is not in agent ii’s coalition, Eij=0E_{ij}=0. With the coalition utility, we propose stochastic TAPE with the policy gradient given by

J1(θ)\displaystyle\nabla J_{1}(\theta) =𝔼𝝅[iθilogπi(ai|τi)𝐔i]\displaystyle=\mathbb{E}_{\bm{\pi}}\left[\sum_{i}\nabla_{\theta_{i}}\log\pi_{i}(a_{i}|\tau_{i})\mathbf{U}_{i}\right] (1)
=𝔼𝝅[i,jEijkj(s)θilogπi(ai|τi)Qjϕj(s,aj)],\displaystyle=\mathbb{E}_{\bm{\pi}}\left[\sum_{i,j}E_{ij}k_{j}(s)\nabla_{\theta_{i}}\log\pi_{i}(a_{i}|\tau_{i})Q_{j}^{\phi_{j}}(s,a_{j})\right], (2)

where kj0k_{j}\geq 0 is the weight for agent jj’s local QQ value QjϕjQ_{j}^{\phi_{j}} provided by the mixing network. The policy gradient derivation from Eq. 1 to Eq. 2 is provided in the appendix A. Since the local utility of other in-coalition agents is maximized by the policy updates, cooperation among agents is facilitated. Pseudo-code and more details of stochastic TAPE are provided in the appendix D.1.

4.2 Deterministic TAPE

Current deterministic MAPG methods (Peng et al. 2021; Zhang et al. 2021; Zhou, Lan, and Aggarwal 2022) yield fully-connected agent topology, which makes agents vulnerable to bad influence of other agents’ sub-optimal actions. A mixing network fmixf_{\text{mix}} is adopted to mix local Q value functions QiπiQ^{\pi_{i}}_{i}. Each agent uses deterministic policy gradient to update parameters and directly maximize global QQ value Qtot𝝅=fmix(s,Q1π1,,Qnπn)Q_{tot}^{\bm{\pi}}=f_{\text{mix}}(s,Q^{\pi_{1}}_{1},\cdots,Q^{\pi_{n}}_{n}). We use the agent topology to drop out utilities of out-of-coalition agents, so that influence of their sub-optimal actions will not spread to in-coalition agents. To this end, Coalition QQ is defined as below.

Definition 2 (Coalition QQ).

Coalition QQ QcoiQ^{i}_{\text{co}} for agent ii is the mixture of its in-coalition agents’ local QQ values with mixing network fmixf_{\text{mix}}, i.e.

Qcoi(s,𝒂)=fmix(s,𝟙[Ei1]Q1π1,,𝟙[Ei,n]Qnπn),Q^{i}_{\text{co}}(s,\bm{a})=f_{\text{mix}}(s,\mathds{1}[E_{i1}]Q^{\pi_{1}}_{1},\cdots,\mathds{1}[E_{i,n}]Q^{\pi_{n}}_{n}), (3)

where 𝟙[Eij]\mathds{1}[E_{ij}] is the indicator function and 𝟙[Eij]=1\mathds{1}[E_{ij}]=1 only when edge EijE_{ij} exists in the topology.

Refer to caption
Figure 1: (a) gives the proposed three matrix games of different levels. We use different colors for different levels of game. Blue represents Easy, green represents Medium and red represents Hard. (b), (c) and (d) give evaluation results. Stochastic TAPE has the best performance because the agents directly maximize joint utility to achieve strong cooperation. The only difference between TAPE and DOP is that TAPE adopts the agent topology. Although COMA is seen as a weak baseline on SMAC, it achieves much better performance than DOP. QMIX fails to perform well in these games as they are not monotonic games.

During policy update, out-of-coalition agents’ QQ values are always masked out, so agent ii’s policy learning will not be affected by out-of-coalition agents. Based on Coalition QQ, we propose deterministic TAPE, whose policy gradient is given by

J2(θ)\displaystyle\nabla J_{2}(\theta) =𝔼𝒟[iθiπi(τi)aiQ^coi(s,𝒂)|ai=πi(τi)]\displaystyle=\mathbb{E}_{\mathcal{D}}\left[\sum_{i}\nabla_{\theta_{i}}\pi_{i}(\tau_{i})\nabla_{a_{i}}\hat{Q}_{\text{co}}^{i}(s,\bm{a})|_{a_{i}=\pi_{i}(\tau_{i})}\right] (4)

where Q^coi(s,𝒂)=fmix(s,𝟙[Ei1]Q^1ϕ1,,𝟙[Ei,n]Q^nϕn)\hat{Q}_{\text{co}}^{i}(s,\bm{a})=f_{\text{mix}}\left(s,\mathds{1}[E_{i1}]\hat{Q}^{\phi_{1}}_{1},\cdots,\mathds{1}[E_{i,n}]\hat{Q}^{\phi_{n}}_{n}\right) and Q^iϕi(τi,ai,mi)=Qiϕi(τi,ai,mi)αlogπi(ai|τi)\hat{Q}^{\phi_{i}}_{i}(\tau_{i},a_{i},m_{i})=Q^{\phi_{i}}_{i}(\tau_{i},a_{i},m_{i})-\alpha\log\pi_{i}(a_{i}|\tau_{i}) is the local soft QQ value (Zhang et al. 2021) augmented with assistive information mim_{i} which contains information to assist policy learning towards the optimal policy as in (Zhou, Lan, and Aggarwal 2022). After dropping out agents not in the coalition, the bad influence of out-of-coalition sub-optimal actions will not affect in-coalition agents. More details and pseudo-code are provided in the appendix D.2.

5 Analysis

5.1 Agent Topology

Although the agent topology can be any arbitrary topology, a proper agent topology should be able to explore diverse cooperation pattern, which is essential for robust cooperation (Li et al. 2021; Strouse et al. 2021; Lou et al. 2023a). We studied three prevalent random graph model: Barabási–Albert (BA) model (Albert and Barabási 2002), Watts–Strogatz (WS) model (Watts and Strogatz 1998) and Erdős–Rényi (ER) model (Erdős, Rényi et al. 1960). BA model is a scale-free network commonly used for citation and signaling biological networks (Barabási and Albert 1999). WS model is known as the small-world network where each nodes can be reached through a small number of nodes, resulting in the six degrees of separation (Travers and Milgram 1977). While in ER model, each edge between any two nodes has an independent probability of being present. Formally, the adjacency matrix EE of ER agent topology (𝒱,)(\mathcal{V},\mathcal{E}) for nn agents is defined as i{1,..,n}\forall i\in\{1,..,n\}, Eii=1E_{ii}=1; i,j{1,..,n}\forall i,\ j\in\{1,..,n\}, ij,Eij=1i\neq j,E_{ij}=1 with probability pp otherwise 0.

In research question 1 of section 6.3, we found that ER model is able to generate the most diverse topologies, which in turn help the agents explore diverse cooperation pattern and achieve strongest performance. Thus, we use ER model to constitute the agent topology in the experiments.

5.2 Theoretical Results

We now establish policy improvement theorem of stochastic TAPE, and prove a theorem for the cooperation improvement from the perspective of exploring the parameter space, which is a common motivation in RL research (Schulman, Chen, and Abbeel 2017; Haarnoja et al. 2018; Zhang et al. 2021; Adjodah et al. 2019). We assume the policy to have tabular expressions.

The following theorem states that stochastic TAPE updates can monotonically improve the objective function J(𝝅)=𝔼𝝅[tγtrt]J(\bm{\pi})=\mathbb{E}_{\bm{\pi}}\left[\sum_{t}\gamma^{t}r_{t}\right].

Theorem 1. [stochastic TAPE policy improvement theorem] With tabular expressions for policies, for any pre-update policy 𝛑\bm{\pi} and updated policy 𝛑^\hat{\bm{\pi}} by policy gradient in Eq. 2 that satisfy for any agent i,π^i(ai|τi)=πi(ai|τi)+βai,sδ\text{for any agent }i,\ \hat{\pi}_{i}(a_{i}|\tau_{i})=\pi_{i}(a_{i}|\tau_{i})+\beta_{a_{i},s}\delta, where δ\delta is a sufficiently small number, we have J(𝛑^)J(𝛑),J(\hat{\bm{\pi}})\geq J(\bm{\pi}), i.e. the joint policy is improved by the update.

Please refer to Appendix B for the proof of Theorem 1. Although this policy improvement theorem is established for policies with tabular expressions, we provide conditions in the proof, under which policy improvement is guaranteed even with function approximators.

Next, we provide a theoretical insight that compared to using individual critics, stochastic TAPE can better explore the parameter space to find more effective cooperation pattern. One heuristic for measuring such capacity is the diversity of parameter updates during each iteration (Adjodah et al. 2019), which is measured by the variance of parameter updates.

Refer to caption
Figure 2: (a) gives a scenario 6x6-3p-4f in LBF. 6x6-3p-4f stands for 6x6 grid-world with 3 players and 4 fruits. (b) In 8x8-2p-3f, stochastic TAPE achieve best performance. While in the more difficult task 15x15-4p-5f (c), deterministic TAPE outperform its base method and all other baselines. See stochastic TAPE against DOP, and deterministic TAPE against PAC for comparison.

Given state ss and action aia_{i}, let ξai,sTAPE\xi^{\text{TAPE}}_{a_{i},s} and ξai,sDOP\xi^{\text{DOP}}_{a_{i},s} denote the stochastic TAPE and DOP parameter updates respectively. The following theorem states that stochastic TAPE policy update is more diverse so that it can explore the parameter space more effectively.

Theorem 2. For any agent ii and s,ai\forall s,a_{i}, the stochastic TAPE policy update ξai,sTAPE\xi^{\text{TAPE}}_{a_{i},s} and DOP policy update ξai,sDOP\xi^{\text{DOP}}_{a_{i},s} satisfy that Var[ξai,sTAPE]Var[ξai,sDOP]\text{Var}\left[\xi^{\text{TAPE}}_{a_{i},s}\right]\geq\text{Var}\left[\xi^{\text{DOP}}_{a_{i},s}\right], and Δ=Var[ξai,sTAPE]Var[ξai,sDOP]\Delta=\text{Var}\left[\xi^{\text{TAPE}}_{a_{i},s}\right]-\text{Var}\left[\xi^{\text{DOP}}_{a_{i},s}\right] is in proportion to p2p^{2}, where pp is the probability of edges being present in the Erdős–Rényi model, i.e. Δp2.\Delta\propto p^{2}.

Theorem 2 shows that compared to solely using individual critics, our agent topology provides larger diversity in policy updates to find better cooperation pattern. More details and proof are provided in the appendix C. It is worth noting that although a large hyperparameter pp in the agent topology means larger diversity in parameter updates, the CDM issue will also become severer because the connections among agents become denser. Thus, pp must be set properly to achieve compromise between facilitating cooperation and avoiding CDM issue, which we will show later in the experiments.

6 Experiment

In this section, we first demonstrate that by ignoring other agents in the policy gradient to avoid bad influence of their sub-optimal actions, cooperation among agents is severely harmed. To this end, three one-step matrix games that require strong cooperation are proposed. Then, we evaluate the efficacy of the proposed methods on (a) Level-Based Foraging (LBF) (Papoudakis et al. 2021); (b) Starcraft II Multi-Agent Challenge (SMAC) (Samvelyan et al. 2019), and answer several research questions via various ablations and a heuristic graph search technique. Our code is available here111github.com/LxzGordon/TAPE.

6.1 Matrix Game

We propose 3 one-step matrix games, which are harder versions of the example in introduction. The matrix games are given in Fig. 1(a). We use different colors to show rewards in different games (blue for Easy, green for Medium and red for Hard). The optimal joint policy is for both agents to take action a0a_{0}. But agent A0A_{0} lacks motivation to choose a0a_{0} because it is very likely to receive a large penalty (8-8 or 16-16). Thus, this game requires strong cooperation among agents. In the Medium game, we further increase the penalty for agent 0 to choose a0a_{0}. In the Hard game, we keep the large penalty and add a local optimal reward at (a1,a1)(a_{1},a_{1}). Note that these matrix games are not monotonic games (Rashid et al. 2020b) as the optimal action for each agent depends on other agents. The evaluation results are given in Fig. 1.

With the agent topology to encourage cooperation, stochastic TAPE outperforms other methods by a large margin and is able to learn optimal joint policy even in the Hard game. DOP agents optimize individual utilities, ignoring utilities of other agents to avoid the influence of their sub-optimal actions, which result in severe miscoordination in these games. But since DOP agents adopt stochastic policy, they may receive some large reward after enough exploration. But the learning efficiency is much lower than stochastic TAPE. COMA is a weak baseline on complex tasks (Samvelyan et al. 2019) (0% win rate in all maps in section 6.3). But since COMA agents optimize global QQ value (expected team reward sum) instead of individual utility in DOP, it can achieve better results on these tasks requiring strong cooperation. These matrix games demonstrate the importance of considering the utility of other agents in cooperative tasks. With the agent topology, stochastic TAPE can facilitate cooperation among agents and alleviate CDM issue simultaneously.

6.2 Level-Based Foraging

Refer to caption
Figure 3: Experiment results on SMAC. (a-c) give the results in hard maps, and (d-f) are results in super-hard maps. After adopting our agent topology to facilitate cooperation and alleviate CDM issue, stochastic TAPE and deterministic TAPE outperforms their base methods respectively. See stochastic TAPE against DOP, and deterministic TAPE against PAC for comparison.

In Level-Based Foraging (LBF (Papoudakis et al. 2021)), agents navigate a grid-world and collect randomly-scattered food items. Agents and food items are assigned with levels. A food item is only allowed to be collected when near-by agents’ level sum is larger than the food level. Reward is only given when a foot item is collected, assigning the environment with sparse-reward property. Test return is 11 when all food items are collected. Compared baselines include both value-based methods: QMIX (Rashid et al. 2020b) and QPLEX (Wang et al. 2020a), and policy-based methods: DOP (Wang et al. 2020c), FACMAC (Peng et al. 2021) and PAC (Zhou, Lan, and Aggarwal 2022). Scenario illustration and results are given in Fig. 2.

To make 8x8-2p-3f more difficult, food items can only be collected when all agents participate. In this simple and sparse-reward task, with the stochastic policy and enhanced cooperation, stochastic TAPE outperforms all other methods on convergence speed and performance. While in 15x15-4p-5f, only state-of-the-art method PAC and deterministic TAPE learn to collect food items. With the agent topology to keep out bad influence of other agents’ sub-optimal actions, deterministic TAPE achieves best performance.

6.3 StarCraft Multi-Agent Challenge

StarCraft Multi-Agent Challenge (SMAC) (Samvelyan et al. 2019) is a challenging benchmark built on StarCraft II, where agents must cooperate with each other to defeat enemy teams controlled by built-in AI. We evaluate the proposed methods and baselines with the recommended evaluation protocol and metric in six maps including three hard maps (3s_vs_4z, 5m_vs_6m and 2c_vs_64zg) and three super hard maps (corridor, MMM2 and 6h_vs_8z). All algorithms are run for four times with different random seeds. Each run lasts for 5×1065\times 10^{6} environmental steps. During training, each algorithm has four parallel environment to collect training data.

Refer to caption
Figure 4: (a) and (b) show the results and performance of using different models to constitute agent topologies. BA is Barabási–Albert model, WS is Watts–Strogatz model, ER is Erdős–Rényi model, Edgeless and FC (Fully-Connected) are the topologies adopted in DOP and PAC respectively. ER has the most diverse topoloies and strongest performance. (c) and (d) show the performance of stochastic TAPE and deterministic TAPE in MMM2 with difference hyperparameter pp for ER model. Evaluation metric is test win rate and scores are normalized by the base method. In base method DOP, p=0p=0 and base method PAC p=1p=1. The boxplot is obtained with four different random seeds, and the red lines show the mean performance.
Refer to caption
Figure 5: The heatmaps show the difference between the frequency of edges being present and the probability pp. Source and Destination represent starting node and destination node of an edge. During training, over 1 million agent topology is generated. According to the law of large numbers, the difference is always around 0 when the heuristic graph search technique is not used in (b). In (a) and (c), we adopt the heuristic graph search technique to choose the agent topology with strongest performance. When pp is too small (0.01 in (a)), the connection among agents is too sparse, weakening cooperation among agents. Therefore, agent topologies with more edges can facilitate cooperation and are preferred by the graph search technique. As a results, the difference is always positive in (a). On the contrary, when the connection is too dense (p=0.3p=0.3 in (c)), topologies with less edges are preferred because they stop bad influence of sub-optimal actions from spreading and have better performance, resulting in negative differences in (c).

Overall results The overall results in six maps are provided in Fig. 3. We can see deterministic TAPE outperforms all other methods in terms of performance and convergence speed. In 6h_vs_8z, one of the most difficult maps in SMAC, deterministic TAPE achieves noticeably better performance than its base method PAC and other baselines. It’s worth noting that after integrating agent topology, both stochastic TAPE and deterministic TAPE have better performance compared to the base methods. This demonstrates the efficacy of the proposed agent topology in facilitating cooperation for DOP and alleviating CDM issue for PAC. Especially, in 2c_vs_64zg, stochastic TAPE outperforms all of the baselines except for our deterministic TAPE while its base method DOP struggles to perform well.

Next, we answer three research questions by ablations and additional experiments. The research questions are: Q1. What is the proper model to constitute the agent topology? Q2. Is there indeed a compromise between facilitating cooperation and suffering from the CDM issue? Q3. Is the agent topology capable of compromising between facilitating cooperation and the CDM issue to achieve best performance?

Q1. We study three prevalent random graph models: Barabási–Albert (BA) model (Albert and Barabási 2002), Watts–Strogatz (WS) model (Watts and Strogatz 1998) and the Erdős–Rényi (ER) model (Erdős, Rényi et al. 1960) via visualization and ablation study. First, we generate 1000 topologies for 12 agents with each model and give the visualization result in Fig. 4(a), where xx-axis is average degree and yy-axis is connectivity (minimum number of edges required, by removing which the graph becomes two sub-graphs). Average degree and connectivity are two essential factors for agent topology as they reflect the level of CDM issue and cooperation. Compared to the other two models, ER model generates much more diverse topologies, covering the area from edgeless topology to fully-connected topology. Then, we evaluate stochastic TAPE with each model on MMM2, a super hard map in SMAC. Results are given in Fig. 4(b). For the random graph models, the larger the graph diversity in Fig. 4(a), the stronger the performance is. Thus, we constitute the agent topology with ER model in other experiments. For fully-connected topology, the performance demonstrates very large variance, because once a sub-optimal action occurs, its bad influence will easily spread through the centralized critic to all other agents. It is worth nothing that the graphs can also be generated via Bayesian optimization, but this may also result in limited graph diversity, causing unstable or even worse performance. Thus, how to generate agent topology via optimization-based methods remains a challenge.

Q2. The compromise here means the more connection among agents to improve performance, the severer CDM issue becomes, and when it is too severe, it will in turn affect performance. To answer this research question, we devise a heuristic graph search technique. During policy training of agent ii, we generate nn topologies with the ER model in each step and use them to update the agent policy. After obtaining nn updated policy [π1i,..,πni][\pi_{1}^{i},..,\pi_{n}^{i}], we evaluate the post-update global QQ value Qtot𝝅i,πjiQ_{tot}^{\bm{\pi}^{-i},\pi^{i}_{j}} and choose the policy with largest global QQ value as the updated policy, i.e. πi=argmaxjQtot𝝅i,πji\pi^{i}=\arg\max_{j}Q_{tot}^{\bm{\pi}^{-i},\pi^{i}_{j}}. The motivation of this heuristic graph search technique is that global QQ value is the expected future reward sum, which shows the post-update performance. Using this technique, we can find the topology with better performance. Then, we respectively use the graph search technique when pp is small or large and give the visualization of preferred topologies in Fig. 5. The results confirm that the compromise does exist, because (1) facilitating cooperation by building more agent connections when there is little CDM issue (Fig. 5(a)), and (2) removing connections to stop bad influence of sub-optimal actions from spreading when CDM issue is severe (Fig. 5(c)), can both improve performance.

Q3. We answer this research question by giving the performance with different hyperparameter pp, as it controls the level of CDM issue and cooperation. The results are given in Fig. 4(c), (d). Large pp stands for dense connections, where agents are easily affected by sub-optimal actions of other agents but cooperation is strongly encouraged. Small pp means sparse connections, where sub-optimal actions’ influence will not easily spread but cooperation among agents is limited. (c) and (d) are drawn at the end of training and half of training to show the convergence performance and speed respectively. We can see the performances first increase when pp is small and later decrease when pp is too large. The best performance appears at the point where the cooperation is strong and CDM issue is acceptable. From the results, we can say our ER agent topology is able to compromise between cooperation and alleviating the CDM issue to achieve the best performance.

7 Conclusion and Future Work

In this paper, we propose an agent topology framework, which aims to alleviate the CDM issue without limiting agents’ cooperation capacity. Based on the agent topology, we propose TAPE for both stochastic and deterministic MAPG methods. Theoretically, we prove the policy improvement theorem for stochastic TAPE and give a theoretical explanation about the improved cooperation among agents. Empirically, we evaluate the proposed methods on several benchmarks. Experiment results show that the methods outperform their base methods and other baselines in terms of convergence speed and performance. A heuristic graph search algorithm is devised and various studies are conducted, which validate the efficacy of our proposed agent topology.

Limitation and Future Work In this work, we consider constructing agent topology with existing random graph models without learning-based methods. Our future work is to adaptively learn the agent topology that can simultaneously facilitate agent cooperation and alleviate the CDM issue.

References

  • Adjodah et al. (2019) Adjodah, D.; Calacci, D.; Dubey, A.; Goyal, A.; Krafft, P.; Moro, E.; and Pentland, A. 2019. Leveraging Communication Topologies Between Learning Agents in Deep Reinforcement Learning. arXiv preprint arXiv:1902.06740.
  • Albert and Barabási (2002) Albert, R.; and Barabási, A.-L. 2002. Statistical mechanics of complex networks. Reviews of modern physics, 74(1): 47.
  • Alemi et al. (2016) Alemi, A. A.; Fischer, I.; Dillon, J. V.; and Murphy, K. 2016. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410.
  • Barabási and Albert (1999) Barabási, A.-L.; and Albert, R. 1999. Emergence of scaling in random networks. science, 286(5439): 509–512.
  • Böhmer, Kurin, and Whiteson (2020) Böhmer, W.; Kurin, V.; and Whiteson, S. 2020. Deep coordination graphs. In International Conference on Machine Learning, 980–991. PMLR.
  • Chen et al. (2022) Chen, W.; Li, W.; Liu, X.; and Yang, S. 2022. Learning Credit Assignment for Cooperative Reinforcement Learning. arXiv preprint arXiv:2210.05367.
  • Das et al. (2019) Das, A.; Gervet, T.; Romoff, J.; Batra, D.; Parikh, D.; Rabbat, M.; and Pineau, J. 2019. Tarmac: Targeted multi-agent communication. In International Conference on Machine Learning, 1538–1546. PMLR.
  • Degris, White, and Sutton (2012) Degris, T.; White, M.; and Sutton, R. S. 2012. Off-policy actor-critic. arXiv preprint arXiv:1205.4839.
  • Ding, Huang, and Lu (2020) Ding, Z.; Huang, T.; and Lu, Z. 2020. Learning individually inferred communication for multi-agent cooperation. Advances in Neural Information Processing Systems, 33: 22069–22079.
  • Du et al. (2019) Du, Y.; Han, L.; Fang, M.; Dai, T.; Liu, J.; and Tao, D. 2019. LIIR: learning individual intrinsic reward in multi-agent reinforcement learning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS), 4403–4414.
  • Du et al. (2023) Du, Y.; Leibo, J. Z.; Islam, U.; Willis, R.; and Sunehag, P. 2023. A Review of Cooperation in Multi-agent Learning. arXiv preprint arXiv:2312.05162.
  • Du et al. (2021) Du, Y.; Liu, B.; Moens, V.; Liu, Z.; Ren, Z.; Wang, J.; Chen, X.; and Zhang, H. 2021. Learning Correlated Communication Topology in Multi-Agent Reinforcement learning. In International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), 456–464.
  • Erdős, Rényi et al. (1960) Erdős, P.; Rényi, A.; et al. 1960. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1): 17–60.
  • Feinberg et al. (2018) Feinberg, V.; Wan, A.; Stoica, I.; Jordan, M. I.; Gonzalez, J. E.; and Levine, S. 2018. Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101.
  • Foerster et al. (2016) Foerster, J.; Assael, I. A.; De Freitas, N.; and Whiteson, S. 2016. Learning to communicate with deep multi-agent reinforcement learning. Advances in neural information processing systems, 29.
  • Foerster et al. (2018) Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; and Whiteson, S. 2018. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
  • Haarnoja et al. (2018) Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. 2018. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905.
  • Kingma and Ba (2014) Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Konan, Seraj, and Gombolay (2022) Konan, S.; Seraj, E.; and Gombolay, M. 2022. Iterated reasoning with mutual information in cooperative and byzantine decentralized teaming. arXiv preprint arXiv:2201.08484.
  • Kraemer and Banerjee (2016) Kraemer, L.; and Banerjee, B. 2016. Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing, 190: 82–94.
  • Li et al. (2021) Li, C.; Wang, T.; Wu, C.; Zhao, Q.; Yang, J.; and Zhang, C. 2021. Celebrating diversity in shared multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 34: 3991–4002.
  • Li et al. (2020) Li, S.; Gupta, J. K.; Morales, P.; Allen, R.; and Kochenderfer, M. J. 2020. Deep implicit coordination graphs for multi-agent reinforcement learning. arXiv preprint arXiv:2006.11438.
  • Lillicrap et al. (2015) Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
  • Liu et al. (2022) Liu, H.; Kiumarsi, B.; Kartal, Y.; Koru, A. T.; Modares, H.; and Lewis, F. L. 2022. Reinforcement learning applications in unmanned vehicle control: A comprehensive overview. Unmanned Systems, 1–10.
  • Lou et al. (2023a) Lou, X.; Guo, J.; Zhang, J.; Wang, J.; Huang, K.; and Du, Y. 2023a. PECAN: Leveraging Policy Ensemble for Context-Aware Zero-Shot Human-AI Coordination. arXiv preprint arXiv:2301.06387.
  • Lou et al. (2023b) Lou, X.; Zhang, J.; Du, Y.; Yu, C.; He, Z.; and Huang, K. 2023b. Leveraging Joint-action Embedding in Multi-agent Reinforcement Learning for Cooperative Games. IEEE Transactions on Games.
  • Lowe et al. (2017) Lowe, R.; Wu, Y. I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; and Mordatch, I. 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30.
  • Mnih et al. (2016) Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, 1928–1937. PMLR.
  • Mnih et al. (2013) Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
  • Munos et al. (2016) Munos, R.; Stepleton, T.; Harutyunyan, A.; and Bellemare, M. 2016. Safe and efficient off-policy reinforcement learning. Advances in neural information processing systems, 29.
  • Noaeen et al. (2022) Noaeen, M.; Naik, A.; Goodman, L.; Crebo, J.; Abrar, T.; Abad, Z. S. H.; Bazzan, A. L.; and Far, B. 2022. Reinforcement learning in urban network traffic signal control: A systematic literature review. Expert Systems with Applications, 116830.
  • Oliehoek and Amato (2016) Oliehoek, F. A.; and Amato, C. 2016. A concise introduction to decentralized POMDPs. Springer.
  • Oliehoek, Spaan, and Vlassis (2008) Oliehoek, F. A.; Spaan, M. T.; and Vlassis, N. 2008. Optimal and approximate Q-value functions for decentralized POMDPs. Journal of Artificial Intelligence Research, 32: 289–353.
  • Papoudakis et al. (2021) Papoudakis, G.; Christianos, F.; Schäfer, L.; and Albrecht, S. V. 2021. Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS).
  • Peng et al. (2021) Peng, B.; Rashid, T.; Schroeder de Witt, C.; Kamienny, P.-A.; Torr, P.; Böhmer, W.; and Whiteson, S. 2021. Facmac: Factored multi-agent centralised policy gradients. Advances in Neural Information Processing Systems, 34: 12208–12221.
  • Precup (2000) Precup, D. 2000. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, 80.
  • Rashid et al. (2020a) Rashid, T.; Farquhar, G.; Peng, B.; and Whiteson, S. 2020a. Weighted qmix: Expanding monotonic value function factorisation for deep multi-agent reinforcement learning. Advances in neural information processing systems, 33: 10199–10210.
  • Rashid et al. (2020b) Rashid, T.; Samvelyan, M.; De Witt, C. S.; Farquhar, G.; Foerster, J.; and Whiteson, S. 2020b. Monotonic value function factorisation for deep multi-agent reinforcement learning. The Journal of Machine Learning Research, 21(1): 7234–7284.
  • Ruan et al. (2022) Ruan, J.; Du, Y.; Xiong, X.; Xing, D.; Li, X.; Meng, L.; Zhang, H.; Wang, J.; and Xu, B. 2022. GCS: graph-based coordination strategy for multi-agent reinforcement learning. arXiv preprint arXiv:2201.06257.
  • Samvelyan et al. (2019) Samvelyan, M.; Rashid, T.; de Witt, C. S.; Farquhar, G.; Nardelli, N.; Rudner, T. G. J.; Hung, C.-M.; Torr, P. H. S.; Foerster, J.; and Whiteson, S. 2019. The StarCraft Multi-Agent Challenge. CoRR, abs/1902.04043.
  • Schulman, Chen, and Abbeel (2017) Schulman, J.; Chen, X.; and Abbeel, P. 2017. Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440.
  • Silver et al. (2014) Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic policy gradient algorithms. In International conference on machine learning, 387–395. Pmlr.
  • Strouse et al. (2021) Strouse, D.; McKee, K.; Botvinick, M.; Hughes, E.; and Everett, R. 2021. Collaborating with humans without human data. Advances in Neural Information Processing Systems, 34: 14502–14515.
  • Sutton and Barto (2018) Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press.
  • Tishby, Pereira, and Bialek (2000) Tishby, N.; Pereira, F. C.; and Bialek, W. 2000. The information bottleneck method. arXiv preprint physics/0004057.
  • Travers and Milgram (1977) Travers, J.; and Milgram, S. 1977. An experimental study of the small world problem. In Social networks, 179–197. Elsevier.
  • Wang et al. (2020a) Wang, J.; Ren, Z.; Liu, T.; Yu, Y.; and Zhang, C. 2020a. Qplex: Duplex dueling multi-agent q-learning. arXiv preprint arXiv:2008.01062.
  • Wang et al. (2023) Wang, S.; Hu, S.; Guo, B.; and Wang, G. 2023. Cross-Region Courier Displacement for On-Demand Delivery With Multi-Agent Reinforcement Learning. IEEE Transactions on Big Data.
  • Wang et al. (2020b) Wang, T.; Gupta, T.; Mahajan, A.; Peng, B.; Whiteson, S.; and Zhang, C. 2020b. Rode: Learning roles to decompose multi-agent tasks. arXiv preprint arXiv:2010.01523.
  • Wang et al. (2019) Wang, T.; Wang, J.; Zheng, C.; and Zhang, C. 2019. Learning nearly decomposable value functions via communication minimization. arXiv preprint arXiv:1910.05366.
  • Wang et al. (2020c) Wang, Y.; Han, B.; Wang, T.; Dong, H.; and Zhang, C. 2020c. Off-policy multi-agent decomposed policy gradients. arXiv preprint arXiv:2007.12322.
  • Watts and Strogatz (1998) Watts, D. J.; and Strogatz, S. H. 1998. Collective dynamics of ‘small-world’networks. nature, 393(6684): 440–442.
  • Wolpert and Tumer (2001) Wolpert, D. H.; and Tumer, K. 2001. Optimal payoff functions for members of collectives. Advances in Complex Systems, 4(02n03): 265–279.
  • Zhang et al. (2018) Zhang, K.; Yang, Z.; Liu, H.; Zhang, T.; and Basar, T. 2018. Fully decentralized multi-agent reinforcement learning with networked agents. In International Conference on Machine Learning, 5872–5881. PMLR.
  • Zhang et al. (2021) Zhang, T.; Li, Y.; Wang, C.; Xie, G.; and Lu, Z. 2021. Fop: Factorizing optimal joint policy of maximum-entropy multi-agent reinforcement learning. In International Conference on Machine Learning, 12491–12500. PMLR.
  • Zhou, Lan, and Aggarwal (2022) Zhou, H.; Lan, T.; and Aggarwal, V. 2022. PAC: Assisted Value Factorisation with Counterfactual Predictions in Multi-Agent Reinforcement Learning. arXiv preprint arXiv:2206.11420.
  • Zhou et al. (2020) Zhou, M.; Liu, Z.; Sui, P.; Li, Y.; and Chung, Y. Y. 2020. Learning implicit credit assignment for cooperative multi-agent reinforcement learning. Advances in neural information processing systems, 33: 11853–11864.

A   Derivation of Policy Gradient

In this section, we give the derivation of the following policy gradient for stochastic TAPE

J(θ)\displaystyle\nabla J(\theta) =𝔼𝝅[iθilogπi(ai|τi)𝐔i]\displaystyle=\mathbb{E}_{\bm{\pi}}\left[\sum_{i}\nabla_{\theta_{i}}\log\pi_{i}(a_{i}|\tau_{i})\mathbf{U}_{i}\right] (5)
=𝔼𝝅[i,jEijkj(s)θilogπi(ai|τi)Qjϕj(s,aj)],\displaystyle=\mathbb{E}_{\bm{\pi}}\left[\sum_{i,j}E_{ij}k_{j}(s)\nabla_{\theta_{i}}\log\pi_{i}(a_{i}|\tau_{i})Q_{j}^{\phi_{j}}(s,a_{j})\right],

where 𝑼i=j=1nEijUj\bm{U}_{i}=\sum\limits_{j=1}^{n}E_{ij}U_{j} is the coalition utility of agent ii with other agents connected in the agent topology, Uj(s,aj)=Qtotϕ(s,𝒂)ajπj(aj|τj)Qtotϕ(s,(aj,𝒂j))U_{j}(s,a_{j})=Q^{\phi}_{tot}(s,\bm{a})-\sum\limits_{a_{j}^{\prime}}\pi_{j}({a_{j}^{\prime}}|\tau_{j})Q^{\phi}_{tot}(s,({a_{j}^{\prime}},\bm{a}_{-j})) is the aristocrat utility of agent jj from (Wolpert and Tumer 2001; Wang et al. 2020c), and i,Eii=1\forall i,E_{ii}=1, i.e. all agents are connected with themselves.

Proof. First, we will reformulate the aristocrat utility for the policy gradient

Ui(s,aj)=Qtotϕ(s,𝒂)ajπi(aj|τi)Qtotϕ(s,(aj,𝒂i))\displaystyle U_{i}(s,a_{j})=Q^{\phi}_{tot}(s,\bm{a})-\sum\limits_{a_{j}^{\prime}}\pi_{i}({a_{j}^{\prime}}|\tau_{i})Q^{\phi}_{tot}(s,({a_{j}^{\prime}},\bm{a}_{-i}))
=jkj(s)Qjϕj(s,aj)\displaystyle=\sum\limits_{j}k_{j}(s)Q^{\phi_{j}}_{j}(s,a_{j})-
ajπi(aj|τi)[jikj(s)Qjϕj(s,aj)+ki(s)Qiϕi(s,aj)]\displaystyle\sum\limits_{a_{j}^{\prime}}\pi_{i}({a_{j}^{\prime}}|\tau_{i})\left[\sum\limits_{j\neq i}k_{j}(s)Q^{\phi_{j}}_{j}(s,a_{j})+k_{i}(s)Q^{\phi_{i}}_{i}(s,{a_{j}^{\prime}})\right]
=ki(s)Qiϕi(s,ai)ki(s)ajπi(aj|τi)Qiϕi(s,aj)\displaystyle=k_{i}(s)Q^{\phi_{i}}_{i}(s,a_{i})-k_{i}(s)\sum\limits_{a_{j}^{\prime}}\pi_{i}({a_{j}^{\prime}}|\tau_{i})Q^{\phi_{i}}_{i}(s,{a_{j}^{\prime}})
=ki(s)[Qiϕi(s,ai)ajπi(aj|τi)Qiϕi(s,aj)].\displaystyle=k_{i}(s)\left[Q^{\phi_{i}}_{i}(s,a_{i})-\sum\limits_{a_{j}^{\prime}}\pi_{i}({a_{j}^{\prime}}|\tau_{i})Q^{\phi_{i}}_{i}(s,{a_{j}^{\prime}})\right].

Uj(s,aj)=kj(s)[Qjϕj(s,aj)ajπj(aj|τj)Qjϕj(s,aj)]U_{j}(s,a_{j})=k_{j}(s)\left[Q_{j}^{\phi_{j}}(s,a_{j})-\sum\limits_{a_{j}^{\prime}}\pi_{j}({a_{j}^{\prime}}|\tau_{j})Q_{j}^{\phi_{j}}(s,{a_{j}^{\prime}})\right]. So policy gradient gg is

g\displaystyle g =𝔼𝝅[i,jEijθilogπi(ai|τi)Uj(s,aj)]\displaystyle=\mathbb{E}_{\bm{\pi}}\left[\sum_{i,j}E_{ij}\nabla_{\theta_{i}}\log\pi_{i}(a_{i}|\tau_{i})U_{j}(s,a_{j})\right]
=𝔼𝝅[i,jEijθilogπi(ai|τi)kj(s)\displaystyle=\mathbb{E}_{\bm{\pi}}\Bigg{[}\sum_{i,j}E_{ij}\nabla_{\theta_{i}}\log\pi_{i}(a_{i}|\tau_{i})k_{j}(s)
(Qjϕj(s,aj)ajπj(aj|τj)Qjϕj(s,aj))].\displaystyle\left(Q_{j}^{\phi_{j}}(s,a_{j})-\sum\limits_{a_{j}^{\prime}}\pi_{j}({a_{j}^{\prime}}|\tau_{j})Q_{j}^{\phi_{j}}(s,{a_{j}^{\prime}})\right)\Bigg{]}.

Let qj(s)=ajπj(aj|τj)Qjϕj(s,aj)q_{j}(s)=\sum\limits_{a_{j}^{\prime}}\pi_{j}({a_{j}^{\prime}}|\tau_{j})Q_{j}^{\phi_{j}}(s,{a_{j}^{\prime}}). Consider a given agent ii

gi\displaystyle g_{i} =𝔼𝝅[jEijθilogπi(ai|τi)kj(s)(Qjϕj(s,aj)qj(s))]\displaystyle=\mathbb{E}_{\bm{\pi}}\left[\sum_{j}E_{ij}\nabla_{\theta_{i}}\log\pi_{i}(a_{i}|\tau_{i})k_{j}(s)\left(Q_{j}^{\phi_{j}}(s,a_{j})-q_{j}(s)\right)\right]
=𝔼𝝅[jEijθilogπi(ai|τi)kj(s)Qjϕj(s,aj)]\displaystyle=\mathbb{E}_{\bm{\pi}}\left[\sum_{j}E_{ij}\nabla_{\theta_{i}}\log\pi_{i}(a_{i}|\tau_{i})k_{j}(s)Q_{j}^{\phi_{j}}(s,a_{j})\right]-
𝔼𝝅[jEijθilogπi(ai|τi)kj(s)qj(s)].\displaystyle\mathbb{E}_{\bm{\pi}}\left[\sum_{j}E_{ij}\nabla_{\theta_{i}}\log\pi_{i}(a_{i}|\tau_{i})k_{j}(s)q_{j}(s)\right].

Since

𝔼𝝅[jEijθilogπi(ai|τi)kj(s)qj(s)]\displaystyle\mathbb{E}_{\bm{\pi}}\left[\sum_{j}E_{ij}\nabla_{\theta_{i}}\log\pi_{i}(a_{i}|\tau_{i})k_{j}(s)q_{j}(s)\right]
=\displaystyle= sd𝝅(s)ai𝒂iπi(ai|τi)𝝅i(𝒂i|𝝉i)\displaystyle\sum\limits_{s}d^{\bm{\pi}}(s)\sum\limits_{a_{i}}\sum\limits_{\bm{a}^{-i}}\pi_{i}(a_{i}|\tau_{i})\bm{\pi}^{-i}(\bm{a}^{-i}|\bm{\tau}^{-i})
jEijθilogπi(ai|τi)kj(s)qj(s)\displaystyle\sum_{j}E_{ij}\nabla_{\theta_{i}}\log\pi_{i}(a_{i}|\tau_{i})k_{j}(s)q_{j}(s)
=\displaystyle= sd𝝅(s)j𝒂iEij𝝅i(𝒂i|𝝉i)kj(s)qj(s)\displaystyle\sum\limits_{s}d^{\bm{\pi}}(s)\sum\limits_{j}\sum\limits_{\bm{a}^{-i}}E_{ij}\bm{\pi}^{-i}(\bm{a}^{-i}|\bm{\tau}^{-i})k_{j}(s)q_{j}(s)
aiπi(ai|τ)θilogπi(ai|τi)=0,\displaystyle\sum\limits_{a_{i}}\pi_{i}(a_{i}|\tau)\nabla_{\theta_{i}}\log\pi_{i}(a_{i}|\tau_{i})=0,

gi=𝔼𝝅[jEijθilogπi(ai|τi)kj(s)Qjϕj(s,aj)]g_{i}=\mathbb{E}_{\bm{\pi}}\left[\sum_{j}E_{ij}\nabla_{\theta_{i}}\log\pi_{i}(a_{i}|\tau_{i})k_{j}(s)Q_{j}^{\phi_{j}}(s,a_{j})\right]. Thus,

J(θ)=𝔼𝝅[i,jEijkj(s)θilogπi(ai|τi)Qjϕj(s,aj)]\nabla J(\theta)=\mathbb{E}_{\bm{\pi}}\left[\sum_{i,j}E_{ij}k_{j}(s)\nabla_{\theta_{i}}\log\pi_{i}(a_{i}|\tau_{i})Q_{j}^{\phi_{j}}(s,a_{j})\right]

is the policy gradient of stochastic TAPE.\hfill\square

B   Policy Improvement Theorem

In this section, we give proof of the policy improvement theorem of stochastic TAPE. As in previous works (Degris, White, and Sutton 2012; Wang et al. 2020c; Feinberg et al. 2018), we relax the requirement that QtotϕQ^{\phi}_{tot} is a good estimate of Qtot𝝅Q^{\bm{\pi}}_{tot} and simplify the QQ-function learning process as the following MSE problem to make it tractable.

L(ϕ)=𝒂,sp(s)𝝅(𝒂|𝝉)(Qtot𝝅(s,𝒂)Qtotϕ(s,𝒂))2L(\phi)=\sum\limits_{\bm{a},s}p(s)\bm{\pi}(\bm{a}|\bm{\tau})\left(Q^{\bm{\pi}}_{tot}(s,\bm{a})-Q^{\phi}_{tot}(s,\bm{a})\right)^{2} (6)

where 𝝅\bm{\pi} is the joint policy, Qtot𝝅(s,𝒂)Q^{\bm{\pi}}_{tot}(s,\bm{a}) is the true value and Qtotϕ(s,𝒂)Q^{\phi}_{tot}(s,\bm{a}) is the estimated value.

To make this proof self-contained, we first borrow Lemma 1 and Lemma 2 from (Wang et al. 2020c). Lemma 1 states that the learning of centralized critic can preserve the order of local action values. Without loss of generality, we consider a given state ss.

Lemma 1. We consider the following optimization problem:

Ls(ϕ)=𝒂𝝅(𝒂|𝝉)(Q𝝅(s,𝒂)f(𝑸ϕ(s,𝒂)))2L_{s}(\phi)=\sum\limits_{\bm{a}}\bm{\pi}(\bm{a}|\bm{\tau})\left(Q^{\bm{\pi}}(s,\bm{a})-f(\bm{Q}^{\phi}(s,\bm{a}))\right)^{2} (7)

Here, f(𝐐ϕ(s,𝐚)):RnRf(\bm{Q}^{\phi}(s,\bm{a})):R^{n}\rightarrow R, and 𝐐ϕ(s,𝐚))\bm{Q}^{\phi}(s,\bm{a})) is a vector with the ithi^{th} entry being Qiϕi(s,ai)Q_{i}^{\phi_{i}}(s,a_{i}). ff satisfies that i,ai,fQiϕi(s,ai)>0\forall i,a_{i},\frac{\partial f}{\partial Q^{\phi_{i}}_{i}(s,a_{i})}>0.

Then, for any local optimal solution, it holds that

Qi𝝅(s,ai)Qi𝝅(s,ai)Qiϕi(s,ai)Qiϕi(s,ai),i,ai,ai.Q^{\bm{\pi}}_{i}(s,a_{i})\geq Q^{\bm{\pi}}_{i}(s,a^{\prime}_{i})\Longleftrightarrow Q^{\phi_{i}}_{i}(s,a_{i})\geq Q^{\phi_{i}}_{i}(s,a^{\prime}_{i}),\ \ \ \forall i,a_{i},a^{\prime}_{i}.

Proof. A necessary condition for a local optimal is

Ls(ϕ)Qiϕi(s,ai)=πi(ai|τi)𝒂ijiπj(aj|τj)\displaystyle\frac{\partial L_{s}(\phi)}{\partial Q^{\phi_{i}}_{i}(s,a_{i})}=\pi_{i}(a_{i}|\tau_{i})\sum\limits_{\bm{a}_{-i}}\prod\limits_{j\neq i}\pi_{j}(a_{j}|\tau_{j})
(Q𝝅(s,𝒂)f(𝑸ϕ(s,𝒂)))(fQiϕi(s,ai))=0.\displaystyle\left(Q^{\bm{\pi}}(s,\bm{a})-f(\bm{Q}^{\phi}(s,\bm{a}))\right)(-\frac{\partial f}{\partial Q^{\phi_{i}}_{i}(s,a_{i})})=0.

This implies that, for i,ai\forall i,a_{i}, we have

𝒂ijiπj(aj|τj)(Q𝝅(s,𝒂)f(𝑸ϕ(s,𝒂)))=0\displaystyle\sum\limits_{\bm{a}_{-i}}\prod\limits_{j\neq i}\pi_{j}(a_{j}|\tau_{j})\left(Q^{\bm{\pi}}(s,\bm{a})-f(\bm{Q}^{\phi}(s,\bm{a}))\right)=0
\displaystyle\Rightarrow 𝒂i𝝅i(𝒂i|𝝉i)f(𝑸ϕ(s,(ai,𝒂i)))=Qi𝝅(s,ai).\displaystyle\sum\limits_{\bm{a}_{-i}}\bm{\pi}_{-i}(\bm{a}_{-i}|\bm{\tau}_{-i})f\left(\bm{Q}^{\phi}\left(s,(a_{i},\bm{a}_{-i})\right)\right)=Q_{i}^{\bm{\pi}}(s,a_{i}).

Let q(s,ai)q(s,a_{i}) denote 𝒂i𝝅i(𝒂i|𝝉i)f(𝑸ϕ(s,(ai,𝒂i)))\sum\limits_{\bm{a}_{-i}}\bm{\pi}_{-i}(\bm{a}_{-i}|\bm{\tau}_{-i})f\left(\bm{Q}^{\phi}\left(s,(a_{i},\bm{a}_{-i})\right)\right). We have

q(s,ai)Qiϕi(s,ai)=𝒂i𝝅i(𝒂i|𝝉i)f(𝑸ϕ(s,(ai,𝒂i)))Qiϕi(s,ai)>0.\frac{\partial q(s,a_{i})}{\partial Q_{i}^{\phi_{i}}(s,a_{i})}=\sum\limits_{\bm{a}_{-i}}\bm{\pi}_{-i}(\bm{a}_{-i}|\bm{\tau}_{-i})\frac{f\left(\bm{Q}^{\phi}\left(s,(a_{i},\bm{a}_{-i})\right)\right)}{\partial Q_{i}^{\phi_{i}}(s,a_{i})}>0.

Therefore, if Qi𝝅(s,ai)Qi𝝅(s,ai)Q_{i}^{\bm{\pi}}(s,a_{i})\geq Q_{i}^{\bm{\pi}}(s,a^{\prime}_{i}), then any local minimal of Ls(ϕ)L_{s}(\phi) satisfies Qiϕi(s,ai)Qiϕi(s,ai)Q^{\phi_{i}}_{i}(s,a_{i})\geq Q^{\phi_{i}}_{i}(s,a^{\prime}_{i}).\hfill\square

The mixer module of our method satisfies i,ai,fQiϕi(s,ai)>0\forall i,a_{i},\frac{\partial f}{\partial Q^{\phi_{i}}_{i}(s,a_{i})}>0, so Lemma 1 holds after the policy evaluation converges.

Lemma 2. For two sequences {ai}\{a_{i}\}, {bi},i[n]\{b_{i}\},i\in[n] listed in an increasing order. if ibi=0\sum_{i}b_{i}=0, then iaibi0\sum_{i}a_{i}b_{i}\geq 0.

Proof. We denote a¯=1niai\overline{a}=\frac{1}{n}\sum_{i}a_{i}, then iaibi=a¯(ibi)+ia~ibi\sum_{i}a_{i}b_{i}=\overline{a}(\sum_{i}b_{i})+\sum_{i}\tilde{a}_{i}b_{i}, where ia~i=0\sum_{i}\tilde{a}_{i}=0. Without loss of generality, we assume that a¯i=0\overline{a}_{i}=0, i\forall i. jj and kk which aj0,aj+10a_{j}\leq 0,a_{j+1}\geq 0 and bk0,bk+10b_{k}\leq 0,b_{k+1}\geq 0. Since a,ba,b are symmetric, we assume jkj\leq k. Then, we have

i[n]aibi\displaystyle\sum\limits_{i\in[n]}a_{i}b_{i} =i[1,j]aibi+i[j+1,k]aibi+i[k+1,n]aibi\displaystyle=\sum\limits_{i\in[1,j]}a_{i}b_{i}+\sum\limits_{i\in[j+1,k]}a_{i}b_{i}+\sum\limits_{i\in[k+1,n]}a_{i}b_{i}
i[j+1,k]aibi+i[k+1,n]aibi\displaystyle\geq\sum\limits_{i\in[j+1,k]}a_{i}b_{i}+\sum\limits_{i\in[k+1,n]}a_{i}b_{i}
aki[j+1,k]bi+ak+1i[k+1,n]bi.\displaystyle\geq a_{k}\sum\limits_{i\in[j+1,k]}b_{i}+a_{k+1}\sum\limits_{i\in[k+1,n]}b_{i}.

As i[j+1,n]bi0\sum\limits_{i\in[j+1,n]}b_{i}\geq 0, we have i[j+1,k]bii[k+1,n]bi-\sum_{i\in[j+1,k]}b_{i}\leq\sum_{i\in[k+1,n]}b_{i}

Thus i[n]aibi(ak+1ak)i[k+1,n]bi0\sum_{i\in[n]}a_{i}b_{i}\geq(a_{k+1}-a_{k})\sum_{i\in[k+1,n]}b_{i}\geq 0.\hfill\square

The next lemma states that for any policies with tabular expressions updated by stochastic TAPE policy gradient, the larger the local critic’s value Qiϕi(s,ai)Q_{i}^{\phi_{i}}(s,a_{i}), the larger update stepsize βs,ai\beta_{s,a_{i}} for aia_{i} under state ss, and vice versa.

Lemma 3. For any pre-update joint policy 𝛑\bm{\pi} and updated joint policy 𝛑^\hat{\bm{\pi}} by stochastic TAPE policy gradient with tabular expressions that satisfy π^i(ai|τi)=πi(ai|τi)+βai,sδ\hat{\pi}_{i}(a_{i}|\tau_{i})=\pi_{i}(a_{i}|\tau_{i})+\beta_{a_{i},s}\delta for any agent ii, where δ\delta is a sufficiently small number, s,ai,ai\forall s,a^{\prime}_{i},a_{i}, it holds that

Qiϕi(s,ai)Qiϕi(s,ai)βai,sβai,s.Q_{i}^{\phi_{i}}(s,a_{i})\geq Q_{i}^{\phi_{i}}(s,a^{\prime}_{i})\Longleftrightarrow\beta_{a_{i},s}\geq\beta_{a^{\prime}_{i},s}. (8)

Proof. We start by showing the connection between Qiϕi(s,ai)Q_{i}^{\phi_{i}}(s,a_{i}) and βai,s\beta_{a_{i},s}. θiJ(θ)\nabla_{\theta_{i}}J(\theta) is the policy gradient for agent ii

θiJ(θ)\displaystyle\nabla_{\theta_{i}}J(\theta) =𝔼𝝅[jEijθilogπi(ai|τi)kj(s)Qjϕj(s,aj)]\displaystyle=\mathbb{E}_{\bm{\pi}}\left[\sum_{j}E_{ij}\nabla_{\theta_{i}}\log\pi_{i}(a_{i}|\tau_{i})k_{j}(s)Q_{j}^{\phi_{j}}(s,a_{j})\right]
=sd𝝅(s)ai𝒂iπi(ai|τi)𝝅i(𝒂i|𝝉i)\displaystyle=\sum\limits_{s}d^{\bm{\pi}}(s)\sum\limits_{a_{i}}\sum\limits_{\bm{a}^{-i}}\pi_{i}(a_{i}|\tau_{i})\bm{\pi}^{-i}(\bm{a}^{-i}|\bm{\tau}^{-i})
jEijθilogπi(ai|τi)kj(s)Qjϕj(s,aj)\displaystyle\qquad\qquad\sum\limits_{j}E_{ij}\nabla_{\theta_{i}}\log\pi_{i}(a_{i}|\tau_{i})k_{j}(s)Q_{j}^{\phi_{j}}(s,a_{j})
=s,𝒂id𝝅(s)𝝅i(𝒂i|𝝉i)\displaystyle=\sum\limits_{s,\bm{a}^{-i}}d^{\bm{\pi}}(s)\bm{\pi}^{-i}(\bm{a}^{-i}|\bm{\tau}^{-i})
aijEijθiπi(ai|τi)kj(s)Qjϕj(s,aj).\displaystyle\qquad\qquad\sum\limits_{a_{i}}\sum\limits_{j}E_{ij}\nabla_{\theta_{i}}\pi_{i}(a_{i}|\tau_{i})k_{j}(s)Q_{j}^{\phi_{j}}(s,a_{j}).

With a little abuse of notation, we let d(s,𝝉i)=d𝝅(s)𝝅i(𝒂i|𝝉i)d(s,\bm{\tau}^{-i})=d^{\bm{\pi}}(s)\bm{\pi}^{-i}(\bm{a}^{-i}|\bm{\tau}^{-i}) for simplicity, where 𝝉\bm{\tau} is joint agent action-observation history and i-i stands for excluding agent ii. Thus, without loss of generality, consider some given state ss and joint action 𝒂\bm{a}, the policy gradient gig_{i} for agent ii is

gi\displaystyle g_{i} =d(s,𝝉i)jEijθiπi(ai|τi)kj(s)Qjϕj(s,aj)\displaystyle=d(s,\bm{\tau}^{-i})\sum\limits_{j}E_{ij}\nabla_{\theta_{i}}\pi_{i}(a_{i}|\tau_{i})k_{j}(s)Q_{j}^{\phi_{j}}(s,a_{j}) (9)
=d(s,𝝉i)θiπi(ai|τi)ki(s)Qiϕi(s,ai)+\displaystyle=d(s,\bm{\tau}^{-i})\nabla_{\theta_{i}}\pi_{i}(a_{i}|\tau_{i})k_{i}(s)Q_{i}^{\phi_{i}}(s,a_{i})+
d(s,𝝉i)jiEijθiπi(ai|τi)kj(s)Qjϕj(s,aj).\displaystyle\qquad\qquad d(s,\bm{\tau}^{-i})\sum\limits_{j\neq i}E_{ij}\nabla_{\theta_{i}}\pi_{i}(a_{i}|\tau_{i})k_{j}(s)Q_{j}^{\phi_{j}}(s,a_{j}).

Since the policies are with tabular expressions, then πi(ai|τi)=θai,τi\pi_{i}(a_{i}|\tau_{i})=\theta_{a_{i},\tau_{i}}. And the update βai,s\beta_{a_{i},s} is in proportion to the gradient, so that

βai,sgi=d(s,𝝉i)ki(s)Qiϕi(s,ai)+C.\beta_{a_{i},s}\propto g_{i}=d(s,\bm{\tau}^{-i})k_{i}(s)Q_{i}^{\phi_{i}}(s,a_{i})+C.

where CC is a constant independent of aia_{i} and Qiϕi(s,ai)Q_{i}^{\phi_{i}}(s,a_{i}).

Since ki0k_{i}\geq 0 is a positive coefficient given by the mixing network, d(s,𝝉i)ki(s)Qiϕi(s,ai)+Cd(s,\bm{\tau}^{-i})k_{i}(s)Q_{i}^{\phi_{i}}(s,a_{i})+C is a linear function with positive coefficient w.r.t. Qiϕi(s,ai)Q_{i}^{\phi_{i}}(s,a_{i}). Thus s,ai,ai\forall s,a^{\prime}_{i},a_{i}, it holds that Qiϕi(s,ai)Qiϕi(s,ai)βai,sβai,s.Q_{i}^{\phi_{i}}(s,a_{i})\geq Q_{i}^{\phi_{i}}(s,a^{\prime}_{i})\Longleftrightarrow\beta_{a_{i},s}\geq\beta_{a^{\prime}_{i},s}.\hfill\square

Now, we are ready to provide proof for the policy improvement theorem.

Theorem 1. With tabular expressions for policies, for any pre-update policy 𝛑\bm{\pi} and updated policy 𝛑^\hat{\bm{\pi}} by policy gradient of stochastic TAPE that satisfy for any agent i,π^i(ai|τi)=πi(ai|τi)+βai,sδ\text{for any agent }i,\ \hat{\pi}_{i}(a_{i}|\tau_{i})=\pi_{i}(a_{i}|\tau_{i})+\beta_{a_{i},s}\delta, where δ\delta is a sufficiently small number, we have J(𝛑^)J(𝛑),J(\hat{\bm{\pi}})\geq J(\bm{\pi}), i.e. the joint policy is improved by the update.

Proof. Given a good value estimate QtotϕQ^{\phi}_{tot}, from Lemma 1 and Lemma 3 we have

Qi𝝅(s,ai)Qi𝝅(s,ai)βai,sβai,s.Q^{\bm{\pi}}_{i}(s,a_{i})\geq Q^{\bm{\pi}}_{i}(s,a^{\prime}_{i})\Longleftrightarrow\beta_{a_{i},s}\geq\beta_{a^{\prime}_{i},s}. (10)

Given st\forall s_{t}, we have

𝒂t𝝅^(𝒂t|𝝉t)Qtot𝝅(st,𝒂t)\displaystyle\qquad\sum\limits_{\bm{a}_{t}}\hat{\bm{\pi}}(\bm{a}_{t}|\bm{\tau}_{t})Q^{\bm{\pi}}_{tot}(s_{t},\bm{a}_{t})
=𝒂t(inπ^i(ait|τit))Qtot𝝅(st,𝒂t)\displaystyle=\sum\limits_{\bm{a}_{t}}\left(\prod\limits^{n}_{i}\hat{\pi}_{i}(a_{i}^{t}|\tau_{i}^{t})\right)Q^{\bm{\pi}}_{tot}(s_{t},\bm{a}_{t})
=𝒂t(inπi(ait|τit)+βait,stδ)Qtot𝝅(st,𝒂t)\displaystyle=\sum\limits_{\bm{a}_{t}}\left(\prod\limits^{n}_{i}{\pi_{i}(a_{i}^{t}|\tau_{i}^{t})+\beta_{a_{i}^{t},s_{t}}\delta}\right)Q^{\bm{\pi}}_{tot}(s_{t},\bm{a}_{t})
=𝒂tinπi(ait|τit)Qtot𝝅(st,𝒂t)+\displaystyle=\sum\limits_{\bm{a}_{t}}\prod\limits^{n}_{i}\pi_{i}(a_{i}^{t}|\tau_{i}^{t})Q^{\bm{\pi}}_{tot}(s_{t},\bm{a}_{t})+
δi=1naitβait,stQtot𝝅(st,ait)+o(δ)\displaystyle\qquad\qquad\delta\sum\limits^{n}_{i=1}\sum\limits_{a_{i}^{t}}\beta_{a_{i}^{t},s_{t}}Q^{\bm{\pi}}_{tot}(s_{t},a_{i}^{t})+o(\delta)
=Vtot𝝅(st)+δi=1naitβait,stQtot𝝅(st,ait)+o(δ).\displaystyle=V^{\bm{\pi}}_{tot}(s_{t})+\delta\sum\limits^{n}_{i=1}\sum\limits_{a_{i}^{t}}\beta_{a_{i}^{t},s_{t}}Q^{\bm{\pi}}_{tot}(s_{t},a_{i}^{t})+o(\delta).

Since δ\delta is sufficiently small, we use o(δ)o(\delta) to represent the summation of components where the exponential coefficient of δ\delta is greater than 11. o(δ)o(\delta) is omitted in further analysis since it is sufficiently small.

Because ait[πi(ait|st)+βait,st]=1\sum\limits_{a_{i}^{t}}\left[\pi_{i}(a_{i}^{t}|s_{t})+\beta_{a_{i}^{t},s_{t}}\right]=1, we have aitβait,st=0\sum\limits_{a_{i}^{t}}\beta_{a_{i}^{t},s_{t}}=0. From Lemma 2 and Eq. 10, we have i=1naitβait,stQtot𝝅(st,ait)>0\sum\limits^{n}_{i=1}\sum\limits_{a_{i}^{t}}\beta_{a_{i}^{t},s_{t}}Q^{\bm{\pi}}_{tot}(s_{t},a_{i}^{t})>0. Thus

𝒂t𝝅^(𝒂t|𝝉t)Qtot𝝅(st,𝒂t)Vtot𝝅(st).\sum\limits_{\bm{a}_{t}}\hat{\bm{\pi}}(\bm{a}_{t}|\bm{\tau}_{t})Q^{\bm{\pi}}_{tot}(s_{t},\bm{a}_{t})\geq V^{\bm{\pi}}_{tot}(s_{t}).

The rest of the proof follows the policy improvement theorem for tabular MDPs from (Sutton and Barto 2018).

Vtot𝝅(st)\displaystyle V^{\bm{\pi}}_{tot}(s_{t}) 𝒂t𝝅^(𝒂t|𝝉t)Qtot𝝅(st,𝒂t)\displaystyle\leq\sum\limits_{\bm{a}_{t}}\hat{\bm{\pi}}(\bm{a}_{t}|\bm{\tau}_{t})Q^{\bm{\pi}}_{tot}(s_{t},\bm{a}_{t})
=𝒂t𝝅^(𝒂t|𝝉t)(rt+γst+1p(st+1|st,𝒂t)V𝝅(st+1))\displaystyle=\sum\limits_{\bm{a}_{t}}\hat{\bm{\pi}}(\bm{a}_{t}|\bm{\tau}_{t})\left(r_{t}+\gamma\sum\limits_{s_{t+1}}p(s_{t+1}|s_{t},\bm{a}_{t})V^{\bm{\pi}}(s_{t+1})\right)
𝒂t𝝅^(𝒂t|𝝉t)(rt+γst+1p(st+1|st,𝒂t)\displaystyle\leq\sum\limits_{\bm{a}_{t}}\hat{\bm{\pi}}(\bm{a}_{t}|\bm{\tau}_{t})\Bigg{(}r_{t}+\gamma\sum\limits_{s_{t+1}}p(s_{t+1}|s_{t},\bm{a}_{t})
(𝒂t+1𝝅(𝒂t+1|𝝉t+1)Qtot𝝅(st+1,𝒂t+1)))\displaystyle\qquad\qquad\Big{(}\sum\limits_{\bm{a}_{t+1}}\bm{\pi}(\bm{a}_{t+1}|\bm{\tau}_{t+1})Q^{\bm{\pi}}_{tot}(s_{t+1},\bm{a}_{t+1})\Big{)}\Bigg{)}
\displaystyle\cdots
=Vtot𝝅^(st).\displaystyle=V^{\hat{\bm{\pi}}}_{tot}(s_{t}).

So we have

Vtot𝝅^(s0)Vtot𝝅(s0),s0V^{\hat{\bm{\pi}}}_{tot}(s_{0})\geq V^{\bm{\pi}}_{tot}(s_{0}),\ \ \ \forall s_{0}\\
s0p(s0)Vtot𝝅^(s0)s0p(s0)Vtot𝝅(s0),\Longrightarrow\ \ \ \sum\limits_{s_{0}}p(s_{0})V^{\hat{\bm{\pi}}}_{tot}(s_{0})\geq\sum\limits_{s_{0}}p(s_{0})V^{\bm{\pi}}_{tot}(s_{0}),

Which is equivalent to J(𝝅^)J(𝝅)J(\hat{\bm{\pi}})\geq J(\bm{\pi}), since J(𝝅)=s0p(s0)Vtot𝝅(s0).J(\bm{\pi})=\sum\limits_{s_{0}}p(s_{0})V^{\bm{\pi}}_{tot}(s_{0}).\hfill\square

Remark We prove that with the coalition utility 𝑼i\bm{U}_{i} in the policy gradient, the objective function J(𝝅)J(\bm{\pi}) is monotonically maximized. The monotone condition (Eq. 8) guarantees the monotonic improvement of stochastic TAPE policy updates in tabular cases. In cases where function approximators (such as neural networks) are used, the policy improvements are still guaranteed as long as the monotone condition holds (actions with larger values have larger update stepsizes). In the experiment section, we empirically demonstrate that the policies parameterized by deep neural networks have steady performance improvement as training goes on, and agents have better performance compared to the baselines.

C   Policy Update Diversity

In this section, we give the detailed proof of Theorem 2. We define the diversity of exploration in the parameter space as the variance of parameter updates ξai,sTAPE=𝔼E[gai,sTAPEλ],ξai,sDOP=gai,sDOPλ\xi_{a_{i},s}^{\text{TAPE}}=\mathbb{E}_{E}\left[g_{a_{i},s}^{\text{TAPE}}\lambda\right],\ \xi_{a_{i},s}^{\text{DOP}}=g_{a_{i},s}^{\text{DOP}}\lambda, where gai,sg_{a_{i},s} is the policy gradient given ss and aia_{i}, λ\lambda is learning rate and EE is the Erdős–Rényi network in stochastic TAPE. Without loss of generality, we assume λ=1\lambda=1.

For clarity, we first restate Theorem 2.

Theorem 2. For any agent ii and s,ai\forall s,a_{i}, the stochastic TAPE policy update ξai,sTAPE\xi^{\text{TAPE}}_{a_{i},s} and DOP policy update ξai,sDOP\xi^{\text{DOP}}_{a_{i},s} satisfy that Var[ξai,sTAPE]Var[ξai,sDOP]\text{Var}\left[\xi^{\text{TAPE}}_{a_{i},s}\right]\geq\text{Var}\left[\xi^{\text{DOP}}_{a_{i},s}\right], and Δ=Var[ξai,sTAPE]Var[ξai,sDOP]\Delta=\text{Var}\left[\xi^{\text{TAPE}}_{a_{i},s}\right]-\text{Var}\left[\xi^{\text{DOP}}_{a_{i},s}\right] is in proportion to p2p^{2}, where pp is the probability of edges being present in the Erdős–Rényi model, i.e. Δp2.\Delta\propto p^{2}.

Proof. From the proof of Lemma 3, we have

gai,sTAPE=d(s,𝝉i)jEijkj(s)Qjϕj(s,aj).g_{a_{i},s}^{\text{TAPE}}=d(s,\bm{\tau}^{-i})\sum\limits_{j}E_{ij}k_{j}(s)Q_{j}^{\phi_{j}}(s,a_{j}).

By replacing the adjacency matrix EE with identity matrix II, we have

gai,sDOP=d(s,𝝉i)ki(s)Qiϕi(s,ai).g_{a_{i},s}^{\text{DOP}}=d(s,\bm{\tau}^{-i})k_{i}(s)Q_{i}^{\phi_{i}}(s,a_{i}).

Substitute gai,sTAPEg_{a_{i},s}^{\text{TAPE}} into ξai,sTAPE\xi^{\text{TAPE}}_{a_{i},s}

ξai,sTAPE\displaystyle\xi^{\text{TAPE}}_{a_{i},s} =𝔼E[d(s,𝝉i)θiπi(ai|τi)ki(s)Qiϕi(s,ai)+\displaystyle=\mathbb{E}_{E}\Big{[}d(s,\bm{\tau}^{-i})\nabla_{\theta_{i}}\pi_{i}(a_{i}|\tau_{i})k_{i}(s)Q_{i}^{\phi_{i}}(s,a_{i})+
jid(s,𝝉i)Eijθiπi(ai|τi)kj(s)Qjϕj(s,aj)]\displaystyle\qquad\quad\sum\limits_{j\neq i}d(s,\bm{\tau}^{-i})E_{ij}\nabla_{\theta_{i}}\pi_{i}(a_{i}|\tau_{i})k_{j}(s)Q_{j}^{\phi_{j}}(s,a_{j})\Big{]}
=d(s,𝝉i)ki(s)Qiϕi(s,ai)+\displaystyle=d(s,\bm{\tau}^{-i})k_{i}(s)Q_{i}^{\phi_{i}}(s,a_{i})+
𝔼E[jid(s,𝝉i)Eijkj(s)Qjϕj(s,aj)]\displaystyle\qquad\qquad\mathbb{E}_{E}\left[\sum\limits_{j\neq i}d(s,\bm{\tau}^{-i})E_{ij}k_{j}(s)Q_{j}^{\phi_{j}}(s,a_{j})\right]
=βai,sDOP+ji𝔼E[d(s,𝝉i)Eijkj(s)Qjϕj(s,aj)]\displaystyle=\beta_{a_{i},s}^{\text{DOP}}+\sum\limits_{j\neq i}\mathbb{E}_{E}\left[d(s,\bm{\tau}^{-i})E_{ij}k_{j}(s)Q_{j}^{\phi_{j}}(s,a_{j})\right]
=ξai,sDOP+pjid(s,𝝉i)kj(s)Qjϕj(s,aj),\displaystyle=\xi_{a_{i},s}^{\text{DOP}}+p\sum\limits_{j\neq i}d(s,\bm{\tau}^{-i})k_{j}(s)Q_{j}^{\phi_{j}}(s,a_{j}),

where d(s,𝝉i)kj(s)Qjϕj(s,aj)=Cjd(s,\bm{\tau}^{-i})k_{j}(s)Q_{j}^{\phi_{j}}(s,a_{j})=C_{j} is independent of πi\pi_{i} and QiϕiQ_{i}^{\phi_{i}}. Thus,

Var[ξai,sTAPE]\displaystyle\text{Var}\left[\xi^{\text{TAPE}}_{a_{i},s}\right] =Var[ξai,sDOP+pjiCj]\displaystyle=\text{Var}\left[\xi_{a_{i},s}^{\text{DOP}}+p\sum\limits_{j\neq i}C_{j}\right]
=Var[ξai,sDOP]+p2jiVar[Cj].\displaystyle=\text{Var}\left[\xi_{a_{i},s}^{\text{DOP}}\right]+p^{2}\sum\limits_{j\neq i}\text{Var}\left[C_{j}\right].

Since variance is always non-negative, clearly we have Var[ξai,sTAPE]Var[ξai,sDOP]\text{Var}\left[\xi^{\text{TAPE}}_{a_{i},s}\right]\geq\text{Var}\left[\xi^{\text{DOP}}_{a_{i},s}\right] and Δp2\Delta\propto p^{2}.\hfill\square

Remark The above theorem states that stochastic TAPE explores the parameter space more effectively. This provides a theoretical insight and explanation why stochastic TAPE agents are more capable of finding good cooperation patterns with other agents. And Δ\Delta is in proportion to p2p^{2}, which means as pp increase, the connections among agents in the topology become denser and enhance their capability of cooperation. But the dense connection will also introduces the CDM issue, as sub-optimality of some agents will more easily affect other agents. Thus, pp serves as a hyperparameter to compromise between avoiding CDM issue and capability to explore cooperation patterns.

D   Algorithm

In this section, we give pseudo-code of the proposed methods. As we only modify policy gradients, rest of structures remains the same as the base methods (Wang et al. 2020c; Zhou, Lan, and Aggarwal 2022).

D.1   Stochastic TAPE

We first give the pseudo-code of stochastic TAPE and then give the details.

Algorithm 1 Stochastic TAPE
1:Initialize critic ϕ\phi, target critic ϕ=ϕ\phi^{\prime}=\phi, policy θi\theta_{i} for each agent ii, and off-policy replay buffer 𝒟\mathcal{D}
2:while training not finished do
3:     Rollout nn trajectories and store them in 𝒟\mathcal{D}
4:     Use the trajectories to calculate yony^{\text{on}}
5:     Sample a batch of trajectories from 𝒟\mathcal{D} to calculate yoffy^{\text{off}}
6:     Update ϕ\phi with yony^{\text{on}} and yoffy^{\text{off}} according to Eq. 13
7:     Generate an agent topology EE according to Eq. 12
8:     Update policy parameter θ\theta with the on-policy trajectories according to Eq. 11
9:     Copy critic network parameter ϕ\phi to target critic ϕ\phi^{\prime} every mm episode
10:end while

The policy gradient of stochastic TAPE is given by

g=𝔼𝝅[i,jEijkj(s)θilogπi(ai|τi)Qjϕj(s,aj)].g=\mathbb{E}_{\bm{\pi}}\left[\sum_{i,j}E_{ij}k_{j}(s)\nabla_{\theta_{i}}\log\pi_{i}(a_{i}|\tau_{i})Q_{j}^{\phi_{j}}(s,a_{j})\right]. (11)

where EE is the ER agent model, defined as

i,j{1,..,n}\displaystyle\ \ \ \ \ \ \ \forall\ i,j\in\{1,..,n\} (12)
ifi=j,Eij=1\displaystyle\text{if}\ i=j,\ \ \ \ E_{ij}=1
elseEij={1withprobabilityp0otherwise.\displaystyle\text{else}\ E_{ij}=\left\{\begin{aligned} &1\ \ \ \rm{with\ probability}\ \emph{p}\\ &0\ \ \ \rm{otherwise}\end{aligned}\right..

And as in the base method DOP, stochastic TAPE adopts an off-policy critic to improve sample efficiency, where the critics’ training loss is the mixture of an off-policy loss with target yoffy^{\text{off}} based on tree-backup technique (Precup 2000; Munos et al. 2016) and an on-policy loss with target yony^{\text{on}}, i.e.

(ϕ)=κ𝔼𝒟[MSE(yoff,Qtotϕ)]+(1κ)𝔼𝝅[MSE(yon,Qtotϕ)],\displaystyle\mathcal{L}(\phi)=\kappa\mathbb{E}_{\mathcal{D}}\left[\text{MSE}(y^{\text{off}},Q^{\phi}_{tot})\right]+(1-\kappa)\mathbb{E}_{\bm{\pi}}\left[\text{MSE}(y^{\text{on}},Q^{\phi}_{tot})\right], (13)

where

yoff=Qtotϕ(s,𝒂)+t=0k1γtct[rt+\displaystyle y^{\text{off}}=Q^{\phi^{\prime}}_{tot}(s,\bm{a})+\sum\limits_{t=0}^{k-1}\gamma^{t}c_{t}\Bigg{[}r_{t}+ (14)
γiki(st+1)𝔼πi[Qiϕi(st+1,)]+b(st+1)Qtotϕ(st,𝒂t)]\displaystyle\qquad\gamma\sum_{i}k_{i}(s_{t+1})\mathbb{E}_{\pi_{i}}\left[Q_{i}^{\phi_{i}^{\prime}}(s_{t+1},\cdot)\right]+b(s_{t+1})-Q^{\phi^{\prime}}_{tot}(s_{t},\bm{a}_{t})\Bigg{]}
yon=Qtotϕ(s,𝒂)+t=0(γλ)t[rt+γQtotϕ(st+1,𝒂t+1)Qtotϕ(st,𝒂t)].\displaystyle y^{\text{on}}=Q^{\phi^{\prime}}_{tot}(s,\bm{a})+\sum\limits_{t=0}^{\infty}(\gamma\lambda)^{t}\Bigg{[}r_{t}+\gamma Q^{\phi^{\prime}}_{tot}(s_{t+1},\bm{a}_{t+1})-Q^{\phi^{\prime}}_{tot}(s_{t},\bm{a}_{t})\Bigg{]}. (15)

where MSE is the mean-squared error loss function, κ\kappa is a parameter controlling the importance of off-policy learning, 𝒟\mathcal{D} is the off-policy replay buffer, 𝝅\bm{\pi} is the joint policy, Qtotϕ(s,𝒂)=iki(st)Qiϕi(st,ati)+b(st)Q^{\phi}_{tot}(s,\bm{a})=\sum_{i}k_{i}(s_{t})Q_{i}^{\phi_{i}}(s_{t},a^{i}_{t})+b(s_{t}), k0k\geq 0 and bb are coefficients provided by the mixing network, QϕQ^{\phi^{\prime}} is the target network to stabilize training (Mnih et al. 2013), ct=l=1tλ𝝅(𝒂l|sl)c_{t}=\prod_{l=1}^{t}\lambda\bm{\pi}(\bm{a}_{l}|s_{l}) and λ\lambda is the TD(λ\lambda) hyperparameter.

With all the equations above, the pseudo-code of stochastic TAPE is given in Algorithm 1.

D.2   Deterministic TAPE

We first give the pseudo-code of deterministic TAPE and then give the details.

Algorithm 2 Deterministic TAPE
1:Initialize critic ϕi\phi_{i}, target critic ϕi=ϕi\phi^{\prime}_{i}=\phi_{i}, policy θi\theta_{i} for each agent ii, mixing network fmixf_{\text{mix}}, information encoder fmf_{m} and replay buffer 𝒟\mathcal{D}
2:while training not finished do
3:     Rollout nn trajectories and store them in 𝒟\mathcal{D}
4:     Sample a batch of trajectories from 𝒟\mathcal{D}
5:     Sample assistive information miN(fm(oi),I)m_{i}\sim N(f_{m}(o_{i}),I) for all agents
6:     Calculate 𝝅\mathcal{L}_{\bm{\pi}}, CA\mathcal{L}_{CA}, IB\mathcal{L}_{IB}, Q^\mathcal{L}_{\hat{Q}^{*}} and Qtot\mathcal{L}_{Q_{tot}} according to Eq. 16, 17, 18, 19, 20
7:     Total loss =𝝅+CA+IB+Q^+Qtot\mathcal{L}=-\mathcal{L}_{\bm{\pi}}+\mathcal{L}_{CA}+\mathcal{L}_{IB}+\mathcal{L}_{\hat{Q}^{*}}+\mathcal{L}_{Q_{tot}}
8:     Update the critics, mixing network, policy networks and information encoder to minimize \mathcal{L}
9:     Copy critic network parameter ϕ\phi to target critic ϕ\phi^{\prime} every mm episode
10:end while

The policy gradient of deterministic TAPE is given by

𝝅=𝔼𝒟[iθiπi(τi)\displaystyle\mathcal{L}_{\bm{\pi}}=\mathbb{E}_{\mathcal{D}}\Bigg{[}\sum_{i}\nabla_{\theta_{i}}\pi_{i}(\tau_{i}) (16)
aifmix(s,𝟙[Ei1]Q^1ϕ1,,𝟙[Ei,n]Q^nϕn)|ai=πi(τi)].\displaystyle\qquad\nabla_{a_{i}}f_{\text{mix}}\left(s,\mathbbm{1}[E_{i1}]\hat{Q}^{\phi_{1}}_{1},\cdots,\mathbbm{1}[E_{i,n}]\hat{Q}^{\phi_{n}}_{n}\right)|_{a_{i}=\pi_{i}(\tau_{i})}\Bigg{]}.

where Q^iϕi(τi,ai,mi)=Qiϕi(τi,ai,mi)αlogπi(ai|τi)\hat{Q}^{\phi_{i}}_{i}(\tau_{i},a_{i},m_{i})=Q^{\phi_{i}}_{i}(\tau_{i},a_{i},m_{i})-\alpha\log\pi_{i}(a_{i}|\tau_{i}) is the soft QQ network augmented with assistive information mim_{i}. The assistive information mim_{i} is encoded from the observation oio_{i} of agent ii, which provides information about the optimal joint action 𝒂\bm{a}^{*} and assists the value factorization.

PAC follows the design of WQMIX (Rashid et al. 2020a) and keep two mixing network QtotQ_{tot} and Q^\hat{Q}^{*}. QtotQ_{tot} is the monotonic mixing network as in QMIX (Rashid et al. 2020b), while Q^\hat{Q}^{*} is an unrestricted function to make sure the joint-action values are correctly estimated as in WQMIX. The loss functions for QtotQ_{tot} and Q^\hat{Q}^{*} are given by

Q^\displaystyle\mathcal{L}_{\hat{Q}^{*}} =k=0N(Q^(sk,𝒂^k)yk)2\displaystyle=\sum_{k=0}^{N}(\hat{Q}^{*}(s_{k},\hat{\bm{a}}_{k})-y_{k})^{2} (17)
Qtot\displaystyle\mathcal{L}_{Q_{tot}} =k=0Nω(sk,𝒂k)(Qtot(sk,𝒂k,𝒎k)yk)2.\displaystyle=\sum_{k=0}^{N}\omega(s_{k},\bm{a}_{k})(Q_{tot}(s_{k},\bm{a}_{k},\bm{m}_{k})-y_{k})^{2}. (18)

where NN is the batch size, a^i=argmaxQiϕi(τi,,mi)\hat{a}_{i}=\arg\max Q_{i}^{\phi_{i}}(\tau_{i},\cdot,m_{i}), 𝒂^=[a^1,..,a^n]\hat{\bm{a}}=[\hat{a}_{1},..,\hat{a}_{n}], ω(s,𝒂)\omega(s,\bm{a}) is the weighting function in WQMIX, 𝒎\bm{m} is the joint assistive information, QtotQ_{tot} is the mixture of local QQ values and target yk=rk+γQ^(sk,argmax𝒂^kQtot(sk,𝒂^k,𝒎k;ϕ))y_{k}=r_{k}+\gamma\hat{Q}^{*}(s^{\prime}_{k},\arg\max_{\hat{\bm{a}}^{\prime}_{k}}Q_{tot}(s^{\prime}_{k},\hat{\bm{a}}^{\prime}_{k},\bm{m}^{\prime}_{k};\phi^{\prime})) with ϕ\phi^{\prime} being the parameters of the target network. PAC adopts two auxiliary loss to assist value factorization in MAPG, which we will briefly introduce next.

Counterfactual Assistance Loss: The counterfactual assistance loss is proposed to directly guide individual agent’s policy towards the action a^i\hat{a}_{i}^{*} from Q^\hat{Q}^{*}. To this end, they propose an advantage function with a counterfactual baseline that relegates a^i\hat{a}_{i}^{*} while keeping other all other agents’ actions 𝒂i\bm{a}^{-i} fixed. Thus, the counterfactual assistance loss is given by

CA=ilogπi(ai|τi)\displaystyle\mathcal{L}_{CA}=\sum_{i}\log\pi_{i}(a_{i}|\tau_{i}) (19)
ai[Qiϕi(τi,a^i,mi)πi(ai|τi)Qiϕi(τi,ai,mi)].\displaystyle\qquad\sum_{a_{i}}\left[Q^{\phi_{i}}_{i}(\tau_{i},\hat{a}^{*}_{i},m_{i})-\pi_{i}(a_{i}|\tau_{i})Q^{\phi_{i}}_{i}(\tau_{i},a_{i},m_{i})\right].

Information Bottleneck Loss: Inspired by information bottleneck method (Tishby, Pereira, and Bialek 2000), the information bottleneck loss encodes the optimal joint action 𝒂^\hat{\bm{a}}^{*} as the assistive information mim_{i} for local QQ value functions Qiπi(τi,ai,mi)Q_{i}^{\pi_{i}}(\tau_{i},a_{i},m_{i}). The assistive information is maximally informative about the optimal action aia_{i}^{*}. With the deep variational information bottleneck (Alemi et al. 2016), the variational lower bound of this objective is

IB=𝔼oi𝒟,mjfm[[p(a^j|𝒐),qψ(a^j|oj,𝒎)]+\displaystyle\mathcal{L}_{IB}=\mathbb{E}_{o_{i}\sim\mathcal{D},m_{j}\sim f_{m}}\Bigg{[}-\mathcal{H}\left[p(\hat{a}^{*}_{j}|\bm{o}),q_{\psi}(\hat{a}^{*}_{j}|o_{j},\bm{m})\right]+ (20)
βDKL(p(mi|oi)qσ(mi)],\displaystyle\qquad\qquad\qquad\qquad\beta D_{KL}(p(m_{i}|o_{i})\|q_{\sigma}(m_{i})\Bigg{]},

where \mathcal{H} is the entropy operator, DKLD_{KL} is the KL divergence and qσ(mi)q_{\sigma}(m_{i}) is a variational posterior estimator of p(mi)p(m_{i}) with parameter σ\sigma. The information encoder fmf_{m} is trained to encode assistive information miN(fm(oi;θm),I)m_{i}\sim N(f_{m}(o_{i};\theta_{m}),I), where NN is the normal distribution.

With all the loss terms defined above, pseudo-code of deterministic TAPE is given in Algorithm 2.

E   Experiment and Implementation Details

We run experiments on Nvidia RTX Titan graphics cards with an Intel Xeon Gold 6240R CPU. The curves in our experiments are smoothed by a sliding window, and the window size is 4, i.e. results of each timestep is the average of the past 4 timesteps. More experiment and implementation details in each environment are given below. It is worth noting that the only hyperparameter that needs tuning in our methods is pp, the probability of edges being present.

E.1   Matrix Game

The evaluation metric in the matrix game is the average return of last 100 episodes and the training goes on for 10000 episodes in total. The results are drawn with four random seeds, which are randomly initialized at the beginning of an experiment. For this simple matrix game, we use tabular expressions for policies in stochastic TAPE, DOP and COMA. The critics are parameterized by a three-layer feed-forward neural network with hidden size 32. The mixing network for QMIX and linearly decomposed critics in stochastic TAPE and DOP is also three-layer feed-forward neural network with hidden size 32, where the coefficients for local QQ values are always non-negative. Hyperparameter pp for ER based agent topology in stochastic TAPE is 0.70.7. All algorithms are trained with Adam optimizer (Kingma and Ba 2014) and the learning rate is 1×1031\times 10^{-3}. And we use batch size 32 for QMIX. Since we wish to compare differences of the policy gradients, we omit the off-policy target for critic in Eq. 14 in stochastic TAPE and DOP for simplicity.

E.2   Level-Based Foraging

Refer to caption
Figure 6: Evaluation results on the six tasks of SMAC. InfoPG is a decentralized graph-based MARL method with networked agents. Our proposed methods adopt agent topology and follow centralized-training-decentralized-execution paradigm, which make them outperform InfoPG by a large margin during decentralized execution.

We use the official implementation of Level-Based Foraging (Papoudakis et al. 2021) (LBF) and remain the default settings of the environment, e.g. reward function and randomly scattered food items and agents. The time limit for 8x8-2p-3f is 25, and 120 for 15x15-4p-5f. For the 8x8-2p-3f scenario, we use the ’-coop’ option in the environment to force the agents collect all food items together and make the task more difficult. The training goes on for 2 million timesteps in 8x8-2p-3f and 5 million timesteps in 15x15-4p-5f. Each algorithm runs for 100 episodes for test every 50k timesteps. The evaluation metric is the average return of the test episodes.

We use official implementations for all baseline algorithms and implement our methods based on the official implementations without changing the default hyperparameters. For example, there are two hidden layers with hidden size being 64 in the hyper network in QMIX, and target networks are updated every 600600 steps in DOP. For fair comparison, we only change the number of parallel-running environments to 4 for all algorithms and remain the other hyperparameters recommended by the official implementations. Hyperparameter pp for both stochastic and deterministic TAPE is 0.30.3.

E.3   Starcraft Multi-Agent Challenge

Agents and players do not play the whole StarCraft II game in Starcraft Multi-Agent Challenge (SMAC). In SMAC, a series of decentralized micromanagement tasks in StarCraft II are proposed to test a group of cooperative agents. The agents must cooperate to fight against enemy units controlled by StarCraft II built-in game AI. We keep the default settings of SMAC in our experiments, such as game AI difficulty 7, observation range and unit health point. We run 32 test episodes every 20k timesteps for evaluation and report median test win rate across all individual runs as recommended.

As in LBF, we use the official implementations, recommended hyperparameters and 4 parallel-running environments for all baseline algorithms. The hyperparameter pp controlling the probability of edges being present is chosen from {0.1,0.3}\{0.1,0.3\} for stochastic TAPE and {0.5,0.7}\{0.5,0.7\} for deterministic TAPE. pp for each map and algorithm is given in Table 1.

Map pp (stochastic) pp (deterministic)
5m_vs_6m 0.1 0.7
2c_vs_64zg 0.1 0.7
6h_vs_8z 0.1 0.5
3s_vs_4z 0.3 0.5
corridor 0.3 0.5
MMM2 0.3 0.5
Table 1: Hyperparameter pp in each map and algorithm.

F   Additional Experiments

The concept of topology is also adopted in fully-decentralized MARL methods with networked agents (Zhang et al. 2018; Konan, Seraj, and Gombolay 2022). In fully-decentralized methods, the CDM issue does not exist since all agents are trained independently. Agents are networked together according to the topology, so that they can consider each other during decision making to better cooperate and even achieve local consensus. However, although using agent topology to gather and utilize local information of neighboring agents, this decentralized training paradigm cannot coordinate agents’ behavior as well as our methods, which is based on centralized-training-decentralized-execution paradigm.

InfoPG (Konan, Seraj, and Gombolay 2022) is the state-of-the-art fully-decentralized method with networked agents, where agents’ policies are conditional on the policies of its neighboring teammates. We run InfoPG on all six maps of our SMAC experiments and compare the mean test return with our proposed stochastic TAPE and deterministic TAPE.

The results demonstrate that our agent topology is effective in facilitating cooperation and filtering out bad influence from other agents during centralized training, which makes it outperform the fully-decentralized method InfoPG.