This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient Multi-agent Reinforcement Learning by Planning

Qihan Liu1∗, Jianing Ye2∗, Xiaoteng Ma1∗, Jun Yang1†, Bin Liang1, Chongjie Zhang3†
1Department of Automation, Tsinghua University
2Institute for Interdisciplinary Information Sciences, Tsinghua University
3Department of Computer Science & Engineering, Washington University in St. Louis
{lqh20, yejn21, ma-xt17}@mails.tsinghua.edu.cn
{yangjun603, bliang}@tsinghua.edu.cn
[email protected]
Abstract

Multi-agent reinforcement learning (MARL) algorithms have accomplished remarkable breakthroughs in solving large-scale decision-making tasks. Nonetheless, most existing MARL algorithms are model-free, limiting sample efficiency and hindering their applicability in more challenging scenarios. In contrast, model-based reinforcement learning (MBRL), particularly algorithms integrating planning, such as MuZero, has demonstrated superhuman performance with limited data in many tasks. Hence, we aim to boost the sample efficiency of MARL by adopting model-based approaches. However, incorporating planning and search methods into multi-agent systems poses significant challenges. The expansive action space of multi-agent systems often necessitates leveraging the nearly-independent property of agents to accelerate learning. To tackle this issue, we propose the MAZero algorithm, which combines a centralized model with Monte Carlo Tree Search (MCTS) for policy search. We design a novel network structure to facilitate distributed execution and parameter sharing. To enhance search efficiency in deterministic environments with sizable action spaces, we introduce two novel techniques: Optimistic Search Lambda (OS(λ\lambda)) and Advantage-Weighted Policy Optimization (AWPO). Extensive experiments on the SMAC benchmark demonstrate that MAZero outperforms model-free approaches in terms of sample efficiency and provides comparable or better performance than existing model-based methods in terms of both sample and computational efficiency. Our code is available at  https://github.com/liuqh16/MAZero.

**footnotetext: Equal contribution.$\dagger$$\dagger$footnotetext: Corresponding authors.

1 Introduction

Multi-Agent Reinforcement Learning (MARL) has seen significant success in recent years, with applications in real-time strategy games (Arulkumaran et al., 2019; Ye et al., 2020), card games (Bard et al., 2020), sports games (Kurach et al., 2020), autonomous driving (Zhou et al., 2020), and multi-robot navigation (Long et al., 2018). Nonetheless, challenges within multi-agent environments have led to the problem of sample inefficiency. One key issue is the non-stationarity of multi-agent settings, where agents continuously update their policies based on observations and rewards, resulting in a changing environment for individual agents (Nguyen et al., 2020). Additionally, the joint action space’s dimension can exponentially increase with the number of agents, leading to an immense policy search space (Hernandez-Leal et al., 2020). These challenges, combined with issues such as partial observation, coordination, and credit assignment, necessitate a considerable demand for samples in MARL for effective training (Gronauer & Diepold, 2022).

Conversely, Model-Based Reinforcement Learning (MBRL) has demonstrated its worth in terms of sample efficiency within single-agent RL scenarios, both in practice (Wang et al., 2019) and theory (Sun et al., 2019). Unlike model-free methods, MBRL approaches typically focus on learning a parameterized model that characterizes the transition or reward functions of the real environment (Sutton & Barto, 2018; Corneil et al., 2018; Ha & Schmidhuber, 2018). Based on the usage of the learned model, MBRL methods can be roughly divided into two lines: planning with model (Hewing et al., 2020; Nagabandi et al., 2018; Wang & Ba, 2019; Schrittwieser et al., 2020; Hansen et al., 2022) and data augmentation with model (Kurutach et al., 2018; Janner et al., 2019; Hafner et al., 2019; 2020; 2023). Due to the foresight inherent in planning methods and their theoretically guaranteed convergence properties, MBRL with planning often demonstrates significantly higher sample efficiency and converges more rapidly (Zhang et al., 2020). A well-known planning-based MBRL method is MuZero, which conducts Monte Carlo Tree Search (MCTS) with a value-equivalent learned model and achieves superhuman performance in video games like Atari and board games including Go, Chess and Shogi (Schrittwieser et al., 2020).

Despite the success of MBRL in single-agent settings, planning with the multi-agent environment model remains in its early stages of development. Recent efforts have emerged to bridge the gap by combining single-agent MBRL algorithms with MARL frameworks (Willemsen et al., 2021; Bargiacchi et al., 2021; Mahajan et al., 2021; Egorov & Shpilman, 2022). However, the extension of single-agent MBRL methods to multi-agent settings presents formidable challenges. On one hand, existing model designs for single-agent algorithms do not account for the unique biases inherent to multi-agent environments, such as the nearly-independent property of agents. Consequently, directly employing models from single-agent MBRL algorithms typically falls short in supporting efficient learning within real-world multi-agent environments, underscoring the paramount importance of model redesign. On the other hand, multi-agent environments possess action spaces significantly more intricate than their single-agent counterparts, thereby exponentially escalating the search complexity within multi-agent settings. This necessitates exploration into specialized planning algorithms tailored to these complex action spaces.

In this paper, we propose MAZero, the first empirically effective approach that extends the MuZero paradigm into multi-agent cooperative environments. In particular, our contributions are fourfold:

  1. 1.

    Inspired by the Centralized Training with Decentralized Execution (CTDE) concept, we develop a centralized-value, individual-dynamic model with shared parameters among all agents. We incorporate an additional communication block using the attention mechanism to promote cooperation during the model unrolling (see Section 4.1).

  2. 2.

    Given the deterministic characteristics of the learned model, we have devised the Optimistic Search Lambda (OS(λ\lambda)) approach (see Section 4.2). It optimistically estimates the sampled returns while mitigating unrolling errors at larger depths using the parameter λ\lambda.

  3. 3.

    We propose a novel policy loss named Advantage-Weighted Policy Optimization (AWPO) (see Section 4.3) utilizing the value information calculated by OS(λ\lambda) to improve the sampled actions.

  4. 4.

    We conduct extensive experiments on the SMAC benchmark, showing that MAZero achieves superior sample efficiency compared to model-free approaches and provides better or comparable performance than existing model-based methods in terms of both sample and computation efficiency (see Section 5).

2 Background

This section introduces essential notations, and related works will be discussed in Appendix A.

POMDP

Reinforcement Learning (RL) addresses the problem of an agent learning to act in an environment in order to maximize a scalar reward signal, which is characterized as a partially observable Markov decision process (POMDP) (Kaelbling et al., 1998) defined by (𝒮,𝒜,T,U,Ω,𝒪,γ)(\mathcal{S},\mathcal{A},T,U,\Omega,\mathcal{O},\gamma), where 𝒮\mathcal{S} is a set of states, 𝒜\mathcal{A} is a set of possible actions, T:𝒮×𝒜×𝒮[0,1]T:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1] is a transition function over next states given the actions at current states, U:𝒮×𝒜U:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R} is the reward function. Ω\Omega is the set of observations for the agent and observing function 𝒪:𝒮Ω\mathcal{O}:\mathcal{S}\rightarrow\Omega maps states to observations. γ[0,1)\gamma\in[0,1) is the discounted factor. At each timestep tt, the agent acquire an observation ot=𝒪(st)o_{t}=\mathcal{O}(s_{t}) based on current state sts_{t}, choose action ata_{t} upon the history of observations oto_{\leq t} and receive corresponding reward ut=U(st,at)u_{t}=U(s_{t},a_{t}) from the environment. The objective of the agent is to learn a policy π\pi that maximizes the expected discounted return J(π)=𝔼π[t=0γtut|atπ(|ot)]J(\pi)=\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}u_{t}|a_{t}\sim\pi(\cdot|o_{\leq t})\right].

MuZero

MuZero (Schrittwieser et al., 2020) is an MBRL algorithm for single-agent settings that amalgamates a learned model of environmental dynamics with MCTS planning algorithm. MuZero’s learned model consists of three functions: a representation function hh, a dynamics function gg, and a prediction function ff. The model is conditioned on the observation history oto_{\leq t} and a sequence of future actions at:t+Ka_{t:t+K} and is trained to predict rewards rt,0:Kr_{t,0:K}, values vt,0:Kv_{t,0:K} and policies pt,0:Kp_{t,0:K}, where rt,k,vt,k,pt,kr_{t,k},v_{t,k},p_{t,k} are model predictions based on imaginary future state st,ks_{t,k} unrolling kk steps from current time tt. Specifically, the representation function maps the current observation history oto_{\leq t} into a hidden state st,0s_{t,0}, which is used as the root node of the MCTS tree. The dynamic function gg inputs the previous hidden state st,ks_{t,k} with an action at+ka_{t+k} and outputs the next hidden state st,k+1s_{t,k+1} and the predicted reward rt,kr_{t,k}. The prediction function ff computes the value vt,kv_{t,k} and policy pt,kp_{t,k} at each hidden state st,ks_{t,k}. To perform MCTS, MuZero runs NN simulation steps where each consists of 3 stages: Selection, Expansion and Backup. During the Selection stage, MuZero starts traversing from the root node and selects action by employing the probabilistic Upper Confidence Tree (pUCT) rule (Kocsis & Szepesvári, 2006; Silver et al., 2016) until reaching a leaf node:

a=argmaxa[Q(s,a)+P(s,a)bN(s,b)1+N(s,a)c(s)]a=\arg\max_{a}\left[Q(s,a)+P(s,a)\cdot\frac{\sqrt{\sum_{b}N(s,b)}}{1+N(s,a)}\cdot c(s)\right] (1)

where Q(s,a)Q(s,a) denotes the estimation for Q-value, N(s,a)N(s,a) the visit count, P(s,a)P(s,a) the prior probability of selecting action aa received from the prediction function, and c(s)c(s) is used to control the influence of the prior relative to the Q-value. The target search policy π(|st,0)\pi(\cdot|s_{t,0}) is the normalized distribution of visit counts at the root node st,0s_{t,0}. The model is trained minimize the following 3 loss: reward loss lrl_{r}, value loss lvl_{v} and policy loss lpl_{p} between true values and network predictions.

LMuZero=k=0K[lr(ut+k,rt,k)+lv(zt+k,vt,k)+lp(πt+k,pt,k)]L^{\text{MuZero}}=\sum_{k=0}^{K}\left[l_{r}(u_{t+k},r_{t,k})+l_{v}(z_{t+k},v_{t,k})+l_{p}(\pi_{t+k},p_{t,k})\right] (2)

where ut+ku_{t+k} is the real reward from environment, zt+k=i=0n1γiut+k+i+γt+k+nνt+k+nz_{t+k}=\sum_{i=0}^{n-1}\gamma^{i}u_{t+k+i}+\gamma^{t+k+n}\nu_{t+k+n} is n-step return consists of reward and MCTS search value, πt+k\pi_{t+k} is the MCTS search policy.

EfficientZero (Ye et al., 2021) proposes an asynchronous parallel workflow to alleviate the computational demands of the reanalysis MCTS . EfficientZero incorporates the self-supervised consistency loss ls=st,kst+k,0l_{s}=\|s_{t,k}-s_{t+k,0}\| using the SimSiam network to ensure consistency between the hidden state from the kkth step’s dynamic st,ks_{t,k} and the direct representation of the future observation st+k,0s_{t+k,0}.

Sampled MuZero

Since the efficiency of MCTS planning is intricately tied to the number of simulations performed, the application of MuZero has traditionally been limited to domains with relatively small action spaces, which can be fully enumerated on each node during the tree search. To tackle larger action spaces, Sampled MuZero (Hubert et al., 2021) introduces a sampling-based framework where policy improvement and evaluation are computed over small subsets of sampled actions. Specifically, every time in the Expansion stage, only a subset T(s)T(s) of the full action space 𝒜\mathcal{A} is expanded, where T(s)T(s) is on-time resampled from a policy β\beta base on the prior policy π\pi. After that, in the Selection stage, an action is picked according to the modified pUCT formula.

a=argmaxaT(s)𝒜[Q(s,a)+β^βP(s,a)bN(s,b)1+N(s,a)c(s)]a=\mathop{\arg\max}\limits_{a\in T(s)\subset\mathcal{A}}\left[Q(s,a)+\frac{\hat{\beta}}{\beta}P(s,a)\cdot\frac{\sqrt{\sum_{b}N(s,b)}}{1+N(s,a)}\cdot c(s)\right] (3)

where β^\hat{\beta} is the empirical distribution of sampled actions, which means its support is T(s)T(s).

In this way, when making actual decisions at some root node state st,0s_{t,0}, Sampled MCTS will yield a policy ω(|st,0)\omega(\cdot|s_{t,0}) after NN simulations, which is a stochastic policy supported in T(st,0)T(s_{t,0}). The actual improved policy is denoted as πMCTS(|st,0)=𝔼[ω(|st,0)]\pi^{\text{MCTS}}(\cdot|s_{t,0})=\mathbb{E}[\omega(\cdot|s_{t,0})], which is the expectation of ω(|st,0)\omega(\cdot|s_{t,0}). The randomness that the expectation is taken over is from all sampling operations in Sampled MCTS. As demonstrated in the original paper, Sampled MuZero can be seamlessly extended to multi-agent settings by considering each agent’s policy as a distinct dimension for a single agent’s multi-discrete action space during centralized planning. The improved policy is denoted as 𝝅MCTS(|𝒔t,0)\boldsymbol{\pi}^{\text{MCTS}}(\cdot|\boldsymbol{s}_{t,0}).111For the sake of simplicity, we will not emphasize the differences between search policy 𝝅MCTS(|𝒔t,0)\boldsymbol{\pi}^{\text{MCTS}}(\cdot|\boldsymbol{s}_{t,0}) in multi-agent settings and πMCTS(|st,0)\pi^{\text{MCTS}}(\cdot|s_{t,0}) in single-agent settings apart from the mathematical expressions.

3 Challenges in Planning-based Multi-agent Model-based RL

Extending single-agent Planning-based MBRL methods to multi-agent environments is highly non-trivial. On one hand, existing model designs for single-agent algorithms do not account for the unique biases inherent to multi-agent environments, such as the nearly-independent property of agents. This renders the direct employing a flattened model from single-agent MBRL algorithms typically falls short in supporting efficient learning within real-world multi-agent environments. On the other hand, multi-agent environments possess state-action spaces significantly more intricate than their single-agent counterparts, thereby exponentially escalating the search complexity within multi-agent settings. This, in turn, compels us to explore specialized search algorithms tailored to complex action spaces. Moreover, the form of the model, in tandem with its generalization ability, collectively constrains the form and efficiency of the search algorithm, and vice versa, making the design of the model and the design of the search algorithm highly interrelated.

In this section, we shall discuss the challenges encountered in the current design of planning-based MBRL algorithms, encompassing both model design and searching algorithm aspects. We will elucidate how our MAZero algorithm systematically addresses these two issues in the Section 4.

3.1 Model Design

Within the paradigm of centralized training with decentralized execution (CTDE), one straightforward approach is to learn a joint model that enables agents to do centralized planning within the joint policy space. However, given the large size of the state-action space in multi-agent environments, the direct application of a flattened model from single-agent MBRL algorithms tends to render the learning process inefficient (Figure 6). This inefficiency stems from the inadequacy of a flattened model in accommodating the unique biases inherent in multi-agent environments.

In numerous real-world scenarios featuring multi-agent environments, agents typically engage in independent decision-making for the majority of instances, with collaboration reserved for exceptional circumstances. Furthermore, akin agents often exhibit homogeneous behavior (e.g., focusing fire in the SMAC environment). For the former, a series of algorithms, including IPPO (Yu et al., 2022), have demonstrated the efficacy of independent learning in a multitude of MARL settings. Regarding the latter, previous research in multi-agent model-free approaches has underscored the huge success of parameter-sharing (Rashid et al., 2020; Sunehag et al., 2017) within MARL, which can be regarded as an exploitation of agents’ homogenous biases. Hence, encoding these biases effectively within the design of the model becomes one main challenge in multi-agent MBRL.

3.2 Planning in Exponential Sized Action Space

Refer to caption
Figure 1: Bandit Experiment  We compare the Behavior Cloning (BC) loss and our Advantage-Weighted Policy Optimization (AWPO) loss on a bandit with action space |𝒜|=100|\mathcal{A}|=100 and sampling time B=2B=2. It is evident that AWPO converges much faster than BC, owing to the effective utilization of values.

The action space in multi-agent environments grows exponentially with the number of agents, rendering it immensely challenging for vanilla MCTS algorithms to expand all actions. While Sampled MuZero, in prior single-agent MBRL work, attempted to address scenarios with large action spaces by utilizing action sampling, the size of the environments it tackled (e.g., Go, 300\sim 300 actions) still exhibits substantial disparity compared to typical multi-agent environments (e.g., SMAC-27m_vs_30m, 1042\sim 10^{42} actions).

In practical applications like MARL, due to the significantly fewer samples taken compared to the original action space size, the underestimation issue of Sampled MCTS becomes more pronounced, making it less than ideal in terms of searching efficiency. This motivates us to design a more optimistic search process.

Moreover, the behavior cloning (BC) loss that Sampled MCTS algorithms adopt disregards the value information. This is because the target policy ω(|st,0)\omega(\cdot|s_{t,0}) only encapsulates the relative magnitude of sampled action values while disregarding the information of the absolute action values. The issue is not particularly severe when dealing with a smaller action space. However, in multi-agent environments, the disparity between the sampling time BB and |𝒜||\mathcal{A}| becomes significant, making it impossible to overlook the repercussions of disregarding value information (Figure 1).

4 MAZero algorithm

4.1 Model Structure

The MAZero model comprises 6 key functions: a representation function hθh_{\theta} for mapping the current observation history otio^{i}_{\leq t} of agent ii to an individual latent state st,0is_{t,0}^{i}, a communication function eθe_{\theta} which generates additional cooperative features et,kie_{t,k}^{i} for each agent ii through the attention mechanism, a dynamic function gθg_{\theta} tasked with deriving the subsequent local latent state st,k+1is_{t,k+1}^{i} based on the agent’s individual state st,kis_{t,k}^{i}, future action at+kia_{t+k}^{i} and communication feature et,kie_{t,k}^{i}, a reward prediction function forecasting the cooperative team reward rt,kr_{t,k} from the global hidden state 𝒔t,k=(st,k1,,st,kN)\boldsymbol{s}_{t,k}=(s_{t,k}^{1},\ldots,s_{t,k}^{N}) and joint action 𝒂t+k=(at+k1,,at+kN)\boldsymbol{a}_{t+k}=(a_{t+k}^{1},\ldots,a_{t+k}^{N}), a value prediction function VθV_{\theta} aimed at predicting value vt,kv_{t,k} for each global hidden state 𝒔t,k\boldsymbol{s}_{t,k}, and a policy prediction function PθP_{\theta} which given an individual state st,kis_{t,k}^{i} and generates the corresponding policy pt,kip_{t,k}^{i} for agent ii. The model equations are shown in Equation 4.

{Representation:st,0i=hθ(oti)Communication:et,k1,,et,kN=eθ(st,k1,,st,kN,at+k1,,at+kN)Dynamic:st,k+1i=gθ(st,ki,at+ki,et,ki)Reward Prediction:rt,k=Rθ(st,k1,,st,kN,at+k1,,at+kN)Value Prediction:vt,k=Vθ(st,k1,,st,kN)Policy Prediction:pt,ki=Pθ(st,ki)\left\{\begin{array}[]{ll}\text{Representation:}&s_{t,0}^{i}=h_{\theta}(o_{\leq t}^{i})\\[5.69054pt] \text{Communication:}&e_{t,k}^{1},\ldots,e_{t,k}^{N}=e_{\theta}(s_{t,k}^{1},\ldots,s_{t,k}^{N},a_{t+k}^{1},\ldots,a_{t+k}^{N})\\[5.69054pt] \text{Dynamic:}&s_{t,k+1}^{i}=g_{\theta}(s_{t,k}^{i},a_{t+k}^{i},e_{t,k}^{i})\\[5.69054pt] \text{Reward Prediction:}&r_{t,k}=R_{\theta}(s_{t,k}^{1},\ldots,s_{t,k}^{N},a_{t+k}^{1},\ldots,a_{t+k}^{N})\\[5.69054pt] \text{Value Prediction:}&v_{t,k}=V_{\theta}(s_{t,k}^{1},\ldots,s_{t,k}^{N})\\[5.69054pt] \text{Policy Prediction:}&{p_{t,k}^{i}}=P_{\theta}(s_{t,k}^{i})\end{array}\right. (4)

Notably, the representation function hθh_{\theta}, the dynamic function gθg_{\theta}, and the policy prediction function PθP_{\theta} all operate using local information, enabling distributed execution. Conversely, the other functions deal with value information and team cooperation, necessitating centralized training for effective learning.

Refer to caption
Figure 2: MAZero model structure  Given the current observations otio_{t}^{i} for each agent, the model separately maps them into local hidden states st,0is_{t,0}^{i} using a shared representation network hh. Value prediction vt,0v_{t,0} is computed based on the global hidden state 𝒔t,0\boldsymbol{s}_{t,0} while policy priors pt,0ip_{t,0}^{i} are individually calculated for each agent using their corresponding local hidden states. Agents use the communication network ee to access team information et,0ie_{t,0}^{i} and generate next local hidden states st,1is_{t,1}^{i} via the shared individual dynamic network gg, subsequently deriving reward rt,1r_{t,1}, value vt,1v_{t,1} and policies pt,1ip_{t,1}^{i}. During the training stage, real future observations 𝒐t+1\boldsymbol{o}_{t+1} can be obtained to generate the target for the next hidden state, denoted as 𝒔t+1,0\boldsymbol{s}_{t+1,0}.

4.2 Optimistic Search Lambda

Having discerned that MAZero has acquired a deterministic world model, we devise a Optimistic Search Lambda (OS(λ\lambda)) approach to better harness this characteristic of the model.

In previous endeavors, the selection stage of MCTS employed two metrics (value score and exploration bonus) to gauge our interest in a particular action. In this context, the value score utilized the mean estimate of values obtained from all simulations within that node’s subtree. However, within the deterministic model, the mean estimate appears excessively conservative. Contemplating a scenario where the tree degenerates into a multi-armed bandit, if pulling an arm consistently yields a deterministic outcome, there arises no necessity, akin to the Upper Confidence Bound (UCB) algorithm, to repeatedly sample the same arm and calculate its average. Similarly, within the tree-based version of the UCT (Upper Confidence Trees) algorithm, averaging values across all simulations in a subtree may be substituted with a more optimistic estimation when the environment is deterministic.

Hence, courtesy of the deterministic model, our focus narrows to managing model generalization errors rather than contending with errors introduced by environmental stochasticity. Built upon this conceptual foundation, we have devised a methodology for calculating the value score that places a heightened emphasis on optimistic values.

Specifically, for each node ss, we define:

𝒰d(s)\displaystyle\mathcal{U}_{d}(s) ={k<dγkrk+γdv(s)|s:dep(s)=dep(s)+d}\displaystyle=\left\{\sum_{k<d}\gamma^{k}r_{k}+\gamma^{d}v(s^{\prime})\middle|s^{\prime}:\text{dep}(s^{\prime})=\text{dep}(s)+d\right\} (5)
𝒰dρ(s)\displaystyle\mathcal{U}_{d}^{\rho}(s) =Top (1ρ) values in 𝒰d\displaystyle=\text{Top }(1-\rho)\text{ values in }\mathcal{U}_{d} (6)
Vλρ(s)\displaystyle V_{\lambda}^{\rho}(s) =dx𝒰dρ(s)λdx/dλd|𝒰dρ(s)|\displaystyle=\sum_{d}\sum_{x\in\mathcal{U}_{d}^{\rho}(s)}\lambda^{d}x/\sum_{d}\lambda^{d}|\mathcal{U}_{d}^{\rho}(s)| (7)
Aλρ(s,a)\displaystyle A_{\lambda}^{\rho}(s,a) =r(s,a)+γVλρ(Dynamic(s,a))v(s)\displaystyle=r(s,a)+\gamma V_{\lambda}^{\rho}(\text{Dynamic}(s,a))-v(s) (8)

where dep(u)\text{dep}(u) denotes the depth of node uu in the MCTS tree, ρ,λ[0,1]\rho,\lambda\in[0,1] are hyperparameters, r,vr,v, and Dynamic are all model predictions calculated according to Equation 4.

To offer a brief elucidation on this matter, Equation 6 is the set of optimistic values at depth dd, the degree of optimism is controlled by the quantile parameter ρ\rho. Since the model errors magnify with increasing depth in practice, Equation 7 calculates a weighted mean of all optimistic values in the subtree as the value estimation of ss, where the weight of different estimations are discounted by a factor λ\lambda over the depth.

Finally, we use the optimistic advantage (Equation 8) to replace the value score in Equation 3 for optimistic search, that is:

a=argmaxaT(s)𝒜[Aλρ(s,a)+β^βP(s,a)bN(s,b)1+N(s,a)c(s)]a=\mathop{\arg\max}\limits_{a\in T(s)\subset\mathcal{A}}\left[A_{\lambda}^{\rho}(s,a)+\frac{\hat{\beta}}{\beta}P(s,a)\cdot\frac{\sqrt{\sum_{b}N(s,b)}}{1+N(s,a)}\cdot c(s)\right] (9)

4.3 Advantage-Weighted Policy Optimization

In order to utilize the value information calculated by OS(λ\lambda), we propose Advantage-Weighted Policy Optimization (AWPO), a novel policy loss function that incorporates the optimistic advantage (Equation 8) into the behavior cloning loss.222 In this section, the expectation operator is taken over the randomness of OS(λ\lambda). For the sake of simplicity, we will omit this in the formula.

In terms of outcome, AWPO loss weights the per-action behavior cloning loss by exp(Aλρ(s,a)α)\exp\left(\frac{A_{\lambda}^{\rho}(s,a)}{\alpha}\right):

lp(θ;π,ω,Aλρ)\displaystyle l_{p}(\theta;\pi,\omega,A_{\lambda}^{\rho}) =𝔼[aT(s)ω(a|s)exp(Aλρ(s,a)α)logπ(a|s;θ)]\displaystyle=-\mathbb{E}\left[\sum_{a\in T(s)}\omega(a|s)\exp\left(\frac{A_{\lambda}^{\rho}(s,a)}{\alpha}\right)\log\pi(a|s;\theta)\right] (10)

where θ\theta denotes parameters of the learned model, π(|s;θ)\pi(\cdot|s;\theta) is the network predict policy to be improved, ω(|s)\omega(\cdot|s) is the search policy supported in action subset T(s)T(s), AλρA_{\lambda}^{\rho} is the optimistic advantage derived from OS(λ\lambda) and α>0\alpha>0 is a hyperparameter controlling the degree of optimism. Theoretically, AWPO can be regarded as a cross-entropy loss between η\eta^{*} and π\pi,333It is equivalent to minimizing the KL divergence of η\eta^{*} and π\pi. where η\eta^{*} is the non-parametric solution of the following constrained optimization problem:

maximizeη𝔼aη(|s)[Aλρ(s,a)]s.t.KL(η(|s)πMCTS(|s))ϵ\begin{array}[]{rl}\displaystyle\mathop{\text{maximize}}_{\eta}&\displaystyle\mathbb{E}_{a\sim\eta(\cdot|s)}\left[A_{\lambda}^{\rho}(s,a)\right]\\ \displaystyle\text{s.t.}&\displaystyle\text{KL}\left(\eta(\cdot|s)\|\pi^{\text{MCTS}}(\cdot|s)\right)\leq\epsilon\end{array} (11)

To be more specific, by Lagrangian method, we have:

η(a|s)πMCTS(a|s)exp(Aλρ(s,a)α)=𝔼[ω(a|s)exp(Aλρ(s,a)α)]\displaystyle\eta^{*}(a|s)\propto\pi^{\text{MCTS}}(a|s)\exp\left(\frac{A_{\lambda}^{\rho}(s,a)}{\alpha}\right)=\mathbb{E}\left[\omega(a|s)\exp\left(\frac{A_{\lambda}^{\rho}(s,a)}{\alpha}\right)\right] (12)

Thereby, minimizing the KL divergence between η\eta^{*} and π\pi gives the form of AWPO loss:

argminθKL(η(|s)π(|s;θ))\displaystyle\operatorname*{arg\,min}_{\theta}\text{KL}(\eta^{*}(\cdot|s)\|\pi(\cdot|s;\theta))
=\displaystyle= argminθ1Z(s)aπMCTS(a|s)exp(Aλρ(s,a)α)logπ(a|s;θ)H(η(|s))\displaystyle\operatorname*{arg\,min}_{\theta}-\frac{1}{Z(s)}\sum_{a}\pi^{\text{MCTS}}(a|s)\exp\left(\frac{A_{\lambda}^{\rho}(s,a)}{\alpha}\right)\log\pi(a|s;\theta)-H(\eta^{*}(\cdot|s))
=\displaystyle= argminθlp(θ;π,ω,Aλρ)\displaystyle\operatorname*{arg\,min}_{\theta}l_{p}(\theta;\pi,\omega,A_{\lambda}^{\rho})

where Z(s)=aπMCTS(a|s)exp(Aλρ(s,a)α)Z(s)=\sum_{a}\pi^{\text{MCTS}}(a|s)\exp\left(\frac{A_{\lambda}^{\rho}(s,a)}{\alpha}\right) is a normalizing factor, H(η)H(\eta^{*}) stands for the entropy of η\eta^{*}, which is a constant.

Equation 11 shows that AWPO aims to optimize the value improvement of π\pi in proximity to πMCTS\pi^{\text{MCTS}}, which effectively combines the improved policy obtained from OS(λ\lambda) with the optimistic value, thereby compensating for the shortcomings of BC loss and enhancing the efficiency of policy optimization.

Details of the derivation can be found in Appendix G.

4.4 Model training

The MAZero model is unrolled and trained in an end-to-end schema similar to MuZero. Specifically, given a trajectory sequence of length K+1K+1 for observation 𝒐t:t+K\boldsymbol{o}_{t:t+K}, joint actions 𝒂t:t+K\boldsymbol{a}_{t:t+K}, rewards ut:t+Ku_{t:t+K}, value targets zt:t+Kz_{t:t+K}, policy targets 𝝅t:t+KMCTS\boldsymbol{\pi}_{t:t+K}^{\text{MCTS}} and optimistic advantages AλρA_{\lambda}^{\rho}, the model is unrolled for KK steps as is shown in Figure 2 and is trained to minimize the following loss:

=k=1K(lr+lv+ls)+k=0Klp\mathcal{L}=\sum_{k=1}^{K}(l_{r}+l_{v}+l_{s})+\sum_{k=0}^{K}l_{p} (13)

where lr=rt,kut+kl_{r}=\|r_{t,k}-u_{t+k}\| and lv=vt,kzt+kl_{v}=\|v_{t,k}-z_{t+k}\| are reward and value losses similar to MuZero, ls=𝒔t,k𝒔t+k,0l_{s}=\|\boldsymbol{s}_{t,k}-\boldsymbol{s}_{t+k,0}\| is the consistency loss akin to EfficientZero and lpl_{p} is the AWPO policy loss (Equation 10) under the realm of multi-agent joint action space:

lp=𝒂T(𝒔t+k,0)ω(𝒂|𝒔t+k,0)exp(Aλρ(𝒔t+k,0,𝒂)α)log𝝅(𝒂|𝒔t,k;θ)l_{p}=-\sum_{\boldsymbol{a}\in T(\boldsymbol{s}_{t+k,0})}\omega(\boldsymbol{a}|\boldsymbol{s}_{t+k,0})\exp\left(\frac{A_{\lambda}^{\rho}(\boldsymbol{s}_{t+k,0},\boldsymbol{a})}{\alpha}\right)\log\boldsymbol{\pi}(\boldsymbol{a}|\boldsymbol{s}_{t,k};\theta) (14)

where OS(λ\lambda) is performed under the hidden state 𝒔t+k,0\boldsymbol{s}_{t+k,0} directly represented from the actual future observation 𝒐t+k\boldsymbol{o}_{t+k}, deriving corresponding action subset T(𝒔t+k,0)T(\boldsymbol{s}_{t+k,0}), search policy ω(𝒂|𝒔t+k,0)\omega(\boldsymbol{a}|\boldsymbol{s}_{t+k,0}) and optimistic advantage Aλρ(𝒔t+k,0,𝒂)A_{\lambda}^{\rho}(\boldsymbol{s}_{t+k,0},\boldsymbol{a}). 𝝅(𝒂|𝒔t,k;θ)=i=1NPθ(ai|st,ki)\boldsymbol{\pi}(\boldsymbol{a}|\boldsymbol{s}_{t,k};\theta)=\prod_{i=1}^{N}P_{\theta}(a^{i}|s_{t,k}^{i}) denotes the joint policy to be improved at the kkth step’s hidden state 𝒔t,k\boldsymbol{s}_{t,k} unrolling from dynamic function.

5 Experiments

5.1 StarCraft Multi-Agent Challenge

Baselines

We compare MAZero with both model-based and model-free baseline methods on StarCraft Multi-Agent Challenge (SMAC) environments. The model-based baseline is MAMBA (Egorov & Shpilman, 2022), a recently introduced multi-agent MARL algorithm based on DreamerV2 (Hafner et al., 2020) known for its state-of-the-art sample efficiency in various SMAC tasks. The model-free MARL methods includes QMIX (Rashid et al., 2020), QPLEX (Wang et al., 2020a), RODE (Wang et al., 2020b), CDS (Li et al., 2021), and MAPPO (Yu et al., 2022).

Refer to caption
Figure 3: Comparisons against baselines in SMAC. Y axis denotes the win rate and X axis denotes the number of steps taken in the environment. Each algorithm is executed with 10 random seeds.

Figure 3 illustrates the results in the SMAC environments. Among the 23 available scenarios, we have chosen 8 for presentation in this paper. Specifically, we have selected four random scenarios categorized as Easy tasks and 4 Hard tasks. It is evident that MAZero outperforms all baseline algorithms across eight scenarios with a given number of steps taken in the environment. Notably, both MAZero and MAMBA, which are categorized as model-based methods, exhibit markedly superior sample efficiency in easy tasks when compared to model-free baselines. Furthermore, MAZero displays a more stable training curve and smoother win rate during evaluation than MAMBA. In the realm of hard tasks, MAZero surpasses MAMBA in terms of overall performance, with a noteworthy emphasis on the 2c_vs_64zg scenario. In this particular scenario, MAZero attains a higher win rate with a significantly reduced sample size when contrasted with other methods. This scenario involves a unique property, featuring only two agents and an expansive action space of up to 70 for each agent, in contrast to other scenarios where the agent’s action space is generally fewer than 20. This distinctive characteristic enhances the role of planning in MAZero components, similar to MuZero’s outstanding performance in the domain of Go.

Refer to caption
Figure 4: Comparisons against MBRL baselines in SMAC. The y-axis denotes the win rate, and the X-axis denotes the cumulative run time of algorithms in the same platform.

As MAZero builds upon the MuZero framework, we perform end-to-end training directly using planning results from the replay buffer. Consequently, this approach circumvents the time overhead for data augmentation, as seen in Dreamer-based methods. Figure 4 illustrates the superior performance of MAZero with respect to the temporal cost in SMAC environments when compared to MAMBA.

5.2 Ablation

We perform several ablation experiments on the two proposed techniques: OS(λ\lambda) and AWPO. The results with algorithms executed with three random seeds are reported in Figure 5. In particular, we examine whether disabling OS(λ\lambda), AWPO, or both of them (i.e., using original Sampled MCTS and BC loss function) impairs final performance. Significantly, both techniques greatly impact final performance and learning efficiency.

Refer to caption
Figure 5: Ablation study on proposed approaches for planning.

In our ablation of the MAZero network structure (Figure 6), we have discerned the substantial impact of two components, “communication” and “sharing”, on algorithmic performance. In stark contrast, the flattened model employed in single-agent MBRL, due to its failure to encapsulate the unique biases of multi-agent environments, can only learn victorious strategies in the most rudimentary of scenarios, such as the 2m_vs_1z map.

Refer to caption
Figure 6: Ablation study on network structure.

6 Conclusion

In this paper, we introduce a model-based multi-agent algorithm, MAZero, which utilizes the CTDE framework and MCTS planning. This approach boasts superior sample efficiency compared to state-of-the-art model-free methods and provides comparable or better performance than existing model-based methods in terms of both sample and computational efficiency. We also develop two novel approaches, OS(λ\lambda) and AWPO, to improve search efficiency in vast action spaces based on sampled MCTS. In the future, we aim to address this issue through reducing the dimensionality of the action space in search, such as action representation.

7 Reproducibility

We ensure the reproducibility of our research by providing comprehensive information and resources that allow others to replicate our findings. All experimental settings in this paper are available in Appendix B, including details on environmental setup, network structure, training procedures, hyperparameters, and more. The source code utilized in our experiments along with clear instructions on how to reproduce our results will be made available upon finalization of the camera-ready version. The benchmark employed in this investigation is either publicly accessible or can be obtained by contacting the appropriate data providers. Additionally, detailed derivations for our theoretical claims can be found in Appendix G. We are dedicated to addressing any concerns or inquiries related to reproducibility and are open to collaborating with others to further validate and verify our findings.

Acknowledgments

This work was supported by the National Science and Technology Innovation 2030 - Major Project (Grant No. 2022ZD0208804).

References

  • Agarwal et al. (2021) Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34:29304–29320, 2021.
  • Anthony et al. (2017) Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. Advances in neural information processing systems, 30, 2017.
  • Antonoglou et al. (2021) Ioannis Antonoglou, Julian Schrittwieser, Sherjil Ozair, Thomas K Hubert, and David Silver. Planning in stochastic environments with a learned model. In International Conference on Learning Representations, 2021.
  • Arulkumaran et al. (2019) Kai Arulkumaran, Antoine Cully, and Julian Togelius. Alphastar: An evolutionary computation perspective. In Proceedings of the genetic and evolutionary computation conference companion, pp.  314–315, 2019.
  • Bard et al. (2020) Nolan Bard, Jakob N Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H Francis Song, Emilio Parisotto, Vincent Dumoulin, Subhodeep Moitra, Edward Hughes, et al. The hanabi challenge: A new frontier for ai research. Artificial Intelligence, 280:103216, 2020.
  • Bargiacchi et al. (2021) Eugenio Bargiacchi, Timothy Verstraeten, and Diederik M Roijers. Cooperative prioritized sweeping. In AAMAS, pp.  160–168, 2021.
  • Brown et al. (2020) Noam Brown, Anton Bakhtin, Adam Lerer, and Qucheng Gong. Combining deep reinforcement learning and search for imperfect-information games. Advances in Neural Information Processing Systems, 33:17057–17069, 2020.
  • Corneil et al. (2018) Dane Corneil, Wulfram Gerstner, and Johanni Brea. Efficient model-based deep reinforcement learning with variational state tabulation. In International Conference on Machine Learning, pp. 1049–1058. PMLR, 2018.
  • Danihelka et al. (2021) Ivo Danihelka, Arthur Guez, Julian Schrittwieser, and David Silver. Policy improvement by planning with gumbel. In International Conference on Learning Representations, 2021.
  • Egorov & Shpilman (2022) Vladimir Egorov and Aleksei Shpilman. Scalable multi-agent model-based reinforcement learning. arXiv preprint arXiv:2205.15023, 2022.
  • Gronauer & Diepold (2022) Sven Gronauer and Klaus Diepold. Multi-agent deep reinforcement learning: a survey. Artificial Intelligence Review, pp.  1–49, 2022.
  • Ha & Schmidhuber (2018) David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
  • Hafner et al. (2019) Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019.
  • Hafner et al. (2020) Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020.
  • Hafner et al. (2023) Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
  • Hansen et al. (2022) Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. arXiv preprint arXiv:2203.04955, 2022.
  • He et al. (2023) Jinke He, Thomas M Moerland, and Frans A Oliehoek. What model does muzero learn? arXiv preprint arXiv:2306.00840, 2023.
  • Hernandez-Leal et al. (2020) Pablo Hernandez-Leal, Bilal Kartal, and Matthew E Taylor. A very condensed survey and critique of multiagent deep reinforcement learning. In Proceedings of the 19th international conference on autonomous agents and multiagent systems, pp.  2146–2148, 2020.
  • Hessel et al. (2021) Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, and Hado Van Hasselt. Muesli: Combining improvements in policy optimization. In International conference on machine learning, pp. 4214–4226. PMLR, 2021.
  • Hewing et al. (2020) Lukas Hewing, Kim P Wabersich, Marcel Menner, and Melanie N Zeilinger. Learning-based model predictive control: Toward safe learning in control. Annual Review of Control, Robotics, and Autonomous Systems, 3:269–296, 2020.
  • Hubert et al. (2021) Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Mohammadamin Barekatain, Simon Schmitt, and David Silver. Learning and planning in complex action spaces. In International Conference on Machine Learning, pp. 4476–4486. PMLR, 2021.
  • Janner et al. (2019) Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019.
  • Kaelbling et al. (1998) Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
  • Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kocsis & Szepesvári (2006) Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European conference on machine learning, pp.  282–293. Springer, 2006.
  • Kurach et al. (2020) Karol Kurach, Anton Raichuk, Piotr Stanczyk, Michal Zajac, Olivier Bachem, Lasse Espeholt, Carlos Riquelme, Damien Vincent, Marcin Michalski, Olivier Bousquet, et al. Google research football: A novel reinforcement learning environment. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  4501–4510, 2020.
  • Kurutach et al. (2018) Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592, 2018.
  • Li et al. (2021) Chenghao Li, Tonghan Wang, Chengjie Wu, Qianchuan Zhao, Jun Yang, and Chongjie Zhang. Celebrating diversity in shared multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 34:3991–4002, 2021.
  • Liu et al. (2020) Yong Liu, Weixun Wang, Yujing Hu, Jianye Hao, Xingguo Chen, and Yang Gao. Multi-agent game abstraction via graph attention neural network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  7211–7218, 2020.
  • Long et al. (2018) Pinxin Long, Tingxiang Fan, Xinyi Liao, Wenxi Liu, Hao Zhang, and Jia Pan. Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning. In 2018 IEEE international conference on robotics and automation (ICRA), pp.  6252–6259. IEEE, 2018.
  • Lowe et al. (2017) Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30, 2017.
  • Mahajan et al. (2021) Anuj Mahajan, Mikayel Samvelyan, Lei Mao, Viktor Makoviychuk, Animesh Garg, Jean Kossaifi, Shimon Whiteson, Yuke Zhu, and Animashree Anandkumar. Tesseract: Tensorised actors for multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 7301–7312. PMLR, 2021.
  • Mei et al. (2022) Yixuan Mei, Jiaxuan Gao, Weirui Ye, Shaohuai Liu, Yang Gao, and Yi Wu. Speedyzero: Mastering atari with limited data and time. In The Eleventh International Conference on Learning Representations, 2022.
  • Nagabandi et al. (2018) Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE international conference on robotics and automation (ICRA), pp.  7559–7566. IEEE, 2018.
  • Nair et al. (2021) Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online reinforcement learning with offline datasets, 2021.
  • Nguyen et al. (2020) Thanh Thi Nguyen, Ngoc Duy Nguyen, and Saeid Nahavandi. Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications. IEEE transactions on cybernetics, 50(9):3826–3839, 2020.
  • Oliehoek et al. (2016) Frans A Oliehoek, Christopher Amato, et al. A concise introduction to decentralized POMDPs, volume 1. Springer, 2016.
  • Peng et al. (2021) Bei Peng, Tabish Rashid, Christian Schroeder de Witt, Pierre-Alexandre Kamienny, Philip Torr, Wendelin Böhmer, and Shimon Whiteson. Facmac: Factored multi-agent centralised policy gradients. Advances in Neural Information Processing Systems, 34:12208–12221, 2021.
  • Rashid et al. (2020) Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic value function factorisation for deep multi-agent reinforcement learning. The Journal of Machine Learning Research, 21(1):7234–7284, 2020.
  • Ryu et al. (2020) Heechang Ryu, Hayong Shin, and Jinkyoo Park. Multi-agent actor-critic with hierarchical graph attention network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  7236–7243, 2020.
  • Schaul et al. (2015) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
  • Schrittwieser et al. (2020) Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
  • Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  • Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
  • Silver et al. (2018) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
  • Son et al. (2019) Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In International conference on machine learning, pp. 5887–5896. PMLR, 2019.
  • Sun et al. (2019) Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on learning theory, pp.  2898–2933. PMLR, 2019.
  • Sunehag et al. (2017) Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296, 2017.
  • Sutton & Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Tesauro (1994) Gerald Tesauro. Td-gammon, a self-teaching backgammon program, achieves master-level play. Neural computation, 6(2):215–219, 1994.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Wang et al. (2020a) Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, and Chongjie Zhang. Qplex: Duplex dueling multi-agent q-learning. arXiv preprint arXiv:2008.01062, 2020a.
  • Wang & Ba (2019) Tingwu Wang and Jimmy Ba. Exploring model-based planning with policy networks. arXiv preprint arXiv:1906.08649, 2019.
  • Wang et al. (2019) Tingwu Wang, Xuchan Bao, Ignasi Clavera, Jerrick Hoang, Yeming Wen, Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057, 2019.
  • Wang et al. (2020b) Tonghan Wang, Tarun Gupta, Anuj Mahajan, Bei Peng, Shimon Whiteson, and Chongjie Zhang. Rode: Learning roles to decompose multi-agent tasks. arXiv preprint arXiv:2010.01523, 2020b.
  • Willemsen et al. (2021) Daniël Willemsen, Mario Coppola, and Guido CHE de Croon. Mambpo: Sample-efficient multi-robot reinforcement learning using learned world models. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  5635–5640. IEEE, 2021.
  • Yang et al. (2020) Yaodong Yang, Jianye Hao, Ben Liao, Kun Shao, Guangyong Chen, Wulong Liu, and Hongyao Tang. Qatten: A general framework for cooperative multiagent reinforcement learning. arXiv preprint arXiv:2002.03939, 2020.
  • Ye et al. (2020) Deheng Ye, Zhao Liu, Mingfei Sun, Bei Shi, Peilin Zhao, Hao Wu, Hongsheng Yu, Shaojie Yang, Xipeng Wu, Qingwei Guo, et al. Mastering complex control in moba games with deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  6672–6679, 2020.
  • Ye et al. (2023) Jianing Ye, Chenghao Li, Jianhao Wang, and Chongjie Zhang. Towards global optimality in cooperative marl with the transformation and distillation framework, 2023.
  • Ye et al. (2021) Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, and Yang Gao. Mastering atari games with limited data. Advances in Neural Information Processing Systems, 34:25476–25488, 2021.
  • Yu et al. (2022) Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems, 35:24611–24624, 2022.
  • Zhang et al. (2020) Kaiqing Zhang, Sham Kakade, Tamer Basar, and Lin Yang. Model-based multi-agent rl in zero-sum markov games with near-optimal sample complexity. Advances in Neural Information Processing Systems, 33:1166–1178, 2020.
  • Zhou et al. (2020) Ming Zhou, Jun Luo, Julian Villella, Yaodong Yang, David Rusu, Jiayu Miao, Weinan Zhang, Montgomery Alban, Iman Fadakar, Zheng Chen, et al. Smarts: Scalable multi-agent reinforcement learning training school for autonomous driving. arXiv preprint arXiv:2010.09776, 2020.

Appendix A Related works

Dec-POMDP

In this work, we focus on the fully cooperative multi-agent systems that can be formalized as Decentralized Partially Observable Markov Decision Process (Dec-POMDP) (Oliehoek et al., 2016), which are defined by (N,𝒮,𝒜1:N,T,U,Ω1:N,𝒪1:N,γ)(N,\mathcal{S},\mathcal{A}^{1:N},T,U,\Omega^{1:N},\mathcal{O}^{1:N},\gamma), where NN is the number of agents, 𝒮\mathcal{S} is the global state space, TT a global transition function, UU a shared reward function and 𝒜i,Ωi,𝒪i\mathcal{A}^{i},\Omega^{i},\mathcal{O}^{i} are the action space, observation space and observing function for agent ii. Given state sts_{t} at timestep tt, agent ii can only acquire local observation oti=𝒪i(st)o_{t}^{i}=\mathcal{O}^{i}(s_{t}) and choose action atiAia_{t}^{i}\in A^{i} according to its policy πi\pi^{i} based on local observation history otio_{\leq t}^{i}. The environment then shifts to the next state st+1T(|st,𝒂t)s_{t+1}\sim T(\cdot|s_{t},{\bm{a}}_{t}) and returns a scalar reward ut=U(st,𝒂t)u_{t}=U(s_{t},{\bm{a}}_{t}). The objective for all agents is to learn a joint policy 𝝅\boldsymbol{\pi} that maximizes the expectation of discounted return J(𝝅)=𝔼𝝅[t=0γtut|atiπi(|oti),i=1,,N]J(\boldsymbol{\pi})=\mathbb{E}_{\boldsymbol{\pi}}\left[\sum_{t=0}^{\infty}\gamma^{t}u_{t}|a_{t}^{i}\sim\pi^{i}(\cdot|o_{\leq t}^{i}),i=1,\ldots,N\right].

Combining reinforcement learning and planning algorithms

The integration of reinforcement learning and planning within a common paradigm has yielded superhuman performance in diverse domains (Tesauro, 1994; Silver et al., 2017; Anthony et al., 2017; Silver et al., 2018; Schrittwieser et al., 2020; Brown et al., 2020). In this approach, RL acquires knowledge through the learning of value and policy networks, which in turn guide the planning process. Simultaneously, planning generates expert targets that facilitate RL training. For example, MuZero-based algorithms (Schrittwieser et al., 2020; Hubert et al., 2021; Antonoglou et al., 2021; Ye et al., 2021; Mei et al., 2022) combine Monte-Carlo Tree Search (MCTS), TD-MPC (Hansen et al., 2022) integrates Model Predictive Control (MPC), and ReBeL (Brown et al., 2020) incorporate a fusion of Counterfactual Regret Minimization (CFR). Nevertheless, the prevailing approach among these algorithms typically involves treating the planning results as targets and subsequently employing behavior cloning to enable the RL network to emulate these targets. Muesli (Hessel et al., 2021) first combines regularized policy optimization with model learning as an auxiliary loss within the MuZero framework. However, it focuses solely on the policy gradient loss while neglecting the potential advantages of planning. Consequently, it attains at most a performance level equivalent to that of MuZero. To the best of our knowledge, we are the first to combined policy gradient and MCTS planning to accurate policy learning in model-based reinforcement learning.

Variants of MuZero-based algorithms

While the original MuZero algorithm (Schrittwieser et al., 2020) has achieved superhuman performance levels in Go and Atari, it is important to note that it harbors several limitations. To address and overcome these limitations, subsequent advancements have been made in the field. EfficientZero (Ye et al., 2021), for instance, introduces self-supervised consistency loss, value prefix prediction, and off-policy correction mechanisms to expedite and stabilize the training process. Sampled MuZero (Hubert et al., 2021) extends its capabilities to complex action spaces through the incorporation of sample-based policy iteration. Gumbel MuZero (Danihelka et al., 2021) leverages the Gumbel-Top-k trick and modifies the planning process using Sequential Halving, thereby reducing the demands of MCTS simulations. SpeedyZero (Mei et al., 2022) integrates a distributed RL framework to facilitate parallel training, further enhancing training efficiency. He et al. (2023) demonstrate that the model acquired by MuZero typically lacks accuracy when employed for policy evaluation, rendering it ineffective in generalizing to assess previously unseen policies.

Multi-agent reinforcement learning

The typical approach for MARL in cooperative settings involves centralized training and decentralized execution (CTDE). During the training phase, this approach leverages global or communication information, while during the execution phase, it restricts itself to the observation information relevant to the current agent. This paradigm encompasses both value-based (Sunehag et al., 2017; Rashid et al., 2020; Son et al., 2019; Yang et al., 2020; Wang et al., 2020a) and policy-based (Lowe et al., 2017; Liu et al., 2020; Peng et al., 2021; Ryu et al., 2020; Ye et al., 2023) MARL methods within the context of model-free scenarios. Model-based MARL algorithms remains relatively underexplored with only a few methods as follows: MAMBPO (Willemsen et al., 2021) pioneers the fusion of the CTDE framework with Dyan-style model-based techniques, emphasizing the utilization of generated data closely resembling real-world data. CPS (Bargiacchi et al., 2021) introduces dynamic and reward models to determine data priorities based on the MAMBPO approach. Tesseract (Mahajan et al., 2021) dissects states and actions into low-rank tensors and evaluates the Q function using Dynamic Programming (DP) within an approximate environment model. MAMBA (Egorov & Shpilman, 2022) incorporates the word model proposed in DreamerV2 (Hafner et al., 2020) and introduces an attention mechanism to facilitate communication. This integration leads to noteworthy performance improvements and a substantial enhancement in sample efficiency when compared to previous model-free methods. To the best of our knowledge, we are the first to expand the MuZero framework and incorporate MCTS planning into the context of model-based MARL settings.

Appendix B Implementation Details

B.1 Network Structure

For the SMAC scenarios where the input observations are 1 dimensional vectors (as opposed to 2 dimensional images for board games or Atari used by MuZero), we use a variation of the MuZero model architecture in which all convolutions are replaced by fully connected layers. The model consists of 6 modules: representation function, communication function, dynamic function, reward prediction function, value prediction function, and policy prediction function, which are all represented as neural networks. All the modules except the communication block are implemented as Multi-Layer Perception (MLP) networks, where each liner layer in MLP is followed by a Rectified Linear Unit (ReLU) activation and a Layer Normalisation (LN) layer. Specifically, we use a hidden state size of 128 for all SMAC scenarios, and the hidden layers for each MLP module are as follows: Representation Network = [128,128][128,128], Dynamic Network = [128,128][128,128], Reward Network = [32][32], Value Network = [32][32] and Policy Network = [32][32]. We use Transformer architecture (Vaswani et al., 2017) with three stacked layers for encoding state-action pairs. These encodings are then used by the agents to process local dynamic and make predictions. We use a dropout technique with probability p=0.1p=0.1 to prevent the model from over-fitting and use positional encoding to distinguish agents in homogeneous settings.

For the representation network, we stack the last four local observations together as input for each agent to deal with partial observability. Additionally, the concatenated observations are processed by an extra LN to normalize observed features before representation.

The dynamic function first concatenates the current local state, the individual action, and the communication encoding based on current state-action pairs as input features. To alleviate the problem that gradients tend to zero during the continuous unroll of the model, the dynamic function employs a residual connection between the next hidden state and the current hidden state.

For value and reward prediction, we follow in scaling targets using an invertible transform h(x)=sign(x)1+x1+0.001xh(x)=sign(x)\sqrt{1+x}-1+0.001\cdot x and use the categorical representation introduced in MuZero (Schrittwieser et al., 2020). We use ten bins for both the value and the reward predictions, with the predictions being able to represent values between [5,5][-5,5]. We used nn-step bootstrapping with n=5n=5 and a discount of 0.990.99.

B.2 Training Details

MAZero employs a similar pipeline to EfficientZero (Ye et al., 2021) while asynchronizing the parallel stages of data collection (Self-play workers), reanalysis (Reanalyze workers), and training (Main Thread) to maintain the reproducibility of the same random seed.

We use the Adam optimizer (Kingma & Ba, 2014) for training, with a batch size of 256 and a constant learning rate of 10410^{-4}. Samples are drawn from the replay buffer according to prioritized replay (Schaul et al., 2015) using the same priority and hyperparameters as in MuZero.

In practice, we re-execute OS(λ\lambda) using a delayed target model θ^\hat{\theta} in the reanalysis stage to reduce off-policy error. Consequently, the AWPO loss function actually used is the following equation:

lp=𝒂^T^(𝒔^t+k,0)ω^(𝒂^|𝒔^t+k,0)exp(A^λρ(𝒔^t+k,0,𝒂^)α)log𝝅(𝒂^|𝒔t,k;θ)l_{p}=-\sum_{\hat{\boldsymbol{a}}\in\hat{T}(\hat{\boldsymbol{s}}_{t+k,0})}\hat{\omega}(\hat{\boldsymbol{a}}|\hat{\boldsymbol{s}}_{t+k,0})\exp\left(\frac{\hat{A}_{\lambda}^{\rho}(\hat{\boldsymbol{s}}_{t+k,0},\hat{\boldsymbol{a}})}{\alpha}\right)\log\boldsymbol{\pi}(\hat{\boldsymbol{a}}|\boldsymbol{s}_{t,k};\theta)

For other details, we provide hyper-parameters in Table 1.

Parameter Setting
Observations stacked 44
Discount factor 0.990.99
Minibatch size 256256
Optimizer Adam
Optimizer: learning rate 10410^{-4}
Optimizer: RMSprop epsilon 10510^{-5}
Optimizer: weight decay 0
Max gradient norm 55
Priority exponent(cαc_{\alpha}) 0.60.6
Priority correction(cβc_{\beta}) 0.410.4\rightarrow 1
Evaluation episodes 3232
Min replay size for sampling 300300
Target network updating interval 200200
Unroll steps 55
TD steps(nn) 55
Number of MCTS sampled actions(KK) 1010
Number of MCTS simulations(NN) 100100
Quantile in MCTS value estimation(ρ\rho) 0.750.75
Decay lambda in MCTS value estimation(λ\lambda) 0.80.8
Exponential factor in Weighted-Advantage(α\alpha) 33
Table 1: Hyper-parameters for MAZero in SMAC environments

B.3 Details of baseline algorithms implementation

MAMBA (Egorov & Shpilman, 2022) is executed based on the open-source implementation: https://github.com/jbr-ai-labs/mamba with the hyper-parameters in Table 2.

Parameter Setting
Batch size 256256
GAE λ\lambda 0.950.95
Entropy coefficient 0.0010.001
Entropy annealing 0.999980.99998
Number of updates 44
Epochs per update 55
Update clipping parameter 0.20.2
Actor Learning rate 5×1045\times 10^{-4}
Critic Learning rate 5×1045\times 10^{-4}
Discount factor 0.990.99
Model Learning rate 2×1042\times 10^{-4}
Number of epochs 6060
Number of sampled rollouts 4040
Sequence length 2020
Rollout horizon HH 1515
Buffer size 2.5×1052.5\times 10^{5}
Number of categoricals 3232
Number of classes 3232
KL balancing entropy weight 0.20.2
KL balancing cross entropy weight 0.80.8
Gradient clipping 100100
Trajectories between updates 11
Hidden size 256256
Table 2: Hyper-parameters for MAMBA in SMAC environments

QMIX (Rashid et al., 2020) is executed based on the open-source implementation: https://github.com/oxwhirl/pymarl with the hyper-parameters in Table 3.

Parameter Setting
Batch size 3232
Buffer size 50005000
Discount factor 0.990.99
Actor Learning rate 5×1045\times 10^{-4}
Critic Learning rate 5×1045\times 10^{-4}
Optimizer RMSProp
RMSProp α\alpha 0.990.99
RMSProp ϵ\epsilon 10510^{-5}
Gradient clipping 1010
ϵ\epsilon-greedy 1.00.051.0\rightarrow 0.05
ϵ\epsilon annealing time 50000
Table 3: Hyper-parameters for QMIX in SMAC environments

QPLEX (Wang et al., 2020a) is executed based on the open-source implementation: https://github.com/wjh720/QPLEX with the hyper-parameters in Table 4.

Parameter Setting
Batch size 3232
Buffer size 50005000
Discount factor 0.990.99
Actor Learning rate 5×1045\times 10^{-4}
Critic Learning rate 5×1045\times 10^{-4}
Optimizer RMSProp
RMSProp α\alpha 0.990.99
RMSProp ϵ\epsilon 10510^{-5}
Gradient clipping 1010
ϵ\epsilon-greedy 1.00.051.0\rightarrow 0.05
ϵ\epsilon annealing time 50000
Table 4: Hyper-parameters for QPLEX in SMAC environments

RODE (Wang et al., 2020b) is executed based on the open-source implementation: https://github.com/TonghanWang/RODE with the hyper-parameters in Table 5.

Parameter Setting
Batch size 3232
Buffer size 50005000
Discount factor 0.990.99
Actor Learning rate 5×1045\times 10^{-4}
Critic Learning rate 5×1045\times 10^{-4}
Optimizer RMSProp
RMSProp α\alpha 0.990.99
RMSProp ϵ\epsilon 10510^{-5}
Gradient clipping 1010
ϵ\epsilon-greedy 1.00.051.0\rightarrow 0.05
ϵ\epsilon annealing time 50K500K50K\sim 500K
number of clusters 252\sim 5
Table 5: Hyper-parameters for RODE in SMAC environments

CDS (Li et al., 2021) is executed based on the open-source implementation: https://github.com/lich14/CDS with the hyper-parameters in Table 6.

Parameter Setting
Batch size 3232
Buffer size 50005000
Discount factor 0.990.99
Actor Learning rate 5×1045\times 10^{-4}
Critic Learning rate 5×1045\times 10^{-4}
Optimizer RMSProp
RMSProp α\alpha 0.990.99
RMSProp ϵ\epsilon 10510^{-5}
Gradient clipping 1010
ϵ\epsilon-greedy 1.00.051.0\rightarrow 0.05
β\beta 0.050.05
β1\beta_{1} 0.50.5
β2\beta_{2} 0.50.5
λ\lambda 0.10.1
attention regulation coefficient 10310^{-3}
Table 6: Hyper-parameters for CDS in SMAC environments

MAPPO (Yu et al., 2022) is executed based on the open-source implementation: https://github.com/marlbenchmark/on-policy with the hyper-parameters in Table 7.

Parameter Setting
Recurrent data chunk length 1010
Gradient clipping 1010
GAE λ\lambda 0.950.95
Discount factor 0.990.99
Value loss huber loss
Huber delta 10.010.0
Batch size num envs × buffer length × num agents
Mini batch size batch size / mini-batch
Optimizer Adam
Optimizer: learning rate 5×1045\times 10^{-4}
Optimizer: RMSprop epsilon 10510^{-5}
Optimizer: weight decay 0
Table 7: Hyper-parameters for MAPPO in SMAC environments

B.4 Details of the Bandit Experiment

The bandit experiment showed in Figure 1 compares the behavior cloning loss and the AWPO loss on a 100-armed bandit, with action values 0,,990,\cdots,99 and sampling time B=2B=2. The policy πBC\pi^{\text{BC}} and πAWPO\pi^{\text{AWPO}} are parameterized by Softmax policy, which means

πBC(a;θBC)\displaystyle\pi^{\text{BC}}(a;\theta^{\text{BC}}) =exp(θaBC)bexp(θbBC)\displaystyle=\frac{\exp(\theta^{\text{BC}}_{a})}{\sum_{b}\exp(\theta^{\text{BC}}_{b})}
πAWPO(a;θAWPO)\displaystyle\pi^{\text{AWPO}}(a;\theta^{\text{AWPO}}) =exp(θaAWPO)bexp(θbAWPO)\displaystyle=\frac{\exp(\theta^{\text{AWPO}}_{a})}{\sum_{b}\exp(\theta^{\text{AWPO}}_{b})}

where θBC,θAWPO100\theta^{\text{BC}},\theta^{\text{AWPO}}\in\mathbb{R}^{100} are randomly initialized and are identical in the beginning.

To exclude the influence of stochasticity in the search algorithm and facilitate a more precise and fair comparison of the differences in loss functions, we made three targeted adjustments.

  1. 1.

    Let the subset of sampled action be TT, we denote the target policy ω(a)=𝕀(a=argmaxbTvalue(b))\omega(a)=\mathbb{I}(a=\operatorname*{arg\,max}_{b\in T}\text{value}(b)).

  2. 2.

    We calculate the expectation of the loss over the randomness of all possible TTs.

  3. 3.

    We normalize the advantage in AWPO into 0-mean-11-std and choose α=1\alpha=1.

Formally, we have

lBC(θ)\displaystyle l^{\text{BC}}(\theta) =logπθ(a)tθ(a)¯\displaystyle=-\sum\log\pi_{\theta}(a)\overline{t_{\theta}(a)}
lBC(θ)\displaystyle l^{\text{BC}}(\theta) =logπθ(a)tθ(a)¯exp(adv(a))\displaystyle=-\sum\log\pi_{\theta}(a)\overline{t_{\theta}(a)}\exp\left(\text{adv}(a)\right)

where t(a)=(baπθ(b))K(b>aπθ(b))Kt(a)=\left(\sum_{b\geq a}\pi_{\theta}(b)\right)^{K}-\left(\sum_{b>a}\pi_{\theta}(b)\right)^{K} is the expectation of ω(a)\omega(a), the over line stands for stop-gradient, adv(a)=(a992)32525\text{adv}(a)=\left(a-\frac{99}{2}\right)\sqrt{\frac{3}{2525}} is the normalized advantage.

Appendix C Standardised Performance Evaluation Protocol

We report our experiments based on the standardised performance evaluation protocol (Agarwal et al., 2021).

Refer to caption
Figure 7: Aggregate metrics on the SMAC benchmark with 95% stratified bootstrap confidence intervals. Higher median, interquartile mean (IQM), and mean, but lower optimality gap indicate better performance.
Refer to caption
Figure 8: Probabilities of improvement, i.e. how likely it is for MAZero to outperform baselines on the SMAC benchmark.

Appendix D More Ablations

In the experiment section, we list some ablation studies to prove the effectiveness of each component in MAZero. In this section, we will display more results for the ablation study about hyperparameters.

First, we perform an ablation study on the training stage optimizers, contrasting SGD and Adam. The SGD optimizer is prevalent in most MuZero-based algorithms (Schrittwieser et al., 2020; Ye et al., 2021), while Adam is frequently used in MARL environments (Rashid et al., 2020; Wang et al., 2020a; Yu et al., 2022). Figure 9 indicates that Adam demonstrates superior performance and more consistent stability compared to SGD.

Refer to caption
Figure 9: Ablation on optimizer
Refer to caption
Figure 10: Ablation on sampled scale
Refer to caption
(a) Ablation on ρ\rho
Refer to caption
(b) Ablation on λ\lambda
Figure 11: Ablation for Hyper-parameters of OS(λ\lambda)

In addition, we perform an ablation study on the sampled scale (the action sampling times KK and simulation numbers NN in MCTS), which has been shown to be essential for final performance in Sampled MuZero (Hubert et al., 2021). Given the limitations of time and computational resources, we only compare two cases: K=5,N=50K=5,N=50 and K=10,N=100K=10,N=100 for easy map 8m and hard map 5m_vs_6m. Figure 10 reveals that our method yields better performance with a larger scale of samples.

We also test the impact of ρ\rho and λ\lambda in our Optimistic Search Lambda (OS(λ\lambda)) algorithm on map 8m. We test ρ=0.65,0.75,0.85,0.95\rho=0.65,0.75,0.85,0.95 by fixing λ=0.8\lambda=0.8, and test λ=0.7,0.8,0.9,1.0\lambda=0.7,0.8,0.9,1.0 by fixing ρ=0.75\rho=0.75 for MAZero. Figure 11 shows that the optimistic approach stably improves the sample efficiency, and the λ\lambda term is useful when dealing with the model error.

We perform an ablation study about MCTS planning in the evaluation stage. The MAZero algorithm is designed under the CTDE framework, which means the global reward allows the agents to learn and optimize their policies collectively during centralized training. The predicted global reward is used in MCTS planning to search for a better policy based on the network prior. Table 8 shows that agents maintain comparable performance without MCTS planning during evaluation, i.e., directly using the final model and local observations without global reward, communication and MCTS planning.

Map Env steps w MCTS w/o MCTS performance ratio
3m 50k 0.985 ± 0.015 0.936 ± 0.107 95.0 ± 10.8%
2m_vs_1z 50k 1.0 ± 0.0 1.0 ± 0.0 100 ± 0.0%
so_many_baneling 50k 0.959 ± 0.023 0.938 ± 0.045 97.8 ± 4.7%
2s_vs_1sc 100k 0.948 ± 0.072 0.623 ± 0.185 65.7 ± 19.5%
2c_vs_64zg 400k 0.893 ± 0.114 0.768 ± 0.182 86.0 ± 20.4%
5m_vs_6m 1M 0.875 ± 0.031 0.821 ± 0.165 93.8 ± 18.9%
8m_vs_9m 1M 0.906 ± 0.092 0.855 ± 0.127 94.4 ± 14.0%
10m_vs_11m 1M 0.922 ± 0.064 0.863 ± 0.023 93.6 ± 2.5%
average performance 100% 90.1 ± 11.3%
Table 8: Ablation for using MCTS during evaluation in SMAC environments

Appendix E Experiments on Single-Agent environments

We have considered demonstrating the effectiveness of OS(λ\lambda) and AWPO in single-agent decision problems with large action spaces. This might help establish the general applicability of these techniques beyond the multi-agent SMAC benchmark.

We choose the classical LunarLander environment as the single-agent benchmark, but discretize the 2-dimensional continuous action space into 400 discrete actions. Additionally, we select the Walker2D scenario in MuJoCo environment and discretize each dimension of continuous action space into 7 discrete actions, i.e., 67280,0006^{7}\approx 280,000 legal actions. Tables 9 and 10 illustrates the results where both techniques greatly impact learning efficiency and final performance.

Environment Env steps MAZero w/o OS(λ\lambda) and AWPO
LunarLander 250k 184.6±22.8 104.0±87.5
500k 259.8±12.9 227.7±56.3
1M 276.9±2.9 274.1±3.5
Table 9: Ablation for using OS(λ\lambda) and AWPO in LunarLander environments
Environment Env steps MAZero w/o OS(λ\lambda) and AWPO TD3 SAC
Walker2D 300k 3424 ± 246 2302 ± 472 1101 ± 386 1989 ± 500
500k 4507 ± 411 3859 ± 424 2878 ± 343 3381 ± 329
1M 5189 ± 382 4266 ± 509 3946 ± 292 4314 ± 256
Table 10: Ablation for using OS(λ\lambda) and AWPO in Walker2D environments

Appendix F Experiments on other Multi-agent environments

It is beneficial to validate the performance of MAZero in other tasks beyond the SMAC benchmark. We further benchmark MAZero on Google Research Football(GRF) (Kurach et al., 2020), academy_pass_and_shoot_with_keeper scenario and compare our methods with several model-free baseline algorithms. Table 11 shows that our method outperforms baselines in terms of sample efficiency.

Environment Env steps MAZero CDS RODE QMIX
academy_pass_and _shoot_with_keeper 500k 0.123 ± 0.089 0.069 ± 0.041 0 0
1M 0.214 ± 0.072 0.148 ± 0.117 0 0
2M 0.619 ± 0.114 0.426 ± 0.083 0.290 ± 0.104 0
Table 11: Comparisons against baselines in GRF.

Appendix G Derivation of AWPO Loss

To prove that AWPO loss (Equation 10) is essentially solving the corresponding constrained optimization problem (Equation 11), we only need to prove the closed form of Equation 11 is Equation 12.

We follow the derivation in Nair et al. (2021). Note that the following optimization problem

η=argmax𝜂𝔼𝐚η(𝐬)[A(𝐬,𝐚)] s.t. KL(η(𝐬)π(𝐬))ϵ𝐚η(𝐚𝐬)d𝐚=1.\begin{gathered}\eta^{*}=\underset{\eta}{\arg\max}\mathbb{E}_{\mathbf{a}\sim\eta(\cdot\mid\mathbf{s})}\left[A(\mathbf{s},\mathbf{a})\right]\\ \text{ s.t. }\mathrm{KL}\left(\eta(\cdot\mid\mathbf{s})\|\pi(\cdot\mid\mathbf{s})\right)\leq\epsilon\\ \int_{\mathbf{a}}\eta(\mathbf{a}\mid\mathbf{s}){\rm d}\mathbf{a}=1.\end{gathered} (15)

has Lagrangian

(η,λ,α)=\displaystyle\mathcal{L}(\eta,\lambda,\alpha)= 𝔼𝐚η(𝐬)[A(𝐬,𝐚)]\displaystyle\mathbb{E}_{\mathbf{a}\sim\eta(\cdot\mid\mathbf{s})}\left[A(\mathbf{s},\mathbf{a})\right] (16)
+λ(ϵDKL(η(𝐬)π(𝐬)))\displaystyle+\lambda\left(\epsilon-D_{\mathrm{KL}}\left(\eta(\cdot\mid\mathbf{s})\|\pi(\cdot\mid\mathbf{s})\right)\right)
+α(1𝐚η(𝐚𝐬)𝑑𝐚).\displaystyle+\alpha\left(1-\int_{\mathbf{a}}\eta(\mathbf{a}\mid\mathbf{s})d\mathbf{a}\right).

Applying KKT condition, we have

η=A(𝐬,𝐚)+λlogπ(𝐚𝐬)λlogη(𝐚𝐬)+λα=0\frac{\partial\mathcal{L}}{\partial\eta}=A(\mathbf{s},\mathbf{a})+\lambda\log\pi(\mathbf{a}\mid\mathbf{s})-\lambda\log\eta(\mathbf{a}\mid\mathbf{s})+\lambda-\alpha=0 (17)

Solving the above equation gives

η(𝐚𝐬)=1Z(𝐬)π(𝐚𝐬)exp(1λA(𝐬,𝐚))\eta^{*}(\mathbf{a}\mid\mathbf{s})=\frac{1}{Z(\mathbf{s})}\pi(\mathbf{a}\mid\mathbf{s})\exp\left(\frac{1}{\lambda}A(\mathbf{s},\mathbf{a})\right) (18)

where Z(s)Z(s) is a normalizing factor.

To make the KKT condition hold, we can let η>0\eta>0 and use the LICQ (Linear independence constraint qualification) condition.

Plugging the original problem (Equation 11) into Equation 15 completes our proof.