¹¹institutetext: University of Science and Technology of China, Hefei, China
¹¹email: {puyuan, samwang, yr0013, xinyao}@mail.ustc.edu.cn, [email protected]

Decomposed Soft Actor Critic Method for Cooperative Multi-Agent Reinforcement Learning

Yuan Pu 11 Shaochen Wang 11 Rui Yang 11 Xin Yao 11 Bin Li 11

Abstract

Deep reinforcement learning methods have shown great performance on many challenging cooperative multi-agent tasks. Two main promising research directions are multi-agent value function decomposition and multi-agent policy gradients. In this paper, we propose a new decomposed multi-agent soft actor-critic (mSAC) method, which effectively combines the advantages of the aforementioned two methods. The main modules include decomposed Q network architecture, discrete probabilistic policy and counterfactual advantage function (optinal). Theoretically, mSAC supports efficient off-policy learning and addresses credit assignment problem partially in both discrete and continuous action spaces. Tested on StarCraft II micromanagement cooperative multiagent benchmark, we empirically investigate the performance of mSAC against its variants and analyze the effects of the different components. Experimental results demonstrate that mSAC significantly outperforms policy-based approach COMA, and achieves competitive results with SOTA value-based approach Qmix on most tasks in terms of asymptotic perfomance metric. In addition, mSAC achieves pretty good results on large action space tasks, such as $2c\_vs\_64zg$ and $MMM2$ .

Keywords:

Deep reinforcement learning Multi-agent Actor-critic.

1 Introduction

Many real-world tasks can be modeled as multi-agent systems. Developing AI system for playing multi-agent games has raised much attention. Recent years, deep multi-agent reinforcement learning (MARL) algorithms [1] have presented impressive results in many challenging multi-agent systems, such as the coordination of autonomous vehicles [2], the challenging StarCraft II game [3], etc. Maybe the simplest way to solve multi-agent system problems is, treating everything else as the environment for each individual agent, and learning concurrently based on the global reward. However this will face the issues [4]: (1) non-stationarity: when an agent is learning, the policies of other agents are also changing simultaneously, which means that the dynamic of environments is non-stationary; (2) scalability: the joint state and action space grows exponentially as the number of agents increases. To cope with these issues, most recent advanced algorithms adopted the paradigm of centralized training with decentralized execution (CTDE)[5], in which they learn a centralized critic conditioned on joint action and observation history and take decentralized execution by learning different local actor (value functions or policies) for each individual agents.

Following CTDE paradigm, there are two main popular and promising research lines in MARL, one is the value function decomposition approach, another is multi-agent policy gradients. Value Decomposition Network (VDN) [6] represented joint Q value $Q^{tot}$ as a sum of individual Q-values $q^{i}$ that condition only on individual actions and observations. Each decentralized policy arise simply from its local Q values $q^{i}$ (selects actions greedily by $q^{i}$ ). Afterwards, QMIX [7] employed a network to estimate joint action-values as a non-linear combination of per-agent values that condition on local observations. The representative work of multi-agent policy gradient method is COMA [8] method, which explicitly used a counterfactual baseline to address the challenges of multi-agent credit assignment and a critic representation to compute the counterfactual baseline efficiently.

Recent work [16] points out that multi-agent Q-learning with linear value decomposition implicitly implements a classical multi-agent credit assignment method called counterfactual difference rewards, which draws a connection with COMA. However, value function decomposition is hard to apply in off-policy training and potentially suffers from the risk of unbounded divergence. In single-agent problems, to achieve sample efficiency and robust performance, [6] proposed the soft actor-critic algorithm, which is an off-policy actor-critic RL algorithm based on the maximum entropy reinforcement learning framework and achieves state-of-the-art performance on many challenging continuous control benchmarks.

To attain both stability and good final performance in CTDE paradigm, how to effectively incorporate soft actor critic paradigm with multi-agent value function decomposition would be important. Following the research line of [14], our key insight is, to efficiently compute the expected joint Q values, only when this linear condition — the joint Q value $Q^{tot}$ is the linear mixture of the individual Q value ${q^{i}}$ satisfy, the following equation holds (Detailed proof can be found in Appendix),

	$\displaystyle\mathbb{E}_{\bm{\pi}}\left[Q^{tot}(\bm{s},\bm{\tau},\bm{a})\right]$	$\displaystyle=\sum_{i}k^{i}(\bm{s})\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right]+b(\bm{s})$		(1)
		$\displaystyle=q^{mix}(\bm{s},\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right])$		(2)

where the $Q^{tot}$ represents neural networks that consist of agent networks ${q^{i}}$ and the mixing network ${q^{mix}}$ . Note that, in our method, to make the aforementioned equation holds, the mixing network ${q^{mix}}$ is not a complex non-linear way but linear weights that is produced by the hyper-network only conditioned on global state information.

Motivated by these insights, in this paper, we present a novel multi-agent soft actor-critic (mSAC) method, which is based on the following assumption: the joint Q value $Q^{tot}$ is the linear mixture of the individual Q values ${q^{i}}$ . mCSAC contains three main components: decomposed soft Q network architecture, decentralized probabilistic policy, and counterfactual advantage function. This method incorporates the idea of the soft actor-critic and multi-agent value function decomposition effectively.

We empirically investigate the performance of our algorithm mSAC and analyze the influence of these components by ablation studies in StarCraft II micromanagement cooperative multi-agent tasks. Experiment results demonstrate that mSAC significantly outperforms current advanced policy-based algorithms (e.g. COMA) and achieves comparable performance with value-based approaches (e.g. Qmix) on most tasks. In addition, the variant method mSAC achieves pretty good results in large action space tasks, like 2c_vs_64zg and $MMM2$ task.

To sum up, here are our contributions:

•

We propose the novel mSAC method to effectively incorporate soft actor critic with value function decomposition method and investigate its practical performance on StarCraft II cooperative multi-agent benchmark.
•

We conduct extensive performance test of different mSAC variants to show the effect of soft value iteration, counterfactual advantage function, probabilistic policy, respectively.

2 Related Works

2.1 Soft Actor-Critic

Before introducing the Soft Actor-Critic method (SAC) [11], we first briefly present the deep reinforcement learning (RL) problem definition. RL problem is often formulated as a Markov Decision Process (MDP), $\mathcal{M}=\left(\mathcal{S},\mathcal{A},p,r,\gamma\right)$ . When the RL agent interacting with the environment, at each step, the agent observes a state ${\mathbf{s}_{t}\in\mathcal{S}}$ , where $\mathcal{S}$ is the state space, and chooses an action ${\mathbf{a}_{t}\in\mathcal{A}}$ , according to the policy $\pi(\mathbf{a}_{t}|\mathbf{s}_{t})$ , where $\mathcal{A}$ is the state space, then the agent receives a reward $r\left(\mathbf{s}_{t},\mathbf{a}_{t}\right)$ and the environment transforms to a next state $\mathbf{s}_{t+1}\sim p(\mathbf{s}_{t+1}|\mathbf{s}_{t},\mathbf{a}_{t})$ .

The objective of reinforcement learning is to maximize the discounted expected total reward. However, in a maximum entropy RL framework, the goal is not only to optimize the cumulative expected rewards, but also maximizes the expected entropy of the policy:

J(\pi)=\sum_{t=0}^{T}\mathbf{E}_{\left(\mathbf{s}_{t},\mathbf{a}_{t}\right)\sim\rho_{\pi}}\left[r\left(\mathbf{s}_{t},\mathbf{a}_{t}\right)+\alpha\mathcal{H}\left(\pi\left(\cdot|\mathbf{s}_{t}\right)\right)\right]

(3)

, where $\gamma$ is the discounted factor, $\rho_{\pi}\left(\mathbf{s}_{t},\mathbf{a}_{t}\right)$ denotes the state-action marginal distribution of the trajectory induced by the policy $\pi(\mathbf{a}_{t}|\mathbf{s}_{t})$ . SAC is a popular single-agent off-policy actor-critic method using the maximum entropy reinforcement learning framework. It utilizes an actor-critic architecture with separate policy and value networks, an off-policy paradigm that enables reuse of previously collected data, and entropy maximization to enable effective exploration. In contrast to other off-policy algorithms, SAC is quite stable and has been considered as a state-of-the-art baseline for a diverse range of RL problems with continuous actions.

2.2 Value Function Decomposition

Value function decomposition (VDN ) [9] methods learn local Q value functions for each individual agent, and then these local Q values are combined with a learnable mixing neural network to produce joint Q values.

Q^{tot}(\tau,\mathbf{a})=q^{mix}(\bm{s},\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right])

(4)

In VDN, the mixing function ${q^{mix}}$ is a simple algorithmic summation. While in QMIX, it’s a non-linear monotonic factorization structure, which can achieve a much richer function class at the the same time satisfy the principle of the Individual-Global Maximization (IGM): a global $argmax$ performed on $Q^{tot}$ yields the same result as a set of individual $argmax$ operations performed on each local $q^{i}$ .

2.3 Multi-Agent Policy Gradients

The centralized training with decentralized execution (CTDE) paradigm has recently attracted attention for its ability to address non-stationarity problems Learning a centralized critic with decentralized actors (CCDA) is an efficient approach that exploits the CTDE paradigm. COMA and MADDPG are two representative examples.

COMA uses a centralised critic to estimate the Q function and decentralised actors to optimise the agents’ policies. To address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agent’s action, while keeping the other agents’ actions fixed. In addition, COMA also uses a critic representation that allows the counterfactual baseline to be computed efficiently in a single forward pass. And it updates stochastic policies using the gradients:

g=\mathbb{E}_{\bm{\pi}}\left[\sum_{i}\nabla_{\theta_{i}}\log\pi^{i}\left(a^{i}\mid\tau^{i}\right)A^{i}(\tau,\bm{a})\right]

(5)

where,

A^{i}(\tau,a)=Q^{tot}(\tau,a)-\sum_{\bm{a^{\prime,i}}}{\pi}^{i}({a^{\prime,i}}\mid\bm{\tau}^{i})Q^{tot}\left(\tau,\left(a^{-i},a^{\prime,i}\right)\right)

is a counterfactual advantage and ${a_{-i}}$ is the joint action other than agent i.

MADDPG[10] is an adaptation of actor-critic methods which learns deterministic policies in continuous action spaces, considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination.

3 Methods

In this section, we first introduce the definition and notation of the Decentralized Partially Observable Markov Decision Process (Dec-POMDP) and then introduce the multi-agent policy gradient decomposed architecture. Afterwards we present the three variant method: multi-agent soft actor-critic (mSAC) method, multi-agent counterfactual soft actor-critic (mCSAC) method, multi-agent counterfactual actor-critic (mCAC) method, respectively.

3.1 Problem Formulation

The fully cooperative multi-agent tasks can be modelled as a decentralized partially observable Markov decision process (Dec-POMDP) [12]

G=\langle S,A,P,R,\Omega,O,n,\gamma\rangle

, where $s\in S$ is the global state and $o\in\Omega$ is a local observation. At each time-step, each agent i receives an observation $o^{i}$ drawn according to the observation function $O(s,i)$ and selects an action $a^{i}\in A^{i}$ , forming a joint action $\bm{a}\in{A}\equiv{A^{n}}$ , and the environment transitions to the next state $s^{\prime}$ according to the transition function $P\left(s^{\prime}\mid s,a\right)$ and receiving a reward $r=R(s,a)$ shared by all agents. Each agent learns a policy $\pi^{i}\left(a^{i}\mid\tau^{i};\theta_{i}\right)$ , which is parameterized by $\theta_{i}$ and conditioned on the local observation-action history $\tau^{i}\in\mathrm{T}\equiv(\Omega\times A)^{*}$ . The joint policy $\bm{\pi}$ , with parameters $\theta=\left\langle\theta_{1},\cdots,\theta_{n}\right\rangle$ , constitute of the joint Q function: $Q_{\pi}^{tot}(\tau,a)=\mathbb{E}_{s_{0:\infty},a_{0:\infty}}\left[\sum_{t=0}^{\infty}\gamma^{t}R\left(s_{t},a_{t}\right)\mid s_{0}=s,a_{0}=a,\pi\right]$

3.2 Multi-Agent Decomposed Policy Gradient Architecture

In this part, we first present the common multi-agent decomposed policy gradient architecture for all the algorithm variants we will introduce in the next three subsections.

We use function approximators (neural networks) for both the centralized critic: Q-function and the decentralized actor: policy, and alternate between optimizing both networks with stochastic gradient descent. We will consider a parameterized Q function $Q_{\phi}(s_{t},\tau_{t},a_{t})$ and a tractable policy $\pi_{\theta}(a_{t}|\tau_{t})$ ., where ${\phi}$ and ${\theta}$ are referred to the parameters of the Q networks and policy networks, respectively.

Policy Network Also called decentralized actor. For simplicity, our decentralized actor (policy) network structure is the same as the agent i’s local Q network except that one $clamp(-5,2)$ operation and a $softmax$ layer is added after the local Q network. At the beginning of training, if the policy networks have improper initialization parameters which would result in policy distribution becomes too sharp potentially, and thereby constrain the degree of exploration. Empirically, we found this clamp operation relieves this issue and accelerates training. The softmax layer is to convert probabilistic logits to the categorical distribution. The policy network parameter is shared among all agents, and different agents are distinguished by utilizing a one-hot identity vector, in order to be consistent with Qmix and fair comparing.

Value Network The centralized critic’s network structure is modified from Qmix’s Q network structure, and is comprised of agent i’s local Q network $q^{i}$ and mixing network $q^{mix}$ in which the weights and biases are produced by the separate hyper-networks. Figure 1 illustrates detailed structure of local Q network and the mixing network. For each agent i, there is one local Q network that represents its local Q value function $q^{i}(\tau^{i},a^{i})$ . We represent local Q networks as GRUs [17] that receive the current individual observation $o^{i}_{t}$ and the last action $a^{i}_{t-1}$ as input at each time step. The mixing network is a feed-forward neural network that takes the agent’s local Q network outputs as input and mixes them linearly and followed by an absolute activation function, producing the values of $Q^{tot}$ , as shown in Figure 1. To make the equation (1) holds, the weights and the biases of the mixing network are restricted to be linear functions of $s$ , and the parameters are produced by the separate hyper-networks same as in Qmix, which allows us to effectively calculate the expected Q values.

Refer to caption — Figure 1: Left: mixing network structure. Red figures are the hyper-networks that produce the weights and biases for the mixing network layers. Middle: the overall Qmix architecture. Right: agent’s local Q network, which is in green, the $i$ means the corresponding one-hot vector to distinguish different agents.

3.3 multi-agent Soft Actor-Critic (mSAC)

Algorithm 1 mSAC

Initialize network parameters: $\theta,\phi_{1,2},\bar{\phi}_{1,2}$ , and replay buffers: $\mathcal{D}$ for training policy and value networks. max training episodes $M$ , replay buffer size $rlbs$

1: for episode =

1

M

2: for each agent i, observe the global state s and its individual observations

o^{i}

3: for t = 1 to max-episode-length do

4: for each agent i, , select action

\mathbf{a}_{t}^{i}

according to the current policy

\pi_{\theta}(\mathbf{a}_{t}^{i}|\tau_{t}^{i})

5: Execute actions

\mathbf{a}=(a^{1},a^{2},...,a^{N})

and interact with the environment and obtain the global reward

r

, and the environment transitions to the next global state

s^{\prime}

6: add the experience

\left(\mathbf{s}_{t},{o}^{i}_{t},\mathbf{a}_{t},\mathbf{r}_{t},\mathbf{s}_{t+1},{o}^{i}_{t+1}\right)

to the replay buffer

\mathcal{D}

7: for each rl training step do

8: Sample a random minibatch of

\mathbf{B}

uniformaly from

\mathcal{D}

9: Update critic network w.r.t the equation (6)

\phi_{i}\leftarrow\phi_{i}-\alpha_{Q}\hat{\nabla}_{\phi_{i}}J_{Q}\left(\phi_{i}\right)

, for

i\in\{1,2\}

10: Update actor network w.r.t the equation (9)

\theta\leftarrow\theta-\alpha_{\pi}\hat{\nabla}_{\theta}J_{\pi}(\theta)

11: Update hyper-parameter

\alpha

w.r.t the equation (11) Update target value network parameters for each agent i:

\bar{\phi}_{i}\leftarrow\tau\phi_{i}+(1-\tau)\bar{\phi}_{i}

, for

i\in\{1,2\}

12: end for

13: end for

14: end for

15: return

Q_{\phi},\pi_{\theta}

Before introducing the counterfactual multi-agent soft actor-critic method, we first present multi-agent soft actor-critic method (we refer it as mSAC), which adopts the practical approximation to soft policy iteration as in [11].

Similar with [11], the critic loss function of the mSAC method in multi-agent setting is,

\mathcal{L}(\phi)=\mathbb{E}_{\mathcal{D}}\left[\left(r_{t}+\gamma*\min_{j\in{{1,2}}}\hat{Q}_{\phi_{j}^{\prime}}^{targ}-Q^{tot}_{\phi}\left(\bm{s}_{t},\bm{\tau}_{t},\bm{a}_{t}\right)\right)^{2}\right]

(6)

Based on original soft actor-critic algorithm, our methods also utilize two soft Q-value networks $Q^{tot}_{\phi_{j}}\left(\bm{s},\bm{\tau},\bm{a}\right)$ , for $j\in\{1,2\}$ , and take the min values as the target. In equation (7),

	$\displaystyle\hat{Q}_{\phi_{j}^{\prime}}^{targ}$	$\displaystyle=\mathbb{E}_{\bm{\pi}_{\theta}}\left[Q^{tot}_{\phi_{j}^{\prime}}\left(\bm{s}_{t+1},\bm{\tau}_{t+1},\bm{a}_{t+1}\right)-\alpha\log\bm{\pi}\left(\bm{a}_{t+1}\mid\bm{\tau}_{t+1}\right)\right]$		(7)
		$\displaystyle=q^{mix}\left(\bm{s}_{t+1},\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i}_{t+1},a^{\prime,i}_{t+1}\right)-\alpha\log\pi^{i}\left(a^{\prime,i}_{t+1}\mid\tau^{i}_{t+1}\right)\right]\right)$		(8)

and $\mathcal{D}$ is a replay buffer containing previously sampled transitions (states, local observations, actions, rewards, next states, next local observations): and ${Q}^{tot}_{{\phi}_{j^{\prime}}}$ is the target Q network, with parameters ${\phi}_{j^{\prime}}$ that are obtained as an exponentially moving average of the current Q network weights ${\phi}_{j}$ , which has been shown to stabilize training.

Note that, in equation (9), $a^{\prime,i}_{t+1}$ is sampled from agent i’s current policy $\pi^{i}$ rather than sampled from the replay buffer. Compared with Qmix, we adopted the additional policy network that outputs the probabilistic policy (which is a probability mass function for discrete domain), which exactly represents the probabilistic value of each agent selects each discrete action. therefore we can calculate the expectation values exactly.

Recent work theoretically proved that the the soft (or call Boltzmann) policy iteration is guaranteed to improve and can converge to the optimal policy. Derived from the soft policy iteration procedure, the objective for policy update is below:

	$\displaystyle\mathcal{L}(\theta)$	$\displaystyle=\mathbb{E}_{\mathcal{D}}\left[\alpha\log\bm{\pi}\left(\bm{a}_{t}\mid\bm{\tau}_{t}\right)-Q^{tot}_{\phi^{\prime}}\left(\bm{s}_{t},\bm{\tau}_{t},\bm{a}_{t}\right)\right]$		(9)
		$\displaystyle=q^{mix}(\bm{s}_{t},\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i}_{t},a^{i}_{t}\right)-\alpha\log\pi^{i}\left(a^{i}_{t}\mid\tau^{i}_{t}\right)\right])$		(10)

$\alpha$ is a hyper-parameter that controls the trade-off between maximizing the entropy of policy and the expected discounted return.

However, $\alpha$ need to be set as different values at different stages of training or on different tasks. Because in different states, the degree of exploration needed is different. In some states, good policy have been learned, and the corresponding $\alpha$ value should be reduced to very samll to weaken the degree of exploration, but in other states, it’s not sure which action is good and which action is bad, so we need to increse the degree of exploration. The SAC algorithm proposes to reconstruct the original soft policy iterative process as a constrained optimization problem, that is, when optimizing the policy to maximize cumulative discount returns, the algorithm should keep the average entropy of policy a fixed value (usually $-|A|$ ) and the action entropy in different states can be variable. Specifically, $\alpha$ is automatically updated by optimizing the following loss [11]:

L(\alpha)=\mathbb{E}_{\mathbf{a}_{t}\sim\pi_{t}}\left[-\alpha\log\pi_{t}\left(\mathbf{a}_{t}\mid\mathbf{\tau}_{t}\right)-\alpha\overline{\mathcal{H}}\right]

(11)

The details of the mSAC algorithm are summarized in Algorithm 1.

3.4 multi-agent Counterfactual Soft Actor-Critic (mCSAC)

One of the most important problems in multi-agent reinforcement learning is credit assignment. For partially solving this issue, we adopted the insight in COMA that is using the counterfactual advantage function when we optimize the individual policy in the multi-agent decomposed policy gradient paradigms. The loss function of policy in multi-agent counterfactual soft actor-critic (mCSAC) method is as following:

\displaystyle\mathbb{E}_{\left(\bm{s}_{t},\tau_{t},r_{t},\right)\sim\mathcal{D},\bm{a}_{t}\sim{\bm{\pi}_{\theta}}}\left[\log\pi^{a}\left(a^{i}_{t}\mid\tau^{i}_{t}\right)A^{i}(s_{t},\tau_{t},\mathbf{a}_{t})\right]

(12)

where,

	$\displaystyle A^{i}(s_{t},\tau_{t},\mathbf{a}_{t})=$	$\displaystyle-\alpha\log\pi^{i}\left(a^{i}_{t}\mid\tau^{i}_{t}\right)+Q^{tot}_{\phi}\left(\bm{s}_{t},\bm{\tau}_{t},\bm{a}_{t}\right)$		(13)
		$\displaystyle-q^{mix}\left(\bm{s}_{t},\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i}_{t},a^{i,{\prime}}_{t}\right)\right],q^{-i}\left(\bm{\tau}^{-i}_{t},a^{-i}_{t}\right)\right)$		(14)

Note that on the right side of the above equation, $a_{t}=\left(a^{i}_{t},a^{-i}_{t}\right)$ , $a_{t}$ refers to the joint actions at time $t$ , and $a^{i}_{t}$ refers to the local action of the agent $i$ , and $a^{-i}_{t}$ refers to the (partial) joint actions of the agents other than the agent i. $a_{t}^{-i}$ is sampled from the current policy $\pi_{i}$ of agent i, and $a_{t}^{-i}$ is sampled from the replay buffer $\mathcal{D}$ , $q^{mix}\left(.\right)$ calculates the counterfactual baseline, which measures the expected action value under the individual policy of the agent $i$ when fix the actions of other agents except agent $i$ . If the joint action value of the samples sampled from the replay buffer $\left(\bm{s}_{t},\bm{\tau}_{t},\bm{a}_{t}\right)$ is greater than the previous baseline, then we update the policy network parameters of the agent $i$ to increase the action probability of $a^{i}_{t}$ , and vice versa.

3.5 multi-agent Counterfactual Actor-Critic (mCAC)

In order to probe the effect of the soft policy iteration paradigm on multi-agent policy optimization, in this section we introduce another variant of the mSAC method: the multi-agent counterfactual actor critic (mCAC) method. which can be obatained after deletes the entropy augment item corresponding to $\alpha log\pi$ from all loss functions of the mCSAC method. Because it does not satisfy the condition of soft policy iteration paradigm, mCAC becomes an on-policy algorithm. The capacity of the replay buffer used to update the policy and the value network is set to a small value.

The behavior strategy used to collect trajectory experience is not based on the categorical distribution that outputed by the policy network, but a strategy similar to $\epsilon$ -greedy exploration. The specific implementation of mCAC is similar to the mSAC algorithm. Action probabilities are produced from the final layer, z, via a bounded softmax distribution that lower-bounds the probability of any given action by $\epsilon/|A|$ : $P(a)=(1-\epsilon)*softmax(z)_{a}+\epsilon/|A|)$ . We anneal $\epsilon$ linearly from 0.5 to 0.02 across 20000 training episodes.

4 Experiments

In this section, we present our experimental results and some analysis. First, we describe the decentralised cooperated StarCraft II micromanagement benchmark to which we apply our proposed method mSAC and the variant methods we consider. Then we present the performance comparison between mSAC, the ablation variant algorithms mCSAC, mCAC and the representative value decomposed algorithm—Qmix and policy gradient algorithm—COMA in aforementioned discrete action environments. We used the Qmix implementation from this open-source code ¹¹1https://github.com/starry-sky6688/StarCraft with the same hyper-parameters as [7]. To be consistent with previous work, our implementation ²²2https://github.com/puyuan1996/MARL almost use the same network architecture and hyper-parameters across all the tasks. More experimental details can be found in the Appendix.

Experimental Setup We focus on the StarCraft II decentralized micromanagement tasks [13] ³³3We use StarCraft 2 Version SC2.4.10 in our experiments., in which each of the agents controls an individual army unit and each agent receives a global shared reward. We use StarCraft Multi-Agent Challenge (SMAC) environment [15] as our APIs, which has become a common-used benchmark for evaluating state-of-the-art MARL approaches such as COMA, QMIX and other baseline algorithms. In this paper, our algorithm learns multiple agents (or called policies) to control allied units to beat the enemy, while the enemy units are controlled by a built-in handcrafted AI, which make use of the handcrafted heuristics. Two representative StarCraftII micromanagement scenarios ( $3m$ and $2c\_vs\_64zg$ ) are shown in Figure 2.

For comparing each method’s performance justly as much as possible, our Qmix implementation also use target Q networks that are obtained as an exponentially moving average of the Q function weights, which was different from the hard update manner in the original paper [7]. In addition, we adopt the same evaluation procedure as in [7]. For each run of a method, we pause training every 100 episodes and run 20 independent episodes where each agent performing greedy decentralised action selection (for Qmix chosen the action with the largest local Q values, for other methods chosen the action with the largest probability value). The percentage of these episodes in which the method defeats all enemy units within the (different) time limit is referred as the test win rate.

The magnitude of x-axis is 100 episodes, and for different maps, there are different episode length limits according to the difficulty level of different tasks. The shaded region indicates the one quarter of standard deviation.

Algorithm Details of Variant Methods The policy network of all agents includes a recurrent layer composed of GRUs with 64-dimensional hidden states, and a fully connected MLP layer before and after this. After the team is defeated or the time step limit is reached, one episode ends. The mixed network part of the value function consists of a single hidden layer of 64 units, and the ELU nonlinear activation function is not used. Its weights and biases are generated by an additional hyper-network composed of a single hidden layer of 64 units without the ReLU nonlinear activation function.

Similar to Qmix, the mSAC algorithm training is also carried out in mini-batch mode, the batch size is 32, the target smoothing coefficient used to update the two target Q networks is 0.005, and the discount coefficient is set to 0.95. Due to parameter sharing, all agents will be processed in parallel, and the information of each agent at each time step of each episode occupies one entry of the mini-batch. Once a new episode of trajectory is added to the replay buffer, the algorithm will update the network parameters of the actor and critic. Specifically, after collecting a episode of trajectory, 32 episodes were sampled from the replay buffer as a mini-batch to train the actor and the critic, fully expand the recurrent network part of the actor and the critic at all time steps and backpropagate the gradients, then apply the summarized gradient update to the neural network. For clarity, the hyperparameter settings of the mSAC algorithm are summarized in Table 1.

Table 1: Hyper-parameters

Parameter Name	Value
leaning rate	5e-4
target smoothing coefficient ( $\tau$ )	0.005
discount factor	0.99
optimizer	RMSprop
activation function	ReLU
replay buffer size (Off-policy)	5000 episodes
replay buffer size (On-policy)	32 episodes
RL batch size	32 episodes
KL lambda	automated adoptiing
entropy target	dim(A) (e.g. , -9 for 3m)

The learning performance of the mSAC method and its variant methods mCSAC and mCAC on the StarCraft II micro-operation task were tested separately to study the impact of off-policy update, soft Q value, probability distribution policy, counterfactual advantage function and other modules on the multi-agent policy gradient algorithm. In all maps, all algorithms used reward standardization techniques for stability purposes,

r_{standard}=10*(r-mean)/(r-std+1e-6)

(15)

For clarity, we briefly outline the key differences of our different variant methods in Table 2.

Table 2: Comparison of Variant Methods

method	on/off policy	buffer size	counterfactual advantage function	soft Q values
mSAC	off-policy	5000 episodes	no	yes
mCSAC	off-policy	5000 episodes	yes	yes
mCAC	on-policy	32 episodes	yes	no

Experimental Results

In this part, we compared the performance of our method mSAC, its variant methods mCSAC, mCAC, and advanced Qmix and COMA methods in the different maps, including maps with homogeneous agents ( $3m$ , $8m$ ), and maps with heterogeneous agents ( $1c3s5z$ , $3s5z$ , $3s\_vs\_5z$ ), maps that agent contains a large action space ( $2c\_vs\_64zg$ , $MMM2$ , $bane\_vs\_bane$ , $27m\_vs\_30m$ ). The performance test win rate learning curves are shown in Figure 3, 4 and 5 respectively. Through the experimental results, it can be found that the method mSAC proposed in this chapter is similar to policy-based method COMA on the map of homogeneous agents, and is significantly better than COMA on other maps. In the maps $8m$ , $1c3s5z$ , $3m$ , $3s5z$ , compared with the current value-based method Qmix, it has similar asymptotic performance. In the map with a large action space for the agent ( $2c\_vs\ _{6}4zg$ , $MMM2$ , $bane\_vs\_bane$ ), its performance is significantly better than Qmix. In some relatively difficult tasks, such as $3s\_vs\_5z$ , the performance of all policy-based methods is worse than Qmix but not much. After carefully analyzing the experimental results, the following observations and conclusions can be drawn:

1. Soft policy iteration paradigm is also effective in multi-agent scenarios. From the comparison of the results of all maps we found that mCAC behaves worse than the other methods both in stability and asymptotic performance, which indicates that soft policy iteration paradigm is usually beneficial to the robust policy improvement in multi-agent policy gradient setting. We conjecture that this is because simultaneously maximizing expected return and entropy can make the agent explore more widely and efficiently, and can capture multiple modes of near-optimal behaviours.

2. It is important to jointly optimize the entire policy distribution On tasks where the agent has a relatively large action space. For example, in the map $2c\_vs\_64zg$ , the Colossi unit has a large action space $|A|=70$ . In the Qmix method, all agents are executed in a decentralized manner. Each agent greedily selects actions based on its local action value function. In a certain state, there is only one action that maximizes the local action value, and the others actions are given the same selection probability $\epsilon$ . At a certain time, in map $2c\_vs\_64zg$ , the probability of ally unit attacking a specific enemy among all the 64 enemy units is very high, while the The probability of attacking the other 63 units is the same $\epsilon$ . While in the mSAC method, each agent executes actions according to its own strategy, that is, the learned categorical distribution, and can choose different areas of the action space in a planned way. By jointly optimizing the entire probability distribution to maximize the sum of expected returns and strategy entropy, intuitively speaking, this is more reasonable and effective than Qmix’s $\epsilon$ -greedy paradigm exploration on tasks with large action spaces.

3. Counterfactual advantage functions are not always effective, and are more important in relatively complex tasks. For easy environments, like map $8m,1c3s5z,3s5z$ , the performance of mCSAC and mSAC is similar, but in harder environment, like map $3s\_vs\_5z$ , the performance gap of mCSAC and mSAC is larger, which indicates that attribution of global reward is critical for solving this harder task, the counterfactual advantage function partially addresses the issue, could gradually learn a reasonable credit assignment during training in some tasks but is not always effective. Moreover, we carefully analyzed the performance of each seed and found that after training of 80000 episodes, in some seeds the test win rate can perform up to 90%, and some seeds are zero. We speculate that this may because it’s more difficult to explore good strategies in difficult maps, which indicates that effective exploration would be a important research problem.

5 Conclusion and Future Works

In this paper, we presented the new decomposed multi-agent soft actor-critic method (mSAC) that incorporates value function decomposition, soft policy iteration, and counterfactual advantage function (optional), which supports efficient off-policy learning and addresses the issue of credit assignment partially. mSAC learns the distributional policy for each agent simultaneously which seems like a guided distributional exploration implicitly, which is especially important in large action space task through the experimental results.

In addition, we empirically investigate the performance of mSAC and its variant methods in StarCraft II micromanagement cooperative multi-agent benchmark. Experimental results demonstrate that mSAC can achieve relatively stable and efficient multi-agent off-policy learning and outperforms, or is competitive with, current main policy-based algorithms and value-based approaches (e.g. COMA, and Qmix) on most tasks, and achieves very good results in large action space task like 2c_vs_64zg and $MMM2$ .

However, in this paper, we only study the effect of counterfactual multi-agent soft actor-critic paradigm on the discrete domain. Experiments under continuous domain need to be studied. In addition, The more solid theoretical analysis of the algorithm will be needed, and at the same time, how to explore more efficiently will be a valuable future work.

References

[1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
[2] Yongcan Cao, Wenwu Yu, Wei Ren, and Guanrong Chen. An overview of recent progress in the study of distributed multi-agent coordination. IEEE Transactions on Industrial Informatics, 9(1):427–438, 2012.
[3] O. Vinyals, I. Babuschkin, J. Chung, and et. al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 350- 354, 2019.
[4] Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip HS Torr, Pushmeet Kohli, and Shimon Whiteson. Stabilising experience replay for deep multi-agent reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1146–1155. JMLR. org, 2017.
[5] Kraemer, L. and Banerjee, B. Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing, 190:82–94, 2016.
[6] Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zambaldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J. Z., Tuyls, K., and Graepel, T. Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward. In Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems, 2017.
[7] Tabish Rashid, Mikayel Samvelyan, Christian Schröder de Witt, Gregory Farquhar, Jakob N. Foerster, and Shimon Whiteson. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning, pages 4292–4301, 2018.
[8] Jakob N Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[9] Jacopo Castellini, Frans A Oliehoek, Rahul Savani, and Shimon Whiteson. The representational capacity of action-value networks for multi-agent reinforcement learning. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 1862–1864. International Foundation for Autonomous Agents and Multiagent Systems, 2019.
[10] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pp. 6379–6390, 2017
[11] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, ”Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International Conference on Machine Learning (ICML), 2018.
[12] Frans A Oliehoek, Christopher Amato, et al. A concise introduction to decentralized POMDPs, volume 1. Springer, 2016.
[13] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob Foerster, and Shimon Whiteson. The StarCraft Multi-Agent Challenge. In arXiv:1902.04043 [cs, stat], February 2019. arXiv: 1902.04043.
[14] Yihan Wang et. al.. DOP: Off-policy Multi-Agent Decomposed Policy Gradients. In International Conference on Learning Representations (ICLR), 2019.
[15] Scott Fujimoto et. al.. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning, 2018.
[16] Jianhao Wang et. al..Towards Understanding Linear Value Decomposition In Cooperative Multi-Agent Q-learning. 2020. arXiv: 2006.00587.
[17] Cho, K.; van Merrenboer, B.; Bahdanau, D.; and Bengio, Y. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
[18] Yaodong Yang et. al.. Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning. arXiv preprint arXiv:2002.03939v2.

APPENDIX

Table 3: StarCraftII Micromanagement Maps Parameters

Name	Ally Units	Enemy Units	Episode Length	Obs. Dim	Action Dim
Easy
3m	3 Marines	3 Marines	60	30	9
8m	8 Marines	8 Marines	120	80	14
1c3s5z	1 Colossi, 3 Stalkers, 5 Zealots	1 Colossi, 3 Stalkers,5 Zealots	180	162	15
bane_vs_bane	4 Banelings,20 Zerglings	4 Banelings,20 Zerglings	200	336	30
Hard
3s5z	3 Stalkers, 5 Zealots	3 Stalkers, 5 Zealots	150	128	14
3s_vs_5z	3 Stalkers	5 Zealots	250	48	11
2c_vs_64zg	2 Colossi	64 Zerglings	400	332	70
10m_vs_11m	10 Marines	11 Marines	150	105	17
SuperHard
27m_vs_30m	27 Marines	30 Marines	180	285	36
3s5z_vs_3s6z	3 Stalkers,5 Zealots	3 Stalkers,6 Zealots	170	136	15
MMM2	1 Medivac,2 Marauders,7 Marines	1 Medivac,3 Marauders,8 Marines	180	176	18

Some Equation Proof Details

The important equation for efficiently calculating the expectation of the joint Q values using the expectation of the local Q values following local policies:

		$\displaystyle\mathbb{E}_{\bm{\pi}}\left[Q^{tot}(\bm{s},\bm{\tau},\bm{a})\right]$	$\displaystyle=\sum_{i}k^{i}(\bm{s})\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right]+b(\bm{s})$		(16)
		$\displaystyle=q^{mix}(\bm{s},\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right])$			(16)

The detailed proof is as follows:

		$\displaystyle\mathbb{E}_{\pi}\left[Q^{tot}(\bm{s},\bm{\tau},\bm{a})\right]$		(17)
		$\displaystyle=\sum_{\bm{a}}\pi(\bm{a}\|\bm{\tau})Q^{tot}(\bm{s},\bm{\tau},\bm{a})$
		$\displaystyle=\sum_{\bm{a}}\pi(\bm{a}\|\bm{\tau})\left[\sum_{i}k^{i}(\bm{s})q^{i}\left(\bm{\tau}^{i},a^{i}\right)+b(\bm{s})\right]$
		$\displaystyle=\sum_{\bm{a}}\pi(\bm{a}\|\bm{\tau})\sum_{i}k^{i}(\bm{s})q^{i}\left(\bm{\tau}^{i},a^{i}\right)+\sum_{\bm{a}}\pi(\bm{a}\|\bm{\tau})b(\bm{s})$
		$\displaystyle=\sum_{i}\sum_{\bm{a}}\pi(\bm{a}\|\bm{\tau})k^{i}(\bm{s})q^{i}\left(\bm{\tau}^{i},a^{i}\right)+b(\bm{s})$
		$\displaystyle\text{below omit }b(\bm{s})\text{ for simplicity}$
		$\displaystyle=\sum_{i}k^{i}(\bm{s})\sum_{\bm{a}}\bm{\pi}(\bm{a}\|\bm{\tau})q^{i}\left(\bm{\tau}^{i},a^{i}\right)$
		$\displaystyle=\sum_{i}k^{i}(\bm{s})\sum_{\bm{a}}{\pi}^{i}({a^{i}}\|\bm{\tau}^{i})\bm{\pi}^{-i}(\bm{a}^{-i}\|\bm{\tau}^{-i})q^{i}\left(\bm{\tau}^{i},a^{i}\right)$
		$\displaystyle=\sum_{i}k^{i}(\bm{s})\sum_{\bm{a^{i}}}{\pi}^{i}({a^{i}}\|\bm{\tau}^{i})q^{i}\left(\bm{\tau}^{i},a^{i}\right)\sum_{\bm{a^{-i}}}\bm{\pi}^{-i}(\bm{a}^{-i}\|\bm{\tau}^{-i})$
		$\displaystyle=\sum_{i}k^{i}(\bm{s})\sum_{\bm{a^{i}}}{\pi}^{i}({a^{i}}\|\bm{\tau}^{i})q^{i}\left(\bm{\tau}^{i},a^{i}\right)$
		$\displaystyle=\sum_{i}k^{i}(\bm{s})\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right]$

		$\displaystyle\mathbb{E}_{\bm{\pi}}\left[Q^{tot}(\bm{s},\bm{\tau},\bm{a})-\alpha\log\bm{\pi}\left(\bm{a}\mid\bm{\tau}\right)\right]$		(18)
		$\displaystyle=\sum_{i}k^{i}(\bm{s})\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right]+b(\bm{s})+\alpha H(\bm{\pi})$
		$\displaystyle=q^{mix}(\bm{s},\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right])+\alpha H(\bm{\pi})$

we could approximate the above equation using following equation,

		$\displaystyle q^{mix}(\bm{s},\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)-\alpha\log{\pi}^{i}\left({a^{i}}\mid\bm{\tau}^{i}\right)\right])$		(19)
		$\displaystyle=\sum_{i}k^{i}(\bm{s})\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right]+\sum_{i}k^{i}(\bm{s})\mathbb{E}_{{\pi}^{i}}\left[-\alpha\log\pi^{i}\left(a^{i}\|\bm{\tau}^{i}\right)\right]+b(\bm{s})$
		$\displaystyle=\sum_{i}k^{i}(\bm{s})\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right]+b(\bm{s})+\alpha\sum_{i}k^{i}(\bm{s})H(\bm{\pi}^{i})$

if we use a different mixing network for entropy term, the equation becomes:

		$\displaystyle q^{mix1}(\bm{s},\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right])+q^{mix2}(\bm{s},\mathbb{E}_{{\pi}^{i}}\left[-\alpha\log{\pi}^{i}\left({a^{i}}\mid\bm{\tau}^{i}\right)\right])$		(20)
		$\displaystyle=\sum_{i}k^{i}_{1}(\bm{s})\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right]+b(\bm{s})+\alpha\sum_{i}k^{i}_{2}(\bm{s})H(\bm{\pi}^{i})$		(20)

		$\displaystyle\mathbb{E}_{\pi}\left[Q^{tot}(\bm{s},\bm{\tau},\bm{a})\right]$		(17)
		$\displaystyle=\sum_{\bm{a}}\pi(\bm{a}\|\bm{\tau})Q^{tot}(\bm{s},\bm{\tau},\bm{a})$
		$\displaystyle=\sum_{\bm{a}}\pi(\bm{a}\|\bm{\tau})\left[\sum_{i}k^{i}(\bm{s})q^{i}\left(\bm{\tau}^{i},a^{i}\right)+b(\bm{s})\right]$
		$\displaystyle=\sum_{\bm{a}}\pi(\bm{a}\|\bm{\tau})\sum_{i}k^{i}(\bm{s})q^{i}\left(\bm{\tau}^{i},a^{i}\right)+\sum_{\bm{a}}\pi(\bm{a}\|\bm{\tau})b(\bm{s})$
		$\displaystyle=\sum_{i}\sum_{\bm{a}}\pi(\bm{a}\|\bm{\tau})k^{i}(\bm{s})q^{i}\left(\bm{\tau}^{i},a^{i}\right)+b(\bm{s})$
		$\displaystyle\text{below omit }b(\bm{s})\text{ for simplicity}$
		$\displaystyle=\sum_{i}k^{i}(\bm{s})\sum_{\bm{a}}\bm{\pi}(\bm{a}\|\bm{\tau})q^{i}\left(\bm{\tau}^{i},a^{i}\right)$
		$\displaystyle=\sum_{i}k^{i}(\bm{s})\sum_{\bm{a}}{\pi}^{i}({a^{i}}\|\bm{\tau}^{i})\bm{\pi}^{-i}(\bm{a}^{-i}\|\bm{\tau}^{-i})q^{i}\left(\bm{\tau}^{i},a^{i}\right)$
		$\displaystyle=\sum_{i}k^{i}(\bm{s})\sum_{\bm{a^{i}}}{\pi}^{i}({a^{i}}\|\bm{\tau}^{i})q^{i}\left(\bm{\tau}^{i},a^{i}\right)\sum_{\bm{a^{-i}}}\bm{\pi}^{-i}(\bm{a}^{-i}\|\bm{\tau}^{-i})$
		$\displaystyle=\sum_{i}k^{i}(\bm{s})\sum_{\bm{a^{i}}}{\pi}^{i}({a^{i}}\|\bm{\tau}^{i})q^{i}\left(\bm{\tau}^{i},a^{i}\right)$
		$\displaystyle=\sum_{i}k^{i}(\bm{s})\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right]$