This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: University of Science and Technology of China, Hefei, China
11email: {puyuan, samwang, yr0013, xinyao}@mail.ustc.edu.cn, [email protected]

Decomposed Soft Actor Critic Method for Cooperative Multi-Agent Reinforcement Learning

Yuan Pu 11    Shaochen Wang 11    Rui Yang 11    Xin Yao 11    Bin Li 11
Abstract

Deep reinforcement learning methods have shown great performance on many challenging cooperative multi-agent tasks. Two main promising research directions are multi-agent value function decomposition and multi-agent policy gradients. In this paper, we propose a new decomposed multi-agent soft actor-critic (mSAC) method, which effectively combines the advantages of the aforementioned two methods. The main modules include decomposed Q network architecture, discrete probabilistic policy and counterfactual advantage function (optinal). Theoretically, mSAC supports efficient off-policy learning and addresses credit assignment problem partially in both discrete and continuous action spaces. Tested on StarCraft II micromanagement cooperative multiagent benchmark, we empirically investigate the performance of mSAC against its variants and analyze the effects of the different components. Experimental results demonstrate that mSAC significantly outperforms policy-based approach COMA, and achieves competitive results with SOTA value-based approach Qmix on most tasks in terms of asymptotic perfomance metric. In addition, mSAC achieves pretty good results on large action space tasks, such as 2c_vs_64zg2c\_vs\_64zg and MMM2MMM2.

Keywords:
Deep reinforcement learning Multi-agent Actor-critic.

1 Introduction

Many real-world tasks can be modeled as multi-agent systems. Developing AI system for playing multi-agent games has raised much attention. Recent years, deep multi-agent reinforcement learning (MARL) algorithms [1] have presented impressive results in many challenging multi-agent systems, such as the coordination of autonomous vehicles [2], the challenging StarCraft II game [3], etc. Maybe the simplest way to solve multi-agent system problems is, treating everything else as the environment for each individual agent, and learning concurrently based on the global reward. However this will face the issues [4]: (1) non-stationarity: when an agent is learning, the policies of other agents are also changing simultaneously, which means that the dynamic of environments is non-stationary; (2) scalability: the joint state and action space grows exponentially as the number of agents increases. To cope with these issues, most recent advanced algorithms adopted the paradigm of centralized training with decentralized execution (CTDE)[5], in which they learn a centralized critic conditioned on joint action and observation history and take decentralized execution by learning different local actor (value functions or policies) for each individual agents.

Following CTDE paradigm, there are two main popular and promising research lines in MARL, one is the value function decomposition approach, another is multi-agent policy gradients. Value Decomposition Network (VDN) [6] represented joint Q value QtotQ^{tot} as a sum of individual Q-values qiq^{i} that condition only on individual actions and observations. Each decentralized policy arise simply from its local Q values qiq^{i} (selects actions greedily by qiq^{i}). Afterwards, QMIX [7] employed a network to estimate joint action-values as a non-linear combination of per-agent values that condition on local observations. The representative work of multi-agent policy gradient method is COMA [8] method, which explicitly used a counterfactual baseline to address the challenges of multi-agent credit assignment and a critic representation to compute the counterfactual baseline efficiently.

Recent work [16] points out that multi-agent Q-learning with linear value decomposition implicitly implements a classical multi-agent credit assignment method called counterfactual difference rewards, which draws a connection with COMA. However, value function decomposition is hard to apply in off-policy training and potentially suffers from the risk of unbounded divergence. In single-agent problems, to achieve sample efficiency and robust performance, [6] proposed the soft actor-critic algorithm, which is an off-policy actor-critic RL algorithm based on the maximum entropy reinforcement learning framework and achieves state-of-the-art performance on many challenging continuous control benchmarks.

To attain both stability and good final performance in CTDE paradigm, how to effectively incorporate soft actor critic paradigm with multi-agent value function decomposition would be important. Following the research line of [14], our key insight is, to efficiently compute the expected joint Q values, only when this linear condition — the joint Q value QtotQ^{tot} is the linear mixture of the individual Q value qi{q^{i}} satisfy, the following equation holds (Detailed proof can be found in Appendix),

𝔼𝝅[Qtot(𝒔,𝝉,𝒂)]\displaystyle\mathbb{E}_{\bm{\pi}}\left[Q^{tot}(\bm{s},\bm{\tau},\bm{a})\right] =iki(𝒔)𝔼πi[qi(𝝉i,ai)]+b(𝒔)\displaystyle=\sum_{i}k^{i}(\bm{s})\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right]+b(\bm{s}) (1)
=qmix(𝒔,𝔼πi[qi(𝝉i,ai)])\displaystyle=q^{mix}(\bm{s},\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right]) (2)

where the QtotQ^{tot} represents neural networks that consist of agent networks qi{q^{i}} and the mixing network qmix{q^{mix}}. Note that, in our method, to make the aforementioned equation holds, the mixing network qmix{q^{mix}} is not a complex non-linear way but linear weights that is produced by the hyper-network only conditioned on global state information.

Motivated by these insights, in this paper, we present a novel multi-agent soft actor-critic (mSAC) method, which is based on the following assumption: the joint Q value QtotQ^{tot} is the linear mixture of the individual Q values qi{q^{i}}. mCSAC contains three main components: decomposed soft Q network architecture, decentralized probabilistic policy, and counterfactual advantage function. This method incorporates the idea of the soft actor-critic and multi-agent value function decomposition effectively.

We empirically investigate the performance of our algorithm mSAC and analyze the influence of these components by ablation studies in StarCraft II micromanagement cooperative multi-agent tasks. Experiment results demonstrate that mSAC significantly outperforms current advanced policy-based algorithms (e.g. COMA) and achieves comparable performance with value-based approaches (e.g. Qmix) on most tasks. In addition, the variant method mSAC achieves pretty good results in large action space tasks, like 2c_vs_64zg and MMM2MMM2 task.

To sum up, here are our contributions:

  • We propose the novel mSAC method to effectively incorporate soft actor critic with value function decomposition method and investigate its practical performance on StarCraft II cooperative multi-agent benchmark.

  • We conduct extensive performance test of different mSAC variants to show the effect of soft value iteration, counterfactual advantage function, probabilistic policy, respectively.

2 Related Works

2.1 Soft Actor-Critic

Before introducing the Soft Actor-Critic method (SAC) [11], we first briefly present the deep reinforcement learning (RL) problem definition. RL problem is often formulated as a Markov Decision Process (MDP), =(𝒮,𝒜,p,r,γ)\mathcal{M}=\left(\mathcal{S},\mathcal{A},p,r,\gamma\right). When the RL agent interacting with the environment, at each step, the agent observes a state 𝐬t𝒮{\mathbf{s}_{t}\in\mathcal{S}}, where 𝒮\mathcal{S} is the state space, and chooses an action 𝐚t𝒜{\mathbf{a}_{t}\in\mathcal{A}}, according to the policy π(𝐚t|𝐬t)\pi(\mathbf{a}_{t}|\mathbf{s}_{t}), where 𝒜\mathcal{A} is the state space, then the agent receives a reward r(𝐬t,𝐚t)r\left(\mathbf{s}_{t},\mathbf{a}_{t}\right) and the environment transforms to a next state 𝐬t+1p(𝐬t+1|𝐬t,𝐚t)\mathbf{s}_{t+1}\sim p(\mathbf{s}_{t+1}|\mathbf{s}_{t},\mathbf{a}_{t}).

The objective of reinforcement learning is to maximize the discounted expected total reward. However, in a maximum entropy RL framework, the goal is not only to optimize the cumulative expected rewards, but also maximizes the expected entropy of the policy:

J(π)=t=0T𝐄(𝐬t,𝐚t)ρπ[r(𝐬t,𝐚t)+α(π(|𝐬t))]J(\pi)=\sum_{t=0}^{T}\mathbf{E}_{\left(\mathbf{s}_{t},\mathbf{a}_{t}\right)\sim\rho_{\pi}}\left[r\left(\mathbf{s}_{t},\mathbf{a}_{t}\right)+\alpha\mathcal{H}\left(\pi\left(\cdot|\mathbf{s}_{t}\right)\right)\right] (3)

, where γ\gamma is the discounted factor, ρπ(𝐬t,𝐚t)\rho_{\pi}\left(\mathbf{s}_{t},\mathbf{a}_{t}\right) denotes the state-action marginal distribution of the trajectory induced by the policy π(𝐚t|𝐬t)\pi(\mathbf{a}_{t}|\mathbf{s}_{t}). SAC is a popular single-agent off-policy actor-critic method using the maximum entropy reinforcement learning framework. It utilizes an actor-critic architecture with separate policy and value networks, an off-policy paradigm that enables reuse of previously collected data, and entropy maximization to enable effective exploration. In contrast to other off-policy algorithms, SAC is quite stable and has been considered as a state-of-the-art baseline for a diverse range of RL problems with continuous actions.

2.2 Value Function Decomposition

Value function decomposition (VDN ) [9] methods learn local Q value functions for each individual agent, and then these local Q values are combined with a learnable mixing neural network to produce joint Q values.

Qtot(τ,𝐚)=qmix(𝒔,[qi(𝝉i,ai)])Q^{tot}(\tau,\mathbf{a})=q^{mix}(\bm{s},\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right]) (4)

In VDN, the mixing function qmix{q^{mix}} is a simple algorithmic summation. While in QMIX, it’s a non-linear monotonic factorization structure, which can achieve a much richer function class at the the same time satisfy the principle of the Individual-Global Maximization (IGM): a global argmaxargmax performed on QtotQ^{tot} yields the same result as a set of individual argmaxargmax operations performed on each local qiq^{i}.

2.3 Multi-Agent Policy Gradients

The centralized training with decentralized execution (CTDE) paradigm has recently attracted attention for its ability to address non-stationarity problems Learning a centralized critic with decentralized actors (CCDA) is an efficient approach that exploits the CTDE paradigm. COMA and MADDPG are two representative examples.

COMA uses a centralised critic to estimate the Q function and decentralised actors to optimise the agents’ policies. To address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agent’s action, while keeping the other agents’ actions fixed. In addition, COMA also uses a critic representation that allows the counterfactual baseline to be computed efficiently in a single forward pass. And it updates stochastic policies using the gradients:

g=𝔼𝝅[iθilogπi(aiτi)Ai(τ,𝒂)]g=\mathbb{E}_{\bm{\pi}}\left[\sum_{i}\nabla_{\theta_{i}}\log\pi^{i}\left(a^{i}\mid\tau^{i}\right)A^{i}(\tau,\bm{a})\right] (5)

where,

Ai(τ,a)=Qtot(τ,a)𝒂,𝒊πi(a,i𝝉i)Qtot(τ,(ai,a,i))A^{i}(\tau,a)=Q^{tot}(\tau,a)-\sum_{\bm{a^{\prime,i}}}{\pi}^{i}({a^{\prime,i}}\mid\bm{\tau}^{i})Q^{tot}\left(\tau,\left(a^{-i},a^{\prime,i}\right)\right)

is a counterfactual advantage and ai{a_{-i}} is the joint action other than agent i.

MADDPG[10] is an adaptation of actor-critic methods which learns deterministic policies in continuous action spaces, considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination.

3 Methods

In this section, we first introduce the definition and notation of the Decentralized Partially Observable Markov Decision Process (Dec-POMDP) and then introduce the multi-agent policy gradient decomposed architecture. Afterwards we present the three variant method: multi-agent soft actor-critic (mSAC) method, multi-agent counterfactual soft actor-critic (mCSAC) method, multi-agent counterfactual actor-critic (mCAC) method, respectively.

3.1 Problem Formulation

The fully cooperative multi-agent tasks can be modelled as a decentralized partially observable Markov decision process (Dec-POMDP) [12]

G=S,A,P,R,Ω,O,n,γG=\langle S,A,P,R,\Omega,O,n,\gamma\rangle

, where sSs\in S is the global state and oΩo\in\Omega is a local observation. At each time-step, each agent i receives an observation oio^{i} drawn according to the observation function O(s,i)O(s,i) and selects an action aiAia^{i}\in A^{i}, forming a joint action 𝒂AAn\bm{a}\in{A}\equiv{A^{n}}, and the environment transitions to the next state ss^{\prime} according to the transition function P(ss,a)P\left(s^{\prime}\mid s,a\right) and receiving a reward r=R(s,a)r=R(s,a) shared by all agents. Each agent learns a policy πi(aiτi;θi)\pi^{i}\left(a^{i}\mid\tau^{i};\theta_{i}\right), which is parameterized by θi\theta_{i} and conditioned on the local observation-action history τiT(Ω×A)\tau^{i}\in\mathrm{T}\equiv(\Omega\times A)^{*}. The joint policy 𝝅\bm{\pi}, with parameters θ=θ1,,θn\theta=\left\langle\theta_{1},\cdots,\theta_{n}\right\rangle, constitute of the joint Q function: Qπtot(τ,a)=𝔼s0:,a0:[t=0γtR(st,at)s0=s,a0=a,π]Q_{\pi}^{tot}(\tau,a)=\mathbb{E}_{s_{0:\infty},a_{0:\infty}}\left[\sum_{t=0}^{\infty}\gamma^{t}R\left(s_{t},a_{t}\right)\mid s_{0}=s,a_{0}=a,\pi\right]

3.2 Multi-Agent Decomposed Policy Gradient Architecture

In this part, we first present the common multi-agent decomposed policy gradient architecture for all the algorithm variants we will introduce in the next three subsections.

We use function approximators (neural networks) for both the centralized critic: Q-function and the decentralized actor: policy, and alternate between optimizing both networks with stochastic gradient descent. We will consider a parameterized Q function Qϕ(st,τt,at)Q_{\phi}(s_{t},\tau_{t},a_{t}) and a tractable policy πθ(at|τt)\pi_{\theta}(a_{t}|\tau_{t})., where ϕ{\phi} and θ{\theta} are referred to the parameters of the Q networks and policy networks, respectively.

Policy Network Also called decentralized actor. For simplicity, our decentralized actor (policy) network structure is the same as the agent i’s local Q network except that one clamp(5,2)clamp(-5,2) operation and a softmaxsoftmax layer is added after the local Q network. At the beginning of training, if the policy networks have improper initialization parameters which would result in policy distribution becomes too sharp potentially, and thereby constrain the degree of exploration. Empirically, we found this clamp operation relieves this issue and accelerates training. The softmax layer is to convert probabilistic logits to the categorical distribution. The policy network parameter is shared among all agents, and different agents are distinguished by utilizing a one-hot identity vector, in order to be consistent with Qmix and fair comparing.

Value Network The centralized critic’s network structure is modified from Qmix’s Q network structure, and is comprised of agent i’s local Q network qiq^{i} and mixing network qmixq^{mix} in which the weights and biases are produced by the separate hyper-networks. Figure 1 illustrates detailed structure of local Q network and the mixing network. For each agent i, there is one local Q network that represents its local Q value function qi(τi,ai)q^{i}(\tau^{i},a^{i}). We represent local Q networks as GRUs [17] that receive the current individual observation otio^{i}_{t} and the last action at1ia^{i}_{t-1} as input at each time step. The mixing network is a feed-forward neural network that takes the agent’s local Q network outputs as input and mixes them linearly and followed by an absolute activation function, producing the values of QtotQ^{tot}, as shown in Figure 1. To make the equation (1) holds, the weights and the biases of the mixing network are restricted to be linear functions of ss, and the parameters are produced by the separate hyper-networks same as in Qmix, which allows us to effectively calculate the expected Q values.

Refer to caption
Figure 1: Left: mixing network structure. Red figures are the hyper-networks that produce the weights and biases for the mixing network layers. Middle: the overall Qmix architecture. Right: agent’s local Q network, which is in green, the ii means the corresponding one-hot vector to distinguish different agents.

3.3 multi-agent Soft Actor-Critic (mSAC)

Algorithm 1 mSAC

Initialize network parameters: θ,ϕ1,2,ϕ¯1,2\theta,\phi_{1,2},\bar{\phi}_{1,2}, and replay buffers: 𝒟\mathcal{D} for training policy and value networks. max training episodesMM, replay buffer size rlbsrlbs

1:  for episode = 11 to MM  do
2:     for each agent i, observe the global state s and its individual observations oio^{i}
3:     for t = 1 to max-episode-length  do
4:        for each agent i, , select action 𝐚ti\mathbf{a}_{t}^{i} according to the current policy πθ(𝐚ti|τti)\pi_{\theta}(\mathbf{a}_{t}^{i}|\tau_{t}^{i})
5:        Execute actions𝐚=(a1,a2,,aN)\mathbf{a}=(a^{1},a^{2},...,a^{N}) and interact with the environment and obtain the global rewardrr, and the environment transitions to the next global state ss^{\prime}
6:        add the experience (𝐬t,oti,𝐚t,𝐫t,𝐬t+1,ot+1i)\left(\mathbf{s}_{t},{o}^{i}_{t},\mathbf{a}_{t},\mathbf{r}_{t},\mathbf{s}_{t+1},{o}^{i}_{t+1}\right) to the replay buffer 𝒟\mathcal{D}
7:        for each rl training step do
8:           Sample a random minibatch of 𝐁\mathbf{B} uniformaly from 𝒟\mathcal{D}
9:           Update critic network w.r.t the equation (6) ϕiϕiαQ^ϕiJQ(ϕi)\phi_{i}\leftarrow\phi_{i}-\alpha_{Q}\hat{\nabla}_{\phi_{i}}J_{Q}\left(\phi_{i}\right), for i{1,2}i\in\{1,2\}
10:           Update actor network w.r.t the equation (9) θθαπ^θJπ(θ)\theta\leftarrow\theta-\alpha_{\pi}\hat{\nabla}_{\theta}J_{\pi}(\theta)
11:           Update hyper-parameterα\alpha w.r.t the equation (11) Update target value network parameters for each agent i:ϕ¯iτϕi+(1τ)ϕ¯i\bar{\phi}_{i}\leftarrow\tau\phi_{i}+(1-\tau)\bar{\phi}_{i}, for i{1,2}i\in\{1,2\}
12:        end for
13:     end for
14:  end for
15:  return Qϕ,πθQ_{\phi},\pi_{\theta}

Before introducing the counterfactual multi-agent soft actor-critic method, we first present multi-agent soft actor-critic method (we refer it as mSAC), which adopts the practical approximation to soft policy iteration as in [11].

Similar with [11], the critic loss function of the mSAC method in multi-agent setting is,

(ϕ)=𝔼𝒟[(rt+γminj1,2Q^ϕjtargQϕtot(𝒔t,𝝉t,𝒂t))2]\mathcal{L}(\phi)=\mathbb{E}_{\mathcal{D}}\left[\left(r_{t}+\gamma*\min_{j\in{{1,2}}}\hat{Q}_{\phi_{j}^{\prime}}^{targ}-Q^{tot}_{\phi}\left(\bm{s}_{t},\bm{\tau}_{t},\bm{a}_{t}\right)\right)^{2}\right] (6)

Based on original soft actor-critic algorithm, our methods also utilize two soft Q-value networks Qϕjtot(𝒔,𝝉,𝒂)Q^{tot}_{\phi_{j}}\left(\bm{s},\bm{\tau},\bm{a}\right), for j{1,2}j\in\{1,2\}, and take the min values as the target. In equation (7),

Q^ϕjtarg\displaystyle\hat{Q}_{\phi_{j}^{\prime}}^{targ} =𝔼𝝅θ[Qϕjtot(𝒔t+1,𝝉t+1,𝒂t+1)αlog𝝅(𝒂t+1𝝉t+1)]\displaystyle=\mathbb{E}_{\bm{\pi}_{\theta}}\left[Q^{tot}_{\phi_{j}^{\prime}}\left(\bm{s}_{t+1},\bm{\tau}_{t+1},\bm{a}_{t+1}\right)-\alpha\log\bm{\pi}\left(\bm{a}_{t+1}\mid\bm{\tau}_{t+1}\right)\right] (7)
=qmix(𝒔t+1,𝔼πi[qi(𝝉t+1i,at+1,i)αlogπi(at+1,iτt+1i)])\displaystyle=q^{mix}\left(\bm{s}_{t+1},\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i}_{t+1},a^{\prime,i}_{t+1}\right)-\alpha\log\pi^{i}\left(a^{\prime,i}_{t+1}\mid\tau^{i}_{t+1}\right)\right]\right) (8)

and 𝒟\mathcal{D} is a replay buffer containing previously sampled transitions (states, local observations, actions, rewards, next states, next local observations): and Qϕjtot{Q}^{tot}_{{\phi}_{j^{\prime}}} is the target Q network, with parameters ϕj{\phi}_{j^{\prime}} that are obtained as an exponentially moving average of the current Q network weights ϕj{\phi}_{j}, which has been shown to stabilize training.

Note that, in equation (9), at+1,ia^{\prime,i}_{t+1} is sampled from agent i’s current policy πi\pi^{i} rather than sampled from the replay buffer. Compared with Qmix, we adopted the additional policy network that outputs the probabilistic policy (which is a probability mass function for discrete domain), which exactly represents the probabilistic value of each agent selects each discrete action. therefore we can calculate the expectation values exactly.

Recent work theoretically proved that the the soft (or call Boltzmann) policy iteration is guaranteed to improve and can converge to the optimal policy. Derived from the soft policy iteration procedure, the objective for policy update is below:

(θ)\displaystyle\mathcal{L}(\theta) =𝔼𝒟[αlog𝝅(𝒂t𝝉t)Qϕtot(𝒔t,𝝉t,𝒂t)]\displaystyle=\mathbb{E}_{\mathcal{D}}\left[\alpha\log\bm{\pi}\left(\bm{a}_{t}\mid\bm{\tau}_{t}\right)-Q^{tot}_{\phi^{\prime}}\left(\bm{s}_{t},\bm{\tau}_{t},\bm{a}_{t}\right)\right] (9)
=qmix(𝒔t,𝔼πi[qi(𝝉ti,ati)αlogπi(atiτti)])\displaystyle=q^{mix}(\bm{s}_{t},\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i}_{t},a^{i}_{t}\right)-\alpha\log\pi^{i}\left(a^{i}_{t}\mid\tau^{i}_{t}\right)\right]) (10)

α\alpha is a hyper-parameter that controls the trade-off between maximizing the entropy of policy and the expected discounted return.

However, α\alpha need to be set as different values at different stages of training or on different tasks. Because in different states, the degree of exploration needed is different. In some states, good policy have been learned, and the corresponding α\alpha value should be reduced to very samll to weaken the degree of exploration, but in other states, it’s not sure which action is good and which action is bad, so we need to increse the degree of exploration. The SAC algorithm proposes to reconstruct the original soft policy iterative process as a constrained optimization problem, that is, when optimizing the policy to maximize cumulative discount returns, the algorithm should keep the average entropy of policy a fixed value (usually |A|-|A|) and the action entropy in different states can be variable. Specifically, α\alpha is automatically updated by optimizing the following loss [11]:

L(α)=𝔼𝐚tπt[αlogπt(𝐚tτt)α¯]L(\alpha)=\mathbb{E}_{\mathbf{a}_{t}\sim\pi_{t}}\left[-\alpha\log\pi_{t}\left(\mathbf{a}_{t}\mid\mathbf{\tau}_{t}\right)-\alpha\overline{\mathcal{H}}\right] (11)

The details of the mSAC algorithm are summarized in Algorithm 1.

3.4 multi-agent Counterfactual Soft Actor-Critic (mCSAC)

One of the most important problems in multi-agent reinforcement learning is credit assignment. For partially solving this issue, we adopted the insight in COMA that is using the counterfactual advantage function when we optimize the individual policy in the multi-agent decomposed policy gradient paradigms. The loss function of policy in multi-agent counterfactual soft actor-critic (mCSAC) method is as following:

𝔼(𝒔t,τt,rt,)𝒟,𝒂t𝝅θ[logπa(atiτti)Ai(st,τt,𝐚t)]\displaystyle\mathbb{E}_{\left(\bm{s}_{t},\tau_{t},r_{t},\right)\sim\mathcal{D},\bm{a}_{t}\sim{\bm{\pi}_{\theta}}}\left[\log\pi^{a}\left(a^{i}_{t}\mid\tau^{i}_{t}\right)A^{i}(s_{t},\tau_{t},\mathbf{a}_{t})\right] (12)

where,

Ai(st,τt,𝐚t)=\displaystyle A^{i}(s_{t},\tau_{t},\mathbf{a}_{t})= αlogπi(atiτti)+Qϕtot(𝒔t,𝝉t,𝒂t)\displaystyle-\alpha\log\pi^{i}\left(a^{i}_{t}\mid\tau^{i}_{t}\right)+Q^{tot}_{\phi}\left(\bm{s}_{t},\bm{\tau}_{t},\bm{a}_{t}\right) (13)
qmix(𝒔t,𝔼πi[qi(𝝉ti,ati,)],qi(𝝉ti,ati))\displaystyle-q^{mix}\left(\bm{s}_{t},\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i}_{t},a^{i,{\prime}}_{t}\right)\right],q^{-i}\left(\bm{\tau}^{-i}_{t},a^{-i}_{t}\right)\right) (14)

Note that on the right side of the above equation, at=(ati,ati)a_{t}=\left(a^{i}_{t},a^{-i}_{t}\right), ata_{t} refers to the joint actions at time tt, and atia^{i}_{t} refers to the local action of the agent ii, and atia^{-i}_{t} refers to the (partial) joint actions of the agents other than the agent i. atia_{t}^{-i} is sampled from the current policy πi\pi_{i} of agent i, and atia_{t}^{-i} is sampled from the replay buffer 𝒟\mathcal{D} , qmix(.)q^{mix}\left(.\right) calculates the counterfactual baseline, which measures the expected action value under the individual policy of the agent ii when fix the actions of other agents except agent ii. If the joint action value of the samples sampled from the replay buffer (𝒔t,𝝉t,𝒂t)\left(\bm{s}_{t},\bm{\tau}_{t},\bm{a}_{t}\right) is greater than the previous baseline, then we update the policy network parameters of the agent ii to increase the action probability of atia^{i}_{t}, and vice versa.

3.5 multi-agent Counterfactual Actor-Critic (mCAC)

In order to probe the effect of the soft policy iteration paradigm on multi-agent policy optimization, in this section we introduce another variant of the mSAC method: the multi-agent counterfactual actor critic (mCAC) method. which can be obatained after deletes the entropy augment item corresponding to αlogπ\alpha log\pi from all loss functions of the mCSAC method. Because it does not satisfy the condition of soft policy iteration paradigm, mCAC becomes an on-policy algorithm. The capacity of the replay buffer used to update the policy and the value network is set to a small value.

The behavior strategy used to collect trajectory experience is not based on the categorical distribution that outputed by the policy network, but a strategy similar to ϵ\epsilon-greedy exploration. The specific implementation of mCAC is similar to the mSAC algorithm. Action probabilities are produced from the final layer, z, via a bounded softmax distribution that lower-bounds the probability of any given action by ϵ/|A|\epsilon/|A|: P(a)=(1ϵ)softmax(z)a+ϵ/|A|)P(a)=(1-\epsilon)*softmax(z)_{a}+\epsilon/|A|). We anneal ϵ\epsilon linearly from 0.5 to 0.02 across 20000 training episodes.

4 Experiments

In this section, we present our experimental results and some analysis. First, we describe the decentralised cooperated StarCraft II micromanagement benchmark to which we apply our proposed method mSAC and the variant methods we consider. Then we present the performance comparison between mSAC, the ablation variant algorithms mCSAC, mCAC and the representative value decomposed algorithm—Qmix and policy gradient algorithm—COMA in aforementioned discrete action environments. We used the Qmix implementation from this open-source code 111https://github.com/starry-sky6688/StarCraft with the same hyper-parameters as [7]. To be consistent with previous work, our implementation 222https://github.com/puyuan1996/MARL almost use the same network architecture and hyper-parameters across all the tasks. More experimental details can be found in the Appendix.

Experimental Setup We focus on the StarCraft II decentralized micromanagement tasks [13] 333We use StarCraft 2 Version SC2.4.10 in our experiments., in which each of the agents controls an individual army unit and each agent receives a global shared reward. We use StarCraft Multi-Agent Challenge (SMAC) environment [15] as our APIs, which has become a common-used benchmark for evaluating state-of-the-art MARL approaches such as COMA, QMIX and other baseline algorithms. In this paper, our algorithm learns multiple agents (or called policies) to control allied units to beat the enemy, while the enemy units are controlled by a built-in handcrafted AI, which make use of the handcrafted heuristics. Two representative StarCraftII micromanagement scenarios (3m3m and 2c_vs_64zg2c\_vs\_64zg) are shown in Figure 2.

Refer to caption
Figure 2: Visualizations of the two representative StarCraftII micromanagement scenarios (3m3m and 2c_vs_64zg2c\_vs\_64zg).

For comparing each method’s performance justly as much as possible, our Qmix implementation also use target Q networks that are obtained as an exponentially moving average of the Q function weights, which was different from the hard update manner in the original paper [7]. In addition, we adopt the same evaluation procedure as in [7]. For each run of a method, we pause training every 100 episodes and run 20 independent episodes where each agent performing greedy decentralised action selection (for Qmix chosen the action with the largest local Q values, for other methods chosen the action with the largest probability value). The percentage of these episodes in which the method defeats all enemy units within the (different) time limit is referred as the test win rate.

The magnitude of x-axis is 100 episodes, and for different maps, there are different episode length limits according to the difficulty level of different tasks. The shaded region indicates the one quarter of standard deviation.

Algorithm Details of Variant Methods The policy network of all agents includes a recurrent layer composed of GRUs with 64-dimensional hidden states, and a fully connected MLP layer before and after this. After the team is defeated or the time step limit is reached, one episode ends. The mixed network part of the value function consists of a single hidden layer of 64 units, and the ELU nonlinear activation function is not used. Its weights and biases are generated by an additional hyper-network composed of a single hidden layer of 64 units without the ReLU nonlinear activation function.

Similar to Qmix, the mSAC algorithm training is also carried out in mini-batch mode, the batch size is 32, the target smoothing coefficient used to update the two target Q networks is 0.005, and the discount coefficient is set to 0.95. Due to parameter sharing, all agents will be processed in parallel, and the information of each agent at each time step of each episode occupies one entry of the mini-batch. Once a new episode of trajectory is added to the replay buffer, the algorithm will update the network parameters of the actor and critic. Specifically, after collecting a episode of trajectory, 32 episodes were sampled from the replay buffer as a mini-batch to train the actor and the critic, fully expand the recurrent network part of the actor and the critic at all time steps and backpropagate the gradients, then apply the summarized gradient update to the neural network. For clarity, the hyperparameter settings of the mSAC algorithm are summarized in Table 1.

Table 1: Hyper-parameters
Parameter Name Value
leaning rate 5e-4
target smoothing coefficient (τ\tau) 0.005
discount factor 0.99
optimizer RMSprop
activation function ReLU
replay buffer size (Off-policy) 5000 episodes
replay buffer size (On-policy) 32 episodes
RL batch size 32 episodes
KL lambda automated adoptiing
entropy target dim(A) (e.g. , -9 for 3m)

The learning performance of the mSAC method and its variant methods mCSAC and mCAC on the StarCraft II micro-operation task were tested separately to study the impact of off-policy update, soft Q value, probability distribution policy, counterfactual advantage function and other modules on the multi-agent policy gradient algorithm. In all maps, all algorithms used reward standardization techniques for stability purposes,

rstandard=10(rmean)/(rstd+1e6)r_{standard}=10*(r-mean)/(r-std+1e-6) (15)

For clarity, we briefly outline the key differences of our different variant methods in Table 2.

Table 2: Comparison of Variant Methods
method on/off policy buffer size counterfactual advantage function soft Q values
mSAC off-policy 5000 episodes no yes
mCSAC off-policy 5000 episodes yes yes
mCAC on-policy 32 episodes yes no

Experimental Results

Refer to caption
(a) 8m8m
Refer to caption
(b) 3m3m
Figure 3: The performance curves for mCSAC, mSAC, mCAC, and QMIX, COMA on different StarCraft II micromanagement combat maps with homogeneous agents.
Refer to caption
(a) 1c3s5z1c3s5z
Refer to caption
(b) 3s5z3s5z
Refer to caption
(c) 3s_vs_5z3s\_vs\_5z
Figure 4: The performance curves for mCSAC, mSAC, mCAC, and QMIX, COMA on different StarCraft II micromanagement combat maps with heterogeneous agents. In 3s5z3s5z and 3s_vs_5z3s\_vs\_5z, COMA method achieves zero test win-rate according to the experimental results in [18], so we don’t plot the COMA curves in these graphs.
Refer to caption
(a) 2c_vs_64zg2c\_vs\_64zg
Refer to caption
(b) MMM2MMM2
Refer to caption
(c) bane_vs_banebane\_vs\_bane
Figure 5: The performance curves for mSAC and QMIX on large action space tasks: 2c_vs_64zg2c\_vs\_64zg, MMM2MMM2 and bane_vs_banebane\_vs\_bane.

In this part, we compared the performance of our method mSAC, its variant methods mCSAC, mCAC, and advanced Qmix and COMA methods in the different maps, including maps with homogeneous agents (3m3m, 8m8m), and maps with heterogeneous agents (1c3s5z1c3s5z, 3s5z3s5z, 3s_vs_5z3s\_vs\_5z), maps that agent contains a large action space (2c_vs_64zg2c\_vs\_64zg, MMM2MMM2, bane_vs_banebane\_vs\_bane, 27m_vs_30m27m\_vs\_30m). The performance test win rate learning curves are shown in Figure 3, 4 and 5 respectively. Through the experimental results, it can be found that the method mSAC proposed in this chapter is similar to policy-based method COMA on the map of homogeneous agents, and is significantly better than COMA on other maps. In the maps 8m8m, 1c3s5z1c3s5z, 3m3m, 3s5z3s5z, compared with the current value-based method Qmix, it has similar asymptotic performance. In the map with a large action space for the agent (2c_vs64zg2c\_vs\ _{6}4zg, MMM2MMM2, bane_vs_banebane\_vs\_bane), its performance is significantly better than Qmix. In some relatively difficult tasks, such as 3s_vs_5z3s\_vs\_5z, the performance of all policy-based methods is worse than Qmix but not much. After carefully analyzing the experimental results, the following observations and conclusions can be drawn:

1. Soft policy iteration paradigm is also effective in multi-agent scenarios. From the comparison of the results of all maps we found that mCAC behaves worse than the other methods both in stability and asymptotic performance, which indicates that soft policy iteration paradigm is usually beneficial to the robust policy improvement in multi-agent policy gradient setting. We conjecture that this is because simultaneously maximizing expected return and entropy can make the agent explore more widely and efficiently, and can capture multiple modes of near-optimal behaviours.

2. It is important to jointly optimize the entire policy distribution On tasks where the agent has a relatively large action space. For example, in the map 2c_vs_64zg2c\_vs\_64zg, the Colossi unit has a large action space |A|=70|A|=70. In the Qmix method, all agents are executed in a decentralized manner. Each agent greedily selects actions based on its local action value function. In a certain state, there is only one action that maximizes the local action value, and the others actions are given the same selection probability ϵ\epsilon. At a certain time, in map 2c_vs_64zg2c\_vs\_64zg, the probability of ally unit attacking a specific enemy among all the 64 enemy units is very high, while the The probability of attacking the other 63 units is the same ϵ\epsilon. While in the mSAC method, each agent executes actions according to its own strategy, that is, the learned categorical distribution, and can choose different areas of the action space in a planned way. By jointly optimizing the entire probability distribution to maximize the sum of expected returns and strategy entropy, intuitively speaking, this is more reasonable and effective than Qmix’s ϵ\epsilon-greedy paradigm exploration on tasks with large action spaces.

3. Counterfactual advantage functions are not always effective, and are more important in relatively complex tasks. For easy environments, like map 8m,1c3s5z,3s5z8m,1c3s5z,3s5z, the performance of mCSAC and mSAC is similar, but in harder environment, like map 3s_vs_5z3s\_vs\_5z , the performance gap of mCSAC and mSAC is larger, which indicates that attribution of global reward is critical for solving this harder task, the counterfactual advantage function partially addresses the issue, could gradually learn a reasonable credit assignment during training in some tasks but is not always effective. Moreover, we carefully analyzed the performance of each seed and found that after training of 80000 episodes, in some seeds the test win rate can perform up to 90%, and some seeds are zero. We speculate that this may because it’s more difficult to explore good strategies in difficult maps, which indicates that effective exploration would be a important research problem.

5 Conclusion and Future Works

In this paper, we presented the new decomposed multi-agent soft actor-critic method (mSAC) that incorporates value function decomposition, soft policy iteration, and counterfactual advantage function (optional), which supports efficient off-policy learning and addresses the issue of credit assignment partially. mSAC learns the distributional policy for each agent simultaneously which seems like a guided distributional exploration implicitly, which is especially important in large action space task through the experimental results.

In addition, we empirically investigate the performance of mSAC and its variant methods in StarCraft II micromanagement cooperative multi-agent benchmark. Experimental results demonstrate that mSAC can achieve relatively stable and efficient multi-agent off-policy learning and outperforms, or is competitive with, current main policy-based algorithms and value-based approaches (e.g. COMA, and Qmix) on most tasks, and achieves very good results in large action space task like 2c_vs_64zg and MMM2MMM2.

However, in this paper, we only study the effect of counterfactual multi-agent soft actor-critic paradigm on the discrete domain. Experiments under continuous domain need to be studied. In addition, The more solid theoretical analysis of the algorithm will be needed, and at the same time, how to explore more efficiently will be a valuable future work.

References

  • [1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
  • [2] Yongcan Cao, Wenwu Yu, Wei Ren, and Guanrong Chen. An overview of recent progress in the study of distributed multi-agent coordination. IEEE Transactions on Industrial Informatics, 9(1):427–438, 2012.
  • [3] O. Vinyals, I. Babuschkin, J. Chung, and et. al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 350- 354, 2019.
  • [4] Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip HS Torr, Pushmeet Kohli, and Shimon Whiteson. Stabilising experience replay for deep multi-agent reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1146–1155. JMLR. org, 2017.
  • [5] Kraemer, L. and Banerjee, B. Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing, 190:82–94, 2016.
  • [6] Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zambaldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J. Z., Tuyls, K., and Graepel, T. Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward. In Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems, 2017.
  • [7] Tabish Rashid, Mikayel Samvelyan, Christian Schröder de Witt, Gregory Farquhar, Jakob N. Foerster, and Shimon Whiteson. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning, pages 4292–4301, 2018.
  • [8] Jakob N Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [9] Jacopo Castellini, Frans A Oliehoek, Rahul Savani, and Shimon Whiteson. The representational capacity of action-value networks for multi-agent reinforcement learning. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 1862–1864. International Foundation for Autonomous Agents and Multiagent Systems, 2019.
  • [10] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pp. 6379–6390, 2017
  • [11] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, ”Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International Conference on Machine Learning (ICML), 2018.
  • [12] Frans A Oliehoek, Christopher Amato, et al. A concise introduction to decentralized POMDPs, volume 1. Springer, 2016.
  • [13] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob Foerster, and Shimon Whiteson. The StarCraft Multi-Agent Challenge. In arXiv:1902.04043 [cs, stat], February 2019. arXiv: 1902.04043.
  • [14] Yihan Wang et. al.. DOP: Off-policy Multi-Agent Decomposed Policy Gradients. In International Conference on Learning Representations (ICLR), 2019.
  • [15] Scott Fujimoto et. al.. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning, 2018.
  • [16] Jianhao Wang et. al..Towards Understanding Linear Value Decomposition In Cooperative Multi-Agent Q-learning. 2020. arXiv: 2006.00587.
  • [17] Cho, K.; van Merrenboer, B.; Bahdanau, D.; and Bengio, Y. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
  • [18] Yaodong Yang et. al.. Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning. arXiv preprint arXiv:2002.03939v2.

APPENDIX

Table 3: StarCraftII Micromanagement Maps Parameters
Name Ally Units Enemy Units Episode Length Obs. Dim Action Dim
Easy
3m 3 Marines 3 Marines 60 30 9
8m 8 Marines 8 Marines 120 80 14
1c3s5z 1 Colossi, 3 Stalkers, 5 Zealots 1 Colossi, 3 Stalkers,5 Zealots 180 162 15
bane_vs_bane 4 Banelings,20 Zerglings 4 Banelings,20 Zerglings 200 336 30
Hard
3s5z 3 Stalkers, 5 Zealots 3 Stalkers, 5 Zealots 150 128 14
3s_vs_5z 3 Stalkers 5 Zealots 250 48 11
2c_vs_64zg 2 Colossi 64 Zerglings 400 332 70
10m_vs_11m 10 Marines 11 Marines 150 105 17
SuperHard
27m_vs_30m 27 Marines 30 Marines 180 285 36
3s5z_vs_3s6z 3 Stalkers,5 Zealots 3 Stalkers,6 Zealots 170 136 15
MMM2 1 Medivac,2 Marauders,7 Marines 1 Medivac,3 Marauders,8 Marines 180 176 18

Some Equation Proof Details

The important equation for efficiently calculating the expectation of the joint Q values using the expectation of the local Q values following local policies:

𝔼𝝅[Qtot(𝒔,𝝉,𝒂)]\displaystyle\mathbb{E}_{\bm{\pi}}\left[Q^{tot}(\bm{s},\bm{\tau},\bm{a})\right] =iki(𝒔)𝔼πi[qi(𝝉i,ai)]+b(𝒔)\displaystyle=\sum_{i}k^{i}(\bm{s})\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right]+b(\bm{s}) (16)
=qmix(𝒔,𝔼πi[qi(𝝉i,ai)])\displaystyle=q^{mix}(\bm{s},\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right])

The detailed proof is as follows:

𝔼π[Qtot(𝒔,𝝉,𝒂)]\displaystyle\mathbb{E}_{\pi}\left[Q^{tot}(\bm{s},\bm{\tau},\bm{a})\right] (17)
=𝒂π(𝒂|𝝉)Qtot(𝒔,𝝉,𝒂)\displaystyle=\sum_{\bm{a}}\pi(\bm{a}|\bm{\tau})Q^{tot}(\bm{s},\bm{\tau},\bm{a})
=𝒂π(𝒂|𝝉)[iki(𝒔)qi(𝝉i,ai)+b(𝒔)]\displaystyle=\sum_{\bm{a}}\pi(\bm{a}|\bm{\tau})\left[\sum_{i}k^{i}(\bm{s})q^{i}\left(\bm{\tau}^{i},a^{i}\right)+b(\bm{s})\right]
=𝒂π(𝒂|𝝉)iki(𝒔)qi(𝝉i,ai)+𝒂π(𝒂|𝝉)b(𝒔)\displaystyle=\sum_{\bm{a}}\pi(\bm{a}|\bm{\tau})\sum_{i}k^{i}(\bm{s})q^{i}\left(\bm{\tau}^{i},a^{i}\right)+\sum_{\bm{a}}\pi(\bm{a}|\bm{\tau})b(\bm{s})
=i𝒂π(𝒂|𝝉)ki(𝒔)qi(𝝉i,ai)+b(𝒔)\displaystyle=\sum_{i}\sum_{\bm{a}}\pi(\bm{a}|\bm{\tau})k^{i}(\bm{s})q^{i}\left(\bm{\tau}^{i},a^{i}\right)+b(\bm{s})
below omit b(𝒔) for simplicity\displaystyle\text{below omit }b(\bm{s})\text{ for simplicity}
=iki(𝒔)𝒂𝝅(𝒂|𝝉)qi(𝝉i,ai)\displaystyle=\sum_{i}k^{i}(\bm{s})\sum_{\bm{a}}\bm{\pi}(\bm{a}|\bm{\tau})q^{i}\left(\bm{\tau}^{i},a^{i}\right)
=iki(𝒔)𝒂πi(ai|𝝉i)𝝅i(𝒂i|𝝉i)qi(𝝉i,ai)\displaystyle=\sum_{i}k^{i}(\bm{s})\sum_{\bm{a}}{\pi}^{i}({a^{i}}|\bm{\tau}^{i})\bm{\pi}^{-i}(\bm{a}^{-i}|\bm{\tau}^{-i})q^{i}\left(\bm{\tau}^{i},a^{i}\right)
=iki(𝒔)𝒂𝒊πi(ai|𝝉i)qi(𝝉i,ai)𝒂𝒊𝝅i(𝒂i|𝝉i)\displaystyle=\sum_{i}k^{i}(\bm{s})\sum_{\bm{a^{i}}}{\pi}^{i}({a^{i}}|\bm{\tau}^{i})q^{i}\left(\bm{\tau}^{i},a^{i}\right)\sum_{\bm{a^{-i}}}\bm{\pi}^{-i}(\bm{a}^{-i}|\bm{\tau}^{-i})
=iki(𝒔)𝒂𝒊πi(ai|𝝉i)qi(𝝉i,ai)\displaystyle=\sum_{i}k^{i}(\bm{s})\sum_{\bm{a^{i}}}{\pi}^{i}({a^{i}}|\bm{\tau}^{i})q^{i}\left(\bm{\tau}^{i},a^{i}\right)
=iki(𝒔)𝔼πi[qi(𝝉i,ai)]\displaystyle=\sum_{i}k^{i}(\bm{s})\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right]
𝔼𝝅[Qtot(𝒔,𝝉,𝒂)αlog𝝅(𝒂𝝉)]\displaystyle\mathbb{E}_{\bm{\pi}}\left[Q^{tot}(\bm{s},\bm{\tau},\bm{a})-\alpha\log\bm{\pi}\left(\bm{a}\mid\bm{\tau}\right)\right] (18)
=iki(𝒔)𝔼πi[qi(𝝉i,ai)]+b(𝒔)+αH(𝝅)\displaystyle=\sum_{i}k^{i}(\bm{s})\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right]+b(\bm{s})+\alpha H(\bm{\pi})
=qmix(𝒔,𝔼πi[qi(𝝉i,ai)])+αH(𝝅)\displaystyle=q^{mix}(\bm{s},\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right])+\alpha H(\bm{\pi})

we could approximate the above equation using following equation,

qmix(𝒔,𝔼πi[qi(𝝉i,ai)αlogπi(ai𝝉i)])\displaystyle q^{mix}(\bm{s},\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)-\alpha\log{\pi}^{i}\left({a^{i}}\mid\bm{\tau}^{i}\right)\right]) (19)
=iki(𝒔)𝔼πi[qi(𝝉i,ai)]+iki(𝒔)𝔼πi[αlogπi(ai|𝝉i)]+b(𝒔)\displaystyle=\sum_{i}k^{i}(\bm{s})\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right]+\sum_{i}k^{i}(\bm{s})\mathbb{E}_{{\pi}^{i}}\left[-\alpha\log\pi^{i}\left(a^{i}|\bm{\tau}^{i}\right)\right]+b(\bm{s})
=iki(𝒔)𝔼πi[qi(𝝉i,ai)]+b(𝒔)+αiki(𝒔)H(𝝅i)\displaystyle=\sum_{i}k^{i}(\bm{s})\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right]+b(\bm{s})+\alpha\sum_{i}k^{i}(\bm{s})H(\bm{\pi}^{i})

if we use a different mixing network for entropy term, the equation becomes:

qmix1(𝒔,𝔼πi[qi(𝝉i,ai)])+qmix2(𝒔,𝔼πi[αlogπi(ai𝝉i)])\displaystyle q^{mix1}(\bm{s},\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right])+q^{mix2}(\bm{s},\mathbb{E}_{{\pi}^{i}}\left[-\alpha\log{\pi}^{i}\left({a^{i}}\mid\bm{\tau}^{i}\right)\right]) (20)
=ik1i(𝒔)𝔼πi[qi(𝝉i,ai)]+b(𝒔)+αik2i(𝒔)H(𝝅i)\displaystyle=\sum_{i}k^{i}_{1}(\bm{s})\mathbb{E}_{{\pi}^{i}}\left[q^{i}\left(\bm{\tau}^{i},a^{i}\right)\right]+b(\bm{s})+\alpha\sum_{i}k^{i}_{2}(\bm{s})H(\bm{\pi}^{i})