This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

AAAI Press Anonymous Submission
Instructions for Authors Using

Written by AAAI Press Staff1
AAAI Style Contributions by Pater Patel Schneider, Sunil Issar,
J. Scott Penberthy, George Ferguson, Hans Guesgen, Francisco Cruz\equalcontrib, Marc Pujol-Gonzalez\equalcontrib
With help from the AAAI Publications Committee.
   Viet The Bui1, Tien Mai1, Thanh H. Nguyen2

Mimicking To Dominate: Imitation Learning Strategies for Success in Multiagent Competitive Games

Written by AAAI Press Staff1
AAAI Style Contributions by Pater Patel Schneider, Sunil Issar,
J. Scott Penberthy, George Ferguson, Hans Guesgen, Francisco Cruz\equalcontrib, Marc Pujol-Gonzalez\equalcontrib
With help from the AAAI Publications Committee.
   Viet The Bui1, Tien Mai1, Thanh H. Nguyen2
Abstract

Training agents in multi-agent competitive games presents significant challenges due to their intricate nature. These challenges are exacerbated by dynamics influenced not only by the environment but also by opponents’ strategies. Existing methods often struggle with slow convergence and instability. To address this, we harness the potential of imitation learning to comprehend and anticipate opponents’ behavior, aiming to mitigate uncertainties with respect to the game dynamics. Our key contributions include: (i) a new multi-agent imitation learning model for predicting next moves of the opponents — our model works with hidden opponents’ actions and local observations; (ii) a new multi-agent reinforcement learning algorithm that combines our imitation learning model and policy training into one single training process; and (iii) extensive experiments in three challenging game environments, including an advanced version of the Star-Craft multi-agent challenge (i.e., SMACv2). Experimental results show that our approach achieves superior performance compared to existing state-of-the-art multi-agent RL algorithms.

Introduction

Recent works in multi-agent reinforcement learning (MARL) have made a significant progress in developing new effective MARL algorithms that can perform well in complex multi-agent environments including SMAC (StarCraft Multi-Agent Challenge) (yu2022surprising; samvelyan2019starcraft). Among these works, centralized training and decentralized execution (CTDE) (foerster2018counterfactual) has attracted a significant attention from the RL research community due to its advantage of leveraging global information to train a centralized critic (aka. actor-critic methods (lowe2017multi)) or a joint Q-function (aka. value-decomposition methods (rashid2020monotonic; sunehag2017value)). This approach enables a more efficient and stable learning process while still allowing agents to act in a decentralized manner. Under this CTDE framework, off-policy methods such as MADDPG (lowe2017multi) and QMIX (rashid2020monotonic) have become very popular as a result of their data efficiency and state-of-the-art (SOTA) results on a wide range of benchmarks.

On the other hand, on-policy gradient methods have been under-explored so far in MARL due to their data consuming and difficulty in transferring knowledge from single-agent to multi-agent settings. However, a more recent work (yu2022surprising) show that on-policy methods such as MAPPO (a multi-agent version of proximal policy optimization) outperforms all other SOTA methods including MADDPG and QMIX in various multi-agent benchmarks, and especially works extremely well in complex SMAC settings. (yu2022surprising) show that with proper choices of various hyper-parameter factors including value normalization, clipping ratio, and batch size, etc., policy gradient methods can achieve both data efficiency and high return for agents. Motivated by this promising results of MAPPO, our work focuses on improving the performance of policy gradient methods in MARL.

We consider a multi-agent environment in which there are agents who attempt to form allies with each other to play against a team of opponents in a partially observable MDP environment where allied agents have to make decision independently without communicating with other members. Unlike the leading method, MAPPO, which focuses on the investigation of hyper-parameter choices for PPO to perform well in the MARL setting, we attempt to enhance the performance of PPO in MARL with the introduction of a novel opponent-behavior-prediction component. This new component is incorporated into the MAPPO framework to enhance the policy learning of allied agents in the MARL setting. A key challenge in our problem setting is that agents in the alliance are unaware of actions taken by their opponents. Not only that, each allied agent only has local observations of opponents locating in the current neighborhood of the agent — the locations and neighborhoods of all players are not fixed, but instead, are changing over time depending on actions taken by players and the dynamics of the environment. Lastly, learning the behavior of the opponents occurs during the policy learning process of the allied agents. The inter-dependency between these two learning components makes the entire learning process significantly challenging.

We address the aforementioned challenges while providing the following key contributions. First, we convert the problem of opponent behavior learning into predicting next states of the opponents. This way, we don’t have to directly predict the behavior of the opponents (of which actions are not observable), but instead, we use the next state prediction outcome as an indirect implication of their behavior. We then cast the problem of opponent next-state prediction as a new multi-agent imitation learning (IL) problem. We propose a new multi-agent IL algorithm, which is an adaptation of IQ-Learn (garg2021iq) (a SOTA imitation learning algorithm), that only considers local opponent-state-only observations. Especially, instead of imitating opponents’ policy, our IL algorithm targets the prediction of next states of the neighboring opponents. Second, we provide a comprehensive theoretical analysis which provides bounds on the impact of the changing policy of the allied agents (as a result of the policy learning process) on our imitation learning outcomes.

Third, we present a unified MARL algorithmic framework in which we incorporate our behavior learning component into MAPPO. The idea is that we can combine each allied agent’s local observations with the next-state prediction of neighboring opponents of that agent, creating an augmented input based on which to improve the decision making of the allied agent at every state. This novel integration results in a new MARL algorithm, which we name Imitation-enhanced Multi-Agent EXtended PPO (IMAX-PPO).

Finally, we conduct extensive experiments in several benchmarks ranging from complex to simple ones, including: SMACv2 (an advanced version of the Star-Craft multi-agent challenge) (ellis2022smacv2), Google research football (GRF) (kurach2020google), and Gold Miner (Miner). Our empirical results show that our new algorithm consistently outperforms SOTA algorithms (i.e., QMIX and MAPPO) significantly accross all these benchmarks.

Related Work

MARL. The literature on MARL includes centralized and decentralized algorithms. While centralized algorithms (claus1998dynamics) learn a single joint policy to produce joint actions of all the agents, decentralized learning (littman1994markov) optimizes each agent’s local policy independently. There are also algorithms based on centralized training and decentralized execution (CTDE). For example, methods in (lowe2017multi; foerster2018counterfactual) adopt actor-critic structures and learn a centralized critic that takes global information as input. Value-decomposition (VD) is a class of methods that represent the joint Q-function as a function of agents’ local Q-functions (sunehag2017value; rashid2020monotonic). On the other hand, the use of policy-gradient methods, such as PPO (schulman2017proximal), has been investigated in multi-agent RL. For example, (de2020independent) propose independent PPO (IPPO), a decentralized MARL, that can achieve high success rates in several hard SMAC maps. IPPO is, however, overall worse than QMIX (rashid2020monotonic), a VD method. Recently, (yu2022surprising) develop MAPPO, a PPO-based MARL algorithm that outperforms QMIX on some popular multi-agent game environments such as SMAC (samvelyan2019starcraft) and GRF (kurach2020google) environments. To the best of our knowledge, MAPPO is currently a SOTA algorithm for MARL. In this work, we integrate a new IL-based behavior prediction model into MAPPO, resulting in a new MARL algorithm that outperforms SOTA MARL methods on various challenging game tasks.

Imitation Learning (IL). In this study, we employ Imitation Learning to anticipate opponents’ behavior, making it pertinent to delve into the relevant literature. IL has been acknowledged as a compelling approach for sequential decision-making (ng2000algorithms; abbeel2004apprenticeship). In IL, a collection of expert trajectories is provided, with the objective of learning a policy that emulates behavior similar to the expert’s policy. One of the simplest IL methods is Behavioral Cloning (BC), which aims to maximize the likelihood of the expert’s actions under the learned policy. BC disregards environmental dynamics, rendering it suitable only for uncomplicated environments. Several advanced IL techniques, encompassing environmental dynamics, have been proposed (reddy2019sqil; fu2017learning; ho2016generative). While these methods operate in complex and continuous domains, they involve adversarial learning, making them prone to instability and sensitivity to hyperparameters. The IQ-learn (garg2021iq) stands as a cutting-edge IL algorithm with distinct advantages, specifically its incorporation of dynamics awareness and non-adversarial training. It’s important to note that all the aforementioned IL methods were designed for single-agent RL. In contrast, the literature on multi-agent RL is considerably more limited, with only a handful of studies addressing IL in multi-agent RL. For instance, (song2018multi) presents an adversarial training-based algorithm, named Multi-agent Generative Adversarial IL. It’s worth noting that all the IL algorithms mentioned above are established on the premise that actions are observable, which implies that no existing algorithm can be directly applied to our multi-agent game settings with local state-only observations.

Competitive Multi-Agent POMDP Setting

We consider a multi-player Markov game in which there are multiple agents forming an alliance to play against some opponent agents. We present the Markov game as a tuple {𝒮,𝒩α,𝒩e,𝒜α,𝒜e,P,R}\{{\mathcal{S}},{\mathcal{N}}_{\alpha},{\mathcal{N}}_{\textit{e}},{\mathcal{A}}^{\alpha},{\mathcal{A}}^{\textit{e}},P,R\}, where 𝒮{\mathcal{S}} is the set of global states shared by all the agents, 𝒩α{\mathcal{N}}_{\alpha} and 𝒩e{\mathcal{N}}_{\textit{e}} are the set of ally and enemy agents, 𝒜α=i𝒩𝒜𝒜αi{\mathcal{A}}^{\alpha}=\prod_{i\in{\mathcal{N}}_{\mathcal{A}}}{\mathcal{A}}^{\alpha}_{i} is the set of joint actions of all the ally agents, 𝒜e=j𝒩e𝒜ej{\mathcal{A}}^{\textit{e}}=\prod_{j\in{\mathcal{N}}_{\textit{e}}}{\mathcal{A}}^{\textit{e}}_{j} is the set of joint actions of all the ally agents, PP is the transition dynamics of the game environment, and RR is a general reward function that can take inputs as states and actions of the ally or enemy agents and return the corresponding rewards. At each time step where the global state is SS, each ally agent i𝒩αi\in{\mathcal{N}}_{\alpha} makes an action aαia^{\alpha}_{i} according to a policy παi(aαi|oαi)\pi^{\alpha}_{i}(a^{\alpha}_{i}|o^{\alpha}_{i}), where oαio^{\alpha}_{i} is is the observation of ally agent ii given state SS. The joint action can be now defined as follows:

Aα={aαi|i𝒩α}A^{\alpha}=\{a^{\alpha}_{i}|~{}i\in{\mathcal{N}}_{\alpha}\}

and the joint policy is defined accordingly:

Πα(Aα|S)=i𝒩απαi(aαi|oαi).\Pi^{\alpha}(A^{\alpha}|S)=\prod\nolimits_{i\in{\mathcal{N}}^{\alpha}}\pi^{\alpha}_{i}(a^{\alpha}_{i}|o^{\alpha}_{i}).

The enemy agents, at the same time, make a joint action Ae={aej|j𝒩e}A^{\textit{e}}=\{a^{\textit{e}}_{j}|~{}j\in{\mathcal{N}}_{\textit{e}}\} with the following probability:

Πe(Ae|S)=j𝒩eπej(aej|oej).\Pi^{\textit{e}}(A^{\textit{e}}|S)=\prod\nolimits_{j\in{\mathcal{N}}^{\textit{e}}}\pi^{\textit{e}}_{j}(a^{\textit{e}}_{j}|o^{\textit{e}}_{{j}}).

After the allies and enemies make decisions, the global state transits to a new state SS^{\prime} with the probability P(S|Ae,Aα,S)P(S^{\prime}|A^{\textit{e}},A^{\alpha},S). In our setting, the enemies’ policies Πe\Pi^{\textit{e}} are fixed. Therefore, we can treat the enemies’ policies as a part of the environment dynamics, as follows:

P(S|Aα,S)=AeΠ(Ae|S)P(S|Ae,Aα,S)P(S^{\prime}|A^{\alpha},S)=\sum\nolimits_{A^{\textit{e}}}\Pi(A^{\textit{e}}|S)P(S^{\prime}|A^{\textit{e}},A^{\alpha},S)

Our goal is to find a policy that optimizes the allies’ expected joint reward, which can be formulated as follows:111Environment dynamics are implicitly involved in sampling.

maxΠα𝔼(Aα,S)Πα[Rα(S,Aα)]\max\nolimits_{\Pi^{\alpha}}\mathbb{E}_{(A^{\alpha},S)\sim\Pi^{\alpha}}\big{[}R^{\alpha}(S,A^{\alpha})\big{]} (1)

The game dynamics involve both the environment dynamics and the joint policy of enemies, making the training costly to converge. Our idea is, therefore, to migrate the uncertainties associated with these game dynamics by first predicting the opponent behavior prediction based on past observations of the allies and leveraging this behavior prediction into guiding the policy training for the allies.

Opponent Behavior Prediction

The key challenge in our problem setting is that actions taken by opponents are hidden from allied agents. Moreover, each allied agent has limited observations of the other agents in the environment; they can only obtain information about nearby opponents. For example, in the SMAC environment, for each allied agent, besides information about the agent itself, the allied agent is also aware of the relative position, relative distance, and health point, etc. of the neighboring opponents.222The neighborhood of each allied agent changes over time depending on actions of all agents and the dynamic of the environment. Therefore, instead of directly predicting opponents’ next moves, we focus on anticipating next states of opponents — this next-state prediction can be used as an implication of what actions have been taken by the neighboring opponents.

Our key contributions here include: (i) a novel representation of this opponent behavior prediction (or next-state prediction) problem in the form of multi-agent imitation learning; (ii) a new adaptation of IQ-Learn (a SOTA method for solving the standard single-agent imitation learning problem) to solve our new imitation learning problem; (iii) a comprehensive theoretical analysis on the influence of learning policies of the agents in the allies on the next-state prediction outcomes; and (iv) a practical multi-agent imitation learning algorithm which is tailored to local observations of allied agents.

State-Only Multi-Agent Imitation Learning

In this section, we first present our approach of modeling the problem of predicting the opponents’ behavior as a state-only multi-agent IM. We then describe our adaptation of IQ-Learn for solving our new multi-agent IL problem.333For the sake of representation, this section focuses on multi-agent IL with global state-only observations in which actions of opponents are hidden from the allies. We then introduce a new practical algorithm in next section which localizes this global state-based IL algorithm to adapt with local observations of allied agents.

Opponent Behavior Prediction as a Multi-Agent IM.

In order to formulate the problem as a multi-agent IM, we introduce a new notion of a joint reward function for the enemies as Re(S|Aα,S)R^{\textit{e}}(S^{\prime}|A^{\alpha},S) associated with a tuple (S,Aα,S)(S^{\prime},A^{\alpha},S) where (S,S)(S,S^{\prime}) are the current and next global states and AαA^{\alpha} is the joint action taken by the allies. Intuitively, the next state SS^{\prime} is a joint “action” of the enemies in this context. This action (SS^{\prime}) is basically a result of the joint actions of the allies AαA^{\alpha} and the hidden joint actions of the enemies AeA^{\textit{e}} in the original multi-agent POMDP. Altogether with the reward function Re(S|Aα,S)R^{\textit{e}}(S^{\prime}|A^{\alpha},S), we introduce a new notion of joint policy for the enemies, Πe(S|S,Aα)\Pi^{\textit{e}}(S^{\prime}|S,A^{\alpha}), which is essentially the probability of ending up at a global state SS^{\prime}, from state SS when the allies make action AαA^{\alpha}. Let 𝚷\boldsymbol{\Pi} be the support set of the imitating policy, i.e., 𝚷={Π:𝒮×𝒮×𝒜α[0,1],S𝒮Π(S|S,Aα)=1,S,S𝒮,Aα𝒜α}.\boldsymbol{\Pi}=\{\Pi:{\mathcal{S}}\times{\mathcal{S}}\times{\mathcal{A}}^{\alpha}\rightarrow[0,1],~{}\sum_{S^{\prime}\in{\mathcal{S}}}\Pi(S^{\prime}|S,A^{\alpha})=1,~{}\forall S,S^{\prime}\in{\mathcal{S}},A^{\alpha}\in{\mathcal{A}}^{\alpha}\}.

We now introduce the maximum-entropy inverse RL framework (ho2016generative) w.r.t the new notions of enemies’ reward and policy (Re(S|S,Aα),Πe(S|S,Aα))(R^{\textit{e}}\!(S^{\prime}|S,A^{\alpha}),\Pi^{\textit{e}}\!(S^{\prime}|S,A^{\alpha})) as follows:

maxReminΠ𝚷{L(Re,Π)=𝔼ρEe[Re(S|S,Aα)\displaystyle\max\nolimits_{R^{\textit{e}}}\min\nolimits_{\Pi\in\boldsymbol{\Pi}}\Big{\{}L(R^{\textit{e}},\Pi)=\mathbb{E}_{\rho_{E}^{e}}[R^{e}(S^{\prime}|S,A^{\alpha})-
𝔼ρΠe[Re(S|S,Aα)]+𝔼ρΠe[lnΠ(S|S,Aα)]}\displaystyle\qquad\mathbb{E}_{\rho_{\Pi}^{e}}[R^{e}(S^{\prime}|S,A^{\alpha})]+\mathbb{E}_{\rho_{\Pi}^{e}}[\ln\Pi(S^{\prime}|S,A^{\alpha})]\Big{\}} (2)

where ρEe\rho_{E}^{e} is the occupancy measure of the enemies’ joint policy Πe\Pi^{\textit{e}} and ρΠe\rho_{\Pi}^{e} the occupancy measure of the imitating policy Π\Pi, and the last term is the entropy regularizer.

An Adaptation of IQ-Learn.

Drawing inspiration from the state-of-the-art IL algorithm known as IQ-learn, we construct our IL algorithm which is an adaptation of IQ-Learn tailored to our multi-agent competitive game environment.

The main idea of IQ-learn is to convert a reward learning problem into a Q-function learning one. To apply IQ-Learn to our setting, we present the following new soft and inverse soft Bellman operators, which is a modification from the original ones introduced in (garg2021iq):

Π,ReQe(S|S,Aα)\displaystyle{\mathcal{B}}^{\Pi,R^{\textit{e}}}_{{Q}^{\textit{e}}}(S^{\prime}|S,A^{\alpha}) =Re(S|S,Aα)+γVeΠ(S)\displaystyle=R^{\textit{e}}(S^{\prime}|S,A^{\alpha})+\gamma{V}^{\textit{e}}_{\Pi}(S^{\prime}) (3)
𝒯ΠQe(S|S,Aα)\displaystyle{\mathcal{T}}^{\Pi}_{{Q}^{\textit{e}}}(S^{\prime}|S,A^{\alpha}) =Qe(S|S,Aα)γVeΠ(S)\displaystyle={Q}^{\textit{e}}(S^{\prime}|S,A^{\alpha})-\gamma{V}^{\textit{e}}_{\Pi}(S^{\prime}) (4)

where

VeΠ(S)=𝔼(Aα,S)(Πα,Π)[Qe(S|S,Aα)ln(Π(S|S,Aα))]{V}^{\textit{e}}_{\Pi}(S)\!=\!\mathbb{E}_{(A^{\alpha},S^{\prime})\sim(\Pi^{\alpha},{\Pi})}\!\Big{[}{Q}^{\textit{e}}(S^{\prime}|S,A^{\alpha})\!-\!\ln({\Pi}(S^{\prime}|S,A^{\alpha}))\Big{]}

The key difference in Equations 3 and 4 in comparison with the original ones presented in IQ-Learn is that actions defined for IL in our setting coincide with next states, resulting in no computation of expectation for future returns. This difference may raise an important question of whether properties and theoretical results of original IQ-Learn still hold for our new problem setting. Interestingly, we show that key properties of IQ-Learn still carry out to this new setting.

First, it can be seen that the soft Bellman operator Π,ReQe{\mathcal{B}}^{\Pi,R^{\textit{e}}}_{{Q}^{\textit{e}}} is contractive, thus defining a unique fixed point solution Q{Q}^{*} such that Π,ReQ=Q{\mathcal{B}}^{\Pi,R^{e}}_{Q^{*}}=Q^{*}. Let us further define the following function of Π\Pi and QeQ^{\textit{e}} for the Q-function learning:

J(Π,Qe)=𝔼ρEe[𝒯ΠQe(S|S,Aα)]\displaystyle{J}(\Pi,{Q}^{\textit{e}})=\mathbb{E}_{\rho_{E}^{e}}[{\mathcal{T}}^{\Pi}_{{Q}^{\textit{e}}}(S^{\prime}|S,A^{\alpha})]
𝔼ρΠe[𝒯ΠQe(S|S,Aα)]+𝔼ρΠe[lnΠ(S|S,Aα)]\displaystyle\qquad\qquad-\mathbb{E}_{\rho_{\Pi}^{e}}[{\mathcal{T}}^{\Pi}_{{Q}^{\textit{e}}}(S^{\prime}|S,A^{\alpha})]+\mathbb{E}_{\rho_{\Pi}^{e}}[\ln\Pi(S^{\prime}|S,A^{\alpha})]

We obtain a similar result as in IQ-Learn showing a connection between the learning reward and learning Q-functions:444All proofs are in the appendix.

Proposition 1.

For any reward function ReR^{e}, let QQ^{*} be the unique fixed point solution to the soft Bellman equation Π,ReQ=Q{\mathcal{B}}^{\Pi,R^{\textit{e}}}_{Q^{*}}=Q^{*}, then we have L(Π,Re)=J(Π,Q)L(\Pi,R^{\textit{e}})={J}(\Pi,Q^{*}), and for any QeQ^{\textit{e}}, J(Π,Qe)=L(Π,𝒯ΠQe){J}(\Pi,Q^{\textit{e}})=L(\Pi,{\mathcal{T}}^{\Pi}_{Q^{\textit{e}}}).

Proposition (1) indicates that the reward learning problem maxReminΠL(Π,Re)\max_{R^{\textit{e}}}\min_{\Pi}L(\Pi,R^{\textit{e}}) can be converted equivalently into the Q-learning one maxQeminΠJ(Π,Qe)\max_{Q^{\textit{e}}}\min_{\Pi}J(\Pi,Q^{\textit{e}}). Suppose QQ^{*} is a solution to the Q-learning problem, then rewards can be recovered by taking Re(S|S,Aα)=Q(S|S,Aα)γVeΠ(S)R^{e}(S^{\prime}|S,A^{\alpha})=Q^{*}(S^{\prime}|S,A^{\alpha})-\gamma V^{e}_{\Pi}(S^{\prime}). Furthermore, our following Proposition 2 shows that the function J(Π,Qe){J}(\Pi,Q^{\textit{e}}) can be reformulated in a more compact form that is convenient for training.

Proposition 2.

The function J()J(\cdot) can be written as follows:

J(Π,Qe)\displaystyle J(\Pi,Q^{e}) =𝔼(S,Aα,S)ρeE[Qe(S|S,Aα)γVeΠ(S)]\displaystyle=\mathbb{E}_{(S^{\prime},A^{\alpha},S)\sim\rho^{\textit{e}}_{E}}\Big{[}{Q}^{\textit{e}}(S^{\prime}|S,A^{\alpha})-\gamma{V}^{e}_{\Pi}(S^{\prime})\Big{]}
+(1γ)𝔼S0P0VeΠ(S0)\displaystyle\qquad\qquad\qquad+(1-\gamma)\mathbb{E}_{S^{0}\sim P^{0}}{V}^{\textit{e}}_{\Pi}(S^{0}) (5)

where S0S_{0} is an initial state. Under this viewpoint, we show that key properties of classical IQ-learn still hold in the context of our multi-agent with missing observations (Proposition 3), making our IQ-learn version convenient to use.

Proposition 3.

The problem maxQeminΠeJ(Πe,Qe)\max_{Q^{\textit{e}}}\min_{\Pi^{\textit{e}}}J(\Pi^{\textit{e}},Q^{\textit{e}}) is equivalent to the maximization maxQeJ(ΠQ,Qe){\max_{Q^{\textit{e}}}}J(\Pi^{Q},Q^{\textit{e}}) where

ΠQ(S|S,Aα)=exp(Qe(S|S,Aα))Sexp(Qe(S|S,Aα))\Pi^{Q}(S^{\prime}|S,A^{\alpha})=\frac{\exp(Q^{\textit{e}}(S^{\prime}|S,A^{\alpha}))}{\sum_{S^{{}^{\prime\prime}}}\exp(Q^{\textit{e}}(S^{{}^{\prime\prime}}|S,A^{\alpha}))}

Moreover, the function J(ΠQ,Qe)J(\Pi^{Q},Q^{\textit{e}}) is concave in QeQ^{\textit{e}}.

Corollary 4.

If Π=ΠQ\Pi=\Pi^{Q} (defined in Theorem 3), then we can write Ve(S)V^{\textit{e}}(S) as follows:

VeΠ(S)\displaystyle V^{\textit{e}}_{\Pi}(S) =VeQ(S)=AαΠα(Aα|S)ln(Sexp(Qe(S|S,Aα)))\displaystyle\!=\!V^{\textit{e}}_{Q}(S)\!=\!\sum\nolimits_{A^{\alpha}}\!\!\!\Pi^{\alpha}(A^{\alpha}|S)\ln\left(\!\sum\nolimits_{S^{\prime}}\!\!\exp\!\left(Q^{\textit{e}}(S^{\prime}|S,A^{\alpha})\right)\!\right)

Affect of Allies’ Policies on Imitation Learning

The above IL algorithm is trained during the training of the allies’ policies, thus the game dynamics change during the IL process. We analyze the impact of these changes on the imitation policy. To this end, we consider the loss function of the imitation model as a function of the allies’ joint policy.

Φ(Πα|Qe)\displaystyle\Phi(\Pi^{\alpha}|Q^{\textit{e}}) =𝔼(S,Aα,S)(Πe,Πα)[Qe(S|S,Aα)γVeQ(S)]\displaystyle\!=\!\mathbb{E}_{(S^{\prime},A^{\alpha},S)\sim(\Pi^{\textit{e}},\Pi^{\alpha})}\!\!\left[Q^{\textit{e}}(S^{\prime}|S,A^{\alpha})\!-\!\gamma V^{\textit{e}}_{Q}(S^{\prime})\right]
(1γ)𝔼S0P0VeQ(S0)\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-(1-\gamma)\mathbb{E}_{S^{0}\sim P^{0}}V^{\textit{e}}_{Q}(S^{0})

In the following, we first present our theoretical results about bounds on the impact of the allies’ changing policies on the imitation learning’s loss function (Proposition 5 for a discrete state space and Proposition 6 for a continuous space). Based on these results, we then provide a theoretical bound on the IL learning outcome accordingly (Corollary 7).

Proposition 5 (Finite state space 𝒮{\mathcal{S}}).

Given two allies’ joint policies Πα\Pi^{\alpha} and Π~α\widetilde{\Pi}^{\alpha} such that KL(Πα(|S)||Π~α(|S))ϵ\textsc{KL}(\Pi^{\alpha}(\cdot|S)||\widetilde{\Pi}^{\alpha}(\cdot|S))\leq\epsilon for any state S𝒮S\in{\mathcal{S}}, the following inequality holds:

|Φ(Πα|Qe)Φ(Π~α|Qe)|\displaystyle\left|\Phi(\Pi^{\alpha}|Q^{\textit{e}})-\Phi(\widetilde{\Pi}^{\alpha}|Q^{\textit{e}})\right|
(γ+1+(1γ)3(1γ)2Q¯+1+(1γ)3(1γ)2ln|𝒮|)2ln2ϵ\displaystyle\qquad\leq\left(\frac{\gamma+1+(1-\gamma)^{3}}{(1-\gamma)^{2}}\overline{Q}+\frac{1+(1-\gamma)^{3}}{(1-\gamma)^{2}}\ln|{\mathcal{S}}|\right)\sqrt{2\ln 2\epsilon}

where Q¯=max(St,St,Aαt)Qe(St|St,Aαt)\overline{Q}\!=\!\!\max\limits_{(S^{\prime}_{t},S_{t},A^{\alpha}_{t})}\!\!Q^{\textit{e}}(S^{\prime}_{t}|S_{t},A^{\alpha}_{t}) is an upper bound of QeQ^{\textit{e}}.

Under a continuous state space, an actor-critic method is used as the policy ΠQ\Pi^{Q} cannot be computed exactly. In this case, one can use an explicit policy Π\Pi to approximate ΠQ\Pi^{Q}. We then iteratively update QeQ^{\textit{e}} and Π\Pi alternatively using the loss function in Proposition 2. In particular, for a fixed QeQ^{\textit{e}}, a soft actor update maxΠ𝔼SΠ(S|Aα,S)[Qe(S|S,Aα)lnΠ(S|S,Aα)]\max_{\Pi}\mathbb{E}_{S^{\prime}\sim\Pi(S|A^{\alpha},S)}[Q^{\textit{e}}(S^{\prime}|S,A^{\alpha})-\ln\Pi(S^{\prime}|S,A^{\alpha})] will bring Π\Pi closer to ΠQ\Pi^{Q}.

Let Φ(Πα|Π,Qe)\Phi(\Pi^{\alpha}|\Pi,Q^{\textit{e}}) be the objective of IL in (5), written as a function of the allies’ joint policy Π~α\widetilde{\Pi}^{\alpha}. The following proposition establishes a bound for the variation of |Φ(Πα|Π,Qe)Φ(Π~α|Π,Qe)||\Phi(\Pi^{\alpha}|\Pi,Q^{\textit{e}})-\Phi(\widetilde{\Pi}^{\alpha}|\Pi,Q^{\textit{e}})| as funntion of KL(Πα||Π~α)\textsc{KL}(\Pi^{\alpha}||\widetilde{\Pi}^{\alpha}).

Proposition 6 (Continuous state space 𝒮{\mathcal{S}}).

Given two allies’ joint policies Πα\Pi^{\alpha} and Π~α\widetilde{\Pi}^{\alpha} such that for every state S𝒮S\in{\mathcal{S}}, KL(Πα(|S)||Π~α(|S))ϵ\textsc{KL}(\Pi^{\alpha}(\cdot|S)||\widetilde{\Pi}^{\alpha}(\cdot|S))\leq\epsilon, the following inequality holds:

|Φ(Πα|Qe,Π)Φ(Π~α|Qe,Π)|\displaystyle\left|\Phi(\Pi^{\alpha}|Q^{\textit{e}},\Pi)-\Phi(\widetilde{\Pi}^{\alpha}|Q^{\textit{e}},\Pi)\right|
(γ+1+(1γ)3(1γ)2Q¯1+(1γ)3(1γ)2H)2ln2ϵ\displaystyle\qquad\leq\left(\frac{\gamma+1+(1-\gamma)^{3}}{(1-\gamma)^{2}}\overline{Q}-\frac{1+(1-\gamma)^{3}}{(1-\gamma)^{2}}H\right)\sqrt{2\ln 2\epsilon}

where Q¯=sup(St,St,Aαt)Qe(St|St,Aαt)\overline{Q}=\sup\limits_{(S^{\prime}_{t},S_{t},A^{\alpha}_{t})}\!\!Q^{\textit{e}}(S^{\prime}_{t}|S_{t},A^{\alpha}_{t}) is an upper bound of Qe(|,)Q^{\textit{e}}(\cdot|\cdot,\cdot) and H=inf(S,Aα)SΠ(S|S,Aα)lnΠ(S|S,Aα)H=\inf\limits_{(S,A^{\alpha})}\sum_{S}\Pi(S^{\prime}|S,A^{\alpha})\ln\Pi(S^{\prime}|S,A^{\alpha}), is a lower bound of the entropy of the actor policy Π(|,)\Pi(\cdot|\cdot,\cdot).

Propositions 5 & 6 allow us to establish an upper bound for the imitation learning when the allies’ joint policy changes.

Corollary 7.

Given two allies’ joint policies Πα\Pi^{\alpha} and Π~α\widetilde{\Pi}^{\alpha} with S𝒮\forall S\!\in\!{\mathcal{S}}, KL(Πα(|S)||Π~α(|S)ϵ\textsc{KL}(\Pi^{\alpha}(\cdot|S)||\widetilde{\Pi}^{\alpha}(\cdot|S)\!\leq\!\epsilon, the following holds:

|maxQe{Φ(Πα|Qe)}maxQe{Φ(Π~α|Qe)}|𝒪(ϵ) (discrete)\displaystyle\Big{|}\max\nolimits_{Q^{\textit{e}}}\!\!\big{\{}\Phi(\Pi^{\alpha}|Q^{\textit{e}})\big{\}}-\max\nolimits_{Q^{\textit{e}}}\!\!\big{\{}\Phi(\widetilde{\Pi}^{\alpha}|Q^{\textit{e}})\big{\}}\Big{|}\leq{\mathcal{O}}(\sqrt{\epsilon})\;\text{ (discrete)}
|maxQeminΠ{Φ(Πα|Qe,Π)} (continuous)\displaystyle\Big{|}\max\nolimits_{Q^{\textit{e}}}\min\nolimits_{\Pi}\big{\{}\Phi(\Pi^{\alpha}|Q^{\textit{e}},\Pi)\big{\}}\qquad\qquad\qquad\qquad\quad\text{ (continuous)}
maxQeminΠ{Φ(Π~α|Qe,Π)}|𝒪(ϵ).\displaystyle\qquad\qquad\quad-\max\nolimits_{Q^{\textit{e}}}\min\nolimits_{\Pi}\big{\{}\Phi(\widetilde{\Pi}^{\alpha}|Q^{\textit{e}},\Pi)\big{\}}\Big{|}\leq{\mathcal{O}}(\sqrt{\epsilon}).

Since the allies’ joint policy Πα\Pi^{\alpha} will be changing during our policy learning process, the above result implies that the imitating policy will be stable if Πα\Pi^{\alpha} becomes stable, and if Πα\Pi^{\alpha} is converging to a target policy Πα\Pi^{\alpha*}, then the imitator’s policy also converges to the one that is trained with the target ally policy with a rate of KL(Πα||Π~α)\sqrt{\textsc{KL}(\Pi^{\alpha}||\widetilde{\Pi}^{\alpha})}. That is, if the actual policy is within a 𝒪(ϵ){\mathcal{O}}(\epsilon) neighborhood of the target policy (i.e., KL(Πα||Π~α)ϵ\textsc{KL}(\Pi^{\alpha}||\widetilde{\Pi}^{\alpha})\leq\epsilon) then the expected return of the imitating policy is within a 𝒪(ϵ){\mathcal{O}}(\sqrt{\epsilon}) neighborhood of the desired “expected return” given by the target policy.

Refer to caption
Figure 1: An overview of our IMAX-PPO algorithm. Each local observation oαio^{\alpha}_{i} of an ally agent ii includes information about itself, as well as enemy and ally agents in its neighborhood (which changes over time). The output of the IL component is the predicted next states of neighboring enemy agents (predictions for the non-neighbor enemies will be masked out).

IMAX-PPO: Imitation-enhanced Multi-Agent EXtended PPO Algorithm

We present our MARL algorithm for the competive game setting. We first focus on a practical implementation of an imitation learning algorithm taking into account local observations. We then show how to integrate this into our MARL algorithm. We call our algorithm as IMAX-PPO, standing for Imitation-enhanced Multi-Agent EXtended PPO algorithm.

Imitation Learning with Local Observations

In previous section, we present our new imitation learning algorithm (which is an adaptation of IQ-Learn) to learn an enemy policy Π^e(S|S,Aα)\widehat{\Pi}^{\textit{e}}(S^{\prime}|S,A^{\alpha}) that behaves similarly to the probabilities of ending up at state SS^{\prime} when the current global state is SS and the allies’ joint action is AαA^{\alpha}. Here, SS represents all the information that the enemy agents can access. In our multi-agent game setting, however, ally agents do not have access to the observations of the enemies. Instead, each ally agent ii only has information about its neighboring enemies observable by agent ii (such as their locations, speeds, etc.).

Therefore, we adapt our IL algorithm in accordance with such available local information. That is, the aim now is to optimize policies Π^eψπ(Se,nexti|oαi,aαi)\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}}(S^{\textit{e},\text{next}}_{i}|o^{\alpha}_{i},a^{\alpha}_{i}), i𝒩αi\in{\mathcal{N}}^{\alpha}, where oαio^{\alpha}_{i} is an observation vector of agent ii, containing the local states of the agent ii itself as well as of all the agents in the neighborhood that are observable by agent ii. In particular,

oαi={sαi}{sαik:ikN(i)𝒩α}{sejk:jkN(i)𝒩e}o^{\alpha}_{i}=\{s^{\alpha}_{i}\}\cup\{s^{\alpha}_{i_{k}}\!:i_{k}\in N(i)\cap{\mathcal{N}}^{\alpha}\}\cup\{s^{\textit{e}}_{j_{k}}\!:j_{k}\in N(i)\cap{\mathcal{N}}^{\textit{e}}\}

where sαiks^{\alpha}_{i_{k}} is the local state of an ally agent iki_{k} and sejks^{\textit{e}}_{j_{k}} is of an enemy agent jkj_{k} and N(i)N(i) is the neighborhood of ii.

To apply our multi-agent IL to this local observation setting, we build a common policy network Π^eψπ\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}} and Q network Q^eψQ\widehat{Q}^{\textit{e}}_{\psi_{Q}} for all the agents where ψπ\psi_{\pi} and ψQ\psi_{Q} are the network parameters. The objective function for IL can be reformulated according to local observations of the allies as follows:

J(Π^eψπ,Q^eψQ)=i𝒩α𝔼(Se,nexti,aαi,oαi)ρeE[Q^e(Se,nexti|oαi,aαi)\displaystyle J(\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}},\widehat{Q}^{\textit{e}}_{\psi_{Q}})=\!\!\sum\nolimits_{i\in{\mathcal{N}}^{\alpha}}\!\!\mathbb{E}_{(S^{\textit{e},\text{next}}_{i},a^{\alpha}_{i},o^{\alpha}_{i})\sim\rho^{\textit{e}}_{E}}\!\Big{[}\widehat{Q}^{\textit{e}}\!(S^{\textit{e},\text{next}}_{i}|o^{\alpha}_{i},a^{\alpha}_{i})
γVeΠ(Se,nexti)]+(1γ)𝔼Se0iP0VeΠ(Se0i)\displaystyle\qquad\quad-\gamma{V}^{\textit{e}}_{\Pi}(S^{\textit{e},\text{next}}_{i})\Big{]}+(1-\gamma)\mathbb{E}_{S^{\textit{e}0}_{i}\sim P^{0}}{V}^{\textit{e}}_{\Pi}(S^{\textit{e}0}_{i}) (6)

where Se,nexti={sejk:jkN(i)𝒩e}S^{\textit{e},\text{next}}_{i}=\{s^{\prime\textit{e}}_{j_{k}}\!:\!j_{k}\!\in\!N(i)\cap{\mathcal{N}}^{\textit{e}}\} is the next states of enemies in the current neighborhood of ally agent ii. In addition, VeΠ(Se,nexti)=𝔼(aαi,Se,nexti)(Πα,Π^eψπ)[Q^eψQ(Se,nexti|oαi,aαi)ln(Π^eψπ(Se,nexti|oαi,aαi))]{V}^{\textit{e}}_{\Pi}(S^{\textit{e},\text{next}}_{i})=\mathbb{E}_{(a^{\alpha}_{i},S^{\textit{e},\text{next}}_{i})\sim(\Pi^{\alpha},{\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}}})}\!\big{[}\widehat{Q}^{\textit{e}}_{\psi_{Q}}(S^{\textit{e},\text{next}}_{i}|o^{\alpha}_{i},a^{\alpha}_{i})\!-\!\ln(\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}}(S^{\textit{e},\text{next}}_{i}|o^{\alpha}_{i},a^{\alpha}_{i}))\big{]}. We can update Π^eψπ\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}} and Q^eψQ\widehat{Q}^{\textit{e}}_{\psi_{Q}} by the following actor-critic rule: for a fixed Q^eψQ\widehat{Q}^{\textit{e}}_{\psi_{Q}}, update ψQ\psi_{Q} to maximize J(Π^eψπ,Q^eψQ)J(\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}},\widehat{Q}^{\textit{e}}_{\psi_{Q}}), and for a fixed Π^eψπ\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}}, apply soft actor-critic (SAC) to update ψπ\psi_{\pi}.

IMAX-PPO Algorithm

We aim to optimize the policy Παθ(aαi|oαi,Se,nexti),i𝒩α}\Pi^{\alpha}_{\theta}(a^{\alpha}_{i}|o^{\alpha}_{i},S^{\textit{e},\text{next}}_{i}),i\!\in\!{\mathcal{N}}^{\alpha}\} of the allies that optimizes the long-term expected joint reward:

maxΠαθ𝔼(aαi,oαi,Se,nexti)Παθ[i𝒩αRαi(aαi,oαi)]\max\nolimits_{\Pi^{\alpha}_{\theta}}\mathbb{E}_{(a^{\alpha}_{i},o^{\alpha}_{i},S^{\textit{e},\text{next}}_{i})\sim\Pi^{\alpha}_{\theta}}\Big{[}\sum\nolimits_{i\in{\mathcal{N}}^{\alpha}}R^{\alpha}_{i}(a^{\alpha}_{i},o^{\alpha}_{i})\Big{]}

where oαio^{\alpha}_{i} is an observation vector of agent ii, Se,nextiS^{\textit{e},\text{next}}_{i} is the information derived from the imitator for agent ii, aαi𝒜αia^{\alpha}_{i}\in{\mathcal{A}}^{\alpha}_{i} is a local action of agent ii. To facilitate the training and integration of the imitation learning policy into the MARL algorithm, for every ally agent ii, we gather game trajectories following the structure (oi,aαi,Se,nexti)(o_{i},a^{\alpha}_{i},S^{\textit{e},\text{next}}_{i}). These gathered observations are then stored in a replay buffer to train the imitation policy Π^eψπ(Se,nexti|oαi,aαi)\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}}(S^{\textit{e},\text{next}}_{i}|o^{\alpha}_{i},a^{\alpha}_{i}).

To integrate the imitator into the IMAX-PPO framework, we include the predicted next states of the enemy agents as inputs for training the allies’ policy Πα\Pi^{\alpha}. Specifically, at each game state SS, considering the current actor policy Πα\Pi^{\alpha} and the imitating policy Π^e\widehat{\Pi}^{\textit{e}}, for each agent i𝒩αi\in{\mathcal{N}}^{\alpha}, we draw a sample for the allied agents’ joint action A~αΠα\widetilde{A}^{\alpha}\!\sim\!\Pi^{\alpha}. Corresponding local observation oαio^{\alpha}_{i} and action a~αi\widetilde{a}^{\alpha}_{i} of each agent ii are then fed as inputs into the imitation policy to predict the subsequent state Se,nextiΠ^e(|oαi,a~αi)S^{\textit{e},\text{next}}_{i}\!\sim\!\widehat{\Pi}^{\textit{e}}(\cdot|o^{\alpha}_{i},\widetilde{a}^{\alpha}_{i}). Once the predicted local states {Se,nexti,i𝒩α}\{S^{\textit{e},\text{next}}_{i},~{}i\!\in\!{\mathcal{N}}^{\alpha}\} are available, it is used as input to the actor policy Παθ\Pi^{\alpha}_{\theta} in order to generate new actions for the allied agents. In simpler terms, we select a next local action aαiΠαθ(|oαi,Se,nexti)a^{\prime\alpha}_{i}\!\sim\!\Pi^{\alpha}_{\theta}(\cdot|o^{\alpha}_{i},S^{\textit{e},\text{next}}_{i}). Beside the allies’ policy network, we also use a centralized value network Vαθv(S)V^{\alpha}_{\theta_{v}}(S) and update it together with the policy network in an actor-critic manner, similarly to MAPPO. The actor-network is trained by optimizing the following objective:

Lα(θ)\displaystyle L^{\alpha}(\theta) =i𝒩α𝔼oαi,aαi,Se,nexti[min{ri(θ)A^,\displaystyle=\sum\nolimits_{i\in{\mathcal{N}}^{\alpha}}\mathbb{E}_{o^{\alpha}_{i},a^{\alpha}_{i},S^{\textit{e},\text{next}}_{i}}\big{[}\min\{r_{i}(\theta)\widehat{A},
clip(ri(θ),1ϵ,1+ϵ)A^}]\displaystyle\qquad\qquad\qquad\qquad\text{clip}(r_{i}(\theta),1-\epsilon,1+\epsilon)\widehat{A}\}\big{]} (7)

where ri(θ)=Παθ(aαi|oαi,Se,nexti)Παθold(aαi|oαi,Se,nexti)r_{i}(\theta)=\frac{\Pi^{\alpha}_{\theta}(a^{\alpha}_{i}|o^{\alpha}_{i},S^{\textit{e},\text{next}}_{i})}{\Pi^{\alpha}_{\theta_{old}}(a^{\alpha}_{i}|o^{\alpha}_{i},S^{\textit{e},\text{next}}_{i})} and A^\widehat{A} is the advantage function, calculated by Generalized Advantage Estimation (GAE). The Critic network is trained by optimizing

Φα(θv)=𝔼S[max{(Vαθv(S)R^(S))2,(Vclipθv,θv,old(S)R^(S))2}]\displaystyle\Phi^{\alpha}\!(\theta_{v})\!=\!\mathbb{E}_{S}\!\Big{[}\!\max\{(V^{\alpha}_{\theta_{v}}(S)\!-\!\widehat{R}(S))^{2}\!,(V^{\text{clip}}_{\theta_{v},\theta_{v,old}}\!(S)\!-\!\widehat{R}(S))^{2}\}\!\Big{]}

where R^(S)=A^+Vαθv,old(S)\widehat{R}(S)=\widehat{A}+V^{\alpha}_{\theta_{v,old}}(S) and Vclipθv,θv,old(S)=clip(Vαθv(S),Vαθv,old(S)ϵ,Vαθv,old(S)+ϵ)V^{\text{clip}}_{\theta_{v},\theta_{v,old}}(S)=\text{clip}(V^{\alpha}_{\theta_{v}}(S),V^{\alpha}_{\theta_{v,old}}(S)-\epsilon,V^{\alpha}_{\theta_{v,old}}(S)+\epsilon). We provide the key stages in Algorithm 1 below. Additionally, Fig. 1 serves as an illustration of our IMAX-PPO algorithm.

Algorithm 1 IMAX-PPO Algorithm

Input: Initial allies’ policy network Παθ\Pi^{\alpha}_{\theta}, initial allies’ value network VαθvV^{\alpha}_{\theta_{v}}, initial imitator’s policy network Π^eψπ\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}}, initial imitator’s Q network QeψQ{Q}^{\textit{e}}_{\psi_{Q}}, learning rates κeπ,κeQ,καπ,καV\kappa^{\textit{e}}_{\pi},\kappa^{\textit{e}}_{Q},\kappa^{\alpha}_{\pi},\kappa^{\alpha}_{V}.
Output: Trained allies’ policy Παθ\Pi^{\alpha}_{\theta}

1:  for t=0,1,t=0,1,\ldots do
2:     # Updating imitator:
3:     # Train Q function using the objective in (6)
4:     ψQ,t+1=ψQ,t+κeQψQ[J(ψQ)]\psi_{Q,t+1}=\psi_{Q,t}+\kappa^{\textit{e}}_{Q}\nabla_{\psi_{Q}}[J(\psi_{Q})]
5:     Update policy Π^eψπ\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}} with a SAC style actor update (for continuous domains)
6:     ψπ,t+1=ψπ,tκeπψπ𝔼Se,nexti[VeΠ(Se,nexti)]\psi_{\pi,t+1}=\psi_{\pi,t}-\kappa^{\textit{e}}_{\pi}\nabla_{\psi_{\pi}}\mathbb{E}_{S^{\textit{e},\text{next}}_{i}}[{V}^{\textit{e}}_{\Pi}(S^{\textit{e},\text{next}}_{i})]
7:     # Updating allies’ policy:
8:     # Update allies’ actor by maximizing Lα(θ)L^{\alpha}(\theta)
9:     θt+1=θt+καπθLα(θ)\theta_{t+1}=\theta_{t}+\kappa^{\alpha}_{\pi}\nabla_{\theta}L^{\alpha}(\theta)
10:     # Update allies’ critic by minimizing Φα(θv)\Phi^{\alpha}(\theta_{v})
11:     θv,t+1=θv,tκαVθvΦα(θv)\theta_{v,t+1}=\theta_{v,t}-\kappa^{\alpha}_{V}\nabla_{\theta_{v}}\Phi^{\alpha}(\theta_{v})
12:  return solution