AAAI Press Anonymous Submission
Instructions for Authors Using LaTeX

Written by AAAI Press Staff¹
AAAI Style Contributions by Pater Patel Schneider, Sunil Issar,
J. Scott Penberthy, George Ferguson, Hans Guesgen, Francisco Cruz\equalcontrib, Marc Pujol-Gonzalez\equalcontrib With help from the AAAI Publications Committee. Viet The Bui¹, Tien Mai¹, Thanh H. Nguyen²

Mimicking To Dominate: Imitation Learning Strategies for Success in Multiagent Competitive Games

Abstract

Training agents in multi-agent competitive games presents significant challenges due to their intricate nature. These challenges are exacerbated by dynamics influenced not only by the environment but also by opponents’ strategies. Existing methods often struggle with slow convergence and instability. To address this, we harness the potential of imitation learning to comprehend and anticipate opponents’ behavior, aiming to mitigate uncertainties with respect to the game dynamics. Our key contributions include: (i) a new multi-agent imitation learning model for predicting next moves of the opponents — our model works with hidden opponents’ actions and local observations; (ii) a new multi-agent reinforcement learning algorithm that combines our imitation learning model and policy training into one single training process; and (iii) extensive experiments in three challenging game environments, including an advanced version of the Star-Craft multi-agent challenge (i.e., SMACv2). Experimental results show that our approach achieves superior performance compared to existing state-of-the-art multi-agent RL algorithms.

Introduction

Recent works in multi-agent reinforcement learning (MARL) have made a significant progress in developing new effective MARL algorithms that can perform well in complex multi-agent environments including SMAC (StarCraft Multi-Agent Challenge) (yu2022surprising; samvelyan2019starcraft). Among these works, centralized training and decentralized execution (CTDE) (foerster2018counterfactual) has attracted a significant attention from the RL research community due to its advantage of leveraging global information to train a centralized critic (aka. actor-critic methods (lowe2017multi)) or a joint Q-function (aka. value-decomposition methods (rashid2020monotonic; sunehag2017value)). This approach enables a more efficient and stable learning process while still allowing agents to act in a decentralized manner. Under this CTDE framework, off-policy methods such as MADDPG (lowe2017multi) and QMIX (rashid2020monotonic) have become very popular as a result of their data efficiency and state-of-the-art (SOTA) results on a wide range of benchmarks.

On the other hand, on-policy gradient methods have been under-explored so far in MARL due to their data consuming and difficulty in transferring knowledge from single-agent to multi-agent settings. However, a more recent work (yu2022surprising) show that on-policy methods such as MAPPO (a multi-agent version of proximal policy optimization) outperforms all other SOTA methods including MADDPG and QMIX in various multi-agent benchmarks, and especially works extremely well in complex SMAC settings. (yu2022surprising) show that with proper choices of various hyper-parameter factors including value normalization, clipping ratio, and batch size, etc., policy gradient methods can achieve both data efficiency and high return for agents. Motivated by this promising results of MAPPO, our work focuses on improving the performance of policy gradient methods in MARL.

We consider a multi-agent environment in which there are agents who attempt to form allies with each other to play against a team of opponents in a partially observable MDP environment where allied agents have to make decision independently without communicating with other members. Unlike the leading method, MAPPO, which focuses on the investigation of hyper-parameter choices for PPO to perform well in the MARL setting, we attempt to enhance the performance of PPO in MARL with the introduction of a novel opponent-behavior-prediction component. This new component is incorporated into the MAPPO framework to enhance the policy learning of allied agents in the MARL setting. A key challenge in our problem setting is that agents in the alliance are unaware of actions taken by their opponents. Not only that, each allied agent only has local observations of opponents locating in the current neighborhood of the agent — the locations and neighborhoods of all players are not fixed, but instead, are changing over time depending on actions taken by players and the dynamics of the environment. Lastly, learning the behavior of the opponents occurs during the policy learning process of the allied agents. The inter-dependency between these two learning components makes the entire learning process significantly challenging.

We address the aforementioned challenges while providing the following key contributions. First, we convert the problem of opponent behavior learning into predicting next states of the opponents. This way, we don’t have to directly predict the behavior of the opponents (of which actions are not observable), but instead, we use the next state prediction outcome as an indirect implication of their behavior. We then cast the problem of opponent next-state prediction as a new multi-agent imitation learning (IL) problem. We propose a new multi-agent IL algorithm, which is an adaptation of IQ-Learn (garg2021iq) (a SOTA imitation learning algorithm), that only considers local opponent-state-only observations. Especially, instead of imitating opponents’ policy, our IL algorithm targets the prediction of next states of the neighboring opponents. Second, we provide a comprehensive theoretical analysis which provides bounds on the impact of the changing policy of the allied agents (as a result of the policy learning process) on our imitation learning outcomes.

Third, we present a unified MARL algorithmic framework in which we incorporate our behavior learning component into MAPPO. The idea is that we can combine each allied agent’s local observations with the next-state prediction of neighboring opponents of that agent, creating an augmented input based on which to improve the decision making of the allied agent at every state. This novel integration results in a new MARL algorithm, which we name Imitation-enhanced Multi-Agent EXtended PPO (IMAX-PPO).

Finally, we conduct extensive experiments in several benchmarks ranging from complex to simple ones, including: SMACv2 (an advanced version of the Star-Craft multi-agent challenge) (ellis2022smacv2), Google research football (GRF) (kurach2020google), and Gold Miner (Miner). Our empirical results show that our new algorithm consistently outperforms SOTA algorithms (i.e., QMIX and MAPPO) significantly accross all these benchmarks.

Related Work

MARL. The literature on MARL includes centralized and decentralized algorithms. While centralized algorithms (claus1998dynamics) learn a single joint policy to produce joint actions of all the agents, decentralized learning (littman1994markov) optimizes each agent’s local policy independently. There are also algorithms based on centralized training and decentralized execution (CTDE). For example, methods in (lowe2017multi; foerster2018counterfactual) adopt actor-critic structures and learn a centralized critic that takes global information as input. Value-decomposition (VD) is a class of methods that represent the joint Q-function as a function of agents’ local Q-functions (sunehag2017value; rashid2020monotonic). On the other hand, the use of policy-gradient methods, such as PPO (schulman2017proximal), has been investigated in multi-agent RL. For example, (de2020independent) propose independent PPO (IPPO), a decentralized MARL, that can achieve high success rates in several hard SMAC maps. IPPO is, however, overall worse than QMIX (rashid2020monotonic), a VD method. Recently, (yu2022surprising) develop MAPPO, a PPO-based MARL algorithm that outperforms QMIX on some popular multi-agent game environments such as SMAC (samvelyan2019starcraft) and GRF (kurach2020google) environments. To the best of our knowledge, MAPPO is currently a SOTA algorithm for MARL. In this work, we integrate a new IL-based behavior prediction model into MAPPO, resulting in a new MARL algorithm that outperforms SOTA MARL methods on various challenging game tasks.

Imitation Learning (IL). In this study, we employ Imitation Learning to anticipate opponents’ behavior, making it pertinent to delve into the relevant literature. IL has been acknowledged as a compelling approach for sequential decision-making (ng2000algorithms; abbeel2004apprenticeship). In IL, a collection of expert trajectories is provided, with the objective of learning a policy that emulates behavior similar to the expert’s policy. One of the simplest IL methods is Behavioral Cloning (BC), which aims to maximize the likelihood of the expert’s actions under the learned policy. BC disregards environmental dynamics, rendering it suitable only for uncomplicated environments. Several advanced IL techniques, encompassing environmental dynamics, have been proposed (reddy2019sqil; fu2017learning; ho2016generative). While these methods operate in complex and continuous domains, they involve adversarial learning, making them prone to instability and sensitivity to hyperparameters. The IQ-learn (garg2021iq) stands as a cutting-edge IL algorithm with distinct advantages, specifically its incorporation of dynamics awareness and non-adversarial training. It’s important to note that all the aforementioned IL methods were designed for single-agent RL. In contrast, the literature on multi-agent RL is considerably more limited, with only a handful of studies addressing IL in multi-agent RL. For instance, (song2018multi) presents an adversarial training-based algorithm, named Multi-agent Generative Adversarial IL. It’s worth noting that all the IL algorithms mentioned above are established on the premise that actions are observable, which implies that no existing algorithm can be directly applied to our multi-agent game settings with local state-only observations.

Competitive Multi-Agent POMDP Setting

We consider a multi-player Markov game in which there are multiple agents forming an alliance to play against some opponent agents. We present the Markov game as a tuple $\{{\mathcal{S}},{\mathcal{N}}_{\alpha},{\mathcal{N}}_{\textit{e}},{\mathcal{A}}^{\alpha},{\mathcal{A}}^{\textit{e}},P,R\}$ , where ${\mathcal{S}}$ is the set of global states shared by all the agents, ${\mathcal{N}}_{\alpha}$ and ${\mathcal{N}}_{\textit{e}}$ are the set of ally and enemy agents, ${\mathcal{A}}^{\alpha}=\prod_{i\in{\mathcal{N}}_{\mathcal{A}}}{\mathcal{A}}^{\alpha}_{i}$ is the set of joint actions of all the ally agents, ${\mathcal{A}}^{\textit{e}}=\prod_{j\in{\mathcal{N}}_{\textit{e}}}{\mathcal{A}}^{\textit{e}}_{j}$ is the set of joint actions of all the ally agents, $P$ is the transition dynamics of the game environment, and $R$ is a general reward function that can take inputs as states and actions of the ally or enemy agents and return the corresponding rewards. At each time step where the global state is $S$ , each ally agent $i\in{\mathcal{N}}_{\alpha}$ makes an action $a^{\alpha}_{i}$ according to a policy $\pi^{\alpha}_{i}(a^{\alpha}_{i}|o^{\alpha}_{i})$ , where $o^{\alpha}_{i}$ is is the observation of ally agent $i$ given state $S$ . The joint action can be now defined as follows:

A^{\alpha}=\{a^{\alpha}_{i}|~{}i\in{\mathcal{N}}_{\alpha}\}

and the joint policy is defined accordingly:

\Pi^{\alpha}(A^{\alpha}|S)=\prod\nolimits_{i\in{\mathcal{N}}^{\alpha}}\pi^{\alpha}_{i}(a^{\alpha}_{i}|o^{\alpha}_{i}).

The enemy agents, at the same time, make a joint action $A^{\textit{e}}=\{a^{\textit{e}}_{j}|~{}j\in{\mathcal{N}}_{\textit{e}}\}$ with the following probability:

\Pi^{\textit{e}}(A^{\textit{e}}|S)=\prod\nolimits_{j\in{\mathcal{N}}^{\textit{e}}}\pi^{\textit{e}}_{j}(a^{\textit{e}}_{j}|o^{\textit{e}}_{{j}}).

After the allies and enemies make decisions, the global state transits to a new state $S^{\prime}$ with the probability $P(S^{\prime}|A^{\textit{e}},A^{\alpha},S)$ . In our setting, the enemies’ policies $\Pi^{\textit{e}}$ are fixed. Therefore, we can treat the enemies’ policies as a part of the environment dynamics, as follows:

P(S^{\prime}|A^{\alpha},S)=\sum\nolimits_{A^{\textit{e}}}\Pi(A^{\textit{e}}|S)P(S^{\prime}|A^{\textit{e}},A^{\alpha},S)

Our goal is to find a policy that optimizes the allies’ expected joint reward, which can be formulated as follows:¹¹1Environment dynamics are implicitly involved in sampling.

\max\nolimits_{\Pi^{\alpha}}\mathbb{E}_{(A^{\alpha},S)\sim\Pi^{\alpha}}\big{[}R^{\alpha}(S,A^{\alpha})\big{]}

(1)

The game dynamics involve both the environment dynamics and the joint policy of enemies, making the training costly to converge. Our idea is, therefore, to migrate the uncertainties associated with these game dynamics by first predicting the opponent behavior prediction based on past observations of the allies and leveraging this behavior prediction into guiding the policy training for the allies.

Opponent Behavior Prediction

The key challenge in our problem setting is that actions taken by opponents are hidden from allied agents. Moreover, each allied agent has limited observations of the other agents in the environment; they can only obtain information about nearby opponents. For example, in the SMAC environment, for each allied agent, besides information about the agent itself, the allied agent is also aware of the relative position, relative distance, and health point, etc. of the neighboring opponents.²²2The neighborhood of each allied agent changes over time depending on actions of all agents and the dynamic of the environment. Therefore, instead of directly predicting opponents’ next moves, we focus on anticipating next states of opponents — this next-state prediction can be used as an implication of what actions have been taken by the neighboring opponents.

Our key contributions here include: (i) a novel representation of this opponent behavior prediction (or next-state prediction) problem in the form of multi-agent imitation learning; (ii) a new adaptation of IQ-Learn (a SOTA method for solving the standard single-agent imitation learning problem) to solve our new imitation learning problem; (iii) a comprehensive theoretical analysis on the influence of learning policies of the agents in the allies on the next-state prediction outcomes; and (iv) a practical multi-agent imitation learning algorithm which is tailored to local observations of allied agents.

State-Only Multi-Agent Imitation Learning

In this section, we first present our approach of modeling the problem of predicting the opponents’ behavior as a state-only multi-agent IM. We then describe our adaptation of IQ-Learn for solving our new multi-agent IL problem.³³3For the sake of representation, this section focuses on multi-agent IL with global state-only observations in which actions of opponents are hidden from the allies. We then introduce a new practical algorithm in next section which localizes this global state-based IL algorithm to adapt with local observations of allied agents.

Opponent Behavior Prediction as a Multi-Agent IM.

In order to formulate the problem as a multi-agent IM, we introduce a new notion of a joint reward function for the enemies as $R^{\textit{e}}(S^{\prime}|A^{\alpha},S)$ associated with a tuple $(S^{\prime},A^{\alpha},S)$ where $(S,S^{\prime})$ are the current and next global states and $A^{\alpha}$ is the joint action taken by the allies. Intuitively, the next state $S^{\prime}$ is a joint “action” of the enemies in this context. This action ( $S^{\prime}$ ) is basically a result of the joint actions of the allies $A^{\alpha}$ and the hidden joint actions of the enemies $A^{\textit{e}}$ in the original multi-agent POMDP. Altogether with the reward function $R^{\textit{e}}(S^{\prime}|A^{\alpha},S)$ , we introduce a new notion of joint policy for the enemies, $\Pi^{\textit{e}}(S^{\prime}|S,A^{\alpha})$ , which is essentially the probability of ending up at a global state $S^{\prime}$ , from state $S$ when the allies make action $A^{\alpha}$ . Let $\boldsymbol{\Pi}$ be the support set of the imitating policy, i.e., $\boldsymbol{\Pi}=\{\Pi:{\mathcal{S}}\times{\mathcal{S}}\times{\mathcal{A}}^{\alpha}\rightarrow[0,1],~{}\sum_{S^{\prime}\in{\mathcal{S}}}\Pi(S^{\prime}|S,A^{\alpha})=1,~{}\forall S,S^{\prime}\in{\mathcal{S}},A^{\alpha}\in{\mathcal{A}}^{\alpha}\}.$

We now introduce the maximum-entropy inverse RL framework (ho2016generative) w.r.t the new notions of enemies’ reward and policy $(R^{\textit{e}}\!(S^{\prime}|S,A^{\alpha}),\Pi^{\textit{e}}\!(S^{\prime}|S,A^{\alpha}))$ as follows:

	$\displaystyle\max\nolimits_{R^{\textit{e}}}\min\nolimits_{\Pi\in\boldsymbol{\Pi}}\Big{\{}L(R^{\textit{e}},\Pi)=\mathbb{E}_{\rho_{E}^{e}}[R^{e}(S^{\prime}\|S,A^{\alpha})-$
	$\displaystyle\qquad\mathbb{E}_{\rho_{\Pi}^{e}}[R^{e}(S^{\prime}\|S,A^{\alpha})]+\mathbb{E}_{\rho_{\Pi}^{e}}[\ln\Pi(S^{\prime}\|S,A^{\alpha})]\Big{\}}$		(2)

where $\rho_{E}^{e}$ is the occupancy measure of the enemies’ joint policy $\Pi^{\textit{e}}$ and $\rho_{\Pi}^{e}$ the occupancy measure of the imitating policy $\Pi$ , and the last term is the entropy regularizer.

An Adaptation of IQ-Learn.

Drawing inspiration from the state-of-the-art IL algorithm known as IQ-learn, we construct our IL algorithm which is an adaptation of IQ-Learn tailored to our multi-agent competitive game environment.

The main idea of IQ-learn is to convert a reward learning problem into a Q-function learning one. To apply IQ-Learn to our setting, we present the following new soft and inverse soft Bellman operators, which is a modification from the original ones introduced in (garg2021iq):

	$\displaystyle{\mathcal{B}}^{\Pi,R^{\textit{e}}}_{{Q}^{\textit{e}}}(S^{\prime}\|S,A^{\alpha})$	$\displaystyle=R^{\textit{e}}(S^{\prime}\|S,A^{\alpha})+\gamma{V}^{\textit{e}}_{\Pi}(S^{\prime})$		(3)
	$\displaystyle{\mathcal{T}}^{\Pi}_{{Q}^{\textit{e}}}(S^{\prime}\|S,A^{\alpha})$	$\displaystyle={Q}^{\textit{e}}(S^{\prime}\|S,A^{\alpha})-\gamma{V}^{\textit{e}}_{\Pi}(S^{\prime})$		(4)

where

{V}^{\textit{e}}_{\Pi}(S)\!=\!\mathbb{E}_{(A^{\alpha},S^{\prime})\sim(\Pi^{\alpha},{\Pi})}\!\Big{[}{Q}^{\textit{e}}(S^{\prime}|S,A^{\alpha})\!-\!\ln({\Pi}(S^{\prime}|S,A^{\alpha}))\Big{]}

The key difference in Equations 3 and 4 in comparison with the original ones presented in IQ-Learn is that actions defined for IL in our setting coincide with next states, resulting in no computation of expectation for future returns. This difference may raise an important question of whether properties and theoretical results of original IQ-Learn still hold for our new problem setting. Interestingly, we show that key properties of IQ-Learn still carry out to this new setting.

First, it can be seen that the soft Bellman operator ${\mathcal{B}}^{\Pi,R^{\textit{e}}}_{{Q}^{\textit{e}}}$ is contractive, thus defining a unique fixed point solution ${Q}^{*}$ such that ${\mathcal{B}}^{\Pi,R^{e}}_{Q^{*}}=Q^{*}$ . Let us further define the following function of $\Pi$ and $Q^{\textit{e}}$ for the Q-function learning:

	$\displaystyle{J}(\Pi,{Q}^{\textit{e}})=\mathbb{E}_{\rho_{E}^{e}}[{\mathcal{T}}^{\Pi}_{{Q}^{\textit{e}}}(S^{\prime}\|S,A^{\alpha})]$
	$\displaystyle\qquad\qquad-\mathbb{E}_{\rho_{\Pi}^{e}}[{\mathcal{T}}^{\Pi}_{{Q}^{\textit{e}}}(S^{\prime}\|S,A^{\alpha})]+\mathbb{E}_{\rho_{\Pi}^{e}}[\ln\Pi(S^{\prime}\|S,A^{\alpha})]$

We obtain a similar result as in IQ-Learn showing a connection between the learning reward and learning Q-functions:⁴⁴4All proofs are in the appendix.

Proposition 1.

For any reward function $R^{e}$ , let $Q^{*}$ be the unique fixed point solution to the soft Bellman equation ${\mathcal{B}}^{\Pi,R^{\textit{e}}}_{Q^{*}}=Q^{*}$ , then we have $L(\Pi,R^{\textit{e}})={J}(\Pi,Q^{*})$ , and for any $Q^{\textit{e}}$ , ${J}(\Pi,Q^{\textit{e}})=L(\Pi,{\mathcal{T}}^{\Pi}_{Q^{\textit{e}}})$ .

Proposition (1) indicates that the reward learning problem $\max_{R^{\textit{e}}}\min_{\Pi}L(\Pi,R^{\textit{e}})$ can be converted equivalently into the Q-learning one $\max_{Q^{\textit{e}}}\min_{\Pi}J(\Pi,Q^{\textit{e}})$ . Suppose $Q^{*}$ is a solution to the Q-learning problem, then rewards can be recovered by taking $R^{e}(S^{\prime}|S,A^{\alpha})=Q^{*}(S^{\prime}|S,A^{\alpha})-\gamma V^{e}_{\Pi}(S^{\prime})$ . Furthermore, our following Proposition 2 shows that the function ${J}(\Pi,Q^{\textit{e}})$ can be reformulated in a more compact form that is convenient for training.

Proposition 2.

The function $J(\cdot)$ can be written as follows:

	$\displaystyle J(\Pi,Q^{e})$	$\displaystyle=\mathbb{E}_{(S^{\prime},A^{\alpha},S)\sim\rho^{\textit{e}}_{E}}\Big{[}{Q}^{\textit{e}}(S^{\prime}\|S,A^{\alpha})-\gamma{V}^{e}_{\Pi}(S^{\prime})\Big{]}$
		$\displaystyle\qquad\qquad\qquad+(1-\gamma)\mathbb{E}_{S^{0}\sim P^{0}}{V}^{\textit{e}}_{\Pi}(S^{0})$		(5)

where $S_{0}$ is an initial state. Under this viewpoint, we show that key properties of classical IQ-learn still hold in the context of our multi-agent with missing observations (Proposition 3), making our IQ-learn version convenient to use.

Proposition 3.

The problem $\max_{Q^{\textit{e}}}\min_{\Pi^{\textit{e}}}J(\Pi^{\textit{e}},Q^{\textit{e}})$ is equivalent to the maximization ${\max_{Q^{\textit{e}}}}J(\Pi^{Q},Q^{\textit{e}})$ where

\Pi^{Q}(S^{\prime}|S,A^{\alpha})=\frac{\exp(Q^{\textit{e}}(S^{\prime}|S,A^{\alpha}))}{\sum_{S^{{}^{\prime\prime}}}\exp(Q^{\textit{e}}(S^{{}^{\prime\prime}}|S,A^{\alpha}))}

Moreover, the function $J(\Pi^{Q},Q^{\textit{e}})$ is concave in $Q^{\textit{e}}$ .

Corollary 4.

If $\Pi=\Pi^{Q}$ (defined in Theorem 3), then we can write $V^{\textit{e}}(S)$ as follows:

\displaystyle V^{\textit{e}}_{\Pi}(S)

\displaystyle\!=\!V^{\textit{e}}_{Q}(S)\!=\!\sum\nolimits_{A^{\alpha}}\!\!\!\Pi^{\alpha}(A^{\alpha}|S)\ln\left(\!\sum\nolimits_{S^{\prime}}\!\!\exp\!\left(Q^{\textit{e}}(S^{\prime}|S,A^{\alpha})\right)\!\right)

Affect of Allies’ Policies on Imitation Learning

The above IL algorithm is trained during the training of the allies’ policies, thus the game dynamics change during the IL process. We analyze the impact of these changes on the imitation policy. To this end, we consider the loss function of the imitation model as a function of the allies’ joint policy.

	$\displaystyle\Phi(\Pi^{\alpha}\|Q^{\textit{e}})$	$\displaystyle\!=\!\mathbb{E}_{(S^{\prime},A^{\alpha},S)\sim(\Pi^{\textit{e}},\Pi^{\alpha})}\!\!\left[Q^{\textit{e}}(S^{\prime}\|S,A^{\alpha})\!-\!\gamma V^{\textit{e}}_{Q}(S^{\prime})\right]$
		$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-(1-\gamma)\mathbb{E}_{S^{0}\sim P^{0}}V^{\textit{e}}_{Q}(S^{0})$

In the following, we first present our theoretical results about bounds on the impact of the allies’ changing policies on the imitation learning’s loss function (Proposition 5 for a discrete state space and Proposition 6 for a continuous space). Based on these results, we then provide a theoretical bound on the IL learning outcome accordingly (Corollary 7).

Proposition 5 (Finite state space ${\mathcal{S}}$ ).

Given two allies’ joint policies $\Pi^{\alpha}$ and $\widetilde{\Pi}^{\alpha}$ such that $\textsc{KL}(\Pi^{\alpha}(\cdot|S)||\widetilde{\Pi}^{\alpha}(\cdot|S))\leq\epsilon$ for any state $S\in{\mathcal{S}}$ , the following inequality holds:

	$\displaystyle\left\|\Phi(\Pi^{\alpha}\|Q^{\textit{e}})-\Phi(\widetilde{\Pi}^{\alpha}\|Q^{\textit{e}})\right\|$
	$\displaystyle\qquad\leq\left(\frac{\gamma+1+(1-\gamma)^{3}}{(1-\gamma)^{2}}\overline{Q}+\frac{1+(1-\gamma)^{3}}{(1-\gamma)^{2}}\ln\|{\mathcal{S}}\|\right)\sqrt{2\ln 2\epsilon}$

where $\overline{Q}\!=\!\!\max\limits_{(S^{\prime}_{t},S_{t},A^{\alpha}_{t})}\!\!Q^{\textit{e}}(S^{\prime}_{t}|S_{t},A^{\alpha}_{t})$ is an upper bound of $Q^{\textit{e}}$ .

Under a continuous state space, an actor-critic method is used as the policy $\Pi^{Q}$ cannot be computed exactly. In this case, one can use an explicit policy $\Pi$ to approximate $\Pi^{Q}$ . We then iteratively update $Q^{\textit{e}}$ and $\Pi$ alternatively using the loss function in Proposition 2. In particular, for a fixed $Q^{\textit{e}}$ , a soft actor update $\max_{\Pi}\mathbb{E}_{S^{\prime}\sim\Pi(S|A^{\alpha},S)}[Q^{\textit{e}}(S^{\prime}|S,A^{\alpha})-\ln\Pi(S^{\prime}|S,A^{\alpha})]$ will bring $\Pi$ closer to $\Pi^{Q}$ .

Let $\Phi(\Pi^{\alpha}|\Pi,Q^{\textit{e}})$ be the objective of IL in (5), written as a function of the allies’ joint policy $\widetilde{\Pi}^{\alpha}$ . The following proposition establishes a bound for the variation of $|\Phi(\Pi^{\alpha}|\Pi,Q^{\textit{e}})-\Phi(\widetilde{\Pi}^{\alpha}|\Pi,Q^{\textit{e}})|$ as funntion of $\textsc{KL}(\Pi^{\alpha}||\widetilde{\Pi}^{\alpha})$ .

Proposition 6 (Continuous state space ${\mathcal{S}}$ ).

Given two allies’ joint policies $\Pi^{\alpha}$ and $\widetilde{\Pi}^{\alpha}$ such that for every state $S\in{\mathcal{S}}$ , $\textsc{KL}(\Pi^{\alpha}(\cdot|S)||\widetilde{\Pi}^{\alpha}(\cdot|S))\leq\epsilon$ , the following inequality holds:

	$\displaystyle\left\|\Phi(\Pi^{\alpha}\|Q^{\textit{e}},\Pi)-\Phi(\widetilde{\Pi}^{\alpha}\|Q^{\textit{e}},\Pi)\right\|$
	$\displaystyle\qquad\leq\left(\frac{\gamma+1+(1-\gamma)^{3}}{(1-\gamma)^{2}}\overline{Q}-\frac{1+(1-\gamma)^{3}}{(1-\gamma)^{2}}H\right)\sqrt{2\ln 2\epsilon}$

where $\overline{Q}=\sup\limits_{(S^{\prime}_{t},S_{t},A^{\alpha}_{t})}\!\!Q^{\textit{e}}(S^{\prime}_{t}|S_{t},A^{\alpha}_{t})$ is an upper bound of $Q^{\textit{e}}(\cdot|\cdot,\cdot)$ and $H=\inf\limits_{(S,A^{\alpha})}\sum_{S}\Pi(S^{\prime}|S,A^{\alpha})\ln\Pi(S^{\prime}|S,A^{\alpha})$ , is a lower bound of the entropy of the actor policy $\Pi(\cdot|\cdot,\cdot)$ .

Propositions 5 & 6 allow us to establish an upper bound for the imitation learning when the allies’ joint policy changes.

Corollary 7.

Given two allies’ joint policies $\Pi^{\alpha}$ and $\widetilde{\Pi}^{\alpha}$ with $\forall S\!\in\!{\mathcal{S}}$ , $\textsc{KL}(\Pi^{\alpha}(\cdot|S)||\widetilde{\Pi}^{\alpha}(\cdot|S)\!\leq\!\epsilon$ , the following holds:

	$\displaystyle\Big{\|}\max\nolimits_{Q^{\textit{e}}}\!\!\big{\{}\Phi(\Pi^{\alpha}\|Q^{\textit{e}})\big{\}}-\max\nolimits_{Q^{\textit{e}}}\!\!\big{\{}\Phi(\widetilde{\Pi}^{\alpha}\|Q^{\textit{e}})\big{\}}\Big{\|}\leq{\mathcal{O}}(\sqrt{\epsilon})\;\text{ (discrete)}$
	$\displaystyle\Big{\|}\max\nolimits_{Q^{\textit{e}}}\min\nolimits_{\Pi}\big{\{}\Phi(\Pi^{\alpha}\|Q^{\textit{e}},\Pi)\big{\}}\qquad\qquad\qquad\qquad\quad\text{ (continuous)}$
	$\displaystyle\qquad\qquad\quad-\max\nolimits_{Q^{\textit{e}}}\min\nolimits_{\Pi}\big{\{}\Phi(\widetilde{\Pi}^{\alpha}\|Q^{\textit{e}},\Pi)\big{\}}\Big{\|}\leq{\mathcal{O}}(\sqrt{\epsilon}).$

Since the allies’ joint policy $\Pi^{\alpha}$ will be changing during our policy learning process, the above result implies that the imitating policy will be stable if $\Pi^{\alpha}$ becomes stable, and if $\Pi^{\alpha}$ is converging to a target policy $\Pi^{\alpha*}$ , then the imitator’s policy also converges to the one that is trained with the target ally policy with a rate of $\sqrt{\textsc{KL}(\Pi^{\alpha}||\widetilde{\Pi}^{\alpha})}$ . That is, if the actual policy is within a ${\mathcal{O}}(\epsilon)$ neighborhood of the target policy (i.e., $\textsc{KL}(\Pi^{\alpha}||\widetilde{\Pi}^{\alpha})\leq\epsilon$ ) then the expected return of the imitating policy is within a ${\mathcal{O}}(\sqrt{\epsilon})$ neighborhood of the desired “expected return” given by the target policy.

Refer to caption — Figure 1: An overview of our IMAX-PPO algorithm. Each local observation $o^{\alpha}_{i}$ of an ally agent $i$ includes information about itself, as well as enemy and ally agents in its neighborhood (which changes over time). The output of the IL component is the predicted next states of neighboring enemy agents (predictions for the non-neighbor enemies will be masked out).

IMAX-PPO: Imitation-enhanced Multi-Agent EXtended PPO Algorithm

We present our MARL algorithm for the competive game setting. We first focus on a practical implementation of an imitation learning algorithm taking into account local observations. We then show how to integrate this into our MARL algorithm. We call our algorithm as IMAX-PPO, standing for Imitation-enhanced Multi-Agent EXtended PPO algorithm.

Imitation Learning with Local Observations

In previous section, we present our new imitation learning algorithm (which is an adaptation of IQ-Learn) to learn an enemy policy $\widehat{\Pi}^{\textit{e}}(S^{\prime}|S,A^{\alpha})$ that behaves similarly to the probabilities of ending up at state $S^{\prime}$ when the current global state is $S$ and the allies’ joint action is $A^{\alpha}$ . Here, $S$ represents all the information that the enemy agents can access. In our multi-agent game setting, however, ally agents do not have access to the observations of the enemies. Instead, each ally agent $i$ only has information about its neighboring enemies observable by agent $i$ (such as their locations, speeds, etc.).

Therefore, we adapt our IL algorithm in accordance with such available local information. That is, the aim now is to optimize policies $\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}}(S^{\textit{e},\text{next}}_{i}|o^{\alpha}_{i},a^{\alpha}_{i})$ , $i\in{\mathcal{N}}^{\alpha}$ , where $o^{\alpha}_{i}$ is an observation vector of agent $i$ , containing the local states of the agent $i$ itself as well as of all the agents in the neighborhood that are observable by agent $i$ . In particular,

o^{\alpha}_{i}=\{s^{\alpha}_{i}\}\cup\{s^{\alpha}_{i_{k}}\!:i_{k}\in N(i)\cap{\mathcal{N}}^{\alpha}\}\cup\{s^{\textit{e}}_{j_{k}}\!:j_{k}\in N(i)\cap{\mathcal{N}}^{\textit{e}}\}

where $s^{\alpha}_{i_{k}}$ is the local state of an ally agent $i_{k}$ and $s^{\textit{e}}_{j_{k}}$ is of an enemy agent $j_{k}$ and $N(i)$ is the neighborhood of $i$ .

To apply our multi-agent IL to this local observation setting, we build a common policy network $\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}}$ and Q network $\widehat{Q}^{\textit{e}}_{\psi_{Q}}$ for all the agents where $\psi_{\pi}$ and $\psi_{Q}$ are the network parameters. The objective function for IL can be reformulated according to local observations of the allies as follows:

	$\displaystyle J(\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}},\widehat{Q}^{\textit{e}}_{\psi_{Q}})=\!\!\sum\nolimits_{i\in{\mathcal{N}}^{\alpha}}\!\!\mathbb{E}_{(S^{\textit{e},\text{next}}_{i},a^{\alpha}_{i},o^{\alpha}_{i})\sim\rho^{\textit{e}}_{E}}\!\Big{[}\widehat{Q}^{\textit{e}}\!(S^{\textit{e},\text{next}}_{i}\|o^{\alpha}_{i},a^{\alpha}_{i})$
	$\displaystyle\qquad\quad-\gamma{V}^{\textit{e}}_{\Pi}(S^{\textit{e},\text{next}}_{i})\Big{]}+(1-\gamma)\mathbb{E}_{S^{\textit{e}0}_{i}\sim P^{0}}{V}^{\textit{e}}_{\Pi}(S^{\textit{e}0}_{i})$		(6)

where $S^{\textit{e},\text{next}}_{i}=\{s^{\prime\textit{e}}_{j_{k}}\!:\!j_{k}\!\in\!N(i)\cap{\mathcal{N}}^{\textit{e}}\}$ is the next states of enemies in the current neighborhood of ally agent $i$ . In addition, ${V}^{\textit{e}}_{\Pi}(S^{\textit{e},\text{next}}_{i})=\mathbb{E}_{(a^{\alpha}_{i},S^{\textit{e},\text{next}}_{i})\sim(\Pi^{\alpha},{\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}}})}\!\big{[}\widehat{Q}^{\textit{e}}_{\psi_{Q}}(S^{\textit{e},\text{next}}_{i}|o^{\alpha}_{i},a^{\alpha}_{i})\!-\!\ln(\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}}(S^{\textit{e},\text{next}}_{i}|o^{\alpha}_{i},a^{\alpha}_{i}))\big{]}$ . We can update $\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}}$ and $\widehat{Q}^{\textit{e}}_{\psi_{Q}}$ by the following actor-critic rule: for a fixed $\widehat{Q}^{\textit{e}}_{\psi_{Q}}$ , update $\psi_{Q}$ to maximize $J(\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}},\widehat{Q}^{\textit{e}}_{\psi_{Q}})$ , and for a fixed $\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}}$ , apply soft actor-critic (SAC) to update $\psi_{\pi}$ .

IMAX-PPO Algorithm

We aim to optimize the policy $\Pi^{\alpha}_{\theta}(a^{\alpha}_{i}|o^{\alpha}_{i},S^{\textit{e},\text{next}}_{i}),i\!\in\!{\mathcal{N}}^{\alpha}\}$ of the allies that optimizes the long-term expected joint reward:

\max\nolimits_{\Pi^{\alpha}_{\theta}}\mathbb{E}_{(a^{\alpha}_{i},o^{\alpha}_{i},S^{\textit{e},\text{next}}_{i})\sim\Pi^{\alpha}_{\theta}}\Big{[}\sum\nolimits_{i\in{\mathcal{N}}^{\alpha}}R^{\alpha}_{i}(a^{\alpha}_{i},o^{\alpha}_{i})\Big{]}

where $o^{\alpha}_{i}$ is an observation vector of agent $i$ , $S^{\textit{e},\text{next}}_{i}$ is the information derived from the imitator for agent $i$ , $a^{\alpha}_{i}\in{\mathcal{A}}^{\alpha}_{i}$ is a local action of agent $i$ . To facilitate the training and integration of the imitation learning policy into the MARL algorithm, for every ally agent $i$ , we gather game trajectories following the structure $(o_{i},a^{\alpha}_{i},S^{\textit{e},\text{next}}_{i})$ . These gathered observations are then stored in a replay buffer to train the imitation policy $\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}}(S^{\textit{e},\text{next}}_{i}|o^{\alpha}_{i},a^{\alpha}_{i})$ .

To integrate the imitator into the IMAX-PPO framework, we include the predicted next states of the enemy agents as inputs for training the allies’ policy $\Pi^{\alpha}$ . Specifically, at each game state $S$ , considering the current actor policy $\Pi^{\alpha}$ and the imitating policy $\widehat{\Pi}^{\textit{e}}$ , for each agent $i\in{\mathcal{N}}^{\alpha}$ , we draw a sample for the allied agents’ joint action $\widetilde{A}^{\alpha}\!\sim\!\Pi^{\alpha}$ . Corresponding local observation $o^{\alpha}_{i}$ and action $\widetilde{a}^{\alpha}_{i}$ of each agent $i$ are then fed as inputs into the imitation policy to predict the subsequent state $S^{\textit{e},\text{next}}_{i}\!\sim\!\widehat{\Pi}^{\textit{e}}(\cdot|o^{\alpha}_{i},\widetilde{a}^{\alpha}_{i})$ . Once the predicted local states $\{S^{\textit{e},\text{next}}_{i},~{}i\!\in\!{\mathcal{N}}^{\alpha}\}$ are available, it is used as input to the actor policy $\Pi^{\alpha}_{\theta}$ in order to generate new actions for the allied agents. In simpler terms, we select a next local action $a^{\prime\alpha}_{i}\!\sim\!\Pi^{\alpha}_{\theta}(\cdot|o^{\alpha}_{i},S^{\textit{e},\text{next}}_{i})$ . Beside the allies’ policy network, we also use a centralized value network $V^{\alpha}_{\theta_{v}}(S)$ and update it together with the policy network in an actor-critic manner, similarly to MAPPO. The actor-network is trained by optimizing the following objective:

	$\displaystyle L^{\alpha}(\theta)$	$\displaystyle=\sum\nolimits_{i\in{\mathcal{N}}^{\alpha}}\mathbb{E}_{o^{\alpha}_{i},a^{\alpha}_{i},S^{\textit{e},\text{next}}_{i}}\big{[}\min\{r_{i}(\theta)\widehat{A},$
		$\displaystyle\qquad\qquad\qquad\qquad\text{clip}(r_{i}(\theta),1-\epsilon,1+\epsilon)\widehat{A}\}\big{]}$		(7)

where $r_{i}(\theta)=\frac{\Pi^{\alpha}_{\theta}(a^{\alpha}_{i}|o^{\alpha}_{i},S^{\textit{e},\text{next}}_{i})}{\Pi^{\alpha}_{\theta_{old}}(a^{\alpha}_{i}|o^{\alpha}_{i},S^{\textit{e},\text{next}}_{i})}$ and $\widehat{A}$ is the advantage function, calculated by Generalized Advantage Estimation (GAE). The Critic network is trained by optimizing

\displaystyle\Phi^{\alpha}\!(\theta_{v})\!=\!\mathbb{E}_{S}\!\Big{[}\!\max\{(V^{\alpha}_{\theta_{v}}(S)\!-\!\widehat{R}(S))^{2}\!,(V^{\text{clip}}_{\theta_{v},\theta_{v,old}}\!(S)\!-\!\widehat{R}(S))^{2}\}\!\Big{]}

where $\widehat{R}(S)=\widehat{A}+V^{\alpha}_{\theta_{v,old}}(S)$ and $V^{\text{clip}}_{\theta_{v},\theta_{v,old}}(S)=\text{clip}(V^{\alpha}_{\theta_{v}}(S),V^{\alpha}_{\theta_{v,old}}(S)-\epsilon,V^{\alpha}_{\theta_{v,old}}(S)+\epsilon)$ . We provide the key stages in Algorithm 1 below. Additionally, Fig. 1 serves as an illustration of our IMAX-PPO algorithm.

Algorithm 1 IMAX-PPO Algorithm

Input: Initial allies’ policy network $\Pi^{\alpha}_{\theta}$ , initial allies’ value network $V^{\alpha}_{\theta_{v}}$ , initial imitator’s policy network $\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}}$ , initial imitator’s Q network ${Q}^{\textit{e}}_{\psi_{Q}}$ , learning rates $\kappa^{\textit{e}}_{\pi},\kappa^{\textit{e}}_{Q},\kappa^{\alpha}_{\pi},\kappa^{\alpha}_{V}$ .
Output: Trained allies’ policy $\Pi^{\alpha}_{\theta}$

1: for

t=0,1,\ldots

2: # Updating imitator:

3: # Train Q function using the objective in (6)

\psi_{Q,t+1}=\psi_{Q,t}+\kappa^{\textit{e}}_{Q}\nabla_{\psi_{Q}}[J(\psi_{Q})]

5: Update policy

\widehat{\Pi}^{\textit{e}}_{\psi_{\pi}}

with a SAC style actor update (for continuous domains)

\psi_{\pi,t+1}=\psi_{\pi,t}-\kappa^{\textit{e}}_{\pi}\nabla_{\psi_{\pi}}\mathbb{E}_{S^{\textit{e},\text{next}}_{i}}[{V}^{\textit{e}}_{\Pi}(S^{\textit{e},\text{next}}_{i})]

7: # Updating allies’ policy:

8: # Update allies’ actor by maximizing

L^{\alpha}(\theta)

\theta_{t+1}=\theta_{t}+\kappa^{\alpha}_{\pi}\nabla_{\theta}L^{\alpha}(\theta)

10: # Update allies’ critic by minimizing

\Phi^{\alpha}(\theta_{v})

11:

\theta_{v,t+1}=\theta_{v,t}-\kappa^{\alpha}_{V}\nabla_{\theta_{v}}\Phi^{\alpha}(\theta_{v})

12: return solution

	$\displaystyle{\mathcal{B}}^{\Pi,R^{\textit{e}}}_{{Q}^{\textit{e}}}(S^{\prime}\|S,A^{\alpha})$	$\displaystyle=R^{\textit{e}}(S^{\prime}\|S,A^{\alpha})+\gamma{V}^{\textit{e}}_{\Pi}(S^{\prime})$		(3)
	$\displaystyle{\mathcal{T}}^{\Pi}_{{Q}^{\textit{e}}}(S^{\prime}\|S,A^{\alpha})$	$\displaystyle={Q}^{\textit{e}}(S^{\prime}\|S,A^{\alpha})-\gamma{V}^{\textit{e}}_{\Pi}(S^{\prime})$		(4)

AAAI Press Anonymous Submission Instructions for Authors Using LaTeX