AAAI Press Anonymous Submission
Instructions for Authors Using LaTeX
Mimicking To Dominate: Imitation Learning Strategies for Success in Multiagent Competitive Games
Abstract
Training agents in multi-agent competitive games presents significant challenges due to their intricate nature. These challenges are exacerbated by dynamics influenced not only by the environment but also by opponents’ strategies. Existing methods often struggle with slow convergence and instability. To address this, we harness the potential of imitation learning to comprehend and anticipate opponents’ behavior, aiming to mitigate uncertainties with respect to the game dynamics. Our key contributions include: (i) a new multi-agent imitation learning model for predicting next moves of the opponents — our model works with hidden opponents’ actions and local observations; (ii) a new multi-agent reinforcement learning algorithm that combines our imitation learning model and policy training into one single training process; and (iii) extensive experiments in three challenging game environments, including an advanced version of the Star-Craft multi-agent challenge (i.e., SMACv2). Experimental results show that our approach achieves superior performance compared to existing state-of-the-art multi-agent RL algorithms.
Introduction
Recent works in multi-agent reinforcement learning (MARL) have made a significant progress in developing new effective MARL algorithms that can perform well in complex multi-agent environments including SMAC (StarCraft Multi-Agent Challenge) (yu2022surprising; samvelyan2019starcraft). Among these works, centralized training and decentralized execution (CTDE) (foerster2018counterfactual) has attracted a significant attention from the RL research community due to its advantage of leveraging global information to train a centralized critic (aka. actor-critic methods (lowe2017multi)) or a joint Q-function (aka. value-decomposition methods (rashid2020monotonic; sunehag2017value)). This approach enables a more efficient and stable learning process while still allowing agents to act in a decentralized manner. Under this CTDE framework, off-policy methods such as MADDPG (lowe2017multi) and QMIX (rashid2020monotonic) have become very popular as a result of their data efficiency and state-of-the-art (SOTA) results on a wide range of benchmarks.
On the other hand, on-policy gradient methods have been under-explored so far in MARL due to their data consuming and difficulty in transferring knowledge from single-agent to multi-agent settings. However, a more recent work (yu2022surprising) show that on-policy methods such as MAPPO (a multi-agent version of proximal policy optimization) outperforms all other SOTA methods including MADDPG and QMIX in various multi-agent benchmarks, and especially works extremely well in complex SMAC settings. (yu2022surprising) show that with proper choices of various hyper-parameter factors including value normalization, clipping ratio, and batch size, etc., policy gradient methods can achieve both data efficiency and high return for agents. Motivated by this promising results of MAPPO, our work focuses on improving the performance of policy gradient methods in MARL.
We consider a multi-agent environment in which there are agents who attempt to form allies with each other to play against a team of opponents in a partially observable MDP environment where allied agents have to make decision independently without communicating with other members. Unlike the leading method, MAPPO, which focuses on the investigation of hyper-parameter choices for PPO to perform well in the MARL setting, we attempt to enhance the performance of PPO in MARL with the introduction of a novel opponent-behavior-prediction component. This new component is incorporated into the MAPPO framework to enhance the policy learning of allied agents in the MARL setting. A key challenge in our problem setting is that agents in the alliance are unaware of actions taken by their opponents. Not only that, each allied agent only has local observations of opponents locating in the current neighborhood of the agent — the locations and neighborhoods of all players are not fixed, but instead, are changing over time depending on actions taken by players and the dynamics of the environment. Lastly, learning the behavior of the opponents occurs during the policy learning process of the allied agents. The inter-dependency between these two learning components makes the entire learning process significantly challenging.
We address the aforementioned challenges while providing the following key contributions. First, we convert the problem of opponent behavior learning into predicting next states of the opponents. This way, we don’t have to directly predict the behavior of the opponents (of which actions are not observable), but instead, we use the next state prediction outcome as an indirect implication of their behavior. We then cast the problem of opponent next-state prediction as a new multi-agent imitation learning (IL) problem. We propose a new multi-agent IL algorithm, which is an adaptation of IQ-Learn (garg2021iq) (a SOTA imitation learning algorithm), that only considers local opponent-state-only observations. Especially, instead of imitating opponents’ policy, our IL algorithm targets the prediction of next states of the neighboring opponents. Second, we provide a comprehensive theoretical analysis which provides bounds on the impact of the changing policy of the allied agents (as a result of the policy learning process) on our imitation learning outcomes.
Third, we present a unified MARL algorithmic framework in which we incorporate our behavior learning component into MAPPO. The idea is that we can combine each allied agent’s local observations with the next-state prediction of neighboring opponents of that agent, creating an augmented input based on which to improve the decision making of the allied agent at every state. This novel integration results in a new MARL algorithm, which we name Imitation-enhanced Multi-Agent EXtended PPO (IMAX-PPO).
Finally, we conduct extensive experiments in several benchmarks ranging from complex to simple ones, including: SMACv2 (an advanced version of the Star-Craft multi-agent challenge) (ellis2022smacv2), Google research football (GRF) (kurach2020google), and Gold Miner (Miner). Our empirical results show that our new algorithm consistently outperforms SOTA algorithms (i.e., QMIX and MAPPO) significantly accross all these benchmarks.
Related Work
MARL. The literature on MARL includes centralized and decentralized algorithms. While centralized algorithms (claus1998dynamics) learn a single joint policy to produce joint actions of all the agents, decentralized learning (littman1994markov) optimizes each agent’s local policy independently. There are also algorithms based on centralized training and decentralized execution (CTDE). For example, methods in (lowe2017multi; foerster2018counterfactual) adopt actor-critic structures and learn a centralized critic that takes global information as input. Value-decomposition (VD) is a class of methods that represent the joint Q-function as a function of agents’ local Q-functions (sunehag2017value; rashid2020monotonic). On the other hand, the use of policy-gradient methods, such as PPO (schulman2017proximal), has been investigated in multi-agent RL. For example, (de2020independent) propose independent PPO (IPPO), a decentralized MARL, that can achieve high success rates in several hard SMAC maps. IPPO is, however, overall worse than QMIX (rashid2020monotonic), a VD method. Recently, (yu2022surprising) develop MAPPO, a PPO-based MARL algorithm that outperforms QMIX on some popular multi-agent game environments such as SMAC (samvelyan2019starcraft) and GRF (kurach2020google) environments. To the best of our knowledge, MAPPO is currently a SOTA algorithm for MARL. In this work, we integrate a new IL-based behavior prediction model into MAPPO, resulting in a new MARL algorithm that outperforms SOTA MARL methods on various challenging game tasks.
Imitation Learning (IL). In this study, we employ Imitation Learning to anticipate opponents’ behavior, making it pertinent to delve into the relevant literature. IL has been acknowledged as a compelling approach for sequential decision-making (ng2000algorithms; abbeel2004apprenticeship). In IL, a collection of expert trajectories is provided, with the objective of learning a policy that emulates behavior similar to the expert’s policy. One of the simplest IL methods is Behavioral Cloning (BC), which aims to maximize the likelihood of the expert’s actions under the learned policy. BC disregards environmental dynamics, rendering it suitable only for uncomplicated environments. Several advanced IL techniques, encompassing environmental dynamics, have been proposed (reddy2019sqil; fu2017learning; ho2016generative). While these methods operate in complex and continuous domains, they involve adversarial learning, making them prone to instability and sensitivity to hyperparameters. The IQ-learn (garg2021iq) stands as a cutting-edge IL algorithm with distinct advantages, specifically its incorporation of dynamics awareness and non-adversarial training. It’s important to note that all the aforementioned IL methods were designed for single-agent RL. In contrast, the literature on multi-agent RL is considerably more limited, with only a handful of studies addressing IL in multi-agent RL. For instance, (song2018multi) presents an adversarial training-based algorithm, named Multi-agent Generative Adversarial IL. It’s worth noting that all the IL algorithms mentioned above are established on the premise that actions are observable, which implies that no existing algorithm can be directly applied to our multi-agent game settings with local state-only observations.
Competitive Multi-Agent POMDP Setting
We consider a multi-player Markov game in which there are multiple agents forming an alliance to play against some opponent agents. We present the Markov game as a tuple , where is the set of global states shared by all the agents, and are the set of ally and enemy agents, is the set of joint actions of all the ally agents, is the set of joint actions of all the ally agents, is the transition dynamics of the game environment, and is a general reward function that can take inputs as states and actions of the ally or enemy agents and return the corresponding rewards. At each time step where the global state is , each ally agent makes an action according to a policy , where is is the observation of ally agent given state . The joint action can be now defined as follows:
and the joint policy is defined accordingly:
The enemy agents, at the same time, make a joint action with the following probability:
After the allies and enemies make decisions, the global state transits to a new state with the probability . In our setting, the enemies’ policies are fixed. Therefore, we can treat the enemies’ policies as a part of the environment dynamics, as follows:
Our goal is to find a policy that optimizes the allies’ expected joint reward, which can be formulated as follows:111Environment dynamics are implicitly involved in sampling.
(1) |
The game dynamics involve both the environment dynamics and the joint policy of enemies, making the training costly to converge. Our idea is, therefore, to migrate the uncertainties associated with these game dynamics by first predicting the opponent behavior prediction based on past observations of the allies and leveraging this behavior prediction into guiding the policy training for the allies.
Opponent Behavior Prediction
The key challenge in our problem setting is that actions taken by opponents are hidden from allied agents. Moreover, each allied agent has limited observations of the other agents in the environment; they can only obtain information about nearby opponents. For example, in the SMAC environment, for each allied agent, besides information about the agent itself, the allied agent is also aware of the relative position, relative distance, and health point, etc. of the neighboring opponents.222The neighborhood of each allied agent changes over time depending on actions of all agents and the dynamic of the environment. Therefore, instead of directly predicting opponents’ next moves, we focus on anticipating next states of opponents — this next-state prediction can be used as an implication of what actions have been taken by the neighboring opponents.
Our key contributions here include: (i) a novel representation of this opponent behavior prediction (or next-state prediction) problem in the form of multi-agent imitation learning; (ii) a new adaptation of IQ-Learn (a SOTA method for solving the standard single-agent imitation learning problem) to solve our new imitation learning problem; (iii) a comprehensive theoretical analysis on the influence of learning policies of the agents in the allies on the next-state prediction outcomes; and (iv) a practical multi-agent imitation learning algorithm which is tailored to local observations of allied agents.
State-Only Multi-Agent Imitation Learning
In this section, we first present our approach of modeling the problem of predicting the opponents’ behavior as a state-only multi-agent IM. We then describe our adaptation of IQ-Learn for solving our new multi-agent IL problem.333For the sake of representation, this section focuses on multi-agent IL with global state-only observations in which actions of opponents are hidden from the allies. We then introduce a new practical algorithm in next section which localizes this global state-based IL algorithm to adapt with local observations of allied agents.
Opponent Behavior Prediction as a Multi-Agent IM.
In order to formulate the problem as a multi-agent IM, we introduce a new notion of a joint reward function for the enemies as associated with a tuple where are the current and next global states and is the joint action taken by the allies. Intuitively, the next state is a joint “action” of the enemies in this context. This action () is basically a result of the joint actions of the allies and the hidden joint actions of the enemies in the original multi-agent POMDP. Altogether with the reward function , we introduce a new notion of joint policy for the enemies, , which is essentially the probability of ending up at a global state , from state when the allies make action . Let be the support set of the imitating policy, i.e.,
We now introduce the maximum-entropy inverse RL framework (ho2016generative) w.r.t the new notions of enemies’ reward and policy as follows:
(2) |
where is the occupancy measure of the enemies’ joint policy and the occupancy measure of the imitating policy , and the last term is the entropy regularizer.
An Adaptation of IQ-Learn.
Drawing inspiration from the state-of-the-art IL algorithm known as IQ-learn, we construct our IL algorithm which is an adaptation of IQ-Learn tailored to our multi-agent competitive game environment.
The main idea of IQ-learn is to convert a reward learning problem into a Q-function learning one. To apply IQ-Learn to our setting, we present the following new soft and inverse soft Bellman operators, which is a modification from the original ones introduced in (garg2021iq):
(3) | ||||
(4) |
where
The key difference in Equations 3 and 4 in comparison with the original ones presented in IQ-Learn is that actions defined for IL in our setting coincide with next states, resulting in no computation of expectation for future returns. This difference may raise an important question of whether properties and theoretical results of original IQ-Learn still hold for our new problem setting. Interestingly, we show that key properties of IQ-Learn still carry out to this new setting.
First, it can be seen that the soft Bellman operator is contractive, thus defining a unique fixed point solution such that . Let us further define the following function of and for the Q-function learning:
We obtain a similar result as in IQ-Learn showing a connection between the learning reward and learning Q-functions:444All proofs are in the appendix.
Proposition 1.
For any reward function , let be the unique fixed point solution to the soft Bellman equation , then we have , and for any , .
Proposition (1) indicates that the reward learning problem can be converted equivalently into the Q-learning one . Suppose is a solution to the Q-learning problem, then rewards can be recovered by taking . Furthermore, our following Proposition 2 shows that the function can be reformulated in a more compact form that is convenient for training.
Proposition 2.
The function can be written as follows:
(5) |
where is an initial state. Under this viewpoint, we show that key properties of classical IQ-learn still hold in the context of our multi-agent with missing observations (Proposition 3), making our IQ-learn version convenient to use.
Proposition 3.
The problem is equivalent to the maximization where
Moreover, the function is concave in .
Corollary 4.
If (defined in Theorem 3), then we can write as follows:
Affect of Allies’ Policies on Imitation Learning
The above IL algorithm is trained during the training of the allies’ policies, thus the game dynamics change during the IL process. We analyze the impact of these changes on the imitation policy. To this end, we consider the loss function of the imitation model as a function of the allies’ joint policy.
In the following, we first present our theoretical results about bounds on the impact of the allies’ changing policies on the imitation learning’s loss function (Proposition 5 for a discrete state space and Proposition 6 for a continuous space). Based on these results, we then provide a theoretical bound on the IL learning outcome accordingly (Corollary 7).
Proposition 5 (Finite state space ).
Given two allies’ joint policies and such that for any state , the following inequality holds:
where is an upper bound of .
Under a continuous state space, an actor-critic method is used as the policy cannot be computed exactly. In this case, one can use an explicit policy to approximate . We then iteratively update and alternatively using the loss function in Proposition 2. In particular, for a fixed , a soft actor update will bring closer to .
Let be the objective of IL in (5), written as a function of the allies’ joint policy . The following proposition establishes a bound for the variation of as funntion of .
Proposition 6 (Continuous state space ).
Given two allies’ joint policies and such that for every state , , the following inequality holds:
where is an upper bound of and , is a lower bound of the entropy of the actor policy .
Propositions 5 & 6 allow us to establish an upper bound for the imitation learning when the allies’ joint policy changes.
Corollary 7.
Given two allies’ joint policies and with , , the following holds:
Since the allies’ joint policy will be changing during our policy learning process, the above result implies that the imitating policy will be stable if becomes stable, and if is converging to a target policy , then the imitator’s policy also converges to the one that is trained with the target ally policy with a rate of . That is, if the actual policy is within a neighborhood of the target policy (i.e., ) then the expected return of the imitating policy is within a neighborhood of the desired “expected return” given by the target policy.

IMAX-PPO: Imitation-enhanced Multi-Agent EXtended PPO Algorithm
We present our MARL algorithm for the competive game setting. We first focus on a practical implementation of an imitation learning algorithm taking into account local observations. We then show how to integrate this into our MARL algorithm. We call our algorithm as IMAX-PPO, standing for Imitation-enhanced Multi-Agent EXtended PPO algorithm.
Imitation Learning with Local Observations
In previous section, we present our new imitation learning algorithm (which is an adaptation of IQ-Learn) to learn an enemy policy that behaves similarly to the probabilities of ending up at state when the current global state is and the allies’ joint action is . Here, represents all the information that the enemy agents can access. In our multi-agent game setting, however, ally agents do not have access to the observations of the enemies. Instead, each ally agent only has information about its neighboring enemies observable by agent (such as their locations, speeds, etc.).
Therefore, we adapt our IL algorithm in accordance with such available local information. That is, the aim now is to optimize policies , , where is an observation vector of agent , containing the local states of the agent itself as well as of all the agents in the neighborhood that are observable by agent . In particular,
where is the local state of an ally agent and is of an enemy agent and is the neighborhood of .
To apply our multi-agent IL to this local observation setting, we build a common policy network and Q network for all the agents where and are the network parameters. The objective function for IL can be reformulated according to local observations of the allies as follows:
(6) |
where is the next states of enemies in the current neighborhood of ally agent . In addition, . We can update and by the following actor-critic rule: for a fixed , update to maximize , and for a fixed , apply soft actor-critic (SAC) to update .
IMAX-PPO Algorithm
We aim to optimize the policy of the allies that optimizes the long-term expected joint reward:
where is an observation vector of agent , is the information derived from the imitator for agent , is a local action of agent . To facilitate the training and integration of the imitation learning policy into the MARL algorithm, for every ally agent , we gather game trajectories following the structure . These gathered observations are then stored in a replay buffer to train the imitation policy .
To integrate the imitator into the IMAX-PPO framework, we include the predicted next states of the enemy agents as inputs for training the allies’ policy . Specifically, at each game state , considering the current actor policy and the imitating policy , for each agent , we draw a sample for the allied agents’ joint action . Corresponding local observation and action of each agent are then fed as inputs into the imitation policy to predict the subsequent state . Once the predicted local states are available, it is used as input to the actor policy in order to generate new actions for the allied agents. In simpler terms, we select a next local action . Beside the allies’ policy network, we also use a centralized value network and update it together with the policy network in an actor-critic manner, similarly to MAPPO. The actor-network is trained by optimizing the following objective:
(7) |
where and is the advantage function, calculated by Generalized Advantage Estimation (GAE). The Critic network is trained by optimizing
where and . We provide the key stages in Algorithm 1 below. Additionally, Fig. 1 serves as an illustration of our IMAX-PPO algorithm.
Input: Initial allies’ policy network , initial allies’ value network , initial imitator’s policy network , initial imitator’s Q network , learning rates .
Output: Trained allies’ policy