This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Model-Based Opponent Modeling

Xiaopeng Yu  Jiechuan Jiang  Wanpeng Zhang  Haobin Jiang  Zongqing Lu

Peking University
Abstract

When one agent interacts with a multi-agent environment, it is challenging to deal with various opponents unseen before. Modeling the behaviors, goals, or beliefs of opponents could help the agent adjust its policy to adapt to different opponents. In addition, it is also important to consider opponents who are learning simultaneously or capable of reasoning. However, existing work usually tackles only one of the aforementioned types of opponents. In this paper, we propose model-based opponent modeling (MBOM), which employs the environment model to adapt to all kinds of opponents. MBOM simulates the recursive reasoning process in the environment model and imagines a set of improving opponent policies. To effectively and accurately represent the opponent policy, MBOM further mixes the imagined opponent policies according to the similarity with the real behaviors of opponents. Empirically, we show that MBOM achieves more effective adaptation than existing methods in a variety of tasks, respectively with different types of opponents, i.e., fixed policy, naïve learner, and reasoning learner.

1 Introduction

Reinforcement learning (RL) has made great progress in multi-agent competitive games, e.g., AlphaGo [33], OpenAI Five [26], and AlphaStar [39]. In multi-agent environments, an agent usually has to compete against or cooperate with diverse other agents (collectively termed as opponents whether collaborators or competitors) unseen before. Since the opponent policy influences the transition dynamics experienced by the agent, interacting with diverse opponents makes the environment nonstationary from the agent’s perspective. Due to the complexity and diversity in opponent policies, it is very challenging for the agent to retain overall supremacy.

Explicitly modeling the behaviors, goals, or beliefs of opponents [2], rather than treating them as a part of the environment, could help the agent adjust its policy to adapt to different opponents. Many studies rely on predicting the actions [11, 13, 10, 27] and goals [29, 30] of opponents during training. When facing diverse or unseen opponents, the agent policy conditions on such predictions or representations generated by corresponding modules. However, opponents may also have the same reasoning ability, e.g., an opponent who makes predictions about the agent’s goal. In this scenario, higher-level reasoning and some other modeling techniques are required to handle such sophisticated opponents [40, 43, 45]. In addition, the opponents may learn simultaneously, the modeling becomes unstable, and the fitted models with historical experiences lag behind. To enable the agent to continuously adapt to learning opponents, LOLA [8] takes into account the gradients of the opponent’s learning for policy updates, Meta-PG [1] formulates continuous adaptation as a meta-learning problem, and Meta-MAPG [16] combines meta-learning with LOLA. However, LOLA requires knowing the learning algorithm of opponents, while Meta-PG and Meta-MAPG require all opponents use the same learning algorithm.

Unlike existing work, we do not make such assumptions and focus on enabling the agent to learn effectively by directly representing the policy improvement of opponents when interacting with them, even if they may be also capable of reasoning. Inspired by the intuition that humans could anticipate the future behaviors of opponents by simulating the interactions in the brain after knowing the rules and mechanics of the environment, in this paper, we propose model-based opponent modeling (MBOM), which employs the environment model to predict and capture the policy improvement of opponents. By simulating the interactions in the environment model, we could obtain the best responses of opponents to the agent policy that is conditioned on the opponent model. Then, the opponent model can be finetuned using the simulated best responses to get a higher-level opponent model. By recursively repeating the simulation and finetuning, MBOM imagines the learning and reasoning of opponents and generates a set of opponent models with different levels, which could also be seen as recursive reasoning. However, since the learning and reasoning of opponents are unknown, a certain-level opponent model might erroneously estimate the opponent. To effectively and accurately represent the opponent policy, we further propose to mix the imagined opponent policies according to the similarity with the real behaviors of opponents updated by the Bayesian rule.

We empirically evaluate MBOM in a variety of competitive tasks, against three types of opponents, i.e., fixed policy, naïve learner, and reasoning learner. MBOM outperforms strong baselines, especially when against naïve learner and reasoning learner. Ablation studies verify the effectiveness of recursive imagination and Bayesian mixing. We also show that MBOM can be applied in cooperative tasks.

2 Related Work

Opponent Modeling. In multi-agent reinforcement learning (MARL), it is a big challenge to form robust policies due to the unknown opponent policy. From the perspective of an agent, if opponents are considered as a part of the environment, the environment is unstable and complex for policy learning when the policies of opponents are also changing. If the information about opponents is included, e.g., behaviors, goals, and beliefs, the environment may become stable, and the agent could learn using single-agent RL methods. This line of research is opponent modeling.

One simple idea of opponent modeling is to build a model each time a new opponent or group of opponents is encountered [48]. However, learning a model every time is not efficient. A more computationally tractable approach is to represent an opponent’s policy with an embedding vector. Grover et al. [10] uses a neural network as an encoder, taking as input the trajectory of one agent. Imitation learning and contrastive learning are used to train the encoder. Then, the learned encoder can be combined with RL by feeding the generated representation into policy or/and value networks. Learning of the model can also be performed simultaneously with RL, as an auxiliary task [14]. Based on DQN [24], DRON [11] and DPIQN [13] use a secondary network that takes observations as input and predicts opponents’ actions. The hidden layer of this network is used by the DQN module to condition on for a better policy. It is also feasible to model opponents using variational auto-encoders [27], which means the generated representations are no longer deterministic vectors, but high-dimensional distributions. ToMnet [29] tries to make agents have the same Theory of Mind [28] as humans. ToMnet consists of three networks, reasoning about the agent’s action and goal based on past and current information. SOM [30] implements Theory of Mind from a different perspective. SOM uses its own policy, opponent’s observation, and opponent’s action to work backward to learn opponent’s goal by gradient ascent.

The methods aforementioned only consider opponent policies that are independent of the agent. If opponents hold a belief about the agent, the agent can form a higher-level belief about opponents’ beliefs. This process can perform recursively, termed recursive reasoning. In repeated games, R2-B2 [5] performs recursive reasoning by assuming that all agents select actions according to GP-UCB acquisition function [34], and shows that recursive reasoning on one level higher than the other agents leads to faster regret convergence. When it comes to more complicated stochastic games, recursive reasoning should be compatible with reinforcement learning algorithms. PR2 [40] and GR2 [41] use the agent’s joint Q-function to obtain recursive reasoning. Yuan et al. [45] takes both level-0 and level-1 beliefs as input to the value function, where level-0 belief is updated according to the Bayesian rule and level-1 belief is updated using a learnable neural network. However, these methods [40, 41, 45] use centralized training with decentralized execution algorithms to train a set of fixed agents that cannot handle diverse opponents in execution.

If the opponents are also learning, the modeling mentioned above becomes unstable and the fitted models with historical experiences lag behind. So it is beneficial to take into consideration the learning process of opponents. LOLA [8] introduces the impact of the agent’s policy on the anticipated parameter update of the opponent. A neural network is used to model the opponent’s policy and estimate the learning gradient of the opponent’s policy, implying that the learning algorithm used by the opponent should be known, otherwise the estimated gradient will be inaccurate. Further, the opponents may still be learning continuously during execution. Meta-PG [1] is a method based on meta policy gradient, using trajectories from the current opponents to do multiple meta-gradient steps and construct a policy that is good for the updated opponents. Meta-MAPG [16] extends this method by including an additional term that accounts for the impact of the agent’s current policy on the future policies of opponents, similar to LOLA. These meta-learning based methods require that the distribution of trajectories matches across training and test, which implicitly means all opponents use the same learning algorithm.

Unlike existing work, we go one step further and consider a more general setting, where the opponents could be fixed, randomly sampled from an unknowable policy set, or continuously learning using an unknowable and changeable RL algorithm, both in training and execution. We learn an environment model that allows the agent to perform recursive reasoning against opponents who may also have the same reasoning ability.

Model-based RL and MARL. Model-based RL allows the agent to have access to the transition function. There are two typical branches of model-based RL approaches: background planning and decision-time planning. In background planning, the agent could use the learned model to generate additional experiences for assisting learning. For example, Dyna-style algorithms [35, 19, 23] perform policy optimization on simulated experiences, and model-augmented value expansion algorithms [7, 25, 3] use model-based rollouts to improve the update targets. In decision-time planning [4], the agent could use the model to rollout the optimal action at a given state by looking forward during execution, e.g., model predictive control. Recent studies have extended model-based methods to multi-agent settings for sample efficiency [46, 47], centralized training [42], and communication [17]. Unlike these studies, we exploit the environment model for opponent modeling.

3 Preliminaries

In general, we consider an nn-agent stochastic game (𝒮,𝒜1,,𝒜n,𝒫,1,,n,γ)(\mathcal{S},\mathcal{A}^{1},\dots,\mathcal{A}^{n},\mathcal{P},\mathcal{R}^{1},\dots,\mathcal{R}^{n},\gamma), where 𝒮\mathcal{S} is the state space, 𝒜i\mathcal{A}^{i} is the action space of agent i[1,,n]i\in[1,\dots,n], 𝒜=i=1n𝒜i\mathcal{A}=\prod_{i=1}^{n}\mathcal{A}^{i} is the joint action space of agents, 𝒫:𝒮×𝒜×𝒮[0,1]\mathcal{P}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\to[0,1] is a transition function, i:𝒮×𝒜\mathcal{R}^{i}:\mathcal{S}\times\mathcal{A}\to\mathbb{R} is the reward function of agent ii , and γ\gamma is the discount factor. The policy of agent ii is πi\pi^{i}, and the joint policy of other agents is πi(ai|s)=jiπj(aj|s)\pi^{-i}(a^{-i}|s)=\prod_{j\neq i}\pi^{j}(a^{j}|s), where aia^{-i} is the joint action except agent ii. All agents interact with the environment simultaneously without communication. The historical trajectory is available, i.e., for agent ii at timestep tt, {s0,a0i,a0i,,st1,at1i,at1i}\{s_{0},a_{0}^{i},a_{0}^{-i},\dots,s_{t-1},a_{t-1}^{i},a_{t-1}^{-i}\} is observable. The goal of the agent ii is to maximize its expected cumulative discount rewards

𝔼st+1𝒫(|st,ati,ati),aiπi(|st),atiπi(|st)[t=0γti(st,ati,ati)].\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}s_{t+1}\sim\mathcal{P}(\cdot|s_{t},a_{t}^{i},a_{t}^{-i}),\\ a^{i}\sim\pi^{i}(\cdot|s_{t}),a_{t}^{-i}\sim\pi^{-i}(\cdot|s_{t})\end{subarray}}\left[\sum_{t=0}^{\infty}\gamma^{t}\mathcal{R}^{i}(s_{t},a_{t}^{i},a_{t}^{-i})\right]. (1)

For convenience, the learning agent treats all other agents as a joint opponent with the joint action aoπo(|s)a^{o}\sim\pi^{o}(\cdot|s) and reward ror^{o}. The action and reward of the learning agent are denoted as aπ(|s)a\sim\pi(\cdot|s) and rr, respectively.

4 Model-Based Opponent Modeling

MBOM employs the environment model to predict and capture the learning of opponent policy. By simulating recursive reasoning via the environment model, MBOM imagines the learning and reasoning of the opponent and generates a set of opponent models. To obtain a stronger representation ability and accurately capture the adaptation of the opponent, MBOM mixes the imagined opponent policies according to the similarity with the real behaviors of the opponent.

4.1 Recursive Imagination

If the opponent is also learning during the interaction, the opponent model fitted with historical experiences always lags behind, making the agent hard to adapt to the opponent. Moreover, if the opponent could adjust its policy according to the actions, intentions, or goals of the agent, then recursive reasoning may occur between agent and opponent. However, based on the lagged opponent model, the agent would struggle to keep up with the learning of opponent. To adapt to the learning and reasoning opponent, the agent should predict the current opponent policy and reason more deeply than the opponent.

Refer to caption
Figure 1: Architectures of MBOM

MBOM explicitly simulates the recursive reasoning process utilizing the environment model, called recursive imagination, to generate a series of opponent models, called Imagined Opponent Policies (IOPs). First, we pretrain the agent’s policy π\pi using PPO [32] while interacting with ν\nu different opponents with diverse policies that can be learned for example by [37]. and collects a buffer 𝒟\mathcal{D} which contains the experience s,a,ao,s,𝒓\left<s,a,a^{o},s^{\prime},\bm{r}\right>, where 𝒓=r,ro\bm{r}=\left<r,r^{o}\right>. For zero-sum game (r+ro=0r+r^{o}=0) and fully cooperative game (r=ror=r^{o}), ror^{o} can be easily obtained, while for general-sum game we make an assumption that ror^{o} can be accessed during experience collection. Then, using the experience buffer 𝒟\mathcal{D}, we can train the environment model Γ(s,𝒓|s,a,ao;ζ)\Gamma(s^{\prime},\bm{r}|s,a,a^{o};\zeta) by minimizing the mean square error

𝔼s,a,ao,s,𝒓𝒟12ss^2+12𝒓𝒓^2,\mathop{\mathbb{E}}\limits_{\begin{subarray}{c}s,a,a^{o},s^{\prime},\bm{r}\sim\mathcal{D}\end{subarray}}\frac{1}{2}\|s^{\prime}-\hat{s}^{\prime}\|^{2}+\frac{1}{2}\|\bm{r}-\hat{\bm{r}}\|^{2}, (2)

where s^,𝒓^=Γ(s,a,ao;ζ)\hat{s}^{\prime},\hat{\bm{r}}=\Gamma(s,a,a^{o};\zeta), and obtain the level-0 IOP π~0o(|s;ϕ0)\tilde{\pi}^{o}_{0}(\cdot|s;\phi_{0}) by maximum-likelihood estimation,

𝔼s,ao𝒟logπ~0o(ao|s;ϕ0).\mathop{\mathbb{E}}\limits_{s,a^{o}\sim\mathcal{D}}\log\tilde{\pi}^{o}_{0}(a^{o}|s;\phi_{0}). (3)

To imagine the learning of the opponent, as illustrated in Figure 1, we use the rollout algorithm [38] to get the best response of the opponent to the agent policy π\pi. For each opponent action atoa^{o}_{t} at timestep tt, we uniformly sample the opponent action sequences in the next kk timesteps, simulate the trajectories using the learned environment model Γζ\Gamma_{\zeta}, and select the best response with the highest rollout value

ato=argmaxatomaxat+1o,,at+koUnifj=0kγjrt+jo.a^{o*}_{t}=\mathop{\operatorname{argmax}}\limits_{a_{t}^{o}}\max\limits_{\begin{subarray}{c}a_{t+1}^{o},\cdots,a_{t+k}^{o}\sim\text{Unif}\end{subarray}}\sum_{j=0}^{k}\gamma^{j}{r}_{t+j}^{o}. (4)

During the simulation, the agent acts according to the policy conditioned on the modeled opponent policy, atπ(|st,a~toπ~0o(|st;ϕ0);θ)a_{t}\sim\pi(\cdot|s_{t},\tilde{a}_{t}^{o}\sim\tilde{\pi}^{o}_{0}(\cdot|s_{t};\phi_{0});\theta), and the learned environment model provides the transition st+1,𝒓𝒕=Γ(st,at,ato;ζ)s_{t+1},\bm{r_{t}}=\Gamma(s_{t},a_{t},a_{t}^{o};\zeta). With larger kk, the rollout has a longer planning horizon, and thus could evaluate the action aoa^{o*} more accurately, assuming a perfect environmental model. However, the computation cost of rollout increases exponentially with the planning horizon to get an accurate estimate of aoa^{o*}, while in practice the compounding error of the environmental model also increases with the planning horizon [15]. Therefore, the choice of kk is a tradeoff between accuracy and cost. Specifically, for zero-sum game and fully cooperative game, we can approximately estimate the opponent state value Vo(s)V^{o}(s) as V(s)-V(s) and V(s)V(s), respectively, and modify the rollout value like nn-step return [36] to obtain a longer horizon (see Appendix C for the empirical effect of kk)

ato=argmaxatomaxat+1o,,at+koUnif[j=0kγjrt+jo+γk+1Vo(st+k+1)].a^{o*}_{t}=\mathop{\operatorname{argmax}}\limits_{a_{t}^{o}}\max\limits_{\begin{subarray}{c}a_{t+1}^{o},\cdots,a_{t+k}^{o}\sim\text{Unif}\end{subarray}}\left[\sum_{j=0}^{k}\gamma^{j}{r}_{t+j}^{o}+\gamma^{k+1}V^{o}(s_{t+k+1})\right]. (5)

By imagination, we can obtain the best response of the opponent to the agent policy π\pi and construct the simulated data {s,ao}\{\left<s,a^{o*}\right>\}. Then, we use the data to finetune the level-0 IOP π~0o(|s;ϕ0)\tilde{\pi}^{o}_{0}(\cdot|s;\phi_{0}) by maximum-likelihood estimation, and obtain the level-1 IOP π~1o(|s;ϕ1)\tilde{\pi}^{o}_{1}(\cdot|s;\phi_{1}). The level-1 IOP can be seen as the best response of the opponent to the response of the agent conditioned on level-0 IOP. In the imagination, it is the nested form as “the opponent believes [that the agent believes (that the opponent believes …)].” The existing imagined opponent policy is the innermost “(that the opponent believes),” while the outermost “the opponent believes” is aoa^{o*} which is obtained by the rollout process. Recursively repeating the rollout and finetuning, where the agent policy is conditioned on the IOP π~m1o(|s;ϕm1)\tilde{\pi}^{o}_{m-1}(\cdot|s;\phi_{m-1}), we could derive the level-mm IOP π~mo(|s;ϕm)\tilde{\pi}^{o}_{m}(\cdot|s;\phi_{m}).

4.2 Bayesian Mixing

By recursive imagination, we get MM IOPs with different reasoning levels. However, since the learning and reasoning of the opponent are unknown, a single IOP might erroneously estimate the opponent. To obtain stronger representation capability and accurately capture the learning of the opponent, we linearly combine the IOPs to get a mixed IOP,

π~mixo(|s)=m=0M1αmπ~mo(|s;ϕm),\tilde{\pi}^{o}_{\operatorname{mix}}(\cdot|s)=\sum_{m=0}^{M-1}\alpha_{m}\tilde{\pi}^{o}_{m}(\cdot|s;\phi_{m}), (6)

where αm\alpha_{m} is the weight of level-mm IOP, which is taken as p(π~mo|ao)p(\tilde{\pi}^{o}_{m}|a^{o}), also denoted as p(m|ao)p(m|a^{o}), the probability that the opponent action aoa^{o} is generated from the level-mm IOP. Thus, p(m|ao)p(m|a^{o}) indicates the similarity between the level-mm IOP π~mo\tilde{\pi}^{o}_{m} and the opponent policy πo\pi^{o} in the most recent stage. By Bayesian rule, we have

p(m|ao)=p(ao|m)p(m)i=0M1p(ao|i)p(i)=π~mo(ao|s;ϕm)p(m)i=0M1[π~io(ao|s;ϕi)p(i)].p(m|a^{o})=\frac{p(a^{o}|m)p(m)}{\sum_{i=0}^{M-1}p(a^{o}|i)p(i)}=\frac{\tilde{\pi}^{o}_{m}(a^{o}|s;\phi_{m})p(m)}{\sum_{i=0}^{M-1}\left[\tilde{\pi}^{o}_{i}(a^{o}|s;\phi_{i})p(i)\right]}. (7)

Updating p(m|ao)p(m|a^{o}) during interaction can obtain a more accurate estimate of the improving opponent policy.

In practice, we use the moving average of p(m|ao)p(m|a^{o}) as the prior p(m)p(m), take the decayed moving average of p(m|ao)p(m|a^{o}) as Ψm\Psi_{m}, and obtain 𝜶\bm{\alpha} by the IOPs mixer,

𝜶=(α0,,αM1)=softer-softmax(Ψ0,,ΨM1).\bm{\alpha}=(\alpha_{0},\dots,\alpha_{M-1})=\text{softer-softmax}(\Psi_{0},\dots,\Psi_{M-1}). (8)

Softer-softmax [12] is a variant of softmax, which uses higher temperature to control a softy of the probability distribution. The IOPs mixer is non-parametric, which could be updated quickly and efficiently without parameter training and too many interactions (see Appendix B for empirical analysis on 𝜶\bm{\alpha}). Therefore, the IOPs mixer could adapt to the fast-improving opponent.

Algorithm 1 MBOM
1:  Pretraining:
2:  Initialize the recursive imagination layer MM.
3:  Initialize the weights 𝜶\bm{\alpha} with (1/M,1/M,,1/M)(1/M,1/M,\dots,1/M).
4:  Interact with ν\nu learning opponents to train the agent policy θ\theta and collect the experience buffer 𝒟\mathcal{D}.
5:  Train the level-0 IOP ϕ0\phi_{0}, the environment model Γζ\Gamma_{\zeta} using the experience buffer 𝒟\mathcal{D}.
6:  Interaction:
7:  for t=1,, max_epocht=1,\ldots,\text{ max\_epoch} do
8:     Interact with the opponent to observe the real opponent actions.{//Recursive Imagination}
9:     Finetune ϕ0\phi_{0} with the real opponent actions
10:     for m=1,,M1m=1,\ldots,M-1 do
11:        Rollout in Γζ\Gamma_{\zeta} with π(|s,a~oπ~m1o(|s;ϕm1);θ)\pi(\cdot|s,\tilde{a}^{o}\mathord{\sim}\tilde{\pi}^{o}_{m-1}(\cdot|s;\phi_{m-1});\theta) to get aoa^{o*} by (4) or (5).
12:        Finetune ϕm1\phi_{m-1} with best responses to get the ϕm\phi_{m}.
13:     end for{//Bayesian Mixing}
14:     Update 𝜶\bm{\alpha} by (8).
15:     Mix the IOPs to get π~mixo\tilde{\pi}_{\operatorname{mix}}^{o} by (6).
16:  end for

For completeness, the full procedure of MBOM in given in Algorithm 1.

4.3 Theoretical Analysis

MBOM consists of several components, including a conditional policy, an environment model, and MM IOPs. They all interact with each other through model rollouts, which however makes it extremely hard to give an error analysis based on each component of MBOM. Therefore, we instead focus mainly on recursive imagination and Bayesian mixing, and give a profound understanding of MBOM. Proofs are available in Appendix A.

Without loss of generality, we define δm\delta_{m} as the discrepancy between level-mm and mm-11 IOPs in terms of estimated value function given the IOP, and εm\varepsilon_{m} as the error of level-mm IOP compared to the true value function given the opponent policy, i.e., δm|V^mV^m1|\delta_{m}\doteq|\widehat{V}_{m}-\widehat{V}_{m-1}|, εm|V^mV|\varepsilon_{m}\doteq|\widehat{V}_{m}-V|, and δ0ε0\delta_{0}\doteq\varepsilon_{0}. Then, we have the following two lemmas.

Lemma 1.

For the mixed imagined opponent policy (IOP) π~mixo(|s)=m=0M1αmπ~mo(|s;ϕm)\tilde{\pi}_{\operatorname{mix}}^{o}(\cdot|s)=\sum_{m=0}^{M-1}\alpha_{m}\tilde{\pi}^{o}_{m}(\cdot|s;\phi_{m}), the total error εtotal=m=0M1αmεm\varepsilon_{total}=\sum_{m=0}^{M-1}\alpha_{m}\varepsilon_{m} satisfies the following inequality:

εtotalm=0M1δmi=mM1αi.\varepsilon_{total}\leq\sum_{m=0}^{M-1}\delta_{m}\sum_{i=m}^{M-1}\alpha_{i}.
Lemma 2.

Suppose the value function is Lipschitz continuous on the state space 𝒮\mathcal{S}, KK is the Lipschitz constant, ^i(s,a)Γ(s,a,ao;ζ)π~io(ao|s;ϕi)\widehat{\mathcal{M}}_{i}(s,a)\doteq\Gamma(s,a,a^{o};\zeta)\cdot\tilde{\pi}^{o}_{i}(a^{o}|s;\phi_{i}) is the transition distribution given ss and aa, then

|V^iV^j|γK1γ𝔼s,aπ,^j^i(s,a)^j(s,a).\left|\widehat{V}_{i}-\widehat{V}_{j}\right|\leq\frac{\gamma K}{1-\gamma}\cdot\underset{s,a\sim\pi,\widehat{\mathcal{M}}_{j}}{\mathbb{E}}\left\|\widehat{\mathcal{M}}_{i}(s,a)-\widehat{\mathcal{M}}_{j}(s,a)\right\|.

According to (6) and (7), we can express αm\alpha_{m} as the following

αm\displaystyle\alpha_{m} 𝔼[p(m|ao)]=s,aod(s,ao)π~mo(ao|s;ϕm)p(m)i=0M1π~io(ao|s;ϕi)p(i),\displaystyle\doteq\mathbb{E}\left[p(m|a^{o})\right]=\sum\limits_{s,a^{o}}d(s,a^{o})\frac{\tilde{\pi}^{o}_{m}\left(a^{o}|s;\phi_{m}\right)p(m)}{\sum_{i=0}^{M-1}\tilde{\pi}^{o}_{i}\left(a^{o}|s;\phi_{i}\right)p(i)},

where d(s,ao)d(s,a^{o}) is the stationary distribution of pairs of state and action of opponent. Then, based on Lemma 1 and 2, we can obtain the following theorem, where ^1\widehat{\mathcal{M}}_{-1} denotes 𝒫(|s,a,ao)πo(ao|s)\mathcal{P}(\cdot|s,a,a^{o})\cdot\pi^{o}(a^{o}|s), the true transition dynamics of agent.

Theorem 1.

Define the error between the approximated value function of the Bayesian mixing IOP and the true value function as εtrue|V^V|\varepsilon_{true}\doteq\left|\widehat{V}-V\right|. For the Bayesian mixing IOP, the true error satisfies the following inequality:

εtrueεtotals,aoγKd(s,ao)1γm=0M1𝔼s,aπ,^m1^m^m1i=mM1π~io(ao|s;ϕi)p(i)j=0M1π~jo(ao|s;ϕj)p(j).\small\varepsilon_{true}\leq\varepsilon_{total}\leq\sum\limits_{s,a^{o}}\frac{\gamma Kd(s,a^{o})}{1-\gamma}\cdot\frac{\sum\limits_{m=0}^{M-1}\underset{s,a\sim\pi,\widehat{\mathcal{M}}_{m-1}}{\mathbb{E}}\left\|\widehat{\mathcal{M}}_{m}-\widehat{\mathcal{M}}_{m-1}\right\|\sum\limits_{i=m}^{M-1}\tilde{\pi}^{o}_{i}\left(a^{o}|s;\phi_{i}\right)p(i)}{\sum\limits_{j=0}^{M-1}\tilde{\pi}^{o}_{j}\left(a^{o}|s;\phi_{j}\right)p(j)}.

Theorem 1 tells us the error accumulates as level MM increases. However, larger MM also has advantages. To analyze, we first define the benefit using the mixed IOP as In the error bound of Theorem 1, it is easy to see that the error accumulates as level MM increases, but at the same time, higher level also has advantages. To analyze, we need to define the benefit of using the mixed IOP as the optimization. We use the negative distribution distance ηπ^π\eta\doteq-\|\hat{\pi}-\pi\| to denote the benefit of using policy in different levels, where π^\hat{\pi} is the estimated opponent policy, π\pi is the true opponent policy and \|\cdot\| is a general form of distribution distance function. For a mixing policy, we can define the benefit as:

ηM=(α0π~0o++αM1π~M1o)πo.\displaystyle\eta_{M}=-\|(\alpha_{0}\tilde{\pi}^{o}_{0}+\cdots+\alpha_{M-1}\tilde{\pi}^{o}_{M-1})-\pi^{o}\|.

Then, we have the following lemma and theorem.

Lemma 3.

Assume that the true opponent policy πo\pi^{o} has a probability distribution over the given set of IOPs {π~0o,,π~M1o}\{\tilde{\pi}^{o}_{0},\ldots,\tilde{\pi}^{o}_{M-1}\}, and a probability Pr{πo=π~mo}=pm\mathrm{Pr}\{\pi^{o}=\tilde{\pi}_{m}^{o}\}=p_{m} for each π~mo\tilde{\pi}_{m}^{o}, then the maximum expectation of ηM\eta_{M} can be achieved if and only if αm=pm,m[0,M1].\alpha_{m}=p_{m},\forall m\in[0,M-1].

Theorem 2.

Given the action trajectory of opponent {a0o,a1o,,ato}\{a^{o}_{0},a^{o}_{1},\ldots,a^{o}_{t}\}, the posterior probability updated by Bayesian mixing approximates the true probability of opponent as the length of the trajectory grows, i.e.,

P(m|ato,,a1o,a0o)Ptrue(m)pm,t.P(m|a^{o}_{t},\cdots,a^{o}_{1},a^{o}_{0})\to P_{\operatorname{true}}(m)\doteq p_{m},\quad t\to\infty.

Then the maximum expectation of ηM\eta_{M} can be achieved with αm=𝔼[P(m|ato,,a1o,a0o)]\alpha_{m}=\mathbb{E}[P(m|a^{o}_{t},\cdots,a^{o}_{1},a^{o}_{0})].

According to Theorem 2, we know that the estimated probability distribution of the opponent converges to the true distribution of the opponent. It is obvious that larger MM improves the representation capability of IOPs and thus better satisfies the assumption in Lemma 3, but also increases the error bound in Theorem 1. Therefore, the selection of MM is a tradeoff between them (see Appendix C for empirically study on MM).

5 Experiments

We first evaluate MBOM thoroughly in two-player zero-sum tasks. Then, we investigate MBOM when against multiple opponents and in a cooperative task. In all the experiments, the baselines have the same neural network architectures as MBOM. All the methods are trained for five runs with different random seeds, and results are presented using mean and 95% confidence intervals. More details about experimental settings and hyperparameters are available in Appendix E.

5.1 Two-Player Zero-Sum Tasks

We evaluate MBOM in two competitive tasks: (1) Triangle Game is an asymmetric zero-sum game implemented on Multi-Agent Particle Environments (MPE) [21]. When facing different policies of the opponent, the agent has to adjust its policy to adapt to the opponent for higher reward. (2) One-on-One is a two-player competitive game implemented on Google Research Football [18]. The goalkeeper controlled by the agent could only passively react to the strategies of the shooter controlled by the opponent and makes policy adaptation when the shooter strategy changes.

Baselines. In the experiments, we compare MBOM with the following methods:

  • LOLA-DiCE [9] is an expansion of the LOLA, which uses Differentiable Monte-Carlo Estimator (DiCE) operation to consider how to shape the learning dynamics of other agents.

  • Meta-PG [1] uses trajectories from the current opponents to do multiple meta-gradient steps and construct a policy that is good for the updated opponents.

  • Meta-MAPG [16] includes an additional term that accounts for the impact of the agent’s current policy on the future policies of opponents, compared with Meta-PG.

  • PPO [32] is a classical single-agent RL algorithm, without any other modules.

Opponents. We construct three types of opponents:

  • Fixed policy. The opponents are pre-trained but not updated during interaction.

  • Naïve learner. The opponents are pre-trained and updated using PPO during interaction.

  • Reasoning learner. The opponents are pre-trained and can model the behavior of the agent. The model is finetuned during interaction and their policy is conditioned on the predicted action of the model.

Refer to caption
(a) Triangle Game, versus fixed policy
Refer to caption
(b) Triangle Game, versus naïve learner
Refer to caption
(c) Triangle Game, versus reasoning learner
Refer to caption
(d) One-on-One, versus fixed policy
Refer to caption
(e) One-on-One, versus naïve learner
Refer to caption
(f) One-on-One, versus reasoning learner
Figure 2: Performance against different types of opponents, i.e., fixed policy, naïve learner, and reasoning learner, where x-axis is opponent index. The results show that MBOM outperforms other baselines, especially against naïve learner and reasoning learner.

\bullet\ \ LOLA-DiCE       \bullet\ \ Meta-PG       \bullet\ \ Meta-MAPG       \bullet\ \ PPO       \bullet\ \ MBOM  

Table 1: Performance on Triangle Game and One-on-One

Methods Triangle Game (score \uparrow) One-on-One (win rate %) Fixed Policy Naïve Learner Reasoning Learner Fixed Policy Naïve Learner Reasoning Learner LOLA-DiCE 22.51-22.51 ±\pm 5.225.22 20.48-20.48 ±\pm 4.024.02 21.55-21.55 ±\pm 4.434.43 33.033.0 ±\pm 0.50.5 25.225.2 ±\pm 1.21.2 18.418.4 ±\pm 0.90.9 Meta-PG 3.78\ \ -3.78 ±\pm 0.130.13 6.72\ \ -6.72 ±\pm 0.180.18 8.35\ \ -8.35 ±\pm 1.871.87 41.741.7 ±\pm 0.70.7 35.435.4 ±\pm 2.52.5 28.028.0 ±\pm 0.60.6 Meta-MAPG 2.01\ \ -2.01 ±\pm 0.060.06 5.76\ \ -5.76 ±\pm 0.290.29 6.14\ \ -6.14 ±\pm 0.840.84 44.444.4 ±\pm 1.51.5 36.636.6 ±\pm 1.61.6 31.131.1 ±\pm 1.91.9 PPO 13.29-13.29 ±\pm 1.801.80 18.42-18.42 ±\pm 1.281.28 20.51-20.51 ±\pm 1.801.80 40.640.6 ±\pm 0.70.7 21.521.5 ±\pm 0.70.7 25.825.8 ±\pm 1.01.0 MBOM 1.19\ \ \bm{-1.19} ±\pm 0.03\bm{0.03} 1.66\bm{\ \ -1.66} ±\pm 0.29\bm{0.29} 2.75\bm{\ \ -2.75} ±\pm 0.89\bm{0.89} 59.5\bm{59.5} ±\pm 0.9\bm{0.9} 51.3\bm{51.3} ±\pm 1.0\bm{1.0} 64.2\bm{64.2} ±\pm 1.8\bm{1.8}

Performance. The experimental results against test opponents in Triangle Game and One-on-One are shown in Figure 2(f), and the mean performance with standard deviation over all test opponents is summarized in Table 1. Without explicitly considering opponent policy, PPO achieves poor performance. The learning of LOLA-DiCE depends on the gradient of the opponent model. However, the gradient information cannot clearly reflect the distinctions between diverse opponents, which leads to that LOLA-DiCE cannot adapt to the unseen opponent quickly and effectively. Meta-PG and Meta-MAPG show mediocre performance in Triangle Game. In One-on-One, since Meta-PG and Meta-MAPG heavily rely on the reward signal, adaptation is difficult for the two methods in this sparse reward task. MBOM outperforms Meta-MAPG and Meta-PG, because the opponent model of MBOM improves the ability to adapt to different opponents. And the performance gain becomes more significant when the opponent is naïve learner or reasoning learner, which is attributed to the recursive reasoning capability brought by recursive imagination in the environment model and Bayesian mixing that quickly captures the learning of the opponent.

Ablation Studies. The opponent modeling module of MBOM is composed of both recursive imagination and Bayesian mixing. We respectively test the functionality of the two components.

In recursive imagination, MBOM generates a sequence of different levels of IOPs finetuned by the imagined best response of the opponent in the environment model. For comparison, we use only ϕ0\phi_{0} as the opponent model (i.e., without recursive imagination), denoted as MBOM w/o IOPs, and use random actions to finetune the IOPs rather than using the best response, denoted as MBOM-BM. Note that MBOM w/o IOPs is essentially the same as the simple opponent modeling methods that predict opponents’ actions, like [11, 13, 10]. The experimental results of MBOM, MBOM w/o IOPs, and MBOM-BM are shown in Figure 3(a) and 3(b). MBOM w/o IOPs obtains similar results to MBOM when facing fixed policy opponents because ϕ0\phi_{0} could accurately predict the opponent behaviors if the opponent is fixed, which corroborates the empirical results in [11, 13, 10]. However, if the opponent is learning or reasoning, ϕ0\phi_{0} cannot accurately represent the opponent, thus the performance of MBOM w/o IOPs drops. The methods with IOPs perform better than MBOM w/o IOPs, which means that a variety of IOPs can capture the policy changes of learning opponents. When facing the reasoning learner, MBOM outperforms MBOM-BM, indicating that the IOPs finetuned by the imagined best response have a stronger ability to represent the reasoning learner.

In Bayesian mixing, MBOM mixes the generated IOPs to obtain a policy that is close to the real opponent. As a comparison, we directly use the individual level-mm IOP generated by recursive imagination without mixing, denoted as MBOM-ϕm\phi_{m}, and use uniformly mixing instead of Bayesian mixing, denoted as MBOM-unif\rm unif. The experimental results are illustrated in Figure 3(c) and 3(d). Benefited from recursive imagination, MBOM-ϕ1\phi_{1} and MBOM-ϕ2\phi_{2} show stronger performance than MBOM-ϕ0\phi_{0}. MBOM consistently outperforms the ablation baselines without mixing when fighting against reasoning learners. The middling performance of MBOM-unif\rm unif is due to not exploiting the more representative IOPs, indicating that Bayesian mixing could obtain a more accurate estimate of reasoning opponents.

We additionally provide analysis on the weight 𝜶\bm{\alpha} and ablation studies on hyperparameters including the rollout length kk and the level of recursive imagination MM, which are available in Appendix B and C, respectively.

Refer to caption
(a) Triangle Game
Refer to caption
(b) One-on-One
Refer to caption
(c) Triangle Game
Refer to caption
(d) One-on-One

\blacksquare\ \ MBOM w/o IOPs     \blacksquare\ \ MBOM-BM     \blacksquare\ \ MBOM

\blacksquare\ MBOM-ϕ0\phi_{0}   \blacksquare\ MBOM-ϕ1\phi_{1}   \blacksquare\ MBOM-ϕ2\phi_{2}   \blacksquare\ MBOM-unif\rm unif   \blacksquare\ MBOM

Figure 3: Ablation study of MBOM. MBOM is compared with MBOM-BM and MBOM w/o IOPs on recursive imagination in 3(a) Triangle Game and 3(b) One-on-One. MBOM is compared with MBOM-ϕ0\phi_{0}, MBOM-ϕ1\phi_{1}, MBOM-ϕ2\phi_{2}, and MBOM-unif\rm unif on Bayesian mixing in 3(c) Triangle Game and 3(d) One-on-One.

5.2 Multiple Opponents

When facing multiple opponents, MBOM takes them as a joint opponent, which is consistent with the single-opponent case. We perform experiments in Predator-Prey [21]. We control the prey as the agent and take the three predators as opponents. In this task, the agent tries not to be touched by the three opponents. When the policies of opponents are fixed (e.g., enveloping the agent), it is difficult for the agent to get high reward. However, if the policies of opponents change during interaction, there may be changes for the agent to escape. The results are shown in Figure 4(a). When facing learning opponents, MBOM achieves performance gain, especially competing with the reasoning learner, which indicates that MBOM adapts to the learning and reasoning of the opponents and responds effectively. Meta-MAPG and Meta-PG do not effectively adapt to the changes of opponents’ policies and are prone to be touched by the opponents many times in a small area, resulting in poor performance. LOLA-DiCE, Meta-PG and Meta-MAPG underperforms PPO when against multiple reasoning learners, indicating their adaptation may induce a negative affect on the agent performance. Detailed experimental results with different opponents are available in Appendix D.

5.3 Cooperative Task

MBOM could also be applied to cooperative tasks. We test MBOM on a cooperative scenario, Coin Game, which is a high-dimension expansion of the iterated prisoner dilemma with multi-step actions [20, 8]. Both agents simultaneously update their policies using MBOM or the baseline methods to maximize the sum of rewards. The experiment results are shown in Figure 4(b). Meta-PG and Meta-MAPG degenerate to Policy Gradients for this task as there is no training set. Both learn a greedy strategy that collects any color coin, which leads to a zero total score of two players. LOLA-DiCE learns too slow and does not learn to cooperate within 200 iterations, indicating the inefficiency of estimating the opponent gradients. PPO learns to cooperate quickly and successfully, which corroborates the good performance of independent PPO in cooperative multi-agent tasks as pointed out in [6, 44]. MBOM slightly outperforms PPO, which indicates that MBOM can also be applied to cooperative tasks without negative effects. Note that we do not compare other cooperative MARL methods like QMIX [31], as they require centralized training.

Refer to caption
(a) Predator-Prey
Refer to caption
(b) Learning curves on Coin Game
Figure 4: 4(a) Performance against different types of opponents in Predator-Prey, where smaller touch number means better performance. 4(b) Learning curves in Coin Game.

6 Conclusion

We have proposed MBOM, which employs recursive imagination and Bayesian mixing to predict and capture the learning and improvement of opponents. Empirically, we evaluated MBOM in a variety of competitive tasks and demonstrated MBOM adapts to learning and reasoning opponents much better than the baselines. These make MBOM a simple and effective RL method whether opponents be fixed, continuously learning, or reasoning in competitive environments. Moreover, MBOM can also be applied in cooperative tasks.

References

  • Al-Shedivat et al. [2018] Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel. Continuous Adaptation Via Meta-Learning in Nonstationary and Competitive Environments. In International Conference on Learning Representations (ICLR), 2018.
  • Albrecht and Stone [2018] Stefano V Albrecht and Peter Stone. Autonomous Agents Modelling Other Agents: A Comprehensive Survey and Open Problems. Artificial Intelligence, 258:66–95, 2018.
  • Buckman et al. [2018] Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • Chua et al. [2018] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep Reinforcement Learning in A Handful of Trials Using Probabilistic Dynamics Models. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • Dai et al. [2020] Zhongxiang Dai, Yizhou Chen, Bryan Kian Hsiang Low, Patrick Jaillet, and Teck-Hua Ho. R2-b2: Recursive reasoning-based bayesian optimization for no-regret learning in games. In International Conference on Machine Learning (ICML), 2020.
  • de Witt et al. [2020] Christian Schröder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip H. S. Torr, Mingfei Sun, and Shimon Whiteson. Is Independent Learning All You Need in the Starcraft Multi-Agent Challenge? arXiv preprint arXiv:2011.09533, 2020.
  • Feinberg et al. [2018] Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I Jordan, Joseph E Gonzalez, and Sergey Levine. Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning. arXiv preprint arXiv:1803.00101, 2018.
  • Foerster et al. [2018a] Jakob Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with Opponent-Learning Awareness. In International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), 2018a.
  • Foerster et al. [2018b] Jakob Foerster, Gregory Farquhar, Maruan Al-Shedivat, Tim Rocktäschel, Eric Xing, and Shimon Whiteson. DICE: The Infinitely Differentiable MOnte CArlo Estimator. In International Conference on Machine Learning (ICML), 2018b.
  • Grover et al. [2018] Aditya Grover, Maruan Al-Shedivat, Jayesh K. Gupta, Yuri Burda, and Harrison Edwards. Learning policy representations in multiagent systems. In International Conference on Machine Learning (ICML), 2018.
  • He et al. [2016] He He, Jordan Boyd-Graber, Kevin Kwok, and Hal Daumé III. Opponent Modeling in Deep Reinforcement Learning. In International Conference on Machine Learning (ICML), 2016.
  • Hinton et al. [2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling The Knowledge in A Neural Network. arXiv preprint arXiv:1503.02531, 2015.
  • Hong et al. [2018] Zhang-Wei Hong, Shih-Yang Su, Tzu-Yun Shann, Yi-Hsiang Chang, and Chun-Yi Lee. A Deep Policy Inference Q-Network for Multi-Agent Systems. In International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), 2018.
  • Jaderberg et al. [2016] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, and K. Kavukcuoglu. Reinforcement Learning with Unsupervised Auxiliary Tasks. In International Conference on Learning Representations (ICLR), 2016.
  • Janner et al. [2019] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to Trust Your Model: Model-Based Policy Optimization. Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • Kim et al. [2021a] Dong-Ki Kim, Miao Liu, Matthew Riemer, Chuangchuang Sun, Marwa Abdulhai, Golnaz Habibi, Sebastian Lopez-Cot, Gerald Tesauro, and Jonathan P. How. A Policy Gradient Algorithm for Learning To Learn in Multiagent Reinforcement Learning. In International Conference on Machine learning proceedings (ICML), 2021a.
  • Kim et al. [2021b] Woojun Kim, Jongeui Park, and Youngchul Sung. Communication in multi-agent reinforcement learning: Intention sharing. In International Conference on Learning Representations (ICLR), 2021b.
  • Kurach et al. [2020] Karol Kurach, Anton Raichuk, Piotr Stańczyk, Michał Zając, Olivier Bachem, Lasse Espeholt, Carlos Riquelme, Damien Vincent, Marcin Michalski, Olivier Bousquet, et al. Google Research Football: A Novel Reinforcement Learning Environment. In AAAI Conference on Artificial Intelligence (AAAI), 2020.
  • Kurutach et al. [2018] Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-Ensemble Trust-Region Policy Optimization. In International Conference on Learning Representations (ICLR), 2018.
  • Lerer and Peysakhovich [2018] Adam Lerer and Alexander Peysakhovich. Maintaining Cooperation in Complex Social Dilemmas Using Deep Reinforcement Learning. arXiv preprint arXiv:1707.01068, 2018.
  • Lowe et al. [2017] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. Advances in Neural Information Processing Systems (NeurIPS), 2017.
  • Luo et al. [2019a] Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic Framework for Model-Based Deep Reinforcement Learning with Theoretical Guarantees. In International Conference on Learning Representations (ICLR), 2019a.
  • Luo et al. [2019b] Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic Framework for Model-Based Deep Reinforcement Learning with Theoretical Guarantees. In International Conference on Learning Representations (ICLR), 2019b.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-Level Control Through Deep Reinforcement Learning. Nature, 518(7540):529–533, 2015.
  • Oh et al. [2017] Junhyuk Oh, Satinder Singh, and Honglak Lee. Value Prediction Network. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  • OpenAI [2018] OpenAI. Openai five. https://blog.openai.com/openai-five/, 2018.
  • Papoudakis and Albrecht [2020] Georgios Papoudakis and Stefano V Albrecht. Variational Autoencoders for Opponent Modeling in Multi-Agent Systems. arXiv preprint arXiv:2001.10829, 2020.
  • Premack and Woodruff [1978] David Premack and Guy Woodruff. Does The Chimpanzee Have A Theory of Mind? Behavioral and brain sciences, 1(4):515–526, 1978.
  • Rabinowitz et al. [2018] Neil Rabinowitz, Frank Perbet, Francis Song, Chiyuan Zhang, SM Ali Eslami, and Matthew Botvinick. Machine Theory of Mind. In International Conference on Machine Learning (ICML), 2018.
  • Raileanu et al. [2018] Roberta Raileanu, Emily Denton, Arthur Szlam, and Rob Fergus. Modeling others using oneself in multi-agent reinforcement learning. In International Conference on Machine Learning (ICML), 2018.
  • Rashid et al. [2018] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Qmix: monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning (ICML), 2018.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering The Game of Go with Deep Neural Networks and Tree Search. Nature, 529(7587):484–489, 2016.
  • Srinivas et al. [2009] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
  • Sutton [1990] Richard S Sutton. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. In International Conference on Machine learning proceedings (ICML), 1990.
  • Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT press, 2018.
  • Tang et al. [2021] Zhenggang Tang, Chao Yu, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Shaolei Du, Yu Wang, and Yi Wu. Discovering diverse multi-agent strategic behavior via reward randomization. In International Conference on Learning Representations (ICLR), 2021.
  • Tesauro and Galperin [1996] G. Tesauro and G. R. Galperin. On-Line Policy Improvement Using Monte-Carlo Search. In Advances in Neural Information Processing Systems (NeurIPS), 1996.
  • Vinyals et al. [2019] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster Level in StarCraft II Using Multi-Agent Reinforcement Learning. Nature, 575(7782):350–354, 2019.
  • Wen et al. [2019] Ying Wen, Yaodong Yang, Rui Luo, Jun Wang, and Wei Pan. Probabilistic Recursive Reasoning for Multi-Agent Reinforcement Learning. In International Conference on Learning Representations (ICLR), 2019.
  • Wen et al. [2020] Ying Wen, Yaodong Yang, Rui Luo, and Jun Wang. Modelling bounded rationality in multi-agent interactions by generalized recursive reasoning. In International Joint Conference on Artificial Intelligence (IJCAI), 2020.
  • Willemsen et al. [2021] Daniël Willemsen, Mario Coppola, and Guido CHE de Croon. Mambpo: Sample-efficient multi-robot reinforcement learning using learned world models. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021.
  • Yang et al. [2019] Tianpei Yang, Jianye Hao, Zhaopeng Meng, Chongjie Zhang, Yan Zheng, and Ze Zheng. Towards efficient detection and optimal response against sophisticated opponents. In International Joint Conference on Artificial Intelligence (IJCAI), 2019.
  • Yu et al. [2021] Chao Yu, Akash Velu, Eugene Vinitsky, Yu Wang, Alexandre M. Bayen, and Yi Wu. The Surprising Effectiveness of MAPPO in Cooperative, Multi-Agent Games. arXiv preprint arXiv:2103.01955, 2021.
  • Yuan et al. [2020] Luyao Yuan, Zipeng Fu, Jingyue Shen, Lu Xu, Junhong Shen, and Song-Chun Zhu. Emergence of Pragmatics From Referential Game Between Theory of Mind Agents. arXiv preprint arXiv:2001.07752, 2020.
  • Zhang et al. [2020] Kaiqing Zhang, Sham M. Kakade, Tamer Basar, and Lin F. Yang. Model-based multi-agent RL in zero-sum markov games with near-optimal sample complexity. arXiv preprint arXiv:2007.07461, 2020.
  • Zhang et al. [2021] Weinan Zhang, Xihuai Wang, Jian Shen, and Ming Zhou. Model-based multi-agent policy optimization with adaptive opponent-wise rollouts. arXiv preprint arXiv:2105.03363, 2021.
  • Zheng et al. [2018] Yan Zheng, Zhaopeng Meng, Jianye Hao, Zongzhang Zhang, Tianpei Yang, and Changjie Fan. A Deep Bayesian Policy Reuse Approach Against Non-Stationary Agents. In Advances in Neural Information Processing Systems (NeurIPS), 2018.

Appendix A Proofs

Lemma 1.

For the mixed imaged opponent policy (IOP) π~mixo(|s)=m=0M1αmπ~mo(|s;ϕm)\tilde{\pi}_{\operatorname{mix}}^{o}(\cdot|s)=\sum\limits_{m=0}^{M-1}\alpha_{m}\tilde{\pi}^{o}_{m}(\cdot|s;\phi_{m}), the total error satisfies the following inequality:

εtotalm=0M1δmi=mM1αi.\varepsilon_{total}\leq\sum_{m=0}^{M-1}\delta_{m}\sum_{i=m}^{M-1}\alpha_{i}.
Proof.

By the definitions of δm\delta_{m} and εm\varepsilon_{m}, it is clear that

εm\displaystyle\varepsilon_{m} =|V^mV||V^mV^m1|++|V^1V^0|+|V^0V|\displaystyle=|\widehat{V}_{m}-V|\leq|\widehat{V}_{m}-\widehat{V}_{m-1}|+\cdots+|\widehat{V}_{1}-\widehat{V}_{0}|+|\widehat{V}_{0}-V|
=δm++δ1+δ0\displaystyle=\delta_{m}+\cdots+\delta_{1}+\delta_{0}

Then, we can derive the following inequality,

εtotal\displaystyle\varepsilon_{total} =m=0M1αmεmm=0M1αmk=0mδk\displaystyle=\sum_{m=0}^{M-1}\alpha_{m}\varepsilon_{m}\leq\sum_{m=0}^{M-1}\alpha_{m}\sum_{k=0}^{m}\delta_{k}
=m=0M1δmi=mM1αi\displaystyle=\sum_{m=0}^{M-1}\delta_{m}\sum_{i=m}^{M-1}\alpha_{i}

Lemma 2.

Suppose the value function is Lipschitz continuous on the state space 𝒮\mathcal{S}, KK is the Lipschitz constant, ^i(s,a)Γ(s,a,ao;ζ)π~io(ao|s;ϕi)\widehat{\mathcal{M}}_{i}(s,a)\doteq\Gamma(s,a,a^{o};\zeta)\cdot\tilde{\pi}^{o}_{i}(a^{o}|s;\phi_{i}) is the transition distribution given ss and aa, then

|V^iV^j|γK1γ𝔼s,aπ,^j^i(s,a)^j(s,a).\left|\widehat{V}_{i}-\widehat{V}_{j}\right|\leq\frac{\gamma K}{1-\gamma}\cdot\underset{s,a\sim\pi,\widehat{\mathcal{M}}_{j}}{\mathbb{E}}\left\|\widehat{\mathcal{M}}_{i}(s,a)-\widehat{\mathcal{M}}_{j}(s,a)\right\|.
Proof.

Lemma 2 is a directly cited theorem in [22], and we make some modifications to fit our context. ∎

Theorem 1.

Define the error between the approximated value function of the Bayesian mixing IOP and the true value function as εtrue|V^V|\varepsilon_{true}\doteq\left|\widehat{V}-V\right|. For the Bayesian mixing IOP, the true error satisfies the following inequality:

εtrueεtotals,aoγKd(s,ao)1γm=0M1𝔼s,aπ,^m1^m^m1i=mM1π~io(ao|s;ϕi)p(i)j=0M1π~jo(ao|s;ϕj)p(j).\small\varepsilon_{true}\leq\varepsilon_{total}\leq\sum\limits_{s,a^{o}}\frac{\gamma Kd(s,a^{o})}{1-\gamma}\cdot\frac{\sum\limits_{m=0}^{M-1}\underset{s,a\sim\pi,\widehat{\mathcal{M}}_{m-1}}{\mathbb{E}}\left\|\widehat{\mathcal{M}}_{m}-\widehat{\mathcal{M}}_{m-1}\right\|\sum\limits_{i=m}^{M-1}\tilde{\pi}^{o}_{i}\left(a^{o}|s;\phi_{i}\right)p(i)}{\sum\limits_{j=0}^{M-1}\tilde{\pi}^{o}_{j}\left(a^{o}|s;\phi_{j}\right)p(j)}.
Proof.

It is clear that

εtrue\displaystyle\varepsilon_{true} |V^V|=|m=0M1αmV^mV|\displaystyle\doteq\left|\widehat{V}-V\right|=\left|\sum_{m=0}^{M-1}\alpha_{m}\widehat{V}_{m}-V\right|
m=0M1αm|V^mV|\displaystyle\leq\sum_{m=0}^{M-1}\alpha_{m}\left|\widehat{V}_{m}-V\right|
=m=0M1αmεm=εtotal\displaystyle=\sum_{m=0}^{M-1}\alpha_{m}\varepsilon_{m}=\varepsilon_{total}

Then, we can derive the following inequality,

εtrueεtotal\displaystyle\varepsilon_{true}\leq\varepsilon_{total} m=0M1δmi=mM1αi=m=0M1δmi=mM1s,aod(s,ao)π~io(ao|s;ϕi)p(i)j=0M1π~jo(ao|s;ϕj)p(j)\displaystyle\leq\sum\limits_{m=0}^{M-1}\delta_{m}\sum\limits_{i=m}^{M-1}\alpha_{i}=\sum\limits_{m=0}^{M-1}\delta_{m}\sum\limits_{i=m}^{M-1}\sum\limits_{s,a^{o}}d(s,a^{o})\frac{\tilde{\pi}^{o}_{i}\left(a^{o}|s;\phi_{i}\right)p(i)}{\sum\limits_{j=0}^{M-1}\tilde{\pi}^{o}_{j}\left(a^{o}|s;\phi_{j}\right)p(j)}
=s,aod(s,ao)m=0M1δmi=mM1π~io(ao|s;ϕi)p(i)j=0M1π~jo(ao|s;ϕj)p(j)\displaystyle=\sum\limits_{s,a^{o}}d(s,a^{o})\frac{\sum\limits_{m=0}^{M-1}\delta_{m}\sum\limits_{i=m}^{M-1}\tilde{\pi}^{o}_{i}\left(a^{o}|s;\phi_{i}\right)p(i)}{\sum\limits_{j=0}^{M-1}\tilde{\pi}^{o}_{j}\left(a^{o}|s;\phi_{j}\right)p(j)}
=s,aoγKd(s,ao)1γm=0M1𝔼s,aπ,^m1^m^m1i=mM1π~io(ao|s;ϕi)p(i)j=0M1π~jo(ao|s;ϕj)p(j),\displaystyle=\sum\limits_{s,a^{o}}\frac{\gamma Kd(s,a^{o})}{1-\gamma}\cdot\frac{\sum\limits_{m=0}^{M-1}\underset{s,a\sim\pi,\widehat{\mathcal{M}}_{m-1}}{\mathbb{E}}\left\|\widehat{\mathcal{M}}_{m}-\widehat{\mathcal{M}}_{m-1}\right\|\sum\limits_{i=m}^{M-1}\tilde{\pi}^{o}_{i}\left(a^{o}|s;\phi_{i}\right)p(i)}{\sum\limits_{j=0}^{M-1}\tilde{\pi}^{o}_{j}\left(a^{o}|s;\phi_{j}\right)p(j)},

where d(s,ao)d(s,a_{o}) is the stationary distribution of pairs of state and action of opponent and ^1\widehat{\mathcal{M}}_{-1} denotes 𝒫(|s,a,ao)πo(ao|s)\mathcal{P}(\cdot|s,a,a^{o})\cdot\pi^{o}(a^{o}|s). ∎

Lemma 3.

Assume that the true opponent policy πo\pi^{o} has a probability distribution over the given set of IOPs {π~0o,,π~M1o}\{\tilde{\pi}^{o}_{0},\ldots,\tilde{\pi}^{o}_{M-1}\}, and a probability Pr{πo=π~mo}=pm\mathrm{Pr}\{\pi^{o}=\tilde{\pi}_{m}^{o}\}=p_{m} for each π~mo\tilde{\pi}_{m}^{o}, then the maximum expectation of ηM\eta_{M} can be achieved if and only if αm=pm,m[0,M1].\alpha_{m}=p_{m},\forall m\in[0,M-1].

Proof.
𝔼[ηM]\displaystyle\mathbb{E}[\eta_{M}] =𝔼(α0π~0o++αM1π~M1o)π\displaystyle=-\mathbb{E}\|(\alpha_{0}\tilde{\pi}^{o}_{0}+\cdots+\alpha_{M-1}\tilde{\pi}^{o}_{M-1})-\pi\|
=m=0M1pm(α0π~0o++αM1π~M1o)π~mo\displaystyle=-\sum_{m=0}^{M-1}p_{m}\|(\alpha_{0}\tilde{\pi}^{o}_{0}+\cdots+\alpha_{M-1}\tilde{\pi}^{o}_{M-1})-\tilde{\pi}^{o}_{m}\|
m=0M1αjπ~mopmπ~mo\displaystyle\geq-\sum_{m=0}^{M-1}\|\alpha_{j}\tilde{\pi}^{o}_{m}-p_{m}\tilde{\pi}^{o}_{m}\|

Since 𝔼[ηM]\mathbb{E}[\eta_{M}] is a negative distance function, it is clear that

𝔼[ηM]0,\mathbb{E}[\eta_{M}]\leq 0,

and when αm=pm,m[0,M1]\alpha_{m}=p_{m},\forall m\in[0,M-1],

𝔼[ηM]m=0M1pmπ~mopmπ~mo=0\mathbb{E}[\eta_{M}]\geq-\sum_{m=0}^{M-1}\|p_{m}\tilde{\pi}^{o}_{m}-p_{m}\tilde{\pi}^{o}_{m}\|=0

Therefore, the maximum expectation of ηM\eta_{M} can be achieved if and only if αm=pm,m[0,M1].\alpha_{m}=p_{m},\forall m\in[0,M-1].

Theorem 2.

Given the action trajectory of oponent {a0o,a1o,,ato}\{a^{o}_{0},a^{o}_{1},\ldots,a^{o}_{t}\}, the posterior probability updated by Bayesian mixing approximates the true probability of opponent as the length of the trajectory grows, i.e.,

P(m|ato,,a1o,a0o)Ptrue(m),t.P(m|a^{o}_{t},\cdots,a^{o}_{1},a^{o}_{0})\to P_{\operatorname{true}}(m),\quad t\to\infty.

Then the maximum expectation of ηM\eta_{M} can be achived with αm=𝔼[P(m|ato,,a1o,a0o)]\alpha_{m}=\mathbb{E}[P(m|a^{o}_{t},\cdots,a^{o}_{1},a^{o}_{0})].

Proof.

According to Bayes’ theorem, as we update the posterior probability as (7), we have the following posterior probabilities

p(m|a0o)=p(a0o|m)p(m)i=0M1p(a0o|i)p(i)\displaystyle p(m|a^{o}_{0})=\frac{p(a^{o}_{0}|m)p(m)}{\sum_{i=0}^{M-1}p(a^{o}_{0}|i)p(i)}
p(m|a1o,a0o)=p(a1o|a0o,m)p(a0o|m)p(m)[i=0M1p(a1o|,a0o,i)p(i)][i=0M1p(a0o|i)p(i)]\displaystyle p(m|a^{o}_{1},a^{o}_{0})=\frac{p(a^{o}_{1}|a^{o}_{0},m)p(a^{o}_{0}|m)p(m)}{\left[\sum_{i=0}^{M-1}p(a^{o}_{1}|,a^{o}_{0},i)p(i)\right]\cdot\left[\sum_{i=0}^{M-1}p(a^{o}_{0}|i)p(i)\right]}
\displaystyle\qquad\vdots
p(m|ato,,a0o)=p(ato|at1o,,a0o,m)p(a0o|m)p(m)[i=0M1p(ato|at1o,,a0o,i)p(i)][i=0M1p(a0o|i)p(i)]\displaystyle p(m|a^{o}_{t},\cdots,a^{o}_{0})=\frac{p(a^{o}_{t}|a^{o}_{t-1},\cdots,a^{o}_{0},m)\cdots p(a^{o}_{0}|m)p(m)}{\left[\sum_{i=0}^{M-1}p(a^{o}_{t}|a^{o}_{t-1},\cdots,a^{o}_{0},i)p(i)\right]\cdots\left[\sum_{i=0}^{M-1}p(a^{o}_{0}|i)p(i)\right]}

Obviously, with sufficient samples,

P(m|ato,,a0o)Ptrue(m)pm,t.P(m|a^{o}_{t},\cdots,a^{o}_{0})\to P_{\operatorname{true}}(m)\doteq p_{m},\quad t\to\infty.

This can also be expressed as

𝔼[P(m|ato,,a1o,a0o)]=pm\mathbb{E}[P(m|a^{o}_{t},\cdots,a^{o}_{1},a^{o}_{0})]=p_{m}

When αm=𝔼[P(m|ato,,a1o,a0o)]\alpha_{m}=\mathbb{E}[P(m|a^{o}_{t},\cdots,a^{o}_{1},a^{o}_{0})], we have αm=pm\alpha_{m}=p_{m}. Considering Lemma 3, the maximum expectation of ηM\eta_{M} is achieved. ∎

Appendix B Weights 𝜶\bm{\alpha}

Refer to caption
Figure 5: The change of 𝜶\bm{\alpha} during the adaptation, colored from red (initial α\alpha) to green (end), in Triangle Game. In each subplot, the top left number is the opponent index, and from left to right are shown the agent plays against fixed policy, naïve learner, reasoning learner.

In Bayesian mixing, 𝜶\bm{\alpha} is the weights to mix IOPs,

𝜶=(α0,,αM1)=softer-softmax(Ψ0,,ΨM1).\bm{\alpha}=(\alpha_{0},\dots,\alpha_{M-1})=\text{softer-softmax}(\Psi_{0},\dots,\Psi_{M-1}).

Ψmt\Psi_{m}^{t} of level-mm IOP at timestep tt is

Ψmt=l=tHt1λtlp(m|alo),\Psi_{m}^{t}=\sum_{l=t-H}^{t-1}\lambda^{t-l}p(m|a_{l}^{o}),

where λ\lambda is the decay factor, HH is the horizon, p(m|ato)p(m|a_{t}^{o}) can be calculated using (7). Moreover, p(m)p(m) at timestep tt is the moving average of p(m|a)p(m|a) over the horizon HH as

p(m)=l=tHt1p(m|alo)/H.{p}(m)=\sum_{l=t-H}^{t-1}p(m|a_{l}^{o})/H.
Refer to caption
Figure 6: The change of 𝜶\bm{\alpha} during the adaptation, colored from red (initial α\alpha) to green (end), in One-on-One. In each subplot, the top left number is the opponent index, and from left to right are shown the agent plays against fixed policy, naïve learner, reasoning learner.

Figure 5 and 6 depict the change of 𝜶\bm{\alpha} (M=3M=3) during the adaptation when the agent plays against different test opponents in Triangle Game and One-one-One, respectively. In each subplot, the top left number is the opponent index, and from left to right are fixed policy, naïve learner, reasoning learner. The changing trends of 𝜶\bm{\alpha} are diverse when against different opponents.

As the level-0 IOP is finetuned during interaction, when playing against fixed policy, α0\alpha_{0} should be large. When playing against reasoning learner, intuitively α1\alpha_{1} should be large. However, the naïve learner’s policy is updated only with reward, thus it does not have a counterpart in the IOPs. As illustrated in Figure 5 and 6, 𝜶\bm{\alpha} does not always converge to the corresponding level of IOP when against fixed policy and reasoning learner. The reason is two-fold. First, as the level-0 IOP is pre-trained against training opponents that are different from test opponents (will be discussed in Appendix E), the level-0 IOP can be largely different from the opponent policy. Thus, the small number of samples obtained during online interaction may not be enough to finetune the level-0 IOP to accurately model the opponent policy. This is referred to as the error of opponent modeling, Thus, 𝜶\bm{\alpha} does not always converge to α0\alpha_{0} when testing against fixed policy. Second, due to such an inaccurate level-0 IOP and the error of the environment model, higher-level IOPs may also be inaccurate, thus α\alpha does not always converge to α1\alpha_{1} when testing against reasoning learner. Essentially, 𝜶\bm{\alpha} is a mapping from IOPs reasoned by the agent to the true policy generated by the opponent’s learning method. When testing against different types of opponents, the mixed IOP according to 𝜶\bm{\alpha} may already be capable to well represent the true opponent policy and capture its update, which offsets the errors of opponent modeling and environment model and makes MBOM almost intact and outperform the baselines.

Appendix C Ablation on Hyperparameters

Refer to caption
(a) Triangle Game, versus fixed policy
Refer to caption
(b) Triangle Game, versus naïve learner
Refer to caption
(c) Triangle Game, versus reasoning learner
Refer to caption
(d) One-on-One, versus fixed policy
Refer to caption
(e) One-on-One, versus naïve learner
Refer to caption
(f) One-on-One, versus reasoning learner
Figure 7: Performance against different types of opponents, i.e., fixed policy, naïve learner, and reasoning learner with different rollout planning horizon kk as in (4). The selection of kk is a tradeoff between the error of the environment model error and the estimate accuracy of rollout. The results are plotted using mean and 95% confidence intervals with five different random seeds (x-axis is opponent index).

\bullet\ \ k=1k=1     \bullet\ \ k=2k=2     \bullet\ \ k=3k=3
   
\bullet\ \ k=1k=1     \bullet\ \ k=3k=3     \bullet\ \ k=5k=5

Refer to caption
(a) Triangle Game, versus fixed policy
Refer to caption
(b) Triangle Game, versus naïve learner
Refer to caption
(c) Triangle Game, versus reasoning learner
Refer to caption
(d) One-on-One, versus fixed policy
Refer to caption
(e) One-on-One, versus naïve learner
Refer to caption
(f) One-on-One, versus reasoning learner
Figure 8: Performance against different types of opponents, i.e., fixed policy, naïve learner, and reasoning learner with different MM (number of recursive imagination levels). The results are plotted using mean and 95% confidence intervals with five different random seeds (x-axis is opponent index). Higher MM improves the representative capability of the opponent model, but also accumulates errors. Thus, MM is a tradeoff between these two. M=1,2,3,4M=1,2,3,4 are performed in Triangle Game, while M=1,2,3M=1,2,3 are performed in One-on-One. In general, M2M\geq 2 performs similarly, which verifies MM is robust. Note that M=1M=1 is MBOM w/o IOPs.

\bullet\ \ M=1M=1     \bullet\ \ M=2M=2     \bullet\ \ M=3M=3     \bullet\ \ M=4M=4

Figure 7(c) shows the performance of MBOM with different rollout planning horizon kk. The selection of kk is a tradeoff between the environment model error and the accuracy of value estimation. In addition, the computational complexity of IOPs increases exponentially with kk. In a sparse reward environment, appropriately increasing kk makes the algorithm robust. While in a dense reward environment, a smaller kk works well.

Figure 8(f) shows the performance of MBOM with different recursive imagination levels MM. From our theoretical analysis, we know higher MM improves the representation capability of the opponent model, but also accumulates the model error. As illustrated in Figure 8(f), except M=1M=1 (i.e., MBOM w/o IOPs), M=2,3,4M=2,3,4 perform similarly. This indicates that M2M\geq 2 is robust. Larger MM increases the representation capability of IOPs, but does not always improve the performance, i.e., the gain is vanishing when MM increases due to the compounding error. Moreover, in practice MM also linearly increases the computational cost, thus in general smaller MM are preferred, e.g., 2 or 3.

Appendix D Detailed Results on Predator-Prey

Refer to caption
(a) versus fixed policy
Refer to caption
(b) versus naïve learner
Refer to caption
(c) versus reasoning learner
Figure 9: Performance against different types of opponents, i.e., fixed policy, naïve learner, and reasoning learner in Predator-Prey, where x-axis is joint opponent index. Smaller touch number means better performance.

\bullet\ \ Meta-MAPG     \bullet\ \ Meta-PG     \bullet\ \ PPO      \bullet\ \ LOLA-DiCE     \bullet\ \ MBOM

Figure 9(c) shows the performance when against different types of opponents compared with the baselines. For each type, there are ten test joint opponent policies. The results show MBOM substantially outperforms the baselines against reasoning learners. LOLA-DiCE, Meta-PG and Meta-MAPG underperform PPO when against multiple reasoning learners, indicating their adaptation may induce a negative effect on the agent performance. The reasons may be as follows. LOLA-DiCE exploits the opponent model to estimate the learning gradient of opponent policy. However, when against multiple reasoning learners, the estimated gradient of their joint policy can hardly be accurate enough to capture the change of their individual policies as each opponent learns conditioned on the agent’s policy. Meta-PG and Meta-MAPG both update the agent’ policy to accommodate the future policy of opponent. However, the future joint policy of multiple reasoning opponents is much harder to anticipate than in single-opponent cases.

Appendix E Experiment Settings

The experiment environments are detailed as follows:

Triangle Game. As shown in Figure 10(a), there are two moving players, player 1 and 2, and three fixed landmarks, L1L3L1-L3, in a square field. The landmarks are located at the three vertexes of an equilateral triangle with a side length 0.60.6. When the distance between a player and a landmark is less than 0.150.15, the agent touches the landmark and has the state TT. T1T1 indicates that the player touches the landmark L1L1, and so on. If the player does not touch any landmark, the player state is FF. The payoff matrix of the two players is shown in Table 2, where player 2 has inherent disadvantages since the optimal solution of player 2 always strictly depends on the state of player 1. When facing different policies of player 1, player 2 has to adjust its policy to adapt to player 1 for higher reward. We control player 2 as the agent and take player 1 as the opponent.

One-on-One. As shown in Figure 10(b), there is a goalkeeper and a shooter who controls the ball in the initial state and could dribble or shoot the ball. At the end of an episode, if the shooter shoots the ball into the goal, the shooter will get a reward +1+1, and the goalkeeper will get a reward 1-1. Otherwise, the shooter will get a reward 1-1, and the goalkeeper will get a reward +1+1. The goalkeeper could only passively react to the strategies of the shooter and makes policy adaptation when the shooter strategy changes. We control the goalkeeper as the agent and take the shooter as the opponent.

Predator-Prey. We follow the setting in MPE [21]. In each game, the agent plays against three opponents (predators), and the episode length is 200 timesteps.

Coin Game. As shown in Figure 10(c), there are two players, red and blue, moving on a 3×33\times 3 grid field, and two types of coins, red and blue, randomly generated on the grid field. If the player moves to the position of the coin, the player collects the coin and receives a reward of +1+1. However, if the color of the collected coin is different from the player’s color, the other player receives a reward of 2-2. The length of the game is 150150 timesteps.

Refer to caption
(a) Triangle Game
Refer to caption
(b) One-on-One
Refer to caption
(c) Coin Game
Figure 10: Illustrations of the scenarios.
Table 2: Payoff matrix of Triangle Game.
Player 2
F T1 T2 T3
Player1 F 0/ 0 0.5-0.5/ +0.5+0.5 0.5-0.5/ +0.5+0.5 0.5-0.5/ +0.5+0.5
T1 +0.5+0.5/ 0.5-0.5 +1+1/ 1-1 +1+1/ 1-1 1-1/ +1+1
T2 +0.5+0.5/ 0.5-0.5 1-1/ +1+1 +1+1/ 1-1 +1+1/ 1-1
T3 +0.5+0.5/ 0.5-0.5 +1+1/ 1-1 1-1/ +1+1 +1+1/ 1-1

Preparing opponents. For the two types of opponents, fixed policy and naïve learner, we run independent PPO [32] algorithm for 1010 times. During each run, we store 2020 opponent policies in the training set, 33 opponent policies in the validation set, and 33 opponent policies in the test set. So the sizes of the training set, validation set, and test set are 200200, 3030, and 3030, respectively. The validation set is only required by Meta-PG and Meta-MAPG. The reasoning learner learns a model to predict the action of the agent and a policy conditioned on the predicted action. Since the initial parameters of the reasoning learner should not be shared with the first two types of opponents, we train additional 3030 reasoning learners in the same way aforementioned and add them to the test set.

To increase the diversity of the opponent policy, the method [37] can be adopted, but here we use some tricks to increase the diversity without incurring too much training cost. For the Triangle Game, we trained the opponent set with a modified reward, so that we could get the opponent that commuting between T1 and T2. Other types of opponents, such as hovering around a landmark, commuting between 2 landmarks, or rotating among 3 landmarks, are obtained in a similar way. For One-on-One, we set a barrier in front of the goal (invisible, but can block the ball) and only keep a gap so that the ball can enter the goal. We trained opponents with such gaps in different positions.

Pre-training and Test. In the pre-training phase, all methods are well trained with ν\nu learning opponents of training set. The data during this phase is collected to fit the environment model and the level-0 IOP for MBOM. In the test phase, the agent interacts with the opponents in the test set to evaluate the ability to adapt to various opponents. The test phase lasts for 100 episodes, during which the environment model is no longer trained and the agent continuously finetunes parameters. Fixed policy, naïve learner, and reasoning learner use the test set to initialize parameters and continuously learn by respective learning ways. All methods use the same training set, validation set, and test set. There are enough opponents with different policies for testing to ensure that experimental results are unbiased.

The hyperparameters of MBOM are summarized in Table 3.

Table 3: Hyperparameters
Triangle Game One-on-One Predator-prey Coin Game
PPO policy hidden units MLP[64,32] LSTM[64,32] MLP[64,32] MLP[64,32]
value hidden units MLP[64,32] MLP[64,32] MLP[64,32] MLP[64,32]
activation function ReLU ReLU ReLU ReLU
optimizer Adam Adam Adam Adam
learning rate 0.001 0.001 0.001 0.001
num. of updates 10 10 10 10
value discount factor 0.99 0.99 0.99 0
GAE parameter 0.99 0.99 0.99 0
clip parameter 0.115 0.115 0.115 0.115
Opponent model hidden units MLP[64,32] MLP[64,32] MLP[64,32] MLP[64,32]
learning rate 0.001 0.001 0.001 0.001
batch size 64 64 64 64
num. of updates 10 10 10 10
IOPs num. of levels MM 3 3 2 2
learning rate 0.005 0.005 0.005 0.005
update times 3 3 3 3
rollout horizon 2 5 1 1
decayed factor of Ψ\Psi 0.9 0.9 0.9 0.9
horizon of Ψ\Psi 10 10 10 10
s-softmax parameter 1 1.1/e1.1/e 1 1

Appendix F Future Work

In MBOM, the learning agent treats all opponents as a joint opponent. If the size of the joint opponent is large, the agent will need a lot of rollouts to get an accurate best response. The cost increases dramatically with the size of the joint opponent. How to reduce the computation overhead in such scenarios will be considered in future work. Moreover, MBOM implicitly assumes that the relationship between opponents is fully cooperative. How to deal with the case where their relationship is non-cooperative is also left as future work.