Offline Decentralized Multi-Agent
Reinforcement Learning
Abstract
In many real-world multi-agent cooperative tasks, due to high cost and risk, agents cannot continuously interact with the environment and collect experiences during learning, but have to learn from offline datasets. However, the transition dynamics in the dataset of each agent can be much different from the ones induced by the learned policies of other agents in execution, creating large errors in value estimates. Consequently, agents learn uncoordinated low-performing policies. In this paper, we propose a framework for offline decentralized multi-agent reinforcement learning, which exploits value deviation and transition normalization to deliberately modify the transition probabilities. Value deviation optimistically increases the transition probabilities of high-value next states, and transition normalization normalizes the transition probabilities of next states. They together enable agents to learn high-performing and coordinated policies. Theoretically, we prove the convergence of Q-learning under the altered non-stationary transition dynamics. Empirically, we show that the framework can be easily built on many existing offline reinforcement learning algorithms and achieve substantial improvement in a variety of multi-agent tasks.
1 Introduction
Reinforcement learning (RL) has shown great potential in domains, including recommendation systems [13], games [28], and robotics [7]. When applying RL to real-world applications, the agent needs to continuously interact with the environment and collect the experiences for learning, which however can be costly, risky, and time-consuming. One way to address this is offline RL, where the agent learns the policy, without interacting with the environment, from a fixed dataset of experiences collected by any behavior policy. The main challenge of offline RL is the extrapolation error, an error incurred by the mismatch between the experience distributions of the learned policy and dataset [5]. Many offline RL methods [5, 10, 31, 11] have been proposed to address the value overestimate caused by the mismatch between action distributions of the learned policy and behavior policy. However, much less attention has been paid to the mismatch between the transition dynamics in the dataset and the real ones in execution. The main reason is that the mismatch of transition dynamics is negligible given a large dataset, if the environment is stationary. Nevertheless, in many real-world scenarios, there are other learning agents in the environment, e.g., autonomous driving. That means the policies of other agents during data collection can be significantly different from their policies in execution, which creates the mismatch of transition dynamics and hence undermines the learned policy.
More specifically, these scenarios can be formulated as the offline decentralized multi-agent setting, where agents cooperate on the task but each learns the policy from its own fixed dataset. The dataset of each agent is independently collected by any behavior policies arbitrarily (i.e., we do not make any assumptions on data collection) and contains its own action instead of the joint action of all agents. This setting resembles many industrial applications where agents belong to different companies and the actions of other agents are not accessible, e.g., autonomous vehicles or robots. Apparently, this setting does not fit the paradigm of offline centralized training and decentralized execution (CTDE) in multi-agent reinforcement learning (MARL)111Even if the datasets of all agents can be accessed in a centralized way, a full dataset that contains the joint actions cannot be constructed, because the datasets are fully independent without any common information. [33], and each agent has to learn a cooperative policy in an offline and fully decentralized way. In such decentralized multi-agent settings, from the perspective of an individual agent, other agents are a part of the environment, thus the transition dynamics experienced by each agent depend on the policies of other agents and change as other agents update their policies [2]. As the learned policies of other agents will be inconsistent with their behavior policies, the transition dynamics induced by the learned policies of other agents in execution will be different from the transition dynamics in the dataset, causing a large mismatch between the transition dynamics. This mismatch can lead to a low-performing policy for each agent. Furthermore, as agents learn in a decentralized way on the datasets with different distributions of experiences collected by various behavior policies, the estimated values of the same state can be much different between agents, which causes the learned policies cannot well coordinate with each other.
In this paper, we propose a framework for offline fully decentralized multi-agent reinforcement learning, where, to overcome the mismatch between transition dynamics and miscoordination, value deviation and transition normalization are introduced to deliberately modify the transition probabilities in the dataset. During data collection, if one agent takes a ‘good’ action while other agents take ‘bad’ actions at a state, the transition probabilities of low-value next states will be high. Thus, the Q-value of the ‘good’ action will be underestimated by the agent. As other agents are also learning, their learned policies are expected to become better than their behavior policies. Therefore, for each agent, the transition probabilities of high-value next states in execution will be higher than those in the dataset. So, we let each agent be optimistic towards other agents and multiply the transition probabilities by the deviation of the value of next state from the expected value over all next states, to make the estimated transition probabilities close to the transition probabilities induced by the learned policies of other agents. To address the miscoordination caused by diverse value estimates of agents, we normalize the transition probabilities in the dataset to be uniform. Transition normalization helps to build the consensus about value estimate among agents and hence promotes coordination. By combining value deviation and transition normalization, agents are able to learn high-performing and coordinated policies in an offline and fully decentralized way.
Value deviation and transition normalization make the transition dynamics non-stationary during learning. However, we mathematically prove the convergence of Q-learning under such purposely controlled non-stationary transition dynamics. Moreover, by derivation, we show that value deviation and transition normalization can take effect merely as the weights of the objective function, thus our framework can be easily built on many existing offline single-agent RL algorithms that address the overestimation incurred by out-of-distribution actions. Empirically, we provide an example instantiation and the thorough analysis of our framework on BCQ [5], termed MABCQ, and also test the variants on CQL [11] and TD3+BC [4], termed MACQL and MATD3+BC respectively. Experimental results show that our method substantially outperforms baselines in a variety of multi-agent tasks, including multi-agent mujoco [1], SMAC [23] and MPE [15]. To the best of our knowledge, this is the first work for offline and decentralized multi-agent reinforcement learning.
2 Preliminaries
We consider agents in multi-agent MDP [17] with the state space and the joint action space . At each timestep, each agent gets state222Global state is only for the convenience of theoretical analysis [9] the agents could learn based on partial observation in practice. , and performs an individual action , and the environment transitions to the next state by taking the joint action with the transition probability . All agents get a shared reward , which is simplified to just depending on state [24]333This simplification is only for the convenience of theoretical analysis, but not necessary in practice.. All the agents learn to maximize the expected return , where is a discount factor. However, in the fully decentralized learning, is partially observable to each agent since each agent only observes its own action instead of the joint action . During execution, from the perspective of each agent , there is an experienced MDP with the individual action space and the online transition probability
where and respectively denote the joint action and the learned joint policy of all agents except agent . However, if the agent cannot interact with other agents in the environment, is unknown.
In offline decentralized settings, each agent only has access to a fixed offline dataset , which is pre-collected by behavior policies and contains the tuples . As defined in BCQ [5], the visible MDP is constructed on , which has the offline transition probability
As the learned policies of other agents may greatly deviate from their behavior policies , can be largely biased from . We define the discrepancy between and as transition bias. The large extrapolation errors caused by transition bias, and the differences in value estimates between agents lead to uncoordinated low-performing policies.
Agent | |||
---|---|---|---|
() | () | ||
() | 1 | 5 | |
Agent | () | 6 | 1 |
action | transition | expected return | |
---|---|---|---|
3.4 | |||
Agent | 3.0 |
action | transition | expected return | |
---|---|---|---|
2.0 | |||
Agent | 4.2 |
To intuitively illustrate the issue, we introduce a two-player matrix game as depicted in Table 1, where the action distributions of the behavior policies of the two agents are and , respectively. Table 2 shows the transition probabilities and expected returns calculated independently by the agents from the datasets collected by the behavior policies. However, as the behavior policies are poor, when agent 1 chooses the optimal action , agent 2 has a higher probability to select the suboptimal action , which leads to low transition probabilities of true high-value next states ( vs. ). So the agents underestimate the optimal actions and converge to the suboptimal policies , rather than the optimal policy .
3 Offline Decentralized MARL Framework
In fully decentralized learning, as the policies of other agents are not accessible, it is hard for an agent to learn a policy that can coordinate well with other agents, merely from the dataset that only contains its own action. To tackle this challenging problem, we propose a framework, which newly introduces value deviation and transition normalization to address transition bias and miscoordination, and leverages the offline single-agent RL algorithm to avoid out-of-distribution actions. The convergence under the non-stationary transition dynamics is theoretically guaranteed, and an example instantiation of the framework is provided.
3.1 Value Deviation
If the behavior policies of some agents are low-performing during data collection, they usually take ‘bad’ actions to cooperate with the ‘good’ actions of other agents, which leads to high transition probabilities of low-value next states. When agent performs Q-learning with the dataset , the Bellman operator is approximated by the transition probability to estimate the expectation over :
If of a high-value is lower than , the Q-value of the state-action pair is underestimated, which will cause large extrapolation error.
However, as the policies of other agents are also updating towards maximizing the Q-values, of high-value next states will grow higher than . Thus, we let each agent be optimistic towards other agents and modify as
where the state value , is the deviation of the value of next state from the expected value over all next states, which increases the transition probabilities of high-value next states and decreases those of low-value next states, and is a normalization term to make sure the sum of the transition probabilities is one. Value deviation modifies the transition probabilities to be close to and hence reduces the transition bias. The optimism towards other agents helps the agent discover potential good actions which are hidden by the poor behavior policies of other agents.
3.2 Transition Normalization
As of each agent is individually collected by different behavior policies, the diverse combinations of behavior policies of all agents lead to the value of the same state being overestimated by some agents, while being underestimated by others. Since the agents are trained to reach high-value states, the large disagreement on state values will cause miscoordination of the learned policies. To overcome the problem, we normalize to be uniform over next states,
where is a normalization term that is the number of different given in . Transition normalization enforces that each agent has the same when it acts the learned action on the same state , and we have the following proposition.
Proposition 1.
In episodic environments, if each agent performs Q-learning on , all agents will converge to the same if they have the same transition probability on any state where each agent acts the learned action .
The proof is provided in Appendix555Appendix is available at https://arxiv.org/abs/2108.01832.. This proposition implies that transition normalization can enable agents to have the same state value estimate. However, to satisfy for all in the datasets, the agents should have the same set of at , which is a strong assumption. In practice, although the assumption is not strictly satisfied, transition normalization can still normalize the transition probabilities, encouraging the estimated state value to be close to each other.
3.3 Optimization Objective
Combining value deviation , denoted as , and transition normalization , denoted as , we modify as
where is the normalization term. Indeed, makes offline learning on similar to online decentralized MARL. In the initial stage, is close to since is not learned yet, and the transition probabilities are uniform, resembling the start stage in online learning where all agents are acting randomly. During training, the transition probabilities of high-value states gradually grow by value deviation, which is an analogy of other agents improving their policies in online learning. Therefore, encourages the agents to learn high-performing policies and improves coordination.
Although is non-stationary (i.e., changes along with the updates of Q-value), we have the following theorem that guarantees the convergence of Bellman operator under ,
Theorem 1.
Under the non-stationary transition probability , the Bellman operator is a contraction and converges to a unique fixed point when , if the reward is bounded by the positive region .
The proof is provided in Appendix. As any positive affine transformation of the reward function does not change the optimal policy in the fixed-horizon environments [36], Theorem 1 holds in general, and we can rescale the reward to make arbitrarily close to so as to make the upper bound of close to .
In deep reinforcement learning, directly modifying the transition probability is infeasible. However, we can instead modify the sampling probability to achieve the same effect. The optimization objective of decentralized deep Q-learning is calculated by sampling the batch from according to the sampling probability . By factorizing , we have
Therefore, we can modify the transition probability as and scale with . Then, the sampling probability can be re-written as
Since is independent of , it can be regarded as a scale factor on , which will not change the expected target value . Thus, sampling batches according to the modified sampling probability can achieve the same effect as modifying the transition probability. Using importance sampling, the modified optimization objective is
We can see that and simply take effect as the weights of the objective function, which makes them easily integrated with existing offline RL methods.
3.4 An Example Instantiation
Our framework can be practically instantiated on many offline single-agent RL algorithms that address the overestimation incurred by out-of-distribution actions. Here, we give the instantiation of the framework on BCQ [5], termed MABCQ. To make MABCQ adapt to high-dimensional continuous spaces, for each agent , we train a Q-network , a perturbation network , and a conditional VAE . In execution, each agent generates actions by , adds small perturbations on the actions using , and then selects the action with the highest value in . The policy can be written as
is updated by minimizing
(1) |
is calculated by the target networks and , where is correspondingly the policy induced by and . is updated by maximizing
(2) |
To estimate , we need and , which can be estimated from the sample without actually going through all . We estimate using the target networks to stabilize along with the updates of and . To avoid extreme values, we clip to the region , where is the optimism level.
To estimate , we train a VAE . Since the latent variable of VAE follows the Gaussian distribution, we use the mean as the encoding of the input and estimate the probability density functions: and , where is the density of unit Gaussian distribution. The conditional density is and the transition probability is when is a small constant. Approximately, we have
and the constant is considered in . In practice, we find that falls into the region for almost all samples. For completeness, we summarize the training of MABCQ in Algorithm 1.
action | transition | expected return | |
---|---|---|---|
4.52 | |||
Agent | 5 |
action | transition | expected return | |
---|---|---|---|
4 | |||
Agent | 4.8 |
action | transition | expected return | |
---|---|---|---|
4.33 | |||
Agent | 5.29 |
action | transition | expected return | |
---|---|---|---|
5.29 | |||
Agent | 4.33 |
MABCQ | BCQ w/ | BCQ w/ | BCQ | DDPG | Behavior | |
---|---|---|---|---|---|---|
HalfCheetah | ||||||
Walker | ||||||
Hopper | ||||||
Ant |
4 Related Work
4.1 Off-policy MARL
Many off-policy MARL methods have been proposed for learning to solve cooperative tasks in an online manner. Policy-based methods [15, 8, 37, 26, 29] extend actor-critic into multi-agent cases. Value factorization methods [27, 22, 25, 30] decompose the joint value function into individual value functions. All these methods follow CTDE, where the information of all agents can be accessed in a centralized way during training. Unlike these studies, we consider decentralized settings where global information is not available. For decentralized learning, the key challenge is the obsolete experiences in the replay buffer, which is considered in Fingerprints [2], Lenient-DQN [19], and concurrent experience replay [18]. However, these methods require additional information, e.g., training iteration number and exploration rate, which are often not provided by the offline dataset.
4.2 Offline RL
Offline RL requires the agent to learn from a fixed batch of data consisting of single-step transitions, without exploration. Most offline RL methods consider the out-of-distribution action [12] as the fundamental challenge, which is the main cause of the extrapolation error [5] in value estimate in the single-agent environment. To minimize the extrapolation error, some recent methods introduce constraints to enforce the learned policy to be close to the behavior policy, which can be direct action constraint [5], kernel MMD [10], Wasserstein distance [31], KL divergence [21], or distance [4, 20]. Some methods train a Q-function pessimistic to out-of-distribution actions to avoid overestimation by adding a reward penalty quantified by the learned environment model [35], by minimizing the Q-values of out-of-distribution actions [11, 34], by weighting the update of Q-function via Monte Carlo dropout [32], or by explicitly assigning and training pseudo Q-values for out-of-distribution actions [16]. Our framework can be built on these methods.
MAICQ [33] studies offline MARL in the CTDE setting, which requires the joint actions of all agents in the dataset and cannot be applied to decentralized settings where datasets contain only individual actions. All the methods aforementioned do not consider the extrapolation error introduced by the transition bias, which is a fatal problem in offline decentralized MARL.
5 Experiments
We evaluate our framework in both fully and partially observable tasks. In each task, we build offline dataset for each agent , which does not contain actions of other agents. We will give the details about the collection of each offline dataset. Our method and baselines have the same neural network architectures and hyperparameters, which are available in Appendix. All the models are trained for five runs with different random seeds, and the results are presented in terms of mean and std.
5.1 Matrix Game
We perform MABCQ on the matrix game in Table 1. As shown in Table 3, if we only use without considering transition normalization, since the transition probabilities of high-value next states have been increased, for agent the value of becomes higher than that of . However, due to the unbalanced action distribution of agent , the initial transition probabilities of agent are extremely unbalanced. With , agent still underestimates the value of and learns the action . The agents arrive at the joint action , which is a worse solution than the initial one (Table 2). Further, by normalizing the transition probabilities by , the agents can learn the optimal solution and build the consensus about the values of learned actions, as shown in Table 4.
value difference | MABCQ | BCQ w/ |
---|---|---|
HalfCheetah | ||
Walker | ||
Hopper | ||
Ant |
extrapolation error | MABCQ | BCQ |
---|---|---|
HalfCheetah | ||
Walker | ||
Hopper | ||
Ant |
MABCQ | BCQ | MACQL | CQL | ||
---|---|---|---|---|---|
random | 3m | ||||
8m | |||||
3s_vs_3z | |||||
3s_vs_4z | |||||
medium | 3m | ||||
8m | |||||
3s_vs_3z | |||||
3s_vs_4z | |||||
replay | 3m | ||||
8m | |||||
3s_vs_3z | |||||
3s_vs_4z | |||||
expert | 3m | ||||
8m | |||||
3s_vs_3z | |||||
3s_vs_4z |
5.2 Multi-Agent Mujoco
To evaluate MABCQ in high-dimensional complex environments, we adopt multi-agent mujoco [1], where each agent independently controls one or some joints of the robot and can get the state [9] and reward of the robot. The task illustration and the collection of offline datasets are given in Appendix.
Baselines. We compare MABCQ against
-
•
BCQ w/ . Using alone on BCQ.
-
•
BCQ w/ . Using alone on BCQ.
-
•
BCQ. Removing both and from MABCQ.
-
•
DDPG [14]. Each agent is trained using independent DDPG on the offline without action constraint and transition probability modification.
-
•
Behavior. Each agent takes the action generated from the VAE .




Ablation. Table 5 shows the normalized scores [3] of all the methods in the four tasks. Without action constraint, DDPG severely suffers from the large extrapolation error and hardly learns. BCQ outperforms the behavior policies but only arrives at mediocre performance. Using alone does not always improve the performance, e.g., in HalfCheetah and Hopper. This is because makes transition probabilities be uniform, which can be far from the ones in execution, leading to large extrapolation errors. In Ant, BCQ w/ outperforms BCQ, which is attributed to the value consensus built by the normalized transition probabilities. By optimistically increasing the transition probabilities of high-value next states, mitigates the underestimation of potential good actions and thus boosts the performance. MABCQ combines the advantages of value deviation and transition normalization and outperforms other baselines.
Consensus on value estimates. To verify that transition normalization can decrease the difference in value estimates among agents, we uniformly sample a subset from the union of all agents’ states and calculate the difference in value estimates, , on this subset. The mean differences during training are illustrated in Table 6. The of MABCQ is indeed lower than that of BCQ w/ . If there is a consensus among agents about which states are high-value, the agents will select the actions that most likely lead to the common high-value states. This promotes the coordination of policies and helps MABCQ outperform BCQ w/ .
Extrapolation error. In Table 7, we present the extrapolation errors of MABCQ and BCQ, measured by , where is the true return evaluated by Monte Carlo. Although MABCQ greatly outperforms BCQ (i.e., much higher return), it still achieves much smaller extrapolation errors than BCQ in Walker, Hopper, and Ant. This empirically verifies the claim that our method can decrease the extrapolation error.
5.3 SMAC
We also investigate the proposed framework on SMAC [23] tasks, including 3m, 8m, 3s_vs_3z, and 3s_vs_4z. We build random datasets that are generated by uniform policies, medium datasets that are generated by mixed medium and uniform policies, replay datasets that are collected in the training process of QMIX [22], and expert datasets that are generated by expert policies trained by QMIX. Each dataset contains episodes. We also build our framework on CQL [11], as MACQL. As shown in Table 8. MABCQ and MACQL achieve great performance improvement compared with the baselines, especially in random and medium datasets, where the transition dynamics in execution are much different from the ones in offline datasets. In expert datasets, since the behavior policies are much deterministic, offline RL methods avoid selecting out-of-distribution actions and thus degenerate to behavior cloning. Therefore, all the methods perform similarly.
Offline CTDE settings. Although offline CTDE method [33] does not fit our offline decentralized setting, our framework can work in offline CTDE datasets. To verify, we select two replay datasets (jointly collected) in MAICQ [33], split them into individual datasets, and test MACQL on them. As shown in Figures 1(a) and 1(b), our decentralized method can obtain competitive performance compared with the centralized method, MAICQ.
MATD3+BC | TD3+BC | |
---|---|---|
full observation | ||
partial observation |
MATD3+BC | TD3+BC | TD3+BC (single) | ||
---|---|---|---|---|
random | halfcheetah | |||
hopper | ||||
walker2d | ||||
medium | halfcheetah | |||
hopper | ||||
walker2d | ||||
replay | halfcheetah | |||
hopper | ||||
walker2d | ||||
medium-expert | halfcheetah | |||
hopper | ||||
walker2d |
5.4 MPE
We additionally evaluate our framework in an MPE-based [15] Differential Game (DG), where the transition bias greatly affects the performance. Two agents can move in the range . The action is the speed, which is in the range . Define , where and are the positions of the two agents, respectively. The shared reward is set as
The visualization of reward function is shown in Figure 2.

Partial observation vs full observation. The offline datasets are collected by uniform random policies, containing transitions. In the full observation setting, the dataset of each agent contains the positions of both agents. In the partial observation setting, the dataset of each agent only contains its own position. In both settings, the datasets do not contain the actions of the other agent. We add and to TD3+BC [4], as MATD3+BC. As shown in Table 9, MATD3+BC obtains the substantial improvement in both full and partial observation settings.
Hyperparameter . The optimism level controls the strength of value deviation. If is too small, value deviation has weak effects on the objective function. On the other hand, if is too large, the agent will be overoptimistic about other agents’ learned policies. Figure 1(c) shows that the performance of MATD3+BC with different , which verifies our framework is robust to .
We test MATD3+BC on Cooperative Navigation in MPE, where agents learn to cover landmarks. The reward is , where is the distance from landmark to the closest agent. The offline datasets are collected by uniform random policies, containing transitions. Figure 1(d) shows our framework significantly outperforms the baseline.
5.5 Additional Results
We also split the D4RL [3] Mujoco datasets into decentralized multi-agent datasets, and test MATD3+BC on them. The results are summarized in Table 10. We find the results of decentralized methods, where the joints are controlled by different agents, are very close to the results of single-agent method, TD3+BC (single), where a single agent controls all joints, which could be seen as the “upper-bound” of the decentralized methods. That is the reason that our method does not bring significant improvement in these tasks. However, MATD3+BC still outperforms TD3+BC on several tasks, e.g., halfcheetah-random and walker2d-medium.
MABCQ | BCQ | MATD3+BC | TD3+BC |
---|---|---|---|
ms | ms | ms | ms |
To demonstrate the computation efficiency of our method, in Table 11, we record the average time taken by one update in Halfcheetah. The experiments are carried out on Intel i7-8700 CPU and NVIDIA GTX 1080Ti GPU. Since and could be calculated from the sampled experience without actually going through all next states, our framework additionally needs only two forward passes for computing and in the update. Since the value computation is very efficient in TD3+BC, our framework is also efficient on it.
6 Conclusion
We propose a framework for offline decentralized multi-agent reinforcement learning, to overcome the mismatch between transition dynamics. The framework can be instantiated on many offline RL methods. Theoretically, we show that under the purposely controlled non-stationary transition dynamics, offline decentralized Q-learning converges to a unique fixed point. Empirically, the framework outperforms the baselines in a variety of multi-agent offline datasets.
This work was supported by NSF China under grants 62250068 and 61872009. The authors would like to thank the anonymous reviewers for their valuable comments.
References
- [1] Christian Schroeder de Witt, Bei Peng, Pierre-Alexandre Kamienny, Philip Torr, Wendelin Böhmer, and Shimon Whiteson, ‘Deep Multi-Agent Reinforcement Learning for Decentralized Continuous Cooperative Control’, arXiv preprint arXiv:2003.06709, (2020).
- [2] Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip HS Torr, Pushmeet Kohli, and Shimon Whiteson, ‘Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning’, in International Conference on Machine Learning (ICML), (2017).
- [3] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine, ‘D4rl: Datasets for Deep Data-Driven Reinforcement Learning’, arXiv preprint arXiv:2004.07219, (2020).
- [4] Scott Fujimoto and Shixiang Shane Gu, ‘A Minimalist Approach To Offline Reinforcement Learning’, in Thirty-Fifth Conference on Neural Information Processing Systems (NeurIPS), (2021).
- [5] Scott Fujimoto, David Meger, and Doina Precup, ‘Off-Policy Deep Reinforcement Learning Without Exploration’, in International Conference on Machine Learning (ICML), (2019).
- [6] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine, ‘Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with A Stochastic Actor’, in International Conference on Machine Learning (ICML), (2018).
- [7] Julian Ibarz, Jie Tan, Chelsea Finn, Mrinal Kalakrishnan, Peter Pastor, and Sergey Levine, ‘How To Train Your Robot with Deep Reinforcement Learning: Lessons We Have Learned’, The International Journal of Robotics Research, 40(4-5), 698–721, (2021).
- [8] Shariq Iqbal and Fei Sha, ‘Actor-Attention-Critic for Multi-Agent Reinforcement Learning’, in International Conference on Machine Learning (ICML), (2019).
- [9] Jakub Grudzien Kuba, Ruiqing Chen, Munning Wen, Ying Wen, Fanglei Sun, Jun Wang, and Yaodong Yang, ‘Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning’, International Conference on Learning Representations (ICLR), (2022).
- [10] Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine, ‘Stabilizing Off-Policy Q-Learning Via Bootstrapping Error Reduction’, in Advances in Neural Information Processing Systems (NeurIPS), (2019).
- [11] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine, ‘Conservative Q-Learning for Offline Reinforcement Learning’, Neural Information Processing Systems (NeurIPS), (2020).
- [12] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu, ‘Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems’, arXiv preprint arXiv:2005.01643, (2020).
- [13] Lihong Li, Wei Chu, John Langford, and Robert E Schapire, ‘A Contextual-Bandit Approach To Personalized News Article Recommendation’, in International Conference on World Wide Web (WWW), pp. 661–670, (2010).
- [14] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra, ‘Continuous Control with Deep Reinforcement Learning.’, in International Conference on Learning Representations (ICLR), (2016).
- [15] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch, ‘Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments’, in Advances in Neural Information Processing Systems (NeurIPS), (2017).
- [16] Jiafei Lyu, Xiaoteng Ma, Xiu Li, and Zongqing Lu, ‘Mildly Conservative Q-learning for Offline Reinforcement Learning’, Neural Information Processing Systems (NeurIPS), (2022).
- [17] Frans A Oliehoek and Christopher Amato, A Concise Introduction To Decentralized POMDPs, Springer, 2016.
- [18] Shayegan Omidshafiei, Jason Pazis, Christopher Amato, Jonathan P How, and John Vian, ‘Deep Decentralized Multi-Task Multi-Agent Reinforcement Learning Under Partial Observability’, in International Conference on Machine Learning (ICML), (2017).
- [19] Gregory Palmer, Karl Tuyls, Daan Bloembergen, and Rahul Savani, ‘Lenient Multi-Agent Deep Reinforcement Learning’, in International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), (2018).
- [20] Ling Pan, Longbo Huang, Tengyu Ma, and Huazhe Xu, ‘Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification’, arXiv preprint arXiv:2111.11188, (2021).
- [21] Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine, ‘Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning’, arXiv preprint arXiv:1910.00177, (2019).
- [22] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson, ‘QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning’, in International Conference on Machine Learning (ICML), (2018).
- [23] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philiph H. S. Torr, Jakob Foerster, and Shimon Whiteson, ‘The StarCraft Multi-Agent Challenge’, CoRR, abs/1902.04043, (2019).
- [24] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz, ‘Trust Region Policy Optimization’, in International Conference on Machine Learning (ICML), (2015).
- [25] Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi, ‘QTRAN: Learning To Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning’, in International Conference on Machine Learning (ICML), (2019).
- [26] Kefan Su and Zongqing Lu, ‘Divergence-Regularized Multi-Agent Actor-Critic’, in International Conference on Machine Learning (ICML), (2022).
- [27] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al., ‘Value-Decomposition Networks for Cooperative Multi-Agent Learning Based on Team Reward’, in International Conference on Autonomous Agents and Multiagent Systems (AAMAS), (2018).
- [28] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al., ‘Grandmaster Level in StarCraft II Using Multi-Agent Reinforcement Learning’, Nature, 575(7782), 350–354, (2019).
- [29] Jiangxing Wang, Deheng Ye, and Zongqing Lu, ‘More Centralized Training, Still Decentralized Execution: Multi-Agent Conditional Policy Factorization’, in International Conference on Learning Representations (ICLR), (2023).
- [30] Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, and Chongjie Zhang, ‘Qplex: Duplex Dueling Multi-Agent Q-Learning’, in International Conference on Learning Representations (ICLR), (2021).
- [31] Yifan Wu, George Tucker, and Ofir Nachum, ‘Behavior Regularized Offline Reinforcement Learning’, arXiv preprint arXiv:1911.11361, (2019).
- [32] Yue Wu, Shuangfei Zhai, Nitish Srivastava, Joshua Susskind, Jian Zhang, Ruslan Salakhutdinov, and Hanlin Goh, ‘Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning’, arXiv preprint arXiv:2105.08140, (2021).
- [33] Yiqin Yang, Xiaoteng Ma, Chenghao Li, Zewu Zheng, Qiyuan Zhang, Gao Huang, Jun Yang, and Qianchuan Zhao, ‘Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning’, in Neural Information Processing Systems (NeurIPS), (2021).
- [34] Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn, ‘Combo: Conservative Offline Model-Based Policy Optimization’, arXiv preprint arXiv:2102.08363, (2021).
- [35] Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma, ‘MOPO: Model-Based Offline Policy Optimization’, Advances in Neural Information Processing Systems (NeurIPS), (2020).
- [36] Chi Zhang, Sanmukh Rao Kuppannagari, and Viktor Prasanna. BRAC+: Going Deeper with Behavior Regularized Offline Reinforcement Learning, 2021.
- [37] Tianhao Zhang, Yueheng Li, Chen Wang, Guangming Xie, and Zongqing Lu, ‘FOP: Factorizing Optimal Joint Policy of Maximum-Entropy Multi-Agent Reinforcement Learning’, in International Conference on Machine Learning (ICML), (2021).
Appendix A Proofs
Proposition 1.
In episodic environments, if each agent performs Q-learning on , all agents will converge to the same if they have the same transition probability on any state where each agent acts the learned action .
Proof.
Considering the two-agent case, we define as the difference in the .
For the terminal state , we have . If , recursively expanding the term, we arrive at . We can easily show that it also holds in the -agent case. ∎
Theorem 1.
Under the non-stationary transition probability , the Bellman operator is a contraction and converges to a unique fixed point when , if the reward is bounded by the positive region .
Proof.
We initialize the Q-value to be , where denotes . Since the reward is bounded by the positive region , the Q-value under the operator is bounded to . Based on the definition of , it can be written as , where . Then, we have the following,
The third term of the penultimate line is because: if ,
else,
Since , we have
Therefore, if , the operator is a contraction. By contraction mapping theorem, converges to a unique fixed point. ∎
Appendix B Settings and Hyperparameters




Hyperparameter | continues BCQ | discrete BCQ | CQL | TD3+BC |
---|---|---|---|---|
learning rate of | ||||
learning rate of | ||||
learning rate of | ||||
learning rate of | ||||
VAE hidden space | ||||
threshold |
The illustrations of multi-agent mujoco are shown in Figure 3, different colors indicate different agents. Each agent independently controls one or some joints of the robot and can get the state and reward of the robot, which are defined in the original tasks. For each environment, we collect datasets for the agents. Each dataset contains transitions . For data collection, we train an intermediate policy and an expert policy for each agent using SAC [6]. The offline dataset is a mixture of four parts: transitions are split from the experiences generated by the SAC agent at the early training, transitions are generated from that the agent acts the intermediate policy while other agents act the expert policies, transitions are generated from that agent performs the expert policy while other agents act the intermediate policies, transitions are generated from that all agents perform the expert policies. For the last three parts, we add a small noise to the policies to increase the diversity of the dataset.
In all tasks, we set the discount factor and use ReLU activation. In Mujoco tasks, the MLP units are , and the batch size is . In SMAC and DG, the MLP units are , and the batch size is . The hyperparameter of our framework is the optimism level . We respectively set in HalfCheetah, Walker, Hopper, and Ant, set in SMAC, and set in MPE. The hyperparameters of baselines are summarized in Table 12.
In this paper, we use SMAC (MIT license), MPE (MIT license), and Gym (MIT license), D4RL (Apache-2.0 license). Many thanks for their contributions.