This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Offline Decentralized Multi-Agent
Reinforcement Learning

Jiechuan Jiang    Zongqing Lu Corresponding Author. Email: [email protected] Peking University
Abstract

In many real-world multi-agent cooperative tasks, due to high cost and risk, agents cannot continuously interact with the environment and collect experiences during learning, but have to learn from offline datasets. However, the transition dynamics in the dataset of each agent can be much different from the ones induced by the learned policies of other agents in execution, creating large errors in value estimates. Consequently, agents learn uncoordinated low-performing policies. In this paper, we propose a framework for offline decentralized multi-agent reinforcement learning, which exploits value deviation and transition normalization to deliberately modify the transition probabilities. Value deviation optimistically increases the transition probabilities of high-value next states, and transition normalization normalizes the transition probabilities of next states. They together enable agents to learn high-performing and coordinated policies. Theoretically, we prove the convergence of Q-learning under the altered non-stationary transition dynamics. Empirically, we show that the framework can be easily built on many existing offline reinforcement learning algorithms and achieve substantial improvement in a variety of multi-agent tasks.

1 Introduction

Reinforcement learning (RL) has shown great potential in domains, including recommendation systems [13], games [28], and robotics [7]. When applying RL to real-world applications, the agent needs to continuously interact with the environment and collect the experiences for learning, which however can be costly, risky, and time-consuming. One way to address this is offline RL, where the agent learns the policy, without interacting with the environment, from a fixed dataset of experiences collected by any behavior policy. The main challenge of offline RL is the extrapolation error, an error incurred by the mismatch between the experience distributions of the learned policy and dataset [5]. Many offline RL methods [5, 10, 31, 11] have been proposed to address the value overestimate caused by the mismatch between action distributions of the learned policy and behavior policy. However, much less attention has been paid to the mismatch between the transition dynamics in the dataset and the real ones in execution. The main reason is that the mismatch of transition dynamics is negligible given a large dataset, if the environment is stationary. Nevertheless, in many real-world scenarios, there are other learning agents in the environment, e.g., autonomous driving. That means the policies of other agents during data collection can be significantly different from their policies in execution, which creates the mismatch of transition dynamics and hence undermines the learned policy.

More specifically, these scenarios can be formulated as the offline decentralized multi-agent setting, where agents cooperate on the task but each learns the policy from its own fixed dataset. The dataset of each agent is independently collected by any behavior policies arbitrarily (i.e., we do not make any assumptions on data collection) and contains its own action instead of the joint action of all agents. This setting resembles many industrial applications where agents belong to different companies and the actions of other agents are not accessible, e.g., autonomous vehicles or robots. Apparently, this setting does not fit the paradigm of offline centralized training and decentralized execution (CTDE) in multi-agent reinforcement learning (MARL)111Even if the datasets of all agents can be accessed in a centralized way, a full dataset that contains the joint actions cannot be constructed, because the datasets are fully independent without any common information. [33], and each agent has to learn a cooperative policy in an offline and fully decentralized way. In such decentralized multi-agent settings, from the perspective of an individual agent, other agents are a part of the environment, thus the transition dynamics experienced by each agent depend on the policies of other agents and change as other agents update their policies [2]. As the learned policies of other agents will be inconsistent with their behavior policies, the transition dynamics induced by the learned policies of other agents in execution will be different from the transition dynamics in the dataset, causing a large mismatch between the transition dynamics. This mismatch can lead to a low-performing policy for each agent. Furthermore, as agents learn in a decentralized way on the datasets with different distributions of experiences collected by various behavior policies, the estimated values of the same state can be much different between agents, which causes the learned policies cannot well coordinate with each other.

In this paper, we propose a framework for offline fully decentralized multi-agent reinforcement learning, where, to overcome the mismatch between transition dynamics and miscoordination, value deviation and transition normalization are introduced to deliberately modify the transition probabilities in the dataset. During data collection, if one agent takes a ‘good’ action while other agents take ‘bad’ actions at a state, the transition probabilities of low-value next states will be high. Thus, the Q-value of the ‘good’ action will be underestimated by the agent. As other agents are also learning, their learned policies are expected to become better than their behavior policies. Therefore, for each agent, the transition probabilities of high-value next states in execution will be higher than those in the dataset. So, we let each agent be optimistic towards other agents and multiply the transition probabilities by the deviation of the value of next state from the expected value over all next states, to make the estimated transition probabilities close to the transition probabilities induced by the learned policies of other agents. To address the miscoordination caused by diverse value estimates of agents, we normalize the transition probabilities in the dataset to be uniform. Transition normalization helps to build the consensus about value estimate among agents and hence promotes coordination. By combining value deviation and transition normalization, agents are able to learn high-performing and coordinated policies in an offline and fully decentralized way.

Value deviation and transition normalization make the transition dynamics non-stationary during learning. However, we mathematically prove the convergence of Q-learning under such purposely controlled non-stationary transition dynamics. Moreover, by derivation, we show that value deviation and transition normalization can take effect merely as the weights of the objective function, thus our framework can be easily built on many existing offline single-agent RL algorithms that address the overestimation incurred by out-of-distribution actions. Empirically, we provide an example instantiation and the thorough analysis of our framework on BCQ [5], termed MABCQ, and also test the variants on CQL [11] and TD3+BC [4], termed MACQL and MATD3+BC respectively. Experimental results show that our method substantially outperforms baselines in a variety of multi-agent tasks, including multi-agent mujoco [1], SMAC [23] and MPE [15]. To the best of our knowledge, this is the first work for offline and decentralized multi-agent reinforcement learning.

2 Preliminaries

We consider NN agents in multi-agent MDP [17] Menv=<𝒮,𝒜,R,Penv,γ>M_{\mathrm{env}}=<\mathcal{S},\mathcal{A},R,P_{\mathrm{env}},\gamma> with the state space 𝒮\mathcal{S} and the joint action space 𝒜\mathcal{A}. At each timestep, each agent ii gets state222Global state is only for the convenience of theoretical analysis [9] the agents could learn based on partial observation in practice. ss, and performs an individual action aia_{i}, and the environment transitions to the next state ss^{\prime} by taking the joint action 𝒂\bm{a} with the transition probability Penv(s|s,𝒂)P_{\mathrm{env}}\left(s^{\prime}|s,\bm{a}\right). All agents get a shared reward r=R(s)r=R\left(s\right), which is simplified to just depending on state [24]333This simplification is only for the convenience of theoretical analysis, but not necessary in practice.. All the agents learn to maximize the expected return 𝔼t=0γtrt\mathbb{E}\sum_{t=0}^{\infty}\gamma^{t}r_{t}, where γ\gamma is a discount factor. However, in the fully decentralized learning, MenvM_{\mathrm{env}} is partially observable to each agent since each agent only observes its own action aia_{i} instead of the joint action 𝒂\bm{a}. During execution, from the perspective of each agent ii, there is an experienced MDP Mi=<𝒮,𝒜i,R,Pi,γ>M_{\mathcal{E}_{i}}=<\mathcal{S},\mathcal{A}_{i},R,P_{\mathcal{E}_{i}},\gamma> with the individual action space 𝒜i\mathcal{A}_{i} and the online transition probability

Pi(s|s,ai)=𝒂iPenv(s|s,ai,𝒂i)𝝅i(𝒂i|s),P_{\mathcal{E}_{i}}\left(s^{\prime}|s,a_{i}\right)={\sum}_{\bm{a}_{-i}}P_{\mathrm{env}}\left(s^{\prime}|s,a_{i},\bm{a}_{-i}\right)\bm{\pi}_{\mathcal{E}_{-i}}(\bm{a}_{-i}|s),

where aia_{-i} and 𝝅i\bm{\pi}_{\mathcal{E}_{-i}} respectively denote the joint action and the learned joint policy of all agents except agent ii. However, if the agent cannot interact with other agents in the environment, PiP_{\mathcal{E}_{i}} is unknown.

In offline decentralized settings, each agent ii only has access to a fixed offline dataset i\mathcal{B}_{i}, which is pre-collected by behavior policies 𝝅\bm{\pi}_{\mathcal{B}} and contains the tuples (s,ai,r,s)(s,a_{i},r,s^{\prime}). As defined in BCQ [5], the visible MDP Mi=<𝒮,𝒜i,R,Pi,γ>M_{\mathcal{B}_{i}}=<\mathcal{S},\mathcal{A}_{i},R,P_{\mathcal{B}_{i}},\gamma> is constructed on i\mathcal{B}_{i}, which has the offline transition probability

Pi(s|s,ai)=𝒂iPenv(s|s,ai,𝒂i)𝝅i(𝒂i|s),P_{\mathcal{B}_{i}}\left(s^{\prime}|s,a_{i}\right)={\sum}_{\bm{a}_{-i}}P_{\mathrm{env}}\left(s^{\prime}|s,a_{i},\bm{a}_{-i}\right)\bm{\pi}_{\mathcal{B}_{-i}}(\bm{a}_{-i}|s),

As the learned policies of other agents 𝝅i\bm{\pi}_{\mathcal{E}_{-i}} may greatly deviate from their behavior policies 𝝅i\bm{\pi}_{\mathcal{B}_{-i}}, PiP_{\mathcal{B}_{i}} can be largely biased from PiP_{\mathcal{E}_{i}}. We define the discrepancy between PiP_{\mathcal{B}_{i}} and PiP_{\mathcal{E}_{i}} as transition bias. The large extrapolation errors caused by transition bias, and the differences in value estimates between agents lead to uncoordinated low-performing policies.

Table 1: The matrix game.
Agent 22
a1a_{1} (0.40.4) a2a_{2} (0.60.6)
a1a_{1} (0.80.8) 1 5
Agent 11 a2a_{2} (0.20.2) 6 1
Table 2: Transition probabilities and expected returns calculated in the dataset.
action transition expected return
p(1|a1)=0.4p(1|{\color[rgb]{0.60546875,0,0}a_{1}})=0.4
a1a_{1} p(5|a1)=0.6p(5|{\color[rgb]{0.60546875,0,0}a_{1}})=0.6 3.4
p(6|a2)=0.4p(6|{\color[rgb]{0.60546875,0,0}a_{2}})=0.4
Agent 11 a2a_{2} p(1|a2)=0.6p(1|{\color[rgb]{0.60546875,0,0}a_{2}})=0.6 3.0
action transition expected return
p(1|a1)=0.8p(1|{\color[rgb]{0,0,0.609375}a_{1}})=0.8
a1a_{1} p(6|a1)=0.2p(6|{\color[rgb]{0,0,0.609375}a_{1}})=0.2 2.0
p(5|a2)=0.8p(5|{\color[rgb]{0,0,0.609375}a_{2}})=0.8
Agent 22 a2a_{2} p(1|a2)=0.2p(1|{\color[rgb]{0,0,0.609375}a_{2}})=0.2 4.2

To intuitively illustrate the issue, we introduce a two-player matrix game as depicted in Table 1, where the action distributions of the behavior policies of the two agents are [0.8,0.2][0.8,0.2] and [0.4,0.6][0.4,0.6], respectively. Table 2 shows the transition probabilities and expected returns calculated independently by the agents from the datasets collected by the behavior policies. However, as the behavior policies are poor, when agent 1 chooses the optimal action a2a_{2}, agent 2 has a higher probability to select the suboptimal action a2a_{2}, which leads to low transition probabilities of true high-value next states (p(6|a2)=0.4p(6|{\color[rgb]{0.60546875,0,0}a_{2}})=0.4 vs. p(1|a2)=0.6p(1|{\color[rgb]{0.60546875,0,0}a_{2}})=0.6). So the agents underestimate the optimal actions and converge to the suboptimal policies (a1,a2)({\color[rgb]{0.60546875,0,0}a_{1}},{\color[rgb]{0,0,0.609375}a_{2}}), rather than the optimal policy (a2,a1)({\color[rgb]{0.60546875,0,0}a_{2}},{\color[rgb]{0,0,0.609375}a_{1}}).

3 Offline Decentralized MARL Framework

In fully decentralized learning, as the policies of other agents are not accessible, it is hard for an agent to learn a policy that can coordinate well with other agents, merely from the dataset that only contains its own action. To tackle this challenging problem, we propose a framework, which newly introduces value deviation and transition normalization to address transition bias and miscoordination, and leverages the offline single-agent RL algorithm to avoid out-of-distribution actions. The convergence under the non-stationary transition dynamics is theoretically guaranteed, and an example instantiation of the framework is provided.

3.1 Value Deviation

If the behavior policies of some agents are low-performing during data collection, they usually take ‘bad’ actions to cooperate with the ‘good’ actions of other agents, which leads to high transition probabilities of low-value next states. When agent ii performs Q-learning with the dataset i\mathcal{B}_{i}, the Bellman operator 𝒯\mathcal{T} is approximated by the transition probability Pi(s|s,ai)P_{\mathcal{B}_{i}}\left(s^{\prime}|s,a_{i}\right) to estimate the expectation over ss^{\prime}:

𝒯Qi(s,ai)=𝔼sPi(s|s,ai)[r+γmaxa^iQi(s,a^i)].\mathcal{T}Q_{i}(s,a_{i})=\mathbb{E}_{s^{\prime}\sim P_{\mathcal{B}_{i}}\left(s^{\prime}|s,a_{i}\right)}\left[r+\gamma\max_{\hat{a}_{i}}Q_{i}\left(s^{\prime},\hat{a}_{i}\right)\right].

If PiP_{\mathcal{B}_{i}} of a high-value ss^{\prime} is lower than PiP_{\mathcal{E}_{i}}, the Q-value of the state-action pair (s,ai)(s,a_{i}) is underestimated, which will cause large extrapolation error.

However, as the policies of other agents are also updating towards maximizing the Q-values, PiP_{\mathcal{E}_{i}} of high-value next states will grow higher than PiP_{\mathcal{B}_{i}}. Thus, we let each agent be optimistic towards other agents and modify PiP_{\mathcal{B}_{i}} as

Pi(s|s,ai)(1+Vi(s)𝔼s^Vi(s^)|𝔼s^Vi(s^)|value deviation)1zivd,P_{\mathcal{B}_{i}}\left(s^{\prime}|s,a_{i}\right)\cdot(\underset{\text{value deviation}}{\underbrace{1+\frac{V_{i}^{*}(s^{\prime})-\mathbb{E}_{\hat{s}^{\prime}}V_{i}^{*}(\hat{s}^{\prime})}{|\mathbb{E}_{\hat{s}^{\prime}}V_{i}^{*}(\hat{s}^{\prime})|}}})\cdot\frac{1}{z_{i}^{vd}},

where the state value Vi(s)=maxaiQi(s,ai)V_{i}^{*}(s)=\max_{a_{i}}Q_{i}(s,a_{i}), 1+Vi(s)𝔼s^Vi(s^)/|𝔼s^Vi(s^)|1+\nicefrac{{V_{i}^{*}(s^{\prime})-\mathbb{E}_{\hat{s}^{\prime}}V_{i}^{*}(\hat{s}^{\prime})}}{{|\mathbb{E}_{\hat{s}^{\prime}}V_{i}^{*}(\hat{s}^{\prime})|}} is the deviation of the value of next state from the expected value over all next states, which increases the transition probabilities of high-value next states and decreases those of low-value next states, and zivd=sPi(s|s,ai)(1+Vi(s)𝔼s^Vi(s^)/|𝔼s^Vi(s^)|)z_{i}^{vd}=\sum_{s^{\prime}}P_{\mathcal{B}_{i}}\left(s^{\prime}|s,a_{i}\right)*(1+\nicefrac{{V_{i}^{*}(s^{\prime})-\mathbb{E}_{\hat{s}^{\prime}}V_{i}^{*}(\hat{s}^{\prime})}}{{|\mathbb{E}_{\hat{s}^{\prime}}V_{i}^{*}(\hat{s}^{\prime})|}}) is a normalization term to make sure the sum of the transition probabilities is one. Value deviation modifies the transition probabilities to be close to PiP_{\mathcal{E}_{i}} and hence reduces the transition bias. The optimism towards other agents helps the agent discover potential good actions which are hidden by the poor behavior policies of other agents.

3.2 Transition Normalization

As i\mathcal{B}_{i} of each agent is individually collected by different behavior policies, the diverse combinations of behavior policies of all agents lead to the value of the same state ss being overestimated by some agents, while being underestimated by others. Since the agents are trained to reach high-value states, the large disagreement on state values will cause miscoordination of the learned policies. To overcome the problem, we normalize PiP_{\mathcal{B}_{i}} to be uniform over next states,

Pi(s|s,ai)1Pi(s|s,ai)transition normalization1zitnP_{\mathcal{B}_{i}}\left(s^{\prime}|s,a_{i}\right)\cdot\underset{\text{transition normalization}}{\underbrace{\frac{1}{P_{\mathcal{B}_{i}}\left(s^{\prime}|s,a_{i}\right)}}}\cdot\frac{1}{z_{i}^{tn}}

where zitnz_{i}^{tn} is a normalization term that is the number of different ss^{\prime} given (s,ai)(s,a_{i}) in i\mathcal{B}_{i}. Transition normalization enforces that each agent has the same PiP_{\mathcal{B}_{i}} when it acts the learned action aia_{i}^{*} on the same state ss, and we have the following proposition.

Proposition 1.

In episodic environments, if each agent ii performs Q-learning on i\mathcal{B}_{i}, all agents will converge to the same VV^{*} if they have the same transition probability on any state where each agent ii acts the learned action aia_{i}^{*}.

The proof is provided in Appendix555Appendix is available at https://arxiv.org/abs/2108.01832.. This proposition implies that transition normalization can enable agents to have the same state value estimate. However, to satisfy P1(s|s,a1)=P2(s|s,a2)==PN(s|s,aN)P_{\mathcal{B}_{1}}\left(s^{\prime}|s,a_{1}^{*}\right)=P_{\mathcal{B}_{2}}\left(s^{\prime}|s,a_{2}^{*}\right)=\ldots=P_{\mathcal{B}_{N}}\left(s^{\prime}|s,a_{N}^{*}\right) for all s𝒮s^{\prime}\in\mathcal{S} in the datasets, the agents should have the same set of ss^{\prime} at (s,a)(s,a^{*}), which is a strong assumption. In practice, although the assumption is not strictly satisfied, transition normalization can still normalize the transition probabilities, encouraging the estimated state value VV^{*} to be close to each other.

3.3 Optimization Objective

Combining value deviation 1+(Vi(s)𝔼s^Vi(s^))/|𝔼s^Vi(s^)|1+\nicefrac{{(V_{i}^{*}(s^{\prime})-\mathbb{E}_{\hat{s}^{\prime}}V_{i}^{*}(\hat{s}^{\prime}))}}{{|\mathbb{E}_{\hat{s}^{\prime}}V_{i}^{*}(\hat{s}^{\prime})|}}, denoted as λvdi\lambda_{{vd}_{i}}, and transition normalization 1/Pi(s|s,ai)\nicefrac{{1}}{{P_{\mathcal{B}_{i}}\left(s^{\prime}|s,a_{i}\right)}}, denoted as λtni\lambda_{{tn}_{i}}, we modify PiP_{\mathcal{B}_{i}} as

P^i(s|s,ai)=Pi(s|s,ai)λtniλvdizi,\hat{P}_{\mathcal{B}_{i}}\left(s^{\prime}|s,a_{i}\right)=P_{\mathcal{B}_{i}}\left(s^{\prime}|s,a_{i}\right)\cdot\frac{\lambda_{{tn}_{i}}\lambda_{{vd}_{i}}}{z_{i}},

where zi=s(1+(Vi(s)𝔼s^Vi(s^))/|𝔼s^Vi(s^)|)z_{i}=\sum_{s^{\prime}}(1+\nicefrac{{(V_{i}^{*}(s^{\prime})-\mathbb{E}_{\hat{s}^{\prime}}V_{i}^{*}(\hat{s}^{\prime}))}}{{|\mathbb{E}_{\hat{s}^{\prime}}V_{i}^{*}(\hat{s}^{\prime})|}}) is the normalization term. Indeed, P^i\hat{P}_{\mathcal{B}_{i}} makes offline learning on i\mathcal{B}_{i} similar to online decentralized MARL. In the initial stage, λvdi\lambda_{{vd}_{i}} is close to 11 since Qi(s,ai)Q_{i}(s,a_{i}) is not learned yet, and the transition probabilities are uniform, resembling the start stage in online learning where all agents are acting randomly. During training, the transition probabilities of high-value states gradually grow by value deviation, which is an analogy of other agents improving their policies in online learning. Therefore, P^i\hat{P}_{\mathcal{B}_{i}} encourages the agents to learn high-performing policies and improves coordination.

Although P^i\hat{P}_{\mathcal{B}_{i}} is non-stationary (i.e., λvdi\lambda_{{vd}_{i}} changes along with the updates of Q-value), we have the following theorem that guarantees the convergence of Bellman operator 𝒯\mathcal{T} under P^i\hat{P}_{\mathcal{B}_{i}},

𝒯Qi(s,ai)=𝔼sP^i(s|s,ai)[r+γmaxa^iQi(s,a^i)].\mathcal{T}Q_{i}(s,a_{i})=\mathbb{E}_{s^{\prime}\sim\hat{P}_{\mathcal{B}_{i}}\left(s^{\prime}|s,a_{i}\right)}\left[r+\gamma\max_{\hat{a}_{i}}Q_{i}\left(s^{\prime},\hat{a}_{i}\right)\right].
Theorem 1.

Under the non-stationary transition probability P^i\hat{P}_{\mathcal{B}_{i}}, the Bellman operator 𝒯\mathcal{T} is a contraction and converges to a unique fixed point when γ<rmin/2rmaxrmin\gamma<\nicefrac{{r_{\min}}}{{2r_{\max}-r_{\min}}}, if the reward is bounded by the positive region [rmin,rmax][r_{\min},r_{\max}].

The proof is provided in Appendix. As any positive affine transformation of the reward function does not change the optimal policy in the fixed-horizon environments [36], Theorem 1 holds in general, and we can rescale the reward to make rminr_{\min} arbitrarily close to rmaxr_{\max} so as to make the upper bound of γ\gamma close to 11.

In deep reinforcement learning, directly modifying the transition probability is infeasible. However, we can instead modify the sampling probability to achieve the same effect. The optimization objective of decentralized deep Q-learning 𝔼pi(s,ai,s)|Qi(s,ai)yi|2\mathbb{E}_{p_{\mathcal{B}_{i}}(s,a_{i},s^{\prime})}|Q_{i}(s,a_{i})-y_{i}|^{2} is calculated by sampling the batch from i\mathcal{B}_{i} according to the sampling probability pi(s,ai,s)p_{\mathcal{B}_{i}}(s,a_{i},s^{\prime}). By factorizing pi(s,ai,s)p_{\mathcal{B}_{i}}(s,a_{i},s^{\prime}), we have

pi(s,ai,s)sampling probability=pi(s,ai)Pi(s|s,ai)transition probability.\underset{{\color[rgb]{0.60546875,0,0}\text{sampling probability}}}{\underbrace{p_{\mathcal{B}_{i}}(s,a_{i},s^{\prime})}}=p_{\mathcal{B}_{i}}(s,a_{i})\cdot\underset{{\color[rgb]{0,0,0.609375}\text{transition probability}}}{\underbrace{P_{\mathcal{B}_{i}}(s^{\prime}|s,a_{i})}}.

Therefore, we can modify the transition probability as λtniλvdiziPi(s|s,ai)\frac{\lambda_{{tn}_{i}}\lambda_{{vd}_{i}}}{z_{i}}P_{\mathcal{B}_{i}}\left(s^{\prime}|s,a_{i}\right) and scale pi(s,ai)p_{\mathcal{B}_{i}}(s,a_{i}) with ziz_{i}. Then, the sampling probability can be re-written as

λtniλvdipi(s,ai,s)modified sampling probability=zipi(s,ai)λtniλvdiziPi(s|s,ai)modified transition probability.\underset{{\color[rgb]{0.60546875,0,0}\text{modified sampling probability}}}{\underbrace{\lambda_{{tn}_{i}}\lambda_{{vd}_{i}}p_{\mathcal{B}_{i}}(s,a_{i},s^{\prime})}}=z_{i}p_{\mathcal{B}_{i}}(s,a_{i})\cdot\underset{{\color[rgb]{0,0,0.609375}\text{modified transition probability}}}{\underbrace{\frac{\lambda_{{tn}_{i}}\lambda_{{vd}_{i}}}{z_{i}}P_{\mathcal{B}_{i}}(s^{\prime}|s,a_{i})}}.

Since ziz_{i} is independent of ss^{\prime}, it can be regarded as a scale factor on pi(s,ai)p_{\mathcal{B}_{i}}(s,a_{i}), which will not change the expected target value yiy_{i}. Thus, sampling batches according to the modified sampling probability can achieve the same effect as modifying the transition probability. Using importance sampling, the modified optimization objective is

𝔼λtniλvdipi(s,ai,s)|Qi(s,ai)yi|2=𝔼pi(s,ai,s)λtniλvdipi(s,ai,s)pi(s,ai,s)|Qi(s,ai)yi|2=𝔼pi(s,ai,s)λtniλvdi|Qi(s,ai)yi|2.\begin{split}&\mathbb{E}_{\lambda_{{tn}_{i}}\lambda_{{vd}_{i}}p_{\mathcal{B}_{i}}(s,a_{i},s^{\prime})}|Q_{i}(s,a_{i})-y_{i}|^{2}\\ &=\mathbb{E}_{p_{\mathcal{B}_{i}}(s,a_{i},s^{\prime})}\frac{\lambda_{{tn}_{i}}\lambda_{{vd}_{i}}p_{\mathcal{B}_{i}}(s,a_{i},s^{\prime})}{p_{\mathcal{B}_{i}}(s,a_{i},s^{\prime})}|Q_{i}(s,a_{i})-y_{i}|^{2}\\ &=\mathbb{E}_{p_{\mathcal{B}_{i}}(s,a_{i},s^{\prime})}\lambda_{{tn}_{i}}\lambda_{{vd}_{i}}|Q_{i}(s,a_{i})-y_{i}|^{2}.\end{split}

We can see that λtni\lambda_{{tn}_{i}} and λvdi\lambda_{{vd}_{i}} simply take effect as the weights of the objective function, which makes them easily integrated with existing offline RL methods.

Algorithm 1 MABCQ
1:  for iNi\in N do
2:     Initialize the conditional VAEs: Gi1={Ei1(μ1,σ1|s,a),Di1(a|s,z1)}G^{1}_{i}=\{E^{1}_{i}\left(\mu^{1},\sigma^{1}|s,a\right),D^{1}_{i}\left(a|s,z^{1}\right)\}, Gi2={Ei2(μ2,σ2|s,a,s),Di2(a|s,s,z2)}G^{2}_{i}=\{E^{2}_{i}\left(\mu^{2},\sigma^{2}|s,a,s^{\prime}\right),D^{2}_{i}\left(a|s,s^{\prime},z^{2}\right)\}.
3:     Initialize Q-network QiQ_{i}, perturbation network ξi\xi_{i}, and their target networks Q^i\hat{Q}_{i} and ξ^i\hat{\xi}_{i}.
4:     Fit the VAEs Gi1G^{1}_{i} and Gi2G^{2}_{i} using i\mathcal{B}_{i}.
5:     for t=1,,max_updatet=1,\ldots,max\_update do
6:        Sample a mini-batch from i\mathcal{B}_{i}.
7:        Update QiQ_{i} by minimizing (1).
8:        Update ξi\xi_{i} by maximizing (2).
9:        Update the target networks Q^i\hat{Q}_{i} and ξ^i\hat{\xi}_{i}.
10:     end for
11:  end for

3.4 An Example Instantiation

Our framework can be practically instantiated on many offline single-agent RL algorithms that address the overestimation incurred by out-of-distribution actions. Here, we give the instantiation of the framework on BCQ [5], termed MABCQ. To make MABCQ adapt to high-dimensional continuous spaces, for each agent ii, we train a Q-network QiQ_{i}, a perturbation network ξi\xi_{i}, and a conditional VAE Gi1={Ei1(μ1,σ1|s,a),Di1(a|s,z1(μ1,σ1))}G^{1}_{i}=\{E^{1}_{i}\left(\mu^{1},\sigma^{1}|s,a\right),D^{1}_{i}\left(a|s,z^{1}\sim\left(\mu^{1},\sigma^{1}\right)\right)\}. In execution, each agent ii generates nn actions by Gi1G_{i}^{1}, adds small perturbations [Φ,Φ]\in[-\Phi,\Phi] on the actions using ξi\xi_{i}, and then selects the action with the highest value in QiQ_{i}. The policy can be written as

πi(s)=argmaxaij+ξi(s,aij)Qi(s,aij+ξ(s,aij)),where {aijGi1(s)}j=1n.\begin{split}\pi_{i}(s)=\underset{a_{i}^{j}+\xi_{i}\left(s,a_{i}^{j}\right)}{\operatorname{argmax}}Q_{i}\left(s,a_{i}^{j}+\xi\left(s,a_{i}^{j}\right)\right),\\ \text{where }\left\{a_{i}^{j}\sim G_{i}^{1}(s)\right\}_{j=1}^{n}.\end{split}

QiQ_{i} is updated by minimizing

𝔼pi(s,ai,s)λtniλvdi|Qi(s,ai)yi|2,where yi=r+γQ^i(s,π^i(s)).\begin{split}\mathbb{E}_{p_{\mathcal{B}_{i}}(s,a_{i},s^{\prime})}\lambda_{{tn}_{i}}\lambda_{{vd}_{i}}|Q_{i}(s,a_{i})-y_{i}|^{2},\\ \text{where }y_{i}=r+\gamma\hat{Q}_{i}(s^{\prime},\hat{\pi}_{i}(s^{\prime})).\end{split} (1)

yiy_{i} is calculated by the target networks Q^i\hat{Q}_{i} and ξ^i\hat{\xi}_{i}, where π^i\hat{\pi}_{i} is correspondingly the policy induced by Q^i\hat{Q}_{i} and ξ^i\hat{\xi}_{i}. ξi\xi_{i} is updated by maximizing

𝔼pi(s,ai,s)λtniλvdiQi(s,ai+ξi(s,ai)).\mathbb{E}_{p_{\mathcal{B}_{i}}(s,a_{i},s^{\prime})}\lambda_{{tn}_{i}}\lambda_{{vd}_{i}}Q_{i}\left(s,a_{i}+\xi_{i}\left(s,a_{i}\right)\right). (2)

To estimate λvdi\lambda_{{vd}_{i}}, we need Vi(s)=Q^i(s,π^i(s))V_{i}^{*}(s^{\prime})=\hat{Q}_{i}(s^{\prime},\hat{\pi}_{i}(s^{\prime})) and 𝔼s[Vi(s)]=1γ(Q^i(s,ai)r)\mathbb{E}_{s^{\prime}}[V_{i}^{*}\left(s^{\prime}\right)]=\frac{1}{\gamma}(\hat{Q}_{i}(s,a_{i})-r), which can be estimated from the sample without actually going through all ss^{\prime}. We estimate λvdi\lambda_{{vd}_{i}} using the target networks to stabilize λvdi\lambda_{{vd}_{i}} along with the updates of QiQ_{i} and ξi\xi_{i}. To avoid extreme values, we clip λvdi\lambda_{{vd}_{i}} to the region [1ϵ,1+ϵ][1-\epsilon,1+\epsilon], where ϵ\epsilon is the optimism level.

To estimate λtni\lambda_{{tn}_{i}}, we train a VAE Gi2={Ei2(μ2,σ2|s,a,s),Di2(a|s,s,z2(μ2,σ2))}G^{2}_{i}=\{E^{2}_{i}\left(\mu^{2},\sigma^{2}|s,a,s^{\prime}\right),D^{2}_{i}\left(a|s,s^{\prime},z^{2}\sim\left(\mu^{2},\sigma^{2}\right)\right)\}. Since the latent variable of VAE follows the Gaussian distribution, we use the mean as the encoding of the input and estimate the probability density functions: ρi(s,a)ρ𝒩(0,1)(μi1)\rho_{i}(s,a)\approx\rho_{\mathcal{N}(0,1)}(\mu^{1}_{i}) and ρi(s,a,s)ρ𝒩(0,1)(μi2)\rho_{i}(s,a,s^{\prime})\approx\rho_{\mathcal{N}(0,1)}(\mu^{2}_{i}), where ρ𝒩(0,1)\rho_{\mathcal{N}(0,1)} is the density of unit Gaussian distribution. The conditional density is ρi(s|a,s)ρ𝒩(0,1)(μi2)/ρ𝒩(0,1)(μi1)\rho_{i}(s^{\prime}|a,s)\approx\nicefrac{{\rho_{\mathcal{N}(0,1)}(\mu^{2}_{i})}}{{\rho_{\mathcal{N}(0,1)}(\mu^{1}_{i})}} and the transition probability is Pi(s|s,ai)s12δ𝒮s+12δ𝒮ρi(s|s,a)dsρi(s|s,a)δ𝒮P_{\mathcal{B}_{i}}(s^{\prime}|s,a_{i})\approx\int_{s^{\prime}-\frac{1}{2}\delta_{\mathcal{S}}}^{s^{\prime}+\frac{1}{2}\delta_{\mathcal{S}}}\rho_{i}(s^{\prime}|s,a)\mathrm{d}s^{\prime}\approx\rho_{i}(s^{\prime}|s,a)\left\|\delta_{\mathcal{S}}\right\| when δ𝒮\left\|\delta_{\mathcal{S}}\right\| is a small constant. Approximately, we have

λtni=ρ𝒩(0,1)(μi1)ρ𝒩(0,1)(μi2),\lambda_{{tn}_{i}}=\frac{\rho_{\mathcal{N}(0,1)}(\mu^{1}_{i})}{\rho_{\mathcal{N}(0,1)}(\mu^{2}_{i})},

and the constant δ𝒮\left\|\delta_{\mathcal{S}}\right\| is considered in ziz_{i}. In practice, we find that λtni\lambda_{{tn}_{i}} falls into the region [0.2,1.4][0.2,1.4] for almost all samples. For completeness, we summarize the training of MABCQ in Algorithm 1.

Table 3: Transition probabilities and expected returns calculated in the dataset using only λvd\lambda_{{vd}}.
action transition expected return
p(1|a1)=0.12p(1|{\color[rgb]{0.60546875,0,0}a_{1}})=0.12
a1a_{1} p(5|a1)=0.88p(5|{\color[rgb]{0.60546875,0,0}a_{1}})=0.88 4.52
p(6|a2)=0.8p(6|{\color[rgb]{0.60546875,0,0}a_{2}})=0.8
Agent 11 a2a_{2} p(1|a2)=0.2p(1|{\color[rgb]{0.60546875,0,0}a_{2}})=0.2 5
action transition expected return
p(1|a1)=0.4p(1|{\color[rgb]{0,0,0.609375}a_{1}})=0.4
a1a_{1} p(6|a1)=0.6p(6|{\color[rgb]{0,0,0.609375}a_{1}})=0.6 4
p(5|a2)=0.95p(5|{\color[rgb]{0,0,0.609375}a_{2}})=0.95
Agent 22 a2a_{2} p(1|a2)=0.05p(1|{\color[rgb]{0,0,0.609375}a_{2}})=0.05 4.8
Table 4: Transition probabilities and expected returns calculated in the dataset using λtn\lambda_{{tn}} and λvd\lambda_{{vd}}.
action transition expected return
p(1|a1)=0.17p(1|{\color[rgb]{0.60546875,0,0}a_{1}})=0.17
a1a_{1} p(5|a1)=0.83p(5|{\color[rgb]{0.60546875,0,0}a_{1}})=0.83 4.33
p(6|a2)=0.86p(6|{\color[rgb]{0.60546875,0,0}a_{2}})=0.86
Agent 11 a2a_{2} p(1|a2)=0.14p(1|{\color[rgb]{0.60546875,0,0}a_{2}})=0.14 5.29
action transition expected return
p(1|a1)=0.14p(1|{\color[rgb]{0,0,0.609375}a_{1}})=0.14
a1a_{1} p(6|a1)=0.86p(6|{\color[rgb]{0,0,0.609375}a_{1}})=0.86 5.29
p(5|a2)=0.83p(5|{\color[rgb]{0,0,0.609375}a_{2}})=0.83
Agent 22 a2a_{2} p(1|a2)=0.17p(1|{\color[rgb]{0,0,0.609375}a_{2}})=0.17 4.33
Table 5: Normalized scores of MABCQ and the baselines.
MABCQ BCQ w/ λvd\lambda_{vd} BCQ w/ λtn\lambda_{tn} BCQ DDPG Behavior
HalfCheetah 17.6±3.3\bm{17.6}\pm 3.3 13.3±4.813.3\pm 4.8 13.4±6.513.4\pm 6.5 13.4±5.213.4\pm 5.2 3.1±3.6-3.1\pm 3.6 11.3±2.811.3\pm 2.8
Walker 54.4±5.6\bm{54.4}\pm 5.6 50.1±11.050.1\pm 11.0 41.2±17.841.2\pm 17.8 28.8±14.428.8\pm 14.4 1.7±0.91.7\pm 0.9 10.0±0.810.0\pm 0.8
Hopper 43.1±14.2\bm{43.1}\pm 14.2 34.1±8.234.1\pm 8.2 19.8±8.719.8\pm 8.7 18.0±3.418.0\pm 3.4 9.0±15.39.0\pm 15.3 10.7±2.310.7\pm 2.3
Ant 60.5±3.6\bm{60.5}\pm 3.6 59.7±4.9\bm{59.7}\pm 4.9 62.9±2.1\bm{62.9}\pm 2.1 51.5±12.751.5\pm 12.7 48.4±1.3-48.4\pm 1.3 19.3±4.519.3\pm 4.5

4 Related Work

4.1 Off-policy MARL

Many off-policy MARL methods have been proposed for learning to solve cooperative tasks in an online manner. Policy-based methods [15, 8, 37, 26, 29] extend actor-critic into multi-agent cases. Value factorization methods [27, 22, 25, 30] decompose the joint value function into individual value functions. All these methods follow CTDE, where the information of all agents can be accessed in a centralized way during training. Unlike these studies, we consider decentralized settings where global information is not available. For decentralized learning, the key challenge is the obsolete experiences in the replay buffer, which is considered in Fingerprints [2], Lenient-DQN [19], and concurrent experience replay [18]. However, these methods require additional information, e.g., training iteration number and exploration rate, which are often not provided by the offline dataset.

4.2 Offline RL

Offline RL requires the agent to learn from a fixed batch of data consisting of single-step transitions, without exploration. Most offline RL methods consider the out-of-distribution action [12] as the fundamental challenge, which is the main cause of the extrapolation error [5] in value estimate in the single-agent environment. To minimize the extrapolation error, some recent methods introduce constraints to enforce the learned policy to be close to the behavior policy, which can be direct action constraint [5], kernel MMD [10], Wasserstein distance [31], KL divergence [21], or l2l2 distance [4, 20]. Some methods train a Q-function pessimistic to out-of-distribution actions to avoid overestimation by adding a reward penalty quantified by the learned environment model [35], by minimizing the Q-values of out-of-distribution actions [11, 34], by weighting the update of Q-function via Monte Carlo dropout [32], or by explicitly assigning and training pseudo Q-values for out-of-distribution actions [16]. Our framework can be built on these methods.

MAICQ [33] studies offline MARL in the CTDE setting, which requires the joint actions of all agents in the dataset and cannot be applied to decentralized settings where datasets contain only individual actions. All the methods aforementioned do not consider the extrapolation error introduced by the transition bias, which is a fatal problem in offline decentralized MARL.

5 Experiments

We evaluate our framework in both fully and partially observable tasks. In each task, we build offline dataset i\mathcal{B}_{i} for each agent ii, which does not contain actions of other agents. We will give the details about the collection of each offline dataset. Our method and baselines have the same neural network architectures and hyperparameters, which are available in Appendix. All the models are trained for five runs with different random seeds, and the results are presented in terms of mean and std.

5.1 Matrix Game

We perform MABCQ on the matrix game in Table 1. As shown in Table 3, if we only use λvd\lambda_{vd} without considering transition normalization, since the transition probabilities of high-value next states have been increased, for agent 11 the value of a2{\color[rgb]{0.60546875,0,0}a_{2}} becomes higher than that of a1{\color[rgb]{0.60546875,0,0}a_{1}}. However, due to the unbalanced action distribution of agent 11, the initial transition probabilities of agent 22 are extremely unbalanced. With λvd\lambda_{vd}, agent 22 still underestimates the value of a1{\color[rgb]{0,0,0.609375}a_{1}} and learns the action a2{\color[rgb]{0,0,0.609375}a_{2}}. The agents arrive at the joint action (a2,a2)({\color[rgb]{0.60546875,0,0}a_{2}},{\color[rgb]{0,0,0.609375}a_{2}}), which is a worse solution than the initial one (Table 2). Further, by normalizing the transition probabilities by λtn\lambda_{tn}, the agents can learn the optimal solution (a2,a1)({\color[rgb]{0.60546875,0,0}a_{2}},{\color[rgb]{0,0,0.609375}a_{1}}) and build the consensus about the values of learned actions, as shown in Table 4.

Table 6: Mean difference in value estimates among agents during training. It is shown that transition normalization indeed reduces the difference in value estimates.
value difference MABCQ BCQ w/ λvd\lambda_{vd}
HalfCheetah 44.4±3.4\bm{44.4}\pm 3.4 411.7±72.4411.7\pm 72.4
Walker 28.2±2.8\bm{28.2}\pm 2.8 38.7±6.938.7\pm 6.9
Hopper 24.2±0.8\bm{24.2}\pm 0.8 25.4±1.325.4\pm 1.3
Ant 60.8±2.9\bm{60.8}\pm 2.9 67.0±3.167.0\pm 3.1
Table 7: Extrapolation errors. It is shown that our framework can decrease the extrapolation error.
extrapolation error MABCQ BCQ
HalfCheetah 98.4±31.398.4\pm 31.3 97.2±29.197.2\pm 29.1
Walker 55.0±9.6\bm{55.0}\pm 9.6 91.5±35.491.5\pm 35.4
Hopper 28.1±3.4\bm{28.1}\pm 3.4 65.8±6.465.8\pm 6.4
Ant 180.2±22.2\bm{180.2}\pm 22.2 231.3±47231.3\pm 47
Table 8: Rewards on SMAC datasets.
MABCQ BCQ MACQL CQL
random 3m 4.0±1.1\bm{4.0}\pm 1.1 0±00\pm 0 13.5±1.5\bm{13.5}\pm 1.5 9.3±3.09.3\pm 3.0
8m 4.5±0.9\bm{4.5}\pm 0.9 0.8±0.20.8\pm 0.2 8.2±1.1\bm{8.2}\pm 1.1 6.6±1.26.6\pm 1.2
3s_vs_3z 8.1±0.7\bm{8.1}\pm 0.7 0±00\pm 0 10.2±0.810.2\pm 0.8 10.1±0.410.1\pm 0.4
3s_vs_4z 4.1±1.5\bm{4.1}\pm 1.5 0±00\pm 0 5.7±0.65.7\pm 0.6 6.7±0.26.7\pm 0.2
medium 3m 8.9±1.38.9\pm 1.3 7.8±0.47.8\pm 0.4 15.1±1.8\bm{15.1}\pm 1.8 13.8±1.413.8\pm 1.4
8m 7.6±0.8\bm{7.6}\pm 0.8 4.5±1.24.5\pm 1.2 14.5±1.5\bm{14.5}\pm 1.5 12.4±0.612.4\pm 0.6
3s_vs_3z 8.7±1.1\bm{8.7}\pm 1.1 3.9±0.63.9\pm 0.6 9.3±0.99.3\pm 0.9 8.9±0.68.9\pm 0.6
3s_vs_4z 4.3±0.5\bm{4.3}\pm 0.5 0±00\pm 0 6.1±0.76.1\pm 0.7 6.8±1.86.8\pm 1.8
replay 3m 13.2±0.213.2\pm 0.2 12.7±0.712.7\pm 0.7 13.8±0.413.8\pm 0.4 13.5±0.613.5\pm 0.6
8m 15.2±1.0\bm{15.2}\pm 1.0 14.3±0.914.3\pm 0.9 17.9±0.4\bm{17.9}\pm 0.4 16.3±0.416.3\pm 0.4
3s_vs_3z 19.4±0.419.4\pm 0.4 19.8±0.319.8\pm 0.3 20.0±0.020.0\pm 0.0 20.0±0.020.0\pm 0.0
3s_vs_4z 5.3±0.65.3\pm 0.6 5.3±0.95.3\pm 0.9 5.9±0.3\bm{5.9}\pm 0.3 5.2±0.75.2\pm 0.7
expert 3m 18.8±0.718.8\pm 0.7 18.3±1.118.3\pm 1.1 18.9±0.518.9\pm 0.5 19.1±0.519.1\pm 0.5
8m 17.0±0.817.0\pm 0.8 17.5±1.117.5\pm 1.1 18.5±1.218.5\pm 1.2 18.3±1.118.3\pm 1.1
3s_vs_3z 19.1±0.519.1\pm 0.5 19.0±0.619.0\pm 0.6 19.2±0.619.2\pm 0.6 19.1±0.619.1\pm 0.6
3s_vs_4z 5.6±0.95.6\pm 0.9 5.4±1.15.4\pm 1.1 6.8±0.76.8\pm 0.7 6.5±0.96.5\pm 0.9

5.2 Multi-Agent Mujoco

To evaluate MABCQ in high-dimensional complex environments, we adopt multi-agent mujoco [1], where each agent independently controls one or some joints of the robot and can get the state [9] and reward of the robot. The task illustration and the collection of offline datasets are given in Appendix.

Baselines. We compare MABCQ against

  • BCQ w/ λvd{\lambda_{vd}}. Using λvd\lambda_{vd} alone on BCQ.

  • BCQ w/ λtn{\lambda_{tn}}. Using λtn\lambda_{tn} alone on BCQ.

  • BCQ. Removing both λtn\lambda_{tn} and λvd\lambda_{vd} from MABCQ.

  • DDPG [14]. Each agent ii is trained using independent DDPG on the offline i\mathcal{B}_{i} without action constraint and transition probability modification.

  • Behavior. Each agent ii takes the action generated from the VAE Gi1G_{i}^{1}.

Refer to caption
(a) 3s_vs_3z
Refer to caption
(b) 2s3z
Refer to caption
(c) DG
Refer to caption
(d) CN
Figure 1: (a) and (b): Comparison with MAICQ on two SMAC datasets [33]. (c): Performance of MATD3+BC with different ϵ\epsilon on DG datasets. (d): Learning curves on CN datasets.

Ablation. Table 5 shows the normalized scores [3] of all the methods in the four tasks. Without action constraint, DDPG severely suffers from the large extrapolation error and hardly learns. BCQ outperforms the behavior policies but only arrives at mediocre performance. Using λtn\lambda_{tn} alone does not always improve the performance, e.g., in HalfCheetah and Hopper. This is because λtn\lambda_{tn} makes transition probabilities be uniform, which can be far from the ones in execution, leading to large extrapolation errors. In Ant, BCQ w/ λtn\lambda_{tn} outperforms BCQ, which is attributed to the value consensus built by the normalized transition probabilities. By optimistically increasing the transition probabilities of high-value next states, λvd\lambda_{vd} mitigates the underestimation of potential good actions and thus boosts the performance. MABCQ combines the advantages of value deviation and transition normalization and outperforms other baselines.

Consensus on value estimates. To verify that transition normalization can decrease the difference in value estimates among agents, we uniformly sample a subset from the union of all agents’ states and calculate the difference in value estimates, maxiViminiVi\max_{i}V^{*}_{i}-\min_{i}V^{*}_{i}, on this subset. The mean differences during training are illustrated in Table 6. The maxiViminiVi\max_{i}V^{*}_{i}-\min_{i}V^{*}_{i} of MABCQ is indeed lower than that of BCQ w/ λvd\lambda_{vd}. If there is a consensus among agents about which states are high-value, the agents will select the actions that most likely lead to the common high-value states. This promotes the coordination of policies and helps MABCQ outperform BCQ w/ λvd\lambda_{vd}.

Extrapolation error. In Table 7, we present the extrapolation errors of MABCQ and BCQ, measured by |1NiQi(s,ai)G||\frac{1}{N}\sum_{i}Q_{i}(s,a_{i})-G|, where GG is the true return evaluated by Monte Carlo. Although MABCQ greatly outperforms BCQ (i.e., much higher return), it still achieves much smaller extrapolation errors than BCQ in Walker, Hopper, and Ant. This empirically verifies the claim that our method can decrease the extrapolation error.

5.3 SMAC

We also investigate the proposed framework on SMAC [23] tasks, including 3m, 8m, 3s_vs_3z, and 3s_vs_4z. We build random datasets that are generated by uniform policies, medium datasets that are generated by mixed medium and uniform policies, replay datasets that are collected in the training process of QMIX [22], and expert datasets that are generated by expert policies trained by QMIX. Each dataset contains 1×1041\times 10^{4} episodes. We also build our framework on CQL [11], as MACQL. As shown in Table 8. MABCQ and MACQL achieve great performance improvement compared with the baselines, especially in random and medium datasets, where the transition dynamics in execution are much different from the ones in offline datasets. In expert datasets, since the behavior policies are much deterministic, offline RL methods avoid selecting out-of-distribution actions and thus degenerate to behavior cloning. Therefore, all the methods perform similarly.

Offline CTDE settings. Although offline CTDE method [33] does not fit our offline decentralized setting, our framework can work in offline CTDE datasets. To verify, we select two replay datasets (jointly collected) in MAICQ [33], split them into individual datasets, and test MACQL on them. As shown in Figures 1(a) and 1(b), our decentralized method can obtain competitive performance compared with the centralized method, MAICQ.

Table 9: Rewards on DG datasets.
MATD3+BC TD3+BC
full observation 46.5±0.9\bm{46.5}\pm 0.9 38.9±0.838.9\pm 0.8
partial observation 31.4±0.6\bm{31.4}\pm 0.6 20.7±0.620.7\pm 0.6
Table 10: Normalized scores on D4RL MuJoCo datasets.
MATD3+BC TD3+BC TD3+BC (single)
random halfcheetah 14.3±2.9\bm{14.3}\pm 2.9 12.9±2.612.9\pm 2.6 10.2±1.310.2\pm 1.3
hopper 10.3±0.610.3\pm 0.6 10.4±0.810.4\pm 0.8 11.0±0.111.0\pm 0.1
walker2d 4.5±3.44.5\pm 3.4 3.7±3.03.7\pm 3.0 1.4±1.61.4\pm 1.6
medium halfcheetah 41.5±2.4\bm{41.5}\pm 2.4 40.4±2.440.4\pm 2.4 42.8±0.342.8\pm 0.3
hopper 97.1±1.797.1\pm 1.7 98.7±1.298.7\pm 1.2 99.5±1.099.5\pm 1.0
walker2d 82.0±5.1\bm{82.0}\pm 5.1 72.3±4.172.3\pm 4.1 79.7±1.879.7\pm 1.8
replay halfcheetah 40.0±3.240.0\pm 3.2 39.3±2.839.3\pm 2.8 43.3±0.543.3\pm 0.5
hopper 28.4±4.1\bm{28.4}\pm 4.1 25.5±3.325.5\pm 3.3 31.4±3.031.4\pm 3.0
walker2d 21.0±2.921.0\pm 2.9 21.3±1.521.3\pm 1.5 25.2±5.125.2\pm 5.1
medium-expert halfcheetah 96.2±5.596.2\pm 5.5 95.3±6.595.3\pm 6.5 97.9±4.497.9\pm 4.4
hopper 112.0±0.8112.0\pm 0.8 112.3±0.9112.3\pm 0.9 112.2±0.2112.2\pm 0.2
walker2d 88.9±17.488.9\pm 17.4 82.0±10.882.0\pm 10.8 105.7±2.7105.7\pm 2.7

5.4 MPE

We additionally evaluate our framework in an MPE-based [15] Differential Game (DG), where the transition bias greatly affects the performance. Two agents can move in the range [1,1][-1,1]. The action is the speed, which is in the range [0.1,0.1][-0.1,0.1]. Define l=x12+x22l=\sqrt{x_{1}^{2}+x_{2}^{2}}, where x1x_{1} and x2x_{2} are the positions of the two agents, respectively. The shared reward is set as

r={0.5×(cos(15×l)+1) if l<0.20 if 0.2l0.60.5×(l0.6)2 if l>0.6.r=\begin{cases}0.5\times(\cos(15\times l)+1)&\text{ if }l<0.2\\ 0&\text{ if }0.2\leq l\leq 0.6\\ 0.5\times(l-0.6)^{2}&\text{ if }l>0.6.\end{cases}

The visualization of reward function is shown in Figure 2.

Refer to caption
Figure 2: Visualization of reward function in Differential Game.

Partial observation vs full observation. The offline datasets are collected by uniform random policies, containing 1×1061\times 10^{6} transitions. In the full observation setting, the dataset of each agent contains the positions of both agents. In the partial observation setting, the dataset of each agent only contains its own position. In both settings, the datasets do not contain the actions of the other agent. We add λvd\lambda_{vd} and λtn\lambda_{tn} to TD3+BC [4], as MATD3+BC. As shown in Table 9, MATD3+BC obtains the substantial improvement in both full and partial observation settings.

Hyperparameter ϵ\bm{\epsilon}. The optimism level ϵ\epsilon controls the strength of value deviation. If ϵ\epsilon is too small, value deviation has weak effects on the objective function. On the other hand, if ϵ\epsilon is too large, the agent will be overoptimistic about other agents’ learned policies. Figure 1(c) shows that the performance of MATD3+BC with different ϵ\epsilon, which verifies our framework is robust to ϵ\epsilon.

We test MATD3+BC on Cooperative Navigation in MPE, where 44 agents learn to cover 44 landmarks. The reward is sum(distancej)-\text{sum}(\mathrm{distance}_{j}), where distancej\mathrm{distance}_{j} is the distance from landmark jj to the closest agent. The offline datasets are collected by uniform random policies, containing 1×1061\times 10^{6} transitions. Figure 1(d) shows our framework significantly outperforms the baseline.

5.5 Additional Results

We also split the D4RL [3] Mujoco datasets into decentralized multi-agent datasets, and test MATD3+BC on them. The results are summarized in Table 10. We find the results of decentralized methods, where the joints are controlled by different agents, are very close to the results of single-agent method, TD3+BC (single), where a single agent controls all joints, which could be seen as the “upper-bound” of the decentralized methods. That is the reason that our method does not bring significant improvement in these tasks. However, MATD3+BC still outperforms TD3+BC on several tasks, e.g., halfcheetah-random and walker2d-medium.

Table 11: Average time taken by one update.
MABCQ BCQ MATD3+BC TD3+BC
1818 ms 1010 ms 44 ms 33 ms

To demonstrate the computation efficiency of our method, in Table 11, we record the average time taken by one update in Halfcheetah. The experiments are carried out on Intel i7-8700 CPU and NVIDIA GTX 1080Ti GPU. Since λvd\lambda_{vd} and λtn\lambda_{tn} could be calculated from the sampled experience without actually going through all next states, our framework additionally needs only two forward passes for computing λvd\lambda_{vd} and λtn\lambda_{tn} in the update. Since the value computation is very efficient in TD3+BC, our framework is also efficient on it.

6 Conclusion

We propose a framework for offline decentralized multi-agent reinforcement learning, to overcome the mismatch between transition dynamics. The framework can be instantiated on many offline RL methods. Theoretically, we show that under the purposely controlled non-stationary transition dynamics, offline decentralized Q-learning converges to a unique fixed point. Empirically, the framework outperforms the baselines in a variety of multi-agent offline datasets.

\ack

This work was supported by NSF China under grants 62250068 and 61872009. The authors would like to thank the anonymous reviewers for their valuable comments.

References

  • [1] Christian Schroeder de Witt, Bei Peng, Pierre-Alexandre Kamienny, Philip Torr, Wendelin Böhmer, and Shimon Whiteson, ‘Deep Multi-Agent Reinforcement Learning for Decentralized Continuous Cooperative Control’, arXiv preprint arXiv:2003.06709, (2020).
  • [2] Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip HS Torr, Pushmeet Kohli, and Shimon Whiteson, ‘Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning’, in International Conference on Machine Learning (ICML), (2017).
  • [3] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine, ‘D4rl: Datasets for Deep Data-Driven Reinforcement Learning’, arXiv preprint arXiv:2004.07219, (2020).
  • [4] Scott Fujimoto and Shixiang Shane Gu, ‘A Minimalist Approach To Offline Reinforcement Learning’, in Thirty-Fifth Conference on Neural Information Processing Systems (NeurIPS), (2021).
  • [5] Scott Fujimoto, David Meger, and Doina Precup, ‘Off-Policy Deep Reinforcement Learning Without Exploration’, in International Conference on Machine Learning (ICML), (2019).
  • [6] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine, ‘Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with A Stochastic Actor’, in International Conference on Machine Learning (ICML), (2018).
  • [7] Julian Ibarz, Jie Tan, Chelsea Finn, Mrinal Kalakrishnan, Peter Pastor, and Sergey Levine, ‘How To Train Your Robot with Deep Reinforcement Learning: Lessons We Have Learned’, The International Journal of Robotics Research, 40(4-5), 698–721, (2021).
  • [8] Shariq Iqbal and Fei Sha, ‘Actor-Attention-Critic for Multi-Agent Reinforcement Learning’, in International Conference on Machine Learning (ICML), (2019).
  • [9] Jakub Grudzien Kuba, Ruiqing Chen, Munning Wen, Ying Wen, Fanglei Sun, Jun Wang, and Yaodong Yang, ‘Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning’, International Conference on Learning Representations (ICLR), (2022).
  • [10] Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine, ‘Stabilizing Off-Policy Q-Learning Via Bootstrapping Error Reduction’, in Advances in Neural Information Processing Systems (NeurIPS), (2019).
  • [11] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine, ‘Conservative Q-Learning for Offline Reinforcement Learning’, Neural Information Processing Systems (NeurIPS), (2020).
  • [12] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu, ‘Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems’, arXiv preprint arXiv:2005.01643, (2020).
  • [13] Lihong Li, Wei Chu, John Langford, and Robert E Schapire, ‘A Contextual-Bandit Approach To Personalized News Article Recommendation’, in International Conference on World Wide Web (WWW), pp. 661–670, (2010).
  • [14] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra, ‘Continuous Control with Deep Reinforcement Learning.’, in International Conference on Learning Representations (ICLR), (2016).
  • [15] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch, ‘Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments’, in Advances in Neural Information Processing Systems (NeurIPS), (2017).
  • [16] Jiafei Lyu, Xiaoteng Ma, Xiu Li, and Zongqing Lu, ‘Mildly Conservative Q-learning for Offline Reinforcement Learning’, Neural Information Processing Systems (NeurIPS), (2022).
  • [17] Frans A Oliehoek and Christopher Amato, A Concise Introduction To Decentralized POMDPs, Springer, 2016.
  • [18] Shayegan Omidshafiei, Jason Pazis, Christopher Amato, Jonathan P How, and John Vian, ‘Deep Decentralized Multi-Task Multi-Agent Reinforcement Learning Under Partial Observability’, in International Conference on Machine Learning (ICML), (2017).
  • [19] Gregory Palmer, Karl Tuyls, Daan Bloembergen, and Rahul Savani, ‘Lenient Multi-Agent Deep Reinforcement Learning’, in International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), (2018).
  • [20] Ling Pan, Longbo Huang, Tengyu Ma, and Huazhe Xu, ‘Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification’, arXiv preprint arXiv:2111.11188, (2021).
  • [21] Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine, ‘Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning’, arXiv preprint arXiv:1910.00177, (2019).
  • [22] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson, ‘QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning’, in International Conference on Machine Learning (ICML), (2018).
  • [23] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philiph H. S. Torr, Jakob Foerster, and Shimon Whiteson, ‘The StarCraft Multi-Agent Challenge’, CoRR, abs/1902.04043, (2019).
  • [24] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz, ‘Trust Region Policy Optimization’, in International Conference on Machine Learning (ICML), (2015).
  • [25] Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi, ‘QTRAN: Learning To Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning’, in International Conference on Machine Learning (ICML), (2019).
  • [26] Kefan Su and Zongqing Lu, ‘Divergence-Regularized Multi-Agent Actor-Critic’, in International Conference on Machine Learning (ICML), (2022).
  • [27] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al., ‘Value-Decomposition Networks for Cooperative Multi-Agent Learning Based on Team Reward’, in International Conference on Autonomous Agents and Multiagent Systems (AAMAS), (2018).
  • [28] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al., ‘Grandmaster Level in StarCraft II Using Multi-Agent Reinforcement Learning’, Nature, 575(7782), 350–354, (2019).
  • [29] Jiangxing Wang, Deheng Ye, and Zongqing Lu, ‘More Centralized Training, Still Decentralized Execution: Multi-Agent Conditional Policy Factorization’, in International Conference on Learning Representations (ICLR), (2023).
  • [30] Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, and Chongjie Zhang, ‘Qplex: Duplex Dueling Multi-Agent Q-Learning’, in International Conference on Learning Representations (ICLR), (2021).
  • [31] Yifan Wu, George Tucker, and Ofir Nachum, ‘Behavior Regularized Offline Reinforcement Learning’, arXiv preprint arXiv:1911.11361, (2019).
  • [32] Yue Wu, Shuangfei Zhai, Nitish Srivastava, Joshua Susskind, Jian Zhang, Ruslan Salakhutdinov, and Hanlin Goh, ‘Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning’, arXiv preprint arXiv:2105.08140, (2021).
  • [33] Yiqin Yang, Xiaoteng Ma, Chenghao Li, Zewu Zheng, Qiyuan Zhang, Gao Huang, Jun Yang, and Qianchuan Zhao, ‘Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning’, in Neural Information Processing Systems (NeurIPS), (2021).
  • [34] Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn, ‘Combo: Conservative Offline Model-Based Policy Optimization’, arXiv preprint arXiv:2102.08363, (2021).
  • [35] Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma, ‘MOPO: Model-Based Offline Policy Optimization’, Advances in Neural Information Processing Systems (NeurIPS), (2020).
  • [36] Chi Zhang, Sanmukh Rao Kuppannagari, and Viktor Prasanna. BRAC+: Going Deeper with Behavior Regularized Offline Reinforcement Learning, 2021.
  • [37] Tianhao Zhang, Yueheng Li, Chen Wang, Guangming Xie, and Zongqing Lu, ‘FOP: Factorizing Optimal Joint Policy of Maximum-Entropy Multi-Agent Reinforcement Learning’, in International Conference on Machine Learning (ICML), (2021).

Appendix A Proofs

Proposition 1.

In episodic environments, if each agent ii performs Q-learning on i\mathcal{B}_{i}, all agents will converge to the same VV^{*} if they have the same transition probability on any state where each agent ii acts the learned action aia_{i}^{*}.

Proof.

Considering the two-agent case, we define δ(s)\delta(s) as the difference in the VV^{*}.

δ(s)\displaystyle\delta(s) =V1(s)V2(s)\displaystyle=V_{1}^{*}(s)-V_{2}^{*}(s)
=sP1(s|s,a1)(r+γV1(s))\displaystyle=\sum_{s^{\prime}}P_{\mathcal{B}_{1}}\left(s^{\prime}|s,a_{1}^{*}\right)\left(r+\gamma V_{1}^{*}\left(s^{\prime}\right)\right)
sP2(s|s,a2)(r+γV2(s))\displaystyle\quad-\sum_{s^{\prime}}P_{\mathcal{B}_{2}}\left(s^{\prime}|s,a_{2}^{*}\right)\left(r+\gamma V_{2}^{*}\left(s^{\prime}\right)\right)
=sP1(s|s,a1)(r+γV2(s)+γV1(s)γV2(s))\displaystyle=\sum_{s^{\prime}}P_{\mathcal{B}_{1}}\left(s^{\prime}|s,a_{1}^{*}\right)\left(r+\gamma V_{2}^{*}\left(s^{\prime}\right)+\gamma V_{1}^{*}\left(s^{\prime}\right)-\gamma V_{2}^{*}\left(s^{\prime}\right)\right)
sP2(s|s,a2)(r+γV2(s))\displaystyle\quad-\sum_{s^{\prime}}P_{\mathcal{B}_{2}}\left(s^{\prime}|s,a_{2}^{*}\right)\left(r+\gamma V_{2}^{*}\left(s^{\prime}\right)\right)
=s(P1(s|s,a1)P2(s|s,a2))(r+γV2(s))\displaystyle=\sum_{s^{\prime}}\left(P_{\mathcal{B}_{1}}\left(s^{\prime}|s,a_{1}^{*}\right)-P_{\mathcal{B}_{2}}\left(s^{\prime}|s,a_{2}^{*}\right)\right)\left(r+\gamma V_{2}^{*}\left(s^{\prime}\right)\right)
+γP1(s|s,a1)δ(s)\displaystyle\quad+\gamma P_{\mathcal{B}_{1}}\left(s^{\prime}|s,a_{1}^{*}\right)\delta\left(s^{\prime}\right)

For the terminal state sends_{end}, we have δ(send)=0\delta(s_{end})=0. If P1(s|s,a1)=P2(s|s,a2),sSP_{\mathcal{B}_{1}}\left(s^{\prime}|s,a_{1}^{*}\right)=P_{\mathcal{B}_{2}}\left(s^{\prime}|s,a_{2}^{*}\right),\,\forall s^{\prime}\in S, recursively expanding the δ\delta term, we arrive at δ(s)=0+γ0+γ20++0=0\delta(s)=0+\gamma 0+\gamma^{2}0+...+0=0. We can easily show that it also holds in the NN-agent case. ∎

Theorem 1.

Under the non-stationary transition probability P^i\hat{P}_{\mathcal{B}_{i}}, the Bellman operator 𝒯\mathcal{T} is a contraction and converges to a unique fixed point when γ<rmin2rmaxrmin\gamma<\frac{r_{\min}}{2r_{\max}-r_{\min}}, if the reward is bounded by the positive region [rmin,rmax][r_{\min},r_{\max}].

Proof.

We initialize the Q-value to be ηrmin\eta r_{\min}, where η\eta denotes 1γT+11γ\frac{1-\gamma^{T+1}}{1-\gamma}. Since the reward is bounded by the positive region [rmin,rmax][r_{\min},r_{\max}], the Q-value under the operator 𝒯\mathcal{T} is bounded to [ηrmin,ηrmax][\eta r_{\min},\eta r_{\max}]. Based on the definition of P^i(s|s,ai)\hat{P}_{\mathcal{B}_{i}}\left(s^{\prime}|s,a_{i}\right), it can be written as Vi(s)s^Vi(s)\frac{V_{i}^{*}(s^{\prime})}{\sum_{\hat{s}^{\prime}}V_{i}^{*}({s}^{\prime})}, where Vi(s)=maxa^iQi(s,a^i)V_{i}^{*}({s}^{\prime})=\max_{\hat{a}_{i}}Q_{i}\left(s^{\prime},\hat{a}_{i}\right). Then, we have the following,

𝒯Qi1𝒯Qi2\displaystyle\left\|\mathcal{T}Q_{i}^{1}-\mathcal{T}Q_{i}^{2}\right\|_{\infty}
=maxs,ai|s𝒮P^i1(s|s,ai)[r+γmaxa^iQi1(s,a^i)]\displaystyle=\max_{s,a_{i}}\left|\sum_{s^{\prime}\in\mathcal{S}}\hat{P}^{1}_{\mathcal{B}_{i}}\left(s^{\prime}|s,a_{i}\right)\left[r+\gamma\max_{\hat{a}_{i}}Q_{i}^{1}\left(s^{\prime},\hat{a}_{i}\right)\right]\right.
s𝒮P^i2(s|s,ai)[r+γmaxa^iQi2(s,a^i)]|\displaystyle\left.\quad-\sum_{s^{\prime}\in\mathcal{S}}\hat{P}^{2}_{\mathcal{B}_{i}}\left(s^{\prime}|s,a_{i}\right)\left[r+\gamma\max_{\hat{a}_{i}}Q^{2}_{i}\left(s^{\prime},\hat{a}_{i}\right)\right]\right|
=maxs,aiγ|s𝒮(Vi1(s))2s𝒮Vi1(s)s𝒮(Vi2(s))2s𝒮Vi2(s)|\displaystyle=\max_{s,a_{i}}\gamma\left|\frac{\sum_{s^{\prime}\in\mathcal{S}}(V_{i}^{*^{1}}(s^{\prime}))^{2}}{\sum_{{s}^{\prime}\in\mathcal{S}}V_{i}^{*^{1}}({s}^{\prime})}-\frac{\sum_{s^{\prime}\in\mathcal{S}}(V_{i}^{*^{2}}(s^{\prime}))^{2}}{\sum_{{s}^{\prime}\in\mathcal{S}}V_{i}^{*^{2}}({s}^{\prime})}\right|
=maxs,aiγ|s𝒮(Vi1(s))2(Vi2(s))2s𝒮Vi1(s)\displaystyle=\max_{s,a_{i}}\gamma\left|\frac{\sum_{s^{\prime}\in\mathcal{S}}(V_{i}^{*^{1}}(s^{\prime}))^{2}-(V_{i}^{*^{2}}(s^{\prime}))^{2}}{\sum_{{s}^{\prime}\in\mathcal{S}}V_{i}^{*^{1}}({s}^{\prime})}\right.
s𝒮(Vi2(s))2(1s𝒮Vi2(s)1s𝒮Vi1(s))|\displaystyle\left.\quad-\sum_{s^{\prime}\in\mathcal{S}}(V_{i}^{*^{2}}(s^{\prime}))^{2}\left(\frac{1}{\sum_{{s}^{\prime}\in\mathcal{S}}V_{i}^{*^{2}}({s}^{\prime})}-\frac{1}{\sum_{{s}^{\prime}\in\mathcal{S}}V_{i}^{*^{1}}({s}^{\prime})}\right)\right|
=maxs,aiγ|s𝒮(Vi1(s)Vi2(s))(Vi1(s)+Vi2(s))s𝒮Vi1(s)\displaystyle=\max_{s,a_{i}}\gamma\left|\frac{\sum_{s^{\prime}\in\mathcal{S}}(V_{i}^{*^{1}}(s^{\prime})-V_{i}^{*^{2}}(s^{\prime}))(V_{i}^{*^{1}}(s^{\prime})+V_{i}^{*^{2}}(s^{\prime}))}{\sum_{{s}^{\prime}\in\mathcal{S}}V_{i}^{*^{1}}({s}^{\prime})}\right.
s𝒮(Vi2(s))2s𝒮Vi1(s)Vi2(s)sVi1(s)s𝒮Vi2(s)|\displaystyle\left.\quad-\sum_{s^{\prime}\in\mathcal{S}}(V_{i}^{*^{2}}(s^{\prime}))^{2}\frac{\sum_{s^{\prime}\in\mathcal{S}}V_{i}^{*^{1}}(s^{\prime})-V_{i}^{*^{2}}(s^{\prime})}{\sum_{{s}^{\prime}}V_{i}^{*^{1}}({s}^{\prime})\sum_{{s}^{\prime}\in\mathcal{S}}V_{i}^{*^{2}}({s}^{\prime})}\right|
maxs,aiγ|s𝒮(Vi1(s)Vi2(s))|1s𝒮Vi1(s)\displaystyle\leq\max_{s,a_{i}}\gamma\left|\sum_{s^{\prime}\in\mathcal{S}}(V_{i}^{*^{1}}(s^{\prime})-V_{i}^{*^{2}}(s^{\prime}))\right|\cdot\frac{1}{\sum_{{s}^{\prime}\in\mathcal{S}}V_{i}^{*^{1}}({s}^{\prime})}
max|(Vi1(s)+Vi2(s))s𝒮(Vi2(s))2s𝒮Vi2(s)|\displaystyle\quad\cdot\max\left|(V_{i}^{*^{1}}(s^{\prime})+V_{i}^{*^{2}}(s^{\prime}))-\frac{\sum_{s^{\prime}\in\mathcal{S}}(V_{i}^{*^{2}}(s^{\prime}))^{2}}{\sum_{{s}^{\prime}\in\mathcal{S}}V_{i}^{*^{2}}({s}^{\prime})}\right|
γ|𝒮|Qi1Qi21|𝒮|ηrminη(2rmaxrmin)\displaystyle\leq\gamma|\mathcal{S}|\left\|Q_{i}^{1}-Q_{i}^{2}\right\|_{\infty}\cdot\frac{1}{|\mathcal{S}|\eta r_{\min}}\cdot\eta(2r_{\max}-r_{\min})
=γ(2rmaxrmin1)Qi1Qi2.\displaystyle=\gamma(\frac{2r_{\max}}{r_{\min}}-1)\left\|Q_{i}^{1}-Q_{i}^{2}\right\|_{\infty}.

The third term of the penultimate line is because: if Vi1(s)+Vi2(s)>s𝒮(Vi2(s))2s𝒮Vi2(s)V_{i}^{*^{1}}(s^{\prime})+V_{i}^{*^{2}}(s^{\prime})>\frac{\sum_{s^{\prime}\in\mathcal{S}}(V_{i}^{*^{2}}(s^{\prime}))^{2}}{\sum_{{s}^{\prime}\in\mathcal{S}}V_{i}^{*^{2}}({s}^{\prime})},

Vi1(s)+Vi2(s)s𝒮(Vi2(s))2s𝒮Vi2(s)\displaystyle V_{i}^{*^{1}}(s^{\prime})+V_{i}^{*^{2}}(s^{\prime})-\frac{\sum_{s^{\prime}\in\mathcal{S}}(V_{i}^{*^{2}}(s^{\prime}))^{2}}{\sum_{{s}^{\prime}\in\mathcal{S}}V_{i}^{*^{2}}({s}^{\prime})}
Vi1(s)+Vi2(s)s𝒮(Vi2(s))ηrmins𝒮Vi2(s)2ηrmaxηrmin,\displaystyle\leq V_{i}^{*^{1}}(s^{\prime})+V_{i}^{*^{2}}(s^{\prime})-\frac{\sum_{s^{\prime}\in\mathcal{S}}(V_{i}^{*^{2}}(s^{\prime}))*\eta r_{\min}}{\sum_{{s}^{\prime}\in\mathcal{S}}V_{i}^{*^{2}}({s}^{\prime})}\leq 2\eta r_{\max}-\eta r_{\min},

else,

s𝒮(Vi2(s))2s𝒮Vi2(s)(Vi1(s)+Vi2(s))\displaystyle\frac{\sum_{s^{\prime}\in\mathcal{S}}(V_{i}^{*^{2}}(s^{\prime}))^{2}}{\sum_{{s}^{\prime}\in\mathcal{S}}V_{i}^{*^{2}}({s}^{\prime})}-(V_{i}^{*^{1}}(s^{\prime})+V_{i}^{*^{2}}(s^{\prime}))
s𝒮(Vi2(s))ηrmaxs𝒮Vi2(s)ηrmax.\displaystyle\quad\leq\frac{\sum_{s^{\prime}\in\mathcal{S}}(V_{i}^{*^{2}}(s^{\prime}))*\eta r_{\max}}{\sum_{{s}^{\prime}\in\mathcal{S}}V_{i}^{*^{2}}({s}^{\prime})}\leq\eta r_{\max}.

Since 2ηrmaxηrminηrmax2\eta r_{\max}-\eta r_{\min}\geq\eta r_{\max}, we have

|(Vi1(s)+Vi2(s))s𝒮(Vi2(s))2s𝒮Vi2(s)|2ηrmaxηrmin.|(V_{i}^{*^{1}}(s^{\prime})+V_{i}^{*^{2}}(s^{\prime}))-\frac{\sum_{s^{\prime}\in\mathcal{S}}(V_{i}^{*^{2}}(s^{\prime}))^{2}}{\sum_{{s}^{\prime}\in\mathcal{S}}V_{i}^{*^{2}}({s}^{\prime})}|\leq 2\eta r_{\max}-\eta r_{\min}.

Therefore, if γ<rmin2rmaxrmin\gamma<\frac{r_{\min}}{2r_{\max}-r_{\min}}, the operator 𝒯\mathcal{T} is a contraction. By contraction mapping theorem, 𝒯\mathcal{T} converges to a unique fixed point. ∎

Appendix B Settings and Hyperparameters

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Illustrations of multi-agent Mujoco: HalfCheetah, Walker, Hooper, and Ant.
Table 12: Hyperparameters
Hyperparameter continues BCQ discrete BCQ CQL TD3+BC
learning rate of QQ 10310^{-3} 10410^{-4} 10410^{-4} 3×1043\times 10^{-4}
learning rate of ξ\xi 10410^{-4}
learning rate of GG 10410^{-4}
learning rate of π\pi 10410^{-4} 3×1043\times 10^{-4}
Φ\Phi 0.050.05
nn 1010
VAE hidden space 1010
α\alpha 0.20.2
λ\lambda 2.52.5
threshold 0.6action space\frac{0.6}{\text{action space}}

The illustrations of multi-agent mujoco are shown in Figure 3, different colors indicate different agents. Each agent independently controls one or some joints of the robot and can get the state and reward of the robot, which are defined in the original tasks. For each environment, we collect NN datasets for the NN agents. Each dataset contains 1×1061\times 10^{6} transitions (s,ai,r,s,done)(s,a_{i},r,s^{\prime},done). For data collection, we train an intermediate policy and an expert policy for each agent using SAC [6]. The offline dataset i\mathcal{B}_{i} is a mixture of four parts: 20%20\% transitions are split from the experiences generated by the SAC agent at the early training, 35%35\% transitions are generated from that the agent ii acts the intermediate policy while other agents act the expert policies, 35%35\% transitions are generated from that agent ii performs the expert policy while other agents act the intermediate policies, 10%10\% transitions are generated from that all agents perform the expert policies. For the last three parts, we add a small noise to the policies to increase the diversity of the dataset.

In all tasks, we set the discount factor γ=0.99\gamma=0.99 and use ReLU activation. In Mujoco tasks, the MLP units are (64,64)(64,64), and the batch size is 10241024. In SMAC and DG, the MLP units are (256,256)(256,256), and the batch size is 100100. The hyperparameter of our framework is the optimism level ϵ\epsilon. We respectively set ϵ=0.80,0.48,0.80,0.64\epsilon=0.80,0.48,0.80,0.64 in HalfCheetah, Walker, Hopper, and Ant, set ϵ=0.99\epsilon=0.99 in SMAC, and set ϵ=0.9\epsilon=0.9 in MPE. The hyperparameters of baselines are summarized in Table 12.

In this paper, we use SMAC (MIT license), MPE (MIT license), and Gym (MIT license), D4RL (Apache-2.0 license). Many thanks for their contributions.