This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Exclusively Penalized Q-learning for Offline Reinforcement Learning

Junghyuk Yeom  Yonghyeon Jo  Jungmo Kim  Sanghyeon Lee  Seungyul Han
Graduate School of Artificial Intelligence
UNIST
Ulsan, South Korea 44919
{junghyukyum,yonghyeonjo,jmkim22,sanghyeon,syhan}@unist.ac.kr
Abstract

Constraint-based offline reinforcement learning (RL) involves policy constraints or imposing penalties on the value function to mitigate overestimation errors caused by distributional shift. This paper focuses on a limitation in existing offline RL methods with penalized value function, indicating the potential for underestimation bias due to unnecessary bias introduced in the value function. To address this concern, we propose Exclusively Penalized Q-learning (EPQ), which reduces estimation bias in the value function by selectively penalizing states that are prone to inducing estimation errors. Numerical results show that our method significantly reduces underestimation bias and improves performance in various offline control tasks compared to other offline RL methods.

1 Introduction

* indicates equal contribution and \dagger indicates the corresponding author: Seungyul Han.Special thanks to Whiyoung Jung from LG AI Research for providing experimental data used in this work.

Reinforcement learning (RL) is gaining significant attention for solving complex Markov decision process (MDP) tasks. Traditionally, online RL develops advanced decision-making strategies through continuous interaction with environments [1, 2, 3, 4, 5, 6]. However, in real-world scenarios, interacting with the environment can be costly, particularly in high-risk environments like disaster situations, where obtaining sufficient data for learning is challenging [7, 8]. In such setups, the need for exploration [9, 10, 11, 12] to discover optimal strategies often incurs additional costs, as agents must try various actions, some of which may be inefficient or risky [13, 14]. This highlights the significance of research on offline setups, where policies are learned using pre-collected data without any direct interaction with the environment [15, 16]. In offline setups, policy actions not present in the data may introduce extrapolation errors, disrupting accurate value estimation by causing a large overestimation error in the value function, known as the distributional shift problem [17].

To address the distributional shift problem, Fujimoto et al. [17] proposes batch-constrained QQ-learning (BCQ), assuming that policy actions are selected from the dataset only. Ensuring optimal convergence of both the policy and value function under batch-constrained RL setups [17], BCQ demonstrates stable learning in offline setups and outperforms behavior cloning (BC) techniques [18], which simply mimic actions from the dataset. However, the policy constraint of BCQ strongly limits the policy space, prompting further research to find improved policies by relaxing constraints based on the support of the policy using metrics like maximum mean discrepancy (MMD) [19] or Kullback–Leibler (KL) divergence [20]. While these methods moderately relax policy restrictions, the issue of limited policies persists. Thus, instead of constraining the policy space directly, alternative offline RL methods have been proposed to reduce overestimation bias based on penalized QQ-functions [21, 22]. Conservative QQ-learning (CQL) [21], a representative offline RL algorithm using QQ-penalty, penalizes the QQ-function for policy actions and provides a bonus to the QQ-function for actions in the dataset. Consequently, CQL selects more actions from the dataset, effectively reducing overestimation errors without policy constraints.

While CQL has demonstrated outstanding performance across various offline tasks, we observed that it introduces unnecessary estimation bias in the value function for states that do not contribute to overestimation. This issue becomes more pronounced as the level of penalty increases, resulting in performance degradation. To address this issue, this paper introduces a novel Exclusively Penalized Q-learning (EPQ) method for efficient offline RL. EPQ imposes a threshold-based penalty on the value function exclusively for states causing estimation errors to mitigate overestimation bias without introducing unnecessary bias in offline learning. Experimental results demonstrate that our proposed method effectively reduces both overestimation bias due to distributional shift and underestimation bias due to the penalty, allowing a more accurate evaluation of the current policy compared to the existing methods. Numerical results reveal that EPQ significantly outperforms other state-of-the-art offline RL algorithms on various D4RL tasks [23].

2 Preliminaries

2.1 Markov Decision Process and Offline RL

We consider a Markov Decision Process (MDP) environment denoted as :=(𝒮,𝒜,P,R,γ)\mathcal{M}:=(\mathcal{S},\mathcal{A},P,R,\gamma), where 𝒮\mathcal{S} is the state space, 𝒜\mathcal{A} is the action space, PP represents the transition probability, γ\gamma is the discount factor, and RR is the bounded reward function. In offline RL, transition samples dt=(st,at,rt,st+1)d_{t}=(s_{t},a_{t},r_{t},s_{t+1}) are generated by a behavior policy β\beta and stored in the dataset DD. We can empirically estimate β\beta as β^(a|s)=N(s,a)N(s)\hat{\beta}(a|s)=\frac{N(s,a)}{N(s)}, where NN represents the number of data points in DD. We assume that 𝔼sD,aβ[f(s,a)]𝔼sD,aβ^[f(s,a)]=𝔼s,aD[f(s,a)]\mathbb{E}_{s\sim D,a\sim\beta}[f(s,a)]\approx\mathbb{E}_{s\sim D,a\sim\hat{\beta}}[f(s,a)]=\mathbb{E}_{s,a\sim D}[f(s,a)] for arbitrary function ff. Utilizing only the provided dataset without interacting with the environment, our objective is to find a target policy π\pi that maximizes the expected discounted return, denoted as J(π):=𝔼s0,a0,s1,π[G0]J(\pi):=\mathbb{E}_{s_{0},a_{0},s_{1},\cdots\sim\pi}[G_{0}], where Gt=l=tγltR(sl,al)G_{t}=\sum^{\infty}_{l=t}\gamma^{l-t}R(s_{l},a_{l}) represents the discounted return.

2.2 Distributional Shift Problem in Offline RL

In online RL, the optimal policy that maximizes J(π)J(\pi) is found through iterative policy evaluation and policy improvement [2, 3]. For policy evaluation, the action value function is defined as Qπ(st,at):=𝔼st,at,st+1,π[l=tγltR(sl,al)|st,at]Q^{\pi}(s_{t},a_{t}):=\mathbb{E}_{s_{t},a_{t},s_{t+1},\cdots\sim\pi}[\sum^{\infty}_{l=t}\gamma^{l-t}R(s_{l},a_{l})|s_{t},~{}a_{t}]. QπQ^{\pi} can be estimated by iteratively applying the Bellman operator π\mathcal{B}^{\pi} to an arbitrary QQ-function, where (πQ)(s,a):=R(s,a)+γ𝔼sP(|s,a),aπ(|s)[Q(s,a)](\mathcal{B}^{\pi}Q)(s,a):=R(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot|s,a),~{}a^{\prime}\sim\pi(\cdot|s^{\prime})}[Q(s^{\prime},a^{\prime})]. The QQ-function is updated to minimize the Bellman error using the dataset DD, given by 𝔼s,aD[(Q(s,a)πQ(s,a))2]\mathbb{E}_{s,a\sim D}\left[\left(Q(s,a)-\mathcal{B}^{\pi}Q(s,a)\right)^{2}\right]. In offline RL, samples are generated by the behavior policy β\beta only, resulting in estimation errors in the QQ-function for policy actions not present in the dataset DD. The policy π\pi is updated to maximize the QQ-function, incorporating the estimation error in the policy improvement step. This process accumulates positive bias in the QQ-function as iterations progress [17].

2.3 Conservative QQ-learning

To mitigate overestimation in offline RL, conservative Q-learning (CQL) [21] penalizes the QQ-function for the policy actions aπa\sim\pi and increases the QQ-function for the data actions aβ^a\sim\hat{\beta} while minimizing the Bellman error, where the QQ-loss function of CQL is given by

12𝔼s,a,sD[(Q(s,a)πQ(s,a))2]+α𝔼sD[𝔼aπ[Q(s,a)]𝔼aβ^[Q(s,a)]],\displaystyle\frac{1}{2}\mathbb{E}_{s,a,s^{\prime}\sim D}\left[\left(Q(s,a)-\mathcal{B}^{\pi}Q(s,a)\right)^{2}\right]+\alpha\mathbb{E}_{s\sim D}[\mathbb{E}_{a\sim\pi}[Q(s,a)]-\mathbb{E}_{a\sim\hat{\beta}}[Q(s,a)]], (1)

where α0\alpha\geq 0 is a penalizing constant. From the value update in (1), the average QQ-value of data actions 𝔼aβ^[Q(s,a)]\mathbb{E}_{a\sim\hat{\beta}}[Q(s,a)] becomes larger than the average QQ-value of target policy actions 𝔼aπ[Q(s,a)]\mathbb{E}_{a\sim\pi}[Q(s,a)] as α\alpha increases. As a result, the policy will tend to choose the data actions more from the policy improvement step, effectively reducing overestimation error in the QQ-function [21].

3 Methodology

3.1 Motivation: Necessity of Mitigating Unnecessary Estimation Bias

In this section, we focus on the penalization behavior of CQL, one of the most representative penalty-based offline RL methods, and present an illustrative example to show that unnecessary estimation bias can occur in the QQ-function due to the penalization. As explained in Section 2.3, CQL penalizes the QQ-function for policy actions and increases the QQ-function for data actions in (1). When examining the QQ-function for each state-action pair (s,a)(s,a), the QQ-value increases if π(a|s)>β^(a|s)\pi(a|s)>\hat{\beta}(a|s); otherwise, the QQ-value decreases as the penalizing constant α\alpha becomes sufficiently large [21].

To visually demonstrate this, Fig. 1 depicts histograms of the fixed policy π\pi and the estimated behavior policy β^\hat{\beta} for various π\pi and β\beta at the initial state s0s_{0} on the Pendulum task with a single-dimensional action space in OpenAI Gym tasks [24], as cases (a), (b), and (c), along with the estimation bias in the QQ-function for CQL with various penalizing factors α\alpha. In this example, for all states except the initial state, we consider π=β=Unif(2,2)\pi=\beta=\text{Unif}(-2,2). In each case, CQL only updates the QQ-function with its penalty to evaluate π\pi in an offline setup, as shown in equation (1), and we plot the estimation bias of CQL, which represents the average difference between the learned QQ-function and the expected return G0G_{0}.

Refer to caption
Figure 1: Histograms of π\pi and β^\hat{\beta} (left axis), and the estimation bias of CQL with various α\alpha (right axis) at s0s_{0} for three cases: (a) β=Unif(2,2)\beta=\textrm{Unif}(-2,2) and π=N(0,0.2)\pi=N(0,0.2) (b) β=12N(1,0.3)+12N(1,0.3)\beta=\frac{1}{2}N(-1,0.3)+\frac{1}{2}N(1,0.3) and π=N(1,0.2)\pi=N(1,0.2) (c) β=12N(1,0.3)+12N(1,0.3)\beta=\frac{1}{2}N(-1,0.3)+\frac{1}{2}N(1,0.3) and π=N(0,0.2)\pi=N(0,0.2), where Unif(2,2)\textrm{Unif}(-2,2) represents a uniform distribution and N(μ,σ)N(\mu,\sigma) denotes a Gaussian distribution with mean μ\mu and standard deviation σ\sigma.

From the results in Fig. 1, we observe that CQL suffers from unnecessary estimation bias in the QQ-function for cases (a) and (b). In both cases, the histograms illustrate that policy actions are fully contained in the dataset β^\hat{\beta}, suggesting that the estimation error in the Bellman update is unlikely to occur even without any penalty. However, CQL introduces a substantial negative bias for actions near 0 where π(0|s0)>β^(0|s0)\pi(0|s_{0})>\hat{\beta}(0|s_{0}) and a positive bias for other actions. Furthermore, the bias intensifies as the penalty level α\alpha increases. In order to mitigate this bias, reducing the penalty level α\alpha to zero may seem intuitive in cases like Fig. 1(a) and Fig. 1(b). However, such an approach would be inadequate in cases like Fig. 1(c). In this case, because policy actions close to 0 are rare in the dataset, penalization is necessary to address overestimation caused by estimation errors in offline learning. Furthermore, this problem may become more severe in actual offline learning situations, as the policy continues to change as learning progresses, compared to situations where a fixed policy is assumed.

3.2 Exclusively Penalized Q-learning

To address the issue outlined in Section 3.1, our goal is to selectively give a penalty to the QQ-function in cases like Fig. 1(c), where policy actions are insufficient in the dataset while minimizing unnecessary bias due to the penalty in scenarios like Fig. 1(a) and Fig. 1(b), where policy actions are sufficient in the dataset. To achieve this goal, we introduce a novel exclusive penalty 𝒫τ\mathcal{P}_{\tau} defined by

𝒫τ:=fτπ,β^(s)penalty adaptation factor(π(a|s)β^(a|s)1)penalty term,\mathcal{P}_{\tau}:=\underbrace{f_{\tau}^{\pi,\hat{\beta}}(s)}_{\textrm{penalty adaptation factor}}\cdot\underbrace{\left(\frac{\pi(a|s)}{\hat{\beta}(a|s)}-1\right)}_{\textrm{penalty term}}, (2)
Refer to caption
Figure 2: An illustration of our exclusive penalty: (a) The log-probability of β^\hat{\beta} and the thresholds τ1\tau_{1} and τ2\tau_{2} according to the number of data samples N1N_{1} and N2N_{2}, where N1<<N2N_{1}<<N_{2}. (b) The penalty adaptation factor fτπ,β^f^{\pi,\hat{\beta}}_{\tau} which represents the amount of adaptive penalty, indicating how much logβ^\log\hat{\beta} exceeds the threshold τ\tau. Three different policies πi,i=1,2,3\pi_{i},~{}i=1,2,3, are considered.

where fτπ,β^(s)=𝔼aπ(|s)[xτβ^]f_{\tau}^{\pi,\hat{\beta}}(s)=\mathbb{E}_{a\sim\pi(\cdot|s)}[x_{\tau}^{\hat{\beta}}] is a penalty adaptation factor for a given β^\hat{\beta} and policy π\pi. Here, xτβ^=min(1.0,exp((logβ^(a|s)τ)))x_{\tau}^{\hat{\beta}}=\min(1.0,\exp(-(\log\hat{\beta}(a|s)-\tau))) represents the amount of adaptive penalty that is reduced as log β^\hat{\beta} exceeds the threshold τ\tau. Thus, the adaptation factor fτπ,β^f^{\pi,\hat{\beta}}_{\tau} indicates the average penalty that policy actions should receive. If the probability of estimated behavior policy β^\hat{\beta} for policy actions exceeds the threshold τ\tau, i.e., policy actions are sufficiently present in the dataset, then xτβ^x_{\tau}^{\hat{\beta}} will be smaller than 1 and reduce the amount of penalty as much as the amount by which β^\hat{\beta} exceeds the threshold τ\tau to avoid unnecessary bias introduced in Section 3.1. Otherwise, it will be 1 due to min(1.0,)\min(1.0,\cdot) to maintain the penalty since policy actions are insufficient in the dataset. The latter penalty term π(a|s)β^(a|s)1\frac{\pi(a|s)}{\hat{\beta}(a|s)}-1, positive if π(a|s)>β^(a|s)\pi(a|s)>\hat{\beta}(a|s) and otherwise negative, imposes a positive penalty on the QQ-function when π(a|s)>β^(a|s)\pi(a|s)>\hat{\beta}(a|s), and otherwise, it increases the QQ-function since the penalty is negative, as the Q-penalization method considered in CQL [21].

To elaborate further on our proposed penalty, Fig. 2(a) depicts the log-probability of β^\hat{\beta} and the thresholds τ\tau used for penalty adaptation, with NN representing the number of data points. In Fig. 2(a), if the log-probability logβ^\log\hat{\beta} of an action a𝒜a\in\mathcal{A} exceeds the threshold τ\tau, this indicates that the action aa is sufficiently represented in the dataset. Thus, we reduce the penalty for such actions. Furthermore, as shown in Fig. 2(a), when the number of actions increase from N1N_{1} to N2N_{2}, the threshold for determining "enough data" decreases from τ1\tau_{1} to τ2\tau_{2}, even if the data distribution remains unchanged.

Furthermore, to explain the role of the threshold τ\tau in the proposed penalty 𝒫τ\mathcal{P}_{\tau}, we consider two thresholds, τ1\tau_{1} and τ2\tau_{2}. In Fig. 2(b), which illustrates the proposed penalty adaptation factor fτ1π,β^f^{\pi,\hat{\beta}}_{\tau_{1}} and fτ2π,β^f^{\pi,\hat{\beta}}_{\tau_{2}} for thresholds τ1\tau_{1} and τ2\tau_{2}, xτ1β^x_{\tau_{1}}^{\hat{\beta}} is larger than xτ2β^x_{\tau_{2}}^{\hat{\beta}} because τ1>τ2\tau_{1}>\tau_{2}. As a result, in the case of τ1\tau_{1}, 𝒫τ1\mathcal{P}_{\tau_{1}} only reduces the penalty for π3\pi_{3}. In other words, fτ1π1,β^=fτ1π2,β^=1,f_{\tau_{1}}^{\pi_{1},\hat{\beta}}=f_{\tau_{1}}^{\pi_{2},\hat{\beta}}=1, and fτ1π3,β^<1f_{\tau_{1}}^{\pi_{3},\hat{\beta}}<1. On the other hand, as the number of data samples increases from N1N_{1} to N2N_{2}, more actions generated by the behavior policy β\beta will be stored in the dataset, so policy actions are more likely to be in the dataset. In this case, the threshold should be lowered from τ1\tau_{1} to τ2\tau_{2}. As a result, β^\hat{\beta} exceeds the threshold τ2\tau_{2} in the support of all policies πi\pi_{i}, and 𝒫τ2\mathcal{P}_{\tau_{2}} reduces the penalty in the support of all policies πi\pi_{i}, i.e., fτ2π3,β^<fτ2π1,β^<fτ2π2,β^<1f_{\tau_{2}}^{\pi_{3},\hat{\beta}}<f_{\tau_{2}}^{\pi_{1},\hat{\beta}}<f_{\tau_{2}}^{\pi_{2},\hat{\beta}}<1. Thus, even without knowing the exact number of data samples, the proposed penalty 𝒫τ\mathcal{P}_{\tau} allows adjusting the penalty level appropriately according to the given number of data samples based on the threshold τ\tau.

Now, we propose exclusively penalized Q-learning (EPQ), a novel offline RL method that minimizes the Bellman error while imposing the proposed exclusive penalty 𝒫τ\mathcal{P}_{\tau} on the QQ-function as follows:

minQ𝔼s,a,sD[(Q(s,a){πQ(s,a)α𝒫τ})2].\displaystyle\min_{Q}~{}\mathbb{E}_{s,a,s^{\prime}\sim D}\left[\left(Q(s,a)-\{\mathcal{B}^{\pi}Q(s,a)-\alpha\mathcal{P}_{\tau}\}\right)^{2}\right]. (3)

Then, we can prove that the final QQ-function of EPQ underestimates the true value function QπQ^{\pi} in offline RL if α\alpha is sufficiently large, as stated in the following theorem. This indicates that the proposed EPQ can successfully reduce overestimation bias in offline RL, while simultaneously alleviating unnecessary bias based on the proposed penalty 𝒫τ\mathcal{P}_{\tau}.

Theorem 3.1.

We denote the QQ-function converged from the QQ-update of EPQ using the proposed penalty 𝒫τ\mathcal{P}_{\tau} in (3) by Q^π\hat{Q}^{\pi}. Then, the expected value of Q^π\hat{Q}^{\pi} underestimates the expected true policy value, i.e., 𝔼aπ[Q^π(s,a)]𝔼aπ[Qπ(s,a)],sD\mathbb{E}_{a\sim\pi}[\hat{Q}^{\pi}(s,a)]\leq\mathbb{E}_{a\sim\pi}[Q^{\pi}(s,a)],\forall s\in D, with high probability 1δ1-\delta for some δ(0,1)\delta\in(0,1), if the penalizing factor α\alpha is sufficiently large. Furthermore, the proposed penalty reduces the average penalty for policy actions compared to the average penalty of CQL.

Proof) Proof of Theorem 3.1 is provided in Appendix A.

Refer to caption
Figure 3: Histogram of β^\hat{\beta} (left axis), and the corresponding fτπ,β^(s)f_{\tau}^{\pi,\hat{\beta}}(s) with various τ\tau (right axis) for two cases: (a) β=Unif(2,2)\beta=\textrm{Unif}(-2,2) (b) β=12N(1,0.3)+12N(1,0.3)\beta=\frac{1}{2}N(-1,0.3)+\frac{1}{2}N(1,0.3)
Refer to caption
Figure 4: Histograms of π\pi and β^\hat{\beta} (left axis), and the estimation bias of CQL and EPQ with various τ\tau (right axis) for three cases: (a) β=Unif(2,2)\beta=\textrm{Unif}(-2,2) and π=N(0,0.2)\pi=N(0,0.2) (b) β=12N(1,0.3)+12N(1,0.3)\beta=\frac{1}{2}N(-1,0.3)+\frac{1}{2}N(1,0.3) and π=N(1,0.2)\pi=N(1,0.2) (c) β=12N(1,0.3)+12N(1,0.3)\beta=\frac{1}{2}N(-1,0.3)+\frac{1}{2}N(1,0.3) and π=N(0,0.2)\pi=N(0,0.2).

In order to demonstrate the QQ-function convergence behavior of the proposed EPQ in more detail, we revisit the previous Pendulum task in Fig. 1. Fig. 4 shows the histogram of β^\hat{\beta} and the penalty adaptation factor fτπ,β^(s)f_{\tau}^{\pi,\hat{\beta}}(s) for Gaussian policy π=N(μ,0.2)\pi=N(\mu,0.2), where μ\mu varies from 2-2 to 22, with varying β\beta. In Fig. 4(a), fτπ,β^(s)f_{\tau}^{\pi,\hat{\beta}}(s) should be less than 1 for any policy mean μ\mu since all policy actions are sufficient in the dataset. In 4(b), fτπ,β^(s)f_{\tau}^{\pi,\hat{\beta}}(s) is less than 1 only if the β^\hat{\beta} probability near the policy mean μ\mu is high, and otherwise, fτπ,β^(s)f_{\tau}^{\pi,\hat{\beta}}(s) is 1, which indicates the lack of policy action in the dataset. Thus, the result shows that fτπ,β^(s)f_{\tau}^{\pi,\hat{\beta}}(s) reflects our motivation in Section 3.1 well. Moreover, Fig. 4 compares the estimation bias curves of CQL and EPQ with α=10\alpha=10 in the scenarios presented in Fig. 1. CQL exhibits unnecessary bias for situations in Fig. 4(a) and Fig. 4(b) where no penalty is needed, as discussed in Section 3.1. Conversely, our proposed method effectively reduces estimation bias in these cases while appropriately maintaining the penalty for the scenario in Fig. 4(c) where penalization is required. This experiment demonstrates the effectiveness of our proposed approach, and the subsequent numerical results in Section 4 will numerically show that our method significantly reduces estimation bias in offline learning, resulting in improved performance.

3.3 Prioritized Dataset

Refer to caption
Figure 5: An illustration of the prioritized dataset. As the policy focuses on actions with maximum QQ-values, the difference between β^\hat{\beta} and π\pi becomes substantial, inducing large penalty: (a) The change of data distribution from β^\hat{\beta} (w/o PD) to β^Q\hat{\beta}^{Q} (with PD) (b) The corresponding penalty graphs for β^\hat{\beta} (w/o PD) and β^Q\hat{\beta}^{Q} (with PD).

In Section 3.2, EPQ effectively controls the penalty in the scenarios depicted in Fig. 4. However, in cases where the policy is highly concentrated on one side, as shown in Fig. 4, the estimation bias may not be completely eliminated due to the latter penalty term πβ^1\frac{\pi}{\hat{\beta}}-1 in 𝒫τ\mathcal{P}_{\tau}, as π\pi significantly exceeds β^\hat{\beta}. This situation, detailed in Fig. 5, arises when there is a substantial difference in the QQ-function values among data actions. As the policy is updated to maximize the QQ-function, the policy shifts towards the data action with a larger QQ, resulting in a more significant penalty for CQL. To further alleviate the penalty to reduce unnecessary bias in this situation, instead of applying a penalty based on β^\hat{\beta}, we introduce a penalty based on the prioritized dataset (PD) β^Qβ^exp(Q)\hat{\beta}^{Q}\propto\hat{\beta}\exp(Q). As shown in Fig. 5(a), which illustrates the difference between the original data distribution β^\hat{\beta} and the modified data distribution β^Q\hat{\beta}^{Q} after applying PD, βQ\beta^{Q} prioritizes data actions with higher QQ-values within the support of β^\hat{\beta}. According to Fig. 5(a), when the policy π\pi focuses on specific actions, the penalty πβ^1\frac{\pi}{\hat{\beta}}-1 increases significantly, as depicted in Fig. 5(b). In contrast, by applying PD, β^\hat{\beta} is adjusted to approach β^Qβexp(Q)\hat{\beta}^{Q}\propto\beta\exp(Q), aligning the data distribution more closely with the policy π\pi. Consequently, we anticipate that the penalty will be significantly mitigated, as the difference between π\pi and β^Q\hat{\beta}^{Q} is much smaller than the difference between π\pi and β^\hat{\beta}. Following this intuition, we modify our penalty using PD as 𝒫τ,PD:=fτπ,β^(s)(π(a|s)β^Q(a|s)1)\mathcal{P}_{\tau,~{}PD}:=f_{\tau}^{\pi,\hat{\beta}}(s)\cdot\left(\frac{\pi(a|s)}{\hat{\beta}^{Q}(a|s)}-1\right). It is important to note that the penalty adaptation factor fτπ,β^(s)f_{\tau}^{\pi,\hat{\beta}}(s) remains unchanged since we use all data samples in the dataset for QQ updates. Additionally, we consider the prioritized dataset for the Bellman update to focus more on data actions with higher QQ-function values for better performance as considered in [25]. Then, we can derive the final QQ-loss function of EPQ with PD as

L(Q)=12𝔼s,sD,aβ^Q[(Q{πQα𝒫τ,PD})2]\displaystyle L(Q)=\frac{1}{2}\mathbb{E}_{s,s^{\prime}\sim D,a\sim\hat{\beta}^{Q}}\left[\left(Q-\{\mathcal{B}^{\pi}Q-\alpha\mathcal{P}_{\tau,~{}PD}\}\right)^{2}\right] (4)
=𝔼s,sD,aβ^,aπ[ws,aQ{12(Q(s,a)πQ(s,a))2+αfτπ,β^(s)(Q(s,a)Q(s,a))}]+C,\displaystyle=\mathbb{E}_{s,s^{\prime}\sim D,a\sim\hat{\beta},a^{\prime}\sim\pi}\left[w_{s,a}^{Q}\cdot\left\{\frac{1}{2}\left(Q(s,a)-\mathcal{B}^{\pi}Q(s,a)\right)^{2}+\alpha f_{\tau}^{\pi,\hat{\beta}}(s)(Q(s,a^{\prime})-Q(s,a))\right\}\right]+C,

where ws,aQ=β^Q(a|s)β^(a|s)=exp(Q(s,a))𝔼aβ^(|s)[exp(Q(s,a))]w_{s,a}^{Q}=\frac{\hat{\beta}^{Q}(a|s)}{\hat{\beta}(a|s)}=\frac{\exp(Q(s,a))}{\mathbb{E}_{a^{\prime}\sim\hat{\beta}(\cdot|s)}[\exp(Q(s,a^{\prime}))]} is the importance sampling (IS) weight, CC is the remaining constant term, and the detailed derivation of (4) is provided in Appendix B.1. The ablation study in Section 4.3 will show that EPQ performs better when prioritized dataset β^Q\hat{\beta}^{Q} is considered.

3.4 Practical Implementation and Algorithm

Now, we propose the implementation of EPQ based on the value loss function (4). Basically, our implementation follows the setup of CQL [21]. For policy, we utilize the Gaussian policy with a Tanh()\textrm{Tanh}(\cdot) layer proposed by Haarnoja et al. [4] and update the policy to maximize the QQ-function with its entropy. Then, the policy loss function is given by

L(π)=𝔼sD,aπ[Q(s,a)+logπ(a|s)].L(\pi)=\mathbb{E}_{s\sim D,~{}a\sim\pi}[-Q(s,a)+\log\pi(a|s)]. (5)

Based on the QQ-update in (4) and the policy loss function (5), we summarize the algorithm of EPQ as Algorithm 1. More detailed implementation, including the calculation method of the IS weight ws,aQw_{s,a}^{Q} and redefined loss functions for the parameterized QQ and π\pi, is provided in Appendix B.2.

Algorithm 1 Exclusively Penalized Q-learning
0:  Offline dataset DD
1:  Train the behavior policy β^\hat{\beta} based on behavior cloning (BC)
2:  Initialize QQ and π\pi
3:  for gradient step k=0,1,2,3,k=0,1,2,3,\ldots do
4:     Sample batch transitions {(s,a,r,s)}\{(s,a,r,s^{\prime})\} from DD.
5:     Calculate the penalty adaptation factor fτπ,β^(s)f_{\tau}^{\pi,\hat{\beta}}(s) and IS weight ws,aQw_{s,a}^{Q}
6:     Compute losses L(Q)L(Q) in Equation (4) and L(π)L(\pi) in Equation (5)
7:     Update the policy π\pi to minimize L(π)L(\pi)
8:     Update the QQ-function QQ to minimize L(Q)L(Q)
9:  end for

4 Experiments

In this section, we evaluate our proposed EPQ against other state-of-the-art offline RL algorithms using the D4RL benchmark [23], commonly used in the offline RL domain. Among various D4RL tasks, we mainly consider Mujoco locomotion tasks, Adroit manipulation tasks, and AntMaze navigation tasks, with scores normalized from 0 to 100100, where 0 represents random performance and 100100 represents expert performance.

Mujoco Locomotion Tasks: The D4RL dataset comprises offline datasets obtained from Mujoco tasks [26] like HalfCheetah, Hopper, and Walker2d. Each task has ‘random’, ‘medium’, and ‘expert’ datasets, obtained by a random policy, the medium policy with performance of 50 to 100 points, and the expert policy with performance of 100 points, respectively. Additionally, there are ‘medium-expert’ dataset that contains both ‘medium’ and ‘expert’ data, ‘medium-replay’ and ‘full-replay’ datasets that contain the buffers generated while the medium and expert policies are trained, respectively.
Adroit Manipulation Tasks: Adroit provides four complex manipulation tasks: Pen, Hammer, Door, and Relocate, utilizing motion-captured human data with associated rewards. Each task has two datasets: ‘human’ dataset derived from human motion-capture data, and ‘cloned’ dataset comprising samples from both the cloned behavior policy using BC and the original motion-capture data.
AntMaze Navigation Tasks: AntMaze is composed of six navigation tasks including ‘umaze’, ‘umaze-diverse’, ‘medium-play’, ‘medium-diverse’, ‘large-play’, and ‘large-diverse’ where robot ant agent is trained to reach a goal within the maze. While ‘play’ dataset is acquired under a fixed set of goal locations and a fixed set of starting locations, the ‘diverse’ dataset is acquired under a random goal locations and random starting locations setting.

Table 1: Performance comparison: Normalized average return results
Task name BC 10% BC TD3+BC CQL (paper) CQL (reprod.) Onestep IQL MCQ MISA EPQ
halfcheetah-random 2.3 2.2 12.7 35.4 20.8 6.9 12.9 28.5 2.5 33.0±\pm2.4
hopper-random 4.1 4.7 22.5 10.8 9.7 7.9 9.6 31.8 9.9 32.1±\pm0.3
walker2d-random 1.7 2.3 7.2 7.0 7.1 6.2 6.9 17.0 9.0 23.0±\pm0.7
halfcheetah-medium 42.6 42.5 48.3 44.4 44.0 48.4 47.4 64.3 47.4 67.3±\pm0.5
hopper-medium 52.9 56.9 59.3 86.6 58.5 59.6 66.3 78.4 67.1 101.3±\pm0.2
walker2d-medium 75.3 75.0 83.7 74.5 72.5 81.8 78.3 91.0 84.1 87.8±\pm2.1
halfcheetah-medium-expert 55.2 92.9 90.7 62.4 91.6 93.4 86.7 87.5 94.7 95.7±\pm0.3
hopper-medium-expert 52.5 110.9 98.0 111.0 105.4 103.3 91.5 111.2 109.8 108.8±\pm5.2
walker2d-medium-expert 107.5 109.0 110.1 98.7 108.8 113.0 109.6 114.2 109.4 112.0±\pm0.6
halfcheetah-expert 92.9 91.9 98.6 104.8 96.3 92.3 95.4 96.2 95.9 107.2±\pm0.2
hopper-expert 111.2 109.6 111.7 109.9 110.8 112.3 112.4 111.4 111.9 112.4±\pm0.5
walker2d-expert 108.5 109.1 110.3 121.6 110.0 111.0 110.1 107.2 109.3 109.8±\pm1.0
halfcheetah-medium-replay 36.6 40.6 44.6 46.2 45.5 38.1 44.2 56.8 45.6 62.0±\pm1.6
hopper-medium-replay 18.1 75.9 60.9 48.6 95.0 97.5 94.7 101.6 98.6 97.8±\pm1.0
walker2d-medium-replay 26.0 62.5 81.8 32.6 77.2 49.5 73.9 91.3 86.2 85.3±\pm1.0
halfcheetah-full-replay 62.4 68.7 75.9 - 76.9 80.0 73.3 82.3 74.8 85.3±\pm0.7
hopper-full-replay 34.3 92.8 81.5 - 101.0 107.8 107.2 108.5 103.5 108.5±\pm0.6
walker2d-full-replay 45.0 89.4 95.2 - 93.4 102.0 98.1 95.7 94.8 107.4±\pm0.6
Mujoco Tasks Total 929.1 1236.9 1293.0 - 1325.8 1311.0 1318.5 1474.9 1354.5 1536.7
pen-human 63.9 -2.0 64.8 55.8 37.5 71.8 71.5 68.5 88.1 83.9±\pm6.8
door-human 2.0 0.0 0.0 9.1 9.9 5.4 4.3 2.3 5.2 13.2±\pm2.4
hammer-human 1.2 0.0 1.8 2.1 4.4 1.2 1.4 0.3 8.1 3.9±\pm5.0
relocate-human 0.1 0.0 0.1 0.4 0.2 1.9 0.1 0.1 0.1 0.3±\pm0.2
pen-cloned 37.0 0.0 49 40.3 39.2 60.0 37.3 49.4 58.6 91.8±\pm4.7
door-cloned 0.0 0.0 0.0 3.5 0.4 0.4 1.6 1.3 0.5 5.8±\pm2.8
hammer-cloned 0.6 0.0 0.2 5.7 2.1 2.1 2.1 1.4 2.2 22.8±\pm15.3
relocate-cloned -0.3 0.0 -0.2 -0.1 -0.1 -0.1 -0.2 0.0 -0.1 0.1±\pm0.1
Adroit Tasks Total 104.5 -2 115.7 116.8 93.6 142.7 118.1 123.3 162.7 221.8
umaze 54.6 62.8 78.6 74.0 80.4 72.5 87.5 98.3 92.3 99.4±\pm1.0
umaze-diverse 45.6 50.2 71.4 84.0 56.3 75.0 62.2 80.0 89.1 78.3±\pm5.0
medium-play 0.0 5.4 10.6 61.2 67.5 5.0 71.2 52.5 63.0 85.0±\pm11.2
medium-diverse 0.0 9.8 3.0 53.7 62.5 5.0 70.0 37.5 62.8 86.7±\pm18.9
large-play 0.0 0.0 0.2 15.8 35.0 2.5 39.6 2.5 17.5 40.0±\pm8.2
large-diverse 0.0 6.0 0.0 14.9 13.3 2.5 47.5 7.5 23.4 36.7±\pm4.7
AntMaze Tasks Total 100.2 134.2 163.8 303.6 315.0 162.5 378.0 278.3 348.1 426.1

4.1 Performance Comparisons

We compare our algorithm with various constraint-based offline RL methods, including CQL baselines [21] on which our algorithm is based on. For other baseline methods, we consider behavior cloning (BC) and 10% BC, where the latter only utilizes only the top 10% of demonstrations with high returns, TD3+BC [27] that simply combines BC with TD3 [3], Onestep RL [28] that performs a single policy iteration based on the dataset, implicit QQ-learning (IQL) [29] that seeks the optimal value function for the dataset through expectile regression, mildly conservative QQ-learning (MCQ) [30] that reduces overestimation by using pseudo QQ values for out-of-distribution actions, and MISA [31] that considers the policy constraint based on mutual information. To assess baseline algorithm performance, we utilize results directly from the original papers for CQL (paper) [21] and MCQ [30], as well as reported results from other baseline algorithms according to Ma et al. [31]. For CQL, reproducing its performance is challenging, so we also include reproduced CQL performance labeled as CQL (reprod.) from Ma et al. [31]. Any missing experimental results have been filled in by re-implementing each baseline algorithm. For our algorithm, we explored various penalty control thresholds τ{cρ,c[0,10]}\tau\in\{c\cdot\rho,~{}c\in[0,10]\}, where ρ\rho represents the log-density of Unif(𝒜)\textrm{Unif}(\mathcal{A}). For Mujoco tasks, the EPQ penalizing constant is fixed at α=20.0\alpha=20.0, and for Adroit tasks, we consider either α=5.0\alpha=5.0 or α=20.0\alpha=20.0. To ensure robustness, we run our algorithm with four different seeds for each task. Table 1 displays the average normalized returns and corresponding standard deviations for compared algorithms. The performance of EPQ is based on the best hyperparameter setup, with additional results presented in the ablation study in Section 4.3. Further details on the hyperparameter setup are provided in Appendix C.

The results in Table 1 shows that our algorithm significantly outperforms the other constraint-based offline RL algorithms in all considered tasks. In particular, in challenging tasks such as Adroit tasks and AntMaze tasks, where rewards are sparse or intermittent, EPQ demonstrates remarkable performance improvements compared to recent offline RL methods. This is because EPQ can impose appropriate penalty on each state, even if the policy and behavior policy varies depending on the timestep as demonstrated in Section 3.2. Also, we observe that our proposed algorithm shows a large increase in performance in the ‘Hopper-random’, ‘Hopper-medium’, and ‘Halfcheetah-medium’ environments compared to CQL, so we will further analyze the causes of the performance increase in these tasks in the following section. For adroit tasks, the performance of CQL (reprod.) is too low compared to CQL (paper), so we provide the enhanced version of CQL in Appendix E, but the result in Appendix E shows that EPQ still performs better than the enhanced version of CQL.

Refer to caption
Refer to caption
Refer to caption

(a) Squared value of estimation bias

Refer to caption
Refer to caption
Refer to caption

(b) Normalized average return

Figure 6: Analysis of proposed method

4.2 The Analysis of Estimation Bias

In Section 4.1, EPQ outperforms CQL baselines significantly across various D4RL tasks based on our proposed penalty in Section 3. To analyze the impact of our overestimation reduction method on performance enhancement, we compare the estimation bias for EPQ and CQL baselines with various penalizing constants α{0,1,5,10,20}\alpha\in\{0,1,5,10,20\} on ‘Hopper-random’, ‘Hopper-medium’, and ‘Halfcheetah-medium’ tasks. In Fig. 6(a), we depict the squared value of estimation bias, obtained from the difference between the QQ-value and the empirical average return for sample trajectories generated by the policy, to show both overestimation bias and underestimation bias. In the experiment shown in Fig. 6(a), the estimation bias in CQL with α=0\alpha=0 became excessively large, causing the gradients to explode and resulting in forced termination of the training. Fig. 6(b) illustrates the corresponding normalized average returns, emphasizing learning progress after 200k200\mathrm{k} gradient steps.

In Fig. 6(a), we observe an increase in estimation bias for CQL as the penalizing constant α\alpha rises, attributed to unnecessary bias highlighted in Fig. 1. Reducing α\alpha to nearly 0 in CQL, however, fails to effectively mitigate overestimation error, leading to a divergence of the QQ-function in tasks such as ‘Hopper-random’ and ‘Hopper-medium’, as shown in Fig. 1. Conversely, EPQ demonstrates superior reduction of estimation bias in the QQ-function compared to CQL baselines for all tasks in Fig. 6(a), indicating its capability to mitigate both overestimation and underestimation bias based on the proposed penalty. As a result, Fig. 6(b) shows that EPQ significantly outperforms all CQL variants on ‘Hopper-random’, ‘Hopper-medium’, and ‘Halfcheetah-medium’ tasks.

Refer to caption

(a) Component evaluation

Refer to caption

(b) Penalty control thresholds τ[0.2ρ,0.5ρ,1.0ρ,2.0ρ,5.0ρ,10.0ρ]\tau\in[0.2\rho,0.5\rho,1.0\rho,2.0\rho,5.0\rho,10.0\rho]

Figure 7: Additional ablation studies on the Hopper-random, Hopper-medium, and Halfcheetah-medium tasks are presented. The best hyperparameter in the paper is denoted by the orange curve.

4.3 Ablation Study

To understand the impact of EPQ’s components and hyperparameters, we conduct ablation studies to evaluate each component and the penalty control threshold τ\tau on the ‘Hopper-random’, ‘Hopper-medium’, and ‘HalfCheetah-medium’ tasks where our proposed method showed a significant performance improvement compared to the baseline CQL.

Component Evaluation: In Section 3, we introduced two variants of the EPQ algorithm: EPQ (w/o PD), which does not incorporate a prioritized dataset as in equation (3), and EPQ (with PD), which leverages a prioritized dataset based on β^Q\hat{\beta}^{Q} as in equation (4). In Fig. 7(a), we compare the performance of EPQ (w/o PD), EPQ (with PD), and the CQL baseline to analyze the impact of each component. EPQ (w/o PD) still outperforms CQL, demonstrating that the proposed penalty 𝒫τ\mathcal{P}_{\tau} in Section 3.2 enhances performance by efficiently reducing overestimation without introducing unnecessary estimation bias, as discussed in Section 3.2. Additionally, Fig. 7(a) shows that EPQ (with PD) outperforms EPQ (w/o PD) significantly in the HalfCheetah-medium task, indicating that the proposed prioritized dataset contributes to improved performance, as anticipated in Section 3.3.

Penalty Control Threshold τ\tau: As discussed in Section 3.2, EPQ can dynamically control the penalty amount based on the penalty control threshold τ\tau, as illustrated in Fig. 2, even in the absence of knowledge about the exact number of data samples. Fig. 7(b) demonstrates the performance of EPQ with various penalty control thresholds τ[0.2ρ,0.5ρ,1.0ρ,2.0ρ,5.0ρ,10.0ρ]\tau\in[0.2\rho,0.5\rho,1.0\rho,2.0\rho,5.0\rho,10.0\rho], where ρ\rho represents the log-density of Unif(𝒜)(\mathcal{A}). Note that ρ\rho is negative, so τ=10.0ρ\tau=10.0\rho is the lowest threshold while τ=0.2ρ\tau=0.2\rho is the highest. The results indicate that in tasks like Hopper-medium, where a variety of actions are not sufficiently sampled, a higher threshold performs better. Conversely, in tasks like Hopper-random, where a broad range of actions is sampled, a lower threshold is more effective. An exception is the HalfCheetah-medium task, which, despite having fewer action variations, visits a diverse range of states. This may result in lower overestimation errors for OOD actions, benefiting from a lower threshold. Furthermore, the performance on the considered tasks appears to be surprisingly less sensitive to changes in τ\tau. We initially expect that performance might be sensitive to τ\tau since it reflects the fixed number of data samples, but the results indicate that the performance is not significantly affected by variations in τ\tau. Moreover, EPQ algorithms with different τ\tau consistently outperform the CQL baseline, highlighting the superiority of the proposed method.

5 Related Works

5.1 Constraint-based Offline RL

In order to reduce the overestimation in offline learning, several constraint-based offline RL methods have been studied. Fujimoto et al. [17] propose a batch-constrained policy to minimize the extrapolation error, and Kumar et al. [19], Wu et al. [20] limits the distribution based on the distance of the distribution, rather than directly constraining the policy. Fujimoto and Gu [27] restricts the policy actions to batch data based on the online algorithm TD3 [3]. Furthermore, Kumar et al. [21], Yu et al. [32] aims to minimize the probability of out-of-distribution actions using the lower bound of the true value. By predicting a more optimistic cost for tuples within the batch data, Xu et al. [22] provides stable training for offline-based safe RL tasks. On the other hand, Ma et al. [31] utilizes mutual information to constrain the policy.

5.2 Offline Learning based on Data Optimality

In offline learning setup, the optimality of the dataset greatly impacts the performance [25]. Simply using nn-%\% BC, or applying weighted experiences, [33, 34] which utilize only a portion of the data based on the evaluation results of the given data, fails to exploit the distribution of the data. Based on Haarnoja et al. [35], Reddy et al. [36], Garg et al. [37] uses the Boltzmann distribution for offline learning, training the policy to follow actions with higher value in the imitation learning domain [38, 39]. Kostrikov et al. [29] and Xiao et al. [40] argue that the optimality of data can be improved by using expectile regression and in-sample SoftMax, respectively. Additionally, methods that learn the value function from the return of the data in a supervised manner have been proposed [41, 28, 42, 43].

5.3 Value Function Shaping

In offline RL, imposing constraints on the policy can decrease the performance, thus Kumar et al. [21], Lyu et al. [44] impose penalties on out-of-distribution actions by structuring the learned value function as a lower bound to the actual values. Additionally, Fakoor et al. [45] addresses the issue by imposing a policy constraint based on divergence and suppressing overly optimistic estimations on the value function, thereby preventing excessive expansion of the value function. Moreover, Wu et al. [46] predicts the instability of actions through the variance of the value function, imposing penalties on the out-of-distribution actions, while Lyu et al. [30] replaces the QQ values for out-of-distribution actions with pseudo QQ-values and Agarwal et al. [47], An et al. [48], Bai et al. [49], Lee et al. [50] mitigates the instability of learning the value function by applying ensemble techniques. In addition, Ghosh et al. [51] interprets the changes in MDP from a Bayesian perspective through the value function, thereby conducting adaptive policy learning.

6 Conclusion

To mitigate overestimation error in offline RL, this paper focuses on exclusive penalty control, which selectivelys gives the penalty for states where policy actions are insufficient in the dataset. Furthermore, we propose a prioritized dataset to enhance the efficiency of reducing unnecessary bias due to the penalty. As a result, our proposed method, EPQ, successfully reduces the overestimation error arising from distributional shift, while avoiding underestimation error due to the penalty. This significantly reduces estimation bias in offline learning, resulting in substantial performance improvements across various D4RL tasks.

Acknowledgements

This work was supported in part by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2022-0-00469, Development of Core Technologies for Task-oriented Reinforcement Learning for Commercialization of Autonomous Drones) and in part by IITP grant funded by the Korea government (MSIT) (No. RS-2022-00156361, Innovative Human Resource Development for Local Intellectualization(UNIST)) and in part by Artificial Intelligence Graduate School support (UNIST), IITP grant funded by the Korea government (MSIT) (No.2020-0-01336).

References

  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Lillicrap et al. [2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • Fujimoto et al. [2018] Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pages 1587–1596. PMLR, 2018.
  • Haarnoja et al. [2018a] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018a.
  • Han and Sung [2019] Seungyul Han and Youngchul Sung. Dimension-wise importance sampling weight clipping for sample-efficient reinforcement learning. In International Conference on Machine Learning, pages 2586–2595. PMLR, 2019.
  • Han and Sung [2021a] Seungyul Han and Youngchul Sung. Diversity actor-critic: Sample-aware entropy regularization for sample-efficient exploration. In International Conference on Machine Learning, pages 4018–4029. PMLR, 2021a.
  • Qin et al. [2022] Rong-Jun Qin, Xingyuan Zhang, Songyi Gao, Xiong-Hui Chen, Zewen Li, Weinan Zhang, and Yang Yu. Neorl: A near real-world benchmark for offline reinforcement learning. Advances in Neural Information Processing Systems, 35:24753–24765, 2022.
  • Zhou et al. [2023] Gaoyue Zhou, Liyiming Ke, Siddhartha Srinivasa, Abhinav Gupta, Aravind Rajeswaran, and Vikash Kumar. Real world offline reinforcement learning with realistic data source. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7176–7183. IEEE, 2023.
  • Tang et al. [2017] Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. Advances in neural information processing systems, 30, 2017.
  • Hong et al. [2018] Zhang-Wei Hong, Tzu-Yun Shann, Shih-Yang Su, Yi-Hsiang Chang, Tsu-Jui Fu, and Chun-Yi Lee. Diversity-driven exploration strategy for deep reinforcement learning. Advances in neural information processing systems, 31, 2018.
  • Han and Sung [2021b] Seungyul Han and Youngchul Sung. A max-min entropy framework for reinforcement learning. Advances in Neural Information Processing Systems, 34:25732–25745, 2021b.
  • Jo et al. [2024] Yonghyeon Jo, Sunwoo Lee, Junghyuk Yeom, and Seungyul Han. Fox: Formation-aware exploration in multi-agent reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 12985–12994, 2024.
  • Pecka and Svoboda [2014] Martin Pecka and Tomas Svoboda. Safe exploration techniques for reinforcement learning–an overview. In Modelling and Simulation for Autonomous Systems: First International Workshop, MESAS 2014, Rome, Italy, May 5-6, 2014, Revised Selected Papers 1, pages 357–375. Springer, 2014.
  • Chae et al. [2022] Jongseong Chae, Seungyul Han, Whiyoung Jung, Myungsik Cho, Sungho Choi, and Youngchul Sung. Robust imitation learning against variations in environment dynamics. In International Conference on Machine Learning, pages 2828–2852. PMLR, 2022.
  • Levine et al. [2020] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  • Kumar et al. [2022] Aviral Kumar, Joey Hong, Anikait Singh, and Sergey Levine. When should we prefer offline reinforcement learning over behavioral cloning? arXiv preprint arXiv:2204.05618, 2022.
  • Fujimoto et al. [2019] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pages 2052–2062. PMLR, 2019.
  • Bain and Sammut [1995] Michael Bain and Claude Sammut. A framework for behavioural cloning. In Machine Intelligence 15, pages 103–129, 1995.
  • Kumar et al. [2019] Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.
  • Wu et al. [2019] Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
  • Kumar et al. [2020] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  • Xu et al. [2022] Haoran Xu, Xianyuan Zhan, and Xiangyu Zhu. Constraints penalized q-learning for safe offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8753–8760, 2022.
  • Fu et al. [2020] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  • Brockman et al. [2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  • Yarats et al. [2022] Denis Yarats, David Brandfonbrener, Hao Liu, Michael Laskin, Pieter Abbeel, Alessandro Lazaric, and Lerrel Pinto. Don’t change the algorithm, change the data: Exploratory data for offline reinforcement learning. arXiv preprint arXiv:2201.13425, 2022.
  • Todorov et al. [2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012.
  • Fujimoto and Gu [2021] Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021.
  • Brandfonbrener et al. [2021] David Brandfonbrener, Will Whitney, Rajesh Ranganath, and Joan Bruna. Offline rl without off-policy evaluation. Advances in neural information processing systems, 34:4933–4946, 2021.
  • Kostrikov et al. [2021] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.
  • Lyu et al. [2022a] Jiafei Lyu, Xiaoteng Ma, Xiu Li, and Zongqing Lu. Mildly conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 35:1711–1724, 2022a.
  • Ma et al. [2024] Xiao Ma, Bingyi Kang, Zhongwen Xu, Min Lin, and Shuicheng Yan. Mutual information regularized offline reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
  • Yu et al. [2021] Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34:28954–28967, 2021.
  • Schaul et al. [2015] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
  • Andrychowicz et al. [2017] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
  • Haarnoja et al. [2017] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In International conference on machine learning, pages 1352–1361. PMLR, 2017.
  • Reddy et al. [2019] Siddharth Reddy, Anca D Dragan, and Sergey Levine. Sqil: Imitation learning via reinforcement learning with sparse rewards. arXiv preprint arXiv:1905.11108, 2019.
  • Garg et al. [2021] Divyansh Garg, Shuvam Chakraborty, Chris Cundy, Jiaming Song, and Stefano Ermon. Iq-learn: Inverse soft-q learning for imitation. Advances in Neural Information Processing Systems, 34:4028–4039, 2021.
  • Ho and Ermon [2016] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.
  • Choi et al. [2024] Sungho Choi, Seungyul Han, Woojun Kim, Jongseong Chae, Whiyoung Jung, and Youngchul Sung. Domain adaptive imitation learning with visual observation. Advances in Neural Information Processing Systems, 36, 2024.
  • Xiao et al. [2023] Chenjun Xiao, Han Wang, Yangchen Pan, Adam White, and Martha White. The in-sample softmax for offline reinforcement learning. arXiv preprint arXiv:2302.14372, 2023.
  • Mandlekar et al. [2020] Ajay Mandlekar, Fabio Ramos, Byron Boots, Silvio Savarese, Li Fei-Fei, Animesh Garg, and Dieter Fox. Iris: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 4414–4420. IEEE, 2020.
  • Emmons et al. [2021] Scott Emmons, Benjamin Eysenbach, Ilya Kostrikov, and Sergey Levine. Rvs: What is essential for offline rl via supervised learning? arXiv preprint arXiv:2112.10751, 2021.
  • Zhuang et al. [2023] Zifeng Zhuang, Kun Lei, Jinxin Liu, Donglin Wang, and Yilang Guo. Behavior proximal policy optimization. arXiv preprint arXiv:2302.11312, 2023.
  • Lyu et al. [2022b] Jiafei Lyu, Xiaoteng Ma, Xiu Li, and Zongqing Lu. Mildly conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 35:1711–1724, 2022b.
  • Fakoor et al. [2021] Rasool Fakoor, Jonas W Mueller, Kavosh Asadi, Pratik Chaudhari, and Alexander J Smola. Continuous doubly constrained batch reinforcement learning. Advances in Neural Information Processing Systems, 34:11260–11273, 2021.
  • Wu et al. [2021] Yue Wu, Shuangfei Zhai, Nitish Srivastava, Joshua Susskind, Jian Zhang, Ruslan Salakhutdinov, and Hanlin Goh. Uncertainty weighted actor-critic for offline reinforcement learning. arXiv preprint arXiv:2105.08140, 2021.
  • Agarwal et al. [2020] Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, pages 104–114. PMLR, 2020.
  • An et al. [2021] Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-based offline reinforcement learning with diversified q-ensemble. Advances in neural information processing systems, 34:7436–7447, 2021.
  • Bai et al. [2022] Chenjia Bai, Lingxiao Wang, Zhuoran Yang, Zhihong Deng, Animesh Garg, Peng Liu, and Zhaoran Wang. Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. arXiv preprint arXiv:2202.11566, 2022.
  • Lee et al. [2022] Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. In Conference on Robot Learning, pages 1702–1712. PMLR, 2022.
  • Ghosh et al. [2022] Dibya Ghosh, Anurag Ajay, Pulkit Agrawal, and Sergey Levine. Offline rl policies should be trained to be adaptive. In International Conference on Machine Learning, pages 7513–7530. PMLR, 2022.
  • Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Bhatia et al. [2010] Nitin Bhatia et al. Survey of nearest neighbor techniques. arXiv preprint arXiv:1007.0085, 2010.
  • Yang et al. [2022] Rui Yang, Chenjia Bai, Xiaoteng Ma, Zhaoran Wang, Chongjie Zhang, and Lei Han. Rorl: Robust offline reinforcement learning via conservative smoothing. Advances in neural information processing systems, 35:23851–23866, 2022.
  • Haarnoja et al. [2018b] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018b.

Appendix A Proof

Theorem 3.1 We denote the QQ-function converged from the QQ-update of EPQ using the proposed penalty 𝒫τ\mathcal{P}_{\tau} in (3) by Q^π\hat{Q}^{\pi}. Then, the expected value of Q^π\hat{Q}^{\pi} underestimates the expected true policy value, i.e., 𝔼aπ[Q^π(s,a)]𝔼aπ[Qπ(s,a)],sD\mathbb{E}_{a\sim\pi}[\hat{Q}^{\pi}(s,a)]\leq\mathbb{E}_{a\sim\pi}[Q^{\pi}(s,a)],\forall s\in D, with high probability 1δ1-\delta for some δ(0,1)\delta\in(0,1), if the penalizing factor α\alpha is sufficiently large. Furthermore, the proposed penalty reduces the average penalty for policy actions compared to the average penalty of CQL.

A.1 Proof of Theorem 3.1

Proof of Theorem 3.1 basically follows the proof of Theorem 3.2 in Kumar et al. [21] since 𝒫τ\mathcal{P}_{\tau} multiplies the penalty control factor fτπ,β^(s)f_{\tau}^{\pi,\hat{\beta}}(s) to the penalty of CQL. At each kk-th iteration, QQ-function is updated by equation (4), then

Qk+1(s,a)^πQk(s,a)α𝒫τ,s,a,Q_{k+1}(s,a)\leftarrow\hat{\mathcal{B}}^{\pi}Q_{k}(s,a)-\alpha\mathcal{P}_{\tau},~{}\forall s,a, (A.1)

where ^π\hat{\mathcal{B}}^{\pi} is the estimation of the true Bellman operator π\mathcal{B}^{\pi} based on data samples. It is known that the error between the estimated Bellman operator ^π\hat{\mathcal{B}}^{\pi} and the true Bellman operator is bounded with high probability of 1δ1-\delta for some δ(0,1)\delta\in(0,1) as |(πQ)(s,a)(^πQ)(s,a)|ξδ(s,a),s,a|(\mathcal{B}^{\pi}Q)(s,a)-(\hat{\mathcal{B}}^{\pi}Q)(s,a)|\leq\xi^{\delta}(s,a),~{}~{}\forall s,a, where ξδ\xi^{\delta} is a positive constant related to the given dataset DD, the discount factor γ\gamma, and the transition probability PP [21]. Then, with high probability 1δ1-\delta,

Qk+1(s,a)πQk(s,a)α𝒫τ+ξδ(s,a),s,a,Q_{k+1}(s,a)\leftarrow\mathcal{B}^{\pi}Q_{k}(s,a)-\alpha\mathcal{P}_{\tau}+\xi^{\delta}(s,a),~{}\forall s,a, (A.2)

Now, with the state value function V(s):=𝔼aπ(|s)[Q(s,a)]V(s):=\mathbb{E}_{a\sim\pi(\cdot|s)}[Q(s,a)]

Vk+1(s)\displaystyle V_{k+1}(s) =𝔼aπ(|s)[Qk(s,a)]=πVkα𝔼aπ[𝒫τ]+ξδ(s,a)\displaystyle=\mathbb{E}_{a\sim\pi(\cdot|s)}[Q_{k}(s,a)]=\mathcal{B}^{\pi}V_{k}-\alpha\mathbb{E}_{a\sim\pi}[\mathcal{P}_{\tau}]+\xi^{\delta}(s,a)
=πVk(s)α𝔼aπ[fτπ,β^(s)(π(a|s)β^(a|s)1)+𝔼aπ[ξδ(s,a)]]\displaystyle=\mathcal{B}^{\pi}V_{k}(s)-\alpha\mathbb{E}_{a\sim\pi}\left[f_{\tau}^{\pi,\hat{\beta}}(s)\cdot\left(\frac{\pi(a|s)}{\hat{\beta}(a|s)}-1\right)+\mathbb{E}_{a\sim\pi}[\xi^{\delta}(s,a)]\right]
=πVk(s)αΔEPQπ(s)+𝔼aπ[ξδ(s,a)]\displaystyle=\mathcal{B}^{\pi}V_{k}(s)-\alpha\Delta_{EPQ}^{\pi}(s)+\mathbb{E}_{a\sim\pi}[\xi^{\delta}(s,a)] (A.3)

Upon repeated iteration, Vk+1V_{k+1} converges to V(s)=Vπ(s)+(IγPπ)1{αΔEPQπ(s)+𝔼aπ[ξδ(s,a)]}V_{\infty}(s)=V^{\pi}(s)+(I-\gamma P^{\pi})^{-1}\cdot\{-\alpha\Delta_{EPQ}^{\pi}(s)+\mathbb{E}_{a\sim\pi}[\xi^{\delta}(s,a)]\} based on the fixed point theorem, where ΔEPQπ(s):=𝔼aπ[𝒫τ]\Delta_{EPQ}^{\pi}(s):=\mathbb{E}_{a\sim\pi}[\mathcal{P}_{\tau}] is the average penalty for policy π\pi, II is the identity matrix, and PπP^{\pi} is the state transition matrix where the policy π\pi is given. Here, we can show that the average penalty ΔEPQπ(s)\Delta_{EPQ}^{\pi}(s) is positive as follows:

ΔEPQπ(s)\displaystyle\Delta_{EPQ}^{\pi}(s) =𝔼aπ[fτπ,β^(s)(π(a|s)β(a|s)1)]\displaystyle=\mathbb{E}_{a\sim\pi}\bigg{[}f_{\tau}^{\pi,\hat{\beta}}(s)\cdot\left(\frac{\pi(a|s)}{\beta(a|s)}-1\right)\bigg{]}
=fτπ,β^(s)[a𝒜π(a|s)(π(a|s)β^(a|s)1)a𝒜β^(a|s)(π(a|s)β^(a|s)1)=0]\displaystyle=f_{\tau}^{\pi,\hat{\beta}}(s)\left[\sum_{a\in\mathcal{A}}\pi(a|s)\left(\frac{\pi(a|s)}{\hat{\beta}(a|s)}-1\right)-\underbrace{\sum_{a\in\mathcal{A}}\hat{\beta}(a|s)\left(\frac{\pi(a|s)}{\hat{\beta}(a|s)}-1\right)}_{=0}\right]
=fτπ,β^(s)a𝒜(π(a|s)β^(a|s))2β^(a|s)0,\displaystyle=f_{\tau}^{\pi,\hat{\beta}}(s)\cdot\sum_{a\in\mathcal{A}}\frac{(\pi(a|s)-\hat{\beta}(a|s))^{2}}{\hat{\beta}(a|s)}\geq 0, (A.4)

where the equality in (A.4) satisfies when π=β^\pi=\hat{\beta} or fτπ,β^=0f_{\tau}^{\pi,\hat{\beta}}=0. Given that Vk+1V_{k+1} converges to V=Vπ(s)+(IγPπ)1{αΔEPQπ(s)+𝔼aπ[ξδ(s,a)]}V_{\infty}=V^{\pi}(s)+(I-\gamma P^{\pi})^{-1}\cdot\{-\alpha\Delta_{EPQ}^{\pi}(s)+\mathbb{E}_{a\sim\pi}[\xi^{\delta}(s,a)]\}, choosing the penalizing constant α\alpha that satisfies αmaxs,aD[ξδ(s,a)]maxsD(ΔEPQπ(s))1\alpha\geq\max_{s,a\in D}[\xi^{\delta}(s,a)]\cdot\max_{s\in D}(\Delta_{EPQ}^{\pi}(s))^{-1} will satisfy,

αΔEPQπ(s)+𝔼aπ[ξδ(s,a)]\displaystyle-\alpha\cdot\Delta_{EPQ}^{\pi}(s)+\mathbb{E}_{a\sim\pi}[\xi^{\delta}(s,a)]
maxs,aD[ξδ(s,a)]maxsD(ΔEPQπ(s))1ΔEPQπ(s)1+𝔼aπ[ξδ(s,a)]\displaystyle\leq-\max_{s,a\in D}[\xi^{\delta}(s,a)]\cdot\underbrace{\max_{s\in D}(\Delta_{EPQ}^{\pi}(s))^{-1}\cdot\Delta_{EPQ}^{\pi}(s)}_{\geq 1}+\mathbb{E}_{a\sim\pi}[\xi^{\delta}(s,a)]
maxs,aD[ξδ(s,a)]+𝔼aπ[ξδ(s,a)]0,s,\displaystyle\leq-\max_{s,a\in D}[\xi^{\delta}(s,a)]+\mathbb{E}_{a\sim\pi}[\xi^{\delta}(s,a)]\leq 0,~{}~{}~{}~{}\forall s, (A.5)

Since IγPπI-\gamma P^{\pi} is non-singular MM-matrix and the inverse of non-singular MM-matrix is non-negative, i.e., all elements of (IγPπ)1(I-\gamma P^{\pi})^{-1} are non-negative, V(s)=Vπ(s)+(IγPπ)1{αΔEPQπ(s)+𝔼aπ[ξδ(s,a)]}Vπ(s),s.V_{\infty}(s)=V^{\pi}(s)+(I-\gamma P^{\pi})^{-1}\cdot\{-\alpha\Delta_{EPQ}^{\pi}(s)+\mathbb{E}_{a\sim\pi}[\xi^{\delta}(s,a)]\}\leq V^{\pi}(s),~{}\forall s. Therefore, VV_{\infty} underestimates the true value function VπV^{\pi} if the penalizing constant α\alpha satisfies αmaxs,aD[ξδ(s,a)]maxsD(ΔEPQπ(s))1\alpha\geq\max_{s,a\in D}[\xi^{\delta}(s,a)]\cdot\max_{s\in D}(\Delta_{EPQ}^{\pi}(s))^{-1}. In addition, according to [21], the average penalty of CQL for policy actions can be represented as ΔCQLπ(s)=𝔼aπ[πβ^1]\Delta_{CQL}^{\pi}(s)=\mathbb{E}_{a\sim\pi}[\frac{\pi}{\hat{\beta}}-1]. Thus, ΔEPQπ(s)=fτπ,β^(s)ΔCQLπ(s)\Delta_{EPQ}^{\pi}(s)=f_{\tau}^{\pi,\hat{\beta}}(s)\Delta_{CQL}^{\pi}(s) and fτπ,β^(s)1f_{\tau}^{\pi,\hat{\beta}}(s)\leq 1 from the definition in (2), so 0ΔEPQπ(s)ΔCQLπ(s)0\leq\Delta_{EPQ}^{\pi}(s)\leq\Delta_{CQL}^{\pi}(s). In addition, if π=β^\pi=\hat{\beta}, then 0=ΔEPQβ^(s)=ΔCQLβ^(s)0=\Delta_{EPQ}^{\hat{\beta}}(s)=\Delta_{CQL}^{\hat{\beta}}(s) from the equality condition in (A.4), which indicates that the average penalty for data actions is 0 for both EPQ and CQL. \blacksquare

Appendix B Implementation Details

In this section, we provide the implementation details of the proposed EPQ. First of all, we provide a detailed derivation of the final QQ-loss function(4) of EPQ in Section B.1. Next, we introduce a practical implementation of EPQ to compute the loss functions for the parameterized policy and QQ-function in Section B.2. In addition, to calculate loss functions in Section B.2, we provide the additional implementation details in Appendices B.3, B.4, and B.5. We conduct our experiments on a single server equipped with an Intel Xeon Gold 6336Y CPU and one NVIDIA RTX A5000 GPU, and we compare the running time of EPQ with other baseline algorithms in Section B.6. For additional hyperparameters in the practical implementation of EPQ, we provide detailed hyperparameter setup and additional ablation studies in Appendix C and Appendix D, respectively.

B.1 Detailed Derivation of QQ-Loss Function

In Section 3.3, the final QQ-loss function with the proposed penalty 𝒫τ,PD=fτπ,β^(πβ^Q1)\mathcal{P}_{\tau,PD}=f_{\tau}^{\pi,\hat{\beta}}(\frac{\pi}{\hat{\beta}^{Q}}-1) is given by L(Q)=12𝔼s,sD,aβ^Q[(Q{πQα𝒫τ,PD})2]L(Q)=\frac{1}{2}\mathbb{E}_{s,s^{\prime}\sim D,a\sim\hat{\beta}^{Q}}\left[\left(Q-\{\mathcal{B}^{\pi}Q-\alpha\mathcal{P}_{\tau,~{}PD}\}\right)^{2}\right]. In this section, we provide a more detailed calculation of L(Q)L(Q) to obtain (4) as follows:

L(Q)=12𝔼s,sD,aβ^Q[(Q{πQα𝒫τ,PD})2]\displaystyle L(Q)=\frac{1}{2}\mathbb{E}_{s,s^{\prime}\sim D,a\sim\hat{\beta}^{Q}}\left[\left(Q-\{\mathcal{B}^{\pi}Q-\alpha\mathcal{P}_{\tau,~{}PD}\}\right)^{2}\right]
=12𝔼s,sD,aβ^Q[(QπQ)2]+α𝔼s,sD,aβ^Q[𝒫τ,PDQ]+C\displaystyle=\frac{1}{2}\mathbb{E}_{s,s^{\prime}\sim D,a\sim\hat{\beta}^{Q}}\left[\left(Q-\mathcal{B}^{\pi}Q\right)^{2}\right]+\alpha\mathbb{E}_{s,s^{\prime}\sim D,a\sim\hat{\beta}^{Q}}\left[\mathcal{P}_{\tau,~{}PD}\cdot Q\right]+C
=12𝔼s,sD,aβ^Q[(QπQ)2]+α𝔼s,sD,aβ^Q[fτπ,β^(πβ^Q1)Q]+C\displaystyle=\frac{1}{2}\mathbb{E}_{s,s^{\prime}\sim D,a\sim\hat{\beta}^{Q}}\left[\left(Q-\mathcal{B}^{\pi}Q\right)^{2}\right]+\alpha\mathbb{E}_{s,s^{\prime}\sim D,a\sim\hat{\beta}^{Q}}\left[f_{\tau}^{\pi,\hat{\beta}}\left(\frac{\pi}{\hat{\beta}^{Q}}-1\right)Q\right]+C
=12𝔼s,sD,aβ^Q[(QπQ)2]+α𝔼s,sD[a𝒜β^Qfτπ,β^(πβ^Q1)Q𝑑a]+C\displaystyle=\frac{1}{2}\mathbb{E}_{s,s^{\prime}\sim D,a\sim\hat{\beta}^{Q}}\left[\left(Q-\mathcal{B}^{\pi}Q\right)^{2}\right]+\alpha\mathbb{E}_{s,s^{\prime}\sim D}\left[\int_{a\in\mathcal{A}}\hat{\beta}^{Q}f_{\tau}^{\pi,\hat{\beta}}\left(\frac{\pi}{\hat{\beta}^{Q}}-1\right)Qda\right]+C
=12𝔼s,sD,aβ^Q[(QπQ)2]+α𝔼s,sD[a𝒜fτπ,β^(πβ^Q)Q𝑑a]+C\displaystyle=\frac{1}{2}\mathbb{E}_{s,s^{\prime}\sim D,a\sim\hat{\beta}^{Q}}\left[\left(Q-\mathcal{B}^{\pi}Q\right)^{2}\right]+\alpha\mathbb{E}_{s,s^{\prime}\sim D}\left[\int_{a\in\mathcal{A}}f_{\tau}^{\pi,\hat{\beta}}\left(\pi-\hat{\beta}^{Q}\right)Qda\right]+C
=12𝔼s,sD,aβ^Q[(QπQ)2]+α𝔼s,sD[a𝒜πfτπ,β^Q𝑑aa𝒜β^Qfτπ,β^Q𝑑a]+C\displaystyle=\frac{1}{2}\mathbb{E}_{s,s^{\prime}\sim D,a\sim\hat{\beta}^{Q}}\left[\left(Q-\mathcal{B}^{\pi}Q\right)^{2}\right]+\alpha\mathbb{E}_{s,s^{\prime}\sim D}\left[\int_{a^{\prime}\in\mathcal{A}}\pi f_{\tau}^{\pi,\hat{\beta}}Qda^{\prime}-\int_{a\in\mathcal{A}}\hat{\beta}^{Q}f_{\tau}^{\pi,\hat{\beta}}Qda\right]+C
=12𝔼s,sD,aβ^Q[(QπQ)2]+α𝔼s,sD[𝔼aπ[fτπ,β^Q]𝔼aβ^Q[fτπ,β^Q]]+C\displaystyle=\frac{1}{2}\mathbb{E}_{s,s^{\prime}\sim D,a\sim\hat{\beta}^{Q}}\left[\left(Q-\mathcal{B}^{\pi}Q\right)^{2}\right]+\alpha\mathbb{E}_{s,s^{\prime}\sim D}\left[\mathbb{E}_{a^{\prime}\sim\pi}\left[f_{\tau}^{\pi,\hat{\beta}}Q\right]-\mathbb{E}_{a\sim\hat{\beta}^{Q}}\left[f_{\tau}^{\pi,\hat{\beta}}Q\right]\right]+C
=12𝔼s,sD,aβ^Q[(QπQ)2]+α𝔼s,sD,aβ^Q[𝔼aπ[fτπ,β^Q]fτπ,β^Q]+C\displaystyle=\frac{1}{2}\mathbb{E}_{s,s^{\prime}\sim D,a\sim\hat{\beta}^{Q}}\left[\left(Q-\mathcal{B}^{\pi}Q\right)^{2}\right]+\alpha\mathbb{E}_{s,s^{\prime}\sim D,a\sim\hat{\beta}^{Q}}\left[\mathbb{E}_{a^{\prime}\sim\pi}\left[f_{\tau}^{\pi,\hat{\beta}}Q\right]-f_{\tau}^{\pi,\hat{\beta}}Q\right]+C
=()𝔼s,sD,aβ^[β^Qβ^{12(QπQ)2+αfτπ,β^(𝔼aπ[Q]Q)}]+C\displaystyle\underset{(*)}{=}\mathbb{E}_{s,s^{\prime}\sim D,a\sim\hat{\beta}}\left[\frac{\hat{\beta}^{Q}}{\hat{\beta}}\cdot\left\{\frac{1}{2}\left(Q-\mathcal{B}^{\pi}Q\right)^{2}+\alpha f_{\tau}^{\pi,\hat{\beta}}\cdot\left(\mathbb{E}_{a^{\prime}\sim\pi}\left[Q\right]-Q\right)\right\}\right]+C
=𝔼s,sD,aβ^,aπ[ws,aQ{12(Q(s,a)πQ(s,a))2+αfτπ,β^(s)(Q(s,a)Q(s,a))}]+C,\displaystyle=\mathbb{E}_{s,s^{\prime}\sim D,a\sim\hat{\beta},a^{\prime}\sim\pi}\left[w_{s,a}^{Q}\cdot\left\{\frac{1}{2}\left(Q(s,a)-\mathcal{B}^{\pi}Q(s,a)\right)^{2}+\alpha f_{\tau}^{\pi,\hat{\beta}}(s)(Q(s,a^{\prime})-Q(s,a))\right\}\right]+C,

where CC is the remaining constant term that can be ignored for the QQ-update since πQ\mathcal{B}^{\pi}Q is the fixed target value. For ()(*), we apply the IS technique, which states that 𝔼xp[f(x)]=𝔼xq[p(x)q(x)f(x)]\mathbb{E}_{x\sim p}[f(x)]=\mathbb{E}_{x\sim q}\left[\frac{p(x)}{q(x)}f(x)\right] for any probability distributions pp and qq, and arbitrary function ff, and ws,aQ=β^Q(a|s)β^(a|s)=exp(Q(s,a))𝔼aβ^(|s)[exp(Q(s,a))]w_{s,a}^{Q}=\frac{\hat{\beta}^{Q}(a|s)}{\hat{\beta}(a|s)}=\frac{\exp(Q(s,a))}{\mathbb{E}_{a^{\prime}\sim\hat{\beta}(\cdot|s)}[\exp(Q(s,a^{\prime}))]} is the importance sampling (IS) ratio between β^Q\hat{\beta}^{Q} and β^\hat{\beta}.

B.2 Practical Implementation for EPQ

Our implementation basically follows the setup of CQL [21]. We use the Gaussian policy π\pi with a Tanh()\textrm{Tanh}(\cdot) layer proposed by Haarnoja et al. [4], and parameterize the policy π\pi and QQ-function using neural network parameters ϕ\phi and θ\theta, respectively. Then, we update the policy to maximize QθQ_{\theta} with its entropy (πϕ)=𝔼πϕ[logπϕ]\mathcal{H}(\pi_{\phi})=\mathbb{E}_{\pi_{\phi}}[-\log\pi_{\phi}], following the maximum entropy principle [4] as explained in Section 3.3, to account for stochastic policies. Then, we can redefine the policy loss function L(π)L(\pi) defined in (5) as the policy loss function Lπ(ϕ)L_{\pi}(\phi) for policy parameter ϕ\phi, given by

Lπ(ϕ)=𝔼sD,aπϕ[Qθ(s,a)+logπϕ(a|s)].L_{\pi}(\phi)=\mathbb{E}_{s\sim D,~{}a\sim\pi_{\phi}}[-Q_{\theta}(s,a)+\log\pi_{\phi}(a|s)]. (B.1)

For the QQ-loss function in (4), we use the IS ratio ws,aQw_{s,a}^{Q} in (4) to account for prioritized sampling based on β^Q\hat{\beta}^{Q}. However, β^Q\hat{\beta}^{Q} discards samples with low IS weights, which can reduce sample efficiency. To address this, we utilize the clipped IS weight max(cmin,ws,aQ)\max(c_{\min},w_{s,a}^{Q}), where cmin(0,1]c_{\min}\in(0,1] is the IS clipping constant. This clipped IS weight is multiplied only to the term (Q(s,a)πQ(s,a))2(Q(s,a)-\mathcal{B}^{\pi}Q(s,a))^{2} in (4) to ensure that we can exploit all data samples for QQ-learning while preserving the proposed penalty. The detailed analysis for cminc_{\min} is provided in Appendix D. In addition, the optimal policy that maximizes (B.1) follows the Boltzmann distribution, proportional to exp(Qθ(s,))\exp(Q_{\theta}(s,\cdot)). It has been proven in Kumar et al. [21] that the optimal policy satisfies 𝔼aπ[Qθ(s,a)]+H(π)=loga𝒜expQθ(s,a)\mathbb{E}_{a\sim\pi}[Q_{\theta}(s,a)]+H(\pi)=\log\sum_{a\sim\mathcal{A}}\exp Q_{\theta}(s,a), so we can replace the 𝔼aπ[Qθ(s,a)]\mathbb{E}_{a^{\prime}\sim\pi}[Q_{\theta}(s,a^{\prime})] term in (4) with loga𝒜expQθ(s,a)\log\sum_{a^{\prime}\sim\mathcal{A}}\exp Q_{\theta}(s,a^{\prime}), given that H(π)H(\pi) does not depend on the QQ-function. The Bellman operator π\mathcal{B}^{\pi} can be estimated by samples in the dataset as πQθr(s,a)+𝔼aπγQθ¯(s,a)\mathcal{B}^{\pi}Q_{\theta}\approx r(s,a)+\mathbb{E}_{a^{\prime}\sim\pi}\gamma Q_{\bar{\theta}}(s^{\prime},a^{\prime}), where θ¯\bar{\theta} is the parameter of the target QQ-function. The target network is updated using exponential moving average (EMA) with temperature ηθ¯=0.005\eta_{\bar{\theta}}=0.005, as proposed in the deep Q-network (DQN) [52]. Finally, by applying IS clipping and logaexpQ\log\sum_{a}\exp Q to the QQ-loss function (4) and redefining it as the value loss function for the value parameter θ\theta, we obtain the following refined value loss function LQ(θ)L_{Q}(\theta) as follows:

LQ(θ)=12𝔼s,a,sD[max(cmin,ws,aQ)(r(s,a)+𝔼aπγQθ¯(s,a)Qθ(s,a))2]\displaystyle L_{Q}(\theta)=\frac{1}{2}\mathbb{E}_{s,a,s^{\prime}\sim D}\big{[}\max(c_{\min},w_{s,a}^{Q})\cdot\left(r(s,a)+\mathbb{E}_{a^{\prime}\sim\pi}\gamma Q_{\bar{\theta}}(s^{\prime},a^{\prime})-Q_{\theta}(s,a)\right)^{2}\big{]} (B.2)
+α𝔼s,aD[ws,aQfτπ,β^(s)(loga𝒜Qθ(s,a)Qθ(s,a))],\displaystyle\quad\quad\quad\quad+\alpha\mathbb{E}_{s,a\sim D}\left[w_{s,a}^{Q}f_{\tau}^{\pi,\hat{\beta}}(s)\left(\log\sum_{a^{\prime}\in\mathcal{A}}Q_{\theta}(s,a^{\prime})-Q_{\theta}(s,a)\right)\right],

where β^\hat{\beta} is pre-trained by behavior cloning (BC) [18, 53] to compute fτπ,β^f_{\tau}^{\pi,\hat{\beta}}. The parameters ϕ\phi and θ\theta are updated to minimize their loss functions Lπ(ϕ)L_{\pi}(\phi) and LQ(θ)L_{Q}(\theta) with learning rate ηϕ\eta_{\phi} and ηθ\eta_{\theta}, respectively. Detailed implementations for estimating the behavior policy β^\hat{\beta}, the IS weight ws,aQw_{s,a}^{Q}, and logaexpQ\log\sum_{a}\exp Q are provided in Appendices B.3, B.4, and B.5, respectively.

B.3 Behavior Policy Estimation Based on Variational Auto-Encoder

In Section B.2, we estimate the behavior policy β\beta that generates the data samples in DD necessary for calculating the penalty adaptation factor fτπ,β^f_{\tau}^{\pi,\hat{\beta}} in equation (2). To estimate the behavior policy β^\hat{\beta}, we employ the variational auto-encoder (VAE), one of the most representative variational inference methods, to approximate the underlying distribution of a large dataset based on the variational lower bound [53]. In the context of VAE, we define an encoder model pψ(z|s,a)p_{\psi}(z|s,a) and a decoder model qψ(a|z,s)q_{\psi}(a|z,s) parameterized by ψ\psi, where zz is the latent variable whose prior distribution p(z)p(z) follows the multivariate normal distribution, i.e., p(z)N(0,I)p(z)\sim N(0,I). Assuming independence among all data samples, we can derive the variational lower bound for the likelihood of β\beta as proposed by Kingma and Welling [53]:

logβ(a|s)\displaystyle\log\beta(a|s) 𝔼zpψ(|s,a)[logqψ(a|z,s)]DKL(pψ(z|s,a)||p(z))the variational lower bound,s,aD\displaystyle\geq\underbrace{\mathbb{E}_{z\sim p_{\psi}(\cdot|s,a)}[\log q_{\psi}(a|z,s)]-D_{KL}(p_{\psi}(z|s,a)||p(z))}_{\textrm{the variational lower bound}},~{}\forall s,a\in D (B.3)

where DKL(p||q)=𝔼p[logplogq]D_{KL}(p||q)=\mathbb{E}_{p}[\log p-\log q] is the Kullback-Leibler (KL) divergence between two distributions pp and qq. In this paper, since we consider the deterministic decoder qψ(z,s)q_{\psi}(z,s), the formal term 𝔼zpψ(|s,a)[logqψ(a|z,s)]\mathbb{E}_{z\sim p_{\psi}(\cdot|s,a)}[\log q_{\psi}(a|z,s)] can be replaced with the mean square error (MSE) as 𝔼zpψ(|s,a)[logqψ(a|z,s)]𝔼zpψ(|s,a)[(qψ(z,s)a)2]\mathbb{E}_{z\sim p_{\psi}(\cdot|s,a)}[\log q_{\psi}(a|z,s)]\approx\mathbb{E}_{z\sim p_{\psi}(\cdot|s,a)}[(q_{\psi}(z,s)-a)^{2}]. At each kk-th iteration, we update the parameter ψ\psi of VAE to maximize the lower bound in equation (B.3). The logβ\log\beta can be estimated using the variational lower-bound in (B.3) to obtain fτπ,β^f_{\tau}^{\pi,\hat{\beta}}. The hyperparameter setup for the VAE is provided in Table 2.

Table 2: Hyperparameter setup for VAE
VAE Hyperparameters
zz dimension 22\cdot state dimension
Hidden activation function ReLU Layer
Encoder network pψp_{\psi}
(512, 2z2\cdot z dim.)
(512,512)
(state dim. + action dim., 512)
Decoder network qψq_{\psi}
(512, action dim.)
(512,512)
(zz dim. + state dim., 512)

B.4 Implementation of IS Weight ws,aQw_{s,a}^{Q}

In order to consider the prioritized data distribution β^Q\hat{\beta}^{Q} proposed in Section 3.3, we use the importance sampling (IS) weight defined by

ws,aQ=β^Q(a|s)β^(a|s)=exp(Q(s,a))𝔼aβ^(|s)[exp(Q(s,a))],s,aD.w_{s,a}^{Q}=\frac{\hat{\beta}^{Q}(a|s)}{\hat{\beta}(a|s)}=\frac{\exp(Q(s,a))}{\mathbb{E}_{a^{\prime}\sim\hat{\beta}(\cdot|s)}[\exp(Q(s,a^{\prime}))]},~{}\forall s,a\in D. (B.4)

Since the computation of 𝔼aβ^(|s)\mathbb{E}_{a^{\prime}\sim\hat{\beta}(\cdot|s)} makes it difficult to know the exact possible action set for state ss, we approximately estimate the IS weight based on clustering as follows:

ws,aQ=exp(Q(s,a))𝔼aβ^(|s)[exp(Q(s,a))]exp(Q(s,a))1|𝒞s,a|(s,a)𝒞s,aexp(Q(s,a)),s,aD.w_{s,a}^{Q}=\frac{\exp(Q(s,a))}{\mathbb{E}_{a^{\prime}\sim\hat{\beta}(\cdot|s)}[\exp(Q(s,a^{\prime}))]}\approx\frac{\exp(Q(s,a))}{\frac{1}{|\mathcal{C}_{s,a}|}\sum_{(s^{\prime},a^{\prime})\in\mathcal{C}_{s,a}}\exp(Q(s^{\prime},a^{\prime}))},~{}\forall s,a\in D. (B.5)

Here, 𝒞s,a\mathcal{C}_{s,a} is the cluster that contains data samples adjacent to (s,a)(s,a), defined by

𝒞s,a={(s,a)D|ss2ϵd¯closest},\mathcal{C}_{s,a}=\{(s^{\prime},a^{\prime})\in D~{}|~{}||s-s^{\prime}||_{2}\leq\epsilon\cdot\bar{d}_{\textrm{closest}}\}, (B.6)

where the cluster 𝒞s,a\mathcal{C}_{s,a} can be directly obtained using the nearest neighbor (NN) algorithm [54] provided in the Python library. ϵd¯closest\epsilon\cdot\bar{d}_{\textrm{closest}} is the radius of the cluster, and d¯closest\bar{d}_{\textrm{closest}} is the average distance between the closest states from each task. In our implementation, we control the radius parameter ϵ>0\epsilon>0 to adjust the number of adjacent samples for the estimation of IS Weight ws,aQw_{s,a}^{Q}. In addition, using the QQ-function in the IS weight term makes the learning unstable since the QQ-function continuously changes as the learning progresses. Thus, instead of the QQ-function, we use the regularized empirical return Gt/ζG_{t}/\zeta for each state-action pair obtained by the trajectories stored in DD, where ζ>0\zeta>0 is the regularizing temperature. Upon the increase of ζ\zeta, the returned difference between adjacent samples in the cluster decreases, so the effect of prioritization can be reduced. The detailed analysis for ϵ\epsilon and ζ\zeta is provided in Appendix D.

B.5 Implementation of QQ-loss Function

In equation (B.2), the final QQ-loss function of proposed EPQ is given by

LQ(θ)\displaystyle L_{Q}(\theta) =12𝔼s,a,sD[max(cmin,ws,aQ)(r(s,a)+𝔼aπγQθ¯(s,a)Qθ(s,a))2]\displaystyle=\frac{1}{2}\mathbb{E}_{s,a,s^{\prime}\sim D}\big{[}\max(c_{\min},w_{s,a}^{Q})\left(r(s,a)+\mathbb{E}_{a^{\prime}\sim\pi}\gamma Q_{\bar{\theta}}(s^{\prime},a^{\prime})-Q_{\theta}(s,a)\right)^{2}\big{]}
+α𝔼s,aD[ws,aQfτπ,β^(s)(loga𝒜expQθ(s,a)Qθ(s,a))].\displaystyle+\alpha\mathbb{E}_{s,a\sim D}\left[w_{s,a}^{Q}f_{\tau}^{\pi,\hat{\beta}}(s)\left(\log\sum_{a^{\prime}\in\mathcal{A}}\exp Q_{\theta}(s,a^{\prime})-Q_{\theta}(s,a)\right)\right].

Here, we can estimate logaexpQ(s,a)\log\sum_{a}\exp Q(s,a) based on the method proposed in CQL [21] as follows:

logaexpQ(s,a)=log(12aπ(a|s){exp(Q(s,a)logπ(a|s))}+12aρd{exp(Q(s,a)logρd)})\displaystyle\log\sum_{a}\exp Q(s,a)=\log\left(\frac{1}{2}\sum_{a}\pi(a|s)\{\exp(Q(s,a)-\log\pi(a|s))\}+\frac{1}{2}\sum_{a}\rho_{d}\{\exp(Q(s,a)-\log\rho_{d})\}\right)
log(12NaanπNa(exp(Q(s,an)logπ(an|s)))+12NaanUnif(𝒜)Na(exp(Q(s,an)logρd)))),\displaystyle\quad\quad\approx\log\left(\frac{1}{2N_{a}}\sum_{a_{n}\sim\pi}^{N_{a}}(\exp(Q(s,a_{n})-\log\pi(a_{n}|s)))+\frac{1}{2N_{a}}\sum_{a_{n}\sim\textrm{Unif}(\mathcal{A})}^{N_{a}}(\exp(Q(s,a_{n})-\log\rho_{d})))\right), (B.7)

where NaN_{a} is the number of action sampling, Unif(𝒜)\textrm{Unif}(\mathcal{A}) is a Uniform distribution on 𝒜\mathcal{A}, and ρd\rho_{d} is the density of uniform distribution.

B.6 Time comparison with other offline RL methods

In this sectrion, we compare the runtime of EPQ with other baseline algorithms: CQL, Onestep, IQL, MCQ, and MISA in Table 3 below. For a fair comparison across all algorithms, we conducted experiments on the Hopper-medium task, which is a popular dataset for comparing computational costs [48, 55], on a single server equipped with an Intel Xeon Gold 6336Y CPU and one NVIDIA RTX A5000 GPU. We measured both epoch runtime during 1,000 gradient steps and score runtime that each algorithm takes to achieve certain normalized scores.

From the epoch runtime results in Table 3, we can observe that EPQ takes approximately 2-30% more runtime per gradient step compared to the CQL baseline. Note that Onestep RL may seem to have very short execution time compared to other algorithms, but one must consider the significantly longer pretraining time required to learn the QQ-function of behavior policy accurately. Additionally, compared to faster offline RL algorithms such as IQL and MISA, EPQ requires more runtime per step and exhibits a similar runtime to MCQ, another conservative Q-learning algorithm. However, according to the score runtime results in Table 3, we can observe that only proposed EPQ achieves a score of 100 points, while all other algorithms fail to reach this score. Particularly, compared to MCQ, which also considers CQL as a baseline, EPQ achieves the same score with significantly less runtime. Therefore, while EPQ may consume slightly more runtime per gradient step compared to other algorithms, we can conclude that proposed EPQ offers substantial advantages in terms of convergence performance over other algorithms.

Table 3: Runtime comparison: Epoch runtime and Score runtime
epoch runtime(s) CQL Onestep IQL MCQ MISA EPQ
1,000 gradient steps 43.1 12.6 13.8 58.1 23.5 54.8
score runtime(s) CQL Onestep IQL MCQ MISA EPQ
Normalized average return
60 3540.0 252.5 1600.2 31,143.4 4,632.7 3,232.2
80 - 568.4 - 49,359.7 - 21,920.0
100 - - - - - 30,633.2

Appendix C Hyperparameter Setup

The implementation of proposed EPQ basically follows the implementation of the CQL algorithm [21]. First, we provide the details of the shared algorithm hyperparameters in Table 4. In Table 4, we compare the shared algorithm hyperparameters of CQL, the revised version of CQL (revised), and proposed EPQ. CQL (revised) considers the same hyperparameter setup with our algorithm for Adroit tasks since the reproduced performance of CQL (reprod.) using the author-provided hyperparameter setup significantly underperforms compared to the result of CQL (paper) in Table 1.

For the coefficient of entropy term in the policy update (B.1), CQL automatically controls the entropy coefficient so that the entropy of π\pi goes to the target entropy, as proposed in Haarnoja et al. [56]. We observe that while the automatic control of policy entropy proves effective for Mujoco tasks, it adversely affects the performance in Adroit tasks since a policy with low entropy can lead to significant overestimation errors in Adroit tasks. Thus, we considered fixed entropy coefficient for Adroit tasks as in Table 4. In addition, CQL controls the penalizing constant α\alpha based on Lagrangian method [21] for Adroit tasks, but we also observe that the automatic control of α\alpha destabilizes training, leading to poor performance. Therefore, we considered fixed penalizing constant for Adroit tasks in Table 4 for stable learning.

In addition, in Table 5, we provide the details of the task hyperparameters regarding our contributions in the proposed EPQ: the penalty control threshold τ\tau and the IS clipping factor cminc_{\min} in the QQ-loss implementation in (B.2), and the cluster radius ϵ\epsilon and regularizing temperature ζ\zeta for the practical implementation of IS clipping factor ws,aQw_{s,a}^{Q} in Section B.4. Note that ρ\rho in Table 5 represents the log-density of uniform distribution. For the task hyperparameters, we consider various hyperparameter setups and provide the best hyperparameter setup for all considered tasks in Table 5. The results are based on the ablations studies provided in Section 4.3 and Appendix D.

Table 4: Algorithm hyperparameter setup of CQL, CQL (revised), and EPQ (ours) algorithms
Hyperparameters CQL
CQL (revised)
(for Adroit)
EPQ
Policy learning rate ηϕ\eta_{\phi} 1e-4 1e-4 1e-4
Value function learning rate ηθ\eta_{\theta} 3e-4 3e-4 3e-4
Soft target update coefficient ηθ¯\eta_{\bar{\theta}} 0.005 0.005 0.005
Batch size 256 256 256
The number of sampling NaN_{a} 10 10 10
Initial behavior cloning steps 10000 10000 10000
Gradient steps for training 3m (0.3m for Adroit) 0.3m 3m (0.3m for Adroit)
Entropy coefficient ηθ\eta_{\theta} Auto 0.5 Auto (0.5 for Adroit)
Penalizing constant α\alpha Auto (10 for MuJoCo) 5 or 20
20 for MuJoCo
5 or 20 for Adroit
5 or Auto for AntMaze
Discount factor γ\gamma 0.99 0.9 or 0.95 0.99 (0.9 or 0.95 for Adroit)
Table 5: Task hyperparameter setup for Mujoco tasks and Adroit tasks
Mujoco Tasks τ/ρ\tau/\rho cminc_{\min} ϵ\epsilon ζ\zeta
halfcheetah-random 10 0.2 2 2
hopper-random 2 0.1 0.5 2
walker2d-random 1 0.2 2 0.5
halfcheetah-medium 10 0.2 0.5 2
hopper-medium 0.2 0.5 2 5
walker2d-medium 1 0.5 2 2
halfcheetah-medium-expert 1.0 0.2 0.5 2
hopper-medium-expert 1 0.2 0.5 2
walker2d-medium-expert 1.0 0.2 0.5 2
halfcheetah-expert 1 0.2 0.5 2
hopper-expert 1 0.2 0.5 2
walker2d-expert 0.5 0.2 2.0 2
halfcheetah-medium-replay 2 0.2 0.5 2
hopper-medium-replay 2 0.2 0.5 2
walker2d-medium-replay 0.2 0.5 1.0 2
halfcheetah-full-replay 1.5 0.2 0.5 2
hopper-full-replay 2.0 0.2 1.0 2
walker2d-full-replay 1.0 0.2 0.5 2
Adroit Tasks τ/ρ\tau/\rho cminc_{\min} ϵ\epsilon ζ\zeta
pen-human 0.05 0.5 1.0 200
door-human 0.05 0.5 0.5 200
hammer-human 0.1 0.2 5 100
relocate-human 0.2 0.2 2 10
pen-cloned 0.2 0.2 5 50
door-cloned 0.2 0.5 1 10
hammer-cloned 0.2 0.2 5 100
relocate-cloned 0.2 0.2 5 10
AntMaze Tasks τ/ρ\tau/\rho cminc_{\min} ϵ\epsilon ζ\zeta
umaze 10 0.2 2 2
umaze-diverse 10 0.2 2 2
medium-play 0.1 0.2 1 2
medium-diverse 0.1 0.2 1 2
large-play 0.1 0.2 1 2
large-diverse 0.1 0.2 1 2

Appendix D Additional Ablation Studies Related to ws,aQw_{s,a}^{Q} Estimation

In this section, we provide additional ablation studies related to IS weight ws,aQw_{s,a}^{Q} estimation in Appendix B. For analysis, Fig. LABEL:fig:ablappen shows the performance plot when the IS clipping factor cminc_{\min}, the cluster radius ϵ\epsilon, and the temperature ζ\zeta change.

IS Clipping Factor cminc_{\min}: In the EPQ implementation, the IS clipping factor cminc_{\min} is employed to clip the IS weight ws,aQw_{s,a}^{Q} to prevent the exclusion of data samples with relatively low ws,aQw_{s,a}^{Q}. When cmin=0c_{\min}=0, low-quality samples with low ws,aQw_{s,a}^{Q} are not utilized at all based on the prioritization in Section 3.3. However, as cminc_{\min} increases, these low-quality samples are increasingly exploited. Fig. 7(c) illustrates the performance of EPQ with varying cminc_{\min}, and EPQ achieves the best performance when cmin=0.5c_{\min}=0.5. This result suggests that it is more beneficial to use low-quality samples with proper priority rather than discarding them entirely.

Cluster Radius ϵ\epsilon: As explained in Appendix B.4, we can control the number of adjacent samples in the cluster based on the radius ϵ\epsilon. From the results illustrated in Fig. LABEL:fig:ablappen(a), we can observe that EPQ with d=2.0d=2.0 performs best, and a decrease or an increase in ϵ\epsilon can significantly affect the performance indicating that ϵ\epsilon must be chosen properly for each task to find the cluster that contains adjacent samples appropriately. If ϵ\epsilon is too small, the cluster will hardly contain adjacent samples, and if ϵ\epsilon is too large, samples that should be distinguished will aggregate in the same cluster, adversely affecting the performance.

Temperature ζ\zeta: As proposed in Section 3.3, samples in the dataset are prioritized according to the definition of ws,aQw^{Q}_{s,a}. Since the samples with higher QQ values are more likely to be selected for the update of the QQ-function, temperature ζ\zeta controls the amount of prioritization, as explained in Appendix B.4. Increasing ζ\zeta reduces the difference in the QQ-function between the samples, putting less emphasis on prioritization. Fig. LABEL:fig:ablappen(b) shows the performance change according to the change in ζ\zeta, where the results state that the performance does not heavily depend on ζ\zeta. From the ablation study, we can conclude that the radius ϵ\epsilon has a greater influence on the performance of Hopper-medium task compared to the temperature ζ\zeta.

Appendix E Additional Performance Comparison on Adroit Tasks

For adroit tasks, the performance of CQL (reprod.) is too low compared to CQL (paper) in Table 1, so we additionally provide the performance result of the revised version of CQL provided in Section C. We also compare the performance of EPQ with the performance of CQL (revised) on various adroit tasks, and Table 6 shows the corresponding comparison results. From the result, we can see that CQL (revised) greatly enhances the performance of CQL on adroit tasks, but EPQ still outperforms CQL (revised), which demonstrates the intact advantage of the proposed exclusive penalty and prioritized dataset well on the adroit tasks.

Table 6: Performance comparison of CQL (paper), CQL (revised), and EPQ (ours) on Adroit tasks.
Task CQL (paper) CQL (revised) EPQ
pen-human 55.8 82.0±\pm6.2 83.9±\pm6.8
door-human 9.1 7.8±\pm0.5 13.2 ±\pm 2.4
hammer-human 2.1 6.4±\pm5.4 3.9±\pm5.0
relocate-human 0.4 0.1±\pm0.2 0.3±\pm0.2
pen-cloned 40.3 90.7±\pm4.8 91.8±\pm4.7
door-cloned 3.5 1.3±\pm2.2 5.8±\pm2.8
hammer-cloned 5.7 2.0±\pm1.3 22.8±\pm15.3
relocate-cloned -0.1 0.0±\pm0.0 0.1±\pm0.1
Adroit Tasks Total 116.8 190.3 221.8

Appendix F Limitations

The proposed EPQ significantly improves performance over the existing CQL baseline on various D4RL tasks, but there are many hyperparamaters that need to be optimized. We newly consider the penalty control threshold τ\tau, IS clipping factor cminc_{\min}, the cluster radius ϵ\epsilon, and the regularizing temperature ζ\zeta. Therefore, in order for the proposed EPQ to perform well, it is necessary to find the optimal performance by considering various hyperparameter setup, which may require some interaction with the environment.

Appendix G Broader Impact

Nevertheless, in real-world situations, engaging with the environment can be costly. Particularly in high-risk contexts such as disaster scenarios, acquiring adequate data for learning can be quite challenging. Our research is primarily focused on offline settings and we present a novel method, EPQ, holds the potential for practical applications in real-life situations where the interaction is not available, and exhibits promise in addressing the challenges posed by offline RL algorithms. Consequently, our work carries several potential societal implications, although we believe that none require specific emphasis in this context.

NeurIPS Paper Checklist

The checklist is designed to encourage best practices for responsible machine learning research, addressing issues of reproducibility, transparency, research ethics, and societal impact. Do not remove the checklist: The papers not including the checklist will be desk rejected. The checklist should follow the references and follow the (optional) supplemental material. The checklist does NOT count towards the page limit.

Please read the checklist guidelines carefully for information on how to answer these questions. For each question in the checklist:

  • You should answer [Yes] , [No] , or [N/A] .

  • [N/A] means either that the question is Not Applicable for that particular paper or the relevant information is Not Available.

  • Please provide a short (1–2 sentence) justification right after your answer (even for NA).

The checklist answers are an integral part of your paper submission. They are visible to the reviewers, area chairs, senior area chairs, and ethics reviewers. You will be asked to also include it (after eventual revisions) with the final version of your paper, and its final version will be published with the paper.

The reviewers of your paper will be asked to use the checklist as one of the factors in their evaluation. While "[Yes] " is generally preferable to "[No] ", it is perfectly acceptable to answer "[No] " provided a proper justification is given (e.g., "error bars are not reported because it would be too computationally expensive" or "we were unable to find the license for the dataset we used"). In general, answering "[No] " or "[N/A] " is not grounds for rejection. While the questions are phrased in a binary way, we acknowledge that the true answer is often more nuanced, so please just use your best judgment and write a justification to elaborate. All supporting evidence can appear either in the main paper or the supplemental material, provided in appendix. If you answer [Yes] to a question, in the justification please point to the section(s) where related material for the question can be found.

IMPORTANT, please:

  • Delete this instruction block, but keep the section heading “NeurIPS paper checklist",

  • Keep the checklist subsection headings, questions/answers and guidelines below.

  • Do not modify the questions and only use the provided macros for your answers.

  1. 1.

    Claims

  2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

  3. Answer: [Yes]

  4. Justification: The claims made in the abstract and introductions are well reflected in Section 3 Methodology and Section 4 Experiments.

  5. Guidelines:

    • The answer NA means that the abstract and introduction do not include the claims made in the paper.

    • The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.

    • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.

    • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

  6. 2.

    Limitations

  7. Question: Does the paper discuss the limitations of the work performed by the authors?

  8. Answer: [Yes]

  9. Justification: The limitations are addressed in the Appendix F Limitations.

  10. Guidelines:

    • The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.

    • The authors are encouraged to create a separate "Limitations" section in their paper.

    • The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.

    • The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.

    • The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.

    • The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.

    • If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.

    • While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

  11. 3.

    Theory Assumptions and Proofs

  12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

  13. Answer: [Yes]

  14. Justification: For each theoretical result, the detailed proofs and assumptions are provided in Appendix A Proof and Appendix B Implementation Details.

  15. Guidelines:

    • The answer NA means that the paper does not include theoretical results.

    • All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.

    • All assumptions should be clearly stated or referenced in the statement of any theorems.

    • The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.

    • Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.

    • Theorems and Lemmas that the proof relies upon should be properly referenced.

  16. 4.

    Experimental Result Reproducibility

  17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

  18. Answer: [Yes]

  19. Justification: The specific environment descriptions and experimental setups including the hyperparameters can be found in Section 4 Experiments and Appendix C Hyperparameter Setup.

  20. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.

    • If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.

    • Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.

    • While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example

      1. (a)

        If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.

      2. (b)

        If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.

      3. (c)

        If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

      4. (d)

        We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

  21. 5.

    Open access to data and code

  22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

  23. Answer: [Yes]

  24. Justification: The data and code for reproducing the main experimental results are included in supplemental materials.

  25. Guidelines:

    • The answer NA means that paper does not include experiments requiring code.

    • Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.

    • While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).

    • The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.

    • The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.

    • The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.

    • At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).

    • Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

  26. 6.

    Experimental Setting/Details

  27. Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

  28. Answer: [Yes]

  29. Justification: The specific experimental setups including the hyperparameters can be found in Section 4 Experiments and Appendix C Hyperparameter Setup.

  30. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.

    • The full details can be provided either with the code, in appendix, or as supplemental material.

  31. 7.

    Experiment Statistical Significance

  32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

  33. Answer: [Yes]

  34. Justification: The graphs included in the paper such as Figure 6 and Figure 7 in Section 4 Experiments well demonstrate the statistical significance of the experiment.

  35. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

    • The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

    • The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

    • The assumptions made should be given (e.g., Normally distributed errors).

    • It should be clear whether the error bar is the standard deviation or the standard error of the mean.

    • It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.

    • For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).

    • If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

  36. 8.

    Experiments Compute Resources

  37. Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

  38. Answer: [Yes]

  39. Justification: The information on computation resources are provided in Appendix B Implementation Details.

  40. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.

    • The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.

    • The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).

  41. 9.

    Code Of Ethics

  42. Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?

  43. Answer: [Yes]

  44. Justification: The research conducted in the paper conforms the NeurlIPS Code of Ethics

  45. Guidelines:

    • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.

    • If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.

    • The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

  46. 10.

    Broader Impacts

  47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

  48. Answer: [Yes]

  49. Justification: The societal impacts of the proposed paper is included in appendix G Broader Impacts section.

  50. Guidelines:

    • The answer NA means that there is no societal impact of the work performed.

    • If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

    • Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

    • The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.

    • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.

    • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

  51. 11.

    Safeguards

  52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

  53. Answer: [N/A]

  54. Justification: The proposed paper does not pose such risks.

  55. Guidelines:

    • The answer NA means that the paper poses no such risks.

    • Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.

    • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.

    • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

  56. 12.

    Licenses for existing assets

  57. Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

  58. Answer: [Yes]

  59. Justification: The baseline code and experimental data are cited both in-text and in the References section.

  60. Guidelines:

    • The answer NA means that the paper does not use existing assets.

    • The authors should cite the original paper that produced the code package or dataset.

    • The authors should state which version of the asset is used and, if possible, include a URL.

    • The name of the license (e.g., CC-BY 4.0) should be included for each asset.

    • For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

    • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

    • For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

    • If this information is not available online, the authors are encouraged to reach out to the asset’s creators.

  61. 13.

    New Assets

  62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

  63. Answer: [N/A]

  64. Justification: The proposed paper does not release new assets.

  65. Guidelines:

    • The answer NA means that the paper does not release new assets.

    • Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.

    • The paper should discuss whether and how consent was obtained from people whose asset is used.

    • At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

  66. 14.

    Crowdsourcing and Research with Human Subjects

  67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

  68. Answer: [N/A]

  69. Justification: The proposed paper does not involve crowdsourcing nor research with human subjects.

  70. Guidelines:

    • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

    • Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.

    • According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

  71. 15.

    Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

  72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

  73. Answer: [N/A]

  74. Justification: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

  75. Guidelines:

    • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

    • Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

    • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.

    • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.