This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adversarially Trained Weighted Actor-Critic for Safe Offline Reinforcement Learning

Honghao Wei
Washington State University
[email protected] &Xiyue Peng
ShanghaiTech University
[email protected] &Arnob Ghosh
New Jersey Institute of Technology
[email protected] &Xin Liu
ShanghaiTech University
[email protected]
Abstract

We propose WSAC (Weighted Safe Actor-Critic), a novel algorithm for Safe Offline Reinforcement Learning (RL) under functional approximation, which can robustly optimize policies to improve upon an arbitrary reference policy with limited data coverage. WSAC is designed as a two-player Stackelberg game to optimize a refined objective function. The actor optimizes the policy against two adversarially trained value critics with small importance-weighted Bellman errors, which focus on scenarios where the actor’s performance is inferior to the reference policy. In theory, we demonstrate that when the actor employs a no-regret optimization oracle, WSAC achieves a number of guarantees: (i)(i) For the first time in the safe offline RL setting, we establish that WSAC can produce a policy that outperforms any reference policy while maintaining the same level of safety, which is critical to designing a safe algorithm for offline RL. (ii)(ii) WSAC achieves the optimal statistical convergence rate of 1/N1/\sqrt{N} to the reference policy, where NN is the size of the offline dataset. (iii)(iii) We theoretically show that WSAC guarantees a safe policy improvement across a broad range of hyperparameters that control the degree of pessimism, indicating its practical robustness. Additionally, we offer a practical version of WSAC and compare it with existing state-of-the-art safe offline RL algorithms in several continuous control environments. WSAC outperforms all baselines across a range of tasks, supporting the theoretical results.

1 Introduction

Online safe reinforcement learning (RL) has found successful applications in various domains, such as autonomous driving (Isele et al.,, 2018), recommender systems (Chow et al.,, 2017), and robotics (Achiam et al.,, 2017). It enables the learning of safe policies effectively while satisfying certain safety constraints, including collision avoidance, budget adherence, and reliability. However, collecting diverse interaction data can be extremely costly and infeasible in many real-world applications, and this challenge becomes even more critical in scenarios where risky behavior cannot be tolerated. Given the inherently risk-sensitive nature of these safety-related tasks, data collection becomes feasible only when employing behavior policies satisfies all the safety requirements.

To overcome the limitations imposed by interactive data collection, offline RL algorithms are designed to learn a policy from an available dataset collected from historical experiences by some behavior policy, which may differ from the policy we aim to learn. A desirable property of an effective offline algorithm is the assurance of robust policy improvement (RPI), which guarantees that a learned policy is always at least as good as the baseline behavior policies (Fujimoto et al.,, 2019; Laroche et al.,, 2019; Kumar et al.,, 2019; Siegel et al.,, 2020; Chen et al., 2022a, ; Zhu et al.,, 2023; Bhardwaj et al.,, 2024). We extend the property of RPI to offline safe RL called safe robust policy improvement (SRPI), which indicates the improvement should be safe as well. This is particularly important in offline safe RL. For example, in autonomous driving, an expert human driver operates the vehicle to collect a diverse dataset under various road and weather conditions, serving as the behavior policy. This policy is considered both effective and safe, as it demonstrates proficient human driving behavior while adhering to all traffic laws and other safety constraints. Achieving a policy that upholds the SRPI characteristic with such a dataset can significantly mitigate the likelihood of potential collisions and other safety concerns.

Refer to caption
Figure 1: Comparison between WSAC and the behavior policy in the tabular case. The behavior policy is a mixture of the optimal policy and a random policy, with the mixture percentage representing the proportion of the optimal policy. The cost threshold is set to 0.1. We observe that WSAC consistently ensures a safely improved policy across various scenarios, even when the behavior policy is not safe.

In offline RL, we represent the state-action occupancy distribution of policy π\pi over the dataset distribution μ\mu using the ratio wπ=dπ/μw^{\pi}=d^{\pi}/\mu. A commonly required assumption is that the \ell_{\infty} concentrability CπC_{\ell_{\infty}}^{\pi} is bounded, which is defined as the infinite norm of wπw^{\pi} for all policies (Liu et al.,, 2019; Chen and Jiang,, 2019; Wang et al.,, 2019; Liao et al.,, 2022; Zhang et al.,, 2020). A stronger assumption requires a uniform lower bound on μ(a|s)\mu(a|s) (Xie and Jiang,, 2021). However, such all-policy concentrability assumptions are difficult to satisfy in practice, particularly for offline safe RL, as they essentially require the offline dataset to have good coverage of all unsafe state-action pairs. To address the full coverage requirement, other works (Rashidinejad et al.,, 2021; Zhan et al.,, 2022; Chen and Jiang,, 2022; Xie et al.,, 2021; Uehara and Sun,, 2021) adapt conservative algorithms by employing the principle of pessimism in the face of uncertainty, reducing the assumption to the best covered policy (or optimal policy) concerning \ell_{\infty} concentrability. Zhu et al., (2023) introduce 2\ell_{2} concentrability to further relax the assumption, indicating that \ell_{\infty} concentrability is always an upper bound of 2\ell_{2} concentrability (see Table 1 for detailed comparisons with previous works). While provable guarantees are obtained using single policy concentrability for unconstrained MDP as Table 1 suggests for the safe RL setting, all the existing studies (Hong et al.,, 2024; Le et al.,, 2019) still require the coverage on all the policies. Further, as Table 1 suggests, the above papers do not guarantee robust safe policy improvement. m , Our main contributions are summarized below:

  1. 1.

    We prove that our algorithm, which uses average Bellman error, enjoys an optimal statistical rate of 1/N1/\sqrt{N} under partial data coverage assumption. This is the first work that achieves such a result using only single-policy 2\ell_{2} concentrability.

  2. 2.

    We propose a novel offline safe RL algorithm, called Weighted Safe Actor-Critic (WSAC), which can robustly learn policies that improve upon any behavior policy with controlled relative pessimism. We prove that under standard function approximation assumptions, when the actor incorporates a no-regret policy optimization oracle, WSAC outputs a policy that never degrades the performance of a reference policy (including the behavior policy) for a range of hyperparameters (defined later). This is the first work that provably demonstrates the property of SRPI in offline safe RL setting.

  3. 3.

    We point out that primal-dual-based approaches Hong et al., (2024) may require all-policy concentrability assumption. Thus, unlike, the primal-dual-based appraoch, we propose a novel rectified penalty-based approach to obtain results using single-policy concentrability. Thus, we need novel analysis techniques to prove results.

  4. 4.

    Furthermore, we provide a practical implementation of WSAC following a two-timescale actor-critic framework using adversarial frameworks similar to Cheng et al., (2022); Zhu et al., (2023), and test it on several continuous control environments in the offline safe RL benchmark (Liu et al., 2023a, ). WSAC outperforms all other state-of-the-art baselines, validating the property of a safe policy improvement.

Table 1: Comparison of algorithms for offline RL (safe RL) with function approximation. The parameters C2π,Cπ,CBellmanπC_{\ell_{2}}^{\pi},C_{\ell_{\infty}}^{\pi},C_{Bellman}^{\pi} refer to different types of concentrabilities, it always hold C2πCπC_{\ell_{2}}^{\pi}\leq C_{\ell_{\infty}}^{\pi} and under certain condition C2πCBellmanπ,C_{\ell_{2}}^{\pi}\leq C_{Bellman}^{\pi}, detailed definitions and more discussions can be found in Section 3.3.
Algorithm Safe RL Coverage assumption Policy Improvement Suboptimality
Xie and Jiang, (2021) No all policy, C2πC_{\ell_{2}}^{\pi} Yes 𝒪(1/N)\mathcal{O}({1}/{\sqrt{N}})
Xie et al., (2021) No single-policy, CBellmanπC_{Bellman}^{\pi} Yes 𝒪(1/N)\mathcal{O}({1}/{\sqrt{N}})
Cheng et al., (2022) No single-policy, CBellmanπC_{Bellman}^{\pi} Yes & Robust 𝒪(1/N1/3)\mathcal{O}({1}/{{N^{1/3}}})
Ozdaglar et al., (2023) No single-policy, CπC_{\ell_{\infty}}^{\pi} No 𝒪(1/N)\mathcal{O}({1}/{\sqrt{N}})
Zhu et al., (2023) No single-policy, C2πC_{\ell_{2}}^{\pi} Yes & Robust 𝒪(1/N)\mathcal{O}({1}/{\sqrt{N}})
Le et al., (2019) Yes all policy, CπC_{\ell_{\infty}}^{\pi} No 𝒪(1/N)\mathcal{O}({1/}{\sqrt{N}})
Hong et al., (2024) Yes all policy, C2πC_{\ell_{2}}^{\pi} No 𝒪(1/N)\mathcal{O}({1/}{\sqrt{N}})
Ours Yes single-policy, C2π\pagecolor[gray]{0.8}C_{\ell_{2}}^{\pi} Yes & Robust 𝒪(1/N)\mathcal{O}({1/}{\sqrt{N}})

2 Related Work

Offline safe RL: Deep offline safe RL algorithms (Lee et al.,, 2022; Liu et al., 2023b, ; Xu et al.,, 2022; Chen et al.,, 2021; Zheng et al.,, 2024) have shown strong empirical performance but lack theoretical guarantees. To the best of our knowledge, the investigation of policy improvement properties in offline safe RL is relatively rare in the state-of-the-art offline RL literature. Wu et al., (2021) focus on the offline constrained multi-objective Markov Decision Process (CMOMDP) and demonstrate that an optimal policy can be learned when there is sufficient data coverage. However, although they show that CMDP problems can be formulated as CMOMDP problems, they assume a linear kernel CMOMDP in their paper, whereas our consideration extends to a more general function approximation setting. Le et al., (2019) propose a model-based primal-dual-type algorithm with deviation control for offline safe RL in the tabular setting. With prior knowledge of the slackness in Slater’s condition and a constant on the concentrability coefficient, an (ϵ,δ)(\epsilon,\delta)-PAC error is achievable when the number of data samples NN is large enough (N=𝒪~(1/ϵ2))(N=\tilde{\mathcal{O}}(1/\epsilon^{2})). These assumptions make the algorithm impractical, and their computational complexity is much higher than ours. Additionally, we consider a more practical, model-free function approximation setting. In another concurrent work (Hong et al.,, 2024), a primal-dual critic algorithm is proposed for offline-constrained RL settings with general function approximation. However, their algorithm requires 2\ell_{2} concentrability for all policies, which is not practical as discussed. The reason is that the dual variable optimization in their primal-dual design requires an accurate estimation of all the policies used in each episode, which necessitates coverage over all policies. Moreover, they cannot guarantee the property of SRPI. Moreover, their algorithm requires an additional offline policy evaluation (OPE) oracle for policy evaluation, making the algorithm less efficient.

3 Preliminaries

3.1 Constrained Markov Decision Process

We consider a Constrained Markov Decision Process (CMDP) ,\mathcal{M}, denoted by (𝒮,𝒜,𝒫,R,C,γ,ρ).(\mathcal{S},\mathcal{A},\mathcal{P},R,C,\gamma,\rho). 𝒮\mathcal{S} is the state space, 𝒜\mathcal{A} is the action space, 𝒫:𝒮×𝒜Δ(𝒮)\mathcal{P}:\mathcal{S}\times\mathcal{A}\rightarrow\Delta(\mathcal{S}) is the transition kernel, where Δ()\Delta(\cdot) is a probability simplex, R:𝒮×𝒜[0,1]R:\mathcal{S}\times\mathcal{A}\rightarrow[0,1] is the reward function, C:𝒮×𝒜[1,1]C:\mathcal{S}\times\mathcal{A}\rightarrow[-1,1] is the cost function, γ[0,1)\gamma\in[0,1) is the discount factor and ρ:𝒮[0,1]\rho:\mathcal{S}\rightarrow[0,1] is the initial state distribution. We assume 𝒜\mathcal{A} is finite while allowing 𝒮\mathcal{S} to be arbitrarily complex. We use π:𝒮Δ(𝒜)\pi:\mathcal{S}\rightarrow\Delta(\mathcal{A}) to denote a stationary policy, which specifies a distribution over actions for each state. At each time, the agent observes a state st𝒮,s_{t}\in\mathcal{S}, takes an action at𝒮a_{t}\in\mathcal{S} according to a policy π,\pi, receives a reward rtr_{t} and a cost ct,c_{t}, where rt=R(st,at),ct=C(st,at).r_{t}=R(s_{t},a_{t}),c_{t}=C(s_{t},a_{t}). Then the CMDP moves to the next state st+1s_{t+1} based on the transition kernel 𝒫(|st,at).\mathcal{P}(\cdot|s_{t},a_{t}). Given a policy π,\pi, we use Vrπ(s)V_{r}^{\pi}(s) and Vcπ(s)V_{c}^{\pi}(s) to denote the expected discounted return and the expected cumulative discounted cost of π\pi, starting from state ss, respectively.

Vrπ(s):=\displaystyle V_{r}^{\pi}(s):= 𝔼[t=0γtrt|s0=s,atπ(|st)]\displaystyle\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r_{t}|s_{0}=s,a_{t}\sim\pi(\cdot|s_{t})] (1)
Vcπ(s):=\displaystyle V_{c}^{\pi}(s):= 𝔼[t=0γtct|s0=s,atπ(|st)].\displaystyle\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}c_{t}|s_{0}=s,a_{t}\sim\pi(\cdot|s_{t})]. (2)

Accordingly, we also define the QQ-value function of a policy π\pi for the reward and cost as

Qrπ(s,a):=\displaystyle Q^{\pi}_{r}(s,a):= 𝔼[t=0γtrt|(s0,a0)=(s,a),atπ(|st)]\displaystyle\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r_{t}|(s_{0},a_{0})=(s,a),a_{t}\sim\pi(\cdot|s_{t})] (3)
Qcπ(s,a):=\displaystyle Q^{\pi}_{c}(s,a):= 𝔼[t=0γtct|(s0,a0)=(s,a),atπ(|st)],\displaystyle\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}c_{t}|(s_{0},a_{0})=(s,a),a_{t}\sim\pi(\cdot|s_{t})], (4)

respectively. As rewards and costs are bounded, we have that 0Qrπ11γ,0\leq Q^{\pi}_{r}\leq\frac{1}{1-\gamma}, and 11γQcπ11γ.-\frac{1}{1-\gamma}\leq Q_{c}^{\pi}\leq\frac{1}{1-\gamma}. We let Vmax=11γV_{\max}=\frac{1}{1-\gamma} to simplify the notation. We further write

Jr(π):=(1γ)𝔼sρ[Vrπ(s)],Jc(π):=(1γ)𝔼sρ[Vcπ(s)]J_{r}(\pi):=(1-\gamma)\mathbb{E}_{s\sim\rho}[V_{r}^{\pi}(s)],\quad J_{c}(\pi):=(1-\gamma)\mathbb{E}_{s\sim\rho}[V_{c}^{\pi}(s)]

to represent the normalized average reward/cost value of policy π.\pi. In addition, we use dπ(s,a)d^{\pi}(s,a) to denote the normalized and discounted state-action occupancy measure of the policy π:\pi:

dπ(s,a):=(1γ)𝔼[t=0γt𝟙(st=s,at=a)|atπ(|st)],d^{\pi}(s,a):=(1-\gamma)\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}\mathds{1}(s_{t}=s,a_{t}=a)|a_{t}\sim\pi(\cdot|s_{t})],

where 𝟙()\mathds{1}(\cdot) is the indicator function. We also use dπ(s)=a𝒜dπ(s,a)d^{\pi}(s)=\sum_{a\in\mathcal{A}}d^{\pi}(s,a) to denote the discounted state occupancy and we use 𝔼π\mathbb{E}_{\pi} as a shorthand of 𝔼(s,a)dπ[]\mathbb{E}_{(s,a)\sim d^{\pi}}[\cdot] or 𝔼sdπ[]\mathbb{E}_{s\sim d^{\pi}}[\cdot]to denote the expectation with respect to dπ.d^{\pi}. Thus The objective in safe RL for an agent is to find a policy such that

π\displaystyle\pi\in argmaxJr(π)\displaystyle\arg\max~{}J_{r}(\pi)\quad s.t.Jc(π)0.\displaystyle\text{s.t.}~{}J_{c}(\pi)\leq 0. (5)
Remark 3.1.

For ease of exposition, this paper exclusively focuses on a single constraint. However, it is readily extendable to accommodate multiple constraints.

3.2 Function Approximation

In complex environments, the state space 𝒮\mathcal{S} is usually very large or even infinite. We assume access to a policy class Π(𝒮Δ(𝒜))\Pi\subseteq(\mathcal{S}\rightarrow\Delta(\mathcal{A})) consisting of all candidate policies from which we can search for good policies. We also assume access to a value function class (𝒮×𝒜[0,Vmax])\mathcal{F}\subseteq(\mathcal{S}\times\mathcal{A}\rightarrow[0,V_{\max}]) to model the reward QQ-functions, and 𝒢(𝒮×𝒜[Vmax,Vmax])\mathcal{G}\subseteq(\mathcal{S}\times\mathcal{A}\rightarrow[-V_{\max},V_{\max}]) to model the cost QQ-functions of candidate policies. We further assume access to a function class 𝒲{w:𝒮×𝒜[0,Bw]}\mathcal{W}\in\{w:\mathcal{S}\times\mathcal{A}\rightarrow[0,B_{w}]\} that represents marginalized importance weights with respect to the offline data distribution. Without loss of generality, we assume that the all-one function is contained in 𝒲.\mathcal{W}.

For a given policy πΠ,\pi\in\Pi, we denote f(s,π):=aπ(a|s)f(s,a)f(s^{\prime},\pi):=\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})f(s^{\prime},a^{\prime}) for any s𝒮.s\in\mathcal{S}. The Bellman operator 𝒯rπ:𝒮×𝒜𝒮×𝒜\mathcal{T}_{r}^{\pi}:\mathbb{R}^{\mathcal{S}\times\mathcal{A}}\rightarrow\mathbb{R}^{\mathcal{S}\times\mathcal{A}} for the reward is defined as

(𝒯rπf)(s,a):=R(s,a)+γ𝔼𝒫(s|s,a)[f(s,π)],(\mathcal{T}^{\pi}_{r}f)(s,a):=R(s,a)+\gamma\mathbb{E}_{\mathcal{P}(s^{\prime}|s,a)}[f(s^{\prime},\pi)],

The Bellman operator 𝒯cπ:𝒮×𝒜𝒮×𝒜\mathcal{T}_{c}^{\pi}:\mathbb{R}^{\mathcal{S}\times\mathcal{A}}\rightarrow\mathbb{R}^{\mathcal{S}\times\mathcal{A}} for the cost is

(𝒯cπf)(s,a):=C(s,a)+γ𝔼𝒫(s|s,a)[f(s,π)].(\mathcal{T}^{\pi}_{c}f)(s,a):=C(s,a)+\gamma\mathbb{E}_{\mathcal{P}(s^{\prime}|s,a)}[f(s^{\prime},\pi)].

Let 2,μ:=𝔼μ[()2]\|\cdot\|_{2,\mu}:=\sqrt{\mathbb{E}_{\mu}[(\cdot)^{2}]} denote the Euclidean norm weighted by distribution μ.\mu. We make the following standard assumptions in offline RL setting (Xie et al.,, 2021; Cheng et al.,, 2022; Zhu et al.,, 2023) on the representation power of the function classes:

Assumption 3.2 (Approximate Realizability).

Assume there exists ϵ10,\epsilon_{1}\geq 0, s.t. for any given policy πΠ,\pi\in\Pi, we have minfmaxadmissible νfTrπf2,ν2ϵ1\min_{f\in\mathcal{F}}\max_{\text{admissible }\nu}\|f-T_{r}^{\pi}f\|_{2,\nu}^{2}\leq\epsilon_{1}, and minfmaxadmissible νfTcπf2,ν2ϵ1,\min_{f\in\mathcal{F}}\max_{\text{admissible }\nu}\|f-T_{c}^{\pi}f\|_{2,\nu}^{2}\leq\epsilon_{1}, where ν\nu is the state-action distribution of any admissible policy such that ν{dπ,πΠ}.\nu\in\{d^{\pi},\forall\pi\in\Pi\}.

Assumption 3.2 assumes that for any policy πΠ,\pi\in\Pi, QrπQ_{r}^{\pi} and QcπQ_{c}^{\pi} are approximately realizable in \mathcal{F} and 𝒢.\mathcal{G}. When ϵ1\epsilon_{1} is small for all admissible ν,\nu, we have frQrπ,f_{r}\approx Q_{r}^{\pi}, and fcQcπ.f_{c}\approx Q_{c}^{\pi}. In particular, when ϵ1=0,\epsilon_{1}=0, we have Qrπ,QcπQ_{r}^{\pi}\in\mathcal{F},Q_{c}^{\pi}\in\mathcal{F} for any policy πΠ.\pi\in\Pi. Note that we do not need Bellman completeness assumption Cheng et al., (2022).

3.3 Offline RL

In offline RL, we assume that the available offline data 𝒟={(si,ai,ri,ci,si)}i=1N\mathcal{D}=\{(s_{i},a_{i},r_{i},c_{i},s_{i}^{\prime})\}_{i=1}^{N} consists of NN samples. Samples are i.i.d. (which are common assumptions in unconstrained Cheng et al., (2022), as well as constrained setting Hong et al., (2024)), and the distribution of each tuple (s,a,r,c,s)(s,a,r,c,s^{\prime}) is specified by a distribution μΔ(𝒮×𝒜),\mu\in\Delta(\mathcal{S}\times\mathcal{A}), which is also the discounted visitation probability of a behavior policy (also denoted by μ\mu for simplicity). In particular, (s,a)μ,r=R(s,a),c=C(s,a),s𝒫(|s,a).(s,a)\sim\mu,r=R(s,a),c=C(s,a),s^{\prime}\sim\mathcal{P}(\cdot|s,a). We use aμ(|s),a\sim\mu(\cdot|s), to denote that the action is drawn using the behavior policy and (s,a,s)μ(s,a,s^{\prime})\sim\mu to denote that (s,a)μ,(s,a)\sim\mu, and s𝒫(|s,a).s^{\prime}\sim\mathcal{P}(\cdot|s,a).

For a given policy π,\pi, we define the marginalized importance weights wπ(s,a):=dπ(s,a)μ(s,a)w^{\pi}(s,a):=\frac{d^{\pi}(s,a)}{\mu(s,a)} which is the ratio between the discounted state-action occupancy of π\pi and the data distribution μ.\mu. This ratio can be used to measure the concentrability of the data coverage (Xie and Jiang,, 2020; Zhan et al.,, 2022; Rashidinejad et al.,, 2022; Ozdaglar et al.,, 2023; Lee et al.,, 2021).

In this paper we study offline RL with access to a dataset with limited coverage. The coverage of a policy π\pi is the dataset can be measured by the weighted 2\ell_{2} single policy concentrability coefficient (Zhu et al.,, 2023; Yin and Wang,, 2021; Uehara et al.,, 2024; Hong et al.,, 2024):

Definition 3.3 (2\ell_{2} Concentrability).

Given a policy π,\pi, define C2π=wπ2,μ=dπ/μ2,μ.C_{\ell_{2}}^{\pi}=\|w^{\pi}\|_{2,\mu}=\|d^{\pi}/\mu\|_{2,\mu}.

Remark 3.4.

The definition here is much weaker than the all policy concentrability used in offline RL (Chen and Jiang,, 2019) and safe offline RL (Le et al.,, 2019; Hong et al.,, 2024), which requires the ratio dπ(s,a)μ(s,a)\frac{d^{\pi}(s,a)}{\mu(s,a)} to be bounded for all s𝒮s\in\mathcal{S} and a𝒜a\in\mathcal{A} and all policies π.\pi. In particular, the all-policy concentrability assumption essentially requires the dataset to have full coverage of all policies ((nearly all the state action pairs). This requirement is often violated in practical scenarios. This requirement is even impossible to meet in safe offline RL because it would require collecting data from every dangerous state and actions, which clearly is impractical.

In the following lemma, we compare two variants of single-policy concentrability definition with the 2\ell_{2} defined in Definition 3.3.

Lemma 1 (Restate Proposition 2.12.1 in Zhu et al., (2023)).

Define the \ell_{\infty} single policy concentrability (Rashidinejad et al.,, 2021) as Cπ=dπ/μC_{\ell_{\infty}}^{\pi}=\|d^{\pi}/\mu\|_{\infty} and the Bellman-consistent single-policy concentrability (Xie et al.,, 2021) as CBellmanπ=maxff𝒯πf2,dπ2f𝒯πf2,μ2C_{Bellman}^{\pi}=\max_{f\in\mathcal{F}}\frac{\|f-\mathcal{T}^{\pi}f\|_{2,d^{\pi}}^{2}}{\|f-\mathcal{T}^{\pi}f\|_{2,\mu}^{2}} (𝒯\mathcal{T} could be 𝒯r\mathcal{T}_{r} or 𝒯c\mathcal{T}_{c} in our setting) Then, it always holds (C2π)2Cπ,C2πCπ(C_{\ell_{2}}^{\pi})^{2}\leq C_{\ell_{\infty}}^{\pi},C_{\ell_{2}}^{\pi}\leq C_{\ell_{\infty}}^{\pi} and there exist offline RL instances where (C2π)2CBellmanπ,C2πCBellmanπ.(C_{\ell_{2}}^{\pi})^{2}\leq C_{Bellman}^{\pi},C_{\ell_{2}}^{\pi}\leq C_{Bellman}^{\pi}.

Remark 3.5.

It is easy to observe that the 2\ell_{2} variant is bounded by \ell_{\infty} and CBellmanπC_{Bellman}^{\pi} under some cases. There is an example (Example 11) in Zhu et al., (2023) showing that C2πC_{\ell_{2}}^{\pi} is bounded by a constant 2\sqrt{2} while CπC_{\ell_{\infty}}^{\pi} could be arbitrarily large. For the case when the function class \mathcal{F} is highly expressive, CBellmanπC_{Bellman}^{\pi} could be close to CπC_{\ell_{\infty}}^{\pi} and thus possibly larger than C2π.C_{\ell_{2}}^{\pi}. Intuitively, C2πC^{\pi}_{\ell_{2}} implies that only 𝔼dπ[wπ(s,a)]\mathbb{E}_{d^{\pi}}[w^{\pi}(s,a)] is bounded, rather, wπ(s,a)w^{\pi}(s,a) is bounded for all (s,a)(s,a) in \ell_{\infty} concentrability bound.

Given the definition of the concentrability, we make the following assumption on the weight function class 𝒲\mathcal{W} and a single-policy realizability:

Assumption 3.6 (Boundedness of 𝒲\mathcal{W} in 2\ell_{2} norm).

For all w𝒲,w\in\mathcal{W}, assume that w2,μC2.\|w\|_{2,\mu}\leq C_{\ell_{2}}^{*}.

Assumption 3.7 (Single-policy realizability of wπw^{\pi}).

For some policy π\pi that we would like to compete with, assume that wπ𝒲.w^{\pi}\in\mathcal{W}.

In this paper, we want to study the robust policy improvement on any reference policy, then we assume that we are provided a reference policy πref\pi_{\text{ref}}. Note that in many applications (e.g., scheduling, networking) we indeed have a reference policy. We want that while applying a sophisticated RL policy it should do better and be safe as well. This is one of the main motivations behind this assumption.

Assumption 3.8 (Reference Policy).

We assume access to a reference policy πrefΠ,\pi_{\text{ref}}\in\Pi, which can be queried at any state.

In many applications such as networking, scheduling, and control problems, there are existing good enough reference policies. In these cases, a robust and safe policy improvement over these reference policies has practical value. If πref\pi_{\text{ref}} is not provided, we can simply run a behavior cloning on the offline data to extract the behavior policy as πref\pi_{\text{ref}} accurately, as long as the size of the offline data set is large enough. More discussion can be found in Section C in the Appendix.

4 Actor-Critic with Importance Weighted Bellman Error

Our algorithm design builds upon the constrained actor-critic method, in which we iteratively optimize a policy and improve the policy based on the evaluation of reward and cost. Consider the following actor-critic approach for solving the optimization problem (5):

Actor: π^argmaxπΠfrπ(s0,π)s.t.fcπ(s0,π)0\displaystyle\hat{\pi}^{*}\in\arg\max_{\pi\in\Pi}f_{r}^{\pi}(s_{0},\pi)\quad s.t.\quad f_{c}^{\pi}(s_{0},\pi)\leq 0
Critic: frπargminf𝔼μ[((f𝒯rf)(s,a))2],fcπargminf𝒢𝔼μ[((f𝒯cf)(s,a))2],\displaystyle f_{r}^{\pi}\in\arg\text{min}_{f\in\mathcal{F}}\mathbb{E}_{\mu}[((f-\mathcal{T}_{r}f)(s,a))^{2}],\quad f_{c}^{\pi}\in\arg\text{min}_{f\in\mathcal{G}}\mathbb{E}_{\mu}[((f-\mathcal{T}_{c}f)(s,a))^{2}],

where we assume that s0s_{0} is a fixed initial state, and fr(s,π)=a𝒜π(a|s)fr(s,a),fc(s,π)=a𝒜π(a|s)fc(s,a).f_{r}(s,\pi)=\sum_{a\in\mathcal{A}}\pi(a|s)f_{r}(s,a),f_{c}(s,\pi)=\sum_{a\in\mathcal{A}}\pi(a|s)f_{c}(s,a). The policy is optimized by maximizing the reward qq function frf_{r} while ensuring that fcf_{c} satisfies the constraint, and the two functions are trained by minimizing the Bellman error. However, this formulation has several disadvantages. 1)1) It cannot handle insufficient data coverage, which may fail to provide an accurate estimation of the policy for unseen states and actions. 2)2)It cannot guarantee robust policy improvement. 3)3) The actor training step is computationally intractable especially when the policy space is extremely large.

To address the insufficient data coverage issue, as mentioned in Xie et al., (2021) the critic can include a Bellman-consistent pessimistic evaluation of π,\pi, which selects the most pessimistic function that approximately satisfies the Bellman equation, which is called absolute pessimism. Then later as indicated by Cheng et al., (2022), instead of using an absolute pessimism, a relative pessimism approach by considering competing to the behavior policy can obtain a robust improvement over the behavior policy. However, this kind of approach can only achieve a suboptimal statistical rate of N1/3,N^{1/3}, and fails to achieve the optimal statistical rate of 1/N,1/\sqrt{N}, then later a weighted average Bellman error (Uehara et al.,, 2020; Xie and Jiang,, 2020; Zhu et al.,, 2023) could be treated as one possible solution for improving the order. We remark here that all the discussions here are for the traditional unconstrained offline RL. Regarding safety, no existing efficient algorithms in safe offline RL have theoretically demonstrated the property of robust policy improvement with optimal statistical rate.

Can Primal-dual based approaches achieve result using only single policy coverability?: The most commonly used approach for addressing safe RL problems is primal-dual optimization (Efroni et al.,, 2020; Altman,, 1999). As shown in current offline safe RL literature (Hong et al.,, 2024; Le et al.,, 2019), the policy can be optimized by maximizing a new unconstrained “reward" QQ- function frπ(s0,π)λfcπ(s0,π)f_{r}^{\pi}(s_{0},\pi)-\lambda f_{c}^{\pi}(s_{0},\pi) where λ\lambda is a dual variable. Then, the dual-variable can be tuned by taking gradient descent step. As we discussed in the introduction, all these require all policy concentrability which is not practical especially for safe RL. Important question is whether all policy concentrability assumption can be relaxed. Note that primal-dual algorithm relies on solving the min-max problem minλmaxπfrπ(s0,π)λfcπ(s0,π)\min_{\lambda}\max_{\pi}f_{r}^{\pi}(s_{0},\pi)-\lambda f_{c}^{\pi}(s_{0},\pi). Recent result (Cui and Du,, 2022) shows that single policy concentrability assumption is not enough for offline min-max game. Hence, we conjecture that using the primal-dual method we can not relax the all policy concentrability assumption. Intuitively, the primal-dual based method (Hong et al.,, 2024) rely on bounding the regret in dual domain k(λkλ)(fcπk0)\sum_{k}(\lambda_{k}-\lambda^{*})(f_{c}^{\pi_{k}}-0), hence, all the policies {πk}k=1K\{\pi_{k}\}_{k=1}^{K} encountered throughout the iteration must be supported by the dataset to evaluate the dual value λ(fcπk0)\lambda^{*}(f_{c}^{\pi_{k}}-0) where λ\lambda^{*} is the optimal dual value.

Our novelty: In contrast, we propose an aggression-limited objective function fr(s0,π)λ[fc(s0,π)]+f_{r}(s_{0},\pi)-\lambda\cdot[f_{c}(s_{0},\pi)]_{+} to control aggressive policies, where {}+:=max{,0}.\{\cdot\}_{+}:=\max\{\cdot,0\}. The high-level intuition behind this aggression-limited objective function is that by appropriately selecting a λ\lambda (usually large enough), we penalize all the policies that are not safe. As a result, the policy that maximizes the objective function is the optimal safe policy. This formulation is fundamentally different from the traditional primal-dual approach as it does not require dual-variable tuning, and thus, does not require all policy concentrability. In particular, we only need to bound the primal domain regret which can be done as long as the reference policy is covered by the dataset similar to the unconstrained setup.

Combining all the previous ideas together provides the design of our main algorithm named WSAC (Weighted Safe Actor-Critic). In Section 5, we will provide theoretical guarantees of WSAC and discuss its advantages over existing approaches in offline safe RL. WSAC aims to solve the following optimization problem:

π^argmaxπΠμ(π,frπ)λ{μ(π,fcπ)}+\displaystyle\hat{\pi}^{*}\in\underset{\pi\in\Pi}{\arg\max}~{}\mathcal{L}_{\mu}(\pi,f^{\pi}_{r})-\lambda\{{\mathcal{L}}_{\mu}(\pi,f^{\pi}_{c})\}_{+} (6)
s.t.\displaystyle s.t.{}{}{} frπargminfrμ(π,fr)+βμ(π,fr),fcπargminfc𝒢λμ(π,fc)+β^μ(π,fc),\displaystyle f^{\pi}_{r}\in\underset{f_{r}\in\mathcal{F}}{\arg\min}~{}\mathcal{L}_{\mu}(\pi,f_{r})+\beta\mathcal{E}_{\mu}(\pi,f_{r}),\quad f^{\pi}_{c}\in\underset{f_{c}\in\mathcal{G}}{\arg\min}~{}-\lambda{\mathcal{L}}_{\mu}(\pi,f_{c})+\beta\hat{\mathcal{E}}_{\mu}(\pi,f_{c}),

where μ(π,f):=𝔼μ[f(s,π)f(s,a)],\mathcal{L}_{\mu}(\pi,f):=\mathbb{E}_{\mu}[f(s,\pi)-f(s,a)], and μ(π,f):=maxw𝒲|𝔼μ[w(s,a)((fTrπf)(s,a))]|,^μ(π,f):=maxw𝒲|𝔼μ[w(s,a)((fTcπf)(s,a))]|.\mathcal{E}_{\mu}(\pi,f):=\max_{w\in\mathcal{W}}|\mathbb{E}_{\mu}[w(s,a)((f-T^{\pi}_{r}f)(s,a))]|,\hat{\mathcal{E}}_{\mu}(\pi,f):=\max_{w\in\mathcal{W}}|\mathbb{E}_{\mu}[w(s,a)((f-T^{\pi}_{c}f)(s,a))]|. This formulation can also be treated as a Stackelberg game (Von Stackelberg,, 2010) or bilevel optimization problem. We penalize the objective function only when the approximate cost QQ-function fcπf_{c}^{\pi} of the policy π\pi is more perilous than the behavior policy (fcπ(s,π)fcπ(s,a)f_{c}^{\pi}(s,\pi)\geq f_{c}^{\pi}(s,a)) forcing our policy to be as safe as the behavior policy. Maximization over ww in for training the two critics can ensure that the Bellman error is small when averaged over measure μw\mu\cdot w for any w𝒲,w\in\mathcal{W}, which turns out to be sufficient to control the suboptimality of the learned policy.

In the following theorem, we show that the solution of the optimization problem (6) is not worse than the behavior policy μ\mu in both performance and safety for any β0,λ>0\beta\geq 0,\lambda>0 than the policy μ\mu under Assumption 3.2 with ϵ1=0\epsilon_{1}=0.

Theorem 4.1.

Assume that Assumption 3.2 holds with ϵ1=0,\epsilon_{1}=0, and the behavior policy μΠ,\mu\in\Pi, then for any β0,λ>0\beta\geq 0,\lambda>0 we have Jr(π^)Jr(μ),J_{r}(\hat{\pi}^{*})\geq J_{r}(\mu), and {Jc(π^)}+{Jc(μ)}++1λ.\{J_{c}(\hat{\pi}^{*})\}_{+}\leq\{J_{c}(\mu)\}_{+}+\frac{1}{\lambda}.

The result in Theorem 4.1 shows that by selecting λ\lambda large enough, for any β0,\beta\geq 0, the solution can achieve better performance than the behavior policy while maintaining safety that is arbitrarily close to that of the behavior policy. The Theorem verifies the design of our framework which has the potential to have a robust safe improvement.

In the next section, we will introduce our main algorithm WSAC and provide its theoretical guarantees.

5 Theoretical Analysis of WSAC

5.1 Main Algorithm

In this section, we present the theoretical version of our new model-free offline safe RL algorithm WSAC. Since we only have access to a dataset 𝒟\mathcal{D} instead of the data distribution. WSAC sovles an empirical version of (6):

π^argmaxπΠ𝒟(π,frπ)λ{𝒟(π,fcπ)}+\displaystyle\hat{\pi}\in\underset{\pi\in\Pi}{\arg\max}~{}\mathcal{L}_{\mathcal{D}}(\pi,f^{\pi}_{r})-\lambda\{{\mathcal{L}}_{\mathcal{D}}(\pi,f^{\pi}_{c})\}_{+} (7)
s.t.\displaystyle s.t.{}{}{} frπargminfr𝒟(π,fr)+β𝒟(π,fr),fcπargminfc𝒢λ𝒟(π,fc)+β^𝒟(π,fc),\displaystyle f^{\pi}_{r}\in\underset{f_{r}\in\mathcal{F}}{\arg\min}~{}\mathcal{L}_{\mathcal{D}}(\pi,f_{r})+\beta\mathcal{E}_{\mathcal{D}}(\pi,f_{r}),\quad f^{\pi}_{c}\in\underset{f_{c}\in\mathcal{G}}{\arg\min}~{}-\lambda{\mathcal{L}}_{\mathcal{D}}(\pi,f_{c})+\beta\hat{\mathcal{E}}_{\mathcal{D}}(\pi,f_{c}),

where

𝒟(π,f):=\displaystyle\mathcal{L}_{\mathcal{D}}(\pi,f):= 𝔼𝒟[f(s,π)f(s,a)]\displaystyle\mathbb{E}_{\mathcal{D}}[f(s,\pi)-f(s,a)] (8)
𝒟(π,f):=\displaystyle\mathcal{E}_{\mathcal{D}}(\pi,f):= maxw𝒲|𝔼𝒟[w(s,a)(f(s,a)rγf(s,π))]|\displaystyle\max_{w\in\mathcal{W}}|\mathbb{E}_{\mathcal{D}}[w(s,a)(f(s,a)-r-\gamma f(s^{\prime},\pi))]|
^𝒟(π,f):=\displaystyle\hat{\mathcal{E}}_{\mathcal{D}}(\pi,f):= maxw𝒲|𝔼𝒟[w(s,a)(f(s,a)cγf(s,π))]|.\displaystyle\max_{w\in\mathcal{W}}|\mathbb{E}_{\mathcal{D}}[w(s,a)(f(s,a)-c-\gamma f(s^{\prime},\pi))]|.

As shown in Algorithm 1, at each iteration, WSAC selects frkf_{r}^{k} maximally pessimistic and fckf_{c}^{k} maximally optimistic for the current policy πk\pi_{k} with a weighted regularization on the estimated Bellman error for reward and cost, respectively (Line 44 and 66) to address the worse cases within reasonable range. In order to achieve a safe robust policy improvement, the actor then applies a no-regret policy optimization oracle to update the policy πk+1\pi_{k+1} by optimizing the aggression-limited objective function compared with the reference policy (Line 77) frk(s,a)λ{fck(s,a)fck(s,πref)}+.f_{r}^{k}(s,a)-\lambda\{f_{c}^{k}(s,a)-f_{c}^{k}(s,\pi_{\text{ref}})\}_{+}. Our algorithm is very computationally efficient and tractable compared with existing approaches (Hong et al.,, 2024; Le et al.,, 2019), since we do not need another inner loop for optimizing the dual variable with an additional online algorithm or offline policy evaluation oracle.

The policy improvement process relies on a no-regret policy optimization oracle, a technique commonly employed in offline RL literature (Zhu et al.,, 2023; Cheng et al.,, 2022; Hong et al.,, 2024; Zhu et al.,, 2023). Extensive literature exists on such methodologies. For instance, approaches like soft policy iteration (Pirotta et al.,, 2013) and algorithms based on natural policy gradients (Kakade,, 2001; Agarwal et al.,, 2021) can function as effective no-regret policy optimization oracles. We now formally define the oracle:

Definition 5.1 (No-regret policy optimization oracle).

An algorithm PO is called a no-regret policy optimization oracle if for any sequence of functions f1,,fKf^{1},\dots,f^{K} with fk:𝒮×𝒜[0,Vmax],k[K].f^{k}:\mathcal{S}\times\mathcal{A}\rightarrow[0,V_{\max}],\forall k\in[K]. The policies π1,,πK\pi_{1},\dots,\pi_{K} produced by the oracle PO satisfy that for any policy πΠ:\pi\in\Pi:

ϵoptπ1Kk=1K𝔼π[fk(s,π)fk(s,πk)]=o(1)\displaystyle\epsilon_{opt}^{\pi}\triangleq\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}_{\pi}[f^{k}(s,\pi)-f^{k}(s,\pi_{k})]=o(1) (9)

There indeed exist many methods that can serve as the no-regret oracle, for example, the mirror-descent approach (Geist et al.,, 2019) or the natural policy gradient approach (Kakade,, 2001) of the form πk+1(a|s)πk(a|s)exp(ηfk(s,a))\pi_{k+1}(a|s)\propto\pi_{k}(a|s)\exp(\eta f^{k}(s,a)) with η=log|𝒜|2Vmax2K\eta=\sqrt{\frac{\log|\mathcal{A}|}{2V^{2}_{\max}K}}(Even-Dar et al.,, 2009; Agarwal et al.,, 2021). In the following define ϵoptπ\epsilon_{opt}^{\pi} as the error generated from the oracle PO by considering frk(s,a)λ{fck(s,a)fck(s,π)}+f_{r}^{k}(s,a)-\lambda\{f_{c}^{k}(s,a)-f_{c}^{k}(s,\pi)\}_{+} as the sequence of functions in Definition 5.1, then we have the following guarantee.

Lemma 2.

Applying a no-regret oracle PO for KK episodes with (frk(s,a)λ{fck(s,a)fck(s,π)}+)(f_{r}^{k}(s,a)-\lambda\{f^{k}_{c}(s,a)-f^{k}_{c}(s,\pi)\}_{+}) for an arbitrary policy π,\pi, can guarantee

1Kk=1K𝔼π[frk(s,π)frk(s,πk)]ϵoptπ\displaystyle\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}_{\pi}[f_{r}^{k}(s,\pi)-f_{r}^{k}(s,\pi_{k})]\leq\epsilon_{opt}^{\pi} (10)
1Kk=1K𝔼π[{fck(s,πk)fck(s,π)}+]ϵoptπ+Vmaxλ.\displaystyle\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}_{\pi}[\{f_{c}^{k}(s,\pi_{k})-f_{c}^{k}(s,\pi)\}_{+}]\leq\epsilon_{opt}^{\pi}+\frac{V_{\max}}{\lambda}. (11)

Lemma 2 establishes that the policy outputted by PO with considering the aggression-limited “reward" can have a strong guarantee on the performance of both reward and cost when λ\lambda is large enough., which is comparable with any competitor policy. This requirement is critical to achieving the performance guarantee of our algorithm and the safe and robust policy improvement. The detailed proof is deferred to Appendix B.2 due to page limit.

Algorithm 1 Weighted Safe Actor-Critic (WSAC)
1:Input: Batch data 𝒟,\mathcal{D}, coefficient β,λ.\beta,\lambda. Value function classes ,𝒢,\mathcal{F},\mathcal{G}, importance weight function class 𝒲,\mathcal{W}, Initialize policy π1\pi_{1} randomly. Any reference policy πref.\pi_{\text{ref}}. No-regret policy optimization oracle PO (Definition 5.1).
2:for k=1,2,,Kk=1,2,\dots,K do
3:     Obtain the reward state-action value function estimation of πk\pi_{k}:
4:     frkargminfr𝒟(πk,fr)+β𝒟(πk,fr)f_{r}^{k}\leftarrow\arg\min_{f_{r}\in\mathcal{F}}\mathcal{L}_{\mathcal{D}}(\pi_{k},f_{r})+\beta\mathcal{E}_{\mathcal{D}}(\pi_{k},f_{r})
5:     Obtain the cost state-action value function estimation of πk\pi_{k}:
6:     fckargminfc𝒢λ𝒟(πk,fc)+β^𝒟(πk,fc)f_{c}^{k}\leftarrow\arg\min_{f_{c}\in\mathcal{G}}-\lambda{\mathcal{L}}_{\mathcal{D}}(\pi_{k},f_{c})+\beta\hat{\mathcal{E}}_{\mathcal{D}}(\pi_{k},f_{c})
7:     Update policy: πk+1PO(πk,frk(s,a)λ{fck(s,a)fck(s,πref)}+,𝒟).\pi_{k+1}\leftarrow\text{\bf PO}(\pi_{k},f_{r}^{k}(s,a)-\lambda\{f_{c}^{k}(s,a)-f_{c}^{k}(s,\pi_{\text{ref}})\}_{+},\mathcal{D}). // 𝒟,𝒟,^𝒟\mathcal{L}_{\mathcal{D}},\mathcal{E}_{\mathcal{D}},\hat{\mathcal{E}}_{\mathcal{D}} are defined in (8)
8:end for
9:Output: π¯=Unif(π1,,πK).\bar{\pi}=\text{Unif}(\pi_{1},\dots,\pi_{K}). // Uniformly mix π1,,πK\pi_{1},\dots,\pi_{K}

5.2 Theoretical Guarantees

We are now ready to provide the theoretical guarantees of WSAC Algorithm 1. The complete proof is deferred to Appendix B.3.

Theorem 5.2 (Main Theorem).

Under Assumptions 3.2 and 3.6, let the reference policy πrefΠ\pi_{\text{ref}}\in\Pi in Algorithm 1 be any policy satisfying Assumption 3.7, then with probability at least 1δ,1-\delta,

Jr(πref)Jr(π¯)𝒪(ϵstat+C2ϵ1)+ϵoptπref\displaystyle J_{r}(\pi_{\text{ref}})-J_{r}(\bar{\pi})\leq\mathcal{O}\bigg{(}\epsilon_{stat}+C^{*}_{\ell_{2}}\sqrt{\epsilon_{1}}\bigg{)}+\epsilon_{opt}^{\pi_{\text{ref}}} (12)
Jc(π¯)Jc(πref)𝒪(ϵstat+C2ϵ1)+ϵoptπref+Vmaxλ,\displaystyle J_{c}(\bar{\pi})-J_{c}(\pi_{\text{ref}})\leq\mathcal{O}\bigg{(}\epsilon_{stat}+C^{*}_{\ell_{2}}\sqrt{\epsilon_{1}}\bigg{)}+\epsilon_{opt}^{\pi_{\text{ref}}}+\frac{V_{\max}}{\lambda}, (13)

where ϵstat:=VmaxC2log(|||Π||W|/δ)N+VmaxBwlog(|||Π||W|/δ)N,\epsilon_{stat}:=V_{\max}C^{*}_{\ell_{2}}\sqrt{\frac{\log(|\mathcal{F}||\Pi||W|/\delta)}{N}}+\frac{V_{\max}B_{w}\log(|\mathcal{F}||\Pi||W|/\delta)}{N}, and π¯\bar{\pi} is the policy returned by Algorithm 1 with β=2\beta=2 and πref\pi_{\text{ref}} as input.

Remark 5.3.

When ϵ1=0,\epsilon_{1}=0, i.e., no model misspecification, which states that the true value function belongs to the function class being used to approximate it (the function class is right enough), let πref\pi_{\text{ref}} be the optimal policy, the results in Theorem 5.2 achieve an optimal dependence statistical rate of 1N\frac{1}{\sqrt{N}}(for large kk), which matches the best existing results. Our algorithm is both statistically optimal and computationally efficient with only single-policy assumption rather relying much stronger assumptions of all policy concentrability Hong et al., (2024); Le et al., (2019). Hence, if the behavior policy or the reference policy is safe, our result indicates that the policy returned by our algorithm will also be safe (nearly). Such a guarantee was missing in the existing literature.

Remark 5.4.

We also do not need a completeness assumption,which requires that for any f or 𝒢f\in\mathcal{F}\text{ or }\mathcal{G} and πΠ,\pi\in\Pi, it approximately holds that 𝒯rf,𝒯cf\mathcal{T}_{r}f\in\mathcal{F},\mathcal{T}_{c}f\in\mathcal{F} as required in Xie et al., (2021); Chen et al., 2022b . They need this assumption to address over-estimation issues caused by the 2\ell_{2} square Bellman error, but our algorithm can get rid of the strong assumption by using a weighted Bellman error which is a simple and unbiased estimator.

Remark 5.5.

Our algorithm can compete with any reference policy πrefΠ\pi_{\text{ref}}\in\Pi as long as wπref=dπref/μw^{\pi_{\text{ref}}}=d^{\pi_{\text{ref}}}/\mu is contained in 𝒲.\mathcal{W}. The importance ratio of the behavior policy is wμ=dμ/μ=μ/μ=1w^{\mu}=d^{\mu}/\mu=\mu/\mu=1 which always satisfies this condition, implying that our algorithm can have a safe robust policy improvement (in Theorem 5.6 discussed below).

5.3 A Safe Robust Policy Improvement

A Robust policy improvement (RPI)(Cheng et al.,, 2022; Zhu et al.,, 2023; Bhardwaj et al.,, 2024) refers to the property of an offline RL algorithm that the offline algorithm can learn to improve over the behavior policy, using a wide range of hyperparameters. In this paper, we introduce the property of Safe Robust policy improvement (SRPI) such that the offline algorithm can learn to improve over the behavior policy in both return and safety, using a wide range of hyperparameters. In the following Theorem 5.6 we show that as long as the hyperparameter β=o(N),\beta=o(\sqrt{N}), our algorithm can, with high probability, produce a policy with vanishing suboptimality compared to the behavior policy.

Theorem 5.6 (SRPI).

Under Assumptions 3.2 and 3.6, then with probability at least 1δ,1-\delta,

Jr(μ)Jr(π¯)𝒪(ϵstatπ+C2ϵ1)+ϵoptμ\displaystyle J_{r}(\mu)-J_{r}(\bar{\pi})\leq\mathcal{O}\bigg{(}\epsilon_{stat}^{\pi}+C^{*}_{\ell_{2}}\sqrt{\epsilon_{1}}\bigg{)}+\epsilon_{opt}^{\mu} (14)
Jc(π¯)Jc(μ)𝒪(ϵstatπ+C2ϵ1)+ϵoptμ+Vmaxλ,\displaystyle J_{c}(\bar{\pi})-J_{c}(\mu)\leq\mathcal{O}\bigg{(}\epsilon_{stat}^{\pi}+C^{*}_{\ell_{2}}\sqrt{\epsilon_{1}}\bigg{)}+\epsilon_{opt}^{\mu}+\frac{V_{\max}}{\lambda}, (15)

where ϵstat:=VmaxC2log(|||Π||W|/δ)N+VmaxBwlog(|||Π||W|/δ)N,\epsilon_{stat}:=V_{\max}C^{*}_{\ell_{2}}\sqrt{\frac{\log(|\mathcal{F}||\Pi||W|/\delta)}{N}}+\frac{V_{\max}B_{w}\log(|\mathcal{F}||\Pi||W|/\delta)}{N}, and π¯\bar{\pi} is the policy returned by Algorithm 1 with β0\beta\geq 0 and μ\mu as input.

The detailed proofs are deferred to Appendix B.4.

6 Experiments

Table 2: The normalized reward and cost of WSAC and other baselines. The Average line shows the average situation in various environments. The cost threshold is 1. Gray: Unsafe agent whose normalized cost is greater than 1. Blue: Safe agent with best performance
Environment BC Safe-BC BCQL BEARL CPQ COptiDICE WSAC
Reward \uparrow Cost \downarrow Reward \uparrow Cost \downarrow Reward \uparrow Cost \downarrow Reward \uparrow Cost \downarrow Reward \uparrow Cost \downarrow Reward \uparrow Cost \downarrow Reward \uparrow Cost \downarrow
BallCircle 0.70 0.95 0.61 0.49 0.73 0.82 0.80 1.23 0.62 0.76 0.71 1.13 0.75 0.27
CarCircle 0.57 1.43 0.57 0.65 0.79 1.19 0.84 1.87 0.67 0.28 0.49 1.52 0.68 0.59
PointButton 0.26 1.75 0.12 0.69 0.36 1.76 0.32 1.71 0.43 3.10 0.15 1.92 0.13 0.67
PointPush 0.13 0.67 0.20 1.35 0.16 1.01 0.12 0.90 -0.01 2.39 0.07 1,18 0.07 0.52
Average 0.42 1.20 0.38 0.80 0.51 1.12 0.52 1.43 0.36 1.63 0.36 1.44 0.41 0.51

6.1 WSAC-Practical Implementation

We introduce a deep RL implementation of WSAC in Algorithm 2 (in Appendix), following the key structure of its theoretical version (Algorithm 1). The reward, cost QQ-functions fr,fcf_{r},f_{c} and the policy network π\pi are all parameterized by neural networks. The critic losses (line 44) lreward(fr)l_{reward}(f_{r}) and lcost(fc)l_{cost}(f_{c}) are calculated based on the principles of Algorithm 1, on the minibatch dataset. Optimizing the actor aims to achieve a no-regret optimization oracle, we use a gradient based update on the actor loss (line 55) lactor(π).l_{actor}(\pi). In the implementation we use adaptive gradient descent algorithm ADAM (Kingma and Ba,, 2015) for updating two critic networks and the actor network. Algorithm follows standard two-timescale first-order algorithms (Fujimoto et al.,, 2018; Haarnoja et al.,, 2018) with a fast learning rate ηfast\eta_{fast} on update critic networks and a slow learning rate ηslow\eta_{slow} for updating the actor.

6.2 Simulations

We present a scalable deep RL version of WSAC in Algorithm 2, following the principles of Algorithm 1. We evaluate WSAC and consider Behavior Cloning (BC), safe Behavior Cloning (Safe-BC), Batch-Constrained deep Q-learning with Lagrangian PID (BCQL) Fujimoto et al., (2019); Stooke et al., (2020) , bootstrapping error accumulation reduction with Lagrangian PID (BEARL) Kumar et al., (2019); Stooke et al., (2020), Constraints Penalized Q-learning (CPQ) Xu et al., (2022) and one of the state-of-the-art algorithms, COptiDICE (Lee et al.,, 2022) as baselines.

We study several representative environments and focus on presenting “BallCircle”. In BallCircle, it requires the ball on a circle in a clockwise direction without leaving the safety zone defined by the boundaries as proposed by Achiam et al., (2017). The ball is a spherical-shaped agent which can freely move on the xy-plane. The reward is dense and increases by the car’s velocity and by the proximity towards the boundary of the circle. The cost is incurred if the agent leaves the safety zone defined by the boundaries.

We use the offline dataset from Liu et al., (2019), where the corresponding expert policy are used to interact with the environments and collect the data. To better illustrate the results, we normalize the reward and cost. Our simulation results are reported in Table 2, we observe that WSAC can guarantee that all the final agents are safe, which is most critical in safe RL literature. Even in challenging environments such as PointButton, which most baselines fail to learn safe policies. WSAC has the best results in 3 of the environments. Moreover, WSAC outperforms all the baselines in terms of the average performance, demonstrating its ability to learn a safe policy by leveraging an offline dataset. The simulation results verify our theoretical findings. We also compared WSAC with all the baselines in the case where the cost limits are different, WSAC still outperforms all the other baselines and ensures a safe policy. We further include simulations to investigate the contribution of each component of our algorithm, including the weighted Bellman regularizer, the aggression-limited objective, and the no-regret policy optimization which together guarantee the theoretical results. More details and discussions are deferred to the Appendix D due to page limit.

7 Conclusion

In this paper, we explore the problem of offline Safe-RL with a single policy data coverage assumption. We propose a novel algorithm, WSAC, which, for the first time, is proven to guarantee the property of safe robust policy improvement. WSAC is able to outperform any reference policy, including the behavior policy, while maintaining the same level of safety across a broad range of hyperparameters. Our simulation results demonstrate that WSAC outperforms existing state-of-the-art offline safe-RL algorithms. Interesting future work includes combining WSAC with online exploration with safety guarantees and extending the approach to multi-agent settings to handle coupled constraints.

References

  • Achiam et al., (2017) Achiam, J., Held, D., Tamar, A., and Abbeel, P. (2017). Constrained policy optimization. In Int. Conf. Machine Learning (ICML), volume 70, pages 22–31. JMLR.
  • Agarwal et al., (2021) Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2021). On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98):1–76.
  • Altman, (1999) Altman, E. (1999). Constrained Markov decision processes, volume 7. CRC Press.
  • Bhardwaj et al., (2024) Bhardwaj, M., Xie, T., Boots, B., Jiang, N., and Cheng, C.-A. (2024). Adversarial model for offline reinforcement learning. Advances in Neural Information Processing Systems, 36.
  • (5) Chen, F., Zhang, J., and Wen, Z. (2022a). A near-optimal primal-dual method for off-policy learning in cmdp. In Advances Neural Information Processing Systems (NeurIPS), volume 35, pages 10521–10532.
  • Chen and Jiang, (2019) Chen, J. and Jiang, N. (2019). Information-theoretic considerations in batch reinforcement learning. In Int. Conf. Machine Learning (ICML), pages 1042–1051. PMLR.
  • Chen and Jiang, (2022) Chen, J. and Jiang, N. (2022). Offline reinforcement learning under value and density-ratio realizability: the power of gaps. In Uncertainty in Artificial Intelligence, pages 378–388. PMLR.
  • (8) Chen, L., Jain, R., and Luo, H. (2022b). Learning infinite-horizon average-reward markov decision process with constraints. In Int. Conf. Machine Learning (ICML), pages 3246–3270. PMLR.
  • Chen et al., (2021) Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. (2021). Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097.
  • Cheng et al., (2020) Cheng, C.-A., Kolobov, A., and Agarwal, A. (2020). Policy improvement via imitation of multiple oracles. In Advances Neural Information Processing Systems (NeurIPS), volume 33, pages 5587–5598.
  • Cheng et al., (2022) Cheng, C.-A., Xie, T., Jiang, N., and Agarwal, A. (2022). Adversarially trained actor critic for offline reinforcement learning. In International Conference on Machine Learning, pages 3852–3878. PMLR.
  • Chow et al., (2017) Chow, Y., Ghavamzadeh, M., Janson, L., and Pavone, M. (2017). Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070–6120.
  • Cui and Du, (2022) Cui, Q. and Du, S. S. (2022). When are offline two-player zero-sum markov games solvable? Advances in Neural Information Processing Systems, 35:25779–25791.
  • Efroni et al., (2020) Efroni, Y., Mannor, S., and Pirotta, M. (2020). Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189.
  • Even-Dar et al., (2009) Even-Dar, E., Kakade, S. M., and Mansour, Y. (2009). Online markov decision processes. Mathematics of Operations Research, 34(3):726–736.
  • Fujimoto et al., (2019) Fujimoto, S., Meger, D., and Precup, D. (2019). Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pages 2052–2062. PMLR.
  • Fujimoto et al., (2018) Fujimoto, S., van Hoof, H., and Meger, D. (2018). Addressing function approximation error in actor-critic methods. In Int. Conf. Machine Learning (ICML), pages 1582–1591.
  • Geist et al., (2019) Geist, M., Scherrer, B., and Pietquin, O. (2019). A theory of regularized markov decision processes. In International Conference on Machine Learning, pages 2160–2169. PMLR.
  • Haarnoja et al., (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft Actor-Critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Int. Conf. Machine Learning (ICML), pages 1861–1870.
  • Hong et al., (2024) Hong, K., Li, Y., and Tewari, A. (2024). A primal-dual-critic algorithm for offline constrained reinforcement learning. In Int. Conf. Artificial Intelligence and Statistics (AISTATS), pages 280–288. PMLR.
  • Isele et al., (2018) Isele, D., Nakhaei, A., and Fujimura, K. (2018). Safe reinforcement learning on autonomous vehicles. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–6. IEEE.
  • Kakade and Langford, (2002) Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning. In Int. Conf. Machine Learning (ICML), pages 267–274.
  • Kakade, (2001) Kakade, S. M. (2001). A natural policy gradient. In Advances Neural Information Processing Systems (NeurIPS).
  • Kingma and Ba, (2015) Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In Bengio, Y. and LeCun, Y., editors, Int. Conf. on Learning Representations (ICLR).
  • Kumar et al., (2019) Kumar, A., Fu, J., Soh, M., Tucker, G., and Levine, S. (2019). Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32.
  • Kumar et al., (2022) Kumar, A., Hong, J., Singh, A., and Levine, S. (2022). Should i run offline reinforcement learning or behavioral cloning? In Int. Conf. on Learning Representations (ICLR).
  • Laroche et al., (2019) Laroche, R., Trichelair, P., and Des Combes, R. T. (2019). Safe policy improvement with baseline bootstrapping. In International conference on machine learning, pages 3652–3661. PMLR.
  • Le et al., (2019) Le, H., Voloshin, C., and Yue, Y. (2019). Batch policy learning under constraints. In International Conference on Machine Learning, pages 3703–3712. PMLR.
  • Lee et al., (2021) Lee, J., Jeon, W., Lee, B., Pineau, J., and Kim, K.-E. (2021). Optidice: Offline policy optimization via stationary distribution correction estimation. In Meila, M. and Zhang, T., editors, Int. Conf. Machine Learning (ICML), volume 139 of Proceedings of Machine Learning Research, pages 6120–6130. PMLR.
  • Lee et al., (2022) Lee, J., Paduraru, C., Mankowitz, D. J., Heess, N., Precup, D., Kim, K.-E., and Guez, A. (2022). Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation. arXiv preprint arXiv:2204.08957.
  • Liao et al., (2022) Liao, P., Qi, Z., Wan, R., Klasnja, P., and Murphy, S. A. (2022). Batch policy learning in average reward markov decision processes. Annals of statistics, 50(6):3364.
  • Liu et al., (2019) Liu, B., Cai, Q., Yang, Z., and Wang, Z. (2019). Neural trust region/proximal policy optimization attains globally optimal policy. Advances in neural information processing systems, 32.
  • (33) Liu, Z., Guo, Z., Lin, H., Yao, Y., Zhu, J., Cen, Z., Hu, H., Yu, W., Zhang, T., Tan, J., et al. (2023a). Datasets and benchmarks for offline safe reinforcement learning. arXiv preprint arXiv:2306.09303.
  • (34) Liu, Z., Guo, Z., Yao, Y., Cen, Z., Yu, W., Zhang, T., and Zhao, D. (2023b). Constrained decision transformer for offline safe reinforcement learning. In International Conference on Machine Learning, pages 21611–21630. PMLR.
  • Ozdaglar et al., (2023) Ozdaglar, A. E., Pattathil, S., Zhang, J., and Zhang, K. (2023). Revisiting the linear-programming framework for offline rl with general function approximation. In International Conference on Machine Learning, pages 26769–26791. PMLR.
  • Pirotta et al., (2013) Pirotta, M., Restelli, M., Pecorino, A., and Calandriello, D. (2013). Safe policy iteration. In Int. Conf. Machine Learning (ICML), pages 307–315. PMLR.
  • Rajaraman et al., (2020) Rajaraman, N., Yang, L., Jiao, J., and Ramchandran, K. (2020). Toward the fundamental limits of imitation learning. Advances Neural Information Processing Systems (NeurIPS), 33:2914–2924.
  • Rashidinejad et al., (2021) Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. (2021). Bridging offline reinforcement learning and imitation learning: A tale of pessimism. In Advances Neural Information Processing Systems (NeurIPS), volume 34, pages 11702–11716.
  • Rashidinejad et al., (2022) Rashidinejad, P., Zhu, H., Yang, K., Russell, S., and Jiao, J. (2022). Optimal conservative offline rl with general function approximation via augmented lagrangian. arXiv preprint arXiv:2211.00716.
  • Siegel et al., (2020) Siegel, N. Y., Springenberg, J. T., Berkenkamp, F., Abdolmaleki, A., Neunert, M., Lampe, T., Hafner, R., Heess, N., and Riedmiller, M. (2020). Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv:2002.08396.
  • Stooke et al., (2020) Stooke, A., Achiam, J., and Abbeel, P. (2020). Responsive safety in reinforcement learning by pid lagrangian methods. In Int. Conf. Machine Learning (ICML), pages 9133–9143. PMLR.
  • Uehara et al., (2020) Uehara, M., Huang, J., and Jiang, N. (2020). Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, pages 9659–9668. PMLR.
  • Uehara et al., (2024) Uehara, M., Kallus, N., Lee, J. D., and Sun, W. (2024). Offline minimax soft-q-learning under realizability and partial coverage. Advances in Neural Information Processing Systems, 36.
  • Uehara and Sun, (2021) Uehara, M. and Sun, W. (2021). Pessimistic model-based offline reinforcement learning under partial coverage. arXiv preprint arXiv:2107.06226.
  • Von Stackelberg, (2010) Von Stackelberg, H. (2010). Market structure and equilibrium. Springer Science & Business Media.
  • Wang et al., (2019) Wang, L., Cai, Q., Yang, Z., and Wang, Z. (2019). Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150.
  • Wu et al., (2021) Wu, R., Zhang, Y., Yang, Z., and Wang, Z. (2021). Offline constrained multi-objective reinforcement learning via pessimistic dual value iteration. In Advances Neural Information Processing Systems (NeurIPS), volume 34, pages 25439–25451.
  • Xie et al., (2021) Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P., and Agarwal, A. (2021). Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34:6683–6694.
  • Xie and Jiang, (2020) Xie, T. and Jiang, N. (2020). Q* approximation schemes for batch reinforcement learning: A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pages 550–559. PMLR.
  • Xie and Jiang, (2021) Xie, T. and Jiang, N. (2021). Batch value-function approximation with only realizability. In Int. Conf. Machine Learning (ICML), pages 11404–11413. PMLR.
  • Xu et al., (2022) Xu, H., Zhan, X., and Zhu, X. (2022). Constraints penalized q-learning for safe offline reinforcement learning. In AAAI Conf. Artificial Intelligence, volume 36, pages 8753–8760.
  • Yin and Wang, (2021) Yin, M. and Wang, Y.-X. (2021). Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34:4065–4078.
  • Zhan et al., (2022) Zhan, W., Huang, B., Huang, A., Jiang, N., and Lee, J. (2022). Offline reinforcement learning with realizability and single-policy concentrability. In Proc. Conf. Learning Theory (COLT), pages 2730–2775. PMLR.
  • Zhang et al., (2020) Zhang, J., Koppel, A., Bedi, A. S., Szepesvari, C., and Wang, M. (2020). Variational policy gradient method for reinforcement learning with general utilities. Advances in Neural Information Processing Systems, 33:4572–4583.
  • Zheng et al., (2024) Zheng, Y., Li, J., Yu, D., Yang, Y., Li, S. E., Zhan, X., and Liu, J. (2024). Safe offline reinforcement learning with feasibility-guided diffusion model. arXiv preprint arXiv:2401.10700.
  • Zhu et al., (2023) Zhu, H., Rashidinejad, P., and Jiao, J. (2023). Importance weighted actor-critic for optimal conservative offline reinforcement learning. arXiv preprint arXiv:2301.12714.

Supplementary Material

Appendix A Auxiliary Lemmas

In the following, we first provide several lemmas which are useful for proving our main results.

Lemma 3.

With probability at least 1δ,1-\delta, for any fr,fc𝒢,πΠf_{r}\in\mathcal{F},f_{c}\in\mathcal{G},\pi\in\Pi and w𝒲,w\in\mathcal{W}, we have

||𝔼μ[(fr𝒯rπf)w]||1N(s,a,r,s)w(s,a)(fr(s,a)rγfr(s,π))||\displaystyle\bigg{|}|\mathbb{E}_{\mu}[(f_{r}-\mathcal{T}_{r}^{\pi}f)w]|-\Big{|}\frac{1}{N}\sum_{(s,a,r,s^{\prime})}w(s,a)(f_{r}(s,a)-r-\gamma f_{r}(s^{\prime},\pi))\Big{|}\bigg{|}
\displaystyle\leq 𝒪(Vmaxlog(|||Π||𝒲|/δ)N+VmaxBwlog(|||Π||𝒲|/δ)N)\displaystyle\mathcal{O}\bigg{(}V_{\max}\sqrt{\frac{\log(|\mathcal{F}||\Pi||\mathcal{W}|/\delta)}{N}}+\frac{V_{\max}B_{w}\log(|\mathcal{F}||\Pi||\mathcal{W}|/\delta)}{N}\bigg{)} (16)
||𝔼μ[(fc𝒯cπf)w]||1N(s,a,r,s)w(s,a)(fc(s,a)cγfc(s,π))||\displaystyle\bigg{|}|\mathbb{E}_{\mu}[(f_{c}-\mathcal{T}_{c}^{\pi}f)w]|-\Big{|}\frac{1}{N}\sum_{(s,a,r,s^{\prime})}w(s,a)(f_{c}(s,a)-c-\gamma f_{c}(s^{\prime},\pi))\Big{|}\bigg{|}
\displaystyle\leq 𝒪(Vmaxlog(|𝒢||Π||𝒲|/δ)N+VmaxBwlog(|𝒢||Π||𝒲|/δ)N)\displaystyle\mathcal{O}\bigg{(}V_{\max}\sqrt{\frac{\log(|\mathcal{G}||\Pi||\mathcal{W}|/\delta)}{N}}+\frac{V_{\max}B_{w}\log(|\mathcal{G}||\Pi||\mathcal{W}|/\delta)}{N}\bigg{)} (17)

The proofs can be found in Lemma 44 in Zhu et al., (2023).

Lemma 4.

With probability at least 12δ,1-2\delta, for any fr,fc𝒢f_{r}\in\mathcal{F},f_{c}\in\mathcal{G} and πΠ,\pi\in\Pi, we have

|μ(π,fr)𝒟(π,fr)|ϵstat\displaystyle|\mathcal{E}_{\mu}(\pi,f_{r})-\mathcal{E}_{\mathcal{D}}(\pi,f_{r})|\leq\epsilon_{stat} (18)
|μ(π,fc)𝒟(π,fc)|ϵstat,\displaystyle|\mathcal{E}_{\mu}(\pi,f_{c})-\mathcal{E}_{\mathcal{D}}(\pi,f_{c})|\leq\epsilon_{stat}, (19)

where ϵstat:=VmaxC2log(|||𝒢||Π||W|/δ)N+VmaxBwlog(|||𝒢||Π||W|/δ)N.\epsilon_{stat}:=V_{\max}C^{*}_{\ell_{2}}\sqrt{\frac{\log(|\mathcal{F}||\mathcal{G}||\Pi||W|/\delta)}{N}}+\frac{V_{\max}B_{w}\log(|\mathcal{F}||\mathcal{G}||\Pi||W|/\delta)}{N}.

Proof.

Condition on the high probability event in Lemma 3,for any fr,fc𝒢,πΠ,f_{r}\in\mathcal{F},f_{c}\in\mathcal{G},\pi\in\Pi, define

wπ,f=argmaxw𝒲μ(π,fr)=argmaxw𝒲|𝔼μ[w(s,a)(fr𝒯rπfr)(s,a)]|w^{*}_{\pi,f}=\arg\max_{w\in\mathcal{W}}\mathcal{E}_{\mu}(\pi,f_{r})=\arg\max_{w\in\mathcal{W}}|\mathbb{E}_{\mu}[w(s,a)(f_{r}-\mathcal{T}_{r}^{\pi}f_{r})(s,a)]|

and define

w^π,fr\displaystyle\hat{w}_{\pi,f_{r}} =argmaxw𝒲𝒟(π,fr)=argmaxw𝒲|1N(s,a,r,s)𝒟w(s,a)(fr(s,a)rγfr(s,π))|.\displaystyle=\arg\max_{w\in\mathcal{W}}\mathcal{E}_{\mathcal{D}}(\pi,f_{r})=\arg\max_{w\in\mathcal{W}}|\frac{1}{N}\sum_{(s,a,r,s^{\prime})\in\mathcal{D}}w(s,a)(f_{r}(s,a)-r-\gamma f_{r}(s^{\prime},\pi))|.

Then we can have

𝒯μ(π,fr)𝒟(π,fr)\displaystyle\mathcal{T}_{\mu}(\pi,f_{r})-\mathcal{E}_{\mathcal{D}}(\pi,f_{r})
=\displaystyle= |𝔼μ[wπ,fr(s,a)(fr𝒯rπfr)(s,a)]||1N(s,a,r,s)w^π,f(s,a)(fr(s,a)rγfr(s,π))|\displaystyle|\mathbb{E}_{\mu}[w^{*}_{\pi,f_{r}}(s,a)(f_{r}-\mathcal{T}_{r}^{\pi}f_{r})(s,a)]|-\bigg{|}\frac{1}{N}\sum_{(s,a,r,s^{\prime})}\hat{w}_{\pi,f}(s,a)(f_{r}(s,a)-r-\gamma f_{r}^{\prime}(s^{\prime},\pi))\bigg{|}
=\displaystyle= |𝔼μ[wπ,fr(s,a)(fr𝒯rπfr)(s,a)]||𝔼μ[w^π,fr(s,a)(fr𝒯rπfr)(s,a)]|\displaystyle|\mathbb{E}_{\mu}[w^{*}_{\pi,f_{r}}(s,a)(f_{r}-\mathcal{T}_{r}^{\pi}f_{r})(s,a)]|-|\mathbb{E}_{\mu}[\hat{w}_{\pi,f_{r}}(s,a)(f_{r}-\mathcal{T}_{r}^{\pi}f_{r})(s,a)]|
+|𝔼μ[w^π,fr(s,a)(fr𝒯rπfr)(s,a)]||1N(s,a,r,s)w^π,f(s,a)(fr(s,a)rγfr(s,π))|\displaystyle+|\mathbb{E}_{\mu}[\hat{w}_{\pi,f_{r}}(s,a)(f_{r}-\mathcal{T}_{r}^{\pi}f_{r})(s,a)]|-\bigg{|}\frac{1}{N}\sum_{(s,a,r,s^{\prime})}\hat{w}_{\pi,f}(s,a)(f_{r}(s,a)-r-\gamma f_{r}^{\prime}(s^{\prime},\pi))\bigg{|}
\displaystyle\geq 0ϵstat=ϵstat,\displaystyle 0-\epsilon_{stat}=-\epsilon_{stat},

where the inequality is true by using the definition of wπ,frw_{\pi,f_{r}}^{*} and Lemma 3. Thus

μ(π,fr)ϵ𝒟(π,fr)\displaystyle\mathcal{E}_{\mu}(\pi,f_{r})-\epsilon_{\mathcal{D}}(\pi,f_{r})
=\displaystyle= |𝔼μ[wπ,fr(s,a)(fr𝒯rπfr)(s,a)]||1N(s,a,r,s)wπ,fr(s,a)(fr(s,a)rγfr(s,π))|\displaystyle|\mathbb{E}_{\mu}[{w}^{*}_{\pi,f_{r}}(s,a)(f_{r}-\mathcal{T}_{r}^{\pi}f_{r})(s,a)]|-\bigg{|}\frac{1}{N}\sum_{(s,a,r,s^{\prime})}{w}^{*}_{\pi,f_{r}}(s,a)(f_{r}(s,a)-r-\gamma f_{r}^{\prime}(s^{\prime},\pi))\bigg{|}
+\displaystyle+ |1N(s,a,r,s)wπ,fr(s,a)(fr(s,a)rγfr(s,π))|\displaystyle\bigg{|}\frac{1}{N}\sum_{(s,a,r,s^{\prime})}{w}^{*}_{\pi,f_{r}}(s,a)(f_{r}(s,a)-r-\gamma f_{r}^{\prime}(s^{\prime},\pi))\bigg{|}
|1N(s,a,r,s)w^π,fr(s,a)(fr(s,a)rγfr(s,π))|\displaystyle-\bigg{|}\frac{1}{N}\sum_{(s,a,r,s^{\prime})}\hat{w}_{\pi,f_{r}}(s,a)(f_{r}(s,a)-r-\gamma f_{r}^{\prime}(s^{\prime},\pi))\bigg{|}
\displaystyle\leq ϵstat\displaystyle\epsilon_{stat}

The proof for the case |μ(π,fc)𝒟(π,fc)|ϵstat|\mathcal{E}_{\mu}(\pi,f_{c})-\mathcal{E}_{\mathcal{D}}(\pi,f_{c})|\leq\epsilon_{stat} is similar. ∎

Lemma 5.

(Empirical weighted average Bellman Error) With probability at least 12δ,1-2\delta, for any πΠ,\pi\in\Pi, we have

𝒟(π,frπ)\displaystyle\mathcal{E}_{\mathcal{D}}(\pi,f_{r}^{\pi})\leq C2ϵ1+ϵstat\displaystyle C_{\ell_{2}}^{*}\sqrt{\epsilon_{1}}+\epsilon_{stat} (20)
𝒟(π,fcπ)\displaystyle\mathcal{E}_{\mathcal{D}}(\pi,f_{c}^{\pi})\leq C2ϵ1+ϵstat,\displaystyle C_{\ell_{2}}^{*}\sqrt{\epsilon_{1}}+\epsilon_{stat}, (21)

where

frπ:=argminfrsupadmissible νfr𝒯rπfr2,ν2,πΠ\displaystyle f_{r}^{\pi}:=\underset{f_{r}\in\mathcal{F}}{\arg\min}\underset{\text{admissible }\nu}{\sup}\|f_{r}-\mathcal{T}_{r}^{\pi}f_{r}\|_{2,\nu}^{2},\forall\pi\in\Pi
fcπ:=argminfc𝒢supadmissible νfc𝒯cπfc2,ν2,πΠ.\displaystyle f_{c}^{\pi}:=\underset{f_{c}\in\mathcal{G}}{\arg\min}\underset{\text{admissible }\nu}{\sup}\|f_{c}-\mathcal{T}_{c}^{\pi}f_{c}\|_{2,\nu}^{2},\forall\pi\in\Pi.
Proof.

Condition on the high probability event in Lemma 4, we have

μ(π,frπ)=maxw𝒲|𝔼μ[w(s,a)((fTrπfrπ)(s,a))]|\displaystyle\mathcal{E}_{\mu}(\pi,f_{r}^{\pi})=\max_{w\in\mathcal{W}}|\mathbb{E}_{\mu}[w(s,a)((f-T^{\pi}_{r}f_{r}^{\pi})(s,a))]|
\displaystyle\leq μ(π,frπ)=maxw𝒲|w2,μfπTrπf)(s,a))]|2,μ\displaystyle\mathcal{E}_{\mu}(\pi,f_{r}^{\pi})=\max_{w\in\mathcal{W}}|\|w\|_{2,\mu}\|f_{\pi}-T^{\pi}_{r}f)(s,a))]|_{2,\mu}
\displaystyle\leq C2ϵ1,\displaystyle C_{\ell_{2}^{*}}\sqrt{\epsilon_{1}},

where the first inequality is true because of Cauchy-Schwarz inequality and the second inequality comes from the definition of frπf_{r}^{\pi} and Assumption 3.2, thus we can obtain

𝒟(π,frπ)μ(π,frπ)+ϵstatC2ϵ1+ϵstat.\displaystyle\mathcal{E}_{\mathcal{D}}(\pi,f_{r}^{\pi})\leq\mathcal{E}_{\mu}(\pi,f_{r}^{\pi})+\epsilon_{stat}\leq C_{\ell_{2}}^{*}\sqrt{\epsilon_{1}}+\epsilon_{stat}. (22)

Following a similar proof we can have

^𝒟(π,fcπ)μ(π,fcπ)+ϵstatC2ϵ1+ϵstat.\displaystyle\hat{\mathcal{E}}_{\mathcal{D}}(\pi,f_{c}^{\pi})\leq\mathcal{E}_{\mu}(\pi,f_{c}^{\pi})+\epsilon_{stat}\leq C_{\ell_{2}}^{*}\sqrt{\epsilon_{1}}+\epsilon_{stat}. (23)

Lemma 6.

(Performance difference decomposition, restate of Lemma 1212 in Cheng et al., (2022)) For an arbitrary policy π,π^Π,\pi,\hat{\pi}\in\Pi, and ff be an arbitrary function over 𝒮×𝒜.\mathcal{S}\times\mathcal{A}. Then we have,

J(π)J(π^)\displaystyle J_{\diamond}(\pi)-J_{\diamond}(\hat{\pi})
=\displaystyle= 𝔼μ[(f𝒯π^)(s,a)]+𝔼π[(𝒯π^ff)(s,a)]+𝔼π[f(s,π)f(s,π^)]+μ(π^,f)μ(π^,Qπ^),\displaystyle\mathbb{E}_{\mu}\big{[}\big{(}f-\mathcal{T}_{\diamond}^{\hat{\pi}}\big{)}(s,a)\big{]}+\mathbb{E}_{\pi}\big{[}\big{(}\mathcal{T}^{\hat{\pi}}_{\diamond}f-f\big{)}(s,a)\big{]}+\mathbb{E}_{\pi}[f(s,\pi)-f(s,\hat{\pi})]+\mathcal{L}_{\mu}(\hat{\pi},f)-\mathcal{L}_{\mu}(\hat{\pi},Q_{\diamond}^{\hat{\pi}}), (24)

where :=r\diamond:=r or c.c.

Proof.

We prove the case when :=r,\diamond:=r, the other case is identical. Let Rf,π^(s,a):=f(s,a)γ𝔼s|(s,a)[f(s,π^)]R^{f,\hat{\pi}}(s,a):=f(s,a)-\gamma\mathbb{E}_{s^{\prime}|(s,a)}[f(s^{\prime},\hat{\pi})] be a virtual reward function for given ff and π^.\hat{\pi}. According to performance difference lemma (Kakade and Langford,, 2002), We first have that

(Jr(π^)Jr(μ))=\displaystyle(J_{r}(\hat{\pi})-J_{r}(\mu))= μ(π^,Qrπ^)\displaystyle\mathcal{L}_{\mu}(\hat{\pi},Q_{r}^{\hat{\pi}})
=\displaystyle= Δ(π^)+μ(π^,f)\displaystyle\Delta(\hat{\pi})+\mathcal{L}_{\mu}(\hat{\pi},f) (Δ(π^):=μ(π^,Qrπ^)μ(π^,f)\Delta(\hat{\pi}):=\mathcal{L}_{\mu}(\hat{\pi},Q_{r}^{\hat{\pi}})-\mathcal{L}_{\mu}(\hat{\pi},f))
=\displaystyle= Δ(π^)+𝔼μ[f(s,π^)f(s,a)]\displaystyle\Delta(\hat{\pi})+\mathbb{E}_{\mu}[f(s,\hat{\pi})-f(s,a)]
=\displaystyle= Δ(π^)+(1γ)(JRf,π^(π^)JRf,π^(μ))\displaystyle\Delta(\hat{\pi})+(1-\gamma)(J_{R^{f,\hat{\pi}}}(\hat{\pi})-J_{R^{f,\hat{\pi}}}(\mu))
=\displaystyle= Δ(π^)+(1γ)QRf,π^π^(s0,π^)𝔼μ[Rπ^,f(s,a)]\displaystyle\Delta(\hat{\pi})+(1-\gamma)Q_{R^{f,\hat{\pi}}}^{\hat{\pi}}(s_{0},\hat{\pi})-\mathbb{E}_{\mu}[R^{\hat{\pi},f}(s,a)]
=\displaystyle= Δ(π^)+(1γ)f(s0,π^)𝔼μ[Rπ^,f(s,a)],\displaystyle\Delta(\hat{\pi})+(1-\gamma)f(s_{0},\hat{\pi})-\mathbb{E}_{\mu}[R^{\hat{\pi},f}(s,a)],

where the last equality is true because that

QRf,π^π(s,a)=(𝒯Rf,π^πf)(s,a)=Rf,π^+γ𝔼s|(s,a)[f(s,π^)]=f(s,a).Q^{\pi}_{R^{f,\hat{\pi}}}(s,a)=(\mathcal{T}^{\pi}_{R^{f,\hat{\pi}}}f)(s,a)=R^{f,\hat{\pi}}+\gamma\mathbb{E}_{s^{\prime}|(s,a)}[f(s^{\prime},\hat{\pi})]=f(s,a).

Thus we have

(Jr(π)Jr(π^))=\displaystyle(J_{r}(\pi)-J_{r}(\hat{\pi}))= (Jr(π)Jr(μ)(Jr(π^)Jr(μ))\displaystyle({J_{r}(\pi)-J_{r}(\mu)}-(J_{r}(\hat{\pi})-J_{r}(\mu))
=\displaystyle= (Jr(π)f(d0,π^))+(𝔼μ[Rπ^,f(s,a)]Jr(μ))Δ(π^).\displaystyle(J_{r}(\pi)-f(d_{0},\hat{\pi}))+\bigg{(}\mathbb{E}_{\mu}[R^{\hat{\pi},f}(s,a)]-J_{r}(\mu)\bigg{)}-\Delta(\hat{\pi}). (25)

For the first term, we have

(Jr(π)f(d0,π^))=\displaystyle(J_{r}(\pi)-f(d_{0},\hat{\pi}))= (Jr(π)f(s0,π^))\displaystyle(J_{r}(\pi)-f(s_{0},\hat{\pi}))
=\displaystyle= Jr(π)𝔼dπ[Rπ^,f(s,a)]+𝔼dπ[Rπ^,f(s,a)]f(s0,π^)\displaystyle J_{r}(\pi)-\mathbb{E}_{d^{\pi}}[R^{\hat{\pi},f}(s,a)]+\mathbb{E}_{d^{\pi}}[R^{\hat{\pi},f}(s,a)]-f(s_{0},\hat{\pi})
=\displaystyle= 𝔼dπ[R(s,a)Rπ^,f(s,a)]+𝔼dπ[f(s,π)f(s,π^)]\displaystyle\mathbb{E}_{d^{\pi}}[R(s,a)-R^{\hat{\pi},f}(s,a)]+\mathbb{E}_{d^{\pi}}[f(s,\pi)-f(s,\hat{\pi})]
=\displaystyle= 𝔼dπ[(𝒯rπ^ff)(s,a)]+𝔼dπ[f(s,π)f(s,π^)],\displaystyle\mathbb{E}_{d^{\pi}}[(\mathcal{T}^{\hat{\pi}}_{r}f-f)(s,a)]+\mathbb{E}_{d^{\pi}}[f(s,\pi)-f(s,\hat{\pi})], (26)

where the second equality is true because

𝔼dπ[Rπ^,f(s,a)]f(s0,π^)\displaystyle\mathbb{E}_{d^{\pi}}[R^{\hat{\pi},f}(s,a)]-f(s_{0},\hat{\pi})
=\displaystyle= 𝔼dπ[f(s,a)γ𝔼s|(s,a)[f(s,π^)]]f(s0,π^)\displaystyle\mathbb{E}_{d^{\pi}}\big{[}f(s,a)-\gamma\mathbb{E}_{s^{\prime}|(s,a)}[f(s^{\prime},\hat{\pi})]\big{]}-f(s_{0},\hat{\pi})
=\displaystyle= 𝔼dπ[f(s,π)]st=1γtPr(st=s|s0d0,π)f(s,π^(s))f(s0,π^)\displaystyle\mathbb{E}_{d^{\pi}}\big{[}f(s,\pi)]-\sum_{s}\sum_{t=1}^{\infty}\gamma^{t}\Pr(s_{t}=s|s_{0}\sim d_{0},\pi)f(s,\hat{\pi}(s))-f(s_{0},\hat{\pi})
=\displaystyle= 𝔼dπ[f(s,π)]st=0γtPr(st=s|s0d0,π)f(s,π^(s))\displaystyle\mathbb{E}_{d^{\pi}}\big{[}f(s,\pi)]-\sum_{s}\sum_{t=0}^{\infty}\gamma^{t}\Pr(s_{t}=s|s_{0}\sim d_{0},\pi)f(s,\hat{\pi}(s))
=\displaystyle= 𝔼dπ[f(s,π)]s,at=0γtPr(st=s,at=a|s0d0,π)f(s,π^(s))\displaystyle\mathbb{E}_{d^{\pi}}\big{[}f(s,\pi)]-\sum_{s,a}\sum_{t=0}^{\infty}\gamma^{t}\Pr(s_{t}=s,a_{t}=a|s_{0}\sim d_{0},\pi)f(s,\hat{\pi}(s))
=\displaystyle= 𝔼dπ[f(s,π)f(s,π^)].\displaystyle\mathbb{E}_{d^{\pi}}[f(s,\pi)-f(s,\hat{\pi})].

For the second term we have

𝔼μ[Rπ^,f(s,a)]Jr(μ)\displaystyle\mathbb{E}_{\mu}[R^{\hat{\pi},f}(s,a)]-J_{r}(\mu)
=\displaystyle= 𝔼μ[Rπ^,f(s,a)R(s,a)]\displaystyle\mathbb{E}_{\mu}[R^{\hat{\pi},f}(s,a)-R(s,a)]
=\displaystyle= 𝔼μ[(f𝒯rπ^f)(s,a)].\displaystyle\mathbb{E}_{\mu}[(f-\mathcal{T}^{\hat{\pi}}_{r}f)(s,a)]. (27)

Therefore plugging 26 and (27) into Eq. (25), we have

Jr(π)Jr(π^)\displaystyle J_{r}(\pi)-J_{r}(\hat{\pi})
=\displaystyle= 𝔼μ[(f𝒯rπ^)(s,a)]+𝔼π[(𝒯rπ^ff)(s,a)]+𝔼π[f(s,π)f(s,π^)]+μ(π^,f)μ(π^,Qrπ^).\displaystyle\mathbb{E}_{\mu}\big{[}\big{(}f-\mathcal{T}_{r}^{\hat{\pi}}\big{)}(s,a)\big{]}+\mathbb{E}_{\pi}\big{[}\big{(}\mathcal{T}_{r}^{\hat{\pi}}f-f\big{)}(s,a)\big{]}+\mathbb{E}_{\pi}[f(s,\pi)-f(s,\hat{\pi})]+\mathcal{L}_{\mu}(\hat{\pi},f)-\mathcal{L}_{\mu}(\hat{\pi},Q_{r}^{\hat{\pi}}).

The proof is completed. ∎

Lemma 7.

With probability at least 12δ,1-2\delta, for any fr,fc𝒢,f_{r}\in\mathcal{F},f_{c}\in\mathcal{G}, and πΠ,\pi\in\Pi, we have:

|μ(π,fr)𝒟(π,fr)|ϵstat\displaystyle|\mathcal{L}_{\mu}(\pi,f_{r})-\mathcal{L}_{\mathcal{D}}(\pi,f_{r})|\leq\epsilon_{stat} (28)
|μ(π,fc)𝒟(π,fc)|ϵstat\displaystyle|\mathcal{L}_{\mu}(\pi,f_{c})-\mathcal{L}_{\mathcal{D}}(\pi,f_{c})|\leq\epsilon_{stat} (29)

where ϵstat:=VmaxC2log(|||𝒢||Π||W|/δ)N+VmaxBwlog(|||𝒢||Π||W|/δ)N.\epsilon_{stat}:=V_{\max}C^{*}_{\ell_{2}}\sqrt{\frac{\log(|\mathcal{F}||\mathcal{G}||\Pi||W|/\delta)}{N}}+\frac{V_{\max}B_{w}\log(|\mathcal{F}||\mathcal{G}||\Pi||W|/\delta)}{N}.

Proof.

Recall that 𝔼μ[𝒟(π,fr)]=μ(π,f)\mathbb{E}_{\mu}[\mathcal{L}_{\mathcal{D}}(\pi,f_{r})]=\mathcal{L}_{\mu}(\pi,f) and |fr(s,π)fr(s,a)|Vmax.|f_{r}(s,\pi)-f_{r}(s,a)|\leq V_{\max}. For any fr,f_{r}\in\mathcal{F}, policy πΠ,\pi\in\Pi, applying a Hoeffding’s inequality and a union bound we can obtain with probability 1δ,1-\delta,

|μ(π,fr)𝒟(π,fr)|𝒪(Vmaxlog(|||Π|/δ)N)ϵstat.\displaystyle|\mathcal{L}_{\mu}(\pi,f_{r})-\mathcal{L}_{\mathcal{D}}(\pi,f_{r})|\leq\mathcal{O}\bigg{(}V_{\max}\sqrt{\frac{\log(|\mathcal{F}||\Pi|/\delta)}{N}}\bigg{)}\leq\epsilon_{stat}. (30)

The inequality for proving the fc,πf_{c},\pi is the same. ∎

Appendix B Missing Proofs

B.1 Proof of Theorem 4.1

Proof.

According to the performance difference lemma (Kakade and Langford,, 2002), we have

(Jr(π)Jr(μ))λ{Jc(π)Jc(μ)}+\displaystyle(J_{r}(\pi)-J_{r}(\mu))-\lambda\{J_{c}(\pi)-J_{c}(\mu)\}_{+}
=\displaystyle= μ(π,Qrπ)λ{μ(π,Qcπ)}+\displaystyle\mathcal{L}_{\mu}(\pi,Q_{r}^{\pi})-\lambda\{\mathcal{L}_{\mu}(\pi,Q_{c}^{\pi})\}_{+}
=\displaystyle= μ(π,Qrπ)+βμ(π,Qrπ)λ{μ(π,Qcπ)}++β^μ(π,Qcπ)\displaystyle\mathcal{L}_{\mu}(\pi,Q_{r}^{\pi})+\beta\mathcal{E}_{\mu}(\pi,Q_{r}^{\pi})-\lambda\{\mathcal{L}_{\mu}(\pi,Q_{c}^{\pi})\}_{+}+\beta\hat{\mathcal{E}}_{\mu}(\pi,Q_{c}^{\pi})
\displaystyle\geq μ(π,frπ)+βμ(π,frπ)λ{μ(π,fcπ)}++β^μ(π,fcπ)\displaystyle\mathcal{L}_{\mu}(\pi,f_{r}^{\pi})+\beta\mathcal{E}_{\mu}(\pi,f_{r}^{\pi})-\lambda\{\mathcal{L}_{\mu}(\pi,f_{c}^{\pi})\}_{+}+\beta\hat{\mathcal{E}}_{\mu}(\pi,f_{c}^{\pi})
\displaystyle\geq μ(π,frπ)λ{μ(π,fcπ)}+,\displaystyle\mathcal{L}_{\mu}(\pi,f_{r}^{\pi})-\lambda\{\mathcal{L}_{\mu}(\pi,f_{c}^{\pi})\}_{+}, (31)

where the second equality is true because μ(π,Qrπ)=^μ(π,Qcπ)=0\mathcal{E}_{\mu}(\pi,Q_{r}^{\pi})=\hat{\mathcal{E}}_{\mu}(\pi,Q_{c}^{\pi})=0 by Assumption 3.2, and the first inequality comes from the selection of frπf_{r}^{\pi} and fcπf_{c}^{\pi} in optimization (6).

Therefore, we can obtain

Jr(π^)Jr(μ)\displaystyle J_{r}(\hat{\pi}^{*})-J_{r}(\mu)\geq (μ(π^,frπ^)λ{μ(π^,fcπ^)}+)+λ{Jc(π^)Jc(μ)}+\displaystyle\big{(}\mathcal{L}_{\mu}(\hat{\pi}^{*},f_{r}^{\hat{\pi}^{*}})-\lambda\{\mathcal{L}_{\mu}(\hat{\pi}^{*},f_{c}^{\hat{\pi}^{*}})\}_{+}\big{)}+\lambda\{J_{c}(\hat{\pi}^{*})-J_{c}(\mu)\}_{+}
\displaystyle\geq (μ(μ,frμ)λ{μ(μ,fcμ)}+)+λ{Jc(π^)Jc(μ)}+\displaystyle\big{(}\mathcal{L}_{\mu}(\mu,f_{r}^{\mu})-\lambda\{\mathcal{L}_{\mu}(\mu,f_{c}^{\mu})\}_{+}\big{)}+\lambda\{J_{c}(\hat{\pi}^{*})-J_{c}(\mu)\}_{+}
\displaystyle\geq λ{Jc(π^)Jc(μ)}+0\displaystyle\lambda\{J_{c}(\hat{\pi}^{*})-J_{c}(\mu)\}_{+}\geq 0 (32)

and

{Jc(π^)}+{Jc(μ)}+{Jc(π^)Jc(μ)}+1λ(Jr(π^)Jr(μ))1λ.\displaystyle\{J_{c}(\hat{\pi}^{*})\}_{+}-\{J_{c}(\mu)\}_{+}\leq\{J_{c}(\hat{\pi}^{*})-J_{c}(\mu)\}_{+}\leq\frac{1}{\lambda}(J_{r}(\hat{\pi}^{*})-J_{r}(\mu))\leq\frac{1}{\lambda}. (33)

B.2 Proof of Lemma 2

Proof.

Denote πref\pi_{\text{ref}} as π.\pi. First according to the definition for the no-regret oracle 5.1, we have

1Kk=1K\displaystyle\frac{1}{K}\sum_{k=1}^{K} 𝔼π[frk(s,π)frk(s,πk)λ{fck(s,π)fck(s,π)}+\displaystyle\mathbb{E}_{\pi}[f_{r}^{k}(s,\pi)-f_{r}^{k}(s,\pi_{k})-\lambda\{f_{c}^{k}(s,\pi)-f_{c}^{k}(s,\pi)\}_{+}
+λ{fck(s,πk)fck(s,π)}+]ϵoptπ\displaystyle+\lambda\{f_{c}^{k}(s,\pi_{k})-f_{c}^{k}(s,\pi)\}_{+}]\leq\epsilon_{opt}^{\pi} (34)

Therefore,

1Kk=1K𝔼π[frk(s,π)frk(s,πk)]\displaystyle\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}_{\pi}[f_{r}^{k}(s,\pi)-f_{r}^{k}(s,\pi_{k})]
\displaystyle\leq ϵoptπ+1Kk=1K𝔼π[λ{fck(s,π)fck(s,π)}+λ{fck(s,πk)fck(s,π)}+]ϵoptπ,\displaystyle\epsilon_{opt}^{\pi}+\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}_{\pi}[\lambda\{f_{c}^{k}(s,\pi)-f_{c}^{k}(s,\pi)\}_{+}-\lambda\{f_{c}^{k}(s,\pi_{k})-f_{c}^{k}(s,\pi)\}_{+}]\leq\epsilon_{opt}^{\pi}, (35)

and

1Kk=1K𝔼π[{fck(s,πk)fck(s,π)}+]ϵoptπ1λKk=1K𝔼π[frk(s,π)frk(s,πk)]ϵoptπ+Vmaxλ.\displaystyle\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}_{\pi}[\{f_{c}^{k}(s,\pi_{k})-f_{c}^{k}(s,\pi)\}_{+}]\leq\epsilon_{opt}^{\pi}-\frac{1}{\lambda K}\sum_{k=1}^{K}\mathbb{E}_{\pi}[f_{r}^{k}(s,\pi)-f_{r}^{k}(s,\pi_{k})]\leq\epsilon_{opt}^{\pi}+\frac{V_{\max}}{\lambda}. (36)

We finish the proof. ∎

B.3 Proof of Theorem 5.2

Theorem (Restate of Theorem 5.2).

Under Assumptions 3.2 and 3.6, let the reference policy πrefΠ\pi_{\text{ref}}\in\Pi be any policy satisfying Assumption 3.7, then with probability at least 1δ,1-\delta,

Jr(πref)Jr(π¯)𝒪(ϵstat+C2ϵ1)+ϵoptπ\displaystyle J_{r}(\pi_{\text{ref}})-J_{r}(\bar{\pi})\leq\mathcal{O}\bigg{(}\epsilon_{stat}+C^{*}_{\ell_{2}}\sqrt{\epsilon_{1}}\bigg{)}+\epsilon_{opt}^{\pi} (37)
Jc(π¯)Jc(πref)𝒪(ϵstat+C2ϵ1)+ϵoptπ+Vmaxλ,\displaystyle J_{c}(\bar{\pi})-J_{c}(\pi_{\text{ref}})\leq\mathcal{O}\bigg{(}\epsilon_{stat}+C^{*}_{\ell_{2}}\sqrt{\epsilon_{1}}\bigg{)}+\epsilon_{opt}^{\pi}+\frac{V_{\max}}{\lambda}, (38)

where ϵstat:=VmaxC2log(|||𝒢||Π||W|/δ)N+VmaxBwlog(|||𝒢||Π||W|/δ)N,\epsilon_{stat}:=V_{\max}C^{*}_{\ell_{2}}\sqrt{\frac{\log(|\mathcal{F}||\mathcal{G}||\Pi||W|/\delta)}{N}}+\frac{V_{\max}B_{w}\log(|\mathcal{F}||\mathcal{G}||\Pi||W|/\delta)}{N}, and π¯\bar{\pi} is the policy returned by Algorithm 1 with β=2\beta=2 and πref\pi_{\text{ref}} as input.

Proof.

Denote πref\pi_{\text{ref}} as π.\pi. According to the definition of π¯,\bar{\pi}, and Lemma 6 we have

Jr(π)Jr(π¯)=1Kk=1K(Jr(π)Jr(πk))\displaystyle J_{r}(\pi)-J_{r}(\bar{\pi})=\frac{1}{K}\sum_{k=1}^{K}(J_{r}(\pi)-J_{r}(\pi_{k}))
=\displaystyle= 1Kk=1K(𝔼μ[frk𝒯rπkfrk](I)+𝔼π[𝒯rπkfrkfrk](II)\displaystyle\frac{1}{K}\sum_{k=1}^{K}\Bigg{(}\underbrace{{\mathbb{E}_{\mu}\ [f_{r}^{k}-\mathcal{T}^{\pi_{k}}_{r}f_{r}^{k}]}}_{(\text{I})}+\underbrace{{\mathbb{E}_{\pi}[\mathcal{T}^{\pi_{k}}_{r}f_{r}^{k}-f_{r}^{k}]}}_{(\text{II})}
+𝔼π[frk(s,π)frk(s,πk)](III)+μ(πk,frk)μ(πk,Qπk)(IV))\displaystyle+\underbrace{{\mathbb{E}_{\pi}[f_{r}^{k}(s,\pi)-f_{r}^{k}(s,\pi_{k})]}}_{(\text{III})}+\underbrace{\mathcal{L}_{\mu}(\pi_{k},f_{r}^{k})-\mathcal{L}_{\mu}(\pi_{k},Q^{\pi_{k}})}_{(\text{IV})}\Bigg{)} (39)

Condition on the high probability event in , we have

(I)+(II)2μ(πk,frk)2𝒟(πk,frk)+2ϵstat\displaystyle(\text{I})+(\text{II})\leq 2\mathcal{E}_{\mu}(\pi_{k},f_{r}^{k})\leq 2\mathcal{E}_{\mathcal{D}}(\pi_{k},f_{r}^{k})+2\epsilon_{stat} (40)

According to a similar argument as that in the Lemma 1313 in Cheng et al., (2022), we have that

|μ(πk,Qrπk)μ(πk,frπk)|\displaystyle|\mathcal{L}_{\mu}({\pi_{k}},Q_{r}^{{\pi_{k}}})-\mathcal{L}_{\mu}({\pi_{k}},f_{r}^{\pi_{k}})|
=\displaystyle= |𝔼μ[Qrπk(s,πk)Qrπk(s,a)]μ(πk,frπk)|\displaystyle|\mathbb{E}_{\mu}[Q_{r}^{\pi_{k}}(s,\pi_{k})-Q_{r}^{\pi_{k}}(s,a)]-\mathcal{L}_{\mu}({\pi_{k}},f_{r}^{\pi_{k}})|
=\displaystyle= |(Jr(πk)Jr(μ))μ(πk,frπk)|\displaystyle|(J_{r}(\pi_{k})-J_{r}(\mu))-\mathcal{L}_{\mu}({\pi_{k}},f_{r}^{\pi_{k}})|
=\displaystyle= |(frπk(s0,πk)Jr(μ))+(Jr(πk)frπk(s0,πk))μ(πk,frπk)|\displaystyle|(f_{r}^{\pi_{k}}(s_{0},\pi_{k})-J_{r}(\mu))+(J_{r}(\pi_{k})-f_{r}^{\pi_{k}}(s_{0},\pi_{k}))-\mathcal{L}_{\mu}({\pi_{k}},f_{r}^{\pi_{k}})|
=\displaystyle= |𝔼μ[frπk(s,πk)(𝒯rπkfrπk)(s,a)]+𝔼dπk[(𝒯rπkfrπk)(s,a)frπk(s,a)]μ(πk,frπk)|\displaystyle|\mathbb{E}_{\mu}[f_{r}^{\pi_{k}}(s,\pi_{k})-(\mathcal{T}_{r}^{\pi_{k}}f_{r}^{\pi_{k}})(s,a)]+\mathbb{E}_{d^{\pi_{k}}}[(\mathcal{T}_{r}^{\pi_{k}}f_{r}^{\pi_{k}})(s,a)-f_{r}^{\pi_{k}}(s,a)]-\mathcal{L}_{\mu}({\pi_{k}},f_{r}^{\pi_{k}})|
=\displaystyle= |μ(πk,frπk)+𝔼μ[frπk(s,a)(𝒯rπkfrπk)(s,a)]+𝔼dπk[(𝒯rπkfrπk)(s,a)frπk(s,a)]μ(πk,frπk)|\displaystyle|\mathcal{L}_{\mu}(\pi_{k},f_{r}^{\pi_{k}})+\mathbb{E}_{\mu}[f_{r}^{\pi_{k}}(s,a)-(\mathcal{T}_{r}^{\pi_{k}}f_{r}^{\pi_{k}})(s,a)]+\mathbb{E}_{d^{\pi_{k}}}[(\mathcal{T}_{r}^{\pi_{k}}f_{r}^{\pi_{k}})(s,a)-f_{r}^{\pi_{k}}(s,a)]-\mathcal{L}_{\mu}({\pi_{k}},f_{r}^{\pi_{k}})|
\displaystyle\leq frπk(s,a)(𝒯rπkfrπk)(s,a)2,μ+(𝒯rπkfrπk)(s,a)frπk(s,a)2,dπk\displaystyle\|f_{r}^{\pi_{k}}(s,a)-(\mathcal{T}_{r}^{\pi_{k}}f_{r}^{\pi_{k}})(s,a)\|_{2,\mu}+\|(\mathcal{T}_{r}^{\pi_{k}}f_{r}^{\pi_{k}})(s,a)-f_{r}^{\pi_{k}}(s,a)\|_{2,d^{\pi_{k}}}
\displaystyle\leq 𝒪(ϵ1),\displaystyle\mathcal{O}(\sqrt{\epsilon_{1}}), (41)

where frπ:=argminfrsupadmissible νfr𝒯rπfr2,ν2,πΠ.f_{r}^{\pi}:=\underset{f_{r}\in\mathcal{F}}{\arg\min}\underset{\text{admissible }\nu}{\sup}\|f_{r}-\mathcal{T}_{r}^{\pi}f_{r}\|_{2,\nu}^{2},\forall\pi\in\Pi. By using Lemma 7, we have

|μ(πk,frk)𝒟(πk,frk)|+|μ(πk,frπk)𝒟(πk,frπk)|𝒪(ϵstat).\displaystyle|\mathcal{L}_{\mu}(\pi_{k},f_{r}^{k})-\mathcal{L}_{\mathcal{D}}(\pi_{k},f_{r}^{k})|+|\mathcal{L}_{\mu}(\pi_{k},f_{r}^{\pi_{k}})-\mathcal{L}_{\mathcal{D}}(\pi_{k},f_{r}^{\pi_{k}})|\leq\mathcal{O}(\epsilon_{stat}). (42)

Therefore

(I)+(II)+(IV)\displaystyle(\text{I})+(\text{II})+(\text{IV}) μ(πk,frk)+2μ(πk,frk)+2ϵstatμ(πk,frπk)+𝒪(ϵ1)\displaystyle\leq\mathcal{L}_{\mu}(\pi_{k},f_{r}^{k})+2\mathcal{E}_{\mu}(\pi_{k},f_{r}^{k})+2\epsilon_{stat}-\mathcal{L}_{\mu}(\pi_{k},f_{r}^{\pi_{k}})+\mathcal{O}(\sqrt{\epsilon_{1}}) (43)
𝒟(πk,frk)+2𝒟(πk,frk)+𝒪(ϵstat)𝒟(πk,frπk)+𝒪(ϵ1)\displaystyle\leq\mathcal{L}_{\mathcal{D}}(\pi_{k},f_{r}^{k})+2\mathcal{E}_{\mathcal{D}}(\pi_{k},f_{r}^{k})+\mathcal{O}(\epsilon_{stat})-\mathcal{L}_{\mathcal{D}}(\pi_{k},f_{r}^{\pi_{k}})+\mathcal{O}(\sqrt{\epsilon_{1}}) (44)
𝒟(πk,frπk)+2𝒟(πk,frπk)+𝒪(ϵstat)𝒟(πk,frπk)+𝒪(ϵ1)\displaystyle\leq\mathcal{L}_{\mathcal{D}}(\pi_{k},f_{r}^{\pi_{k}})+2\mathcal{E}_{\mathcal{D}}(\pi_{k},f_{r}^{\pi_{k}})+\mathcal{O}(\epsilon_{stat})-\mathcal{L}_{\mathcal{D}}(\pi_{k},f_{r}^{\pi_{k}})+\mathcal{O}(\sqrt{\epsilon_{1}}) (45)
𝒪(ϵstat+C2ϵ1),\displaystyle\leq\mathcal{O}(\epsilon_{stat}+C_{\ell_{2}}^{*}\sqrt{\epsilon_{1}}), (46)

where the third inequality holds by the selection of frk,f_{r}^{k}, and the last inequality holds by Lemma 5. Therefore by using Lemma B.2 we obtain

Jr(π)Jr(π¯)𝒪(ϵstat+C2ϵ1)+ϵopt.\displaystyle J_{r}(\pi)-J_{r}(\bar{\pi})\leq\mathcal{O}(\epsilon_{stat}+C_{\ell_{2}}^{*}\sqrt{\epsilon_{1}})+\epsilon_{opt}. (47)

Following a similar argument, we have that

Jc(π¯)Jc(π)=1Kk=1K(Jc(πk)Jc(π))𝒪(ϵstat+C2ϵ1)+ϵoptπ+Vmaxλ.\displaystyle J_{c}(\bar{\pi})-J_{c}(\pi)=\frac{1}{K}\sum_{k=1}^{K}(J_{c}(\pi_{k})-J_{c}(\pi))\leq\mathcal{O}(\epsilon_{stat}+C_{\ell_{2}}^{*}\sqrt{\epsilon_{1}})+\epsilon_{opt}^{\pi}+\frac{V_{max}}{\lambda}. (48)

B.4 Proof of Theorem 5.6

Theorem (Restate of Theorem 5.2).

Under Assumptions 3.2 and 3.6, let the reference policy πrefΠ\pi_{\text{ref}}\in\Pi be any policy satisfying Assumption 3.7, then with probability at least 1δ,1-\delta,

Jr(μ)Jr(π¯)𝒪(ϵstatπ+C2ϵ1)+ϵoptμ\displaystyle J_{r}(\mu)-J_{r}(\bar{\pi})\leq\mathcal{O}\bigg{(}\epsilon_{stat}^{\pi}+C^{*}_{\ell_{2}}\sqrt{\epsilon_{1}}\bigg{)}+\epsilon_{opt}^{\mu} (49)
Jc(π¯)Jc(μ)𝒪(ϵstatπ+C2ϵ1)+ϵoptμ+Vmaxλ,\displaystyle J_{c}(\bar{\pi})-J_{c}(\mu)\leq\mathcal{O}\bigg{(}\epsilon_{stat}^{\pi}+C^{*}_{\ell_{2}}\sqrt{\epsilon_{1}}\bigg{)}+\epsilon_{opt}^{\mu}+\frac{V_{\max}}{\lambda}, (50)

where ϵstat:=VmaxC2log(|||Π||W|/δ)N+VmaxBwlog(|||Π||W|/δ)N,\epsilon_{stat}:=V_{\max}C^{*}_{\ell_{2}}\sqrt{\frac{\log(|\mathcal{F}||\Pi||W|/\delta)}{N}}+\frac{V_{\max}B_{w}\log(|\mathcal{F}||\Pi||W|/\delta)}{N}, and π¯\bar{\pi} is the policy returned by Algorithm 1 with β0\beta\geq 0 and μ\mu as input.

Proof.

Following a similar proof in Theorem 5.2. But when the reference policy is the behavior policy, we have (I)+(II)=0.(\text{I})+(\text{II})=0. Therefore we have have

(IV)=μ(πk,frk)μ(πk,Qπk)\displaystyle(\text{IV})=\mathcal{L}_{\mu}(\pi_{k},f_{r}^{k})-\mathcal{L}_{\mu}(\pi_{k},Q^{\pi_{k}})
\displaystyle\leq μ(πk,frk)μ(πk,Qπk)+β𝒟(πk,frk)\displaystyle\mathcal{L}_{\mu}(\pi_{k},f_{r}^{k})-\mathcal{L}_{\mu}(\pi_{k},Q^{\pi_{k}})+\beta\mathcal{E}_{\mathcal{D}}(\pi_{k},f_{r}^{k})
\displaystyle\leq μ(πk,frk)μ(πk,Qπk)+β𝒟(πk,frk)β𝒟(π,fπk)+βC2ϵ1+βϵstat\displaystyle\mathcal{L}_{\mu}(\pi_{k},f_{r}^{k})-\mathcal{L}_{\mu}(\pi_{k},Q^{\pi_{k}})+\beta\mathcal{E}_{\mathcal{D}}(\pi_{k},f_{r}^{k})-\beta\mathcal{E}_{\mathcal{D}}(\pi,f_{\pi_{k}})+\beta C_{\ell_{2}}^{*}\sqrt{\epsilon_{1}}+\beta\epsilon_{stat}
\displaystyle\leq 𝒟(πk,frk)+beta𝒟(πk,frk)𝒟(πk,frπk)β𝒟(π,fπk)+(β+1)(ϵstat+C2ϵ1)\displaystyle\mathcal{L}_{\mathcal{D}}(\pi_{k},f_{r}^{k})+beta\mathcal{E}_{\mathcal{D}}(\pi_{k},f_{r}^{k})-\mathcal{L}_{\mathcal{D}}(\pi_{k},f_{r}^{\pi_{k}})-\beta\mathcal{E}_{\mathcal{D}}(\pi,f_{\pi_{k}})+(\beta+1)(\epsilon_{stat}+C_{\ell_{2}}^{*}\sqrt{\epsilon_{1}})
\displaystyle\leq (β+1)(ϵstat+C2ϵ1).\displaystyle(\beta+1)(\epsilon_{stat}+C_{\ell_{2}}^{*}\sqrt{\epsilon_{1}}).

We finish the proof. ∎

Appendix C Discussion on obtaining the behavior policy

To extract the behavior policy when it is not provided, we can simply run behavior cloning on the offline data. In particular, we can estimate the learned behavior policy π^μ\hat{\pi}_{\mu} as follows: s𝒟,π^μ(a|s)n(s,a)n(s)\forall s\in\mathcal{D},\hat{\pi}_{\mu}(a|s)\leftarrow\frac{n(s,a)}{n(s)}, and s𝒟,π^μ(a|s)1|𝒜|\forall s\notin\mathcal{D},\hat{\pi}_{\mu}(a|s)\leftarrow\frac{1}{|\mathcal{A}|}, where n(s,a)n(s,a) is the number of times (s,a)(s,a) appears in the offline dataset 𝒟\mathcal{D}. Essentially, the estimated BC policy matches the empirical behavior policy on states in the offline dataset and takes uniform random actions outside the support of the dataset. It is easy to show that the gap between the learned policy π^μ\hat{\pi}_{\mu} and the behavior policy πμ\pi_{\mu} is upper bounded by 𝒪(min{1,|𝒮|/N})\mathcal{O}(\min\{1,|\mathcal{S}|/N\}) (Kumar et al.,, 2022; Rajaraman et al.,, 2020). We can have a very accurate estimate as long as the size of the dataset is large enough.

Appendix D Expermintal Supplement

D.1 Practical Algorithm

The practical version of our algorithm WSAC is shown in Algorithm 2.

Algorithm 2 WSAC - Practical Version
1:Input: Batch data 𝒟,\mathcal{D}, policy network π,\pi, network for the reward critic fr,f_{r}, network for the cost critic fc,β>0,λ>0.f_{c},\beta>0,\lambda>0.
2:for k=1,2,,Kk=1,2,\dots,K do
3:     Sample minibatch 𝒟mini\mathcal{D}_{\text{mini}} from the dataset 𝒟.\mathcal{D}.
4:     Update Critic Networks:
lreward(fr)\displaystyle l_{\text{reward}}(f_{r}) :=𝒟mini(π,fr)+β𝒟mini(π,fr),\displaystyle:=\mathcal{L}_{\mathcal{D}_{\text{mini}}}(\pi,f_{r})+\beta\mathcal{E}_{\mathcal{D}_{\text{mini}}}(\pi,f_{r}),
fr\displaystyle f_{r} ADAM(frηfastlreward(fr)),\displaystyle\leftarrow\text{ADAM}(f_{r}-\eta_{\text{fast}}\nabla l_{\text{reward}}(f_{r})),
lcost(fc)\displaystyle l_{\text{cost}}(f_{c}) :=λ𝒟mini(π,fc)+β𝒟mini(π,fc),\displaystyle:=-\lambda\mathcal{L}_{\mathcal{D}_{\text{mini}}}(\pi,f_{c})+\beta\mathcal{E}_{\mathcal{D}_{\text{mini}}}(\pi,f_{c}),
fc\displaystyle f_{c} ADAM(fcηfastlcost(fc)).\displaystyle\leftarrow\text{ADAM}(f_{c}-\eta_{\text{fast}}\nabla l_{\text{cost}}(f_{c})).
5:     Update Policy Network:
lactor(π)\displaystyle l_{\text{actor}}(\pi) :=mini(π,fr)+λ{mini(π,fc)}+,\displaystyle:=-\mathcal{L}_{\text{mini}}(\pi,f_{r})+\lambda\{\mathcal{L}_{\text{mini}}(\pi,f_{c})\}_{+},
π\displaystyle\pi ADAM(πηslowlactor(π)).\displaystyle\leftarrow\text{ADAM}(\pi-\eta_{\text{slow}}\nabla l_{\text{actor}}(\pi)).
6:end for
7:Output: π\pi

D.2 Environments Description

Besides the “BallCircle" environment, we also study several representative environments as follows. All of them are shown in Figure 2 and their offline dataset is from Liu et al., 2023a .

  • CarCircle: This environment requires the car to move on a circle in a clockwise direction within the safety zone defined by the boundaries. The car is a four-wheeled agent based on MIT’s race car. The reward is dense and increases by the car’s velocity and by the proximity towards the boundary of the circle and the cost is incurred if the agent leaves the safety zone defined by the two yellow boundaries, which are the same as "CarCircle".

  • PointButton: This environment requires the point to navigate to the goal button location and touch the right goal button while avoiding more gremlins and hazards. The point has two actuators, one for rotation and the other for forward/backward movement. The reward consists of two parts, indicating the distance between the agent and the goal and if the agent reaches the goal button and touches it. The cost will be incurred if the agent enters the hazardous areas, contacts the gremlins, or presses the wrong button.

  • PointPush: This environment requires the point to push a box to reach the goal while circumventing hazards and pillars. The objects are in 2D planes and the point is the same as "PointButton". It has a small square in front of it, which makes it easier to determine the orientation visually and also helps point push the box.

Refer to caption
Refer to caption
Refer to caption
Figure 2: BallCircle and CarCircle (left), PointButton (medium), PointPush(right) .

D.3 Implementation Details and Experimental settings

We run all the experiments with NVIDIA GeForce RTX 30803080 Ti 88-Core Processor.

The normalized reward and cost are summarized as follows:

Rnormalized=Rπrmin()rmax()rmin()\displaystyle R_{normalized}=\frac{R_{\pi}-r_{min}(\mathcal{M})}{r_{max}(\mathcal{M})-r_{min}(\mathcal{M})} (51)
Cnormalized=Cπ+ϵκ+ϵ,\displaystyle C_{normalized}=\frac{C_{\pi}+\epsilon}{\kappa+\epsilon}, (52)

where r()r(\mathcal{M}) is the empirical reward for task \mathcal{M}, κ\kappa is the cost threshold, ϵ\epsilon is a small number to ensure numerical stability. Thus any normalized cost below 11 is considered as safe. We use RπR_{\pi} and CπC_{\pi} to dentoe the cumulative rewards and cost for the evaluated policy, respectively. The parameters of rmin(),r_{min}(\mathcal{M}), rmax()r_{max}(\mathcal{M}) and κ\kappa are environment-dependent constants and the specific values can be found in the Appendix  D. We remark that the normalized reward and cost only used for demonstrating the performance purpose and are not used in the training process. The detailed value of the reward and costs can be found in Table 3.

Table 3: Hyperparameters of WSAC
Parameters BallCircle CarCircle PointButton PointPush
βc\beta_{c} 30.0 38.0 30.0 30.0
βr\beta_{r} 10.0 12.0 10.0 10.0
UBQCUB_{Q_{C}} 30.0 28.0 32.0 30.0
λ\lambda [1.0,20.0][1.0,20.0]
Batch size 512512
Actor learning rate 0.00010.0001
Critic learning rate 0.00030.0003
κ\kappa 4040
rmin()r_{min}(\mathcal{M}) 0.3831 3.4844 0.0141 0.0012
rmax()r_{max}(\mathcal{M}) 881.4633 534.3061 42.8986 14.6910

To mitigate the risk of unsafe scenarios, we introduce a hyperparameter UBQCUB_{Q_{C}} to the cost QQ-function as an overestimation when calculating the actor loss. We use two separate βr\beta_{r}, βc\beta_{c} for reward and cost QQ functions to make the algorithm more flexible.

Refer to caption

(a) BallCircle
Refer to caption
(b) CarCircle
Refer to caption
(c) PointButton
Refer to caption
(d) PointPush

Figure 3: The moving average of evaluation results is recorded every 500500 training steps, with each result representing the average over 2020 evaluation episodes and three random seeds. A cost threshold 11 is applied, with any normalized cost below 1 considered safe.

We use different β\beta for the reward and cost critic networks and different UBQCUB_{Q_{C}} for the actor-network to make the adversarial training more stable. We also let the key parameter λ\lambda within a certain range balance reward and cost during the training process. Their values are shown in Table 3. In experiments, we take 𝒲={0,C}\mathcal{W}=\{0,C_{\infty}\} for computation effective. Then we can reduce 𝒟(π,f)\mathcal{E}_{\mathcal{D}}(\pi,f) to C𝔼𝒟[(f(s,a)rγf(s,π))2]C_{\infty}\mathbb{E}_{\mathcal{D}}[(f(s,a)-r-\gamma f(s^{\prime},\pi))^{2}] and reduce ^𝒟(π,f)\hat{\mathcal{E}}_{{\mathcal{D}}}(\pi,f) to C𝔼𝒟[(f(s,a)cγf(s,π))2]C_{\infty}\mathbb{E}_{\mathcal{D}}[(f(s,a)-c-\gamma f(s^{\prime},\pi))^{2}]. In this case, CC_{\infty} can be considered as a part of the hyperparameter βr(βc)\beta_{r}(\beta_{c}).

D.4 Experimental results details and supplements

The evaluation performances of the agents in each environment after 3000030000 update steps of training are shown in Table 2, and the performance of average rewards and costs are shown in Figure 3. From the results, we observe that WSAC achieves a best reward performance with significantly lowest costs against all the baselines. It suggests WSAC can establish a safe and efficient policy and achieve a steady improvement by leveraging the offline dataset.

D.5 Simulations under different cost limits

To further evaluate the performance of our algorithm under varying situations. We further compare our algorithm with baselines under varying cost limits, we report the average performance of our method and other baselines in Table 4. Specifically, cost limits of [10,20,40][10,20,40] are used for the BallCircle and CarCircle environments, and [20,40,80][20,40,80] for the PointButton and PointPush environments, following the standard setup outlined by Liu et al., 2023a . Our results demonstrate that WSAC maintains safety across all environments, and its performance is either comparable to or superior to the best baseline in each case. These suggest that WSAC is well-suited for adapting to tasks of varying difficulty.

Table 4: The normalized reward and cost of WSAC and other baselines for different cost limits. Each value is averaged over 3 distinct cost limits, 20 evaluation episodes, and 3 random seeds. The Average line shows the average situation in various environments. The cost threshold is 1. Gray: Unsafe agent whose normalized cost is greater than 1. Blue: Safe agent with best performance. The performance of all the baselines is reported according to the results in Liu et al., 2023a .
BC Safe-BC CDT BCQL BEARL CPQ COptiDICE WSAC
Reward \uparrow Cost \downarrow Reward \uparrow Cost \downarrow Reward \uparrow Cost \downarrow Reward \uparrow Cost \downarrow Reward \uparrow Cost \downarrow Reward \uparrow Cost \downarrow Reward \uparrow Cost \downarrow Reward \uparrow Cost \downarrow
BallCircle 0.74 4.71 0.52 0.65 0.77 1.07 0.69 2.36 0.86 3.09 0.64 0.76 0.70 2.61 0.74 0.51
CarCircle 0.58 3.74 0.50 0.84 0.75 0.95 0.63 1.89 0.74 2.18 0.71 0.33 0.49 3.14 0.65 0.55
PointButton 0.27 2.02 0.16 1.10 0.46 1.57 0.40 2.66 0.43 2.47 0.58 4.30 0.15 1.51 0.11 0.55
PointPush 0.18 0.91 0.11 0.80 0.21 0.65 0.23 0.99 0.16 0.89 0.11 1.04 0.02 1.18 0.07 0.61
Average 0.44 2.85 0.32 0.85 0.55 1.06 0.49 1.98 0.55 2.16 0.51 1.61 0.34 2.11 0.39 0.56

D.6 Ablation studies

To investigate the contribution of each component of our algorithm, including the weighted Bellman regularizer, the aggression-limited objective, and the no-regret policy optimization (which together guarantee our theoretical results), we performed an ablation study in the tabular setting. The results, presented in Table 5, indicate that the weighted Bellman regularization ensures the safety of the algorithm, while the aggression-limited objective fine-tunes the algorithm to achieve higher rewards without compromising safety.

Table 5: Ablation study under tabular case (cost limit is 0.1) over 10 repeat experiments
Components cost reward
ALL 0.014 ±0.006\pm 0.006 0.788±0.0040.788\pm 0.004
W/O no-regret policy optimization 0.014±0.0060.014\pm 0.006 0.788±0.0040.788\pm 0.004
W/O Aggression-limited objective 0.014±0.0060.014\pm 0.006 0.788±0.0050.788\pm 0.005
W/O Weighted Bellman regularizer 0.323±0.0610.323\pm 0.061 0.684±0.0170.684\pm 0.017

D.7 Sensitivity Analysis of Hyper-Parameters

Refer to caption
Refer to caption
Figure 4: Sensitivity Analysis of Hyperparameters in the Tabular Case. The left figure illustrates tests conducted with various β\beta values (For the sake of discussion, we denote β=βr=βc\beta=\beta_{r}=\beta_{c}) with λ=[0,2]\lambda=[0,2], while the right figure presents tests across different ranges of λ\lambda with βr=βc=2.0\beta_{r}=\beta_{c}=2.0.

We provide the rewards and costs under different sets of βr=βc{1,0.5,0.05}\beta_{r}=\beta_{c}\in\{1,0.5,0.05\} and λ{[0,1],[0,2],[1,2]}\lambda\in\{[0,1],[0,2],[1,2]\} (since our λ\lambda only increases, the closed interval here represents the initial value and the upper bound of λ\lambda) to demonstrate the robustness of our approach in the tabular setting in Figure 4. We can observe that the performance is almost the same under different sets of parameters and different qualities of behavior policies.