Adversarially Trained Weighted Actor-Critic for Safe Offline Reinforcement Learning
Abstract
We propose WSAC (Weighted Safe Actor-Critic), a novel algorithm for Safe Offline Reinforcement Learning (RL) under functional approximation, which can robustly optimize policies to improve upon an arbitrary reference policy with limited data coverage. WSAC is designed as a two-player Stackelberg game to optimize a refined objective function. The actor optimizes the policy against two adversarially trained value critics with small importance-weighted Bellman errors, which focus on scenarios where the actor’s performance is inferior to the reference policy. In theory, we demonstrate that when the actor employs a no-regret optimization oracle, WSAC achieves a number of guarantees: For the first time in the safe offline RL setting, we establish that WSAC can produce a policy that outperforms any reference policy while maintaining the same level of safety, which is critical to designing a safe algorithm for offline RL. WSAC achieves the optimal statistical convergence rate of to the reference policy, where is the size of the offline dataset. We theoretically show that WSAC guarantees a safe policy improvement across a broad range of hyperparameters that control the degree of pessimism, indicating its practical robustness. Additionally, we offer a practical version of WSAC and compare it with existing state-of-the-art safe offline RL algorithms in several continuous control environments. WSAC outperforms all baselines across a range of tasks, supporting the theoretical results.
1 Introduction
Online safe reinforcement learning (RL) has found successful applications in various domains, such as autonomous driving (Isele et al.,, 2018), recommender systems (Chow et al.,, 2017), and robotics (Achiam et al.,, 2017). It enables the learning of safe policies effectively while satisfying certain safety constraints, including collision avoidance, budget adherence, and reliability. However, collecting diverse interaction data can be extremely costly and infeasible in many real-world applications, and this challenge becomes even more critical in scenarios where risky behavior cannot be tolerated. Given the inherently risk-sensitive nature of these safety-related tasks, data collection becomes feasible only when employing behavior policies satisfies all the safety requirements.
To overcome the limitations imposed by interactive data collection, offline RL algorithms are designed to learn a policy from an available dataset collected from historical experiences by some behavior policy, which may differ from the policy we aim to learn. A desirable property of an effective offline algorithm is the assurance of robust policy improvement (RPI), which guarantees that a learned policy is always at least as good as the baseline behavior policies (Fujimoto et al.,, 2019; Laroche et al.,, 2019; Kumar et al.,, 2019; Siegel et al.,, 2020; Chen et al., 2022a, ; Zhu et al.,, 2023; Bhardwaj et al.,, 2024). We extend the property of RPI to offline safe RL called safe robust policy improvement (SRPI), which indicates the improvement should be safe as well. This is particularly important in offline safe RL. For example, in autonomous driving, an expert human driver operates the vehicle to collect a diverse dataset under various road and weather conditions, serving as the behavior policy. This policy is considered both effective and safe, as it demonstrates proficient human driving behavior while adhering to all traffic laws and other safety constraints. Achieving a policy that upholds the SRPI characteristic with such a dataset can significantly mitigate the likelihood of potential collisions and other safety concerns.

In offline RL, we represent the state-action occupancy distribution of policy over the dataset distribution using the ratio . A commonly required assumption is that the concentrability is bounded, which is defined as the infinite norm of for all policies (Liu et al.,, 2019; Chen and Jiang,, 2019; Wang et al.,, 2019; Liao et al.,, 2022; Zhang et al.,, 2020). A stronger assumption requires a uniform lower bound on (Xie and Jiang,, 2021). However, such all-policy concentrability assumptions are difficult to satisfy in practice, particularly for offline safe RL, as they essentially require the offline dataset to have good coverage of all unsafe state-action pairs. To address the full coverage requirement, other works (Rashidinejad et al.,, 2021; Zhan et al.,, 2022; Chen and Jiang,, 2022; Xie et al.,, 2021; Uehara and Sun,, 2021) adapt conservative algorithms by employing the principle of pessimism in the face of uncertainty, reducing the assumption to the best covered policy (or optimal policy) concerning concentrability. Zhu et al., (2023) introduce concentrability to further relax the assumption, indicating that concentrability is always an upper bound of concentrability (see Table 1 for detailed comparisons with previous works). While provable guarantees are obtained using single policy concentrability for unconstrained MDP as Table 1 suggests for the safe RL setting, all the existing studies (Hong et al.,, 2024; Le et al.,, 2019) still require the coverage on all the policies. Further, as Table 1 suggests, the above papers do not guarantee robust safe policy improvement. m , Our main contributions are summarized below:
-
1.
We prove that our algorithm, which uses average Bellman error, enjoys an optimal statistical rate of under partial data coverage assumption. This is the first work that achieves such a result using only single-policy concentrability.
-
2.
We propose a novel offline safe RL algorithm, called Weighted Safe Actor-Critic (WSAC), which can robustly learn policies that improve upon any behavior policy with controlled relative pessimism. We prove that under standard function approximation assumptions, when the actor incorporates a no-regret policy optimization oracle, WSAC outputs a policy that never degrades the performance of a reference policy (including the behavior policy) for a range of hyperparameters (defined later). This is the first work that provably demonstrates the property of SRPI in offline safe RL setting.
-
3.
We point out that primal-dual-based approaches Hong et al., (2024) may require all-policy concentrability assumption. Thus, unlike, the primal-dual-based appraoch, we propose a novel rectified penalty-based approach to obtain results using single-policy concentrability. Thus, we need novel analysis techniques to prove results.
-
4.
Furthermore, we provide a practical implementation of WSAC following a two-timescale actor-critic framework using adversarial frameworks similar to Cheng et al., (2022); Zhu et al., (2023), and test it on several continuous control environments in the offline safe RL benchmark (Liu et al., 2023a, ). WSAC outperforms all other state-of-the-art baselines, validating the property of a safe policy improvement.
Algorithm | Safe RL | Coverage assumption | Policy Improvement | Suboptimality |
Xie and Jiang, (2021) | No | all policy, | Yes | |
Xie et al., (2021) | No | single-policy, | Yes | |
Cheng et al., (2022) | No | single-policy, | Yes & Robust | |
Ozdaglar et al., (2023) | No | single-policy, | No | |
Zhu et al., (2023) | No | single-policy, | Yes & Robust | |
Le et al., (2019) | Yes | all policy, | No | |
Hong et al., (2024) | Yes | all policy, | No | |
Ours | Yes | single-policy, | Yes & Robust |
2 Related Work
Offline safe RL: Deep offline safe RL algorithms (Lee et al.,, 2022; Liu et al., 2023b, ; Xu et al.,, 2022; Chen et al.,, 2021; Zheng et al.,, 2024) have shown strong empirical performance but lack theoretical guarantees. To the best of our knowledge, the investigation of policy improvement properties in offline safe RL is relatively rare in the state-of-the-art offline RL literature. Wu et al., (2021) focus on the offline constrained multi-objective Markov Decision Process (CMOMDP) and demonstrate that an optimal policy can be learned when there is sufficient data coverage. However, although they show that CMDP problems can be formulated as CMOMDP problems, they assume a linear kernel CMOMDP in their paper, whereas our consideration extends to a more general function approximation setting. Le et al., (2019) propose a model-based primal-dual-type algorithm with deviation control for offline safe RL in the tabular setting. With prior knowledge of the slackness in Slater’s condition and a constant on the concentrability coefficient, an -PAC error is achievable when the number of data samples is large enough . These assumptions make the algorithm impractical, and their computational complexity is much higher than ours. Additionally, we consider a more practical, model-free function approximation setting. In another concurrent work (Hong et al.,, 2024), a primal-dual critic algorithm is proposed for offline-constrained RL settings with general function approximation. However, their algorithm requires concentrability for all policies, which is not practical as discussed. The reason is that the dual variable optimization in their primal-dual design requires an accurate estimation of all the policies used in each episode, which necessitates coverage over all policies. Moreover, they cannot guarantee the property of SRPI. Moreover, their algorithm requires an additional offline policy evaluation (OPE) oracle for policy evaluation, making the algorithm less efficient.
3 Preliminaries
3.1 Constrained Markov Decision Process
We consider a Constrained Markov Decision Process (CMDP) denoted by is the state space, is the action space, is the transition kernel, where is a probability simplex, is the reward function, is the cost function, is the discount factor and is the initial state distribution. We assume is finite while allowing to be arbitrarily complex. We use to denote a stationary policy, which specifies a distribution over actions for each state. At each time, the agent observes a state takes an action according to a policy receives a reward and a cost where Then the CMDP moves to the next state based on the transition kernel Given a policy we use and to denote the expected discounted return and the expected cumulative discounted cost of , starting from state , respectively.
(1) | ||||
(2) |
Accordingly, we also define the value function of a policy for the reward and cost as
(3) | ||||
(4) |
respectively. As rewards and costs are bounded, we have that and We let to simplify the notation. We further write
to represent the normalized average reward/cost value of policy In addition, we use to denote the normalized and discounted state-action occupancy measure of the policy
where is the indicator function. We also use to denote the discounted state occupancy and we use as a shorthand of or to denote the expectation with respect to Thus The objective in safe RL for an agent is to find a policy such that
(5) |
Remark 3.1.
For ease of exposition, this paper exclusively focuses on a single constraint. However, it is readily extendable to accommodate multiple constraints.
3.2 Function Approximation
In complex environments, the state space is usually very large or even infinite. We assume access to a policy class consisting of all candidate policies from which we can search for good policies. We also assume access to a value function class to model the reward functions, and to model the cost functions of candidate policies. We further assume access to a function class that represents marginalized importance weights with respect to the offline data distribution. Without loss of generality, we assume that the all-one function is contained in
For a given policy we denote for any The Bellman operator for the reward is defined as
The Bellman operator for the cost is
Let denote the Euclidean norm weighted by distribution We make the following standard assumptions in offline RL setting (Xie et al.,, 2021; Cheng et al.,, 2022; Zhu et al.,, 2023) on the representation power of the function classes:
Assumption 3.2 (Approximate Realizability).
Assume there exists s.t. for any given policy we have , and where is the state-action distribution of any admissible policy such that
3.3 Offline RL
In offline RL, we assume that the available offline data consists of samples. Samples are i.i.d. (which are common assumptions in unconstrained Cheng et al., (2022), as well as constrained setting Hong et al., (2024)), and the distribution of each tuple is specified by a distribution which is also the discounted visitation probability of a behavior policy (also denoted by for simplicity). In particular, We use to denote that the action is drawn using the behavior policy and to denote that and
For a given policy we define the marginalized importance weights which is the ratio between the discounted state-action occupancy of and the data distribution This ratio can be used to measure the concentrability of the data coverage (Xie and Jiang,, 2020; Zhan et al.,, 2022; Rashidinejad et al.,, 2022; Ozdaglar et al.,, 2023; Lee et al.,, 2021).
In this paper we study offline RL with access to a dataset with limited coverage. The coverage of a policy is the dataset can be measured by the weighted single policy concentrability coefficient (Zhu et al.,, 2023; Yin and Wang,, 2021; Uehara et al.,, 2024; Hong et al.,, 2024):
Definition 3.3 ( Concentrability).
Given a policy define
Remark 3.4.
The definition here is much weaker than the all policy concentrability used in offline RL (Chen and Jiang,, 2019) and safe offline RL (Le et al.,, 2019; Hong et al.,, 2024), which requires the ratio to be bounded for all and and all policies In particular, the all-policy concentrability assumption essentially requires the dataset to have full coverage of all policies ((nearly all the state action pairs). This requirement is often violated in practical scenarios. This requirement is even impossible to meet in safe offline RL because it would require collecting data from every dangerous state and actions, which clearly is impractical.
In the following lemma, we compare two variants of single-policy concentrability definition with the defined in Definition 3.3.
Lemma 1 (Restate Proposition in Zhu et al., (2023)).
Remark 3.5.
It is easy to observe that the variant is bounded by and under some cases. There is an example (Example ) in Zhu et al., (2023) showing that is bounded by a constant while could be arbitrarily large. For the case when the function class is highly expressive, could be close to and thus possibly larger than Intuitively, implies that only is bounded, rather, is bounded for all in concentrability bound.
Given the definition of the concentrability, we make the following assumption on the weight function class and a single-policy realizability:
Assumption 3.6 (Boundedness of in norm).
For all assume that
Assumption 3.7 (Single-policy realizability of ).
For some policy that we would like to compete with, assume that
In this paper, we want to study the robust policy improvement on any reference policy, then we assume that we are provided a reference policy . Note that in many applications (e.g., scheduling, networking) we indeed have a reference policy. We want that while applying a sophisticated RL policy it should do better and be safe as well. This is one of the main motivations behind this assumption.
Assumption 3.8 (Reference Policy).
We assume access to a reference policy which can be queried at any state.
In many applications such as networking, scheduling, and control problems, there are existing good enough reference policies. In these cases, a robust and safe policy improvement over these reference policies has practical value. If is not provided, we can simply run a behavior cloning on the offline data to extract the behavior policy as accurately, as long as the size of the offline data set is large enough. More discussion can be found in Section C in the Appendix.
4 Actor-Critic with Importance Weighted Bellman Error
Our algorithm design builds upon the constrained actor-critic method, in which we iteratively optimize a policy and improve the policy based on the evaluation of reward and cost. Consider the following actor-critic approach for solving the optimization problem (5):
Actor: | |||
Critic: |
where we assume that is a fixed initial state, and The policy is optimized by maximizing the reward function while ensuring that satisfies the constraint, and the two functions are trained by minimizing the Bellman error. However, this formulation has several disadvantages. It cannot handle insufficient data coverage, which may fail to provide an accurate estimation of the policy for unseen states and actions. It cannot guarantee robust policy improvement. The actor training step is computationally intractable especially when the policy space is extremely large.
To address the insufficient data coverage issue, as mentioned in Xie et al., (2021) the critic can include a Bellman-consistent pessimistic evaluation of which selects the most pessimistic function that approximately satisfies the Bellman equation, which is called absolute pessimism. Then later as indicated by Cheng et al., (2022), instead of using an absolute pessimism, a relative pessimism approach by considering competing to the behavior policy can obtain a robust improvement over the behavior policy. However, this kind of approach can only achieve a suboptimal statistical rate of and fails to achieve the optimal statistical rate of then later a weighted average Bellman error (Uehara et al.,, 2020; Xie and Jiang,, 2020; Zhu et al.,, 2023) could be treated as one possible solution for improving the order. We remark here that all the discussions here are for the traditional unconstrained offline RL. Regarding safety, no existing efficient algorithms in safe offline RL have theoretically demonstrated the property of robust policy improvement with optimal statistical rate.
Can Primal-dual based approaches achieve result using only single policy coverability?: The most commonly used approach for addressing safe RL problems is primal-dual optimization (Efroni et al.,, 2020; Altman,, 1999). As shown in current offline safe RL literature (Hong et al.,, 2024; Le et al.,, 2019), the policy can be optimized by maximizing a new unconstrained “reward" function where is a dual variable. Then, the dual-variable can be tuned by taking gradient descent step. As we discussed in the introduction, all these require all policy concentrability which is not practical especially for safe RL. Important question is whether all policy concentrability assumption can be relaxed. Note that primal-dual algorithm relies on solving the min-max problem . Recent result (Cui and Du,, 2022) shows that single policy concentrability assumption is not enough for offline min-max game. Hence, we conjecture that using the primal-dual method we can not relax the all policy concentrability assumption. Intuitively, the primal-dual based method (Hong et al.,, 2024) rely on bounding the regret in dual domain , hence, all the policies encountered throughout the iteration must be supported by the dataset to evaluate the dual value where is the optimal dual value.
Our novelty: In contrast, we propose an aggression-limited objective function to control aggressive policies, where The high-level intuition behind this aggression-limited objective function is that by appropriately selecting a (usually large enough), we penalize all the policies that are not safe. As a result, the policy that maximizes the objective function is the optimal safe policy. This formulation is fundamentally different from the traditional primal-dual approach as it does not require dual-variable tuning, and thus, does not require all policy concentrability. In particular, we only need to bound the primal domain regret which can be done as long as the reference policy is covered by the dataset similar to the unconstrained setup.
Combining all the previous ideas together provides the design of our main algorithm named WSAC (Weighted Safe Actor-Critic). In Section 5, we will provide theoretical guarantees of WSAC and discuss its advantages over existing approaches in offline safe RL. WSAC aims to solve the following optimization problem:
(6) | ||||
where and This formulation can also be treated as a Stackelberg game (Von Stackelberg,, 2010) or bilevel optimization problem. We penalize the objective function only when the approximate cost -function of the policy is more perilous than the behavior policy () forcing our policy to be as safe as the behavior policy. Maximization over in for training the two critics can ensure that the Bellman error is small when averaged over measure for any which turns out to be sufficient to control the suboptimality of the learned policy.
In the following theorem, we show that the solution of the optimization problem (6) is not worse than the behavior policy in both performance and safety for any than the policy under Assumption 3.2 with .
Theorem 4.1.
Assume that Assumption 3.2 holds with and the behavior policy then for any we have and
The result in Theorem 4.1 shows that by selecting large enough, for any the solution can achieve better performance than the behavior policy while maintaining safety that is arbitrarily close to that of the behavior policy. The Theorem verifies the design of our framework which has the potential to have a robust safe improvement.
In the next section, we will introduce our main algorithm WSAC and provide its theoretical guarantees.
5 Theoretical Analysis of WSAC
5.1 Main Algorithm
In this section, we present the theoretical version of our new model-free offline safe RL algorithm WSAC. Since we only have access to a dataset instead of the data distribution. WSAC sovles an empirical version of (6):
(7) | ||||
where
(8) | ||||
As shown in Algorithm 1, at each iteration, WSAC selects maximally pessimistic and maximally optimistic for the current policy with a weighted regularization on the estimated Bellman error for reward and cost, respectively (Line and ) to address the worse cases within reasonable range. In order to achieve a safe robust policy improvement, the actor then applies a no-regret policy optimization oracle to update the policy by optimizing the aggression-limited objective function compared with the reference policy (Line ) Our algorithm is very computationally efficient and tractable compared with existing approaches (Hong et al.,, 2024; Le et al.,, 2019), since we do not need another inner loop for optimizing the dual variable with an additional online algorithm or offline policy evaluation oracle.
The policy improvement process relies on a no-regret policy optimization oracle, a technique commonly employed in offline RL literature (Zhu et al.,, 2023; Cheng et al.,, 2022; Hong et al.,, 2024; Zhu et al.,, 2023). Extensive literature exists on such methodologies. For instance, approaches like soft policy iteration (Pirotta et al.,, 2013) and algorithms based on natural policy gradients (Kakade,, 2001; Agarwal et al.,, 2021) can function as effective no-regret policy optimization oracles. We now formally define the oracle:
Definition 5.1 (No-regret policy optimization oracle).
An algorithm PO is called a no-regret policy optimization oracle if for any sequence of functions with The policies produced by the oracle PO satisfy that for any policy
(9) |
There indeed exist many methods that can serve as the no-regret oracle, for example, the mirror-descent approach (Geist et al.,, 2019) or the natural policy gradient approach (Kakade,, 2001) of the form with (Even-Dar et al.,, 2009; Agarwal et al.,, 2021). In the following define as the error generated from the oracle PO by considering as the sequence of functions in Definition 5.1, then we have the following guarantee.
Lemma 2.
Applying a no-regret oracle PO for episodes with for an arbitrary policy can guarantee
(10) | |||
(11) |
Lemma 2 establishes that the policy outputted by PO with considering the aggression-limited “reward" can have a strong guarantee on the performance of both reward and cost when is large enough., which is comparable with any competitor policy. This requirement is critical to achieving the performance guarantee of our algorithm and the safe and robust policy improvement. The detailed proof is deferred to Appendix B.2 due to page limit.
5.2 Theoretical Guarantees
We are now ready to provide the theoretical guarantees of WSAC Algorithm 1. The complete proof is deferred to Appendix B.3.
Theorem 5.2 (Main Theorem).
Remark 5.3.
When i.e., no model misspecification, which states that the true value function belongs to the function class being used to approximate it (the function class is right enough), let be the optimal policy, the results in Theorem 5.2 achieve an optimal dependence statistical rate of (for large ), which matches the best existing results. Our algorithm is both statistically optimal and computationally efficient with only single-policy assumption rather relying much stronger assumptions of all policy concentrability Hong et al., (2024); Le et al., (2019). Hence, if the behavior policy or the reference policy is safe, our result indicates that the policy returned by our algorithm will also be safe (nearly). Such a guarantee was missing in the existing literature.
Remark 5.4.
We also do not need a completeness assumption,which requires that for any and it approximately holds that as required in Xie et al., (2021); Chen et al., 2022b . They need this assumption to address over-estimation issues caused by the square Bellman error, but our algorithm can get rid of the strong assumption by using a weighted Bellman error which is a simple and unbiased estimator.
Remark 5.5.
Our algorithm can compete with any reference policy as long as is contained in The importance ratio of the behavior policy is which always satisfies this condition, implying that our algorithm can have a safe robust policy improvement (in Theorem 5.6 discussed below).
5.3 A Safe Robust Policy Improvement
A Robust policy improvement (RPI)(Cheng et al.,, 2022; Zhu et al.,, 2023; Bhardwaj et al.,, 2024) refers to the property of an offline RL algorithm that the offline algorithm can learn to improve over the behavior policy, using a wide range of hyperparameters. In this paper, we introduce the property of Safe Robust policy improvement (SRPI) such that the offline algorithm can learn to improve over the behavior policy in both return and safety, using a wide range of hyperparameters. In the following Theorem 5.6 we show that as long as the hyperparameter our algorithm can, with high probability, produce a policy with vanishing suboptimality compared to the behavior policy.
Theorem 5.6 (SRPI).
The detailed proofs are deferred to Appendix B.4.
6 Experiments
Environment | BC | Safe-BC | BCQL | BEARL | CPQ | COptiDICE | WSAC | |||||||
Reward | Cost | Reward | Cost | Reward | Cost | Reward | Cost | Reward | Cost | Reward | Cost | Reward | Cost | |
BallCircle | 0.70 | 0.95 | 0.61 | 0.49 | 0.73 | 0.82 | 0.80 | 1.23 | 0.62 | 0.76 | 0.71 | 1.13 | 0.75 | 0.27 |
CarCircle | 0.57 | 1.43 | 0.57 | 0.65 | 0.79 | 1.19 | 0.84 | 1.87 | 0.67 | 0.28 | 0.49 | 1.52 | 0.68 | 0.59 |
PointButton | 0.26 | 1.75 | 0.12 | 0.69 | 0.36 | 1.76 | 0.32 | 1.71 | 0.43 | 3.10 | 0.15 | 1.92 | 0.13 | 0.67 |
PointPush | 0.13 | 0.67 | 0.20 | 1.35 | 0.16 | 1.01 | 0.12 | 0.90 | -0.01 | 2.39 | 0.07 | 1,18 | 0.07 | 0.52 |
Average | 0.42 | 1.20 | 0.38 | 0.80 | 0.51 | 1.12 | 0.52 | 1.43 | 0.36 | 1.63 | 0.36 | 1.44 | 0.41 | 0.51 |
6.1 WSAC-Practical Implementation
We introduce a deep RL implementation of WSAC in Algorithm 2 (in Appendix), following the key structure of its theoretical version (Algorithm 1). The reward, cost functions and the policy network are all parameterized by neural networks. The critic losses (line ) and are calculated based on the principles of Algorithm 1, on the minibatch dataset. Optimizing the actor aims to achieve a no-regret optimization oracle, we use a gradient based update on the actor loss (line ) In the implementation we use adaptive gradient descent algorithm ADAM (Kingma and Ba,, 2015) for updating two critic networks and the actor network. Algorithm follows standard two-timescale first-order algorithms (Fujimoto et al.,, 2018; Haarnoja et al.,, 2018) with a fast learning rate on update critic networks and a slow learning rate for updating the actor.
6.2 Simulations
We present a scalable deep RL version of WSAC in Algorithm 2, following the principles of Algorithm 1. We evaluate WSAC and consider Behavior Cloning (BC), safe Behavior Cloning (Safe-BC), Batch-Constrained deep Q-learning with Lagrangian PID (BCQL) Fujimoto et al., (2019); Stooke et al., (2020) , bootstrapping error accumulation reduction with Lagrangian PID (BEARL) Kumar et al., (2019); Stooke et al., (2020), Constraints Penalized Q-learning (CPQ) Xu et al., (2022) and one of the state-of-the-art algorithms, COptiDICE (Lee et al.,, 2022) as baselines.
We study several representative environments and focus on presenting “BallCircle”. In BallCircle, it requires the ball on a circle in a clockwise direction without leaving the safety zone defined by the boundaries as proposed by Achiam et al., (2017). The ball is a spherical-shaped agent which can freely move on the xy-plane. The reward is dense and increases by the car’s velocity and by the proximity towards the boundary of the circle. The cost is incurred if the agent leaves the safety zone defined by the boundaries.
We use the offline dataset from Liu et al., (2019), where the corresponding expert policy are used to interact with the environments and collect the data. To better illustrate the results, we normalize the reward and cost. Our simulation results are reported in Table 2, we observe that WSAC can guarantee that all the final agents are safe, which is most critical in safe RL literature. Even in challenging environments such as PointButton, which most baselines fail to learn safe policies. WSAC has the best results in 3 of the environments. Moreover, WSAC outperforms all the baselines in terms of the average performance, demonstrating its ability to learn a safe policy by leveraging an offline dataset. The simulation results verify our theoretical findings. We also compared WSAC with all the baselines in the case where the cost limits are different, WSAC still outperforms all the other baselines and ensures a safe policy. We further include simulations to investigate the contribution of each component of our algorithm, including the weighted Bellman regularizer, the aggression-limited objective, and the no-regret policy optimization which together guarantee the theoretical results. More details and discussions are deferred to the Appendix D due to page limit.
7 Conclusion
In this paper, we explore the problem of offline Safe-RL with a single policy data coverage assumption. We propose a novel algorithm, WSAC, which, for the first time, is proven to guarantee the property of safe robust policy improvement. WSAC is able to outperform any reference policy, including the behavior policy, while maintaining the same level of safety across a broad range of hyperparameters. Our simulation results demonstrate that WSAC outperforms existing state-of-the-art offline safe-RL algorithms. Interesting future work includes combining WSAC with online exploration with safety guarantees and extending the approach to multi-agent settings to handle coupled constraints.
References
- Achiam et al., (2017) Achiam, J., Held, D., Tamar, A., and Abbeel, P. (2017). Constrained policy optimization. In Int. Conf. Machine Learning (ICML), volume 70, pages 22–31. JMLR.
- Agarwal et al., (2021) Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2021). On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98):1–76.
- Altman, (1999) Altman, E. (1999). Constrained Markov decision processes, volume 7. CRC Press.
- Bhardwaj et al., (2024) Bhardwaj, M., Xie, T., Boots, B., Jiang, N., and Cheng, C.-A. (2024). Adversarial model for offline reinforcement learning. Advances in Neural Information Processing Systems, 36.
- (5) Chen, F., Zhang, J., and Wen, Z. (2022a). A near-optimal primal-dual method for off-policy learning in cmdp. In Advances Neural Information Processing Systems (NeurIPS), volume 35, pages 10521–10532.
- Chen and Jiang, (2019) Chen, J. and Jiang, N. (2019). Information-theoretic considerations in batch reinforcement learning. In Int. Conf. Machine Learning (ICML), pages 1042–1051. PMLR.
- Chen and Jiang, (2022) Chen, J. and Jiang, N. (2022). Offline reinforcement learning under value and density-ratio realizability: the power of gaps. In Uncertainty in Artificial Intelligence, pages 378–388. PMLR.
- (8) Chen, L., Jain, R., and Luo, H. (2022b). Learning infinite-horizon average-reward markov decision process with constraints. In Int. Conf. Machine Learning (ICML), pages 3246–3270. PMLR.
- Chen et al., (2021) Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. (2021). Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097.
- Cheng et al., (2020) Cheng, C.-A., Kolobov, A., and Agarwal, A. (2020). Policy improvement via imitation of multiple oracles. In Advances Neural Information Processing Systems (NeurIPS), volume 33, pages 5587–5598.
- Cheng et al., (2022) Cheng, C.-A., Xie, T., Jiang, N., and Agarwal, A. (2022). Adversarially trained actor critic for offline reinforcement learning. In International Conference on Machine Learning, pages 3852–3878. PMLR.
- Chow et al., (2017) Chow, Y., Ghavamzadeh, M., Janson, L., and Pavone, M. (2017). Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070–6120.
- Cui and Du, (2022) Cui, Q. and Du, S. S. (2022). When are offline two-player zero-sum markov games solvable? Advances in Neural Information Processing Systems, 35:25779–25791.
- Efroni et al., (2020) Efroni, Y., Mannor, S., and Pirotta, M. (2020). Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189.
- Even-Dar et al., (2009) Even-Dar, E., Kakade, S. M., and Mansour, Y. (2009). Online markov decision processes. Mathematics of Operations Research, 34(3):726–736.
- Fujimoto et al., (2019) Fujimoto, S., Meger, D., and Precup, D. (2019). Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pages 2052–2062. PMLR.
- Fujimoto et al., (2018) Fujimoto, S., van Hoof, H., and Meger, D. (2018). Addressing function approximation error in actor-critic methods. In Int. Conf. Machine Learning (ICML), pages 1582–1591.
- Geist et al., (2019) Geist, M., Scherrer, B., and Pietquin, O. (2019). A theory of regularized markov decision processes. In International Conference on Machine Learning, pages 2160–2169. PMLR.
- Haarnoja et al., (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft Actor-Critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Int. Conf. Machine Learning (ICML), pages 1861–1870.
- Hong et al., (2024) Hong, K., Li, Y., and Tewari, A. (2024). A primal-dual-critic algorithm for offline constrained reinforcement learning. In Int. Conf. Artificial Intelligence and Statistics (AISTATS), pages 280–288. PMLR.
- Isele et al., (2018) Isele, D., Nakhaei, A., and Fujimura, K. (2018). Safe reinforcement learning on autonomous vehicles. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–6. IEEE.
- Kakade and Langford, (2002) Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning. In Int. Conf. Machine Learning (ICML), pages 267–274.
- Kakade, (2001) Kakade, S. M. (2001). A natural policy gradient. In Advances Neural Information Processing Systems (NeurIPS).
- Kingma and Ba, (2015) Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In Bengio, Y. and LeCun, Y., editors, Int. Conf. on Learning Representations (ICLR).
- Kumar et al., (2019) Kumar, A., Fu, J., Soh, M., Tucker, G., and Levine, S. (2019). Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32.
- Kumar et al., (2022) Kumar, A., Hong, J., Singh, A., and Levine, S. (2022). Should i run offline reinforcement learning or behavioral cloning? In Int. Conf. on Learning Representations (ICLR).
- Laroche et al., (2019) Laroche, R., Trichelair, P., and Des Combes, R. T. (2019). Safe policy improvement with baseline bootstrapping. In International conference on machine learning, pages 3652–3661. PMLR.
- Le et al., (2019) Le, H., Voloshin, C., and Yue, Y. (2019). Batch policy learning under constraints. In International Conference on Machine Learning, pages 3703–3712. PMLR.
- Lee et al., (2021) Lee, J., Jeon, W., Lee, B., Pineau, J., and Kim, K.-E. (2021). Optidice: Offline policy optimization via stationary distribution correction estimation. In Meila, M. and Zhang, T., editors, Int. Conf. Machine Learning (ICML), volume 139 of Proceedings of Machine Learning Research, pages 6120–6130. PMLR.
- Lee et al., (2022) Lee, J., Paduraru, C., Mankowitz, D. J., Heess, N., Precup, D., Kim, K.-E., and Guez, A. (2022). Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation. arXiv preprint arXiv:2204.08957.
- Liao et al., (2022) Liao, P., Qi, Z., Wan, R., Klasnja, P., and Murphy, S. A. (2022). Batch policy learning in average reward markov decision processes. Annals of statistics, 50(6):3364.
- Liu et al., (2019) Liu, B., Cai, Q., Yang, Z., and Wang, Z. (2019). Neural trust region/proximal policy optimization attains globally optimal policy. Advances in neural information processing systems, 32.
- (33) Liu, Z., Guo, Z., Lin, H., Yao, Y., Zhu, J., Cen, Z., Hu, H., Yu, W., Zhang, T., Tan, J., et al. (2023a). Datasets and benchmarks for offline safe reinforcement learning. arXiv preprint arXiv:2306.09303.
- (34) Liu, Z., Guo, Z., Yao, Y., Cen, Z., Yu, W., Zhang, T., and Zhao, D. (2023b). Constrained decision transformer for offline safe reinforcement learning. In International Conference on Machine Learning, pages 21611–21630. PMLR.
- Ozdaglar et al., (2023) Ozdaglar, A. E., Pattathil, S., Zhang, J., and Zhang, K. (2023). Revisiting the linear-programming framework for offline rl with general function approximation. In International Conference on Machine Learning, pages 26769–26791. PMLR.
- Pirotta et al., (2013) Pirotta, M., Restelli, M., Pecorino, A., and Calandriello, D. (2013). Safe policy iteration. In Int. Conf. Machine Learning (ICML), pages 307–315. PMLR.
- Rajaraman et al., (2020) Rajaraman, N., Yang, L., Jiao, J., and Ramchandran, K. (2020). Toward the fundamental limits of imitation learning. Advances Neural Information Processing Systems (NeurIPS), 33:2914–2924.
- Rashidinejad et al., (2021) Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. (2021). Bridging offline reinforcement learning and imitation learning: A tale of pessimism. In Advances Neural Information Processing Systems (NeurIPS), volume 34, pages 11702–11716.
- Rashidinejad et al., (2022) Rashidinejad, P., Zhu, H., Yang, K., Russell, S., and Jiao, J. (2022). Optimal conservative offline rl with general function approximation via augmented lagrangian. arXiv preprint arXiv:2211.00716.
- Siegel et al., (2020) Siegel, N. Y., Springenberg, J. T., Berkenkamp, F., Abdolmaleki, A., Neunert, M., Lampe, T., Hafner, R., Heess, N., and Riedmiller, M. (2020). Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv:2002.08396.
- Stooke et al., (2020) Stooke, A., Achiam, J., and Abbeel, P. (2020). Responsive safety in reinforcement learning by pid lagrangian methods. In Int. Conf. Machine Learning (ICML), pages 9133–9143. PMLR.
- Uehara et al., (2020) Uehara, M., Huang, J., and Jiang, N. (2020). Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, pages 9659–9668. PMLR.
- Uehara et al., (2024) Uehara, M., Kallus, N., Lee, J. D., and Sun, W. (2024). Offline minimax soft-q-learning under realizability and partial coverage. Advances in Neural Information Processing Systems, 36.
- Uehara and Sun, (2021) Uehara, M. and Sun, W. (2021). Pessimistic model-based offline reinforcement learning under partial coverage. arXiv preprint arXiv:2107.06226.
- Von Stackelberg, (2010) Von Stackelberg, H. (2010). Market structure and equilibrium. Springer Science & Business Media.
- Wang et al., (2019) Wang, L., Cai, Q., Yang, Z., and Wang, Z. (2019). Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150.
- Wu et al., (2021) Wu, R., Zhang, Y., Yang, Z., and Wang, Z. (2021). Offline constrained multi-objective reinforcement learning via pessimistic dual value iteration. In Advances Neural Information Processing Systems (NeurIPS), volume 34, pages 25439–25451.
- Xie et al., (2021) Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P., and Agarwal, A. (2021). Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34:6683–6694.
- Xie and Jiang, (2020) Xie, T. and Jiang, N. (2020). Q* approximation schemes for batch reinforcement learning: A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pages 550–559. PMLR.
- Xie and Jiang, (2021) Xie, T. and Jiang, N. (2021). Batch value-function approximation with only realizability. In Int. Conf. Machine Learning (ICML), pages 11404–11413. PMLR.
- Xu et al., (2022) Xu, H., Zhan, X., and Zhu, X. (2022). Constraints penalized q-learning for safe offline reinforcement learning. In AAAI Conf. Artificial Intelligence, volume 36, pages 8753–8760.
- Yin and Wang, (2021) Yin, M. and Wang, Y.-X. (2021). Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34:4065–4078.
- Zhan et al., (2022) Zhan, W., Huang, B., Huang, A., Jiang, N., and Lee, J. (2022). Offline reinforcement learning with realizability and single-policy concentrability. In Proc. Conf. Learning Theory (COLT), pages 2730–2775. PMLR.
- Zhang et al., (2020) Zhang, J., Koppel, A., Bedi, A. S., Szepesvari, C., and Wang, M. (2020). Variational policy gradient method for reinforcement learning with general utilities. Advances in Neural Information Processing Systems, 33:4572–4583.
- Zheng et al., (2024) Zheng, Y., Li, J., Yu, D., Yang, Y., Li, S. E., Zhan, X., and Liu, J. (2024). Safe offline reinforcement learning with feasibility-guided diffusion model. arXiv preprint arXiv:2401.10700.
- Zhu et al., (2023) Zhu, H., Rashidinejad, P., and Jiao, J. (2023). Importance weighted actor-critic for optimal conservative offline reinforcement learning. arXiv preprint arXiv:2301.12714.
Supplementary Material
Appendix A Auxiliary Lemmas
In the following, we first provide several lemmas which are useful for proving our main results.
Lemma 3.
With probability at least for any and we have
(16) | ||||
(17) |
The proofs can be found in Lemma in Zhu et al., (2023).
Lemma 4.
With probability at least for any and we have
(18) | |||
(19) |
where
Proof.
Lemma 5.
(Empirical weighted average Bellman Error) With probability at least for any we have
(20) | ||||
(21) |
where
Proof.
Lemma 6.
(Performance difference decomposition, restate of Lemma in Cheng et al., (2022)) For an arbitrary policy and be an arbitrary function over Then we have,
(24) |
where or
Proof.
We prove the case when the other case is identical. Let be a virtual reward function for given and According to performance difference lemma (Kakade and Langford,, 2002), We first have that
() | ||||
where the last equality is true because that
Thus we have
(25) |
For the first term, we have
(26) |
where the second equality is true because
For the second term we have
(27) |
Therefore plugging 26 and (27) into Eq. (25), we have
The proof is completed. ∎
Lemma 7.
With probability at least for any and we have:
(28) | |||
(29) |
where
Proof.
Recall that and For any policy applying a Hoeffding’s inequality and a union bound we can obtain with probability
(30) |
The inequality for proving the is the same. ∎
Appendix B Missing Proofs
B.1 Proof of Theorem 4.1
B.2 Proof of Lemma 2
Proof.
Denote as First according to the definition for the no-regret oracle 5.1, we have
(34) |
Therefore,
(35) |
and
(36) |
We finish the proof. ∎
B.3 Proof of Theorem 5.2
Theorem (Restate of Theorem 5.2).
Proof.
Denote as According to the definition of and Lemma 6 we have
(39) |
Condition on the high probability event in , we have
(40) |
According to a similar argument as that in the Lemma in Cheng et al., (2022), we have that
(41) |
where By using Lemma 7, we have
(42) |
Therefore
(43) | ||||
(44) | ||||
(45) | ||||
(46) |
where the third inequality holds by the selection of and the last inequality holds by Lemma 5. Therefore by using Lemma B.2 we obtain
(47) |
Following a similar argument, we have that
(48) |
∎
B.4 Proof of Theorem 5.6
Theorem (Restate of Theorem 5.2).
Proof.
Following a similar proof in Theorem 5.2. But when the reference policy is the behavior policy, we have Therefore we have have
We finish the proof. ∎
Appendix C Discussion on obtaining the behavior policy
To extract the behavior policy when it is not provided, we can simply run behavior cloning on the offline data. In particular, we can estimate the learned behavior policy as follows: , and , where is the number of times appears in the offline dataset . Essentially, the estimated BC policy matches the empirical behavior policy on states in the offline dataset and takes uniform random actions outside the support of the dataset. It is easy to show that the gap between the learned policy and the behavior policy is upper bounded by (Kumar et al.,, 2022; Rajaraman et al.,, 2020). We can have a very accurate estimate as long as the size of the dataset is large enough.
Appendix D Expermintal Supplement
D.1 Practical Algorithm
The practical version of our algorithm WSAC is shown in Algorithm 2.
D.2 Environments Description
Besides the “BallCircle" environment, we also study several representative environments as follows. All of them are shown in Figure 2 and their offline dataset is from Liu et al., 2023a .
-
•
CarCircle: This environment requires the car to move on a circle in a clockwise direction within the safety zone defined by the boundaries. The car is a four-wheeled agent based on MIT’s race car. The reward is dense and increases by the car’s velocity and by the proximity towards the boundary of the circle and the cost is incurred if the agent leaves the safety zone defined by the two yellow boundaries, which are the same as "CarCircle".
-
•
PointButton: This environment requires the point to navigate to the goal button location and touch the right goal button while avoiding more gremlins and hazards. The point has two actuators, one for rotation and the other for forward/backward movement. The reward consists of two parts, indicating the distance between the agent and the goal and if the agent reaches the goal button and touches it. The cost will be incurred if the agent enters the hazardous areas, contacts the gremlins, or presses the wrong button.
-
•
PointPush: This environment requires the point to push a box to reach the goal while circumventing hazards and pillars. The objects are in 2D planes and the point is the same as "PointButton". It has a small square in front of it, which makes it easier to determine the orientation visually and also helps point push the box.



D.3 Implementation Details and Experimental settings
We run all the experiments with NVIDIA GeForce RTX Ti Core Processor.
The normalized reward and cost are summarized as follows:
(51) | |||
(52) |
where is the empirical reward for task , is the cost threshold, is a small number to ensure numerical stability. Thus any normalized cost below is considered as safe. We use and to dentoe the cumulative rewards and cost for the evaluated policy, respectively. The parameters of and are environment-dependent constants and the specific values can be found in the Appendix D. We remark that the normalized reward and cost only used for demonstrating the performance purpose and are not used in the training process. The detailed value of the reward and costs can be found in Table 3.
Parameters | BallCircle | CarCircle | PointButton | PointPush |
30.0 | 38.0 | 30.0 | 30.0 | |
10.0 | 12.0 | 10.0 | 10.0 | |
30.0 | 28.0 | 32.0 | 30.0 | |
Batch size | ||||
Actor learning rate | ||||
Critic learning rate | ||||
0.3831 | 3.4844 | 0.0141 | 0.0012 | |
881.4633 | 534.3061 | 42.8986 | 14.6910 |
To mitigate the risk of unsafe scenarios, we introduce a hyperparameter to the cost -function as an overestimation when calculating the actor loss. We use two separate , for reward and cost functions to make the algorithm more flexible.

(a) BallCircle
(b) CarCircle
(c) PointButton
(d) PointPush
We use different for the reward and cost critic networks and different for the actor-network to make the adversarial training more stable. We also let the key parameter within a certain range balance reward and cost during the training process. Their values are shown in Table 3. In experiments, we take for computation effective. Then we can reduce to and reduce to . In this case, can be considered as a part of the hyperparameter .
D.4 Experimental results details and supplements
The evaluation performances of the agents in each environment after update steps of training are shown in Table 2, and the performance of average rewards and costs are shown in Figure 3. From the results, we observe that WSAC achieves a best reward performance with significantly lowest costs against all the baselines. It suggests WSAC can establish a safe and efficient policy and achieve a steady improvement by leveraging the offline dataset.
D.5 Simulations under different cost limits
To further evaluate the performance of our algorithm under varying situations. We further compare our algorithm with baselines under varying cost limits, we report the average performance of our method and other baselines in Table 4. Specifically, cost limits of are used for the BallCircle and CarCircle environments, and for the PointButton and PointPush environments, following the standard setup outlined by Liu et al., 2023a . Our results demonstrate that WSAC maintains safety across all environments, and its performance is either comparable to or superior to the best baseline in each case. These suggest that WSAC is well-suited for adapting to tasks of varying difficulty.
BC | Safe-BC | CDT | BCQL | BEARL | CPQ | COptiDICE | WSAC | |||||||||
Reward | Cost | Reward | Cost | Reward | Cost | Reward | Cost | Reward | Cost | Reward | Cost | Reward | Cost | Reward | Cost | |
BallCircle | 0.74 | 4.71 | 0.52 | 0.65 | 0.77 | 1.07 | 0.69 | 2.36 | 0.86 | 3.09 | 0.64 | 0.76 | 0.70 | 2.61 | 0.74 | 0.51 |
CarCircle | 0.58 | 3.74 | 0.50 | 0.84 | 0.75 | 0.95 | 0.63 | 1.89 | 0.74 | 2.18 | 0.71 | 0.33 | 0.49 | 3.14 | 0.65 | 0.55 |
PointButton | 0.27 | 2.02 | 0.16 | 1.10 | 0.46 | 1.57 | 0.40 | 2.66 | 0.43 | 2.47 | 0.58 | 4.30 | 0.15 | 1.51 | 0.11 | 0.55 |
PointPush | 0.18 | 0.91 | 0.11 | 0.80 | 0.21 | 0.65 | 0.23 | 0.99 | 0.16 | 0.89 | 0.11 | 1.04 | 0.02 | 1.18 | 0.07 | 0.61 |
Average | 0.44 | 2.85 | 0.32 | 0.85 | 0.55 | 1.06 | 0.49 | 1.98 | 0.55 | 2.16 | 0.51 | 1.61 | 0.34 | 2.11 | 0.39 | 0.56 |
D.6 Ablation studies
To investigate the contribution of each component of our algorithm, including the weighted Bellman regularizer, the aggression-limited objective, and the no-regret policy optimization (which together guarantee our theoretical results), we performed an ablation study in the tabular setting. The results, presented in Table 5, indicate that the weighted Bellman regularization ensures the safety of the algorithm, while the aggression-limited objective fine-tunes the algorithm to achieve higher rewards without compromising safety.
Components | cost | reward |
ALL | 0.014 | |
W/O no-regret policy optimization | ||
W/O Aggression-limited objective | ||
W/O Weighted Bellman regularizer |
D.7 Sensitivity Analysis of Hyper-Parameters


We provide the rewards and costs under different sets of and (since our only increases, the closed interval here represents the initial value and the upper bound of ) to demonstrate the robustness of our approach in the tabular setting in Figure 4. We can observe that the performance is almost the same under different sets of parameters and different qualities of behavior policies.