This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Evaluating Model-free Reinforcement Learning toward Safety-critical Tasks

Linrui Zhang1, Qin Zhang1, Li Shen2Correspondence to: Li Shen and Xueqian Wang, Bo Yuan3, Xueqian Wang1, Dacheng Tao2 This work was done during Linrui Zhang’s internship at JD Explore Academy, JD.com, Inc.
Abstract

Safety comes first in many real-world applications involving autonomous agents. Despite a large number of reinforcement learning (RL) methods focusing on safety-critical tasks, there is still a lack of high-quality evaluation of those algorithms that adheres to safety constraints at each decision step under complex and unknown dynamics. In this paper, we revisit prior work in this scope from the perspective of state-wise safe RL and categorize them as projection-based, recovery-based, and optimization-based approaches, respectively. Furthermore, we propose Unrolling Safety Layer (USL), a joint method that combines safety optimization and safety projection. This novel technique explicitly enforces hard constraints via the deep unrolling architecture and enjoys structural advantages in navigating the trade-off between reward improvement and constraint satisfaction. To facilitate further research in this area, we reproduce related algorithms in a unified pipeline and incorporate them into SafeRL-Kit, a toolkit that provides off-the-shelf interfaces and evaluation utilities for safety-critical tasks. We then perform a comparative study of the involved algorithms on six benchmarks ranging from robotic control to autonomous driving. The empirical results provide an insight into their applicability and robustness in learning zero-cost-return policies without task-dependent handcrafting. The project page is available at https://sites.google.com/view/saferlkit.

Introduction

Model-free reinforcement learning (RL) has achieved superhuman performance on many decision-making problems (Mnih et al. 2015; Vinyals et al. 2019). Typically, the agent learns from trial and error and requires minimal prior knowledge of the environment. Such a paradigm features significant advantages in mastering essential skills for complex systems, but concerns about the systematic safety limit the extent of adoption of model-free RL in real-world applications, such as human-robot collaboration (Villani et al. 2018) and autonomous driving (Kiran et al. 2021).

Penalizing unsafe transitions via the reward function is straightforward but sometimes cumbersome to navigate the trade-off between performance and safety. A trivial penalty term may fail to obtain a risk-averse policy, whereas an excessive punishment may make the agent too conservative to explore the environment. Alternatively, incorporating safety into RL via constraints (Altman 1999) is widely adopted since the strength of constraints reflects the human-specified safety requirement, and the agent is desired to optimize its behavior within the constrained policy search space.

In this paper, we explore model-free reinforcement learning methods that adhere to state-wise safety constraints. To better understand how this study is developed, two points deserve further clarification. First, our work is aimed at learning a stationary safe policy under the general model-free settings, instead of refusing any safety violations during the training. The latter is more related to optimal control and relies on the known dynamic model (Cheng et al. 2019) or a carefully designed energy function (Zhao, He, and Liu 2021). Second, we focus on the state-wise constraint at every decision-making step and demonstrate that this type of constraint is more strict and practical toward safety-critical tasks theoretically and empirically. Our contributions in this paper are summarized as follows:

  1. 1.

    We revisit model-free RL following state-wise safety constraints and present SafeRL-Kit, a toolkit that implements prior work in this scope under a unified off-policy framework. Specifically, SafeRL-Kit contains projection-based Safety Layer (Dalal et al. 2018), recovery-based Recovery RL (Thananjeyan et al. 2021), optimization-based Off-policy Lagrangian (Ha et al. 2020), Feasible Actor-Critic (Ma et al. 2021), and the new method proposed in this paper.

  2. 2.

    We propose Unrolling Safety Layer (USL), a novel approach that combines safety projection and safety optimization. USL unrolls gradient-based corrections to the jointly optimized actor-network and thus explicitly enforces the constraints. The proposed method is simple-yet-effective and outperforms state-of-the-art algorithms in learning risk-averse policies.

  3. 3.

    We perform a comparative study based on SafeRL-Kit and evaluate the related algorithms on six different tasks. We further demonstrate their applicability and robustness in safety-critical tasks with the universal binary cost indicator and a constant constraint threshold.

Related Work

Safe RL Algorithms.

Safe RL optimizes policies under episodic or instantaneous constraints. The most common approach to solving the episodic constraint is Lagrangian relaxation (Chow et al. 2017; Tessler at al. 2017; Stooke et al. 2020). Other works (Achiam et al. 2017; Yang et al. 2020) approximate the constrained policy iteration with a quadratic constrained optimization. Recently, first-order methods (Liu, Ding, and Liu 2020; Zhang et al. 2022; Yang et al. 2022a) start to gain attractions as the objective is efficient to optimize and easy to handle multiple constraints. For the instantaneous constraint, Lagrangian relaxation is also a feasible solution (Bohez et al. 2019). Dalal et al. (2018) perform quadratic programming to project actions back to the safe set. Other works model the instantaneous cost as Gaussian Processes and plan in the safety-proven neighboring states (Wachi et al. 2018; Wachi and Sui 2020). In this paper, we formulate safe RL following state-wise safety constraints, which are slightly different from the above genres.

Safe RL Benchmarks.

There have already been some safety-critical benchmarks to evaluate the efficacy of safe RL methods, including traditional MuJoCo tasks (Achiam et al. 2017; Zhang, Vuong, and Ross 2020), navigation in the cluttered environment (Ray, Achiam, and Amodei 2019), safe robotic control task (Yuan et al. 2021) and safe autonomous driving (Li et al. 2021; Herman et al. 2021). However, a comprehensive study on learning a zero-cost-return policy with model-free methods is still absent.

Preliminaries

This study lies in the context of constrained Markov Decision Process (CMDP) (Altman 1999), which extends standard MDP (Sutton and Barto 1998) as a tuple (𝒮,𝒜,𝒫,,𝒞,μ,γ)(\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\mathcal{C},\mu,\gamma). 𝒮{\mathcal{S}} and 𝒜{\mathcal{A}} denote the state space and the action space, respectively. 𝒫:𝒮×𝒜×𝒮[0,1]{\mathcal{P}}:{\mathcal{S}}\times\mathcal{A}\times\mathcal{S}\mapsto[0,1] is the transition probability function to describe the dynamics of the system. :𝒮×𝒜\mathcal{R}:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R} is the reward function. 𝒞:𝒮×𝒜[0,+]\mathcal{C}:\mathcal{S}\times\mathcal{A}\mapsto[0,+\infty] is the cost function and reflects the safety violation. μ:𝒮[0,1]\mu:\mathcal{S}\mapsto[0,1] is the initial state distribution, and γ\gamma is the discount factor for future reward and cost. A stationary policy π:SP(A)\pi:S\mapsto P(A) maps the given states to probability distributions over action space and the expected discounted return of the policy is JR(π)=𝔼τπ[t=0γtR(st,at)],J_{R}(\pi)=\mathop{\mathbb{E}}_{\tau\sim\pi}\big{[}\sum^{\infty}_{t=0}\gamma^{t}R(s_{t},a_{t})\big{]}, where τπ\tau\sim\pi accounts for the stochastic trajectory distribution sampled on s0μ,atπ(|st),st+1P(|st,at)s_{0}\sim\mu,a_{t}\sim\pi(\cdot|s_{t}),s_{t+1}\sim P(\cdot|s_{t},a_{t}). The goal of safe RL is to find the optimal policy:

maxπJR(π)s.t.π is feasible.{\max}_{\pi}J_{R}(\pi)\quad\mathrm{s.t.}\ \ \pi\text{ is feasible}. (1)

In a CMDP, the agent is typically constrained by the cost function in two ways. One is the Episodic Constraint. This type of formulation requires the cost-return in the whole trajectory be within a certain threshold, namely JC(π)=𝔼τπ[t=0γtct]dJ_{C}(\pi)=\mathop{\mathbb{E}}_{\tau\sim\pi}\big{[}\sum^{\infty}_{t=0}\gamma^{t}c_{t}\big{]}\leq d, which is suitable for total energy consumption, resource over-utilization, etc. The other is the Instantaneous Constraint. This type of formulation requires the selected actions to enforce the constraint at every decision-making step, namely t,𝔼π[ct|st]ϵ\forall t,\mathop{\mathbb{E}}_{\pi}[c_{t}|s_{t}]\leq\epsilon, which is indispensable in accident and damage avoidance.

Revisit RL toward Safety-critical Tasks

In many safety-critical scenarios, the final policy is assumed to maintain the zero-cost return since any inadmissible behavior could lead to catastrophic failure in the execution. Prior constrained learning paradigms have fatal flaws under this premise. For the episode constraint, with a threshold dd close to 0, the agent often either fails to improve policy or receives a cost-return more significant than 0. The underlying reason is that such a constraint is easy to violate, especially when the time horizon contains thousands of steps. If the algorithm handles the episodic formula directly, it is cumbersome to identify ill actions as well as tweak the policy parameters and the corresponding action sequence. For another, the instantaneous constraint is tighter since it is a sufficient but not necessary condition for the episodic constraint with ϵ=(1γ)d\epsilon=(1-\gamma)d. Nevertheless, it is problematic to directly enforce C(st,at)ϵC(s_{t},a_{t})\leq\epsilon at each single decision-making step in complicated dynamics, since some actions have a long-term effect on future visited states and the infeasible states might also be intractable to recover in a single time-step. A more reasonable solution is to prevent the current state from falling into the unsafe set in a certain planning span, which is inspired by model predictive control. Consequently, we define the long-term return for cost as Qcπ(st,at)=𝔼π[t=tγtct|st,at]Q^{\pi}_{c}(s_{t},a_{t})={\mathbb{E}}_{\pi}\big{[}\sum^{\infty}_{t^{\prime}=t}\gamma^{t}c_{t^{\prime}}|s_{t},a_{t}\big{]}, which is similar to Qπ(s,a)Q^{\pi}(s,a) for reward in standard RL by substituting rtr_{t} with ctc_{t}. Next, we present the formal definitions related to state-wise safe reinforcement learning.

Definition 1 (State-wise safety constraints).

In the whole trajectory, the agent is required to adhere to the following long-term constraint at every visited state

Qcπ(st,at)δ,t0.Q^{\pi}_{c}(s_{t},a_{t})\leq\delta,\forall t\geq 0. (2)
Definition 2 (Optimal state-wise safe policy).

In any feasible state, the optimal action is

a=argmaxa[Qπ(s,a)]s.t.Qcπ(s,a)δ,a^{*}=\mathop{\arg\max}_{a}\big{[}Q^{\pi}(s,a)\big{]}\quad\mathrm{s.t.}\ \ Q^{\pi}_{c}(s,a)\leq\delta, (3)

Using the cumulative constraint to enhance state-wise safety is not novel (Srinivasan et al. 2020; Ma et al. 2021; Yu et al. 2022). Nevertheless, the choice of δ\delta is tricky since the relationship between (2) and the desired instantaneous constraint is not straightforward. In this paper, we give a theoretical bound of cost limit δ\delta as follows.

Proposition 1.

If δϵγT\delta\leq\epsilon\cdot\gamma^{T}, any policy π(|s)\pi(\cdot|s) satisfying (2) fulfills 𝔼π[ct|st]ϵ\mathop{\mathbb{E}}_{\pi}[c_{t}|s_{t}]\leq\epsilon within the planning span TT.

Proof.

It is proved by contradiction that if 𝔼π[ct|st]>ϵ\mathbb{E}_{\pi}\big{[}c_{t}|s_{t}]>\epsilon in any step T<TT^{\prime}<T, then Vcπ(st)ϵγcT>ϵγcTδV^{\pi}_{c}(s_{t})\geq\epsilon\cdot\gamma_{c}^{T^{\prime}}>\epsilon\cdot\gamma_{c}^{T}\geq\delta. ∎

The above proposition, to some extent, alleviates the concern that decreasing δ\delta leads to overly conservative policy. For example, if a racing car is able to slow down and avoid the obstacle in 20 steps ahead, setting T>20T>20 will not change the optimal sequence for atta^{*}_{t-t^{\prime}}, where t>20t^{\prime}>20. In our experiments, we keep δ=0.1\delta=0.1 across different safety-critical tasks, which equals the safe planning span of at least 100 steps with a universal binary cost indicator. Empirical results demonstrate that safe RL methods adhering to state-wise safety constraints are robust to this value.

In this paper, we explore model-free RL that adheres to state-wise safety constraints in continuous state-action spaces and unknown dynamics. We classify the most related approaches into the following three categories:

Safety Correction.

This type of methods corrects the initial unsafe decision by projecting it back to the safe set. The projection can be constructed by the control barrier function (Cheng et al. 2019) with known dynamics, implicit safety index (Zhao, He, and Liu 2021) with hand-crafted energy function, or parametric linear model (Dalal et al. 2018) learned from past experiences. However, those approaches are sometimes under-performed regarding cumulative rewards since the correction only guarantees the feasibility but lacks the equivalence with optimality.

Safety Recovery.

This type of methods is especially welcomed in the field of robotics (Thananjeyan et al. 2021; Yang et al. 2022b) and autonomous driving (Chen et al. 2021). The critical idea behind Recovery RL is introducing a dedicated policy that recovers unsafe states, whereas the task policy is trained by the standard RL to achieve the original goal. However, those approaches struggle for a rational recovery policy since it tends to be overly conservative in preventing risky exploration. Furthermore, the decisions between two policies may conflict with each other, which makes the agents stuck near the boundaries of safe regions easily.

Safety Optimization.

This type of methods incorporates safety constraints into the RL objective and yields a constrained sequential optimization task. These approaches employ different optimization objectives to guide the updates of parametric policies, which can be tackled by Lagrangian relaxation (Ha et al. 2020; Ma et al. 2021) or the penalty method (Zhang et al. 2022). Unfortunately, the “soft” loss function in the sample-based learning does not consistently enforce the “hard” constraint in practice and barely leads to zero-cost-return policies even at convergence.

Unrolling Safety Layer: A Novel Approach

Orthogonal to existing algorithms, we propose a novel approach referred to as Unrolling Safety Layer (USL) in this paper, which is inspired by the complementary of safety projection and safety optimization. For projection-based approaches, the correction (even if tractable) only enforces the feasibility but lacks the equivalence to the optimal maximum return. For optimization-based approaches, most of them tend to find the optimal solution to the constrained problem with the help of neural networks. However, the forward computing lacks explicit restrictions on the output actions, and thus the “soft” loss function often fails to fully satisfy “hard” constraints. Recently, Deep Constraint Completion and Correction (DC3) (Donti, Rolnick, and Kolter 2021) shows potential to achieve optimal objective values while preserving feasibility, which is of independent interest to general constrained problems. As illustrated in Figure 1, we employ a similar joint architecture that combines safety optimization (serves as the first stage’s approximate solver) and safety projection (serves as the second stage’s iterative correction). To the best of our knowledge, this is the first study to introduce deep unrolling optimization into safe RL.

Refer to caption
Figure 1: The deep unrolling architecture for safe RL. At each decision-making step, the pre-posed policy network outputs near-optimal action a0a_{0}; the post-posed unrolling safety layer (USL) takes a0a_{0} as the initial solution and iteratively performs gradient-based corrections to enforce the hard constraint. Back-propagation of the state-wise constrained objective guides the policy optimization.

Stage 1: Policy Network as Approximate Solver

We train a parametric neural network in the first stage as the approximate solver to problem (3), which aims to output sub-optimal actions via naive forward computing. Different from Donti et al. (2021) that simply applies 2\ell_{2} regular term to the objective function, we use the exact penalty function (Zhang et al. 2022) as the alternative. The merit is that one can construct an equivalent counterpart for problem (3) with a finite penalty factor κ\kappa as

(π)=𝔼𝒟[Qπ(s,π(s))+κmax{0,Qcπ(s,π(s))δ}].\ell(\pi)=\mathbb{E}_{\mathcal{D}}\big{[}-Q^{\pi}(s,\pi(s))+\kappa\cdot\max\{0,Q^{\pi}_{c}(s,\pi(s))-\delta\}\big{]}. (4)
Proposition 2.

Let (π,λ)\mathcal{L}(\pi,\lambda) denote the Lagrangian function 𝔼𝒟[Qπ(s,π(s))+λ(s)(Qcπ(s,π(s))δ)]\mathbb{E}_{\mathcal{D}}\big{[}-Q^{\pi}(s,\pi(s))+\lambda(s)\big{(}Q^{\pi}_{c}(s,\pi(s))-\delta\big{)}\big{]}. Assume that the optimal π\pi^{*} and λ\lambda^{*} exist for the dual problem maxλ0minπ(π,λ)\max_{\lambda\geq 0}\min_{\pi}\mathcal{L}(\pi,\lambda). If κλ(s)\kappa\geq||\lambda(s)||_{\infty}, it holds that

min(π)maxλ0minπ(π,λ)\min\ell(\pi)\iff\max_{\lambda\geq 0}\min_{\pi}\mathcal{L}(\pi,\lambda)
Proof.

The exactness of the penalty function can be referred to our previous work (Zhang et al. 2022). ∎

We use a fixed κ\kappa as a hyper-parameter in the practical implementation and find that a large constant (κ=5\kappa=5 in our experiments) is empirically effective across different tasks even if the supremum of Lagrange multipliers is intractable to estimate. Moreover, if the actual λ(s)\lambda(s) tends to be positively infinitive for some critically dangerous states, there would be numerical issues for optimization-based approaches. On the contrary, the objective function (4) under those circumstances can be regarded as a penalty method by adding regularization terms and only gives a sub-optimal initial solution for the next stage. Fortunately, the proposed two-stage architecture does not require an optimal solution in the first stage, and the joint training and inference process with post-projection can tackle this problem to some extent.

Refer to caption
Figure 2: The schema of SafeRL-Kit. All the algorithms are implemented under off-policy settings and evaluated on six safety-critical tasks with the universal binary cost indicator and a constant constraint threshold.

Stage 2: Gradient-based Projection

The approximate solver in stage 1 may still output infeasible actions for the following reasons: (a) The supremum of Lagrangian multipliers in Proposition 2 is hard to obtain, and we only settle κ\kappa as a fixed, large but sub-optimal hyper-parameter. (b) The inherent issues of safe RL, such as the approximation in the modeling, sample-based learning, distributional shift, etc., make it possible for the end-to-end actor to violate hard constraints in the policy execution.

To address the above issue, the post-posed projection performs gradient steps to rectify the hard constraint from the initial iteration point a0π(|s)a^{0}\sim\pi(\cdot|s). ψ()\psi(\cdot) is the differentiable operator that takes intermediate action aka^{k} at kthk_{\mathrm{th}} iteration as input and performs a gradient descent towards the constraint-violation wrapped with ReLU function:

ψ(ak)=akη𝒵kak[Qc(s,ak)δ]+.\psi(a^{k})=a^{k}-\frac{\eta}{\mathcal{Z}_{k}}\cdot\frac{\partial}{\partial{a^{k}}}\big{[}Q_{c}(s,a_{k})-\delta\big{]}^{+}. (5)

Here 𝒵k=ak[Qc(s,ak)δ]+\mathcal{Z}_{k}=||\frac{\partial}{\partial{a^{k}}}\big{[}Q_{c}(s,a_{k})-\delta\big{]}^{+}||_{\infty} is the normalization factor that rescales the gradients on aka_{k}, and therefore the hyper-parameter η\eta determines the maximum step size of the action change.

Notably, the iterative executions of ψ()\psi(\cdot) do not always converge to global (or even local) optima for the primal constrained optimization problem (3). Nevertheless, such a method is highly effective in practice if the initial iteration point is close to the optimal solution (Panageas, Piliouras, and Wang 2019; Donti, Rolnick, and Kolter 2021). This fact emphasizes the necessity for training a pre-posed policy network via the exact penalty regularization, which provides a non-pathological initialization for USL. By means of minimizing the objective function (4), the output of π(|s)\pi(\cdot|s) may be still infeasible sometimes but already close to the optimal action aa^{*}. Thus, the sequence of ak+1=ψ(ak)a^{k+1}=\psi(a^{k}) is expected to converge to aa^{*} when k+k\rightarrow+\infty. However, letting k+k\rightarrow+\infty is not practical in use, and thus we set an upper limit KK as the maximum iterations of USL. Note that the value of KK needs to match the gradient step-size factor η\eta. For example, we set η=0.05\eta=0.05 and K=20K=20 to enable USL to degrade the normalized action from 11 to 0 within maximum iterations.

More algorithmic details are summarized in Appendix A.

SafeRL-Kit: A Systematic Implementation

To facilitate further research in this area, we release SafeRL-Kit222Project page: https://sites.google.com/view/saferlkit, a reproducible and open-source safe RL toolkit as shown in Figure 2. In brief, SafeRL-Kit contains a list of representative algorithms that address safe learning from different perspectives. Potential users can also incorporate domain-specific knowledge into appropriate baselines to build more competent algorithms for their tasks of interest. Furthermore, SafeRL-Kit is implemented in an off-policy training pipeline, which provides unified and efficient interfaces for fair comparisons among different algorithms on different benchmarks.

Safety-critical Benchmarks

SafeRL-Kit includes six safety-critical benchmarks, ranging from basic robotic control to autonomous driving, which are well-explored in recent literature (Yuan et al. 2021; Ray, Achiam, and Amodei 2019; Li et al. 2021). A short description of the benchmarks is presented below:

(A) SpeedLimit.

The four-legged ant runs along the avenue and receives a cost signal when exceeding the velocity limit.

(B) Stabilization.

The cart pole is rewarded for keeping itself upright while being constrained by angular velocity.

(C) PathTracking.

The quadrotor tracks the green circular trajectory and receives a cost signal if it leaves the area allowed to fly bounded within the red rectangular.

(D) SafetyGym-PG.

The mass point moves to the green goal and is required to get rid of blue hazards.

(E) PandaPush.

The robotic arm pushes the green cube to the destination while avoiding collisions with the red cube.

(F) SafeDriving.

The autonomous vehicle learns to reach the navigation land markers as quickly as possible but is not allowed to collide with other vehicles or be out of the road.

Refer to caption
Refer to caption
Figure 3: Learning curves of different algorithms on safety-critical tasks. The x-axis is the number of interactions. The y-axis represents episodic return (top line), episodic cost rate (middle line) and total cost rate (bottom line), respectively. The dashed line is the expected zero-cost limit in the test time.

More detailed environment descriptions can be referred to Appendix B.1.

Safe Learning Algorithms

SafeRL-Kit includes five safe learning methods addressing safety-critical tasks from different perspectives.

Safety Layer

(Dalal et al. 2018) for safety projection

a=argmina12||aπθ(s)||2s.t.gω(s)a+c¯(s)δ.a^{*}=\arg\min_{a}\frac{1}{2}||a-\pi_{\theta}(s)||^{2}\quad\mathrm{s.t.}\ \ g_{\omega}(s)^{\top}a+\bar{c}(s)\leq\delta.
Recovery RL

(Thananjeyan et al. 2021) for safety recovery

at={πtask(st),if Qriskπ(st,πtask(st))δπrisk(st),otherwise.\qquad\quad a_{t}=\begin{cases}\pi_{\text{task}}(s_{t}),&\text{if }Q^{\pi}_{\text{risk}}\big{(}s_{t},\pi_{\text{task}}(s_{t})\big{)}\leq\delta\\ \pi_{\text{risk}}(s_{t}),&\text{otherwise}\end{cases}.

Off-Policy Lagrangian Method

(Ha et al. 2020) for safety optimization

maxλ0minθ𝔼𝒟Qπ(s,πθ(s))+λ(Qcπ(s,πθ(s))ϵ).\mathop{\max}_{\lambda\geq 0}\mathop{\min}_{\theta}\mathbb{E}_{\mathcal{D}}-Q^{\pi}(s,\pi_{\theta}(s))+\lambda\big{(}Q^{\pi}_{c}(s,\pi_{\theta}(s))-\epsilon\big{)}.
Feasible Actor Critic (FAC)

(Ma et al. 2021) for safety optimization

maxξminθ𝔼𝒟Qπ(s,πθ(s))+λξ(s)(Qcπ(s,πθ(s))ϵ).\mathop{\max}_{\xi}\mathop{\min}_{\theta}\mathbb{E}_{\mathcal{D}}-Q^{\pi}(s,\pi_{\theta}(s))+\lambda_{\xi}(s)\big{(}Q^{\pi}_{c}(s,\pi_{\theta}(s))-\epsilon\big{)}.
Unrolling Safety Layer (USL).

A joint approach proposed in this paper combining safety projection and optimization.

All the above algorithms in SafeRL-Kit are implemented under the off-policy Actor-Critic architecture. Although these model-free algorithms may inevitably encounter cost signals in the training process, they still enjoy better sample efficiency with fewer unsafe transitions compared with on-policy implementations (Ray, Achiam, and Amodei 2019), and can better leverage human demonstration if needed. The essential updates of backbone networks uniformly follow TD3 (Fujimoto, Hoof, and Meger 2018), and thus we can perform a fair evaluation to see which of them are best suited for safety-critical tasks.

More details of the implemented algorithms can be referred to the Appendix B.2.

Cost Function and Evaluation Metrics

Without loss of generality, we uniformly designate the cost function as a binary indicator (1 for unsafe transitions; 0 for other cases), and our experiments aim to obtain stationary policies that adhere to zero cost signals. It is worth noting that some works define cost more prospectively, for example, using the distance from the car to the road boundary in autonomous driving scenarios (Chen et al. 2021). We omit any task-dependent hand-crafting in our comparative study since they overly rely on domain-specific knowledge and are sometimes intractable on complex tasks. Instead, receiving an instantaneous cost signal is much more straightforward and can be generalized to related tasks.

Considering the properties of safety-critical tasks and the definition of the cost function, we employ the following metrics for the joint evaluation:

Episodic Return

sum of rewards in the test time.\triangleq\text{sum of rewards in the test time}. It indicates how well the agent finishes the original task.

Episodic Cost Rate

number of cost signalslength of the episode.\triangleq\frac{\text{number of cost signals}}{\text{length of the episode}}. It indicates how safe the agent is in the test time.

Total Cost Rate

total number of cost signalstotal number of training steps.\triangleq\frac{\text{total number of cost signals}}{\text{total number of training steps}}. It indicates how safe the agent is in the whole training process.

Table 1: Mean performance at convergence with 95% confidence interval for different algorithms on safety-critical tasks.
Tasks USL(ours) Safety Layer Recovery RL Lagrangian FAC TD3(ref)
(A)Speedlimit Ep Return 965.97±5.09965.97\pm 5.09 935.69±9.11935.69\pm 9.11 965.73±3.52965.73\pm 3.52 897.32±77.80897.32\pm 77.80 978.77±12.12978.77\pm 12.12 1382.16±19.651382.16\pm 19.65
Ep CostRate(%) 0.63±0.550.63\pm 0.55 12.73±2.7112.73\pm 2.71 15.10±4.0715.10\pm 4.07 27,17±14.0027,17\pm 14.00 1.07±0.581.07\pm 0.58 95.34±1.0395.34\pm 1.03
Tot CostRate(%) 0.63±0.060.63\pm 0.06 8.00±0.758.00\pm 0.75 8.25±2.748.25\pm 2.74 8.01±3.288.01\pm 3.28 1.85±0.401.85\pm 0.40 62.56±5.8962.56\pm 5.89
(B)Stabilization Ep Return 228.18±3.88228.18\pm 3.88 222.80±15.22222.80\pm 15.22 204.23±3.92204.23\pm 3.92 231.61±2.12231.61\pm 2.12 214.20±19.38214.20\pm 19.38 238.47±3.99238.47\pm 3.99
Ep CostRate(%) 0.00±0.000.00\pm 0.00 6.98±2.966.98\pm 2.96 0.10±0.040.10\pm 0.04 0.03±0.060.03\pm 0.06 0.02±0.030.02\pm 0.03 12.89±4.0812.89\pm 4.08
Tot CostRate(%) 1.77±0.231.77\pm 0.23 18.27±2.0618.27\pm 2.06 16.63±1.9816.63\pm 1.98 9.05±0.859.05\pm 0.85 2.07±0.192.07\pm 0.19 20.01±2.8720.01\pm 2.87
(C)Pathtracking Ep Return 241.74±0.88241.74\pm 0.88 205.93±46.73205.93\pm 46.73 229.05±7.16229.05\pm 7.16 240.50±2.03240.50\pm 2.03 218.28±28.99218.28\pm 28.99 248.80±0.54248.80\pm 0.54
Ep CostRate(%) 0.17±0.180.17\pm 0.18 18.79±5.5218.79\pm 5.52 6.22±3.916.22\pm 3.91 5.77±2.275.77\pm 2.27 6.33±3.556.33\pm 3.55 48.40±9.4948.40\pm 9.49
Tot CostRate(%) 5.36±0.605.36\pm 0.60 39.44±10.6639.44\pm 10.66 25.08±8.2725.08\pm 8.27 12.18±2.1012.18\pm 2.10 11.92±2.4311.92\pm 2.43 42.22±3.7442.22\pm 3.74
(D)Safetygym Ep Return 6.36±0.906.36\pm 0.90 13.64±0.0513.64\pm 0.05 12.24±0.2612.24\pm 0.26 12.28±0.8712.28\pm 0.87 9.66±1.249.66\pm 1.24 13.65±0.1213.65\pm 0.12
Ep CostRate(%) 1.49±0.741.49\pm 0.74 5.17±0.475.17\pm 0.47 4.06±1.684.06\pm 1.68 2.47±0.412.47\pm 0.41 1.84±.881.84\pm.88 5.41±0.165.41\pm 0.16
Tot CostRate(%) 3.07±0.153.07\pm 0.15 5.79±0.165.79\pm 0.16 6.40±0.226.40\pm 0.22 4.31±0.274.31\pm 0.27 3.97±0.423.97\pm 0.42 12.80±0.5512.80\pm 0.55
(E)Pandapush Ep Return 0.96±0.040.96\pm 0.04 0.55±0.220.55\pm 0.22 0.89±0.190.89\pm 0.19 0.77±0.350.77\pm 0.35 0.92±0.080.92\pm 0.08 0.16±0.130.16\pm 0.13
Ep CostRate(%) 0.17±0.170.17\pm 0.17 6.28±4.446.28\pm 4.44 0.66±0.570.66\pm 0.57 6.28±4.446.28\pm 4.44 3.48±2.553.48\pm 2.55 12.56±4.7012.56\pm 4.70
Tot CostRate(%) 1.56±0.431.56\pm 0.43 12.89±6.1112.89\pm 6.11 6.17±2.876.17\pm 2.87 10.37±2.5210.37\pm 2.52 4.28±2.844.28\pm 2.84 12.35±4.0612.35\pm 4.06
(F)Safedriving Ep Return 0.73±0.040.73\pm 0.04 0.73±0.050.73\pm 0.05 0.78±0.060.78\pm 0.06 0.74±0.050.74\pm 0.05 0.69±0.040.69\pm 0.04 0.80±0.060.80\pm 0.06
Ep CostRate(%) 0.85±0.140.85\pm 0.14 2.59±0.222.59\pm 0.22 2.83±0.382.83\pm 0.38 1.85±0.981.85\pm 0.98 0.66±0.100.66\pm 0.10 3.81±0.513.81\pm 0.51
Tot CostRate(%) 1.30±0.171.30\pm 0.17 3.96±0.683.96\pm 0.68 5.42±0.255.42\pm 0.25 2.33±0.442.33\pm 0.44 1.30±0.131.30\pm 0.13 6.22±0.826.22\pm 0.82
Table 2: Ablation study for USL on two representative tasks.
Tasks Speedlimit Pathtracking
USL Models Ep Return Ep CostRate(%) Tot CostRate(%) Ep Return Ep CostRate(%) Tot CostRate(%)
Stage 1 + Stage 2 965.97±5.09965.97\pm 5.09 0.63±0.550.63\pm 0.55 0.63±0.060.63\pm 0.06 241.74±0.80241.74\pm 0.80 0.09±0.090.09\pm 0.09 5.35±0.605.35\pm 0.60
Stage 1 only 1016.03±29.171016.03\pm 29.17 5.05±2.695.05\pm 2.69 2.53±0.392.53\pm 0.39 242.32±0.27242.32\pm 0.27 0.49±0.090.49\pm 0.09 8.74±0.728.74\pm 0.72
Stage 2 only 989.12±96.62989.12\pm 96.62 38.87±15.5038.87\pm 15.50 13.25±7.8213.25\pm 7.82 211.88±18.28211.88\pm 18.28 0.62±0.150.62\pm 0.15 18.24±1.6618.24\pm 1.66
Unconstrained 1382.16±19.651382.16\pm 19.65 95.34±1.0395.34\pm 1.03 62.56±5.8962.56\pm 5.89 244.80±0.54s244.80\pm 0.54s 24.20±4.7524.20\pm 4.75 42.22±3.74s42.22\pm 3.74s
Table 3: Computational efficiency for different algorithms (K=20,η=0.05K=20,\eta=0.05 for USL).
Metrics USL(ours) Recovery Safety Layer Lagrangian FAC TD3(ref)
Normalized Inference Time 5.0 1.2 1.6 1.0 1.0 1.0
Average Inference Time (s) 25E-4 6E-4 8E-4 5E-4 5E-4 5E-4
Max Control Frequency (Hz) 400 1666 1250 2000 2000 2000

Experiments

In this section, we empirically evaluate model-free RL toward safety-critical tasks based on SafeRL-Kit and investigate the following research questions.

Q1: How applicable and robust are the algorithms regarding state-wise constraints?

We plot the learning curves of each algorithm on different tasks over five random seeds in Figure 3 and report their mean performance at convergence in Table 1. Detailed hyper-parameters settings are presented in the supplementary material C.1.

TD3 (Fujimoto, Hoof, and Meger 2018) is used as the unconstrained reference for the upper bounds of reward performance and cost signals. On the SpeedLimit task, the safety constraint (velocity<1.5m/svelocity<1.5m/s) is easily violated since the ant is able to move much faster for higher rewards. Thus, we can clearly observe that the TD3 agent achieves over 95%95\% episodic cost rate at convergence while other risk-aware algorithms in SafeRL-kit suppress the explosion of cost rate. On other tasks, the cost signals are more sparse, such as the accidental collisions with obstacles in autonomous driving. Nevertheless, the evaluated algorithms still effectively reduce the likelihood of safety violations.

The empirical results reveal that Safety Layer and Recovery RL are comparatively ineffective in reducing the cost return. For Safety Layer, the main reasons are that the linear approximation to the cost function brings about non-negligible errors, and the single-step correction is myopic for future risks. For Recovery RL, the estimation error of QriskQ_{\text{risk}} is the major factor affecting its efficacy.

By contrast, Off-policy Lagrangian and FAC have significantly lower cumulative costs. However, Lagrangian-based methods may suffer from the inherent issues due to primal-dual ascents. For one thing, the Lagrangian multiplier tuning causes oscillations in learning curves. For another thing, the performance may heavily depend on the Lagrangian multipliers’ initialization and learning rate. According to the sensitivity analysis, we find that Lagrangian-based methods are susceptible to the learning rate of the Lagrangian multiplier(s) in stochastic primal-dual optimization. First, the oscillating λ\lambda causes non-negligible deviations in the learning curves. Second, the increasing ηλ\eta_{\lambda} may degrade the performance dramatically. The phenomenon is especially pronounced in FAC, which has a multiplier network to predict the state-dependent λ(s;ξ)\lambda(s;\xi). Consequently, we suggest to ensure ηληθ\eta_{\lambda}\ll\eta_{\theta} in practice. We put the sensitivity analysis of these two algorithms in Appendix C.2.

Refer to caption
Refer to caption
Figure 4: Sensitivity analysis on penalty factor κ\kappa.

In summary, the proposed USL is clearly effective for learning constraint-satisfying policies. First, USL achieves higher or competitive returns while adhering to (almost) zero cost return across different tasks. Second, USL features minor standard deviations and oscillations, which demonstrates its robustness. At last, USL generally converges with fewer interactions, which is crucial in sample-expensive risky environments. The underlying reason is that the optimization stage is equivalent to FAC but the state-dependent Lagrangian multipliers are reduced to a single fixed hyper-parameter. Meanwhile, the consistent loss function stabilizes the training process compared with primal-dual optimization. Furthermore, the projection stage explicitly enforces the state-wise constraints intractable in naive forward computing.

Additional experiments for comparing the state-wise constraint, the episodic constraint and reward shaping are placed in the supplementary material C.3.

Q2: How to account for the importance of the two stages in USL?

To better understand the importance of the two stages in our approach, we perform an ablation study as shown in Table 2 and confirm that the two stages must work jointly to achieve the desired performance. An intuitive example is that the solution derived by standard RL may be far away from the desired optimal safe action on tasks such as SpeedLimit. Thus, directly post-optimizing over QcπQ^{\pi}_{c} may not necessarily converge to aa^{*}, and the agent still has a 38% cost rate. Instead, if the initial solution from Stage 1 is close to aa^{*}, it can serve as a valid candidate and the cost rate goes down to 0.63%. Note that, when the unconstrained action is not that far from the safe set, such as on the PathTracking task, both the optimization and projection parts can effectively degrade the cost rate from 24% to less than 1%. However, using only Stage 2 is inferior on episodic return, which is an inherent flaw of the projection-based method.

Q3: How sensitive is USL to its hyper-parameters and how to tune them?

We study the impacts of two pivotal hyper-parameters in USL, namely the penalty factor κ\kappa in the training objective and the maximum iterative number KK in the post-projection, on the SpeedLimit task. For κ\kappa, Figure 4 shows that final policies are insensitive to its value, and the learning curves are almost identical for sufficiently large κ\kappa values. By contrast, a small κ\kappa value may degenerate the first stage of USL into a “soft” regularization method. In our experiments, we find κ=5\kappa=5 generally achieves good performance across different tasks. For KK, Figure 5 shows that USL can enforce the hard constraint within five iterations at most decision-making steps, indicating the possibility of navigating the trade-off between constraint satisfaction and computational efficiency. We set K=0K=0 in the sensitivity study and show that the single optimization in Stage 1 can not lead to zero cost return without the aid of the post-posed projection.

Refer to caption
Refer to caption
Figure 5: Sensitivity analysis on iterative number KK.

Q4: How is the computational efficiency of USL with the additional iterative steps?

The two-stage architecture of USL inevitably brings concerns on computational feasibility in real-world applications. We compare different algorithms on a mainstream computing platform (Intel Core i7-9700K, NVIDIA GeForce RTX 2070). Table 3 shows that USL takes around 4-5 times the inference time of the unconstrained TD3 but still achieves an admissible 400 Hz control frequency. Having said that, we will leave the efforts on improving the time efficiency of USL to accelerate inference speed as future work.

Conclusions

In this paper, we perform a comparative study on model-free reinforcement learning toward safety-critical tasks following state-wise safety constraints. We revisit and evaluate related algorithms from the perspective of safety projection, recovery, and optimization, respectively. Furthermore, we propose Unrolling Safety Layer (USL) and demonstrate its efficacy in improving the episodic return and enhancing the safety-constraint satisfaction with an admissible computational complexity. We also present the open-sourced SafeRL-Kit and invite researchers and practitioners to incorporate domain-specific knowledge into the baselines to build more competent algorithms for their tasks.

Acknowledgments

This work is supported by the National Key R&D Program of China (2022YFB4701400/4701402), the National Natural Science Foundation of China (No. U21B6002, U1813216, 52265002), and the Science and Technology Innovation 2030 – “Brain Science and Brain-like Research” key Project (No. 2021ZD0201405).

References

  • Achiam et al. (2017) Achiam, J.; Held, D.; Tamar, A.; and Abbeel, P. 2017. Constrained policy optimization. In International Conference on Machine Learning, 22–31. PMLR.
  • Altman (1999) Altman, E. 1999. Constrained Markov decision processes, volume 7. CRC Press.
  • Bohez et al. (2019) Bohez, S.; Abdolmaleki, A.; Neunert, M.; Buchli, J.; Heess, N.; and Hadsell, R. 2019. Value constrained model-free continuous control. arXiv preprint arXiv:1902.04623.
  • Chen et al. (2021) Chen, B.; Francis, J.; Nyberg, J. O. E.; and Herbert, S. L. 2021. Safe Autonomous Racing via Approximate Reachability on Ego-vision. arXiv preprint arXiv:2110.07699.
  • Cheng et al. (2019) Cheng, R.; Orosz, G.; Murray, R. M.; and Burdick, J. W. 2019. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 3387–3395.
  • Dalal et al. (2018) Dalal, G.; Dvijotham, K.; Vecerik, M.; Hester, T.; Paduraru, C.; and Tassa, Y. 2018. Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757.
  • Donti, Rolnick, and Kolter (2021) Donti, P. L.; Rolnick, D.; and Kolter, J. Z. 2021. Dc3: A learning method for optimization with hard constraints. arXiv preprint arXiv:2104.12225.
  • Fujimoto, Hoof, and Meger (2018) Fujimoto, S.; Hoof, H.; and Meger, D. 2018. Addressing function approximation error in actor-critic methods. In International conference on machine learning, 1587–1596. PMLR.
  • Ha et al. (2020) Ha, S.; Xu, P.; Tan, Z.; Levine, S.; and Tan, J. 2020. Learning to walk in the real world with minimal human effort. arXiv preprint arXiv:2002.08550.
  • Herman et al. (2021) Herman, J.; Francis, J.; Ganju, S.; Chen, B.; Koul, A.; Gupta, A.; Skabelkin, A.; Zhukov, I.; Kumskoy, M.; and Nyberg, E. 2021. Learn-to-race: A multimodal control environment for autonomous racing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9793–9802.
  • Kiran et al. (2021) Kiran, B. R.; Sobh, I.; Talpaert, V.; Mannion, P.; Al Sallab, A. A.; Yogamani, S.; and Pérez, P. 2021. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems.
  • Li et al. (2021) Li, Q.; Peng, Z.; Xue, Z.; Zhang, Q.; and Zhou, B. 2021. Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning. arXiv preprint arXiv:2109.12674.
  • Liu, Ding, and Liu (2020) Liu, Y.; Ding, J.; and Liu, X. 2020. Ipo: Interior-point policy optimization under constraints. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04): 4940–4947.
  • Luenberger, Ye et al. (1984) Luenberger, D. G.; Ye, Y.; et al. 1984. Linear and nonlinear programming, volume 2. Springer.
  • Ma et al. (2021) Ma, H.; Guan, Y.; Li, S. E.; Zhang, X.; Zheng, S.; and Chen, J. 2021. Feasible actor-critic: Constrained reinforcement learning for ensuring statewise safety. arXiv preprint arXiv:2105.10682.
  • Mnih et al. (2015) Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. nature, 518(7540): 529–533.
  • Panageas, Piliouras, and Wang (2019) Panageas, I.; Piliouras, G.; and Wang, X. 2019. First-order methods almost always avoid saddle points: The case of vanishing step-sizes. Advances in Neural Information Processing Systems, 32.
  • Ray, Achiam, and Amodei (2019) Ray, A.; Achiam, J.; and Amodei, D. 2019. Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708, 7: 1.
  • Srinivasan et al. (2020) Srinivasan, K.; Eysenbach, B.; Ha, S.; Tan, J.; and Finn, C. 2020. Learning to be safe: Deep rl with a safety critic. arXiv preprint arXiv:2010.14603.
  • Sutton and Barto (1998) Sutton, R. S.; and Barto, A. G. 1998. Reinforcement learning: An introduction. MIT press.
  • Thananjeyan et al. (2021) Thananjeyan, B.; Balakrishna, A.; Nair, S.; Luo, M.; Srinivasan, K.; Hwang, M.; Gonzalez, J. E.; Ibarz, J.; Finn, C.; and Goldberg, K. 2021. Recovery rl: Safe reinforcement learning with learned recovery zones. IEEE Robotics and Automation Letters, 6(3): 4915–4922.
  • Villani et al. (2018) Villani, V.; Pini, F.; Leali, F.; and Secchi, C. 2018. Survey on human–robot collaboration in industrial settings: Safety, intuitive interfaces and applications. Mechatronics, 55: 248–266.
  • Vinyals et al. (2019) Vinyals, O.; Babuschkin, I.; Czarnecki, W. M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D. H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782): 350–354.
  • Wachi and Sui (2020) Wachi, A.; and Sui, Y. 2020. Safe reinforcement learning in constrained Markov decision processes. In International Conference on Machine Learning, 9797–9806. PMLR.
  • Wachi et al. (2018) Wachi, A.; Sui, Y.; Yue, Y.; and Ono, M. 2018. Safe exploration and optimization of constrained mdps using gaussian processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
  • Yang et al. (2022a) Yang, L.; Ji, J.; Dai, J.; Zhang, L.; Zhou, B.; Li, P.; Yang, Y.; and Pan, G. 2022a. Constrained Update Projection Approach to Safe Policy Optimization. In 36th Conference on Neural Information Processing Systems.
  • Yang et al. (2020) Yang, T.-Y.; Rosca, J.; Narasimhan, K.; and Ramadge, P. J. 2020. Projection-based constrained policy optimization. arXiv preprint arXiv:2010.03152.
  • Yang et al. (2022b) Yang, T.-Y.; Zhang, T.; Luu, L.; Ha, S.; Tan, J.; and Yu, W. 2022b. Safe Reinforcement Learning for Legged Locomotion. arXiv preprint arXiv:2203.02638.
  • Yu et al. (2022) Yu, D.; Ma, H.; Li, S.; and Chen, J. 2022. Reachability Constrained Reinforcement Learning. In International Conference on Machine Learning, 25636–25655. PMLR.
  • Yuan et al. (2021) Yuan, Z.; Hall, A. W.; Zhou, S.; Brunke, L.; Greeff, M.; Panerati, J.; and Schoellig, A. P. 2021. safe-control-gym: a Unified Benchmark Suite for Safe Learning-based Control and Reinforcement Learning. arXiv preprint arXiv:2109.06325.
  • Zhang et al. (2022) Zhang, L.; Shen, L.; Yang, L.; Chen, S.; Wang, X.; Yuan, B.; and Tao, D. 2022. Penalized Proximal Policy Optimization for Safe Reinforcement Learning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, 3744–3750.
  • Zhang, Vuong, and Ross (2020) Zhang, Y.; Vuong, Q.; and Ross, K. W. 2020. First order constrained optimization in policy space. arXiv preprint arXiv:2002.06506.
  • Zhao, He, and Liu (2021) Zhao, W.; He, T.; and Liu, C. 2021. Model-free safe control for zero-violation reinforcement learning. In 5th Annual Conference on Robot Learning.

Supplementary Material A: Algorithmic Details

Algorithm 1 Deterministic Policy Gradients with USL.
0:  deterministic policy network π(s;θ)\pi(s;\theta); critic networks Q^(s,a;ϕ)\hat{Q}(s,a;\phi) and Q^c(s,a;φ)\hat{Q}_{c}(s,a;\varphi)
1:  for t in 1,2,1,2,... do
2:     at0=π(st;θ)+ϵ,ϵ𝒩(0,σ)a^{0}_{t}=\pi(s_{t};\theta)+\epsilon,\ \ \epsilon\sim\mathcal{N}(0,\sigma).
3:     for k in 1,2,…KK  do
4:        atk=ψ(atk1)a^{k}_{t}=\psi(a^{k-1}_{t}).
5:     end for
6:     Apply at=atKa_{t}=a^{K}_{t} to the environment.
7:     Store the transition (st,at,st+1,rt,ct,dt)(s_{t},a_{t},s_{t+1},r_{t},c_{t},d_{t}) in \mathcal{B}.
8:     Sample a mini-batch of NN transitions from \mathcal{B}.
9:     φargminφ𝔼[Q^c(s,a;φ)(c+γc(1d)Q^C(s,π(s;θ);φ))]2\varphi\leftarrow{\arg\min}_{\varphi}\mathop{\mathbb{E}}_{\mathcal{B}}\big{[}\hat{Q}_{c}(s,a;\varphi)-\big{(}c+\gamma_{c}(1-d)\hat{Q}_{C}(s^{\prime},\pi(s^{\prime};\theta);\varphi)\big{)}\big{]}^{2}.
10:     ϕargminϕ𝔼[Q^(s,a;ϕ)(r+γ(1d)Q^(s,π(s;θ);ϕ))]2\phi\leftarrow{\arg\min}_{\phi}\mathop{\mathbb{E}}_{\mathcal{B}}\big{[}\hat{Q}(s,a;\phi)-\big{(}r+\gamma(1-d)\hat{Q}(s^{\prime},\pi(s^{\prime};\theta);\phi)\big{)}\big{]}^{2}.
11:     θargminθ𝔼[Q^(s,π(s;θ);ϕ)+κmax{0,Q^c(s,π(s;θ);φ)δ}]\theta\leftarrow{\arg\min}_{\theta}\mathop{\mathbb{E}}_{\mathcal{B}}\big{[}-\hat{Q}(s,\pi(s;\theta);\phi)+\kappa\cdot\max\{0,\hat{Q}_{c}(s,\pi(s;\theta);\varphi)-\delta\}\big{]}.
12:  end for

Algorithm 2 Stochastic Policy Gradients with USL.
0:  stochastic policy network π(|s;θ)\pi(\cdot|s;\theta); critic networks Q^(s,a;ϕ)\hat{Q}(s,a;\phi) and Q^c(s,a;φ)\hat{Q}_{c}(s,a;\varphi)
1:  for t in 1,2,1,2,... do
2:     at0π(|st;θ)a^{0}_{t}\sim\pi(\cdot|s_{t};\theta).
3:     for k in 1,2,…KK  do
4:        atk=ψ(atk1)a^{k}_{t}=\psi(a^{k-1}_{t}).
5:     end for
6:     Apply at=atKa_{t}=a^{K}_{t} to the environment.
7:     Store the transition (st,at,st+1,rt,ct,dt)(s_{t},a_{t},s_{t+1},r_{t},c_{t},d_{t}) in \mathcal{B}.
8:     Sample a mini-batch of NN transitions from \mathcal{B}.
9:     Sample aπ(|s;θ)a^{\prime}\sim\pi(\cdot|s^{\prime};\theta).
10:     φargminφ𝔼[Q^c(s,a;φ)(c+γc(1d)Q^C(s,a;φ)αlogπ(a|s;θ))]2\varphi\leftarrow{\arg\min}_{\varphi}\mathop{\mathbb{E}}_{\mathcal{B}}\big{[}\hat{Q}_{c}(s,a;\varphi)-(c+\gamma_{c}(1-d)\hat{Q}_{C}(s^{\prime},a^{\prime};\varphi^{\prime})-\alpha\log\pi(a^{\prime}|s^{\prime};\theta)\big{)}\big{]}^{2}.
11:     ϕargminϕ𝔼[Q^(s,a;ϕ)(r+γ(1d)Q^(s,π(s;θ);ϕ)αlogπ(a|s;θ))]2\phi\leftarrow{\arg\min}_{\phi}\mathop{\mathbb{E}}_{\mathcal{B}}\big{[}\hat{Q}(s,a;\phi)-\big{(}r+\gamma(1-d)\hat{Q}(s^{\prime},\pi(s^{\prime};\theta^{\prime});\phi^{\prime})-\alpha\log\pi(a^{\prime}|s^{\prime};\theta)\big{)}\big{]}^{2}.
12:     Utilize reparameterization to calculate a~(θ)=tanh(μθ(s)+σθ(s)ξ),ξ𝒩(0,1)\tilde{a}(\theta)=\tanh\big{(}\mu_{\theta}(s)+\sigma_{\theta}(s)\odot\xi\big{)},\ \ \xi\sim\mathcal{N}(0,1).
13:     θargminθ𝔼[Q^(s,a~(θ);ϕ)+κmax{0,Q^c(s,a~(θ);φ)δ}]\theta\leftarrow{\arg\min}_{\theta}\mathop{\mathbb{E}}_{\mathcal{B}}\big{[}-\hat{Q}(s,\tilde{a}(\theta);\phi)+\kappa\cdot\max\{0,\hat{Q}_{c}(s,\tilde{a}(\theta);\varphi)-\delta\}\big{]}.
14:  end for

Supplementary Material B: Implementation Details

B.1 Safety-Critical Benchmarks

Environments Descriptions
(a) SpeedLimit [Uncaptioned image] Refer to Zhang, Vuong, and Ross (2020). The observation space (𝒮33\mathcal{S}\in\mathbb{R}^{3}3) contains ego-information, including position, linear velocity, quaternion, the angular velocity, the feet contact forces, etc. The action space (𝒜8\mathcal{A}\in\mathbb{R}^{8}) denotes the forces applied to each motors. The four-legged ant is rewarded for running along the avenue, which is calculated by the forward distance towards the target xtarget=+x_{\text{target}}=+\infty. However, it is constrained with its linear velocity x˙\dot{x}. In our experiments, the agent receives +1+1 cost signal if |x˙|>0.15m/s|\dot{x}|>0.15m/s or the ant is out of the yellow boundary , i.e., |y|>2m|y|>2m.
(b) Stabilization [Uncaptioned image] Refer to Yuan et al. (2021). The cart-pole is rewarded for keeping itself upright, but is constrained with its angular degree θ\theta and angular velocity θ˙\dot{\theta}. The observation space (𝒮4\mathcal{S}\in\mathbb{R}^{4}) contains the horizontal position of the cart xx, the velocity of the cart x˙\dot{x}, the angle of the pole w.r.t vertical θ\theta and the angular velocity of the pole θ˙\dot{\theta}. The action space (𝒜\mathcal{A}\in\mathbb{R}) is the force FF applied to the center of the mass of the cart. The reward function is an instantaneous signal of +1 if the pole is upright (|θ|θmax|\theta|\leq\theta_{\max}). The agent receives a +1 cost signal if |θ|>θmax|\theta|>\theta_{\max} or |θ˙|>θ˙max|\dot{\theta}|>\dot{\theta}_{\max}. In our experiments, we set θmax=θ˙max=0.2\theta_{\max}=\dot{\theta}_{\max}=0.2.
(c) PathTracking [Uncaptioned image] Refer to Yuan et al. (2021). The drone (UAV) is rewarded for tracking a circular trajectory, but the safe area is bounded within a smaller rectangular. The observation space (𝒮6\mathcal{S}\in\mathbb{R}^{6}) contains the translation position x,zx,z and velocity x˙,z˙\dot{x},\dot{z} of the drone in the xzxz-plane, as well as the pitch angle θ\theta and pitch angular velocity θ˙\dot{\theta}. The action space (𝒜2\mathcal{A}\in\mathbb{R}^{2}) denotes the thrusts [T1,T2][T_{1},T_{2}] generated by the two motors. The reward function is in a quadratic form w.r.t to the reference xx and aa. The agent receives +1+1 cost signal if it is out of the area allowed to fly. In our experiments, we set 0.4mxsafe0.4m-0.4m\leq x_{\text{safe}}\leq 0.4m and 0.05mysafe0.9m0.05m\leq y_{\text{safe}}\leq 0.9m.
(d) Safetygym-PG [Uncaptioned image] Refer to Ray, Achiam, and Amodei (2019). The observation space (𝒮60\mathcal{S}\in\mathbb{R}^{6}0) contains ego information and Lidar information (towards hazards and the goal respectively), etc. The action space (𝒜2\mathcal{A}\in\mathbb{R}^{2}) denotes the linear velocity and angular velocity of the point mass. The point mass is rewarded for getting close to the green destination. However, it is the agent receives +1+1 cost signal if it overlaps with the virtual blue hazards. Due to the original point-goal environment is not designed for zero-cost tasks, we inherit the setting as Zhao, He, and Liu (2021), where the number of hazards is 8 and the radius of them is 0.3m.
(e) PandaPush [Uncaptioned image] Refer to Zhang et al. (2022). The 7-DoF Franka Emika Panda manipulator is required to push the green cube to the destination marked in yellow and the environment returns a sparse reward (0 for finished and -1 for unfinished). We add a red cube in the optimal path and return a +1 if the green cube collides with the obstacle. The observation (𝒮21\mathcal{S}\in\mathbb{R}^{21}) contains ego information, obstacle information and destination information. We adopt position control to move the manipulator end-effector, i.e., the action (𝒜3\mathcal{A}\in\mathbb{R}^{3}) is the increments on X-Y-Z axis.
(f) SafeDrive [Uncaptioned image] Refer to Li et al. (2021). Metadrive is a compositional, lightweight and realistic platform for vehicle autonomy. Most importantly, it provides pre-defined environments for safe policy learning in autopilots. Concretely, the observation is encoded by a vector containing ego-state, navigation information and surrounding information detected by the Lidar. We control the speed and steering of the car to hit virtual land markers for rewards (By default, Li et al. (2021) conduct a reward shaping by penalizing collisions and overstepping the road; otherwise it would be too hard to learn), and the cost function is +1 if the vehicle collides with other obstacles or it is out of the road.

B.2 Safe Learning Algorithms

Safety Projection

This type of method corrects the initial unsafe decision by projecting it back to the safe set. In SafeRL-Kit, the representative model-free method is Safety Layer which is added on the top of the original policy network. Specifically, Safety Layer utilizes a parametric linear model C(st,at)g(st;ω)at+ct1C(s_{t},a_{t})\approx g(s_{t};\omega)^{\top}a_{t}+c_{t-1} to approximate the single-step cost function with supervised training and solves the quadratic programming as follows

at=\displaystyle a_{t}^{*}= argmina12aμθ(s)2\displaystyle\ {\arg\min}_{a}\ \frac{1}{2}||a-\mu_{\theta}(s)||^{2} (A.1)
s.t.g(st;ω)at+ct1ϵ,\displaystyle\ \mathrm{s.t.}\quad g(s_{t};\omega)^{\top}a_{t}+c_{t-1}\leq\epsilon,

to find the ”closest” action to the feasible region. Since there is only one compositional cost signal in our problem, the closed-form solution of problem (A.1) is

at=μθ(st)[g(st;ω)μθ(s)+ct1ϵg(st;ω)g(st;ω)]+g(st;ω)a_{t}^{*}=\mu_{\theta}(s_{t})-\bigg{[}\frac{g(s_{t};\omega)^{\top}\mu_{\theta}(s)+c_{t-1}-\epsilon}{g(s_{t};\omega)^{\top}g(s_{t};\omega)}\bigg{]}^{+}g(s_{t};\omega) (A.2)

By the way, the gωg_{\omega} is trained from offline data in Dalal et al. (2018). In SafeRL-Kit, we instead learn the linear model with the policy network synchronously, considering the side-effect of distribution shift. We also employ a warm-up for the safety critic in the training process to avoid inaccurate estimation and wrong corrections.

Safety Recovery

The critical insight behind safety recovery is to introduce an additional policy that recovers potential unsafe states. In SafeRL-Kit, the representative model-free method is Recovery RL (Thananjeyan et al. 2021). We first learn a safe critic to estimate the future probability of constraint violation as

Qriskπ(st,at)=ct+(1ct)γ𝔼πQriskπ(st+1,at+1).Q^{\pi}_{\text{risk}}(s_{t},a_{t})=c_{t}+(1-c_{t})\gamma\mathbb{E}_{\pi}Q^{\pi}_{\text{risk}}(s_{t+1},a_{t+1}). (A.3)

This formulation is slightly different from the standard Bellman equation since it assumes the episode terminates when the agent receives a cost signal. We remove the early-stopping condition for agents to better master complex skills but still preserve the original formulation of QriskπQ^{\pi}_{\text{risk}} in (A.3) since it limits the upper bound of the safe critic and eliminates the over-estimation in Q-learning. In the phase of policy execution, the recovery policy takes over the control when the predicted value of the safe critic exceeds the given threshold:

at={πtask(st),if Qriskπ(st,πtask(st))δπrisk(st),otherwise.a_{t}=\begin{cases}\pi_{\text{task}}(s_{t}),&\text{if }Q^{\pi}_{\text{risk}}\big{(}s_{t},\pi_{\text{task}}(s_{t})\big{)}\leq\delta\\ \pi_{\text{risk}}(s_{t}),&\text{otherwise}\end{cases}. (A.4)

It is of the best practice to store ataska_{\text{task}} and ariska_{\text{risk}} simultaneously in the replay buffer, and utilize them to train πtask\pi_{\text{task}} and πrisk\pi_{\text{risk}} respectively in Recovery RL. This technique ensures that πtask\pi_{\text{task}} can learn from the new MDP, instead of proposing same unsafe actions continuously. Similar to Safety Layer, Recovery RL also has a warm-up stage where QriskπQ^{\pi}_{\text{risk}} is trained but is not utilized in SafeRL-Kit.

Safety Optimization

State-wise safe safe RL can be formulated as a constrained sequential optimization problem

a=argmaxa[Qπ(s,a)]s.t.Qcπ(s,a)δ,a^{*}=\mathop{\arg\max}_{a}\big{[}Q^{\pi}(s,a)\big{]}\quad\mathrm{s.t.}\ \ Q^{\pi}_{c}(s,a)\leq\delta, (A.5)

and can be tackled via the dual problem in the parametric space as follows

maxλ0minθ𝔼𝒟Qπ(s,πθ(s))+λ(Qcπ(s,πθ(s))ϵ).\mathop{\max}_{\lambda\geq 0}\mathop{\min}_{\theta}\mathbb{E}_{\mathcal{D}}-Q^{\pi}(s,\pi_{\theta}(s))+\lambda\big{(}Q^{\pi}_{c}(s,\pi_{\theta}(s))-\epsilon\big{)}. (A.6)

Off-policy Lagrangian applies stochastic primal-dual optimization (Luenberger, Ye et al. 1984) to update primal and dual variables alternatively, which follows as

{θθ+ηθθ𝔼𝒟(Qπ(s,πθ(s))λQcπ(s,πθ(s)))λ[λ+ηλ𝔼𝒟(Qcπ(s,πθ(s))ϵ)]+\begin{cases}\theta\leftarrow\theta+\eta_{\theta}\nabla_{\theta}\mathbb{E}_{\mathcal{D}}\big{(}Q^{\pi}(s,\pi_{\theta}(s))-\lambda Q^{\pi}_{c}(s,\pi_{\theta}(s))\big{)}\\ \lambda\leftarrow\big{[}\lambda+\eta_{\lambda}\mathbb{E}_{\mathcal{D}}\big{(}Q^{\pi}_{c}(s,\pi_{\theta}(s))-\epsilon\big{)}\big{]}^{+}\end{cases} (A.7)

Notably, the timescale of primal variable updates is required to be faster than the timescale of Lagrange multipliers. Thus, we set ηθηλ\eta_{\theta}\gg\eta_{\lambda} in SafeRL-Kit.

The constraint in Off-policy Lagrangian is based on the expectation of the safety critic. Feasible Actor-Critic (FAC) (Ma et al. 2021) introduces state-wise constraints for each ”feasible” initial states and reformulates (A.6) as

maxλ0minθ𝔼𝒟Qπ(s,πθ(s))+λ(s)(Qcπ(s,πθ(s))ϵ).\mathop{\max}_{\lambda\geq 0}\mathop{\min}_{\theta}\mathbb{E}_{\mathcal{D}}-Q^{\pi}(s,\pi_{\theta}(s))+\lambda(s)\big{(}Q^{\pi}_{c}(s,\pi_{\theta}(s))-\epsilon\big{)}. (A.8)

The distinctiveness of problem (A.8) is there are infinitely many Lagrangian multipliers that are state-dependent. In SafeRL-Kit, we employ a neural network λ(s;ξ)\lambda(s;\xi) activated by Softplus function to map the given state ss to its corresponding Lagrangian multiplier λ(s)\lambda(s). The primal-dual ascents of policy network is similar to (A.7); the updates of multiplier network is given by

ξξ+ηξξ𝔼𝒟λ(s;ξ)(Qcπ(s,πθ(s))ϵ).\xi\leftarrow\xi+\eta_{\xi}\nabla_{\xi}\mathbb{E}_{\mathcal{D}}\lambda(s;\xi)\big{(}Q^{\pi}_{c}(s,\pi_{\theta}(s))-\epsilon\big{)}. (A.9)

Besides, we set a different interval schedule mπm_{\pi} (for πθ\pi_{\theta} delay steps) and mλm_{\lambda} (for λξ\lambda_{\xi} delay steps) in SafeRL-Kit to stabilize the training process inspired by Fujimoto et al. (2018).

Supplementary Material C: Experimental Details

C.1 Hyper-parameter Settings

Table 5: Hyper-parameters of different safety-aware algorithms in SafeRL-Kit.
Hyper-parameter Safety Layer Recovery RL Lagrangian FAC USL(ours)
Cost Limit δ\delta 0.1 0.1 0.1 0.1 0.1
Reward Discount 0.99 0.99 0.99 0.99 0.99
Cost Discount 0.99 0.99 0.99 0.99 0.99
Warm-up Ratio 0.2 0.2 N/A N/A N/A
Batch Size 256 256 256 256 256
Critic LR 3E-4 3E-4 3E-4 3E-4 3E-4
Actor LR 3E-4 3E-4 3E-4 3E-4 3E-4
Safe Critic LR 3E-4 3E-4 3E-4 3E-4 3E-4
Safe Actor LR N/A 3E-4 N/A N/A N/A
Multiplier LR N/A N/A 1E-5 1E-5 N/A
Multiplier Init N/A N/A 0.0 N/A N/A
Policy Delay 2 2 2 2 2
Multiplier Delay N/A N/A N/A 12 N/A
Penalty Factor κ\kappa N/A N/A N/A N/A 5
Iterative Step KK N/A N/A N/A N/A 20

C.2 Additional Sensitivity Study between Lagrangian, FAC and USL

In this section, we study the sensitivity to hyper-parameters of Lagrangian-based methods and newly proposed USL in Figure 6 and Figure 7, respectively. We found that Lagrangian-based methods are susceptible to the learning rate of the Lagrangian multiplier(s) in stochastic primal-dual optimization. First, the oscillating λ\lambda causes non-negligible deviations in the learning curves. Besides, the increasing ηλ\eta_{\lambda} may degrade the performance dramatically. The phenomenon is especially pronounced in FAC, which has a multiplier network to predict the state-dependent λ(s;ξ)\lambda(s;\xi). Thus, we suggest ηληθ\eta_{\lambda}\ll\eta_{\theta} in practice. As for USL, we find if the penalty factor κ\kappa is too small, the cost return may fail to converge. Nevertheless, if κ\kappa is sufficiently large, the learning curves are robust and almost identical. Thus, we suggest κ>5\kappa>5 in experiments and a grid search for better performance.

Refer to caption
Refer to caption
(a) Reward-Lag-SpeedLimit
Refer to caption
(b) Cost-Lag-SpeedLimit
Refer to caption
(c) Reward-FAC-MetaDrive
Refer to caption
(d) Cost-FAC-MetaDrive
Figure 6: Sensitivity study of Lagrangian-based methods. The first two figure are reward and cost plots of Off-policy Lagrangian on Car-SpeedLimit task with different λ\lambda learning rates. The last two figure are success rate and cost plots of Feasible Actor-Critic on MetaDrive benchmark with different λ(s;ξ)\lambda(s;\xi) learning rates.
Refer to caption
Refer to caption
(a) Reward-USL-SpeedLimit
Refer to caption
(b) Cost-USL-SpeedLimit
Refer to caption
(c) Reward-USL-MetaDrive
Refer to caption
(d) Cost-USL-MetaDrive
Figure 7: Sensitivity study of USL. The first two figure are reward and cost plots of USL on the Car-SpeedLimit task with different penalty factors κ\kappa. The last two figure are the success rate and cost plots of USL on the MetaDrive benchmark with different penalty factors κ\kappa.

C.3 Additional Comparative Study between State-wise Safe RL, Episodic Safe RL and Reward Shaping

In Section 1: Introduction, we claim that “Penalizing unsafe transitions on the reward function (i.e., r=rσcr^{\prime}=r-\sigma\cdot c) is straightforward but sometimes cumbersome to navigate the trade-off between performance and safety.” In this section, we empirically demonstrate that by changing punishment intensities of the reward shaping method on the Stabilization task.

Refer to caption
Refer to caption
(a) Reward-Stabilization
Refer to caption
(b) Cost-Stabilization
Refer to caption
(c) CostRate-Stabilization
Figure 8: Comparative study of different punishment intensities of the reward shaping method on the Stabilization task.

In Section 3: Revisit RL toward Safety-critical Tasks, we claim that “In many safety-critical scenarios, the final policy is supposed to maintain the zero-cost return since any inadmissible behavior could lead to catastrophic failure in the execution. Prior constrained learning paradigms have fatal flaws under this premise. For the episode constraint, if we set the threshold dd close to 0, the agent either fails to improve policy or receives a cost-return more significant than 0.” In this section, we empirically demonstrate that episodic safe RL may not work toward safety-critical tasks when we set d0d\rightarrow 0 in the corresponding constraint Jc(π)dJ_{c}(\pi)\leq d. We perform a comparative study on the Stabilization task and take CPO, PPO-L, TRPO-L (Ray, Achiam, and Amodei 2019) as episodic baselines. The results show that CPO/PPO-L/TRPO-L fails to improve policy and receives a cost-return more significant than 0 when we set d=0d=0. Instead, the state-wise safe RL method (we take USL as an example) can obtain a reasonable policy while adhering to zero-cost return in the end.

Refer to caption
Refer to caption
(a) Reward-Stabilization
Refer to caption
(b) Cost-Stabilization
Refer to caption
(c) CostRate-Stabilization
Figure 9: Comparative study between state-wise safe RL (USL) and episodic safe RL (CPO/PPO-L/TRPO-L) on the Stabilization task.

In Section 3: Revisit RL toward Safety-critical Tasks, we claim that “Empirical results demonstrate safe RL methods adhering to state-wise safety constraints are robust to δ\delta value.” In this section, we empirically demonstrate that change δ\delta from 0.1 to 1 (namely, change safe planning span from 100 to 1) won’t affect the learning curves too much. However, when we set δ=1\delta=1, it degenerates to the instantaneous safety constraint, and the agent cannot obtain a zero-cost return policy at convergence.

Refer to caption
Refer to caption
(a) Reward-Stabilization
Refer to caption
(b) Cost-Stabilization
Refer to caption
(c) CostRate-Stabilization
Figure 10: Comparative study of different safe planning span settings in state-wise safe RL (USL) on the Stabilization task.