This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SAFER: Data-Efficient and Safe Reinforcement Learning via Skill Acquisition

Dylan Slack
UC Irvine
[email protected] &Yinlam Chow
Google Research
[email protected] &Bo Dai
Google Research
[email protected] &Nevan Wichers
Google Research
[email protected]
Work performed while an intern at Google AI
Abstract

Methods that extract policy primitives from offline demonstrations using deep generative models have shown promise at accelerating reinforcement learning (RL) for new tasks. Intuitively, these methods should also help to train safe RL agents because they enforce useful skills. However, we identify these techniques are not well equipped for safe policy learning because they ignore negative experiences (e.g., unsafe or unsuccessful), focusing only on positive experiences, which harms their ability to generalize to new tasks safely. Rather, we model the latent safety context using principled contrastive training on an offline dataset of demonstrations from many tasks, including both negative and positive experiences. Using this latent variable, our RL framework, SAFEty skill pRiors (SAFER) extracts task specific safe primitive skills to safely and successfully generalize to new tasks. In the inference stage, policies trained with SAFER learn to compose safe skills into successful policies. We theoretically characterize why SAFER can enforce safe policy learning and demonstrate its effectiveness on several complex safety-critical robotic grasping tasks inspired by the game Operation,111https://en.wikipedia.org/wiki/OperationGame in which SAFER outperforms state-of-the-art primitive learning methods in success and safety.

Keywords: Primitives, Offline RL, Behavioral Prior

1 Introduction

Decision Awareness in Reinforcement Learning Workshop at the 39th International Conference on Machine Learning (ICML), Baltimore, Maryland, USA, 2022. Copyright 2022 by the author(s).

Reinforcement learning (RL) has demonstrated strong performance at solving complex control tasks. However, RL algorithms still require considerable exploration to acquire successful policies. For many complex safety-critical applications (i.e., autonomous driving, healthcare), extensive interaction with an environment is impossible due to dangers associated with exploration. These difficulties are further complicated by the challenging nature of specifying safety constraints in complex environments. Nevertheless, relatively few existing safe reinforcement learning algorithms can rapidly and safely solve complex RL problems with hard to specify safety constraints.

One promising route is offline primitive learning methods [1, 2, 3, 4]. These methods use offline datasets to learn representations of useful actions or behaviors through deep generative models, such as normalizing flow models or variational autoencoders (VAE). Specifically, they treat the latent space of the generative model as the abstract action space of higher-level actions (i.e., skills). These methods train an RL agent to map states onto the abstract action space of skills for each downstream task using the learned primitives. This approach can significantly accelerate policy learning because the generative model learns useful primitives from a dataset, simplifying the action space [5].

However, primitive learning techniques suffer from a critical drawback when applied to safety concerned tasks. Intuitively, if trained on datasets consisting of trajectories that are both safe and successful, offline skill learning methods should capture safe and useful behaviors and encourage the rapid acquisition of safe policies on future tasks (downstream learning). For example, when trained on data from everyday household tasks, these methods should learn behaviors that successfully and safely accomplish similar tasks, such as handling objects carefully or avoiding animals in the environment. However, when offline skill learning methods are trained only with safe experiences, the unsafe data is out of the training distribution. It is well known that deep generative models have problems generalizing to out of distribution data, which increases the likelihood of unsafe actions (see Fig. 1) [6, 7, 8]. Thus, current state of the art primitive learning techniques may, counter-intuitively, encourage unsafe behavior.

4.59\displaystyle{-4.59}4.61\displaystyle{-4.61}7.09\displaystyle{-7.09}254.59\displaystyle{-254.59}303040405050Log-likelihood lower bound% Unsafe (state, action)SAFERPARROT
Figure 1: Evaluating the concentration of unsafe data in high likelihood regions by computing the % of unsafe state-action pairs in a holdout dataset of a safe robotic grasping task. PARROT assigns high likelihoods to unsafe data, i.e., it does not encourage safety, while SAFER has much lower likelihood in unsafe data, so it will encourage safety.

In this work, we identify that modeling the latent safety context is the key to overcoming these challenges. To this end, we introduce SAFER: safety skill priors, a primitive learning technique that accelerates reinforcement learning with safe actions. (An overview is provided in Figure 2.) To acquire safe skills, SAFER i) uses a contrastive loss to distinguish safe and unsafe data and ii) learns a posterior sampling distribution of a latent safety variable, that captures different safety contexts. Using the safety context, SAFER established a set of task specific safe actions, greatly improving safety generalization. Consequently, policies trained using the SAFER abstract actions as the action space will learn to compose a set of safe policy primitives. As shown in Figure 1, SAFER assigns much lower likelihood to unsafe states and actions, indicating that it will better promote safe behaviors when applied to downstream RL. To demonstrate the effectiveness of SAFER, we evaluate it on a set of complex safety-critical robotic grasping tasks. When compared with state of the art primitive learning methods, SAFER has both a higher success rate and fewer safety violations.

2 Related Work

Safe Exploration Several related works focus on safe exploration in RL when there is access to known constraint function [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]. In our work, we focus on the setting where the constraint function cannot be easily specified and must be inferred entirely from data, which is critical for scaling safe RL methods to the real world. To this end, a few works consider a similiar setting where the constraints must be inferred from data. Thananjeyan et al. [21] uses an offline dataset of safety constraint violations to learn about safety constraints and trains a policy to recover from safety violations so the agent can continue exploring safely. Yang et al. [22] use natural language to enforce a set of safety constraints during policy learning. Compared to our work, these methods focus on constrained exploration in a single task setting. Instead, we consider accelerating learning across multiple tasks through learned safe primitives.

Demonstrations for Safe RL The use of demonstrations to ensure safety in RL has received considerable interest in the literature [23, 24, 25, 26]. Most relevant, Srinivasan et al. [27] use unsafe demonstrations to constrain exploration to only a safe set of actions for task adaption. Thananjeyan et al. [28] relies on a set of sub-optimal demonstrations to safely learn new tasks. Though these works leverage demonstrations to improve safety, they each rely on task specific demonstrations. Instead, we focus on learning generalizable safe primitives, which we transfer to downstream tasks, and we demonstrate this can greatly accelerate safe policy learning.

Skill Discovery Various works consider learning skills in an online fashion [29, 30, 31, 32, 33]. These methods learn skills for planning [31] or online RL [29, 30]. In contrast, we focus on a setting with access to an offline dataset, from which the primitives are learned. Further works also use offline datasets to extract skills, and transfer these to downstream learning [2, 3, 4], but they do not model the safety of the downstream tasks, which we demonstrate is critical for safe generalization.

Hierarchical RL Numerous works have found learning high level primitives using auxiliary models and controlling these with RL beneficial [1, 34, 35, 36, 37, 38, 39, 40, 41]. Though these works propose methods that are capable of accelerating the acquisition of successful policies, they do not specifically consider learning with safety constraints, which makes them susceptible to the generalization issues discussed in Section 1 and Section 3, where these methods can inadvertently make unsafe behavior high likelihood. In contrast, SAFER learns a hierarchical policy that explicitly considers the safety of tasks, resulting in both safe and successful generalization to downstream tasks, addressing the aforementioned issues.

3 Background

In this section, we provide background for our problem setting. Recall the motivating household robotics example where we wish to train an agent to accomplish a series of household tasks. The agent must learn to do tasks like set a cast iron pot to boil, remove dirt off a dish with a sponge, or cut an apple with a knife. Within these tasks, there are different goals and notions of safety. For example, the robot can safely drop the sponge but cannot safely drop the cast iron pot while cooking, because this would be quite dangerous. From a training perspective, it is difficult to devise safety violation functions for all tasks, given how many ways one could behave unsafely with a cast iron pot or knife. However, it is straightforward to determine whether the task is successful (e.g., the apple is cut in half or it isn’t). Consequently, it is more reasonable to assume an offline data collection process where a large set of behaviors have been annotated for success and safety violation, through simulation or real world experience, and agents must rely entirely on the existing data to learn safety constraints when generalizing to downstream tasks, though they may have access to a sparse reward signal.

Safety MDP In a setting with different tasks and safety constraints, for each task 𝒯\mathcal{T}, the agent’s interaction is modeled as a safety Markov decision process decision process (safety MDP). A safety MDP is a tuple (𝒮,𝒜,T,r,γ,𝒔0,ω(𝒔,𝒂))(\mathcal{S},\mathcal{A},\textrm{T},r,\gamma,{\bm{s}}_{0},\omega({\bm{s}},{\bm{a}})), where 𝒮\mathcal{S} and 𝒜\mathcal{A} are the state and action spaces, T(|𝒔,𝒂)\textrm{T}(\cdot|{\bm{s}},{\bm{a}}) is the transition probabilities, r(𝒔,𝒂)r({\bm{s}},{\bm{a}}) is the reward function, γ[0,1)\gamma\in[0,1) is the discount factor, 𝒔0𝒮{\bm{s}}_{0}\in\mathcal{S} is the initial state, and ω(𝒔,𝒂){0,1}\omega({\bm{s}},{\bm{a}})\in\{0,1\} is the safety violation function, that indicates whether the current state and action lead to a safety violation (11) or no safety violation (0). Given a policy μ\mu, we define the expected return as μ(𝒔0):=𝔼[t=0γtr(𝒔t,𝒂t)μ,𝒔0]\mathcal{R}_{\mu}({\bm{s}}_{0}):=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r({\bm{s}}_{t},{\bm{a}}_{t})\mid\mu,{\bm{s}}_{0}] and at each given state 𝒔𝒮{\bm{s}}\in\mathcal{S} the safety constraint function (i.e., expected safety violation) as 𝒲μ(s):=𝔼[ω(𝒔,𝒂)μ,𝒔]\mathcal{W}_{\mu}(s):=\mathbb{E}[\omega({\bm{s}},{\bm{a}})\mid\mu,{\bm{s}}]. The safety constraint is then defined as 𝒲μ(s)ϵ\mathcal{W}_{\mu}(s)\leq\epsilon, where ϵ[0,1]\epsilon\in[0,1] is the tolerable threshold of violation. For each task the goal in safety MDP is to satisfy the safety constraint while maximizing expected return.

Refer to caption
Figure 2: Overview of SAFER: SAFER optimizes the posterior over a latent safety variable (left hand side of figure) that encodes safety information of the environment. SAFER uses the safety variable to learn an abstract action space 𝒵\mathcal{Z} that maps to safe and useful behaviors through fϕf_{\phi} through a normalizing flow (middle of figure). SAFER accelerates RL training by learning a latent-action policy πθ(𝒛|𝒔)\pi_{\theta}({\bm{z}}|{\bm{s}}) in 𝒵\mathcal{Z} (right hand side of figure).

Offline Primitive Learning In many problems, we may not have access to the underlying reward and safety violation functions across many different tasks. Instead, we assume access to an offline dataset 𝒟\mathcal{D}, which consists of state-action rollouts τ={𝒔0,𝒂0,,𝒔t,𝒂t}\tau=\{{\bm{s}}_{0},{\bm{a}}_{0},...,{\bm{s}}_{t},{\bm{a}}_{t}\} collected across different tasks, where the reward and safety violation are labeled for each state action pair. Further, when adapting to new tasks, we assume that we do not have access to the underlying safety violation function and only have a sparse reward signal for whether the task was completed successfully. Thus, the safety constraints must be inferred entirely from the data.

To use the offline dataset 𝒟\mathcal{D} to generalize to downstream tasks, offline primitive discovery techniques [1, 4, 2, 3], use a policy structure consisting of a prior μψ=fϕ(𝒛;𝒔)\mu_{\psi}=f_{\phi}({\bm{z}};{\bm{s}}) and policy 𝒛πθ(𝒛|𝒔){\bm{z}}\sim\pi_{\theta}({\bm{z}}|{\bm{s}}). In this parameterization, the prior fϕ:𝒵×𝒮𝒜f_{\phi}:\mathcal{Z}\times\mathcal{S}\rightarrow\mathcal{A} with learnable parameters ϕ\phi maps from the abstract action space 𝒵\mathcal{Z} and state space 𝒮\mathcal{S} to the action space 𝒜\mathcal{A} and is trained to learn a set of useful skills from the dataset 𝒟\mathcal{D}. The task-dependent, high-level policy πθ:𝒮(𝒵)\pi_{\theta}:\mathcal{S}\rightarrow\mathbb{P}(\mathcal{Z}) maps any state 𝒔𝒮{\bm{s}}\in\mathcal{S} to the corresponding distribution of abstract actions in 𝒵\mathcal{Z}. In this way, policies ϕθ\phi_{\theta} trained on downstream tasks learn to compose the primitives learned by fϕf_{\phi} from the offline dataset. Different ways to express the behavior prior mapping have been proposed and have been found to greatly accelerate policy learning. For instance, Ajay et al. [4] optimizes the likelihood of actions, conditioned on the state and abstract action space, logπθ(𝒂|𝒔,𝒛)\textrm{log}\;\pi_{\theta}({\bm{a}}|{\bm{s}},{\bm{z}}). Singh et al. [1] directly optimizes the log-likelihood, logp(𝒂|𝒔)\textrm{log}\;p({\bm{a}}|{\bm{s}}), and fix an invertible mapping through the use of a conditional normalizing flow [42] between the abstract action space 𝒵\mathcal{Z} and the distribution over useful actions p(𝒂|𝒔)p({\bm{a}}|{\bm{s}}).

Issues With Offline Primitive Discovery for Safe RL Though current offline primitive discovery methods are highly useful at accelerating learning, they only increase the likelihood of useful actions. Thus, when applied to a safety MDP problem, data containing unsafe or unsuccessful data should not be used because it is counter-intuitive to increase the likelihood of these actions [1, 4]. Consequently, unsafe states and actions may be out of distribution (OOD). It is well established in the literature on deep generative models (including the techniques used in offline primitive discovery methods) that OOD data is handled poorly and, in some cases, might have higher likelihood than in-distribution data [6, 7, 8]. As we see in Figure 1, these observations hold true for current techniques where unsafe data has high likelihood, indicating that they may encourage unsafe behavior. Since the proposed offline primitive discovery policy structure relies on high likelihood actions from the prior [4, 1], using the aforementioned behavior priors for safety will be problematic.

4 SAFER: Safety Skill Priors

Considering the shortcomings mentioned in Section 3 of existing offline primitive discovery techniques and the need for methods that can learn complex safety constraints, ideally a method that encourages safety should i) be capable of learning complex safety constraints by sufficiently exploiting the data, thereby avoiding the OOD issue; ii) permit the specification of undesirable behaviors through data; and iii) accelerate the learning of successful policies. Motivated by these requirements, in this section we introduce SAFER, an offline primitive learning method that circumvents the aforementioned shortcomings and is specifically designed for safety MDPs.

4.1 Latent Safety Variable

To address these criteria, we a latent variable called the safety variable 𝒄𝒞{\bm{c}}\in\mathcal{C} that encodes safety context about the environment, i.e., fϕ:𝒵×𝒞×𝒮𝒜.f_{\phi}:\mathcal{Z}\times\mathcal{C}\times\mathcal{S}\rightarrow\mathcal{A}. This construction encodes information beyond the current state 𝒔{\bm{s}} to help SAFER model complex per task safety dynamics. For example, the safety variable could encode the locations of people or animals while a robot performs household tasks. Because we do not assume the task variable 𝒞\mathcal{C} is provided, we infer it from a network.

4.2 Learning The Safety Variable

In order to train the prior fϕf_{\phi} and posterior over the safety variable, we adopt a variational inference (VI) approach. We jointly train an invertible conditional normalizing flow fϕf_{\phi} [42] as the prior fϕf_{\phi} and posterior over the safety variable using VI. At each state s𝒮s\in\mathcal{S} and safety variable c𝒞c\in\mathcal{C}, the flow model fϕf_{\phi} maps a unit Normal abstract action z𝒵z\in\mathcal{Z} (i.e., samples z=fϕ1(𝒂|𝒔,𝒄)z=f^{-1}_{\phi}({\bm{a}}|{\bm{s}},{\bm{c}}) of the inverse flow model follow the distribution p𝒵():=𝒩(0,I)p_{\mathcal{Z}}(\cdot):=\mathcal{N}(0,I)) onto the action space 𝒜\mathcal{A} of safe behaviors, and thus, the corresponding prior action distribution is given by

pϕ(𝒂|𝒔,𝒄):=p𝒵(fϕ1(𝒂;𝒔;𝒄))|det(fϕ1(𝒂;𝒔;𝒄)/𝒂)|.p_{\phi}({\bm{a}}|{\bm{s}},{\bm{c}}):=p_{\mathcal{Z}}(f^{-1}_{\phi}({\bm{a}};{\bm{s}};{\bm{c}}))\cdot|\text{det}(\partial f^{-1}_{\phi}({\bm{a}};{\bm{s}};{\bm{c}})/\partial{\bm{a}})|. (1)

The flow model is a good choice for the mapping fϕf_{\phi} because it allows computing exact log likelihoods. Further, it yields a mapping such that actions taken in the abstract action space 𝒛𝒵{\bm{z}}\in\mathcal{Z} can easily be transformed into useful ones 𝒂=fϕ(𝒛;𝒔;𝒄){\bm{a}}=f_{\phi}({\bm{z}};{\bm{s}};{\bm{c}}). However, since VI approximates the lower bound of maximum likelihood, it does not explicitly enforce the safety requirements in the safety variable 𝒄{\bm{c}}. To overcome this issue, we encode safety to 𝒄{\bm{c}} by formulating the learning problem as a chance constrained optimization [43] problem.

Chance Constrained Optimization Formally, our objective arises from optimizing a neural network to infer the posterior over the safety variable 𝒞\mathcal{C} using amortized variational inference [44]. In particular, we parameterize the posterior over the safety variable as qρ(𝒄|Λ)q_{\rho}({\bm{c}}\,|\,\Lambda), where 𝒄{\bm{c}} is the safety variable, and Λ\Lambda is information from which to infer the variable. We set Λ\Lambda as a sliding window of states, such that if 𝒔t{\bm{s}}_{t} is the current state at time tt and ww is the window size, then the information is given by Λ=[𝒔t,𝒔t1,,𝒔tw]\Lambda=\left[{\bm{s}}_{t},{\bm{s}}_{t-1},...,{\bm{s}}_{t-w}\right]. We infer the safety variable from the sliding window of states Λ\Lambda because we expect Λ\Lambda to contain useful information concerning safe learning. For example, in a robotics setting where the observations are images, previous states may contain useful information concerning the locations of objects to avoid, which may be unobserved in the current state. We write the evidence-lower bound (ELBO) of our model as

𝔼𝒄qρ(|Λ)[logpϕ(𝒂|𝒔,𝒄)]DKL(qρ(|Λ)||p()),\displaystyle\mathbb{E}_{{\bm{c}}\sim q_{\rho}\left(\cdot|\Lambda\right)}\left[\textrm{log}\;p_{\phi}({\bm{a}}|{\bm{s}},{\bm{c}})\right]-D_{\textrm{KL}}(q_{\rho}(\cdot|\Lambda)||p(\cdot)), (2)

where 𝒂{\bm{a}} is the safe action (i.e, ω(𝒂,𝒔)=0\omega({\bm{a}},{\bm{s}})=0) and p(𝒄)p({\bm{c}}) is a prior over the safety variable 𝒄{\bm{c}}. To ensure that SAFER only samples unsafe actions with low probability, we add a chance constraint about the likelihood of unsafe actions [45] to the ELBO optimization,

maxρ,ϕ,ξ\displaystyle\max_{\rho,\phi,\xi} 𝔼𝒄qρ(|𝒔)[logpϕ(𝒂|𝒔,𝒄)]DKL(qρ(|Λ)||p())λξ\displaystyle\mathbb{E}_{{\bm{c}}\sim q_{\rho}\left(\cdot|{\bm{s}}\right)}\left[\textrm{log}\;p_{\phi}({\bm{a}}|{\bm{s}},{\bm{c}})\right]\!-\!D_{\textrm{KL}}\left(q_{\rho}(\cdot|\Lambda)||p(\cdot)\right)-\lambda^{\prime}\xi (3)
s.t.𝒄qρ(|𝒔)(pϕ(𝒂unsafe|𝒔,𝒄)>ϵ)ξ,\displaystyle\textrm{s.t.}\,\,\mathbb{P}_{{\bm{c}}\sim q_{\rho}\left(\cdot|{\bm{s}}\right)}(p_{\phi}({\bm{a}}_{\textrm{unsafe}}|{\bm{s}},{\bm{c}})>\epsilon)\leq\xi,

where the constraint states that with probability ξ\xi with the safety variable cc drawn from 𝒞\mathcal{C} the distribution of the corresponding unsafe actions (i.e., ω(𝒂unsafe,𝒔)=1\omega({\bm{a}}_{\textrm{unsafe}},{\bm{s}})=1) is always less than the safety threshold ϵ\epsilon. Intuitively, this objective enforces that the safety variable makes safe actions as likely as possible while minimizing the probability of unsafe actions.

Tractable Lower Bound Due to the difficulty in optimizing the chance constrained ELBO objective, we instead consider optimizing an unconstrained surrogate lower bound [45]. We provide a proof in Appendix Section A.

Proposition 4.1

Assuming the chance constrained ELBO is written as in Equation 3, we can write the surrogate lower bound as,

maxρ,ϕ𝔼𝒄qρ(|𝒔)\displaystyle\max_{\rho,\phi}\,\mathbb{E}_{{\bm{c}}\sim q_{\rho}\left(\cdot|{\bm{s}}\right)} [logpϕ(𝒂|𝒔,𝒄)λlogpϕ(𝒂unsafe|𝒔,𝒄)]DKL(qρ(|Λ)||p())\displaystyle\!\Big{[}\emph{log}p_{\phi}({\bm{a}}|{\bm{s}},\!{\bm{c}})\!-\!\lambda\emph{log}p_{\phi}({\bm{a}}_{\textrm{unsafe}}|{\bm{s}},\!{\bm{c}})\Big{]}-D_{\textrm{KL}}(q_{\rho}(\cdot|\Lambda)||p(\cdot)) (4)

We denote this objective as the SAFER Contrastive Objective. Further, this objective function has an intuitive interpretation. The first two terms act as a contrastive loss that encourages safe actions (high likelihood) while discourages unsafe ones (low likelihood). Together with the final term, the variable 𝒄{\bm{c}} is forced to contain useful information about safety. Thus the objective satisfies our goals, allowing for the inference of safety constraints through the task variable and discouraging unsafe behaviors. Finally, since SAFER can increase the likelihood of any safe behaviors, the final criteria that the offline primitive discovery technique can accelerate downstream policy learning will be met by using safe and successful trajectory data during SAFER training.

Algorithm 1 Accelerating Safe Reinforcement Learning with SAFER
0:  SAFER Prior fϕf_{\phi}, Safety Posterior qρ(𝒄|Λ)q_{\rho}({\bm{c}}|\Lambda), Safety bound η\eta, Task 𝒯\mathcal{T}, Window Λ={}\Lambda=\{\}
  for step k=1,,Kk=1,...,K do
     𝒔k{\bm{s}}_{k}\leftarrow current state
     𝒄k𝔼𝒄qρ(|Λk)[𝒄]{\bm{c}}_{k}\leftarrow\mathbb{E}_{{\bm{c}}\sim q_{\rho}(\cdot|\Lambda_{k})}\left[{\bm{c}}\right]{ Mean safety var.}
     𝒛kπθ(|𝒔k){\bm{z}}_{k}\sim\pi_{\theta}\left(\cdot|{\bm{s}}_{k}\right) {Sample abstract action}
     𝒂kfϕ(𝒛k;𝒔k;𝒄k){\bm{a}}_{k}\leftarrow f_{\phi}({\bm{z}}_{k};{\bm{s}}_{k};{\bm{c}}_{k}) {Get SAFER action}
     sk+1,rk,ωks_{k+1},r_{k},\omega_{k}\leftarrow Perform 𝒂k{\bm{a}}_{k} in task 𝒯\mathcal{T}
     Update πθ(𝒛|𝒔)\pi_{\theta}({\bm{z}}|{\bm{s}}) using (𝒔k,𝒛k,𝒔k+1,rk)({\bm{s}}_{k},{\bm{z}}_{k},{\bm{s}}_{k+1},r_{k})
     Update Λ\Lambda with 𝒔k{\bm{s}}_{k} in FIFO order
  end for
  Return: Policy πθ(𝒛|𝒔)\pi_{\theta}({\bm{z}}|{\bm{s}}) for task 𝒯\mathcal{T}

Parametization Choices To parameterize the SAFER action mapping fϕf_{\phi}, we use the Real NVP conditional normalizing flow, proposed by Dinh et al. [42], due to it being highly expressive and allowing exact log-likelihood calculations. Next, we parameterize the posterior distribution qρ(c|Λ)q_{\rho}(c|\Lambda) over the safety variable as a diagonal Gaussian to compute the KL efficiently while enabling an expressive latent space. We use a transformer architecture to model the sequential dependency between Gaussian safety variable 𝒄{\bm{c}} and the window of previous states Λ\Lambda [46]. Finally, because the state space is an image pixel space, we also encode each observation to a vector using a CNN. An overview of the architecture is given in Figure 2.

Training It is necessary to use the reparameterization trick to compute gradients across the objective in Equation 5 [47]. Second, optimizing Equation 5 involves minimizing an unbounded log-likelihood in the second term of the objective. This term can lead to numerical instabilities when pϕ(𝒂unsafe|𝒔,𝒄)p_{\phi}({\bm{a}}_{\textrm{unsafe}}|{\bm{s}},{\bm{c}}) is too small. To overcome these issues, we use gradient clipping and freeze this term if it starts to diverge. Psuedo code of the procedure to train SAFER is provided in Appendix F in Algorithm 2 and hyperparameter details are provided in Appendix D.

4.3 Accelerating Safe RL with SAFER

When using SAFER on a safe RL task, the goal is to accelerate safe learning by leveraging the mapping fϕf_{\phi} in the hierarchical policy μψ(𝒔,𝒄)=𝒛fϕ(𝒛;𝒔;𝒄)𝑑πθ(𝒛|𝒔)\mu_{\psi}({\bm{s}},{\bm{c}})=\int_{\bm{z}}f_{\phi}({\bm{z}};{\bm{s}};{\bm{c}})d\pi_{\theta}({\bm{z}}|{\bm{s}}) where the policy parameters of the mapping ϕ\phi are fixed and the parameters θ\theta need to be optimized (Psuedo code of the procedure is provided in Algorithm 1). The policy πθ(𝒛|𝒔)\pi_{\theta}({\bm{z}}|{\bm{s}}) can be learned by any standard RL methods (e.g., SAC [48]) that produces continuous actions. To leverage SAFER at inference time, at each timestep tt the RL policy takes an action in the abstract action space 𝒛tπθ(𝒛|𝒔=𝒔t){\bm{z}}_{t}\sim\pi_{\theta}({\bm{z}}|{\bm{s}}={\bm{s}}_{t}). Using the sliding window of states Λ\Lambda, the safety variable posterior computes the distribution over the safety variable ctc_{t}.222If there are insufficient states to compute a task window of size ww (e.g., at the beginning of the rollout), we pad the available states with 0’s in order to construct a window of ww states. Because a single safety variable value ctc_{t} is required, we fix it at its mean, E[𝒄t]=𝒄𝑑qρ(𝒄|Λt)\textrm{E}[{\bm{c}}_{t}]=\int{\bm{c}}\,dq_{\rho}({\bm{c}}|\Lambda_{t}). Finally, SAFER computes the action 𝒂t=fϕ(𝒛t;𝒔t;E[𝒄t]){\bm{a}}_{t}=f_{\phi}({\bm{z}}_{t};{\bm{s}}_{t};\textrm{E}[{\bm{c}}_{t}]), the action is taken the environment, and the reward r(𝒔t,𝒂t)r({\bm{s}}_{t},{\bm{a}}_{t}) and safety violations ω(𝒔t,𝒂t)\omega({\bm{s}}_{t},{\bm{a}}_{t}) are returned. The action 𝒛t{\bm{z}}_{t} and reward rtr_{t} are added to the replay buffer for subsequent RL training.

4.4 Using SAFER to Guarantee Safety

Next, we demonstrate how it is straightforward to use SAFER to theoretically guarantee safety for any policy trained under the prior. To show this is the case, we assume there always exists safe actions to take in the environment and make an optimiality assumptions about the prior in (5). Then, we can construct a bound on the range of abstracts actions that ensures only safe actions under the prior:

Proposition 4.2

There exists an η\eta such that the corresponding bounded abstract actions 𝐳(η,η){\bm{z}}\in(-\eta,\eta) are safe, i.e., ω(𝐬,fϕ(𝐳;𝐜;𝐬))=0\omega({\bm{s}},f_{\phi}({\bm{z}};{\bm{c}};{\bm{s}}))=0, 𝐳(η,η),𝐬,𝐜\forall{\bm{z}}\in(-\eta,\eta),{\bm{s}},{\bm{c}}.

As a result, we can construct a latent variable bound around the mean of 𝒵\mathcal{Z} as the actions fϕ(𝒛;𝒄;𝒔)f_{\phi}({\bm{z}};{\bm{c}};{\bm{s}}) that are more likely to be safe and successful are closer to the mean. Because unsafe actions under the SAFER prior have lower likelihood and our assumption ensures that safe actions exists, there must exist a finite latent variable bound η\eta that contains all safe actions. Consequently, with such an η\eta from Proposition 4.2, any agent πθ\pi_{\theta} that is trained under the SAFER prior and has a bounded abstraction action output 𝒛(η,η){\bm{z}}\in(-\eta,\eta) is safe. The full proof details are provided in Appendix B.

In practice, we use an offline data set with safe (𝒔{\bm{s}}, 𝒂{\bm{a}}) and unsafe (𝒔{\bm{s}}, 𝒂unsafe{\bm{a}}_{\text{unsafe}}) state-action pairs to determine the value of η\eta that ensures safety. Also, it is acceptable to fix a range (η-\eta, η\eta) that includes a small number of unsafe actions (e.g., at most 1ϵ1-\epsilon portion of all actions in data is unsafe) to avoid an overly tight bound. We optimize the real-valued η>0\eta>0 by a numerical gradient-free approach. First, we initialize η=η0\eta=\eta_{0} with a large constant and use that to generate the corresponding SAFER abstract actions 𝒛{\bm{z}} w.r.t. the offline data, whose latent variable is bounded in (η0,η0)(-\eta_{0},\eta_{0}). We sub-sample this SAFER-action bootstrapped dataset to construct a refined one that only has at most 1ϵ1-\epsilon portion of unsafe actions. Since the normalizing flow in SAFER is invertible, for every (𝒔{\bm{s}}, 𝒂{\bm{a}})-pair in this data computing the corresponding latent action value is straightforward. This allows us to estimate η1\eta_{1}, which is the maximum latent action in this dataset. We repeat this procedure until convergence. If the offline dataset contains sufficiently diverse state-action data that covers most situations encountered by SAFER, we expect the above safety threshold to be generalizable [49]. We denote this procedure computing the SAFER safety assurances and provide pseudo code in Algorithm 3 in the Appendix.

Refer to caption
Figure 3: An example of a task where the robot successfully and safely grasping an object (top row). Here, the robot reaches into the container and extracts the object without touching the container. On the bottom row, the robot performs the same task but commits safety violations by touching the container.

5 Experiments

We evaluate the calibration of the safety assurances introduced in Section 4.4 and how well SAFER encourages both safe and successful policy learning compared to baselines.

5.1 Experiments Setup

To evaluate SAFER, we introduce a suite of safety-critical robotic grasping tasks that are inspired by the game Operation333https://en.wikipedia.org/wiki/OperationGame.

Safety-critical Robotic Grasping Tasks Based on the game Operation, whose goal is to extract objects from different sized containers without touching the container, we construct a set of 4040 grasping tasks, each consisting of a container and object defined in PyBullet [50]. We collect data from these tasks to train SAFER and use 66 of the more complex tasks for evaluation. In our tasks, the objects are randomly selected from ones available in PyBullet package, and the containers are generated to fit the objects, whose dimensions (heights and widths) are generated randomly. Our agent controls a 55DoF robotic arm and gripper. The agent receives positive reward (r(s,a)=1r(s,a)=1) when it extracts the object from the box and a negative reward (r(s,a)=1r(s,a)=-1) at every time step while the task is incomplete. The agent incurs a safety violation (ω(s,a)=1\omega(s,a)=1) if the arm touches the box (examples of safe/unsafe trajectories in Figure 3, examples of the tasks are in Figure 10). The states are 48×4848\times 48 pixel image observations of the scene collected from a fixed camera.

Offline Data Collection To generate the offline data for the SAFER training algorithm, for each robot grasping task we use the scripted policy from Singh et al. [1] to collect trajectories with a total of 1,000,0001,000,000 steps. The scripted policy controls the robotic arm to grasp the object generally by minimizing the absolute distance between objects and the robot. To obtain more diverse/exploratory trajectories, one also adds random actuation noise to the policy. After collecting the trajectories, for each state-action pair (𝒔,𝒂)({\bm{s}},{\bm{a}}) in the dataset we provide labels for i) safety violation ω(s,a){0,1}\omega(s,a)\in\{0,1\}, and ii) whether the pair (𝒔,𝒂)({\bm{s}},{\bm{a}}) is part of a successful rollout (i.e., (𝒔,𝒂)({\bm{s}},{\bm{a}}) such that 𝔼[r(𝒔T,𝒂T)|μdata,𝒔0=𝒔,𝒂0=𝒂]=1\mathbb{E}[r({\bm{s}}_{T},{\bm{a}}_{T})|\mu_{\text{data}},{\bm{s}}_{0}={\bm{s}},{\bm{a}}_{0}={\bm{a}}]=1, where TT is the trajectory length random variable). To create the state window Λ\Lambda for SAFER training, for each (𝒔,𝒂)({\bm{s}},{\bm{a}}) in the data buffer we save the previous ww states. One can utilize these labels to categorize safe versus unsafe data to train SAFER.

Baseline Comparisons To demonstrate the improved safety performance of SAFER over existing offline primitive learning techniques, we compare against baseline methods that leverage offline data to accelerate learning, including PARROT [1], a contextual version of PARROT (Context. PAR) that uses a latent variable to help accelerate learning, Prior Explore (a method that samples from SAFER to help with data collection during training) and RL from scratch using SAC. See Appendix E for more details. Last, because our setting requires learning safety constraints entirely from labeled offline data, we do not compare against methods that require online safety constraint functions, such as many existing safe RL methods which use a constraint function during training.

5.2 Results Discussion

Table 1: Training RL with SAFER, we give the mean ±\pm SD success rate and cumulative safety violations across different tasks and initializations. SAFER produces the lowest cumulative safety violations throughout training and outperforms the baseline methods in terms of success rate.
Success Rate (%)
Task 1 Task 2 Task 3 Task 4 Task 5 Task 6
SAC 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 2.3±0.02.3\pm 0.0
PARROT 0.0±0.00.0\pm 0.0 12.8±0.212.8\pm 0.2 25.7±0.225.7\pm 0.2 16.1±0.216.1\pm 0.2 33.9±0.333.9\pm 0.3 6.3±0.16.3\pm 0.1
Context PAR. 5.0±0.05.0\pm 0.0 24.2±0.224.2\pm 0.2 27.0±0.327.0\pm 0.3 0.7±0.00.7\pm 0.0 7.3±0.17.3\pm 0.1 12.0±0.212.0\pm 0.2
Prior Explore 1.8±0.01.8\pm 0.0 1.5±0.01.5\pm 0.0 3.0±0.03.0\pm 0.0 1.8±0.01.8\pm 0.0 1.1±0.01.1\pm 0.0 1.0±0.01.0\pm 0.0
SAFER 21.0±0.1\bm{21.0\pm 0.1} 87.4±0.2\bm{87.4\pm 0.2} 89.3±0.0\bm{89.3\pm 0.0} 28.1±0.2\bm{28.1\pm 0.2} 54.4±0.1\bm{54.4\pm 0.1} 83.3±0.0\bm{83.3\pm 0.0}
Total Number of Safety Violations (Out of 50,000 Steps)
Task 1 Task 2 Task 3 Task 4 Task 5 Task 6
SAC 2045±2362045\pm 236 876±117876\pm 117 1055±2161055\pm 216 2736±1472736\pm 147 2188±4052188\pm 405 756±293756\pm 293
PARROT 6332±30266332\pm 3026 307±291307\pm 291 13±2113\pm 21 541±461541\pm 461 2414±3142414\pm 314 932±844932\pm 844
Context PAR. 5929±29645929\pm 2964 1576±12081576\pm 1208 1039±7771039\pm 777 5056±17785056\pm 1778 2796±6242796\pm 624 2085±19512085\pm 1951
Prior Explore 6203±5516203\pm 551 2240±6342240\pm 634 2867±8532867\pm 853 4525±8264525\pm 826 4669±5424669\pm 542 2596±7032596\pm 703
SAFER 𝟔𝟏𝟎±𝟏𝟖𝟒\bm{610\pm 184} 𝟓𝟏±𝟔𝟏\bm{51\pm 61} 𝟏𝟎±𝟏𝟒\bm{10\pm 14} 𝟒𝟓𝟓±𝟒𝟕𝟎\bm{455\pm 470} 𝟏𝟕𝟎𝟕±𝟐𝟗𝟐\bm{1707\pm 292} 𝟕±𝟗\bm{7\pm 9}

Effectiveness of RL training with SAFER In Table 1 we compare SAFER with the baseline methods both in terms of cumulative safety violations and success rate. Note, here we use the underlying reward and safety violation functions for each task to evaluate performance. We choose a SAFER policy primitive with a safety assurance upper bound that guarantees at most 15%15\% unsafe actions, which empirically maintains a good balance between performance and safety. For each downstream task, we then train the RL agent πθ\pi_{\theta} with SAC for only 50,00050,000 steps because we are more interested to evaluate the power of the primitive learning algorithm. Overall, we see that SAFER has the lowest cumulative safety violations, indicating that it is the most effective method in promoting safe policy learning. Interestingly, SAFER also consistently outperforms other methods in policy performance. The strong success rates of SAFER are potentially due to the fact that discouraging unsafe behaviors may indeed help refining the space of useful behaviors, thus improves policy learning.

Refer to caption
Figure 4: Assessing the calibration of the SAFER safety assurances by randomly sampling actions from the prior with various safety upper bounds across different evaluation tasks. Each dot corresponds to the empirical percent of unsafe (𝒔,𝒂)({\bm{s}},{\bm{a}}) pairs from a single rollout on the task. Overall, we see that the SAFER safety assurances are quite well calibrated.

Safety Assurance Calibration We evaluate whether the safe abstract action bound of SAFER computed in Section 4.4 is well calibrated, i.e., the empirical percent of unsafe actions should be less than the upper bound. To study this, we compute the 𝒵\mathcal{Z}-action bound (η,η)(-\eta,\eta) corresponding to an upper bound of 0%0\%, 15%15\%, 30%30\% and 45%45\% unsafe actions for SAFER. We compute the percentage of unsafe actions by randomly sampling actions from SAFER on each evaluation task and report the results in Figure 4, showing that the SAFER bounds are indeed well calibrated.

Impact of latent safety variable We train SAFER on Tasks 22 and 55 using the contrastive objective in Equation 5 but without the safety variable. In this case, the success rate never exceeds 10%10\% and the safety violations are quite high (see Appendix C for the Task 2 results). In contrast, the safety variable in SAFER has at least a 60%60\% success rate on both tasks ( Table 1). This result suggests that the latent safety variable is crucial for success and safety.

6 Limitations

Though SAFER improves both safe and successful generalization to downstream tasks, there are several critical limitations to consider. Foremost, SAFER relies on a labeled offline dataset. In certain settings, it may be impractical to collect a sufficiently large dataset to ensure useful learned primitives or to receive high quality labels. If the dataset is not sufficiently large or the labels are poor quality, this could harm the capacity of SAFER to learn useful primitives. In the future, researchers should benchmark and improve the sample efficiency of SAFER. Second, the offline dataset includes a selection of unsafe demonstrations. In settings where there are not existing unsafe data points (e.g., from previous failures), or unsafe data cannot be simulated, it may be difficult for SAFER to learn generalizable safety constraints from the data.

7 Conclusion

In this paper, we introduced SAFER, an offline primitive learning method that improves the data efficiency of safe RL when there is access to both safe and unsafe data examples. This is particularly important because most existing safe RL algorithms are very data inefficient. We proposed a set of complex safety-critical robotic grasping tasks to evaluate SAFER, investigated limitations of state-of-the-art offline primitive learning baselines, and demonstrated that SAFER can achieve better success rates while enforcing safety with high-probability assurances.

References

  • Singh et al. [2021] A. Singh, H. Liu, G. Zhou, A. Yu, N. Rhinehart, and S. Levine. Parrot: Data-driven behavioral priors for reinforcement learning. ICLR, 2021.
  • Pertsch et al. [2020] K. Pertsch, Y. Lee, and J. J. Lim. Accelerating reinforcement learning with learned skill priors. CoRL, 2020.
  • Pertsch et al. [2021] K. Pertsch, Y. Lee, Y. Wu, and J. J. Lim. Guided reinforcement learning with learned skills. In Self-Supervision for Reinforcement Learning Workshop-ICLR 2021, 2021.
  • Ajay et al. [2021] A. Ajay, A. Kumar, P. Agrawal, S. Levine, and O. Nachum. Opal: Offline primitive discovery for accelerating offline reinforcement learning. ICLR, abs/2010.13611, 2021.
  • Dulac-Arnold et al. [2015] G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap, J. Hunt, T. Mann, T. Weber, T. Degris, and B. Coppin. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679, 2015.
  • Nalisnick et al. [2018] E. Nalisnick, A. Matsukawa, Y. Teh, D. Gorur, and B. Lakshminarayanan. Do deep generative models know what they don’t know? ICLR, 10 2018.
  • Fetaya et al. [2020] E. Fetaya, J.-H. Jacobsen, W. Grathwohl, and R. Zemel. Understanding the limitations of conditional generative models. ICLR, 10 2020.
  • Kirichenko et al. [2020] P. Kirichenko, P. Izmailov, and A. G. Wilson. Why normalizing flows fail to detect out-of-distribution data. arXiv, 2020.
  • Wachi and Sui [2020] A. Wachi and Y. Sui. Safe reinforcement learning in constrained Markov decision processes. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 9797–9806. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/wachi20a.html.
  • Achiam et al. [2017] J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 22–31. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/achiam17a.html.
  • Dalal et al. [2018] G. Dalal, K. Dvijotham, M. Vecerík, T. Hester, C. Paduraru, and Y. Tassa. Safe exploration in continuous action spaces. CoRR, abs/1801.08757, 2018. URL http://arxiv.org/abs/1801.08757.
  • Bharadhwaj et al. [2021] H. Bharadhwaj, A. Kumar, N. Rhinehart, S. Levine, F. Shkurti, and A. Garg. Conservative safety critics for exploration. ICLR, 2021.
  • Narasimhan [2020] K. Narasimhan. Projection-based constrained policy optimization. ICLR, abs/2010.03152, 2020.
  • Yang et al. [2021] T.-Y. Yang, J. P. Rosca, K. Narasimhan, and P. J. Ramadge. Accelerating safe reinforcement learning with constraint-mismatched policies. ICML, abs/2006.11645, 2021.
  • Chow et al. [2018a] Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 8103–8112, Red Hook, NY, USA, 2018a. Curran Associates Inc.
  • Chow et al. [2018b] Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018b. URL https://proceedings.neurips.cc/paper/2018/file/4fe5149039b52765bde64beb9f674940-Paper.pdf.
  • Achiam and Amodei [2019] J. Achiam and D. Amodei. Benchmarking safe exploration in deep reinforcement learning. 2019.
  • Berkenkamp et al. [2017] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause. Safe model-based reinforcement learning with stability guarantees. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/766ebcd59621e305170616ba3d3dac32-Paper.pdf.
  • El Chamie et al. [2016] M. El Chamie, Y. Yu, and B. Açıkmeşe. Convex synthesis of randomized policies for controlled markov chains with density safety upper bound constraints. In 2016 American Control Conference (ACC), pages 6290–6295, 2016. doi:10.1109/ACC.2016.7526658.
  • Turchetta et al. [2020] M. Turchetta, A. Kolobov, S. Shah, A. Krause, and A. Agarwal. Safe reinforcement learning via curriculum induction. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12151–12162. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/8df6a65941e4c9da40a4fb899de65c55-Paper.pdf.
  • Thananjeyan et al. [2021] B. Thananjeyan, A. Balakrishna, S. Nair, M. Luo, K. P. Srinivasan, M. Hwang, J. E. Gonzalez, J. Ibarz, C. Finn, and K. Goldberg. Recovery rl: Safe reinforcement learning with learned recovery zones. IEEE Robotics and Automation Letters, 6:4915–4922, 2021.
  • Yang et al. [2017] T. Yang, M. Hu, Y. Chow, P. J. Ramadge, and K. Narasimhan. Safe reinforcement learning with natural language constraints. CoRR, abs/2010.05150, 2017. URL https://arxiv.org/abs/2010.05150.
  • Rosolia and Borrelli [2018] U. Rosolia and F. Borrelli. Learning model predictive control for iterative tasks. a data-driven control framework. IEEE Transactions on Automatic Control, 63:1883–1896, 2018.
  • Thananjeyan et al. [2021] B. Thananjeyan, A. Balakrishna, U. Rosolia, J. Gonzalez, A. D. Ames, and K. Goldberg. Abc-lmpc: Safe sample-based learning mpc for stochastic nonlinear dynamical systems with adjustable boundary conditions. In WAFR, 2021.
  • Driessens and Dzeroski [2004] K. Driessens and S. Dzeroski. Integrating guidance into relational reinforcement learning. Machine Learning, 57:271–304, 2004.
  • Smart and Kaelbling [2000] W. D. Smart and L. P. Kaelbling. Practical reinforcement learning in continuous spaces. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, page 903–910, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. ISBN 1558607072.
  • Srinivasan et al. [2020] K. P. Srinivasan, B. Eysenbach, S. Ha, J. Tan, and C. Finn. Learning to be safe: Deep rl with a safety critic. ArXiv, abs/2010.14603, 2020.
  • Thananjeyan et al. [2020] B. Thananjeyan, A. Balakrishna, U. Rosolia, F. Li, R. McAllister, J. Gonzalez, S. Levine, F. Borrelli, and K. Goldberg. Safety augmented value estimation from demonstrations (saved): Safe deep model-based rl for sparse cost robotic tasks. IEEE Robotics and Automation Letters, 5:3612–3619, 2020.
  • Eysenbach et al. [2019] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine. Diversity is all you need: Learning skills without a reward function. ICLR, 2019.
  • Nachum et al. [2019] O. Nachum, S. Gu, H. Lee, and S. Levine. Near-optimal representation learning for hierarchical reinforcement learning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=H1emus0qF7.
  • Sharma et al. [2020] A. Sharma, S. S. Gu, S. Levine, V. Kumar, and K. Hausman. Dynamics-aware unsupervised discovery of skills. ArXiv, abs/1907.01657, 2020.
  • Xie et al. [2021] K. Xie, H. Bharadhwaj, D. Hafner, A. Garg, and F. Shkurti. Latent skill planning for exploration and transfer. ICLR, 2021.
  • Konidaris and Barto [2009] G. Konidaris and A. Barto. Skill discovery in continuous reinforcement learning domains using skill chaining. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc., 2009. URL https://proceedings.neurips.cc/paper/2009/file/e0cf1f47118daebc5b16269099ad7347-Paper.pdf.
  • Peng et al. [2019] X. B. Peng, M. Chang, G. Zhang, P. Abbeel, and S. Levine. MCP: Learning Composable Hierarchical Control with Multiplicative Compositional Policies. Curran Associates Inc., Red Hook, NY, USA, 2019.
  • Chandak et al. [2019] Y. Chandak, G. Theocharous, J. Kostas, S. Jordan, and P. Thomas. Learning action representations for reinforcement learning. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 941–950. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/chandak19a.html.
  • Nachum et al. [2018] O. Nachum, S. S. Gu, H. Lee, and S. Levine. Data-efficient hierarchical reinforcement learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/e6384711491713d29bc63fc5eeb5ba4f-Paper.pdf.
  • Hausman et al. [2017] K. Hausman, Y. Chebotar, S. Schaal, G. Sukhatme, and J. J. Lim. Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 1235–1245, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
  • Florensa et al. [2017] C. Florensa, Y. Duan, and P. Abbeel. Stochastic neural networks for hierarchical reinforcement learning. ICLR, 2017.
  • Fox et al. [2017] R. Fox, S. Krishnan, I. Stoica, and K. Goldberg. Multi-level discovery of deep options. ArXiv, abs/1703.08294, 2017.
  • Dietterich [1998] T. G. Dietterich. The maxq method for hierarchical reinforcement learning. In Proceedings of the Fifteenth International Conference on Machine Learning, ICML ’98, page 118–126, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc. ISBN 1558605568.
  • Rakelly et al. [2019] K. Rakelly, A. Zhou, C. Finn, S. Levine, and D. Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5331–5340. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/rakelly19a.html.
  • Dinh et al. [2017] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real NVP. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=HkpbnH9lx.
  • Charnes and Cooper [1959] A. Charnes and W. W. Cooper. Chance-constrained programming. Management science, 6(1):73–79, 1959.
  • Zhang et al. [2019] C. Zhang, J. Bütepage, H. Kjellström, and S. Mandt. Advances in variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41:2008–2026, 2019.
  • Nemirovski and Shapiro [2007] A. Nemirovski and A. Shapiro. Convex approximations of chance constrained programs. SIAM Journal on Optimization, 17(4):969–996, 2007. doi:10.1137/050622328. URL https://doi.org/10.1137/050622328.
  • Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  • Kingma and Welling [2014] D. P. Kingma and M. Welling. Auto-encoding variational bayes. ICLR, 2014.
  • Haarnoja et al. [2018] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018.
  • Kääriäinen and Langford [2005] M. Kääriäinen and J. Langford. A comparison of tight generalization error bounds. In Proceedings of the 22nd international conference on Machine learning, pages 409–416, 2005.
  • Coumans and Bai [2016–2021] E. Coumans and Y. Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2021.
  • Shankar and Gupta [2020] T. Shankar and A. Gupta. Learning robot skills with temporal variational inference. In International Conference on Machine Learning, pages 8624–8633. PMLR, 2020.
  • Fox et al. [2017] R. Fox, S. Krishnan, I. Stoica, and K. Goldberg. Multi-level discovery of deep options. arXiv preprint arXiv:1703.08294, 2017.
  • Ghadirzadeh et al. [2020] A. Ghadirzadeh, P. Poklukar, V. Kyrki, D. Kragic, and M. Björkman. Data-efficient visuomotor policy training using reinforcement learning and generative models. arXiv preprint arXiv:2007.13134, 2020.
  • Guadarrama et al. [2018] S. Guadarrama, A. Korattikara, O. Ramirez, P. Castro, E. Holly, S. Fishman, K. Wang, E. Gonina, N. Wu, E. Kokiopoulou, L. Sbaiz, J. Smith, G. Bartók, J. Berent, C. Harris, V. Vanhoucke, and E. Brevdo. TF-Agents: A library for reinforcement learning in tensorflow. https://github.com/tensorflow/agents, 2018. URL https://github.com/tensorflow/agents. [Online; accessed 25-June-2019].

Appendix

Appendix A Proof: Tractable Lower Bound

Proposition 3.1: Assuming the chance constrained ELBO is written as in Equation 3, we can write the surrogate lower bound as,

maxρ,ϕ𝔼𝒄qρ(|𝒔)\displaystyle\max_{\rho,\phi}\,\mathbb{E}_{{\bm{c}}\sim q_{\rho}\left(\cdot|{\bm{s}}\right)} [logpϕ(𝒂|𝒔,𝒄)λlogpϕ(𝒂unsafe|𝒔,𝒄)]DKL(qρ(|Λ)||p())\displaystyle\!\Big{[}\emph{log}p_{\phi}({\bm{a}}|{\bm{s}},\!{\bm{c}})\!-\!\lambda\emph{log}p_{\phi}({\bm{a}}_{\textrm{unsafe}}|{\bm{s}},\!{\bm{c}})\Big{]}-D_{\textrm{KL}}(q_{\rho}(\cdot|\Lambda)||p(\cdot)) (5)

Proof: We rewrite the optimization 3 into the following form,

maxρ,ϕ,λ𝔼𝒄qρ(|𝒔)\displaystyle\max_{\rho,\phi,\lambda^{\prime}}\mathbb{E}_{{\bm{c}}\sim q_{\rho}\left(\cdot|{\bm{s}}\right)} [logpϕ(𝒂|𝒔,𝒄)DKL(qρ(|Λ)||p())]λ𝒄qρ(|𝒔)(pϕ(𝒂unsafe|𝒔,𝒄)>ϵ).\displaystyle\!\left[\textrm{log}\;p_{\phi}({\bm{a}}|{\bm{s}},{\bm{c}})\!-\!D_{\textrm{KL}}\!\left(q_{\rho}(\cdot|\Lambda)||p(\cdot)\right)\right]-\lambda^{\prime}\mathbb{P}_{{\bm{c}}\sim q_{\rho}\left(\cdot|{\bm{s}}\right)}(p_{\phi}({\bm{a}}_{\textrm{unsafe}}|{\bm{s}},{\bm{c}})>\epsilon). (6)

With the Markov inequality we have

𝒄qρ(|𝒔)(pϕ(𝒂unsafe|𝒔,𝒄)>ϵ)𝔼𝒄[pϕ(𝒂unsafe|𝒔,𝒄)]ϵ,\displaystyle\mathbb{P}_{{\bm{c}}\sim q_{\rho}\left(\cdot|{\bm{s}}\right)}\!(p_{\phi}({\bm{a}}_{\textrm{unsafe}}|{\bm{s}},\!{\bm{c}})\!>\!\epsilon)\!\leq\!\frac{\mathbb{E}_{{\bm{c}}}\!\left[p_{\phi}({\bm{a}}_{\textrm{unsafe}}|{\bm{s}},\!{\bm{c}})\right]}{\epsilon}, (7)

such that the following objective function is a lower bound of that in Equation 3:

maxρ,ϕ,λ𝔼𝒄qρ(|𝒔)\displaystyle\max_{\rho,\phi,\lambda^{\prime}}\mathbb{E}_{{\bm{c}}\sim q_{\rho}\left(\cdot|{\bm{s}}\right)} [logpϕ(𝒂|𝒔,𝒄)λϵpϕ(𝒂unsafe|𝒔,𝒄)]DKL(qρ(|Λ)||p())\displaystyle\Big{[}\textrm{log}p_{\phi}({\bm{a}}|{\bm{s}},{\bm{c}})\!-\!\frac{\lambda^{\prime}}{\epsilon}p_{\phi}({\bm{a}}_{\textrm{unsafe}}|{\bm{s}},{\bm{c}})\Big{]}-D_{\textrm{KL}}(q_{\rho}(\cdot|\Lambda)||p(\cdot)) (8)

For convenience, we write λϵ\frac{\lambda^{\prime}}{\epsilon} as the single hyperparameter λ\lambda and optimize the log of pϕ(𝒂unsafe|𝒔,𝒄)p_{\phi}({\bm{a}}_{\textrm{unsafe}}|{\bm{s}},{\bm{c}}) for better numerical stability. We finally have the lower bound surrogate objective in Equation 5.

Appendix B Guaranteeing Safety with SAFER

In this section, we demonstrate how it is straightforward to show SAFER can guarantee safety for any policy trained under the skill prior, demonstrating the utility of the method. We restate and clarify our assumptions before providing the proof of the proposition. The first assumption ensures that there is always a safe action to take.

Assumption B.1

At every state 𝐬{\bm{s}}, there always exists a safe action 𝐚{\bm{a}}, i.e., 𝐬𝐚s.t.ω(𝐬,𝐚)=0\forall{\bm{s}}\;\exists{\bm{a}}\;s.t.\;\omega({\bm{s}},{\bm{a}})=0.

The second assumption ensures that the SAFER model is optimal according to the SAFER objective given in Objective 5. In effect, this assumption means that safe actions have high likelihood while the unsafe actions are much less likely under the SAFER prior.

Assumption B.2

The SAFER prior parameters ρ^,ϕ^\hat{\rho},\hat{\phi} are optimal per Objective 5, such that all safe actions have higher likelihood than unsafe actions under the prior, i.e., 𝐚,𝐚unsafe,𝐬,𝐜:logpϕ(𝐚|𝐬,𝐜)logpϕ(𝐚unsafe|𝐬,𝐜)\forall{\bm{a}},{\bm{a}}_{unsafe},{\bm{s}},{\bm{c}}:\;\textrm{log}p_{\phi}({\bm{a}}|{\bm{s}},{\bm{c}})\gg\textrm{log}p_{\phi}({\bm{a}}_{unsafe}|{\bm{s}},{\bm{c}}).

Next, we provide a proof for Proposition 4.2.

Proposition B.1

There exists an η\eta such that the corresponding bounded abstract actions 𝐳(η,η){\bm{z}}\in(-\eta,\eta) are safe, i.e., ω(𝐬,fϕ(𝐳;𝐜;𝐬))=0\omega({\bm{s}},f_{\phi}({\bm{z}};{\bm{c}};{\bm{s}}))=0, 𝐳(η,η),𝐬,𝐜\forall{\bm{z}}\in(-\eta,\eta),{\bm{s}},{\bm{c}}.

Proof (Sketch): Because the abstract action space 𝒵\mathcal{Z} is unit Gaussian (𝒵𝒩(0,I)\mathcal{Z}\sim\mathcal{N}(0,I)) it is the case that the 𝒛{\bm{z}}’s that are closer to the zero vector 𝟎\bm{0} have higher likelihood, i.e., 𝒛{\bm{z}}’s with lower norm 𝒛||{\bm{z}}|| have higher likelihood. From the assumption, we know that in every state there exists safe actions and with the way we train SAFER it will have much higher likelihood than unsafe actions in the prior distribution, logpϕ(𝒂|𝒔,𝒄)logpϕ(𝒂unsafe|𝒔,𝒄)\textrm{log}p_{\phi}({\bm{a}}|{\bm{s}},{\bm{c}})\gg\textrm{log}p_{\phi}({\bm{a}}_{unsafe}|{\bm{s}},{\bm{c}}). Using the invertibility property of normalizing flow, one concludes that for all state and action pairs, the unsafe abstract actions 𝒛{\bm{z}} are much farther away from the zero vector 𝟎\bm{0}. Consequently, there must exist a finite latent bound η\eta that separates all safe actions with unsafe ones.

Appendix C Additional Results

In this Appendix, we present additional results with SAFER.

Cumulative Safety Violation Graphs In the main paper, we presented the cumulative safety violations at the end of training. Here, we present graphs of the cumulative safety violations in figure 5 throughout training for the baselines and SAFER. In these graphs, we see that SAFER is consistently the safety method throughout training.

Refer to caption
Figure 5: The cumulative safety violations throughout training for SAFER and the baselines. We see that SAFER is consistently the safest method throughout training.
Refer to caption
Figure 6: Assessing the tradeoff between success and safety varying the safety assurances bound on the abstract action space 𝒵\mathcal{Z}, (referred to as η\eta in Algorithm 3). There is an sweet spot where success rate is high and safety violations is low.

Success & Safety Tradeoff In Figure 6 we assess the tradeoff between success and safety by varying the 𝒵\mathcal{Z}-action bound in Algorithm 3. We sweep over different bounds and compute both the success rate and safety violations at the end of training for Task 5. We see that there is a sweet spot with high success rate and low safety violations when the safety assurances bound is close to 15%15\%. Interestingly when the bound is too tight (corresponding small 𝒛{\bm{z}} values), both the safety violation and success rate become low, indicating SAFER cannot solve the task without sufficient exploration.

Per Step Safety Violations In the main paper, we provide cumulative safety violation graphs. Here, we provide the safety violations over the last 1,0001,000 steps in Figure 7 in order to get a better sense of the safety violations throughout training. We again consistently see SAFER is safety method over the course of training. One interesting observation is that, in Section 3, we discussed how PARROT rates unsafe (𝒔,𝒂{\bm{s}},{\bm{a}}) pairs as high likelihood. Because PARROT draws on higher likelihood actions from the prior earlier in training, we would expect that PARROT would be more unsafe earlier in training. Empirically, we see this to be the case. Looking at the graphs, PARROT has high safety violation spikes at the beginning of training. These results demonstrate that our earlier observations surrounding the unsafety of PARROT hold true when running RL.

Refer to captionRefer to caption
Figure 7: The safety violations over each step of training for each of the tasks (same task ordering as Figure 5). We see that SAFER is consistently the most safe method throughout training.

Impact of Probabilistic Treatment One question worth considering is how necessary is it to treat SAFER as a latent variable model and optimize the posterior over the safety variable using variational inference, as is proposed in Section 4.2. It could be easier to treat 𝒄{\bm{c}} as a vector (without defining it as a Guassian random variable), exclude the KL term from Equation 5, and optimize qρ(𝒄|Λ)q_{\rho}({\bm{c}}|\Lambda) with the rest of the objective. To assess whether this is the case, we ran a sweep across different hyperparameter configurations, including the number of bijectors in the real NVP model, the learning rate, λ\lambda, and the number of hidden units in each bijector. Doing this, however, we find SAFER quickly diverges, indicating the probabilistic treatment greatly helps stabilize training and is necessary for the success of the method.

Training SAFER Without the Safety Context Variable As an abalation in the main paper, we considered training SAFER without the SAFETY context variable and found that it led to worse success rate and relatively higher safety violations. In this Appendix, we provide the full training results in Figure 8 in terms of success rate and per step safety violations. Here, we see that for the tasks considered, training SAFER without the safety variable leads to worse success rates and less safety (compared to the per step success rates in Figure 7).

Refer to caption
Figure 8: Effectiveness of RL Training using the SAFER objective without the safety variable. We see the prior without the safety variable is quite unsuccessful, indicating that the safety variable is critical to enabling SAFER to promote both safe and successful learning.
Refer to caption
Figure 9: Training PARROT using unsafe data from successful trajectories as well as safe data. We see that this leads to leads to relatively worse success rates (top row) as well as relatively higher per step safety violations (bottom row). These results suggest it is best to train PARROT with safe and successful data only.

Training PARROT With Unsafe Data In the paper, we performed experiments PARROT trained using safe data. Meaning, w(𝒔,𝒂)=0w({\bm{s}},{\bm{a}})=0 for each training point. We also limited the data to only those tuples in successful trajectories to promote PARROT acquiring safe and successful behaviors. Though it makes the most sense to train PARROT for safety concerned tasks in this fashion, it is worth considering what would happen if we also included unsafe data from successful trajectories. To assess what would happen, we train PARROT using both safe and unsafe data from successful trajectories, using the hyperparameters for PARROT in Section D. The results given in Figure 9 demonstrate that this leads to relatively higher per step safety violations, indicating that it is best to train PARROT with only safe data from successful trajectories.

Appendix D SAFER Hyperparameter Details

Hyperparameter Details We explored a number of different parameter configurations with SAFER. We tuned λ\lambda (1e4,1e5)(1e-4,1e-5), the number of bijectors in the real NVP flow model (3,5)(3,5), the number of components in the context variable 𝒄{\bm{c}} (8,32,64)(8,32,64), the size of the states window ww (16,32)(16,32), the optimizer (Adam, SGD+Momentum), and the learning rate (1e4,5e5)(1e-4,5e-5). We trained for 500500k steps and found that using a smaller number of components in the context variable led to more stable training (88). Setting the learning rate to (1e4)(1e-4) led to much quicker convergence, without sacrificing much stability. Furthermore, training with Adam led to divergence in some cases while SGD+Momentum tended to diverge less often. Between the other parameters considered, there was relatively little difference, and therefore we used a model with learning rate 1e41e-4, 33 bijectors, 88 components, 1616 states window size, and SGD+Momentum.

Appendix E Baseline Methods

We select several baseline methods to compare with SAFER. We mainly focus on methods leverage action primitives trained with offline data to improve efficiency, e.g., PARROT, Prior-Explore. While we are aware of additional baseline methods, e.g., TrajRL [51, 52], HIRL [53], in the literature, we omit their comparisons here because it has been shown in prior work [1] that their performance is consistently below that of the state of the art.

Soft Actor Critic: Soft-actor critic (SAC) [48] is one of the standard model-free policy-gradient based RL methods. Here without using any action primitives we apply SAC to learn a policy that directly maps states in 𝒳\mathcal{X} to actions in 𝒜\mathcal{A}. Later we also use SAC in all our action primitive based RL methods (e.g., SAFER, PARROT) to optimize the high-level policy. Therefore, one can view the SAC baseline as one ablation study as well. We use the implementation from TF-Agents [54]. We used SAC with autonmatic entropy tuning and tune the number of target network update period, discount factor, policy learning rate, and Q-function learning rate.

PARROT: We compare against the state-of-the-art primitive learning RL method PARROT, proposed by Singh et al. [1]. Similar to SAFER, PARROT leverages a conditional normalizing flow and to train a behavioral prior using data from successful rollouts. To enforce safety in the PARROT agent, we additionally limit the training data of its behavioral prior to both safe and successful rollouts, otherwise PARROT may encourage unsafe behaviors. We tune the number of bijectors in the conditional normalizing flow for PARROT (55, 33), the number hidden units in each bijector layer (128128, 256256), the learning rate (1e41e-4, 5e55e-5, 1e51e-5), the optimizer (Adam or SGD+Momentum), and train for 500500k steps. We find using 33 bijectors with learning rate 1e41e-4, and the Adam optimizer works best.

Prior-Explore: We also consider the prior-explore method proposed in Singh et al. [1] as one of our baseline method. Here the prior-explore policy combines the mapping fϕf_{\phi} action policy in Equation 1 with an SAC agent to aid exploration of the RL agent. It selects an action from the prior policy with probability δ\delta and from the SAC agent otherwise. Followed from Singh et al. [1], we set this probability δ\delta to 0.90.9 and use mapping fϕf_{\phi} trained for SAFER.

Contextual PARROT (SAFER Without Contrastive Loss): As one ablation study we consider SAFER without the contrastive loss. This setup also models the behavioral prior policy with a conditional normalizing flow and the latent safety variable but trains that only with safe and successful data. Note that this baseline method is equivalent to PARROT, with a policy that is a function of the latent safety variable. We use the same parameters as PARROT with this baseline and 88 components in safety variable because we found this number of components to be the most successful with SAFER.

Appendix F Training SAFER

In this appendix, we provide psuedo code for the SAFER training procedure in Algorithm 2.

Algorithm 2 SAFER Training
0:  SAFER Behavioral Prior fϕf_{\phi}, Safety Variable Posterior qρ(𝒄|Λ)q_{\rho}({\bm{c}}|\Lambda), safe dataset 𝒟safe\mathcal{D}_{\textrm{safe}}, unsafe dataset 𝒟unsafe\mathcal{D}_{\textrm{unsafe}}, Steps NN, λ\lambda
  Let flow_loss(\cdot) refer to Equation 1
  for n=1,,Nn=1,...,N do
     (𝒔,𝒂,Λ)Safe𝒟Safe({\bm{s}},{\bm{a}},\Lambda)_{\textrm{Safe}}\sim\mathcal{D}_{\textrm{Safe}} {Sample safe + unsafe batches of data }
     (𝒔,𝒂,Λ)Unsafe𝒟Unsafe({\bm{s}},{\bm{a}},\Lambda)_{\textrm{Unsafe}}\sim\mathcal{D}_{\textrm{Unsafe}}
     𝒄Safeqρ(c|ΛSafe){\bm{c}}_{\textrm{Safe}}\sim q_{\rho}(c|\Lambda_{\textrm{Safe}}) {Sample safety variables }
     𝒄Unsafeqρ(c|ΛSafe){\bm{c}}_{\textrm{Unsafe}}\sim q_{\rho}(c|\Lambda_{\textrm{Safe}})
     safe\mathcal{L}_{\textrm{safe}}\leftarrow log (flow_loss(𝒔safe;𝒂safe;𝒄safe))({\bm{s}}_{\textrm{safe}};{\bm{a}}_{\textrm{safe}};{\bm{c}}_{\textrm{safe}})) {Compute log-likelihoods}
     unsafe\mathcal{L}_{\textrm{unsafe}}\leftarrow log (flow_loss(𝒔unsafe;𝒂unsafe;𝒄unsafe))({\bm{s}}_{\textrm{unsafe}};{\bm{a}}_{\textrm{unsafe}};{\bm{c}}_{\textrm{unsafe}}))
     DKLSafeDKL(qρ(𝒄|ΛSafe)||p(𝒄))D_{\textrm{KL}}^{\textrm{Safe}}\leftarrow D_{\textrm{KL}}\left(q_{\rho}({\bm{c}}|\Lambda_{\textrm{Safe}})||p({\bm{c}})\right) {Compute KL of safety variables}
     DKLUnsafeDKL(qρ(𝒄|ΛUnsafe)||p(𝒄))D_{\textrm{KL}}^{\textrm{Unsafe}}\leftarrow D_{\textrm{KL}}\left(q_{\rho}({\bm{c}}|\Lambda_{\textrm{Unsafe}})||p({\bm{c}})\right)
     NLL(safeλunsafeDKLSafeDKLUnsafe)NLL\leftarrow-(\mathcal{L}_{\textrm{safe}}-\lambda\cdot\mathcal{L}_{\textrm{unsafe}}-D_{\textrm{KL}}^{\textrm{Safe}}-D_{\textrm{KL}}^{\textrm{Unsafe}})
     Minimize NLLNLL and update ϕ,ρ\phi,\rho {Update SAFER}
  end for
  Return: SAFER Behaviors Prior fϕf_{\phi}, Safety Variable Posterior qρ(𝒄|Λ)q_{\rho}({\bm{c}}|\Lambda)

Appendix G Setting the Safety Assurance

In this appendix, we provide psuedo code for the SAFER safety assurances procedure in Algorithm 3. This algorithm provided a numerical gradient-free approach to find an optimal bound η\eta that included ϵ\epsilon portion safe actions.

Algorithm 3 SAFER Safety Assurances
0:  Initial bound η0\eta_{0}, Desired percent safe actions ϵ\epsilon, SAFER prior fϕf_{\phi}, Safe dataset 𝒟safe\mathcal{D}_{\textrm{safe}}, Unsafe dataset 𝒟unsafe\mathcal{D}_{\textrm{unsafe}}
  define
  function get_in_bound(dataset 𝒟\mathcal{D}, bound ηt\eta_{t}
     // This function computes the abstract actions 𝒛{\bm{z}} within bound ηt\eta_{t}
     𝒵\mathcal{Z} \leftarrow {}
     for (𝒔{\bm{s}}, 𝒂{\bm{a}}, Λ\Lambda) in 𝒟\mathcal{D} do
        // Iterate over tuple (state 𝒔{\bm{s}}, action 𝒂{\bm{a}}, and context Λ\Lambda)
        𝒄𝔼qρ(|Λ)[𝒄]{\bm{c}}\leftarrow\mathbb{E}_{q_{\rho}(\cdot|\Lambda)}\left[{\bm{c}}\right], 𝒛{\bm{z}} \leftarrow fϕ1(𝒂;𝒔;𝒄)f_{\phi}^{-1}({\bm{a}};{\bm{s}};{\bm{c}}) {Get abstract action 𝒛{\bm{z}} from 𝒔{\bm{s}}, 𝒂{\bm{a}}, and Λ\Lambda }
        if 𝒛{\bm{z}} within bound ηt\eta_{t} then
           𝒵\mathcal{Z} = 𝒵\mathcal{Z} 𝒛\cup\;{\bm{z}} {Add 𝒛{\bm{z}} if its within bound ηt\eta_{t} }
        end if
     end for
     return 𝒵\mathcal{Z}
  end function
  η\eta \leftarrow (η0,η0)(-\eta_{0},\eta_{0}) {Initialize bound }
  done \leftarrow False
  while not done do
     𝒵safeη\mathcal{Z}_{\textrm{safe}}^{\eta} = get_in_bound(𝒟safe\mathcal{D}_{\textrm{safe}}, η\eta), 𝒵unsafeη\mathcal{Z}_{\textrm{unsafe}}^{\eta} = get_in_bound(𝒟unsafe\mathcal{D}_{\textrm{unsafe}}, η\eta) {Get 𝒛{\bm{z}} in current bound η\eta }
     SS \leftarrow |𝒵safeη||\mathcal{Z}_{\textrm{safe}}^{\eta}| + |𝒵unsafeη||\mathcal{Z}_{\textrm{unsafe}}^{\eta}|
     𝒵safeϵ\mathcal{Z}_{\textrm{safe}}^{\epsilon}\sim sample S×ϵ\lfloor S\times\epsilon\rfloor items from 𝒵safeη\mathcal{Z}_{\textrm{safe}}^{\eta}, 𝒵unsafe1ϵ\mathcal{Z}_{\textrm{unsafe}}^{1-\epsilon}\sim sample S×(1ϵ)\lfloor S\times(1-\epsilon)\rfloor items from 𝒵unsafeη\mathcal{Z}_{\textrm{unsafe}}^{\eta}
     η\eta\;\leftarrow the max component absolute value across 𝒵unsafe1ϵ\mathcal{Z}_{\textrm{unsafe}}^{1-\epsilon} and 𝒵safeϵ\mathcal{Z}_{\textrm{safe}}^{\epsilon} {Update bound }
     if ϵ\epsilon portion of items across 𝒟safe\mathcal{D}_{\textrm{safe}} and 𝒟unsafe\mathcal{D}_{\textrm{unsafe}} within bound (η-\eta, η\eta) are safe then
        done \leftarrow True {Break if bound η\eta contains desired portion safe actions }
     end if
  end while
  return η\eta

Appendix H Additional Task Examples

In this Appendix, we provide additional examples of the tasks included in the safe robotic grasping environment in Figure 10.

Refer to caption
Figure 10: Additional examples of tasks included in the safe robotic grasping environment (top row). The tasks all use different sizes containers, to represent different difficulties in preserving safe behavior. We also provide a zoomed in version of the task (right hand side). Finally, we also include the examples of safe and unsafe trajectories provided in the main paper (Figure 3) for completeness