This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Safety Verification of Model Based
Reinforcement Learning Controllers

Akshita Gupta & Inseok Hwang
School of Aeronautics and Astronautics
Purdue University
{gupta417,ihwang}@purdue.edu
Abstract

Model-based reinforcement learning (RL) has emerged as a promising tool for developing controllers for real world systems (e.g., robotics, autonomous driving, etc.). However, real systems often have constraints imposed on their state space which must be satisfied to ensure the safety of the system and its environment. Developing a verification tool for RL algorithms is challenging because the non-linear structure of neural networks impedes analytical verification of such models or controllers. To this end, we present a novel safety verification framework for model-based RL controllers using reachable set analysis. The proposed framework can efficiently handle models and controllers which are represented using neural networks. Additionally, if a controller fails to satisfy the safety constraints in general, the proposed framework can also be used to identify the subset of initial states from which the controller can be safely executed.

1 Introduction

One of the primary reasons for the growing application of reinforcement learning (RL) algorithms in developing optimal controllers is that RL does not assume a priori knowledge of the system dynamics. Model-based RL explicitly learns a model of the system dynamics, from observed samples of state transitions. This learnt model is used along with a planning algorithm to develop optimal controllers for different tasks. Thus, any uncertainties in the system, including environment noise, friction, air-drag etc., can also be captured by the modeled dynamics.

However, the performance of the controller is directly related to how accurately the learnt model represents the true system dynamics. Due to the discrepancy between the learnt model and the true model, the developed controller can behave unexpectedly when deployed on the real physical system, e.g., land robots, UAVs, etc. (Benbrahim & Franklin, 1997; Endo et al., 2008; Morimoto & Doya, 2001). This unexpected behavior may result in the violation of constraints imposed on the system, thereby violating its safety requirements (Moldovan & Abbeel, 2012). Thus, it is necessary to have a framework which can ensure that the controller will satisfy the safety constraints before it is deployed on a real system. This raises the primary question of interest: Given a set of safety constraints imposed on the state space, how do we determine whether a given controller is safe or not?

In the literature, there have been several works that focus on the problem of ensuring safety. Some of the works aim at finding a safe controller, under the assumption of a known baseline safe policy (Hans et al., 2008; Garcia & Fernández, 2012; Berkenkamp et al., 2017; Thomas et al., 2015; Laroche et al., 2019), or several known safe policies (Perkins & Barto, 2002). However, such safe policies may not be readily available in general. Alternatively, Akametalu et al. (2014) used reachability analysis to develop safe model-based controllers, under the assumption that the system dynamics can be modeled using Gaussian processes, i.e., an assumption which is violated by most modern RL methods that make use of neural networks (NN) instead. While there have been several works proposed to develop safe controllers, some of the assumptions made in these works may not be possible to realize in practice. In recent years, this limitation has drawn attention towards developing verification frameworks for RL controllers, which is the focus of this paper. Since verifying safety of an NN based RL controller is also related to verifying the safety of the underlying NN model (Xiang et al., 2018b; Tran et al., 2019b; Xiang et al., 2018a; Tran et al., 2019a), we provide an additional review for these methods in Appendix A.1.

Contributions:

In this work, we focus on the problem of determining whether a given controller is safe or not, with respect to satisfying constraints imposed on the state space. To do so, we propose a novel safety verification algorithm for model-based RL controllers using forward reachable tube analysis that can handle NN based learnt dynamics and controllers, while also being robust against modeling error. The problem of determining the reachable tube is framed as an optimal control problem using the Hamilton Jacobi (HJ) partial differential equation (PDE), whose solution is computed using the level set method. The advantage of using the level set method is the fact that it can represent sets with non-convex boundaries, thereby avoiding approximation errors that most existing methods suffer from. Additionally, if a controller is deemed unsafe, we take a step further to identify if there are any starting conditions from which the given controller can be safely executed. To achieve this, a backward reachable tube is computed for the learnt model and, to the best of our knowledge, this is the first work which computes the backward reachable tube over an NN. Finally, empirical results are presented on two domains inspired by real-world applications where safety verification is critical.

2 Problem Setting

Let 𝕊n{\mathbb{S}}\subset\mathbb{R}^{n} denote the set of states and 𝔸m{\mathbb{A}}\subset\mathbb{R}^{m} denote the set of feasible actions for the RL agent. Let 𝕊0𝕊{\mathbb{S}}_{0}\subset{\mathbb{S}} denote the set of bounded initial states and ξ:={(𝒔t,𝒂t)}t=0T\xi:=\{({\bm{s}}_{t},{\bm{a}}_{t})\}_{t=0}^{T} represent a trajectory generated over a finite time TT, as a sequence of state and action tuples, where subscript tt denotes the instantaneous time. Additionaly, let 𝒔(){\bm{s}}(\cdot) and 𝒂(){\bm{a}}(\cdot) represent a sequence of states and actions, respectively. The state constraints imposed on the system are represented as unsafe regions using bounded sets s=i=1ps(i){\mathbb{C}}_{s}=\cup_{i=1}^{p}{\mathbb{C}}_{s}^{(i)}, where s(i)𝕊,i{1,2,p}{\mathbb{C}}_{s}^{(i)}\subset{\mathbb{S}},\;\forall i\in\{1,2,\ldots p\}. The true system dynamics is given by a non-linear function f:𝕊×𝔸nf:{\mathbb{S}}\times{\mathbb{A}}\rightarrow{\mathbb{R}}^{n} such that, 𝒔˙=f(𝒔,𝒂)\dot{{\bm{s}}}=f({\bm{s}},{\bm{a}}), and is unknown to the agent.

A model-based RL algorithm is used to find an optimal controller π:𝕊𝔸\pi:{\mathbb{S}}\rightarrow{\mathbb{A}}, to reach a set of target states 𝕋𝕊{\mathbb{T}}\subset{\mathbb{S}} within some finite time TT, while avoiding constraints s{\mathbb{C}}_{s}. An NN model, f^𝜽:𝕊×𝔸n\hat{f}_{\bm{\theta}}:{\mathbb{S}}\times{\mathbb{A}}\rightarrow{\mathbb{R}}^{n} parameterized by weights 𝜽{\bm{\theta}}, is trained to learn the true, but unknown, system dynamics from the observed state transition data tuples D={(𝒔t,𝒂t,Δ𝒔t+1)(i)}i=1ND=\{({\bm{s}}_{t},{\bm{a}}_{t},\Delta{\bm{s}}_{t+1})^{(i)}\}_{i=1}^{N}. However, due to sampling bias, the learnt model f^𝜽\hat{f}_{\bm{\theta}} may not be accurate. We assume that it is possible to estimate a bounded set 𝔻n{\mathbb{D}}\subset{\mathbb{R}}^{n} such that, at any state 𝒔𝕊{\bm{s}}\in{\mathbb{S}}, augmenting the learnt dynamics f^𝜽\hat{f}_{\bm{\theta}} with some 𝒅𝔻{\bm{d}}\in{\mathbb{D}} results in a closer approximation of the true system dynamics at that particular state. Using this notation, we now define the problem of safety verification of a given controller π(𝒔)\pi({\bm{s}}).

Problem 1

(Safety verification): Given a set of initial states 𝕊0{\mathbb{S}}_{0}, determine if 𝐬0𝕊0\,\,\forall{\bm{s}}_{0}\in{\mathbb{S}}_{0}, all the trajectories ξ\xi executed under π(𝐬)\pi({\bm{s}}) and following the system dynamics ff, satisfy the constraints s{\mathbb{C}}_{s} or not.

The solution to Problem 1 will only provide a binary yes or no answer to whether π(𝒔)\pi({\bm{s}}) is safe or not with respect to 𝕊0{\mathbb{S}}_{0}. In the case where the policy is unsafe, a stronger result is the identification of safe initial states 𝕊safe𝕊0{\mathbb{S}}_{safe}\subset{\mathbb{S}}_{0} from which π(𝒔)\pi({\bm{s}}) executes trajectories which always satisfy the constraints s{\mathbb{C}}_{s}. This problem is stated below.

Problem 2

(Safe initial states): Given π(𝐬)\pi({\bm{s}}), find 𝕊safe{\mathbb{S}}_{safe}, such that, any trajectory ξ\xi executed under π(𝐬)\pi({\bm{s}}) and following the system dynamics ff, starting from any 𝐬0𝕊safe{\bm{s}}_{0}\in{\mathbb{S}}_{safe}, satisfies the constraints s{\mathbb{C}}_{s}.

3 Safety Verification

To address Problems 1 and 2, we use reachability analysis. Reachability analysis is an exhaustive verification technique which tracks the evolution of the system states over a finite time, known as the reachable tube, from a given set of initial states 𝕊0{\mathbb{S}}_{0} (Maler, 2008). If the evolution is tracked starting from 𝕊0{\mathbb{S}}_{0}, then it is called the forward reachable tube and is denoted as 𝔽R(T){\mathbb{F}}_{R}(T). Analogously, if the evolution is tracked starting from s{\mathbb{C}}_{s} to 𝕊0{\mathbb{S}}_{0}, then it is called the backward reachable tube and is denoted as 𝔹R(T){\mathbb{B}}_{R}(T).

In the following sections, we will formulate a reachable tube problem for NN-based models and controllers, and then propose a verification framework that (a) can determine whether or not a given policy π\pi is safe, and (b) can compute 𝕊safe{\mathbb{S}}_{safe} if π\pi is unsafe. To do so, there are two main questions that need to be answered. First, since the true system dynamics ff is unknown, how can we determine a conservative bound on the modeling error, to augment the learnt model f^𝜽\hat{f}_{\bm{\theta}} and better model the true system dynamics when evaluating a controller π\pi? Second, how do we formulate the forward and backward reachable tube problems over the NN modeled system dynamics?

3.1 Model-Based Reinforcement Learning

In this section, we focus on the necessary requirements for the modeled dynamics f^𝜽\hat{f}_{\bm{\theta}} and discuss the estimation of the modeling error set 𝔻{\mathbb{D}}. A summary of the model-based RL framework is presented in Appendix A.2. Recall that the learnt model f^𝜽\hat{f}_{\bm{\theta}} is represented using an NN and predicts the change in states Δ𝒔^t+1n\Delta\hat{{\bm{s}}}_{t+1}\in{\mathbb{R}}^{n}. To learn f^θ\hat{f}_{\theta}, an observed data set D={(𝒔t,𝒂t,Δ𝒔t+1)(i)}i=1ND=\{({\bm{s}}_{t},{\bm{a}}_{t},\Delta{\bm{s}}_{t+1})^{(i)}\}_{i=1}^{N} is first split into a training data set DtD_{t} and a validation data set DvD_{v}. A supervised learning technique is then used to train f^θ\hat{f}_{\theta} over DtD_{t} by minimizing the prediction error E=1Nti=1NtΔ𝒔^t+1(i)Δ𝒔t+1(i)2,E=\frac{1}{N_{t}}\sum_{i=1}^{N_{t}}||\Delta\hat{{\bm{s}}}_{t+1}^{(i)}-\Delta{\bm{s}}_{t+1}^{(i)}||^{2}, where Δ𝒔^t+1\Delta\hat{{\bm{s}}}_{t+1} is the change predicted by the learnt model, Δ𝒔t+1\Delta{\bm{s}}_{t+1} is the true observed change in state and Nt=|Dt|N_{t}=|D_{t}|. With this notation, we now formalize the following necessary assumption for this work. This assumption is required to ensure boundedness during analysis and is easily satisfied by NNs that use common activation functions like tanh or sigmoid (Usama & Chang, 2018).

Assumption 1

f^𝜽\hat{f}_{{\bm{\theta}}} is Lipschitz continuous and j{1,,n},Δ𝐬^j[c,c]\forall j\in\{1,...,n\},\Delta\hat{{\bm{s}}}_{j}\in[-c,c], where |c|<|c|<\infty.

Modeling error: As mentioned earlier, the accuracy of the learnt model f^𝜽\hat{f}_{{\bm{\theta}}} depends on the quality of data and the NN being used, thereby resulting in some modeling error 𝒅{\bm{d}} in the prediction of the next state. Estimating modeling errors is an active area of research and is required for several existing works on safe RL (Akametalu et al., 2014; Gillula & Tomlin, 2012), and is complementary to our goal. Since the primary contribution of this work is the development of a reachable tube formulation for model-based controllers that use NNs, we rely on existing techniques (Moldovan et al., 2015) to estimate a conservative modeling error bound. We leverage the error estimates 𝒅^j=Δ^st+1(j)Δst+1(j),\hat{\bm{d}}_{j}=\hat{\Delta}s_{t+1}^{(j)}-\Delta s^{(j)}_{t+1}, of f^θ\hat{f}_{\theta}, for the transition tuples in the validation set DvD_{v} to construct the upper confidence bound 𝒅+=[d1,d2,dn]{\bm{d}}^{+}=[d_{1},\;d_{2},\ldots d_{n}] and lower confidence bound 𝒅=[d1,d2,dn]{\bm{d}}^{-}=[-d_{1},-d_{2},\ldots-d_{n}] for 𝒅{\bm{d}}, for each state dimension. Let a high-confidence bounded error set 𝔻{\mathbb{D}} be defined such that 𝔻={𝒅:i{1,,n},𝒅i<𝒅i<𝒅i+}{\mathbb{D}}=\{{\bm{d}}:\forall i\in\{1,...,n\},\,\,{\bm{d}}^{-}_{i}<{\bm{d}}_{i}<{{\bm{d}}}^{+}_{i}\}. We then use 𝔻{\mathbb{D}} to represent the augmented learnt system dynamics as

f^𝜽(r):=f^𝜽(𝒔,𝒂)+𝒅,𝒅𝔻.\hat{f}^{(r)}_{{\bm{\theta}}}:=\hat{f}_{{\bm{\theta}}}({\bm{s}},{\bm{a}})+{\bm{d}},\;\;{\bm{d}}\in{\mathbb{D}}. (1)

3.2 Reachable Tube Formulation

For an exhaustive verification technique on a continuous state and action space, it is infeasible to sample trajectories from every point in the given initial state and further, to verify whether all these trajectories satisfy the safety constraints. Therefore, reachable sets are usually (approximately) represented as convex polyhedrons and their evolution is tracked by pushing the boundaries of this polyhedron according to the system dynamics. However, as convex polyhedrons can lead to large approximation errors, we leverage the level set method (Mitchell et al., 2005) to compute the boundaries of the reachable tube for NN-based models and controllers at every time instant. With the level set method, even non-convex boundaries of the reachable tube can be represented, thereby ensuring an accurate computation of the reachable tube (Mitchell et al., 2005). Since the non-linear, non-convex structure of NNs is not suitable for analytical verification techniques, the reachability analysis gives an efficient, simulation-based approach to analyze the safety of the controller. Therefore, in this subsection, we formulate the reachable tube for a NN modeled system dynamics f^𝜽\hat{f}_{{\bm{\theta}}}, but first we formally define the forward reachable tube and backward reachable tube for a policy π(𝒔)\pi({\bm{s}}).

Refer to caption
Figure 1: Flowchart of the proposed safety verification algorithm. The rectangle on the left represents the flow for computing the forward reachable tube (FRT), which can only state if π(𝒔)\pi({\bm{s}}) is safe or unsafe for 𝕊0{\mathbb{S}}_{0}. The rectangle on the right presents the flow for computing the backward reachable tube (BRT), which is invoked if π(𝒔)\pi({\bm{s}}) is unsafe. The BRT analysis can compute the subset of safe initial states 𝕊safe{\mathbb{S}}_{safe} for π(𝒔)\pi({\bm{s}}), if such a set exists.

Forward reachable tube (FRT): It is the set of all states that can be reached from an initial set 𝕊0{\mathbb{S}}_{0}, when the trajectories ξ\xi are executed under policy π(𝒔)\pi({\bm{s}}) and system dynamics f^𝜽(r)(𝒔,𝒂,𝒅)\hat{f}^{(r)}_{{\bm{\theta}}}({\bm{s}},{\bm{a}},{\bm{d}}). The FRT is computed over a finite length of time TT and is formally defined as

𝔽R(T):={𝒔:𝒅𝔻,𝒔() satisfies 𝒔˙=f^𝜽(r)(𝒔,𝒂,𝒅), where 𝒂=π(𝒔),𝒔t0𝕊0,tf=T},\!\!\!{\mathbb{F}}_{R}(T):=\{{\bm{s}}:\forall{\bm{d}}\in{\mathbb{D}},{\bm{s}}(\cdot)\text{ satisfies }\dot{{\bm{s}}}=\hat{f}^{(r)}_{{\bm{\theta}}}({\bm{s}},{\bm{a}},{\bm{d}}),\text{ where }{\bm{a}}=\pi({\bm{s}}),\;{\bm{s}}_{t_{0}}\in{\mathbb{S}}_{0},t_{f}=T\}, (2)

where t0t_{0} and tft_{f} denote the initial and final time of the trajectory ξ\xi, respectively.

Backward reachable tube (BRT): It is the set of all states which can reach a given bounded target set 𝕋n{\mathbb{T}}\subset{\mathbb{R}}^{n}, when the trajectories ξ\xi are executed under policy π(𝒔)\pi({\bm{s}}) and system dynamics f^𝜽(r)(𝒔,𝒂,𝒅)\hat{f}^{(r)}_{{\bm{\theta}}}({\bm{s}},{\bm{a}},{\bm{d}}). The BRT is also computed for a finite length of time, with the trajectories starting at time t0=Tt_{0}=-T and ending at time tf=0t_{f}=0. It is denoted as

𝔹R(T):={𝒔0:𝒅𝔻,𝒔() satisfies 𝒔˙=f^𝜽(r)(𝒔,𝒂,𝒅), where 𝒂=π(𝒔) with 𝒔t0=𝒔T;𝒔tf𝕋,tf[T,0]}.\begin{split}{\mathbb{B}}_{R}(-T):=\{{\bm{s}}_{0}:\forall{\bm{d}}\in{\mathbb{D}},{\bm{s}}(\cdot)&\text{ satisfies }\dot{{\bm{s}}}=\hat{f}^{(r)}_{{\bm{\theta}}}({\bm{s}},{\bm{a}},{\bm{d}}),\text{ where }{\bm{a}}=\pi({\bm{s}})\\ &\text{ with }{\bm{s}}_{t_{0}}={\bm{s}}_{-T};\;{\bm{s}}_{t_{f}}\in{\mathbb{T}},t_{f}\in[-T,0]\}.\end{split} (3)

The key difference between the FRT and BRT is that, for the former, the initial set of states are known, whereas for the latter, the final set of states are known.

Outline:

The flowchart of the safety verification framework proposed in this work is presented in Fig. 1. Given a model-based policy π(𝒔)\pi({\bm{s}}), the set of initial states 𝕊0{\mathbb{S}}_{0} and the set of constrained states s{\mathbb{C}}_{s}, the first step is to estimate the bounded set of modeling error 𝔻{\mathbb{D}}, as discussed in Section 3.1. Using f^𝜽(r)\hat{f}^{(r)}_{\bm{\theta}}, the FRT is constructed from the initial set 𝕊0{\mathbb{S}}_{0} and it contains all the states reachable by π(𝒔)\pi({\bm{s}}) over a finite time TT. Thus, if the FRT contains any state from the unsafe region s{\mathbb{C}}_{s}, π(𝒔)\pi({\bm{s}}) is deemed unsafe. Therefore, the solution to Problem 1, is determined by analyzing the set of intersection of the FRT with s{\mathbb{C}}_{s} as

π={ safeif 𝔽R(T)s=, unsafeif 𝔽R(T)s.\pi=\begin{cases}\text{ safe}&\mbox{if }\,\,{\mathbb{F}}_{R}(T)\cap{\mathbb{C}}_{s}=\emptyset,\\ \text{ unsafe}&\mbox{if }\,\,{\mathbb{F}}_{R}(T)\cap{\mathbb{C}}_{s}\neq\emptyset.\end{cases} (4)

If π(𝒔)\pi({\bm{s}}) is classified as safe for the entire set 𝕊0{\mathbb{S}}_{0}, then no further analysis is required. However, if π(𝒔)\pi({\bm{s}}) is classified as unsafe, we proceed to compute the subset 𝕊safe𝕊0{\mathbb{S}}_{\text{safe}}\subset{\mathbb{S}}_{0} of initial states for which π(𝒔)\pi({\bm{s}}) generates safe trajectories. 𝕊safe{\mathbb{S}}_{safe} is the solution to Problem 2 and allows an unsafe policy to be deployed on a real system, with restrictions on the starting states. To this end, the BRT is computed from the unsafe region s{\mathbb{C}}_{s}, to determine the set of trajectories (and states) which terminate in s{\mathbb{C}}_{s}. The intersection of the BRT with 𝕊0{\mathbb{S}}_{0} determines the set of unsafe initial states 𝕊unsafe{\mathbb{S}}_{unsafe}. To determine 𝕊safe{\mathbb{S}}_{safe}, we utilize the following properties, (a) 𝕊safe𝕊unsafe=𝕊0{\mathbb{S}}_{safe}\cup{\mathbb{S}}_{unsafe}={\mathbb{S}}_{0}, and (b) 𝕊safe𝕊unsafe={\mathbb{S}}_{safe}\cap{\mathbb{S}}_{unsafe}=\emptyset, and compute

𝕊safe=𝕊¯unsafe𝕊0.{\mathbb{S}}_{safe}=\overline{{\mathbb{S}}}_{unsafe}\cap{\mathbb{S}}_{0}. (5)

If 𝕊safe{\mathbb{S}}_{safe}\neq\emptyset, then we have identified the safe initial states for π(𝒔)\pi({\bm{s}}), otherwise, it is concluded that there are no initial states in 𝕊0{\mathbb{S}}_{0} from which π(𝒔)\pi({\bm{s}}) can generate safe trajectories.

Mathematical Formulation:

This section presents the mathematical formulation to compute the BRT. The FRT can be computed with a slight modification to the BRT formulation and this is discussed in the end of this section.

Recall, for the BRT problem, there exists a target set 𝕋n{\mathbb{T}}\subset{\mathbb{R}}^{n} which the agent has to reach in finite time, i.e., the condition on the final state is given as 𝒔tf𝕋{\bm{s}}_{t_{f}}\in{\mathbb{T}}. Conventionally, for the BRT formulation, the final time tf=0t_{f}=0 and the starting time t0=Tt_{0}=-T, where 0<T<0<T<\infty. When evaluating a policy π(𝒔)\pi({\bm{s}}), the controller input is computed by the given policy as 𝒂=π(𝒔){\bm{a}}=\pi({\bm{s}}). However, following the system dynamics f^𝜽(r)\hat{f}_{\bm{\theta}}^{(r)} in (1), the modeling error 𝒅{\bm{d}} is now included in the system as an adversarial input, whose value at each state is determined so as to maximize the controller’s cost function. We use the HJ PDE to formulate the effect of the modeling error on the system for computing the BRT, but first we briefly review the formulation of the HJ PDE with an NN modeled system dynamics f^𝜽\hat{f}_{\bm{\theta}} in the following.

For an optimal controller, we first define the cost function which the controller has to minimize. Let C(𝒔t,𝒂t)C({\bm{s}}_{t},{\bm{a}}_{t}) denote the running cost of the agent, which is dependent on the state and action taken at time t[T,0]t\in[-T,0]. Let g(𝒔tf)g({\bm{s}}_{t_{f}}) denote the cost at the final state 𝒔tf{\bm{s}}_{t_{f}}. Then, the goal of the optimal controller is to find a series of optimal actions such that

min𝒂τ()(T0C(𝒔τ,𝒂τ)𝑑τ+g(𝒔tf))subject to 𝒔˙=f^𝜽(𝒔,𝒂),𝒔tf𝕋,\begin{split}\min_{{\bm{a}}_{\tau}(\cdot)}&\left(\int_{-T}^{0}C({\bm{s}}_{\tau},{\bm{a}}_{\tau})d\tau+g({\bm{s}}_{t_{f}})\right)\\ \text{subject to }\dot{{\bm{s}}}&=\hat{f}_{{\bm{\theta}}}({\bm{s}},{\bm{a}}),\,\,\,{\bm{s}}_{t_{f}}\in{\mathbb{T}},\end{split} (6)

where 𝒂τ𝔸{\bm{a}}_{\tau}\in{\mathbb{A}} and f^𝜽\hat{f}_{\bm{\theta}} is the NN modeled system dynamics. The above optimization problem is solved using the dynamic programming approach (Smith & Smith, 1991), which is based on the Principle of Optimality (Troutman, 2012). Let V(𝒔t,t)V({\bm{s}}_{t},t) denote the value function of a state 𝒔{\bm{s}} at time t[T,0]t\in[-T,0], such that

V(𝒔t,t)=min𝒂τ()[t0C(𝒔τ,𝒂τ)𝑑τ+g(𝒔tf)]=min𝒂τ()[tt+δC(𝒔τ,𝒂τ)𝑑τ+V(𝒔t+δ,t+δ)],\!\!\!V({\bm{s}}_{t},t)=\min_{{\bm{a}}_{\tau}(\cdot)}\left[\int_{t}^{0}C({\bm{s}}_{\tau},{\bm{a}}_{\tau})d\tau+g({\bm{s}}_{t_{f}})\right]=\min_{{\bm{a}}_{\tau}(\cdot)}\left[\int_{t}^{t+\delta}\!\!\!C({\bm{s}}_{\tau},{\bm{a}}_{\tau})d\tau+V({\bm{s}}_{t+\delta},t+\delta)\right], (7)

where δ>0\delta>0. V(𝒔t,t)V({\bm{s}}_{t},t) is a quantitative measure of being at a state 𝒔{\bm{s}}, described in terms of the cost required to reach the goal state from 𝒔{\bm{s}}. Then, using the Taylor series expansion, V(𝒔t+δ,t+δ)V({\bm{s}}_{t+\delta},t+\delta) is approximated around V(𝒔t,t)V({\bm{s}}_{t},t) in (7) to derive the HJ PDE as

dVdt+min𝒂[Vf^𝜽(𝒔,𝒂)+C(𝒔,𝒂)]=0,V(𝒔tf,tf)=g(𝒔tf),\begin{split}\frac{dV}{dt}+\min_{{\bm{a}}}\left[\nabla V\cdot\hat{f}_{{\bm{\theta}}}({\bm{s}},{\bm{a}})+C({\bm{s}},{\bm{a}})\right]&=0,\\ V({\bm{s}}_{t_{f}},t_{f})&=g({\bm{s}}_{t_{f}}),\end{split} (8)

where Vn\nabla V\in{\mathbb{R}}^{n} is the spatial derivative of VV. Additionally, the time index has been dropped above and the dynamics constraint in (6) has been included in the PDE. Equation (8) is a terminal value PDE, and by solving (8), we can compute the value of a state V(𝒔t,t)V({\bm{s}}_{t},t) at any time tt.

We now discuss how the formulation in (8) can be modified to obtain the BRT. It is noted that along with computing the value function, the formulation in (8) also computes the optimal action 𝒂{\bm{a}}. However, in Problems 1 and 2, the optimal policy π(𝒔)\pi({\bm{s}}) is already provided. Therefore, the constraint 𝒂=π(𝒔){\bm{a}}=\pi({\bm{s}}) should be included in problem (6), thereby avoiding the need of minimizing over actions 𝒂𝔸{\bm{a}}\in{\mathbb{A}} in (8). Additionally, as discussed in Section 3.1, the NN modeled system dynamics f^𝜽\hat{f}_{\bm{\theta}} may not be a good approximation of the true system dynamics ff. Instead, the augmented learnt system dynamics f^𝜽(r)\hat{f}^{(r)}_{\bm{\theta}} in (1) is used in place of f^𝜽\hat{f}_{\bm{\theta}} in (8), since it better models the true dynamics at a given state. However, by including f^𝜽(r)\hat{f}_{\bm{\theta}}^{(r)} in (8), the modeling error 𝒅{\bm{d}} is now included in the formulation. The modeling error 𝒅𝔻{\bm{d}}\in{\mathbb{D}} is treated as an adversarial input which is trying to drive the system away from it’s goal state by taking a value which maximizes the cost function at each state. Thus, to account for this adversarial input, the formulation in (8) is now maximized over 𝒅{\bm{d}}.

Lastly, the BRT problem is posed for a set of states and not an individual state. Hence, an efficient representation of the target set is required to propagate an entire set of trajectories at a time, as opposed to propagating individual trajectories.

Assumption 2

The target set 𝕋n{\mathbb{T}}\subset{\mathbb{R}}^{n} is closed and can be represented as the zero sublevel set of a bounded and Lipschitz continuous function l:nl:{\mathbb{R}}^{n}\ \rightarrow{\mathbb{R}}, such that, 𝕋={𝐬:l(𝐬)0}{\mathbb{T}}=\{{\bm{s}}:l({\bm{s}})\leq 0\}.

The above assumption defines a function ll to check whether a state lies inside or outside the target set. If 𝕋{\mathbb{T}} is represented using a regular, well-defined geometric shape (like a sphere, rectangle, cylinder, etc.), then deriving the level set function l(𝒔)l({\bm{s}}) is straight forward, whereas, an irregularly shaped 𝕋{\mathbb{T}} can be represented as a union of several well-defined geometric shapes to derive l(𝒔)l({\bm{s}}).

For the BRT problem, the goal is to determine all the states which can reach 𝕋{\mathbb{T}} within a finite time. The path taken by the controller is irrelevant and only the value of the final state is used to determine if any state 𝒔𝔹R(T){\bm{s}}\in{\mathbb{B}}_{R}(-T). From Assumption 2, the terminal condition 𝒔tf𝕋{\bm{s}}_{t_{f}}\in{\mathbb{T}} can be restated as l(𝒔tf)0l({\bm{s}}_{t_{f}})\leq 0. Thus, to prevent the system from reaching 𝕋{\mathbb{T}}, the adversarial input 𝒅{\bm{d}} tries to maximize l(𝒔tf)l({\bm{s}}_{t_{f}}), thereby pushing 𝒔tf{\bm{s}}_{t_{f}} as far away from 𝕋{\mathbb{T}} as possible. Therefore, the cost function in (6) is modified to J=l(𝒔tf)J=l({\bm{s}}_{t_{f}}). Additionally, any state which can reach 𝕋{\mathbb{T}} within a finite time interval TT is included in the BRT. Therefore, if any trajectory reaches 𝕋{\mathbb{T}} at some tf<0t_{f}<0, it shouldn’t be allowed to leave the set. Keeping this in mind, the BRT optimization problem can be posed as

max𝒅()(mint[T,0]l(𝒔tf))subject to: 𝒔˙=f^𝜽(r)(𝒔,𝒂,𝒅),𝒂=π(𝒔),l(𝒔tf)0,\begin{split}\max_{{\bm{d}}(\cdot)}&\left(\min_{t\in[-T,0]}l({\bm{s}}_{t_{f}})\right)\\ \text{subject to: }\dot{{\bm{s}}}&=\hat{f}^{(r)}_{{\bm{\theta}}}({\bm{s}},{\bm{a}},{\bm{d}}),\,\,\,{\bm{a}}=\pi({\bm{s}}),\,\,\,\,l({\bm{s}}_{t_{f}})\leq 0,\end{split} (9)

where the inner minimization over time prevents the trajectory from leaving the target set. Then, the value function for the above problem is defined as

VR(𝒔t,t):=max𝒅()l(𝒔(tf)).V_{R}({\bm{s}}_{t},t):=\max_{{\bm{d}}(\cdot)}l({\bm{s}}(t_{f})). (10)

Comparing this with (7), it is observed that the value of a state 𝒔{\bm{s}} is no longer dependent on the running cost C(𝒔,𝒂)C({\bm{s}},{\bm{a}}). This doesn’t imply that the generated trajectories are not optimal w.r.t. action 𝒂{\bm{a}}, because the running cost is equivalent to the negative reward function, for which π(𝒔)\pi({\bm{s}}) is already optimized. Instead, VRV_{R} solely depends on whether the final state of the trajectory lies within the target set or not, i.e., whether or not 𝒔tf𝕋{\bm{s}}_{t_{f}}\in{\mathbb{T}}. Thus, the value function VRV_{R} for any state 𝒔{\bm{s}} is equal to l(𝒔tf)l({\bm{s}}_{t_{f}}), where 𝒔tf{\bm{s}}_{t_{f}} is the final state of the trajectory originating at 𝒔{\bm{s}}. Then, the HJ PDE for the problem in (9) is stated as

dVRdt+min{0,H(𝒔,VR(𝒔t,t),t)}=0,VR(𝒔tf,tf)=l(𝒔tf),where H=max𝒅(VRf^𝜽(r)(𝒔,π(𝒔),𝒅)),\begin{split}\frac{dV_{R}}{dt}+min\{0,H^{*}({\bm{s}},\nabla V_{R}({\bm{s}}_{t},t),t)\}&=0,\\ V_{R}({\bm{s}}_{t_{f}},t_{f})&=l({\bm{s}}_{t_{f}}),\\ \text{where }H^{*}&=\max_{{\bm{d}}}\left(\nabla V_{R}\cdot\hat{f}^{(r)}_{{\bm{\theta}}}({\bm{s}},\pi({\bm{s}}),{\bm{d}})\right),\\ \end{split} (11)

where HH^{*} represents the optimal Hamiltonian. Since we are computing the BRT, min{0,H(𝒔,VR(𝒔t,t),t)}min\{0,H^{*}({\bm{s}},\nabla V_{R}({\bm{s}}_{t},t),t)\} in the PDE above ensures that the tube grows only in the backward direction, thereby preventing a trajectory which has reached 𝕋{\mathbb{T}} from leaving. In Problem (11), the optimal action has been substituted by π(𝒔)\pi({\bm{s}}) and the augmented learnt dynamics is used instead of f^𝜽\hat{f}_{\bm{\theta}}. The Hamiltonian optimization problem can be further simplified to derive an analytical solution for the modeling error. By substituting the augmented dynamics from (1), the optimization problem can be re-written as

H=VRf^𝜽(𝒔,𝒂)+max𝒅VR𝒅.H^{*}=\nabla V_{R}\cdot\hat{f}_{{\bm{\theta}}}({\bm{s}},{\bm{a}})+\max_{{\bm{d}}}\nabla V_{R}\cdot{\bm{d}}. (12)

Expanding VR=[p1,p2,,pn]Tn\nabla V_{R}=[p_{1},p_{2},\ldots,p_{n}]^{T}\in{\mathbb{R}}^{n}, the vector product VR𝒅=p1d1+p2d2++pndn\nabla V_{R}\cdot{\bm{d}}=p_{1}d_{1}+p_{2}d_{2}+\ldots+p_{n}d_{n}. Therefore, to maximize VR𝒅\nabla V_{R}\cdot{\bm{d}}, the disturbance control is chosen as

di={diif pi>0diif pi<0,i=1,n.d_{i}=\begin{cases}d_{i}&\mbox{if }p_{i}>0\\ -d_{i}&\mbox{if }p_{i}<0\end{cases},\forall i=1,\ldots n. (13)

With this analytical solution, the final PDE representing the BRT is stated as

dVRdt+min{0,H(𝒔,VR(𝒔t,t),t)}=0,VR(𝒔tf,tf)=l(𝒔tf),where H=VRf^𝜽(𝒔,𝒂)+(|p1|d1+|p2|d2++|pn|dn).\begin{split}\frac{dV_{R}}{dt}+min\{0,H^{*}({\bm{s}},\nabla V_{R}({\bm{s}}_{t},t),t)\}&=0,\\ V_{R}({\bm{s}}_{t_{f}},t_{f})&=l({\bm{s}}_{t_{f}}),\\ \text{where }H^{*}=\nabla V_{R}\cdot\hat{f}_{{\bm{\theta}}}({\bm{s}},{\bm{a}})+&(|p_{1}|d_{1}+|p_{2}|d_{2}+\ldots+|p_{n}|d_{n}).\end{split} (14)

The value function VR(𝒔t,t)V_{R}({\bm{s}}_{t},t) in (14) represents the evolution of the target level set function backwards in time. By finding the solution to VRV_{R} in the above PDE, the level set function is determined at any time instant t[T,0]t\in[-T,0], thereby determining the BRT. From the result of Theorem 2 in Mitchell et al. (2005), it is proved that the solution of VRV_{R} in (14) at any time tt gives the zero sublevel set for the BRT. Thus,

𝔹R(T)={𝒔:VR(𝒔t,t)0,t[T,0]}.{\mathbb{B}}_{R}(-T)=\{{\bm{s}}:V_{R}({\bm{s}}_{t},t)\leq 0,\;t\in[-T,0]\}. (15)

The solution to VRV_{R} can be computed numerically by using existing solvers for the level set method. A brief note on the implementation of the algorithm is included in subsection A.3 in the Appendix.

There are a few things to note about the formulation in (14). First, Equation (14) assumes that 𝕋{\mathbb{T}} is a desired goal state. However, the formulation can be modified if 𝕋{\mathbb{T}} is an unsafe set, in which case, the adversarial modeling error tries to minimize the Hamiltonian. Similarly, the input 𝒅{\bm{d}} can represent any other disturbance in the system, either adversarial or cooperative. Second, to compute the FRT, the formulation in (14) is modified from a final value PDE to an initial value PDE.

4 Experiments

Refer to caption
Refer to caption
Refer to caption
Figure 2: (Left) The environment for the safe land navigation problem. (Middle) The FRTs for both the augmented learnt model dynamics and true system dynamics classify the controller as unsafe. (Right) The BRT computed from obstacles 1 and 2 for the given controller. 𝕊safe{\mathbb{S}}_{safe} computed by the proposed BRT algorithm is compared with the ground truth (G.T.) data for the safe initial states, marked in green.

In our experiments, we aim to answer the following questions: (a) Can safety verification be done for an NN-based π\pi and f^θ\hat{f}_{\theta} using FRT?, and (b) Can 𝕊safe{\mathbb{S}}_{\text{safe}} be identified using BRT if π\pi is deemed unsafe? To answer these two questions, we demonstrate results on the following domains inspired by real world safety-critical problems, where RL controllers developed using a learnt model can be appealing, as they can adapt to transition dynamics involving friction, air-drag, wind, etc., which might be hard to explicitly model otherwise.

Safe land navigation: Navigation of a ground robot in an indoor environment is a common application which requires the satisfaction of safety constraints by π\pi to avoid collision with obstacles. For this setting, we simulate a ground robot which has continuous states and actions. The initial configuration of the domain is shown in Fig. 2. The set of initial and goal states are represented by circles and the obstacles with rectangles.

Safe aerial navigation: This domain simulates a navigation problem in an urban environment for an unmanned aerial vehicle (UAV). Constraints are incorporated while training π\pi to ensure that collision is avoided with potential obstacles in its path. States and actions are both continuous and the initial configuration of the domain is shown in Fig 3. The set of initial and goal states are represented using cuboids, and the obstacles with cylinders.

Analysis:

To address the questions with respect to the above mentioned domains, we first train a NN based f^θ\hat{f}_{\theta} to estimate the dynamics using sampled transitions. This f^θ\hat{f}_{\theta} is then also used to learn a NN based controller π\pi which is trained with a cost function designed to mitigate collisions. For brevity, only the representative results for this π\pi are discussed here; implementation details and more experimental results are available in Appendix A.2, A.3, A.4 and A.5.

To address the first question, the FRT is computed for both the domains over the augmented learnt dynamics f^𝜽(r)\hat{f}^{(r)}_{\bm{\theta}} as shown in Fig. 2 and Fig. 3. Additionally, for land navigation we also compute the FRT over the true system dynamics ff, which serves as a way to validate the safety verification result of π\pi from the proposed framework. It is observed that for both the domains, FRTs deem the given policy π\pi as unsafe, since the FRTs intersect with one of the obstacles. Even when π\pi is learnt using a cost function designed to avoid collisions, the proposed safety verification framework successfully brings out the limitations of π\pi, which may have resulted due to the use of function approximations, ill-specified hyper-parameters, convergence to a local optimum, etc.

For the second question, the BRT is computed from both the obstacles for the given controller π\pi, as shown in Fig. 2 and Fig. 3. To estimate the accuracy of the BRT computation, we compare the computed 𝕊unsafe{\mathbb{S}}_{unsafe} and 𝕊safe{\mathbb{S}}_{safe} sets with the ground truth (G.T.) data generated using random samples of possible trajectories. It is observed that the BRT from obstacle 1 does not intersect with 𝕊0{\mathbb{S}}_{0}, implying that all trajectories are safe w.r.t. obstacle 1. However, the BRT from obstacle 2 intersects with 𝕊0{\mathbb{S}}_{0} and identifies the subset of initial states which are unsafe. Such an information can be critical to safely deploy even an unsafe controller just by restricting its starting conditions.

Refer to caption
Refer to caption
Refer to caption
Figure 3: (Left) The safe aerial navigation domain. (Middle) The FRT is computed for the augmented learnt dynamics. For clarity in 3D, FRT with true dynamics is not plotted; instead the approximation quality can be better visualized in BRT. (Right) Comparison of 𝕊safe{\mathbb{S}}_{safe} computed by the proposed BRT algorithm with the ground truth data of the safe initial states, marked in green (and unsafe states marked in red), it is observed that the BRT conservatively approximates 𝕊safe{\mathbb{S}}_{safe}.

5 Conclusion

In this paper, we have presented a novel framework using forward and backward reachable tubes for safety verification and determination of the subset of initial states for which a given model-based RL controller always satisfies the state constraints. The main contribution of this work is the formulation of the reachability problem for a neural network modeled system dynamics and the use of level set method to compute an exact reachable tube by solving the Hamilton-Jacobi partial differential equation, for the reinforcement learning framework, thereby minimizing approximation errors that other existing reachability methods suffer. Additionally, the proposed framework can identify the set of safe initial sets for a given policy, thereby determining the initial conditions for which even a sub-optimal, unsafe policy satisfies the safety constraints.

References

  • Akametalu et al. (2014) Anayo K Akametalu, Jaime F Fisac, Jeremy H Gillula, Shahab Kaynama, Melanie N Zeilinger, and Claire J Tomlin. Reachability-based safe learning with gaussian processes. In 53rd IEEE Conference on Decision and Control, pp. 1424–1431. IEEE, 2014.
  • Bastani et al. (2016) Osbert Bastani, Yani Ioannou, Leonidas Lampropoulos, Dimitrios Vytiniotis, Aditya Nori, and Antonio Criminisi. Measuring neural net robustness with constraints. In Advances in neural information processing systems, pp. 2613–2621, 2016.
  • Benbrahim & Franklin (1997) Hamid Benbrahim and Judy A Franklin. Biped dynamic walking using reinforcement learning. Robotics and Autonomous Systems, 22(3-4):283–302, 1997.
  • Berkenkamp et al. (2017) Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in neural information processing systems, pp. 908–918, 2017.
  • Crandall & Lions (1984) Michael G Crandall and P-L Lions. Two approximations of solutions of hamilton-jacobi equations. Mathematics of computation, 43(167):1–19, 1984.
  • Endo et al. (2008) Gen Endo, Jun Morimoto, Takamitsu Matsubara, Jun Nakanishi, and Gordon Cheng. Learning cpg-based biped locomotion with a policy gradient method: Application to a humanoid robot. The International Journal of Robotics Research, 27(2):213–228, 2008.
  • Garcia & Fernández (2012) Javier Garcia and Fernando Fernández. Safe exploration of state and action spaces in reinforcement learning. Journal of Artificial Intelligence Research, 45:515–564, 2012.
  • Gillula & Tomlin (2012) Jeremy H Gillula and Claire J Tomlin. Guaranteed safe online learning via reachability: tracking a ground target using a quadrotor. In 2012 IEEE International Conference on Robotics and Automation, pp.  2723–2730. IEEE, 2012.
  • Hans et al. (2008) Alexander Hans, Daniel Schneegaß, Anton Maximilian Schäfer, and Steffen Udluft. Safe exploration for reinforcement learning. In ESANN, pp.  143–148, 2008.
  • Huang et al. (2017) Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. Safety verification of deep neural networks. In International Conference on Computer Aided Verification, pp.  3–29. Springer, 2017.
  • Katz et al. (2017) Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. Reluplex: An efficient smt solver for verifying deep neural networks. In International Conference on Computer Aided Verification, pp.  97–117. Springer, 2017.
  • Kwiatkowska (2019) Marta Z Kwiatkowska. Safety verification for deep neural networks with provable guarantees. 2019.
  • Laroche et al. (2019) Romain Laroche, Paul Trichelair, and Remi Tachet Des Combes. Safe policy improvement with baseline bootstrapping. In International Conference on Machine Learning, pp. 3652–3661. PMLR, 2019.
  • Maler (2008) Oded Maler. Computing reachable sets: An introduction. Tech. rep. French National Center of Scientific Research, 2008.
  • Mitchell (2007) Ian M Mitchell. A toolbox of level set methods. UBC Department of Computer Science Technical Report TR-2007-11, 2007.
  • Mitchell et al. (2005) Ian M Mitchell, Alexandre M Bayen, and Claire J Tomlin. A time-dependent hamilton-jacobi formulation of reachable sets for continuous dynamic games. IEEE Transactions on automatic control, 50(7):947–957, 2005.
  • Moldovan & Abbeel (2012) Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in markov decision processes. arXiv preprint arXiv:1205.4810, 2012.
  • Moldovan et al. (2015) Teodor Mihai Moldovan, Sergey Levine, Michael I Jordan, and Pieter Abbeel. Optimism-driven exploration for nonlinear systems. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pp.  3239–3246. IEEE, 2015.
  • Morimoto & Doya (2001) Jun Morimoto and Kenji Doya. Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robotics and Autonomous Systems, 36(1):37–51, 2001.
  • Osher & Sethian (1988) Stanley Osher and James A Sethian. Fronts propagating with curvature-dependent speed: algorithms based on hamilton-jacobi formulations. Journal of computational physics, 79(1):12–49, 1988.
  • Osher et al. (2004) Stanley Osher, Ronald Fedkiw, and K Piechor. Level set methods and dynamic implicit surfaces. Appl. Mech. Rev., 57(3):B15–B15, 2004.
  • Perkins & Barto (2002) Theodore J Perkins and Andrew G Barto. Lyapunov design for safe reinforcement learning. Journal of Machine Learning Research, 3(Dec):803–832, 2002.
  • Pulina & Tacchella (2012) Luca Pulina and Armando Tacchella. Challenging smt solvers to verify neural networks. Ai Communications, 25(2):117–135, 2012.
  • Sethian (1999) James Albert Sethian. Level set methods and fast marching methods: evolving interfaces in computational geometry, fluid mechanics, computer vision, and materials science, volume 3. Cambridge university press, 1999.
  • Smith & Smith (1991) David K Smith and David K Smith. Dynamic programming: a practical introduction. Ellis Horwood New York, 1991.
  • Sun et al. (2019) Xiaowu Sun, Haitham Khedr, and Yasser Shoukry. Formal verification of neural network controlled autonomous systems. In Proceedings of the 22nd ACM International Conference on Hybrid Systems: Computation and Control, pp.  147–156, 2019.
  • Thomas et al. (2015) Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High confidence policy improvement. In International Conference on Machine Learning, pp. 2380–2388, 2015.
  • toolbox (2019) toolbox. helperoc. https://github.com/HJReachability/helperOC, May 2019.
  • Tran et al. (2019a) Hoang-Dung Tran, Feiyang Cai, Manzanas Lopez Diego, Patrick Musau, Taylor T Johnson, and Xenofon Koutsoukos. Safety verification of cyber-physical systems with reinforcement learning control. ACM Transactions on Embedded Computing Systems (TECS), 18(5s):1–22, 2019a.
  • Tran et al. (2019b) Hoang-Dung Tran, Patrick Musau, Diego Manzanas Lopez, Xiaodong Yang, Luan Viet Nguyen, Weiming Xiang, and Taylor T Johnson. Parallelizable reachability analysis algorithms for feed-forward neural networks. In 2019 IEEE/ACM 7th International Conference on Formal Methods in Software Engineering (FormaliSE), pp.  51–60. IEEE, 2019b.
  • Troutman (2012) John L Troutman. Variational calculus and optimal control: optimization with elementary convexity. Springer Science & Business Media, 2012.
  • Usama & Chang (2018) Muhammad Usama and Dong Eui Chang. Towards robust neural networks with lipschitz continuity. In International Workshop on Digital Watermarking, pp. 373–389. Springer, 2018.
  • Xiang et al. (2018a) Weiming Xiang, Hoang-Dung Tran, and Taylor T Johnson. Output reachable set estimation and verification for multilayer neural networks. IEEE transactions on neural networks and learning systems, 29(11):5777–5783, 2018a.
  • Xiang et al. (2018b) Weiming Xiang, Hoang-Dung Tran, Joel A Rosenfeld, and Taylor T Johnson. Reachable set estimation and safety verification for piecewise linear systems with neural network controllers. In 2018 Annual American Control Conference (ACC), pp. 1574–1579. IEEE, 2018b.

Appendix A Appendix

A.1 Extended Related Work

NNs are one of the most commonly used function approximators for RL algorithms. Therefore, often, verifying the safety of an RL controller reduces to verifying the safety of the underlying NN model or controller. In this section, we briefly review the techniques used to evaluate the safety of NNs.

In recent years, NN-based supervised learning of image classifiers have found applications in autonomous driving and perception modules. These are safety critical systems and it is necessary to verify the robustness of NN-based classifiers to minimize misclassification errors after they are deployed (Kwiatkowska, 2019). Some of the recent works have relied upon using existing optimization frameworks, such as linear programming and satisfiability modulo theory (SMT), to determine adversarial examples in the neighborhood of input images (Huang et al., 2017; Bastani et al., 2016; Pulina & Tacchella, 2012). However, these techniques usually require an approximation of the constraints on the NN model. Katz et al. (2017) improved the scalability of SMT verification algorithms and Sun et al. (2019) used satisfiability modulo convex (SMC) over a partitioned workspace, to determine the set of safe initial states for a robot, but their algorithms were limited to NNs with the ReLU activation function. However, the problem of interest in this work is more general than the single-step classification problems considered in these related work, and instead we look at RL problems which require multi-step decision making in real-valued domains.

For real-valued output domains, the safety property needs to be verified for a set of states as opposed to a single label (i.e., a single point) in the output domain. Therefore, to verify safety in the continuous problem domain, most works (Xiang et al., 2018b; Tran et al., 2019b; Xiang et al., 2018a; Tran et al., 2019a) first compute the forward reachable tube (FRT) of an NN, i.e, the set of output values achieved by the NN. Then, the NN is said to be unsafe if the FRT intersects with an unsafe set. Xiang et al. (2018b), Tran et al. (2019b) and Xiang et al. (2018a) follow the FRT algorithm by discretizing the input space into subspaces and executing forward pass over the NN to evaluate its output range. However, the above works use convex polyhedrons to approximate non-convex reachable sets and the approximation error compounds over each layer of the NN. Tran et al. (2019a) used a star-shaped polygon approach to generate the FRT which gives a less conservative solution than most polyhedra-based reachable tube computation. Additionally, some of the algorithms are applicable specifically to ReLU activation functions (Xiang et al., 2018b; Tran et al., 2019b). In comparison, the algorithm proposed in this work computes the exact reachable tube and can handle most commonly used activation functions, including tanh, sigmoid, ReLu, etc.

A.2 Algorithm for Model-Based RL

RL follows the formal framework of a Markov decision process (MDP) which is represented using a tuple M:=(𝕊,𝔸,f,R)M:=({\mathbb{S}},{\mathbb{A}},f,R) where 𝕊{\mathbb{S}}, 𝔸{\mathbb{A}} and ff denote the state space, action space and true system dynamics, as defined in Section 2. The last element in the MDP tuple is the reward function R:𝕊×𝔸R:{\mathbb{S}}\times{\mathbb{A}}\rightarrow{\mathbb{R}} which quantifies the goodness of taking an action 𝒂t{\bm{a}}_{t} in state 𝒔t{\bm{s}}_{t}, by computing the scalar value rt:=R(𝒔t,𝒂t)r_{t}:=R({\bm{s}}_{t},{\bm{a}}_{t}). Then the goal of the RL agent is to develop a policy (controller) π\pi which maximizes the expectation of the total return G=t=1Tγt1rtG=\sum_{t=1}^{T}\gamma^{t-1}r_{t} over time TT, where γ[0, 1)\gamma\in[0,\;1) is the discount factor. In the model-based RL framework, the policy π\pi can simply be a planning algorithm or it can be a function approximator.

Under the model-based RL framework, the true dynamics function ff is unknown and needs to be learnt from observed data samples collected from the actual environment. The learnt model is represented using a function approximator as f^𝜽\hat{f}_{\bm{\theta}}. A generic algorithm for model-based RL is provided in Algorithm 1. To train the model, an initial training data set D(0)D^{(0)} is first generated by sampling trajectories ξ\xi using a randomly initialized policy π\pi. In any iteration kk, the data tuples D(k)={(𝒔t,𝒂t,Δ𝒔t+1)i}i=1NkD^{(k)}=\{({\bm{s}}_{t},{\bm{a}}_{t},\Delta{\bm{s}}_{t+1})_{i}\}_{i=1}^{N_{k}} are used to train f^𝜽\hat{f}_{\bm{\theta}} by minimizing the prediction error EE using a supervised learning technique. This trained model f^𝜽\hat{f}_{\bm{\theta}} is then used to update the policy π\pi or used by a planning algorithm to generate new trajectories and improve the return GG obtained by the agent. The data tuples obtained from the new trajectories are appended to the existing training data set to create D(k+1)=D(k){ξi}D^{(k+1)}=D^{(k)}\cup\{\xi_{i}\} and the process continues until the performance of the controller converges.

Input: π(0)\pi^{(0)}, f^𝜽(0)\hat{f}^{(0)}_{\bm{\theta}}
Generate trajectories ξ\xi using base policy π(0)\pi^{(0)};
Collect and construct initial data set D(0)={(𝒔t,𝒂t,Δ𝒔t+1)i}i=1N0D^{(0)}=\{({\bm{s}}_{t},{\bm{a}}_{t},\Delta{\bm{s}}_{t+1})_{i}\}_{i=1}^{N_{0}};
k = 0;
while ΔE>ϵ\Delta E>\epsilon do
       Learn dynamics model f^𝜽(k)(𝒔,𝒂)\hat{f}^{(k)}_{{\bm{\theta}}}({\bm{s}},{\bm{a}}) over D(k)D^{(k)} by minimizing error E(D(k))E(D^{(k)});
       while ΔG>δ\Delta G>\delta do
             At time tt plan through f^𝜽(k)(𝒔,𝒂)\hat{f}^{(k)}_{\bm{\theta}}({\bm{s}},{\bm{a}}) to select action 𝒂t{\bm{a}}_{t};
             Execute action 𝒂t{\bm{a}}_{t} and observe reward rtr_{t} and next state 𝒔t+1{\bm{s}}_{t+1} (MPC);
             Obtain updated policy π(k+1)\pi^{(k+1)} using backpropagation to maximize return G=t=1Tγt1rtG=\sum_{t=1}^{T}\gamma^{t-1}r_{t};
             Create data set D(k+1)D^{(k+1)} by appending new observed data points;
            
       end while
      k = k+1;
      
end while
Algorithm 1 Generic model-based RL algorithm

A.3 Implementation of Proposed Safety Verification Framework

To implement the proposed safety verification framework in Fig. 1, the MATLAB helperOC toolbox (2019) developed and maintained by the HJ Reachability Group on Github was used. This toolbox is dependent on the level set toolbox developed by Mitchell (2007). In the helperOC toolbox, the level set method is used to provide a numerical solution to the PDE in (14). Level set methods have been used to track and model the evolution of dynamic surfaces (here the reachable tube) by representing the dynamic surface implicitly using a level set function at any instant of time (Osher & Sethian, 1988; Osher et al., 2004; Sethian, 1999). There are several advantages to using the level set algorithms, including the fact that they can represent non-linear, non-convex surfaces even if the surface merges or divides over time. Additionally, it is easy to extend the theory to the higher dimensional state space. However, the computational time increases exponentially with a dimension of the state space. The helperOC toolbox (2019) has demonstrated results on problems with 10 dimensional space and since most physical systems can be represented within this limit, it is sufficient for the problems of interest in this work.

The helperOC toolbox (2019) comes with inbuilt numerical schemes to compute the Hamiltonian HH, the spacial derivatives VR\nabla V_{R} and the time derivative dVRdt\frac{dV_{R}}{dt} in (14). This toolbox is modified to solve the problem in (14) by substituting the true system dynamics ff with the augmented learnt dynamics f^𝜽(r)(𝒔,𝒂,𝒅)\hat{f}^{(r)}_{\bm{\theta}}({\bm{s}},{\bm{a}},{\bm{d}}) and computing the action at any state 𝒔{\bm{s}} as 𝒂=π(𝒔){\bm{a}}=\pi({\bm{s}}). Subsequently, to determine the level set function at any time instant, dVdt\frac{dV}{dt} is computed using the Runge-Kutta integration scheme while the gradient of the level set function VR\nabla V_{R} is computed using an upwind finite difference scheme. Additionally, a Lax-Friedrichs approximation is used to stably evaluate the Hamiltoniaan (Crandall & Lions, 1984).

A.4 Experiment Results: Safe Land Navigation

In the land navigation problem, a ground robot is moving in an indoor environment with obstacles. The ground vehicle has the state vector 𝒔=[x,y]T{\bm{s}}=[x,y]^{T} representing the xx and yy coordinates of the car. The control vector 𝒂=[v,ψ]{\bm{a}}=[v,\psi] comprises of the velocity vv and the direction ψ\psi of the vehicle. The problem setting in shown in Fig. 2. The set of initial and goal states are represented by circles centered at coordinates (0.0,0.0)(0.0,0.0) and (6.0,4.5)(6.0,4.5), respectively, and with radii of 0.70.7 and 0.50.5 units, respectively. There are two rectangular obstacles in the environment with centers at (1.5,4.5)(1.5,4.5) and (4.0,1.5)(4.0,1.5).

To demonstrate the working of the proposed safety verification framework, different models and controllers were trained for the 2D land navigation problem setting. Three different models were trained over training data sets of different sizes. Model 1 was trained over 300 data samples, model 2 over 600 data samples, and model 3 over 1000 data samples. To estimate the modeling error for all the three models, a ±3σ\pm 3\sigma bound was computed by first computing the standard deviation (s.d.) over the validation data sets (Table 1).

Table 1: Standard deviations (s.d.) computed for different models.
Model 1 Model 2 Model 3
s.d. for state xx 0.00114 0.00045 0.00031
s.d. for state yy 0.00120 0.00053 0.00042

Then, for each model, three different policies were sampled after training them for 100, 300 and 600 episodes, respectively. We first present the FRT results for all controllers correspondong to every model, followed by the BRT results.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: The FRTs computed over the augmented learnt dynamics and the true system dynamics are presented. (Top Row) The FRTs corresponding to policy 1 (left), policy 2 (middle) and policy 3 (right) trained over learnt model 1. (Middle Row) The FRTs corresponding to policy 1 (left), policy 2 (middle) and policy 3 (right) trained over learnt model 2. (Bottom Row) The FRTs corresponding to policy 1 (left), policy 2 (middle) and policy 3 (right) trained over learnt model 3.

The FRT results for all the different models and controllers are shown in Fig. 4. Each plot in Fig. 4 shows two FRTs - the first corresponding to the augmented learnt dynamics and the second corresponding to the true system dynamics. This is done to validate the result of the proposed safety verification framework with the ground truth classification of π\pi as safe or unsafe. In general, it is observed that the FRT computed by the proposed framework, over the augmented learnt dynamics, closely resembles the FRT computed over the true system dynamics. Thus, in every case, classification of π\pi determined by the proposed framework is consistent with the ground truth result. All the above policies are found to be unsafe because the FRT intersects with obstacle 2. In all the cases, the policies are safe with respect to obstacle 1. Since all the above policies are unsafe, the BRT analysis is performed to determine 𝕊safe{\mathbb{S}}_{safe} for each policy.

The results corresponding to the determination of 𝕊safe{\mathbb{S}}_{safe}, using the proposed BRT formulation, for each of the 3 controllers trained over learnt models 1, 2 and 3 are presented in Fig. 5. 𝕊safe{\mathbb{S}}_{safe} determined by the proposed method is compared with the ground truth (G.T.) data, which was generated by using Monte-Carlo samples of possible trajectories from 𝕊0{\mathbb{S}}_{0}. It is observed that the most accurate results are obtained for model 1. For model 2, a small error due to modeling uncertainty is observed while determining 𝕊safe{\mathbb{S}}_{safe}. This error due to modeling uncertainty is related to how conservative the bound on the modeling error is. For model 3, the proposed BRT formulation correctly identifies policy 1 as completely unsafe, since 𝕊safe={\mathbb{S}}_{safe}=\emptyset. For the remaining policies, a small error is observed while determining 𝕊safe{\mathbb{S}}_{safe}.

It can be seen from Fig. 5 that there is no consistent trend for the fraction of safe initial states across, either the models trained with increasing number of sample points (top to bottom) or, across the policies trained for increasing number of iterations (left to right). This highlights one of the primary concerns for the lack of safety in RL based controllers: even though the cost function is designed to mitigate collisions and head towards the target, the controller learnt using it may not be safe because of various reasons stemming from state-representations learned by NN based policies and model dynamics, the hyper-parameters being used, or the optimization getting stuck in local minima, etc. In such scenarios, the proposed framework can be used to check before deployment whether, the developed RL controller abides by the required safety constraints or not.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Comparison of 𝕊safe{\mathbb{S}}_{safe} computed by the proposed BRT formulation with the ground truth (G.T.) data. (Top Row) The BRT computed for policy 1 (left), policy 2 (middle) and policy 3 (right) trained over learnt model 1. (Middle Row) The BRT computed for policy 1 (left), policy 2 (middle) and policy 3 (right) trained over learnt model 2. (Bottom Row) The BRT computed for policy 1 (left), policy 2 (middle) and policy 3 (right) trained over learnt model 3.

A.5 Experiment Results: Safe Aerial Navigation

In the aerial navigation problem, an unmanned aerial vehicle (UAV) is flying in an urban environment. To ensure a safe flight, the UAV has to avoid collision with obstacles, like buildings and trees, in it’s path. This problem is abstracted in a continuous state and action domain, where the state vector 𝒔=[x,y,z]{\bm{s}}=[x,y,z] gives the 3D coordinates of the UAV and the action vector 𝒂=[v,ψ,ϕ]{\bm{a}}=[v,\psi,\phi] comprises of the velocity, heading angle and pitch angle. The UAV has to takeoff from the ground and reach a desired set of goal states. Both the initial set of states and the goal states are represented using cuboids centered at coordinates (0.0,0.0,0.0)(0.0,0.0,0.0) and (3.8,4.5,4.5)(3.8,4.5,4.5), respectively. There are two cylindrical obstacles in the environment, each of height 66 units, which are centered at coordinates (2.0,4.0,0.0)(2.0,4.0,0.0) and (4.0,3.0,0.0)(4.0,3.0,0.0), respectively, with a radius of 0.50.5 units. This problem setting is presented in Fig. 3.

A learnt model represented by an NN is trained over 1000 data points and the policy π\pi trained over this model is evaluated. The FRT computed by the proposed framework, over the augmented learnt dynamics, is presented from different perspectives in Fig. 6. While the computed FRT doesn’t intersect with obstacle 1, it clearly intersects with obstacle 2. Therefore, π\pi is deemed unsafe. The next step is to compute the BRT and determine 𝕊safe{\mathbb{S}}_{safe} for the unsafe policy π\pi as shown in Fig. 6. It is observed that the proposed method under approximates 𝕊safe{\mathbb{S}}_{safe} when compared with the G.T. data.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: (Top Row) Views from different angles for the FRT computed by the proposed method over the augmented learnt system dynamics. It can be noticed that FRT intersects with obstacle 2, therefore, the policy is deemed unsafe. (Bottom Row) Views from different angles of the BRT computed by the proposed method over the augmented learnt system dynamics. The proposed method under approximates 𝕊safe{\mathbb{S}}_{safe}when compared with the G.T. data, marked in green.