This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Barrier Certified Safety Learning Control:
When Sum-of-Square Programming Meets Reinforcement Learning

Hejun Huang1, Zhenglong Li2, Dongkun Han1∗ 1Hejun Huang and Dongkun Han are with the Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong2Zhenglong Li is with the Department of Electrical and Electronic Engineering, The University of Hong Kong*Email: [email protected]
Abstract

Safety guarantee is essential in many engineering implementations. Reinforcement learning provides a useful way to strengthen safety. However, reinforcement learning algorithms cannot completely guarantee safety over realistic operations. To address this issue, this work adopts control barrier functions over reinforcement learning, and proposes a compensated algorithm to completely maintain safety. Specifically, a sum-of-squares programming has been exploited to search for the optimal controller, and tune the learning hyperparameters simultaneously. Thus, the control actions are pledged to be always within the safe region. The effectiveness of proposed method is demonstrated via an inverted pendulum model. Compared to quadratic programming based reinforcement learning methods, our sum-of-squares programming based reinforcement learning has shown its superiority.

I Introduction

Reinforcement learning (RL) is one of the most popular methods for accomplishing the long-term objective of a Markov-decision process [1, 2, 3, 4, 5, 6]. This learning method aims for an optimal policy that can maximize a long-term reward in a given environment recursively. For the case of RL policy gradient method [7, 8, 9], this reward-related learning problem is traditionally addressed using gradient descent to obtain superior control policies.

RL algorithms might generate desired results in some academic trials, e.g., robotic experiments [10, 11] and autonomous driving contexts [12, 13]. But for realistic situations, how to completely guarantee the safety based on these RL policies is still a problem. The intermediate policies from the RL agent’s exploration and exploitation processes may sometimes ruin the environment or the agent itself, which is unacceptable for safety-critical systems. Moreover, disturbances around the system would confuse the agent over learning process, which make the control tasks more complex and challenging.

To cope with the issues above, various methods are proposed to specify dynamical safety during the learning process. Initially, the region of attraction (ROA) from control Lyapunov functions or Lyapunov-like function work [14, 15] are used to certify the dynamical safety. Related and useful work of ROA estimation are introduced in [16, 17, 18, 19]. In 2017, [20] started to execute the RL’s exploration and exploitation in ROA to maintain safety. Different from region constraint, in 2018 [21, 22] proposed a shield framework to select safe actions repeatedly. Inspired by this work, [23, 24] further combined the control barrier function (CBF) to synthesis the RL-based controller. Note that, the CBF is not new to identify safe regions in control theory, [25, 26] are representative work to combine CBF in biped robot and [27] explored its implementation of quadrotors, etc. Appropriate region constraint for RL is important for building safe RL. Hence, a reliable learning method is expected to construct a controller inside the safe region over the process.

In this paper, we will consider a state-space model in the polynomial form, including an unknown term that estimated by Gaussian process. Then, we will apply Deep Deterministic Policy Gradient (DDPG) algorithm to design a controller. The generated control action will impact system dynamics and further influence the sum-of-squares programming (SOSP) solution of control barrier functions.

The main contributions of this work are: (1) Formulate the framework to embed SOSP-based CBF into DDPG algorithm; (2) compare the performance of quadratic programming (QP)-based and SOSP-based controller with DDPG algorithm; and (3) demonstrate the safety and efficiency of DDPG-SOSP algorithm via an inverted pendulum model.

This paper is organized as follows, we first present some preliminaries about reinforcement learning, Gaussian process and control barrier function in Section II. Then, in Section III, we formulate the steps to compute an SOSP-based controller and introduce a corresponding DDPG-SOSP algorithm. Numerical examples are given for these cases in Section IV, then we discuss the learning efficiency and safety performance of QP-based and SOSP-based strategies, before concluding this paper.

II Preliminary

A discrete-time nonlinear control-affine system with state st,st+1Ss_{t},s_{t+1}\in S and control action atAa_{t}\in A is considered as:

st+1=f(st)+g(st)at+d(st),t[t¯,t¯],s_{t+1}=f(s_{t})+g(s_{t})a_{t}+d(s_{t}),\forall t\in[\underline{t},\overline{t}], (1)

where st,st+1s_{t},s_{t+1} are finished within a bounded time tt, while t¯\underline{t} and t¯\overline{t} are constants.

Notation: From the infinite-horizon discounted Markov decision process (MDP) theory [28], let r:𝒮×𝒜r:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{R} denote the reward of the current state-action pair (st,at)(s_{t},a_{t}) in (1), let P(st,st+1,at)=p(st+1|st,at)P(s_{t},s_{t+1},a_{t})=p(s_{t+1}|s_{t},a_{t}) denote the probability distribution to next state st+1s_{t+1}, and let γ(0,1)\gamma\in(0,1) denote a discount factor. Thus, a MDP tuple τ=(st,st+1,at,rt,Pt,γ)\tau=(s_{t},s_{t+1},a_{t},r_{t},P_{t},\gamma) establishes at each state tt.

II-A Reinforcement Learning

RL focuses on training an agent to map various situations to the most valuable actions in a long-term process. During the process, the agent will be inspired by a cumulative reward rr to find the best policy π(a|s)\pi(a|s). Various novel RL algorithms are proposed to select optimal π(a|s)\pi(a|s). For more details of selecting π\pi, we kindly refer interested readers to [1].

Given MDP tuple τ\tau, the RL agent is selecting an optimal policy π\pi^{*} by maximizing the expected reward J(π)J(\pi)

J(π)=𝔼τ[t=0γtr(st,at)].J(\pi)=\mathbb{E}_{\tau}[\sum^{\infty}_{t=0}\gamma^{t}r(s_{t},a_{t})]. (2)

Furthermore, the action value function QπQ_{\pi} can be generated from τ\tau and satisfies the Bellman equation as follows,

Qπ(st,at)\displaystyle Q_{\pi}(s_{t},a_{t}) =𝔼st,at[rt+γ𝔼at+1[Qπ(st+1,at+1)]]\displaystyle=\mathbb{E}_{s_{t},a_{t}}[r_{t}+\gamma\mathbb{E}_{a_{t+1}}[Q_{\pi}(s_{t+1},a_{t+1})]] (3)
=𝔼st+1,at+1,[l=0γlr(st+l,at+l)].\displaystyle=\mathbb{E}_{s_{t+1},a_{t+1},\dots}[\sum^{\infty}_{l=0}\gamma^{l}r(s_{t+l},a_{t+l})].

In this paper, we will use DDPG algorithm, a typical actor-critic and off-policy method, to handle the concerned dynamics of (1). Let parameters θπ\theta^{\pi} and θQ\theta^{Q} denote the neural network of actor and critic, respectively. By interacting with the environment repeatedly, the actor will generate an actor policy π:𝒮𝒜\pi:\mathcal{S}\rightarrow\mathcal{A} and be evaluated by the critic via Qπ:𝒮×𝒜Q_{\pi}:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{R}. These actions will be stored in replay buffer to update the policy gradient w.r.t θπ\theta^{\pi}, and the loss function w.r.t θQ\theta^{Q}, once if the replay buffer is fulfilled.

However, the state-action of DDPG algorithm generates without a complete safety guarantee, which is unacceptable in safety critical situations. Inspired by the work of [23], we propose a programming assist method to maintain safety over the learning process in Section III.

II-B Gaussian Process

Gaussian processes (GP) estimate the system and further predict dynamics based on the prior data. A GP is a stochastic process that construct a joint Gaussian distribution with concerned states {s1,s2,}S\{s_{1},s_{2},\dots\}\subset S. We use GP to estimate the unknown term dd in (1) during the system implementation. Mixed with an independent Gaussian noise (0,σn2)(0,\sigma_{n}^{2}), the GP model’s prior data {d0,d1,,dk}\{d_{0},d_{1},\dots,d_{k}\} can be computed from dk=sk+1fgakd_{k}=s_{k+1}-f-g\cdot a_{k} indirectly. Then, a high probability statement of the posterior output d^:S\hat{d}:S\rightarrow\mathbb{R} generates as

d(s)𝒩(md(s),σd(s)),\displaystyle d(s)\sim\mathcal{N}(m_{d}(s),\sigma_{d}(s)), (4)

where the mean function md(s)m_{d}(s) and the variance function σd(s)\sigma_{d}(s) can describe the posterior distribution as

md(s)kδσd(s)d(s)md(s)+kδσd(s),\displaystyle m_{d}(s)-k_{\delta}\sigma_{d}(s)\leq d(s)\leq m_{d}(s)+k_{\delta}\sigma_{d}(s), (5)

with probability greater or equal to (1δ)k(1-\delta)^{k}, δ(0,1)\delta\in(0,1) [29], where kδk_{\delta} is a parameter to design interval [(1δ)k,1][(1-\delta)^{k},1]. Therefore, by learning a finite number of data points, we can estimate the unknown term d(s)d(s_{*}) of any query state s𝒮s_{*}\in\mathcal{S}.

As the following lemma declaims in [30, 31], there exists a polynomial mean function to approximate the term dd via GP, which can predict the output of state ss_{*} straightly here.

Lemma 1.

Suppose we have access to kk measurements of d(s)d(s) in (1) that corrupted with Gaussian noise (0,σn2)(0,\sigma_{n}^{2}). If the norm unknown term d(s)d(s) bounded, the following GP model of d(s)d(s) can be established with polynomial mean function md(s)m_{d}(s_{*}) and covariance function σd2(s)\sigma_{d}^{2}(s_{*}),

md(s)\displaystyle m_{d}(s_{*}) =φ(s)Tw,\displaystyle=\varphi(s_{*})^{\mathrm{T}}w, (6)
σd2(s)\displaystyle\sigma_{d}^{2}(s_{*}) =k(s,s)kT(K+σn2I)1k,\displaystyle=k(s_{*},s_{*})-k_{*}^{\mathrm{T}}(K+\sigma_{n}^{2}I)^{-1}k_{*},

within probability bounds [(1δ)k,1][(1-\delta)^{k},1], where δ(0,1)\delta\in(0,1), ss_{*} is a query state, φ(s)\varphi(s_{*}) is a monomial vector, ww is a coefficient vector, [K](i,j)=k(si,sj)[K]_{(i,j)}=k(s_{i},s_{j}) is a kernel Gramian matrix and k=[k(s1,s),k(s2,s),,k(sk,s)]Tk_{*}=[k(s_{1},s_{*}),k(s_{2},s_{*}),\dots,k(s_{k},s_{*})]^{\mathrm{T}}.

Lemma 1 generates a polynomial expression of d(s)d(s) within probabilistic range [(1δ)k,1][(1-\delta)^{k},1]. Meanwhile, the nonlinear term f(s)f(s) in (1) can be approximated by Chebyshev interpolants Pk(s)P_{k}(s) and bounded remainder ξ(s)\xi(s) in certain domain as f(s)=Pk(s)+ξ(s)f(s)=P_{k}(s)+\xi(s) [32], where kk is the degree of the polynomial Pk(x)P_{k}(x) [32]. Now we convert (1) as

st+1=Pk(st)+g(st)at+dξ(st).\displaystyle s_{t+1}=P_{k}(s_{t})+g(s_{t})a_{t}+d_{\xi}(s_{t}). (7)

The polynomial (7) is equal to (1) with a new term dξ(st)=d(st)+ξ(st)d_{\xi}(s_{t})=d(s_{t})+\xi(s_{t}). Consequently, we can obtain a polynomial system within a probability range by learning dξ(st)d_{\xi}(s_{t}),

st+1=Pk(st)+g(st)at+mdξ(st),\displaystyle s_{t+1}=P_{k}(s_{t})+g(s_{t})a_{t}+m_{d_{\xi}}(s_{t}), (8)

which will be used to synthesis the controller in Section III.

II-C Control Barrier Function

The super-level set of the control barrier function (CBF) h:𝒮h:\mathcal{S}\rightarrow\mathcal{R} could validate an safety set 𝒞\mathcal{C} as

𝒞={st𝒮:h(st)0}.\displaystyle\mathcal{C}=\{{s_{t}\in\mathcal{S}:h(s_{t})\geq 0}\}. (9)

Throughout this paper, we refer to 𝒞\mathcal{C} as a safe region and

𝒰=𝒮𝒞={st𝒮:h(st)<0},\displaystyle\mathcal{U}=\mathcal{S}-\mathcal{C}=\{s_{t}\in\mathcal{S}:h(s_{t})<0\}, (10)

as unsafe regions of the dynamical system (8). Then, the state st𝒞s_{t}\in\mathcal{C} will not enter into 𝒰\mathcal{U} by satisfying the forward invariant constraint below,

st𝒞:h(st)0,Δh(st,at)0,\forall s_{t}\in\mathcal{C}:h(s_{t})\geq 0,\,\Delta{h}(s_{t},a_{t})\geq 0, (11)

where Δh(st,at)=h(st+1)h(st)\Delta{h(s_{t},a_{t})}=h(s_{t+1})-h(s_{t}). Before we demonstrate the computation of hh of the system (8), the Positivestellensatz (P-satz) needs to be introduced first [33].

Let 𝒫\mathcal{P} be the set of polynomials and 𝒫SOS\mathcal{P}^{\text{SOS}} be the set of sum of squares polynomials, e.g., P(x)=i=1kpi2(x),P(x)=\sum_{i=1}^{k}p_{i}^{2}(x), where pi(x)𝒫p_{i}(x)\in\mathcal{P} and P(x)𝒫SOSP(x)\in\mathcal{P}^{SOS}.

Lemma 2.

For polynomials {ai}i=1m\{a_{i}\}_{i=1}^{m}, {bj}j=1n\{b_{j}\}_{j=1}^{n} and pp, define a set ={s𝒮:{ai(s)}i=1m=0,{bj(s)}j=1n0}\mathcal{B}=\{s\in\mathcal{S}:\{a_{i}(s)\}_{i=1}^{m}=0,\{b_{j}(s)\}_{j=1}^{n}\geq 0\}. Let \mathcal{B} be compact. The condition x𝒮,p(s)0\forall x\in\mathcal{S},p(s)\geq 0 holds if the following condition is satisfied,

{r1,,rm𝒫,s1,,sn𝒫SOS,pi=1mriaij=1nsjbj𝒫SOS.\left\{\begin{array}[]{l}\exists r_{1},\dots,r_{m}\in\mathcal{P},~{}s_{1},\dots,s_{n}\in\mathcal{P}^{\text{SOS}},\\ p-\sum^{m}_{i=1}r_{i}a_{i}-\sum^{n}_{j=1}s_{j}b_{j}\in\mathcal{P}^{\text{SOS}}.\end{array}\right. \square

Lemma 2 points out that any strictly positive polynomial pp lies in the cone that generated by non-negative polynomials {bj}j=1n\{b_{j}\}_{j=1}^{n} in the set of \mathcal{B}. Based on the definition of 𝒞\mathcal{C} and 𝒰\mathcal{U}, P-satz will be adequately used in the safety verification.

Refer to caption
Figure 1: Workflow of the safe RL control with SOS program

The sufficient condition to establish a CBF is the existence of a deterministic controller aCBF:𝒮𝒜a^{CBF}\colon\mathcal{S}\rightarrow\mathcal{A}. Based on Lemma 2, we can compute a polynomial controller aCBFa^{CBF} through SOS program to avoid entering into unsafe regions μi(s)𝒰,i=1,2,,n\mu_{i}(s)\in\mathcal{U},i=1,2,\dots,n at each step tt.

Lemma 3.

Provided a safe region 𝒞\mathcal{C} certified by the polynomial barrier function hh and some unsafe regions μi𝒰,i+\mu_{i}\in\mathcal{U},i\in\mathbb{Z}^{+}, in a polynomial system (8), if there exists a controller aCBFa^{CBF} that satisfies

a=argminaCBF2st𝒮,aCBF𝒜;L(st),Mi(st)𝒫SOS;\displaystyle\qquad\underset{s_{t}\in\mathcal{S},a^{CBF}\in\mathcal{A};L(s_{t}),M_{i}(s_{t})\in\mathcal{P}^{SOS};\;\;}{a^{*}\quad=\quad\arg\min\;{\|a^{CBF}\|_{2}}} (12)
s.t. Δh(st,aCBF)L(st)h(st)𝒫SOS,\displaystyle\;\quad\Delta{h}(s_{t},a^{CBF})-L(s_{t})h(s_{t})\in\mathcal{P}^{SOS},
Δh(st,aCBF)Mi(st)μi(st)𝒫SOS,\displaystyle-\Delta{h}(s_{t},a^{CBF})-M_{i}(s_{t})\mu_{i}(s_{t})\in\mathcal{P}^{\text{SOS}},

where L(st)L(s_{t}) and Mi(st)M_{i}(s_{t}) are SOS polynomials for i=1,2,,ni=1,2,\dots,n. Then, action aa^{*} is a minimum polynomial controller that regulates the system (8) safely.

Proof.

Let us suppose that there exists a CBF hh that certified a safe region 𝒞\mathcal{C}. Then, a deterministic input at=aCBF(st)a_{t}=a^{CBF}(s_{t}) is needed to establish the necessary condition Δh(st,aCBF)0\Delta{h(s_{t},a^{CBF})}\geq 0 in (11) of the system (8).

Since hh is a polynomial function, if aCBFa^{CBF} is also a polynomial, Δh\Delta h will maintain in the polynomial form. According to Lemma 2, we can formulate this non-negative condition of Δh\Delta h as part of SOS constraint in the cone of hh as

Δh(st,aCBF)L(st)h(st)𝒫SOS,\displaystyle\Delta h(s_{t},a^{CBF})-L(s_{t})h(s_{t})\in\mathcal{P}^{SOS}, (13)

where the auxiliary SOS polynomials L(st)L(s_{t}) is used to project the related term into an non-negative compact domain 𝒞\mathcal{C} that certified by the superlevel set of h(st)h(s_{t}).

We continue to leverage the P-satz again to drive the dynamics of (8) far from unsafe areas μi(st)\mu_{i}(s_{t}) as follows,

Δh(st,aCBF)μi(st)Mi(st)𝒫SOS,\displaystyle-\Delta h(s_{t},a^{CBF})-\mu_{i}(s_{t})M_{i}(s_{t})\in\mathcal{P}^{SOS}, (14)

where Mi(st)M_{i}(s_{t}) denote the corresponding SOS polynomials. Guided by the optimization (LABEL:eq:CLF-CBF-discrete), we can solve a minimal cost control to maintain system safety with (LABEL:eq:text1) and (14), which completes the proof. ∎

This lemma maintains the safety of polynomial system (8) and minimizes the current control aCBFa^{CBF} with a SOS program. Multiple unsafe regions μi(st)\mu_{i}(s_{t}) are also considered in solving the target input aa^{*} in (LABEL:eq:CLF-CBF-discrete). Since Lemma 1 discussed the probabilistic relationship of (8) and (1), the computed result of (LABEL:eq:CLF-CBF-discrete) of (1) can also be a feasible control to regulate the true dynamics safely.

III CBF Guiding Control with Reinforcement Learning

In the first part of this section, we will further express the computation steps of (LABEL:eq:CLF-CBF-discrete). In the second part, we will go through the details about the SOSP-based CBF guiding control with DDPG algorithm.

III-A SOS Program Based CBF

Different from [23], our control aCBFa^{CBF} is solved by SOS program with polynomial barrier functions, rather than the QP and linear barrier functions. In [23], they initially constructed QP with monotonous linear barrier constraints to cope with RL. Obviously, a safe but more relaxed searching area will enhance the RL training performance. We propose a SOS program to compute a minimal control with polynomial barrier function based on Lemma 3.

Lemma 3 declared that we can compute a minimal control aCBFa^{CBF} from an approximated polynomial system (8) directly. However, it is hard to define the minimization of the convex function aCBFa^{CBF} in (LABEL:eq:CLF-CBF-discrete). The optimal control can be searched by using the following 2 steps:

Step 1: Search an optimal auxiliary factor L(st)L(s_{t}) by maximizing the scalar ϵ1\epsilon_{1},

L(st)=argmaxϵ1ϵ1+,L(st)𝒫SOS\displaystyle\underset{\epsilon_{1}\in\mathcal{R}^{+},L(s_{t})\in\mathcal{P}^{\text{SOS}}}{L^{*}(s_{t})\quad=\quad\arg\max\quad\epsilon_{1}} (15)
s.t. Δh(st,aCBF)L(st)h(st)ϵ1𝒫SOS\displaystyle\Delta h(s_{t},a^{CBF})-L(s_{t})h(s_{t})-\epsilon_{1}\in\mathcal{P}^{\text{SOS}}
Δh(st,aCBF)Mi(st)μi(st)𝒫SOS,\displaystyle-\Delta{h}(s_{t},a^{CBF})-M_{i}(s_{t})\mu_{i}(s_{t})\in\mathcal{P}^{\text{SOS}},

where L(st)L(s_{t}) is an auxiliary factor to obtain a permissive constraint for the existing action aCBFa^{CBF}.

Step 2: Given L(st)L(s_{t}) from last step, search an optimal control of aCBFa^{CBF} with minimal control cost by minimizing the scalar ϵ2\epsilon_{2},

at=argminϵ2ϵ2+\displaystyle\underset{\epsilon_{2}\in\mathcal{R}^{+}}{a^{*}_{t}\quad=\quad\arg\min\quad\epsilon_{2}} (16)
s.t. Δh(st,aCBF)L(st)h(st)ϵ2𝒫SOS\displaystyle\Delta{h}(s_{t},a^{CBF})-L(s_{t})h(s_{t})-\epsilon_{2}\in\mathcal{P}^{\text{SOS}}
Δh(st,aCBF)Mi(st)μi(st)𝒫SOS.\displaystyle-\Delta{h}(s_{t},a^{CBF})-M_{i}(s_{t})\mu_{i}(s_{t})\in\mathcal{P}^{\text{SOS}}.
Remark.

The scalars ϵ1\epsilon_{1} and ϵ2\epsilon_{2} in (15) and (16) are used to limit the magnitude of the coefficients’ optimization of L(st)L(s_{t}) and ata_{t}, respectively.

The SOS programs above demonstrate the details of a target controller computation of the system (8), and the solution of (16) regulates the dynamical safety via a control barrier function directly.

III-B SOS Program Based CBF with Reinforcement Learning

The workflow of the CBF guiding DDPG control algorithm is shown in Fig. 1, which is akin to the original idea in [23] to push the agent’s action into a safe training architecture. In Fig. 1, we list these flowing elements into brackets and highlight the corresponding input of each factor computation.

We illustrate the core idea of CBF guiding RL control as,

at=\displaystyle a_{t}= aθkRL+a¯k+atCBF,\displaystyle a^{RL}_{\theta_{k}}+\bar{a}_{k}+a_{t}^{CBF}, (17)

where the final action ata_{t} consists of a RL-based controller aθkRLa^{RL}_{\theta_{k}}, a previous deployed CBF controller a¯ϕ\bar{a}_{\phi} and a SOS program based CBF controller atCBFa_{t}^{CBF}. A clear distinction of the subscript tt and kk is stated here: tt denotes the step number of each policy iteration and kk denotes the policy iteration number over the learning process.

More specifically, the first term in (17) is an action generated from a standard DDPG algorithm with parameter θ\theta. The second term a¯k=i=0k1aiCBF\bar{a}_{k}=\sum_{i=0}^{k-1}a_{i}^{CBF} demotes a global consideration of previous CBF controllers. Since it is impractical to compute the exact value of each CBF action aiCBFa_{i}^{CBF} at each step. Supported by the work of [23], we approximate this term as a¯ϕka¯k=i=0k1aiCBF\bar{a}_{\phi_{k}}\approx\bar{a}_{k}=\sum^{k-1}_{i=0}a_{i}^{CBF}, where a¯ϕk\bar{a}_{\phi_{k}} denotes an output from the multilayer perceptron (MLP) with hyperparameters ϕ\phi. Then, we fit the MLP and update ϕk\phi_{k} with i=0k1aiCBF(s,aθ0RL,,aθi1RL)\sum^{k-1}_{i=0}a_{i}^{CBF}(s,a_{\theta_{0}}^{RL},\dots,a_{\theta_{i-1}}^{RL}) at each episode.

The third term atCBFa^{CBF}_{t} in (17) is a real-time value based on the deterministic value sts_{t}, aθkRL(st)a^{RL}_{\theta_{k}}(s_{t}) and a¯ϕk(st)\bar{a}_{\phi_{k}}(s_{t}). Although there exists an unknown term dd in the system (1), we can still solve the dynamical safety in the approximated polynomial system (8). So, the optimal ata_{t} in (17) can be computed by the SOS program below with deterministic aθkRLa^{RL}_{\theta_{k}} and a¯ϕk\bar{a}_{\phi_{k}}

a=argminak2st𝒮,ak𝒜;L(st),Mi(st)𝒫SOS;\displaystyle\qquad\underset{s_{t}\in\mathcal{S},a_{k}\in\mathcal{A};L(s_{t}),M_{i}(s_{t})\in\mathcal{P}^{SOS};\;\;}{a^{*}\quad=\quad\arg\min\;{\|a_{k}\|_{2}}} (18)
s.t. Δh(st,ak)L(st)h(st)𝒫SOS,\displaystyle\;\quad\Delta{h}(s_{t},a_{k})-L(s_{t})h(s_{t})\in\mathcal{P}^{SOS},
Δh(st,ak)Mi(st)μi(st)𝒫SOS.\displaystyle-\Delta{h}(s_{t},a_{k})-M_{i}(s_{t})\mu_{i}(s_{t})\in\mathcal{P}^{\text{SOS}}.

Thus, we can establish a controller by satisfying the solution of (LABEL:eq:sosclb) over the learning process.

Theorem 1.

Given a barrier function hh in (9), a partially unknown dynamical system (1) and a corresponding approximated system (8), while (8) is a stochastic statement of (1) within the probabilistic range [(1δ)k,1][(1-\delta)^{k},1], suppose there exists an action ata_{t} satisfying (LABEL:eq:sosclb) in the following form:

at(st)=\displaystyle a_{t}(s_{t})= aθkRL(st,at,st+1,rt)+a¯ϕk(st,i=0k1(st,aiCBF))\displaystyle a^{RL}_{\theta_{k}}(s_{t},a_{t},s_{t+1},r_{t})+\bar{a}_{\phi_{k}}(s_{t},\sum_{i=0}^{k-1}(s_{t},a^{CBF}_{i})) (19)
+atCBF(st,aθkRL,a¯ϕk).\displaystyle+a^{CBF}_{t}(s_{t},a^{RL}_{\theta_{k}},\bar{a}_{\phi_{k}}).

Then, the controller (19) guarantees the system (1) inside the safe region within the range of probability [(1δ)n,1][{(1-\delta)}^{n},1].

Proof.

Regarding the system (8), this result follows directly from Lemma 3 by solving the SOS program (LABEL:eq:sosclb). The only different part of SOS program (LABEL:eq:sosclb) from the SOS program (LABEL:eq:CLF-CBF-discrete) in Lemma 3 is the action aka_{k} here contains additional deterministic value aθkRL(st)a_{\theta_{k}}^{RL}(s_{t}) and a¯ϕk(st)\bar{a}_{\phi_{k}}(s_{t}).

Since the output of (LABEL:eq:sosclb) can regulate the dynamics of the model (8) in a safe region, while the approximated model (8) is a stochastic statement of the original system (1) with the probability greater or equal to (1δ)k(1-\delta)^{k}. Then, the solution of (LABEL:eq:sosclb) can be regarded as a safe control to drive the system (1) far from dangerous, which ends the proof. ∎

We display an overview of the whole steps in the combination of the aforementioned factors over the learning process. The detailed workflow is outlined in Algorithm 1

Input: Origin system (1); barrier function hh.
Output: RL optimal policy π\pi.
1 Preprocess (1) into (8).
2 Create the tuple D^\hat{D} to store {st,at,st+1,rt}\{s_{t},a_{t},s_{t+1},r_{t}\} in D^\hat{D}.
3 for k{1,2,,k}k\in\{1,2,\dots,k\} do
4       Execute SOSP (LABEL:eq:sosclb) to obtain the real-time CBF controller akCBFa^{CBF}_{k} and store it to train previous CBF controller a¯θkCBF\bar{a}_{\theta_{k}}^{CBF}.
5       Actuate the agent with ata_{t} in (19).
6       Construct D^\hat{D} and update θk\theta_{k} and ϕk\phi_{k} directly.
7      
8return π\pi.
Algorithm 1 DDPG-SOSP

IV Numerical Example

A swing-up pendulum model from OpenAI Gym environment (Pendulum-v1) is used to verify the above theorems,

ml2θ¨=mglsin(θ)+a.\displaystyle ml^{2}\ddot{\theta}=mgl\sin(\theta)+a. (20)

Let s1=θs_{1}=\theta, s2=θ˙s_{2}=\dot{\theta}. The dynamics of (20) are defined as

[s˙1s˙2]=[s2glsin(s1)+aml2+d],\displaystyle\begin{aligned} \begin{bmatrix}\dot{s}_{1}\\ \dot{s}_{2}\end{bmatrix}=\begin{bmatrix}s_{2}\\ \;-\frac{g}{l}\sin{(s_{1})}+\frac{a}{ml^{2}}+d\;\end{bmatrix},\end{aligned} (21)

where dd denotes unknown dynamics that generate by inaccurate parameters m¯=1.4\bar{m}=1.4, l¯=1.4\bar{l}=1.4, as [23] introduced. When we try to construct a polynomial system (8), the sin(s1)\sin(s_{1}) in (21) will be approximated by Chebyshev interpolants within s1[3,3]s_{1}\in[-3,3]. Then, with SOSOPT Toolbox [34], Mosek solver [35] and Tensorflow, we implement Algorithm 1 in the pendulum model.

The related model parameters are given in Table 1.

TABLE 1: Swing-up Pendulum Model Parameters

Model Parameter Symbol Value Units
Pendulum mass mm 1.01.0 kg
Gravity gg 10.010.0 m/s2\text{m}/\text{s}^{2}
Pendulum length ll 1.01.0 m
Input torque aa [15.0,15.0][-15.0,15.0] N
Pendulum Angle θ\theta [1.00,1.00][-1.00,1.00] rad
Angle velocity θ˙\dot{\theta} [60.0,60.0][-60.0,60.0] rad/s\text{rad}/\text{s}
Angle acceleration θ¨\ddot{\theta} rad/s2\text{rad}/\text{s}^{2}

We want to maintain the pendulum angle θ\theta always in a safe range [1.0,1.0][-1.0,1.0] during the learning process. Three RL algorithms are selected and compared in this example, including the DDPG-ONLY algorithm [5], CBF-based DDPG-QP algorithm [23] and CBF-based DDPG-SOSP algorithm (our work). The codes of CBF-based DDPG-SOSP can be found at the url: https://github.com/Wogwan/CCTA2022_SafeRL.

All of these algorithms are trained in 150150 episodes and each episode contains 200200 steps. Each step is around 0.05s0.05s. And the reward function r=θ2+0.1θ˙2+0.001a2r=\theta^{2}+0.1\dot{\theta}^{2}+0.001a^{2} is defined such that the RL agent are expected to keep the pendulum upright with minimal θ\theta, θ˙\dot{\theta} and θ¨\ddot{\theta}.

Refer to caption
Figure 2: Comparison of the maximum absolute θ\theta of different algorithms. The dotted solid line, the dashed line, and the solid line denote the performance of DDPG-ONLY, DDPG-QP and DDPG-SOSP, respectively The straight line denotes the safe boundary |s1|=1|s_{1}|=1.
Refer to caption
Figure 3: Comparison of the maximum absolute θ˙\dot{\theta} of different algorithms.
Refer to caption
Figure 4: Comparison of the accumulated reward of different algorithms.
Refer to caption

(a)

Refer to caption

(b)

Figure 5: The comparison of the 1st1^{st} and the 150th150^{th} policy performance of the angle control task between (a)(a) DDPG-QP and (b)(b) DDPG-SOS.

Fig. 2 compares the collected maximum |θ||\theta| at each episode by using different algorithms. As a baseline algorithm, DDPG-ONLY explored every state without any safety considerations, while DDPG-QP sometimes violated the safe state and DDPG-SOS completely kept in safe regions.

The historical maximum value of |θ˙||\dot{\theta}| of these algorithms is compared and shown in Fig. 3. It is found that DDPG-SOS is able to maintain the pendulum into a lower θ˙\dot{\theta} easily than others over episodes.

In Fig. 4, from the reward curve of these algorithms, it is easy to observe the efficiency of different algorithms: DDPG-SOS obtains comparatively lower average reward and thus a better performance.

Fig. 5 shows the first (2nd2^{nd} episode) and the final (150th150^{th} episode) policy guided control performance between DDPG-QP and DDPG-SOSP algorithm. We observed 5050 steps of the pendulum angle to highlight the superiority of these two algorithms. Both in Fig. 5(a) and (b), the final policy can reach a smoother control output with less time. Although DDPG-SOSP takes a quicker and smoother action to stabilize the pendulum than DDPG-QP, but the time consuming will increase due to the dynamics regression and optimization computation under uncertainty. Accordingly, the result of DDPG-SOSP algorithm of model (20) is not only obviously stabler than DDPG-QP, but also maintaining the RL algorithm action’s safety strictly.

V Conclusion

In this paper, we consider a novel approach to guarantee the safety of reinforcement learning (RL) with sum-of-squares programming (SOSP). The objective is to find a safe RL method toward a partial unknown nonlinear system such that the RL agent’s action is always safe. One of the typical RL algorithms, Deep Deterministic Policy Gradient (DDPG) algorithm, cooperates with control barrier function (CBF) in our work, where CBF is widely used to guarantee the dynamical safety in control theories. Therefore, we leverage the SOSP to compute an optimal controller based on CBF, and propose an algorithm to creatively embed this computation over DDPG. Our numerical example shows that SOSP in the inverted pendulum model obtains a better control performance than the quadratic program work. It also shows that the relaxations of SOSP can enable the agent to generate safe actions more permissively.

For the future work, we will investigate how to incorporate SOSP with other types of RL including value-based RL and policy-based RL. Meanwhile, we check the possibility to solve a SOSP-based RL that works in higher dimensional cases.

References

  • [1] R. S. Sutton, A. G. Barto, ”Reinforcement learning: An introduction,” MIT press, 2018.
  • [2] Y. Duan, X. Chen, R. Houthooft, J. Schulman, P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” in Proc. Int. Conf. on Machine Learning, pp. 1329–1338, PMLR, 2016.
  • [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [4] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
  • [5] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, D. Wierstra, “Continuous control with deep reinforcement learning,” ArXiv:1509.02971, 2015.
  • [6] Li, Z., Cheng, X., Peng, X. B., Abbeel, P., Levine, S., Berseth, G., Sreenath, K., ”Reinforcement Learning for Robust Parameterized Locomotion Control of Bipedal Robots,” in Proc. Int. Conf. on Robotics and Automation, pp. 2811–2817, 2021.
  • [7] J. Peters, S. Schaal, “Reinforcement learning of motor skills with policy gradients,” Neural Networks, vol. 21, no. 4, pp. 682–697, 2008.
  • [8] R. S. Sutton, et al. ”Policy gradient methods for reinforcement learning with function approximation,” Advances in Neural Information Processing Systems, 1999.
  • [9] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, M. Riedmiller, “Deterministic policy gradient algorithms,” in Proc. Int. Conf. on Machine Learning, pp. 387–395, PMLR, 2014.
  • [10] J. Kober, J. A. Bagnell, J. Peters, “Reinforcement learning in robotics: A survey,” Int. Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
  • [11] P. Kormushev, S. Calinon, D. G. Caldwell, “Reinforcement learning in robotics: Applications and real-world challenges,” Robotics, vol. 2, no. 3, pp. 122–148, 2013.
  • [12] J. Levinson, J. Askeland, J. Becker, J. Dolson, D. Held, S. Kammel, J. Z. Kolter, D. Langer, O. Pink, V. Pratt, et al., “Towards fully autonomous driving: Systems and algorithms,” IEEE Intelligent Vehicles Symposium, pp. 163–168, 2011.
  • [13] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, P. Pérez, “Deep reinforcement learning for autonomous driving: A survey,” IEEE Trans. on Intelligent Transportation Systems, 2021.
  • [14] L. Wang, D. Han, M. Egerstedt, “Permissive barrier certificates for safe stabilization using sum-of-squares,” in Proc. Amer. Control Conf., pp. 585–590, 2018.
  • [15] D. Han, D. Panagou, “Robust multitask formation control via parametric Lyapunov-like barrier functions,” IEEE Trans. on Automatic Control, vol. 64, no. 11, pp. 4439–4453, 2019.
  • [16] G. Chesi, ”Domain of attraction: analysis and control via SOS programming”, Springer Science & Business Media, vol. 415, 2011.
  • [17] D. Han, G. Chesi, Y. S. Hung, “Robust consensus for a class of uncertain multi-agent dynamical systems,” IEEE Trans. on Industrial Informatics, vol. 9, no. 1, pp. 306–312, 2012.
  • [18] D. Han, G. Chesi, “Robust synchronization via homogeneous
    parameter-dependent polynomial contraction matrix,” IEEE Trans. on Circuits and Systems I: Regular Papers, vol. 61, no. 10, pp. 2931–2940, 2014.
  • [19] D. Han, M. Althoff, “On estimating the robust domain of attraction for uncertain non-polynomial systems: An LMI approach,” in Proc. Conf. on Decision and Control, pp. 2176–2183, 2016.
  • [20] B. Felix, M. Turchetta, A. Schoellig, A. Krause, ”A Safe model-based reinforcement learning with stability guarantees,” Advances in Neural Information Processing Systems, pp. 909–919, 2018.
  • [21] M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, U. Topcu, “Safe reinforcement learning via shielding,” in AAAI Conf. on Artificial Intelligence, vol. 32, no. 1, 2018.
  • [22] C. Steven, N. Jansen, S. Junges, U. Topcu, ”Safe Reinforcement Learning via Shielding for POMDPs,” ArXiv:2204.00755, 2022.
  • [23] R. Cheng, G. Orosz, R. M. Murray, J. W. Burdick, “End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks,” in AAAI Conf. on Artificial Intelligence, vol. 33, no. 1, 2019.
  • [24] R. Cheng, A. Verma, G. Orosz, S. Chaudhuri, Y. Yue, J. W. Burdick, ”Control Regularization for Reduced Variance Reinforcement Learning,” in Proc. Int. Conf. on Machine Learning. pp. 1141–1150, 2019.
  • [25] S. C. Hsu, X. Xu, A. D. Ames, “Control barrier function based quadratic programs with application to bipedal robotic walking,” in Proc. Amer. Control Conf., pp. 4542–4548, 2015.
  • [26] T. Akshay, Z. Jun, K. Sreenath. ”Safety-Critical Control and Planning for Obstacle Avoidance between Polytopes with Control Barrier Functions,” in Proc. Int. Conf. on Robotics And Automation, accepted, 2022.
  • [27] L. Wang, E. A. Theodorou, M. Egerstedt, “Safe learning of quadrotor dynamics using barrier certificates,” in Proc. Int. Conf. on Robotics And Automation, pp. 2460–2465, 2018.
  • [28] P. Martin, ”Markov decision processes: discrete stochastic dynamic programming”, John Wiley & Sons, 2014.
  • [29] A. Lederer, J. Umlauft, S. Hirche, “Uniform error bounds for Gaussian process regression with application to safe control,” Advances in Neural Information Processing Systems, pp. 659–669, 2019.
  • [30] H. Huang, D. Han, ”On Estimating the Probabilistic Region of Attraction for Partially Unknown Nonlinear Systems: An Sum-of-Squares Approach,” in Proc. Chinese Control and Decision Conf., accepted, 2022.
  • [31] D. Han, H. Huang, ”Sum-of-Squares Program and Safe Learning on Maximizing the Region of Attraction of Partially Unknown Systems,” in Proc. Asian Control Conf., 2022.
  • [32] L. N. Trefethen, ”Approximation Theory and Approximation Practice,” SIAM, 2019.
  • [33] M. Putinar, “Positive Polynomials on Compact Semi-algebraic Sets,” Indiana University Mathematics Journal, vol. 42, no. 3, pp. 969–984, 1993.
  • [34] P. Seiler, ”SOSOPT: A toolbox for polynomial optimization,”
    ArXiv:1308.1889, 2013.
  • [35] M. ApS, ”Mosek optimization toolbox for Matlab,” User’s Guide And Reference Manual, Ver. 4, 2019.