A Satisficing Control Design Framework with Safety and Performance Guarantees for Constrained Systems under Disturbances

Yuzhen Han Student Member, IEEE and Hamidreza Modares Senior Member, IEEE Yuzhen Han and Hamidreza Modares are with the Department of Mechanical Engineering, Michigan State University, East Lansing, MI, 48863, USA (e-mails:[email protected]; [email protected]).

Abstract

This paper presents a safe robust policy iteration (SR-PI) algorithm to design controllers with satisficing (good enough) performance and safety guarantee. This is in contrast to standard PI-based control design methods with no safety certification. It also moves away from existing safe control design approaches that perform pointwise optimization and are thus myopic. Safety assurance requires satisfying a control barrier function (CBF), which might be in conflict with the performance-driven Lyapunov solution to the Bellman equation arising in each iteration of the PI. Therefore, a new development is required to robustly certify the safety of an improved policy at each iteration of the PI. The proposed SR-PI algorithm unifies performance guarantee (provided by a Bellman inequality) with safety guarantee (provided by a robust CBF) at each iteration. The Bellman inequality resembles the satisficing decision making framework and parameterizes the sacrifice on the performance with an aspiration level when there is a conflict with safety. This aspiration level is optimized at each iteration to minimize the sacrifice on the performance. It is shown that the presented satisficing control policies obtained at each iteration of the SR-PI guarantees robust safety and performance. Robust stability is also guaranteed when there is no conflict with safety. Sum of squares (SOS) program is employed to implement the proposed SR-PI algorithm iteratively. Finally, numerical simulations are carried out to illustrate the proposed satisficing control framework.

Index Terms:

constrained continuous-time system, policy iteration, robust control, satisficing control, safe control

I Introduction

Successful deployment of the next-generation safety-critical systems (e.g., self-driving cars or assistive robots) requires certifying their safety despite uncertainties [1, 2]. Therefore, there is an urgent need for developing robust safe controllers that respect the system’s constraints all the time. Safety, however, is the bare minimum requirement for any safety-critical system, and it is desired to design safe controllers that achieve as much performance as possible. To guarantee performance, one can solve an optimal control problem for which a pre-defined cost function that encodes desired system specifications is optimized [3, 4]. Despite the importance of designing safe optimal controllers, safe control design and optimal control design are typically separated in the literature. More specifically, while an optimal controller is generally found by solving the so-called Hamilton-Jacobi-Bellman (HJB) equation [5, 6], existing iterative solutions to the HJB equation mainly ignore safety constraints. On the other hand, to satisfy safety requirement of dynamic systems, safety verification using control barrier functions (CBFs) [7, 8, 9] and reachability analysis [10, 11, 12] have been widely and successfully used. While these frameworks can effectively guarantee the forward invariance of a given safe set and asymptotic convergence of the system’s trajectories to a target set, the long-term optimality of the solution is not considered. This can lead to conservative solutions that unnecessarily consume a significant amount of resources or result in poor performance.

To take into account safety and optimality simultaneously, in the reference governor approach [13, 14], a safe controller intervenes with a nominal performance-driven controller to avoid constraint violations when there is a risk of safety violation. However, reference governor might keep intervening with the nominal controller, making the controller myopic and possibly far from optimal. This is because the nominal controller does not take into account the safety constraints in the design phase to proactively avoid them. Model predictive control (MPC) [15, 16] is another control strategy candidate for addressing safe optimal control design. MPC takes the system constraints into account while optimizing a short-horizon performance. However, despite its tremendous success, MPC needs to perform an optimization algorithm at every step, which might not be computationally tractable for nonlinear systems. Moreover, because of its short-sided nature, guaranteeing feasibility and stability is also hard.

Satisficing decision theory [17] has been widely employed in economic optimization problems to find good enough solutions that are not necessarily optimal. This is motivated by the fact that finding optimal solutions for systems under limited resources and incomplete information might not be feasible. Satisficing stabilizing control design has also been considered in control community [18, 19]. These approaches, however, are typically plagued by the lack of safety guarantee. In this paper, we propose a new safe and satisficing control approach in which a satisficing framework is leveraged to sacrifice performance in favor of safety, but minimizing the sacrifice level on the performance as much as possible. Starting from a safe control policy, an iterative safe robust policy iteration (SR-PI) algorithm is then proposed to find improved satisficing controllers that certify robust safety against matched disturbances. More specifically, the policy evaluation step finds the value function for the current the satisficing safe control policy and the policy improvement step finds an improved satisficing control policy with guaranteed input-to-state safety (ISSf) [20]. The robust stability of satisficing policies will also be guaranteed when there is no conflict between safety and stability. Sum of squares (SOS) program [21] is employed to implement the presented PI algorithm. Fig. 1 shows the schematic of the proposed SR-PI and its comparison with the standard PI algorithm. It should be noticed that the type matched disturbance is investigated in this research, but it could be generalized to unmatched disturbance case which also attracts wide attentions[22, 23, 24].

The rest of the paper is organized as follows. Section 2 presents some preliminaries that are used throughout the paper. A satisficing safe control design framework is developed in Section 3. Sections 4 presents the simulation results and experiment comparison, respectively. The paper is concluded in Section 5.

Notations: Through the paper, the set of continuously differentiable functions are represented by ${{C}^{1}}$ , and the set of positive definite and proper functions in ${{C}^{1}}$ are denoted as $P$ . A polynomial $p(x)$ is a sum of squares (SOS) polynomial, i.e., $p(x)\in{{\mathcal{P}}^{SOS}}$ where ${{\mathcal{P}}^{SOS}}$ is a set of SOS polynomial, if $p(x)=\sum\nolimits_{1}^{m}{p_{i}^{2}(x)}$ where $p_{i}(x)\in{{\mathcal{P}}^{SOS}}$ , $i=1,...,m$ . $\mathbb{R}{{[x]}_{{{d}_{1}},{{d}_{2}}}}$ denotes all the sets of polynomials in $x\in{{\mathbb{R}}^{n}}$ with degree at least ${{d}_{1}}$ and at most ${{d}_{2}}$ . $\mathcal{X}\in{{\mathbb{R}}^{n}}$ is state space which is a compact set. A continuous function $K:[0,a)\to[0,\infty)$ is of class $\kappa$ function, denoted by $K\in\kappa$ , if it is strictly increasing and $K(0)=0$ . A function $\alpha(s,t)$ is a class of $\kappa\mathcal{L}$ function if for each fixed $t\geq 0$ the function $\alpha(.,t)$ is a $\kappa$ function and for fixed $s\geq 0$ it is decreasing to zero as $t\to\infty$ . We also denote by $\kappa\kappa$ all functions $\gamma$ such that $\gamma(.,t)\in\kappa$ for a fixed $t\geq 0$ and similarly, $\gamma(s,.)\in\kappa$ for a fixed $s\geq 0$ . $\nabla f(x)$ is the gradient of function $f$ and $\nabla f(x)={{[\frac{\partial f(x)}{\partial{{x}_{1}}},\frac{\partial f(x)}{\partial{{x}_{1}}},...,\frac{\partial f(x)}{\partial{{x}_{n}}}]}^{T}}$ . $diag({{x}_{1}},...,{{x}_{n}})$ denotes a square diagonal matrix with elements of ${{x}_{1}},...,{{x}_{n}}$ on the main diagonal. $||x||$ indicates the Euclidean norm $\sqrt{{{x}^{T}}x}$ of a real vector $x\in{{\mathbb{R}}^{n}}$ . For any set $S$ , $Int(S)$ and $\partial S$ denote the interior and boundary of the set $S$ , respectively; $\overline{Int(S)}$ is the closure of set $S$ . $||\xi|{{|}_{U}}={{\min}_{a\in U}}||\xi-a||$ where $||.||$ is Euclidian norm. For a given signal $x:\mathbb{R}\to{{\mathbb{R}}^{n}}$ , its ${{L}^{P}}$ norm on the interval $T$ is given by $||x|{{|}_{{{L}^{P}}(T)}}={{({{\int_{T}{||x(t)||}}^{P}}dt)}^{1/P}}$ and similarly, its ${{L}^{\infty}}$ norm is defined by $||x|{{|}_{{{L}^{\infty}}(T)}}=(ess){{\sup}_{t\in T}}(||x(t)||)$ . For the sake of conciseness, for $T=[0,\infty)$ , we denote the ${{L}^{\infty}}$ norm of $x$ simply by $||x|{{|}_{{{L}^{\infty}}}}$ . For two vectors $x$ and $y$ , $x\succeq y$ iff $x_{i}\geq y_{i}$ holds for all elements $x_{i}$ and $y_{i}$ of $x$ and $y$ . $f_{max}$ indicates the maximum of the function $f$ over a set of interest.

Refer to caption — Figure 1: Comparison between the proposed SR-PI algorithm and existing PI algorithms. (a) Standard PI without safety verification; (b) The proposed SR-PI with safety verification at each iteration to find an improved safe policy.

II Preliminaries

Consider the following continuous-time nonlinear system

\dot{x}=f\left(x\right)+g\left(x\right)(u+d)=f\left(x\right)+g\left(x\right)u+\omega,\,\,\,\

(1)

where $x\in\mathcal{X}$ is the vector of system states, $u\in{{\mathbb{R}}^{\text{m}}}$ is the vector of control input, $\omega$ is the disturbance on the control input, and $\omega=g\left(x\right)d$ . The nonlinear functions $f:{{\mathbb{R}}^{\text{n}}}\to{{\mathbb{R}}^{\text{n}}}$ and $\text{g}:{{\mathbb{R}}^{\text{n}}}\to{{\mathbb{R}}^{\text{n}\times\text{m}}}$ are assumed to be locally Lipschitz continuous with $f\left(0\right)=0$ .

Assumption 1 The system (1) is stabilizable on the set $\mathcal{X}$ .

Assumption 2 The disturbance $d$ is bounded. That is, there exists a constant $d_{\max}$ such that

||d(t)||\leq d_{\max}

(2)

II-A Optimal Control Design Framework

In this section, we present a robust optimal control design framework for the system (1). To find an optimal controller, one can optimize a pre-defined performance index that encodes the designer’s intention in achieving some system’s specifications. For the case where $d=0$ , the following infinite-horizon performance index is usually considered for the system (1).

J\left({x},u\right)=\mathop{\int}_{t}^{\infty}\,r\left(x,u\right)d\tau

(3)

where

r\left(x,u\right)=q\left(x\right)+{{u}^{T}}Ru\text{ }\!\!\!\!\text{ }

(4)

is the reward function with $q(x)\in\mathbb{R}$ as a positive definite function, and $R\in{{\mathbb{R}}^{m\times m}}$ as a symmetric positive definite matrix. The existence of a stabilizing optimal controller is guaranteed under some mild assumptions on the system dynamics and the performance index [25, 26]. The optimal control found by optimizing (3), however, does not guarantee robust stability of the system due to the disturbance. To guarantee robust stability of the system (1) for the case when $d\neq 0$ , the performance index (3) can be modified as follows [27]

\overline{J}\left({x},u\right)=\mathop{\int}_{t}^{\infty}\,\overline{r}\left(x,u\right)d\tau

(5)

with the modified reward function

\overline{r}\left(x,u\right)=q\left(x\right)+{{u}^{T}}Ru+\beta\text{(x) }\!\!\!\!\text{ }

(6)

where $\beta\text{(x)}$ is an extra term added to guarantee robust stability despite the disturbance.

Remark 1. Note that several modified performance or reward functions are presented in the presence of the disturbance. For example, the $H_{\infty}$ control defines the extra term as $\beta\text{(x)}=-\gamma^{2}d(x)^{T}d(x)$ . On the other hand, in [29], the extra term is defined as $\beta\text{(x)=}\frac{1}{4}\nabla{{V}^{T}}\nabla V+d_{\max}^{2}$ , where $V(x)$ is the value function corresponding to the control policy $u$ , and it is shown that the optimal controller found by minimizing the modified cost guarantees robust stability and provides an upper bound for the original performance. However, since the extra term $\beta\text{(x)}$ depends on the gradient of the value of the control input quadratically, solving the modified optimal control problem becomes computationally expensive when SOS is used to implement it. In the following, a modified performance index is presented to avoid this issue. As will be shown later, instead of solving a huge SOS optimization (because of the cross term $\nabla{{V}^{T}}\nabla V$ ), two SOS optimizations with much less complexity will be solved.

Theorem 1. Consider the system (1) and the performance function (5) and (6) and let

\beta\text{(x)=}\beta_{u}(x)+d_{\max}^{T}Rd_{\max}

(7)

with $\beta_{u}(x)\geq u_{\max}^{T}R\,\,{u}_{\max}$ . Then, the optimal control solution is

{{u}^{o}}(x)=-\frac{1}{2}{{R}^{-1}}{{({{(\nabla{{V}^{o}})}^{T}}\left(x\right)g(x))}^{T}}

(8)

where $V^{o}$ is the solution to the HJB equation given by

\overline{H}\text{(}{{V}^{o}}\text{)=0}

(9)

with

\displaystyle\begin{gathered}\overline{H}\text{(}V\text{)=}q(x)+\nabla{{V}^{T}}\left(x\right)f(x)\hfill\\ -\frac{1}{4}\nabla{{V}^{T}}\left(x\right)g(x){{R}^{-1}}{{(\nabla{{V}^{T}}\left(x\right)g(x))}^{T}}+d_{\max}^{T}Rd_{\max}+\beta_{u}(x)\hfill\end{gathered}

(12)

That is,

{{V}^{o}}({{x}_{0}})=\underset{u}{\mathop{\min}}\,\overline{J}({{x}_{0}},u)=\overline{J}({{x}_{0}},{{u}^{o}})

(13)

Moreover, the optimal controller is unique and guarantees robust stability and provides a suboptimal performance (i.e., an upper bound) for the original cost function (3) and (4).

Proof. The fact that the optimal control policy found by (8) and (12) can be shown similar to [25], as only the cost function is modified and the derivation of the HJB equation does not change. We now show that the optimal controller guarantees robust stability and provided an upper bound for the original cost. First, we show that (8) is a solution to the roust control problem. That is, the system (1) is globally asymptotic stable for $u^{o}(x)$ . To do this, we show that $V^{o}(x)$ is a Lyapunov function. Clearly,

\left\{\begin{array}[]{l}V^{o}(x)>0,\,\,\,\,\,x\neq 0\\ V^{o}(x)=0,\,\,\,\,\,x=0\end{array}\right.

(14)

Also, $\dot{V}^{o}(x)=dV^{o}(x)/dt<0$ for $x\neq 0$ , because

\displaystyle\begin{gathered}\dot{V}^{o}(x)={{(dV^{o}(x)/dx)}^{T}}(dx/dt)\hfill\\ \quad\quad\quad={{(\nabla V^{o})}^{T}}(x)(f(x)+g(x)u^{o}+w)\hfill\\ \quad\quad\quad={{(\nabla V^{o})}^{T}}(x)(f(x)+g(x)u^{o})+{{(\nabla V^{o})}^{T}}(x)g(x)d\hfill\\ \quad\quad\quad=-\beta_{u}(x)-d_{\max}^{T}R{{d}_{\max}}-q(x)-{{u}^{{o}^{T}}}Ru^{o}+{{(\nabla V^{o})}^{T}}(x)g(x)d\hfill\\ \quad\quad\quad=-\beta_{u}(x)-d_{\max}^{T}R{{d}_{\max}}-q(x)-{{u}^{{o}^{T}}}Ru^{o}-2u^{o^{T}}Rd\hfill\\ \quad\quad\quad=-\beta_{u}(x)-d_{\max}^{T}R{{d}_{\max}}-q(x)-{{(d+u^{o}(x))}^{T}}R(d+u^{o}(x))\hfill\\ \quad\quad\quad\quad+{{d}^{T}}Rd\hfill\\ \quad\quad\quad\leq-\beta_{u}(x)-(d_{\max}^{T}R{{d}_{\max}}-{{d}^{T}}Rd)-q(x)\hfill\\ \quad\quad\quad\leq-\beta_{u}(x)-q(x)<0.\hfill\\ \end{gathered}

(25)

Therefore, the conditions of the Lyapunov local stability theory are satisfied. Consequently, there exists a constant $c>0$ and a neighborhood $\mathcal{N}=\{x:||\beta_{u}(x)+q(x)||<c\}$ such that if $x(t)$ enters $\mathcal{N}$ , then $lim_{t\to\infty}x(t)=0$ . However, $x(t)$ cannot remain outside $\mathcal{N}$ forever. Otherwise, $||x(t)||\geq c$ for all $t\geq 0$ . Therefore

\displaystyle\begin{gathered}V(x(t))-V(x(0))=\int_{0}^{t}{\dot{V}(x(\tau))}d\tau\hfill\\ \quad\quad\quad\leq-\int_{0}^{t}{||\beta_{u}(x)+q(x)|{{|}}}d\tau\hfill\\ \quad\quad\quad\leq-\int_{0}^{t}{{{c}}}d\tau={{c}}t\hfill\\ \end{gathered}

(30)

Let $t\to+\infty$ , we have

\displaystyle\begin{gathered}V(x(t))\leq V(x(0))-ct\to-\infty\end{gathered}

(32)

which contradicts the fact that $V(x(t))\geq 0$ for all $x(t)$ . Therefore $lim_{t\to\infty}x(t)=0$ no matter where the trajectory begins. So the optimal controller guarantees robust stability. From the (25), the following holds

\displaystyle\begin{gathered}\dot{V}(x)\leq-\beta_{u}(x)-q(x)\hfill\\ \quad\quad\quad\leq-u_{\max}^{T}R{{u}_{\max}}-q(x)\hfill\\ \quad\quad\quad\leq-{{u}^{T}}Ru-q(x)\hfill\\ \end{gathered}

(37)

Integrating both sides of (37) on the time interval $[0,t]$ yields

\displaystyle\begin{gathered}V(x(0))-V(x(t))\geq\int_{0}^{t}{\,[{{u}^{T}}Ru+q(x)}]d\tau\end{gathered}

(39)

Let $t\to\infty$ , then

\displaystyle\begin{gathered}V(x(0))\geq\underset{t\to\infty}{\mathop{\lim}}\ \int_{0}^{t}{\,[{{u}^{T}}Ru+q(x)}]d\tau=\int_{0}^{\infty}{\,[{{u}^{T}}Ru+q(x)}]d\tau=J(x,u)\end{gathered}

(41)

which implies the optimal controller provides a suboptimal performance (i.e., an upper bound) for the original cost function (3) and (4). Next, we show the uniqueness of the solution to (9). Given $u$ , assume there is another Lyapunov function ${{V}^{a}}$ satisfies HJB equation

\overline{H}\text{(}{{V}^{a}}\text{)=}0,\text{ }x\in\mathcal{X}.

(42)

Since $\overline{r}\left(x,u\right)>0$ , we have $\nabla{{V}^{T}}\left(x\right)\left(f\left(x\right)+g\left(x\right)u(x)\right)<0$ , $\forall x\in\mathcal{X}\backslash\{0\}$ . Subtracting $\overline{H}({{V}^{o}})$ from $\overline{H}({{V}^{a}})$ yields

\overline{H}({{V}^{a}})-\overline{H}({{V}^{o}})={{[\nabla{{V}^{a}}\left(x\right)-\nabla{{V}^{o}}\left(x\right)]}^{T}}\left(f\left(x\right)+g\left(x\right)u(x)\right)=0.

(43)

Therefore, ${{V}^{a}}={{V}^{o}}+\varepsilon$ for some scalar $\varepsilon$ , $\forall x\in\mathcal{X}\backslash\{0\}$ . Since $f\left(x\right)+g\left(x\right)u(x)\neq 0$ , $\forall x\in\mathcal{X}\backslash\{0\}$ . Here ${{V}^{a}}(0)={{V}^{\text{*}}}(0)=0$ result in $\varepsilon=0$ and therefore ${{V}^{a}}(x)={{V}^{o}}(x)$ holds for all $\forall x\in\mathcal{X}$ . This contradicts with our assumption that there is another ${{V}^{a}}$ satisfy HJB equation. So, ${{V}^{o}}(x)$ is a unique solution for equation $\overline{H}\left({{V}^{o}}\right)=0$ .

$\square$

Note that under Assumption 1, there exists a well-defined Lyapunov function ${{V}^{0}}\in P$ and a control policy ${{u}^{1}}$ such that

L({{V}^{0}},{{u}^{1}})=-{{(\nabla{{V}^{0}})}^{T}}\left(x\right)\left(f\left(x\right)+g\left(x\right){{u}^{1}}\right)-\overline{r}\left(x,{{u}^{1}}\right)\geq 0\,\,\,\,

(44)

Remark 2. Note that in general the optimal control problem does not necessarily have a smooth value function. However, under some mild assumptions [30], the value function satisfies premier properties like continuity and continuous differentiability. In this paper, all derivations are performed under the assumption of the existence of a smooth solution to (9). If the smoothness assumption is relaxed, then one needs to use the theory of viscosity solutions [30] to find a solution. Note, however, that the existence of the disturbance and so addition of the extra term $\beta(x)$ to the HJB might restrict the class of the systems or the conditions under which the existence of the solution is guaranteed. This is the case for all other approaches that deal with disturbance. For example, for the disturbance-free case, if the system is stabilizable and q(x) is positive definite, then the existence of a unique solution to the HJB equation is guaranteed. However, if $H_{\infty}$ control is employed for disturbance attenuation, then the solution to its corresponding Hamilton-Jacobi-Issac (HJI) equation is not guaranteed under the same conditions and extra conditions on the performance parameters and the disturbance attenuation level are required to guarantee existence of a solution.

II-B Safety Assurance Using Control Barrier Certificate

It is of vital importance for a safety critical system to prevent its state from entering some certain unsafe regions. To design a safe controller, the concept of control barrier function (CBF) can be used. Consider the nonlinear system (1) with $x\in\mathcal{X}$ , where $\mathcal{X}$ is the allowable set for system’s states, and let ${{\mathcal{X}}_{\mathbf{0}}}\in\mathcal{X}\in{{\mathbb{R}}^{n}}$ and ${{\mathcal{X}}_{u}}\in\mathcal{X}\in{{\mathbb{R}}^{n}}$ be initial set and unsafe set, respectively. Let there exists a continuously differentiable function $h\in{{C}^{1}}(\mathcal{X}):{{\mathbb{R}}^{n}}\to\mathbb{R}$ such that

\displaystyle\begin{gathered}h\left(x\right)\geq 0,\,\,\forall x\in{{\mathcal{X}}_{0}},\hfill\\ h\left(x\right)<0,\,\,\forall x\in{{\mathcal{X}}_{u}}.\hfill\\ \end{gathered}

(48)

Under disturbance, one must guarantee that regardless of the disturbance value, the system never enters the unsafe set ${{\mathcal{X}}_{u}}$ . The input-to-state safety and the robust CBF defined below provide the conditions under which the system trajectories never enter an unsafe set despite the disturbances (as inputs to the system).

Definition 1[31]. Let ${{\mathcal{X}}_{u}}$ be the unsafe set. The system (1) is input-to-state safe (ISSf) if there exist $\alpha,\phi\in\kappa\kappa$ and a strictly increasing function $\sigma$ such that

\sigma(||x(t)|{{|}_{{{\mathcal{X}}_{u}}}})\geq\alpha(||x(t)|{{|}_{{{\mathcal{X}}_{u}}}},t)-\phi((||u|{{|}_{{{L}^{\infty}}}},t))

(49)

holds $\forall t$ .

Let the safe set $\mathcal{C}$ be defined as

\displaystyle\begin{gathered}\partial\mathcal{C}=\{x\in\mathcal{X}:h(x)=0\},\hfill\\ \mathcal{C}=\{x\in\mathcal{X}:h(x)\geq 0\},\hfill\\ Int(\mathcal{C})=\{x\in\mathcal{X}:h(x)>0\}.\hfill\\ \end{gathered}

(54)

Then, according to [31], in the presence of disturbance $d\neq 0$ , $h(x)$ is called a Zeroing CBF (ZCBF) if there exists an extended class $\kappa$ function $\varpi\in\kappa$ that satisfies

\displaystyle\begin{gathered}\leavevmode\nobreak\ \underset{u\in U}{\mathop{\sup}}\,\left\{\nabla h(x\right)f\left(x\right)+\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ }\nabla h\left(x\right)g\left(x\right)u\hfill\\ -\nabla h\left(x\right)g(x){{(\nabla h\left(x\right)g(x))}^{T}}+\varpi\left(h\left(x\right)\right)\}\geq 0,\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ }\quad x\in\mathcal{X}\hfill\end{gathered}

(57)

Based on ZCBF $h\left(x\right),$ the safe control space $S\left(x\right)\leavevmode\nobreak\$ is defined as

\displaystyle\begin{gathered}S(x)=\{u\in U|\nabla h(x)f(x)+\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ }\nabla h(x)g(x)u\hfill\\ -\nabla h(x)g(x){{(\nabla h(x)g(x))}^{T}}+\varpi(h(x))\geq 0\}\},\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ }x\in\mathcal{X}.\hfill\end{gathered}

(60)

The following theorem shows how to design a controller using the concept ZCBF to guarantee that the safe set $\mathcal{C}$ is forward invariant and thus the system is safe.

Assumption 3[31]. The admissible control space $S\left(x\right)$ is path-connected and nonempty. That is,

Int(S)\neq 0\,\,\,\,and\,\,\,\overline{Int(S)}=S.

(61)

Theorem 2 [32]. Consider the set $\mathcal{C}\subset{{\mathbb{R}}^{n}}$ defined in (54) and let the ZCBF $h$ be defined as (57). Then, under Assumption 3, any Lipschitz continuous controller $u$ such that $u\in S\left(x\right)$ renders the system (1) ISSf and the safe set $\mathcal{C}$ forward invariant. $\square$

Conditions (54) and (57) guarantee that if the system starts from any initial condition $x\in{{\mathcal{X}}_{0}}$ within the safe set, its future trajectories will not enter the unsafe region $x\in{{\mathcal{X}}_{u}}$ for any disturbances. This is because the condition (57) makes the safe set $\mathcal{C}$ robustly invariant.

II-C Satisficing Control Design

Satisficing (good enough) decision-making framework, originated from economical science [17], and adopted in control society later [18, 19], defines two utility functions: the selectability ${{p}_{x}}(u,x)$ and rejectability ${{p}_{r}}(u,x)$ . One then seeks to find a strategy $u$ from a so-called satisficing set defined as

{{S}_{B}}(x)=\left\{u\in{{\mathbb{R}}^{m}}:{{p}_{s}}(u,x)\succeq B(x){{p}_{r}}(u,x)\right\}

(62)

where $B(x)$ is called aspiration level. Any control strategy that has a lager selectability index than rejectability index belongs to the satisficing set and is considered to have a good enough performance. It was shown in [18] that if ${{p}_{s}}(u,x)=-{{(\nabla V)}^{T}}\left(x\right)\left(f\left(x\right)+g\left(x\right)u\right)$ where $V$ is a control Lyapunov function and ${{p}_{r}}(u,x)=r(x,u)$ with $r(x,u)$ defined in (4), then satisficing controllers (62) are stabilizing. Moreover, ${{p}_{s}}(u,x)$ indicates the stability index of the system and ${{p}_{r}}(u,x)$ indicates the cost of implementing the controller. It was also shown in [18] how to parametrize the set of all satisficing controllers. However, it is not clear how to choose one control strategy from the set of satisficing controllers to be applied to the system. We design a policy iteration (PI) algorithm that selects a satisficing strategy that optimizes a performance.

III A SATISFICING SAFE CONTROL DESIGN SCHEME

In this section, a novel satisficing safe control framework is designed that guarantees robust safety and a good enough performance. First, a relaxed robust stabilizing control framework is presented in 3.2, inspired by [33, 34, 35, 36], in which an infinite-dimensional linear program (LP) is derived to find a robust suboptimal controller. The infinite-horizon LP is then transformed into a sum-of-squares (SOS) program. This framework will then be integrated with CBF in 3.3 to certify robust safety of the resulting controller.

III-A A Satisficing Control Framework for Safe control with Guaranteed Performance

We now define a satisficing control framework that defines the selectability index as robust stability and robust safety of controllers and rejectability index as their cost. Consider the system (1) with the performance function (5) and (6). Let $V$ be a control Lyapunov function and $h\left(x\right)$ be a control barrier function. Define

\left\{\begin{array}[]{l}{{{p}_{s}}(u,x)}=[{-{{(\nabla V)}^{T}}\left(x\right)\left(f\left(x\right)+g\left(x\right)u\right)},\hfill\\ {\nabla h\left(x\right)f\left(x\right)+\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ }\nabla h\left(x\right)g\left(x\right)u-\nabla h\left(x\right)g(x){{(\nabla h\left(x\right)g(x))}^{T}}}]^{T}\\ \\ {{{p}_{r}}(u,x)}=[{\overline{r}\left(x,u\right)},\quad{-\eta\varpi\left(h\left(x\right)\right)}]^{T}\end{array}\right.

(63)

where $\overline{r}\left(x,u\right)$ is defined in (6), $\eta>0$ is a constant, ${{p}_{s}}(u,x)$ is selectability and is associated with the robust stability and robust safety of the system, and ${{p}_{r}}(u,x)$ is rejectability and is associated with the controller cost and safety aggressiveness. Define now the satisficing set as (62) where $B(x)=diag({{b}_{1}}(x),\,\,\,{{b}_{2}}(x))$ with ${{b}_{1}}(x)$ as the aspiration level on stability and ${{b}_{2}}(x)$ the aspiration level on safety.

Remark 3. The parameter $\eta$ in the rejectibility index of (63) reveals how rapidly $\varpi\left(h\left(x\right)\right)$ damps as the system’s states get further away from the safe boundaries and thus it is treated as a rejectability index. A larger $\eta$ increases system’s aggressiveness in favor of having more flexibility in the safe set. The aspiration level reflects the expectation of the designer on satisfaction of the controller, and is also related to the solution space in the satisficing set ${{S}_{B}}(x)$ .

The next theorem shows that under Assumptions 1-3, there is always a feasible solution to (62), (63) by choosing an appropriate aspiration level. The feasible solution makes the system both robustly safe and robustly stable if there is no conflict between safety and stability and sacrifices on the stability in case of a conflict.

Theorem 3. Let Assumptions 1 and 3 hold. Then, there exists an aspiration level $B(x)$ for which the satisficing set (62) with ${{p}_{s}}(u,x)$ and ${{p}_{r}}(u,x)$ defined in (63) has a feasible solution.

Proof. By Assumption 3, there is always a safe control policy that satisfies the second inequality in (62), (63). On the other hand, by Assumption 1, there is a stabilizing controller that satisfies the first constraints. If there is no conflict between safety and stability, then there is a controller that satisfies both inequalities in (62), (63), the satisficing set is nonempty. Assume now that there is a conflict between two constraints in (62), (63). Let

{{b}_{1}}(x)=\frac{\overline{r}\left(x,u\right)+\text{ }\!\!\delta\!\!\text{ }}{\overline{r}\left(x,u\right)},\,\,\,\,{{b}_{2}}(x)=\theta

(64)

where $\text{ }\!\!\delta\!\!\text{ }$ is a function of $x$ representing the conflict between two constraints, and $\theta>0$ is the aspiration on the safety level. Based on (57), the safety constraint is satisfied by choosing any $\theta>0$ . On the other hand, based on the defined ${{b}_{1}}(x)$ in (64), the first constraints in the equality ${{p}_{s}}(u,x)=B(x){{p}_{r}}(u,x)$ becomes

-{{(\nabla V)}^{T}}\left(x\right)\left(f\left(x\right)+g\left(x\right)u\right)-\overline{r}\left(x,u\right)=\text{ }\!\!\delta\!\!\text{ }.

(65)

Therefore, any aspiration level of stability ${{b}_{1}}^{{}^{\prime}}(x)$ satisfying $\text{ }{{b}_{1}}^{{}^{\prime}}(x)\leq\text{ }{{b}_{1}}(x)$ also satisfies safe constraint as

-{{(\nabla V)}^{T}}\left(x\right)\left(f\left(x\right)+g\left(x\right)u\right)\geq-\theta\eta\varpi\left(h\left(x\right)\right).

(66)

This completes the proof. $\square$

Equation (65) shows that $\delta$ indicates the sacrifice on stability and performance, since (66) resembles the Bellman inequality and will be used later to improve performance. This function will be optimized later to minimize the sacrifice on performance as much as possible. Note that a stabilizing solution to (66) is guaranteed under Assumption 1 when $\delta=0$ . When $\delta$ is nonzero, however, the solution to (66) might not be stabilizing and this can occur only when the safety is in conflict with stability. In this paper, instead of parameterizing the set of all satisficing controllers, a PI algorithm is designed to optimize over all satisficing control policies by relating the control Lyapunov function with the Bellman inequality solution, which is a performance-oriented control Lyapunov function and can be iteratively optimized while assuring that every improved policy remains in the satisficing set and thus guarantees stability. To this end, it is shown in the next subsection how to optimize over the set of satisficing control policies by iteratively solving Bellman inequalities while ignoring the safety constraint. 3.3 combines safety constraints and bellman inequalities to guarantee safety of improved policies.

III-B Relaxed Robust Stabilizing Optimal Control Design

In this subsection, a relaxed robust optimal control framework is presented. While safety is ignored here, this framework allows incorporating safety constraints, as shown in the next subsection. Inspired by [33, 34, 35, 36], a finite-dimensional linear program (LP) is first derived. This finite-dimensional optimization problem is integrated with CBF to include safety and is solved using SOS in the next subsection.

Problem 1 (Relaxed suboptimal robust stabilizing control design problem)

Consider the system (1) with the performance index (5), (6). Find the value function $V$ by solving

\displaystyle\begin{gathered}\underset{V}{\mathop{\min}}\,\mathop{\int}_{\Omega}V\left(x\right)dx\\ \text{s.t}.\quad\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ }\!\!\leavevmode\nobreak\ \!\!\overline{H}\left(V\right)\leq 0\\ V\in P\\ \end{gathered}

(71)

where $\Omega$ is the area in which system performance is expected to be improved, and $\overline{H}(V)$ is defined in (12).

This is inspired by [33, 34, 35] in which it is shown that an optimal control problem can be transformed into an infinite-dimensional LP. Instead of searching for optimal value function over whole an infinite-dimensional space, in [35], the optimization is performed over an interested region $\Omega$ , which makes the problem finite-dimensional and tractable. In contrast to [35], a modified HJB inequality $\overline{H}\left(V\right)<0$ with the modified reward function is used to guarantee robust stability. This framework will allow us to incorporate safety constraints later and find satisficing control solutions that satisfy safety and optimize over stabilizing solutions. The following theorem shows some premier properties of Problem 1.

Theorem 4. Consider the system (1) and let Assumptions 1-3 hold. Then,

1) Problem 1 has a feasible value function solution.

2) If $V$ is the unique solution of Problem 1, then, the control solution

\overline{u}(x)=-\frac{1}{2}{{R}^{-1}}{{(\nabla{{V}^{T}}\left(x\right)g(x))}^{T}}

(72)

is globally robust stable and belongs to the satisficing set (62) when safety is ignored.

3) The control policy (72) provides an upper bound for the cost (3).

4) The following inequalities hold for $x\in\mathcal{X}$ along the trajectories of the closed system (1) with the controller (72),

V({{x}_{0}})+\int\limits_{0}^{\infty}{H(V(x(t)))dt}\leq{{V}^{o}}({{x}_{0}})\leq V({{x}_{0}})

(73)

5) The value function ${{V}^{o}}$ in (9) is a global robust optimal solution to (71).

Proof. See APPENDIX. $\square$

Iterative policy iteration algorithms can be designed to solve Problem 1. The need for safety assurance, however, makes existing policy iteration algorithms invalid, as they cannot guarantee safety. In the next subsection, to find a robust safe control policy with guaranteed performance, safety constraints satisfaction is incorporated by including a CBF as another inequality to Problem 1 and a novel PI algorithm is developed to solve it. Its connection to satisficing controllers is also shown.

Remark 4. A feasible solution $V$ to (71) may not be the true cost function associated with $\overline{u}$ . However, it is an upper bound of the practical cost.

III-C Robust Safe Satisficing Control with Performance Guarantee: A Novel Framework

While the controller designed by solving Problem 1 guarantees performance and robustness and belongs to satisficing set that only concerns stability, it cannot assure safety. On the other hand, the control design based on the CBF satisfying (62) guarantees safety but might result in poor performance. To bring the best of both worlds together, in this section, we aim to design robust stabilizing safe controllers that provide guaranteed performance within the volume of the certified safe area.

Problem 2 (Satisficing safe control design with performance guarantee)

Consider the system (1) with the performance function (5), (6). Find the value function that solves

\displaystyle\begin{gathered}\underset{V,\text{ }\!\!\delta\!\!\text{ }}{\mathop{\min}}\,\mathop{\int}_{\mathcal{L}}V\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ dx }\!\!\leavevmode\nobreak\ \!\!\text{ }+{{k}_{\text{ }\!\!\delta\!\!\text{ }}}{{\text{ }\!\!\delta\!\!\text{ }}^{2}}\\ \text{s.t}.\quad\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ }\overline{H}\left(V\right)\leq\text{ }\!\!\delta\!\!\text{ }\\ \text{ }\nabla h\left(x\right)f\left(x\right)+\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ }\nabla h\left(x\right)g\left(x\right)u-\nabla h\left(x\right)g(x){{(\nabla h\left(x\right)g(x))}^{T}}\\ +\varpi\left(h\left(x\right)\right)\geq 0\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ }\\ \end{gathered}

(79)

where $\overline{H}(V)$ is defined in (12), $\mathcal{L}$ is the interested safe region in which the performance is expected to be improved, ${{\text{k}}_{\text{ }\!\!\delta\!\!\text{ }}}>0$ is a design parameter that trades off between the system’s aggressiveness toward performance and the safety, $\delta$ is a relaxation factor.

Note that the relaxation factor $\delta$ can be interpreted as the system’s aspiration level for the performance that shows how much we sacrifice the performance when both safety and performance cannot be satisfied together. This relaxation factor is minimized to get as much performance as possible.

Lemma 1. The solution to Problem 2 belongs to the satisficing set (62).

Proof. Recalling (64), (65) and (66), one can see that the constraints in (79) can be transformed to (62) by selecting suitable aspiration levels given in Theorem 4. Therefore, the search for the optimal solution is over the space of satisficing set (62), and, thus, the solution to Problem 2 belongs to (62). $\square$

Theorem 5. The safe optimization problem (79) has a feasible solution.

Proof. Based on Theorem 4, a robust safe control policy $u$ exists by selecting a suitable aspiration level. Let write this control policy as $u={{u}^{*}}+{{u}^{safe}}$ where ${{u}^{*}}=-\frac{1}{2}{{R}^{-1}}{{({{(\nabla{{V}^{*}})}^{T}}(x)g(x))}^{T}}$ is part of the control that is used to optimize the performance without concerning safety, and is given by [37], and ${{u}^{safe}}$ is added to ${{u}^{*}}$ to guarantee safety. Reformulate now the HJB equation as follows

\displaystyle\begin{gathered}\overline{H}({{V}^{*}})=-\frac{1}{4}{{(\nabla{{V}^{*}})}^{T}}(x)g(x){{R}^{-1}}{{({{(\nabla{{V}^{*}})}^{T}}(x)g(x))}^{T}}\hfill\\ \quad\quad\quad+q(x)+{{(\nabla{{V}^{*}})}^{T}}(x)f(x)+d_{max}^{T}Rd_{max}+\beta_{u}(x)\hfill\\ \quad\quad\quad={{(\nabla{{V}^{*}})}^{T}}(x)(f(x)+g(x)u)+\overline{r}(x,u)\hfill\\ \quad\quad\quad={{(\nabla{{V}^{*}})}^{T}}(x)(f(x)+g(x)u)+\overline{r}(x,u)\hfill\\ \quad\quad\quad+{{({{u}^{*}})}^{T}}R{{u}^{*}}-{{u}^{T}}Ru-{{(\nabla{{V}^{*}})}^{T}}(x)g(x){{u}^{safe}}\hfill\\ \quad\quad\quad={{(\nabla{{V}^{*}})}^{T}}(x)(f(x)+g(x)u)+\overline{r}(x,u)\hfill\\ \quad\quad\quad+{{({{u}^{*}})}^{T}}R{{u}^{*}}-{{u}^{T}}Ru+2{{({{u}^{*}})}^{T}}R{{u}^{safe}}\hfill\\ \quad\quad\quad={{(\nabla{{V}^{*}})}^{T}}(x)(f(x)+g(x)u)+\overline{r}(x,u)\hfill\\ \quad\quad\quad+{{({{u}^{*}})}^{T}}R{{u}^{*}}-{{u}^{T}}Ru+2{{({{u}^{*}})}^{T}}R(u-{{u}^{*}})\hfill\\ \quad\quad\quad={{(\nabla{{V}^{*}})}^{T}}(x)(f(x)+g(x)u)+\overline{r}(x,u)\hfill\\ \quad\quad\quad-{{({{u}^{*}})}^{T}}R{{u}^{*}}-{{u}^{T}}Ru+2{{({{u}^{*}})}^{T}}Ru\hfill\\ \quad\quad\quad={{(\nabla{{V}^{*}})}^{T}}(x)(f(x)+g(x)u)+\overline{r}(x,u)\hfill\\ \quad\quad\quad-{{(u-{{u}^{*}})}^{T}}R(u-{{u}^{*}})\hfill\\ \quad\quad\quad={{(\nabla{{V}^{*}})}^{T}}(x)(f(x)+g(x)u)+\overline{r}(x,u)-{{({{u}^{safe}})}^{T}}R{{u}^{safe}}\hfill\\ \end{gathered}

(95)

where $\overline{r}$ is defined in (6) and (7). While ${{u}^{*}}$ is robust stabilizing, if robust safe control ${{u}^{safe}}$ is in conflict with the robust stability, the overall control input $u$ might not be robust stabilizing, i.e., ${{(\nabla{{V}^{*}})}^{T}}\left(f(x)+g(x)u\right)+\overline{r}\left(x,u\right)>0$ might not be satisfied at some points. By choosing an appropriate slack variable $\delta$ to resolve the conflict between safety and stability, however, one has

\overline{H}({{V}^{*}})\text{-}\delta\text{=}{{(\nabla{{V}^{*}})}^{T}}(x)\left(f(x)+g(x)u\right)+\overline{r}\left(x,u\right)-\delta\leq 0

(96)

for some $\delta$ . On the other hand, since $u$ is safe, based on the converse CBF theorem [38], there exists a barrier certificate $h(x)$ satisfying

\displaystyle\begin{gathered}\text{ }\nabla h\left(x\right)f\left(x\right)+\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ }\nabla h\left(x\right)g\left(x\right)u-\nabla h\left(x\right)g(x){{(\nabla h\left(x\right)g(x))}^{T}}+\varpi\left(h\left(x\right)\right)\geq 0.\hfill\\ \end{gathered}

(99)

$\square$

Assumption 4[35]. There exists a smooth mapping ${{V}^{0}}:{{\mathbb{R}}^{n}}\to\mathbb{R}$ , such that ${{V}^{0}}\in\mathbb{R}{{[x]}_{2,2r}}\cap P$ and $L({{V}^{0}},{{u}^{1}})+\delta$ is SOS, where $L({{V}^{0}},{{u}^{1}})=-\overline{H}\text{(}{{V}^{0}}\text{)}$ .

Solving optimization Problem 2 is non-trivial in general. If both Bellman inequality and CBF inequality constraints are restricted to SOS constraints, SOS program can be used to significantly reduce the computational burden in finding a solution to this optimization problem. However, since $\overline{H}(V)$ is bilinear in $V$ , it makes the optimization problem hard or even impossible to solve using SOS. Therefore, we propose a robust safe policy iteration algorithm that iterates on a Bellman inequality, which is linear in $V$ , instead of directly solving for $\overline{H}(V)\leq\delta$ . Using this Bellman inequality, a policy evaluation step that will find the value function ${V^{i}}$ and corresponding to a robust safe control policy ${{u}^{i}}$ , and policy improvement step will find an improved policy ${{u}^{i+1}}$ for which its safety is certified by adding the CBF inequality. We assume that an initial robust safe control policy ${{u}^{0}}$ is given, which can be found by a control policy that only satisfies the CBF without any concern about optimality.

To evaluate a given policy ${{u}^{i}}$ , i.e., to find the value function ${{V}^{i}}$ corresponding to it, a Bellman inequality based on the modified reward function must be solved, which requires knowing $\beta_{u^{i}}(x)$ in the modified reward functions, which in turns require knowing a bound on ${u^{i}}_{\max}^{T}R\,\,{u^{i}}_{\max}$ , as defined in Theorem 1. Therefore, before the policy evaluation step, we first find this bound and thus $\beta_{u^{i}}(x)$ . Since $u^{i}$ , and, consequently $u^{i^{T}}R\,\,{u^{i}}$ , is polynomial, one can write it as $u^{i^{T}}R\,\,{u^{i}}=c_{i}m_{i}^{x}$ where $m_{i}^{x}$ is the set of monomials and $c_{i}$ is the vector of coefficients. To find ${u^{i}}_{\max}^{T}R\,\,{u^{i}}_{\max}=maxu^{i^{T}}R\,\,{u^{i}}$ , one can then solve the following optimization problem.

\displaystyle\begin{gathered}{u^{i}}_{\max}^{T}R\,\,{u^{i}}_{\max}=max\,\,c_{i}m_{i}^{x}\\ \,\,s.t.\,\,\,\,x\,\,\in\,\,\mathcal{X}\\ \end{gathered}

(103)

However, polynomial optimization is NP-hard and, instead, we obtain a lower bound for ${u^{i}}_{\max}^{T}R\,\,{u^{i}}_{\max}$ by solving

\displaystyle\begin{gathered}L_{p}=min\,\,\gamma\\ \,\,s.t.\,\,\,\,\gamma\,-\,c_{i}m_{i}^{x}\,\geq\,0\\ \end{gathered}

(107)

which is an SOS optimization and can be efficiently solved. Note that ${u^{i}}_{\max}^{T}R\,\,{u^{i}}_{\max}\leq L_{p}$ and if we choose $\beta_{u^{i}}(x)=L_{p}$ , then $\beta_{u^{i}}(x)\geq(u^{i}_{max})^{T}\,\ R\,\ u^{i}_{max}$ which satisfies the condition of reward function in Theorem 1. Based on this optimization, the following policy evaluation step is proposed.

Safe policy evaluation step: Given a robust safe control policy ${{u}^{i}}$ , find the bound for $u^{i}_{max}$ using (107) and then find ${{V}^{i}}$ and ${{\delta}_{i}}$ that solve the following optimization problem:

\displaystyle\begin{gathered}\underset{{{\text{V}}^{i}},{{\delta}^{i}}}{\mathop{\min}}\,\mathop{\int}_{\mathcal{L}}{{V}^{i}}\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ dx }\!\!\leavevmode\nobreak\ \!\!\text{ +}{{\text{k}}_{\delta}}{{\delta}_{i}}^{2}\\ \text{s.t}.\quad\text{ }\!\!\leavevmode\nobreak\ \!\!L({{V}^{i}},{{u}^{i}})=-{{(\nabla{{V}^{i}})}^{T}}\left(x\right)\left(f\left(x\right)+g\left(x\right){{u}^{i}}\right)-\overline{r}\left(x,{{u}^{i}}\right)\geq-{{\delta}_{i}}\\ {{V}^{i-1}}-{{V}^{i}}\geq 0\\ \end{gathered}

(112)

In terms of SOS, this optimization problem is transformed into

\displaystyle\begin{gathered}\underset{{{\text{V}}^{i}},{{\delta}^{i}}}{\mathop{\min}}\,\mathop{\int}_{\mathcal{L}}{{\text{V}}^{i}}\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ dx }\!\!\leavevmode\nobreak\ \!\!\text{ +}{{\text{k}}_{\delta}}{{\delta}_{i}}^{2}\\ \text{s.t}.\quad\text{ }\!\!\leavevmode\nobreak\ \!\!L({{V}^{i}},{{u}^{i}})+{{\delta}_{i}}\text{ }\!\!\leavevmode\nobreak\ \!\!\,\,\text{is}\,\,\text{ }\!\!\leavevmode\nobreak\ \!\!\text{SOS},\text{ }\!\!\leavevmode\nobreak\ \!\!\quad\forall x\in\mathcal{X}\\ {{V}^{i-1}}-{{V}^{i}}\,\,\,\text{is}\,\,\,\text{SOS}\\ \end{gathered}

(117)

where $V={{p}^{T}}{{\overrightarrow{m}}_{2,2r}}(x)$ , and ${{V}_{i}}=p_{{}^{i}}^{T}{{\overrightarrow{m}}_{2,2r}}(x)$ .

In the policy evaluation step (112) , the value function corresponding to a given policy is found while minimizing the relaxation factor ${{\delta}_{i}}$ . Note that since a robust safe control policy ${{u}^{i}}$ might not necessarily be robust stabilizing, therefore, $L({{V}^{i}},{{u}^{i}})$ might not be positive semidefinite.

Remark 5. Instead of performing two SOS optimization to evaluate a given policy, (one for finding $u^{i}_{max}$ in each step), one can regard every element of $u^{i}_{max}=[u^{i}_{1},...,u^{i}_{m}]$ as a decision variable and incorporate it through the following optimization problem.

\displaystyle\begin{gathered}\min\,{u^{i}_{j_{\max}}}\\ \,\,s.t.\,\,\,\,{{u^{i}_{j}}_{\max}}-{{u}^{i}_{j}}\,\,is\,\,SOS\\ \end{gathered}

(121)

where the $u^{i^{j}}$ is the $j-th$ element of the improved safe policy $u^{i}$ found in Step 3. Then, ${u^{i}}_{\max}^{T}R\,\,{u^{i}}_{\max}$ can be calculated since ${u^{i}}_{\max}=[u^{i}_{1_{\max}},...,u^{i}_{m_{\max}}]$ . Alternatively, the $u^{i}_{j_{max}}$ can be defined as a polynomial which is desired to be minimized in a domain of interests. That is,

\displaystyle\begin{gathered}\underset{{u}^{i}_{j_{\max}}}{\mathop{\min\,}}\,\int_{\mathcal{D}}{{u^{i}_{j_{\max}}}}\\ \,\,s.t.\,\,\,\,{{u}^{i}_{j_{\max}}}-{u}^{i}\,\,is\,\,SOS\\ \end{gathered}

(125)

where $\mathcal{D}$ is the domain of the interested in which the $u^{i}_{max}$ is desired to be minimized. Incorporating these extra SOS constraints into the policy improvement step (112) relaxes the requirement of solving the SOS optimization (107).

Once a policy is evaluated, an improved control policy with ISSf certification is found. The following lemma shows how to find a safety-certified improved control policy.

Lemma 2 (Safe policy improvement). Let ${{u}^{i}}$ be a robust safe control policy with guaranteed value function ${{V}^{i}}$ . Then, an improved safety certified control policy ${{u}^{i+1}}$ can be found by solving following optimization problem

\displaystyle\begin{gathered}\underset{{{u}^{safe}},Z}{\mathop{\min}}\,\,{{({{u}^{safe}})}^{T}}R{{u}^{safe}}\\ \text{s.t}.\quad\text{ }\!\!\leavevmode\nobreak\ \!\!{{u}^{i+1}}={{u}^{safe}}-\frac{1}{2}{{R}^{-1}}{{(\nabla{{V}^{i}}(x)g(x))}^{T}}\\ \nabla{{h}}\left(\text{x}\right)f\left(x\right)+\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ }\nabla{{h}}\left(\text{x}\right)\text{g}\left(\text{x}\right){{u}^{i+1}}+Z\,{{h}}\left(x\right)\\ -\nabla{{h}}\left(x\right)g(x){{(\nabla{{h}}\left(x\right)g(x))}^{T}}\,\,\text{is }\,\,\text{SOS }\!\!\leavevmode\nobreak\ \!\!\text{ }\\ Z\text{ }\!\!\leavevmode\nobreak\ \!\!\,\,\text{is }\,\,\text{SOS}\\ \end{gathered}

(132)

Proof. To find an improved control policy, we use stationary condition [39] that minimizes the Bellman equation while satisfying the CBF and find an improved policy. Note that the Bellman equation can be written as

\displaystyle\begin{gathered}L({{V}^{i}},{{u}^{i+1}})=-{{(\nabla{{V}^{i}})}^{T}}\left(x\right)\left(f\left(x\right)+g\left(x\right){{u}^{i+1}}\right)-\overline{r}\left(x,{{u}^{i+1}}\right)\hfill\\ \qquad\qquad\text{ }\!\!\leavevmode\nobreak\ \!\!=-{{(\nabla{{V}^{i}})}^{T}}\left(x\right)\left(f\left(x\right)+g\left(x\right){{({{u}^{opt}})}^{i+1}}\right)\hfill\\ \quad\qquad\qquad\text{ }\!\!\leavevmode\nobreak\ \!\!+{{({{u}^{safe}})}^{T}}R{{u}^{safe}}-\overline{r}\left(x,{{({{u}^{opt}})}^{i+1}}\right)\hfill\\ \end{gathered}

(137)

where ${{u}^{i+1}}={{u}^{opt}}+{{u}^{safe}}$ . Minimizing the term $-{{(\nabla{{V}^{i}})}^{T}}\left(x\right)\left(f\left(x\right)+g\left(x\right){{({{u}^{opt}})}^{i+1}}\right)-\overline{r}\left(x,{{({{u}^{opt}})}^{i+1}}\right)$ using stationarity condition results in ${{u}^{opt}}=-\frac{1}{2}{{R}^{-1}}{{({{(\nabla{{V}^{i}})}^{T}}(x)g(x))}^{T}}$ . Therefore, minimizing ${{({{u}^{safe}})}^{T}}R{{u}^{safe}}$ as the second term while setting ${{u}^{i+1}}={{u}^{safe}}-\frac{1}{2}{{R}^{-1}}{{({{(\nabla{{V}^{i}})}^{T}}(x)g(x))}^{T}}$ optimizes the performance. Since the control must certify safety constraint, the CBF inequality must also be considered. $\square$

Remark 6. In (117) and (132), $\delta$ and ${{u}^{safe}}$ are polynomials which can be written in the form of Square Matrix Representation (SMR) as ${{P}^{T}}(x)QP(x)$ , where $P(x)$ is a vector of monomials, and $Q$ is a symmetrical coefficient matrix. In order to solve this optimization problem, we adopt a typical way in the literature [40] to minimize the $trace(Q)$ to get smaller $\delta$ and ${{u}^{safe}}$ for objective function in (117) and (132). The SOS program (132) involves bilinear decision variables. It can be solved efficiently by splitting into several smaller SOS programs as presented in [41].

Remark 7 [42]. To find a feasible $h(x)$ for performing the policy improvement step, one can solve the following optimization problem.

\displaystyle\begin{gathered}\,\,\,\,\,\,\,\,Find\,\,\,h(x),\,\,{{\sigma}_{1}}(x)\,and\,\,{{\sigma}_{2}}(x)\,\\ s.t.\,\,\,h(x)-\varepsilon{{\sigma}_{1}}(x){{x}_{0}}(x)\,\,is\,\,SOS\\ \,\,\,\,\,\,-\,h(x)-\varepsilon{{\sigma}_{1}}(x){{x}_{o}}(x)\,\,is\,\,SOS\\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,{{\sigma}_{1}}(x)\,and\,\,{{\sigma}_{2}}(x)\,\,is\,\,SOS\\ \end{gathered}

(143)

However, there might be multiple $h(x)$ solutions. By maximizing the margin $\varepsilon$ of the barrier certificate constraint, an optimal option of $h(x)$ can be obtained. This method enlarges the feasible solution space of ${{u}^{safe}}$ for optimizing in successive step 1 which speeds up the convergence of optimization procedure [40, 41].

We now combine the safe policy evaluation step (117) and safe policy improvement step performed in (132), the following algorithm is presented to iteratively solve Problem 2.

Algorithm 1: Satisficing safe control design framework.

1:Initialize with

({V^{0}},{u^{1}})

that satisfies Assumption 4 and set a sum of square variable

\neg

as a threshold variable.

2:procedure

\forall\,i=1,2,...,N

3: Given

u^{i}

, let

{{V}_{i}}=p_{{}^{i}}^{T}{{\overrightarrow{m}}_{2,2r}}(x)

, calculate the value function

V^{i}

and the relaxation variable variable

\delta_{i}

using (117) through SOS program. Then, calculate and justify whether

-({V^{i-1}}-{V^{i}})+\neg\,\,\,{\rm{is}}\,\,\,{\rm{SOS}}

. The algorithm stops if

-({V^{i-1}}-{V^{i}})+\neg\,\,\,{\rm{is}}\,\,\,{\rm{SOS}}

, otherwise go to Step 4.

4: Iteratively search for an improved policy

u^{i+1}

using (132). Then, use

u^{i+1}

into Step 3 to calculate a new value function.

5:end procedure

It should be noticed that in Step 3, the condition $-({V^{i-1}}-{V^{i}})+\neg\,\,\,{\rm{is}}\,\,\,{\rm{SOS}}$ implies $|{V^{i-1}}-{V^{i}}|\,<\neg\,$ . More specifically, Algorithm 1 terminates when the value function $V^{i}$ stops decreasing.

Remark 8. The presented robust safe satisficing control scheme integrates barrier certificate with performance-driven Lyapunov to assure safety while sacrificing as little as possible on performance.

Theorem 6. Consider Assumptions 1-4 for the system (1). Then,

1) The policy evaluation step (112) has a nonempty feasible set.

2) The closed-loop system (1) with the controller ${{u}^{i}}$ derived after each safe policy iteration is robust safe and guarantee the globally robust stability as much as possible.

3) There exists a positive definite ${{V}^{*}}\in\mathbb{R}{{[x]}_{2,2r}}$ such that for any ${{x}_{0}}\in D$ , ${{V}^{*}}({{x}_{0}})\leq{{V}^{i}}({{x}_{0}})$ holds. Besides, $\underset{i\to\infty}{\mathop{\lim}}\,{{V}^{i}}({{x}_{0}})\to{{V}^{*}}({{x}_{0}})$ .

4) Along the solution of system with $u^{*}(x)=-\frac{1}{2}{{R}^{-1}}{{({{(\nabla{{V}^{*}})}^{T}}\left(x\right)g(x))}^{T}}$ , the following holds

V^{*}({{x}_{0}})+\int\limits_{0}^{\infty}{H(V^{*}(x(t)))dt}\leq{{V}^{o}}({{x}_{0}})

(144)

Proof:

1) The following mathematical induction steps are used to prove part 1.

i) Suppose $i=1$ . Then, under Assumption 4, $L({{V}^{0}},{{u}^{{}_{1}}})+\delta$ is SOS. Therefore, $V={{V}^{0}}$ is a feasible solution to (112).

ii) Assume now that $V={{V}^{j-1}}$ is an optimal solution to the (112) with $i=j-1>1$ . In the following, it is show that $V={{V}^{j-1}}$ is then a feasible solution to the same problem with $i=j$ .

From the safe policy improvement step (132), by definition, ${{u}^{i}}={{u}^{safe}}-\frac{1}{2}{{R}^{-1}}{{(\nabla{{V}^{i-1}}(x)g(x))}^{T}}$ and

\displaystyle\begin{gathered}L({{V}^{j-1}},{{u}^{{}_{j}}})+\delta=-{{(\nabla{{V}^{j-1}})}^{T}}(f(x)+g(x){{u}^{{}_{j}}}+g(x)\omega)-\bar{r}(x,{{u}^{{}_{j}}})+\delta\hfill\\ =L({{V}^{j-1}},{{u}^{{}_{j-1}}})+\delta-{{(\nabla{{V}^{j-1}})}^{T}}g(x)({{u}^{{}_{j}}}-{{u}^{{}_{j-1}}})+{{(u^{{}^{j-1}})}^{T}}Ru^{{}^{j-1}}+\beta_{u^{j-1}}(x)-\beta_{u^{j}}(x)-{{(u{{}^{j}})}^{T}}Ru^{{}^{j}}\hfill\\ =L({{V}^{j-1}},{{u}^{{}_{j-1}}})+{{({{u}^{{}_{j-1}}}-{{u}^{j}})}^{T}}R({{u}^{{}_{j-1}}}-{{u}^{j}})+\delta+\beta_{u^{j-1}}(x)-\beta_{u^{j}}(x-2{{({{u}^{safe}})}^{T}}R({{u}^{{}_{j}}}-{{u}^{{}_{j-1}}})\hfill\\ \end{gathered}

(149)

Under the induction assumption, one has ${{V}^{{}_{j-1}}}\in\mathbb{R}{{[x]}_{2,2r}}$ and $L({{V}^{j-1}},{{u}^{{}_{j-1}}})+\delta$ is SOS. By selecting a suitable relax variable $\delta$ , the affect of safe policy ${{u}^{safe}}$ on positivity of (149) is eliminated. Hence, $L({{V}^{j-1}},{{u}^{{}_{j}}})+\delta$ is SOS. As a result, ${{V}^{j-1}}$ is a feasible solution to the SOS program (112) with $i=j$ .

2) We fist show that ${{u}^{i}}$ derived after each safe policy improvement is robustly safe. It can be seen from Algorithm 1 that $u={{u}^{opt}}+{{u}^{safe}}$ and following barrier certificate is satisfied.

\displaystyle\begin{gathered}\nabla{{h}}\left(\text{x}\right)f\left(x\right)+\text{ }\!\!\leavevmode\nobreak\ \!\!\text{ }\nabla{{h}}\left(\text{x}\right)\text{g}\left(\text{x}\right){{u}^{i+1}}+Z\,{{h}}\left(x\right)\\ -\nabla{{h}}\left(x\right)g(x){{(\nabla{{h}}\left(x\right)g(x))}^{T}}\,\,\text{is }\,\,\text{SOS }\!\!\leavevmode\nobreak\ \!\!\text{ }\\ \end{gathered}

(153)

With an initial robust safe control policy ${{u}^{0}}$ , the safety of the control policy can always be guaranteed in each iteration.

We now show globally robust stability of the control solution when there is no conflict between stability and safety. When there is no conflict, $\delta=0$ because in this case, no relaxation variable is needed to guarantee feasibility by selecting a suitable scalar ${{\text{k}}_{\delta}}$ . Now, develop an induction as follows.

i) Suppose $i=1$ . Under Assumption 4, ${{u}^{1}}$ is globally robust stabilizing, and we also have ${{V}^{1}}\in P$ . For each ${{x}_{0}}\in D$ and ${{x}_{0}}\neq 0$ , we can obtain

{{V}^{1}}({{x}_{0}})\geq\int\limits_{0}^{\infty}{\overline{r}(x,{{u}^{1}})dt}>0

(154)

Using this inequality and the constraint in (112), under Assumption 1 it follows that

{{V}^{o}}\leq{{V}^{{}_{1}}}\leq{{V}^{{}_{0}}}

(155)

since ${{V}^{o}}$ and ${{V}^{0}}$ are assumed to be positive definite, obviously ${{V}}\in P$ .

ii) Assume ${{u}^{i-1}}$ is globally robust stabilizing, and ${{V}^{i-1}}\in P$ for $i\geq 2$ . We now prove that ${{u}^{{}_{i}}}$ is globally robust stabilizing and ${{V}^{i}}\in P$ . Along the system trajectory of system (1) and $u={{u}^{{}_{i}}}$ the following holds

{{\dot{V}}^{i-1}}={{(\nabla{{V}^{i-1}})}^{T}}(f+g({{u}^{i}}+\omega))=-L({{V}^{i-1}},{{u}^{i}})-\overline{r}(x,{{u}^{i}})\leq 0

(156)

Therefore, ${{u}^{i}}$ is globally robust stabilizing in this situation. ${{V}^{i-1}}$ is a Lyapunov function for the system and we have

{{V}^{i}}({{x}_{0}})\geq\int\limits_{0}^{\infty}{\overline{r}(x,{{u}^{i}})dt},\,\,\,\,\,\forall{{x}_{0}}\neq 0.

(157)

Similarly, the following inequalities hold

{{V}^{o}}({{x}_{0}})\leq{{V}^{{}_{i}}}({{x}_{0}})\leq{{V}^{{}_{i-1}}}({{x}_{0}})

(158)

Since ${{V}^{o}}$ and ${{V}^{{}_{i-1}}}$ are assumed to be positive definite, obviously ${{V}^{{}_{i}}}\in P$ .

3) By 2), the sequence $\left\{{{V}^{i}}(x)\right\}_{i=0}^{\infty}$ is monotonically decreasing with $0$ as their lower bound due to its positive definite property. Therefore, there exists a lower bound ${{V}^{*}}(x)$ such that $\underset{i\to\infty}{\mathop{\lim}}\,{{V}_{i}}(x)={{V}^{*}}(x)$ . Let $\left\{{{p}_{i}}\right\}_{i=0}^{\infty}$ be a sequence for $\left\{{{V}^{i}}(x)\right\}_{i=0}^{\infty}$ such that ${{V}^{i}}=p_{i}^{T}{{\overrightarrow{m}}_{2,2r}}(x)$ so $\underset{i\to\infty}{\mathop{\lim}}\,{{p}_{i}}={{p}^{*}}\in{{\mathbb{R}}^{{{n}_{2r}}}}$ , ${{V}^{*}}={{p}^{*}}{{\overrightarrow{m}}_{2,2r}}(x)$ . Similarly, it can also be shown that ${{V}^{o}}(x)\leq{{V}^{*}}(x)\leq{{V}^{0}}$ . Since ${{V}^{o}}$ and ${{V}^{{}_{i-1}}}$ are assumed to be positive definite, ${{V}^{*}}\in\mathbb{R}{{[x]}_{2,2r}}$ and is positive definite.

4) By 3), we know that

\overline{H}(V^{*})=-L(V^{*},u^{*})\leq 0.

(159)

So, $V^{*}$ is a solution to Problem 1, the inequality in 4) can be derived from the fourth property of Theorem 5. This completes the proof. $\square$

IV SIMULATION RESULTS

In this section, the peoposed safe optimal control algorithm is applied to car suspension system. Consider a model of a car suspension system as

\displaystyle\begin{gathered}\left[{\begin{array}[]{*{20}{c}}{{{\dot{x}}_{1}}}\\ {{{\dot{x}}_{2}}}\\ {{{\dot{x}}_{3}}}\\ {{{\dot{x}}_{4}}}\end{array}}\right]=\left[{\begin{array}[]{*{20}{c}}{{x_{2}}}\\ {\frac{{{K_{s}}({x_{3}}-{x_{1}})-{K_{n}}{{({x_{1}}-{x_{3}})}^{3}}+{B_{s}}({x_{4}}-{x_{2}})+cu}}{{{M_{b}}}}}\\ {{x_{3}}}\\ {\frac{{{K_{s}}({x_{3}}-{x_{1}})-{K_{n}}{{({x_{1}}-{x_{3}})}^{3}}+{B_{s}}({x_{4}}-{x_{2}})+{K_{1}}{x_{3}}+cu}}{{{M_{w}}}}}\end{array}}\right]\end{gathered}

(169)

where ${{x}_{1}},{{x}_{2}},{{x}_{3}},{{x}_{4}}$ are position, velocity of the car body, position, velocity of the wheel assembly. ${{M}_{b}},\,{{M}_{w}}$ denote mass of car and wheel assembly respectively. ${{K}_{l}},{{K}_{s}},{{K}_{n}}$ are the tire stiffness, nonlinear suspension stiffness and linear suspension stiffness, respectively, ${{B}_{s}}$ is the damping rate of suspension and $c$ is a constant control signal related to the input force. In this experiment, the parameters of system are set to ${{M}_{b}}=300\,kg,\,{{M}_{w}}=60\,kg,\,{{B}_{s}}=1000\,Ns/m,\,{{K}_{s}}=16000\,N/m,\,{{K}_{t}}=190000\,N/m,\,{{K}_{n}}=1600\,N/m$ . The safe region ${{\mathcal{X}}_{o}}$ is defined as

\displaystyle\begin{gathered}\mathcal{X}_{o}=\{x|x\in{{\mathbb{R}}^{4}},\,-20\leq x_{4}\leq 25\}.\end{gathered}

(171)

The parameters in performance index are set to $q(x)=100x_{1}^{2}+x_{2}^{2}+x_{3}^{2}+x_{4}^{2}$ and $R=1$ . We are interested in proving the system in the following set

\Theta=\{x|x\in{{\mathbb{R}}^{4}},\,\,\,|{{x}_{1}}|,\,|{{x}_{3}}|\leq 0.5\,\,\,|{{x}_{2}}|,\,|{{x}_{4}}|\leq 10\}.

(172)

The state trajectories under the proposed safe optimal controller and uncontrolled condition are shown in Fig. 2 while the visualization of the value function, and safe optimal control signal are presented in Fig. 3 and Fig. 4. According to Fig. 3, the sequence of value functions evaluated by (112) is monotonically decreasing, and it will reach a much smaller value than the initial one after 10 iterations. Besides, it can be clearly observed that the trajectory of $x_{4}$ under the proposed algorithm will not violate the safe constraints (171).

To analyze and verify the effectiveness of the proposed algorithm, in the following subsections, two comparison experiments are conducted. Since the optimal control of safety-critic systems is the concern of the paper, satisfaction of safe constraints (i.e., staying forever in the safe regions when starting from the safe set) as well as optimality of the proposed control policy are investigated in the following subsections.

IV-A Safety Verification of the Proposed Algorithm

We now compare our results with an optimal control algorithm presented in [35]. Implementing the algorithm in [35] to the car suspension system (169) results in the system performance shown in Fig. 5 and control signal shown in Fig. 6. From Fig. 5, it can be seen that the trajectory of $x_{4}$ controlled by [35] violates safety constraints (171). Compared with [35], the proposed safe optimal control policy will not escape the safe set (171. Therefore, the proposed algorithm guarantees safety while the algorithm presented in [35] violates safety in this simulation case.

IV-B Optimality Verification of the Proposed Algorithm

We now compare our proposed safe optimal control design method with the safe control design method presented in [41]. The method in [41] only considers safety and stability and does not incorporate any long-horizon optimality in the control design phase. The simulation results for the suspension system for the controller desiged without considering any cost optimization in [41] are shown in the Fig. 7. It can be clearly observed from Fig. 7 that both design methods will not violate the safety constraints (171) (represented by the blue dash line). However, as shown in Fig. 8, the value function for corresponding to the proposed safe optimal controller is much smaller than the value function corresponding to the same reward function calculated for the safe controller. This clearly shows that our proposed approach outperforms the standard safe control design approaches.

V CONCLUSION AND FUTURE WORK

In this paper, the problem of both safe control and optimal control are investigated simultaneously. The optimal solution is derived by solving the Bellmen inequality using SOS. The obtained result is then verified by the barrier function to guarantee the safety of system. A relaxation variable is added to handle conflict between safety and stability of the system. The final controller obtained from the proposed method is not necessarily an optimal one but assures the safety of the system with guaranteed performance, called a satisficing solution. Numerical simulations are given to illustrate the effectiveness of the proposed algorithm. The proposed algorithm is applied to the car suspension system. Possible extensions of the presented work for the tracking problem, event-triggered systems, control of systems with unmatched disturbance and output regulation will be explored in the future.

Appendix A PROOF OF THE THEOREM 4

1) Define ${{u}_{0}}(x)=-\frac{1}{2}{{R}^{-1}}{{(\nabla V_{{}^{0}}^{T}\left(x\right)g(x))}^{T}}$ , since (44) holds, then

\displaystyle\begin{gathered}\overline{H}({{V}_{0}})=\nabla V_{0}^{T}(f(x)+g(x){{u}_{0}}+w)+\overline{r}(x,{{u}_{0}})\hfill\\ \quad\quad\quad=\nabla V_{0}^{T}(f(x)+g(x){{u}_{1}}+w)+\overline{r}(x,{{u}_{1}})+\nabla V_{0}^{T}g(x)({{u}_{0}}-{{u}_{1}})\hfill\\ \quad\quad\quad+u_{0}^{T}R{{u}_{0}}-u_{1}^{T}R{{u}_{1}}+{{\beta}_{{{u}_{0}}}}(x)-{{\beta}_{{{u}_{1}}}}(x)\hfill\\ \quad\quad\quad=\nabla V_{0}^{T}(f(x)+g(x){{u}_{1}}+w)+\overline{r}(x,{{u}_{1}})-{{({{u}_{0}}-{{u}_{1}})}^{T}}R({{u}_{0}}-{{u}_{1}})+{{\beta}_{{{u}_{0}}}}(x)-{{\beta}_{{{u}_{1}}}}(x)\hfill\\ \quad\quad\quad\leq 0\hfill\\ \end{gathered}

(179)

Therefore, ${{V}_{0}}$ is a feasible solution for (71). Now, let us prove the solution to Problem 1 is unique.

2) Consider system (1) and control policy (72). Indeed, along the solutions of the closed-loop system, it follows that:

\displaystyle\begin{gathered}\dot{V}=\nabla{{V}^{T}}\left(x\right)\left(f\left(x\right)+g\left(x\right)\overline{u}+\omega(t)\right)\,\hfill\\ \quad=\nabla{{V}^{T}}\left(x\right)f(x)+\nabla{{V}^{T}}\left(x\right)g(x)\overline{u}+\nabla{{V}^{T}}\left(x\right)\omega(t)\hfill\\ \end{gathered}

(183)

Since Problem 1 performs optimization over all value functions but satisfies $\overline{H}(V)\leq 0$ . Then applying Hamilton function $\overline{H}\text{(}V\text{)=}\nabla{{V}^{T}}(x)\,(f(x)+g(x)u)+\overline{r}\left(x,u\right)$ of (12), one has

\displaystyle\begin{gathered}\dot{V}\leq-\overline{r}(x,\overline{u})+\nabla{{V}^{T}}\left(x\right)\omega(t)\hfill\\ \quad=-q\left(x\right)-{{\overline{u}}^{T}}R\overline{u}-d_{max}^{T}Rd_{max}-\beta_{u}(x)+\nabla{{V}^{T}}\left(x\right)g(x)d\hfill\\ \quad=-2{{\overline{u}}^{T}}Rd-d_{max}^{T}Rd_{max}-q(x)-{{\overline{u}}^{T}}R\overline{u}-\beta_{u}(x)\hfill\\ \quad=-d_{max}^{T}Rd_{max}-q(x)+{{d}^{T}}(x)Rd(x)-{{(d+\overline{u})}^{T}}R(d+\overline{u})-\beta_{u}(x)\hfill\\ \quad\leq-d_{max}^{T}Rd_{max}-q(x)+{{d}^{T}}(x)Rd(x)-\beta_{u}(x)\hfill\\ \quad\leq-q(x)-\beta_{u}(x)\leq-q(x)-\overline{u}^{T}R{{\overline{u}}}<0\hfill\\ \end{gathered}

(191)

From (191), it is easily to see $V$ is a well-defined Lyapunov function for closed-loop system. Therefore, $\overline{u}$ is global robust stable. when there is no conflict, select $\text{ }\!\!\delta\!\!\text{ =0}$ and $b(x)=1$ , it is obvious that solution belong to (20) when ${{p}_{s}}(u,s)=-{{(\nabla V)}^{T}}\left(x\right)\left(f\left(x\right)+g\left(x\right){{u}_{0}}\right)$ and ${{p}_{r}}(u,s)=r(x,{{u}_{0}})$ .

3) Next we show that index (3) has an upper bound. By integrating both sides of inequality (191) over time $[0,T]$ , we derive:

V({{x}_{0}})-V(x(T))\geq\int\limits_{0}^{T}{(q\left(x\right)+{{u}^{T}}Ru})dt

(192)

Since $V$ is a well-defined Lyapunov function of system (1), we have $V(x(T))\to 0$ as $T\to\infty$ . Thus, (192) yields

V({{x}_{0}})\geq\int\limits_{0}^{\infty}{(q\left(x\right)+{{u}^{T}}Ru})dt

(193)

which implies that $J\left({{x}_{0}},u\right)\leq V({{x}_{0}})$ for all small bounded disturbance $d$ . This shows performance index (3) has an upper bound $V({{x}_{0}})$ . And along the trajectory of the nominal system $\dot{x}=f\left(x\right)+g\left(x\right)u$ , we have

\dot{V}=\nabla{{V}^{T}}\left(x\right)\left(f\left(x\right)+g\left(x\right)u\right)\leq-q\left(x\right)-{{u}^{T}}Ru-\beta_{u}(x)-d_{max}^{T}Rd_{max}

(194)

Integrating both sides of (194) over $[0,T]$ yields

V({{x}_{0}})-V(x(T))\geq\int\limits_{0}^{T}{(q\left(x\right)+{{u}^{T}}Ru}+\beta_{u}(x)+d_{max}^{T}Rd_{max})dt

(195)

Letting $T\to\infty$ , we obtain $\overline{J}\left({{x}_{0}},u\right)\leq V({{x}_{0}})$ .

4) By 3), we know

V({{x}_{0}})\geq\bar{J}({{x}_{0}},\overline{u})\geq\underset{u}{\mathop{\min}}\,\bar{J}({{x}_{0}},\overline{u})={{V}^{o}}({{x}_{0}})

(196)

Therefore, the second inequality in (73) is proved. Besides,

\displaystyle\begin{gathered}\overline{H}(V)=\overline{H}(V)-\overline{H}({{V}^{o}})\hfill\\ =-{{(\nabla V-\nabla{{V}^{o}})}^{T}}(f+g\overline{u})+\overline{r}(x,\overline{u})-{{(\nabla{{V}^{o}})}^{T}}g({{u}^{o}}-\overline{u})-\overline{r}(x,{{u}^{o}})\hfill\\ ={{(\nabla V-\nabla{{V}^{o}})}^{T}}(f+g\overline{u})+{{(\overline{u}-{{u}^{o}})}^{T}}R(\overline{u}-{{u}^{o}})+\beta_{\bar{u}}(x)-\beta_{u^{o}}(x)\hfill\\ \geq{{(\nabla V-\nabla{{V}^{o}})}^{T}}(f+g\overline{u})\hfill\\ \end{gathered}

(202)

Integrating above equation along the solutions of the closed-loop system (1) with control policy (72) on the interval $[0.\infty]$ , we derive

V({{x}_{0}})-{{V}^{o}}({{x}_{0}})\leq-\int\limits_{0}^{\infty}{\bar{H}(V(x(t)))dt}

(203)

5) By 3), for any feasible solution $V$ to (71), we have $V(x)\geq{{V}^{o}}(x)$ . Therefore

\mathop{\int}_{\Omega}{{V}^{o}}\left(x\right)dx\leq\mathop{\int}_{\Omega}V\left(x\right)dx

(204)

which concludes ${{V}^{o}}$ is a global robust optimal solution to (71).

The proof is complete. $\square$

References

[1] A. D. Ames, X. Xu, J. W. Grizzle and P. Tabuada, “Control Barrier Function Based Quadratic Programs for Safety Critical Systems,” IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3861-3876, Aug. 2017.
[2] L. E. G. Martins and T. Gorschek, “Requirements Engineering for Safety-Critical Systems: Overview and Challenges,” IEEE Software, vol. 34, no. 4, pp. 49-57, 2017.
[3] Vamvoudakis, Kyriakos G and Vrabie, Draguna and Lewis, Frank L, “Online adaptive algorithm for optimal control with integral reinforcement learning” International Journal of Robust and Nonlinear Control, vol. 24, no. 17, pp. 2686-2710, Oct. 2014.
[4] Kretchmar, R Matthew and Young, Peter M and Anderson, Charles W and Hittle, Douglas C and Anderson, Michael L and Delnero, Christopher C, “Robust reinforcement learning control with static and dynamic stability,” International Journal of Robust and Nonlinear Control, vol. 11, no. 15, pp.1469–1500, Oct. 2001.
[5] W. M. McEneaney, “Max-plus methods for nonlinear control and estimation,” Springer Science and Business Media, 2006.
[6] W. H. Fleming, and W. M. McEneaney. “A Max-Plus-Based Algorithm for a Hamilton–Jacobi–Bellman Equation of Nonlinear Filtering,” SIAM Journal on Control and Optimization, vol. 38, no. 3, pp. 683-710, 2000.
[7] X. Xu, J. W. Grizzle, P. Tabuada and A. D. Ames, “Correctness Guarantees for the Composition of Lane Keeping and Adaptive Cruise Control,” IEEE Transactions on Automation Science and Engineering, vol. 15, no. 3, pp. 1216-1229, July 2018.
[8] S. Prajna, A. Jadbabaie, “Safety verification of hybrid systems using barrier certificates,” International Workshop on Hybrid Systems: Computation and Control, pp. 477-492, Mar. 2004
[9] Q. Nguyen and K. Sreenath, “Exponential Control Barrier Functions for enforcing high relative-degree safety-critical constraints,” in Proc. of American Control Conference, pp. 322-328, 2016.
[10] O. Bokanowski, N. Forcadel and H. Zidani, “Reachability and minimal times for state constrained nonlinear problems without any controllability assumption,” SIAM Journal on Control and Optimization, vol. 48, no. 7, pp. 4292-4316, 2010.
[11] I. M. Mitchell, A. M. Bayen and C. J. Tomlin, “A time-dependent Hamilton-Jacobi formulation of reachable sets for continuous dynamic games,” IEEE Transactions on Automatic Control, vol. 50, no. 7, pp. 947-957, July. 2005.
[12] J. Ding, J. Sprinkle, S. S. Sastry and C. J. Tomlin, “Reachability calculations for automated aerial refueling,” In Proceedings of IEEE Conference on Decision and Control, pp. 3706-3712, 2008.
[13] A. Bemporad, “Reference governor for constrained nonlinear systems,” IEEE Transactions on Automatic Control, vol. 43, no. 3, pp. 415-419, March. 1998.
[14] E. G. Gilbert and I. Kolmanovsky, “A generalized reference governor for nonlinear systems,” In Proceedings of IEEE Conference on Decision and Control, pp. 4222-4227, 2001.
[15] F. Borrelli, A. Bemporad, M. Fodor and D. Hrovat, “An MPC/hybrid system approach to traction control,” IEEE Transactions on Control Systems Technology, vol. 14, no. 3, pp. 541-552, May. 2006.
[16] S. Richter, C. N. Jones and M. Morari, “Computational Complexity Certification for Real-Time MPC With Input Constraints Based on the Fast Gradient Method,” IEEE Transactions on Automatic Control, vol. 57, no. 6, pp. 1391-1403, June. 2012.
[17] H. Simon and J. March, “Administrative behavior organization,” New york: free Press, 1976.
[18] J. W. Curtis and R. W. Beard, “Satisficing: a new approach to constructive nonlinear control,” IEEE Transactions on Automatic Control, vol. 49, no. 7, pp. 1090-1102, July. 2004.
[19] M. A. Goodrich, W. C. Stirling and R. L. Frost, “A theory of satisficing decisions and control,” IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans vol. 28, no. 6, pp. 763-779, Nov. 1998.
[20] H. K. Khalil, “Nonlinear Systems, 3 ed.” Upper Saddle River, NJ: Prentice Hall, 2002.
[21] S. Prajna, A. Papachristodoulou and P. A. Parrilo, “Introducing SOSTOOLS: a general purpose sum of squares programming solver,” In Proceedings of IEEE Conference on Decision and Control, pp. 741-746, 2002.
[22] C. Mu and Y. Zhang, “Learning-Based Robust Tracking Control of Quadrotor With Time-Varying and Coupling Uncertainties,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 1, pp. 259-273, 2020.
[23] Mu, C and Zhang, Y and Gao, Z and Sun. C, “ADP-Based Robust Tracking Control for a Class of Nonlinear Systems With Unmatched Uncertainties,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2019.
[24] Wang, D and Mu. C and He, H and Liu, D, “Event-Driven Adaptive Robust Control of Nonlinear Systems With Uncertainties Through NDP Strategy,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 47, no. 7, pp. 1358-1370, 2017.
[25] F. L. Lewis, D. Vrabie and V. L. Syrmos, Optimal control, John Wiley and Sons, 2012.
[26] M. Liang, D. Wang and D. Liu, “Neuro-Optimal Control for Discrete Stochastic Processes via a Novel Policy Iteration Algorithm,” IEEE Transactions on Systems, Man, and Cybernetics: Systems. 2019
[27] D. Wang, D. Liu and H. Li, “Policy Iteration Algorithm for Online Design of Robust Control for a Class of Continuous-Time Nonlinear Systems,” IEEE Transactions on Automation Science and Engineering, vol. 11, no. 2, pp. 627-632, April. 2014.
[28] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits and Systems Magazine, vol. 9, no. 3, pp. 32-50, Third Quarter 2009.
[29] D. Xu, Q. Wang, Y. Li, “Optimal Guaranteed Cost Tracking of Uncertain Nonlinear Systems Using Adaptive Dynamic Programming with Concurrent Learning,” International Journal of Control, Automation and Systems, pp. 1-12, 2019.
[30] M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,” Automatica, vol. 41, no. 5, pp. 779-791, 2005.
[31] S. Kolathaya and A. D. Ames, “Input-to-State Safety With Control Barrier Functions,” IEEE Control Systems Letters, vol. 3, no. 1, pp. 108-113, Jan. 2019.
[32] A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath and P. Tabuada, “Control Barrier Functions: Theory and Applications,” European Control Conference , pp. 3420-3431, 2019.
[33] D. P. De Farias and B. Van Roy, “The linear programming approach to approximate dynamic programming,” Operations research , vol. 51, no. 6, pp. 850-865, 2003.
[34] D. P. De Farias and B. Van Roy, “On constraint sampling in the linear programming approach to approximate dynamic programming,” Mathematics of operations research , vol. 29, no. 3, pp. 462-478, 2004.
[35] Y. Jiang and Z. Jiang, “Global Adaptive Dynamic Programming for Continuous-Time Nonlinear Systems,” IEEE Transactions on Automatic Control , vol. 60, no. 11, pp. 2917-2929, Nov. 2015.
[36] P. N. Beuchat, A. Georghiou and J. Lygeros, “Performance Guarantees for Model-Based Approximate Dynamic Programming in Continuous Spaces,” IEEE Transactions on Automatic Control, vol. 65, no. 1, pp. 143-158, Jan. 2020.
[37] F. L. Lewis, D. Vrabie and K. G. Vamvoudakis, “Reinforcement Learning and Feedback Control: Using Natural Decision Methods to Design Optimal Adaptive Controllers,” IEEE Control Systems Magazine, vol. 32, no. 6, pp. 76-105, Dec. 2012.
[38] Konda, Rohit and Ames, Aaron D and Coogan, Samuel, “Characterizing safety: Minimal barrier functions from scalar comparison systems,” arXiv preprint arXiv:1908.09323, May 2019.
[39] D. P. Bertsekas, “Dynamic Programming and Optimal Control.” Belmont, MA, USA: Athena Scientific, 2007.
[40] G. Chesi, “Domain of attraction: analysis and control via SOS programming.” Springer Science and Business Media, 2011.
[41] Wang, L and Han, D and Egerstedt, M, “Permissive Barrier Certificates for Safe Stabilization Using Sum-of-squares,” 2018 Annual American Control Conference (ACC), pp. 585-590, 2018.
[42] Prajna, S and Papachristodoulou, A and Seiler, P and Parrilo, P. A., “Positive polynomials in control.” Springer, 2005.