This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Accelerated Continuous-Time Approximate Dynamic Programming via Data-Assisted Hybrid Control111Research supported in part by NSF grant number CNS-1947613.22footnotemark: 2

Daniel E. Ochoa [email protected] Jorge I. Poveda [email protected]
Abstract

We introduce a new closed-loop architecture for the online solution of approximate optimal control problems in the context of continuous-time systems. Specifically, we introduce the first algorithm that incorporates dynamic momentum in actor-critic structures to control continuous-time dynamic plants with an affine structure in the input. By incorporating dynamic momentum in our algorithm, we are able to accelerate the convergence properties of the closed-loop system, achieving superior transient performance compared to traditional gradient-descent based techniques. In addition, by leveraging the existence of past recorded data with sufficiently rich information properties, we dispense with the persistence of excitation condition traditionally imposed on the regressors of the critic and the actor. Given that our continuous-time momentum-based dynamics also incorporate periodic discrete-time resets that emulate restarting techniques used in the machine learning literature, we leverage tools from hybrid dynamical systems theory to establish asymptotic stability properties for the closed-loop system. We illustrate our results with a numerical example.

keywords:
Approximate dynamic programming, concurrent learning, hybrid systems, Lyapunov theory.
journal:  
\diffdef

p op-symbol = ∂, left-delim = , right-delim = — , subscr-nudge = +4 mu

\affiliation

organization=Department of Electrical, Energy and Computer Engineering. University of Colorado Boulder,city=Boulder, postcode=80305, state=Colorado, country=USA

1 Introduction

Recent technological advances in computation and sensing have incentivized the development and implementation of data-assisted feedback control techniques previously deemed intractable due to their computational complexity. Among these techniques, reinforcement learning (RL) has emerged as a practically viable tool with remarkable degrees of success in robotics [1], autonomous driving [2], water-distribution systems [3], among other cyber-physical applications, see [4]. These types of algorithms, are part of a large landscape of adaptive systems that aim to control a plant while simultaneously optimizing a performance index in a model-free way, with closed-loop stability guarantees.

In this paper, we focus on a particular class of infinite horizon RL problems from the perspective of approximate optimal control and approximate adaptive dynamic programming (AADP). Specifically, we study the optimal control problem for nonlinear continuous-time and control-affine deterministic plants, interconnected with approximate adaptive optimal controllers [5] in an actor-critic configuration. These types of adaptive controllers aim to find, in real time, the solution to the Hamilton-Jacobi-Bellman (HJB) equation by measuring the output of the nonlinear dynamical system while making use of two approximation structures:

  • 1.

    a critic, used to estimate the optimal value function of the optimal control problem, and

  • 2.

    an actor, used to estimate the optimal feedback controller.

Our goal is to design online adaptive dynamics for the real-time tuning of the aforementioned structures, while simultaneously achieving closed-loop stability and high transient performance. To achieve this, and motivated by the widespread usage of momentum-based gradient dynamics in practical RL settings [6], we study continuous-time actor-critic dynamics inspired by a class of ordinary differential equations (ODEs) that can be seen as continuous-time counterparts of Nesterov’s accelerated optimization algorithm [7]. Such types of algorithms have gained popularity in optimization and related fields due to the fact that they can minimize smooth convex functions at a rate of order 𝒪(1/t2)\mathcal{O}(1/t^{2}) [8]. The main source for the acceleration property in these ODEs comes from the addition of momentum to gradient-based dynamics, in conjunction with a vanishing dynamic damping coefficient. However, as recently shown in [9] and [10], the non-uniform convergence properties that emerge in these types of dynamics complicates their use in feedback systems with plant dynamics in the loop. In this paper, we overcome these challenges by incorporating resets into the proposed momentum-based algorithms, similar to restarting heuristics studied in the machine learning literature, see [11] and [7]. Our resulting actor-critic controller is naturally modeled by a hybrid dynamical system that incorporates continuous-time and discrete-time dynamics, which we analyze using tools from [12].

A traditional assumption in the literature of continuous-time actor-critic RL is that the regressors used in the parameterizations satisfy a persistence of excitation condition along the trajectories of the plant. However, in practice, this condition can be difficult to verify a priori. To circumvent this issue, in this paper we consider a data-assisted approach, where a finite amount of past “sufficiently rich” recorded data is used to guarantee asymptotic learning in the closed-loop system. As a consequence, the resulting data-assisted hybrid control algorithm concurrently uses real-time and recorded data, similar in spirit to concurrent-learning (CL) techniques [13]. By using Lyapunov-based tools for hybrid dynamical systems, we analyze the interconnection of an actor-critic neural-network (NN) controller and the nonlinear plant, establishing that the trajectories of the closed-loop system remain ultimately bounded around the origin of the plant and the optimal actor and critic NN parameters. Since the resulting closed-loop system has suitable regularity properties in terms of continuity of the dynamics, our stability results are in fact robust with respect to arbitrarily small additive disturbances that can be adversarial in nature, or that can arise due to numerical implementations. To the best knowledge of the authors, these are the first theoretical stability guarantees of continuous-time accelerated actor-critic algorithms for neural network-based adaptive dynamic programming controllers in nonlinear deterministic settings.

The rest of this paper is organized as follows: Section 2 presents the notation and some concepts on hybrid dynamical systems, Section 3 presents the problem statement and some preliminaries on optimal control. Section 4 introduces the hybrid momentum-based dynamics for the update of the critic NN, Section 5 presents the update dynamics for the actor NN, and Section 6 studies the properties of closed-loop system. In Section 7 we study a numerical example illustrating our theoretical results.

2 Preliminaries

Notation: We denote the real numbers by \mathbb{R}, and we use 0\mathbb{R}_{\geq 0}\subset\mathbb{R} to denote the non-negative real line. We use n\mathbb{R}^{n} to represent the nn-dimensional Euclidean space and ||\left\lvert\cdot\right\rvert to denote its usual vector norm. Given An×nA\in\mathbb{R}^{n\times n}, we use |A|\left\lvert A\right\rvert to denote the induced 2-norm for matrices, and we infer its distinction with the vector norm depending on the context. We use Tr(A)\text{Tr}\left(A\right) to denote the trace operator on matrices. Given a compact set 𝒜n\mathcal{A}\subset\mathbb{R}^{n} and a vector znz\in\mathbb{R}^{n}, we use |z|𝒜mins𝒜|zs||z|_{\mathcal{A}}\coloneqq\min_{s\in\mathcal{A}}\left\lvert z-s\right\rvert to represent the minimum distance of zz to 𝒜\mathcal{A}. We also use r𝔹r\mathbb{B} to denote a closed ball in the Euclidean space, of radius r>0r>0, and centered at the origin. We use Inn×nI_{n}\in\mathbb{R}^{n\times n} to denote the identity matrix, and (x,y)(x,y) for the concatenation of the vectors xx and yy, i.e., (x,y)[x,y](x,y)\coloneqq[x^{\top},y^{\top}]^{\top}. A function γ:00\gamma:\mathbb{R}_{\geq 0}\to\mathbb{R}_{\geq 0} is said to be of class-𝒦\mathcal{K} (γ𝒦\gamma\in\mathcal{K}), if it is continuous, zero at zero, and nondecreasing. A function β:0×00\beta:\mathbb{R}_{\geq 0}\times\mathbb{R}_{\geq 0}\to\mathbb{R}_{\geq 0} is said to be of class-𝒦\mathcal{K}\mathcal{L} (β𝒦\beta\in\mathcal{KL}) if β(,s)𝒦\beta(\cdot,s)\in\mathcal{K} for each s0s\in\mathbb{R}_{\geq 0}, it is non-increasing in its second argument, and limsβ(r,s)=0\lim_{s\to\infty}\beta(r,s)=0 for each r0r\in\mathbb{R}_{\geq 0}. The gradient of a real valued function f:nf:\mathbb{R}^{n}\to\mathbb{R} is defined as a column vector and denoted by f\nabla f. For a vector valued function g:nmg:\mathbb{R}^{n}\to\mathbb{R}^{m}, we use \diffpg(x)xm×n\diffp{g(x)}{x}\in\mathbb{R}^{m\times n} to denote its Jacobian matrix.

Hybrid Dynamical Systems: To study our algorithms, we will use tools from hybrid dynamical systems (HDS) theory [12]. A HDS with state xnx\in\mathbb{R}^{n}, has dynamics

xC,x˙=F(x),andxD,x+=G(x),x\in C,\leavevmode\nobreak\ \dot{x}=F(x),\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ x\in D,\leavevmode\nobreak\ \leavevmode\nobreak\ x^{+}=G(x), (1)

where F:nnF:\mathbb{R}^{n}\to\mathbb{R}^{n} is called the flow map, G:nnG:\mathbb{R}^{n}\to\mathbb{R}^{n} is called the jump map, and CnC\subset\mathbb{R}^{n} and DnD\subset\mathbb{R}^{n} are closed sets, called the flow set and the jump set, respectively. We use =(C,F,D,G)\mathcal{H}=(C,F,D,G) to denote the elements of the HDS \mathcal{H}. Solutions x:dom(x)nx:\text{dom}(x)\to\mathbb{R}^{n} to system (1) are indexed by a continuous-time parameter tt, which increases continuously during flows, and a discrete-time index jj, which increases by one during jumps. Thus, the notation x˙\dot{x} in (1) represents the derivative dx(t,j)dt\frac{dx(t,j)}{dt}; and x+x^{+} in (1) represents the value of xx after an instantaneous jump, i.e., x(t,j+1)x(t,j+1). Therefore, solutions x:dom(x)nx:\text{dom}(x)\to\mathbb{R}^{n} to system (1) are defined on hybrid time domains. For a precise definition of hybrid time domains and solutions to HDS of the form (1), we refer the reader to [12, Ch.2]. The following definitions will be instrumental to study the stability and convergence properties of systems of the form (1).

Definition 1

The compact set 𝒜CD\mathcal{A}\subset C\cup D is said to be uniformly asymptotically stable (UAS) for system (1) if \exists β𝒦\beta\in\mathcal{K}\mathcal{L} and r>0r>0 such that every solution xx with x(0,0)r𝔹(CD)x(0,0)\in r\mathbb{B}\cap(C\cup D) satisfies:

|x(t,j)|𝒜β(|x(0,0)|𝒜,t+j),(t,j)dom(x).|x(t,j)|_{\mathcal{A}}\leq\beta(|x(0,0)|_{\mathcal{A}},t+j),\leavevmode\nobreak\ \forall\leavevmode\nobreak\ (t,j)\in\text{dom}(x). (2)

When β(r,s)=c1rec2s\beta(r,s)=c_{1}re^{-c_{2}s} for some c1,c2>0c_{1},c_{2}>0, the set 𝒜\mathcal{A} is said to be uniformly exponentially stable (UES). \square

3 Problem Statement

Consider a control-affine nonlinear dynamical plant

x˙=f(x)+g(x)u,\dot{x}=f(x)+g(x)u, (3)

where xnx\in\mathbb{R}^{n} is the state of the system, uUmu\in U\subset\mathbb{R}^{m} is the input, and f:nnf:\mathbb{R}^{n}\to\mathbb{R}^{n} and g:nn×mg:\mathbb{R}^{n}\to\mathbb{R}^{n\times m} are locally Lipschitz functions. Our goal is to design a stable algorithm able to find –in real time– a control law uu^{*} that minimizes the cost functional V:n×𝒰VV:\mathbb{R}^{n}\times\mathcal{U}_{V}\to\mathbb{R} given by:

V(x0,u)0r(x(τ),u(x(τ)))𝑑τ,V(x_{0},u)\coloneqq\int_{0}^{\infty}r\Big{(}x\big{(}\tau\big{)},u\left(x(\tau)\right)\Big{)}d\tau, (4)

where x(t)x\big{(}t\big{)} represents a solution to (3) from the initial condition x(0)=x0x(0)=x_{0}, that results from implementing a feedback law uu, belonging to a class of admissible control laws 𝒰V\mathcal{U}_{V} characterized as follows:

Definition 2

[14, Definition 1] Given the dynamical system in (3), a feedback control u:nmu:\mathbb{R}^{n}\to\mathbb{R}^{m} is admissible with respect to the cost functional VV in (4) if

  • 1.

    uu is continuous,

  • 2.

    uu renders system (3) UAS,

  • 3.

    V(x0,u)<V(x_{0},u)<\infty for all x0nx_{0}\in\mathbb{R}^{n}. \square

We denote the set of admissible feedback laws as 𝒰V\mathcal{U}_{V}.

In (4), we consider cost functions r:n×mr:\mathbb{R}^{n}\times\mathbb{R}^{m}\to\mathbb{R} of the form r(x,u)Q(x)+R(u),r(x,u)\coloneqq Q(x)+R(u), where the state-cost is given by Q(x)xΠxxQ(x)\coloneqq x^{\top}\Pi_{x}x with Πx0\Pi_{x}\succ 0, and the control-cost is given by R(u)uΠuuR(u)\coloneqq u^{\top}\Pi_{u}u with Πu0\Pi_{u}\succ 0. To find the optimal control law that minimizes (4), we study the Hamiltonian function H:n×m×nH:\mathbb{R}^{n}\times\mathbb{R}^{m}\times\mathbb{R}^{n}\to\mathbb{R} related to (3) and (4), given by

H(x,u,V)\displaystyle H(x,u,\nabla V) V(f(x)+g(x)u)+Q(x)+R(u).\displaystyle\coloneqq\nabla V^{\top}(f(x)+g(x)u)+Q(x)+R(u).{} (5)

Using (5), a necessary optimality condition for uu^{*} is given by Pontryagin’s maximum principle [15]:

u(x)\displaystyle u^{*}(x) =argminu𝒰VH(x,u,V)u(x)=12Πu1g(x)V(x),\displaystyle=\operatorname*{arg\,min}_{u\in\mathcal{U}_{V}}H(x,u,\nabla V^{*})\implies u^{*}(x)=-\frac{1}{2}\Pi_{u}^{-1}g(x)^{\top}\nabla V^{*}(x),{} (6)

where VV^{*} represents the optimal value function:

V(x)infu𝒰VV(x,u())V^{*}(x)\coloneqq\inf_{u\in\mathcal{U}_{V}}V(x,u(\cdot))

On the other hand, under the assumption that VV^{*} is continuously differentiable, the optimal value function can be shown to satisfy the Hamilton-Jacobi-Bellman equation [5, Ch. 1.4]:

\diffpVt=H(x,u,V)xn.\diffp{V^{*}}{t}=-H(x,u^{*},\nabla V^{*})\quad\forall x\in\mathbb{R}^{n}.

Since the functional in (4) does not have an explicit dependence on tt, it follows that \diffpVt=0\diffp{V^{*}}{t}=0, and hence H(x,u,V)=0H(x,u^{*},\nabla V^{*})=0, meaning that for all xnx\in\mathbb{R}^{n}, the following holds:

V(f(x)+g(x)u(x))+Q(x)+R(u(x))=0.\displaystyle\nabla V^{*^{\top}}\big{(}f(x)+g(x)u^{*}(x)\big{)}+Q(x)+R\Big{(}u^{*}(x)\Big{)}=0.{} (7)

The time-invariant Hamilton-Jacobi-Bellman equation in (7), allows for a state-dependent characterization of optimality. Therefore, by using the optimal control law in (6), and assuming that the system dynamics (3) are known, the form (7) could be leveraged to find VV^{*}. Unfortunately, finding an explicit closed-form expression for VV^{*}, and thus for the optimal control law, is, in general, an intractable problem. However, the utility of (7) is not completely lost. As we shall show in the following sections, online and historical “measurements” of (7) can be leveraged in real time to estimate the optimal control law uu^{*} while concurrently rendering a neighborhood of the origin of system (3) asymptotically stable.

4 Data-Assisted Critic Dynamics

To leverage the form of (7), we consider the following parameterization of the optimal value function V(x)V^{*}(x):

V(x)=θcϕc(x)+ϵc(x)xK,V^{*}(x)=\theta_{c}^{*^{\top}}\phi_{c}(x)+\epsilon_{c}(x)\quad\forall x\in K, (8)

where KnK\subset\mathbb{R}^{n} is a compact set, θclc\theta_{c}^{*}\in\mathbb{R}^{l_{c}}, ϕc:nlc\phi_{c}:\mathbb{R}^{n}\to\mathbb{R}^{l_{c}} is a vector of continuously differentiable basis functions, and ϵc:n\epsilon_{c}:\mathbb{R}^{n}\to\mathbb{R} is the approximation error. The parameterization (8) is always possible on compact sets due to the continuity properties of VV and the universal approximation theorem [16]. This parametrization results in an optimal Hamiltonian of the form HpH(x,u,\diffpϕcxθc+ϵc)H_{p}^{*}\coloneqq H(x,u^{*},\diffp{\phi_{c}}{x}^{\top}\theta_{c}^{*}+\nabla\epsilon_{c}) given by:

Hp(x)\displaystyle H_{p}^{*}(x) =θcψ(x,u(x))+Q(x)+R(u(x))+ϵc(x)(f(x)+g(x)u(x)),\displaystyle=\theta_{c}^{*^{\top}}\psi(x,u^{*}(x))+Q(x)+R\left(u^{*}(x)\right)+\nabla\epsilon_{c}(x)^{\top}\left(f(x)+g(x)u^{*}(x)\right), (9)

where we defined ψ:n×mlc\psi:\mathbb{R}^{n}\times\mathbb{R}^{m}\to\mathbb{R}^{l_{c}} as:

ψ(x,u)\diffpϕc(x)x(f(x)+g(x)u).\psi(x,u)\coloneqq\diffp{\phi_{c}(x)}{x}\left(f(x)+g(x)u\right). (10)

We note that the explicit dependence of ψ:n×mlc\psi:\mathbb{R}^{n}\times\mathbb{R}^{m}\to\mathbb{R}^{l_{c}} on the control action uu, defined in (10), is a fundamental departure from the previous approaches studied in the context of concurrent learning (CL) NN actor-critic controllers, such as those considered in [17] and [18]. In particular, we note that in the context of CL the data used to estimate the optimal value function VV^{*} is generated from measurements of the optimal Hamiltonian which, by definition, incorporates the optimal control law uu^{*}. Hence, the need to include uu as part of the regressor vectors ψ\psi becomes crucial; this dependence characterizes how far our recorded measurements of a Hamiltonian are from the optimal Hamiltonian HpH_{p}^{*}. Indeed, this distance will explicitly emerge in our convergence and stability analysis. Naturally, the dependence of (10) on uu will impose stronger conditions on the recorded data needed to estimate VV^{*}.

Assuming we have access to ϕc\phi_{c}, we can define a critic neural network as:

V^(x)θcϕc(x),xK,\hat{V}(x)\coloneqq\theta_{c}^{\top}\phi_{c}(x),\leavevmode\nobreak\ \leavevmode\nobreak\ \forall x\in K, (11)

which will serve as an approximation of the optimal value function VV^{*} in (8). This critic NN results in an estimated Hamiltonian:

H(x,u,V^)θcψ(x,u)+Q(x)+R(u),H\left(x,u,\nabla\hat{V}\right)\coloneqq\theta_{c}^{\top}\psi\left(x,u\right)+Q(x)+R(u), (12)

which we will use to design the update dynamics of the critic parameters θc\theta_{c}. In particular, our goal is to use previously recorded data from trajectories of the plant to ensure asymptotic stability of the set of optimal critic parameters {θc}\left\{\theta_{c}^{*}\right\}, while simultaneously enabling the incorporation of instantaneous measurements from the plant. Towards this end, we will assume enough “richness” properties in the recorded data, a notion that is captured by a relaxed (and finite-time) version of persistence of excitation (PE); see [13] and [19].

Assumption 1

Let {ψ(xk,u(xk))}k=1N\{\psi\left(x_{k},u^{*}(x_{k})\right)\}_{k=1}^{N} be a sequence of recorded data, and define:

Λ\displaystyle\Lambda k=1NΨ(xk,u(xk))Ψ(xk,u(xk)),Ψ(x,u)ψ(x,u)1+ψ(x,u)ψ(x,u).\displaystyle\coloneqq\sum_{k=1}^{N}\Psi(x_{k},u^{*}(x_{k}))\Psi(x_{k},u^{*}(x_{k}))^{\top},\quad\Psi(x,u)\coloneqq\frac{\psi(x,u)}{1+\psi(x,u)^{\top}\psi(x,u)}.{} (13)

There exists λ¯>0\underline{\lambda}\in\mathbb{R}_{>0} such that Λλ¯In\Lambda\succeq\underline{\lambda}I_{n}, i.e., the data is λ¯\underline{\lambda}-sufficiently-rich (λ¯\underline{\lambda}-SR). \square

Remark 1

In this paper, we study reinforcement learning dynamics that do not make explicit usage of exploration signals with standard PE properties, which can be difficult to guarantee in practice. Instead, we assume access to samples obtained by observing the action of optimal values u(xk)u^{*}(x_{k}) acting on the plant. Note however that this does not imply knowledge of the optimal control policy as a whole, but only of a finite number of demonstrations from an “expert” policy. Similar requirements commonly arise in the literature of imitation learning, or inverse reinforcement learning, and have been recently shown in practice to reduce the exploratory requirements of online reinforcement learning algorithms, with mild assumptions in the sampling of the demonstrations. For recent discussions on these topics in the discrete-time stochastic reinforcement learning setting we refer the reader to [20] and [21].

Now, we consider the instantaneous and data-dependent errors of the estimated Hamiltonian with respect to the optimal one:

ei(θc,x,u)\displaystyle e^{i}\left(\theta_{c},x,u\right) H(x,u,V^)H(x,u(x),V)\displaystyle\coloneqq H\left(x,u,\nabla\hat{V}\right)-H\left(x,u^{*}(x),\nabla V^{*}\right)
=θcψ(x,u)+Q(x)+R(u),\displaystyle\leavevmode\nobreak\ =\theta_{c}^{\top}\psi\left(x,u\right)+Q(x)+R\left(u\right),
ekd(θc)\displaystyle e^{d}_{k}(\theta_{c}) H(xk,u(xk),V^)H(xk,u(xk),V)\displaystyle\coloneqq H\left(x_{k},u^{*}(x_{k}),\nabla\hat{V}\right)-H\left(x_{k},u^{*}(x_{k}),\nabla V^{*}\right)
=θcψ(xk,u(xk))+Q(xk)+R(u(xk)),\displaystyle\leavevmode\nobreak\ =\theta_{c}^{\top}\psi\left(x_{k},u^{*}(x_{k})\right)+Q(x_{k})+R\left(u^{*}(x_{k})\right),

where we used the fact that H(x,u(x),V)=0H\left(x,u^{*}(x),\nabla V^{*}\right)=0. Moreover, we define the joint instantaneous and data-dependent error as:

e(θc,x,u)\displaystyle e\left(\theta_{c},x,u\right) 12(ρiei(x,θc,u)2(1+|ψ(x,u)|2)2+ρdk=1Nekd(θc)2(1+|ψ(xk,u(xk))|2)2),\displaystyle\coloneqq\frac{1}{2}\Bigg{(}\rho_{i}\frac{e^{i}\left(x,\theta_{c},u\right)^{2}}{\left(1+\left\lvert\psi(x,u)\right\rvert^{2}\right)^{2}}+\rho_{d}\sum_{k=1}^{N}\frac{e^{d}_{k}(\theta_{c})^{2}}{\left(1+\left\lvert\psi\left(x_{k},u^{*}(x_{k})\right)\right\rvert^{2}\right)^{2}}\Bigg{)}{}, (14)

where ρi0\rho_{i}\in\mathbb{R}_{\geq 0} and ρd>0\rho_{d}\in\mathbb{R}_{>0} are tunable gains. Since we are interested in designing real-time training dynamics for the estimation of the optimal parameters θc\theta_{c}^{*}, we compute the the gradient of (14) with respect to θc\theta_{c} as follows:

θce(θc,x,u)\displaystyle\nabla_{\theta_{c}}e(\theta_{c},x,u) =ρi(Ψ(x,u)Ψ(x,u)θc+ψ(x,u)[Q(x)+R(u)](1+ψ(x,u)ψ(x,u))2)\displaystyle=\rho_{i}\left(\rule{0.0pt}{21.33955pt}\right.\Psi(x,u)\Psi(x,u)^{\top}\theta_{c}+\frac{\psi(x,u)\left[Q(x)+R(u)\right]}{\left(1+\psi(x,u)^{\top}\psi(x,u)\right)^{2}}\left.\rule{0.0pt}{21.33955pt}\right)
+ρd(Λθc+k=1Nψ(xk,u(xk))[Q(xk)+R(u(xk))](1+ψ(xk,u(xk))ψ(xk,u(xk)))2),\displaystyle\qquad+\rho_{d}\left(\Lambda\theta_{c}+\sum_{k=1}^{N}\frac{\psi(x_{k},u^{*}(x_{k}))\left[Q(x_{k})+R\left(u^{*}(x_{k})\right)\right]}{\left(1+\psi\left(x_{k},u^{*}(x_{k})\right)^{\top}\psi\left(x_{k},u^{*}(x_{k})\right)\right)^{2}}\right), (15)

where Λ\Lambda and Ψ\Psi are defined in Assumption 1.
The “propagated” error to the HJB equation that results from the approximate parametrization of VV^{*} in (8), is given by:

ϵHJB(x)\displaystyle\epsilon_{\text{HJB}}(x) H(x,u(x),V)H(x,u,\diffpϕc(x)xθc)\displaystyle\coloneqq H(x,u^{*}(x),\nabla V^{*})-H\left(x,u^{*},\diffp{\phi_{c}(x)}{x}^{\top}\theta_{c}^{*}\right)
=ϵc(x)(f(x)+g(x)u(x)).\displaystyle=-\nabla\epsilon_{c}^{\top}(x)\Big{(}f(x)+g(x)u^{*}(x)\Big{)}.{} (16)

The following assumption is standard, and it is satisfied when the involved functions are continuous and KK is compact.

Assumption 2

There exist ϕc¯,dϕc¯,ϵc¯,dϵc¯,ϵHJB¯,g¯>0\overline{\phi_{c}},\leavevmode\nobreak\ \overline{d\phi_{c}},\leavevmode\nobreak\ \overline{\epsilon_{c}},\leavevmode\nobreak\ \overline{d\epsilon_{c}},\leavevmode\nobreak\ \overline{\epsilon_{\text{HJB}}},\overline{g}\in\mathbb{R}_{>0} such that

|ϕc(x)|ϕc¯,|\diffpϕc(x)x|dϕc¯,|ϵc(x)|ϵc¯,\displaystyle\left\lvert\phi_{c}(x)\right\rvert\leq\overline{\phi_{c}},\leavevmode\nobreak\ \leavevmode\nobreak\ \left\lvert\diffp{\phi_{c}(x)}{x}\right\rvert\leq\overline{d\phi_{c}},\leavevmode\nobreak\ \leavevmode\nobreak\ \left\lvert\epsilon_{c}(x)\right\rvert\leq\overline{\epsilon_{c}},
|ϵc(x)|dϵc¯,|ϵHJB(x)|ϵHJB¯,|g(x)|g¯xK,\displaystyle\left\lvert\nabla\epsilon_{c}(x)\right\rvert\leq\overline{d\epsilon_{c}},\leavevmode\nobreak\ \left\lvert\epsilon_{HJB}(x)\right\rvert\leq\overline{\epsilon_{HJB}},\leavevmode\nobreak\ \left\lvert g(x)\right\rvert\leq\overline{g}\quad\forall x\in K,

where KK is the same set considered in (8). \square

4.1 Critic Dynamics via Data-Driven Hybrid Momentum-Based Control

To design fast asymptotically stable dynamics for the estimate θc\theta_{c}, we propose a new class of momentum-based critic dynamics inspired by accelerated gradient flows with restarting mechanisms, such as those studied in [7] and [11]. Specifically, we consider the following hybrid dynamics of the form (1), with state y(θc,p,τ)y\coloneqq(\theta_{c},p,\tau) and elements:

Cc{y2lc+1:τ[T0,T]},Fc(y,x,u)(2τ(pθc)2kcθce(θc,x,u)12),\displaystyle C_{c}\coloneqq\left\{y\in\mathbb{R}^{2l_{c}+1}\leavevmode\nobreak\ :\leavevmode\nobreak\ \tau\in[T_{0},T]\right\},\qquad F_{c}(y,x,u)\coloneqq\begin{pmatrix}\frac{2}{\tau}\left(p-\theta_{c}\right)\\ -2k_{c}\nabla_{\theta_{c}}e(\theta_{c},x,u)\\ \frac{1}{2}\end{pmatrix}, (17a)
Dc{y2lc+1:τ=T},Gc(y)(θcθcT0),\displaystyle D_{c}\coloneqq\left\{y\in\mathbb{R}^{2l_{c}+1}\leavevmode\nobreak\ :\leavevmode\nobreak\ \tau=T\right\},\qquad\qquad G_{c}(y)\coloneqq\begin{pmatrix}\theta_{c}\\ \theta_{c}\\ T_{0}\end{pmatrix}, (17b)

where kc>0k_{c}\in\mathbb{R}_{>0} is a tunable gain, and (p,τ)(p,\tau) are auxiliary states that are periodically reset every time τ=T\tau=T via the jump map (17b), with >T>T0>0\infty>T>T_{0}>0. The dynamical system in (17) flows in continuous time according to (17a) whenever the timer variable τ\tau is in [T0,T][T_{0},T]. As soon as τ\tau hits TT, the algorithm (17) resets the timer variable to T0T_{0}, as well as the momentum variable pp to θc\theta_{c}, while leaving θc\theta_{c} unaffected. Accordingly, after the first reset, the system exhibits periodic resets every ΔT=2(TT0)\Delta T=2(T-T_{0}) intervals of time. The following assumption provides data-dependent tuning guidelines for the resetting frequency of the timer variable τ\tau, which will be leveraged in our stability results.

Assumption 3

The tunable parameters (T0,T,kc,ρi,ρd)(T_{0},T,k_{c},\rho_{i},\rho_{d}) satisfy 2ρdλ¯>ρi2\rho_{d}\underline{\lambda}>\rho_{i} and

T02+12kcλ¯ρd<T2<8ρdλ¯kcρi2,T_{0}^{2}+\frac{1}{2k_{c}\underline{\lambda}\rho_{d}}<T^{2}<\frac{8\rho_{d}\underline{\lambda}}{k_{c}\rho_{i}^{2}}, (18)

where λ¯\underline{\lambda} is the level of richness of the recorderd data defined in Assumption 1. \square

For system (17), we study stability properties with respect to the compact set:

𝒜c\displaystyle\mathcal{A}_{c} 𝒜θc,p×[T0,T],\displaystyle\coloneqq\mathcal{A}_{\theta_{c},p}\times[T_{0},T], (19a)
𝒜θc,p\displaystyle\mathcal{A}_{\theta_{c},p} {(θc,p)2lc:pc=θc,θc=θc}.\displaystyle\coloneqq\left\{(\theta_{c},p)\in\mathbb{R}^{2l_{c}}:p_{c}=\theta_{c},\leavevmode\nobreak\ \theta_{c}=\theta_{c}^{*}\right\}. (19b)

The following theorem is the first main result of this paper. All the proofs are presented in the Appendices.

Theorem 1

Given a number lcl_{c} of basis functions ϕc\phi_{c} parametrizing the critic NN, and a compact set KnK\subset\mathbb{R}^{n}, suppose that Assumptions 1, 2 and 3 are satisfied. Then, there exists (κ,c)>0×>0(\kappa,c)\in\mathbb{R}_{>0}\times\mathbb{R}_{>0} and class-𝒦\mathcal{K}_{\infty} functions γ1\gamma_{1} and γ2\gamma_{2}, such that for every solution y=(θc,p,τ)y=(\theta_{c},p,\tau) to (17) with initial condition y(0,0)=(θc(0,0),p(0,0),τ(0,0))y(0,0)=(\theta_{c}(0,0),p(0,0),\tau(0,0)), and using the control policy u()𝒰Vu(\cdot)\in\mathcal{U}_{V} on the plant, the critic parameters θc\theta_{c} satisfy

|θc(t,j)θc|\displaystyle\left\lvert\theta_{c}(t,j)-\theta_{c}^{*}\right\rvert κec(t+j)|y(0,0)|𝒜c+γ2(|u~(x(t,j))|)+γ1(ϵHJB¯),\displaystyle\leq\kappa e^{-c(t+j)}\left\lvert y(0,0)\right\rvert_{\mathcal{A}_{c}}+\gamma_{2}\left(\left\lvert\tilde{u}(x(t,j))\right\rvert\right)+\gamma_{1}(\overline{\epsilon_{\text{HJB}}}),{} (20)

where u~(x(t,j)):=u(x(t,j))u(x(t,j))\tilde{u}(x(t,j)):=u(x(t,j))-u^{*}(x(t,j)), for all (t,j)dom(y)(t,j)\in\text{dom}\left(y\right) \square

The presence of a residual optimal-control mismatch term in (20) of the form γ2(|u(x)u(x)|)\gamma_{2}(\left\lvert u(x)-u^{*}(x)\right\rvert), represents a crucial difference with respect to previous CL adaptive dynamic approaches, such as those studied in [17] and [5, Ch. 4 ]. This term is a direct byproduct of our definition of ψ\psi in (10), its dependence on the control action uu, and its appearance in the error gradient (15). In principle, the emergence of this term in Theorem 1 is agnostic to the particular gradient-based update dynamics for the critic NN, regardless of the inclusion or not of momentum. Since γ2𝒦\gamma_{2}\in\mathcal{K}, the larger the difference between the nominal input uu and the optimal feedback law uu^{*}, the greater the residual error in the convergence of θc\theta_{c}. In particular, the bound (20) describes a semi-global practical input-to-state stability property that, to the best knowledge of the authors, is novel in the context of CL-based RL. In the next section we will show that the residual error γ2(|u~|)\gamma_{2}(|\tilde{u}|) can be removed by incorporating an additional actor NN in the system.

Remark 2

In contrast to standard data-driven gradient-descent dynamics for the estimation of the optimal value function VV^{*}, which can achieve exponential rates of convergence proportional to λ¯\underline{\lambda} (cf. [18, 13]), under the assumptions of Theorem 1 the critic update dynamics (17) can achieve exponential convergence with rates proportional to λ¯\sqrt{\underline{\lambda}}. As shown in [9], momentum-based dynamics of this form can achieve these rates using the restarting parameter

T=T:=e12kcρdλ¯+T02.T=T^{*}:=e\sqrt{\frac{1}{2k_{c}\rho_{d}\underline{\lambda}}+T_{0}^{2}}. (21)

This property is particularly useful in settings where the level of richness of the data-set is limited, i.e., when λ¯1\underline{\lambda}\ll 1, which is common in practical applications.

Theorem 1 guarantees exponential convergence to a neighborhood of the optimal parameters {θc}\left\{\theta_{c}^{*}\right\} that define the optimal value function VV^{*}. Consequently, by continuity, and on compact sets, V^\hat{V} would converge to an ϵ\epsilon-approximation of VV^{*}, which can be leveraged by the control law (6) to stabilize system (3). However, as noted in [22], implementing only critic structures for the control of nonlinear dynamical systems of the form (3) can lead to poor closed-loop transient performance. To tackle this issue, we consider an auxiliary dynamical system, called the actor, which will serve as an estimator of the optimal controller that acts on the plant.

Refer to caption
Figure 1: Proposed Hybrid Momentum Based Dynamics for the training of the Critic subsystem

5 Actor Dynamics

Refer to caption
Figure 2: Actor Subsystem

Using the optimal value parametrization described in Section 4 the optimal control law can written as:

u(x)=12Πu1g(x)[\diffpϕc(x)xθc+ϵc(x)],xK.u^{*}(x)=-\frac{1}{2}\Pi_{u}^{-1}g(x)^{\top}\left[\diffp{\phi_{c}(x)}{x}^{\top}\theta_{c}^{*}+\nabla\epsilon_{c}(x)\right],\quad\forall x\in K. (22)

Therefore, using \diffpϕc(x)x\diffp{\phi_{c}(x)}{x} and g(x)g(x) we can implement an actor neural-network given by:

u^(x)=ω(x)θu,\hat{u}(x)=\omega(x)^{\top}\theta_{u}, (23)

where ω:nlc×m\omega:\mathbb{R}^{n}\to\mathbb{R}^{l_{c}\times m} is defined as:

ω(x)12\diffpϕc(x)xg(x)Πu1.\omega(x)\coloneqq-\frac{1}{2}\diffp{\phi_{c}(x)}{x}g(x)\Pi_{u}^{-1}. (24)

To guarantee convergence of u^\hat{u} to uu^{*}, we design update dynamics for θulc\theta_{u}\in\mathbb{R}^{l_{c}} based on the minimization of the error:

ε(x,θc,θu)\displaystyle\varepsilon(x,\theta_{c},\theta_{u}) 12[α1εa(x,θc,θu)εa(x,θc,θu)1+Tr(ω(x)ω(x))+α2εb(θc,θu)εb(θc,θu)],\displaystyle\coloneqq\frac{1}{2}\Bigg{[}\alpha_{1}\frac{\varepsilon_{a}(x,\theta_{c},\theta_{u})^{\top}\varepsilon_{a}(x,\theta_{c},\theta_{u})}{1+\text{Tr}\left(\omega(x)^{\top}\omega(x)\right)}+\alpha_{2}\varepsilon_{b}(\theta_{c},\theta_{u})^{\top}\varepsilon_{b}(\theta_{c},\theta_{u})\Bigg{]},
εa(x,θc,θu)\displaystyle\varepsilon_{a}(x,\theta_{c},\theta_{u}) u^(x)ω(x)θc=ω(x)(θuθc),\displaystyle\coloneqq\hat{u}(x)-\omega(x)^{\top}\theta_{c}=\omega(x)^{\top}\left(\theta_{u}-\theta_{c}\right),
εb(θc,θu)\displaystyle\varepsilon_{b}(\theta_{c},\theta_{u}) θuθc,\displaystyle\coloneqq\theta_{u}-\theta_{c},{} (25)

which satisfies:

θuε(x,θc,θu)\displaystyle\nabla_{\theta_{u}}\varepsilon(x,\theta_{c},\theta_{u}) =Ω(x)(θuθc),\displaystyle=\Omega(x)(\theta_{u}-\theta_{c}),

where

Ω(x)\displaystyle\Omega(x) α1ω(x)ω(x)1+Tr(ω(x)ω(x))+α2Ilc×lcxn.\displaystyle\coloneqq\alpha_{1}\frac{\omega(x)\omega(x)^{\top}}{1+\text{Tr}\left(\omega(x)^{\top}\omega(x)\right)}+\alpha_{2}I\in\mathbb{R}^{l_{c}\times l_{c}}\quad\forall x\in\mathbb{R}^{n}.{} (26)

Based on these definitions, we consider the following gradient-descent dynamics for the actor neural-network:

θ˙u=Fu(θu,x,θc)kuθuε(x,θc,θu),\displaystyle\dot{\theta}_{u}=F_{u}(\theta_{u},x,\theta_{c})\coloneqq-k_{u}\nabla_{\theta_{u}}\varepsilon(x,\theta_{c},\theta_{u}),{} (27)

where ku>0k_{u}\in\mathbb{R}_{>0} is a tunable gain. A scheme representing these update dynamics is shown in Figure 2.

6 Momentum-Based Actor-Critic Feedback System

Consider the closed-loop resulting from the interconnection between the plant (3), the critic update dynamics (17), the actor update dynamics (27) and the feedback law in (23) shown in Figure 3(a), and given by:

x˙\displaystyle\dot{x} =f(x)+g(x)u^(x),x+=x,\displaystyle=f(x)+g(x)\hat{u}(x),\qquad x^{+}=x, (28a)
y˙\displaystyle\dot{y} =Fc(y,x,u^(x)),y+=Gc(y),\displaystyle=F_{c}(y,x,\hat{u}(x)),\qquad\quad\leavevmode\nobreak\ y^{+}=G_{c}(y), (28b)
θ˙u\displaystyle\dot{\theta}_{u} =Fu(θu,x,θc),θu+=θu,\displaystyle=F_{u}(\theta_{u},x,\theta_{c}),\qquad\qquad\theta_{u}^{+}=\theta_{u}, (28c)

and with flow set and jump set given by C=n×Cc×lcC=\mathbb{R}^{n}\times C_{c}\times\mathbb{R}^{l_{c}} and D=n×Dc×lcD=\mathbb{R}^{n}\times D_{c}\times\mathbb{R}^{l_{c}} respectively, where CcC_{c} and DcD_{c} are as defined in (17). Let z(x,y,θu)z\coloneqq(x,y,\theta_{u}) be the overall state of the closed-loop system, and define:

𝒜\displaystyle\mathcal{A} {0}×𝒜c×{θc}.\displaystyle\coloneqq\left\{0\right\}\times\mathcal{A}_{c}\times\left\{\theta_{c}^{*}\right\}.

The following is the main result of this paper.

Theorem 2

Given the vector of basis functions ϕc:nlc\phi_{c}:\mathbb{R}^{n}\to\mathbb{R}^{l_{c}} parametrizing the critic NN and a compact set KzK×Ky×Kθn×2lc+1×lcK_{z}\coloneqq K\times K_{y}\times K_{\theta}\subset\mathbb{R}^{n}\times\mathbb{R}^{2l_{c}+1}\times\mathbb{R}^{l_{c}}, where KK is given as in (8), suppose that Assumption 1-3 are satisfied. Then, there exists β𝒦\beta\in\mathcal{KL}, γ𝒦\gamma\in\mathcal{K} and tunable parameters (ρi,ρd,kc,ku,α1,α2)(\rho_{i},\rho_{d},k_{c},k_{u},\alpha_{1},\alpha_{2}), such that for every solution z=(x,y,θu)z=(x,y,\theta_{u}) to the closed-loop system (28), with initial condition z(0,0)=(x(0,0),y(0,0),θu(0,0))Kzz(0,0)=(x(0,0),y(0,0),\theta_{u}(0,0))\in K_{z}, there exists T~>0\tilde{T}>0 such that for all (t,j)dom(z)(t,j)\in\text{dom}(z):

|z(t,j)|𝒜β(|z(0,0)|𝒜,t+j)+γ(|(ϵHJB¯,dϵc¯)|)+ν,\displaystyle\left\lvert z(t,j)\right\rvert_{\mathcal{A}}\leq\beta(\left\lvert z(0,0)\right\rvert_{\mathcal{A}},t+j)+\gamma(\left\lvert\left(\overline{\epsilon_{\text{HJB}}},\overline{d\epsilon_{c}}\right)\right\rvert)+\nu,

for all 0t+jT~0\leq t+j\leq\tilde{T}, and

|z(t,j)|𝒜γ(|(ϵHJB¯,dϵc¯)|)+ν,T~t+j,\displaystyle\left\lvert z(t,j)\right\rvert_{\mathcal{A}}\leq\gamma(\left\lvert\left(\overline{\epsilon_{\text{HJB}}},\overline{d\epsilon_{c}}\right)\right\rvert)+\nu,\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall\leavevmode\nobreak\ \tilde{T}\leq t+j,

for some ν>0\nu>0 constant. \square

Theorem 2 establishes asymptotic convergence to a neighborhood of the compact set 𝒜\mathcal{A} as (ϵHJB¯,dϵc¯)0\left(\overline{\epsilon_{\text{HJB}}},\overline{d\epsilon_{c}}\right)\to 0 from any compact set KzK_{z} modulo some error ν\nu, under a suitable choice of tunable parameters. To the best knowledge of the authors this is the first result providing stability certificates for continuous-time actor-critic reinforcement learning using recorded data and accelerated value-function estimation dynamics with momentum. In addition, since the resulting closed-loop system in (28) is given by a well-posed hybrid system, the stability results are robust with respect to arbitrarily small additive disturbances on the states and dynamics [12, Ch. 7].

7 Numerical Example

Refer to caption
(a) Closed-Loop System
Refer to caption
(b) Convergence of the critic (left) and actor (right) neural networks’ weights to the optimal values.
Figure 3: Closed-Loop System Diagram and Numerical Example

In this section, we present a numerical experiment that illustrates our theoretical results. In particular, we study the following nonlinear control-affine plant:

x˙=f(x)+g(x)u,\displaystyle\dot{x}=f(x)+g(x)u, (29a)
f(x)=[x1+x212(x1x2(1cos(2x1+2)2))],\displaystyle f(x)=\begin{bmatrix}-x_{1}+x_{2}\\ -\frac{1}{2}\Bigg{(}x_{1}-x_{2}\Big{(}1-\cos(2x_{1}+2)^{2}\Big{)}\Bigg{)}\end{bmatrix}, (29b)
g(x)[0cos(2x1)+2],\displaystyle g(x)\coloneqq\begin{bmatrix}0\\ \cos(2x_{1})+2\end{bmatrix}, (29c)

with local state and control costs given by Q(x)=xxQ(x)=x^{\top}x and R(u)=u2R(u)=u^{2} [18]. The optimal value function for this setting is given by V(x)=12x12+x22V^{*}(x)=\frac{1}{2}x_{1}^{2}+x_{2}^{2} with optimal control law given by u(x)=(cos(2x1)+2)x2u^{*}(x)=-(\cos(2x_{1})+2)x_{2}. Using this information, we choose ϕc(x)=(x12,x1x2,x22)\phi_{c}(x)=(x_{1}^{2},x_{1}x_{2},x_{2}^{2}), and we implement the prescribed hybrid momentum-based dynamics in (17) for the update of the critic neural network, and the update dynamics for the actor described in (27). We obtain the results shown in Figure 3(b) with x(0,0)=(10,10)x(0,0)=(-10,10), θc(0,0)=(1,1,1)\theta_{c}(0,0)=(1,1,1) and θu[0,1]3\theta_{u}\in[0,1]^{3}. We compare the results with the case in which the critic neural-network is updated with the gradient-descent dynamics of [17], and where the sufficiently rich data is a set of 1616 data points obtained by sampling the dynamics (29) in a grid around the origin of size 4×44\times 4. In our simulations we use T0=0.1,T=5.5T_{0}=0.1,T=5.5 for the momentum-based dynamics in (17). These particular values are obtained by using the level of richness λ¯\underline{\lambda} of the data-set, and the inequalities in (18) in order to ensure compliance with Assumption 3. For both reinforcement learning dynamics we use kc=1,ku=1,ρd=1k_{c}=1,k_{u}=1,\rho_{d}=1 and ρi=1\rho_{i}=1. As shown in the figure both update dynamics are able to converge to {θc}\left\{\theta_{c}^{*}\right\}, with θc=(1/2,0,1)\theta_{c}^{*}=(1/2,0,1) describing the optimal value function VV^{*}. However, the hybrid-based dynamics are able to significantly improve the transient performance of the learning mechanism.333The code used to implement this simulation can be found in the following repository: https://github.com/deot95/Accelerated-Continuous-Time-Approximate-Dynamic-Programming-through-Data-Assisted-Hybrid-Control

8 Conclusions

In this paper, we introduced the first stability guarantees for deterministic continuous-time actor-critic reinforcement learning with accelerated training of neural network structures. To do so, we studied a novel hybrid momentum-based estimation dynamical system for the critic NN, which estimates, in real time, the optimal value function. Our stability analysis leveraged the existence of rich recorded data taken from a finite number of samples along optimal trajectories and inputs of the system. We showed that this finite sequence of samples can be used to train the controller to achieve online optimal performance with fast transient performance. Closed-loop stability was established using tools from hybrid dynamical systems theory. Potential extensions include the study of similar accelerated training dynamics for the actor subsystem, as well as considering reinforcement learning problems in hybrid plants.

References

  • [1] J. Ibarz, J. Tan, C. Finn, M. Kalakrishnan, P. Pastor, and S. Levine, “How to train your robot with deep reinforcement learning: lessons we have learned,” The International Journal of Robotics Research, vol. 40, no. 4-5, pp. 698–721, 2021.
  • [2] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. Pérez, “Deep reinforcement learning for autonomous driving: A survey,” IEEE Transactions on Intelligent Transportation Systems, 2021.
  • [3] J. Martinez-Piazuelo, D. E. Ochoa, N. Quijano, and L. F. Giraldo, “A multi-critic reinforcement learning method: An application to multi-tank water systems,” IEEE Access, vol. 8, pp. 173227–173238, 2020.
  • [4] K. G. Vamvoudakis, Y. Wan, F. L. Lewis, and D. Cansever, “Handbook of reinforcement learning and control,” 2021.
  • [5] R. Kamalapurkar, P. Walters, J. Rosenfeld, and W. E. Dixon, Reinforcement learning for optimal feedback control: A Lyapunov-based approach. Springer, 2018.
  • [6] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [7] W. Su, S. Boyd, and E. Candes, “A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights,” J. of Machine Learning Research, vol. 17, no. 153, pp. 1–43, 2016.
  • [8] A. Wibisono, A. C. Wilson, and M. I. Jordan, “A variational perspective on accelerated methods in optimization,” Proceedings of the National Academy of Sciences, vol. 113, no. 47, pp. E7351–E7358, 2016.
  • [9] J. I. Poveda and N. Li, “Robust hybrid zero-order optimization algorithms with acceleration via averaging in continuous time,” Automatica, vol. 123, 2021.
  • [10] J. I. Poveda and A. R. Teel, “The heavy-ball ode with time-varying damping: Persistence of excitation and uniform asymptotic stability,” in 2020 American Control Conference (ACC), pp. 773–778, IEEE, 2020.
  • [11] O’Donoghue and E. J. Candès, “Adaptive restart for accelerated gradient schemes,” Foundations of Computational Mathematics, vol. 15, no. 3, pp. 715–732, 2013.
  • [12] R. Goebel, R. G. Sanfelice, and A. R. Teel, “Hybrid dynamical systems: modeling stability, and robustness,” 2012.
  • [13] G. Chowdhary and E. Johnson, “Concurrent learning for convergence in adaptive control without persistency of excitation,” in 49th IEEE Conference on Decision and Control (CDC), pp. 3674–3679, IEEE, 2010.
  • [14] R. W. Beard, G. N. Saridis, and J. T. Wen, “Galerkin approximations of the generalized hamilton-jacobi-bellman equation,” Automatica, vol. 33, no. 12, pp. 2159–2177, 1997.
  • [15] D. Liberzon, Calculus of variations and optimal control theory. Princeton university press, 2011.
  • [16] K. Hornik, M. Stinchcombe, and H. White, “Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks,” Neural networks, vol. 3, no. 5, pp. 551–560, 1990.
  • [17] K. G. Vamvoudakis, M. F. Miranda, and J. P. Hespanha, “Asymptotically stable adaptive–optimal control algorithm with saturating actuators and relaxed persistence of excitation,” IEEE transactions on neural networks and learning systems, vol. 27, no. 11, pp. 2386–2398, 2015.
  • [18] R. Kamalapurkar, P. Walters, and W. E. Dixon, “Model-based reinforcement learning for approximate optimal regulation,” Automatica, vol. 64, pp. 94–104, 2016.
  • [19] K. J. Astrom and B. Wittenmark, Adaptive Control. Addison-Wesley Publishing Company, 1989.
  • [20] K. Ciosek, “Imitation learning by reinforcement learning,” in International Conference on Learning Representations, 2022.
  • [21] P. Rashidinejad, B. Zhu, C. Ma, J. Jiao, and S. Russell, “Bridging offline reinforcement learning and imitation learning: A tale of pessimism,” Advances in Neural Information Processing Systems, vol. 34, 2021.
  • [22] K. Doya, “Reinforcement learning in continuous time and space,” Neural computation, vol. 12, no. 1, pp. 219–245, 2000.
  • [23] C. Cai and A. R. Teel, “Characterizations of input-to-state stability for hybrid systems,” Systems & Control Letters, vol. 58, no. 1, pp. 47–53, 2009.
  • [24] H. K. Khalil, Nonlinear Systems. Upper Saddle River, NJ: Prentice Hall, 2002.
  • [25] D. E. Ochoa, J. I. Poveda, A. Subbaraman, G. S. Schmidt, and F. R. Pour-Safaei, “Accelerated concurrent learning algorithms via data-driven hybrid dynamics and nonsmooth odes,” in Learning for Dynamics and Control, pp. 866–878, PMLR, 2021.

Appendix A Proof Theorem 1

A.1 Gradient of Critic Error-Function in Deviation Variables

First, using (16) together with H(x,u(x),V)=0H(x,u^{*}(x),\nabla V^{*})=0 for all xx, we obtain:

ψ(x,u(x))θc+Q(x)+R(u(x))=ϵHJB(x).\psi(x,u^{*}(x))^{\top}\theta_{c}^{*}+Q(x)+R\left(u^{*}(x)\right)=\epsilon_{\text{HJB}}(x). (30)

Thus, using (15) and (30), we can rewrite the gradient of e(θc,x,u)e(\theta_{c},x,u) as follows:

θce(θc,x,u)\displaystyle\nabla_{\theta_{c}}e(\theta_{c},x,u) =Θ(x,u)(θcθc)+υϵ(x,u)+χ(x,u),\displaystyle=\Theta(x,u)\left(\theta_{c}-\theta_{c}^{*}\right)+\upsilon_{\epsilon}(x,u)+\chi(x,u),{} (31)

where

Θ(x,u)\displaystyle\Theta(x,u) ρiΨ(x,u)Ψ(x,u)+ρdΛ,\displaystyle\coloneqq\rho_{i}\Psi(x,u)\Psi(x,u)^{\top}+\rho_{d}\Lambda, (32)

and

υϵ(x,u)\displaystyle\upsilon_{\epsilon}(x,u) ρiψ(x,u)ϵHJB(x)(1+|ψ(x,u)|2)2+ρdk=1Nψ(xk,u(xk))ϵHJB(xk)(1+|ψ(xk,u(xk))|2)2lc,\displaystyle\coloneqq\rho_{i}\frac{\psi(x,u)\epsilon_{\text{HJB}}(x)}{\left(1+\left\lvert\psi(x,u)\right\rvert^{2}\right)^{2}}+\rho_{d}\sum_{k=1}^{N}\frac{\psi(x_{k},u^{*}(x_{k}))\epsilon_{\text{HJB}}(x_{k})}{\left(1+\left\lvert\psi(x_{k},u^{*}(x_{k}))\right\rvert^{2}\right)^{2}}\leavevmode\nobreak\ \in\mathbb{R}^{l_{c}}, (33)
χ(x,u)\displaystyle\chi(x,u) ρiψ(x,u)[\diffpϕc(x)xg(x)(uu(x))]θc(1+|ψ(x,u)|2)2+ρiψ(x,u)[R(u)R(u(x))](1+|ψ(x,u)|2)2lc,\displaystyle\coloneqq\frac{\rho_{i}\psi(x,u)\left[\diffp{\phi_{c}(x)}{x}g(x)\left(u-u^{*}(x)\right)\right]^{\top}\theta^{*}_{c}}{\left(1+\left\lvert\psi(x,u)\right\rvert^{2}\right)^{2}}+\frac{\rho_{i}\psi(x,u)\left[R(u)-R(u^{*}(x))\right]}{\left(1+\left\lvert\psi(x,u)\right\rvert^{2}\right)^{2}}\leavevmode\nobreak\ \in\mathbb{R}^{l_{c}}, (34)

which, by using the fact that r(1+r2)23316,r0\frac{r}{\left(1+r^{2}\right)^{2}}\leq\frac{3\sqrt{3}}{16},\forall r\in\mathbb{R}_{\geq 0}, satisfy:

|υϵ(x,u)|\displaystyle\left\lvert\upsilon_{\epsilon}(x,u)\right\rvert 3316ϵHJB¯(ρi+Nρd),\displaystyle\leq\frac{3\sqrt{3}}{16}\overline{\epsilon_{\text{HJB}}}\left(\rho_{i}+N\rho_{d}\right), (35a)
|χ(x,u)|\displaystyle\left\lvert\chi(x,u)\right\rvert ρi3316(g¯(dϕc¯[1+|θc|]+dϵc¯)|uu(x)|+λmax(Πu)|uu(x)|2).\displaystyle\leq\rho_{i}\frac{3\sqrt{3}}{16}\Bigg{(}\overline{g}\left(\overline{d\phi_{c}}\left[1+\left\lvert\theta_{c}^{*}\right\rvert\right]+\overline{d\epsilon_{c}}\right)\left\lvert u-u^{*}(x)\right\rvert+\lambda_{\max}\left(\Pi_{u}\right)\left\lvert u-u^{*}(x)\right\rvert^{2}\Bigg{)}. (35b)

The following Lemma will be instrumental for our results.

Lemma 1

If the data is λ¯\underline{\lambda}-sufficiently-rich, then there exist Θ¯,Θ¯>0\overline{\Theta},\underline{\Theta}\in\mathbb{R}_{>0} such that

Θ¯InΘ(x,u)Θ¯Inxn,um.\displaystyle\underline{\Theta}I_{n}\preceq\Theta(x,u)\preceq\overline{\Theta}I_{n}\qquad\forall x\in\mathbb{R}^{n},\leavevmode\nobreak\ \forall u\in\mathbb{R}^{m}.
  • Proof.

    Let θlc\theta\in\mathbb{R}^{l_{c}} be arbitrary. Since, by assumption, the data is λ¯\underline{\lambda}-SR it follows that:

    θΘ(x,u)θ\displaystyle\theta^{\top}\Theta(x,u)\theta =θρiΨ(x,u)Ψ(x,u)θ+θρdΛθ\displaystyle=\theta^{\top}\rho_{i}\Psi(x,u)\Psi(x,u)^{\top}\theta+\theta^{\top}\rho_{d}\Lambda\theta
    ρdλ¯|θ|2\displaystyle\geq\rho_{d}\underline{\lambda}\left\lvert\theta\right\rvert^{2}
    Θ(x,u)Θ¯Ilc,(x,u)n×m,\displaystyle\implies\Theta(x,u)\succeq\underline{\Theta}I_{l_{c}},\leavevmode\nobreak\ \forall(x,u)\in\mathbb{R}^{n}\times\mathbb{R}^{m},{} (36)

    where Θ¯ρdλ¯\underline{\Theta}\coloneqq\rho_{d}\underline{\lambda}. On the other hand, using the fact that |aa|=|a|2,an\left\lvert aa^{\top}\right\rvert=\left\lvert a\right\rvert^{2},\leavevmode\nobreak\ \forall a\in\mathbb{R}^{n}, we obtain that:

    |Ψ(x,u)Ψ(x,u)|=|Ψ(x,u)|21,(x,u)n×m,\left\lvert\Psi(x,u)\Psi(x,u)^{\top}\right\rvert=\left\lvert\Psi(x,u)\right\rvert^{2}\leq 1,\leavevmode\nobreak\ \forall(x,u)\in\mathbb{R}^{n}\times\mathbb{R}^{m},

    we obtain:

    θΘ(x,u)θ\displaystyle\theta^{\top}\Theta(x,u)\theta =θρiψ(x,u)ψ(x,u)θ+θρdΛθ\displaystyle=\theta^{\top}\rho_{i}\psi(x,u)\psi(x,u)^{\top}\theta+\theta^{\top}\rho_{d}\Lambda\theta
    (ρi+ρdλmax(Λ))|θ|2\displaystyle\leq\left(\rho_{i}+\rho_{d}\lambda_{\max}\left(\Lambda\right)\right)\left\lvert\theta\right\rvert^{2}
    Θ(x,u)Θ¯Ilc,(x,u)n×m,\displaystyle\implies\Theta(x,u)\preceq\overline{\Theta}I_{l_{c}},\quad\forall(x,u)\in\mathbb{R}^{n}\times\mathbb{R}^{m},

    where Θ¯ρi+ρdλmax(Λ)\overline{\Theta}\coloneqq\rho_{i}+\rho_{d}\lambda_{\max}\left(\Lambda\right). \blacksquare

A.2 Lyapunov-Based Analysis

Recall from Section 4 that y=(θc,p,τ)y=(\theta_{c},p,\tau), suppose that the assumptions of Theorem 1 hold and consider the Lyapunov candidate function Vc:lc×lc×>00V_{c}:\mathbb{R}^{l_{c}}\times\mathbb{R}^{l_{c}}\times\mathbb{R}_{>0}\to\mathbb{R}_{\geq 0} given by:

Vc(y)\displaystyle V_{c}(y) |pθc|24+|pθc|24+kcρdτ2(θcθc)Λ(θcθc)2,\displaystyle\coloneqq\frac{\left\lvert p-\theta_{c}\right\rvert^{2}}{4}+\frac{\left\lvert p-\theta_{c}^{*}\right\rvert^{2}}{4}+k_{c}\rho_{d}\tau^{2}\frac{\left(\theta_{c}-\theta_{c}^{*}\right)\top\Lambda\left(\theta_{c}-\theta_{c}^{*}\right)}{2},{} (37)

where Λ\Lambda was defined in Assumption 1 and which satisfies:

c¯|y|𝒜c2Vc\displaystyle\underline{c}\left\lvert y\right\rvert_{\mathcal{A}_{c}}^{2}\leq V_{c} (y)c¯|y|𝒜c2,\displaystyle(y)\leq\overline{c}\left\lvert y\right\rvert_{\mathcal{A}_{c}}^{2},{} (38)
c¯min{14,kcρdT02λ¯2},\displaystyle\underline{c}\coloneqq\min\left\{\frac{1}{4},\frac{k_{c}\rho_{d}T_{0}^{2}\underline{\lambda}}{2}\right\}, c¯{34,12(1+kcρdT2λ¯)},\displaystyle\quad\overline{c}\coloneqq\left\{\frac{3}{4},\leavevmode\nobreak\ \frac{1}{2}\left(1{+}k_{c}\rho_{d}T^{2}\overline{\lambda}\right)\right\},

where λ¯λmax(Λ)\overline{\lambda}\coloneqq\lambda_{\max}\left(\Lambda\right). Now, let u𝒰Vu\in\mathcal{U}_{V}, and consider the time derivative of VcV_{c} along the continuous-time evolution of the critic subsystem, i.e., V˙c=yVc(y)y˙\dot{V}_{c}=\nabla_{y}V_{c}(y)^{\top}\dot{y}. Then, by using (31) and Lemma 1, and some algebraic manipulation, V˙c\dot{V}_{c} can be shown to satisfy

V˙c\displaystyle\dot{V}_{c} (|pθc||θcθc|)M(τ)(|pθc||θcθc|)+22kcy𝒜c(|υϵ(x)|+|χ(x,u(x))|),\displaystyle\leq-\begin{pmatrix}\left\lvert p-\theta_{c}\right\rvert&\left\lvert\theta_{c}-\theta_{c}^{*}\right\rvert\end{pmatrix}M(\tau)\begin{pmatrix}\left\lvert p-\theta_{c}\right\rvert\\ \left\lvert\theta_{c}-\theta_{c}^{*}\right\rvert\end{pmatrix}+2\sqrt{2}k_{c}y_{\mathcal{A}_{c}}\big{(}\left\lvert\upsilon_{\epsilon}(x)\right\rvert+\left\lvert\chi(x,u(x))\right\rvert\big{)},{} (39)

where

M(τ)(2kcτ2ρi2ρi2Θ¯),M(\tau)\coloneqq\begin{pmatrix}\frac{2}{k_{c}\tau^{2}}&-\frac{\rho_{i}}{2}\\ -\frac{\rho_{i}}{2}&\underline{\Theta}\end{pmatrix}, (40)

and 𝒜c\mathcal{A}_{c} was defined in Section 4. Since 2ρdλ¯>ρi2\rho_{d}\underline{\lambda}>\rho_{i} and T2<8ρdλ¯kcρi2T^{2}<\frac{8\rho_{d}\underline{\lambda}}{k_{c}\rho_{i}^{2}} by means of Asssumption 2, and τ(t,j)[T0,T],(t,j)dom(y)\tau(t,j)\in[T_{0},T],\leavevmode\nobreak\ \forall(t,j)\in\text{dom}\left(y\right) by construction of the critic update dynamics (17), it follows that M(τ)r¯M(\tau)\succeq\underline{r} with r¯Θ¯ρi2\underline{r}\coloneqq\underline{\Theta}-\frac{\rho_{i}}{2}. Hence, from (39) and using (35), we obtain that:

V˙c\displaystyle\dot{V}_{c} r¯|y|𝒜c2+|y|𝒜c(γν(ϵHJB¯)+γχ(|u(x)u(x)|)),\displaystyle\leq-\underline{r}\left\lvert y\right\rvert_{\mathcal{A}_{c}}^{2}+\left\lvert y\right\rvert_{\mathcal{A}_{c}}\Big{(}\gamma_{\nu}\left(\overline{\epsilon_{\text{HJB}}}\right)+\gamma_{\chi}\left(\left\lvert u(x)-u^{*}(x)\right\rvert\right)\Big{)},{} (41)

where γν,γχ𝒦\gamma_{\nu},\gamma_{\chi}\in\mathcal{K}_{\infty} are given by:

γν(r)\displaystyle\gamma_{\nu}(r) 368(ρi+Nρd)r,γχ(r)cχ(r+r2),\displaystyle\coloneqq\frac{3\sqrt{6}}{8}\left(\rho_{i}+N\rho_{d}\right)r,\quad\gamma_{\chi}(r)\coloneqq c_{\chi}(r+r^{2}),
cχ\displaystyle c_{\chi} 368ρimax{g¯(dϕc¯[1+|θc|]+dϵc¯),λmax(Πu)}.\displaystyle\coloneqq\frac{3\sqrt{6}}{8}\rho_{i}\max\left\{\overline{g}\left(\overline{d\phi_{c}}\left[1+\left\lvert\theta_{c}^{*}\right\rvert\right]+\overline{d\epsilon_{c}}\right),\leavevmode\nobreak\ \lambda_{\max}\left(\Pi_{u}\right)\right\}.

Thus, letting dc(0,1)d_{c}\in(0,1), and using (38), (41):

V˙c\displaystyle\dot{V}_{c} r¯(1dc)c¯Vc(y),|y|𝒜c1dc(γν(ϵHJB¯)+γχ(|u(x)u(x)|)).\displaystyle\leq-\frac{\underline{r}(1-d_{c})}{\overline{c}}V_{c}(y),\quad\forall\left\lvert y\right\rvert_{\mathcal{A}_{c}}\geq\frac{1}{d_{c}}\Big{(}\gamma_{\nu}\left(\overline{\epsilon_{\text{HJB}}}\right)+\gamma_{\chi}\left(\left\lvert u(x)-u^{*}(x)\right\rvert\right)\Big{)}. (42a)

On the other hand, the change of VcV_{c} during the jumps in the update dynamics for the critic (17), satisfies:

Vc(y+)Vc(y)ηVc(y),V_{c}\left(y^{+}\right)-V_{c}(y)\leq-\eta V_{c}(y), (43)

with η1T02T212kcρdλ¯T2\eta\coloneqq 1-\frac{T_{0}^{2}}{T^{2}}-\frac{1}{2k_{c}\rho_{d}\underline{\lambda}T^{2}} which satisfies η(0,1)\eta\in(0,1) by means of Assumption 2. Together, (42) and (43), in conjuction with the quadratic bounds of (38), imply the results of Theorem 1 via [23, Prop 2.7] and the fact that |θc(t,j)θc||y(t,j)|𝒜c|(θc(t,j),p(t,j))|𝒜θc,p\left\lvert\theta_{c}(t,j)-\theta^{*}_{c}\right\rvert\leq\left\lvert y(t,j)\right\rvert_{\mathcal{A}_{c}}\leq\left\lvert(\theta_{c}(t,j),p(t,j))\right\rvert_{\mathcal{A}_{\theta_{c},p}} for all (t,j)dom(y)(t,j)\in\text{dom}\left(y\right).\blacksquare

Appendix B Proof of Theorem 2

B.1 Gradient of Actor Error-Function in Deviation Variables

First, note that we we can write (25) as:

θuεa(x,θc,θu)\displaystyle\nabla_{\theta_{u}}\varepsilon_{a}(x,\theta_{c},\theta_{u}) =Ω(x)(θuθc(θcθc)),\displaystyle=\Omega(x)\left(\theta_{u}-\theta_{c}^{*}-\left(\theta_{c}-\theta_{c}^{*}\right)\right),

and consider the following Lemma, instrumental for our results.

Lemma 2

There exists Ω¯,Ω¯>0\overline{\Omega},\underline{\Omega}\in\mathbb{R}_{>0} such that

Ω¯IlcΩ(x)Ω¯Ilc.\underline{\Omega}I_{l_{c}}\preceq\Omega(x)\preceq\overline{\Omega}I_{l_{c}}.
  • Proof.

    Let θlc\theta\in\mathbb{R}^{l_{c}} be arbitrary. Then, by the definition of Ω:nlc×lc\Omega:\mathbb{R}^{n}\to\mathbb{R}^{l_{c}\times l_{c}} in (26), it follows that:

    θΩ(x)θ=α1|ω(x)θ|21+Tr(ω(x)ω(x))+α2|θ|2α2|θ|2\displaystyle\theta^{\top}\Omega(x)\theta=\alpha_{1}\frac{\left\lvert\omega(x)^{\top}\theta\right\rvert^{2}}{1+\text{Tr}\left(\omega(x)^{\top}\omega(x)\right)}+\alpha_{2}\left\lvert\theta\right\rvert^{2}\geq\alpha_{2}\left\lvert\theta\right\rvert^{2} Ω(x)Ω¯Ilc,xn,\displaystyle\implies\Omega(x)\succeq\underline{\Omega}I_{l_{c}},\quad\forall x\in\mathbb{R}^{n},

    where Ω¯α2\underline{\Omega}\coloneqq\alpha_{2}. On the other hand, we obtain:

    θΩ(x)θ\displaystyle\theta^{\top}\Omega(x)\theta =(α1|ω(x)|21+|ω(x)|F2+α2)|θ|2Ω¯|θ|2Ω(x)Ω¯Ilc,xn,\displaystyle=\left(\alpha_{1}\frac{\left\lvert\omega(x)\right\rvert^{2}}{1+\left\lvert\omega(x)\right\rvert_{F}^{2}}+\alpha_{2}\right)\left\lvert\theta\right\rvert^{2}\leq\overline{\Omega}\left\lvert\theta\right\rvert^{2}\implies\Omega(x)\preceq\overline{\Omega}I_{l_{c}},\quad\forall x\in\mathbb{R}^{n},

    where Ω¯α1+α2\overline{\Omega}\coloneqq\alpha_{1}+\alpha_{2}, |A|F\left\lvert A\right\rvert_{F} represents the Frobenius norm and where we used |A||A|F,Alc×lc\left\lvert A\right\rvert\leq\left\lvert A\right\rvert_{F},\leavevmode\nobreak\ \forall A\in\mathbb{R}^{l_{c}\times l_{c}} and r21+r21r\frac{r^{2}}{1+r^{2}}\leq 1\leavevmode\nobreak\ \forall r\in\mathbb{R}. \blacksquare

Now, consider the Lyapunov function

𝒱(z)Vo(x)+Vc(y)+Va(θu),\displaystyle\mathcal{V}(z)\coloneqq V_{o}(x)+V_{c}(y)+V_{a}(\theta_{u}), (44a)
Vo(x)V(x),Va(θu)12|θuθu|2,\displaystyle V_{o}(x)\coloneqq V^{*}(x),\quad V_{a}(\theta_{u})\coloneqq\frac{1}{2}\left\lvert\theta_{u}-\theta_{u}^{*}\right\rvert^{2}, (44b)

where VcV_{c} was defined in (37) and where we recall that z=(x,y,θu)z=(x,y,\theta_{u}). By [24, Lemma 4.3], and since Vo=VV_{o}=V^{*} is a continuous and positive definite function in n\mathbb{R}^{n}, there exist γ¯o,γ¯o𝒦\underline{\gamma}_{o},\overline{\gamma}_{o}\in\mathcal{K} such that γ¯o(|x|)Vo(x)γ¯o(|x|)\underline{\gamma}_{o}\left(\left\lvert x\right\rvert\right)\leq V_{o}(x)\leq\overline{\gamma}_{o}(\left\lvert x\right\rvert). Hence, using (38), and the fact that sum of class 𝒦\mathcal{K} is in turn of class 𝒦\mathcal{K}, there exist γ¯𝒱,γ¯𝒱𝒦\underline{\gamma}_{\mathcal{V}},\overline{\gamma}_{\mathcal{V}}\in\mathcal{K} such that:

γ¯𝒱(|z|𝒜)𝒱(z)γ¯𝒱(|z|𝒜)\displaystyle\underline{\gamma}_{\mathcal{V}}\left(\left\lvert z\right\rvert_{\mathcal{A}}\right)\leq\mathcal{V}(z)\leq\overline{\gamma}_{\mathcal{V}}\left(\left\lvert z\right\rvert_{\mathcal{A}}\right) (45)

Now, the time derivative of V˙o=Vo(x)x˙\dot{V}_{o}=\nabla V_{o}(x)^{\top}\dot{x} along the trajectories of (28) satisfies:

V˙o\displaystyle\dot{V}_{o} Q(x)+g¯2λmax(Πu1)2(dϕ¯c|θc|+dϵc¯)(dϕc¯|θuθc|+dϵc¯).\displaystyle\leq-Q(x)+\frac{\overline{g}^{2}\lambda_{\max}\left(\Pi_{u}^{-1}\right)}{2}\left(\overline{d\phi}_{c}\left\lvert\theta^{*}_{c}\right\rvert+\overline{d\epsilon_{c}}\right)\left(\overline{d\phi_{c}}\left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert+\overline{d\epsilon_{c}}\right){}. (46)

On the other hand, making use of Lemma 2, for the time derivative of V˙a=θuVa(θu)θu\dot{V}_{a}=\nabla_{\theta_{u}}V_{a}(\theta_{u})^{\top}\theta_{u} we obtain:

V˙a\displaystyle\dot{V}_{a} kuα2|θuθc|2+kuΩ¯|θuθc||θcθc|.\displaystyle\leq-k_{u}\alpha_{2}\left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert^{2}+k_{u}\overline{\Omega}\left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert\left\lvert\theta_{c}-\theta_{c}^{*}\right\rvert.{} (47)

Hence, using (39), (46), and (47), together with the upper bounds in (35), we obtain that the time derivative of 𝒱\mathcal{V} along the trajectories of the closed-loop system satisfies:

𝒱˙\displaystyle\dot{\mathcal{V}} Q(x)r¯|y|𝒜c2kuα2|θuθc|2\displaystyle\leq-Q(x)-\underline{r}\left\lvert y\right\rvert_{\mathcal{A}_{c}}^{2}-k_{u}\alpha_{2}\left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert^{2}
+cy|y|𝒜c+cu|θuθc|+cyu|θuθc||y|𝒜c\displaystyle\quad+c_{y}\left\lvert y\right\rvert_{\mathcal{A}_{c}}+c_{u}\left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert+c_{yu}\left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert\left\lvert y\right\rvert_{\mathcal{A}_{c}}
+cyu2|y|𝒜c|θuθc|2+c0,\displaystyle\qquad+c_{yu^{2}}\left\lvert y\right\rvert_{\mathcal{A}_{c}}\left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert^{2}+c_{0},{} (48)

where

cy\displaystyle c_{y} 368kc(ϵHJB¯(ρi+Nρd)+12g¯2ρi[λmax(Πu1)(dϕc¯[1+|θc|]+dϵc¯)dϵc¯+λmax(Πu)λmax(Πu1)2dϵc¯2]),\displaystyle\coloneqq\frac{3\sqrt{6}}{8}k_{c}\Bigg{(}\overline{\epsilon_{\text{HJB}}}\left(\rho_{i}+N\rho_{d}\right)+\frac{1}{2}\overline{g}^{2}\rho_{i}\Bigg{[}\lambda_{\max}\left(\Pi_{u}^{-1}\right)\left(\overline{d\phi_{c}}\left[1+\left\lvert\theta_{c}^{*}\right\rvert\right]+\overline{d\epsilon_{c}}\right)\overline{d\epsilon_{c}}+\lambda_{\max}\left(\Pi_{u}\right)\lambda_{\max}\left(\Pi_{u}^{-1}\right)^{2}\overline{d\epsilon_{c}}^{2}\Bigg{]}\Bigg{)},
cu\displaystyle c_{u} 12(dϕ¯c|θc|+dϵc¯)g¯2λmax(Πu1)dϕc¯,\displaystyle\coloneqq\frac{1}{2}\left(\overline{d\phi}_{c}\left\lvert\theta^{*}_{c}\right\rvert+\overline{d\epsilon_{c}}\right)\overline{g}^{2}\lambda_{\max}\left(\Pi_{u}^{-1}\right)\overline{d\phi_{c}},
cyu\displaystyle c_{yu} 36kc16(2kuΩ¯+g¯2ρiλmax(Πu1)(dϕc¯[1+|θc|]+dϵc¯)dϕc¯),\displaystyle\coloneqq\frac{3\sqrt{6}k_{c}}{16}\Bigg{(}2k_{u}\overline{\Omega}+\overline{g}^{2}\rho_{i}\lambda_{\max}\left(\Pi_{u}^{-1}\right)\left(\overline{d\phi_{c}}\left[1+\left\lvert\theta_{c}^{*}\right\rvert\right]+\overline{d\epsilon_{c}}\right)\overline{d\phi_{c}}\Bigg{)},
cyu2\displaystyle c_{yu^{2}} 3616kcg¯2ρiλmax(Πu)λmax(Πu1)2dϕc¯2,\displaystyle\coloneqq\frac{3\sqrt{6}}{16}k_{c}\overline{g}^{2}\rho_{i}\lambda_{\max}\left(\Pi_{u}\right)\lambda_{\max}\left(\Pi_{u}^{-1}\right)^{2}\overline{d\phi_{c}}^{2},
c0\displaystyle c_{0} 12(dϕ¯c|θc|+dϵc¯)g¯2λmax(Πu1)dϵc¯.\displaystyle\coloneqq\frac{1}{2}\left(\overline{d\phi}_{c}\left\lvert\theta^{*}_{c}\right\rvert+\overline{d\epsilon_{c}}\right)\overline{g}^{2}\lambda_{\max}\left(\Pi_{u}^{-1}\right)\overline{d\epsilon_{c}}.

Then, for all |θuθc|cyucyu2\left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert\leq\frac{c_{yu}}{c_{yu^{2}}}, by using Q(x)=xΠxxQ(x)=x^{\top}\Pi_{x}x and letting d1(0,1)d_{1}\in(0,1), from (48), 𝒱˙\dot{\mathcal{V}} can be further upper bounded as:

𝒱˙\displaystyle\dot{\mathcal{V}} λmin(Πx)|x|2(1d1)(r¯|y|𝒜c2+kuα2|θuθc|2)\displaystyle\leq-\lambda_{\min}\left(\Pi_{x}\right)\left\lvert x\right\rvert^{2}-(1-d_{1})\left(\underline{r}\left\lvert y\right\rvert_{\mathcal{A}_{c}}^{2}+k_{u}\alpha_{2}\left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert^{2}\right)
+cy|y|𝒜c+cu|θuθc|+c0\displaystyle\quad+c_{y}\left\lvert y\right\rvert_{\mathcal{A}_{c}}+c_{u}\left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert+c_{0}
(|y|𝒜c|θuθc|)(d1r¯cyucyud1kuα2)(|y|𝒜c|θuθc|).\displaystyle\qquad-\begin{pmatrix}\left\lvert y\right\rvert_{\mathcal{A}_{c}}&\left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert\end{pmatrix}\begin{pmatrix}d_{1}\underline{r}&-c_{yu}\\ -c_{yu}&d_{1}k_{u}\alpha_{2}\end{pmatrix}\begin{pmatrix}\left\lvert y\right\rvert_{\mathcal{A}_{c}}\\ \left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert\end{pmatrix}.{} (49)

Now, pick a set of tunable parameters (ρi,ρd,kc,ku)(\rho_{i},\rho_{d},k_{c},k_{u}) such that r¯cyu2d12kuα2\underline{r}\geq\frac{c_{yu}^{2}}{d_{1}^{2}k_{u}\alpha_{2}} so that from (49), we obtain:

𝒱˙(1d2)dz|z|𝒜2,|z|𝒜max{c02dyu,2dyud2dz},|θuθc|cyucyu2,\displaystyle\qquad\quad\dot{\mathcal{V}}\leq-(1-d_{2})d_{z}\left\lvert z\right\rvert_{\mathcal{A}}^{2},\quad\forall\left\lvert z\right\rvert_{\mathcal{A}}\geq\max\left\{\frac{c_{0}}{2d_{yu}},\leavevmode\nobreak\ \frac{2d_{yu}}{d_{2}d_{z}}\right\},\leavevmode\nobreak\ \left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert\leq\frac{c_{yu}}{c_{yu^{2}}}, (50a)

with

dz\displaystyle d_{z} min{λmin(Πx),(1d1)r¯,kuα2},\displaystyle\coloneqq\min\left\{\lambda_{\min}\left(\Pi_{x}\right),\leavevmode\nobreak\ (1-d_{1})\underline{r},\leavevmode\nobreak\ k_{u}\alpha_{2}\right\},
dyu\displaystyle d_{yu} max{2cyu,c0},d2(0,1).\displaystyle\coloneqq\max\left\{2c_{yu},\leavevmode\nobreak\ c_{0}\right\},\quad d_{2}\in(0,1).

Notice that for every compact set KθK_{\theta} of initial conditions for θu\theta_{u} we can pick suitable ρi,ρd,α1,α2,kc,ku\rho_{i},\rho_{d},\alpha_{1},\alpha_{2},k_{c},k_{u} to satisfy Kθcyucyu2𝔹K_{\theta}\subset\frac{c_{yu}}{c{yu^{2}}}\mathbb{B} such that (50) holds for every trajectory with θu(0,0)Kθ\theta_{u}(0,0)\in K_{\theta}. Now, during jumps xx and θu\theta_{u} do not change, and hence 𝒱\mathcal{V} satisfies:

𝒱(z+)𝒱(z)=Vc(y+)Vc(y)ηVc(y).\mathcal{V}(z^{+})-\mathcal{V}(z)=V_{c}(y^{+})-V_{c}(y)\leq-\eta V_{c}(y). (51)

The result of the theorem follows by using the strong-decrease of 𝒱\mathcal{V} during flows outside a neighborhood of 𝒜\mathcal{A} described in (50), the non-increase of 𝒱\mathcal{V} during jumps given in (51), by noting that, by design, the closed-loop dynamics are a well-posed HDS which experiences periodic jumps followed by intervals of flow of length TT0>0T-T_{0}>0 (c.f. [25]), and by following the same arguments of [12, Prop 3.27] and [23, Prop. 2.7].\blacksquare