Accelerated Continuous-Time Approximate Dynamic Programming via Data-Assisted Hybrid Control¹¹1Research supported in part by NSF grant number CNS-1947613.²²footnotemark: 2

Daniel E. Ochoa [email protected] Jorge I. Poveda [email protected]

Abstract

We introduce a new closed-loop architecture for the online solution of approximate optimal control problems in the context of continuous-time systems. Specifically, we introduce the first algorithm that incorporates dynamic momentum in actor-critic structures to control continuous-time dynamic plants with an affine structure in the input. By incorporating dynamic momentum in our algorithm, we are able to accelerate the convergence properties of the closed-loop system, achieving superior transient performance compared to traditional gradient-descent based techniques. In addition, by leveraging the existence of past recorded data with sufficiently rich information properties, we dispense with the persistence of excitation condition traditionally imposed on the regressors of the critic and the actor. Given that our continuous-time momentum-based dynamics also incorporate periodic discrete-time resets that emulate restarting techniques used in the machine learning literature, we leverage tools from hybrid dynamical systems theory to establish asymptotic stability properties for the closed-loop system. We illustrate our results with a numerical example.

keywords:

Approximate dynamic programming, concurrent learning, hybrid systems, Lyapunov theory.

^†^†journal:

\diffdef

p op-symbol = ∂, left-delim = , right-delim = — , subscr-nudge = +4 mu

\affiliation

organization=Department of Electrical, Energy and Computer Engineering. University of Colorado Boulder,city=Boulder, postcode=80305, state=Colorado, country=USA

1 Introduction

Recent technological advances in computation and sensing have incentivized the development and implementation of data-assisted feedback control techniques previously deemed intractable due to their computational complexity. Among these techniques, reinforcement learning (RL) has emerged as a practically viable tool with remarkable degrees of success in robotics [1], autonomous driving [2], water-distribution systems [3], among other cyber-physical applications, see [4]. These types of algorithms, are part of a large landscape of adaptive systems that aim to control a plant while simultaneously optimizing a performance index in a model-free way, with closed-loop stability guarantees.

In this paper, we focus on a particular class of infinite horizon RL problems from the perspective of approximate optimal control and approximate adaptive dynamic programming (AADP). Specifically, we study the optimal control problem for nonlinear continuous-time and control-affine deterministic plants, interconnected with approximate adaptive optimal controllers [5] in an actor-critic configuration. These types of adaptive controllers aim to find, in real time, the solution to the Hamilton-Jacobi-Bellman (HJB) equation by measuring the output of the nonlinear dynamical system while making use of two approximation structures:

1.

a critic, used to estimate the optimal value function of the optimal control problem, and
2.

an actor, used to estimate the optimal feedback controller.

Our goal is to design online adaptive dynamics for the real-time tuning of the aforementioned structures, while simultaneously achieving closed-loop stability and high transient performance. To achieve this, and motivated by the widespread usage of momentum-based gradient dynamics in practical RL settings [6], we study continuous-time actor-critic dynamics inspired by a class of ordinary differential equations (ODEs) that can be seen as continuous-time counterparts of Nesterov’s accelerated optimization algorithm [7]. Such types of algorithms have gained popularity in optimization and related fields due to the fact that they can minimize smooth convex functions at a rate of order $\mathcal{O}(1/t^{2})$ [8]. The main source for the acceleration property in these ODEs comes from the addition of momentum to gradient-based dynamics, in conjunction with a vanishing dynamic damping coefficient. However, as recently shown in [9] and [10], the non-uniform convergence properties that emerge in these types of dynamics complicates their use in feedback systems with plant dynamics in the loop. In this paper, we overcome these challenges by incorporating resets into the proposed momentum-based algorithms, similar to restarting heuristics studied in the machine learning literature, see [11] and [7]. Our resulting actor-critic controller is naturally modeled by a hybrid dynamical system that incorporates continuous-time and discrete-time dynamics, which we analyze using tools from [12].

A traditional assumption in the literature of continuous-time actor-critic RL is that the regressors used in the parameterizations satisfy a persistence of excitation condition along the trajectories of the plant. However, in practice, this condition can be difficult to verify a priori. To circumvent this issue, in this paper we consider a data-assisted approach, where a finite amount of past “sufficiently rich” recorded data is used to guarantee asymptotic learning in the closed-loop system. As a consequence, the resulting data-assisted hybrid control algorithm concurrently uses real-time and recorded data, similar in spirit to concurrent-learning (CL) techniques [13]. By using Lyapunov-based tools for hybrid dynamical systems, we analyze the interconnection of an actor-critic neural-network (NN) controller and the nonlinear plant, establishing that the trajectories of the closed-loop system remain ultimately bounded around the origin of the plant and the optimal actor and critic NN parameters. Since the resulting closed-loop system has suitable regularity properties in terms of continuity of the dynamics, our stability results are in fact robust with respect to arbitrarily small additive disturbances that can be adversarial in nature, or that can arise due to numerical implementations. To the best knowledge of the authors, these are the first theoretical stability guarantees of continuous-time accelerated actor-critic algorithms for neural network-based adaptive dynamic programming controllers in nonlinear deterministic settings.

The rest of this paper is organized as follows: Section 2 presents the notation and some concepts on hybrid dynamical systems, Section 3 presents the problem statement and some preliminaries on optimal control. Section 4 introduces the hybrid momentum-based dynamics for the update of the critic NN, Section 5 presents the update dynamics for the actor NN, and Section 6 studies the properties of closed-loop system. In Section 7 we study a numerical example illustrating our theoretical results.

2 Preliminaries

Notation: We denote the real numbers by $\mathbb{R}$ , and we use $\mathbb{R}_{\geq 0}\subset\mathbb{R}$ to denote the non-negative real line. We use $\mathbb{R}^{n}$ to represent the $n$ -dimensional Euclidean space and $\left\lvert\cdot\right\rvert$ to denote its usual vector norm. Given $A\in\mathbb{R}^{n\times n}$ , we use $\left\lvert A\right\rvert$ to denote the induced 2-norm for matrices, and we infer its distinction with the vector norm depending on the context. We use $\text{Tr}\left(A\right)$ to denote the trace operator on matrices. Given a compact set $\mathcal{A}\subset\mathbb{R}^{n}$ and a vector $z\in\mathbb{R}^{n}$ , we use $|z|_{\mathcal{A}}\coloneqq\min_{s\in\mathcal{A}}\left\lvert z-s\right\rvert$ to represent the minimum distance of $z$ to $\mathcal{A}$ . We also use $r\mathbb{B}$ to denote a closed ball in the Euclidean space, of radius $r>0$ , and centered at the origin. We use $I_{n}\in\mathbb{R}^{n\times n}$ to denote the identity matrix, and $(x,y)$ for the concatenation of the vectors $x$ and $y$ , i.e., $(x,y)\coloneqq[x^{\top},y^{\top}]^{\top}$ . A function $\gamma:\mathbb{R}_{\geq 0}\to\mathbb{R}_{\geq 0}$ is said to be of class- $\mathcal{K}$ ( $\gamma\in\mathcal{K}$ ), if it is continuous, zero at zero, and nondecreasing. A function $\beta:\mathbb{R}_{\geq 0}\times\mathbb{R}_{\geq 0}\to\mathbb{R}_{\geq 0}$ is said to be of class- $\mathcal{K}\mathcal{L}$ ( $\beta\in\mathcal{KL}$ ) if $\beta(\cdot,s)\in\mathcal{K}$ for each $s\in\mathbb{R}_{\geq 0}$ , it is non-increasing in its second argument, and $\lim_{s\to\infty}\beta(r,s)=0$ for each $r\in\mathbb{R}_{\geq 0}$ . The gradient of a real valued function $f:\mathbb{R}^{n}\to\mathbb{R}$ is defined as a column vector and denoted by $\nabla f$ . For a vector valued function $g:\mathbb{R}^{n}\to\mathbb{R}^{m}$ , we use $\diffp{g(x)}{x}\in\mathbb{R}^{m\times n}$ to denote its Jacobian matrix.

Hybrid Dynamical Systems: To study our algorithms, we will use tools from hybrid dynamical systems (HDS) theory [12]. A HDS with state $x\in\mathbb{R}^{n}$ , has dynamics

x\in C,\leavevmode\nobreak\ \dot{x}=F(x),\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ x\in D,\leavevmode\nobreak\ \leavevmode\nobreak\ x^{+}=G(x),

(1)

where $F:\mathbb{R}^{n}\to\mathbb{R}^{n}$ is called the flow map, $G:\mathbb{R}^{n}\to\mathbb{R}^{n}$ is called the jump map, and $C\subset\mathbb{R}^{n}$ and $D\subset\mathbb{R}^{n}$ are closed sets, called the flow set and the jump set, respectively. We use $\mathcal{H}=(C,F,D,G)$ to denote the elements of the HDS $\mathcal{H}$ . Solutions $x:\text{dom}(x)\to\mathbb{R}^{n}$ to system (1) are indexed by a continuous-time parameter $t$ , which increases continuously during flows, and a discrete-time index $j$ , which increases by one during jumps. Thus, the notation $\dot{x}$ in (1) represents the derivative $\frac{dx(t,j)}{dt}$ ; and $x^{+}$ in (1) represents the value of $x$ after an instantaneous jump, i.e., $x(t,j+1)$ . Therefore, solutions $x:\text{dom}(x)\to\mathbb{R}^{n}$ to system (1) are defined on hybrid time domains. For a precise definition of hybrid time domains and solutions to HDS of the form (1), we refer the reader to [12, Ch.2]. The following definitions will be instrumental to study the stability and convergence properties of systems of the form (1).

Definition 1

The compact set $\mathcal{A}\subset C\cup D$ is said to be uniformly asymptotically stable (UAS) for system (1) if $\exists$ $\beta\in\mathcal{K}\mathcal{L}$ and $r>0$ such that every solution $x$ with $x(0,0)\in r\mathbb{B}\cap(C\cup D)$ satisfies:

|x(t,j)|_{\mathcal{A}}\leq\beta(|x(0,0)|_{\mathcal{A}},t+j),\leavevmode\nobreak\ \forall\leavevmode\nobreak\ (t,j)\in\text{dom}(x).

(2)

When $\beta(r,s)=c_{1}re^{-c_{2}s}$ for some $c_{1},c_{2}>0$ , the set $\mathcal{A}$ is said to be uniformly exponentially stable (UES). $\square$

3 Problem Statement

Consider a control-affine nonlinear dynamical plant

\dot{x}=f(x)+g(x)u,

(3)

where $x\in\mathbb{R}^{n}$ is the state of the system, $u\in U\subset\mathbb{R}^{m}$ is the input, and $f:\mathbb{R}^{n}\to\mathbb{R}^{n}$ and $g:\mathbb{R}^{n}\to\mathbb{R}^{n\times m}$ are locally Lipschitz functions. Our goal is to design a stable algorithm able to find –in real time– a control law $u^{*}$ that minimizes the cost functional $V:\mathbb{R}^{n}\times\mathcal{U}_{V}\to\mathbb{R}$ given by:

V(x_{0},u)\coloneqq\int_{0}^{\infty}r\Big{(}x\big{(}\tau\big{)},u\left(x(\tau)\right)\Big{)}d\tau,

(4)

where $x\big{(}t\big{)}$ represents a solution to (3) from the initial condition $x(0)=x_{0}$ , that results from implementing a feedback law $u$ , belonging to a class of admissible control laws $\mathcal{U}_{V}$ characterized as follows:

Definition 2

[14, Definition 1] Given the dynamical system in (3), a feedback control $u:\mathbb{R}^{n}\to\mathbb{R}^{m}$ is admissible with respect to the cost functional $V$ in (4) if

1.

$u$ is continuous,
2.

$u$ renders system (3) UAS,
3.

$V(x_{0},u)<\infty$ for all $x_{0}\in\mathbb{R}^{n}$ . $\square$

We denote the set of admissible feedback laws as $\mathcal{U}_{V}$ .

In (4), we consider cost functions $r:\mathbb{R}^{n}\times\mathbb{R}^{m}\to\mathbb{R}$ of the form $r(x,u)\coloneqq Q(x)+R(u),$ where the state-cost is given by $Q(x)\coloneqq x^{\top}\Pi_{x}x$ with $\Pi_{x}\succ 0$ , and the control-cost is given by $R(u)\coloneqq u^{\top}\Pi_{u}u$ with $\Pi_{u}\succ 0$ . To find the optimal control law that minimizes (4), we study the Hamiltonian function $H:\mathbb{R}^{n}\times\mathbb{R}^{m}\times\mathbb{R}^{n}\to\mathbb{R}$ related to (3) and (4), given by

\displaystyle H(x,u,\nabla V)

\displaystyle\coloneqq\nabla V^{\top}(f(x)+g(x)u)+Q(x)+R(u).{}

(5)

Using (5), a necessary optimality condition for $u^{*}$ is given by Pontryagin’s maximum principle [15]:

\displaystyle u^{*}(x)

\displaystyle=\operatorname*{arg\,min}_{u\in\mathcal{U}_{V}}H(x,u,\nabla V^{*})\implies u^{*}(x)=-\frac{1}{2}\Pi_{u}^{-1}g(x)^{\top}\nabla V^{*}(x),{}

(6)

where $V^{*}$ represents the optimal value function:

V^{*}(x)\coloneqq\inf_{u\in\mathcal{U}_{V}}V(x,u(\cdot))

On the other hand, under the assumption that $V^{*}$ is continuously differentiable, the optimal value function can be shown to satisfy the Hamilton-Jacobi-Bellman equation [5, Ch. 1.4]:

\diffp{V^{*}}{t}=-H(x,u^{*},\nabla V^{*})\quad\forall x\in\mathbb{R}^{n}.

Since the functional in (4) does not have an explicit dependence on $t$ , it follows that $\diffp{V^{*}}{t}=0$ , and hence $H(x,u^{*},\nabla V^{*})=0$ , meaning that for all $x\in\mathbb{R}^{n}$ , the following holds:

\displaystyle\nabla V^{*^{\top}}\big{(}f(x)+g(x)u^{*}(x)\big{)}+Q(x)+R\Big{(}u^{*}(x)\Big{)}=0.{}

(7)

The time-invariant Hamilton-Jacobi-Bellman equation in (7), allows for a state-dependent characterization of optimality. Therefore, by using the optimal control law in (6), and assuming that the system dynamics (3) are known, the form (7) could be leveraged to find $V^{*}$ . Unfortunately, finding an explicit closed-form expression for $V^{*}$ , and thus for the optimal control law, is, in general, an intractable problem. However, the utility of (7) is not completely lost. As we shall show in the following sections, online and historical “measurements” of (7) can be leveraged in real time to estimate the optimal control law $u^{*}$ while concurrently rendering a neighborhood of the origin of system (3) asymptotically stable.

4 Data-Assisted Critic Dynamics

To leverage the form of (7), we consider the following parameterization of the optimal value function $V^{*}(x)$ :

V^{*}(x)=\theta_{c}^{*^{\top}}\phi_{c}(x)+\epsilon_{c}(x)\quad\forall x\in K,

(8)

where $K\subset\mathbb{R}^{n}$ is a compact set, $\theta_{c}^{*}\in\mathbb{R}^{l_{c}}$ , $\phi_{c}:\mathbb{R}^{n}\to\mathbb{R}^{l_{c}}$ is a vector of continuously differentiable basis functions, and $\epsilon_{c}:\mathbb{R}^{n}\to\mathbb{R}$ is the approximation error. The parameterization (8) is always possible on compact sets due to the continuity properties of $V$ and the universal approximation theorem [16]. This parametrization results in an optimal Hamiltonian of the form $H_{p}^{*}\coloneqq H(x,u^{*},\diffp{\phi_{c}}{x}^{\top}\theta_{c}^{*}+\nabla\epsilon_{c})$ given by:

\displaystyle H_{p}^{*}(x)

\displaystyle=\theta_{c}^{*^{\top}}\psi(x,u^{*}(x))+Q(x)+R\left(u^{*}(x)\right)+\nabla\epsilon_{c}(x)^{\top}\left(f(x)+g(x)u^{*}(x)\right),

(9)

where we defined $\psi:\mathbb{R}^{n}\times\mathbb{R}^{m}\to\mathbb{R}^{l_{c}}$ as:

\psi(x,u)\coloneqq\diffp{\phi_{c}(x)}{x}\left(f(x)+g(x)u\right).

(10)

We note that the explicit dependence of $\psi:\mathbb{R}^{n}\times\mathbb{R}^{m}\to\mathbb{R}^{l_{c}}$ on the control action $u$ , defined in (10), is a fundamental departure from the previous approaches studied in the context of concurrent learning (CL) NN actor-critic controllers, such as those considered in [17] and [18]. In particular, we note that in the context of CL the data used to estimate the optimal value function $V^{*}$ is generated from measurements of the optimal Hamiltonian which, by definition, incorporates the optimal control law $u^{*}$ . Hence, the need to include $u$ as part of the regressor vectors $\psi$ becomes crucial; this dependence characterizes how far our recorded measurements of a Hamiltonian are from the optimal Hamiltonian $H_{p}^{*}$ . Indeed, this distance will explicitly emerge in our convergence and stability analysis. Naturally, the dependence of (10) on $u$ will impose stronger conditions on the recorded data needed to estimate $V^{*}$ .

Assuming we have access to $\phi_{c}$ , we can define a critic neural network as:

\hat{V}(x)\coloneqq\theta_{c}^{\top}\phi_{c}(x),\leavevmode\nobreak\ \leavevmode\nobreak\ \forall x\in K,

(11)

which will serve as an approximation of the optimal value function $V^{*}$ in (8). This critic NN results in an estimated Hamiltonian:

H\left(x,u,\nabla\hat{V}\right)\coloneqq\theta_{c}^{\top}\psi\left(x,u\right)+Q(x)+R(u),

(12)

which we will use to design the update dynamics of the critic parameters $\theta_{c}$ . In particular, our goal is to use previously recorded data from trajectories of the plant to ensure asymptotic stability of the set of optimal critic parameters $\left\{\theta_{c}^{*}\right\}$ , while simultaneously enabling the incorporation of instantaneous measurements from the plant. Towards this end, we will assume enough “richness” properties in the recorded data, a notion that is captured by a relaxed (and finite-time) version of persistence of excitation (PE); see [13] and [19].

Assumption 1

Let $\{\psi\left(x_{k},u^{*}(x_{k})\right)\}_{k=1}^{N}$ be a sequence of recorded data, and define:

\displaystyle\Lambda

\displaystyle\coloneqq\sum_{k=1}^{N}\Psi(x_{k},u^{*}(x_{k}))\Psi(x_{k},u^{*}(x_{k}))^{\top},\quad\Psi(x,u)\coloneqq\frac{\psi(x,u)}{1+\psi(x,u)^{\top}\psi(x,u)}.{}

(13)

There exists $\underline{\lambda}\in\mathbb{R}_{>0}$ such that $\Lambda\succeq\underline{\lambda}I_{n}$ , i.e., the data is $\underline{\lambda}$ -sufficiently-rich ( $\underline{\lambda}$ -SR). $\square$

Remark 1

In this paper, we study reinforcement learning dynamics that do not make explicit usage of exploration signals with standard PE properties, which can be difficult to guarantee in practice. Instead, we assume access to samples obtained by observing the action of optimal values $u^{*}(x_{k})$ acting on the plant. Note however that this does not imply knowledge of the optimal control policy as a whole, but only of a finite number of demonstrations from an “expert” policy. Similar requirements commonly arise in the literature of imitation learning, or inverse reinforcement learning, and have been recently shown in practice to reduce the exploratory requirements of online reinforcement learning algorithms, with mild assumptions in the sampling of the demonstrations. For recent discussions on these topics in the discrete-time stochastic reinforcement learning setting we refer the reader to [20] and [21].

Now, we consider the instantaneous and data-dependent errors of the estimated Hamiltonian with respect to the optimal one:

	$\displaystyle e^{i}\left(\theta_{c},x,u\right)$	$\displaystyle\coloneqq H\left(x,u,\nabla\hat{V}\right)-H\left(x,u^{}(x),\nabla V^{}\right)$
		$\displaystyle\leavevmode\nobreak\ =\theta_{c}^{\top}\psi\left(x,u\right)+Q(x)+R\left(u\right),$
	$\displaystyle e^{d}_{k}(\theta_{c})$	$\displaystyle\coloneqq H\left(x_{k},u^{}(x_{k}),\nabla\hat{V}\right)-H\left(x_{k},u^{}(x_{k}),\nabla V^{*}\right)$
		$\displaystyle\leavevmode\nobreak\ =\theta_{c}^{\top}\psi\left(x_{k},u^{}(x_{k})\right)+Q(x_{k})+R\left(u^{}(x_{k})\right),$

where we used the fact that $H\left(x,u^{*}(x),\nabla V^{*}\right)=0$ . Moreover, we define the joint instantaneous and data-dependent error as:

\displaystyle e\left(\theta_{c},x,u\right)

\displaystyle\coloneqq\frac{1}{2}\Bigg{(}\rho_{i}\frac{e^{i}\left(x,\theta_{c},u\right)^{2}}{\left(1+\left\lvert\psi(x,u)\right\rvert^{2}\right)^{2}}+\rho_{d}\sum_{k=1}^{N}\frac{e^{d}_{k}(\theta_{c})^{2}}{\left(1+\left\lvert\psi\left(x_{k},u^{*}(x_{k})\right)\right\rvert^{2}\right)^{2}}\Bigg{)}{},

(14)

where $\rho_{i}\in\mathbb{R}_{\geq 0}$ and $\rho_{d}\in\mathbb{R}_{>0}$ are tunable gains. Since we are interested in designing real-time training dynamics for the estimation of the optimal parameters $\theta_{c}^{*}$ , we compute the the gradient of (14) with respect to $\theta_{c}$ as follows:

	$\displaystyle\nabla_{\theta_{c}}e(\theta_{c},x,u)$	$\displaystyle=\rho_{i}\left(\rule{0.0pt}{21.33955pt}\right.\Psi(x,u)\Psi(x,u)^{\top}\theta_{c}+\frac{\psi(x,u)\left[Q(x)+R(u)\right]}{\left(1+\psi(x,u)^{\top}\psi(x,u)\right)^{2}}\left.\rule{0.0pt}{21.33955pt}\right)$
		$\displaystyle\qquad+\rho_{d}\left(\Lambda\theta_{c}+\sum_{k=1}^{N}\frac{\psi(x_{k},u^{}(x_{k}))\left[Q(x_{k})+R\left(u^{}(x_{k})\right)\right]}{\left(1+\psi\left(x_{k},u^{}(x_{k})\right)^{\top}\psi\left(x_{k},u^{}(x_{k})\right)\right)^{2}}\right),$		(15)

where $\Lambda$ and $\Psi$ are defined in Assumption 1.
The “propagated” error to the HJB equation that results from the approximate parametrization of $V^{*}$ in (8), is given by:

	$\displaystyle\epsilon_{\text{HJB}}(x)$	$\displaystyle\coloneqq H(x,u^{}(x),\nabla V^{})-H\left(x,u^{},\diffp{\phi_{c}(x)}{x}^{\top}\theta_{c}^{}\right)$
		$\displaystyle=-\nabla\epsilon_{c}^{\top}(x)\Big{(}f(x)+g(x)u^{*}(x)\Big{)}.{}$		(16)

The following assumption is standard, and it is satisfied when the involved functions are continuous and $K$ is compact.

Assumption 2

There exist $\overline{\phi_{c}},\leavevmode\nobreak\ \overline{d\phi_{c}},\leavevmode\nobreak\ \overline{\epsilon_{c}},\leavevmode\nobreak\ \overline{d\epsilon_{c}},\leavevmode\nobreak\ \overline{\epsilon_{\text{HJB}}},\overline{g}\in\mathbb{R}_{>0}$ such that

	$\displaystyle\left\lvert\phi_{c}(x)\right\rvert\leq\overline{\phi_{c}},\leavevmode\nobreak\ \leavevmode\nobreak\ \left\lvert\diffp{\phi_{c}(x)}{x}\right\rvert\leq\overline{d\phi_{c}},\leavevmode\nobreak\ \leavevmode\nobreak\ \left\lvert\epsilon_{c}(x)\right\rvert\leq\overline{\epsilon_{c}},$
	$\displaystyle\left\lvert\nabla\epsilon_{c}(x)\right\rvert\leq\overline{d\epsilon_{c}},\leavevmode\nobreak\ \left\lvert\epsilon_{HJB}(x)\right\rvert\leq\overline{\epsilon_{HJB}},\leavevmode\nobreak\ \left\lvert g(x)\right\rvert\leq\overline{g}\quad\forall x\in K,$

where $K$ is the same set considered in (8). $\square$

4.1 Critic Dynamics via Data-Driven Hybrid Momentum-Based Control

To design fast asymptotically stable dynamics for the estimate $\theta_{c}$ , we propose a new class of momentum-based critic dynamics inspired by accelerated gradient flows with restarting mechanisms, such as those studied in [7] and [11]. Specifically, we consider the following hybrid dynamics of the form (1), with state $y\coloneqq(\theta_{c},p,\tau)$ and elements:


	$\displaystyle C_{c}\coloneqq\left\{y\in\mathbb{R}^{2l_{c}+1}\leavevmode\nobreak\ :\leavevmode\nobreak\ \tau\in[T_{0},T]\right\},\qquad F_{c}(y,x,u)\coloneqq\begin{pmatrix}\frac{2}{\tau}\left(p-\theta_{c}\right)\\ -2k_{c}\nabla_{\theta_{c}}e(\theta_{c},x,u)\\ \frac{1}{2}\end{pmatrix},$		(17a)
	$\displaystyle D_{c}\coloneqq\left\{y\in\mathbb{R}^{2l_{c}+1}\leavevmode\nobreak\ :\leavevmode\nobreak\ \tau=T\right\},\qquad\qquad G_{c}(y)\coloneqq\begin{pmatrix}\theta_{c}\\ \theta_{c}\\ T_{0}\end{pmatrix},$		(17b)

where $k_{c}\in\mathbb{R}_{>0}$ is a tunable gain, and $(p,\tau)$ are auxiliary states that are periodically reset every time $\tau=T$ via the jump map (17b), with $\infty>T>T_{0}>0$ . The dynamical system in (17) flows in continuous time according to (17a) whenever the timer variable $\tau$ is in $[T_{0},T]$ . As soon as $\tau$ hits $T$ , the algorithm (17) resets the timer variable to $T_{0}$ , as well as the momentum variable $p$ to $\theta_{c}$ , while leaving $\theta_{c}$ unaffected. Accordingly, after the first reset, the system exhibits periodic resets every $\Delta T=2(T-T_{0})$ intervals of time. The following assumption provides data-dependent tuning guidelines for the resetting frequency of the timer variable $\tau$ , which will be leveraged in our stability results.

Assumption 3

The tunable parameters $(T_{0},T,k_{c},\rho_{i},\rho_{d})$ satisfy $2\rho_{d}\underline{\lambda}>\rho_{i}$ and

T_{0}^{2}+\frac{1}{2k_{c}\underline{\lambda}\rho_{d}}<T^{2}<\frac{8\rho_{d}\underline{\lambda}}{k_{c}\rho_{i}^{2}},

(18)

where $\underline{\lambda}$ is the level of richness of the recorderd data defined in Assumption 1. $\square$

For system (17), we study stability properties with respect to the compact set:


$\displaystyle\mathcal{A}_{c}$	$\displaystyle\coloneqq\mathcal{A}_{\theta_{c},p}\times[T_{0},T],$	(19a)
$\displaystyle\mathcal{A}_{\theta_{c},p}$	$\displaystyle\coloneqq\left\{(\theta_{c},p)\in\mathbb{R}^{2l_{c}}:p_{c}=\theta_{c},\leavevmode\nobreak\ \theta_{c}=\theta_{c}^{*}\right\}.$	(19b)

The following theorem is the first main result of this paper. All the proofs are presented in the Appendices.

Theorem 1

Given a number $l_{c}$ of basis functions $\phi_{c}$ parametrizing the critic NN, and a compact set $K\subset\mathbb{R}^{n}$ , suppose that Assumptions 1, 2 and 3 are satisfied. Then, there exists $(\kappa,c)\in\mathbb{R}_{>0}\times\mathbb{R}_{>0}$ and class- $\mathcal{K}_{\infty}$ functions $\gamma_{1}$ and $\gamma_{2}$ , such that for every solution $y=(\theta_{c},p,\tau)$ to (17) with initial condition $y(0,0)=(\theta_{c}(0,0),p(0,0),\tau(0,0))$ , and using the control policy $u(\cdot)\in\mathcal{U}_{V}$ on the plant, the critic parameters $\theta_{c}$ satisfy

\displaystyle\left\lvert\theta_{c}(t,j)-\theta_{c}^{*}\right\rvert

\displaystyle\leq\kappa e^{-c(t+j)}\left\lvert y(0,0)\right\rvert_{\mathcal{A}_{c}}+\gamma_{2}\left(\left\lvert\tilde{u}(x(t,j))\right\rvert\right)+\gamma_{1}(\overline{\epsilon_{\text{HJB}}}),{}

(20)

where $\tilde{u}(x(t,j)):=u(x(t,j))-u^{*}(x(t,j))$ , for all $(t,j)\in\text{dom}\left(y\right)$ $\square$

The presence of a residual optimal-control mismatch term in (20) of the form $\gamma_{2}(\left\lvert u(x)-u^{*}(x)\right\rvert)$ , represents a crucial difference with respect to previous CL adaptive dynamic approaches, such as those studied in [17] and [5, Ch. 4 ]. This term is a direct byproduct of our definition of $\psi$ in (10), its dependence on the control action $u$ , and its appearance in the error gradient (15). In principle, the emergence of this term in Theorem 1 is agnostic to the particular gradient-based update dynamics for the critic NN, regardless of the inclusion or not of momentum. Since $\gamma_{2}\in\mathcal{K}$ , the larger the difference between the nominal input $u$ and the optimal feedback law $u^{*}$ , the greater the residual error in the convergence of $\theta_{c}$ . In particular, the bound (20) describes a semi-global practical input-to-state stability property that, to the best knowledge of the authors, is novel in the context of CL-based RL. In the next section we will show that the residual error $\gamma_{2}(|\tilde{u}|)$ can be removed by incorporating an additional actor NN in the system.

Remark 2

In contrast to standard data-driven gradient-descent dynamics for the estimation of the optimal value function $V^{*}$ , which can achieve exponential rates of convergence proportional to $\underline{\lambda}$ (cf. [18, 13]), under the assumptions of Theorem 1 the critic update dynamics (17) can achieve exponential convergence with rates proportional to $\sqrt{\underline{\lambda}}$ . As shown in [9], momentum-based dynamics of this form can achieve these rates using the restarting parameter

T=T^{*}:=e\sqrt{\frac{1}{2k_{c}\rho_{d}\underline{\lambda}}+T_{0}^{2}}.

(21)

This property is particularly useful in settings where the level of richness of the data-set is limited, i.e., when $\underline{\lambda}\ll 1$ , which is common in practical applications.

Theorem 1 guarantees exponential convergence to a neighborhood of the optimal parameters $\left\{\theta_{c}^{*}\right\}$ that define the optimal value function $V^{*}$ . Consequently, by continuity, and on compact sets, $\hat{V}$ would converge to an $\epsilon$ -approximation of $V^{*}$ , which can be leveraged by the control law (6) to stabilize system (3). However, as noted in [22], implementing only critic structures for the control of nonlinear dynamical systems of the form (3) can lead to poor closed-loop transient performance. To tackle this issue, we consider an auxiliary dynamical system, called the actor, which will serve as an estimator of the optimal controller that acts on the plant.

Refer to caption — Figure 1: Proposed Hybrid Momentum Based Dynamics for the training of the Critic subsystem

5 Actor Dynamics

Using the optimal value parametrization described in Section 4 the optimal control law can written as:

u^{*}(x)=-\frac{1}{2}\Pi_{u}^{-1}g(x)^{\top}\left[\diffp{\phi_{c}(x)}{x}^{\top}\theta_{c}^{*}+\nabla\epsilon_{c}(x)\right],\quad\forall x\in K.

(22)

Therefore, using $\diffp{\phi_{c}(x)}{x}$ and $g(x)$ we can implement an actor neural-network given by:

\hat{u}(x)=\omega(x)^{\top}\theta_{u},

(23)

where $\omega:\mathbb{R}^{n}\to\mathbb{R}^{l_{c}\times m}$ is defined as:

\omega(x)\coloneqq-\frac{1}{2}\diffp{\phi_{c}(x)}{x}g(x)\Pi_{u}^{-1}.

(24)

To guarantee convergence of $\hat{u}$ to $u^{*}$ , we design update dynamics for $\theta_{u}\in\mathbb{R}^{l_{c}}$ based on the minimization of the error:

$\displaystyle\varepsilon(x,\theta_{c},\theta_{u})$	$\displaystyle\coloneqq\frac{1}{2}\Bigg{[}\alpha_{1}\frac{\varepsilon_{a}(x,\theta_{c},\theta_{u})^{\top}\varepsilon_{a}(x,\theta_{c},\theta_{u})}{1+\text{Tr}\left(\omega(x)^{\top}\omega(x)\right)}+\alpha_{2}\varepsilon_{b}(\theta_{c},\theta_{u})^{\top}\varepsilon_{b}(\theta_{c},\theta_{u})\Bigg{]},$
$\displaystyle\varepsilon_{a}(x,\theta_{c},\theta_{u})$	$\displaystyle\coloneqq\hat{u}(x)-\omega(x)^{\top}\theta_{c}=\omega(x)^{\top}\left(\theta_{u}-\theta_{c}\right),$
$\displaystyle\varepsilon_{b}(\theta_{c},\theta_{u})$	$\displaystyle\coloneqq\theta_{u}-\theta_{c},{}$	(25)

which satisfies:

\displaystyle\nabla_{\theta_{u}}\varepsilon(x,\theta_{c},\theta_{u})

\displaystyle=\Omega(x)(\theta_{u}-\theta_{c}),

where

\displaystyle\Omega(x)

\displaystyle\coloneqq\alpha_{1}\frac{\omega(x)\omega(x)^{\top}}{1+\text{Tr}\left(\omega(x)^{\top}\omega(x)\right)}+\alpha_{2}I\in\mathbb{R}^{l_{c}\times l_{c}}\quad\forall x\in\mathbb{R}^{n}.{}

(26)

Based on these definitions, we consider the following gradient-descent dynamics for the actor neural-network:

\displaystyle\dot{\theta}_{u}=F_{u}(\theta_{u},x,\theta_{c})\coloneqq-k_{u}\nabla_{\theta_{u}}\varepsilon(x,\theta_{c},\theta_{u}),{}

(27)

where $k_{u}\in\mathbb{R}_{>0}$ is a tunable gain. A scheme representing these update dynamics is shown in Figure 2.

6 Momentum-Based Actor-Critic Feedback System

Consider the closed-loop resulting from the interconnection between the plant (3), the critic update dynamics (17), the actor update dynamics (27) and the feedback law in (23) shown in Figure 3(a), and given by:


$\displaystyle\dot{x}$	$\displaystyle=f(x)+g(x)\hat{u}(x),\qquad x^{+}=x,$	(28a)
$\displaystyle\dot{y}$	$\displaystyle=F_{c}(y,x,\hat{u}(x)),\qquad\quad\leavevmode\nobreak\ y^{+}=G_{c}(y),$	(28b)
$\displaystyle\dot{\theta}_{u}$	$\displaystyle=F_{u}(\theta_{u},x,\theta_{c}),\qquad\qquad\theta_{u}^{+}=\theta_{u},$	(28c)

and with flow set and jump set given by $C=\mathbb{R}^{n}\times C_{c}\times\mathbb{R}^{l_{c}}$ and $D=\mathbb{R}^{n}\times D_{c}\times\mathbb{R}^{l_{c}}$ respectively, where $C_{c}$ and $D_{c}$ are as defined in (17). Let $z\coloneqq(x,y,\theta_{u})$ be the overall state of the closed-loop system, and define:

\displaystyle\mathcal{A}

\displaystyle\coloneqq\left\{0\right\}\times\mathcal{A}_{c}\times\left\{\theta_{c}^{*}\right\}.

The following is the main result of this paper.

Theorem 2

Given the vector of basis functions $\phi_{c}:\mathbb{R}^{n}\to\mathbb{R}^{l_{c}}$ parametrizing the critic NN and a compact set $K_{z}\coloneqq K\times K_{y}\times K_{\theta}\subset\mathbb{R}^{n}\times\mathbb{R}^{2l_{c}+1}\times\mathbb{R}^{l_{c}}$ , where $K$ is given as in (8), suppose that Assumption 1-3 are satisfied. Then, there exists $\beta\in\mathcal{KL}$ , $\gamma\in\mathcal{K}$ and tunable parameters $(\rho_{i},\rho_{d},k_{c},k_{u},\alpha_{1},\alpha_{2})$ , such that for every solution $z=(x,y,\theta_{u})$ to the closed-loop system (28), with initial condition $z(0,0)=(x(0,0),y(0,0),\theta_{u}(0,0))\in K_{z}$ , there exists $\tilde{T}>0$ such that for all $(t,j)\in\text{dom}(z)$ :

\displaystyle\left\lvert z(t,j)\right\rvert_{\mathcal{A}}\leq\beta(\left\lvert z(0,0)\right\rvert_{\mathcal{A}},t+j)+\gamma(\left\lvert\left(\overline{\epsilon_{\text{HJB}}},\overline{d\epsilon_{c}}\right)\right\rvert)+\nu,

for all $0\leq t+j\leq\tilde{T}$ , and

\displaystyle\left\lvert z(t,j)\right\rvert_{\mathcal{A}}\leq\gamma(\left\lvert\left(\overline{\epsilon_{\text{HJB}}},\overline{d\epsilon_{c}}\right)\right\rvert)+\nu,\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall\leavevmode\nobreak\ \tilde{T}\leq t+j,

for some $\nu>0$ constant. $\square$

Theorem 2 establishes asymptotic convergence to a neighborhood of the compact set $\mathcal{A}$ as $\left(\overline{\epsilon_{\text{HJB}}},\overline{d\epsilon_{c}}\right)\to 0$ from any compact set $K_{z}$ modulo some error $\nu$ , under a suitable choice of tunable parameters. To the best knowledge of the authors this is the first result providing stability certificates for continuous-time actor-critic reinforcement learning using recorded data and accelerated value-function estimation dynamics with momentum. In addition, since the resulting closed-loop system in (28) is given by a well-posed hybrid system, the stability results are robust with respect to arbitrarily small additive disturbances on the states and dynamics [12, Ch. 7].

7 Numerical Example

In this section, we present a numerical experiment that illustrates our theoretical results. In particular, we study the following nonlinear control-affine plant:


	$\displaystyle\dot{x}=f(x)+g(x)u,$		(29a)
	$\displaystyle f(x)=\begin{bmatrix}-x_{1}+x_{2}\\ -\frac{1}{2}\Bigg{(}x_{1}-x_{2}\Big{(}1-\cos(2x_{1}+2)^{2}\Big{)}\Bigg{)}\end{bmatrix},$		(29b)
	$\displaystyle g(x)\coloneqq\begin{bmatrix}0\\ \cos(2x_{1})+2\end{bmatrix},$		(29c)

with local state and control costs given by $Q(x)=x^{\top}x$ and $R(u)=u^{2}$ [18]. The optimal value function for this setting is given by $V^{*}(x)=\frac{1}{2}x_{1}^{2}+x_{2}^{2}$ with optimal control law given by $u^{*}(x)=-(\cos(2x_{1})+2)x_{2}$ . Using this information, we choose $\phi_{c}(x)=(x_{1}^{2},x_{1}x_{2},x_{2}^{2})$ , and we implement the prescribed hybrid momentum-based dynamics in (17) for the update of the critic neural network, and the update dynamics for the actor described in (27). We obtain the results shown in Figure 3(b) with $x(0,0)=(-10,10)$ , $\theta_{c}(0,0)=(1,1,1)$ and $\theta_{u}\in[0,1]^{3}$ . We compare the results with the case in which the critic neural-network is updated with the gradient-descent dynamics of [17], and where the sufficiently rich data is a set of $16$ data points obtained by sampling the dynamics (29) in a grid around the origin of size $4\times 4$ . In our simulations we use $T_{0}=0.1,T=5.5$ for the momentum-based dynamics in (17). These particular values are obtained by using the level of richness $\underline{\lambda}$ of the data-set, and the inequalities in (18) in order to ensure compliance with Assumption 3. For both reinforcement learning dynamics we use $k_{c}=1,k_{u}=1,\rho_{d}=1$ and $\rho_{i}=1$ . As shown in the figure both update dynamics are able to converge to $\left\{\theta_{c}^{*}\right\}$ , with $\theta_{c}^{*}=(1/2,0,1)$ describing the optimal value function $V^{*}$ . However, the hybrid-based dynamics are able to significantly improve the transient performance of the learning mechanism.³³3The code used to implement this simulation can be found in the following repository: https://github.com/deot95/Accelerated-Continuous-Time-Approximate-Dynamic-Programming-through-Data-Assisted-Hybrid-Control

8 Conclusions

In this paper, we introduced the first stability guarantees for deterministic continuous-time actor-critic reinforcement learning with accelerated training of neural network structures. To do so, we studied a novel hybrid momentum-based estimation dynamical system for the critic NN, which estimates, in real time, the optimal value function. Our stability analysis leveraged the existence of rich recorded data taken from a finite number of samples along optimal trajectories and inputs of the system. We showed that this finite sequence of samples can be used to train the controller to achieve online optimal performance with fast transient performance. Closed-loop stability was established using tools from hybrid dynamical systems theory. Potential extensions include the study of similar accelerated training dynamics for the actor subsystem, as well as considering reinforcement learning problems in hybrid plants.

References

[1] J. Ibarz, J. Tan, C. Finn, M. Kalakrishnan, P. Pastor, and S. Levine, “How to train your robot with deep reinforcement learning: lessons we have learned,” The International Journal of Robotics Research, vol. 40, no. 4-5, pp. 698–721, 2021.
[2] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. Pérez, “Deep reinforcement learning for autonomous driving: A survey,” IEEE Transactions on Intelligent Transportation Systems, 2021.
[3] J. Martinez-Piazuelo, D. E. Ochoa, N. Quijano, and L. F. Giraldo, “A multi-critic reinforcement learning method: An application to multi-tank water systems,” IEEE Access, vol. 8, pp. 173227–173238, 2020.
[4] K. G. Vamvoudakis, Y. Wan, F. L. Lewis, and D. Cansever, “Handbook of reinforcement learning and control,” 2021.
[5] R. Kamalapurkar, P. Walters, J. Rosenfeld, and W. E. Dixon, Reinforcement learning for optimal feedback control: A Lyapunov-based approach. Springer, 2018.
[6] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
[7] W. Su, S. Boyd, and E. Candes, “A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights,” J. of Machine Learning Research, vol. 17, no. 153, pp. 1–43, 2016.
[8] A. Wibisono, A. C. Wilson, and M. I. Jordan, “A variational perspective on accelerated methods in optimization,” Proceedings of the National Academy of Sciences, vol. 113, no. 47, pp. E7351–E7358, 2016.
[9] J. I. Poveda and N. Li, “Robust hybrid zero-order optimization algorithms with acceleration via averaging in continuous time,” Automatica, vol. 123, 2021.
[10] J. I. Poveda and A. R. Teel, “The heavy-ball ode with time-varying damping: Persistence of excitation and uniform asymptotic stability,” in 2020 American Control Conference (ACC), pp. 773–778, IEEE, 2020.
[11] O’Donoghue and E. J. Candès, “Adaptive restart for accelerated gradient schemes,” Foundations of Computational Mathematics, vol. 15, no. 3, pp. 715–732, 2013.
[12] R. Goebel, R. G. Sanfelice, and A. R. Teel, “Hybrid dynamical systems: modeling stability, and robustness,” 2012.
[13] G. Chowdhary and E. Johnson, “Concurrent learning for convergence in adaptive control without persistency of excitation,” in 49th IEEE Conference on Decision and Control (CDC), pp. 3674–3679, IEEE, 2010.
[14] R. W. Beard, G. N. Saridis, and J. T. Wen, “Galerkin approximations of the generalized hamilton-jacobi-bellman equation,” Automatica, vol. 33, no. 12, pp. 2159–2177, 1997.
[15] D. Liberzon, Calculus of variations and optimal control theory. Princeton university press, 2011.
[16] K. Hornik, M. Stinchcombe, and H. White, “Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks,” Neural networks, vol. 3, no. 5, pp. 551–560, 1990.
[17] K. G. Vamvoudakis, M. F. Miranda, and J. P. Hespanha, “Asymptotically stable adaptive–optimal control algorithm with saturating actuators and relaxed persistence of excitation,” IEEE transactions on neural networks and learning systems, vol. 27, no. 11, pp. 2386–2398, 2015.
[18] R. Kamalapurkar, P. Walters, and W. E. Dixon, “Model-based reinforcement learning for approximate optimal regulation,” Automatica, vol. 64, pp. 94–104, 2016.
[19] K. J. Astrom and B. Wittenmark, Adaptive Control. Addison-Wesley Publishing Company, 1989.
[20] K. Ciosek, “Imitation learning by reinforcement learning,” in International Conference on Learning Representations, 2022.
[21] P. Rashidinejad, B. Zhu, C. Ma, J. Jiao, and S. Russell, “Bridging offline reinforcement learning and imitation learning: A tale of pessimism,” Advances in Neural Information Processing Systems, vol. 34, 2021.
[22] K. Doya, “Reinforcement learning in continuous time and space,” Neural computation, vol. 12, no. 1, pp. 219–245, 2000.
[23] C. Cai and A. R. Teel, “Characterizations of input-to-state stability for hybrid systems,” Systems & Control Letters, vol. 58, no. 1, pp. 47–53, 2009.
[24] H. K. Khalil, Nonlinear Systems. Upper Saddle River, NJ: Prentice Hall, 2002.
[25] D. E. Ochoa, J. I. Poveda, A. Subbaraman, G. S. Schmidt, and F. R. Pour-Safaei, “Accelerated concurrent learning algorithms via data-driven hybrid dynamics and nonsmooth odes,” in Learning for Dynamics and Control, pp. 866–878, PMLR, 2021.

Appendix A Proof Theorem 1

A.1 Gradient of Critic Error-Function in Deviation Variables

First, using (16) together with $H(x,u^{*}(x),\nabla V^{*})=0$ for all $x$ , we obtain:

\psi(x,u^{*}(x))^{\top}\theta_{c}^{*}+Q(x)+R\left(u^{*}(x)\right)=\epsilon_{\text{HJB}}(x).

(30)

Thus, using (15) and (30), we can rewrite the gradient of $e(\theta_{c},x,u)$ as follows:

\displaystyle\nabla_{\theta_{c}}e(\theta_{c},x,u)

\displaystyle=\Theta(x,u)\left(\theta_{c}-\theta_{c}^{*}\right)+\upsilon_{\epsilon}(x,u)+\chi(x,u),{}

(31)

where

\displaystyle\Theta(x,u)

\displaystyle\coloneqq\rho_{i}\Psi(x,u)\Psi(x,u)^{\top}+\rho_{d}\Lambda,

(32)

and

	$\displaystyle\upsilon_{\epsilon}(x,u)$	$\displaystyle\coloneqq\rho_{i}\frac{\psi(x,u)\epsilon_{\text{HJB}}(x)}{\left(1+\left\lvert\psi(x,u)\right\rvert^{2}\right)^{2}}+\rho_{d}\sum_{k=1}^{N}\frac{\psi(x_{k},u^{}(x_{k}))\epsilon_{\text{HJB}}(x_{k})}{\left(1+\left\lvert\psi(x_{k},u^{}(x_{k}))\right\rvert^{2}\right)^{2}}\leavevmode\nobreak\ \in\mathbb{R}^{l_{c}},$		(33)
	$\displaystyle\chi(x,u)$	$\displaystyle\coloneqq\frac{\rho_{i}\psi(x,u)\left[\diffp{\phi_{c}(x)}{x}g(x)\left(u-u^{}(x)\right)\right]^{\top}\theta^{}_{c}}{\left(1+\left\lvert\psi(x,u)\right\rvert^{2}\right)^{2}}+\frac{\rho_{i}\psi(x,u)\left[R(u)-R(u^{*}(x))\right]}{\left(1+\left\lvert\psi(x,u)\right\rvert^{2}\right)^{2}}\leavevmode\nobreak\ \in\mathbb{R}^{l_{c}},$		(34)

which, by using the fact that $\frac{r}{\left(1+r^{2}\right)^{2}}\leq\frac{3\sqrt{3}}{16},\forall r\in\mathbb{R}_{\geq 0}$ , satisfy:


$\displaystyle\left\lvert\upsilon_{\epsilon}(x,u)\right\rvert$	$\displaystyle\leq\frac{3\sqrt{3}}{16}\overline{\epsilon_{\text{HJB}}}\left(\rho_{i}+N\rho_{d}\right),$	(35a)
$\displaystyle\left\lvert\chi(x,u)\right\rvert$	$\displaystyle\leq\rho_{i}\frac{3\sqrt{3}}{16}\Bigg{(}\overline{g}\left(\overline{d\phi_{c}}\left[1+\left\lvert\theta_{c}^{}\right\rvert\right]+\overline{d\epsilon_{c}}\right)\left\lvert u-u^{}(x)\right\rvert+\lambda_{\max}\left(\Pi_{u}\right)\left\lvert u-u^{*}(x)\right\rvert^{2}\Bigg{)}.$	(35b)

The following Lemma will be instrumental for our results.

Lemma 1

If the data is $\underline{\lambda}$ -sufficiently-rich, then there exist $\overline{\Theta},\underline{\Theta}\in\mathbb{R}_{>0}$ such that

\displaystyle\underline{\Theta}I_{n}\preceq\Theta(x,u)\preceq\overline{\Theta}I_{n}\qquad\forall x\in\mathbb{R}^{n},\leavevmode\nobreak\ \forall u\in\mathbb{R}^{m}.

Proof.

Let $\theta\in\mathbb{R}^{l_{c}}$ be arbitrary. Since, by assumption, the data is $\underline{\lambda}$ -SR it follows that:

$\displaystyle\theta^{\top}\Theta(x,u)\theta$	$\displaystyle=\theta^{\top}\rho_{i}\Psi(x,u)\Psi(x,u)^{\top}\theta+\theta^{\top}\rho_{d}\Lambda\theta$
	$\displaystyle\geq\rho_{d}\underline{\lambda}\left\lvert\theta\right\rvert^{2}$
	$\displaystyle\implies\Theta(x,u)\succeq\underline{\Theta}I_{l_{c}},\leavevmode\nobreak\ \forall(x,u)\in\mathbb{R}^{n}\times\mathbb{R}^{m},{}$	(36)

where $\underline{\Theta}\coloneqq\rho_{d}\underline{\lambda}$ . On the other hand, using the fact that $\left\lvert aa^{\top}\right\rvert=\left\lvert a\right\rvert^{2},\leavevmode\nobreak\ \forall a\in\mathbb{R}^{n}$ , we obtain that:

\left\lvert\Psi(x,u)\Psi(x,u)^{\top}\right\rvert=\left\lvert\Psi(x,u)\right\rvert^{2}\leq 1,\leavevmode\nobreak\ \forall(x,u)\in\mathbb{R}^{n}\times\mathbb{R}^{m},

we obtain:

	$\displaystyle\theta^{\top}\Theta(x,u)\theta$	$\displaystyle=\theta^{\top}\rho_{i}\psi(x,u)\psi(x,u)^{\top}\theta+\theta^{\top}\rho_{d}\Lambda\theta$
		$\displaystyle\leq\left(\rho_{i}+\rho_{d}\lambda_{\max}\left(\Lambda\right)\right)\left\lvert\theta\right\rvert^{2}$
		$\displaystyle\implies\Theta(x,u)\preceq\overline{\Theta}I_{l_{c}},\quad\forall(x,u)\in\mathbb{R}^{n}\times\mathbb{R}^{m},$

where $\overline{\Theta}\coloneqq\rho_{i}+\rho_{d}\lambda_{\max}\left(\Lambda\right)$ . $\blacksquare$

A.2 Lyapunov-Based Analysis

Recall from Section 4 that $y=(\theta_{c},p,\tau)$ , suppose that the assumptions of Theorem 1 hold and consider the Lyapunov candidate function $V_{c}:\mathbb{R}^{l_{c}}\times\mathbb{R}^{l_{c}}\times\mathbb{R}_{>0}\to\mathbb{R}_{\geq 0}$ given by:

\displaystyle V_{c}(y)

\displaystyle\coloneqq\frac{\left\lvert p-\theta_{c}\right\rvert^{2}}{4}+\frac{\left\lvert p-\theta_{c}^{*}\right\rvert^{2}}{4}+k_{c}\rho_{d}\tau^{2}\frac{\left(\theta_{c}-\theta_{c}^{*}\right)\top\Lambda\left(\theta_{c}-\theta_{c}^{*}\right)}{2},{}

(37)

where $\Lambda$ was defined in Assumption 1 and which satisfies:

	$\displaystyle\underline{c}\left\lvert y\right\rvert_{\mathcal{A}_{c}}^{2}\leq V_{c}$	$\displaystyle(y)\leq\overline{c}\left\lvert y\right\rvert_{\mathcal{A}_{c}}^{2},{}$		(38)
	$\displaystyle\underline{c}\coloneqq\min\left\{\frac{1}{4},\frac{k_{c}\rho_{d}T_{0}^{2}\underline{\lambda}}{2}\right\},$	$\displaystyle\quad\overline{c}\coloneqq\left\{\frac{3}{4},\leavevmode\nobreak\ \frac{1}{2}\left(1{+}k_{c}\rho_{d}T^{2}\overline{\lambda}\right)\right\},$

where $\overline{\lambda}\coloneqq\lambda_{\max}\left(\Lambda\right)$ . Now, let $u\in\mathcal{U}_{V}$ , and consider the time derivative of $V_{c}$ along the continuous-time evolution of the critic subsystem, i.e., $\dot{V}_{c}=\nabla_{y}V_{c}(y)^{\top}\dot{y}$ . Then, by using (31) and Lemma 1, and some algebraic manipulation, $\dot{V}_{c}$ can be shown to satisfy

\displaystyle\dot{V}_{c}

\displaystyle\leq-\begin{pmatrix}\left\lvert p-\theta_{c}\right\rvert&\left\lvert\theta_{c}-\theta_{c}^{*}\right\rvert\end{pmatrix}M(\tau)\begin{pmatrix}\left\lvert p-\theta_{c}\right\rvert\\ \left\lvert\theta_{c}-\theta_{c}^{*}\right\rvert\end{pmatrix}+2\sqrt{2}k_{c}y_{\mathcal{A}_{c}}\big{(}\left\lvert\upsilon_{\epsilon}(x)\right\rvert+\left\lvert\chi(x,u(x))\right\rvert\big{)},{}

(39)

where

M(\tau)\coloneqq\begin{pmatrix}\frac{2}{k_{c}\tau^{2}}&-\frac{\rho_{i}}{2}\\ -\frac{\rho_{i}}{2}&\underline{\Theta}\end{pmatrix},

(40)

and $\mathcal{A}_{c}$ was defined in Section 4. Since $2\rho_{d}\underline{\lambda}>\rho_{i}$ and $T^{2}<\frac{8\rho_{d}\underline{\lambda}}{k_{c}\rho_{i}^{2}}$ by means of Asssumption 2, and $\tau(t,j)\in[T_{0},T],\leavevmode\nobreak\ \forall(t,j)\in\text{dom}\left(y\right)$ by construction of the critic update dynamics (17), it follows that $M(\tau)\succeq\underline{r}$ with $\underline{r}\coloneqq\underline{\Theta}-\frac{\rho_{i}}{2}$ . Hence, from (39) and using (35), we obtain that:

\displaystyle\dot{V}_{c}

\displaystyle\leq-\underline{r}\left\lvert y\right\rvert_{\mathcal{A}_{c}}^{2}+\left\lvert y\right\rvert_{\mathcal{A}_{c}}\Big{(}\gamma_{\nu}\left(\overline{\epsilon_{\text{HJB}}}\right)+\gamma_{\chi}\left(\left\lvert u(x)-u^{*}(x)\right\rvert\right)\Big{)},{}

(41)

where $\gamma_{\nu},\gamma_{\chi}\in\mathcal{K}_{\infty}$ are given by:

	$\displaystyle\gamma_{\nu}(r)$	$\displaystyle\coloneqq\frac{3\sqrt{6}}{8}\left(\rho_{i}+N\rho_{d}\right)r,\quad\gamma_{\chi}(r)\coloneqq c_{\chi}(r+r^{2}),$
	$\displaystyle c_{\chi}$	$\displaystyle\coloneqq\frac{3\sqrt{6}}{8}\rho_{i}\max\left\{\overline{g}\left(\overline{d\phi_{c}}\left[1+\left\lvert\theta_{c}^{*}\right\rvert\right]+\overline{d\epsilon_{c}}\right),\leavevmode\nobreak\ \lambda_{\max}\left(\Pi_{u}\right)\right\}.$

Thus, letting $d_{c}\in(0,1)$ , and using (38), (41):


	$\displaystyle\dot{V}_{c}$	$\displaystyle\leq-\frac{\underline{r}(1-d_{c})}{\overline{c}}V_{c}(y),\quad\forall\left\lvert y\right\rvert_{\mathcal{A}_{c}}\geq\frac{1}{d_{c}}\Big{(}\gamma_{\nu}\left(\overline{\epsilon_{\text{HJB}}}\right)+\gamma_{\chi}\left(\left\lvert u(x)-u^{*}(x)\right\rvert\right)\Big{)}.$		(42a)

On the other hand, the change of $V_{c}$ during the jumps in the update dynamics for the critic (17), satisfies:

V_{c}\left(y^{+}\right)-V_{c}(y)\leq-\eta V_{c}(y),

(43)

with $\eta\coloneqq 1-\frac{T_{0}^{2}}{T^{2}}-\frac{1}{2k_{c}\rho_{d}\underline{\lambda}T^{2}}$ which satisfies $\eta\in(0,1)$ by means of Assumption 2. Together, (42) and (43), in conjuction with the quadratic bounds of (38), imply the results of Theorem 1 via [23, Prop 2.7] and the fact that $\left\lvert\theta_{c}(t,j)-\theta^{*}_{c}\right\rvert\leq\left\lvert y(t,j)\right\rvert_{\mathcal{A}_{c}}\leq\left\lvert(\theta_{c}(t,j),p(t,j))\right\rvert_{\mathcal{A}_{\theta_{c},p}}$ for all $(t,j)\in\text{dom}\left(y\right)$ . $\blacksquare$

Appendix B Proof of Theorem 2

B.1 Gradient of Actor Error-Function in Deviation Variables

First, note that we we can write (25) as:

\displaystyle\nabla_{\theta_{u}}\varepsilon_{a}(x,\theta_{c},\theta_{u})

\displaystyle=\Omega(x)\left(\theta_{u}-\theta_{c}^{*}-\left(\theta_{c}-\theta_{c}^{*}\right)\right),

and consider the following Lemma, instrumental for our results.

Lemma 2

There exists $\overline{\Omega},\underline{\Omega}\in\mathbb{R}_{>0}$ such that

\underline{\Omega}I_{l_{c}}\preceq\Omega(x)\preceq\overline{\Omega}I_{l_{c}}.

Proof.

Let $\theta\in\mathbb{R}^{l_{c}}$ be arbitrary. Then, by the definition of $\Omega:\mathbb{R}^{n}\to\mathbb{R}^{l_{c}\times l_{c}}$ in (26), it follows that:

\displaystyle\theta^{\top}\Omega(x)\theta=\alpha_{1}\frac{\left\lvert\omega(x)^{\top}\theta\right\rvert^{2}}{1+\text{Tr}\left(\omega(x)^{\top}\omega(x)\right)}+\alpha_{2}\left\lvert\theta\right\rvert^{2}\geq\alpha_{2}\left\lvert\theta\right\rvert^{2}

\displaystyle\implies\Omega(x)\succeq\underline{\Omega}I_{l_{c}},\quad\forall x\in\mathbb{R}^{n},

where $\underline{\Omega}\coloneqq\alpha_{2}$ . On the other hand, we obtain:

\displaystyle\theta^{\top}\Omega(x)\theta

\displaystyle=\left(\alpha_{1}\frac{\left\lvert\omega(x)\right\rvert^{2}}{1+\left\lvert\omega(x)\right\rvert_{F}^{2}}+\alpha_{2}\right)\left\lvert\theta\right\rvert^{2}\leq\overline{\Omega}\left\lvert\theta\right\rvert^{2}\implies\Omega(x)\preceq\overline{\Omega}I_{l_{c}},\quad\forall x\in\mathbb{R}^{n},

where $\overline{\Omega}\coloneqq\alpha_{1}+\alpha_{2}$ , $\left\lvert A\right\rvert_{F}$ represents the Frobenius norm and where we used $\left\lvert A\right\rvert\leq\left\lvert A\right\rvert_{F},\leavevmode\nobreak\ \forall A\in\mathbb{R}^{l_{c}\times l_{c}}$ and $\frac{r^{2}}{1+r^{2}}\leq 1\leavevmode\nobreak\ \forall r\in\mathbb{R}$ . $\blacksquare$

Now, consider the Lyapunov function


	$\displaystyle\mathcal{V}(z)\coloneqq V_{o}(x)+V_{c}(y)+V_{a}(\theta_{u}),$		(44a)
	$\displaystyle V_{o}(x)\coloneqq V^{}(x),\quad V_{a}(\theta_{u})\coloneqq\frac{1}{2}\left\lvert\theta_{u}-\theta_{u}^{}\right\rvert^{2},$		(44b)

where $V_{c}$ was defined in (37) and where we recall that $z=(x,y,\theta_{u})$ . By [24, Lemma 4.3], and since $V_{o}=V^{*}$ is a continuous and positive definite function in $\mathbb{R}^{n}$ , there exist $\underline{\gamma}_{o},\overline{\gamma}_{o}\in\mathcal{K}$ such that $\underline{\gamma}_{o}\left(\left\lvert x\right\rvert\right)\leq V_{o}(x)\leq\overline{\gamma}_{o}(\left\lvert x\right\rvert)$ . Hence, using (38), and the fact that sum of class $\mathcal{K}$ is in turn of class $\mathcal{K}$ , there exist $\underline{\gamma}_{\mathcal{V}},\overline{\gamma}_{\mathcal{V}}\in\mathcal{K}$ such that:

\displaystyle\underline{\gamma}_{\mathcal{V}}\left(\left\lvert z\right\rvert_{\mathcal{A}}\right)\leq\mathcal{V}(z)\leq\overline{\gamma}_{\mathcal{V}}\left(\left\lvert z\right\rvert_{\mathcal{A}}\right)

(45)

Now, the time derivative of $\dot{V}_{o}=\nabla V_{o}(x)^{\top}\dot{x}$ along the trajectories of (28) satisfies:

\displaystyle\dot{V}_{o}

\displaystyle\leq-Q(x)+\frac{\overline{g}^{2}\lambda_{\max}\left(\Pi_{u}^{-1}\right)}{2}\left(\overline{d\phi}_{c}\left\lvert\theta^{*}_{c}\right\rvert+\overline{d\epsilon_{c}}\right)\left(\overline{d\phi_{c}}\left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert+\overline{d\epsilon_{c}}\right){}.

(46)

On the other hand, making use of Lemma 2, for the time derivative of $\dot{V}_{a}=\nabla_{\theta_{u}}V_{a}(\theta_{u})^{\top}\theta_{u}$ we obtain:

\displaystyle\dot{V}_{a}

\displaystyle\leq-k_{u}\alpha_{2}\left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert^{2}+k_{u}\overline{\Omega}\left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert\left\lvert\theta_{c}-\theta_{c}^{*}\right\rvert.{}

(47)

Hence, using (39), (46), and (47), together with the upper bounds in (35), we obtain that the time derivative of $\mathcal{V}$ along the trajectories of the closed-loop system satisfies:

$\displaystyle\dot{\mathcal{V}}$	$\displaystyle\leq-Q(x)-\underline{r}\left\lvert y\right\rvert_{\mathcal{A}_{c}}^{2}-k_{u}\alpha_{2}\left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert^{2}$
	$\displaystyle\quad+c_{y}\left\lvert y\right\rvert_{\mathcal{A}_{c}}+c_{u}\left\lvert\theta_{u}-\theta_{c}^{}\right\rvert+c_{yu}\left\lvert\theta_{u}-\theta_{c}^{}\right\rvert\left\lvert y\right\rvert_{\mathcal{A}_{c}}$
	$\displaystyle\qquad+c_{yu^{2}}\left\lvert y\right\rvert_{\mathcal{A}_{c}}\left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert^{2}+c_{0},{}$	(48)

where

	$\displaystyle c_{y}$	$\displaystyle\coloneqq\frac{3\sqrt{6}}{8}k_{c}\Bigg{(}\overline{\epsilon_{\text{HJB}}}\left(\rho_{i}+N\rho_{d}\right)+\frac{1}{2}\overline{g}^{2}\rho_{i}\Bigg{[}\lambda_{\max}\left(\Pi_{u}^{-1}\right)\left(\overline{d\phi_{c}}\left[1+\left\lvert\theta_{c}^{*}\right\rvert\right]+\overline{d\epsilon_{c}}\right)\overline{d\epsilon_{c}}+\lambda_{\max}\left(\Pi_{u}\right)\lambda_{\max}\left(\Pi_{u}^{-1}\right)^{2}\overline{d\epsilon_{c}}^{2}\Bigg{]}\Bigg{)},$
	$\displaystyle c_{u}$	$\displaystyle\coloneqq\frac{1}{2}\left(\overline{d\phi}_{c}\left\lvert\theta^{*}_{c}\right\rvert+\overline{d\epsilon_{c}}\right)\overline{g}^{2}\lambda_{\max}\left(\Pi_{u}^{-1}\right)\overline{d\phi_{c}},$
	$\displaystyle c_{yu}$	$\displaystyle\coloneqq\frac{3\sqrt{6}k_{c}}{16}\Bigg{(}2k_{u}\overline{\Omega}+\overline{g}^{2}\rho_{i}\lambda_{\max}\left(\Pi_{u}^{-1}\right)\left(\overline{d\phi_{c}}\left[1+\left\lvert\theta_{c}^{*}\right\rvert\right]+\overline{d\epsilon_{c}}\right)\overline{d\phi_{c}}\Bigg{)},$
	$\displaystyle c_{yu^{2}}$	$\displaystyle\coloneqq\frac{3\sqrt{6}}{16}k_{c}\overline{g}^{2}\rho_{i}\lambda_{\max}\left(\Pi_{u}\right)\lambda_{\max}\left(\Pi_{u}^{-1}\right)^{2}\overline{d\phi_{c}}^{2},$
	$\displaystyle c_{0}$	$\displaystyle\coloneqq\frac{1}{2}\left(\overline{d\phi}_{c}\left\lvert\theta^{*}_{c}\right\rvert+\overline{d\epsilon_{c}}\right)\overline{g}^{2}\lambda_{\max}\left(\Pi_{u}^{-1}\right)\overline{d\epsilon_{c}}.$

Then, for all $\left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert\leq\frac{c_{yu}}{c_{yu^{2}}}$ , by using $Q(x)=x^{\top}\Pi_{x}x$ and letting $d_{1}\in(0,1)$ , from (48), $\dot{\mathcal{V}}$ can be further upper bounded as:

$\displaystyle\dot{\mathcal{V}}$	$\displaystyle\leq-\lambda_{\min}\left(\Pi_{x}\right)\left\lvert x\right\rvert^{2}-(1-d_{1})\left(\underline{r}\left\lvert y\right\rvert_{\mathcal{A}_{c}}^{2}+k_{u}\alpha_{2}\left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert^{2}\right)$
	$\displaystyle\quad+c_{y}\left\lvert y\right\rvert_{\mathcal{A}_{c}}+c_{u}\left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert+c_{0}$
	$\displaystyle\qquad-\begin{pmatrix}\left\lvert y\right\rvert_{\mathcal{A}_{c}}&\left\lvert\theta_{u}-\theta_{c}^{}\right\rvert\end{pmatrix}\begin{pmatrix}d_{1}\underline{r}&-c_{yu}\\ -c_{yu}&d_{1}k_{u}\alpha_{2}\end{pmatrix}\begin{pmatrix}\left\lvert y\right\rvert_{\mathcal{A}_{c}}\\ \left\lvert\theta_{u}-\theta_{c}^{}\right\rvert\end{pmatrix}.{}$	(49)

Now, pick a set of tunable parameters $(\rho_{i},\rho_{d},k_{c},k_{u})$ such that $\underline{r}\geq\frac{c_{yu}^{2}}{d_{1}^{2}k_{u}\alpha_{2}}$ so that from (49), we obtain:


	$\displaystyle\qquad\quad\dot{\mathcal{V}}\leq-(1-d_{2})d_{z}\left\lvert z\right\rvert_{\mathcal{A}}^{2},\quad\forall\left\lvert z\right\rvert_{\mathcal{A}}\geq\max\left\{\frac{c_{0}}{2d_{yu}},\leavevmode\nobreak\ \frac{2d_{yu}}{d_{2}d_{z}}\right\},\leavevmode\nobreak\ \left\lvert\theta_{u}-\theta_{c}^{*}\right\rvert\leq\frac{c_{yu}}{c_{yu^{2}}},$		(50a)

with

	$\displaystyle d_{z}$	$\displaystyle\coloneqq\min\left\{\lambda_{\min}\left(\Pi_{x}\right),\leavevmode\nobreak\ (1-d_{1})\underline{r},\leavevmode\nobreak\ k_{u}\alpha_{2}\right\},$
	$\displaystyle d_{yu}$	$\displaystyle\coloneqq\max\left\{2c_{yu},\leavevmode\nobreak\ c_{0}\right\},\quad d_{2}\in(0,1).$

Notice that for every compact set $K_{\theta}$ of initial conditions for $\theta_{u}$ we can pick suitable $\rho_{i},\rho_{d},\alpha_{1},\alpha_{2},k_{c},k_{u}$ to satisfy $K_{\theta}\subset\frac{c_{yu}}{c{yu^{2}}}\mathbb{B}$ such that (50) holds for every trajectory with $\theta_{u}(0,0)\in K_{\theta}$ . Now, during jumps $x$ and $\theta_{u}$ do not change, and hence $\mathcal{V}$ satisfies:

\mathcal{V}(z^{+})-\mathcal{V}(z)=V_{c}(y^{+})-V_{c}(y)\leq-\eta V_{c}(y).

(51)

The result of the theorem follows by using the strong-decrease of $\mathcal{V}$ during flows outside a neighborhood of $\mathcal{A}$ described in (50), the non-increase of $\mathcal{V}$ during jumps given in (51), by noting that, by design, the closed-loop dynamics are a well-posed HDS which experiences periodic jumps followed by intervals of flow of length $T-T_{0}>0$ (c.f. [25]), and by following the same arguments of [12, Prop 3.27] and [23, Prop. 2.7]. $\blacksquare$

Accelerated Continuous-Time Approximate Dynamic Programming via Data-Assisted Hybrid Control111Research supported in part by NSF grant number CNS-1947613.22footnotemark: 2

Abstract

keywords:

1 Introduction

2 Preliminaries

Definition 1

3 Problem Statement

Definition 2

4 Data-Assisted Critic Dynamics

Assumption 1

Remark 1

Assumption 2

4.1 Critic Dynamics via Data-Driven Hybrid Momentum-Based Control

Assumption 3

Theorem 1

Remark 2

5 Actor Dynamics

6 Momentum-Based Actor-Critic Feedback System

Theorem 2

7 Numerical Example

8 Conclusions

References

Appendix A Proof Theorem 1

A.1 Gradient of Critic Error-Function in Deviation Variables

Lemma 1

A.2 Lyapunov-Based Analysis

Appendix B Proof of Theorem 2

B.1 Gradient of Actor Error-Function in Deviation Variables

Lemma 2

Accelerated Continuous-Time Approximate Dynamic Programming via Data-Assisted Hybrid Control¹¹1Research supported in part by NSF grant number CNS-1947613.²²footnotemark: 2