bupt]School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, P. R. China

Model-free Nearly Optimal Control of Constrained-Input Nonlinear Systems Based on Synchronous Reinforcement Learning

Han Zhao\arefbupt Lei Guo\arefbupt [ [email protected]

Abstract

In this paper a novel model-free algorithm is proposed. This algorithm can learn the nearly optimal control law of constrained-input systems from online data without requiring any a priori knowledge of system dynamics. Based on the concept of generalized policy iteration method, there are two neural networks (NNs), namely actor and critic NN to approximate the optimal value function and optimal policy. The stability of closed-loop systems and the convergence of weights are also guaranteed by Lyapunov analysis.

keywords:

Optimal Control, Synchronous Reinforcement Learning, Constrained-Input, Actor-Critic Networks

⁰⁰footnotetext: This work is supported by National Natural Science Foundation (NNSF) of China under Grant 61105103.

1 Introduction

The control method of nonlinear constraint systems has been widely researched and is a valuable practical problem. In the framework of optimal control, an optimal policy that minimize a user-defined performance index and the corresponding value function can be obtained by solving Hamilton-Jacobi-Bellman (HJB) equation, which is a partial differential equation in continuous-time nonlinear systems and the analytical solution is usually hard to find. Therefore, function approximation using neural networks (NNs) is a useful tool to solve HJB equation. These algorithms belong to the family of adaptive dynamic programming (ADP) or reinforcement learning (RL) methods.

ADP/RL methods have solved several discrete-time optimal control problems such as nonlinear zero-sum games [1] and multi-agent systems [2], [3]. A considerable number of literatures have also focused on solving the continuous-time constrained-input optimal control problem via ADP/RL. In [4], an off-line policy iteration (PI) algorithm that can iteratively solve HJB equation is proposed. The full knowledge of system dynamics and an initial stabilized controller are needed in this algorithm. In these years, many works are intended to relax the restriction of the a priori knowledge of system dynamics. The concept of integral reinforcement learning (IRL, [5]) and generalized PI [6] is brought in [7], [8] to estimate the optimal value function in constrained-input problems. The initial stabilized controller is not needed in these methods, but the input gain matrix of the system is still used in policy improvement step.

Another way to solve this problem is to use an identifier NN. Xue et al. [9] proposed an algorithm based on actor-critic-identifier structure to estimate the system dynamics and solve HJB equation online. Comparing with other model-free methods, an extra identifier might bring additional burden on real-time computing.

In terms of IRL, Vamvoudakis [10] proposed a completely model-free Q-learning algorithm to solve continuous-time optimal control problem. Q-function combines the value function and the Hamiltonian in Pontryagin’s minimum principle and provides the information of policy improvement without requiring any a priori. Similarly, the restriction of the system input dynamics can be relaxed by adding exploration signal [11]. Our previous work [12] also present a novel model-free algorithm that combines the exploration signal and synchronous reinforcement learning to solve the adaptive optimal control problem in continuous-time nonlinear systems.

In this paper we present an algorithm to solve the continuous-time constrained-input optimal control problem without knowing any information of system dynamics or an initial stabilized controller. The identifier NN in [9] is also not needed in this method. The remainder of this paper is organized as follows: Section 2 provides the mathematic formulation of the optimal control for continuous-time systems with input constraints. In section 3 we present the weights tuning law of actor-critic NNs based on exploration-HJB and synchronous reinforcement learning. In the process of learning, policy evaluation and improvement can be done simultaneously and no knowledge of system dynamics is required. The convergence of weights and stability of the closed-loop system is proved through Lyapunov analysis. Finally, we design two simulations to verify the effectiveness of our method.

2 Problem Formulation

Consider a continuous-time input-affine system

\dot{x}(t)=f(x(t))+g(x(t))u(t)\ \ \ x(0)=\xi,

(1)

where $x\in\mathbb{R}^{n}$ and $u\in\mathbb{R}^{m}$ is the measurable state and input vector respectively. $\xi$ denotes the initial state of the system. $f(x)\in\mathbb{R}^{n}$ is called as the system drift dynamics and $g(x)\in\mathbb{R}^{n\times m}$ is the system input dynamics. The system dynamics $f(x)+g(x)u$ is assumed to be Lipschitz in a compact set $\Omega$ and satisfies $f(0)=0$ .

The objective of optimal control is to design a controller that minimize a user-defined performance index. The performance index in this paper is defined as

J(x,u)=\int_{0}^{\infty}\left(Q(x(\tau))+U(u(\tau))\right){\rm d}\tau,

(2)

both $Q(x)$ and $U(u)$ are positive definite functions.

In this paper, the constrained-input control problem is considered. The input vector needs to satisfy the constraint $|u_{i}|\leq\lambda(i=1,...,m).$ To deal with this constraint, a non-quadratic integral cost function of input is usually employed [13]:

U(u)=2\int_{0}^{u}\lambda\bm{\vartheta}^{-1}(\frac{s}{\lambda})R{\rm d}\textit{s},

(3)

where $\bm{\vartheta}(s)=\left[\vartheta(s_{1}),...,\vartheta(s_{m})\right]^{\top}$ and $\bm{\vartheta}^{-1}(s)=\left[\vartheta^{-1}(s_{1}),...,\vartheta^{-1}(s_{m})\right]^{\top}$ . $\vartheta$ is a bounded one-to-one monotonic odd function. The hyperbolic tangent function $\vartheta(\cdot)=\tanh(\cdot)$ is often used. $R$ is a positive definite symmetric matrix. For the convenience of computing and analysis, $R$ is assumed to be diagonal, i.e. $R={\rm diag}(r_{1},...,r_{m}).$

The goal is to find the optimal control law $u^{*}$ that can stabilize system (1) and minimize performance index (2). The following value function is used:

V^{\mu}=J(x,u)|_{u=\mu(x)},V^{\mu}(0)=0,

(4)

where $\mu(x)$ is a feedback policy and $\mu(0)=0.$ The value function is well-defined only if the policy is admissible [4].

Definition 1.

A constrained policy is admissible in $\Omega$ , which is denoted as $\mu\in\mathcal{A}(\Omega)$ , if: 1) The policy stabilize system (1) in $\Omega$ ; 2) For any state $x\in\Omega,V^{\mu}(x)$ is finite; 3) For any state $x\in\Omega,|\mu_{i}(x)|\leq\lambda(i=1,...,m)$ . $\Omega$ is called as the admissible domain and $\mathcal{A}(\Omega)$ is the set of admissible policies.

Assume that the set of admissible policy $\mathcal{A}(\Omega)$ of system (1) is not empty and $V^{\mu}\in\mathcal{C}^{1}(\Omega)$ . It is obvious that there exists an optimal policy $\mu^{*}(x)\in\mathcal{A}(\Omega)$ so that

V^{\mu^{*}}(\xi)=\min_{\mu\in\mathcal{A}(\Omega)}\int_{0}^{\infty}\left(Q(x(\tau)+U(u(\tau)))\right){\rm d}\tau,\forall\xi\in\Omega.

The optimal value function satisfies $V^{*}(x)=V^{\mu^{*}}(x)$ . In the remainder of the paper both of these two symbols may be used to describe the same variable.

According to the definition of the value function (4), the infinitesimal version of (2) is obtained as

0=Q(x)+U(\mu(x))+(\nabla V^{\mu}(x))^{\top}(f(x)+g(x)\mu(x)),

(5)

where $\nabla V(x)=\partial V(x)/\partial x\in\mathbb{R}^{n}$ . Define the Hamiltonian as

\mathcal{H}(x,u,\nabla V^{\mu})=Q(x)+U(u)+(\nabla V^{\mu})^{\top}(f(x)+g(x)u).

(6)

For optimal value function $V^{*}$ and optimal control law $u^{*}$ , the following Hamilton-Jacobi-Bellman (HJB) equation is satisfied:

		$\displaystyle 0=Q(x)+U(u^{})+\nabla V^{\top}(f(x)+g(x)u^{*}),$		(7)
		$\displaystyle V^{*}(0)=0.$		(7)

The optimal policy can be derived by minimizing the Hamiltonian, for input-affine system (1) and the performance index (2) and (3), the optimal policy can be directly denoted as

u^{*}(x)=\lambda{\bm{\vartheta}}\left(-\frac{1}{2\lambda}R^{-1}g^{\top}(x)\nabla V^{*}(x)\right).

(8)

Let

v^{*}(x)=-\frac{1}{2}R^{-1}g^{\top}(x)\nabla V^{*}(x),

(9)

the optimal policy can be obtained as

u^{*}(x)=\lambda{\bm{\vartheta}}\left(\frac{v^{*}(x)}{\lambda}\right).

(10)

It is hard to solve the optimal value function and policy in (7) directly and the a priori knowledge of system dynamics is required. Therefore, the online data-driven algorithm is a useful tool to relax the restriction.

3 Online Synchronous Reinforcement Learning Algorithm to Solve HJB Equation

In this section a data-driven method that learning the optimal policy is introduced. The framework of integral reinforcement learning (IRL) is proposed in [5]. A completely model-free algorithm based on IRL is developed by Lee et al. [11] to solve the optimal control problem of input-affine systems. However, the initial stabilized controller is still required to start the iteration.

In our previous work [12], an algorithm called synchronous integral Q-learning is proposed, which can continuously update the weights of both two NNs to estimate the optimal value function and policy. In this paper we extend the algorithm to solve the constrained-input problem. By adding a bounded piece-wise continuous non-zero exploration signal $e$ into input, the system (1) is derived as

\dot{x}=f(x)+g(x)(u+e).

(11)

Substituting (11) into HJB equation (7). For any time $t>T$ and time interval $T>0$ , integrating (7) from $[t-T,t]$ along the trajectory of system (11), the following integral temporal difference equation is obtained as

	$\displaystyle V^{}(x(t))-V^{}(x(t-T))$	$\displaystyle-\int_{t-T}^{t}\nabla V^{*\top}(x)g(x)e(\tau){\rm d}\tau$		(12)
		$\displaystyle=-\int_{t-T}^{t}\left(Q(x)+U(u^{*})\right){\rm d}\tau.$		(12)

The following HJB equation is derived by substituting (9):

	$\displaystyle V^{}(x(t))-V^{}(x(t-T))$	$\displaystyle+2\int_{t-T}^{t}v^{*\top}Re(\tau){\rm d}\tau$		(13)
		$\displaystyle=-\int_{t-T}^{t}\left(Q(x)+U(u^{*})\right){\rm d}\tau.$		(13)

Both $V^{*}(x)$ and $v^{*}(x)$ can be solved in integral exploration-HJB equation (13) and the optimal policy is computed by (10). Note that the knowledge of system dynamics is not needed in these two equations.

Since the analytical solution of $V^{*}(x)$ and $v^{*}(x)$ cannot be determined. We can approximate them by choosing a proper structure of neural networks (NNs). Assume $V^{*}(x)$ is a smooth function. Based on the assumption of system dynamics, $v^{*}(x)$ is also smooth. Therefore, there exists two single-layer NNs, namely actor and critic NN, so that:

1) $V^{*}(x)$ and its gradient can be universal approximated:

V^{*}(x)=w_{c}^{\top}\phi_{c}(x)+\varepsilon_{c}(x),

(14)

\nabla V^{*}(x)=\nabla\phi_{c}^{\top}(x)w_{c}+\nabla\varepsilon_{c}(x),

(15)

2) $v^{*}(x)$ can be universal approximated:

	$\displaystyle v^{*}(x)$	$\displaystyle=-\frac{1}{2}R^{-1}g^{\top}(x)(\nabla\phi_{c}^{\top}(x)w_{c}+\nabla\varepsilon_{c}(x))$		(16)
		$\displaystyle=w_{a}^{\top}\phi_{a}(x)+\varepsilon_{a}(x),$		(16)

where $\phi_{c}(x):\mathbb{R}^{n}\to\mathbb{R}^{N_{c}},w_{c},\varepsilon_{c}$ is the activation function, weights and the reconstruction error of the critic NN, respectively. $N_{c}$ is the number of neurons in the hidden layer of the critic NN. Similarly, in actor NN, we denote $\phi_{a}(x):\mathbb{R}^{n}\to\mathbb{R}^{N_{a}},w_{a},\varepsilon_{a}$ is the activation function, weights and the reconstruction error, respectively. $N_{a}$ is the number of neurons in the hidden layer of the actor NN. $\varepsilon_{c}(x)$ and $\varepsilon_{a}(x)$ are bounded on a compact set $\Omega$ . According to Weierstrass approximation theorem, when $N_{c},N_{a}\to\infty$ , we have $\varepsilon_{c},\nabla\varepsilon_{c},\varepsilon_{a}\to 0.$

Define the Bellman error $\varepsilon_{B}$ based on (13) as

	$\displaystyle w_{c}^{\top}\Delta\phi_{c}$	$\displaystyle+2\int_{t-T}^{t}\phi_{a}^{\top}(x(\tau))w_{a}Re(\tau){\rm d}\tau$		(17)
		$\displaystyle+\int_{t-T}^{t}\left(Q(x(\tau))+U(u^{*}(x(\tau))\right){\rm d}\tau=\varepsilon_{B},$		(17)

where $\Delta\phi_{c}=\phi_{c}(x(t))-\phi_{c}(x(t-T))$ . The integral reinforcement is defined as

\rho(x,u)=\int_{t-T}^{t}\left(Q(x(\tau))+U(u(x(\tau))\right){\rm d}\tau.

Using the characteristic of Kronecker product, the following equation can be derived from (17):

\varepsilon_{B}-\rho(x,u^{*})=W^{\top}\delta,

(18)

where $W=[w_{c}^{\top},{\rm col}\{w_{a}\}^{\top}]^{\top}$ . ${\rm col}\{w_{a}\}$ is the reshaped column vector of $w_{a}$ , and

\delta=[\Delta\phi_{c}^{\top},(\int_{t-T}^{t}2\phi_{a}\otimes(Re(\tau)){\rm d}\tau)^{\top}]^{\top}

Before introducing the weights tuning law, it is necessary to make some assumptions of NNs.

Assumption 1.

The system dynamics and NNs satisfy:

a. $f(x)$ is Lipschitz and $g(x)$ is bounded.

$\Arrowvert f(x)\Arrowvert\leq b_{f}\Arrowvert x\Arrowvert,\Arrowvert g(x)\Arrowvert\leq b_{g}.$

b. The reconstruction errors and the gradient of critic NN’s error is bounded.

$\Arrowvert\varepsilon_{c}\Arrowvert\leq b_{\varepsilon c},\Arrowvert\varepsilon_{a}\Arrowvert\leq b_{\varepsilon a},\Arrowvert\nabla\varepsilon_{c}\Arrowvert\leq b_{\varepsilon cx}.$

c. The activation functions and the gradient of critic NN’s activation function is bounded.

$\Arrowvert\phi_{c}(x)\Arrowvert\leq b_{\phi c},\Arrowvert\phi_{a}(x)\Arrowvert\leq b_{\phi a},\Arrowvert\nabla\phi_{c}(x)\Arrowvert\leq b_{\phi cx}.$

d. The optimal weights of NNs $\Arrowvert w_{c}\Arrowvert,\Arrowvert w_{a}\Arrowvert$ are bounded, and the amplitude of exploration signal is bounded by $b_{e}$ .

These assumptions are trivial and can be easily satisfied except the assumption of $g(x).$ With the assumption that system dynamics is Lipschitz, Bellman error $\varepsilon_{B}$ is bounded on a compact set. When $N_{c},N_{a}\to\infty$ , $\varepsilon_{B}\to 0$ universally [4].

In this paper we use $\vartheta(\cdot)=\tanh(\cdot)$ . The non-quadratic cost $U(u)$ is derived as

	$\displaystyle U(u)$	$\displaystyle=2\int_{0}^{u}\lambda\tanh^{-1}(\frac{s}{\lambda})R{\rm d}s$		(19)
		$\displaystyle=2\lambda u^{\top}R\tanh^{-1}(\frac{u}{\lambda})+\lambda^{2}\overline{R}\ln\left(\bm{1}_{m}-(\frac{u}{\lambda})^{2}\right),$		(19)

where $\overline{R}=[r_{1},...,r_{m}],\bm{1}_{m}$ is the $m$ -dimensional vector with all elements are one. According to (17) and (19), the approximate exploration-HJB equation is obtained as

		$\displaystyle\int_{t-T}^{t}\left(-Q-2\lambda\tanh^{\top}(\frac{w_{a}^{\top}\phi_{a}}{\lambda})Rw_{a}^{\top}\phi_{a}\right.$		(20)
		$\displaystyle\left.-\lambda^{2}\overline{R}\ln\left(\bm{1}_{m}-\tanh^{2}(\frac{w_{a}^{\top}\phi_{a}}{\lambda})\right)+\varepsilon_{HJB}\right){\rm d}\tau=W^{\top}\delta,$		(20)

in which $\varepsilon_{HJB}(x)$ is the error raised from the reconstruction errors of two NNs. Since the ideal weights $w_{c}$ and $w_{a}$ are unknown, they are approximated in real time as

\hat{V}(x)=\hat{w}_{c}^{\top}\phi_{c}(x),

(21)

\hat{v}_{1}(x)=\hat{w}_{a1}^{\top}\phi_{a}(x).

(22)

The optimal policy is obtained as

u(x)=\lambda\tanh\left(\frac{w_{a}^{\top}\phi_{a}(x)+\varepsilon_{a}(x)}{\lambda}\right).

We use an actor NN to approximate the optimal policy as

\hat{u}(x)=\lambda\tanh\left(\frac{\hat{v}_{2}(x)}{\lambda}\right),

(23)

where $\hat{v}_{2}(x)=\hat{w}_{a2}^{\top}\phi_{a}(x)$ . Note that $\hat{v}_{1}$ and $\hat{v}_{2}$ have the same approximate structure. However, the estimated weights in (22) will not be directly implemented on the controller for Lyapunov stability. Define $\hat{W}=[\hat{w}_{c}^{\top},{\rm col}\{\hat{w}_{a1}\}]^{\top}$ , the approximate Bellman error of the critic NN is obtained as

E=\hat{W}^{\top}\delta+\rho(x,\hat{u}).

(24)

In order to minimize the error, we define the objective function of critic NN as $K=\frac{1}{2}E^{\top}E.$ The modified gradient-descent law is obtained as

\dot{\hat{W}}=-\alpha_{1}\frac{\delta}{1+\delta^{\top}\delta}E,

(25)

where $\alpha_{1}$ is the learning rate that determines the speed of convergence. In order to guarantee the stability of the closed-loop system and the convergence. Define the approximate error of the actor NN as

E_{u}=\lambda R\left(\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})-\tanh(\frac{\hat{w}_{a1}^{\top}\phi_{a}}{\lambda})\right)

(26)

The gradient-descent tuning law of $\hat{w}_{a2}$ is set as

	$\displaystyle{\rm col}\{\dot{\hat{w}}_{a2}\}=$	$\displaystyle-\alpha_{2}\bigg{(}Y{\rm col}\{\hat{w}_{a2}\}$		(27)
		$\displaystyle+E_{u}\otimes\phi_{a}-\Bigl{(}E_{u}\tanh^{2}(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})\Bigr{)}\otimes\phi_{a}\bigg{)},$		(27)

where $Y$ is a designed parameter to guarantee stability. Define $m_{s}=1+\delta^{\top}\delta$ and $\overline{\delta}=\delta/m_{s}$ , the following theorem is derived.

Theorem 1.

If all terms in Assumption 1 is satisfied and signal $\overline{\delta}(t)$ is persistently excited [14], i.e. there exists two constant $\beta_{1}<\beta_{2}$ such that $\beta_{1}I\leq\int_{t-T}^{t}\overline{\delta}^{\top}\overline{\delta}{\rm d}\tau\leq\beta_{2}I,\forall t>T.$ There exists a positive integer $N_{0}$ and a sufficiently small reinforcement interval $T>0$ such that, when the number of neurons $N_{c},N_{a}>N_{0}$ , the states in closed-loop system, the estimated error of $\hat{V}(x),\hat{v}_{1}(x)$ and $\hat{u}(x)$ are uniformly ultimately bounded.

4 Proof of Theorem 1

Define the error of weights as $\tilde{w}_{c}=w_{c}-\hat{w}_{c},\tilde{w}_{a1}=w_{a}-\hat{w}_{a1}$ and $\tilde{w}_{a2}=w_{a}-\hat{w}_{a2}$ . Consider the following Lyapunov candidate

L=V^{*}(x(t))+\frac{1}{2}\tilde{W}^{\top}(t)\alpha_{1}^{-1}\tilde{W}(t)+\frac{1}{2}\tilde{w}_{a2}^{\top}(t)\alpha_{2}^{-1}\tilde{w}_{a2}(t).

(28)

The derivative of (28) to time is

\dot{L}=\dot{V}^{*}(x(t))+\tilde{W}^{\top}(t)\alpha_{1}^{-1}\dot{\tilde{W}}(t)+\tilde{w}_{a2}^{\top}(t)\alpha_{2}^{-1}\dot{\tilde{w}}_{a2}(t).

(29)

The first term in (29) is obtained as

	$\displaystyle\dot{V}^{*}=$	$\displaystyle w_{c}^{\top}\nabla\phi_{c}(f+g\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})+ge)$		(30)
		$\displaystyle+\nabla\varepsilon_{c}^{\top}(f+g\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})+ge)$		(30)

From (16), we have

		$\displaystyle w_{c}^{\top}\nabla\phi_{c}g\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})=-2\phi_{a}^{\top}w_{a}R\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})$		(31)
		$\displaystyle-2\varepsilon_{a}^{\top}R\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda}-\nabla\varepsilon_{c}^{\top}g\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda}).$		(31)

Using $w_{a}=\hat{w}_{a2}+\tilde{w}_{a2}$ and the fact that $x^{\top}\tanh(x)\geq 0,\forall x$ for the first term on the right side of (31), it can be obtained as

-2\phi_{a}^{\top}w_{a}R\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})\leq-2\phi_{a}^{\top}\tilde{w}_{a2}^{\top}R\lambda\tanh(\frac{\hat{w}_{a2}\phi_{a}}{\lambda}).

(32)

Therefore, (30) turns into

		$\displaystyle\dot{V}^{*}\leq w_{c}^{\top}\nabla\phi_{c}(f+g\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})+ge)+\varepsilon_{1}$		(33)
		$\displaystyle-2\phi_{a}^{\top}w_{a}R\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})-w_{c}^{\top}\nabla\phi_{c}g\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda}),$		(33)

where $\varepsilon_{1}=\nabla\varepsilon_{c}^{\top}(f+ge)-2\varepsilon_{a}^{\top}R\lambda\tanh(\hat{w}_{a2}^{\top}\phi_{a}/\lambda)$ . According to assumption 1, it satisfies

\varepsilon_{1}\leq b_{\varepsilon cx}b_{f}\Arrowvert x\Arrowvert+b_{\varepsilon cx}b_{g}b_{e}+2\lambda b_{\varepsilon a}\sigma_{\min}(R),

(34)

where $\sigma_{\min}(R)$ denotes the minimum singular value of matrix $R$ . Substituting the derivative of the approximate HJB equation (20) into (33), we have

		$\displaystyle\dot{V}^{*}\leq-Q-U\left(\tanh(\frac{w_{a}^{\top}\phi_{a}}{\lambda})\right)+\varepsilon_{HJB}+\varepsilon_{1}$		(35)
		$\displaystyle-2\phi_{a}^{\top}w_{a}R\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})-w_{c}^{\top}\nabla\phi_{c}g\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda}).$		(35)

The last term in (35) satisfies

-w_{c}^{\top}\nabla\phi_{c}g\lambda\tanh(\frac{w_{a}^{\top}\phi_{a}}{\lambda})\leq\lambda b_{g}b_{\phi cx}\Arrowvert w_{c}\Arrowvert.

(36)

Since $Q$ and $U$ are positive definite, there exists a $q>0$ such that $x^{\top}qx<Q+U$ . Therefore, (35) can be derived as

\dot{V}^{*}\leq-x^{\top}qx+k_{1}\Arrowvert x\Arrowvert+k_{2}-2\phi_{a}^{\top}\tilde{w}_{a2}R\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda}),

(37)

where

		$\displaystyle k_{1}=b_{\varepsilon cx}b_{f},$
		$\displaystyle k_{2}=b_{\varepsilon cx}b_{g}b_{e}+2\lambda b_{\varepsilon a}\sigma_{\min}(R)+\lambda b_{g}b_{\phi cx}\Arrowvert w_{c}\Arrowvert+\varepsilon_{h},$

$\varepsilon_{h}$ is the upper bound of $\varepsilon_{HJB}$ .

Now analyzing the second term in (29). According to the tuning law (25), we have the following error dynamics

		$\displaystyle\dot{\tilde{W}}(t)=\alpha_{1}(\overline{\delta}/m_{s})E(t)$		(38)
		$\displaystyle y(t)=\overline{\delta}^{\top}\tilde{W}(t)$		(38)

Substituting HJB equation (20) into (38), one has

$\displaystyle E(t)$	$\displaystyle=\hat{W}^{\top}(t)\delta(t)+\int_{t-T}^{t}(Q+\hat{U}){\rm d}\tau$	(39)
	$\displaystyle=\int_{t-T}^{t}\Big{(}Q+U(\hat{u})-Q-U(u^{*})+\varepsilon_{HJB}$
	$\displaystyle+\hat{w}_{c}^{\top}(t)\nabla\phi_{c}(f+g\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})+ge)$
	$\displaystyle-w_{c}^{\top}\nabla\phi_{c}(f+g\lambda\tanh(\frac{w_{a}^{\top}\phi_{a}}{\lambda})+ge)\Big{)}{\rm d}\tau$
	$\displaystyle-2{\rm col}\{\tilde{w}_{a1}\}^{\top}\int_{t-T}^{t}\phi_{a}\otimes Re{\rm d}\tau,$

According to the definition of $U$ , we have

		$\displaystyle U(\hat{u})-U(u^{*})=$		(40)
		$\displaystyle 2\lambda\phi_{a}^{\top}\hat{w}_{a2}R\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})-2\lambda\phi_{a}^{\top}w_{a}R\tanh(\frac{w_{a}^{\top}\phi_{a}}{\lambda})$
		$\displaystyle+\lambda^{2}\overline{R}\left(\ln\left(\bm{1}_{m}-\tanh^{2}(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})\right)\right)$
		$\displaystyle-\lambda^{2}\overline{R}\left(\ln\left(\bm{1}_{m}-\tanh^{2}(\frac{\hat{w}_{a}^{\top}\phi_{a}}{\lambda})\right)\right)$

According to [7], the third term in (40) can be written as

		$\displaystyle\ln\left(\bm{1}_{m}-\tanh^{2}(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})\right)$		(41)
		$\displaystyle=\ln 4-2\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda}{\rm sgn}(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})+\varepsilon_{\hat{D}},$		(41)

where $\Arrowvert\varepsilon_{\hat{D}}\Arrowvert\leq\ln 4$ . Similarly, we have

		$\displaystyle\ln\left(\bm{1}_{m}-\tanh^{2}(\frac{w_{a}^{\top}\phi_{a}}{\lambda})\right)$		(42)
		$\displaystyle=\ln 4-2\frac{w_{a}^{\top}\phi_{a}}{\lambda}{\rm sgn}(\frac{w_{a}^{\top}\phi_{a}}{\lambda})+\varepsilon_{D},$		(42)

where $\Arrowvert\varepsilon_{D}\Arrowvert\leq\ln 4$ . The following inequality is derived from (40):

		$\displaystyle U(\hat{u})-U(u^{*})=\lambda^{2}\overline{R}(\varepsilon_{\hat{D}}-\varepsilon_{D})$		(43)
		$\displaystyle 2\lambda\phi_{a}^{\top}\hat{w}_{a2}R\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})-2\lambda\phi_{a}^{\top}w_{a}R\tanh(\frac{w_{a}^{\top}\phi_{a}}{\lambda})$
		$\displaystyle-2\lambda^{2}\overline{R}\hat{w}_{a2}^{\top}\phi_{a}\ {\rm sgn}(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})+2\lambda^{2}\overline{R}w_{a}^{\top}\phi_{a}\ {\rm sgn}(\frac{w_{a}^{\top}\phi_{a}}{\lambda}).$

In [7], a continuous approximation of $x\ {\rm sgn}(x)$ is provided:

0\leq x\ {\rm sgn}(x)-x\tanh(\kappa x)\leq\frac{3.59}{\kappa}.

(44)

Combining (43) and (44), the approximate HJB error is

$\displaystyle E(t)$	$\displaystyle=\int_{t-T}^{t}\Big{(}2\lambda\phi_{a}^{\top}\hat{w}_{a2}R\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})$	(45)
	$\displaystyle-2\lambda\phi_{a}^{\top}w_{a}R\tanh(\frac{w_{a}^{\top}\phi_{a}}{\lambda})+\varepsilon_{\kappa}+\varepsilon_{HJB}$
	$\displaystyle+2\lambda^{2}\overline{R}w_{a}^{\top}\phi_{a}\left(\tanh(\kappa\frac{w_{a}^{\top}\phi_{a}}{\lambda})-\tanh(\kappa\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})\right)$
	$\displaystyle+2\lambda^{2}\overline{R}\tilde{w}_{a2}^{\top}\phi_{a}\tanh(\kappa\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})+\lambda^{2}\overline{R}(\varepsilon_{\hat{D}}-\varepsilon_{D})$
	$\displaystyle+\hat{w}_{c}^{\top}(t)\nabla\phi_{c}(f+ge)+\hat{w}_{c}^{\top}(t)\nabla\phi_{c}g\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})$
	$\displaystyle-w_{c}^{\top}\nabla\phi_{c}(f+ge)-w_{c}^{\top}\nabla\phi_{c}g\lambda\tanh(\frac{w_{a}^{\top}\phi_{a}}{\lambda})\Big{)}{\rm d}\tau$
	$\displaystyle-2{\rm col}\{w_{a1}(t)\}^{\top}\int_{t-T}^{t}\phi_{a}\otimes Re{\rm d}\tau,$

in which the approximate error satisfies $0\leq\varepsilon_{\kappa}\leq 7.18/\kappa$ .

Substituting (31) into (45), we can get

E(t)=-\overline{\delta}^{\top}\tilde{W}(t)+\int_{t-T}^{t}{\rm col}\{\tilde{w}_{a2}(\tau)\}^{\top}M{\rm d}\tau+N(t),

(46)

where

		$\displaystyle M=2\lambda\phi_{a}\otimes R\left(\lambda\tanh(\kappa\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})-\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})\right),$
		$\displaystyle N(t)=\int_{t-T}^{t}\Big{(}\varepsilon_{HJB}+\lambda^{2}\overline{R}(\varepsilon_{\hat{D}}-\varepsilon_{D})+\varepsilon_{\kappa}+\varepsilon_{2}$
		$\displaystyle+2\lambda^{2}\overline{R}w_{a}^{\top}\phi_{a}\big{(}\tanh(\frac{\kappa w_{a}^{\top}\phi_{a}}{\lambda})-\tanh(\kappa\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})\big{)}\Big{)}{\rm d}\tau,$

and

\varepsilon_{2}=(2\varepsilon_{a}^{\top}R+\nabla\varepsilon_{c}^{\top}g)\lambda\left(\tanh(\frac{w_{a}^{\top}\phi_{a}}{\lambda})-\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})\right).

Substituting (46) into (38), the error dynamics can be rewritten as

	$\displaystyle\dot{\tilde{W}}(t)=\alpha_{1}\Big{(}$	$\displaystyle-\overline{\delta}\overline{\delta}^{\top}\tilde{W}(t)$		(47)
		$\displaystyle+\frac{\overline{\delta}}{m_{s}}\int_{t-T}^{t}{\rm col}\{\tilde{w}_{a2}\}^{\top}M{\rm d}\tau+\frac{\overline{\delta}}{m_{s}}N(t)\Big{)}.$		(47)

The second term in (29) is obtained as

		$\displaystyle\tilde{W}^{\top}(t)\alpha_{1}^{-1}\dot{\tilde{W}}(t)=-\tilde{W}^{\top}(t)\overline{\delta}\overline{\delta}^{\top}\tilde{W}(t)$		(48)
		$\displaystyle+\tilde{W}^{\top}(t)\frac{\overline{\delta}}{m_{s}}\int_{t-T}^{t}{\rm col}\{\tilde{w}_{a2}\}^{\top}M{\rm d}\tau+\tilde{W}^{\top}(t)\frac{\overline{\delta}}{m_{s}}N(t).$		(48)

For a sufficiently small reinforcement interval, the integral term in (48) can be approximated as

\int_{t-T}^{t}{\rm col}\{\tilde{w}_{a2}(\tau)\}^{\top}M{\rm d}\tau\approx TM^{\top}{\rm col}\{\tilde{w}_{a2}(t)\}.

(49)

Substituting (49) into (48) and applying Young’s inequality, then (48) is derived as

	$\displaystyle\tilde{W}^{\top}(t)\alpha_{1}^{-1}\dot{\tilde{W}}(t)$	$\displaystyle\leq-d\tilde{W}^{\top}(t)\overline{\delta}\overline{\delta}^{\top}\tilde{W}(t)+\tilde{W}^{\top}(t)\frac{\overline{\delta}}{m_{s}}N(t)$		(50)
		$\displaystyle+\frac{\varepsilon}{2m_{s}^{2}}{\rm col}\{\tilde{w}_{a2}(t)\}^{\top}MM^{\top}{\rm col}\{\tilde{w}_{a2}(t)\},$		(50)

where $d=1-T^{2}/2\varepsilon$ . Eq. (29) can be obtained as

$\displaystyle\dot{L}\leq$	$\displaystyle-x^{\top}qx+k_{1}\Arrowvert x\Arrowvert+k_{2}$	(51)
	$\displaystyle-d\tilde{W}^{\top}(t)\overline{\delta}\overline{\delta}^{\top}\tilde{W}(t)+\tilde{W}^{\top}(t)\frac{\overline{\delta}}{m_{s}}N(t)$
	$\displaystyle+\frac{\varepsilon}{2m_{s}^{2}}{\rm col}\{\tilde{w}_{a2}\}^{\top}MM^{\top}{\rm col}\{\tilde{w}_{a2}\}$
	$\displaystyle-2\phi_{a}^{\top}\tilde{w}_{a2}R\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})$
	$\displaystyle+{\rm col}\{\tilde{w}_{a2}\}^{\top}\alpha_{2}^{-1}{\rm col}\{\dot{\tilde{w}}_{a2}\}.$

Substituting the tuning law of $\hat{w}_{a2}$ (27) and $\hat{w}_{a2}=w_{a}-\tilde{w}_{a2}$ , the last two term of (51) can be written as

		$\displaystyle{\rm col}\{\tilde{w}_{a2}\}^{\top}\alpha_{2}^{-1}{\rm col}\{\dot{\tilde{w}}_{a2}\}-2\phi_{a}^{\top}\tilde{w}_{a2}R\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})$		(52)
		$\displaystyle=-{\rm col}\{\tilde{w}_{a2}\}^{\top}Y{\rm col}\{\tilde{w}_{a2}\}+{\rm col}\{\tilde{w}_{a2}\}^{\top}k_{3},$		(52)

where

		$\displaystyle k_{3}=Y{\rm col}\{w_{a}\}-\Bigl{(}E_{u}\tanh^{2}(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})\Bigr{)}\otimes\phi_{a}$
		$\displaystyle-\lambda R\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})\otimes\phi_{a}-\lambda R\tanh(\frac{\hat{w}_{a1}^{\top}\phi_{a}}{\lambda})\otimes\phi_{a}$

Eq. (51) becomes

$\displaystyle\dot{L}\leq$	$\displaystyle-x^{\top}qx+k_{1}\Arrowvert x\Arrowvert+k_{2}$	(53)
	$\displaystyle-d\tilde{W}^{\top}(t)\overline{\delta}\overline{\delta}^{\top}\tilde{W}(t)+\tilde{W}^{\top}(t)\frac{\overline{\delta}}{m_{s}}N(t)$
	$\displaystyle-{\rm col}\{\tilde{w}_{a2}\}^{\top}B{\rm col}\{\tilde{w}_{a2}\}+{\rm col}\{\tilde{w}_{a2}\}^{\top}k_{3},$

where $B=Y-\varepsilon MM^{\top}/2m_{s}^{2}$ . Choose $T$ and $Y$ properly such that both $d$ and $B$ are positive. Thus, $\dot{L}$ is negative if

\Arrowvert x\Arrowvert>\frac{k_{1}}{2\sigma_{\min}(q)}+\sqrt{\frac{k_{1}^{2}}{4\sigma_{\min}^{2}(q)}+\frac{k_{2}}{\sigma_{\min}(q)}},

(54)

\Arrowvert\overline{\delta}^{\top}\tilde{W}\Arrowvert>\frac{N}{d},

(55)

\Arrowvert\tilde{w}_{a2}\Arrowvert>\frac{k_{3}}{\sigma_{\min}(B)}.

(56)

The inequalities above shows that the states of the closed-loop system, the output of the error dynamics and the error $\tilde{w}_{a2}$ are UUB. Since the signal $\overline{\delta}(t)$ is persistently excited. The weights of critic NN are also UUB.

5 Simulation

In this section we design two experiments to verify the effectiveness of our method. Since there are no analytic solutions to nonlinear optimal constrained-input control problem, we choose our first case as a linear system with a large enough upper bound that the input will not be near to the saturation. The optimal solution in this case should be same as the standard linear quadratic regulator (LQR) problem.

5.1 Case 1: Linear System

The system in the first case is chosen as

\dot{x}=\left[\begin{matrix}1&0\\ 0&-2\end{matrix}\right]x+\left[\begin{matrix}2\\ 1\end{matrix}\right]u.

(57)

The cost function is defined as $Q=x_{1}^{2}+x_{2}^{2}$ and $R=1$ . The input constraint is $\lambda=30.$ It becomes a LQR problem near the origin and the optimal value function is quadratic in the states and the optimal control law is linear. Therefore, we choose the activation functions as $\phi_{c}=[x_{1}^{2},x_{1}x_{2},x_{2}^{2}]^{\top}$ and $\phi_{a}=[x_{1},x_{2}]^{\top}$ . By solving algebraic Riccati equation, the optimal weights are obtained as $W^{*}=[0.8779,-0.1904,0.2492,-1.6601,-0.0577]^{\top}.$ As for the hyperparameters, we set reinforcement interval as $T=0.01\ \rm s$ and the learning rate of two NNs is set as $\alpha_{1}=10^{3}$ and $\alpha_{2}=20$ , respectively. The exploration signal we add into the input is $e=\sum_{k=1}^{100}\sin(\omega_{k}t)$ , where $\omega_{k}$ is uniformly sampled from $[-50,50]$ .

After $180\ \rm s$ , the exploration signal is removed and the simulation end at $t_{f}=200\ \rm s$ . The weights converge to $\hat{W}(t_{f})=[0.8777,-0.1876,0.2461,-1.6618,-0.0589]^{\top}$ and $\hat{w}_{a2}(t_{f})=[-1.6618,-0.0596]^{\top}$ , which are close to the optimal weights.

5.2 Case 2: Nonlinear System

The second case is a nonlinear system (1) with $f(x)=[-x_{1}+x_{2},-0.5(x_{1}+x_{2})+0.5x_{2}(\cos(2x_{1})+2)^{2}]^{\top}$ and $g(x)=[0,\cos(2x_{1})+2]^{\top}$ . In this case the upper bound of input is set as $\lambda=0.5$ , we choose $\phi_{c}=[x_{1}^{2},x_{2}^{2},x_{1}x_{2},x_{1}^{4},x_{2}^{4},x_{1}^{3}x_{2},x_{1}^{2}x_{2}^{2},x_{1}x_{2}^{3}]^{\top}$ and $\phi_{a}=[x_{1},x_{2},x_{1}^{2},x_{2}^{2},x_{1}x_{2},x_{1}^{3},x_{2}^{3},x_{1}^{2}x_{2},x_{1}x_{2}^{2}]^{\top}$ . The learning rates are set as $\alpha_{1}=10^{5},\alpha_{2}=10.$ Due to the input constraint, the exploration signal is chosen as

e(t)=\lambda\tanh(\frac{\frac{1}{30}\sum_{k=1}^{100}\sin(\omega_{k}t)+\hat{w}_{a2}^{\top}(t)\phi_{a}}{\lambda})-u(t).

After learning, the weights converge to

	$\displaystyle\hat{w}_{c}(t_{f})=[0.0601,$	$\displaystyle 1.0401,-0.1214,-0.0755,$
		$\displaystyle 0.1327,-0.0816,-0.0551,0.0048]^{\top},$

	$\displaystyle\hat{w}_{a1}(t_{f})=[0.$	$\displaystyle 1876,-3.1325,-0.0278,-0.2763,-0.2390,$
		$\displaystyle-2.0274,-0.0607,1.9738,0.6572]^{\top},$

	$\displaystyle\hat{w}_{a2}(t_{f})=[0.$	$\displaystyle 1826,-3.1356,-0.0164,-0.5104,-0.2490,$
		$\displaystyle-2.0156,-0.0420,1.9659,0.6565]^{\top}.$

6 Conclusion

This paper presents a novel adaptive optimal control method to solve constrained-input problem in a completely model-free way. By adding exploration signal, actor and critic NNs can simultaneously update their weights and no a priori knowledge of system dynamics or an initial admissible policy is required during the learning phase. The efficacy of the proposed method is also shown in simulation results.

References

[1] Q. Wei, D. Liu, Q. Lin, and R. Song, “Adaptive dynamic programming for discrete-time zero-sum games,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 4, pp. 957–969, 2018.
[2] Z. Peng, Y. Zhao, J. Hu, and B. K. Ghosh, “Data-driven optimal tracking control of discrete-time multi-agent systems with two-stage policy iteration algorithm,” Information Sciences, vol. 481, pp. 189–202, 2019.
[3] Z. Peng, Y. Zhao, J. Hu, R. Luo, B. K. Ghosh, and S. K. Nguang, “Input–output data-based output antisynchronization control of multiagent systems using reinforcement learning approach,” IEEE Transactions on Industrial Informatics, vol. 17, no. 11, pp. 7359–7367, 2021.
[4] M. Abu-Khalaf and F. Lewis, “Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,” Automatica, vol. 41, pp. 779–791, 2005.
[5] D. Vrabie, O. Pastravanu, F. Lewis, and M. Abu-Khalaf, “Adaptive optimal control for continuous-time linear systems based on policy iteration,” Automatica, vol. 45, pp. 477–484, 2009.
[6] R. Sutton and A. Barto, Reinforcement learning: an introduction. Cambridge University Press, 1998.
[7] H. Modares, F. Lewis, and M.-B. Naghibi-Sistani, “Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems,” Automatica, vol. 50, no. 1, pp. 193–202, 2014.
[8] H. Modares and F. Lewis, “Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning,” Automatica, vol. 50, no. 7, pp. 1780–1792, 2014.
[9] S. Xue, B. Luo, D. Liu, and Y. Yang, “Constrained event-triggered ${H}_{\infty}$ control based on adaptive dynamic programming with concurrent learning,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, pp. 1–13, 2020.
[10] K. Vamvoudakis and F. Lewis, “Q-learning for continuous-time linear systems: a model-free infinite horizon optimal control approach,” Systems & Control Letters, vol. 100, pp. 14–20, 2017.
[11] J. Lee, J. Park, and Y. Choi, “Integral reinforcement learning for continuous-time input-affine nonlinear systems with simultaneous invariant explorations,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, pp. 916–932, 2015.
[12] L. Guo and H. Zhao, “Online adaptive optimal control algorithm based on synchronous integral reinforcement learning with explorations,” arXiv e-prints, 2021. arxiv.org/abs/2105.09006.
[13] S. Lyshevski, “Optimal control of nonlinear continuous-time systems: design of bounded controllers via generalized nonquadratic functionals,” in Proceedings of the 1998 American Control Conference, vol. 1, pp. 205–209, 1998.
[14] P. Ioannou and B. Fidan, Adaptive control tutorial. SIAM, 2006.