This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

bupt]School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, P. R. China

Model-free Nearly Optimal Control of Constrained-Input Nonlinear Systems Based on Synchronous Reinforcement Learning

Han Zhao\arefbupt    Lei Guo\arefbupt [ [email protected]
Abstract

In this paper a novel model-free algorithm is proposed. This algorithm can learn the nearly optimal control law of constrained-input systems from online data without requiring any a priori knowledge of system dynamics. Based on the concept of generalized policy iteration method, there are two neural networks (NNs), namely actor and critic NN to approximate the optimal value function and optimal policy. The stability of closed-loop systems and the convergence of weights are also guaranteed by Lyapunov analysis.

keywords:
Optimal Control, Synchronous Reinforcement Learning, Constrained-Input, Actor-Critic Networks
00footnotetext: This work is supported by National Natural Science Foundation (NNSF) of China under Grant 61105103.

1 Introduction

The control method of nonlinear constraint systems has been widely researched and is a valuable practical problem. In the framework of optimal control, an optimal policy that minimize a user-defined performance index and the corresponding value function can be obtained by solving Hamilton-Jacobi-Bellman (HJB) equation, which is a partial differential equation in continuous-time nonlinear systems and the analytical solution is usually hard to find. Therefore, function approximation using neural networks (NNs) is a useful tool to solve HJB equation. These algorithms belong to the family of adaptive dynamic programming (ADP) or reinforcement learning (RL) methods.

ADP/RL methods have solved several discrete-time optimal control problems such as nonlinear zero-sum games [1] and multi-agent systems [2], [3]. A considerable number of literatures have also focused on solving the continuous-time constrained-input optimal control problem via ADP/RL. In [4], an off-line policy iteration (PI) algorithm that can iteratively solve HJB equation is proposed. The full knowledge of system dynamics and an initial stabilized controller are needed in this algorithm. In these years, many works are intended to relax the restriction of the a priori knowledge of system dynamics. The concept of integral reinforcement learning (IRL, [5]) and generalized PI [6] is brought in [7], [8] to estimate the optimal value function in constrained-input problems. The initial stabilized controller is not needed in these methods, but the input gain matrix of the system is still used in policy improvement step.

Another way to solve this problem is to use an identifier NN. Xue et al. [9] proposed an algorithm based on actor-critic-identifier structure to estimate the system dynamics and solve HJB equation online. Comparing with other model-free methods, an extra identifier might bring additional burden on real-time computing.

In terms of IRL, Vamvoudakis [10] proposed a completely model-free Q-learning algorithm to solve continuous-time optimal control problem. Q-function combines the value function and the Hamiltonian in Pontryagin’s minimum principle and provides the information of policy improvement without requiring any a priori. Similarly, the restriction of the system input dynamics can be relaxed by adding exploration signal [11]. Our previous work [12] also present a novel model-free algorithm that combines the exploration signal and synchronous reinforcement learning to solve the adaptive optimal control problem in continuous-time nonlinear systems.

In this paper we present an algorithm to solve the continuous-time constrained-input optimal control problem without knowing any information of system dynamics or an initial stabilized controller. The identifier NN in [9] is also not needed in this method. The remainder of this paper is organized as follows: Section 2 provides the mathematic formulation of the optimal control for continuous-time systems with input constraints. In section 3 we present the weights tuning law of actor-critic NNs based on exploration-HJB and synchronous reinforcement learning. In the process of learning, policy evaluation and improvement can be done simultaneously and no knowledge of system dynamics is required. The convergence of weights and stability of the closed-loop system is proved through Lyapunov analysis. Finally, we design two simulations to verify the effectiveness of our method.

2 Problem Formulation

Consider a continuous-time input-affine system

x˙(t)=f(x(t))+g(x(t))u(t)x(0)=ξ,\dot{x}(t)=f(x(t))+g(x(t))u(t)\ \ \ x(0)=\xi, (1)

where xnx\in\mathbb{R}^{n} and umu\in\mathbb{R}^{m} is the measurable state and input vector respectively. ξ\xi denotes the initial state of the system. f(x)nf(x)\in\mathbb{R}^{n} is called as the system drift dynamics and g(x)n×mg(x)\in\mathbb{R}^{n\times m} is the system input dynamics. The system dynamics f(x)+g(x)uf(x)+g(x)u is assumed to be Lipschitz in a compact set Ω\Omega and satisfies f(0)=0f(0)=0.

The objective of optimal control is to design a controller that minimize a user-defined performance index. The performance index in this paper is defined as

J(x,u)=0(Q(x(τ))+U(u(τ)))dτ,J(x,u)=\int_{0}^{\infty}\left(Q(x(\tau))+U(u(\tau))\right){\rm d}\tau, (2)

both Q(x)Q(x) and U(u)U(u) are positive definite functions.

In this paper, the constrained-input control problem is considered. The input vector needs to satisfy the constraint |ui|λ(i=1,,m).|u_{i}|\leq\lambda(i=1,...,m). To deal with this constraint, a non-quadratic integral cost function of input is usually employed [13]:

U(u)=20uλϑ1(sλ)Rds,U(u)=2\int_{0}^{u}\lambda\bm{\vartheta}^{-1}(\frac{s}{\lambda})R{\rm d}\textit{s}, (3)

where ϑ(s)=[ϑ(s1),,ϑ(sm)]\bm{\vartheta}(s)=\left[\vartheta(s_{1}),...,\vartheta(s_{m})\right]^{\top} and ϑ1(s)=[ϑ1(s1),,ϑ1(sm)]\bm{\vartheta}^{-1}(s)=\left[\vartheta^{-1}(s_{1}),...,\vartheta^{-1}(s_{m})\right]^{\top}. ϑ\vartheta is a bounded one-to-one monotonic odd function. The hyperbolic tangent function ϑ()=tanh()\vartheta(\cdot)=\tanh(\cdot) is often used. RR is a positive definite symmetric matrix. For the convenience of computing and analysis, RR is assumed to be diagonal, i.e. R=diag(r1,,rm).R={\rm diag}(r_{1},...,r_{m}).

The goal is to find the optimal control law uu^{*} that can stabilize system (1) and minimize performance index (2). The following value function is used:

Vμ=J(x,u)|u=μ(x),Vμ(0)=0,V^{\mu}=J(x,u)|_{u=\mu(x)},V^{\mu}(0)=0, (4)

where μ(x)\mu(x) is a feedback policy and μ(0)=0.\mu(0)=0. The value function is well-defined only if the policy is admissible [4].

Definition 1.

A constrained policy is admissible in Ω\Omega, which is denoted as μ𝒜(Ω)\mu\in\mathcal{A}(\Omega), if: 1) The policy stabilize system (1) in Ω\Omega; 2) For any state xΩ,Vμ(x)x\in\Omega,V^{\mu}(x) is finite; 3) For any state xΩ,|μi(x)|λ(i=1,,m)x\in\Omega,|\mu_{i}(x)|\leq\lambda(i=1,...,m). Ω\Omega is called as the admissible domain and 𝒜(Ω)\mathcal{A}(\Omega) is the set of admissible policies.

Assume that the set of admissible policy 𝒜(Ω)\mathcal{A}(\Omega) of system (1) is not empty and Vμ𝒞1(Ω)V^{\mu}\in\mathcal{C}^{1}(\Omega). It is obvious that there exists an optimal policy μ(x)𝒜(Ω)\mu^{*}(x)\in\mathcal{A}(\Omega) so that

Vμ(ξ)=minμ𝒜(Ω)0(Q(x(τ)+U(u(τ))))dτ,ξΩ.V^{\mu^{*}}(\xi)=\min_{\mu\in\mathcal{A}(\Omega)}\int_{0}^{\infty}\left(Q(x(\tau)+U(u(\tau)))\right){\rm d}\tau,\forall\xi\in\Omega.

The optimal value function satisfies V(x)=Vμ(x)V^{*}(x)=V^{\mu^{*}}(x). In the remainder of the paper both of these two symbols may be used to describe the same variable.

According to the definition of the value function (4), the infinitesimal version of (2) is obtained as

0=Q(x)+U(μ(x))+(Vμ(x))(f(x)+g(x)μ(x)),0=Q(x)+U(\mu(x))+(\nabla V^{\mu}(x))^{\top}(f(x)+g(x)\mu(x)), (5)

where V(x)=V(x)/xn\nabla V(x)=\partial V(x)/\partial x\in\mathbb{R}^{n}. Define the Hamiltonian as

(x,u,Vμ)=Q(x)+U(u)+(Vμ)(f(x)+g(x)u).\mathcal{H}(x,u,\nabla V^{\mu})=Q(x)+U(u)+(\nabla V^{\mu})^{\top}(f(x)+g(x)u). (6)

For optimal value function VV^{*} and optimal control law uu^{*}, the following Hamilton-Jacobi-Bellman (HJB) equation is satisfied:

0=Q(x)+U(u)+V(f(x)+g(x)u),\displaystyle 0=Q(x)+U(u^{*})+\nabla V^{*\top}(f(x)+g(x)u^{*}), (7)
V(0)=0.\displaystyle V^{*}(0)=0.

The optimal policy can be derived by minimizing the Hamiltonian, for input-affine system (1) and the performance index (2) and (3), the optimal policy can be directly denoted as

u(x)=λϑ(12λR1g(x)V(x)).u^{*}(x)=\lambda{\bm{\vartheta}}\left(-\frac{1}{2\lambda}R^{-1}g^{\top}(x)\nabla V^{*}(x)\right). (8)

Let

v(x)=12R1g(x)V(x),v^{*}(x)=-\frac{1}{2}R^{-1}g^{\top}(x)\nabla V^{*}(x), (9)

the optimal policy can be obtained as

u(x)=λϑ(v(x)λ).u^{*}(x)=\lambda{\bm{\vartheta}}\left(\frac{v^{*}(x)}{\lambda}\right). (10)

It is hard to solve the optimal value function and policy in (7) directly and the a priori knowledge of system dynamics is required. Therefore, the online data-driven algorithm is a useful tool to relax the restriction.

3 Online Synchronous Reinforcement Learning Algorithm to Solve HJB Equation

In this section a data-driven method that learning the optimal policy is introduced. The framework of integral reinforcement learning (IRL) is proposed in [5]. A completely model-free algorithm based on IRL is developed by Lee et al. [11] to solve the optimal control problem of input-affine systems. However, the initial stabilized controller is still required to start the iteration.

In our previous work [12], an algorithm called synchronous integral Q-learning is proposed, which can continuously update the weights of both two NNs to estimate the optimal value function and policy. In this paper we extend the algorithm to solve the constrained-input problem. By adding a bounded piece-wise continuous non-zero exploration signal ee into input, the system (1) is derived as

x˙=f(x)+g(x)(u+e).\dot{x}=f(x)+g(x)(u+e). (11)

Substituting (11) into HJB equation (7). For any time t>Tt>T and time interval T>0T>0, integrating (7) from [tT,t][t-T,t] along the trajectory of system (11), the following integral temporal difference equation is obtained as

V(x(t))V(x(tT))\displaystyle V^{*}(x(t))-V^{*}(x(t-T)) tTtV(x)g(x)e(τ)dτ\displaystyle-\int_{t-T}^{t}\nabla V^{*\top}(x)g(x)e(\tau){\rm d}\tau (12)
=tTt(Q(x)+U(u))dτ.\displaystyle=-\int_{t-T}^{t}\left(Q(x)+U(u^{*})\right){\rm d}\tau.

The following HJB equation is derived by substituting (9):

V(x(t))V(x(tT))\displaystyle V^{*}(x(t))-V^{*}(x(t-T)) +2tTtvRe(τ)dτ\displaystyle+2\int_{t-T}^{t}v^{*\top}Re(\tau){\rm d}\tau (13)
=tTt(Q(x)+U(u))dτ.\displaystyle=-\int_{t-T}^{t}\left(Q(x)+U(u^{*})\right){\rm d}\tau.

Both V(x)V^{*}(x) and v(x)v^{*}(x) can be solved in integral exploration-HJB equation (13) and the optimal policy is computed by (10). Note that the knowledge of system dynamics is not needed in these two equations.

Since the analytical solution of V(x)V^{*}(x) and v(x)v^{*}(x) cannot be determined. We can approximate them by choosing a proper structure of neural networks (NNs). Assume V(x)V^{*}(x) is a smooth function. Based on the assumption of system dynamics, v(x)v^{*}(x) is also smooth. Therefore, there exists two single-layer NNs, namely actor and critic NN, so that:

1) V(x)V^{*}(x) and its gradient can be universal approximated:

V(x)=wcϕc(x)+εc(x),V^{*}(x)=w_{c}^{\top}\phi_{c}(x)+\varepsilon_{c}(x), (14)
V(x)=ϕc(x)wc+εc(x),\nabla V^{*}(x)=\nabla\phi_{c}^{\top}(x)w_{c}+\nabla\varepsilon_{c}(x), (15)

2) v(x)v^{*}(x) can be universal approximated:

v(x)\displaystyle v^{*}(x) =12R1g(x)(ϕc(x)wc+εc(x))\displaystyle=-\frac{1}{2}R^{-1}g^{\top}(x)(\nabla\phi_{c}^{\top}(x)w_{c}+\nabla\varepsilon_{c}(x)) (16)
=waϕa(x)+εa(x),\displaystyle=w_{a}^{\top}\phi_{a}(x)+\varepsilon_{a}(x),

where ϕc(x):nNc,wc,εc\phi_{c}(x):\mathbb{R}^{n}\to\mathbb{R}^{N_{c}},w_{c},\varepsilon_{c} is the activation function, weights and the reconstruction error of the critic NN, respectively. NcN_{c} is the number of neurons in the hidden layer of the critic NN. Similarly, in actor NN, we denote ϕa(x):nNa,wa,εa\phi_{a}(x):\mathbb{R}^{n}\to\mathbb{R}^{N_{a}},w_{a},\varepsilon_{a} is the activation function, weights and the reconstruction error, respectively. NaN_{a} is the number of neurons in the hidden layer of the actor NN. εc(x)\varepsilon_{c}(x) and εa(x)\varepsilon_{a}(x) are bounded on a compact set Ω\Omega. According to Weierstrass approximation theorem, when Nc,NaN_{c},N_{a}\to\infty, we have εc,εc,εa0.\varepsilon_{c},\nabla\varepsilon_{c},\varepsilon_{a}\to 0.

Define the Bellman error εB\varepsilon_{B} based on (13) as

wcΔϕc\displaystyle w_{c}^{\top}\Delta\phi_{c} +2tTtϕa(x(τ))waRe(τ)dτ\displaystyle+2\int_{t-T}^{t}\phi_{a}^{\top}(x(\tau))w_{a}Re(\tau){\rm d}\tau (17)
+tTt(Q(x(τ))+U(u(x(τ)))dτ=εB,\displaystyle+\int_{t-T}^{t}\left(Q(x(\tau))+U(u^{*}(x(\tau))\right){\rm d}\tau=\varepsilon_{B},

where Δϕc=ϕc(x(t))ϕc(x(tT))\Delta\phi_{c}=\phi_{c}(x(t))-\phi_{c}(x(t-T)). The integral reinforcement is defined as

ρ(x,u)=tTt(Q(x(τ))+U(u(x(τ)))dτ.\rho(x,u)=\int_{t-T}^{t}\left(Q(x(\tau))+U(u(x(\tau))\right){\rm d}\tau.

Using the characteristic of Kronecker product, the following equation can be derived from (17):

εBρ(x,u)=Wδ,\varepsilon_{B}-\rho(x,u^{*})=W^{\top}\delta, (18)

where W=[wc,col{wa}]W=[w_{c}^{\top},{\rm col}\{w_{a}\}^{\top}]^{\top}. col{wa}{\rm col}\{w_{a}\} is the reshaped column vector of waw_{a}, and

δ=[Δϕc,(tTt2ϕa(Re(τ))dτ)]\delta=[\Delta\phi_{c}^{\top},(\int_{t-T}^{t}2\phi_{a}\otimes(Re(\tau)){\rm d}\tau)^{\top}]^{\top}

Before introducing the weights tuning law, it is necessary to make some assumptions of NNs.

Assumption 1.

The system dynamics and NNs satisfy:

a. f(x)f(x) is Lipschitz and g(x)g(x) is bounded.

f(x)bfx,g(x)bg.\Arrowvert f(x)\Arrowvert\leq b_{f}\Arrowvert x\Arrowvert,\Arrowvert g(x)\Arrowvert\leq b_{g}.

b. The reconstruction errors and the gradient of critic NN’s error is bounded.

εcbεc,εabεa,εcbεcx.\Arrowvert\varepsilon_{c}\Arrowvert\leq b_{\varepsilon c},\Arrowvert\varepsilon_{a}\Arrowvert\leq b_{\varepsilon a},\Arrowvert\nabla\varepsilon_{c}\Arrowvert\leq b_{\varepsilon cx}.

c. The activation functions and the gradient of critic NN’s activation function is bounded.

ϕc(x)bϕc,ϕa(x)bϕa,ϕc(x)bϕcx.\Arrowvert\phi_{c}(x)\Arrowvert\leq b_{\phi c},\Arrowvert\phi_{a}(x)\Arrowvert\leq b_{\phi a},\Arrowvert\nabla\phi_{c}(x)\Arrowvert\leq b_{\phi cx}.

d. The optimal weights of NNs wc,wa\Arrowvert w_{c}\Arrowvert,\Arrowvert w_{a}\Arrowvert are bounded, and the amplitude of exploration signal is bounded by beb_{e}.

These assumptions are trivial and can be easily satisfied except the assumption of g(x).g(x). With the assumption that system dynamics is Lipschitz, Bellman error εB\varepsilon_{B} is bounded on a compact set. When Nc,NaN_{c},N_{a}\to\infty, εB0\varepsilon_{B}\to 0 universally [4].

In this paper we use ϑ()=tanh()\vartheta(\cdot)=\tanh(\cdot). The non-quadratic cost U(u)U(u) is derived as

U(u)\displaystyle U(u) =20uλtanh1(sλ)Rds\displaystyle=2\int_{0}^{u}\lambda\tanh^{-1}(\frac{s}{\lambda})R{\rm d}s (19)
=2λuRtanh1(uλ)+λ2R¯ln(𝟏m(uλ)2),\displaystyle=2\lambda u^{\top}R\tanh^{-1}(\frac{u}{\lambda})+\lambda^{2}\overline{R}\ln\left(\bm{1}_{m}-(\frac{u}{\lambda})^{2}\right),

where R¯=[r1,,rm],𝟏m\overline{R}=[r_{1},...,r_{m}],\bm{1}_{m} is the mm-dimensional vector with all elements are one. According to (17) and (19), the approximate exploration-HJB equation is obtained as

tTt(Q2λtanh(waϕaλ)Rwaϕa\displaystyle\int_{t-T}^{t}\left(-Q-2\lambda\tanh^{\top}(\frac{w_{a}^{\top}\phi_{a}}{\lambda})Rw_{a}^{\top}\phi_{a}\right. (20)
λ2R¯ln(𝟏mtanh2(waϕaλ))+εHJB)dτ=Wδ,\displaystyle\left.-\lambda^{2}\overline{R}\ln\left(\bm{1}_{m}-\tanh^{2}(\frac{w_{a}^{\top}\phi_{a}}{\lambda})\right)+\varepsilon_{HJB}\right){\rm d}\tau=W^{\top}\delta,

in which εHJB(x)\varepsilon_{HJB}(x) is the error raised from the reconstruction errors of two NNs. Since the ideal weights wcw_{c} and waw_{a} are unknown, they are approximated in real time as

V^(x)=w^cϕc(x),\hat{V}(x)=\hat{w}_{c}^{\top}\phi_{c}(x), (21)
v^1(x)=w^a1ϕa(x).\hat{v}_{1}(x)=\hat{w}_{a1}^{\top}\phi_{a}(x). (22)

The optimal policy is obtained as

u(x)=λtanh(waϕa(x)+εa(x)λ).u(x)=\lambda\tanh\left(\frac{w_{a}^{\top}\phi_{a}(x)+\varepsilon_{a}(x)}{\lambda}\right).

We use an actor NN to approximate the optimal policy as

u^(x)=λtanh(v^2(x)λ),\hat{u}(x)=\lambda\tanh\left(\frac{\hat{v}_{2}(x)}{\lambda}\right), (23)

where v^2(x)=w^a2ϕa(x)\hat{v}_{2}(x)=\hat{w}_{a2}^{\top}\phi_{a}(x). Note that v^1\hat{v}_{1} and v^2\hat{v}_{2} have the same approximate structure. However, the estimated weights in (22) will not be directly implemented on the controller for Lyapunov stability. Define W^=[w^c,col{w^a1}]\hat{W}=[\hat{w}_{c}^{\top},{\rm col}\{\hat{w}_{a1}\}]^{\top}, the approximate Bellman error of the critic NN is obtained as

E=W^δ+ρ(x,u^).E=\hat{W}^{\top}\delta+\rho(x,\hat{u}). (24)

In order to minimize the error, we define the objective function of critic NN as K=12EE.K=\frac{1}{2}E^{\top}E. The modified gradient-descent law is obtained as

W^˙=α1δ1+δδE,\dot{\hat{W}}=-\alpha_{1}\frac{\delta}{1+\delta^{\top}\delta}E, (25)

where α1\alpha_{1} is the learning rate that determines the speed of convergence. In order to guarantee the stability of the closed-loop system and the convergence. Define the approximate error of the actor NN as

Eu=λR(tanh(w^a2ϕaλ)tanh(w^a1ϕaλ))E_{u}=\lambda R\left(\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})-\tanh(\frac{\hat{w}_{a1}^{\top}\phi_{a}}{\lambda})\right) (26)

The gradient-descent tuning law of w^a2\hat{w}_{a2} is set as

col{w^˙a2}=\displaystyle{\rm col}\{\dot{\hat{w}}_{a2}\}= α2(Ycol{w^a2}\displaystyle-\alpha_{2}\bigg{(}Y{\rm col}\{\hat{w}_{a2}\} (27)
+Euϕa(Eutanh2(w^a2ϕaλ))ϕa),\displaystyle+E_{u}\otimes\phi_{a}-\Bigl{(}E_{u}\tanh^{2}(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})\Bigr{)}\otimes\phi_{a}\bigg{)},

where YY is a designed parameter to guarantee stability. Define ms=1+δδm_{s}=1+\delta^{\top}\delta and δ¯=δ/ms\overline{\delta}=\delta/m_{s}, the following theorem is derived.

Theorem 1.

If all terms in Assumption 1 is satisfied and signal δ¯(t)\overline{\delta}(t) is persistently excited [14], i.e. there exists two constant β1<β2\beta_{1}<\beta_{2} such that β1ItTtδ¯δ¯dτβ2I,t>T.\beta_{1}I\leq\int_{t-T}^{t}\overline{\delta}^{\top}\overline{\delta}{\rm d}\tau\leq\beta_{2}I,\forall t>T. There exists a positive integer N0N_{0} and a sufficiently small reinforcement interval T>0T>0 such that, when the number of neurons Nc,Na>N0N_{c},N_{a}>N_{0}, the states in closed-loop system, the estimated error of V^(x),v^1(x)\hat{V}(x),\hat{v}_{1}(x) and u^(x)\hat{u}(x) are uniformly ultimately bounded.

4 Proof of Theorem 1

Define the error of weights as w~c=wcw^c,w~a1=waw^a1\tilde{w}_{c}=w_{c}-\hat{w}_{c},\tilde{w}_{a1}=w_{a}-\hat{w}_{a1} and w~a2=waw^a2\tilde{w}_{a2}=w_{a}-\hat{w}_{a2}. Consider the following Lyapunov candidate

L=V(x(t))+12W~(t)α11W~(t)+12w~a2(t)α21w~a2(t).L=V^{*}(x(t))+\frac{1}{2}\tilde{W}^{\top}(t)\alpha_{1}^{-1}\tilde{W}(t)+\frac{1}{2}\tilde{w}_{a2}^{\top}(t)\alpha_{2}^{-1}\tilde{w}_{a2}(t). (28)

The derivative of (28) to time is

L˙=V˙(x(t))+W~(t)α11W~˙(t)+w~a2(t)α21w~˙a2(t).\dot{L}=\dot{V}^{*}(x(t))+\tilde{W}^{\top}(t)\alpha_{1}^{-1}\dot{\tilde{W}}(t)+\tilde{w}_{a2}^{\top}(t)\alpha_{2}^{-1}\dot{\tilde{w}}_{a2}(t). (29)

The first term in (29) is obtained as

V˙=\displaystyle\dot{V}^{*}= wcϕc(f+gλtanh(w^a2ϕaλ)+ge)\displaystyle w_{c}^{\top}\nabla\phi_{c}(f+g\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})+ge) (30)
+εc(f+gλtanh(w^a2ϕaλ)+ge)\displaystyle+\nabla\varepsilon_{c}^{\top}(f+g\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})+ge)

From (16), we have

wcϕcgλtanh(w^a2ϕaλ)=2ϕawaRλtanh(w^a2ϕaλ)\displaystyle w_{c}^{\top}\nabla\phi_{c}g\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})=-2\phi_{a}^{\top}w_{a}R\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda}) (31)
2εaRλtanh(w^a2ϕaλεcgλtanh(w^a2ϕaλ).\displaystyle-2\varepsilon_{a}^{\top}R\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda}-\nabla\varepsilon_{c}^{\top}g\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda}).

Using wa=w^a2+w~a2w_{a}=\hat{w}_{a2}+\tilde{w}_{a2} and the fact that xtanh(x)0,xx^{\top}\tanh(x)\geq 0,\forall x for the first term on the right side of (31), it can be obtained as

2ϕawaRλtanh(w^a2ϕaλ)2ϕaw~a2Rλtanh(w^a2ϕaλ).-2\phi_{a}^{\top}w_{a}R\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})\leq-2\phi_{a}^{\top}\tilde{w}_{a2}^{\top}R\lambda\tanh(\frac{\hat{w}_{a2}\phi_{a}}{\lambda}). (32)

Therefore, (30) turns into

V˙wcϕc(f+gλtanh(w^a2ϕaλ)+ge)+ε1\displaystyle\dot{V}^{*}\leq w_{c}^{\top}\nabla\phi_{c}(f+g\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})+ge)+\varepsilon_{1} (33)
2ϕawaRλtanh(w^a2ϕaλ)wcϕcgλtanh(w^a2ϕaλ),\displaystyle-2\phi_{a}^{\top}w_{a}R\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})-w_{c}^{\top}\nabla\phi_{c}g\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda}),

where ε1=εc(f+ge)2εaRλtanh(w^a2ϕa/λ)\varepsilon_{1}=\nabla\varepsilon_{c}^{\top}(f+ge)-2\varepsilon_{a}^{\top}R\lambda\tanh(\hat{w}_{a2}^{\top}\phi_{a}/\lambda). According to assumption 1, it satisfies

ε1bεcxbfx+bεcxbgbe+2λbεaσmin(R),\varepsilon_{1}\leq b_{\varepsilon cx}b_{f}\Arrowvert x\Arrowvert+b_{\varepsilon cx}b_{g}b_{e}+2\lambda b_{\varepsilon a}\sigma_{\min}(R), (34)

where σmin(R)\sigma_{\min}(R) denotes the minimum singular value of matrix RR. Substituting the derivative of the approximate HJB equation (20) into (33), we have

V˙QU(tanh(waϕaλ))+εHJB+ε1\displaystyle\dot{V}^{*}\leq-Q-U\left(\tanh(\frac{w_{a}^{\top}\phi_{a}}{\lambda})\right)+\varepsilon_{HJB}+\varepsilon_{1} (35)
2ϕawaRλtanh(w^a2ϕaλ)wcϕcgλtanh(w^a2ϕaλ).\displaystyle-2\phi_{a}^{\top}w_{a}R\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})-w_{c}^{\top}\nabla\phi_{c}g\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda}).

The last term in (35) satisfies

wcϕcgλtanh(waϕaλ)λbgbϕcxwc.-w_{c}^{\top}\nabla\phi_{c}g\lambda\tanh(\frac{w_{a}^{\top}\phi_{a}}{\lambda})\leq\lambda b_{g}b_{\phi cx}\Arrowvert w_{c}\Arrowvert. (36)

Since QQ and UU are positive definite, there exists a q>0q>0 such that xqx<Q+Ux^{\top}qx<Q+U. Therefore, (35) can be derived as

V˙xqx+k1x+k22ϕaw~a2Rλtanh(w^a2ϕaλ),\dot{V}^{*}\leq-x^{\top}qx+k_{1}\Arrowvert x\Arrowvert+k_{2}-2\phi_{a}^{\top}\tilde{w}_{a2}R\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda}), (37)

where

k1=bεcxbf,\displaystyle k_{1}=b_{\varepsilon cx}b_{f},
k2=bεcxbgbe+2λbεaσmin(R)+λbgbϕcxwc+εh,\displaystyle k_{2}=b_{\varepsilon cx}b_{g}b_{e}+2\lambda b_{\varepsilon a}\sigma_{\min}(R)+\lambda b_{g}b_{\phi cx}\Arrowvert w_{c}\Arrowvert+\varepsilon_{h},

εh\varepsilon_{h} is the upper bound of εHJB\varepsilon_{HJB}.

Now analyzing the second term in (29). According to the tuning law (25), we have the following error dynamics

W~˙(t)=α1(δ¯/ms)E(t)\displaystyle\dot{\tilde{W}}(t)=\alpha_{1}(\overline{\delta}/m_{s})E(t) (38)
y(t)=δ¯W~(t)\displaystyle y(t)=\overline{\delta}^{\top}\tilde{W}(t)

Substituting HJB equation (20) into (38), one has

E(t)\displaystyle E(t) =W^(t)δ(t)+tTt(Q+U^)dτ\displaystyle=\hat{W}^{\top}(t)\delta(t)+\int_{t-T}^{t}(Q+\hat{U}){\rm d}\tau (39)
=tTt(Q+U(u^)QU(u)+εHJB\displaystyle=\int_{t-T}^{t}\Big{(}Q+U(\hat{u})-Q-U(u^{*})+\varepsilon_{HJB}
+w^c(t)ϕc(f+gλtanh(w^a2ϕaλ)+ge)\displaystyle+\hat{w}_{c}^{\top}(t)\nabla\phi_{c}(f+g\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})+ge)
wcϕc(f+gλtanh(waϕaλ)+ge))dτ\displaystyle-w_{c}^{\top}\nabla\phi_{c}(f+g\lambda\tanh(\frac{w_{a}^{\top}\phi_{a}}{\lambda})+ge)\Big{)}{\rm d}\tau
2col{w~a1}tTtϕaRedτ,\displaystyle-2{\rm col}\{\tilde{w}_{a1}\}^{\top}\int_{t-T}^{t}\phi_{a}\otimes Re{\rm d}\tau,

According to the definition of UU, we have

U(u^)U(u)=\displaystyle U(\hat{u})-U(u^{*})= (40)
2λϕaw^a2Rtanh(w^a2ϕaλ)2λϕawaRtanh(waϕaλ)\displaystyle 2\lambda\phi_{a}^{\top}\hat{w}_{a2}R\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})-2\lambda\phi_{a}^{\top}w_{a}R\tanh(\frac{w_{a}^{\top}\phi_{a}}{\lambda})
+λ2R¯(ln(𝟏mtanh2(w^a2ϕaλ)))\displaystyle+\lambda^{2}\overline{R}\left(\ln\left(\bm{1}_{m}-\tanh^{2}(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})\right)\right)
λ2R¯(ln(𝟏mtanh2(w^aϕaλ)))\displaystyle-\lambda^{2}\overline{R}\left(\ln\left(\bm{1}_{m}-\tanh^{2}(\frac{\hat{w}_{a}^{\top}\phi_{a}}{\lambda})\right)\right)

According to [7], the third term in (40) can be written as

ln(𝟏mtanh2(w^a2ϕaλ))\displaystyle\ln\left(\bm{1}_{m}-\tanh^{2}(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})\right) (41)
=ln42w^a2ϕaλsgn(w^a2ϕaλ)+εD^,\displaystyle=\ln 4-2\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda}{\rm sgn}(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})+\varepsilon_{\hat{D}},

where εD^ln4\Arrowvert\varepsilon_{\hat{D}}\Arrowvert\leq\ln 4. Similarly, we have

ln(𝟏mtanh2(waϕaλ))\displaystyle\ln\left(\bm{1}_{m}-\tanh^{2}(\frac{w_{a}^{\top}\phi_{a}}{\lambda})\right) (42)
=ln42waϕaλsgn(waϕaλ)+εD,\displaystyle=\ln 4-2\frac{w_{a}^{\top}\phi_{a}}{\lambda}{\rm sgn}(\frac{w_{a}^{\top}\phi_{a}}{\lambda})+\varepsilon_{D},

where εDln4\Arrowvert\varepsilon_{D}\Arrowvert\leq\ln 4. The following inequality is derived from (40):

U(u^)U(u)=λ2R¯(εD^εD)\displaystyle U(\hat{u})-U(u^{*})=\lambda^{2}\overline{R}(\varepsilon_{\hat{D}}-\varepsilon_{D}) (43)
2λϕaw^a2Rtanh(w^a2ϕaλ)2λϕawaRtanh(waϕaλ)\displaystyle 2\lambda\phi_{a}^{\top}\hat{w}_{a2}R\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})-2\lambda\phi_{a}^{\top}w_{a}R\tanh(\frac{w_{a}^{\top}\phi_{a}}{\lambda})
2λ2R¯w^a2ϕasgn(w^a2ϕaλ)+2λ2R¯waϕasgn(waϕaλ).\displaystyle-2\lambda^{2}\overline{R}\hat{w}_{a2}^{\top}\phi_{a}\ {\rm sgn}(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})+2\lambda^{2}\overline{R}w_{a}^{\top}\phi_{a}\ {\rm sgn}(\frac{w_{a}^{\top}\phi_{a}}{\lambda}).

In [7], a continuous approximation of xsgn(x)x\ {\rm sgn}(x) is provided:

0xsgn(x)xtanh(κx)3.59κ.0\leq x\ {\rm sgn}(x)-x\tanh(\kappa x)\leq\frac{3.59}{\kappa}. (44)

Combining (43) and (44), the approximate HJB error is

E(t)\displaystyle E(t) =tTt(2λϕaw^a2Rtanh(w^a2ϕaλ)\displaystyle=\int_{t-T}^{t}\Big{(}2\lambda\phi_{a}^{\top}\hat{w}_{a2}R\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda}) (45)
2λϕawaRtanh(waϕaλ)+εκ+εHJB\displaystyle-2\lambda\phi_{a}^{\top}w_{a}R\tanh(\frac{w_{a}^{\top}\phi_{a}}{\lambda})+\varepsilon_{\kappa}+\varepsilon_{HJB}
+2λ2R¯waϕa(tanh(κwaϕaλ)tanh(κw^a2ϕaλ))\displaystyle+2\lambda^{2}\overline{R}w_{a}^{\top}\phi_{a}\left(\tanh(\kappa\frac{w_{a}^{\top}\phi_{a}}{\lambda})-\tanh(\kappa\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})\right)
+2λ2R¯w~a2ϕatanh(κw^a2ϕaλ)+λ2R¯(εD^εD)\displaystyle+2\lambda^{2}\overline{R}\tilde{w}_{a2}^{\top}\phi_{a}\tanh(\kappa\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})+\lambda^{2}\overline{R}(\varepsilon_{\hat{D}}-\varepsilon_{D})
+w^c(t)ϕc(f+ge)+w^c(t)ϕcgλtanh(w^a2ϕaλ)\displaystyle+\hat{w}_{c}^{\top}(t)\nabla\phi_{c}(f+ge)+\hat{w}_{c}^{\top}(t)\nabla\phi_{c}g\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})
wcϕc(f+ge)wcϕcgλtanh(waϕaλ))dτ\displaystyle-w_{c}^{\top}\nabla\phi_{c}(f+ge)-w_{c}^{\top}\nabla\phi_{c}g\lambda\tanh(\frac{w_{a}^{\top}\phi_{a}}{\lambda})\Big{)}{\rm d}\tau
2col{wa1(t)}tTtϕaRedτ,\displaystyle-2{\rm col}\{w_{a1}(t)\}^{\top}\int_{t-T}^{t}\phi_{a}\otimes Re{\rm d}\tau,

in which the approximate error satisfies 0εκ7.18/κ0\leq\varepsilon_{\kappa}\leq 7.18/\kappa.

Substituting (31) into (45), we can get

E(t)=δ¯W~(t)+tTtcol{w~a2(τ)}Mdτ+N(t),E(t)=-\overline{\delta}^{\top}\tilde{W}(t)+\int_{t-T}^{t}{\rm col}\{\tilde{w}_{a2}(\tau)\}^{\top}M{\rm d}\tau+N(t), (46)

where

M=2λϕaR(λtanh(κw^a2ϕaλ)tanh(w^a2ϕaλ)),\displaystyle M=2\lambda\phi_{a}\otimes R\left(\lambda\tanh(\kappa\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})-\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})\right),
N(t)=tTt(εHJB+λ2R¯(εD^εD)+εκ+ε2\displaystyle N(t)=\int_{t-T}^{t}\Big{(}\varepsilon_{HJB}+\lambda^{2}\overline{R}(\varepsilon_{\hat{D}}-\varepsilon_{D})+\varepsilon_{\kappa}+\varepsilon_{2}
+2λ2R¯waϕa(tanh(κwaϕaλ)tanh(κw^a2ϕaλ)))dτ,\displaystyle+2\lambda^{2}\overline{R}w_{a}^{\top}\phi_{a}\big{(}\tanh(\frac{\kappa w_{a}^{\top}\phi_{a}}{\lambda})-\tanh(\kappa\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})\big{)}\Big{)}{\rm d}\tau,

and

ε2=(2εaR+εcg)λ(tanh(waϕaλ)tanh(w^a2ϕaλ)).\varepsilon_{2}=(2\varepsilon_{a}^{\top}R+\nabla\varepsilon_{c}^{\top}g)\lambda\left(\tanh(\frac{w_{a}^{\top}\phi_{a}}{\lambda})-\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})\right).

Substituting (46) into (38), the error dynamics can be rewritten as

W~˙(t)=α1(\displaystyle\dot{\tilde{W}}(t)=\alpha_{1}\Big{(} δ¯δ¯W~(t)\displaystyle-\overline{\delta}\overline{\delta}^{\top}\tilde{W}(t) (47)
+δ¯mstTtcol{w~a2}Mdτ+δ¯msN(t)).\displaystyle+\frac{\overline{\delta}}{m_{s}}\int_{t-T}^{t}{\rm col}\{\tilde{w}_{a2}\}^{\top}M{\rm d}\tau+\frac{\overline{\delta}}{m_{s}}N(t)\Big{)}.

The second term in (29) is obtained as

W~(t)α11W~˙(t)=W~(t)δ¯δ¯W~(t)\displaystyle\tilde{W}^{\top}(t)\alpha_{1}^{-1}\dot{\tilde{W}}(t)=-\tilde{W}^{\top}(t)\overline{\delta}\overline{\delta}^{\top}\tilde{W}(t) (48)
+W~(t)δ¯mstTtcol{w~a2}Mdτ+W~(t)δ¯msN(t).\displaystyle+\tilde{W}^{\top}(t)\frac{\overline{\delta}}{m_{s}}\int_{t-T}^{t}{\rm col}\{\tilde{w}_{a2}\}^{\top}M{\rm d}\tau+\tilde{W}^{\top}(t)\frac{\overline{\delta}}{m_{s}}N(t).

For a sufficiently small reinforcement interval, the integral term in (48) can be approximated as

tTtcol{w~a2(τ)}MdτTMcol{w~a2(t)}.\int_{t-T}^{t}{\rm col}\{\tilde{w}_{a2}(\tau)\}^{\top}M{\rm d}\tau\approx TM^{\top}{\rm col}\{\tilde{w}_{a2}(t)\}. (49)

Substituting (49) into (48) and applying Young’s inequality, then (48) is derived as

W~(t)α11W~˙(t)\displaystyle\tilde{W}^{\top}(t)\alpha_{1}^{-1}\dot{\tilde{W}}(t) dW~(t)δ¯δ¯W~(t)+W~(t)δ¯msN(t)\displaystyle\leq-d\tilde{W}^{\top}(t)\overline{\delta}\overline{\delta}^{\top}\tilde{W}(t)+\tilde{W}^{\top}(t)\frac{\overline{\delta}}{m_{s}}N(t) (50)
+ε2ms2col{w~a2(t)}MMcol{w~a2(t)},\displaystyle+\frac{\varepsilon}{2m_{s}^{2}}{\rm col}\{\tilde{w}_{a2}(t)\}^{\top}MM^{\top}{\rm col}\{\tilde{w}_{a2}(t)\},

where d=1T2/2εd=1-T^{2}/2\varepsilon. Eq. (29) can be obtained as

L˙\displaystyle\dot{L}\leq xqx+k1x+k2\displaystyle-x^{\top}qx+k_{1}\Arrowvert x\Arrowvert+k_{2} (51)
dW~(t)δ¯δ¯W~(t)+W~(t)δ¯msN(t)\displaystyle-d\tilde{W}^{\top}(t)\overline{\delta}\overline{\delta}^{\top}\tilde{W}(t)+\tilde{W}^{\top}(t)\frac{\overline{\delta}}{m_{s}}N(t)
+ε2ms2col{w~a2}MMcol{w~a2}\displaystyle+\frac{\varepsilon}{2m_{s}^{2}}{\rm col}\{\tilde{w}_{a2}\}^{\top}MM^{\top}{\rm col}\{\tilde{w}_{a2}\}
2ϕaw~a2Rλtanh(w^a2ϕaλ)\displaystyle-2\phi_{a}^{\top}\tilde{w}_{a2}R\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})
+col{w~a2}α21col{w~˙a2}.\displaystyle+{\rm col}\{\tilde{w}_{a2}\}^{\top}\alpha_{2}^{-1}{\rm col}\{\dot{\tilde{w}}_{a2}\}.

Substituting the tuning law of w^a2\hat{w}_{a2} (27) and w^a2=waw~a2\hat{w}_{a2}=w_{a}-\tilde{w}_{a2}, the last two term of (51) can be written as

col{w~a2}α21col{w~˙a2}2ϕaw~a2Rλtanh(w^a2ϕaλ)\displaystyle{\rm col}\{\tilde{w}_{a2}\}^{\top}\alpha_{2}^{-1}{\rm col}\{\dot{\tilde{w}}_{a2}\}-2\phi_{a}^{\top}\tilde{w}_{a2}R\lambda\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda}) (52)
=col{w~a2}Ycol{w~a2}+col{w~a2}k3,\displaystyle=-{\rm col}\{\tilde{w}_{a2}\}^{\top}Y{\rm col}\{\tilde{w}_{a2}\}+{\rm col}\{\tilde{w}_{a2}\}^{\top}k_{3},

where

k3=Ycol{wa}(Eutanh2(w^a2ϕaλ))ϕa\displaystyle k_{3}=Y{\rm col}\{w_{a}\}-\Bigl{(}E_{u}\tanh^{2}(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})\Bigr{)}\otimes\phi_{a}
λRtanh(w^a2ϕaλ)ϕaλRtanh(w^a1ϕaλ)ϕa\displaystyle-\lambda R\tanh(\frac{\hat{w}_{a2}^{\top}\phi_{a}}{\lambda})\otimes\phi_{a}-\lambda R\tanh(\frac{\hat{w}_{a1}^{\top}\phi_{a}}{\lambda})\otimes\phi_{a}

Eq. (51) becomes

L˙\displaystyle\dot{L}\leq xqx+k1x+k2\displaystyle-x^{\top}qx+k_{1}\Arrowvert x\Arrowvert+k_{2} (53)
dW~(t)δ¯δ¯W~(t)+W~(t)δ¯msN(t)\displaystyle-d\tilde{W}^{\top}(t)\overline{\delta}\overline{\delta}^{\top}\tilde{W}(t)+\tilde{W}^{\top}(t)\frac{\overline{\delta}}{m_{s}}N(t)
col{w~a2}Bcol{w~a2}+col{w~a2}k3,\displaystyle-{\rm col}\{\tilde{w}_{a2}\}^{\top}B{\rm col}\{\tilde{w}_{a2}\}+{\rm col}\{\tilde{w}_{a2}\}^{\top}k_{3},

where B=YεMM/2ms2B=Y-\varepsilon MM^{\top}/2m_{s}^{2}. Choose TT and YY properly such that both dd and BB are positive. Thus, L˙\dot{L} is negative if

x>k12σmin(q)+k124σmin2(q)+k2σmin(q),\Arrowvert x\Arrowvert>\frac{k_{1}}{2\sigma_{\min}(q)}+\sqrt{\frac{k_{1}^{2}}{4\sigma_{\min}^{2}(q)}+\frac{k_{2}}{\sigma_{\min}(q)}}, (54)
δ¯W~>Nd,\Arrowvert\overline{\delta}^{\top}\tilde{W}\Arrowvert>\frac{N}{d}, (55)
w~a2>k3σmin(B).\Arrowvert\tilde{w}_{a2}\Arrowvert>\frac{k_{3}}{\sigma_{\min}(B)}. (56)

The inequalities above shows that the states of the closed-loop system, the output of the error dynamics and the error w~a2\tilde{w}_{a2} are UUB. Since the signal δ¯(t)\overline{\delta}(t) is persistently excited. The weights of critic NN are also UUB.

5 Simulation

In this section we design two experiments to verify the effectiveness of our method. Since there are no analytic solutions to nonlinear optimal constrained-input control problem, we choose our first case as a linear system with a large enough upper bound that the input will not be near to the saturation. The optimal solution in this case should be same as the standard linear quadratic regulator (LQR) problem.

5.1 Case 1: Linear System

The system in the first case is chosen as

x˙=[1002]x+[21]u.\dot{x}=\left[\begin{matrix}1&0\\ 0&-2\end{matrix}\right]x+\left[\begin{matrix}2\\ 1\end{matrix}\right]u. (57)

The cost function is defined as Q=x12+x22Q=x_{1}^{2}+x_{2}^{2} and R=1R=1. The input constraint is λ=30.\lambda=30. It becomes a LQR problem near the origin and the optimal value function is quadratic in the states and the optimal control law is linear. Therefore, we choose the activation functions as ϕc=[x12,x1x2,x22]\phi_{c}=[x_{1}^{2},x_{1}x_{2},x_{2}^{2}]^{\top} and ϕa=[x1,x2]\phi_{a}=[x_{1},x_{2}]^{\top}. By solving algebraic Riccati equation, the optimal weights are obtained as W=[0.8779,0.1904,0.2492,1.6601,0.0577].W^{*}=[0.8779,-0.1904,0.2492,-1.6601,-0.0577]^{\top}. As for the hyperparameters, we set reinforcement interval as T=0.01sT=0.01\ \rm s and the learning rate of two NNs is set as α1=103\alpha_{1}=10^{3} and α2=20\alpha_{2}=20, respectively. The exploration signal we add into the input is e=k=1100sin(ωkt)e=\sum_{k=1}^{100}\sin(\omega_{k}t), where ωk\omega_{k} is uniformly sampled from [50,50][-50,50].

After 180s180\ \rm s, the exploration signal is removed and the simulation end at tf=200st_{f}=200\ \rm s. The weights converge to W^(tf)=[0.8777,0.1876,0.2461,1.6618,0.0589]\hat{W}(t_{f})=[0.8777,-0.1876,0.2461,-1.6618,-0.0589]^{\top} and w^a2(tf)=[1.6618,0.0596]\hat{w}_{a2}(t_{f})=[-1.6618,-0.0596]^{\top}, which are close to the optimal weights.

5.2 Case 2: Nonlinear System

The second case is a nonlinear system (1) with f(x)=[x1+x2,0.5(x1+x2)+0.5x2(cos(2x1)+2)2]f(x)=[-x_{1}+x_{2},-0.5(x_{1}+x_{2})+0.5x_{2}(\cos(2x_{1})+2)^{2}]^{\top} and g(x)=[0,cos(2x1)+2]g(x)=[0,\cos(2x_{1})+2]^{\top}. In this case the upper bound of input is set as λ=0.5\lambda=0.5, we choose ϕc=[x12,x22,x1x2,x14,x24,x13x2,x12x22,x1x23]\phi_{c}=[x_{1}^{2},x_{2}^{2},x_{1}x_{2},x_{1}^{4},x_{2}^{4},x_{1}^{3}x_{2},x_{1}^{2}x_{2}^{2},x_{1}x_{2}^{3}]^{\top} and ϕa=[x1,x2,x12,x22,x1x2,x13,x23,x12x2,x1x22]\phi_{a}=[x_{1},x_{2},x_{1}^{2},x_{2}^{2},x_{1}x_{2},x_{1}^{3},x_{2}^{3},x_{1}^{2}x_{2},x_{1}x_{2}^{2}]^{\top}. The learning rates are set as α1=105,α2=10.\alpha_{1}=10^{5},\alpha_{2}=10. Due to the input constraint, the exploration signal is chosen as

e(t)=λtanh(130k=1100sin(ωkt)+w^a2(t)ϕaλ)u(t).e(t)=\lambda\tanh(\frac{\frac{1}{30}\sum_{k=1}^{100}\sin(\omega_{k}t)+\hat{w}_{a2}^{\top}(t)\phi_{a}}{\lambda})-u(t).

After learning, the weights converge to

w^c(tf)=[0.0601,\displaystyle\hat{w}_{c}(t_{f})=[0.0601, 1.0401,0.1214,0.0755,\displaystyle 1.0401,-0.1214,-0.0755,
0.1327,0.0816,0.0551,0.0048],\displaystyle 0.1327,-0.0816,-0.0551,0.0048]^{\top},
w^a1(tf)=[0.\displaystyle\hat{w}_{a1}(t_{f})=[0. 1876,3.1325,0.0278,0.2763,0.2390,\displaystyle 1876,-3.1325,-0.0278,-0.2763,-0.2390,
2.0274,0.0607,1.9738,0.6572],\displaystyle-2.0274,-0.0607,1.9738,0.6572]^{\top},
w^a2(tf)=[0.\displaystyle\hat{w}_{a2}(t_{f})=[0. 1826,3.1356,0.0164,0.5104,0.2490,\displaystyle 1826,-3.1356,-0.0164,-0.5104,-0.2490,
2.0156,0.0420,1.9659,0.6565].\displaystyle-2.0156,-0.0420,1.9659,0.6565]^{\top}.

6 Conclusion

This paper presents a novel adaptive optimal control method to solve constrained-input problem in a completely model-free way. By adding exploration signal, actor and critic NNs can simultaneously update their weights and no a priori knowledge of system dynamics or an initial admissible policy is required during the learning phase. The efficacy of the proposed method is also shown in simulation results.

References

  • [1] Q. Wei, D. Liu, Q. Lin, and R. Song, “Adaptive dynamic programming for discrete-time zero-sum games,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 4, pp. 957–969, 2018.
  • [2] Z. Peng, Y. Zhao, J. Hu, and B. K. Ghosh, “Data-driven optimal tracking control of discrete-time multi-agent systems with two-stage policy iteration algorithm,” Information Sciences, vol. 481, pp. 189–202, 2019.
  • [3] Z. Peng, Y. Zhao, J. Hu, R. Luo, B. K. Ghosh, and S. K. Nguang, “Input–output data-based output antisynchronization control of multiagent systems using reinforcement learning approach,” IEEE Transactions on Industrial Informatics, vol. 17, no. 11, pp. 7359–7367, 2021.
  • [4] M. Abu-Khalaf and F. Lewis, “Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,” Automatica, vol. 41, pp. 779–791, 2005.
  • [5] D. Vrabie, O. Pastravanu, F. Lewis, and M. Abu-Khalaf, “Adaptive optimal control for continuous-time linear systems based on policy iteration,” Automatica, vol. 45, pp. 477–484, 2009.
  • [6] R. Sutton and A. Barto, Reinforcement learning: an introduction. Cambridge University Press, 1998.
  • [7] H. Modares, F. Lewis, and M.-B. Naghibi-Sistani, “Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems,” Automatica, vol. 50, no. 1, pp. 193–202, 2014.
  • [8] H. Modares and F. Lewis, “Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning,” Automatica, vol. 50, no. 7, pp. 1780–1792, 2014.
  • [9] S. Xue, B. Luo, D. Liu, and Y. Yang, “Constrained event-triggered H{H}_{\infty} control based on adaptive dynamic programming with concurrent learning,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, pp. 1–13, 2020.
  • [10] K. Vamvoudakis and F. Lewis, “Q-learning for continuous-time linear systems: a model-free infinite horizon optimal control approach,” Systems & Control Letters, vol. 100, pp. 14–20, 2017.
  • [11] J. Lee, J. Park, and Y. Choi, “Integral reinforcement learning for continuous-time input-affine nonlinear systems with simultaneous invariant explorations,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, pp. 916–932, 2015.
  • [12] L. Guo and H. Zhao, “Online adaptive optimal control algorithm based on synchronous integral reinforcement learning with explorations,” arXiv e-prints, 2021. arxiv.org/abs/2105.09006.
  • [13] S. Lyshevski, “Optimal control of nonlinear continuous-time systems: design of bounded controllers via generalized nonquadratic functionals,” in Proceedings of the 1998 American Control Conference, vol. 1, pp. 205–209, 1998.
  • [14] P. Ioannou and B. Fidan, Adaptive control tutorial. SIAM, 2006.