bupt]School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, P. R. China
Model-free Nearly Optimal Control of Constrained-Input Nonlinear Systems Based on Synchronous Reinforcement Learning
Abstract
In this paper a novel model-free algorithm is proposed. This algorithm can learn the nearly optimal control law of constrained-input systems from online data without requiring any a priori knowledge of system dynamics. Based on the concept of generalized policy iteration method, there are two neural networks (NNs), namely actor and critic NN to approximate the optimal value function and optimal policy. The stability of closed-loop systems and the convergence of weights are also guaranteed by Lyapunov analysis.
keywords:
Optimal Control, Synchronous Reinforcement Learning, Constrained-Input, Actor-Critic Networks1 Introduction
The control method of nonlinear constraint systems has been widely researched and is a valuable practical problem. In the framework of optimal control, an optimal policy that minimize a user-defined performance index and the corresponding value function can be obtained by solving Hamilton-Jacobi-Bellman (HJB) equation, which is a partial differential equation in continuous-time nonlinear systems and the analytical solution is usually hard to find. Therefore, function approximation using neural networks (NNs) is a useful tool to solve HJB equation. These algorithms belong to the family of adaptive dynamic programming (ADP) or reinforcement learning (RL) methods.
ADP/RL methods have solved several discrete-time optimal control problems such as nonlinear zero-sum games [1] and multi-agent systems [2], [3]. A considerable number of literatures have also focused on solving the continuous-time constrained-input optimal control problem via ADP/RL. In [4], an off-line policy iteration (PI) algorithm that can iteratively solve HJB equation is proposed. The full knowledge of system dynamics and an initial stabilized controller are needed in this algorithm. In these years, many works are intended to relax the restriction of the a priori knowledge of system dynamics. The concept of integral reinforcement learning (IRL, [5]) and generalized PI [6] is brought in [7], [8] to estimate the optimal value function in constrained-input problems. The initial stabilized controller is not needed in these methods, but the input gain matrix of the system is still used in policy improvement step.
Another way to solve this problem is to use an identifier NN. Xue et al. [9] proposed an algorithm based on actor-critic-identifier structure to estimate the system dynamics and solve HJB equation online. Comparing with other model-free methods, an extra identifier might bring additional burden on real-time computing.
In terms of IRL, Vamvoudakis [10] proposed a completely model-free Q-learning algorithm to solve continuous-time optimal control problem. Q-function combines the value function and the Hamiltonian in Pontryagin’s minimum principle and provides the information of policy improvement without requiring any a priori. Similarly, the restriction of the system input dynamics can be relaxed by adding exploration signal [11]. Our previous work [12] also present a novel model-free algorithm that combines the exploration signal and synchronous reinforcement learning to solve the adaptive optimal control problem in continuous-time nonlinear systems.
In this paper we present an algorithm to solve the continuous-time constrained-input optimal control problem without knowing any information of system dynamics or an initial stabilized controller. The identifier NN in [9] is also not needed in this method. The remainder of this paper is organized as follows: Section 2 provides the mathematic formulation of the optimal control for continuous-time systems with input constraints. In section 3 we present the weights tuning law of actor-critic NNs based on exploration-HJB and synchronous reinforcement learning. In the process of learning, policy evaluation and improvement can be done simultaneously and no knowledge of system dynamics is required. The convergence of weights and stability of the closed-loop system is proved through Lyapunov analysis. Finally, we design two simulations to verify the effectiveness of our method.
2 Problem Formulation
Consider a continuous-time input-affine system
(1) |
where and is the measurable state and input vector respectively. denotes the initial state of the system. is called as the system drift dynamics and is the system input dynamics. The system dynamics is assumed to be Lipschitz in a compact set and satisfies .
The objective of optimal control is to design a controller that minimize a user-defined performance index. The performance index in this paper is defined as
(2) |
both and are positive definite functions.
In this paper, the constrained-input control problem is considered. The input vector needs to satisfy the constraint To deal with this constraint, a non-quadratic integral cost function of input is usually employed [13]:
(3) |
where and . is a bounded one-to-one monotonic odd function. The hyperbolic tangent function is often used. is a positive definite symmetric matrix. For the convenience of computing and analysis, is assumed to be diagonal, i.e.
The goal is to find the optimal control law that can stabilize system (1) and minimize performance index (2). The following value function is used:
(4) |
where is a feedback policy and The value function is well-defined only if the policy is admissible [4].
Definition 1.
A constrained policy is admissible in , which is denoted as , if: 1) The policy stabilize system (1) in ; 2) For any state is finite; 3) For any state . is called as the admissible domain and is the set of admissible policies.
Assume that the set of admissible policy of system (1) is not empty and . It is obvious that there exists an optimal policy so that
The optimal value function satisfies . In the remainder of the paper both of these two symbols may be used to describe the same variable.
According to the definition of the value function (4), the infinitesimal version of (2) is obtained as
(5) |
where . Define the Hamiltonian as
(6) |
For optimal value function and optimal control law , the following Hamilton-Jacobi-Bellman (HJB) equation is satisfied:
(7) | ||||
The optimal policy can be derived by minimizing the Hamiltonian, for input-affine system (1) and the performance index (2) and (3), the optimal policy can be directly denoted as
(8) |
Let
(9) |
the optimal policy can be obtained as
(10) |
It is hard to solve the optimal value function and policy in (7) directly and the a priori knowledge of system dynamics is required. Therefore, the online data-driven algorithm is a useful tool to relax the restriction.
3 Online Synchronous Reinforcement Learning Algorithm to Solve HJB Equation
In this section a data-driven method that learning the optimal policy is introduced. The framework of integral reinforcement learning (IRL) is proposed in [5]. A completely model-free algorithm based on IRL is developed by Lee et al. [11] to solve the optimal control problem of input-affine systems. However, the initial stabilized controller is still required to start the iteration.
In our previous work [12], an algorithm called synchronous integral Q-learning is proposed, which can continuously update the weights of both two NNs to estimate the optimal value function and policy. In this paper we extend the algorithm to solve the constrained-input problem. By adding a bounded piece-wise continuous non-zero exploration signal into input, the system (1) is derived as
(11) |
Substituting (11) into HJB equation (7). For any time and time interval , integrating (7) from along the trajectory of system (11), the following integral temporal difference equation is obtained as
(12) | ||||
The following HJB equation is derived by substituting (9):
(13) | ||||
Both and can be solved in integral exploration-HJB equation (13) and the optimal policy is computed by (10). Note that the knowledge of system dynamics is not needed in these two equations.
Since the analytical solution of and cannot be determined. We can approximate them by choosing a proper structure of neural networks (NNs). Assume is a smooth function. Based on the assumption of system dynamics, is also smooth. Therefore, there exists two single-layer NNs, namely actor and critic NN, so that:
1) and its gradient can be universal approximated:
(14) |
(15) |
2) can be universal approximated:
(16) | ||||
where is the activation function, weights and the reconstruction error of the critic NN, respectively. is the number of neurons in the hidden layer of the critic NN. Similarly, in actor NN, we denote is the activation function, weights and the reconstruction error, respectively. is the number of neurons in the hidden layer of the actor NN. and are bounded on a compact set . According to Weierstrass approximation theorem, when , we have
Define the Bellman error based on (13) as
(17) | ||||
where . The integral reinforcement is defined as
Using the characteristic of Kronecker product, the following equation can be derived from (17):
(18) |
where . is the reshaped column vector of , and
Before introducing the weights tuning law, it is necessary to make some assumptions of NNs.
Assumption 1.
The system dynamics and NNs satisfy:
a. is Lipschitz and is bounded.
b. The reconstruction errors and the gradient of critic NN’s error is bounded.
c. The activation functions and the gradient of critic NN’s activation function is bounded.
d. The optimal weights of NNs are bounded, and the amplitude of exploration signal is bounded by .
These assumptions are trivial and can be easily satisfied except the assumption of With the assumption that system dynamics is Lipschitz, Bellman error is bounded on a compact set. When , universally [4].
In this paper we use . The non-quadratic cost is derived as
(19) | ||||
where is the -dimensional vector with all elements are one. According to (17) and (19), the approximate exploration-HJB equation is obtained as
(20) | ||||
in which is the error raised from the reconstruction errors of two NNs. Since the ideal weights and are unknown, they are approximated in real time as
(21) |
(22) |
The optimal policy is obtained as
We use an actor NN to approximate the optimal policy as
(23) |
where . Note that and have the same approximate structure. However, the estimated weights in (22) will not be directly implemented on the controller for Lyapunov stability. Define , the approximate Bellman error of the critic NN is obtained as
(24) |
In order to minimize the error, we define the objective function of critic NN as The modified gradient-descent law is obtained as
(25) |
where is the learning rate that determines the speed of convergence. In order to guarantee the stability of the closed-loop system and the convergence. Define the approximate error of the actor NN as
(26) |
The gradient-descent tuning law of is set as
(27) | ||||
where is a designed parameter to guarantee stability. Define and , the following theorem is derived.
Theorem 1.
If all terms in Assumption 1 is satisfied and signal is persistently excited [14], i.e. there exists two constant such that There exists a positive integer and a sufficiently small reinforcement interval such that, when the number of neurons , the states in closed-loop system, the estimated error of and are uniformly ultimately bounded.
4 Proof of Theorem 1
Define the error of weights as and . Consider the following Lyapunov candidate
(28) |
The derivative of (28) to time is
(29) |
Using and the fact that for the first term on the right side of (31), it can be obtained as
(32) |
Therefore, (30) turns into
(33) | ||||
where . According to assumption 1, it satisfies
(34) |
where denotes the minimum singular value of matrix . Substituting the derivative of the approximate HJB equation (20) into (33), we have
(35) | ||||
The last term in (35) satisfies
(36) |
Since and are positive definite, there exists a such that . Therefore, (35) can be derived as
(37) |
where
is the upper bound of .
Now analyzing the second term in (29). According to the tuning law (25), we have the following error dynamics
(38) | ||||
Substituting HJB equation (20) into (38), one has
(39) | ||||
According to the definition of , we have
(40) | ||||
According to [7], the third term in (40) can be written as
(41) | ||||
where . Similarly, we have
(42) | ||||
where . The following inequality is derived from (40):
(43) | ||||
In [7], a continuous approximation of is provided:
(44) |
Combining (43) and (44), the approximate HJB error is
(45) | ||||
in which the approximate error satisfies .
Substituting (31) into (45), we can get
(46) |
where
and
Substituting (46) into (38), the error dynamics can be rewritten as
(47) | ||||
The second term in (29) is obtained as
(48) | ||||
For a sufficiently small reinforcement interval, the integral term in (48) can be approximated as
(49) |
Substituting (49) into (48) and applying Young’s inequality, then (48) is derived as
(50) | ||||
where . Eq. (29) can be obtained as
(51) | ||||
Substituting the tuning law of (27) and , the last two term of (51) can be written as
(52) | ||||
where
Eq. (51) becomes
(53) | ||||
where . Choose and properly such that both and are positive. Thus, is negative if
(54) |
(55) |
(56) |
The inequalities above shows that the states of the closed-loop system, the output of the error dynamics and the error are UUB. Since the signal is persistently excited. The weights of critic NN are also UUB.
5 Simulation
In this section we design two experiments to verify the effectiveness of our method. Since there are no analytic solutions to nonlinear optimal constrained-input control problem, we choose our first case as a linear system with a large enough upper bound that the input will not be near to the saturation. The optimal solution in this case should be same as the standard linear quadratic regulator (LQR) problem.
5.1 Case 1: Linear System
The system in the first case is chosen as
(57) |
The cost function is defined as and . The input constraint is It becomes a LQR problem near the origin and the optimal value function is quadratic in the states and the optimal control law is linear. Therefore, we choose the activation functions as and . By solving algebraic Riccati equation, the optimal weights are obtained as As for the hyperparameters, we set reinforcement interval as and the learning rate of two NNs is set as and , respectively. The exploration signal we add into the input is , where is uniformly sampled from .
After , the exploration signal is removed and the simulation end at . The weights converge to and , which are close to the optimal weights.
5.2 Case 2: Nonlinear System
The second case is a nonlinear system (1) with and . In this case the upper bound of input is set as , we choose and . The learning rates are set as Due to the input constraint, the exploration signal is chosen as
After learning, the weights converge to
6 Conclusion
This paper presents a novel adaptive optimal control method to solve constrained-input problem in a completely model-free way. By adding exploration signal, actor and critic NNs can simultaneously update their weights and no a priori knowledge of system dynamics or an initial admissible policy is required during the learning phase. The efficacy of the proposed method is also shown in simulation results.
References
- [1] Q. Wei, D. Liu, Q. Lin, and R. Song, “Adaptive dynamic programming for discrete-time zero-sum games,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 4, pp. 957–969, 2018.
- [2] Z. Peng, Y. Zhao, J. Hu, and B. K. Ghosh, “Data-driven optimal tracking control of discrete-time multi-agent systems with two-stage policy iteration algorithm,” Information Sciences, vol. 481, pp. 189–202, 2019.
- [3] Z. Peng, Y. Zhao, J. Hu, R. Luo, B. K. Ghosh, and S. K. Nguang, “Input–output data-based output antisynchronization control of multiagent systems using reinforcement learning approach,” IEEE Transactions on Industrial Informatics, vol. 17, no. 11, pp. 7359–7367, 2021.
- [4] M. Abu-Khalaf and F. Lewis, “Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,” Automatica, vol. 41, pp. 779–791, 2005.
- [5] D. Vrabie, O. Pastravanu, F. Lewis, and M. Abu-Khalaf, “Adaptive optimal control for continuous-time linear systems based on policy iteration,” Automatica, vol. 45, pp. 477–484, 2009.
- [6] R. Sutton and A. Barto, Reinforcement learning: an introduction. Cambridge University Press, 1998.
- [7] H. Modares, F. Lewis, and M.-B. Naghibi-Sistani, “Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems,” Automatica, vol. 50, no. 1, pp. 193–202, 2014.
- [8] H. Modares and F. Lewis, “Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning,” Automatica, vol. 50, no. 7, pp. 1780–1792, 2014.
- [9] S. Xue, B. Luo, D. Liu, and Y. Yang, “Constrained event-triggered control based on adaptive dynamic programming with concurrent learning,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, pp. 1–13, 2020.
- [10] K. Vamvoudakis and F. Lewis, “Q-learning for continuous-time linear systems: a model-free infinite horizon optimal control approach,” Systems & Control Letters, vol. 100, pp. 14–20, 2017.
- [11] J. Lee, J. Park, and Y. Choi, “Integral reinforcement learning for continuous-time input-affine nonlinear systems with simultaneous invariant explorations,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, pp. 916–932, 2015.
- [12] L. Guo and H. Zhao, “Online adaptive optimal control algorithm based on synchronous integral reinforcement learning with explorations,” arXiv e-prints, 2021. arxiv.org/abs/2105.09006.
- [13] S. Lyshevski, “Optimal control of nonlinear continuous-time systems: design of bounded controllers via generalized nonquadratic functionals,” in Proceedings of the 1998 American Control Conference, vol. 1, pp. 205–209, 1998.
- [14] P. Ioannou and B. Fidan, Adaptive control tutorial. SIAM, 2006.