This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Model-Free Design of Stochastic LQR Controller from Reinforcement Learning and Primal-Dual Optimization Perspective

Man Li [email protected]    Jiahu Qin [email protected]    Wei Xing Zheng [email protected]    Yaonan Wang [email protected]    Yu Kang [email protected] Department of Automation, University of Science and Technology of China, Hefei 230027, China School of Computing, Engineering and Mathematics, Western Sydney University, Sydney, NSW 2751, Australia College of Electrical and Information Engineering, Hunan University, Changsha 410082, China National Engineering Laboratory for Robot Visual Perception and Control Technology, Changsha 410082, China
Abstract

To further understand the underlying mechanism of various reinforcement learning (RL) algorithms and also to better use the optimization theory to make further progress in RL, many researchers begin to revisit the linear-quadratic regulator (LQR) problem, whose setting is simple and yet captures the characteristics of RL. Inspired by this, this work is concerned with the model-free design of stochastic LQR controller for linear systems subject to Gaussian noises, from the perspective of both RL and primal-dual optimization. From the RL perspective, we first develop a new model-free off-policy policy iteration (MF-OPPI) algorithm, in which the sampled data is repeatedly used for updating the policy to alleviate the data-hungry problem to some extent. We then provide a rigorous analysis for algorithm convergence by showing that the involved iterations are equivalent to the iterations in the classical policy iteration (PI) algorithm. From the perspective of optimization, we first reformulate the stochastic LQR problem at hand as a constrained non-convex optimization problem, which is shown to have strong duality. Then, to solve this non-convex optimization problem, we propose a model-based primal-dual (MB-PD) algorithm based on the properties of the resulting Karush-Kuhn-Tucker (KKT) conditions. We also give a model-free implementation for the MB-PD algorithm by solving a transformed dual feasibility condition. More importantly, we show that the dual and primal update steps in the MB-PD algorithm can be interpreted as the policy evaluation and policy improvement steps in the PI algorithm, respectively. Finally, we provide one simulation example to show the performance of the proposed algorithms.

keywords:
Stochastic linear-quadratic regulator; model-free design; reinforcement learning; optimization; duality.

, , , ,

1 Introduction

Reinforcement learning (RL), a branch of machine learning, aims to learn policies to optimize a long-term decision-making process in a dynamic environment [1, 2]. In general, there are two main characteristics, which are also readily considered as the difficulties of the RL setting [1]. First, the value of a state is the total amount of reward/cost an agent expects to accumulate over the future, starting from that state. That is, early decisions affect the outcomes both near and far in the future. Second, the learner’s knowledge about the environment is incomplete, e.g., the underlying mechanism governing the dynamics of the state-transition trajectories is unknown. To overcome these difficulties, a lot of effective RL algorithms have been developed [3, 4, 5], and these algorithms have been successfully used in numerous fields, including smart grids [6], multi-robot systems [7], intelligent transportation[8], and networked control [9]. From these works, we know that in spite of the fact that RL cannot be regarded as a standard optimization problem due to the lack of an explicit objective function [10], it still has strong ties to the optimization formulation, since its goal is to optimize the long-term cumulative performance [11]. In view of this, existing fruitful optimization algorithms and theories might enable researchers to make further progress in RL, so that the integration of elementary RL methods with optimization frameworks is worthwhile to study. Some efforts towards this direction have been made, see, for example, [12, 13].

It is worth mentioning that following this line, numerous researchers begin to revisit the linear-quadratic regulator (LQR) problem [14, 15, 16, 17, 18, 19], the task of which is to determine an optimal feedback controller minimizing a quadratic cost function for linear systems [20]. One important reason for the revisit of the LQR problem is that it is simple and yet captures the two characteristics of RL problems mentioned in the previous paragraph. Such properties of the LQR problem provide convenience for understanding the behavior and the underlying mechanism of various RL algorithms. Another reason is that the traditional methods for addressing the LQR problem require to directly solve an algebraic Riccati equation (ARE), which depends upon exact information of the system matrices; however, the exact system matrices are usually unknown in real applications. Although model identification provides a way to obtain the estimated system model, the propagation and accumulation of modeling errors may cause undesirable system behaviors. Therefore, it is meaningful to study the model-free design of the LQR controller with the consideration of the integration of RL algorithms with optimization theories. Some relevant works can be found in [15, 16, 19, 18, 17].

Specifically, in [15] and [16], the authors reformulate the LQR problem as a non-convex optimization problem, and merge ideas from optimal control theory, mathematical optimization, and sampled-based RL to solve it. To seek the solutions to the resulting non-convex optimization problem, model-free stochastic policy gradient algorithms along with zeroth-order optimization techniques are developed with provable global convergence [15] and explicit convergence rate [16]. The work [17] considers distributed RL for decentralized LQR problem, and proposes a zeroth-order distributed policy optimization algorithm to learn the local optimal controllers in a distributed manner, by leveraging the ideas of policy gradient, zeroth-order optimization, as well as consensus algorithms. Note that in the above mentioned works, the key to realizing model-free control is to use the zeroth-order optimization approach to estimate the policy gradient. This approach often suffers from large variance, as it involves the randomness of an entire trajectory. To address this difficulty, some different algorithms have been proposed [18, 19]. The work [18] develops a model-free approximate policy iteration algorithm based on Q-function, in which the way to guarantee a good regret is also discussed. In [19], an on-policy actor-critic algorithm, which is an online alternating update algorithm for bilevel optimization, is developed with a non-asymptotic convergence analysis, and then it is extended to the off-policy case to alleviate the data-hungry problem.

Despite the fact that extensive research has been carried out on the LQR problem when considering the integration of RL frameworks with optimization techniques, these works, such as [15, 16, 19, 18, 17], mainly focus on the theoretical understanding of the performance of common RL algorithms. They do not provide any explanations for the proposed RL algorithms from the perspective of optimization or establish any relationship between RL and optimization-based methods. Different from the above mentioned works, the authors in [21] reformulate the LQR problem as a new optimization problem, and further explain its duality results from a Q-learning viewpoint. The duality results proposed in [21] are based on the results given in [22], where the duality for the stochastic LQR problem pertaining to linear systems with stochastic noise is investigated in terms of semidefinite programming (SDP). These works motivate some questions. Can we formulate the stochastic LQR problem as an optimization problem whose optimal points can be interpreted from the perspective of various RL algorithms? Can we explain the RL algorithms used to solve stochastic LQR problem from the perspective of optimization?

To answer these questions, we investigate the model-free design of stochastic LQR controller from the perspective of both RL and optimization. We first solve the stochastic LQR problem in a model-free way from the perspective of RL. Then, we reconsider this problem by some reformulation and transformations, and solve it from the perspective of primal-dual optimization. Besides that, we also provide a theorem to show that the algorithms developed from the above two perspectives are equivalent in the sense that there is a corresponding relation in the involved iterations. The contributions of this work are summarized as follows:

  1. 1.

    To solve the stochastic LQR problem, we develop a model-free off-policy policy iteration (MF-OPPI) algorithm and a model-free primal-dual (MB-PD) algorithm from the perspective of RL and primal-dual optimization, respectively. Both algorithms can be implemented in a model-free way, so that one can circumvent model identification, which may cause undesirable behaviors due to the propagation and accumulation of modeling errors. More importantly, we show that the dual and primal update steps in the MB-PD algorithm can be interpreted as the policy evaluation and policy improvement steps in the classical policy iteration (PI) algorithm, respectively. This result provides a novel primal-dual optimization perspective to understand the classical PI algorithm.

  2. 2.

    From the perspective of RL, we develop a new MF-OPPI algorithm to address the stochastic LQR problem, in which the data generated by exerting the behavior policy on the real system is repeatedly used to evaluate the target policy so as to alleviate the data-hungry problem to some extent. Compared with the off-policy actor-critic algorithm proposed in [19], the newly developed MF-OPPI algorithm is simple and easy to implement, since it does not require to use neural networks for function approximation.

  3. 3.

    From the optimization perspective, we first reformulate the stochastic LQR problem as a constrained non-convex optimization problem, based on the works [21] and [22]. However, due to the consideration of the influence of stochastic noise and the relation with RL algorithms, the proposed optimization formulation is different from the ones given in [21] and [22]. Such differences bring about some challenges in the equivalent formulation of primal-dual problems as well as the analysis for the strong duality. To address the resulting constrained non-convex optimization problem, we develop a new MB-PD algorithm with a model-free implementation, by using the properties of the resulting Karush-Kuhn-Tucker (KKT) conditions. This algorithm circumvents the requirement of probing noise exerted on the system for persistently exciting, which is necessary for the newly proposed MF-OPPI algorithm and other RL-based algorithms proposed in [23, 24, 25], thereby avoiding undesirable oscillation.

The rest of this paper is organized as follows. We first formulate the problem of interest and provide some preliminaries in Section 2. Subsequently, we investigate this problem from the perspective of RL and primal-dual optimization in Section 3 and Section 4, respectively. Then, in Section 5, we show that the algorithms developed from the above two perspectives are equivalent in the sense that there is a corresponding relation in the involved iterations. Next, we provide a simulation example in Section 6. Some concluding remarks are finally given in Section 7.

Notations: We use In\textbf{I}_{n} and 0n\textbf{0}_{n} to denote an identity matrix and a null matrix in n×n\mathbb{R}^{n\times n}, respectively. Denote by 0n0_{n} a zero vector of nn dimensions. A set of symmetric matrices of n×nn\times n dimensions is denoted by 𝕊n\mathbb{S}^{n}. 𝕊+n\mathbb{S}^{n}_{+} (𝕊++n\mathbb{S}^{n}_{++}) is a set of symmetric positive semidefinite (positive definite) matrices. For a matrix MM, we use MijM_{ij} to denote the element in the ii-th row and the jj-th column. Tr(M)\mathrm{Tr}(M) is the trace of a matrix MM, and ρ(M)\rho(M) denotes its spectral radius. A symmetric positive (semi-)definite matrix MM is denoted by M0(0)M\succ 0\>(\succeq 0).

For a symmetric matrix Mn×nM\in\mathbb{R}^{n\times n}, vech(M)1×n(n+1)2\mathrm{vech}(M)\in\mathbb{R}^{1\times\frac{n(n+1)}{2}} denotes a row vector consisting of nn diagonal entries and n(n1)/2n(n-1)/2 non-diagonal terms MijM_{ij}, i=1,,ni=1,\ldots,n, i<jni<j\leq n. Specifically, vech(M)=[M11,,Mnn,M12,,M1n,M23,,M2n,,M(n1)n]\mathrm{vech}(M)=[M_{11},\,\ldots,\,M_{nn},\,M_{12},\,\ldots,\,M_{1n},M_{23},\,\ldots,\,M_{2n},\ldots,\,M_{(n-1)n}]. For a symmetric matrix Mn×nM\in\mathbb{R}^{n\times n}, vecs(M)n(n+1)/2\mathrm{vecs}(M)\in\mathbb{R}^{n(n+1)/2} denotes a column vector composed of nn diagonal entries of MM and n(n1)/2n(n-1)/2 distinct sums Mij+MjiM_{ij}+M_{ji}, i.e., vecs(M)=[M11,,Mnn,2M12,,2M1n,2M23,,2M2n,,2M(n1)n]\mathrm{vecs}(M)=[M_{11},\ldots,\!M_{nn},\!2M_{12},\ldots,\!2M_{1n},\!2M_{23},\ldots,\!2M_{2n},\ldots,\!2M_{(n-1)n}]^{\prime}. For a non-square matrix Mn×mM\in\mathbb{R}^{n\times m}, vecn(M)mn\mathrm{vecn}(M)\in\mathbb{R}^{mn} consists of the stack of each column’s elements. That is, vecn(M)=[M11,,Mn1,M12,,Mn2,,M1m,,Mnm]\mathrm{vecn}(M)=[M_{11},\ldots,M_{n1},M_{12},\ldots,M_{n2},\ldots,M_{1m},\ldots,M_{nm}]^{\prime}.

2 Problem Formulation and Preliminaries

Consider the following discrete-time linear time-invariant system

xk+1=Axk+Buk+wk,\displaystyle x_{k+1}=Ax_{k}+Bu_{k}+w_{k}, (1)

where kk\in\mathbb{N}, xknx_{k}\in\mathbb{R}^{n} is the state vector of the system plant, ukmu_{k}\in\mathbb{R}^{m} is the control action, wknw_{k}\in\mathbb{R}^{n} is the system noise generated according to the Gaussian distribution 𝒩(0n,σw2)\mathcal{N}(0_{n},\sigma_{w}^{2}), and AA and BB are constant matrices of appropriate dimensions.

The initial state is denoted by x0=zx_{0}=z, which is assumed to be a random variable. In this paper, we consider the state-feedback control law, that is, uk=Fxku_{k}=Fx_{k}, where FF is the feedback gain. xk(F,z)x_{k}(F,z) denotes the solution to the dynamical system (1) under the control input uk=Fxku_{k}=Fx_{k} while starting from x0=zx_{0}=z. It is assumed that the noise wkw_{k} is independent of the initial state zz as well as the noise wkw_{k^{\prime}} for all kkk^{\prime}\neq k.

Under the state-feedback control law uk=Fxku_{k}=Fx_{k}, the cost function is denoted by

J(F,z)k=0γk𝔼{[xk(F,z)Fxk(F,z)]Λ[xk(F,z)Fxk(F,z)]}\displaystyle J(F,z)\triangleq\sum_{k=0}^{\infty}\gamma^{k}\mathbb{E}\left\{\left[\begin{array}[]{c}x_{k}(F,z)\\ Fx_{k}(F,z)\end{array}\right]^{\prime}\Lambda\left[\begin{array}[]{c}x_{k}(F,z)\\ Fx_{k}(F,z)\end{array}\right]\right\} (6)

where γ(0,1)\gamma\in(0,1) is the discount factor, and Λ[Q00R]0\Lambda\triangleq\left[\begin{array}[]{cc}Q&0\\ 0&R\end{array}\right]\succeq 0.

Define {Fm×n:ρ(A+BF)1}\mathcal{F}\triangleq\{F\in\mathbb{R}^{m\times n}:\rho(A+BF)\leq 1\}, which is the set of stabilizing state-feedback gains for the matrix pair (A,B)(A,B). Now, we are ready to formally give the problem of interest, which is called the stochastic LQR problem.

Problem 1 (Stochastic LQR Problem).

Suppose that the initial state zz satisfying z~𝔼(zz)0\tilde{z}\triangleq\mathbb{E}(zz^{\prime})\succ 0. The stochastic LQR problem aims to find the optimal feedback gain FF^{*} minimizing the quadratic cost function (6) with the system state evolving along the dynamical system (1). That is,

{minFJ(F,z)s.t.xk+1=Axk+Buk+wk.\displaystyle\left\{\begin{aligned} \min_{F\in\mathcal{F}}&\quad J(F,z)\\ \text{s.t.}&\quad x_{k+1}=Ax_{k}+Bu_{k}+w_{k}.\end{aligned}\right.

The optimal feedback gain FF^{*} is defined as FargminFJ(F,z)F^{*}\triangleq\arg\min_{F\in\mathcal{F}}J(F,z). The corresponding optimal cost for a given initial state znz\in\mathbb{R}^{n} is denoted by J(z)J(F,z)J^{*}(z)\triangleq J(F^{*},z).

The followings are some assumptions that will be used in the remaining contents.

Assumption 1.

Assume that

  1. 1.

    (A,B)(A,B) is not known;

  2. 2.

    An initial stabilizing feedback gain FstabF_{\text{stab}}\in\mathcal{F} is known;

  3. 3.

    Q0Q\succeq 0 and R0R\succ 0;

  4. 4.

    The matrix pair (A,B)(A,B) is stabilizable;

  5. 5.

    The matrix pair (A,C)(A,C) is detectable, where CC is a matrix such that CC=QCC^{\prime}=Q.

For an stabilizing feedback gain FF\in\mathcal{F}, define the value function as

V(F,xk)i=kγik𝔼[xk(Q+FRF)xk].\displaystyle V(F,x_{k})\triangleq\sum_{i=k}^{\infty}\gamma^{i-k}\mathbb{E}[x_{k}^{\prime}(Q+F^{\prime}RF)x_{k}]. (7)

The optimal value function is given by

V(F,xk)minFV(F,xk).\displaystyle V(F^{*},x_{k})\triangleq\min_{F\in\mathcal{F}}V(F,x_{k}). (8)
Lemma 1 ([26, Lemma 3]).

For any stabilizing state-feedback gain FF\in\mathcal{F}, the stochastic LQR problem, i.e., Problem 1, is well-posed111The stochastic LQR problem is said to be well-posed if the optimal value function satisfies V(F,xk)+-\infty\leq V(F^{*},x_{k})\leq+\infty. and the corresponding value function is

V(F,xk)=𝔼[xkXxk]+γ1γTr(Xσw2),\displaystyle V(F,x_{k})=\mathbb{E}[x_{k}^{\prime}Xx_{k}]+\frac{\gamma}{1-\gamma}\mathrm{Tr}(X\sigma_{w}^{2}), (9)

where X𝕊+nX\in\mathbb{S}_{+}^{n} is the unique solution to the Lyapunov equation

X=γ(A+BF)X(A+BF)+FRF+Q.\displaystyle X=\gamma(A+BF)^{\prime}X(A+BF)+F^{\prime}RF+Q. (10)

From the definition of the value function (7), one has

V(F,xk)=𝔼[xk(Q+FRF)xk]+γV(F,xk+1),\displaystyle V(F,x_{k})=\mathbb{E}[x_{k}^{\prime}(Q+F^{\prime}RF)x_{k}]+\gamma V(F,x_{k+1}), (11)

which is called the Bellman equation. Define the Hamiltonian as

H(F,xk)\displaystyle H(F,x_{k})\triangleq 𝔼[xk(Q+FRF)xk]+γ𝔼(xk+1Xxk+1)\displaystyle\mathbb{E}[x_{k}^{\prime}(Q+F^{\prime}RF)x_{k}]+\gamma\mathbb{E}(x_{k+1}^{\prime}Xx_{k+1})
𝔼(xkXxk)γTr(Xσw2).\displaystyle-\mathbb{E}(x_{k}^{\prime}Xx_{k})-\gamma\mathrm{Tr}(X\sigma_{w}^{2}).

The first-order necessary condition for optimality gives

H(F,xk)F=2(R+γBXB)F𝔼[xkxk]+2γBXA𝔼[xkxk]=0.\displaystyle\frac{\partial H(F,x_{k})}{\partial F}=2(R+\gamma B^{\prime}XB)F\mathbb{E}[x_{k}x_{k}^{\prime}]+2\gamma B^{\prime}XA\mathbb{E}[x_{k}x_{k}^{\prime}]=0.

It follows that the unique optimal control gain is given by

F=γ(R+γBXB)1BXA\displaystyle F^{*}=-\gamma(R+\gamma B^{\prime}X^{*}B)^{-1}B^{\prime}X^{*}A (12)

with XX^{*} being the unique solution to the following ARE

X=Q+γAXAγ2AXB(R+γBXB)1BXA.\displaystyle X^{*}=Q+\gamma A^{\prime}X^{*}A-\gamma^{2}A^{\prime}X^{*}B(R+\gamma B^{\prime}X^{*}B)^{-1}B^{\prime}X^{*}A. (13)

By substituting the optimal feedback gain (12) into the value function (11), one can obtain that

V(F,xk)=𝔼{[xkFxk]P[xkFxk]}+γ1γTr(Xσw2),\displaystyle V(F^{*},x_{k})=\mathbb{E}\left\{\left[\begin{array}[]{c}x_{k}\\ F^{*}x_{k}\end{array}\right]^{\prime}P^{*}\left[\begin{array}[]{c}x_{k}\\ F^{*}x_{k}\end{array}\right]\right\}+\frac{\gamma}{1-\gamma}\mathrm{Tr}(X^{*}\sigma_{w}^{2}), (18)

where

P[Q+γAXAγAXBγBXAR+γBXB].\displaystyle P^{*}\triangleq\left[\begin{array}[]{cc}Q+\gamma A^{\prime}X^{*}A&\gamma A^{\prime}X^{*}B\\ \gamma B^{\prime}X^{*}A&R+\gamma B^{\prime}X^{*}B\end{array}\right]. (21)

In the following analysis, let P11Q+γAXAP_{11}^{*}\triangleq Q+\gamma A^{\prime}X^{*}A, P12(P21)γAXBP_{12}^{*}\triangleq(P_{21}^{*})^{\prime}\triangleq\gamma A^{\prime}X^{*}B, and P22R+γBXBP_{22}^{*}\triangleq R+\gamma B^{\prime}X^{*}B.

Based on the above analysis, we know that calculating the optimal feedback gain FF^{*} requires both the system matrix AA and the control input matrix BB. In the following two sections, we provide two approaches from a RL-based view and an optimization-based view, respectively, to compute the optimal feedback gain without using the matrices AA and BB.

3 RL-Based Method

The goal of this section is to solve Problem 1 by designing RL algorithms.

3.1 Classical PI Algorithm

Algorithm 1 Classical PI Algorithm
1:  Initialization: set a stabilizing control gain F0F^{0}, s=0s=0, and ϵ>0\epsilon>0;
2:  Repeat
3:   policy evaluation: solve XsX^{s} from
Xs=γ(A+BFs)Xs(A+BFs)+(Fs)RFs+Q;\displaystyle\ X^{s}=\gamma(A+BF^{s})^{\prime}X^{s}(A+BF^{s})+(F^{s})^{\prime}RF^{s}+Q; (22)
4:    policy improvement: update control gain from
Fs+1=γ(R+γBXsB)1BXsA;\displaystyle F^{s+1}=-\gamma(R+\gamma B^{\prime}X^{s}B)^{-1}B^{\prime}X^{s}A; (23)
5:    ss+1;s\leftarrow s+1;
6:  Until FsFs1ϵ\|F^{s}-F^{s-1}\|\leq\epsilon;
7:  Return FsF^{s}.

Note that by replacing (A,B)(A,B) with (γ1/2A,γ1/2B)(\gamma^{1/2}A,\gamma^{1/2}B), the ARE (13) is identical to that obtained from a standard LQR problem [27, Chapter 2.1.4], in which the dynamical system is not influenced by the stochastic noise. There have been various algorithms to seek the solutions to the ARE, one of which is the policy iteration algorithm. A classical PI algorithm proposed in [27, Chapter 2.2] to solve (13) is given below, cf. Algorithm 1.

The iterations (22) and (23) in Algorithm 1 rely upon repeatedly solving the Lyapunov equation (10). Hence, Algorithm 1 is an offline algorithm that requires complete knowledge of the system matrices AA and BB. Note that Algorithm 1 is also called Hewer’s algorithm, and it has been shown to converge to the solution of the ARE (13) in [28].

3.2 MF-OPPI Algorithm

In this subsection, we intend to propose an off-policy algorithm that can calculate the optimal feedback gain (12) without using the system matrices.

To this end, we first rewrite the original system (1) as

xk+1=(A+BFs)xk+B(ukFsxk)+wk,\displaystyle x_{k+1}=(A+BF^{s})x_{k}+B(u_{k}-F^{s}x_{k})+w_{k}, (24)

where uksFsxku_{k}^{s}\triangleq F^{s}x_{k} is the target policy being learned and updated by iterations, and uku_{k} is the behavior policy that is exerted on the system to generate data for learning.

By using (24) and the Lyapunov equation (10), one can obtain that

𝔼[xkXsxk]γ𝔼[xk+1Xsxk+1]\displaystyle\mathbb{E}[x_{k}^{\prime}X^{s}x_{k}]-\gamma\mathbb{E}[x_{k+1}^{\prime}X^{s}x_{k+1}] (25)
=\displaystyle= 𝔼[xk(Q+(Fs)RFs)xk]2γ𝔼[(ukFsxk)BXsAxk]\displaystyle\mathbb{E}[x_{k}^{\prime}(Q+(F^{s})^{\prime}RF^{s})x_{k}]-2\gamma\mathbb{E}[(u_{k}-F^{s}x_{k})^{\prime}B^{\prime}X^{s}Ax_{k}]
γ𝔼[(ukFsxk)BXsB(Fsxk+uk)]γTr(Xsσw2),\displaystyle-\gamma\mathbb{E}[(u_{k}-F^{s}x_{k})^{\prime}B^{\prime}X^{s}B(F^{s}x_{k}+u_{k})]-\gamma\mathrm{Tr}(X^{s}\sigma_{w}^{2}),

which is termed the off-policy Bellman equation.

Then, we use (LABEL:off-2) for policy evaluation, and propose an MF-OPPI algorithm, i.e., Algorithm 2.

Algorithm 2 MF-OPPI Algorithm
1:  Initialization: set a stabilizing control gain F0F^{0}, s=0s=0, and ϵ>0\epsilon>0;
2:  Repeat
3:   policy evaluation: solve XsX^{s}, X1sX_{1}^{s}, and X2sX_{2}^{s} from
𝔼[xkXsxk]γ𝔼[xk+1Xsxk+1]\displaystyle\mathbb{E}[x_{k}^{\prime}X^{s}x_{k}]-\gamma\mathbb{E}[x_{k+1}^{\prime}X^{s}x_{k+1}] (26)
=\displaystyle= 𝔼[xk(Q+(Fs)RFs)xk]2γ𝔼[(ukFsxk)X1sxk]\displaystyle\mathbb{E}[x_{k}^{\prime}(Q+(F^{s})^{\prime}RF^{s})x_{k}]-2\gamma\mathbb{E}[(u_{k}-F^{s}x_{k})^{\prime}X_{1}^{s}x_{k}]
γ𝔼[(ukFsxk)X2s(Fsxk+uk)]γTr(Xsσw2);\displaystyle-\gamma\mathbb{E}[(u_{k}-F^{s}x_{k})^{\prime}X_{2}^{s}(F^{s}x_{k}+u_{k})]-\gamma\mathrm{Tr}(X^{s}\sigma_{w}^{2});
4:   policy improvement: update control gain from
Fs+1=γ(R+γX2s)1X1s;\displaystyle F^{s+1}=-\gamma(R+\gamma X_{2}^{s})^{-1}X_{1}^{s}; (27)
5:    ss+1;s\leftarrow s+1;
6:  Until FsFs1ϵ\|F^{s}-F^{s-1}\|\leq\epsilon;
7:  Return FsF^{s}.
Theorem 1.

In Algorithm 2, it holds that limsXs=X\lim_{s\to\infty}X^{s}=X^{*} and limtFs=F\lim_{t\to\infty}F^{s}=F^{*}, where XX^{*} and FF^{*} are defined in (13) and (12), respectively.

Proof.

Substituting (24) into the left-hand side of (LABEL:off-PE), it follows that

𝔼[xkXsxk]γ𝔼[xk+1Xsxk+1]\displaystyle\mathbb{E}[x_{k}^{\prime}X^{s}x_{k}]-\gamma\mathbb{E}[x_{k+1}^{\prime}X^{s}x_{k+1}]
=\displaystyle= 𝔼[xkXsxk]γ𝔼[((A+BFs)xk+wk)Xs((A+BFs)xk+wk)]\displaystyle\mathbb{E}[x_{k}^{\prime}X^{s}x_{k}]-\gamma\mathbb{E}[((A+BF^{s})x_{k}+w_{k})^{\prime}X^{s}((A+BF^{s})x_{k}+w_{k})]
2γ𝔼[(ukFsxk)BXsAxk]\displaystyle-2\gamma\mathbb{E}[(u_{k}-F^{s}x_{k})^{\prime}B^{\prime}X^{s}Ax_{k}]
γ𝔼[(ukFsxk)BXsB(Fsxk+uk)].\displaystyle-\gamma\mathbb{E}[(u_{k}-F^{s}x_{k})^{\prime}B^{\prime}X^{s}B(F^{s}x_{k}+u_{k})].

Therefore, it holds that

𝔼[xkXsxk]γ𝔼[((A+BFs)xk+wk)Xs((A+BFs)xk+wk)]\displaystyle\mathbb{E}[x_{k}^{\prime}X^{s}x_{k}]-\gamma\mathbb{E}[((A+BF^{s})x_{k}+w_{k})^{\prime}X^{s}((A+BF^{s})x_{k}+w_{k})]
=\displaystyle= 𝔼[xk(Q+(Fs)RFs)xk]γTr(Xsσw2).\displaystyle\mathbb{E}[x_{k}^{\prime}(Q+(F^{s})^{\prime}RF^{s})x_{k}]-\gamma\mathrm{Tr}(X^{s}\sigma_{w}^{2}).

Since 𝔼[((A+BFs)xk+wk)Xs((A+BFs)xk+wk)]=𝔼[xk(A+BFs)Xs(A+BFs)xk]+Tr(Xsσw2)\mathbb{E}[((A+BF^{s})x_{k}+w_{k})^{\prime}X^{s}((A+BF^{s})x_{k}+w_{k})]=\mathbb{E}[x_{k}^{\prime}(A+BF^{s})^{\prime}X^{s}(A+BF^{s})x_{k}]+\mathrm{Tr}(X^{s}\sigma_{w}^{2}), one can obtain that for k1k\geq 1,

𝔼[xkXsxk]γ𝔼[xk(A+BFs)Xs(A+BFs)xk]\displaystyle\mathbb{E}[x_{k}^{\prime}X^{s}x_{k}]-\gamma\mathbb{E}[x_{k}^{\prime}(A+BF^{s})^{\prime}X^{s}(A+BF^{s})x_{k}] (28)
=\displaystyle= 𝔼[xk(Q+(Fs)RFs)xk].\displaystyle\mathbb{E}[x_{k}^{\prime}(Q+(F^{s})^{\prime}RF^{s})x_{k}].

From the dynamics (1), it holds that for k1k\geq 1,

𝔼[xkxk]=(A+BF)𝔼[xk1xk1](A+BF)+σw2.\displaystyle\mathbb{E}[x_{k}x_{k}^{\prime}]=(A+BF)\mathbb{E}[x_{k-1}x_{k-1}^{\prime}](A+BF)^{\prime}+\sigma_{w}^{2}.

Due to the positive definiteness of σw2\sigma_{w}^{2} and z~\tilde{z}, the above equation implies that 𝔼[xkxk]\mathbb{E}[x_{k}x_{k}^{\prime}] is positive definite. Thus, one can obtain from (LABEL:off-3) that

Xs=γ(A+BFs)Xs(A+BFs)+(Fs)RFs+Q,\displaystyle X^{s}=\gamma(A+BF^{s})^{\prime}X^{s}(A+BF^{s})+(F^{s})^{\prime}RF^{s}+Q, (29)

which is exactly the update in policy evaluation step of Algorithm 1.

From (LABEL:off-2) and (LABEL:off-PE), we have X1s=BXsAX_{1}^{s}=B^{\prime}X^{s}A and X2s=BXsBX_{2}^{s}=B^{\prime}X^{s}B, which further imply that (27) is identical to (23). ∎

3.3 Implementation of Algorithm 2

To implement Algorithm 2, a natural question is how to solve the matrices XsX^{s}, X1sX_{1}^{s}, and X2sX_{2}^{s} from (LABEL:off-PE). In this subsection, we provide an effective way for the implementation of Algorithm 2, where a numerical average is adopted to approximate the mathematical expectation and a batch least square (BLS) method [29] is employed to estimate the unknown matrices.

Specifically, by vectorization, the policy evaluation step (LABEL:off-PE) in Algorithm 2 becomes

{𝔼[vech(xkxk)γvech(xk+1xk+1)]+γvech(σw2)}vecs(Xs)\displaystyle\left\{\mathbb{E}\left[\mathrm{vech}\big{(}x_{k}x_{k}^{\prime}\big{)}-\gamma\mathrm{vech}\big{(}x_{k+1}x_{k+1}^{\prime}\big{)}\right]+\gamma\mathrm{vech}(\sigma_{w}^{2})\right\}\mathrm{vecs}(X^{s})
+2γ𝔼[vecn[(ukFsxk)xk]]vecn(X1s)\displaystyle+2\gamma\mathbb{E}\left[\mathrm{vecn}\big{[}(u_{k}-F^{s}x_{k})x_{k}^{\prime}\big{]}\right]^{\prime}\mathrm{vecn}(X_{1}^{s})
+γ𝔼[vech[(ukFsxk)(uk+Fsxk)]]vecs(X2s)\displaystyle+\gamma\mathbb{E}\Big{[}\mathrm{vech}\big{[}(u_{k}-F^{s}x_{k})(u_{k}+F^{s}x_{k})^{\prime}\big{]}\Big{]}\mathrm{vecs}(X_{2}^{s})
=\displaystyle= 𝔼[xk(Q+(Fs)RFs)xk].\displaystyle\mathbb{E}\left[x_{k}^{\prime}(Q+(F^{s})^{\prime}RF^{s})x_{k}\right]. (30)

The equation (3.3) has n(n+1)2+mn+m(m+1)2\frac{n(n+1)}{2}+mn+\frac{m(m+1)}{2} unknown parameters. Thus, at least Kn(n+1)2+mn+m(m+1)2K\geq\frac{n(n+1)}{2}+mn+\frac{m(m+1)}{2} sample points are required to solve XsX^{s}, X1sX_{1}^{s}, and X2sX_{2}^{s} from (3.3). Sample the data generated under the behavior policy uku_{k} for KK time steps, and define

Φs\displaystyle\Phi^{s}\triangleq [𝔼[xk(Q+(Fs)RFs)xk]𝔼[xk+1(Q+(Fs)RFs)xk+1]𝔼[xk+K1(Q+(Fs)RFs)xk+K1]],\displaystyle\left[\begin{array}[]{c}\mathbb{E}[x_{k}^{\prime}(Q+(F^{s})^{\prime}RF^{s})x_{k}]\\ \mathbb{E}[x_{k+1}^{\prime}(Q+(F^{s})^{\prime}RF^{s})x_{k+1}]\\ \vdots\\ \mathbb{E}[x_{k+K-1}^{\prime}(Q+(F^{s})^{\prime}RF^{s})x_{k+K-1}]\end{array}\right], (35)
Ψs\displaystyle\Psi^{s}\triangleq [𝔼[Hxx,1]𝔼[Hxu,1]𝔼[Huu,1]𝔼[Hxx,K]𝔼[Hxu,K]𝔼[Huu,K]],\displaystyle\left[\begin{array}[]{ccc}\mathbb{E}[H_{xx,1}]&\mathbb{E}[H_{xu,1}]&\mathbb{E}[H_{uu,1}]\\ \vdots&\vdots&\vdots\\ \mathbb{E}[H_{xx,K}]&\mathbb{E}[H_{xu,K}]&\mathbb{E}[H_{uu,K}]\end{array}\right], (39)

with

Hxx,i\displaystyle H_{xx,i}\triangleq vech(xk+i1xk+i1)γvech(xk+ixk+i)+γvech(σw2),\displaystyle\mathrm{vech}\big{(}x_{k+i-1}x_{k+i-1}^{\prime}\big{)}-\gamma\mathrm{vech}\big{(}x_{k+i}x_{k+i}^{\prime}\big{)}+\gamma\mathrm{vech}(\sigma_{w}^{2}),
Hxu,i\displaystyle H_{xu,i}\triangleq 2γvecn[(uk+i1Fsxk+i1)xk+i1],\displaystyle 2\gamma\mathrm{vecn}\left[(u_{k+i-1}-F^{s}x_{k+i-1})x_{k+i-1}^{\prime}\right]^{\prime},
Huu,i\displaystyle H_{uu,i}\triangleq γvech[(uk+i1Fsxk+i1)(uk+i1+Fsxk+i1)].\displaystyle\gamma\mathrm{vech}\ \Big{[}(u_{k+i-1}-F^{s}x_{k+i-1})(u_{k+i-1}+F^{s}x_{k+i-1})^{\prime}\Big{]}.

From (3.3), (35), and (39), one can obtain

Ψs[(vecs(Xs))(vecn(X1s))(vecs(X2s))]=Φs,\displaystyle\Psi^{s}\left[\begin{array}[]{ccc}\Big{(}\mathrm{vecs}(X^{s})\Big{)}^{\prime}&\left(\mathrm{vecn}(X_{1}^{s})\right)^{\prime}&\Big{(}\mathrm{vecs}(X_{2}^{s})\Big{)}^{\prime}\end{array}\right]^{\prime}=\Phi^{s},

which can be solved by

[(vecs(Xs))(vecn(X1s))(vecs(X2s))]\displaystyle\left[\begin{array}[]{ccc}\Big{(}\mathrm{vecs}(X^{s})\Big{)}^{\prime}&\Big{(}\mathrm{vecn}(X_{1}^{s})\Big{)}^{\prime}&\Big{(}\mathrm{vecs}(X_{2}^{s})\Big{)}^{\prime}\end{array}\right]^{\prime} (40)
=\displaystyle= [(Ψs)Ψs]1(Ψs)Φs.\displaystyle\left[(\Psi^{s})^{\prime}\Psi^{s}\right]^{-1}(\Psi^{s})^{\prime}\Phi^{s}.

Note that to guarantee the solvability of the BLS estimator (40), a probing noise is required to add into the behavior policy uku_{k} to guarantee the invertibility of (Ψs)Ψs(\Psi^{s})^{\prime}\Psi^{s}. It is worthwhile to point out that adding a probing noise does not influence the iteration results of Algorithm 2, which can be regarded as an advantage of the off-policy algorithm compared with the on-policy one [24]. The detailed implementation of Algorithm 2 is given in Algorithm 3.

Algorithm 3 includes two main phases. In the first phase, we exert the behavior policy on the real system for KK time steps, and sample the resulting system states and control actions. Then, a numerical average of NN trajectories is used to calculate the mathematical expectation. In the second phase, we implement policy evaluation and policy improvement steps by reusing the data sampled in the first phase. The reuse of data is an important advantage of the proposed MF-OPPI algorithm when compared with the on-policy RL algorithms, since the latter are required to collect the state-input data when applying every iterated control policies [5, 30]. Therefore, the proposed MF-OPPI algorithm is able to alleviate the data-hungry problem to some extent.

Algorithm 3 Implementation of MF-OPPI Algorithm
1:  Phase 1 (Data Collection):
2:  Initialization: set a stabilizing control gain F0F^{0}, s=0s=0, K>0K>0, and ϵ>0\epsilon>0; let Exkxk=0E_{x_{k}x_{k}}=0, Exkuk=0E_{x_{k}u_{k}}=0, and Eukuk=0E_{u_{k}u_{k}}=0 for k=0,,Kk=0,\cdots,K;
3:  For q=1:Nq=1:N
4:   apply the behavior policy uku_{k} for KK time steps;
5:   sample {xk,uk}k=0,,K\{x_{k},u_{k}\}_{k=0,\ldots,K};
6:   Exkxk=Exkxk+xkxkE_{x_{k}x_{k}}=E_{x_{k}x_{k}}+x_{k}x_{k}^{\prime}, k=0,,Kk=0,\ldots,K;
7:   Exkuk=Exkuk+xkukE_{x_{k}u_{k}}=E_{x_{k}u_{k}}+x_{k}u_{k}^{\prime}, k=0,,Kk=0,\ldots,K;
8:   Eukuk=Eukuk+ukukE_{u_{k}u_{k}}=E_{u_{k}u_{k}}+u_{k}u_{k}^{\prime}, k=0,,Kk=0,\ldots,K;
9:  End
10:  Exkxk=Exkxk/NE_{x_{k}x_{k}}=E_{x_{k}x_{k}}/N, Exkuk=Exkuk/NE_{x_{k}u_{k}}=E_{x_{k}u_{k}}/N, Eukuk=Eukuk/NE_{u_{k}u_{k}}=E_{u_{k}u_{k}}/N, k=0,,Kk=0,\ldots,K;
11:  Phase 2 (Reuse of Collected Data for Iteration):
12:  Initialization: set a stabilizable control gain F0F^{0}, s=0s=0, and ϵ>0\epsilon>0;
13:  Repeat
14:   calculate Φs\Phi^{s} and Ψs\Psi^{s} using FsF^{s} and collected data;
15:   policy evaluation: solve XsX^{s}, X1sX_{1}^{s}, and X2sX_{2}^{s} from
[(vecs(Xs))(vecn(X1s))(vecs(X2s))]\displaystyle\left[\begin{array}[]{ccc}\Big{(}\mathrm{vecs}(X^{s})\Big{)}^{\prime}&\Big{(}\mathrm{vecn}(X_{1}^{s})\Big{)}^{\prime}&\Big{(}\mathrm{vecs}(X_{2}^{s})\Big{)}^{\prime}\end{array}\right]^{\prime}
=\displaystyle= [(Ψs)Ψs]1(Ψs)Φs;\displaystyle\left[(\Psi^{s})^{\prime}\Psi^{s}\right]^{-1}(\Psi^{s})^{\prime}\Phi^{s};
16:    policy improvement: update control gain from
Fs+1=γ(R+γBXsB)1BXsA;F^{s+1}=-\gamma(R+\gamma B^{\prime}X^{s}B)^{-1}B^{\prime}X^{s}A;
17:   ss+1s\leftarrow s+1;
18:  Until FsFs1ϵ\|F^{s}-F^{s-1}\|\leq\epsilon
19:  Return FsF^{s}.

4 Optimization-Based Method

In this section, we reformulate Problem 1 as a constrained optimization problem, and resolve it from the primal-dual optimization perspective.

4.1 Problem Reformulation

For technical reasons, we need to make some modifications on Problem 1. To do this, some new notations are required.

We introduce the augmented state vector vk[xkuk]v_{k}\triangleq\left[x_{k}^{\prime}\;\;u_{k}^{\prime}\right]^{\prime}, and the resulting augmented system is described by

vk+1=AFvk+F¯wk,\displaystyle v_{k+1}=A_{F}v_{k}+\bar{F}w_{k}, (41)

where AF[ABFAFB](n+m)×(n+m)A_{F}\triangleq\left[\begin{array}[]{cc}A&B\\ FA&FB\end{array}\right]\in\mathbb{R}^{(n+m)\times(n+m)} and F¯[InF](n+m)×n\bar{F}\triangleq\left[\begin{array}[]{c}I_{n}\\ F\end{array}\right]\in\mathbb{R}^{(n+m)\times n}.

The following is a useful property that will be used in the remaining part of this paper.

Lemma 2 (Property of Spectral Radius[21, Lemma 1]).

It holds that ρ(A+BF)=ρ(AF)\rho(A+BF)=\rho(A_{F}).

Let z(i)nz_{(i)}\in\mathbb{R}^{n} for i{1,,r}i\in\{1,\ldots,r\} be rr different samples from the random variable zz. Let u(i)mu_{(i)}\in\mathbb{R}^{m} for i{1,,r}i\in\{1,\ldots,r\} be rr different initial stabilizing control laws. rr different initial states for the augmented system (41) are denoted by v(i)=[z(i)u(i)]v_{(i)}=[z_{(i)}^{\prime}\ u_{(i)}^{\prime}]^{\prime} for i={1,,r}i=\{1,\ldots,r\}. Denote by vk(F,v(i))v_{k}(F,v_{(i)}) the solution of (41) starting from the initial state v(i)v_{(i)} and evolving under the control uk=Fxku_{k}=Fx_{k}, k1k\geq 1.

A new cost function is defined as

J^(F,v(i))k=0γk𝔼[vk(F,v(i))Λvk(F,v(i))].\displaystyle\hat{J}(F,v_{(i)})\triangleq\sum_{k=0}^{\infty}\gamma^{k}\mathbb{E}\left[v_{k}(F,v_{(i)})^{\prime}\Lambda v_{k}(F,v_{(i)})\right]. (42)

Note that different from the cost function J(F,z)J(F,z) defined in (6), where the initial state-feedback control law u0=Fzu_{0}=Fz is considered, the initial control u(i)u_{(i)} used in J^(F,v(i))\hat{J}(F,v_{(i)}) is arbitrary, as long as it is stabilizable.

The new problem is formally formulated as below.

Problem 2 (Modified Stochastic LQR Problem).

Suppose that z(i)nz_{(i)}\in\mathbb{R}^{n}, u(i)mu_{(i)}\in\mathbb{R}^{m}, and v(i)=[z(i)u(i)]v_{(i)}=[z_{(i)}^{\prime}\ u_{(i)}^{\prime}]^{\prime} for i{1,,r}i\in\{1,\ldots,r\}, are chosen such that 1ri=1rv(i)v(i)=Γ0\frac{1}{r}\sum_{i=1}^{r}v_{(i)}v_{(i)}^{\prime}=\Gamma\succ 0, with rr being a positive constant. The following problem,

{minF1ri=1rJ^(F,v(i))s.t.xk+1=Axk+Buk+wk,\displaystyle\left\{\begin{aligned} \min_{F\in\mathcal{F}}&\quad\frac{1}{r}\sum_{i=1}^{r}\hat{J}(F,v_{(i)})\\ \text{s.t.}&\quad x_{k+1}=Ax_{k}+Bu_{k}+w_{k},\end{aligned}\right.

is called a modified stochastic LQR problem. The corresponding optimal feedback gain is defined as F^argminF1ri=1rJ^(F,v(i))\hat{F}^{*}\triangleq\arg\min_{F\in\mathcal{F}}\frac{1}{r}\sum_{i=1}^{r}\hat{J}(F,v_{(i)}). The resulting optimal value is denoted by J^1ri=1rJ^(F,v(i))\hat{J}^{*}\triangleq\frac{1}{r}\sum_{i=1}^{r}\hat{J}(F^{*},v_{(i)}).

Proposition 1.

The optimal solution F^\hat{F}^{*} obtained by solving Problem 2 is unique, and it is identical to FF^{*}.

Proof.

From the results in Section 2, we know that although J(F,z)J^{*}(F,z) has different values under different initial state zz, the optimal control gain F=argminFJ(F,z)F^{*}=\arg\min_{F\in\mathcal{F}}J(F,z), whose explicit form is given in (12), does not rely upon zz. It follows that argminFJ(F,z)=argminF1ri=1rJ(F,zi)\arg\min_{F\in\mathcal{F}}J(F,z)=\arg\min_{F\in\mathcal{F}}\frac{1}{r}\sum_{i=1}^{r}J(F,z_{i}) for any initial states zz and ziz_{i}, i{1,,r}i\in\{1,\ldots,r\}, satisfying zizi0z_{i}z_{i}^{\prime}\succ 0.

Based on the definitions of J(,)J(\cdot,\cdot) and J^(,)\hat{J}(\cdot,\cdot), an algebraic manipulation leads to

1ri=1rJ^(F,v(i))=\displaystyle\frac{1}{r}\sum_{i=1}^{r}\hat{J}(F,v_{(i)})= γri=1rJ(F,zi)+1ri=1rv(i)Λv(i)\displaystyle\frac{\gamma}{r}\sum_{i=1}^{r}J(F,z_{i})+\frac{1}{r}\sum_{i=1}^{r}v_{(i)}^{\prime}\Lambda v_{(i)}
+γTr[(Q+FRF)σw2],\displaystyle+\gamma\mathrm{Tr}[(Q+F^{\prime}RF)\sigma_{w}^{2}],

where zi=[AB]v(i)z_{i}=[A\ B]v_{(i)}. Since the last two terms on the right-hand side of the above equation are constants, it holds that argminF1ri=1rJ(F,zi)=argminF1ri=1rJ^(F,v(i))\arg\min_{F\in\mathcal{F}}\frac{1}{r}\sum_{i=1}^{r}J(F,z_{i})=\arg\min_{F\in\mathcal{F}}\frac{1}{r}\sum_{i=1}^{r}\hat{J}(F,v_{(i)}).

From the above, it follows that F=argminFJ(F,z)=argminF1ri=1rJ(F,zi)=argminF1ri=1rJ^(F,v(i))=F^F^{*}=\arg\min_{F\in\mathcal{F}}J(F,z)=\arg\min_{F\in\mathcal{F}}\frac{1}{r}\sum_{i=1}^{r}J(F,z_{i})=\arg\min_{F\in\mathcal{F}}\frac{1}{r}\sum_{i=1}^{r}\hat{J}(F,v_{(i)})=\hat{F}^{*}. Moreover, the uniqueness of FF^{*} guarantees that F^\hat{F}^{*} is unique. ∎

The followings are two conclusions that will be frequently used in the following contents.

Lemma 3 (Lyapunov Stability Theorems [31, Chapter 3]).

Let An×nA\in\mathbb{R}^{n\times n}. Then,

  1. 1.

    if ρ(A)<1\rho(A)<1, then for each given matrix M𝕊+nM\in\mathbb{S}_{+}^{n}, there exists a unique matrix P𝕊+nP\in\mathbb{S}_{+}^{n} satisfying APA+M=PA^{\prime}PA+M=P;

  2. 2.

    ρ(A)<1\rho(A)<1 if and only if for each given matrix M𝕊++nM\in\mathbb{S}_{++}^{n}, there exists a unique matrix P𝕊++nP\in\mathbb{S}_{++}^{n} such that APA+M=PA^{\prime}PA+M=P.

4.2 Primal-Dual Formulation

From the previous subsection, we know that Problem 2 is a non-convex optimization problem with the objective function 1ri=1rJ^(F,v(i))\frac{1}{r}\sum_{i=1}^{r}\hat{J}(F,v_{(i)}), the static constraint FF\in\mathcal{F}, and the dynamic constraint (1). In this subsection, we propose an equivalent non-convex optimization formulation of Problem 2, and show that its dual gap is zero.

We first give the definition of the non-convex optimization formulation.

Primal Problem 1.

Solve

Jp\displaystyle J_{p}\triangleq infS𝕊n+m,Fm×nTr(ΛS)\displaystyle\inf_{S\in\mathbb{S}^{n+m},F\in\mathbb{R}^{m\times n}}\mathrm{Tr}(\Lambda S)
s.t.\displaystyle\mathrm{s.t.}\;\;\;\; S0,\displaystyle S\succ 0, (43)
γAFSAF+Γ+γ1γF¯σw2F¯=S.\displaystyle\gamma A_{F}SA_{F}^{\prime}+\Gamma+\frac{\gamma}{1-\gamma}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}=S. (44)

It is obvious to see that Primal Problem 1 is a single-objective multi-variable optimization problem consisting of a linear objective function, a linear inequality constraint, and a quadratic equality constraint [32]. Hence, it is a non-convex optimization problem.

The following proposition shows the equivalence between Problem 2 and Primal Problem 1.

Proposition 2.

The optimal solution of Primal Problem 1, which is denoted by (Sp,Fp)(S_{p},F_{p}), is unique. Furthermore, Primal Problem 1 is equivalent to Problem 2 in the sense that Jp=J^J_{p}=\hat{J}^{*} and Fp=F^F_{p}=\hat{F}^{*}.

Proof.

By the properties of trace, we rewrite the objective function of Problem 2 as

1ri=1rJ^(F,v(i))=Tr(ΛS)\displaystyle\frac{1}{r}\sum_{i=1}^{r}\hat{J}(F,v_{(i)})=\mathrm{Tr}(\Lambda S)

with

SS(F)=1ri=1rk=0γk𝔼[vk(F,v(i))vk(F,v(i))].\displaystyle S\triangleq S(F)=\frac{1}{r}\sum_{i=1}^{r}\sum_{k=0}^{\infty}\gamma^{k}\mathbb{E}[v_{k}(F,v_{(i)})v_{k}(F,v_{(i)})^{\prime}]. (45)

From (41), we have

vk(F,v(i))=AFkv(i)+j=0k1AFjF¯wkj1,k1.\displaystyle v_{k}(F,v_{(i)})=A_{F}^{k}v_{(i)}+\sum_{j=0}^{k-1}A_{F}^{j}\bar{F}w_{k-j-1},\;k\geq 1. (46)

Using the equation (46) and the fact that wk𝒩(0n,σw2)w_{k}\sim\mathcal{N}(0_{n},\sigma_{w}^{2}) is the i.i.d. random noise for each k0k\geq 0, we can obtain that for any FF\in\mathcal{F} and k1k\geq 1, it holds

𝔼[vk(F,v(i))vk(F,v(i))]\displaystyle\mathbb{E}[v_{k}(F,v_{(i)})v_{k}(F,v_{(i)})^{\prime}]
=\displaystyle= 𝔼[AFkv(i)v(i)(AFk)+j=0k1AFjF¯wkj1wkj1F¯(AFj)].\displaystyle\mathbb{E}\left[A_{F}^{k}v_{(i)}v_{(i)}^{\prime}(A_{F}^{k})^{\prime}+\sum_{j=0}^{k-1}A_{F}^{j}\bar{F}w_{k-j-1}w_{k-j-1}^{\prime}\bar{F}^{\prime}(A_{F}^{j})^{\prime}\right].

Thus, SS in (45) becomes

S=\displaystyle S= 1ri=1r[k=1γk[AFkv(i)v(i)(AFk)+j=0k1AFjF¯σw2F¯(AFj)]]\displaystyle\frac{1}{r}\sum_{i=1}^{r}\left[\sum_{k=1}^{\infty}\gamma^{k}\Big{[}A_{F}^{k}v_{(i)}v_{(i)}^{\prime}(A_{F}^{k})^{\prime}+\sum_{j=0}^{k-1}A_{F}^{j}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}(A_{F}^{j})^{\prime}\Big{]}\right]
=\displaystyle= k=1γk[AFkΓ(AFk)+j=0k1AFjF¯σw2F¯(AFj)].\displaystyle\sum_{k=1}^{\infty}\gamma^{k}\left[A_{F}^{k}\Gamma(A_{F}^{k})^{\prime}+\sum_{j=0}^{k-1}A_{F}^{j}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}(A_{F}^{j})^{\prime}\right]. (47)

By an algebraic manipulation, one can obtain that

SγAFSAF=Γ+k=1γkF¯σw2F¯=Γ+γ1γF¯σw2F¯,\displaystyle S-\gamma A_{F}SA_{F}^{\prime}=\Gamma+\sum_{k=1}^{\infty}\gamma^{k}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}=\Gamma+\frac{\gamma}{1-\gamma}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime},

which implies that SS satisfies the Lyapunov equation (44).

Since Γ+γ1γF¯σw2F¯0\Gamma+\frac{\gamma}{1-\gamma}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}\succ 0, it follows from Lemma 2 and the second statement of Lemma 3 that FF\in\mathcal{F} if and only if there exists a unique S0S\succ 0 such that the constraint (44) holds. Therefore, we can replace the constraint FF\in\mathcal{F} in Problem 2 by S0S\succ 0 without changing its optimal solution.

From the above discussions, we know that if (Sp,Fp)(S_{p},F_{p}) is the optimal solution of Primal Problem 1 and the corresponding optimal cost is JpJ_{p}, then it holds that Sp0S_{p}\succ 0 and ρ(AFp)<1\rho(A_{F_{p}})<1. From Lemma 2, FpF_{p}\in\mathcal{F} and FpF_{p} is a feasible point of Problem 2. Thereby, JpJ_{p} is lower bounded by the optimal value of Problem 2, i.e., JpJ^J_{p}\geq\hat{J}^{*}. In addition, if SpS_{p} is the unique solution of (44) with Fp=F^F_{p}=\hat{F}^{*}\in\mathcal{F}, then the resulting objective function of Primal Probelm 1 is Jp=J^J_{p}=\hat{J}^{*}. Thus, we can conclude that (Sp,Fp)(S_{p},F_{p}) is the solution of Primal Problem 1 and Jp=J^J_{p}=\hat{J}^{*}. The uniqueness of (Sp,Fp)(S_{p},F_{p}) can be shown by the uniqueness of F^\hat{F}^{*}. ∎

For any fixed P𝕊n+mP\in\mathbb{S}^{n+m}, P0𝕊+n+mP_{0}\in\mathbb{S}_{+}^{n+m}, define the Lagrangian function as

L(P,P0,\displaystyle L(P,P_{0}, F,S)Tr(ΛS)+Tr(SP0)\displaystyle F,S)\triangleq\mathrm{Tr}(\Lambda S)+\mathrm{Tr}(-SP_{0}) (48)
+Tr[(γAFSAF+Γ+γ1γF¯σw2F¯S)P],\displaystyle+\mathrm{Tr}\left[\Big{(}\gamma A_{F}SA_{F}^{\prime}+\Gamma+\frac{\gamma}{1-\gamma}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}-S\Big{)}P\right],

where PP and P0P_{0} are Lagrangian multipliers. The Lagrangian dual function is defined as

d(P,P0)infS𝕊n+m,Fm×nL(P,P0,F,S).\displaystyle d(P,P_{0})\triangleq\inf_{S\in\mathbb{S}^{n+m},F\in\mathbb{R}^{m\times n}}L(P,P_{0},F,S).

Then, the dual problem corresponding to Primal Problem 1 is defined as

Jd\displaystyle J_{d}\triangleq supP𝕊n+m,P0𝕊+n+md(P,P0)\displaystyle\sup_{P\in\mathbb{S}^{n+m},P_{0}\in\mathbb{S}_{+}^{n+m}}d(P,P_{0})
=\displaystyle= supP𝕊n+m,P0𝕊+n+minfS𝕊n+m,Fm×nL(P,P0,F,S),\displaystyle\sup_{P\in\mathbb{S}^{n+m},P_{0}\in\mathbb{S}_{+}^{n+m}}\inf_{S\in\mathbb{S}^{n+m},F\in\mathbb{R}^{m\times n}}L(P,P_{0},F,S),

where L(P,P0,F,S)L(P,P_{0},F,S) is as defined in (48).

The weak duality [32, Chapter 5] guarantees that JdJpJ_{d}\leq J_{p} holds, where JpJd0J_{p}-J_{d}\geq 0 is called the duality gap. When the duality gap is zero, we say that the optimization problem has strong duality [32, Chapter 5]. The following theorem shows that Primal Problem 1 has strong duality.

Theorem 2 (Strong Duality).

The strong duality holds for Primal Problem 1, that is, Jp=JdJ_{p}=J_{d}.

Proof.

See Appendix A. ∎

4.3 MB-PD Algorithm

It is well-known that for a non-convex optimization problem, if its objective and constraints functions are differentiable and it has strong duality, then any primal and dual optimal points must satisfy the corresponding KKT conditions, but not vise versa[32, Chapter 5.5.3]. However, if the solution to the KKT conditions is unique, then this solution must be the primal and dual optimal points. In view of this, we first derive the KKT conditions and show the uniqueness of their solutions in this subsection, and then design an algorithm to seek the solution to Problem 2 according to the solutions to the KKT conditions.

Lemma 4.

Suppose that (Sp,Fp)(S_{p},F_{p}) is the primal optimal point and (Pp,P0,p)(P_{p},P_{0,p}) is the dual optimal point of Primal Problem 1. Then, (Sp,Fp,Pp,P0,p)(S_{p},F_{p},P_{p},P_{0,p}) satisfies the KKT conditions for (S,F,P,P0)(S,F,P,P_{0})

γAFSAF+Γ+γ1γF¯σw2F¯S=0,\displaystyle\gamma A_{F}SA_{F}^{\prime}+\Gamma+\frac{\gamma}{1-\gamma}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}-S=0, (49)
S0,\displaystyle S\succ 0, (50)
P0=0n+m,\displaystyle P_{0}=\textbf{0}_{n+m}, (51)
γAFPAF+Λ=P,\displaystyle\gamma A_{F}^{\prime}PA_{F}+\Lambda=P, (52)
2(P12+P22F)(γ[AB]S[AB]+γ1γσw2)=0.\displaystyle 2(P_{12}^{\prime}+P_{22}F)\Big{(}\gamma\left[A\;\;B\right]S\left[A\;\;B\right]^{\prime}+\frac{\gamma}{1-\gamma}\sigma_{w}^{2}\Big{)}=0. (53)
Proof.

From the conclusions in [32, Chapter 5.9.2], the KKT conditions of Primal Problem 1 are as follows:

  1. 1.

    primal feasibility condition:

    γAFSAF+Γ+γ1γF¯σw2F¯S=0,\displaystyle\gamma A_{F}SA_{F}^{\prime}+\Gamma+\frac{\gamma}{1-\gamma}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}-S=0,
    S0;\displaystyle S\succ 0;
  2. 2.

    dual feasibility condition:

    P00;\displaystyle P_{0}\succeq 0;
  3. 3.

    complementary slackness condition:

    Tr(SP0)=0;\displaystyle\mathrm{Tr}(SP_{0})=0;
  4. 4.

    stationary conditions:

    LS=Λ+γAFSAFPP0=0,\displaystyle\!\frac{\partial L}{\partial S}=\Lambda+\gamma A_{F}^{\prime}SA_{F}-P-P_{0}=0,
    LF=2(P12+P22F)(γ[AB]S[AB]+γ1γσw2)=0.\displaystyle\!\frac{\partial L}{\partial F}=2(P_{12}^{\prime}+P_{22}F)\Big{(}\gamma\left[A\;\;B\right]S\left[A\;\;B\right]^{\prime}\!+\!\frac{\gamma}{1-\gamma}\sigma_{w}^{2}\Big{)}=0.

By the second statement of Lemma 3, Γ+γ1γσw20\Gamma+\frac{\gamma}{1-\gamma}\sigma_{w}^{2}\succ 0 and FF\in\mathcal{F} guarantee that S0S\succ 0, which further implies that the only solution to the equation Tr(SP0)=0\mathrm{Tr}(SP_{0})=0 is P0=0m+nP_{0}=\textbf{0}_{m+n}. Hence, the KKT conditions in (49)-(53) can be obtained. According to [32, Chapter 5.5.3], the strong duality guarantees that the solutions to the KKT conditions must be the primal and dual optimal points. ∎

The following lemma is given to show that the uniqueness of the solutions to the KKT conditions (49)-(53), which guarantees that the solutions satisfying the KKT conditions must be the primal and dual optimal points.

Lemma 5.

Suppose that (S,F,P,P0)(S,F,P,P_{0}) satisfies the KKT conditions (49)-(53). Then, (S,F,P,P0)(S,F,P,P_{0}) is unique. Furthermore, it holds that F=FpF=F_{p}, P=PpP=P_{p}, P0=0m+nP_{0}=\textbf{0}_{m+n}, and SpS_{p} is the unique solution to (49) with F=FpF=F_{p} and P=PpP=P_{p}.

Proof.

By observation, we can know that the sufficient condition for (53) is F=(P22)1P12F=-(P_{22})^{-1}P_{12}^{\prime}. To show this lemma, we first show that (F,P)(F,P) satisfying γAFPAF+Λ=P\gamma A_{F}^{\prime}PA_{F}+\Lambda=P, F=(P22)1P12F=-(P_{22})^{-1}P_{12}^{\prime}, and P0P\succeq 0 is exactly (Fp,Pp)(F_{p},P_{p}) with FpF_{p}\in\mathcal{F}. Using the same argument as in the proof of Lemma 7, we have Pp=PP_{p}=P^{*}. From Proposition 1 and Proposition 2, it holds that Fp=FF_{p}=F^{*}. In view of this, we only need to show that (F,P)(F,P) satisfying γAFPAF+Λ=P\gamma A_{F}^{\prime}PA_{F}+\Lambda=P, F=(P22)1P12F=-(P_{22})^{-1}P_{12}^{\prime}, and P0P\succeq 0 is exactly (F,P)(F^{*},P^{*}).

By substituting F=(P22)1P12F=-(P_{22})^{-1}P_{12}^{\prime} into (52), it follows that

γ[AB](P11P12P221P12)[AB]+Λ=P.\displaystyle\gamma\left[A\ \ B\right]^{\prime}(P_{11}-P_{12}P_{22}^{-1}P_{12}^{\prime})\left[A\ \ B\right]+\Lambda=P.

Define X¯P11P12P221P12\bar{X}\triangleq P_{11}-P_{12}P_{22}^{-1}P_{12}^{\prime}. The above equation becomes γ[AB]X¯[AB]+Λ=P\gamma\left[A\ B\right]^{\prime}\bar{X}\left[A\ B\right]+\Lambda=P, the expansion of which can be written as

[γAX¯A+QγAX¯BγBX¯AγBX¯B+R]=[P11P12P12P22],\displaystyle\left[\begin{array}[]{cc}\gamma A^{\prime}\bar{X}A+Q&\gamma A^{\prime}\bar{X}B\\ \gamma B^{\prime}\bar{X}A&\gamma B^{\prime}\bar{X}B+R\end{array}\right]=\left[\begin{array}[]{cc}P_{11}&P_{12}\\ P_{12}^{\prime}&P_{22}\end{array}\right],

i.e., P11=γAX¯A+QP_{11}=\gamma A^{\prime}\bar{X}A+Q, P12=γAX¯BP_{12}=\gamma A^{\prime}\bar{X}B, and P22=γBX¯B+RP_{22}=\gamma B^{\prime}\bar{X}B+R. Substituting these expressions into the definition X¯P11P12P221P12\bar{X}\triangleq P_{11}-P_{12}P_{22}^{-1}P_{12}^{\prime} yields that

Q+γAX¯Aγ2AX¯B(R+γBX¯B)1BX¯A=X¯.\displaystyle Q+\gamma A^{\prime}\bar{X}A-\gamma^{2}A^{\prime}\bar{X}B(R+\gamma B^{\prime}\bar{X}B)^{-1}B^{\prime}\bar{X}A=\bar{X}.

This is exactly the ARE (13), which has a unique solution. Therefore, it holds that X¯=X\bar{X}=X^{*}. From the definitions (12) and (21), we have F=FF=F^{*} and P=PP=P^{*}.

Then, based on the fact that Fp=FF_{p}=F^{*}\in\mathcal{F} and Λ0\Lambda\succeq 0, the first statement of Lemma 3 shows that there exists a unique S=SpS=S_{p} satisfying (49) with F=FpF=F_{p} and P=PpP=P_{p}. ∎

Note that to calculate the primal optimal point FpF_{p}, one needs to know PpP_{p}, which is uniquely determined by solving (52). However, (52) is dependent upon the primal optimal point FpF_{p}, which means that the stationary conditions (52) and (53) are coupled with each other, so that FpF_{p} and PpP_{p} cannot be obtained independently. In what follows, we propose an algorithm, cf. Algorithm 4, to iteratively solve (52) and (53). Note that once FpF_{p} and PpP_{p} are attained, one can uniquely determine the primal variable SpS_{p} by solving (49).

Algorithm 4 MB-PD Algorithm
1:  Initialization: Fp0F_{p}^{0}\in\mathcal{F}, ϵ>0\epsilon>0, and set s=0s=0;
2:  Repeat
3:   dual update: solve PsP^{s} from
γ(AFps)PpsAFps+Λ=Pps;\gamma(A_{F_{p}^{s}})^{\prime}P_{p}^{s}A_{F_{p}^{s}}+\Lambda=P_{p}^{s};
4:    primal update: Fps+1=(Pp,22s)1(Pp,12s)F_{p}^{s+1}=-(P_{p,22}^{s})^{-1}(P_{p,12}^{s})^{\prime};
5:   ss+1;s\leftarrow s+1;
6:  Until FpsFps1ϵ\|F_{p}^{s}-F_{p}^{s-1}\|\leq\epsilon;
7:  Return FpsF_{p}^{s}.

We have the following convergence result:

Lemma 6.

Given an initial stabilizing control gain Fp0F_{p}^{0}. Consider the two sequences {Pps}s=0\{P_{p}^{s}\}_{s=0}^{\infty} and {Fps}s=0\{F_{p}^{s}\}_{s=0}^{\infty} obtained from Algorithm 4. We have the following properties:

  1. 1.

    {Pps}s=0\{P_{p}^{s}\}_{s=0}^{\infty} is a non-increasing sequence, that is, PpsPps+1P_{p}^{s}\succeq P_{p}^{s+1};

  2. 2.

    FpsF_{p}^{s} obtained at each iteration step is stabilizing, that is, FpsF_{p}^{s}\in\mathcal{F} for each ss;

  3. 3.

    limsPps=Pp=P\lim_{s\to\infty}P_{p}^{s}=P_{p}=P^{*} and limsFps=Fp=F\lim_{s\to\infty}F_{p}^{s}=F_{p}=F^{*}, where PP^{*} and FF^{*} are defined in (21) and (12), respectively.

Proof.

Note that Algorithm 4 can be transformed into Algorithm 1 developed in [21] by replacing (A,B)(A,B) with (γ1/2A,γ1/2B)(\gamma^{1/2}A,\gamma^{1/2}B). Therefore, this lemma can be shown directly by using the proof of Proposition 8 in [21] by some algebraic manipulations. ∎

4.4 Model-Free Implementation of Algorithm 4

From the equation (45) that is defined in the proof of Proposition 2, S(Fps)S(F_{p}^{s}) can be equivalently written as

S(Fps)=1ri=1rk=0γk𝔼[vk(Fps,v(i))vk(Fps,v(i))].\displaystyle S(F_{p}^{s})=\frac{1}{r}\sum_{i=1}^{r}\sum_{k=0}^{\infty}\gamma^{k}\mathbb{E}\left[v_{k}\big{(}F_{p}^{s},v_{(i)}\big{)}v_{k}\big{(}F_{p}^{s},v_{(i)}\big{)}^{\prime}\right]. (54)

We truncate the time horizon by K>0K>0, and define

S~(Fps)=1ri=1rk=0Kγk𝔼[vk(Fps,v(i))vk(Fps,v(i))].\displaystyle\tilde{S}(F_{p}^{s})=\frac{1}{r}\sum_{i=1}^{r}\sum_{k=0}^{K}\gamma^{k}\mathbb{E}\left[v_{k}\big{(}F_{p}^{s},v_{(i)}\big{)}v_{k}\big{(}F_{p}^{s},v_{(i)}\big{)}^{\prime}\right]. (55)

Note that for any K>0K>0, it holds that S~(Fps)0\tilde{S}(F_{p}^{s})\succ 0 due to the positive definiteness of 1ri=1rv(i)v(i)\frac{1}{r}\sum_{i=1}^{r}v_{(i)}v_{(i)}^{\prime}. In view of this, the stationary condition (52) holds if and only if S~(Fps)(γAFpsPpsAFps+ΛPps)S~(Fps)=0\tilde{S}(F_{p}^{s})^{\prime}\Big{(}\gamma A_{F_{p}^{s}}^{\prime}P_{p}^{s}A_{F_{p}^{s}}+\Lambda-P_{p}^{s}\Big{)}\tilde{S}(F_{p}^{s})=0, which can be rewritten as the following linear matrix equation

S~(Fps)PpsS~(Fps)γW~(Fps)PpsW~(Fps)=S~(Fps)ΛS~(Fps),\displaystyle\tilde{S}(F_{p}^{s})^{\prime}P_{p}^{s}\tilde{S}(F_{p}^{s})-\gamma\tilde{W}(F_{p}^{s})^{\prime}P_{p}^{s}\tilde{W}(F_{p}^{s})=\tilde{S}(F_{p}^{s})^{\prime}\Lambda\tilde{S}(F_{p}^{s}), (56)

where

W~(Fps)=1ri=1rk=0Kγk𝔼[vk+1(Fps,v(i))vk(Fps,v(i))].\displaystyle\tilde{W}(F_{p}^{s})=\frac{1}{r}\sum_{i=1}^{r}\sum_{k=0}^{K}\gamma^{k}\mathbb{E}\left[v_{k+1}\big{(}F_{p}^{s},v_{(i)}\big{)}v_{k}\big{(}F_{p}^{s},v_{(i)}\big{)}^{\prime}\right]. (57)

Therefore, for a fixed and stabilizing control gain FpsF_{p}^{s}, the corresponding dual variable PpsP_{p}^{s} can be obtained by solving the linear matrix equation (56).

Based on the above analysis, we give a detailed model-free implementation of the proposed MB-PD algorithm, and summarize it in Algorithm 5. Similar to Algorithm 3, the mathematical expectations are approximated by the numerical averages herein.

Algorithm 5 Model-Free Implementation of MB-PD Algorithm
1:  Initialization: set a stabilizing control gain Fp0F_{p}^{0}, s=0s=0, K>0K>0, and ϵ>0\epsilon>0;
2:  Repeat
3:   primal update: S~=0\tilde{S}=0, W~=0\tilde{W}=0;
4:   for i=1:ri=1:r
5:    S¯=0\bar{S}=0, W¯=0\bar{W}=0;
6:    for q=1:Nq=1:N
7:     apply uk=Fpsxku_{k}=F_{p}^{s}x_{k} for KK time steps;
8:     sample {xk,uk}k=0,,K\{x_{k},u_{k}\}_{k=0,\ldots,K};
9:     calculate S=k=0KγkvkvkS=\sum_{k=0}^{K}\gamma^{k}v_{k}v_{k}^{\prime} with vk=[xkuk]v_{k}=[x_{k}^{\prime}\ u_{k}^{\prime}]^{\prime};
10:     calculate W=k=0Kγkvk+1vkW=\sum_{k=0}^{K}\gamma^{k}v_{k+1}v_{k}^{\prime};
11:     S¯=S¯+S\bar{S}=\bar{S}+S, W¯=W¯+W\bar{W}=\bar{W}+W ;
12:    End
13:   S¯=S¯/N\bar{S}=\bar{S}/N, W¯=W¯/N\bar{W}=\bar{W}/N;
14:   S~=S~+S¯\tilde{S}=\tilde{S}+\bar{S}, W~=W~+W¯\tilde{W}=\tilde{W}+\bar{W};
15:   End
16:   dual update: calculate PpsP_{p}^{s} by solving
S~PpsS~γW~PpsW~=S~ΛS~;\tilde{S}^{\prime}P_{p}^{s}\tilde{S}-\gamma\tilde{W}^{\prime}P_{p}^{s}\tilde{W}=\tilde{S}^{\prime}\Lambda\tilde{S};
17:    primal update: update control gain from
Fps+1=(Pp,22s)1(Pp,12s);F_{p}^{s+1}=-(P_{p,22}^{s})^{-1}(P_{p,12}^{s})^{\prime};
18:    ss+1s\leftarrow s+1;
19:  Until FpsFps1ϵ\|F_{p}^{s}-F_{p}^{s-1}\|\leq\epsilon;
20:  Return FpsF_{p}^{s}.

Note that different from the Algorithm 3, the primal-dual optimization-based algorithm developed in this section has the advantage of averting the requirement of the excitation signal, which may cause undesirable system oscillation. However, this is at the cost of the requirement of more data. Specifically, it is assumed that we can sample the state-input trajectories generated under a certain set of initial vectors which are linearly independent in n\mathbb{R}^{n}.

5 Equivalency Between PI and MB-PD Algorithms

In the previous two sections, we solve the stochastic LQR problem from the perspective of RL and primal-dual optimization, respectively. The goal of this section is to show the equivalency of the algorithms developed before. Specifically, we show that classical PI and MB-PD algorithms are equivalent in the sense that there is a corresponding relation in the involved iterations. The specific relationship is described in the following theorem.

Theorem 3.

The dual and primal update steps in Algorithm 4 can be interpreted as the policy evaluation and policy improvement steps in Algorithm 1, respectively, in the sense that Xs=(F¯ps)PpsF¯psX^{s}=(\bar{F}_{p}^{s})^{\prime}P_{p}^{s}\bar{F}_{p}^{s} and Fps=FsF_{p}^{s}=F^{s} with F¯ps=[In(Fps)]\bar{F}_{p}^{s}=[I_{n}\ (F_{p}^{s})^{\prime}]^{\prime}.

Proof.

By pre- and post-multiplying (AFps)PpsAFps+Λ=Pps(A_{F_{p}^{s}})^{\prime}P_{p}^{s}A_{F_{p}^{s}}+\Lambda=P_{p}^{s} by [In(Fps)][I_{n}\ \ (F_{p}^{s})^{\prime}] and its transpose, one can obtain that

γ(A+BFps)(F¯ps)PpsF¯ps(A+BFps)+Q+(Fps)RFps\displaystyle\gamma(A+BF_{p}^{s})^{\prime}(\bar{F}_{p}^{s})^{\prime}P_{p}^{s}\bar{F}_{p}^{s}(A+BF_{p}^{s})+Q+(F_{p}^{s})^{\prime}RF_{p}^{s} (58)
=\displaystyle= (F¯ps)PpsF¯ps.\displaystyle(\bar{F}_{p}^{s})^{\prime}P_{p}^{s}\bar{F}_{p}^{s}.

Define Xps(F¯ps)PpsF¯psX_{p}^{s}\triangleq(\bar{F}_{p}^{s})^{\prime}P_{p}^{s}\bar{F}_{p}^{s}. Then, (58) becomes

γ(A+BFps)Xps(A+BFps)+Q+(Fps)RFps=Xps,\displaystyle\gamma(A+BF_{p}^{s})^{\prime}X_{p}^{s}(A+BF_{p}^{s})+Q+(F_{p}^{s})^{\prime}RF_{p}^{s}=X_{p}^{s},

which has a unique solution since FpsF_{p}^{s}\in\mathcal{F} and Q+(Fps)RFpsQ+(F_{p}^{s})^{\prime}RF_{p}^{s} is positive semidefinite. By observation, we know that the dual update step in Algorithm 4 is equivalent to the policy evaluation step in Algorithm 1 in the sense that Xs=Xps=(F¯ps)PpsF¯psX^{s}=X_{p}^{s}=(\bar{F}_{p}^{s})^{\prime}P_{p}^{s}\bar{F}_{p}^{s}.

Let Pps=[Pp,11sPp,12sPp,21sPp,22s]P_{p}^{s}=\left[\begin{array}[]{cc}P_{p,11}^{s}&P_{p,12}^{s}\\ P_{p,21}^{s}&P_{p,22}^{s}\end{array}\right] with Pp,11sn×nP_{p,11}^{s}\in\mathbb{R}^{n\times n}, Pp,22sm×mP_{p,22}^{s}\in\mathbb{R}^{m\times m}, and Pp,12s=(Pp,21s)n×mP_{p,12}^{s}=(P_{p,21}^{s})^{\prime}\in\mathbb{R}^{n\times m}. Expanding (F¯ps)PpsF¯ps(\bar{F}_{p}^{s})^{\prime}P_{p}^{s}\bar{F}_{p}^{s} yields that

Pp,11s+(Fps)Pp,21s+Pp,12sFps+(Fps)Pp,22sFps=Xs.\displaystyle P_{p,11}^{s}+(F_{p}^{s})^{\prime}P_{p,21}^{s}+P_{p,12}^{s}F_{p}^{s}+(F_{p}^{s})^{\prime}P_{p,22}^{s}F_{p}^{s}=X^{s}. (59)

Expanding the left-hand side of (22) yields that

(Q+γAXsA)+(Fs)\displaystyle(Q+\gamma A^{\prime}X^{s}A)+(F^{s})^{\prime} (γBXsA)+γAXsBFs\displaystyle(\gamma B^{\prime}X^{s}A)+\gamma A^{\prime}X^{s}BF^{s} (60)
+(Fs)(R+γBXsB)Fs=Xs.\displaystyle+(F^{s})^{\prime}(R+\gamma B^{\prime}X^{s}B)F^{s}=X^{s}.

Making the corresponding terms in (59) and (60) identical, it follows that Pp,11s=Q+γAXsAP_{p,11}^{s}=Q+\gamma A^{\prime}X^{s}A, Pp,12s=γAXsBP_{p,12}^{s}=\gamma A^{\prime}X^{s}B, and Pp,22s=R+γBXsBP_{p,22}^{s}=R+\gamma B^{\prime}X^{s}B, which further imply that Fps+1=Fs+1F_{p}^{s+1}=F^{s+1}. ∎

Example 1.

Considering the following discrete-time linear system:

xk+1=2xk+uk+wk.\displaystyle x_{k+1}=2x_{k}+u_{k}+w_{k}.

The weight matrices and discount factor are selected as Q=1Q=1, R=1R=1, and γ=0.7\gamma=0.7. The system noise is assumed to be generated according to the Gaussian distribution 𝒩(0,1)\mathcal{N}(0,1). We choose F0=Fp0=1F^{0}=F_{p}^{0}=-1 as the initial control gain matrix. Suppose that the system matrices AA and BB are known.

By implementing the classical PI algorithm, i.e., Algorithm 1, for 44 steps, one can obtain that

X1\displaystyle X^{1} =6.6666,X2=4.0675,X3=3.9353,X4=3.9345;\displaystyle=6.6666,\ \ X^{2}=4.0675,\ \ X^{3}=3.9353,\ \ X^{4}=3.9345;
F1\displaystyle F^{1} =1.6471,F2=1.4801,F3=1.4673,F4=1.4673.\displaystyle=-1.6471,F^{2}=-1.4801,F^{3}=-1.4673,F^{4}=-1.4673.

Then we implement the MB-PD algorithm, i.e., Algorithm 4, for 44 steps, producing that

Pp1\displaystyle P_{p}^{1} =[19.66669.33339.33335.6667],Pp2=[12.38895.69455.69453.8472],\displaystyle=\left[\begin{array}[]{cc}19.6666&9.3333\\ 9.3333&5.6667\end{array}\right],\ \ P_{p}^{2}=\left[\begin{array}[]{cc}12.3889&5.6945\\ 5.6945&3.8472\end{array}\right],
Pp3\displaystyle P_{p}^{3} =[12.01885.50945.50943.7547],Pp4=[12.01165.50835.50833.7542];\displaystyle=\left[\begin{array}[]{cc}12.0188&5.5094\\ 5.5094&3.7547\end{array}\right],\ \ P_{p}^{4}=\left[\begin{array}[]{cc}12.0116&5.5083\\ 5.5083&3.7542\end{array}\right];
Fp1=1.6471,Fp2=1.4801,Fp3=1.4673,Fp4=1.4673.\displaystyle F_{p}^{1}=-1.6471,F_{p}^{2}=-1.4801,F_{p}^{3}=-1.4673,F_{p}^{4}=-1.4673.

From Xps=(F¯ps)PpsF¯psX_{p}^{s}=(\bar{F}_{p}^{s})^{\prime}P_{p}^{s}\bar{F}_{p}^{s}, it follows that

Xp1=6.6667,Xp2=4.0675,Xp3=3.9353,Xp4=3.9345.\displaystyle X_{p}^{1}=6.6667,\ \ X_{p}^{2}=4.0675,\ \ X_{p}^{3}=3.9353,\ \ X_{p}^{4}=3.9345.

By simple observation and calculation, we can verify the correctness of the conclusion given in Theorem 3.

6 Simulations

In this section, we provide a numerical example to evaluate the performance of the proposed model-free algorithms.

Considering the following discrete-time linear system:

xk+1=[0.510.250.5]xk+[11]uk+wk.\displaystyle x_{k+1}=\left[\begin{array}[]{cc}0.5&1\\ 0.25&0.5\end{array}\right]x_{k}+\left[\begin{array}[]{c}1\\ 1\end{array}\right]u_{k}+w_{k}.

The weight matrices and discount factor are selected as Q=I2Q=\textbf{I}_{2}, R=1R=1, and γ=0.7\gamma=0.7. The system noise wkw_{k} is assumed to be generated according to the Gaussian distribution 𝒩(02,I2)\mathcal{N}(0_{2},\textbf{I}_{2}). When one knows the exact system matrices AA and BB, the optimal control gain FF^{*} can be obtained by solving the ARE (13). The result is Fs=[0.24460.4892]F^{s}=\left[-0.2446\ \ -0.4892\right].

6.1 Performance of Algorithm 3

When implement Algorithm 3, the initial state zz is randomly generated. Inspired by [24], the probing noise is considered as

ek=0.2sin(1.009k)+cos2(0.538k)+sin(0.9k)+cos(100k).\displaystyle e_{k}=0.2\sin(1.009k)+\cos^{2}(0.538k)+\sin(0.9k)+\cos(100k).

Initial stabilizing control gain is chosen as F0=[1 0]F^{0}=[-1\ 0]. We use the numerical average of N=15N=15 trajectories to approximate the mathematical expectation, and sample K=20K=20 data points at each trajectory for the BLS method. The tolerant error is ε=103\varepsilon=10^{-3}. We can observe from Fig. 1 that Algorithm 3 stops after 66 iterations and returns the estimated control gain F6=[0.24640.4960]F^{6}=\left[-0.2464\ \ -0.4960\right].

Refer to caption
Figure 1: Values of FsF^{s} at each iteration step. The black dotted lines are baselines for two components of the optimal control gain FF^{*}.

Considering that the system noise is randomly generated, we use Monte-Carlo method to validate the effectiveness of Algorithm 3, as well as to show the influence of the variance of the system noise. We implement Algorithm 3 for Y=10Y=10 times, and each experiment is stopped when s=10s=10. The average value of the resulting control gain from YY experiments is denoted by Fˇs=1Yi=1YFis\check{F}^{s}=\frac{1}{Y}\sum_{i=1}^{Y}F_{i}^{s}, where FisF_{i}^{s} is the control gain obtained at the ss-th step from the ii-th experiment. We use Es=1Yi=1YFisF\text{E}^{s}=\frac{1}{Y}\sum_{i=1}^{Y}\|F_{i}^{s}-F^{*}\| to denote the average approximation error of the ss-th iteration step from YY experiments. Fig. 2 shows the value of the control gain FsF^{s} at each experiment as well as the resulting average gain Fˇs\check{F}^{s}, when the variance of the system noise is set as σw2=αI2\sigma_{w}^{2}=\alpha\textbf{I}_{2} with α\alpha being 0, 0.10.1, 0.50.5, and 11, respectively. Note that E10=1Li=1LFi10F\text{E}^{10}=\frac{1}{L}\sum_{i=1}^{L}\|F_{i}^{10}-F^{*}\| denotes the average approximation error on convergence, that is, the average deviation of the obtained control gain at the ii-th experiment from the optimal control gain FF^{*}. Fig. 3 shows the variation of the average approximation error E10\text{E}^{10}, when α\alpha is increased from 0 to 11. From Fig. 2 and Fig. 3, we conclude that roughly speaking, the resulting approximation errors are increased as the increase of the variance of the system noise. That is, a stronger system noise may lead to a higher control gain approximation error. Moreover, when σw2=02\sigma_{w}^{2}=\textbf{0}_{2}, FsF^{s} obtained from Algorithm 3 converges to the optimal feedback gain FF^{*}.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 2: Values of FsF^{s} at each experiment and the resulting average values Fˇs\check{F}^{s}. The black dotted lines are baselines for two components of the optimal control gain FF^{*}. (a) σw2=02\sigma_{w}^{2}=\textbf{0}_{2}. (b) σw2=0.1I2\sigma_{w}^{2}=0.1\textbf{I}_{2}. (c) σw2=0.5I2\sigma_{w}^{2}=0.5\textbf{I}_{2}. (d) σw2=I2\sigma_{w}^{2}=\textbf{I}_{2}.
Refer to caption
Figure 3: Average approximation errors of Y=10Y=10 experiments at different values of α\alpha, where the variance of the system noise is σw2=αI2\sigma_{w}^{2}=\alpha\textbf{I}_{2}.

6.2 Performance of Algorithm 5

When implement Algorithm 5, we truncate the time horizon by K=10K=10 and let r=3r=3. Three pairs of initial states and initial control inputs are z(1)=[1 3]z_{(1)}=[-1\ \ 3], u(1)=2u_{(1)}=-2, z(2)=[21]z_{(2)}=[2\ \ -1], u(2)=5u_{(2)}=-5, z(3)=[3 3]z_{(3)}=[-3\ \ 3], u(3)=8u_{(3)}=-8. The initial stabilizing control gain is chosen as Fp0=[1 0]F_{p}^{0}=[-1\ 0]. We use the numerical average of N=15N=15 trajectories to approximate the mathematical expectation. The desired convergence precision is set as ε=5×103\varepsilon=5\times 10^{-3}. We can observe from Fig. 4 that Algorithm 5 stops after 77 iterations and returns the estimated control gain Fp7=[0.24430.4884]F_{p}^{7}=\left[-0.2443\ \ -0.4884\right].

Refer to caption
Figure 4: Values of FpsF_{p}^{s} at each iteration step. The black dotted lines are baselines for two components of the optimal control gain FF^{*}.

Similar to the previous subsection, we use Monte-Carlo method to discuss the influence of the system noise herein. We implement Algorithm 5 for Y=10Y=10 times, and stop each experiment when s=15s=15. We use Fˇps=1Yi=1YFp,is\check{F}_{p}^{s}=\frac{1}{Y}\sum_{i=1}^{Y}F_{p,i}^{s} to denote the average value of the resulting control gain from YY experiments, where Fp,isF_{p,i}^{s} is the control gain obtained at the ss-th step from the ii-th experiment. Correspondingly, the average deviation of the approximation error at the ss-th iteration step obtained from YY experiments is denoted by Eps=1Yi=1YFp,isF\text{E}_{p}^{s}=\frac{1}{Y}\sum_{i=1}^{Y}\|F_{p,i}^{s}-F^{*}\|. In Fig. 2, we show the value of FpsF_{p}^{s} at each experiment as well as the resulting average value Fˇps\check{F}_{p}^{s}, when the variance of the system noises is set as σw2=02\sigma_{w}^{2}=\textbf{0}_{2}, σw2=0.1I2\sigma_{w}^{2}=0.1\textbf{I}_{2}, σw2=0.5I2\sigma_{w}^{2}=0.5\textbf{I}_{2}, and σw2=I2\sigma_{w}^{2}=\textbf{I}_{2}, respectively. By observation, we know that in these four cases, FpF_{p} is in a small neighborhood of the optimal feedback gain after about 33 iterations, and the range of the neighborhood is increased as the increase of the variance of the system noise. This implies that when the variance of the system noise is large, Algorithm 5 may not converge under a small convergence precision. In spite of this, we can also see from Fig. 2 that the average control gain Fˇs\check{F}^{s} is close to the optimal one. Let Ep15=1Yi=1YFp,i15FE_{p}^{15}=\frac{1}{Y}\sum_{i=1}^{Y}\|F_{p,i}^{15}-F^{*}\|, which is the average derivation of the control gain obtained at the 1515-th iteration step of the ii-th experiment from the optimal control gain FF^{*}. Fig. 6 shows the values of Ep15E_{p}^{15} when α\alpha is increased from 0 to 11. Roughly speaking, the resulting approximation errors are increased as the increase of the variance of the system noise.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 5: Values of FpsF_{p}^{s} at each iteration step. The black dotted lines are baselines for two components of the optimal control gain FF^{*}. (a) σw2=02\sigma_{w}^{2}=\textbf{0}_{2}. (b) σw2=0.1I2\sigma_{w}^{2}=0.1\textbf{I}_{2}. (c) σw2=0.5I2\sigma_{w}^{2}=0.5\textbf{I}_{2}. (d) σw2=I2\sigma_{w}^{2}=\textbf{I}_{2}.
Refer to caption
Figure 6: Average approximation errors of Y=10Y=10 experiments at different values of α\alpha, where the variance of the system noise is σw2=αI2\sigma_{w}^{2}=\alpha\textbf{I}_{2}.

7 Conclusion

In this paper, we have revisited the stochastic LQR problem, and have developed a MF-OPPI algorithm and a MB-PD algorithm from the perspective of RL and primal-dual optimization, respectively. Both algorithms can be implemented without using the information of system matrices, such that one can circumvent model identification, which may cause undesirable behaviors due to the propagation and accumulation of modeling errors. Furthermore, we have shown that the dual and primal update steps in the MB-PD algorithm can be interpreted as the policy evaluation and policy improvement steps in the classical PI algorithm. This advances our understanding of common RL algorithms and might contribute to develop various RL algorithms based on optimization formulations.

Appendix A A Proof of Theorem 2

From the proof of Proposition 2, we know that the constraints S0S\succ 0 and FF\in\mathcal{F} are equivalent. Based on this, we introduce the following auxiliary problem for further analysis:

Auxiliary Problem 1.

Solve

J^p\displaystyle\hat{J}_{p}\triangleq infS𝕊n+m,Fm×nTr(ΛS)\displaystyle\inf_{S\in\mathbb{S}^{n+m},F\in\mathbb{R}^{m\times n}}\mathrm{Tr}(\Lambda S)
s.t.\displaystyle\mathrm{s.t.}\;\; F,\displaystyle F\in\mathcal{F}, (61)
γAFSAF+Γ+γ1γF¯σw2F¯=S.\displaystyle\gamma A_{F}SA_{F}^{\prime}+\Gamma+\frac{\gamma}{1-\gamma}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}=S. (62)

This is a non-convex optimization problem since the set \mathcal{F} is not convex and the equality constraint (62) is a quadratic function of FF. Note that we can easily deduce from the proof of Proposition 2 that the optimal value of Auxiliary Problem 1 is equivalent to that of Problem 2, i.e., J^p=J^\hat{J}_{p}=\hat{J}^{*}.

The Lagrangian function of Auxiliary Problem 1 is defined as

L^(P,F,S)\displaystyle\hat{L}(P,F,S) (63)
\displaystyle\triangleq Tr(ΛS)+Tr[(γAFSAF+Γ+γ1γF¯σw2F¯S)P],\displaystyle\mathrm{Tr}(\Lambda S)+\mathrm{Tr}\left[\Big{(}\gamma A_{F}SA_{F}^{\prime}+\Gamma+\frac{\gamma}{1-\gamma}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}-S\Big{)}P\right],

where P𝕊n+mP\in\mathbb{S}^{n+m} is the Lagrange multiplier. The corresponding Lagrangian dual function is defined as

d^(P)infS𝕊n+m,FL^(P,F,S).\displaystyle\hat{d}(P)\triangleq\inf_{S\in\mathbb{S}^{n+m},F\in\mathcal{F}}\hat{L}(P,F,S).

We take the constraint FF\in\mathcal{F} as the domain of the variable FF in the Lagrangian formulation, instead of directly writing it in the Lagrangian function, since the constraint FF\in\mathcal{F} is not an explicit equality or inequality constraint.

Then, the dual problem of Auxiliary Problem 1 is given as follows:

J^dsupP𝕊n+md^(P)=supP𝕊n+minfS𝕊+n+m,FL^(P,F,S),\displaystyle\hat{J}_{d}\triangleq\sup_{P\in\mathbb{S}^{n+m}}\hat{d}(P)=\sup_{P\in\mathbb{S}^{n+m}}\inf_{S\in\mathbb{S}_{+}^{n+m},F\in\mathcal{F}}\hat{L}(P,F,S),

where L^(P,F,S)\hat{L}(P,F,S) is as defined in (LABEL:Lagrangian-1).

Since Jp=J^J_{p}=\hat{J}^{*} and J^p=J^\hat{J}_{p}=\hat{J}^{*}, it holds that Jp=J^pJ_{p}=\hat{J}_{p}. By weak duality, we have JpJdJ_{p}\geq J_{d}. In view of this, we complete the proof of Theorem 2 by the following two parts:

  1. 1.

    Auxiliary Problem 1 has strong duality, i.e., J^p=J^d\hat{J}_{p}=\hat{J}_{d} (Subsection A.1);

  2. 2.

    The optimal point of the dual problem of Primal Problem 1 is not larger than that of the dual problem of Auxiliary Problem 1, i.e., JdJ^dJ_{d}\leq\hat{J}_{d} (Subsection A.2).

A.1 Proof for J^p=J^d\hat{J}_{p}=\hat{J}_{d}

We first introduce another auxiliary optimization problem which can be shown to be equivalent to the modified stochastic LQR problem, i.e., Problem 2.

Auxiliary Problem 2.

Solve

J~p\displaystyle\tilde{J}_{p}\triangleq infP𝕊n+m,Fm×nTr((Γ+γ1γF¯σw2F¯)P)\displaystyle\inf_{P\in\mathbb{S}^{n+m},F\in\mathbb{R}^{m\times n}}\mathrm{Tr}\left(\Big{(}\Gamma+\frac{\gamma}{1-\gamma}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}\Big{)}P\right)
s.t.\displaystyle\mathrm{s.t.}\;\; F,\displaystyle F\in\mathcal{F}, (64)
γAFPAF+Λ=P.\displaystyle\gamma A_{F}^{\prime}PA_{F}+\Lambda=P. (65)
Proposition 3.

The Auxiliary Problem 2 has a unique optimal solution denoted by (P~p,F~p)(\tilde{P}_{p},\tilde{F}_{p}), and it is equivalent to Problem 2 in the sense that J~p=J^\tilde{J}_{p}=\hat{J}^{*} and F~p=F^\tilde{F}_{p}=\hat{F}^{*}.

Proof.

By the equation (46) and the fact that wk𝒩(0n,σw2)w_{k}\sim\mathcal{N}(0_{n},\sigma_{w}^{2}) is the i.i.d. random noise for each k0k\geq 0, we can obtain that for any FF\in\mathcal{F}, it holds that

𝔼[vk(F,v(i))Λvk(F,v(i))]\displaystyle\mathbb{E}\left[v_{k}(F,v_{(i)})^{\prime}\Lambda v_{k}(F,v_{(i)})\right]
=\displaystyle= Tr[𝔼(v(i)v(i))(AFk)ΛAFk]+Tr(j=0k1F¯σw2F¯(AFj)ΛAFj).\displaystyle\mathrm{Tr}\left[\mathbb{E}\left(v_{(i)}v_{(i)}^{\prime}\right)(A_{F}^{k})^{\prime}\Lambda A_{F}^{k}\right]+\mathrm{Tr}\left(\sum_{j=0}^{k-1}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}(A_{F}^{j})^{\prime}\Lambda A_{F}^{j}\right).

Therefore, the objective function of Problem 2 becomes

1ri=1rJ^(F,v(i))=Tr(ΓP1)+γ1γTr(F¯σw2F¯P2),\displaystyle\frac{1}{r}\sum_{i=1}^{r}\hat{J}(F,v_{(i)})=\mathrm{Tr}\left(\Gamma P_{1}\right)+\frac{\gamma}{1-\gamma}\mathrm{Tr}\left(\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}P_{2}\right),

with

P1\displaystyle P_{1} =k=0γk(AFk)ΛAFk,\displaystyle=\sum_{k=0}^{\infty}\gamma^{k}(A_{F}^{k})^{\prime}\Lambda A_{F}^{k},
P2\displaystyle P_{2} =1γγk=0γkj=0k1(AFj)ΛAFj.\displaystyle=\frac{1-\gamma}{\gamma}\sum_{k=0}^{\infty}\gamma^{k}\sum_{j=0}^{k-1}(A_{F}^{j})^{\prime}\Lambda A_{F}^{j}.

We can easily verify that both P1P_{1} and P2P_{2} satisfy the Lyapunov equation (65). By the first statement of Lemma 3, there exists a unique P𝕊+nP\in\mathbb{S}^{n}_{+} such that γAFPAF+Λ=P\gamma A_{F}^{\prime}PA_{F}+\Lambda=P, which implies that P1=P2P_{1}=P_{2}.

Since the optimal solution (P~p,F~p)(\tilde{P}_{p},\tilde{F}_{p}) of Auxiliary Problem 2 is one feasible point of Problem 2, it holds that J~pJ^\tilde{J}_{p}\geq\hat{J}^{*}. Furthermore, if F~p=F^\tilde{F}_{p}=\hat{F}^{*}\in\mathcal{F}, and P~p\tilde{P}_{p} is the unique solution to the constraint (65), then the resulting objective function of Auxiliary Problem 2 is exactly J^\hat{J}^{*}. Therefore, we can conclude that (S~p,F~p)(\tilde{S}_{p},\tilde{F}_{p}) is the solution of Auxiliary Problem 2 and J~p=J^\tilde{J}_{p}=\hat{J}^{*}. The uniqueness of (S~p,F~p)(\tilde{S}_{p},\tilde{F}_{p}) can be obtained by the uniqueness of F^\hat{F}^{*}. ∎

To show the strong duality of Auxiliary Problem 1, the following lemmas will be used.

Lemma 7.

The optimal point (P~p,F~p)(\tilde{P}_{p},\tilde{F}_{p}) of Auxiliary Problem 2 satisfies the following:

  1. 1.

    P~p=P\tilde{P}_{p}=P^{*};

  2. 2.

    P~p0\tilde{P}_{p}\succeq 0, P~p,220\tilde{P}_{p,22}\succ 0, and F~p=(P~p,22)1(P~p,12)\tilde{F}_{p}=-(\tilde{P}_{p,22})^{-1}(\tilde{P}_{p,12})^{\prime}\in\mathcal{F}.

Proof.

Based on the definitions in (12) and (21), direct calculation yields that [InF]P[InF]=X\left[\begin{array}[]{c}I_{n}\\ F^{*}\end{array}\right]^{\prime}P^{*}\left[\begin{array}[]{c}I_{n}\\ F^{*}\end{array}\right]=X^{*}. Then, from (21), it follows that P=γ[AB]X[AB]+Λ=γ[AB][InF]P[InF][AB]+Λ=γAFPAF+ΛP^{*}=\gamma\left[\begin{array}[]{cc}A&B\end{array}\right]^{\prime}X^{*}\left[\begin{array}[]{cc}A&B\end{array}\right]+\Lambda=\gamma\left[\begin{array}[]{cc}A&B\end{array}\right]^{\prime}\left[\begin{array}[]{c}I_{n}\\ F^{*}\end{array}\right]^{\prime}P^{*}\left[\begin{array}[]{c}I_{n}\\ F^{*}\end{array}\right]\left[\begin{array}[]{cc}A&B\end{array}\right]+\Lambda=\gamma A_{F*}^{\prime}P^{*}A_{F*}+\Lambda. Since FF^{*}\in\mathcal{F}, i.e., ρ(AF)<1\rho(A_{F^{*}})<1, there exists a unique PP^{*} satisfying the above equation.

In addition, we have F~p=F=F^\tilde{F}_{p}=F^{*}=\hat{F}^{*} by Proposition 1 and Proposition 2. Therefore, when F=F~pF=\tilde{F}_{p}, the equality constraint (65) has a unique solution P~p=P\tilde{P}_{p}=P^{*}.

The second statement can be directly shown via the definitions of PP^{*} and FF^{*} given in (21) and (12), respectively. ∎

Lemma 8 ([21, Lemma 4]).

If P0P\succeq 0 and P220P_{22}\succeq 0, then for Fm×n\forall F\in\mathbb{R}^{m\times n}, it holds that

F¯PF¯P11P12P221P12=[InP221P12]P[InP221P12],\displaystyle\bar{F}^{\prime}P\bar{F}\succeq P_{11}-P_{12}P_{22}^{-1}P_{12}^{\prime}=\left[\begin{array}[]{c}I_{n}\\ P_{22}^{-1}P_{12}^{\prime}\end{array}\right]^{\prime}P\left[\begin{array}[]{c}I_{n}\\ P_{22}^{-1}P_{12}^{\prime}\end{array}\right],

and the equality holds if and only if F=P221P12F=-P_{22}^{-1}P_{12}^{\prime}.

Now, we are ready to show the strong duality of Auxiliary Problem  1.

Lemma 9.

The dual gap for Auxiliary Problem  1 is zero, that is, J^p=J^d\hat{J}_{p}=\hat{J}_{d}.

Proof.

For a given matrix P𝕊n+mP\in\mathbb{S}^{n+m}, the Lagrangian dual function is

d^(P)\displaystyle\hat{d}(P) =infS𝕊+n+m,FL^(P,F,S)\displaystyle=\inf_{S\in\mathbb{S}_{+}^{n+m},F\in\mathcal{F}}\hat{L}(P,F,S)
={infFTr(ΓP)+γ1γTr(σw2F¯PF¯),ifP𝒫^,otherwise\displaystyle=\left\{\begin{aligned} &\inf_{F\in\mathcal{F}}\mathrm{Tr}\left(\Gamma P\right)+\frac{\gamma}{1-\gamma}\mathrm{Tr}\left(\sigma_{w}^{2}\bar{F}^{\prime}P\bar{F}\right),\;\text{if}\ P\in\hat{\mathcal{P}}\\ &-\infty,\;\text{otherwise}\end{aligned}\right.

where 𝒫^={P𝕊n+m:γAFPAFP+Λ0,F}\hat{\mathcal{P}}=\{P\in\mathbb{S}^{n+m}:\gamma A_{F}^{\prime}PA_{F}-P+\Lambda\succeq 0,\forall F\in\mathcal{F}\}.

Next, we show that the optimal solution P~p\tilde{P}_{p} of Auxiliary Problem 2 is an element of 𝒫^\hat{\mathcal{P}}, and thus the set 𝒫^\hat{\mathcal{P}} is nonempty. By Lemma 7, γAF~pP~pAF~p+Λ=P~p\gamma A_{\tilde{F}_{p}}^{\prime}\tilde{P}_{p}A_{\tilde{F}_{p}}+\Lambda=\tilde{P}_{p}, where F~p=(P~p,22)1(P~p,12)\tilde{F}_{p}=-(\tilde{P}_{p,22})^{-1}(\tilde{P}_{p,12})^{\prime}. Then, by Lemma 8, it holds that γAFP~pAF+ΛγAF~pP~pAF~p+Λ=P~p\gamma A_{F}^{\prime}\tilde{P}_{p}A_{F}+\Lambda\succeq\gamma A_{\tilde{F}_{p}}^{\prime}\tilde{P}_{p}A_{\tilde{F}_{p}}+\Lambda=\tilde{P}_{p} for all Fm×nF\in\mathbb{R}^{m\times n}.

Therefore, the dual problem is equivalent to

J^d=\displaystyle\hat{J}_{d}= supP𝕊n+md^(P)\displaystyle\sup_{P\in\mathbb{S}^{n+m}}\hat{d}(P) (66)
=\displaystyle= supP𝒫^infFTr(ΓP)+γ1γTr(σw2F¯PF¯).\displaystyle\sup_{P\in\hat{\mathcal{P}}}\inf_{F\in\mathcal{F}}\mathrm{Tr}(\Gamma P)+\frac{\gamma}{1-\gamma}\mathrm{Tr}(\sigma_{w}^{2}\bar{F}^{\prime}P\bar{F}).

For P~p𝒫^\tilde{P}_{p}\in\hat{\mathcal{P}}, we have

d^(P~p)=infFTr(ΓP~p)+γ1γTr(σw2F¯P~pF¯).\displaystyle\hat{d}(\tilde{P}_{p})=\inf_{F\in\mathcal{F}}\mathrm{Tr}(\Gamma\tilde{P}_{p})+\frac{\gamma}{1-\gamma}\mathrm{Tr}(\sigma_{w}^{2}\bar{F}^{\prime}\tilde{P}_{p}\bar{F}). (67)

Obviously, d^(P~p)J^d\hat{d}(\tilde{P}_{p})\leq\hat{J}_{d}. Since P~p0\tilde{P}_{p}\succeq 0 is fixed and the objective function in (67) is a quadratic function of FF, the infimum in (67) is reached at F=(P~p,22)1(P~p,12)=P~pF=-(\tilde{P}_{p,22})^{-1}(\tilde{P}_{p,12})^{\prime}=\tilde{P}_{p}\in\mathcal{F}. Therefore, J~p=d^(P~p)\tilde{J}_{p}=\hat{d}(\tilde{P}_{p}), which implies that J~pJ^d\tilde{J}_{p}\leq\hat{J}_{d}. By the weak duality, there holds that J^dJ^p\hat{J}_{d}\leq\hat{J}_{p}. Since J^p=J^=J~p\hat{J}_{p}=\hat{J}^{*}=\tilde{J}_{p}, we have J^p=J^d\hat{J}_{p}=\hat{J}_{d}. ∎

A.2 Proof for JdJ^dJ_{d}\leq\hat{J}_{d}

Lemma 10.

There holds JdJ^dJ_{d}\leq\hat{J}_{d}.

Proof.

For any given P𝕊n+mP\in\mathbb{S}^{n+m}, P0𝕊+n+mP_{0}\in\mathbb{S}_{+}^{n+m}, the Lagrangian dual function d(P,P0)d(P,P_{0}) is

d(P,P0)=\displaystyle d(P,P_{0})= infS𝕊+n+m,Fm×nL(P,P0,F,S)\displaystyle\inf_{S\in\mathbb{S}_{+}^{n+m},F\in\mathbb{R}^{m\times n}}L(P,P_{0},F,S)
=\displaystyle= {infFm×nTr(ΓP)+γ1γTr(σw2F¯PF¯),ifP𝒫^,otherwise\displaystyle\left\{\begin{aligned} &\inf_{F\in\mathbb{R}^{m\times n}}\mathrm{Tr}\left(\Gamma P\right)+\frac{\gamma}{1-\gamma}\mathrm{Tr}\left(\sigma_{w}^{2}\bar{F}^{\prime}P\bar{F}\right),\;\text{if}\ P\in\hat{\mathcal{P}}\\ &-\infty,\;\text{otherwise}\end{aligned}\right.

where 𝒫={P𝕊+m+n:γAFPAFP+ΛP00,F}\mathcal{P}=\{P\in\mathbb{S}_{+}^{m+n}:\gamma A_{F}^{\prime}PA_{F}-P+\Lambda-P_{0}\succeq 0,\forall F\in\mathcal{F}\}.

Therefore, the dual problem corresponding to Primal Problem 1 is equivalent to

Jd=\displaystyle J_{d}= supP𝕊+m+n,P0𝕊+m+nd(P,P0)\displaystyle\sup_{P\in\mathbb{S}_{+}^{m+n},P_{0}\in\mathbb{S}_{+}^{m+n}}d(P,P_{0}) (68)
=\displaystyle= supP𝒫,P0𝕊+m+ninfFm×nTr(ΓP)+γ1γTr(σw2F¯PF¯).\displaystyle\sup_{P\in\mathcal{P},P_{0}\in\mathbb{S}_{+}^{m+n}}\inf_{F\in\mathbb{R}^{m\times n}}\mathrm{Tr}\left(\Gamma P\right)+\frac{\gamma}{1-\gamma}\mathrm{Tr}\left(\sigma_{w}^{2}\bar{F}^{\prime}P\bar{F}\right).

We can observe from (68) that P0P_{0} affects the dual optimal value JdJ_{d} by adjusting the size of the set 𝒫\mathcal{P}. To obtain the supremum of the Lagrangian dual function, the constraint P𝒫P\in\mathcal{P} should be relaxed as much as possible. We can easily observe from the definition of the set 𝒫\mathcal{P} that its maximum is obtained when P0=0m+nP_{0}=\textbf{0}_{m+n}. Furthermore, if P0=0m+nP_{0}=\textbf{0}_{m+n}, then it holds that 𝒫={P𝕊+m+n:γAFPAFP+Λ0,F}=𝒫^\mathcal{P}=\{P\in\mathbb{S}_{+}^{m+n}:\gamma A_{F}^{\prime}PA_{F}-P+\Lambda\succeq 0,\forall F\in\mathcal{F}\}=\hat{\mathcal{P}}. As a result, it yields that

Jd=supP𝒫^infFm×nTr(ΓP)+γ1γTr(σw2F¯PF¯).\displaystyle J_{d}=\sup_{P\in\hat{\mathcal{P}}}\inf_{F\in\mathbb{R}^{m\times n}}\mathrm{Tr}\left(\Gamma P\right)+\frac{\gamma}{1-\gamma}\mathrm{Tr}\left(\sigma_{w}^{2}\bar{F}^{\prime}P\bar{F}\right).

Since m×n\mathcal{F}\subseteq\mathbb{R}^{m\times n}, it holds that infFm×nTr(ΓP)+γ1γTr(σw2F¯PF¯)infFTr(ΓP)+γ1γTr(σw2F¯PF¯)\inf_{F\in\mathbb{R}^{m\times n}}\mathrm{Tr}\left(\Gamma P\right)+\frac{\gamma}{1-\gamma}\mathrm{Tr}\left(\sigma_{w}^{2}\bar{F}^{\prime}P\bar{F}\right)\leq\inf_{F\in\mathcal{F}}\mathrm{Tr}\left(\Gamma P\right)+\frac{\gamma}{1-\gamma}\mathrm{Tr}\left(\sigma_{w}^{2}\bar{F}^{\prime}P\bar{F}\right), which implies that JdJ^dJ_{d}\leq\hat{J}_{d}. ∎

References

  • [1] R. S. Sutton, A. G. Barto. (1998). Reinforcement Learning: An introduction, MIT Press, Cambridge, MA, USA.
  • [2] D. P. Bertsekas. (2019). Reinforcement Learning and Optimal Control, Athena Scientific, Nashua, USA.
  • [3] K. Arulkumaran, M. P. Deisenroth, M. Brundage, A. A. Bharath. (2017). Deep reinforcement learning: A brief survey, IEEE Signal Processing Magazine, 34 (6), 26–38.
  • [4] B. Recht. (2019). A tour of reinforcement learning: The view from continuous control, Annual Review of Control, Robotics, and Autonomous Systems, 2, 253–279.
  • [5] M. Li, J. Qin, N. M. Freris, D. W. C. Ho, Multiplayer Stackelberg-Nash game for nonlinear system via value iteration-based integral reinforcement learning, IEEE Transactions on Neural Networks and Learning Systems, DOI: 10.1109/TNNLS.2020.3042331.
  • [6] F. Li, J. Qin, W. X. Zheng. (2019). Distributed Q-learning-based online optimization algorithm for unit commitment and dispatch in smart grid, IEEE Transactions on Cybernetics, 50 (9), 4146–4156.
  • [7] B. Wang, Z. Liu, Q. Li, A. Prorok. (2020). Mobile robot path planning in dynamic environments through globally guided reinforcement learning, IEEE Robotics and Automation Letters, 5 (4), 6932–6939.
  • [8] A. Haydari, Y. Yilmaz, Deep reinforcement learning for intelligent transportation systems: A survey, IEEE Transactions on Intelligent Transportation Systems, DOI: 10.1109/TITS.2020.3008612.
  • [9] A. S. Leong, A. Ramaswamy, D. E. Quevedo, H. Karl, L. Shi, Deep reinforcement learning for wireless sensor scheduling in cyber–physical systems, Automatica, DOI: https://doi.org/10.1016/j.automatica.2019.108759.
  • [10] A. Naik, R. Shariff, N. Yasui, R. S. Sutton, Discounted reinforcement learning is not an optimization problem, arXiv preprint arXiv:1910.02140.
  • [11] R. Zhang, T. Yu, Y. Shen, H. Jin, C. Chen. (2019). Text-based interactive recommendation via constraint-augmented reinforcement learning, in: Advances in Neural Information Processing Systems, Vancouver, Canada, pp. 15214–15224.
  • [12] Z. Qin, W. Li, F. Janoos. (2014). Sparse reinforcement learning via convex optimization, in: International Conference on Machine Learning, Beijing, China, pp. 424–432.
  • [13] N. Vieillard, O. Pietquin, M. Geist, On connections between constrained optimization and reinforcement learning, arXiv preprint arXiv:1910.08476.
  • [14] H. Schlüter, F. Solowjow, S. Trimpe, Event-triggered learning for linear quadratic control, IEEE Transactions on Automatic Control, DOI: 10.1109/TAC.2020.3030877.
  • [15] M. Fazel, R. Ge, S. M. Kakade, M. Mesbahi. (2018). Global convergence of policy gradient methods for the linear quadratic regulator, in: International Conference on Machine Learning, Stockholm, Sweden, pp. 1467–1476.
  • [16] D. Malik, A. Pananjady, K. Bhatia, K. Khamaru, B. L. Peter, W. J. Martin. (2019). Derivative-free methods for policy optimization: Guarantees for linear quadratic systems, in: The 22nd International Conference on Artificial Intelligence and Statistics, Stockholm, Sweden, pp. 2916–2925.
  • [17] Y. Li, Y. Tang, R. Zhang, N. Li, Distributed reinforcement learning for decentralized linear quadratic control: A derivative-free policy optimization approach, DOI: arXiv preprint arXiv:1912.09135.
  • [18] K. Karl, S. Tu. (2019). Finite-time analysis of approximate policy iteration for the linear quadratic regulator, in: International Conference on Machine Learning, Vancouver, Canada, pp. 8514–8524.
  • [19] Z. Yang, Y. Chen, M. Hong, Z. Wang. (2019). Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost, in: Advances in Neural Information Processing Systems, Vancouver, Canada, pp. 8353–8365.
  • [20] R. Mitze, M. Mönnigmann, A dynamic programming approach to solving constrained linear–quadratic optimal control problems, Automatica, DOI: https://doi.org/10.1016/j.automatica.2020.109132.
  • [21] D. Lee, J. Hu. (2018). Primal-dual Q-learning framework for LQR design, IEEE Transactions on Automatic Control, 64 (9), 3756–3763.
  • [22] D. D. Yao, S. Zhang, X. Y. Zhou. (2001). Stochastic linear-quadratic control via semidefinite programming, SIAM Journal on Control and Optimization, 40 (3), 801–823.
  • [23] Y. Jiang, Z.-P. Jiang. (2012). Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics, Automatica, 48 (10), 2699–2704.
  • [24] B. Kiumarsi, F. L. Lewis, Z.-P. Jiang. (2017). H{H}_{\infty} control of linear discrete-time systems: Off-policy reinforcement learning, Automatica, 78, 144–152.
  • [25] J. Qin, M. Li, Y. Shi, Q. Ma, W. X. Zheng. (2019). Optimal synchronization control of multiagent systems with input saturation via off-policy reinforcement learning, IEEE Transactions on Neural Networks and Learning Systems, 30 (1), 85–96.
  • [26] J. Lai, J. Xiong, Z. Shu, Model-free optimal control of discrete-time systems with additive and multiplicative noises, arXiv preprint arXiv:2008.08734.
  • [27] D. Vrabie, K. G. Vamvoudakis, F. L. Lewis. (2013). Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles, IET Press, Stevenage, U.K.
  • [28] G. Hewer. (1971). An iterative technique for the computation of the steady state gains for the discrete optimal regulator, IEEE Transactions on Automatic Control, 16 (4), 382–384.
  • [29] D. Graupe, V. K. Jain, J. Salahi. (1980). A comparative analysis of various least-squares identification algorithms, Automatica, 16 (6), 663–681.
  • [30] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, F. L. Lewis. (2009). Adaptive optimal control for continuous-time linear systems based on policy iteration, Automatica, 45 (2), 477–484.
  • [31] G. Gu. (2012). Discrete-Time Linear Systems: Theory and Design with Applications, Springer-Verlag, London, U.K.
  • [32] S. Boyd, B. Vandenberghe. (2004). Convex Optimization, Cambridge University Press, Cambridge, U.K.