Model-Free Design of Stochastic LQR Controller from Reinforcement Learning and Primal-Dual Optimization Perspective

Man Li [email protected] Jiahu Qin [email protected] Wei Xing Zheng [email protected] Yaonan Wang [email protected] Yu Kang [email protected] Department of Automation, University of Science and Technology of China, Hefei 230027, China School of Computing, Engineering and Mathematics, Western Sydney University, Sydney, NSW 2751, Australia College of Electrical and Information Engineering, Hunan University, Changsha 410082, China National Engineering Laboratory for Robot Visual Perception and Control Technology, Changsha 410082, China

Abstract

To further understand the underlying mechanism of various reinforcement learning (RL) algorithms and also to better use the optimization theory to make further progress in RL, many researchers begin to revisit the linear-quadratic regulator (LQR) problem, whose setting is simple and yet captures the characteristics of RL. Inspired by this, this work is concerned with the model-free design of stochastic LQR controller for linear systems subject to Gaussian noises, from the perspective of both RL and primal-dual optimization. From the RL perspective, we first develop a new model-free off-policy policy iteration (MF-OPPI) algorithm, in which the sampled data is repeatedly used for updating the policy to alleviate the data-hungry problem to some extent. We then provide a rigorous analysis for algorithm convergence by showing that the involved iterations are equivalent to the iterations in the classical policy iteration (PI) algorithm. From the perspective of optimization, we first reformulate the stochastic LQR problem at hand as a constrained non-convex optimization problem, which is shown to have strong duality. Then, to solve this non-convex optimization problem, we propose a model-based primal-dual (MB-PD) algorithm based on the properties of the resulting Karush-Kuhn-Tucker (KKT) conditions. We also give a model-free implementation for the MB-PD algorithm by solving a transformed dual feasibility condition. More importantly, we show that the dual and primal update steps in the MB-PD algorithm can be interpreted as the policy evaluation and policy improvement steps in the PI algorithm, respectively. Finally, we provide one simulation example to show the performance of the proposed algorithms.

keywords:

Stochastic linear-quadratic regulator; model-free design; reinforcement learning; optimization; duality.

, , , ,

1 Introduction

Reinforcement learning (RL), a branch of machine learning, aims to learn policies to optimize a long-term decision-making process in a dynamic environment [1, 2]. In general, there are two main characteristics, which are also readily considered as the difficulties of the RL setting [1]. First, the value of a state is the total amount of reward/cost an agent expects to accumulate over the future, starting from that state. That is, early decisions affect the outcomes both near and far in the future. Second, the learner’s knowledge about the environment is incomplete, e.g., the underlying mechanism governing the dynamics of the state-transition trajectories is unknown. To overcome these difficulties, a lot of effective RL algorithms have been developed [3, 4, 5], and these algorithms have been successfully used in numerous fields, including smart grids [6], multi-robot systems [7], intelligent transportation[8], and networked control [9]. From these works, we know that in spite of the fact that RL cannot be regarded as a standard optimization problem due to the lack of an explicit objective function [10], it still has strong ties to the optimization formulation, since its goal is to optimize the long-term cumulative performance [11]. In view of this, existing fruitful optimization algorithms and theories might enable researchers to make further progress in RL, so that the integration of elementary RL methods with optimization frameworks is worthwhile to study. Some efforts towards this direction have been made, see, for example, [12, 13].

It is worth mentioning that following this line, numerous researchers begin to revisit the linear-quadratic regulator (LQR) problem [14, 15, 16, 17, 18, 19], the task of which is to determine an optimal feedback controller minimizing a quadratic cost function for linear systems [20]. One important reason for the revisit of the LQR problem is that it is simple and yet captures the two characteristics of RL problems mentioned in the previous paragraph. Such properties of the LQR problem provide convenience for understanding the behavior and the underlying mechanism of various RL algorithms. Another reason is that the traditional methods for addressing the LQR problem require to directly solve an algebraic Riccati equation (ARE), which depends upon exact information of the system matrices; however, the exact system matrices are usually unknown in real applications. Although model identification provides a way to obtain the estimated system model, the propagation and accumulation of modeling errors may cause undesirable system behaviors. Therefore, it is meaningful to study the model-free design of the LQR controller with the consideration of the integration of RL algorithms with optimization theories. Some relevant works can be found in [15, 16, 19, 18, 17].

Specifically, in [15] and [16], the authors reformulate the LQR problem as a non-convex optimization problem, and merge ideas from optimal control theory, mathematical optimization, and sampled-based RL to solve it. To seek the solutions to the resulting non-convex optimization problem, model-free stochastic policy gradient algorithms along with zeroth-order optimization techniques are developed with provable global convergence [15] and explicit convergence rate [16]. The work [17] considers distributed RL for decentralized LQR problem, and proposes a zeroth-order distributed policy optimization algorithm to learn the local optimal controllers in a distributed manner, by leveraging the ideas of policy gradient, zeroth-order optimization, as well as consensus algorithms. Note that in the above mentioned works, the key to realizing model-free control is to use the zeroth-order optimization approach to estimate the policy gradient. This approach often suffers from large variance, as it involves the randomness of an entire trajectory. To address this difficulty, some different algorithms have been proposed [18, 19]. The work [18] develops a model-free approximate policy iteration algorithm based on Q-function, in which the way to guarantee a good regret is also discussed. In [19], an on-policy actor-critic algorithm, which is an online alternating update algorithm for bilevel optimization, is developed with a non-asymptotic convergence analysis, and then it is extended to the off-policy case to alleviate the data-hungry problem.

Despite the fact that extensive research has been carried out on the LQR problem when considering the integration of RL frameworks with optimization techniques, these works, such as [15, 16, 19, 18, 17], mainly focus on the theoretical understanding of the performance of common RL algorithms. They do not provide any explanations for the proposed RL algorithms from the perspective of optimization or establish any relationship between RL and optimization-based methods. Different from the above mentioned works, the authors in [21] reformulate the LQR problem as a new optimization problem, and further explain its duality results from a Q-learning viewpoint. The duality results proposed in [21] are based on the results given in [22], where the duality for the stochastic LQR problem pertaining to linear systems with stochastic noise is investigated in terms of semidefinite programming (SDP). These works motivate some questions. Can we formulate the stochastic LQR problem as an optimization problem whose optimal points can be interpreted from the perspective of various RL algorithms? Can we explain the RL algorithms used to solve stochastic LQR problem from the perspective of optimization?

To answer these questions, we investigate the model-free design of stochastic LQR controller from the perspective of both RL and optimization. We first solve the stochastic LQR problem in a model-free way from the perspective of RL. Then, we reconsider this problem by some reformulation and transformations, and solve it from the perspective of primal-dual optimization. Besides that, we also provide a theorem to show that the algorithms developed from the above two perspectives are equivalent in the sense that there is a corresponding relation in the involved iterations. The contributions of this work are summarized as follows:

1.

To solve the stochastic LQR problem, we develop a model-free off-policy policy iteration (MF-OPPI) algorithm and a model-free primal-dual (MB-PD) algorithm from the perspective of RL and primal-dual optimization, respectively. Both algorithms can be implemented in a model-free way, so that one can circumvent model identification, which may cause undesirable behaviors due to the propagation and accumulation of modeling errors. More importantly, we show that the dual and primal update steps in the MB-PD algorithm can be interpreted as the policy evaluation and policy improvement steps in the classical policy iteration (PI) algorithm, respectively. This result provides a novel primal-dual optimization perspective to understand the classical PI algorithm.
2.

From the perspective of RL, we develop a new MF-OPPI algorithm to address the stochastic LQR problem, in which the data generated by exerting the behavior policy on the real system is repeatedly used to evaluate the target policy so as to alleviate the data-hungry problem to some extent. Compared with the off-policy actor-critic algorithm proposed in [19], the newly developed MF-OPPI algorithm is simple and easy to implement, since it does not require to use neural networks for function approximation.
3.

From the optimization perspective, we first reformulate the stochastic LQR problem as a constrained non-convex optimization problem, based on the works [21] and [22]. However, due to the consideration of the influence of stochastic noise and the relation with RL algorithms, the proposed optimization formulation is different from the ones given in [21] and [22]. Such differences bring about some challenges in the equivalent formulation of primal-dual problems as well as the analysis for the strong duality. To address the resulting constrained non-convex optimization problem, we develop a new MB-PD algorithm with a model-free implementation, by using the properties of the resulting Karush-Kuhn-Tucker (KKT) conditions. This algorithm circumvents the requirement of probing noise exerted on the system for persistently exciting, which is necessary for the newly proposed MF-OPPI algorithm and other RL-based algorithms proposed in [23, 24, 25], thereby avoiding undesirable oscillation.

The rest of this paper is organized as follows. We first formulate the problem of interest and provide some preliminaries in Section 2. Subsequently, we investigate this problem from the perspective of RL and primal-dual optimization in Section 3 and Section 4, respectively. Then, in Section 5, we show that the algorithms developed from the above two perspectives are equivalent in the sense that there is a corresponding relation in the involved iterations. Next, we provide a simulation example in Section 6. Some concluding remarks are finally given in Section 7.

Notations: We use $\textbf{I}_{n}$ and $\textbf{0}_{n}$ to denote an identity matrix and a null matrix in $\mathbb{R}^{n\times n}$ , respectively. Denote by $0_{n}$ a zero vector of $n$ dimensions. A set of symmetric matrices of $n\times n$ dimensions is denoted by $\mathbb{S}^{n}$ . $\mathbb{S}^{n}_{+}$ ( $\mathbb{S}^{n}_{++}$ ) is a set of symmetric positive semidefinite (positive definite) matrices. For a matrix $M$ , we use $M_{ij}$ to denote the element in the $i$ -th row and the $j$ -th column. $\mathrm{Tr}(M)$ is the trace of a matrix $M$ , and $\rho(M)$ denotes its spectral radius. A symmetric positive (semi-)definite matrix $M$ is denoted by $M\succ 0\>(\succeq 0)$ .

For a symmetric matrix $M\in\mathbb{R}^{n\times n}$ , $\mathrm{vech}(M)\in\mathbb{R}^{1\times\frac{n(n+1)}{2}}$ denotes a row vector consisting of $n$ diagonal entries and $n(n-1)/2$ non-diagonal terms $M_{ij}$ , $i=1,\ldots,n$ , $i<j\leq n$ . Specifically, $\mathrm{vech}(M)=[M_{11},\,\ldots,\,M_{nn},\,M_{12},\,\ldots,\,M_{1n},M_{23},\,\ldots,\,M_{2n},\ldots,\,M_{(n-1)n}]$ . For a symmetric matrix $M\in\mathbb{R}^{n\times n}$ , $\mathrm{vecs}(M)\in\mathbb{R}^{n(n+1)/2}$ denotes a column vector composed of $n$ diagonal entries of $M$ and $n(n-1)/2$ distinct sums $M_{ij}+M_{ji}$ , i.e., $\mathrm{vecs}(M)=[M_{11},\ldots,\!M_{nn},\!2M_{12},\ldots,\!2M_{1n},\!2M_{23},\ldots,\!2M_{2n},\ldots,\!2M_{(n-1)n}]^{\prime}$ . For a non-square matrix $M\in\mathbb{R}^{n\times m}$ , $\mathrm{vecn}(M)\in\mathbb{R}^{mn}$ consists of the stack of each column’s elements. That is, $\mathrm{vecn}(M)=[M_{11},\ldots,M_{n1},M_{12},\ldots,M_{n2},\ldots,M_{1m},\ldots,M_{nm}]^{\prime}$ .

2 Problem Formulation and Preliminaries

Consider the following discrete-time linear time-invariant system

\displaystyle x_{k+1}=Ax_{k}+Bu_{k}+w_{k},

(1)

where $k\in\mathbb{N}$ , $x_{k}\in\mathbb{R}^{n}$ is the state vector of the system plant, $u_{k}\in\mathbb{R}^{m}$ is the control action, $w_{k}\in\mathbb{R}^{n}$ is the system noise generated according to the Gaussian distribution $\mathcal{N}(0_{n},\sigma_{w}^{2})$ , and $A$ and $B$ are constant matrices of appropriate dimensions.

The initial state is denoted by $x_{0}=z$ , which is assumed to be a random variable. In this paper, we consider the state-feedback control law, that is, $u_{k}=Fx_{k}$ , where $F$ is the feedback gain. $x_{k}(F,z)$ denotes the solution to the dynamical system (1) under the control input $u_{k}=Fx_{k}$ while starting from $x_{0}=z$ . It is assumed that the noise $w_{k}$ is independent of the initial state $z$ as well as the noise $w_{k^{\prime}}$ for all $k^{\prime}\neq k$ .

Under the state-feedback control law $u_{k}=Fx_{k}$ , the cost function is denoted by

\displaystyle J(F,z)\triangleq\sum_{k=0}^{\infty}\gamma^{k}\mathbb{E}\left\{\left[\begin{array}[]{c}x_{k}(F,z)\\ Fx_{k}(F,z)\end{array}\right]^{\prime}\Lambda\left[\begin{array}[]{c}x_{k}(F,z)\\ Fx_{k}(F,z)\end{array}\right]\right\}

(6)

where $\gamma\in(0,1)$ is the discount factor, and $\Lambda\triangleq\left[\begin{array}[]{cc}Q&0\\ 0&R\end{array}\right]\succeq 0$ .

Define $\mathcal{F}\triangleq\{F\in\mathbb{R}^{m\times n}:\rho(A+BF)\leq 1\}$ , which is the set of stabilizing state-feedback gains for the matrix pair $(A,B)$ . Now, we are ready to formally give the problem of interest, which is called the stochastic LQR problem.

Problem 1 (Stochastic LQR Problem).

Suppose that the initial state $z$ satisfying $\tilde{z}\triangleq\mathbb{E}(zz^{\prime})\succ 0$ . The stochastic LQR problem aims to find the optimal feedback gain $F^{*}$ minimizing the quadratic cost function (6) with the system state evolving along the dynamical system (1). That is,

\displaystyle\left\{\begin{aligned} \min_{F\in\mathcal{F}}&\quad J(F,z)\\ \text{s.t.}&\quad x_{k+1}=Ax_{k}+Bu_{k}+w_{k}.\end{aligned}\right.

The optimal feedback gain $F^{*}$ is defined as $F^{*}\triangleq\arg\min_{F\in\mathcal{F}}J(F,z)$ . The corresponding optimal cost for a given initial state $z\in\mathbb{R}^{n}$ is denoted by $J^{*}(z)\triangleq J(F^{*},z)$ .

The followings are some assumptions that will be used in the remaining contents.

Assumption 1.

Assume that

1.

$(A,B)$ is not known;
2.

An initial stabilizing feedback gain $F_{\text{stab}}\in\mathcal{F}$ is known;
3.

$Q\succeq 0$ and $R\succ 0$ ;
4.

The matrix pair $(A,B)$ is stabilizable;
5.

The matrix pair $(A,C)$ is detectable, where $C$ is a matrix such that $CC^{\prime}=Q$ .

For an stabilizing feedback gain $F\in\mathcal{F}$ , define the value function as

\displaystyle V(F,x_{k})\triangleq\sum_{i=k}^{\infty}\gamma^{i-k}\mathbb{E}[x_{k}^{\prime}(Q+F^{\prime}RF)x_{k}].

(7)

The optimal value function is given by

\displaystyle V(F^{*},x_{k})\triangleq\min_{F\in\mathcal{F}}V(F,x_{k}).

(8)

Lemma 1 ([26, Lemma 3]).

For any stabilizing state-feedback gain $F\in\mathcal{F}$ , the stochastic LQR problem, i.e., Problem 1, is well-posed¹¹1The stochastic LQR problem is said to be well-posed if the optimal value function satisfies $-\infty\leq V(F^{*},x_{k})\leq+\infty$ . and the corresponding value function is

\displaystyle V(F,x_{k})=\mathbb{E}[x_{k}^{\prime}Xx_{k}]+\frac{\gamma}{1-\gamma}\mathrm{Tr}(X\sigma_{w}^{2}),

(9)

where $X\in\mathbb{S}_{+}^{n}$ is the unique solution to the Lyapunov equation

\displaystyle X=\gamma(A+BF)^{\prime}X(A+BF)+F^{\prime}RF+Q.

(10)

From the definition of the value function (7), one has

\displaystyle V(F,x_{k})=\mathbb{E}[x_{k}^{\prime}(Q+F^{\prime}RF)x_{k}]+\gamma V(F,x_{k+1}),

(11)

which is called the Bellman equation. Define the Hamiltonian as

	$\displaystyle H(F,x_{k})\triangleq$	$\displaystyle\mathbb{E}[x_{k}^{\prime}(Q+F^{\prime}RF)x_{k}]+\gamma\mathbb{E}(x_{k+1}^{\prime}Xx_{k+1})$
		$\displaystyle-\mathbb{E}(x_{k}^{\prime}Xx_{k})-\gamma\mathrm{Tr}(X\sigma_{w}^{2}).$

The first-order necessary condition for optimality gives

\displaystyle\frac{\partial H(F,x_{k})}{\partial F}=2(R+\gamma B^{\prime}XB)F\mathbb{E}[x_{k}x_{k}^{\prime}]+2\gamma B^{\prime}XA\mathbb{E}[x_{k}x_{k}^{\prime}]=0.

It follows that the unique optimal control gain is given by

\displaystyle F^{*}=-\gamma(R+\gamma B^{\prime}X^{*}B)^{-1}B^{\prime}X^{*}A

(12)

with $X^{*}$ being the unique solution to the following ARE

\displaystyle X^{*}=Q+\gamma A^{\prime}X^{*}A-\gamma^{2}A^{\prime}X^{*}B(R+\gamma B^{\prime}X^{*}B)^{-1}B^{\prime}X^{*}A.

(13)

By substituting the optimal feedback gain (12) into the value function (11), one can obtain that

\displaystyle V(F^{*},x_{k})=\mathbb{E}\left\{\left[\begin{array}[]{c}x_{k}\\ F^{*}x_{k}\end{array}\right]^{\prime}P^{*}\left[\begin{array}[]{c}x_{k}\\ F^{*}x_{k}\end{array}\right]\right\}+\frac{\gamma}{1-\gamma}\mathrm{Tr}(X^{*}\sigma_{w}^{2}),

(18)

where

\displaystyle P^{*}\triangleq\left[\begin{array}[]{cc}Q+\gamma A^{\prime}X^{*}A&\gamma A^{\prime}X^{*}B\\ \gamma B^{\prime}X^{*}A&R+\gamma B^{\prime}X^{*}B\end{array}\right].

(21)

In the following analysis, let $P_{11}^{*}\triangleq Q+\gamma A^{\prime}X^{*}A$ , $P_{12}^{*}\triangleq(P_{21}^{*})^{\prime}\triangleq\gamma A^{\prime}X^{*}B$ , and $P_{22}^{*}\triangleq R+\gamma B^{\prime}X^{*}B$ .

Based on the above analysis, we know that calculating the optimal feedback gain $F^{*}$ requires both the system matrix $A$ and the control input matrix $B$ . In the following two sections, we provide two approaches from a RL-based view and an optimization-based view, respectively, to compute the optimal feedback gain without using the matrices $A$ and $B$ .

3 RL-Based Method

The goal of this section is to solve Problem 1 by designing RL algorithms.

3.1 Classical PI Algorithm

Algorithm 1 Classical PI Algorithm

1: Initialization: set a stabilizing control gain

F^{0}

s=0

, and

\epsilon>0

;

2: Repeat

3: policy evaluation: solve

X^{s}

from

\displaystyle\ X^{s}=\gamma(A+BF^{s})^{\prime}X^{s}(A+BF^{s})+(F^{s})^{\prime}RF^{s}+Q;

(22)

4: policy improvement: update control gain from

\displaystyle F^{s+1}=-\gamma(R+\gamma B^{\prime}X^{s}B)^{-1}B^{\prime}X^{s}A;

(23)

s\leftarrow s+1;

6: Until

\|F^{s}-F^{s-1}\|\leq\epsilon

;

7: Return

F^{s}

Note that by replacing $(A,B)$ with $(\gamma^{1/2}A,\gamma^{1/2}B)$ , the ARE (13) is identical to that obtained from a standard LQR problem [27, Chapter 2.1.4], in which the dynamical system is not influenced by the stochastic noise. There have been various algorithms to seek the solutions to the ARE, one of which is the policy iteration algorithm. A classical PI algorithm proposed in [27, Chapter 2.2] to solve (13) is given below, cf. Algorithm 1.

The iterations (22) and (23) in Algorithm 1 rely upon repeatedly solving the Lyapunov equation (10). Hence, Algorithm 1 is an offline algorithm that requires complete knowledge of the system matrices $A$ and $B$ . Note that Algorithm 1 is also called Hewer’s algorithm, and it has been shown to converge to the solution of the ARE (13) in [28].

3.2 MF-OPPI Algorithm

In this subsection, we intend to propose an off-policy algorithm that can calculate the optimal feedback gain (12) without using the system matrices.

To this end, we first rewrite the original system (1) as

\displaystyle x_{k+1}=(A+BF^{s})x_{k}+B(u_{k}-F^{s}x_{k})+w_{k},

(24)

where $u_{k}^{s}\triangleq F^{s}x_{k}$ is the target policy being learned and updated by iterations, and $u_{k}$ is the behavior policy that is exerted on the system to generate data for learning.

By using (24) and the Lyapunov equation (10), one can obtain that

	$\displaystyle\mathbb{E}[x_{k}^{\prime}X^{s}x_{k}]-\gamma\mathbb{E}[x_{k+1}^{\prime}X^{s}x_{k+1}]$	(25)
$\displaystyle=$	$\displaystyle\mathbb{E}[x_{k}^{\prime}(Q+(F^{s})^{\prime}RF^{s})x_{k}]-2\gamma\mathbb{E}[(u_{k}-F^{s}x_{k})^{\prime}B^{\prime}X^{s}Ax_{k}]$
	$\displaystyle-\gamma\mathbb{E}[(u_{k}-F^{s}x_{k})^{\prime}B^{\prime}X^{s}B(F^{s}x_{k}+u_{k})]-\gamma\mathrm{Tr}(X^{s}\sigma_{w}^{2}),$

which is termed the off-policy Bellman equation.

Then, we use (LABEL:off-2) for policy evaluation, and propose an MF-OPPI algorithm, i.e., Algorithm 2.

Algorithm 2 MF-OPPI Algorithm

1: Initialization: set a stabilizing control gain

F^{0}

s=0

, and

\epsilon>0

;

2: Repeat

3: policy evaluation: solve

X^{s}

X_{1}^{s}

, and

X_{2}^{s}

from

	$\displaystyle\mathbb{E}[x_{k}^{\prime}X^{s}x_{k}]-\gamma\mathbb{E}[x_{k+1}^{\prime}X^{s}x_{k+1}]$	(26)
$\displaystyle=$	$\displaystyle\mathbb{E}[x_{k}^{\prime}(Q+(F^{s})^{\prime}RF^{s})x_{k}]-2\gamma\mathbb{E}[(u_{k}-F^{s}x_{k})^{\prime}X_{1}^{s}x_{k}]$
	$\displaystyle-\gamma\mathbb{E}[(u_{k}-F^{s}x_{k})^{\prime}X_{2}^{s}(F^{s}x_{k}+u_{k})]-\gamma\mathrm{Tr}(X^{s}\sigma_{w}^{2});$

4: policy improvement: update control gain from

\displaystyle F^{s+1}=-\gamma(R+\gamma X_{2}^{s})^{-1}X_{1}^{s};

(27)

s\leftarrow s+1;

6: Until

\|F^{s}-F^{s-1}\|\leq\epsilon

;

7: Return

F^{s}

Theorem 1.

In Algorithm 2, it holds that $\lim_{s\to\infty}X^{s}=X^{*}$ and $\lim_{t\to\infty}F^{s}=F^{*}$ , where $X^{*}$ and $F^{*}$ are defined in (13) and (12), respectively.

Proof.

Substituting (24) into the left-hand side of (LABEL:off-PE), it follows that

		$\displaystyle\mathbb{E}[x_{k}^{\prime}X^{s}x_{k}]-\gamma\mathbb{E}[x_{k+1}^{\prime}X^{s}x_{k+1}]$
	$\displaystyle=$	$\displaystyle\mathbb{E}[x_{k}^{\prime}X^{s}x_{k}]-\gamma\mathbb{E}[((A+BF^{s})x_{k}+w_{k})^{\prime}X^{s}((A+BF^{s})x_{k}+w_{k})]$
		$\displaystyle-2\gamma\mathbb{E}[(u_{k}-F^{s}x_{k})^{\prime}B^{\prime}X^{s}Ax_{k}]$
		$\displaystyle-\gamma\mathbb{E}[(u_{k}-F^{s}x_{k})^{\prime}B^{\prime}X^{s}B(F^{s}x_{k}+u_{k})].$

Therefore, it holds that

		$\displaystyle\mathbb{E}[x_{k}^{\prime}X^{s}x_{k}]-\gamma\mathbb{E}[((A+BF^{s})x_{k}+w_{k})^{\prime}X^{s}((A+BF^{s})x_{k}+w_{k})]$
	$\displaystyle=$	$\displaystyle\mathbb{E}[x_{k}^{\prime}(Q+(F^{s})^{\prime}RF^{s})x_{k}]-\gamma\mathrm{Tr}(X^{s}\sigma_{w}^{2}).$

Since $\mathbb{E}[((A+BF^{s})x_{k}+w_{k})^{\prime}X^{s}((A+BF^{s})x_{k}+w_{k})]=\mathbb{E}[x_{k}^{\prime}(A+BF^{s})^{\prime}X^{s}(A+BF^{s})x_{k}]+\mathrm{Tr}(X^{s}\sigma_{w}^{2})$ , one can obtain that for $k\geq 1$ ,

		$\displaystyle\mathbb{E}[x_{k}^{\prime}X^{s}x_{k}]-\gamma\mathbb{E}[x_{k}^{\prime}(A+BF^{s})^{\prime}X^{s}(A+BF^{s})x_{k}]$		(28)
	$\displaystyle=$	$\displaystyle\mathbb{E}[x_{k}^{\prime}(Q+(F^{s})^{\prime}RF^{s})x_{k}].$		(28)

From the dynamics (1), it holds that for $k\geq 1$ ,

\displaystyle\mathbb{E}[x_{k}x_{k}^{\prime}]=(A+BF)\mathbb{E}[x_{k-1}x_{k-1}^{\prime}](A+BF)^{\prime}+\sigma_{w}^{2}.

Due to the positive definiteness of $\sigma_{w}^{2}$ and $\tilde{z}$ , the above equation implies that $\mathbb{E}[x_{k}x_{k}^{\prime}]$ is positive definite. Thus, one can obtain from (LABEL:off-3) that

\displaystyle X^{s}=\gamma(A+BF^{s})^{\prime}X^{s}(A+BF^{s})+(F^{s})^{\prime}RF^{s}+Q,

(29)

which is exactly the update in policy evaluation step of Algorithm 1.

From (LABEL:off-2) and (LABEL:off-PE), we have $X_{1}^{s}=B^{\prime}X^{s}A$ and $X_{2}^{s}=B^{\prime}X^{s}B$ , which further imply that (27) is identical to (23). ∎

3.3 Implementation of Algorithm 2

To implement Algorithm 2, a natural question is how to solve the matrices $X^{s}$ , $X_{1}^{s}$ , and $X_{2}^{s}$ from (LABEL:off-PE). In this subsection, we provide an effective way for the implementation of Algorithm 2, where a numerical average is adopted to approximate the mathematical expectation and a batch least square (BLS) method [29] is employed to estimate the unknown matrices.

Specifically, by vectorization, the policy evaluation step (LABEL:off-PE) in Algorithm 2 becomes

	$\displaystyle\left\{\mathbb{E}\left[\mathrm{vech}\big{(}x_{k}x_{k}^{\prime}\big{)}-\gamma\mathrm{vech}\big{(}x_{k+1}x_{k+1}^{\prime}\big{)}\right]+\gamma\mathrm{vech}(\sigma_{w}^{2})\right\}\mathrm{vecs}(X^{s})$
	$\displaystyle+2\gamma\mathbb{E}\left[\mathrm{vecn}\big{[}(u_{k}-F^{s}x_{k})x_{k}^{\prime}\big{]}\right]^{\prime}\mathrm{vecn}(X_{1}^{s})$
	$\displaystyle+\gamma\mathbb{E}\Big{[}\mathrm{vech}\big{[}(u_{k}-F^{s}x_{k})(u_{k}+F^{s}x_{k})^{\prime}\big{]}\Big{]}\mathrm{vecs}(X_{2}^{s})$
$\displaystyle=$	$\displaystyle\mathbb{E}\left[x_{k}^{\prime}(Q+(F^{s})^{\prime}RF^{s})x_{k}\right].$	(30)

The equation (3.3) has $\frac{n(n+1)}{2}+mn+\frac{m(m+1)}{2}$ unknown parameters. Thus, at least $K\geq\frac{n(n+1)}{2}+mn+\frac{m(m+1)}{2}$ sample points are required to solve $X^{s}$ , $X_{1}^{s}$ , and $X_{2}^{s}$ from (3.3). Sample the data generated under the behavior policy $u_{k}$ for $K$ time steps, and define

	$\displaystyle\Phi^{s}\triangleq$	$\displaystyle\left[\begin{array}[]{c}\mathbb{E}[x_{k}^{\prime}(Q+(F^{s})^{\prime}RF^{s})x_{k}]\\ \mathbb{E}[x_{k+1}^{\prime}(Q+(F^{s})^{\prime}RF^{s})x_{k+1}]\\ \vdots\\ \mathbb{E}[x_{k+K-1}^{\prime}(Q+(F^{s})^{\prime}RF^{s})x_{k+K-1}]\end{array}\right],$		(35)
	$\displaystyle\Psi^{s}\triangleq$	$\displaystyle\left[\begin{array}[]{ccc}\mathbb{E}[H_{xx,1}]&\mathbb{E}[H_{xu,1}]&\mathbb{E}[H_{uu,1}]\\ \vdots&\vdots&\vdots\\ \mathbb{E}[H_{xx,K}]&\mathbb{E}[H_{xu,K}]&\mathbb{E}[H_{uu,K}]\end{array}\right],$		(39)

with

	$\displaystyle H_{xx,i}\triangleq$	$\displaystyle\mathrm{vech}\big{(}x_{k+i-1}x_{k+i-1}^{\prime}\big{)}-\gamma\mathrm{vech}\big{(}x_{k+i}x_{k+i}^{\prime}\big{)}+\gamma\mathrm{vech}(\sigma_{w}^{2}),$
	$\displaystyle H_{xu,i}\triangleq$	$\displaystyle 2\gamma\mathrm{vecn}\left[(u_{k+i-1}-F^{s}x_{k+i-1})x_{k+i-1}^{\prime}\right]^{\prime},$
	$\displaystyle H_{uu,i}\triangleq$	$\displaystyle\gamma\mathrm{vech}\ \Big{[}(u_{k+i-1}-F^{s}x_{k+i-1})(u_{k+i-1}+F^{s}x_{k+i-1})^{\prime}\Big{]}.$

From (3.3), (35), and (39), one can obtain

\displaystyle\Psi^{s}\left[\begin{array}[]{ccc}\Big{(}\mathrm{vecs}(X^{s})\Big{)}^{\prime}&\left(\mathrm{vecn}(X_{1}^{s})\right)^{\prime}&\Big{(}\mathrm{vecs}(X_{2}^{s})\Big{)}^{\prime}\end{array}\right]^{\prime}=\Phi^{s},

which can be solved by

		$\displaystyle\left[\begin{array}[]{ccc}\Big{(}\mathrm{vecs}(X^{s})\Big{)}^{\prime}&\Big{(}\mathrm{vecn}(X_{1}^{s})\Big{)}^{\prime}&\Big{(}\mathrm{vecs}(X_{2}^{s})\Big{)}^{\prime}\end{array}\right]^{\prime}$		(40)
	$\displaystyle=$	$\displaystyle\left[(\Psi^{s})^{\prime}\Psi^{s}\right]^{-1}(\Psi^{s})^{\prime}\Phi^{s}.$		(40)

Note that to guarantee the solvability of the BLS estimator (40), a probing noise is required to add into the behavior policy $u_{k}$ to guarantee the invertibility of $(\Psi^{s})^{\prime}\Psi^{s}$ . It is worthwhile to point out that adding a probing noise does not influence the iteration results of Algorithm 2, which can be regarded as an advantage of the off-policy algorithm compared with the on-policy one [24]. The detailed implementation of Algorithm 2 is given in Algorithm 3.

Algorithm 3 includes two main phases. In the first phase, we exert the behavior policy on the real system for $K$ time steps, and sample the resulting system states and control actions. Then, a numerical average of $N$ trajectories is used to calculate the mathematical expectation. In the second phase, we implement policy evaluation and policy improvement steps by reusing the data sampled in the first phase. The reuse of data is an important advantage of the proposed MF-OPPI algorithm when compared with the on-policy RL algorithms, since the latter are required to collect the state-input data when applying every iterated control policies [5, 30]. Therefore, the proposed MF-OPPI algorithm is able to alleviate the data-hungry problem to some extent.

Algorithm 3 Implementation of MF-OPPI Algorithm

1: Phase 1 (Data Collection):

2: Initialization: set a stabilizing control gain

F^{0}

s=0

K>0

, and

\epsilon>0

; let

E_{x_{k}x_{k}}=0

E_{x_{k}u_{k}}=0

, and

E_{u_{k}u_{k}}=0

for

k=0,\cdots,K

;

3: For

q=1:N

4: apply the behavior policy

u_{k}

for

K

time steps;

5: sample

\{x_{k},u_{k}\}_{k=0,\ldots,K}

;

E_{x_{k}x_{k}}=E_{x_{k}x_{k}}+x_{k}x_{k}^{\prime}

k=0,\ldots,K

;

E_{x_{k}u_{k}}=E_{x_{k}u_{k}}+x_{k}u_{k}^{\prime}

k=0,\ldots,K

;

E_{u_{k}u_{k}}=E_{u_{k}u_{k}}+u_{k}u_{k}^{\prime}

k=0,\ldots,K

;

9: End

10:

E_{x_{k}x_{k}}=E_{x_{k}x_{k}}/N

E_{x_{k}u_{k}}=E_{x_{k}u_{k}}/N

E_{u_{k}u_{k}}=E_{u_{k}u_{k}}/N

k=0,\ldots,K

;

11: Phase 2 (Reuse of Collected Data for Iteration):

12: Initialization: set a stabilizable control gain

F^{0}

s=0

, and

\epsilon>0

;

13: Repeat

14: calculate

\Phi^{s}

and

\Psi^{s}

using

F^{s}

and collected data;

15: policy evaluation: solve

X^{s}

X_{1}^{s}

, and

X_{2}^{s}

from

		$\displaystyle\left[\begin{array}[]{ccc}\Big{(}\mathrm{vecs}(X^{s})\Big{)}^{\prime}&\Big{(}\mathrm{vecn}(X_{1}^{s})\Big{)}^{\prime}&\Big{(}\mathrm{vecs}(X_{2}^{s})\Big{)}^{\prime}\end{array}\right]^{\prime}$
	$\displaystyle=$	$\displaystyle\left[(\Psi^{s})^{\prime}\Psi^{s}\right]^{-1}(\Psi^{s})^{\prime}\Phi^{s};$

16: policy improvement: update control gain from

F^{s+1}=-\gamma(R+\gamma B^{\prime}X^{s}B)^{-1}B^{\prime}X^{s}A;

17:

s\leftarrow s+1

;

18: Until

\|F^{s}-F^{s-1}\|\leq\epsilon

19: Return

F^{s}

4 Optimization-Based Method

In this section, we reformulate Problem 1 as a constrained optimization problem, and resolve it from the primal-dual optimization perspective.

4.1 Problem Reformulation

For technical reasons, we need to make some modifications on Problem 1. To do this, some new notations are required.

We introduce the augmented state vector $v_{k}\triangleq\left[x_{k}^{\prime}\;\;u_{k}^{\prime}\right]^{\prime}$ , and the resulting augmented system is described by

\displaystyle v_{k+1}=A_{F}v_{k}+\bar{F}w_{k},

(41)

where $A_{F}\triangleq\left[\begin{array}[]{cc}A&B\\ FA&FB\end{array}\right]\in\mathbb{R}^{(n+m)\times(n+m)}$ and $\bar{F}\triangleq\left[\begin{array}[]{c}I_{n}\\ F\end{array}\right]\in\mathbb{R}^{(n+m)\times n}$ .

The following is a useful property that will be used in the remaining part of this paper.

Lemma 2 (Property of Spectral Radius[21, Lemma 1]).

It holds that $\rho(A+BF)=\rho(A_{F})$ .

Let $z_{(i)}\in\mathbb{R}^{n}$ for $i\in\{1,\ldots,r\}$ be $r$ different samples from the random variable $z$ . Let $u_{(i)}\in\mathbb{R}^{m}$ for $i\in\{1,\ldots,r\}$ be $r$ different initial stabilizing control laws. $r$ different initial states for the augmented system (41) are denoted by $v_{(i)}=[z_{(i)}^{\prime}\ u_{(i)}^{\prime}]^{\prime}$ for $i=\{1,\ldots,r\}$ . Denote by $v_{k}(F,v_{(i)})$ the solution of (41) starting from the initial state $v_{(i)}$ and evolving under the control $u_{k}=Fx_{k}$ , $k\geq 1$ .

A new cost function is defined as

\displaystyle\hat{J}(F,v_{(i)})\triangleq\sum_{k=0}^{\infty}\gamma^{k}\mathbb{E}\left[v_{k}(F,v_{(i)})^{\prime}\Lambda v_{k}(F,v_{(i)})\right].

(42)

Note that different from the cost function $J(F,z)$ defined in (6), where the initial state-feedback control law $u_{0}=Fz$ is considered, the initial control $u_{(i)}$ used in $\hat{J}(F,v_{(i)})$ is arbitrary, as long as it is stabilizable.

The new problem is formally formulated as below.

Problem 2 (Modified Stochastic LQR Problem).

Suppose that $z_{(i)}\in\mathbb{R}^{n}$ , $u_{(i)}\in\mathbb{R}^{m}$ , and $v_{(i)}=[z_{(i)}^{\prime}\ u_{(i)}^{\prime}]^{\prime}$ for $i\in\{1,\ldots,r\}$ , are chosen such that $\frac{1}{r}\sum_{i=1}^{r}v_{(i)}v_{(i)}^{\prime}=\Gamma\succ 0$ , with $r$ being a positive constant. The following problem,

\displaystyle\left\{\begin{aligned} \min_{F\in\mathcal{F}}&\quad\frac{1}{r}\sum_{i=1}^{r}\hat{J}(F,v_{(i)})\\ \text{s.t.}&\quad x_{k+1}=Ax_{k}+Bu_{k}+w_{k},\end{aligned}\right.

is called a modified stochastic LQR problem. The corresponding optimal feedback gain is defined as $\hat{F}^{*}\triangleq\arg\min_{F\in\mathcal{F}}\frac{1}{r}\sum_{i=1}^{r}\hat{J}(F,v_{(i)})$ . The resulting optimal value is denoted by $\hat{J}^{*}\triangleq\frac{1}{r}\sum_{i=1}^{r}\hat{J}(F^{*},v_{(i)})$ .

Proposition 1.

The optimal solution $\hat{F}^{*}$ obtained by solving Problem 2 is unique, and it is identical to $F^{*}$ .

Proof.

From the results in Section 2, we know that although $J^{*}(F,z)$ has different values under different initial state $z$ , the optimal control gain $F^{*}=\arg\min_{F\in\mathcal{F}}J(F,z)$ , whose explicit form is given in (12), does not rely upon $z$ . It follows that $\arg\min_{F\in\mathcal{F}}J(F,z)=\arg\min_{F\in\mathcal{F}}\frac{1}{r}\sum_{i=1}^{r}J(F,z_{i})$ for any initial states $z$ and $z_{i}$ , $i\in\{1,\ldots,r\}$ , satisfying $z_{i}z_{i}^{\prime}\succ 0$ .

Based on the definitions of $J(\cdot,\cdot)$ and $\hat{J}(\cdot,\cdot)$ , an algebraic manipulation leads to

	$\displaystyle\frac{1}{r}\sum_{i=1}^{r}\hat{J}(F,v_{(i)})=$	$\displaystyle\frac{\gamma}{r}\sum_{i=1}^{r}J(F,z_{i})+\frac{1}{r}\sum_{i=1}^{r}v_{(i)}^{\prime}\Lambda v_{(i)}$
		$\displaystyle+\gamma\mathrm{Tr}[(Q+F^{\prime}RF)\sigma_{w}^{2}],$

where $z_{i}=[A\ B]v_{(i)}$ . Since the last two terms on the right-hand side of the above equation are constants, it holds that $\arg\min_{F\in\mathcal{F}}\frac{1}{r}\sum_{i=1}^{r}J(F,z_{i})=\arg\min_{F\in\mathcal{F}}\frac{1}{r}\sum_{i=1}^{r}\hat{J}(F,v_{(i)})$ .

From the above, it follows that $F^{*}=\arg\min_{F\in\mathcal{F}}J(F,z)=\arg\min_{F\in\mathcal{F}}\frac{1}{r}\sum_{i=1}^{r}J(F,z_{i})=\arg\min_{F\in\mathcal{F}}\frac{1}{r}\sum_{i=1}^{r}\hat{J}(F,v_{(i)})=\hat{F}^{*}$ . Moreover, the uniqueness of $F^{*}$ guarantees that $\hat{F}^{*}$ is unique. ∎

The followings are two conclusions that will be frequently used in the following contents.

Lemma 3 (Lyapunov Stability Theorems [31, Chapter 3]).

Let $A\in\mathbb{R}^{n\times n}$ . Then,

1.

if $\rho(A)<1$ , then for each given matrix $M\in\mathbb{S}_{+}^{n}$ , there exists a unique matrix $P\in\mathbb{S}_{+}^{n}$ satisfying $A^{\prime}PA+M=P$ ;
2.

$\rho(A)<1$ if and only if for each given matrix $M\in\mathbb{S}_{++}^{n}$ , there exists a unique matrix $P\in\mathbb{S}_{++}^{n}$ such that $A^{\prime}PA+M=P$ .

4.2 Primal-Dual Formulation

From the previous subsection, we know that Problem 2 is a non-convex optimization problem with the objective function $\frac{1}{r}\sum_{i=1}^{r}\hat{J}(F,v_{(i)})$ , the static constraint $F\in\mathcal{F}$ , and the dynamic constraint (1). In this subsection, we propose an equivalent non-convex optimization formulation of Problem 2, and show that its dual gap is zero.

We first give the definition of the non-convex optimization formulation.

Primal Problem 1.

Solve

$\displaystyle J_{p}\triangleq$	$\displaystyle\inf_{S\in\mathbb{S}^{n+m},F\in\mathbb{R}^{m\times n}}\mathrm{Tr}(\Lambda S)$
$\displaystyle\mathrm{s.t.}\;\;\;\;$	$\displaystyle S\succ 0,$	(43)
	$\displaystyle\gamma A_{F}SA_{F}^{\prime}+\Gamma+\frac{\gamma}{1-\gamma}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}=S.$	(44)

It is obvious to see that Primal Problem 1 is a single-objective multi-variable optimization problem consisting of a linear objective function, a linear inequality constraint, and a quadratic equality constraint [32]. Hence, it is a non-convex optimization problem.

The following proposition shows the equivalence between Problem 2 and Primal Problem 1.

Proposition 2.

The optimal solution of Primal Problem 1, which is denoted by $(S_{p},F_{p})$ , is unique. Furthermore, Primal Problem 1 is equivalent to Problem 2 in the sense that $J_{p}=\hat{J}^{*}$ and $F_{p}=\hat{F}^{*}$ .

Proof.

By the properties of trace, we rewrite the objective function of Problem 2 as

\displaystyle\frac{1}{r}\sum_{i=1}^{r}\hat{J}(F,v_{(i)})=\mathrm{Tr}(\Lambda S)

with

\displaystyle S\triangleq S(F)=\frac{1}{r}\sum_{i=1}^{r}\sum_{k=0}^{\infty}\gamma^{k}\mathbb{E}[v_{k}(F,v_{(i)})v_{k}(F,v_{(i)})^{\prime}].

(45)

From (41), we have

\displaystyle v_{k}(F,v_{(i)})=A_{F}^{k}v_{(i)}+\sum_{j=0}^{k-1}A_{F}^{j}\bar{F}w_{k-j-1},\;k\geq 1.

(46)

Using the equation (46) and the fact that $w_{k}\sim\mathcal{N}(0_{n},\sigma_{w}^{2})$ is the i.i.d. random noise for each $k\geq 0$ , we can obtain that for any $F\in\mathcal{F}$ and $k\geq 1$ , it holds

		$\displaystyle\mathbb{E}[v_{k}(F,v_{(i)})v_{k}(F,v_{(i)})^{\prime}]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[A_{F}^{k}v_{(i)}v_{(i)}^{\prime}(A_{F}^{k})^{\prime}+\sum_{j=0}^{k-1}A_{F}^{j}\bar{F}w_{k-j-1}w_{k-j-1}^{\prime}\bar{F}^{\prime}(A_{F}^{j})^{\prime}\right].$

Thus, $S$ in (45) becomes

	$\displaystyle S=$	$\displaystyle\frac{1}{r}\sum_{i=1}^{r}\left[\sum_{k=1}^{\infty}\gamma^{k}\Big{[}A_{F}^{k}v_{(i)}v_{(i)}^{\prime}(A_{F}^{k})^{\prime}+\sum_{j=0}^{k-1}A_{F}^{j}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}(A_{F}^{j})^{\prime}\Big{]}\right]$
	$\displaystyle=$	$\displaystyle\sum_{k=1}^{\infty}\gamma^{k}\left[A_{F}^{k}\Gamma(A_{F}^{k})^{\prime}+\sum_{j=0}^{k-1}A_{F}^{j}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}(A_{F}^{j})^{\prime}\right].$		(47)

By an algebraic manipulation, one can obtain that

\displaystyle S-\gamma A_{F}SA_{F}^{\prime}=\Gamma+\sum_{k=1}^{\infty}\gamma^{k}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}=\Gamma+\frac{\gamma}{1-\gamma}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime},

which implies that $S$ satisfies the Lyapunov equation (44).

Since $\Gamma+\frac{\gamma}{1-\gamma}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}\succ 0$ , it follows from Lemma 2 and the second statement of Lemma 3 that $F\in\mathcal{F}$ if and only if there exists a unique $S\succ 0$ such that the constraint (44) holds. Therefore, we can replace the constraint $F\in\mathcal{F}$ in Problem 2 by $S\succ 0$ without changing its optimal solution.

From the above discussions, we know that if $(S_{p},F_{p})$ is the optimal solution of Primal Problem 1 and the corresponding optimal cost is $J_{p}$ , then it holds that $S_{p}\succ 0$ and $\rho(A_{F_{p}})<1$ . From Lemma 2, $F_{p}\in\mathcal{F}$ and $F_{p}$ is a feasible point of Problem 2. Thereby, $J_{p}$ is lower bounded by the optimal value of Problem 2, i.e., $J_{p}\geq\hat{J}^{*}$ . In addition, if $S_{p}$ is the unique solution of (44) with $F_{p}=\hat{F}^{*}\in\mathcal{F}$ , then the resulting objective function of Primal Probelm 1 is $J_{p}=\hat{J}^{*}$ . Thus, we can conclude that $(S_{p},F_{p})$ is the solution of Primal Problem 1 and $J_{p}=\hat{J}^{*}$ . The uniqueness of $(S_{p},F_{p})$ can be shown by the uniqueness of $\hat{F}^{*}$ . ∎

For any fixed $P\in\mathbb{S}^{n+m}$ , $P_{0}\in\mathbb{S}_{+}^{n+m}$ , define the Lagrangian function as

	$\displaystyle L(P,P_{0},$	$\displaystyle F,S)\triangleq\mathrm{Tr}(\Lambda S)+\mathrm{Tr}(-SP_{0})$		(48)
		$\displaystyle+\mathrm{Tr}\left[\Big{(}\gamma A_{F}SA_{F}^{\prime}+\Gamma+\frac{\gamma}{1-\gamma}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}-S\Big{)}P\right],$		(48)

where $P$ and $P_{0}$ are Lagrangian multipliers. The Lagrangian dual function is defined as

\displaystyle d(P,P_{0})\triangleq\inf_{S\in\mathbb{S}^{n+m},F\in\mathbb{R}^{m\times n}}L(P,P_{0},F,S).

Then, the dual problem corresponding to Primal Problem 1 is defined as

	$\displaystyle J_{d}\triangleq$	$\displaystyle\sup_{P\in\mathbb{S}^{n+m},P_{0}\in\mathbb{S}_{+}^{n+m}}d(P,P_{0})$
	$\displaystyle=$	$\displaystyle\sup_{P\in\mathbb{S}^{n+m},P_{0}\in\mathbb{S}_{+}^{n+m}}\inf_{S\in\mathbb{S}^{n+m},F\in\mathbb{R}^{m\times n}}L(P,P_{0},F,S),$

where $L(P,P_{0},F,S)$ is as defined in (48).

The weak duality [32, Chapter 5] guarantees that $J_{d}\leq J_{p}$ holds, where $J_{p}-J_{d}\geq 0$ is called the duality gap. When the duality gap is zero, we say that the optimization problem has strong duality [32, Chapter 5]. The following theorem shows that Primal Problem 1 has strong duality.

Theorem 2 (Strong Duality).

The strong duality holds for Primal Problem 1, that is, $J_{p}=J_{d}$ .

Proof.

See Appendix A. ∎

4.3 MB-PD Algorithm

It is well-known that for a non-convex optimization problem, if its objective and constraints functions are differentiable and it has strong duality, then any primal and dual optimal points must satisfy the corresponding KKT conditions, but not vise versa[32, Chapter 5.5.3]. However, if the solution to the KKT conditions is unique, then this solution must be the primal and dual optimal points. In view of this, we first derive the KKT conditions and show the uniqueness of their solutions in this subsection, and then design an algorithm to seek the solution to Problem 2 according to the solutions to the KKT conditions.

Lemma 4.

Suppose that $(S_{p},F_{p})$ is the primal optimal point and $(P_{p},P_{0,p})$ is the dual optimal point of Primal Problem 1. Then, $(S_{p},F_{p},P_{p},P_{0,p})$ satisfies the KKT conditions for $(S,F,P,P_{0})$

	$\displaystyle\gamma A_{F}SA_{F}^{\prime}+\Gamma+\frac{\gamma}{1-\gamma}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}-S=0,$		(49)
	$\displaystyle S\succ 0,$		(50)
	$\displaystyle P_{0}=\textbf{0}_{n+m},$		(51)
	$\displaystyle\gamma A_{F}^{\prime}PA_{F}+\Lambda=P,$		(52)
	$\displaystyle 2(P_{12}^{\prime}+P_{22}F)\Big{(}\gamma\left[A\;\;B\right]S\left[A\;\;B\right]^{\prime}+\frac{\gamma}{1-\gamma}\sigma_{w}^{2}\Big{)}=0.$		(53)

Proof.

From the conclusions in [32, Chapter 5.9.2], the KKT conditions of Primal Problem 1 are as follows:

primal feasibility condition:

	$\displaystyle\gamma A_{F}SA_{F}^{\prime}+\Gamma+\frac{\gamma}{1-\gamma}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}-S=0,$
	$\displaystyle S\succ 0;$

2.

dual feasibility condition:

$\displaystyle P_{0}\succeq 0;$
3.

complementary slackness condition:

$\displaystyle\mathrm{Tr}(SP_{0})=0;$

stationary conditions:

	$\displaystyle\!\frac{\partial L}{\partial S}=\Lambda+\gamma A_{F}^{\prime}SA_{F}-P-P_{0}=0,$
	$\displaystyle\!\frac{\partial L}{\partial F}=2(P_{12}^{\prime}+P_{22}F)\Big{(}\gamma\left[A\;\;B\right]S\left[A\;\;B\right]^{\prime}\!+\!\frac{\gamma}{1-\gamma}\sigma_{w}^{2}\Big{)}=0.$

By the second statement of Lemma 3, $\Gamma+\frac{\gamma}{1-\gamma}\sigma_{w}^{2}\succ 0$ and $F\in\mathcal{F}$ guarantee that $S\succ 0$ , which further implies that the only solution to the equation $\mathrm{Tr}(SP_{0})=0$ is $P_{0}=\textbf{0}_{m+n}$ . Hence, the KKT conditions in (49)-(53) can be obtained. According to [32, Chapter 5.5.3], the strong duality guarantees that the solutions to the KKT conditions must be the primal and dual optimal points. ∎

The following lemma is given to show that the uniqueness of the solutions to the KKT conditions (49)-(53), which guarantees that the solutions satisfying the KKT conditions must be the primal and dual optimal points.

Lemma 5.

Suppose that $(S,F,P,P_{0})$ satisfies the KKT conditions (49)-(53). Then, $(S,F,P,P_{0})$ is unique. Furthermore, it holds that $F=F_{p}$ , $P=P_{p}$ , $P_{0}=\textbf{0}_{m+n}$ , and $S_{p}$ is the unique solution to (49) with $F=F_{p}$ and $P=P_{p}$ .

Proof.

By observation, we can know that the sufficient condition for (53) is $F=-(P_{22})^{-1}P_{12}^{\prime}$ . To show this lemma, we first show that $(F,P)$ satisfying $\gamma A_{F}^{\prime}PA_{F}+\Lambda=P$ , $F=-(P_{22})^{-1}P_{12}^{\prime}$ , and $P\succeq 0$ is exactly $(F_{p},P_{p})$ with $F_{p}\in\mathcal{F}$ . Using the same argument as in the proof of Lemma 7, we have $P_{p}=P^{*}$ . From Proposition 1 and Proposition 2, it holds that $F_{p}=F^{*}$ . In view of this, we only need to show that $(F,P)$ satisfying $\gamma A_{F}^{\prime}PA_{F}+\Lambda=P$ , $F=-(P_{22})^{-1}P_{12}^{\prime}$ , and $P\succeq 0$ is exactly $(F^{*},P^{*})$ .

By substituting $F=-(P_{22})^{-1}P_{12}^{\prime}$ into (52), it follows that

\displaystyle\gamma\left[A\ \ B\right]^{\prime}(P_{11}-P_{12}P_{22}^{-1}P_{12}^{\prime})\left[A\ \ B\right]+\Lambda=P.

Define $\bar{X}\triangleq P_{11}-P_{12}P_{22}^{-1}P_{12}^{\prime}$ . The above equation becomes $\gamma\left[A\ B\right]^{\prime}\bar{X}\left[A\ B\right]+\Lambda=P$ , the expansion of which can be written as

\displaystyle\left[\begin{array}[]{cc}\gamma A^{\prime}\bar{X}A+Q&\gamma A^{\prime}\bar{X}B\\ \gamma B^{\prime}\bar{X}A&\gamma B^{\prime}\bar{X}B+R\end{array}\right]=\left[\begin{array}[]{cc}P_{11}&P_{12}\\ P_{12}^{\prime}&P_{22}\end{array}\right],

i.e., $P_{11}=\gamma A^{\prime}\bar{X}A+Q$ , $P_{12}=\gamma A^{\prime}\bar{X}B$ , and $P_{22}=\gamma B^{\prime}\bar{X}B+R$ . Substituting these expressions into the definition $\bar{X}\triangleq P_{11}-P_{12}P_{22}^{-1}P_{12}^{\prime}$ yields that

\displaystyle Q+\gamma A^{\prime}\bar{X}A-\gamma^{2}A^{\prime}\bar{X}B(R+\gamma B^{\prime}\bar{X}B)^{-1}B^{\prime}\bar{X}A=\bar{X}.

This is exactly the ARE (13), which has a unique solution. Therefore, it holds that $\bar{X}=X^{*}$ . From the definitions (12) and (21), we have $F=F^{*}$ and $P=P^{*}$ .

Then, based on the fact that $F_{p}=F^{*}\in\mathcal{F}$ and $\Lambda\succeq 0$ , the first statement of Lemma 3 shows that there exists a unique $S=S_{p}$ satisfying (49) with $F=F_{p}$ and $P=P_{p}$ . ∎

Note that to calculate the primal optimal point $F_{p}$ , one needs to know $P_{p}$ , which is uniquely determined by solving (52). However, (52) is dependent upon the primal optimal point $F_{p}$ , which means that the stationary conditions (52) and (53) are coupled with each other, so that $F_{p}$ and $P_{p}$ cannot be obtained independently. In what follows, we propose an algorithm, cf. Algorithm 4, to iteratively solve (52) and (53). Note that once $F_{p}$ and $P_{p}$ are attained, one can uniquely determine the primal variable $S_{p}$ by solving (49).

Algorithm 4 MB-PD Algorithm

1: Initialization:

F_{p}^{0}\in\mathcal{F}

\epsilon>0

, and set

s=0

;

2: Repeat

3: dual update: solve

P^{s}

from

\gamma(A_{F_{p}^{s}})^{\prime}P_{p}^{s}A_{F_{p}^{s}}+\Lambda=P_{p}^{s};

4: primal update:

F_{p}^{s+1}=-(P_{p,22}^{s})^{-1}(P_{p,12}^{s})^{\prime}

;

s\leftarrow s+1;

6: Until

\|F_{p}^{s}-F_{p}^{s-1}\|\leq\epsilon

;

7: Return

F_{p}^{s}

We have the following convergence result:

Lemma 6.

Given an initial stabilizing control gain $F_{p}^{0}$ . Consider the two sequences $\{P_{p}^{s}\}_{s=0}^{\infty}$ and $\{F_{p}^{s}\}_{s=0}^{\infty}$ obtained from Algorithm 4. We have the following properties:

1.

$\{P_{p}^{s}\}_{s=0}^{\infty}$ is a non-increasing sequence, that is, $P_{p}^{s}\succeq P_{p}^{s+1}$ ;
2.

$F_{p}^{s}$ obtained at each iteration step is stabilizing, that is, $F_{p}^{s}\in\mathcal{F}$ for each $s$ ;
3.

$\lim_{s\to\infty}P_{p}^{s}=P_{p}=P^{*}$ and $\lim_{s\to\infty}F_{p}^{s}=F_{p}=F^{*}$ , where $P^{*}$ and $F^{*}$ are defined in (21) and (12), respectively.

Proof.

Note that Algorithm 4 can be transformed into Algorithm 1 developed in [21] by replacing $(A,B)$ with $(\gamma^{1/2}A,\gamma^{1/2}B)$ . Therefore, this lemma can be shown directly by using the proof of Proposition 8 in [21] by some algebraic manipulations. ∎

4.4 Model-Free Implementation of Algorithm 4

From the equation (45) that is defined in the proof of Proposition 2, $S(F_{p}^{s})$ can be equivalently written as

\displaystyle S(F_{p}^{s})=\frac{1}{r}\sum_{i=1}^{r}\sum_{k=0}^{\infty}\gamma^{k}\mathbb{E}\left[v_{k}\big{(}F_{p}^{s},v_{(i)}\big{)}v_{k}\big{(}F_{p}^{s},v_{(i)}\big{)}^{\prime}\right].

(54)

We truncate the time horizon by $K>0$ , and define

\displaystyle\tilde{S}(F_{p}^{s})=\frac{1}{r}\sum_{i=1}^{r}\sum_{k=0}^{K}\gamma^{k}\mathbb{E}\left[v_{k}\big{(}F_{p}^{s},v_{(i)}\big{)}v_{k}\big{(}F_{p}^{s},v_{(i)}\big{)}^{\prime}\right].

(55)

Note that for any $K>0$ , it holds that $\tilde{S}(F_{p}^{s})\succ 0$ due to the positive definiteness of $\frac{1}{r}\sum_{i=1}^{r}v_{(i)}v_{(i)}^{\prime}$ . In view of this, the stationary condition (52) holds if and only if $\tilde{S}(F_{p}^{s})^{\prime}\Big{(}\gamma A_{F_{p}^{s}}^{\prime}P_{p}^{s}A_{F_{p}^{s}}+\Lambda-P_{p}^{s}\Big{)}\tilde{S}(F_{p}^{s})=0$ , which can be rewritten as the following linear matrix equation

\displaystyle\tilde{S}(F_{p}^{s})^{\prime}P_{p}^{s}\tilde{S}(F_{p}^{s})-\gamma\tilde{W}(F_{p}^{s})^{\prime}P_{p}^{s}\tilde{W}(F_{p}^{s})=\tilde{S}(F_{p}^{s})^{\prime}\Lambda\tilde{S}(F_{p}^{s}),

(56)

where

\displaystyle\tilde{W}(F_{p}^{s})=\frac{1}{r}\sum_{i=1}^{r}\sum_{k=0}^{K}\gamma^{k}\mathbb{E}\left[v_{k+1}\big{(}F_{p}^{s},v_{(i)}\big{)}v_{k}\big{(}F_{p}^{s},v_{(i)}\big{)}^{\prime}\right].

(57)

Therefore, for a fixed and stabilizing control gain $F_{p}^{s}$ , the corresponding dual variable $P_{p}^{s}$ can be obtained by solving the linear matrix equation (56).

Based on the above analysis, we give a detailed model-free implementation of the proposed MB-PD algorithm, and summarize it in Algorithm 5. Similar to Algorithm 3, the mathematical expectations are approximated by the numerical averages herein.

Algorithm 5 Model-Free Implementation of MB-PD Algorithm

1: Initialization: set a stabilizing control gain

F_{p}^{0}

s=0

K>0

, and

\epsilon>0

;

2: Repeat

3: primal update:

\tilde{S}=0

\tilde{W}=0

;

4: for

i=1:r

\bar{S}=0

\bar{W}=0

;

6: for

q=1:N

7: apply

u_{k}=F_{p}^{s}x_{k}

for

K

time steps;

8: sample

\{x_{k},u_{k}\}_{k=0,\ldots,K}

;

9: calculate

S=\sum_{k=0}^{K}\gamma^{k}v_{k}v_{k}^{\prime}

with

v_{k}=[x_{k}^{\prime}\ u_{k}^{\prime}]^{\prime}

;

10: calculate

W=\sum_{k=0}^{K}\gamma^{k}v_{k+1}v_{k}^{\prime}

;

11:

\bar{S}=\bar{S}+S

\bar{W}=\bar{W}+W

;

12: End

13:

\bar{S}=\bar{S}/N

\bar{W}=\bar{W}/N

;

14:

\tilde{S}=\tilde{S}+\bar{S}

\tilde{W}=\tilde{W}+\bar{W}

;

15: End

16: dual update: calculate

P_{p}^{s}

by solving

\tilde{S}^{\prime}P_{p}^{s}\tilde{S}-\gamma\tilde{W}^{\prime}P_{p}^{s}\tilde{W}=\tilde{S}^{\prime}\Lambda\tilde{S};

17: primal update: update control gain from

F_{p}^{s+1}=-(P_{p,22}^{s})^{-1}(P_{p,12}^{s})^{\prime};

18:

s\leftarrow s+1

;

19: Until

\|F_{p}^{s}-F_{p}^{s-1}\|\leq\epsilon

;

20: Return

F_{p}^{s}

Note that different from the Algorithm 3, the primal-dual optimization-based algorithm developed in this section has the advantage of averting the requirement of the excitation signal, which may cause undesirable system oscillation. However, this is at the cost of the requirement of more data. Specifically, it is assumed that we can sample the state-input trajectories generated under a certain set of initial vectors which are linearly independent in $\mathbb{R}^{n}$ .

5 Equivalency Between PI and MB-PD Algorithms

In the previous two sections, we solve the stochastic LQR problem from the perspective of RL and primal-dual optimization, respectively. The goal of this section is to show the equivalency of the algorithms developed before. Specifically, we show that classical PI and MB-PD algorithms are equivalent in the sense that there is a corresponding relation in the involved iterations. The specific relationship is described in the following theorem.

Theorem 3.

The dual and primal update steps in Algorithm 4 can be interpreted as the policy evaluation and policy improvement steps in Algorithm 1, respectively, in the sense that $X^{s}=(\bar{F}_{p}^{s})^{\prime}P_{p}^{s}\bar{F}_{p}^{s}$ and $F_{p}^{s}=F^{s}$ with $\bar{F}_{p}^{s}=[I_{n}\ (F_{p}^{s})^{\prime}]^{\prime}$ .

Proof.

By pre- and post-multiplying $(A_{F_{p}^{s}})^{\prime}P_{p}^{s}A_{F_{p}^{s}}+\Lambda=P_{p}^{s}$ by $[I_{n}\ \ (F_{p}^{s})^{\prime}]$ and its transpose, one can obtain that

		$\displaystyle\gamma(A+BF_{p}^{s})^{\prime}(\bar{F}_{p}^{s})^{\prime}P_{p}^{s}\bar{F}_{p}^{s}(A+BF_{p}^{s})+Q+(F_{p}^{s})^{\prime}RF_{p}^{s}$		(58)
	$\displaystyle=$	$\displaystyle(\bar{F}_{p}^{s})^{\prime}P_{p}^{s}\bar{F}_{p}^{s}.$		(58)

Define $X_{p}^{s}\triangleq(\bar{F}_{p}^{s})^{\prime}P_{p}^{s}\bar{F}_{p}^{s}$ . Then, (58) becomes

\displaystyle\gamma(A+BF_{p}^{s})^{\prime}X_{p}^{s}(A+BF_{p}^{s})+Q+(F_{p}^{s})^{\prime}RF_{p}^{s}=X_{p}^{s},

which has a unique solution since $F_{p}^{s}\in\mathcal{F}$ and $Q+(F_{p}^{s})^{\prime}RF_{p}^{s}$ is positive semidefinite. By observation, we know that the dual update step in Algorithm 4 is equivalent to the policy evaluation step in Algorithm 1 in the sense that $X^{s}=X_{p}^{s}=(\bar{F}_{p}^{s})^{\prime}P_{p}^{s}\bar{F}_{p}^{s}$ .

Let $P_{p}^{s}=\left[\begin{array}[]{cc}P_{p,11}^{s}&P_{p,12}^{s}\\ P_{p,21}^{s}&P_{p,22}^{s}\end{array}\right]$ with $P_{p,11}^{s}\in\mathbb{R}^{n\times n}$ , $P_{p,22}^{s}\in\mathbb{R}^{m\times m}$ , and $P_{p,12}^{s}=(P_{p,21}^{s})^{\prime}\in\mathbb{R}^{n\times m}$ . Expanding $(\bar{F}_{p}^{s})^{\prime}P_{p}^{s}\bar{F}_{p}^{s}$ yields that

\displaystyle P_{p,11}^{s}+(F_{p}^{s})^{\prime}P_{p,21}^{s}+P_{p,12}^{s}F_{p}^{s}+(F_{p}^{s})^{\prime}P_{p,22}^{s}F_{p}^{s}=X^{s}.

(59)

Expanding the left-hand side of (22) yields that

	$\displaystyle(Q+\gamma A^{\prime}X^{s}A)+(F^{s})^{\prime}$	$\displaystyle(\gamma B^{\prime}X^{s}A)+\gamma A^{\prime}X^{s}BF^{s}$		(60)
		$\displaystyle+(F^{s})^{\prime}(R+\gamma B^{\prime}X^{s}B)F^{s}=X^{s}.$		(60)

Making the corresponding terms in (59) and (60) identical, it follows that $P_{p,11}^{s}=Q+\gamma A^{\prime}X^{s}A$ , $P_{p,12}^{s}=\gamma A^{\prime}X^{s}B$ , and $P_{p,22}^{s}=R+\gamma B^{\prime}X^{s}B$ , which further imply that $F_{p}^{s+1}=F^{s+1}$ . ∎

Example 1.

Considering the following discrete-time linear system:

\displaystyle x_{k+1}=2x_{k}+u_{k}+w_{k}.

The weight matrices and discount factor are selected as $Q=1$ , $R=1$ , and $\gamma=0.7$ . The system noise is assumed to be generated according to the Gaussian distribution $\mathcal{N}(0,1)$ . We choose $F^{0}=F_{p}^{0}=-1$ as the initial control gain matrix. Suppose that the system matrices $A$ and $B$ are known.

By implementing the classical PI algorithm, i.e., Algorithm 1, for $4$ steps, one can obtain that

	$\displaystyle X^{1}$	$\displaystyle=6.6666,\ \ X^{2}=4.0675,\ \ X^{3}=3.9353,\ \ X^{4}=3.9345;$
	$\displaystyle F^{1}$	$\displaystyle=-1.6471,F^{2}=-1.4801,F^{3}=-1.4673,F^{4}=-1.4673.$

Then we implement the MB-PD algorithm, i.e., Algorithm 4, for $4$ steps, producing that

	$\displaystyle P_{p}^{1}$	$\displaystyle=\left[\begin{array}[]{cc}19.6666&9.3333\\ 9.3333&5.6667\end{array}\right],\ \ P_{p}^{2}=\left[\begin{array}[]{cc}12.3889&5.6945\\ 5.6945&3.8472\end{array}\right],$
	$\displaystyle P_{p}^{3}$	$\displaystyle=\left[\begin{array}[]{cc}12.0188&5.5094\\ 5.5094&3.7547\end{array}\right],\ \ P_{p}^{4}=\left[\begin{array}[]{cc}12.0116&5.5083\\ 5.5083&3.7542\end{array}\right];$

\displaystyle F_{p}^{1}=-1.6471,F_{p}^{2}=-1.4801,F_{p}^{3}=-1.4673,F_{p}^{4}=-1.4673.

From $X_{p}^{s}=(\bar{F}_{p}^{s})^{\prime}P_{p}^{s}\bar{F}_{p}^{s}$ , it follows that

\displaystyle X_{p}^{1}=6.6667,\ \ X_{p}^{2}=4.0675,\ \ X_{p}^{3}=3.9353,\ \ X_{p}^{4}=3.9345.

By simple observation and calculation, we can verify the correctness of the conclusion given in Theorem 3.

6 Simulations

In this section, we provide a numerical example to evaluate the performance of the proposed model-free algorithms.

Considering the following discrete-time linear system:

\displaystyle x_{k+1}=\left[\begin{array}[]{cc}0.5&1\\ 0.25&0.5\end{array}\right]x_{k}+\left[\begin{array}[]{c}1\\ 1\end{array}\right]u_{k}+w_{k}.

The weight matrices and discount factor are selected as $Q=\textbf{I}_{2}$ , $R=1$ , and $\gamma=0.7$ . The system noise $w_{k}$ is assumed to be generated according to the Gaussian distribution $\mathcal{N}(0_{2},\textbf{I}_{2})$ . When one knows the exact system matrices $A$ and $B$ , the optimal control gain $F^{*}$ can be obtained by solving the ARE (13). The result is $F^{s}=\left[-0.2446\ \ -0.4892\right]$ .

6.1 Performance of Algorithm 3

When implement Algorithm 3, the initial state $z$ is randomly generated. Inspired by [24], the probing noise is considered as

\displaystyle e_{k}=0.2\sin(1.009k)+\cos^{2}(0.538k)+\sin(0.9k)+\cos(100k).

Initial stabilizing control gain is chosen as $F^{0}=[-1\ 0]$ . We use the numerical average of $N=15$ trajectories to approximate the mathematical expectation, and sample $K=20$ data points at each trajectory for the BLS method. The tolerant error is $\varepsilon=10^{-3}$ . We can observe from Fig. 1 that Algorithm 3 stops after $6$ iterations and returns the estimated control gain $F^{6}=\left[-0.2464\ \ -0.4960\right]$ .

Refer to caption — Figure 1: Values of $F^{s}$ at each iteration step. The black dotted lines are baselines for two components of the optimal control gain $F^{*}$ .

Considering that the system noise is randomly generated, we use Monte-Carlo method to validate the effectiveness of Algorithm 3, as well as to show the influence of the variance of the system noise. We implement Algorithm 3 for $Y=10$ times, and each experiment is stopped when $s=10$ . The average value of the resulting control gain from $Y$ experiments is denoted by $\check{F}^{s}=\frac{1}{Y}\sum_{i=1}^{Y}F_{i}^{s}$ , where $F_{i}^{s}$ is the control gain obtained at the $s$ -th step from the $i$ -th experiment. We use $\text{E}^{s}=\frac{1}{Y}\sum_{i=1}^{Y}\|F_{i}^{s}-F^{*}\|$ to denote the average approximation error of the $s$ -th iteration step from $Y$ experiments. Fig. 2 shows the value of the control gain $F^{s}$ at each experiment as well as the resulting average gain $\check{F}^{s}$ , when the variance of the system noise is set as $\sigma_{w}^{2}=\alpha\textbf{I}_{2}$ with $\alpha$ being $0$ , $0.1$ , $0.5$ , and $1$ , respectively. Note that $\text{E}^{10}=\frac{1}{L}\sum_{i=1}^{L}\|F_{i}^{10}-F^{*}\|$ denotes the average approximation error on convergence, that is, the average deviation of the obtained control gain at the $i$ -th experiment from the optimal control gain $F^{*}$ . Fig. 3 shows the variation of the average approximation error $\text{E}^{10}$ , when $\alpha$ is increased from $0$ to $1$ . From Fig. 2 and Fig. 3, we conclude that roughly speaking, the resulting approximation errors are increased as the increase of the variance of the system noise. That is, a stronger system noise may lead to a higher control gain approximation error. Moreover, when $\sigma_{w}^{2}=\textbf{0}_{2}$ , $F^{s}$ obtained from Algorithm 3 converges to the optimal feedback gain $F^{*}$ .

6.2 Performance of Algorithm 5

When implement Algorithm 5, we truncate the time horizon by $K=10$ and let $r=3$ . Three pairs of initial states and initial control inputs are $z_{(1)}=[-1\ \ 3]$ , $u_{(1)}=-2$ , $z_{(2)}=[2\ \ -1]$ , $u_{(2)}=-5$ , $z_{(3)}=[-3\ \ 3]$ , $u_{(3)}=-8$ . The initial stabilizing control gain is chosen as $F_{p}^{0}=[-1\ 0]$ . We use the numerical average of $N=15$ trajectories to approximate the mathematical expectation. The desired convergence precision is set as $\varepsilon=5\times 10^{-3}$ . We can observe from Fig. 4 that Algorithm 5 stops after $7$ iterations and returns the estimated control gain $F_{p}^{7}=\left[-0.2443\ \ -0.4884\right]$ .

Similar to the previous subsection, we use Monte-Carlo method to discuss the influence of the system noise herein. We implement Algorithm 5 for $Y=10$ times, and stop each experiment when $s=15$ . We use $\check{F}_{p}^{s}=\frac{1}{Y}\sum_{i=1}^{Y}F_{p,i}^{s}$ to denote the average value of the resulting control gain from $Y$ experiments, where $F_{p,i}^{s}$ is the control gain obtained at the $s$ -th step from the $i$ -th experiment. Correspondingly, the average deviation of the approximation error at the $s$ -th iteration step obtained from $Y$ experiments is denoted by $\text{E}_{p}^{s}=\frac{1}{Y}\sum_{i=1}^{Y}\|F_{p,i}^{s}-F^{*}\|$ . In Fig. 2, we show the value of $F_{p}^{s}$ at each experiment as well as the resulting average value $\check{F}_{p}^{s}$ , when the variance of the system noises is set as $\sigma_{w}^{2}=\textbf{0}_{2}$ , $\sigma_{w}^{2}=0.1\textbf{I}_{2}$ , $\sigma_{w}^{2}=0.5\textbf{I}_{2}$ , and $\sigma_{w}^{2}=\textbf{I}_{2}$ , respectively. By observation, we know that in these four cases, $F_{p}$ is in a small neighborhood of the optimal feedback gain after about $3$ iterations, and the range of the neighborhood is increased as the increase of the variance of the system noise. This implies that when the variance of the system noise is large, Algorithm 5 may not converge under a small convergence precision. In spite of this, we can also see from Fig. 2 that the average control gain $\check{F}^{s}$ is close to the optimal one. Let $E_{p}^{15}=\frac{1}{Y}\sum_{i=1}^{Y}\|F_{p,i}^{15}-F^{*}\|$ , which is the average derivation of the control gain obtained at the $15$ -th iteration step of the $i$ -th experiment from the optimal control gain $F^{*}$ . Fig. 6 shows the values of $E_{p}^{15}$ when $\alpha$ is increased from $0$ to $1$ . Roughly speaking, the resulting approximation errors are increased as the increase of the variance of the system noise.

7 Conclusion

In this paper, we have revisited the stochastic LQR problem, and have developed a MF-OPPI algorithm and a MB-PD algorithm from the perspective of RL and primal-dual optimization, respectively. Both algorithms can be implemented without using the information of system matrices, such that one can circumvent model identification, which may cause undesirable behaviors due to the propagation and accumulation of modeling errors. Furthermore, we have shown that the dual and primal update steps in the MB-PD algorithm can be interpreted as the policy evaluation and policy improvement steps in the classical PI algorithm. This advances our understanding of common RL algorithms and might contribute to develop various RL algorithms based on optimization formulations.

Appendix A A Proof of Theorem 2

From the proof of Proposition 2, we know that the constraints $S\succ 0$ and $F\in\mathcal{F}$ are equivalent. Based on this, we introduce the following auxiliary problem for further analysis:

Auxiliary Problem 1.

Solve

$\displaystyle\hat{J}_{p}\triangleq$	$\displaystyle\inf_{S\in\mathbb{S}^{n+m},F\in\mathbb{R}^{m\times n}}\mathrm{Tr}(\Lambda S)$
$\displaystyle\mathrm{s.t.}\;\;$	$\displaystyle F\in\mathcal{F},$	(61)
	$\displaystyle\gamma A_{F}SA_{F}^{\prime}+\Gamma+\frac{\gamma}{1-\gamma}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}=S.$	(62)

This is a non-convex optimization problem since the set $\mathcal{F}$ is not convex and the equality constraint (62) is a quadratic function of $F$ . Note that we can easily deduce from the proof of Proposition 2 that the optimal value of Auxiliary Problem 1 is equivalent to that of Problem 2, i.e., $\hat{J}_{p}=\hat{J}^{*}$ .

The Lagrangian function of Auxiliary Problem 1 is defined as

		$\displaystyle\hat{L}(P,F,S)$		(63)
	$\displaystyle\triangleq$	$\displaystyle\mathrm{Tr}(\Lambda S)+\mathrm{Tr}\left[\Big{(}\gamma A_{F}SA_{F}^{\prime}+\Gamma+\frac{\gamma}{1-\gamma}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}-S\Big{)}P\right],$		(63)

where $P\in\mathbb{S}^{n+m}$ is the Lagrange multiplier. The corresponding Lagrangian dual function is defined as

\displaystyle\hat{d}(P)\triangleq\inf_{S\in\mathbb{S}^{n+m},F\in\mathcal{F}}\hat{L}(P,F,S).

We take the constraint $F\in\mathcal{F}$ as the domain of the variable $F$ in the Lagrangian formulation, instead of directly writing it in the Lagrangian function, since the constraint $F\in\mathcal{F}$ is not an explicit equality or inequality constraint.

Then, the dual problem of Auxiliary Problem 1 is given as follows:

\displaystyle\hat{J}_{d}\triangleq\sup_{P\in\mathbb{S}^{n+m}}\hat{d}(P)=\sup_{P\in\mathbb{S}^{n+m}}\inf_{S\in\mathbb{S}_{+}^{n+m},F\in\mathcal{F}}\hat{L}(P,F,S),

where $\hat{L}(P,F,S)$ is as defined in (LABEL:Lagrangian-1).

Since $J_{p}=\hat{J}^{*}$ and $\hat{J}_{p}=\hat{J}^{*}$ , it holds that $J_{p}=\hat{J}_{p}$ . By weak duality, we have $J_{p}\geq J_{d}$ . In view of this, we complete the proof of Theorem 2 by the following two parts:

1.

Auxiliary Problem 1 has strong duality, i.e., $\hat{J}_{p}=\hat{J}_{d}$ (Subsection A.1);
2.

The optimal point of the dual problem of Primal Problem 1 is not larger than that of the dual problem of Auxiliary Problem 1, i.e., $J_{d}\leq\hat{J}_{d}$ (Subsection A.2).

A.1 Proof for $\hat{J}_{p}=\hat{J}_{d}$

We first introduce another auxiliary optimization problem which can be shown to be equivalent to the modified stochastic LQR problem, i.e., Problem 2.

Auxiliary Problem 2.

Solve

$\displaystyle\tilde{J}_{p}\triangleq$	$\displaystyle\inf_{P\in\mathbb{S}^{n+m},F\in\mathbb{R}^{m\times n}}\mathrm{Tr}\left(\Big{(}\Gamma+\frac{\gamma}{1-\gamma}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}\Big{)}P\right)$
$\displaystyle\mathrm{s.t.}\;\;$	$\displaystyle F\in\mathcal{F},$	(64)
	$\displaystyle\gamma A_{F}^{\prime}PA_{F}+\Lambda=P.$	(65)

Proposition 3.

The Auxiliary Problem 2 has a unique optimal solution denoted by $(\tilde{P}_{p},\tilde{F}_{p})$ , and it is equivalent to Problem 2 in the sense that $\tilde{J}_{p}=\hat{J}^{*}$ and $\tilde{F}_{p}=\hat{F}^{*}$ .

Proof.

By the equation (46) and the fact that $w_{k}\sim\mathcal{N}(0_{n},\sigma_{w}^{2})$ is the i.i.d. random noise for each $k\geq 0$ , we can obtain that for any $F\in\mathcal{F}$ , it holds that

		$\displaystyle\mathbb{E}\left[v_{k}(F,v_{(i)})^{\prime}\Lambda v_{k}(F,v_{(i)})\right]$
	$\displaystyle=$	$\displaystyle\mathrm{Tr}\left[\mathbb{E}\left(v_{(i)}v_{(i)}^{\prime}\right)(A_{F}^{k})^{\prime}\Lambda A_{F}^{k}\right]+\mathrm{Tr}\left(\sum_{j=0}^{k-1}\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}(A_{F}^{j})^{\prime}\Lambda A_{F}^{j}\right).$

Therefore, the objective function of Problem 2 becomes

\displaystyle\frac{1}{r}\sum_{i=1}^{r}\hat{J}(F,v_{(i)})=\mathrm{Tr}\left(\Gamma P_{1}\right)+\frac{\gamma}{1-\gamma}\mathrm{Tr}\left(\bar{F}\sigma_{w}^{2}\bar{F}^{\prime}P_{2}\right),

with

	$\displaystyle P_{1}$	$\displaystyle=\sum_{k=0}^{\infty}\gamma^{k}(A_{F}^{k})^{\prime}\Lambda A_{F}^{k},$
	$\displaystyle P_{2}$	$\displaystyle=\frac{1-\gamma}{\gamma}\sum_{k=0}^{\infty}\gamma^{k}\sum_{j=0}^{k-1}(A_{F}^{j})^{\prime}\Lambda A_{F}^{j}.$

We can easily verify that both $P_{1}$ and $P_{2}$ satisfy the Lyapunov equation (65). By the first statement of Lemma 3, there exists a unique $P\in\mathbb{S}^{n}_{+}$ such that $\gamma A_{F}^{\prime}PA_{F}+\Lambda=P$ , which implies that $P_{1}=P_{2}$ .

Since the optimal solution $(\tilde{P}_{p},\tilde{F}_{p})$ of Auxiliary Problem 2 is one feasible point of Problem 2, it holds that $\tilde{J}_{p}\geq\hat{J}^{*}$ . Furthermore, if $\tilde{F}_{p}=\hat{F}^{*}\in\mathcal{F}$ , and $\tilde{P}_{p}$ is the unique solution to the constraint (65), then the resulting objective function of Auxiliary Problem 2 is exactly $\hat{J}^{*}$ . Therefore, we can conclude that $(\tilde{S}_{p},\tilde{F}_{p})$ is the solution of Auxiliary Problem 2 and $\tilde{J}_{p}=\hat{J}^{*}$ . The uniqueness of $(\tilde{S}_{p},\tilde{F}_{p})$ can be obtained by the uniqueness of $\hat{F}^{*}$ . ∎

To show the strong duality of Auxiliary Problem 1, the following lemmas will be used.

Lemma 7.

The optimal point $(\tilde{P}_{p},\tilde{F}_{p})$ of Auxiliary Problem 2 satisfies the following:

1.

$\tilde{P}_{p}=P^{*}$ ;
2.

$\tilde{P}_{p}\succeq 0$ , $\tilde{P}_{p,22}\succ 0$ , and $\tilde{F}_{p}=-(\tilde{P}_{p,22})^{-1}(\tilde{P}_{p,12})^{\prime}\in\mathcal{F}$ .

Proof.

Based on the definitions in (12) and (21), direct calculation yields that $\left[\begin{array}[]{c}I_{n}\\ F^{*}\end{array}\right]^{\prime}P^{*}\left[\begin{array}[]{c}I_{n}\\ F^{*}\end{array}\right]=X^{*}$ . Then, from (21), it follows that $P^{*}=\gamma\left[\begin{array}[]{cc}A&B\end{array}\right]^{\prime}X^{*}\left[\begin{array}[]{cc}A&B\end{array}\right]+\Lambda=\gamma\left[\begin{array}[]{cc}A&B\end{array}\right]^{\prime}\left[\begin{array}[]{c}I_{n}\\ F^{*}\end{array}\right]^{\prime}P^{*}\left[\begin{array}[]{c}I_{n}\\ F^{*}\end{array}\right]\left[\begin{array}[]{cc}A&B\end{array}\right]+\Lambda=\gamma A_{F*}^{\prime}P^{*}A_{F*}+\Lambda$ . Since $F^{*}\in\mathcal{F}$ , i.e., $\rho(A_{F^{*}})<1$ , there exists a unique $P^{*}$ satisfying the above equation.

In addition, we have $\tilde{F}_{p}=F^{*}=\hat{F}^{*}$ by Proposition 1 and Proposition 2. Therefore, when $F=\tilde{F}_{p}$ , the equality constraint (65) has a unique solution $\tilde{P}_{p}=P^{*}$ .

The second statement can be directly shown via the definitions of $P^{*}$ and $F^{*}$ given in (21) and (12), respectively. ∎

Lemma 8 ([21, Lemma 4]).

If $P\succeq 0$ and $P_{22}\succeq 0$ , then for $\forall F\in\mathbb{R}^{m\times n}$ , it holds that

\displaystyle\bar{F}^{\prime}P\bar{F}\succeq P_{11}-P_{12}P_{22}^{-1}P_{12}^{\prime}=\left[\begin{array}[]{c}I_{n}\\ P_{22}^{-1}P_{12}^{\prime}\end{array}\right]^{\prime}P\left[\begin{array}[]{c}I_{n}\\ P_{22}^{-1}P_{12}^{\prime}\end{array}\right],

and the equality holds if and only if $F=-P_{22}^{-1}P_{12}^{\prime}$ .

Now, we are ready to show the strong duality of Auxiliary Problem 1.

Lemma 9.

The dual gap for Auxiliary Problem 1 is zero, that is, $\hat{J}_{p}=\hat{J}_{d}$ .

Proof.

For a given matrix $P\in\mathbb{S}^{n+m}$ , the Lagrangian dual function is

	$\displaystyle\hat{d}(P)$	$\displaystyle=\inf_{S\in\mathbb{S}_{+}^{n+m},F\in\mathcal{F}}\hat{L}(P,F,S)$
		$\displaystyle=\left\{\begin{aligned} &\inf_{F\in\mathcal{F}}\mathrm{Tr}\left(\Gamma P\right)+\frac{\gamma}{1-\gamma}\mathrm{Tr}\left(\sigma_{w}^{2}\bar{F}^{\prime}P\bar{F}\right),\;\text{if}\ P\in\hat{\mathcal{P}}\\ &-\infty,\;\text{otherwise}\end{aligned}\right.$

where $\hat{\mathcal{P}}=\{P\in\mathbb{S}^{n+m}:\gamma A_{F}^{\prime}PA_{F}-P+\Lambda\succeq 0,\forall F\in\mathcal{F}\}$ .

Next, we show that the optimal solution $\tilde{P}_{p}$ of Auxiliary Problem 2 is an element of $\hat{\mathcal{P}}$ , and thus the set $\hat{\mathcal{P}}$ is nonempty. By Lemma 7, $\gamma A_{\tilde{F}_{p}}^{\prime}\tilde{P}_{p}A_{\tilde{F}_{p}}+\Lambda=\tilde{P}_{p}$ , where $\tilde{F}_{p}=-(\tilde{P}_{p,22})^{-1}(\tilde{P}_{p,12})^{\prime}$ . Then, by Lemma 8, it holds that $\gamma A_{F}^{\prime}\tilde{P}_{p}A_{F}+\Lambda\succeq\gamma A_{\tilde{F}_{p}}^{\prime}\tilde{P}_{p}A_{\tilde{F}_{p}}+\Lambda=\tilde{P}_{p}$ for all $F\in\mathbb{R}^{m\times n}$ .

Therefore, the dual problem is equivalent to

	$\displaystyle\hat{J}_{d}=$	$\displaystyle\sup_{P\in\mathbb{S}^{n+m}}\hat{d}(P)$		(66)
	$\displaystyle=$	$\displaystyle\sup_{P\in\hat{\mathcal{P}}}\inf_{F\in\mathcal{F}}\mathrm{Tr}(\Gamma P)+\frac{\gamma}{1-\gamma}\mathrm{Tr}(\sigma_{w}^{2}\bar{F}^{\prime}P\bar{F}).$		(66)

For $\tilde{P}_{p}\in\hat{\mathcal{P}}$ , we have

\displaystyle\hat{d}(\tilde{P}_{p})=\inf_{F\in\mathcal{F}}\mathrm{Tr}(\Gamma\tilde{P}_{p})+\frac{\gamma}{1-\gamma}\mathrm{Tr}(\sigma_{w}^{2}\bar{F}^{\prime}\tilde{P}_{p}\bar{F}).

(67)

Obviously, $\hat{d}(\tilde{P}_{p})\leq\hat{J}_{d}$ . Since $\tilde{P}_{p}\succeq 0$ is fixed and the objective function in (67) is a quadratic function of $F$ , the infimum in (67) is reached at $F=-(\tilde{P}_{p,22})^{-1}(\tilde{P}_{p,12})^{\prime}=\tilde{P}_{p}\in\mathcal{F}$ . Therefore, $\tilde{J}_{p}=\hat{d}(\tilde{P}_{p})$ , which implies that $\tilde{J}_{p}\leq\hat{J}_{d}$ . By the weak duality, there holds that $\hat{J}_{d}\leq\hat{J}_{p}$ . Since $\hat{J}_{p}=\hat{J}^{*}=\tilde{J}_{p}$ , we have $\hat{J}_{p}=\hat{J}_{d}$ . ∎

A.2 Proof for $J_{d}\leq\hat{J}_{d}$

Lemma 10.

There holds $J_{d}\leq\hat{J}_{d}$ .

Proof.

For any given $P\in\mathbb{S}^{n+m}$ , $P_{0}\in\mathbb{S}_{+}^{n+m}$ , the Lagrangian dual function $d(P,P_{0})$ is

	$\displaystyle d(P,P_{0})=$	$\displaystyle\inf_{S\in\mathbb{S}_{+}^{n+m},F\in\mathbb{R}^{m\times n}}L(P,P_{0},F,S)$
	$\displaystyle=$	$\displaystyle\left\{\begin{aligned} &\inf_{F\in\mathbb{R}^{m\times n}}\mathrm{Tr}\left(\Gamma P\right)+\frac{\gamma}{1-\gamma}\mathrm{Tr}\left(\sigma_{w}^{2}\bar{F}^{\prime}P\bar{F}\right),\;\text{if}\ P\in\hat{\mathcal{P}}\\ &-\infty,\;\text{otherwise}\end{aligned}\right.$

where $\mathcal{P}=\{P\in\mathbb{S}_{+}^{m+n}:\gamma A_{F}^{\prime}PA_{F}-P+\Lambda-P_{0}\succeq 0,\forall F\in\mathcal{F}\}$ .

Therefore, the dual problem corresponding to Primal Problem 1 is equivalent to

	$\displaystyle J_{d}=$	$\displaystyle\sup_{P\in\mathbb{S}_{+}^{m+n},P_{0}\in\mathbb{S}_{+}^{m+n}}d(P,P_{0})$		(68)
	$\displaystyle=$	$\displaystyle\sup_{P\in\mathcal{P},P_{0}\in\mathbb{S}_{+}^{m+n}}\inf_{F\in\mathbb{R}^{m\times n}}\mathrm{Tr}\left(\Gamma P\right)+\frac{\gamma}{1-\gamma}\mathrm{Tr}\left(\sigma_{w}^{2}\bar{F}^{\prime}P\bar{F}\right).$		(68)

We can observe from (68) that $P_{0}$ affects the dual optimal value $J_{d}$ by adjusting the size of the set $\mathcal{P}$ . To obtain the supremum of the Lagrangian dual function, the constraint $P\in\mathcal{P}$ should be relaxed as much as possible. We can easily observe from the definition of the set $\mathcal{P}$ that its maximum is obtained when $P_{0}=\textbf{0}_{m+n}$ . Furthermore, if $P_{0}=\textbf{0}_{m+n}$ , then it holds that $\mathcal{P}=\{P\in\mathbb{S}_{+}^{m+n}:\gamma A_{F}^{\prime}PA_{F}-P+\Lambda\succeq 0,\forall F\in\mathcal{F}\}=\hat{\mathcal{P}}$ . As a result, it yields that

\displaystyle J_{d}=\sup_{P\in\hat{\mathcal{P}}}\inf_{F\in\mathbb{R}^{m\times n}}\mathrm{Tr}\left(\Gamma P\right)+\frac{\gamma}{1-\gamma}\mathrm{Tr}\left(\sigma_{w}^{2}\bar{F}^{\prime}P\bar{F}\right).

Since $\mathcal{F}\subseteq\mathbb{R}^{m\times n}$ , it holds that $\inf_{F\in\mathbb{R}^{m\times n}}\mathrm{Tr}\left(\Gamma P\right)+\frac{\gamma}{1-\gamma}\mathrm{Tr}\left(\sigma_{w}^{2}\bar{F}^{\prime}P\bar{F}\right)\leq\inf_{F\in\mathcal{F}}\mathrm{Tr}\left(\Gamma P\right)+\frac{\gamma}{1-\gamma}\mathrm{Tr}\left(\sigma_{w}^{2}\bar{F}^{\prime}P\bar{F}\right)$ , which implies that $J_{d}\leq\hat{J}_{d}$ . ∎

References

[1] R. S. Sutton, A. G. Barto. (1998). Reinforcement Learning: An introduction, MIT Press, Cambridge, MA, USA.
[2] D. P. Bertsekas. (2019). Reinforcement Learning and Optimal Control, Athena Scientific, Nashua, USA.
[3] K. Arulkumaran, M. P. Deisenroth, M. Brundage, A. A. Bharath. (2017). Deep reinforcement learning: A brief survey, IEEE Signal Processing Magazine, 34 (6), 26–38.
[4] B. Recht. (2019). A tour of reinforcement learning: The view from continuous control, Annual Review of Control, Robotics, and Autonomous Systems, 2, 253–279.
[5] M. Li, J. Qin, N. M. Freris, D. W. C. Ho, Multiplayer Stackelberg-Nash game for nonlinear system via value iteration-based integral reinforcement learning, IEEE Transactions on Neural Networks and Learning Systems, DOI: 10.1109/TNNLS.2020.3042331.
[6] F. Li, J. Qin, W. X. Zheng. (2019). Distributed Q-learning-based online optimization algorithm for unit commitment and dispatch in smart grid, IEEE Transactions on Cybernetics, 50 (9), 4146–4156.
[7] B. Wang, Z. Liu, Q. Li, A. Prorok. (2020). Mobile robot path planning in dynamic environments through globally guided reinforcement learning, IEEE Robotics and Automation Letters, 5 (4), 6932–6939.
[8] A. Haydari, Y. Yilmaz, Deep reinforcement learning for intelligent transportation systems: A survey, IEEE Transactions on Intelligent Transportation Systems, DOI: 10.1109/TITS.2020.3008612.
[9] A. S. Leong, A. Ramaswamy, D. E. Quevedo, H. Karl, L. Shi, Deep reinforcement learning for wireless sensor scheduling in cyber–physical systems, Automatica, DOI: https://doi.org/10.1016/j.automatica.2019.108759.
[10] A. Naik, R. Shariff, N. Yasui, R. S. Sutton, Discounted reinforcement learning is not an optimization problem, arXiv preprint arXiv:1910.02140.
[11] R. Zhang, T. Yu, Y. Shen, H. Jin, C. Chen. (2019). Text-based interactive recommendation via constraint-augmented reinforcement learning, in: Advances in Neural Information Processing Systems, Vancouver, Canada, pp. 15214–15224.
[12] Z. Qin, W. Li, F. Janoos. (2014). Sparse reinforcement learning via convex optimization, in: International Conference on Machine Learning, Beijing, China, pp. 424–432.
[13] N. Vieillard, O. Pietquin, M. Geist, On connections between constrained optimization and reinforcement learning, arXiv preprint arXiv:1910.08476.
[14] H. Schlüter, F. Solowjow, S. Trimpe, Event-triggered learning for linear quadratic control, IEEE Transactions on Automatic Control, DOI: 10.1109/TAC.2020.3030877.
[15] M. Fazel, R. Ge, S. M. Kakade, M. Mesbahi. (2018). Global convergence of policy gradient methods for the linear quadratic regulator, in: International Conference on Machine Learning, Stockholm, Sweden, pp. 1467–1476.
[16] D. Malik, A. Pananjady, K. Bhatia, K. Khamaru, B. L. Peter, W. J. Martin. (2019). Derivative-free methods for policy optimization: Guarantees for linear quadratic systems, in: The 22nd International Conference on Artificial Intelligence and Statistics, Stockholm, Sweden, pp. 2916–2925.
[17] Y. Li, Y. Tang, R. Zhang, N. Li, Distributed reinforcement learning for decentralized linear quadratic control: A derivative-free policy optimization approach, DOI: arXiv preprint arXiv:1912.09135.
[18] K. Karl, S. Tu. (2019). Finite-time analysis of approximate policy iteration for the linear quadratic regulator, in: International Conference on Machine Learning, Vancouver, Canada, pp. 8514–8524.
[19] Z. Yang, Y. Chen, M. Hong, Z. Wang. (2019). Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost, in: Advances in Neural Information Processing Systems, Vancouver, Canada, pp. 8353–8365.
[20] R. Mitze, M. Mönnigmann, A dynamic programming approach to solving constrained linear–quadratic optimal control problems, Automatica, DOI: https://doi.org/10.1016/j.automatica.2020.109132.
[21] D. Lee, J. Hu. (2018). Primal-dual Q-learning framework for LQR design, IEEE Transactions on Automatic Control, 64 (9), 3756–3763.
[22] D. D. Yao, S. Zhang, X. Y. Zhou. (2001). Stochastic linear-quadratic control via semidefinite programming, SIAM Journal on Control and Optimization, 40 (3), 801–823.
[23] Y. Jiang, Z.-P. Jiang. (2012). Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics, Automatica, 48 (10), 2699–2704.
[24] B. Kiumarsi, F. L. Lewis, Z.-P. Jiang. (2017). ${H}_{\infty}$ control of linear discrete-time systems: Off-policy reinforcement learning, Automatica, 78, 144–152.
[25] J. Qin, M. Li, Y. Shi, Q. Ma, W. X. Zheng. (2019). Optimal synchronization control of multiagent systems with input saturation via off-policy reinforcement learning, IEEE Transactions on Neural Networks and Learning Systems, 30 (1), 85–96.
[26] J. Lai, J. Xiong, Z. Shu, Model-free optimal control of discrete-time systems with additive and multiplicative noises, arXiv preprint arXiv:2008.08734.
[27] D. Vrabie, K. G. Vamvoudakis, F. L. Lewis. (2013). Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles, IET Press, Stevenage, U.K.
[28] G. Hewer. (1971). An iterative technique for the computation of the steady state gains for the discrete optimal regulator, IEEE Transactions on Automatic Control, 16 (4), 382–384.
[29] D. Graupe, V. K. Jain, J. Salahi. (1980). A comparative analysis of various least-squares identification algorithms, Automatica, 16 (6), 663–681.
[30] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, F. L. Lewis. (2009). Adaptive optimal control for continuous-time linear systems based on policy iteration, Automatica, 45 (2), 477–484.
[31] G. Gu. (2012). Discrete-Time Linear Systems: Theory and Design with Applications, Springer-Verlag, London, U.K.
[32] S. Boyd, B. Vandenberghe. (2004). Convex Optimization, Cambridge University Press, Cambridge, U.K.

Model-Free Design of Stochastic LQR Controller from Reinforcement Learning and Primal-Dual Optimization Perspective

Abstract

keywords:

1 Introduction

2 Problem Formulation and Preliminaries

Problem 1 (Stochastic LQR Problem).

Assumption 1.

Lemma 1 ([26, Lemma 3]).

3 RL-Based Method

3.1 Classical PI Algorithm

3.2 MF-OPPI Algorithm

Theorem 1.

Proof.

3.3 Implementation of Algorithm 2

4 Optimization-Based Method

4.1 Problem Reformulation

Lemma 2 (Property of Spectral Radius[21, Lemma 1]).

Problem 2 (Modified Stochastic LQR Problem).

Proposition 1.

Proof.

Lemma 3 (Lyapunov Stability Theorems [31, Chapter 3]).

4.2 Primal-Dual Formulation

Primal Problem 1.

Proposition 2.

Proof.

Theorem 2 (Strong Duality).

Proof.

4.3 MB-PD Algorithm

Lemma 4.

Proof.

Lemma 5.

Proof.

Lemma 6.

Proof.

4.4 Model-Free Implementation of Algorithm 4

5 Equivalency Between PI and MB-PD Algorithms

Theorem 3.

Proof.

Example 1.

6 Simulations

6.1 Performance of Algorithm 3

6.2 Performance of Algorithm 5

7 Conclusion

Appendix A A Proof of Theorem 2

Auxiliary Problem 1.

A.1 Proof for J^p=J^d\hat{J}_{p}=\hat{J}_{d}

Auxiliary Problem 2.

Proposition 3.

Proof.

Lemma 7.

Proof.

Lemma 8 ([21, Lemma 4]).

Lemma 9.

Proof.

A.2 Proof for Jd≤J^dJ_{d}\leq\hat{J}_{d}

Lemma 10.

Proof.

References

A.1 Proof for $\hat{J}_{p}=\hat{J}_{d}$

A.2 Proof for $J_{d}\leq\hat{J}_{d}$