This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Two system transformation data-driven algorithms for linear quadratic mean-field games

Xun Li [email protected] Guangchen Wang [email protected] Yu Wang [email protected] Jie Xiong [email protected] Heng Zhang [email protected], [email protected] Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong, China School of Control Science and Engineering, Shandong University, Jinan 250061, China Department of Mathematics, Southern University of Science and Technology, Shenzhen 518055, China SUSTech International Center for Mathematics, Southern University of Science and Technology, Shenzhen 518055, China
Abstract

This paper studies a class of continuous-time linear quadratic (LQ) mean-field game problems. We develop two system transformation data-driven algorithms to approximate the decentralized strategies of the LQ mean-field games. The main feature of the obtained data-driven algorithms is that they eliminate the requirement on all system matrices. First, we transform the original stochastic system into an ordinary differential equation (ODE). Subsequently, we construct some Kronecker product-based matrices by the input/state data of the ODE. By virtue of these matrices, we implement a model-based policy iteration (PI) algorithm and a model-based value iteration (VI) algorithm in a data-driven fashion. In addition, we also demonstrate the convergence of these two data-driven algorithms under some mild conditions. Finally, we illustrate the practicality of our algorithms via two numerical examples.

keywords:
Linear quadratic (LQ) mean-field game; decentralized strategy; system transformation; policy iteration (PI); value iteration (VI)

1 Introduction

Game theory is the study of decision making in an interactive environment, which has been widely used in industry, management and other fields [1, 2, 3, 4]. In game systems with a large number of agents, i.e., large-population systems, the influence of individual agent behavior on the systems can be negligible, but the group behavior of agents has a significant impact on individual agent. In detail, states or cost functionals of agents are highly coupled through a state-average term, which brings curse of dimensionality and considerable computational complexity. Therefore, the classical game theory will no longer be applicable for large-population systems.

Different from the classical game theory, Huang et al. [5] and Lasry and Lions [6] independently proposed a mean-field method to overcome difficulties caused by high coupling. Huang et al. [5] studied collective behavior caused by individual interactions and utilized a mean-field term to represent the complex interaction information between agents and proposed decentralized strategies. Independently, the method proposed by Lasry and Lions [6] entailed solving coupled forward-backward partial differential equations. Since then, mean-field game theory has developed rapidly and the game framework has been extended to various different settings. For instance, Huang et al. [7] considered an LQ game system where each agent is weakly coupled with the other agents only through its cost functional. Based on a fixed-point method, they developed the Nash certainty equivalence and designed an ϵ\epsilon-Nash equilibrium for the infinite-horizon mean-field LQ game. Huang [8] further considered decentralized controls of a game system with a major player and a large number of minor players, where the major player has a significant influence on others. Utilizing a mean-field approximation, they decomposed the game problem in population limit into a set of localized limiting two-player games and derived all players’ decentralized strategies by the Nash certainty equivalence approach. Readers may also refer to [9, 10, 11] for mean-field Stackelberg differential games, [12, 13] for risk-sensitive mean-field games, [14, 15] for mean-field differential games with partial information, [16] for a mean-field cooperative differential game.

On the other hand, reinforcement learning (RL), as a well-known numerical algorithm for studying LQ control and game problems, has received increasing attention from researchers. The main feature of RL algorithms is that they do not require or partially require the system matrix information, and thus may have higher practical value. See [17, 18, 20, 19] for some classical RL algorithms related to deterministic LQ control problems and [21, 22, 23, 24] for some RL algorithms in stochastic LQ control problems. For mean-field game problems, uz Zaman et al. [25] designed a stochastic optimization-based RL algorithm to solve a discrete-time LQ mean-field game problem. Fu et al. [26] established an actor-critic RL algorithm for an LQ mean-field game in discrete time. Subramanian and Mahajan [27] proposed two RL algorithms to deal with a stationary mean-field game problem. Carmona et al. [28] obtained a policy gradient strategy to work out a mean-field control problem and Angiuli et al. [29] introduced a unified Q-learning for mean-field game and control problems. Carmona et al. [30] developed a Q-learning algorithm for mean-field Markov decision processes and presented its convergence proof. However, it is worth mentioning that the aforementioned RL literature focuses on LQ control problems or discrete-time mean-field game problems and there are few RL algorithms for continuous-time LQ mean-field games. To the authors’ best knowledge, Xu et al. [31] is the first attempt to solve continuous-time LQ mean-field game problems by using data-driven RL algorithms. The ϵ\epsilon-Nash equilibrium of their problem is closely related to two algebraic Riccati equations (AREs). They first designed a model-based PI algorithm to solve these two AREs, where all system coefficients are indispensable. Then, they implemented the model-based PI algorithm by collecting the state and input data of a given agent, and thus removed the requirement of all system parameters.

Inspired by the above articles, this paper focuses on the same LQ mean-field game problem as in Xu et al. [31], but proposes two novel data-driven RL methods to address this problem. In particular, we provide a different idea to implement the model-based PI algorithm and a model-based VI algorithm, which reduces the computational complexities of our data-driven algorithms. The main contributions of this paper are summarized as follows.

  • 1.

    We propose a system transformation idea to implement the model-based algorithms. Specifically, we transform the original stochastic system into an ODE and then carry out the model-based algorithms by the input/state data of the ODE. Different from the algorithms that directly use stochastic data [21, 22, 23, 24, 31], the algorithms that adopt our idea have smaller computational complexities. In addition, this idea may also be applicable to other data-driven RL algorithms for stochastic problems, especially for problems where the diffusion term of their system dynamics does not include control and state variables.

  • 2.

    We develop a data-driven PI algorithm to solve the LQ mean-field game problem. By virtue of the proposed system transformation idea, a novel data-driven PI algorithm is proposed to circumvent the need of all system coefficients. The simulation results show that this algorithm successfully obtains an ϵ\epsilon-Nash equilibrium with errors similar to those of [31, Section 4]. But it can be noted that our algorithm may be more computationally efficient than their algorithm (see Remark 1).

  • 3.

    We develop a data-driven VI algorithm to deal with the LQ mean-field game problem. It is worth mentioning that the PI algorithms require a priori knowledge of two Hurwitz matrices and these Hurwitz matrices are closely related to the system coefficients. When all system matrices are unavailable, it may be difficult to obtain two matrices that satisfy this condition. As a consequence, we develop a data-driven VI algorithm to overcome this difficulty. The proposed data-driven VI algorithm neither requires the system parameter information nor the assumption of two initial Hurwitz matrices.

The rest of this paper is organized as follows. In Section 2, we give some standard notations and terminologies, and formulate the continuous-time LQ mean-field game problem. In Section 3, we design the data-driven PI algorithm to work out the game problem. In Section 4, we develop the data-driven VI algorithm to solve the problem. In Section 5, we validate the obtained algorithms by means of two simulation examples. In Section 6, we give some concluding remarks and outlooks.

2 Problem formulation and preliminaries

2.1 Notations

Let us denote +\mathbb{Z}^{+} and \mathbb{Z} the set of positive integers and nonnegative integers. Let \mathbb{R}, m×n\mathbb{R}^{m\times n} and m\mathbb{R}^{m} be the collection of real numbers, m×nm\times n-dimensional real matrices and mm-dimensional real vectors, respectively. Denote the m×mm\times m identity matrix as 𝕀m\mathbb{I}_{m}. For notation simplicity, we denote any zero vector or zero matrix by 0. Let diag{ν}diag\{\nu\} be a diagonal matrix whose diagonal is vector ν\nu. |||\cdot| represents the induced matrix norm. Let 𝕊m\mathbb{S}^{m}, 𝕊+m\mathbb{S}^{m}_{+} and 𝕊++m\mathbb{S}^{m}_{++} denote the set of symmetric matrices, positive semidefinite matrices and positive definite matrices in m×m\mathbb{R}^{m\times m}. A positive semidefinite (respectively, positive definite) matrix MM is denoted by M0M\geq 0 (respectively, M>0M>0). Given Mm×mM\in\mathbb{R}^{m\times m}, let Re(λl(M))Re(\lambda_{l}(M)), l=1,2,,ml=1,2,\cdots,m, denote the real part of the ll-th eigenvalue of MM. Let superscript TT be the transpose of any vector or matrix. Furthermore, \otimes represents the Kronecker product. Given a matrix Mm×nM\in\mathbb{R}^{m\times n}, MM^{\dagger} denotes its pseudoinverse and vec(M)[M1T,M2T,,MnT]Tvec(M)\triangleq[M_{1}^{T},M_{2}^{T},\cdots,M_{n}^{T}]^{T}, in which MlmM_{l}\in\mathbb{R}^{m}, l=1,2,,nl=1,2,\cdots,n, denotes the ll-th column of matrix MM. Given a symmetric matrix S𝕊mS\in\mathbb{S}^{m}, we define vecs(S)[S11,2S12,2S13,,2S1m,S22,2S23,,2Sm1m,Smm]Tvecs(S)\triangleq[S_{11},2S_{12},2S_{13},\cdots,2S_{1m},S_{22},2S_{23},\cdots,2S_{m-1m},S_{mm}]^{T}, where SrlS_{rl}, r,l=1,2,,mr,l=1,2,\cdots,m, is the (r,l)(r,l)-th element of matrix SS. For a given vector ζm\zeta\in\mathbb{R}^{m}, let ζ¯[ζ12,ζ1ζ2,,ζ1ζm,ζ22,ζ2ζ3,,ζm1ζm,ζm2]T\overline{\zeta}\triangleq[\zeta_{1}^{2},\zeta_{1}\zeta_{2},\cdots,\zeta_{1}\zeta_{m},\zeta_{2}^{2},\zeta_{2}\zeta_{3},\cdots,\zeta_{m-1}\zeta_{m},\zeta_{m}^{2}]^{T}, where ζl\zeta_{l}, l=1,2,,ml=1,2,\cdots,m, denotes the ll-th element of ζ\zeta. For any matrix Mm×nM\in\mathbb{R}^{m\times n} and vector φn\varphi\in\mathbb{R}^{n}, we denote by [M]rl[M]_{rl}, r=1,2,,mr=1,2,\cdots,m, l=1,2,,nl=1,2,\cdots,n, the (r,l)(r,l)-th element of MM and [φ]r[\varphi]_{r}, r=1,2,,nr=1,2,\cdots,n, the rr-th element of ϕ\phi.

2.2 LQ mean-field game problems

We study a stochastic differential game with NN agents in a large-population framework. The state of agent ii satisfies

{dxi(t)=[Axi(t)+Bui(t)]dt+CdWi(t),xi(0)=xi0,1iN,\left\{\begin{aligned} dx_{i}(t)&=\left[Ax_{i}(t)+Bu_{i}(t)\right]dt+CdW_{i}(t),\\ x_{i}(0)&=x_{i0},\quad 1\leq i\leq N,\end{aligned}\right. (1)

where xi()nx_{i}(\cdot)\in\mathbb{R}^{n} and ui()mu_{i}(\cdot)\in\mathbb{R}^{m} represent the state and the control strategy of agent ii, respectively. {Wi()}i=1N\left\{W_{i}(\cdot)\right\}_{i=1}^{N} are independent standard pp-dimensional Brownian motions defined on a filtered complete probability space (Ω,,(t)t0,)\left(\Omega,\mathcal{F},\left(\mathcal{F}_{t}\right)_{t\geq 0},\mathbb{P}\right), where tσ(Wi(s):1iN, 0st)\mathcal{F}_{t}\triangleq\sigma\left(W_{i}(s):1\leq i\leq N,\ 0\leq s\leq t\right). The system coefficients AA, BB and CC are three unknown constant matrices of proper sizes. The initial states {xi0}i=1N\left\{x_{i0}\right\}_{i=1}^{N} are independent and identically distributed with the same expectation. It is assumed that the initial states are independent of {Wi()}i=1N\left\{W_{i}(\cdot)\right\}_{i=1}^{N} and their second moments are finite. Moreover, let tWiσ(Wi(s):0st)\mathcal{F}^{W_{i}}_{t}\triangleq\sigma\left(W_{i}(s):0\leq s\leq t\right) be the information available to agent ii, 1iN1\leq i\leq N.

The admissible control set of agent ii is

𝒰ad,i{ui():[0,+)×Ωm|ui()is adapted totWi,𝔼0+eρt|u(t)|2dt<+}, 1iN,\mathcal{U}_{ad,i}\triangleq\left\{u_{i}(\cdot):[0,+\infty)\times\Omega\rightarrow\mathbb{R}^{m}\left|u_{i}(\cdot)\ \text{is adapted to}\ \mathcal{F}^{W_{i}}_{t},\ \mathbb{E}\int_{0}^{+\infty}e^{-\rho t}|u(t)|^{2}dt<+\infty\right.\right\},\ 1\leq i\leq N,

and all agents’ admissible control set is 𝒰ad=𝒰ad,1×𝒰ad,2××𝒰ad,N\mathcal{U}_{ad}=\mathcal{U}_{ad,1}\times\mathcal{U}_{ad,2}\times\cdots\times\mathcal{U}_{ad,N}, where ρ>0\rho>0 is a discount parameter.

The cost functional of agent ii is

𝒥i(ui(),ui())𝔼0+eρt[(xi(t)xN(t))TQ(xi(t)xN(t))+ui(t)TRui(t)]𝑑t, 1iN,\mathcal{J}_{i}(u_{i}(\cdot),u_{-i}(\cdot))\triangleq\mathbb{E}\int_{0}^{+\infty}e^{-\rho t}\left[\left(x_{i}(t)-x_{N}(t)\right)^{T}Q\left(x_{i}(t)-x_{N}(t)\right)+u_{i}(t)^{T}Ru_{i}(t)\right]dt,\ 1\leq i\leq N,

where Q>0Q>0, R>0R>0 are two weighting matrices, xN()1Ni=1Nxi()x_{N}(\cdot)\triangleq\frac{1}{N}\sum_{i=1}^{N}x_{i}(\cdot) and ui()(u1(),,ui1(),ui+1(),,uN())u_{-i}(\cdot)\triangleq\left(u_{1}(\cdot),\cdots,u_{i-1}(\cdot),u_{i+1}(\cdot),\cdots,u_{N}(\cdot)\right).

Then the infinite-horizon LQ mean-field game problem is as follows.

Problem (LQG) Find u()=(u1(),u2(),,uN())𝒰adu^{*}(\cdot)=(u_{1}^{*}(\cdot),u_{2}^{*}(\cdot),\cdots,u_{N}^{*}(\cdot))\in\mathcal{U}_{ad} satisfying

𝒥i(ui(),ui())=minui()𝒰ad,i𝒥i(ui(),ui()), 1iN,\mathcal{J}_{i}(u^{*}_{i}(\cdot),u^{*}_{-i}(\cdot))=\min_{u_{i}(\cdot)\in\mathcal{U}_{ad,i}}\mathcal{J}_{i}(u_{i}(\cdot),u^{*}_{-i}(\cdot)),\ 1\leq i\leq N,

where ui()=(u1(),,ui1(),ui+1(),,uN())u^{*}_{-i}(\cdot)=\left(u^{*}_{1}(\cdot),\cdots,u^{*}_{i-1}(\cdot),u^{*}_{i+1}(\cdot),\cdots,u^{*}_{N}(\cdot)\right).

Similar to [8, 31], we are interested in designing decentralized control strategies and the ϵ\epsilon-Nash equilibrium of Problem (LQG), which are closely related to two AREs

ρP=PA+ATPPBR1BTP+Q,\rho P=PA+A^{T}P-PBR^{-1}B^{T}P+Q, (2)

and

ρY=YA+ATYYBR1BTY.\rho Y=YA+A^{T}Y-YBR^{-1}B^{T}Y. (3)

In order to guarantee the existence of solutions to these two equations, the following assumption is required.

Assumption 1.

Re(λl(A))ρRe(\lambda_{l}(A))\neq\rho, l=1,2,,nl=1,2,\cdots,n, and (A,B)(A,B) is stabilizable.

By virtue of AREs (2) and (3), the next lemma presents the ϵ\epsilon-Nash equilibrium of Problem (LQG), whose proof follows from [31, Proposition 2.1, Remark 3.1] and [8, Theorem 11] and thus is omitted here.

Lemma 1.

Let Assumption 1 hold. Then we have

(i) ARE (2) admits a solution P>0P^{*}>0 such that ABKA-BK^{*} is Hurwitz, and ARE (3) admits a solution Y0Y^{*}\geq 0 such that ABKY0.5ρ𝕀nA-BK^{*}_{Y}-0.5\rho\mathbb{I}_{n} is Hurwitz, where KR1BTPK^{*}\triangleq R^{-1}B^{T}P^{*} and KYR1BTYK^{*}_{Y}\triangleq R^{-1}B^{T}Y^{*};

(ii) The decentralized strategies of Problem (LQG) are

ui(t)=Kxi(t)(KYK)x^(t),1iN,u_{i}(t)=-K^{*}x_{i}(t)-(K_{Y}^{*}-K^{*})\widehat{x}(t),\quad 1\leq i\leq N,

where x^()\widehat{x}(\cdot) is governed by the aggregate quantity

{dx^(t)=(ABKY)x^(t)dt,x^(0)=ξ0.\begin{cases}d\widehat{x}(t)=\big{(}A-BK^{*}_{Y}\big{)}\widehat{x}(t)dt,\\ \widehat{x}(0)=\xi_{0}.\\ \end{cases} (4)

Moreover, {ui()}i=1N\big{\{}u_{i}(\cdot)\big{\}}_{i=1}^{N} construct an ϵ\epsilon-Nash equilibrium of Problem (LQG).

Lemma 1 illustrates that AREs (2) and (3) play a core role in constructing the ϵ\epsilon-Nash equilibrium. Noteworthily, in view of the nonlinear properties of (2) and (3), it is hard to directly solve the two equations. Meanwhile, when all system coefficients are unavailable in the real world, it is more difficult to solve these two equations. In the next two sections, we will propose two algorithms to solve AREs (2) and (3) without knowing all system matrices.

3 A system transformation data-driven PI algorithm

In this section, we devote ourselves to designing a novel data-driven PI algorithm to tackle Problem (LQG) without knowing all system coefficients.

Our algorithm is based on a model-based PI algorithm developed in [31, Lemmas 3.1 and 3.2]. We summarize the model-based PI algorithm in Algorithm 1 and present its convergence results in the next lemma for future use.

Lemma 2.

Suppose that Assumption 1 holds. Then the sequences {Kk}k=1+\{K^{k}\}_{k=1}^{+\infty} and {KYk}k=1+\{K^{k}_{Y}\}_{k=1}^{+\infty} obtained by Algorithm 1 have the following properties:

(i) {A0.5ρ𝕀nBKk}k=1+\big{\{}A-0.5\rho\mathbb{I}_{n}-BK^{k}\big{\}}_{k=1}^{+\infty} and {A0.5ρ𝕀nBKYk}k=1+\big{\{}A-0.5\rho\mathbb{I}_{n}-BK^{k}_{Y}\big{\}}_{k=1}^{+\infty} are Hurwitz matrices;

(ii) limk+Kk=K\lim_{k\rightarrow+\infty}K^{k}=K^{*}, limk+KYk=KY\lim_{k\rightarrow+\infty}K^{k}_{Y}=K^{*}_{Y}.

Algorithm 1 Model-based PI algorithm
1:Initial k=1k=1. Choose K0m×nK^{0}\in\mathbb{R}^{m\times n} and KY0m×nK^{0}_{Y}\in\mathbb{R}^{m\times n} such that ABK0A-BK^{0} and ABKY0A-BK^{0}_{Y} are Hurwitz. Predefine a small threshold ε>0\varepsilon>0.
2:loop
3:     Calculate PkP^{k} and YkY^{k} by
ρPk=Pk(ABKk1)+(ABKk1)TPk+(Kk1)TRKk1+Q,\rho P^{k}=P^{k}(A-BK^{k-1})+(A-BK^{k-1})^{T}P^{k}+(K^{k-1})^{T}RK^{k-1}+Q, (5a)
ρYk=Yk(ABKYk1)+(ABKYk1)TYk+(KYk1)TRKYk1.\rho Y^{k}=Y^{k}(A-BK^{k-1}_{Y})+(A-BK^{k-1}_{Y})^{T}Y^{k}+(K^{k-1}_{Y})^{T}RK^{k-1}_{Y}. (5b)
4:     if |Kk+1Kk|>ε|K^{k+1}-K^{k}|>\varepsilon or |KYk+1KYk|>ε|K^{k+1}_{Y}-K^{k}_{Y}|>\varepsilon then
5:         KkR1BTPk,KYkR1BTYk.K^{k}\leftarrow R^{-1}B^{T}P^{k},\ K^{k}_{Y}\leftarrow R^{-1}B^{T}Y^{k}.
6:         kk+1.k\leftarrow k+1.
7:     else
8:         return(Pk,Kk,Yk,KYk)\textbf{return}\,\,(P^{k},\ K^{k},\ Y^{k},\ K^{k}_{Y}).
9:     end if
10:end loop

It is worth pointing out that Algorithm 1 needs all information of system matrices, which may be hard to obtain in the real world. In the sequel, we will remove the requirement of all system matrices.

To this end, we first handle system (1) of any given agent ii. By defining 𝒳()𝔼[xi()]\mathcal{X}(\cdot)\triangleq\mathbb{E}[x_{i}(\cdot)], 𝒱()𝔼[ui()]\mathcal{V}(\cdot)\triangleq\mathbb{E}[u_{i}(\cdot)] and 𝒳0𝔼[xi0]\mathcal{X}_{0}\triangleq\mathbb{E}[x_{i0}], we transform system (1) into

{d𝒳(t)=[A𝒳(t)+B𝒱(t)]dt,𝒳(0)=𝒳0.\begin{cases}d\mathcal{X}(t)=\big{[}A\mathcal{X}(t)+B\mathcal{V}(t)\big{]}dt,\\ \mathcal{X}(0)=\mathcal{X}_{0}.\end{cases} (6)

Next, we give a lemma to reveal a relationship between system (6) and the matrices to be solved in Algorithm 1, which will play a central role in deriving our data-driven PI algorithm.

Lemma 3.

For any k+k\in\mathbb{Z}^{+}, PkP^{k} generated by Algorithm 1 satisfies

φ(sj,sj+1)vecs(Pk)2{sjsj+1eρt[𝒳(t)T𝒳(t)T]𝑑t}(𝕀n(Kk1)T)vec(k)2{sjsj+1eρt[𝒳(t)T𝒱(t)T]𝑑t}vec(k)={sjsj+1eρt[𝒳(t)T𝒳(t)T]𝑑t}vec((Kk1)TRKk1Q),\begin{split}&\varphi(s_{j},s_{j+1})vecs(P^{k})-2\Big{\{}\int_{s_{j}}^{s_{j+1}}e^{-\rho t}\big{[}\mathcal{X}(t)^{T}\otimes\mathcal{X}(t)^{T}\big{]}dt\Big{\}}\big{(}\mathbb{I}_{n}\otimes(K^{k-1})^{T}\big{)}vec(\mathcal{L}^{k})\\ &-2\Big{\{}\int_{s_{j}}^{s_{j+1}}e^{-\rho t}\big{[}\mathcal{X}(t)^{T}\otimes\mathcal{V}(t)^{T}\big{]}dt\Big{\}}vec(\mathcal{L}^{k})\\ =\,\,&\Big{\{}\int_{s_{j}}^{s_{j+1}}e^{-\rho t}\big{[}\mathcal{X}(t)^{T}\otimes\mathcal{X}(t)^{T}\big{]}dt\Big{\}}vec\big{(}-(K^{k-1})^{T}RK^{k-1}-Q\big{)},\\ \end{split} (7)

and YkY^{k} generated by Algorithm 1 satisfies

φ(sj,sj+1)vecs(Yk)2{sjsj+1eρt[𝒳(t)T𝒳(t)T]𝑑t}(𝕀n(KYk1)T)vec(Yk)2{sjsj+1eρt[𝒳(t)T𝒱(t)T]𝑑t}vec(Yk)={sjsj+1eρt[𝒳(t)T𝒳(t)T]𝑑t}vec((KYk1)TRKYk1),\begin{split}&\varphi(s_{j},s_{j+1})vecs(Y^{k})-2\Big{\{}\int_{s_{j}}^{s_{j+1}}e^{-\rho t}\big{[}\mathcal{X}(t)^{T}\otimes\mathcal{X}(t)^{T}\big{]}dt\Big{\}}\big{(}\mathbb{I}_{n}\otimes(K^{k-1}_{Y})^{T}\big{)}vec(\mathcal{L}^{k}_{Y})\\ &-2\Big{\{}\int_{s_{j}}^{s_{j+1}}e^{-\rho t}\big{[}\mathcal{X}(t)^{T}\otimes\mathcal{V}(t)^{T}\big{]}dt\Big{\}}vec(\mathcal{L}^{k}_{Y})\\ =\,\,&\Big{\{}\int_{s_{j}}^{s_{j+1}}e^{-\rho t}\big{[}\mathcal{X}(t)^{T}\otimes\mathcal{X}(t)^{T}\big{]}dt\Big{\}}vec\big{(}-(K^{k-1}_{Y})^{T}RK^{k-1}_{Y}\big{)},\\ \end{split} (8)

where φ(sj,sj+1)eρsj+1𝒳¯(sj+1)Teρsj𝒳¯(sj)T\varphi(s_{j},s_{j+1})\triangleq e^{-\rho s_{j+1}}\overline{\mathcal{X}}(s_{j+1})^{T}-e^{-\rho s_{j}}\overline{\mathcal{X}}(s_{j})^{T}, kBTPk\mathcal{L}^{k}\triangleq B^{T}P^{k}, YkBTYk\mathcal{L}^{k}_{Y}\triangleq B^{T}Y^{k}, {sj}j=0d\{s_{j}\}_{j=0}^{d} is a set of real numbers satisfying s0<s1<s2<<sds_{0}<s_{1}<s_{2}<\cdots<s_{d} and d+d\in\mathbb{Z}^{+} is a predefined positive integer.

Proof.

Since the derivations of (7) and (8) are similar, we only provide the proof of (7). According to system (6), we know

d(eρt𝒳(t)TPk𝒳(t))={ρeρt𝒳(t)TPk𝒳(t)+eρt[A𝒳(t)+B𝒱(t)]TPk𝒳(t)+eρt𝒳(t)TPk[A𝒳(t)+B𝒱(t)]}dt=eρt{𝒳(t)T[ρPk+ATPk+PkA]𝒳(t)+2𝒱(t)TBTPk𝒳(t)}dt.\begin{split}&d\big{(}e^{-\rho t}\mathcal{X}(t)^{T}P^{k}\mathcal{X}(t)\big{)}\\ =\,\,&\Big{\{}-\rho e^{-\rho t}\mathcal{X}(t)^{T}P^{k}\mathcal{X}(t)+e^{-\rho t}\big{[}A\mathcal{X}(t)+B\mathcal{V}(t)\big{]}^{T}P^{k}\mathcal{X}(t)+e^{-\rho t}\mathcal{X}(t)^{T}P^{k}\big{[}A\mathcal{X}(t)+B\mathcal{V}(t)\big{]}\Big{\}}dt\\ =\,\,&e^{-\rho t}\Big{\{}\mathcal{X}(t)^{T}\big{[}-\rho P^{k}+A^{T}P^{k}+P^{k}A\big{]}\mathcal{X}(t)+2\mathcal{V}(t)^{T}B^{T}P^{k}\mathcal{X}(t)\Big{\}}dt.\end{split} (9)

With the help of (5a), it is clear that

ρPk+ATPk+PkA=PkBKk1+(Kk1)TBTPk(Kk1)TRKk1Q.\begin{split}-\rho P^{k}+A^{T}P^{k}+P^{k}A=P^{k}BK^{k-1}+(K^{k-1})^{T}B^{T}P^{k}-(K^{k-1})^{T}RK^{k-1}-Q.\end{split} (10)

Inserting (10) into (9), (9) is transformed into

d(eρt𝒳(t)TPk𝒳(t))=eρt{𝒳(t)T[2(Kk1)TBTPk(Kk1)TRKk1Q]𝒳(t)+2𝒱(t)TBTPk𝒳(t)}dt.\begin{split}&d\big{(}e^{-\rho t}\mathcal{X}(t)^{T}P^{k}\mathcal{X}(t)\big{)}\\ =\,\,&e^{-\rho t}\Big{\{}\mathcal{X}(t)^{T}\big{[}2(K^{k-1})^{T}B^{T}P^{k}-(K^{k-1})^{T}RK^{k-1}-Q\big{]}\mathcal{X}(t)+2\mathcal{V}(t)^{T}B^{T}P^{k}\mathcal{X}(t)\Big{\}}dt.\end{split} (11)

Then, integrating the above equation from sjs_{j} to sj+1s_{j+1}, (11) becomes

eρsj+1𝒳(sj+1)TPk𝒳(sj+1)eρsj𝒳(sj)TPk𝒳(sj)=sjsj+1eρt{𝒳(t)T[2(Kk1)TBTPk(Kk1)TRKk1Q]𝒳(t)+2𝒱(t)TBTPk𝒳(t)}𝑑t.\begin{split}&e^{-\rho s_{j+1}}\mathcal{X}(s_{j+1})^{T}P^{k}\mathcal{X}(s_{j+1})-e^{-\rho s_{j}}\mathcal{X}(s_{j})^{T}P^{k}\mathcal{X}(s_{j})\\ =\,\,&\int_{s_{j}}^{s_{j+1}}e^{-\rho t}\Big{\{}\mathcal{X}(t)^{T}\big{[}2(K^{k-1})^{T}B^{T}P^{k}-(K^{k-1})^{T}RK^{k-1}-Q\big{]}\mathcal{X}(t)+2\mathcal{V}(t)^{T}B^{T}P^{k}\mathcal{X}(t)\Big{\}}dt.\end{split}

Thus, it follows from Kronecker product theory that

[eρsj+1𝒳¯(sj+1)Teρsj𝒳¯(sj)T]vecs(Pk)={sjsj+1eρt[𝒳(t)T𝒳(t)T]𝑑t}vec((Kk1)TRKk1Q)+2{sjsj+1eρt[𝒳(t)T𝒳(t)T]𝑑t}(𝕀n(Kk1)T)vec(BTPk)+2{sjsj+1eρt[𝒳(t)T𝒱(t)T]𝑑t}vec(BTPk),\begin{split}&\Big{[}e^{-\rho s_{j+1}}\overline{\mathcal{X}}(s_{j+1})^{T}-e^{-\rho s_{j}}\overline{\mathcal{X}}(s_{j})^{T}\Big{]}vecs(P^{k})\\ =\,\,&\Big{\{}\int_{s_{j}}^{s_{j+1}}e^{-\rho t}\big{[}\mathcal{X}(t)^{T}\otimes\mathcal{X}(t)^{T}\big{]}dt\Big{\}}vec\big{(}-(K^{k-1})^{T}RK^{k-1}-Q\big{)}\\ &+2\Big{\{}\int_{s_{j}}^{s_{j+1}}e^{-\rho t}\big{[}\mathcal{X}(t)^{T}\otimes\mathcal{X}(t)^{T}\big{]}dt\Big{\}}\big{(}\mathbb{I}_{n}\otimes(K^{k-1})^{T}\big{)}vec(B^{T}P^{k})\\ &+2\Big{\{}\int_{s_{j}}^{s_{j+1}}e^{-\rho t}\big{[}\mathcal{X}(t)^{T}\otimes\mathcal{V}(t)^{T}\big{]}dt\Big{\}}vec(B^{T}P^{k}),\\ \end{split}

which yields equation (7). This completes the proof. ∎

To use the lemma proposed above, we define

[φ(s0,s1)T,φ(s1,s2)T,,φ(sd1,sd)T]T,𝒳[s0s1eρt[𝒳(t)𝒳(t)]𝑑t,s1s2eρt[𝒳(t)𝒳(t)]𝑑t,,sd1sdeρt[𝒳(t)𝒳(t)]𝑑t]T,𝒳𝒱[s0s1eρt[𝒳(t)𝒱(t)]𝑑t,s1s2eρt[𝒳(t)𝒱(t)]𝑑t,,sd1sdeρt[𝒳(t)𝒱(t)]𝑑t]T.\begin{split}\mathcal{I}&\triangleq\bigg{[}\varphi(s_{0},s_{1})^{T},\varphi(s_{1},s_{2})^{T},\cdots,\varphi(s_{d-1},s_{d})^{T}\bigg{]}^{T},\\ \mathcal{I}_{\mathcal{X}}&\triangleq\bigg{[}\int_{s_{0}}^{s_{1}}e^{-\rho t}\big{[}\mathcal{X}(t)\otimes\mathcal{X}(t)\big{]}dt,\int_{s_{1}}^{s_{2}}e^{-\rho t}\big{[}\mathcal{X}(t)\otimes\mathcal{X}(t)\big{]}dt,\cdots,\int_{s_{d-1}}^{s_{d}}e^{-\rho t}\big{[}\mathcal{X}(t)\otimes\mathcal{X}(t)\big{]}dt\bigg{]}^{T},\\ \mathcal{I}_{\mathcal{X\mathcal{V}}}&\triangleq\bigg{[}\int_{s_{0}}^{s_{1}}e^{-\rho t}\big{[}\mathcal{X}(t)\otimes\mathcal{V}(t)\big{]}dt,\int_{s_{1}}^{s_{2}}e^{-\rho t}\big{[}\mathcal{X}(t)\otimes\mathcal{V}(t)\big{]}dt,\cdots,\int_{s_{d-1}}^{s_{d}}e^{-\rho t}\big{[}\mathcal{X}(t)\otimes\mathcal{V}(t)\big{]}dt\bigg{]}^{T}.\\ \end{split} (12)

Thanks to the matrices in (12), equations (7) and (8) imply that

Δk[vecs(Pk)vec(k)]=Θk,k+,\Delta^{k}\begin{bmatrix}vecs(P^{k})\\ vec(\mathcal{L}^{k})\\ \end{bmatrix}=\Theta^{k},\quad\forall k\in\mathbb{Z}^{+}, (13)
ΔYk[vecs(Yk)vec(Yk)]=ΘYk,k+,\Delta^{k}_{Y}\begin{bmatrix}vecs(Y^{k})\\ vec(\mathcal{L}^{k}_{Y})\\ \end{bmatrix}=\Theta^{k}_{Y},\quad\forall k\in\mathbb{Z}^{+}, (14)

where Φk\Phi_{k} and Ξk\Xi_{k} are defined by

Δk[,2𝒳(𝕀n(Kk1)T)2𝒳𝒱],Θk𝒳vec((Kk1)TRKk1Q),ΔYk[,2𝒳(𝕀n(KYk1)T)2𝒳𝒱],ΘYk𝒳vec((KYk1)TRKYk1).\begin{split}\Delta^{k}&\triangleq\big{[}\mathcal{I},-2\mathcal{I}_{\mathcal{X}}\big{(}\mathbb{I}_{n}\otimes(K^{k-1})^{T}\big{)}-2\mathcal{I}_{\mathcal{X\mathcal{V}}}\big{]},\Theta^{k}\triangleq\mathcal{I}_{\mathcal{X}}vec\big{(}-(K^{k-1})^{T}RK^{k-1}-Q\big{)},\\ \Delta^{k}_{Y}&\triangleq\big{[}\mathcal{I},-2\mathcal{I}_{\mathcal{X}}\big{(}\mathbb{I}_{n}\otimes(K^{k-1}_{Y})^{T}\big{)}-2\mathcal{I}_{\mathcal{X\mathcal{V}}}\big{]},\Theta^{k}_{Y}\triangleq\mathcal{I}_{\mathcal{X}}vec\big{(}-(K^{k-1}_{Y})^{T}RK^{k-1}_{Y}\big{)}.\end{split}

By virtue of the above analysis, we present our data-driven PI algorithm in Algorithm 2 and provide its convergence results in Theorem 1.

Algorithm 2 System transformation data-driven PI algorithm
1:Initial k=1k=1, select a positive integer dd and a series of real numeber {sj}j=0d\{s_{j}\}_{j=0}^{d}. Choose K0m×nK^{0}\in\mathbb{R}^{m\times n} and KY0m×nK^{0}_{Y}\in\mathbb{R}^{m\times n} such that ABK0A-BK^{0} and ABKY0A-BK^{0}_{Y} are Hurwitz. Predefine a small threshold ε>0\varepsilon>0. Compute \mathcal{I}, 𝒳\mathcal{I}_{\mathcal{X}} and 𝒳𝒱\mathcal{I}_{\mathcal{X\mathcal{V}}}.
2:loop
3:     Calculate (PkP^{k}, k\mathcal{L}^{k}) and (YkY^{k}, Yk\mathcal{L}^{k}_{Y}), respectively, by
[vecs(Pk)vec(k)]=(Δk)Θk,\begin{bmatrix}vecs(P^{k})\\ vec(\mathcal{L}^{k})\\ \end{bmatrix}=(\Delta^{k})^{\dagger}\Theta^{k}, (15)
[vecs(Yk)vec(Yk)]=(ΔYk)ΘYk.\begin{bmatrix}vecs(Y^{k})\\ vec(\mathcal{L}^{k}_{Y})\\ \end{bmatrix}=(\Delta^{k}_{Y})^{\dagger}\Theta^{k}_{Y}. (16)
4:     if |Kk+1Kk|>ε|K^{k+1}-K^{k}|>\varepsilon or |KYk+1KYk|>ε|K^{k+1}_{Y}-K^{k}_{Y}|>\varepsilon then
5:         Kk+1R1kK^{k+1}\leftarrow R^{-1}\mathcal{L}^{k}, KYk+1R1YkK^{k+1}_{Y}\leftarrow R^{-1}\mathcal{L}^{k}_{Y}.
6:         kk+1k\leftarrow k+1.
7:     else
8:         return(Pk,Kk,Yk,KYk)\textbf{return}\,\,(P^{k},\ K^{k},\ Y^{k},\ K^{k}_{Y}).
9:     end if
10:end loop
Theorem 1.

Suppose Assumption 1 holds and there exists a positive integer d^\hat{d} such that

rank([𝒳,𝒳𝒱])=mn+n(n+1)2rank\big{(}[\mathcal{I}_{\mathcal{X}},\mathcal{I}_{\mathcal{X\mathcal{V}}}]\big{)}=mn+\frac{n(n+1)}{2} (17)

holds for any dd^d\geq\hat{d}, then {Kk}k=1+\{K^{k}\}_{k=1}^{+\infty} and {KYk}k=1+\{K^{k}_{Y}\}_{k=1}^{+\infty} generated by Algorithm 2 satisfy limk+Kk=K\lim_{k\rightarrow+\infty}K^{k}=K^{*} and limk+KYk=KY\lim_{k\rightarrow+\infty}K^{k}_{Y}=K^{*}_{Y}, respectively.

Proof.

Step 1: Given k+k\in\mathbb{Z}^{+}, we first show that matrices Δk\Delta^{k} and ΔYk\Delta^{k}_{Y} have full column rank under condition (17).

We prove this result by contradiction. If the column ranks of Δk\Delta^{k} and ΔYk\Delta^{k}_{Y} are not full, then there exist two nonzero vectors gmn+n(n+1)/2g\in\mathbb{R}^{mn+n(n+1)/2} and hmn+n(n+1)/2h\in\mathbb{R}^{mn+n(n+1)/2} such that Δkg=0\Delta^{k}g=0 and ΔYkh=0\Delta^{k}_{Y}h=0. Moreover, we can take four matrices G1𝕊nG_{1}\in\mathbb{S}^{n}, G2m×nG_{2}\in\mathbb{R}^{m\times n}, H1𝕊nH_{1}\in\mathbb{S}^{n} and H2m×nH_{2}\in\mathbb{R}^{m\times n} by [vecs(G1)T,vec(G2)T]=g[vecs(G_{1})^{T},vec(G_{2})^{T}]=g and [vecs(H1)T,vec(H2)T]=h[vecs(H_{1})^{T},vec(H_{2})^{T}]=h.

Similar to (9), we have

eρsj+1𝒳(sj+1)TG1𝒳(sj+1)eρsj𝒳(sj)TG1𝒳(sj)=sjsj+1eρt{𝒳(t)T[ρG1+ATG1+G1A]𝒳(t)+2𝒱(t)TBTG1𝒳(t)}𝑑t,\begin{split}&e^{-\rho s_{j+1}}\mathcal{X}(s_{j+1})^{T}G_{1}\mathcal{X}(s_{j+1})-e^{-\rho s_{j}}\mathcal{X}(s_{j})^{T}G_{1}\mathcal{X}(s_{j})\\ =&\int_{s_{j}}^{s_{j+1}}e^{-\rho t}\Big{\{}\mathcal{X}(t)^{T}\big{[}-\rho G_{1}+A^{T}G_{1}+G_{1}A\big{]}\mathcal{X}(t)+2\mathcal{V}(t)^{T}B^{T}G_{1}\mathcal{X}(t)\Big{\}}dt,\\ \end{split} (18)

and

eρsj+1𝒳(sj+1)TH1𝒳(sj+1)eρsj𝒳(sj)TH1𝒳(sj)=sjsj+1eρt{𝒳(t)T[ρH1+ATH1+H1A]𝒳(t)+2𝒱(t)TBTH1𝒳(t)}𝑑t.\begin{split}&e^{-\rho s_{j+1}}\mathcal{X}(s_{j+1})^{T}H_{1}\mathcal{X}(s_{j+1})-e^{-\rho s_{j}}\mathcal{X}(s_{j})^{T}H_{1}\mathcal{X}(s_{j})\\ =&\int_{s_{j}}^{s_{j+1}}e^{-\rho t}\Big{\{}\mathcal{X}(t)^{T}\big{[}-\rho H_{1}+A^{T}H_{1}+H_{1}A\big{]}\mathcal{X}(t)+2\mathcal{V}(t)^{T}B^{T}H_{1}\mathcal{X}(t)\Big{\}}dt.\\ \end{split} (19)

Following the derivations of (13) and (14) and combining (18) and (19), we derive

0=Δkg=Δk[vecs(G1)vec(G2)]=[𝒳,𝒳𝒱][vec(𝔾1)vec(𝔾2)],0=ΔYkh=ΔYk[vecs(H1)vec(H2)]=[𝒳,𝒳𝒱][vec(1)vec(2)],\begin{split}0&=\Delta^{k}g=\Delta^{k}\begin{bmatrix}vecs(G_{1})\\ vec(G_{2})\\ \end{bmatrix}=[\mathcal{I}_{\mathcal{X}},\mathcal{I}_{\mathcal{X\mathcal{V}}}]\begin{bmatrix}vec(\mathbb{G}_{1})\\ vec(\mathbb{G}_{2})\\ \end{bmatrix},\\ 0&=\Delta^{k}_{Y}h=\Delta^{k}_{Y}\begin{bmatrix}vecs(H_{1})\\ vec(H_{2})\\ \end{bmatrix}=[\mathcal{I}_{\mathcal{X}},\mathcal{I}_{\mathcal{X\mathcal{V}}}]\begin{bmatrix}vec(\mathbb{H}_{1})\\ vec(\mathbb{H}_{2})\\ \end{bmatrix},\end{split} (20)

where

𝔾1=ρG1+ATG1+G1A(Kk1)TG2G2TKk1,𝔾2=2BTG12G2,1=ρH1+ATH1+H1A(KYk1)TH2H2TKYk1,2=2BTH12H2.\begin{split}\mathbb{G}_{1}&=-\rho G_{1}+A^{T}G_{1}+G_{1}A-(K^{k-1})^{T}G_{2}-G_{2}^{T}K^{k-1},\\ \mathbb{G}_{2}&=2B^{T}G_{1}-2G_{2},\\ \mathbb{H}_{1}&=-\rho H_{1}+A^{T}H_{1}+H_{1}A-(K^{k-1}_{Y})^{T}H_{2}-H_{2}^{T}K^{k-1}_{Y},\\ \mathbb{H}_{2}&=2B^{T}H_{1}-2H_{2}.\\ \end{split} (21)

Noting that 𝔾1\mathbb{G}_{1} and 1\mathbb{H}_{1} are symmetric matrices, it follows that

𝒳vec(𝔾1)=^𝒳vecs(𝔾1),𝒳vec(1)=^𝒳vecs(1),\mathcal{I}_{\mathcal{X}}vec(\mathbb{G}_{1})=\widehat{\mathcal{I}}_{\mathcal{X}}vecs(\mathbb{G}_{1}),\quad\mathcal{I}_{\mathcal{X}}vec(\mathbb{H}_{1})=\widehat{\mathcal{I}}_{\mathcal{X}}vecs(\mathbb{H}_{1}), (22)

where

^𝒳[s0s1eρt𝒳¯(t)𝑑t,s1s2eρt𝒳¯(t)𝑑t,,sd1sdeρt𝒳¯(t)𝑑t]T.\widehat{\mathcal{I}}_{\mathcal{X}}\triangleq\bigg{[}\int_{s_{0}}^{s_{1}}e^{-\rho t}\overline{\mathcal{X}}(t)dt,\int_{s_{1}}^{s_{2}}e^{-\rho t}\overline{\mathcal{X}}(t)dt,\cdots,\int_{s_{d-1}}^{s_{d}}e^{-\rho t}\overline{\mathcal{X}}(t)dt\bigg{]}^{T}. (23)

Keeping (22) in mind, (20) implies

0=Δkg=[𝒳,𝒳𝒱][vec(𝔾1)vec(𝔾2)]=[^𝒳,𝒳𝒱][vecs(𝔾1)vec(𝔾2)],0=ΔYkh=[𝒳,𝒳𝒱][vec(1)vec(2)]=[^𝒳,𝒳𝒱][vecs(1)vec(2)].\begin{split}0&=\Delta^{k}g=[\mathcal{I}_{\mathcal{X}},\mathcal{I}_{\mathcal{X\mathcal{V}}}]\begin{bmatrix}vec(\mathbb{G}_{1})\\ vec(\mathbb{G}_{2})\\ \end{bmatrix}=[\widehat{\mathcal{I}}_{\mathcal{X}},\mathcal{I}_{\mathcal{X\mathcal{V}}}]\begin{bmatrix}vecs(\mathbb{G}_{1})\\ vec(\mathbb{G}_{2})\\ \end{bmatrix},\\ 0&=\Delta^{k}_{Y}h=[\mathcal{I}_{\mathcal{X}},\mathcal{I}_{\mathcal{X\mathcal{V}}}]\begin{bmatrix}vec(\mathbb{H}_{1})\\ vec(\mathbb{H}_{2})\\ \end{bmatrix}=[\widehat{\mathcal{I}}_{\mathcal{X}},\mathcal{I}_{\mathcal{X\mathcal{V}}}]\begin{bmatrix}vecs(\mathbb{H}_{1})\\ vec(\mathbb{H}_{2})\\ \end{bmatrix}.\end{split} (24)

Under condition (17), it is evident to see that matrix [^𝒳,𝒳𝒱][\widehat{\mathcal{I}}_{\mathcal{X}},\mathcal{I}_{\mathcal{X\mathcal{V}}}] has full column rank. Combining this fact with (24), the definitions of vec()vec(\cdot) and vecs()vecs(\cdot) mean that 𝔾1=𝔾2=1=2=0\mathbb{G}_{1}=\mathbb{G}_{2}=\mathbb{H}_{1}=\mathbb{H}_{2}=0. Thus, (21) yields

(A0.5ρ𝕀nBKk1)TG1+G1(A0.5ρBTKk1)=0,(A0.5ρ𝕀nBKYk1)TH1+H1(A0.5ρBTKYk1)=0.\begin{split}(A-0.5\rho\mathbb{I}_{n}-BK^{k-1})^{T}G_{1}+G_{1}(A-0.5\rho-B^{T}K^{k-1})&=0,\\ (A-0.5\rho\mathbb{I}_{n}-BK^{k-1}_{Y})^{T}H_{1}+H_{1}(A-0.5\rho-B^{T}K^{k-1}_{Y})&=0.\end{split} (25)

According to the Kronecker product theory, (25) shows

[𝕀n(A0.5ρ𝕀nBKk1)T+(A0.5ρBTKk1)T𝕀n]vec(G1)=0,[𝕀n(A0.5ρ𝕀nBKYk1)T+(A0.5ρBTKYk1)T𝕀n]vec(H1)=0.\begin{split}\Big{[}\mathbb{I}_{n}\otimes(A-0.5\rho\mathbb{I}_{n}-BK^{k-1})^{T}+(A-0.5\rho-B^{T}K^{k-1})^{T}\otimes\mathbb{I}_{n}\Big{]}vec(G_{1})&=0,\\ \Big{[}\mathbb{I}_{n}\otimes(A-0.5\rho\mathbb{I}_{n}-BK^{k-1}_{Y})^{T}+(A-0.5\rho-B^{T}K^{k-1}_{Y})^{T}\otimes\mathbb{I}_{n}\Big{]}vec(H_{1})&=0.\\ \end{split} (26)

By virtue of (i) of Lemma 2, we know matrices 𝕀n(A0.5ρ𝕀nBKk1)T+(A0.5ρBTKk1)T𝕀n\mathbb{I}_{n}\otimes(A-0.5\rho\mathbb{I}_{n}-BK^{k-1})^{T}+(A-0.5\rho-B^{T}K^{k-1})^{T}\otimes\mathbb{I}_{n} and 𝕀n(A0.5ρ𝕀nBKYk1)T+(A0.5ρBTKYk1)T𝕀n\mathbb{I}_{n}\otimes(A-0.5\rho\mathbb{I}_{n}-BK^{k-1}_{Y})^{T}+(A-0.5\rho-B^{T}K^{k-1}_{Y})^{T}\otimes\mathbb{I}_{n} are Hurwitz. Then it follows from (26) that G1=H1=0G_{1}=H_{1}=0. Combing it with 𝔾2=2=0\mathbb{G}_{2}=\mathbb{H}_{2}=0, we obtain G1=H1=G2=H2=0G_{1}=H_{1}=G_{2}=H_{2}=0, which means that g=h=0g=h=0. However, this is contrary to the assumptions that gg and hh are nonzero vectors.

Step 2: We now prove the convergence results of Algorithm 2. For any k+k\in\mathbb{Z}^{+}, since Δk\Delta^{k} and ΔYk\Delta^{k}_{Y} have full column rank, (15) and (16) admit a unique solution, respectively. Furthermore, Lemma 3 shows that (Pk,k)(P^{k},\mathcal{L}^{k}) is a solution of (15) and (Yk,Yk)(Y^{k},\mathcal{L}^{k}_{Y}) is a solution of (16). Therefore, (15) admits unique solution (Pk,k)(P^{k},\mathcal{L}^{k}) and (16) admits unique solution (Yk,Yk)(Y^{k},\mathcal{L}^{k}_{Y}). Thus, the convergence of Algorithm 2 follows from Lemma 2. We have then proved the theorem. ∎

Remark 1.

To further analyze our algorithm, we are now in a position to consider the computational complexity (i.e. time complexity and space complexity ) of Algorithm 2. If each integral in (12) is computed by summation with yy discrete points and the samples of employing Monte Carlo is MM, then the time complexity of calculating matrices \mathcal{I}, 𝒳\mathcal{I}_{\mathcal{X}} and 𝒳𝒱\mathcal{I}_{\mathcal{X\mathcal{V}}} is O(M+dy)O(M+dy). In this case, the time complexity of computing the similar matrices in Xu et al. [31] is O(Mdy)O(Mdy). Moreover, it is easy to verify that the space complexity of Algorithm 2 is the same as the algorithm in Xu et al. [31]. Due to the fact that MM is always large in practice, it is evident that the computational complexity of our algorithm is much smaller than that of the algorithm in Xu et al. [31].

Remark 2.

In practical implementations, rank condition (17) can be met by adding an exploration signal to the input. There are many types of exploration signals commonly used in RL algorithms, such as Gaussian signals [32, 23, 22], random signals [34, 33] and signals generated by the sum of sinusoidal functions [18, 31]. In the simulation sections of this paper, we adopt exploration signals mainly generated by the combination of sinusoidal functions.

By virtue of a system transformation idea, we have developed a data-driven PI algorithm to solve Problem (LQG) without needing all system coefficients. However, it is evident that this algorithm requires two matrices K0m×nK^{0}\in\mathbb{R}^{m\times n} and KY0m×nK^{0}_{Y}\in\mathbb{R}^{m\times n} so that ABK0A-BK^{0} and ABKY0A-BK^{0}_{Y} are Hurwitz. When all system matrices are unknown, it may be difficult to obtain two matrices that meet this condition. Thus, we will develop a data-driven VI algorithm in the next section to solve this conundrum.

4 A system transformation data-driven VI algorithm

In this section, we aim to propose a data-driven VI algorithm to remove the assumption of needing two Hurwitz matrices. Moreover, the data-driven VI algorithm also does not require the knowledge of all system coefficients.

In addition to Assumption 1, this section also needs the following assumption.

Assumption 2.

A0.5ρ𝕀nA-0.5\rho\mathbb{I}_{n} is Hurwitz.

Under Assumptions 1-2, it is easy to verify that Y=0Y^{*}=0 and KY=0K^{*}_{Y}=0. Thus, we only need to deal with ARE (2) and the corresponding KK^{*}. To proceed, we define a constant sequence {γk}k=1+\{\gamma_{k}\}_{k=1}^{+\infty} and a set of bounded collections {𝒟q}q=0+\{\mathcal{D}_{q}\}_{q=0}^{+\infty}, which satisfy

k=0+γk=+,k=0+γk2<+,γk>0,k,\sum_{k=0}^{+\infty}\gamma_{k}=+\infty,\ \sum_{k=0}^{+\infty}\gamma_{k}^{2}<+\infty,\ \gamma_{k}>0,\ \forall k\in\mathbb{Z}, (27)

and

limq+𝒟q=𝕊+n,𝒟q𝒟q+1,q.\lim\limits_{q\rightarrow+\infty}\mathcal{D}_{q}=\mathbb{S}^{n}_{+},\ \mathcal{D}_{q}\subseteq\mathcal{D}_{q+1},\ \forall q\in\mathbb{Z}. (28)

Based on the above symbols, we propose a model-based VI algorithm in Algorithm 3, which can initiate from any positive definite matrix. Since the model-based VI algorithm is a direct extension of [20, Lemma 3.4, Theorem 3.3], we present its convergence results in the next lemma but omit its proof due to space limitations

Lemma 4.

Let Assumptions 1-2 hold. Then {𝒦k}k=0+\{\mathcal{K}^{k}\}_{k=0}^{+\infty} generated by Algorithm 3 satisfy limk+𝒦k=K\lim_{k\rightarrow+\infty}\mathcal{K}^{k}=K^{*}.

Algorithm 3 Model-based VI algorithm
1:Initial k=0k=0 and q=0q=0. Choose 𝒫0>0\mathcal{P}^{0}>0. Choose a sequence {γk}k=1+\{\gamma_{k}\}_{k=1}^{+\infty} that meets (27) and a set of bounded collections {𝒟q}q=0+\{\mathcal{D}_{q}\}_{q=0}^{+\infty} that meets (28). Predefine a small threshold ε>0\varepsilon>0.
2:loop
3:     𝒦kR1BT𝒫k\mathcal{K}^{k}\leftarrow R^{-1}B^{T}\mathcal{P}^{k}𝒫~𝒫k+γk(𝒫kA+AT𝒫kρ𝒫k+(𝒦k)TR𝒦k+Q).\widetilde{\mathcal{P}}\leftarrow\mathcal{P}^{k}+\gamma_{k}\big{(}\mathcal{P}^{k}A+A^{T}\mathcal{P}^{k}-\rho\mathcal{P}^{k}+(\mathcal{K}^{k})^{T}R\mathcal{K}^{k}+Q\big{)}.
4:     if |𝒫~Pk|/γk<ε|\widetilde{\mathcal{P}}-P^{k}|/\gamma_{k}<\varepsilon then
5:         return(𝒫k,𝒦k)\textbf{return}\,\,(\mathcal{P}^{k},\ \mathcal{K}^{k}).
6:     else if 𝒫~𝒟q\widetilde{\mathcal{P}}\notin\mathcal{D}_{q} then
7:         𝒫k+1𝒫0,qq+1.\mathcal{P}^{k+1}\leftarrow\mathcal{P}^{0},\ q\leftarrow q+1.
8:     else
9:         𝒫k+1𝒫~.\mathcal{P}^{k+1}\leftarrow\widetilde{\mathcal{P}}.
10:     end if
11:     kk+1.k\leftarrow k+1.
12:end loop

Although Algorithm 3 does not need the assumption of two Hurwitz matrices, it still requires the information of system coefficients AA and BB. In the sequel, we will develop a data-driven VI algorithm to solve Problem (LQG) without depending on all system matrices. To this end, we give a lemma to construct a relationship between system (6) and the matrices in Algorithm 3.

Lemma 5.

For any kk\in\mathbb{Z}, 𝒫k\mathcal{P}^{k} generated by Algorithm 3 satisfies

[eρsj+1𝒳¯(sj+1)Teρsj𝒳¯(sj)T]vecs(𝒫k)={sjsj+1eρt𝒳¯(t)T𝑑t}vecs(k)+2{sjsj+1eρt[𝒳(t)T𝒱(t)T]𝑑t}vec(𝒩k),\begin{split}&\Big{[}e^{-\rho s_{j+1}}\overline{\mathcal{X}}(s_{j+1})^{T}-e^{-\rho s_{j}}\overline{\mathcal{X}}(s_{j})^{T}\Big{]}vecs(\mathcal{P}^{k})\\ =\,\,&\Big{\{}\int_{s_{j}}^{s_{j+1}}e^{-\rho t}\overline{\mathcal{X}}(t)^{T}dt\Big{\}}vecs\big{(}\mathcal{M}^{k}\big{)}+2\Big{\{}\int_{s_{j}}^{s_{j+1}}e^{-\rho t}\big{[}\mathcal{X}(t)^{T}\otimes\mathcal{V}(t)^{T}\big{]}dt\Big{\}}vec(\mathcal{N}^{k}),\\ \end{split} (29)

where kρ𝒫k+AT𝒫k+𝒫kA\mathcal{M}^{k}\triangleq-\rho\mathcal{P}^{k}+A^{T}\mathcal{P}^{k}+\mathcal{P}^{k}A, 𝒩kBT𝒫k\mathcal{N}^{k}\triangleq B^{T}\mathcal{P}^{k}, {sj}j=0d\{s_{j}\}_{j=0}^{d} is a set of real numbers satisfying s0<s1<s2<<sds_{0}<s_{1}<s_{2}<\cdots<s_{d} and d+d\in\mathbb{Z}^{+} is a predefined positive integer.

Proof.

By virtue of system (6), we derive

d(eρt𝒳(t)T𝒫k𝒳(t))={ρeρt𝒳(t)T𝒫k𝒳(t)+eρt[A𝒳(t)+B𝒱(t)]T𝒫k𝒳(t)+eρt𝒳(t)T𝒫k[A𝒳(t)+B𝒱(t)]}dt=eρt{𝒳(t)T[ρ𝒫k+AT𝒫k+𝒫kA]𝒳(t)+2𝒱(t)TBT𝒫k𝒳(t)}dt.\begin{split}&d\big{(}e^{-\rho t}\mathcal{X}(t)^{T}\mathcal{P}^{k}\mathcal{X}(t)\big{)}\\ =\,\,&\Big{\{}-\rho e^{-\rho t}\mathcal{X}(t)^{T}\mathcal{P}^{k}\mathcal{X}(t)+e^{-\rho t}\big{[}A\mathcal{X}(t)+B\mathcal{V}(t)\big{]}^{T}\mathcal{P}^{k}\mathcal{X}(t)+e^{-\rho t}\mathcal{X}(t)^{T}\mathcal{P}^{k}\big{[}A\mathcal{X}(t)+B\mathcal{V}(t)\big{]}\Big{\}}dt\\ =\,\,&e^{-\rho t}\Big{\{}\mathcal{X}(t)^{T}\big{[}-\rho\mathcal{P}^{k}+A^{T}\mathcal{P}^{k}+\mathcal{P}^{k}A\big{]}\mathcal{X}(t)+2\mathcal{V}(t)^{T}B^{T}\mathcal{P}^{k}\mathcal{X}(t)\Big{\}}dt.\end{split}

Then, integrating the above equation from sjs_{j} to sj+1s_{j+1}, we have

eρsj+1𝒳(sj+1)T𝒫k𝒳(sj+1)eρsj𝒳(sj)T𝒫k𝒳(sj)=sjsj+1eρt{𝒳(t)T[ρ𝒫k+AT𝒫k+𝒫kA]𝒳(t)+2𝒱(t)TBT𝒫k𝒳(t)}𝑑t.\begin{split}&e^{-\rho s_{j+1}}\mathcal{X}(s_{j+1})^{T}\mathcal{P}^{k}\mathcal{X}(s_{j+1})-e^{-\rho s_{j}}\mathcal{X}(s_{j})^{T}\mathcal{P}^{k}\mathcal{X}(s_{j})\\ =\,\,&\int_{s_{j}}^{s_{j+1}}e^{-\rho t}\Big{\{}\mathcal{X}(t)^{T}\big{[}-\rho\mathcal{P}^{k}+A^{T}\mathcal{P}^{k}+\mathcal{P}^{k}A\big{]}\mathcal{X}(t)+2\mathcal{V}(t)^{T}B^{T}\mathcal{P}^{k}\mathcal{X}(t)\Big{\}}dt.\end{split}

According to Kronecker product theory, we get (29). This completes the proof. ∎

In light of the matrices in (12) and (23), equation (29) means that

[^𝒳,2𝒳𝒱][vecs(k)vec(𝒩k)]=vecs(𝒫k),k.\big{[}\widehat{\mathcal{I}}_{\mathcal{X}},2\mathcal{I}_{\mathcal{X\mathcal{V}}}\big{]}\begin{bmatrix}vecs(\mathcal{M}^{k})\\ vec(\mathcal{N}^{k})\\ \end{bmatrix}=\mathcal{I}vecs(\mathcal{P}^{k}),\quad\forall k\in\mathbb{Z}.

Now we can summarize our data-driven VI algorithm in Algorithm 4. It should be noted that this algorithm does not require the assumption of Hurwitz matrices and can be implemented in the setting of unknown system matrices.

Algorithm 4 System transformation data-driven VI algorithm
1:Initial k=0k=0 and q=0q=0. Choose 𝒫0>0\mathcal{P}^{0}>0. Choose a sequence {γk}k=1+\{\gamma_{k}\}_{k=1}^{+\infty} that meets (27) and a set of bounded collections {𝒟q}q=0+\{\mathcal{D}_{q}\}_{q=0}^{+\infty} that meets (28). Predefine a small threshold ε>0\varepsilon>0. Compute \mathcal{I}, ^𝒳\widehat{\mathcal{I}}_{\mathcal{X}} and 𝒳𝒱\mathcal{I}_{\mathcal{X\mathcal{V}}}.
2:loop
3:     Calculate (k\mathcal{M}^{k}, 𝒩k\mathcal{N}^{k}) by
[vecs(k)vec(𝒩k)]=[^𝒳,2𝒳𝒱]vecs(𝒫k).\begin{bmatrix}vecs(\mathcal{M}^{k})\\ vec(\mathcal{N}^{k})\\ \end{bmatrix}=\big{[}\widehat{\mathcal{I}}_{\mathcal{X}},2\mathcal{I}_{\mathcal{X\mathcal{V}}}\big{]}^{\dagger}\mathcal{I}vecs(\mathcal{P}^{k}). (30)
4:     𝒦kR1𝒩k\mathcal{K}^{k}\leftarrow R^{-1}\mathcal{N}^{k}, 𝒫~𝒫k+γk(k+(𝒦k)TR𝒦k+Q).\widetilde{\mathcal{P}}\leftarrow\mathcal{P}^{k}+\gamma_{k}\Big{(}\mathcal{M}^{k}+(\mathcal{K}^{k})^{T}R\mathcal{K}^{k}+Q\Big{)}.
5:     if |𝒫~Pk|/γk<ε|\widetilde{\mathcal{P}}-P^{k}|/\gamma_{k}<\varepsilon then
6:         return(𝒫k,𝒦k)\textbf{return}\,\,(\mathcal{P}^{k},\mathcal{K}^{k}).
7:     else if 𝒫~𝒟q\widetilde{\mathcal{P}}\notin\mathcal{D}_{q} then
8:         𝒫k+1𝒫0,qq+1.\mathcal{P}^{k+1}\leftarrow\mathcal{P}^{0},\ q\leftarrow q+1.
9:     else
10:         𝒫k+1𝒫~\mathcal{P}^{k+1}\leftarrow\widetilde{\mathcal{P}}.
11:     end if
12:     kk+1.k\leftarrow k+1.
13:end loop
Theorem 2.

Suppose Assumptions 1-2 hold and there exists a positive integer d^\hat{d} such that

rank([^𝒳,𝒳𝒱])=mn+n(n+1)2rank\big{(}[\widehat{\mathcal{I}}_{\mathcal{X}},\mathcal{I}_{\mathcal{X\mathcal{V}}}]\big{)}=mn+\frac{n(n+1)}{2} (31)

holds for any dd^d\geq\hat{d}, then {𝒦k}k=0+\{\mathcal{K}^{k}\}_{k=0}^{+\infty} obtained by Algorithm 4 satisfy limk+𝒦k=K\lim_{k\rightarrow+\infty}\mathcal{K}^{k}=K^{*}.

Proof.

Given kk\in\mathbb{Z}, it follows from Lemma 5 that (k,𝒩k)(\mathcal{M}^{k},\mathcal{N}^{k}) is a solution of (30). When rank condition (31) is guaranteed, we know (30) admits a unique solution. Therefore, the solution matrices of Algorithm 4 are the same as those of Algorithm 3. Thus, Lemma 4 implies the convergence of Algorithm 4. This completes the proof. ∎

5 Simulations

This section demonstrates the applicability of the proposed data-driven algorithms through two numerical examples.

5.1 The first numerical example

In this example, we let the coefficients of system (1) be the same as in [31, Section 4], i.e.,

A=[531012],B=[01],C=[0.10.10.10.1].A=\begin{bmatrix}5&3\\ 10&12\\ \end{bmatrix},B=\begin{bmatrix}0\\ 1\\ \end{bmatrix},C=\begin{bmatrix}0.1&0.1\\ 0.1&0.1\\ \end{bmatrix}.

The parameters in the performance index are Q=10𝕀nQ=10\mathbb{I}_{n}, R=1R=1 and ρ=0.01\rho=0.01. In order to know the true values of KK^{*} and KYK^{*}_{Y}, we first carry out the model-based algorithm in Lemma 2 with a large number of iterations and set its solution as the true values. The true solutions are

P=[232.288759.300759.300734.5712],K=[59.300734.5712],Y=[207.146056.576756.576733.9800],KY=[56.576733.9800].\begin{split}P^{*}&=\begin{bmatrix}232.2887&59.3007\\ 59.3007&34.5712\\ \end{bmatrix},K^{*}=\begin{bmatrix}59.3007&34.5712\end{bmatrix},\\ Y^{*}&=\begin{bmatrix}207.1460&56.5767\\ 56.5767&33.9800\\ \end{bmatrix},K^{*}_{Y}=\begin{bmatrix}56.5767&33.9800\end{bmatrix}.\\ \end{split}

Now we devote ourselves to solving the above problem by Algorithm 2, which does not rely on the information of matrices AA, BB and CC. The parameters of Algorithm 2 are set as d=20d=20, s0=0s_{0}=0, sj+1=sj+0.1s_{j+1}=s_{j}+0.1, j=0,1,2,,d1j=0,1,2,\cdots,d-1, K0=KY0=[35,25]K^{0}=K_{Y}^{0}=[35,25] and ε=103\varepsilon=10^{-3}. The initial state is set as [1,1]T[1,1]^{T}. During the simulation, we choose any agent ii and randomly collect 10610^{6} samples of its state and control policy data. We set ui(t)=K0xi(t)+0.3r=1100sin(βrt)u_{i}(t)=-K^{0}x_{i}(t)+0.3\sum_{r=1}^{100}\sin(\beta_{r}t), where {βr}r=1100\{\beta_{r}\}_{r=1}^{100} is a set of real numbers randomly selected in [1000,1000][-1000,1000]. Then the collected information is used to construct system (6) by the Monte Carlo method. The input and state trajectories of system (6) used in the simulation are depicted in Figs. 1 and 2, respectively.

Refer to caption
Figure 1: Input trajectory of system (7) used in Section 5.1.
Refer to caption
Figure 2: State trajectories of system (7) used in Section 5.1.
Refer to caption
Figure 3: Convergence of matrices PkP^{k} and YkY^{k} in Section 5.1.
Refer to caption
Figure 4: Convergence of matrices KkK^{k} and KYkK^{k}_{Y} in Section 5.1.

Figs. 3 and 4 display the convergence of Algorithm 2. After 6 iteration steps, Algorithm 2 gives the approximate values of (P,K)(P^{*},K^{*}) and (PY,KY)(P^{*}_{Y},K^{*}_{Y}). The results of our algorithm are

P6=[233.427959.589659.589634.5868],K6=[59.225434.5373],Y6=[208.548057.042557.042534.0963],KY6=[56.326533.8537].\begin{split}P^{6}&=\begin{bmatrix}233.4279&59.5896\\ 59.5896&34.5868\\ \end{bmatrix},K^{6}=\begin{bmatrix}59.2254&34.5373\end{bmatrix},\\ Y^{6}&=\begin{bmatrix}208.5480&57.0425\\ 57.0425&34.0963\\ \end{bmatrix},K^{6}_{Y}=\begin{bmatrix}56.3265&33.8537\end{bmatrix}.\\ \end{split}

Clearly, the matrices solved by Algorithm 2 are close to the true values. This is in good agreement with our theortical results. For comparison purpose, the numerical results of Xu et al. [31] are presented in Table 1. It is evident to see from this table that our algorithm has similar performance to theirs. However, as mentioned in Remark 1, the computational complexity of our algorithm is smaller than that of the algorithm in Xu et al. [31].

Table 1: Comparison between the algorithm in Xu et al. [31] and Algorithm 2.
         The algorithm in Xu et al. [31]          Algorithm 2
         Final iteration numbers          6          6
         Relative errors |K6K||K|\frac{|K^{6}-K^{*}|}{|K^{*}|}          0.0013          0.0012
         Relative errors |KY6KY||KY|\frac{|K^{6}_{Y}-K^{*}_{Y}|}{|K^{*}_{Y}|}          0.0014          0.0042

To close this subsection, we focus on the decentralized strategies constructed by the solution of Algorithm 2. Suppose that there are 200200 agents and they adopt decentralized strategies ui()=K6xi()(KY6K6)x^100()u_{i}(\cdot)=-K^{6}x_{i}(\cdot)-(K_{Y}^{6}-K^{6})\widehat{x}_{100}(\cdot), 1i2001\leq i\leq 200, where x^100()\widehat{x}_{100}(\cdot) denotes the trajectory of aggregate quantity (4) simulated by Monte Carlo with 100100 samples. All agents’ initial states are randomly generated from [0,2]×[0,2][0,2]\times[0,2]. Figs. 5 and 6 illustrate the behaviors of the polulation. Furthermore, we plot all agents’ average state, which is denoted by x~200()\widetilde{x}_{200}(\cdot), and the trajectory of x^100()\widehat{x}_{100}(\cdot) in Fig. 7. The lines in Fig. 7 show that x~200()\widetilde{x}_{200}(\cdot) is close to x^100()\widehat{x}_{100}(\cdot), which effectively demonstrate the consistency condition.

Refer to caption
Figure 5: State trajectories [xi()]1\big{[}x_{i}(\cdot)\big{]}_{1}, 1i2001\leq i\leq 200.
Refer to caption
Figure 6: State trajectories [xi()]2\big{[}x_{i}(\cdot)\big{]}_{2}, 1i2001\leq i\leq 200.
Refer to caption
Figure 7: Trajectories of aggregate quantity x^100()\widehat{x}_{100}(\cdot) and agents’ average state x~200()\widetilde{x}_{200}(\cdot).

5.2 The second numerical example

As it appears evident, the first example does not satisfy Assumption 2. In this subsection, we consider an example that satisfies Assumptions 1-2. The system coefficients are

A=[510.075100.625039.26990.004500.4127],B=[1.45420.01540.4127],C=[30.10.5210].A=\begin{bmatrix}-5&1&-0.0751\\ 0&-0.6250&-39.2699\\ -0.0045&0&-0.4127\\ \end{bmatrix},B=\begin{bmatrix}1.4542\\ -0.0154\\ 0.4127\\ \end{bmatrix},C=\begin{bmatrix}3&0.1\\ 0.5&-2\\ 1&0\\ \end{bmatrix}.

The parameters in cost functional are Q=diag{5,1,1}Q=diag\{5,1,1\}, R=1R=1 and ρ=0.01\rho=0.01. Since Y=0Y^{*}=0 and KY=0K_{Y}^{*}=0, we only focus on calculating the approximate solutions of PP^{*} and KK^{*}. For comparison purpose, we present the true values of PP^{*} and KK^{*}, which are

P=[0.49760.11851.32290.11850.33772.58771.32292.587736.5204],K=[0.17580.900813.1881].\begin{split}P^{*}&=\begin{bmatrix}0.4976&0.1185&-1.3229\\ 0.1185&0.3377&-2.5877\\ -1.3229&-2.5877&36.5204\\ \end{bmatrix},K^{*}=\begin{bmatrix}0.1758&-0.9008&13.1881\\ \end{bmatrix}.\end{split}

Next, we will solve this example by Algorithm 2 and Algorithm 4, respectively. To implement Algorithm 2, we set xi0=[1,0,1]Tx_{i0}=[-1,0,1]^{T}, K0=[1,1,14]K^{0}=[-1,-1,14], ui(t)=K0xi(t)+sin(24.6t)u_{i}(t)=-K^{0}x_{i}(t)+\sin(-24.6t) and collect 10610^{6} data samples of a given agent ii. For Algorithm 4, we set 𝒫0=0.1𝕀n\mathcal{P}^{0}=0.1\mathbb{I}_{n}, γk=3/(k+1)\gamma_{k}=3/(k+1), ui(t)=sin(6t)u_{i}(t)=\sin(-6t), 𝒟q={𝒫𝕊+n||𝒫|100(q+1)}\mathcal{D}_{q}=\left\{\mathcal{P}\in\mathbb{S}^{n}_{+}\big{|}|\mathcal{P}|\leq 100(q+1)\right\}, q\forall q\in\mathbb{Z} and collect 2×1062\times 10^{6} data samples of agent ii. The other parameters are set the same as those in Section 5.1. The convergence of Algorithms 2 and 4 is shown in Figs. 8 and 9.

Refer to caption
Figure 8: Convergence of Algorithm 2 in Section 5.2.
Refer to caption
Figure 9: Convergence of Algorithm 4 in Section 5.2

After 4 iteration steps, Algorithm 2 generates the following solution matrices

P4=[0.38250.18361.17360.18360.29902.67511.17362.675136.8479],K4=[0.14160.853713.2655]\begin{split}P^{4}&=\begin{bmatrix}0.3825&0.1836&-1.1736\\ 0.1836&0.2990&-2.6751\\ -1.1736&-2.6751&36.8479\\ \end{bmatrix},K^{4}=\begin{bmatrix}0.1416&-0.8537&13.2655\\ \end{bmatrix}\end{split}

with a relative error |K4K|/|K|=0.0073|K^{4}-K^{*}|/|K^{*}|=0.0073. Moreover, Algorithm 4 converges after 172 iterations and gives

𝒫172=[0.39800.12511.19290.12510.33972.58551.19292.585535.9500],𝒦172=[0.15240.895113.2542]\begin{split}\mathcal{P}^{172}&=\begin{bmatrix}0.3980&0.1251&-1.1929\\ 0.1251&0.3397&-2.5855\\ -1.1929&-2.5855&35.9500\\ \end{bmatrix},\mathcal{K}^{172}=\begin{bmatrix}0.1524&-0.8951&13.2542\\ \end{bmatrix}\end{split}

with its relative error |𝒦172𝒦|/|𝒦|=0.0053|\mathcal{K}^{172}-\mathcal{K}^{*}|/|\mathcal{K}^{*}|=0.0053. It is worth pointing out that the the straight dashed lines in the first half of Fig. 9 indicate that 𝒫k𝒟q\mathcal{P}^{k}\notin\mathcal{D}_{q} in the early iteration steps. This is one of the characteristics of continuous-time VI algorithms.

According to the above simulation results, it follows that the algorithms developed in this paper may be useful in dealing with LQ mean-field game problems under the setting of completely unknown system coefficients. Therefore, the proposed algorithm may be more conducive to solving practical mean-field game application problems.

6 Conclusions and future works

This paper is concerned with an LQ mean-field game problem in continuous-time. We develop a system transformation method to implement a model-based PI algorithm and a model-based VI algorithm. The obtained data-driven algorithms permit the construction of decentralized control strategies without system coefficient information and have smaller computational complexities. Moreover, we simulate two numerical examples to verify the effectiveness of our algorithms. The simulation results show that our algorithms successfully find the ϵ\epsilon-Nash equilibria with small errors. Thus, the algorithms proposed in this paper may be promising tools in solving continuous-time LQ mean-field games with unknown system parameters. In future works, we want to explore data-driven RL algorithms for more general LQ mean-field games such as those with jumps, delays and partial information.

Acknowledgements

X. Li acknowledges the financial support by the Hong Kong General Research Fund, China under Grant Nos. 15216720, 15221621 and 15226922. G. Wang acknowledges the financial support from the National Key R&D Program of China under Grant No. 2022YFA1006103, the National Natural Science Foundation of China under Grant Nos. 61925306, 61821004 and 11831010, and the Natural Science Foundation of Shandong Province under Grant No. ZR2019ZD42. J. Xiong acknowledges the financial support from the National Key R&D Program of China under Grant No. 2022YFA1006102 and the National Natural Science Foundation of China under Grant No. 11831010.

References

  • [1] J. von Neumann, O. Morgenstern. The Theory of Games and Economic Behavior, Princeton University Press, 1944.
  • [2] T. Başar, G. Olsder, Dynamic Noncooperative Game Theory, London, U.K.: Academic Press, 1982.
  • [3] O. Guéant, J. M. Lasry, P. L. Lions, Paris-Princeton Lectures on Mathematical Finance, Mean field Games and Applications, Heidelberg, Germany: Springer-Verlag, 2011.
  • [4] A. Bensoussan, J. Frehse, P. Yam, Mean Field Games and Mean Field Type Control Theory, New York: Springer, 2013.
  • [5] M. Huang, R. P. Malhamé, P. E. Caines, Large population stochastic dynamic games: Closed-loop McKean-Vlasov systems and the Nash certainty equivalence principle, Commun. Inf. Syst. 6 (3) (2006) 221-252.
  • [6] J. M. Lasry, P. L. Lions, Mean field games, Jap. J. Math. 2 (1) (2007) 229-260.
  • [7] M. Huang, P. E. Caines, R. P. Malhamé, Large-population cost-coupled LQG problems with nonuniform agents: Individual-mass behavior and decentralized ϵ\epsilon-Nash equilibria, IEEE Trans. Autom. Control 52 (9) (2007) 1560-1571.
  • [8] M. Huang, Large-population LQG games involving a major player: The Nash certainty equivalence principle, SIAM J. Control Optim. 48 (5) (2010) 3318-3353.
  • [9] A. Bensoussan, M. H. M. Chau, S. C. P. Yam, Mean field games with a dominating player, Appl. Math. Optim. 74 (2016) 91-128.
  • [10] A. Bensoussan, M. H. M. Chau, Y. Lai, S. C. P. Yam, Linear-quadratic mean field stackelberg games with state and control delays, SIAM J. Control Optim. 55 (4) (2017) 2748-2781.
  • [11] J. Moon, T. Başar, Linear quadratic mean field stackelberg differential games, Automatica 97 (2018) 200-213.
  • [12] J. Moon, T. Başar, Linear quadratic risk-sensitive and robust mean field games. IEEE Trans. Autom. Control 62 (3) (2017) 1062-1077.
  • [13] H. Tembine, Q. Zhu, T. Başar, Risk-sensitive mean-field stochastic differential games, IFAC Proceedings 44 (1) (2011) 3222-3227.
  • [14] J. Huang, S. Wang, Dynamic optimization of large-population systems with partial information, J. Optim. Theory Appl. 168 (2015) 231-245.
  • [15] P. Huang, G. Wang, W. Wang, A linear-quadratic mean-field game of backward stochastic differential equation with partial information and common noise, Appl. Math. Comput. 446 (2023) 127899.
  • [16] B. Wang, H. Zhang, J. Zhang, Mean field linear-quadratic control: uniform stabilization and social optimality, Automatica 121 (2020) 109088.
  • [17] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, F. L. Lewis, Adaptive optimal control for continuous-time linear systems based on policy iteration, Automatica 45 (2009) 477-484.
  • [18] Y. Jiang, Z. -P. Jiang, Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics, Automatica 48 (10) (2012) 2699-2704.
  • [19] J. Y. Lee, J. B. Park, Y. H. Choi, Integral Q-learning and explorized policy iteration for adaptive optimal control of continuous-time linear systems, Automatica 48 (11) (2012) 2850-2859.
  • [20] T. Bian, Z. -P. Jiang, Value iteration and adaptive dynamic programming for data-driven adaptive optimal control design, Automatica 71 (2016) 348-360.
  • [21] N. Li, X. Li, J. Peng, Z. Xu, Stochastic linear quadratic optimal control problem: a reinforcement learning method, IEEE Trans. Autom. Control 67 (2022) 5009-5016.
  • [22] H. Zhang, N. Li, Data-driven policy iteration algorithm for continuous-time stochastic linear-quadratic optimal control problems, Asian J. Control (2023).
  • [23] H. Zhang, An adaptive dynamic programming-based algorithm for infinite-horizon linear quadratic stochastic optimal control problems, J. Appl. Math. Comput. 69 (2023) 2741-2760.
  • [24] G. Wang, H. Zhang, Model-free value iteration algorithm for continuous-time stochastic linear quadratic optimal control problems, arXiv:2203.06547, 2022.
  • [25] M. A. uz Zaman, E. Miehling, T. Başar, Reinforcement learning for non-stationary discrete-time linear-quadratic mean-field games in multiple populations, Dyn. Games Appl. 13 (2023) 118-164.
  • [26] Z. Fu, Z. Yang, Y. Chen, Z. Wang, Actor-critic provably finds Nash equilibria of linear-quadratic mean-field games, In 8th International Conference on Learning Representations, 2020.
  • [27] J. Subramanian, A. Mahajan, Reinforcement learning in stationary mean-field games, In Proceedings of the 18th international conference on autonomous agents and multiagent systems (pp. 251–259), 2019.
  • [28] R. Carmona, M. Laurière, Z. Tan, Linear-quadratic mean-field reinforcement learning: Convergence of policy gradient methods, arXiv:1910.04295, 2019.
  • [29] A. Angiuli, J-P. Fouque, M. Laurière, Unified reinforcement Q-learning for mean field game and control problems, Math. Control Signals Syst. 34 (2022) 217-271.
  • [30] R. Carmona., M. Laurière, Z. Tan, Model-free mean-field reinforcement learning: Mean-field MDP and mean-field Q-learning, Ann. Appl. Probab. 33 (2023) 5334-5381.
  • [31] Z. Xu, T. Shen, M. Huang, Model-free policy iteration approach to NCE-based strategy design for linear quadratic Gaussian games, Automatica 155 (2023) 111162.
  • [32] S. J. Bradtke, Reinforcement learning applied to linear quadratic regulation, Adv. Neural Inf. Process. Syst. 5 (1993) 295-302.
  • [33] H. Xu, S. Jagannathan, F. L. Lewis, Stochastic optimal control of unknown linear networked control system in the presence of random delays and packet losses, Automatica 48 (6) (2012) 1017-1030.
  • [34] A. Al-Tamimi, F. L. Lewis, M. Abu-Khalaf, Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control, Automatica 43 (3) (2007) 473-481.