This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Byzantine-Resilient Distributed Optimization of
Multi-Dimensional Functions

Kananart Kuwaranancharoen, Lei Xin, Shreyas Sundaram This research was supported by NSF CAREER award 1653648. The authors are with the School of Electrical and Computer Engineering at Purdue University. Email: {kkuwaran,lxin,sundara2}@purdue.edu. Both of the first two authors contributed equally to this paper.     Lei Xin, George Chiu, Shreyas Sundaram This research was supported by USDA grant 2018-67007-28439. This work represents the opinions of the authors and not the USDA or NIFA. Lei Xin and Shreyas Sundaram are with the Elmore Family School of Electrical and Computer Engineering, Purdue University. George Chiu is with the School of Mechanical Engineering, Purdue University. E-mails: {lxin, gchiu, sundara2}@purdue.edu.

Learning the Dynamics of Autonomous Linear Systems From Multiple Trajectories

Kananart Kuwaranancharoen, Lei Xin, Shreyas Sundaram This research was supported by NSF CAREER award 1653648. The authors are with the School of Electrical and Computer Engineering at Purdue University. Email: {kkuwaran,lxin,sundara2}@purdue.edu. Both of the first two authors contributed equally to this paper.     Lei Xin, George Chiu, Shreyas Sundaram This research was supported by USDA grant 2018-67007-28439. This work represents the opinions of the authors and not the USDA or NIFA. Lei Xin and Shreyas Sundaram are with the Elmore Family School of Electrical and Computer Engineering, Purdue University. George Chiu is with the School of Mechanical Engineering, Purdue University. E-mails: {lxin, gchiu, sundara2}@purdue.edu.
Abstract

We consider the problem of learning the dynamics of autonomous linear systems (i.e., systems that are not affected by external control inputs) from observations of multiple trajectories of those systems, with finite sample guarantees. Existing results on learning rate and consistency of autonomous linear system identification rely on observations of steady state behaviors from a single long trajectory, and are not applicable to unstable systems. In contrast, we consider the scenario of learning system dynamics based on multiple short trajectories, where there are no easily observed steady state behaviors. We provide a finite sample analysis, which shows that the dynamics can be learned at a rate 𝒪(1N)\mathcal{O}(\frac{1}{\sqrt{N}}) for both stable and unstable systems, where NN is the number of trajectories, when the initial state of the system has zero mean (which is a common assumption in the existing literature). We further generalize our result to the case where the initial state has non-zero mean. We show that one can adjust the length of the trajectories to achieve a learning rate of 𝒪(logNN)\mathcal{O}(\sqrt{\frac{\log{N}}{N})} for strictly stable systems and a learning rate of 𝒪((logN)dN)\mathcal{O}(\frac{(\log{N})^{d}}{\sqrt{N}}) for marginally stable systems, where dd is some constant.

I Introduction

The system identification problem is to learn the parameters of a dynamical system, given the measurements of the inputs and outputs. While classical system identification techniques focused primarily on achieving asymptotic consistency [1, 2, 3], recent efforts have sought to characterize the number of samples needed to achieve a desired level of accuracy in the learned model. In particular, non-asymptotic analysis of system identification based on a single trajectory has been studied extensively over the past few years [4, 5, 6, 7, 8, 9].

In practice, performing system identification using a single trajectory could be problematic for many applications. For example, having the system run for a long time could incur risks when the system is unstable. Furthermore, only historical snippets of data about the system may be available, without the ability to easily observe long-run behavior. Additionally, in settings where one has the ability to restart the system or have several copies of the system running in parallel, one may obtain multiple trajectories generated by the system dynamics [10]. The paper [11] studies the sample complexity of identifying a system whose state is fully measured using only the final data points from multiple trajectories. Using a similar setup, the paper [12] explores the benefits of adding an 1\ell_{1} regularizer. The paper [13] studies the sample complexity of partially-measured system identification by including nuclear norm regularization, again only using the final samples from each trajectory. For partially-measured systems, the paper [14] allows for more efficient use of data. As mentioned in [14], compared to the single trajectory setup, the multiple trajectories setup usually allows for more direct application of concentration inequalities due to the assumption of independence over multiple trajectories.

In addition to the potential lack of single long trajectories, in many settings we may not be able to actually apply inputs to the system in order to perform system identification; this could be due to the costs of applying inputs, or due to the fact that we are simply observing an autonomous system that we cannot control. The uncontrolled system may also be serving as a subsystem connected to the main system that one wants to control, and having a better model of the subsystem could be useful in controlling the main system. For partially-measured systems, the characterization of finite sample error of purely stochastic systems (systems that are entirely driven by unmeasurable noise) is more challenging as indicated in [15]. There, the goal is to estimate the system matrices as well as the steady state Kalman filter gain of the corresponding system. The paper [15] shows that classical stochastic system identification algorithm can achieve a learning rate of 𝒪(1N¯)\mathcal{O}(\frac{1}{\sqrt{\bar{N}}}) (up to logarithmic factors) for both strictly stable and marginally stable systems, where N¯\bar{N} denotes the number of samples in a single trajectory.

In this paper, we are motivated by the challenge of system identification for partially-measured and autonomous stochastic linear systems (with no controlled inputs) as in [15], but for the case where a single long-run trajectory is unavailable. Existing results on consistency and learning rate of stochastic system identification algorithms (including [15]) typically convert the original system to a statistically equivalent form of the Kalman filter that is assumed to have reached steady state [1, 16, 15]. The analysis is then performed by assuming that the covariance matrix of the initial state of the system is the same as the steady state Kalman filter error covariance matrix, which simplifies the analysis. We note, however, that this assumption is invalid when one has no long run observation of the system trajectory, since it is in general unclear how long one should wait until the Kalman filter “converges” (even if it converges exponentially fast) for an unknown system. Furthermore, the available short trajectories may not be long enough to guarantee that the underlying filter has converged. Consequently, the single trajectory-based results cannot be directly applied to the multiple (short) trajectories case. Our goal in this paper is to estimate the system matrices (up to similarity transformations) using only multiple trajectories of transient responses of a partially-measured system that is entirely driven by noise.

I-A Contributions

Our work is inspired by recent work on stochastic system identification (with a single long trajectory) [15], and system identification with multiple trajectories (but with controlled inputs) [14], and extends them in the following ways.

  • We provide results on the sample complexity of learning the dynamics of an autonomous stochastic linear system using multiple trajectories, without assuming that the associated Kalman filter has reached steady state (i.e., the initial states can have arbitrary covariance matrix). Compared to [14] and [15], our results neither rely on controlled inputs, nor rely on observations of steady state behaviors of the system.

  • We provide the asymptotic learning rate of the multiple-trajectories-based stochastic system identification algorithm. If NN is the number of trajectories, we prove a learning rate of 𝒪(1N)\mathcal{O}(\frac{1}{\sqrt{N}}) when the initial state in each trajectory has zero mean (which is a common assumption in the existing literature). This rate is consistent with [14] and [15] (up to logarithmic factors). We further generalize the result to the case when the initial state in each trajectory has non-zero mean. In such case, we show that we can adjust the length of the regressor to achieve a learning rate of 𝒪(logNN)\mathcal{O}(\sqrt{\frac{\log{N}}{N})} for strictly stable systems and a learning rate of 𝒪((logN)dN)\mathcal{O}(\frac{(\log{N})^{d}}{\sqrt{N}}) for marginally stable systems, where dd is some constant.

II Mathematical notation and terminology

Let \mathbb{R} and \mathbb{N} denote the sets of real numbers and natural numbers, respectively. Let σn()\sigma_{n}(\cdot) and σmin()\sigma_{min}(\cdot) be the nn-th largest and smallest singular value, respectively, of a symmetric matrix. Similarly, let λmin()\lambda_{min}(\cdot) be the smallest eigenvalue of a symmetric matrix. We use * to denote the conjugate transpose of a given matrix. The spectral radius of a given matrix is denoted as ρ()\rho(\cdot). A square matrix AA is called strictly stable if ρ(A)<1\rho(A)<1, marginally stable if ρ(A)1\rho(A)\leq 1, and unstable if ρ(A)>1\rho(A)>1. We use \|\cdot\| to denote the spectral norm of a given matrix. Vectors are treated as column vectors. A Gaussian distributed random vector is denoted as u𝒩(μ,Σ)u\sim\mathcal{N}(\mu,\Sigma), where μ\mu is the mean and Σ\Sigma is the covariance matrix. We use II to denote the identity matrix. We use 𝔼\mathbb{E} to denote the expectation. The symbol t=ijAt\prod_{t=i}^{j}A_{t} is used to denote the matrix product, AiAi+1AjA_{i}A_{i+1}\cdots A_{j}. The symbol \dagger is used to denote the pseudoinverse.

III Problem formulation

Consider a discrete time linear time-invariant system with no user specified inputs:

xk+1\displaystyle x_{k+1} =Axk+wk,\displaystyle=Ax_{k}+w_{k}, yk\displaystyle y_{k} =Cxk+vk,\displaystyle=Cx_{k}+v_{k}, (1)

where xknx_{k}\in\mathbb{R}^{n}, ykmy_{k}\in\mathbb{R}^{m}, wknw_{k}\in\mathbb{R}^{n}, vkmv_{k}\in\mathbb{R}^{m}, An×nA\in\mathbb{R}^{n\times n} and Cm×nC\in\mathbb{R}^{m\times n}. The noise terms wkw_{k} and vkv_{k} are assumed to be i.i.d Gaussian, i.e., wk𝒩(0,Q)w_{k}\sim\mathcal{N}(0,Q), vk𝒩(0,R)v_{k}\sim\mathcal{N}(0,R). The initial state is also assumed to be independent of wkw_{k} and vkv_{k}, and is distributed as x0𝒩(μ,Σ0)x_{0}\sim\mathcal{N}(\mu,\Sigma_{0}). In addition, whether μ\mu is zero or non-zero is assumed to be known. If μ\mu is non-zero, the system matrix AA is assumed to be strictly stable or marginally stable. The system order nn is also assumed to be known. We refer to the above system as an autonomous stochastic linear system. We will make the following assumption.

Assumption 1

The output covariance matrix RR is positive definite. The pair (A,C)(A,C) is observable and (A,Q12)(A,Q^{\frac{1}{2}}) is controllable.

Under the above assumption, the Kalman Filter associated with system (1) is a system of the form

x^k+1\displaystyle\hat{x}_{k+1} =Ax^k+Kkek,\displaystyle=A\hat{x}_{k}+K_{k}e_{k}, yk\displaystyle y_{k} =Cx^k+ek,\displaystyle=C\hat{x}_{k}+e_{k}, (2)

where x^k\hat{x}_{k} is an estimate of state xkx_{k}, with the initial estimate being the mean of the initial state in system (1), i.e., x^0=μ\hat{x}_{0}=\mu. The sequence of matrices Kkn×mK_{k}\in\mathbb{R}^{n\times m} is called the Kalman gain, given by

Kk=APkC(CPkC+R)1,K_{k}=AP_{k}C^{*}(CP_{k}C^{*}+R)^{-1}, (3)

where the estimation error covariance Pkn×nP_{k}\in\mathbb{R}^{n\times n} is updated based on the Riccati equation

Pk+1=APkA+QAPkC(CPkC+R)1CPkA,P_{k+1}=AP_{k}A^{*}+Q-AP_{k}C^{*}(CP_{k}C^{*}+R)^{-1}CP_{k}A^{*},

with P0=Σ0P_{0}=\Sigma_{0}. Finally, ek=ykCx^ke_{k}=y_{k}-C\hat{x}_{k} are independent zero mean Gaussian innovations with covariance matrix given by

R¯k=CPkC+R.\displaystyle\bar{R}_{k}=CP_{k}C^{*}+R. (4)

Since the outputs of system (1) and system (2) have identical statistical properties [17], we will perform analysis on system (2). The subspace identification problem for stochastic systems that we tackle in this paper is to identify the system matrices (A,C)(A,C) up to a similarity transformation, given multiple trajectories of outputs of the system (1). As a byproduct, we will also simultaneously learn the Kalman filter gain KkK_{k} of the corresponding system, at some time step kk. In particular, we are interested in the quality of the estimates of (A,C)(A,C) given a finite number of samples.

IV Subspace identification technique

Here we describe a variant of classical subspace identification algorithm [17] to estimate (A,C)(A,C) matrices (up to a similarity transformation). We will first establish some definitions.

Suppose that we have access to NN independent output trajectories of system (1), each of some length TT\in\mathbb{N}, and each obtained right after restarting the system from an initial state x0𝒩(μ,Σ0)x_{0}\sim\mathcal{N}(\mu,\Sigma_{0}). We denote the data from these trajectories as {yki:1iN,0kT1}\{y^{i}_{k}:1\leq i\leq N,0\leq k\leq T-1\}, where the superscript denotes the trajectory index and the subscript denotes the time index. Let p+f=Tp+f=T, where p,fp,f are design parameters that satisfy p,f>np,f>n, where nn is the order of the system. We split the output samples from each output trajectory ii into past and future outputs with respect to pp, and denote the past output and future output vectors for trajectory ii as:

Yi[y0iy1iyp1i],\displaystyle Y_{-}^{i}\triangleq\begin{bmatrix}y_{0}^{i*}&y_{1}^{i*}&\cdots&y_{p-1}^{i*}\end{bmatrix}^{*}, (5)
Y+i[ypiyp+1iyp+f1i],\displaystyle Y_{+}^{i}\triangleq\begin{bmatrix}y_{p}^{i*}&y_{p+1}^{i*}&\cdots&y_{p+f-1}^{i*}\end{bmatrix}^{*},

respectively. The batch past output and batch future output matrices are formed by stacking all NN output trajectories:

Y[Y1Y2YN],\displaystyle Y_{-}\triangleq\begin{bmatrix}Y_{-}^{1}&Y_{-}^{2}&\cdots&Y_{-}^{N}\\ \end{bmatrix}, Y+[Y+1Y+2Y+N].\displaystyle Y_{+}\triangleq\begin{bmatrix}Y_{+}^{1}&Y_{+}^{2}&\cdots&Y_{+}^{N}\\ \end{bmatrix}. (6)

The past and future innovations Ei,E+i,E,E+E_{-}^{i},E_{+}^{i},E_{-},E_{+} are defined similarly, using the signals ekie_{k}^{i} rather than ykiy_{k}^{i}.

Let the batch matrix of initial states be X^0[x^01x^02x^0N].\hat{X}_{0}\triangleq\begin{bmatrix}\hat{x}_{0}^{1}&\hat{x}_{0}^{2}&\cdots&\hat{x}_{0}^{N}\\ \end{bmatrix}. Define the largest norm of innovation covariance matrices as ¯Tmaxt0,,T1R¯t,\mathcal{\bar{R}}_{T}\triangleq\max_{t\in 0,\dots,T-1}\|\bar{R}_{t}\|, where R¯t\bar{R}_{t} is defined in (4). For any l1l\geq 1, the extended observability matrix 𝒪lml×n\mathcal{O}_{l}\in\mathbb{R}^{ml\times n} and the reversed extended controllability matrix 𝒦pn×mp\mathcal{K}_{p}\in\mathbb{R}^{n\times mp} are defined as:

𝒪l[C(CA)(CAl1)],\displaystyle\mathcal{O}_{l}\triangleq\begin{bmatrix}C^{*}&(CA)^{*}&\cdots&(CA^{l-1})^{*}\end{bmatrix}^{*},
𝒦p[((AKp1C)(AK1C)K0)((AKp1C)Kp2)Kp1].\displaystyle\mathcal{K}_{p}\triangleq\begin{bmatrix}((A-K_{p-1}C)\cdots(A-K_{1}C)K_{0})^{*}\\ \vdots\\ ((A-K_{p-1}C)K_{p-2})^{*}\\ K_{p-1}^{*}\\ \end{bmatrix}^{*}.

Define

G𝒪f𝒦p.\displaystyle G\triangleq\mathcal{O}_{f}\mathcal{K}_{p}. (7)

Let Kn×mK\in\mathbb{R}^{n\times m} be the steady state Kalman gain K=APC(CPC+R)1K=APC^{*}(CPC^{*}+R)^{-1}, where Pn×nP\in\mathbb{R}^{n\times n} is the solution to the Riccati equation, P=APA+QAPC(CPC+R)1CPA.P=APA^{*}+Q-APC^{*}(CPC^{*}+R)^{-1}CPA^{*}. From Kalman filtering theory, the matrix AKCA-KC has spectral radius strictly less than 1 [18]. Denote the reversed extended controllability matrix formed by the steady state Kalman gain KK as

𝐊p[(AKC)p1K(AKC)p2KK].\displaystyle\mathbf{K}_{p}\triangleq\begin{bmatrix}(A-KC)^{p-1}K&(A-KC)^{p-2}K&\cdots&K\\ \end{bmatrix}.

We further make the following assumption.

Assumption 2

We have rank(𝒦p)=rank(𝐊p)=n\operatorname{rank}(\mathcal{K}_{p})=\operatorname{rank}(\mathbf{K}_{p})=n.

The rank condition on 𝐊p\mathbf{K}_{p} is standard, e.g., [3, 15]. The rank condition on 𝒦p\mathcal{K}_{p} is needed to ensure that GG has rank nn, which can be satisfied in practice by choosing pp to be relatively large if the rank condition on 𝐊p\mathbf{K}_{p} is satisfied (see Proposition 8 in the Appendix).

Finally, for any integer a0a\geq 0 and b2b\geq 2, define the block-Toeplitz matrix 𝒯bamb×mb\mathcal{T}_{b}^{a}\in\mathbb{R}^{mb\times mb} as:

𝒯ba[Im00CKaIm0CAb2KaCAb3Ka+1Im].\displaystyle\mathcal{T}_{b}^{a}\triangleq\begin{bmatrix}I_{m}&0&\cdots&0\\ CK_{a}&I_{m}&\cdots&0\\ \vdots&\vdots&\indent&\vdots\\ CA^{b-2}K_{a}&CA^{b-3}K_{a+1}&\cdots&I_{m}\\ \end{bmatrix}.

IV-A Linear regression

The subspace identification technique first uses linear regression to estimate GG from (7), which will subsequently form the basis for the recovery of the system parameters.

For any output trajectory i{1,,N}i\in\{1,\cdots,N\}, by iterating (2), the future output matrix Y+iY_{+}^{i} satisfies

Y+i=𝒪fx^pi+𝒯fpE+i.Y_{+}^{i}=\mathcal{O}_{f}\hat{x}_{p}^{i}+\mathcal{T}_{f}^{p}E_{+}^{i}. (8)

Note that at any time step kk, the state x^ki\hat{x}_{k}^{i} can be expressed from (2) as

x^ki\displaystyle\hat{x}_{k}^{i} =Ax^k1i+Kk1(yk1iCx^k1i)\displaystyle=A\hat{x}^{i}_{k-1}+K_{k-1}(y^{i}_{k-1}-C\hat{x}^{i}_{k-1})
=Kk1yk1i+(AKk1C)x^k1i.\displaystyle=K_{k-1}y^{i}_{k-1}+(A-K_{k-1}C)\hat{x}^{i}_{k-1}.

By expanding the above relationship recursively, we have

x^pi\displaystyle\hat{x}_{p}^{i} =Kp1yp1i++(AKp1C)(AK1C)K0y0i\displaystyle=K_{p-1}y_{p-1}^{i}+\cdots+(A-K_{p-1}C)\cdots(A-K_{1}C)K_{0}y_{0}^{i}
+(AKp1C)(AK0C)x^0i\displaystyle+(A-K_{p-1}C)\cdots(A-K_{0}C)\hat{x}_{0}^{i}
=𝒦pYi+(AKp1C)(AK0C)x^0i.\displaystyle=\mathcal{K}_{p}Y_{-}^{i}+(A-K_{p-1}C)\cdots(A-K_{0}C)\hat{x}_{0}^{i}.

By substituting the above equality into (8), the relationship between the batch output matrices is given by

Y+=GY+𝒪f(AKp1C)(AK0C)X^0+𝒯fpE+.Y_{+}=GY_{-}+\mathcal{O}_{f}(A-K_{p-1}C)\cdots(A-K_{0}C)\hat{X}_{0}+\mathcal{T}_{f}^{p}E_{+}.

An estimate of GG (motivated by the least squares approach) is

G^=Y+Y(YY)1.\displaystyle\hat{G}=Y_{+}Y_{-}^{*}(Y_{-}Y_{-}^{*})^{-1}. (9)

Consequently, the estimation error for matrix GG can be expressed as

G^G=\displaystyle\hat{G}-G= 𝒯fpE+Y(YY)1+\displaystyle\mathcal{T}_{f}^{p}E_{+}Y_{-}^{*}(Y_{-}Y_{-}^{*})^{-1}+ (10)
𝒪f(AKp1C)(AK0C)X^0Y(YY)1,\displaystyle\mathcal{O}_{f}(A-K_{p-1}C)\cdots(A-K_{0}C)\hat{X}_{0}Y_{-}^{*}(Y_{-}Y_{-}^{*})^{-1},

where the second term can be dropped if X^0=0\|\hat{X}_{0}\|=0, i.e., the initial state of system (1) has zero mean. When X^0\|\hat{X}_{0}\| is known to be non-zero but the system is marginally stable (ρ(A)1\rho(A)\leq 1), we can leverage the fact that the norm (AKp1C)(AK0C)\|(A-K_{p-1}C)\cdots(A-K_{0}C)\| converges to zero exponentially fast with pp (see Proposition 6 in the Appendix) by setting p=clogNp=c\log{N} for some positive constant cc, to make the second term go to zero asymptotically. The above steps are encapsulated in Algorithm 1.

Algorithm 1 Linear regression to calculate an estimate G^\hat{G} of GG

Input NN output trajectories {yki:1iN,0kT1}\{y^{i}_{k}:1\leq i\leq N,0\leq k\leq T-1\}, parameters p,fp,f

1:For each output trajectory i{1,,N}i\in\{1,\cdots,N\}, construct the past output and future output Yi,Y+iY_{-}^{i},Y_{+}^{i} as in (5).
2:Construct the batch past output and batch future output Y,Y+Y_{-},Y_{+} as in (6).
3:Return G^=Y+Y(YY)1\hat{G}=Y_{+}Y_{-}^{*}(Y_{-}Y_{-}^{*})^{-1}.
Remark 1

Note that the matrix GG itself can be used to predict future outputs from past outputs. The value GYiGY_{-}^{i} represents the Kalman prediction of the next ff outputs using the past pp measurements, assuming the initial state has zero mean. Its role is similar to the Markov parameters that map inputs to outputs in the case when one has measured inputs.

IV-B Balanced realization

The following balanced realization algorithm uses a standard Singular Value Decomposition to extract the estimated system matrices (A^,C^,K^p1)(\hat{A},\hat{C},\hat{K}_{p-1}) from the estimate G^\hat{G}.

Algorithm 2 Balanced realization to calculate estimates (A^,C^,K^p1)(\hat{A},\hat{C},\hat{K}_{p-1}) of (A,C,Kp1)(A,C,K_{p-1}) up to a similarity transformation

Input The estimate G^\hat{G}, parameters n,m,fn,m,f

1:Perform the Singular Value Decomposition: G^=[U^1U^2][Σ^100Σ^2][V^1V^2],\hat{G}=\begin{bmatrix}\hat{U}_{1}&\hat{U}_{2}\end{bmatrix}\begin{bmatrix}\hat{\Sigma}_{1}&0\\ 0&\hat{\Sigma}_{2}\\ \end{bmatrix}\begin{bmatrix}\hat{V}_{1}^{*}\\ \hat{V}_{2}^{*}\\ \end{bmatrix}, where Σ1n×n\Sigma_{1}\in\mathbb{R}^{n\times n} contains the nn-largest singular values of G^\hat{G}.
2:Compute the estimated observability matrix 𝒪^f=U^1Σ^112\hat{\mathcal{O}}_{f}=\hat{U}_{1}\hat{\Sigma}_{1}^{\frac{1}{2}}, and let the top mm rows of 𝒪^f\hat{\mathcal{O}}_{f} be C^\hat{C}.
3:Compute the estimated reversed extended controllability matrix 𝒦^p=Σ^112V^1\hat{\mathcal{K}}_{p}=\hat{\Sigma}_{1}^{\frac{1}{2}}\hat{V}_{1}^{*}, and let the last mm columns of 𝒦^p\hat{\mathcal{K}}_{p} be K^p1\hat{K}_{p-1}.
4:Compute A^=(𝒪^fu)𝒪^fl\hat{A}=(\hat{\mathcal{O}}_{f}^{u})^{\dagger}\hat{\mathcal{O}}_{f}^{l}, where 𝒪^fu\hat{\mathcal{O}}_{f}^{u} is the submatrix formed by the top m(f1)m(f-1) rows of 𝒪^f\hat{\mathcal{O}}_{f}, and 𝒪^fl\hat{\mathcal{O}}_{f}^{l} is the submatrix formed by dropping the first mm rows of 𝒪^f\hat{\mathcal{O}}_{f}.
5:Return (A^,C^,K^p1)(\hat{A},\hat{C},\hat{K}_{p-1}).

V Main results

In this section, we will present our main results on bounding the estimation error G^G\|\hat{G}-G\| from (10). We will show that the term (YY)1\|(Y_{-}Y_{-}^{*})^{-1}\| decreases with a rate of 𝒪(1N)\mathcal{O}(\frac{1}{N}), and then upper bound the growth rate of other terms in (10) separately. Using recent results on the balanced realization algorithm with the adjustments to accommodate the non-steady state Kalman filter, we then show that the estimation error of the system matrices A,C,Kp1A,C,K_{p-1} will also be bounded when the error G^G\|\hat{G}-G\| is small enough.

First, denote the covariance matrix of the weighted past innovation 𝒯p0Ei\mathcal{T}_{p}^{0}E_{-}^{i}, 1iN1\leq i\leq N, as:

ΣE\displaystyle\Sigma_{E} 𝔼[𝒯p0EiEi𝒯p0]=𝒯p0diag(R¯0,,R¯p1)𝒯p0.\displaystyle\triangleq\mathbb{E}[\mathcal{T}_{p}^{0}E_{-}^{i}E_{-}^{i*}\mathcal{T}_{p}^{0*}]=\mathcal{T}_{p}^{0}\operatorname{diag}(\bar{R}_{0},\cdots,\bar{R}_{p-1})\mathcal{T}_{p}^{0*}.

Let σEσmin(ΣE)\sigma_{E}\triangleq\sigma_{min}(\Sigma_{E}). We first show that the weighted innovation covariance matrix ΣE\Sigma_{E} is strictly positive definite. The proof is similar to [15, Lemma 2].

Proposition 1

Let σEσmin(ΣE)\sigma_{E}\triangleq\sigma_{min}(\Sigma_{E}). We have σEσmin(R)>0\sigma_{E}\geq\sigma_{min}(R)>0.

Proof:

For any output trajectory ii, its corresponding weighted past innovation 𝒯p0Ei\mathcal{T}_{p}^{0}E_{-}^{i} can be written as

𝒯p0Ei=Yi𝒪px^0i=𝒪p(x0ix^0i)+𝐓Wi+Vi,\displaystyle\mathcal{T}_{p}^{0}E_{-}^{i}=Y_{-}^{i}-\mathcal{O}_{p}\hat{x}_{0}^{i}=\mathcal{O}_{p}(x_{0}^{i}-\hat{x}_{0}^{i})+\mathbf{T}W_{-}^{i}+V_{-}^{i},

where WiW_{-}^{i} and ViV_{-}^{i} are the process and output noises respectively in system (1), and are defined similarly as YiY_{-}^{i}. Matrix 𝐓\mathbf{T} is a block-Toeplitz matrix which accounts for the weight of the process noise in system (1), and its explicit form is omitted in the interest of space. Since x0ix^0ix_{0}^{i}-\hat{x}_{0}^{i}, ViV_{-}^{i} and WiW_{-}^{i} are independent, we have

ΣE\displaystyle\Sigma_{E} =𝔼[𝒯p0EiEi𝒯p0]𝔼[ViVi]=diag(R,,R).\displaystyle=\mathbb{E}[\mathcal{T}_{p}^{0}E_{-}^{i}E_{-}^{i*}\mathcal{T}_{p}^{0*}]\succeq\mathbb{E}[V_{-}^{i}V_{-}^{i*}]=\operatorname{diag}(R,\cdots,R).

Hence, we have σEσmin(R)>0\sigma_{E}\geq\sigma_{min}(R)>0, where the second inequality comes from Assumption 1. ∎

Next we will show that the term (YY)1\|(Y_{-}Y_{-}^{*})^{-1}\| is decreasing with a rate of 𝒪(1N)\mathcal{O}(\frac{1}{N}). Since Y=𝒪pX^0+𝒯p0EY_{-}=\mathcal{O}_{p}\hat{X}_{0}+\mathcal{T}_{p}^{0}E_{-}, the explicit form of YYY_{-}Y_{-}^{*} is

YY\displaystyle Y_{-}Y_{-}^{*} =𝒪pX^0X^0𝒪p+𝒯p0EE𝒯p0\displaystyle=\mathcal{O}_{p}\hat{X}_{0}\hat{X}_{0}^{*}\mathcal{O}_{p}^{*}+\mathcal{T}_{p}^{0}E_{-}E_{-}^{*}\mathcal{T}_{p}^{0*} (11)
+𝒪pX^0E𝒯p0+𝒯p0EX^0𝒪p,\displaystyle+\mathcal{O}_{p}\hat{X}_{0}E_{-}^{*}\mathcal{T}_{p}^{0*}+\mathcal{T}_{p}^{0}E_{-}\hat{X}_{0}^{*}\mathcal{O}_{p}^{*},

and we will bound these terms separately.

We will rely on the following lemma from [11, Lemma 2], which provides a non-asymptotic lower bound of a standard Wishart matrix.

Lemma 1

Let ui𝒩(0,Imp)u_{i}\sim\mathcal{N}(0,I_{mp}), i=1,,Ni=1,\ldots,N be i.i.d random vectors. For any fixed δ>0\delta>0, we have

λmin(i=1Nuiui)Nmp2log1δ\sqrt{\lambda_{min}(\sum_{i=1}^{N}u_{i}u_{i}^{*}})\geq\sqrt{N}-\sqrt{mp}-\sqrt{2\log{\frac{1}{\delta}}}

with probability at least 1δ1-\delta.

Proposition 2

For any fixed δ>0\delta>0, let NN08mp+16log1δN\geq N_{0}\triangleq 8mp+16\log{\frac{1}{\delta}}. We have

𝒯p0EE𝒯p0N4σEImp\mathcal{T}_{p}^{0}E_{-}E_{-}^{*}\mathcal{T}_{p}^{0*}\succeq\frac{N}{4}\sigma_{E}I_{mp}

with probability at least 1δ1-\delta.

Proof:

For any output trajectory ii, note that the past innovation EiE_{-}^{i} has the same distribution as a single Gaussian random vector, Ei𝒩(0,diag(R¯0,,R¯p1))E_{-}^{i}\sim\mathcal{N}(0,\operatorname{diag}(\bar{R}_{0},\cdots,\bar{R}_{p-1})), since the innovations e0i,,ep1ie_{0}^{i},\cdots,e_{p-1}^{i} are independent zero-mean Gaussian random vectors with covariance matrices R¯0,,R¯p1\bar{R}_{0},\cdots,\bar{R}_{p-1}, respectively. We can rewrite EiE_{-}^{i} as

Ei=diag(R¯012,,R¯p112)𝐮i,\displaystyle E_{-}^{i}=\operatorname{diag}(\bar{R}_{0}^{\frac{1}{2}},\cdots,\bar{R}_{p-1}^{\frac{1}{2}})\mathbf{u}_{i},

where 𝐮i\mathbf{u}_{i} are i.i.d random vectors with 𝐮i𝒩(0,Imp)\mathbf{u}_{i}\sim\mathcal{N}(0,I_{mp}). Let U=[𝐮1𝐮2𝐮N]U_{-}=\begin{bmatrix}\mathbf{u}_{1}&\mathbf{u}_{2}&\cdots&\mathbf{u}_{N}\\ \end{bmatrix}. Fixing δ>0\delta>0 and applying Lemma 1, with probability of at least 1δ1-\delta, we have

λmin(UU)Nmp2log1δ.\displaystyle\sqrt{\lambda_{min}(U_{-}U_{-}^{*})}\geq\sqrt{N}-\sqrt{mp}-\sqrt{2\log{\frac{1}{\delta}}}. (12)

Considering the inequality 2(a2+b2)(a+b)22(a^{2}+b^{2})\geq(a+b)^{2} and the assumption that NN08mp+16log1δN\geq N_{0}\triangleq 8mp+16\log{\frac{1}{\delta}}, we have

2(mp+2log1δ)(mp+2log1δ)2,\displaystyle 2(mp+2\log{\frac{1}{\delta}})\geq(\sqrt{mp}+\sqrt{2\log{\frac{1}{\delta}}})^{2},
\displaystyle\Rightarrow N2mp+2log1δ.\displaystyle\frac{\sqrt{N}}{2}\geq\sqrt{mp}+\sqrt{2\log{\frac{1}{\delta}}}.

In conjunction with (12), this yields λmin(UU)12N\sqrt{\lambda_{min}(U_{-}U_{-}^{*})}\geq\frac{1}{2}\sqrt{N} with probability at least 1δ1-\delta. Hence, we have

UUN4Imp\displaystyle U_{-}U_{-}^{*}\succeq\frac{N}{4}I_{mp}

with probability of at least 1δ1-\delta.

Finally, multiplying by 𝒯p0diag(R¯012,,R¯p112)\mathcal{T}_{p}^{0}\operatorname{diag}(\bar{R}_{0}^{\frac{1}{2}},\cdots,\bar{R}_{p-1}^{\frac{1}{2}}) from the left and diag(R¯012,,R¯p112)𝒯p0\operatorname{diag}(\bar{R}_{0}^{\frac{1}{2}},\cdots,\bar{R}_{p-1}^{\frac{1}{2}})\mathcal{T}_{p}^{0*} from the right gives 𝒯p0EE𝒯p0N4ΣEN4σEImp\mathcal{T}_{p}^{0}E_{-}E_{-}^{*}\mathcal{T}_{p}^{0*}\succeq\frac{N}{4}\Sigma_{E}\succeq\frac{N}{4}\sigma_{E}I_{mp} with probability at least 1δ1-\delta. ∎

To bound the cross terms due to the possibly non-zero batch matrix of initial states in (11), X^0E\hat{X}_{0}E_{-}^{*}, we will be using the following lemma from [5, Lemma A.1].

Lemma 2

Let Mm×nM\in\mathbb{R}^{m\times n} be a matrix with mnm\geq n, and let η\eta\in\mathbb{R} be such that Mη\|M\|\leq\eta. Let Zm×kZ\in\mathbb{R}^{m\times k} be a matrix with independent standard normal entries. Then, for all t0t\geq 0, with probability at least 12exp(t22)1-2\exp(\frac{-t^{2}}{2}),

MZη(2(n+k)+t).\|M^{*}Z\|\leq\eta(\sqrt{2(n+k)}+t).
Proposition 3

For any fixed δ>0\delta>0, let γp¯T12(2(n+mp)+2log2δ)\gamma_{p}\triangleq\mathcal{\bar{R}}_{T}^{\frac{1}{2}}(\sqrt{2(n+mp)}+\sqrt{2\log{\frac{2}{\delta}}}), and γf¯T12(2(n+mf)+2log2δ)\gamma_{f}\triangleq\mathcal{\bar{R}}_{T}^{\frac{1}{2}}(\sqrt{2(n+mf)}+\sqrt{2\log{\frac{2}{\delta}}}). For NnN\geq n, each of these inequalities hold with probability at least 1δ1-\delta:

X^0EX^0γp,\|\hat{X}_{0}E_{-}^{*}\|\leq\|\hat{X}_{0}\|\gamma_{p},
X^0E+X^0γf.\|\hat{X}_{0}E_{+}^{*}\|\leq\|\hat{X}_{0}\|\gamma_{f}.
Proof:

We will only show the first inequality as the second one is almost identical. We can rewrite E=diag(R¯012,,R¯p112)UE_{-}=\operatorname{diag}(\bar{R}_{0}^{\frac{1}{2}},\cdots,\bar{R}_{p-1}^{\frac{1}{2}})U_{-}, where U=[𝐮1𝐮2𝐮N]U_{-}=\begin{bmatrix}\mathbf{u}_{1}&\mathbf{u}_{2}&\cdots&\mathbf{u}_{N}\\ \end{bmatrix}, where 𝐮i\mathbf{u}_{i} are i.i.d random vectors with 𝐮i𝒩(0,Imp)\mathbf{u}_{i}\sim\mathcal{N}(0,I_{mp}). Applying Lemma 2, we obtain

X^0E\displaystyle\|\hat{X}_{0}E_{-}^{*}\| =X^0Udiag(R¯012,,R¯p112)\displaystyle=\|\hat{X}_{0}U_{-}^{*}\operatorname{diag}(\bar{R}_{0}^{\frac{1}{2}},\cdots,\bar{R}_{p-1}^{\frac{1}{2}})\|
X^0¯T12(2(n+mp)+t)\displaystyle\leq\|\hat{X}_{0}\|\mathcal{\bar{R}}_{T}^{\frac{1}{2}}(\sqrt{2(n+mp)}+t)

with probability at least 12exp(t22)1-2\exp(\frac{-t^{2}}{2}). Finally, setting δ=2exp(t22)\delta=2\exp(\frac{-t^{2}}{2}), we have t=2log2δt=\sqrt{2\log{\frac{2}{\delta}}}. Plugging tt into the above inequality, we get the desired form. ∎

Now we are ready to show that (YY)1\|(Y_{-}Y_{-}^{*})^{-1}\| is decreasing with a rate of 𝒪(1N)\mathcal{O}({\frac{1}{N}}).

Lemma 3

Fix any δ>0\delta>0 and let Nmax{N0,N1}N\geq\max\{N_{0},N_{1}\}, where N0=8mp+16log1δN_{0}=8mp+16\log{\frac{1}{\delta}}, and N116𝒯p0𝒪pX^0γpσEN_{1}\triangleq\frac{16\|\mathcal{T}_{p}^{0}\|\|\mathcal{O}_{p}\|\|\hat{X}_{0}\|\gamma_{p}}{\sigma_{E}}. Define ζ𝒪pμμ𝒪p\zeta\triangleq\mathcal{O}_{p}\mu\mu^{*}\mathcal{O}_{p}^{*}. We have

(YY)18Nσmin(σEImp+8ζ)\|(Y_{-}Y_{-}^{*})^{-1}\|\leq\frac{8}{N\sigma_{min}(\sigma_{E}I_{mp}+8\zeta)}

with probability at least 12δ1-2\delta.

Proof:

Recall the explicit form of YYY_{-}Y_{-}^{*} in (11). Letting umpu\in\mathbb{R}^{mp} be an arbitrary unit vector, we can write

uYYu\displaystyle u^{*}Y_{-}Y_{-}^{*}u =u𝒪pX^0X^0𝒪pu+u𝒯p0EE𝒯p0u\displaystyle=u^{*}\mathcal{O}_{p}\hat{X}_{0}\hat{X}_{0}^{*}\mathcal{O}_{p}^{*}u+u^{*}\mathcal{T}_{p}^{0}E_{-}E_{-}^{*}\mathcal{T}_{p}^{0*}u
+u𝒪pX^0E𝒯p0u+u𝒯p0EX^0𝒪pu\displaystyle+u^{*}\mathcal{O}_{p}\hat{X}_{0}E_{-}^{*}\mathcal{T}_{p}^{0*}u+u^{*}\mathcal{T}_{p}^{0}E_{-}\hat{X}_{0}^{*}\mathcal{O}_{p}^{*}u
u𝒪pX^0X^0𝒪pu+u𝒯p0EE𝒯p0u\displaystyle\geq u^{*}\mathcal{O}_{p}\hat{X}_{0}\hat{X}_{0}^{*}\mathcal{O}_{p}^{*}u+u^{*}\mathcal{T}_{p}^{0}E_{-}E_{-}^{*}\mathcal{T}_{p}^{0*}u
2X^0E𝒯p0𝒪p,\displaystyle-2\|\hat{X}_{0}E_{-}^{*}\|\|\mathcal{T}_{p}^{0}\|\|\mathcal{O}_{p}\|,

where we used the Cauchy–Schwarz inequality. Fixing δ>0\delta>0, letting NN0N\geq N_{0}, applying Proposition 2 and Proposition 3, and using a union bound, we have

uYYu\displaystyle u^{*}Y_{-}Y_{-}^{*}u u𝒪pX^0X^0𝒪pu+N4σE2𝒯p0𝒪pX^0γp\displaystyle\geq u^{*}\mathcal{O}_{p}\hat{X}_{0}\hat{X}_{0}^{*}\mathcal{O}_{p}^{*}u+\frac{N}{4}\sigma_{E}-2\|\mathcal{T}_{p}^{0}\|\|\mathcal{O}_{p}\|\|\hat{X}_{0}\|\gamma_{p}

with probability at least 12δ1-2\delta.

Conditioning on the above event and letting NN1=16𝒯p0𝒪pX^0γpσEN\geq N_{1}=\frac{16\|\mathcal{T}_{p}^{0}\|\|\mathcal{O}_{p}\|\|\hat{X}_{0}\|\gamma_{p}}{\sigma_{E}}, we have

uYYu\displaystyle u^{*}Y_{-}Y_{-}^{*}u u𝒪pX^0X^0𝒪pu+N8σE\displaystyle\geq u^{*}\mathcal{O}_{p}\hat{X}_{0}\hat{X}_{0}^{*}\mathcal{O}_{p}^{*}u+\frac{N}{8}\sigma_{E}
=u𝒪pNμμ𝒪pu+uN8σEImpu,\displaystyle=u^{*}\mathcal{O}_{p}N\mu\mu^{*}\mathcal{O}_{p}^{*}u+u^{*}\frac{N}{8}\sigma_{E}I_{mp}u,

where the equality is due to the fact that x^0i=μ\hat{x}^{i}_{0}=\mu for all ii.

Consequently, we have

YY\displaystyle\ Y_{-}Y_{-}^{*} 𝒪pNμμ𝒪p+N8σEImp.\displaystyle\succeq\mathcal{O}_{p}N\mu\mu^{*}\mathcal{O}_{p}^{*}+\frac{N}{8}\sigma_{E}I_{mp}.

Taking the inverse we get the desired result. ∎

To see that NN will eventually be greater than N1N_{1} even if X^0\|\hat{X}_{0}\| is non-zero, note that when the system is strictly stable or marginally stable, 𝒯p0\|\mathcal{T}_{p}^{0}\| and 𝒪p\|\mathcal{O}_{p}\| will grow no faster than 𝒪(p𝐝)\mathcal{O}(p^{\mathbf{d}}) for some constant 𝐝\mathbf{d} (see Proposition 5 for 𝒯p0\|\mathcal{T}_{p}^{0}\|, and [15, Corollary E.1] for 𝒪p\|\mathcal{O}_{p}\|). Further, γp\gamma_{p} is 𝒪(p12)\mathcal{O}(p^{\frac{1}{2}}), and X^0=Nμ=𝒪(N)\|\hat{X}_{0}\|=\sqrt{N}\|\mu\|=\mathcal{O}(\sqrt{N}). Thus if p=𝒪(logN)p=\mathcal{O}(\log{N}), NN will eventually be greater than N1N_{1} as NN increases.

Now we will show that the term E+E\|E_{+}E_{-}^{*}\| is 𝒪(N)\mathcal{O}(\sqrt{N}). We will leverage the following lemma from[11, Lemma 1] to bound the product of the innovation terms.

Lemma 4

Let fimf_{i}\in\mathbb{R}^{m}, ging_{i}\in\mathbb{R}^{n} be independent random vectors fi𝒩(0,Σf)f_{i}\sim\mathcal{N}(0,\Sigma_{f}) and gi𝒩(0,Σg)g_{i}\sim\mathcal{N}(0,\Sigma_{g}), for i=1,,Ni=1,\cdots,N. Let N2(n+m)log1δN\geq 2(n+m)\log{\frac{1}{\delta}}. For any fixed δ>0\delta>0, we have

i=1Nfigi4Σf12Σg12N(m+n)log9δ.\|\sum_{i=1}^{N}f_{i}g_{i}^{*}\|\leq 4\|\Sigma_{f}\|^{\frac{1}{2}}\|\Sigma_{g}\|^{\frac{1}{2}}\sqrt{N(m+n)\log{\frac{9}{\delta}}}.

with probability at least 1δ1-\delta.

Proposition 4

For any fixed δ>0\delta>0, let NN22(mf+mp)log1δN\geq N_{2}\triangleq 2(mf+mp)\log{\frac{1}{\delta}}. We have

𝒯fpE+E𝒯p04𝒯fp𝒯p0¯TN(mf+mp)log9δ\displaystyle\|\mathcal{T}_{f}^{p}E_{+}E_{-}^{*}\mathcal{T}_{p}^{0*}\|\leq 4\|\mathcal{T}_{f}^{p}\|\|\mathcal{T}_{p}^{0}\|\mathcal{\bar{R}}_{T}\sqrt{N(mf+mp)\log{\frac{9}{\delta}}}

with probability at least 1δ1-\delta.

Proof:

Note that the columns of E+E_{+} are independent Gaussian random vectors, i.e., E+i𝒩(0,diag(R¯p,,R¯p+f1))E_{+}^{i}\sim\mathcal{N}(0,\operatorname{diag}(\bar{R}_{p},\cdots,\bar{R}_{p+f-1})). Similarly, the columns of EE_{-} are independent Gaussian random vectors, i.e., Ei𝒩(0,diag(R¯0,,R¯p1))E_{-}^{i}\sim\mathcal{N}(0,\operatorname{diag}(\bar{R}_{0},\cdots,\bar{R}_{p-1})). Further, E+iE_{+}^{i} and EiE_{-}^{i} are independent from classical results of Kalman filtering theory [18]. Applying Lemma 4, we get the desired result. ∎

With Proposition 4, we are now in place to prove the bound on the estimation error of GG .

Theorem 1 (Bound on estimation error of GG)

Consider the Kalman filter form (2) of system (1) under Assumptions 1 and 2, and let G{G} be defined as in (7). For any fixed δ>0\delta>0, let G^\hat{G} defined in (9) be the output of the linear regression described in Algorithm 1 given NN trajectories of outputs, where Nmax{N0,N1,N2}N\geq\max\{N_{0},N_{1},N_{2}\}, where N0=8mp+16log1δ,N1=16𝒯p0𝒪pX^0γpσE,N2=2(mf+mp)log1δN_{0}=8mp+16\log{\frac{1}{\delta}},N_{1}=\frac{16\|\mathcal{T}_{p}^{0}\|\|\mathcal{O}_{p}\|\|\hat{X}_{0}\|\gamma_{p}}{\sigma_{E}},N_{2}=2(mf+mp)\log{\frac{1}{\delta}}. We have:

G^Gϵ1Nσmin(σEImp+8ζ)+X^0ϵ2+X^02ϵ3Nσmin(σEImp+8ζ)\displaystyle\|\hat{G}-G\|\leq\frac{\epsilon_{1}}{\sqrt{N}\sigma_{min}(\sigma_{E}I_{mp}+8\zeta)}+\frac{\|\hat{X}_{0}\|\epsilon_{2}+\|\hat{X}_{0}\|^{2}\epsilon_{3}}{N\sigma_{min}(\sigma_{E}I_{mp}+8\zeta)}

with probability at least 14δ1-4\delta, where

ϵ1=32𝒯fp𝒯p0¯T(mf+mp)log9δ\displaystyle\epsilon_{1}=32\|\mathcal{T}_{f}^{p}\|\|\mathcal{T}_{p}^{0}\|\mathcal{\bar{R}}_{T}\sqrt{(mf+mp)\log{\frac{9}{\delta}}}
ϵ2=8γf𝒯fp𝒪p\displaystyle\epsilon_{2}=8\gamma_{f}\|\mathcal{T}_{f}^{p}\|\|\mathcal{O}_{p}\|
+8(AKp1C)(AK0C)γp𝒯p0𝒪f,\displaystyle+8\|(A-K_{p-1}C)\cdots(A-K_{0}C)\|\gamma_{p}\|\mathcal{T}_{p}^{0}\|\|\mathcal{O}_{f}\|,
ϵ3=8𝒪f𝒪p(AKp1C)(AK0C),\displaystyle\epsilon_{3}=8\|\mathcal{O}_{f}\|\|\mathcal{O}_{p}\|\|(A-K_{p-1}C)\cdots(A-K_{0}C)\|,
γp=¯T12(2(n+mp)+2log2δ),\displaystyle\gamma_{p}=\mathcal{\bar{R}}_{T}^{\frac{1}{2}}(\sqrt{2(n+mp)}+\sqrt{{2\log{\frac{2}{\delta}}}}),
γf=¯T12(2(n+mf)+2log2δ),ζ𝒪pμμ𝒪p.\displaystyle\gamma_{f}=\mathcal{\bar{R}}_{T}^{\frac{1}{2}}(\sqrt{2(n+mf)}+\sqrt{{2\log{\frac{2}{\delta}}}}),\quad\zeta\triangleq\mathcal{O}_{p}\mu\mu^{*}\mathcal{O}_{p}^{*}.
Proof:

Recall the expression of the error G^G\hat{G}-G in (10). We first bound the error term 𝒯fpE+Y(YY)1\mathcal{T}_{f}^{p}E_{+}Y_{-}^{*}(Y_{-}Y_{-}^{*})^{-1}. Using Y=𝒪pX^0+𝒯p0EY_{-}=\mathcal{O}_{p}\hat{X}_{0}+\mathcal{T}_{p}^{0}E_{-}, we have

𝒯fpE+Y(YY)1\displaystyle\|\mathcal{T}_{f}^{p}E_{+}Y_{-}^{*}(Y_{-}Y_{-}^{*})^{-1}\| 𝒯fpE+X^0𝒪p(YY)1\displaystyle\leq\|\mathcal{T}_{f}^{p}E_{+}\hat{X}_{0}^{*}\mathcal{O}_{p}^{*}\|\|(Y_{-}Y_{-}^{*})^{-1}\|
+𝒯fpE+E𝒯p0(YY)1.\displaystyle+\|\mathcal{T}_{f}^{p}E_{+}E_{-}^{*}\mathcal{T}_{p}^{0*}\|\|(Y_{-}Y_{-}^{*})^{-1}\|.

Fix δ>0\delta>0 and let Nmax{N0,N1,N2}N\geq\max\{N_{0},N_{1},N_{2}\}. Applying Proposition 3, Proposition 4 and Lemma 3 to the above inequality and using a union bound, we obtain

𝒯fpE+Y(YY)18X^0γf𝒯fp𝒪pNσmin(σEImp+8ζ)\displaystyle\|\mathcal{T}_{f}^{p}E_{+}Y_{-}^{*}(Y_{-}Y_{-}^{*})^{-1}\|\leq\frac{8\|\hat{X}_{0}\|\gamma_{f}\|\mathcal{T}_{f}^{p}\|\|\mathcal{O}_{p}\|}{N\sigma_{min}(\sigma_{E}I_{mp}+8\zeta)} (13)
+32𝒯fp𝒯p0¯T(mf+mp)log9δNσmin(σEImp+8ζ)\displaystyle+\frac{32\|\mathcal{T}_{f}^{p}\|\|\mathcal{T}_{p}^{0}\|\mathcal{\bar{R}}_{T}\sqrt{(mf+mp)\log{\frac{9}{\delta}}}}{\sqrt{N}\sigma_{min}(\sigma_{E}I_{mp}+8\zeta)}

with probability at least 14δ1-4\delta.

Second, we bound the error term 𝒪f(AKp1C)(AK0C)X^0Y(YY)1\mathcal{O}_{f}(A-K_{p-1}C)\cdots(A-K_{0}C)\hat{X}_{0}Y_{-}^{*}(Y_{-}Y_{-}^{*})^{-1}. We have

X^0Y=X^0X^0𝒪p+X^0E𝒯p0.\displaystyle\|\hat{X}_{0}Y_{-}^{*}\|=\|\hat{X}_{0}\hat{X}_{0}^{*}\mathcal{O}_{p}^{*}+\hat{X}_{0}E_{-}^{*}\mathcal{T}_{p}^{0*}\|.

Conditioning on the event X^0EX^0γp\|\hat{X}_{0}E_{-}^{*}\|\leq\|\hat{X}_{0}\|\gamma_{p} from Proposition 3, we have

X^0YX^02𝒪p+X^0γp𝒯p0.\displaystyle\|\hat{X}_{0}Y_{-}^{*}\|\leq\|\hat{X}_{0}\|^{2}\|\mathcal{O}_{p}\|+\|\hat{X}_{0}\|\gamma_{p}\|\mathcal{T}_{p}^{0}\|.

Conditioning on the above event and the event (YY)18Nσmin(σEImp+8ζ)\|(Y_{-}Y_{-}^{*})^{-1}\|\leq\frac{8}{N\sigma_{min}(\sigma_{E}I_{mp}+8\zeta)} from Lemma 3, we have

𝒪f(AKp1C)(AK0C)X^0Y(YY)1\displaystyle\|\mathcal{O}_{f}(A-K_{p-1}C)\cdots(A-K_{0}C)\hat{X}_{0}Y_{-}^{*}(Y_{-}Y_{-}^{*})^{-1}\|\leq (14)
𝒪f(AKp1C)(AK0C)X^0Y(YY)1\displaystyle\|\mathcal{O}_{f}\|\|(A-K_{p-1}C)\cdots(A-K_{0}C)\|\|\hat{X}_{0}Y_{-}^{*}\|\|(Y_{-}Y_{-}^{*})^{-1}\|
8𝒪f(AKp1C)(AK0C)X^02𝒪pNσmin(σEImp+8ζ)\displaystyle\leq\frac{8\|\mathcal{O}_{f}\|\|(A-K_{p-1}C)\cdots(A-K_{0}C)\|\|\hat{X}_{0}\|^{2}\|\mathcal{O}_{p}\|}{N\sigma_{min}(\sigma_{E}I_{mp}+8\zeta)}
+8𝒪f(AKp1C)(AK0C)X^0γp𝒯p0Nσmin(σEImp+8ζ).\displaystyle+\frac{8\|\mathcal{O}_{f}\|\|(A-K_{p-1}C)\cdots(A-K_{0}C)\|\|\hat{X}_{0}\|\gamma_{p}\|\mathcal{T}_{p}^{0}\|}{N\sigma_{min}(\sigma_{E}I_{mp}+8\zeta)}.

Finally, we combine the two upper bounds from (13) and (14) to get the desired form. ∎

Interpretation of Theorem 1.

Learning rate when X^0\|\hat{X}_{0}\| is zero, and the effects of trajectory length: When X^0=0\|\hat{X}_{0}\|=0, i.e., the initial state of the system (1) has zero mean, the upper bound of the error will not depend on ϵ2,ϵ3\epsilon_{2},\epsilon_{3}. Noting the dependencies on p,fp,f in ϵ1\epsilon_{1}, setting pp and ff to be small will generally result in a smaller error bound of GG, since we are estimating a smaller GG. However, p,fp,f should be greater than the order nn (and pp should also be large enough such that Assumption 2 is satisfied), so that Algorithm 2 can recover the system matrices from GG. The estimator G^\hat{G} can achieve a learning rate of 𝒪(1N)\mathcal{O}(\frac{1}{\sqrt{N}}). This rate is faster than the single trajectory case reported in [15] in that there are no logarithmic factors, and it applies to both stable and unstable systems. This confirms the benefits of being able to collect multiple independent trajectories starting from x0𝒩(0,Σ0)x_{0}\sim\mathcal{N}(0,\Sigma_{0}).

Learning rate when X^0\|\hat{X}_{0}\| is nonzero, and the effects of trajectory length: When X^0\|\hat{X}_{0}\| is nonzero, the error bound will depend on ϵ2,ϵ3\epsilon_{2},\epsilon_{3}. Note that X^0=Nμ\|\hat{X}_{0}\|=\sqrt{N}\|\mu\| when the initial state of each trajectory has mean μ\mu. The term X^02ϵ3Nσmin(σEImp+8ζ)\frac{\|\hat{X}_{0}\|^{2}\epsilon_{3}}{N\sigma_{min}(\sigma_{E}I_{mp}+8\zeta)} is 𝒪(1)\mathcal{O}(1) when pp is fixed. In such case, if the system is known to be marginally stable (ρ(A)1\rho(A)\leq 1), we can leverage the fact that the norm (AKp1C)(AK0C)\|(A-K_{p-1}C)\cdots(A-K_{0}C)\| in ϵ3\epsilon_{3} converges to zero exponentially fast with pp (see Proposition 6 in the Appendix), by setting p=clogNp=c\log{N} for some sufficiently large cc, to force the term (AKp1C)(AK0C)\|(A-K_{p-1}C)\cdots(A-K_{0}C)\| to go to zero no slower than 𝒪(1N)\mathcal{O}(\frac{1}{\sqrt{N}}). The term ¯T\mathcal{\bar{R}}_{T} is 𝒪(1)\mathcal{O}(1) since the Kalman filter converges [18]. For the same reason, by fixing a small f>nf>n, 𝒯fp\|\mathcal{T}_{f}^{p}\| is 𝒪(1)\mathcal{O}(1). In addition, 𝒯p0\|\mathcal{T}_{p}^{0}\| and 𝒪p\|\mathcal{O}_{p}\| are 𝒪(1)\mathcal{O}(1) for stable systems, and 𝒪(p𝐝)\mathcal{O}(p^{\mathbf{d}}) for some constant 𝐝\mathbf{d} for marginally stable systems (see Proposition 5 in the Appendix for 𝒯p0\|\mathcal{T}_{p}^{0}\|, and [15, Corollary E.1] for 𝒪p\|\mathcal{O}_{p}\|). As a result, the error will decrease with a rate of 𝒪(logNN)\mathcal{O}(\sqrt{\frac{\log{N}}{N}}) for strictly stable systems, and 𝒪((logN)dN)\mathcal{O}(\frac{(\log{N})^{d}}{\sqrt{N}}) for some constant dd for marginally stable systems, even if X^0\|\hat{X}_{0}\| is non-zero.

The next step shows that the realization error of system matrix estimates (A^,C^,K^p1)(\hat{A},\hat{C},\hat{K}_{p-1}) provided by Algorithm 2 is bounded. Based on our assumption that 𝒪f\mathcal{O}_{f} and 𝒦p\mathcal{K}_{p} have rank nn, the true GG also has rank nn. The proof of the following theorem entirely follows [15, Theorem  4], with the only difference being the replacement of steady state Kalman gain KK by non-steady state Kalman gain Kp1K_{p-1}.

Theorem 2 (Bound on realizations of system matrices)

Let GG and G^\hat{G} be defined in (7) and (9). Let the estimates based on G^\hat{G} using Algorithm 2 be 𝒪^f,𝒦^p,A^,C^,K^p1\hat{\mathcal{O}}_{f},\hat{\mathcal{K}}_{p},\hat{A},\hat{C},\hat{K}_{p-1}, and the corresponding matrices based on the true GG using Algorithm 2 be 𝒪~f,𝒦~p,A~,C~,K~p1\tilde{\mathcal{O}}_{f},\tilde{\mathcal{K}}_{p},\tilde{A},\tilde{C},\tilde{K}_{p-1}. If GG has rank nn and G^Gσn(G)4,\|\hat{G}-G\|\leq\frac{\sigma_{n}(G)}{4}, then there exists an orthonormal matrix 𝒯n×n\mathcal{T}\in\mathbb{R}^{n\times n} such that:

𝒪^f𝒪~f𝒯210nσn(G)G^G,\displaystyle\|\hat{\mathcal{O}}_{f}-\tilde{\mathcal{O}}_{f}\mathcal{T}\|\leq 2\sqrt{\frac{10n}{\sigma_{n}(G)}}\|\hat{G}-G\|,
C^C~𝒯𝒪^f𝒪~f𝒯,\displaystyle\|\hat{C}-\tilde{C}\mathcal{T}\|\leq\|\hat{\mathcal{O}}_{f}-\tilde{\mathcal{O}}_{f}\mathcal{T}\|,
A^𝒯A~𝒯G+σoσo2𝒪^f𝒪~f𝒯,\displaystyle\|\hat{A}-\mathcal{T}^{*}\tilde{A}\mathcal{T}\|\leq\frac{\sqrt{\|G\|}+\sigma_{o}}{\sigma_{o}^{2}}\|\hat{\mathcal{O}}_{f}-\tilde{\mathcal{O}}_{f}\mathcal{T}\|,
K^p1𝒯K~p1210nσn(G)G^G,\displaystyle\|\hat{K}_{p-1}-\mathcal{T}^{*}\tilde{K}_{p-1}\|\leq 2\sqrt{\frac{10n}{\sigma_{n}(G)}}\|\hat{G}-G\|,

where σomin(σn(𝒪^fu),σn(𝒪~fu))\sigma_{o}\triangleq min(\sigma_{n}(\hat{\mathcal{O}}_{f}^{u}),\sigma_{n}(\tilde{\mathcal{O}}_{f}^{u})). Recall that the notation 𝒪^fu,𝒪~fu\hat{\mathcal{O}}_{f}^{u},\tilde{\mathcal{O}}_{f}^{u} refers to the submatrix formed by the top m(f1)m(f-1) rows of the respective matrix.

Remark 2

Note that the matrices A~,C~,K~p1\tilde{A},\tilde{C},\tilde{K}_{p-1} are equivalent to the original A,C,Kp1A,C,K_{p-1} matrices up to a similarity transformation. As pp increases, G=𝒪f𝒦p\|G\|=\|\mathcal{O}_{f}\mathcal{K}_{p}\| is 𝒪(1)\mathcal{O}(1) since ff is fixed, and 𝒦p\|\mathcal{K}_{p}\| is also 𝒪(1)\mathcal{O}(1) (see Proposition 7 in the Appendix). From Proposition 8 in the Appendix, σn(G)\sigma_{n}(G) is lower bounded as pp increases. As suggested in[15, Remark  3], the random term σn(𝒪^fu)\sigma_{n}(\hat{\mathcal{O}}_{f}^{u}) in σo\sigma_{o} can be replaced by a deterministic bound as

σn(𝒪^fu)σn(𝒪~fu)𝒪^f𝒪~f𝒯.\displaystyle\sigma_{n}(\hat{\mathcal{O}}_{f}^{u})\geq\sigma_{n}(\tilde{\mathcal{O}}_{f}^{u})-\|\hat{\mathcal{O}}_{f}-\tilde{\mathcal{O}}_{f}\mathcal{T}\|.

Hence σo\sigma_{o} will be lower bounded by σn(𝒪~fu)2>0\frac{\sigma_{n}(\tilde{\mathcal{O}}_{f}^{u})}{2}>0 when the error 𝒪^f𝒪~f𝒯\|\hat{\mathcal{O}}_{f}-\tilde{\mathcal{O}}_{f}\mathcal{T}\| is small enough, where the inequality is due to the fact that we assumed the system is observable. Consequently, the term G+σoσo2\frac{\sqrt{\|G\|}+\sigma_{o}}{\sigma_{o}^{2}} is always 𝒪(1)\mathcal{O}(1).

As a result, all estimation errors of system matrices depend linearly on G^G\|\hat{G}-G\|, even if pp is increasing. Hence, the realization error will decrease at least as fast as 𝒪(1N)\mathcal{O}(\frac{1}{\sqrt{N}}) when X^0=0\|\hat{X}_{0}\|=0, and pp is fixed. When X^0\|\hat{X}_{0}\| is non-zero, the error can decrease at a rate of 𝒪(logNN)\mathcal{O}(\sqrt{\frac{\log{N}}{N}}) for strictly stable systems, and at a rate of 𝒪((logN)dN)\mathcal{O}(\frac{(\log{N})^{d}}{\sqrt{N}}) for some constant dd for marginally stable systems by setting p=clogNp=c\log{N} for some positive constant cc. Note that as pp goes to infinity, the matrix K^p1\hat{K}_{p-1} estimates the steady state Kalman gain 𝒯K~\mathcal{T}^{*}\tilde{K}.

On the other hand, the dependencies on σn(G)\sigma_{n}(G) and σn(𝒪^fu)\sigma_{n}(\hat{\mathcal{O}}_{f}^{u}) also show that the estimation error of system matrices depends on the “normalized estimation error” of GG. Consequently, although our bound suggests that setting p,fp,f to be small could potentially reduce the estimation error of GG (when μ=0\mu=0), it may not necessarily reduce the error of the system matrices. A similar issue also appears in the recovery of system matrices from Markov parameters [14]. It is of interest to study how trajectory length directly affects the realization error in future work.

VI Conclusion and future work

In this paper, we performed finite sample analysis of learning the dynamics of autonomous systems using multiple trajectories. Our results rely neither on controlled inputs, nor on observations of steady state behaviors of the system. We proved a learning rate that is consistent with [14] and [15] (up to logarithmic factors). Future work could focus on understanding how to effectively utilize multiple trajectories of varying lengths, and how other variants of the balanced realization algorithm affect the error.

Appendix A Appendix

A-A Auxiliary results

Some of the results here are standard, and we include them for completeness.

Lemma 5

([15, Lemma E.2]). Consider the series St=i=0tAiS_{t}=\sum_{i=0}^{t}\|A^{i}\|. If the matrix AA is strictly stable (ρ(A)<1\rho(A)<1), then St=𝒪(1)S_{t}=\mathcal{O}(1); if the matrix AA is marginally stable (ρ(A)=1\rho(A)=1), then St=𝒪(t𝐝)S_{t}=\mathcal{O}(t^{\mathbf{d}}), where 𝐝\mathbf{d} is the largest Jordan block of AA corresponding to a unit circle eigenvalue λ=1\|\lambda\|=1.

Proposition 5

The norm 𝒯p0\|\mathcal{T}_{p}^{0}\| is 𝒪(p𝐝)\mathcal{O}(p^{\mathbf{d}}) with pp for some constant 𝐝\mathbf{d} when the system matrix AA is marginally stable, and is 𝒪(1)\mathcal{O}(1) when the system matrix AA is strictly stable.

Proof:

Letting Kmax(p)=maxt0,,p2KtK_{max}(p)=\max_{t\in 0,\dots,p-2}\|K_{t}\|, where KtK_{t} is defined in (3). We have

𝒯p0Imp+CKmax(p)+\displaystyle\|\mathcal{T}_{p}^{0}\|\leq\|I_{mp}\|+\|C\|K_{max}(p)+
CAKmax(p)++CAp2Kmax(p)\displaystyle\|C\|\|A\|K_{max}(p)+\cdots+\|C\|\|A^{p-2}\|K_{max}(p)
=1+CKmax(p)i=0p2Ai.\displaystyle=1+\|C\|K_{max}(p)\sum_{i=0}^{p-2}\|A^{i}\|.

From Kalman filtering theory, the Kalman gain KtK_{t} converges to its steady state KK under Assumption 1. Hence Kmax(p)=𝒪(1)K_{max}(p)=\mathcal{O}(1). From Lemma 5, we have the above sum is 𝒪(1)\mathcal{O}(1) if AA is strictly stable, and 𝒪(p𝐝)\mathcal{O}(p^{\mathbf{d}}) when AA is marginally stable. ∎

Lemma 6

([19, Theorem 6.6]). Let U+A1,U+A2,U+A_{1},U+A_{2},\cdots be a sequence of n×nn\times n matrices. Given ϵ>0\epsilon>0, there is a δ(ϵ)\delta(\epsilon) such that if Akδ(ϵ)\|A_{k}\|\leq\delta(\epsilon) for all kk, then

(U+Ak)(U+A1)σ(ρ(U)+ϵ)k\displaystyle\|(U+A_{k})\cdots(U+A_{1})\|\leq\sigma(\rho(U)+\epsilon)^{k}

for some constant σ\sigma.

Proposition 6

For any fixed integer kk, where p1k0p-1\geq k\geq 0, we have

(AKp1C)(AKkC)=𝒪(ec0p),\displaystyle\|(A-K_{p-1}C)\cdots(A-K_{k}C)\|=\mathcal{O}(e^{-c_{0}p}),

for some positive constant c0c_{0}.

Proof:

From Kalman filtering theory, the Kalman gain KtK_{t} converges to its steady state KK, and the matrix AKCA-KC has spectral radius less than 1 under Assumption 1. Hence, we can write AKtC=AKC+ηtA-K_{t}C=A-KC+\eta_{t} for t0t\geq 0, where ηt\|\eta_{t}\| converges to 0. Pick ϵ\epsilon such that ϵ+ρ(AKC)<1\epsilon+\rho(A-KC)<1. To apply Lemma 6, let U=AKCU=A-KC, and let k+t(ϵ)k+t(\epsilon) be the smallest index such that ηtδ(ϵ)\|\eta_{t}\|\leq\delta(\epsilon) for all tk+t(ϵ)t\geq k+t(\epsilon). Letting pk+t(ϵ)+1p\geq k+t(\epsilon)+1, we have

(AKp1C)(AKkC)\displaystyle\|(A-K_{p-1}C)\cdots(A-K_{k}C)\|\leq
t=1pkt(ϵ)(U+ηpt)t=pkt(ϵ)+1pk(U+ηpt)\displaystyle\|\prod_{t=1}^{p-k-t(\epsilon)}(U+\eta_{p-t})\|\|\prod_{t=p-k-t(\epsilon)+1}^{p-k}(U+\eta_{p-t})\|
σ(ρ(U)+ϵ)pkt(ϵ)t=pkt(ϵ)+1pk(U+ηpt)\displaystyle\leq\sigma(\rho(U)+\epsilon)^{p-k-t(\epsilon)}\|\prod_{t=p-k-t(\epsilon)+1}^{p-k}(U+\eta_{p-t})\|
=𝒪(ec0p),\displaystyle=\mathcal{O}(e^{-c_{0}p}),

where the second inequality comes from Lemma 6. ∎

Proposition 7

The norm 𝒦p\|\mathcal{K}_{p}\| is 𝒪(1)\mathcal{O}(1) with pp.

Proof:

From Kalman filtering theory, the Kalman gain Kp1K_{p-1} converges to its steady state KK, and the matrix AKCA-KC has spectral radius less than 1 under Assumption 1. Hence for any t1t\geq 1 we can write AKtC=AKC+ηtA-K_{t}C=A-KC+\eta_{t}, where ηt\|\eta_{t}\| converges to 0. Let Kmax(p)=maxt0,,p1KtK_{max}(p)=\max_{t\in 0,\dots,p-1}\|K_{t}\|. We have Kmax(p)=𝒪(1)K_{max}(p)=\mathcal{O}(1) since KtK_{t} converges to KK. Pick ϵ\epsilon such that ϵ+ρ(AKC)<1\epsilon+\rho(A-KC)<1. To apply Lemma 6, let U=AKCU=A-KC, and let t(ϵ)t(\epsilon) be the smallest index such that ηtδ(ϵ)\|\eta_{t}\|\leq\delta(\epsilon) for all tt(ϵ)t\geq t(\epsilon). Letting pt(ϵ)+1p\geq t(\epsilon)+1, we have

𝒦pKmax(p)+Kmax(p)t=2pj=2t(U+ηpj+1)\displaystyle\|\mathcal{K}_{p}\|\leq K_{max}(p)+K_{max}(p)\sum_{t=2}^{p}\|\prod_{j=2}^{t}(U+\eta_{p-j+1})\|
=Kmax(p)+Kmax(p)t=pt(ϵ)+2pj=2t(U+ηpj+1)\displaystyle=K_{max}(p)+K_{max}(p)\sum_{t=p-t(\epsilon)+2}^{p}\|\prod_{j=2}^{t}(U+\eta_{p-j+1})\|
+Kmax(p)t=2pt(ϵ)+1j=2t(U+ηpj+1).\displaystyle+K_{max}(p)\sum_{t=2}^{p-t(\epsilon)+1}\|\prod_{j=2}^{t}(U+\eta_{p-j+1})\|.

From Proposition 6, we have

Kmax(p)t=pt(ϵ)+2pj=2t(U+ηpj+1)=𝒪(ec0p).\displaystyle K_{max}(p)\sum_{t=p-t(\epsilon)+2}^{p}\|\prod_{j=2}^{t}(U+\eta_{p-j+1})\|=\mathcal{O}(e^{-c_{0}p}). (15)

From Lemma 6, we have

Kmax(p)t=2pt(ϵ)+1j=2t(U+ηpj+1)\displaystyle K_{max}(p)\sum_{t=2}^{p-t(\epsilon)+1}\|\prod_{j=2}^{t}(U+\eta_{p-j+1})\|\leq (16)
Kmax(p)t=2pt(ϵ)+1σ(ρ(U)+ϵ)t1Kmax(p)σ1ρ(U)ϵ=𝒪(1),\displaystyle K_{max}(p)\sum_{t=2}^{p-t(\epsilon)+1}\sigma(\rho(U)+\epsilon)^{t-1}\leq\frac{K_{max}(p)\sigma}{1-\rho(U)-\epsilon}=\mathcal{O}(1),

where the second inequality comes from geometric series.

Finally, combining (15) and (16), we obtain 𝒦p=𝒪(1)\|\mathcal{K}_{p}\|=\mathcal{O}(1).

Proposition 8

Assume that rank(𝒪f)=rank(𝐊p)=n\operatorname{rank}(\mathcal{O}_{f})=\operatorname{rank}(\mathbf{K}_{p})=n, where nn is the order of the system. Fix any positive integer k,nk<pk,n\leq k<p. Let 𝐊ss\mathbf{K}_{ss} be the matrix formed by the last kk block columns of the reversed extended controllability matrix 𝐊p\mathbf{K}_{p}. For sufficiently large pp, we have the following inequalities:

σn(G)σn(𝒪f𝐊𝐬𝐬)2>0,\displaystyle\sigma_{n}(G)\geq\frac{\sigma_{n}(\mathcal{O}_{f}\mathbf{K_{ss}})}{2}>0,
σn(𝒦p)σn(𝐊𝐬𝐬)2>0.\displaystyle\sigma_{n}(\mathcal{K}_{p})\geq\frac{\sigma_{n}(\mathbf{K_{ss}})}{2}>0.
Proof:

We will only show the first inequality as the second one is similar. Recall the definition of GG in (7). We can rewrite G=[M𝒪f𝒦tv]G=[M\quad\mathcal{O}_{f}\mathcal{K}_{tv}], where 𝒦tv\mathcal{K}_{tv} is the matrix formed by the last kk block columns of 𝒦p\mathcal{K}_{p}, and MM is some residual matrix. We have

GG𝒪f𝒦tv𝒦tv𝒪f.\displaystyle GG^{*}\succeq\mathcal{O}_{f}\mathcal{K}_{tv}\mathcal{K}^{*}_{tv}\mathcal{O}_{f}^{*}.

Hence, we have

σn(G)\displaystyle\sigma_{n}(G) σn(𝒪f𝒦tv)=σn(𝒪f𝐊ss+𝒪f𝒦tv𝒪f𝐊ss)\displaystyle\geq\sigma_{n}(\mathcal{O}_{f}\mathcal{K}_{tv})=\sigma_{n}(\mathcal{O}_{f}\mathbf{K}_{ss}+\mathcal{O}_{f}\mathcal{K}_{tv}-\mathcal{O}_{f}\mathbf{K}_{ss})
σn(𝒪f𝐊ss)𝒪f𝒦tv𝐊ss,\displaystyle\geq\sigma_{n}(\mathcal{O}_{f}\mathbf{K}_{ss})-\|\mathcal{O}_{f}\|\|\mathcal{K}_{tv}-\mathbf{K}_{ss}\|,

where the last inequality comes from the application of [20, Theorem 3.3.16(c)]. From Kalman filtering theory, the Kalman gain KtK_{t} converges to its steady state KK under Assumption 1. Consequently, we have 𝒦tv𝐊ss\|\mathcal{K}_{tv}-\mathbf{K}_{ss}\| converges to zero as pp increases. We see that σn(G)\sigma_{n}(G) will eventually be lower bounded by σn(𝒪f𝐊ss)2>0\frac{\sigma_{n}(\mathcal{O}_{f}\mathbf{K}_{ss})}{2}>0 as pp increases, where the inequality comes from the assumption that 𝒪f\mathcal{O}_{f} and 𝐊p\mathbf{K}_{p} have full rank and the Cayley-Hamilton Theorem. ∎

References

  • [1] D. Bauer, M. Deistler, and W. Scherrer, “Consistency and asymptotic normality of some subspace algorithms for systems without observed inputs,” Automatica, vol. 35, no. 7, pp. 1243–1254, 1999.
  • [2] M. Jansson and B. Wahlberg, “On consistency of subspace methods for system identification,” Automatica, vol. 34, no. 12, pp. 1507–1519, 1998.
  • [3] T. Knudsen, “Consistency analysis of subspace identification methods based on a linear regression approach,” Automatica, vol. 37, no. 1, pp. 81–89, 2001.
  • [4] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht, “Learning without mixing: Towards a sharp analysis of linear system identification,” in Conference On Learning Theory.   PMLR, 2018, pp. 439–473.
  • [5] S. Oymak and N. Ozay, “Non-asymptotic identification of LTI systems from a single trajectory,” arXiv preprint arXiv:1806.05722, 2018.
  • [6] M. Simchowitz, R. Boczar, and B. Recht, “Learning linear dynamical systems with semi-parametric least squares,” in Conference on Learning Theory.   PMLR, 2019, pp. 2714–2802.
  • [7] T. Sarkar, A. Rakhlin, and M. A. Dahleh, “Finite-time system identification for partially observed LTI systems of unknown order,” arXiv preprint arXiv:1902.01848, 2019.
  • [8] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Finite time identification in unstable linear systems,” Automatica, vol. 96, pp. 342–353, 2018.
  • [9] T. Sarkar and A. Rakhlin, “Near optimal finite time identification of arbitrary linear dynamical systems,” in International Conference on Machine Learning.   PMLR, 2019, pp. 5610–5618.
  • [10] Y. Xing, B. Gravell, X. He, K. H. Johansson, and T. Summers, “Linear system identification under multiplicative noise from multiple trajectory data,” in American Control Conference (ACC).   IEEE, 2020, pp. 5157–5261.
  • [11] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “On the sample complexity of the linear quadratic regulator,” Foundations of Computational Mathematics, pp. 1–47, 2019.
  • [12] S. Fattahi and S. Sojoudi, “Data-driven sparse system identification,” in 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).   IEEE, 2018, pp. 462–469.
  • [13] Y. Sun, S. Oymak, and M. Fazel, “Finite sample system identification: Optimal rates and the role of regularization,” in Learning for Dynamics and Control.   PMLR, 2020, pp. 16–25.
  • [14] Y. Zheng and N. Li, “Non-asymptotic identification of linear dynamical systems using multiple trajectories,” IEEE Control Systems Letters, vol. 5, no. 5, pp. 1693–1698, 2020.
  • [15] A. Tsiamis and G. J. Pappas, “Finite sample analysis of stochastic system identification,” in Conference on Decision and Control (CDC).   IEEE, 2019, pp. 3648–3654, arXiv:1903.09122.
  • [16] M. Deistler, K. Peternell, and W. Scherrer, “Consistency and relative efficiency of subspace methods,” Automatica, vol. 31, no. 12, pp. 1865–1875, 1995.
  • [17] P. Van Overschee and B. De Moor, Subspace identification for linear systems: Theory—Implementation—Applications.   Springer Science & Business Media, 2012.
  • [18] B. D. Anderson and J. B. Moore, Optimal filtering.   Courier Corporation, 2012.
  • [19] D. J. Hartfiel, Nonhomogeneous matrix products.   World Scientific, 2002.
  • [20] R. A. Horn and C. R. Johnson, “Topics in matrix analysis, 1991,” Cambridge University Presss, Cambridge, vol. 37, p. 39, 1991.