Byzantine-Resilient Distributed Optimization of
Multi-Dimensional Functions
Learning the Dynamics of Autonomous Linear Systems From Multiple Trajectories
Abstract
We consider the problem of learning the dynamics of autonomous linear systems (i.e., systems that are not affected by external control inputs) from observations of multiple trajectories of those systems, with finite sample guarantees. Existing results on learning rate and consistency of autonomous linear system identification rely on observations of steady state behaviors from a single long trajectory, and are not applicable to unstable systems. In contrast, we consider the scenario of learning system dynamics based on multiple short trajectories, where there are no easily observed steady state behaviors. We provide a finite sample analysis, which shows that the dynamics can be learned at a rate for both stable and unstable systems, where is the number of trajectories, when the initial state of the system has zero mean (which is a common assumption in the existing literature). We further generalize our result to the case where the initial state has non-zero mean. We show that one can adjust the length of the trajectories to achieve a learning rate of for strictly stable systems and a learning rate of for marginally stable systems, where is some constant.
I Introduction
The system identification problem is to learn the parameters of a dynamical system, given the measurements of the inputs and outputs. While classical system identification techniques focused primarily on achieving asymptotic consistency [1, 2, 3], recent efforts have sought to characterize the number of samples needed to achieve a desired level of accuracy in the learned model. In particular, non-asymptotic analysis of system identification based on a single trajectory has been studied extensively over the past few years [4, 5, 6, 7, 8, 9].
In practice, performing system identification using a single trajectory could be problematic for many applications. For example, having the system run for a long time could incur risks when the system is unstable. Furthermore, only historical snippets of data about the system may be available, without the ability to easily observe long-run behavior. Additionally, in settings where one has the ability to restart the system or have several copies of the system running in parallel, one may obtain multiple trajectories generated by the system dynamics [10]. The paper [11] studies the sample complexity of identifying a system whose state is fully measured using only the final data points from multiple trajectories. Using a similar setup, the paper [12] explores the benefits of adding an regularizer. The paper [13] studies the sample complexity of partially-measured system identification by including nuclear norm regularization, again only using the final samples from each trajectory. For partially-measured systems, the paper [14] allows for more efficient use of data. As mentioned in [14], compared to the single trajectory setup, the multiple trajectories setup usually allows for more direct application of concentration inequalities due to the assumption of independence over multiple trajectories.
In addition to the potential lack of single long trajectories, in many settings we may not be able to actually apply inputs to the system in order to perform system identification; this could be due to the costs of applying inputs, or due to the fact that we are simply observing an autonomous system that we cannot control. The uncontrolled system may also be serving as a subsystem connected to the main system that one wants to control, and having a better model of the subsystem could be useful in controlling the main system. For partially-measured systems, the characterization of finite sample error of purely stochastic systems (systems that are entirely driven by unmeasurable noise) is more challenging as indicated in [15]. There, the goal is to estimate the system matrices as well as the steady state Kalman filter gain of the corresponding system. The paper [15] shows that classical stochastic system identification algorithm can achieve a learning rate of (up to logarithmic factors) for both strictly stable and marginally stable systems, where denotes the number of samples in a single trajectory.
In this paper, we are motivated by the challenge of system identification for partially-measured and autonomous stochastic linear systems (with no controlled inputs) as in [15], but for the case where a single long-run trajectory is unavailable. Existing results on consistency and learning rate of stochastic system identification algorithms (including [15]) typically convert the original system to a statistically equivalent form of the Kalman filter that is assumed to have reached steady state [1, 16, 15]. The analysis is then performed by assuming that the covariance matrix of the initial state of the system is the same as the steady state Kalman filter error covariance matrix, which simplifies the analysis. We note, however, that this assumption is invalid when one has no long run observation of the system trajectory, since it is in general unclear how long one should wait until the Kalman filter “converges” (even if it converges exponentially fast) for an unknown system. Furthermore, the available short trajectories may not be long enough to guarantee that the underlying filter has converged. Consequently, the single trajectory-based results cannot be directly applied to the multiple (short) trajectories case. Our goal in this paper is to estimate the system matrices (up to similarity transformations) using only multiple trajectories of transient responses of a partially-measured system that is entirely driven by noise.
I-A Contributions
Our work is inspired by recent work on stochastic system identification (with a single long trajectory) [15], and system identification with multiple trajectories (but with controlled inputs) [14], and extends them in the following ways.
-
•
We provide results on the sample complexity of learning the dynamics of an autonomous stochastic linear system using multiple trajectories, without assuming that the associated Kalman filter has reached steady state (i.e., the initial states can have arbitrary covariance matrix). Compared to [14] and [15], our results neither rely on controlled inputs, nor rely on observations of steady state behaviors of the system.
-
•
We provide the asymptotic learning rate of the multiple-trajectories-based stochastic system identification algorithm. If is the number of trajectories, we prove a learning rate of when the initial state in each trajectory has zero mean (which is a common assumption in the existing literature). This rate is consistent with [14] and [15] (up to logarithmic factors). We further generalize the result to the case when the initial state in each trajectory has non-zero mean. In such case, we show that we can adjust the length of the regressor to achieve a learning rate of for strictly stable systems and a learning rate of for marginally stable systems, where is some constant.
II Mathematical notation and terminology
Let and denote the sets of real numbers and natural numbers, respectively. Let and be the -th largest and smallest singular value, respectively, of a symmetric matrix. Similarly, let be the smallest eigenvalue of a symmetric matrix. We use to denote the conjugate transpose of a given matrix. The spectral radius of a given matrix is denoted as . A square matrix is called strictly stable if , marginally stable if , and unstable if . We use to denote the spectral norm of a given matrix. Vectors are treated as column vectors. A Gaussian distributed random vector is denoted as , where is the mean and is the covariance matrix. We use to denote the identity matrix. We use to denote the expectation. The symbol is used to denote the matrix product, . The symbol is used to denote the pseudoinverse.
III Problem formulation
Consider a discrete time linear time-invariant system with no user specified inputs:
(1) |
where , , , , and . The noise terms and are assumed to be i.i.d Gaussian, i.e., , . The initial state is also assumed to be independent of and , and is distributed as . In addition, whether is zero or non-zero is assumed to be known. If is non-zero, the system matrix is assumed to be strictly stable or marginally stable. The system order is also assumed to be known. We refer to the above system as an autonomous stochastic linear system. We will make the following assumption.
Assumption 1
The output covariance matrix is positive definite. The pair is observable and is controllable.
Under the above assumption, the Kalman Filter associated with system (1) is a system of the form
(2) |
where is an estimate of state , with the initial estimate being the mean of the initial state in system (1), i.e., . The sequence of matrices is called the Kalman gain, given by
(3) |
where the estimation error covariance is updated based on the Riccati equation
with . Finally, are independent zero mean Gaussian innovations with covariance matrix given by
(4) |
Since the outputs of system (1) and system (2) have identical statistical properties [17], we will perform analysis on system (2). The subspace identification problem for stochastic systems that we tackle in this paper is to identify the system matrices up to a similarity transformation, given multiple trajectories of outputs of the system (1). As a byproduct, we will also simultaneously learn the Kalman filter gain of the corresponding system, at some time step . In particular, we are interested in the quality of the estimates of given a finite number of samples.
IV Subspace identification technique
Here we describe a variant of classical subspace identification algorithm [17] to estimate matrices (up to a similarity transformation). We will first establish some definitions.
Suppose that we have access to independent output trajectories of system (1), each of some length , and each obtained right after restarting the system from an initial state . We denote the data from these trajectories as , where the superscript denotes the trajectory index and the subscript denotes the time index. Let , where are design parameters that satisfy , where is the order of the system. We split the output samples from each output trajectory into past and future outputs with respect to , and denote the past output and future output vectors for trajectory as:
(5) | ||||
respectively. The batch past output and batch future output matrices are formed by stacking all output trajectories:
(6) |
The past and future innovations are defined similarly, using the signals rather than .
Let the batch matrix of initial states be Define the largest norm of innovation covariance matrices as where is defined in (4). For any , the extended observability matrix and the reversed extended controllability matrix are defined as:
Define
(7) |
Let be the steady state Kalman gain , where is the solution to the Riccati equation, From Kalman filtering theory, the matrix has spectral radius strictly less than 1 [18]. Denote the reversed extended controllability matrix formed by the steady state Kalman gain as
We further make the following assumption.
Assumption 2
We have .
The rank condition on is standard, e.g., [3, 15]. The rank condition on is needed to ensure that has rank , which can be satisfied in practice by choosing to be relatively large if the rank condition on is satisfied (see Proposition 8 in the Appendix).
Finally, for any integer and , define the block-Toeplitz matrix as:
IV-A Linear regression
The subspace identification technique first uses linear regression to estimate from (7), which will subsequently form the basis for the recovery of the system parameters.
For any output trajectory , by iterating (2), the future output matrix satisfies
(8) |
Note that at any time step , the state can be expressed from (2) as
By expanding the above relationship recursively, we have
By substituting the above equality into (8), the relationship between the batch output matrices is given by
An estimate of (motivated by the least squares approach) is
(9) |
Consequently, the estimation error for matrix can be expressed as
(10) | ||||
where the second term can be dropped if , i.e., the initial state of system (1) has zero mean. When is known to be non-zero but the system is marginally stable (), we can leverage the fact that the norm converges to zero exponentially fast with (see Proposition 6 in the Appendix) by setting for some positive constant , to make the second term go to zero asymptotically. The above steps are encapsulated in Algorithm 1.
Remark 1
Note that the matrix itself can be used to predict future outputs from past outputs. The value represents the Kalman prediction of the next outputs using the past measurements, assuming the initial state has zero mean. Its role is similar to the Markov parameters that map inputs to outputs in the case when one has measured inputs.
IV-B Balanced realization
The following balanced realization algorithm uses a standard Singular Value Decomposition to extract the estimated system matrices from the estimate .
Input The estimate , parameters
V Main results
In this section, we will present our main results on bounding the estimation error from (10). We will show that the term decreases with a rate of , and then upper bound the growth rate of other terms in (10) separately. Using recent results on the balanced realization algorithm with the adjustments to accommodate the non-steady state Kalman filter, we then show that the estimation error of the system matrices will also be bounded when the error is small enough.
First, denote the covariance matrix of the weighted past innovation , , as:
Let . We first show that the weighted innovation covariance matrix is strictly positive definite. The proof is similar to [15, Lemma 2].
Proposition 1
Let . We have .
Proof:
For any output trajectory , its corresponding weighted past innovation can be written as
where and are the process and output noises respectively in system (1), and are defined similarly as . Matrix is a block-Toeplitz matrix which accounts for the weight of the process noise in system (1), and its explicit form is omitted in the interest of space. Since , and are independent, we have
Hence, we have , where the second inequality comes from Assumption 1. ∎
Next we will show that the term is decreasing with a rate of . Since , the explicit form of is
(11) | ||||
and we will bound these terms separately.
We will rely on the following lemma from [11, Lemma 2], which provides a non-asymptotic lower bound of a standard Wishart matrix.
Lemma 1
Let , be i.i.d random vectors. For any fixed , we have
with probability at least .
Proposition 2
For any fixed , let . We have
with probability at least .
Proof:
For any output trajectory , note that the past innovation has the same distribution as a single Gaussian random vector, , since the innovations are independent zero-mean Gaussian random vectors with covariance matrices , respectively. We can rewrite as
where are i.i.d random vectors with . Let . Fixing and applying Lemma 1, with probability of at least , we have
(12) |
Considering the inequality and the assumption that , we have
In conjunction with (12), this yields with probability at least . Hence, we have
with probability of at least .
Finally, multiplying by from the left and from the right gives with probability at least . ∎
To bound the cross terms due to the possibly non-zero batch matrix of initial states in (11), , we will be using the following lemma from [5, Lemma A.1].
Lemma 2
Let be a matrix with , and let be such that . Let be a matrix with independent standard normal entries. Then, for all , with probability at least ,
Proposition 3
For any fixed , let , and . For , each of these inequalities hold with probability at least :
Proof:
We will only show the first inequality as the second one is almost identical. We can rewrite , where , where are i.i.d random vectors with . Applying Lemma 2, we obtain
with probability at least . Finally, setting , we have . Plugging into the above inequality, we get the desired form. ∎
Now we are ready to show that is decreasing with a rate of .
Lemma 3
Fix any and let , where , and . Define . We have
with probability at least .
Proof:
Recall the explicit form of in (11). Letting be an arbitrary unit vector, we can write
where we used the Cauchy–Schwarz inequality. Fixing , letting , applying Proposition 2 and Proposition 3, and using a union bound, we have
with probability at least .
Conditioning on the above event and letting , we have
where the equality is due to the fact that for all .
Consequently, we have
Taking the inverse we get the desired result. ∎
To see that will eventually be greater than even if is non-zero, note that when the system is strictly stable or marginally stable, and will grow no faster than for some constant (see Proposition 5 for , and [15, Corollary E.1] for ). Further, is , and . Thus if , will eventually be greater than as increases.
Now we will show that the term is . We will leverage the following lemma from[11, Lemma 1] to bound the product of the innovation terms.
Lemma 4
Let , be independent random vectors and , for . Let . For any fixed , we have
with probability at least .
Proposition 4
For any fixed , let . We have
with probability at least .
Proof:
With Proposition 4, we are now in place to prove the bound on the estimation error of .
Theorem 1 (Bound on estimation error of )
Proof:
Recall the expression of the error in (10). We first bound the error term . Using , we have
Fix and let . Applying Proposition 3, Proposition 4 and Lemma 3 to the above inequality and using a union bound, we obtain
(13) | ||||
with probability at least .
Interpretation of Theorem 1.
Learning rate when is zero, and the effects of trajectory length: When , i.e., the initial state of the system (1) has zero mean, the upper bound of the error will not depend on . Noting the dependencies on in , setting and to be small will generally result in a smaller error bound of , since we are estimating a smaller . However, should be greater than the order (and should also be large enough such that Assumption 2 is satisfied), so that Algorithm 2 can recover the system matrices from . The estimator can achieve a learning rate of . This rate is faster than the single trajectory case reported in [15] in that there are no logarithmic factors, and it applies to both stable and unstable systems. This confirms the benefits of being able to collect multiple independent trajectories starting from .
Learning rate when is nonzero, and the effects of trajectory length: When is nonzero, the error bound will depend on . Note that when the initial state of each trajectory has mean . The term is when is fixed. In such case, if the system is known to be marginally stable (), we can leverage the fact that the norm in converges to zero exponentially fast with (see Proposition 6 in the Appendix), by setting for some sufficiently large , to force the term to go to zero no slower than . The term is since the Kalman filter converges [18]. For the same reason, by fixing a small , is . In addition, and are for stable systems, and for some constant for marginally stable systems (see Proposition 5 in the Appendix for , and [15, Corollary E.1] for ). As a result, the error will decrease with a rate of for strictly stable systems, and for some constant for marginally stable systems, even if is non-zero.
The next step shows that the realization error of system matrix estimates provided by Algorithm 2 is bounded. Based on our assumption that and have rank , the true also has rank . The proof of the following theorem entirely follows [15, Theorem 4], with the only difference being the replacement of steady state Kalman gain by non-steady state Kalman gain .
Theorem 2 (Bound on realizations of system matrices)
Let and be defined in (7) and (9). Let the estimates based on using Algorithm 2 be , and the corresponding matrices based on the true using Algorithm 2 be . If has rank and then there exists an orthonormal matrix such that:
where . Recall that the notation refers to the submatrix formed by the top rows of the respective matrix.
Remark 2
Note that the matrices are equivalent to the original matrices up to a similarity transformation. As increases, is since is fixed, and is also (see Proposition 7 in the Appendix). From Proposition 8 in the Appendix, is lower bounded as increases. As suggested in[15, Remark 3], the random term in can be replaced by a deterministic bound as
Hence will be lower bounded by when the error is small enough, where the inequality is due to the fact that we assumed the system is observable. Consequently, the term is always .
As a result, all estimation errors of system matrices depend linearly on , even if is increasing. Hence, the realization error will decrease at least as fast as when , and is fixed. When is non-zero, the error can decrease at a rate of for strictly stable systems, and at a rate of for some constant for marginally stable systems by setting for some positive constant . Note that as goes to infinity, the matrix estimates the steady state Kalman gain .
On the other hand, the dependencies on and also show that the estimation error of system matrices depends on the “normalized estimation error” of . Consequently, although our bound suggests that setting to be small could potentially reduce the estimation error of (when ), it may not necessarily reduce the error of the system matrices. A similar issue also appears in the recovery of system matrices from Markov parameters [14]. It is of interest to study how trajectory length directly affects the realization error in future work.
VI Conclusion and future work
In this paper, we performed finite sample analysis of learning the dynamics of autonomous systems using multiple trajectories. Our results rely neither on controlled inputs, nor on observations of steady state behaviors of the system. We proved a learning rate that is consistent with [14] and [15] (up to logarithmic factors). Future work could focus on understanding how to effectively utilize multiple trajectories of varying lengths, and how other variants of the balanced realization algorithm affect the error.
Appendix A Appendix
A-A Auxiliary results
Some of the results here are standard, and we include them for completeness.
Lemma 5
([15, Lemma E.2]). Consider the series . If the matrix is strictly stable (), then ; if the matrix is marginally stable (), then , where is the largest Jordan block of corresponding to a unit circle eigenvalue .
Proposition 5
The norm is with for some constant when the system matrix is marginally stable, and is when the system matrix is strictly stable.
Proof:
Letting , where is defined in (3). We have
Lemma 6
([19, Theorem 6.6]). Let be a sequence of matrices. Given , there is a such that if for all , then
for some constant .
Proposition 6
For any fixed integer , where , we have
for some positive constant .
Proof:
From Kalman filtering theory, the Kalman gain converges to its steady state , and the matrix has spectral radius less than 1 under Assumption 1. Hence, we can write for , where converges to . Pick such that . To apply Lemma 6, let , and let be the smallest index such that for all . Letting , we have
where the second inequality comes from Lemma 6. ∎
Proposition 7
The norm is with .
Proof:
From Kalman filtering theory, the Kalman gain converges to its steady state , and the matrix has spectral radius less than 1 under Assumption 1. Hence for any we can write , where converges to . Let . We have since converges to . Pick such that . To apply Lemma 6, let , and let be the smallest index such that for all . Letting , we have
From Proposition 6, we have
(15) |
∎
Proposition 8
Assume that , where is the order of the system. Fix any positive integer . Let be the matrix formed by the last block columns of the reversed extended controllability matrix . For sufficiently large , we have the following inequalities:
Proof:
We will only show the first inequality as the second one is similar. Recall the definition of in (7). We can rewrite , where is the matrix formed by the last block columns of , and is some residual matrix. We have
Hence, we have
where the last inequality comes from the application of [20, Theorem 3.3.16(c)]. From Kalman filtering theory, the Kalman gain converges to its steady state under Assumption 1. Consequently, we have converges to zero as increases. We see that will eventually be lower bounded by as increases, where the inequality comes from the assumption that and have full rank and the Cayley-Hamilton Theorem. ∎
References
- [1] D. Bauer, M. Deistler, and W. Scherrer, “Consistency and asymptotic normality of some subspace algorithms for systems without observed inputs,” Automatica, vol. 35, no. 7, pp. 1243–1254, 1999.
- [2] M. Jansson and B. Wahlberg, “On consistency of subspace methods for system identification,” Automatica, vol. 34, no. 12, pp. 1507–1519, 1998.
- [3] T. Knudsen, “Consistency analysis of subspace identification methods based on a linear regression approach,” Automatica, vol. 37, no. 1, pp. 81–89, 2001.
- [4] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht, “Learning without mixing: Towards a sharp analysis of linear system identification,” in Conference On Learning Theory. PMLR, 2018, pp. 439–473.
- [5] S. Oymak and N. Ozay, “Non-asymptotic identification of LTI systems from a single trajectory,” arXiv preprint arXiv:1806.05722, 2018.
- [6] M. Simchowitz, R. Boczar, and B. Recht, “Learning linear dynamical systems with semi-parametric least squares,” in Conference on Learning Theory. PMLR, 2019, pp. 2714–2802.
- [7] T. Sarkar, A. Rakhlin, and M. A. Dahleh, “Finite-time system identification for partially observed LTI systems of unknown order,” arXiv preprint arXiv:1902.01848, 2019.
- [8] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Finite time identification in unstable linear systems,” Automatica, vol. 96, pp. 342–353, 2018.
- [9] T. Sarkar and A. Rakhlin, “Near optimal finite time identification of arbitrary linear dynamical systems,” in International Conference on Machine Learning. PMLR, 2019, pp. 5610–5618.
- [10] Y. Xing, B. Gravell, X. He, K. H. Johansson, and T. Summers, “Linear system identification under multiplicative noise from multiple trajectory data,” in American Control Conference (ACC). IEEE, 2020, pp. 5157–5261.
- [11] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “On the sample complexity of the linear quadratic regulator,” Foundations of Computational Mathematics, pp. 1–47, 2019.
- [12] S. Fattahi and S. Sojoudi, “Data-driven sparse system identification,” in 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2018, pp. 462–469.
- [13] Y. Sun, S. Oymak, and M. Fazel, “Finite sample system identification: Optimal rates and the role of regularization,” in Learning for Dynamics and Control. PMLR, 2020, pp. 16–25.
- [14] Y. Zheng and N. Li, “Non-asymptotic identification of linear dynamical systems using multiple trajectories,” IEEE Control Systems Letters, vol. 5, no. 5, pp. 1693–1698, 2020.
- [15] A. Tsiamis and G. J. Pappas, “Finite sample analysis of stochastic system identification,” in Conference on Decision and Control (CDC). IEEE, 2019, pp. 3648–3654, arXiv:1903.09122.
- [16] M. Deistler, K. Peternell, and W. Scherrer, “Consistency and relative efficiency of subspace methods,” Automatica, vol. 31, no. 12, pp. 1865–1875, 1995.
- [17] P. Van Overschee and B. De Moor, Subspace identification for linear systems: Theory—Implementation—Applications. Springer Science & Business Media, 2012.
- [18] B. D. Anderson and J. B. Moore, Optimal filtering. Courier Corporation, 2012.
- [19] D. J. Hartfiel, Nonhomogeneous matrix products. World Scientific, 2002.
- [20] R. A. Horn and C. R. Johnson, “Topics in matrix analysis, 1991,” Cambridge University Presss, Cambridge, vol. 37, p. 39, 1991.