FEDERAL UNIVERSITY OF MINAS GERAIS
Department of Electronic Engineering
Technical Report
Robust Bayesian Subspace Identification for Small Data Sets
1 Introduction
Model estimates obtained from traditional subspace identification methods may be subject to significant variance. This elevated variance is aggravated in the cases of large models or of a limited sample size. Common solutions to reduce the effect of variance are regularized estimators, shrinkage estimators and Bayesian estimation. In the current work we investigate the latter two solutions, which have not yet been applied to subspace identification. Our experimental results show that our proposed estimators may reduce the estimation risk up to of that of traditional subspace methods.
In the scope of subspace identification, a number of related works explore regularization techniques such as nuclear norm minimization [1, 2, 3, 4, 5]. Under specific hyperparameter choices, regularized estimators can be understood in a Bayesian framework as maximum a posteriori estimators. These regularized estimators carry two important disadvantages in comparison to our approach: i) the choice of hyperparameters is not straightforward and it often does not take advantage of the problem statistics at hand; ii) they are not oriented toward risk minimization.
The usage of regularized, shrinkage and Bayesian estimators in system identification is discussed in [6, 7, 4] and it is greeted as a promising and refreshing trend in this field. We are only aware of one work where a shrinking function is used for subspace identification [8]. Differently from our approach, the shrinking is applied to all elements of the matrix and not only to the singular values. In a somewhat related application, in [9] we find the application of wavelet thresholding as a preprocessor in a subspace identification setting. Apart from the field of system identification, we also find the application of similar methods in the context of subspace identification in hyperspectral imaging [10].
If we regard subspace identification as a particular application of principal component analysis, our work would be understood as an application of robust principal component analysis, which finds many applications in image and video processing and econometrics [11]. Bayesian solutions to this problem are found in [12, 13, 14]. Our alternating least squares method was conceived as an improvement upon the Bayesian robust principal component analysis of [15] and it is closely related to the tensorial regression present in [16, 17, 18, 19].
A different approach to subspace identification with finite samples is found in [20], where the focus in on deciding the number of samples in order to bound the estimation error with high probability.
2 A brief recapitulation of subspace identification
Consider the linear state space model in innovation form
(1) | ||||
(2) |
where is the state, is the measured variable, is the input variable and is the innovations process, assumed to be white Gaussian noise. The model parameters and are matrices with the appropriate dimensions. At times, it may be useful to express the same model in the predictor form
(3) | ||||
(4) |
where , , .
Following the formulation in [21], we make use of the extended state space model
(5) |
and its predictor form
(6) |
where the available data is arranged in Hankel matrices defined by
(7) |
and similarly are defined and . Here is the future horizon and is a function of the size of the available data set. The state sequence is defined as
(8) |
The past information is collected in up to the horizon and arranged as
(9) |
and likewise for .
As a consequence, is the extended observability matrix as defined by
(10) |
and are Toeplitz matrices given by
(11) |
and
(12) |
where is the innovations covariance. The matrix is reminiscent of the matrix of Markov parameters and is given by
(13) |
with
(14) |
for and and . These matrices can also be decomposed as products of the extended observability and controllability matrices:
(15) |
for .
Several estimation methods were proposed in the literature departing from this formulation and exploiting the fact that the data as structured in (6) should lie in low dimensional subspaces (see [21, 22] for a comprehensive review). A common approach is to solve (6) via least squares and then estimate from a truncated singular value decomposition of . The matrices and could then be recovered from estimates of and . The matrices and could then be obtained from a new least squares problem or from estimates of .
From a general perspective, most methods obtain an initial estimate from least squares and, given two weight matrices and , obtain a low-dimensional estimate by truncating the singular value decomposition of to the largest components:
(16) |
In this context, would be an estimate of the system order . Methods of obtaining such an estimate of have been treated with difficulty in the literature and usually one resorts to ad hoc solutions or the visual inspection of the singular values.
Traditional subspace identification methods obtain an initial estimate of from (6) applying the least squares method as follows:
(17) |
Notice, however, that this estimate only approximates the maximum likelihood estimate as the noise term in (6) is at best only approximately white and as is a lower triangular Toeplitz matrix.
At this point it is worthwhile having a discussion on the approximations typically made in the process of solving (6). Given that is structured as a Hankel matrix as in (9), its components are not truly independent. Nevertheless, the off diagonal elements of are reasonably sparse, which gives a reasonable approximation of a white matrix. In order the incorporate the Hankel and Toeplitz structure of and in the computation, one would have to vectorize equation (6):
(18) |
where the lower case letters denote the vectorized version of the corresponding matrices. In addition, and lie in lower dimensional subspaces. This can be solved introducing known matrices and that convert the original lower dimensional data into the vectorized versions of the Toeplitz matrix and the Hankel matrix . This would result in the new equation
(19) |
However, the least squares problem that arises in (19) is high dimensional, sparse and poorly conditioned numerically. In this regard, one of the virtues of the formulation in (6) is to avoid the numerical difficulties that would arise in a more detailed formulation.
3 Problem Description
Most subspace identification methods rely on the limit case of large enough data sets, . In addition, most approaches are based on least squares or, equivalently, on maximum likelihood estimators of . However, it is well known that such estimators perform poorly in regards to the estimation risk under quadratic loss (are not admissible) when the number of parameters is larger than 2 [23, 24], which is classically known as the Stein phenomenon [25]. On the other hand, the number of parameters in may be as large as , which may be considerably large. Even if we consider that is low rank and may be parameterized and estimated in a smaller space, it would still depend on parameters, which is still large in most applications.
In addition, there seems to be considerable difficulty in the literature to find appropriate methods to estimate the system order , which is of particular concern given the high sensitivity of the results to this parameter.
Within this context, our goal is to investigate robustified methods to estimate . In particular, two groups of methods shall be considered. The first group is comprised of the main shrinkage estimators of singular values available in the literature. The second group consists of Bayesian estimators that can be efficiently computed with a Gibbs sampling scheme.
Given that the estimation errors in the matrices are upper bounded by a constant times as demonstrated in [20], we concentrate our efforts on the estimators of and leave an investigation on the matrices estimators to future work.
4 Singular Value Shrinkage
Shrinkage estimators are biased estimators that shrink the maximum likelihood estimate towards a pre-assigned value (often to zero). These estimators were first proposed by Stein [25] and it has been demonstrated that this type of estimator may provide smaller estimation risk than the maximum likelihood estimator under quadratic loss when the number of parameters is larger than 2. The rationale behind such estimators is that, although shrinking introduces bias, it reduces variance and therefore it may be beneficial in reducing the overall quadratic error.
In the context of singular value decompositions, approaches based on random matrix theory obtain optimal shrinkage estimators in the asymptotic case of the matrix dimensions approaching infinity. This assumption may be far from adequate in the case of subspace identification with small data sets. Non-asymptotic results are also available based on the minimization of unbiased estimates.
Most approaches consider the problem of estimating some matrix from a noisy measurement
(20) |
where is known and the noise matrix is orthogonally invariant, i.e., its distribution is the same as that of for any orthogonal matrices and . In practice, is not a single measurement, but the maximum likelihood estimate obtained from multiple measurements, which gives a sufficient statistic based on all available measurements.
Considering the Frobenius norm squared as the risk loss function, and given that the noise is orthogonally invariant, it suffices to restrict our search to shrinkage estimators of the form:
(21) |
where is the singular value decomposition of and is the shrinkage function.
4.1 Hard and Soft-Thresholding
In the asymptotic case of very large dimensions, [26] computed the optimal parameters for the hard and soft-thresholding functions:
(22) | ||||
(23) |
where is the indicator function and is the positive-part function and where the function are applied element-wise on the singular values and the optimal thresholds are
(24) |
and
(25) |
with being the aspect ratio of the matrix .
4.2 Optimal shrinkage in the asymptotic case
Still in the asymptotic case, the overall optimal shrinkage function was obtained in [27]. For the squared Frobenius norm loss function, the optimal shrinkage is given by
(26) |
4.3 SURE-based thresholding
In the non-asymptotic case, [28] obtained an expression for the Stein’s unbiased risk estimate (SURE) for estimators based on soft-thresholding functions. The threshold values can then be obtained by minimizing the SURE risk estimate. For and the Frobenius norm squared loss, the unbiased risk estimate is given by
(27) |
where denotes the -th singular value of and . Given , can be easily minimized since it is a piecewise quadratic function in .
4.4 Subspace identification with shrinkage estimators
Following the traditional subspace identification methods, we obtain the initial estimate of using the least squares estimate of (17). This estimate offers the advantage of being unbiased and to have a covariance that can be easily characterized. This facilitates framing the problem of singular value shrinking in the format given by (20).
In the context of shrinkage estimators, one could naturally propose to also perform shrinkage of the parameters estimated in (17) by replacing the linear regression by a ridge regression. However, we would have a biased estimate of and this would make it harder to explore the theory of shrinkage estimators for singular values.
Our goal is to minimize the risk function
(28) |
Within the domain of the shrinkage estimators presented above, we shall minimize the risk in (28) if we shrink the SVD decomposition of . However, the variance level is assumed known in (20). This means that we still must obtain a reasonable estimate for it.
Defining the projection matrix onto the orthogonal complement of the row space of as , the estimate in (17) is equivalent to [21]
(29) |
Substituting as in (6), this leads to
(30) |
Multiplying by the weight matrices, we have
(31) |
Unless we carefully pick the weight matrices, the noise term in (31) is not orthogonally invariant white as assumed in (20). If we consider the approximation for the vectorized noise , then the covariance matrix for the vectorization of would be
(32) |
To make this covariance orthogonally invariant, one would choose and . This is closely related to the weight choice of the popular CVA algorithm for subspace identification [21]. Nonetheless, there might be reasons to choose different weights. For example, the N4SID algorithm uses and in order to penalize the prediction error. Alternatively, one might prefer to penalize errors in the in the input-output relation and choose and .
For general weight matrices, we identify the noise level with the largest possible noise level in a given direction, i.e.,
(33) |
In order to estimate , we compute the residues
(34) |
and construct the estimate
(35) |
where we make and and is the number o degrees of freedom of the linear regression. The estimate in (35) is enough to obtain from (33). If we further desire to specify , one simple approach is to obtain the Cholesky decomposition of (35) and then average over the matrix diagonals to enforce the Toeplitz structure on . If an initial truncated estimate is computed, it may be used in (34) to further account for the noise that is present in the lower singular values of . In that case, the number of degrees of freedom could be changed to , where is the rank of .
5 An Alternating Least Squares Bayesian Approach
Bayesian estimators are inherently robust. Under mild conditions [24], they are admissible, i.e., there exists no estimator that improves their risk for all the parameters in the parameter space. Under group invariance of priors and losses (see [24] for definitions), they are also minimax, i.e., they minimize the risk under the worst case parameter value. The shrinkage estimators presented above may often be characterized as Bayes estimators with an empirical prior, i.e., a prior distribution that is constructed from the data itself.
In our approach to subspace identification, we aim to construct a Bayesian method that is computationally simple. With this in mind, we choose priors that lead to simple regularized least squares steps. The priors are obtained empirically from the data. We have experimented with hierarchical Bayes priors as well but did not find a convincing improvement with respect to the simpler empirical priors. Our model was inspired by that in [15] but, differently from this paper, we do not explicitly estimate the orthogonal matrices and and the singular value matrix . We found that better results are obtained by estimating and instead.
For the noise term , we propose a prior invariant with respect to the group of lower triangular Toeplitz matrices. Hence, the covariance estimator will be equivariant with respect to this group. In this particular case, it is possible to compute posterior samples in a reasonably simple manner.
Assuming for simplicity, we adopt the following set of independent priors:
(36) | ||||
(37) | ||||
(38) | ||||
(39) |
where the matrices are random matrices with the appropriate dimensions and whose elements are independent normally distributed random variables with variance , the matrices are fixed parameters to be specified later and . It will be shown later that the improper prior given to is invariant under the group of multiplication by lower triangular Toeplitz matrices.
Since the full posterior distribution for the estimation problem at hand is too hard to characterize analytically, we estimate its empirical distribution using a Gibbs sampler. In a Gibbs sampler, samples from dependent variables and are drawn iteratively from their conditional distributions as in:
(41) | |||
(42) |
The probability distribution of the resulting Markov chain is shown to converge under mild conditions to . The following posterior updates follow from our choice of prior:
(43) |
and
(44) |
where , and , the matrices are random matrices whose components are independently drawn from a unit normal distribution. Intuitively, we are performing independent regressions row by row in (43) and independent regressions column by column in (44), in addition to summing the corresponding simulated noise terms.
In order to define the posterior update for , we first define the matrix that maps the last row of to , i.e., . Similarly, we define the matrix such that . Let denote the chi distribution with degrees of freedom, then the posterior update for is given by:
(45) | ||||
(46) | ||||
(47) | ||||
(48) | ||||
(49) | ||||
(50) |
where denotes the lower triangular part of the Cholesky decomposition of . In other words, we first obtain a sample of the last row of , then we construct the full lower triangular Toeplitz matrix from this row and compute its inverse. Given that the matrix is very large and sparse, (46) may be somewhat tricky to compute. A safer alternative may come from ignoring the Hankel structure of and assuming that its elements are mutually independent. In this case, the posterior update becomes
(51) | ||||
(52) | ||||
(53) | ||||
(54) | ||||
(55) | ||||
(56) |
Finally, the estimate for is obtained by averaging the over the trajectory of the Markov chain:
(57) |
where is some burn-in period intended to remove the effect of transients. In order to reduce variance and improve convergence, one may prefer to average over the expected values of (43) and (44) in every step (obtained by setting the respective matrices to zero):
(58) |
Regarding the parameters in the priors, we initially obtain estimates and from (17) and next some truncated singular value decomposition . Then, we make and
(59) | ||||
(60) | ||||
(61) | ||||
(62) | ||||
(63) |
To justify such a choice, we note that the priors (36) and (38) approximately describe an SVD decomposition:
(64) |
where and behave as orthogonal matrices in expectation, i.e., .
6 Numerical Experiments
In order to test our proposed methods, we ran Monte Carlo simulations on a large number of systems and compared the estimation risks.
The system order was uniformly distributed from to and we fixed . The sample size was defined as . The row-length of Hankel matrices was set as .
To obtain a stable system matrix , we first sampled an auxiliary matrix such that and a spectral radius . Then we made . The matrices and were sampled such that . We make . The measurement noise covariance was generated such that . Likewise, the process noise covariance was generated such that . The Kalman gain was therefore obtained from the previous parameters.
The system input is comprised of independent samples from , where the signal to noise ratio SNR was sampled uniformly on logarithmic scale such that .
We applied the proposed algorithms to each model realization and response realization. For each realization and each estimator, we computed the risk
(65) |
Given that the system models have different scales in each realization, we normalized the risk performance by that of a reference estimator and averaged over N realizations as such:
(66) |
The logarithmic function is used to give higher and symmetric weights to risks that are either too low or too high compared to the reference.
We used the same estimate of for all shrinkage methods as defined by (35). We started by truncating to the largest singular values and then, computing the corresponding and hard threshold , we obtained
(67) |
In words, is the least such that the order estimate is less than . We then make and use in (35).
6.1 A heuristic as benchmark
To provide the reference estimator above, we propose a heuristic that seeks to mimic the order selection as done by visual inspection. Namely, what a typical user would do is to look at the plot of singular values and identify the point of sharp change in their rate of decline. In order to do something similar automatically, we borrow from the idea of effective sample size and, from the vector of ordered singular values, define
(68) |
Next we construct a function that linearly fits from to . Finally we define our heuristic order estimate as
(69) |
6.2 Results
Our results are summarized in Tables 1, 2 and 3, where we considered the three main weight choices. We observe that the optimal shrinkage method and the alternating least squares approach give consistently lower risk estimates. Interestingly, we observe that soft-thresholding and hard-thresholding do not always improve upon the benchmark. Since the SURE-based method also applies soft-thresholding, we see that the problem does not lie in the class of shrinking functions, but on a poor parameter choice that was based on asymptotic properties of large random matrices.
Method | Average Normalized Risk () |
---|---|
Heuristic (68) | 1.0 |
Heuristic (69) | 3.83 |
Hard-thresholding | 0.97 |
Soft-thresholding | 0.76 |
Optimal Shrinkage | 0.75 |
SURE | 0.58 |
Alternating Least Squares | 0.39 |
Appendix A Equivariant Estimators of the Covariance on the Lower Triangular Toeplitz Group
The set of lower triangular Toeplitz matrices as exemplified in (12) is a group under matrix multiplication. This group operation may be interpreted as the cascading of dynamical systems. In this sense, we are interested in estimators that are invariant with respect to dynamical system cascading. For example, if we pass both input and output through a linear filter, we want an estimator that gives the same model regardless of the filtering. In the Bayesian framework, not all priors result in such an equivariant estimator. In this section, we aim at deriving a prior for the matrix that is invariant under the group operation of multiplication by lower triangular Toeplitz matrices.
Let be one such group and consider two matrices . Let . Using the fact that for , we have that the last row of C is given by
(71) |
If we parametrize these matrices using their lowest row, we can compute the Jacobian for right multiplication as
(72) |
Given the triangular structure of B, we have that . As for left multiplication,
(73) |
and . With this we can define the left and right invariant Haar measure
(74) |
Indeed, to check right invariance, let and observe that
(75) |
where we used the above defined Jacobian in the first equality and, in the second equality, we used the fact . Left invariance may be checked similarly for . Therefore, we have arrived at a prior that is invariant under the group operation.
In the sequence we derive the posterior that corresponds to this prior. Recall that the residues are
(76) |
Taking the vectorization operation and making use of the noise vector on the subspace of dimension , we have that
(77) |
Therefore, the residues covariance is given by
(78) |
Since is , is rank deficient and we shall make use of its pseudo-determinant (product of non-zero singular values) in obtaining its pdf:
(79) |
where we used the fact that and are lower triangular and the fact that must have exactly nonzero entries on its diagonal. In order to make the vectorization explicit in the likelihood function, we note that
(80) |
Therefore,
(81) |
and the posterior would be proportional to
(82) | ||||
(83) |
Defining the change of varibles , we have
(84) |
Therefore and , for . This is precisely the posterior given by equations (45)-(50).
References
- [1] Michel Verhaegen and Anders Hansson. N2SID: Nuclear norm subspace identification of innovation models. Automatica, 72:57–63, 2016.
- [2] Roy S Smith. Frequency domain subspace identification using nuclear norm minimization and Hankel matrix realizations. IEEE Transactions on Automatic Control, 59(11):2886–2896, 2014.
- [3] Gianluigi Pillonetto, Tianshi Chen, Alessandro Chiuso, Giuseppe De Nicolao, and Lennart Ljung. Regularized linear system identification using atomic, nuclear and kernel-based norms: The role of the stability constraint. Automatica, 69:137–149, 2016.
- [4] Alessandro Chiuso and Gianluigi Pillonetto. System identification: A machine learning perspective. Annual Review of Control, Robotics, and Autonomous Systems, 2:281–304, 2019.
- [5] Yue Sun, Samet Oymak, and Maryam Fazel. Finite sample identification of low-order LTI systems via nuclear norm regularization. IEEE Open Journal of Control Systems, 1:237–254, 2022.
- [6] Alessandro Chiuso. Regularization and Bayesian learning in dynamical systems: Past, present and future. Annual Reviews in Control, 41:24–38, 2016.
- [7] Lennart Ljung, Tianshi Chen, and Biqiang Mu. A shift in paradigm for system identification. International Journal of Control, 93(2):173–180, 2020.
- [8] Jie Liu and Bing Li. A novel strategy for response and force reconstruction under impact excitation. Journal of Mechanical Science and Technology, 32(8):3581–3596, 2018.
- [9] Vineet Vajpayee, Siddhartha Mukhopadhyay, and Akhilanand Pati Tiwari. Data-driven subspace predictive control of a nuclear reactor. IEEE Transactions on Nuclear Science, 65(2):666–679, 2017.
- [10] Behnood Rasti, Magnus O Ulfarsson, and Johannes R Sveinsson. Hyperspectral subspace identification using SURE. IEEE Geoscience and Remote Sensing Letters, 12(12):2481–2485, 2015.
- [11] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):1–37, 2011.
- [12] Clément Elvira, Pierre Chainais, and Nicolas Dobigeon. Bayesian nonparametric subspace estimation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2247–2251. IEEE, 2017.
- [13] Qian Zhao, Deyu Meng, Zongben Xu, Wangmeng Zuo, and Lei Zhang. Robust principal component analysis with complex noise. In International conference on machine learning, pages 55–63. PMLR, 2014.
- [14] S Derin Babacan, Martin Luessi, Rafael Molina, and Aggelos K Katsaggelos. Sparse Bayesian methods for low-rank matrix estimation. IEEE Transactions on Signal Processing, 60(8):3964–3977, 2012.
- [15] Xinghao Ding, Lihan He, and Lawrence Carin. Bayesian robust principal component analysis. IEEE Transactions on Image Processing, 20(12):3419–3430, 2011.
- [16] Wei Chu and Zoubin Ghahramani. Probabilistic models for incomplete multi-dimensional arrays. In Artificial Intelligence and Statistics, pages 89–96. PMLR, 2009.
- [17] David Gerard and Peter Hoff. Equivariant minimax dominators of the MLE in the array normal model. Journal of multivariate analysis, 137:32–49, 2015.
- [18] Peter D Hoff. Equivariant and scale-free tucker decomposition models. Bayesian Analysis, 11(3):627–648, 2016.
- [19] Ming Shi, Dan Li, and Jian Qiu Zhang. An alternating Bayesian approach to PARAFAC decomposition of tensors. IEEE Access, 6:36487–36499, 2018.
- [20] Anastasios Tsiamis and George J Pappas. Finite sample analysis of stochastic system identification. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 3648–3654. IEEE, 2019.
- [21] S Joe Qin. An overview of subspace identification. Computers & chemical engineering, 30(10-12):1502–1513, 2006.
- [22] Peter Van Overschee and Bart De Moor. Subspace identification for linear systems: Theory—Implementation—Applications. Springer Science & Business Media, 2012.
- [23] Erich L Lehmann and George Casella. Theory of point estimation. Springer Science & Business Media, 2006.
- [24] Christian P Robert et al. The Bayesian choice: from decision-theoretic foundations to computational implementation, volume 2. Springer, 2007.
- [25] William James and Charles Stein. Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Simposium on Mathematical Statistics and Probability, volume 1, pages 361–379. University of California Press, 1961.
- [26] Matan Gavish and David L Donoho. The optimal hard threshold for singular values is . IEEE Transactions on Information Theory, 60(8):5040–5053, 2014.
- [27] Matan Gavish and David L Donoho. Optimal shrinkage of singular values. IEEE Transactions on Information Theory, 63(4):2137–2152, 2017.
- [28] Emmanuel J Candes, Carlos A Sing-Long, and Joshua D Trzasko. Unbiased risk estimates for singular value thresholding and spectral estimators. IEEE transactions on signal processing, 61(19):4643–4657, 2013.
- [29] Lennart Ljung. System identification toolbox for use with MATLAB. Mathworks, 2007.