This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

FEDERAL UNIVERSITY OF MINAS GERAIS
Department of Electronic Engineering
Technical Report
Robust Bayesian Subspace Identification for Small Data Sets


Alexandre Rodrigues Mesquita

1 Introduction

Model estimates obtained from traditional subspace identification methods may be subject to significant variance. This elevated variance is aggravated in the cases of large models or of a limited sample size. Common solutions to reduce the effect of variance are regularized estimators, shrinkage estimators and Bayesian estimation. In the current work we investigate the latter two solutions, which have not yet been applied to subspace identification. Our experimental results show that our proposed estimators may reduce the estimation risk up to 40%40\% of that of traditional subspace methods.

In the scope of subspace identification, a number of related works explore regularization techniques such as nuclear norm minimization [1, 2, 3, 4, 5]. Under specific hyperparameter choices, regularized estimators can be understood in a Bayesian framework as maximum a posteriori estimators. These regularized estimators carry two important disadvantages in comparison to our approach: i) the choice of hyperparameters is not straightforward and it often does not take advantage of the problem statistics at hand; ii) they are not oriented toward risk minimization.

The usage of regularized, shrinkage and Bayesian estimators in system identification is discussed in [6, 7, 4] and it is greeted as a promising and refreshing trend in this field. We are only aware of one work where a shrinking function is used for subspace identification [8]. Differently from our approach, the shrinking is applied to all elements of the matrix and not only to the singular values. In a somewhat related application, in [9] we find the application of wavelet thresholding as a preprocessor in a subspace identification setting. Apart from the field of system identification, we also find the application of similar methods in the context of subspace identification in hyperspectral imaging [10].

If we regard subspace identification as a particular application of principal component analysis, our work would be understood as an application of robust principal component analysis, which finds many applications in image and video processing and econometrics [11]. Bayesian solutions to this problem are found in [12, 13, 14]. Our alternating least squares method was conceived as an improvement upon the Bayesian robust principal component analysis of [15] and it is closely related to the tensorial regression present in [16, 17, 18, 19].

A different approach to subspace identification with finite samples is found in [20], where the focus in on deciding the number of samples in order to bound the estimation error with high probability.

2 A brief recapitulation of subspace identification

Consider the linear state space model in innovation form

xk+1\displaystyle x_{k+1} =Axk+Buk+Kek\displaystyle=Ax_{k}+Bu_{k}+Ke_{k} (1)
yk\displaystyle y_{k} =Cxk+Duk+ek,\displaystyle=Cx_{k}+Du_{k}+e_{k}\enspace, (2)

where xk\mathbbRnxx_{k}\in\mathbb{R}^{n_{x}} is the state, yk\mathbbRnoy_{k}\in\mathbb{R}^{n_{o}} is the measured variable, uk\mathbbRniu_{k}\in\mathbb{R}^{n_{i}} is the input variable and eke_{k} is the innovations process, assumed to be white Gaussian noise. The model parameters A,B,C,DA,B,C,D and KK are matrices with the appropriate dimensions. At times, it may be useful to express the same model in the predictor form

xk+1\displaystyle x_{k+1} =AKxk+BKzk\displaystyle=A_{K}x_{k}+B_{K}z_{k} (3)
yk\displaystyle y_{k} =Cxk+Duk+ek,\displaystyle=Cx_{k}+Du_{k}+e_{k}\enspace, (4)

where zk=[ukTykT]Tz_{k}=[u_{k}^{T}~{}~{}y_{k}^{T}]^{T}, AK=AKCA_{K}=A-KC, BK=[BKDK]B_{K}=[B-KD~{}~{}K].

Following the formulation in [21], we make use of the extended state space model

Yf=ΓfXk+HfUf+GfEfY_{f}=\Gamma_{f}X_{k}+H_{f}U_{f}+G_{f}E_{f} (5)

and its predictor form

Yf=HfpZp+HfUf+GfEfY_{f}=H_{fp}Z_{p}+H_{f}U_{f}+G_{f}E_{f} (6)

where the available data is arranged in Hankel matrices defined by

Yf=[ykyk+1yk+N1yk+1yk+2yk+Nyk+f1yk+fyk+f+N2]Y_{f}=\left[\begin{array}[]{cccc}y_{k}&y_{k+1}&\cdots&y_{k+N-1}\\ y_{k+1}&y_{k+2}&\cdots&y_{k+N}\\ \vdots&\vdots&\ddots&\vdots\\ y_{k+f-1}&y_{k+f}&\cdots&y_{k+f+N-2}\\ \end{array}\right] (7)

and similarly are defined UfU_{f} and EfE_{f}. Here f>nf>n is the future horizon and NN is a function of the size of the available data set. The state sequence is defined as

Xk=[xkxk+1xk+N1].X_{k}=\left[\begin{array}[]{cccc}x_{k}&x_{k+1}&\cdots&x_{k+N-1}\end{array}\right]\enspace. (8)

The past information is collected in Zp=[UpTYpT]TZ_{p}=[U_{p}^{T}~{}Y_{p}^{T}]^{T} up to the horizon p>np>n and arranged as

Up=[ukpukp+1ukp+N1ukp+1ukp+2ukp+Nuk1ukuk+N2]U_{p}=\left[\begin{array}[]{cccc}u_{k-p}&u_{k-p+1}&\cdots&u_{k-p+N-1}\\ u_{k-p+1}&u_{k-p+2}&\cdots&u_{k-p+N}\\ \vdots&\vdots&\ddots&\vdots\\ u_{k-1}&u_{k}&\cdots&u_{k+N-2}\\ \end{array}\right] (9)

and likewise for YpY_{p}.

As a consequence, Γf\Gamma_{f} is the extended observability matrix as defined by

Γf=[CCACAf1],\Gamma_{f}=\left[\begin{array}[]{c}C\\ CA\\ \vdots\\ CA^{f-1}\end{array}\right]\enspace, (10)

HfH_{f} and GfG_{f} are Toeplitz matrices given by

Hf=[D00CBD0CAf2BCAf3BD],H_{f}=\left[\begin{array}[]{cccc}D&0&\cdots&0\\ CB&D&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ CA^{f-2}B&CA^{f-3}B&\cdots&D\end{array}\right]\enspace, (11)

and

Gf=[I00CKI0CAf2KCAf3KI]IfΣ1/2,G_{f}=\left[\begin{array}[]{cccc}I&0&\cdots&0\\ CK&I&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ CA^{f-2}K&CA^{f-3}K&\cdots&I\end{array}\right]\cdot I_{f}\otimes\Sigma^{1/2}\enspace, (12)

where Σ\Sigma is the innovations covariance. The matrix HfpH_{fp} is reminiscent of the matrix of Markov parameters and is given by

Hfp=[Hfp(1)Hfp(2)]H_{fp}=\left[\begin{array}[]{cc}H_{fp}^{(1)}&H_{fp}^{(2)}\end{array}\right] (13)

with

Hfp(i)=[CAKp1BK(i)CAKp2BK(i)CBK(i)CAAKp1BK(i)CAAKp2BK(i)CBK(i)CAf1AKp1BK(i)CAf1AKp2BK(i)CAf1BK(i)],H_{fp}^{(i)}=\left[\begin{array}[]{cccc}CA_{K}^{p-1}B_{K}^{(i)}&CA_{K}^{p-2}B_{K}^{(i)}&\cdots&CB_{K}^{(i)}\\ CAA_{K}^{p-1}B_{K}^{(i)}&CAA_{K}^{p-2}B_{K}^{(i)}&\cdots&CB_{K}^{(i)}\\ \vdots&\vdots&\ddots&\vdots\\ CA^{f-1}A_{K}^{p-1}B_{K}^{(i)}&CA^{f-1}A_{K}^{p-2}B_{K}^{(i)}&\cdots&CA^{f-1}B_{K}^{(i)}\\ \end{array}\right]\enspace, (14)

for i=1,2i=1,2 and BK(1)=BKDB_{K}^{(1)}=B-KD and BK(2)=KB_{K}^{(2)}=K. These matrices can also be decomposed as products Hfp(i)=ΓfLp(i)H_{fp}^{(i)}=\Gamma_{f}L_{p}^{(i)} of the extended observability Γf\Gamma_{f} and controllability matrices:

Lp(i)=[AKp1BK(i)AKp2BK(i)BK(i)]L_{p}^{(i)}=[\begin{array}[]{ccc}A_{K}^{p-1}B_{K}^{(i)}&A_{K}^{p-2}B_{K}^{(i)}\cdots B_{K}^{(i)}\end{array}]\enspace (15)

for i=1,2i=1,2.

Several estimation methods were proposed in the literature departing from this formulation and exploiting the fact that the data as structured in (6) should lie in low dimensional subspaces (see [21, 22] for a comprehensive review). A common approach is to solve (6) via least squares and then estimate Γf\Gamma_{f} from a truncated singular value decomposition of HfpH_{fp}. The matrices AA and CC could then be recovered from estimates of Γf\Gamma_{f} and Γf+1\Gamma_{f+1}. The matrices BB and DD could then be obtained from a new least squares problem or from estimates of LpL_{p}.

From a general perspective, most methods obtain an initial estimate H^fp\hat{H}_{fp} from least squares and, given two weight matrices W1W_{1} and W2W_{2}, obtain a low-dimensional estimate H^^fp\hat{\hat{H}}_{fp} by truncating the singular value decomposition of H^fp{\hat{H}}_{fp} to the largest rr components:

W1H^fpW2=USVTUrSrVrT=:W1H^^fpW2.W_{1}\hat{H}_{fp}W_{2}=USV^{T}\approx U_{r}S_{r}V_{r}^{T}=:W_{1}\hat{\hat{H}}_{fp}W_{2}\enspace. (16)

In this context, rr would be an estimate of the system order nn. Methods of obtaining such an estimate of nn have been treated with difficulty in the literature and usually one resorts to ad hoc solutions or the visual inspection of the singular values.

Traditional subspace identification methods obtain an initial estimate of HfpH_{fp} from (6) applying the least squares method as follows:

[H^fpH^f]=Yf[UpYpUf].[\hat{H}_{fp}~{}\hat{H}_{f}]=Y_{f}\left[\begin{array}[]{c}U_{p}\\ Y_{p}\\ U_{f}\end{array}\right]^{\dagger}\enspace. (17)

Notice, however, that this estimate only approximates the maximum likelihood estimate as the noise term GfEfG_{f}E_{f} in (6) is at best only approximately white and as HfH_{f} is a lower triangular Toeplitz matrix.

At this point it is worthwhile having a discussion on the approximations typically made in the process of solving (6). Given that EfE_{f} is structured as a Hankel matrix as in (9), its components are not truly independent. Nevertheless, the off diagonal elements of E[EfEfT]\text{E}[E_{f}E_{f}^{T}] are reasonably sparse, which gives a reasonable approximation of a white matrix. In order the incorporate the Hankel and Toeplitz structure of EfE_{f} and HfH_{f} in the computation, one would have to vectorize equation (6):

yf=(ZpTI)hfp+(UfTI)hf+(IGf)ef,y_{f}=(Z_{p}^{T}\otimes I)h_{fp}+(U_{f}^{T}\otimes I)h_{f}+(I\otimes G_{f})e_{f}\enspace, (18)

where the lower case letters denote the vectorized version of the corresponding matrices. In addition, hfh_{f} and efe_{f} lie in lower dimensional subspaces. This can be solved introducing known matrices BTB_{T} and BWB_{W} that convert the original lower dimensional data into the vectorized versions of the Toeplitz matrix HfH_{f} and the Hankel matrix EfE_{f}. This would result in the new equation

yf=(ZpTI)hfp+(UfTI)BTh¯f+(IGf)BWe¯f.y_{f}=(Z_{p}^{T}\otimes I)h_{fp}+(U_{f}^{T}\otimes I)B_{T}\bar{h}_{f}+(I\otimes G_{f})B_{W}\bar{e}_{f}\enspace. (19)

However, the least squares problem that arises in (19) is high dimensional, sparse and poorly conditioned numerically. In this regard, one of the virtues of the formulation in (6) is to avoid the numerical difficulties that would arise in a more detailed formulation.

3 Problem Description

Most subspace identification methods rely on the limit case of large enough data sets, NN\to\infty. In addition, most approaches are based on least squares or, equivalently, on maximum likelihood estimators of HfpH_{fp}. However, it is well known that such estimators perform poorly in regards to the estimation risk under quadratic loss (are not admissible) when the number of parameters is larger than 2 [23, 24], which is classically known as the Stein phenomenon [25]. On the other hand, the number of parameters in HfpH_{fp} may be as large as no(ni+no)fpn_{o}\cdot(n_{i}+n_{o})\cdot f\cdot p, which may be considerably large. Even if we consider that HfpH_{fp} is low rank and may be parameterized and estimated in a smaller space, it would still depend on (ni+2no)nx(n_{i}+2n_{o})\cdot n_{x} parameters, which is still large in most applications.

In addition, there seems to be considerable difficulty in the literature to find appropriate methods to estimate the system order nxn_{x}, which is of particular concern given the high sensitivity of the results to this parameter.

Within this context, our goal is to investigate robustified methods to estimate HfpH_{fp}. In particular, two groups of methods shall be considered. The first group is comprised of the main shrinkage estimators of singular values available in the literature. The second group consists of Bayesian estimators that can be efficiently computed with a Gibbs sampling scheme.

Given that the estimation errors in the matrices (A,B,C,D,K)(A,B,C,D,K) are upper bounded by a constant times HfpH^fp\|H_{fp}-\hat{H}_{fp}\| as demonstrated in [20], we concentrate our efforts on the estimators of HfpH_{fp} and leave an investigation on the matrices estimators to future work.

4 Singular Value Shrinkage

Shrinkage estimators are biased estimators that shrink the maximum likelihood estimate towards a pre-assigned value (often to zero). These estimators were first proposed by Stein [25] and it has been demonstrated that this type of estimator may provide smaller estimation risk than the maximum likelihood estimator under quadratic loss when the number of parameters is larger than 2. The rationale behind such estimators is that, although shrinking introduces bias, it reduces variance and therefore it may be beneficial in reducing the overall quadratic error.

In the context of singular value decompositions, approaches based on random matrix theory obtain optimal shrinkage estimators in the asymptotic case of the matrix dimensions approaching infinity. This assumption may be far from adequate in the case of subspace identification with small data sets. Non-asymptotic results are also available based on the minimization of unbiased estimates.

Most approaches consider the problem of estimating some matrix X\mathbbRi×jX\in\mathbb{R}^{i\times j} from a noisy measurement

Y=X+σW,Y=X+\sigma W\enspace, (20)

where σ>0\sigma>0 is known and the noise matrix WW is orthogonally invariant, i.e., its distribution is the same as that of UWVUWV for any orthogonal matrices UU and VV. In practice, YY is not a single measurement, but the maximum likelihood estimate obtained from multiple measurements, which gives a sufficient statistic based on all available measurements.

Considering the Frobenius norm squared as the risk loss function, and given that the noise is orthogonally invariant, it suffices to restrict our search to shrinkage estimators of the form:

X^=Uη(S)VT,\hat{X}=U\eta(S)V^{T}\enspace, (21)

where USVTUSV^{T} is the singular value decomposition of YY and η()\eta(\cdot) is the shrinkage function.

4.1 Hard and Soft-Thresholding

In the asymptotic case of very large dimensions, [26] computed the optimal parameters for the hard and soft-thresholding functions:

ηhard(s)\displaystyle\eta_{\text{hard}}(s) =s1s>λhard\displaystyle=s1_{s>\lambda_{\text{hard}}} (22)
ηsoft(s)\displaystyle\eta_{\text{soft}}(s) =(sλsoft)+,\displaystyle=(s-\lambda_{\text{soft}})_{+}\enspace, (23)

where 1()1_{(\cdot)} is the indicator function and ()+(\cdot)_{+} is the positive-part function and where the function η\eta are applied element-wise on the singular values and the optimal thresholds are

λhard=2(β+1)+8ββ+1+β2+14β+1σj\lambda_{\text{hard}}=\sqrt{2(\beta+1)+\frac{8\beta}{\beta+1+\sqrt{\beta^{2}+14\beta+1}}}\cdot\sigma\sqrt{j} (24)

and

λsoft=(1+β)σj,\lambda_{\text{soft}}=(1+\sqrt{\beta})\sigma\sqrt{j}\enspace, (25)

with β=i/j\beta=i/j being the aspect ratio of the matrix YY.

4.2 Optimal shrinkage in the asymptotic case

Still in the asymptotic case, the overall optimal shrinkage function was obtained in [27]. For the squared Frobenius norm loss function, the optimal shrinkage is given by

η(s)=1s(s2(1+β)σ2j)24βσ4j21s>λsoft.\eta^{*}(s)=\frac{1}{s}\sqrt{(s^{2}-(1+\beta)\sigma^{2}j)^{2}-4\beta\sigma^{4}j^{2}}\cdot 1_{s>\lambda_{\text{soft}}}\enspace. (26)

4.3 SURE-based thresholding

In the non-asymptotic case, [28] obtained an expression for the Stein’s unbiased risk estimate (SURE) for estimators based on soft-thresholding functions. The threshold values can then be obtained by minimizing the SURE risk estimate. For η(s)=(sλ)+\eta(s)=(s-\lambda)_{+} and the Frobenius norm squared loss, the unbiased risk estimate is given by

SUREλ(Y)=ijσ2+k=1imin(λ2,σk2)+2σ2[(ji)k=1i(1λσk)++k=1i1σk>λ+k,l,kliσk(σkλ)+σk2σl2],\text{SURE}_{\lambda}(Y)=-ij\sigma^{2}+\sum_{k=1}^{i}\min(\lambda^{2},\sigma_{k}^{2})+2\sigma^{2}\left[(j-i)\sum_{k=1}^{i}\left(1-\frac{\lambda}{\sigma_{k}}\right)_{+}+\sum_{k=1}^{i}1_{\sigma_{k}>\lambda}\right.\\ \left.+\sum_{k,l,k\neq l}^{i}\frac{\sigma_{k}(\sigma_{k}-\lambda)_{+}}{\sigma_{k}^{2}-\sigma_{l}^{2}}\right]\enspace, (27)

where σk\sigma_{k} denotes the kk-th singular value of YY and i<ji<j. Given YY, SUREλ(Y)\text{SURE}_{\lambda}(Y) can be easily minimized since it is a piecewise quadratic function in λ\lambda.

4.4 Subspace identification with shrinkage estimators

Following the traditional subspace identification methods, we obtain the initial estimate of HfpH_{fp} using the least squares estimate of (17). This estimate offers the advantage of being unbiased and to have a covariance that can be easily characterized. This facilitates framing the problem of singular value shrinking in the format given by (20).

In the context of shrinkage estimators, one could naturally propose to also perform shrinkage of the parameters estimated in (17) by replacing the linear regression by a ridge regression. However, we would have a biased estimate of HfpH_{fp} and this would make it harder to explore the theory of shrinkage estimators for singular values.

Our goal is to minimize the risk function

(H^^fp)=E[tr(W2T(HfpH^^)TW1TW1(HfpH^^)W2)].\mathcal{R}(\hat{\hat{H}}_{fp})=\text{E}\left[\operatorname{tr}\left(W_{2}^{T}(H_{fp}-\hat{\hat{H}})^{T}W_{1}^{T}W_{1}(H_{fp}-\hat{\hat{H}})W_{2}\right)\right]\enspace. (28)

Within the domain of the shrinkage estimators presented above, we shall minimize the risk in (28) if we shrink the SVD decomposition of W1H^fpW2W_{1}\hat{H}_{fp}W_{2}. However, the variance level σ\sigma is assumed known in (20). This means that we still must obtain a reasonable estimate for it.

Defining the projection matrix onto the orthogonal complement of the row space of UfU_{f} as ΠUf=IUfT(UfUfT)1Uf\Pi_{U_{f}}^{\perp}=I-U_{f}^{T}(U_{f}U_{f}^{T})^{-1}U_{f}, the estimate H^fp\hat{H}_{fp} in (17) is equivalent to [21]

H^fp=YfΠUfZpT(ZpΠUfZpT)1.\hat{H}_{fp}=Y_{f}\Pi_{U_{f}}^{\perp}Z_{p}^{T}(Z_{p}\Pi_{U_{f}}^{\perp}Z_{p}^{T})^{-1}\enspace. (29)

Substituting YfY_{f} as in (6), this leads to

H^fp=Hfp+GfEfΠUfZpT(ZpΠUfZpT)1.\hat{H}_{fp}=H_{fp}+G_{f}E_{f}\Pi_{U_{f}}^{\perp}Z_{p}^{T}(Z_{p}\Pi_{U_{f}}^{\perp}Z_{p}^{T})^{-1}\enspace. (30)

Multiplying by the weight matrices, we have

W1H^fpW2=W1HfpW2+W1GfEfΠUfZpT(ZpΠUfZpT)1W2.W_{1}\hat{H}_{fp}W_{2}=W_{1}H_{fp}W_{2}+W_{1}G_{f}E_{f}\Pi_{U_{f}}^{\perp}Z_{p}^{T}(Z_{p}\Pi_{U_{f}}^{\perp}Z_{p}^{T})^{-1}W_{2}\enspace. (31)

Unless we carefully pick the weight matrices, the noise term in (31) is not orthogonally invariant white as assumed in (20). If we consider the approximation E[efefT]=I\operatorname{E}[e_{f}e_{f}^{T}]=I for the vectorized noise EfE_{f}, then the covariance matrix for the vectorization of W1H^fpW2W_{1}\hat{H}_{fp}W_{2} would be

Σhfp=W2T(ZpΠUfZpT)1W2W1GfGfTW1T.\Sigma_{h_{fp}}=W_{2}^{T}(Z_{p}\Pi_{U_{f}}^{\perp}Z_{p}^{T})^{-1}W_{2}\otimes W_{1}G_{f}G_{f}^{T}W_{1}^{T}\enspace. (32)

To make this covariance orthogonally invariant, one would choose W1=(GfGfT)1/2W_{1}=(G_{f}G_{f}^{T})^{-1/2} and W2=(ZpΠUfZpT)1/2W_{2}=(Z_{p}\Pi_{U_{f}}^{\perp}Z_{p}^{T})^{1/2}. This is closely related to the weight choice of the popular CVA algorithm for subspace identification [21]. Nonetheless, there might be reasons to choose different weights. For example, the N4SID algorithm uses W1=IW_{1}=I and W2=ZpW_{2}=Z_{p} in order to penalize the prediction error. Alternatively, one might prefer to penalize errors in the in the input-output relation and choose W1=IW_{1}=I and W2=IW_{2}=I.

For general weight matrices, we identify the noise level σ\sigma with the largest possible noise level in a given direction, i.e.,

σ2=σmax(W2T(ZpΠUfZpT)1W2)σmax(W1GfGfTW1T).\sigma^{2}=\sigma_{\max}(W_{2}^{T}(Z_{p}\Pi_{U_{f}}^{\perp}Z_{p}^{T})^{-1}W_{2})\cdot\sigma_{\max}(W_{1}G_{f}G_{f}^{T}W_{1}^{T})\enspace. (33)

In order to estimate GfG_{f}, we compute the residues

=YfH^fpZpH^fUf\mathcal{E}=Y_{f}-\hat{H}_{fp}Z_{p}-\hat{H}_{f}U_{f} (34)

and construct the estimate

G^fG^fT=Tji(no+2ni)\hat{G}_{f}\hat{G}_{f}^{T}=\frac{\mathcal{E}\mathcal{E}^{T}}{j-i(n_{o}+2n_{i})} (35)

where we make f=if=i and N=jN=j and i(no+2ni)i(n_{o}+2n_{i}) is the number o degrees of freedom of the linear regression. The estimate in (35) is enough to obtain σ\sigma from (33). If we further desire to specify GfG_{f}, one simple approach is to obtain the Cholesky decomposition of (35) and then average over the matrix diagonals to enforce the Toeplitz structure on G^f\hat{G}_{f}. If an initial truncated estimate H^^fp\hat{\hat{H}}_{fp} is computed, it may be used in (34) to further account for the noise that is present in the lower singular values of H^fp\hat{H}_{fp}. In that case, the number of degrees of freedom could be changed to ini+i(ni+no)(i(ni+no)(i+ni+no)r+r2)in_{i}+i(n_{i}+n_{o})-(i(n_{i}+n_{o})-(i+n_{i}+n_{o})r+r^{2}), where rr is the rank of H^^fp\hat{\hat{H}}_{fp}.

5 An Alternating Least Squares Bayesian Approach

Bayesian estimators are inherently robust. Under mild conditions [24], they are admissible, i.e., there exists no estimator that improves their risk for all the parameters in the parameter space. Under group invariance of priors and losses (see [24] for definitions), they are also minimax, i.e., they minimize the risk under the worst case parameter value. The shrinkage estimators presented above may often be characterized as Bayes estimators with an empirical prior, i.e., a prior distribution that is constructed from the data itself.

In our approach to subspace identification, we aim to construct a Bayesian method that is computationally simple. With this in mind, we choose priors that lead to simple regularized least squares steps. The priors are obtained empirically from the data. We have experimented with hierarchical Bayes priors as well but did not find a convincing improvement with respect to the simpler empirical priors. Our model was inspired by that in [15] but, differently from this paper, we do not explicitly estimate the orthogonal matrices UU and VV and the singular value matrix SS. We found that better results are obtained by estimating US1/2US^{1/2} and S1/2VTS^{1/2}V^{T} instead.

For the noise term GfG_{f}, we propose a prior invariant with respect to the group of lower triangular Toeplitz matrices. Hence, the covariance estimator will be equivariant with respect to this group. In this particular case, it is possible to compute posterior samples in a reasonably simple manner.

Assuming ni=no=1n_{i}=n_{o}=1 for simplicity, we adopt the following set of independent priors:

Γf\displaystyle\Gamma_{f} =G¯fΞΓΛΓ1/2\displaystyle=\bar{G}_{f}\Xi_{\Gamma}\Lambda_{\Gamma}^{-1/2} (36)
Hf\displaystyle H_{f} =G¯fΞHΛH1/2\displaystyle=\bar{G}_{f}\Xi_{H}\Lambda_{H}^{-1/2} (37)
Lp\displaystyle L_{p} =ΛL1/2ΞLZp\displaystyle=\Lambda_{L}^{-1/2}\Xi_{L}Z_{p}^{\dagger} (38)
Gf\displaystyle G_{f} 1|Gf[1,1]|i\displaystyle\sim\frac{1}{|G_{f[1,1]}|^{i}} (39)

where the Ξ()\Xi_{(\cdot)} matrices are random matrices with the appropriate dimensions and whose elements are independent normally distributed random variables with variance 11, the Λ()\Lambda_{(\cdot)} matrices are fixed parameters to be specified later and G¯f=Gf/Gf[1,1]\bar{G}_{f}=G_{f}/G_{f[1,1]}. It will be shown later that the improper prior given to GfG_{f} is invariant under the group of multiplication by lower triangular Toeplitz matrices.

Since the full posterior distribution for the estimation problem at hand is too hard to characterize analytically, we estimate its empirical distribution using a Gibbs sampler. In a Gibbs sampler, samples from dependent variables xx and yy are drawn iteratively from their conditional distributions as in:

y(n)p(y|x(n1))\displaystyle y^{(n)}\sim p(y|x^{(n-1)}) (41)
x(n)p(x|y(n)).\displaystyle x^{(n)}\sim p(x|y^{(n)})\enspace. (42)

The probability distribution of the resulting Markov chain (x(n),y(n))(x^{(n)},y^{(n)}) is shown to converge under mild conditions to p(x,y)p(x,y). The following posterior updates follow from our choice of prior:

[Γf(n)Hf(n)]=Yf[Xp(n1)Uf]T([ΛΓ00ΛH]+γ(n)[Xp(n1)Uf][Xp(n1)Uf]T)1γ(n)+G¯f(n1)ΞΓ,H(n)([ΛΓ00ΛH]+γ(n)[Xp(n1)Uf][Xp(n1)Uf]T)1/2[\begin{array}[]{cc}\Gamma_{f}^{(n)}&H_{f}^{(n)}\end{array}]=Y_{f}\left[\begin{array}[]{c}X_{p}^{(n-1)}\\ U_{f}\end{array}\right]^{T}\left(\left[\begin{array}[]{c c}\Lambda_{\Gamma}&0\\ 0&\Lambda_{H}\\ \end{array}\right]+\gamma^{(n)}\left[\begin{array}[]{c}X_{p}^{(n-1)}\\ U_{f}\end{array}\right]\left[\begin{array}[]{c}X_{p}^{(n-1)}\\ U_{f}\end{array}\right]^{T}\right)^{-1}\gamma^{(n)}\\ +\bar{G}_{f}^{(n-1)}\Xi_{\Gamma,H}^{(n)}\left(\left[\begin{array}[]{c c}\Lambda_{\Gamma}&0\\ 0&\Lambda_{H}\\ \end{array}\right]+\gamma^{(n)}\left[\begin{array}[]{c}X_{p}^{(n-1)}\\ U_{f}\end{array}\right]\left[\begin{array}[]{c}X_{p}^{(n-1)}\\ U_{f}\end{array}\right]^{T}\right)^{-1/2} (43)

and

Lp(n)=(Γf(n)T(Σe(n))1Γf(n)+ΛL)1Γf(n)T(Σe(n))1(YfHf(n)Uf)+(Γf(n)T(Σe(n))1Γf(n)+ΛL)1/2ΞL(n)ZpL_{p}^{(n)}=\left(\Gamma_{f}^{(n)T}(\Sigma_{e}^{(n)})^{-1}\Gamma_{f}^{(n)}+\Lambda_{L}\right)^{-1}\Gamma_{f}^{(n)T}(\Sigma_{e}^{(n)})^{-1}(Y_{f}-H_{f}^{(n)}U_{f})\\ +\left(\Gamma_{f}^{(n)T}(\Sigma_{e}^{(n)})^{-1}\Gamma_{f}^{(n)}+\Lambda_{L}\right)^{-1/2}\Xi_{L}^{(n)}Z_{p}^{\dagger} (44)

where Xp(n)=Lp(n)ZpX_{p}^{(n)}=L_{p}^{(n)}Z_{p}, γ(n)=1/(Gf[1,1](n1)])2\gamma^{(n)}=1/(G_{f[1,1]}^{(n-1)}])^{2} and Σe(n)=Gf(n1)(Gf(n1))T\Sigma_{e}^{(n)}=G_{f}^{(n-1)}(G_{f}^{(n-1)})^{T}, the Ξ()(n)\Xi_{(\cdot)}^{(n)} matrices are random matrices whose components are independently drawn from a unit normal distribution. Intuitively, we are performing independent regressions row by row in (43) and independent regressions column by column in (44), in addition to summing the corresponding simulated noise terms.

In order to define the posterior update for GfG_{f}, we first define the matrix BTB_{T} that maps the last row of GfG_{f} to vec(Gf)\text{vec}(G_{f}), i.e., vec(Gf)=BT(Gf[i,])T\text{vec}(G_{f})=B_{T}(G_{f[i,\cdot]})^{T}. Similarly, we define the matrix BWB_{W} such that vec(Ef)=BWe[1:i+j1]\text{vec}(E_{f})=B_{W}e_{[1:i+j-1]}. Let χk\chi_{k} denote the chi distribution with kk degrees of freedom, then the posterior update for GfG_{f} is given by:

(n)\displaystyle\mathcal{E}^{(n)} =YfΓf(n)Lp(n)ZpHf(n)Uf\displaystyle=Y_{f}-\Gamma_{f}^{(n)}L_{p}^{(n)}Z_{p}-H_{f}^{(n)}U_{f} (45)
Ω(n)\displaystyle\Omega^{(n)} =BTT((n)Ii)BW(BWTBW)1BWT(((n))TIi)BT\displaystyle=B_{T}^{T}(\mathcal{E}^{(n)}\otimes I_{i})B_{W}(B_{W}^{T}B_{W})^{-1}B_{W}^{T}((\mathcal{E}^{(n)})^{T}\otimes I_{i})B_{T} (46)
νi(n)\displaystyle\nu_{i}^{(n)} χj+1\displaystyle\sim\chi_{j+1} (47)
νk(n)\displaystyle\nu_{k}^{(n)} N(0,1),fork=1,,i1\displaystyle\sim N(0,1),~{}\text{for}~{}k=1,\ldots,i-1 (48)
(Gf(n))[i,:]1\displaystyle(G_{f}^{(n)})^{-1}_{[i,:]} =ν[1:i](ΩL(n))1\displaystyle=\nu_{[1:i]}(\Omega_{L}^{(n)})^{-1} (49)
vec((Gf(n))1)\displaystyle\text{vec}\left((G_{f}^{(n)})^{-1}\right) =BT(Gf(n))[i,:]1,\displaystyle=B_{T}(G_{f}^{(n)})^{-1}_{[i,:]}\enspace, (50)

where ΩL(n)\Omega_{L}^{(n)} denotes the lower triangular part of the Cholesky decomposition of Ω(n)\Omega^{(n)}. In other words, we first obtain a sample of the last row of Gf1G_{f}^{-1}, then we construct the full lower triangular Toeplitz matrix Gf1G_{f}^{-1} from this row and compute its inverse. Given that the matrix BWB_{W} is very large and sparse, (46) may be somewhat tricky to compute. A safer alternative may come from ignoring the Hankel structure of EfE_{f} and assuming that its elements are mutually independent. In this case, the posterior update becomes

(n)\displaystyle\mathcal{E}^{(n)} =YfΓf(n)Lp(n)ZpHf(n)Uf\displaystyle=Y_{f}-\Gamma_{f}^{(n)}L_{p}^{(n)}Z_{p}-H_{f}^{(n)}U_{f} (51)
Ω(n)\displaystyle\Omega^{(n)} =BTT((n)((n))TIi)BT\displaystyle=B_{T}^{T}(\mathcal{E}^{(n)}(\mathcal{E}^{(n)})^{T}\otimes I_{i})B_{T} (52)
νi(n)\displaystyle\nu_{i}^{(n)} χiji+2\displaystyle\sim\chi_{ij-i+2} (53)
νk(n)\displaystyle\nu_{k}^{(n)} N(0,1),fork=1,,i1\displaystyle\sim N(0,1),~{}\text{for}~{}k=1,\ldots,i-1 (54)
[(Gf(n))1]i,\displaystyle\left[(G_{f}^{(n)})^{-1}\right]_{i,\cdot} =ν[1:i](ΩL(n))1\displaystyle=\nu_{[1:i]}(\Omega_{L}^{(n)})^{-1} (55)
vec((Gf(n))1)\displaystyle\text{vec}\left((G_{f}^{(n)})^{-1}\right) =BT(Gf(n))[i,]1.\displaystyle=B_{T}(G_{f}^{(n)})^{-1}_{[i,\cdot]}\enspace. (56)

Finally, the estimate for Hfp{H}_{fp} is obtained by averaging the over the trajectory of the Markov chain:

H^fp=1NFNon=NoNFΓf(n)Lp(n),\hat{H}_{fp}=\frac{1}{N_{F}-N_{o}}\sum_{n=N_{o}}^{N_{F}}\Gamma_{f}^{(n)}L_{p}^{(n)}\enspace, (57)

where NoN_{o} is some burn-in period intended to remove the effect of transients. In order to reduce variance and improve convergence, one may prefer to average over the expected values of (43) and (44) in every step (obtained by setting the respective Ξ()\Xi_{(\cdot)} matrices to zero):

H^fp=12(NFNo)n=NoNFE[Γf(n)]Lp(n1)+Γf(n)E[Lp(n)].\hat{H}_{fp}=\frac{1}{2(N_{F}-N_{o})}\sum_{n=N_{o}}^{N_{F}}\operatorname{E}[\Gamma_{f}^{(n)}]L_{p}^{(n-1)}+\Gamma_{f}^{(n)}\operatorname{E}[L_{p}^{(n)}]\enspace. (58)

Regarding the parameters in the priors, we initially obtain estimates Hfp(1){H}_{fp}^{(1)} and Hf(1){H}_{f}^{(1)} from (17) and next some truncated singular value decomposition Hfp(1)ZpUrSrVrTH_{fp}^{(1)}Z_{p}\approx U_{r}S_{r}V_{r}^{T}. Then, we make Gf(1)=IiG_{f}^{(1)}=I_{i} and

Γf(1)\displaystyle\Gamma_{f}^{(1)} =UrSr1/2\displaystyle=U_{r}S_{r}^{1/2} (59)
Lp(1)\displaystyle L_{p}^{(1)} =Sr1/2VrTZp\displaystyle=S_{r}^{1/2}V_{r}^{T}Z_{p}^{\dagger} (60)
ΛΓ1\displaystyle\Lambda_{\Gamma}^{-1} =Sr/i\displaystyle=S_{r}/i (61)
ΛL1\displaystyle\Lambda_{L}^{-1} =Sr/j\displaystyle=S_{r}/j (62)
ΛH1\displaystyle\Lambda_{H}^{-1} =Iitr((Hf(1))THf(1))/i2.\displaystyle=I_{i}\operatorname{tr}((H_{f}^{(1)})^{T}H_{f}^{(1)})/i^{2}\enspace. (63)

To justify such a choice, we note that the priors (36) and (38) approximately describe an SVD decomposition:

ΓfXp=ΞΓiSrΞLj\Gamma_{f}X_{p}=\frac{\Xi_{\Gamma}}{i}S_{r}\frac{\Xi_{L}}{j} (64)

where ΞΓ/i{\Xi_{\Gamma}}/{i} and ΞL/j{\Xi_{L}}/{j} behave as orthogonal matrices in expectation, i.e., E[ΞΓTΞΓ/i2]=E[ΞLΞLT/j2]=Ir\operatorname{E}[\Xi_{\Gamma}^{T}\Xi_{\Gamma}/{i^{2}}]=\operatorname{E}[\Xi_{L}\Xi_{L}^{T}/{j^{2}}]=I_{r}.

6 Numerical Experiments

In order to test our proposed methods, we ran Monte Carlo simulations on a large number of systems and compared the estimation risks.

The system order nxn_{x} was uniformly distributed from 11 to 1010 and we fixed ni=no=1n_{i}=n_{o}=1. The sample size was defined as N=80nxN=\lfloor 80\sqrt{n_{x}}\rfloor. The row-length of Hankel matrices was set as i=N/10i=\lfloor N/10\rfloor.

To obtain a stable system matrix AA, we first sampled an auxiliary matrix A~\tilde{A} such that A~[k,l]N(0,1)\tilde{A}_{[k,l]}\sim N(0,1) and a spectral radius λa𝒰(0,1)\lambda_{a}\sim\mathcal{U}(0,1). Then we made A=A~/λmax(A~)λaA=\tilde{A}/\lambda_{\max}(\tilde{A})\cdot\lambda_{a}. The matrices BB and CC were sampled such that B[k,l],C[k,l]N(0,1)B_{[k,l]},C_{[k,l]}\sim N(0,1). We make D=0D=0. The measurement noise covariance RvR_{v} was generated such that (Rv)[k,l]1/2N(0,1)(R_{v})^{1/2}_{[k,l]}\sim N(0,1). Likewise, the process noise covariance RwR_{w} was generated such that (Rw)[k,l]1/2N(0,1)(R_{w})^{1/2}_{[k,l]}\sim N(0,1). The Kalman gain KK was therefore obtained from the previous parameters.

The system input uku_{k} is comprised of independent samples from N(0,SNR)N(0,\textsf{SNR}), where the signal to noise ratio SNR was sampled uniformly on logarithmic scale such that log10SNR𝒰(1,2)\log_{10}\textsf{SNR}\sim\mathcal{U}(-1,2).

We applied the proposed algorithms to each model realization and response realization. For each realization and each estimator, we computed the risk

k=tr(W2T(HfpH^^fp)TW1TW1(HfpH^^fp)W2).\mathcal{R}_{k}=\operatorname{tr}\left(W_{2}^{T}(H_{fp}-\hat{\hat{H}}_{fp})^{T}W_{1}^{T}W_{1}(H_{fp}-\hat{\hat{H}}_{fp})W_{2}\right)\enspace. (65)

Given that the system models have different scales in each realization, we normalized the risk performance k\mathcal{R}_{k} by that of a reference estimator ko\mathcal{R}_{k}^{o} and averaged over N realizations as such:

¯=exp(1Nk=1Nln(kko)).\bar{\mathcal{R}}=\exp\left(\frac{1}{\textsf{N}}\sum_{k=1}^{\textsf{N}}\ln\left(\frac{\mathcal{R}_{k}}{{\mathcal{R}}_{k}^{o}}\right)\right)\enspace. (66)

The logarithmic function is used to give higher and symmetric weights to risks that are either too low or too high compared to the reference.

We used the same estimate of G^f\hat{G}_{f} for all shrinkage methods as defined by (35). We started by truncating H^fp\hat{H}_{fp} to the largest rr singular values and then, computing the corresponding G^f(r)\hat{G}_{f}(r) and hard threshold λsoft(r)\lambda_{\text{soft}}(r), we obtained

r=min{r:max{l:S[l]>λsoft(r)}<r}.r^{*}=\min\{r:\max\{l:S_{[l]}>\lambda_{\text{soft}}(r)\}<r\}\enspace. (67)

In words, rr^{*} is the least rr such that the order estimate is less than rr. We then make G^f=G^f(r)\hat{G}_{f}=\hat{G}_{f}(r^{*}) and use rr^{*} in (35).

6.1 A heuristic as benchmark

To provide the reference estimator above, we propose a heuristic that seeks to mimic the order selection as done by visual inspection. Namely, what a typical user would do is to look at the plot of singular values and identify the point of sharp change in their rate of decline. In order to do something similar automatically, we borrow from the idea of effective sample size and, from the vector SS of ordered singular values, define

neff=(l=1iSl)2l=1iSl2.n_{eff}=\frac{\left(\sum_{l=1}^{i}S_{l}\right)^{2}}{\sum_{l=1}^{i}S_{l}^{2}}\enspace. (68)

Next we construct a function η(l)\eta(l) that linearly fits lnSl\ln S_{l} from l=neff+1l=n_{eff}+1 to l=il=i. Finally we define our heuristic order estimate as

n^x=max{l:Sl>exp(η(l))}.\hat{n}_{x}=\max\{l:S_{l}>\exp(\eta(l))\}\enspace. (69)

As a second benchmark, we also considered the selection criterion adopted in [1, 29]:

n^x,2=max{l:Sl>exp((lnS1+lnSi)/2)}.\hat{n}_{x,2}=\max\{l:S_{l}>\exp((\ln S_{1}+\ln S_{i})/2)\}\enspace. (70)

6.2 Results

Our results are summarized in Tables 1, 2 and 3, where we considered the three main weight choices. We observe that the optimal shrinkage method and the alternating least squares approach give consistently lower risk estimates. Interestingly, we observe that soft-thresholding and hard-thresholding do not always improve upon the benchmark. Since the SURE-based method also applies soft-thresholding, we see that the problem does not lie in the class of shrinking functions, but on a poor parameter choice that was based on asymptotic properties of large random matrices.

Method Average Normalized Risk (±5%\pm 5\%)
Heuristic (68) 1.0
Heuristic (69) 3.83
Hard-thresholding 0.97
Soft-thresholding 0.76
Optimal Shrinkage 0.75
SURE 0.58
Alternating Least Squares 0.39
Table 1: Risk performance for W1=IiW_{1}=I_{i} and W2=I2iW_{2}=I_{2i} and 30003000 Monte Carlo runs. Other paremeters are NF=250N_{F}=250, No=1N_{o}=1.
Method Average Normalized Risk (±5%\pm 5\%)
Heuristic (68) 1.0
Heuristic (69) 1.53
Hard-thresholding 0.75
Soft-thresholding 1.57
Optimal Shrinkage 0.58
SURE 0.73
Alternating Least Squares 0.52
Table 2: Risk performance for W1=G^f1W_{1}=\hat{G}_{f}^{-1} and W2=(ZpΠUfZpT)1/2W_{2}=(Z_{p}\Pi_{U_{f}}^{\perp}Z_{p}^{T})^{1/2} (similarly to CVA) and 30003000 Monte Carlo runs. Other parameters are NF=250N_{F}=250, No=1N_{o}=1.
Method Average Normalized Risk (±5%\pm 5\%)
Heuristic (68) 1.0
Heuristic (69) 1.52
Hard-thresholding 0.68
Soft-thresholding 1.20
Optimal Shrinkage 0.49
SURE 0.64
Alternating Least Squares 0.42
Table 3: Risk performance for W1=IiW_{1}=I_{i} and W2=ZpW_{2}=Z_{p} (as in N4SID) and 30003000 Monte Carlo runs. Other parameters are NF=250N_{F}=250, No=1N_{o}=1.

Appendix A Equivariant Estimators of the Covariance on the Lower Triangular Toeplitz Group

The set of lower triangular Toeplitz matrices as exemplified in (12) is a group under matrix multiplication. This group operation may be interpreted as the cascading of dynamical systems. In this sense, we are interested in estimators that are invariant with respect to dynamical system cascading. For example, if we pass both input and output through a linear filter, we want an estimator that gives the same model regardless of the filtering. In the Bayesian framework, not all priors result in such an equivariant estimator. In this section, we aim at deriving a prior for the matrix GfG_{f} that is invariant under the group operation of multiplication by lower triangular Toeplitz matrices.

Let G\mathbbRi×i\textsf{G}\subset\mathbb{R}^{i\times i} be one such group and consider two matrices A,BG\textsf{A},\textsf{B}\in\textsf{G}. Let C=AB\textsf{C}=\textsf{A}\textsf{B}. Using the fact that ak,l=ak+m,l+ma_{k,l}=a_{k+m,l+m} for A=[ak,l]G\textsf{A}=[a_{k,l}]\in\textsf{G}, we have that the last row of C is given by

cil=maimbml=mai,mbi,lm+i.c_{il}=\sum_{m}a_{im}b_{ml}=\sum_{m}a_{i,m}b_{i,l-m+i}\enspace. (71)

If we parametrize these matrices using their lowest row, we can compute the Jacobian for right multiplication as

J[l,m]R=ci,lai,m=bi,lm+i.J^{R}_{[l,m]}=\frac{\partial c_{i,l}}{\partial a_{i,m}}=b_{i,l-m+i}\enspace. (72)

Given the triangular structure of B, we have that |JR|=|bi,i|i|J^{R}|=|b_{i,i}|^{i}. As for left multiplication,

J[l,m]L=ci,lbi,m=ai,lm+iJ^{L}_{[l,m]}=\frac{\partial c_{i,l}}{\partial b_{i,m}}=a_{i,l-m+i}\enspace (73)

and |JL|=|ai,i|i|J^{L}|=|a_{i,i}|^{i}. With this we can define the left and right invariant Haar measure

μ(𝒜)=𝒜dai,|ai,i|i.\mu(\mathcal{A})=\int_{\mathcal{A}}\frac{da_{i,\cdot}}{|a_{i,i}|^{i}}\enspace. (74)

Indeed, to check right invariance, let C=AB\textsf{C}=\textsf{A}\textsf{B} and observe that

f(CB1)dci,|ci,i|i=f(A)|bi,i|idai,|ci,i|i=f(A)dai,|ai,i|i,\int f(\textsf{C}\textsf{B}^{-1})~{}\frac{dc_{i,\cdot}}{|c_{i,i}|^{i}}=\int f(\textsf{A})~{}\frac{|b_{i,i}|^{i}da_{i,\cdot}}{|c_{i,i}|^{i}}=\int f(\textsf{A})~{}\frac{da_{i,\cdot}}{|a_{i,i}|^{i}}\enspace, (75)

where we used the above defined Jacobian in the first equality and, in the second equality, we used the fact ci,i=ai,ibi,ic_{i,i}=a_{i,i}b_{i,i}. Left invariance may be checked similarly for f(B1C)f(\textsf{B}^{-1}\textsf{C}). Therefore, we have arrived at a prior that is invariant under the group operation.

In the sequence we derive the posterior that corresponds to this prior. Recall that the residues are

=YfHfpZpHfUf=GfEf.\mathcal{E}=Y_{f}-H_{fp}Z_{p}-H_{f}U_{f}=G_{f}E_{f}\enspace. (76)

Taking the vectorization operation and making use of the noise vector e¯f\bar{e}_{f} on the subspace of dimension i+j1i+j-1, we have that

vec()=(IGf)BWe¯f.\text{vec}(\mathcal{E})=(I\otimes G_{f})B_{W}\bar{e}_{f}\enspace. (77)

Therefore, the residues covariance is given by

Σ=(IGf)BWBWT(IGfT).\Sigma_{\mathcal{E}}=(I\otimes G_{f})B_{W}B_{W}^{T}(I\otimes G_{f}^{T})\enspace. (78)

Since rank(BWBWT)\text{rank}(B_{W}B_{W}^{T}) is i+j1i+j-1, Σ\Sigma_{\mathcal{E}} is rank deficient and we shall make use of its pseudo-determinant (product of non-zero singular values) in obtaining its pdf:

|Σ|+=|(IGf)BWBWT(IGfT)|+=|(IGf)chol(BWBWT)|+2Gf[i,i]2(i+j1),|\Sigma_{\mathcal{E}}|_{+}=|(I\otimes G_{f})B_{W}B_{W}^{T}(I\otimes G_{f}^{T})|_{+}=|(I\otimes G_{f})\cdot\text{chol}(B_{W}B_{W}^{T})|_{+}^{2}\propto G_{f[i,i]}^{2(i+j-1)}\enspace, (79)

where we used the fact that IGfI\otimes G_{f} and chol(BWBWT)\text{chol}(B_{W}B_{W}^{T}) are lower triangular and the fact that chol(BWBWT)\text{chol}(B_{W}B_{W}^{T}) must have exactly i+j1i+j-1 nonzero entries on its diagonal. In order to make the vectorization GfG_{f} explicit in the likelihood function, we note that

Ef=Gf1BWe¯f=(TIi)vec(Gf1)e¯f=BW(TIi)BT(Gf[i,]1)T.E_{f}=G_{f}^{-1}\mathcal{E}\Rightarrow B_{W}\bar{e}_{f}=(\mathcal{E}^{T}\otimes I_{i})\text{vec}(G_{f}^{-1})\Rightarrow\bar{e}_{f}=B_{W}^{\dagger}(\mathcal{E}^{T}\otimes I_{i})B_{T}(G_{f[i,\cdot]}^{-1})^{T}\enspace. (80)

Therefore,

p(|Gf)1|Gf[i,i]|i+j1exp(12Gf[i,]1BTT(Ii)(BW)TBW(TIi)BT(Gf[i,]1)T)p(\mathcal{E}|G_{f})\propto\frac{1}{|G_{f[i,i]}|^{i+j-1}}\exp\left(-\frac{1}{2}G_{f[i,\cdot]}^{-1}B_{T}^{T}(\mathcal{E}\otimes I_{i})(B_{W}^{\dagger})^{T}B_{W}^{\dagger}(\mathcal{E}^{T}\otimes I_{i})B_{T}(G_{f[i,\cdot]}^{-1})^{T}\right) (81)

and the posterior would be proportional to

p(|Gf1)dμ(Gf1)\displaystyle p(\mathcal{E}|G_{f}^{-1})d\mu(G_{f}^{-1}) |Gf[i,i]1|i+j1exp(12Gf[i,]1Ω(Gf[i,]1)T)1|Gf[i,i]1|i\displaystyle\propto|G_{f[i,i]}^{-1}|^{i+j-1}\exp\left(-\frac{1}{2}G_{f[i,\cdot]}^{-1}\Omega(G_{f[i,\cdot]}^{-1})^{T}\right)\cdot\frac{1}{|G_{f[i,i]}^{-1}|^{i}} (82)
=|Gf[i,i]1|j1exp(12Gf[i,]1Ω(Gf[i,]1)T).\displaystyle=|G_{f[i,i]}^{-1}|^{j-1}\exp\left(-\frac{1}{2}G_{f[i,\cdot]}^{-1}\Omega(G_{f[i,\cdot]}^{-1})^{T}\right)\enspace. (83)

Defining the change of varibles νT=Gf[i,]1ΩL\nu^{T}=G_{f[i,\cdot]}^{-1}\Omega_{L}, we have

p(ν)|νi|j1exp(12νTν)=|νi2|j+121exp(12νTν).p(\nu)\propto|\nu_{i}|^{j-1}\exp\left(-\frac{1}{2}\nu^{T}\nu\right)=|\nu_{i}^{2}|^{\frac{j+1}{2}-1}\exp\left(-\frac{1}{2}\nu^{T}\nu\right)\enspace. (84)

Therefore νi2χj+12\nu_{i}^{2}\sim\chi^{2}_{j+1} and νlN(0,1)\nu_{l}\sim N(0,1), for l=1,,i1l=1,\ldots,i-1. This is precisely the posterior given by equations (45)-(50).

If, on the other hand, we assumed that E[efefT]=Iij\operatorname{E}[e_{f}e_{f}^{T}]=I_{ij}, then

|Σ||Gf[i,i]|2ij|\Sigma_{\mathcal{E}}|\propto|G_{f[i,i]}|^{2ij} (85)

and the posterior of equations (51)-(56) would follow.

References

  • [1] Michel Verhaegen and Anders Hansson. N2SID: Nuclear norm subspace identification of innovation models. Automatica, 72:57–63, 2016.
  • [2] Roy S Smith. Frequency domain subspace identification using nuclear norm minimization and Hankel matrix realizations. IEEE Transactions on Automatic Control, 59(11):2886–2896, 2014.
  • [3] Gianluigi Pillonetto, Tianshi Chen, Alessandro Chiuso, Giuseppe De Nicolao, and Lennart Ljung. Regularized linear system identification using atomic, nuclear and kernel-based norms: The role of the stability constraint. Automatica, 69:137–149, 2016.
  • [4] Alessandro Chiuso and Gianluigi Pillonetto. System identification: A machine learning perspective. Annual Review of Control, Robotics, and Autonomous Systems, 2:281–304, 2019.
  • [5] Yue Sun, Samet Oymak, and Maryam Fazel. Finite sample identification of low-order LTI systems via nuclear norm regularization. IEEE Open Journal of Control Systems, 1:237–254, 2022.
  • [6] Alessandro Chiuso. Regularization and Bayesian learning in dynamical systems: Past, present and future. Annual Reviews in Control, 41:24–38, 2016.
  • [7] Lennart Ljung, Tianshi Chen, and Biqiang Mu. A shift in paradigm for system identification. International Journal of Control, 93(2):173–180, 2020.
  • [8] Jie Liu and Bing Li. A novel strategy for response and force reconstruction under impact excitation. Journal of Mechanical Science and Technology, 32(8):3581–3596, 2018.
  • [9] Vineet Vajpayee, Siddhartha Mukhopadhyay, and Akhilanand Pati Tiwari. Data-driven subspace predictive control of a nuclear reactor. IEEE Transactions on Nuclear Science, 65(2):666–679, 2017.
  • [10] Behnood Rasti, Magnus O Ulfarsson, and Johannes R Sveinsson. Hyperspectral subspace identification using SURE. IEEE Geoscience and Remote Sensing Letters, 12(12):2481–2485, 2015.
  • [11] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):1–37, 2011.
  • [12] Clément Elvira, Pierre Chainais, and Nicolas Dobigeon. Bayesian nonparametric subspace estimation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2247–2251. IEEE, 2017.
  • [13] Qian Zhao, Deyu Meng, Zongben Xu, Wangmeng Zuo, and Lei Zhang. Robust principal component analysis with complex noise. In International conference on machine learning, pages 55–63. PMLR, 2014.
  • [14] S Derin Babacan, Martin Luessi, Rafael Molina, and Aggelos K Katsaggelos. Sparse Bayesian methods for low-rank matrix estimation. IEEE Transactions on Signal Processing, 60(8):3964–3977, 2012.
  • [15] Xinghao Ding, Lihan He, and Lawrence Carin. Bayesian robust principal component analysis. IEEE Transactions on Image Processing, 20(12):3419–3430, 2011.
  • [16] Wei Chu and Zoubin Ghahramani. Probabilistic models for incomplete multi-dimensional arrays. In Artificial Intelligence and Statistics, pages 89–96. PMLR, 2009.
  • [17] David Gerard and Peter Hoff. Equivariant minimax dominators of the MLE in the array normal model. Journal of multivariate analysis, 137:32–49, 2015.
  • [18] Peter D Hoff. Equivariant and scale-free tucker decomposition models. Bayesian Analysis, 11(3):627–648, 2016.
  • [19] Ming Shi, Dan Li, and Jian Qiu Zhang. An alternating Bayesian approach to PARAFAC decomposition of tensors. IEEE Access, 6:36487–36499, 2018.
  • [20] Anastasios Tsiamis and George J Pappas. Finite sample analysis of stochastic system identification. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 3648–3654. IEEE, 2019.
  • [21] S Joe Qin. An overview of subspace identification. Computers & chemical engineering, 30(10-12):1502–1513, 2006.
  • [22] Peter Van Overschee and Bart De Moor. Subspace identification for linear systems: Theory—Implementation—Applications. Springer Science & Business Media, 2012.
  • [23] Erich L Lehmann and George Casella. Theory of point estimation. Springer Science & Business Media, 2006.
  • [24] Christian P Robert et al. The Bayesian choice: from decision-theoretic foundations to computational implementation, volume 2. Springer, 2007.
  • [25] William James and Charles Stein. Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Simposium on Mathematical Statistics and Probability, volume 1, pages 361–379. University of California Press, 1961.
  • [26] Matan Gavish and David L Donoho. The optimal hard threshold for singular values is 4/34/\sqrt{3}. IEEE Transactions on Information Theory, 60(8):5040–5053, 2014.
  • [27] Matan Gavish and David L Donoho. Optimal shrinkage of singular values. IEEE Transactions on Information Theory, 63(4):2137–2152, 2017.
  • [28] Emmanuel J Candes, Carlos A Sing-Long, and Joshua D Trzasko. Unbiased risk estimates for singular value thresholding and spectral estimators. IEEE transactions on signal processing, 61(19):4643–4657, 2013.
  • [29] Lennart Ljung. System identification toolbox for use with MATLAB. Mathworks, 2007.