FEDERAL UNIVERSITY OF MINAS GERAIS
Department of Electronic Engineering
Technical Report
Robust Bayesian Subspace Identification for Small Data Sets

Alexandre Rodrigues Mesquita

1 Introduction

Model estimates obtained from traditional subspace identification methods may be subject to significant variance. This elevated variance is aggravated in the cases of large models or of a limited sample size. Common solutions to reduce the effect of variance are regularized estimators, shrinkage estimators and Bayesian estimation. In the current work we investigate the latter two solutions, which have not yet been applied to subspace identification. Our experimental results show that our proposed estimators may reduce the estimation risk up to $40\%$ of that of traditional subspace methods.

In the scope of subspace identification, a number of related works explore regularization techniques such as nuclear norm minimization [1, 2, 3, 4, 5]. Under specific hyperparameter choices, regularized estimators can be understood in a Bayesian framework as maximum a posteriori estimators. These regularized estimators carry two important disadvantages in comparison to our approach: i) the choice of hyperparameters is not straightforward and it often does not take advantage of the problem statistics at hand; ii) they are not oriented toward risk minimization.

The usage of regularized, shrinkage and Bayesian estimators in system identification is discussed in [6, 7, 4] and it is greeted as a promising and refreshing trend in this field. We are only aware of one work where a shrinking function is used for subspace identification [8]. Differently from our approach, the shrinking is applied to all elements of the matrix and not only to the singular values. In a somewhat related application, in [9] we find the application of wavelet thresholding as a preprocessor in a subspace identification setting. Apart from the field of system identification, we also find the application of similar methods in the context of subspace identification in hyperspectral imaging [10].

If we regard subspace identification as a particular application of principal component analysis, our work would be understood as an application of robust principal component analysis, which finds many applications in image and video processing and econometrics [11]. Bayesian solutions to this problem are found in [12, 13, 14]. Our alternating least squares method was conceived as an improvement upon the Bayesian robust principal component analysis of [15] and it is closely related to the tensorial regression present in [16, 17, 18, 19].

A different approach to subspace identification with finite samples is found in [20], where the focus in on deciding the number of samples in order to bound the estimation error with high probability.

2 A brief recapitulation of subspace identification

Consider the linear state space model in innovation form

	$\displaystyle x_{k+1}$	$\displaystyle=Ax_{k}+Bu_{k}+Ke_{k}$		(1)
	$\displaystyle y_{k}$	$\displaystyle=Cx_{k}+Du_{k}+e_{k}\enspace,$		(2)

where $x_{k}\in\mathbb{R}^{n_{x}}$ is the state, $y_{k}\in\mathbb{R}^{n_{o}}$ is the measured variable, $u_{k}\in\mathbb{R}^{n_{i}}$ is the input variable and $e_{k}$ is the innovations process, assumed to be white Gaussian noise. The model parameters $A,B,C,D$ and $K$ are matrices with the appropriate dimensions. At times, it may be useful to express the same model in the predictor form

	$\displaystyle x_{k+1}$	$\displaystyle=A_{K}x_{k}+B_{K}z_{k}$		(3)
	$\displaystyle y_{k}$	$\displaystyle=Cx_{k}+Du_{k}+e_{k}\enspace,$		(4)

where $z_{k}=[u_{k}^{T}~{}~{}y_{k}^{T}]^{T}$ , $A_{K}=A-KC$ , $B_{K}=[B-KD~{}~{}K]$ .

Following the formulation in [21], we make use of the extended state space model

Y_{f}=\Gamma_{f}X_{k}+H_{f}U_{f}+G_{f}E_{f}

(5)

and its predictor form

Y_{f}=H_{fp}Z_{p}+H_{f}U_{f}+G_{f}E_{f}

(6)

where the available data is arranged in Hankel matrices defined by

Y_{f}=\left[\begin{array}[]{cccc}y_{k}&y_{k+1}&\cdots&y_{k+N-1}\\ y_{k+1}&y_{k+2}&\cdots&y_{k+N}\\ \vdots&\vdots&\ddots&\vdots\\ y_{k+f-1}&y_{k+f}&\cdots&y_{k+f+N-2}\\ \end{array}\right]

(7)

and similarly are defined $U_{f}$ and $E_{f}$ . Here $f>n$ is the future horizon and $N$ is a function of the size of the available data set. The state sequence is defined as

X_{k}=\left[\begin{array}[]{cccc}x_{k}&x_{k+1}&\cdots&x_{k+N-1}\end{array}\right]\enspace.

(8)

The past information is collected in $Z_{p}=[U_{p}^{T}~{}Y_{p}^{T}]^{T}$ up to the horizon $p>n$ and arranged as

U_{p}=\left[\begin{array}[]{cccc}u_{k-p}&u_{k-p+1}&\cdots&u_{k-p+N-1}\\ u_{k-p+1}&u_{k-p+2}&\cdots&u_{k-p+N}\\ \vdots&\vdots&\ddots&\vdots\\ u_{k-1}&u_{k}&\cdots&u_{k+N-2}\\ \end{array}\right]

(9)

and likewise for $Y_{p}$ .

As a consequence, $\Gamma_{f}$ is the extended observability matrix as defined by

\Gamma_{f}=\left[\begin{array}[]{c}C\\ CA\\ \vdots\\ CA^{f-1}\end{array}\right]\enspace,

(10)

$H_{f}$ and $G_{f}$ are Toeplitz matrices given by

H_{f}=\left[\begin{array}[]{cccc}D&0&\cdots&0\\ CB&D&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ CA^{f-2}B&CA^{f-3}B&\cdots&D\end{array}\right]\enspace,

(11)

and

G_{f}=\left[\begin{array}[]{cccc}I&0&\cdots&0\\ CK&I&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ CA^{f-2}K&CA^{f-3}K&\cdots&I\end{array}\right]\cdot I_{f}\otimes\Sigma^{1/2}\enspace,

(12)

where $\Sigma$ is the innovations covariance. The matrix $H_{fp}$ is reminiscent of the matrix of Markov parameters and is given by

H_{fp}=\left[\begin{array}[]{cc}H_{fp}^{(1)}&H_{fp}^{(2)}\end{array}\right]

(13)

with

H_{fp}^{(i)}=\left[\begin{array}[]{cccc}CA_{K}^{p-1}B_{K}^{(i)}&CA_{K}^{p-2}B_{K}^{(i)}&\cdots&CB_{K}^{(i)}\\ CAA_{K}^{p-1}B_{K}^{(i)}&CAA_{K}^{p-2}B_{K}^{(i)}&\cdots&CB_{K}^{(i)}\\ \vdots&\vdots&\ddots&\vdots\\ CA^{f-1}A_{K}^{p-1}B_{K}^{(i)}&CA^{f-1}A_{K}^{p-2}B_{K}^{(i)}&\cdots&CA^{f-1}B_{K}^{(i)}\\ \end{array}\right]\enspace,

(14)

for $i=1,2$ and $B_{K}^{(1)}=B-KD$ and $B_{K}^{(2)}=K$ . These matrices can also be decomposed as products $H_{fp}^{(i)}=\Gamma_{f}L_{p}^{(i)}$ of the extended observability $\Gamma_{f}$ and controllability matrices:

L_{p}^{(i)}=[\begin{array}[]{ccc}A_{K}^{p-1}B_{K}^{(i)}&A_{K}^{p-2}B_{K}^{(i)}\cdots B_{K}^{(i)}\end{array}]\enspace

(15)

for $i=1,2$ .

Several estimation methods were proposed in the literature departing from this formulation and exploiting the fact that the data as structured in (6) should lie in low dimensional subspaces (see [21, 22] for a comprehensive review). A common approach is to solve (6) via least squares and then estimate $\Gamma_{f}$ from a truncated singular value decomposition of $H_{fp}$ . The matrices $A$ and $C$ could then be recovered from estimates of $\Gamma_{f}$ and $\Gamma_{f+1}$ . The matrices $B$ and $D$ could then be obtained from a new least squares problem or from estimates of $L_{p}$ .

From a general perspective, most methods obtain an initial estimate $\hat{H}_{fp}$ from least squares and, given two weight matrices $W_{1}$ and $W_{2}$ , obtain a low-dimensional estimate $\hat{\hat{H}}_{fp}$ by truncating the singular value decomposition of ${\hat{H}}_{fp}$ to the largest $r$ components:

W_{1}\hat{H}_{fp}W_{2}=USV^{T}\approx U_{r}S_{r}V_{r}^{T}=:W_{1}\hat{\hat{H}}_{fp}W_{2}\enspace.

(16)

In this context, $r$ would be an estimate of the system order $n$ . Methods of obtaining such an estimate of $n$ have been treated with difficulty in the literature and usually one resorts to ad hoc solutions or the visual inspection of the singular values.

Traditional subspace identification methods obtain an initial estimate of $H_{fp}$ from (6) applying the least squares method as follows:

[\hat{H}_{fp}~{}\hat{H}_{f}]=Y_{f}\left[\begin{array}[]{c}U_{p}\\ Y_{p}\\ U_{f}\end{array}\right]^{\dagger}\enspace.

(17)

Notice, however, that this estimate only approximates the maximum likelihood estimate as the noise term $G_{f}E_{f}$ in (6) is at best only approximately white and as $H_{f}$ is a lower triangular Toeplitz matrix.

At this point it is worthwhile having a discussion on the approximations typically made in the process of solving (6). Given that $E_{f}$ is structured as a Hankel matrix as in (9), its components are not truly independent. Nevertheless, the off diagonal elements of $\text{E}[E_{f}E_{f}^{T}]$ are reasonably sparse, which gives a reasonable approximation of a white matrix. In order the incorporate the Hankel and Toeplitz structure of $E_{f}$ and $H_{f}$ in the computation, one would have to vectorize equation (6):

y_{f}=(Z_{p}^{T}\otimes I)h_{fp}+(U_{f}^{T}\otimes I)h_{f}+(I\otimes G_{f})e_{f}\enspace,

(18)

where the lower case letters denote the vectorized version of the corresponding matrices. In addition, $h_{f}$ and $e_{f}$ lie in lower dimensional subspaces. This can be solved introducing known matrices $B_{T}$ and $B_{W}$ that convert the original lower dimensional data into the vectorized versions of the Toeplitz matrix $H_{f}$ and the Hankel matrix $E_{f}$ . This would result in the new equation

y_{f}=(Z_{p}^{T}\otimes I)h_{fp}+(U_{f}^{T}\otimes I)B_{T}\bar{h}_{f}+(I\otimes G_{f})B_{W}\bar{e}_{f}\enspace.

(19)

However, the least squares problem that arises in (19) is high dimensional, sparse and poorly conditioned numerically. In this regard, one of the virtues of the formulation in (6) is to avoid the numerical difficulties that would arise in a more detailed formulation.

3 Problem Description

Most subspace identification methods rely on the limit case of large enough data sets, $N\to\infty$ . In addition, most approaches are based on least squares or, equivalently, on maximum likelihood estimators of $H_{fp}$ . However, it is well known that such estimators perform poorly in regards to the estimation risk under quadratic loss (are not admissible) when the number of parameters is larger than 2 [23, 24], which is classically known as the Stein phenomenon [25]. On the other hand, the number of parameters in $H_{fp}$ may be as large as $n_{o}\cdot(n_{i}+n_{o})\cdot f\cdot p$ , which may be considerably large. Even if we consider that $H_{fp}$ is low rank and may be parameterized and estimated in a smaller space, it would still depend on $(n_{i}+2n_{o})\cdot n_{x}$ parameters, which is still large in most applications.

In addition, there seems to be considerable difficulty in the literature to find appropriate methods to estimate the system order $n_{x}$ , which is of particular concern given the high sensitivity of the results to this parameter.

Within this context, our goal is to investigate robustified methods to estimate $H_{fp}$ . In particular, two groups of methods shall be considered. The first group is comprised of the main shrinkage estimators of singular values available in the literature. The second group consists of Bayesian estimators that can be efficiently computed with a Gibbs sampling scheme.

Given that the estimation errors in the matrices $(A,B,C,D,K)$ are upper bounded by a constant times $\|H_{fp}-\hat{H}_{fp}\|$ as demonstrated in [20], we concentrate our efforts on the estimators of $H_{fp}$ and leave an investigation on the matrices estimators to future work.

4 Singular Value Shrinkage

Shrinkage estimators are biased estimators that shrink the maximum likelihood estimate towards a pre-assigned value (often to zero). These estimators were first proposed by Stein [25] and it has been demonstrated that this type of estimator may provide smaller estimation risk than the maximum likelihood estimator under quadratic loss when the number of parameters is larger than 2. The rationale behind such estimators is that, although shrinking introduces bias, it reduces variance and therefore it may be beneficial in reducing the overall quadratic error.

In the context of singular value decompositions, approaches based on random matrix theory obtain optimal shrinkage estimators in the asymptotic case of the matrix dimensions approaching infinity. This assumption may be far from adequate in the case of subspace identification with small data sets. Non-asymptotic results are also available based on the minimization of unbiased estimates.

Most approaches consider the problem of estimating some matrix $X\in\mathbb{R}^{i\times j}$ from a noisy measurement

Y=X+\sigma W\enspace,

(20)

where $\sigma>0$ is known and the noise matrix $W$ is orthogonally invariant, i.e., its distribution is the same as that of $UWV$ for any orthogonal matrices $U$ and $V$ . In practice, $Y$ is not a single measurement, but the maximum likelihood estimate obtained from multiple measurements, which gives a sufficient statistic based on all available measurements.

Considering the Frobenius norm squared as the risk loss function, and given that the noise is orthogonally invariant, it suffices to restrict our search to shrinkage estimators of the form:

\hat{X}=U\eta(S)V^{T}\enspace,

(21)

where $USV^{T}$ is the singular value decomposition of $Y$ and $\eta(\cdot)$ is the shrinkage function.

4.1 Hard and Soft-Thresholding

In the asymptotic case of very large dimensions, [26] computed the optimal parameters for the hard and soft-thresholding functions:

	$\displaystyle\eta_{\text{hard}}(s)$	$\displaystyle=s1_{s>\lambda_{\text{hard}}}$		(22)
	$\displaystyle\eta_{\text{soft}}(s)$	$\displaystyle=(s-\lambda_{\text{soft}})_{+}\enspace,$		(23)

where $1_{(\cdot)}$ is the indicator function and $(\cdot)_{+}$ is the positive-part function and where the function $\eta$ are applied element-wise on the singular values and the optimal thresholds are

\lambda_{\text{hard}}=\sqrt{2(\beta+1)+\frac{8\beta}{\beta+1+\sqrt{\beta^{2}+14\beta+1}}}\cdot\sigma\sqrt{j}

(24)

and

\lambda_{\text{soft}}=(1+\sqrt{\beta})\sigma\sqrt{j}\enspace,

(25)

with $\beta=i/j$ being the aspect ratio of the matrix $Y$ .

4.2 Optimal shrinkage in the asymptotic case

Still in the asymptotic case, the overall optimal shrinkage function was obtained in [27]. For the squared Frobenius norm loss function, the optimal shrinkage is given by

\eta^{*}(s)=\frac{1}{s}\sqrt{(s^{2}-(1+\beta)\sigma^{2}j)^{2}-4\beta\sigma^{4}j^{2}}\cdot 1_{s>\lambda_{\text{soft}}}\enspace.

(26)

4.3 SURE-based thresholding

In the non-asymptotic case, [28] obtained an expression for the Stein’s unbiased risk estimate (SURE) for estimators based on soft-thresholding functions. The threshold values can then be obtained by minimizing the SURE risk estimate. For $\eta(s)=(s-\lambda)_{+}$ and the Frobenius norm squared loss, the unbiased risk estimate is given by

\text{SURE}_{\lambda}(Y)=-ij\sigma^{2}+\sum_{k=1}^{i}\min(\lambda^{2},\sigma_{k}^{2})+2\sigma^{2}\left[(j-i)\sum_{k=1}^{i}\left(1-\frac{\lambda}{\sigma_{k}}\right)_{+}+\sum_{k=1}^{i}1_{\sigma_{k}>\lambda}\right.\\ \left.+\sum_{k,l,k\neq l}^{i}\frac{\sigma_{k}(\sigma_{k}-\lambda)_{+}}{\sigma_{k}^{2}-\sigma_{l}^{2}}\right]\enspace,

(27)

where $\sigma_{k}$ denotes the $k$ -th singular value of $Y$ and $i<j$ . Given $Y$ , $\text{SURE}_{\lambda}(Y)$ can be easily minimized since it is a piecewise quadratic function in $\lambda$ .

4.4 Subspace identification with shrinkage estimators

Following the traditional subspace identification methods, we obtain the initial estimate of $H_{fp}$ using the least squares estimate of (17). This estimate offers the advantage of being unbiased and to have a covariance that can be easily characterized. This facilitates framing the problem of singular value shrinking in the format given by (20).

In the context of shrinkage estimators, one could naturally propose to also perform shrinkage of the parameters estimated in (17) by replacing the linear regression by a ridge regression. However, we would have a biased estimate of $H_{fp}$ and this would make it harder to explore the theory of shrinkage estimators for singular values.

Our goal is to minimize the risk function

\mathcal{R}(\hat{\hat{H}}_{fp})=\text{E}\left[\operatorname{tr}\left(W_{2}^{T}(H_{fp}-\hat{\hat{H}})^{T}W_{1}^{T}W_{1}(H_{fp}-\hat{\hat{H}})W_{2}\right)\right]\enspace.

(28)

Within the domain of the shrinkage estimators presented above, we shall minimize the risk in (28) if we shrink the SVD decomposition of $W_{1}\hat{H}_{fp}W_{2}$ . However, the variance level $\sigma$ is assumed known in (20). This means that we still must obtain a reasonable estimate for it.

Defining the projection matrix onto the orthogonal complement of the row space of $U_{f}$ as $\Pi_{U_{f}}^{\perp}=I-U_{f}^{T}(U_{f}U_{f}^{T})^{-1}U_{f}$ , the estimate $\hat{H}_{fp}$ in (17) is equivalent to [21]

\hat{H}_{fp}=Y_{f}\Pi_{U_{f}}^{\perp}Z_{p}^{T}(Z_{p}\Pi_{U_{f}}^{\perp}Z_{p}^{T})^{-1}\enspace.

(29)

Substituting $Y_{f}$ as in (6), this leads to

\hat{H}_{fp}=H_{fp}+G_{f}E_{f}\Pi_{U_{f}}^{\perp}Z_{p}^{T}(Z_{p}\Pi_{U_{f}}^{\perp}Z_{p}^{T})^{-1}\enspace.

(30)

Multiplying by the weight matrices, we have

W_{1}\hat{H}_{fp}W_{2}=W_{1}H_{fp}W_{2}+W_{1}G_{f}E_{f}\Pi_{U_{f}}^{\perp}Z_{p}^{T}(Z_{p}\Pi_{U_{f}}^{\perp}Z_{p}^{T})^{-1}W_{2}\enspace.

(31)

Unless we carefully pick the weight matrices, the noise term in (31) is not orthogonally invariant white as assumed in (20). If we consider the approximation $\operatorname{E}[e_{f}e_{f}^{T}]=I$ for the vectorized noise $E_{f}$ , then the covariance matrix for the vectorization of $W_{1}\hat{H}_{fp}W_{2}$ would be

\Sigma_{h_{fp}}=W_{2}^{T}(Z_{p}\Pi_{U_{f}}^{\perp}Z_{p}^{T})^{-1}W_{2}\otimes W_{1}G_{f}G_{f}^{T}W_{1}^{T}\enspace.

(32)

To make this covariance orthogonally invariant, one would choose $W_{1}=(G_{f}G_{f}^{T})^{-1/2}$ and $W_{2}=(Z_{p}\Pi_{U_{f}}^{\perp}Z_{p}^{T})^{1/2}$ . This is closely related to the weight choice of the popular CVA algorithm for subspace identification [21]. Nonetheless, there might be reasons to choose different weights. For example, the N4SID algorithm uses $W_{1}=I$ and $W_{2}=Z_{p}$ in order to penalize the prediction error. Alternatively, one might prefer to penalize errors in the in the input-output relation and choose $W_{1}=I$ and $W_{2}=I$ .

For general weight matrices, we identify the noise level $\sigma$ with the largest possible noise level in a given direction, i.e.,

\sigma^{2}=\sigma_{\max}(W_{2}^{T}(Z_{p}\Pi_{U_{f}}^{\perp}Z_{p}^{T})^{-1}W_{2})\cdot\sigma_{\max}(W_{1}G_{f}G_{f}^{T}W_{1}^{T})\enspace.

(33)

In order to estimate $G_{f}$ , we compute the residues

\mathcal{E}=Y_{f}-\hat{H}_{fp}Z_{p}-\hat{H}_{f}U_{f}

(34)

and construct the estimate

\hat{G}_{f}\hat{G}_{f}^{T}=\frac{\mathcal{E}\mathcal{E}^{T}}{j-i(n_{o}+2n_{i})}

(35)

where we make $f=i$ and $N=j$ and $i(n_{o}+2n_{i})$ is the number o degrees of freedom of the linear regression. The estimate in (35) is enough to obtain $\sigma$ from (33). If we further desire to specify $G_{f}$ , one simple approach is to obtain the Cholesky decomposition of (35) and then average over the matrix diagonals to enforce the Toeplitz structure on $\hat{G}_{f}$ . If an initial truncated estimate $\hat{\hat{H}}_{fp}$ is computed, it may be used in (34) to further account for the noise that is present in the lower singular values of $\hat{H}_{fp}$ . In that case, the number of degrees of freedom could be changed to $in_{i}+i(n_{i}+n_{o})-(i(n_{i}+n_{o})-(i+n_{i}+n_{o})r+r^{2})$ , where $r$ is the rank of $\hat{\hat{H}}_{fp}$ .

5 An Alternating Least Squares Bayesian Approach

Bayesian estimators are inherently robust. Under mild conditions [24], they are admissible, i.e., there exists no estimator that improves their risk for all the parameters in the parameter space. Under group invariance of priors and losses (see [24] for definitions), they are also minimax, i.e., they minimize the risk under the worst case parameter value. The shrinkage estimators presented above may often be characterized as Bayes estimators with an empirical prior, i.e., a prior distribution that is constructed from the data itself.

In our approach to subspace identification, we aim to construct a Bayesian method that is computationally simple. With this in mind, we choose priors that lead to simple regularized least squares steps. The priors are obtained empirically from the data. We have experimented with hierarchical Bayes priors as well but did not find a convincing improvement with respect to the simpler empirical priors. Our model was inspired by that in [15] but, differently from this paper, we do not explicitly estimate the orthogonal matrices $U$ and $V$ and the singular value matrix $S$ . We found that better results are obtained by estimating $US^{1/2}$ and $S^{1/2}V^{T}$ instead.

For the noise term $G_{f}$ , we propose a prior invariant with respect to the group of lower triangular Toeplitz matrices. Hence, the covariance estimator will be equivariant with respect to this group. In this particular case, it is possible to compute posterior samples in a reasonably simple manner.

Assuming $n_{i}=n_{o}=1$ for simplicity, we adopt the following set of independent priors:

$\displaystyle\Gamma_{f}$	$\displaystyle=\bar{G}_{f}\Xi_{\Gamma}\Lambda_{\Gamma}^{-1/2}$	(36)
$\displaystyle H_{f}$	$\displaystyle=\bar{G}_{f}\Xi_{H}\Lambda_{H}^{-1/2}$	(37)
$\displaystyle L_{p}$	$\displaystyle=\Lambda_{L}^{-1/2}\Xi_{L}Z_{p}^{\dagger}$	(38)
$\displaystyle G_{f}$	$\displaystyle\sim\frac{1}{\|G_{f[1,1]}\|^{i}}$	(39)

where the $\Xi_{(\cdot)}$ matrices are random matrices with the appropriate dimensions and whose elements are independent normally distributed random variables with variance $1$ , the $\Lambda_{(\cdot)}$ matrices are fixed parameters to be specified later and $\bar{G}_{f}=G_{f}/G_{f[1,1]}$ . It will be shown later that the improper prior given to $G_{f}$ is invariant under the group of multiplication by lower triangular Toeplitz matrices.

Since the full posterior distribution for the estimation problem at hand is too hard to characterize analytically, we estimate its empirical distribution using a Gibbs sampler. In a Gibbs sampler, samples from dependent variables $x$ and $y$ are drawn iteratively from their conditional distributions as in:

	$\displaystyle y^{(n)}\sim p(y\|x^{(n-1)})$		(41)
	$\displaystyle x^{(n)}\sim p(x\|y^{(n)})\enspace.$		(42)

The probability distribution of the resulting Markov chain $(x^{(n)},y^{(n)})$ is shown to converge under mild conditions to $p(x,y)$ . The following posterior updates follow from our choice of prior:

[\begin{array}[]{cc}\Gamma_{f}^{(n)}&H_{f}^{(n)}\end{array}]=Y_{f}\left[\begin{array}[]{c}X_{p}^{(n-1)}\\ U_{f}\end{array}\right]^{T}\left(\left[\begin{array}[]{c c}\Lambda_{\Gamma}&0\\ 0&\Lambda_{H}\\ \end{array}\right]+\gamma^{(n)}\left[\begin{array}[]{c}X_{p}^{(n-1)}\\ U_{f}\end{array}\right]\left[\begin{array}[]{c}X_{p}^{(n-1)}\\ U_{f}\end{array}\right]^{T}\right)^{-1}\gamma^{(n)}\\ +\bar{G}_{f}^{(n-1)}\Xi_{\Gamma,H}^{(n)}\left(\left[\begin{array}[]{c c}\Lambda_{\Gamma}&0\\ 0&\Lambda_{H}\\ \end{array}\right]+\gamma^{(n)}\left[\begin{array}[]{c}X_{p}^{(n-1)}\\ U_{f}\end{array}\right]\left[\begin{array}[]{c}X_{p}^{(n-1)}\\ U_{f}\end{array}\right]^{T}\right)^{-1/2}

(43)

and

L_{p}^{(n)}=\left(\Gamma_{f}^{(n)T}(\Sigma_{e}^{(n)})^{-1}\Gamma_{f}^{(n)}+\Lambda_{L}\right)^{-1}\Gamma_{f}^{(n)T}(\Sigma_{e}^{(n)})^{-1}(Y_{f}-H_{f}^{(n)}U_{f})\\ +\left(\Gamma_{f}^{(n)T}(\Sigma_{e}^{(n)})^{-1}\Gamma_{f}^{(n)}+\Lambda_{L}\right)^{-1/2}\Xi_{L}^{(n)}Z_{p}^{\dagger}

(44)

where $X_{p}^{(n)}=L_{p}^{(n)}Z_{p}$ , $\gamma^{(n)}=1/(G_{f[1,1]}^{(n-1)}])^{2}$ and $\Sigma_{e}^{(n)}=G_{f}^{(n-1)}(G_{f}^{(n-1)})^{T}$ , the $\Xi_{(\cdot)}^{(n)}$ matrices are random matrices whose components are independently drawn from a unit normal distribution. Intuitively, we are performing independent regressions row by row in (43) and independent regressions column by column in (44), in addition to summing the corresponding simulated noise terms.

In order to define the posterior update for $G_{f}$ , we first define the matrix $B_{T}$ that maps the last row of $G_{f}$ to $\text{vec}(G_{f})$ , i.e., $\text{vec}(G_{f})=B_{T}(G_{f[i,\cdot]})^{T}$ . Similarly, we define the matrix $B_{W}$ such that $\text{vec}(E_{f})=B_{W}e_{[1:i+j-1]}$ . Let $\chi_{k}$ denote the chi distribution with $k$ degrees of freedom, then the posterior update for $G_{f}$ is given by:

$\displaystyle\mathcal{E}^{(n)}$	$\displaystyle=Y_{f}-\Gamma_{f}^{(n)}L_{p}^{(n)}Z_{p}-H_{f}^{(n)}U_{f}$	(45)
$\displaystyle\Omega^{(n)}$	$\displaystyle=B_{T}^{T}(\mathcal{E}^{(n)}\otimes I_{i})B_{W}(B_{W}^{T}B_{W})^{-1}B_{W}^{T}((\mathcal{E}^{(n)})^{T}\otimes I_{i})B_{T}$	(46)
$\displaystyle\nu_{i}^{(n)}$	$\displaystyle\sim\chi_{j+1}$	(47)
$\displaystyle\nu_{k}^{(n)}$	$\displaystyle\sim N(0,1),~{}\text{for}~{}k=1,\ldots,i-1$	(48)
$\displaystyle(G_{f}^{(n)})^{-1}_{[i,:]}$	$\displaystyle=\nu_{[1:i]}(\Omega_{L}^{(n)})^{-1}$	(49)
$\displaystyle\text{vec}\left((G_{f}^{(n)})^{-1}\right)$	$\displaystyle=B_{T}(G_{f}^{(n)})^{-1}_{[i,:]}\enspace,$	(50)

where $\Omega_{L}^{(n)}$ denotes the lower triangular part of the Cholesky decomposition of $\Omega^{(n)}$ . In other words, we first obtain a sample of the last row of $G_{f}^{-1}$ , then we construct the full lower triangular Toeplitz matrix $G_{f}^{-1}$ from this row and compute its inverse. Given that the matrix $B_{W}$ is very large and sparse, (46) may be somewhat tricky to compute. A safer alternative may come from ignoring the Hankel structure of $E_{f}$ and assuming that its elements are mutually independent. In this case, the posterior update becomes

$\displaystyle\mathcal{E}^{(n)}$	$\displaystyle=Y_{f}-\Gamma_{f}^{(n)}L_{p}^{(n)}Z_{p}-H_{f}^{(n)}U_{f}$	(51)
$\displaystyle\Omega^{(n)}$	$\displaystyle=B_{T}^{T}(\mathcal{E}^{(n)}(\mathcal{E}^{(n)})^{T}\otimes I_{i})B_{T}$	(52)
$\displaystyle\nu_{i}^{(n)}$	$\displaystyle\sim\chi_{ij-i+2}$	(53)
$\displaystyle\nu_{k}^{(n)}$	$\displaystyle\sim N(0,1),~{}\text{for}~{}k=1,\ldots,i-1$	(54)
$\displaystyle\left[(G_{f}^{(n)})^{-1}\right]_{i,\cdot}$	$\displaystyle=\nu_{[1:i]}(\Omega_{L}^{(n)})^{-1}$	(55)
$\displaystyle\text{vec}\left((G_{f}^{(n)})^{-1}\right)$	$\displaystyle=B_{T}(G_{f}^{(n)})^{-1}_{[i,\cdot]}\enspace.$	(56)

Finally, the estimate for ${H}_{fp}$ is obtained by averaging the over the trajectory of the Markov chain:

\hat{H}_{fp}=\frac{1}{N_{F}-N_{o}}\sum_{n=N_{o}}^{N_{F}}\Gamma_{f}^{(n)}L_{p}^{(n)}\enspace,

(57)

where $N_{o}$ is some burn-in period intended to remove the effect of transients. In order to reduce variance and improve convergence, one may prefer to average over the expected values of (43) and (44) in every step (obtained by setting the respective $\Xi_{(\cdot)}$ matrices to zero):

\hat{H}_{fp}=\frac{1}{2(N_{F}-N_{o})}\sum_{n=N_{o}}^{N_{F}}\operatorname{E}[\Gamma_{f}^{(n)}]L_{p}^{(n-1)}+\Gamma_{f}^{(n)}\operatorname{E}[L_{p}^{(n)}]\enspace.

(58)

Regarding the parameters in the priors, we initially obtain estimates ${H}_{fp}^{(1)}$ and ${H}_{f}^{(1)}$ from (17) and next some truncated singular value decomposition $H_{fp}^{(1)}Z_{p}\approx U_{r}S_{r}V_{r}^{T}$ . Then, we make $G_{f}^{(1)}=I_{i}$ and

$\displaystyle\Gamma_{f}^{(1)}$	$\displaystyle=U_{r}S_{r}^{1/2}$	(59)
$\displaystyle L_{p}^{(1)}$	$\displaystyle=S_{r}^{1/2}V_{r}^{T}Z_{p}^{\dagger}$	(60)
$\displaystyle\Lambda_{\Gamma}^{-1}$	$\displaystyle=S_{r}/i$	(61)
$\displaystyle\Lambda_{L}^{-1}$	$\displaystyle=S_{r}/j$	(62)
$\displaystyle\Lambda_{H}^{-1}$	$\displaystyle=I_{i}\operatorname{tr}((H_{f}^{(1)})^{T}H_{f}^{(1)})/i^{2}\enspace.$	(63)

To justify such a choice, we note that the priors (36) and (38) approximately describe an SVD decomposition:

\Gamma_{f}X_{p}=\frac{\Xi_{\Gamma}}{i}S_{r}\frac{\Xi_{L}}{j}

(64)

where ${\Xi_{\Gamma}}/{i}$ and ${\Xi_{L}}/{j}$ behave as orthogonal matrices in expectation, i.e., $\operatorname{E}[\Xi_{\Gamma}^{T}\Xi_{\Gamma}/{i^{2}}]=\operatorname{E}[\Xi_{L}\Xi_{L}^{T}/{j^{2}}]=I_{r}$ .

6 Numerical Experiments

In order to test our proposed methods, we ran Monte Carlo simulations on a large number of systems and compared the estimation risks.

The system order $n_{x}$ was uniformly distributed from $1$ to $10$ and we fixed $n_{i}=n_{o}=1$ . The sample size was defined as $N=\lfloor 80\sqrt{n_{x}}\rfloor$ . The row-length of Hankel matrices was set as $i=\lfloor N/10\rfloor$ .

To obtain a stable system matrix $A$ , we first sampled an auxiliary matrix $\tilde{A}$ such that $\tilde{A}_{[k,l]}\sim N(0,1)$ and a spectral radius $\lambda_{a}\sim\mathcal{U}(0,1)$ . Then we made $A=\tilde{A}/\lambda_{\max}(\tilde{A})\cdot\lambda_{a}$ . The matrices $B$ and $C$ were sampled such that $B_{[k,l]},C_{[k,l]}\sim N(0,1)$ . We make $D=0$ . The measurement noise covariance $R_{v}$ was generated such that $(R_{v})^{1/2}_{[k,l]}\sim N(0,1)$ . Likewise, the process noise covariance $R_{w}$ was generated such that $(R_{w})^{1/2}_{[k,l]}\sim N(0,1)$ . The Kalman gain $K$ was therefore obtained from the previous parameters.

The system input $u_{k}$ is comprised of independent samples from $N(0,\textsf{SNR})$ , where the signal to noise ratio SNR was sampled uniformly on logarithmic scale such that $\log_{10}\textsf{SNR}\sim\mathcal{U}(-1,2)$ .

We applied the proposed algorithms to each model realization and response realization. For each realization and each estimator, we computed the risk

\mathcal{R}_{k}=\operatorname{tr}\left(W_{2}^{T}(H_{fp}-\hat{\hat{H}}_{fp})^{T}W_{1}^{T}W_{1}(H_{fp}-\hat{\hat{H}}_{fp})W_{2}\right)\enspace.

(65)

Given that the system models have different scales in each realization, we normalized the risk performance $\mathcal{R}_{k}$ by that of a reference estimator $\mathcal{R}_{k}^{o}$ and averaged over N realizations as such:

\bar{\mathcal{R}}=\exp\left(\frac{1}{\textsf{N}}\sum_{k=1}^{\textsf{N}}\ln\left(\frac{\mathcal{R}_{k}}{{\mathcal{R}}_{k}^{o}}\right)\right)\enspace.

(66)

The logarithmic function is used to give higher and symmetric weights to risks that are either too low or too high compared to the reference.

We used the same estimate of $\hat{G}_{f}$ for all shrinkage methods as defined by (35). We started by truncating $\hat{H}_{fp}$ to the largest $r$ singular values and then, computing the corresponding $\hat{G}_{f}(r)$ and hard threshold $\lambda_{\text{soft}}(r)$ , we obtained

r^{*}=\min\{r:\max\{l:S_{[l]}>\lambda_{\text{soft}}(r)\}<r\}\enspace.

(67)

In words, $r^{*}$ is the least $r$ such that the order estimate is less than $r$ . We then make $\hat{G}_{f}=\hat{G}_{f}(r^{*})$ and use $r^{*}$ in (35).

6.1 A heuristic as benchmark

To provide the reference estimator above, we propose a heuristic that seeks to mimic the order selection as done by visual inspection. Namely, what a typical user would do is to look at the plot of singular values and identify the point of sharp change in their rate of decline. In order to do something similar automatically, we borrow from the idea of effective sample size and, from the vector $S$ of ordered singular values, define

n_{eff}=\frac{\left(\sum_{l=1}^{i}S_{l}\right)^{2}}{\sum_{l=1}^{i}S_{l}^{2}}\enspace.

(68)

Next we construct a function $\eta(l)$ that linearly fits $\ln S_{l}$ from $l=n_{eff}+1$ to $l=i$ . Finally we define our heuristic order estimate as

\hat{n}_{x}=\max\{l:S_{l}>\exp(\eta(l))\}\enspace.

(69)

As a second benchmark, we also considered the selection criterion adopted in [1, 29]:

\hat{n}_{x,2}=\max\{l:S_{l}>\exp((\ln S_{1}+\ln S_{i})/2)\}\enspace.

(70)

6.2 Results

Our results are summarized in Tables 1, 2 and 3, where we considered the three main weight choices. We observe that the optimal shrinkage method and the alternating least squares approach give consistently lower risk estimates. Interestingly, we observe that soft-thresholding and hard-thresholding do not always improve upon the benchmark. Since the SURE-based method also applies soft-thresholding, we see that the problem does not lie in the class of shrinking functions, but on a poor parameter choice that was based on asymptotic properties of large random matrices.

Method	Average Normalized Risk ( $\pm 5\%$ )
Heuristic (68)	1.0
Heuristic (69)	3.83
Hard-thresholding	0.97
Soft-thresholding	0.76
Optimal Shrinkage	0.75
SURE	0.58
Alternating Least Squares	0.39

Table 1: Risk performance for

W_{1}=I_{i}

and

W_{2}=I_{2i}

and

3000

Monte Carlo runs. Other paremeters are

N_{F}=250

N_{o}=1

Method	Average Normalized Risk ( $\pm 5\%$ )
Heuristic (68)	1.0
Heuristic (69)	1.53
Hard-thresholding	0.75
Soft-thresholding	1.57
Optimal Shrinkage	0.58
SURE	0.73
Alternating Least Squares	0.52

Table 2: Risk performance for

W_{1}=\hat{G}_{f}^{-1}

and

W_{2}=(Z_{p}\Pi_{U_{f}}^{\perp}Z_{p}^{T})^{1/2}

(similarly to CVA) and

3000

Monte Carlo runs. Other parameters are

N_{F}=250

N_{o}=1

Method	Average Normalized Risk ( $\pm 5\%$ )
Heuristic (68)	1.0
Heuristic (69)	1.52
Hard-thresholding	0.68
Soft-thresholding	1.20
Optimal Shrinkage	0.49
SURE	0.64
Alternating Least Squares	0.42

Table 3: Risk performance for

W_{1}=I_{i}

and

W_{2}=Z_{p}

(as in N4SID) and

3000

Monte Carlo runs. Other parameters are

N_{F}=250

N_{o}=1

Appendix A Equivariant Estimators of the Covariance on the Lower Triangular Toeplitz Group

The set of lower triangular Toeplitz matrices as exemplified in (12) is a group under matrix multiplication. This group operation may be interpreted as the cascading of dynamical systems. In this sense, we are interested in estimators that are invariant with respect to dynamical system cascading. For example, if we pass both input and output through a linear filter, we want an estimator that gives the same model regardless of the filtering. In the Bayesian framework, not all priors result in such an equivariant estimator. In this section, we aim at deriving a prior for the matrix $G_{f}$ that is invariant under the group operation of multiplication by lower triangular Toeplitz matrices.

Let $\textsf{G}\subset\mathbb{R}^{i\times i}$ be one such group and consider two matrices $\textsf{A},\textsf{B}\in\textsf{G}$ . Let $\textsf{C}=\textsf{A}\textsf{B}$ . Using the fact that $a_{k,l}=a_{k+m,l+m}$ for $\textsf{A}=[a_{k,l}]\in\textsf{G}$ , we have that the last row of C is given by

c_{il}=\sum_{m}a_{im}b_{ml}=\sum_{m}a_{i,m}b_{i,l-m+i}\enspace.

(71)

If we parametrize these matrices using their lowest row, we can compute the Jacobian for right multiplication as

J^{R}_{[l,m]}=\frac{\partial c_{i,l}}{\partial a_{i,m}}=b_{i,l-m+i}\enspace.

(72)

Given the triangular structure of B, we have that $|J^{R}|=|b_{i,i}|^{i}$ . As for left multiplication,

J^{L}_{[l,m]}=\frac{\partial c_{i,l}}{\partial b_{i,m}}=a_{i,l-m+i}\enspace

(73)

and $|J^{L}|=|a_{i,i}|^{i}$ . With this we can define the left and right invariant Haar measure

\mu(\mathcal{A})=\int_{\mathcal{A}}\frac{da_{i,\cdot}}{|a_{i,i}|^{i}}\enspace.

(74)

Indeed, to check right invariance, let $\textsf{C}=\textsf{A}\textsf{B}$ and observe that

\int f(\textsf{C}\textsf{B}^{-1})~{}\frac{dc_{i,\cdot}}{|c_{i,i}|^{i}}=\int f(\textsf{A})~{}\frac{|b_{i,i}|^{i}da_{i,\cdot}}{|c_{i,i}|^{i}}=\int f(\textsf{A})~{}\frac{da_{i,\cdot}}{|a_{i,i}|^{i}}\enspace,

(75)

where we used the above defined Jacobian in the first equality and, in the second equality, we used the fact $c_{i,i}=a_{i,i}b_{i,i}$ . Left invariance may be checked similarly for $f(\textsf{B}^{-1}\textsf{C})$ . Therefore, we have arrived at a prior that is invariant under the group operation.

In the sequence we derive the posterior that corresponds to this prior. Recall that the residues are

\mathcal{E}=Y_{f}-H_{fp}Z_{p}-H_{f}U_{f}=G_{f}E_{f}\enspace.

(76)

Taking the vectorization operation and making use of the noise vector $\bar{e}_{f}$ on the subspace of dimension $i+j-1$ , we have that

\text{vec}(\mathcal{E})=(I\otimes G_{f})B_{W}\bar{e}_{f}\enspace.

(77)

Therefore, the residues covariance is given by

\Sigma_{\mathcal{E}}=(I\otimes G_{f})B_{W}B_{W}^{T}(I\otimes G_{f}^{T})\enspace.

(78)

Since $\text{rank}(B_{W}B_{W}^{T})$ is $i+j-1$ , $\Sigma_{\mathcal{E}}$ is rank deficient and we shall make use of its pseudo-determinant (product of non-zero singular values) in obtaining its pdf:

|\Sigma_{\mathcal{E}}|_{+}=|(I\otimes G_{f})B_{W}B_{W}^{T}(I\otimes G_{f}^{T})|_{+}=|(I\otimes G_{f})\cdot\text{chol}(B_{W}B_{W}^{T})|_{+}^{2}\propto G_{f[i,i]}^{2(i+j-1)}\enspace,

(79)

where we used the fact that $I\otimes G_{f}$ and $\text{chol}(B_{W}B_{W}^{T})$ are lower triangular and the fact that $\text{chol}(B_{W}B_{W}^{T})$ must have exactly $i+j-1$ nonzero entries on its diagonal. In order to make the vectorization $G_{f}$ explicit in the likelihood function, we note that

E_{f}=G_{f}^{-1}\mathcal{E}\Rightarrow B_{W}\bar{e}_{f}=(\mathcal{E}^{T}\otimes I_{i})\text{vec}(G_{f}^{-1})\Rightarrow\bar{e}_{f}=B_{W}^{\dagger}(\mathcal{E}^{T}\otimes I_{i})B_{T}(G_{f[i,\cdot]}^{-1})^{T}\enspace.

(80)

Therefore,

p(\mathcal{E}|G_{f})\propto\frac{1}{|G_{f[i,i]}|^{i+j-1}}\exp\left(-\frac{1}{2}G_{f[i,\cdot]}^{-1}B_{T}^{T}(\mathcal{E}\otimes I_{i})(B_{W}^{\dagger})^{T}B_{W}^{\dagger}(\mathcal{E}^{T}\otimes I_{i})B_{T}(G_{f[i,\cdot]}^{-1})^{T}\right)

(81)

and the posterior would be proportional to

	$\displaystyle p(\mathcal{E}\|G_{f}^{-1})d\mu(G_{f}^{-1})$	$\displaystyle\propto\|G_{f[i,i]}^{-1}\|^{i+j-1}\exp\left(-\frac{1}{2}G_{f[i,\cdot]}^{-1}\Omega(G_{f[i,\cdot]}^{-1})^{T}\right)\cdot\frac{1}{\|G_{f[i,i]}^{-1}\|^{i}}$		(82)
		$\displaystyle=\|G_{f[i,i]}^{-1}\|^{j-1}\exp\left(-\frac{1}{2}G_{f[i,\cdot]}^{-1}\Omega(G_{f[i,\cdot]}^{-1})^{T}\right)\enspace.$		(83)

Defining the change of varibles $\nu^{T}=G_{f[i,\cdot]}^{-1}\Omega_{L}$ , we have

p(\nu)\propto|\nu_{i}|^{j-1}\exp\left(-\frac{1}{2}\nu^{T}\nu\right)=|\nu_{i}^{2}|^{\frac{j+1}{2}-1}\exp\left(-\frac{1}{2}\nu^{T}\nu\right)\enspace.

(84)

Therefore $\nu_{i}^{2}\sim\chi^{2}_{j+1}$ and $\nu_{l}\sim N(0,1)$ , for $l=1,\ldots,i-1$ . This is precisely the posterior given by equations (45)-(50).

If, on the other hand, we assumed that $\operatorname{E}[e_{f}e_{f}^{T}]=I_{ij}$ , then

|\Sigma_{\mathcal{E}}|\propto|G_{f[i,i]}|^{2ij}

(85)

and the posterior of equations (51)-(56) would follow.

References

[1] Michel Verhaegen and Anders Hansson. N2SID: Nuclear norm subspace identification of innovation models. Automatica, 72:57–63, 2016.
[2] Roy S Smith. Frequency domain subspace identification using nuclear norm minimization and Hankel matrix realizations. IEEE Transactions on Automatic Control, 59(11):2886–2896, 2014.
[3] Gianluigi Pillonetto, Tianshi Chen, Alessandro Chiuso, Giuseppe De Nicolao, and Lennart Ljung. Regularized linear system identification using atomic, nuclear and kernel-based norms: The role of the stability constraint. Automatica, 69:137–149, 2016.
[4] Alessandro Chiuso and Gianluigi Pillonetto. System identification: A machine learning perspective. Annual Review of Control, Robotics, and Autonomous Systems, 2:281–304, 2019.
[5] Yue Sun, Samet Oymak, and Maryam Fazel. Finite sample identification of low-order LTI systems via nuclear norm regularization. IEEE Open Journal of Control Systems, 1:237–254, 2022.
[6] Alessandro Chiuso. Regularization and Bayesian learning in dynamical systems: Past, present and future. Annual Reviews in Control, 41:24–38, 2016.
[7] Lennart Ljung, Tianshi Chen, and Biqiang Mu. A shift in paradigm for system identification. International Journal of Control, 93(2):173–180, 2020.
[8] Jie Liu and Bing Li. A novel strategy for response and force reconstruction under impact excitation. Journal of Mechanical Science and Technology, 32(8):3581–3596, 2018.
[9] Vineet Vajpayee, Siddhartha Mukhopadhyay, and Akhilanand Pati Tiwari. Data-driven subspace predictive control of a nuclear reactor. IEEE Transactions on Nuclear Science, 65(2):666–679, 2017.
[10] Behnood Rasti, Magnus O Ulfarsson, and Johannes R Sveinsson. Hyperspectral subspace identification using SURE. IEEE Geoscience and Remote Sensing Letters, 12(12):2481–2485, 2015.
[11] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):1–37, 2011.
[12] Clément Elvira, Pierre Chainais, and Nicolas Dobigeon. Bayesian nonparametric subspace estimation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2247–2251. IEEE, 2017.
[13] Qian Zhao, Deyu Meng, Zongben Xu, Wangmeng Zuo, and Lei Zhang. Robust principal component analysis with complex noise. In International conference on machine learning, pages 55–63. PMLR, 2014.
[14] S Derin Babacan, Martin Luessi, Rafael Molina, and Aggelos K Katsaggelos. Sparse Bayesian methods for low-rank matrix estimation. IEEE Transactions on Signal Processing, 60(8):3964–3977, 2012.
[15] Xinghao Ding, Lihan He, and Lawrence Carin. Bayesian robust principal component analysis. IEEE Transactions on Image Processing, 20(12):3419–3430, 2011.
[16] Wei Chu and Zoubin Ghahramani. Probabilistic models for incomplete multi-dimensional arrays. In Artificial Intelligence and Statistics, pages 89–96. PMLR, 2009.
[17] David Gerard and Peter Hoff. Equivariant minimax dominators of the MLE in the array normal model. Journal of multivariate analysis, 137:32–49, 2015.
[18] Peter D Hoff. Equivariant and scale-free tucker decomposition models. Bayesian Analysis, 11(3):627–648, 2016.
[19] Ming Shi, Dan Li, and Jian Qiu Zhang. An alternating Bayesian approach to PARAFAC decomposition of tensors. IEEE Access, 6:36487–36499, 2018.
[20] Anastasios Tsiamis and George J Pappas. Finite sample analysis of stochastic system identification. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 3648–3654. IEEE, 2019.
[21] S Joe Qin. An overview of subspace identification. Computers & chemical engineering, 30(10-12):1502–1513, 2006.
[22] Peter Van Overschee and Bart De Moor. Subspace identification for linear systems: Theory—Implementation—Applications. Springer Science & Business Media, 2012.
[23] Erich L Lehmann and George Casella. Theory of point estimation. Springer Science & Business Media, 2006.
[24] Christian P Robert et al. The Bayesian choice: from decision-theoretic foundations to computational implementation, volume 2. Springer, 2007.
[25] William James and Charles Stein. Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Simposium on Mathematical Statistics and Probability, volume 1, pages 361–379. University of California Press, 1961.
[26] Matan Gavish and David L Donoho. The optimal hard threshold for singular values is $4/\sqrt{3}$ . IEEE Transactions on Information Theory, 60(8):5040–5053, 2014.
[27] Matan Gavish and David L Donoho. Optimal shrinkage of singular values. IEEE Transactions on Information Theory, 63(4):2137–2152, 2017.
[28] Emmanuel J Candes, Carlos A Sing-Long, and Joshua D Trzasko. Unbiased risk estimates for singular value thresholding and spectral estimators. IEEE transactions on signal processing, 61(19):4643–4657, 2013.
[29] Lennart Ljung. System identification toolbox for use with MATLAB. Mathworks, 2007.

	$\displaystyle p(\mathcal{E}\|G_{f}^{-1})d\mu(G_{f}^{-1})$	$\displaystyle\propto\|G_{f[i,i]}^{-1}\|^{i+j-1}\exp\left(-\frac{1}{2}G_{f[i,\cdot]}^{-1}\Omega(G_{f[i,\cdot]}^{-1})^{T}\right)\cdot\frac{1}{\|G_{f[i,i]}^{-1}\|^{i}}$		(82)
		$\displaystyle=\|G_{f[i,i]}^{-1}\|^{j-1}\exp\left(-\frac{1}{2}G_{f[i,\cdot]}^{-1}\Omega(G_{f[i,\cdot]}^{-1})^{T}\right)\enspace.$		(83)

FEDERAL UNIVERSITY OF MINAS GERAIS Department of Electronic Engineering Technical Report Robust Bayesian Subspace Identification for Small Data Sets