Approximation of stationary processes and Toeplitz Spectra

Giorgio Picci and Bin Zhu Giorgio Picci is with the Department of Information Engineering, University of Padova, Padova, Italy [email protected]² Bin Zhu is with the School of Intelligent Systems Engineering, Sun Yat-sen University, Waihuan East Road 132, 510006 Guangzhou, China, e-mail: [email protected]

Abstract

We study the approximation of stationary processes by a simple class of purely deterministic signals. This has an analytic counterpart in the approximation of symmetric positive definite Toeplitz matrices by submatrices of finite rank. We propose a notion of distance between them and prove a weak sense approximation result.

I Introduction and Problem Statement

In our past investigations [1] we have discovered that the a posteriori covariance operator of some random harmonic oscillation signals corrupted by white noise has remarkable properties very similar to those that have been uncovered in the 60’s and 70’s by D. Slepian and coworkers in a famous series of papers concerning the energy concentration problems of time and band limited signals [2, 3, 4, 5]. The key property of the covariance operator in question is that its eigenvalues decay extremely fast to zero for indices greater than an a priori computable number (the so-called Slepian frequency).

From the sharp decay to zero of the eigenvalues it follows that the a posteriori probabilistic description of the signal is essentially finite dimensional and it seems that in simulated experiments the observations can be well approximated by a purely deterministic process [6] corrupted by white noise. Since purely deterministic processes of finite rank can be described by linear finite dimensional state-space models [6, p. 277], the estimation can be carried on by rather straightforward subspace methods.

This although the precise meaning of this approximation has so far not been understood. Scope of this paper is to propose a possible metric for this approximation and to prove convergence althogh only in a weak sense.

Consider the infinite covariance matrix¹¹1In this paper bold symbols like $\boldsymbol{\Sigma}$ are reserved for infinte arrays say stochastic processes or covariance matrices. $\boldsymbol{\Sigma}=\left[\sigma(t-s)\right]_{t,s\in\mathbb{Z}}$ of a scalar stationary zero-mean process having covariance function

\sigma(\tau)=\operatorname{{\mathbb{E}}}\mathbf{y}(t+\tau)\mathbf{y}(t)\,\qquad\tau\in\mathbb{Z}

(1)

which we shall assume admits a Fourier transform

\varphi(e^{j\theta})=\sum_{\tau=-\infty}^{+\infty}\,e^{-j\theta\,\tau}\sigma(\tau)

which is a piecewise smooth function of $\theta$ . For example $\varphi$ will be continuous and bounded when $\sigma$ belongs to $\ell^{1}$ . The function $\varphi$ is called the symbol of $\boldsymbol{\Sigma}$ . The process $\mathbf{y}$ is not necessarily purely non deterministic; however we assume that $\varphi(e^{j\theta})$ is piecewise continuous and bounded, as for example is a rectangular function. Then $\boldsymbol{\Sigma}$ induces a bounded operator in $\ell^{2}$ [7].

Let now

\mathbf{y}_{N}(t)=\left[\begin{matrix}\mathbf{y}(t)&\mathbf{y}(t+1)&\ldots&\mathbf{y}(t+N-1)\end{matrix}\right]^{\top}

and consider the $N$ -truncation of the matrix $\boldsymbol{\Sigma}$ , defined as

\Sigma_{N}:=\operatorname{{\mathbb{E}}}\mathbf{y}_{N}(t)\mathbf{y}_{N}(t)^{\top}

which for each $N$ has a positive point spectrum, say

\mathfrak{S}_{N}:=\{\lambda_{N,1},\ldots,\lambda_{N,N}\}

where the eigenvalues are ordered in decreasing magnitude. We shall assume that all $\Sigma_{N}$ are non singular so that the eigenvalues must be strictly positive for all $N$ . By Weyl’s theorem [8, p. 203], see also [9, Fact M], the k–th eigenvalue of $\boldsymbol{\Sigma}_{N}$ is a non decreasing function of $N$ and hence has a limit, $\lambda_{k}(\boldsymbol{\Sigma})$ , which may possibly be equal to $+\infty$ . Each such limit is called an eigenvalue of $\boldsymbol{\Sigma}$ . These limits however are in general not true eigenvalues, as it is well-known that $\boldsymbol{\Sigma}$ may not have eigenvalues. For example, a bounded symmetric Toeplitz operator matrix has a purely continuous spectrum [10].

Assume now that all eigenvalues of $\boldsymbol{\Sigma}$ are finite (this is equivalent to $\boldsymbol{\Sigma}$ being a bounded linear operator in $\ell^{2}$ see [11]) and let us keep in the formal spectral decomposition of $\boldsymbol{\Sigma}$ the $n$ largest eigenvalues setting all the others to zero. In this way we form an ”approximation” of $\boldsymbol{\Sigma}$ of finite rank $n$ which we shall denote $\boldsymbol{\Sigma}^{n}$ . Then the (infinite) matrix $\boldsymbol{\Sigma}^{n}$ is the covariance of a purely deterministic process of rank $n$ [6] whose spectral density, say $\varphi_{n}(e^{j\theta})$ , is the sum of $n$ Dirac pulses. We would like to know in what sense, if any, $\boldsymbol{\Sigma}^{n}$ could be considerd an approximation of $\boldsymbol{\Sigma}$ or equivalently, $\varphi_{n}(e^{j\theta})$ could be considered an approximation of the symbol $\varphi(e^{j\theta})$ . Obviously this last fact could only happen in a weak sense, say for arbitrary test functions $\psi(e^{j\theta})$ continuous on the unit circle one should have

\int\psi(e^{j\theta})\varphi_{n}(e^{j\theta})\psi(e^{j\theta})^{*}\to\int\psi(e^{j\theta})\varphi(e^{j\theta})\psi(e^{j\theta})^{*}

(2)

as $n\to\infty$ .

An equivalent question can be posed in terms of $L^{2}$ approximation of a stationary process $\mathbf{y}$ of covariance $\boldsymbol{\Sigma}$ (which could in particular be p.n.d) by a purely deterministic one, say $\hat{\mathbf{y}}_{n}$ having covariance $\boldsymbol{\Sigma}_{n}$ of rank $n$ . As we shall see this problem should also be naturally formulated in a weak sense.

II Approximation of random vectors

To begin with, suppose we want to approximate an $N$ -dimensional zero-mean random vector $y$ of covariance matrix $\Sigma\in\mathbb{R}^{N\times N}$ by another $N$ -dimensional vector say $\hat{y}$ having covariance $\Sigma_{n}$ of rank $n<N$ . To make the problem well-posed we shall require that the approximation $\hat{y}$ generates a linear subspace of $\mathbf{H}({y})$ , which will then have to be $n$ -dimensional. This means that we can represent $\hat{y}$ as a linear function of $y$ , say

\hat{y}:=M{y}

where $M\in\mathbb{R}^{N\times N}$ has rank $n$ .

Motivated from the above, let us consider the following approximation problem:

Problem 1.

Find a matrix $M\in{\mathbb{R}}^{N\times N}$ of rank $n$ , solving the following minimum problem

\min_{\operatorname{rank}(\,M\,)\,=\,n}\,\operatorname{{\mathbb{E}}}\{\|{y}-M\,{y}\|^{2}\}\,.

(3)

Note that an equivalent geometric formulation is to look for an optimal $n$ -dimensional subspace of $\mathbf{H}({y})$ onto which ${y}$ should be projected in order to minimize the approximation error variance. Let us stress that this is quite different from the usual least squares approximation problem which amounts to projecting onto a given subspace.

As usual, minimizing the square distance in (3) requires that the approximation $M{y}$ should be uncorrelated with the approximation error; namely

{y}-M{y}\,\perp\,M{y}

(4)

which is equivalent to

M\Sigma-M\Sigma M^{\top}=0\,.

Introducing a square root $\Sigma^{1/2}$ of $\Sigma$ and defining $\hat{M}:=\Sigma^{-1/2}M\Sigma^{1/2}$ , this condition is seen to be equivalent to

\hat{M}=\hat{M}\,\hat{M}^{\top}

which just says that $\hat{M}$ must be symmetric and idempotent (i.e. $\hat{M}=\hat{M}^{2}$ ), in other words an orthogonal projection from ${\mathbb{R}}^{N}$ onto some $n$ -dimensional subspace. Hence $M$ must have the following structure

M\,=\,\Sigma^{1/2}\,\Pi\,\Sigma^{-1/2},\quad\quad\Pi\,=\,\Pi^{2}\quad\quad\Pi\,=\,\Pi^{\top}

(5)

where $\Sigma^{1/2}$ is any square root of $\Sigma$ and $\Pi$ is an orthogonal projection matrix of rank $n$ .

Theorem 2.

The solutions of the approximation problem (3) are of the form

M=W\,W^{\top},\qquad W=U_{n}Q_{n}

where $U_{n}$ is a $N\times n$ matrix whose columns are the first $n$ normalized eigenvectors of $\Sigma$ , ordered in the descending magnitude ordering of the corresponding first $n$ eigenvalues, collected in the diagonal matrix $\Lambda_{n}$ , and $Q_{n}$ is an arbitrary $n\times n$ orthogonal matrix.

Proof: Let $\Lambda:=\operatorname{diag}\{\lambda_{1},\ldots,\lambda_{N}\}$ and $\Sigma\,=\,U\Lambda U^{\top}$ be the spectral decomposition of $\Sigma$ in which $U$ is an orthogonal matrix of eigenvectors. We can pick as a square rot of $\Sigma$ the matrix $\Sigma^{1/2}\,:=\,U\Lambda^{1/2}$ .

Now, no matter how $\Sigma^{1/2}$ is chosen, the random vector ${\mathbf{e}}:=\Sigma^{-1/2}\,{y}$ has orthonormal components. Hence using (5) the cost function of our minimum problem can be rewritten as

	$\displaystyle E\{\\|{y}-M\,{y}\\|^{2}\}\,$	$\displaystyle=\,\operatorname{{\mathbb{E}}}\{\\|\Sigma^{1/2}{\mathbf{e}}-\Sigma^{1/2}\,\Pi\,\Sigma^{-1/2}{\mathbf{y}}\\|^{2}\}$
		$\displaystyle=\,\operatorname{{\mathbb{E}}}\{\\|\Sigma^{1/2}({\mathbf{e}}-\Pi\,{\mathbf{e}}\,)\\|^{2}\}$
		$\displaystyle=\,\operatorname{{\mathbb{E}}}\{\\|\Lambda^{1/2}({\mathbf{e}}-\Pi\,{\mathbf{e}}\,)\\|^{2}\}$
		$\displaystyle=\,\operatorname{{\mathbb{E}}}\,({\mathbf{e}}-\Pi\,{\mathbf{e}}\,)^{\top}\Lambda\,({\mathbf{e}}-\Pi\,{\mathbf{e}}\,)$
		$\displaystyle=\,\operatorname{tr}\left[\Lambda\,\operatorname{{\mathbb{E}}}({\mathbf{e}}-\Pi\,{\mathbf{e}}\,)({\mathbf{e}}-\Pi\,{\mathbf{e}}\,)^{\top}\right]$

where $\operatorname{tr}A:=\sum a_{kk}$ is the trace of $A$ . Our minimum problem can therefore be rewritten as

\min_{\operatorname{rank}(\,\Pi\,)\,=\,n}\,\operatorname{tr}\{\Lambda\Pi^{\perp}\}

where $\Pi^{\perp}:=I-\Pi$ is the orthogonal projection onto the orthogonal complement of the subspace $\operatorname{Im}\Pi$ .

Since the eigenvalues are ordered in decreasing order; i.e. $\{\lambda_{1}\geq\ldots\geq\lambda_{t}\}$ , one sees that the minimum of this function of $\Pi$ is reached when $\Pi$ projects onto the subspace spanned by the first $n$ coordinate axes. In other words, $\Pi_{optimal}=\operatorname{diag}\{I_{n},\,0\}$ the minimum being $\lambda_{n+1}+\ldots+\lambda_{N}$ . It is then evident that

M=\,U\Lambda^{1/2}\,\Pi_{optimal}\Lambda^{-1/2}U^{\top}=U_{n}U_{n}^{\top}

Naturally, multiplying $U_{n}$ by any orthogonal $n\times n$ matrix does not change the result. $\Box$

Observe that $\hat{y}=U_{n}U_{n}^{\top}y$ is just the first $n$ Principal Components Approximation of $y$ . In fact it is well-known that the PCA vector $\hat{y}$ can be seen as a linear transformation acting on ${y}$ [12]. This result confirms in particular that the truncated PCA expansion is optimal in the sense that it provides the best $M$ and the best approximation subspace for the criterion (3). This characterization has been also found by a different technique studying subspace approximation problems; see e.g. [13].

Note for future reference that the variance matrix of $\hat{y}$ has rank $n$ since

\operatorname{{\mathbb{E}}}\hat{y}\hat{y}^{\top}\!=\!U_{n}U_{n}^{\top}\!\operatorname{{\mathbb{E}}}yy^{\top}U_{n}U_{n}^{\top}\!=\!U_{n}\operatorname{diag}\{\lambda_{1},\ldots,\lambda_{n}\}U_{n}^{\top}

(6)

and that this expression holds for arbitrary $N\geq n$ . Naturally one should keep in mind that the eigenvector matrices now depend on $N$ but the eigenvalues stay fixed.

III Extension to infinite dimension

In the same spirit, consider now a stationary zero-mean process $\mathbf{y}$ and any jointly stationary zero-mean process (both written as a doubly infinite column vectors) $\mathbf{z}$ , spanning a subspace $\mathbf{H}(\mathbf{z})\subset\mathbf{H}(\mathbf{y})$ of dimension $n$ . Any such process $\mathbf{z}$ must be purely deterministic of rank $n$ and is then uniquely determined by any finite string of random variables $\{\mathbf{z}(t)\}_{t\in I}$ induced on an interval $I$ of length $N\geq n$ . This statement from [6, page 276-277] is reported for completeness in the following lemma.

Lemma 1.

Any p.d. process $\mathbf{z}$ of rank $n$ can be represented for all $t\in\mathbb{Z}$ by a state-space model (i.e. a stochastic realization) of the form

	$\displaystyle\boldsymbol{\xi}(t+1)$	$\displaystyle=A\boldsymbol{\xi}(t)$		(7)
	$\displaystyle\mathbf{z}(t)$	$\displaystyle=c\,\boldsymbol{\xi}(t)$		(8)

where $\boldsymbol{\xi}(t)=[\,\xi_{1}(t),\,\xi_{2}(t),\,\ldots,\xi_{n}(t)\,]^{\top}$ is an $n$ -dimensional basis vector spanning the Hilbert space $\mathbf{H}(\mathbf{z}_{N})$ linearly generated by the $N\geq n$ random variables of the set $\{\mathbf{z}(s)\,:\,t\geq s\geq t-N+1\}$ , $A$ is a $n\times n$ unitary matrix and $c$ is a $n$ -vector such that the pair $(A,c)$ is observable.

Proof: See [6, p. 277]. $\Box$

This linear system extends in time the finite family of random variables $\{\mathbf{z}(s)\}$ , generators of $\mathbf{H}(\mathbf{z}_{N})$ , to generate the stationary p.d. process $\mathbf{z}$ defined on the whole time line $\mathbb{Z}$ . Since this realization is uniquely determined by the space $\mathbf{H}(\mathbf{z}_{N})$ modulo a choice of basis, it follows that the process $\mathbf{z}$ is also uniquely determined by the space $\mathbf{H}(\mathbf{z}_{N})$ . Hence all covariances $\sigma(\tau)=\operatorname{{\mathbb{E}}}\mathbf{z}(t+\tau)\mathbf{z}(t)$ are also uniquely defined and determine the entries of the covariance operator of the process.

In analogy to Problem 1 let us ask if there is a stationary process $\mathbf{z}$ spanning a subspace $\mathbf{H}(\mathbf{z})\subset\mathbf{H}(\mathbf{y})$ of dimension $n$ , which minimizes an approximation criterion of the type (3). If such a process exists we shall call it a $n$ -PC approximation of $\mathbf{y}$ and denote it by the symbol $\hat{\mathbf{y}}_{n}$ .

Let then $I=[\,t,t+1,\ldots,t+N-1\,]$ denote a finite subinterval of the time line $\mathbb{Z}$ of length $N\geq n$ and $X_{I}$ a $N\times\infty$ matrix with entries $x_{j,k}$ equal to one if $k\in I$ and zero otherwise. Consider the finite random vectors $\mathbf{y}_{N}:=X_{I}\mathbf{y}$ and $\mathbf{z}_{N}:=X_{I}\mathbf{z}$ and let $\hat{\mathbf{y}}_{N}$ be the random vector minimizing the norm $\operatorname{{\mathbb{E}}}\{\|X_{I}\mathbf{y}-X_{I}\mathbf{z}\|^{2}\}$ for an interval of length $N\geq n$ . This problem is analogous to the problem (1) where now $y$ is substituted by $\mathbf{y}_{N}$ . The solution string determines by stationary extension (Lemma 1) a purely deterministic process $\hat{\mathbf{y}}$ such that $\hat{\mathbf{y}}_{N}:=X_{I}\hat{\mathbf{y}}$ and $\mathbf{H}(\hat{\mathbf{y}})=\mathbf{H}(\hat{\mathbf{y}}_{N})$ has dimension $n$ . Since $\mathbf{H}(\hat{\mathbf{y}})\subset\mathbf{H}(\mathbf{y}_{N})\subset\mathbf{H}(\mathbf{y})$ , this processes satisfies our requirements. By stationarity $\hat{\mathbf{y}}$ is invariant with respect to translations of the interval $I$ provided its length $N$ is fixed. Then $\hat{\mathbf{y}}$ is a $n$ -PC approximation of $\mathbf{y}$ . The question now is to understand in what sense this approximation can get close to $\mathbf{y}$ as $n\to\infty$ .

Since we are now studying the behavior of the $n$ -PC approximation of $\mathbf{y}$ when the dimension $n$ varies, we shall attach a superscript to $\hat{\mathbf{y}}$ and denote it by $\hat{\mathbf{y}}^{n}$ ; likewise we shall do to its covariance matrix, which will be denoted $\hat{\boldsymbol{\Sigma}}^{n}$ .

Theorem 3.

For each $n$ the $n$ -PC approximation of $\mathbf{y}$ has an (infinite) covariance operator $\hat{\boldsymbol{\Sigma}}^{n}$ of rank $n$ . The sequence $\{\hat{\boldsymbol{\Sigma}}^{n}\}$ converges weakly to $\boldsymbol{\Sigma}$ as $n$ diverges to $\infty$ , that is

\psi^{\top}[\boldsymbol{\Sigma}-\hat{\boldsymbol{\Sigma}}^{n}]\psi\rightarrow 0\qquad\text{as}\quad n\to\infty

for all functions (row sequence) $\psi^{\top}:=a^{\top}X_{I}$ having support in $I$ where $a\in\mathbb{R}^{N}$ is arbitrary.

Proof: Consider a $n$ -PC approximation $\hat{\mathbf{y}}$ of $\mathbf{y}$ and the restriction $X_{I}\hat{\mathbf{y}}$ to any interval $I$ of length $N\geq n$ . Recall that $\hat{\mathbf{y}}$ is now denoted $\hat{\mathbf{y}}^{n}$ and likewise for its covariance matrix denoted $\hat{\Sigma}^{n}$ . By analogy to (6) the $N\times N$ truncation of this matrix has the structure

\hat{\Sigma}^{n}_{N}=U^{n}_{N}\operatorname{diag}\{\lambda_{1},\ldots,\lambda_{n}\}(U^{n}_{N})^{\top}

(9)

where the eigenvalues $\lambda_{k}$ are fixed and equal to the first $n$ eigenvalues of the $N\times N$ -truncation of the covariance operator $\boldsymbol{\Sigma}$ of $\mathbf{y}$ . The $N\times n$ eigenvector matrices $U^{n}_{N}$ depend on $N$ as their dimension trivially grows with $N$ .

We shall now show that all $\hat{\Sigma}^{n}_{N}$ are the $N$ -truncation of an infinite Toeplitz matrix $\hat{\boldsymbol{\Sigma}}^{n}$ of rank $n$ which is the covariance of the purely deterministic process $\hat{\mathbf{y}}^{n}$ . That this is so follows from the fact that all covariances $\hat{\sigma}^{n}(\tau)=\operatorname{{\mathbb{E}}}\hat{\mathbf{y}}^{n}(t+\tau)\hat{\mathbf{y}}^{n}(t)$ are uniquely defined by the state space model (7), (8) and constitute exactly the entries of the (infinite) covariance operator $\hat{\boldsymbol{\Sigma}}^{n}$ of the p.d. process $\hat{\mathbf{y}}^{n}$ . Then, in particular, each finite matrix $\hat{\Sigma}^{n}_{N}$ must be a $N$ -truncation of the same $\hat{\boldsymbol{\Sigma}}^{n}$ . Then from the expansion (9) it follows that this truncation is just the (symmetric) $N\times N$ SVD approximation of rank $n$ of the $N\times N$ truncation of $\boldsymbol{\Sigma}$ , which is of course well defined for all finite $N$ .

By a well-known property of the SVD (see e.g [14, Chap. 2]) the variance matrix $\hat{\Sigma}^{n}_{N}$ of $X_{I}\hat{\mathbf{y}}^{n}$ is the symmetric $N\times N$ matrix of rank $n$ which has minimum distance from that of $X_{I}\mathbf{y}$ in the Frobenius norm. This in turn implies that the variance $\hat{\boldsymbol{\Sigma}}^{n}$ is the (infinte) covariance matrix of rank $n$ which minimizes

\psi^{\top}(\boldsymbol{\Sigma}-\hat{\boldsymbol{\Sigma}}^{n})\psi=\psi^{\top}(\boldsymbol{\Sigma}_{N}-\Sigma^{n}_{N})\psi.

(10)

for all functions $\psi$ having support in an interval $I\subset\mathbb{Z}$ of length $N\geq n$ . But the sequence (10) is monotonically decreasing and nonnegative for all $n$ and fixed $N\geq n$ . In fact by obvious properties of the SVD, for each $\psi$ of finite support of length $N$ it tends to zero with $n$ in a finite number of steps since when $n=N$ the difference is zero. But this happens for all $N$ and hence for all $\psi$ of finite support. $\Box$

Remark 1.

Contrary to all submatrices $\Sigma_{N}$ , the infinite covariance operator $\boldsymbol{\Sigma}$ may not have eigenvalues (nor corresponding eigenvectors) and consequently the idea of PC approximation does not apply to the full matrix. For this reason the approximation and the convergence results may not hold in a strong sense.

IV Approximation in the spectral domain

We now go back to the problem formulation of Sec. I. From [6, Chap. 3] the processes $\mathbf{y}$ and $\hat{\mathbf{y}}^{n}$ have a spectral representation with random spectral measures $dZ(e^{i\theta})$ and $dZ^{n}(e^{i\theta})$ such that

	$\displaystyle\operatorname{{\mathbb{E}}}dZ(e^{i\theta})dZ((e^{i\theta})^{*}$	$\displaystyle=\varphi(e^{i\theta}){\displaystyle\frac{d\omega}{2\pi}}\,,$
	$\displaystyle\operatorname{{\mathbb{E}}}dZ^{n}(e^{i\theta})dZ^{n}((e^{i\theta})^{*}$	$\displaystyle=\varphi_{n}(e^{i\theta}){\displaystyle\frac{d\omega}{2\pi}}\,.$

Letting $\hat{\psi}(e^{i\omega}):=\sum_{k=0}^{N-1}\psi(k)e^{i\omega k}$ one has

	$\displaystyle\psi^{\top}\boldsymbol{\Sigma}\psi$	$\displaystyle=\operatorname{{\mathbb{E}}}[\sum_{k=0}^{N-1}\psi(k)\mathbf{y}(k)\,]^{2}=\operatorname{{\mathbb{E}}}[\int_{-\pi}^{\pi}\hat{\psi}(e^{i\omega})dZ(e^{i\theta})\,]^{2}$
		$\displaystyle=\int_{-\pi}^{\pi}\hat{\psi}(e^{i\omega})\varphi(e^{i\omega})\hat{\psi}(e^{i\omega})^{*}{\displaystyle\frac{d\omega}{2\pi}}$

and similarly for $\psi^{\top}\boldsymbol{\Sigma}^{n}\psi$ which can be written

	$\displaystyle\psi^{\top}\boldsymbol{\Sigma}^{n}\psi$	$\displaystyle=\operatorname{{\mathbb{E}}}[\sum_{k=0}^{N-1}\psi(k)\hat{\mathbf{y}}(k)\,]^{2}$
		$\displaystyle=\operatorname{{\mathbb{E}}}[\int_{-\pi}^{\pi}\hat{\psi}(e^{i\omega})dZ^{n}(e^{i\theta})\,]^{2}$
		$\displaystyle=\int_{-\pi}^{\pi}\hat{\psi}(e^{i\omega})\varphi_{n}(e^{i\omega})\hat{\psi}(e^{i\omega})^{*}{\displaystyle\frac{d\omega}{2\pi}}$

Therefore (2) follows from Theorem 3.

Now note that, because of the orthogonality (4), $\psi^{\top}(\mathbf{y}-\hat{\mathbf{y}}^{n})\perp\psi^{\top}\hat{\mathbf{y}}^{n}$ and the difference (10) can be rewritten as $\operatorname{{\mathbb{E}}}[\psi^{\top}(\mathbf{y}-\hat{\mathbf{y}}^{n})]^{2}$ which also must tend to zero when $n\to\infty$ for all functions $\psi$ having support in an interval $I\subset\mathbb{Z}$ of length $N\geq n$ . Therefore $\hat{\mathbf{y}}^{n}$ converges weakly to $\mathbf{y}$ .

References

[1] G. Picci and B. Zhu, “An empirical bayesian approach to frequency estimation,” Dept. of Information Engineering University of Padova, arXiv.1910.09475, 2019.
[2] D. Slepian and H. Pollak, “Prolate spheroidal wave functions, fourier analysis and uncertainty I,” Bell Syst. Tech. Jour., vol. 40, pp. 43–63, 1961.
[3] H. J. Landau and H. Pollak, “Prolate spheroidal wave functions, fourier analysis and uncertainty II,” Bell Syst. Tech. Jour., vol. 40, pp. 65–84, 1961.
[4] D. Slepian, “Prolate spheroidal wave functions, fourier analysis and uncertainty V: The discrete case,” Bell Syst. Tech. Jour., vol. 57, no. 5, pp. 1371–1430, 1978.
[5] H. J. Landau and H. Widom, “Eigenvalue distribution of time and frequency limiting,” Journal of Mathematical Analysis and Applications, vol. 77, no. 2, pp. 469–481, 1980.
[6] A. Lindquist and G. Picci, Linear Stochastic Systems: a Geometric Approach to Modeling Estimation and Identification. Springer Verlag, 2015.
[7] N. Akhiezer and I. M. Glazman, Theory of Linear Operators in Hilbert Space Vol I. New York: Fredrik Ungar Pub. Co., 1961.
[8] G. W. Stewart and J. Sun, Matrix perturbation theory. Boston: Academic Press, 1990.
[9] M. Forni and M. Lippi, “The generalized dynamic factor model: representation theory,” Econometric Theory, vol. 17, pp. 1113–1141, 2001.
[10] P. Hartman and A. Wintner, “The spectra of Toeplitz matrices,” American Journal of Mathematics, vol. 76, pp. 867–882, 1954.
[11] G. Bottegal and G. Picci, “Modeling complex systems by generalized factor analysis,” IEEE Trans. Autom. Control, vol. 60, no. 3, pp. 759–774, 2015.
[12] H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, pp. 321–377, 1936.
[13] B. Yang, “Projection approximation subspace tracking,” IEEE Transactions on Signal Procesing, vol. 43, pp. 95–107, 1995.
[14] G. Golub and C. V. Loan, Matrix computations. John Hopkins University Press, $2^{nd}$ edition, 1989.

	$\displaystyle E\{\\|{y}-M\,{y}\\|^{2}\}\,$	$\displaystyle=\,\operatorname{{\mathbb{E}}}\{\\|\Sigma^{1/2}{\mathbf{e}}-\Sigma^{1/2}\,\Pi\,\Sigma^{-1/2}{\mathbf{y}}\\|^{2}\}$
		$\displaystyle=\,\operatorname{{\mathbb{E}}}\{\\|\Sigma^{1/2}({\mathbf{e}}-\Pi\,{\mathbf{e}}\,)\\|^{2}\}$
		$\displaystyle=\,\operatorname{{\mathbb{E}}}\{\\|\Lambda^{1/2}({\mathbf{e}}-\Pi\,{\mathbf{e}}\,)\\|^{2}\}$
		$\displaystyle=\,\operatorname{{\mathbb{E}}}\,({\mathbf{e}}-\Pi\,{\mathbf{e}}\,)^{\top}\Lambda\,({\mathbf{e}}-\Pi\,{\mathbf{e}}\,)$
		$\displaystyle=\,\operatorname{tr}\left[\Lambda\,\operatorname{{\mathbb{E}}}({\mathbf{e}}-\Pi\,{\mathbf{e}}\,)({\mathbf{e}}-\Pi\,{\mathbf{e}}\,)^{\top}\right]$