Optimal Spectral-Norm Approximate Minimization of Weighted Finite Automata

Borja Balle DeepMind, London, United Kingdom. Clara Lacroce¹¹1Corresponding author: [email protected] School of Computer Science, McGill University, Montréal, Canada Mila, Montréal, Canada Prakash Panangaden School of Computer Science, McGill University, Montréal, Canada Mila, Montréal, Canada Doina Precup School of Computer Science, McGill University, Montréal, Canada Mila, Montréal, Canada Guillaume Rabusseau²²2The names of the authors appear in alphabetical order Mila, Montréal, Canada DIRO, Université de Montréal, Montréal, Canada

Abstract

We address the approximate minimization problem for weighted finite automata (WFAs) with weights in $\mathbb{R}$ , over a one-letter alphabet: to compute the best possible approximation of a WFA given a bound on the number of states. This work is grounded in Adamyan-Arov-Krein Approximation theory, a remarkable collection of results on the approximation of Hankel operators. In addition to its intrinsic mathematical relevance, this theory has proven to be very effective for model reduction. We adapt these results to the framework of weighted automata over a one-letter alphabet. We provide theoretical guarantees and bounds on the quality of the approximation in the spectral and $\ell^{2}$ norm. We develop an algorithm that, based on the properties of Hankel operators, returns the optimal approximation in the spectral norm.

1 Introduction

Weighted finite automata (WFAs) are an expressive class of models representing functions defined over sequences. The approximate minimization problem is concerned with finding an automaton that approximates the behaviour of a given minimal WFA, while being smaller in size. This second automaton recognizes a different language, and the objective is to minimize the approximation error [10, 11]. Approximate minimization becomes particularly useful in the context of spectral learning algorithms [5, 7, 9, 22]. When applied to a learning task, such algorithms can be viewed as working in two steps. First, they compute a minimal WFA that explains the training data exactly. Then, they obtain a model that generalizes to the unseen data by producing a smaller approximation to the minimal WFA. It is not just a question of saving space by having a smaller state space; the exact machine will overfit the data and generalize poorly. To obtain accurate results it is crucial to guess correctly the size of the minimal WFA, in particular when the data is generated by this type of machine.

The minimization task is greatly shaped by the way we decide to measure the approximation error. It is thus natural to wonder if there are norms that are preferable to others. We believe that the chosen norm should be computationally reasonable to minimize. For instance, the distance between WFAs can be computed using a metric based on bisimulation [8]. While exploring this approach could still be interesting, the fact that this metric is hard to compute makes it unsuitable for our purposes. Moreover, this metric is specifically designed for WFAs, so it is not directly applicable to other models dealing with sequential data. We think that being transferable is a second important feature for the chosen norm. In fact, being able to compare different classes of models is desirable for future applications of this method. For example, one can think of the burgeoning literature on approximating Recurrent Neural Networks (RNNs) using WFAs, where the objective is to extract from a trained RNN an automaton that accurately mimics its behaviour [36, 40, 31, 4, 18]. With this in mind, we think that it is preferable to consider a norm defined on the input-output function – or the Hankel matrix – rather than the parameters of the specific model considered. Finally, it is important to choose a norm that can be computed accurately. The spectral norm seems to be a good candidate for the task. In particular, it allows us to exploit the work of Adamyan, Arov and Krein which has come to be known as AAK theory [1]: a series of results connecting the theory of complex functions on the unit circle to Hankel matrices, a mathematical object representing functions defined over sequences. The core of this theory provides us with theoretical guarantees for the exact computation of the spectral norm of the error, and a method to construct the optimal approximation.

We summarize our main contributions:

•

We use AAK theory to study the approximate minimization problem of WFAs. To connect those areas, we establish a correspondence between the parameters of a WFA and the coefficients of a complex function on the unit circle. To the best of our knowledge, this paper represents the first attempt to apply AAK theory to WFAs.
•

We present a theoretical analysis of the optimal spectral-norm approximate minimization problem for WFAs, based on their connection with finite-rank infinite Hankel matrices. We provide a closed form solution for real weighted automata over a one-letter alphabet $A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle$ , under the assumption $\rho(\mathbf{A})<1$ on the spectral radius. We bound the approximation error, both in terms of the Hankel matrix (spectral norm) and of the rational function computed by the WFA ( $\ell^{2}$ norm).
•

We propose a self-contained algorithm that returns the unique optimal spectral-norm approximation of a given size.
•

We tighten the connection, made in [11], between WFAs and discrete dynamical systems, by adapting some of the control theory concepts, like the allpass system [20], to the formalism of WFAs.

2 Background

2.1 Preliminaries

We denote with $\mathbb{N}$ , $\mathbb{Z}$ and $\mathbb{R}$ the set of natural, integers and real numbers, respectively. We use bold letters for vectors and matrices; all vectors considered are column vectors. We denote with $\mathbf{1}$ the identity matrix, specifying its dimension only when not clear from the context. We refer to the $i$ -th row and the $j$ -th column of $\mathbf{M}$ by $\mathbf{M}(i,:)$ and $\mathbf{M}(:,j)$ . Given a matrix $\mathbf{M}\in\mathbb{R}^{p\times q}$ of rank $n$ , a rank factorization is a factorization $\mathbf{M}=\mathbf{P}\mathbf{Q}$ , where $\mathbf{P}\in\mathbb{R}^{p\times n}$ , $\mathbf{Q}\in\mathbb{R}^{n\times q}$ and $\operatorname{rank}(\mathbf{P})=\operatorname{rank}(\mathbf{Q})=n$ . Let $\mathbf{M}\in\mathbb{R}^{p\times q}$ of rank $n$ , the compact singular value decomposition SVD of $\mathbf{M}$ is the factorization $\mathbf{M}=\mathbf{U}\mathbf{D}\mathbf{V}^{\top}$ , where $\mathbf{U}\in\mathbb{R}^{p\times n}$ , $\mathbf{D}\in\mathbb{R}^{n\times n}$ , $\mathbf{V}\in\mathbb{R}^{q\times n}$ are such that $\mathbf{U}^{\top}\mathbf{U}=\mathbf{V}^{\top}\mathbf{V}=\mathbf{1}$ , and $\mathbf{D}$ is a diagonal matrix. The columns of $\mathbf{U}$ and $\mathbf{V}$ are called left and right singular vectors, while the entries of $\mathbf{D}$ are the singular values. The Moore-Penrose pseudo-inverse $\mathbf{M}^{+}$ of $\mathbf{M}$ is the unique matrix such that $\mathbf{M}\mathbf{M}^{+}\mathbf{M}=\mathbf{M}$ , $\mathbf{M}^{+}\mathbf{M}\mathbf{M}^{+}=\mathbf{M}^{+}$ , with $\mathbf{M}^{+}\mathbf{M}$ and $\mathbf{M}\mathbf{M}^{+}$ Hermitian [42]. The spectral radius $\rho(\mathbf{M})$ of a matrix $\mathbf{M}$ is the largest modulus among its eigenvalues.

A Hilbert space is a complete normed vector space where the norm arises from an inner product. A linear operator $T:X\rightarrow Y$ between Hilbert spaces is bounded if it has finite operator norm, i.e. $\|T\|_{op}=\sup_{\|g\|_{X}\leq 1}\|Tg\|_{Y}<\infty$ . We denote by $\mathbf{T}$ the (infinite) matrix associated with $T$ by some (canonical) orthonormal basis on $H$ . An operator is compact if the image of the unit ball in $X$ is relatively compact. Given Hilbert spaces $X,Y$ and a compact operator $T:X\rightarrow Y$ , we denote its adjoint by $T^{*}$ . The singular numbers $\{\sigma_{n}\}_{n\geq 0}$ of $T$ are the square roots of the eigenvalues of the self-adjoint operator $T^{*}T$ , arranged in decreasing order. A $\sigma$ -Schmidt pair $\{\boldsymbol{\xi},\boldsymbol{\eta}\}$ for $T$ is a couple of norm $1$ vectors such that: $\mathbf{T}\boldsymbol{\xi}=\sigma\boldsymbol{\eta}$ and $\mathbf{T}^{*}\boldsymbol{\eta}=\sigma\boldsymbol{\xi}$ . The Hilbert-Schmidt decomposition provides a generalization of the compact SVD for the infinite matrix of a compact operator $T$ using singular numbers and orthonormal Schmidt pairs: $\mathbf{T}\mathbf{x}=\sum_{n\geq 0}\sigma_{n}\langle\mathbf{x},\boldsymbol{\xi}_{n}\rangle\boldsymbol{\eta}_{k}$ [42]. The spectral norm $\|\mathbf{T}\|$ of the matrix representing the operator $T$ is the largest singular number of $T$ . Note that the spectral norm of $\mathbf{T}$ corresponds to the operator norm of $T$ .

Let $\ell^{2}$ be the Hilbert space of square-summable sequences over $\Sigma^{*}$ , with norm $\|f\|_{2}^{2}=\sum_{x\in\Sigma^{*}}|f(x)|^{2}$ and inner product $\langle f,g\rangle=\sum_{x\in\Sigma^{*}}f(x)g(x)$ for $f,g\in\mathbb{R}^{\Sigma^{*}}$ . Let $\mathbb{T}=\{z\in\mathbb{C}:|z|=1\}$ be the complex unit circle, $\mathbb{D}=\{z\in\mathbb{C}:|z|<1\}$ the (open) complex unit disc. Let $1<p<\infty$ , $\mathcal{L}^{p}(\mathbb{T})$ be the space of measurable functions on $\mathbb{T}$ for which the $p$ -th power of the absolute value is Lebesgue integrable. For $p=\infty$ , we denote with $\mathcal{L}^{\infty}(\mathbb{T})$ the space of measurable functions that are bounded, with norm $\|f\|_{\infty}=\sup\{|f(x)|:x\in\mathbb{T}\}$ .

2.2 Weighted Finite Automata

Let $\Sigma$ be a fixed finite alphabet, $\Sigma^{*}$ the set of all finite strings with symbols in $\Sigma$ . We use $\varepsilon$ to denote the empty string. Given $p,s\in\Sigma$ , we denote with $ps$ their concatenation.

A weighted finite automaton (WFA) of $n$ states over $\Sigma$ is a tuple $A=\langle\boldsymbol{\alpha},\{\mathbf{A}_{a}\},\boldsymbol{\beta}\rangle$ , where $\boldsymbol{\alpha},$ $\boldsymbol{\beta}\in\mathbb{R}^{n}$ are the vector of initial and final weights, respectively, and $\mathbf{A}_{a}\in\mathbb{R}^{n\times n}$ is the matrix containing the transition weights associated with each symbol $a$ . Every WFA $A$ with real weights realizes a function $f_{A}:\Sigma^{*}\to\mathbb{R}$ , i.e. given a string $x=x_{1}\cdots x_{t}\in\Sigma^{*}$ , it returns $f_{A}(x)=\boldsymbol{\alpha}^{\top}\mathbf{A}_{x_{1}}\cdots\mathbf{A}_{x_{t}}\boldsymbol{\beta}=\boldsymbol{\alpha}^{\top}\mathbf{A}_{x}\boldsymbol{\beta}$ . A function $f:\Sigma^{*}\to\mathbb{R}$ is called rational if there exists a WFA $A$ that realizes it. The rank of the function is the size of the smallest WFA realizing $f$ . Given $f:\Sigma^{*}\to\mathbb{R}$ , we can consider a matrix $\mathbf{H}_{f}\in\mathbb{R}^{\Sigma^{*}\times\Sigma^{*}}$ having rows and columns indexed by strings and defined by $\mathbf{H}_{f}(p,s)=f(ps)$ for $p,s\in\Sigma^{*}$ .

Definition 2.1.

A (bi-infinite) matrix $\mathbf{H}\in\mathbb{R}^{\Sigma^{*}\times\Sigma^{*}}$ is Hankel if for all $p,p^{\prime},s,s^{\prime}\in\Sigma^{*}$ such that $ps=p^{\prime}s^{\prime}$ , we have $\mathbf{H}(p,s)=\mathbf{H}(p^{\prime},s^{\prime})$ . Given a Hankel matrix $\mathbf{H}\in\mathbb{R}^{\Sigma^{*}\times\Sigma^{*}}$ , there exists a unique function $f:\Sigma^{*}\to\mathbb{R}$ such that $\mathbf{H}_{f}=\mathbf{H}$ .

Theorem 2.1 ([15, 19]).

A function $f:\Sigma^{*}\to\mathbb{R}$ can be realized by a WFA if and only if $\mathbf{H}_{f}$ has finite rank $n$ . In that case, $n$ is the minimal number of states of any WFA $A$ such that $f=f_{A}$ .

Given a WFA $A=\langle\boldsymbol{\alpha},\{\mathbf{A}_{a}\},\boldsymbol{\beta}\rangle$ , the forward matrix of $A$ is the infinite matrix $\mathbf{F}_{A}\in\mathbb{R}^{\Sigma^{*}\times n}$ given by $\mathbf{F}_{A}(p,:)=\boldsymbol{\alpha}^{\top}\mathbf{A}_{p}$ for any $p\in\Sigma^{*}$ , while the backward matrix of $A$ is $\mathbf{B}_{A}\in\mathbb{R}^{\Sigma^{*}\times n}$ , given by $\mathbf{B}_{A}(s,:)=(\mathbf{A}_{s}\boldsymbol{\beta})^{\top}$ for any $s\in\Sigma^{*}$ . Let $\mathbf{H}_{f}$ be the Hankel matrix of $f$ , its forward-backward (FB) factorization is: $\mathbf{H}_{f}=\mathbf{F}\mathbf{B}^{\top}$ . A WFA with $n$ states is reachable if $\operatorname{rank}(\mathbf{F}_{A})=n$ , while it is observable if $\operatorname{rank}(\mathbf{B}_{A})=n$ . A WFA is minimal if it is reachable and observable. If $A$ is minimal, the FB factorization is a rank factorization [7].

We recall the definition of the singular value automaton, a canonical form for WFAs [10].

Definition 2.2.

Let $f:\Sigma^{*}\rightarrow\mathbb{R}$ be a rational function and suppose $\mathbf{H}_{f}$ admits an SVD, $\mathbf{H}_{f}=\mathbf{U}\mathbf{D}\mathbf{V}^{\top}$ . A singular value automaton (SVA) for $f$ is the minimal WFA $A$ realizing $f$ such that $\mathbf{F}_{A}=\mathbf{U}\mathbf{D}^{1/2}$ and $\mathbf{B}_{A}=\mathbf{V}\mathbf{D}^{1/2}$ .

The SVA can be computed with an efficient algorithm relying on the following matrices [11].

Definition 2.3.

Let $f:\Sigma^{*}\rightarrow\mathbb{R}$ be a rational function, $\mathbf{H}_{f}=\mathbf{F}\mathbf{B}^{\top}$ a FB factorization. If the matrices $\mathbf{P}=\mathbf{F}^{\top}\mathbf{F}$ and $\mathbf{Q}=\mathbf{B}^{\top}\mathbf{B}$ are well defined, we call $\mathbf{P}$ the reachability Gramian and $\mathbf{Q}$ the observability Gramian.

Note that if $A$ is an SVA, then the Gramians associated with its FB factorization satisfy $\mathbf{P}_{A}=\mathbf{Q}_{A}=\mathbf{D}$ , where $\mathbf{D}$ is the matrix of singular values of its Hankel matrix. The Gramians can alternatively be characterized (and computed [11]) using fixed point equations, corresponding to Lyapunov equations when $|\Sigma|=1$ [27].

Theorem 2.2.

Let $|\Sigma|=1$ , $A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle$ a WFA with $n$ states and well-defined Gramians $\mathbf{P}$ , $\mathbf{Q}$ . Then $X=\mathbf{P}$ and $Y=\mathbf{Q}$ solve:

	$\displaystyle X-\mathbf{A}X\mathbf{A}^{\top}=\boldsymbol{\beta}\boldsymbol{\beta}^{\top},$		(1)
	$\displaystyle Y-\mathbf{A}^{\top}Y\mathbf{A}=\boldsymbol{\alpha}\boldsymbol{\alpha}^{\top}.$		(2)

Finally, we recall the following definition.

Definition 2.4.

A WFA $A=\langle\boldsymbol{\alpha},\{\mathbf{A}_{a}\},\boldsymbol{\beta}\rangle$ is a generative probabilistic automaton (GPA) if $f_{A}(x)\geq 0$ for every $x$ , and $\sum_{x\in\Sigma^{*}}f_{A}(x)=1$ , i.e. if $f_{A}$ computes a probability distribution over $\Sigma^{*}$ .

Example 2.3.

Let $|\Sigma|=1$ , $\Sigma=\{x\}$ . The WFA $A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle$ , with:

\mathbf{A}=\begin{pmatrix}0&\frac{1}{2}\\ \frac{1}{2}&0\end{pmatrix},\quad\boldsymbol{\alpha}=\begin{pmatrix}\frac{\sqrt{3}}{2}\\ 0\end{pmatrix},\quad\boldsymbol{\beta}=\begin{pmatrix}\frac{\sqrt{3}}{2}\\ 0\end{pmatrix},

is a GPA. Ideed, $f_{A}(x)\geq 0$ and $\sum_{x\in\Sigma^{*}}f_{A}(x)=1$ , since the rational function is:

f_{A}(x\cdots x)=f_{A}(k)=\boldsymbol{\alpha}^{\top}\mathbf{A}^{k}\boldsymbol{\beta}=\begin{cases}0&\text{if $k$ is odd}\\ \frac{3}{4}2^{-k}&\text{if $k$ is even}\end{cases}

where $k$ corresponds to the string where $x$ is repeated $k$ -times. We remark that $A$ is minimal and in its SVA form, with Gramians $\mathbf{P}=\mathbf{Q}=\begin{pmatrix}\frac{4}{5}&0\\ 0&\frac{1}{5}\end{pmatrix}$ , and $f_{A}$ has rank $2$ . Finally, the corresponding Hankel matrix, also of rank $2$ , is:

\mathbf{H}=\begin{pmatrix}f_{A}(0)&f_{A}(1)&f_{A}(2)&\dots\\ f_{A}(1)&f_{A}(2)&f_{A}(3)&\dots\\ f_{A}(2)&f_{A}(3)&f_{A}(4)&\dots\\ \vdots&\vdots&\vdots&\ddots\end{pmatrix}=\begin{pmatrix}\frac{3}{4}&0&\frac{3}{16}&\dots\\ 0&\frac{3}{16}&0&\dots\\ \frac{3}{16}&0&\frac{3}{64}&\dots\\ \vdots&\vdots&\vdots&\ddots\end{pmatrix}.

(3)

2.3 AAK Theory

Theorem 2.1 provides us with a way to associate a minimal WFA $A$ with $n$ states to a Hankel matrix $\mathbf{H}$ of rank $n$ . The approach we propose to approximate $A$ is to find the WFA corresponding to the matrix that minimizes $\mathbf{H}$ in the spectral norm. We recall the fundamental result of Schmidt, Eckart, Young and Mirsky [17].

Theorem 2.4 ([17]).

Let $\mathbf{H}$ be a Hankel matrix corresponding to a compact Hankel operator of rank $n$ , and let $\sigma_{0}\geq\dots\geq\sigma_{n-1}>0$ be its singular numbers. Then, if $\mathbf{R}$ is a matrix of rank $k$ , we have:

\|\mathbf{H}-\mathbf{R}\|\geq\sigma_{k}.

(4)

The equality is attained when $\mathbf{R}$ corresponds to the truncated SVD of $\mathbf{H}$ .

Note that a low-rank approximation obtained by truncating the SVD is not in general a Hankel matrix. This is problematic, since $\mathbf{G}$ needs to be Hankel in order to be the matrix of a WFA. Surprisingly, we can obtain a result comparable to the one of Theorem 2.4 while preserving the Hankel property. This is the possible thanks to a theory of optimal approximation called AAK theory [1]. To apply this theory, we will need to rewrite the approximation problem in functional analysis terms. First, we will associate a linear operator to the Hankel matrix. Then, we will use Fourier analysis to reformulate the problem in a function space. A comprehensive presentation of the concepts recalled in this section can be found in [30, 33, 28].

Let $f:\Sigma^{*}\rightarrow\mathbb{R}$ be a rational function, we interpret the corresponding Hankel matrix $\mathbf{H}_{f}$ as the expression of a linear (Hankel) operator $H_{f}:\ell^{2}\rightarrow\ell^{2}$ in terms of the canonical basis. We recall that a Hankel operator $H_{f}$ is bounded if and only if $f\in\ell^{2}$ [11]. This property, together with the fact that we only consider finite rank operators (corresponding to the Hankel matrices of WFAs), is sufficient to guarantee compactness.

To introduce AAK theory, we need to consider a second realization of Hankel operators on complex spaces. Since in this paper we work with two classes of functions – functions over sequences and complex functions – to avoid any confusion we will make explicit the dependence on the complex variable $z=e^{it}$ . We start by recalling a few fundamental definitions from the theory of complex functions. Note that a function $\phi(z)\in\mathcal{L}^{2}(\mathbb{T})$ can be represented, using the orthonormal basis $\{z^{n}\}_{n\in\mathbb{Z}}$ , by means of its Fourier series: $\phi(z)=\sum_{n\in\mathbb{Z}}\widehat{\phi}(n)z^{n}$ , with Fourier coefficients $\widehat{\phi}(n)=\int_{\mathbb{T}}\phi(z)\bar{z}^{n}dz,\,n\in\mathbb{Z}$ . This establishes an isomorphism between the function $\phi(z)$ and the sequence of the corresponding Fourier coefficients $\widehat{\phi}$ . Thus, we can partition the function space $\mathcal{L}^{2}(\mathbb{T})$ into two subspaces.

Definition 2.5.

For $0<p\leq\infty$ , the Hardy space $\mathcal{H}^{p}$ on $\mathbb{T}$ is the subspace of $\mathcal{L}^{p}(\mathbb{T})$ defined as:

\mathcal{H}^{p}=\{\phi(z)\in\mathcal{L}^{p}(\mathbb{T}):\widehat{\phi}(n)=0,n<0\},

(5)

while the negative Hardy space on $\mathbb{T}$ is the subspace of $\mathcal{L}^{p}(\mathbb{T})$

\mathcal{H}^{p}_{-}=\{\phi(z)\in\mathcal{L}^{p}(\mathbb{T}):\widehat{\phi}(n)=0,n\geq 0\}.

(6)

It is possible to define Hardy spaces also on the open unit disc $\mathbb{D}$ .

Definition 2.6.

The Hardy space $\mathcal{H}^{p}(\mathbb{D})$ on $\mathbb{D}$ for $0<p<\infty$ consists of functions $\phi(z)$ analytic in $\mathbb{D}$ and such that:

\|\phi\|_{p}:=\sup_{0<r<1}\Bigg{(}\int_{\mathbb{T}}|\phi(r\xi)|^{p}dm(\xi)\Bigg{)}^{1/p}<\infty

(7)

and it is equipped with the norm $\|\cdot\|_{p}$ . For $p=\infty$ , $\mathcal{H}^{\infty}(\mathbb{D})$ is the space of bounded analytic functions in $\mathbb{D}$ with norm:

\|\phi\|_{\infty}:=\sup_{\xi\in\mathbb{D}}|\phi(\xi)|.

(8)

Interestingly, $\mathcal{H}^{p}(\mathbb{D})$ and $\mathcal{H}^{p}$ can be canonically identified by associating a function $\phi(z)\in\mathcal{H}^{p}(\mathbb{D})$ with its limit on the boundary, which is a function in $\mathcal{H}^{p}$ (a proof can be found in [30]). Thus, we will make no difference between those functions in the unit disc and their boundary value on the circle.

We can now embed the sequence space $\ell^{2}$ into $\ell^{2}(\mathbb{Z})$ by “duplicating” each vector, i.e. by associating $\boldsymbol{\mu}=(\mu_{0},\mu_{1},\dots)\in\ell^{2}$ to $\boldsymbol{\mu}^{(2)}=(\dots,\mu_{1},\mu_{0},\mu_{1},\dots)\in\ell^{2}(\mathbb{Z})$ . Then, we can use the Fourier isomorphism to map the vector $\boldsymbol{\mu}^{(2)}\in\ell^{2}(\mathbb{Z})$ to the function space $\mathcal{L}^{2}(\mathbb{T})$ . In this way each vector $\boldsymbol{\mu}\in\ell^{2}$ corresponds to two functions in the Hardy spaces:

		$\displaystyle\mu^{-}(z)=\sum_{j=0}^{\infty}\boldsymbol{\mu}_{j}z^{-j-1}\in\mathcal{H}^{2}_{-},$		(9)
		$\displaystyle\mu^{+}(z)=\sum_{j=0}^{\infty}\boldsymbol{\mu}_{j}z^{j}\in\mathcal{H}^{2}.$		(10)

This leads to an alternative characterization of Hankel operators in Hardy spaces.

Definition 2.7.

Let $\phi(z)$ be a function in the space $\mathcal{L}^{2}(\mathbb{T})$ . A Hankel operator is an operator $H_{\phi}:\mathcal{H}^{2}\rightarrow\mathcal{H}^{2}_{-}$ defined by $H_{\phi}f(z)=\mathbb{P}_{-}\phi f(z)$ , where $\mathbb{P}_{-}$ is the orthogonal projection from $\mathcal{L}^{2}(\mathbb{T})$ onto $\mathcal{H}^{2}_{-}$ . The function $\phi(z)$ is called a symbol of the Hankel operator $H_{\phi}$ .

If $H_{\phi}$ is a bounded operator, we can consider without loss of generality $\phi(z)\in\mathcal{L}^{\infty}(\mathbb{T})$ . This is a consequence of Nehari’s theorem [29], whose formulation can be found in Appendix A, together with more details about the two definitions of Hankel operators. We remark that a Hankel operator has infinitely many different symbols, since $H_{\phi}=H_{\phi+\psi}$ for $\psi(z)\in\mathcal{H}^{\infty}$ .

Remark 1.

In the standard orthonormal bases, $\{z^{k}\}_{k\geq 0}$ in $\mathcal{H}^{2}$ and $\{{z}^{-(j+1)}\}_{j\geq 0}$ in $\mathcal{H}^{2}_{-}$ , the Hankel operator $H_{\phi}$ has Hankel matrix $\mathbf{H}(j,k)=\widehat{\phi}(-j-k-1)$ for $j,k\geq 0$ .

Example 2.5.

In the case of the Hankel matrix in Example 2.3, since $\mathbf{H}(j,k)=\widehat{\phi}(-j-k-1)$ , we have:

\mathbf{H}=\begin{pmatrix}\frac{3}{4}&0&\frac{3}{16}&\dots\\ 0&\frac{3}{16}&0&\dots\\ \frac{3}{16}&0&\frac{3}{64}&\dots\\ \vdots&\vdots&\vdots&\ddots\end{pmatrix}=\begin{pmatrix}\widehat{\phi}(-1)&\widehat{\phi}(-2)&\widehat{\phi}(-3)&\dots\\ \widehat{\phi}(-2)&\widehat{\phi}(-3)&\widehat{\phi}(-4)&\dots\\ \widehat{\phi}(-3)&\widehat{\phi}(-4)&\widehat{\phi}(-5)&\dots\\ \vdots&\vdots&\vdots&\ddots\end{pmatrix}.

Hence, we can recover the corresponding symbol:

\mathbb{P}_{-}\phi=\sum_{n\geq 0}\widehat{\phi}(-n-1)z^{-2n-1}=\sum_{n\geq 0}\frac{3}{4}4^{-n}z^{-n-1}=\frac{3z}{4z^{2}-1}.

Definition 2.8.

The complex function $f(z)$ is rational if $f(z)=p(z)/q(z)$ , with $p(z)$ and $q(z)$ polynomials. The rank of $f(z)$ is the maximum between the degrees of $p(z)$ and $q(z)$ . A rational function is strictly proper if the degree of $p(z)$ is strictly smaller than that of $q(z)$ .

We remark that the poles of a complex function $f$ correspond to the zeros of $1/f$ . The following result of Kronecker relates finite-rank infinite Hankel matrices to rational functions.

Theorem 2.6 ([24]).

Let $H_{\phi}$ be a bounded Hankel operator with matrix $\mathbf{H}$ . Then $\mathbf{H}$ has finite rank if and only if $\mathbb{P}_{-}\phi$ is a strictly proper rational function. Moreover the rank of $\mathbf{H}$ is equal to the number of poles (with multiplicities) of $\mathbb{P}_{-}\phi$ inside the unit disc.

Example 2.7.

The function in Example 2.5 is rational with degree $2$ and has two poles inside the unit disc at $z=\pm\frac{1}{2}$ . Thus, the Hankel matrix associated has rank 2.

We state as remark an important takeaway from this section.

Remark 2.

Given a rank $n$ Hankel matrix $\mathbf{H}$ , we can look at it in two alternative ways. On the one hand we can consider the Hankel operator over sequences $H_{f}:\ell^{2}\rightarrow\ell^{2}$ , associated to a function $f:\Sigma^{*}\rightarrow\mathbb{R}$ . In this case $\mathbf{H}(i,j)=f(i+j)$ for $i,j\geq 0$ , and $f$ is rational in the sense that it is realized by a WFA of size $n$ . On the other hand, we can consider the Hankel operator over complex (Hardy) spaces $H_{\phi}:\mathcal{H}^{2}\rightarrow\mathcal{H}^{2}_{-}$ , associated to a function $\phi(z)\in\mathcal{L}^{2}(\mathbb{T})$ , the symbol. In this case $\mathbf{H}(j,k)=\widehat{\phi}(-j-k-1)$ for $j,k\geq 0$ , and $\mathbb{P}_{-}\phi=\widehat{\phi}(-j-k-1)$ is rational of rank $n$ in the sense of Definition 2.8.

We can introduce now the main result of Adamyan, Arov and Krein [1]. The theorem, stated for Hankel operators over Hardy spaces, shows that for infinite dimensional Hankel matrices the constraint of preserving the Hankel property does not affect the achievable approximation error.

Theorem 2.8 (AAK-1[1]).

Let $H_{\phi}$ be a compact Hankel operator of rank $n$ and singular numbers $\sigma_{m}$ , with $0\leq m<n$ and $\sigma_{0}\geq\dots\geq\sigma_{n-1}>0$ . Then there exists a unique Hankel operator $G$ of rank $k<n$ such that:

\|\mathbf{H}-\mathbf{G}\|=\sigma_{k}.

(11)

We denote with $\mathcal{R}_{k}\subset\mathcal{H}^{\infty}_{-}$ the set of strictly proper rational functions of rank $k$ , and we consider the set:

\mathcal{H}^{\infty}_{k}=\{\psi(z)\in\mathcal{L}^{\infty}(\mathbb{T}):\,\,\exists g(z)\in\mathcal{R}_{k},\exists l(z)\in\mathcal{H}^{\infty},\,\,\psi(z)=g(z)+l(z)\}.

(12)

We can reformulate the theorem in terms of symbols.

Theorem 2.9 (AAK-2 [1]).

Let $\phi(z)\in\mathcal{L}^{\infty}(\mathbb{T})$ . Then there exists $\psi(z)\in\mathcal{H}^{\infty}_{k}$ such that:

\|\phi(z)-\psi(z)\|_{\infty}=\sigma_{k}(H_{\phi}).

(13)

The solutions of Theorem 2.8 and 2.9 are strictly related (proof in Appendix A).

Corollary 1.

Let $\psi(z)=g(z)+l(z)\in\mathcal{H}^{\infty}_{k}$ , with $g(z)\in\mathcal{R}_{k},\,l(z)\in\mathcal{H}^{\infty}$ . If $\psi(z)$ solves Equation 13, then $G=H_{g}$ is the unique Hankel operator from Theorem 2.8.

We state as corollary the key point of the proof of AAK Theorem, that provides us with a practical way to find the best approximating symbol.

Corollary 2.

Let $\phi(z)$ and $\{\boldsymbol{\xi}_{k},\boldsymbol{\eta}_{k}\}$ be a symbol and a $\sigma_{k}$ -Schmidt pair for $H_{\phi}$ . A function $\psi(z)\in\mathcal{L}^{\infty}$ is the best AAK approximation according to Theorem 2.9, if and only if:

(\phi(z)-\psi(z))\xi^{+}_{k}(z)=\sigma_{k}\eta^{-}_{k}(z).

(14)

Moreover, the function $\psi(z)$ does not depend on the particular choice of the pair $\{\boldsymbol{\xi}_{k},\boldsymbol{\eta}_{k}\}$ .

3 Approximate Minimization

3.1 Assumptions

To apply AAK theory to the approximate minimization problem, we consider only automata with real weights, defined over a one-letter alphabet. In this case, the free monoid generated by the single letter can be identified with $\mathbb{N}$ , and canonically embedded into $\mathbb{Z}$ . This passage is fundamental to use Fourier analysis and the isomorphism that leads to Theorem 2.8 and 2.9. Unfortunately, this idea cannot be directly generalized to bigger alphabets, since in this case we would obtain a free non-abelian monoid (not identifiable with $\mathbb{Z}$ ).

Theorem 2.8 requires the Hankel operator $H$ to be bounded. To ensure that a minimal WFA $A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle$ satisfies this condition, we assume $\rho(\mathbf{A})<1$ , where $\rho$ is the spectral radius of $\mathbf{A}$ [11]. As a matter of fact, to guarantee the boundness of the Hankel operator it is enough that the considered WFA computes a function $f\in\ell^{2}$ [11]. However, the stricter assumption on the spectral radius is needed when computing the symbol associated to a WFA. This condition directly implies the existence of the SVA, and of the Gramian matrices $\mathbf{P}$ and $\mathbf{Q}$ , with $\mathbf{P}=\mathbf{Q}$ diagonal [11]. We assume that $A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle$ is in SVA form. In this case, given the size of the alphabet, the Hankel matrix $\mathbf{H}$ is symmetric, so $\mathbf{A}=\mathbf{A}^{\top}$ . Moreover, if we denote with $\lambda_{i}$ the $i$ -th non-zero eigenvalue of $\mathbf{H}$ , and we consider the coordinates of $\boldsymbol{\alpha}$ and $\boldsymbol{\beta}$ , we have that $\boldsymbol{\alpha}_{i}=\operatorname{sgn}(\lambda_{i})\boldsymbol{\beta}_{i}$ , where $\operatorname{sgn}(\lambda_{i})=\lambda_{i}/|\lambda_{i}|$ .

For example, we note that a minimal GPA computes a function $f\in\ell^{1}$ , so the condition on $\rho(\mathbf{A})$ is automatically satisfied by this class of WFAs [11]. Possible relaxations of the spectral radius assumption are discussed in Appendix C, together with an alternative method to find the optimal spectral-norm approximation of a Hankel matrix without extracting a WFA.

Finally, in this paper we only consider automata with weights in $\mathbb{R}$ , though results remain true for complex numbers. The method we present can be easily extended to vector-valued automata [35], but the solution to the optimal approximation problem will not be unique [33].

3.2 Problem Formulation

Let $A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle$ be a minimal WFA with $n$ states in SVA form, defined over a one-letter alphabet. Let $\mathbf{H}$ be the Hankel matrix of $A$ , we denote with $\sigma_{i}$ , for $0\leq i<n$ , the singular numbers. Given a target number of states $k<n$ , we say that a WFA $\widehat{A}_{k}$ with $k$ states solves the optimal spectral-norm approximate minimization problem if the Hankel matrix $\mathbf{G}$ of $\widehat{A}_{k}$ satisfies $\|\mathbf{H}-\mathbf{G}\|=\sigma_{k}(\mathbf{H})$ . Note that the content of the “optimal spectral-norm approximate minimization” is equivalent to the problem solved by Theorem 2.8, with the exception that here we insist on representing the inputs and outputs of the problem effectively by means of WFAs. Based on the AAK theory sketched in Section 2.3, we draw the following steps:

1.

Compute a symbol $\phi(z)$ for $H$ using Remark 2. We obtain the negative Fourier coefficients of $\phi(z)$ and derive its Fourier series.
2.

Compute the optimal symbol $\psi(z)$ using Corollary 2. The main challenge here is to find a suitable representation for the functions $\psi(z)$ and $e(z)=\phi(z)-\psi(z)$ . We define them in terms of two auxiliary WFAs. The key point is to select constraints on their parameters to leverage the properties of weighted automata, while still keeping the formulation general.
3.

Extracting the rational component by solving for $g(z)$ in Corollary 1. This step is arguably the most conceptually challenging, as it requires to identify the position of the function’s poles. In fact, we know from Theorem 2.6 that $g(z)$ has $k$ poles, all inside the unit disc.
4.

Find a WFA representation for $g(z)$ . Since in Step 2 we parametrized the functions using WFAs, the expression of $g(z)$ directly reveals the WFA $\widehat{A}_{k}$ .

3.3 Spectral-Norm Approximate Minimization

In the following sections we will consider a minimal WFA $A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle$ with $n$ states in SVA form, defined over a one-letter alphabet $\Sigma=\{a\}$ , its Hankel matrix $\mathbf{H}$ , corresponding to the bounded operator $H$ , and the singular numbers $\sigma_{i}$ , for $0\leq i<n$ . Let $f:\Sigma^{*}\rightarrow\mathbb{R}$ be the function realized by $A$ . We denote by $x$ the string where $a$ is repeated $x$ times, so we have $f(x)=\boldsymbol{\alpha}^{\top}\mathbf{A}^{x}\boldsymbol{\beta}$ .

3.3.1 Computation of a Symbol for A

To determine the symbol $\phi(z)$ of $H$ , we recall that each entry of the Hankel matrix corresponds simultaneously to the values of $f$ and to the negative Fourier coefficients of $\phi(z)$ . In fact, as seen in Remark 2, we have:

\mathbf{H}=\begin{pmatrix}f_{A}(0)&f_{A}(1)&f_{A}(2)&\dots\\ f_{A}(1)&f_{A}(2)&f_{A}(3)&\dots\\ f_{A}(2)&f_{A}(3)&f_{A}(4)&\dots\\ \vdots&\vdots&\vdots&\ddots\end{pmatrix}=\begin{pmatrix}\widehat{\phi}(-1)&\widehat{\phi}(-2)&\widehat{\phi}(-3)&\dots\\ \widehat{\phi}(-2)&\widehat{\phi}(-3)&\widehat{\phi}(-4)&\dots\\ \widehat{\phi}(-3)&\widehat{\phi}(-4)&\widehat{\phi}(-5)&\dots\\ \vdots&\vdots&\vdots&\ddots\end{pmatrix}.

(15)

We obtain:

\mathbb{P}_{-}\phi(z)=\sum_{k\geq 0}f(k)z^{-k-1}=\sum_{k\geq 0}\boldsymbol{\alpha}^{\top}\mathbf{A}^{k}\boldsymbol{\beta}z^{-k-1}=\boldsymbol{\alpha}^{\top}(z\mathbf{1}-\mathbf{A})^{-1}\boldsymbol{\beta}

(16)

where we use the fact that $\rho(A)<1$ for the last equality. Since the function obtained is already bounded, we can directly consider $\phi(z)=\boldsymbol{\alpha}^{\top}(z\mathbf{1}-\mathbf{A})^{-1}\boldsymbol{\beta}$ as a symbol for $H$ .

Example 3.1.

If we apply the formula in Equation 16 to the GPA in Example 2.3, we recover the rational function $\phi(z)=\frac{3z}{4z^{2}-1}$ found in Example 2.5.

3.3.2 Computation of the Optimal Symbol

We consider two auxiliary WFAs. Let $\widehat{A}=\langle\widehat{\boldsymbol{\alpha}},\widehat{\mathbf{A}},\widehat{\boldsymbol{\beta}}\rangle$ be a WFA with $j\geq k$ states, satisfying the following properties:

1.

$1$ is not an eigenvalue of $\widehat{\mathbf{A}}$

the automaton $E=\langle\boldsymbol{\alpha}_{e},\mathbf{A}_{e},\boldsymbol{\beta}_{e}\rangle$ is minimal, with

\mathbf{A}_{e}=\begin{pmatrix}\mathbf{A}&\mathbf{0}\\ \mathbf{0}&\widehat{\mathbf{A}}\end{pmatrix},\quad\boldsymbol{\alpha}_{e}=\begin{pmatrix}\boldsymbol{\alpha}\\ -\widehat{\boldsymbol{\alpha}}\end{pmatrix},\quad\boldsymbol{\beta}_{e}=\begin{pmatrix}\boldsymbol{\beta}\\ \widehat{\boldsymbol{\beta}}\end{pmatrix}.

(17)

Using the parameters of the automaton $\widehat{A}$ and a constant $C$ , we define a function $\psi(z)=\widehat{\boldsymbol{\alpha}}^{\top}(z\mathbf{1}-\widehat{\mathbf{A}})^{-1}\widehat{\boldsymbol{\beta}}+C$ . We remark that the poles of $\psi(z)$ correspond to the eigenvalues of $\widehat{\mathbf{A}}$ , counted with their multiplicities. By assumption, $1$ is not an eigenvalue of $\widehat{A}$ , so $\psi(z)$ does not have any poles on the unit circle, and therefore $\psi(z)\in\mathcal{L}^{\infty}(\mathbb{T})$ . Analogously, the function $e(z)=\phi(z)-\psi(z)=\boldsymbol{\alpha}_{e}^{\top}(z\mathbf{1}-\mathbf{A}_{e})^{-1}\boldsymbol{\beta}_{e}-C$ is also bounded on the circle.

Our objective is to compute the parameters of $\widehat{A}=\langle\widehat{\boldsymbol{\alpha}},\widehat{\mathbf{A}},\widehat{\boldsymbol{\beta}}\rangle$ that make $\psi(z)$ the best approximation of $\phi(z)$ according to Theorem 2.9. In particular, we will use Corollary 2 to find the triple $\widehat{\boldsymbol{\alpha}},\widehat{\mathbf{A}},\widehat{\boldsymbol{\beta}}$ such that $\psi(z)$ satisfies Equation 14. Note that, with this purpose, the constant term $C\in H^{\infty}$ becomes necessary to characterize $\psi(z)$ . In fact, while the $H^{\infty}$ -component of the symbol does not affect the Hankel norm, it plays a role in the computation of the $\mathcal{L}^{\infty}$ -norm (in Equation 13) according to Nehari (Theorem A.1), so it cannot be dismissed.

The following theorem provides us with an explicit expression for the functions in the Hardy space corresponding to a $\sigma_{k}$ -Schmidt pair.

Theorem 3.2.

Let $\sigma_{k}$ be a singular number of the Hankel operator $H$ . The singular functions associated with the $\sigma_{k}$ -Schmidt pair $\{\boldsymbol{\xi}_{k},\boldsymbol{\eta}_{k}\}$ of $H$ are:

	$\displaystyle\xi^{+}_{k}(z)$	$\displaystyle=\sigma_{k}^{-1/2}\boldsymbol{\alpha}^{\top}(\mathbf{1}-z\mathbf{A})^{-1}\mathbf{e}_{k}$		(18)
	$\displaystyle\eta^{-}_{k}(z)$	$\displaystyle=\sigma_{k}^{-1/2}\boldsymbol{\beta}^{\top}(z\mathbf{1}-\mathbf{A})^{-1}\mathbf{e}_{k}.$		(19)

If $\psi(z)$ is the best approximation to the symbol, then $\sigma_{k}^{-1}e(z)$ has modulus $1$ almost everywhere on the unit circle (i.e. it is unimodular).

Proof.

Let $\mathbf{F}$ and $\mathbf{B}$ be the forward and backward matrices, respectively, with $\mathbf{H}=\mathbf{F}\mathbf{B}^{\top}$ , $\mathbf{P}=\mathbf{F}^{\top}\mathbf{F},\mathbf{Q}=\mathbf{B}^{\top}\mathbf{B}$ . We consider the $\sigma_{k}$ -Schmidt pair $\{\boldsymbol{\xi}_{k},\boldsymbol{\eta}_{k}\}$ . By definition, $\mathbf{H}^{\top}\mathbf{H}\boldsymbol{\xi}_{k}=\sigma_{k}^{2}\boldsymbol{\xi}_{k}$ . Rewriting in terms of the FB factorization, we obtain:

	$\displaystyle\mathbf{H}^{\top}\mathbf{H}\boldsymbol{\xi}_{k}=\sigma_{k}^{2}\boldsymbol{\xi}_{k}$		(20)
	$\displaystyle\mathbf{B}\mathbf{F}^{\top}\mathbf{F}\mathbf{B}^{\top}\boldsymbol{\xi}_{k}=\sigma_{k}^{2}\boldsymbol{\xi}_{k}$		(21)
	$\displaystyle\mathbf{B}\mathbf{P}\mathbf{B}^{\top}\boldsymbol{\xi}_{k}=\sigma_{k}^{2}\boldsymbol{\xi}_{k}$		(22)
	$\displaystyle\mathbf{B}\mathbf{P}\mathbf{e}_{k}=\sigma_{k}^{2}\boldsymbol{\xi}_{k}$		(23)

where in the last step we set $\mathbf{e}_{k}=\mathbf{B}^{\top}\boldsymbol{\xi}_{k}$ , to reduce the SVD problem of $\mathbf{H}$ to the one of $\mathbf{Q}\mathbf{P}$ . Note that, since $\mathbf{P}$ and $\mathbf{Q}$ are diagonal, $\mathbf{e}_{k}$ is the $k$ -th coordinate vector $(0,\dots,0,1,0,\dots,0)^{\top}$ . Since $\mathbf{e}_{k}$ is an eigenvector of $\mathbf{Q}\mathbf{P}$ for $\sigma_{k}^{2}$ , we get:

	$\displaystyle\mathbf{B}\mathbf{Q}^{-1}\mathbf{Q}\mathbf{P}\mathbf{e}_{k}=\sigma_{k}^{2}\boldsymbol{\xi}_{k}$		(24)
	$\displaystyle\mathbf{B}\mathbf{Q}^{-1}\mathbf{e}_{k}=\boldsymbol{\xi}_{k}.$		(25)

Moreover, $\mathbf{H}$ is symmetric, so we have that the singular vectors $\boldsymbol{\eta}_{k}$ and $\boldsymbol{\xi}_{k}$ have the same coordinates up to the sign of the corresponding eigenvalues. We obtain:

	$\displaystyle\xi^{+}_{k}(z)$	$\displaystyle=\sum_{j=0}^{\infty}\sigma_{k}^{-1/2}\boldsymbol{\alpha}^{\top}\mathbf{A}^{j}\mathbf{e}_{k}z^{j}=\sigma_{k}^{-1/2}\boldsymbol{\alpha}^{\top}(\mathbf{1}-z\mathbf{A})^{-1}\boldsymbol{e}_{k}$		(26)
	$\displaystyle\eta^{-}_{k}(z)$	$\displaystyle=\sum_{j=0}^{\infty}\sigma_{k}^{-1/2}\boldsymbol{\beta}^{\top}\mathbf{A}^{j}\mathbf{e}_{k}z^{-j-1}=\sigma_{k}^{-1/2}\boldsymbol{\beta}^{\top}(z\mathbf{1}-\mathbf{A})^{-1}\boldsymbol{e}_{k}$		(27)

where the singular functions have been computed following Equation 9. If $r$ is the multiplicity of $\sigma_{k}$ , from Corollary 2 we get the following fundamental equation:

(\phi(z)-\psi(z))\boldsymbol{\alpha}^{\top}(\mathbf{1}-z\mathbf{A})^{-1}\mathbf{V}=\sigma_{k}\boldsymbol{\beta}^{\top}(z\mathbf{1}-\mathbf{A})^{-1}\mathbf{V}

where $\mathbf{V}=\begin{pmatrix}\mathbf{0}&\mathbf{1}_{r}\end{pmatrix}^{\top}$ is a $n\times r$ matrix. Consequently, we obtain the function:

\sigma_{k}^{-1}e(z)=\frac{\boldsymbol{\beta}^{\top}(z\mathbf{1}-\mathbf{A})^{-1}\mathbf{V}}{\boldsymbol{\alpha}^{\top}(\mathbf{1}-z\mathbf{A})^{-1}\mathbf{V}}

which is unimodular, since $\boldsymbol{\alpha}_{i}=\operatorname{sgn}(\lambda_{i})\boldsymbol{\beta}_{i}$ . ∎

It is reasonable to wonder how the fact that $\sigma_{k}^{-1}e(z)$ is unimodular reflects on the structure of the WFA $E=\langle\boldsymbol{\alpha}_{e},\mathbf{A}_{e},\boldsymbol{\beta}_{e}\rangle$ associated with it. We remark that, a priori, the controllability and observability Gramians of $E$ might not be well defined. The following theorem provides us with two matrices $\mathbf{P}_{e}$ and $\mathbf{Q}_{e}$ satisfying properties similar to those of the Gramians. This theorem is the analogous of a control theory result [16], rephrased in terms of WFAs. A sketch of the proof, that relies on the minimality of the WFA $E$ [38], can be found in Appendix B. For the detailed version of the proof and the original theorem we refer the reader to [16].

Theorem 3.3 ([16]).

Consider the function $e(z)=\boldsymbol{\alpha}_{e}^{\top}(z\mathbf{1}-\mathbf{A}_{e})^{-1}\boldsymbol{\beta}_{e}-C$ and the corresponding minimal WFA $E=\langle\boldsymbol{\alpha}_{e},\mathbf{A}_{e},\boldsymbol{\beta}_{e}\rangle$ associated with it. If $\sigma_{k}^{-1}e(z)$ is unimodular, then there exists a unique pair of symmetric invertible matrices $\mathbf{P}_{e}$ and $\mathbf{Q}_{e}$ satisfying:

(a)

$\mathbf{P}_{e}-\mathbf{A}_{e}\mathbf{P}_{e}\mathbf{A}_{e}^{\top}=\boldsymbol{\beta}_{e}\boldsymbol{\beta}_{e}^{\top}$
(b)

$\mathbf{Q}_{e}-\mathbf{A}_{e}^{\top}\mathbf{Q}_{e}\mathbf{A}_{e}=\boldsymbol{\alpha}_{e}\boldsymbol{\alpha}_{e}^{\top}$
(c)

$\mathbf{P}_{e}\mathbf{Q}_{e}=\sigma^{2}_{k}\mathbf{1}$

We can now derive the parameters of the WFA $\widehat{A}=\langle\widehat{\boldsymbol{\alpha}},\widehat{\mathbf{A}},\widehat{\boldsymbol{\beta}}\rangle$ .

Theorem 3.4.

Let $A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle$ be a minimal WFA with $n$ states in its SVA form, and let $\phi(z)=\boldsymbol{\alpha}^{\top}(z\mathbf{1}-\mathbf{A})^{-1}\boldsymbol{\beta}$ be a symbol for its Hankel operator $H$ . Let $\sigma_{k}$ be a singular number of multiplicity $r$ for $H$ , with $\sigma_{0}\geq\dots>\sigma_{k}=\dots=\sigma_{k+r-1}>\sigma_{k+r}\geq\dots\geq\sigma_{n-1}>0.$ We can partition the Gramian matrices $\mathbf{P}$ , $\mathbf{Q}$ as follows:

\mathbf{P}=\mathbf{Q}=\begin{pmatrix}\mathbf{\Sigma}&\mathbf{0}\\ \mathbf{0}&\sigma_{k}\mathbf{1}_{r}\end{pmatrix},

(28)

where $\boldsymbol{\Sigma}\in\mathbb{R}^{(n-r)\times(n-r)}$ is the diagonal matrix containing the remaining singular numbers, and partition $\mathbf{A}$ , $\boldsymbol{\alpha}$ and $\boldsymbol{\beta}$ conformally to the Gramians:

\mathbf{A}=\begin{pmatrix}\mathbf{A}_{11}&\mathbf{A}_{12}\\ \mathbf{A}_{12}^{\top}&\mathbf{A}_{22}\end{pmatrix},\quad\boldsymbol{\alpha}=\begin{pmatrix}\boldsymbol{\alpha}_{1}\\ \boldsymbol{\alpha}_{2}\end{pmatrix},\quad\boldsymbol{\beta}=\begin{pmatrix}\boldsymbol{\beta}_{1}\\ \boldsymbol{\beta}_{2}\end{pmatrix}.

(29)

Let $\mathbf{R}=\sigma_{k}^{2}\mathbf{1}_{n-r}-\mathbf{\Sigma}^{2}$ , we denote by $(\cdot)^{+}$ the Moore-Penrose pseudo-inverse. If the function $\psi(z)=\widehat{\boldsymbol{\alpha}}^{\top}(z\mathbf{1}-\widehat{\mathbf{A}})^{-1}\widehat{\boldsymbol{\beta}}+C$ is the best approximation of $\phi(z)$ , then:

•

If $\boldsymbol{\alpha}_{2}\neq\mathbf{0}$ :

\begin{cases}\widehat{\boldsymbol{\beta}}=-\widehat{\mathbf{A}}\mathbf{A}_{12}(\boldsymbol{\beta}_{2}^{\top})^{+}\\ \widehat{\boldsymbol{\alpha}}=\widehat{\mathbf{A}}^{\top}\mathbf{R}\mathbf{A}_{12}(\boldsymbol{\alpha}_{2}^{\top})^{+}\\ \widehat{\mathbf{A}}(\mathbf{A}_{11}-\mathbf{A}_{12}(\boldsymbol{\beta}_{2}^{\top})^{+}\boldsymbol{\beta}_{1}^{\top})=\mathbf{1}\end{cases}

(30)

•

If $\boldsymbol{\alpha}_{2}=\mathbf{0}$ :

\begin{cases}\widehat{\boldsymbol{\beta}}=(\mathbf{1}-\widehat{\mathbf{A}}\mathbf{A}_{11})(\boldsymbol{\beta}_{1}^{\top})^{+}\\ \widehat{\boldsymbol{\alpha}}=-(\mathbf{R}-\widehat{\mathbf{A}}^{\top}\mathbf{R}\mathbf{A}_{11})(\boldsymbol{\alpha}_{1}^{\top})^{+}\\ \widehat{\mathbf{A}}\mathbf{A}_{12}=\mathbf{0}\end{cases}

(31)

Proof of Theorem 3.4.

Since $\sigma^{-1}e(z)=\phi(z)-\psi(z)$ is unimodular, from Theorem 3.3 there exist two symmetric nonsingular matrices $\mathbf{P}_{e}$ , $\mathbf{Q}_{e}$ satisfying the fixed point equations:

	$\displaystyle\mathbf{P}_{e}-\mathbf{A}_{e}\mathbf{P}_{e}\mathbf{A}_{e}^{\top}$	$\displaystyle=\boldsymbol{\beta}_{e}\boldsymbol{\beta}_{e}^{\top}$		(32)
	$\displaystyle\mathbf{Q}_{e}-\mathbf{A}_{e}^{\top}\mathbf{Q}_{e}\mathbf{A}_{e}$	$\displaystyle=\boldsymbol{\alpha}_{e}\boldsymbol{\alpha}_{e}^{\top}$		(33)

and such that $\mathbf{P}_{e}\mathbf{Q}_{e}=\sigma^{2}_{k}\mathbf{1}$ . We can partition $\mathbf{P}_{e}$ and $\mathbf{Q}_{e}$ according to the definition of $\mathbf{A}_{e}$ (see Eq. 17):

\mathbf{P}_{e}=\begin{pmatrix}\mathbf{P}_{11}&\mathbf{P}_{12}\\ \mathbf{P}_{12}^{\top}&\mathbf{P}_{22}\end{pmatrix},\quad\mathbf{Q}_{e}=\begin{pmatrix}\mathbf{Q}_{11}&\mathbf{Q}_{12}\\ \mathbf{Q}_{12}^{\top}&\mathbf{Q}_{22}\end{pmatrix}.

From Equation 32 and 33, we note that $\mathbf{P}_{11}$ and $\mathbf{Q}_{11}$ corresponds to the controllability and observability Gramians of $A$ :

\mathbf{P}_{11}=\mathbf{Q}_{11}=\mathbf{P}=\begin{pmatrix}\mathbf{\Sigma}&\mathbf{0}\\ \mathbf{0}&\sigma_{k}\mathbf{1}\end{pmatrix}.

Moreover, since $\mathbf{P}_{e}\mathbf{Q}_{e}=\sigma_{k}^{2}\mathbf{1}$ , we get $\mathbf{P}_{12}\mathbf{Q}_{12}^{\top}=\sigma_{k}^{2}\mathbf{1}-\mathbf{P}^{2}$ . It follows that $\mathbf{P}_{12}\mathbf{Q}_{12}^{\top}$ has rank $n-r$ . Without loss of generality we can set $\dim{\widehat{\mathbf{A}}}=j=n-r$ , and choose an appropriate basis for the state space such that $\mathbf{P}_{12}=\begin{pmatrix}\mathbf{1}&\mathbf{0}\end{pmatrix}^{\top}$ and $\mathbf{Q}_{12}=\begin{pmatrix}\mathbf{R}&\mathbf{0}\end{pmatrix}^{\top}$ , with $\mathbf{R}=\sigma_{k}^{2}\mathbf{1}-\mathbf{\Sigma}^{2}$ . Once $\mathbf{P}_{12}$ and $\mathbf{Q}_{12}$ are fixed, the values of $\mathbf{P}_{22}$ and $\mathbf{Q}_{22}$ are automatically determined. We obtain:

\mathbf{P}_{e}=\begin{pmatrix}\mathbf{\Sigma}&\mathbf{0}&\mathbf{1}\\ \mathbf{0}&\sigma_{k}\mathbf{1}&\mathbf{0}\\ \mathbf{1}&\mathbf{0}&-\mathbf{\Sigma}\mathbf{R}^{-1}\end{pmatrix},\quad\mathbf{Q}_{e}=\begin{pmatrix}\mathbf{\Sigma}&\mathbf{0}&\mathbf{R}\\ \mathbf{0}&\sigma_{k}\mathbf{1}&\mathbf{0}\\ \mathbf{R}&\mathbf{0}&-\mathbf{\Sigma}\mathbf{R}\end{pmatrix}.

(34)

Now that we have an expression for the matrices $\mathbf{P}_{e}$ and $\mathbf{Q}_{e}$ of Theorem 3.3, we can rewrite the fixed point equations to derive the parameters $\widehat{\boldsymbol{\alpha}}$ , $\widehat{\mathbf{A}}$ and $\widehat{\boldsymbol{\beta}}$ . We obtain the following systems:

\begin{cases}\mathbf{P}-\mathbf{A}\mathbf{P}\mathbf{A}=\boldsymbol{\beta}\boldsymbol{\beta}^{\top}\\ \mathbf{N}-\mathbf{A}\mathbf{N}\widehat{\mathbf{A}}^{\top}=\boldsymbol{\beta}\widehat{\boldsymbol{\beta}}^{\top}\\ -\mathbf{\Sigma}\mathbf{R}^{-1}+\widehat{\mathbf{A}}\mathbf{\Sigma}\mathbf{R}^{-1}\widehat{\mathbf{A}}^{\top}={\widehat{\boldsymbol{\beta}}}{\widehat{\boldsymbol{\beta}}}^{\top}\end{cases}\quad\begin{cases}\mathbf{P}-\mathbf{A}\mathbf{P}\mathbf{A}=\boldsymbol{\alpha}\boldsymbol{\alpha}^{\top}\\ \mathbf{M}-\mathbf{A}^{\top}\mathbf{M}\widehat{\mathbf{A}}=-\boldsymbol{\alpha}\widehat{\boldsymbol{\alpha}}^{\top}\\ -\mathbf{\Sigma}\mathbf{R}+\widehat{\mathbf{A}}^{\top}\mathbf{\Sigma}\mathbf{R}\widehat{\mathbf{A}}={\widehat{\boldsymbol{\alpha}}}{\widehat{\boldsymbol{\alpha}}}^{\top}\end{cases}

(35)

where $\mathbf{N}=\begin{pmatrix}\mathbf{1}\\ \mathbf{0}\end{pmatrix}$ and $\mathbf{M}=\begin{pmatrix}\mathbf{R}\\ \mathbf{0}\end{pmatrix}$ . We can rewrite the second equation of each system as follows:

\begin{cases}\mathbf{1}-\mathbf{A}_{11}\widehat{\mathbf{A}}^{\top}=\boldsymbol{\beta}_{1}\widehat{\boldsymbol{\beta}}^{\top}\\ -\mathbf{A}_{12}^{\top}\widehat{\mathbf{A}}^{\top}=\boldsymbol{\beta}_{2}\widehat{\boldsymbol{\beta}}^{\top}\end{cases}\quad\begin{cases}\mathbf{R}-\mathbf{A}_{11}\mathbf{R}\widehat{\mathbf{A}}=-\boldsymbol{\alpha}_{1}\widehat{\boldsymbol{\alpha}}^{\top}\\ \widehat{\mathbf{A}}^{\top}\mathbf{R}\mathbf{A}_{12}=\widehat{\boldsymbol{\alpha}}\boldsymbol{\alpha}_{2}^{\top}\end{cases}

(36)

If $\boldsymbol{\alpha}_{2}\neq\mathbf{0}$ , then also $\boldsymbol{\beta}_{2}\neq\mathbf{0}$ (recall that $\boldsymbol{\alpha}_{i}=\operatorname{sgn}(\lambda_{i})\boldsymbol{\beta}_{i}$ ), and we have:

\begin{cases}\widehat{\boldsymbol{\beta}}=-\widehat{\mathbf{A}}\mathbf{A}_{12}(\boldsymbol{\beta}_{2}^{\top})^{+}\\ \widehat{\boldsymbol{\alpha}}=\widehat{\mathbf{A}}^{\top}\mathbf{R}\mathbf{A}_{12}(\boldsymbol{\alpha}_{2}^{\top})^{+}\\ \widehat{\mathbf{A}}(\mathbf{A}_{11}-\mathbf{A}_{12}(\boldsymbol{\beta}_{2}^{\top})^{+}\boldsymbol{\beta}_{1}^{\top})=\mathbf{1}\end{cases}

(37)

with $(\boldsymbol{\alpha}_{2}^{\top})^{+}=\frac{\boldsymbol{\alpha}_{2}}{\boldsymbol{\alpha}_{2}^{\top}\boldsymbol{\alpha}_{2}}$ and $(\boldsymbol{\beta}_{2}^{\top})^{+}=\frac{\boldsymbol{\beta}_{2}}{\boldsymbol{\beta}_{2}^{\top}\boldsymbol{\beta}_{2}}$ .

If $\boldsymbol{\alpha}_{2}=\mathbf{0}$ , we have $\widehat{\mathbf{A}}\mathbf{A}_{12}=\mathbf{0}$ . We remark that $\widehat{\mathbf{A}}$ has size $(n-r)\times(n-r)$ , while $\mathbf{A}_{12}$ is $(n-r)\times r$ , so the system of equations corresponding to $\widehat{\mathbf{A}}\mathbf{A}_{12}=\mathbf{0}$ is underdetermined if $r<\frac{n}{2}$ , in which case we can find an alternative set of solutions:

\begin{cases}\widehat{\boldsymbol{\beta}}=(\mathbf{1}-\widehat{\mathbf{A}}\mathbf{A}_{11})(\boldsymbol{\beta}_{1}^{\top})^{+}\\ \widehat{\boldsymbol{\alpha}}=-(\mathbf{R}-\widehat{\mathbf{A}}^{\top}\mathbf{R}\mathbf{A}_{11})(\boldsymbol{\alpha}_{1}^{\top})^{+}\\ \widehat{\mathbf{A}}\mathbf{A}_{12}=\mathbf{0}\end{cases}

(38)

with $\widehat{\mathbf{A}}\neq\mathbf{0}$ . On the other hand, if $r\geq\frac{n}{2}$ , i.e. if the multiplicity of the singular number $\sigma_{k}$ is more than half the size of the original WFA, the system might not have any solution unless $\widehat{\mathbf{A}}=\mathbf{0}$ (or unless $\mathbf{A}_{12}$ was zero to begin with). In this setting the method proposed returns $\widehat{\mathbf{A}}=\mathbf{0}$ . An alternative in this case is to search for an approximation of size $k-1$ or $k+1$ , so that $r<\frac{n}{2}$ , and the system in Equation 38 is underdetermined. ∎

3.3.3 Extraction of the Rational Component

The function $\psi(z)=\widehat{\boldsymbol{\alpha}}^{\top}(z\mathbf{1}-\widehat{\mathbf{A}})^{-1}\widehat{\boldsymbol{\beta}}+C$ found in Theorem 3.4 corresponds to the solution of Theorem 2.9. To find the solution to the approximation problem we only need to “isolate” the function $g(z)\in\mathcal{R}_{k}$ , i.e. the rational component. To do this, we study the position of the poles of $\psi(z)$ , since the poles of a strictly proper rational function lie in the unit disc (Theorem 2.6). As noted before, we parametrized $\psi(z)$ so that its poles correspond to the eigenvalues of $\widehat{A}$ . After a change of basis (detailed in the Paragraph 3.3.3.1), we can rewrite $\widehat{\mathbf{A}}$ in block-diagonal form:

\widehat{\mathbf{A}}=\begin{pmatrix}\widehat{\mathbf{A}}_{+}&\mathbf{0}\\ \mathbf{0}&\widehat{\mathbf{A}}_{-}\end{pmatrix}

(39)

where the modulus of the eigenvalues of $\widehat{\mathbf{A}}_{+}$ (resp. $\widehat{\mathbf{A}}_{-}$ ) is smaller (resp. greater) than one. We apply the same change of coordinates on $\widehat{\boldsymbol{\alpha}}$ and $\widehat{\boldsymbol{\beta}}$ .

To conclude the study of the eigenvalues of $\widehat{\mathbf{A}}$ , we need the following auxiliary result from Ostrowski [32]. A proof of this theorem can be found in [41].

Theorem 3.5 ([32]).

Let $|\Sigma|=1$ , and let $\mathbf{P}$ be a solution to the fixed point equation $X-\mathbf{A}X\mathbf{A}^{\top}=\boldsymbol{\beta}\boldsymbol{\beta}^{\top}$ for the WFA $A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle$ . If $A$ is reachable, then:

•

The number of eigenvalues $\lambda$ of $\mathbf{A}$ such that $|\lambda|<1$ is equal to the number of positive eigenvalues of $\mathbf{P}$ .
•

The number of eigenvalues $\lambda$ of $\mathbf{A}$ such that $|\lambda|>1$ is equal to the number of negative eigenvalues of $\mathbf{P}$ .

We can finally find the rational component of the function $\psi(z)$ , i.e. the function $g(z)$ from Corollary 1 necessary to solve that approximate minimization problem.

Theorem 3.6.

Let $\widehat{\mathbf{A}}_{+},\widehat{\boldsymbol{\alpha}}_{+},\widehat{\boldsymbol{\beta}}_{+}$ be as in Theorem 3.4. The rational component of $\psi(z)$ is the function $g(z)=\widehat{\boldsymbol{\alpha}}_{+}^{\top}(z\mathbf{1}-\widehat{\mathbf{A}}_{+})^{-1}\widehat{\boldsymbol{\beta}}_{+}$ .

Proof.

Clearly $\psi(z)=g(z)+l(z)$ , with $l(z)=\widehat{\boldsymbol{\alpha}}_{-}^{\top}(z\mathbf{1}-\widehat{\mathbf{A}}_{-})^{-1}\widehat{\boldsymbol{\beta}}_{-}$ , $l\in\mathcal{H}^{\infty}$ . To conclude the proof we need to show that $g(z)$ has $k$ poles inside the unit disc, and therefore has rank $k$ . We do this by studying the position of the eigenvalues of $\widehat{\mathbf{A}}_{+}$ .

Since $E$ is minimal, by definition $\widehat{A}$ is reachable, so we can use Theorem 3.5 and solve the problem by directly examining the eigenvalues of $-\mathbf{\Sigma}\mathbf{R}$ . From the proof of Theorem 3.4 we have $-\boldsymbol{\Sigma}\mathbf{R}=\boldsymbol{\Sigma}(\boldsymbol{\Sigma}^{2}-\sigma^{2}_{k}\mathbf{1})$ , where $\boldsymbol{\Sigma}$ is the diagonal matrix having as elements the singular numbers of $H$ different from $\sigma_{k}$ . It follows that $-\boldsymbol{\Sigma}\mathbf{R}$ has only $k$ strictly positive eigenvalues, and $\widehat{\mathbf{A}}$ has $k$ eigenvalues with modulus smaller than $1$ . Thus, $\widehat{\mathbf{A}}_{+}$ has $k$ eigenvalues, corresponding to the poles of $g(z)$ . ∎

3.3.3.1 Block Diagonalization.

In this paragraph we detail the technical steps necessary to rewrite $\widehat{\mathbf{A}}$ in block-diagonal form. The problem of computing the Jordan form of a matrix is ill-conditioned, hence it is not suitable for our algorithmic purposes. Instead, we compute the Schur decomposition, i.e. we find an orthogonal matrix $\mathbf{U}$ such that $\mathbf{U}^{\top}\widehat{\mathbf{A}}\mathbf{U}$ is upper triangular, with the eigenvalues of $\widehat{\mathbf{A}}$ on the diagonal. We obtain:

\mathbf{T}=\mathbf{U}^{\top}\widehat{\mathbf{A}}\mathbf{U}=\begin{pmatrix}\widehat{\mathbf{A}}_{+}&\widehat{\mathbf{A}}_{12}\\ \mathbf{0}&\widehat{\mathbf{A}}_{-}\end{pmatrix}

(40)

where the eigenvalues are arranged in increasing order of modulus, and the modulus of those in $\widehat{\mathbf{A}}_{+}$ (resp. $\widehat{\mathbf{A}}_{-}$ ) is smaller (resp. greater) than one. To transform this upper triangular matrix into a block-diagonal one, we use the following result.

Theorem 3.7 ([37]).

Let $\mathbf{T}$ be the matrix defined in Equation 40. The matrix $\mathbf{X}$ is a solution of the equation $\widehat{\mathbf{A}}_{+}\mathbf{X}-\mathbf{X}\widehat{\mathbf{A}}_{-}+\widehat{\mathbf{A}}_{12}=\mathbf{0}$ if and only if the matrices

\mathbf{M}=\begin{pmatrix}\mathbf{1}&\mathbf{X}\\ \mathbf{0}&\mathbf{1}\end{pmatrix},\quad\text{and}\quad\mathbf{M}^{-1}=\begin{pmatrix}\mathbf{1}&-\mathbf{X}\\ \mathbf{0}&\mathbf{1}\end{pmatrix}

(41)

satisfy:

\mathbf{M}^{-1}\mathbf{T}\mathbf{M}=\begin{pmatrix}\widehat{\mathbf{A}}_{+}&\mathbf{0}\\ \mathbf{0}&\widehat{\mathbf{A}}_{-}\end{pmatrix},

(42)

where $\mathbf{T}$ is the matrix defined in Equation 40.

Setting $\boldsymbol{\Gamma}=\begin{pmatrix}\mathbf{1}_{k}&\mathbf{0}\end{pmatrix}$ we can now derive the rational component of the WFA:

	$\displaystyle\widehat{\mathbf{A}}_{+}=\boldsymbol{\Gamma}\mathbf{M}^{-1}\mathbf{U}^{\top}\widehat{\mathbf{A}}\mathbf{U}\mathbf{M}\boldsymbol{\Gamma}^{\top}$		(43)
	$\displaystyle\widehat{\boldsymbol{\alpha}}_{+}=\boldsymbol{\Gamma}\mathbf{M}^{\top}\mathbf{U}^{\top}\widehat{\boldsymbol{\alpha}}$		(44)
	$\displaystyle\widehat{\boldsymbol{\beta}}_{+}=\boldsymbol{\Gamma}\mathbf{M}^{-1}\mathbf{U}^{\top}\widehat{\boldsymbol{\beta}}.$		(45)

3.3.4 Solution to the Approximation Problem

In the previous sections, we have derived the rational function $g(z)$ corresponding to the symbol of $G$ , the operator that solves Theorem 2.8. To find the solution to the approximation problem we only need to find the parameters of $\widehat{A}_{k}$ , the optimal approximating WFA. These are directly revealed by the expression of $g(z)$ , thanks to the way we parametrized the functions.

Theorem 3.8.

Let $A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle$ be a minimal WFA with $n$ states over a one-letter alphabet. Let $A$ be in its SVA form. The optimal spectral-norm approximation of rank $k$ is given by the WFA $\widehat{A}_{k}=\langle\widehat{\boldsymbol{\alpha}}_{+},\widehat{\mathbf{A}}_{+},\widehat{\boldsymbol{\beta}}_{+}\rangle$ .

Proof.

From Corollary 1 we know that $g(z)$ is the rational function associated with the Hankel matrix of the best approximation. Given the correspondence between the Fourier coefficients of $g(z)$ and the entries of the matrix (Remark 2), we have:

g(z)=\widehat{\boldsymbol{\alpha}}_{+}^{\top}(z\mathbf{1}-\widehat{\mathbf{A}}_{+})^{-1}\widehat{\boldsymbol{\beta}}_{+}=\sum_{k\geq 0}\widehat{\boldsymbol{\alpha}}_{+}^{\top}\widehat{\mathbf{A}}_{+}^{k}\widehat{\boldsymbol{\beta}}_{+}z^{-k-1}=\sum_{k\geq 0}\bar{f}(k)z^{-k-1}

(46)

where $\bar{f}:\Sigma^{*}\rightarrow\mathbb{R}$ is the function computed by $\widehat{A}_{k}$ and $\widehat{\boldsymbol{\alpha}}_{+},\widehat{\mathbf{A}}_{+},\widehat{\boldsymbol{\beta}}_{+}$ are the parameters. ∎

3.4 Error Analysis

The theoretical foundations of AAK theory guarantee that the construction detailed in Section 3.3 produces the rank $k$ optimal spectral-norm approximation of a WFA satisfying our assumptions, and the singular number $\sigma_{k}$ provides the exact error.

Similarly to the case of SVA truncation [11], due to the ordering of the singular numbers, the error decreases when $k$ increases, meaning that allowing $\widehat{A}_{k}$ to have more states guarantees a better approximation of $A$ . We remark that, while the solution we propose is optimal in the spectral norm, the same is not necessarily true in other norms. Nonetheless, we have the following bound between $\ell^{2}$ -norm and spectral-norm (proof in Appendix B).

Theorem 3.9.

Let $A$ be a minimal WFA computing $f:\Sigma^{*}\rightarrow\mathbb{R}$ , with matrix $\mathbf{H}$ . Let $\widehat{A}_{k}$ be its optimal spectral-norm approximation, computing $g:\Sigma^{*}\rightarrow\mathbb{R}$ , with matrix $\mathbf{G}$ . Then:

\|f-g\|_{\ell^{2}}\leq\|\mathbf{H}-\mathbf{G}\|=\sigma_{k}.

(47)

Proof.

Let $\mathbf{e}_{0}=\begin{pmatrix}1&0&\cdots\end{pmatrix}^{\top}$ , $f:\Sigma^{*}\rightarrow\mathbb{R}$ , $g:\Sigma^{*}\rightarrow\mathbb{R}$ with Hankel matrices $\mathbf{H}$ and $\mathbf{G}$ , respectively. We have:

\|f-g\|_{\ell^{2}}=\left(\sum_{n=0}^{\infty}|f_{n}-g_{n}|^{2}\right)^{1/2}=\|(\mathbf{H}-\mathbf{G})\mathbf{e}_{0}\|_{\ell^{2}}\leq\sup_{\|\mathbf{x}\|_{\ell^{2}}=1}\|(\mathbf{H}-\mathbf{G})\mathbf{x}\|_{\ell^{2}}=\|\mathbf{H}-\mathbf{G}\|=\sigma_{k}

where the second equation follows by definition and by observing that matrix difference is computed entry-wise. ∎

4 Algorithm

In this section we describe the algorithm for spectral-norm approximate minimization. The algorithm takes as input a target number of states $k<n$ , a minimal WFA $A$ with $\rho(\mathbf{A})<1$ , $\boldsymbol{\alpha}_{2}\neq 0$ , $n$ states and in SVA form, and its Gramian $\mathbf{P}$ . Note that, in the case of $\boldsymbol{\alpha}_{2}=0$ , it is enough to substitute the Steps $4,5,6$ with the analogous from Equation 31. The constraints on the WFA $A$ to be minimal and in SVA form are non essential. In fact a WFA with $n$ states can be minimized in time $O(n^{3})$ [14], and the SVA computed in $O(n^{3})$ [11].

Using the results of Theorem 3.4, we outline in Algorithm 1, AAKapproximation, the steps necessary to extract the best spectral-norm approximation of a WFA.

input : A minimal WFA

A

, with

\boldsymbol{\alpha}_{2}\neq 0

n

states and in SVA form,
its Gramian

\mathbf{P}

, a target number of states

k<n

output : A WFA

\widehat{A}_{k}

with

k

states

1 Let

\boldsymbol{\alpha}_{1},\boldsymbol{\alpha}_{2},\boldsymbol{\beta}_{1},\boldsymbol{\beta}_{2},\mathbf{A}_{11},\mathbf{A}_{12},\mathbf{A}_{22},\mathbf{\Sigma}

be the blocks defined in Eq. 28

2 Let

(\boldsymbol{\alpha}_{2}^{\top})^{+}=\frac{\boldsymbol{\alpha}_{2}}{\boldsymbol{\alpha}_{2}^{\top}\boldsymbol{\alpha}_{2}}

(\boldsymbol{\beta}_{2}^{\top})^{+}=\frac{\boldsymbol{\beta}_{2}}{\boldsymbol{\beta}_{2}^{\top}\boldsymbol{\beta}_{2}}

3 Let

\mathbf{R}=\sigma_{k}^{2}\mathbf{1}-\mathbf{\Sigma}^{2}

4 Let

\widehat{\mathbf{A}}=(\mathbf{A}_{11}-\mathbf{A}_{12}(\boldsymbol{\beta}_{2}^{\top})^{+}\boldsymbol{\beta}_{1}^{\top})^{-1}

5 Let

\widehat{\boldsymbol{\alpha}}=\widehat{\mathbf{A}}^{\top}\mathbf{R}\mathbf{A}_{12}(\boldsymbol{\alpha}_{2}^{\top})^{+}

6 Let

\widehat{\boldsymbol{\beta}}=-\widehat{\mathbf{A}}\mathbf{A}_{12}(\boldsymbol{\beta}_{2}^{\top})^{+}

7 Let

\widehat{A}=\langle\widehat{\boldsymbol{\alpha}},\widehat{\mathbf{A}},\widehat{\boldsymbol{\beta}}\rangle

8 Let

\widehat{A}_{k}\leftarrow

BlockDiagonalize(

\widehat{A}

)

return

\widehat{A}_{k}

Algorithm 1 AAKapproximation

The algorithm involves a call to Algorithm 2, BlockDiagonalize. In particular, this corresponds to the steps, outlined in Paragraph 3.3.3.1, necessary to derive the WFA $\widehat{A}_{k}$ corresponding to the rational function $g(z)$ . We remark that Step $2$ in BlockDiagonalize can be performed using the Bartels-Stewart algorithm [13].

input : A WFA

\widehat{A}

output : A WFA

\widehat{A}_{k}

wit

\rho<1

1 Compute the Schur decomposition of

\widehat{\mathbf{A}}=\mathbf{U}\mathbf{T}\mathbf{U}^{\top}

, where

|T_{11}|\leq|T_{22}|\leq\dots

2 Solve

\widehat{\mathbf{A}}_{11}\mathbf{X}-\mathbf{X}\widehat{\mathbf{A}}_{22}+\widehat{\mathbf{A}}_{12}=\mathbf{0}

for

\mathbf{X}

3 Let

\mathbf{M}=\begin{pmatrix}\mathbf{1}&\mathbf{X}\\ \mathbf{0}&\mathbf{1}\end{pmatrix}

and

\mathbf{M}^{-1}=\begin{pmatrix}\mathbf{1}&-\mathbf{X}\\ \mathbf{0}&\mathbf{1}\end{pmatrix}

4 Let

\boldsymbol{\Gamma}=\begin{pmatrix}\mathbf{1}_{k}&\mathbf{0}\end{pmatrix}

5 Let

\widehat{\mathbf{A}}_{+}=\boldsymbol{\Gamma}\mathbf{M}^{-1}\mathbf{U}^{\top}\widehat{\mathbf{A}}\mathbf{U}\mathbf{M}\boldsymbol{\Gamma}^{\top}

6 Let

\widehat{\boldsymbol{\alpha}}_{+}=\boldsymbol{\Gamma}\mathbf{M}^{\top}\mathbf{U}^{\top}\widehat{\boldsymbol{\alpha}}

7 Let

\widehat{\boldsymbol{\beta}}_{+}=\boldsymbol{\Gamma}\mathbf{M}^{-1}\mathbf{U}^{\top}\widehat{\boldsymbol{\beta}}

8 Let

\widehat{A}_{k}=\langle\widehat{\boldsymbol{\alpha}}_{+},\widehat{\mathbf{A}}_{+},\widehat{\boldsymbol{\beta}}_{+}\rangle

return

\widehat{A}_{k}

Algorithm 2 BlockDiagonalize

To compute the computational cost we recall the following facts [39]:

•

The product of two $n\times n$ matrices can be computed in time $O(n^{3})$ using a standard iterative algorithm, but can be reduced to $O(n^{\omega})$ with $\omega<2.4$ .
•

The inversion of a $n\times n$ matrix can be computed in time $O(n^{3})$ using Gauss-Jordan elimination, but can be reduced to $O(n^{\omega})$ with $\omega<2.4$ .
•

The computation of the Schur decomposition of a $n\times n$ matrix can be done with a two-step algorithm, where each step takes $O(n^{3})$ , using the Hessenberg form of the matrix.
•

The Bartels-Stewart algorithm applied to upper triangular matrices to find a matrix $m\times n$ takes $O(mn^{2}+nm^{2})$ .

The running time of BlockDiagonalize with input a WFA $\widehat{A}$ with $(n-r)$ states is thus in $O((n-r)^{3})$ , where $r$ is the multiplicity of the singular value considered. The running time of AAKapproximation for an input WFA $\widehat{A}$ with $n$ states is in $O((n-r)^{3})$ .

5 Related Work

The study of approximate minimization for WFAs is very recent, and only a few works have been published on the subject. In [10, 11] the authors present an approximate minimization technique using a canonical expression for WFAs, and provide bounds on the error in the $\ell^{2}$ norm. The result is supported by strong theoretical guarantees, but it is not optimal in any norm. An extension of this method to the case of Weighted Tree Automata can be found in [12]. A similar problem is addressed in [25], with less general results. In [26], the authors connect spectral learning to the approximate minimization problem of a small class of Hidden Markov models, bounding the error in terms of the total variation distance.

The control theory community has largely studied approximate minimization in the context of linear time-invariant systems, and several methods have been proposed [3]. A parallel can be drown between those results and ours, by noting that the impulse response of a discrete time-invariant Single-Input-Single-Output SISO system can be parametrized as a WFA over a one-letter alphabet. In [20] Glover presents a state-space solution for the case of continuous Multi-Input-Multi-Output MIMO systems. Glover’s method led to a widespread application of these results, thanks to its computational and theoretical simplicity. This stems from the structure of the Lyapunov equations for continuous systems. It is however not the case for discrete control systems, where the Lyapunov equations have a quadratic form. As noted in [16], there is not a simple closed form formula for the state space solution of a discrete system. Thus, most of the results for the discrete case work with a suboptimal version of the problem, with restrictions on the multiplicity of the singular values [6, 2, 23]. A solution for the SISO case can be found, without additional assumptions, using a polynomial approach, but it does not provide an explicit representation of the state space nor it generalizes to the MIMO setting. The first to actually extend Glover results to the discrete case is Gu, who provides an elegant solution for the MIMO discrete problem [21]. Glover and Gu’s solutions rely on embedding the initial system into an extension of it, the all-pass system, equivalent to the WFA $E$ in our method. Part of our contribution is the adaptation of some of the control theory tools to our setting.

6 Conclusion

In this paper we applied the AAK theory for Hankel operators and complex functions to the framework of WFAs in order to construct the best possible approximation to an automaton given a bound on the size. We provide an algorithm to find the parameters of the best WFA approximation in the spectral norm, and bounds on the error. Our method applies to real WFAs $A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle$ , defined over a one-letter alphabet, with $\rho(\mathbf{A})<1$ . While this setting is certainly restricted, we believe that this work constitutes a first fundamental step towards optimal approximation. Furthermore, the use of AAK techniques has proven to be very fruitful in related areas like control theory; we think that automata theory can also benefit from it. The use of such methods can help deepen the understanding of the behaviour of rational functions. This paper highlights and strengthens the interesting connections between functional analysis, automata theory and control theory, unifying tools from different domains in one formalism.

A compelling direction for future work is to extend our results to the multi-letter case. The work of Adamyan, Arov and Krein provides us with a powerful theory connecting sequences to the study of complex functions. We note that, unfortunately, this approach cannot be directly generalized to the multi-letter case because of the non-commutative nature of the monoid considered. Extending this work would require adapting standard harmonic analysis results to the non-abelian case. A recent line of work in functional analysis is centered around extending this theory to the case of non-commutative operators, and in [34] a non-commutative version of the AAK theorem is presented. However, those results are non-constructive, making this direction, already challenging, even harder to pursue.

Acknowledgments

This research has been supported by NSERC Canada (C. Lacroce, P. Panangaden, D. Precup) and Canada CIFAR AI chairs program (Guillaume Rabusseau). The authors would like to thank Tianyu Li, Harsh Satija and Alessandro Sordoni for feedback on earlier drafts of this work, Gheorghe Comanici for a detailed review, and Maxime Wabartha for fruitful discussions and for comments on the proofs.

References

[1] Vadim M. Adamyan, Damir Zyamovich Arov, and Mark Grigorievich Krein. Analytic Properties of Schmidt Pairs for a Hankel Operator and the Generalized Schur–Takagi problem. Mathematics of The Ussr-sbornik, 15:31–73, 1971.
[2] M.M Al-Hussari, I.M. Jaimoukha, and D.J.N. Limebeer. A Descriptor Approach for the Solution of the One-Block Distance Problem. In In Proceedings of the IFAC World Congress, 1993.
[3] Athanasios C. Antoulas. Approximation of Large-Scale Dynamical Systems. SIAM, 2005.
[4] Stéphane Ayache, Rémi Eyraud, and Noé Goudian. Explaining Black Boxes on Sequential Data Using Weighted Automata. In Proceedings of the 14th International Conference on Grammatical Inference, ICGI 2018, Wrocław, Poland, September 5-7, 2018, volume 93 of Proceedings of Machine Learning Research, pages 81–103. PMLR, 2018. URL: http://proceedings.mlr.press/v93/ayache19a.html.
[5] Raphaël Bailly, François Denis, and Liva Ralaivola. Grammatical Inference as a Principal Component Analysis Problem. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 33––40, New York, NY, USA, 2009. Association for Computing Machinery. doi:10.1145/1553374.1553379.
[6] Joseph A. Ball and Andre CM. Ran. Optimal Hankel Norm Model Reductions and Wiener–Hopf Factorization I: The Canonical Case. SIAM Journal on Control and Optimization, 25(2):362–382, 1987.
[7] Borja Balle, Xavier Carreras, Franco M. Luque, and Ariadna Quattoni. Spectral Learning of Weighted Automata - A Forward–Backward Perspective. Mach. Learn., 96(1-2):33–63, 2014. doi:10.1007/s10994-013-5416-x.
[8] Borja Balle, Pascale Gourdeau, and Prakash Panangaden. Bisimulation Metrics and Norms for Weighted Finite Automata. In Ioannis Chatzigiannakis, Piotr Indyk, Fabian Kuhn, and Anca Muscholl, editors, 44th International Colloquium on Automata, Languages, and Programming, ICALP 2017, July 10-14, 2017, Warsaw, Poland, volume 80 of LIPIcs, pages 103:1–103:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPIcs.ICALP.2017.103.
[9] Borja Balle, William L. Hamilton, and Joelle Pineau. Methods of moments for learning stochastic languages: Unified presentation and empirical comparison. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, volume 32 of JMLR Workshop and Conference Proceedings, pages 1386–1394. JMLR.org, 2014. URL: http://proceedings.mlr.press/v32/balle14.html.
[10] Borja Balle, Prakash Panangaden, and Doina Precup. A Canonical Form for Weighted Automata and Applications to Approximate Minimization. In 30th Annual ACM/IEEE Symposium on Logic in Computer Science, LICS 2015, Kyoto, Japan, July 6-10, 2015, pages 701–712. IEEE Computer Society, 2015. doi:10.1109/LICS.2015.70.
[11] Borja Balle, Prakash Panangaden, and Doina Precup. Singular Value Automata and Approximate Minimization. Math. Struct. Comput. Sci., 29(9):1444–1478, 2019. doi:10.1017/S0960129519000094.
[12] Borja Balle and Guillaume Rabusseau. Approximate minimization of weighted tree automata. Information and Computation, page 104654, 2020.
[13] R. H. Bartels and G. W. Stewart. Solution of the matrix equation ax + xb = c [f4]. Commun. ACM, 15(9):820––826, 1972.
[14] Jean Berstel and Christophe Reutenauer. Noncommutative rational series with applications, volume 137. Cambridge University Press, 2011.
[15] J.W. Carlyle and A. Paz. Realizations by stochastic finite automata. Journal of Computer and System Sciences, 5(1):26–40, 1971.
[16] Charles K. Chui and Guanrong Chen. Discrete $H^{\infty}$ Optimization With Applications in Signal Processing and Control Systems. Springer-Verlag, 1997.
[17] C. Eckart and G. Young. The approximation of one matrix by another of lower rank. Psychometrika, 1:211–218, 1936. doi:10.1007/BF02288367.
[18] Rémi Eyraud and Stéphane Ayache. Distillation of Weighted Automata from Recurrent Neural Networks Using a Spectral Approach. CoRR, abs/2009.13101, 2020. URL: https://arxiv.org/abs/2009.13101, arXiv:2009.13101.
[19] Michel Flies. Matrice de Hankel. Journal de Mathématique Pures et Appliquées, 5:197–222, 1974.
[20] Keith Glover. All Optimal Hankel–Nnorm Approximations of Linear Multivariable Systems and their $\mathcal{L}^{\infty}$ –Error Bounds. International Journal of Control, 39(6):1115–1193, 1984. doi:10.1080/00207178408933239.
[21] Guoxiang Gu. All Optimal Hankel–Norm Approximations and their Eerror Bounds in Discrete–Time. International Journal of Control, 78(6):408–423, 2005. doi:10.1080/00207170500110988.
[22] Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. J. Comput. Syst. Sci., 78(5):1460–1480, September 2012. doi:10.1016/j.jcss.2011.12.025.
[23] Vlad Ionescu and Cristian Oara. The four-block Adamjan-Arov-Kein problem for discrete-time systems. In Linear Algebra and its Application, pages 95–119. Elsevier, 2001.
[24] L. Kronecker. Zur Theorie der Elimination einer Variablen aus zwei algebraischen Gleichungen. Montasber. Königl. Preussischen Acad Wies, pages 535 – 600, 1881.
[25] Alex Kulesza, Nan Jiang, and Satinder Singh. Low-Rank Spectral Learning with Weighted Loss Functions. In Artificial Intelligence and Statistics, pages 517–525. PMLR, 2015.
[26] Alex Kulesza, N. Raj Rao, and Satinder Singh. Low-Rank Spectral Learning. In Samuel Kaski and Jukka Corander, editors, Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research, pages 522–530, Reykjavik, Iceland, 22–25 April 2014. PMLR.
[27] Aersity M Lyapunov. The General Problem of the Stability of Motion [in Russian]. Gostekhizdat, Moscow, 1950.
[28] Jean Meinguet. A Simplified Presentation of the Adamjan-Arov-Krein Approximation Theory. In H. Werner, L. Wuytack, E. Ng, and H. J. Bünger, editors, Computational Aspects of Complex Analysis, pages 217–248, Dordrecht, 1983. Springer Netherlands. doi:10.1007/978-94-009-7121-9_9.
[29] Zeev Nehari. On Bounded Bilinear Forms. Annals of Mathematics, 65(1):153–162, 1957.
[30] Nikolai K. Nikol’Skii. Operators, Functions and Systems: An Easy Reading, volume 92 of Mathematical Surveys and Monographs. American Mathematical Society, 2002.
[31] Takamasa Okudono, Masaki Waga, Taro Sekiyama, and Ichiro Hasuo. Weighted Automata Extraction from Recurrent Neural Networks via Regression on State Spaces. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 5306–5314. AAAI Press, 2020. URL: https://aaai.org/ojs/index.php/AAAI/article/view/5977.
[32] Alexander Ostrowski and Hans Schneider. Some Theorems on the Inertia of General Matrices. J. Math. Anal. Appl, 4(1):72–84, 1962.
[33] Vladimir Peller. Hankel Operators and their Applications. Springer Science & Business Media, 2012.
[34] Gelu Popescu. Multivariable Nehari Problem and Interpolation. Journal of Functional Analysis, 200:536–581, 2003. doi:10.1016/S0022-1236(03)00078-8.
[35] Guillaume Rabusseau, Borja Balle, and Joelle Pineau. Multitask Spectral Learning of Weighted Automata. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 2585–2594, 2017.
[36] Guillaume Rabusseau, Tianyu Li, and Doina Precup. Connecting Weighted Automata and Recurrent Neural Networks through Spectral Learning. In Kamalika Chaudhuri and Masashi Sugiyama, editors, The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pages 1630–1639. PMLR, 2019. URL: http://proceedings.mlr.press/v89/rabusseau19a.html.
[37] William E. Roth. The equations ax - yb = c and ax - xb = c in matrices. Proceedings of the American Mathematical Society, 3(3):392–396, 1952.
[38] B.De Schutter. Minimal State-Space Realization in Linear System Theory: an Overview. Journal of Computational and Applied Mathematics, 121(1):331–354, 2000. doi:10.1016/S0377-0427(00)00341-1.
[39] Lloyd N Trefethen and David Bau III. Numerical linear algebra, volume 50. Siam, 1997.
[40] Gail Weiss, Yoav Goldberg, and Eran Yahav. Learning Deterministic Weighted Automata with Queries and Counterexamples. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 8558–8569, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/d3f93e7766e8e1b7ef66dfdd9a8be93b-Abstract.html.
[41] H.K Wimmer. On the Ostrowski-Schneider Inertia Theorem. Journal of Mathematical Analysis and Applications, 41(1):164–169, 1973. doi:10.1016/0022-247X(73)90190-X.
[42] Kehe Zhu. Operator Theory in Function Spaces, volume 138. American Mathematical Society, 1990.

Appendix A Hankel Operators

For more details on the content of this section we refer the reader to [30]. We recall the first definition of Hankel operator.

Definition A.1.

A Hankel operator is a mapping $H:\ell^{2}\rightarrow\ell^{2}$ with matrix $\mathbf{H}=\{\alpha_{j+k}\}_{j,k\geq 0}$ . In other words, given $a=\{a_{n}\}_{n\geq 0}\in\ell^{2}$ , we have $H(a)=b$ , where $b=\{b_{n}\}_{n\geq 0}$ is defined by:

b_{k}=\sum_{j\geq 0}\alpha_{j+k}a_{j},\quad k\geq 0.

(48)

This property on the Hankel matrix can be rephrased as an operator identity. Defining the shift operator by $S(x_{0},x_{1},\dots)=(0,x_{0},x_{1},\dots)$ and denoting its left inverse by $S^{*}=(y_{0},y_{1},\dots)=(y_{1},y_{2},\dots)$ , we have that $H$ is a Hankel operator if and only if:

HS=S^{*}H.

(49)

The correspondence between Definition A.1 and Definition 2.7 can be easily made explicit. First, we note that using the isomorphism $\phi\mapsto(\widehat{\phi}(n))_{n\in\mathbb{Z}}$ , introduced with Fourier series, we can identify $\mathcal{H}^{2}$ with $\ell^{2}$ . Moreover, the operator of multiplication by $z$ acts as right shift $S$ on the space of Fourier coefficients, in fact $\widehat{(z\phi)}(n)=\widehat{\phi}(n-1)$ . Analogously, the left inverse $S^{*}$ corresponds to the truncated multiplication operator. Now, let $\mathcal{B}_{1}=\{z^{k}\}_{k\geq 0}$ and $\mathcal{B}_{2}=\{z^{j}\}_{j<0}$ be bases for $\mathcal{H}^{2}$ and $\mathcal{H}^{2}_{-}$ , respectively, and let

Iz^{n}=z^{-n-1}

(50)

be the involution on $\mathcal{L}^{2}(\mathbb{T})$ . Note that $I\mathcal{H}^{2}=\mathcal{H}^{2}_{-}$ .

Let $\overline{H}:\ell^{2}\rightarrow\ell^{2}$ be a Hankel operator with matrix $\overline{\mathbf{H}}$ . Using the bases $\mathcal{B}_{1}$ , $\mathcal{B}_{2}$ and the Fourier identification map, we can obtain an operator acting between Hardy spaces. Following this interpretation, the operator $H=I\overline{H}:\mathcal{H}^{2}\rightarrow\mathcal{H}^{2}_{-}$ has matrix $\overline{\mathbf{H}}$ with respect to $\mathcal{B}_{1}$ , $\mathcal{B}_{2}$ , and satisfies:

HS=\mathbb{P}_{-}SH.

(51)

In particular, $\overline{H}S=S^{*}\overline{H}$ if and only if $HS=\mathbb{P}_{-}SH$ . It is now easy to see that the characterization of the Hankel operator given in Definition 2.7 satisfies Equation 51.

The following theorem, due to Nehari [29], is of great importance as it highlights a correspondence between bounded Hankel operators and functions in $\mathcal{L}^{\infty}(\mathbb{T})$ .

Theorem A.1 ([29]).

A Hankel operator $H:\ell^{2}\rightarrow\ell^{2}$ with matrix $\mathbf{H}(j,k)=\{\alpha_{j+k}\}_{j,k\geq 0}$ is bounded on $\ell^{2}$ if and only if there exists a function $\psi\in\mathcal{L}^{\infty}(\mathbb{T})$ such that

\alpha_{m}=\widehat{\psi}(m),\quad m\geq 0.

(52)

In this case:

\|H\|=\inf\{\|\psi\|_{\infty}:\widehat{\psi}(n)=\widehat{\phi}(n),\,n\geq 0\}.

(53)

Where $\widehat{\psi}(n)$ is the $n$ -th Fourier coefficient of $\psi$ .

We can now reformulate the theorem using the characterization of Hankel operators in Hardy spaces.

Theorem A.2 ([29]).

Let $\phi\in\mathcal{L}^{2}(\mathbb{T})$ be a symbol of the Hankel operator on Hardy spaces $H_{\phi}:\mathcal{H}^{2}\rightarrow\mathcal{H}^{2}_{-}$ . The following are equivalent:

(1)

$H_{\phi}$ is bounded on $\mathcal{H}^{2}$ ,
(2)

there exists $\psi\in\mathcal{L}^{\infty}(\mathbb{T})$ such that $\widehat{\psi}(m)=\widehat{\phi}(m)$ for all $m<0$ .

If the conditions above are satisfied, then:

\|H_{\phi}\|=\inf\{\|\psi\|_{\infty}:\widehat{\psi}(m)=\widehat{\phi}(m),\,m<0\},

(54)

or equivalently:

\|H_{\phi}\|=\inf_{f(z)\in\mathcal{H}^{\infty}}\|\phi(z)-f(z)\|_{\infty}.

(55)

Nehari’s Theorem is at the core of the proof of Corollary 1.

Proof of Corollary 1.

Let $H_{\phi}$ be a Hankel operator with symbol $\phi(z)\in\mathcal{L}^{\infty}(\mathbb{T})$ and matrix $\mathbf{H}$ . Let $\psi(z)=g(z)+l(z)\in\mathcal{H}^{\infty}_{k}$ be the solution of Equation 13. We have:

$\displaystyle\\|H_{\phi}-H_{\psi}\\|$	$\displaystyle=\\|H_{\phi-{\psi}}\\|$	(56)
	$\displaystyle=\\|H_{\sigma_{k}\eta^{-}_{k}(z)/\xi^{+}_{k}(z)}\\|$	(57)
	$\displaystyle\leq\sigma_{k}\\|\eta^{-}_{k}(z)/\xi^{+}_{k}(z)\\|_{\infty}=\sigma_{k}$	(58)

where first we used Corollary 2 and then Equation 55. Now, using the definition of Hankel operator, we have:

\|H_{\phi}-H_{\psi}\|=\|H_{\phi}-H_{g}\|=\|\mathbf{H}-\mathbf{G}\|\leq\sigma_{k}.

(59)

Since $\|\mathbf{H}-\mathbf{G}\|\geq\sigma_{k}$ (from Eckart-Young theorem [17]), it follows that $\|\mathbf{H}-\mathbf{G}\|=\sigma_{k}$ . Note that $\mathbf{G}$ has rank $k$ , as required, because $g\in\mathcal{R}_{k}$ (Theorem 2.6). ∎

Appendix B Proofs from Section 3

Proof of Theorem 3.3.

In order to prove Theorem 3.3 we need an auxiliary lemma. These are the analogous of a control theory result, rephrased in terms of WFAs. The original theorem and lemma, together with the corresponding proofs, can be found in [16]. Hence, we only provide a sketch of the proofs.

Lemma B.1 ( [16]).

Let $E=\langle\boldsymbol{\alpha}_{e},\mathbf{A}_{e},\boldsymbol{\beta}_{e}\rangle$ be a minimal WFA. Let $e(z)=\boldsymbol{\alpha}_{e}^{\top}(z\mathbf{1}-\mathbf{A}_{e})^{-1}\boldsymbol{\beta}_{e}-C$ , if $\sigma_{k}^{-1}e(z)$ is unimodular, then there exist a unique invertible symmetric matrix $\mathbf{T}$ satisfying:

(a)

$\mathbf{A}_{e}^{\top}\mathbf{T}\boldsymbol{\beta}_{e}=\boldsymbol{\alpha}_{e}C$
(b)

$\sigma_{k}^{2}\boldsymbol{\alpha}_{e}^{\top}\mathbf{T}^{-1}\mathbf{A}_{e}^{\top}=C\boldsymbol{\beta}_{e}^{\top}$
(c)

$\mathbf{A}_{e}^{\top}\mathbf{T}\mathbf{A}_{e}-C^{-1}\mathbf{A}_{e}^{\top}\mathbf{T}\boldsymbol{\beta}_{e}\boldsymbol{\alpha}_{e}^{\top}=\mathbf{T}$

Proof.

Since $\sigma_{k}^{-1}e(z)$ is unimodular, we have that:

e(z)e^{*}(\bar{z}^{-1})=\sigma_{k}^{2}\mathbf{1}

(60)

where we denote with $e^{*}$ the adjoint function. From the equation above, we obtain:

	$\displaystyle e^{*}(\bar{z}^{-1})$	$\displaystyle=\sigma_{k}^{2}e^{-1}(z)=\sigma_{k}^{2}(-C+\boldsymbol{\alpha}_{e}^{\top}(z\mathbf{1}-\mathbf{A}_{e})^{-1}\boldsymbol{\beta}_{e})^{-1}$		(61)
		$\displaystyle=-\sigma_{k}^{2}C^{-1}-\sigma_{k}^{2}C^{-1}\boldsymbol{\alpha}_{e}^{\top}((z\mathbf{1}-(\mathbf{A}_{e}+C^{-1}\boldsymbol{\beta}_{e}\boldsymbol{\alpha}_{e}))^{-1}\boldsymbol{\beta}_{e}C^{-1}$		(62)

where we used the matrix inversion lemma. On the other hand we have:

$\displaystyle e^{*}(\bar{z}^{-1})$	$\displaystyle=-C+\boldsymbol{\beta}_{e}^{\top}(z^{-1}\mathbf{1}-\mathbf{A}_{e}^{\top})^{-1}\boldsymbol{\alpha}_{e}$	(63)
	$\displaystyle=-C+\boldsymbol{\beta}_{e}^{\top}(-\mathbf{A}_{e}^{-\top}(\mathbf{1}-z\mathbf{A}_{e}^{\top})+\mathbf{A}_{e}^{-\top})(\mathbf{1}-z\mathbf{A}_{e}^{\top})^{-1}\boldsymbol{\alpha}_{e}$	(64)
	$\displaystyle=-(C-\boldsymbol{\beta}_{e}^{\top}\mathbf{A}_{e}^{-\top}\boldsymbol{\alpha}_{e})-\boldsymbol{\beta}_{e}^{\top}\mathbf{A}_{e}^{-\top}(z\mathbf{1}-\mathbf{A}_{e}^{-\top})^{-1}\mathbf{A}_{e}^{-\top}\boldsymbol{\alpha}_{e}$	(65)

where we used again the matrix inversion lemma before grouping the terms. If the quantities in Equation 62 and Equation 65 have to be equal, we need their constant term to be the same. Then, we want the $\mathcal{H}^{\infty}_{-}$ -components to correspond, so we consider the corresponding Hankel matrices. It is easy to see that we can once again associate the coefficients of these complex functions to the parameters of a WFA. From the minimality of $E$ we obtain:

\begin{cases}\sigma_{k}^{2}C^{-1}\boldsymbol{\alpha}_{e}^{\top}=\boldsymbol{\beta}_{e}^{\top}\mathbf{A}_{e}^{-\top}\mathbf{T}\\ \mathbf{A}_{e}+C^{-1}\boldsymbol{\beta}_{e}\boldsymbol{\alpha}_{e}=\mathbf{T}^{-1}\mathbf{A}_{e}^{-\top}\mathbf{T}\\ \boldsymbol{\beta}_{e}C^{-1}=\mathbf{T}^{-1}\mathbf{A}_{e}^{-\top}\boldsymbol{\alpha}_{e}\end{cases}

(66)

where $\mathbf{T}$ is an invertible matrix [7]. This system is equivalent to:

\begin{cases}\sigma_{k}^{2}\boldsymbol{\alpha}_{e}^{\top}\mathbf{T}^{-1}\mathbf{A}_{e}^{\top}=C\boldsymbol{\beta}_{e}^{\top}\\ \mathbf{A}_{e}^{\top}\mathbf{T}\mathbf{A}_{e}-C^{-1}\mathbf{A}_{e}^{\top}\mathbf{T}\boldsymbol{\beta}_{e}\boldsymbol{\alpha}_{e}^{\top}=\mathbf{T}\\ \mathbf{A}_{e}^{\top}\mathbf{T}\boldsymbol{\beta}_{e}=\boldsymbol{\alpha}_{e}C\end{cases}

(67)

To conclude the proof it remains to check that $\mathbf{T}$ is symmetric, and this can be checked by direct computations. ∎

Proof of Theorem 3.3.

This proof follows easily from Lemma B.1 by setting $\mathbf{P}=-\sigma^{2}_{k}\mathbf{T}^{-1}$ and $\mathbf{Q}=-\mathbf{T}$ . We obtain point $(c)$ by direct multiplication. Then, we substitute the last equation in 67 into the second one, and we obtain:

\mathbf{A}_{e}^{\top}\mathbf{T}\mathbf{A}_{e}-\boldsymbol{\alpha}_{e}\boldsymbol{\alpha}_{e}^{\top}=\mathbf{T}

(68)

which verifies point $(b)$ with $\mathbf{Q}=-\mathbf{T}$ . Point $(a)$ can be obtained analogously combining the first and second equations in 67. ∎

Appendix C Possible Extensions

C.1 Relaxing the Spectral Radius Assumption

It is possible to extend part of our method to WFAs over a one-letter alphabet with $\rho(\mathbf{A})\neq 1$ , but the approximation recovered is not optimal in the spectral norm.

Let $A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle$ , with $\rho(\mathbf{A})\neq 1$ , be a WFA with $n$ states that we want to minimize. The idea is to block-diagonalize $\mathbf{A}$ like we did in Section 3.3.3, and tackle each component separately. The case of $A_{+}=\langle\boldsymbol{\alpha}_{+},\mathbf{A}_{+},\boldsymbol{\beta}_{+}\rangle$ , the component having $\rho(\mathbf{A})<1$ , can be dealt with in the way presented in the previous sections. This means that we can find an optimal spectral-norm approximation of the desired size for $A_{+}$ . Now we can consider the second component, $A_{-}=\langle\boldsymbol{\alpha}_{-},\mathbf{A}_{-},\boldsymbol{\beta}_{-}\rangle$ . The key idea is to apply the transformation $z^{j-1}\mapsto z^{-j}$ for $j\geq 1$ to the function $\phi^{\prime\prime}(z)$ associated to $A_{-}$ . Then, the function

\phi^{\prime\prime}(z^{-1})=\sum_{k\geq 0}\boldsymbol{\alpha}_{-}^{\top}\mathbf{A}_{-}^{k}z^{k}\boldsymbol{\beta}_{-}=\boldsymbol{\alpha}_{-}^{\top}(\mathbf{1}-z\mathbf{A}_{-})^{-1}\boldsymbol{\beta}_{-}

(69)

is well defined, as the series converges for $z$ with small enough modulus. Using this transformation we obtain a function with poles inside the unit disc and we can apply the method presented in the paper. An important choice to make is the size of the approximation of $A_{-}$ , as it can influence the quality of the approximation. Analyzing the effects of this parameter on the approximation error constitutes an interesting direction for future work, both in the theoretical and experimental side. We refer the reader to the control theory literature [20], where some theoretical work has been done to study an analogous approach for continuous time systems and their approximation error.

C.2 Polynomial method

We remark that Equation 14 from Corollary 2 can be rewritten as

\psi(z)=\phi(z)-\frac{H\xi^{+}_{k}(z)}{\xi^{+}_{k}(z)},

(70)

where $\xi^{+}_{k}(z)$ is the function in $\mathcal{H}^{2}$ associated to the vector $\boldsymbol{\xi}_{k}\in\operatorname{Ker}(\mathbf{H}^{*}\mathbf{H}-\sigma_{k}^{2}\mathbf{1})$ (and $\psi(z)$ does not depend on the choice of the specific $\boldsymbol{\xi}_{k}$ ). There is an alternative way to find the best approximation, particularly useful when the objective is to approximate a finite-rank infinite Hankel matrix with another Hankel matrix, without necessarily extract a WFA. We can consider the adjoint operator $H^{*}$ and its matrix $\mathbf{H}^{*}$ . The singular numbers and singular vectors of $H$ correspond to the eigenvalues and eigenvectors of $\mathbf{R}=(\mathbf{H}^{*}\mathbf{H})^{1/2}$ . Hence, it is possible to compute $\sigma_{k}$ and a corresponding singular vector $\boldsymbol{\xi}_{k}$ . The function $\xi^{+}_{k}(z)$ is then obtained following Equation 9. Then, the Hankel matrix $\mathbf{G}$ that best approximates $\mathbf{H}$ is given by $\mathbf{G}=\mathbf{H}-\mathbf{M}$ , where $\mathbf{M}$ is the Hankel matrix having $\frac{H\xi^{+}_{k}(z)}{\xi^{+}_{k}(z)}$ as symbol.

$\displaystyle\\|H_{\phi}-H_{\psi}\\|$	$\displaystyle=\\|H_{\phi-{\psi}}\\|$	(56)
	$\displaystyle=\\|H_{\sigma_{k}\eta^{-}_{k}(z)/\xi^{+}_{k}(z)}\\|$	(57)
	$\displaystyle\leq\sigma_{k}\\|\eta^{-}_{k}(z)/\xi^{+}_{k}(z)\\|_{\infty}=\sigma_{k}$	(58)