This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Optimal Spectral-Norm Approximate Minimization of Weighted Finite Automata

Borja Balle DeepMind, London, United Kingdom. Clara Lacroce111Corresponding author: [email protected] School of Computer Science, McGill University, Montréal, Canada Mila, Montréal, Canada Prakash Panangaden School of Computer Science, McGill University, Montréal, Canada Mila, Montréal, Canada Doina Precup School of Computer Science, McGill University, Montréal, Canada Mila, Montréal, Canada Guillaume Rabusseau222The names of the authors appear in alphabetical order Mila, Montréal, Canada DIRO, Université de Montréal, Montréal, Canada
Abstract

We address the approximate minimization problem for weighted finite automata (WFAs) with weights in \mathbb{R}, over a one-letter alphabet: to compute the best possible approximation of a WFA given a bound on the number of states. This work is grounded in Adamyan-Arov-Krein Approximation theory, a remarkable collection of results on the approximation of Hankel operators. In addition to its intrinsic mathematical relevance, this theory has proven to be very effective for model reduction. We adapt these results to the framework of weighted automata over a one-letter alphabet. We provide theoretical guarantees and bounds on the quality of the approximation in the spectral and 2\ell^{2} norm. We develop an algorithm that, based on the properties of Hankel operators, returns the optimal approximation in the spectral norm.

1 Introduction

Weighted finite automata (WFAs) are an expressive class of models representing functions defined over sequences. The approximate minimization problem is concerned with finding an automaton that approximates the behaviour of a given minimal WFA, while being smaller in size. This second automaton recognizes a different language, and the objective is to minimize the approximation error [10, 11]. Approximate minimization becomes particularly useful in the context of spectral learning algorithms [5, 7, 9, 22]. When applied to a learning task, such algorithms can be viewed as working in two steps. First, they compute a minimal WFA that explains the training data exactly. Then, they obtain a model that generalizes to the unseen data by producing a smaller approximation to the minimal WFA. It is not just a question of saving space by having a smaller state space; the exact machine will overfit the data and generalize poorly. To obtain accurate results it is crucial to guess correctly the size of the minimal WFA, in particular when the data is generated by this type of machine.

The minimization task is greatly shaped by the way we decide to measure the approximation error. It is thus natural to wonder if there are norms that are preferable to others. We believe that the chosen norm should be computationally reasonable to minimize. For instance, the distance between WFAs can be computed using a metric based on bisimulation [8]. While exploring this approach could still be interesting, the fact that this metric is hard to compute makes it unsuitable for our purposes. Moreover, this metric is specifically designed for WFAs, so it is not directly applicable to other models dealing with sequential data. We think that being transferable is a second important feature for the chosen norm. In fact, being able to compare different classes of models is desirable for future applications of this method. For example, one can think of the burgeoning literature on approximating Recurrent Neural Networks (RNNs) using WFAs, where the objective is to extract from a trained RNN an automaton that accurately mimics its behaviour [36, 40, 31, 4, 18]. With this in mind, we think that it is preferable to consider a norm defined on the input-output function – or the Hankel matrix – rather than the parameters of the specific model considered. Finally, it is important to choose a norm that can be computed accurately. The spectral norm seems to be a good candidate for the task. In particular, it allows us to exploit the work of Adamyan, Arov and Krein which has come to be known as AAK theory [1]: a series of results connecting the theory of complex functions on the unit circle to Hankel matrices, a mathematical object representing functions defined over sequences. The core of this theory provides us with theoretical guarantees for the exact computation of the spectral norm of the error, and a method to construct the optimal approximation.

We summarize our main contributions:

  • We use AAK theory to study the approximate minimization problem of WFAs. To connect those areas, we establish a correspondence between the parameters of a WFA and the coefficients of a complex function on the unit circle. To the best of our knowledge, this paper represents the first attempt to apply AAK theory to WFAs.

  • We present a theoretical analysis of the optimal spectral-norm approximate minimization problem for WFAs, based on their connection with finite-rank infinite Hankel matrices. We provide a closed form solution for real weighted automata over a one-letter alphabet A=𝜶,𝐀,𝜷A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle, under the assumption ρ(𝐀)<1\rho(\mathbf{A})<1 on the spectral radius. We bound the approximation error, both in terms of the Hankel matrix (spectral norm) and of the rational function computed by the WFA (2\ell^{2} norm).

  • We propose a self-contained algorithm that returns the unique optimal spectral-norm approximation of a given size.

  • We tighten the connection, made in [11], between WFAs and discrete dynamical systems, by adapting some of the control theory concepts, like the allpass system [20], to the formalism of WFAs.

2 Background

2.1 Preliminaries

We denote with \mathbb{N}, \mathbb{Z} and \mathbb{R} the set of natural, integers and real numbers, respectively. We use bold letters for vectors and matrices; all vectors considered are column vectors. We denote with 𝟏\mathbf{1} the identity matrix, specifying its dimension only when not clear from the context. We refer to the ii-th row and the jj-th column of 𝐌\mathbf{M} by 𝐌(i,:)\mathbf{M}(i,:) and 𝐌(:,j)\mathbf{M}(:,j). Given a matrix 𝐌p×q\mathbf{M}\in\mathbb{R}^{p\times q} of rank nn, a rank factorization is a factorization 𝐌=𝐏𝐐\mathbf{M}=\mathbf{P}\mathbf{Q}, where 𝐏p×n\mathbf{P}\in\mathbb{R}^{p\times n}, 𝐐n×q\mathbf{Q}\in\mathbb{R}^{n\times q} and rank(𝐏)=rank(𝐐)=n\operatorname{rank}(\mathbf{P})=\operatorname{rank}(\mathbf{Q})=n. Let 𝐌p×q\mathbf{M}\in\mathbb{R}^{p\times q} of rank nn, the compact singular value decomposition SVD of 𝐌\mathbf{M} is the factorization 𝐌=𝐔𝐃𝐕\mathbf{M}=\mathbf{U}\mathbf{D}\mathbf{V}^{\top}, where 𝐔p×n\mathbf{U}\in\mathbb{R}^{p\times n}, 𝐃n×n\mathbf{D}\in\mathbb{R}^{n\times n}, 𝐕q×n\mathbf{V}\in\mathbb{R}^{q\times n} are such that 𝐔𝐔=𝐕𝐕=𝟏\mathbf{U}^{\top}\mathbf{U}=\mathbf{V}^{\top}\mathbf{V}=\mathbf{1}, and 𝐃\mathbf{D} is a diagonal matrix. The columns of 𝐔\mathbf{U} and 𝐕\mathbf{V} are called left and right singular vectors, while the entries of 𝐃\mathbf{D} are the singular values. The Moore-Penrose pseudo-inverse 𝐌+\mathbf{M}^{+} of 𝐌\mathbf{M} is the unique matrix such that 𝐌𝐌+𝐌=𝐌\mathbf{M}\mathbf{M}^{+}\mathbf{M}=\mathbf{M}, 𝐌+𝐌𝐌+=𝐌+\mathbf{M}^{+}\mathbf{M}\mathbf{M}^{+}=\mathbf{M}^{+}, with 𝐌+𝐌\mathbf{M}^{+}\mathbf{M} and 𝐌𝐌+\mathbf{M}\mathbf{M}^{+} Hermitian [42]. The spectral radius ρ(𝐌)\rho(\mathbf{M}) of a matrix 𝐌\mathbf{M} is the largest modulus among its eigenvalues.

A Hilbert space is a complete normed vector space where the norm arises from an inner product. A linear operator T:XYT:X\rightarrow Y between Hilbert spaces is bounded if it has finite operator norm, i.e. Top=supgX1TgY<\|T\|_{op}=\sup_{\|g\|_{X}\leq 1}\|Tg\|_{Y}<\infty. We denote by 𝐓\mathbf{T} the (infinite) matrix associated with TT by some (canonical) orthonormal basis on HH. An operator is compact if the image of the unit ball in XX is relatively compact. Given Hilbert spaces X,YX,Y and a compact operator T:XYT:X\rightarrow Y, we denote its adjoint by TT^{*}. The singular numbers {σn}n0\{\sigma_{n}\}_{n\geq 0} of TT are the square roots of the eigenvalues of the self-adjoint operator TTT^{*}T, arranged in decreasing order. A σ\sigma-Schmidt pair {𝝃,𝜼}\{\boldsymbol{\xi},\boldsymbol{\eta}\} for TT is a couple of norm 11 vectors such that: 𝐓𝝃=σ𝜼\mathbf{T}\boldsymbol{\xi}=\sigma\boldsymbol{\eta} and 𝐓𝜼=σ𝝃\mathbf{T}^{*}\boldsymbol{\eta}=\sigma\boldsymbol{\xi}. The Hilbert-Schmidt decomposition provides a generalization of the compact SVD for the infinite matrix of a compact operator TT using singular numbers and orthonormal Schmidt pairs: 𝐓𝐱=n0σn𝐱,𝝃n𝜼k\mathbf{T}\mathbf{x}=\sum_{n\geq 0}\sigma_{n}\langle\mathbf{x},\boldsymbol{\xi}_{n}\rangle\boldsymbol{\eta}_{k} [42]. The spectral norm 𝐓\|\mathbf{T}\| of the matrix representing the operator TT is the largest singular number of TT. Note that the spectral norm of 𝐓\mathbf{T} corresponds to the operator norm of TT.

Let 2\ell^{2} be the Hilbert space of square-summable sequences over Σ\Sigma^{*}, with norm f22=xΣ|f(x)|2\|f\|_{2}^{2}=\sum_{x\in\Sigma^{*}}|f(x)|^{2} and inner product f,g=xΣf(x)g(x)\langle f,g\rangle=\sum_{x\in\Sigma^{*}}f(x)g(x) for f,gΣf,g\in\mathbb{R}^{\Sigma^{*}}. Let 𝕋={z:|z|=1}\mathbb{T}=\{z\in\mathbb{C}:|z|=1\} be the complex unit circle, 𝔻={z:|z|<1}\mathbb{D}=\{z\in\mathbb{C}:|z|<1\} the (open) complex unit disc. Let 1<p<1<p<\infty, p(𝕋)\mathcal{L}^{p}(\mathbb{T}) be the space of measurable functions on 𝕋\mathbb{T} for which the pp-th power of the absolute value is Lebesgue integrable. For p=p=\infty, we denote with (𝕋)\mathcal{L}^{\infty}(\mathbb{T}) the space of measurable functions that are bounded, with norm f=sup{|f(x)|:x𝕋}\|f\|_{\infty}=\sup\{|f(x)|:x\in\mathbb{T}\}.

2.2 Weighted Finite Automata

Let Σ\Sigma be a fixed finite alphabet, Σ\Sigma^{*} the set of all finite strings with symbols in Σ\Sigma. We use ε\varepsilon to denote the empty string. Given p,sΣp,s\in\Sigma, we denote with psps their concatenation.

A weighted finite automaton (WFA) of nn states over Σ\Sigma is a tuple A=𝜶,{𝐀a},𝜷A=\langle\boldsymbol{\alpha},\{\mathbf{A}_{a}\},\boldsymbol{\beta}\rangle, where 𝜶,\boldsymbol{\alpha}, 𝜷n\boldsymbol{\beta}\in\mathbb{R}^{n} are the vector of initial and final weights, respectively, and 𝐀an×n\mathbf{A}_{a}\in\mathbb{R}^{n\times n} is the matrix containing the transition weights associated with each symbol aa. Every WFA AA with real weights realizes a function fA:Σf_{A}:\Sigma^{*}\to\mathbb{R}, i.e. given a string x=x1xtΣx=x_{1}\cdots x_{t}\in\Sigma^{*}, it returns fA(x)=𝜶𝐀x1𝐀xt𝜷=𝜶𝐀x𝜷f_{A}(x)=\boldsymbol{\alpha}^{\top}\mathbf{A}_{x_{1}}\cdots\mathbf{A}_{x_{t}}\boldsymbol{\beta}=\boldsymbol{\alpha}^{\top}\mathbf{A}_{x}\boldsymbol{\beta}. A function f:Σf:\Sigma^{*}\to\mathbb{R} is called rational if there exists a WFA AA that realizes it. The rank of the function is the size of the smallest WFA realizing ff. Given f:Σf:\Sigma^{*}\to\mathbb{R}, we can consider a matrix 𝐇fΣ×Σ\mathbf{H}_{f}\in\mathbb{R}^{\Sigma^{*}\times\Sigma^{*}} having rows and columns indexed by strings and defined by 𝐇f(p,s)=f(ps)\mathbf{H}_{f}(p,s)=f(ps) for p,sΣp,s\in\Sigma^{*}.

Definition 2.1.

A (bi-infinite) matrix 𝐇Σ×Σ\mathbf{H}\in\mathbb{R}^{\Sigma^{*}\times\Sigma^{*}} is Hankel if for all p,p,s,sΣp,p^{\prime},s,s^{\prime}\in\Sigma^{*} such that ps=psps=p^{\prime}s^{\prime}, we have 𝐇(p,s)=𝐇(p,s)\mathbf{H}(p,s)=\mathbf{H}(p^{\prime},s^{\prime}). Given a Hankel matrix 𝐇Σ×Σ\mathbf{H}\in\mathbb{R}^{\Sigma^{*}\times\Sigma^{*}}, there exists a unique function f:Σf:\Sigma^{*}\to\mathbb{R} such that 𝐇f=𝐇\mathbf{H}_{f}=\mathbf{H}.

Theorem 2.1 ([15, 19]).

A function f:Σf:\Sigma^{*}\to\mathbb{R} can be realized by a WFA if and only if 𝐇f\mathbf{H}_{f} has finite rank nn. In that case, nn is the minimal number of states of any WFA AA such that f=fAf=f_{A}.

Given a WFA A=𝜶,{𝐀a},𝜷A=\langle\boldsymbol{\alpha},\{\mathbf{A}_{a}\},\boldsymbol{\beta}\rangle, the forward matrix of AA is the infinite matrix 𝐅AΣ×n\mathbf{F}_{A}\in\mathbb{R}^{\Sigma^{*}\times n} given by 𝐅A(p,:)=𝜶𝐀p\mathbf{F}_{A}(p,:)=\boldsymbol{\alpha}^{\top}\mathbf{A}_{p} for any pΣp\in\Sigma^{*}, while the backward matrix of AA is 𝐁AΣ×n\mathbf{B}_{A}\in\mathbb{R}^{\Sigma^{*}\times n}, given by 𝐁A(s,:)=(𝐀s𝜷)\mathbf{B}_{A}(s,:)=(\mathbf{A}_{s}\boldsymbol{\beta})^{\top} for any sΣs\in\Sigma^{*}. Let 𝐇f\mathbf{H}_{f} be the Hankel matrix of ff, its forward-backward (FB) factorization is: 𝐇f=𝐅𝐁\mathbf{H}_{f}=\mathbf{F}\mathbf{B}^{\top}. A WFA with nn states is reachable if rank(𝐅A)=n\operatorname{rank}(\mathbf{F}_{A})=n, while it is observable if rank(𝐁A)=n\operatorname{rank}(\mathbf{B}_{A})=n. A WFA is minimal if it is reachable and observable. If AA is minimal, the FB factorization is a rank factorization [7].

We recall the definition of the singular value automaton, a canonical form for WFAs [10].

Definition 2.2.

Let f:Σf:\Sigma^{*}\rightarrow\mathbb{R} be a rational function and suppose 𝐇f\mathbf{H}_{f} admits an SVD, 𝐇f=𝐔𝐃𝐕\mathbf{H}_{f}=\mathbf{U}\mathbf{D}\mathbf{V}^{\top}. A singular value automaton (SVA) for ff is the minimal WFA AA realizing ff such that 𝐅A=𝐔𝐃1/2\mathbf{F}_{A}=\mathbf{U}\mathbf{D}^{1/2} and 𝐁A=𝐕𝐃1/2\mathbf{B}_{A}=\mathbf{V}\mathbf{D}^{1/2}.

The SVA can be computed with an efficient algorithm relying on the following matrices [11].

Definition 2.3.

Let f:Σf:\Sigma^{*}\rightarrow\mathbb{R} be a rational function, 𝐇f=𝐅𝐁\mathbf{H}_{f}=\mathbf{F}\mathbf{B}^{\top} a FB factorization. If the matrices 𝐏=𝐅𝐅\mathbf{P}=\mathbf{F}^{\top}\mathbf{F} and 𝐐=𝐁𝐁\mathbf{Q}=\mathbf{B}^{\top}\mathbf{B} are well defined, we call 𝐏\mathbf{P} the reachability Gramian and 𝐐\mathbf{Q} the observability Gramian.

Note that if AA is an SVA, then the Gramians associated with its FB factorization satisfy 𝐏A=𝐐A=𝐃\mathbf{P}_{A}=\mathbf{Q}_{A}=\mathbf{D}, where 𝐃\mathbf{D} is the matrix of singular values of its Hankel matrix. The Gramians can alternatively be characterized (and computed [11]) using fixed point equations, corresponding to Lyapunov equations when |Σ|=1|\Sigma|=1 [27].

Theorem 2.2.

Let |Σ|=1|\Sigma|=1, A=𝛂,𝐀,𝛃A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle a WFA with nn states and well-defined Gramians 𝐏\mathbf{P}, 𝐐\mathbf{Q}. Then X=𝐏X=\mathbf{P} and Y=𝐐Y=\mathbf{Q} solve:

X𝐀X𝐀=𝜷𝜷,\displaystyle X-\mathbf{A}X\mathbf{A}^{\top}=\boldsymbol{\beta}\boldsymbol{\beta}^{\top}, (1)
Y𝐀Y𝐀=𝜶𝜶.\displaystyle Y-\mathbf{A}^{\top}Y\mathbf{A}=\boldsymbol{\alpha}\boldsymbol{\alpha}^{\top}. (2)

Finally, we recall the following definition.

Definition 2.4.

A WFA A=𝛂,{𝐀a},𝛃A=\langle\boldsymbol{\alpha},\{\mathbf{A}_{a}\},\boldsymbol{\beta}\rangle is a generative probabilistic automaton (GPA) if fA(x)0f_{A}(x)\geq 0 for every xx, and xΣfA(x)=1\sum_{x\in\Sigma^{*}}f_{A}(x)=1, i.e. if fAf_{A} computes a probability distribution over Σ\Sigma^{*}.

Example 2.3.

Let |Σ|=1|\Sigma|=1, Σ={x}\Sigma=\{x\}. The WFA A=𝛂,𝐀,𝛃A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle, with:

𝐀=(012120),𝜶=(320),𝜷=(320),\mathbf{A}=\begin{pmatrix}0&\frac{1}{2}\\ \frac{1}{2}&0\end{pmatrix},\quad\boldsymbol{\alpha}=\begin{pmatrix}\frac{\sqrt{3}}{2}\\ 0\end{pmatrix},\quad\boldsymbol{\beta}=\begin{pmatrix}\frac{\sqrt{3}}{2}\\ 0\end{pmatrix},

is a GPA. Ideed, fA(x)0f_{A}(x)\geq 0 and xΣfA(x)=1\sum_{x\in\Sigma^{*}}f_{A}(x)=1, since the rational function is:

fA(xx)=fA(k)=𝜶𝐀k𝜷={0if k is odd342kif k is evenf_{A}(x\cdots x)=f_{A}(k)=\boldsymbol{\alpha}^{\top}\mathbf{A}^{k}\boldsymbol{\beta}=\begin{cases}0&\text{if $k$ is odd}\\ \frac{3}{4}2^{-k}&\text{if $k$ is even}\end{cases}

where kk corresponds to the string where xx is repeated kk-times. We remark that AA is minimal and in its SVA form, with Gramians 𝐏=𝐐=(450015)\mathbf{P}=\mathbf{Q}=\begin{pmatrix}\frac{4}{5}&0\\ 0&\frac{1}{5}\end{pmatrix}, and fAf_{A} has rank 22. Finally, the corresponding Hankel matrix, also of rank 22, is:

𝐇=(fA(0)fA(1)fA(2)fA(1)fA(2)fA(3)fA(2)fA(3)fA(4))=(340316031603160364).\mathbf{H}=\begin{pmatrix}f_{A}(0)&f_{A}(1)&f_{A}(2)&\dots\\ f_{A}(1)&f_{A}(2)&f_{A}(3)&\dots\\ f_{A}(2)&f_{A}(3)&f_{A}(4)&\dots\\ \vdots&\vdots&\vdots&\ddots\end{pmatrix}=\begin{pmatrix}\frac{3}{4}&0&\frac{3}{16}&\dots\\ 0&\frac{3}{16}&0&\dots\\ \frac{3}{16}&0&\frac{3}{64}&\dots\\ \vdots&\vdots&\vdots&\ddots\end{pmatrix}. (3)

2.3 AAK Theory

Theorem 2.1 provides us with a way to associate a minimal WFA AA with nn states to a Hankel matrix 𝐇\mathbf{H} of rank nn. The approach we propose to approximate AA is to find the WFA corresponding to the matrix that minimizes 𝐇\mathbf{H} in the spectral norm. We recall the fundamental result of Schmidt, Eckart, Young and Mirsky [17].

Theorem 2.4 ([17]).

Let 𝐇\mathbf{H} be a Hankel matrix corresponding to a compact Hankel operator of rank nn, and let σ0σn1>0\sigma_{0}\geq\dots\geq\sigma_{n-1}>0 be its singular numbers. Then, if 𝐑\mathbf{R} is a matrix of rank kk, we have:

𝐇𝐑σk.\|\mathbf{H}-\mathbf{R}\|\geq\sigma_{k}. (4)

The equality is attained when 𝐑\mathbf{R} corresponds to the truncated SVD of 𝐇\mathbf{H}.

Note that a low-rank approximation obtained by truncating the SVD is not in general a Hankel matrix. This is problematic, since 𝐆\mathbf{G} needs to be Hankel in order to be the matrix of a WFA. Surprisingly, we can obtain a result comparable to the one of Theorem 2.4 while preserving the Hankel property. This is the possible thanks to a theory of optimal approximation called AAK theory [1]. To apply this theory, we will need to rewrite the approximation problem in functional analysis terms. First, we will associate a linear operator to the Hankel matrix. Then, we will use Fourier analysis to reformulate the problem in a function space. A comprehensive presentation of the concepts recalled in this section can be found in [30, 33, 28].

Let f:Σf:\Sigma^{*}\rightarrow\mathbb{R} be a rational function, we interpret the corresponding Hankel matrix 𝐇f\mathbf{H}_{f} as the expression of a linear (Hankel) operator Hf:22H_{f}:\ell^{2}\rightarrow\ell^{2} in terms of the canonical basis. We recall that a Hankel operator HfH_{f} is bounded if and only if f2f\in\ell^{2} [11]. This property, together with the fact that we only consider finite rank operators (corresponding to the Hankel matrices of WFAs), is sufficient to guarantee compactness.

To introduce AAK theory, we need to consider a second realization of Hankel operators on complex spaces. Since in this paper we work with two classes of functions – functions over sequences and complex functions – to avoid any confusion we will make explicit the dependence on the complex variable z=eitz=e^{it}. We start by recalling a few fundamental definitions from the theory of complex functions. Note that a function ϕ(z)2(𝕋)\phi(z)\in\mathcal{L}^{2}(\mathbb{T}) can be represented, using the orthonormal basis {zn}n\{z^{n}\}_{n\in\mathbb{Z}}, by means of its Fourier series: ϕ(z)=nϕ^(n)zn\phi(z)=\sum_{n\in\mathbb{Z}}\widehat{\phi}(n)z^{n}, with Fourier coefficients ϕ^(n)=𝕋ϕ(z)z¯n𝑑z,n\widehat{\phi}(n)=\int_{\mathbb{T}}\phi(z)\bar{z}^{n}dz,\,n\in\mathbb{Z}. This establishes an isomorphism between the function ϕ(z)\phi(z) and the sequence of the corresponding Fourier coefficients ϕ^\widehat{\phi}. Thus, we can partition the function space 2(𝕋)\mathcal{L}^{2}(\mathbb{T}) into two subspaces.

Definition 2.5.

For 0<p0<p\leq\infty , the Hardy space p\mathcal{H}^{p} on 𝕋\mathbb{T} is the subspace of p(𝕋)\mathcal{L}^{p}(\mathbb{T}) defined as:

p={ϕ(z)p(𝕋):ϕ^(n)=0,n<0},\mathcal{H}^{p}=\{\phi(z)\in\mathcal{L}^{p}(\mathbb{T}):\widehat{\phi}(n)=0,n<0\}, (5)

while the negative Hardy space on 𝕋\mathbb{T} is the subspace of p(𝕋)\mathcal{L}^{p}(\mathbb{T})

p={ϕ(z)p(𝕋):ϕ^(n)=0,n0}.\mathcal{H}^{p}_{-}=\{\phi(z)\in\mathcal{L}^{p}(\mathbb{T}):\widehat{\phi}(n)=0,n\geq 0\}. (6)

It is possible to define Hardy spaces also on the open unit disc 𝔻\mathbb{D}.

Definition 2.6.

The Hardy space p(𝔻)\mathcal{H}^{p}(\mathbb{D}) on 𝔻\mathbb{D} for 0<p<0<p<\infty consists of functions ϕ(z)\phi(z) analytic in 𝔻\mathbb{D} and such that:

ϕp:=sup0<r<1(𝕋|ϕ(rξ)|p𝑑m(ξ))1/p<\|\phi\|_{p}:=\sup_{0<r<1}\Bigg{(}\int_{\mathbb{T}}|\phi(r\xi)|^{p}dm(\xi)\Bigg{)}^{1/p}<\infty (7)

and it is equipped with the norm p\|\cdot\|_{p}. For p=p=\infty, (𝔻)\mathcal{H}^{\infty}(\mathbb{D}) is the space of bounded analytic functions in 𝔻\mathbb{D} with norm:

ϕ:=supξ𝔻|ϕ(ξ)|.\|\phi\|_{\infty}:=\sup_{\xi\in\mathbb{D}}|\phi(\xi)|. (8)

Interestingly, p(𝔻)\mathcal{H}^{p}(\mathbb{D}) and p\mathcal{H}^{p} can be canonically identified by associating a function ϕ(z)p(𝔻)\phi(z)\in\mathcal{H}^{p}(\mathbb{D}) with its limit on the boundary, which is a function in p\mathcal{H}^{p} (a proof can be found in [30]). Thus, we will make no difference between those functions in the unit disc and their boundary value on the circle.

We can now embed the sequence space 2\ell^{2} into 2()\ell^{2}(\mathbb{Z}) by “duplicating” each vector, i.e. by associating 𝝁=(μ0,μ1,)2\boldsymbol{\mu}=(\mu_{0},\mu_{1},\dots)\in\ell^{2} to 𝝁(2)=(,μ1,μ0,μ1,)2()\boldsymbol{\mu}^{(2)}=(\dots,\mu_{1},\mu_{0},\mu_{1},\dots)\in\ell^{2}(\mathbb{Z}). Then, we can use the Fourier isomorphism to map the vector 𝝁(2)2()\boldsymbol{\mu}^{(2)}\in\ell^{2}(\mathbb{Z}) to the function space 2(𝕋)\mathcal{L}^{2}(\mathbb{T}). In this way each vector 𝝁2\boldsymbol{\mu}\in\ell^{2} corresponds to two functions in the Hardy spaces:

μ(z)=j=0𝝁jzj12,\displaystyle\mu^{-}(z)=\sum_{j=0}^{\infty}\boldsymbol{\mu}_{j}z^{-j-1}\in\mathcal{H}^{2}_{-}, (9)
μ+(z)=j=0𝝁jzj2.\displaystyle\mu^{+}(z)=\sum_{j=0}^{\infty}\boldsymbol{\mu}_{j}z^{j}\in\mathcal{H}^{2}. (10)

This leads to an alternative characterization of Hankel operators in Hardy spaces.

Definition 2.7.

Let ϕ(z)\phi(z) be a function in the space 2(𝕋)\mathcal{L}^{2}(\mathbb{T}). A Hankel operator is an operator Hϕ:22H_{\phi}:\mathcal{H}^{2}\rightarrow\mathcal{H}^{2}_{-} defined by Hϕf(z)=ϕf(z)H_{\phi}f(z)=\mathbb{P}_{-}\phi f(z), where \mathbb{P}_{-} is the orthogonal projection from 2(𝕋)\mathcal{L}^{2}(\mathbb{T}) onto 2\mathcal{H}^{2}_{-} . The function ϕ(z)\phi(z) is called a symbol of the Hankel operator HϕH_{\phi}.

If HϕH_{\phi} is a bounded operator, we can consider without loss of generality ϕ(z)(𝕋)\phi(z)\in\mathcal{L}^{\infty}(\mathbb{T}). This is a consequence of Nehari’s theorem [29], whose formulation can be found in Appendix A, together with more details about the two definitions of Hankel operators. We remark that a Hankel operator has infinitely many different symbols, since Hϕ=Hϕ+ψH_{\phi}=H_{\phi+\psi} for ψ(z)\psi(z)\in\mathcal{H}^{\infty}.

Remark 1.

In the standard orthonormal bases, {zk}k0\{z^{k}\}_{k\geq 0} in 2\mathcal{H}^{2} and {z(j+1)}j0\{{z}^{-(j+1)}\}_{j\geq 0} in 2\mathcal{H}^{2}_{-}, the Hankel operator HϕH_{\phi} has Hankel matrix 𝐇(j,k)=ϕ^(jk1)\mathbf{H}(j,k)=\widehat{\phi}(-j-k-1) for j,k0j,k\geq 0.

Example 2.5.

In the case of the Hankel matrix in Example 2.3, since 𝐇(j,k)=ϕ^(jk1)\mathbf{H}(j,k)=\widehat{\phi}(-j-k-1), we have:

𝐇=(340316031603160364)=(ϕ^(1)ϕ^(2)ϕ^(3)ϕ^(2)ϕ^(3)ϕ^(4)ϕ^(3)ϕ^(4)ϕ^(5)).\mathbf{H}=\begin{pmatrix}\frac{3}{4}&0&\frac{3}{16}&\dots\\ 0&\frac{3}{16}&0&\dots\\ \frac{3}{16}&0&\frac{3}{64}&\dots\\ \vdots&\vdots&\vdots&\ddots\end{pmatrix}=\begin{pmatrix}\widehat{\phi}(-1)&\widehat{\phi}(-2)&\widehat{\phi}(-3)&\dots\\ \widehat{\phi}(-2)&\widehat{\phi}(-3)&\widehat{\phi}(-4)&\dots\\ \widehat{\phi}(-3)&\widehat{\phi}(-4)&\widehat{\phi}(-5)&\dots\\ \vdots&\vdots&\vdots&\ddots\end{pmatrix}.

Hence, we can recover the corresponding symbol:

ϕ=n0ϕ^(n1)z2n1=n0344nzn1=3z4z21.\mathbb{P}_{-}\phi=\sum_{n\geq 0}\widehat{\phi}(-n-1)z^{-2n-1}=\sum_{n\geq 0}\frac{3}{4}4^{-n}z^{-n-1}=\frac{3z}{4z^{2}-1}.
Definition 2.8.

The complex function f(z)f(z) is rational if f(z)=p(z)/q(z)f(z)=p(z)/q(z), with p(z)p(z) and q(z)q(z) polynomials. The rank of f(z)f(z) is the maximum between the degrees of p(z)p(z) and q(z)q(z). A rational function is strictly proper if the degree of p(z)p(z) is strictly smaller than that of q(z)q(z).

We remark that the poles of a complex function ff correspond to the zeros of 1/f1/f. The following result of Kronecker relates finite-rank infinite Hankel matrices to rational functions.

Theorem 2.6 ([24]).

Let HϕH_{\phi} be a bounded Hankel operator with matrix 𝐇\mathbf{H}. Then 𝐇\mathbf{H} has finite rank if and only if ϕ\mathbb{P}_{-}\phi is a strictly proper rational function. Moreover the rank of 𝐇\mathbf{H} is equal to the number of poles (with multiplicities) of ϕ\mathbb{P}_{-}\phi inside the unit disc.

Example 2.7.

The function in Example 2.5 is rational with degree 22 and has two poles inside the unit disc at z=±12z=\pm\frac{1}{2}. Thus, the Hankel matrix associated has rank 2.

We state as remark an important takeaway from this section.

Remark 2.

Given a rank nn Hankel matrix 𝐇\mathbf{H}, we can look at it in two alternative ways. On the one hand we can consider the Hankel operator over sequences Hf:22H_{f}:\ell^{2}\rightarrow\ell^{2}, associated to a function f:Σf:\Sigma^{*}\rightarrow\mathbb{R}. In this case 𝐇(i,j)=f(i+j)\mathbf{H}(i,j)=f(i+j) for i,j0i,j\geq 0, and ff is rational in the sense that it is realized by a WFA of size nn. On the other hand, we can consider the Hankel operator over complex (Hardy) spaces Hϕ:22H_{\phi}:\mathcal{H}^{2}\rightarrow\mathcal{H}^{2}_{-}, associated to a function ϕ(z)2(𝕋)\phi(z)\in\mathcal{L}^{2}(\mathbb{T}), the symbol. In this case 𝐇(j,k)=ϕ^(jk1)\mathbf{H}(j,k)=\widehat{\phi}(-j-k-1) for j,k0j,k\geq 0, and ϕ=ϕ^(jk1)\mathbb{P}_{-}\phi=\widehat{\phi}(-j-k-1) is rational of rank nn in the sense of Definition 2.8.

We can introduce now the main result of Adamyan, Arov and Krein [1]. The theorem, stated for Hankel operators over Hardy spaces, shows that for infinite dimensional Hankel matrices the constraint of preserving the Hankel property does not affect the achievable approximation error.

Theorem 2.8 (AAK-1[1]).

Let HϕH_{\phi} be a compact Hankel operator of rank nn and singular numbers σm\sigma_{m}, with 0m<n0\leq m<n and σ0σn1>0\sigma_{0}\geq\dots\geq\sigma_{n-1}>0. Then there exists a unique Hankel operator GG of rank k<nk<n such that:

𝐇𝐆=σk.\|\mathbf{H}-\mathbf{G}\|=\sigma_{k}. (11)

We denote with k\mathcal{R}_{k}\subset\mathcal{H}^{\infty}_{-} the set of strictly proper rational functions of rank kk, and we consider the set:

k={ψ(z)(𝕋):g(z)k,l(z),ψ(z)=g(z)+l(z)}.\mathcal{H}^{\infty}_{k}=\{\psi(z)\in\mathcal{L}^{\infty}(\mathbb{T}):\,\,\exists g(z)\in\mathcal{R}_{k},\exists l(z)\in\mathcal{H}^{\infty},\,\,\psi(z)=g(z)+l(z)\}. (12)

We can reformulate the theorem in terms of symbols.

Theorem 2.9 (AAK-2 [1]).

Let ϕ(z)(𝕋)\phi(z)\in\mathcal{L}^{\infty}(\mathbb{T}). Then there exists ψ(z)k\psi(z)\in\mathcal{H}^{\infty}_{k} such that:

ϕ(z)ψ(z)=σk(Hϕ).\|\phi(z)-\psi(z)\|_{\infty}=\sigma_{k}(H_{\phi}). (13)

The solutions of Theorem 2.8 and 2.9 are strictly related (proof in Appendix A).

Corollary 1.

Let ψ(z)=g(z)+l(z)k\psi(z)=g(z)+l(z)\in\mathcal{H}^{\infty}_{k}, with g(z)k,l(z)g(z)\in\mathcal{R}_{k},\,l(z)\in\mathcal{H}^{\infty}. If ψ(z)\psi(z) solves Equation 13, then G=HgG=H_{g} is the unique Hankel operator from Theorem 2.8.

We state as corollary the key point of the proof of AAK Theorem, that provides us with a practical way to find the best approximating symbol.

Corollary 2.

Let ϕ(z)\phi(z) and {𝛏k,𝛈k}\{\boldsymbol{\xi}_{k},\boldsymbol{\eta}_{k}\} be a symbol and a σk\sigma_{k}-Schmidt pair for HϕH_{\phi}. A function ψ(z)\psi(z)\in\mathcal{L}^{\infty} is the best AAK approximation according to Theorem 2.9, if and only if:

(ϕ(z)ψ(z))ξk+(z)=σkηk(z).(\phi(z)-\psi(z))\xi^{+}_{k}(z)=\sigma_{k}\eta^{-}_{k}(z). (14)

Moreover, the function ψ(z)\psi(z) does not depend on the particular choice of the pair {𝛏k,𝛈k}\{\boldsymbol{\xi}_{k},\boldsymbol{\eta}_{k}\}.

3 Approximate Minimization

3.1 Assumptions

To apply AAK theory to the approximate minimization problem, we consider only automata with real weights, defined over a one-letter alphabet. In this case, the free monoid generated by the single letter can be identified with \mathbb{N}, and canonically embedded into \mathbb{Z}. This passage is fundamental to use Fourier analysis and the isomorphism that leads to Theorem 2.8 and 2.9. Unfortunately, this idea cannot be directly generalized to bigger alphabets, since in this case we would obtain a free non-abelian monoid (not identifiable with \mathbb{Z}).

Theorem 2.8 requires the Hankel operator HH to be bounded. To ensure that a minimal WFA A=𝜶,𝐀,𝜷A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle satisfies this condition, we assume ρ(𝐀)<1\rho(\mathbf{A})<1, where ρ\rho is the spectral radius of 𝐀\mathbf{A} [11]. As a matter of fact, to guarantee the boundness of the Hankel operator it is enough that the considered WFA computes a function f2f\in\ell^{2} [11]. However, the stricter assumption on the spectral radius is needed when computing the symbol associated to a WFA. This condition directly implies the existence of the SVA, and of the Gramian matrices 𝐏\mathbf{P} and 𝐐\mathbf{Q}, with 𝐏=𝐐\mathbf{P}=\mathbf{Q} diagonal [11]. We assume that A=𝜶,𝐀,𝜷A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle is in SVA form. In this case, given the size of the alphabet, the Hankel matrix 𝐇\mathbf{H} is symmetric, so 𝐀=𝐀\mathbf{A}=\mathbf{A}^{\top}. Moreover, if we denote with λi\lambda_{i} the ii-th non-zero eigenvalue of 𝐇\mathbf{H}, and we consider the coordinates of 𝜶\boldsymbol{\alpha} and 𝜷\boldsymbol{\beta}, we have that 𝜶i=sgn(λi)𝜷i\boldsymbol{\alpha}_{i}=\operatorname{sgn}(\lambda_{i})\boldsymbol{\beta}_{i}, where sgn(λi)=λi/|λi|\operatorname{sgn}(\lambda_{i})=\lambda_{i}/|\lambda_{i}|.

For example, we note that a minimal GPA computes a function f1f\in\ell^{1}, so the condition on ρ(𝐀)\rho(\mathbf{A}) is automatically satisfied by this class of WFAs [11]. Possible relaxations of the spectral radius assumption are discussed in Appendix C, together with an alternative method to find the optimal spectral-norm approximation of a Hankel matrix without extracting a WFA.

Finally, in this paper we only consider automata with weights in \mathbb{R}, though results remain true for complex numbers. The method we present can be easily extended to vector-valued automata [35], but the solution to the optimal approximation problem will not be unique [33].

3.2 Problem Formulation

Let A=𝜶,𝐀,𝜷A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle be a minimal WFA with nn states in SVA form, defined over a one-letter alphabet. Let 𝐇\mathbf{H} be the Hankel matrix of AA, we denote with σi\sigma_{i}, for 0i<n0\leq i<n, the singular numbers. Given a target number of states k<nk<n, we say that a WFA A^k\widehat{A}_{k} with kk states solves the optimal spectral-norm approximate minimization problem if the Hankel matrix 𝐆\mathbf{G} of A^k\widehat{A}_{k} satisfies 𝐇𝐆=σk(𝐇)\|\mathbf{H}-\mathbf{G}\|=\sigma_{k}(\mathbf{H}). Note that the content of the “optimal spectral-norm approximate minimization” is equivalent to the problem solved by Theorem 2.8, with the exception that here we insist on representing the inputs and outputs of the problem effectively by means of WFAs. Based on the AAK theory sketched in Section 2.3, we draw the following steps:

  1. 1.

    Compute a symbol ϕ(z)\phi(z) for HH using Remark 2. We obtain the negative Fourier coefficients of ϕ(z)\phi(z) and derive its Fourier series.

  2. 2.

    Compute the optimal symbol ψ(z)\psi(z) using Corollary 2. The main challenge here is to find a suitable representation for the functions ψ(z)\psi(z) and e(z)=ϕ(z)ψ(z)e(z)=\phi(z)-\psi(z). We define them in terms of two auxiliary WFAs. The key point is to select constraints on their parameters to leverage the properties of weighted automata, while still keeping the formulation general.

  3. 3.

    Extracting the rational component by solving for g(z)g(z) in Corollary 1. This step is arguably the most conceptually challenging, as it requires to identify the position of the function’s poles. In fact, we know from Theorem 2.6 that g(z)g(z) has kk poles, all inside the unit disc.

  4. 4.

    Find a WFA representation for g(z)g(z). Since in Step 2 we parametrized the functions using WFAs, the expression of g(z)g(z) directly reveals the WFA A^k\widehat{A}_{k}.

3.3 Spectral-Norm Approximate Minimization

In the following sections we will consider a minimal WFA A=𝜶,𝐀,𝜷A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle with nn states in SVA form, defined over a one-letter alphabet Σ={a}\Sigma=\{a\}, its Hankel matrix 𝐇\mathbf{H}, corresponding to the bounded operator HH, and the singular numbers σi\sigma_{i}, for 0i<n0\leq i<n. Let f:Σf:\Sigma^{*}\rightarrow\mathbb{R} be the function realized by AA. We denote by xx the string where aa is repeated xx times, so we have f(x)=𝜶𝐀x𝜷f(x)=\boldsymbol{\alpha}^{\top}\mathbf{A}^{x}\boldsymbol{\beta}.

3.3.1 Computation of a Symbol for A

To determine the symbol ϕ(z)\phi(z) of HH, we recall that each entry of the Hankel matrix corresponds simultaneously to the values of ff and to the negative Fourier coefficients of ϕ(z)\phi(z). In fact, as seen in Remark 2, we have:

𝐇=(fA(0)fA(1)fA(2)fA(1)fA(2)fA(3)fA(2)fA(3)fA(4))=(ϕ^(1)ϕ^(2)ϕ^(3)ϕ^(2)ϕ^(3)ϕ^(4)ϕ^(3)ϕ^(4)ϕ^(5)).\mathbf{H}=\begin{pmatrix}f_{A}(0)&f_{A}(1)&f_{A}(2)&\dots\\ f_{A}(1)&f_{A}(2)&f_{A}(3)&\dots\\ f_{A}(2)&f_{A}(3)&f_{A}(4)&\dots\\ \vdots&\vdots&\vdots&\ddots\end{pmatrix}=\begin{pmatrix}\widehat{\phi}(-1)&\widehat{\phi}(-2)&\widehat{\phi}(-3)&\dots\\ \widehat{\phi}(-2)&\widehat{\phi}(-3)&\widehat{\phi}(-4)&\dots\\ \widehat{\phi}(-3)&\widehat{\phi}(-4)&\widehat{\phi}(-5)&\dots\\ \vdots&\vdots&\vdots&\ddots\end{pmatrix}. (15)

We obtain:

ϕ(z)=k0f(k)zk1=k0𝜶𝐀k𝜷zk1=𝜶(z𝟏𝐀)1𝜷\mathbb{P}_{-}\phi(z)=\sum_{k\geq 0}f(k)z^{-k-1}=\sum_{k\geq 0}\boldsymbol{\alpha}^{\top}\mathbf{A}^{k}\boldsymbol{\beta}z^{-k-1}=\boldsymbol{\alpha}^{\top}(z\mathbf{1}-\mathbf{A})^{-1}\boldsymbol{\beta} (16)

where we use the fact that ρ(A)<1\rho(A)<1 for the last equality. Since the function obtained is already bounded, we can directly consider ϕ(z)=𝜶(z𝟏𝐀)1𝜷\phi(z)=\boldsymbol{\alpha}^{\top}(z\mathbf{1}-\mathbf{A})^{-1}\boldsymbol{\beta} as a symbol for HH.

Example 3.1.

If we apply the formula in Equation 16 to the GPA in Example 2.3, we recover the rational function ϕ(z)=3z4z21\phi(z)=\frac{3z}{4z^{2}-1} found in Example 2.5.

3.3.2 Computation of the Optimal Symbol

We consider two auxiliary WFAs. Let A^=𝜶^,𝐀^,𝜷^\widehat{A}=\langle\widehat{\boldsymbol{\alpha}},\widehat{\mathbf{A}},\widehat{\boldsymbol{\beta}}\rangle be a WFA with jkj\geq k states, satisfying the following properties:

  1. 1.

    11 is not an eigenvalue of 𝐀^\widehat{\mathbf{A}}

  2. 2.

    the automaton E=𝜶e,𝐀e,𝜷eE=\langle\boldsymbol{\alpha}_{e},\mathbf{A}_{e},\boldsymbol{\beta}_{e}\rangle is minimal, with

    𝐀e=(𝐀𝟎𝟎𝐀^),𝜶e=(𝜶𝜶^),𝜷e=(𝜷𝜷^).\mathbf{A}_{e}=\begin{pmatrix}\mathbf{A}&\mathbf{0}\\ \mathbf{0}&\widehat{\mathbf{A}}\end{pmatrix},\quad\boldsymbol{\alpha}_{e}=\begin{pmatrix}\boldsymbol{\alpha}\\ -\widehat{\boldsymbol{\alpha}}\end{pmatrix},\quad\boldsymbol{\beta}_{e}=\begin{pmatrix}\boldsymbol{\beta}\\ \widehat{\boldsymbol{\beta}}\end{pmatrix}. (17)

Using the parameters of the automaton A^\widehat{A} and a constant CC, we define a function ψ(z)=𝜶^(z𝟏𝐀^)1𝜷^+C\psi(z)=\widehat{\boldsymbol{\alpha}}^{\top}(z\mathbf{1}-\widehat{\mathbf{A}})^{-1}\widehat{\boldsymbol{\beta}}+C. We remark that the poles of ψ(z)\psi(z) correspond to the eigenvalues of 𝐀^\widehat{\mathbf{A}}, counted with their multiplicities. By assumption, 11 is not an eigenvalue of A^\widehat{A}, so ψ(z)\psi(z) does not have any poles on the unit circle, and therefore ψ(z)(𝕋)\psi(z)\in\mathcal{L}^{\infty}(\mathbb{T}). Analogously, the function e(z)=ϕ(z)ψ(z)=𝜶e(z𝟏𝐀e)1𝜷eCe(z)=\phi(z)-\psi(z)=\boldsymbol{\alpha}_{e}^{\top}(z\mathbf{1}-\mathbf{A}_{e})^{-1}\boldsymbol{\beta}_{e}-C is also bounded on the circle.

Our objective is to compute the parameters of A^=𝜶^,𝐀^,𝜷^\widehat{A}=\langle\widehat{\boldsymbol{\alpha}},\widehat{\mathbf{A}},\widehat{\boldsymbol{\beta}}\rangle that make ψ(z)\psi(z) the best approximation of ϕ(z)\phi(z) according to Theorem 2.9. In particular, we will use Corollary 2 to find the triple 𝜶^,𝐀^,𝜷^\widehat{\boldsymbol{\alpha}},\widehat{\mathbf{A}},\widehat{\boldsymbol{\beta}} such that ψ(z)\psi(z) satisfies Equation 14. Note that, with this purpose, the constant term CHC\in H^{\infty} becomes necessary to characterize ψ(z)\psi(z). In fact, while the HH^{\infty}-component of the symbol does not affect the Hankel norm, it plays a role in the computation of the \mathcal{L}^{\infty}-norm (in Equation 13) according to Nehari (Theorem A.1), so it cannot be dismissed.

The following theorem provides us with an explicit expression for the functions in the Hardy space corresponding to a σk\sigma_{k}-Schmidt pair.

Theorem 3.2.

Let σk\sigma_{k} be a singular number of the Hankel operator HH. The singular functions associated with the σk\sigma_{k}-Schmidt pair {𝛏k,𝛈k}\{\boldsymbol{\xi}_{k},\boldsymbol{\eta}_{k}\} of HH are:

ξk+(z)\displaystyle\xi^{+}_{k}(z) =σk1/2𝜶(𝟏z𝐀)1𝐞k\displaystyle=\sigma_{k}^{-1/2}\boldsymbol{\alpha}^{\top}(\mathbf{1}-z\mathbf{A})^{-1}\mathbf{e}_{k} (18)
ηk(z)\displaystyle\eta^{-}_{k}(z) =σk1/2𝜷(z𝟏𝐀)1𝐞k.\displaystyle=\sigma_{k}^{-1/2}\boldsymbol{\beta}^{\top}(z\mathbf{1}-\mathbf{A})^{-1}\mathbf{e}_{k}. (19)

If ψ(z)\psi(z) is the best approximation to the symbol, then σk1e(z)\sigma_{k}^{-1}e(z) has modulus 11 almost everywhere on the unit circle (i.e. it is unimodular).

Proof.

Let 𝐅\mathbf{F} and 𝐁\mathbf{B} be the forward and backward matrices, respectively, with 𝐇=𝐅𝐁\mathbf{H}=\mathbf{F}\mathbf{B}^{\top}, 𝐏=𝐅𝐅,𝐐=𝐁𝐁\mathbf{P}=\mathbf{F}^{\top}\mathbf{F},\mathbf{Q}=\mathbf{B}^{\top}\mathbf{B}. We consider the σk\sigma_{k}-Schmidt pair {𝝃k,𝜼k}\{\boldsymbol{\xi}_{k},\boldsymbol{\eta}_{k}\}. By definition, 𝐇𝐇𝝃k=σk2𝝃k\mathbf{H}^{\top}\mathbf{H}\boldsymbol{\xi}_{k}=\sigma_{k}^{2}\boldsymbol{\xi}_{k}. Rewriting in terms of the FB factorization, we obtain:

𝐇𝐇𝝃k=σk2𝝃k\displaystyle\mathbf{H}^{\top}\mathbf{H}\boldsymbol{\xi}_{k}=\sigma_{k}^{2}\boldsymbol{\xi}_{k} (20)
𝐁𝐅𝐅𝐁𝝃k=σk2𝝃k\displaystyle\mathbf{B}\mathbf{F}^{\top}\mathbf{F}\mathbf{B}^{\top}\boldsymbol{\xi}_{k}=\sigma_{k}^{2}\boldsymbol{\xi}_{k} (21)
𝐁𝐏𝐁𝝃k=σk2𝝃k\displaystyle\mathbf{B}\mathbf{P}\mathbf{B}^{\top}\boldsymbol{\xi}_{k}=\sigma_{k}^{2}\boldsymbol{\xi}_{k} (22)
𝐁𝐏𝐞k=σk2𝝃k\displaystyle\mathbf{B}\mathbf{P}\mathbf{e}_{k}=\sigma_{k}^{2}\boldsymbol{\xi}_{k} (23)

where in the last step we set 𝐞k=𝐁𝝃k\mathbf{e}_{k}=\mathbf{B}^{\top}\boldsymbol{\xi}_{k}, to reduce the SVD problem of 𝐇\mathbf{H} to the one of 𝐐𝐏\mathbf{Q}\mathbf{P}. Note that, since 𝐏\mathbf{P} and 𝐐\mathbf{Q} are diagonal, 𝐞k\mathbf{e}_{k} is the kk-th coordinate vector (0,,0,1,0,,0)(0,\dots,0,1,0,\dots,0)^{\top}. Since 𝐞k\mathbf{e}_{k} is an eigenvector of 𝐐𝐏\mathbf{Q}\mathbf{P} for σk2\sigma_{k}^{2}, we get:

𝐁𝐐1𝐐𝐏𝐞k=σk2𝝃k\displaystyle\mathbf{B}\mathbf{Q}^{-1}\mathbf{Q}\mathbf{P}\mathbf{e}_{k}=\sigma_{k}^{2}\boldsymbol{\xi}_{k} (24)
𝐁𝐐1𝐞k=𝝃k.\displaystyle\mathbf{B}\mathbf{Q}^{-1}\mathbf{e}_{k}=\boldsymbol{\xi}_{k}. (25)

Moreover, 𝐇\mathbf{H} is symmetric, so we have that the singular vectors 𝜼k\boldsymbol{\eta}_{k} and 𝝃k\boldsymbol{\xi}_{k} have the same coordinates up to the sign of the corresponding eigenvalues. We obtain:

ξk+(z)\displaystyle\xi^{+}_{k}(z) =j=0σk1/2𝜶𝐀j𝐞kzj=σk1/2𝜶(𝟏z𝐀)1𝒆k\displaystyle=\sum_{j=0}^{\infty}\sigma_{k}^{-1/2}\boldsymbol{\alpha}^{\top}\mathbf{A}^{j}\mathbf{e}_{k}z^{j}=\sigma_{k}^{-1/2}\boldsymbol{\alpha}^{\top}(\mathbf{1}-z\mathbf{A})^{-1}\boldsymbol{e}_{k} (26)
ηk(z)\displaystyle\eta^{-}_{k}(z) =j=0σk1/2𝜷𝐀j𝐞kzj1=σk1/2𝜷(z𝟏𝐀)1𝒆k\displaystyle=\sum_{j=0}^{\infty}\sigma_{k}^{-1/2}\boldsymbol{\beta}^{\top}\mathbf{A}^{j}\mathbf{e}_{k}z^{-j-1}=\sigma_{k}^{-1/2}\boldsymbol{\beta}^{\top}(z\mathbf{1}-\mathbf{A})^{-1}\boldsymbol{e}_{k} (27)

where the singular functions have been computed following Equation 9. If rr is the multiplicity of σk\sigma_{k}, from Corollary 2 we get the following fundamental equation:

(ϕ(z)ψ(z))𝜶(𝟏z𝐀)1𝐕=σk𝜷(z𝟏𝐀)1𝐕(\phi(z)-\psi(z))\boldsymbol{\alpha}^{\top}(\mathbf{1}-z\mathbf{A})^{-1}\mathbf{V}=\sigma_{k}\boldsymbol{\beta}^{\top}(z\mathbf{1}-\mathbf{A})^{-1}\mathbf{V}

where 𝐕=(𝟎𝟏r)\mathbf{V}=\begin{pmatrix}\mathbf{0}&\mathbf{1}_{r}\end{pmatrix}^{\top} is a n×rn\times r matrix. Consequently, we obtain the function:

σk1e(z)=𝜷(z𝟏𝐀)1𝐕𝜶(𝟏z𝐀)1𝐕\sigma_{k}^{-1}e(z)=\frac{\boldsymbol{\beta}^{\top}(z\mathbf{1}-\mathbf{A})^{-1}\mathbf{V}}{\boldsymbol{\alpha}^{\top}(\mathbf{1}-z\mathbf{A})^{-1}\mathbf{V}}

which is unimodular, since 𝜶i=sgn(λi)𝜷i\boldsymbol{\alpha}_{i}=\operatorname{sgn}(\lambda_{i})\boldsymbol{\beta}_{i}. ∎

It is reasonable to wonder how the fact that σk1e(z)\sigma_{k}^{-1}e(z) is unimodular reflects on the structure of the WFA E=𝜶e,𝐀e,𝜷eE=\langle\boldsymbol{\alpha}_{e},\mathbf{A}_{e},\boldsymbol{\beta}_{e}\rangle associated with it. We remark that, a priori, the controllability and observability Gramians of EE might not be well defined. The following theorem provides us with two matrices 𝐏e\mathbf{P}_{e} and 𝐐e\mathbf{Q}_{e} satisfying properties similar to those of the Gramians. This theorem is the analogous of a control theory result [16], rephrased in terms of WFAs. A sketch of the proof, that relies on the minimality of the WFA EE [38], can be found in Appendix B. For the detailed version of the proof and the original theorem we refer the reader to [16].

Theorem 3.3 ([16]).

Consider the function e(z)=𝛂e(z𝟏𝐀e)1𝛃eCe(z)=\boldsymbol{\alpha}_{e}^{\top}(z\mathbf{1}-\mathbf{A}_{e})^{-1}\boldsymbol{\beta}_{e}-C and the corresponding minimal WFA E=𝛂e,𝐀e,𝛃eE=\langle\boldsymbol{\alpha}_{e},\mathbf{A}_{e},\boldsymbol{\beta}_{e}\rangle associated with it. If σk1e(z)\sigma_{k}^{-1}e(z) is unimodular, then there exists a unique pair of symmetric invertible matrices 𝐏e\mathbf{P}_{e} and 𝐐e\mathbf{Q}_{e} satisfying:

  • (a)

    𝐏e𝐀e𝐏e𝐀e=𝜷e𝜷e\mathbf{P}_{e}-\mathbf{A}_{e}\mathbf{P}_{e}\mathbf{A}_{e}^{\top}=\boldsymbol{\beta}_{e}\boldsymbol{\beta}_{e}^{\top}

  • (b)

    𝐐e𝐀e𝐐e𝐀e=𝜶e𝜶e\mathbf{Q}_{e}-\mathbf{A}_{e}^{\top}\mathbf{Q}_{e}\mathbf{A}_{e}=\boldsymbol{\alpha}_{e}\boldsymbol{\alpha}_{e}^{\top}

  • (c)

    𝐏e𝐐e=σk2𝟏\mathbf{P}_{e}\mathbf{Q}_{e}=\sigma^{2}_{k}\mathbf{1}

We can now derive the parameters of the WFA A^=𝜶^,𝐀^,𝜷^\widehat{A}=\langle\widehat{\boldsymbol{\alpha}},\widehat{\mathbf{A}},\widehat{\boldsymbol{\beta}}\rangle.

Theorem 3.4.

Let A=𝛂,𝐀,𝛃A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle be a minimal WFA with nn states in its SVA form, and let ϕ(z)=𝛂(z𝟏𝐀)1𝛃\phi(z)=\boldsymbol{\alpha}^{\top}(z\mathbf{1}-\mathbf{A})^{-1}\boldsymbol{\beta} be a symbol for its Hankel operator HH. Let σk\sigma_{k} be a singular number of multiplicity rr for HH, with σ0>σk==σk+r1>σk+rσn1>0.\sigma_{0}\geq\dots>\sigma_{k}=\dots=\sigma_{k+r-1}>\sigma_{k+r}\geq\dots\geq\sigma_{n-1}>0. We can partition the Gramian matrices 𝐏\mathbf{P}, 𝐐\mathbf{Q} as follows:

𝐏=𝐐=(𝚺𝟎𝟎σk𝟏r),\mathbf{P}=\mathbf{Q}=\begin{pmatrix}\mathbf{\Sigma}&\mathbf{0}\\ \mathbf{0}&\sigma_{k}\mathbf{1}_{r}\end{pmatrix}, (28)

where 𝚺(nr)×(nr)\boldsymbol{\Sigma}\in\mathbb{R}^{(n-r)\times(n-r)} is the diagonal matrix containing the remaining singular numbers, and partition 𝐀\mathbf{A}, 𝛂\boldsymbol{\alpha} and 𝛃\boldsymbol{\beta} conformally to the Gramians:

𝐀=(𝐀11𝐀12𝐀12𝐀22),𝜶=(𝜶1𝜶2),𝜷=(𝜷1𝜷2).\mathbf{A}=\begin{pmatrix}\mathbf{A}_{11}&\mathbf{A}_{12}\\ \mathbf{A}_{12}^{\top}&\mathbf{A}_{22}\end{pmatrix},\quad\boldsymbol{\alpha}=\begin{pmatrix}\boldsymbol{\alpha}_{1}\\ \boldsymbol{\alpha}_{2}\end{pmatrix},\quad\boldsymbol{\beta}=\begin{pmatrix}\boldsymbol{\beta}_{1}\\ \boldsymbol{\beta}_{2}\end{pmatrix}. (29)

Let 𝐑=σk2𝟏nr𝚺2\mathbf{R}=\sigma_{k}^{2}\mathbf{1}_{n-r}-\mathbf{\Sigma}^{2}, we denote by ()+(\cdot)^{+} the Moore-Penrose pseudo-inverse. If the function ψ(z)=𝛂^(z𝟏𝐀^)1𝛃^+C\psi(z)=\widehat{\boldsymbol{\alpha}}^{\top}(z\mathbf{1}-\widehat{\mathbf{A}})^{-1}\widehat{\boldsymbol{\beta}}+C is the best approximation of ϕ(z)\phi(z), then:

  • If 𝜶2𝟎\boldsymbol{\alpha}_{2}\neq\mathbf{0}:

    {𝜷^=𝐀^𝐀12(𝜷2)+𝜶^=𝐀^𝐑𝐀12(𝜶2)+𝐀^(𝐀11𝐀12(𝜷2)+𝜷1)=𝟏\begin{cases}\widehat{\boldsymbol{\beta}}=-\widehat{\mathbf{A}}\mathbf{A}_{12}(\boldsymbol{\beta}_{2}^{\top})^{+}\\ \widehat{\boldsymbol{\alpha}}=\widehat{\mathbf{A}}^{\top}\mathbf{R}\mathbf{A}_{12}(\boldsymbol{\alpha}_{2}^{\top})^{+}\\ \widehat{\mathbf{A}}(\mathbf{A}_{11}-\mathbf{A}_{12}(\boldsymbol{\beta}_{2}^{\top})^{+}\boldsymbol{\beta}_{1}^{\top})=\mathbf{1}\end{cases} (30)
  • If 𝜶2=𝟎\boldsymbol{\alpha}_{2}=\mathbf{0}:

    {𝜷^=(𝟏𝐀^𝐀11)(𝜷1)+𝜶^=(𝐑𝐀^𝐑𝐀11)(𝜶1)+𝐀^𝐀12=𝟎\begin{cases}\widehat{\boldsymbol{\beta}}=(\mathbf{1}-\widehat{\mathbf{A}}\mathbf{A}_{11})(\boldsymbol{\beta}_{1}^{\top})^{+}\\ \widehat{\boldsymbol{\alpha}}=-(\mathbf{R}-\widehat{\mathbf{A}}^{\top}\mathbf{R}\mathbf{A}_{11})(\boldsymbol{\alpha}_{1}^{\top})^{+}\\ \widehat{\mathbf{A}}\mathbf{A}_{12}=\mathbf{0}\end{cases} (31)
Proof of Theorem 3.4.

Since σ1e(z)=ϕ(z)ψ(z)\sigma^{-1}e(z)=\phi(z)-\psi(z) is unimodular, from Theorem 3.3 there exist two symmetric nonsingular matrices 𝐏e\mathbf{P}_{e}, 𝐐e\mathbf{Q}_{e} satisfying the fixed point equations:

𝐏e𝐀e𝐏e𝐀e\displaystyle\mathbf{P}_{e}-\mathbf{A}_{e}\mathbf{P}_{e}\mathbf{A}_{e}^{\top} =𝜷e𝜷e\displaystyle=\boldsymbol{\beta}_{e}\boldsymbol{\beta}_{e}^{\top} (32)
𝐐e𝐀e𝐐e𝐀e\displaystyle\mathbf{Q}_{e}-\mathbf{A}_{e}^{\top}\mathbf{Q}_{e}\mathbf{A}_{e} =𝜶e𝜶e\displaystyle=\boldsymbol{\alpha}_{e}\boldsymbol{\alpha}_{e}^{\top} (33)

and such that 𝐏e𝐐e=σk2𝟏\mathbf{P}_{e}\mathbf{Q}_{e}=\sigma^{2}_{k}\mathbf{1}. We can partition 𝐏e\mathbf{P}_{e} and 𝐐e\mathbf{Q}_{e} according to the definition of 𝐀e\mathbf{A}_{e} (see Eq. 17):

𝐏e=(𝐏11𝐏12𝐏12𝐏22),𝐐e=(𝐐11𝐐12𝐐12𝐐22).\mathbf{P}_{e}=\begin{pmatrix}\mathbf{P}_{11}&\mathbf{P}_{12}\\ \mathbf{P}_{12}^{\top}&\mathbf{P}_{22}\end{pmatrix},\quad\mathbf{Q}_{e}=\begin{pmatrix}\mathbf{Q}_{11}&\mathbf{Q}_{12}\\ \mathbf{Q}_{12}^{\top}&\mathbf{Q}_{22}\end{pmatrix}.

From Equation 32 and 33, we note that 𝐏11\mathbf{P}_{11} and 𝐐11\mathbf{Q}_{11} corresponds to the controllability and observability Gramians of AA:

𝐏11=𝐐11=𝐏=(𝚺𝟎𝟎σk𝟏).\mathbf{P}_{11}=\mathbf{Q}_{11}=\mathbf{P}=\begin{pmatrix}\mathbf{\Sigma}&\mathbf{0}\\ \mathbf{0}&\sigma_{k}\mathbf{1}\end{pmatrix}.

Moreover, since 𝐏e𝐐e=σk2𝟏\mathbf{P}_{e}\mathbf{Q}_{e}=\sigma_{k}^{2}\mathbf{1}, we get 𝐏12𝐐12=σk2𝟏𝐏2\mathbf{P}_{12}\mathbf{Q}_{12}^{\top}=\sigma_{k}^{2}\mathbf{1}-\mathbf{P}^{2}. It follows that 𝐏12𝐐12\mathbf{P}_{12}\mathbf{Q}_{12}^{\top} has rank nrn-r. Without loss of generality we can set dim𝐀^=j=nr\dim{\widehat{\mathbf{A}}}=j=n-r, and choose an appropriate basis for the state space such that 𝐏12=(𝟏𝟎)\mathbf{P}_{12}=\begin{pmatrix}\mathbf{1}&\mathbf{0}\end{pmatrix}^{\top} and 𝐐12=(𝐑𝟎)\mathbf{Q}_{12}=\begin{pmatrix}\mathbf{R}&\mathbf{0}\end{pmatrix}^{\top}, with 𝐑=σk2𝟏𝚺2\mathbf{R}=\sigma_{k}^{2}\mathbf{1}-\mathbf{\Sigma}^{2}. Once 𝐏12\mathbf{P}_{12} and 𝐐12\mathbf{Q}_{12} are fixed, the values of 𝐏22\mathbf{P}_{22} and 𝐐22\mathbf{Q}_{22} are automatically determined. We obtain:

𝐏e=(𝚺𝟎𝟏𝟎σk𝟏𝟎𝟏𝟎𝚺𝐑1),𝐐e=(𝚺𝟎𝐑𝟎σk𝟏𝟎𝐑𝟎𝚺𝐑).\mathbf{P}_{e}=\begin{pmatrix}\mathbf{\Sigma}&\mathbf{0}&\mathbf{1}\\ \mathbf{0}&\sigma_{k}\mathbf{1}&\mathbf{0}\\ \mathbf{1}&\mathbf{0}&-\mathbf{\Sigma}\mathbf{R}^{-1}\end{pmatrix},\quad\mathbf{Q}_{e}=\begin{pmatrix}\mathbf{\Sigma}&\mathbf{0}&\mathbf{R}\\ \mathbf{0}&\sigma_{k}\mathbf{1}&\mathbf{0}\\ \mathbf{R}&\mathbf{0}&-\mathbf{\Sigma}\mathbf{R}\end{pmatrix}. (34)

Now that we have an expression for the matrices 𝐏e\mathbf{P}_{e} and 𝐐e\mathbf{Q}_{e} of Theorem 3.3, we can rewrite the fixed point equations to derive the parameters 𝜶^\widehat{\boldsymbol{\alpha}}, 𝐀^\widehat{\mathbf{A}} and 𝜷^\widehat{\boldsymbol{\beta}}. We obtain the following systems:

{𝐏𝐀𝐏𝐀=𝜷𝜷𝐍𝐀𝐍𝐀^=𝜷𝜷^𝚺𝐑1+𝐀^𝚺𝐑1𝐀^=𝜷^𝜷^{𝐏𝐀𝐏𝐀=𝜶𝜶𝐌𝐀𝐌𝐀^=𝜶𝜶^𝚺𝐑+𝐀^𝚺𝐑𝐀^=𝜶^𝜶^\begin{cases}\mathbf{P}-\mathbf{A}\mathbf{P}\mathbf{A}=\boldsymbol{\beta}\boldsymbol{\beta}^{\top}\\ \mathbf{N}-\mathbf{A}\mathbf{N}\widehat{\mathbf{A}}^{\top}=\boldsymbol{\beta}\widehat{\boldsymbol{\beta}}^{\top}\\ -\mathbf{\Sigma}\mathbf{R}^{-1}+\widehat{\mathbf{A}}\mathbf{\Sigma}\mathbf{R}^{-1}\widehat{\mathbf{A}}^{\top}={\widehat{\boldsymbol{\beta}}}{\widehat{\boldsymbol{\beta}}}^{\top}\end{cases}\quad\begin{cases}\mathbf{P}-\mathbf{A}\mathbf{P}\mathbf{A}=\boldsymbol{\alpha}\boldsymbol{\alpha}^{\top}\\ \mathbf{M}-\mathbf{A}^{\top}\mathbf{M}\widehat{\mathbf{A}}=-\boldsymbol{\alpha}\widehat{\boldsymbol{\alpha}}^{\top}\\ -\mathbf{\Sigma}\mathbf{R}+\widehat{\mathbf{A}}^{\top}\mathbf{\Sigma}\mathbf{R}\widehat{\mathbf{A}}={\widehat{\boldsymbol{\alpha}}}{\widehat{\boldsymbol{\alpha}}}^{\top}\end{cases} (35)

where 𝐍=(𝟏𝟎)\mathbf{N}=\begin{pmatrix}\mathbf{1}\\ \mathbf{0}\end{pmatrix} and 𝐌=(𝐑𝟎)\mathbf{M}=\begin{pmatrix}\mathbf{R}\\ \mathbf{0}\end{pmatrix}. We can rewrite the second equation of each system as follows:

{𝟏𝐀11𝐀^=𝜷1𝜷^𝐀12𝐀^=𝜷2𝜷^{𝐑𝐀11𝐑𝐀^=𝜶1𝜶^𝐀^𝐑𝐀12=𝜶^𝜶2\begin{cases}\mathbf{1}-\mathbf{A}_{11}\widehat{\mathbf{A}}^{\top}=\boldsymbol{\beta}_{1}\widehat{\boldsymbol{\beta}}^{\top}\\ -\mathbf{A}_{12}^{\top}\widehat{\mathbf{A}}^{\top}=\boldsymbol{\beta}_{2}\widehat{\boldsymbol{\beta}}^{\top}\end{cases}\quad\begin{cases}\mathbf{R}-\mathbf{A}_{11}\mathbf{R}\widehat{\mathbf{A}}=-\boldsymbol{\alpha}_{1}\widehat{\boldsymbol{\alpha}}^{\top}\\ \widehat{\mathbf{A}}^{\top}\mathbf{R}\mathbf{A}_{12}=\widehat{\boldsymbol{\alpha}}\boldsymbol{\alpha}_{2}^{\top}\end{cases} (36)

If 𝜶2𝟎\boldsymbol{\alpha}_{2}\neq\mathbf{0}, then also 𝜷2𝟎\boldsymbol{\beta}_{2}\neq\mathbf{0} (recall that 𝜶i=sgn(λi)𝜷i\boldsymbol{\alpha}_{i}=\operatorname{sgn}(\lambda_{i})\boldsymbol{\beta}_{i}), and we have:

{𝜷^=𝐀^𝐀12(𝜷2)+𝜶^=𝐀^𝐑𝐀12(𝜶2)+𝐀^(𝐀11𝐀12(𝜷2)+𝜷1)=𝟏\begin{cases}\widehat{\boldsymbol{\beta}}=-\widehat{\mathbf{A}}\mathbf{A}_{12}(\boldsymbol{\beta}_{2}^{\top})^{+}\\ \widehat{\boldsymbol{\alpha}}=\widehat{\mathbf{A}}^{\top}\mathbf{R}\mathbf{A}_{12}(\boldsymbol{\alpha}_{2}^{\top})^{+}\\ \widehat{\mathbf{A}}(\mathbf{A}_{11}-\mathbf{A}_{12}(\boldsymbol{\beta}_{2}^{\top})^{+}\boldsymbol{\beta}_{1}^{\top})=\mathbf{1}\end{cases} (37)

with (𝜶2)+=𝜶2𝜶2𝜶2(\boldsymbol{\alpha}_{2}^{\top})^{+}=\frac{\boldsymbol{\alpha}_{2}}{\boldsymbol{\alpha}_{2}^{\top}\boldsymbol{\alpha}_{2}} and (𝜷2)+=𝜷2𝜷2𝜷2(\boldsymbol{\beta}_{2}^{\top})^{+}=\frac{\boldsymbol{\beta}_{2}}{\boldsymbol{\beta}_{2}^{\top}\boldsymbol{\beta}_{2}}.

If 𝜶2=𝟎\boldsymbol{\alpha}_{2}=\mathbf{0}, we have 𝐀^𝐀12=𝟎\widehat{\mathbf{A}}\mathbf{A}_{12}=\mathbf{0}. We remark that 𝐀^\widehat{\mathbf{A}} has size (nr)×(nr)(n-r)\times(n-r), while 𝐀12\mathbf{A}_{12} is (nr)×r(n-r)\times r, so the system of equations corresponding to 𝐀^𝐀12=𝟎\widehat{\mathbf{A}}\mathbf{A}_{12}=\mathbf{0} is underdetermined if r<n2r<\frac{n}{2}, in which case we can find an alternative set of solutions:

{𝜷^=(𝟏𝐀^𝐀11)(𝜷1)+𝜶^=(𝐑𝐀^𝐑𝐀11)(𝜶1)+𝐀^𝐀12=𝟎\begin{cases}\widehat{\boldsymbol{\beta}}=(\mathbf{1}-\widehat{\mathbf{A}}\mathbf{A}_{11})(\boldsymbol{\beta}_{1}^{\top})^{+}\\ \widehat{\boldsymbol{\alpha}}=-(\mathbf{R}-\widehat{\mathbf{A}}^{\top}\mathbf{R}\mathbf{A}_{11})(\boldsymbol{\alpha}_{1}^{\top})^{+}\\ \widehat{\mathbf{A}}\mathbf{A}_{12}=\mathbf{0}\end{cases} (38)

with 𝐀^𝟎\widehat{\mathbf{A}}\neq\mathbf{0}. On the other hand, if rn2r\geq\frac{n}{2}, i.e. if the multiplicity of the singular number σk\sigma_{k} is more than half the size of the original WFA, the system might not have any solution unless 𝐀^=𝟎\widehat{\mathbf{A}}=\mathbf{0} (or unless 𝐀12\mathbf{A}_{12} was zero to begin with). In this setting the method proposed returns 𝐀^=𝟎\widehat{\mathbf{A}}=\mathbf{0}. An alternative in this case is to search for an approximation of size k1k-1 or k+1k+1, so that r<n2r<\frac{n}{2}, and the system in Equation 38 is underdetermined. ∎

3.3.3 Extraction of the Rational Component

The function ψ(z)=𝜶^(z𝟏𝐀^)1𝜷^+C\psi(z)=\widehat{\boldsymbol{\alpha}}^{\top}(z\mathbf{1}-\widehat{\mathbf{A}})^{-1}\widehat{\boldsymbol{\beta}}+C found in Theorem 3.4 corresponds to the solution of Theorem 2.9. To find the solution to the approximation problem we only need to “isolate” the function g(z)kg(z)\in\mathcal{R}_{k}, i.e. the rational component. To do this, we study the position of the poles of ψ(z)\psi(z), since the poles of a strictly proper rational function lie in the unit disc (Theorem 2.6). As noted before, we parametrized ψ(z)\psi(z) so that its poles correspond to the eigenvalues of A^\widehat{A}. After a change of basis (detailed in the Paragraph 3.3.3.1), we can rewrite 𝐀^\widehat{\mathbf{A}} in block-diagonal form:

𝐀^=(𝐀^+𝟎𝟎𝐀^)\widehat{\mathbf{A}}=\begin{pmatrix}\widehat{\mathbf{A}}_{+}&\mathbf{0}\\ \mathbf{0}&\widehat{\mathbf{A}}_{-}\end{pmatrix} (39)

where the modulus of the eigenvalues of 𝐀^+\widehat{\mathbf{A}}_{+} (resp. 𝐀^\widehat{\mathbf{A}}_{-}) is smaller (resp. greater) than one. We apply the same change of coordinates on 𝜶^\widehat{\boldsymbol{\alpha}} and 𝜷^\widehat{\boldsymbol{\beta}}.

To conclude the study of the eigenvalues of 𝐀^\widehat{\mathbf{A}}, we need the following auxiliary result from Ostrowski [32]. A proof of this theorem can be found in [41].

Theorem 3.5 ([32]).

Let |Σ|=1|\Sigma|=1, and let 𝐏\mathbf{P} be a solution to the fixed point equation X𝐀X𝐀=𝛃𝛃X-\mathbf{A}X\mathbf{A}^{\top}=\boldsymbol{\beta}\boldsymbol{\beta}^{\top} for the WFA A=𝛂,𝐀,𝛃A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle. If AA is reachable, then:

  • The number of eigenvalues λ\lambda of 𝐀\mathbf{A} such that |λ|<1|\lambda|<1 is equal to the number of positive eigenvalues of 𝐏\mathbf{P}.

  • The number of eigenvalues λ\lambda of 𝐀\mathbf{A} such that |λ|>1|\lambda|>1 is equal to the number of negative eigenvalues of 𝐏\mathbf{P}.

We can finally find the rational component of the function ψ(z)\psi(z), i.e. the function g(z)g(z) from Corollary 1 necessary to solve that approximate minimization problem.

Theorem 3.6.

Let 𝐀^+,𝛂^+,𝛃^+\widehat{\mathbf{A}}_{+},\widehat{\boldsymbol{\alpha}}_{+},\widehat{\boldsymbol{\beta}}_{+} be as in Theorem 3.4. The rational component of ψ(z)\psi(z) is the function g(z)=𝛂^+(z𝟏𝐀^+)1𝛃^+g(z)=\widehat{\boldsymbol{\alpha}}_{+}^{\top}(z\mathbf{1}-\widehat{\mathbf{A}}_{+})^{-1}\widehat{\boldsymbol{\beta}}_{+}.

Proof.

Clearly ψ(z)=g(z)+l(z)\psi(z)=g(z)+l(z), with l(z)=𝜶^(z𝟏𝐀^)1𝜷^l(z)=\widehat{\boldsymbol{\alpha}}_{-}^{\top}(z\mathbf{1}-\widehat{\mathbf{A}}_{-})^{-1}\widehat{\boldsymbol{\beta}}_{-}, ll\in\mathcal{H}^{\infty}. To conclude the proof we need to show that g(z)g(z) has kk poles inside the unit disc, and therefore has rank kk. We do this by studying the position of the eigenvalues of 𝐀^+\widehat{\mathbf{A}}_{+}.

Since EE is minimal, by definition A^\widehat{A} is reachable, so we can use Theorem 3.5 and solve the problem by directly examining the eigenvalues of 𝚺𝐑-\mathbf{\Sigma}\mathbf{R}. From the proof of Theorem 3.4 we have 𝚺𝐑=𝚺(𝚺2σk2𝟏)-\boldsymbol{\Sigma}\mathbf{R}=\boldsymbol{\Sigma}(\boldsymbol{\Sigma}^{2}-\sigma^{2}_{k}\mathbf{1}), where 𝚺\boldsymbol{\Sigma} is the diagonal matrix having as elements the singular numbers of HH different from σk\sigma_{k}. It follows that 𝚺𝐑-\boldsymbol{\Sigma}\mathbf{R} has only kk strictly positive eigenvalues, and 𝐀^\widehat{\mathbf{A}} has kk eigenvalues with modulus smaller than 11. Thus, 𝐀^+\widehat{\mathbf{A}}_{+} has kk eigenvalues, corresponding to the poles of g(z)g(z). ∎

3.3.3.1 Block Diagonalization.

In this paragraph we detail the technical steps necessary to rewrite 𝐀^\widehat{\mathbf{A}} in block-diagonal form. The problem of computing the Jordan form of a matrix is ill-conditioned, hence it is not suitable for our algorithmic purposes. Instead, we compute the Schur decomposition, i.e. we find an orthogonal matrix 𝐔\mathbf{U} such that 𝐔𝐀^𝐔\mathbf{U}^{\top}\widehat{\mathbf{A}}\mathbf{U} is upper triangular, with the eigenvalues of 𝐀^\widehat{\mathbf{A}} on the diagonal. We obtain:

𝐓=𝐔𝐀^𝐔=(𝐀^+𝐀^12𝟎𝐀^)\mathbf{T}=\mathbf{U}^{\top}\widehat{\mathbf{A}}\mathbf{U}=\begin{pmatrix}\widehat{\mathbf{A}}_{+}&\widehat{\mathbf{A}}_{12}\\ \mathbf{0}&\widehat{\mathbf{A}}_{-}\end{pmatrix} (40)

where the eigenvalues are arranged in increasing order of modulus, and the modulus of those in 𝐀^+\widehat{\mathbf{A}}_{+} (resp. 𝐀^\widehat{\mathbf{A}}_{-}) is smaller (resp. greater) than one. To transform this upper triangular matrix into a block-diagonal one, we use the following result.

Theorem 3.7 ([37]).

Let 𝐓\mathbf{T} be the matrix defined in Equation 40. The matrix 𝐗\mathbf{X} is a solution of the equation 𝐀^+𝐗𝐗𝐀^+𝐀^12=𝟎\widehat{\mathbf{A}}_{+}\mathbf{X}-\mathbf{X}\widehat{\mathbf{A}}_{-}+\widehat{\mathbf{A}}_{12}=\mathbf{0} if and only if the matrices

𝐌=(𝟏𝐗𝟎𝟏),and𝐌1=(𝟏𝐗𝟎𝟏)\mathbf{M}=\begin{pmatrix}\mathbf{1}&\mathbf{X}\\ \mathbf{0}&\mathbf{1}\end{pmatrix},\quad\text{and}\quad\mathbf{M}^{-1}=\begin{pmatrix}\mathbf{1}&-\mathbf{X}\\ \mathbf{0}&\mathbf{1}\end{pmatrix} (41)

satisfy:

𝐌1𝐓𝐌=(𝐀^+𝟎𝟎𝐀^),\mathbf{M}^{-1}\mathbf{T}\mathbf{M}=\begin{pmatrix}\widehat{\mathbf{A}}_{+}&\mathbf{0}\\ \mathbf{0}&\widehat{\mathbf{A}}_{-}\end{pmatrix}, (42)

where 𝐓\mathbf{T} is the matrix defined in Equation 40.

Setting 𝚪=(𝟏k𝟎)\boldsymbol{\Gamma}=\begin{pmatrix}\mathbf{1}_{k}&\mathbf{0}\end{pmatrix} we can now derive the rational component of the WFA:

𝐀^+=𝚪𝐌1𝐔𝐀^𝐔𝐌𝚪\displaystyle\widehat{\mathbf{A}}_{+}=\boldsymbol{\Gamma}\mathbf{M}^{-1}\mathbf{U}^{\top}\widehat{\mathbf{A}}\mathbf{U}\mathbf{M}\boldsymbol{\Gamma}^{\top} (43)
𝜶^+=𝚪𝐌𝐔𝜶^\displaystyle\widehat{\boldsymbol{\alpha}}_{+}=\boldsymbol{\Gamma}\mathbf{M}^{\top}\mathbf{U}^{\top}\widehat{\boldsymbol{\alpha}} (44)
𝜷^+=𝚪𝐌1𝐔𝜷^.\displaystyle\widehat{\boldsymbol{\beta}}_{+}=\boldsymbol{\Gamma}\mathbf{M}^{-1}\mathbf{U}^{\top}\widehat{\boldsymbol{\beta}}. (45)

3.3.4 Solution to the Approximation Problem

In the previous sections, we have derived the rational function g(z)g(z) corresponding to the symbol of GG, the operator that solves Theorem 2.8. To find the solution to the approximation problem we only need to find the parameters of A^k\widehat{A}_{k}, the optimal approximating WFA. These are directly revealed by the expression of g(z)g(z), thanks to the way we parametrized the functions.

Theorem 3.8.

Let A=𝛂,𝐀,𝛃A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle be a minimal WFA with nn states over a one-letter alphabet. Let AA be in its SVA form. The optimal spectral-norm approximation of rank kk is given by the WFA A^k=𝛂^+,𝐀^+,𝛃^+\widehat{A}_{k}=\langle\widehat{\boldsymbol{\alpha}}_{+},\widehat{\mathbf{A}}_{+},\widehat{\boldsymbol{\beta}}_{+}\rangle.

Proof.

From Corollary 1 we know that g(z)g(z) is the rational function associated with the Hankel matrix of the best approximation. Given the correspondence between the Fourier coefficients of g(z)g(z) and the entries of the matrix (Remark 2), we have:

g(z)=𝜶^+(z𝟏𝐀^+)1𝜷^+=k0𝜶^+𝐀^+k𝜷^+zk1=k0f¯(k)zk1g(z)=\widehat{\boldsymbol{\alpha}}_{+}^{\top}(z\mathbf{1}-\widehat{\mathbf{A}}_{+})^{-1}\widehat{\boldsymbol{\beta}}_{+}=\sum_{k\geq 0}\widehat{\boldsymbol{\alpha}}_{+}^{\top}\widehat{\mathbf{A}}_{+}^{k}\widehat{\boldsymbol{\beta}}_{+}z^{-k-1}=\sum_{k\geq 0}\bar{f}(k)z^{-k-1} (46)

where f¯:Σ\bar{f}:\Sigma^{*}\rightarrow\mathbb{R} is the function computed by A^k\widehat{A}_{k} and 𝜶^+,𝐀^+,𝜷^+\widehat{\boldsymbol{\alpha}}_{+},\widehat{\mathbf{A}}_{+},\widehat{\boldsymbol{\beta}}_{+} are the parameters. ∎

3.4 Error Analysis

The theoretical foundations of AAK theory guarantee that the construction detailed in Section 3.3 produces the rank kk optimal spectral-norm approximation of a WFA satisfying our assumptions, and the singular number σk\sigma_{k} provides the exact error.

Similarly to the case of SVA truncation [11], due to the ordering of the singular numbers, the error decreases when kk increases, meaning that allowing A^k\widehat{A}_{k} to have more states guarantees a better approximation of AA. We remark that, while the solution we propose is optimal in the spectral norm, the same is not necessarily true in other norms. Nonetheless, we have the following bound between 2\ell^{2}-norm and spectral-norm (proof in Appendix B).

Theorem 3.9.

Let AA be a minimal WFA computing f:Σf:\Sigma^{*}\rightarrow\mathbb{R}, with matrix 𝐇\mathbf{H}. Let A^k\widehat{A}_{k} be its optimal spectral-norm approximation, computing g:Σg:\Sigma^{*}\rightarrow\mathbb{R}, with matrix 𝐆\mathbf{G}. Then:

fg2𝐇𝐆=σk.\|f-g\|_{\ell^{2}}\leq\|\mathbf{H}-\mathbf{G}\|=\sigma_{k}. (47)
Proof.

Let 𝐞0=(10)\mathbf{e}_{0}=\begin{pmatrix}1&0&\cdots\end{pmatrix}^{\top}, f:Σf:\Sigma^{*}\rightarrow\mathbb{R}, g:Σg:\Sigma^{*}\rightarrow\mathbb{R} with Hankel matrices 𝐇\mathbf{H} and 𝐆\mathbf{G}, respectively. We have:

fg2=(n=0|fngn|2)1/2=(𝐇𝐆)𝐞02sup𝐱2=1(𝐇𝐆)𝐱2=𝐇𝐆=σk\|f-g\|_{\ell^{2}}=\left(\sum_{n=0}^{\infty}|f_{n}-g_{n}|^{2}\right)^{1/2}=\|(\mathbf{H}-\mathbf{G})\mathbf{e}_{0}\|_{\ell^{2}}\leq\sup_{\|\mathbf{x}\|_{\ell^{2}}=1}\|(\mathbf{H}-\mathbf{G})\mathbf{x}\|_{\ell^{2}}=\|\mathbf{H}-\mathbf{G}\|=\sigma_{k}

where the second equation follows by definition and by observing that matrix difference is computed entry-wise. ∎

4 Algorithm

In this section we describe the algorithm for spectral-norm approximate minimization. The algorithm takes as input a target number of states k<nk<n, a minimal WFA AA with ρ(𝐀)<1\rho(\mathbf{A})<1, 𝜶20\boldsymbol{\alpha}_{2}\neq 0, nn states and in SVA form, and its Gramian 𝐏\mathbf{P}. Note that, in the case of 𝜶2=0\boldsymbol{\alpha}_{2}=0, it is enough to substitute the Steps 4,5,64,5,6 with the analogous from Equation 31. The constraints on the WFA AA to be minimal and in SVA form are non essential. In fact a WFA with nn states can be minimized in time O(n3)O(n^{3}) [14], and the SVA computed in O(n3)O(n^{3}) [11].

Using the results of Theorem 3.4, we outline in Algorithm 1, AAKapproximation, the steps necessary to extract the best spectral-norm approximation of a WFA.

input : A minimal WFA AA, with 𝜶20\boldsymbol{\alpha}_{2}\neq 0, nn states and in SVA form,
its Gramian 𝐏\mathbf{P}, a target number of states k<nk<n
output : A WFA A^k\widehat{A}_{k} with kk states
1 Let 𝜶1,𝜶2,𝜷1,𝜷2,𝐀11,𝐀12,𝐀22,𝚺\boldsymbol{\alpha}_{1},\boldsymbol{\alpha}_{2},\boldsymbol{\beta}_{1},\boldsymbol{\beta}_{2},\mathbf{A}_{11},\mathbf{A}_{12},\mathbf{A}_{22},\mathbf{\Sigma} be the blocks defined in Eq. 28
2 Let (𝜶2)+=𝜶2𝜶2𝜶2(\boldsymbol{\alpha}_{2}^{\top})^{+}=\frac{\boldsymbol{\alpha}_{2}}{\boldsymbol{\alpha}_{2}^{\top}\boldsymbol{\alpha}_{2}}, (𝜷2)+=𝜷2𝜷2𝜷2(\boldsymbol{\beta}_{2}^{\top})^{+}=\frac{\boldsymbol{\beta}_{2}}{\boldsymbol{\beta}_{2}^{\top}\boldsymbol{\beta}_{2}}
3 Let 𝐑=σk2𝟏𝚺2\mathbf{R}=\sigma_{k}^{2}\mathbf{1}-\mathbf{\Sigma}^{2}
4 Let 𝐀^=(𝐀11𝐀12(𝜷2)+𝜷1)1\widehat{\mathbf{A}}=(\mathbf{A}_{11}-\mathbf{A}_{12}(\boldsymbol{\beta}_{2}^{\top})^{+}\boldsymbol{\beta}_{1}^{\top})^{-1}
5 Let 𝜶^=𝐀^𝐑𝐀12(𝜶2)+\widehat{\boldsymbol{\alpha}}=\widehat{\mathbf{A}}^{\top}\mathbf{R}\mathbf{A}_{12}(\boldsymbol{\alpha}_{2}^{\top})^{+}
6 Let 𝜷^=𝐀^𝐀12(𝜷2)+\widehat{\boldsymbol{\beta}}=-\widehat{\mathbf{A}}\mathbf{A}_{12}(\boldsymbol{\beta}_{2}^{\top})^{+}
7 Let A^=𝜶^,𝐀^,𝜷^\widehat{A}=\langle\widehat{\boldsymbol{\alpha}},\widehat{\mathbf{A}},\widehat{\boldsymbol{\beta}}\rangle
8 Let A^k\widehat{A}_{k}\leftarrow BlockDiagonalize(A^\widehat{A})
return A^k\widehat{A}_{k}
Algorithm 1 AAKapproximation

The algorithm involves a call to Algorithm 2, BlockDiagonalize. In particular, this corresponds to the steps, outlined in Paragraph 3.3.3.1, necessary to derive the WFA A^k\widehat{A}_{k} corresponding to the rational function g(z)g(z). We remark that Step 22 in BlockDiagonalize can be performed using the Bartels-Stewart algorithm [13].

input : A WFA A^\widehat{A}
output : A WFA A^k\widehat{A}_{k} wit ρ<1\rho<1
1 Compute the Schur decomposition of 𝐀^=𝐔𝐓𝐔\widehat{\mathbf{A}}=\mathbf{U}\mathbf{T}\mathbf{U}^{\top}, where |T11||T22||T_{11}|\leq|T_{22}|\leq\dots
2 Solve 𝐀^11𝐗𝐗𝐀^22+𝐀^12=𝟎\widehat{\mathbf{A}}_{11}\mathbf{X}-\mathbf{X}\widehat{\mathbf{A}}_{22}+\widehat{\mathbf{A}}_{12}=\mathbf{0} for 𝐗\mathbf{X}
3 Let 𝐌=(𝟏𝐗𝟎𝟏)\mathbf{M}=\begin{pmatrix}\mathbf{1}&\mathbf{X}\\ \mathbf{0}&\mathbf{1}\end{pmatrix} and 𝐌1=(𝟏𝐗𝟎𝟏)\mathbf{M}^{-1}=\begin{pmatrix}\mathbf{1}&-\mathbf{X}\\ \mathbf{0}&\mathbf{1}\end{pmatrix}
4 Let 𝚪=(𝟏k𝟎)\boldsymbol{\Gamma}=\begin{pmatrix}\mathbf{1}_{k}&\mathbf{0}\end{pmatrix}
5 Let 𝐀^+=𝚪𝐌1𝐔𝐀^𝐔𝐌𝚪\widehat{\mathbf{A}}_{+}=\boldsymbol{\Gamma}\mathbf{M}^{-1}\mathbf{U}^{\top}\widehat{\mathbf{A}}\mathbf{U}\mathbf{M}\boldsymbol{\Gamma}^{\top}
6 Let 𝜶^+=𝚪𝐌𝐔𝜶^\widehat{\boldsymbol{\alpha}}_{+}=\boldsymbol{\Gamma}\mathbf{M}^{\top}\mathbf{U}^{\top}\widehat{\boldsymbol{\alpha}}
7 Let 𝜷^+=𝚪𝐌1𝐔𝜷^\widehat{\boldsymbol{\beta}}_{+}=\boldsymbol{\Gamma}\mathbf{M}^{-1}\mathbf{U}^{\top}\widehat{\boldsymbol{\beta}}
8 Let A^k=𝜶^+,𝐀^+,𝜷^+\widehat{A}_{k}=\langle\widehat{\boldsymbol{\alpha}}_{+},\widehat{\mathbf{A}}_{+},\widehat{\boldsymbol{\beta}}_{+}\rangle
return A^k\widehat{A}_{k}
Algorithm 2 BlockDiagonalize

To compute the computational cost we recall the following facts [39]:

  • The product of two n×nn\times n matrices can be computed in time O(n3)O(n^{3}) using a standard iterative algorithm, but can be reduced to O(nω)O(n^{\omega}) with ω<2.4\omega<2.4.

  • The inversion of a n×nn\times n matrix can be computed in time O(n3)O(n^{3}) using Gauss-Jordan elimination, but can be reduced to O(nω)O(n^{\omega}) with ω<2.4\omega<2.4.

  • The computation of the Schur decomposition of a n×nn\times n matrix can be done with a two-step algorithm, where each step takes O(n3)O(n^{3}), using the Hessenberg form of the matrix.

  • The Bartels-Stewart algorithm applied to upper triangular matrices to find a matrix m×nm\times n takes O(mn2+nm2)O(mn^{2}+nm^{2}).

The running time of BlockDiagonalize with input a WFA A^\widehat{A} with (nr)(n-r) states is thus in O((nr)3)O((n-r)^{3}), where rr is the multiplicity of the singular value considered. The running time of AAKapproximation for an input WFA A^\widehat{A} with nn states is in O((nr)3)O((n-r)^{3}).

5 Related Work

The study of approximate minimization for WFAs is very recent, and only a few works have been published on the subject. In [10, 11] the authors present an approximate minimization technique using a canonical expression for WFAs, and provide bounds on the error in the 2\ell^{2} norm. The result is supported by strong theoretical guarantees, but it is not optimal in any norm. An extension of this method to the case of Weighted Tree Automata can be found in [12]. A similar problem is addressed in [25], with less general results. In [26], the authors connect spectral learning to the approximate minimization problem of a small class of Hidden Markov models, bounding the error in terms of the total variation distance.

The control theory community has largely studied approximate minimization in the context of linear time-invariant systems, and several methods have been proposed [3]. A parallel can be drown between those results and ours, by noting that the impulse response of a discrete time-invariant Single-Input-Single-Output SISO system can be parametrized as a WFA over a one-letter alphabet. In [20] Glover presents a state-space solution for the case of continuous Multi-Input-Multi-Output MIMO systems. Glover’s method led to a widespread application of these results, thanks to its computational and theoretical simplicity. This stems from the structure of the Lyapunov equations for continuous systems. It is however not the case for discrete control systems, where the Lyapunov equations have a quadratic form. As noted in [16], there is not a simple closed form formula for the state space solution of a discrete system. Thus, most of the results for the discrete case work with a suboptimal version of the problem, with restrictions on the multiplicity of the singular values [6, 2, 23]. A solution for the SISO case can be found, without additional assumptions, using a polynomial approach, but it does not provide an explicit representation of the state space nor it generalizes to the MIMO setting. The first to actually extend Glover results to the discrete case is Gu, who provides an elegant solution for the MIMO discrete problem [21]. Glover and Gu’s solutions rely on embedding the initial system into an extension of it, the all-pass system, equivalent to the WFA EE in our method. Part of our contribution is the adaptation of some of the control theory tools to our setting.

6 Conclusion

In this paper we applied the AAK theory for Hankel operators and complex functions to the framework of WFAs in order to construct the best possible approximation to an automaton given a bound on the size. We provide an algorithm to find the parameters of the best WFA approximation in the spectral norm, and bounds on the error. Our method applies to real WFAs A=𝜶,𝐀,𝜷A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle, defined over a one-letter alphabet, with ρ(𝐀)<1\rho(\mathbf{A})<1. While this setting is certainly restricted, we believe that this work constitutes a first fundamental step towards optimal approximation. Furthermore, the use of AAK techniques has proven to be very fruitful in related areas like control theory; we think that automata theory can also benefit from it. The use of such methods can help deepen the understanding of the behaviour of rational functions. This paper highlights and strengthens the interesting connections between functional analysis, automata theory and control theory, unifying tools from different domains in one formalism.

A compelling direction for future work is to extend our results to the multi-letter case. The work of Adamyan, Arov and Krein provides us with a powerful theory connecting sequences to the study of complex functions. We note that, unfortunately, this approach cannot be directly generalized to the multi-letter case because of the non-commutative nature of the monoid considered. Extending this work would require adapting standard harmonic analysis results to the non-abelian case. A recent line of work in functional analysis is centered around extending this theory to the case of non-commutative operators, and in [34] a non-commutative version of the AAK theorem is presented. However, those results are non-constructive, making this direction, already challenging, even harder to pursue.

Acknowledgments

This research has been supported by NSERC Canada (C. Lacroce, P. Panangaden, D. Precup) and Canada CIFAR AI chairs program (Guillaume Rabusseau). The authors would like to thank Tianyu Li, Harsh Satija and Alessandro Sordoni for feedback on earlier drafts of this work, Gheorghe Comanici for a detailed review, and Maxime Wabartha for fruitful discussions and for comments on the proofs.

References

  • [1] Vadim M. Adamyan, Damir Zyamovich Arov, and Mark Grigorievich Krein. Analytic Properties of Schmidt Pairs for a Hankel Operator and the Generalized Schur–Takagi problem. Mathematics of The Ussr-sbornik, 15:31–73, 1971.
  • [2] M.M Al-Hussari, I.M. Jaimoukha, and D.J.N. Limebeer. A Descriptor Approach for the Solution of the One-Block Distance Problem. In In Proceedings of the IFAC World Congress, 1993.
  • [3] Athanasios C. Antoulas. Approximation of Large-Scale Dynamical Systems. SIAM, 2005.
  • [4] Stéphane Ayache, Rémi Eyraud, and Noé Goudian. Explaining Black Boxes on Sequential Data Using Weighted Automata. In Proceedings of the 14th International Conference on Grammatical Inference, ICGI 2018, Wrocław, Poland, September 5-7, 2018, volume 93 of Proceedings of Machine Learning Research, pages 81–103. PMLR, 2018. URL: http://proceedings.mlr.press/v93/ayache19a.html.
  • [5] Raphaël Bailly, François Denis, and Liva Ralaivola. Grammatical Inference as a Principal Component Analysis Problem. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 33––40, New York, NY, USA, 2009. Association for Computing Machinery. doi:10.1145/1553374.1553379.
  • [6] Joseph A. Ball and Andre CM. Ran. Optimal Hankel Norm Model Reductions and Wiener–Hopf Factorization I: The Canonical Case. SIAM Journal on Control and Optimization, 25(2):362–382, 1987.
  • [7] Borja Balle, Xavier Carreras, Franco M. Luque, and Ariadna Quattoni. Spectral Learning of Weighted Automata - A Forward–Backward Perspective. Mach. Learn., 96(1-2):33–63, 2014. doi:10.1007/s10994-013-5416-x.
  • [8] Borja Balle, Pascale Gourdeau, and Prakash Panangaden. Bisimulation Metrics and Norms for Weighted Finite Automata. In Ioannis Chatzigiannakis, Piotr Indyk, Fabian Kuhn, and Anca Muscholl, editors, 44th International Colloquium on Automata, Languages, and Programming, ICALP 2017, July 10-14, 2017, Warsaw, Poland, volume 80 of LIPIcs, pages 103:1–103:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPIcs.ICALP.2017.103.
  • [9] Borja Balle, William L. Hamilton, and Joelle Pineau. Methods of moments for learning stochastic languages: Unified presentation and empirical comparison. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, volume 32 of JMLR Workshop and Conference Proceedings, pages 1386–1394. JMLR.org, 2014. URL: http://proceedings.mlr.press/v32/balle14.html.
  • [10] Borja Balle, Prakash Panangaden, and Doina Precup. A Canonical Form for Weighted Automata and Applications to Approximate Minimization. In 30th Annual ACM/IEEE Symposium on Logic in Computer Science, LICS 2015, Kyoto, Japan, July 6-10, 2015, pages 701–712. IEEE Computer Society, 2015. doi:10.1109/LICS.2015.70.
  • [11] Borja Balle, Prakash Panangaden, and Doina Precup. Singular Value Automata and Approximate Minimization. Math. Struct. Comput. Sci., 29(9):1444–1478, 2019. doi:10.1017/S0960129519000094.
  • [12] Borja Balle and Guillaume Rabusseau. Approximate minimization of weighted tree automata. Information and Computation, page 104654, 2020.
  • [13] R. H. Bartels and G. W. Stewart. Solution of the matrix equation ax + xb = c [f4]. Commun. ACM, 15(9):820––826, 1972.
  • [14] Jean Berstel and Christophe Reutenauer. Noncommutative rational series with applications, volume 137. Cambridge University Press, 2011.
  • [15] J.W. Carlyle and A. Paz. Realizations by stochastic finite automata. Journal of Computer and System Sciences, 5(1):26–40, 1971.
  • [16] Charles K. Chui and Guanrong Chen. Discrete HH^{\infty} Optimization With Applications in Signal Processing and Control Systems. Springer-Verlag, 1997.
  • [17] C. Eckart and G. Young. The approximation of one matrix by another of lower rank. Psychometrika, 1:211–218, 1936. doi:10.1007/BF02288367.
  • [18] Rémi Eyraud and Stéphane Ayache. Distillation of Weighted Automata from Recurrent Neural Networks Using a Spectral Approach. CoRR, abs/2009.13101, 2020. URL: https://arxiv.org/abs/2009.13101, arXiv:2009.13101.
  • [19] Michel Flies. Matrice de Hankel. Journal de Mathématique Pures et Appliquées, 5:197–222, 1974.
  • [20] Keith Glover. All Optimal Hankel–Nnorm Approximations of Linear Multivariable Systems and their \mathcal{L}^{\infty}–Error Bounds. International Journal of Control, 39(6):1115–1193, 1984. doi:10.1080/00207178408933239.
  • [21] Guoxiang Gu. All Optimal Hankel–Norm Approximations and their Eerror Bounds in Discrete–Time. International Journal of Control, 78(6):408–423, 2005. doi:10.1080/00207170500110988.
  • [22] Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. J. Comput. Syst. Sci., 78(5):1460–1480, September 2012. doi:10.1016/j.jcss.2011.12.025.
  • [23] Vlad Ionescu and Cristian Oara. The four-block Adamjan-Arov-Kein problem for discrete-time systems. In Linear Algebra and its Application, pages 95–119. Elsevier, 2001.
  • [24] L. Kronecker. Zur Theorie der Elimination einer Variablen aus zwei algebraischen Gleichungen. Montasber. Königl. Preussischen Acad Wies, pages 535 – 600, 1881.
  • [25] Alex Kulesza, Nan Jiang, and Satinder Singh. Low-Rank Spectral Learning with Weighted Loss Functions. In Artificial Intelligence and Statistics, pages 517–525. PMLR, 2015.
  • [26] Alex Kulesza, N. Raj Rao, and Satinder Singh. Low-Rank Spectral Learning. In Samuel Kaski and Jukka Corander, editors, Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research, pages 522–530, Reykjavik, Iceland, 22–25 April 2014. PMLR.
  • [27] Aersity M Lyapunov. The General Problem of the Stability of Motion [in Russian]. Gostekhizdat, Moscow, 1950.
  • [28] Jean Meinguet. A Simplified Presentation of the Adamjan-Arov-Krein Approximation Theory. In H. Werner, L. Wuytack, E. Ng, and H. J. Bünger, editors, Computational Aspects of Complex Analysis, pages 217–248, Dordrecht, 1983. Springer Netherlands. doi:10.1007/978-94-009-7121-9_9.
  • [29] Zeev Nehari. On Bounded Bilinear Forms. Annals of Mathematics, 65(1):153–162, 1957.
  • [30] Nikolai K. Nikol’Skii. Operators, Functions and Systems: An Easy Reading, volume 92 of Mathematical Surveys and Monographs. American Mathematical Society, 2002.
  • [31] Takamasa Okudono, Masaki Waga, Taro Sekiyama, and Ichiro Hasuo. Weighted Automata Extraction from Recurrent Neural Networks via Regression on State Spaces. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 5306–5314. AAAI Press, 2020. URL: https://aaai.org/ojs/index.php/AAAI/article/view/5977.
  • [32] Alexander Ostrowski and Hans Schneider. Some Theorems on the Inertia of General Matrices. J. Math. Anal. Appl, 4(1):72–84, 1962.
  • [33] Vladimir Peller. Hankel Operators and their Applications. Springer Science & Business Media, 2012.
  • [34] Gelu Popescu. Multivariable Nehari Problem and Interpolation. Journal of Functional Analysis, 200:536–581, 2003. doi:10.1016/S0022-1236(03)00078-8.
  • [35] Guillaume Rabusseau, Borja Balle, and Joelle Pineau. Multitask Spectral Learning of Weighted Automata. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 2585–2594, 2017.
  • [36] Guillaume Rabusseau, Tianyu Li, and Doina Precup. Connecting Weighted Automata and Recurrent Neural Networks through Spectral Learning. In Kamalika Chaudhuri and Masashi Sugiyama, editors, The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pages 1630–1639. PMLR, 2019. URL: http://proceedings.mlr.press/v89/rabusseau19a.html.
  • [37] William E. Roth. The equations ax - yb = c and ax - xb = c in matrices. Proceedings of the American Mathematical Society, 3(3):392–396, 1952.
  • [38] B.De Schutter. Minimal State-Space Realization in Linear System Theory: an Overview. Journal of Computational and Applied Mathematics, 121(1):331–354, 2000. doi:10.1016/S0377-0427(00)00341-1.
  • [39] Lloyd N Trefethen and David Bau III. Numerical linear algebra, volume 50. Siam, 1997.
  • [40] Gail Weiss, Yoav Goldberg, and Eran Yahav. Learning Deterministic Weighted Automata with Queries and Counterexamples. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 8558–8569, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/d3f93e7766e8e1b7ef66dfdd9a8be93b-Abstract.html.
  • [41] H.K Wimmer. On the Ostrowski-Schneider Inertia Theorem. Journal of Mathematical Analysis and Applications, 41(1):164–169, 1973. doi:10.1016/0022-247X(73)90190-X.
  • [42] Kehe Zhu. Operator Theory in Function Spaces, volume 138. American Mathematical Society, 1990.

Appendix A Hankel Operators

For more details on the content of this section we refer the reader to [30]. We recall the first definition of Hankel operator.

Definition A.1.

A Hankel operator is a mapping H:22H:\ell^{2}\rightarrow\ell^{2} with matrix 𝐇={αj+k}j,k0\mathbf{H}=\{\alpha_{j+k}\}_{j,k\geq 0}. In other words, given a={an}n02a=\{a_{n}\}_{n\geq 0}\in\ell^{2}, we have H(a)=bH(a)=b, where b={bn}n0b=\{b_{n}\}_{n\geq 0} is defined by:

bk=j0αj+kaj,k0.b_{k}=\sum_{j\geq 0}\alpha_{j+k}a_{j},\quad k\geq 0. (48)

This property on the Hankel matrix can be rephrased as an operator identity. Defining the shift operator by S(x0,x1,)=(0,x0,x1,)S(x_{0},x_{1},\dots)=(0,x_{0},x_{1},\dots) and denoting its left inverse by S=(y0,y1,)=(y1,y2,)S^{*}=(y_{0},y_{1},\dots)=(y_{1},y_{2},\dots), we have that HH is a Hankel operator if and only if:

HS=SH.HS=S^{*}H. (49)

The correspondence between Definition A.1 and Definition 2.7 can be easily made explicit. First, we note that using the isomorphism ϕ(ϕ^(n))n\phi\mapsto(\widehat{\phi}(n))_{n\in\mathbb{Z}}, introduced with Fourier series, we can identify 2\mathcal{H}^{2} with 2\ell^{2}. Moreover, the operator of multiplication by zz acts as right shift SS on the space of Fourier coefficients, in fact (zϕ)^(n)=ϕ^(n1)\widehat{(z\phi)}(n)=\widehat{\phi}(n-1). Analogously, the left inverse SS^{*} corresponds to the truncated multiplication operator. Now, let 1={zk}k0\mathcal{B}_{1}=\{z^{k}\}_{k\geq 0} and 2={zj}j<0\mathcal{B}_{2}=\{z^{j}\}_{j<0} be bases for 2\mathcal{H}^{2} and 2\mathcal{H}^{2}_{-}, respectively, and let

Izn=zn1Iz^{n}=z^{-n-1} (50)

be the involution on 2(𝕋)\mathcal{L}^{2}(\mathbb{T}). Note that I2=2I\mathcal{H}^{2}=\mathcal{H}^{2}_{-}.

Let H¯:22\overline{H}:\ell^{2}\rightarrow\ell^{2} be a Hankel operator with matrix 𝐇¯\overline{\mathbf{H}}. Using the bases 1\mathcal{B}_{1}, 2\mathcal{B}_{2} and the Fourier identification map, we can obtain an operator acting between Hardy spaces. Following this interpretation, the operator H=IH¯:22H=I\overline{H}:\mathcal{H}^{2}\rightarrow\mathcal{H}^{2}_{-} has matrix 𝐇¯\overline{\mathbf{H}} with respect to 1\mathcal{B}_{1}, 2\mathcal{B}_{2}, and satisfies:

HS=SH.HS=\mathbb{P}_{-}SH. (51)

In particular, H¯S=SH¯\overline{H}S=S^{*}\overline{H} if and only if HS=SHHS=\mathbb{P}_{-}SH. It is now easy to see that the characterization of the Hankel operator given in Definition 2.7 satisfies Equation 51.

The following theorem, due to Nehari [29], is of great importance as it highlights a correspondence between bounded Hankel operators and functions in (𝕋)\mathcal{L}^{\infty}(\mathbb{T}).

Theorem A.1 ([29]).

A Hankel operator H:22H:\ell^{2}\rightarrow\ell^{2} with matrix 𝐇(j,k)={αj+k}j,k0\mathbf{H}(j,k)=\{\alpha_{j+k}\}_{j,k\geq 0} is bounded on 2\ell^{2} if and only if there exists a function ψ(𝕋)\psi\in\mathcal{L}^{\infty}(\mathbb{T}) such that

αm=ψ^(m),m0.\alpha_{m}=\widehat{\psi}(m),\quad m\geq 0. (52)

In this case:

H=inf{ψ:ψ^(n)=ϕ^(n),n0}.\|H\|=\inf\{\|\psi\|_{\infty}:\widehat{\psi}(n)=\widehat{\phi}(n),\,n\geq 0\}. (53)

Where ψ^(n)\widehat{\psi}(n) is the nn-th Fourier coefficient of ψ\psi.

We can now reformulate the theorem using the characterization of Hankel operators in Hardy spaces.

Theorem A.2 ([29]).

Let ϕ2(𝕋)\phi\in\mathcal{L}^{2}(\mathbb{T}) be a symbol of the Hankel operator on Hardy spaces Hϕ:22H_{\phi}:\mathcal{H}^{2}\rightarrow\mathcal{H}^{2}_{-}. The following are equivalent:

  • (1)

    HϕH_{\phi} is bounded on 2\mathcal{H}^{2},

  • (2)

    there exists ψ(𝕋)\psi\in\mathcal{L}^{\infty}(\mathbb{T}) such that ψ^(m)=ϕ^(m)\widehat{\psi}(m)=\widehat{\phi}(m) for all m<0m<0.

If the conditions above are satisfied, then:

Hϕ=inf{ψ:ψ^(m)=ϕ^(m),m<0},\|H_{\phi}\|=\inf\{\|\psi\|_{\infty}:\widehat{\psi}(m)=\widehat{\phi}(m),\,m<0\}, (54)

or equivalently:

Hϕ=inff(z)ϕ(z)f(z).\|H_{\phi}\|=\inf_{f(z)\in\mathcal{H}^{\infty}}\|\phi(z)-f(z)\|_{\infty}. (55)

Nehari’s Theorem is at the core of the proof of Corollary 1.

Proof of Corollary 1.

Let HϕH_{\phi} be a Hankel operator with symbol ϕ(z)(𝕋)\phi(z)\in\mathcal{L}^{\infty}(\mathbb{T}) and matrix 𝐇\mathbf{H}. Let ψ(z)=g(z)+l(z)k\psi(z)=g(z)+l(z)\in\mathcal{H}^{\infty}_{k} be the solution of Equation 13. We have:

HϕHψ\displaystyle\|H_{\phi}-H_{\psi}\| =Hϕψ\displaystyle=\|H_{\phi-{\psi}}\| (56)
=Hσkηk(z)/ξk+(z)\displaystyle=\|H_{\sigma_{k}\eta^{-}_{k}(z)/\xi^{+}_{k}(z)}\| (57)
σkηk(z)/ξk+(z)=σk\displaystyle\leq\sigma_{k}\|\eta^{-}_{k}(z)/\xi^{+}_{k}(z)\|_{\infty}=\sigma_{k} (58)

where first we used Corollary 2 and then Equation 55. Now, using the definition of Hankel operator, we have:

HϕHψ=HϕHg=𝐇𝐆σk.\|H_{\phi}-H_{\psi}\|=\|H_{\phi}-H_{g}\|=\|\mathbf{H}-\mathbf{G}\|\leq\sigma_{k}. (59)

Since 𝐇𝐆σk\|\mathbf{H}-\mathbf{G}\|\geq\sigma_{k} (from Eckart-Young theorem [17]), it follows that 𝐇𝐆=σk\|\mathbf{H}-\mathbf{G}\|=\sigma_{k}. Note that 𝐆\mathbf{G} has rank kk, as required, because gkg\in\mathcal{R}_{k}(Theorem 2.6). ∎

Appendix B Proofs from Section 3

Proof of Theorem 3.3.

In order to prove Theorem 3.3 we need an auxiliary lemma. These are the analogous of a control theory result, rephrased in terms of WFAs. The original theorem and lemma, together with the corresponding proofs, can be found in [16]. Hence, we only provide a sketch of the proofs.

Lemma B.1 [16]).

Let E=𝛂e,𝐀e,𝛃eE=\langle\boldsymbol{\alpha}_{e},\mathbf{A}_{e},\boldsymbol{\beta}_{e}\rangle be a minimal WFA. Let e(z)=𝛂e(z𝟏𝐀e)1𝛃eCe(z)=\boldsymbol{\alpha}_{e}^{\top}(z\mathbf{1}-\mathbf{A}_{e})^{-1}\boldsymbol{\beta}_{e}-C, if σk1e(z)\sigma_{k}^{-1}e(z) is unimodular, then there exist a unique invertible symmetric matrix 𝐓\mathbf{T} satisfying:

  • (a)

    𝐀e𝐓𝜷e=𝜶eC\mathbf{A}_{e}^{\top}\mathbf{T}\boldsymbol{\beta}_{e}=\boldsymbol{\alpha}_{e}C

  • (b)

    σk2𝜶e𝐓1𝐀e=C𝜷e\sigma_{k}^{2}\boldsymbol{\alpha}_{e}^{\top}\mathbf{T}^{-1}\mathbf{A}_{e}^{\top}=C\boldsymbol{\beta}_{e}^{\top}

  • (c)

    𝐀e𝐓𝐀eC1𝐀e𝐓𝜷e𝜶e=𝐓\mathbf{A}_{e}^{\top}\mathbf{T}\mathbf{A}_{e}-C^{-1}\mathbf{A}_{e}^{\top}\mathbf{T}\boldsymbol{\beta}_{e}\boldsymbol{\alpha}_{e}^{\top}=\mathbf{T}

Proof.

Since σk1e(z)\sigma_{k}^{-1}e(z) is unimodular, we have that:

e(z)e(z¯1)=σk2𝟏e(z)e^{*}(\bar{z}^{-1})=\sigma_{k}^{2}\mathbf{1} (60)

where we denote with ee^{*} the adjoint function. From the equation above, we obtain:

e(z¯1)\displaystyle e^{*}(\bar{z}^{-1}) =σk2e1(z)=σk2(C+𝜶e(z𝟏𝐀e)1𝜷e)1\displaystyle=\sigma_{k}^{2}e^{-1}(z)=\sigma_{k}^{2}(-C+\boldsymbol{\alpha}_{e}^{\top}(z\mathbf{1}-\mathbf{A}_{e})^{-1}\boldsymbol{\beta}_{e})^{-1} (61)
=σk2C1σk2C1𝜶e((z𝟏(𝐀e+C1𝜷e𝜶e))1𝜷eC1\displaystyle=-\sigma_{k}^{2}C^{-1}-\sigma_{k}^{2}C^{-1}\boldsymbol{\alpha}_{e}^{\top}((z\mathbf{1}-(\mathbf{A}_{e}+C^{-1}\boldsymbol{\beta}_{e}\boldsymbol{\alpha}_{e}))^{-1}\boldsymbol{\beta}_{e}C^{-1} (62)

where we used the matrix inversion lemma. On the other hand we have:

e(z¯1)\displaystyle e^{*}(\bar{z}^{-1}) =C+𝜷e(z1𝟏𝐀e)1𝜶e\displaystyle=-C+\boldsymbol{\beta}_{e}^{\top}(z^{-1}\mathbf{1}-\mathbf{A}_{e}^{\top})^{-1}\boldsymbol{\alpha}_{e} (63)
=C+𝜷e(𝐀e(𝟏z𝐀e)+𝐀e)(𝟏z𝐀e)1𝜶e\displaystyle=-C+\boldsymbol{\beta}_{e}^{\top}(-\mathbf{A}_{e}^{-\top}(\mathbf{1}-z\mathbf{A}_{e}^{\top})+\mathbf{A}_{e}^{-\top})(\mathbf{1}-z\mathbf{A}_{e}^{\top})^{-1}\boldsymbol{\alpha}_{e} (64)
=(C𝜷e𝐀e𝜶e)𝜷e𝐀e(z𝟏𝐀e)1𝐀e𝜶e\displaystyle=-(C-\boldsymbol{\beta}_{e}^{\top}\mathbf{A}_{e}^{-\top}\boldsymbol{\alpha}_{e})-\boldsymbol{\beta}_{e}^{\top}\mathbf{A}_{e}^{-\top}(z\mathbf{1}-\mathbf{A}_{e}^{-\top})^{-1}\mathbf{A}_{e}^{-\top}\boldsymbol{\alpha}_{e} (65)

where we used again the matrix inversion lemma before grouping the terms. If the quantities in Equation 62 and Equation 65 have to be equal, we need their constant term to be the same. Then, we want the \mathcal{H}^{\infty}_{-}-components to correspond, so we consider the corresponding Hankel matrices. It is easy to see that we can once again associate the coefficients of these complex functions to the parameters of a WFA. From the minimality of EE we obtain:

{σk2C1𝜶e=𝜷e𝐀e𝐓𝐀e+C1𝜷e𝜶e=𝐓1𝐀e𝐓𝜷eC1=𝐓1𝐀e𝜶e\begin{cases}\sigma_{k}^{2}C^{-1}\boldsymbol{\alpha}_{e}^{\top}=\boldsymbol{\beta}_{e}^{\top}\mathbf{A}_{e}^{-\top}\mathbf{T}\\ \mathbf{A}_{e}+C^{-1}\boldsymbol{\beta}_{e}\boldsymbol{\alpha}_{e}=\mathbf{T}^{-1}\mathbf{A}_{e}^{-\top}\mathbf{T}\\ \boldsymbol{\beta}_{e}C^{-1}=\mathbf{T}^{-1}\mathbf{A}_{e}^{-\top}\boldsymbol{\alpha}_{e}\end{cases} (66)

where 𝐓\mathbf{T} is an invertible matrix [7]. This system is equivalent to:

{σk2𝜶e𝐓1𝐀e=C𝜷e𝐀e𝐓𝐀eC1𝐀e𝐓𝜷e𝜶e=𝐓𝐀e𝐓𝜷e=𝜶eC\begin{cases}\sigma_{k}^{2}\boldsymbol{\alpha}_{e}^{\top}\mathbf{T}^{-1}\mathbf{A}_{e}^{\top}=C\boldsymbol{\beta}_{e}^{\top}\\ \mathbf{A}_{e}^{\top}\mathbf{T}\mathbf{A}_{e}-C^{-1}\mathbf{A}_{e}^{\top}\mathbf{T}\boldsymbol{\beta}_{e}\boldsymbol{\alpha}_{e}^{\top}=\mathbf{T}\\ \mathbf{A}_{e}^{\top}\mathbf{T}\boldsymbol{\beta}_{e}=\boldsymbol{\alpha}_{e}C\end{cases} (67)

To conclude the proof it remains to check that 𝐓\mathbf{T} is symmetric, and this can be checked by direct computations. ∎

Proof of Theorem 3.3.

This proof follows easily from Lemma B.1 by setting 𝐏=σk2𝐓1\mathbf{P}=-\sigma^{2}_{k}\mathbf{T}^{-1} and 𝐐=𝐓\mathbf{Q}=-\mathbf{T}. We obtain point (c)(c) by direct multiplication. Then, we substitute the last equation in 67 into the second one, and we obtain:

𝐀e𝐓𝐀e𝜶e𝜶e=𝐓\mathbf{A}_{e}^{\top}\mathbf{T}\mathbf{A}_{e}-\boldsymbol{\alpha}_{e}\boldsymbol{\alpha}_{e}^{\top}=\mathbf{T} (68)

which verifies point (b)(b) with 𝐐=𝐓\mathbf{Q}=-\mathbf{T}. Point (a)(a) can be obtained analogously combining the first and second equations in 67. ∎

Appendix C Possible Extensions

C.1 Relaxing the Spectral Radius Assumption

It is possible to extend part of our method to WFAs over a one-letter alphabet with ρ(𝐀)1\rho(\mathbf{A})\neq 1, but the approximation recovered is not optimal in the spectral norm.

Let A=𝜶,𝐀,𝜷A=\langle\boldsymbol{\alpha},\mathbf{A},\boldsymbol{\beta}\rangle, with ρ(𝐀)1\rho(\mathbf{A})\neq 1, be a WFA with nn states that we want to minimize. The idea is to block-diagonalize 𝐀\mathbf{A} like we did in Section 3.3.3, and tackle each component separately. The case of A+=𝜶+,𝐀+,𝜷+A_{+}=\langle\boldsymbol{\alpha}_{+},\mathbf{A}_{+},\boldsymbol{\beta}_{+}\rangle, the component having ρ(𝐀)<1\rho(\mathbf{A})<1, can be dealt with in the way presented in the previous sections. This means that we can find an optimal spectral-norm approximation of the desired size for A+A_{+}. Now we can consider the second component, A=𝜶,𝐀,𝜷A_{-}=\langle\boldsymbol{\alpha}_{-},\mathbf{A}_{-},\boldsymbol{\beta}_{-}\rangle. The key idea is to apply the transformation zj1zjz^{j-1}\mapsto z^{-j} for j1j\geq 1 to the function ϕ′′(z)\phi^{\prime\prime}(z) associated to AA_{-}. Then, the function

ϕ′′(z1)=k0𝜶𝐀kzk𝜷=𝜶(𝟏z𝐀)1𝜷\phi^{\prime\prime}(z^{-1})=\sum_{k\geq 0}\boldsymbol{\alpha}_{-}^{\top}\mathbf{A}_{-}^{k}z^{k}\boldsymbol{\beta}_{-}=\boldsymbol{\alpha}_{-}^{\top}(\mathbf{1}-z\mathbf{A}_{-})^{-1}\boldsymbol{\beta}_{-} (69)

is well defined, as the series converges for zz with small enough modulus. Using this transformation we obtain a function with poles inside the unit disc and we can apply the method presented in the paper. An important choice to make is the size of the approximation of AA_{-}, as it can influence the quality of the approximation. Analyzing the effects of this parameter on the approximation error constitutes an interesting direction for future work, both in the theoretical and experimental side. We refer the reader to the control theory literature [20], where some theoretical work has been done to study an analogous approach for continuous time systems and their approximation error.

C.2 Polynomial method

We remark that Equation 14 from Corollary 2 can be rewritten as

ψ(z)=ϕ(z)Hξk+(z)ξk+(z),\psi(z)=\phi(z)-\frac{H\xi^{+}_{k}(z)}{\xi^{+}_{k}(z)}, (70)

where ξk+(z)\xi^{+}_{k}(z) is the function in 2\mathcal{H}^{2} associated to the vector 𝝃kKer(𝐇𝐇σk2𝟏)\boldsymbol{\xi}_{k}\in\operatorname{Ker}(\mathbf{H}^{*}\mathbf{H}-\sigma_{k}^{2}\mathbf{1}) (and ψ(z)\psi(z) does not depend on the choice of the specific 𝝃k\boldsymbol{\xi}_{k}). There is an alternative way to find the best approximation, particularly useful when the objective is to approximate a finite-rank infinite Hankel matrix with another Hankel matrix, without necessarily extract a WFA. We can consider the adjoint operator HH^{*} and its matrix 𝐇\mathbf{H}^{*}. The singular numbers and singular vectors of HH correspond to the eigenvalues and eigenvectors of 𝐑=(𝐇𝐇)1/2\mathbf{R}=(\mathbf{H}^{*}\mathbf{H})^{1/2}. Hence, it is possible to compute σk\sigma_{k} and a corresponding singular vector 𝝃k\boldsymbol{\xi}_{k}. The function ξk+(z)\xi^{+}_{k}(z) is then obtained following Equation 9. Then, the Hankel matrix 𝐆\mathbf{G} that best approximates 𝐇\mathbf{H} is given by 𝐆=𝐇𝐌\mathbf{G}=\mathbf{H}-\mathbf{M}, where 𝐌\mathbf{M} is the Hankel matrix having Hξk+(z)ξk+(z)\frac{H\xi^{+}_{k}(z)}{\xi^{+}_{k}(z)} as symbol.