Optimal Spectral-Norm Approximate Minimization of Weighted Finite Automata
Abstract
We address the approximate minimization problem for weighted finite automata (WFAs) with weights in , over a one-letter alphabet: to compute the best possible approximation of a WFA given a bound on the number of states. This work is grounded in Adamyan-Arov-Krein Approximation theory, a remarkable collection of results on the approximation of Hankel operators. In addition to its intrinsic mathematical relevance, this theory has proven to be very effective for model reduction. We adapt these results to the framework of weighted automata over a one-letter alphabet. We provide theoretical guarantees and bounds on the quality of the approximation in the spectral and norm. We develop an algorithm that, based on the properties of Hankel operators, returns the optimal approximation in the spectral norm.
1 Introduction
Weighted finite automata (WFAs) are an expressive class of models representing functions defined over sequences. The approximate minimization problem is concerned with finding an automaton that approximates the behaviour of a given minimal WFA, while being smaller in size. This second automaton recognizes a different language, and the objective is to minimize the approximation error [10, 11]. Approximate minimization becomes particularly useful in the context of spectral learning algorithms [5, 7, 9, 22]. When applied to a learning task, such algorithms can be viewed as working in two steps. First, they compute a minimal WFA that explains the training data exactly. Then, they obtain a model that generalizes to the unseen data by producing a smaller approximation to the minimal WFA. It is not just a question of saving space by having a smaller state space; the exact machine will overfit the data and generalize poorly. To obtain accurate results it is crucial to guess correctly the size of the minimal WFA, in particular when the data is generated by this type of machine.
The minimization task is greatly shaped by the way we decide to measure the approximation error. It is thus natural to wonder if there are norms that are preferable to others. We believe that the chosen norm should be computationally reasonable to minimize. For instance, the distance between WFAs can be computed using a metric based on bisimulation [8]. While exploring this approach could still be interesting, the fact that this metric is hard to compute makes it unsuitable for our purposes. Moreover, this metric is specifically designed for WFAs, so it is not directly applicable to other models dealing with sequential data. We think that being transferable is a second important feature for the chosen norm. In fact, being able to compare different classes of models is desirable for future applications of this method. For example, one can think of the burgeoning literature on approximating Recurrent Neural Networks (RNNs) using WFAs, where the objective is to extract from a trained RNN an automaton that accurately mimics its behaviour [36, 40, 31, 4, 18]. With this in mind, we think that it is preferable to consider a norm defined on the input-output function – or the Hankel matrix – rather than the parameters of the specific model considered. Finally, it is important to choose a norm that can be computed accurately. The spectral norm seems to be a good candidate for the task. In particular, it allows us to exploit the work of Adamyan, Arov and Krein which has come to be known as AAK theory [1]: a series of results connecting the theory of complex functions on the unit circle to Hankel matrices, a mathematical object representing functions defined over sequences. The core of this theory provides us with theoretical guarantees for the exact computation of the spectral norm of the error, and a method to construct the optimal approximation.
We summarize our main contributions:
-
•
We use AAK theory to study the approximate minimization problem of WFAs. To connect those areas, we establish a correspondence between the parameters of a WFA and the coefficients of a complex function on the unit circle. To the best of our knowledge, this paper represents the first attempt to apply AAK theory to WFAs.
-
•
We present a theoretical analysis of the optimal spectral-norm approximate minimization problem for WFAs, based on their connection with finite-rank infinite Hankel matrices. We provide a closed form solution for real weighted automata over a one-letter alphabet , under the assumption on the spectral radius. We bound the approximation error, both in terms of the Hankel matrix (spectral norm) and of the rational function computed by the WFA ( norm).
-
•
We propose a self-contained algorithm that returns the unique optimal spectral-norm approximation of a given size.
- •
2 Background
2.1 Preliminaries
We denote with , and the set of natural, integers and real numbers, respectively. We use bold letters for vectors and matrices; all vectors considered are column vectors. We denote with the identity matrix, specifying its dimension only when not clear from the context. We refer to the -th row and the -th column of by and . Given a matrix of rank , a rank factorization is a factorization , where , and . Let of rank , the compact singular value decomposition SVD of is the factorization , where , , are such that , and is a diagonal matrix. The columns of and are called left and right singular vectors, while the entries of are the singular values. The Moore-Penrose pseudo-inverse of is the unique matrix such that , , with and Hermitian [42]. The spectral radius of a matrix is the largest modulus among its eigenvalues.
A Hilbert space is a complete normed vector space where the norm arises from an inner product. A linear operator between Hilbert spaces is bounded if it has finite operator norm, i.e. . We denote by the (infinite) matrix associated with by some (canonical) orthonormal basis on . An operator is compact if the image of the unit ball in is relatively compact. Given Hilbert spaces and a compact operator , we denote its adjoint by . The singular numbers of are the square roots of the eigenvalues of the self-adjoint operator , arranged in decreasing order. A -Schmidt pair for is a couple of norm vectors such that: and . The Hilbert-Schmidt decomposition provides a generalization of the compact SVD for the infinite matrix of a compact operator using singular numbers and orthonormal Schmidt pairs: [42]. The spectral norm of the matrix representing the operator is the largest singular number of . Note that the spectral norm of corresponds to the operator norm of .
Let be the Hilbert space of square-summable sequences over , with norm and inner product for . Let be the complex unit circle, the (open) complex unit disc. Let , be the space of measurable functions on for which the -th power of the absolute value is Lebesgue integrable. For , we denote with the space of measurable functions that are bounded, with norm .
2.2 Weighted Finite Automata
Let be a fixed finite alphabet, the set of all finite strings with symbols in . We use to denote the empty string. Given , we denote with their concatenation.
A weighted finite automaton (WFA) of states over is a tuple , where are the vector of initial and final weights, respectively, and is the matrix containing the transition weights associated with each symbol . Every WFA with real weights realizes a function , i.e. given a string , it returns . A function is called rational if there exists a WFA that realizes it. The rank of the function is the size of the smallest WFA realizing . Given , we can consider a matrix having rows and columns indexed by strings and defined by for .
Definition 2.1.
A (bi-infinite) matrix is Hankel if for all such that , we have . Given a Hankel matrix , there exists a unique function such that .
Theorem 2.1 ([15, 19]).
A function can be realized by a WFA if and only if has finite rank . In that case, is the minimal number of states of any WFA such that .
Given a WFA , the forward matrix of is the infinite matrix given by for any , while the backward matrix of is , given by for any . Let be the Hankel matrix of , its forward-backward (FB) factorization is: . A WFA with states is reachable if , while it is observable if . A WFA is minimal if it is reachable and observable. If is minimal, the FB factorization is a rank factorization [7].
We recall the definition of the singular value automaton, a canonical form for WFAs [10].
Definition 2.2.
Let be a rational function and suppose admits an SVD, . A singular value automaton (SVA) for is the minimal WFA realizing such that and .
The SVA can be computed with an efficient algorithm relying on the following matrices [11].
Definition 2.3.
Let be a rational function, a FB factorization. If the matrices and are well defined, we call the reachability Gramian and the observability Gramian.
Note that if is an SVA, then the Gramians associated with its FB factorization satisfy , where is the matrix of singular values of its Hankel matrix. The Gramians can alternatively be characterized (and computed [11]) using fixed point equations, corresponding to Lyapunov equations when [27].
Theorem 2.2.
Let , a WFA with states and well-defined Gramians , . Then and solve:
(1) | |||
(2) |
Finally, we recall the following definition.
Definition 2.4.
A WFA is a generative probabilistic automaton (GPA) if for every , and , i.e. if computes a probability distribution over .
Example 2.3.
Let , . The WFA , with:
is a GPA. Ideed, and , since the rational function is:
where corresponds to the string where is repeated -times. We remark that is minimal and in its SVA form, with Gramians , and has rank . Finally, the corresponding Hankel matrix, also of rank , is:
(3) |
2.3 AAK Theory
Theorem 2.1 provides us with a way to associate a minimal WFA with states to a Hankel matrix of rank . The approach we propose to approximate is to find the WFA corresponding to the matrix that minimizes in the spectral norm. We recall the fundamental result of Schmidt, Eckart, Young and Mirsky [17].
Theorem 2.4 ([17]).
Let be a Hankel matrix corresponding to a compact Hankel operator of rank , and let be its singular numbers. Then, if is a matrix of rank , we have:
(4) |
The equality is attained when corresponds to the truncated SVD of .
Note that a low-rank approximation obtained by truncating the SVD is not in general a Hankel matrix. This is problematic, since needs to be Hankel in order to be the matrix of a WFA. Surprisingly, we can obtain a result comparable to the one of Theorem 2.4 while preserving the Hankel property. This is the possible thanks to a theory of optimal approximation called AAK theory [1]. To apply this theory, we will need to rewrite the approximation problem in functional analysis terms. First, we will associate a linear operator to the Hankel matrix. Then, we will use Fourier analysis to reformulate the problem in a function space. A comprehensive presentation of the concepts recalled in this section can be found in [30, 33, 28].
Let be a rational function, we interpret the corresponding Hankel matrix as the expression of a linear (Hankel) operator in terms of the canonical basis. We recall that a Hankel operator is bounded if and only if [11]. This property, together with the fact that we only consider finite rank operators (corresponding to the Hankel matrices of WFAs), is sufficient to guarantee compactness.
To introduce AAK theory, we need to consider a second realization of Hankel operators on complex spaces. Since in this paper we work with two classes of functions – functions over sequences and complex functions – to avoid any confusion we will make explicit the dependence on the complex variable . We start by recalling a few fundamental definitions from the theory of complex functions. Note that a function can be represented, using the orthonormal basis , by means of its Fourier series: , with Fourier coefficients . This establishes an isomorphism between the function and the sequence of the corresponding Fourier coefficients . Thus, we can partition the function space into two subspaces.
Definition 2.5.
For , the Hardy space on is the subspace of defined as:
(5) |
while the negative Hardy space on is the subspace of
(6) |
It is possible to define Hardy spaces also on the open unit disc .
Definition 2.6.
The Hardy space on for consists of functions analytic in and such that:
(7) |
and it is equipped with the norm . For , is the space of bounded analytic functions in with norm:
(8) |
Interestingly, and can be canonically identified by associating a function with its limit on the boundary, which is a function in (a proof can be found in [30]). Thus, we will make no difference between those functions in the unit disc and their boundary value on the circle.
We can now embed the sequence space into by “duplicating” each vector, i.e. by associating to . Then, we can use the Fourier isomorphism to map the vector to the function space . In this way each vector corresponds to two functions in the Hardy spaces:
(9) | ||||
(10) |
This leads to an alternative characterization of Hankel operators in Hardy spaces.
Definition 2.7.
Let be a function in the space . A Hankel operator is an operator defined by , where is the orthogonal projection from onto . The function is called a symbol of the Hankel operator .
If is a bounded operator, we can consider without loss of generality . This is a consequence of Nehari’s theorem [29], whose formulation can be found in Appendix A, together with more details about the two definitions of Hankel operators. We remark that a Hankel operator has infinitely many different symbols, since for .
Remark 1.
In the standard orthonormal bases, in and in , the Hankel operator has Hankel matrix for .
Example 2.5.
In the case of the Hankel matrix in Example 2.3, since , we have:
Hence, we can recover the corresponding symbol:
Definition 2.8.
The complex function is rational if , with and polynomials. The rank of is the maximum between the degrees of and . A rational function is strictly proper if the degree of is strictly smaller than that of .
We remark that the poles of a complex function correspond to the zeros of . The following result of Kronecker relates finite-rank infinite Hankel matrices to rational functions.
Theorem 2.6 ([24]).
Let be a bounded Hankel operator with matrix . Then has finite rank if and only if is a strictly proper rational function. Moreover the rank of is equal to the number of poles (with multiplicities) of inside the unit disc.
Example 2.7.
The function in Example 2.5 is rational with degree and has two poles inside the unit disc at . Thus, the Hankel matrix associated has rank 2.
We state as remark an important takeaway from this section.
Remark 2.
Given a rank Hankel matrix , we can look at it in two alternative ways. On the one hand we can consider the Hankel operator over sequences , associated to a function . In this case for , and is rational in the sense that it is realized by a WFA of size . On the other hand, we can consider the Hankel operator over complex (Hardy) spaces , associated to a function , the symbol. In this case for , and is rational of rank in the sense of Definition 2.8.
We can introduce now the main result of Adamyan, Arov and Krein [1]. The theorem, stated for Hankel operators over Hardy spaces, shows that for infinite dimensional Hankel matrices the constraint of preserving the Hankel property does not affect the achievable approximation error.
Theorem 2.8 (AAK-1[1]).
Let be a compact Hankel operator of rank and singular numbers , with and . Then there exists a unique Hankel operator of rank such that:
(11) |
We denote with the set of strictly proper rational functions of rank , and we consider the set:
(12) |
We can reformulate the theorem in terms of symbols.
Theorem 2.9 (AAK-2 [1]).
Let . Then there exists such that:
(13) |
Corollary 1.
We state as corollary the key point of the proof of AAK Theorem, that provides us with a practical way to find the best approximating symbol.
Corollary 2.
Let and be a symbol and a -Schmidt pair for . A function is the best AAK approximation according to Theorem 2.9, if and only if:
(14) |
Moreover, the function does not depend on the particular choice of the pair .
3 Approximate Minimization
3.1 Assumptions
To apply AAK theory to the approximate minimization problem, we consider only automata with real weights, defined over a one-letter alphabet. In this case, the free monoid generated by the single letter can be identified with , and canonically embedded into . This passage is fundamental to use Fourier analysis and the isomorphism that leads to Theorem 2.8 and 2.9. Unfortunately, this idea cannot be directly generalized to bigger alphabets, since in this case we would obtain a free non-abelian monoid (not identifiable with ).
Theorem 2.8 requires the Hankel operator to be bounded. To ensure that a minimal WFA satisfies this condition, we assume , where is the spectral radius of [11]. As a matter of fact, to guarantee the boundness of the Hankel operator it is enough that the considered WFA computes a function [11]. However, the stricter assumption on the spectral radius is needed when computing the symbol associated to a WFA. This condition directly implies the existence of the SVA, and of the Gramian matrices and , with diagonal [11]. We assume that is in SVA form. In this case, given the size of the alphabet, the Hankel matrix is symmetric, so . Moreover, if we denote with the -th non-zero eigenvalue of , and we consider the coordinates of and , we have that , where .
For example, we note that a minimal GPA computes a function , so the condition on is automatically satisfied by this class of WFAs [11]. Possible relaxations of the spectral radius assumption are discussed in Appendix C, together with an alternative method to find the optimal spectral-norm approximation of a Hankel matrix without extracting a WFA.
3.2 Problem Formulation
Let be a minimal WFA with states in SVA form, defined over a one-letter alphabet. Let be the Hankel matrix of , we denote with , for , the singular numbers. Given a target number of states , we say that a WFA with states solves the optimal spectral-norm approximate minimization problem if the Hankel matrix of satisfies . Note that the content of the “optimal spectral-norm approximate minimization” is equivalent to the problem solved by Theorem 2.8, with the exception that here we insist on representing the inputs and outputs of the problem effectively by means of WFAs. Based on the AAK theory sketched in Section 2.3, we draw the following steps:
-
1.
Compute a symbol for using Remark 2. We obtain the negative Fourier coefficients of and derive its Fourier series.
-
2.
Compute the optimal symbol using Corollary 2. The main challenge here is to find a suitable representation for the functions and . We define them in terms of two auxiliary WFAs. The key point is to select constraints on their parameters to leverage the properties of weighted automata, while still keeping the formulation general.
- 3.
-
4.
Find a WFA representation for . Since in Step 2 we parametrized the functions using WFAs, the expression of directly reveals the WFA .
3.3 Spectral-Norm Approximate Minimization
In the following sections we will consider a minimal WFA with states in SVA form, defined over a one-letter alphabet , its Hankel matrix , corresponding to the bounded operator , and the singular numbers , for . Let be the function realized by . We denote by the string where is repeated times, so we have .
3.3.1 Computation of a Symbol for A
To determine the symbol of , we recall that each entry of the Hankel matrix corresponds simultaneously to the values of and to the negative Fourier coefficients of . In fact, as seen in Remark 2, we have:
(15) |
We obtain:
(16) |
where we use the fact that for the last equality. Since the function obtained is already bounded, we can directly consider as a symbol for .
3.3.2 Computation of the Optimal Symbol
We consider two auxiliary WFAs. Let be a WFA with states, satisfying the following properties:
-
1.
is not an eigenvalue of
-
2.
the automaton is minimal, with
(17)
Using the parameters of the automaton and a constant , we define a function . We remark that the poles of correspond to the eigenvalues of , counted with their multiplicities. By assumption, is not an eigenvalue of , so does not have any poles on the unit circle, and therefore . Analogously, the function is also bounded on the circle.
Our objective is to compute the parameters of that make the best approximation of according to Theorem 2.9. In particular, we will use Corollary 2 to find the triple such that satisfies Equation 14. Note that, with this purpose, the constant term becomes necessary to characterize . In fact, while the -component of the symbol does not affect the Hankel norm, it plays a role in the computation of the -norm (in Equation 13) according to Nehari (Theorem A.1), so it cannot be dismissed.
The following theorem provides us with an explicit expression for the functions in the Hardy space corresponding to a -Schmidt pair.
Theorem 3.2.
Let be a singular number of the Hankel operator . The singular functions associated with the -Schmidt pair of are:
(18) | ||||
(19) |
If is the best approximation to the symbol, then has modulus almost everywhere on the unit circle (i.e. it is unimodular).
Proof.
Let and be the forward and backward matrices, respectively, with , . We consider the -Schmidt pair . By definition, . Rewriting in terms of the FB factorization, we obtain:
(20) | |||
(21) | |||
(22) | |||
(23) |
where in the last step we set , to reduce the SVD problem of to the one of . Note that, since and are diagonal, is the -th coordinate vector . Since is an eigenvector of for , we get:
(24) | |||
(25) |
Moreover, is symmetric, so we have that the singular vectors and have the same coordinates up to the sign of the corresponding eigenvalues. We obtain:
(26) | ||||
(27) |
where the singular functions have been computed following Equation 9. If is the multiplicity of , from Corollary 2 we get the following fundamental equation:
where is a matrix. Consequently, we obtain the function:
which is unimodular, since . ∎
It is reasonable to wonder how the fact that is unimodular reflects on the structure of the WFA associated with it. We remark that, a priori, the controllability and observability Gramians of might not be well defined. The following theorem provides us with two matrices and satisfying properties similar to those of the Gramians. This theorem is the analogous of a control theory result [16], rephrased in terms of WFAs. A sketch of the proof, that relies on the minimality of the WFA [38], can be found in Appendix B. For the detailed version of the proof and the original theorem we refer the reader to [16].
Theorem 3.3 ([16]).
Consider the function and the corresponding minimal WFA associated with it. If is unimodular, then there exists a unique pair of symmetric invertible matrices and satisfying:
-
(a)
-
(b)
-
(c)
We can now derive the parameters of the WFA .
Theorem 3.4.
Let be a minimal WFA with states in its SVA form, and let be a symbol for its Hankel operator . Let be a singular number of multiplicity for , with We can partition the Gramian matrices , as follows:
(28) |
where is the diagonal matrix containing the remaining singular numbers, and partition , and conformally to the Gramians:
(29) |
Let , we denote by the Moore-Penrose pseudo-inverse. If the function is the best approximation of , then:
-
•
If :
(30) -
•
If :
(31)
Proof of Theorem 3.4.
Since is unimodular, from Theorem 3.3 there exist two symmetric nonsingular matrices , satisfying the fixed point equations:
(32) | ||||
(33) |
and such that . We can partition and according to the definition of (see Eq. 17):
From Equation 32 and 33, we note that and corresponds to the controllability and observability Gramians of :
Moreover, since , we get . It follows that has rank . Without loss of generality we can set , and choose an appropriate basis for the state space such that and , with . Once and are fixed, the values of and are automatically determined. We obtain:
(34) |
Now that we have an expression for the matrices and of Theorem 3.3, we can rewrite the fixed point equations to derive the parameters , and . We obtain the following systems:
(35) |
where and . We can rewrite the second equation of each system as follows:
(36) |
If , then also (recall that ), and we have:
(37) |
with and .
If , we have . We remark that has size , while is , so the system of equations corresponding to is underdetermined if , in which case we can find an alternative set of solutions:
(38) |
with . On the other hand, if , i.e. if the multiplicity of the singular number is more than half the size of the original WFA, the system might not have any solution unless (or unless was zero to begin with). In this setting the method proposed returns . An alternative in this case is to search for an approximation of size or , so that , and the system in Equation 38 is underdetermined. ∎
3.3.3 Extraction of the Rational Component
The function found in Theorem 3.4 corresponds to the solution of Theorem 2.9. To find the solution to the approximation problem we only need to “isolate” the function , i.e. the rational component. To do this, we study the position of the poles of , since the poles of a strictly proper rational function lie in the unit disc (Theorem 2.6). As noted before, we parametrized so that its poles correspond to the eigenvalues of . After a change of basis (detailed in the Paragraph 3.3.3.1), we can rewrite in block-diagonal form:
(39) |
where the modulus of the eigenvalues of (resp. ) is smaller (resp. greater) than one. We apply the same change of coordinates on and .
To conclude the study of the eigenvalues of , we need the following auxiliary result from Ostrowski [32]. A proof of this theorem can be found in [41].
Theorem 3.5 ([32]).
Let , and let be a solution to the fixed point equation for the WFA . If is reachable, then:
-
•
The number of eigenvalues of such that is equal to the number of positive eigenvalues of .
-
•
The number of eigenvalues of such that is equal to the number of negative eigenvalues of .
We can finally find the rational component of the function , i.e. the function from Corollary 1 necessary to solve that approximate minimization problem.
Theorem 3.6.
Let be as in Theorem 3.4. The rational component of is the function .
Proof.
Clearly , with , . To conclude the proof we need to show that has poles inside the unit disc, and therefore has rank . We do this by studying the position of the eigenvalues of .
Since is minimal, by definition is reachable, so we can use Theorem 3.5 and solve the problem by directly examining the eigenvalues of . From the proof of Theorem 3.4 we have , where is the diagonal matrix having as elements the singular numbers of different from . It follows that has only strictly positive eigenvalues, and has eigenvalues with modulus smaller than . Thus, has eigenvalues, corresponding to the poles of . ∎
3.3.3.1 Block Diagonalization.
In this paragraph we detail the technical steps necessary to rewrite in block-diagonal form. The problem of computing the Jordan form of a matrix is ill-conditioned, hence it is not suitable for our algorithmic purposes. Instead, we compute the Schur decomposition, i.e. we find an orthogonal matrix such that is upper triangular, with the eigenvalues of on the diagonal. We obtain:
(40) |
where the eigenvalues are arranged in increasing order of modulus, and the modulus of those in (resp. ) is smaller (resp. greater) than one. To transform this upper triangular matrix into a block-diagonal one, we use the following result.
Theorem 3.7 ([37]).
Setting we can now derive the rational component of the WFA:
(43) | |||
(44) | |||
(45) |
3.3.4 Solution to the Approximation Problem
In the previous sections, we have derived the rational function corresponding to the symbol of , the operator that solves Theorem 2.8. To find the solution to the approximation problem we only need to find the parameters of , the optimal approximating WFA. These are directly revealed by the expression of , thanks to the way we parametrized the functions.
Theorem 3.8.
Let be a minimal WFA with states over a one-letter alphabet. Let be in its SVA form. The optimal spectral-norm approximation of rank is given by the WFA .
3.4 Error Analysis
The theoretical foundations of AAK theory guarantee that the construction detailed in Section 3.3 produces the rank optimal spectral-norm approximation of a WFA satisfying our assumptions, and the singular number provides the exact error.
Similarly to the case of SVA truncation [11], due to the ordering of the singular numbers, the error decreases when increases, meaning that allowing to have more states guarantees a better approximation of . We remark that, while the solution we propose is optimal in the spectral norm, the same is not necessarily true in other norms. Nonetheless, we have the following bound between -norm and spectral-norm (proof in Appendix B).
Theorem 3.9.
Let be a minimal WFA computing , with matrix . Let be its optimal spectral-norm approximation, computing , with matrix . Then:
(47) |
Proof.
Let , , with Hankel matrices and , respectively. We have:
where the second equation follows by definition and by observing that matrix difference is computed entry-wise. ∎
4 Algorithm
In this section we describe the algorithm for spectral-norm approximate minimization. The algorithm takes as input a target number of states , a minimal WFA with , , states and in SVA form, and its Gramian . Note that, in the case of , it is enough to substitute the Steps with the analogous from Equation 31. The constraints on the WFA to be minimal and in SVA form are non essential. In fact a WFA with states can be minimized in time [14], and the SVA computed in [11].
Using the results of Theorem 3.4, we outline in Algorithm 1, AAKapproximation, the steps necessary to extract the best spectral-norm approximation of a WFA.
its Gramian , a target number of states
The algorithm involves a call to Algorithm 2, BlockDiagonalize. In particular, this corresponds to the steps, outlined in Paragraph 3.3.3.1, necessary to derive the WFA corresponding to the rational function . We remark that Step in BlockDiagonalize can be performed using the Bartels-Stewart algorithm [13].
To compute the computational cost we recall the following facts [39]:
-
•
The product of two matrices can be computed in time using a standard iterative algorithm, but can be reduced to with .
-
•
The inversion of a matrix can be computed in time using Gauss-Jordan elimination, but can be reduced to with .
-
•
The computation of the Schur decomposition of a matrix can be done with a two-step algorithm, where each step takes , using the Hessenberg form of the matrix.
-
•
The Bartels-Stewart algorithm applied to upper triangular matrices to find a matrix takes .
The running time of BlockDiagonalize with input a WFA with states is thus in , where is the multiplicity of the singular value considered. The running time of AAKapproximation for an input WFA with states is in .
5 Related Work
The study of approximate minimization for WFAs is very recent, and only a few works have been published on the subject. In [10, 11] the authors present an approximate minimization technique using a canonical expression for WFAs, and provide bounds on the error in the norm. The result is supported by strong theoretical guarantees, but it is not optimal in any norm. An extension of this method to the case of Weighted Tree Automata can be found in [12]. A similar problem is addressed in [25], with less general results. In [26], the authors connect spectral learning to the approximate minimization problem of a small class of Hidden Markov models, bounding the error in terms of the total variation distance.
The control theory community has largely studied approximate minimization in the context of linear time-invariant systems, and several methods have been proposed [3]. A parallel can be drown between those results and ours, by noting that the impulse response of a discrete time-invariant Single-Input-Single-Output SISO system can be parametrized as a WFA over a one-letter alphabet. In [20] Glover presents a state-space solution for the case of continuous Multi-Input-Multi-Output MIMO systems. Glover’s method led to a widespread application of these results, thanks to its computational and theoretical simplicity. This stems from the structure of the Lyapunov equations for continuous systems. It is however not the case for discrete control systems, where the Lyapunov equations have a quadratic form. As noted in [16], there is not a simple closed form formula for the state space solution of a discrete system. Thus, most of the results for the discrete case work with a suboptimal version of the problem, with restrictions on the multiplicity of the singular values [6, 2, 23]. A solution for the SISO case can be found, without additional assumptions, using a polynomial approach, but it does not provide an explicit representation of the state space nor it generalizes to the MIMO setting. The first to actually extend Glover results to the discrete case is Gu, who provides an elegant solution for the MIMO discrete problem [21]. Glover and Gu’s solutions rely on embedding the initial system into an extension of it, the all-pass system, equivalent to the WFA in our method. Part of our contribution is the adaptation of some of the control theory tools to our setting.
6 Conclusion
In this paper we applied the AAK theory for Hankel operators and complex functions to the framework of WFAs in order to construct the best possible approximation to an automaton given a bound on the size. We provide an algorithm to find the parameters of the best WFA approximation in the spectral norm, and bounds on the error. Our method applies to real WFAs , defined over a one-letter alphabet, with . While this setting is certainly restricted, we believe that this work constitutes a first fundamental step towards optimal approximation. Furthermore, the use of AAK techniques has proven to be very fruitful in related areas like control theory; we think that automata theory can also benefit from it. The use of such methods can help deepen the understanding of the behaviour of rational functions. This paper highlights and strengthens the interesting connections between functional analysis, automata theory and control theory, unifying tools from different domains in one formalism.
A compelling direction for future work is to extend our results to the multi-letter case. The work of Adamyan, Arov and Krein provides us with a powerful theory connecting sequences to the study of complex functions. We note that, unfortunately, this approach cannot be directly generalized to the multi-letter case because of the non-commutative nature of the monoid considered. Extending this work would require adapting standard harmonic analysis results to the non-abelian case. A recent line of work in functional analysis is centered around extending this theory to the case of non-commutative operators, and in [34] a non-commutative version of the AAK theorem is presented. However, those results are non-constructive, making this direction, already challenging, even harder to pursue.
Acknowledgments
This research has been supported by NSERC Canada (C. Lacroce, P. Panangaden, D. Precup) and Canada CIFAR AI chairs program (Guillaume Rabusseau). The authors would like to thank Tianyu Li, Harsh Satija and Alessandro Sordoni for feedback on earlier drafts of this work, Gheorghe Comanici for a detailed review, and Maxime Wabartha for fruitful discussions and for comments on the proofs.
References
- [1] Vadim M. Adamyan, Damir Zyamovich Arov, and Mark Grigorievich Krein. Analytic Properties of Schmidt Pairs for a Hankel Operator and the Generalized Schur–Takagi problem. Mathematics of The Ussr-sbornik, 15:31–73, 1971.
- [2] M.M Al-Hussari, I.M. Jaimoukha, and D.J.N. Limebeer. A Descriptor Approach for the Solution of the One-Block Distance Problem. In In Proceedings of the IFAC World Congress, 1993.
- [3] Athanasios C. Antoulas. Approximation of Large-Scale Dynamical Systems. SIAM, 2005.
- [4] Stéphane Ayache, Rémi Eyraud, and Noé Goudian. Explaining Black Boxes on Sequential Data Using Weighted Automata. In Proceedings of the 14th International Conference on Grammatical Inference, ICGI 2018, Wrocław, Poland, September 5-7, 2018, volume 93 of Proceedings of Machine Learning Research, pages 81–103. PMLR, 2018. URL: http://proceedings.mlr.press/v93/ayache19a.html.
- [5] Raphaël Bailly, François Denis, and Liva Ralaivola. Grammatical Inference as a Principal Component Analysis Problem. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 33––40, New York, NY, USA, 2009. Association for Computing Machinery. doi:10.1145/1553374.1553379.
- [6] Joseph A. Ball and Andre CM. Ran. Optimal Hankel Norm Model Reductions and Wiener–Hopf Factorization I: The Canonical Case. SIAM Journal on Control and Optimization, 25(2):362–382, 1987.
- [7] Borja Balle, Xavier Carreras, Franco M. Luque, and Ariadna Quattoni. Spectral Learning of Weighted Automata - A Forward–Backward Perspective. Mach. Learn., 96(1-2):33–63, 2014. doi:10.1007/s10994-013-5416-x.
- [8] Borja Balle, Pascale Gourdeau, and Prakash Panangaden. Bisimulation Metrics and Norms for Weighted Finite Automata. In Ioannis Chatzigiannakis, Piotr Indyk, Fabian Kuhn, and Anca Muscholl, editors, 44th International Colloquium on Automata, Languages, and Programming, ICALP 2017, July 10-14, 2017, Warsaw, Poland, volume 80 of LIPIcs, pages 103:1–103:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPIcs.ICALP.2017.103.
- [9] Borja Balle, William L. Hamilton, and Joelle Pineau. Methods of moments for learning stochastic languages: Unified presentation and empirical comparison. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, volume 32 of JMLR Workshop and Conference Proceedings, pages 1386–1394. JMLR.org, 2014. URL: http://proceedings.mlr.press/v32/balle14.html.
- [10] Borja Balle, Prakash Panangaden, and Doina Precup. A Canonical Form for Weighted Automata and Applications to Approximate Minimization. In 30th Annual ACM/IEEE Symposium on Logic in Computer Science, LICS 2015, Kyoto, Japan, July 6-10, 2015, pages 701–712. IEEE Computer Society, 2015. doi:10.1109/LICS.2015.70.
- [11] Borja Balle, Prakash Panangaden, and Doina Precup. Singular Value Automata and Approximate Minimization. Math. Struct. Comput. Sci., 29(9):1444–1478, 2019. doi:10.1017/S0960129519000094.
- [12] Borja Balle and Guillaume Rabusseau. Approximate minimization of weighted tree automata. Information and Computation, page 104654, 2020.
- [13] R. H. Bartels and G. W. Stewart. Solution of the matrix equation ax + xb = c [f4]. Commun. ACM, 15(9):820––826, 1972.
- [14] Jean Berstel and Christophe Reutenauer. Noncommutative rational series with applications, volume 137. Cambridge University Press, 2011.
- [15] J.W. Carlyle and A. Paz. Realizations by stochastic finite automata. Journal of Computer and System Sciences, 5(1):26–40, 1971.
- [16] Charles K. Chui and Guanrong Chen. Discrete Optimization With Applications in Signal Processing and Control Systems. Springer-Verlag, 1997.
- [17] C. Eckart and G. Young. The approximation of one matrix by another of lower rank. Psychometrika, 1:211–218, 1936. doi:10.1007/BF02288367.
- [18] Rémi Eyraud and Stéphane Ayache. Distillation of Weighted Automata from Recurrent Neural Networks Using a Spectral Approach. CoRR, abs/2009.13101, 2020. URL: https://arxiv.org/abs/2009.13101, arXiv:2009.13101.
- [19] Michel Flies. Matrice de Hankel. Journal de Mathématique Pures et Appliquées, 5:197–222, 1974.
- [20] Keith Glover. All Optimal Hankel–Nnorm Approximations of Linear Multivariable Systems and their –Error Bounds. International Journal of Control, 39(6):1115–1193, 1984. doi:10.1080/00207178408933239.
- [21] Guoxiang Gu. All Optimal Hankel–Norm Approximations and their Eerror Bounds in Discrete–Time. International Journal of Control, 78(6):408–423, 2005. doi:10.1080/00207170500110988.
- [22] Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. J. Comput. Syst. Sci., 78(5):1460–1480, September 2012. doi:10.1016/j.jcss.2011.12.025.
- [23] Vlad Ionescu and Cristian Oara. The four-block Adamjan-Arov-Kein problem for discrete-time systems. In Linear Algebra and its Application, pages 95–119. Elsevier, 2001.
- [24] L. Kronecker. Zur Theorie der Elimination einer Variablen aus zwei algebraischen Gleichungen. Montasber. Königl. Preussischen Acad Wies, pages 535 – 600, 1881.
- [25] Alex Kulesza, Nan Jiang, and Satinder Singh. Low-Rank Spectral Learning with Weighted Loss Functions. In Artificial Intelligence and Statistics, pages 517–525. PMLR, 2015.
- [26] Alex Kulesza, N. Raj Rao, and Satinder Singh. Low-Rank Spectral Learning. In Samuel Kaski and Jukka Corander, editors, Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research, pages 522–530, Reykjavik, Iceland, 22–25 April 2014. PMLR.
- [27] Aersity M Lyapunov. The General Problem of the Stability of Motion [in Russian]. Gostekhizdat, Moscow, 1950.
- [28] Jean Meinguet. A Simplified Presentation of the Adamjan-Arov-Krein Approximation Theory. In H. Werner, L. Wuytack, E. Ng, and H. J. Bünger, editors, Computational Aspects of Complex Analysis, pages 217–248, Dordrecht, 1983. Springer Netherlands. doi:10.1007/978-94-009-7121-9_9.
- [29] Zeev Nehari. On Bounded Bilinear Forms. Annals of Mathematics, 65(1):153–162, 1957.
- [30] Nikolai K. Nikol’Skii. Operators, Functions and Systems: An Easy Reading, volume 92 of Mathematical Surveys and Monographs. American Mathematical Society, 2002.
- [31] Takamasa Okudono, Masaki Waga, Taro Sekiyama, and Ichiro Hasuo. Weighted Automata Extraction from Recurrent Neural Networks via Regression on State Spaces. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 5306–5314. AAAI Press, 2020. URL: https://aaai.org/ojs/index.php/AAAI/article/view/5977.
- [32] Alexander Ostrowski and Hans Schneider. Some Theorems on the Inertia of General Matrices. J. Math. Anal. Appl, 4(1):72–84, 1962.
- [33] Vladimir Peller. Hankel Operators and their Applications. Springer Science & Business Media, 2012.
- [34] Gelu Popescu. Multivariable Nehari Problem and Interpolation. Journal of Functional Analysis, 200:536–581, 2003. doi:10.1016/S0022-1236(03)00078-8.
- [35] Guillaume Rabusseau, Borja Balle, and Joelle Pineau. Multitask Spectral Learning of Weighted Automata. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 2585–2594, 2017.
- [36] Guillaume Rabusseau, Tianyu Li, and Doina Precup. Connecting Weighted Automata and Recurrent Neural Networks through Spectral Learning. In Kamalika Chaudhuri and Masashi Sugiyama, editors, The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pages 1630–1639. PMLR, 2019. URL: http://proceedings.mlr.press/v89/rabusseau19a.html.
- [37] William E. Roth. The equations ax - yb = c and ax - xb = c in matrices. Proceedings of the American Mathematical Society, 3(3):392–396, 1952.
- [38] B.De Schutter. Minimal State-Space Realization in Linear System Theory: an Overview. Journal of Computational and Applied Mathematics, 121(1):331–354, 2000. doi:10.1016/S0377-0427(00)00341-1.
- [39] Lloyd N Trefethen and David Bau III. Numerical linear algebra, volume 50. Siam, 1997.
- [40] Gail Weiss, Yoav Goldberg, and Eran Yahav. Learning Deterministic Weighted Automata with Queries and Counterexamples. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 8558–8569, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/d3f93e7766e8e1b7ef66dfdd9a8be93b-Abstract.html.
- [41] H.K Wimmer. On the Ostrowski-Schneider Inertia Theorem. Journal of Mathematical Analysis and Applications, 41(1):164–169, 1973. doi:10.1016/0022-247X(73)90190-X.
- [42] Kehe Zhu. Operator Theory in Function Spaces, volume 138. American Mathematical Society, 1990.
Appendix A Hankel Operators
For more details on the content of this section we refer the reader to [30]. We recall the first definition of Hankel operator.
Definition A.1.
A Hankel operator is a mapping with matrix . In other words, given , we have , where is defined by:
(48) |
This property on the Hankel matrix can be rephrased as an operator identity. Defining the shift operator by and denoting its left inverse by , we have that is a Hankel operator if and only if:
(49) |
The correspondence between Definition A.1 and Definition 2.7 can be easily made explicit. First, we note that using the isomorphism , introduced with Fourier series, we can identify with . Moreover, the operator of multiplication by acts as right shift on the space of Fourier coefficients, in fact . Analogously, the left inverse corresponds to the truncated multiplication operator. Now, let and be bases for and , respectively, and let
(50) |
be the involution on . Note that .
Let be a Hankel operator with matrix . Using the bases , and the Fourier identification map, we can obtain an operator acting between Hardy spaces. Following this interpretation, the operator has matrix with respect to , , and satisfies:
(51) |
In particular, if and only if . It is now easy to see that the characterization of the Hankel operator given in Definition 2.7 satisfies Equation 51.
The following theorem, due to Nehari [29], is of great importance as it highlights a correspondence between bounded Hankel operators and functions in .
Theorem A.1 ([29]).
A Hankel operator with matrix is bounded on if and only if there exists a function such that
(52) |
In this case:
(53) |
Where is the -th Fourier coefficient of .
We can now reformulate the theorem using the characterization of Hankel operators in Hardy spaces.
Theorem A.2 ([29]).
Let be a symbol of the Hankel operator on Hardy spaces . The following are equivalent:
-
(1)
is bounded on ,
-
(2)
there exists such that for all .
If the conditions above are satisfied, then:
(54) |
or equivalently:
(55) |
Nehari’s Theorem is at the core of the proof of Corollary 1.
Proof of Corollary 1.
Let be a Hankel operator with symbol and matrix . Let be the solution of Equation 13. We have:
(56) | ||||
(57) | ||||
(58) |
where first we used Corollary 2 and then Equation 55. Now, using the definition of Hankel operator, we have:
(59) |
Since (from Eckart-Young theorem [17]), it follows that . Note that has rank , as required, because (Theorem 2.6). ∎
Appendix B Proofs from Section 3
Proof of Theorem 3.3.
In order to prove Theorem 3.3 we need an auxiliary lemma. These are the analogous of a control theory result, rephrased in terms of WFAs. The original theorem and lemma, together with the corresponding proofs, can be found in [16]. Hence, we only provide a sketch of the proofs.
Lemma B.1 ( [16]).
Let be a minimal WFA. Let , if is unimodular, then there exist a unique invertible symmetric matrix satisfying:
-
(a)
-
(b)
-
(c)
Proof.
Since is unimodular, we have that:
(60) |
where we denote with the adjoint function. From the equation above, we obtain:
(61) | ||||
(62) |
where we used the matrix inversion lemma. On the other hand we have:
(63) | ||||
(64) | ||||
(65) |
where we used again the matrix inversion lemma before grouping the terms. If the quantities in Equation 62 and Equation 65 have to be equal, we need their constant term to be the same. Then, we want the -components to correspond, so we consider the corresponding Hankel matrices. It is easy to see that we can once again associate the coefficients of these complex functions to the parameters of a WFA. From the minimality of we obtain:
(66) |
where is an invertible matrix [7]. This system is equivalent to:
(67) |
To conclude the proof it remains to check that is symmetric, and this can be checked by direct computations. ∎
Proof of Theorem 3.3.
Appendix C Possible Extensions
C.1 Relaxing the Spectral Radius Assumption
It is possible to extend part of our method to WFAs over a one-letter alphabet with , but the approximation recovered is not optimal in the spectral norm.
Let , with , be a WFA with states that we want to minimize. The idea is to block-diagonalize like we did in Section 3.3.3, and tackle each component separately. The case of , the component having , can be dealt with in the way presented in the previous sections. This means that we can find an optimal spectral-norm approximation of the desired size for . Now we can consider the second component, . The key idea is to apply the transformation for to the function associated to . Then, the function
(69) |
is well defined, as the series converges for with small enough modulus. Using this transformation we obtain a function with poles inside the unit disc and we can apply the method presented in the paper. An important choice to make is the size of the approximation of , as it can influence the quality of the approximation. Analyzing the effects of this parameter on the approximation error constitutes an interesting direction for future work, both in the theoretical and experimental side. We refer the reader to the control theory literature [20], where some theoretical work has been done to study an analogous approach for continuous time systems and their approximation error.
C.2 Polynomial method
We remark that Equation 14 from Corollary 2 can be rewritten as
(70) |
where is the function in associated to the vector (and does not depend on the choice of the specific ). There is an alternative way to find the best approximation, particularly useful when the objective is to approximate a finite-rank infinite Hankel matrix with another Hankel matrix, without necessarily extract a WFA. We can consider the adjoint operator and its matrix . The singular numbers and singular vectors of correspond to the eigenvalues and eigenvectors of . Hence, it is possible to compute and a corresponding singular vector . The function is then obtained following Equation 9. Then, the Hankel matrix that best approximates is given by , where is the Hankel matrix having as symbol.