Infinite-dimensional reservoir computing
Abstract
Reservoir computing approximation and generalization bounds are proved for a new concept class of input/output systems that extends the so-called generalized Barron functionals to a dynamic context. This new class is characterized by the readouts with a certain integral representation built on infinite-dimensional state-space systems. It is shown that this class is very rich and possesses useful features and universal approximation properties. The reservoir architectures used for the approximation and estimation of elements in the new class are randomly generated echo state networks with either linear or ReLU activation functions. Their readouts are built using randomly generated neural networks in which only the output layer is trained (extreme learning machines or random feature neural networks). The results in the paper yield a fully implementable recurrent neural network-based learning algorithm with provable convergence guarantees that do not suffer from the curse of dimensionality.
Key Words: recurrent neural network, reservoir computing, echo state network, ESN, extreme learning machine, ELM, recurrent linear network, machine learning, Barron functional, recurrent Barron functional, universality, finite memory functional, approximation bound, convolutional filter.
1 Introduction
Reservoir computing (RC) [Jaeg 10, Maas 02, Jaeg 04, Maas 11] and in particular echo state networks (ESNs) [Matt 92, Matt 93, Jaeg 04] have gained much popularity in recent years due to their excellent performance in the forecasting of dynamical systems [Grig 14, Jaeg 04, Path 17, Path 18, Lu 18, Wikn 21, Arco 22] and due to the ease of their implementation. RC aims at approximating nonlinear input/output systems using randomly generated state-space systems (called reservoirs) in which only a linear readout is estimated. It has been theoretically established that this is indeed possible in a variety of deterministic and stochastic contexts [Grig 18b, Grig 18a, Gono 20c, Gono 21b, Gono 23] in which RC systems have been shown to have universal approximation properties.
In this paper, we focus on deriving error bounds for a variant of the architectures that we just cited and consider as approximants randomly generated linear systems with readouts given by randomly generated neural networks in which only the output layer is trained. Thus, from a learning perspective, we combine linear echo state networks and what is referred to in the literature as random features [Rahi 07] /extreme learning machines (ELMs) [Huan 06]. We develop explicit and readily computable approximation and estimation bounds for a newly introduced concept class whose elements we refer to as recurrent (generalized) Barron functionals since they can be viewed as a dynamical analog of the (generalized) Barron functions introduced in [Barr 92, Barr 93] and extended later in [E 20b, E 20a, E 19]. The main novelty in this concept class with respect to others already available in the literature, like the fading memory class [Boyd 85] in deterministic setups or functionals in the stochastic case, is that it consists of elements that admit an explicit infinite-dimensional state-space representation, which makes them analytically tractable. As we shall see later on, many interesting families of input/output systems belong to this class that, additionally, is universal in the -sense.
From an approximation-theoretical point of view, the universality properties of linear systems with readouts belonging to dense families have been known for a long time both in deterministic [Boyd 85, Grig 18b] and in stochastic [Gono 20c] setups. Their corresponding dynamical and memory properties have been extensively studied (see, for instance, [Herm 10, Coui 16, Tino 18, Tino 20, Gono 20a, Li 22] and references therein). Our contribution in this paper can hence be considered as an extension of those works to the recurrent Barron class.
All the dynamic learning works cited so far use exclusively finite-dimensional state spaces. Hence, one of the main novelties of our contribution is the infinite-dimensional component in the concept class that we propose. It is worth mentioning that, even though in static setups, there exist numerous neural functional approximation results (see, for instance, [Chen 95, Stin 99, Krat 20, Bent 22, Cuch 22, Neuf 22], in addition to the works on Barron functions cited above), the use of infinite-dimensional state-space systems has not been much exploited, and it is only very recently that it is being seriously developed. See [Herm 12, Bouv 17b, Bouv 17a, Kira 19, Li 20, Kova 21, Acci 22, Gali 22, Hu 22, Salv 22] for a few examples. Dedicated hardware realizations of RC systems using quantum systems are a potential application of these extensions for which, in the absence of adequate tools, most of the theoretical developments have been carried out so far in exclusively finite-dimensional setups [Chen 19, Chen 20, Tran 20, Tran 21, Chen 22, Mart 23].
In order to introduce our contributions in more detail, we start by recalling that an echo state network (ESN) is an input/output system defined by the two equations:
(1.1) | ||||
(1.2) |
for , where , , , the matrix is trainable, denotes the componentwise application of a given activation function , and , , and are randomly generated. If for each there exists a unique solution to (1.1), then (1.1)-(1.2) define a mapping via , with given by (1.2) for x the unique solution to (1.1) associated with . Given a (typically unknown) functional to be learned, the readout is then trained so that is a good approximation of .
For echo state networks (1.1)–(1.2), approximation bounds were given in [Gono 23] for maps which have the property that their restrictions to input sequences of finite length lie in a certain Sobolev space, for each , and with weights , , in (1.1) sampled from a generic distribution with a certain structure. Here we consider a novel architecture for (1.1)–(1.2), where, instead of applying a linear function in (1.2) we apply a random feedforward neural network, that is, (1.2) is replaced by
(1.3) |
for and with randomly generated coefficients , , valued in , , and , respectively. In most cases, the activation function in (1.1) will be just the identity or, eventually, a rectified linear unit.
In order to derive learning error bounds for this architecture, we shall proceed in several steps. First, we examine the new concept class of recurrent (generalized) Barron functionals consisting of reservoir functionals that admit a certain integral representation. This class turns out to possess very useful properties: first, we show that a large class of functionals , including – but not restricted to – functionals with “sufficient smoothness” are in and, under mild conditions, is dense in the -sense. We then examine the approximation of elements in by reservoir computing systems and derive approximation error bounds. This shows that a large class of functionals can be approximated by recurrent neural networks whose hidden weights are randomly sampled from generic distributions, and explicit approximation rates can be derived for them.
The second step consists in obtaining generalization error bounds that match the parameter restrictions emerging from the approximation results for these systems. The key challenge here is that the observational data is non-IID and so classical statistical learning techniques [Bouc 13] can not be employed directly, see [Gono 20b]. By combining these bounds, we then obtain learning error bounds for such random recurrent neural networks for learning a general class of maps . As a by-product, we obtain new universality results for reservoir systems, see [Grig 18b, Grig 18a, Gono 20c, Gono 23, Gono 21b], with generically randomly generated coefficients. It is worth emphasizing that the construction that we describe in this paper yields a fully implementable recurrent neural network-based learning algorithm with provable convergence guarantees not suffering from the curse of dimensionality.
2 A dynamic analog of the generalized Barron functions
In this section, we introduce the class of recurrent (generalized) Barron functionals, which constitute the concept class that we study and approximate in this paper. We prove elementary properties of and identify important constituents in it. In particular, we show that is a vector space that is dense in the set of square-integrable functionals and contains many important classes of maps such as linear or “sufficiently smooth” maps.
2.1 Notation
Let , , and let be the Hölder conjugate of . Recall the notation for the space of sequences with , in the case , and , in the case . The symbol denotes the space of sequences that satisfy the same summability conditions as those in , with the negative integers including zero. We let denote the set of sequences with the property that . For and we write . Let denote two functions, called activation functions. For both activation functions we denote using the same symbol their componentwise application to sequences or vectors. Furthermore, we assume that is Lipschitz-continuous with constant and . Given , we write and . We will consider inputs in .
2.2 Definition and properties of recurrent Barron functionals
In this section we define the class of recurrent generalized Barron functionals and study some properties of these functionals.
Definition 2.1.
A functional is said to be a recurrent (generalized) Barron functional, if there exist a (Borel) probability measure on , and linear maps , such that for each the system
(2.1) |
admits a unique solution , satisfies that
and, writing ,
(2.2) |
We denote by the class of all recurrent generalized Barron functionals or just recurrent Barron functionals.
Note that the readout (2.2) is built in such a way that the output is constructed out of and instead of exclusively , like it is customary in reservoir computing (see (1.2)). The reason for this choice is that it will allow us later on in Section 3.6 to recover the static situation as a particular case of the recurrent setup in a straightforward manner.
The unique solution property (USP). The existence and uniqueness of solutions property of state equations of the type (2.1), which is part of the definition of the recurrent Barron functionals, is a well-studied problem. This property is referred in the literature as the echo state property (ESP) [Jaeg 10, Yild 12, Manj 13] or the unique solution property (USP) [G Ma 20, Manj 22a] and it has been much tackled in deterministic (see references above and [Grig 18a], for instance), and stochastic setups [Manj 22b]. The following proposition is an infinite-dimensional adaptation of a commonly used sufficient condition for the USP to hold in the case of (2.1), as well as a full characterization of it when the activation function is the identity, and hence (2.1) is a linear system. In both cases, the inputs are assumed to be bounded; possible generalizations to the unbounded case can be found in [Grig 19].
Proposition 2.2.
Consider the state equation given by (2.1) and the two following cases:
- (i)
- (ii)
Proof.
The proof of part (i) is a straightforward generalization of [Grig 18a, Theorem 3.1(ii)] together with the observation that the hypothesis implies that the state equation is a contraction on the state entry. Part (ii) can be obtained by mimicking the proof of [Grig 21b, Proposition 4.2(i)]; a key element in that proof is the use of Gelfand’s formula for the spectral radius, which holds for operators between infinite-dimensional Banach spaces [Lax 02, Theorem 4, page 195]. ∎
The concept class of recurrent Barron functionals is a vector space.
Proposition 2.3.
Suppose and . Then .
Proof.
Without loss of generality, we may assume for . For let be (Borel) probability measures on , let , let , be linear maps such that for each the system
(2.4) |
admits a unique solution , satisfies , and
(2.5) |
Define by and by and note that are linear, they map both and into themselves, , and , . Define and linear maps , by setting , and .
Suppose now that is a solution to (2.1) for the input and the parameters that we just defined. For each denote by the odd components of and by the even components. Then and by construction of , it follows that is a solution of (2.4) for . By the uniqueness of the solutions of (2.4), it follows that there is at most one solution of (2.1). On the other hand, if we now denote by , , the unique solution to (2.4) for the input , then setting defines a sequence which is a solution to (2.1), again by the construction of .
Next, define as for and set , where denotes the pushforward of under . Then the integral transformation theorem shows that
and, by (2.5), for all ,
Hence, , as claimed. ∎
Finite-dimensional recurrent Barron functionals. For , denote by the set of functionals with the following property: there exist a (Borel) probability measure on , , , , such that for each the system
(2.6) |
admits a unique solution , satisfies
and, writing ,
(2.7) |
We shall refer to the elements in as finite-dimensional recurrent Barron functionals. In the following proposition, we show that the set of finite-dimensional recurrent Barron functionals can be naturally seen as a subspace of . In the statement, we use the notion and properties of system morphisms that are recalled, for instance, in [Grig 21b].
Proposition 2.4.
For arbitrary, let be the natural injection . Given an element , there exists a system of type (2.1)-(2.2) that realizes and for which the map is a system morphism. This allows us to conclude that
(2.8) |
that is, every finite-dimensional recurrent Barron functional admits a realization as a recurrent Barron functional.
Proof.
Let be such that (2.7) holds with satisfying (2.6). First, define , , by setting , , for , , . Furthermore, if is the canonical injection, we denote by the same symbol the injection of into and by the pushforward measure of under .
Consider now the system (2.6)–(2.7) and the one given by (2.1) and (2.2) with the parameters that we just defined, as well as the readout given by the measure . We shall prove that the map is a system morphism between these systems. This requires showing that is system equivariant and readout invariant (see [Grig 21b] for this terminology).
First, we notice that is system equivariant because
as required. As a consequence of this fact (see [Grig 21b, Proposition 2.2]) if is a solution of the system determined by (2.6) and (2.7) for , then so is with for the one given by (2.1) and (2.2) with the parameters . This solution is unique because if is another solution for it and we denote by its first components then from (2.1) we obtain for the components that , which shows that is a solution to (2.6) and thus, by uniqueness, necessarily. For components we get from (2.1) that . Altogether, this proves that for each the system given by (2.1) with the parameters has a unique solution and the first entries of each term are given by .
Notice now that the change-of-variables theorem and the equivalence of norms on yields that
which proves that in (2.2) defines an element in . Finally, we show that the map is readout invariant using that the first components of each solution are given by and applying again the change-of-variables theorem:
for , as required. In particular, and are in fact the same functional, which proves the inclusion (2.8). ∎
2.3 Examples of recurrent Barron functionals
We now show that large classes of input/output systems naturally belong to the recurrent Barron class.
Finite-memory functionals. The following proposition shows that finite-memory causal and time-invariant systems that have certain regularity properties can be written as finite-dimensional recurrent Barron functionals with a linear state equation.
Proposition 2.5.
Assume , , is bounded, let , and suppose that is in the Sobolev space for . Then, the functional , , is a finite-dimensional recurrent Barron functional.
Proof.
Choose , ,
(2.9) |
Then the system
(2.10) |
admits as unique solution , . Next, by [E 21, Theorem 3.1] (which is based on [Barr 93, Proposition 1 and Section IX.15)]; alternatively the result also follows using the representation obtained in the proof of [Gono 23, Theorem 1]), there exists a (Borel) probability measure on such that and for Lebesgue-a.e. ,
(2.11) |
But is continuous in by [Evan 10, Theorem 6(ii) in Chapter 5.6] and also the right-hand side in (2.11) is continuous in , hence the representation (2.11) holds for all .
Let be defined as follows: for , we set . Denote by the pushforward measure of under . Then by the change-of-variables formula satisfies
and for all , with ,
(2.12) | ||||
which shows that is a finite-dimensional recurrent Barron functional, that is . ∎
Linear functionals. The following propositions show that certain linear functionals are also in the recurrent Barron class .
Proposition 2.6.
Suppose and either or . Let , for and assume that is bounded, and . Assume that either
- (i)
-
for some
- (ii)
-
or .
Then the functional
satisfies .
Proof.
Let be as in (i) or, in case (ii) holds, let .
Consider first the case . Let and define , by , for , , . Let us now look at the -th component of (2.1). Inserting the definitions yields
(2.13) |
By iterating (2.13) we see that for , , we must have . Using this as the definition of , we hence obtain that (2.1) has a unique solution. In addition, this solution satisfies the condition : suppose that (i) holds, then in the case we verify for any that , where is chosen such that for all . In the case of we verify that . Similarly, if (ii) holds, follows from .
Define, for each , by setting for and by setting the remaining components of equal to zero. Set , then is indeed a (Borel) probability measure on and satisfies
since for and for . Furthermore, for any ,
(2.14) | ||||
which proves that in the case .
The next proposition also covers the linear case under slightly different hypotheses.
Proposition 2.7.
Suppose and either or . Let and . Then the functional
satisfies .
Proof.
The case can be deduced from the case precisely as in the proof of Proposition 2.6, hence it suffices to consider the case . Choosing and as in the proof of Proposition 2.6 yields a unique solution to (2.1) and for , we have . Define now by setting for , , and consider . Then and for any ,
(2.16) |
which proves that . ∎
Remark 2.8.
This proposition covers, in particular, the functionals associated to convolutional filters. Indeed, let and consider the convolutional filter defined by . Then its associated functional defined by is an element of , as shown in Proposition 2.7 above.
As a further example, we now see that any functional associated with a linear state-space system on a separable Hilbert space admits a realization as an element in for .
Proposition 2.9.
Let be a separable Hilbert space, let , be linear and and assume that for each the system
(2.17) |
admits a unique solution , i.e., satisfying . Let be a (Borel) probability measure on with and consider
(2.18) |
Then with .
Proof.
Denote by an isometric isomorphism between and . Define , , , , , and let be the pushforward measure of under .
Then is a probability measure on and , are linear maps. Furthermore, by choice of we have that is a solution (2.1). On the other hand, if is any solution to (2.1), then defines a solution to (2.17). But the latter system has a unique solution by assumption, hence also the solution to (2.1) is unique.
Finally, the change-of-variables formula and the fact that is an isometry imply that satisfies and
(2.19) | ||||
This shows that the functional can be represented as a readout (2.2) of a system (2.1), hence . It also proves that is a system isomorphism between the system determined by , , , and , and the system in determined by , , , and . ∎
The next proposition is a generalization of a universality result in [Gono 20c, Theorem III.13] to the context of recurrent Barron functionals.
Proposition 2.10.
Assume and either or is bounded, continuous and non-constant. Let and let be a probability measure on . Then is dense in .
Proof.
Let and . Consider first the case . Define the activation function , then for . Hence, by [Gono 20c, Theorem III.13] there exists a reservoir functional associated to a linear system with neural network readout such that and . More specifically, there exist , , such that the system
(2.20) |
has the echo state property and for some neural network given by
for some , and . We claim that . To prove this, note that with the choices , and the process is the unique solution to the system (2.6) due to the echo state property of (2.20). From the proof of [Gono 20c, Theorem III.13] and the assumption that we obtain that the unique solution satisfies . Set now . Then is a probability measure on and for
This shows that and . Combining this with Proposition 2.4 yields the claim.
In the case where is bounded, continuous and non-constant we may directly work with (instead of ) and proceed similarly. ∎
Generalized convolutional functionals. We conclude by showing that the class contains functionals that generalize convolutional filters, an important family of transforms.
Proposition 2.11.
Assume , and suppose . Let be a probability measure on with . Then the functional ,
(2.21) |
is in .
Proof.
We only need to construct and linear maps , such that for each the system
(2.22) |
admits the unique solution . Then and so has representation (2.2). Choose , let be the sequence with the components of in the first entries and otherwise, and let be the operator that shifts the -th entry of the input to the -th entry and inserts in the first entries. Then , are linear, and indeed for all is the unique solution to (2.22) and . ∎
Notice that standard convolutional filters are a special case of the functional in (2.21) when is a Dirac measure and is the identity.
3 Approximation
In this section, we establish approximation results for the concept class of recurrent Barron functionals. As approximating systems, we use a modification of the finite-dimensional echo state networks (1.1)–(1.2) in which the linear readout in (1.2) is replaced by an extreme learning machine (ELM) / random features neural network, that is, a neural network of the type (1.3) whose parameters are all randomly generated, except for the output layer given by the weights which are trainable. To derive such bounds, we proceed in several approximation steps and (a) approximate the infinite system (2.1) by a finite system of type (1.1), (b) apply a Monte Carlo approximation of the integral in (2.2), and (c) use an importance sampling procedure to guarantee that the neural network weights can be generated from a generic distribution rather than the (unknown) measure .
3.1 Approximation by a finite-dimensional system
We start by showing that the elements in can be approximated by finite-dimensional truncations, that is, by finite-dimensional recurrent Barron functionals in , for any .
Proposition 3.1.
Assume and . Suppose has representation (2.2) with probability measure on , , and bounded, linear maps , . Let and for , let and . Let , , be given as , , for , . Assume is -Lipschitz and . Then the following statements hold:
- (i)
-
For each the system
(3.1) admits a unique solution .
- (ii)
-
Write and define the functional by
(3.2) Then there exists a constant not depending on such that for all ,
(3.3) In particular, . The constant is given by .
- (iii)
-
.
Proof.
Part (i) is a consequence of Proposition 2.2 and of the inequality which in turn implies that . Regarding (ii), note first that by our choice of , it follows that for all , . Similarly, for , we have that . Consequently, for :
(3.4) | ||||
Denote by the projection onto the first coordinates. Then
(3.5) | ||||
Iterating (3.5) thus yields
(3.6) |
This can now be used to estimate the difference between and its truncation:
(3.7) | ||||
with .
It remains to be shown that . Recall that denotes the projection onto the first components and let be the product map , that is, . Denote by the pushforward measure of under . Then is a (Borel) probability measure on and in we showed that for each the system (3.1) admits a unique solution. In addition, the integral transformation theorem implies that satisfies
and, writing ,
(3.8) | ||||
This proves that . ∎
3.2 Normalized realizations
The next result shows that a large class of elements in the concept class can be transformed into what we call a normalized realization in which the solution map for (2.1) is a multiplicative scaling operator.
Proposition 3.2.
Assume , and is bounded. Suppose has a realization of the type (2.1)-(2.2) with bounded and bounded satisfying , which guarantees the existence of a positive integer such that , for all . Fix . Then there exists a (Borel) probability measure on satisfying such that has the alternative representation
(3.9) |
where , for , .
Remark 3.3.
The assumption that is bounded in the previous statement could be relaxed to the requirement that .
Proof.
First, we recall that by Proposition 2.2, the hypotheses in the statement guarantee that the system (2.1) admits a unique solution for each , and that is explicitly given by the expression (2.3). The existence of a positive integer such that for all is guaranteed by Gelfand’s formula for the expression of the spectral radius .
Next, we define and note that the choice guarantees that . Moreover, for each the map is a bounded linear map between two Banach spaces and hence its adjoint is a bounded linear map from which we can identify with . More specifically, for the adjoint is defined by and so with the identification for some and all we obtain that the adjoint property translates to .
Thus, we may consider the map given by for , . We claim that the image can be identified with a subset of . Indeed,
(3.10) |
since . Thus, the map given by for , is well-defined and from (3.10) we deduce that
(3.11) |
where and we have used that for and for . We may now use (2.3) to rewrite as
(3.12) |
where and the series can be interchanged by Fubini’s theorem, as we see by estimating
since and is bounded. The second term in (3.12) can be rewritten as
(3.13) | ||||
Combining (3.13) and (3.12) and inserting in (2.2) then yields
(3.14) |
Let be given as and denote by the pushforward measure of under . Then the change-of-variables formula and (3.11) show that
and similarly we obtain from (3.14) that for any ,
(3.15) | ||||
∎
3.3 Echo state network approximation of normalized, truncated functionals
Consider now the setting of reservoir computing, where we aim to approximate an unknown element in our class using a finite-dimensional ELM-ESN pair, that is, a randomly generated echo state network-like state equation with a neural network as readout in which only the output layer is trained. As a first step, we consider in this section the situation when the target functional is in the class of finite-dimensional recurrent Barron functionals.
More specifically, consider the echo state network
(3.16) |
with randomly generated matrices , , and , which is used to capture the dynamics and then plugged into a random neural network readout. Thus, the element in is approximated by
(3.17) |
with randomly generated coefficients , , valued in , , and , respectively, and where is trainable.
Let and define the Kalman controlability matrix
(3.18) |
where, in case , the map removes the last columns.
Proposition 3.4.
Suppose admits a normalized realization of the type
(3.19) |
for some on with and where satisfies for all , with . Assume that , is bounded, and consider an arbitrary system of the type (3.16) such that , and that the matrix in (3.18) is invertible. Then there exists a Borel measure on such that and the mapping
(3.20) |
satisfies for all the bound
(3.21) |
where is the smallest integer such that , is chosen such that for all , and is the diagonal matrix with entries for all , with .
Remark 3.5.
A particular case of the previous statement is when is chosen such that . In that situation, instead of the bound (3.21) one obtains
(3.22) |
Remark 3.6.
It can be shown (see [Grig 21a, Proposition 4.4]) that if and are randomly drawn using regular random variables with values in the corresponding spaces, then the hypothesis on the invertibility of holds almost surely.
Proof.
First notice that by our hypotheses and (2.3), the equation (3.16) has a unique solution in given by
(3.23) |
Let . Then
(3.24) |
In order to obtain the last inequality in the previous expression, we decomposed , with , which allowed us to write
In the last inequality we used the fact that is the smallest integer such that and hence necessarily if .
3.4 Main approximation result
This section contains our main approximation result, in which we extend the possibility of approximating the elements in the class of finite-dimensional recurrent Barron functionals, using finite-dimensional ELM-ESN pairs, to the full class of recurrent Barron functionals.
In this case, we shall assume that the random variables used in the construction of the ELM layer (3.17), that is, the initialization for the readout weight and the hidden weights , , , valued in , , , and , respectively, are such that are independent and identically distributed (IID); denote by their common distribution.
Let be the diagonal matrix introduced in Proposition 3.4, let be the projection map and for given Borel measure on , , bounded linear maps , with let , where is the map
with for , . Furthermore, let
Using this notation, we shall now formulate a result that shows that the elements in that admit a representation in which the state equation (2.4) is linear and has the unique solution property, can be approximated arbitrarily well by randomly generated ESN-ELM pairs in which only the output layer of the ELM is trained. We provide an approximation bound where the dependence on the dimensions of the ESN state space and of the input is explicitly spelled out.
Theorem 3.7.
In this statement assume that , , and that the input set is bounded. Let be such that it has a realization of the type (2.2), with , bounded linear maps , satisfying and . Using the notation introduced before this statement, suppose that and that is bounded. Let . Consider now a randomly constructed ESN-ELM pair such that and for which the controllability matrix in (3.18) is invertible.
Then, there exists a measurable function such that the ESN-ELM pair with the same parameters and new readout satisfies, for any , the approximation error bound
(3.27) |
with
(3.28) |
for only depending on , , , and (see (3.43)).
Proof.
Let , for , . Then Proposition 3.2 implies that there exists a (Borel) probability measure on satisfying and such that the normalized representation
(3.29) |
holds. Notice that from the proof of Proposition 2.6 we obtain that , where is the unique solution to
(3.30) |
with , given by , for , , . Let , by given as , for , . Proposition 3.1 then implies that for each the system
(3.31) |
admits a unique solution and the functional ,
(3.32) |
satisfies and that
(3.33) |
for all and by (3.11)
Choose such that for all . Furthermore, note that and . Hence, using and for all we obtain from (3.33)
(3.34) | ||||
From (3.31) we obtain for , that . Therefore, for all , with . In particular, letting satisfy for all , with we obtain . Let be the projection map , with , and denote by the pushforward measure of under . Then and from (3.32) we get
(3.35) |
Proposition 3.4 hence implies that there exists a Borel measure on such that and the mapping
(3.36) |
satisfies for all the bound (3.22), that is,
(3.37) |
where is the diagonal matrix with entries for all , with . Furthermore, and (3.11) imply
(3.38) | ||||
Note that the measure in (3.36) is given by with as in the proof of Proposition 3.4 and with as in the proof of Proposition 3.2. Then we verify that and the change-of-variables theorem shows that
(3.39) | ||||
for any function for which the integrand in the last line is integrable with respect to and where , and is linear and satisfies (3.11).
Remark 3.8.
Corollary 3.9.
As in Theorem 3.7, suppose the recurrent activation function is , and the input set is bounded. Let be such that it has a realization of the type (2.2) with measure and bounded linear maps , satisfying and . Let . Consider now a randomly constructed ESN-ELM pair such that and for which the controllability matrix in (3.18) is invertible. Then there exists a distribution for the readout hidden layer weights and a readout (an -valued random vector) such that the ESN-ELM satisfies, for any , the approximation error bound
(3.44) |
with as in (3.28).
Remark 3.10.
Curse of dimensionality. The error bound (3.44) consists of the constant and three terms depending explicitly on . The constant could depend on through the norms ,, , but for these norms it is reasonable to assume that they do not depend on (in practice, the norms of these matrices will be normalized). Similarly, it appears reasonable to assume that the norm is bounded in , since the norm is likely to be bounded from below. This argument indicates that in practically relevant situations, the constant does not depend on and, as it is apparent from the explicit expression (3.43), it depends only polynomially on the dimension . To achieve an approximation error of size at most , we could thus take , which grows only polynomially in and . Hence, in these circumstances, there is no curse of dimensionality in the bound (3.44).
Remark 3.11.
Implementation. From a practical point of view, the bounds obtained in Theorem 3.7 and in Corollary 3.9 apply to two different learning procedures: in the first case, when the chosen ELM weight distribution satisfies the absolute continuity assumption in Theorem 3.7, only the ELM output layer needs to be trained. In the second case, when that condition is not available, all the ELM weights also need to be trained. However, note that these are not recurrent weights; consequently, the vanishing gradient problem does not occur here. In contrast, this problem sometimes occurs for standard recurrent neural networks in which all weights are trained by backpropagation through time.
We conclude this section by considering another implementation scenario in which we impose the measure that we use for the sampling of the ELM, and show that it is still possible to sample using measures that are arbitrarily close to it in the sense of the -Wasserstein metric in exchange for potentially increasing an error term that can be compensated by increasing the dimension of the ESN. Let be the -Wasserstein metric. Suppose that we use the measure to generate the hidden layer weights. The next corollary shows how the error increases when we sample “almost” from the given measure .
Corollary 3.12.
Consider the same situation as in Corollary 3.9 and assume that has finite first moment and . Then, for any there exists a probability measure with and a readout (an -valued random vector) such that the ESN-ELM with distribution for the hidden layer weights satisfies for any the approximation error bound
(3.45) |
with , and as in (3.28).
Proof.
First note that , since . In the case where and is bounded, that is, , we may choose and obtain from Theorem 3.7 the bound
(3.46) |
We now derive an alternative bound (regardless of whether is finite or not). Combining both bounds then yields the bound (3.45).
3.5 Universality
As an additional result, we obtain the following universality result, which shows that any square-integrable functional (not necessarily in the recurrent Barron class) can be well approximated by the ESN-ELM (3.16)-(3.17) with suitably chosen readout and hidden weights generated from a distribution arbitrarily close to a prescribed measure .
Corollary 3.13.
Assume that the input set is bounded, and consider an arbitrary functional that we will approximate with an ESN-ELM built with the following specifications: the activation functions are and either or is a bounded, Lipschitz-continuous, and non-constant function. Furthermore, assume that there exists and , , such that for any choice of the ESN parameters satisfy that the controllability matrix in (3.18) is invertible, , , , and , where is the diagonal matrix with entries , for all , with . Let be a given hidden weight distribution with finite first moment. Then for any probability measure on such that , there exists a probability measure with and a readout (an -valued random vector) such that the ESN-ELM with readout and distribution for the hidden layer weights satisfies that
Proof.
Firstly, by Proposition 2.10 there exists such that . From the proof of Proposition 2.10 and the construction in [Gono 20c] it follows that is of the form
(3.48) |
for some , an atomic probability measure , and with satisfying
Now let , , and define , by , for , , . Then, as in the proof of Proposition 2.6 we obtain that (2.1) admits a unique solution and for , we have . Let be the natural injection and let be the natural projection. Then
Let be defined by and denote by the pushforward measure of under . Then
(3.49) | ||||
Thus, has a representation of the type (2.2) with . Furthermore, since is atomic it follows directly that and .
Next, fix arbitrary and note that under the assumptions on the ESN, the constant in (3.28) can be estimated as
(3.50) |
which does not depend on . Similarly, the bound
the assumptions on the ESN matrices, and (3.47) can be used to obtain an upper bound on that only depends on , , , , , , and , but not on . Set .
We now apply Corollary 3.12 to . We select with and
then from Corollary 3.12 we obtain that there exists a probability measure with and a readout (an -valued random vector) such that the ESN-ELM with readout and distribution for the hidden layer weights satisfies for any
(3.51) |
Hence,
∎
3.6 Special case: static situation
As a special case, we consider the situation without recurrence, that is, when is of the form
(3.52) |
for a probability measure on satisfying . is clearly a particular case of a recurrent generalized Barron functional, since it can be written as (2.2) with measure on given by , which satisfies by the integrability assumption on .
In this situation, we show that such static elements of the recurrent Barron class can be approximated by ELMs, that is, by feedforward neural networks
(3.53) |
with randomly generated coefficients , valued in and , respectively, and trainable. can be viewed as a functional , which is of the form (3.17) with . Let be -valued random variables, assume that the random variables are IID and denote by their common distribution.
Corollary 3.14.
Suppose that is as in (3.52) with associated satisfying that . Assume the input set is bounded. Suppose that and that is bounded. Then there exists a measurable function such that the ELM with readout satisfies for any the approximation error bound
(3.54) |
with
(3.55) |
Proof.
As noted above, is a recurrent generalized Barron functional with
and we can choose the maps , and the sequence as . With these choices, the map and the sequence appearing in the proof of Theorem 3.7 are both equal to . Consequently, from (3.39) we obtain that the measure in the proof of Theorem 3.7 satisfies
for any for which the integrand in the last line is integrable with respect to . This shows that in (3.36) coincides with the map
(3.56) |
and furthermore . Therefore, denoting by also the functional , in the first step of the error estimate (3.42) only the last term is non-zero.
Furthermore, and thus . Therefore, the hypothesis implies .
4 Learning from a single trajectory
Consider now the situation in which we aim to learn a functional from a single trajectory of input/output pairs. We observe data points for and aim to learn the input/output relation from them. In contrast to static situations, these data points are not IID; instead, they constitute a single realization of a discrete-time stochastic process. Similar problems were considered, for instance, in [Ziem 22].
More formally, we consider a stationary stochastic process and assume that sequential observations , , coming from a single realization of this process are available. Suppose that takes values in . Let be the unknown functional and assume that the input/output relation between the data is given as . For example, this is satisfied if is any stationary process and for a stationary process independent of and with .
The goal is to learn the functional from the observations of the underlying input/output process. Recall that minimizes over all measurable maps . To learn from the data, one thus aims to find a minimizer of
(4.1) |
where we denote .
To learn from data, we use an approximant of the type introduced in (3.17) and write to emphasize that are the trainable parameters. Note that is now -valued and is constructed by simply using a readout that collects readout vectors in a matrix or, equivalently, by making the in (3.17) -valued. The results in Section 3, in particular the universality statement in Section 3.5, indicate that can be approximated well by such systems. Hence, we expect that a minimizer of over such systems should be a good approximation of . Therefore, we set to solve the minimization problem
(4.2) |
for given by the set of all random matrices which satisfy for each row the bound for some regularization constant and which are measurable with respect to the data and the randomly generated parameters , , , . The latter requirement accounts for the fact that the randomly generated parameters are fixed after their initialization and hence the trainable weights may depend on them. The choice of will be specified later on in the statement of Theorem 4.2.
The problem (4.2) admits an explicit solution (see, for instance, [Gono 21a, Section 4.3]). Once computed, the learned functional is . The learning performance of the algorithm can now be evaluated by assessing its learning error (or generalization error)
where is an IID copy of and is independent of all other random variables introduced so far. An analysis of the learning error is carried out below in Theorem 4.2, in which we assume, inspired by the developments in [Gono 20b], that the input and the output processes are of the type introduced in the following definition.
Definition 4.1.
An -valued random process is said to have a causal Bernoulli shift structure, if there exist , measurable, and an IID collection of -valued random variables such that
The process is said to have geometric decay if there exist , such that the weak dependence coefficient satisfies for all , where for an independent copy of .
Theorem 4.2.
Assume , , , and that the input set is bounded. Consider a functional that has a representation of the type (2.2) with , bounded linear maps , satisfying and . Let . Consider now as approximant an ESN-ELM system such that and that the matrix in (3.18) is invertible. Suppose that is bounded, , and is bounded by a constant . Let , . Assume is bounded by a constant and has a causal Bernoulli shift structure with geometric decay and , where . Then the trained ESN-ELM satisfies the learning error bound
(4.3) |
with
(4.4) |
(4.5) |
for only depending on , , , , and (see (3.43)) and only depending on , , , and (see (4.19)).
Proof.
Let be an IID copy of , independent of all other random variables introduced so far. Independence and the assumption imply that and for any . Therefore, for any we obtain
(4.6) | ||||
where we used in the last step that (4.2) implies .
The first term in the right hand side of (4.6) can be bounded using Theorem 3.7. Indeed, Theorem 3.7 proves that there exists a measurable function such that the ESN-ELM with readout satisfies
(4.7) |
with given in (3.28). Here we used independence and that takes values in . Furthermore, in the proof of Theorem 3.7 and are explicitly chosen as . Therefore, -a.s. the random vector satisfies and consequently .
It remains to analyze the second term in the right hand side of (4.6). For notational simplicity we now consider , , , as fixed; formally this can be justified by performing the subsequent computations conditionally on these random variables as, for instance, in [Gono 21a, Proof of Theorem 4.3]. Then
(4.8) |
Let , denote and note that each can be written as for , . Furthermore, is the reservoir functional associated to the map , , which is just the reservoir map associated to the ESN determined by , , augmented by the current input and the (scaled) previous state.
Then for each with we have
with and with . Furthermore, for any the reservoir map is an -contraction with and for any and we have
In addition, for we have from (3.23)
where with chosen such that for all . Finally, let and note that satisfies for all , and with that and is Lipschitz-continuous with constant . This can be seen by writing
Consider the Rademacher-type complexity as defined in [Gono 20b]: let be IID Rademacher random variables which are independent of the IID copies , , of . For each we define
Denote by the vector with components , . Then we can rewrite and estimate using the Cauchy-Schwartz and Jensen’s inequalities, independence, and the fact that
Altogether, the hypotheses of [Gono 20b, Corollary 8(ii)] are satisfied. Writing for the idealized empirical risk and applying this result proves that
(4.9) |
for all satisfying , where and
(4.10) | ||||
(4.11) |
Furthermore, [Gono 20b, Proposition 5] yields that -a.s.
Combining this with (4.9) we obtain
(4.12) | ||||
where
(4.13) | ||||
Recall that was considered fixed; now we take expectation also with respect to its distribution and obtain in the same way as we obtained (4.12)
(4.14) | ||||
The expectations in (4.14) can be estimated using Jensen’s inequality and (3.23):
with
(4.15) | ||||
and consequently
(4.16) |
Inserting this in (4.14) yields
(4.17) |
5 Conclusions
In this paper we have provided reservoir computing approximation and generalization bounds for a new concept class of input/output systems that extends to a dynamical context the so-called generalized Barron functionals. We call the elements of the new class recurrent generalized Barron functionals. Its elements are characterized by the availability of readouts with a certain integral representation built on infinite-dimensional state-space systems. We have also shown that they form a vector space and that the class contains as a subset systems of this type with finite-dimensional state-space representations. This class has been shown to be very rich: it contains all sufficiently smooth finite-memory functionals, linear and convolutional filters, as well as any input-output system built using separable Hilbert state spaces with readouts of integral Barron type. Moreover, this class is dense in the -sense.
The reservoir architectures used for the approximation/estimation of elements in the new class are randomly generated echo state networks with either linear (for approximation) or ReLU (estimation) activation functions, and with readouts built using randomly generated neural networks, in which only the output layer is trained (extreme learning machines or random feature neural networks). The results in the paper yield a fully implementable recurrent neural network-based learning algorithm with provable convergence guarantees that does not suffer from the curse of dimensionality.
Acknowledgments. The authors acknowledge partial financial support coming from the Swiss National Science Foundation (grant number 200021_175801/1). L. Grigoryeva and J.-P. Ortega wish to thank the hospitality of the University of Munich and L. Gonon that of the Nanyang Technological University during the academic visits in which some of this work was developed. We thank Christa Cuchiero and Josef Teichmann for interesting exchanges that inspired some of the results in this paper.
References
- [Acci 22] B. Acciaio, A. Kratsios, and G. Pammer. “Metric hypertransformers are universal adapted maps”. arXiv preprint arXiv:2201.13094, 2022.
- [Arco 22] T. Arcomano, I. Szunyogh, A. Wikner, J. Pathak, B. R. Hunt, and E. Ott. “A hybrid approach to atmospheric modeling that combines machine learning with a physics-based numerical model”. Journal of Advances in Modeling Earth Systems, Vol. 14, No. 3, p. e2021MS002712, 2022.
- [Barr 92] A. R. Barron. “Neural net approximation”. In: Proc. 7th Yale Workshop on Adaptive and Learning Systems, pp. 69–72, 1992.
- [Barr 93] A. Barron. “Universal approximation bounds for superpositions of a sigmoidal function”. IEEE Transactions on Information Theory, Vol. 39, No. 3, pp. 930–945, may 1993.
- [Bent 22] F. E. Benth, N. Detering, and L. Galimberti. “Neural networks in Fréchet spaces”. Annals of Mathematics and Artificial Intelligence, pp. 1–29, 2022.
- [Bouc 13] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.
- [Bouv 17a] J. Bouvrie and B. Hamzi. “Kernel methods for the approximation of nonlinear systems”. SIAM Journal on Control and Optimization, Vol. 55, No. 4, pp. 2460–2492, 2017.
- [Bouv 17b] J. Bouvrie and B. Hamzi. “Kernel methods for the approximation of some key quantities of nonlinear systems”. Journal of Computational Dynamics, Vol. 4, No. (1-2), pp. 1–19, 2017.
- [Boyd 85] S. Boyd and L. Chua. “Fading memory and the problem of approximating nonlinear operators with Volterra series”. IEEE Transactions on Circuits and Systems, Vol. 32, No. 11, pp. 1150–1161, 1985.
- [Chen 19] J. Chen and H. I. Nurdin. “Learning nonlinear input–output maps with dissipative quantum systems”. Quantum Information Processing, Vol. 18, No. 7, pp. 1–36, 2019.
- [Chen 20] J. Chen, H. I. Nurdin, and N. Yamamoto. “Temporal information processing on noisy quantum computers”. Physical Review Applied, Vol. 14, No. 2, p. 24065, 2020.
- [Chen 22] J. Chen. Nonlinear Convergent Dynamics for Temporal Information Processing on Novel Quantum and Classical Devices. PhD thesis, UNSW Sydney, 2022.
- [Chen 95] T. Chen and H. Chen. “Approximation capability to functions of several variables, nonlinear functionals, and operators by radial basis function neural networks”. IEEE Transactions on Neural Networks, Vol. 6, No. 4, pp. 904–910, 1995.
- [Coui 16] R. Couillet, G. Wainrib, H. Sevi, and H. T. Ali. “The asymptotic performance of linear echo state neural networks”. Journal of Machine Learning Research, Vol. 17, No. 178, pp. 1–35, 2016.
- [Cuch 22] C. Cuchiero, F. Primavera, and S. Svaluto-Ferro. “Universal approximation theorems for continuous functions of càdlàg paths and Lévy-type signature models”. arXiv preprint arXiv:2208.02293, 2022.
- [E 19] W. E, C. Ma, and L. Wu. “A priori estimates of the population risk for two-layer neural networks”. Commun. Math. Sci., Vol. 17, No. 5, pp. 1407–1425, 2019.
- [E 20a] W. E, C. Ma, S. Wojtowytsch, and L. Wu. “Towards a mathematical understanding of neural network-based machine learning: what we know and what we don’t”. Preprint, arXiv 2009.10713, 2020.
- [E 20b] W. E and S. Wojtowytsch. “On the Banach spaces associated with multi-layer ReLU networks: Function representation, approximation theory and gradient descent dynamics”. Preprint, arXiv 2007.15623, 2020.
- [E 21] W. E and S. Wojtowytsch. “Representation formulas and pointwise properties for Barron functions”. arXiv:2006.05982, 2021.
- [Evan 10] L. C. Evans. Partial Differential Equations. Vol. 19 of Graduate Studies in Mathematics, American Mathematical Society, Providence, RI, second Ed., 2010.
- [G Ma 20] G Manjunath. “Stability and memory-loss go hand-in-hand: three results in dynamics & computation”. To appear in Proceedings of the Royal Society London Ser. A Math. Phys. Eng. Sci., pp. 1–25, 2020.
- [Gali 22] L. Galimberti, G. Livieri, and A. Kratsios. “Designing universal causal deep learning models: the case of infinite-dimensional dynamical systems from stochastic analysis”. arXiv preprint arXiv:2210.13300, 2022.
- [Gono 20a] L. Gonon, L. Grigoryeva, and J.-P. Ortega. “Memory and forecasting capacities of nonlinear recurrent networks”. Physica D, Vol. 414, No. 132721, pp. 1–13., 2020.
- [Gono 20b] L. Gonon, L. Grigoryeva, and J.-P. Ortega. “Risk bounds for reservoir computing”. Journal of Machine Learning Research, Vol. 21, No. 240, pp. 1–61, 2020.
- [Gono 20c] L. Gonon and J.-P. Ortega. “Reservoir computing universality with stochastic inputs”. IEEE Transactions on Neural Networks and Learning Systems, Vol. 31, No. 1, pp. 100–112, 2020.
- [Gono 21a] L. Gonon. “Random feature neural networks learn Black-Scholes type PDEs without curse of dimensionality”. Preprint, 2021.
- [Gono 21b] L. Gonon and J.-P. Ortega. “Fading memory echo state networks are universal”. Neural Networks, Vol. 138, pp. 10–13, 2021.
- [Gono 23] L. Gonon, L. Grigoryeva, and J.-P. Ortega. “Approximation error estimates for random neural networks and reservoir systems”. The Annals of Applied Probability, Vol. 33, No. 1, pp. 28–69, 2023.
- [Grig 14] L. Grigoryeva, J. Henriques, L. Larger, and J.-P. Ortega. “Stochastic time series forecasting using time-delay reservoir computers: performance and universality”. Neural Networks, Vol. 55, pp. 59–71, 2014.
- [Grig 18a] L. Grigoryeva and J.-P. Ortega. “Echo state networks are universal”. Neural Networks, Vol. 108, pp. 495–508, 2018.
- [Grig 18b] L. Grigoryeva and J.-P. Ortega. “Universal discrete-time reservoir computers with stochastic inputs and linear readouts using non-homogeneous state-affine systems”. Journal of Machine Learning Research, Vol. 19, No. 24, pp. 1–40, 2018.
- [Grig 19] L. Grigoryeva and J.-P. Ortega. “Differentiable reservoir computing”. Journal of Machine Learning Research, Vol. 20, No. 179, pp. 1–62, 2019.
- [Grig 21a] L. Grigoryeva, A. G. Hart, and J.-P. Ortega. “Learning strange attractors with reservoir systems”. arXiv preprint arXiv:2108.05024, 2021.
- [Grig 21b] L. Grigoryeva and J.-P. Ortega. “Dimension reduction in recurrent networks by canonicalization”. Journal of Geometric Mechanics, Vol. 13, No. 4, pp. 647–677, 2021.
- [Herm 10] M. Hermans and B. Schrauwen. “Memory in linear recurrent neural networks in continuous time.”. Neural networks : the official journal of the International Neural Network Society, Vol. 23, No. 3, pp. 341–55, apr 2010.
- [Herm 12] M. Hermans and B. Schrauwen. “Recurrent kernel machines: computation with infinite echo state networks”. Neural Computation, Vol. 24, pp. 104–133, 2012.
- [Hu 22] P. Hu, Q. Meng, B. Chen, S. Gong, Y. Wang, W. Chen, R. Zhu, Z.-M. Ma, and T.-Y. Liu. “Neural operator with regularity structure for modeling dynamics driven by SPDEs”. arXiv preprint arXiv:2204.06255, 2022.
- [Huan 06] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew. “Extreme learning machine: Theory and applications”. Neurocomputing, Vol. 70, No. 1-3, pp. 489–501, dec 2006.
- [Jaeg 04] H. Jaeger and H. Haas. “Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication”. Science, Vol. 304, No. 5667, pp. 78–80, 2004.
- [Jaeg 10] H. Jaeger. “The ‘echo state’ approach to analysing and training recurrent neural networks with an erratum note”. Tech. Rep., German National Research Center for Information Technology, 2010.
- [Kira 19] F. J. Király and H. Oberhauser. “Kernels for sequentially ordered data”. Journal of Machine Learning Research, Vol. 20, 2019.
- [Kova 21] N. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anandkumar. “Neural operator: Learning maps between function spaces”. arXiv preprint arXiv:2108.08481, 2021.
- [Krat 20] A. Kratsios and I. Bilokopytov. “Non-Euclidean universal approximation”. 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada., 2020.
- [Lax 02] P. Lax. Functional Analysis. Wiley-Interscience, 2002.
- [Li 20] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar. “Fourier neural operator for parametric partial differential equations”. arXiv preprint arXiv:2010.08895, 2020.
- [Li 22] Z. Li, J. Han, E. Weinan, and Q. Li. “Approximation and optimization theory for Llnear continuous-time recurrent neural networks”. Journal of Machine Learning Research, Vol. 23, pp. 41–42, 2022.
- [Lu 18] Z. Lu, B. R. Hunt, and E. Ott. “Attractor reconstruction by machine learning”. Chaos, Vol. 28, No. 6, 2018.
- [Maas 02] W. Maass, T. Natschläger, and H. Markram. “Real-time computing without stable states: a new framework for neural computation based on perturbations”. Neural Computation, Vol. 14, pp. 2531–2560, 2002.
- [Maas 11] W. Maass. “Liquid state machines: motivation, theory, and applications”. In: S. S. Barry Cooper and A. Sorbi, Eds., Computability In Context: Computation and Logic in the Real World, Chap. 8, pp. 275–296, 2011.
- [Manj 13] G. Manjunath and H. Jaeger. “Echo state property linked to an input: exploring a fundamental characteristic of recurrent neural networks”. Neural Computation, Vol. 25, No. 3, pp. 671–696, 2013.
- [Manj 22a] G. Manjunath. “Embedding information onto a dynamical system”. Nonlinearity, Vol. 35, No. 3, p. 1131, 2022.
- [Manj 22b] G. Manjunath and J.-P. Ortega. “Transport in reservoir computing”. arXiv preprint arXiv:2209.07946, 2022.
- [Mart 23] R. Martínez-Peña and J.-P. Ortega. “Quantum reservoir computing in finite dimensions”. Physical Review E - Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics, Vol. 107, No. 3, p. 035306, 2023.
- [Matt 92] M. B. Matthews. On the Uniform Approximation of Nonlinear Discrete-Time Fading-Memory Systems Using Neural Network Models. PhD thesis, ETH Zürich, 1992.
- [Matt 93] M. B. Matthews. “Approximating nonlinear fading-memory operators using neural network models”. Circuits, Systems, and Signal Processing, Vol. 12, No. 2, pp. 279–307, jun 1993.
- [Neuf 22] A. Neufeld and P. Schmocker. “Chaotic hedging with iterated integrals and neural networks”. arXiv preprint arXiv:2209.10166, 2022.
- [Path 17] J. Pathak, Z. Lu, B. R. Hunt, M. Girvan, and E. Ott. “Using machine learning to replicate chaotic attractors and calculate Lyapunov exponents from data”. Chaos, Vol. 27, No. 12, 2017.
- [Path 18] J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott. “Model-Free Prediction of Large Spatiotemporally Chaotic Systems from Data: A Reservoir Computing Approach”. Physical Review Letters, Vol. 120, No. 2, p. 24102, 2018.
- [Rahi 07] A. Rahimi and B. Recht. “Random features for large-scale kernel machines”. Advances in neural information, 2007.
- [Salv 22] C. Salvi, M. Lemercier, and A. Gerasimovics. “Neural stochastic PDEs: Resolution-invariant learning of continuous spatiotemporal dynamics”. In: Advances in Neural Information Processing Systems, 2022.
- [Stin 99] M. B. Stinchcombe. “Neural network approximation of continuous functionals and continuous functions on compactifications”. Neural Networks, Vol. 12, No. 3, pp. 467–477, 1999.
- [Tino 18] P. Tino. “Asymptotic Fisher memory of randomized linear symmetric Echo State Networks”. Neurocomputing, Vol. 298, pp. 4–8, 2018.
- [Tino 20] P. Tino. “Dynamical systems as temporal feature spaces”. Journal of Machine Learning Research, Vol. 21, pp. 1–42, 2020.
- [Tran 20] Q. H. Tran and K. Nakajima. “Higher-order quantum reservoir computing”. arXiv preprint arXiv:2006.08999, 2020.
- [Tran 21] Q. H. Tran and K. Nakajima. “Learning temporal quantum tomography”. Physical Review Letters, Vol. 127, No. 26, p. 260401, 2021.
- [Vill 09] C. Villani. Optimal Transport: Old and New. Springer, 2009.
- [Wikn 21] A. Wikner, J. Pathak, B. R. Hunt, I. Szunyogh, M. Girvan, and E. Ott. “Using data assimilation to train a hybrid forecast system that combines machine-learning and knowledge-based components”. Chaos: An Interdisciplinary Journal of Nonlinear Science, Vol. 31, No. 5, p. 53114, 2021.
- [Yild 12] I. B. Yildiz, H. Jaeger, and S. J. Kiebel. “Re-visiting the echo state property.”. Neural Networks, Vol. 35, pp. 1–9, nov 2012.
- [Ziem 22] I. M. Ziemann, H. Sandberg, and N. Matni. “Single Trajectory Nonparametric Learning of Nonlinear Dynamics”. In: P.-L. Loh and M. Raginsky, Eds., Proceedings of Thirty Fifth Conference on Learning Theory, pp. 3333–3364, PMLR, 2022.