This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Infinite-dimensional reservoir computing

Lukas Gonon1, Lyudmila Grigoryeva2,3, and Juan-Pablo Ortega4
Abstract

Reservoir computing approximation and generalization bounds are proved for a new concept class of input/output systems that extends the so-called generalized Barron functionals to a dynamic context. This new class is characterized by the readouts with a certain integral representation built on infinite-dimensional state-space systems. It is shown that this class is very rich and possesses useful features and universal approximation properties. The reservoir architectures used for the approximation and estimation of elements in the new class are randomly generated echo state networks with either linear or ReLU activation functions. Their readouts are built using randomly generated neural networks in which only the output layer is trained (extreme learning machines or random feature neural networks). The results in the paper yield a fully implementable recurrent neural network-based learning algorithm with provable convergence guarantees that do not suffer from the curse of dimensionality.


Key Words: recurrent neural network, reservoir computing, echo state network, ESN, extreme learning machine, ELM, recurrent linear network, machine learning, Barron functional, recurrent Barron functional, universality, finite memory functional, approximation bound, convolutional filter.

11footnotetext: Imperial College. Department of Mathematics. London. United Kingdom. [email protected]22footnotetext: Universität Sankt Gallen. Faculty of Mathematics and Statistics. Sankt Gallen. Switzerland. [email protected] 33footnotetext: University of Warwick. Department of Statistics. United Kingdom. [email protected] 44footnotetext: Nanyang Technological University. Division of Mathematical Sciences. Singapore. [email protected]

1 Introduction

Reservoir computing (RC) [Jaeg 10, Maas 02, Jaeg 04, Maas 11] and in particular echo state networks (ESNs) [Matt 92, Matt 93, Jaeg 04] have gained much popularity in recent years due to their excellent performance in the forecasting of dynamical systems [Grig 14, Jaeg 04, Path 17, Path 18, Lu 18, Wikn 21, Arco 22] and due to the ease of their implementation. RC aims at approximating nonlinear input/output systems using randomly generated state-space systems (called reservoirs) in which only a linear readout is estimated. It has been theoretically established that this is indeed possible in a variety of deterministic and stochastic contexts [Grig 18b, Grig 18a, Gono 20c, Gono 21b, Gono 23] in which RC systems have been shown to have universal approximation properties.

In this paper, we focus on deriving error bounds for a variant of the architectures that we just cited and consider as approximants randomly generated linear systems with readouts given by randomly generated neural networks in which only the output layer is trained. Thus, from a learning perspective, we combine linear echo state networks and what is referred to in the literature as random features [Rahi 07] /extreme learning machines (ELMs) [Huan 06]. We develop explicit and readily computable approximation and estimation bounds for a newly introduced concept class whose elements we refer to as recurrent (generalized) Barron functionals since they can be viewed as a dynamical analog of the (generalized) Barron functions introduced in [Barr 92, Barr 93] and extended later in [E 20b, E 20a, E 19]. The main novelty in this concept class with respect to others already available in the literature, like the fading memory class [Boyd 85] in deterministic setups or LpL^{p} functionals in the stochastic case, is that it consists of elements that admit an explicit infinite-dimensional state-space representation, which makes them analytically tractable. As we shall see later on, many interesting families of input/output systems belong to this class that, additionally, is universal in the LpL^{p}-sense.

From an approximation-theoretical point of view, the universality properties of linear systems with readouts belonging to dense families have been known for a long time both in deterministic [Boyd 85, Grig 18b] and in stochastic [Gono 20c] setups. Their corresponding dynamical and memory properties have been extensively studied (see, for instance, [Herm 10, Coui 16, Tino 18, Tino 20, Gono 20a, Li 22] and references therein). Our contribution in this paper can hence be considered as an extension of those works to the recurrent Barron class.

All the dynamic learning works cited so far use exclusively finite-dimensional state spaces. Hence, one of the main novelties of our contribution is the infinite-dimensional component in the concept class that we propose. It is worth mentioning that, even though in static setups, there exist numerous neural functional approximation results (see, for instance, [Chen 95, Stin 99, Krat 20, Bent 22, Cuch 22, Neuf 22], in addition to the works on Barron functions cited above), the use of infinite-dimensional state-space systems has not been much exploited, and it is only very recently that it is being seriously developed. See [Herm 12, Bouv 17b, Bouv 17a, Kira 19, Li 20, Kova 21, Acci 22, Gali 22, Hu 22, Salv 22] for a few examples. Dedicated hardware realizations of RC systems using quantum systems are a potential application of these extensions for which, in the absence of adequate tools, most of the theoretical developments have been carried out so far in exclusively finite-dimensional setups [Chen 19, Chen 20, Tran 20, Tran 21, Chen 22, Mart 23].

In order to introduce our contributions in more detail, we start by recalling that an echo state network (ESN) is an input/output system defined by the two equations:

𝐱t\displaystyle{\bf x}_{t} =σ(A𝐱t1+C𝐳t+𝜻),\displaystyle={\sigma}(A{\bf x}_{t-1}+C{\bf z}_{t}+\bm{\zeta}), (1.1)
𝐲t\displaystyle{\bf y}_{t} =W𝐱t,\displaystyle=W{\bf x}_{t}, (1.2)

for tt\in\mathbb{Z}_{-}, where d,m,Nd,m,N\in\mathbb{N}, 𝐳(d){\bf z}\in(\mathbb{R}^{d})^{\mathbb{Z}_{-}}, 𝐱tN{\bf x}_{t}\in\mathbb{R}^{N}, the matrix W𝕄m,NW\in\mathbb{M}_{m,N} is trainable, σ\sigma denotes the componentwise application of a given activation function σ:\sigma\colon\mathbb{R}\to\mathbb{R}, and A𝕄NA\in\mathbb{M}_{N}, C𝕄N,dC\in\mathbb{M}_{N,d}, and 𝜻N\bm{\zeta}\in\mathbb{R}^{N} are randomly generated. If for each 𝐳(d){\bf z}\in(\mathbb{R}^{d})^{\mathbb{Z}_{-}} there exists a unique solution 𝐱=(𝐱t)t(N){\bf x}=({\bf x}_{t})_{t\in\mathbb{Z}_{-}}\in(\mathbb{R}^{N})^{\mathbb{Z}_{-}} to (1.1), then (1.1)-(1.2) define a mapping HW:(d)mH_{W}\colon(\mathbb{R}^{d})^{\mathbb{Z}_{-}}\to\mathbb{R}^{m} via HW(𝐳)=𝐲0H_{W}({\bf z})={\bf y}_{0}, with 𝐲(m){\bf y}\in(\mathbb{R}^{m})^{\mathbb{Z}_{-}} given by (1.2) for x the unique solution to (1.1) associated with 𝐳(d){\bf z}\in(\mathbb{R}^{d})^{\mathbb{Z}_{-}}. Given a (typically unknown) functional H:(d)mH\colon(\mathbb{R}^{d})^{\mathbb{Z}_{-}}\to\mathbb{R}^{m} to be learned, the readout WW is then trained so that HWH_{W} is a good approximation of HH.

For echo state networks (1.1)–(1.2), approximation bounds were given in [Gono 23] for maps HH which have the property that their restrictions to input sequences of finite length TT lie in a certain Sobolev space, for each TT\in\mathbb{N}, and with weights AA, CC, 𝜻\bm{\zeta} in (1.1) sampled from a generic distribution with a certain structure. Here we consider a novel architecture for (1.1)–(1.2), where, instead of applying a linear function in (1.2) we apply a random feedforward neural network, that is,  (1.2) is replaced by

𝐲t=i=1NWiσ(𝐚(i)𝐱t1+𝐜(i)𝐳t+bi){\bf y}_{t}=\sum_{i=1}^{N}W_{i}\sigma({\bf a}^{(i)}\cdot\mathbf{x}_{t-1}+{\bf c}^{(i)}\cdot{\bf z}_{t}+b_{i}) (1.3)

for tt\in\mathbb{Z}_{-} and with randomly generated coefficients 𝐚(i){\bf a}^{(i)}, 𝐜(i){\bf c}^{(i)}, bib_{i} valued in N\mathbb{R}^{N}, d\mathbb{R}^{d}, and \mathbb{R}, respectively. In most cases, the activation function σ\sigma in (1.1) will be just the identity or, eventually, a rectified linear unit.

In order to derive learning error bounds for this architecture, we shall proceed in several steps. First, we examine the new concept class 𝒞\mathcal{C} of recurrent (generalized) Barron functionals consisting of reservoir functionals that admit a certain integral representation. This class 𝒞\mathcal{C} turns out to possess very useful properties: first, we show that a large class of functionals H:(d)mH\colon(\mathbb{R}^{d})^{\mathbb{Z}_{-}}\to\mathbb{R}^{m}, including – but not restricted to – functionals with “sufficient smoothness” are in 𝒞\mathcal{C} and, under mild conditions, 𝒞\mathcal{C} is dense in the L2L^{2}-sense. We then examine the approximation of elements in 𝒞\mathcal{C} by reservoir computing systems and derive approximation error bounds. This shows that a large class of functionals can be approximated by recurrent neural networks whose hidden weights are randomly sampled from generic distributions, and explicit approximation rates can be derived for them.

The second step consists in obtaining generalization error bounds that match the parameter restrictions emerging from the approximation results for these systems. The key challenge here is that the observational data is non-IID and so classical statistical learning techniques [Bouc 13] can not be employed directly, see [Gono 20b]. By combining these bounds, we then obtain learning error bounds for such random recurrent neural networks for learning a general class of maps 𝒞\mathcal{C}. As a by-product, we obtain new universality results for reservoir systems, see [Grig 18b, Grig 18a, Gono 20c, Gono 23, Gono 21b], with generically randomly generated coefficients. It is worth emphasizing that the construction that we describe in this paper yields a fully implementable recurrent neural network-based learning algorithm with provable convergence guarantees not suffering from the curse of dimensionality.

2 A dynamic analog of the generalized Barron functions

In this section, we introduce the class 𝒞\mathcal{C} of recurrent (generalized) Barron functionals, which constitute the concept class that we study and approximate in this paper. We prove elementary properties of 𝒞\mathcal{C} and identify important constituents in it. In particular, we show that 𝒞\mathcal{C} is a vector space that is dense in the set of square-integrable functionals and contains many important classes of maps H:(d)H\colon(\mathbb{R}^{d})^{\mathbb{Z}_{-}}\to\mathbb{R} such as linear or “sufficiently smooth” maps.

2.1 Notation

Let DddD_{d}\subset\mathbb{R}^{d}, p[1,]p\in[1,\infty], and let qq be the Hölder conjugate of pp. Recall the notation p=p()\ell^{p}=\ell^{p}(\mathbb{R}) for the space of sequences (xn)n(x_{n})_{n\in\mathbb{N}}\in\mathbb{R}^{\mathbb{N}} with i|xi|p<\sum_{i\in\mathbb{N}}|x_{i}|^{p}<\infty, in the case p<p<\infty, and supi|xi|<\sup_{i\in\mathbb{N}}|x_{i}|<\infty, in the case p=p=\infty. The symbol p=p()\ell^{p}_{-}=\ell^{p}_{-}(\mathbb{R}) denotes the space of sequences (xn)n(x_{n})_{n\in\mathbb{Z}_{-}}\in\mathbb{R}^{\mathbb{Z}_{-}} that satisfy the same summability conditions as those in p\ell^{p}, with \mathbb{Z}_{-} the negative integers including zero. We let (q)\ell^{\infty}_{-}(\ell^{q}) denote the set of sequences (𝐱t)t(q)(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in(\ell^{q})^{\mathbb{Z}_{-}} with the property that supt𝐱tq<\sup_{t\in\mathbb{Z}_{-}}\|\mathbf{x}_{t}\|_{q}<\infty. For 𝐱p\mathbf{x}\in\ell^{p} and 𝐲q{\bf y}\in\ell^{q} we write 𝐱𝐲=ixiyi\mathbf{x}\cdot{\bf y}=\sum_{i\in\mathbb{N}}x_{i}y_{i}. Let σ1,σ2:\sigma_{1},\sigma_{2}\colon\mathbb{R}\to\mathbb{R} denote two functions, called activation functions. For both activation functions we denote using the same symbol their componentwise application to sequences or vectors. Furthermore, we assume that σ2\sigma_{2} is Lipschitz-continuous with constant Lσ2L_{\sigma_{2}} and σ1(0)=0\sigma_{1}(0)=0. Given NN\in\mathbb{N}, we write 𝒳=×p×d×\mathcal{X}=\mathbb{R}\times\ell^{p}\times\mathbb{R}^{d}\times\mathbb{R} and 𝒳N=×N×d×\mathcal{X}^{N}=\mathbb{R}\times\mathbb{R}^{N}\times\mathbb{R}^{d}\times\mathbb{R}. We will consider inputs in d(Dd)\mathcal{I}_{d}\subset(D_{d})^{\mathbb{Z}_{-}}.

2.2 Definition and properties of recurrent Barron functionals

In this section we define the class of recurrent generalized Barron functionals and study some properties of these functionals.

Definition 2.1.

A functional H:dH\colon\mathcal{I}_{d}\to\mathbb{R} is said to be a recurrent (generalized) Barron functional, if there exist a (Borel) probability measure μ\mu on 𝒳\mathcal{X}, 𝐁q{\bf B}\in\ell^{q} and linear maps A:qqA\colon\ell^{q}\to\ell^{q}, C:dqC\colon\mathbb{R}^{d}\to\ell^{q} such that for each 𝐳d{\bf z}\in\mathcal{I}_{d} the system

𝐱t=σ1(A𝐱t1+C𝐳t+𝐁),t,\mathbf{x}_{t}=\sigma_{1}(A\mathbf{x}_{t-1}+C{\bf z}_{t}+{\bf B}),\quad t\in\mathbb{Z}_{-}, (2.1)

admits a unique solution (𝐱t)t(q)(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q}), μ\mu satisfies that

Iμ,p:=𝒳|w|(𝐚p+𝐜+|b|)μ(dw,d𝐚,d𝐜,db)<I_{\mu,p}:=\int_{\mathcal{X}}|w|(\|{\bf a}\|_{p}+\|{\bf c}\|+|b|)\mu(dw,d{\bf a},d{\bf c},db)<\infty

and, writing 𝐱t=𝐱t(𝐳)\mathbf{x}_{t}=\mathbf{x}_{t}({\bf z}),

H(𝐳)=𝒳wσ2(𝐚𝐱1(𝐳)+𝐜𝐳0+b)μ(dw,d𝐚,d𝐜,db),𝐳d.H({\bf z})=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d}. (2.2)

We denote by 𝒞\mathcal{C} the class of all recurrent generalized Barron functionals or just recurrent Barron functionals.

Note that the readout (2.2) is built in such a way that the output is constructed out of 𝐱1\mathbf{x}_{-1} and 𝐳0{\bf z}_{0} instead of exclusively 𝐱0\mathbf{x}_{0}, like it is customary in reservoir computing (see (1.2)). The reason for this choice is that it will allow us later on in Section 3.6 to recover the static situation as a particular case of the recurrent setup in a straightforward manner.

The unique solution property (USP). The existence and uniqueness of solutions property of state equations of the type (2.1), which is part of the definition of the recurrent Barron functionals, is a well-studied problem. This property is referred in the literature as the echo state property (ESP) [Jaeg 10, Yild 12, Manj 13] or the unique solution property (USP) [G Ma 20, Manj 22a] and it has been much tackled in deterministic (see references above and [Grig 18a], for instance), and stochastic setups [Manj 22b]. The following proposition is an infinite-dimensional adaptation of a commonly used sufficient condition for the USP to hold in the case of (2.1), as well as a full characterization of it when the activation function σ1\sigma_{1} is the identity, and hence (2.1) is a linear system. In both cases, the inputs are assumed to be bounded; possible generalizations to the unbounded case can be found in [Grig 19].

Proposition 2.2.

Consider the state equation given by (2.1) and the two following cases:

(i)

Suppose that the inputs are defined in d(Dd)\mathcal{I}_{d}\subset(D_{d})^{\mathbb{Z}_{-}}, with DdD_{d} a compact subset of d{\mathbb{R}}^{d}. Furthermore, assume that σ1\sigma_{1} is Lipschitz-continuous with constant Lσ1L_{\sigma_{1}} and that Lσ1|A|<1L_{\sigma_{1}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1 with |A|{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|} the operator norm of the linear map A:qqA\colon\ell^{q}\to\ell^{q}. Then (2.1) has the USP, i.e., for each 𝐳d{\bf z}\in\mathcal{I}_{d} there exists a unique (𝐱t)t(q)(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q}) satisfying (2.1).

(ii)

Suppose that σ1\sigma_{1} is the identity and that A:qqA\colon\ell^{q}\to\ell^{q}, C:dqC\colon\mathbb{R}^{d}\to\ell^{q} are bounded linear operators. If the spectral radius ρ(A)\rho(A) of the operator AA is such that ρ(A)<1\rho(A)<1, then the linear state equation (2.1) has the USP with respect to inputs in (d)\ell^{\infty}_{-}({\mathbb{R}}^{d}). In that case the unique solution of (2.1) is given by

𝐱t(𝐳)=k=0Ak(C𝐳tk+𝐁),t.\mathbf{x}_{t}({\bf z})=\sum_{k=0}^{\infty}A^{k}(C{\bf z}_{t-k}+{\bf B}),\quad\mbox{$t\in\mathbb{Z}$}. (2.3)
Proof.

The proof of part (i) is a straightforward generalization of [Grig 18a, Theorem 3.1(ii)] together with the observation that the hypothesis Lσ1|A|<1L_{\sigma_{1}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1 implies that the state equation is a contraction on the state entry. Part (ii) can be obtained by mimicking the proof of [Grig 21b, Proposition 4.2(i)]; a key element in that proof is the use of Gelfand’s formula for the spectral radius, which holds for operators between infinite-dimensional Banach spaces [Lax 02, Theorem 4, page 195]. ∎

The concept class 𝒞\mathcal{C} of recurrent Barron functionals is a vector space.

Proposition 2.3.

Suppose H1,H2𝒞H_{1},H_{2}\in\mathcal{C} and λ1,λ2\lambda_{1},\lambda_{2}\in\mathbb{R}. Then λ1H1+λ2H2𝒞\lambda_{1}H_{1}+\lambda_{2}H_{2}\in\mathcal{C}.

Proof.

Without loss of generality, we may assume λi0\lambda_{i}\neq 0 for i=1,2i=1,2. For i=1,2i=1,2 let μ(i)\mu^{(i)} be (Borel) probability measures on 𝒳\mathcal{X}, let 𝐁(i)q{\bf B}^{(i)}\in\ell^{q}, let A(i):qqA^{(i)}\colon\ell^{q}\to\ell^{q}, C(i):dqC^{(i)}\colon\mathbb{R}^{d}\to\ell^{q} be linear maps such that for each 𝐳d{\bf z}\in\mathcal{I}_{d} the system

𝐱t(i)=σ1(A(i)𝐱t1(i)+C(i)𝐳t+𝐁(i)),t,\mathbf{x}_{t}^{(i)}=\sigma_{1}(A^{(i)}\mathbf{x}_{t-1}^{(i)}+C^{(i)}{\bf z}_{t}+{\bf B}^{(i)}),\quad t\in\mathbb{Z}_{-}, (2.4)

admits a unique solution (𝐱t(i))t(q)(\mathbf{x}_{t}^{(i)})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q}), μ(i)\mu^{(i)} satisfies Iμ(i),p<I_{\mu^{(i)},p}<\infty, and

Hi(𝐳)=𝒳wσ2(𝐚𝐱1(i)(𝐳)+𝐜𝐳0+b)μ(i)(dw,d𝐚,d𝐜,db),𝐳d.H_{i}({\bf z})=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{(i)}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu^{(i)}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d}. (2.5)

Define π1:\pi_{1}\colon\mathbb{R}^{\mathbb{N}}\to\mathbb{R}^{\mathbb{N}} by π1((yn)n)=(yn+12𝟙(n+12))n\pi_{1}((y_{n})_{n\in\mathbb{N}})=(y_{\frac{n+1}{2}}\mathbbm{1}_{\mathbb{N}}(\frac{n+1}{2}))_{n\in\mathbb{N}} and π2:\pi_{2}\colon\mathbb{R}^{\mathbb{N}}\to\mathbb{R}^{\mathbb{N}} by π2((yn)n)=(yn2𝟙(n2))n\pi_{2}((y_{n})_{n\in\mathbb{N}})=(y_{\frac{n}{2}}\mathbbm{1}_{\mathbb{N}}(\frac{n}{2}))_{n\in\mathbb{N}} and note that π1,π2\pi_{1},\pi_{2} are linear, they map both p\ell^{p} and q\ell^{q} into themselves, π1+π2=𝕀\pi_{1}+\pi_{2}=\mathbb{I}, and πiσ1=σ1πi\pi_{i}\circ\sigma_{1}=\sigma_{1}\circ\pi_{i}, i=1,2i=1,2. Define 𝐁q{\bf B}\in\ell^{q} and linear maps A:qqA\colon\ell^{q}\to\ell^{q}, C:dqC\colon\mathbb{R}^{d}\to\ell^{q} by setting A𝐱=π1(A(1)𝐱)+π2(A(2)𝐱)A{\bf x}=\pi_{1}(A^{(1)}{\bf x})+\pi_{2}(A^{(2)}{\bf x}), 𝐁=π1(𝐁(1))+π2(𝐁(2)){\bf B}=\pi_{1}({\bf B}^{(1)})+\pi_{2}({\bf B}^{(2)}) and C𝐳=π1(C(1)𝐳)+π2(C(2)𝐳)C{\bf z}=\pi_{1}(C^{(1)}{\bf z})+\pi_{2}(C^{(2)}{\bf z}).

Suppose now that (𝐱t)t(q)(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q}) is a solution to (2.1) for the input 𝐳d{\bf z}\in\mathcal{I}_{d} and the parameters A,𝐁,CA,{\bf B},C that we just defined. For each tt\in\mathbb{Z}_{-} denote by 𝐱t(1)=(𝐱t,2n1)n\mathbf{x}_{t}^{(1)}=(\mathbf{x}_{t,2n-1})_{n\in\mathbb{N}} the odd components of 𝐱t\mathbf{x}_{t} and by 𝐱t(2)=(𝐱t,2n)n\mathbf{x}_{t}^{(2)}=(\mathbf{x}_{t,2n})_{n\in\mathbb{N}} the even components. Then 𝐱t(i)q\mathbf{x}_{t}^{(i)}\in\ell^{q} and by construction of A,𝐁,CA,{\bf B},C, it follows that (𝐱t(i))t(\mathbf{x}_{t}^{(i)})_{t\in\mathbb{Z}_{-}} is a solution of (2.4) for i=1,2i=1,2. By the uniqueness of the solutions of (2.4), it follows that there is at most one solution of (2.1). On the other hand, if we now denote by (𝐱t(i))t(q)(\mathbf{x}^{(i)}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q}), i=1,2i=1,2, the unique solution to (2.4) for the input 𝐳d{\bf z}\in\mathcal{I}_{d}, then setting 𝐱t=π1(𝐱t(1))+π2(𝐱t(2))\mathbf{x}_{t}=\pi_{1}(\mathbf{x}_{t}^{(1)})+\pi_{2}(\mathbf{x}_{t}^{(2)}) defines a sequence (𝐱t)t(q)(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q}) which is a solution to (2.1), again by the construction of A,𝐁,CA,{\bf B},C.

Next, define ϕi:𝒳𝒳\phi_{i}\colon\mathcal{X}\to\mathcal{X} as ϕi(w,𝐚,𝐜,b)=(w(λ1+λ2),πi(𝐚),𝐜,b)\phi_{i}(w,{\bf a},{\bf c},b)=(w(\lambda_{1}+\lambda_{2}),\pi_{i}({\bf a}),{\bf c},b) for i=1,2i=1,2 and set μ=λ1λ1+λ2(μ(1)ϕ11)+λ2λ1+λ2(μ(2)ϕ21)\mu=\frac{\lambda_{1}}{\lambda_{1}+\lambda_{2}}(\mu^{(1)}\circ\phi_{1}^{-1})+\frac{\lambda_{2}}{\lambda_{1}+\lambda_{2}}(\mu^{(2)}\circ\phi_{2}^{-1}), where μ(i)ϕi1\mu^{(i)}\circ\phi_{i}^{-1} denotes the pushforward of μ(i)\mu^{(i)} under ϕi\phi_{i}. Then the integral transformation theorem shows that

𝒳|w|(𝐚p+𝐜+|b|)μ(dw,d𝐚,d𝐜,db)=i=12λi𝒳|w|(πi(𝐚)p+𝐜+|b|)μ(i)(dw,d𝐚,d𝐜,db)<\int_{\mathcal{X}}|w|(\|{\bf a}\|_{p}+\|{\bf c}\|+|b|)\mu(dw,d{\bf a},d{\bf c},db)=\sum_{i=1}^{2}\lambda_{i}\int_{\mathcal{X}}|w|(\|\pi_{i}({\bf a})\|_{p}+\|{\bf c}\|+|b|)\mu^{(i)}(dw,d{\bf a},d{\bf c},db)<\infty

and, by (2.5), for all 𝐳d{\bf z}\in\mathcal{I}_{d},

𝒳wσ2(𝐚𝐱1+𝐜𝐳0+b)μ(dw,d𝐚,d𝐜,db)\displaystyle\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db) =i=12λi𝒳wσ2(πi(𝐚)𝐱1+𝐜𝐳0+b)μ(i)(dw,d𝐚,d𝐜,db)\displaystyle=\sum_{i=1}^{2}\lambda_{i}\int_{\mathcal{X}}w\sigma_{2}(\pi_{i}({\bf a})\cdot\mathbf{x}_{-1}+{\bf c}\cdot{\bf z}_{0}+b)\mu^{(i)}(dw,d{\bf a},d{\bf c},db)
=i=12λi𝒳wσ2(𝐚𝐱1(i)+𝐜𝐳0+b)μ(i)(dw,d𝐚,d𝐜,db)\displaystyle=\sum_{i=1}^{2}\lambda_{i}\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{(i)}+{\bf c}\cdot{\bf z}_{0}+b)\mu^{(i)}(dw,d{\bf a},d{\bf c},db)
=λ1H1(𝐳)+λ2H2(𝐳).\displaystyle=\lambda_{1}H_{1}({\bf z})+\lambda_{2}H_{2}({\bf z}).

Hence, λ1H1+λ2H2𝒞\lambda_{1}H_{1}+\lambda_{2}H_{2}\in\mathcal{C}, as claimed. ∎

Finite-dimensional recurrent Barron functionals. For NN\in\mathbb{N}, denote by 𝒞N\mathcal{C}_{N} the set of functionals HN:d(Dd)H^{{N}}\colon\mathcal{I}_{d}\subset(D_{d})^{\mathbb{Z}_{-}}\to\mathbb{R} with the following property: there exist a (Borel) probability measure μN\mu^{N} on 𝒳N\mathcal{X}^{N}, A𝕄N,N{A}\in\mathbb{M}_{N,N}, 𝐁N{\bf B}\in\mathbb{R}^{N}, C𝕄N,d{C}\in\mathbb{M}_{N,d}, such that for each 𝐳d{\bf z}\in\mathcal{I}_{d} the system

𝐱tN=σ1(A𝐱t1N+C𝐳t+𝐁),t,\mathbf{x}_{t}^{N}=\sigma_{1}({A}\mathbf{x}_{t-1}^{N}+{C}{\bf z}_{t}+{\bf B}),\quad t\in\mathbb{Z}_{-}, (2.6)

admits a unique solution (𝐱tN)t(N)(\mathbf{x}_{t}^{N})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\mathbb{R}^{N}), μN\mu^{N} satisfies

𝒳N|w|(𝐚+𝐜+|b|)μN(dw,d𝐚,d𝐜,db)<\int_{\mathcal{X}^{N}}|w|(\|{\bf a}\|+\|{\bf c}\|+|b|)\mu^{N}(dw,d{\bf a},d{\bf c},db)<\infty

and, writing 𝐱tN=𝐱tN(𝐳)\mathbf{x}_{t}^{N}=\mathbf{x}_{t}^{N}({\bf z}),

HN(𝐳)=𝒳Nwσ2(𝐚𝐱1N(𝐳)+𝐜𝐳0+b)μN(dw,d𝐚,d𝐜,db),𝐳d.H^{N}({\bf z})=\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{N}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d}. (2.7)

We shall refer to the elements in 𝒞N\mathcal{C}_{N} as finite-dimensional recurrent Barron functionals. In the following proposition, we show that the set of finite-dimensional recurrent Barron functionals can be naturally seen as a subspace of 𝒞\mathcal{C}. In the statement, we use the notion and properties of system morphisms that are recalled, for instance, in [Grig 21b].

Proposition 2.4.

For NN\in\mathbb{N} arbitrary, let ι:Nq\iota\colon\mathbb{R}^{N}\to\ell^{q} be the natural injection (x1,,xN)(x1,,xN,0,)(x_{1},\ldots,x_{N})^{\top}\longmapsto(x_{1},\ldots,x_{N},0,\ldots). Given an element HN𝒞NH^{N}\in\mathcal{C}_{N}, there exists a system of type (2.1)-(2.2) that realizes HNH^{N} and for which the map ι\iota is a system morphism. This allows us to conclude that

𝒞N𝒞,\mathcal{C}_{N}\subset\mathcal{C}, (2.8)

that is, every finite-dimensional recurrent Barron functional admits a realization as a recurrent Barron functional.

Proof.

Let HN𝒞NH^{N}\in\mathcal{C}_{N} be such that (2.7) holds with 𝐱N(N)\mathbf{x}^{N}\in\ell^{\infty}_{-}(\mathbb{R}^{N}) satisfying (2.6). First, define A¯:qq\bar{A}\colon\ell^{q}\to\ell^{q}, 𝐁¯q\bar{{\bf B}}\in\ell^{q}, C¯:dq\bar{C}\colon\mathbb{R}^{d}\to\ell^{q} by setting (A¯𝐱)i=𝟙{iN}j=1NAijxj(\bar{A}\mathbf{x})_{i}=\mathbbm{1}_{\{i\leq N\}}\sum_{j=1}^{N}{A}_{ij}x_{j}, B¯i=𝟙{iN}Bi\bar{B}_{i}=\mathbbm{1}_{\{i\leq N\}}{B}_{i}, (C¯𝐳)i=𝟙{iN}j=1dCijzj(\bar{C}{\bf z})_{i}=\mathbbm{1}_{\{i\leq N\}}\sum_{j=1}^{d}{C}_{ij}z_{j} for 𝐱q\mathbf{x}\in\ell^{q}, 𝐳d{\bf z}\in\mathbb{R}^{d}, ii\in\mathbb{N}. Furthermore, if ι\iota is the canonical injection, we denote by the same symbol the injection of 𝒳N\mathcal{X}^{N} into 𝒳{\cal X} and by μ=μNι1\mu=\mu^{N}\circ\iota^{-1} the pushforward measure of μN\mu^{N} under ι\iota.

Consider now the system (2.6)–(2.7) and the one given by (2.1) and (2.2) with the parameters A¯,𝐁¯,C¯\bar{A},\bar{{\bf B}},\bar{C} that we just defined, as well as the readout given by the measure μ=μNι1\mu=\mu^{N}\circ\iota^{-1}. We shall prove that the map ι:Nq\iota\colon\mathbb{R}^{N}\to\ell^{q} is a system morphism between these systems. This requires showing that ι\iota is system equivariant and readout invariant (see [Grig 21b] for this terminology).

First, we notice that ι\iota is system equivariant because

ι(σ1(A𝐱N+C𝐳+𝐁))=σ1(ι(A𝐱N+C𝐳+𝐁))=σ1(A¯ι(𝐱N)+C¯𝐳+𝐁¯),for any 𝐱NN𝐳d,\iota\left(\sigma_{1}({A}\mathbf{x}^{N}+{C}{\bf z}+{\bf B})\right)=\sigma_{1}(\iota({A}\mathbf{x}^{N}+{C}{\bf z}+{\bf B}))=\sigma_{1}(\bar{A}\iota(\mathbf{x}^{N})+\bar{C}{\bf z}+\bar{{\bf B}}),\ \mbox{for any $\mathbf{x}^{N}\in\mathbb{R}^{N}$, ${\bf z}\in{\mathbb{R}}^{d}$,}

as required. As a consequence of this fact (see [Grig 21b, Proposition 2.2]) if (𝐱tN)t(N)(\mathbf{x}_{t}^{N})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\mathbb{R}^{N}) is a solution of the system determined by (2.6) and (2.7) for 𝐳d{\bf z}\in{\cal I}_{d}, then so is (𝐱t)t(q)(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q}) with 𝐱t=ι(𝐱tN)\mathbf{x}_{t}=\iota(\mathbf{x}_{t}^{N}) for the one given by (2.1) and (2.2) with the parameters A¯,𝐁¯,C¯\bar{A},\bar{{\bf B}},\bar{C}. This solution is unique because if (𝐱¯t)t(q)(\overline{\mathbf{x}}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q}) is another solution for it and we denote by 𝐱¯tNN\overline{\mathbf{x}}_{t}^{N}\in\mathbb{R}^{N} its first NN components then from (2.1) we obtain for the components i=1,,Ni=1,\ldots,N that x¯t,iN=σ1((A¯𝐱¯t1)i+(C¯𝐳t)i+B¯i)=σ1((A𝐱¯t1N)i+(C𝐳t)i+Bi)\overline{{x}}_{t,i}^{N}=\sigma_{1}((\bar{A}\overline{\mathbf{x}}_{t-1})_{i}+(\bar{C}{\bf z}_{t})_{i}+\bar{B}_{i})=\sigma_{1}(({A}\overline{\mathbf{x}}_{t-1}^{N})_{i}+({C}{\bf z}_{t})_{i}+{B}_{i}), which shows that 𝐱¯N\overline{\mathbf{x}}^{N} is a solution to (2.6) and thus, by uniqueness, 𝐱¯N=𝐱N\overline{\mathbf{x}}^{N}=\mathbf{x}^{N} necessarily. For components i>Ni>N we get from (2.1) that x¯t,i=σ1(0)=0\overline{x}_{t,i}=\sigma_{1}(0)=0. Altogether, this proves that for each 𝐳d{\bf z}\in\mathcal{I}_{d} the system given by (2.1) with the parameters A¯,𝐁¯,C¯\bar{A},\bar{{\bf B}},\bar{C} has a unique solution 𝐱\mathbf{x} and the first NN entries of each term 𝐱t\mathbf{x}_{t} are given by 𝐱tN\mathbf{x}_{t}^{N}.

Notice now that the change-of-variables theorem and the equivalence of norms on N\mathbb{R}^{N} yields that

𝒳|w|(𝐚p+𝐜+|b|)μ(dw,d𝐚,d𝐜,db)=𝒳N|w|(ι(𝐚)p+𝐜+|b|)μN(dw,d𝐚,d𝐜,db)<,\int_{\mathcal{X}}|w|(\|{\bf a}\|_{p}+\|{\bf c}\|+|b|)\mu(dw,d{\bf a},d{\bf c},db)=\int_{\mathcal{X}^{N}}|w|(\|\iota({\bf a})\|_{p}+\|{\bf c}\|+|b|)\mu^{N}(dw,d{\bf a},d{\bf c},db)<\infty,

which proves that HH in (2.2) defines an element in 𝒞\mathcal{C}. Finally, we show that the map ι\iota is readout invariant using that the first NN components of each solution 𝐱t\mathbf{x}_{t} are given by 𝐱tN\mathbf{x}_{t}^{N} and applying again the change-of-variables theorem:

H(𝐳)=𝒳wσ2(𝐚𝐱1(𝐳)+𝐜𝐳0+b)μ(dw,d𝐚,d𝐜,db)=𝒳Nwσ2(ι(𝐚)𝐱1(𝐳)+𝐜𝐳0+b)μN(dw,d𝐚,d𝐜,db)=𝒳Nwσ2(𝐚𝐱1N(𝐳)+𝐜𝐳0+b)μN(dw,d𝐚,d𝐜,db)=HN(𝐳)H({\bf z})=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db)=\int_{\mathcal{X}^{N}}w\sigma_{2}(\iota({\bf a})\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db)\\ =\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{N}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db)=H^{N}({\bf z})

for 𝐳d{\bf z}\in\mathcal{I}_{d}, as required. In particular, HNH^{N} and HH are in fact the same functional, which proves the inclusion (2.8). ∎

2.3 Examples of recurrent Barron functionals

We now show that large classes of input/output systems naturally belong to the recurrent Barron class.

Finite-memory functionals. The following proposition shows that finite-memory causal and time-invariant systems that have certain regularity properties can be written as finite-dimensional recurrent Barron functionals with a linear state equation.

Proposition 2.5.

Assume σ1(x)=x\sigma_{1}(x)=x, σ2(x)=max(x,0)\sigma_{2}(x)=\max(x,0), DdD_{d} is bounded, let TT\in\mathbb{N}, and suppose that h:Tdh\colon\mathbb{R}^{Td}\to\mathbb{R} is in the Sobolev space Hs(Td)H^{s}(\mathbb{R}^{Td}) for s>Td2+2s>\frac{Td}{2}+2. Then, the functional H(𝐳)=h(𝐳0,,𝐳T+1)H({\bf z})=h({\bf z}_{0},\ldots,{\bf z}_{-T+1}), 𝐳d{\bf z}\in\mathcal{I}_{d}, is a finite-dimensional recurrent Barron functional.

Proof.

Choose N=dTN=dT, 𝐁=𝟎{\bf B}=\mathbf{0},

A=(𝟎d,d(T1)𝟎d,d𝕀d(T1)𝟎d(T1),d)andC=(𝕀d𝟎d(T1),d).{A}=\left(\begin{array}[]{cc}\mathbf{0}_{d,d(T-1)}&\mathbf{0}_{d,d}\\ \mathbb{I}_{d(T-1)}&\mathbf{0}_{d(T-1),d}\end{array}\right)\quad\mbox{and}\quad{C}=\left(\begin{array}[]{c}\mathbb{I}_{d}\\ \mathbf{0}_{d(T-1),d}\\ \end{array}\right). (2.9)

Then the system

𝐱tN=σ1(A𝐱t1N+C𝐳t+𝐁),t,\mathbf{x}_{t}^{N}=\sigma_{1}({A}\mathbf{x}_{t-1}^{N}+{C}{\bf z}_{t}+{\bf B}),\quad t\in\mathbb{Z}_{-}, (2.10)

admits as unique solution 𝐱tN=(𝐳t,,𝐳tT+1)\mathbf{x}_{t}^{N}=({\bf z}_{t}^{\top},\ldots,{\bf z}_{t-T+1}^{\top})^{\top}, tt\in\mathbb{Z}_{-}. Next, by [E 21, Theorem 3.1] (which is based on [Barr 93, Proposition 1 and Section IX.15)]; alternatively the result also follows using the representation obtained in the proof of [Gono 23, Theorem 1]), there exists a (Borel) probability measure μ~\tilde{\mu} on ×Td+1\mathbb{R}\times\mathbb{R}^{Td+1} such that ×Td+1|w|𝐯μ~(dw,d𝐯)<\int_{\mathbb{R}\times\mathbb{R}^{Td+1}}|w|\|{\bf v}\|\tilde{\mu}(dw,d{\bf v})<\infty and for Lebesgue-a.e. 𝐮(Dd)T{\bf u}\in(D_{d})^{T},

h(𝐮)=×Td+1wσ2(𝐯(𝐮,1))μ~(dw,d𝐯).h({\bf u})=\int_{\mathbb{R}\times\mathbb{R}^{Td+1}}w\sigma_{2}({\bf v}\cdot({\bf u}^{\top},1)^{\top})\tilde{\mu}(dw,d{\bf v}). (2.11)

But hh is continuous in (Dd)T(D_{d})^{T} by [Evan 10, Theorem 6(ii) in Chapter 5.6] and also the right-hand side in (2.11) is continuous in 𝐮{\bf u}, hence the representation (2.11) holds for all 𝐮(Dd)T{\bf u}\in(D_{d})^{T}.

Let π:×Td+1𝒳N\pi\colon\mathbb{R}\times\mathbb{R}^{Td+1}\to\mathcal{X}^{N} be defined as follows: for ww\in\mathbb{R}, 𝐯Td+1{\bf v}\in\mathbb{R}^{Td+1} we set π(w,𝐯)=(w,(vd+1,,vTd,0,,0),(v1,,vd),vTd+1)\pi(w,{\bf v})=(w,({v}_{d+1},\ldots,{v}_{Td},0,\ldots,0),({v}_{1},\ldots,{v}_{d}),{v}_{Td+1}). Denote by μN=μ~π1\mu^{N}=\tilde{\mu}\circ\pi^{-1} the pushforward measure of μ~\tilde{\mu} under π\pi. Then by the change-of-variables formula μN\mu^{N} satisfies

𝒳N|w|(𝐚+𝐜+|b|)μN(dw,d𝐚,d𝐜,db)\displaystyle\int_{\mathcal{X}^{N}}|w|(\|{\bf a}\|+\|{\bf c}\|+|b|)\mu^{N}(dw,d{\bf a},d{\bf c},db) 3𝒳N|w|(𝐚2+𝐜2+|b|2)1/2μN(dw,d𝐚,d𝐜,db)\displaystyle\leq\sqrt{3}\int_{\mathcal{X}^{N}}|w|(\|{\bf a}\|^{2}+\|{\bf c}\|^{2}+|b|^{2})^{1/2}\mu^{N}(dw,d{\bf a},d{\bf c},db)
=3×Td+1|w|𝐯μ~(dw,d𝐯)<\displaystyle=\sqrt{3}\int_{\mathbb{R}\times\mathbb{R}^{Td+1}}|w|\|{\bf v}\|\tilde{\mu}(dw,d{\bf v})<\infty

and for all 𝐳(Dd){\bf z}\in(D_{d})^{\mathbb{Z}_{-}}, with g𝐳(w,𝐚,𝐜,b)=wσ2((𝐜,𝐚,b)(𝐳0,𝐳1,,𝐳T,1))g_{\bf z}(w,{\bf a},{\bf c},b)=w\sigma_{2}(({\bf c}^{\top},{\bf a}^{\top},b)({\bf z}_{0}^{\top},{\bf z}_{-1}^{\top},\ldots,{\bf z}_{-T}^{\top},1)^{\top}),

H(𝐳)\displaystyle H({\bf z}) =h(𝐳0,,𝐳T+1)=×Td+1wσ2(𝐯(𝐳0,,𝐳T+1,1))μ~(dw,d𝐯)\displaystyle=h({\bf z}_{0},\ldots,{\bf z}_{-T+1})=\int_{\mathbb{R}\times\mathbb{R}^{Td+1}}w\sigma_{2}({\bf v}\cdot({\bf z}_{0}^{\top},\ldots,{\bf z}_{-T+1}^{\top},1)^{\top})\tilde{\mu}(dw,d{\bf v}) (2.12)
=×Td+1g𝐳(π(w,𝐯))μ~(dw,d𝐯)=𝒳Nwσ2((𝐜,𝐚,b)(𝐳0,𝐳1,,𝐳T,1))μN(dw,d𝐚,d𝐜,db)\displaystyle=\int_{\mathbb{R}\times\mathbb{R}^{Td+1}}g_{\bf z}(\pi(w,{\bf v}))\tilde{\mu}(dw,d{\bf v})=\int_{\mathcal{X}^{N}}w\sigma_{2}(({\bf c}^{\top},{\bf a}^{\top},b)({\bf z}_{0}^{\top},{\bf z}_{-1}^{\top},\ldots,{\bf z}_{-T}^{\top},1)^{\top})\mu^{N}(dw,d{\bf a},d{\bf c},db)
=𝒳Nwσ2(𝐚𝐱1N(𝐳)+𝐜𝐳0+b)μN(dw,d𝐚,d𝐜,db),\displaystyle=\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{N}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db),

which shows that HH is a finite-dimensional recurrent Barron functional, that is H𝒞NH\in\mathcal{C}_{N}. ∎

Linear functionals. The following propositions show that certain linear functionals are also in the recurrent Barron class 𝒞\mathcal{C}.

Proposition 2.6.

Suppose σ1(x)=x\sigma_{1}(x)=x and either σ2(x)=x\sigma_{2}(x)=x or σ2(x)=max(x,0)\sigma_{2}(x)=\max(x,0). Let 𝐰1\mathbf{w}\in\ell^{1}_{-}, 𝐚id{\bf a}_{i}\in\mathbb{R}^{d} for ii\in\mathbb{Z}_{-} and assume that DdD_{d} is bounded, and supi𝐚i<\sup_{i\in\mathbb{Z}_{-}}\|{\bf a}_{i}\|<\infty. Assume that either

(i)

i|wi|λi<\sum_{i\in\mathbb{Z}_{-}}|w_{i}|\lambda^{i}<\infty for some λ(0,1)\lambda\in(0,1)

(ii)

or dq(Dd)\mathcal{I}_{d}\subset\ell^{q}_{-}(D_{d}).

Then the functional

H(𝐳)=iwi𝐚i𝐳i,𝐳d,H({\bf z})=\sum_{i\in\mathbb{Z}_{-}}w_{i}{\bf a}_{i}\cdot{\bf z}_{i},\quad{\bf z}\in\mathcal{I}_{d},

satisfies H𝒞H\in\mathcal{C}.

Proof.

Let λ(0,1)\lambda\in(0,1) be as in (i) or, in case (ii) holds, let λ=1\lambda=1.

Consider first the case σ2(x)=x\sigma_{2}(x)=x. Let 𝐁=𝟎{\bf B}={\bf 0} and define A:qqA\colon\ell^{q}\to\ell^{q}, C:dqC\colon\mathbb{R}^{d}\to\ell^{q} by (A𝐱)i=𝟙{i>d}λxid(A\mathbf{x})_{i}=\mathbbm{1}_{\{i>d\}}\lambda x_{i-d}, (C𝐳)i=𝟙{id}zi(C{\bf z})_{i}=\mathbbm{1}_{\{i\leq d\}}{z}_{i} for 𝐱q\mathbf{x}\in\ell^{q}, 𝐳d{\bf z}\in\mathbb{R}^{d}, ii\in\mathbb{N}. Let us now look at the ii-th component of (2.1). Inserting the definitions yields

xt,i=𝟙{i>d}λxt1,id+𝟙{id}zt,i,t.{x}_{t,i}=\mathbbm{1}_{\{i>d\}}\lambda{x}_{t-1,i-d}+\mathbbm{1}_{\{i\leq d\}}{z}_{t,i},\quad t\in\mathbb{Z}_{-}. (2.13)

By iterating (2.13) we see that for kk\in\mathbb{N}, j=1,,dj=1,\ldots,d, we must have xt,(k1)d+j=λk1ztk+1,j{x}_{t,(k-1)d+j}=\lambda^{k-1}{z}_{t-k+1,j}. Using this as the definition of 𝐱\mathbf{x}, we hence obtain that (2.1) has a unique solution. In addition, this solution satisfies the condition (𝐱t)t(q)(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q}): suppose that (i) holds, then in the case q<q<\infty we verify for any tt\in\mathbb{Z}_{-} that k|xt,k|q=iλqi𝐳tiqqdMqλq(1λq)1\sum_{k\in\mathbb{N}}|{x}_{t,k}|^{q}=\sum_{i\in\mathbb{N}}\lambda^{qi}\|{\bf z}_{t-i}\|_{q}^{q}\leq dM^{q}\lambda^{q}(1-\lambda^{q})^{-1}, where M>0M>0 is chosen such that 𝐮M\|{\bf u}\|_{\infty}\leq M for all 𝐮Dd{\bf u}\in D_{d}. In the case of q=q=\infty we verify that supk|xt,k|sups,k|zs,k|M\sup_{k\in\mathbb{N}}|{x}_{t,k}|\leq\sup_{s,k\in\mathbb{N}}|{z}_{s,k}|\leq M. Similarly, if (ii) holds, (𝐱t)t(q)(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q}) follows from 𝐳dq(Dd){\bf z}\in\mathcal{I}_{d}\subset\ell^{q}_{-}(D_{d}).

Define, for each i{0}i\in\mathbb{Z}_{-}\setminus\{0\}, 𝐚¯ip\bar{\mathbf{a}}_{i}\in\ell^{p} by setting a¯i,j+d(|i|1)=λi+1ai,j\bar{a}_{i,j+d(|i|-1)}=\lambda^{i+1}{a}_{i,j} for j=1,,dj=1,\ldots,d and by setting the remaining components of 𝐚¯i\bar{\mathbf{a}}_{i} equal to zero. Set μ=1𝐰1i|wi|δ(sign(wi)𝐰1,𝐚¯i𝟙{i<0},𝐚0𝟙{i=0},0)\mu=\frac{1}{\|\mathbf{w}\|_{1}}\sum_{i\in\mathbb{Z}_{-}}|w_{i}|\delta_{(\mathrm{sign}(w_{i})\|\mathbf{w}\|_{1},\bar{{\bf a}}_{i}\mathbbm{1}_{\{i<0\}},{\bf a}_{0}\mathbbm{1}_{\{i=0\}},0)}, then μ\mu is indeed a (Borel) probability measure on 𝒳\mathcal{X} and μ\mu satisfies

Iμ,p=𝒳|w¯|(𝐚p+𝐜+|b|)μ(dw¯,d𝐚,d𝐜,db)\displaystyle I_{\mu,p}=\int_{\mathcal{X}}|\bar{w}|(\|{\bf a}\|_{p}+\|{\bf c}\|+|b|)\mu(d\bar{w},d{\bf a},d{\bf c},db) =i|wi|(𝐚¯i𝟙{i<0}p+𝐚0𝟙{i=0})\displaystyle=\sum_{i\in\mathbb{Z}_{-}}|w_{i}|(\|\bar{{\bf a}}_{i}\mathbbm{1}_{\{i<0\}}\|_{p}+\|{\bf a}_{0}\mathbbm{1}_{\{i=0\}}\|)
di|wi|λi+1𝐚i\displaystyle\leq\sqrt{d}\sum_{i\in\mathbb{Z}_{-}}|w_{i}|\lambda^{i+1}\|{\bf a}_{i}\|
dsupi{𝐚i}i|wi|λi+1<,\displaystyle\leq\sqrt{d}\sup_{i\in\mathbb{Z}_{-}}\{\|{\bf a}_{i}\|\}\sum_{i\in\mathbb{Z}_{-}}|w_{i}|\lambda^{i+1}<\infty,

since p\|\cdot\|_{p}\leq\|\cdot\| for p2p\geq 2 and pd1p12\|\cdot\|_{p}\leq d^{\frac{1}{p}-\frac{1}{2}}\|\cdot\| for p[1,2]p\in[1,2]. Furthermore, for any 𝐳d{\bf z}\in\mathcal{I}_{d},

𝒳w¯σ2(𝐚𝐱1(𝐳)+𝐜𝐳0+b)μ(dw¯,d𝐚,d𝐜,db)\displaystyle\int_{\mathcal{X}}\bar{w}\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu(d\bar{w},d{\bf a},d{\bf c},db) =iwi(𝐚¯i𝟙{i<0}𝐱1+𝐚0𝟙{i=0}𝐳0)\displaystyle=\sum_{i\in\mathbb{Z}_{-}}w_{i}(\bar{\mathbf{a}}_{i}\mathbbm{1}_{\{i<0\}}\cdot\mathbf{x}_{-1}+{\bf a}_{0}\mathbbm{1}_{\{i=0\}}\cdot{\bf z}_{0}) (2.14)
=𝐚0𝐳0w0+i{0}wiλi+1j=1dai,jx1,j+d(|i|1)\displaystyle={\bf a}_{0}\cdot{\bf z}_{0}w_{0}+\sum_{i\in\mathbb{Z}_{-}\setminus\{0\}}w_{i}\lambda^{i+1}\sum_{j=1}^{d}{a}_{i,j}{x}_{-1,j+d(|i|-1)}
=𝐚0𝐳0w0+i{0}wij=1dai,jz|i|,j\displaystyle={\bf a}_{0}\cdot{\bf z}_{0}w_{0}+\sum_{i\in\mathbb{Z}_{-}\setminus\{0\}}w_{i}\sum_{j=1}^{d}{a}_{i,j}{z}_{-|i|,j}
=H(𝐳),\displaystyle=H({\bf z}),

which proves that H𝒞H\in\mathcal{C} in the case σ2(x)=x\sigma_{2}(x)=x.

For the case σ2(x)=max(x,0)\sigma_{2}(x)=\max(x,0), let φ:𝒳𝒳\varphi\colon\mathcal{X}\to\mathcal{X} be defined by φ(w¯,𝐚,𝐜,b)=(w¯,𝐚,𝐜,b)\varphi(\bar{w},{\bf a},{\bf c},b)=(-\bar{w},-{\bf a},-{\bf c},-b) and consider μ¯=12μ+12(μφ1)\bar{\mu}=\frac{1}{2}\mu+\frac{1}{2}(\mu\circ\varphi^{-1}). Then the change-of-variables theorem and the fact that Iμ,p<I_{\mu,p}<\infty (as established above) show that Iμ¯,p<I_{\bar{\mu},p}<\infty and by (2.14) and σ2(x)σ2(x)=x\sigma_{2}(x)-\sigma_{2}(-x)=x we obtain

𝒳w¯σ2(𝐚𝐱1(𝐳)+𝐜𝐳0+b)μ¯(dw¯,d𝐚,d𝐜,db)\displaystyle\int_{\mathcal{X}}\bar{w}\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\bar{\mu}(d\bar{w},d{\bf a},d{\bf c},db) (2.15)
=12𝒳w¯σ2(𝐚𝐱1(𝐳)+𝐜𝐳0+b)μ(dw¯,d𝐚,d𝐜,db)12𝒳w¯σ2(𝐚𝐱1(𝐳)𝐜𝐳0b)μ(dw¯,d𝐚,d𝐜,db)\displaystyle=\frac{1}{2}\int_{\mathcal{X}}\bar{w}\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu(d\bar{w},d{\bf a},d{\bf c},db)-\frac{1}{2}\int_{\mathcal{X}}\bar{w}\sigma_{2}(-{\bf a}\cdot\mathbf{x}_{-1}({\bf z})-{\bf c}\cdot{\bf z}_{0}-b)\mu(d\bar{w},d{\bf a},d{\bf c},db)
=12H(𝐳),\displaystyle=\frac{1}{2}H({\bf z}),

hence 12H𝒞\frac{1}{2}H\in\mathcal{C} and, by Proposition 2.3, also H𝒞H\in\mathcal{C}. ∎

The next proposition also covers the linear case under slightly different hypotheses.

Proposition 2.7.

Suppose σ1(x)=x\sigma_{1}(x)=x and either σ2(x)=x\sigma_{2}(x)=x or σ2(x)=max(x,0)\sigma_{2}(x)=\max(x,0). Let 𝐡p(d){\bf h}\in\ell^{p}_{-}(\mathbb{R}^{d}) and dq(Dd)\mathcal{I}_{d}\subset\ell^{q}_{-}(D_{d}). Then the functional

H(𝐳)=i𝐡i𝐳i,𝐳d,H({\bf z})=\sum_{i\in\mathbb{Z}_{-}}{\bf h}_{i}\cdot{\bf z}_{i},\quad{\bf z}\in\mathcal{I}_{d},

satisfies H𝒞H\in\mathcal{C}.

Proof.

The case σ2(x)=max(x,0)\sigma_{2}(x)=\max(x,0) can be deduced from the case σ2(x)=x\sigma_{2}(x)=x precisely as in the proof of Proposition 2.6, hence it suffices to consider the case σ2(x)=x\sigma_{2}(x)=x. Choosing λ=1\lambda=1 and A,𝐁,CA,{\bf B},C as in the proof of Proposition 2.6 yields a unique solution (𝐱t)t(q)(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q}) to (2.1) and for kk\in\mathbb{N}, j=1,,dj=1,\ldots,d we have xt,(k1)d+j=ztk+1,j{x}_{t,(k-1)d+j}={z}_{t-k+1,j}. Define now 𝐚p\mathbf{a}\in\ell^{p} by setting a(k1)d+j=hk,ja_{(k-1)d+j}={h}_{-k,j} for kk\in\mathbb{N}, j=1,,dj=1,\ldots,d, and consider μ=δ(1,𝐚,𝐡0,0)\mu=\delta_{(1,\mathbf{a},{\bf h}_{0},0)}. Then Iμ,p<I_{\mu,p}<\infty and for any 𝐳d{\bf z}\in\mathcal{I}_{d},

𝒳w¯σ2(𝐚𝐱1(𝐳)+𝐜𝐳0+b)μ(dw¯,d𝐚,d𝐜,db)=𝐚𝐱1(𝐳)+𝐡0𝐳0=𝐡0𝐳0+kj=1dhk,jzk,j=H(𝐳),\int_{\mathcal{X}}\bar{w}\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu(d\bar{w},d{\bf a},d{\bf c},db)=\mathbf{a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf h}_{0}\cdot{\bf z}_{0}={\bf h}_{0}\cdot{\bf z}_{0}+\sum_{k\in\mathbb{N}}\sum_{j=1}^{d}{h}_{-k,j}{z}_{-k,j}=H({\bf z}), (2.16)

which proves that H𝒞H\in\mathcal{C}. ∎

Remark 2.8.

This proposition covers, in particular, the functionals associated to convolutional filters. Indeed, let 𝐡p(d){\bf h}\in\ell^{p}_{-}(\mathbb{R}^{d}) and consider the convolutional filter U𝐡:q(Dd)U_{\bf h}\colon\ell^{q}_{-}(D_{d})\to\mathbb{R}^{\mathbb{Z}_{-}} defined by U𝐡(𝐳)t=j𝐡j𝐳t+jU_{\bf h}({\bf z})_{t}=\sum_{j\in\mathbb{Z}_{-}}{\bf h}_{j}\cdot{\bf z}_{t+j}. Then its associated functional H:q(Dd)H\colon\ell^{q}_{-}(D_{d})\to\mathbb{R} defined by H(𝐳)=j𝐡j𝐳jH({\bf z})=\sum_{j\in\mathbb{Z}_{-}}{\bf h}_{j}\cdot{\bf z}_{j} is an element of 𝒞\mathcal{C}, as shown in Proposition 2.7 above.

As a further example, we now see that any functional associated with a linear state-space system on a separable Hilbert space admits a realization as an element in 𝒞\mathcal{C} for p=2p=2.

Proposition 2.9.

Let 𝒴\mathcal{Y} be a separable Hilbert space, let A¯:𝒴𝒴\bar{A}\colon\mathcal{Y}\to\mathcal{Y}, C¯:d𝒴\bar{C}\colon\mathbb{R}^{d}\to\mathcal{Y} be linear and 𝐁¯𝒴\bar{\bf B}\in\mathcal{Y} and assume that for each 𝐳d{\bf z}\in\mathcal{I}_{d} the system

𝐱¯t=A¯𝐱¯t1+C¯𝐳t+𝐁¯,t,\bar{\mathbf{x}}_{t}=\bar{A}\bar{\mathbf{x}}_{t-1}+\bar{C}{\bf z}_{t}+\bar{\bf B},\quad t\in\mathbb{Z}_{-}, (2.17)

admits a unique solution (𝐱¯t)t(𝒴)(\bar{\mathbf{x}}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\mathcal{Y}), i.e., satisfying supt𝐱¯t𝒴<\sup_{t\in\mathbb{Z}_{-}}\|\bar{\mathbf{x}}_{t}\|_{\mathcal{Y}}<\infty. Let μ¯\bar{\mu} be a (Borel) probability measure on ×𝒴×d×\mathbb{R}\times\mathcal{Y}\times\mathbb{R}^{d}\times\mathbb{R} with ×𝒴×d×|w|(𝐚𝒴+𝐜+|b|)μ¯(dw,d𝐚,d𝐜,db)<\int_{\mathbb{R}\times\mathcal{Y}\times\mathbb{R}^{d}\times\mathbb{R}}|w|(\|{\bf a}\|_{\mathcal{Y}}+\|{\bf c}\|+|b|)\bar{\mu}(dw,d{\bf a},d{\bf c},db)<\infty and consider

H(𝐳)=×𝒴×d×wσ2(𝐚,𝐱¯1(𝐳)𝒴+𝐜𝐳0+b)μ¯(dw,d𝐚,d𝐜,db),𝐳d.H({\bf z})=\int_{\mathbb{R}\times\mathcal{Y}\times\mathbb{R}^{d}\times\mathbb{R}}w\sigma_{2}(\langle{\bf a},\bar{\mathbf{x}}_{-1}({\bf z})\rangle_{\mathcal{Y}}+{\bf c}\cdot{\bf z}_{0}+b)\bar{\mu}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d}. (2.18)

Then H𝒞H\in\mathcal{C} with p=2p=2.

Proof.

Denote by T:𝒴2T\colon\mathcal{Y}\to\ell^{2} an isometric isomorphism between 𝒴\mathcal{Y} and 2\ell^{2}. Define 𝐱t:=T𝐱¯t\mathbf{x}_{t}:=T\bar{\mathbf{x}}_{t}, A=TA¯T1A=T\bar{A}T^{-1}, C=TC¯C=T\bar{C}, 𝐁=T𝐁¯{\bf B}=T\bar{\bf B}, ϕ:×𝒴×d×𝒳\phi\colon\mathbb{R}\times\mathcal{Y}\times\mathbb{R}^{d}\times\mathbb{R}\to\mathcal{X}, ϕ(w,𝐚,𝐜,b)=(w,T𝐚,𝐜,b)\phi(w,{{\bf a}},{\bf c},b)=(w,T{\bf a},{\bf c},b) and let μ=μ¯ϕ1\mu=\bar{\mu}\circ\phi^{-1} be the pushforward measure of μ¯\bar{\mu} under ϕ\phi.

Then μ\mu is a probability measure on ×2×d×\mathbb{R}\times\ell^{2}\times\mathbb{R}^{d}\times\mathbb{R} and A:22A\colon\ell^{2}\to\ell^{2}, C:d2C\colon\mathbb{R}^{d}\to\ell^{2} are linear maps. Furthermore, by choice of A,𝐁,CA,{\bf B},C we have that 𝐱(2)\mathbf{x}\in\ell^{\infty}_{-}(\ell^{2}) is a solution (2.1). On the other hand, if (𝐱t)t(2)(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{2}) is any solution to (2.1), then 𝐱¯t:=T1𝐱t\bar{\mathbf{x}}_{t}:=T^{-1}\mathbf{x}_{t} defines a solution to (2.17). But the latter system has a unique solution by assumption, hence also the solution to (2.1) is unique.

Finally, the change-of-variables formula and the fact that TT is an isometry imply that μ\mu satisfies Iμ,2=×𝒴×d×|w|(T𝐚2+𝐜+|b|)μ¯(dw,d𝐚,d𝐜,db)<I_{\mu,2}=\int_{\mathbb{R}\times\mathcal{Y}\times\mathbb{R}^{d}\times\mathbb{R}}|w|(\|T{\bf a}\|_{2}+\|{\bf c}\|+|b|)\bar{\mu}(dw,d{\bf a},d{\bf c},db)<\infty and

H(𝐳)\displaystyle H({\bf z}) =×𝒴×d×wσ2(T𝐚T𝐱¯1(𝐳)+𝐜𝐳0+b)μ¯(dw,d𝐚,d𝐜,db)\displaystyle=\int_{\mathbb{R}\times\mathcal{Y}\times\mathbb{R}^{d}\times\mathbb{R}}w\sigma_{2}(T{\bf a}\cdot T\bar{\mathbf{x}}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\bar{\mu}(dw,d{\bf a},d{\bf c},db) (2.19)
=×2×d×wσ2(𝐚𝐱1(𝐳)+𝐜𝐳0+b)μ(dw,d𝐚,d𝐜,db).\displaystyle=\int_{\mathbb{R}\times\ell^{2}\times\mathbb{R}^{d}\times\mathbb{R}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db).

This shows that the functional HH can be represented as a readout (2.2) of a system (2.1), hence H𝒞H\in\mathcal{C}. It also proves that TT is a system isomorphism between the system determined by A¯\bar{A}, C¯\bar{C}, B¯\bar{B}, and μ¯\bar{\mu}, and the system in 𝒞\mathcal{C} determined by A{A}, C{C}, B{B}, and μ{\mu}. ∎

The next proposition is a generalization of a universality result in [Gono 20c, Theorem III.13] to the context of recurrent Barron functionals.

Proposition 2.10.

Assume σ1(x)=x\sigma_{1}(x)=x and either σ2(x)=max(x,0)\sigma_{2}(x)=\max(x,0) or σ2\sigma_{2} is bounded, continuous and non-constant. Let p¯[1,)\bar{p}\in[1,\infty) and let γ\gamma be a probability measure on d(Dd)\mathcal{I}_{d}\subset\ell^{\infty}_{-}(D_{d}). Then 𝒞Lp¯(d,γ)\mathcal{C}\cap L^{\bar{p}}(\mathcal{I}_{d},\gamma) is dense in Lp¯(d,γ)L^{\bar{p}}(\mathcal{I}_{d},\gamma).

Proof.

Let HLp¯(d,γ)H\in L^{\bar{p}}(\mathcal{I}_{d},\gamma) and ε>0\varepsilon>0. Consider first the case σ2(x)=max(x,0)\sigma_{2}(x)=\max(x,0). Define the activation function σ¯(x)=σ2(x+1)σ2(x)\bar{\sigma}(x)=\sigma_{2}(x+1)-\sigma_{2}(x), then |σ¯(x)|1|\bar{\sigma}(x)|\leq 1 for xx\in\mathbb{R}. Hence, by [Gono 20c, Theorem III.13] there exists a reservoir functional HRCH^{RC} associated to a linear system with neural network readout such that HRCLp¯(d,γ)H^{RC}\in L^{\bar{p}}(\mathcal{I}_{d},\gamma) and HHRCLp¯(d,γ)<ε\|H-H^{RC}\|_{L^{\bar{p}}(\mathcal{I}_{d},\gamma)}<\varepsilon. More specifically, there exist NN\in\mathbb{N}, ANN×N{A}^{N}\in\mathbb{R}^{N\times N}, CNN×d{C}^{N}\in\mathbb{R}^{N\times d} such that the system

𝐱tN=AN𝐱t1N+CN𝐳t,t\mathbf{x}_{t}^{N}={A}^{N}\mathbf{x}_{t-1}^{N}+{C}^{N}\mathbf{z}_{t},\quad t\in\mathbb{Z}_{-} (2.20)

has the echo state property and HRC(𝐳)=h(𝐱0N(𝐳))H^{RC}(\mathbf{z})=h(\mathbf{x}_{0}^{N}({\bf z})) for some neural network h:Nh\colon\mathbb{R}^{N}\to\mathbb{R} given by

h(𝐱)=j=1kβjσ¯(𝜶j𝐱θj)h(\mathbf{x})=\sum_{j=1}^{k}\beta_{j}\bar{\sigma}(\bm{\alpha}_{j}\cdot\mathbf{x}-\theta_{j})

for some kk\in\mathbb{N}, βj,θj\beta_{j},\theta_{j}\in\mathbb{R} and 𝜶jN\bm{\alpha}_{j}\in\mathbb{R}^{N}. We claim that HRC𝒞NH^{RC}\in\mathcal{C}_{N}. To prove this, note that with the choices A=AN{A}={A}^{N}, 𝐁=𝟎{\bf B}={\bf 0} and C=CN{C}={C}^{N} the process 𝐱N\mathbf{x}^{N} is the unique solution to the system (2.6) due to the echo state property of (2.20). From the proof of [Gono 20c, Theorem III.13] and the assumption that d(Dd)\mathcal{I}_{d}\subset\ell^{\infty}_{-}(D_{d}) we obtain that the unique solution satisfies 𝐱N(N)\mathbf{x}^{N}\in\ell^{\infty}_{-}(\mathbb{R}^{N}). Set now μN=12kj=1kδ(2kβj,A𝜶j,C𝜶j,θj+1)+δ(2kβj,A𝜶j,C𝜶j,θj)\mu^{N}=\frac{1}{2k}\sum_{j=1}^{k}\delta_{(2k\beta_{j},{A}^{\top}\bm{\alpha}_{j},{C}^{\top}\bm{\alpha}_{j},-\theta_{j}+1)}+\delta_{(-2k\beta_{j},{A}^{\top}\bm{\alpha}_{j},{C}^{\top}\bm{\alpha}_{j},-\theta_{j})}. Then μN\mu^{N} is a probability measure on 𝒳N\mathcal{X}^{N} and for 𝐳d{\bf z}\in\mathcal{I}_{d}

𝒳Nwσ2(𝐚𝐱1N(𝐳)+𝐜𝐳0+b)μN(dw,d𝐚,d𝐜,db)\displaystyle\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{N}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db)
=12kj=1k2kβjσ2(𝜶j(A𝐱1N(𝐳)+C𝐳0)θj+1)2kβjσ2(𝜶j(A𝐱1N(𝐳)+C𝐳0)θj)\displaystyle=\frac{1}{2k}\sum_{j=1}^{k}2k\beta_{j}\sigma_{2}(\bm{\alpha}_{j}\cdot({A}\mathbf{x}_{-1}^{N}({\bf z})+{C}{\bf z}_{0})-\theta_{j}+1)-2k\beta_{j}\sigma_{2}(\bm{\alpha}_{j}\cdot({A}\mathbf{x}_{-1}^{N}({\bf z})+{C}{\bf z}_{0})-\theta_{j})
=j=1kβjσ¯(𝜶j𝐱0N(𝐳)θj)=h(𝐱0N(𝐳))=HRC(𝐳).\displaystyle=\sum_{j=1}^{k}\beta_{j}\bar{\sigma}(\bm{\alpha}_{j}\cdot\mathbf{x}_{0}^{N}({\bf z})-\theta_{j})=h(\mathbf{x}_{0}^{N}({\bf z}))=H^{RC}(\mathbf{z}).

This shows that HRC𝒞NH^{RC}\in\mathcal{C}_{N} and HHRCLp¯(d,γ)<ε\|H-H^{RC}\|_{L^{\bar{p}}(\mathcal{I}_{d},\gamma)}<\varepsilon. Combining this with Proposition 2.4 yields the claim.

In the case where σ\sigma is bounded, continuous and non-constant we may directly work with σ\sigma (instead of σ¯\bar{\sigma}) and proceed similarly. ∎

Generalized convolutional functionals. We conclude by showing that the class 𝒞\mathcal{C} contains functionals that generalize convolutional filters, an important family of transforms.

Proposition 2.11.

Assume σ1(x)=x\sigma_{1}(x)=x, p(1,)p\in(1,\infty) and suppose dq(Dd)\mathcal{I}_{d}\subset\ell^{q}_{-}(D_{d}). Let μ\mu be a probability measure on 𝒳\mathcal{X} with Iμ,p<I_{\mu,p}<\infty. Then the functional H:dH\colon\mathcal{I}_{d}\to\mathbb{R},

H(𝐳)=𝒳wσ2(𝐚𝐳1:+𝐜𝐳0+b)μ(dw,d𝐚,d𝐜,db),𝐳dH({\bf z})=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot{\bf z}_{-1:-\infty}+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d} (2.21)

is in 𝒞\mathcal{C}.

Proof.

We only need to construct 𝐁q{\bf B}\in\ell^{q} and linear maps A:qqA\colon\ell^{q}\to\ell^{q}, C:dqC\colon\mathbb{R}^{d}\to\ell^{q} such that for each 𝐳d{\bf z}\in\mathcal{I}_{d} the system

𝐱t=A𝐱t1+C𝐳t+𝐁,t,\mathbf{x}_{t}=A\mathbf{x}_{t-1}+C{\bf z}_{t}+{\bf B},\quad t\in\mathbb{Z}_{-}, (2.22)

admits the unique solution (𝐱t)t=(𝐳t:)t(q)(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}=({\bf z}_{t:-\infty})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q}). Then 𝐱1=𝐳1:\mathbf{x}_{-1}={\bf z}_{-1:-\infty} and so HH has representation (2.2). Choose 𝐁=𝟎{\bf B}={\bf 0}, let C𝐲C{\bf y} be the sequence with the components of 𝐲{\bf y} in the first dd entries and 0 otherwise, and let AA be the operator that shifts the ii-th entry of the input to the i+di+d-th entry and inserts 0 in the first dd entries. Then A:qqA\colon\ell^{q}\to\ell^{q}, C:dqC\colon\mathbb{R}^{d}\to\ell^{q} are linear, and indeed 𝐱t=𝐳t:\mathbf{x}_{t}={\bf z}_{t:-\infty} for all tt\in\mathbb{Z}_{-} is the unique solution to (2.22) and 𝐱(q)\mathbf{x}\in\ell^{\infty}_{-}(\ell^{q}). ∎

Notice that standard convolutional filters are a special case of the functional in (2.21) when μ\mu is a Dirac measure and σ2\sigma_{2} is the identity.

3 Approximation

In this section, we establish approximation results for the concept class 𝒞\mathcal{C} of recurrent Barron functionals. As approximating systems, we use a modification of the finite-dimensional echo state networks (1.1)–(1.2) in which the linear readout WW in (1.2) is replaced by an extreme learning machine (ELM) / random features neural network, that is, a neural network of the type (1.3) whose parameters are all randomly generated, except for the output layer given by the weights (Wi)i=1,,N(W_{i})_{i=1,\ldots,N} which are trainable. To derive such bounds, we proceed in several approximation steps and (a) approximate the infinite system (2.1) by a finite system of type (1.1), (b) apply a Monte Carlo approximation of the integral in (2.2), and (c) use an importance sampling procedure to guarantee that the neural network weights can be generated from a generic distribution rather than the (unknown) measure μ\mu.

3.1 Approximation by a finite-dimensional system

We start by showing that the elements in 𝒞\mathcal{C} can be approximated by finite-dimensional truncations, that is, by finite-dimensional recurrent Barron functionals in 𝒞N\mathcal{C}_{N}, for any NN\in\mathbb{N}.

Proposition 3.1.

Assume q[1,)q\in[1,\infty) and d(d)\mathcal{I}_{d}\subset\ell^{\infty}_{-}(\mathbb{R}^{d}). Suppose H𝒞H\in\mathcal{C} has representation (2.2) with probability measure μ\mu on 𝒳\mathcal{X}, 𝐁q{\bf B}\in\ell^{q}, and bounded, linear maps A:qqA\colon\ell^{q}\to\ell^{q}, C:dqC\colon\mathbb{R}^{d}\to\ell^{q}. Let NN\in\mathbb{N} and for jj\in\mathbb{N}, k=1,,dk=1,\ldots,d let ϵj=(δij)i\bm{\epsilon}^{j}=(\delta_{ij})_{i\in\mathbb{N}}\in\mathbb{R}^{\mathbb{N}} and 𝐞k=(δik)i=1,,dd{\bf e}^{k}=(\delta_{ik})_{i=1,\ldots,d}\in\mathbb{R}^{d}. Let 𝐁¯N\bar{\bf B}\in\mathbb{R}^{N}, A¯N×N\bar{A}\in\mathbb{R}^{N\times N}, C¯N×d\bar{C}\in\mathbb{R}^{N\times d} be given as B¯i=Bi\bar{B}_{i}=B_{i}, A¯ij=(Aϵj)i\bar{A}_{ij}=(A\bm{\epsilon}^{j})_{i}, C¯ik=(C𝐞k)i\bar{C}_{ik}=(C{\bf e}^{k})_{i} for i,j=1,,Ni,j=1,\ldots,N, k=1,,dk=1,\ldots,d. Assume σ1\sigma_{1} is Lσ1L_{\sigma_{1}}-Lipschitz and Lσ1|A|<1L_{\sigma_{1}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1. Then the following statements hold:

(i)

For each 𝐳d{\bf z}\in\mathcal{I}_{d} the system

𝐱tN=σ1(A¯𝐱t1N+C¯𝐳t+𝐁¯),t,\mathbf{x}_{t}^{N}=\sigma_{1}(\bar{A}\mathbf{x}_{t-1}^{N}+\bar{C}{\bf z}_{t}+\bar{\bf B}),\quad t\in\mathbb{Z}_{-}, (3.1)

admits a unique solution (𝐱tN)t(N)(\mathbf{x}_{t}^{N})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\mathbb{R}^{N}).

(ii)

Write 𝐱tN(𝐳)=𝐱tN\mathbf{x}_{t}^{N}({\bf z})=\mathbf{x}_{t}^{N} and define the functional HN:dH^{N}\colon\mathcal{I}_{d}\to\mathbb{R} by

HN(𝐳)=𝒳wσ2(𝐚ι(𝐱1N(𝐳))+𝐜𝐳0+b)μ(dw,d𝐚,d𝐜,db),𝐳d.H^{N}({\bf z})=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\iota(\mathbf{x}_{-1}^{N}({\bf z}))+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d}. (3.2)

Then there exists a constant Cfin>0C_{\mathrm{fin}}>0 not depending on NN such that for all 𝐳d{\bf z}\in\mathcal{I}_{d},

|H(𝐳)HN(𝐳)|Cfink=0(Lσ1|A|)k(i=N+1|x1k,i|q)1/q.|H({\bf z})-H^{N}({\bf z})|\leq C_{\mathrm{fin}}\sum_{k=0}^{\infty}(L_{\sigma_{1}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|})^{k}\left(\sum_{i=N+1}^{\infty}|{x}_{-1-k,i}|^{q}\right)^{1/q}. (3.3)

In particular, limNHN(𝐳)=H(𝐳)\lim_{N\to\infty}H^{N}({\bf z})=H({\bf z}). The constant is given by Cfin=Lσ2𝒳|w|𝐚pμ(dw,d𝐚,d𝐜,db)C_{\mathrm{fin}}=L_{\sigma_{2}}\int_{\mathcal{X}}|w|\|{\bf a}\|_{p}\mu(dw,d{\bf a},d{\bf c},db).

(iii)

HN𝒞NH^{N}\in\mathcal{C}_{N}.

Proof.

Part (i) is a consequence of Proposition 2.2 and of the inequality |A¯||A|{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\bar{A}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|} which in turn implies that Lσ1|A¯|<1L_{\sigma_{1}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\bar{A}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1. Regarding (ii), note first that by our choice of C¯\bar{C}, it follows that (C𝐮)i=k=1duk(C𝐞k)i=k=1dukC¯ik=(C¯𝐮)i(C{\bf u})_{i}=\sum_{k=1}^{d}{u}_{k}(C{\bf e}^{k})_{i}=\sum_{k=1}^{d}{u}_{k}\bar{C}_{ik}=(\bar{C}{\bf u})_{i} for all 𝐮d{\bf u}\in\mathbb{R}^{d}, i=1,,Ni=1,\ldots,N. Similarly, for 𝐲N{\bf y}\in\mathbb{R}^{N}, i=1,,Ni=1,\ldots,N we have that (A¯𝐲)i=j=1NyjA¯ij=j=1Nyj(Aϵj)i=(Aι(𝐲))i(\bar{A}{\bf y})_{i}=\sum_{j=1}^{N}{y}_{j}\bar{A}_{ij}=\sum_{j=1}^{N}{y}_{j}(A\bm{\epsilon}^{j})_{i}=(A\iota({\bf y}))_{i}. Consequently, for i=1,,Ni=1,\ldots,N:

|xt,ixt,iN|\displaystyle|{x}_{t,i}-{x}_{t,i}^{N}| Lσ1|(A𝐱t1)i+(C𝐳t)i+Bi(A¯𝐱t1N)i(C¯𝐳t)iB¯i|\displaystyle\leq L_{\sigma_{1}}|(A\mathbf{x}_{t-1})_{i}+(C{\bf z}_{t})_{i}+B_{i}-(\bar{A}\mathbf{x}_{t-1}^{N})_{i}-(\bar{C}{\bf z}_{t})_{i}-\bar{B}_{i}| (3.4)
=Lσ1|(A(𝐱t1ι(𝐱t1N)))i|.\displaystyle=L_{\sigma_{1}}|(A(\mathbf{x}_{t-1}-\iota(\mathbf{x}_{t-1}^{N})))_{i}|.

Denote by π:qN\pi\colon\ell^{q}\to\mathbb{R}^{N} the projection onto the first NN coordinates. Then

𝐱t(𝐳)ι(𝐱tN(𝐳))q\displaystyle\|\mathbf{x}_{t}({\bf z})-\iota(\mathbf{x}_{t}^{N}({\bf z}))\|_{q} 𝐱t(𝐳)ι(π(𝐱t(𝐳)))q+ι(π(𝐱t(𝐳)))ι(𝐱tN(𝐳))q\displaystyle\leq\|\mathbf{x}_{t}({\bf z})-\iota(\pi(\mathbf{x}_{t}({\bf z})))\|_{q}+\|\iota(\pi(\mathbf{x}_{t}({\bf z})))-\iota(\mathbf{x}_{t}^{N}({\bf z}))\|_{q} (3.5)
=(i=N+1|xt,i|q)1/q+(i=1N|xt,ixt,iN|q)1/q\displaystyle=\left(\sum_{i=N+1}^{\infty}|{x}_{t,i}|^{q}\right)^{1/q}+\left(\sum_{i=1}^{N}|{x}_{t,i}-{x}_{t,i}^{N}|^{q}\right)^{1/q}
(i=N+1|xt,i|q)1/q+Lσ1(i=1N|(A(𝐱t1ι(𝐱t1N)))i|q)1/q\displaystyle\leq\left(\sum_{i=N+1}^{\infty}|{x}_{t,i}|^{q}\right)^{1/q}+L_{\sigma_{1}}\left(\sum_{i=1}^{N}|(A(\mathbf{x}_{t-1}-\iota(\mathbf{x}_{t-1}^{N})))_{i}|^{q}\right)^{1/q}
(i=N+1|xt,i|q)1/q+Lσ1A(𝐱t1ι(𝐱t1N))q\displaystyle\leq\left(\sum_{i=N+1}^{\infty}|{x}_{t,i}|^{q}\right)^{1/q}+L_{\sigma_{1}}\|A(\mathbf{x}_{t-1}-\iota(\mathbf{x}_{t-1}^{N}))\|_{q}
(i=N+1|xt,i|q)1/q+Lσ1|A|𝐱t1ι(𝐱t1N)q.\displaystyle\leq\left(\sum_{i=N+1}^{\infty}|{x}_{t,i}|^{q}\right)^{1/q}+L_{\sigma_{1}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\|\mathbf{x}_{t-1}-\iota(\mathbf{x}_{t-1}^{N})\|_{q}.

Iterating (3.5) thus yields

𝐱t(𝐳)ι(𝐱tN(𝐳))qk=0(Lσ1|A|)k(i=N+1|xtk,i|q)1/q.\|\mathbf{x}_{t}({\bf z})-\iota(\mathbf{x}_{t}^{N}({\bf z}))\|_{q}\leq\sum_{k=0}^{\infty}(L_{\sigma_{1}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|})^{k}\left(\sum_{i=N+1}^{\infty}|{x}_{t-k,i}|^{q}\right)^{1/q}. (3.6)

This can now be used to estimate the difference between HH and its truncation:

|H(𝐳)HN(𝐳)|\displaystyle|H({\bf z})-H^{N}({\bf z})| 𝒳|w||σ2(𝐚𝐱1(𝐳)+𝐜𝐳0+b)σ2(𝐚ι(𝐱1N(𝐳))+𝐜𝐳0+b)|μ(dw,d𝐚,d𝐜,db)\displaystyle\leq\int_{\mathcal{X}}|w||\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)-\sigma_{2}({\bf a}\cdot\iota(\mathbf{x}_{-1}^{N}({\bf z}))+{\bf c}\cdot{\bf z}_{0}+b)|\mu(dw,d{\bf a},d{\bf c},db) (3.7)
Lσ2𝒳|w||𝐚𝐱1(𝐳)𝐚ι(𝐱1N(𝐳))|μ(dw,d𝐚,d𝐜,db)\displaystyle\leq L_{\sigma_{2}}\int_{\mathcal{X}}|w||{\bf a}\cdot\mathbf{x}_{-1}({\bf z})-{\bf a}\cdot\iota(\mathbf{x}_{-1}^{N}({\bf z}))|\mu(dw,d{\bf a},d{\bf c},db)
Lσ2𝒳|w|𝐚pμ(dw,d𝐚,d𝐜,db)𝐱1(𝐳)ι(𝐱1N(𝐳))q\displaystyle\leq L_{\sigma_{2}}\int_{\mathcal{X}}|w|\|{\bf a}\|_{p}\mu(dw,d{\bf a},d{\bf c},db)\|\mathbf{x}_{-1}({\bf z})-\iota(\mathbf{x}_{-1}^{N}({\bf z}))\|_{q}
Cfink=0(Lσ1|A|)k(i=N+1|x1k,i|q)1/q\displaystyle\leq C_{\mathrm{fin}}\sum_{k=0}^{\infty}(L_{\sigma_{1}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|})^{k}\left(\sum_{i=N+1}^{\infty}|{x}_{-1-k,i}|^{q}\right)^{1/q}

with Cfin=Lσ2𝒳|w|𝐚pμ(dw,d𝐚,d𝐜,db)C_{\mathrm{fin}}=L_{\sigma_{2}}\int_{\mathcal{X}}|w|\|{\bf a}\|_{p}\mu(dw,d{\bf a},d{\bf c},db).

It remains to be shown that HN𝒞NH^{N}\in\mathcal{C}_{N}. Recall that π:pN\pi\colon\ell^{p}\to\mathbb{R}^{N} denotes the projection onto the first NN components and let ϕ:𝒳𝒳N\phi\colon\mathcal{X}\to\mathcal{X}^{N} be the product map ϕ=Id×π×Idd×Id\phi=\mathrm{Id}_{\mathbb{R}}\times\pi\times\mathrm{Id}_{\mathbb{R}^{d}}\times\mathrm{Id}_{\mathbb{R}}, that is, ϕ(w,𝐚,𝐜,b)=(w,(ai)i=1,,N,𝐜,b)\phi(w,{\bf a},{\bf c},b)=(w,(a_{i})_{i=1,\ldots,N},{\bf c},b). Denote by μN=μϕ1\mu^{N}=\mu\circ\phi^{-1} the pushforward measure of μ\mu under ϕ\phi. Then μN\mu^{N} is a (Borel) probability measure on 𝒳N\mathcal{X}^{N} and in (i)(i) we showed that for each 𝐳d{\bf z}\in\mathcal{I}_{d} the system (3.1) admits a unique solution. In addition, the integral transformation theorem implies that μN\mu^{N} satisfies

𝒳N|w|(𝐚+𝐜+|b|)μN(dw,d𝐚,d𝐜,db)=𝒳|w|(π(𝐚)+𝐜+|b|)μ(dw,d𝐚,d𝐜,db)<\int_{\mathcal{X}^{N}}|w|(\|{\bf a}\|+\|{\bf c}\|+|b|)\mu^{N}(dw,d{\bf a},d{\bf c},db)=\int_{\mathcal{X}}|w|(\|\pi({\bf a})\|+\|{\bf c}\|+|b|)\mu(dw,d{\bf a},d{\bf c},db)<\infty

and, writing 𝐱tN=𝐱tN(𝐳)\mathbf{x}_{t}^{N}=\mathbf{x}_{t}^{N}({\bf z}),

HN(𝐳)\displaystyle H^{N}({\bf z}) =𝒳wσ2(𝐚ι(𝐱1N(𝐳))+𝐜𝐳0+b)μ(dw,d𝐚,d𝐜,db)\displaystyle=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\iota(\mathbf{x}_{-1}^{N}({\bf z}))+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db) (3.8)
=𝒳wσ2(π(𝐚)𝐱1N(𝐳)+𝐜𝐳0+b)μ(dw,d𝐚,d𝐜,db)\displaystyle=\int_{\mathcal{X}}w\sigma_{2}(\pi({\bf a})\cdot\mathbf{x}_{-1}^{N}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db)
=𝒳Nwσ2(𝐚𝐱1N(𝐳)+𝐜𝐳0+b)μN(dw,d𝐚,d𝐜,db),𝐳d.\displaystyle=\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{N}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d}.

This proves that HN𝒞NH^{N}\in\mathcal{C}_{N}. ∎

3.2 Normalized realizations

The next result shows that a large class of elements in the concept class 𝒞\mathcal{C} can be transformed into what we call a normalized realization in which the solution map for (2.1) is a multiplicative scaling operator.

Proposition 3.2.

Assume σ1(x)=x\sigma_{1}(x)=x, p(1,)p\in(1,\infty) and DdD_{d} is bounded. Suppose H𝒞H\in\mathcal{C} has a realization of the type (2.1)-(2.2) with CC bounded and bounded A:qqA\colon\ell^{q}\to\ell^{q} satisfying ρ(A)<1\rho(A)<1, which guarantees the existence of a positive integer k0k_{0}\in\mathbb{N} such that |Ak|<1{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A^{k}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1, for all kk0k\geq k_{0}. Fix λ(|Ak0|1/k0,1)\lambda\in\left({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{1/k_{0}},1\right). Then there exists a (Borel) probability measure μ¯\bar{\mu} on 𝒳\mathcal{X} satisfying 𝒳|w|(𝐚p+𝐜+|b|)μ¯(dw,d𝐚,d𝐜,db)<\int_{\mathcal{X}}|w|(\|{\bf a}\|_{p}+\|{\bf c}\|+|b|)\bar{\mu}(dw,d{\bf a},d{\bf c},db)<\infty such that HH has the alternative representation

H(𝐳)=𝒳wσ2(𝐚(Λ¯𝐳)+𝐜𝐳0+b)μ¯(dw,d𝐚,d𝐜,db),𝐳d,H({\bf z})=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot(\bar{\Lambda}{\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\bar{\mu}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d}, (3.9)

where Λ¯:(Dd)q\bar{\Lambda}\colon(D_{d})^{\mathbb{Z}_{-}}\to\ell^{q}, (Λ¯𝐳)d(k1)+j=λk1zk,j(\bar{\Lambda}{\bf z})_{d(k-1)+j}=\lambda^{k-1}z_{-k,j} for kk\in\mathbb{N}, j=1,,dj=1,\ldots,d.

Remark 3.3.

The assumption that DdD_{d} is bounded in the previous statement could be relaxed to the requirement that k𝐳kλk<\sum_{k\in\mathbb{N}}\|{\bf z}_{-k}\|\lambda^{k}<\infty.

Proof.

First, we recall that by Proposition 2.2, the hypotheses in the statement guarantee that the system (2.1) admits a unique solution (𝐱t(𝐳))t(q)(\mathbf{x}_{t}({\bf z}))_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q}) for each 𝐳d{\bf z}\in\mathcal{I}_{d}, and that 𝐱t(𝐳)\mathbf{x}_{t}({\bf z}) is explicitly given by the expression (2.3). The existence of a positive integer k0k_{0}\in\mathbb{N} such that |Ak|<1{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A^{k}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1 for all kk0k\geq k_{0} is guaranteed by Gelfand’s formula for the expression of the spectral radius ρ(A)\rho(A).

Next, we define A~=λ1A\tilde{A}=\lambda^{-1}A and note that the choice λ(|Ak0|1/k0,1)\lambda\in\left({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{1/k_{0}},1\right) guarantees that |A~k0|<1{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\tilde{A}^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1. Moreover, for each k0k\in\mathbb{N}_{0} the map A~kC:dq\tilde{A}^{k}C\colon\mathbb{R}^{d}\to\ell^{q} is a bounded linear map between two Banach spaces and hence its adjoint is a bounded linear map from (q)(d)(\ell^{q})^{*}\to(\mathbb{R}^{d})^{*} which we can identify with (A~kC):pd(\tilde{A}^{k}C)^{*}\colon\ell^{p}\to\mathbb{R}^{d}. More specifically, for y(q),𝐮dy\in(\ell^{q})^{*},{\bf u}\in\mathbb{R}^{d} the adjoint is defined by [(A~kC)(y)](𝐮)=[y(A~kC)](𝐮)=y(A~kC𝐮)[(\tilde{A}^{k}C)^{*}(y)]({\bf u})=[y\circ(\tilde{A}^{k}C)]({\bf u})=y(\tilde{A}^{k}C{\bf u}) and so with the identification y(𝐛)=𝐚𝐲𝐛y({\bf b})={\bf a_{y}}\cdot{\bf b} for some 𝐚𝐲p{\bf a_{y}}\in\ell^{p} and all 𝐛q{\bf b}\in\ell^{q} we obtain that the adjoint property translates to (A~kC)(𝐚𝐲)𝐮=𝐚𝐲(A~kC)𝐮(\tilde{A}^{k}C)^{*}({\bf a_{y}})\cdot{\bf u}={\bf a_{y}}\cdot(\tilde{A}^{k}C){\bf u}.

Thus, we may consider the map :p((d)0)\mathcal{L}\colon\ell^{p}\to((\mathbb{R}^{d})^{\mathbb{N}_{0}}) given by [(𝐚)]k=(A~kC)𝐚[\mathcal{L}({\bf a})]_{k}=(\tilde{A}^{k}C)^{*}{\bf a} for k0k\in\mathbb{N}_{0}, 𝐚p{\bf a}\in\ell^{p}. We claim that the image (p)\mathcal{L}(\ell^{p}) can be identified with a subset of p\ell^{p}. Indeed,

k0[(𝐚)]kp=k0(A~kC)𝐚pk0|(A~kC)|p𝐚pp=k0|A~kC|p𝐚pp(|C|𝐚p)pk0|A~k|p(|C|𝐚p)pj0i=0k01|A~k0|jp|A~i|p=(|C|𝐚p)pi=0k01|A~i|p1|A~k0|p<,\!\!\!\!\!\!\sum_{k\in\mathbb{N}_{0}}\|[\mathcal{L}({\bf a})]_{k}\|^{p}=\sum_{k\in\mathbb{N}_{0}}\|(\tilde{A}^{k}C)^{*}{\bf a}\|^{p}\leq\sum_{k\in\mathbb{N}_{0}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|(\tilde{A}^{k}C)^{*}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{p}\|{\bf a}\|_{p}^{p}=\sum_{k\in\mathbb{N}_{0}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\tilde{A}^{k}C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{p}\|{\bf a}\|_{p}^{p}\leq({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\|{\bf a}\|_{p})^{p}\sum_{k\in\mathbb{N}_{0}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\tilde{A}^{k}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{p}\\ \leq({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\|{\bf a}\|_{p})^{p}\sum_{j\in\mathbb{N}_{0}}\sum_{i=0}^{k_{0}-1}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\tilde{A}^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{jp}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\tilde{A}^{i}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{p}=({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\|{\bf a}\|_{p})^{p}\sum_{i=0}^{k_{0}-1}\frac{{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\tilde{A}^{i}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{p}}{1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\tilde{A}^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{p}}<\infty, (3.10)

since |A~k0|=λk0|Ak0|<1{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\tilde{A}^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}=\lambda^{-k_{0}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1. Thus, the map ¯:pp\bar{\mathcal{L}}\colon\ell^{p}\to\ell^{p} given by [¯(𝐚)]d(k1)+j=[(A~k1C)𝐚]j[\bar{\mathcal{L}}({\bf a})]_{d(k-1)+j}=[(\tilde{A}^{k-1}C)^{*}{\bf a}]_{j} for kk\in\mathbb{N}, j=1,,dj=1,\ldots,d is well-defined and from (3.10) we deduce that

¯(𝐚)p=(kj=1d[(A~k1C)𝐚]jp)1/p=(k0[(𝐚)]kpp)1/p(dk0[(𝐚)]kp)1/pc0𝐚p,\begin{aligned} \|\bar{\mathcal{L}}({\bf a})\|_{p}=\left(\sum_{k\in\mathbb{N}}\sum_{j=1}^{d}[(\tilde{A}^{k-1}C)^{*}{\bf a}]_{j}^{p}\right)^{1/p}=\left(\sum_{k\in\mathbb{N}_{0}}\|[\mathcal{L}({\bf a})]_{k}\|_{p}^{p}\right)^{1/p}\leq\left({d}\sum_{k\in\mathbb{N}_{0}}\|[\mathcal{L}({\bf a})]_{k}\|^{p}\right)^{1/p}\leq c_{0}\|{\bf a}\|_{p}\end{aligned}, (3.11)

where c0=d1/p|C|(i=0k01|A~i|p1|A~k0|p)1/pc_{0}=d^{1/p}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\left(\sum_{i=0}^{k_{0}-1}\frac{{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|\tilde{A}^{i}\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}^{p}}{1-{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|\tilde{A}^{k_{0}}\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}^{p}}\right)^{1/p} and we have used that p\|\cdot\|_{p}\leq\|\cdot\| for p2p\geq 2 and pd1p12\|\cdot\|_{p}\leq d^{\frac{1}{p}-\frac{1}{2}}\|\cdot\| for p[1,2]p\in[1,2]. We may now use (2.3) to rewrite 𝐚𝐱1(𝐳){\bf a}\cdot\mathbf{x}_{-1}({\bf z}) as

𝐚𝐱1(𝐳)=𝐚𝐁¯0+iaik=0(AkC𝐳1k)i=𝐚𝐁¯0+k=0𝐚(A~kC𝐳1kλk),{\bf a}\cdot\mathbf{x}_{-1}({\bf z})={\bf a}\cdot\bar{\bf B}_{0}+\sum_{i\in\mathbb{N}}a_{i}\sum_{k=0}^{\infty}({A}^{k}C{\bf z}_{-1-k})_{i}={\bf a}\cdot\bar{\bf B}_{0}+\sum_{k=0}^{\infty}{\bf a}\cdot(\tilde{A}^{k}C{\bf z}_{-1-k}\lambda^{k}), (3.12)

where 𝐁¯0=k=0Ak𝐁q\bar{\bf B}_{0}=\sum_{k=0}^{\infty}A^{k}{\bf B}\in\ell^{q} and the series can be interchanged by Fubini’s theorem, as we see by estimating

k=0i|ai||(AkC𝐳1k)i|k=0𝐚pAkC𝐳1kqk=0𝐚p|Ak||C|𝐳1kq𝐚p|C|j0i=0k01|Ak0|j|Ai|𝐳1kq<,\sum_{k=0}^{\infty}\sum_{i\in\mathbb{N}}|a_{i}||(A^{k}C{\bf z}_{-1-k})_{i}|\leq\sum_{k=0}^{\infty}\|\mathbf{a}\|_{p}\|{A}^{k}C{\bf z}_{-1-k}\|_{q}\leq\sum_{k=0}^{\infty}\|\mathbf{a}\|_{p}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A^{k}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\|{\bf z}_{-1-k}\|_{q}\\ \leq\|\mathbf{a}\|_{p}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\sum_{j\in\mathbb{N}_{0}}\sum_{i=0}^{k_{0}-1}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{j}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A^{i}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\|{\bf z}_{-1-k}\|_{q}<\infty,

since |Ak0|<1{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1 and DdD_{d} is bounded. The second term in (3.12) can be rewritten as

k=0𝐚(A~kC𝐳1kλk)\displaystyle\sum_{k=0}^{\infty}{\bf a}\cdot(\tilde{A}^{k}C{\bf z}_{-1-k}\lambda^{k}) =k=0((A~kC)𝐚)(𝐳1kλk)=k=0j=1d[(A~kC)𝐚]j[z1k]jλk\displaystyle=\sum_{k=0}^{\infty}((\tilde{A}^{k}C)^{*}{\bf a})\cdot({\bf z}_{-1-k}\lambda^{k})=\sum_{k=0}^{\infty}\sum_{j=1}^{d}[(\tilde{A}^{k}C)^{*}{\bf a}]_{j}[{z}_{-1-k}]_{j}\lambda^{k} (3.13)
=k=1j=1d[(A~k1C)𝐚]j[zk]jλk1=k=1j=1d[¯(𝐚)]d(k1)+j(Λ¯𝐳)d(k1)+j\displaystyle=\sum_{k=1}^{\infty}\sum_{j=1}^{d}[(\tilde{A}^{k-1}C)^{*}{\bf a}]_{j}[{z}_{-k}]_{j}\lambda^{k-1}=\sum_{k=1}^{\infty}\sum_{j=1}^{d}[\bar{\mathcal{L}}({\bf a})]_{d(k-1)+j}(\bar{\Lambda}{\bf z})_{d(k-1)+j}
=¯(𝐚)(Λ¯𝐳).\displaystyle=\bar{\mathcal{L}}({\bf a})\cdot(\bar{\Lambda}{\bf z}).

Combining (3.13) and (3.12) and inserting in (2.2) then yields

H(𝐳)=𝒳wσ2(𝐚𝐁¯0+¯(𝐚)(Λ¯𝐳)+𝐜𝐳0+b)μ(dw,d𝐚,d𝐜,db),𝐳d.\displaystyle H({\bf z})=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\bar{\bf B}_{0}+\bar{\mathcal{L}}({\bf a})\cdot(\bar{\Lambda}{\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d}. (3.14)

Let ϕ¯:𝒳𝒳\bar{\phi}\colon\mathcal{X}\to\mathcal{X} be given as ϕ¯(w,𝐚,𝐜,b)=(w,¯(𝐚),𝐜,𝐚𝐁¯0+b)\bar{\phi}(w,{\bf a},{\bf c},b)=(w,\bar{\mathcal{L}}({\bf a}),{\bf c},{\bf a}\cdot\bar{\bf B}_{0}+b) and denote by μ¯=μϕ¯1\bar{\mu}=\mu\circ\bar{\phi}^{-1} the pushforward measure of μ\mu under ϕ¯\bar{\phi}. Then the change-of-variables formula and (3.11) show that

Iμ¯,p\displaystyle I_{\bar{\mu},p} =𝒳|w|(¯(𝐚)p+𝐜+|𝐚𝐁¯0+b|)μ(dw,d𝐚,d𝐜,db)\displaystyle=\int_{\mathcal{X}}|w|(\|\bar{\mathcal{L}}({\bf a})\|_{p}+\|{\bf c}\|+|{\bf a}\cdot\bar{\bf B}_{0}+b|)\mu(dw,d{\bf a},d{\bf c},db)
𝒳|w|(c0𝐚p+𝐜+𝐚p𝐁¯0q+|b|)μ(dw,d𝐚,d𝐜,db)\displaystyle\leq\int_{\mathcal{X}}|w|(c_{0}\|{\bf a}\|_{p}+\|{\bf c}\|+\|{\bf a}\|_{p}\|\bar{\bf B}_{0}\|_{q}+|b|)\mu(dw,d{\bf a},d{\bf c},db)
max(c0+𝐁¯0q,1)Iμ,p<\displaystyle\leq\max(c_{0}+\|\bar{\bf B}_{0}\|_{q},1)I_{\mu,p}<\infty

and similarly we obtain from (3.14) that for any 𝐳d{\bf z}\in\mathcal{I}_{d},

H(𝐳)\displaystyle H({\bf z}) =𝒳wσ2(𝐚𝐁¯0+¯(𝐚)(Λ¯𝐳)+𝐜𝐳0+b)μ(dw,d𝐚,d𝐜,db)\displaystyle=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\bar{\bf B}_{0}+\bar{\mathcal{L}}({\bf a})\cdot(\bar{\Lambda}{\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db) (3.15)
=𝒳wσ2(𝐚(Λ¯𝐳)+𝐜𝐳0+b)μ¯(dw,d𝐚,d𝐜,db).\displaystyle=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot(\bar{\Lambda}{\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\bar{\mu}(dw,d{\bf a},d{\bf c},db).

3.3 Echo state network approximation of normalized, truncated functionals

Consider now the setting of reservoir computing, where we aim to approximate an unknown element in our class 𝒞\mathcal{C} using a finite-dimensional ELM-ESN pair, that is, a randomly generated echo state network-like state equation with a neural network as readout in which only the output layer is trained. As a first step, we consider in this section the situation when the target functional is in the class 𝒞N\mathcal{C}_{N} of finite-dimensional recurrent Barron functionals.

More specifically, consider the echo state network

𝐱tESN=σ1(AESN𝐱t1ESN+CESN𝐳t+𝐁ESN),t,\mathbf{x}_{t}^{\mathrm{ESN}}=\sigma_{1}({A}^{\mathrm{ESN}}\mathbf{x}_{t-1}^{\mathrm{ESN}}+{C}^{\mathrm{ESN}}{\bf z}_{t}+{\bf B}^{\mathrm{ESN}}),\quad t\in\mathbb{Z}_{-}, (3.16)

with randomly generated matrices AESN𝕄N{A}^{\mathrm{ESN}}\in\mathbb{M}_{N}, CESN𝕄N,d{C}^{\mathrm{ESN}}\in\mathbb{M}_{N,d}, and 𝐁ESNN{\bf B}^{\mathrm{ESN}}\in\mathbb{R}^{N}, which is used to capture the dynamics and then plugged into a random neural network readout. Thus, the element in 𝒞N\mathcal{C}_{N} is approximated by

H^(𝐳)=i=1Nw(i)σ2(𝐚(i)𝐱1ESN(𝐳)+𝐜(i)𝐳0+bi)\widehat{H}({\bf z})=\sum_{i=1}^{N}w^{(i)}\sigma_{2}({\bf a}^{(i)}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}({\bf z})+{\bf c}^{(i)}\cdot{\bf z}_{0}+b_{i}) (3.17)

with randomly generated coefficients 𝐚(i){\bf a}^{(i)}, 𝐜(i){\bf c}^{(i)}, bib_{i} valued in N\mathbb{R}^{N}, d\mathbb{R}^{d}, and \mathbb{R}, respectively, and where 𝐖=(w(1),,w(N))N{\bf W}=\left(w^{(1)},\ldots,w^{(N)}\right)^{\top}\in\mathbb{R}^{N} is trainable.

Let T=NdT=\left\lceil\frac{N}{d}\right\rceil and define the Kalman controlability matrix

K=πN×N(CESN|AESNCESN||(AESN)T2CESN|(AESN)T1CESN),K=\pi_{N\times N}({C}^{\mathrm{ESN}}|{A}^{\mathrm{ESN}}{C}^{\mathrm{ESN}}|\cdots|({A}^{\mathrm{ESN}})^{T-2}{C}^{\mathrm{ESN}}|({A}^{\mathrm{ESN}})^{T-1}{C}^{\mathrm{ESN}}), (3.18)

where, in case T>NdT>\frac{N}{d}, the map πN×N\pi_{N\times N} removes the last dTNdT-N columns.

Proposition 3.4.

Suppose HN:(Dd)H^{N}\colon(D_{d})^{\mathbb{Z}_{-}}\to\mathbb{R} admits a normalized realization of the type

HN(𝐳)=𝒳Nwσ2(𝐚(ΛN𝐳)+𝐜𝐳0+b)μN(dw,d𝐚,d𝐜,db),𝐳(Dd),H^{N}({\bf z})=\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot(\Lambda_{N}{\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in(D_{d})^{\mathbb{Z}_{-}}, (3.19)

for some μN\mu^{N} on 𝒳N\mathcal{X}^{N} with IμN=𝒳N|w|(𝐚+𝐜+|b|)μN(dw,d𝐚,d𝐜,db)<I_{\mu^{N}}=\int_{\mathcal{X}^{N}}|w|(\|{\bf a}\|+\|{\bf c}\|+|b|)\mu^{N}(dw,d{\bf a},d{\bf c},db)<\infty and where ΛN:(Dd)N\Lambda_{N}\colon(D_{d})^{\mathbb{Z}_{-}}\to\mathbb{R}^{N} satisfies (ΛN𝐳)d(k1)+j=λk1(zk)j(\Lambda_{N}{\bf z})_{d(k-1)+j}=\lambda^{k-1}(z_{-k})_{j} for all kk\in\mathbb{N}, j=1,,dj=1,\ldots,d with d(k1)+jNd(k-1)+j\leq N. Assume that σ1(x)=x\sigma_{1}(x)=x, DdD_{d} is bounded, and consider an arbitrary system of the type (3.16) such that ρ(AESN)<1\rho\left({A}^{\mathrm{ESN}}\right)<1, and that the matrix KK in (3.18) is invertible. Then there exists a Borel measure μ~N\tilde{\mu}^{N} on 𝒳N\mathcal{X}^{N} such that Iμ~N<I_{\tilde{\mu}^{N}}<\infty and the mapping

H¯N(𝐳)=𝒳Nwσ2(𝐚𝐱1ESN(𝐳)+𝐜𝐳0+b)μ~N(dw,d𝐚,d𝐜,db),𝐳(Dd)\bar{H}^{N}({\bf z})=\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\tilde{\mu}^{N}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in(D_{d})^{\mathbb{Z}_{-}} (3.20)

satisfies for all 𝐳(Dd){\bf z}\in(D_{d})^{\mathbb{Z}_{-}} the bound

|HN(𝐳)H¯N(𝐳)|\displaystyle|H^{N}({\bf z})-\bar{H}^{N}({\bf z})| Lσ2(|CESN|M+𝐁ESN)|(AESN)k0|Tk0|AESN|k01\displaystyle\leq L_{\sigma_{2}}\left({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}M+\|{\bf B}^{\mathrm{ESN}}\|\right){\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{\lfloor\frac{T}{k_{0}}\rfloor}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{k_{0}-1}
×(i=0k01|(AESN)i|1|(AESN)k0|)|KΛ|𝒳N|w|𝐚μN(dw,d𝐚,d𝐜,db),\displaystyle\times\left(\sum_{i=0}^{k_{0}-1}\frac{{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{i}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}}{1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}}\right){\left|\kern-1.07639pt\left|\kern-1.07639pt\left|K^{-\top}\Lambda\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\int_{\mathcal{X}^{N}}|w|\|{\bf a}\|\mu^{N}(dw,d{\bf a},d{\bf c},db), (3.21)

where k0k_{0}\in\mathbb{N} is the smallest integer such that |(AESN)k0|<1{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1, M>0M>0 is chosen such that 𝐮M\|{\bf u}\|\leq M for all 𝐮Dd{\bf u}\in D_{d}, and ΛN×N\Lambda\in\mathbb{R}^{N\times N} is the diagonal matrix with entries Λd(k1)+j,d(k1)+j=λk1\Lambda_{d(k-1)+j,d(k-1)+j}=\lambda^{k-1} for all kk\in\mathbb{N}, j=1,,dj=1,\ldots,d with d(k1)+jNd(k-1)+j\leq N.

Remark 3.5.

A particular case of the previous statement is when AESN{A}^{\mathrm{ESN}} is chosen such that |AESN|<1{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1. In that situation, instead of the bound (3.21) one obtains

|HN(𝐳)H¯N(𝐳)|Lσ2(|CESN|M+𝐁ESN)|AESN|T1|AESN||KΛ|𝒳N|w|𝐚μN(dw,d𝐚,d𝐜,db).|H^{N}({\bf z})-\bar{H}^{N}({\bf z})|\leq L_{\sigma_{2}}\left({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}M+\|{\bf B}^{\mathrm{ESN}}\|\right)\frac{{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{T}}{1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}}\ {\left|\kern-1.07639pt\left|\kern-1.07639pt\left|K^{-\top}\Lambda\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\int_{\mathcal{X}^{N}}|w|\|{\bf a}\|\mu^{N}(dw,d{\bf a},d{\bf c},db). (3.22)
Remark 3.6.

It can be shown (see [Grig 21a, Proposition 4.4]) that if AESN{A}^{\mathrm{ESN}} and CESN{C}^{\mathrm{ESN}} are randomly drawn using regular random variables with values in the corresponding spaces, then the hypothesis on the invertibility of KK holds almost surely.

Proof.

First notice that by our hypotheses and (2.3), the equation (3.16) has a unique solution in (N)\ell^{\infty}_{-}(\mathbb{R}^{N}) given by

𝐱tESN=k=0(AESN)k(CESN𝐳tk+𝐁ESN),for any t.\mathbf{x}_{t}^{\mathrm{ESN}}=\sum_{k=0}^{\infty}({A}^{\mathrm{ESN}})^{k}({C}^{\mathrm{ESN}}{\bf z}_{t-k}+{\bf B}^{\mathrm{ESN}}),\quad\mbox{for any $t\in\mathbb{Z}_{-}$.} (3.23)

Let 𝐱tN=k=0T1(AESN)k(CESN𝐳tk+𝐁ESN)\mathbf{x}_{t}^{N}=\sum_{k=0}^{T-1}({A}^{\mathrm{ESN}})^{k}({C}^{\mathrm{ESN}}{\bf z}_{t-k}+{\bf B}^{\mathrm{ESN}}). Then

𝐱tESN𝐱tN\displaystyle\|\mathbf{x}_{t}^{\mathrm{ESN}}-\mathbf{x}_{t}^{N}\| k=T|(AESN)k|(|CESN|𝐳tk+𝐁ESN)\displaystyle\leq\sum_{k=T}^{\infty}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{k}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\|{\bf z}_{t-k}\|+\|{\bf B}^{\mathrm{ESN}}\|)
(|CESN|M+𝐁ESN)|(AESN)T|k=0|(AESN)k|\displaystyle\leq({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}M+\|{\bf B}^{\mathrm{ESN}}\|){\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{T}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\sum_{k=0}^{\infty}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{k}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}
(|CESN|M+𝐁ESN)|(AESN)k0|Tk0|AESN|k01(i=0k01|(AESN)i|1|(AESN)k0|).\displaystyle\leq({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}M+\|{\bf B}^{\mathrm{ESN}}\|){\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{\lfloor\frac{T}{k_{0}}\rfloor}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{k_{0}-1}\left(\sum_{i=0}^{k_{0}-1}\frac{{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{i}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}}{1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}}\right). (3.24)

In order to obtain the last inequality in the previous expression, we decomposed T=Tk0k0+rT=\lfloor\frac{T}{k_{0}}\rfloor k_{0}+r, with 0r<k00\leq r<k_{0}, which allowed us to write

|(AESN)T|=|(AESN)Tk0k0+r||(AESN)k0|Tk0|(AESN)r||(AESN)k0|Tk0|AESN|k01.{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{T}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{\lfloor\frac{T}{k_{0}}\rfloor k_{0}+r}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{\lfloor\frac{T}{k_{0}}\rfloor}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{r}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{\lfloor\frac{T}{k_{0}}\rfloor}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{k_{0}-1}.

In the last inequality we used the fact that k0k_{0}\in\mathbb{N} is the smallest integer such that |(AESN)k0|<1{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1 and hence |(AESN)r|1{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{r}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\geq 1 necessarily if r<k0r<k_{0}.

Denote by 𝐳~N\tilde{{\bf z}}\in\mathbb{R}^{N} the vector with entries 𝐳~d(k1)+j=(zk)j\tilde{\bf z}_{d(k-1)+j}=(z_{-k})_{j} and by ΛN×N\Lambda\in\mathbb{R}^{N\times N} the diagonal matrix such that Λd(k1)+j,d(k1)+j=λk1\Lambda_{d(k-1)+j,d(k-1)+j}=\lambda^{k-1}, for all kk\in\mathbb{N}, j=1,,dj=1,\ldots,d with d(k1)+jNd(k-1)+j\leq N. Then 𝐱1N=k=1T(AESN)k1(CESN𝐳k+𝐁ESN)=K𝐳~+𝐁~\mathbf{x}_{-1}^{N}=\sum_{k=1}^{T}({A}^{\mathrm{ESN}})^{k-1}({C}^{\mathrm{ESN}}{\bf z}_{-k}+{\bf B}^{\mathrm{ESN}})=K\tilde{{\bf z}}+\tilde{\bf B}, where 𝐁~=k=1T(AESN)k1𝐁ESN\tilde{\bf B}=\sum_{k=1}^{T}({A}^{\mathrm{ESN}})^{k-1}{\bf B}^{\mathrm{ESN}}. Let ϕ~:𝒳N𝒳N\tilde{\phi}\colon\mathcal{X}^{N}\to\mathcal{X}^{N} be given as ϕ~(w,𝐚,𝐜,b)=(w,KΛ𝐚,𝐜,(KΛ𝐚)𝐁~+b)\tilde{\phi}(w,{\bf a},{\bf c},b)=(w,K^{-\top}\Lambda{\bf a},{\bf c},-(K^{-\top}\Lambda{\bf a})\cdot\tilde{\bf B}+b) and denote by μ~N=μNϕ~1\tilde{\mu}^{N}=\mu^{N}\circ\tilde{\phi}^{-1} the pushforward measure of μN\mu^{N} under ϕ~\tilde{\phi}. Then the change-of-variables formula yields that

Iμ~N\displaystyle I_{\tilde{\mu}^{N}} =𝒳N|w|(KΛ𝐚+𝐜+|(KΛ𝐚)𝐁~+b|)μN(dw,d𝐚,d𝐜,db)\displaystyle=\int_{\mathcal{X}^{N}}|w|(\|K^{-\top}\Lambda{\bf a}\|+\|{\bf c}\|+|-(K^{-\top}\Lambda{\bf a})\cdot\tilde{\bf B}+b|)\mu^{N}(dw,d{\bf a},d{\bf c},db)
max(|KΛ|,1)(1+𝐁~)IμN<\displaystyle\leq\max({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|K^{-\top}\Lambda\right|\kern-1.07639pt\right|\kern-1.07639pt\right|},1)(1+\|\tilde{\bf B}\|)I_{\mu^{N}}<\infty

and

HN(𝐳)\displaystyle H^{N}({\bf z}) =𝒳Nwσ2((Λ𝐚)𝐳~+𝐜𝐳0+b)μN(dw,d𝐚,d𝐜,db)\displaystyle=\int_{\mathcal{X}^{N}}w\sigma_{2}((\Lambda{\bf a})\cdot\tilde{\bf z}+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db) (3.25)
=𝒳Nwσ2((KΛ𝐚)K𝐳~+𝐜𝐳0+b)μN(dw,d𝐚,d𝐜,db)\displaystyle=\int_{\mathcal{X}^{N}}w\sigma_{2}((K^{-\top}\Lambda{\bf a})\cdot K\tilde{\bf z}+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db)
=𝒳Nwσ2((KΛ𝐚)𝐱1N(KΛ𝐚)𝐁~+𝐜𝐳0+b)μN(dw,d𝐚,d𝐜,db)\displaystyle=\int_{\mathcal{X}^{N}}w\sigma_{2}((K^{-\top}\Lambda{\bf a})\cdot\mathbf{x}_{-1}^{N}-(K^{-\top}\Lambda{\bf a})\cdot\tilde{\bf B}+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db)
=𝒳Nwσ2(𝐚𝐱1N+𝐜𝐳0+b)μ~N(dw,d𝐚,d𝐜,db).\displaystyle=\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{N}+{\bf c}\cdot{\bf z}_{0}+b)\tilde{\mu}^{N}(dw,d{\bf a},d{\bf c},db).

Combining (3.25) and (3.24) then yields

|HN(𝐳)H¯N(𝐳)|\displaystyle|H^{N}({\bf z})-\bar{H}^{N}({\bf z})| 𝒳Nw|σ2(𝐚𝐱1N+𝐜𝐳0+b)σ2(𝐚𝐱1ESN(𝐳)+𝐜𝐳0+b)|μ~N(dw,d𝐚,d𝐜,db)\displaystyle\leq\int_{\mathcal{X}^{N}}w|\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{N}+{\bf c}\cdot{\bf z}_{0}+b)-\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)|\tilde{\mu}^{N}(dw,d{\bf a},d{\bf c},db) (3.26)
𝒳NwLσ2|𝐚(𝐱1N𝐱1ESN)|μ~N(dw,d𝐚,d𝐜,db)\displaystyle\leq\int_{\mathcal{X}^{N}}wL_{\sigma_{2}}|{\bf a}\cdot(\mathbf{x}_{-1}^{N}-\mathbf{x}_{-1}^{\mathrm{ESN}})|\tilde{\mu}^{N}(dw,d{\bf a},d{\bf c},db)
𝐱1N𝐱1ESNLσ2𝒳NwKΛ𝐚μN(dw,d𝐚,d𝐜,db)\displaystyle\leq\|\mathbf{x}_{-1}^{N}-\mathbf{x}_{-1}^{\mathrm{ESN}}\|L_{\sigma_{2}}\int_{\mathcal{X}^{N}}w\|K^{-\top}\Lambda{\bf a}\|\mu^{N}(dw,d{\bf a},d{\bf c},db)
Lσ2(|CESN|M+𝐁ESN)|(AESN)k0|Tk0|AESN|k01\displaystyle\leq L_{\sigma_{2}}({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}M+\|{\bf B}^{\mathrm{ESN}}\|){\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{\lfloor\frac{T}{k_{0}}\rfloor}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{k_{0}-1}
(i=0k01|(AESN)i|1|(AESN)k0|)|KΛ|IμN.\displaystyle\quad\cdot\left(\sum_{i=0}^{k_{0}-1}\frac{{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{i}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}}{1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}}\right){\left|\kern-1.07639pt\left|\kern-1.07639pt\left|K^{-\top}\Lambda\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}I_{\mu^{N}}.

3.4 Main approximation result

This section contains our main approximation result, in which we extend the possibility of approximating the elements in the class 𝒞N\mathcal{C}_{N} of finite-dimensional recurrent Barron functionals, using finite-dimensional ELM-ESN pairs, to the full class 𝒞\mathcal{C} of recurrent Barron functionals.

In this case, we shall assume that the random variables used in the construction of the ELM layer (3.17), that is, the initialization w(i)w^{(i)} for the readout weight and the hidden weights 𝐚(i){\bf a}^{(i)}, 𝐜(i){\bf c}^{(i)}, bib_{i}, valued in \mathbb{R}, N\mathbb{R}^{N}, d\mathbb{R}^{d}, and \mathbb{R}, respectively, are such that (w(1),𝐚(1),𝐜(1),b1),,(w(N),𝐚(N),𝐜(N),bN)(w^{(1)},{\bf a}^{(1)},{\bf c}^{(1)},b_{1}),\ldots,(w^{(N)},{\bf a}^{(N)},{\bf c}^{(N)},b_{N}) are independent and identically distributed (IID); denote by ν\nu their common distribution.

Let ΛN×N\Lambda\in\mathbb{R}^{N\times N} be the diagonal matrix introduced in Proposition 3.4, let π:pN\pi\colon\ell^{p}\to\mathbb{R}^{N} be the projection map π(𝐚)=(ai)i=1,,N\pi({\bf a})=(a_{i})_{i=1,\ldots,N} and for given Borel measure μ\mu on 𝒳\mathcal{X}, 𝐁q{\bf B}\in\ell^{q}, bounded linear maps C:dqC\colon\mathbb{R}^{d}\to\ell^{q}, A:qqA\colon\ell^{q}\to\ell^{q} with ρ(A)<1\rho{(A)}<1 let μA,B,C=μψA,B,C1{\mu}_{A,B,C}=\mu\circ\psi_{A,B,C}^{-1}, where ψA,B,C:𝒳𝒳N\psi_{A,B,C}\colon\mathcal{X}\to\mathcal{X}^{N} is the map

ψA,B,C(w,𝐚,𝐜,b)=(w,KΛπ(¯(𝐚)),𝐜,(KΛπ(¯(𝐚)))[k=1T(AESN)k1𝐁ESN]+𝐚[k=0Ak𝐁]+b)\psi_{A,B,C}(w,{\bf a},{\bf c},b)=\Bigg{(}w,K^{-\top}\Lambda\pi(\bar{\mathcal{L}}({\bf a})),{\bf c},-(K^{-\top}\Lambda\pi(\bar{\mathcal{L}}({\bf a})))\cdot\Bigg{[}\sum_{k=1}^{T}({A}^{\mathrm{ESN}})^{k-1}{\bf B}^{\mathrm{ESN}}\Bigg{]}+{\bf a}\cdot\left[\sum_{k=0}^{\infty}A^{k}{\bf B}\right]+b\Bigg{)}

with [¯(𝐚)]d(k1)+j=[(A~k1C)𝐚]j[\bar{\mathcal{L}}({\bf a})]_{d(k-1)+j}=[(\tilde{A}^{k-1}C)^{*}{\bf a}]_{j} for kk\in\mathbb{N}, j=1,,dj=1,\ldots,d . Furthermore, let

Iμ,p(2)=(𝒳w2(𝐚p2+𝐜2+|b|2+1)μ(dw,d𝐚,d𝐜,db))1/2.I_{\mu,p}^{(2)}=\left(\int_{\mathcal{X}}w^{2}\left(\|{\bf a}\|_{p}^{2}+\|{\bf c}\|^{2}+|b|^{2}+1\right)\mu(dw,d{\bf a},d{\bf c},db)\right)^{1/2}.

Using this notation, we shall now formulate a result that shows that the elements in H𝒞H\in\mathcal{C} that admit a representation in which the state equation (2.4) is linear and has the unique solution property, can be approximated arbitrarily well by randomly generated ESN-ELM pairs in which only the output layer of the ELM is trained. We provide an approximation bound where the dependence on the dimensions NN of the ESN state space and dd of the input is explicitly spelled out.

Theorem 3.7.

In this statement assume that σ1(x)=x\sigma_{1}(x)=x, p(1,)p\in(1,\infty), and that the input set DdD_{d} is bounded. Let H𝒞H\in\mathcal{C} be such that it has a realization of the type (2.2), with 𝐁q{\bf B}\in\ell^{q}, bounded linear maps C:dqC\colon\mathbb{R}^{d}\to\ell^{q}, A:qqA\colon\ell^{q}\to\ell^{q} satisfying |A|<1{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1 and Iμ,p(2)<I_{\mu,p}^{(2)}<\infty. Using the notation introduced before this statement, suppose that μA,B,Cν\mu_{A,B,C}\ll\nu and that dμA,B,Cdν{\displaystyle\frac{d\mu_{A,B,C}}{d\nu}} is bounded. Let λ(|A|,1)\lambda\in\left({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|},1\right). Consider now a randomly constructed ESN-ELM pair such that |AESN|<1{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1 and for which the controllability matrix KK in (3.18) is invertible.

Then, there exists a measurable function f:(𝒳N)NNf\colon(\mathcal{X}^{N})^{N}\to\mathbb{R}^{N} such that the ESN-ELM pair H^\widehat{H} with the same parameters and new readout 𝐖=f((w(i),𝐚(i),𝐜(i),bi)i=1,,N){\bf W}=f((w^{(i)},{\bf a}^{(i)},{\bf c}^{(i)},b_{i})_{i=1,\ldots,N}) satisfies, for any 𝐳d{\bf z}\in\mathcal{I}_{d}, the approximation error bound

𝔼[|H(𝐳)H^(𝐳)|2]1/2\displaystyle\mathbb{E}[|H({\bf z})-\widehat{H}({\bf z})|^{2}]^{1/2} CH,ESN[λNd+|AESN|T+1N12dμA,B,Cdν12]\displaystyle\leq C_{H,\mathrm{ESN}}\left[\lambda^{\frac{N}{d}}+{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{T}+\frac{1}{N^{\frac{1}{2}}}\left\|\frac{d\mu_{A,B,C}}{d\nu}\right\|_{\infty}^{\frac{1}{2}}\right] (3.27)

with

CH,ESN=c~1max(|C|(1|λ1A|p)1/p,𝐁q1|A|,1)×max(Iμ,p,Iμ,p(2))(1+|CESN|+𝐁ESN1|AESN||KΛ|)C_{H,\mathrm{ESN}}=\tilde{c}_{1}\max\left(\frac{{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}}{\left(1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\lambda^{-1}{A}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{p}\right)^{1/p}},\frac{\|{\bf B}\|_{q}}{1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}},1\right)\\ \times\max\left(I_{\mu,p},I_{\mu,p}^{(2)}\right)\cdot\left(1+\frac{{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}+\|{\bf B}^{\mathrm{ESN}}\|}{1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|K^{-\top}\Lambda\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\right) (3.28)

for c~1\tilde{c}_{1} only depending on dd, σ2\sigma_{2}, pp, diam(Dd)\mathrm{diam}(D_{d}) and λ\lambda (see (3.43)).

Proof.

Let Λ¯:(Dd)q\bar{\Lambda}\colon(D_{d})^{\mathbb{Z}_{-}}\to\ell^{q}, (Λ¯𝐳)d(k1)+j=λk1(zk)j(\bar{\Lambda}{\bf z})_{d(k-1)+j}=\lambda^{k-1}(z_{-k})_{j} for kk\in\mathbb{N}, j=1,,dj=1,\ldots,d. Then Proposition 3.2 implies that there exists a (Borel) probability measure μ¯\bar{\mu} on 𝒳\mathcal{X} satisfying Iμ¯,p<I_{\bar{\mu},p}<\infty and such that the normalized representation

H(𝐳)=𝒳wσ2(𝐚(Λ¯𝐳)+𝐜𝐳0+b)μ¯(dw,d𝐚,d𝐜,db),𝐳d,H({\bf z})=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot(\bar{\Lambda}{\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\bar{\mu}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d}, (3.29)

holds. Notice that from the proof of Proposition 2.6 we obtain that (Λ¯𝐳)d(k1)+j=λk1(zk)j=x¯1,d(k1)+j(\bar{\Lambda}{\bf z})_{d(k-1)+j}=\lambda^{k-1}(z_{-k})_{j}=\bar{{x}}_{-1,d(k-1)+j}, where 𝐱¯\bar{\mathbf{x}} is the unique solution to

𝐱¯t=σ1(A¯𝐱¯t1+C¯𝐳t),t,\bar{\mathbf{x}}_{t}=\sigma_{1}(\bar{A}\bar{\mathbf{x}}_{t-1}+\bar{C}{\bf z}_{t}),\quad t\in\mathbb{Z}_{-}, (3.30)

with A¯:qq\bar{A}\colon\ell^{q}\to\ell^{q}, C¯:dq\bar{C}\colon\mathbb{R}^{d}\to\ell^{q} given by (A¯𝐱)i=𝟙{i>d}λxid(\bar{A}{\bf x})_{i}=\mathbbm{1}_{\{i>d\}}\lambda x_{i-d}, (C¯𝐳)i=𝟙{id}zi(\bar{C}{\bf z})_{i}=\mathbbm{1}_{\{i\leq d\}}{z}_{i} for 𝐱q{\bf x}\in\ell^{q}, 𝐳d{\bf z}\in\mathbb{R}^{d}, ii\in\mathbb{N}. Let A^N×N\hat{A}\in\mathbb{R}^{N\times N}, C^N×d\hat{C}\in\mathbb{R}^{N\times d} by given as A^ij=(A¯ϵj)i=𝟙{i>d}λδid,j\hat{A}_{ij}=(\bar{A}\bm{\epsilon}^{j})_{i}=\mathbbm{1}_{\{i>d\}}\lambda\delta_{i-d,j}, C^ik=(C¯𝐞k)i=𝟙{id}δi,k\hat{C}_{ik}=(\bar{C}{\bf e}^{k})_{i}=\mathbbm{1}_{\{i\leq d\}}\delta_{i,k} for i,j=1,,Ni,j=1,\ldots,N, k=1,,dk=1,\ldots,d. Proposition 3.1 then implies that for each 𝐳d{\bf z}\in\mathcal{I}_{d} the system

𝐱tN=σ1(A^𝐱t1N+C^𝐳t),t,\mathbf{x}_{t}^{N}=\sigma_{1}(\hat{A}\mathbf{x}_{t-1}^{N}+\hat{C}{\bf z}_{t}),\quad t\in\mathbb{Z}_{-}, (3.31)

admits a unique solution (𝐱tN)t(N)(\mathbf{x}_{t}^{N})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\mathbb{R}^{N}) and the functional HN:dH^{N}\colon\mathcal{I}_{d}\to\mathbb{R},

HN(𝐳)=𝒳wσ2(𝐚ι(𝐱1N(𝐳))+𝐜𝐳0+b)μ¯(dw,d𝐚,d𝐜,db),𝐳d,\displaystyle H^{N}({\bf z})=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\iota(\mathbf{x}_{-1}^{N}({\bf z}))+{\bf c}\cdot{\bf z}_{0}+b)\bar{\mu}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d}, (3.32)

satisfies HN𝒞NH^{N}\in\mathcal{C}_{N} and that

|H(𝐳)HN(𝐳)|Cfinl=0|A¯|l(i=N+1|x¯1l,i|q)1/q|H({\bf z})-H^{N}({\bf z})|\leq C_{\mathrm{fin}}\sum_{l=0}^{\infty}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\bar{A}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{l}\left(\sum_{i=N+1}^{\infty}|\bar{{x}}_{-1-l,i}|^{q}\right)^{1/q} (3.33)

for all 𝐳d{\bf z}\in\mathcal{I}_{d} and by (3.11)

Cfin\displaystyle C_{\mathrm{fin}} =Lσ2𝒳|w|𝐚pμ¯(dw,d𝐚,d𝐜,db)=Lσ2𝒳|w|¯(𝐚)pμ(dw,d𝐚,d𝐜,db)c0Lσ2Iμ,p.\displaystyle=L_{\sigma_{2}}\int_{\mathcal{X}}|w|\|{\bf a}\|_{p}\bar{\mu}(dw,d{\bf a},d{\bf c},db)=L_{\sigma_{2}}\int_{\mathcal{X}}|w|\|\bar{\mathcal{L}}({\bf a})\|_{p}{\mu}(dw,d{\bf a},d{\bf c},db)\leq c_{0}L_{\sigma_{2}}I_{\mu,p}.

Choose M>0M>0 such that 𝐮M\|{\bf u}\|\leq M for all 𝐮Dd{\bf u}\in D_{d}. Furthermore, note that |A¯|λ{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\bar{A}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\leq\lambda and x¯1l,(k1)d+j=λk1[zlk]j\bar{{x}}_{-1-l,(k-1)d+j}=\lambda^{k-1}[z_{-l-k}]_{j}. Hence, using qqmax(d1q2,1)q\|\cdot\|_{q}^{q}\leq\max(d^{1-\frac{q}{2}},1)\|\cdot\|^{q} and 𝐮M\|{\bf u}\|\leq M for all 𝐮Dd{\bf u}\in D_{d} we obtain from (3.33)

|H(𝐳)HN(𝐳)|\displaystyle|H({\bf z})-H^{N}({\bf z})| c0Lσ2Iμ,pl=0(|A¯|)l(k=Nd+1[λk1]qj=1d|[zlk]j|q)1/q\displaystyle\leq c_{0}L_{\sigma_{2}}I_{\mu,p}\sum_{l=0}^{\infty}({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\bar{A}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|})^{l}\left(\sum_{k=\lfloor\frac{N}{d}\rfloor+1}^{\infty}[\lambda^{k-1}]^{q}\sum_{j=1}^{d}|[z_{-l-k}]_{j}|^{q}\right)^{1/q} (3.34)
c0Lσ2Iμ,pl=0(|A¯|)l(k=Nd+1[λk1]qmax(d1q2,1)Mq)1/q\displaystyle\leq c_{0}L_{\sigma_{2}}I_{\mu,p}\sum_{l=0}^{\infty}({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\bar{A}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|})^{l}\left(\sum_{k=\lfloor\frac{N}{d}\rfloor+1}^{\infty}[\lambda^{k-1}]^{q}\max(d^{1-\frac{q}{2}},1)M^{q}\right)^{1/q}
c0Lσ2Iμ,pMmax(d1q12,1)λNd(1λ)(1λq)1/q.\displaystyle\leq c_{0}L_{\sigma_{2}}I_{\mu,p}M\max(d^{\frac{1}{q}-\frac{1}{2}},1)\frac{\lambda^{\lfloor\frac{N}{d}\rfloor}}{(1-\lambda)(1-\lambda^{q})^{1/q}}.

From (3.31) we obtain for i=1,,Ni=1,\ldots,N, tt\in\mathbb{Z}_{-} that (xtN)i=𝟙{i>d}λ(xt1N)id+𝟙{id}(zt)i({x}_{t}^{N})_{i}=\mathbbm{1}_{\{i>d\}}\lambda({x}_{t-1}^{N})_{i-d}+\mathbbm{1}_{\{i\leq d\}}({z}_{t})_{i}. Therefore, (xtN)d(k1)+j=λk1(ztk+1)j({x}_{t}^{N})_{d(k-1)+j}=\lambda^{k-1}(z_{t-k+1})_{j} for all kk\in\mathbb{N}, j=1,,dj=1,\ldots,d with d(k1)+jNd(k-1)+j\leq N. In particular, letting ΛN:(Dd)N\Lambda_{N}\colon(D_{d})^{\mathbb{Z}_{-}}\to\mathbb{R}^{N} satisfy (ΛN𝐳)d(k1)+j=λk1(zk)j(\Lambda_{N}{\bf z})_{d(k-1)+j}=\lambda^{k-1}(z_{-k})_{j} for all kk\in\mathbb{N}, j=1,,dj=1,\ldots,d with d(k1)+jNd(k-1)+j\leq N we obtain (x1N)d(k1)+j=λk1(zk)j=(ΛN𝐳)d(k1)+j({x}_{-1}^{N})_{d(k-1)+j}=\lambda^{k-1}(z_{-k})_{j}=(\Lambda_{N}{\bf z})_{d(k-1)+j}. Let ϕ:𝒳𝒳N\phi\colon\mathcal{X}\to\mathcal{X}^{N} be the projection map ϕ(w,𝐚,𝐜,b)=(w,π(𝐚),𝐜,b)\phi(w,{\bf a},{\bf c},b)=(w,\pi({\bf a}),{\bf c},b), with π(𝐚)=(ai)i=1,,N\pi({\bf a})=(a_{i})_{i=1,\ldots,N}, and denote by μN=μ¯ϕ1\mu^{N}=\bar{\mu}\circ\phi^{-1} the pushforward measure of μ¯\bar{\mu} under ϕ\phi. Then IμN<I_{\mu^{N}}<\infty and from (3.32) we get

HN(𝐳)=𝒳Nwσ2(𝐚(ΛN𝐳)+𝐜𝐳0+b)μN(dw,d𝐚,d𝐜,db),𝐳d.\displaystyle H^{N}({\bf z})=\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot(\Lambda_{N}{\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d}. (3.35)

Proposition 3.4 hence implies that there exists a Borel measure μ~N\tilde{\mu}^{N} on 𝒳N\mathcal{X}^{N} such that Iμ~N<I_{\tilde{\mu}^{N}}<\infty and the mapping

H¯N(𝐳)=𝒳Nwσ2(𝐚𝐱1ESN(𝐳)+𝐜𝐳0+b)μ~N(dw,d𝐚,d𝐜,db),𝐳(Dd)\bar{H}^{N}({\bf z})=\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\tilde{\mu}^{N}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in(D_{d})^{\mathbb{Z}_{-}} (3.36)

satisfies for all 𝐳d{\bf z}\in\mathcal{I}_{d} the bound (3.22), that is,

|HN(𝐳)H¯N(𝐳)|Lσ2(|CESN|M+𝐁ESN)|AESN|T1|AESN||KΛ|𝒳N|w|𝐚μN(dw,d𝐚,d𝐜,db),|H^{N}({\bf z})-\bar{H}^{N}({\bf z})|\leq L_{\sigma_{2}}({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}M+\|{\bf B}^{\mathrm{ESN}}\|)\frac{{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{T}}{1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|K^{-\top}\Lambda\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\int_{\mathcal{X}^{N}}|w|\|{\bf a}\|\mu^{N}(dw,d{\bf a},d{\bf c},db), (3.37)

where ΛN×N\Lambda\in\mathbb{R}^{N\times N} is the diagonal matrix with entries Λd(k1)+j,d(k1)+j=λk1\Lambda_{d(k-1)+j,d(k-1)+j}=\lambda^{k-1} for all kk\in\mathbb{N}, j=1,,dj=1,\ldots,d with d(k1)+jNd(k-1)+j\leq N. Furthermore, max(1,d121p)p\|\cdot\|\leq\max(1,d^{\frac{1}{2}-\frac{1}{p}})\|\cdot\|_{p} and (3.11) imply

𝒳N|w|𝐚μN(dw,d𝐚,d𝐜,db)\displaystyle\int_{\mathcal{X}^{N}}|w|\|{\bf a}\|\mu^{N}(dw,d{\bf a},d{\bf c},db) =𝒳|w|π(¯(𝐚))μ(dw,d𝐚,d𝐜,db)\displaystyle=\int_{\mathcal{X}}|w|\|\pi(\bar{\mathcal{L}}({\bf a}))\|\mu(dw,d{\bf a},d{\bf c},db) (3.38)
max(1,d121p)𝒳|w|¯(𝐚)pμ(dw,d𝐚,d𝐜,db)\displaystyle\leq\max(1,d^{\frac{1}{2}-\frac{1}{p}})\int_{\mathcal{X}}|w|\|\bar{\mathcal{L}}({\bf a})\|_{p}\mu(dw,d{\bf a},d{\bf c},db)
max(1,d121p)c0Iμ,p.\displaystyle\leq\max(1,d^{\frac{1}{2}-\frac{1}{p}})c_{0}I_{\mu,p}.

Note that the measure μ~N\tilde{\mu}^{N} in (3.36) is given by μ~N=μNϕ~1\tilde{\mu}^{N}=\mu^{N}\circ\tilde{\phi}^{-1} with ϕ~\tilde{\phi} as in the proof of Proposition 3.4 and μ¯=μϕ¯1\bar{\mu}=\mu\circ\bar{\phi}^{-1} with ϕ¯\bar{\phi} as in the proof of Proposition 3.2. Then we verify that μ~N=μA,B,C\tilde{\mu}^{N}=\mu_{A,B,C} and the change-of-variables theorem shows that

𝒳N\displaystyle\int_{\mathcal{X}^{N}} f(w,𝐚,𝐜,b)μ~N(dw,d𝐚,d𝐜,db)=𝒳Nf(w,KΛ𝐚,𝐜,(KΛ𝐚)𝐁~+b)μN(dw,d𝐚,d𝐜,db)\displaystyle f(w,{\bf a},{\bf c},b)\tilde{\mu}^{N}(dw,d{\bf a},d{\bf c},db)=\int_{\mathcal{X}^{N}}f(w,K^{-\top}\Lambda{\bf a},{\bf c},-(K^{-\top}\Lambda{\bf a})\cdot\tilde{\bf B}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db) (3.39)
=𝒳f(w,KΛπ(𝐚),𝐜,(KΛπ(𝐚))𝐁~+b)μ¯(dw,d𝐚,d𝐜,db)\displaystyle=\int_{\mathcal{X}}f(w,K^{-\top}\Lambda\pi({\bf a}),{\bf c},-(K^{-\top}\Lambda\pi({\bf a}))\cdot\tilde{\bf B}+b)\bar{\mu}(dw,d{\bf a},d{\bf c},db)
=𝒳f(w,KΛπ(¯(𝐚)),𝐜,(KΛπ(¯(𝐚)))𝐁~+𝐚𝐁¯0+b)μ(dw,d𝐚,d𝐜,db)\displaystyle=\int_{\mathcal{X}}f(w,K^{-\top}\Lambda\pi(\bar{\mathcal{L}}({\bf a})),{\bf c},-(K^{-\top}\Lambda\pi(\bar{\mathcal{L}}({\bf a})))\cdot\tilde{\bf B}+{\bf a}\cdot\bar{\bf B}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db)

for any function ff for which the integrand in the last line is integrable with respect to μ\mu and where 𝐁~=k=1T(AESN)k1𝐁ESN\tilde{\bf B}=\sum_{k=1}^{T}({A}^{\mathrm{ESN}})^{k-1}{\bf B}^{\mathrm{ESN}}, 𝐁¯0=k=0Ak𝐁q\bar{\bf B}_{0}=\sum_{k=0}^{\infty}A^{k}{\bf B}\in\ell^{q} and ¯:pp\bar{\mathcal{L}}\colon\ell^{p}\to\ell^{p} is linear and satisfies (3.11).

Set Wi=w(i)Ndμ~Ndν(w(i),𝐚(i),𝐜(i),bi)W_{i}=\dfrac{w^{(i)}}{N}\dfrac{d\tilde{\mu}^{N}}{d\nu}(w^{(i)},{\bf a}^{(i)},{\bf c}^{(i)},b_{i}) and notice that

𝔼[i=1NWiσ2(𝐚(i)𝐱1ESN+𝐜(i)𝐳0+bi)]\displaystyle\mathbb{E}\left[\sum_{i=1}^{N}W_{i}\sigma_{2}({\bf a}^{(i)}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}^{(i)}\cdot{\bf z}_{0}+b_{i})\right] =𝒳Nwdμ~Ndν(w,𝐚,𝐜,b)σ2(𝐚𝐱1ESN+𝐜𝐳0+b)ν(dw,d𝐚,d𝐜,db)\displaystyle=\int_{\mathcal{X}^{N}}w\frac{d\tilde{\mu}^{N}}{d\nu}(w,{\bf a},{\bf c},b)\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}\cdot{\bf z}_{0}+b)\nu(dw,d{\bf a},d{\bf c},db)
=𝒳Nwσ2(𝐚𝐱1ESN+𝐜𝐳0+b)μ~N(dw,d𝐚,d𝐜,db)=H¯N(𝐳).\displaystyle=\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}\cdot{\bf z}_{0}+b)\tilde{\mu}^{N}(dw,d{\bf a},d{\bf c},db)=\bar{H}^{N}({\bf z}).

Therefore,

𝔼[|H¯N(𝐳)H^(𝐳)|2]\displaystyle\mathbb{E}[|\bar{H}^{N}({\bf z})-\widehat{H}({\bf z})|^{2}] =𝔼[|H¯N(𝐳)i=1NWiσ2(𝐚(i)𝐱1ESN+𝐜(i)𝐳0+bi)|2]\displaystyle=\mathbb{E}[|\bar{H}^{N}({\bf z})-\sum_{i=1}^{N}W_{i}\sigma_{2}({\bf a}^{(i)}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}^{(i)}\cdot{\bf z}_{0}+b_{i})|^{2}] (3.40)
=Var(i=1NWiσ2(𝐚(i)𝐱1ESN+𝐜(i)𝐳0+bi))\displaystyle=\mathrm{Var}\left(\sum_{i=1}^{N}W_{i}\sigma_{2}({\bf a}^{(i)}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}^{(i)}\cdot{\bf z}_{0}+b_{i})\right)
=i=1NVar(Wiσ2(𝐚(i)𝐱1ESN+𝐜(i)𝐳0+bi))\displaystyle=\sum_{i=1}^{N}\mathrm{Var}(W_{i}\sigma_{2}({\bf a}^{(i)}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}^{(i)}\cdot{\bf z}_{0}+b_{i}))
N𝔼[W12σ2(𝐚(1)𝐱1ESN+𝐜(1)𝐳0+b1)2].\displaystyle\leq N\mathbb{E}[W_{1}^{2}\sigma_{2}({\bf a}^{(1)}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}^{(1)}\cdot{\bf z}_{0}+b_{1})^{2}].

As noted above, μ~N=μA,B,C\tilde{\mu}^{N}=\mu_{A,B,C} and by assumption dμA,B,Cdν{\displaystyle\frac{d\mu_{A,B,C}}{d\nu}} is bounded, hence

𝔼\displaystyle\mathbb{E} [W12σ2(𝐚(1)𝐱1ESN+𝐜(1)𝐳0+b1)2]\displaystyle[W_{1}^{2}\sigma_{2}({\bf a}^{(1)}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}^{(1)}\cdot{\bf z}_{0}+b_{1})^{2}]
=𝒳Nw2N2dμ~Ndν(w,𝐚,𝐜,b)σ2(𝐚𝐱1ESN+𝐜𝐳0+b)2μ~N(dw,d𝐚,d𝐜,db)\displaystyle=\int_{\mathcal{X}^{N}}\frac{w^{2}}{N^{2}}\frac{d\tilde{\mu}^{N}}{d\nu}(w,{\bf a},{\bf c},b)\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}\cdot{\bf z}_{0}+b)^{2}\tilde{\mu}^{N}(dw,d{\bf a},d{\bf c},db)
2dμA,B,Cdν𝒳Nw2N2[Lσ22|𝐚𝐱1ESN+𝐜𝐳0+b|2+|σ2(0)|2]μ~N(dw,d𝐚,d𝐜,db)\displaystyle\leq 2\left\|\frac{d\mu_{A,B,C}}{d\nu}\right\|_{\infty}\int_{\mathcal{X}^{N}}\frac{w^{2}}{N^{2}}[L_{\sigma_{2}}^{2}|{\bf a}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}\cdot{\bf z}_{0}+b|^{2}+|\sigma_{2}(0)|^{2}]\tilde{\mu}^{N}(dw,d{\bf a},d{\bf c},db)
=2dμA,B,Cdν𝒳w2N2[Lσ22|KΛπ(¯(𝐚))𝐱1ESN+𝐜𝐳0(KΛπ(¯(𝐚)))𝐁~+𝐚𝐁¯0+b|2+|σ2(0)|2]μ(dw,d𝐚,d𝐜,db)\displaystyle=2\left\|\frac{d\mu_{A,B,C}}{d\nu}\right\|_{\infty}\int_{\mathcal{X}}\frac{w^{2}}{N^{2}}[L_{\sigma_{2}}^{2}|K^{-\top}\Lambda\pi(\bar{\mathcal{L}}({\bf a}))\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}\cdot{\bf z}_{0}-(K^{-\top}\Lambda\pi(\bar{\mathcal{L}}({\bf a})))\cdot\tilde{\bf B}+{\bf a}\cdot\bar{\bf B}_{0}+b|^{2}+|\sigma_{2}(0)|^{2}]\mu(dw,d{\bf a},d{\bf c},db)
2dμA,B,Cdν𝒳w2N2[Lσ22|π(¯(𝐚))ΛK1(𝐱1ESN𝐁~)+𝐜𝐳0+𝐚p𝐁¯0q+|b||2+|σ2(0)|2]μ(dw,d𝐚,d𝐜,db)\displaystyle\leq 2\left\|\frac{d\mu_{A,B,C}}{d\nu}\right\|_{\infty}\int_{\mathcal{X}}\frac{w^{2}}{N^{2}}[L_{\sigma_{2}}^{2}|\|\pi(\bar{\mathcal{L}}({\bf a}))\|\|\Lambda K^{-1}(\mathbf{x}_{-1}^{\mathrm{ESN}}-\tilde{\bf B})\|+\|{\bf c}\|\|{\bf z}_{0}\|+\|{\bf a}\|_{p}\|\bar{\bf B}_{0}\|_{q}+|b||^{2}+|\sigma_{2}(0)|^{2}]\mu(dw,d{\bf a},d{\bf c},db)
c1N2dμA,B,Cdνmax(c02,𝐁¯0q2,1)(1+|ΛK1|2𝐱1ESN𝐁~2)𝒳w2[𝐚p2+𝐜2+|b|2+1]μ(dw,d𝐚,d𝐜,db)\displaystyle\leq\frac{c_{1}}{N^{2}}\left\|\frac{d\mu_{A,B,C}}{d\nu}\right\|_{\infty}\max(c_{0}^{2},\|\bar{\bf B}_{0}\|_{q}^{2},1)(1+{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|\Lambda K^{-1}\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}^{2}\|\mathbf{x}_{-1}^{\mathrm{ESN}}-\tilde{\bf B}\|^{2})\int_{\mathcal{X}}w^{2}[\|{\bf a}\|_{p}^{2}+\|{\bf c}\|^{2}+|b|^{2}+1]\mu(dw,d{\bf a},d{\bf c},db)

where we used (3.11), max(1,d121p)p\|\cdot\|\leq\max(1,d^{\frac{1}{2}-\frac{1}{p}})\|\cdot\|_{p}, c0=d1/(2p)|C|(1|A~|p)1/pc_{0}=d^{1/(2p)}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}(1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\tilde{A}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{p})^{-1/p} and set

c1=8max(1,d121p)2max(Lσ2,|σ2(0)|)2max(M2,1).c_{1}=8\max(1,d^{\frac{1}{2}-\frac{1}{p}})^{2}\max(L_{\sigma_{2}},|\sigma_{2}(0)|)^{2}\max(M^{2},1). (3.41)

Combining this with (3.40), (3.34), (3.37) and (3.38) then yields

𝔼[|H(𝐳)H^(𝐳)|2]1/2\displaystyle\mathbb{E}[|H({\bf z})-\widehat{H}({\bf z})|^{2}]^{1/2}
𝔼[|H(𝐳)HN(𝐳)|2]1/2+𝔼[|HN(𝐳)H¯N(𝐳)|2]1/2+𝔼[|H¯N(𝐳)H^(𝐳)|2]1/2\displaystyle\leq\mathbb{E}[|H({\bf z})-H^{N}({\bf z})|^{2}]^{1/2}+\mathbb{E}[|H^{N}({\bf z})-\bar{H}^{N}({\bf z})|^{2}]^{1/2}+\mathbb{E}[|\bar{H}^{N}({\bf z})-\widehat{H}({\bf z})|^{2}]^{1/2}
c0Lσ2Iμ,pMmax(1,d1q12)λNd(1λ)(1λq)1/q+Lσ2(|CESN|M+𝐁ESN)|AESN|T1|AESN||KΛ|max(1,d121p)c0Iμ,p\displaystyle\leq c_{0}L_{\sigma_{2}}I_{\mu,p}M\max(1,d^{\frac{1}{q}-\frac{1}{2}})\frac{\lambda^{\lfloor\frac{N}{d}\rfloor}}{(1-\lambda)(1-\lambda^{q})^{1/q}}+L_{\sigma_{2}}\left({\left|\kern-0.75346pt\left|\kern-0.75346pt\left|{C}^{\mathrm{ESN}}\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}M+\|{\bf B}^{\mathrm{ESN}}\|\right)\frac{{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|{A}^{\mathrm{ESN}}\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}^{T}}{1-{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|{A}^{\mathrm{ESN}}\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}}{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|K^{-\top}\Lambda\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}\max\left(1,d^{\frac{1}{2}-\frac{1}{p}}\right)c_{0}I_{\mu,p}
+c11/2N1/2dμA,B,Cdν1/2max(c0,𝐁¯0q,1)(1+|ΛK1|𝐱1ESN𝐁~)Iμ,p(2)\displaystyle\quad\quad+\frac{c_{1}^{1/2}}{N^{1/2}}\left\|\frac{d\mu_{A,B,C}}{d\nu}\right\|_{\infty}^{1/2}\max(c_{0},\|\bar{\bf B}_{0}\|_{q},1)(1+{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|\Lambda K^{-1}\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}\|\mathbf{x}_{-1}^{\mathrm{ESN}}-\tilde{\bf B}\|)I_{\mu,p}^{(2)}
c~1max(|C|(1|λ1A|p)1/p,𝐁q1|A|,1)[Iμ,pλNd+(|||CESN|||+𝐁ESN)|AESN|T1|AESN||||KΛ|||Iμ,p\displaystyle\leq\tilde{c}_{1}\max\left(\frac{{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|C\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}}{\left(1-{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|\lambda^{-1}{A}\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}^{p}\right)^{1/p}},\frac{\|{\bf B}\|_{q}}{1-{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|A\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}},1\right)[I_{\mu,p}\lambda^{\frac{N}{d}}+\left({\left|\kern-0.75346pt\left|\kern-0.75346pt\left|{C}^{\mathrm{ESN}}\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}+\|{\bf B}^{\mathrm{ESN}}\|\right)\frac{{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|{A}^{\mathrm{ESN}}\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}^{T}}{1-{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|{A}^{\mathrm{ESN}}\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}}{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|K^{-\top}\Lambda\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}I_{\mu,p}
+1N12dμA,B,Cdν12(1+|||ΛK1|||(|||CESN|||+𝐁ESN)11|AESN|)Iμ,p(2)]\displaystyle\quad\quad+\frac{1}{N^{\frac{1}{2}}}\left\|\frac{d\mu_{A,B,C}}{d\nu}\right\|_{\infty}^{\frac{1}{2}}(1+{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|\Lambda K^{-1}\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}\left({\left|\kern-0.75346pt\left|\kern-0.75346pt\left|{C}^{\mathrm{ESN}}\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}+\|{\bf B}^{\mathrm{ESN}}\|\right)\frac{1}{1-{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|{A}^{\mathrm{ESN}}\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}})I_{\mu,p}^{(2)}]
c~1max(|C|(1|λ1A|p)1/p,𝐁q1|A|,1)max(Iμ,p,Iμ,p(2))(1+|CESN|+𝐁ESN1|AESN||KΛ|)\displaystyle\leq\tilde{c}_{1}\max\left(\frac{{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|C\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}}{\left(1-{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|\lambda^{-1}{A}\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}^{p}\right)^{1/p}},\frac{\|{\bf B}\|_{q}}{1-{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|A\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}},1\right)\max(I_{\mu,p},I_{\mu,p}^{(2)})(1+\frac{{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|{C}^{\mathrm{ESN}}\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}+\|{\bf B}^{\mathrm{ESN}}\|}{1-{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|{A}^{\mathrm{ESN}}\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}}{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|K^{-\top}\Lambda\right|\kern-0.75346pt\right|\kern-0.75346pt\right|})
×[λNd+|AESN|T+1N12dμA,B,Cdν12]\displaystyle\quad\quad\times\left[\lambda^{\frac{N}{d}}+{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|{A}^{\mathrm{ESN}}\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}^{T}+\frac{1}{N^{\frac{1}{2}}}\left\|\frac{d\mu_{A,B,C}}{d\nu}\right\|_{\infty}^{\frac{1}{2}}\right] (3.42)

with

c~1\displaystyle\tilde{c}_{1} =812max(Lσ2,|σ2(0)|,1)max(1,d121p)d12pmax(1,d1q12)(1λ)1(1λq)1qλ1max(M,1)2.\displaystyle=8^{\frac{1}{2}}\max(L_{\sigma_{2}},|\sigma_{2}(0)|,1)\max(1,d^{\frac{1}{2}-\frac{1}{p}})d^{\frac{1}{2p}}\max(1,d^{\frac{1}{q}-\frac{1}{2}})(1-\lambda)^{-1}(1-\lambda^{q})^{-\frac{1}{q}}\lambda^{-1}\max(M,1)^{2}. (3.43)

Remark 3.8.

A bound similar to the one in (3.27) can be obtained if the ESN-ELM used as approximant satisfies the more general condition ρ(AESN)<1\rho({A}^{{\rm ESN}})<1. The theorem is proved in that case by replacing in (3.37) the use of the inequality (3.22) by its more general version (3.21).

Corollary 3.9.

As in Theorem 3.7, suppose the recurrent activation function is σ1(x)=x\sigma_{1}(x)=x, p(1,)p\in(1,\infty) and the input set DdD_{d} is bounded. Let H𝒞H\in\mathcal{C} be such that it has a realization of the type (2.2) with measure μ\mu and bounded linear maps C:dqC\colon\mathbb{R}^{d}\to\ell^{q}, A:qqA\colon\ell^{q}\to\ell^{q} satisfying |A|<1{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1 and Iμ,p(2)<I_{\mu,p}^{(2)}<\infty. Let λ(|A|,1)\lambda\in\left({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|},1\right). Consider now a randomly constructed ESN-ELM pair such that |AESN|<1{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1 and for which the controllability matrix KK in (3.18) is invertible. Then there exists a distribution ν\nu for the readout hidden layer weights and a readout 𝐖{\bf W} (an N\mathbb{R}^{N}-valued random vector) such that the ESN-ELM H^\widehat{H} satisfies, for any 𝐳d{\bf z}\in\mathcal{I}_{d}, the approximation error bound

𝔼[|H(𝐳)H^(𝐳)|2]1/2\displaystyle\mathbb{E}[|H({\bf z})-\widehat{H}({\bf z})|^{2}]^{1/2} CH,ESN[λNd+|AESN|Nd+1N12]\displaystyle\leq C_{H,\mathrm{ESN}}\left[\lambda^{\frac{N}{d}}+{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{\frac{N}{d}}+\frac{1}{N^{\frac{1}{2}}}\right] (3.44)

with CH,ESNC_{H,\mathrm{ESN}} as in (3.28).

Proof.

This follows by noting |AESN|T|AESN|Nd{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{T}\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{\frac{N}{d}} due to TNdT\geq\frac{N}{d} and choosing ν=μA,B,C\nu=\mu_{A,B,C}, so that dμA,B,Cdν=1\dfrac{d\mu_{A,B,C}}{d\nu}=1 and Theorem 3.7 yields (3.44). ∎

Remark 3.10.

Curse of dimensionality. The error bound (3.44) consists of the constant CH,ESNC_{H,\mathrm{ESN}} and three terms depending explicitly on NN. The constant CH,ESNC_{H,\mathrm{ESN}} could depend on NN through the norms |AESN|{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|},𝐁ESN\|{\bf B}^{\mathrm{ESN}}\|, |CESN|{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}, but for these norms it is reasonable to assume that they do not depend on NN (in practice, the norms of these matrices will be normalized). Similarly, it appears reasonable to assume that the norm |KΛ|{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|K^{-\top}\Lambda\right|\kern-1.07639pt\right|\kern-1.07639pt\right|} is bounded in NN, since the norm |AESN|{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|} is likely to be bounded from below. This argument indicates that in practically relevant situations, the constant does not depend on NN and, as it is apparent from the explicit expression (3.43), it depends only polynomially on the dimension dd. To achieve an approximation error of size at most ε\varepsilon, we could thus take N=max(dlog(ε3CH,ESN)1log(λ),dlog(ε3CH,ESN)1log(|AESN|),9CH,ESN2ε2)N=\left\lceil\max\left(d\log\left(\dfrac{\varepsilon}{3C_{H,\mathrm{ESN}}}\right)\dfrac{1}{\log(\lambda)},d\log\left(\dfrac{\varepsilon}{3C_{H,\mathrm{ESN}}}\right)\dfrac{1}{\log({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|})},9C_{H,\mathrm{ESN}}^{2}\varepsilon^{-2}\right)\right\rceil, which grows only polynomially in dd and ε1\varepsilon^{-1}. Hence, in these circumstances, there is no curse of dimensionality in the bound (3.44).

Remark 3.11.

Implementation. From a practical point of view, the bounds obtained in Theorem 3.7 and in Corollary 3.9 apply to two different learning procedures: in the first case, when the chosen ELM weight distribution ν\nu satisfies the absolute continuity assumption in Theorem 3.7, only the ELM output layer needs to be trained. In the second case, when that condition is not available, all the ELM weights also need to be trained. However, note that these are not recurrent weights; consequently, the vanishing gradient problem does not occur here. In contrast, this problem sometimes occurs for standard recurrent neural networks in which all weights are trained by backpropagation through time.

We conclude this section by considering another implementation scenario in which we impose the measure that we use for the sampling of the ELM, and show that it is still possible to sample using measures that are arbitrarily close to it in the sense of the 11-Wasserstein metric in exchange for potentially increasing an error term that can be compensated by increasing the dimension of the ESN. Let 𝒲1\mathcal{W}_{1} be the 11-Wasserstein metric. Suppose that we use the measure ν0\nu_{0} to generate the hidden layer weights. The next corollary shows how the error increases when we sample “almost” from the given measure ν0\nu_{0}.

Corollary 3.12.

Consider the same situation as in Corollary 3.9 and assume that ν0\nu_{0} has finite first moment and 𝒳[|w|+𝐚p+𝐜+|b|]μ(dw,d𝐚,d𝐜,db)<\int_{\mathcal{X}}[|w|+\|{\bf a}\|_{p}+\|{\bf c}\|+|b|]\mu(dw,d{\bf a},d{\bf c},db)<\infty. Then, for any δ(0,1)\delta\in(0,1) there exists a probability measure ν\nu with 𝒲1(ν0,ν)δ\mathcal{W}_{1}(\nu_{0},\nu)\leq\delta and a readout 𝐖{\bf W} (an N\mathbb{R}^{N}-valued random vector) such that the ESN-ELM with distribution ν\nu for the hidden layer weights satisfies for any 𝐳d{\bf z}\in\mathcal{I}_{d} the approximation error bound

𝔼[|H(𝐳)H^(𝐳)|2]1/2\displaystyle\mathbb{E}[|H({\bf z})-\widehat{H}({\bf z})|^{2}]^{1/2} C~H,ESN[λNd+|AESN|Nd+1N12min(δ12,𝒥ν0)]\displaystyle\leq\tilde{C}_{H,\mathrm{ESN}}\left[\lambda^{\frac{N}{d}}+{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{\frac{N}{d}}+\frac{1}{N^{\frac{1}{2}}}\min(\delta^{-\frac{1}{2}},\mathcal{J}_{\nu_{0}})\right] (3.45)

with 𝒥ν0=dμA,B,Cdν012\mathcal{J}_{\nu_{0}}=\left\|\frac{d\mu_{A,B,C}}{d\nu_{0}}\right\|_{\infty}^{\frac{1}{2}}, C~H,ESN=CH,ESNmax(1,𝒲1(ν0,μA,B,C))12\tilde{C}_{H,\mathrm{ESN}}=C_{H,\mathrm{ESN}}\max(1,\mathcal{W}_{1}(\nu_{0},\mu_{A,B,C}))^{\frac{1}{2}} and CH,ESNC_{H,\mathrm{ESN}} as in (3.28).

Proof.

First note that |AESN|T|AESN|Nd{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{T}\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{\frac{N}{d}}, since TNdT\geq\frac{N}{d}. In the case where μA,B,Cν0\mu_{A,B,C}\ll\nu_{0} and dμA,B,Cdν0\frac{d\mu_{A,B,C}}{d\nu_{0}} is bounded, that is, 𝒥ν0<\mathcal{J}_{\nu_{0}}<\infty, we may choose ν=ν0\nu=\nu_{0} and obtain from Theorem 3.7 the bound

𝔼[|H(𝐳)H^(𝐳)|2]1/2\displaystyle\mathbb{E}[|H({\bf z})-\widehat{H}({\bf z})|^{2}]^{1/2} C~H,ESN[λNd+|AESN|Nd+1N12𝒥ν0].\displaystyle\leq\tilde{C}_{H,\mathrm{ESN}}\left[\lambda^{\frac{N}{d}}+{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{\frac{N}{d}}+\frac{1}{N^{\frac{1}{2}}}\mathcal{J}_{\nu_{0}}\right]. (3.46)

We now derive an alternative bound (regardless of whether 𝒥ν0\mathcal{J}_{\nu_{0}} is finite or not). Combining both bounds then yields the bound (3.45).

From (3.39) and the estimate π(¯(𝐚))max(1,d121p)c0𝐚p\|\pi(\bar{\mathcal{L}}({\bf a}))\|\leq\max(1,d^{\frac{1}{2}-\frac{1}{p}})c_{0}\|{\bf a}\|_{p} we obtain

𝒳N\displaystyle\int_{\mathcal{X}^{N}} (|w|+𝐚+𝐜+|b|)μA,B,C(dw,d𝐚,d𝐜,db)\displaystyle\left(|w|+\|{\bf a}\|+\|{\bf c}\|+|b|\right)\mu_{A,B,C}(dw,d{\bf a},d{\bf c},db) (3.47)
=𝒳(|w|+KΛπ(¯(𝐚))+𝐜+|(KΛπ(¯(𝐚)))𝐁~+𝐚𝐁¯0+b|)μ(dw,d𝐚,d𝐜,db)\displaystyle=\int_{\mathcal{X}}\left(|w|+\|K^{-\top}\Lambda\pi(\bar{\mathcal{L}}({\bf a}))\|+\|{\bf c}\|+|-(K^{-\top}\Lambda\pi(\bar{\mathcal{L}}({\bf a})))\cdot\tilde{\bf B}+{\bf a}\cdot\bar{\bf B}_{0}+b|\right)\mu(dw,d{\bf a},d{\bf c},db)
max(1,|KΛ|max(1,d121p)c0)\displaystyle\leq\max(1,{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|K^{-\top}\Lambda\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\max(1,d^{\frac{1}{2}-\frac{1}{p}})c_{0})
×(1+𝐁~+𝐁¯0q)𝒳(|w|+𝐚p+𝐜+|b|)μ(dw,d𝐚,d𝐜,db)<\displaystyle\times(1+\|\tilde{\bf B}\|+\|\bar{\bf B}_{0}\|_{q})\int_{\mathcal{X}}\left(|w|+\|{\bf a}\|_{p}+\|{\bf c}\|+|b|\right)\mu(dw,d{\bf a},d{\bf c},db)<\infty

and thus μA,B,C\mu_{A,B,C} has a finite first moment. Hence, δ~=δ[max(1,𝒲1(ν0,μA,B,C))]1(0,1)\tilde{\delta}=\delta[\max(1,\mathcal{W}_{1}(\nu_{0},\mu_{A,B,C}))]^{-1}\in(0,1) and also ν=δ~μA,B,C+(1δ~)ν0\nu=\tilde{\delta}\mu_{A,B,C}+(1-\tilde{\delta})\nu_{0} has a finite first moment. Therefore, we may use the Kantorovich-Rubinshtein duality (see [Vill 09, Theorem 5.10; Particular Case 5.16]) to calculate

𝒲1(ν0,ν)\displaystyle\mathcal{W}_{1}(\nu_{0},\nu) =supf:f is 1-Lipschitz{𝒳Nf(x)(ν0ν)(dx)}\displaystyle=\sup_{f\colon f\text{ is $1$-Lipschitz}}\left\{\int_{\mathcal{X}^{N}}f(x)(\nu_{0}-\nu)(dx)\right\}
=δ~supf:f is 1-Lipschitz{𝒳Nf(x)(ν0μA,B,C)(dx)}=δ~𝒲1(ν0,μA,B,C)δ.\displaystyle=\tilde{\delta}\sup_{f\colon f\text{ is $1$-Lipschitz}}\left\{\int_{\mathcal{X}^{N}}f(x)(\nu_{0}-\mu_{A,B,C})(dx)\right\}=\tilde{\delta}\mathcal{W}_{1}(\nu_{0},\mu_{A,B,C})\leq\delta.

In addition, μA,B,Cν\mu_{A,B,C}\ll\nu and

δ~dμA,B,Cdν+(1δ~)dν0dν=1,\tilde{\delta}\frac{d\mu_{A,B,C}}{d\nu}+(1-\tilde{\delta})\frac{d\nu_{0}}{d\nu}=1,

hence dν0dν0\frac{d\nu_{0}}{d\nu}\geq 0 implies dμA,B,Cdν1δ~\left\|\frac{d\mu_{A,B,C}}{d\nu}\right\|_{\infty}\leq\frac{1}{\tilde{\delta}}. Therefore, Theorem 3.7 yields (3.45). ∎

3.5 Universality

As an additional result, we obtain the following universality result, which shows that any square-integrable functional (not necessarily in the recurrent Barron class) can be well approximated by the ESN-ELM (3.16)-(3.17) with suitably chosen readout 𝐖{\bf W} and hidden weights generated from a distribution ν\nu arbitrarily close to a prescribed measure ν0\nu_{0}.

Corollary 3.13.

Assume that the input set DdD_{d} is bounded, and consider an arbitrary functional H:dH\colon\mathcal{I}_{d}\to\mathbb{R} that we will approximate with an ESN-ELM built with the following specifications: the activation functions are σ1(x)=x\sigma_{1}(x)=x and either σ2(x)=max(x,0)\sigma_{2}(x)=\max(x,0) or σ2\sigma_{2} is a bounded, Lipschitz-continuous, and non-constant function. Furthermore, assume that there exists c¯>0\bar{c}>0 and l¯,l¯(0,1)\underline{l},\bar{l}\in(0,1), l¯<l¯\underline{l}<\bar{l}, such that for any choice of NN\in\mathbb{N} the ESN parameters satisfy that the controllability matrix KK in (3.18) is invertible, l¯<|AESN|<l¯\underline{l}<{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<\bar{l}, 𝐁ESNc¯\|{\bf B}^{\mathrm{ESN}}\|\leq\bar{c}, |CESN|c¯{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\leq\bar{c}, and |K1Λ|c¯{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|K^{-1}\Lambda\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\leq\bar{c}, where ΛN×N\Lambda\in\mathbb{R}^{{N}\times{N}} is the diagonal matrix with entries Λd(k1)+j,d(k1)+j=l¯k1\Lambda_{d(k-1)+j,d(k-1)+j}={\underline{l}}^{k-1}, for all kk\in\mathbb{N}, j=1,,dj=1,\ldots,d with d(k1)+jNd(k-1)+j\leq{N}. Let ν0\nu_{0} be a given hidden weight distribution with finite first moment. Then for any probability measure γ\gamma on d(Dd)\mathcal{I}_{d}\subset(D_{d})^{\mathbb{Z}_{-}} such that HL2(d,γ)H\in L^{2}(\mathcal{I}_{d},\gamma), there exists a probability measure ν\nu with 𝒲1(ν0,ν)<ε\mathcal{W}_{1}(\nu_{0},\nu)<\varepsilon and a readout 𝐖{\bf W} (an N\mathbb{R}^{N}-valued random vector) such that the ESN-ELM H^\widehat{H} with readout 𝐖{\bf W} and distribution ν\nu for the hidden layer weights satisfies that

(d𝔼[|H(𝐳)H^(𝐳)|2]γ(d𝐳))1/2<ε.\left(\int_{\mathcal{I}_{d}}\mathbb{E}[|H({\bf z})-\widehat{H}({\bf z})|^{2}]\gamma(d{\bf z})\right)^{1/2}<\varepsilon.
Proof.

Firstly, by Proposition 2.10 there exists HRC𝒞L2(d,γ)H^{RC}\in\mathcal{C}\cap L^{2}(\mathcal{I}_{d},\gamma) such that HHRCL2(d,γ)<ε/2\|H-H^{RC}\|_{L^{2}(\mathcal{I}_{d},\gamma)}<\varepsilon/2. From the proof of Proposition 2.10 and the construction in [Gono 20c] it follows that HRCH^{RC} is of the form

HRC(𝐳)=𝒳N¯wσ2(𝐚(πN¯𝐳)+𝐜𝐳0+b)μN¯(dw,d𝐚,d𝐜,db),𝐳(Dd),H^{RC}({\bf z})=\int_{\mathcal{X}^{\bar{N}}}w\sigma_{2}({\bf a}\cdot(\pi_{\bar{N}}{\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu^{\bar{N}}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in(D_{d})^{\mathbb{Z}_{-}}, (3.48)

for some N¯\bar{N}\in\mathbb{N}, an atomic probability measure μN¯\mu^{\bar{N}}, and with πN¯:(Dd)N¯\pi_{\bar{N}}\colon(D_{d})^{\mathbb{Z}_{-}}\to\mathbb{R}^{\bar{N}} satisfying

(πN¯(𝐳))d(k1)+j=(zk)j,for all kj=1,,d with d(k1)+jN¯.(\pi_{\bar{N}}({\bf z}))_{d(k-1)+j}=(z_{-k})_{j},\quad\mbox{for all $k\in\mathbb{N}$, $j=1,\ldots,d$ with $d(k-1)+j\leq\bar{N}$.}

Now let λ¯=l¯(0,1){\bar{\lambda}}=\underline{l}\in(0,1), p(1,)p\in(1,\infty), 𝐁=0{\bf B}=0 and define A:qqA\colon\ell^{q}\to\ell^{q}, C:dqC\colon\mathbb{R}^{d}\to\ell^{q} by (A𝐱)i=𝟙{i>d}λ¯xid(A{\bf x})_{i}=\mathbbm{1}_{\{i>d\}}{\bar{\lambda}}x_{i-d}, (C𝐳)i=𝟙{id}zi(C{\bf z})_{i}=\mathbbm{1}_{\{i\leq d\}}{z}_{i} for 𝐱q{\bf x}\in\ell^{q}, 𝐳d{\bf z}\in\mathbb{R}^{d}, ii\in\mathbb{N}. Then, as in the proof of Proposition 2.6 we obtain that (2.1) admits a unique solution (𝐱t)t(q)(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q}) and for kk\in\mathbb{N}, j=1,,dj=1,\ldots,d we have xt,(k1)d+j=λ¯k1ztk+1,j{x}_{t,(k-1)d+j}={\bar{\lambda}}^{k-1}{z}_{t-k+1,j}. Let ι:N¯p\iota\colon\mathbb{R}^{\bar{N}}\to\ell^{p} be the natural injection (x1,,xN¯)(x1,,xN¯,0,)(x_{1},\ldots,x_{\bar{N}})^{\top}\longmapsto(x_{1},\ldots,x_{\bar{N}},0,\ldots) and let π:qN¯\pi\colon\ell^{q}\to\mathbb{R}^{\bar{N}} be the natural projection. Then

𝐚(πN¯𝐳)=𝐚(Λ1π𝐱1)=(ιΛ1𝐚)𝐱1.{\bf a}\cdot(\pi_{\bar{N}}{\bf z})={\bf a}\cdot(\Lambda^{-1}\pi\mathbf{x}_{-1})=(\iota\Lambda^{-1}{\bf a})\cdot\mathbf{x}_{-1}.

Let ϕ:𝒳N¯𝒳\phi\colon\mathcal{X}^{\bar{N}}\to\mathcal{X} be defined by ϕ(w,𝐚,𝐜,b)=(w,ιΛ1𝐚,𝐜,b)\phi(w,{\bf a},{\bf c},b)=(w,\iota\Lambda^{-1}{\bf a},{\bf c},b) and denote by μ=μN¯ϕ1\mu=\mu^{\bar{N}}\circ\phi^{-1} the pushforward measure of μN¯\mu^{\bar{N}} under ϕ\phi. Then

HRC(𝐳)\displaystyle H^{RC}({\bf z}) =𝒳N¯wσ2((ιΛ1𝐚)𝐱1+𝐜𝐳0+b)μN¯(dw,d𝐚,d𝐜,db)\displaystyle=\int_{\mathcal{X}^{\bar{N}}}w\sigma_{2}((\iota\Lambda^{-1}{\bf a})\cdot\mathbf{x}_{-1}+{\bf c}\cdot{\bf z}_{0}+b)\mu^{\bar{N}}(dw,d{\bf a},d{\bf c},db) (3.49)
=𝒳wσ2(𝐚𝐱1(𝐳)+𝐜𝐳0+b)μ(dw,d𝐚,d𝐜,db).\displaystyle=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db).

Thus, HRC𝒞H^{RC}\in\mathcal{C} has a representation of the type (2.2) with |A|=λ¯<1{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}={\bar{\lambda}}<1. Furthermore, since μN¯\mu^{\bar{N}} is atomic it follows directly that Iμ,p(2)<I_{\mu,p}^{(2)}<\infty and 𝒳[|w|+𝐚p+𝐜+|b|]μ(dw,d𝐚,d𝐜,db)<\int_{\mathcal{X}}[|w|+\|{\bf a}\|_{p}+\|{\bf c}\|+|b|]\mu(dw,d{\bf a},d{\bf c},db)<\infty.

Next, fix λ(|A|,1)\lambda\in({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|},1) arbitrary and note that under the assumptions on the ESN, the constant CH,ESNC_{H,\mathrm{ESN}} in (3.28) can be estimated as

CH,ESNc~1max(1(1(λ1λ¯)p)1p,1)max(Iμ,p,Iμ,p(2))(1+2c¯21l¯)=:C¯H,0,C_{H,\mathrm{ESN}}\leq\tilde{c}_{1}\max(\frac{1}{(1-(\lambda^{-1}\bar{\lambda})^{p})^{\frac{1}{p}}},1)\max(I_{\mu,p},I_{\mu,p}^{(2)})\left(1+2\frac{\bar{c}^{2}}{1-\bar{l}}\right)=:\bar{C}_{H,0}, (3.50)

which does not depend on NN. Similarly, the bound

𝒲1(ν0,μA,B,C)𝒳Nxν0(dx)+𝒳NxμA,B,C(dx),\mathcal{W}_{1}(\nu_{0},\mu_{A,B,C})\leq\int_{\mathcal{X}^{N}}\|x\|\nu_{0}(dx)+\int_{\mathcal{X}^{N}}\|x\|\mu_{A,B,C}(dx),

the assumptions on the ESN matrices, and (3.47) can be used to obtain an upper bound C¯H,1\bar{C}_{H,1} on 𝒲1(ν0,μA,B,C)\mathcal{W}_{1}(\nu_{0},\mu_{A,B,C}) that only depends on μ\mu, dd, pp, λ\lambda, c¯\bar{c}, l¯\underline{l}, l¯\bar{l} and ν0\nu_{0}, but not on NN. Set c¯H=C¯H,0max(1,C¯H,1)12\bar{c}_{H}=\bar{C}_{H,0}\max(1,\bar{C}_{H,1})^{\frac{1}{2}}.

We now apply Corollary 3.12 to H=HRCH=H^{RC}. We select δ(0,1)\delta\in(0,1) with δ<ε\delta<\varepsilon and

N=max(dlog(ε6c¯H)1log(λ),dlog(ε6c¯H)1log(l¯),(6c¯Hε1)2δ1),N=\left\lceil\max\left(d\log\left(\frac{\varepsilon}{6\bar{c}_{H}}\right)\dfrac{1}{\log(\lambda)},d\log\left(\frac{\varepsilon}{6\bar{c}_{H}}\right)\dfrac{1}{\log(\bar{l})},(6\bar{c}_{H}\varepsilon^{-1})^{2}\delta^{-1}\right)\right\rceil,

then from Corollary 3.12 we obtain that there exists a probability measure ν\nu with 𝒲1(ν0,ν)<ε\mathcal{W}_{1}(\nu_{0},\nu)<\varepsilon and a readout 𝐖{\bf W} (an N\mathbb{R}^{N}-valued random vector) such that the ESN-ELM H^\widehat{H} with readout 𝐖{\bf W} and distribution ν\nu for the hidden layer weights satisfies for any 𝐳d{\bf z}\in\mathcal{I}_{d}

𝔼[|HRC(𝐳)H^(𝐳)|2]1/2\displaystyle\mathbb{E}[|H^{RC}({\bf z})-\widehat{H}({\bf z})|^{2}]^{1/2} c¯H[λNd+|AESN|Nd+1δ12N12]<ε2.\displaystyle\leq\bar{c}_{H}\left[\lambda^{\frac{N}{d}}+{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{\frac{N}{d}}+\frac{1}{\delta^{\frac{1}{2}}N^{\frac{1}{2}}}\right]<\frac{\varepsilon}{2}. (3.51)

Hence,

(d𝔼[|H(𝐳)H^(𝐳)|2]γ(d𝐳))1/2HHRCL2(d,γ)+(d𝔼[|HRC(𝐳)H^(𝐳)|2]γ(d𝐳))1/2<ε.\left(\int_{\mathcal{I}_{d}}\mathbb{E}[|H({\bf z})-\widehat{H}({\bf z})|^{2}]\gamma(d{\bf z})\right)^{1/2}\leq\|H-H^{RC}\|_{L^{2}(\mathcal{I}_{d},\gamma)}+\left(\int_{\mathcal{I}_{d}}\mathbb{E}[|H^{RC}({\bf z})-\widehat{H}({\bf z})|^{2}]\gamma(d{\bf z})\right)^{1/2}<\varepsilon.

3.6 Special case: static situation

As a special case, we consider the situation without recurrence, that is, when H:DdH\colon D_{d}\to\mathbb{R} is of the form

H(𝐮)=×d×wσ2(𝐜𝐮+b)μ0(dw,d𝐜,db),𝐮Ddd,H({\bf u})=\int_{\mathbb{R}\times\mathbb{R}^{d}\times\mathbb{R}}w\sigma_{2}({\bf c}\cdot{\bf u}+b)\mu_{0}(dw,d{\bf c},db),\quad{\bf u}\in D_{d}\subset\mathbb{R}^{d}, (3.52)

for a probability measure μ0\mu_{0} on ×d×\mathbb{R}\times\mathbb{R}^{d}\times\mathbb{R} satisfying ×d×|w|(𝐜+|b|)μ0(dw,d𝐜,db)<\int_{\mathbb{R}\times\mathbb{R}^{d}\times\mathbb{R}}|w|(\|{\bf c}\|+|b|)\mu_{0}(dw,d{\bf c},db)<\infty. HH is clearly a particular case of a recurrent generalized Barron functional, since it can be written as (2.2) with measure μ\mu on 𝒳\mathcal{X} given by μ(dw,d𝐚,d𝐜,db)=δ0(d𝐚)μ0(dw,d𝐜,db)\mu(dw,d{\bf a},d{\bf c},db)=\delta_{0}(d{\bf a})\mu_{0}(dw,d{\bf c},db), which satisfies Iμ,p<I_{\mu,p}<\infty by the integrability assumption on μ0\mu_{0}.

In this situation, we show that such static elements HH of the recurrent Barron class can be approximated by ELMs, that is, by feedforward neural networks

H^(𝐮)=i=1NWiσ2(𝐜(i)𝐮+bi)\widehat{H}({\bf u})=\sum_{i=1}^{N}W_{i}\sigma_{2}({\bf c}^{(i)}\cdot{\bf u}+b_{i}) (3.53)

with randomly generated coefficients 𝐜(i){\bf c}^{(i)}, bib_{i} valued in d\mathbb{R}^{d} and \mathbb{R}, respectively, and 𝐖N{\bf W}\in\mathbb{R}^{N} trainable. H^\widehat{H} can be viewed as a functional 𝐳H^(𝐳0){\bf z}\mapsto\widehat{H}({\bf z}_{0}), which is of the form (3.17) with 𝐚(i)=0{\bf a}^{(i)}=0. Let w(1),,w(n)w^{(1)},\ldots,w^{(n)} be \mathbb{R}-valued random variables, assume that the random variables (w(1),𝐜(1),b1),,(w(N),𝐜(N),bN)(w^{(1)},{\bf c}^{(1)},b_{1}),\ldots,(w^{(N)},{\bf c}^{(N)},b_{N}) are IID and denote by ν0\nu_{0} their common distribution.

Corollary 3.14.

Suppose that HH is as in (3.52) with associated μ\mu satisfying that Iμ,p(2)<I_{\mu,p}^{(2)}<\infty. Assume the input set DdD_{d} is bounded. Suppose that μ0ν0\mu_{0}\ll\nu_{0} and that dμ0dν0{\displaystyle\frac{d\mu_{0}}{d\nu_{0}}} is bounded. Then there exists a measurable function f:(×d×)NNf\colon(\mathbb{R}\times\mathbb{R}^{d}\times\mathbb{R})^{N}\to\mathbb{R}^{N} such that the ELM H^\widehat{H} with readout 𝐖=f((w(i),𝐜(i),bi)i=1,,N)N{\bf W}=f((w^{(i)},{\bf c}^{(i)},b_{i})_{i=1,\ldots,N})\in\mathbb{R}^{N} satisfies for any 𝐮Dd{\bf u}\in D_{d} the approximation error bound

𝔼[|H(𝐮)H^(𝐮)|2]1/2\displaystyle\mathbb{E}[|H({\bf u})-\widehat{H}({\bf u})|^{2}]^{1/2} CHN12\displaystyle\leq\frac{C_{H}}{N^{\frac{1}{2}}} (3.54)

with

CH=(2max(2Lσ22,|σ2(0)|2)max(1,sup𝐯Dd𝐯2))12Iμ,p(2)dμ0dν012.C_{H}=(2\max(2L_{\sigma_{2}}^{2},|\sigma_{2}(0)|^{2})\max(1,\sup_{{\bf v}\in D_{d}}\|{\bf v}\|^{2}))^{\frac{1}{2}}I_{\mu,p}^{(2)}\left\|\frac{d\mu_{0}}{d\nu_{0}}\right\|_{\infty}^{\frac{1}{2}}. (3.55)
Proof.

As noted above, HH is a recurrent generalized Barron functional with

μ(dw,d𝐚,d𝐜,db)=δ0(d𝐚)μ0(dw,d𝐜,db)\mu(dw,d{\bf a},d{\bf c},db)=\delta_{0}(d{\bf a})\mu_{0}(dw,d{\bf c},db)

and we can choose the maps AA, CC and the sequence 𝐁{\bf B} as 0. With these choices, the map ¯\bar{\mathcal{L}} and the sequence 𝐁¯0\bar{\bf B}_{0} appearing in the proof of Theorem 3.7 are both equal to 0. Consequently, from (3.39) we obtain that the measure μ~N\tilde{\mu}^{N} in the proof of Theorem 3.7 satisfies

𝒳Nf(w,𝐚,𝐜,b)μ~N(dw,d𝐚,d𝐜,db)=×d×f(w,0,𝐜,b)μ0(dw,d𝐜,db)\int_{\mathcal{X}^{N}}f(w,{\bf a},{\bf c},b)\tilde{\mu}^{N}(dw,d{\bf a},d{\bf c},db)=\int_{\mathbb{R}\times\mathbb{R}^{d}\times\mathbb{R}}f(w,0,{\bf c},b)\mu_{0}(dw,d{\bf c},db)

for any ff for which the integrand in the last line is integrable with respect to μ0\mu_{0}. This shows that H¯N\bar{H}^{N} in (3.36) coincides with the map

H¯N(𝐳)=×d×wσ2(𝐜𝐳0+b)μ0(dw,d𝐜,db)=H(𝐳0),𝐳(Dd)\bar{H}^{N}({\bf z})=\int_{\mathbb{R}\times\mathbb{R}^{d}\times\mathbb{R}}w\sigma_{2}({\bf c}\cdot{\bf z}_{0}+b)\mu_{0}(dw,d{\bf c},db)=H({\bf z}_{0}),\quad{\bf z}\in(D_{d})^{\mathbb{Z}_{-}} (3.56)

and furthermore H¯N=HN\bar{H}^{N}=H^{N}. Therefore, denoting by HH also the functional 𝐳H(𝐳0){\bf z}\mapsto H({\bf z}_{0}), in the first step of the error estimate (3.42) only the last term is non-zero.

Furthermore, ψA,B,C(w,𝐚,𝐜,b)=(w,0,𝐜,b)\psi_{A,B,C}(w,{\bf a},{\bf c},b)=(w,0,{\bf c},b) and thus μA,B,C(dw,d𝐚,d𝐜,db)=δ0(d𝐚)μ0(dw,d𝐜,db)\mu_{A,B,C}(dw,d{\bf a},d{\bf c},db)=\delta_{0}(d{\bf a})\mu_{0}(dw,d{\bf c},db). Therefore, the hypothesis μ0ν0\mu_{0}\ll\nu_{0} implies μA,B,Cδ0(d𝐚)ν0\mu_{A,B,C}\ll\delta_{0}(d{\bf a})\nu_{0}.

Choosing Wi=w(i)Ndμ0dν0(w(i),𝐜(i),bi)W_{i}={\displaystyle\frac{w^{(i)}}{N}\frac{d\mu_{0}}{d\nu_{0}}(w^{(i)},{\bf c}^{(i)},b_{i})}, from (3.40) we thus obtain

𝔼[|H(𝐮)H^(𝐮)|2]N𝔼[W12σ2(𝐜(1)𝐮+b1)2].\displaystyle\mathbb{E}[|H({\bf u})-\widehat{H}({\bf u})|^{2}]\leq N\mathbb{E}[W_{1}^{2}\sigma_{2}({\bf c}^{(1)}\cdot{\bf u}+b_{1})^{2}]. (3.57)

The expectation in (3.57) can be estimated by

𝔼[W12σ2(𝐜(1)𝐮+b1)2]\displaystyle\mathbb{E}[W_{1}^{2}\sigma_{2}({\bf c}^{(1)}\cdot{\bf u}+b_{1})^{2}] =\displaystyle= ×d×w2N2dμ0dν0(w,𝐜,b)σ2(𝐜𝐮+b)2μ0(dw,d𝐜,db)\displaystyle\int_{\mathbb{R}\times\mathbb{R}^{d}\times\mathbb{R}}\frac{w^{2}}{N^{2}}\frac{d\mu_{0}}{d\nu_{0}}(w,{\bf c},b)\sigma_{2}({\bf c}\cdot{\bf u}+b)^{2}\mu_{0}(dw,d{\bf c},db)
\displaystyle\leq 2dμ0dν0×d×w2N2[Lσ22|𝐜𝐮+b|2+|σ2(0)|2]μ0(dw,d𝐜,db)\displaystyle 2\left\|\frac{d\mu_{0}}{d\nu_{0}}\right\|_{\infty}\int_{\mathbb{R}\times\mathbb{R}^{d}\times\mathbb{R}}\frac{w^{2}}{N^{2}}[L_{\sigma_{2}}^{2}|{\bf c}\cdot{\bf u}+b|^{2}+|\sigma_{2}(0)|^{2}]{\mu}_{0}(dw,d{\bf c},db)
\displaystyle\leq c1N2dμ0dν0×d×w2[𝐜2+|b|2+1]μ0(dw,d𝐜,db)\displaystyle\frac{c_{1}}{N^{2}}\left\|\frac{d\mu_{0}}{d\nu_{0}}\right\|_{\infty}\int_{\mathbb{R}\times\mathbb{R}^{d}\times\mathbb{R}}w^{2}[\|{\bf c}\|^{2}+|b|^{2}+1]{\mu}_{0}(dw,d{\bf c},db)

with c1=2max(2Lσ22,|σ2(0)|2)max(1,sup𝐯Dd𝐯2)c_{1}=2\max(2L_{\sigma_{2}}^{2},|\sigma_{2}(0)|^{2})\max(1,\sup_{{\bf v}\in D_{d}}\|{\bf v}\|^{2}). ∎

4 Learning from a single trajectory

Consider now the situation in which we aim to learn a functional HH from a single trajectory of input/output pairs. We observe nn\in\mathbb{N} data points (𝐳t,𝐲t)({\bf z}_{t},{\bf y}_{t}) for t{0,1,,(n1)}t\in\{0,-1,\ldots,-(n-1)\} and aim to learn the input/output relation HH from them. In contrast to static situations, these data points are not IID; instead, they constitute a single realization of a discrete-time stochastic process. Similar problems were considered, for instance, in [Ziem 22].

More formally, we consider a stationary stochastic process (𝐙t,𝐘t)t({\bf Z}_{t},{\bf Y}_{t})_{t\in\mathbb{Z}_{-}} and assume that nn sequential observations (𝐙t,𝐘t)({\bf Z}_{t},{\bf Y}_{t}), t{0,1,,(n1)}t\in\{0,-1,\ldots,-(n-1)\}, coming from a single realization of this process are available. Suppose that 𝐙{\bf Z} takes values in d\mathcal{I}_{d}. Let H𝒞H\in\mathcal{C} be the unknown functional and assume that the input/output relation between the data is given as H(𝐙)=𝔼[𝐘0|𝐙]H({\bf Z})=\mathbb{E}[{\bf Y}_{0}|{\bf Z}]. For example, this is satisfied if 𝐙{\bf Z} is any stationary process and 𝐘t=H(,𝐙t2,𝐙t1,𝐙t)+𝜺t{\bf Y}_{t}=H(\ldots,{\bf Z}_{t-2},{\bf Z}_{t-1},{\bf Z}_{t})+\bm{\varepsilon}_{t} for a stationary process (𝜺t)t(\bm{\varepsilon}_{t})_{t\in\mathbb{Z}_{-}} independent of 𝐙{\bf Z} and with 𝔼[𝜺0]=0\mathbb{E}[\bm{\varepsilon}_{0}]=0.

The goal is to learn the functional HH from the observations (𝐙t,𝐘t)t=0,1,,n+1({\bf Z}_{t},{\bf Y}_{t})_{t=0,-1,\ldots,-n+1} of the underlying input/output process. Recall that H(𝐙)=𝔼[𝐘0|𝐙]H({\bf Z})=\mathbb{E}[{\bf Y}_{0}|{\bf Z}] minimizes (G)=𝔼[G(𝐙)𝐘02]\mathcal{R}(G)=\mathbb{E}[\|G({\bf Z})-{\bf Y}_{0}\|^{2}] over all measurable maps G:dmG\colon\mathcal{I}_{d}\to\mathbb{R}^{m}. To learn HH from the data, one thus aims to find a minimizer of

n(G)=1ni=0n1G(𝐙in+1)𝐘i2,\mathcal{R}_{n}(G)=\frac{1}{n}\sum_{i=0}^{n-1}\|G({\bf Z}_{-i}^{-n+1})-{\bf Y}_{-i}\|^{2}, (4.1)

where we denote 𝐙in+1=(,0,0,𝐙n+1,,𝐙i1,𝐙i){\bf Z}_{-i}^{-n+1}=(\ldots,0,0,{\bf Z}_{-n+1},\ldots,{\bf Z}_{-i-1},{\bf Z}_{-i}).

To learn HH from data, we use an approximant of the type H^\widehat{H} introduced in (3.17) and write H^W=H^\widehat{H}_{W}=\widehat{H} to emphasize that WW are the trainable parameters. Note that H^W\widehat{H}_{W} is now m\mathbb{R}^{m}-valued and is constructed by simply using a readout W𝕄m,NW\in\mathbb{M}_{m,N} that collects mm readout vectors in a matrix or, equivalently, by making the w(i)w^{(i)} in (3.17) m\mathbb{R}^{m}-valued. The results in Section 3, in particular the universality statement in Section 3.5, indicate that HH can be approximated well by such systems. Hence, we expect that a minimizer of n()\mathcal{R}_{n}(\cdot) over such systems should be a good approximation of HH. Therefore, we set to solve the minimization problem

W^=arg minW𝒲R1ni=0n1H^W(𝐙in+1)𝐘i2\widehat{W}=\underset{W\in\mathcal{W}_{R}}{\mbox{{\rm arg min}}}\frac{1}{n}\sum_{i=0}^{n-1}\|\widehat{H}_{W}({\bf Z}_{-i}^{-n+1})-{\bf Y}_{-i}\|^{2} (4.2)

for 𝒲R\mathcal{W}_{R} given by the set of all random matrices W:Ωm×NW\colon\Omega\to\mathbb{R}^{m\times N} which satisfy for each row Wi,W_{i,\cdot} the bound Wi,R\|W_{i,\cdot}\|\leq R for some regularization constant R>0R>0 and which are measurable with respect to the data (𝐙t,𝐘t)t=0,1,,n+1({\bf Z}_{t},{\bf Y}_{t})_{t=0,-1,\ldots,-n+1} and the randomly generated parameters (wi,𝐚(i),𝐜(i),bi)i=1,,N(w_{i},{\bf a}^{(i)},{\bf c}^{(i)},b_{i})_{i=1,\ldots,N}, AESN{A}^{\mathrm{ESN}}, 𝐁ESN{\bf B}^{\mathrm{ESN}}, CESN{C}^{\mathrm{ESN}}. The latter requirement accounts for the fact that the randomly generated parameters are fixed after their initialization and hence the trainable weights may depend on them. The choice of RR will be specified later on in the statement of Theorem 4.2.

The problem (4.2) admits an explicit solution (see, for instance, [Gono 21a, Section 4.3]). Once computed, the learned functional is H^W^\widehat{H}_{\widehat{W}}. The learning performance of the algorithm can now be evaluated by assessing its learning error (or generalization error)

𝔼[H(𝐙¯)H^W^(𝐙¯)2],\mathbb{E}[\|H(\bar{\bf Z})-\widehat{H}_{\widehat{W}}(\bar{\bf Z})\|^{2}],

where 𝐙¯\bar{\bf Z} is an IID copy of 𝐙{\bf Z} and 𝐙¯\bar{\bf Z} is independent of all other random variables introduced so far. An analysis of the learning error is carried out below in Theorem 4.2, in which we assume, inspired by the developments in [Gono 20b], that the input and the output processes are of the type introduced in the following definition.

Definition 4.1.

An k\mathbb{R}^{k}-valued random process 𝐔{\bf U} is said to have a causal Bernoulli shift structure, if there exist qq\in\mathbb{N}, G:(q)kG\colon(\mathbb{R}^{q})^{\mathbb{Z}_{-}}\to\mathbb{R}^{k} measurable, and an IID collection (𝝃t)t(\bm{\xi}_{t})_{t\in\mathbb{Z}_{-}} of q\mathbb{R}^{q}-valued random variables such that

𝐔t=G(,𝝃t1,𝝃t),t.{\bf U}_{t}=G(\ldots,\bm{\xi}_{t-1},\bm{\xi}_{t}),\quad t\in\mathbb{Z}_{-}.

The process 𝐔{\bf U} is said to have geometric decay if there exist Cdep>0C_{\mathrm{dep}}>0, λdep(0,1)\lambda_{\mathrm{dep}}\in(0,1) such that the weak dependence coefficient θ(τ):=𝔼[𝐔0𝐔~0τ]\theta(\tau):=\mathbb{E}[\|{\bf U}_{0}-\tilde{{\bf U}}^{\tau}_{0}\|] satisfies θ(τ)Cdepλdepτ\theta(\tau)\leq C_{\mathrm{dep}}\lambda_{\mathrm{dep}}^{\tau} for all τ\tau\in\mathbb{N}, where 𝐔~0τ=G(,𝝃~τ1,𝝃~τ,𝝃τ+1,,𝝃0)\tilde{{\bf U}}^{\tau}_{0}=G(\ldots,\tilde{\bm{\xi}}_{-\tau-1},\tilde{\bm{\xi}}_{-\tau},\bm{\xi}_{-\tau+1},\ldots,\bm{\xi}_{0}) for 𝝃~\tilde{\bm{\xi}} an independent copy of 𝝃\bm{\xi}.

Theorem 4.2.

Assume σ1(x)=x\sigma_{1}(x)=x, m=1m=1, p(1,)p\in(1,\infty), and that the input set DdD_{d} is bounded. Consider a functional H𝒞H\in\mathcal{C} that has a representation of the type (2.2) with 𝐁q{\bf B}\in\ell^{q}, bounded linear maps C:dqC\colon\mathbb{R}^{d}\to\ell^{q}, A:qqA\colon\ell^{q}\to\ell^{q} satisfying |A|<1{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1 and Iμ,p(2)<I_{\mu,p}^{(2)}<\infty. Let λ(|A|,1)\lambda\in\left({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|},1\right). Consider now as approximant an ESN-ELM system such that |AESN|<1{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1 and that the matrix KK in (3.18) is invertible. Suppose that w(1)w^{(1)} is bounded, μA,B,Cν\mu_{A,B,C}\ll\nu, and dμA,B,Cdν\frac{d\mu_{A,B,C}}{d\nu} is bounded by a constant κ>0\kappa>0. Let RκNw(1)R\geq\frac{\kappa}{\sqrt{N}}\|w^{(1)}\|_{\infty}, r(|AESN|,1)r\in({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|},1). Assume Y{Y} is bounded by a constant M~\widetilde{M} and (Y,𝐙)({Y},{\bf Z}) has a causal Bernoulli shift structure with geometric decay and log(n)<nlog(λmax1)\log(n)<n\log(\lambda_{max}^{-1}), where λmax=max(r,λdep)\lambda_{max}=\max(r,\lambda_{\mathrm{dep}}). Then the trained ESN-ELM H^𝐖^\widehat{H}_{\widehat{\bf W}} satisfies the learning error bound

𝔼[|H(𝐙¯)H^𝐖^(𝐙¯)|2]1/2Capprox(λNd+|AESN|T+1N12)+Cest(RN12log(n)n)12\displaystyle\mathbb{E}[|H(\bar{\bf Z})-\widehat{H}_{\widehat{\bf W}}(\bar{\bf Z})|^{2}]^{1/2}\leq C_{\mathrm{approx}}\left(\lambda^{\frac{N}{d}}+{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{T}+\frac{1}{N^{\frac{1}{2}}}\right)+C_{\mathrm{est}}\left(RN^{\frac{1}{2}}\frac{\sqrt{\log(n)}}{\sqrt{n}}\right)^{\frac{1}{2}} (4.3)

with

Capprox=c~1max(κ12,1)max(|C|(1|λ1A|p)1/p,𝐁q1|A|,1)×max(Iμ,p,Iμ,p(2))(1+|CESN|+𝐁ESN1|AESN||KΛ|),C_{\mathrm{approx}}=\tilde{c}_{1}\max(\kappa^{\frac{1}{2}},1)\max\left(\frac{{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}}{\left(1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\lambda^{-1}{A}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{p}\right)^{1/p}},\frac{\|{\bf B}\|_{q}}{1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}},1\right)\\ \times\max\left(I_{\mu,p},I_{\mu,p}^{(2)}\right)\cdot\left(1+\frac{{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}+\|{\bf B}^{\mathrm{ESN}}\|}{1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|K^{-\top}\Lambda\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\right), (4.4)
Cest=c~2[(|CESN|+𝐁ESN+1)3(1|AESN|)3log(λmax1)max(R,1)(𝔼[𝐚(1)2]12+𝔼[𝐜(1)2]12+𝔼[|b1|2]12+1)×(r2|||AESN|||2)1/2max([κw(1)]1,1)(r1r+1λmax+(|CESN|2+1)12λmax1log(λmax1))]12,C_{\mathrm{est}}=\tilde{c}_{2}\Bigg{[}\frac{({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}+\|{\bf B}^{\mathrm{ESN}}\|+1)^{3}}{(1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|})^{3}\sqrt{\log(\lambda_{max}^{-1})}}\max(R,1)(\mathbb{E}[\|{\bf a}^{(1)}\|^{2}]^{\frac{1}{2}}+\mathbb{E}[\|{\bf c}^{(1)}\|^{2}]^{\frac{1}{2}}+\mathbb{E}[|b_{1}|^{2}]^{\frac{1}{2}}+1)\\ \times(r^{2}-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2})^{-1/2}\max([\kappa\|w^{(1)}\|_{\infty}]^{-1},1)\Bigg{(}\frac{r}{1-r}+\frac{1}{\lambda_{max}}+\frac{({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}+1)^{\frac{1}{2}}\lambda_{max}^{-1}}{\log(\lambda_{max}^{-1})}\Bigg{)}\Bigg{]}^{\frac{1}{2}}, (4.5)

for c~1\tilde{c}_{1} only depending on dd, σ2\sigma_{2}, pp, diam(Dd)\mathrm{diam}(D_{d}), and λ\lambda (see (3.43)) and c~2\tilde{c}_{2} only depending on σ2\sigma_{2}, M~\widetilde{M}, diam(Dd)\mathrm{diam}(D_{d}), and CdepC_{\mathrm{dep}} (see (4.19)).

Proof.

Let (𝐙¯,Y¯)(\bar{\bf Z},\bar{{Y}}) be an IID copy of (𝐙,Y)({\bf Z},{Y}), independent of all other random variables introduced so far. Independence and the assumption H(𝐙)=𝔼[Y0|𝐙]H({\bf Z})=\mathbb{E}[{Y}_{0}|{\bf Z}] imply that 𝔼[H^𝐖^(𝐙¯)H(𝐙¯)]=𝔼[H^𝐖^(𝐙¯)Y¯0]\mathbb{E}[\widehat{H}_{\widehat{\bf W}}(\bar{\bf Z})H(\bar{\bf Z})]=\mathbb{E}[\widehat{H}_{\widehat{\bf W}}(\bar{\bf Z})\bar{Y}_{0}] and 𝔼[H^𝐖(𝐙¯)H(𝐙¯)]=𝔼[H^𝐖(𝐙¯)Y¯0]\mathbb{E}[\widehat{H}_{\bf W}(\bar{\bf Z})H(\bar{\bf Z})]=\mathbb{E}[\widehat{H}_{{\bf W}}(\bar{\bf Z})\bar{Y}_{0}] for any 𝐖𝒲R{\bf W}\in\mathcal{W}_{R}. Therefore, for any 𝐖𝒲R{\bf W}\in\mathcal{W}_{R} we obtain

𝔼[|H(𝐙¯)H^𝐖^(𝐙¯)|2]\displaystyle\mathbb{E}[|H(\bar{\bf Z})-\widehat{H}_{\widehat{\bf W}}(\bar{\bf Z})|^{2}] =𝔼[|H(𝐙¯)H^𝐖(𝐙¯)|2]+𝔼[|Y¯0H^𝐖^(𝐙¯)|2]𝔼[|Y¯0H^𝐖(𝐙¯)|2]\displaystyle=\mathbb{E}[|H(\bar{\bf Z})-\widehat{H}_{\bf W}(\bar{\bf Z})|^{2}]+\mathbb{E}[|\bar{Y}_{0}-\widehat{H}_{\widehat{\bf W}}(\bar{\bf Z})|^{2}]-\mathbb{E}[|\bar{Y}_{0}-\widehat{H}_{\bf W}(\bar{\bf Z})|^{2}] (4.6)
𝔼[|H(𝐙¯)H^𝐖(𝐙¯)|2]+𝔼[(H^𝐖^)n(H^𝐖^)+n(H^𝐖)(H^𝐖)],\displaystyle\leq\mathbb{E}[|H(\bar{\bf Z})-\widehat{H}_{\bf W}(\bar{\bf Z})|^{2}]+\mathbb{E}[\mathcal{R}(\widehat{H}_{\widehat{\bf W}})-\mathcal{R}_{n}(\widehat{H}_{\widehat{\bf W}})+\mathcal{R}_{n}(\widehat{H}_{\bf W})-\mathcal{R}(\widehat{H}_{\bf W})],

where we used in the last step that (4.2) implies n(H^𝐖^)n(H^𝐖)\mathcal{R}_{n}(\widehat{H}_{\widehat{\bf W}})\leq\mathcal{R}_{n}(\widehat{H}_{\bf W}).

The first term in the right hand side of (4.6) can be bounded using Theorem 3.7. Indeed, Theorem 3.7 proves that there exists a measurable function f:(𝒳N)NNf\colon(\mathcal{X}^{N})^{N}\to\mathbb{R}^{N} such that the ESN-ELM H^\widehat{H} with readout 𝐖=f((w(i),𝐚(i),𝐜(i),bi)i=1,,N){\bf W}=f((w^{(i)},{\bf a}^{(i)},{\bf c}^{(i)},b_{i})_{i=1,\ldots,N}) satisfies

𝔼[|H(𝐙¯)H^𝐖(𝐙¯)|2]1/2\displaystyle\mathbb{E}[|H(\bar{{\bf Z}})-\widehat{H}_{\bf W}(\bar{{\bf Z}})|^{2}]^{1/2} CH,ESN[λNd+|AESN|T+1N12dμA,B,Cdν12]\displaystyle\leq C_{H,\mathrm{ESN}}\left[\lambda^{\frac{N}{d}}+{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{T}+\frac{1}{N^{\frac{1}{2}}}\left\|\frac{d\mu_{A,B,C}}{d\nu}\right\|_{\infty}^{\frac{1}{2}}\right] (4.7)

with CH,ESNC_{H,\mathrm{ESN}} given in (3.28). Here we used independence and that 𝐙¯\bar{{\bf Z}} takes values in d\mathcal{I}_{d}. Furthermore, in the proof of Theorem 3.7 ff and 𝐖{\bf W} are explicitly chosen as Wi=w(i)NdμA,B,Cdν(w(i),𝐚(i),𝐜(i),bi){\displaystyle W_{i}=\frac{w^{(i)}}{N}\frac{d\mu_{A,B,C}}{d\nu}(w^{(i)},{\bf a}^{(i)},{\bf c}^{(i)},b_{i})}. Therefore, \mathbb{P}-a.s. the random vector 𝐖{\bf W} satisfies 𝐖Nmaxi=1NWiκNw(1)R\|{\bf W}\|\leq\sqrt{N}\max_{i=1}^{N}\|W_{i}\|_{\infty}\leq\frac{\kappa}{\sqrt{N}}\|w^{(1)}\|_{\infty}\leq R and consequently 𝐖𝒲R{\bf W}\in\mathcal{W}_{R}.

It remains to analyze the second term in the right hand side of (4.6). For notational simplicity we now consider (wi,𝐚(i),𝐜(i),bi)i=1,,N(w_{i},{\bf a}^{(i)},{\bf c}^{(i)},b_{i})_{i=1,\ldots,N}, AESN{A}^{\mathrm{ESN}}, 𝐁ESN{\bf B}^{\mathrm{ESN}}, CESN{C}^{\mathrm{ESN}} as fixed; formally this can be justified by performing the subsequent computations conditionally on these random variables as, for instance, in [Gono 21a, Proof of Theorem 4.3]. Then

𝔼[(H^𝐖^)n(H^𝐖^)+n(H^𝐖)(H^𝐖)]2𝔼[sup𝐰N:𝐰R|n(H^𝐰)(H^𝐰)|].\mathbb{E}[\mathcal{R}(\widehat{H}_{\widehat{\bf W}})-\mathcal{R}_{n}(\widehat{H}_{\widehat{\bf W}})+\mathcal{R}_{n}(\widehat{H}_{\bf W})-\mathcal{R}(\widehat{H}_{\bf W})]\leq 2\mathbb{E}\left[\sup_{{\bf w}\in\mathbb{R}^{N}\,:\,\|{\bf w}\|\leq R}|\mathcal{R}_{n}(\widehat{H}_{\bf w})-\mathcal{R}(\widehat{H}_{\bf w})|\right]. (4.8)

Let λ¯=(r2|AESN|2)1/2\bar{\lambda}=(r^{2}-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2})^{1/2}, denote 𝒞RC={H^𝐰:𝐰N,𝐰R}\mathcal{C}^{RC}=\{\widehat{H}_{\bf w}\,:\,{\bf w}\in\mathbb{R}^{N},\|{\bf w}\|\leq R\} and note that each H^𝐰\widehat{H}_{\bf w} can be written as H^𝐰=h𝐰HF\widehat{H}_{\bf w}=h_{\bf w}\circ H^{F} for HF(𝐳)=(𝐱0ESN(𝐳),𝐳0,λ¯𝐱1ESN(𝐳))H^{F}({\bf z})=(\mathbf{x}_{0}^{\mathrm{ESN}}({\bf z}),{\bf z}_{0},\bar{\lambda}\mathbf{x}_{-1}^{\mathrm{ESN}}({\bf z})), h𝐰((𝐱0,𝐳0,𝐱1))=i=1Nwiσ2(λ¯1𝐚(i)𝐱1+𝐜(i)𝐳0+bi)h_{\bf w}(({\bf x}_{0},{\bf z}_{0},{\bf x}_{-1}))=\sum_{i=1}^{N}w_{i}\sigma_{2}(\bar{\lambda}^{-1}{\bf a}^{(i)}\cdot\mathbf{x}_{-1}+{\bf c}^{(i)}\cdot{\bf z}_{0}+b_{i}). Furthermore, HF:(Dd)N+d+NH^{F}\colon(D_{d})^{\mathbb{Z}_{-}}\to\mathbb{R}^{N+d+N} is the reservoir functional associated to the map F:N+d+N×dN+d+NF\colon\mathbb{R}^{N+d+N}\times\mathbb{R}^{d}\to\mathbb{R}^{N+d+N}, F((𝐱0,𝐱1,𝐱2),𝐳)=(σ1(AESN𝐱0+CESN𝐳+𝐁ESN),𝐳,λ¯𝐱0)F(({\bf x}_{0},{\bf x}_{1},{\bf x}_{2}),{\bf z})=(\sigma_{1}({A}^{\mathrm{ESN}}\mathbf{x}_{0}+{C}^{\mathrm{ESN}}{\bf z}+{\bf B}^{\mathrm{ESN}}),{\bf z},\bar{\lambda}{\bf x}_{0}), which is just the reservoir map associated to the ESN determined by AESN{A}^{\mathrm{ESN}}, 𝐁ESN{\bf B}^{\mathrm{ESN}}, CESN{C}^{\mathrm{ESN}} augmented by the current input and the (scaled) previous state.

Then for each 𝐰N{\bf w}\in\mathbb{R}^{N} with 𝐰R\|{\bf w}\|\leq R we have

|h𝐰((𝐱0,𝐳0,𝐱1))h𝐰((𝐱¯0,𝐳¯0,𝐱¯1))|i=1N|wi|Lσ2|λ¯1𝐚(i)(𝐱1𝐱¯1)+𝐜(i)(𝐳0𝐳¯0)|Lσ2Rmax((i=1Nλ¯1𝐚(i)2)1/2,(i=1N𝐜(i)2)1/2)[𝐱1𝐱¯1+𝐳0𝐳¯0]Lh¯(𝐱0,𝐳0,𝐱1)(𝐱¯0,𝐳¯0,𝐱¯1),|h_{\bf w}(({\bf x}_{0},{\bf z}_{0},{\bf x}_{-1}))-h_{\bf w}((\bar{{\bf x}}_{0},\bar{{\bf z}}_{0},\bar{{\bf x}}_{-1}))|\leq\sum_{i=1}^{N}|w_{i}|L_{\sigma_{2}}|\bar{\lambda}^{-1}{\bf a}^{(i)}\cdot(\mathbf{x}_{-1}-\bar{{\bf x}}_{-1})+{\bf c}^{(i)}\cdot({\bf z}_{0}-\bar{{\bf z}}_{0})|\\ \leq L_{\sigma_{2}}R\max((\sum_{i=1}^{N}\|\bar{\lambda}^{-1}{\bf a}^{(i)}\|^{2})^{1/2},(\sum_{i=1}^{N}\|{\bf c}^{(i)}\|^{2})^{1/2})[\|\mathbf{x}_{-1}-\bar{{\bf x}}_{-1}\|+\|{\bf z}_{0}-\bar{{\bf z}}_{0}\|]\leq\overline{L_{h}}\|({\bf x}_{0},{\bf z}_{0},{\bf x}_{-1})-(\bar{{\bf x}}_{0},\bar{{\bf z}}_{0},\bar{{\bf x}}_{-1})\|,

with Lh¯=2Lσ2Rmax((i=1Nλ¯1𝐚(i)2)1/2,(i=1N𝐜(i)2)1/2)\overline{L_{h}}=2L_{\sigma_{2}}R\max((\sum_{i=1}^{N}\|\bar{\lambda}^{-1}{\bf a}^{(i)}\|^{2})^{1/2},(\sum_{i=1}^{N}\|{\bf c}^{(i)}\|^{2})^{1/2}) and |h𝐰(𝟎)|Lh,0|h_{\bf w}({\bf 0})|\leq L_{h,0} with Lh,0=R(i=1Nσ2(bi)2)1/2L_{h,0}=R(\sum_{i=1}^{N}\sigma_{2}(b_{i})^{2})^{1/2}. Furthermore, for any 𝐳Dd{\bf z}\in D_{d} the reservoir map F(,𝐳)F(\cdot,{\bf z}) is an rr-contraction with r=(|AESN|2+λ¯2)1/2<1r=({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}+\bar{\lambda}^{2})^{1/2}<1 and for any (𝐱0,𝐱1,𝐱2)N×Dd×N({\bf x}_{0},{\bf x}_{1},{\bf x}_{2})\in\mathbb{R}^{N}\times D_{d}\times\mathbb{R}^{N} and 𝐳,𝐳¯d{\bf z},\bar{{\bf z}}\in\mathbb{R}^{d} we have

F((𝐱0,𝐱1,𝐱2),𝐳)F((𝐱0,𝐱1,𝐱2),𝐳¯)=(CESN(𝐳𝐳¯),𝐳𝐳¯,0)=(|CESN|2+1)1/2𝐳𝐳¯.\|F(({\bf x}_{0},{\bf x}_{1},{\bf x}_{2}),{\bf z})-F(({\bf x}_{0},{\bf x}_{1},{\bf x}_{2}),\bar{{\bf z}})\|=\|({C}^{\mathrm{ESN}}({\bf z}-\bar{{\bf z}}),{\bf z}-\bar{{\bf z}},0)\|=({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}+1)^{1/2}\|{\bf z}-\bar{{\bf z}}\|.

In addition, for 𝐳(Dd){\bf z}\in(D_{d})^{\mathbb{Z}_{-}} we have from (3.23)

HF(𝐳)=(𝐱0ESN(𝐳)2+𝐳02+λ¯𝐱1ESN(𝐳)2)1/2M,\|H^{F}({\bf z})\|=(\|\mathbf{x}_{0}^{\mathrm{ESN}}({\bf z})\|^{2}+\|{\bf z}_{0}\|^{2}+\|\bar{\lambda}\mathbf{x}_{-1}^{\mathrm{ESN}}({\bf z})\|^{2})^{1/2}\leq M_{\mathcal{F}},

where M=(2[(1|AESN|)1(|CESN|M+𝐁ESN)]2+M2)1/2M_{\mathcal{F}}=(2[(1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|})^{-1}({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}M+\|{\bf B}^{\mathrm{ESN}}\|)]^{2}+M^{2})^{1/2} with MM chosen such that 𝐳M\|{\bf z}\|\leq M for all 𝐳Dd{\bf z}\in D_{d}. Finally, let CL=RM+M~C_{L}=RM_{\mathcal{F}}+\widetilde{M} and note that L(x,y)=min(CL,|xy|)2L(x,y)=\min(C_{L},|x-y|)^{2} satisfies for all H𝒞RCH\in\mathcal{C}^{RC}, 𝐳(Dd){\bf z}\in(D_{d})^{\mathbb{Z}_{-}} and y{y}\in\mathbb{R} with |y|M~|{y}|\leq\widetilde{M} that L(H(𝐳),y)=|H(𝐳)y|2L(H({\bf z}),{y})=|H({\bf z})-{y}|^{2} and z[min(CL,|z|)]2z\mapsto[\min(C_{L},|z|)]^{2} is Lipschitz-continuous with constant 2CL2C_{L}. This can be seen by writing

|min(CL,|z1|)2min(CL,|z2|)2|\displaystyle|\min(C_{L},|z_{1}|)^{2}-\min(C_{L},|z_{2}|)^{2}| =|min(CL,|z1|)min(CL,|z2|)||min(CL,|z1|)+min(CL,|z2|)|\displaystyle=|\min(C_{L},|z_{1}|)-\min(C_{L},|z_{2}|)||\min(C_{L},|z_{1}|)+\min(C_{L},|z_{2}|)|
2CL||z2||z1||2CL|z2z1|.\displaystyle\leq 2C_{L}\big{|}|z_{2}|-|z_{1}|\big{|}\leq 2C_{L}|z_{2}-z_{1}|.

Consider the Rademacher-type complexity as defined in [Gono 20b]: let ε0,ε1,\varepsilon_{0},\varepsilon_{1},\ldots be IID Rademacher random variables which are independent of the IID copies 𝐙~(l)\tilde{{\bf Z}}^{(l)}, ll\in\mathbb{N}, of 𝐙{\bf Z}. For each kk\in\mathbb{N} we define

k(𝒞RC)=1k𝔼[supH𝒞RC|l=0k1εlH(𝐙~(l))|].{\mathfrak{R}}_{k}(\mathcal{C}^{RC})=\frac{1}{k}\mathbb{E}\left[\sup_{H\in\mathcal{C}^{RC}}\left|\sum_{l=0}^{k-1}\varepsilon_{l}H(\tilde{{\bf Z}}^{(l)})\right|\right].

Denote by 𝐗l\mathbf{X}^{l} the vector with components Xjl=σ2(𝐚(j)𝐱1ESN(𝐙~(l))+𝐜(j)𝐙~0(l)+bj){X}^{l}_{j}=\sigma_{2}({\bf a}^{(j)}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}(\tilde{\bf Z}^{(l)})+{\bf c}^{(j)}\cdot\tilde{\bf Z}_{0}^{(l)}+b_{j}), j=1,,Nj=1,\ldots,N. Then we can rewrite H^𝐰(𝐙~(l))=𝐰𝐗l\widehat{H}_{{\bf w}}(\tilde{\bf Z}^{(l)})={\bf w}^{\top}\mathbf{X}^{l} and estimate using the Cauchy-Schwartz and Jensen’s inequalities, independence, and the fact that 𝔼[εlεj]=δlj\mathbb{E}[\varepsilon_{l}\varepsilon_{j}]=\delta_{lj}

k(𝒞RC)\displaystyle{\mathfrak{R}}_{k}(\mathcal{C}^{RC}) =1k𝔼[sup𝐰N:𝐰R|l=0k1εlH^𝐰(𝐙~(l))|]=1k𝔼[sup𝐰N:𝐰R|𝐰l=0k1εl𝐗l|]\displaystyle=\frac{1}{k}\mathbb{E}\left[\sup_{{\bf w}\in\mathbb{R}^{N}\,:\,\|{\bf w}\|\leq R}\left|\sum_{l=0}^{k-1}\varepsilon_{l}\widehat{H}_{\bf w}(\tilde{{\bf Z}}^{(l)})\right|\right]=\frac{1}{k}\mathbb{E}\left[\sup_{{\bf w}\in\mathbb{R}^{N}\,:\,\|{\bf w}\|\leq R}\left|{\bf w}^{\top}\sum_{l=0}^{k-1}\varepsilon_{l}\mathbf{X}^{l}\right|\right]
Rk𝔼[l=0k1εl𝐗l]Rk𝔼[l=0k1εl𝐗l2]1/2=Rk(𝔼[l=0k1j=0k1εlεj(𝐗l)𝐗j])1/2\displaystyle\leq\frac{R}{k}\mathbb{E}\left[\left\|\sum_{l=0}^{k-1}\varepsilon_{l}\mathbf{X}^{l}\right\|\right]\leq\frac{R}{k}\mathbb{E}\left[\left\|\sum_{l=0}^{k-1}\varepsilon_{l}\mathbf{X}^{l}\right\|^{2}\right]^{1/2}=\frac{R}{k}\left(\mathbb{E}\left[\sum_{l=0}^{k-1}\sum_{j=0}^{k-1}\varepsilon_{l}\varepsilon_{j}(\mathbf{X}^{l})^{\top}\mathbf{X}^{j}\right]\right)^{1/2}
=Rk(l=0k1𝔼[(𝐗l)𝐗l])1/2=Rk(𝔼[(𝐗1)𝐗1])1/2.\displaystyle=\frac{R}{k}\left(\sum_{l=0}^{k-1}\mathbb{E}\left[(\mathbf{X}^{l})^{\top}\mathbf{X}^{l}\right]\right)^{1/2}=\frac{R}{\sqrt{k}}\left(\mathbb{E}\left[(\mathbf{X}^{1})^{\top}\mathbf{X}^{1}\right]\right)^{1/2}.

Altogether, the hypotheses of [Gono 20b, Corollary 8(ii)] are satisfied. Writing n(G)=1ni=0n1|G(𝐙i)Yi|2\mathcal{R}_{n}^{\infty}(G)=\frac{1}{n}\sum_{i=0}^{n-1}|G({\bf Z}_{-i}^{-\infty})-{Y}_{-i}|^{2} for the idealized empirical risk and applying this result proves that

𝔼[sup𝐰N:𝐰R|n(H^𝐰)(H^𝐰)|]C1n+C2log(n)n+C3,abslog(n)n\displaystyle\mathbb{E}\left[\sup_{{\bf w}\in\mathbb{R}^{N}\,:\,\|{\bf w}\|\leq R}|\mathcal{R}_{n}^{\infty}(\widehat{H}_{{\bf w}})-\mathcal{R}(\widehat{H}_{{\bf w}})|\right]\leq\frac{C_{1}}{n}+\frac{C_{2}{\log(n)}}{{n}}+\frac{C_{3,abs}\sqrt{\log(n)}}{\sqrt{n}} (4.9)

for all nn\in\mathbb{N} satisfying log(n)<nlog(λmax1)\log(n)<n\log(\lambda_{max}^{-1}), where λmax=max(r,λdep)\lambda_{max}=\max(r,\lambda_{\mathrm{dep}}) and

C1\displaystyle C_{1} =2CL2MLh¯+Cdepλmax,C3,abs=8CLlog(λmax1)[R(𝔼[(𝐗1)𝐗1])1/2+𝔼[|Y0|22]1/2],\displaystyle=2C_{L}\frac{2M_{\mathcal{F}}\overline{L_{h}}+C_{\mathrm{dep}}}{\lambda_{max}},\quad\quad C_{3,abs}=\frac{8C_{L}}{\sqrt{\log(\lambda_{max}^{-1})}}\left[R\left(\mathbb{E}\left[(\mathbf{X}^{1})^{\top}\mathbf{X}^{1}\right]\right)^{1/2}+\mathbb{E}\left[|{Y}_{0}|^{2}_{2}\right]^{1/2}\right], (4.10)
C2\displaystyle C_{2} =4CL(MLh¯+Lh,0+(|CESN|2+1)1/2Lh¯Cdepλmax1)+2𝔼[|Y0|2]log(λmax1).\displaystyle=\frac{4C_{L}(M_{\mathcal{F}}\overline{L_{h}}+L_{h,0}+({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}+1)^{1/2}\overline{L_{h}}C_{\mathrm{dep}}\lambda_{max}^{-1})+2\mathbb{E}[|{Y}_{0}|^{2}]}{\log(\lambda_{max}^{-1})}. (4.11)

Furthermore, [Gono 20b, Proposition 5] yields that \mathbb{P}-a.s.

sup𝐰N:𝐰R|n(H^𝐰)n(H^𝐰)|4CLrLh¯M(1r)n.\sup_{{\bf w}\in\mathbb{R}^{N}\,:\,\|{\bf w}\|\leq R}\left|\mathcal{R}_{n}(\widehat{H}_{{\bf w}})-\mathcal{R}_{n}^{\infty}(\widehat{H}_{{\bf w}})\right|\leq\frac{4C_{L}r\overline{L_{h}}M_{\mathcal{F}}}{(1-r)n}.

Combining this with (4.9) we obtain

𝔼[sup𝐰N:𝐰R|n(H^𝐰)(H^𝐰)|]\displaystyle\mathbb{E}\left[\sup_{{\bf w}\in\mathbb{R}^{N}\,:\,\|{\bf w}\|\leq R}|\mathcal{R}_{n}(\widehat{H}_{{\bf w}})-\mathcal{R}(\widehat{H}_{{\bf w}})|\right] (4.12)
𝔼[sup𝐰N:𝐰R|n(H^𝐰)n(H^𝐰)|]+𝔼[sup𝐰N:𝐰R|n(H^𝐰)(H^𝐰)|]\displaystyle\leq\mathbb{E}\left[\sup_{{\bf w}\in\mathbb{R}^{N}\,:\,\|{\bf w}\|\leq R}|\mathcal{R}_{n}(\widehat{H}_{{\bf w}})-\mathcal{R}_{n}^{\infty}(\widehat{H}_{{\bf w}})|\right]+\mathbb{E}\left[\sup_{{\bf w}\in\mathbb{R}^{N}\,:\,\|{\bf w}\|\leq R}|\mathcal{R}_{n}^{\infty}(\widehat{H}_{{\bf w}})-\mathcal{R}(\widehat{H}_{{\bf w}})|\right]
4CLrLh¯M(1r)n+C1n+C2log(n)n+C3,abslog(n)n\displaystyle\leq\frac{4C_{L}r\overline{L_{h}}M_{\mathcal{F}}}{(1-r)n}+\frac{C_{1}}{n}+\frac{C_{2}{\log(n)}}{{n}}+\frac{C_{3,abs}\sqrt{\log(n)}}{\sqrt{n}}
max(R,1)max(Lh¯,Lh,0,1,R(𝔼[(𝐗1)𝐗1])1/2)Cest,1log(n)n,\displaystyle\leq\max(R,1)\max(\overline{L_{h}},L_{h,0},1,R\left(\mathbb{E}\left[(\mathbf{X}^{1})^{\top}\mathbf{X}^{1}\right]\right)^{1/2})C_{est,1}\frac{\sqrt{\log(n)}}{\sqrt{n}},

where

Cest,1\displaystyle C_{est,1} =32max(1,Cdep)log(λmax1)max(2(|CESN|M+𝐁ESN)2(1|AESN|)2+M2,1)max(M~,1)\displaystyle=\frac{32\max(1,C_{\mathrm{dep}})}{\sqrt{\log(\lambda_{max}^{-1})}}\max(2\frac{({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}M+\|{\bf B}^{\mathrm{ESN}}\|)^{2}}{(1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|})^{2}}+M^{2},1)\max(\widetilde{M},1) (4.13)
×(r(1r)+1λmax+(2+(|CESN|2+1)1/2λmax1)+𝔼[|Y0|2]log(λmax1)+𝔼[|Y0|2]1/2).\displaystyle\quad\quad\times\Bigg{(}\frac{r}{(1-r)}+\frac{1}{\lambda_{max}}+\frac{(2+({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}+1)^{1/2}\lambda_{max}^{-1})+\mathbb{E}[|{Y}_{0}|^{2}]}{\log(\lambda_{max}^{-1})}+\mathbb{E}\left[|{Y}_{0}|^{2}\right]^{1/2}\Bigg{)}.

Recall that (w(i),𝐚(i),𝐜(i),bi)i=1,,N(w^{(i)},{\bf a}^{(i)},{\bf c}^{(i)},b_{i})_{i=1,\ldots,N} was considered fixed; now we take expectation also with respect to its distribution ν\nu and obtain in the same way as we obtained (4.12)

𝔼[sup𝐰N:𝐰R|n(H^𝐰)(H^𝐰)|]\displaystyle\mathbb{E}\left[\sup_{{\bf w}\in\mathbb{R}^{N}\,:\,\|{\bf w}\|\leq R}|\mathcal{R}_{n}(\widehat{H}_{{\bf w}})-\mathcal{R}(\widehat{H}_{{\bf w}})|\right] max(R,1)max(2RLσ2𝔼[max((i=1Nλ¯1𝐚(i)2)12,(i=1N𝐜(i)2)12)],\displaystyle\leq\max(R,1)\max\Bigg{(}2RL_{\sigma_{2}}\mathbb{E}\left[\max\left((\sum_{i=1}^{N}\|\bar{\lambda}^{-1}{\bf a}^{(i)}\|^{2})^{\frac{1}{2}},(\sum_{i=1}^{N}\|{\bf c}^{(i)}\|^{2})^{\frac{1}{2}}\right)\right], (4.14)
R𝔼[(i=1Nσ2(bi)2)12],1,R(𝔼[(𝐗1)𝐗1])12)Cest,1log(n)n.\displaystyle\quad R\mathbb{E}\left[(\sum_{i=1}^{N}\sigma_{2}(b_{i})^{2})^{\frac{1}{2}}\right],1,R\left(\mathbb{E}\left[(\mathbf{X}^{1})^{\top}\mathbf{X}^{1}\right]\right)^{\frac{1}{2}}\Bigg{)}C_{est,1}\frac{\sqrt{\log(n)}}{\sqrt{n}}.

The expectations in (4.14) can be estimated using Jensen’s inequality and (3.23):

𝔼[(𝐗1)𝐗1]12=(j=1N𝔼[|Xj1|2])12\displaystyle\mathbb{E}\left[(\mathbf{X}^{1})^{\top}\mathbf{X}^{1}\right]^{\frac{1}{2}}=\left(\sum_{j=1}^{N}\mathbb{E}[|{X}^{1}_{j}|^{2}]\right)^{\frac{1}{2}} max(2Lσ22,2|σ2(0)|2)12N12𝔼[|𝐚(1)𝐱1ESN(𝐙)+𝐜(1)𝐙0+b1|2+1]12\displaystyle\leq\max(2L_{\sigma_{2}}^{2},2|\sigma_{2}(0)|^{2})^{\frac{1}{2}}N^{\frac{1}{2}}\mathbb{E}[|{\bf a}^{(1)}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}({\bf Z})+{\bf c}^{(1)}\cdot{\bf Z}_{0}+b_{1}|^{2}+1]^{\frac{1}{2}}
N12Cest,2\displaystyle\leq N^{\frac{1}{2}}C_{est,2}

with

Cest,2=\displaystyle C_{est,2}= 232max(M,1)max(Lσ2,|σ2(0)|,1)\displaystyle 2^{\frac{3}{2}}\max(M,1)\max(L_{\sigma_{2}},|\sigma_{2}(0)|,1) (4.15)
×|CESN|+𝐁ESN+11|AESN|(𝔼[λ¯1𝐚(1)2]12+𝔼[𝐜(1)2]12+𝔼[|b1|2]12+1)\displaystyle\times\frac{{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}+\|{\bf B}^{\mathrm{ESN}}\|+1}{1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}}(\mathbb{E}[\|\bar{\lambda}^{-1}{\bf a}^{(1)}\|^{2}]^{\frac{1}{2}}+\mathbb{E}[\|{\bf c}^{(1)}\|^{2}]^{\frac{1}{2}}+\mathbb{E}[|b_{1}|^{2}]^{\frac{1}{2}}+1)

and consequently

max(2RLσ2𝔼[max((i=1Nλ¯1𝐚(i)2)12,(i=1N𝐜(i)2)12)],R𝔼[(i=1Nσ2(bi)2)12],1,R(𝔼[(𝐗1)𝐗1])12)max(RN12Cest,2,1).\max\Bigg{(}2RL_{\sigma_{2}}\mathbb{E}\left[\max\left((\sum_{i=1}^{N}\|\bar{\lambda}^{-1}{\bf a}^{(i)}\|^{2})^{\frac{1}{2}},(\sum_{i=1}^{N}\|{\bf c}^{(i)}\|^{2})^{\frac{1}{2}}\right)\right],R\mathbb{E}\left[(\sum_{i=1}^{N}\sigma_{2}(b_{i})^{2})^{\frac{1}{2}}\right],1,R\left(\mathbb{E}\left[(\mathbf{X}^{1})^{\top}\mathbf{X}^{1}\right]\right)^{\frac{1}{2}}\Bigg{)}\\ \quad\leq\max\left(RN^{\frac{1}{2}}C_{est,2},1\right). (4.16)

Inserting this in (4.14) yields

𝔼[sup𝐰N:𝐰R|n(H^𝐰)(H^𝐰)|]\displaystyle\mathbb{E}\left[\sup_{{\bf w}\in\mathbb{R}^{N}\,:\,\|{\bf w}\|\leq R}|\mathcal{R}_{n}(\widehat{H}_{{\bf w}})-\mathcal{R}(\widehat{H}_{{\bf w}})|\right] max(R,1)max(RN12Cest,2,1)Cest,1log(n)n.\displaystyle\leq\max(R,1)\max\left(RN^{\frac{1}{2}}C_{est,2},1\right)C_{est,1}\frac{\sqrt{\log(n)}}{\sqrt{n}}. (4.17)

Combining this with (4.6), (4.7) and (4.8) we thus obtain

𝔼[|H(𝐙¯)H^𝐖^(𝐙¯)|2]1/2\displaystyle\mathbb{E}[|H(\bar{\bf Z})-\widehat{H}_{\widehat{\bf W}}(\bar{\bf Z})|^{2}]^{1/2} 𝔼[|H(𝐙¯)H^𝐖(𝐙¯)|2]1/2+𝔼[(H^𝐖^)n(H^𝐖^)+n(H^𝐖)(H^𝐖)]1/2\displaystyle\leq\mathbb{E}[|H(\bar{\bf Z})-\widehat{H}_{\bf W}(\bar{\bf Z})|^{2}]^{1/2}+\mathbb{E}[\mathcal{R}(\widehat{H}_{\widehat{\bf W}})-\mathcal{R}_{n}(\widehat{H}_{\widehat{\bf W}})+\mathcal{R}_{n}(\widehat{H}_{\bf W})-\mathcal{R}(\widehat{H}_{\bf W})]^{1/2} (4.18)
CH,ESN[λNd+|AESN|T+1N12dμA,B,Cdν12]\displaystyle\leq C_{H,\mathrm{ESN}}[\lambda^{\frac{N}{d}}+{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{T}+\frac{1}{N^{\frac{1}{2}}}\left\|\frac{d\mu_{A,B,C}}{d\nu}\right\|_{\infty}^{\frac{1}{2}}]
+2(max(R,1)max(RN12Cest,2,1)Cest,1log(n)n)12\displaystyle\quad+\sqrt{2}\left(\max(R,1)\max\left(RN^{\frac{1}{2}}C_{est,2},1\right)C_{est,1}\frac{\sqrt{\log(n)}}{\sqrt{n}}\right)^{\frac{1}{2}}
Capprox[λNd+|AESN|T+1N12]+C~est(RN12log(n)n)12,\displaystyle\leq C_{\mathrm{approx}}[\lambda^{\frac{N}{d}}+{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{T}+\frac{1}{N^{\frac{1}{2}}}]+\tilde{C}_{\mathrm{est}}\left(RN^{\frac{1}{2}}\frac{\sqrt{\log(n)}}{\sqrt{n}}\right)^{\frac{1}{2}},

where we used max(RN12,1)RN12max([κw(1)]1,1)\max(RN^{\frac{1}{2}},1)\leq RN^{\frac{1}{2}}\max([\kappa\|w^{(1)}\|_{\infty}]^{-1},1) and set Capprox=CH,ESNmax(κ12,1)C_{\mathrm{approx}}=C_{H,\mathrm{ESN}}\max(\kappa^{\frac{1}{2}},1), C~est=(2Cest,1max(R,1)Cest,2max([κw(1)]1,1))12\tilde{C}_{\mathrm{est}}=(2C_{est,1}\max(R,1)C_{est,2}\max([\kappa\|w^{(1)}\|_{\infty}]^{-1},1))^{\frac{1}{2}}. Setting

Cest\displaystyle C_{\mathrm{est}} =c~2[(|CESN|+𝐁ESN+1)3(1|AESN|)3log(λmax1)max(R,1)(𝔼[𝐚(1)2]12+𝔼[𝐜(1)2]12+𝔼[|b1|2]12+1)\displaystyle=\tilde{c}_{2}\Bigg{[}\frac{({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}+\|{\bf B}^{\mathrm{ESN}}\|+1)^{3}}{(1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|})^{3}\sqrt{\log(\lambda_{max}^{-1})}}\max(R,1)(\mathbb{E}[\|{\bf a}^{(1)}\|^{2}]^{\frac{1}{2}}+\mathbb{E}[\|{\bf c}^{(1)}\|^{2}]^{\frac{1}{2}}+\mathbb{E}[|b_{1}|^{2}]^{\frac{1}{2}}+1)
×(r2|||AESN|||2)1/2max([κw(1)]1,1)(r1r+1λmax+(|CESN|2+1)12λmax1log(λmax1))]12\displaystyle\quad\quad\times(r^{2}-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2})^{-1/2}\max([\kappa\|w^{(1)}\|_{\infty}]^{-1},1)\Bigg{(}\frac{r}{1-r}+\frac{1}{\lambda_{max}}+\frac{({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}+1)^{\frac{1}{2}}\lambda_{max}^{-1}}{\log(\lambda_{max}^{-1})}\Bigg{)}\Bigg{]}^{\frac{1}{2}}

with

c~2=26(max(1,Cdep)max(M,1)5max(M~,1)3max(Lσ2,|σ2(0)|,1))12\tilde{c}_{2}=2^{6}(\max(1,C_{\mathrm{dep}})\max(M,1)^{5}\max(\widetilde{M},1)^{3}\max(L_{\sigma_{2}},|\sigma_{2}(0)|,1))^{\frac{1}{2}} (4.19)

and inserting (3.28) then shows (4.3) with (4.4) and (4.5), as claimed. ∎

Remark 4.3.

The bound in Theorem 4.2 can be directly extended to the multivariate case m>1m>1 as follows: if H=(H1,,Hm)H=(H_{1},\ldots,H_{m}) and each HiH_{i} satisfies the hypotheses formulated in Theorem 4.2, then

𝔼[H(𝐙¯)H^W^(𝐙¯)2]1/2\displaystyle\mathbb{E}[\|H(\bar{\bf Z})-\widehat{H}_{\widehat{W}}(\bar{\bf Z})\|^{2}]^{1/2} =𝔼[i=1m|Hi(𝐙¯)H^W^i(𝐙¯)|2]1/2i=1m𝔼[|Hi(𝐙¯)H^W^i(𝐙¯)|2]1/2\displaystyle=\mathbb{E}\left[\sum_{i=1}^{m}|H_{i}(\bar{\bf Z})-\widehat{H}_{\widehat{W}_{i}}(\bar{\bf Z})|^{2}\right]^{1/2}\leq\sum_{i=1}^{m}\mathbb{E}\left[|H_{i}(\bar{\bf Z})-\widehat{H}_{\widehat{W}_{i}}(\bar{\bf Z})|^{2}\right]^{1/2} (4.20)
i=1mCapprox,i(λiNd+|AESN|T+1N12)+i=1mCest,i(RN12log(n)n)12\displaystyle\leq\sum_{i=1}^{m}C_{\mathrm{approx},i}\left(\lambda_{i}^{\frac{N}{d}}+{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{T}+\frac{1}{N^{\frac{1}{2}}}\right)+\sum_{i=1}^{m}C_{\mathrm{est},i}\left(RN^{\frac{1}{2}}\frac{\sqrt{\log(n)}}{\sqrt{n}}\right)^{\frac{1}{2}}

with Capprox,iC_{\mathrm{approx},i}, Cest,iC_{\mathrm{est},i} the constants from Theorem 4.2 corresponding to HiH_{i}.

5 Conclusions

In this paper we have provided reservoir computing approximation and generalization bounds for a new concept class of input/output systems that extends to a dynamical context the so-called generalized Barron functionals. We call the elements of the new class recurrent generalized Barron functionals. Its elements are characterized by the availability of readouts with a certain integral representation built on infinite-dimensional state-space systems. We have also shown that they form a vector space and that the class contains as a subset systems of this type with finite-dimensional state-space representations. This class has been shown to be very rich: it contains all sufficiently smooth finite-memory functionals, linear and convolutional filters, as well as any input-output system built using separable Hilbert state spaces with readouts of integral Barron type. Moreover, this class is dense in the L2L^{2}-sense.

The reservoir architectures used for the approximation/estimation of elements in the new class are randomly generated echo state networks with either linear (for approximation) or ReLU (estimation) activation functions, and with readouts built using randomly generated neural networks, in which only the output layer is trained (extreme learning machines or random feature neural networks). The results in the paper yield a fully implementable recurrent neural network-based learning algorithm with provable convergence guarantees that does not suffer from the curse of dimensionality.

Acknowledgments. The authors acknowledge partial financial support coming from the Swiss National Science Foundation (grant number 200021_175801/1). L. Grigoryeva and J.-P. Ortega wish to thank the hospitality of the University of Munich and L. Gonon that of the Nanyang Technological University during the academic visits in which some of this work was developed. We thank Christa Cuchiero and Josef Teichmann for interesting exchanges that inspired some of the results in this paper.

References

  • [Acci 22] B. Acciaio, A. Kratsios, and G. Pammer. “Metric hypertransformers are universal adapted maps”. arXiv preprint arXiv:2201.13094, 2022.
  • [Arco 22] T. Arcomano, I. Szunyogh, A. Wikner, J. Pathak, B. R. Hunt, and E. Ott. “A hybrid approach to atmospheric modeling that combines machine learning with a physics-based numerical model”. Journal of Advances in Modeling Earth Systems, Vol. 14, No. 3, p. e2021MS002712, 2022.
  • [Barr 92] A. R. Barron. “Neural net approximation”. In: Proc. 7th Yale Workshop on Adaptive and Learning Systems, pp. 69–72, 1992.
  • [Barr 93] A. Barron. “Universal approximation bounds for superpositions of a sigmoidal function”. IEEE Transactions on Information Theory, Vol. 39, No. 3, pp. 930–945, may 1993.
  • [Bent 22] F. E. Benth, N. Detering, and L. Galimberti. “Neural networks in Fréchet spaces”. Annals of Mathematics and Artificial Intelligence, pp. 1–29, 2022.
  • [Bouc 13] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.
  • [Bouv 17a] J. Bouvrie and B. Hamzi. “Kernel methods for the approximation of nonlinear systems”. SIAM Journal on Control and Optimization, Vol. 55, No. 4, pp. 2460–2492, 2017.
  • [Bouv 17b] J. Bouvrie and B. Hamzi. “Kernel methods for the approximation of some key quantities of nonlinear systems”. Journal of Computational Dynamics, Vol. 4, No. (1-2), pp. 1–19, 2017.
  • [Boyd 85] S. Boyd and L. Chua. “Fading memory and the problem of approximating nonlinear operators with Volterra series”. IEEE Transactions on Circuits and Systems, Vol. 32, No. 11, pp. 1150–1161, 1985.
  • [Chen 19] J. Chen and H. I. Nurdin. “Learning nonlinear input–output maps with dissipative quantum systems”. Quantum Information Processing, Vol. 18, No. 7, pp. 1–36, 2019.
  • [Chen 20] J. Chen, H. I. Nurdin, and N. Yamamoto. “Temporal information processing on noisy quantum computers”. Physical Review Applied, Vol. 14, No. 2, p. 24065, 2020.
  • [Chen 22] J. Chen. Nonlinear Convergent Dynamics for Temporal Information Processing on Novel Quantum and Classical Devices. PhD thesis, UNSW Sydney, 2022.
  • [Chen 95] T. Chen and H. Chen. “Approximation capability to functions of several variables, nonlinear functionals, and operators by radial basis function neural networks”. IEEE Transactions on Neural Networks, Vol. 6, No. 4, pp. 904–910, 1995.
  • [Coui 16] R. Couillet, G. Wainrib, H. Sevi, and H. T. Ali. “The asymptotic performance of linear echo state neural networks”. Journal of Machine Learning Research, Vol. 17, No. 178, pp. 1–35, 2016.
  • [Cuch 22] C. Cuchiero, F. Primavera, and S. Svaluto-Ferro. “Universal approximation theorems for continuous functions of càdlàg paths and Lévy-type signature models”. arXiv preprint arXiv:2208.02293, 2022.
  • [E 19] W. E, C. Ma, and L. Wu. “A priori estimates of the population risk for two-layer neural networks”. Commun. Math. Sci., Vol. 17, No. 5, pp. 1407–1425, 2019.
  • [E 20a] W. E, C. Ma, S. Wojtowytsch, and L. Wu. “Towards a mathematical understanding of neural network-based machine learning: what we know and what we don’t”. Preprint, arXiv 2009.10713, 2020.
  • [E 20b] W. E and S. Wojtowytsch. “On the Banach spaces associated with multi-layer ReLU networks: Function representation, approximation theory and gradient descent dynamics”. Preprint, arXiv 2007.15623, 2020.
  • [E 21] W. E and S. Wojtowytsch. “Representation formulas and pointwise properties for Barron functions”. arXiv:2006.05982, 2021.
  • [Evan 10] L. C. Evans. Partial Differential Equations. Vol. 19 of Graduate Studies in Mathematics, American Mathematical Society, Providence, RI, second Ed., 2010.
  • [G Ma 20] G Manjunath. “Stability and memory-loss go hand-in-hand: three results in dynamics & computation”. To appear in Proceedings of the Royal Society London Ser. A Math. Phys. Eng. Sci., pp. 1–25, 2020.
  • [Gali 22] L. Galimberti, G. Livieri, and A. Kratsios. “Designing universal causal deep learning models: the case of infinite-dimensional dynamical systems from stochastic analysis”. arXiv preprint arXiv:2210.13300, 2022.
  • [Gono 20a] L. Gonon, L. Grigoryeva, and J.-P. Ortega. “Memory and forecasting capacities of nonlinear recurrent networks”. Physica D, Vol. 414, No. 132721, pp. 1–13., 2020.
  • [Gono 20b] L. Gonon, L. Grigoryeva, and J.-P. Ortega. “Risk bounds for reservoir computing”. Journal of Machine Learning Research, Vol. 21, No. 240, pp. 1–61, 2020.
  • [Gono 20c] L. Gonon and J.-P. Ortega. “Reservoir computing universality with stochastic inputs”. IEEE Transactions on Neural Networks and Learning Systems, Vol. 31, No. 1, pp. 100–112, 2020.
  • [Gono 21a] L. Gonon. “Random feature neural networks learn Black-Scholes type PDEs without curse of dimensionality”. Preprint, 2021.
  • [Gono 21b] L. Gonon and J.-P. Ortega. “Fading memory echo state networks are universal”. Neural Networks, Vol. 138, pp. 10–13, 2021.
  • [Gono 23] L. Gonon, L. Grigoryeva, and J.-P. Ortega. “Approximation error estimates for random neural networks and reservoir systems”. The Annals of Applied Probability, Vol. 33, No. 1, pp. 28–69, 2023.
  • [Grig 14] L. Grigoryeva, J. Henriques, L. Larger, and J.-P. Ortega. “Stochastic time series forecasting using time-delay reservoir computers: performance and universality”. Neural Networks, Vol. 55, pp. 59–71, 2014.
  • [Grig 18a] L. Grigoryeva and J.-P. Ortega. “Echo state networks are universal”. Neural Networks, Vol. 108, pp. 495–508, 2018.
  • [Grig 18b] L. Grigoryeva and J.-P. Ortega. “Universal discrete-time reservoir computers with stochastic inputs and linear readouts using non-homogeneous state-affine systems”. Journal of Machine Learning Research, Vol. 19, No. 24, pp. 1–40, 2018.
  • [Grig 19] L. Grigoryeva and J.-P. Ortega. “Differentiable reservoir computing”. Journal of Machine Learning Research, Vol. 20, No. 179, pp. 1–62, 2019.
  • [Grig 21a] L. Grigoryeva, A. G. Hart, and J.-P. Ortega. “Learning strange attractors with reservoir systems”. arXiv preprint arXiv:2108.05024, 2021.
  • [Grig 21b] L. Grigoryeva and J.-P. Ortega. “Dimension reduction in recurrent networks by canonicalization”. Journal of Geometric Mechanics, Vol. 13, No. 4, pp. 647–677, 2021.
  • [Herm 10] M. Hermans and B. Schrauwen. “Memory in linear recurrent neural networks in continuous time.”. Neural networks : the official journal of the International Neural Network Society, Vol. 23, No. 3, pp. 341–55, apr 2010.
  • [Herm 12] M. Hermans and B. Schrauwen. “Recurrent kernel machines: computation with infinite echo state networks”. Neural Computation, Vol. 24, pp. 104–133, 2012.
  • [Hu 22] P. Hu, Q. Meng, B. Chen, S. Gong, Y. Wang, W. Chen, R. Zhu, Z.-M. Ma, and T.-Y. Liu. “Neural operator with regularity structure for modeling dynamics driven by SPDEs”. arXiv preprint arXiv:2204.06255, 2022.
  • [Huan 06] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew. “Extreme learning machine: Theory and applications”. Neurocomputing, Vol. 70, No. 1-3, pp. 489–501, dec 2006.
  • [Jaeg 04] H. Jaeger and H. Haas. “Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication”. Science, Vol. 304, No. 5667, pp. 78–80, 2004.
  • [Jaeg 10] H. Jaeger. “The ‘echo state’ approach to analysing and training recurrent neural networks with an erratum note”. Tech. Rep., German National Research Center for Information Technology, 2010.
  • [Kira 19] F. J. Király and H. Oberhauser. “Kernels for sequentially ordered data”. Journal of Machine Learning Research, Vol. 20, 2019.
  • [Kova 21] N. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anandkumar. “Neural operator: Learning maps between function spaces”. arXiv preprint arXiv:2108.08481, 2021.
  • [Krat 20] A. Kratsios and I. Bilokopytov. “Non-Euclidean universal approximation”. 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada., 2020.
  • [Lax 02] P. Lax. Functional Analysis. Wiley-Interscience, 2002.
  • [Li 20] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar. “Fourier neural operator for parametric partial differential equations”. arXiv preprint arXiv:2010.08895, 2020.
  • [Li 22] Z. Li, J. Han, E. Weinan, and Q. Li. “Approximation and optimization theory for Llnear continuous-time recurrent neural networks”. Journal of Machine Learning Research, Vol. 23, pp. 41–42, 2022.
  • [Lu 18] Z. Lu, B. R. Hunt, and E. Ott. “Attractor reconstruction by machine learning”. Chaos, Vol. 28, No. 6, 2018.
  • [Maas 02] W. Maass, T. Natschläger, and H. Markram. “Real-time computing without stable states: a new framework for neural computation based on perturbations”. Neural Computation, Vol. 14, pp. 2531–2560, 2002.
  • [Maas 11] W. Maass. “Liquid state machines: motivation, theory, and applications”. In: S. S. Barry Cooper and A. Sorbi, Eds., Computability In Context: Computation and Logic in the Real World, Chap. 8, pp. 275–296, 2011.
  • [Manj 13] G. Manjunath and H. Jaeger. “Echo state property linked to an input: exploring a fundamental characteristic of recurrent neural networks”. Neural Computation, Vol. 25, No. 3, pp. 671–696, 2013.
  • [Manj 22a] G. Manjunath. “Embedding information onto a dynamical system”. Nonlinearity, Vol. 35, No. 3, p. 1131, 2022.
  • [Manj 22b] G. Manjunath and J.-P. Ortega. “Transport in reservoir computing”. arXiv preprint arXiv:2209.07946, 2022.
  • [Mart 23] R. Martínez-Peña and J.-P. Ortega. “Quantum reservoir computing in finite dimensions”. Physical Review E - Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics, Vol. 107, No. 3, p. 035306, 2023.
  • [Matt 92] M. B. Matthews. On the Uniform Approximation of Nonlinear Discrete-Time Fading-Memory Systems Using Neural Network Models. PhD thesis, ETH Zürich, 1992.
  • [Matt 93] M. B. Matthews. “Approximating nonlinear fading-memory operators using neural network models”. Circuits, Systems, and Signal Processing, Vol. 12, No. 2, pp. 279–307, jun 1993.
  • [Neuf 22] A. Neufeld and P. Schmocker. “Chaotic hedging with iterated integrals and neural networks”. arXiv preprint arXiv:2209.10166, 2022.
  • [Path 17] J. Pathak, Z. Lu, B. R. Hunt, M. Girvan, and E. Ott. “Using machine learning to replicate chaotic attractors and calculate Lyapunov exponents from data”. Chaos, Vol. 27, No. 12, 2017.
  • [Path 18] J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott. “Model-Free Prediction of Large Spatiotemporally Chaotic Systems from Data: A Reservoir Computing Approach”. Physical Review Letters, Vol. 120, No. 2, p. 24102, 2018.
  • [Rahi 07] A. Rahimi and B. Recht. “Random features for large-scale kernel machines”. Advances in neural information, 2007.
  • [Salv 22] C. Salvi, M. Lemercier, and A. Gerasimovics. “Neural stochastic PDEs: Resolution-invariant learning of continuous spatiotemporal dynamics”. In: Advances in Neural Information Processing Systems, 2022.
  • [Stin 99] M. B. Stinchcombe. “Neural network approximation of continuous functionals and continuous functions on compactifications”. Neural Networks, Vol. 12, No. 3, pp. 467–477, 1999.
  • [Tino 18] P. Tino. “Asymptotic Fisher memory of randomized linear symmetric Echo State Networks”. Neurocomputing, Vol. 298, pp. 4–8, 2018.
  • [Tino 20] P. Tino. “Dynamical systems as temporal feature spaces”. Journal of Machine Learning Research, Vol. 21, pp. 1–42, 2020.
  • [Tran 20] Q. H. Tran and K. Nakajima. “Higher-order quantum reservoir computing”. arXiv preprint arXiv:2006.08999, 2020.
  • [Tran 21] Q. H. Tran and K. Nakajima. “Learning temporal quantum tomography”. Physical Review Letters, Vol. 127, No. 26, p. 260401, 2021.
  • [Vill 09] C. Villani. Optimal Transport: Old and New. Springer, 2009.
  • [Wikn 21] A. Wikner, J. Pathak, B. R. Hunt, I. Szunyogh, M. Girvan, and E. Ott. “Using data assimilation to train a hybrid forecast system that combines machine-learning and knowledge-based components”. Chaos: An Interdisciplinary Journal of Nonlinear Science, Vol. 31, No. 5, p. 53114, 2021.
  • [Yild 12] I. B. Yildiz, H. Jaeger, and S. J. Kiebel. “Re-visiting the echo state property.”. Neural Networks, Vol. 35, pp. 1–9, nov 2012.
  • [Ziem 22] I. M. Ziemann, H. Sandberg, and N. Matni. “Single Trajectory Nonparametric Learning of Nonlinear Dynamics”. In: P.-L. Loh and M. Raginsky, Eds., Proceedings of Thirty Fifth Conference on Learning Theory, pp. 3333–3364, PMLR, 2022.