Infinite-dimensional reservoir computing

Lukas Gonon¹, Lyudmila Grigoryeva^2,3, and Juan-Pablo Ortega⁴

Abstract

Reservoir computing approximation and generalization bounds are proved for a new concept class of input/output systems that extends the so-called generalized Barron functionals to a dynamic context. This new class is characterized by the readouts with a certain integral representation built on infinite-dimensional state-space systems. It is shown that this class is very rich and possesses useful features and universal approximation properties. The reservoir architectures used for the approximation and estimation of elements in the new class are randomly generated echo state networks with either linear or ReLU activation functions. Their readouts are built using randomly generated neural networks in which only the output layer is trained (extreme learning machines or random feature neural networks). The results in the paper yield a fully implementable recurrent neural network-based learning algorithm with provable convergence guarantees that do not suffer from the curse of dimensionality.

Key Words: recurrent neural network, reservoir computing, echo state network, ESN, extreme learning machine, ELM, recurrent linear network, machine learning, Barron functional, recurrent Barron functional, universality, finite memory functional, approximation bound, convolutional filter.

¹¹footnotetext: Imperial College. Department of Mathematics. London. United Kingdom. [email protected]²²footnotetext: Universität Sankt Gallen. Faculty of Mathematics and Statistics. Sankt Gallen. Switzerland. [email protected] ³³footnotetext: University of Warwick. Department of Statistics. United Kingdom. [email protected] ⁴⁴footnotetext: Nanyang Technological University. Division of Mathematical Sciences. Singapore. [email protected]

1 Introduction

Reservoir computing (RC) [Jaeg 10, Maas 02, Jaeg 04, Maas 11] and in particular echo state networks (ESNs) [Matt 92, Matt 93, Jaeg 04] have gained much popularity in recent years due to their excellent performance in the forecasting of dynamical systems [Grig 14, Jaeg 04, Path 17, Path 18, Lu 18, Wikn 21, Arco 22] and due to the ease of their implementation. RC aims at approximating nonlinear input/output systems using randomly generated state-space systems (called reservoirs) in which only a linear readout is estimated. It has been theoretically established that this is indeed possible in a variety of deterministic and stochastic contexts [Grig 18b, Grig 18a, Gono 20c, Gono 21b, Gono 23] in which RC systems have been shown to have universal approximation properties.

In this paper, we focus on deriving error bounds for a variant of the architectures that we just cited and consider as approximants randomly generated linear systems with readouts given by randomly generated neural networks in which only the output layer is trained. Thus, from a learning perspective, we combine linear echo state networks and what is referred to in the literature as random features [Rahi 07] /extreme learning machines (ELMs) [Huan 06]. We develop explicit and readily computable approximation and estimation bounds for a newly introduced concept class whose elements we refer to as recurrent (generalized) Barron functionals since they can be viewed as a dynamical analog of the (generalized) Barron functions introduced in [Barr 92, Barr 93] and extended later in [E 20b, E 20a, E 19]. The main novelty in this concept class with respect to others already available in the literature, like the fading memory class [Boyd 85] in deterministic setups or $L^{p}$ functionals in the stochastic case, is that it consists of elements that admit an explicit infinite-dimensional state-space representation, which makes them analytically tractable. As we shall see later on, many interesting families of input/output systems belong to this class that, additionally, is universal in the $L^{p}$ -sense.

From an approximation-theoretical point of view, the universality properties of linear systems with readouts belonging to dense families have been known for a long time both in deterministic [Boyd 85, Grig 18b] and in stochastic [Gono 20c] setups. Their corresponding dynamical and memory properties have been extensively studied (see, for instance, [Herm 10, Coui 16, Tino 18, Tino 20, Gono 20a, Li 22] and references therein). Our contribution in this paper can hence be considered as an extension of those works to the recurrent Barron class.

All the dynamic learning works cited so far use exclusively finite-dimensional state spaces. Hence, one of the main novelties of our contribution is the infinite-dimensional component in the concept class that we propose. It is worth mentioning that, even though in static setups, there exist numerous neural functional approximation results (see, for instance, [Chen 95, Stin 99, Krat 20, Bent 22, Cuch 22, Neuf 22], in addition to the works on Barron functions cited above), the use of infinite-dimensional state-space systems has not been much exploited, and it is only very recently that it is being seriously developed. See [Herm 12, Bouv 17b, Bouv 17a, Kira 19, Li 20, Kova 21, Acci 22, Gali 22, Hu 22, Salv 22] for a few examples. Dedicated hardware realizations of RC systems using quantum systems are a potential application of these extensions for which, in the absence of adequate tools, most of the theoretical developments have been carried out so far in exclusively finite-dimensional setups [Chen 19, Chen 20, Tran 20, Tran 21, Chen 22, Mart 23].

In order to introduce our contributions in more detail, we start by recalling that an echo state network (ESN) is an input/output system defined by the two equations:

	$\displaystyle{\bf x}_{t}$	$\displaystyle={\sigma}(A{\bf x}_{t-1}+C{\bf z}_{t}+\bm{\zeta}),$		(1.1)
	$\displaystyle{\bf y}_{t}$	$\displaystyle=W{\bf x}_{t},$		(1.2)

for $t\in\mathbb{Z}_{-}$ , where $d,m,N\in\mathbb{N}$ , ${\bf z}\in(\mathbb{R}^{d})^{\mathbb{Z}_{-}}$ , ${\bf x}_{t}\in\mathbb{R}^{N}$ , the matrix $W\in\mathbb{M}_{m,N}$ is trainable, $\sigma$ denotes the componentwise application of a given activation function $\sigma\colon\mathbb{R}\to\mathbb{R}$ , and $A\in\mathbb{M}_{N}$ , $C\in\mathbb{M}_{N,d}$ , and $\bm{\zeta}\in\mathbb{R}^{N}$ are randomly generated. If for each ${\bf z}\in(\mathbb{R}^{d})^{\mathbb{Z}_{-}}$ there exists a unique solution ${\bf x}=({\bf x}_{t})_{t\in\mathbb{Z}_{-}}\in(\mathbb{R}^{N})^{\mathbb{Z}_{-}}$ to (1.1), then (1.1)-(1.2) define a mapping $H_{W}\colon(\mathbb{R}^{d})^{\mathbb{Z}_{-}}\to\mathbb{R}^{m}$ via $H_{W}({\bf z})={\bf y}_{0}$ , with ${\bf y}\in(\mathbb{R}^{m})^{\mathbb{Z}_{-}}$ given by (1.2) for x the unique solution to (1.1) associated with ${\bf z}\in(\mathbb{R}^{d})^{\mathbb{Z}_{-}}$ . Given a (typically unknown) functional $H\colon(\mathbb{R}^{d})^{\mathbb{Z}_{-}}\to\mathbb{R}^{m}$ to be learned, the readout $W$ is then trained so that $H_{W}$ is a good approximation of $H$ .

For echo state networks (1.1)–(1.2), approximation bounds were given in [Gono 23] for maps $H$ which have the property that their restrictions to input sequences of finite length $T$ lie in a certain Sobolev space, for each $T\in\mathbb{N}$ , and with weights $A$ , $C$ , $\bm{\zeta}$ in (1.1) sampled from a generic distribution with a certain structure. Here we consider a novel architecture for (1.1)–(1.2), where, instead of applying a linear function in (1.2) we apply a random feedforward neural network, that is, (1.2) is replaced by

{\bf y}_{t}=\sum_{i=1}^{N}W_{i}\sigma({\bf a}^{(i)}\cdot\mathbf{x}_{t-1}+{\bf c}^{(i)}\cdot{\bf z}_{t}+b_{i})

(1.3)

for $t\in\mathbb{Z}_{-}$ and with randomly generated coefficients ${\bf a}^{(i)}$ , ${\bf c}^{(i)}$ , $b_{i}$ valued in $\mathbb{R}^{N}$ , $\mathbb{R}^{d}$ , and $\mathbb{R}$ , respectively. In most cases, the activation function $\sigma$ in (1.1) will be just the identity or, eventually, a rectified linear unit.

In order to derive learning error bounds for this architecture, we shall proceed in several steps. First, we examine the new concept class $\mathcal{C}$ of recurrent (generalized) Barron functionals consisting of reservoir functionals that admit a certain integral representation. This class $\mathcal{C}$ turns out to possess very useful properties: first, we show that a large class of functionals $H\colon(\mathbb{R}^{d})^{\mathbb{Z}_{-}}\to\mathbb{R}^{m}$ , including – but not restricted to – functionals with “sufficient smoothness” are in $\mathcal{C}$ and, under mild conditions, $\mathcal{C}$ is dense in the $L^{2}$ -sense. We then examine the approximation of elements in $\mathcal{C}$ by reservoir computing systems and derive approximation error bounds. This shows that a large class of functionals can be approximated by recurrent neural networks whose hidden weights are randomly sampled from generic distributions, and explicit approximation rates can be derived for them.

The second step consists in obtaining generalization error bounds that match the parameter restrictions emerging from the approximation results for these systems. The key challenge here is that the observational data is non-IID and so classical statistical learning techniques [Bouc 13] can not be employed directly, see [Gono 20b]. By combining these bounds, we then obtain learning error bounds for such random recurrent neural networks for learning a general class of maps $\mathcal{C}$ . As a by-product, we obtain new universality results for reservoir systems, see [Grig 18b, Grig 18a, Gono 20c, Gono 23, Gono 21b], with generically randomly generated coefficients. It is worth emphasizing that the construction that we describe in this paper yields a fully implementable recurrent neural network-based learning algorithm with provable convergence guarantees not suffering from the curse of dimensionality.

2 A dynamic analog of the generalized Barron functions

In this section, we introduce the class $\mathcal{C}$ of recurrent (generalized) Barron functionals, which constitute the concept class that we study and approximate in this paper. We prove elementary properties of $\mathcal{C}$ and identify important constituents in it. In particular, we show that $\mathcal{C}$ is a vector space that is dense in the set of square-integrable functionals and contains many important classes of maps $H\colon(\mathbb{R}^{d})^{\mathbb{Z}_{-}}\to\mathbb{R}$ such as linear or “sufficiently smooth” maps.

2.1 Notation

Let $D_{d}\subset\mathbb{R}^{d}$ , $p\in[1,\infty]$ , and let $q$ be the Hölder conjugate of $p$ . Recall the notation $\ell^{p}=\ell^{p}(\mathbb{R})$ for the space of sequences $(x_{n})_{n\in\mathbb{N}}\in\mathbb{R}^{\mathbb{N}}$ with $\sum_{i\in\mathbb{N}}|x_{i}|^{p}<\infty$ , in the case $p<\infty$ , and $\sup_{i\in\mathbb{N}}|x_{i}|<\infty$ , in the case $p=\infty$ . The symbol $\ell^{p}_{-}=\ell^{p}_{-}(\mathbb{R})$ denotes the space of sequences $(x_{n})_{n\in\mathbb{Z}_{-}}\in\mathbb{R}^{\mathbb{Z}_{-}}$ that satisfy the same summability conditions as those in $\ell^{p}$ , with $\mathbb{Z}_{-}$ the negative integers including zero. We let $\ell^{\infty}_{-}(\ell^{q})$ denote the set of sequences $(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in(\ell^{q})^{\mathbb{Z}_{-}}$ with the property that $\sup_{t\in\mathbb{Z}_{-}}\|\mathbf{x}_{t}\|_{q}<\infty$ . For $\mathbf{x}\in\ell^{p}$ and ${\bf y}\in\ell^{q}$ we write $\mathbf{x}\cdot{\bf y}=\sum_{i\in\mathbb{N}}x_{i}y_{i}$ . Let $\sigma_{1},\sigma_{2}\colon\mathbb{R}\to\mathbb{R}$ denote two functions, called activation functions. For both activation functions we denote using the same symbol their componentwise application to sequences or vectors. Furthermore, we assume that $\sigma_{2}$ is Lipschitz-continuous with constant $L_{\sigma_{2}}$ and $\sigma_{1}(0)=0$ . Given $N\in\mathbb{N}$ , we write $\mathcal{X}=\mathbb{R}\times\ell^{p}\times\mathbb{R}^{d}\times\mathbb{R}$ and $\mathcal{X}^{N}=\mathbb{R}\times\mathbb{R}^{N}\times\mathbb{R}^{d}\times\mathbb{R}$ . We will consider inputs in $\mathcal{I}_{d}\subset(D_{d})^{\mathbb{Z}_{-}}$ .

2.2 Definition and properties of recurrent Barron functionals

In this section we define the class of recurrent generalized Barron functionals and study some properties of these functionals.

Definition 2.1.

A functional $H\colon\mathcal{I}_{d}\to\mathbb{R}$ is said to be a recurrent (generalized) Barron functional, if there exist a (Borel) probability measure $\mu$ on $\mathcal{X}$ , ${\bf B}\in\ell^{q}$ and linear maps $A\colon\ell^{q}\to\ell^{q}$ , $C\colon\mathbb{R}^{d}\to\ell^{q}$ such that for each ${\bf z}\in\mathcal{I}_{d}$ the system

\mathbf{x}_{t}=\sigma_{1}(A\mathbf{x}_{t-1}+C{\bf z}_{t}+{\bf B}),\quad t\in\mathbb{Z}_{-},

(2.1)

admits a unique solution $(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q})$ , $\mu$ satisfies that

I_{\mu,p}:=\int_{\mathcal{X}}|w|(\|{\bf a}\|_{p}+\|{\bf c}\|+|b|)\mu(dw,d{\bf a},d{\bf c},db)<\infty

and, writing $\mathbf{x}_{t}=\mathbf{x}_{t}({\bf z})$ ,

H({\bf z})=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d}.

(2.2)

We denote by $\mathcal{C}$ the class of all recurrent generalized Barron functionals or just recurrent Barron functionals.

Note that the readout (2.2) is built in such a way that the output is constructed out of $\mathbf{x}_{-1}$ and ${\bf z}_{0}$ instead of exclusively $\mathbf{x}_{0}$ , like it is customary in reservoir computing (see (1.2)). The reason for this choice is that it will allow us later on in Section 3.6 to recover the static situation as a particular case of the recurrent setup in a straightforward manner.

The unique solution property (USP). The existence and uniqueness of solutions property of state equations of the type (2.1), which is part of the definition of the recurrent Barron functionals, is a well-studied problem. This property is referred in the literature as the echo state property (ESP) [Jaeg 10, Yild 12, Manj 13] or the unique solution property (USP) [G Ma 20, Manj 22a] and it has been much tackled in deterministic (see references above and [Grig 18a], for instance), and stochastic setups [Manj 22b]. The following proposition is an infinite-dimensional adaptation of a commonly used sufficient condition for the USP to hold in the case of (2.1), as well as a full characterization of it when the activation function $\sigma_{1}$ is the identity, and hence (2.1) is a linear system. In both cases, the inputs are assumed to be bounded; possible generalizations to the unbounded case can be found in [Grig 19].

Proposition 2.2.

Consider the state equation given by (2.1) and the two following cases:

(i)

Suppose that the inputs are defined in $\mathcal{I}_{d}\subset(D_{d})^{\mathbb{Z}_{-}}$ , with $D_{d}$ a compact subset of ${\mathbb{R}}^{d}$ . Furthermore, assume that $\sigma_{1}$ is Lipschitz-continuous with constant $L_{\sigma_{1}}$ and that $L_{\sigma_{1}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1$ with ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}$ the operator norm of the linear map $A\colon\ell^{q}\to\ell^{q}$ . Then (2.1) has the USP, i.e., for each ${\bf z}\in\mathcal{I}_{d}$ there exists a unique $(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q})$ satisfying (2.1).

(ii)

Suppose that $\sigma_{1}$ is the identity and that $A\colon\ell^{q}\to\ell^{q}$ , $C\colon\mathbb{R}^{d}\to\ell^{q}$ are bounded linear operators. If the spectral radius $\rho(A)$ of the operator $A$ is such that $\rho(A)<1$ , then the linear state equation (2.1) has the USP with respect to inputs in $\ell^{\infty}_{-}({\mathbb{R}}^{d})$ . In that case the unique solution of (2.1) is given by

\mathbf{x}_{t}({\bf z})=\sum_{k=0}^{\infty}A^{k}(C{\bf z}_{t-k}+{\bf B}),\quad\mbox{$t\in\mathbb{Z}$}.

(2.3)

Proof.

The proof of part (i) is a straightforward generalization of [Grig 18a, Theorem 3.1(ii)] together with the observation that the hypothesis $L_{\sigma_{1}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1$ implies that the state equation is a contraction on the state entry. Part (ii) can be obtained by mimicking the proof of [Grig 21b, Proposition 4.2(i)]; a key element in that proof is the use of Gelfand’s formula for the spectral radius, which holds for operators between infinite-dimensional Banach spaces [Lax 02, Theorem 4, page 195]. ∎

The concept class $\mathcal{C}$ of recurrent Barron functionals is a vector space.

Proposition 2.3.

Suppose $H_{1},H_{2}\in\mathcal{C}$ and $\lambda_{1},\lambda_{2}\in\mathbb{R}$ . Then $\lambda_{1}H_{1}+\lambda_{2}H_{2}\in\mathcal{C}$ .

Proof.

Without loss of generality, we may assume $\lambda_{i}\neq 0$ for $i=1,2$ . For $i=1,2$ let $\mu^{(i)}$ be (Borel) probability measures on $\mathcal{X}$ , let ${\bf B}^{(i)}\in\ell^{q}$ , let $A^{(i)}\colon\ell^{q}\to\ell^{q}$ , $C^{(i)}\colon\mathbb{R}^{d}\to\ell^{q}$ be linear maps such that for each ${\bf z}\in\mathcal{I}_{d}$ the system

\mathbf{x}_{t}^{(i)}=\sigma_{1}(A^{(i)}\mathbf{x}_{t-1}^{(i)}+C^{(i)}{\bf z}_{t}+{\bf B}^{(i)}),\quad t\in\mathbb{Z}_{-},

(2.4)

admits a unique solution $(\mathbf{x}_{t}^{(i)})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q})$ , $\mu^{(i)}$ satisfies $I_{\mu^{(i)},p}<\infty$ , and

H_{i}({\bf z})=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{(i)}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu^{(i)}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d}.

(2.5)

Define $\pi_{1}\colon\mathbb{R}^{\mathbb{N}}\to\mathbb{R}^{\mathbb{N}}$ by $\pi_{1}((y_{n})_{n\in\mathbb{N}})=(y_{\frac{n+1}{2}}\mathbbm{1}_{\mathbb{N}}(\frac{n+1}{2}))_{n\in\mathbb{N}}$ and $\pi_{2}\colon\mathbb{R}^{\mathbb{N}}\to\mathbb{R}^{\mathbb{N}}$ by $\pi_{2}((y_{n})_{n\in\mathbb{N}})=(y_{\frac{n}{2}}\mathbbm{1}_{\mathbb{N}}(\frac{n}{2}))_{n\in\mathbb{N}}$ and note that $\pi_{1},\pi_{2}$ are linear, they map both $\ell^{p}$ and $\ell^{q}$ into themselves, $\pi_{1}+\pi_{2}=\mathbb{I}$ , and $\pi_{i}\circ\sigma_{1}=\sigma_{1}\circ\pi_{i}$ , $i=1,2$ . Define ${\bf B}\in\ell^{q}$ and linear maps $A\colon\ell^{q}\to\ell^{q}$ , $C\colon\mathbb{R}^{d}\to\ell^{q}$ by setting $A{\bf x}=\pi_{1}(A^{(1)}{\bf x})+\pi_{2}(A^{(2)}{\bf x})$ , ${\bf B}=\pi_{1}({\bf B}^{(1)})+\pi_{2}({\bf B}^{(2)})$ and $C{\bf z}=\pi_{1}(C^{(1)}{\bf z})+\pi_{2}(C^{(2)}{\bf z})$ .

Suppose now that $(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q})$ is a solution to (2.1) for the input ${\bf z}\in\mathcal{I}_{d}$ and the parameters $A,{\bf B},C$ that we just defined. For each $t\in\mathbb{Z}_{-}$ denote by $\mathbf{x}_{t}^{(1)}=(\mathbf{x}_{t,2n-1})_{n\in\mathbb{N}}$ the odd components of $\mathbf{x}_{t}$ and by $\mathbf{x}_{t}^{(2)}=(\mathbf{x}_{t,2n})_{n\in\mathbb{N}}$ the even components. Then $\mathbf{x}_{t}^{(i)}\in\ell^{q}$ and by construction of $A,{\bf B},C$ , it follows that $(\mathbf{x}_{t}^{(i)})_{t\in\mathbb{Z}_{-}}$ is a solution of (2.4) for $i=1,2$ . By the uniqueness of the solutions of (2.4), it follows that there is at most one solution of (2.1). On the other hand, if we now denote by $(\mathbf{x}^{(i)}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q})$ , $i=1,2$ , the unique solution to (2.4) for the input ${\bf z}\in\mathcal{I}_{d}$ , then setting $\mathbf{x}_{t}=\pi_{1}(\mathbf{x}_{t}^{(1)})+\pi_{2}(\mathbf{x}_{t}^{(2)})$ defines a sequence $(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q})$ which is a solution to (2.1), again by the construction of $A,{\bf B},C$ .

Next, define $\phi_{i}\colon\mathcal{X}\to\mathcal{X}$ as $\phi_{i}(w,{\bf a},{\bf c},b)=(w(\lambda_{1}+\lambda_{2}),\pi_{i}({\bf a}),{\bf c},b)$ for $i=1,2$ and set $\mu=\frac{\lambda_{1}}{\lambda_{1}+\lambda_{2}}(\mu^{(1)}\circ\phi_{1}^{-1})+\frac{\lambda_{2}}{\lambda_{1}+\lambda_{2}}(\mu^{(2)}\circ\phi_{2}^{-1})$ , where $\mu^{(i)}\circ\phi_{i}^{-1}$ denotes the pushforward of $\mu^{(i)}$ under $\phi_{i}$ . Then the integral transformation theorem shows that

\int_{\mathcal{X}}|w|(\|{\bf a}\|_{p}+\|{\bf c}\|+|b|)\mu(dw,d{\bf a},d{\bf c},db)=\sum_{i=1}^{2}\lambda_{i}\int_{\mathcal{X}}|w|(\|\pi_{i}({\bf a})\|_{p}+\|{\bf c}\|+|b|)\mu^{(i)}(dw,d{\bf a},d{\bf c},db)<\infty

and, by (2.5), for all ${\bf z}\in\mathcal{I}_{d}$ ,

	$\displaystyle\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db)$	$\displaystyle=\sum_{i=1}^{2}\lambda_{i}\int_{\mathcal{X}}w\sigma_{2}(\pi_{i}({\bf a})\cdot\mathbf{x}_{-1}+{\bf c}\cdot{\bf z}_{0}+b)\mu^{(i)}(dw,d{\bf a},d{\bf c},db)$
		$\displaystyle=\sum_{i=1}^{2}\lambda_{i}\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{(i)}+{\bf c}\cdot{\bf z}_{0}+b)\mu^{(i)}(dw,d{\bf a},d{\bf c},db)$
		$\displaystyle=\lambda_{1}H_{1}({\bf z})+\lambda_{2}H_{2}({\bf z}).$

Hence, $\lambda_{1}H_{1}+\lambda_{2}H_{2}\in\mathcal{C}$ , as claimed. ∎

Finite-dimensional recurrent Barron functionals. For $N\in\mathbb{N}$ , denote by $\mathcal{C}_{N}$ the set of functionals $H^{{N}}\colon\mathcal{I}_{d}\subset(D_{d})^{\mathbb{Z}_{-}}\to\mathbb{R}$ with the following property: there exist a (Borel) probability measure $\mu^{N}$ on $\mathcal{X}^{N}$ , ${A}\in\mathbb{M}_{N,N}$ , ${\bf B}\in\mathbb{R}^{N}$ , ${C}\in\mathbb{M}_{N,d}$ , such that for each ${\bf z}\in\mathcal{I}_{d}$ the system

\mathbf{x}_{t}^{N}=\sigma_{1}({A}\mathbf{x}_{t-1}^{N}+{C}{\bf z}_{t}+{\bf B}),\quad t\in\mathbb{Z}_{-},

(2.6)

admits a unique solution $(\mathbf{x}_{t}^{N})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\mathbb{R}^{N})$ , $\mu^{N}$ satisfies

\int_{\mathcal{X}^{N}}|w|(\|{\bf a}\|+\|{\bf c}\|+|b|)\mu^{N}(dw,d{\bf a},d{\bf c},db)<\infty

and, writing $\mathbf{x}_{t}^{N}=\mathbf{x}_{t}^{N}({\bf z})$ ,

H^{N}({\bf z})=\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{N}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d}.

(2.7)

We shall refer to the elements in $\mathcal{C}_{N}$ as finite-dimensional recurrent Barron functionals. In the following proposition, we show that the set of finite-dimensional recurrent Barron functionals can be naturally seen as a subspace of $\mathcal{C}$ . In the statement, we use the notion and properties of system morphisms that are recalled, for instance, in [Grig 21b].

Proposition 2.4.

For $N\in\mathbb{N}$ arbitrary, let $\iota\colon\mathbb{R}^{N}\to\ell^{q}$ be the natural injection $(x_{1},\ldots,x_{N})^{\top}\longmapsto(x_{1},\ldots,x_{N},0,\ldots)$ . Given an element $H^{N}\in\mathcal{C}_{N}$ , there exists a system of type (2.1)-(2.2) that realizes $H^{N}$ and for which the map $\iota$ is a system morphism. This allows us to conclude that

\mathcal{C}_{N}\subset\mathcal{C},

(2.8)

that is, every finite-dimensional recurrent Barron functional admits a realization as a recurrent Barron functional.

Proof.

Let $H^{N}\in\mathcal{C}_{N}$ be such that (2.7) holds with $\mathbf{x}^{N}\in\ell^{\infty}_{-}(\mathbb{R}^{N})$ satisfying (2.6). First, define $\bar{A}\colon\ell^{q}\to\ell^{q}$ , $\bar{{\bf B}}\in\ell^{q}$ , $\bar{C}\colon\mathbb{R}^{d}\to\ell^{q}$ by setting $(\bar{A}\mathbf{x})_{i}=\mathbbm{1}_{\{i\leq N\}}\sum_{j=1}^{N}{A}_{ij}x_{j}$ , $\bar{B}_{i}=\mathbbm{1}_{\{i\leq N\}}{B}_{i}$ , $(\bar{C}{\bf z})_{i}=\mathbbm{1}_{\{i\leq N\}}\sum_{j=1}^{d}{C}_{ij}z_{j}$ for $\mathbf{x}\in\ell^{q}$ , ${\bf z}\in\mathbb{R}^{d}$ , $i\in\mathbb{N}$ . Furthermore, if $\iota$ is the canonical injection, we denote by the same symbol the injection of $\mathcal{X}^{N}$ into ${\cal X}$ and by $\mu=\mu^{N}\circ\iota^{-1}$ the pushforward measure of $\mu^{N}$ under $\iota$ .

Consider now the system (2.6)–(2.7) and the one given by (2.1) and (2.2) with the parameters $\bar{A},\bar{{\bf B}},\bar{C}$ that we just defined, as well as the readout given by the measure $\mu=\mu^{N}\circ\iota^{-1}$ . We shall prove that the map $\iota\colon\mathbb{R}^{N}\to\ell^{q}$ is a system morphism between these systems. This requires showing that $\iota$ is system equivariant and readout invariant (see [Grig 21b] for this terminology).

First, we notice that $\iota$ is system equivariant because

\iota\left(\sigma_{1}({A}\mathbf{x}^{N}+{C}{\bf z}+{\bf B})\right)=\sigma_{1}(\iota({A}\mathbf{x}^{N}+{C}{\bf z}+{\bf B}))=\sigma_{1}(\bar{A}\iota(\mathbf{x}^{N})+\bar{C}{\bf z}+\bar{{\bf B}}),\ \mbox{for any $\mathbf{x}^{N}\in\mathbb{R}^{N}$, ${\bf z}\in{\mathbb{R}}^{d}$,}

as required. As a consequence of this fact (see [Grig 21b, Proposition 2.2]) if $(\mathbf{x}_{t}^{N})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\mathbb{R}^{N})$ is a solution of the system determined by (2.6) and (2.7) for ${\bf z}\in{\cal I}_{d}$ , then so is $(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q})$ with $\mathbf{x}_{t}=\iota(\mathbf{x}_{t}^{N})$ for the one given by (2.1) and (2.2) with the parameters $\bar{A},\bar{{\bf B}},\bar{C}$ . This solution is unique because if $(\overline{\mathbf{x}}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q})$ is another solution for it and we denote by $\overline{\mathbf{x}}_{t}^{N}\in\mathbb{R}^{N}$ its first $N$ components then from (2.1) we obtain for the components $i=1,\ldots,N$ that $\overline{{x}}_{t,i}^{N}=\sigma_{1}((\bar{A}\overline{\mathbf{x}}_{t-1})_{i}+(\bar{C}{\bf z}_{t})_{i}+\bar{B}_{i})=\sigma_{1}(({A}\overline{\mathbf{x}}_{t-1}^{N})_{i}+({C}{\bf z}_{t})_{i}+{B}_{i})$ , which shows that $\overline{\mathbf{x}}^{N}$ is a solution to (2.6) and thus, by uniqueness, $\overline{\mathbf{x}}^{N}=\mathbf{x}^{N}$ necessarily. For components $i>N$ we get from (2.1) that $\overline{x}_{t,i}=\sigma_{1}(0)=0$ . Altogether, this proves that for each ${\bf z}\in\mathcal{I}_{d}$ the system given by (2.1) with the parameters $\bar{A},\bar{{\bf B}},\bar{C}$ has a unique solution $\mathbf{x}$ and the first $N$ entries of each term $\mathbf{x}_{t}$ are given by $\mathbf{x}_{t}^{N}$ .

Notice now that the change-of-variables theorem and the equivalence of norms on $\mathbb{R}^{N}$ yields that

\int_{\mathcal{X}}|w|(\|{\bf a}\|_{p}+\|{\bf c}\|+|b|)\mu(dw,d{\bf a},d{\bf c},db)=\int_{\mathcal{X}^{N}}|w|(\|\iota({\bf a})\|_{p}+\|{\bf c}\|+|b|)\mu^{N}(dw,d{\bf a},d{\bf c},db)<\infty,

which proves that $H$ in (2.2) defines an element in $\mathcal{C}$ . Finally, we show that the map $\iota$ is readout invariant using that the first $N$ components of each solution $\mathbf{x}_{t}$ are given by $\mathbf{x}_{t}^{N}$ and applying again the change-of-variables theorem:

H({\bf z})=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db)=\int_{\mathcal{X}^{N}}w\sigma_{2}(\iota({\bf a})\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db)\\ =\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{N}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db)=H^{N}({\bf z})

for ${\bf z}\in\mathcal{I}_{d}$ , as required. In particular, $H^{N}$ and $H$ are in fact the same functional, which proves the inclusion (2.8). ∎

2.3 Examples of recurrent Barron functionals

We now show that large classes of input/output systems naturally belong to the recurrent Barron class.

Finite-memory functionals. The following proposition shows that finite-memory causal and time-invariant systems that have certain regularity properties can be written as finite-dimensional recurrent Barron functionals with a linear state equation.

Proposition 2.5.

Assume $\sigma_{1}(x)=x$ , $\sigma_{2}(x)=\max(x,0)$ , $D_{d}$ is bounded, let $T\in\mathbb{N}$ , and suppose that $h\colon\mathbb{R}^{Td}\to\mathbb{R}$ is in the Sobolev space $H^{s}(\mathbb{R}^{Td})$ for $s>\frac{Td}{2}+2$ . Then, the functional $H({\bf z})=h({\bf z}_{0},\ldots,{\bf z}_{-T+1})$ , ${\bf z}\in\mathcal{I}_{d}$ , is a finite-dimensional recurrent Barron functional.

Proof.

Choose $N=dT$ , ${\bf B}=\mathbf{0}$ ,

{A}=\left(\begin{array}[]{cc}\mathbf{0}_{d,d(T-1)}&\mathbf{0}_{d,d}\\ \mathbb{I}_{d(T-1)}&\mathbf{0}_{d(T-1),d}\end{array}\right)\quad\mbox{and}\quad{C}=\left(\begin{array}[]{c}\mathbb{I}_{d}\\ \mathbf{0}_{d(T-1),d}\\ \end{array}\right).

(2.9)

Then the system

\mathbf{x}_{t}^{N}=\sigma_{1}({A}\mathbf{x}_{t-1}^{N}+{C}{\bf z}_{t}+{\bf B}),\quad t\in\mathbb{Z}_{-},

(2.10)

admits as unique solution $\mathbf{x}_{t}^{N}=({\bf z}_{t}^{\top},\ldots,{\bf z}_{t-T+1}^{\top})^{\top}$ , $t\in\mathbb{Z}_{-}$ . Next, by [E 21, Theorem 3.1] (which is based on [Barr 93, Proposition 1 and Section IX.15)]; alternatively the result also follows using the representation obtained in the proof of [Gono 23, Theorem 1]), there exists a (Borel) probability measure $\tilde{\mu}$ on $\mathbb{R}\times\mathbb{R}^{Td+1}$ such that $\int_{\mathbb{R}\times\mathbb{R}^{Td+1}}|w|\|{\bf v}\|\tilde{\mu}(dw,d{\bf v})<\infty$ and for Lebesgue-a.e. ${\bf u}\in(D_{d})^{T}$ ,

h({\bf u})=\int_{\mathbb{R}\times\mathbb{R}^{Td+1}}w\sigma_{2}({\bf v}\cdot({\bf u}^{\top},1)^{\top})\tilde{\mu}(dw,d{\bf v}).

(2.11)

But $h$ is continuous in $(D_{d})^{T}$ by [Evan 10, Theorem 6(ii) in Chapter 5.6] and also the right-hand side in (2.11) is continuous in ${\bf u}$ , hence the representation (2.11) holds for all ${\bf u}\in(D_{d})^{T}$ .

Let $\pi\colon\mathbb{R}\times\mathbb{R}^{Td+1}\to\mathcal{X}^{N}$ be defined as follows: for $w\in\mathbb{R}$ , ${\bf v}\in\mathbb{R}^{Td+1}$ we set $\pi(w,{\bf v})=(w,({v}_{d+1},\ldots,{v}_{Td},0,\ldots,0),({v}_{1},\ldots,{v}_{d}),{v}_{Td+1})$ . Denote by $\mu^{N}=\tilde{\mu}\circ\pi^{-1}$ the pushforward measure of $\tilde{\mu}$ under $\pi$ . Then by the change-of-variables formula $\mu^{N}$ satisfies

	$\displaystyle\int_{\mathcal{X}^{N}}\|w\|(\\|{\bf a}\\|+\\|{\bf c}\\|+\|b\|)\mu^{N}(dw,d{\bf a},d{\bf c},db)$	$\displaystyle\leq\sqrt{3}\int_{\mathcal{X}^{N}}\|w\|(\\|{\bf a}\\|^{2}+\\|{\bf c}\\|^{2}+\|b\|^{2})^{1/2}\mu^{N}(dw,d{\bf a},d{\bf c},db)$
		$\displaystyle=\sqrt{3}\int_{\mathbb{R}\times\mathbb{R}^{Td+1}}\|w\|\\|{\bf v}\\|\tilde{\mu}(dw,d{\bf v})<\infty$

and for all ${\bf z}\in(D_{d})^{\mathbb{Z}_{-}}$ , with $g_{\bf z}(w,{\bf a},{\bf c},b)=w\sigma_{2}(({\bf c}^{\top},{\bf a}^{\top},b)({\bf z}_{0}^{\top},{\bf z}_{-1}^{\top},\ldots,{\bf z}_{-T}^{\top},1)^{\top})$ ,

$\displaystyle H({\bf z})$	$\displaystyle=h({\bf z}_{0},\ldots,{\bf z}_{-T+1})=\int_{\mathbb{R}\times\mathbb{R}^{Td+1}}w\sigma_{2}({\bf v}\cdot({\bf z}_{0}^{\top},\ldots,{\bf z}_{-T+1}^{\top},1)^{\top})\tilde{\mu}(dw,d{\bf v})$	(2.12)
	$\displaystyle=\int_{\mathbb{R}\times\mathbb{R}^{Td+1}}g_{\bf z}(\pi(w,{\bf v}))\tilde{\mu}(dw,d{\bf v})=\int_{\mathcal{X}^{N}}w\sigma_{2}(({\bf c}^{\top},{\bf a}^{\top},b)({\bf z}_{0}^{\top},{\bf z}_{-1}^{\top},\ldots,{\bf z}_{-T}^{\top},1)^{\top})\mu^{N}(dw,d{\bf a},d{\bf c},db)$
	$\displaystyle=\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{N}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db),$

which shows that $H$ is a finite-dimensional recurrent Barron functional, that is $H\in\mathcal{C}_{N}$ . ∎

Linear functionals. The following propositions show that certain linear functionals are also in the recurrent Barron class $\mathcal{C}$ .

Proposition 2.6.

Suppose $\sigma_{1}(x)=x$ and either $\sigma_{2}(x)=x$ or $\sigma_{2}(x)=\max(x,0)$ . Let $\mathbf{w}\in\ell^{1}_{-}$ , ${\bf a}_{i}\in\mathbb{R}^{d}$ for $i\in\mathbb{Z}_{-}$ and assume that $D_{d}$ is bounded, and $\sup_{i\in\mathbb{Z}_{-}}\|{\bf a}_{i}\|<\infty$ . Assume that either

(i): $\sum_{i\in\mathbb{Z}_{-}}|w_{i}|\lambda^{i}<\infty$ for some $\lambda\in(0,1)$
(ii): or $\mathcal{I}_{d}\subset\ell^{q}_{-}(D_{d})$ .

Then the functional

H({\bf z})=\sum_{i\in\mathbb{Z}_{-}}w_{i}{\bf a}_{i}\cdot{\bf z}_{i},\quad{\bf z}\in\mathcal{I}_{d},

satisfies $H\in\mathcal{C}$ .

Proof.

Let $\lambda\in(0,1)$ be as in (i) or, in case (ii) holds, let $\lambda=1$ .

Consider first the case $\sigma_{2}(x)=x$ . Let ${\bf B}={\bf 0}$ and define $A\colon\ell^{q}\to\ell^{q}$ , $C\colon\mathbb{R}^{d}\to\ell^{q}$ by $(A\mathbf{x})_{i}=\mathbbm{1}_{\{i>d\}}\lambda x_{i-d}$ , $(C{\bf z})_{i}=\mathbbm{1}_{\{i\leq d\}}{z}_{i}$ for $\mathbf{x}\in\ell^{q}$ , ${\bf z}\in\mathbb{R}^{d}$ , $i\in\mathbb{N}$ . Let us now look at the $i$ -th component of (2.1). Inserting the definitions yields

{x}_{t,i}=\mathbbm{1}_{\{i>d\}}\lambda{x}_{t-1,i-d}+\mathbbm{1}_{\{i\leq d\}}{z}_{t,i},\quad t\in\mathbb{Z}_{-}.

(2.13)

By iterating (2.13) we see that for $k\in\mathbb{N}$ , $j=1,\ldots,d$ , we must have ${x}_{t,(k-1)d+j}=\lambda^{k-1}{z}_{t-k+1,j}$ . Using this as the definition of $\mathbf{x}$ , we hence obtain that (2.1) has a unique solution. In addition, this solution satisfies the condition $(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q})$ : suppose that (i) holds, then in the case $q<\infty$ we verify for any $t\in\mathbb{Z}_{-}$ that $\sum_{k\in\mathbb{N}}|{x}_{t,k}|^{q}=\sum_{i\in\mathbb{N}}\lambda^{qi}\|{\bf z}_{t-i}\|_{q}^{q}\leq dM^{q}\lambda^{q}(1-\lambda^{q})^{-1}$ , where $M>0$ is chosen such that $\|{\bf u}\|_{\infty}\leq M$ for all ${\bf u}\in D_{d}$ . In the case of $q=\infty$ we verify that $\sup_{k\in\mathbb{N}}|{x}_{t,k}|\leq\sup_{s,k\in\mathbb{N}}|{z}_{s,k}|\leq M$ . Similarly, if (ii) holds, $(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q})$ follows from ${\bf z}\in\mathcal{I}_{d}\subset\ell^{q}_{-}(D_{d})$ .

Define, for each $i\in\mathbb{Z}_{-}\setminus\{0\}$ , $\bar{\mathbf{a}}_{i}\in\ell^{p}$ by setting $\bar{a}_{i,j+d(|i|-1)}=\lambda^{i+1}{a}_{i,j}$ for $j=1,\ldots,d$ and by setting the remaining components of $\bar{\mathbf{a}}_{i}$ equal to zero. Set $\mu=\frac{1}{\|\mathbf{w}\|_{1}}\sum_{i\in\mathbb{Z}_{-}}|w_{i}|\delta_{(\mathrm{sign}(w_{i})\|\mathbf{w}\|_{1},\bar{{\bf a}}_{i}\mathbbm{1}_{\{i<0\}},{\bf a}_{0}\mathbbm{1}_{\{i=0\}},0)}$ , then $\mu$ is indeed a (Borel) probability measure on $\mathcal{X}$ and $\mu$ satisfies

	$\displaystyle I_{\mu,p}=\int_{\mathcal{X}}\|\bar{w}\|(\\|{\bf a}\\|_{p}+\\|{\bf c}\\|+\|b\|)\mu(d\bar{w},d{\bf a},d{\bf c},db)$	$\displaystyle=\sum_{i\in\mathbb{Z}_{-}}\|w_{i}\|(\\|\bar{{\bf a}}_{i}\mathbbm{1}_{\{i<0\}}\\|_{p}+\\|{\bf a}_{0}\mathbbm{1}_{\{i=0\}}\\|)$
		$\displaystyle\leq\sqrt{d}\sum_{i\in\mathbb{Z}_{-}}\|w_{i}\|\lambda^{i+1}\\|{\bf a}_{i}\\|$
		$\displaystyle\leq\sqrt{d}\sup_{i\in\mathbb{Z}_{-}}\{\\|{\bf a}_{i}\\|\}\sum_{i\in\mathbb{Z}_{-}}\|w_{i}\|\lambda^{i+1}<\infty,$

since $\|\cdot\|_{p}\leq\|\cdot\|$ for $p\geq 2$ and $\|\cdot\|_{p}\leq d^{\frac{1}{p}-\frac{1}{2}}\|\cdot\|$ for $p\in[1,2]$ . Furthermore, for any ${\bf z}\in\mathcal{I}_{d}$ ,

$\displaystyle\int_{\mathcal{X}}\bar{w}\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu(d\bar{w},d{\bf a},d{\bf c},db)$	$\displaystyle=\sum_{i\in\mathbb{Z}_{-}}w_{i}(\bar{\mathbf{a}}_{i}\mathbbm{1}_{\{i<0\}}\cdot\mathbf{x}_{-1}+{\bf a}_{0}\mathbbm{1}_{\{i=0\}}\cdot{\bf z}_{0})$	(2.14)
	$\displaystyle={\bf a}_{0}\cdot{\bf z}_{0}w_{0}+\sum_{i\in\mathbb{Z}_{-}\setminus\{0\}}w_{i}\lambda^{i+1}\sum_{j=1}^{d}{a}_{i,j}{x}_{-1,j+d(\|i\|-1)}$
	$\displaystyle={\bf a}_{0}\cdot{\bf z}_{0}w_{0}+\sum_{i\in\mathbb{Z}_{-}\setminus\{0\}}w_{i}\sum_{j=1}^{d}{a}_{i,j}{z}_{-\|i\|,j}$
	$\displaystyle=H({\bf z}),$

which proves that $H\in\mathcal{C}$ in the case $\sigma_{2}(x)=x$ .

For the case $\sigma_{2}(x)=\max(x,0)$ , let $\varphi\colon\mathcal{X}\to\mathcal{X}$ be defined by $\varphi(\bar{w},{\bf a},{\bf c},b)=(-\bar{w},-{\bf a},-{\bf c},-b)$ and consider $\bar{\mu}=\frac{1}{2}\mu+\frac{1}{2}(\mu\circ\varphi^{-1})$ . Then the change-of-variables theorem and the fact that $I_{\mu,p}<\infty$ (as established above) show that $I_{\bar{\mu},p}<\infty$ and by (2.14) and $\sigma_{2}(x)-\sigma_{2}(-x)=x$ we obtain

		$\displaystyle\int_{\mathcal{X}}\bar{w}\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\bar{\mu}(d\bar{w},d{\bf a},d{\bf c},db)$		(2.15)
		$\displaystyle=\frac{1}{2}\int_{\mathcal{X}}\bar{w}\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu(d\bar{w},d{\bf a},d{\bf c},db)-\frac{1}{2}\int_{\mathcal{X}}\bar{w}\sigma_{2}(-{\bf a}\cdot\mathbf{x}_{-1}({\bf z})-{\bf c}\cdot{\bf z}_{0}-b)\mu(d\bar{w},d{\bf a},d{\bf c},db)$
		$\displaystyle=\frac{1}{2}H({\bf z}),$

hence $\frac{1}{2}H\in\mathcal{C}$ and, by Proposition 2.3, also $H\in\mathcal{C}$ . ∎

The next proposition also covers the linear case under slightly different hypotheses.

Proposition 2.7.

Suppose $\sigma_{1}(x)=x$ and either $\sigma_{2}(x)=x$ or $\sigma_{2}(x)=\max(x,0)$ . Let ${\bf h}\in\ell^{p}_{-}(\mathbb{R}^{d})$ and $\mathcal{I}_{d}\subset\ell^{q}_{-}(D_{d})$ . Then the functional

H({\bf z})=\sum_{i\in\mathbb{Z}_{-}}{\bf h}_{i}\cdot{\bf z}_{i},\quad{\bf z}\in\mathcal{I}_{d},

satisfies $H\in\mathcal{C}$ .

Proof.

The case $\sigma_{2}(x)=\max(x,0)$ can be deduced from the case $\sigma_{2}(x)=x$ precisely as in the proof of Proposition 2.6, hence it suffices to consider the case $\sigma_{2}(x)=x$ . Choosing $\lambda=1$ and $A,{\bf B},C$ as in the proof of Proposition 2.6 yields a unique solution $(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q})$ to (2.1) and for $k\in\mathbb{N}$ , $j=1,\ldots,d$ we have ${x}_{t,(k-1)d+j}={z}_{t-k+1,j}$ . Define now $\mathbf{a}\in\ell^{p}$ by setting $a_{(k-1)d+j}={h}_{-k,j}$ for $k\in\mathbb{N}$ , $j=1,\ldots,d$ , and consider $\mu=\delta_{(1,\mathbf{a},{\bf h}_{0},0)}$ . Then $I_{\mu,p}<\infty$ and for any ${\bf z}\in\mathcal{I}_{d}$ ,

\int_{\mathcal{X}}\bar{w}\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu(d\bar{w},d{\bf a},d{\bf c},db)=\mathbf{a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf h}_{0}\cdot{\bf z}_{0}={\bf h}_{0}\cdot{\bf z}_{0}+\sum_{k\in\mathbb{N}}\sum_{j=1}^{d}{h}_{-k,j}{z}_{-k,j}=H({\bf z}),

(2.16)

which proves that $H\in\mathcal{C}$ . ∎

Remark 2.8.

This proposition covers, in particular, the functionals associated to convolutional filters. Indeed, let ${\bf h}\in\ell^{p}_{-}(\mathbb{R}^{d})$ and consider the convolutional filter $U_{\bf h}\colon\ell^{q}_{-}(D_{d})\to\mathbb{R}^{\mathbb{Z}_{-}}$ defined by $U_{\bf h}({\bf z})_{t}=\sum_{j\in\mathbb{Z}_{-}}{\bf h}_{j}\cdot{\bf z}_{t+j}$ . Then its associated functional $H\colon\ell^{q}_{-}(D_{d})\to\mathbb{R}$ defined by $H({\bf z})=\sum_{j\in\mathbb{Z}_{-}}{\bf h}_{j}\cdot{\bf z}_{j}$ is an element of $\mathcal{C}$ , as shown in Proposition 2.7 above.

As a further example, we now see that any functional associated with a linear state-space system on a separable Hilbert space admits a realization as an element in $\mathcal{C}$ for $p=2$ .

Proposition 2.9.

Let $\mathcal{Y}$ be a separable Hilbert space, let $\bar{A}\colon\mathcal{Y}\to\mathcal{Y}$ , $\bar{C}\colon\mathbb{R}^{d}\to\mathcal{Y}$ be linear and $\bar{\bf B}\in\mathcal{Y}$ and assume that for each ${\bf z}\in\mathcal{I}_{d}$ the system

\bar{\mathbf{x}}_{t}=\bar{A}\bar{\mathbf{x}}_{t-1}+\bar{C}{\bf z}_{t}+\bar{\bf B},\quad t\in\mathbb{Z}_{-},

(2.17)

admits a unique solution $(\bar{\mathbf{x}}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\mathcal{Y})$ , i.e., satisfying $\sup_{t\in\mathbb{Z}_{-}}\|\bar{\mathbf{x}}_{t}\|_{\mathcal{Y}}<\infty$ . Let $\bar{\mu}$ be a (Borel) probability measure on $\mathbb{R}\times\mathcal{Y}\times\mathbb{R}^{d}\times\mathbb{R}$ with $\int_{\mathbb{R}\times\mathcal{Y}\times\mathbb{R}^{d}\times\mathbb{R}}|w|(\|{\bf a}\|_{\mathcal{Y}}+\|{\bf c}\|+|b|)\bar{\mu}(dw,d{\bf a},d{\bf c},db)<\infty$ and consider

H({\bf z})=\int_{\mathbb{R}\times\mathcal{Y}\times\mathbb{R}^{d}\times\mathbb{R}}w\sigma_{2}(\langle{\bf a},\bar{\mathbf{x}}_{-1}({\bf z})\rangle_{\mathcal{Y}}+{\bf c}\cdot{\bf z}_{0}+b)\bar{\mu}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d}.

(2.18)

Then $H\in\mathcal{C}$ with $p=2$ .

Proof.

Denote by $T\colon\mathcal{Y}\to\ell^{2}$ an isometric isomorphism between $\mathcal{Y}$ and $\ell^{2}$ . Define $\mathbf{x}_{t}:=T\bar{\mathbf{x}}_{t}$ , $A=T\bar{A}T^{-1}$ , $C=T\bar{C}$ , ${\bf B}=T\bar{\bf B}$ , $\phi\colon\mathbb{R}\times\mathcal{Y}\times\mathbb{R}^{d}\times\mathbb{R}\to\mathcal{X}$ , $\phi(w,{{\bf a}},{\bf c},b)=(w,T{\bf a},{\bf c},b)$ and let $\mu=\bar{\mu}\circ\phi^{-1}$ be the pushforward measure of $\bar{\mu}$ under $\phi$ .

Then $\mu$ is a probability measure on $\mathbb{R}\times\ell^{2}\times\mathbb{R}^{d}\times\mathbb{R}$ and $A\colon\ell^{2}\to\ell^{2}$ , $C\colon\mathbb{R}^{d}\to\ell^{2}$ are linear maps. Furthermore, by choice of $A,{\bf B},C$ we have that $\mathbf{x}\in\ell^{\infty}_{-}(\ell^{2})$ is a solution (2.1). On the other hand, if $(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{2})$ is any solution to (2.1), then $\bar{\mathbf{x}}_{t}:=T^{-1}\mathbf{x}_{t}$ defines a solution to (2.17). But the latter system has a unique solution by assumption, hence also the solution to (2.1) is unique.

Finally, the change-of-variables formula and the fact that $T$ is an isometry imply that $\mu$ satisfies $I_{\mu,2}=\int_{\mathbb{R}\times\mathcal{Y}\times\mathbb{R}^{d}\times\mathbb{R}}|w|(\|T{\bf a}\|_{2}+\|{\bf c}\|+|b|)\bar{\mu}(dw,d{\bf a},d{\bf c},db)<\infty$ and

	$\displaystyle H({\bf z})$	$\displaystyle=\int_{\mathbb{R}\times\mathcal{Y}\times\mathbb{R}^{d}\times\mathbb{R}}w\sigma_{2}(T{\bf a}\cdot T\bar{\mathbf{x}}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\bar{\mu}(dw,d{\bf a},d{\bf c},db)$		(2.19)
		$\displaystyle=\int_{\mathbb{R}\times\ell^{2}\times\mathbb{R}^{d}\times\mathbb{R}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db).$		(2.19)

This shows that the functional $H$ can be represented as a readout (2.2) of a system (2.1), hence $H\in\mathcal{C}$ . It also proves that $T$ is a system isomorphism between the system determined by $\bar{A}$ , $\bar{C}$ , $\bar{B}$ , and $\bar{\mu}$ , and the system in $\mathcal{C}$ determined by ${A}$ , ${C}$ , ${B}$ , and ${\mu}$ . ∎

The next proposition is a generalization of a universality result in [Gono 20c, Theorem III.13] to the context of recurrent Barron functionals.

Proposition 2.10.

Assume $\sigma_{1}(x)=x$ and either $\sigma_{2}(x)=\max(x,0)$ or $\sigma_{2}$ is bounded, continuous and non-constant. Let $\bar{p}\in[1,\infty)$ and let $\gamma$ be a probability measure on $\mathcal{I}_{d}\subset\ell^{\infty}_{-}(D_{d})$ . Then $\mathcal{C}\cap L^{\bar{p}}(\mathcal{I}_{d},\gamma)$ is dense in $L^{\bar{p}}(\mathcal{I}_{d},\gamma)$ .

Proof.

Let $H\in L^{\bar{p}}(\mathcal{I}_{d},\gamma)$ and $\varepsilon>0$ . Consider first the case $\sigma_{2}(x)=\max(x,0)$ . Define the activation function $\bar{\sigma}(x)=\sigma_{2}(x+1)-\sigma_{2}(x)$ , then $|\bar{\sigma}(x)|\leq 1$ for $x\in\mathbb{R}$ . Hence, by [Gono 20c, Theorem III.13] there exists a reservoir functional $H^{RC}$ associated to a linear system with neural network readout such that $H^{RC}\in L^{\bar{p}}(\mathcal{I}_{d},\gamma)$ and $\|H-H^{RC}\|_{L^{\bar{p}}(\mathcal{I}_{d},\gamma)}<\varepsilon$ . More specifically, there exist $N\in\mathbb{N}$ , ${A}^{N}\in\mathbb{R}^{N\times N}$ , ${C}^{N}\in\mathbb{R}^{N\times d}$ such that the system

\mathbf{x}_{t}^{N}={A}^{N}\mathbf{x}_{t-1}^{N}+{C}^{N}\mathbf{z}_{t},\quad t\in\mathbb{Z}_{-}

(2.20)

has the echo state property and $H^{RC}(\mathbf{z})=h(\mathbf{x}_{0}^{N}({\bf z}))$ for some neural network $h\colon\mathbb{R}^{N}\to\mathbb{R}$ given by

h(\mathbf{x})=\sum_{j=1}^{k}\beta_{j}\bar{\sigma}(\bm{\alpha}_{j}\cdot\mathbf{x}-\theta_{j})

for some $k\in\mathbb{N}$ , $\beta_{j},\theta_{j}\in\mathbb{R}$ and $\bm{\alpha}_{j}\in\mathbb{R}^{N}$ . We claim that $H^{RC}\in\mathcal{C}_{N}$ . To prove this, note that with the choices ${A}={A}^{N}$ , ${\bf B}={\bf 0}$ and ${C}={C}^{N}$ the process $\mathbf{x}^{N}$ is the unique solution to the system (2.6) due to the echo state property of (2.20). From the proof of [Gono 20c, Theorem III.13] and the assumption that $\mathcal{I}_{d}\subset\ell^{\infty}_{-}(D_{d})$ we obtain that the unique solution satisfies $\mathbf{x}^{N}\in\ell^{\infty}_{-}(\mathbb{R}^{N})$ . Set now $\mu^{N}=\frac{1}{2k}\sum_{j=1}^{k}\delta_{(2k\beta_{j},{A}^{\top}\bm{\alpha}_{j},{C}^{\top}\bm{\alpha}_{j},-\theta_{j}+1)}+\delta_{(-2k\beta_{j},{A}^{\top}\bm{\alpha}_{j},{C}^{\top}\bm{\alpha}_{j},-\theta_{j})}$ . Then $\mu^{N}$ is a probability measure on $\mathcal{X}^{N}$ and for ${\bf z}\in\mathcal{I}_{d}$

		$\displaystyle\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{N}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db)$
		$\displaystyle=\frac{1}{2k}\sum_{j=1}^{k}2k\beta_{j}\sigma_{2}(\bm{\alpha}_{j}\cdot({A}\mathbf{x}_{-1}^{N}({\bf z})+{C}{\bf z}_{0})-\theta_{j}+1)-2k\beta_{j}\sigma_{2}(\bm{\alpha}_{j}\cdot({A}\mathbf{x}_{-1}^{N}({\bf z})+{C}{\bf z}_{0})-\theta_{j})$
		$\displaystyle=\sum_{j=1}^{k}\beta_{j}\bar{\sigma}(\bm{\alpha}_{j}\cdot\mathbf{x}_{0}^{N}({\bf z})-\theta_{j})=h(\mathbf{x}_{0}^{N}({\bf z}))=H^{RC}(\mathbf{z}).$

This shows that $H^{RC}\in\mathcal{C}_{N}$ and $\|H-H^{RC}\|_{L^{\bar{p}}(\mathcal{I}_{d},\gamma)}<\varepsilon$ . Combining this with Proposition 2.4 yields the claim.

In the case where $\sigma$ is bounded, continuous and non-constant we may directly work with $\sigma$ (instead of $\bar{\sigma}$ ) and proceed similarly. ∎

Generalized convolutional functionals. We conclude by showing that the class $\mathcal{C}$ contains functionals that generalize convolutional filters, an important family of transforms.

Proposition 2.11.

Assume $\sigma_{1}(x)=x$ , $p\in(1,\infty)$ and suppose $\mathcal{I}_{d}\subset\ell^{q}_{-}(D_{d})$ . Let $\mu$ be a probability measure on $\mathcal{X}$ with $I_{\mu,p}<\infty$ . Then the functional $H\colon\mathcal{I}_{d}\to\mathbb{R}$ ,

H({\bf z})=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot{\bf z}_{-1:-\infty}+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d}

(2.21)

is in $\mathcal{C}$ .

Proof.

We only need to construct ${\bf B}\in\ell^{q}$ and linear maps $A\colon\ell^{q}\to\ell^{q}$ , $C\colon\mathbb{R}^{d}\to\ell^{q}$ such that for each ${\bf z}\in\mathcal{I}_{d}$ the system

\mathbf{x}_{t}=A\mathbf{x}_{t-1}+C{\bf z}_{t}+{\bf B},\quad t\in\mathbb{Z}_{-},

(2.22)

admits the unique solution $(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}=({\bf z}_{t:-\infty})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q})$ . Then $\mathbf{x}_{-1}={\bf z}_{-1:-\infty}$ and so $H$ has representation (2.2). Choose ${\bf B}={\bf 0}$ , let $C{\bf y}$ be the sequence with the components of ${\bf y}$ in the first $d$ entries and $0$ otherwise, and let $A$ be the operator that shifts the $i$ -th entry of the input to the $i+d$ -th entry and inserts $0$ in the first $d$ entries. Then $A\colon\ell^{q}\to\ell^{q}$ , $C\colon\mathbb{R}^{d}\to\ell^{q}$ are linear, and indeed $\mathbf{x}_{t}={\bf z}_{t:-\infty}$ for all $t\in\mathbb{Z}_{-}$ is the unique solution to (2.22) and $\mathbf{x}\in\ell^{\infty}_{-}(\ell^{q})$ . ∎

Notice that standard convolutional filters are a special case of the functional in (2.21) when $\mu$ is a Dirac measure and $\sigma_{2}$ is the identity.

3 Approximation

In this section, we establish approximation results for the concept class $\mathcal{C}$ of recurrent Barron functionals. As approximating systems, we use a modification of the finite-dimensional echo state networks (1.1)–(1.2) in which the linear readout $W$ in (1.2) is replaced by an extreme learning machine (ELM) / random features neural network, that is, a neural network of the type (1.3) whose parameters are all randomly generated, except for the output layer given by the weights $(W_{i})_{i=1,\ldots,N}$ which are trainable. To derive such bounds, we proceed in several approximation steps and (a) approximate the infinite system (2.1) by a finite system of type (1.1), (b) apply a Monte Carlo approximation of the integral in (2.2), and (c) use an importance sampling procedure to guarantee that the neural network weights can be generated from a generic distribution rather than the (unknown) measure $\mu$ .

3.1 Approximation by a finite-dimensional system

We start by showing that the elements in $\mathcal{C}$ can be approximated by finite-dimensional truncations, that is, by finite-dimensional recurrent Barron functionals in $\mathcal{C}_{N}$ , for any $N\in\mathbb{N}$ .

Proposition 3.1.

Assume $q\in[1,\infty)$ and $\mathcal{I}_{d}\subset\ell^{\infty}_{-}(\mathbb{R}^{d})$ . Suppose $H\in\mathcal{C}$ has representation (2.2) with probability measure $\mu$ on $\mathcal{X}$ , ${\bf B}\in\ell^{q}$ , and bounded, linear maps $A\colon\ell^{q}\to\ell^{q}$ , $C\colon\mathbb{R}^{d}\to\ell^{q}$ . Let $N\in\mathbb{N}$ and for $j\in\mathbb{N}$ , $k=1,\ldots,d$ let $\bm{\epsilon}^{j}=(\delta_{ij})_{i\in\mathbb{N}}\in\mathbb{R}^{\mathbb{N}}$ and ${\bf e}^{k}=(\delta_{ik})_{i=1,\ldots,d}\in\mathbb{R}^{d}$ . Let $\bar{\bf B}\in\mathbb{R}^{N}$ , $\bar{A}\in\mathbb{R}^{N\times N}$ , $\bar{C}\in\mathbb{R}^{N\times d}$ be given as $\bar{B}_{i}=B_{i}$ , $\bar{A}_{ij}=(A\bm{\epsilon}^{j})_{i}$ , $\bar{C}_{ik}=(C{\bf e}^{k})_{i}$ for $i,j=1,\ldots,N$ , $k=1,\ldots,d$ . Assume $\sigma_{1}$ is $L_{\sigma_{1}}$ -Lipschitz and $L_{\sigma_{1}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1$ . Then the following statements hold:

(i)

For each ${\bf z}\in\mathcal{I}_{d}$ the system

\mathbf{x}_{t}^{N}=\sigma_{1}(\bar{A}\mathbf{x}_{t-1}^{N}+\bar{C}{\bf z}_{t}+\bar{\bf B}),\quad t\in\mathbb{Z}_{-},

(3.1)

admits a unique solution $(\mathbf{x}_{t}^{N})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\mathbb{R}^{N})$ .

(ii)

Write $\mathbf{x}_{t}^{N}({\bf z})=\mathbf{x}_{t}^{N}$ and define the functional $H^{N}\colon\mathcal{I}_{d}\to\mathbb{R}$ by

H^{N}({\bf z})=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\iota(\mathbf{x}_{-1}^{N}({\bf z}))+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d}.

(3.2)

Then there exists a constant $C_{\mathrm{fin}}>0$ not depending on $N$ such that for all ${\bf z}\in\mathcal{I}_{d}$ ,

|H({\bf z})-H^{N}({\bf z})|\leq C_{\mathrm{fin}}\sum_{k=0}^{\infty}(L_{\sigma_{1}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|})^{k}\left(\sum_{i=N+1}^{\infty}|{x}_{-1-k,i}|^{q}\right)^{1/q}.

(3.3)

In particular, $\lim_{N\to\infty}H^{N}({\bf z})=H({\bf z})$ . The constant is given by $C_{\mathrm{fin}}=L_{\sigma_{2}}\int_{\mathcal{X}}|w|\|{\bf a}\|_{p}\mu(dw,d{\bf a},d{\bf c},db)$ .

(iii)

$H^{N}\in\mathcal{C}_{N}$ .

Proof.

Part (i) is a consequence of Proposition 2.2 and of the inequality ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\bar{A}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}$ which in turn implies that $L_{\sigma_{1}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\bar{A}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1$ . Regarding (ii), note first that by our choice of $\bar{C}$ , it follows that $(C{\bf u})_{i}=\sum_{k=1}^{d}{u}_{k}(C{\bf e}^{k})_{i}=\sum_{k=1}^{d}{u}_{k}\bar{C}_{ik}=(\bar{C}{\bf u})_{i}$ for all ${\bf u}\in\mathbb{R}^{d}$ , $i=1,\ldots,N$ . Similarly, for ${\bf y}\in\mathbb{R}^{N}$ , $i=1,\ldots,N$ we have that $(\bar{A}{\bf y})_{i}=\sum_{j=1}^{N}{y}_{j}\bar{A}_{ij}=\sum_{j=1}^{N}{y}_{j}(A\bm{\epsilon}^{j})_{i}=(A\iota({\bf y}))_{i}$ . Consequently, for $i=1,\ldots,N$ :

	$\displaystyle\|{x}_{t,i}-{x}_{t,i}^{N}\|$	$\displaystyle\leq L_{\sigma_{1}}\|(A\mathbf{x}_{t-1})_{i}+(C{\bf z}_{t})_{i}+B_{i}-(\bar{A}\mathbf{x}_{t-1}^{N})_{i}-(\bar{C}{\bf z}_{t})_{i}-\bar{B}_{i}\|$		(3.4)
		$\displaystyle=L_{\sigma_{1}}\|(A(\mathbf{x}_{t-1}-\iota(\mathbf{x}_{t-1}^{N})))_{i}\|.$		(3.4)

Denote by $\pi\colon\ell^{q}\to\mathbb{R}^{N}$ the projection onto the first $N$ coordinates. Then

$\displaystyle\\|\mathbf{x}_{t}({\bf z})-\iota(\mathbf{x}_{t}^{N}({\bf z}))\\|_{q}$	$\displaystyle\leq\\|\mathbf{x}_{t}({\bf z})-\iota(\pi(\mathbf{x}_{t}({\bf z})))\\|_{q}+\\|\iota(\pi(\mathbf{x}_{t}({\bf z})))-\iota(\mathbf{x}_{t}^{N}({\bf z}))\\|_{q}$	(3.5)
	$\displaystyle=\left(\sum_{i=N+1}^{\infty}\|{x}_{t,i}\|^{q}\right)^{1/q}+\left(\sum_{i=1}^{N}\|{x}_{t,i}-{x}_{t,i}^{N}\|^{q}\right)^{1/q}$
	$\displaystyle\leq\left(\sum_{i=N+1}^{\infty}\|{x}_{t,i}\|^{q}\right)^{1/q}+L_{\sigma_{1}}\left(\sum_{i=1}^{N}\|(A(\mathbf{x}_{t-1}-\iota(\mathbf{x}_{t-1}^{N})))_{i}\|^{q}\right)^{1/q}$
	$\displaystyle\leq\left(\sum_{i=N+1}^{\infty}\|{x}_{t,i}\|^{q}\right)^{1/q}+L_{\sigma_{1}}\\|A(\mathbf{x}_{t-1}-\iota(\mathbf{x}_{t-1}^{N}))\\|_{q}$
	$\displaystyle\leq\left(\sum_{i=N+1}^{\infty}\|{x}_{t,i}\|^{q}\right)^{1/q}+L_{\sigma_{1}}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|A\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}\\|\mathbf{x}_{t-1}-\iota(\mathbf{x}_{t-1}^{N})\\|_{q}.$

Iterating (3.5) thus yields

\|\mathbf{x}_{t}({\bf z})-\iota(\mathbf{x}_{t}^{N}({\bf z}))\|_{q}\leq\sum_{k=0}^{\infty}(L_{\sigma_{1}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|})^{k}\left(\sum_{i=N+1}^{\infty}|{x}_{t-k,i}|^{q}\right)^{1/q}.

(3.6)

This can now be used to estimate the difference between $H$ and its truncation:

$\displaystyle\|H({\bf z})-H^{N}({\bf z})\|$	$\displaystyle\leq\int_{\mathcal{X}}\|w\|\|\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)-\sigma_{2}({\bf a}\cdot\iota(\mathbf{x}_{-1}^{N}({\bf z}))+{\bf c}\cdot{\bf z}_{0}+b)\|\mu(dw,d{\bf a},d{\bf c},db)$	(3.7)
	$\displaystyle\leq L_{\sigma_{2}}\int_{\mathcal{X}}\|w\|\|{\bf a}\cdot\mathbf{x}_{-1}({\bf z})-{\bf a}\cdot\iota(\mathbf{x}_{-1}^{N}({\bf z}))\|\mu(dw,d{\bf a},d{\bf c},db)$
	$\displaystyle\leq L_{\sigma_{2}}\int_{\mathcal{X}}\|w\|\\|{\bf a}\\|_{p}\mu(dw,d{\bf a},d{\bf c},db)\\|\mathbf{x}_{-1}({\bf z})-\iota(\mathbf{x}_{-1}^{N}({\bf z}))\\|_{q}$
	$\displaystyle\leq C_{\mathrm{fin}}\sum_{k=0}^{\infty}(L_{\sigma_{1}}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|A\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|})^{k}\left(\sum_{i=N+1}^{\infty}\|{x}_{-1-k,i}\|^{q}\right)^{1/q}$

with $C_{\mathrm{fin}}=L_{\sigma_{2}}\int_{\mathcal{X}}|w|\|{\bf a}\|_{p}\mu(dw,d{\bf a},d{\bf c},db)$ .

It remains to be shown that $H^{N}\in\mathcal{C}_{N}$ . Recall that $\pi\colon\ell^{p}\to\mathbb{R}^{N}$ denotes the projection onto the first $N$ components and let $\phi\colon\mathcal{X}\to\mathcal{X}^{N}$ be the product map $\phi=\mathrm{Id}_{\mathbb{R}}\times\pi\times\mathrm{Id}_{\mathbb{R}^{d}}\times\mathrm{Id}_{\mathbb{R}}$ , that is, $\phi(w,{\bf a},{\bf c},b)=(w,(a_{i})_{i=1,\ldots,N},{\bf c},b)$ . Denote by $\mu^{N}=\mu\circ\phi^{-1}$ the pushforward measure of $\mu$ under $\phi$ . Then $\mu^{N}$ is a (Borel) probability measure on $\mathcal{X}^{N}$ and in $(i)$ we showed that for each ${\bf z}\in\mathcal{I}_{d}$ the system (3.1) admits a unique solution. In addition, the integral transformation theorem implies that $\mu^{N}$ satisfies

\int_{\mathcal{X}^{N}}|w|(\|{\bf a}\|+\|{\bf c}\|+|b|)\mu^{N}(dw,d{\bf a},d{\bf c},db)=\int_{\mathcal{X}}|w|(\|\pi({\bf a})\|+\|{\bf c}\|+|b|)\mu(dw,d{\bf a},d{\bf c},db)<\infty

and, writing $\mathbf{x}_{t}^{N}=\mathbf{x}_{t}^{N}({\bf z})$ ,

$\displaystyle H^{N}({\bf z})$	$\displaystyle=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\iota(\mathbf{x}_{-1}^{N}({\bf z}))+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db)$	(3.8)
	$\displaystyle=\int_{\mathcal{X}}w\sigma_{2}(\pi({\bf a})\cdot\mathbf{x}_{-1}^{N}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db)$
	$\displaystyle=\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{N}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d}.$

This proves that $H^{N}\in\mathcal{C}_{N}$ . ∎

3.2 Normalized realizations

The next result shows that a large class of elements in the concept class $\mathcal{C}$ can be transformed into what we call a normalized realization in which the solution map for (2.1) is a multiplicative scaling operator.

Proposition 3.2.

Assume $\sigma_{1}(x)=x$ , $p\in(1,\infty)$ and $D_{d}$ is bounded. Suppose $H\in\mathcal{C}$ has a realization of the type (2.1)-(2.2) with $C$ bounded and bounded $A\colon\ell^{q}\to\ell^{q}$ satisfying $\rho(A)<1$ , which guarantees the existence of a positive integer $k_{0}\in\mathbb{N}$ such that ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A^{k}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1$ , for all $k\geq k_{0}$ . Fix $\lambda\in\left({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{1/k_{0}},1\right)$ . Then there exists a (Borel) probability measure $\bar{\mu}$ on $\mathcal{X}$ satisfying $\int_{\mathcal{X}}|w|(\|{\bf a}\|_{p}+\|{\bf c}\|+|b|)\bar{\mu}(dw,d{\bf a},d{\bf c},db)<\infty$ such that $H$ has the alternative representation

H({\bf z})=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot(\bar{\Lambda}{\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\bar{\mu}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d},

(3.9)

where $\bar{\Lambda}\colon(D_{d})^{\mathbb{Z}_{-}}\to\ell^{q}$ , $(\bar{\Lambda}{\bf z})_{d(k-1)+j}=\lambda^{k-1}z_{-k,j}$ for $k\in\mathbb{N}$ , $j=1,\ldots,d$ .

Remark 3.3.

The assumption that $D_{d}$ is bounded in the previous statement could be relaxed to the requirement that $\sum_{k\in\mathbb{N}}\|{\bf z}_{-k}\|\lambda^{k}<\infty$ .

Proof.

First, we recall that by Proposition 2.2, the hypotheses in the statement guarantee that the system (2.1) admits a unique solution $(\mathbf{x}_{t}({\bf z}))_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q})$ for each ${\bf z}\in\mathcal{I}_{d}$ , and that $\mathbf{x}_{t}({\bf z})$ is explicitly given by the expression (2.3). The existence of a positive integer $k_{0}\in\mathbb{N}$ such that ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A^{k}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1$ for all $k\geq k_{0}$ is guaranteed by Gelfand’s formula for the expression of the spectral radius $\rho(A)$ .

Next, we define $\tilde{A}=\lambda^{-1}A$ and note that the choice $\lambda\in\left({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{1/k_{0}},1\right)$ guarantees that ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\tilde{A}^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1$ . Moreover, for each $k\in\mathbb{N}_{0}$ the map $\tilde{A}^{k}C\colon\mathbb{R}^{d}\to\ell^{q}$ is a bounded linear map between two Banach spaces and hence its adjoint is a bounded linear map from $(\ell^{q})^{*}\to(\mathbb{R}^{d})^{*}$ which we can identify with $(\tilde{A}^{k}C)^{*}\colon\ell^{p}\to\mathbb{R}^{d}$ . More specifically, for $y\in(\ell^{q})^{*},{\bf u}\in\mathbb{R}^{d}$ the adjoint is defined by $[(\tilde{A}^{k}C)^{*}(y)]({\bf u})=[y\circ(\tilde{A}^{k}C)]({\bf u})=y(\tilde{A}^{k}C{\bf u})$ and so with the identification $y({\bf b})={\bf a_{y}}\cdot{\bf b}$ for some ${\bf a_{y}}\in\ell^{p}$ and all ${\bf b}\in\ell^{q}$ we obtain that the adjoint property translates to $(\tilde{A}^{k}C)^{*}({\bf a_{y}})\cdot{\bf u}={\bf a_{y}}\cdot(\tilde{A}^{k}C){\bf u}$ .

Thus, we may consider the map $\mathcal{L}\colon\ell^{p}\to((\mathbb{R}^{d})^{\mathbb{N}_{0}})$ given by $[\mathcal{L}({\bf a})]_{k}=(\tilde{A}^{k}C)^{*}{\bf a}$ for $k\in\mathbb{N}_{0}$ , ${\bf a}\in\ell^{p}$ . We claim that the image $\mathcal{L}(\ell^{p})$ can be identified with a subset of $\ell^{p}$ . Indeed,

\!\!\!\!\!\!\sum_{k\in\mathbb{N}_{0}}\|[\mathcal{L}({\bf a})]_{k}\|^{p}=\sum_{k\in\mathbb{N}_{0}}\|(\tilde{A}^{k}C)^{*}{\bf a}\|^{p}\leq\sum_{k\in\mathbb{N}_{0}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|(\tilde{A}^{k}C)^{*}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{p}\|{\bf a}\|_{p}^{p}=\sum_{k\in\mathbb{N}_{0}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\tilde{A}^{k}C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{p}\|{\bf a}\|_{p}^{p}\leq({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\|{\bf a}\|_{p})^{p}\sum_{k\in\mathbb{N}_{0}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\tilde{A}^{k}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{p}\\ \leq({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\|{\bf a}\|_{p})^{p}\sum_{j\in\mathbb{N}_{0}}\sum_{i=0}^{k_{0}-1}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\tilde{A}^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{jp}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\tilde{A}^{i}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{p}=({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\|{\bf a}\|_{p})^{p}\sum_{i=0}^{k_{0}-1}\frac{{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\tilde{A}^{i}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{p}}{1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\tilde{A}^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{p}}<\infty,

(3.10)

since ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\tilde{A}^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}=\lambda^{-k_{0}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1$ . Thus, the map $\bar{\mathcal{L}}\colon\ell^{p}\to\ell^{p}$ given by $[\bar{\mathcal{L}}({\bf a})]_{d(k-1)+j}=[(\tilde{A}^{k-1}C)^{*}{\bf a}]_{j}$ for $k\in\mathbb{N}$ , $j=1,\ldots,d$ is well-defined and from (3.10) we deduce that

\begin{aligned} \|\bar{\mathcal{L}}({\bf a})\|_{p}=\left(\sum_{k\in\mathbb{N}}\sum_{j=1}^{d}[(\tilde{A}^{k-1}C)^{*}{\bf a}]_{j}^{p}\right)^{1/p}=\left(\sum_{k\in\mathbb{N}_{0}}\|[\mathcal{L}({\bf a})]_{k}\|_{p}^{p}\right)^{1/p}\leq\left({d}\sum_{k\in\mathbb{N}_{0}}\|[\mathcal{L}({\bf a})]_{k}\|^{p}\right)^{1/p}\leq c_{0}\|{\bf a}\|_{p}\end{aligned},

(3.11)

where $c_{0}=d^{1/p}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\left(\sum_{i=0}^{k_{0}-1}\frac{{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|\tilde{A}^{i}\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}^{p}}{1-{\left|\kern-0.75346pt\left|\kern-0.75346pt\left|\tilde{A}^{k_{0}}\right|\kern-0.75346pt\right|\kern-0.75346pt\right|}^{p}}\right)^{1/p}$ and we have used that $\|\cdot\|_{p}\leq\|\cdot\|$ for $p\geq 2$ and $\|\cdot\|_{p}\leq d^{\frac{1}{p}-\frac{1}{2}}\|\cdot\|$ for $p\in[1,2]$ . We may now use (2.3) to rewrite ${\bf a}\cdot\mathbf{x}_{-1}({\bf z})$ as

{\bf a}\cdot\mathbf{x}_{-1}({\bf z})={\bf a}\cdot\bar{\bf B}_{0}+\sum_{i\in\mathbb{N}}a_{i}\sum_{k=0}^{\infty}({A}^{k}C{\bf z}_{-1-k})_{i}={\bf a}\cdot\bar{\bf B}_{0}+\sum_{k=0}^{\infty}{\bf a}\cdot(\tilde{A}^{k}C{\bf z}_{-1-k}\lambda^{k}),

(3.12)

where $\bar{\bf B}_{0}=\sum_{k=0}^{\infty}A^{k}{\bf B}\in\ell^{q}$ and the series can be interchanged by Fubini’s theorem, as we see by estimating

\sum_{k=0}^{\infty}\sum_{i\in\mathbb{N}}|a_{i}||(A^{k}C{\bf z}_{-1-k})_{i}|\leq\sum_{k=0}^{\infty}\|\mathbf{a}\|_{p}\|{A}^{k}C{\bf z}_{-1-k}\|_{q}\leq\sum_{k=0}^{\infty}\|\mathbf{a}\|_{p}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A^{k}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\|{\bf z}_{-1-k}\|_{q}\\ \leq\|\mathbf{a}\|_{p}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\sum_{j\in\mathbb{N}_{0}}\sum_{i=0}^{k_{0}-1}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{j}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A^{i}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\|{\bf z}_{-1-k}\|_{q}<\infty,

since ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1$ and $D_{d}$ is bounded. The second term in (3.12) can be rewritten as

$\displaystyle\sum_{k=0}^{\infty}{\bf a}\cdot(\tilde{A}^{k}C{\bf z}_{-1-k}\lambda^{k})$	$\displaystyle=\sum_{k=0}^{\infty}((\tilde{A}^{k}C)^{}{\bf a})\cdot({\bf z}_{-1-k}\lambda^{k})=\sum_{k=0}^{\infty}\sum_{j=1}^{d}[(\tilde{A}^{k}C)^{}{\bf a}]_{j}[{z}_{-1-k}]_{j}\lambda^{k}$	(3.13)
	$\displaystyle=\sum_{k=1}^{\infty}\sum_{j=1}^{d}[(\tilde{A}^{k-1}C)^{*}{\bf a}]_{j}[{z}_{-k}]_{j}\lambda^{k-1}=\sum_{k=1}^{\infty}\sum_{j=1}^{d}[\bar{\mathcal{L}}({\bf a})]_{d(k-1)+j}(\bar{\Lambda}{\bf z})_{d(k-1)+j}$
	$\displaystyle=\bar{\mathcal{L}}({\bf a})\cdot(\bar{\Lambda}{\bf z}).$

Combining (3.13) and (3.12) and inserting in (2.2) then yields

\displaystyle H({\bf z})=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\bar{\bf B}_{0}+\bar{\mathcal{L}}({\bf a})\cdot(\bar{\Lambda}{\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d}.

(3.14)

Let $\bar{\phi}\colon\mathcal{X}\to\mathcal{X}$ be given as $\bar{\phi}(w,{\bf a},{\bf c},b)=(w,\bar{\mathcal{L}}({\bf a}),{\bf c},{\bf a}\cdot\bar{\bf B}_{0}+b)$ and denote by $\bar{\mu}=\mu\circ\bar{\phi}^{-1}$ the pushforward measure of $\mu$ under $\bar{\phi}$ . Then the change-of-variables formula and (3.11) show that

	$\displaystyle I_{\bar{\mu},p}$	$\displaystyle=\int_{\mathcal{X}}\|w\|(\\|\bar{\mathcal{L}}({\bf a})\\|_{p}+\\|{\bf c}\\|+\|{\bf a}\cdot\bar{\bf B}_{0}+b\|)\mu(dw,d{\bf a},d{\bf c},db)$
		$\displaystyle\leq\int_{\mathcal{X}}\|w\|(c_{0}\\|{\bf a}\\|_{p}+\\|{\bf c}\\|+\\|{\bf a}\\|_{p}\\|\bar{\bf B}_{0}\\|_{q}+\|b\|)\mu(dw,d{\bf a},d{\bf c},db)$
		$\displaystyle\leq\max(c_{0}+\\|\bar{\bf B}_{0}\\|_{q},1)I_{\mu,p}<\infty$

and similarly we obtain from (3.14) that for any ${\bf z}\in\mathcal{I}_{d}$ ,

	$\displaystyle H({\bf z})$	$\displaystyle=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\bar{\bf B}_{0}+\bar{\mathcal{L}}({\bf a})\cdot(\bar{\Lambda}{\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db)$		(3.15)
		$\displaystyle=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot(\bar{\Lambda}{\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\bar{\mu}(dw,d{\bf a},d{\bf c},db).$		(3.15)

∎

3.3 Echo state network approximation of normalized, truncated functionals

Consider now the setting of reservoir computing, where we aim to approximate an unknown element in our class $\mathcal{C}$ using a finite-dimensional ELM-ESN pair, that is, a randomly generated echo state network-like state equation with a neural network as readout in which only the output layer is trained. As a first step, we consider in this section the situation when the target functional is in the class $\mathcal{C}_{N}$ of finite-dimensional recurrent Barron functionals.

More specifically, consider the echo state network

\mathbf{x}_{t}^{\mathrm{ESN}}=\sigma_{1}({A}^{\mathrm{ESN}}\mathbf{x}_{t-1}^{\mathrm{ESN}}+{C}^{\mathrm{ESN}}{\bf z}_{t}+{\bf B}^{\mathrm{ESN}}),\quad t\in\mathbb{Z}_{-},

(3.16)

with randomly generated matrices ${A}^{\mathrm{ESN}}\in\mathbb{M}_{N}$ , ${C}^{\mathrm{ESN}}\in\mathbb{M}_{N,d}$ , and ${\bf B}^{\mathrm{ESN}}\in\mathbb{R}^{N}$ , which is used to capture the dynamics and then plugged into a random neural network readout. Thus, the element in $\mathcal{C}_{N}$ is approximated by

\widehat{H}({\bf z})=\sum_{i=1}^{N}w^{(i)}\sigma_{2}({\bf a}^{(i)}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}({\bf z})+{\bf c}^{(i)}\cdot{\bf z}_{0}+b_{i})

(3.17)

with randomly generated coefficients ${\bf a}^{(i)}$ , ${\bf c}^{(i)}$ , $b_{i}$ valued in $\mathbb{R}^{N}$ , $\mathbb{R}^{d}$ , and $\mathbb{R}$ , respectively, and where ${\bf W}=\left(w^{(1)},\ldots,w^{(N)}\right)^{\top}\in\mathbb{R}^{N}$ is trainable.

Let $T=\left\lceil\frac{N}{d}\right\rceil$ and define the Kalman controlability matrix

K=\pi_{N\times N}({C}^{\mathrm{ESN}}|{A}^{\mathrm{ESN}}{C}^{\mathrm{ESN}}|\cdots|({A}^{\mathrm{ESN}})^{T-2}{C}^{\mathrm{ESN}}|({A}^{\mathrm{ESN}})^{T-1}{C}^{\mathrm{ESN}}),

(3.18)

where, in case $T>\frac{N}{d}$ , the map $\pi_{N\times N}$ removes the last $dT-N$ columns.

Proposition 3.4.

Suppose $H^{N}\colon(D_{d})^{\mathbb{Z}_{-}}\to\mathbb{R}$ admits a normalized realization of the type

H^{N}({\bf z})=\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot(\Lambda_{N}{\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in(D_{d})^{\mathbb{Z}_{-}},

(3.19)

for some $\mu^{N}$ on $\mathcal{X}^{N}$ with $I_{\mu^{N}}=\int_{\mathcal{X}^{N}}|w|(\|{\bf a}\|+\|{\bf c}\|+|b|)\mu^{N}(dw,d{\bf a},d{\bf c},db)<\infty$ and where $\Lambda_{N}\colon(D_{d})^{\mathbb{Z}_{-}}\to\mathbb{R}^{N}$ satisfies $(\Lambda_{N}{\bf z})_{d(k-1)+j}=\lambda^{k-1}(z_{-k})_{j}$ for all $k\in\mathbb{N}$ , $j=1,\ldots,d$ with $d(k-1)+j\leq N$ . Assume that $\sigma_{1}(x)=x$ , $D_{d}$ is bounded, and consider an arbitrary system of the type (3.16) such that $\rho\left({A}^{\mathrm{ESN}}\right)<1$ , and that the matrix $K$ in (3.18) is invertible. Then there exists a Borel measure $\tilde{\mu}^{N}$ on $\mathcal{X}^{N}$ such that $I_{\tilde{\mu}^{N}}<\infty$ and the mapping

\bar{H}^{N}({\bf z})=\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\tilde{\mu}^{N}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in(D_{d})^{\mathbb{Z}_{-}}

(3.20)

satisfies for all ${\bf z}\in(D_{d})^{\mathbb{Z}_{-}}$ the bound

	$\displaystyle\|H^{N}({\bf z})-\bar{H}^{N}({\bf z})\|$	$\displaystyle\leq L_{\sigma_{2}}\left({\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{C}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}M+\\|{\bf B}^{\mathrm{ESN}}\\|\right){\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\left({A}^{\mathrm{ESN}}\right)^{k_{0}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}^{\lfloor\frac{T}{k_{0}}\rfloor}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{A}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}^{k_{0}-1}$
		$\displaystyle\times\left(\sum_{i=0}^{k_{0}-1}\frac{{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\left({A}^{\mathrm{ESN}}\right)^{i}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}}{1-{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\left({A}^{\mathrm{ESN}}\right)^{k_{0}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}}\right){\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|K^{-\top}\Lambda\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}\int_{\mathcal{X}^{N}}\|w\|\\|{\bf a}\\|\mu^{N}(dw,d{\bf a},d{\bf c},db),$		(3.21)

where $k_{0}\in\mathbb{N}$ is the smallest integer such that ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1$ , $M>0$ is chosen such that $\|{\bf u}\|\leq M$ for all ${\bf u}\in D_{d}$ , and $\Lambda\in\mathbb{R}^{N\times N}$ is the diagonal matrix with entries $\Lambda_{d(k-1)+j,d(k-1)+j}=\lambda^{k-1}$ for all $k\in\mathbb{N}$ , $j=1,\ldots,d$ with $d(k-1)+j\leq N$ .

Remark 3.5.

A particular case of the previous statement is when ${A}^{\mathrm{ESN}}$ is chosen such that ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1$ . In that situation, instead of the bound (3.21) one obtains

|H^{N}({\bf z})-\bar{H}^{N}({\bf z})|\leq L_{\sigma_{2}}\left({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}M+\|{\bf B}^{\mathrm{ESN}}\|\right)\frac{{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{T}}{1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}}\ {\left|\kern-1.07639pt\left|\kern-1.07639pt\left|K^{-\top}\Lambda\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\int_{\mathcal{X}^{N}}|w|\|{\bf a}\|\mu^{N}(dw,d{\bf a},d{\bf c},db).

(3.22)

Remark 3.6.

It can be shown (see [Grig 21a, Proposition 4.4]) that if ${A}^{\mathrm{ESN}}$ and ${C}^{\mathrm{ESN}}$ are randomly drawn using regular random variables with values in the corresponding spaces, then the hypothesis on the invertibility of $K$ holds almost surely.

Proof.

First notice that by our hypotheses and (2.3), the equation (3.16) has a unique solution in $\ell^{\infty}_{-}(\mathbb{R}^{N})$ given by

\mathbf{x}_{t}^{\mathrm{ESN}}=\sum_{k=0}^{\infty}({A}^{\mathrm{ESN}})^{k}({C}^{\mathrm{ESN}}{\bf z}_{t-k}+{\bf B}^{\mathrm{ESN}}),\quad\mbox{for any $t\in\mathbb{Z}_{-}$.}

(3.23)

Let $\mathbf{x}_{t}^{N}=\sum_{k=0}^{T-1}({A}^{\mathrm{ESN}})^{k}({C}^{\mathrm{ESN}}{\bf z}_{t-k}+{\bf B}^{\mathrm{ESN}})$ . Then

$\displaystyle\\|\mathbf{x}_{t}^{\mathrm{ESN}}-\mathbf{x}_{t}^{N}\\|$	$\displaystyle\leq\sum_{k=T}^{\infty}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\left({A}^{\mathrm{ESN}}\right)^{k}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}({\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{C}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}\\|{\bf z}_{t-k}\\|+\\|{\bf B}^{\mathrm{ESN}}\\|)$
	$\displaystyle\leq({\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{C}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}M+\\|{\bf B}^{\mathrm{ESN}}\\|){\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\left({A}^{\mathrm{ESN}}\right)^{T}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}\sum_{k=0}^{\infty}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\left({A}^{\mathrm{ESN}}\right)^{k}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}$
	$\displaystyle\leq({\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{C}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}M+\\|{\bf B}^{\mathrm{ESN}}\\|){\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\left({A}^{\mathrm{ESN}}\right)^{k_{0}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}^{\lfloor\frac{T}{k_{0}}\rfloor}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{A}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}^{k_{0}-1}\left(\sum_{i=0}^{k_{0}-1}\frac{{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\left({A}^{\mathrm{ESN}}\right)^{i}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}}{1-{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\left({A}^{\mathrm{ESN}}\right)^{k_{0}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}}\right).$	(3.24)

In order to obtain the last inequality in the previous expression, we decomposed $T=\lfloor\frac{T}{k_{0}}\rfloor k_{0}+r$ , with $0\leq r<k_{0}$ , which allowed us to write

{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{T}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{\lfloor\frac{T}{k_{0}}\rfloor k_{0}+r}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{\lfloor\frac{T}{k_{0}}\rfloor}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{r}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{\lfloor\frac{T}{k_{0}}\rfloor}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{k_{0}-1}.

In the last inequality we used the fact that $k_{0}\in\mathbb{N}$ is the smallest integer such that ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{k_{0}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1$ and hence ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\left({A}^{\mathrm{ESN}}\right)^{r}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\geq 1$ necessarily if $r<k_{0}$ .

Denote by $\tilde{{\bf z}}\in\mathbb{R}^{N}$ the vector with entries $\tilde{\bf z}_{d(k-1)+j}=(z_{-k})_{j}$ and by $\Lambda\in\mathbb{R}^{N\times N}$ the diagonal matrix such that $\Lambda_{d(k-1)+j,d(k-1)+j}=\lambda^{k-1}$ , for all $k\in\mathbb{N}$ , $j=1,\ldots,d$ with $d(k-1)+j\leq N$ . Then $\mathbf{x}_{-1}^{N}=\sum_{k=1}^{T}({A}^{\mathrm{ESN}})^{k-1}({C}^{\mathrm{ESN}}{\bf z}_{-k}+{\bf B}^{\mathrm{ESN}})=K\tilde{{\bf z}}+\tilde{\bf B}$ , where $\tilde{\bf B}=\sum_{k=1}^{T}({A}^{\mathrm{ESN}})^{k-1}{\bf B}^{\mathrm{ESN}}$ . Let $\tilde{\phi}\colon\mathcal{X}^{N}\to\mathcal{X}^{N}$ be given as $\tilde{\phi}(w,{\bf a},{\bf c},b)=(w,K^{-\top}\Lambda{\bf a},{\bf c},-(K^{-\top}\Lambda{\bf a})\cdot\tilde{\bf B}+b)$ and denote by $\tilde{\mu}^{N}=\mu^{N}\circ\tilde{\phi}^{-1}$ the pushforward measure of $\mu^{N}$ under $\tilde{\phi}$ . Then the change-of-variables formula yields that

	$\displaystyle I_{\tilde{\mu}^{N}}$	$\displaystyle=\int_{\mathcal{X}^{N}}\|w\|(\\|K^{-\top}\Lambda{\bf a}\\|+\\|{\bf c}\\|+\|-(K^{-\top}\Lambda{\bf a})\cdot\tilde{\bf B}+b\|)\mu^{N}(dw,d{\bf a},d{\bf c},db)$
		$\displaystyle\leq\max({\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|K^{-\top}\Lambda\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|},1)(1+\\|\tilde{\bf B}\\|)I_{\mu^{N}}<\infty$

and

$\displaystyle H^{N}({\bf z})$	$\displaystyle=\int_{\mathcal{X}^{N}}w\sigma_{2}((\Lambda{\bf a})\cdot\tilde{\bf z}+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db)$	(3.25)
	$\displaystyle=\int_{\mathcal{X}^{N}}w\sigma_{2}((K^{-\top}\Lambda{\bf a})\cdot K\tilde{\bf z}+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db)$
	$\displaystyle=\int_{\mathcal{X}^{N}}w\sigma_{2}((K^{-\top}\Lambda{\bf a})\cdot\mathbf{x}_{-1}^{N}-(K^{-\top}\Lambda{\bf a})\cdot\tilde{\bf B}+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db)$
	$\displaystyle=\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{N}+{\bf c}\cdot{\bf z}_{0}+b)\tilde{\mu}^{N}(dw,d{\bf a},d{\bf c},db).$

Combining (3.25) and (3.24) then yields

$\displaystyle\|H^{N}({\bf z})-\bar{H}^{N}({\bf z})\|$	$\displaystyle\leq\int_{\mathcal{X}^{N}}w\|\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{N}+{\bf c}\cdot{\bf z}_{0}+b)-\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\|\tilde{\mu}^{N}(dw,d{\bf a},d{\bf c},db)$	(3.26)
	$\displaystyle\leq\int_{\mathcal{X}^{N}}wL_{\sigma_{2}}\|{\bf a}\cdot(\mathbf{x}_{-1}^{N}-\mathbf{x}_{-1}^{\mathrm{ESN}})\|\tilde{\mu}^{N}(dw,d{\bf a},d{\bf c},db)$
	$\displaystyle\leq\\|\mathbf{x}_{-1}^{N}-\mathbf{x}_{-1}^{\mathrm{ESN}}\\|L_{\sigma_{2}}\int_{\mathcal{X}^{N}}w\\|K^{-\top}\Lambda{\bf a}\\|\mu^{N}(dw,d{\bf a},d{\bf c},db)$
	$\displaystyle\leq L_{\sigma_{2}}({\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{C}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}M+\\|{\bf B}^{\mathrm{ESN}}\\|){\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\left({A}^{\mathrm{ESN}}\right)^{k_{0}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}^{\lfloor\frac{T}{k_{0}}\rfloor}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{A}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}^{k_{0}-1}$
	$\displaystyle\quad\cdot\left(\sum_{i=0}^{k_{0}-1}\frac{{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\left({A}^{\mathrm{ESN}}\right)^{i}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}}{1-{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\left({A}^{\mathrm{ESN}}\right)^{k_{0}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}}\right){\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|K^{-\top}\Lambda\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}I_{\mu^{N}}.$

∎

3.4 Main approximation result

This section contains our main approximation result, in which we extend the possibility of approximating the elements in the class $\mathcal{C}_{N}$ of finite-dimensional recurrent Barron functionals, using finite-dimensional ELM-ESN pairs, to the full class $\mathcal{C}$ of recurrent Barron functionals.

In this case, we shall assume that the random variables used in the construction of the ELM layer (3.17), that is, the initialization $w^{(i)}$ for the readout weight and the hidden weights ${\bf a}^{(i)}$ , ${\bf c}^{(i)}$ , $b_{i}$ , valued in $\mathbb{R}$ , $\mathbb{R}^{N}$ , $\mathbb{R}^{d}$ , and $\mathbb{R}$ , respectively, are such that $(w^{(1)},{\bf a}^{(1)},{\bf c}^{(1)},b_{1}),\ldots,(w^{(N)},{\bf a}^{(N)},{\bf c}^{(N)},b_{N})$ are independent and identically distributed (IID); denote by $\nu$ their common distribution.

Let $\Lambda\in\mathbb{R}^{N\times N}$ be the diagonal matrix introduced in Proposition 3.4, let $\pi\colon\ell^{p}\to\mathbb{R}^{N}$ be the projection map $\pi({\bf a})=(a_{i})_{i=1,\ldots,N}$ and for given Borel measure $\mu$ on $\mathcal{X}$ , ${\bf B}\in\ell^{q}$ , bounded linear maps $C\colon\mathbb{R}^{d}\to\ell^{q}$ , $A\colon\ell^{q}\to\ell^{q}$ with $\rho{(A)}<1$ let ${\mu}_{A,B,C}=\mu\circ\psi_{A,B,C}^{-1}$ , where $\psi_{A,B,C}\colon\mathcal{X}\to\mathcal{X}^{N}$ is the map

\psi_{A,B,C}(w,{\bf a},{\bf c},b)=\Bigg{(}w,K^{-\top}\Lambda\pi(\bar{\mathcal{L}}({\bf a})),{\bf c},-(K^{-\top}\Lambda\pi(\bar{\mathcal{L}}({\bf a})))\cdot\Bigg{[}\sum_{k=1}^{T}({A}^{\mathrm{ESN}})^{k-1}{\bf B}^{\mathrm{ESN}}\Bigg{]}+{\bf a}\cdot\left[\sum_{k=0}^{\infty}A^{k}{\bf B}\right]+b\Bigg{)}

with $[\bar{\mathcal{L}}({\bf a})]_{d(k-1)+j}=[(\tilde{A}^{k-1}C)^{*}{\bf a}]_{j}$ for $k\in\mathbb{N}$ , $j=1,\ldots,d$ . Furthermore, let

I_{\mu,p}^{(2)}=\left(\int_{\mathcal{X}}w^{2}\left(\|{\bf a}\|_{p}^{2}+\|{\bf c}\|^{2}+|b|^{2}+1\right)\mu(dw,d{\bf a},d{\bf c},db)\right)^{1/2}.

Using this notation, we shall now formulate a result that shows that the elements in $H\in\mathcal{C}$ that admit a representation in which the state equation (2.4) is linear and has the unique solution property, can be approximated arbitrarily well by randomly generated ESN-ELM pairs in which only the output layer of the ELM is trained. We provide an approximation bound where the dependence on the dimensions $N$ of the ESN state space and $d$ of the input is explicitly spelled out.

Theorem 3.7.

In this statement assume that $\sigma_{1}(x)=x$ , $p\in(1,\infty)$ , and that the input set $D_{d}$ is bounded. Let $H\in\mathcal{C}$ be such that it has a realization of the type (2.2), with ${\bf B}\in\ell^{q}$ , bounded linear maps $C\colon\mathbb{R}^{d}\to\ell^{q}$ , $A\colon\ell^{q}\to\ell^{q}$ satisfying ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1$ and $I_{\mu,p}^{(2)}<\infty$ . Using the notation introduced before this statement, suppose that $\mu_{A,B,C}\ll\nu$ and that ${\displaystyle\frac{d\mu_{A,B,C}}{d\nu}}$ is bounded. Let $\lambda\in\left({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|},1\right)$ . Consider now a randomly constructed ESN-ELM pair such that ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1$ and for which the controllability matrix $K$ in (3.18) is invertible.

Then, there exists a measurable function $f\colon(\mathcal{X}^{N})^{N}\to\mathbb{R}^{N}$ such that the ESN-ELM pair $\widehat{H}$ with the same parameters and new readout ${\bf W}=f((w^{(i)},{\bf a}^{(i)},{\bf c}^{(i)},b_{i})_{i=1,\ldots,N})$ satisfies, for any ${\bf z}\in\mathcal{I}_{d}$ , the approximation error bound

\displaystyle\mathbb{E}[|H({\bf z})-\widehat{H}({\bf z})|^{2}]^{1/2}

\displaystyle\leq C_{H,\mathrm{ESN}}\left[\lambda^{\frac{N}{d}}+{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{T}+\frac{1}{N^{\frac{1}{2}}}\left\|\frac{d\mu_{A,B,C}}{d\nu}\right\|_{\infty}^{\frac{1}{2}}\right]

(3.27)

with

C_{H,\mathrm{ESN}}=\tilde{c}_{1}\max\left(\frac{{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}}{\left(1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\lambda^{-1}{A}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{p}\right)^{1/p}},\frac{\|{\bf B}\|_{q}}{1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}},1\right)\\ \times\max\left(I_{\mu,p},I_{\mu,p}^{(2)}\right)\cdot\left(1+\frac{{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}+\|{\bf B}^{\mathrm{ESN}}\|}{1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|K^{-\top}\Lambda\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\right)

(3.28)

for $\tilde{c}_{1}$ only depending on $d$ , $\sigma_{2}$ , $p$ , $\mathrm{diam}(D_{d})$ and $\lambda$ (see (3.43)).

Proof.

Let $\bar{\Lambda}\colon(D_{d})^{\mathbb{Z}_{-}}\to\ell^{q}$ , $(\bar{\Lambda}{\bf z})_{d(k-1)+j}=\lambda^{k-1}(z_{-k})_{j}$ for $k\in\mathbb{N}$ , $j=1,\ldots,d$ . Then Proposition 3.2 implies that there exists a (Borel) probability measure $\bar{\mu}$ on $\mathcal{X}$ satisfying $I_{\bar{\mu},p}<\infty$ and such that the normalized representation

H({\bf z})=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot(\bar{\Lambda}{\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\bar{\mu}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d},

(3.29)

holds. Notice that from the proof of Proposition 2.6 we obtain that $(\bar{\Lambda}{\bf z})_{d(k-1)+j}=\lambda^{k-1}(z_{-k})_{j}=\bar{{x}}_{-1,d(k-1)+j}$ , where $\bar{\mathbf{x}}$ is the unique solution to

\bar{\mathbf{x}}_{t}=\sigma_{1}(\bar{A}\bar{\mathbf{x}}_{t-1}+\bar{C}{\bf z}_{t}),\quad t\in\mathbb{Z}_{-},

(3.30)

with $\bar{A}\colon\ell^{q}\to\ell^{q}$ , $\bar{C}\colon\mathbb{R}^{d}\to\ell^{q}$ given by $(\bar{A}{\bf x})_{i}=\mathbbm{1}_{\{i>d\}}\lambda x_{i-d}$ , $(\bar{C}{\bf z})_{i}=\mathbbm{1}_{\{i\leq d\}}{z}_{i}$ for ${\bf x}\in\ell^{q}$ , ${\bf z}\in\mathbb{R}^{d}$ , $i\in\mathbb{N}$ . Let $\hat{A}\in\mathbb{R}^{N\times N}$ , $\hat{C}\in\mathbb{R}^{N\times d}$ by given as $\hat{A}_{ij}=(\bar{A}\bm{\epsilon}^{j})_{i}=\mathbbm{1}_{\{i>d\}}\lambda\delta_{i-d,j}$ , $\hat{C}_{ik}=(\bar{C}{\bf e}^{k})_{i}=\mathbbm{1}_{\{i\leq d\}}\delta_{i,k}$ for $i,j=1,\ldots,N$ , $k=1,\ldots,d$ . Proposition 3.1 then implies that for each ${\bf z}\in\mathcal{I}_{d}$ the system

\mathbf{x}_{t}^{N}=\sigma_{1}(\hat{A}\mathbf{x}_{t-1}^{N}+\hat{C}{\bf z}_{t}),\quad t\in\mathbb{Z}_{-},

(3.31)

admits a unique solution $(\mathbf{x}_{t}^{N})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\mathbb{R}^{N})$ and the functional $H^{N}\colon\mathcal{I}_{d}\to\mathbb{R}$ ,

\displaystyle H^{N}({\bf z})=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\iota(\mathbf{x}_{-1}^{N}({\bf z}))+{\bf c}\cdot{\bf z}_{0}+b)\bar{\mu}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d},

(3.32)

satisfies $H^{N}\in\mathcal{C}_{N}$ and that

|H({\bf z})-H^{N}({\bf z})|\leq C_{\mathrm{fin}}\sum_{l=0}^{\infty}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\bar{A}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{l}\left(\sum_{i=N+1}^{\infty}|\bar{{x}}_{-1-l,i}|^{q}\right)^{1/q}

(3.33)

for all ${\bf z}\in\mathcal{I}_{d}$ and by (3.11)

\displaystyle C_{\mathrm{fin}}

\displaystyle=L_{\sigma_{2}}\int_{\mathcal{X}}|w|\|{\bf a}\|_{p}\bar{\mu}(dw,d{\bf a},d{\bf c},db)=L_{\sigma_{2}}\int_{\mathcal{X}}|w|\|\bar{\mathcal{L}}({\bf a})\|_{p}{\mu}(dw,d{\bf a},d{\bf c},db)\leq c_{0}L_{\sigma_{2}}I_{\mu,p}.

Choose $M>0$ such that $\|{\bf u}\|\leq M$ for all ${\bf u}\in D_{d}$ . Furthermore, note that ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\bar{A}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\leq\lambda$ and $\bar{{x}}_{-1-l,(k-1)d+j}=\lambda^{k-1}[z_{-l-k}]_{j}$ . Hence, using $\|\cdot\|_{q}^{q}\leq\max(d^{1-\frac{q}{2}},1)\|\cdot\|^{q}$ and $\|{\bf u}\|\leq M$ for all ${\bf u}\in D_{d}$ we obtain from (3.33)

$\displaystyle\|H({\bf z})-H^{N}({\bf z})\|$	$\displaystyle\leq c_{0}L_{\sigma_{2}}I_{\mu,p}\sum_{l=0}^{\infty}({\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\bar{A}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|})^{l}\left(\sum_{k=\lfloor\frac{N}{d}\rfloor+1}^{\infty}[\lambda^{k-1}]^{q}\sum_{j=1}^{d}\|[z_{-l-k}]_{j}\|^{q}\right)^{1/q}$	(3.34)
	$\displaystyle\leq c_{0}L_{\sigma_{2}}I_{\mu,p}\sum_{l=0}^{\infty}({\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\bar{A}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|})^{l}\left(\sum_{k=\lfloor\frac{N}{d}\rfloor+1}^{\infty}[\lambda^{k-1}]^{q}\max(d^{1-\frac{q}{2}},1)M^{q}\right)^{1/q}$
	$\displaystyle\leq c_{0}L_{\sigma_{2}}I_{\mu,p}M\max(d^{\frac{1}{q}-\frac{1}{2}},1)\frac{\lambda^{\lfloor\frac{N}{d}\rfloor}}{(1-\lambda)(1-\lambda^{q})^{1/q}}.$

From (3.31) we obtain for $i=1,\ldots,N$ , $t\in\mathbb{Z}_{-}$ that $({x}_{t}^{N})_{i}=\mathbbm{1}_{\{i>d\}}\lambda({x}_{t-1}^{N})_{i-d}+\mathbbm{1}_{\{i\leq d\}}({z}_{t})_{i}$ . Therefore, $({x}_{t}^{N})_{d(k-1)+j}=\lambda^{k-1}(z_{t-k+1})_{j}$ for all $k\in\mathbb{N}$ , $j=1,\ldots,d$ with $d(k-1)+j\leq N$ . In particular, letting $\Lambda_{N}\colon(D_{d})^{\mathbb{Z}_{-}}\to\mathbb{R}^{N}$ satisfy $(\Lambda_{N}{\bf z})_{d(k-1)+j}=\lambda^{k-1}(z_{-k})_{j}$ for all $k\in\mathbb{N}$ , $j=1,\ldots,d$ with $d(k-1)+j\leq N$ we obtain $({x}_{-1}^{N})_{d(k-1)+j}=\lambda^{k-1}(z_{-k})_{j}=(\Lambda_{N}{\bf z})_{d(k-1)+j}$ . Let $\phi\colon\mathcal{X}\to\mathcal{X}^{N}$ be the projection map $\phi(w,{\bf a},{\bf c},b)=(w,\pi({\bf a}),{\bf c},b)$ , with $\pi({\bf a})=(a_{i})_{i=1,\ldots,N}$ , and denote by $\mu^{N}=\bar{\mu}\circ\phi^{-1}$ the pushforward measure of $\bar{\mu}$ under $\phi$ . Then $I_{\mu^{N}}<\infty$ and from (3.32) we get

\displaystyle H^{N}({\bf z})=\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot(\Lambda_{N}{\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in\mathcal{I}_{d}.

(3.35)

Proposition 3.4 hence implies that there exists a Borel measure $\tilde{\mu}^{N}$ on $\mathcal{X}^{N}$ such that $I_{\tilde{\mu}^{N}}<\infty$ and the mapping

\bar{H}^{N}({\bf z})=\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\tilde{\mu}^{N}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in(D_{d})^{\mathbb{Z}_{-}}

(3.36)

satisfies for all ${\bf z}\in\mathcal{I}_{d}$ the bound (3.22), that is,

|H^{N}({\bf z})-\bar{H}^{N}({\bf z})|\leq L_{\sigma_{2}}({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}M+\|{\bf B}^{\mathrm{ESN}}\|)\frac{{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{T}}{1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|K^{-\top}\Lambda\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\int_{\mathcal{X}^{N}}|w|\|{\bf a}\|\mu^{N}(dw,d{\bf a},d{\bf c},db),

(3.37)

where $\Lambda\in\mathbb{R}^{N\times N}$ is the diagonal matrix with entries $\Lambda_{d(k-1)+j,d(k-1)+j}=\lambda^{k-1}$ for all $k\in\mathbb{N}$ , $j=1,\ldots,d$ with $d(k-1)+j\leq N$ . Furthermore, $\|\cdot\|\leq\max(1,d^{\frac{1}{2}-\frac{1}{p}})\|\cdot\|_{p}$ and (3.11) imply

$\displaystyle\int_{\mathcal{X}^{N}}\|w\|\\|{\bf a}\\|\mu^{N}(dw,d{\bf a},d{\bf c},db)$	$\displaystyle=\int_{\mathcal{X}}\|w\|\\|\pi(\bar{\mathcal{L}}({\bf a}))\\|\mu(dw,d{\bf a},d{\bf c},db)$	(3.38)
	$\displaystyle\leq\max(1,d^{\frac{1}{2}-\frac{1}{p}})\int_{\mathcal{X}}\|w\|\\|\bar{\mathcal{L}}({\bf a})\\|_{p}\mu(dw,d{\bf a},d{\bf c},db)$
	$\displaystyle\leq\max(1,d^{\frac{1}{2}-\frac{1}{p}})c_{0}I_{\mu,p}.$

Note that the measure $\tilde{\mu}^{N}$ in (3.36) is given by $\tilde{\mu}^{N}=\mu^{N}\circ\tilde{\phi}^{-1}$ with $\tilde{\phi}$ as in the proof of Proposition 3.4 and $\bar{\mu}=\mu\circ\bar{\phi}^{-1}$ with $\bar{\phi}$ as in the proof of Proposition 3.2. Then we verify that $\tilde{\mu}^{N}=\mu_{A,B,C}$ and the change-of-variables theorem shows that

$\displaystyle\int_{\mathcal{X}^{N}}$	$\displaystyle f(w,{\bf a},{\bf c},b)\tilde{\mu}^{N}(dw,d{\bf a},d{\bf c},db)=\int_{\mathcal{X}^{N}}f(w,K^{-\top}\Lambda{\bf a},{\bf c},-(K^{-\top}\Lambda{\bf a})\cdot\tilde{\bf B}+b)\mu^{N}(dw,d{\bf a},d{\bf c},db)$	(3.39)
	$\displaystyle=\int_{\mathcal{X}}f(w,K^{-\top}\Lambda\pi({\bf a}),{\bf c},-(K^{-\top}\Lambda\pi({\bf a}))\cdot\tilde{\bf B}+b)\bar{\mu}(dw,d{\bf a},d{\bf c},db)$
	$\displaystyle=\int_{\mathcal{X}}f(w,K^{-\top}\Lambda\pi(\bar{\mathcal{L}}({\bf a})),{\bf c},-(K^{-\top}\Lambda\pi(\bar{\mathcal{L}}({\bf a})))\cdot\tilde{\bf B}+{\bf a}\cdot\bar{\bf B}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db)$

for any function $f$ for which the integrand in the last line is integrable with respect to $\mu$ and where $\tilde{\bf B}=\sum_{k=1}^{T}({A}^{\mathrm{ESN}})^{k-1}{\bf B}^{\mathrm{ESN}}$ , $\bar{\bf B}_{0}=\sum_{k=0}^{\infty}A^{k}{\bf B}\in\ell^{q}$ and $\bar{\mathcal{L}}\colon\ell^{p}\to\ell^{p}$ is linear and satisfies (3.11).

Set $W_{i}=\dfrac{w^{(i)}}{N}\dfrac{d\tilde{\mu}^{N}}{d\nu}(w^{(i)},{\bf a}^{(i)},{\bf c}^{(i)},b_{i})$ and notice that

	$\displaystyle\mathbb{E}\left[\sum_{i=1}^{N}W_{i}\sigma_{2}({\bf a}^{(i)}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}^{(i)}\cdot{\bf z}_{0}+b_{i})\right]$	$\displaystyle=\int_{\mathcal{X}^{N}}w\frac{d\tilde{\mu}^{N}}{d\nu}(w,{\bf a},{\bf c},b)\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}\cdot{\bf z}_{0}+b)\nu(dw,d{\bf a},d{\bf c},db)$
		$\displaystyle=\int_{\mathcal{X}^{N}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}\cdot{\bf z}_{0}+b)\tilde{\mu}^{N}(dw,d{\bf a},d{\bf c},db)=\bar{H}^{N}({\bf z}).$

Therefore,

$\displaystyle\mathbb{E}[\|\bar{H}^{N}({\bf z})-\widehat{H}({\bf z})\|^{2}]$	$\displaystyle=\mathbb{E}[\|\bar{H}^{N}({\bf z})-\sum_{i=1}^{N}W_{i}\sigma_{2}({\bf a}^{(i)}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}^{(i)}\cdot{\bf z}_{0}+b_{i})\|^{2}]$	(3.40)
	$\displaystyle=\mathrm{Var}\left(\sum_{i=1}^{N}W_{i}\sigma_{2}({\bf a}^{(i)}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}^{(i)}\cdot{\bf z}_{0}+b_{i})\right)$
	$\displaystyle=\sum_{i=1}^{N}\mathrm{Var}(W_{i}\sigma_{2}({\bf a}^{(i)}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}^{(i)}\cdot{\bf z}_{0}+b_{i}))$
	$\displaystyle\leq N\mathbb{E}[W_{1}^{2}\sigma_{2}({\bf a}^{(1)}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}^{(1)}\cdot{\bf z}_{0}+b_{1})^{2}].$

As noted above, $\tilde{\mu}^{N}=\mu_{A,B,C}$ and by assumption ${\displaystyle\frac{d\mu_{A,B,C}}{d\nu}}$ is bounded, hence

	$\displaystyle\mathbb{E}$	$\displaystyle[W_{1}^{2}\sigma_{2}({\bf a}^{(1)}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}^{(1)}\cdot{\bf z}_{0}+b_{1})^{2}]$
		$\displaystyle=\int_{\mathcal{X}^{N}}\frac{w^{2}}{N^{2}}\frac{d\tilde{\mu}^{N}}{d\nu}(w,{\bf a},{\bf c},b)\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}\cdot{\bf z}_{0}+b)^{2}\tilde{\mu}^{N}(dw,d{\bf a},d{\bf c},db)$
		$\displaystyle\leq 2\left\\|\frac{d\mu_{A,B,C}}{d\nu}\right\\|_{\infty}\int_{\mathcal{X}^{N}}\frac{w^{2}}{N^{2}}[L_{\sigma_{2}}^{2}\|{\bf a}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}\cdot{\bf z}_{0}+b\|^{2}+\|\sigma_{2}(0)\|^{2}]\tilde{\mu}^{N}(dw,d{\bf a},d{\bf c},db)$
		$\displaystyle=2\left\\|\frac{d\mu_{A,B,C}}{d\nu}\right\\|_{\infty}\int_{\mathcal{X}}\frac{w^{2}}{N^{2}}[L_{\sigma_{2}}^{2}\|K^{-\top}\Lambda\pi(\bar{\mathcal{L}}({\bf a}))\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}+{\bf c}\cdot{\bf z}_{0}-(K^{-\top}\Lambda\pi(\bar{\mathcal{L}}({\bf a})))\cdot\tilde{\bf B}+{\bf a}\cdot\bar{\bf B}_{0}+b\|^{2}+\|\sigma_{2}(0)\|^{2}]\mu(dw,d{\bf a},d{\bf c},db)$
		$\displaystyle\leq 2\left\\|\frac{d\mu_{A,B,C}}{d\nu}\right\\|_{\infty}\int_{\mathcal{X}}\frac{w^{2}}{N^{2}}[L_{\sigma_{2}}^{2}\|\\|\pi(\bar{\mathcal{L}}({\bf a}))\\|\\|\Lambda K^{-1}(\mathbf{x}_{-1}^{\mathrm{ESN}}-\tilde{\bf B})\\|+\\|{\bf c}\\|\\|{\bf z}_{0}\\|+\\|{\bf a}\\|_{p}\\|\bar{\bf B}_{0}\\|_{q}+\|b\|\|^{2}+\|\sigma_{2}(0)\|^{2}]\mu(dw,d{\bf a},d{\bf c},db)$
		$\displaystyle\leq\frac{c_{1}}{N^{2}}\left\\|\frac{d\mu_{A,B,C}}{d\nu}\right\\|_{\infty}\max(c_{0}^{2},\\|\bar{\bf B}_{0}\\|_{q}^{2},1)(1+{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|\Lambda K^{-1}\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}^{2}\\|\mathbf{x}_{-1}^{\mathrm{ESN}}-\tilde{\bf B}\\|^{2})\int_{\mathcal{X}}w^{2}[\\|{\bf a}\\|_{p}^{2}+\\|{\bf c}\\|^{2}+\|b\|^{2}+1]\mu(dw,d{\bf a},d{\bf c},db)$

where we used (3.11), $\|\cdot\|\leq\max(1,d^{\frac{1}{2}-\frac{1}{p}})\|\cdot\|_{p}$ , $c_{0}=d^{1/(2p)}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}(1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\tilde{A}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{p})^{-1/p}$ and set

c_{1}=8\max(1,d^{\frac{1}{2}-\frac{1}{p}})^{2}\max(L_{\sigma_{2}},|\sigma_{2}(0)|)^{2}\max(M^{2},1).

(3.41)

Combining this with (3.40), (3.34), (3.37) and (3.38) then yields

	$\displaystyle\mathbb{E}[\|H({\bf z})-\widehat{H}({\bf z})\|^{2}]^{1/2}$
	$\displaystyle\leq\mathbb{E}[\|H({\bf z})-H^{N}({\bf z})\|^{2}]^{1/2}+\mathbb{E}[\|H^{N}({\bf z})-\bar{H}^{N}({\bf z})\|^{2}]^{1/2}+\mathbb{E}[\|\bar{H}^{N}({\bf z})-\widehat{H}({\bf z})\|^{2}]^{1/2}$
	$\displaystyle\leq c_{0}L_{\sigma_{2}}I_{\mu,p}M\max(1,d^{\frac{1}{q}-\frac{1}{2}})\frac{\lambda^{\lfloor\frac{N}{d}\rfloor}}{(1-\lambda)(1-\lambda^{q})^{1/q}}+L_{\sigma_{2}}\left({\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|{C}^{\mathrm{ESN}}\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}M+\\|{\bf B}^{\mathrm{ESN}}\\|\right)\frac{{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|{A}^{\mathrm{ESN}}\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}^{T}}{1-{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|{A}^{\mathrm{ESN}}\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}}{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|K^{-\top}\Lambda\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}\max\left(1,d^{\frac{1}{2}-\frac{1}{p}}\right)c_{0}I_{\mu,p}$
	$\displaystyle\quad\quad+\frac{c_{1}^{1/2}}{N^{1/2}}\left\\|\frac{d\mu_{A,B,C}}{d\nu}\right\\|_{\infty}^{1/2}\max(c_{0},\\|\bar{\bf B}_{0}\\|_{q},1)(1+{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|\Lambda K^{-1}\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}\\|\mathbf{x}_{-1}^{\mathrm{ESN}}-\tilde{\bf B}\\|)I_{\mu,p}^{(2)}$
	$\displaystyle\leq\tilde{c}_{1}\max\left(\frac{{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|C\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}}{\left(1-{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|\lambda^{-1}{A}\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}^{p}\right)^{1/p}},\frac{\\|{\bf B}\\|_{q}}{1-{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|A\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}},1\right)[I_{\mu,p}\lambda^{\frac{N}{d}}+\left({\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|{C}^{\mathrm{ESN}}\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}+\\|{\bf B}^{\mathrm{ESN}}\\|\right)\frac{{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|{A}^{\mathrm{ESN}}\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}^{T}}{1-{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|{A}^{\mathrm{ESN}}\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}}{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|K^{-\top}\Lambda\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}I_{\mu,p}$
	$\displaystyle\quad\quad+\frac{1}{N^{\frac{1}{2}}}\left\\|\frac{d\mu_{A,B,C}}{d\nu}\right\\|_{\infty}^{\frac{1}{2}}(1+{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|\Lambda K^{-1}\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}\left({\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|{C}^{\mathrm{ESN}}\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}+\\|{\bf B}^{\mathrm{ESN}}\\|\right)\frac{1}{1-{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|{A}^{\mathrm{ESN}}\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}})I_{\mu,p}^{(2)}]$
	$\displaystyle\leq\tilde{c}_{1}\max\left(\frac{{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|C\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}}{\left(1-{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|\lambda^{-1}{A}\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}^{p}\right)^{1/p}},\frac{\\|{\bf B}\\|_{q}}{1-{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|A\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}},1\right)\max(I_{\mu,p},I_{\mu,p}^{(2)})(1+\frac{{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|{C}^{\mathrm{ESN}}\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}+\\|{\bf B}^{\mathrm{ESN}}\\|}{1-{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|{A}^{\mathrm{ESN}}\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}}{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|K^{-\top}\Lambda\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|})$
	$\displaystyle\quad\quad\times\left[\lambda^{\frac{N}{d}}+{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|{A}^{\mathrm{ESN}}\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}^{T}+\frac{1}{N^{\frac{1}{2}}}\left\\|\frac{d\mu_{A,B,C}}{d\nu}\right\\|_{\infty}^{\frac{1}{2}}\right]$		(3.42)

with

\displaystyle\tilde{c}_{1}

\displaystyle=8^{\frac{1}{2}}\max(L_{\sigma_{2}},|\sigma_{2}(0)|,1)\max(1,d^{\frac{1}{2}-\frac{1}{p}})d^{\frac{1}{2p}}\max(1,d^{\frac{1}{q}-\frac{1}{2}})(1-\lambda)^{-1}(1-\lambda^{q})^{-\frac{1}{q}}\lambda^{-1}\max(M,1)^{2}.

(3.43)

∎

Remark 3.8.

A bound similar to the one in (3.27) can be obtained if the ESN-ELM used as approximant satisfies the more general condition $\rho({A}^{{\rm ESN}})<1$ . The theorem is proved in that case by replacing in (3.37) the use of the inequality (3.22) by its more general version (3.21).

Corollary 3.9.

As in Theorem 3.7, suppose the recurrent activation function is $\sigma_{1}(x)=x$ , $p\in(1,\infty)$ and the input set $D_{d}$ is bounded. Let $H\in\mathcal{C}$ be such that it has a realization of the type (2.2) with measure $\mu$ and bounded linear maps $C\colon\mathbb{R}^{d}\to\ell^{q}$ , $A\colon\ell^{q}\to\ell^{q}$ satisfying ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1$ and $I_{\mu,p}^{(2)}<\infty$ . Let $\lambda\in\left({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|},1\right)$ . Consider now a randomly constructed ESN-ELM pair such that ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1$ and for which the controllability matrix $K$ in (3.18) is invertible. Then there exists a distribution $\nu$ for the readout hidden layer weights and a readout ${\bf W}$ (an $\mathbb{R}^{N}$ -valued random vector) such that the ESN-ELM $\widehat{H}$ satisfies, for any ${\bf z}\in\mathcal{I}_{d}$ , the approximation error bound

\displaystyle\mathbb{E}[|H({\bf z})-\widehat{H}({\bf z})|^{2}]^{1/2}

\displaystyle\leq C_{H,\mathrm{ESN}}\left[\lambda^{\frac{N}{d}}+{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{\frac{N}{d}}+\frac{1}{N^{\frac{1}{2}}}\right]

(3.44)

with $C_{H,\mathrm{ESN}}$ as in (3.28).

Proof.

This follows by noting ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{T}\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{\frac{N}{d}}$ due to $T\geq\frac{N}{d}$ and choosing $\nu=\mu_{A,B,C}$ , so that $\dfrac{d\mu_{A,B,C}}{d\nu}=1$ and Theorem 3.7 yields (3.44). ∎

Remark 3.10.

Curse of dimensionality. The error bound (3.44) consists of the constant $C_{H,\mathrm{ESN}}$ and three terms depending explicitly on $N$ . The constant $C_{H,\mathrm{ESN}}$ could depend on $N$ through the norms ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}$ , $\|{\bf B}^{\mathrm{ESN}}\|$ , ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}$ , but for these norms it is reasonable to assume that they do not depend on $N$ (in practice, the norms of these matrices will be normalized). Similarly, it appears reasonable to assume that the norm ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|K^{-\top}\Lambda\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}$ is bounded in $N$ , since the norm ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}$ is likely to be bounded from below. This argument indicates that in practically relevant situations, the constant does not depend on $N$ and, as it is apparent from the explicit expression (3.43), it depends only polynomially on the dimension $d$ . To achieve an approximation error of size at most $\varepsilon$ , we could thus take $N=\left\lceil\max\left(d\log\left(\dfrac{\varepsilon}{3C_{H,\mathrm{ESN}}}\right)\dfrac{1}{\log(\lambda)},d\log\left(\dfrac{\varepsilon}{3C_{H,\mathrm{ESN}}}\right)\dfrac{1}{\log({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|})},9C_{H,\mathrm{ESN}}^{2}\varepsilon^{-2}\right)\right\rceil$ , which grows only polynomially in $d$ and $\varepsilon^{-1}$ . Hence, in these circumstances, there is no curse of dimensionality in the bound (3.44).

Remark 3.11.

Implementation. From a practical point of view, the bounds obtained in Theorem 3.7 and in Corollary 3.9 apply to two different learning procedures: in the first case, when the chosen ELM weight distribution $\nu$ satisfies the absolute continuity assumption in Theorem 3.7, only the ELM output layer needs to be trained. In the second case, when that condition is not available, all the ELM weights also need to be trained. However, note that these are not recurrent weights; consequently, the vanishing gradient problem does not occur here. In contrast, this problem sometimes occurs for standard recurrent neural networks in which all weights are trained by backpropagation through time.

We conclude this section by considering another implementation scenario in which we impose the measure that we use for the sampling of the ELM, and show that it is still possible to sample using measures that are arbitrarily close to it in the sense of the $1$ -Wasserstein metric in exchange for potentially increasing an error term that can be compensated by increasing the dimension of the ESN. Let $\mathcal{W}_{1}$ be the $1$ -Wasserstein metric. Suppose that we use the measure $\nu_{0}$ to generate the hidden layer weights. The next corollary shows how the error increases when we sample “almost” from the given measure $\nu_{0}$ .

Corollary 3.12.

Consider the same situation as in Corollary 3.9 and assume that $\nu_{0}$ has finite first moment and $\int_{\mathcal{X}}[|w|+\|{\bf a}\|_{p}+\|{\bf c}\|+|b|]\mu(dw,d{\bf a},d{\bf c},db)<\infty$ . Then, for any $\delta\in(0,1)$ there exists a probability measure $\nu$ with $\mathcal{W}_{1}(\nu_{0},\nu)\leq\delta$ and a readout ${\bf W}$ (an $\mathbb{R}^{N}$ -valued random vector) such that the ESN-ELM with distribution $\nu$ for the hidden layer weights satisfies for any ${\bf z}\in\mathcal{I}_{d}$ the approximation error bound

\displaystyle\mathbb{E}[|H({\bf z})-\widehat{H}({\bf z})|^{2}]^{1/2}

\displaystyle\leq\tilde{C}_{H,\mathrm{ESN}}\left[\lambda^{\frac{N}{d}}+{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{\frac{N}{d}}+\frac{1}{N^{\frac{1}{2}}}\min(\delta^{-\frac{1}{2}},\mathcal{J}_{\nu_{0}})\right]

(3.45)

with $\mathcal{J}_{\nu_{0}}=\left\|\frac{d\mu_{A,B,C}}{d\nu_{0}}\right\|_{\infty}^{\frac{1}{2}}$ , $\tilde{C}_{H,\mathrm{ESN}}=C_{H,\mathrm{ESN}}\max(1,\mathcal{W}_{1}(\nu_{0},\mu_{A,B,C}))^{\frac{1}{2}}$ and $C_{H,\mathrm{ESN}}$ as in (3.28).

Proof.

First note that ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{T}\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{\frac{N}{d}}$ , since $T\geq\frac{N}{d}$ . In the case where $\mu_{A,B,C}\ll\nu_{0}$ and $\frac{d\mu_{A,B,C}}{d\nu_{0}}$ is bounded, that is, $\mathcal{J}_{\nu_{0}}<\infty$ , we may choose $\nu=\nu_{0}$ and obtain from Theorem 3.7 the bound

\displaystyle\mathbb{E}[|H({\bf z})-\widehat{H}({\bf z})|^{2}]^{1/2}

\displaystyle\leq\tilde{C}_{H,\mathrm{ESN}}\left[\lambda^{\frac{N}{d}}+{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{\frac{N}{d}}+\frac{1}{N^{\frac{1}{2}}}\mathcal{J}_{\nu_{0}}\right].

(3.46)

We now derive an alternative bound (regardless of whether $\mathcal{J}_{\nu_{0}}$ is finite or not). Combining both bounds then yields the bound (3.45).

From (3.39) and the estimate $\|\pi(\bar{\mathcal{L}}({\bf a}))\|\leq\max(1,d^{\frac{1}{2}-\frac{1}{p}})c_{0}\|{\bf a}\|_{p}$ we obtain

$\displaystyle\int_{\mathcal{X}^{N}}$	$\displaystyle\left(\|w\|+\\|{\bf a}\\|+\\|{\bf c}\\|+\|b\|\right)\mu_{A,B,C}(dw,d{\bf a},d{\bf c},db)$	(3.47)
	$\displaystyle=\int_{\mathcal{X}}\left(\|w\|+\\|K^{-\top}\Lambda\pi(\bar{\mathcal{L}}({\bf a}))\\|+\\|{\bf c}\\|+\|-(K^{-\top}\Lambda\pi(\bar{\mathcal{L}}({\bf a})))\cdot\tilde{\bf B}+{\bf a}\cdot\bar{\bf B}_{0}+b\|\right)\mu(dw,d{\bf a},d{\bf c},db)$
	$\displaystyle\leq\max(1,{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|K^{-\top}\Lambda\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}\max(1,d^{\frac{1}{2}-\frac{1}{p}})c_{0})$
	$\displaystyle\times(1+\\|\tilde{\bf B}\\|+\\|\bar{\bf B}_{0}\\|_{q})\int_{\mathcal{X}}\left(\|w\|+\\|{\bf a}\\|_{p}+\\|{\bf c}\\|+\|b\|\right)\mu(dw,d{\bf a},d{\bf c},db)<\infty$

and thus $\mu_{A,B,C}$ has a finite first moment. Hence, $\tilde{\delta}=\delta[\max(1,\mathcal{W}_{1}(\nu_{0},\mu_{A,B,C}))]^{-1}\in(0,1)$ and also $\nu=\tilde{\delta}\mu_{A,B,C}+(1-\tilde{\delta})\nu_{0}$ has a finite first moment. Therefore, we may use the Kantorovich-Rubinshtein duality (see [Vill 09, Theorem 5.10; Particular Case 5.16]) to calculate

	$\displaystyle\mathcal{W}_{1}(\nu_{0},\nu)$	$\displaystyle=\sup_{f\colon f\text{ is $1$-Lipschitz}}\left\{\int_{\mathcal{X}^{N}}f(x)(\nu_{0}-\nu)(dx)\right\}$
		$\displaystyle=\tilde{\delta}\sup_{f\colon f\text{ is $1$-Lipschitz}}\left\{\int_{\mathcal{X}^{N}}f(x)(\nu_{0}-\mu_{A,B,C})(dx)\right\}=\tilde{\delta}\mathcal{W}_{1}(\nu_{0},\mu_{A,B,C})\leq\delta.$

In addition, $\mu_{A,B,C}\ll\nu$ and

\tilde{\delta}\frac{d\mu_{A,B,C}}{d\nu}+(1-\tilde{\delta})\frac{d\nu_{0}}{d\nu}=1,

hence $\frac{d\nu_{0}}{d\nu}\geq 0$ implies $\left\|\frac{d\mu_{A,B,C}}{d\nu}\right\|_{\infty}\leq\frac{1}{\tilde{\delta}}$ . Therefore, Theorem 3.7 yields (3.45). ∎

3.5 Universality

As an additional result, we obtain the following universality result, which shows that any square-integrable functional (not necessarily in the recurrent Barron class) can be well approximated by the ESN-ELM (3.16)-(3.17) with suitably chosen readout ${\bf W}$ and hidden weights generated from a distribution $\nu$ arbitrarily close to a prescribed measure $\nu_{0}$ .

Corollary 3.13.

Assume that the input set $D_{d}$ is bounded, and consider an arbitrary functional $H\colon\mathcal{I}_{d}\to\mathbb{R}$ that we will approximate with an ESN-ELM built with the following specifications: the activation functions are $\sigma_{1}(x)=x$ and either $\sigma_{2}(x)=\max(x,0)$ or $\sigma_{2}$ is a bounded, Lipschitz-continuous, and non-constant function. Furthermore, assume that there exists $\bar{c}>0$ and $\underline{l},\bar{l}\in(0,1)$ , $\underline{l}<\bar{l}$ , such that for any choice of $N\in\mathbb{N}$ the ESN parameters satisfy that the controllability matrix $K$ in (3.18) is invertible, $\underline{l}<{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<\bar{l}$ , $\|{\bf B}^{\mathrm{ESN}}\|\leq\bar{c}$ , ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\leq\bar{c}$ , and ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|K^{-1}\Lambda\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\leq\bar{c}$ , where $\Lambda\in\mathbb{R}^{{N}\times{N}}$ is the diagonal matrix with entries $\Lambda_{d(k-1)+j,d(k-1)+j}={\underline{l}}^{k-1}$ , for all $k\in\mathbb{N}$ , $j=1,\ldots,d$ with $d(k-1)+j\leq{N}$ . Let $\nu_{0}$ be a given hidden weight distribution with finite first moment. Then for any probability measure $\gamma$ on $\mathcal{I}_{d}\subset(D_{d})^{\mathbb{Z}_{-}}$ such that $H\in L^{2}(\mathcal{I}_{d},\gamma)$ , there exists a probability measure $\nu$ with $\mathcal{W}_{1}(\nu_{0},\nu)<\varepsilon$ and a readout ${\bf W}$ (an $\mathbb{R}^{N}$ -valued random vector) such that the ESN-ELM $\widehat{H}$ with readout ${\bf W}$ and distribution $\nu$ for the hidden layer weights satisfies that

\left(\int_{\mathcal{I}_{d}}\mathbb{E}[|H({\bf z})-\widehat{H}({\bf z})|^{2}]\gamma(d{\bf z})\right)^{1/2}<\varepsilon.

Proof.

Firstly, by Proposition 2.10 there exists $H^{RC}\in\mathcal{C}\cap L^{2}(\mathcal{I}_{d},\gamma)$ such that $\|H-H^{RC}\|_{L^{2}(\mathcal{I}_{d},\gamma)}<\varepsilon/2$ . From the proof of Proposition 2.10 and the construction in [Gono 20c] it follows that $H^{RC}$ is of the form

H^{RC}({\bf z})=\int_{\mathcal{X}^{\bar{N}}}w\sigma_{2}({\bf a}\cdot(\pi_{\bar{N}}{\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu^{\bar{N}}(dw,d{\bf a},d{\bf c},db),\quad{\bf z}\in(D_{d})^{\mathbb{Z}_{-}},

(3.48)

for some $\bar{N}\in\mathbb{N}$ , an atomic probability measure $\mu^{\bar{N}}$ , and with $\pi_{\bar{N}}\colon(D_{d})^{\mathbb{Z}_{-}}\to\mathbb{R}^{\bar{N}}$ satisfying

(\pi_{\bar{N}}({\bf z}))_{d(k-1)+j}=(z_{-k})_{j},\quad\mbox{for all $k\in\mathbb{N}$, $j=1,\ldots,d$ with $d(k-1)+j\leq\bar{N}$.}

Now let ${\bar{\lambda}}=\underline{l}\in(0,1)$ , $p\in(1,\infty)$ , ${\bf B}=0$ and define $A\colon\ell^{q}\to\ell^{q}$ , $C\colon\mathbb{R}^{d}\to\ell^{q}$ by $(A{\bf x})_{i}=\mathbbm{1}_{\{i>d\}}{\bar{\lambda}}x_{i-d}$ , $(C{\bf z})_{i}=\mathbbm{1}_{\{i\leq d\}}{z}_{i}$ for ${\bf x}\in\ell^{q}$ , ${\bf z}\in\mathbb{R}^{d}$ , $i\in\mathbb{N}$ . Then, as in the proof of Proposition 2.6 we obtain that (2.1) admits a unique solution $(\mathbf{x}_{t})_{t\in\mathbb{Z}_{-}}\in\ell^{\infty}_{-}(\ell^{q})$ and for $k\in\mathbb{N}$ , $j=1,\ldots,d$ we have ${x}_{t,(k-1)d+j}={\bar{\lambda}}^{k-1}{z}_{t-k+1,j}$ . Let $\iota\colon\mathbb{R}^{\bar{N}}\to\ell^{p}$ be the natural injection $(x_{1},\ldots,x_{\bar{N}})^{\top}\longmapsto(x_{1},\ldots,x_{\bar{N}},0,\ldots)$ and let $\pi\colon\ell^{q}\to\mathbb{R}^{\bar{N}}$ be the natural projection. Then

{\bf a}\cdot(\pi_{\bar{N}}{\bf z})={\bf a}\cdot(\Lambda^{-1}\pi\mathbf{x}_{-1})=(\iota\Lambda^{-1}{\bf a})\cdot\mathbf{x}_{-1}.

Let $\phi\colon\mathcal{X}^{\bar{N}}\to\mathcal{X}$ be defined by $\phi(w,{\bf a},{\bf c},b)=(w,\iota\Lambda^{-1}{\bf a},{\bf c},b)$ and denote by $\mu=\mu^{\bar{N}}\circ\phi^{-1}$ the pushforward measure of $\mu^{\bar{N}}$ under $\phi$ . Then

	$\displaystyle H^{RC}({\bf z})$	$\displaystyle=\int_{\mathcal{X}^{\bar{N}}}w\sigma_{2}((\iota\Lambda^{-1}{\bf a})\cdot\mathbf{x}_{-1}+{\bf c}\cdot{\bf z}_{0}+b)\mu^{\bar{N}}(dw,d{\bf a},d{\bf c},db)$		(3.49)
		$\displaystyle=\int_{\mathcal{X}}w\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)\mu(dw,d{\bf a},d{\bf c},db).$		(3.49)

Thus, $H^{RC}\in\mathcal{C}$ has a representation of the type (2.2) with ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}={\bar{\lambda}}<1$ . Furthermore, since $\mu^{\bar{N}}$ is atomic it follows directly that $I_{\mu,p}^{(2)}<\infty$ and $\int_{\mathcal{X}}[|w|+\|{\bf a}\|_{p}+\|{\bf c}\|+|b|]\mu(dw,d{\bf a},d{\bf c},db)<\infty$ .

Next, fix $\lambda\in({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|},1)$ arbitrary and note that under the assumptions on the ESN, the constant $C_{H,\mathrm{ESN}}$ in (3.28) can be estimated as

C_{H,\mathrm{ESN}}\leq\tilde{c}_{1}\max(\frac{1}{(1-(\lambda^{-1}\bar{\lambda})^{p})^{\frac{1}{p}}},1)\max(I_{\mu,p},I_{\mu,p}^{(2)})\left(1+2\frac{\bar{c}^{2}}{1-\bar{l}}\right)=:\bar{C}_{H,0},

(3.50)

which does not depend on $N$ . Similarly, the bound

\mathcal{W}_{1}(\nu_{0},\mu_{A,B,C})\leq\int_{\mathcal{X}^{N}}\|x\|\nu_{0}(dx)+\int_{\mathcal{X}^{N}}\|x\|\mu_{A,B,C}(dx),

the assumptions on the ESN matrices, and (3.47) can be used to obtain an upper bound $\bar{C}_{H,1}$ on $\mathcal{W}_{1}(\nu_{0},\mu_{A,B,C})$ that only depends on $\mu$ , $d$ , $p$ , $\lambda$ , $\bar{c}$ , $\underline{l}$ , $\bar{l}$ and $\nu_{0}$ , but not on $N$ . Set $\bar{c}_{H}=\bar{C}_{H,0}\max(1,\bar{C}_{H,1})^{\frac{1}{2}}$ .

We now apply Corollary 3.12 to $H=H^{RC}$ . We select $\delta\in(0,1)$ with $\delta<\varepsilon$ and

N=\left\lceil\max\left(d\log\left(\frac{\varepsilon}{6\bar{c}_{H}}\right)\dfrac{1}{\log(\lambda)},d\log\left(\frac{\varepsilon}{6\bar{c}_{H}}\right)\dfrac{1}{\log(\bar{l})},(6\bar{c}_{H}\varepsilon^{-1})^{2}\delta^{-1}\right)\right\rceil,

then from Corollary 3.12 we obtain that there exists a probability measure $\nu$ with $\mathcal{W}_{1}(\nu_{0},\nu)<\varepsilon$ and a readout ${\bf W}$ (an $\mathbb{R}^{N}$ -valued random vector) such that the ESN-ELM $\widehat{H}$ with readout ${\bf W}$ and distribution $\nu$ for the hidden layer weights satisfies for any ${\bf z}\in\mathcal{I}_{d}$

\displaystyle\mathbb{E}[|H^{RC}({\bf z})-\widehat{H}({\bf z})|^{2}]^{1/2}

\displaystyle\leq\bar{c}_{H}\left[\lambda^{\frac{N}{d}}+{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{\frac{N}{d}}+\frac{1}{\delta^{\frac{1}{2}}N^{\frac{1}{2}}}\right]<\frac{\varepsilon}{2}.

(3.51)

Hence,

\left(\int_{\mathcal{I}_{d}}\mathbb{E}[|H({\bf z})-\widehat{H}({\bf z})|^{2}]\gamma(d{\bf z})\right)^{1/2}\leq\|H-H^{RC}\|_{L^{2}(\mathcal{I}_{d},\gamma)}+\left(\int_{\mathcal{I}_{d}}\mathbb{E}[|H^{RC}({\bf z})-\widehat{H}({\bf z})|^{2}]\gamma(d{\bf z})\right)^{1/2}<\varepsilon.

∎

3.6 Special case: static situation

As a special case, we consider the situation without recurrence, that is, when $H\colon D_{d}\to\mathbb{R}$ is of the form

H({\bf u})=\int_{\mathbb{R}\times\mathbb{R}^{d}\times\mathbb{R}}w\sigma_{2}({\bf c}\cdot{\bf u}+b)\mu_{0}(dw,d{\bf c},db),\quad{\bf u}\in D_{d}\subset\mathbb{R}^{d},

(3.52)

for a probability measure $\mu_{0}$ on $\mathbb{R}\times\mathbb{R}^{d}\times\mathbb{R}$ satisfying $\int_{\mathbb{R}\times\mathbb{R}^{d}\times\mathbb{R}}|w|(\|{\bf c}\|+|b|)\mu_{0}(dw,d{\bf c},db)<\infty$ . $H$ is clearly a particular case of a recurrent generalized Barron functional, since it can be written as (2.2) with measure $\mu$ on $\mathcal{X}$ given by $\mu(dw,d{\bf a},d{\bf c},db)=\delta_{0}(d{\bf a})\mu_{0}(dw,d{\bf c},db)$ , which satisfies $I_{\mu,p}<\infty$ by the integrability assumption on $\mu_{0}$ .

In this situation, we show that such static elements $H$ of the recurrent Barron class can be approximated by ELMs, that is, by feedforward neural networks

\widehat{H}({\bf u})=\sum_{i=1}^{N}W_{i}\sigma_{2}({\bf c}^{(i)}\cdot{\bf u}+b_{i})

(3.53)

with randomly generated coefficients ${\bf c}^{(i)}$ , $b_{i}$ valued in $\mathbb{R}^{d}$ and $\mathbb{R}$ , respectively, and ${\bf W}\in\mathbb{R}^{N}$ trainable. $\widehat{H}$ can be viewed as a functional ${\bf z}\mapsto\widehat{H}({\bf z}_{0})$ , which is of the form (3.17) with ${\bf a}^{(i)}=0$ . Let $w^{(1)},\ldots,w^{(n)}$ be $\mathbb{R}$ -valued random variables, assume that the random variables $(w^{(1)},{\bf c}^{(1)},b_{1}),\ldots,(w^{(N)},{\bf c}^{(N)},b_{N})$ are IID and denote by $\nu_{0}$ their common distribution.

Corollary 3.14.

Suppose that $H$ is as in (3.52) with associated $\mu$ satisfying that $I_{\mu,p}^{(2)}<\infty$ . Assume the input set $D_{d}$ is bounded. Suppose that $\mu_{0}\ll\nu_{0}$ and that ${\displaystyle\frac{d\mu_{0}}{d\nu_{0}}}$ is bounded. Then there exists a measurable function $f\colon(\mathbb{R}\times\mathbb{R}^{d}\times\mathbb{R})^{N}\to\mathbb{R}^{N}$ such that the ELM $\widehat{H}$ with readout ${\bf W}=f((w^{(i)},{\bf c}^{(i)},b_{i})_{i=1,\ldots,N})\in\mathbb{R}^{N}$ satisfies for any ${\bf u}\in D_{d}$ the approximation error bound

\displaystyle\mathbb{E}[|H({\bf u})-\widehat{H}({\bf u})|^{2}]^{1/2}

\displaystyle\leq\frac{C_{H}}{N^{\frac{1}{2}}}

(3.54)

with

C_{H}=(2\max(2L_{\sigma_{2}}^{2},|\sigma_{2}(0)|^{2})\max(1,\sup_{{\bf v}\in D_{d}}\|{\bf v}\|^{2}))^{\frac{1}{2}}I_{\mu,p}^{(2)}\left\|\frac{d\mu_{0}}{d\nu_{0}}\right\|_{\infty}^{\frac{1}{2}}.

(3.55)

Proof.

As noted above, $H$ is a recurrent generalized Barron functional with

\mu(dw,d{\bf a},d{\bf c},db)=\delta_{0}(d{\bf a})\mu_{0}(dw,d{\bf c},db)

and we can choose the maps $A$ , $C$ and the sequence ${\bf B}$ as $0$ . With these choices, the map $\bar{\mathcal{L}}$ and the sequence $\bar{\bf B}_{0}$ appearing in the proof of Theorem 3.7 are both equal to $0$ . Consequently, from (3.39) we obtain that the measure $\tilde{\mu}^{N}$ in the proof of Theorem 3.7 satisfies

\int_{\mathcal{X}^{N}}f(w,{\bf a},{\bf c},b)\tilde{\mu}^{N}(dw,d{\bf a},d{\bf c},db)=\int_{\mathbb{R}\times\mathbb{R}^{d}\times\mathbb{R}}f(w,0,{\bf c},b)\mu_{0}(dw,d{\bf c},db)

for any $f$ for which the integrand in the last line is integrable with respect to $\mu_{0}$ . This shows that $\bar{H}^{N}$ in (3.36) coincides with the map

\bar{H}^{N}({\bf z})=\int_{\mathbb{R}\times\mathbb{R}^{d}\times\mathbb{R}}w\sigma_{2}({\bf c}\cdot{\bf z}_{0}+b)\mu_{0}(dw,d{\bf c},db)=H({\bf z}_{0}),\quad{\bf z}\in(D_{d})^{\mathbb{Z}_{-}}

(3.56)

and furthermore $\bar{H}^{N}=H^{N}$ . Therefore, denoting by $H$ also the functional ${\bf z}\mapsto H({\bf z}_{0})$ , in the first step of the error estimate (3.42) only the last term is non-zero.

Furthermore, $\psi_{A,B,C}(w,{\bf a},{\bf c},b)=(w,0,{\bf c},b)$ and thus $\mu_{A,B,C}(dw,d{\bf a},d{\bf c},db)=\delta_{0}(d{\bf a})\mu_{0}(dw,d{\bf c},db)$ . Therefore, the hypothesis $\mu_{0}\ll\nu_{0}$ implies $\mu_{A,B,C}\ll\delta_{0}(d{\bf a})\nu_{0}$ .

Choosing $W_{i}={\displaystyle\frac{w^{(i)}}{N}\frac{d\mu_{0}}{d\nu_{0}}(w^{(i)},{\bf c}^{(i)},b_{i})}$ , from (3.40) we thus obtain

\displaystyle\mathbb{E}[|H({\bf u})-\widehat{H}({\bf u})|^{2}]\leq N\mathbb{E}[W_{1}^{2}\sigma_{2}({\bf c}^{(1)}\cdot{\bf u}+b_{1})^{2}].

(3.57)

The expectation in (3.57) can be estimated by

$\displaystyle\mathbb{E}[W_{1}^{2}\sigma_{2}({\bf c}^{(1)}\cdot{\bf u}+b_{1})^{2}]$	$\displaystyle=$	$\displaystyle\int_{\mathbb{R}\times\mathbb{R}^{d}\times\mathbb{R}}\frac{w^{2}}{N^{2}}\frac{d\mu_{0}}{d\nu_{0}}(w,{\bf c},b)\sigma_{2}({\bf c}\cdot{\bf u}+b)^{2}\mu_{0}(dw,d{\bf c},db)$
	$\displaystyle\leq$	$\displaystyle 2\left\\|\frac{d\mu_{0}}{d\nu_{0}}\right\\|_{\infty}\int_{\mathbb{R}\times\mathbb{R}^{d}\times\mathbb{R}}\frac{w^{2}}{N^{2}}[L_{\sigma_{2}}^{2}\|{\bf c}\cdot{\bf u}+b\|^{2}+\|\sigma_{2}(0)\|^{2}]{\mu}_{0}(dw,d{\bf c},db)$
	$\displaystyle\leq$	$\displaystyle\frac{c_{1}}{N^{2}}\left\\|\frac{d\mu_{0}}{d\nu_{0}}\right\\|_{\infty}\int_{\mathbb{R}\times\mathbb{R}^{d}\times\mathbb{R}}w^{2}[\\|{\bf c}\\|^{2}+\|b\|^{2}+1]{\mu}_{0}(dw,d{\bf c},db)$

with $c_{1}=2\max(2L_{\sigma_{2}}^{2},|\sigma_{2}(0)|^{2})\max(1,\sup_{{\bf v}\in D_{d}}\|{\bf v}\|^{2})$ . ∎

4 Learning from a single trajectory

Consider now the situation in which we aim to learn a functional $H$ from a single trajectory of input/output pairs. We observe $n\in\mathbb{N}$ data points $({\bf z}_{t},{\bf y}_{t})$ for $t\in\{0,-1,\ldots,-(n-1)\}$ and aim to learn the input/output relation $H$ from them. In contrast to static situations, these data points are not IID; instead, they constitute a single realization of a discrete-time stochastic process. Similar problems were considered, for instance, in [Ziem 22].

More formally, we consider a stationary stochastic process $({\bf Z}_{t},{\bf Y}_{t})_{t\in\mathbb{Z}_{-}}$ and assume that $n$ sequential observations $({\bf Z}_{t},{\bf Y}_{t})$ , $t\in\{0,-1,\ldots,-(n-1)\}$ , coming from a single realization of this process are available. Suppose that ${\bf Z}$ takes values in $\mathcal{I}_{d}$ . Let $H\in\mathcal{C}$ be the unknown functional and assume that the input/output relation between the data is given as $H({\bf Z})=\mathbb{E}[{\bf Y}_{0}|{\bf Z}]$ . For example, this is satisfied if ${\bf Z}$ is any stationary process and ${\bf Y}_{t}=H(\ldots,{\bf Z}_{t-2},{\bf Z}_{t-1},{\bf Z}_{t})+\bm{\varepsilon}_{t}$ for a stationary process $(\bm{\varepsilon}_{t})_{t\in\mathbb{Z}_{-}}$ independent of ${\bf Z}$ and with $\mathbb{E}[\bm{\varepsilon}_{0}]=0$ .

The goal is to learn the functional $H$ from the observations $({\bf Z}_{t},{\bf Y}_{t})_{t=0,-1,\ldots,-n+1}$ of the underlying input/output process. Recall that $H({\bf Z})=\mathbb{E}[{\bf Y}_{0}|{\bf Z}]$ minimizes $\mathcal{R}(G)=\mathbb{E}[\|G({\bf Z})-{\bf Y}_{0}\|^{2}]$ over all measurable maps $G\colon\mathcal{I}_{d}\to\mathbb{R}^{m}$ . To learn $H$ from the data, one thus aims to find a minimizer of

\mathcal{R}_{n}(G)=\frac{1}{n}\sum_{i=0}^{n-1}\|G({\bf Z}_{-i}^{-n+1})-{\bf Y}_{-i}\|^{2},

(4.1)

where we denote ${\bf Z}_{-i}^{-n+1}=(\ldots,0,0,{\bf Z}_{-n+1},\ldots,{\bf Z}_{-i-1},{\bf Z}_{-i})$ .

To learn $H$ from data, we use an approximant of the type $\widehat{H}$ introduced in (3.17) and write $\widehat{H}_{W}=\widehat{H}$ to emphasize that $W$ are the trainable parameters. Note that $\widehat{H}_{W}$ is now $\mathbb{R}^{m}$ -valued and is constructed by simply using a readout $W\in\mathbb{M}_{m,N}$ that collects $m$ readout vectors in a matrix or, equivalently, by making the $w^{(i)}$ in (3.17) $\mathbb{R}^{m}$ -valued. The results in Section 3, in particular the universality statement in Section 3.5, indicate that $H$ can be approximated well by such systems. Hence, we expect that a minimizer of $\mathcal{R}_{n}(\cdot)$ over such systems should be a good approximation of $H$ . Therefore, we set to solve the minimization problem

\widehat{W}=\underset{W\in\mathcal{W}_{R}}{\mbox{{\rm arg min}}}\frac{1}{n}\sum_{i=0}^{n-1}\|\widehat{H}_{W}({\bf Z}_{-i}^{-n+1})-{\bf Y}_{-i}\|^{2}

(4.2)

for $\mathcal{W}_{R}$ given by the set of all random matrices $W\colon\Omega\to\mathbb{R}^{m\times N}$ which satisfy for each row $W_{i,\cdot}$ the bound $\|W_{i,\cdot}\|\leq R$ for some regularization constant $R>0$ and which are measurable with respect to the data $({\bf Z}_{t},{\bf Y}_{t})_{t=0,-1,\ldots,-n+1}$ and the randomly generated parameters $(w_{i},{\bf a}^{(i)},{\bf c}^{(i)},b_{i})_{i=1,\ldots,N}$ , ${A}^{\mathrm{ESN}}$ , ${\bf B}^{\mathrm{ESN}}$ , ${C}^{\mathrm{ESN}}$ . The latter requirement accounts for the fact that the randomly generated parameters are fixed after their initialization and hence the trainable weights may depend on them. The choice of $R$ will be specified later on in the statement of Theorem 4.2.

The problem (4.2) admits an explicit solution (see, for instance, [Gono 21a, Section 4.3]). Once computed, the learned functional is $\widehat{H}_{\widehat{W}}$ . The learning performance of the algorithm can now be evaluated by assessing its learning error (or generalization error)

\mathbb{E}[\|H(\bar{\bf Z})-\widehat{H}_{\widehat{W}}(\bar{\bf Z})\|^{2}],

where $\bar{\bf Z}$ is an IID copy of ${\bf Z}$ and $\bar{\bf Z}$ is independent of all other random variables introduced so far. An analysis of the learning error is carried out below in Theorem 4.2, in which we assume, inspired by the developments in [Gono 20b], that the input and the output processes are of the type introduced in the following definition.

Definition 4.1.

An $\mathbb{R}^{k}$ -valued random process ${\bf U}$ is said to have a causal Bernoulli shift structure, if there exist $q\in\mathbb{N}$ , $G\colon(\mathbb{R}^{q})^{\mathbb{Z}_{-}}\to\mathbb{R}^{k}$ measurable, and an IID collection $(\bm{\xi}_{t})_{t\in\mathbb{Z}_{-}}$ of $\mathbb{R}^{q}$ -valued random variables such that

{\bf U}_{t}=G(\ldots,\bm{\xi}_{t-1},\bm{\xi}_{t}),\quad t\in\mathbb{Z}_{-}.

The process ${\bf U}$ is said to have geometric decay if there exist $C_{\mathrm{dep}}>0$ , $\lambda_{\mathrm{dep}}\in(0,1)$ such that the weak dependence coefficient $\theta(\tau):=\mathbb{E}[\|{\bf U}_{0}-\tilde{{\bf U}}^{\tau}_{0}\|]$ satisfies $\theta(\tau)\leq C_{\mathrm{dep}}\lambda_{\mathrm{dep}}^{\tau}$ for all $\tau\in\mathbb{N}$ , where $\tilde{{\bf U}}^{\tau}_{0}=G(\ldots,\tilde{\bm{\xi}}_{-\tau-1},\tilde{\bm{\xi}}_{-\tau},\bm{\xi}_{-\tau+1},\ldots,\bm{\xi}_{0})$ for $\tilde{\bm{\xi}}$ an independent copy of $\bm{\xi}$ .

Theorem 4.2.

Assume $\sigma_{1}(x)=x$ , $m=1$ , $p\in(1,\infty)$ , and that the input set $D_{d}$ is bounded. Consider a functional $H\in\mathcal{C}$ that has a representation of the type (2.2) with ${\bf B}\in\ell^{q}$ , bounded linear maps $C\colon\mathbb{R}^{d}\to\ell^{q}$ , $A\colon\ell^{q}\to\ell^{q}$ satisfying ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1$ and $I_{\mu,p}^{(2)}<\infty$ . Let $\lambda\in\left({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|},1\right)$ . Consider now as approximant an ESN-ELM system such that ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1$ and that the matrix $K$ in (3.18) is invertible. Suppose that $w^{(1)}$ is bounded, $\mu_{A,B,C}\ll\nu$ , and $\frac{d\mu_{A,B,C}}{d\nu}$ is bounded by a constant $\kappa>0$ . Let $R\geq\frac{\kappa}{\sqrt{N}}\|w^{(1)}\|_{\infty}$ , $r\in({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|},1)$ . Assume ${Y}$ is bounded by a constant $\widetilde{M}$ and $({Y},{\bf Z})$ has a causal Bernoulli shift structure with geometric decay and $\log(n)<n\log(\lambda_{max}^{-1})$ , where $\lambda_{max}=\max(r,\lambda_{\mathrm{dep}})$ . Then the trained ESN-ELM $\widehat{H}_{\widehat{\bf W}}$ satisfies the learning error bound

\displaystyle\mathbb{E}[|H(\bar{\bf Z})-\widehat{H}_{\widehat{\bf W}}(\bar{\bf Z})|^{2}]^{1/2}\leq C_{\mathrm{approx}}\left(\lambda^{\frac{N}{d}}+{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{T}+\frac{1}{N^{\frac{1}{2}}}\right)+C_{\mathrm{est}}\left(RN^{\frac{1}{2}}\frac{\sqrt{\log(n)}}{\sqrt{n}}\right)^{\frac{1}{2}}

(4.3)

with

C_{\mathrm{approx}}=\tilde{c}_{1}\max(\kappa^{\frac{1}{2}},1)\max\left(\frac{{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}}{\left(1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\lambda^{-1}{A}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{p}\right)^{1/p}},\frac{\|{\bf B}\|_{q}}{1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}},1\right)\\ \times\max\left(I_{\mu,p},I_{\mu,p}^{(2)}\right)\cdot\left(1+\frac{{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}+\|{\bf B}^{\mathrm{ESN}}\|}{1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|K^{-\top}\Lambda\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\right),

(4.4)

C_{\mathrm{est}}=\tilde{c}_{2}\Bigg{[}\frac{({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}+\|{\bf B}^{\mathrm{ESN}}\|+1)^{3}}{(1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|})^{3}\sqrt{\log(\lambda_{max}^{-1})}}\max(R,1)(\mathbb{E}[\|{\bf a}^{(1)}\|^{2}]^{\frac{1}{2}}+\mathbb{E}[\|{\bf c}^{(1)}\|^{2}]^{\frac{1}{2}}+\mathbb{E}[|b_{1}|^{2}]^{\frac{1}{2}}+1)\\ \times(r^{2}-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2})^{-1/2}\max([\kappa\|w^{(1)}\|_{\infty}]^{-1},1)\Bigg{(}\frac{r}{1-r}+\frac{1}{\lambda_{max}}+\frac{({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}+1)^{\frac{1}{2}}\lambda_{max}^{-1}}{\log(\lambda_{max}^{-1})}\Bigg{)}\Bigg{]}^{\frac{1}{2}},

(4.5)

for $\tilde{c}_{1}$ only depending on $d$ , $\sigma_{2}$ , $p$ , $\mathrm{diam}(D_{d})$ , and $\lambda$ (see (3.43)) and $\tilde{c}_{2}$ only depending on $\sigma_{2}$ , $\widetilde{M}$ , $\mathrm{diam}(D_{d})$ , and $C_{\mathrm{dep}}$ (see (4.19)).

Proof.

Let $(\bar{\bf Z},\bar{{Y}})$ be an IID copy of $({\bf Z},{Y})$ , independent of all other random variables introduced so far. Independence and the assumption $H({\bf Z})=\mathbb{E}[{Y}_{0}|{\bf Z}]$ imply that $\mathbb{E}[\widehat{H}_{\widehat{\bf W}}(\bar{\bf Z})H(\bar{\bf Z})]=\mathbb{E}[\widehat{H}_{\widehat{\bf W}}(\bar{\bf Z})\bar{Y}_{0}]$ and $\mathbb{E}[\widehat{H}_{\bf W}(\bar{\bf Z})H(\bar{\bf Z})]=\mathbb{E}[\widehat{H}_{{\bf W}}(\bar{\bf Z})\bar{Y}_{0}]$ for any ${\bf W}\in\mathcal{W}_{R}$ . Therefore, for any ${\bf W}\in\mathcal{W}_{R}$ we obtain

	$\displaystyle\mathbb{E}[\|H(\bar{\bf Z})-\widehat{H}_{\widehat{\bf W}}(\bar{\bf Z})\|^{2}]$	$\displaystyle=\mathbb{E}[\|H(\bar{\bf Z})-\widehat{H}_{\bf W}(\bar{\bf Z})\|^{2}]+\mathbb{E}[\|\bar{Y}_{0}-\widehat{H}_{\widehat{\bf W}}(\bar{\bf Z})\|^{2}]-\mathbb{E}[\|\bar{Y}_{0}-\widehat{H}_{\bf W}(\bar{\bf Z})\|^{2}]$		(4.6)
		$\displaystyle\leq\mathbb{E}[\|H(\bar{\bf Z})-\widehat{H}_{\bf W}(\bar{\bf Z})\|^{2}]+\mathbb{E}[\mathcal{R}(\widehat{H}_{\widehat{\bf W}})-\mathcal{R}_{n}(\widehat{H}_{\widehat{\bf W}})+\mathcal{R}_{n}(\widehat{H}_{\bf W})-\mathcal{R}(\widehat{H}_{\bf W})],$		(4.6)

where we used in the last step that (4.2) implies $\mathcal{R}_{n}(\widehat{H}_{\widehat{\bf W}})\leq\mathcal{R}_{n}(\widehat{H}_{\bf W})$ .

The first term in the right hand side of (4.6) can be bounded using Theorem 3.7. Indeed, Theorem 3.7 proves that there exists a measurable function $f\colon(\mathcal{X}^{N})^{N}\to\mathbb{R}^{N}$ such that the ESN-ELM $\widehat{H}$ with readout ${\bf W}=f((w^{(i)},{\bf a}^{(i)},{\bf c}^{(i)},b_{i})_{i=1,\ldots,N})$ satisfies

\displaystyle\mathbb{E}[|H(\bar{{\bf Z}})-\widehat{H}_{\bf W}(\bar{{\bf Z}})|^{2}]^{1/2}

\displaystyle\leq C_{H,\mathrm{ESN}}\left[\lambda^{\frac{N}{d}}+{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{T}+\frac{1}{N^{\frac{1}{2}}}\left\|\frac{d\mu_{A,B,C}}{d\nu}\right\|_{\infty}^{\frac{1}{2}}\right]

(4.7)

with $C_{H,\mathrm{ESN}}$ given in (3.28). Here we used independence and that $\bar{{\bf Z}}$ takes values in $\mathcal{I}_{d}$ . Furthermore, in the proof of Theorem 3.7 $f$ and ${\bf W}$ are explicitly chosen as $W_{i}=\frac{w^{(i)}}{N}\frac{d\mu_{A,B,C}}{d\nu}(w^{(i)},{\bf a}^{(i)},{\bf c}^{(i)},b_{i})$ . Therefore, $\mathbb{P}$ -a.s. the random vector ${\bf W}$ satisfies $\|{\bf W}\|\leq\sqrt{N}\max_{i=1}^{N}\|W_{i}\|_{\infty}\leq\frac{\kappa}{\sqrt{N}}\|w^{(1)}\|_{\infty}\leq R$ and consequently ${\bf W}\in\mathcal{W}_{R}$ .

It remains to analyze the second term in the right hand side of (4.6). For notational simplicity we now consider $(w_{i},{\bf a}^{(i)},{\bf c}^{(i)},b_{i})_{i=1,\ldots,N}$ , ${A}^{\mathrm{ESN}}$ , ${\bf B}^{\mathrm{ESN}}$ , ${C}^{\mathrm{ESN}}$ as fixed; formally this can be justified by performing the subsequent computations conditionally on these random variables as, for instance, in [Gono 21a, Proof of Theorem 4.3]. Then

\mathbb{E}[\mathcal{R}(\widehat{H}_{\widehat{\bf W}})-\mathcal{R}_{n}(\widehat{H}_{\widehat{\bf W}})+\mathcal{R}_{n}(\widehat{H}_{\bf W})-\mathcal{R}(\widehat{H}_{\bf W})]\leq 2\mathbb{E}\left[\sup_{{\bf w}\in\mathbb{R}^{N}\,:\,\|{\bf w}\|\leq R}|\mathcal{R}_{n}(\widehat{H}_{\bf w})-\mathcal{R}(\widehat{H}_{\bf w})|\right].

(4.8)

Let $\bar{\lambda}=(r^{2}-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2})^{1/2}$ , denote $\mathcal{C}^{RC}=\{\widehat{H}_{\bf w}\,:\,{\bf w}\in\mathbb{R}^{N},\|{\bf w}\|\leq R\}$ and note that each $\widehat{H}_{\bf w}$ can be written as $\widehat{H}_{\bf w}=h_{\bf w}\circ H^{F}$ for $H^{F}({\bf z})=(\mathbf{x}_{0}^{\mathrm{ESN}}({\bf z}),{\bf z}_{0},\bar{\lambda}\mathbf{x}_{-1}^{\mathrm{ESN}}({\bf z}))$ , $h_{\bf w}(({\bf x}_{0},{\bf z}_{0},{\bf x}_{-1}))=\sum_{i=1}^{N}w_{i}\sigma_{2}(\bar{\lambda}^{-1}{\bf a}^{(i)}\cdot\mathbf{x}_{-1}+{\bf c}^{(i)}\cdot{\bf z}_{0}+b_{i})$ . Furthermore, $H^{F}\colon(D_{d})^{\mathbb{Z}_{-}}\to\mathbb{R}^{N+d+N}$ is the reservoir functional associated to the map $F\colon\mathbb{R}^{N+d+N}\times\mathbb{R}^{d}\to\mathbb{R}^{N+d+N}$ , $F(({\bf x}_{0},{\bf x}_{1},{\bf x}_{2}),{\bf z})=(\sigma_{1}({A}^{\mathrm{ESN}}\mathbf{x}_{0}+{C}^{\mathrm{ESN}}{\bf z}+{\bf B}^{\mathrm{ESN}}),{\bf z},\bar{\lambda}{\bf x}_{0})$ , which is just the reservoir map associated to the ESN determined by ${A}^{\mathrm{ESN}}$ , ${\bf B}^{\mathrm{ESN}}$ , ${C}^{\mathrm{ESN}}$ augmented by the current input and the (scaled) previous state.

Then for each ${\bf w}\in\mathbb{R}^{N}$ with $\|{\bf w}\|\leq R$ we have

|h_{\bf w}(({\bf x}_{0},{\bf z}_{0},{\bf x}_{-1}))-h_{\bf w}((\bar{{\bf x}}_{0},\bar{{\bf z}}_{0},\bar{{\bf x}}_{-1}))|\leq\sum_{i=1}^{N}|w_{i}|L_{\sigma_{2}}|\bar{\lambda}^{-1}{\bf a}^{(i)}\cdot(\mathbf{x}_{-1}-\bar{{\bf x}}_{-1})+{\bf c}^{(i)}\cdot({\bf z}_{0}-\bar{{\bf z}}_{0})|\\ \leq L_{\sigma_{2}}R\max((\sum_{i=1}^{N}\|\bar{\lambda}^{-1}{\bf a}^{(i)}\|^{2})^{1/2},(\sum_{i=1}^{N}\|{\bf c}^{(i)}\|^{2})^{1/2})[\|\mathbf{x}_{-1}-\bar{{\bf x}}_{-1}\|+\|{\bf z}_{0}-\bar{{\bf z}}_{0}\|]\leq\overline{L_{h}}\|({\bf x}_{0},{\bf z}_{0},{\bf x}_{-1})-(\bar{{\bf x}}_{0},\bar{{\bf z}}_{0},\bar{{\bf x}}_{-1})\|,

with $\overline{L_{h}}=2L_{\sigma_{2}}R\max((\sum_{i=1}^{N}\|\bar{\lambda}^{-1}{\bf a}^{(i)}\|^{2})^{1/2},(\sum_{i=1}^{N}\|{\bf c}^{(i)}\|^{2})^{1/2})$ and $|h_{\bf w}({\bf 0})|\leq L_{h,0}$ with $L_{h,0}=R(\sum_{i=1}^{N}\sigma_{2}(b_{i})^{2})^{1/2}$ . Furthermore, for any ${\bf z}\in D_{d}$ the reservoir map $F(\cdot,{\bf z})$ is an $r$ -contraction with $r=({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}+\bar{\lambda}^{2})^{1/2}<1$ and for any $({\bf x}_{0},{\bf x}_{1},{\bf x}_{2})\in\mathbb{R}^{N}\times D_{d}\times\mathbb{R}^{N}$ and ${\bf z},\bar{{\bf z}}\in\mathbb{R}^{d}$ we have

\|F(({\bf x}_{0},{\bf x}_{1},{\bf x}_{2}),{\bf z})-F(({\bf x}_{0},{\bf x}_{1},{\bf x}_{2}),\bar{{\bf z}})\|=\|({C}^{\mathrm{ESN}}({\bf z}-\bar{{\bf z}}),{\bf z}-\bar{{\bf z}},0)\|=({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}+1)^{1/2}\|{\bf z}-\bar{{\bf z}}\|.

In addition, for ${\bf z}\in(D_{d})^{\mathbb{Z}_{-}}$ we have from (3.23)

\|H^{F}({\bf z})\|=(\|\mathbf{x}_{0}^{\mathrm{ESN}}({\bf z})\|^{2}+\|{\bf z}_{0}\|^{2}+\|\bar{\lambda}\mathbf{x}_{-1}^{\mathrm{ESN}}({\bf z})\|^{2})^{1/2}\leq M_{\mathcal{F}},

where $M_{\mathcal{F}}=(2[(1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{A}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|})^{-1}({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{C}^{\mathrm{ESN}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}M+\|{\bf B}^{\mathrm{ESN}}\|)]^{2}+M^{2})^{1/2}$ with $M$ chosen such that $\|{\bf z}\|\leq M$ for all ${\bf z}\in D_{d}$ . Finally, let $C_{L}=RM_{\mathcal{F}}+\widetilde{M}$ and note that $L(x,y)=\min(C_{L},|x-y|)^{2}$ satisfies for all $H\in\mathcal{C}^{RC}$ , ${\bf z}\in(D_{d})^{\mathbb{Z}_{-}}$ and ${y}\in\mathbb{R}$ with $|{y}|\leq\widetilde{M}$ that $L(H({\bf z}),{y})=|H({\bf z})-{y}|^{2}$ and $z\mapsto[\min(C_{L},|z|)]^{2}$ is Lipschitz-continuous with constant $2C_{L}$ . This can be seen by writing

	$\displaystyle\|\min(C_{L},\|z_{1}\|)^{2}-\min(C_{L},\|z_{2}\|)^{2}\|$	$\displaystyle=\|\min(C_{L},\|z_{1}\|)-\min(C_{L},\|z_{2}\|)\|\|\min(C_{L},\|z_{1}\|)+\min(C_{L},\|z_{2}\|)\|$
		$\displaystyle\leq 2C_{L}\big{\|}\|z_{2}\|-\|z_{1}\|\big{\|}\leq 2C_{L}\|z_{2}-z_{1}\|.$

Consider the Rademacher-type complexity as defined in [Gono 20b]: let $\varepsilon_{0},\varepsilon_{1},\ldots$ be IID Rademacher random variables which are independent of the IID copies $\tilde{{\bf Z}}^{(l)}$ , $l\in\mathbb{N}$ , of ${\bf Z}$ . For each $k\in\mathbb{N}$ we define

{\mathfrak{R}}_{k}(\mathcal{C}^{RC})=\frac{1}{k}\mathbb{E}\left[\sup_{H\in\mathcal{C}^{RC}}\left|\sum_{l=0}^{k-1}\varepsilon_{l}H(\tilde{{\bf Z}}^{(l)})\right|\right].

Denote by $\mathbf{X}^{l}$ the vector with components ${X}^{l}_{j}=\sigma_{2}({\bf a}^{(j)}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}(\tilde{\bf Z}^{(l)})+{\bf c}^{(j)}\cdot\tilde{\bf Z}_{0}^{(l)}+b_{j})$ , $j=1,\ldots,N$ . Then we can rewrite $\widehat{H}_{{\bf w}}(\tilde{\bf Z}^{(l)})={\bf w}^{\top}\mathbf{X}^{l}$ and estimate using the Cauchy-Schwartz and Jensen’s inequalities, independence, and the fact that $\mathbb{E}[\varepsilon_{l}\varepsilon_{j}]=\delta_{lj}$

	$\displaystyle{\mathfrak{R}}_{k}(\mathcal{C}^{RC})$	$\displaystyle=\frac{1}{k}\mathbb{E}\left[\sup_{{\bf w}\in\mathbb{R}^{N}\,:\,\\|{\bf w}\\|\leq R}\left\|\sum_{l=0}^{k-1}\varepsilon_{l}\widehat{H}_{\bf w}(\tilde{{\bf Z}}^{(l)})\right\|\right]=\frac{1}{k}\mathbb{E}\left[\sup_{{\bf w}\in\mathbb{R}^{N}\,:\,\\|{\bf w}\\|\leq R}\left\|{\bf w}^{\top}\sum_{l=0}^{k-1}\varepsilon_{l}\mathbf{X}^{l}\right\|\right]$
		$\displaystyle\leq\frac{R}{k}\mathbb{E}\left[\left\\|\sum_{l=0}^{k-1}\varepsilon_{l}\mathbf{X}^{l}\right\\|\right]\leq\frac{R}{k}\mathbb{E}\left[\left\\|\sum_{l=0}^{k-1}\varepsilon_{l}\mathbf{X}^{l}\right\\|^{2}\right]^{1/2}=\frac{R}{k}\left(\mathbb{E}\left[\sum_{l=0}^{k-1}\sum_{j=0}^{k-1}\varepsilon_{l}\varepsilon_{j}(\mathbf{X}^{l})^{\top}\mathbf{X}^{j}\right]\right)^{1/2}$
		$\displaystyle=\frac{R}{k}\left(\sum_{l=0}^{k-1}\mathbb{E}\left[(\mathbf{X}^{l})^{\top}\mathbf{X}^{l}\right]\right)^{1/2}=\frac{R}{\sqrt{k}}\left(\mathbb{E}\left[(\mathbf{X}^{1})^{\top}\mathbf{X}^{1}\right]\right)^{1/2}.$

Altogether, the hypotheses of [Gono 20b, Corollary 8(ii)] are satisfied. Writing $\mathcal{R}_{n}^{\infty}(G)=\frac{1}{n}\sum_{i=0}^{n-1}|G({\bf Z}_{-i}^{-\infty})-{Y}_{-i}|^{2}$ for the idealized empirical risk and applying this result proves that

\displaystyle\mathbb{E}\left[\sup_{{\bf w}\in\mathbb{R}^{N}\,:\,\|{\bf w}\|\leq R}|\mathcal{R}_{n}^{\infty}(\widehat{H}_{{\bf w}})-\mathcal{R}(\widehat{H}_{{\bf w}})|\right]\leq\frac{C_{1}}{n}+\frac{C_{2}{\log(n)}}{{n}}+\frac{C_{3,abs}\sqrt{\log(n)}}{\sqrt{n}}

(4.9)

for all $n\in\mathbb{N}$ satisfying $\log(n)<n\log(\lambda_{max}^{-1})$ , where $\lambda_{max}=\max(r,\lambda_{\mathrm{dep}})$ and

	$\displaystyle C_{1}$	$\displaystyle=2C_{L}\frac{2M_{\mathcal{F}}\overline{L_{h}}+C_{\mathrm{dep}}}{\lambda_{max}},\quad\quad C_{3,abs}=\frac{8C_{L}}{\sqrt{\log(\lambda_{max}^{-1})}}\left[R\left(\mathbb{E}\left[(\mathbf{X}^{1})^{\top}\mathbf{X}^{1}\right]\right)^{1/2}+\mathbb{E}\left[\|{Y}_{0}\|^{2}_{2}\right]^{1/2}\right],$		(4.10)
	$\displaystyle C_{2}$	$\displaystyle=\frac{4C_{L}(M_{\mathcal{F}}\overline{L_{h}}+L_{h,0}+({\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{C}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}^{2}+1)^{1/2}\overline{L_{h}}C_{\mathrm{dep}}\lambda_{max}^{-1})+2\mathbb{E}[\|{Y}_{0}\|^{2}]}{\log(\lambda_{max}^{-1})}.$		(4.11)

Furthermore, [Gono 20b, Proposition 5] yields that $\mathbb{P}$ -a.s.

\sup_{{\bf w}\in\mathbb{R}^{N}\,:\,\|{\bf w}\|\leq R}\left|\mathcal{R}_{n}(\widehat{H}_{{\bf w}})-\mathcal{R}_{n}^{\infty}(\widehat{H}_{{\bf w}})\right|\leq\frac{4C_{L}r\overline{L_{h}}M_{\mathcal{F}}}{(1-r)n}.

Combining this with (4.9) we obtain

		$\displaystyle\mathbb{E}\left[\sup_{{\bf w}\in\mathbb{R}^{N}\,:\,\\|{\bf w}\\|\leq R}\|\mathcal{R}_{n}(\widehat{H}_{{\bf w}})-\mathcal{R}(\widehat{H}_{{\bf w}})\|\right]$		(4.12)
		$\displaystyle\leq\mathbb{E}\left[\sup_{{\bf w}\in\mathbb{R}^{N}\,:\,\\|{\bf w}\\|\leq R}\|\mathcal{R}_{n}(\widehat{H}_{{\bf w}})-\mathcal{R}_{n}^{\infty}(\widehat{H}_{{\bf w}})\|\right]+\mathbb{E}\left[\sup_{{\bf w}\in\mathbb{R}^{N}\,:\,\\|{\bf w}\\|\leq R}\|\mathcal{R}_{n}^{\infty}(\widehat{H}_{{\bf w}})-\mathcal{R}(\widehat{H}_{{\bf w}})\|\right]$
		$\displaystyle\leq\frac{4C_{L}r\overline{L_{h}}M_{\mathcal{F}}}{(1-r)n}+\frac{C_{1}}{n}+\frac{C_{2}{\log(n)}}{{n}}+\frac{C_{3,abs}\sqrt{\log(n)}}{\sqrt{n}}$
		$\displaystyle\leq\max(R,1)\max(\overline{L_{h}},L_{h,0},1,R\left(\mathbb{E}\left[(\mathbf{X}^{1})^{\top}\mathbf{X}^{1}\right]\right)^{1/2})C_{est,1}\frac{\sqrt{\log(n)}}{\sqrt{n}},$

where

	$\displaystyle C_{est,1}$	$\displaystyle=\frac{32\max(1,C_{\mathrm{dep}})}{\sqrt{\log(\lambda_{max}^{-1})}}\max(2\frac{({\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{C}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}M+\\|{\bf B}^{\mathrm{ESN}}\\|)^{2}}{(1-{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{A}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|})^{2}}+M^{2},1)\max(\widetilde{M},1)$		(4.13)
		$\displaystyle\quad\quad\times\Bigg{(}\frac{r}{(1-r)}+\frac{1}{\lambda_{max}}+\frac{(2+({\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{C}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}^{2}+1)^{1/2}\lambda_{max}^{-1})+\mathbb{E}[\|{Y}_{0}\|^{2}]}{\log(\lambda_{max}^{-1})}+\mathbb{E}\left[\|{Y}_{0}\|^{2}\right]^{1/2}\Bigg{)}.$		(4.13)

Recall that $(w^{(i)},{\bf a}^{(i)},{\bf c}^{(i)},b_{i})_{i=1,\ldots,N}$ was considered fixed; now we take expectation also with respect to its distribution $\nu$ and obtain in the same way as we obtained (4.12)

	$\displaystyle\mathbb{E}\left[\sup_{{\bf w}\in\mathbb{R}^{N}\,:\,\\|{\bf w}\\|\leq R}\|\mathcal{R}_{n}(\widehat{H}_{{\bf w}})-\mathcal{R}(\widehat{H}_{{\bf w}})\|\right]$	$\displaystyle\leq\max(R,1)\max\Bigg{(}2RL_{\sigma_{2}}\mathbb{E}\left[\max\left((\sum_{i=1}^{N}\\|\bar{\lambda}^{-1}{\bf a}^{(i)}\\|^{2})^{\frac{1}{2}},(\sum_{i=1}^{N}\\|{\bf c}^{(i)}\\|^{2})^{\frac{1}{2}}\right)\right],$		(4.14)
		$\displaystyle\quad R\mathbb{E}\left[(\sum_{i=1}^{N}\sigma_{2}(b_{i})^{2})^{\frac{1}{2}}\right],1,R\left(\mathbb{E}\left[(\mathbf{X}^{1})^{\top}\mathbf{X}^{1}\right]\right)^{\frac{1}{2}}\Bigg{)}C_{est,1}\frac{\sqrt{\log(n)}}{\sqrt{n}}.$		(4.14)

The expectations in (4.14) can be estimated using Jensen’s inequality and (3.23):

	$\displaystyle\mathbb{E}\left[(\mathbf{X}^{1})^{\top}\mathbf{X}^{1}\right]^{\frac{1}{2}}=\left(\sum_{j=1}^{N}\mathbb{E}[\|{X}^{1}_{j}\|^{2}]\right)^{\frac{1}{2}}$	$\displaystyle\leq\max(2L_{\sigma_{2}}^{2},2\|\sigma_{2}(0)\|^{2})^{\frac{1}{2}}N^{\frac{1}{2}}\mathbb{E}[\|{\bf a}^{(1)}\cdot\mathbf{x}_{-1}^{\mathrm{ESN}}({\bf Z})+{\bf c}^{(1)}\cdot{\bf Z}_{0}+b_{1}\|^{2}+1]^{\frac{1}{2}}$
		$\displaystyle\leq N^{\frac{1}{2}}C_{est,2}$

with

	$\displaystyle C_{est,2}=$	$\displaystyle 2^{\frac{3}{2}}\max(M,1)\max(L_{\sigma_{2}},\|\sigma_{2}(0)\|,1)$		(4.15)
		$\displaystyle\times\frac{{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{C}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}+\\|{\bf B}^{\mathrm{ESN}}\\|+1}{1-{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{A}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}}(\mathbb{E}[\\|\bar{\lambda}^{-1}{\bf a}^{(1)}\\|^{2}]^{\frac{1}{2}}+\mathbb{E}[\\|{\bf c}^{(1)}\\|^{2}]^{\frac{1}{2}}+\mathbb{E}[\|b_{1}\|^{2}]^{\frac{1}{2}}+1)$		(4.15)

and consequently

\max\Bigg{(}2RL_{\sigma_{2}}\mathbb{E}\left[\max\left((\sum_{i=1}^{N}\|\bar{\lambda}^{-1}{\bf a}^{(i)}\|^{2})^{\frac{1}{2}},(\sum_{i=1}^{N}\|{\bf c}^{(i)}\|^{2})^{\frac{1}{2}}\right)\right],R\mathbb{E}\left[(\sum_{i=1}^{N}\sigma_{2}(b_{i})^{2})^{\frac{1}{2}}\right],1,R\left(\mathbb{E}\left[(\mathbf{X}^{1})^{\top}\mathbf{X}^{1}\right]\right)^{\frac{1}{2}}\Bigg{)}\\ \quad\leq\max\left(RN^{\frac{1}{2}}C_{est,2},1\right).

(4.16)

Inserting this in (4.14) yields

\displaystyle\mathbb{E}\left[\sup_{{\bf w}\in\mathbb{R}^{N}\,:\,\|{\bf w}\|\leq R}|\mathcal{R}_{n}(\widehat{H}_{{\bf w}})-\mathcal{R}(\widehat{H}_{{\bf w}})|\right]

\displaystyle\leq\max(R,1)\max\left(RN^{\frac{1}{2}}C_{est,2},1\right)C_{est,1}\frac{\sqrt{\log(n)}}{\sqrt{n}}.

(4.17)

Combining this with (4.6), (4.7) and (4.8) we thus obtain

$\displaystyle\mathbb{E}[\|H(\bar{\bf Z})-\widehat{H}_{\widehat{\bf W}}(\bar{\bf Z})\|^{2}]^{1/2}$	$\displaystyle\leq\mathbb{E}[\|H(\bar{\bf Z})-\widehat{H}_{\bf W}(\bar{\bf Z})\|^{2}]^{1/2}+\mathbb{E}[\mathcal{R}(\widehat{H}_{\widehat{\bf W}})-\mathcal{R}_{n}(\widehat{H}_{\widehat{\bf W}})+\mathcal{R}_{n}(\widehat{H}_{\bf W})-\mathcal{R}(\widehat{H}_{\bf W})]^{1/2}$	(4.18)
	$\displaystyle\leq C_{H,\mathrm{ESN}}[\lambda^{\frac{N}{d}}+{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{A}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}^{T}+\frac{1}{N^{\frac{1}{2}}}\left\\|\frac{d\mu_{A,B,C}}{d\nu}\right\\|_{\infty}^{\frac{1}{2}}]$
	$\displaystyle\quad+\sqrt{2}\left(\max(R,1)\max\left(RN^{\frac{1}{2}}C_{est,2},1\right)C_{est,1}\frac{\sqrt{\log(n)}}{\sqrt{n}}\right)^{\frac{1}{2}}$
	$\displaystyle\leq C_{\mathrm{approx}}[\lambda^{\frac{N}{d}}+{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{A}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}^{T}+\frac{1}{N^{\frac{1}{2}}}]+\tilde{C}_{\mathrm{est}}\left(RN^{\frac{1}{2}}\frac{\sqrt{\log(n)}}{\sqrt{n}}\right)^{\frac{1}{2}},$

where we used $\max(RN^{\frac{1}{2}},1)\leq RN^{\frac{1}{2}}\max([\kappa\|w^{(1)}\|_{\infty}]^{-1},1)$ and set $C_{\mathrm{approx}}=C_{H,\mathrm{ESN}}\max(\kappa^{\frac{1}{2}},1)$ , $\tilde{C}_{\mathrm{est}}=(2C_{est,1}\max(R,1)C_{est,2}\max([\kappa\|w^{(1)}\|_{\infty}]^{-1},1))^{\frac{1}{2}}$ . Setting

	$\displaystyle C_{\mathrm{est}}$	$\displaystyle=\tilde{c}_{2}\Bigg{[}\frac{({\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{C}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}+\\|{\bf B}^{\mathrm{ESN}}\\|+1)^{3}}{(1-{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{A}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|})^{3}\sqrt{\log(\lambda_{max}^{-1})}}\max(R,1)(\mathbb{E}[\\|{\bf a}^{(1)}\\|^{2}]^{\frac{1}{2}}+\mathbb{E}[\\|{\bf c}^{(1)}\\|^{2}]^{\frac{1}{2}}+\mathbb{E}[\|b_{1}\|^{2}]^{\frac{1}{2}}+1)$
		$\displaystyle\quad\quad\times(r^{2}-{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{A}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}^{2})^{-1/2}\max([\kappa\\|w^{(1)}\\|_{\infty}]^{-1},1)\Bigg{(}\frac{r}{1-r}+\frac{1}{\lambda_{max}}+\frac{({\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{C}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}^{2}+1)^{\frac{1}{2}}\lambda_{max}^{-1}}{\log(\lambda_{max}^{-1})}\Bigg{)}\Bigg{]}^{\frac{1}{2}}$

with

\tilde{c}_{2}=2^{6}(\max(1,C_{\mathrm{dep}})\max(M,1)^{5}\max(\widetilde{M},1)^{3}\max(L_{\sigma_{2}},|\sigma_{2}(0)|,1))^{\frac{1}{2}}

(4.19)

and inserting (3.28) then shows (4.3) with (4.4) and (4.5), as claimed. ∎

Remark 4.3.

The bound in Theorem 4.2 can be directly extended to the multivariate case $m>1$ as follows: if $H=(H_{1},\ldots,H_{m})$ and each $H_{i}$ satisfies the hypotheses formulated in Theorem 4.2, then

	$\displaystyle\mathbb{E}[\\|H(\bar{\bf Z})-\widehat{H}_{\widehat{W}}(\bar{\bf Z})\\|^{2}]^{1/2}$	$\displaystyle=\mathbb{E}\left[\sum_{i=1}^{m}\|H_{i}(\bar{\bf Z})-\widehat{H}_{\widehat{W}_{i}}(\bar{\bf Z})\|^{2}\right]^{1/2}\leq\sum_{i=1}^{m}\mathbb{E}\left[\|H_{i}(\bar{\bf Z})-\widehat{H}_{\widehat{W}_{i}}(\bar{\bf Z})\|^{2}\right]^{1/2}$		(4.20)
		$\displaystyle\leq\sum_{i=1}^{m}C_{\mathrm{approx},i}\left(\lambda_{i}^{\frac{N}{d}}+{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{A}^{\mathrm{ESN}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}^{T}+\frac{1}{N^{\frac{1}{2}}}\right)+\sum_{i=1}^{m}C_{\mathrm{est},i}\left(RN^{\frac{1}{2}}\frac{\sqrt{\log(n)}}{\sqrt{n}}\right)^{\frac{1}{2}}$		(4.20)

with $C_{\mathrm{approx},i}$ , $C_{\mathrm{est},i}$ the constants from Theorem 4.2 corresponding to $H_{i}$ .

5 Conclusions

In this paper we have provided reservoir computing approximation and generalization bounds for a new concept class of input/output systems that extends to a dynamical context the so-called generalized Barron functionals. We call the elements of the new class recurrent generalized Barron functionals. Its elements are characterized by the availability of readouts with a certain integral representation built on infinite-dimensional state-space systems. We have also shown that they form a vector space and that the class contains as a subset systems of this type with finite-dimensional state-space representations. This class has been shown to be very rich: it contains all sufficiently smooth finite-memory functionals, linear and convolutional filters, as well as any input-output system built using separable Hilbert state spaces with readouts of integral Barron type. Moreover, this class is dense in the $L^{2}$ -sense.

The reservoir architectures used for the approximation/estimation of elements in the new class are randomly generated echo state networks with either linear (for approximation) or ReLU (estimation) activation functions, and with readouts built using randomly generated neural networks, in which only the output layer is trained (extreme learning machines or random feature neural networks). The results in the paper yield a fully implementable recurrent neural network-based learning algorithm with provable convergence guarantees that does not suffer from the curse of dimensionality.

Acknowledgments. The authors acknowledge partial financial support coming from the Swiss National Science Foundation (grant number 200021_175801/1). L. Grigoryeva and J.-P. Ortega wish to thank the hospitality of the University of Munich and L. Gonon that of the Nanyang Technological University during the academic visits in which some of this work was developed. We thank Christa Cuchiero and Josef Teichmann for interesting exchanges that inspired some of the results in this paper.

References

[Acci 22] B. Acciaio, A. Kratsios, and G. Pammer. “Metric hypertransformers are universal adapted maps”. arXiv preprint arXiv:2201.13094, 2022.
[Arco 22] T. Arcomano, I. Szunyogh, A. Wikner, J. Pathak, B. R. Hunt, and E. Ott. “A hybrid approach to atmospheric modeling that combines machine learning with a physics-based numerical model”. Journal of Advances in Modeling Earth Systems, Vol. 14, No. 3, p. e2021MS002712, 2022.
[Barr 92] A. R. Barron. “Neural net approximation”. In: Proc. 7th Yale Workshop on Adaptive and Learning Systems, pp. 69–72, 1992.
[Barr 93] A. Barron. “Universal approximation bounds for superpositions of a sigmoidal function”. IEEE Transactions on Information Theory, Vol. 39, No. 3, pp. 930–945, may 1993.
[Bent 22] F. E. Benth, N. Detering, and L. Galimberti. “Neural networks in Fréchet spaces”. Annals of Mathematics and Artificial Intelligence, pp. 1–29, 2022.
[Bouc 13] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.
[Bouv 17a] J. Bouvrie and B. Hamzi. “Kernel methods for the approximation of nonlinear systems”. SIAM Journal on Control and Optimization, Vol. 55, No. 4, pp. 2460–2492, 2017.
[Bouv 17b] J. Bouvrie and B. Hamzi. “Kernel methods for the approximation of some key quantities of nonlinear systems”. Journal of Computational Dynamics, Vol. 4, No. (1-2), pp. 1–19, 2017.
[Boyd 85] S. Boyd and L. Chua. “Fading memory and the problem of approximating nonlinear operators with Volterra series”. IEEE Transactions on Circuits and Systems, Vol. 32, No. 11, pp. 1150–1161, 1985.
[Chen 19] J. Chen and H. I. Nurdin. “Learning nonlinear input–output maps with dissipative quantum systems”. Quantum Information Processing, Vol. 18, No. 7, pp. 1–36, 2019.
[Chen 20] J. Chen, H. I. Nurdin, and N. Yamamoto. “Temporal information processing on noisy quantum computers”. Physical Review Applied, Vol. 14, No. 2, p. 24065, 2020.
[Chen 22] J. Chen. Nonlinear Convergent Dynamics for Temporal Information Processing on Novel Quantum and Classical Devices. PhD thesis, UNSW Sydney, 2022.
[Chen 95] T. Chen and H. Chen. “Approximation capability to functions of several variables, nonlinear functionals, and operators by radial basis function neural networks”. IEEE Transactions on Neural Networks, Vol. 6, No. 4, pp. 904–910, 1995.
[Coui 16] R. Couillet, G. Wainrib, H. Sevi, and H. T. Ali. “The asymptotic performance of linear echo state neural networks”. Journal of Machine Learning Research, Vol. 17, No. 178, pp. 1–35, 2016.
[Cuch 22] C. Cuchiero, F. Primavera, and S. Svaluto-Ferro. “Universal approximation theorems for continuous functions of càdlàg paths and Lévy-type signature models”. arXiv preprint arXiv:2208.02293, 2022.
[E 19] W. E, C. Ma, and L. Wu. “A priori estimates of the population risk for two-layer neural networks”. Commun. Math. Sci., Vol. 17, No. 5, pp. 1407–1425, 2019.
[E 20a] W. E, C. Ma, S. Wojtowytsch, and L. Wu. “Towards a mathematical understanding of neural network-based machine learning: what we know and what we don’t”. Preprint, arXiv 2009.10713, 2020.
[E 20b] W. E and S. Wojtowytsch. “On the Banach spaces associated with multi-layer ReLU networks: Function representation, approximation theory and gradient descent dynamics”. Preprint, arXiv 2007.15623, 2020.
[E 21] W. E and S. Wojtowytsch. “Representation formulas and pointwise properties for Barron functions”. arXiv:2006.05982, 2021.
[Evan 10] L. C. Evans. Partial Differential Equations. Vol. 19 of Graduate Studies in Mathematics, American Mathematical Society, Providence, RI, second Ed., 2010.
[G Ma 20] G Manjunath. “Stability and memory-loss go hand-in-hand: three results in dynamics & computation”. To appear in Proceedings of the Royal Society London Ser. A Math. Phys. Eng. Sci., pp. 1–25, 2020.
[Gali 22] L. Galimberti, G. Livieri, and A. Kratsios. “Designing universal causal deep learning models: the case of infinite-dimensional dynamical systems from stochastic analysis”. arXiv preprint arXiv:2210.13300, 2022.
[Gono 20a] L. Gonon, L. Grigoryeva, and J.-P. Ortega. “Memory and forecasting capacities of nonlinear recurrent networks”. Physica D, Vol. 414, No. 132721, pp. 1–13., 2020.
[Gono 20b] L. Gonon, L. Grigoryeva, and J.-P. Ortega. “Risk bounds for reservoir computing”. Journal of Machine Learning Research, Vol. 21, No. 240, pp. 1–61, 2020.
[Gono 20c] L. Gonon and J.-P. Ortega. “Reservoir computing universality with stochastic inputs”. IEEE Transactions on Neural Networks and Learning Systems, Vol. 31, No. 1, pp. 100–112, 2020.
[Gono 21a] L. Gonon. “Random feature neural networks learn Black-Scholes type PDEs without curse of dimensionality”. Preprint, 2021.
[Gono 21b] L. Gonon and J.-P. Ortega. “Fading memory echo state networks are universal”. Neural Networks, Vol. 138, pp. 10–13, 2021.
[Gono 23] L. Gonon, L. Grigoryeva, and J.-P. Ortega. “Approximation error estimates for random neural networks and reservoir systems”. The Annals of Applied Probability, Vol. 33, No. 1, pp. 28–69, 2023.
[Grig 14] L. Grigoryeva, J. Henriques, L. Larger, and J.-P. Ortega. “Stochastic time series forecasting using time-delay reservoir computers: performance and universality”. Neural Networks, Vol. 55, pp. 59–71, 2014.
[Grig 18a] L. Grigoryeva and J.-P. Ortega. “Echo state networks are universal”. Neural Networks, Vol. 108, pp. 495–508, 2018.
[Grig 18b] L. Grigoryeva and J.-P. Ortega. “Universal discrete-time reservoir computers with stochastic inputs and linear readouts using non-homogeneous state-affine systems”. Journal of Machine Learning Research, Vol. 19, No. 24, pp. 1–40, 2018.
[Grig 19] L. Grigoryeva and J.-P. Ortega. “Differentiable reservoir computing”. Journal of Machine Learning Research, Vol. 20, No. 179, pp. 1–62, 2019.
[Grig 21a] L. Grigoryeva, A. G. Hart, and J.-P. Ortega. “Learning strange attractors with reservoir systems”. arXiv preprint arXiv:2108.05024, 2021.
[Grig 21b] L. Grigoryeva and J.-P. Ortega. “Dimension reduction in recurrent networks by canonicalization”. Journal of Geometric Mechanics, Vol. 13, No. 4, pp. 647–677, 2021.
[Herm 10] M. Hermans and B. Schrauwen. “Memory in linear recurrent neural networks in continuous time.”. Neural networks : the official journal of the International Neural Network Society, Vol. 23, No. 3, pp. 341–55, apr 2010.
[Herm 12] M. Hermans and B. Schrauwen. “Recurrent kernel machines: computation with infinite echo state networks”. Neural Computation, Vol. 24, pp. 104–133, 2012.
[Hu 22] P. Hu, Q. Meng, B. Chen, S. Gong, Y. Wang, W. Chen, R. Zhu, Z.-M. Ma, and T.-Y. Liu. “Neural operator with regularity structure for modeling dynamics driven by SPDEs”. arXiv preprint arXiv:2204.06255, 2022.
[Huan 06] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew. “Extreme learning machine: Theory and applications”. Neurocomputing, Vol. 70, No. 1-3, pp. 489–501, dec 2006.
[Jaeg 04] H. Jaeger and H. Haas. “Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication”. Science, Vol. 304, No. 5667, pp. 78–80, 2004.
[Jaeg 10] H. Jaeger. “The ‘echo state’ approach to analysing and training recurrent neural networks with an erratum note”. Tech. Rep., German National Research Center for Information Technology, 2010.
[Kira 19] F. J. Király and H. Oberhauser. “Kernels for sequentially ordered data”. Journal of Machine Learning Research, Vol. 20, 2019.
[Kova 21] N. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anandkumar. “Neural operator: Learning maps between function spaces”. arXiv preprint arXiv:2108.08481, 2021.
[Krat 20] A. Kratsios and I. Bilokopytov. “Non-Euclidean universal approximation”. 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada., 2020.
[Lax 02] P. Lax. Functional Analysis. Wiley-Interscience, 2002.
[Li 20] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar. “Fourier neural operator for parametric partial differential equations”. arXiv preprint arXiv:2010.08895, 2020.
[Li 22] Z. Li, J. Han, E. Weinan, and Q. Li. “Approximation and optimization theory for Llnear continuous-time recurrent neural networks”. Journal of Machine Learning Research, Vol. 23, pp. 41–42, 2022.
[Lu 18] Z. Lu, B. R. Hunt, and E. Ott. “Attractor reconstruction by machine learning”. Chaos, Vol. 28, No. 6, 2018.
[Maas 02] W. Maass, T. Natschläger, and H. Markram. “Real-time computing without stable states: a new framework for neural computation based on perturbations”. Neural Computation, Vol. 14, pp. 2531–2560, 2002.
[Maas 11] W. Maass. “Liquid state machines: motivation, theory, and applications”. In: S. S. Barry Cooper and A. Sorbi, Eds., Computability In Context: Computation and Logic in the Real World, Chap. 8, pp. 275–296, 2011.
[Manj 13] G. Manjunath and H. Jaeger. “Echo state property linked to an input: exploring a fundamental characteristic of recurrent neural networks”. Neural Computation, Vol. 25, No. 3, pp. 671–696, 2013.
[Manj 22a] G. Manjunath. “Embedding information onto a dynamical system”. Nonlinearity, Vol. 35, No. 3, p. 1131, 2022.
[Manj 22b] G. Manjunath and J.-P. Ortega. “Transport in reservoir computing”. arXiv preprint arXiv:2209.07946, 2022.
[Mart 23] R. Martínez-Peña and J.-P. Ortega. “Quantum reservoir computing in finite dimensions”. Physical Review E - Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics, Vol. 107, No. 3, p. 035306, 2023.
[Matt 92] M. B. Matthews. On the Uniform Approximation of Nonlinear Discrete-Time Fading-Memory Systems Using Neural Network Models. PhD thesis, ETH Zürich, 1992.
[Matt 93] M. B. Matthews. “Approximating nonlinear fading-memory operators using neural network models”. Circuits, Systems, and Signal Processing, Vol. 12, No. 2, pp. 279–307, jun 1993.
[Neuf 22] A. Neufeld and P. Schmocker. “Chaotic hedging with iterated integrals and neural networks”. arXiv preprint arXiv:2209.10166, 2022.
[Path 17] J. Pathak, Z. Lu, B. R. Hunt, M. Girvan, and E. Ott. “Using machine learning to replicate chaotic attractors and calculate Lyapunov exponents from data”. Chaos, Vol. 27, No. 12, 2017.
[Path 18] J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott. “Model-Free Prediction of Large Spatiotemporally Chaotic Systems from Data: A Reservoir Computing Approach”. Physical Review Letters, Vol. 120, No. 2, p. 24102, 2018.
[Rahi 07] A. Rahimi and B. Recht. “Random features for large-scale kernel machines”. Advances in neural information, 2007.
[Salv 22] C. Salvi, M. Lemercier, and A. Gerasimovics. “Neural stochastic PDEs: Resolution-invariant learning of continuous spatiotemporal dynamics”. In: Advances in Neural Information Processing Systems, 2022.
[Stin 99] M. B. Stinchcombe. “Neural network approximation of continuous functionals and continuous functions on compactifications”. Neural Networks, Vol. 12, No. 3, pp. 467–477, 1999.
[Tino 18] P. Tino. “Asymptotic Fisher memory of randomized linear symmetric Echo State Networks”. Neurocomputing, Vol. 298, pp. 4–8, 2018.
[Tino 20] P. Tino. “Dynamical systems as temporal feature spaces”. Journal of Machine Learning Research, Vol. 21, pp. 1–42, 2020.
[Tran 20] Q. H. Tran and K. Nakajima. “Higher-order quantum reservoir computing”. arXiv preprint arXiv:2006.08999, 2020.
[Tran 21] Q. H. Tran and K. Nakajima. “Learning temporal quantum tomography”. Physical Review Letters, Vol. 127, No. 26, p. 260401, 2021.
[Vill 09] C. Villani. Optimal Transport: Old and New. Springer, 2009.
[Wikn 21] A. Wikner, J. Pathak, B. R. Hunt, I. Szunyogh, M. Girvan, and E. Ott. “Using data assimilation to train a hybrid forecast system that combines machine-learning and knowledge-based components”. Chaos: An Interdisciplinary Journal of Nonlinear Science, Vol. 31, No. 5, p. 53114, 2021.
[Yild 12] I. B. Yildiz, H. Jaeger, and S. J. Kiebel. “Re-visiting the echo state property.”. Neural Networks, Vol. 35, pp. 1–9, nov 2012.
[Ziem 22] I. M. Ziemann, H. Sandberg, and N. Matni. “Single Trajectory Nonparametric Learning of Nonlinear Dynamics”. In: P.-L. Loh and M. Raginsky, Eds., Proceedings of Thirty Fifth Conference on Learning Theory, pp. 3333–3364, PMLR, 2022.

	$\displaystyle I_{\mu,p}=\int_{\mathcal{X}}\|\bar{w}\|(\\|{\bf a}\\|_{p}+\\|{\bf c}\\|+\|b\|)\mu(d\bar{w},d{\bf a},d{\bf c},db)$	$\displaystyle=\sum_{i\in\mathbb{Z}_{-}}\|w_{i}\|(\\|\bar{{\bf a}}_{i}\mathbbm{1}_{\{i<0\}}\\|_{p}+\\|{\bf a}_{0}\mathbbm{1}_{\{i=0\}}\\|)$
		$\displaystyle\leq\sqrt{d}\sum_{i\in\mathbb{Z}_{-}}\|w_{i}\|\lambda^{i+1}\\|{\bf a}_{i}\\|$
		$\displaystyle\leq\sqrt{d}\sup_{i\in\mathbb{Z}_{-}}\{\\|{\bf a}_{i}\\|\}\sum_{i\in\mathbb{Z}_{-}}\|w_{i}\|\lambda^{i+1}<\infty,$

	$\displaystyle\|{x}_{t,i}-{x}_{t,i}^{N}\|$	$\displaystyle\leq L_{\sigma_{1}}\|(A\mathbf{x}_{t-1})_{i}+(C{\bf z}_{t})_{i}+B_{i}-(\bar{A}\mathbf{x}_{t-1}^{N})_{i}-(\bar{C}{\bf z}_{t})_{i}-\bar{B}_{i}\|$		(3.4)
		$\displaystyle=L_{\sigma_{1}}\|(A(\mathbf{x}_{t-1}-\iota(\mathbf{x}_{t-1}^{N})))_{i}\|.$		(3.4)

$\displaystyle\\|\mathbf{x}_{t}({\bf z})-\iota(\mathbf{x}_{t}^{N}({\bf z}))\\|_{q}$	$\displaystyle\leq\\|\mathbf{x}_{t}({\bf z})-\iota(\pi(\mathbf{x}_{t}({\bf z})))\\|_{q}+\\|\iota(\pi(\mathbf{x}_{t}({\bf z})))-\iota(\mathbf{x}_{t}^{N}({\bf z}))\\|_{q}$	(3.5)
	$\displaystyle=\left(\sum_{i=N+1}^{\infty}\|{x}_{t,i}\|^{q}\right)^{1/q}+\left(\sum_{i=1}^{N}\|{x}_{t,i}-{x}_{t,i}^{N}\|^{q}\right)^{1/q}$
	$\displaystyle\leq\left(\sum_{i=N+1}^{\infty}\|{x}_{t,i}\|^{q}\right)^{1/q}+L_{\sigma_{1}}\left(\sum_{i=1}^{N}\|(A(\mathbf{x}_{t-1}-\iota(\mathbf{x}_{t-1}^{N})))_{i}\|^{q}\right)^{1/q}$
	$\displaystyle\leq\left(\sum_{i=N+1}^{\infty}\|{x}_{t,i}\|^{q}\right)^{1/q}+L_{\sigma_{1}}\\|A(\mathbf{x}_{t-1}-\iota(\mathbf{x}_{t-1}^{N}))\\|_{q}$
	$\displaystyle\leq\left(\sum_{i=N+1}^{\infty}\|{x}_{t,i}\|^{q}\right)^{1/q}+L_{\sigma_{1}}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|A\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}\\|\mathbf{x}_{t-1}-\iota(\mathbf{x}_{t-1}^{N})\\|_{q}.$

$\displaystyle\|H({\bf z})-H^{N}({\bf z})\|$	$\displaystyle\leq\int_{\mathcal{X}}\|w\|\|\sigma_{2}({\bf a}\cdot\mathbf{x}_{-1}({\bf z})+{\bf c}\cdot{\bf z}_{0}+b)-\sigma_{2}({\bf a}\cdot\iota(\mathbf{x}_{-1}^{N}({\bf z}))+{\bf c}\cdot{\bf z}_{0}+b)\|\mu(dw,d{\bf a},d{\bf c},db)$	(3.7)
	$\displaystyle\leq L_{\sigma_{2}}\int_{\mathcal{X}}\|w\|\|{\bf a}\cdot\mathbf{x}_{-1}({\bf z})-{\bf a}\cdot\iota(\mathbf{x}_{-1}^{N}({\bf z}))\|\mu(dw,d{\bf a},d{\bf c},db)$
	$\displaystyle\leq L_{\sigma_{2}}\int_{\mathcal{X}}\|w\|\\|{\bf a}\\|_{p}\mu(dw,d{\bf a},d{\bf c},db)\\|\mathbf{x}_{-1}({\bf z})-\iota(\mathbf{x}_{-1}^{N}({\bf z}))\\|_{q}$
	$\displaystyle\leq C_{\mathrm{fin}}\sum_{k=0}^{\infty}(L_{\sigma_{1}}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|A\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|})^{k}\left(\sum_{i=N+1}^{\infty}\|{x}_{-1-k,i}\|^{q}\right)^{1/q}$

	$\displaystyle I_{\bar{\mu},p}$	$\displaystyle=\int_{\mathcal{X}}\|w\|(\\|\bar{\mathcal{L}}({\bf a})\\|_{p}+\\|{\bf c}\\|+\|{\bf a}\cdot\bar{\bf B}_{0}+b\|)\mu(dw,d{\bf a},d{\bf c},db)$
		$\displaystyle\leq\int_{\mathcal{X}}\|w\|(c_{0}\\|{\bf a}\\|_{p}+\\|{\bf c}\\|+\\|{\bf a}\\|_{p}\\|\bar{\bf B}_{0}\\|_{q}+\|b\|)\mu(dw,d{\bf a},d{\bf c},db)$
		$\displaystyle\leq\max(c_{0}+\\|\bar{\bf B}_{0}\\|_{q},1)I_{\mu,p}<\infty$