This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\stackMath

Forecasting causal dynamics with universal reservoirs

Lyudmila Grigoryeva1,2, James Louw3, and Juan-Pablo Ortega3
Abstract

An iterated multistep forecasting scheme based on recurrent neural networks (RNN) is proposed for the time series generated by causal chains with infinite memory. This forecasting strategy contains, as a particular case, the iterative prediction strategies for dynamical systems that are customary in reservoir computing. Readily computable error bounds are obtained as a function of the forecasting horizon, functional and dynamical features of the specific RNN used, and the approximation error committed by it. The framework in the paper circumvents difficult-to-verify embedding hypotheses that appear in previous references in the literature and applies to new situations like the finite-dimensional observations of functional differential equations or the deterministic parts of stochastic processes to which standard embedding techniques do not necessarily apply.


Key Words: forecasting, dynamical system, ergodicity, nonuniform hyperbolicity, Oseledets’ theorem, recurrent neural network, reservoir computing, state-space system, echo state network, ESN, generalized synchronization, causal chain with infinite memory, universality, universal approximation.

22footnotetext: Faculty of Mathematics and Statistics. University of St. Gallen. Bodanstrasse 6, St. Gallen CH-9000, Switzerland. [email protected] 33footnotetext: Honorary Associate Professor at the Department of Statistics, University of Warwick, Coventry CV4 7AL, UK. [email protected] 44footnotetext: Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, 21 Nanyang Link, Singapore 637371. [email protected], [email protected]

1 Introduction

Recurrent neural networks (RNN) and in particular reservoir computing (RC) [Luko 09, Tana 19] have gained much visibility in recent years due to their remarkable success in the learning of the chaotic attractors of complex nonlinear infinite dimensional dynamical systems [Jaeg 04, Path 17, Path 18, Lu 18, Peco 20, Hart 20, Grig 21a, Arco 22] and in a great variety of empirical classification and forecasting applications [Wyff 10, Bute 13, Grig 14]). Even though much research has been conducted in recent years that theoretically explains this good performance, most of the work has been concentrated on the estimation [Gono 20a] and approximation [Grig 18b, Grig 18a, Gono 20b, Gono 21, Cuch 22, Gono 23a] properties of input/output systems but not in the way in which reservoir computing has been traditionally used as an RNN in order to forecast the observations of a dynamical system.

This paper aims to quantify how the universal approximation properties of a family of recurrent networks in the sense of input/output systems impact their approximation properties in the realm of the modeling and forecasting of the observations of autonomous dynamical systems and, more generally, of causal chains with infinite memory (CCIM) [Dede 07, Douk 08, Alqu 12]. More specifically, we shall quantify the forecasting performance of an RNN that has been trained to mimic the observations of a dynamical system or a causal chain to a certain accuracy as an input/output system (where the output is a time-shifted version of the system observations in the input). This result will be obtained with minimal requirements on the state-space map. The upshot of our main result is that the short-term forecasting (or path-continuation) error committed by an RNN that has been trained to uniformly approximate the dynamics of the observations up to a certain accuracy ε>0\varepsilon>0 is bounded above by the product of ε\varepsilon with an exponential factor controlled by the top Lyapunov exponent of the autonomous dynamical system associated with the RNN that has been trained using the observations.

The two most popular forecasting schemes for causal dynamics are the so-called iterated multistep and direct forecasting strategies. There is a consensus on the fact that direct forecasting methods are superior in robustness with respect to model misspecification to those based on iterated multistep schemes such as those with RNNs [Marc 06, Ghys 09, McEl 15]. This statement can be rigorously proved in many contexts, especially when estimation errors can be neglected. In our work, we focus on the iterated multistep approach since we have an interest not only in minimizing short-term forecasting errors but ultimately in using the trained system as a proxy for the underlying data-generating process.

The causal chains that we forecast in this paper contain, as particular cases, two important dynamical elements. First, given a dynamical system ϕDiff1(M)\phi\in\operatorname{Diff}^{1}(M) on a qq-dimensional manifold MM and ωC2(M,d)\omega\in C^{2}(M,\mathbb{R}^{d}) a map that provides lower dimensional observations of it, Takens’ Embedding Theorem [Take 81] proves that time series of ω\omega-observations of ϕ\phi form generically (that is, for an open and dense set of observation maps ω\omega in the C2C^{2} topology) a causal chain (actually of finite memory). Second, causal chains also appear as the deterministic parts of many stochastic processes after using the Wold [Broc 06, Theorem 5.7.1] or the Doob [Foll 02] decompositions.

The observations-based modeling and forecasting in the dynamical systems context is a venerable topic that has been traditionally tackled using attractor reconstruction techniques based on the Takens’ Embedding Theorem [Take 81, Star 99, Huke 06, Huke 15] and embedology [Saue 91]. More recently, new links have been established between traditional RNN-based forecasting schemes and dynamical systems learnability with classical notions like attractor embeddings and generalized synchronizations. The reader can check with [Hart 20, Lu 20, Manj 20, Verz 20, Verz 21, Hart 21, Grig 21a, Grig 23, Manj 21, Manj 22] and references therein for recent work in this direction.

A particularly interesting recent work in the context of dynamical systems is [Berr 23], which introduces a very general embedding framework that contains, as a particular case, most of the schemes in the references that we cited above. Moreover, it provides error estimates for the iterative multistep and direct forecasting strategies in that general framework, which are somewhat related to the results that we present in this paper. In that respect, we conclude this introduction by emphasizing some features of our approach that make it useful and, at the same time, differentiate it from the references above:

  • We focus exclusively on the modeling and forecasting of the observations of a given dynamical system and not of the full dynamical system itself. This choice allows us to circumvent the need for embedding and general synchronization properties of the state-space system used for the learning and the dynamical system to be forecasted. This fact is particularly important because it is difficult to establish that a given state-space system driven by prescribed observations of a dynamical system has embedding properties, a hypothesis that appears in most references like [Berr 23]. The embedding property has been proved for certain linear Takens-like systems [Grig 23] but remains to be shown for many standard universal RC and RNN paradigms like echo state networks [Jaeg 04] or state-affine systems [Grig 18b], even though the synchronization maps that they induce have been proved to be typically differentiable [Grig 21a].

  • Notwithstanding what has been said above, when given access to the full states of a dynamical system, our work also provides bounds for forecasting the dynamical system itself via two strategies, namely (i) seeing the identity map on the dynamical system as an observation map, and (ii) the causal embedding learning scheme introduced in [Manj 21]. The latter scheme uses a state map to learn a dynamical system from inputs consisting of the dynamical system’s left-infinite trajectories. When the reservoir map satisfies a condition called state-input invertibility (see [Manj 21] or Section 3.2 later on for details), the scheme is a particular case of our setup.

  • Unlike the bounds for the iterated multistep scheme in [Berr 23], the ones that we propose in this work are explicit (can be derived from constants introduced in the proofs) and are valid within a finite time horizon and not just asymptotically.

  • Our treatment covers not only observations of dynamical systems but also extends to general causal chains with infinite memory. We can hence handle forecasting problems that go beyond the situations usually treated with embedding techniques. These could be the solutions of infinite dimensional or functional differential equations (see example in Section 4) or the deterministic parts of stochastic processes.

  • The short-term forecasting bounds, or at least the rate at which the error grows with the forecasting horizon, are readily computable in terms of the approximation error and analytic and dynamical features of the RNN used for forecasting. The dependence on the approximation error makes it possible to use bounds developed in previous works (see [Gono 23a, Gono 23b]) that estimate it as a function of relevant architecture parameters in many popular reservoir families.

2 Preliminaries

This section describes the dynamic processes that are the object of our research and the forecasting scheme we will use for them. Additionally, we spell out some classical results on ergodic theory that will be central in our developments.

The data generating process (DGP) and its approximants.

Our results apply to the discrete-time evolution of dynamical variables in a dd-dimensional Euclidean space whose behavior is determined by a causal chain with infinite memory (CCIM). More specifically, we are interested in the dynamical properties of infinite sequences 𝐲=(𝐲t)t{\bf y}=\left({\bf y}_{t}\right)_{t\in\mathbb{Z}}, where 𝐲td{\bf y}_{t}\in\mathbb{R}^{d}, for all tt\in\mathbb{Z}, for which there exists a functional H:(d)dH:(\mathbb{R}^{d})^{\mathbb{Z}_{-}}\longrightarrow\mathbb{R}^{d} such that

𝐲t=H(,𝐲t2,𝐲t1)=H(𝐲t1¯),for all t,{\bf y}_{t}=H\left(\ldots,{\bf y}_{t-2},{\bf y}_{t-1}\right)=H\left(\underline{{\bf y}_{t-1}}\right),\quad\mbox{for all $t\in\mathbb{Z}$,}

and where the symbol 𝐲t1¯(d)\underline{{\bf y}_{t-1}}\in({\mathbb{R}}^{d})^{\mathbb{Z}_{-}} denotes the semi-infinite sequence 𝐲t1¯=(,𝐲t2,𝐲t1)\underline{{\bf y}_{t-1}}=\left(\ldots,{\bf y}_{t-2},{\bf y}_{t-1}\right). These dynamics are the deterministic counterparts of various structures with a widespread presence in time series analysis. See for instance [Dede 07, Douk 08, Alqu 12] and references therein.

The CCIM determined by the functional H:(d)dH:({\mathbb{R}}^{d})^{\mathbb{Z}_{-}}\longrightarrow{\mathbb{R}}^{d} has a canonical causal and time-invariant filter UH:(d)(d)U_{H}:({\mathbb{R}}^{d})^{\mathbb{Z}}\longrightarrow({\mathbb{R}}^{d})^{\mathbb{Z}} associated determined by the equality

UH(𝐲)t:=H(𝐲t1¯),for all 𝐲(d) and t.U_{H}({\bf y})_{t}:=H\left(\underline{{\bf y}_{t-1}}\right),\quad\mbox{for all ${\bf y}\in({\mathbb{R}}^{d})^{\mathbb{Z}}$ and $t\in\mathbb{Z}$.} (2.1)

A classical problem in control and systems theory is the realization of filters like (2.1) as the unique solution (when it exists) of a state-space transformation

𝐱t\displaystyle\mathbf{x}_{t} =F(𝐱t1,𝐲t1),\displaystyle=F(\mathbf{x}_{t-1},{\bf y}_{t-1}), (2.2)
𝐲t\displaystyle{\bf y}_{t} =h(𝐱t),\displaystyle=h(\mathbf{x}_{t}), (2.3)

where F:L×dLF:\mathbb{R}^{L}\times\mathbb{R}^{d}\longrightarrow\mathbb{R}^{L} and h:Ldh:\mathbb{R}^{L}\longrightarrow\mathbb{R}^{d} are a state and a readout map, respectively, L\mathbb{R}^{L} is, in this case, the state space and, typically, NN is much bigger than dd. Some classical references in connection to this problem are [Kalm 59b, Kalm 59a, Kalm 60, Kalm 62, Baum 66, Kalm 10] or, more recently, [Sand 78, Sont 79, Dang 84, Nare 90, Matt 94, Lind 15, Grig 21b, Orte 24b]. Even though the realization problem does not always have a solution, it has been shown that if the domain of the functional HH is restricted to a space of uniformly bounded sequences of the type KM:={𝐲(d)𝐲tM,t}K_{M}:=\left\{{\bf y}\in(\mathbb{R}^{d})^{\mathbb{Z}_{-}}\mid\left\|{\bf y}_{t}\right\|\leq M,\,t\in\mathbb{Z}_{-}\right\} for some M>0M>0 and, additionally, it has a continuity condition called the fading memory property [Boyd 85, Grig 18b], then various families of state-space transformations of the type (2.2)-(2.3) can be used to approximate it uniformly. This is the case for linear systems with polynomial or neural network readouts [Boyd 85, Grig 18b, Gono 20b], state-affine systems with linear readouts [Grig 18b], or echo state networks [Grig 18a, Gono 20b, Gono 23a, Gono 21]. In other words, any of these families exhibit universality properties in the fading memory category. This allows us to conclude that given a fading memory functional H:KMdH:K_{M}\longrightarrow\mathbb{R}^{d} and ε>0\varepsilon>0, there exists a state-space system (F,h)(F,h) in that family to which a functional HhF:KMdH^{F}_{h}:K_{M}\longrightarrow\mathbb{R}^{d} can be associated and such that

HHhF=sup𝐲KM{H(𝐲)HhF(𝐲)}<ε.\left\|H-H^{F}_{h}\right\|_{\infty}=\sup_{{\bf y}\in K_{M}}\left\{\left\|H({\bf y})-H^{F}_{h}({\bf y})\right\|\right\}<\varepsilon.

For future reference, a condition that ensures that a state-space system (F,h)(F,h) has a functional HhF:KMdH^{F}_{h}:K_{M}\longrightarrow\mathbb{R}^{d} associated is the so-called echo state property (ESP) [Jaeg 10]: the state equation FF has the ESP with respect to inputs in KMK_{M} when for any 𝐲KM{\bf y}\in K_{M} there exists a unique 𝐱(L)\mathbf{x}\in(\mathbb{R}^{L})^{\mathbb{Z}_{-}} such that (2.2) holds for any tt\in\mathbb{Z}_{-}. In such a situation, FF has a filter UF:KM(L)U^{F}:K_{M}\longrightarrow(\mathbb{R}^{L})^{\mathbb{Z}_{-}} associated that is uniquely determined by the relations UF(𝐲)t=F(UF(𝐲)t1,𝐲t1)U^{F}({\bf y})_{t}=F\left(U^{F}({\bf y})_{t-1},{\bf y}_{t-1}\right). The filter UFU^{F} induces the functional HF:KMNH^{F}\colon K_{M}\longrightarrow{\mathbb{R}}^{N} given by HF(𝐲)=UF(𝐲)0.H^{F}({\bf y})=U^{F}({\bf y})_{0}. The functional HhF:KMdH^{F}_{h}:K_{M}\longrightarrow\mathbb{R}^{d} corresponding to the state-space system is given by HhF(𝐲)=hHF(𝐲)H^{F}_{h}({\bf y})=h\circ H^{F}({\bf y}). The ESP has important dynamical implications [Grig 19, Manj 20, Manj 22]. In practice, the state space is often taken to be compact. In this case, if FF is a continuous map and a contraction on its first entry, the ESP is ensured. Indeed, for such a state map, it can be shown that with inputs from B(𝟎,M)¯d\overline{B({\bf 0},M)}\subset\mathbb{R}^{d} there is a compact subset XLX\subset\mathbb{R}^{L} such that F(X×B(𝟎,M)¯)XF(X\times\overline{B({\bf 0},M)})\subset X. Stochastic generalizations of some of these statements can be found in [Manj 23, Orte 24a].

Causal chains with infinite memory and dynamical systems observations.

A particularly important family of CCIMs can be generated using the observations of some dynamical systems. Indeed, let MM be a manifold and let ϕ:MM\phi:M\longrightarrow M be a continuous map that determines a dynamical system on it via the evolution recursions

mt+1=ϕ(mt),m_{t+1}=\phi(m_{t}),

using a given point m0Mm_{0}\in M as initial condition. The data available for the forecasting of a dynamical system is given in the form of observations that are usually encoded as the values of a continuous map ω:Md\omega:M\longrightarrow{\mathbb{R}}^{d}, that is, they are given by the sequences 𝐲0:=ω(m0),𝐲1:=ω(m1),,𝐲t:=ω(mt){\bf y}_{0}:=\omega(m_{0}),\,{\bf y}_{1}:=\omega(m_{1}),\ldots,{\bf y}_{t}:=\omega(m_{t}). The problem that we tackle in this work is the forecasting or continuation of those sequences using state-space transformations of the type (2.2)-(2.3) that have been learned/trained using exclusively temporal traces of ω\omega-observations.

We shall now show that the observations of a large class of dynamical systems are CCIMs. More specifically, we shall work in the context of invertible dynamical systems for which an invertible generalized synchronization map can be found. We briefly recall the meaning of these terms. First, we say that the dynamical system ϕ\phi on MM is invertible when it belongs either to the set Hom(M){\rm Hom}(M) of homeomorphisms (continuous invertible maps with continuous inverse) of a topological space MM or to the set of diffeomorphisms Diff1(M){\rm Diff}^{1}(M) of a finite-dimensional manifold MM. The invertibility of ϕ\phi allows us, given an observation map ω:Md\omega:M\longrightarrow{\mathbb{R}}^{d}, to define the (ϕ,ω)(\phi,\omega)-delay map S(ϕ,ω):M(d)S_{(\phi,\omega)}:M\longrightarrow(\mathbb{R}^{d})^{\mathbb{Z}} as S(ϕ,ω)(m):=(ω(ϕt(m)))tS_{(\phi,\omega)}(m):=\left(\omega(\phi^{t}(m))\right)_{t\in\mathbb{Z}}. Second, given a state-space map F:L×dLF:\mathbb{R}^{L}\times\mathbb{R}^{d}\longrightarrow\mathbb{R}^{L}, consider the drive-response system associated to the ω\omega-observations of ϕ\phi and determined by the recursions:

𝐱t=F(𝐱t1,S(ϕ,ω)(m)t1),t,mM.\mathbf{x}_{t}=F\left(\mathbf{x}_{t-1},S_{(\phi,\omega)}(m)_{t-1}\right),\quad\mbox{$t\in\mathbb{Z},\,m\in M.$} (2.4)

We say that a generalized synchronization (GS) [Rulk 95, Peco 97, Ott 02, Bocc 02, Erog 17] occurs in this configuration when there exists a map f(ϕ,ω,F):MLf_{(\phi,\omega,F)}:M\longrightarrow\mathbb{R}^{L} such that for any 𝐱t\mathbf{x}_{t}, tt\in\mathbb{Z}, as in (2.4), it holds that

𝐱t=f(ϕ,ω,F)(ϕt(m)),\mathbf{x}_{t}=f_{(\phi,\omega,F)}(\phi^{t}(m)), (2.5)

that is, the time evolution of the dynamical system in phase space (not just its observations) drives the response in (2.4). We emphasize that the definition (2.5) presupposes that the recursions (2.4) have a (unique) solution, that is, that there exists a sequence 𝐱(L)\mathbf{x}\in(\mathbb{R}^{L})^{\mathbb{Z}} such that (2.4) holds true. When that existence property holds and, additionally, the solution sequence 𝐱\mathbf{x} is unique, we say that FF has the (ϕ,ω)(\phi,\omega)-Echo State Property (ESP). The existence, continuity, and differentiability of GSs have been established, for instance, in [Star 99, Grig 21a] for a rich class of systems that exhibit the so-called fading memory property and that are generated by locally state-contracting maps FF. In such a framework, the state equation (2.4) uniquely determines a filter UF:(d)(L)U^{F}:\left({\mathbb{R}}^{d}\right)^{\mathbb{Z}}\longrightarrow\left({\mathbb{R}}^{L}\right)^{\mathbb{Z}} that satisfies the recursion UF(𝐳)t=F(UF(𝐳)t1,𝐳t1)U^{F}({\bf z})_{t}=F\left(U^{F}({\bf z})_{t-1},{\bf z}_{t-1}\right), for all tt\in\mathbb{Z}, in terms of which the GS f(ϕ,ω,F):MLf_{(\phi,\omega,F)}:M\longrightarrow\mathbb{R}^{L} can be written as

f(ϕ,ω,F)(m)=(UF(S(ϕ,ω)(m)))0.f_{(\phi,\omega,F)}(m)=\left(U^{F}\left(S_{(\phi,\omega)}(m)\right)\right)_{0}. (2.6)

The following result, introduced in [Grig 21a, Grig 24], shows that the observations of a dynamical system in the presence of injective GS form a CCIM. More explicitly, we shall be assuming that the GS f(ϕ,ω,F):MLf_{(\phi,\omega,F)}:M\longrightarrow\mathbb{R}^{L} is an embedding which means that f(ϕ,ω,F)f_{(\phi,\omega,F)} is an injective immersion (C1C^{1} map with injective tangent map) and, additionally, the manifold topology in f(ϕ,ω,F)(M)f_{(\phi,\omega,F)}(M) induced by f(ϕ,ω,F)f_{(\phi,\omega,F)} coincides with the relative topology inherited from the standard topology in L\mathbb{R}^{L}.

Proposition 2.1

Let ϕ:MM\phi:M\longrightarrow M be an invertible dynamical system on the manifold MM and let ω:Md\omega:M\longrightarrow\mathbb{R}^{d} be an observation map. Let F:L×dLF:\mathbb{R}^{L}\times\mathbb{R}^{d}\longrightarrow\mathbb{R}^{L} be a state map that determines a generalized synchronization f(ϕ,ω,F):MLf_{(\phi,\omega,F)}:M\longrightarrow\mathbb{R}^{L} when driven by the ω\omega-observations of MM as in (2.4). Suppose that f(ϕ,ω,F):MLf_{(\phi,\omega,F)}:M\longrightarrow\mathbb{R}^{L} is an embedding. Then:

(i)

The set S:=f(ϕ,ω,F)(M)LS:=f_{(\phi,\omega,F)}(M)\subset\mathbb{R}^{L} is an embedded submanifold of the reservoir space L\mathbb{R}^{L}.

(ii)

There exists a differentiable observation map h:Sdh:S\longrightarrow\mathbb{R}^{d} that extracts the corresponding observations of the dynamical system out of the reservoir states. That is, with the notation introduced in (2.4):

h(𝐱t)=ω(ϕt(m)).h(\mathbf{x}_{t})=\omega\left(\phi^{t}(m)\right). (2.7)
(iii)

The maps FF and hh determine a differentiable dynamical system ΦC1(S,S)\Phi\in C^{1}(S,S) given by

Φ(𝐬):=F(𝐬,h(𝐬)),\Phi({\bf s}):=F({\bf s},h({\bf s})), (2.8)

which is C1C^{1}-conjugate to ϕDiff1(M)\phi\in{\rm Diff}^{1}(M) by f(ϕ,ω,F)f_{(\phi,\omega,F)}, that is,

f(ϕ,ω,F)ϕ=Φf(ϕ,ω,F).f_{(\phi,\omega,F)}\circ\phi=\Phi\circ f_{(\phi,\omega,F)}. (2.9)
(iv)

The sequences of ω\omega-observations 𝐲t:=ω(ϕt(m)){\bf y}_{t}:=\omega(\phi^{t}(m)), tt\in\mathbb{Z}, for a fixed mMm\in M, are causal chains with infinite memory.

The proof of this result can be found in [Grig 24] and is an elementary consequence of the fact that f(ϕ,ω,F)f_{(\phi,\omega,F)} is an embedding (see, for instance, [Abra 88] for details). The map in (ii) is h:=ωf(ϕ,ω,F)1:f(ϕ,ω,F)(M)Ldh:=\omega\circ f_{(\phi,\omega,F)}^{-1}:f_{(\phi,\omega,F)}(M)\subset\mathbb{R}^{L}\longrightarrow\mathbb{R}^{d} (notice that it can be constructed since, by hypothesis, f(ϕ,ω,F)f_{(\phi,\omega,F)} is invertible). Regarding (iv), this is a consequence of (2.7), (2.5), and (2.6):

𝐲t=ω(ϕt(m))=h(𝐱t)=h(f(ϕ,ω,F)(ϕt(m)))=h((UF(S(ϕ,ω)(ϕt(m))))0)=:HhF(𝐲t1¯),{\bf y}_{t}=\omega\left(\phi^{t}(m)\right)=h(\mathbf{x}_{t})=h(f_{(\phi,\omega,F)}(\phi^{t}(m)))=h\left(\left(U^{F}\left(S_{(\phi,\omega)}(\phi^{t}(m))\right)\right)_{0}\right)=:H^{F}_{h}(\underline{{\bf y}_{t-1}}),

where the dependence of the newly defined map HhFH^{F}_{h} exclusively on 𝐲t1¯\underline{{\bf y}_{t-1}} is a consequence of the causality of the filter UFU^{F}.

As we show in the next corollary, the last part of this proposition, together with Takens’ Embedding Theorem (or, more generally, the main result in [Grig 23]), implies that generic observations of a large class of dynamical systems form CCIMs.

Corollary 2.2

Let MM be a compact manifold of dimension qq\in\mathbb{N} and let ϕDiff2(M)\phi\in{\rm Diff}^{2}(M) be a twice-differentiable diffeomorphism that satisfies the following two properties:

(i)

ϕ\phi has only finitely many periodic points with periods less than or equal to 2q2q.

(ii)

If mMm\in M is any periodic point of ϕ\phi with period k<2qk<2q, then the eigenvalues of the linear map Tmϕk:TmMTmMT_{m}\phi^{k}:T_{m}M\longrightarrow T_{m}M are distinct.

Then for any generic scalar observation function ωC2(M,)\omega\in C^{2}(M,\mathbb{R}), the sequences of ω\omega-observations are causal chains with infinite memory.

Proof. Recall first that in the hypotheses of the corollary, Takens’ Theorem [Take 81, Huke 06] guarantees that a (2q+1)(2q+1)-truncated version S(ϕ,ω)2q+1S_{(\phi,\omega)}^{2q+1} of the (ϕ,ω)(\phi,\omega)-delay map given by S(ϕ,ω)2q+1(m):=(ω(m),ω(ϕ1(m)),,ω(ϕ2q(m)))S_{(\phi,\omega)}^{2q+1}(m):=\left(\omega(m),\omega(\phi^{-1}(m)),\ldots,\omega(\phi^{-2q}(m))\right) is a continuously differentiable embedding. This map is in turn, the GS corresponding to the linear state map F(𝐱,z):=A𝐱+𝐂zF(\mathbf{x},z):=A\mathbf{x}+\mathbf{C}z, with AA the lower shift matrix in dimension 2q+12q+1 and 𝐂=(1,0,,0)2q+1\mathbf{C}=(1,0,\ldots,0)\in\mathbb{R}^{2q+1}. The claim then follows from part (iv) of Proposition 2.1.  \blacksquare

Ergodic theorems for dynamical systems.

An important tool in the study of dynamical systems is the concept of ergodicity, which allows us to translate statements about space averages to averages along trajectories of the system, thus relating global analytic properties of a map to the dynamics it induces. Of particular relevance to our work is the study of nonuniform hyperbolicity and the multiplicative ergodic theorem of Oseledets [Osel 68, Ruel 79, Barr 07]. We briefly introduce the theory as applied to our context, keeping to the details and results necessary for our work. Our overview is based largely on [Barr 07, Chapter 1 and Chapter 3].

Let MM be an qq-dimensional C1C^{1} submanifold of L\mathbb{R}^{L} and let ϕDiff1(M)\phi\in\operatorname{Diff}^{1}(M) be a dynamical system acting on MM. A concept that allows us to quantify the rate of divergence of two points initially close together is that of the Lyapunov exponent. For a point mMm\in M the forward Lyapunov exponent along a vector vTmMv\in T_{m}M in the tangent space TmMT_{m}M of MM at the point mm is defined as

λ+(m,v)=lim supt1tlogDϕt(m)v,\displaystyle\lambda^{+}(m,v)=\limsup_{t\to\infty}\frac{1}{t}\log\lVert D\phi^{t}(m)v\rVert, (2.10)

where Dϕt(m):TmMTϕt(m)MD\phi^{t}(m):T_{m}M\longrightarrow T_{\phi^{t}(m)}M is the tangent map of ϕt\phi^{t} at the point mMm\in M. Similarly, we define the backward Lyapunov exponent

λ(m,v)=lim supt1|t|logDϕt(m)v.\displaystyle\lambda^{-}(m,v)=\limsup_{t\to-\infty}\frac{1}{|t|}\log\lVert D\phi^{t}(m)v\rVert. (2.11)

It can be shown that, for any given point mMm\in M, the forward Lyapunov exponent λ+(m,)\lambda^{+}(m,\cdot) satisfies the following three properties:

(i)

λ+(m,αv)=λ+(v)\lambda^{+}(m,\alpha v)=\lambda^{+}(v) for all vTmMv\in T_{m}M and all α{0},\alpha\in\mathbb{R}\setminus\{0\},

(ii)

λ+(m,v+w)max{λ+(m,v),λ+(m,w)}\lambda^{+}(m,v+w)\leq\max\{\lambda^{+}(m,v),\lambda^{+}(m,w)\} for all v,wTmM,v,w\in T_{m}M, and

(iii)

λ+(0)=,\lambda^{+}(0)=-\infty, where we make use of the convention log0=.\log 0=-\infty.

As a result, λ+(m,)\lambda^{+}(m,\cdot) takes on finitely many values λ1+(m)>>λl+(m)+(m)\lambda_{1}^{+}(m)>\cdots>\lambda_{l^{+}(m)}^{+}(m) for some l+(m)ql^{+}(m)\leq q. These are related to the differing rates of convergence/divergence along different tangent directions. The properties (i)(iii) also allow us to construct a filtration of vector subspaces in the tangent space using the ordered Lyapunov exponents to single out linearly independent tangent directions up to a given Lyapunov value. The filtration is 𝒱+(m):{0}Vl+(m)+(m)V1+(m)=TmM,\mathcal{V}^{+}(m)\colon\{0\}\subsetneq V_{l^{+}(m)}^{+}(m)\subsetneq\cdots\subsetneq V_{1}^{+}(m)=T_{m}M, where we define

Vi+(m):={vTmM:λ+(m,v)λi+(m)},\displaystyle V_{i}^{+}(m):=\{v\in T_{m}M\colon\lambda^{+}(m,v)\leq\lambda_{i}^{+}(m)\}, (2.12)

for i=1,,l+(m),i=1,\dots,l^{+}(m), that is, Vi+(m)V_{i}^{+}(m) is the subspace of the tangent space consisting of all directions with asymptotic rate of divergence at most λi+(m).\lambda_{i}^{+}(m). The number of tangent directions having a given rate of divergence is ki+(m)=dimVi+(m)dimVi+1+(m)k_{i}^{+}(m)=\dim V_{i}^{+}(m)-\dim V_{i+1}^{+}(m) for i=1,,l+(m).i=1,\dots,l^{+}(m). The value ki+(m)k_{i}^{+}(m) is referred to as the multiplicity of the Lyapunov exponent λi+(m).\lambda_{i}^{+}(m).

In like manner, for mMm\in M the backward Lyapunov exponent λ(m,)\lambda^{-}(m,\cdot) takes on finitely many values λ1(m)<<λl(m)(m)\lambda_{1}^{-}(m)<\cdots<\lambda_{l^{-}(m)}^{-}(m) for some l(m)ql^{-}(m)\leq q and has a filtration 𝒱(m):{0}V1(m)Vl(m)(m)=TmM,\mathcal{V}^{-}(m)\colon\{0\}\subsetneq V_{1}^{-}(m)\subsetneq\cdots\subsetneq V_{l^{-}(m)}^{-}(m)=T_{m}M, defined by

Vi(m):={vTmM:λ(m,v)λi(m)},\displaystyle V_{i}^{-}(m):=\{v\in T_{m}M\colon\lambda^{-}(m,v)\leq\lambda_{i}^{-}(m)\}, (2.13)

for i=1,,l(m).i=1,\dots,l^{-}(m). In a similar fashion to the forward Lyapunov exponent, the multiplicity of the backward Lyapunov exponent λi(m)\lambda_{i}^{-}(m) is defined as ki(m)=dimVi(m)dimVi1(m)k_{i}^{-}(m)=\dim V_{i}^{-}(m)-\dim V_{i-1}^{-}(m) for i=1,,l(m).i=1,\dots,l^{-}(m).

While the forward Lyapunov exponent captures the growth of tangent vectors in forward time, the backward exponent measures this growth in backward time. An obstacle in the definitions (2.10) and (2.11) is the lim sup,\limsup, which makes the growth of individual terms in the sequences intractable. To deal with this, we relate the forward and backward Lyapunov exponents and their corresponding filtrations. This relation is expressed in the notion of coherence. The filtrations 𝒱+(m)\mathcal{V}^{+}(m) and 𝒱(m)\mathcal{V}^{-}(m) are said to be coherent if

(i)

l+(m)=l(m)=:l(m),l^{+}(m)=l^{-}(m)=:l(m),

(ii)

there is a decomposition TmM=i=1l(m)Ei(m)T_{m}M=\bigoplus_{i=1}^{l(m)}E_{i}(m) such that

Vi+(m)=j=il(m)Ej(m)andVi(m)=j=1iEj(m)for i=1,,l(m), and\displaystyle V_{i}^{+}(m)=\bigoplus_{j=i}^{l(m)}E_{j}(m)\quad\text{and}\quad V_{i}^{-}(m)=\bigoplus_{j=1}^{i}E_{j}(m)\quad\text{for }i=1,\dots,l(m),\text{ and} (2.14)
(iii)

limt±1tlogDϕt(m)v=λi+(m)=λi(m)=:λi(m)\lim_{t\to\pm\infty}\frac{1}{t}\log\lVert D\phi^{t}(m)v\rVert=\lambda_{i}^{+}(m)=-\lambda_{i}^{-}(m)=:\lambda_{i}(m) uniformly over {vEi(m):v=1}.\{v\in E_{i}(m)\colon\lVert v\rVert=1\}.

The subspaces Ei(m)E_{i}(m) are called the Oseledets subspaces and the decomposition TmM=i=1l(m)Ei(m)T_{m}M=\bigoplus_{i=1}^{l(m)}E_{i}(m) is called the Oseledets splitting. The number of distinct Lyapunov exponents and the Oseledets subspaces are invariant along a trajectory of the dynamical system in the sense that l(m)=l(ϕ(m))l(m)=l(\phi(m)) and Dϕ(m)Ei(m)=Ei(ϕ(m))D\phi(m)E_{i}(m)=E_{i}(\phi(m)) for i=1,,l(m).i=1,\dots,l(m). The Oseledets subspaces can be constructed by taking Ei(m)=Vi+(m)Vi(m)E_{i}(m)=V_{i}^{+}(m)\cap V_{i}^{-}(m) for i=1,,l(m).i=1,\dots,l(m). Expressing them in this way is instructive: the set Vi+(m)V_{i}^{+}(m) filters out those tangent directions which, in forward time, have greater Lyapunov exponent than λi(m)\lambda_{i}(m), while the set Vi(m)V_{i}^{-}(m) filters out those that, in backward time, have greater Lyapunov exponent than λi(m)-\lambda_{i}(m). Thus the tangent directions left in Ei(m)E_{i}(m) are in some sense pure, in that their Lyapunov exponent in forward and backward time are the same.

We focus on a certain regular class of points whose properties allow us to establish a link between the Lyapunov exponent and the rate of divergence of points in a nearby neighborhood using the theory of nonuniform hyperbolicity. In particular, for these points the lim sup\limsups in (2.10) and (2.11) become limits and they possess a splitting as in (2.14), allowing us to decompose the tangent space into subspaces where the rate of expansion/contraction under the action of the dynamical system is uniform. A point mMm\in M is said to be forward regular if

limt1tlog|detDϕt(m)|=i=1l+(m)ki+(m)λi+(m)<.\lim_{t\to\infty}\frac{1}{t}\log\lvert\det D\phi^{t}(m)\rvert=\sum_{i=1}^{l^{+}(m)}k_{i}^{+}(m)\lambda_{i}^{+}(m)<\infty.

Likewise, mm is said to be backward regular if

limt1|t|log|detDϕt(m)|=i=1l(m)ki(m)λi(m)<.\lim_{t\to-\infty}\frac{1}{|t|}\log\lvert\det D\phi^{t}(m)\rvert=\sum_{i=1}^{l^{-}(m)}k_{i}^{-}(m)\lambda_{i}^{-}(m)<\infty.

In both cases, the expression on the right is the sum of the distinct values of the Lyapunov exponent counted with multiplicity. Intuitively, this says that a small volume of space around the point mm grows at an exponential rate controlled by the sum of the Lyapunov exponents. Finally, we are ready to define the set of regular points: a point is said to be Lyapunov-Perron regular, or simply regular, if it is both forward and backward regular and the filtrations 𝒱+(m)\mathcal{V}^{+}(m) and 𝒱(m)\mathcal{V}^{-}(m) are coherent. This brings us to Oseledets’ Multiplicative Ergodic Theorem [Barr 07, Theorem 3.4.3].

Theorem 2.3 (Oseledets’ Multiplicative Ergodic Theorem)

Let ϕDiff1(M)\phi\in\operatorname{Diff}^{1}(M) be a dynamical system acting on the qq-dimensional C1C^{1} submanifold MM of L\mathbb{R}^{L}. Let μ\mu be a Borel probability measure on MM, invariant with respect to ϕ\phi, and suppose Mlog+Dϕdμ<\int_{M}\operatorname{log}^{+}\|D\phi\|d\mu<\infty and Mlog+Dϕ1dμ<,\int_{M}\operatorname{log}^{+}\|D\phi^{-1}\|d\mu<\infty, where log+a=max{loga,0}.\operatorname{log}^{+}a=\max\{\operatorname{log}a,0\}. Then the set ΛM\Lambda\subset M of regular points is invariant under ϕ\phi and has full μ\mu-measure.

While above we gave the standard definition for a regular point, we require a slightly stronger consequence of regularity found in [Barr 07, Theorem 1.3.11]. The result involves a certain type of basis which is conveniently adapted to the associated filtration at the point.

Definition 2.4

Consider an qq-dimensional vector space VV and a filtration {0}=Vl+1VlV1=V.\{0\}=V_{l+1}\subsetneq V_{l}\subsetneq\cdots\subsetneq V_{1}=V. A basis {v1,,vq}\{v_{1},\dots,v_{q}\} of VV is said to be normal with respect to this filtration if for every i=1,,li=1,\dots,l there is a basis for ViV_{i} in the set {v1,,vq}.\{v_{1},\dots,v_{q}\}.

The result we need is in fact an equivalent condition for forward regularity of a point, quantifying how a small volume at a point expands along trajectories. It allows us to bound how the angles between the Oseledets subspaces in the Oseledets splitting evolve with time. For a set {v1,,vr}\{v_{1},\dots,v_{r}\} of linearly independent vectors, we denote by vol(v1,,vr)\operatorname{vol}(v_{1},\dots,v_{r}) the volume of the rr-parallelepiped formed by these vectors. The following result can be found in [Barr 07, Theorem 1.3.11].

Theorem 2.5

Let MM be an qq-dimensional C1C^{1} submanifold of L\mathbb{R}^{L} and let ϕDiff1(M)\phi\in\operatorname{Diff}^{1}(M) be a dynamical system acting on MM. Consider a point mMm\in M, with associated filtration 𝒱(m):{0}=Vl(m)+1(m)Vl(m)(m)V1(m)=TmM,\mathcal{V}(m)\colon\{0\}=V_{l(m)+1}(m)\subsetneq V_{l(m)}(m)\subsetneq\cdots\subsetneq V_{1}(m)=T_{m}M, as defined above. Then mm is forward regular if and only if for any basis {v1,,vq}\{v_{1},\dots,v_{q}\} for TmMT_{m}M, normal with respect to 𝒱(m),\mathcal{V}(m), and for any K{1,,q},K\subset\{1,\dots,q\},

limt1tlogvol({Dϕt(m)vi}iK)=iKλ(m,vi).\displaystyle\lim_{t\to\infty}\frac{1}{t}\log\operatorname{vol}(\{D\phi^{t}(m)v_{i}\}_{i\in K})=\sum_{i\in K}\lambda(m,v_{i}).

3 Forecasting causal chains with recurrent neural networks

We now study the iterated multistep forecasting scheme for causal chains using recurrent neural networks and prove our main result. The main theorem provides error bounds for the forecasting error as a function of the forecasting horizon and the dynamical properties of the RNN used in the prediction exercise.

3.1 The iterated multistep forecasting scheme

We formulate the forecasting problem in a simplified setup in which we are given a bi-infinite sequence 𝐲=(𝐲t)t{\bf y}=\left({\bf y}_{t}\right)_{t\in\mathbb{Z}} which is a realization of a causal chain with infinite memory determined by a functional H:(d)dH:({\mathbb{R}}^{d})^{\mathbb{Z}_{-}}\longrightarrow{\mathbb{R}}^{d}. This means that the entries of 𝐲{\bf y} satisfy 𝐲t=H(,𝐲t2,𝐲t1)=H(𝐲t1¯){\bf y}_{t}=H\left(\ldots,{\bf y}_{t-2},{\bf y}_{t-1}\right)=H\left(\underline{{\bf y}_{t-1}}\right), for all tt\in\mathbb{Z}. The iterated multistep forecasting of 𝐲{\bf y} consists in producing predictions 𝐲~1,𝐲~2,,𝐲~t\widetilde{{\bf y}}_{1},\widetilde{{\bf y}}_{2},\ldots,\widetilde{{\bf y}}_{t} of the positive entries 𝐲1,𝐲2,𝐲t{\bf y}_{1},{\bf y}_{2}\ldots,{\bf y}_{t}, tt\in\mathbb{N} of 𝐲{\bf y} based on the knowledge of the semi-infinite sequence 𝐲0¯(d)\underline{{\bf y}_{0}}\in({\mathbb{R}}^{d})^{\mathbb{Z}_{-}}. The RNN-based iterated multistep forecasting scheme consists of first finding a state-space system (F,h)(F,h) with associated functional HhF:(d)dH^{F}_{h}:({\mathbb{R}}^{d})^{\mathbb{Z}_{-}}\longrightarrow{\mathbb{R}}^{d} that uniformly ε\varepsilon-approximates HH, that is,

HHhF=sup𝐲(d){H(𝐲)HhF(𝐲)}<ε.\left\|H-H^{F}_{h}\right\|_{\infty}=\sup_{{\bf y}\in({\mathbb{R}}^{d})^{\mathbb{Z}_{-}}}\left\{\left\|H({\bf y})-H^{F}_{h}({\bf y})\right\|\right\}<\varepsilon. (3.1)

Once such (F,h)(F,h) has been found, the predictions 𝐲~1,𝐲~2,,𝐲~t\widetilde{{\bf y}}_{1},\widetilde{{\bf y}}_{2},\ldots,\widetilde{{\bf y}}_{t} are iteratively produced in tandem with a sequence 𝐱~1,𝐱~2,,𝐱~t\widetilde{{\bf x}}_{1},\widetilde{{\bf x}}_{2},\ldots,\widetilde{{\bf x}}_{t} in the state space according to the scheme

𝐱~t\displaystyle\widetilde{\bf x}_{t} =F(𝐱~t1,𝐲~t1),\displaystyle=F(\widetilde{\bf x}_{t-1},\widetilde{\bf y}_{t-1}), (3.2)
𝐲~t\displaystyle\widetilde{\bf y}_{t} =h(𝐱~t),\displaystyle=h(\widetilde{\bf x}_{t}), (3.3)

for t,t\in\mathbb{N}, where we initialize 𝐱~0=HF(𝐲1¯)\widetilde{\bf x}_{0}=H^{F}(\underline{{\bf y}_{-1}}) and 𝐲~0=𝐲0.\widetilde{\bf y}_{0}={\bf y}_{0}. If we denote by 𝐲𝐯=(,𝐲1,𝐲0,𝐯1,,𝐯n)(d){\bf y}\mathbf{v}=\left(\ldots,{\bf y}_{-1},{\bf y}_{0},\mathbf{v}_{1},\ldots,\mathbf{v}_{n}\right)\in({\mathbb{R}}^{d})^{\mathbb{Z}_{-}} the concatenation of 𝐲(d){\bf y}\in({\mathbb{R}}^{d})^{\mathbb{Z}_{-}} and 𝐯(d)n\mathbf{v}\in({\mathbb{R}}^{d})^{n}, we can write our two sequences as

𝐲~t=HhF(𝐲0¯𝐲~t1¯), and 𝐱~t=HF(𝐲0¯𝐲~t1¯).\widetilde{{\bf y}}_{t}=H^{F}_{h}(\underline{{\bf y}_{0}}\,\underline{\widetilde{{\bf y}}_{t-1}}),\mbox{ and }\widetilde{{\bf x}}_{t}=H^{F}(\underline{{\bf y}_{0}}\,\underline{\widetilde{{\bf y}}_{t-1}}). (3.4)

We also let 𝐱t=HF(𝐲t1¯){\bf x}_{t}=H^{F}(\underline{{\bf y}_{t-1}}). While our aim is to estimate the forecasting error Δ𝐲t:=𝐲t𝐲~t,\Delta{\bf y}_{t}:={\bf y}_{t}-\widetilde{\bf y}_{t}, we do this by mapping the sequences 𝐲t¯\underline{{\bf y}_{t}} and 𝐲0¯𝐲~t¯\underline{{\bf y}_{0}}\,\underline{\widetilde{\bf y}_{t}} into the state-space variables 𝐱t,𝐱~tL{\bf x}_{t},\widetilde{\bf x}_{t}\in\mathbb{R}^{L} via HFH^{F} and studying how the error Δ𝐱t:=𝐱t𝐱~t\Delta{\bf x}_{t}:={\bf x}_{t}-\widetilde{\bf x}_{t} accumulates there.

The reason why we called this setup simplified at the beginning of this paragraph is that in practice, one does not have at one’s disposal a full semi-infinite history of the chain that one wants to forecast but only a finite sample, say {𝐲T,𝐲T+1,,𝐲0}\left\{{\bf y}_{-T},{\bf y}_{-T+1},\ldots,{\bf y}_{0}\right\}. This fact carries in its wake two complementary sources of error that we briefly describe, together with references where they are specifically addressed:

(i)

The estimation error: the universality results that we cited in the introduction guarantee the existence of uniform approximants HhFH_{h}^{F} that satisfy (3.1) in a variety of situations and for various families of state-space systems. Given a finite sample {𝐲T,𝐲T+1,,𝐲0}\left\{{\bf y}_{-T},{\bf y}_{-T+1},\ldots,{\bf y}_{0}\right\}, the approximant HhFH_{h}^{F} in the family is in general estimated using the empirical risk minimization with a prescribed loss. All estimation methods have an error associated that decreases with the sample size. Explicit bounds for this error can be found in [Gono 20a].

(ii)

The state starting error: as it can be seen in (3.4), the iterations that produce the forecasts are initialized using the state given by HF(𝐲1¯)LH^{F}(\underline{{\bf y}_{-1}})\in{\mathbb{R}}^{L} which is uniquely determined if FF has the echo state property and the entire semi-infinite sequence 𝐲1¯\underline{{\bf y}_{-1}} is known. When only a finite history {𝐲T,𝐲T+1,,𝐲0}\left\{{\bf y}_{-T},{\bf y}_{-T+1},\ldots,{\bf y}_{0}\right\} is available, the initial state HF(𝐲1¯)H^{F}(\underline{{\bf y}_{-1}}) can be determined only approximately, which leads to what we denominate the state starting error. If the state map FF is contracting in the state variable, then the corresponding filter has what is called the state forgetting property [Grig 19, Theorem 27] elsewhere known as the uniform attraction property (see [Manj 22]). This property implies that if we initialize the state equation at an arbitrary state sufficiently far in the past, then the resulting state at t=0t=0 converges to HF(𝐲1¯)H^{F}(\underline{{\bf y}_{-1}}) as TT\rightarrow\infty regardless of the initialization chosen. This means that in the presence of this property, the state HF(𝐲1¯)H^{F}(\underline{{\bf y}_{-1}}) can be approximated by using a sufficiently long washout. In the dynamical systems context for which an embedding generalized synchronization exists, one can learn a map that allows what is called the cold starting of the RNN without necessarily using a washout (see [Grig 24] for a presentation of this methodology).

Bounds for the forecasting error.

We proceed to develop the necessary theory to estimate the error associated with the iterated multistep forecasting scheme as a function of the forecasting horizon. We do this by mapping the true and predicted sequences into the state space using the state-space system. Predictions in the state space come from the trajectory of the associated dynamical system. Using ergodic theory, we can track the accumulation of error.

Let LhL_{h} be a Lipschitz constant for hh, that is, h(𝐱1)h(𝐱2)Lh𝐱1𝐱2\|h({\bf x}^{1})-h({\bf x}^{2})\|\leq L_{h}\|{\bf x}^{1}-{\bf x}^{2}\| for any 𝐱1,𝐱2L.{\bf x}^{1},{\bf x}^{2}\in\mathbb{R}^{L}. According to the identities (3.4) we have

Δ𝐲t=H(𝐲t1¯)HhF(𝐲0¯𝐲~t1¯)=H(𝐲t1¯)HhF(𝐲t1¯)+HhF(𝐲t1¯)HhF(𝐲0¯𝐲~t1¯)ε+HhF(𝐲t1¯)HhF(𝐲0¯𝐲~t1¯)=ε+h(HF(𝐲t1¯))h(HF(𝐲0¯𝐲~t1¯))ε+h(𝐱t1)h(𝐱~t1)ε+LhΔ𝐱t.\left\|\Delta{\bf y}_{t}\right\|=\left\|H(\underline{{\bf y}_{t-1}})-H^{F}_{h}(\underline{{\bf y}_{0}}\,\underline{\widetilde{{\bf y}}_{t-1}})\right\|\\ =\left\|H(\underline{{\bf y}_{t-1}})-H^{F}_{h}(\underline{{\bf y}_{t-1}})+H^{F}_{h}(\underline{{\bf y}_{t-1}})-H^{F}_{h}(\underline{{\bf y}_{0}}\,\underline{\widetilde{{\bf y}}_{t-1}})\right\|\leq\varepsilon+\left\|H^{F}_{h}(\underline{{\bf y}_{t-1}})-H^{F}_{h}(\underline{{\bf y}_{0}}\,\underline{\widetilde{{\bf y}}_{t-1}})\right\|\\ =\varepsilon+\left\|h\left(H^{F}(\underline{{\bf y}_{t-1}})\right)-h\left(H^{F}(\underline{{\bf y}_{0}}\,\underline{\widetilde{{\bf y}}_{t-1}})\right)\right\|\leq\varepsilon+\left\|h({\bf x}_{t-1})-h(\widetilde{\bf x}_{t-1})\right\|\leq\varepsilon+L_{h}\|\Delta{\bf x}_{t}\|. (3.5)

Thus we can track the prediction error by bounding the error in the state space. Associated to the state-space system (F,h)(F,h) is the dynamical system Φ:LL\Phi\colon\mathbb{R}^{L}\longrightarrow\mathbb{R}^{L} determined by Φ(𝐱)=F(𝐱,h(𝐱)).\Phi({\bf x})=F({\bf x},h({\bf x})). According to the iterations (3.2) and (3.3), the sequence (𝐱~t)(\widetilde{\bf x}_{t}) is a trajectory of the dynamical system Φ\Phi, that is, 𝐱~t=Φ(𝐱~t1)\widetilde{\bf x}_{t}=\Phi(\widetilde{\bf x}_{t-1}) for tt\in\mathbb{N} with initial condition 𝐱~0=𝐱0.\widetilde{\bf x}_{0}={\bf x}_{0}. Defining 𝐂t:=F(𝐱t1,𝐲t1)F(𝐱t1,h(𝐱t1)){\bf C}_{t}:=F({\bf x}_{t-1},{\bf y}_{t-1})-F({\bf x}_{t-1},h({\bf x}_{t-1})) for t,t\in\mathbb{N}, we have

Δ𝐱t=𝐂t+Φ(𝐱t1)Φ(𝐱~t1)=𝐂t+DΦ(𝐱~t1)Δ𝐱t1+𝐎t,\displaystyle\Delta{\bf x}_{t}={\bf C}_{t}+\Phi({\bf x}_{t-1})-\Phi(\widetilde{\bf x}_{t-1})={\bf C}_{t}+D\Phi(\widetilde{\bf x}_{t-1})\Delta{\bf x}_{t-1}+{\bf O}_{t}, (3.6)

where 𝐎t{\bf O}_{t} is the error resulting from the Taylor approximation of Φ(𝐱t1)Φ(𝐱~t1)\Phi({\bf x}_{t-1})-\Phi(\widetilde{\bf x}_{t-1}). Let LzL_{z} be a Lipschitz for the second argument of FF, that is, F(𝐱,𝐳1)F(𝐱,𝐳2)Lz𝐳1𝐳2\|F({\bf x},{\bf z}_{1})-F({\bf x},{\bf z}_{2})\|\leq L_{z}\|{\bf z}_{1}-{\bf z}_{2}\| for any 𝐱L{\bf x}\in\mathbb{R}^{L} and any 𝐳1,𝐳2d.{\bf z}_{1},{\bf z}_{2}\in\mathbb{R}^{d}. Since 𝐲t1h(𝐱t1)=H(𝐲t2¯)HhF(𝐲t2¯)ε\lVert{\bf y}_{t-1}-h({\bf x}_{t-1})\rVert=\lVert H(\underline{{\bf y}_{t-2}})-H^{F}_{h}(\underline{{\bf y}_{t-2}})\rVert\leq\varepsilon, we can bound 𝐂t=F(𝐱t1,𝐲t1)F(𝐱t1,h(𝐱t1))Lzε\lVert{\bf C}_{t}\rVert=\|F({\bf x}_{t-1},{\bf y}_{t-1})-F({\bf x}_{t-1},h({\bf x}_{t-1}))\|\leq L_{z}\varepsilon.

Bounds for the linearized error.

We begin our analysis by ignoring the Taylor approximation error 𝐎t{\bf O}_{t} and considering the linearized sequence

𝐚t=𝐂t+DΦ(𝐱~t1)𝐚t1,t,𝐚0=0.\displaystyle{\bf a}_{t}={\bf C}_{t}+D\Phi(\widetilde{\bf x}_{t-1}){\bf a}_{t-1},\quad t\in{\mathbb{N}},\quad{\bf a}_{0}=0.

Using the ergodic theorems Theorem 2.3 and 2.5 introduced in Section 2, we bound the growth of such a sequence in Theorem 3.1. The ergodic theorems tell us that, asymptotically, the convergence/divergence of trajectories is dominated by an exponential factor. While this growth estimate may deteriorate over time, it will only do so at a small exponential rate, called the leakage rate in [Berr 23]. Including the Taylor error leads to a quadratic term in the sequence of errors. Dealing with the error accumulation in this case is harder, but we bound it up to a finite time horizon later on in Theorem 3.11.

Theorem 3.1

Let ΦDiff1(X)\Phi\in\operatorname{Diff}^{1}(X) be a dynamical system acting on the qq-dimensional C1C^{1} submanifold XX of L\mathbb{R}^{L} and let XX^{\prime} be a compact invariant subset of XX. Consider a point 𝐱X{\bf x}\in X^{\prime} and its trajectory in forward time (𝐱~t)t0(\widetilde{\bf x}_{t})_{t\in\mathbb{N}_{0}} under the dynamical system Φ,\Phi, that is, 𝐱~t=Φ(𝐱~t1)\widetilde{\bf x}_{t}=\Phi(\widetilde{\bf x}_{t-1}) for tt\in\mathbb{N}, with the initial condition 𝐱~0=𝐱.\widetilde{\bf x}_{0}={\bf x}. Along with this, consider a sequence (𝐂t)t({\bf C}_{t})_{t\in\mathbb{N}} such that 𝐂tT𝐱~tX{\bf C}_{t}\in T_{\widetilde{\bf x}_{t}}X for all tt\in\mathbb{N} and moreover the sequence is bounded by a constant K>0.K>0. Construct the sequence 𝐚=(𝐚t)(L){\bf a}=({\bf a}_{t})\in\left(\mathbb{R}^{L}\right)^{\mathbb{N}} according to the linear recursion

𝐚t=𝐂t+DΦ(𝐱~t1)𝐚t1,t,𝐚0=0.\displaystyle{\bf a}_{t}={\bf C}_{t}+D\Phi(\widetilde{\bf x}_{t-1}){\bf a}_{t-1},\quad t\in{\mathbb{N}},\quad{\bf a}_{0}=0.

Then for any Φ\Phi-invariant Borel probability measure on XX^{\prime}, there exists a Φ\Phi-invariant set ΛX\Lambda\subset X^{\prime} of full measure such that for every point 𝐱Λ{\bf x}\in\Lambda, for any sufficiently small leakage constant δ>0\delta>0, the corresponding sequence 𝐚{\bf a} is bounded by

𝐚tRKet(λ1pos+δ),\|{\bf a}_{t}\|\leq RKe^{t(\lambda_{1}^{\text{pos}}+\delta)},

where λ1=λ1(𝐱)\lambda_{1}=\lambda_{1}({\bf x}) is the largest Lyapunov exponent of 𝐱{\bf x}, we write λ1pos=max{λ1,0},\lambda_{1}^{\text{pos}}=\max\{\lambda_{1},0\}, and R=R(δ,𝐱)R=R(\delta,{\bf x}) is a constant depending on the initial point 𝐱{\bf x} and the leakage constant.

Remark 3.2

We prove in fact that

𝐚t{RR′′Ket(λ1+δ),if λ13δ>0,RKte4tδ,if λ13δ=0,RR′′Ke4tδ,if λ13δ<0,\|{\bf a}_{t}\|\leq\begin{cases}R^{\prime}R^{\prime\prime}Ke^{t(\lambda_{1}+\delta)},&\mbox{if }\lambda_{1}-3\delta>0,\\ R^{\prime}Kte^{4t\delta},&\mbox{if }\lambda_{1}-3\delta=0,\\ R^{\prime}R^{\prime\prime}Ke^{4t\delta},&\mbox{if }\lambda_{1}-3\delta<0,\end{cases} (3.7)

for constants R=R(δ,𝐱)R^{\prime}=R^{\prime}(\delta,{\bf x}) and R′′=R′′(δ,𝐱)R^{\prime\prime}=R^{\prime\prime}(\delta,{\bf x}) Choosing the constant RR appropriately then gives us the bound in the theorem. In particular, for the first and third cases we may take R(δ,𝐱):=R(δ,𝐱)R′′(δ,𝐱)R(\delta,{\bf x}):=R^{\prime}(\delta,{\bf x})R^{\prime\prime}(\delta,{\bf x}) and R(δ,𝐱):=R(δ/4,𝐱)R′′(δ/4,𝐱),R(\delta,{\bf x}):=R^{\prime}(\delta/4,{\bf x})R^{\prime\prime}(\delta/4,{\bf x}), respectively. In the second case, we note that te4tδte^{4t\delta} is eventually overtaken by e5tδ,e^{5t\delta}, so there is a constant R′′′(δ)R^{\prime\prime\prime}(\delta) such that R′′′(δ)e5tδte4tδR^{\prime\prime\prime}(\delta)e^{5t\delta}\geq te^{4t\delta} for all t.t\in{\mathbb{N}}. Taking R(δ,𝐱):=R′′′(δ/5)R(δ/5,𝐱)R(\delta,{\bf x}):=R^{\prime\prime\prime}(\delta/5)R^{\prime}(\delta/5,{\bf x}) yields the desired result.

Remark 3.3

In practice, we are concerned with the case when the largest Lyapunov exponent is positive, which is a necessary condition for chaos. In this case, trajectories of the dynamical system that start arbitrarily near to each other may diverge (sensitivity to initial conditions), and we wish to bound the rate at which this occurs. When the leading Lyapunov exponent is negative, trajectories that start near to one another converge, making this scenario trivial. While we deal with all possible values of the leading Lyapunov exponent in our theorem, for nonuniformly hyperbolic behavior to hold, we require that all the Lyapunov exponents be nonzero.

Remark 3.4

The leakage constant δ\delta quantifies the rate at which the exponential estimate deteriorates. Typically we take δmini=1,,l(𝐱)|λi|.\delta\ll\min_{i=1,\dots,l({\bf x})}|\lambda_{i}|. As noted in the theorem, the constant R=R(δ,𝐱)R=R(\delta,{\bf x}) has a dependence on δ.\delta. Taking a smaller leakage constant will necessarily entail a larger constant RR.

Remark 3.5

We comment on the assumption of the existence of a compact invariant set XX^{\prime}. First of all, since the Borel probability measure is defined on X,X^{\prime}, compactness guarantees that Φ\Phi satisfies the integrability condition in Oseledets’ Theorem, the cornerstone of our proof. Secondly, by the Krylov-Bogolyubov theorem, the compactness of XX^{\prime} ensures that a Φ\Phi-invariant Borel probability measure for XX^{\prime} exists. Indeed, we can go even further and guarantee the existence of an ergodic Borel probability measure for Φ\Phi, since the set of invariant probability measures is convex in the space of measures on XX^{\prime}, and the ergodic measures are precisely the extreme points of this set. For an ergodic invariant probability measure, almost all points will have the same Lyapunov exponent. The question of the uniqueness of such measures is well-studied. Given an ergodic measure μ\mu on XX^{\prime}, there is at most one ergodic invariant measure on XX^{\prime}, absolutely continuous with respect to μ\mu. For reference texts on the topic of unique ergodicity see [Walt 00, Chapter 6] and [Laso 13, Chapter 4].

Remark 3.6

The theorem that we just formulated and its proof below corrects a misstep in [Berr 23]. Indeed, in the proof, we decompose the error accumulation sequence 𝐚{\bf a} into the Oseledets subspaces and track its growth separately in each space. A problem arises when we need to recombine the errors across the Oseledets subspaces since the angles between Oseledets subspaces may change along a trajectory of the dynamical system so that errors in individual Oseledets subspaces committed at different points along the trajectory do not add together in the same way. We deal with this by employing another property of (Lyapunov) regular trajectories, which bounds the rate at which angles between Oseledets subspaces change.

Proof of the theorem. Let Λ\Lambda be the set of regular points of the dynamical system Φ\Phi in XX^{\prime}. As noted in Theorem 2.3, Λ\Lambda is Φ\Phi-invariant and has full measure, for any invariant Borel probability measure. Take 𝐱Λ.{\bf x}\in\Lambda. Since (𝐱~t)(\widetilde{\bf x}_{t}) is a trajectory of Φ\Phi, we have the identity DΦ(𝐱~t1)DΦ(𝐱~t2)DΦ(𝐱~s)𝐂s=DΦts(𝐱~s)D\Phi(\widetilde{\bf x}_{t-1})D\Phi(\widetilde{\bf x}_{t-2})\cdots D\Phi(\widetilde{\bf x}_{s}){\bf C}_{s}=D\Phi^{t-s}(\widetilde{\bf x}_{s}). Using the initial condition 𝐚0=0{\bf a}_{0}=0 we can write

𝐚t\displaystyle{\bf a}_{t} =𝐂t+s=1t1DΦ(𝐱~t1)DΦ(𝐱~t2)DΦ(𝐱~s)\displaystyle={\bf C}_{t}+\sum_{s=1}^{t-1}D\Phi(\widetilde{\bf x}_{t-1})D\Phi(\widetilde{\bf x}_{t-2})\cdots D\Phi(\widetilde{\bf x}_{s})
=𝐂t+s=1t1DΦts(𝐱~s)𝐂s=𝐂t+s=1t1DΦt(𝐱~0)(DΦs(𝐱~0))1𝐂s\displaystyle={\bf C}_{t}+\sum_{s=1}^{t-1}D\Phi^{t-s}(\widetilde{\bf x}_{s}){\bf C}_{s}={\bf C}_{t}+\sum_{s=1}^{t-1}D\Phi^{t}(\widetilde{\bf x}_{0})\left(D\Phi^{s}(\widetilde{\bf x}_{0})\right)^{-1}{\bf C}_{s} (3.8)

Since 𝐱{\bf x} is regular, it has Lyapunov exponents λ1(𝐱)>>λl(𝐱)(𝐱)\lambda_{1}({\bf x})>\cdots>\lambda_{l({\bf x})}({\bf x}) and there exists a splitting T𝐱X=i=1l(𝐱)Ei(𝐱)T_{\bf x}X=\bigoplus_{i=1}^{l({\bf x})}E_{i}({\bf x}) such that

(i)

l(𝐱)=l(Φt(𝐱))l({\bf x})=l(\Phi^{t}({\bf x})) for tt\in\mathbb{N},

(ii)

DΦt(𝐱)Ei(𝐱)=Ei(Φt(𝐱))D\Phi^{t}({\bf x})E_{i}({\bf x})=E_{i}(\Phi^{t}({\bf x})) for i=1,,l(𝐱),i=1,\dots,l({\bf x}), tt\in\mathbb{N}, and

(iii)

limt1tlogDΦt(𝐱)𝐯=λi(𝐱)\lim_{t\to\infty}\frac{1}{t}\log\lVert D\Phi^{t}({\bf x}){\bf v}\rVert=\lambda_{i}({\bf x}) uniformly over {𝐯Ei(𝐱):𝐯=𝟏}.\{{\bf v}\in E_{i}({\bf x})\colon\lVert\bf v\rVert=1\}.

Thus by (ii) above, for tt\in\mathbb{N} we have the splitting T𝐱~tX=i=1l(𝐱)DΦtEi(𝐱)=i=1l(𝐱)Ei(𝐱~t)T_{\widetilde{\bf x}_{t}}X=\bigoplus_{i=1}^{l({\bf x})}D\Phi^{t}E_{i}({\bf x})=\bigoplus_{i=1}^{l({\bf x})}E_{i}(\widetilde{\bf x}_{t}) and we can write 𝐂t=i=0l(𝐱)𝐂t(i),{\bf C}_{t}=\sum_{i=0}^{l({\bf x})}{\bf C}_{t}^{(i)}, where 𝐂t(i)Ei(𝐱~s){\bf C}_{t}^{(i)}\in E_{i}(\widetilde{\bf x}_{s}) for i=1,,l(𝐱).i=1,\dots,l({\bf x}). This allows us to define sequences 𝐚(i)=(𝐚t(i)){\bf a}^{(i)}=({\bf a}_{t}^{(i)}) in the iith Oseledets subspace by

𝐚t(i)=𝐂t(i)+DΦ(𝐱~t1)𝐚t1(i),t,𝐚0(i)=0,\displaystyle{\bf a}_{t}^{(i)}={\bf C}_{t}^{(i)}+D\Phi(\widetilde{\bf x}_{t-1}){\bf a}_{t-1}^{(i)},\quad t\in{\mathbb{N}},\quad{\bf a}_{0}^{(i)}=0,

for i=1,,l(𝐱)i=1,\ldots,l({\bf x}) so that 𝐚t=i=1l(𝐱)𝐚t(i){\bf a}_{t}=\sum_{i=1}^{l({\bf x})}{\bf a}_{t}^{(i)}, where 𝐚t(i)Ei(𝐱~t){\bf a}_{t}^{(i)}\in E_{i}(\widetilde{\bf x}_{t}) for i=1,,l(𝐱)i=1,\dots,l({\bf x}), t0t\in\mathbb{N}_{0}. Now fix i=1,,l(𝐱)i=1,\ldots,l({\bf x}). By (iii) above we have that for any δ>0\delta>0 there is a τ\tau\in\mathbb{N} such that for all tτt\geq\tau and for all 𝐯Ei(𝐱){\bf v}\in E_{i}({\bf x}) satisfying 𝐯=1,\lVert{\bf v}\rVert=1,

λi(𝐱)δ\displaystyle\lambda_{i}({\bf x})-\delta <1tlogDΦt(𝐱)𝐯<λi(𝐱)+δ,\displaystyle<\frac{1}{t}\log\lVert D\Phi^{t}({\bf x}){\bf v}\rVert<\lambda_{i}({\bf x})+\delta,

and so

et(λi(𝐱)δ)\displaystyle e^{t(\lambda_{i}({\bf x})-\delta)} <DΦt(𝐱)𝐯<et(λi(𝐱)+δ).\displaystyle<\lVert D\Phi^{t}({\bf x}){\bf v}\rVert<e^{t(\lambda_{i}({\bf x})+\delta)}.

Taking a constant C(δ,𝐱,i)1C(\delta,{\bf x},i)\geq 1 large enough, we have for all t0,t\geq 0, and for any 𝐯Ei(𝐱){𝟎},{\bf v}\in E_{i}({\bf x})\setminus\{{\bf 0}\},

1C(δ,𝐱,i)et(λi(𝐱)δ)𝐯<DΦt(𝐱)𝐯<C(δ,𝐱,i)et(λi(𝐱)+δ)𝐯.\frac{1}{C(\delta,{\bf x},i)}e^{t(\lambda_{i}({\bf x})-\delta)}\lVert{\bf v}\rVert<\lVert D\Phi^{t}({\bf x}){\bf v}\rVert<C(\delta,{\bf x},i)e^{t(\lambda_{i}({\bf x})+\delta)}\lVert{\bf v}\rVert. (3.9)

Thus, for t0t\geq 0,

DΦt(𝐱)|Ei(𝐱)\displaystyle\lVert\left.D\Phi^{t}({\bf x})\right|_{E_{i}({\bf x})}\rVert C(δ,𝐱,i)et(λi(𝐱)+δ),\displaystyle\leq C(\delta,{\bf x},i)e^{t(\lambda_{i}({\bf x})+\delta)}, (3.10)

and

(DΦt(𝐱))1|Ei(𝐱~t)\displaystyle\lVert\left.\left(D\Phi^{t}({\bf x})\right)^{-1}\right|_{E_{i}(\widetilde{\bf x}_{t})}\rVert C(δ,𝐱,i)et(λi(𝐱)δ).\displaystyle\leq C(\delta,{\bf x},i)e^{-t(\lambda_{i}({\bf x})-\delta)}. (3.11)

Using the same expansion as in (3.1) for each of the sequences 𝐚(i){\bf a}^{(i)}, we have

𝐚t(i)𝐂t(i)+s=1t1DΦt(𝐱)|Ei(𝐱)(DΦs(𝐱))1|Ei(𝐱~s)𝐂s(i)𝐂t(i)+s=1t1C(δ,𝐱,i)2et(λi(𝐱)+δ)es(λi(𝐱)δ)𝐂s(i)s=1tC(δ,𝐱,i)2et(λi(𝐱)+δ)es(λi(𝐱)δ)𝐂s(i).\lVert{\bf a}_{t}^{(i)}\rVert\leq\lVert{\bf C}_{t}^{(i)}\rVert+\sum_{s=1}^{t-1}\lVert\left.D\Phi^{t}({\bf x})\right|_{E_{i}({\bf x})}\rVert\lVert\left.\left(D\Phi^{s}({\bf x})\right)^{-1}\right|_{E_{i}(\widetilde{\bf x}_{s})}\rVert\lVert{\bf C}_{s}^{(i)}\rVert\\ \leq\lVert{\bf C}_{t}^{(i)}\rVert+\sum_{s=1}^{t-1}C(\delta,{\bf x},i)^{2}e^{t(\lambda_{i}({\bf x})+\delta)}e^{-s(\lambda_{i}({\bf x})-\delta)}\lVert{\bf C}_{s}^{(i)}\rVert\leq\sum_{s=1}^{t}C(\delta,{\bf x},i)^{2}e^{t(\lambda_{i}({\bf x})+\delta)}e^{-s(\lambda_{i}({\bf x})-\delta)}\lVert{\bf C}_{s}^{(i)}\rVert.

Let C(δ,𝐱)=maxi=1,,l(𝐱){C(δ,𝐱,i)}.C(\delta,{\bf x})=\max_{i=1,\dots,l({\bf x})}\{C(\delta,{\bf x},i)\}. Then we have

𝐚ti=1l(𝐱)𝐚t(i)\displaystyle\|{\bf a}_{t}\|\leq\sum_{i=1}^{l({\bf x})}\|{\bf a}_{t}^{(i)}\| s=1ti=1l(𝐱)C(δ,𝐱,i)2et(λi(𝐱)+δ)es(λi(𝐱)δ)𝐂s(i)\displaystyle\leq\sum_{s=1}^{t}\sum_{i=1}^{l({\bf x})}C(\delta,{\bf x},i)^{2}e^{t(\lambda_{i}({\bf x})+\delta)}e^{-s(\lambda_{i}({\bf x})-\delta)}\lVert{\bf C}_{s}^{(i)}\rVert
C(δ,𝐱)2et(λ1(𝐱)+δ)s=1tes(λ1(𝐱)δ)(i=1l(𝐱)𝐂s(i)).\displaystyle\leq C(\delta,{\bf x})^{2}e^{t(\lambda_{1}({\bf x})+\delta)}\sum_{s=1}^{t}e^{-s(\lambda_{1}({\bf x})-\delta)}\left(\sum_{i=1}^{l({\bf x})}\lVert{\bf C}_{s}^{(i)}\rVert\right). (3.12)

It remains therefore to bound the terms i=1l(𝐱)𝐂s(i)\sum_{i=1}^{l({\bf x})}\lVert{\bf C}_{s}^{(i)}\rVert by the terms 𝐂s.\|{\bf C}_{s}\|. This is an important step which was overlooked in [Berr 23]. At this point we pause our proof in order to develop the necessary theory to deal with this factor. While in any finite-dimensional space, for linearly independent directions 𝐯1,,𝐯l{\bf v}_{1},\ldots,{\bf v}_{l}, we can find a constant C>0C>0 such that i=1l|αi|𝐯iCi=1lαi𝐯i\sum_{i=1}^{l}|\alpha_{i}|\|{\bf v}_{i}\|\leq C\|\sum_{i=1}^{l}\alpha_{i}{\bf v}_{i}\|, see for instance [Krey 91, Lemma 2.4-1], in our case we have to deal with the fact that the angles between the Oseledets subspaces change along the trajectory, so we cannot hope to find a single CC which fulfills this identity at every point in time. In order to get around this we need Theorem 2.5, which tells us how volumes change along trajectories. We begin with a few preliminary observations that allow us to relate volumes and angles between spaces.

Definition 3.7

Let V,WV,W be two subspaces of an inner product space (H,,).(H,\langle\cdot,\cdot\rangle). We define the angle between VV and WW as

(V,W)=inf{arccos(|𝐯,𝐰|𝐯𝐰):𝐯V,𝐰W,𝐯,𝐰0}.\angle(V,W)=\inf\left\{\arccos\left(\frac{\lvert\langle{\bf v},{\bf w}\rangle\rvert}{\lVert{\bf v}\rVert\lVert{\bf w}\rVert}\right)\colon{\bf v}\in V,{\bf w}\in W,{\bf v},{\bf w}\neq 0\right\}.
Remark 3.8

By definition (V,W)[0,π/2].\angle(V,W)\in[0,\pi/2]. Thus cos(V,W),sin(V,W)[0,1].\cos\angle(V,W),\sin\angle(V,W)\in[0,1].

Remark 3.9

For any 𝐯V,𝐰W{\bf v}\in V,{\bf w}\in W, |cos(𝐯,𝐰)|cos(V,W).\lvert\cos\angle({\bf v},{\bf w})\rvert\leq\cos\angle(V,W).

We have the following lemma.

Lemma 3.10

Let (H,,)(H,\langle\cdot,\cdot\rangle) be an inner product space. Suppose 𝐯1,,𝐯rH{\bf v}_{1},\dots,{\bf v}_{r}\in H are linearly independent vectors and define the filtrations 𝒱:Vi:=span{𝐯i,,𝐯r},i=1,,r,\mathcal{V}\colon V_{i}:=\text{span}\{{\bf v}_{i},\dots,{\bf v}_{r}\},i=1,\dots,r, and 𝒲:Wi:=span{𝐯1,,𝐯i},i=1,,r.\mathcal{W}\colon W_{i}:=\text{span}\{{\bf v}_{1},\dots,{\bf v}_{i}\},i=1,\dots,r. So we have VrV1V_{r}\subsetneq\cdots\subsetneq V_{1} and W1Wr.W_{1}\subsetneq\cdots\subsetneq W_{r}. Further define the angles θi:=(Vi+1,Wi),i=1,,r1.\theta_{i}:=\angle(V_{i+1},W_{i}),i=1,\dots,r-1. Then

i=1r𝐯i12r1(i=1r1sinθi)(i=1r𝐯i).\displaystyle\left\|\sum_{i=1}^{r}{\bf v}_{i}\right\|\geq\frac{1}{2^{r-1}}\left(\prod_{i=1}^{r-1}\sin\theta_{i}\right)\left(\sum_{i=1}^{r}\lVert{\bf v}_{i}\rVert\right).

Proof of Lemma. For r=2r=2, we take two linearly independent vectors 𝐯1,𝐯2{\bf v}_{1},{\bf v}_{2} forming an angle θ.\theta. We denote the angle between the spaces V2V_{2} and W1W_{1} as defined above by θ1.\theta_{1}. By Remark 3.9 we have

𝐯1+𝐯22=𝐯12+𝐯22+2cosθ𝐯1𝐯2𝐯12+𝐯22|cosθ1|(𝐯12+𝐯22)=(1|cosθ1|)(𝐯12+𝐯22)11sin2θ12(𝐯1+𝐯2)2.\lVert{\bf v}_{1}+{\bf v}_{2}\rVert^{2}=\lVert{\bf v}_{1}\rVert^{2}+\lVert{\bf v}_{2}\rVert^{2}+2\cos\theta\lVert{\bf v}_{1}\rVert\lVert{\bf v}_{2}\rVert\geq\lVert{\bf v}_{1}\rVert^{2}+\lVert{\bf v}_{2}\rVert^{2}-\lvert\cos\theta_{1}\rvert(\lVert{\bf v}_{1}\rVert^{2}+\lVert{\bf v}_{2}\rVert^{2})\\ =(1-\lvert\cos\theta_{1}\rvert)(\lVert{\bf v}_{1}\rVert^{2}+\lVert{\bf v}_{2}\rVert^{2})\geq\frac{1-\sqrt{1-\sin^{2}\theta_{1}}}{2}(\lVert{\bf v}_{1}\rVert+\lVert{\bf v}_{2}\rVert)^{2}.

Now, for 0a10\leq a\leq 1 we have the identity 11a2a4.\frac{1-\sqrt{1-a}}{2}\geq\frac{a}{4}. In particular, 11sin2θ12sin2θ14.\frac{1-\sqrt{1-\sin^{2}\theta_{1}}}{2}\geq\frac{\sin^{2}\theta_{1}}{4}. And so, by Remark 3.8,

𝐯1+𝐯2\displaystyle\lVert{\bf v}_{1}+{\bf v}_{2}\rVert sinθ12(𝐯1+𝐯2).\displaystyle\geq\frac{\sin\theta_{1}}{2}(\lVert{\bf v}_{1}\rVert+\lVert{\bf v}_{2}\rVert). (3.13)

We proceed by induction. Suppose the result is true for any set of rr linearly independent vectors. Consider a linearly independent set {𝐯1,,𝐯r+1}\{{\bf v}_{1},\dots,{\bf v}_{r+1}\}. Noting that (Πi=1r1sinθi)/2r11,\left(\Pi_{i=1}^{r-1}\sin{\theta_{i}}\right)/2^{r-1}\leq 1, we use (3.13) and the induction hypothesis to get

i=1r+1𝐯isinθr2(i=1r𝐯i+𝐯r+1)sinθr2(12r1(i=1r1sinθi)(i=1r𝐯i)+𝐯r+1)sinθr212r1(i=1r1sinθi)((i=1r𝐯i)+𝐯r+1)=12r(i=1rsinθi)(i=1r+1𝐯i),\left\|\sum_{i=1}^{r+1}{\bf v}_{i}\right\|\geq\frac{\sin{\theta_{r}}}{2}\left(\left\|\sum_{i=1}^{r}{\bf v}_{i}\right\|+\lVert{\bf v}_{r+1}\rVert\right)\geq\frac{\sin{\theta_{r}}}{2}\left(\frac{1}{2^{r-1}}\left(\prod_{i=1}^{r-1}\sin\theta_{i}\right)\left(\sum_{i=1}^{r}\lVert{\bf v}_{i}\rVert\right)+\lVert{\bf v}_{r+1}\rVert\right)\\ \geq\frac{\sin{\theta_{r}}}{2}\frac{1}{2^{r-1}}\left(\prod_{i=1}^{r-1}\sin\theta_{i}\right)\left(\left(\sum_{i=1}^{r}\lVert{\bf v}_{i}\rVert\right)+\lVert{\bf v}_{r+1}\rVert\right)=\frac{1}{2^{r}}\left(\prod_{i=1}^{r}\sin\theta_{i}\right)\left(\sum_{i=1}^{r+1}\lVert{\bf v}_{i}\rVert\right),

as desired.  \blacksquare

We use Lemma 3.10 to bound i=1l(𝐱)𝐂t(i)\sum_{i=1}^{l({\bf x})}\lVert{\bf C}_{t}^{(i)}\rVert in terms of 𝐂t.\|{\bf C}_{t}\|. Let {𝐯1,,𝐯N}\{{\bf v}_{1},\ldots,{\bf v}_{N}\} be any basis for T𝐱X=i=1l(𝐱)Ei(𝐱)T_{\bf x}X=\bigoplus_{i=1}^{l({\bf x})}E_{i}({\bf x}), such that for each i=1,,l(𝐱)i=1,\dots,l({\bf x}), we may pick a basis for Ei(𝐱)E_{i}({\bf x}) from {𝐯1,,𝐯N}.\{{\bf v}_{1},\dots,{\bf v}_{N}\}. Thus this basis is normal with respect to the filtration 𝒱+(𝐱)\mathcal{V}^{+}({\bf x}) defined by (2.12). Let the filtrations 𝒱\mathcal{V} and 𝒲\mathcal{W} be defined as in Lemma 3.10, that is, Vi:=span{𝐯i,,𝐯N},V_{i}:=\text{span}\{{\bf v}_{i},\dots,{\bf v}_{N}\}, and Wi:=span{𝐯1,,𝐯i},W_{i}:=\text{span}\{{\bf v}_{1},\dots,{\bf v}_{i}\}, for i=1,,N.i=1,\dots,N. Furthermore, for each time step tt\in\mathbb{N} define 𝐯i(t):=DΦt(𝐱)𝐯i,{\bf v}_{i}(t):=D\Phi^{t}({\bf x}){\bf v}_{i}, for i=1,,N.i=1,\dots,N. Then we may pick a basis for each Ei(𝐱~t)E_{i}(\widetilde{\bf x}_{t}) from {𝐯1(t),,𝐯N(t)}.\{{\bf v}_{1}(t),\dots,{\bf v}_{N}(t)\}. Analogous to Lemma 3.10, we define the filtrations 𝒱(t):Vi(t):=span{𝐯i(t),,𝐯N(t)},i=1,,N\mathcal{V}(t)\colon V_{i}(t):=\text{span}\{{\bf v}_{i}(t),\dots,{\bf v}_{N}(t)\},i=1,\dots,N and 𝒲(t):Wi=span{𝐯1(t),,𝐯i(t)},i=1,,N,\mathcal{W}(t)\colon W_{i}=\text{span}\{{\bf v}_{1}(t),\dots,{\bf v}_{i}(t)\},i=1,\dots,N, with angles θi(t):=(Vi+1(t),Wi(t)),i=1,,N1.\theta_{i}(t):=\angle\left(V_{i+1}(t),W_{i}(t)\right),i=1,\dots,N-1.

Now, by Theorem 2.5, limt1tlogvol(DΦt(𝐱)𝐯1,,DΦt(𝐱)𝐯N)=i=1Nλ+(𝐱,𝐯i)\lim_{t\to\infty}\frac{1}{t}\log\operatorname{vol}(D\Phi^{t}({\bf x}){\bf v}_{1},\ldots,D\Phi^{t}({\bf x}){\bf v}_{N})=\sum_{i=1}^{N}\lambda^{+}({\bf x},{\bf v}_{i}). Let S:=i=1Nλ+(𝐱,𝐯i).S:=\sum_{i=1}^{N}\lambda^{+}({\bf x},{\bf v}_{i}). Then there exists a constant C=C(δ,𝐱)1C^{\prime}=C^{\prime}(\delta,{\bf x})\geq 1 such that

1Cet(Sδ)<vol(𝐯1(t),\displaystyle\frac{1}{C^{\prime}}e^{t(S-\delta)}<\operatorname{vol}({\bf v}_{1}(t), ,𝐯N(t))<Cet(S+δ)\displaystyle\dots,{\bf v}_{N}(t))<C^{\prime}e^{t(S+\delta)}

Similarly, by definition of the forward Lyapunov exponents, for each i=1,,Ni=1,\dots,N we have a constant Ci=Ci(δ/N,𝐱)1C_{i}=C_{i}(\delta/N,{\bf x})\geq 1 such that for all t,t\in{\mathbb{N}},

1Ciet(λ+(𝐱,𝐯i)δ/N)<𝐯i(t)<Ciet(λ+(𝐱,𝐯i)+δ/N).\frac{1}{C_{i}}e^{t(\lambda^{+}({\bf x},{\bf v}_{i})-\delta/N)}<\|{\bf v}_{i}(t)\|<C_{i}e^{t(\lambda^{+}({\bf x},{\bf v}_{i})+\delta/N)}.

Thus

(i=1N1sinθi(t))(i=1NCi)et(S+δ)\displaystyle\left(\prod_{i=1}^{N-1}\sin\theta_{i}(t)\right)\left(\prod_{i=1}^{N}C_{i}\right)e^{t(S+\delta)} >(i=1N1sinθi(t))(i=1N𝐯i(t))\displaystyle>\left(\prod_{i=1}^{N-1}\sin\theta_{i}(t)\right)\left(\prod_{i=1}^{N}\lVert{\bf v}_{i}(t)\rVert\right)
=vol(𝐯1(t),,𝐯N(t))>1Cet(Sδ),\displaystyle=\operatorname{vol}({\bf v}_{1}(t),\dots,{\bf v}_{N}(t))>\frac{1}{C^{\prime}}e^{t(S-\delta)},

and so

i=1N1sinθi(t)\displaystyle\prod_{i=1}^{N-1}\sin\theta_{i}(t) >1Ci=1NCie2tδ.\displaystyle>\frac{1}{C^{\prime}\prod_{i=1}^{N}C_{i}}e^{-2t\delta}.

Now, we may write 𝐂t=i=1Nαi(t)𝐯i(t){\bf C}_{t}=\sum_{i=1}^{N}\alpha_{i}(t){\bf v}_{i}(t) for some constants α1(t),,αN(t)\alpha_{1}(t),\dots,\alpha_{N}(t)\in\mathbb{R}. By construction, for each Ei(𝐱~t)E_{i}(\widetilde{\bf x}_{t}) we may pick a basis from 𝐯1(t),,𝐯N(t).{\bf v}_{1}(t),\dots,{\bf v}_{N}(t). Thus, writing i=1Nαi𝐯i(t)=i=1l(𝐱)𝐂t(i),\sum_{i=1}^{N}\alpha_{i}{\bf v}_{i}(t)=\sum_{i=1}^{l({\bf x})}{\bf C}_{t}^{(i)}, we can group the terms in this sum into their respective Oseledets subspaces. In particular, this implies i=1l(𝐱)𝐂t(i)i=1Nαi(t)𝐯i(t)\sum_{i=1}^{l(\bf x)}\left\|{\bf C}_{t}^{(i)}\right\|\leq\sum_{i=1}^{N}\|\alpha_{i}(t){\bf v}_{i}(t)\|. By Lemma 3.10, we have

𝐂t\displaystyle\left\|{\bf C}_{t}\right\| =i=1Nαi(t)𝐯i(t)12N1i=1N1sinθi(t)(i=1Nαi(t)𝐯i(t))\displaystyle=\left\|\sum_{i=1}^{N}\alpha_{i}(t){\bf v}_{i}(t)\right\|\geq\frac{1}{2^{N-1}}\prod_{i=1}^{N-1}\sin\theta_{i}(t)\left(\sum_{i=1}^{N}\lVert\alpha_{i}(t){\bf v}_{i}(t)\rVert\right)
>12N1Ci=1NCie2tδ(i=1l(𝐱)𝐂t(i)).\displaystyle>\frac{1}{2^{N-1}C^{\prime}\prod_{i=1}^{N}C_{i}}e^{-2t\delta}\left(\sum_{i=1}^{l({\bf x})}\lVert{\bf C}_{t}^{(i)}\rVert\right).

We obtain the desired bound by returning to (3.1).

𝐚tC(δ,𝐱)2et(λ1(𝐱)+δ)s=1tes(λ1(𝐱)δ)(i=1l(𝐱)𝐂s(i))2N1C(i=1NCi)C(δ,𝐱)2et(λ1(𝐱)+δ)s=1tes(λ1(𝐱)δ)e2sδ𝐂s2N1C(i=1NCi)C(δ,𝐱)2Ket(λ1(𝐱)+δ)s=1tes(λ1(𝐱)3δ).\|{\bf a}_{t}\|\leq C(\delta,{\bf x})^{2}e^{t(\lambda_{1}({\bf x})+\delta)}\sum_{s=1}^{t}e^{-s(\lambda_{1}({\bf x})-\delta)}\left(\sum_{i=1}^{l({\bf x})}\lVert{\bf C}_{s}^{(i)}\rVert\right)\\ \leq 2^{N-1}C^{\prime}\left(\prod_{i=1}^{N}C_{i}\right)C(\delta,{\bf x})^{2}e^{t(\lambda_{1}({\bf x})+\delta)}\sum_{s=1}^{t}e^{-s(\lambda_{1}({\bf x})-\delta)}e^{2s\delta}\lVert{\bf C}_{s}\rVert\\ \leq 2^{N-1}C^{\prime}\left(\prod_{i=1}^{N}C_{i}\right)C(\delta,{\bf x})^{2}Ke^{t(\lambda_{1}({\bf x})+\delta)}\sum_{s=1}^{t}e^{-s(\lambda_{1}({\bf x})-3\delta)}.

We let

R\displaystyle R^{\prime} =R(δ,𝐱):=2N1C(δ,𝐱)(i=1NCi(δ/N,𝐱))C(δ,𝐱)2,and\displaystyle=R^{\prime}(\delta,{\bf x}):=2^{N-1}C^{\prime}(\delta,{\bf x})\left(\prod_{i=1}^{N}C_{i}(\delta/N,{\bf x})\right)C(\delta,{\bf x})^{2},\quad\text{and}
R′′\displaystyle R^{\prime\prime} =R′′(δ,𝐱):=e3δλ1(𝐱)|1eλ1(𝐱)+3δ|.\displaystyle=R^{\prime\prime}(\delta,{\bf x}):=\frac{e^{3\delta-\lambda_{1}({\bf x})}}{|1-e^{-\lambda_{1}({\bf x})+3\delta}|}.

If λ1(𝐱)3δ=0\lambda_{1}({\bf x})-3\delta=0, then s=1tes(λ1(𝐱)3δ)=t,\sum_{s=1}^{t}e^{-s(\lambda_{1}({\bf x})-3\delta)}=t, and et(λ1(𝐱)+δ)=e4tδ,e^{t(\lambda_{1}({\bf x})+\delta)}=e^{4t\delta}, so 𝐚tRKte4tδ.\|{\bf a}_{t}\|\leq R^{\prime}Kte^{4t\delta}. If λ1(𝐱)3δ<0,\lambda_{1}({\bf x})-3\delta<0, then s=1tes(λ1(𝐱)3δ)e(t+1)(λ1(𝐱)3δ)/(eλ1(𝐱)+3δ1),\sum_{s=1}^{t}e^{-s(\lambda_{1}({\bf x})-3\delta)}\leq e^{-(t+1)(\lambda_{1}({\bf x})-3\delta)}/(e^{-\lambda_{1}({\bf x})+3\delta}-1), so 𝐚tRR′′Ke4tδ\|{\bf a}_{t}\|\leq R^{\prime}R^{\prime\prime}Ke^{4t\delta}. The case we are most concerned with is when λ1(𝐱)3δ>0.\lambda_{1}({\bf x})-3\delta>0. In this case s=1tes(λ1(𝐱)3δ)eλ1(𝐱)+3δ/(1eλ1(𝐱)+3δ),\sum_{s=1}^{t}e^{-s(\lambda_{1}({\bf x})-3\delta)}\leq e^{-\lambda_{1}({\bf x})+3\delta}/(1-e^{-\lambda_{1}({\bf x})+3\delta}), so 𝐚tRR′′Ket(λ1(𝐱)+δ).\|{\bf a}_{t}\|\leq R^{\prime}R^{\prime\prime}Ke^{t(\lambda_{1}({\bf x})+\delta)}. This proves the bounds given in (3.7). Remark 3.2 then completes the proof.  \blacksquare

The main result.

Theorem 3.1 deals only with a linearized version of the recursion (3.6). We now turn to bounding the actual state error Δ𝐱t\Delta{\bf x}_{t}. In the bounds above, we simplified the error accumulation by performing a Taylor approximation and ignoring the second-order error incurred. When we factor in the Taylor approximation errors, we can maintain the exponential bound, but only up to a certain finite time threshold. This is made explicit in the following theorem.

Theorem 3.11

Let H:(d)dH:({\mathbb{R}}^{d})^{\mathbb{Z}_{-}}\longrightarrow{\mathbb{R}}^{d} be a causal chain with infinite memory and let HhF:(d)dH^{F}_{h}:({\mathbb{R}}^{d})^{\mathbb{Z}_{-}}\longrightarrow{\mathbb{R}}^{d} be the functional associated to a state-space system (F,h)(F,h) that has the echo state property, where F:X×dXF\colon X\times\mathbb{R}^{d}\longrightarrow X and h:Xdh\colon X\longrightarrow\mathbb{R}^{d}, for a convex qq-dimensional C2C^{2} submanifold XX of L\mathbb{R}^{L}. Let Φ:XX\Phi\colon X\longrightarrow X be the associated dynamical system, that is, Φ(𝐱)=F(𝐱,h(𝐱))\Phi({\bf x})=F({\bf x},h({\bf x})) and suppose ΦDiff2(X)\Phi\in\operatorname{Diff}^{2}(X). Further, let XX^{\prime} be a compact Φ\Phi-invariant subset of XX and define X′′:={HF(𝐲0¯)𝐲0¯ is a realization of the CCIM X^{\prime\prime}:=\{H^{F}(\underline{{\bf y}_{0}})\mid\underline{{\bf y}_{0}}\text{ is a realization of the CCIM }H},\}, the set of reachable states of the CCIM. We assume that

(i)

there exists an open and convex subset UU of XX such that XX′′UX^{\prime}\cup X^{\prime\prime}\subset U and the second order derivatives of Φ\Phi are bounded on UU, and

(ii)

X′′Φ(X′′)X^{\prime\prime}\cup\Phi(X^{\prime\prime}) is contained in a convex subset of XX.

Suppose that FF is Lipschitz on its second entry with constant LzL_{z}, uniform with respect to the first entry, that is,

F(𝐱,𝐳1)F(𝐱,𝐳2)\displaystyle\left\|F(\mathbf{x},{\bf z}_{1})-F(\mathbf{x},{\bf z}_{2})\right\| Lz𝐳1𝐳2,for all 𝐱X and all 𝐳1,𝐳2d.\displaystyle\leq L_{z}\left\|\mathbf{z}_{1}-\mathbf{z}_{2}\right\|,\quad\mbox{for all $\mathbf{x}\in X$ and all ${\bf z}_{1},{\bf z}_{2}\in\mathbb{R}^{d}$.}

Assume also that the readout h:Ldh:\mathbb{R}^{L}\longrightarrow{\mathbb{R}}^{d} is Lipschitz with constant LhL_{h}. Suppose that HhFH^{F}_{h} is a uniform ε\varepsilon-approximant of HH and hence it satisfies (3.1). Let 𝐲=(𝐲t)(d){\bf y}=({\bf y}_{t})\in({\mathbb{R}}^{d})^{\mathbb{Z}} be a realization of the chain HH, that is, 𝐲t=H(𝐲t1¯){\bf y}_{t}=H\left(\underline{{\bf y}_{t-1}}\right) for all tt\in\mathbb{Z}, and let 𝐲~t=HhF(𝐲0¯𝐲~t1¯)\widetilde{{\bf y}}_{t}=H^{F}_{h}(\underline{{\bf y}_{0}}\,\underline{\widetilde{{\bf y}}_{t-1}}), tt\in\mathbb{N}, be the forecast of the positive terms of 𝐲{\bf y} using (F,h)(F,h) and the iterative scheme introduced in (3.2) and (3.3). Let 𝐱~t=HF(𝐲0¯𝐲~t1¯)\widetilde{\bf x}_{t}=H^{F}(\underline{{\bf y}_{0}}\,\underline{\widetilde{{\bf y}}_{t-1}}) be the corresponding sequence in the state space produced by these recursions and let 𝐱t=HF(𝐲t1¯){\bf x}_{t}=H^{F}(\underline{{\bf y}_{t-1}}), a sequence in the state space parallel to 𝐲{\bf y}. Then for any Φ\Phi-invariant Borel probability measure on XX^{\prime}, there is a set ΛX\Lambda\subset X^{\prime} of full measure such that, if 𝐱0Λ{\bf x}_{0}\in\Lambda, then for sufficiently small δ>0\delta>0, there exists a constant R=R(δ,𝐱0)R=R(\delta,{\bf x}_{0}) and a time horizon T=T(δ,𝐱0)T=T(\delta,{\bf x}_{0})\in\mathbb{N} such that

𝐱t𝐱~t\displaystyle\|{\bf x}_{t}-\widetilde{\bf x}_{t}\| 2εLzRet(λ1pos+δ)\displaystyle\leq 2\varepsilon L_{z}Re^{t(\lambda_{1}^{\text{pos}}+\delta)} (3.14)
𝐲t𝐲~t\displaystyle\|{\bf y}_{t}-\widetilde{\bf y}_{t}\| ε(1+2LhLzRet(λ1pos+δ)),\displaystyle\leq\varepsilon\left(1+2L_{h}L_{z}Re^{t(\lambda_{1}^{\text{pos}}+\delta)}\right), (3.15)

for t=1,,T,t=1,\dots,T, where λ1=λ1(𝐱0)\lambda_{1}=\lambda_{1}({\bf x}_{0}) is the largest Lyapunov exponent of 𝐱0{\bf x}_{0} under the dynamical system (X,Φ),(X,\Phi), and as before we take λ1pos=max{λ1,0}.\lambda_{1}^{\text{pos}}=\max\{\lambda_{1},0\}.

Remark 3.12

The assumptions (i) and (ii) above are omitted in [Berr 23]. However, they are necessary for two reasons: first of all, in the proof below, for a point 𝐱~tX\widetilde{\bf x}_{t}\in X^{\prime} and a left infinite sequence 𝐲t¯\underline{{\bf y}_{t}} from the CCIM, we approximate a difference Φ(HF(𝐲t¯))Φ(𝐱~t)\Phi(H^{F}(\underline{{\bf y}_{t}}))-\Phi(\widetilde{\bf x}_{t}) using Taylor’s theorem, which requires both endpoints to be in an open and convex subset of XX, whence the requirement that XX′′UXX^{\prime}\cup X^{\prime\prime}\subset U\subset X for a convex open set UU. Requiring that the second order derivatives of Φ\Phi are bounded on UU allows us to bound the Taylor approximation from above. Secondly, in the proof we encounter the terms 𝐂t=F(𝐱t1,𝐲t1)F(𝐱t1,h(𝐱t1))=HF(𝐲t¯)Φ(HF(𝐲t1¯)){\bf C}_{t}=F({\bf x}_{t-1},{\bf y}_{t-1})-F({\bf x}_{t-1},h({\bf x}_{t-1}))=H^{F}(\underline{{\bf y}_{t}})-\Phi(H^{F}(\underline{{\bf y}_{t-1}})) for a left infinite sequence 𝐲t¯\underline{{\bf y}_{t}} from the CCIM. We require that 𝐂t{\bf C}_{t} be in the tangent space of XX, which is ensured by the assumption that X′′Φ(X′′)X^{\prime\prime}\cup\Phi(X^{\prime\prime}) is contained in a convex subset of XX.

Remark 3.13

We comment on the embedding requirement in the case when the causal chain consists of observations from a dynamical system. In the paper [Berr 23], the map FF is assumed to embed the underlying dynamical system in the state-space. This is necessary in order to guarantee the existence of a readout as in Proposition 2.1 that forms a conjugate dynamical system in the state-space. The readout is ultimately the map that is learnt and they work with the corresponding learnt dynamical system as an approximation of the dynamical system conjugate to the original. We circumvent the embedding condition by replacing it with the weaker requirement that the functional HhFH^{F}_{h} associated to the state-space system (F,h)(F,h) uniformly ε\varepsilon-approximates the functional of the causal chain. This is an important improvement as the question of determining when a given state-space system will determine a generalised synchronisation that is an embedding when driven by observations from a dynamical system remains open except for a certain class of linear state-space systems [Grig 23]. Furthermore, it enables us to extend the results to a much larger class of time series, namely that of CCIMs. CCIMs appear as the deterministic parts of many stochastic processes after using the Wold or Doob decompositions and also as the solutions of infinite dimensional or functional differential equations. While our bounds show an error scaling controlled by the top Lyapunov exponent of the learnt dynamical system, it is interesting to note that many of these applications may not possess an underlying system with Lyapunov exponents at all.

Remark 3.14

One of the contributions of this paper is that we deal explicitly and rigorously with the error from the Taylor approximation, in contrast to former efforts, where this error was absorbed by giving the error bounds in terms of a factor O(en(λ1pos+δ)).O(e^{n(\lambda_{1}^{\text{pos}}+\delta)}). We do this by controlling the error growth within a factor of the exponential term up to a finite time horizon. The expression we find for the forecasting horizon within which we can control the error is

T=1λ1pos+5δlog(eλ1pos+5δ14εLzLΦKRe8δ)+1,T=\left\lfloor\frac{1}{\lambda_{1}^{\text{pos}}+5\delta}\log\left(\frac{e^{\lambda_{1}^{\text{pos}}+5\delta}-1}{4\varepsilon L_{z}L_{\Phi}KRe^{8\delta}}\right)\right\rfloor+1,

where LΦL_{\Phi} is a constant depending on the second order tangent map of Φ\Phi and K=K(δ,𝐱0)K=K(\delta,{\bf x}_{0}) is a constant. When we expand K(δ,𝐱0)K(\delta,{\bf x}_{0}) and R(δ,𝐱0)R(\delta,{\bf x}_{0}) in terms of the other constants introduced in the proofs we obtain

T=1λ1pos+5δ(log[(eλ1pos+5δ1)(eλ1pos3δ1)e8δ]log[4εLzLΦ(2N1C(i=1NCi)C2)2])+1.T=\left\lfloor\frac{1}{\lambda_{1}^{\text{pos}}+5\delta}\left(\log\left[\frac{(e^{\lambda_{1}^{\text{pos}}+5\delta}-1)(e^{\lambda_{1}^{\text{pos}}-3\delta}-1)}{e^{8\delta}}\right]-\log\left[{4\varepsilon L_{z}L_{\Phi}\left(2^{N-1}C^{\prime}\left(\prod_{i=1}^{N}C_{i}\right)C^{2}\right)^{2}}\right]\right)\right\rfloor+1.

It is difficult to make qualitative deductions on how the different variables influence the value of TT. To begin with, the constants C,CC,C^{\prime} and Ci,i=1,,NC_{i},i=1,\dots,N all depend on δ\delta, so while decreasing δ\delta may serve to increase the first logarithmic term, we don’t know how it will influence the second term. Indeed, we expect these constants to increase for a smaller leakage constant, so the influence on TT is unknown. A second observation is that all the constants have an implicit dependence on ε\varepsilon, since taking a smaller ε\varepsilon restricts the set of state-space systems from which we can select (F,h)(F,h). Therefore, although at first glance decreasing ε\varepsilon may seem to lead to an increase in the predictable time horizon, this may not necessarily be true in practice.

Remark 3.15

Notwithstanding what has been said above, in the error bounds (3.14), (3.15), we do see an interplay between overfitting and approximation error. The top Lyapunov exponent is related to the regularity of the dynamical system generated by the state-space system, while the approximation error ε\varepsilon determines the set of state-space systems we can choose from. Thus decreasing the approximation error may lead to a highly irregular dynamical system and hence a large leading Lyapunov exponent. See [Berr 23, Remark 12] for further discussion on this.

Remark 3.16

In this setup, it is visible how there are three contributions to the error bound in (3.15), namely the approximation error ε\varepsilon of HH by HhFH^{F}_{h}, the Lipschitz constants LhL_{h} and LzL_{z}, which are related to the structure of the RNN, and the exponential factor et(λ1pos+δ)e^{t(\lambda_{1}^{\text{pos}}+\delta)} which is exclusively related to the dynamical features of the map ΦDiff1(X)\Phi\in\operatorname{Diff}^{1}(X) that (F,h)(F,h) determines. Much work has been done on the ability of state-space systems to reproduce the Lyapunov exponents of the dynamical system being learnt. Two factors come into play here: first, while it is known that under suitable conditions on the GS of a state-space system that embeds the dynamical system in a higher dimensional space, the Lyapunov spectrum of the original system is contained in that of the reconstructed system, a number of additional Lyapunov exponents arise, commonly referred to as spurious Lyapunov exponents, some of which may exceed the top Lyapunov exponent of the original dynamical system [Dech 96, Dech 00]; second is the question of continuity of Lyapunov exponents under perturbations of the generating dynamical system, which is discussed in detail in the survey [Vian 20]. Of particular interest, then, is the stability gap, namely the difference between the top Lyapunov exponent of the reconstructed system and that of the original system. An upper bound for the stability gap is given in [Berr 23, Theorem 3 (iii)]. In our case, however, we are concerned only with the Lyapunov exponents of the dynamical system associated to (F,h)(F,h). Indeed, it is more useful for us to have bounds for the error in terms of the Lyapunov exponents of the learnt dynamical system, which can be calculated, rather than that of the dynamical system we are learning, which are unknown to us. Nevertheless, work on the stability gap is relevant: as discussed in [Berr 23, Theorem 3], the stability gap is always nonnegative, so when learning the observations from a dynamical system, we cannot hope to get a reconstructed dynamical system with lower Lyapunov exponent than that of the original. We note that, in general, CCIMs need not have Lyapunov exponents.

Remark 3.17

Let LxL_{x} be a Lipschitz constant for the first entry of FF. It is easy to show that if FF and hh have bounded differentials and λ1\lambda_{1} is the top Lyapunov exponent of the dynamical system ΦDiff1(X)\Phi\in\operatorname{Diff}^{1}(X), then the Lipschitz constants Lx,LhL_{x},L_{h}, and LzL_{z} will satisfy

eλ1Lx+LhLz,e^{\lambda_{1}}\leq L_{x}+L_{h}L_{z}, (3.16)

and we can bound the prediction error as

Δ𝐲tε(1+LhLzj=0t2(Lx+LhLz)j),for all t.\left\|\Delta{{\bf y}}_{t}\right\|\leq\varepsilon\left(1+L_{h}L_{z}\sum_{j=0}^{t-2}\left(L_{x}+L_{h}L_{z}\right)^{j}\right),\quad\mbox{for all $t\in\mathbb{N}$.} (3.17)

Now, if the Lipschitz constants can be chosen so that Lx+LhLz<1L_{x}+L_{h}L_{z}<1, then the forecasting error admits a horizon-uniform bound given by the inequality:

Δ𝐲tε(1Lx1(Lx+LhLz)),for all t.\left\|\Delta{{\bf y}}_{t}\right\|\leq\varepsilon\left(\frac{1-L_{x}}{1-\left(L_{x}+L_{h}L_{z}\right)}\right),\quad\mbox{for all $t\in\mathbb{N}$.} (3.18)

Thus, in the case when λ1<0\lambda_{1}<0, using Lipschitz constants may give a better bound than that established using the theory of Lyapunov exponents in (3.14) and (3.15), which grows exponentially. In the case of a positive maximum Lyapunov exponent, the bound (3.17) grows exponentially with factor Lx+LhLz1L_{x}+L_{h}L_{z}\geq 1. Calculations show that in practice this factor is typically much larger than the factor eλ1+δe^{\lambda_{1}+\delta} using the Lyapunov exponent. Thus ergodic theory provides us with a bound which far outperforms an approach using only tools from analysis.

Remark 3.18

The Lipschitz constants that appear in the error bounds (3.14) and (3.15) are readily computable for most commonly used families of recurrent neural networks. Consider the echo state network (ESN) [Jaeg 04] (F,h)(F,h) which is defined by using a linear readout map h(𝐱)=W𝐱h(\mathbf{x})=W\mathbf{x}, W𝕄d,LW\in\mathbb{M}_{d,L} and a state map given by

F(𝐱,𝐳)=𝝈(A𝐱+C𝐳+𝜻),withA𝕄L,L,C𝕄L,d,𝜻L,F(\mathbf{x},{\bf z})=\boldsymbol{\sigma}(A\mathbf{x}+C{\bf z}+\boldsymbol{\zeta}),\quad\mbox{with}\quad A\in\mathbb{M}_{L,L},C\in\mathbb{M}_{L,d},\boldsymbol{\zeta}\in\mathbb{R}^{L}, (3.19)

and where 𝝈:L[1,1]L\boldsymbol{\sigma}:\mathbb{R}^{L}\longrightarrow[-1,1]^{L} is the map obtained by componentwise application of a squashing function σ:[1,1]\sigma:\mathbb{R}\longrightarrow[-1,1] that is is non-decreasing and satisfies limxσ(x)=1\lim_{x\rightarrow-\infty}\sigma(x)=-1 and limxσ(x)=1\lim_{x\rightarrow\infty}\sigma(x)=1. Moreover, we assume that Lσ:=supx{|σ(x)|}<+L_{\sigma}:=\sup_{x\in\mathbb{R}}\{|\sigma^{\prime}(x)|\}<+\infty. With this hypothesis, we note that |DxF(𝐱,𝐳)|Lσ|A|<+{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|D_{x}F(\mathbf{x},{\bf z})\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\leq L_{\sigma}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<+\infty and that |DzF(𝐱,𝐳)|Lσ|C|<+{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|D_{z}F(\mathbf{x},{\bf z})\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\leq L_{\sigma}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<+\infty. Consequently, by the mean value theorem we can choose in this case

Lx=Lσ|A|,Lz=Lσ|C|,andLh=|W|.L_{x}=L_{\sigma}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|},\ L_{z}=L_{\sigma}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|C\right|\kern-1.07639pt\right|\kern-1.07639pt\right|},\ \mbox{and}\ L_{h}={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|W\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}.

We recall that for the ESN family, Lσ|A|<1L_{\sigma}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}<1 is a sufficient condition for the echo state and the fading memory properties to hold when the inputs are in (d)\ell^{\infty}(\mathbb{R}^{d}) (see [Grig 19]). With regard to the Lyapunov exponents of the dynamical system associated to the ESN, algorithms exist for computing these.

Remark 3.19

The approximation bound ε>0\varepsilon>0 in (3.14) and (3.15) which has been chosen so that HHhF<ε\left\|H-H^{F}_{h}\right\|_{\infty}<\varepsilon can in practice be put in relation, using Barron-type bounds, with architecture parameters of the RNN families that are being used in the forecasting exercise. Relations of this type can be found in [Gono 23a, Gono 23b].

Proof of the theorem. As in the proof of Theorem 3.1, we take Λ\Lambda to be the set of (Lyapunov-Perron) regular points in XX^{\prime} of the dynamical system Φ\Phi. Let 𝐱0Λ.{\bf x}_{0}\in\Lambda. As noted before, according to the iterations (3.2) and (3.3), the sequence (𝐱~t)(\widetilde{\bf x}_{t}) is a trajectory of the dynamical system Φ\Phi, that is, 𝐱~t=Φ(𝐱~t1)\widetilde{\bf x}_{t}=\Phi(\widetilde{\bf x}_{t-1}) for tt\in\mathbb{N} with initial condition 𝐱~0=𝐱0.\widetilde{\bf x}_{0}={\bf x}_{0}. Defining 𝐂t:=F(𝐱t1,𝐲t1)F(𝐱t1,h(𝐱t1)){\bf C}_{t}:=F({\bf x}_{t-1},{\bf y}_{t-1})-F({\bf x}_{t-1},h({\bf x}_{t-1})) for t,t\in\mathbb{N}, we have

Δ𝐱t\displaystyle\Delta{\bf x}_{t} =F(𝐱t1,𝐲t1)F(𝐱t1,h(𝐱t1))+F(𝐱t1,h(𝐱t1))F(𝐱~t1,h(𝐱~t1))\displaystyle=F({\bf x}_{t-1},{\bf y}_{t-1})-F({\bf x}_{t-1},h({\bf x}_{t-1}))+F({\bf x}_{t-1},h({\bf x}_{t-1}))-F(\widetilde{\bf x}_{t-1},h(\widetilde{\bf x}_{t-1}))
=𝐂t+Φ(𝐱t1)Φ(𝐱~t1)=𝐂t+DΦ(𝐱~t1)Δ𝐱t1+𝐎t,\displaystyle={\bf C}_{t}+\Phi({\bf x}_{t-1})-\Phi(\widetilde{\bf x}_{t-1})={\bf C}_{t}+D\Phi(\widetilde{\bf x}_{t-1})\Delta{\bf x}_{t-1}+{\bf O}_{t}, (3.20)

where 𝐎t{\bf O}_{t} is the error resulting from the Taylor approximation of Φ(𝐱t1)Φ(𝐱~t1)\Phi({\bf x}_{t-1})-\Phi(\widetilde{\bf x}_{t-1}). Using the fact that Δ𝐱0=0,{\Delta{\bf x}_{0}}=0, we can write

Δ𝐱t=𝐂t+s=1t1DΦ(𝐱~t1)DΦ(𝐱~s)𝐂s+(𝐎t+s=2t1DΦ(𝐱~t1)DΦ(𝐱~s)𝐎s)=𝐂t+s=1t1DΦts(𝐱~s)𝐂s+(𝐎t+s=2t1DΦts(𝐱~s)𝐎s)=𝐚t+𝐎t+s=2t1DΦts(𝐱~s)𝐎s,\Delta{\bf x}_{t}={\bf C}_{t}+\sum_{s=1}^{t-1}D\Phi(\widetilde{\bf x}_{t-1})\cdots D\Phi(\widetilde{\bf x}_{s}){\bf C}_{s}+\left({\bf O}_{t}+\sum_{s=2}^{t-1}D\Phi(\widetilde{\bf x}_{t-1})\dots D\Phi(\widetilde{\bf x}_{s}){\bf O}_{s}\right)\\ ={\bf C}_{t}+\sum_{s=1}^{t-1}D\Phi^{t-s}(\widetilde{\bf x}_{s}){\bf C}_{s}+\left({\bf O}_{t}+\sum_{s=2}^{t-1}D\Phi^{t-s}(\widetilde{\bf x}_{s}){\bf O}_{s}\right)={\bf a}_{t}+{\bf O}_{t}+\sum_{s=2}^{t-1}D\Phi^{t-s}(\widetilde{\bf x}_{s}){\bf O}_{s}, (3.21)

where 𝐚{\bf a} is the sequence defined by the recursion 𝐚t=𝐂t+DΦ(𝐱~t1)𝐚t1,{\bf a}_{t}={\bf C}_{t}+D\Phi(\widetilde{\bf x}_{t-1}){\bf a}_{t-1}, for t,t\in{\mathbb{N}}, with 𝐚0=0,{\bf a}_{0}=0, as in Theorem 3.1. Since 𝐲t1h(𝐱t1)=H(𝐲t2¯)HhF(𝐲t2¯)ε\lVert{\bf y}_{t-1}-h({\bf x}_{t-1})\rVert=\lVert H(\underline{{\bf y}_{t-2}})-H^{F}_{h}(\underline{{\bf y}_{t-2}})\rVert\leq\varepsilon, using the Lipschitz condition on the second coordinate of FF, we can bound 𝐂tLzε\lVert{\bf C}_{t}\rVert\leq L_{z}\varepsilon. Thus there exists a constant R=R(δ,𝐱0)R=R(\delta,{\bf x}_{0}) such that 𝐚tεLzRet(λ1pos+δ).\|{\bf a}_{t}\|\leq\varepsilon L_{z}Re^{t(\lambda_{1}^{\text{pos}}+\delta)}.

By Taylor’s Theorem we also have 𝐎tLΦΔ𝐱t12\lVert{\bf O}_{t}\rVert\leq L_{\Phi}\lVert\Delta{\bf x}_{t-1}\rVert^{2} where LΦL_{\Phi} is a constant depending on the second order derivatives of Φ\Phi. Using the same strategies as in the proof of Theorem 3.1, we can find a constant K=K(δ,𝐱)1K=K(\delta,{\bf x})\geq 1 such that DΦts(𝐱~s)Ket(λ1pos+δ)es(λ1pos3δ).\lVert D\Phi^{t-s}(\widetilde{\bf x}_{s})\rVert\leq Ke^{t(\lambda_{1}^{\text{pos}}+\delta)}e^{-s(\lambda_{1}^{\text{pos}}-3\delta)}. Thus,

Δ𝐱t\displaystyle\lVert\Delta{\bf x}_{t}\rVert εLzRet(λ1pos+δ)+s=2t1Ket(λ1pos+δ)es(λ1pos3δ)LΦΔ𝐱s12+LΦΔ𝐱t12\displaystyle\leq\varepsilon L_{z}Re^{t(\lambda_{1}^{\text{pos}}+\delta)}+\sum_{s=2}^{t-1}Ke^{t(\lambda_{1}^{\text{pos}}+\delta)}e^{-s(\lambda_{1}^{\text{pos}}-3\delta)}L_{\Phi}\lVert\Delta{\bf x}_{s-1}\rVert^{2}+L_{\Phi}\lVert\Delta{\bf x}_{t-1}\rVert^{2}
et(λ1pos+δ)(εLzR+LΦKs=2tes(λ1pos3δ)Δ𝐱s12).\displaystyle\leq e^{t(\lambda_{1}^{\text{pos}}+\delta)}\left(\varepsilon L_{z}R+L_{\Phi}K\sum_{s=2}^{t}e^{-s(\lambda_{1}^{\text{pos}}-3\delta)}\lVert\Delta{\bf x}_{s-1}\rVert^{2}\right).

We can thus construct a sequence 𝐛=(bt)t0{\bf b}=(b_{t})_{t\geq 0} of upper bounds for (Δ𝐱t)t0(\lVert\Delta{\bf x}_{t}\rVert)_{t\geq 0} by the recursion

bt=et(λ1pos+δ)(εLzR+LΦKs=2tes(λ1pos3δ)bs12),t;b0=0.\displaystyle b_{t}=e^{t(\lambda_{1}^{\text{pos}}+\delta)}\left(\varepsilon L_{z}R+L_{\Phi}K\sum_{s=2}^{t}e^{-s(\lambda_{1}^{\text{pos}}-3\delta)}b_{s-1}^{2}\right),\quad t\in\mathbb{N};\quad b_{0}=0.

We transform this sequence by dividing out the factor εLzRexpt(λ1pos+δ),\varepsilon L_{z}R\exp{t(\lambda_{1}^{\text{pos}}+\delta)}, that is, we form the sequence 𝐜=(ct){\bf c}=(c_{t}) by ct:=bt/(εLzRexpt(λ1pos+δ)).c_{t}:=b_{t}/(\varepsilon L_{z}R\exp{t(\lambda_{1}^{\text{pos}}+\delta)}). We have

ct=1+εLzLΦKRe2(λ1pos+δ)s=2tes(λ1pos+5δ)cs12for t.\displaystyle c_{t}=1+\varepsilon L_{z}L_{\Phi}KRe^{-2(\lambda_{1}^{\text{pos}}+\delta)}\sum_{s=2}^{t}e^{s(\lambda_{1}^{\text{pos}}+5\delta)}c_{s-1}^{2}\quad\mbox{for }t\in\mathbb{N}. (3.22)

We can bound the sequence 𝐜{\bf c} by an exponential growth factor for a finite time horizon. Consider equation (3.22). Suppose that for s=1,,t1s=1,\dots,t-1 we have cs1+η,c_{s}\leq 1+\eta, for some η>0.\eta>0. Then

ct\displaystyle c_{t} 1+εLzLΦKRe2(λ1pos+δ)s=2tes(λ1pos+5δ)(η+1)21+(η+1)2εLzLΦKR(e8δeλ1pos+5δ1)e(t1)(λ1pos+5δ).\displaystyle\leq 1+\varepsilon L_{z}L_{\Phi}KRe^{-2(\lambda_{1}^{\text{pos}}+\delta)}\sum_{s=2}^{t}e^{s(\lambda_{1}^{\text{pos}}+5\delta)}(\eta+1)^{2}\leq 1+(\eta+1)^{2}\varepsilon L_{z}L_{\Phi}KR\left(\frac{e^{8\delta}}{e^{\lambda_{1}^{\text{pos}}+5\delta}-1}\right)e^{(t-1)(\lambda_{1}^{\text{pos}}+5\delta)}.

Thus we can maintain the bound ct1+ηc_{t}\leq 1+\eta as long as

t1λ1pos+5δlog(η(η+1)2eλ1pos+5δ1εLzLΦKRe8δ)+1.\displaystyle t\leq\left\lfloor\frac{1}{\lambda_{1}^{\text{pos}}+5\delta}\log\left(\frac{\eta}{(\eta+1)^{2}}\frac{e^{\lambda_{1}^{\text{pos}}+5\delta}-1}{\varepsilon L_{z}L_{\Phi}KRe^{8\delta}}\right)\right\rfloor+1.

To maximise the time steps for which this bound holds we optimise η/(η+1)2.\eta/(\eta+1)^{2}. We find

ddη(η(η+1)2)=0\frac{d}{d\eta}\left(\frac{\eta}{(\eta+1)^{2}}\right)=0

for η=1\eta=1, so that the function has a maximum of 14.\frac{1}{4}. Thus, if cs2c_{s}\leq 2 for s=1,t1s=1,\dots t-1, then ct2c_{t}\leq 2, as long as

t1λ1pos+5δlog(eλ1pos+5δ14εLzLΦKRe8δ)+1.\displaystyle t\leq\left\lfloor\frac{1}{\lambda_{1}^{\text{pos}}+5\delta}\log\left(\frac{e^{\lambda_{1}^{\text{pos}}+5\delta}-1}{4\varepsilon L_{z}L_{\Phi}KRe^{8\delta}}\right)\right\rfloor+1.

that is, we can control the error bound as Δ𝐱t2εLzRet(λ1pos+δ)\lVert\Delta{\bf x}_{t}\rVert\leq 2\varepsilon L_{z}Re^{t(\lambda_{1}^{\text{pos}}+\delta)} within the aforementioned number of time steps. Finally, by (3.5), Δ𝐲tε+LhΔ𝐱tε(1+2LhLzRet(λ1pos+δ)).\left\|\Delta{\bf y}_{t}\right\|\leq\varepsilon+L_{h}\|\Delta{\bf x}_{t}\|\leq\varepsilon\left(1+2L_{h}L_{z}Re^{t(\lambda_{1}^{\text{pos}}+\delta)}\right).\blacksquare

3.2 The causal embedding forecasting scheme

As an application of the previous results, we compare the causal embedding forecasting scheme proposed by Manjunath and de Clerq in [Manj 21] with the traditional approach used in reservoir computing and RNNs that we analyzed in the previous section. We start by deriving an upper bound for the prediction error corresponding to Theorem 3.11 for his scheme.

We start with a brief overview of the causal embedding scheme adapted to our setup. The causal embedding scheme consists of a state-space system that takes the states of a dynamical system as inputs and creates an embedding of the dynamical system in a higher dimensional space. In our setup we only have access to observations from a dynamical system, so we begin by embedding the original dynamical system in a proxy space. For this, we may use a state-space system known to provide embedding, such as Takens’ delay embedding scheme. We then apply the causal embedding scheme to the proxy space. Another simplification in our setup is that we assume an invertible dynamical system from the start, a property that is treated only as a special case in [Manj 21].

Let ϕ:MM\phi\colon M\longrightarrow M be a dynamical system, with MM a qq-dimensional compact manifold and ϕDiff2(M).\phi\in\operatorname{Diff}^{2}(M). Let ω:Md\omega\colon M\longrightarrow\mathbb{R}^{d} be an observation map and suppose F:L×dLF\colon\mathbb{R}^{L}\times\mathbb{R}^{d}\longrightarrow\mathbb{R}^{L} is a state-space map whose GS f:MLf\colon M\longrightarrow\mathbb{R}^{L} is an embedding, so that the proxy space M=f(M)M^{\prime}=f(M) is a compact submanifold of L.\mathbb{R}^{L}. By Proposition 2.1, there exists a readout h:Mdh\colon M^{\prime}\longrightarrow\mathbb{R}^{d} satisfying (2.7) and we can form the conjugate dynamical system ϕ:MM\phi^{\prime}\colon M^{\prime}\longrightarrow M^{\prime} given by ϕ(m)=F(m,h(m))\phi^{\prime}(m^{\prime})=F(m^{\prime},h(m^{\prime})) for mM.m^{\prime}\in M^{\prime}. We assume that ϕDiff2(M)\phi^{\prime}\in\operatorname{Diff}^{2}(M^{\prime}).

Whereas in the typical RC setup we would learn the readout hh, in the causal embedding scheme, we concatenate a second state-space system onto the first. Let F:L×MLF^{\prime}\colon\mathbb{R}^{L^{\prime}}\times M^{\prime}\longrightarrow\mathbb{R}^{L^{\prime}} be a state-space map with GS f:ML.f^{\prime}\colon M^{\prime}\longrightarrow\mathbb{R}^{L^{\prime}}. We define the map f2:ML×Lf_{2}^{\prime}\colon M^{\prime}\longrightarrow\mathbb{R}^{L^{\prime}}\times\mathbb{R}^{L^{\prime}} by

f2(m)=(f(m),f(ϕ(m))).f_{2}^{\prime}(m^{\prime})=(f^{\prime}(m^{\prime}),f^{\prime}(\phi^{\prime}(m^{\prime}))).

Let the image of MM^{\prime} under f2f_{2}^{\prime} be XX. This is the space of single-delay lags in the state variable of FF^{\prime}. We assume an additional property for the state-space map FF^{\prime}, namely that of state-input invertibility (SI-invertibility): FF^{\prime} is SI-invertible when the map F𝐱():=F(𝐱,):MLF^{\prime}_{\bf x}(\cdot):=F^{\prime}({\bf x},\cdot)\colon M^{\prime}\longrightarrow\mathbb{R}^{L^{\prime}} is invertible for all 𝐱L.{\bf x}\in\mathbb{R}^{L^{\prime}}. SI-invertibility is not a very strict condition to impose, and can in practice easily be guaranteed. When the map FF^{\prime} is SI-invertible, from a single delay in the state space we can compute the corresponding input from the dynamical system MM^{\prime}. This allows us to construct a readout h:XMh^{\prime}\colon X\longrightarrow M^{\prime} as

h(𝐱1,𝐱0):=ϕ(F𝐱1)1(𝐱0),h^{\prime}({\bf x}_{-1},{\bf x}_{0}):=\phi^{\prime}\circ(F^{\prime}_{{\bf x}_{-1}})^{-1}({\bf x}_{0}),

and we can form a dynamical system Φ:XX\Phi\colon X\longrightarrow X given by Φ:=f2h.\Phi:=f_{2}^{\prime}\circ h^{\prime}. The map f2f_{2}^{\prime} is an embedding and the dynamical system (X,Φ)(X,\Phi) is conjugate to (M,ϕ)(M^{\prime},\phi^{\prime}) and hence to (M,ϕ)(M,\phi). The map Φ\Phi maps a single-delay coordinate one step forward under input from the dynamical system. To see this more clearly, let UF:(M)(L)U^{F^{\prime}}\colon(M^{\prime})^{\mathbb{Z}}\longrightarrow(\mathbb{R}^{L^{\prime}})^{\mathbb{Z}} be the filter corresponding to the state-space map FF^{\prime}. Then for a point mMm^{\prime}\in M^{\prime}, let 𝐦=(ϕt(m))t{\bf m}^{\prime}=(\phi^{\prime t}(m^{\prime}))_{t\in\mathbb{Z}} be its trajectory under ϕ\phi^{\prime}. For any tt\in\mathbb{Z} we have

Φ(UF(𝐦)t1,UF(𝐦)t)\displaystyle\Phi(U^{F^{\prime}}({\bf m}^{\prime})_{t-1},U^{F^{\prime}}({\bf m}^{\prime})_{t}) =f2ϕ(FUF(𝐦)t1)1(UF(𝐦)t)\displaystyle=f_{2}^{\prime}\circ\phi^{\prime}\circ\left(F^{\prime}_{U^{F^{\prime}}({\bf m}^{\prime})_{t-1}}\right)^{-1}(U^{F^{\prime}}({\bf m}^{\prime})_{t})
=f2ϕϕt1(m)=(f(ϕt(m)),f(ϕt+1(m)))=(UF(𝐦)t,UF(𝐦)t+1).\displaystyle=f_{2}^{\prime}\circ\phi^{\prime}\circ\phi^{\prime t-1}(m^{\prime})=\left(f^{\prime}(\phi^{\prime t}(m^{\prime})),f^{\prime}(\phi^{\prime t+1}(m^{\prime}))\right)=\left(U^{F^{\prime}}({\bf m}^{\prime})_{t},U^{F^{\prime}}({\bf m}^{\prime})_{t+1}\right).

Essentially, what we have done is to enlarge the state-space map F:L×MLF^{\prime}\colon\mathbb{R}^{L^{\prime}}\times M^{\prime}\longrightarrow\mathbb{R}^{L^{\prime}} to a map F′′:X×MXF^{\prime\prime}\colon X\times M^{\prime}\longrightarrow X given by F′′(𝐱1,𝐱0,m)=(𝐱0,F(𝐱0,m))F^{\prime\prime}({\bf x}_{-1},{\bf x}_{0},m^{\prime})=({\bf x}_{0},F^{\prime}({\bf x}_{0},m^{\prime})) for (𝐱1,𝐱0)X.({\bf x}_{-1},{\bf x}_{0})\in X. Thus Φ(𝐱1,𝐱0)=F′′(𝐱1,𝐱0,h(𝐱1,𝐱0))\Phi({\bf x}_{-1},{\bf x}_{0})=F^{\prime\prime}({\bf x}_{-1},{\bf x}_{0},h^{\prime}({\bf x}_{-1},{\bf x}_{0})). In the causal embedding scheme, we learn the readout hh^{\prime}. Note that we do not require the GS ff^{\prime} to be an embedding but only that the map FF^{\prime} is SI-invertible and has the ESP. The reasoning behind adding a second state-space system onto the first is that the first might be highly irregular, making the readout hh hard to learn. By contrast, the state-space map FF^{\prime} may be chosen without requiring its GS to be an embedding, and hence the map Φ,\Phi, and in particular the readout hh^{\prime}, is hoped to be more regular and thus easier to learn.

For a point mMm\in M, let 𝐲=(𝐲t)t{\bf y}=({\bf y}_{t})_{t\in\mathbb{Z}}, where 𝐲t=ω(ϕt(m)),{\bf y}_{t}=\omega(\phi^{t}(m)), for all tt\in\mathbb{Z}, be the corresponding sequence of observations. The state space maps FF and FF^{\prime} generate sequences 𝐲(L){\bf y}^{\prime}\in(\mathbb{R}^{L})^{\mathbb{Z}} and 𝐱(L){\bf x}\in(\mathbb{R}^{L^{\prime}})^{\mathbb{Z}} according to the iterations

{𝐲t=F(𝐲t1,𝐲t1),𝐱t=F(𝐱t1,𝐲t1).\displaystyle\left\{\begin{array}[]{ll}{\bf y}_{t}^{\prime}&=F({\bf y}_{t-1}^{\prime},{\bf y}_{t-1}),\\ {\bf x}_{t}&=F^{\prime}({\bf x}_{t-1},{\bf y}_{t-1}^{\prime}).\end{array}\right. (3.25)

We approximate h:XMh^{\prime}\colon X\longrightarrow M^{\prime} with the learnt map h~\widetilde{h}. In similar manner to the iterated multistep forecasting scheme discussed previously, we take as input the sequence of past observations 𝐲0¯(d)\underline{{\bf y}_{0}}\in(\mathbb{R}^{d})^{\mathbb{Z}_{-}} so that we only have access to the left infinite sequences 𝐲0¯\underline{{\bf y}_{0}^{\prime}} and 𝐱0¯.\underline{{\bf x}_{0}}. Initializing 𝐱~0=𝐱0\widetilde{\bf x}_{0}={\bf x}_{0} and 𝐲~0=𝐲0\widetilde{\bf y}_{0}^{\prime}={\bf y}_{0}^{\prime}, for tt\in\mathbb{N} we produce predictions 𝐱~1,𝐱~2,,𝐱~t\widetilde{\bf x}_{1},\widetilde{\bf x}_{2},\dots,\widetilde{\bf x}_{t}, 𝐲~1,𝐲~2,,𝐲~t\widetilde{\bf y}_{1}^{\prime},\widetilde{\bf y}_{2}^{\prime},\dots,\widetilde{\bf y}_{t}^{\prime}, and 𝐲~1,𝐲~2,,𝐲~t\widetilde{\bf y}_{1},\widetilde{\bf y}_{2},\dots,\widetilde{\bf y}_{t} according to the scheme

{𝐱~t=F(𝐱~t1,𝐲~t1),𝐲~t=h~(𝐱~t1,𝐱~t),𝐲~t=f1(𝐲~t).\displaystyle\left\{\begin{array}[]{ll}\widetilde{\bf x}_{t}&=F^{\prime}(\widetilde{\bf x}_{t-1},\widetilde{\bf y}_{t-1}^{\prime}),\\ \widetilde{\bf y}_{t}^{\prime}&=\widetilde{h}^{\prime}(\widetilde{\bf x}_{t-1},\widetilde{\bf x}_{t}),\\ \widetilde{\bf y}_{t}&=f^{-1}(\widetilde{\bf y}_{t}^{\prime}).\end{array}\right. (3.29)
Corollary 3.20

Let ϕDiff2(M)\phi\in\operatorname{Diff}^{2}(M) be a dynamical system on a finite-dimensional compact manifold MM and ω:Md\omega\colon M\longrightarrow\mathbb{R}^{d} an observation map and let the maps F,f,h,ϕ,F,f,f2,h,ΦF,f,h,\phi^{\prime},F^{\prime},f^{\prime},f_{2}^{\prime},h^{\prime},\Phi and the spaces MM^{\prime} and XX be as described above. In particular, the GS ff belonging to the state-space map is an embedding so that M=f(M)M^{\prime}=f(M) is a compact submanifold of L,\mathbb{R}^{L}, and ϕDiff2(M)\phi^{\prime}\in\operatorname{Diff}^{2}(M^{\prime}) is conjugate to ϕ\phi; the state-space map FF^{\prime} is SI-invertible and possesses a GS ff^{\prime}, and Φ\Phi is conjugate to ϕ\phi^{\prime} and hence to ϕ\phi. Suppose that XX is a convex C2C^{2} submanifold of L\mathbb{R}^{L^{\prime}} and that ΦDiff2(X).\Phi\in\operatorname{Diff}^{2}(X). Further, let XX^{\prime} be a compact Φ\Phi-invariant subset of XX and define X′′:={UF(𝐦)0:𝐦=(ϕt(m))t is a trajectory of a point mM},X^{\prime\prime}:=\{U^{F^{\prime}}({\bf m}^{\prime})_{0}\colon{\bf m}^{\prime}=(\phi^{\prime t}(m^{\prime}))_{t\in\mathbb{Z}}\text{ is a trajectory of a point }m^{\prime}\in M^{\prime}\}, the set of reachable states. We assume that

(i)

there exists an open and convex subset UU of XX such that XX′′UX^{\prime}\cup X^{\prime\prime}\subset U and the second order derivatives of Φ\Phi are bounded on UU, and that

(ii)

X′′Φ(X′′)X^{\prime\prime}\cup\Phi(X^{\prime\prime}) is contained in a convex subset of XX.

Suppose that f1f^{-1} is Lipschitz on MM^{\prime} with constant Lf1L_{f^{-1}} and FF^{\prime} is Lipschitz on its second entry with constant LzL_{z}^{\prime}, uniform with respect to the first entry, that is,

F(𝐱,𝐳1)F(𝐱,𝐳2)\displaystyle\left\|F^{\prime}(\mathbf{x},{\bf z}_{1})-F^{\prime}(\mathbf{x},{\bf z}_{2})\right\| Lz𝐳1𝐳2,for all 𝐱X and all 𝐳1,𝐳2N.\displaystyle\leq L_{z}^{\prime}\left\|\mathbf{z}_{1}-\mathbf{z}_{2}\right\|,\quad\mbox{for all $\mathbf{x}\in X$ and all ${\bf z}_{1},{\bf z}_{2}\in\mathbb{R}^{N}$.}

Let h~:XM\widetilde{h}^{\prime}\colon X\longrightarrow M^{\prime} be a learnt version of the map hh^{\prime} and let ε=hh~.\varepsilon=\|h^{\prime}-\widetilde{h}^{\prime}\|_{\infty}. Assume that h~:XM\widetilde{h}^{\prime}:X\longrightarrow M^{\prime} is Lipschitz with constant Lh~L_{\widetilde{h}^{\prime}}. For a point mMm\in M, construct the sequences 𝐲,𝐲,{\bf y},{\bf y}^{\prime}, and 𝐱{\bf x} as in (3.25), and the forecasts 𝐱~1,𝐱~2,,𝐱~t,\widetilde{\bf x}_{1},\widetilde{\bf x}_{2},\dots,\widetilde{\bf x}_{t}, 𝐲~1,𝐲~2,,𝐲~t,\widetilde{\bf y}_{1}^{\prime},\widetilde{\bf y}_{2}^{\prime},\dots,\widetilde{\bf y}_{t}^{\prime}, and 𝐲~1,𝐲~2,,𝐲~t\widetilde{\bf y}_{1},\widetilde{\bf y}_{2},\dots,\widetilde{\bf y}_{t} as in (3.29). Then, for any Φ\Phi-invariant Borel probability measure on XX^{\prime}, there is a Φ\Phi-invariant set ΛX\Lambda\subset X^{\prime} of full measure such that if 𝐱0Λ{\bf x}_{0}\in\Lambda the following will hold: For small δ>0\delta>0, there exists a constant R=R(δ,𝐱0)R=R(\delta,{\bf x}_{0}) and a time horizon T=T(δ,𝐱0)T=T(\delta,{\bf x}_{0})\in\mathbb{N} such that

𝐱t𝐱~t\displaystyle\|{\bf x}_{t}-\widetilde{\bf x}_{t}\| 2εLzRet(λ1pos+δ)\displaystyle\leq 2\varepsilon L_{z}^{\prime}Re^{t(\lambda_{1}^{\text{pos}}+\delta)} (3.30)
𝐲t𝐲~t\displaystyle\|{\bf y}_{t}^{\prime}-\widetilde{\bf y}_{t}^{\prime}\| ε(1+2Lh~LzRet(λ1pos+δ)),\displaystyle\leq\varepsilon\left(1+2L_{\widetilde{h}^{\prime}}L_{z}^{\prime}Re^{t(\lambda_{1}^{\text{pos}}+\delta)}\right), (3.31)
𝐲t𝐲~t\displaystyle\|{\bf y}_{t}-\widetilde{\bf y}_{t}\| εLf1(1+2Lh~LzRet(λ1pos+δ)),\displaystyle\leq\varepsilon L_{f^{-1}}\left(1+2L_{\widetilde{h}^{\prime}}L_{z}^{\prime}Re^{t(\lambda_{1}^{\text{pos}}+\delta)}\right), (3.32)

for t=1,,T,t=1,\dots,T, where λ1=λ1(𝐱0)\lambda_{1}=\lambda_{1}({\bf x}_{0}) is the largest Lyapunov exponent of 𝐱0{\bf x}_{0} under the dynamical system (X,Φ),(X,\Phi), and as before we take λ1pos=max{λ1,0}.\lambda_{1}^{\text{pos}}=\max\{\lambda_{1},0\}.

Proof. As noted above, the sequences 𝐲{\bf y}^{\prime} and 𝐱{\bf x} satisfy the state-space equations of (F′′,h)(F^{\prime\prime},h^{\prime}) where F′′:X×MXF^{\prime\prime}\colon X\times M^{\prime}\longrightarrow X is given by F′′(𝐱1,𝐱0,m)=(𝐱0,F(𝐱0,m))F^{\prime\prime}({\bf x}_{-1},{\bf x}_{0},m^{\prime})=({\bf x}_{0},F^{\prime}({\bf x}_{0},m^{\prime})) for (𝐱1,𝐱0)X.({\bf x}_{-1},{\bf x}_{0})\in X. Similarly, the sequences 𝐲~1,𝐲~2,,𝐲~t,\widetilde{\bf y}_{1}^{\prime},\widetilde{\bf y}_{2}^{\prime},\dots,\widetilde{\bf y}_{t}^{\prime}, and 𝐱~1,𝐱~2,,𝐱~t\widetilde{\bf x}_{1},\widetilde{\bf x}_{2},\dots,\widetilde{\bf x}_{t} are generated by the state-space system (F′′,h~).(F^{\prime\prime},\widetilde{h}^{\prime}). Thus, 𝐲t=HhF′′(𝐲t1¯){\bf y}_{t}^{\prime}=H^{F^{\prime\prime}}_{h^{\prime}}(\underline{{\bf y}_{t-1}}^{\prime}) for tt\in\mathbb{Z} and 𝐲~t=Hh~F′′(𝐲0¯𝐲~t1¯)\widetilde{\bf y}_{t}^{\prime}=H^{F^{\prime\prime}}_{\widetilde{h}^{\prime}}(\underline{{\bf y}_{0}}^{\prime}\,\underline{\widetilde{\bf y}_{t-1}^{\prime}}) for t+.t\in\mathbb{Z}_{+}. Now,

HhF′′Hh~F′′=(hh~)HF′′hh~=ε.\displaystyle\|H^{F^{\prime\prime}}_{h^{\prime}}-H^{F^{\prime\prime}}_{\widetilde{h}^{\prime}}\|_{\infty}=\|(h^{\prime}-\widetilde{h}^{\prime})\circ H^{F^{\prime\prime}}\|_{\infty}\leq\|h^{\prime}-\widetilde{h}^{\prime}\|_{\infty}=\varepsilon.

Thus (3.30) and (3.31) follow from Theorem 3.11. Finally we note that

𝐲t𝐲~t=f1(𝐲t)f1(𝐲~t)Lf1𝐲t𝐲~t,\displaystyle\|{\bf y}_{t}-\widetilde{\bf y}_{t}\|=\|f^{-1}({\bf y}_{t}^{\prime})-f^{-1}(\widetilde{\bf y}_{t}^{\prime})\|\leq L_{f^{-1}}\|{\bf y}_{t}^{\prime}-\widetilde{\bf y}_{t}^{\prime}\|,

from which (3.32) follows immediately. \quad\blacksquare

4 Numerical illustration

In this section we illustrate the error bounds given in Theorem 3.11 for predicting observations coming from dynamical systems using reservoir systems. Our results show the applicability of the error bound to these systems in accurately predicting the rate of growth of the forecasting errors. Nevertheless, there are some limitations to the error bound, which become apparent in our numerical work. First of all, the dynamical systems we learn all have attractors with finite diameter, and hence, while the echo state network (ESN) that we use for the forecasting (see 3.19) typically reconstructs the attractor, the actual predictions deviate from the true values over time until the error saturates the attractor, being bounded by its diameter, and hence it settles at a constant value. After this, the error bound is no longer applicable. Second, the error bound given in (3.15) includes the multiplicative constant RR. The existence of this constant is guaranteed by the use of arguments involving Oseledets’ Theorem in the proofs; intuitively, the constant gives the system time to settle until the ergodic properties underlying this theorem take effect. However, we do not have an algorithm for calculating this constant. Instead, we estimate it by considering the log prediction errors and fitting the intercept of the log error bound to this. In the case of reservoir systems that do not learn the given system well, this constant may turn out to exceed the error at saturation, and the bound becomes redundant. While on first impression this may seem perplexing, this is a practical difficulty inherent to any learning paradigm. Rather than undermining the usefulness of the error bound given in this paper, it serves to show the inability of the given reservoir computer to track the dynamical system, in which case any error bounds are redundant as saturation is quickly reached. We emphasize that in the case of reservoir systems that demonstrate the ability to produce accurate long-term iterated multistep predictions, the gradient of the log error growth closely matches the predicted value given in (3.15). Nevertheless, the question naturally arises, in the context where we have to deal with observations from an unknown dynamical system, whether we can determine up to what time the error bounds are valid. Once again, as noted in Remark 3.14, while we are guaranteed the existence of a finite time horizon TT up to which the bound holds, in practice, the horizon is difficult to calculate. In the empirical exercises below, the bound is valid until saturation. If the reservoir system reconstructs the attractor of the dynamical system well, the saturation error can be estimated by running predictions on any two distinct trajectories of the dynamical system with close initial conditions. The time horizon up to which error estimates are both valid and useful can be calculated by checking when the predicted error bound crosses the saturation error. However, this still requires knowledge of the multiplicative constant RR. Thus, to get a more clear answer to this question, more work needs to be done on calculating this constant.

In our numerical illustrations, we consider the Lorenz and Rössler systems and train an ESN on an input time series consisting of the full states of the given system. In the case of the Lorenz system, we also consider inputs coming from an observation function on the dynamical system. This forms a causal chain as shown in Corollary 2.2. The ESNs we use have a state map of the form (3.19), that is,

F(𝐱,𝐳)=𝝈(A𝐱+γC𝐳+𝜻),withA𝕄L,L,γ,C𝕄L,d,𝜻L,F(\mathbf{x},{\bf z})=\boldsymbol{\sigma}(A\mathbf{x}+\gamma C{\bf z}+\boldsymbol{\zeta}),\quad\mbox{with}\quad A\in\mathbb{M}_{L,L},\gamma\in\mathbb{R},C\in\mathbb{M}_{L,d},\boldsymbol{\zeta}\in\mathbb{R}^{L}, (4.1)

with an affine readout given by h:Ld,h(𝐱)=W𝐱+𝐚h\colon\mathbb{R}^{L}\longrightarrow\mathbb{R}^{d},h({\bf x})=W{\bf x}+{\bf a} for some linear map W𝕄d,LW\in\mathbb{M}_{d,L} and bias 𝐚d.{\bf a}\in\mathbb{R}^{d}. We take the nonlinear function 𝝈()\boldsymbol{\sigma}(\cdot) to be tanh()(\cdot) applied componentwise and randomly generate the connectivity matrix A𝕄L,LA\in\mathbb{M}_{L,L} and input mask C𝕄L,dC\in\mathbb{M}_{L,d} by selecting entries independently from a uniform distribution over the interval [1,1][-1,1], with density 0.010.01 and 11, respectively. The input shift 𝜻L\boldsymbol{\zeta}\in\mathbb{R}^{L} is simply taken to be zero. Data for trajectories of the different systems is generated by integrating the given dynamical system over a discrete time step Δt\Delta t using the Runge-Kutta45 method. In each case, an input trajectory 𝐲T,,𝐲0{\bf y}_{-T},\dots,{\bf y}_{0} is fed into the randomly generated state map, generating a trajectory in the state space according to the scheme

𝐱t+1=F(𝐱t,𝐲t),fort=T,,1,where𝐱T=𝟎.{\bf x}_{t+1}=F({\bf x}_{t},{\bf y}_{t}),\quad\mbox{for}\quad t=-T,\dots,-1,\quad\mbox{where}\quad{\bf x}_{-T}={\bf 0}. (4.2)

An initial washout of τ\tau timesteps allows the state space trajectory to converge to the values given by the GS so that the two become indistinguishable, although the state space trajectory generated is arbitrarily initialized at 𝟎\bf 0. This is due to the so-called input forgetting property [Grig 19]. The washout data is discarded and a ridge regression with regularization constant α\alpha is performed between the input points 𝐲T+τ,,𝐲0{\bf y}_{-T+\tau},\dots,{\bf y}_{0} and state space points 𝐱T+τ,,𝐱0,{\bf x}_{-T+\tau},\dots,{\bf x}_{0}, to fit the output parameters WW and 𝐚{\bf a}. Predictions are then made by iterating (3.2) and (3.3), that is,

𝐱~t\displaystyle\widetilde{\bf x}_{t} =F(𝐱~t1,𝐲~t1),\displaystyle=F(\widetilde{\bf x}_{t-1},\widetilde{\bf y}_{t-1}), (4.3)
𝐲~t\displaystyle\widetilde{\bf y}_{t} =h(𝐱~t),\displaystyle=h(\widetilde{\bf x}_{t}), (4.4)

for t=1,2,3,,t=1,2,3,\dots, where 𝐱~0=𝐱0\widetilde{\bf x}_{0}={\bf x}_{0} and 𝐲~0=𝐲0.\widetilde{\bf y}_{0}={\bf y}_{0}. There are several hyperparameters that need to be tuned, namely the regularization constant α\alpha, the scaling constant γ\gamma for the input mask, and the spectral radius of the connectivity matrix. Selecting these is a perplexing task as, to our knowledge, no clear analytical method exists by which prediction error as a function of these constants may be minimized. Moreover, the error in prediction is not stable with respect to choices of the constants and, especially in selecting the regularization constant, small changes may lead to largely different results. In order to tune our ESNs individually to each system, we had to resort to a brute force cross validation method minimizing mean squared prediction error over a grid for these paremeters. This code is computationally expensive.

The Lorenz system.

The Lorenz system is given by the three-dimensional first-order differential equation

x˙\displaystyle\dot{x} =σ(xy),\displaystyle=-\sigma(x-y), (4.5)
y˙\displaystyle\dot{y} =ρxyxz,\displaystyle=\rho x-y-xz, (4.6)
z˙\displaystyle\dot{z} =βz+xy,\displaystyle=-\beta z+xy, (4.7)

where σ=10\sigma=10, β=8/3\beta=8/3, and ρ=28\rho=28.

Refer to caption
Figure 1: Data from a trajectory used for training and forecasting the Lorenz system.
Refer to caption
Figure 2: Log prediction error for the Lorenz system with full access to system states, averaged over 10 trajectories.

To learn the Lorenz system, we chose an ESN with 10001000 neurons whose connectivity matrix had spectral radius 1.21.2, and whose input mask was scaled by γ=13/3\gamma=13/3. Inputs were taken as the states of the entire system, sampled at a time step of Δt=0.02\Delta t=0.02 over 1000010000 time steps. A washout of 10001000 time steps was used. Thereafter, using a Tikhonov regularization, with regularization constant α=1014.5\alpha=10^{-14.5}, between the state space trajectory and desired output trajectory on the remaining 90009000 time steps, the readout hh was trained and predictions made for an additional 25002500 time steps. The Lyapunov exponent of the RC was calculated at λ1=0.9796\lambda_{1}=0.9796 using Bennetin’s algorithm over the length of one trajectory, and the multiplicative constant in the error bound (3.14) was estimated by fitting the predicted error bound to the actual errors. Plotted in Figure 2 are the log errors in prediction over 1313 Lyapunov times for an RC trained over 1010 trajectories with initial conditions chosen from a Gaussian distribution around (0,1,1.05)(0,1,1.05), averaged at each point across the 1010 trajectories. The log errors grow approximately at a linear rate until saturation of the attractor around 1010 Lyapunov times. The gradient of the linear growth is closely matched by the predicted error bound, whose gradient is given by the top Lyapunov exponent of the RC.

Refer to caption
Figure 3: Predictions of a single trajectory for an ESN trained on the first coordinate of the Lorenz system.
Refer to caption
Figure 4: Log prediction error for Lorenz system with access to the first coordinate, averaged over 10 trajectories.

The Lorenz system as a causal chain.

To illustrate our result for a causal chain, we took the observation function ω:3,(x,y,z)x,\omega\colon\mathbb{R}^{3}\longrightarrow\mathbb{R},(x,y,z)\mapsto x, on the Lorenz system. By Corollary 2.2, sequences of observations through ω\omega form a causal chain. To learn the Lorenz system using inputs only from the first coordinate, we chose an ESN with 10001000 neurons whose connectivity matrix had spectral radius 1.01.0, and whose input mask was scaled by γ=2.4\gamma=2.4. Inputs were sampled at a time step of Δt=0.005\Delta t=0.005 over 1200012000 time steps. A washout of 10001000 time steps was used. Thereafter, using a Tikhonov regularization, with regularization constant α=1014.75\alpha=10^{-14.75}, between the state space trajectory and desired output trajectory on the remaining 1100011000 time steps, the readout hh was trained and predictions made for an additional 40004000 time steps. The Lyapunov exponent of the RC was calculated at λ1=0.9978\lambda_{1}=0.9978 using Bennetin’s algorithm over the length of one trajectory, and the multiplicative constant in the error bound (3.14) was estimated by fitting the predicted error bound to the actual errors. We chose 1010 initial conditions from a Gaussian distribution around (0,1,1.05)(0,1,1.05) and trained the readout of the RC over a single trajectory. In Figure 4 the log errors in prediction averaged across the 1010 trajectories are plotted for 1010 Lyapunov times. The errors grow approximately at a linear rate until saturation of the attractor around 77 Lyapunov times. The gradient of the linear growth is closely matched by the predicted error bound, whose gradient is given by the top Lyapunov exponent of the RC.

The Rössler system.

The dynamics of the Rössler system is given by

x˙\displaystyle\dot{x} =yz,\displaystyle=-y-z, (4.8)
y˙\displaystyle\dot{y} =x+ay,\displaystyle=x+ay, (4.9)
z˙\displaystyle\dot{z} =b+z(xc),\displaystyle=b+z(x-c), (4.10)

where a=0.1,b=0.1,a=0.1,b=0.1, and c=14.c=14.

Refer to caption
Figure 5: Log prediction error for the Rössler system with access to full system state, averaged over 10 trajectories.

To learn the Rössler system, we chose an ESN with 10001000 neurons and connectivity matrix with spectral radius 0.20.2. The input mask was scaled by γ=4.65\gamma=4.65. Inputs were taken as the states of the entire system, sampled at a time step of Δt=0.005\Delta t=0.005 over 2300023000 time steps. A washout of 20002000 time steps was used. Thereafter, using a Tikhonov regularization, with regularization constant α=1014.5\alpha=10^{-14.5}, between the state space trajectory and desired output trajectory on the remaining 2100021000 time steps, the readout hh was trained and predictions made for an additional 3000030000 time steps. The Lyapunov exponent of the RC was calculated at λ1=0.0814\lambda_{1}=0.0814 using Bennetin’s algorithm over the length of one trajectory, and the multiplicative constant in the error bound (3.14) was estimated by fitting the predicted error bound to the actual errors. We chose 1010 initial conditions from a Gaussian distribution around (0.1,0.1,14)(0.1,0.1,14) and trained the readout of the RC over a single trajectory. In Figure 5 the log errors in prediction averaged across the 1010 trajectories are plotted for 1010 Lyapunov times. The errors grow approximately at a linear rate until saturation of the attractor around 88 Lyapunov times. The gradient of the linear growth is closely matched by the predicted error bound, whose gradient is given by the top Lyapunov exponent of the RC.

Discussion.

The numerical experiments illustrate how the error committed using an ESN to predict three different systems grows at an exponential rate, governed by the top Lyapunov exponent associated with the learned ESN. The interplay between the Lyapunov exponents of the dynamical system and the ensuing effect on forecasting accuracy can clearly be seen in the numerical examples illustrated above. In all three examples the Lyapunov exponents of the learnt dynamical system associated to the trained RC are close to those of the underlying dynamical system being learnt. The Lyapunov exponent of the Lorenz system is greater than that of the Rössler system, leading to a quick saturation of the prediction error, in comparison with the RC trained on the Rössler system, which exhibits the ability to maintain reasonably accurate predictions long into the future. Nevertheless, scaling the time axis in Lyapunov times, shows that the prediction capabilities of ESNs on the two different systems are comparable.

It is to be noted, however, that in cases where the RC does not exhibit accurate long term predictions of the given system, the bound is not as visually apparent, or may indeed be redundant due to the swift saturation of the attractor. In this case errors settle at a finite value, determined by the diameter and stationary measure of the attractor, before the bound comes into play. This may be seen in part in the comparison between predicting the Lorenz system with access to full dynamical states and with access only to a single coordinate. The latter task is more challenging, due to restricted access to information about the system. Thus, while both RCs display similar Lyapunov exponents, the forecasting error in predicting the causal chain reaches saturation before that in predicting the entire system.

5 Conclusions

In this paper, we have studied the iterated multistep forecasting scheme based on recurrent neural networks (RNN) that are customarily used in reservoir computing for the path continuation of dynamical systems. More specifically, we have formulated and rigorously proved error bounds in terms of the forecasting horizon, functional and dynamical features of the specific RNN used, and the approximation error committed by it. The results that we present show, in particular, the interplay between the universal approximation properties of a family of recurrent networks in the sense of input/output systems and their properties in the realm of the modeling and the forecasting of the observations of autonomous dynamical systems and, more generally, of causal chains with infinite memory (CCIMs).

The framework in the paper circumvents difficult-to-verify embedding hypotheses that appear in previous references in the literature and applies to new situations like the finite-dimensional observations of functional differential equations or the deterministic parts of stochastic processes to which standard embedding techniques do not necessarily apply. This is an improvement on previous results, which were restricted to sequences consisting of observations of dynamical systems.

The error bounds in this paper show, in particular, that the growth of forecasting error is exponential at a rate governed by the top Lyapunov exponent of the dynamical system associated with the trained RNN in the state space. This is insightful for several reasons: (i) it illuminates the role of state space dynamics in the accuracy of long-term forecasting, (ii) in the case of observations coming from a dynamical system, it illustrates the relationship between the dynamics of the original system and that of the reconstructed dynamical system in the state space and (iii) it provides a tool for comparing the forecasting accuracy of different state-space system learning techniques based on the top Lyapunov exponents of their associated state space dynamical systems in a given learning problem.

As noted in the numerical section, the bounds given have a number of limitations. In particular, it is not clear how to calculate an important multiplicative constant to fit the error bound to real case prediction scenarios. Nevertheless, in the case of well-trained state space systems used in prediction, the bound demonstrates a striking sharpness, accurately predicting the growth rate of the true forecasting error.

Future work will focus on stochastic extensions of these results in which iterative schemes of this type are used to forecast discrete-time stochastic processes, a context in which RNN universal approximation results have also been formulated.


Acknowledgments: The authors thank G Manjunath for helpful discussions and remarks. JL is funded by a Graduate Scholarship from Nanyang Technological University. JPO acknowledges partial financial support from the School of Physical and Mathematical Sciences of the Nanyang Technological University. LG and JPO thank the hospitality of the Nanyang Technological University and the University of St. Gallen; it is during respective visits to these two institutions that some of the results in this paper were obtained.

References

  • [Abra 88] R. Abraham, J. E. Marsden, and T. S. Ratiu. Manifolds, Tensor Analysis, and Applications. Vol. 75, Applied Mathematical Sciences. Springer-Verlag, 1988.
  • [Alqu 12] P. Alquier and O. Wintenberger. “Model selection for weakly dependent time series forecasting”. Bernoulli, Vol. 18, No. 3, pp. 883–913, 2012.
  • [Arco 22] T. Arcomano, I. Szunyogh, A. Wikner, J. Pathak, B. R. Hunt, and E. Ott. “A hybrid approach to atmospheric modeling that combines machine learning with a physics-based numerical model”. Journal of Advances in Modeling Earth Systems, Vol. 14, No. 3, p. e2021MS002712, 2022.
  • [Barr 07] L. Barreira and Y. B. Pesin. Nonuniform hyperbolicity: Dynamics of systems with nonzero Lyapunov exponents. Vol. 115, Cambridge University Press Cambridge, 2007.
  • [Baum 66] L. E. Baum and T. Petrie. “Statistical inference for probabilistic functions of finite state Markov chains”. The annals of mathematical statistics, Vol. 37, No. 6, pp. 1554–1563, 1966.
  • [Berr 23] T. Berry and S. Das. “Learning theory for dynamical systems”. SIAM Journal on Applied Dynamical Systems, Vol. 22, No. 3, pp. 2082–2122, 2023.
  • [Bocc 02] S. Boccaletti, J. Kurths, G. Osipov, D. L. Valladares, and C. S. Zhou. “The synchronization of chaotic systems”. Physics Reports, Vol. 366, pp. 1–101, 2002.
  • [Boyd 85] S. Boyd and L. Chua. “Fading memory and the problem of approximating nonlinear operators with Volterra series”. IEEE Transactions on Circuits and Systems, Vol. 32, No. 11, pp. 1150–1161, 1985.
  • [Broc 06] P. J. Brockwell and R. A. Davis. Time Series: Theory and Methods. Springer-Verlag, 2006.
  • [Bute 13] P. Buteneers, D. Verstraeten, B. V. Nieuwenhuyse, D. Stroobandt, R. Raedt, K. Vonck, P. Boon, and B. Schrauwen. “Real-time detection of epileptic seizures in animal models using reservoir computing”. Epilepsy Research, Vol. 103, No. 2, pp. 124–134, 2013.
  • [Cuch 22] C. Cuchiero, L. Gonon, L. Grigoryeva, J. P. Ortega, and J. Teichmann. “Discrete-time signatures and randomness in reservoir computing”. IEEE Transactions on Neural Networks and Learning Systems, Vol. 33, No. 11, pp. 1–10, 2022.
  • [Dang 84] H. Dang Van Mien and D. Normand-Cyrot. “Nonlinear state affine identification methods: applications to electrical power plants”. Automatica, Vol. 20, No. 2, pp. 175–188, mar 1984.
  • [Dech 00] W. D. Dechert and R. Gençay. “Is the largest Lyapunov exponent preserved in embedded dynamics?”. Physics Letters A, Vol. 276, No. 1-4, pp. 59–64, 2000.
  • [Dech 96] W. D. Dechert and R. Gençay. “The topological invariance of Lyapunov exponents in embedded dynamics”. Physica D: Nonlinear Phenomena, Vol. 90, No. 1-2, pp. 40–55, 1996.
  • [Dede 07] J. Dedecker, P. Doukhan, G. Lang, J. R. León, S. Louhichi, and C. Prieur. Weak Dependence: With Examples and Applications. Springer Science+Business Media, 2007.
  • [Douk 08] P. Doukhan and O. Wintenberger. “Weakly dependent chains with infinite memory”. Stochastic Processes and their Applications, Vol. 118, No. 11, pp. 1997–2013, 2008.
  • [Erog 17] D. Eroglu, J. S. W. Lamb, and T. Pereira. “Synchronisation of chaos and its applications”. Contemporary Physics, Vol. 58, No. 3, pp. 207–243, 2017.
  • [Foll 02] H. Föllmer and A. Schied. Stochastic Finance. Vol. 27 of de Gruyter Studies in Mathematics, Walter de Gruyter and Co., Berlin, 2002.
  • [Ghys 09] E. Ghysels, R. Valkanov, and A. Rubia. “Multi-period forecasts of volatility: direct, iterated, and mixed-data approaches”. EFA 2009 Bergen Meetings Paper, Available at SSRN: https://ssrn.com/abstract=1344742 or http://dx.doi.org/10.2139/ssrn.1344742, 2009.
  • [Gono 20a] L. Gonon, L. Grigoryeva, and J.-P. Ortega. “Risk bounds for reservoir computing”. Journal of Machine Learning Research, Vol. 21, No. 240, pp. 1–61, 2020.
  • [Gono 20b] L. Gonon and J.-P. Ortega. “Reservoir computing universality with stochastic inputs”. IEEE Transactions on Neural Networks and Learning Systems, Vol. 31, No. 1, pp. 100–112, 2020.
  • [Gono 21] L. Gonon and J.-P. Ortega. “Fading memory echo state networks are universal”. Neural Networks, Vol. 138, pp. 10–13, 2021.
  • [Gono 23a] L. Gonon, L. Grigoryeva, and J.-P. Ortega. “Approximation error estimates for random neural networks and reservoir systems”. The Annals of Applied Probability, Vol. 33, No. 1, pp. 28–69, 2023.
  • [Gono 23b] L. Gonon, L. Grigoryeva, and J. P. Ortega. “Infinite-dimensional reservoir computing”. arXiv:2304.00490, 2023.
  • [Grig 14] L. Grigoryeva, J. Henriques, L. Larger, and J.-P. Ortega. “Stochastic time series forecasting using time-delay reservoir computers: performance and universality”. Neural Networks, Vol. 55, pp. 59–71, 2014.
  • [Grig 18a] L. Grigoryeva and J.-P. Ortega. “Echo state networks are universal”. Neural Networks, Vol. 108, pp. 495–508, 2018.
  • [Grig 18b] L. Grigoryeva and J.-P. Ortega. “Universal discrete-time reservoir computers with stochastic inputs and linear readouts using non-homogeneous state-affine systems”. Journal of Machine Learning Research, Vol. 19, No. 24, pp. 1–40, 2018.
  • [Grig 19] L. Grigoryeva and J.-P. Ortega. “Differentiable reservoir computing”. Journal of Machine Learning Research, Vol. 20, No. 179, pp. 1–62, 2019.
  • [Grig 21a] L. Grigoryeva, A. G. Hart, and J.-P. Ortega. “Chaos on compact manifolds: Differentiable synchronizations beyond the Takens theorem”. Physical Review E - Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics, Vol. 103, p. 062204, 2021.
  • [Grig 21b] L. Grigoryeva and J.-P. Ortega. “Dimension reduction in recurrent networks by canonicalization”. Journal of Geometric Mechanics, Vol. 13, No. 4, pp. 647–677, 2021.
  • [Grig 23] L. Grigoryeva, A. G. Hart, and J.-P. Ortega. “Learning strange attractors with reservoir systems”. Nonlinearity, Vol. 36, pp. 4674–4708, 2023.
  • [Grig 24] L. Grigoryeva, B. Hamzi, F. P. Kemeth, Y. Kevrekidis, G. Manjunath, J.-P. Ortega, and M. J. Steynberg. “Data-driven cold starting of good reservoirs”. arXiv:2403.10325, 2024.
  • [Hart 20] A. G. Hart, J. L. Hook, and J. H. P. Dawes. “Embedding and approximation theorems for echo state networks”. Neural Networks, Vol. 128, pp. 234–247, 2020.
  • [Hart 21] A. G. Hart, J. L. Hook, and J. H. P. Dawes. “Echo State Networks trained by Tikhonov least squares are L2(μ\mu) approximators of ergodic dynamical systems”. Physica D: Nonlinear Phenomena, Vol. 421, p. 132882, 2021.
  • [Huke 06] J. P. Huke. “Embedding nonlinear dynamical systems: a guide to Takens’ theorem”. Tech. Rep., Manchester Institute for Mathematical Sciences. The University of Manchester, 2006.
  • [Huke 15] J. P. Huke and M. R. Muldoon. “Embedding and time series analysis”. Tech. Rep., Manchester Institute for Mathematical Sciences. The University of Manchester, 2015.
  • [Jaeg 04] H. Jaeger and H. Haas. “Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication”. Science, Vol. 304, No. 5667, pp. 78–80, 2004.
  • [Jaeg 10] H. Jaeger. “The ‘echo state’ approach to analysing and training recurrent neural networks with an erratum note”. Tech. Rep., German National Research Center for Information Technology, 2010.
  • [Kalm 10] R. Kalman. “Lectures on Controllability and Observability”. In: Controllability and Observability, pp. 1–149, Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.
  • [Kalm 59a] R. E. Kalman and J. E. Bertram. “A unified approach to the theory of sampling systems”. Journal of the Franklin Institute, Vol. 267, No. 5, pp. 405–436, 1959.
  • [Kalm 59b] R. E. Kalman and J. E. Bertram. “General synthesis procedure for computer control of single-loop and multiloop linear systems (An optimal sampling system)”. Transactions of the American Institute of Electrical Engineers, Part II: Applications and Industry, Vol. 77, No. 6, pp. 602–609, 1959.
  • [Kalm 60] R. Kalman. “A new approach to linear filtering and prediction problems”. Trans. ASME, J. Basic Engineering, Vol. 82D, pp. 35–45, 1960.
  • [Kalm 62] R. E. Kalman. “Canonical structure of linear dynamical systems”. Proceedings of National Academy Of Sciences USA, Vol. 48, No. 4, pp. 596–600, 1962.
  • [Krey 91] E. Kreyszig. Introductory functional analysis with applications. Vol. 17, John Wiley & Sons, 1991.
  • [Laso 13] A. Lasota and M. C. Mackey. Chaos, Fractals, and Noise: Stochastic Aspects of Dynamics. Applied Mathematical Sciences, Springer New York, 2013.
  • [Lind 15] A. Lindquist and G. Picci. Linear Stochastic Systems. Springer-Verlag, 2015.
  • [Lu 18] Z. Lu, B. R. Hunt, and E. Ott. “Attractor reconstruction by machine learning”. Chaos, Vol. 28, No. 6, 2018.
  • [Lu 20] Z. Lu and D. S. Bassett. “Invertible generalized synchronization: A putative mechanism for implicit learning in neural systems”. Chaos, Vol. 30, No. 063133, 2020.
  • [Luko 09] M. Lukoševičius and H. Jaeger. “Reservoir computing approaches to recurrent neural network training”. Computer Science Review, Vol. 3, No. 3, pp. 127–149, 2009.
  • [Manj 20] G. Manjunath. “Stability and memory-loss go hand-in-hand: three results in dynamics \\backslash& computation”. Proceedings of the Royal Society London Ser. A Math. Phys. Eng. Sci., Vol. 476, No. 2242, pp. 1–25, 2020.
  • [Manj 21] G. Manjunath and A. de Clercq. “Universal set of observables for the Koopman operator through causal embedding”. arXiv preprint arXiv:2105.10759, 2021.
  • [Manj 22] G. Manjunath. “Embedding information onto a dynamical system”. Nonlinearity, Vol. 35, No. 3, p. 1131, 2022.
  • [Manj 23] G. Manjunath and J.-P. Ortega. “Transport in reservoir computing”. Physica D: Nonlinear Phenomena, Vol. 449, p. 133744, 2023.
  • [Marc 06] M. Marcellino, J. H. Stock, and M. W. Watson. “A comparison of direct and iterated multistep AR methods for forecasting macroeconomic time series”. Journal of Econometrics, Vol. 135, No. 1-2, pp. 499–526, 2006.
  • [Matt 94] M. Matthews and G. Moschytz. “The identification of nonlinear discrete-time fading-memory systems using neural network models”. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, Vol. 41, No. 11, pp. 740–751, 1994.
  • [McEl 15] T. McElroy. “When are direct multi-step and iterative forecasts identical?”. Journal of Forecasting, Vol. 34, No. 4, pp. 315–336, 2015.
  • [Nare 90] K. Narendra and K. Parthasarathy. “Identification and control of dynamical systems using neural networks”. IEEE Transactions on Neural Networks, Vol. 1, No. 1, pp. 4–27, mar 1990.
  • [Orte 24a] J.-P. Ortega and F. Rossmannek. “State-space systems as dynamic generative models”. Preprint, 2024.
  • [Orte 24b] J.-P. Ortega and D. Yin. “Learnability of linear port-Hamiltonian systems”. Journal of Machine Learning Research, Vol. 25, pp. 1–56, 2024.
  • [Osel 68] V. I. Oseledets. “A multiplicative ergodic theorem. Characteristic Ljapunov, exponents of dynamical systems”. Trudy Moskovskogo Matematicheskogo Obshchestva, Vol. 19, pp. 179–210, 1968.
  • [Ott 02] E. Ott. Chaos in Dynamical Systems. Cambridge University Press, second Ed., 2002.
  • [Path 17] J. Pathak, Z. Lu, B. R. Hunt, M. Girvan, and E. Ott. “Using machine learning to replicate chaotic attractors and calculate Lyapunov exponents from data”. Chaos, Vol. 27, No. 12, 2017.
  • [Path 18] J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott. “Model-free prediction of large spatiotemporally chaotic systems from data: A reservoir computing approach”. Physical Review Letters, Vol. 120, No. 2, p. 24102, 2018.
  • [Peco 20] L. Pecora and T. Carroll. “The dependence of reservoir computing on system parameters”. Bulletin of the American Physical Society, Vol. 65, 2020.
  • [Peco 97] L. M. Pecora, T. L. Carroll, G. A. Johnson, D. J. Mar, and J. F. Heagy. “Fundamentals of synchronization in chaotic systems, concepts, and applications”. Chaos, Vol. 7, No. 4, pp. 520–543, 1997.
  • [Ruel 79] D. Ruelle. “Ergodic theory of differentiable dynamical systems”. Publications Mathématiques de l’Institut des Hautes Études Scientifiques, Vol. 50, No. 1, pp. 27–58, 1979.
  • [Rulk 95] N. F. Rulkov, M. M. Sushchik, L. S. Tsimring, and H. D. I. Abarbanel. “Generalized synchronization of chaos in directionally coupled chaotic systems”. Physical Review E, Vol. 51, No. 2, p. 980, 1995.
  • [Sand 78] N. R. Sandell and K. I. Yared. “Maximum likelihood identification of state space models for linear dynamic systems”. Electronic Systems Laboratory, Dept. of Electrical Engineering and Computer Science, Massachusetts Institute of Technology., Vol. R-814, 1978.
  • [Saue 91] T. Sauer, J. A. Yorke, and M. Casdagli. “Embedology”. Journal of Statistical Physics, Vol. 65, No. 3, pp. 579–616, 1991.
  • [Sont 79] E. Sontag. “Realization theory of discrete-time nonlinear systems: Part I-The bounded case”. IEEE Transactions on Circuits and Systems, Vol. 26, No. 5, pp. 342–356, may 1979.
  • [Star 99] J. Stark. “Regularity of invariant graphs for forced systems”. Ergodic theory and dynamical systems, Vol. 19, No. 1, pp. 155–199, 1999.
  • [Take 81] F. Takens. “Detecting strange attractors in turbulence”. pp. 366–381, Springer Berlin Heidelberg, 1981.
  • [Tana 19] G. Tanaka, T. Yamane, J. B. Héroux, R. Nakane, N. Kanazawa, S. Takeda, H. Numata, D. Nakano, and A. Hirose. “Recent advances in physical reservoir computing: A review”. Neural Networks, Vol. 115, pp. 100–123, 2019.
  • [Verz 20] P. Verzelli, C. Alippi, L. Livi, and P. Tino. “Input representation in recurrent neural networks dynamics”. mar 2020.
  • [Verz 21] P. Verzelli, C. Alippi, and L. Livi. “Learn to synchronize, synchronize to learn”. Chaos: An Interdisciplinary Journal of Nonlinear Science, Vol. 31, p. 083119, 2021.
  • [Vian 20] M. Viana. “(Dis) continuity of Lyapunov exponents”. Ergodic Theory and Dynamical Systems, Vol. 40, No. 3, pp. 577–611, 2020.
  • [Walt 00] P. Walters. An Introduction to Ergodic Theory. Graduate Texts in Mathematics, Springer New York, 2000.
  • [Wyff 10] F. Wyffels and B. Schrauwen. “A comparative study of Reservoir Computing strategies for monthly time series prediction”. Neurocomputing, Vol. 73, No. 10, pp. 1958–1964, 2010.