This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Infinite-dimensional next-generation reservoir computing

Lyudmila Grigoryeva [email protected] Faculty of Mathematics and Statistics, Universität Sankt Gallen, CH-9000 Switzerland    Hannah Lim Jing Ting [email protected]    Juan-Pablo Ortega [email protected] School of Physical and Mathematical Sciences, Nanyang Technological University, 637371 Singapore
Abstract

Next-generation reservoir computing (NG-RC) has attracted much attention due to its excellent performance in spatio-temporal forecasting of complex systems and its ease of implementation. This paper shows that NG-RC can be encoded as a kernel ridge regression that makes training efficient and feasible even when the space of chosen polynomial features is very large. Additionally, an extension to an infinite number of covariates is possible, which makes the methodology agnostic with respect to the lags into the past that are considered as explanatory factors, as well as with respect to the number of polynomial covariates, an important hyperparameter in traditional NG-RC. We show that this approach has solid theoretical backing and good behavior based on kernel universality properties previously established in the literature. Various numerical illustrations show that these generalizations of NG-RC outperform the traditional approach in several forecasting applications.

I Introduction

Reservoir computing (RC) [23, 32, 24, 31] has established itself as an important tool for learning and forecasting dynamical systems [20, 24, 41, 40, 30, 48, 2]. It is a methodology in which a recurrent neural network with a randomly generated state equation and a functionally simple readout layer (usually linear) is trained to proxy the data-generating process of a time series. One example is the echo state networks (ESNs) [36, 37, 24], which have demonstrated excellent empirical performance and have been shown to have universal approximation properties in several contexts [21, 16, 17, 15]. Another family with similar theoretical properties is the state-affine systems (SAS) introduced in [44, 22] and that is prevalent in engineering applications [9] and in quantum reservoir computing (QRC) [34, 35].

Despite the ease of training associated with the RC methodology, its implementation depends heavily on hyperparameters that are sometimes difficult to estimate robustly. This difficulty has motivated various authors to replace the standard RC approach with nonlinear vector autoregression in which the covariates are monomials constructed using the previous inputs. In these methods, the only hyperparameters to be chosen are the maximum order of the monomials and the number of lags of past signals. This approach has been called next-generation reservoir computing (NG-RC) [13, 5, 4, 26, 25, 42] and has displayed excellent performances in control and spatio-temporal forecasting tasks.

One downside of the NG-RC approach is that its performance and complexity depend strongly on the above-mentioned hyperparameters. Yet, as these hyperparameters grow, the computational effort associated with NG-RC increases exponentially. One goal of this paper is to place NG-RC in the context of kernel methods. Kernels are classical tools employed in static classification, regression, and pattern recognition tasks [43, 39, 7]. Due to the Representer Theorem, kernels provide a way of passing inputs into higher-dimensional feature spaces where learning takes place linearly without scaling with the dimension of the feature space. By kernelizing NG-RC, we show a more computationally tractable approach to carrying out the NG-RC methodology that does not increase complexity with the hyperparameter values. In particular, we shall see that NG-RC is a particular case of polynomial kernel regression.

The idea that we just explained can be pushed all the way to considering all past lags and polynomial orders of arbitrarily high order in the polynomial kernel regression. Even though this leads to an infinite-dimensional covariate space (hence the title of the paper), the kernel regression can be explicitly and efficiently implemented using the recurrence properties of the Volterra kernel introduced in [14]. Since the Volterra kernel regression method has as covariates the left-infinite sequence of inputs and all the monomials of all degrees constructed from these inputs, this implies that, unlike NG-RC or the polynomial kernel regression, this approach is agnostic with respect to the number of lags and the order of monomials. Moreover, the Volterra kernel has been shown in [14] to be universal in the space of fading memory input/output systems with uniformly bounded inputs and, moreover, has a computational complexity that outperforms the NG-RC whenever higher-order monomials and lags are required for modeling more functionally complex systems.

The last part of the paper contains various numerical simulations that illustrate that (i) using the more computationally tractable polynomial kernel regression, one has access to a broader feature space, which allows learning more complex systems than those typically presented when using the NG-RC methodology and (ii) the Volterra kernel is a useful tool that can produce more accurate forecasts because it allows one to take infinite lags and monomial orders into account, which is of relevance in the modeling of long-memory phenomena that exhibit functionally complex dependencies. All the codes and data necessary to reproduce the results in the paper are available at https://github.com/Learning-of-Dynamic-Processes/kernelngrcvolterra.

II Preliminary discussion

II.1 Notation

We often work with input or output sequence spaces. Let \mathbb{Z} denote the set of integers and {\mathbb{Z}_{-}} be the set of non-positive integers. Denote by \mathbb{R} the set of real numbers and the dd-dimensional Euclidean space by d\mathbb{R}^{d}. Let the space of sequences of dd-dimensional real vectors indexed by {\mathbb{Z}_{-}} be denoted (d)\left(\mathbb{R}^{d}\right)^{\mathbb{Z}_{-}}. Denote generically the input space by 𝒵\mathcal{Z} and the output space by 𝒴\mathcal{Y}, which, in most cases, will be subsets of a finite-dimensional Euclidean space. Given τ\tau\in\mathbb{N}, we shall denote by (d)τ\left(\mathbb{R}^{d}\right)^{\tau} the space of τ\tau-long sequences of dd-dimensional real inputs indexed by {τ+1,,1,0}\{-\tau+1,\ldots,-1,0\}. Given the input space 𝒵d\mathcal{Z}\subset\mathbb{R}^{d}, we will denote the set of semi-infinite sequences with elements in 𝒵{\cal Z} by 𝒵{\cal Z}^{\mathbb{Z}_{-}}. The components of the sequences 𝐳¯𝒵\underline{\mathbf{z}}\in{\cal Z}^{\mathbb{Z}_{-}} are given by 𝐳id{\bf z}_{i}\in\mathbb{R}^{d}, ii\in\mathbb{Z}_{-}, that is, 𝐳¯=(𝐳i)i\underline{\mathbf{z}}=\left({\bf z}_{i}\right)_{i\in\mathbb{Z}_{-}}. Now, given ii\in\mathbb{Z}_{-}, τ\tau\in\mathbb{N}, and 𝐳¯𝒵\underline{\mathbf{z}}\in{\cal Z}^{\mathbb{Z}_{-}}, we denote by 𝐳i¯(,𝐳i1,𝐳i)𝒵\underline{\mathbf{z}_{i}}\coloneqq(\ldots,\mathbf{z}_{i-1},\mathbf{z}_{i})\in{\cal Z}^{\mathbb{Z}_{-}} and by 𝐳i¯τ(𝐳iτ+1,,𝐳i)𝒵τ\underline{\mathbf{z}_{i}}^{\tau}\coloneqq(\mathbf{z}_{i-\tau+1},\ldots,\mathbf{z}_{i})\in{\cal Z}^{\tau}.

II.2 Data generating processes

Throughout this paper, we are interested in the learning and forecasting of discrete-time, time-invariant, and causal dynamic processes that are generated by functionals of the form H:(d)𝒴H:\left(\mathbb{R}^{d}\right)^{\mathbb{Z}_{-}}\longrightarrow\mathcal{Y}, that is, given 𝐳¯(d)\underline{{\bf z}}\in({\mathbb{R}}^{d})^{\mathbb{Z}}, we have that 𝐲tH(𝐳t¯){\bf y}_{t}\in H(\underline{{\bf z}_{t}}), tt\in\mathbb{Z}. In practice, the inputs 𝐳¯(d)\underline{{\bf z}}\in({\mathbb{R}}^{d})^{\mathbb{Z}} can be just deterministic sequences or realizations of a stochastic process. A related data generating process that we shall be using are the causal chains with infinite memory (CCIM) [10, 11, 1]. These are infinite sequences (𝐲i)i(\mathbf{y}_{i})_{i\in\mathbb{Z}}, where 𝐲i𝒴\mathbf{y}_{i}\in\mathcal{Y} for all ii\in\mathbb{Z}, are such that 𝐲i=H(𝐲i1¯)\mathbf{y}_{i}=H(\underline{\mathbf{y}_{i-1}}) for all ii\in\mathbb{Z} and some functional H:𝒴𝒴H:{\cal Y}^{\mathbb{Z}_{-}}\rightarrow{\cal Y}. Takens’ Theorem [45] and its generalizations [27, 28, 18, 19] guarantee that the low-dimensional observations of dynamical systems follow, under certain conditions, a dynamical prescription of this type.

II.3 Kernel methods

Kernels and feature maps. A kernel on 𝒵\mathcal{Z} is a function K:𝒵×𝒵K:\mathcal{Z}\times\mathcal{Z}\longrightarrow\mathbb{R} that is symmetric and positive semi-definite. In this context, positive semi-definiteness means that for any αi,αj\alpha_{i},\alpha_{j}\in\mathbb{R}, 𝐳i,𝐳j𝒵\mathbf{z}_{i},\mathbf{z}_{j}\in\mathcal{Z}, i,j{1,,n}i,j\in\{1,\ldots,n\},

i=1nj=1nαiαjK(𝐳i,𝐳j)0.\sum_{i=1}^{n}\sum_{j=1}^{n}\alpha_{i}\alpha_{j}K(\mathbf{z}_{i},\mathbf{z}_{j})\geq 0.

Given 𝐳1,𝐳n𝒵\mathbf{z}_{1},\ldots\mathbf{z}_{n}\in\mathcal{Z}, define the Gram matrix or Gramian to be the matrix 𝖪[K(𝐳i,𝐳j)]i,j=1nn×n\mathsf{K}\coloneqq\left[K(\mathbf{z}_{i},\mathbf{z}_{j})\right]_{i,j=1}^{n}\in\mathbb{R}^{n\times n}. The kernel function being symmetric and positive semi-definite is equivalent to the Gramian being positive semi-definite in a matrix sense. The next example is a family of kernel functions of importance in our discussion.

Example 1 (Polynomial kernels).

Let 𝒵d\mathcal{Z}\subset\mathbb{R}^{d} for some dd\in\mathbb{N}. For any constant c>0c>0, a polynomial kernel of degree pp\in\mathbb{N} is the function Kpoly:𝒵×𝒵K^{\textup{poly}}:\mathcal{Z}\times\mathcal{Z}\longrightarrow\mathbb{R} given by

Kpoly(𝐳,𝐳)(𝐳𝐳+c)p,𝐳,𝐳𝒵.K^{\textup{poly}}({\bf z},{\bf z}^{\prime})\coloneqq(\mathbf{z}^{\top}\mathbf{z}^{\prime}+c)^{p},\ \mathbf{z},\mathbf{z}^{\prime}\in\mathcal{Z}.

Let \mathbb{H} be a Hilbert space of real-valued functions on 𝒵\mathcal{Z} endowed with pointwise sum and scalar multiplication. Denote the inner product on \mathbb{H} by ,\langle\cdot,\cdot\rangle_{\mathbb{H}}. We say that \mathbb{H} is a reproducing kernel Hilbert space (RKHS) associated to the kernel KK if the following two conditions hold. First, for all 𝐳𝒵\mathbf{z}\in\mathcal{Z}, we have that the functions K(,𝐳)K(\cdot,\mathbf{z})\in\mathbb{H}, and for all 𝐳𝒵\mathbf{z}\in\mathcal{Z} and all ff\in\mathbb{H}, the reproducing property, f(𝐳)=f,K(,𝐳)f(\mathbf{z})=\langle f,K(\cdot,\mathbf{z})\rangle_{\mathbb{H}}, z𝒵z\in{\cal Z}, is satisfied. The maps of the form K𝐳()K(,𝐳):𝒵K_{\mathbf{z}}(\cdot)\coloneqq K(\cdot,\mathbf{z}):\mathcal{Z}\longrightarrow\mathbb{R} are called kernel sections. Second, the Dirac functionals δ𝐳:\delta_{{\bf z}}:\mathbb{H}\rightarrow\mathbb{R} defined by δ𝐳(f):=f(𝐳)\delta_{{\bf z}}(f):=f({\bf z}) are continuous, for all 𝐳𝒵{\bf z}\in{\cal Z}.

In this setup one may define the map Φ:𝒵\Phi_{\mathbb{H}}:\mathcal{Z}\longrightarrow\mathbb{H} given by Φ(𝐳)K𝐳\Phi_{\mathbb{H}}(\mathbf{z})\coloneqq K_{\mathbf{z}}, 𝐳𝒵{\bf z}\in{\cal Z}. Then from the reproducing property of the RKHS, the kernel function can be written as the inner product of kernel sections. Indeed, given 𝐳,𝐳𝒵{\bf z},{\bf z}^{\prime}\in{\cal Z}. we can write

K(𝐳,𝐳)=K𝐳,K𝐳=Φ(𝐳),Φ(𝐳).K(\mathbf{z},\mathbf{z}^{\prime})=\left<K_{\mathbf{z}^{\prime}},K_{\mathbf{z}}\right>_{\mathbb{H}}=\left<\Phi_{\mathbb{H}}(\mathbf{z}^{\prime}),\Phi_{\mathbb{H}}(\mathbf{z})\right>_{\mathbb{H}}.

We call the map Φ\Phi_{\mathbb{H}} the canonical feature map and the RKHS \mathbb{H} is referred to as its canonical feature space.

By the Moore-Aronszajn Theorem [3, 7], any kernel has a unique RKHS, and this RKHS can be written as

span{K𝐳𝐳𝒵}¯,\mathbb{H}\coloneqq\overline{\operatorname*{\textup{span}}\{K_{\mathbf{z}}\mid\mathbf{z}\in\mathcal{Z}\}},\\ (1)

where the bar denotes the completion with respect to the metric induced by the inner product ,\langle\cdot,\cdot\rangle_{\mathbb{H}}. This inner product obviously satisfies that for f=i=1nαiK𝐳i,g=j=1mβjK𝐳jf=\sum^{n}_{i=1}\alpha_{i}K_{\mathbf{z}_{i}},g=\sum^{m}_{j=1}\beta_{j}K_{\mathbf{z}_{j}}\in\mathbb{H}, f,gi=1nj=1mαi,βjK(𝐳i,𝐳𝐣)\langle f,g\rangle_{\mathbb{H}}\coloneqq\sum_{i=1}^{n}\sum_{j=1}^{m}\alpha_{i},\beta_{j}K(\mathbf{z}_{i},\mathbf{z_{j}}). We use the symbol \|\cdot\|_{\mathbb{H}} for the norm induced by ,\langle\cdot,\cdot\rangle_{\mathbb{H}}.

The construction we just presented produces an RKHS and a feature map out of a kernel map. Conversely, given any Hilbert space (H,,H)(H,\left\langle\cdot,\cdot\right\rangle_{H}) and a map Φ:𝒵H\Phi:\mathcal{Z}\rightarrow H from the set 𝒵{\cal Z} into HH, we can construct a kernel map that has Φ\Phi as a feature map and HH as a feature space, that is, define

K(𝐳,𝐳):=Φ(𝐳),Φ(𝐳)H,𝐳,𝐳𝒵.K(\mathbf{z},\mathbf{z}^{\prime}):=\left<\Phi(\mathbf{z}^{\prime}),\Phi(\mathbf{z})\right>_{H},\quad{\bf z},{\bf z}^{\prime}\in\mathcal{Z}. (2)

Such a function KK is clearly symmetric and positive semi-definite; thus, it satisfies the conditions needed to be a kernel function. It can actually be proved (see [7, Theorem 4.16]) that a map K:𝒵×𝒵K:\mathcal{Z}\times\mathcal{Z}\longrightarrow\mathbb{R} is a kernel if and only if there exists a feature map that allows KK to be represented as in (2). Note that given a kernel function, the feature representation (2) is not unique in the sense that neither the feature space HH nor the feature map Φ:𝒵H\Phi:\mathcal{Z}\rightarrow H are unique. In particular, given KK, the canonical feature map and the RKHS are a choice of feature map and feature space for KK, respectively.

When a kernel function KK is defined as in (2) using a feature map Φ:𝒵H\Phi:\mathcal{Z}\rightarrow H, Theorem 4.21 in [7] proves that the corresponding RKHS \mathbb{H} is given by

={f:𝒵f(𝐳)=w,Φ(𝐳)H,wH,𝐳𝒵},\mathbb{H}=\{f:\mathcal{Z}\rightarrow\mathbb{R}\mid f(\mathbf{z})=\left<w,\Phi(\mathbf{z})\right>_{H},w\in H,\mathbf{z}\in\mathcal{Z}\}, (3)

equipped with the Hilbert norm

f:=inf{wHwH such that f()=w,Φ()H}.\left\|f\right\|_{\mathbb{H}}:=\inf\left\{\left\|w\right\|_{H}\mid w\in H\mbox{ such that }f(\cdot)=\left\langle w,\Phi(\cdot)\right\rangle_{H}\right\}. (4)

Using this fact, we prove the following lemma, which shows the equality of the RKHS spaces associated with kernels with feature maps related by a linear isomorphism.

Lemma 2.

Suppose K1K_{1} and K2K_{2} are kernels with feature spaces H1H_{1} and H2H_{2} and feature maps Φ1:𝒵H1\Phi_{1}:\mathcal{Z}\rightarrow H_{1} and Φ2:𝒵H2\Phi_{2}:\mathcal{Z}\rightarrow H_{2}, respectively. Denote the corresponding RKHSs by 1\mathbb{H}_{1} and 2\mathbb{H}_{2}. Suppose that there exists a bounded linear isomorphism L:H1H2L:H_{1}\rightarrow H_{2} such that Φ2=LΦ1\Phi_{2}=L\circ\Phi_{1}, then 1=2\mathbb{H}_{1}=\mathbb{H}_{2}.

Proof.

By the bounded inverse theorem, L1L^{-1}, which is linear, is bounded. By the Riesz representation theorem, the adjoints of LL and L1L^{-1}, denoted by LL^{*} and (L1)(L^{-1})^{*} are well defined. By (3), for f1f\in\mathbb{H}_{1}, there exists wH1w\in H_{1} such that f(𝐳)=w,Φ1(𝐳)H1f(\mathbf{z})=\left<w,\Phi_{1}(\mathbf{z})\right>_{H_{1}} for any 𝐳𝒵\mathbf{z}\in\mathcal{Z}. Then it is easy to see that for any 𝐳𝒵\mathbf{z}\in\mathcal{Z}, f(𝐳)=(L1)w,Φ2(𝐳)H2f(\mathbf{z})=\left<(L^{-1})^{*}w,\Phi_{2}(\mathbf{z})\right>_{H_{2}} so that 12\mathbb{H}_{1}\subset\mathbb{H}_{2}. The reverse inclusion is similar. ∎

Kernel universality. An important feature of some kernels is their universal approximating properties [38, 7, 8] which we now define. Suppose we have a continuous kernel map KK. For any compact subset 𝒦𝒵\mathcal{K}\subset\mathcal{Z}, define the space of kernel sections K(𝒦)K(\mathcal{K}) to be the subset of C0(𝒦)C^{0}(\mathcal{K}) given by

K(𝒦)span{K𝐳𝐳𝒦}¯.K(\mathcal{K})\coloneqq\overline{\operatorname*{\textup{span}}\{K_{\mathbf{z}}\mid\mathbf{z}\in\mathcal{K}\}}. (5)

This time around, the bar denotes the uniform closure. Note that the continuity of KK, the reproducing property, and the compactness of 𝒦{\mathcal{K}} imply that the uniform closure that defines K(𝒦)K\left({\mathcal{K}}\right) contains the completion of the vector space span{Kzz𝒵}{\rm span}\left\{K_{z}\mid z\in{\cal Z}\right\}. The continuous kernel KK is called kernel universal when for any compact subset 𝒦𝒵\mathcal{K}\subset\mathcal{Z},

K(𝒦)=C0(𝒦).K(\mathcal{K})=C^{0}(\mathcal{K}).

Many kernels used in practice are universal, e.g., the Gaussian kernel on Euclidean space.

Note that in the framework discussed in Section II.2, where the behavior of outputs is determined by causal functionals or CCIMs, whenever the corresponding functional is continuous and the inputs are defined on a compact input space, kernel universality implies that the elements of the corresponding RKHS can be used to uniformly approximate the data generating functional.

Kernel and nonlinear regressions. The universal approximation properties of RKHSs that we just described make them good candidates to serve as concept spaces in nonlinear regressions. More explicitly, suppose that 𝒵d{\cal Z}\subset{\mathbb{R}}^{d}, 𝒴\mathcal{Y}\subset\mathbb{R}, and that a function f:𝒵𝒴f:{\cal Z}\rightarrow{\cal Y} needs to be estimated using a finite sample of nn pairs of input and outputs {(𝐳i,yi:=f(𝐳i))}i=1n\{(\mathbf{z}_{i},y_{i}:=f(\mathbf{z}_{i}))\}_{i=1}^{n} where 𝐳i𝒵\mathbf{z}_{i}\in\mathcal{Z} and yi𝒴y_{i}\in\mathcal{Y}, i=1,,ni=1,\ldots,n. If the estimation is carried out using a empirical risk with squared loss

R^n(g):=1ni=1n(g(𝐳i)yi)2,\widehat{R}_{n}(g):=\frac{1}{n}\sum_{i=1}^{n}\left(g(\mathbf{z}_{i})-y_{i}\right)^{2},

and the hypothesis set is the RKHS \mathbb{H} corresponding to a kernel KK, the Representer Theorem states that the minimizer of the ridge regularized empirical risk on \mathbb{H}, that is,

f=argming{R^n(g)+λregg2},f^{\ast}=\operatorname*{\arg\min}_{g\in\mathbb{H}}\left\{\widehat{R}_{n}(g)+\lambda_{\textup{reg}}\|g\|_{\mathbb{H}}^{2}\right\}, (6)

for some λreg>0\lambda_{\textup{reg}}>0 lies on the span of the kernel sections generated by the sample. More explicitly, the solution of (6) has the form f()=i=1nαiK(,𝐳i)f^{*}(\cdot)=\sum_{i=1}^{n}\alpha_{i}^{\ast}K(\cdot,\mathbf{z}_{i}), where the vector 𝜶=(α1,,αn)n\bm{\alpha}^{\ast}=(\alpha_{1}^{\ast},\ldots,\alpha_{n}^{\ast})^{\top}\in\mathbb{R}^{n} can be determined by solving the regularized linear (Gramian) regression problem associated to the Gram matrix 𝖪\mathsf{K} of the sample, that is,

min𝜶n{1n𝖪𝜶𝖸n22+λreg𝜶𝖪𝜶},\min_{\bm{\alpha}\in\mathbb{R}^{n}}\left\{\frac{1}{n}\|\mathsf{K}\bm{\alpha}-\mathsf{Y}_{n}\|_{2}^{2}+\lambda_{{\textup{reg}}}\bm{\alpha}^{\top}\mathsf{K}\bm{\alpha}\right\}, (7)

where 𝖸n:=(y1,,yn)\mathsf{Y}_{n}:=(y_{1},\ldots,y_{n})^{\top} and 2\|\cdot\|_{2} denotes the Euclidean norm. The optimization problem (7) admits the closed form solution

𝜶=(𝖪𝖪+λreg𝖪)1𝖪𝖸n.\bm{\alpha}^{*}=\left(\mathsf{K}^{\top}\mathsf{K}+\lambda_{\textup{reg}}\mathsf{K}\right)^{-1}\mathsf{K}\mathsf{Y}_{n}. (8)

The optimization problem formulated in (6) is usually referred to as a kernel ridge regression. When the kernel KK that is used to implement it has a feature map Φ:𝒵(H,,H)\Phi:{\cal Z}\rightarrow\left(H,\left\langle\cdot,\cdot\right\rangle_{H}\right) associated, the kernel ridge regression can be reformulated as a standard linear regression in which the covariates are the components of the feature map. More explicitly, due to (3) and (4), the following equality holds true:

ming{R^n(g)+λregg2}=minwH{R^n(w,Φ()H)+λregwH2}.\min_{g\in\mathbb{H}}\left\{\widehat{R}_{n}(g)+\lambda_{\textup{reg}}\|g\|_{\mathbb{H}}^{2}\right\}\\ =\min_{w\in{H}}\left\{\widehat{R}_{n}(\left\langle w,\Phi(\cdot)\right\rangle_{H})+\lambda_{\textup{reg}}\|w\|_{H}^{2}\right\}. (9)

Then if we set

w=argminwH{R^n(w,Φ()H)+λregwH2},w^{\ast}=\operatorname*{\arg\min}_{w\in{H}}\left\{\widehat{R}_{n}(\left\langle w,\Phi(\cdot)\right\rangle_{H})+\lambda_{\textup{reg}}\|w\|_{H}^{2}\right\},

we can conclude that

f()=i=1nαiK(,𝐳i)=w,Φ(),f^{*}(\cdot)=\sum_{i=1}^{n}\alpha_{i}^{\ast}K(\cdot,\mathbf{z}_{i})=\left\langle w^{\ast},\Phi(\cdot)\right\rangle, (10)

which proves our claim in relation to interpreting the kernel ridge regression as a standard linear regression with covariates given by the feature map. In particular, for new inputs 𝐳𝒵\mathbf{z}\in\mathcal{Z}, the estimator ff^{*} generates the out-of-sample outputs

y^=f(𝐳)=i=1nαiK(𝐳,𝐳i)=w,Φ(𝐳).\widehat{y}=f^{*}(\mathbf{z})=\sum_{i=1}^{n}\alpha^{*}_{i}K(\mathbf{z},\mathbf{z}_{i})=\left\langle w^{\ast},\Phi({\bf z})\right\rangle. (11)

II.4 The NG-RC methodology

Next-generation reservoir computing, as introduced in [13], proposes to estimate a causal polynomial functional link between an explained variable yty_{t} at time tt\in\mathbb{Z} and the values of certain explanatory variables encoded in the components of 𝐳t¯τ\underline{\mathbf{z}_{t}}^{\tau} at time tt and in all the τ\tau-instants preceding it. Mathematically speaking, its estimation amounts to solving a linear regression problem that has as covariates polynomial functions of the explanatory variables in the past.

More explicitly, assume that we have a collection of nn inputs and outputs {(𝐳1,y1),,(𝐳n,yn)𝐳id,yi}\{(\mathbf{z}_{1},y_{1}),\ldots,(\mathbf{z}_{n},y_{n})\mid\mathbf{z}_{i}\in\mathbb{R}^{d},y_{i}\in\mathbb{R}\}. For each t{τ,,n}t\in\{\tau,\ldots,n\}, construct the tt-th τ\tau-delay vector

𝐳t¯τ((ztτ+1,1ztτ+1,d),,(zt,1zt,d))(d)τ.\underline{\mathbf{z}_{t}}^{\tau}\coloneqq\left(\begin{pmatrix}z_{t-\tau+1,1}\\ \vdots\\ z_{t-\tau+1,d}\end{pmatrix},\ldots,\begin{pmatrix}z_{t,1}\\ \vdots\\ z_{t,d}\end{pmatrix}\right)\in\left(\mathbb{R}^{d}\right)^{\tau}.

Re-indexing 𝐳t¯τ\underline{\mathbf{z}_{t}}^{\tau} so that

𝐳t¯τ=((zt,1zt,d),,(zt,τdd+1zt,τd)),\underline{\mathbf{z}_{t}}^{\tau}=\left(\begin{pmatrix}z_{t,1}\\ \vdots\\ z_{t,d}\end{pmatrix},\ldots,\begin{pmatrix}z_{t,\tau d-d+1}\\ \vdots\\ z_{t,\tau d}\end{pmatrix}\right), (12)

write the vector containing all kk-degree monomials of elements of 𝐳t¯τ\underline{\mathbf{z}_{t}}^{\tau} as

[𝐳t¯τ]k\displaystyle\left[\underline{\mathbf{z}_{t}}^{\tau}\right]^{k} (s=ttτ+1u=1dzs,uks,u)s,uks,u=k,ks,u0\displaystyle\coloneqq\left(\prod_{s=t}^{t-\tau+1}\prod_{u=1}^{d}z_{s,u}^{k_{s,u}}\right)_{\sum_{s,u}k_{s,u}=k,\hskip 0.70004ptk_{s,u}\geq 0}
=(s=1τdzt,skt,s)s=1τdkt,s=k,kt,1,kt,τd0.\displaystyle=\left(\prod_{s=1}^{\tau d}z_{t,s}^{k_{t,s}}\right)_{\begin{subarray}{c}\sum_{s=1}^{\tau d}k_{t,s}=k,\\ k_{t,1},\ldots k_{t,\tau d}\geq 0\end{subarray}}.

For a chosen maximum monomial order pp\in\mathbb{N}, define the feature vector Φ:(d)τN\Phi:\left(\mathbb{R}^{d}\right)^{\tau}\longrightarrow\mathbb{R}^{N} as

Φ(𝐳i¯τ)=(1,[𝐳i¯τ],[𝐳i¯τ]2,,[𝐳i¯τ]p),\Phi(\underline{\mathbf{z}_{i}}^{\tau})=(1,\left[\underline{\mathbf{z}_{i}}^{\tau}\right],\left[\underline{\mathbf{z}_{i}}^{\tau}\right]^{2},\ldots,\left[\underline{\mathbf{z}_{i}}^{\tau}\right]^{p})^{\top}, (13)

where NN denotes the dimension of the feature space, namely,

N=1+k=1p(τd+k1k)=(τd+pp).N=1+\sum_{k=1}^{p}\binom{\tau d+k-1}{k}=\binom{\tau d+p}{p}. (14)

NG-RC proposes as link between inputs and outputs the solution of the nonlinear regression that uses the components of Φ\Phi as covariates, that is, it requires solving the optimization problem:

argmin𝐰N{1nτ+1i=τn(𝐰Φ(𝐳i¯τ)yi)2+λ𝐰22}.\operatorname*{\arg\min}_{\mathbf{w}\in\mathbb{R}^{N}}\left\{\frac{1}{n-\tau+1}\sum_{i=\tau}^{n}{\left(\mathbf{w}^{\top}\Phi(\underline{\mathbf{z}_{i}}^{\tau})-y_{i}\right)}^{2}+\lambda\|\mathbf{w}\|^{2}_{2}\right\}. (15)

This ridge-regularized problem obviously admits a unique solution that we now explicitly write. We first collect the feature vectors and outputs into a design matrix 𝖷\mathsf{X} and an output vector 𝖸\mathsf{Y}, respectively:

𝖷[Φ(𝐳τ¯τ)Φ(𝐳n¯τ)] and 𝖸[yτyn].\mathsf{X}\coloneqq\begin{bmatrix}{\Phi(\underline{\mathbf{z}_{\tau}}^{\tau})}^{\top}\\ \vdots\\ {\Phi(\underline{\mathbf{z}_{n}}^{\tau})}^{\top}\end{bmatrix}\quad\text{ and }\quad\mathsf{Y}\coloneqq\begin{bmatrix}y_{\tau}\\ \vdots\\ y_{n}\end{bmatrix}.

The closed-form solution for the optimization problem (15) is

𝐰=(𝖷𝖷+λreg𝕀)1𝖷𝖸,\mathbf{w}^{*}={(\mathsf{X}^{\top}\mathsf{X}+\lambda_{\textup{reg}}\mathbb{I})}^{-1}\mathsf{X}^{\top}\mathsf{Y}, (16)

where 𝕀\mathbb{I} denotes the identity matrix. Consequently, the output yty_{t}\in\mathbb{R} of an NG-RC system at each time step tt\in\mathbb{Z} corresponding to an input 𝐳¯(d)\underline{\mathbf{z}}\in\left(\mathbb{R}^{d}\right)^{\mathbb{Z}} is given by

y^t=(𝐰)Φ(𝐳t¯τ).\widehat{y}_{t}={(\mathbf{w}^{*})}^{\top}\Phi(\underline{\mathbf{z}_{t}}^{\tau}). (17)

III Kernelization of NG-RC

We now show that NG-RC can be kernelized using the polynomial kernel function introduced in Example 1. The term kernelization in this context means that, along the lines of what we saw in (10), the solution (17) of the NG-RC nonlinear regression problem can be written as the solution of the kernel ridge regression problem (6) associated to the polynomial kernel. More explicitly, we shall see that the solution function f(𝐳¯τ)=(𝐰)Φ(𝐳¯τ)f^{\ast}(\underline{\mathbf{z}}^{\tau})={(\mathbf{w}^{*})}^{\top}\Phi(\underline{\mathbf{z}}^{\tau}) in (17) can be written as a linear combination of the kernel sections of KpolyK^{\textup{poly}} generated by the data, with the coefficients obtained in (7) coming from the Representer Theorem. A similar result can be obtained for kernels based on Taylor polynomials as in [7].

Proposition 3.

Consider a sample of nn input/output observations {(𝐳t,yt)}t{1,,n}\{(\mathbf{z}_{t},y_{t})\}_{t\in\{1,\ldots,n\}} where 𝐳td\mathbf{z}_{t}\in\mathbb{R}^{d} and yty_{t}\in\mathbb{R}. Let τ\tau\in\mathbb{N} be a chosen delay and let pp\in\mathbb{N} be a maximum polynomial order, let Kpoly:(d)τ×(d)τK^{\textup{poly}}:\left(\mathbb{R}^{d}\right)^{\tau}\times\left(\mathbb{R}^{d}\right)^{\tau}\longrightarrow\mathbb{R}

Kpoly(𝐳¯τ,𝐳¯τ)=(1+(𝐳¯τ)𝐳¯τ)pK^{\textup{poly}}(\underline{\mathbf{z}}^{\tau},\underline{\mathbf{z}^{\prime}}^{\tau})=(1+(\underline{\mathbf{z}}^{\tau})^{\top}\underline{\mathbf{z}^{\prime}}^{\tau})^{p} (18)

be the τ\tau-lagged polynomial kernel on τd\mathbb{R}^{\tau d}. Then, the kernel regression problem on the left-hand side of (9), corresponding to the input/output set {((𝐳τ¯τ),yτ),,((𝐳n¯τ),yn)}\{\left(\left(\underline{\mathbf{z}_{\tau}}^{\tau}\right),y_{\tau}\right),\ldots,\left(\left(\underline{\mathbf{z}_{n}}^{\tau}\right),y_{n}\right)\} and the kernel KpolyK^{\textup{poly}} has the same solution as the NG-RC optimization problem given in (15), corresponding to {(𝐳1,y1),,(𝐳n,yn)}\{(\mathbf{z}_{1},y_{1}),\ldots,(\mathbf{z}_{n},y_{n})\}. In particular, the corresponding solution functions coincide, that is,

f(𝐳¯τ)=(𝐰)Φ(𝐳¯τ)=i=0nταiK𝐳¯τ+iτpoly(𝐳¯τ),f^{\ast}(\underline{\mathbf{z}}^{\tau})={(\mathbf{w}^{*})}^{\top}\Phi(\underline{\mathbf{z}}^{\tau})=\sum_{i=0}^{n-\tau}\alpha^{\ast}_{i}K^{\textup{poly}}_{\underline{\mathbf{z}}^{\tau}_{\tau+i}}(\underline{\mathbf{z}}^{\tau}), (19)

for any 𝐳¯τ(d)τ\underline{\mathbf{z}}^{\tau}\in\left(\mathbb{R}^{d}\right)^{\tau} and where 𝐰N\mathbf{w}^{*}\in\mathbb{R}^{N} is the NG-RC solution (16), NN as in (14), and 𝛂:=(α1,,αnτ+1)\bm{\alpha}^{\ast}:=(\alpha_{1}^{\ast},\ldots,\alpha^{\ast}_{n-\tau+1}) is the solution of the Gramian regression in (8) for KpolyK^{\textup{poly}}.

Proof.

For i{τ,,n}i\in\{\tau,\ldots,n\}, recall the reindexing of the τ\tau-delay vector 𝐳i¯τ\underline{\mathbf{z}_{i}}^{\tau} in (12) and additionally let zi,0=1z_{i,0}=1. By the multinomial theorem, for any i,j{τ,,n}i,j\in\{\tau,\ldots,n\},

Kpoly(𝐳i¯τ,𝐳j¯τ)=(1+(𝐳i¯τ)𝐳j¯τ)p=(k=0τdzi,kzj,k)p\displaystyle K^{\textup{poly}}(\underline{\mathbf{z}_{i}}^{\tau},\underline{\mathbf{z}_{j}}^{\tau})=\left(1+(\underline{\mathbf{z}_{i}}^{\tau})^{\top}\underline{\mathbf{z}_{j}}^{\tau}\right)^{p}=\left(\sum_{k=0}^{\tau d}z_{i,k}z_{j,k}\right)^{p}
=k0+k1,+kτd=pk0,k1,,kτd0(pk0,k1,,kτd)t=0τdzi,tktzj,tkt\displaystyle=\sum_{\begin{subarray}{c}k_{0}+k_{1},\ldots+k_{\tau d}=p\\ k_{0},k_{1},\ldots,k_{\tau d}\geq 0\end{subarray}}\binom{p}{k_{0},k_{1},\ldots,k_{\tau d}}\prod_{t=0}^{\tau d}z_{i,t}^{k_{t}}z_{j,t}^{k_{t}}
=1+k0=0p1k1++kτd=pk0k1,,kτd0(pk0,k1,,kτd)t=1τdzi,tktt=1τdzj,tkt\displaystyle=1+\sum_{k_{0}=0}^{p-1}\sum_{\begin{subarray}{c}k_{1}+\ldots+k_{\tau d}=p-k_{0}\\ k_{1},\ldots,k_{\tau d}\geq 0\end{subarray}}\binom{p}{k_{0},k_{1},\ldots,k_{\tau d}}\prod_{t=1}^{\tau d}z_{i,t}^{k_{t}}\prod_{t=1}^{\tau d}z_{j,t}^{k_{t}}
=(cΦ(𝐳i¯τ))(cΦ(𝐳j¯τ)),\displaystyle=\left(c\odot\Phi(\underline{\mathbf{z}_{i}}^{\tau})\right)^{\top}\left(c\odot\Phi(\underline{\mathbf{z}_{j}}^{\tau})\right), (20)

where Φ:(d)τN\Phi:\left(\mathbb{R}^{d}\right)^{\tau}\longrightarrow\mathbb{R}^{N} is the NG-RC feature map introduced in (13), \odot denotes component-wise (Hadamard) multiplication, and the constant vector cc is given by

c=((pko,k1,,kτd))k0+k1++kτd=pk0,k1,,kτd0N.c=\left(\sqrt{\binom{p}{k_{o},k_{1},\ldots,k_{\tau d}}}\right)^{\top}_{\begin{subarray}{c}k_{0}+k_{1}+\ldots+k_{\tau d}=p\\ k_{0},k_{1},\ldots,k_{\tau d}\geq 0\end{subarray}}\in\mathbb{R}^{N}. (21)

Notice that the relation (20) implies that the map cΦ:(d)τNc\odot\Phi:\left(\mathbb{R}^{d}\right)^{\tau}\longrightarrow\mathbb{R}^{N} is a feature map for the kernel KpolyK^{\textup{poly}}. Additionally, whenever N<N<\infty, the component-wise product with the vector cNc\in\mathbb{R}^{N} can be written as a bounded linear isomorphism which can be represented by the diagonal NN by NN matrix with the elements of cc on the diagonal.

Define the kernel function KNG-RC:(d)τ×(d)τK^{\textup{NG-RC}}:\left(\mathbb{R}^{d}\right)^{\tau}\times\left(\mathbb{R}^{d}\right)^{\tau}\longrightarrow\mathbb{R} obtained out of the dot product of the NG-RC feature vector (13), that is,

KNG-RC(𝐳¯τ,𝐳¯τ)Φ(𝐳¯τ)Φ(𝐳¯τ),K^{\textup{NG-RC}}(\underline{\mathbf{z}}^{\tau},\underline{\mathbf{z}^{\prime}}^{\tau})\coloneqq\Phi(\underline{\mathbf{z}}^{\tau})^{\top}\Phi(\underline{\mathbf{z}^{\prime}}^{\tau}),

which is obviously a kernel function because the dot product is symmetric and positive semi-definite.

By the Moore-Aronszajn Theorem, KpolyK^{\textup{poly}} and KNG-RCK^{\textup{NG-RC}} each have unique RKHSs associated poly\mathbb{H}_{\textup{poly}} and NGRC\mathbb{H}_{\textup{NGRC}}, respectively. Since each of their feature maps cΦc\odot\Phi and Φ\Phi, respectively, are related by a bounded linear isomorphism, by Lemma 2 we have that,

poly=NG-RC.\mathbb{H}_{\textup{poly}}=\mathbb{H}_{\textup{NG-RC}}. (22)

This implies that the kernel regression problems (6) associated with KpolyK^{\textup{poly}} and KNG-RCK^{\textup{NG-RC}} are identical and hence have the same solution. This observation, combined with the identity (9), proves the statement. ∎

Example 4.

In the setup of the previous proposition, consider the case d=1d=1, τ=2\tau=2, and p=2p=2. In that situation, the two kernels in the previous discussion are given by:

Kpoly(𝐳i¯τ,𝐳j¯τ)=(1+(𝐳i¯τ)𝐳j¯τ)2=(1+(zi,zi1)(zj,zj1))2=1+2zizj+2zi1zj1+zi2zj2+zi12zj12+2zizjzi1zj1K^{\textup{poly}}(\underline{\mathbf{z}_{i}}^{\tau},\underline{\mathbf{z}_{j}}^{\tau})\\ =(1+(\underline{\mathbf{z}_{i}}^{\tau})^{\top}\underline{\mathbf{z}_{j}}^{\tau})^{2}=\left(1+(z_{i},z_{i-1})(z_{j},z_{j-1})^{\top}\right)^{2}\\ =1+2z_{i}z_{j}+2z_{i-1}z_{j-1}+z_{i}^{2}z_{j}^{2}+z_{i-1}^{2}z_{j-1}^{2}+2z_{i}z_{j}z_{i-1}z_{j-1}

and given that the NG-RC map is given by:

Φ(𝐳i¯τ)=(1,zj,zj1,zj2,zjzj1,zj12),\Phi(\underline{\mathbf{z}_{i}}^{\tau})=(1,z_{j},z_{j-1},z_{j}^{2},z_{j}z_{j-1},z_{j-1}^{2})^{\top},

we have that

KNG-RC(𝐳i¯τ,𝐳j¯τ)=Φ(𝐳i¯τ)Φ(𝐳j¯τ)=1+zizj,zi1zj1+zi2zj2+zi1zj1zizj+zi12zj12.K^{\textup{NG-RC}}(\underline{\mathbf{z}_{i}}^{\tau},\underline{\mathbf{z}_{j}}^{\tau})=\Phi(\underline{\mathbf{z}_{i}}^{\tau})^{\top}\Phi(\underline{\mathbf{z}_{j}}^{\tau})\\ =1+z_{i}z_{j},z_{i-1}z_{j-1}+z_{i}^{2}z_{j}^{2}+z_{i-1}z_{j-1}z_{i}z_{j}+z_{i-1}^{2}z_{j-1}^{2}.

As we already pointed out in the proof of Proposition 3, the same monomials appear in KpolyK^{\textup{poly}} and KNG-RCK^{\textup{NG-RC}}, which only differ by constants.

Another important observation that is visible in these expressions is the difference in computation complexity between NG-RC and its kernelized version introduced in Proposition 3. Recall that NG-RC produces the vector 𝐰N\mathbf{w}^{\ast}\in\mathbb{R}^{N} in (19) while the polynomial kernel regression yields 𝜶nτ+1\bm{\alpha}^{\ast}\in\mathbb{R}^{n-\tau+1}. NG-RC is a nonlinear regression on the components of the (six-dimensional in this case) feature map Φ\Phi, and to carry it out, all those components have to be evaluated at all the data points; on the contrary, the kernelized version only requires the evaluation of KpolyK^{\textup{poly}} at the data points, which is computationally simpler. The kernelized version of NG-RC is, hence, computationally more efficient. This difference is even more visible as τ\tau, dd, and pp increase since the dependence (14) of the number of covariates in the NG-RC regression on those parameters makes them grow rapidly, while the behavior of the computational complexity of the polynomial kernel KpolyK^{\textup{poly}} is much more favorable. This difference in computational performance between the two approaches will be more rigorously analyzed later in Section V.2.

IV Infinite-dimensional NG-RC and Volterra kernels

Apart from the computational efficiency associated with the kernelized version of NG-RC, this approach allows for an extension of this methodology that would be impossible in its original feature map-based version. More explicitly, in this section, we will see that by pursuing the kernel approach, NG-RC can be extended to the limiting cases τ\tau\rightarrow\infty, pp\rightarrow\infty, hence taking into account infinite lags into the past and infinite polynomial degrees in relation with the input series. This is a valuable feature in modeling situations in which one is obliged to remain agnostic with respect to τ\tau and pp. The natural tool to carry this out is the Volterra kernel introduced in [14], which is, roughly speaking, an infinite lag and infinite monomial degree counterpart of the polynomial kernel and that we recall in the following paragraphs.

Let πt:(d)(d)\pi_{t}:\left(\mathbb{R}^{d}\right)^{\mathbb{Z}_{-}}\longrightarrow\left(\mathbb{R}^{d}\right) such that πt(𝐳i¯)=𝐳it\pi_{t}(\underline{\mathbf{z}_{i}})=\mathbf{z}_{i-t} be the projection operator onto the (t)(-t)-th term of a semi-infinite sequence. Given τ\tau\in\mathbb{N}, define the τ\tau-time-delay operator Tτ:(d)(d)T_{\tau}:\left(\mathbb{R}^{d}\right)^{\mathbb{Z}_{-}}\longrightarrow\left(\mathbb{R}^{d}\right)^{\mathbb{Z}_{-}} by πt(Tτ(𝐳¯))πtτ(𝐳¯)\pi_{t}\left(T_{\tau}(\underline{\mathbf{z}})\right)\coloneqq\pi_{t-\tau}\left(\underline{\mathbf{z}}\right) for all tt\in{\mathbb{Z}_{-}}. Given M>0M>0, define the space of MM-bounded inputs by KM{𝐳¯(d)πt(𝐳¯)22M, for all t}K_{M}\coloneqq\{\underline{\mathbf{z}}\in(\mathbb{R}^{d})^{\mathbb{Z}_{-}}\mid\|\pi_{t}\left(\underline{\mathbf{z}}\right)\|_{2}^{2}\leq M,\text{ for all }t\in{\mathbb{Z}_{-}}\}. Choose θ>0\theta>0 such that θ2M2<1\theta^{2}M^{2}<1 and choose some λ\lambda such that 0<λ<1θ2M20<\lambda<\sqrt{1-\theta^{2}M^{2}}. Define the Volterra kernel KVolt:KM×KMK^{\textup{Volt}}:K_{M}\times K_{M}\longrightarrow\mathbb{R} by the recursion

KVolt(𝐳¯,𝐳¯)=1+λ2KVolt(T1(𝐳¯),T1(𝐳¯))1θ2π0(𝐳¯),π0(𝐳¯).K^{\textup{Volt}}(\underline{\mathbf{z}},\underline{\mathbf{z}^{\prime}})=1+\lambda^{2}\frac{K^{\textup{Volt}}(T_{1}(\underline{\mathbf{z}}),T_{1}(\underline{\mathbf{z}^{\prime}}))}{1-\theta^{2}\left<\pi_{0}(\underline{\mathbf{z}}),\pi_{0}(\underline{\mathbf{z}^{\prime}})\right>}. (23)

The rationale behind this recursion is the definition of the Volterra kernel proposed in [14] as the kernel associated with a feature map obtained as the unique solution of a certain state space equation in an infinite-dimensional tensor space. The recursion in that state space equation implies the defining recursion in (23). Alternatively, the Volterra kernel can be introduced by writing the unique solution of (23), namely:

KVolt(𝐳¯,𝐳¯)=1+τ=1λ2τ11θ2π0(𝐳¯),π0(𝐳¯)11θ2π1(𝐳¯),π1(𝐳¯)11θ2πτ1(𝐳¯),πτ1(𝐳¯.K^{\textup{Volt}}(\underline{\mathbf{z}},\underline{\mathbf{z}^{\prime}})=1+\sum_{\tau=1}^{\infty}\lambda^{2\tau}\frac{1}{1-\theta^{2}\langle\pi_{0}(\underline{\mathbf{z}}),\pi_{0}(\underline{\mathbf{z}^{\prime}})\rangle}\frac{1}{1-\theta^{2}\langle\pi_{1}(\underline{\mathbf{z}}),\pi_{1}(\underline{\mathbf{z}^{\prime}})\rangle}\cdots\frac{1}{1-\theta^{2}\langle\pi_{\tau-1}(\underline{\mathbf{z}}),\pi_{\tau-1}(\underline{\mathbf{z}^{\prime}}\rangle}. (24)

It can be verified that the Volterra kernel is a kernel map on the space of semi-infinite sequences with real entries in the sense discussed in Section II.3.

The Volterra kernel as an infinite order and lag polynomial kernel. Observe that the τ\tau-lagged polynomial kernel map (18) can be rewritten as

Kpoly(𝐳i¯τ,𝐳j¯τ)=(1+(𝐳i¯τ)𝐳j¯τ)p\displaystyle K^{\textup{poly}}(\underline{\mathbf{z}_{i}}^{\tau},\underline{\mathbf{z}_{j}}^{\tau})=(1+(\underline{\mathbf{z}_{i}}^{\tau})^{\top}\underline{\mathbf{z}_{j}}^{\tau})^{p}
=(1+𝐳i𝐳j++𝐳iτ+1𝐳jτ+1)p=\displaystyle=(1+\mathbf{z}_{i}^{\top}\mathbf{z}_{j}+\ldots+\mathbf{z}_{i-\tau+1}^{\top}\mathbf{z}_{j-\tau+1})^{p}=
k=0pk0++kτ1=kk0,,kτ10(ppk,k0,,kτ1)t=0τ1(𝐳it𝐳jt)kt(),\displaystyle\sum_{k=0}^{p}\underbrace{\sum_{\begin{subarray}{c}k_{0}+\ldots+k_{\tau-1}=k\\ k_{0},\ldots,k_{\tau-1}\geq 0\end{subarray}}\binom{p}{p-k,k_{0},\ldots,k_{\tau-1}}\prod_{t=0}^{\tau-1}(\mathbf{z}_{i-t}^{\top}\mathbf{z}_{j-t})^{k_{t}}}_{(*)},

where we notice that the term marked with ()(*) is the sum of all monomials of up to order kk on the variables that appear in the inner products 𝐳i𝐳j,,𝐳iτ+1𝐳jτ+1\mathbf{z}_{i}^{\top}\mathbf{z}_{j},\ldots,\mathbf{z}_{i-\tau+1}^{\top}\mathbf{z}_{j-\tau+1}. This expression yields the polynomial kernel as a polynomial of some finite degree pp on the components of the input terms up to some finite lag τ\tau.

Rewriting (24) using the geometric series, we have for 𝐳i¯,𝐳j¯KM(d)\underline{\mathbf{z}_{i}},\underline{\mathbf{z}_{j}}\in K_{M}\subset(\mathbb{R}^{d})^{\mathbb{Z}_{-}},

KVolt(𝐳i¯,𝐳j¯)=1+τ=1λ2τt=0τ1k=0(θ2𝐳it𝐳jt)k=1+τ=1k=0λ2τθ2kk0++kτ1=kk0,,kτ10t=0τ1(𝐳it𝐳jt)kt(),K^{\textup{Volt}}(\underline{\mathbf{z}_{i}},\underline{\mathbf{z}_{j}})=1+\sum_{\tau=1}^{\infty}\lambda^{2\tau}\prod_{t=0}^{\tau-1}\sum_{k=0}^{\infty}\left(\theta^{2}\mathbf{z}_{i-t}^{\top}\mathbf{z}_{j-t}\right)^{k}\\ =1+\sum_{\tau=1}^{\infty}\sum_{k=0}^{\infty}\lambda^{2\tau}\theta^{2k}\underbrace{\sum_{\begin{subarray}{c}k_{0}+\ldots+k_{\tau-1}=k\\ k_{0},\ldots,k_{\tau-1}\geq 0\end{subarray}}\prod_{t=0}^{\tau-1}(\mathbf{z}_{i-t}^{\top}\mathbf{z}_{j-t})^{k_{t}}}_{(*)},

where ()(*) is again a sum of all monomials of order kk on variables similar to the expression for KpolyK^{\textup{poly}}. However, in contrast to the polynomial kernel, note that in this case, we are taking monomial combinations of arbitrarily high degree and lags with respect to 𝐳i¯\underline{\mathbf{z}_{i}} and 𝐳j¯\underline{\mathbf{z}_{j}}. This implies that the Volterra kernel considers additional functional and temporal information about the input, which allows us to use it in situations where we have to remain agnostic about the number of lags and the degree of monomials that need to be used.

Infinite-dimensionality and universality. The discussion above hints that the Volterra kernel can be understood as the kernel induced by the feature map (20) associated with the polynomial kernel, but extended to an infinite-dimensional codomain capable of accommodating all powers and lags of the input variables. This statement has been made rigorous in [14], where the Volterra kernel was constructed out of an infinite-dimensional tensor feature space, which, in particular, makes it universal in the space of continuous functions defined on uniformly bounded semi-infinite sequences. This implies that any continuous data-generating functional with uniformly bounded inputs can be uniformly approximated by elements in the RKHS generated by the Volterra kernel. This is detailed in the following theorem proved in  [14]. The statement uses the notation introduced in (5).

Theorem 5.

Let KVolt:KM×KMK^{\textup{Volt}}:K_{M}\times K_{M}\longrightarrow\mathbb{R} be the Volterra kernel given by (24) and let KVolt(KM)K^{\textup{Volt}}(K_{M}) be the associated space of kernel sections. Then

KVolt(KM)=C0(KM).K^{\textup{Volt}}(K_{M})=C^{0}(K_{M}).

In contrast, the polynomial kernel (equivalently NG-RC) is not universal (see [46]). These arguments suggest that the Volterra kernel should outperform polynomial kernel regressions and the NG-RC in its ability to, for example, learn complex systems. This will indeed be illustrated in numerical simulations in Section V.

Computation of Volterra Gramians. Even though the Volterra kernel is defined in the space of semi-infinite sequences, in applications, only finite samples of size nn of the form {(𝐳i,yi)}i{1,,n}\{(\mathbf{z}_{i},y_{i})\}_{i\in\{1,\ldots,n\}} are available. In that situation, it is customary to construct semi-infinite inputs of the form 𝐳i¯=(,0,0,𝐳1,,𝐳i1,𝐳i)(d)\underline{\mathbf{z}_{i}}=(\ldots,0,0,\mathbf{z}_{1},\ldots,\mathbf{z}_{i-1},\mathbf{z}_{i})\in(\mathbb{R}^{d})^{\mathbb{Z}_{-}}, for each i{1,,n}i\in\{1,...,n\}, and we then define for that sample the Volterra Gram matrix 𝖪Voltn×n\mathsf{K}^{\textup{Volt}}\in\mathbb{R}^{n\times n} as

𝖪i,jVolt=KVolt(𝐳i¯,𝐳j¯),i,j{1,,n}.\mathsf{K}^{\textup{Volt}}_{i,j}=K^{\textup{Volt}}(\underline{\mathbf{z}_{i}},\underline{\mathbf{z}_{j}}),\quad i,j\in\{1,\ldots,n\}.

Due to the recursive nature of the kernel map introduced in (23), the entries of the Gram matrix can be computed also recursively by

𝖪i,j=1+λ2𝖪i1,j1Volt1θ2𝐳i,𝐳j,\mathsf{K}_{i,j}=1+\frac{\lambda^{2}\mathsf{K}^{\textup{Volt}}_{i-1,j-1}}{1-\theta^{2}\left<\mathbf{z}_{i},\mathbf{z}_{j}\right>}, (25)

where 𝖪0,0Volt=𝖪i,0Volt=1/(1θ2)\mathsf{K}^{\textup{Volt}}_{0,0}=\mathsf{K}^{\textup{Volt}}_{i,0}=1/(1-\theta^{2}) for all i{1,,n}i\in\{1,\ldots,n\}.

We recall now that, due to the Representer Theorem, the learning problem (6) associated with the squared loss can be solved using the Gramian that we just constructed by computing (7). Moreover, the solution ff^{\ast} has the form f()=i=1nαiK𝐳i¯Volt()f^{\ast}(\cdot)=\sum_{i=1}^{n}\alpha_{i}^{\ast}K^{\textup{Volt}}_{\underline{\mathbf{z}_{i}}}(\cdot), with α1,,αn\alpha_{1}^{\ast},\ldots,\alpha_{n}^{\ast} given by (8).

For a newly available set of inputs 𝐳n+1,,𝐳n+h\mathbf{z}_{n+1},...,\mathbf{z}_{n+h}, the estimator can be used to forecast outputs y^n+1,,y^n+h\widehat{y}_{n+1},\ldots,\widehat{y}_{n+h}. Extend the Gram matrix to a rectangular matrix 𝖪Voltn,×n+h\mathsf{K}^{\textup{Volt}}\in\mathbb{R}^{n,\times n+h} by using the recursion

𝖪i,n+j=1+λ2𝖪i1,n+j1Volt1θ2𝐳i,𝐳n+j\mathsf{K}_{i,n+j}=1+\frac{\lambda^{2}\mathsf{K}^{\textup{Volt}}_{i-1,n+j-1}}{1-\theta^{2}\left<\mathbf{z}_{i},\mathbf{z}_{n+j}\right>}

that can be initialized by 𝖪0,n+jVolt=1/(1θ2)\mathsf{K}^{\textup{Volt}}_{0,n+j}=1/(1-\theta^{2}) for all j{1,,h}j\in\{1,\ldots,h\}. Then the forecasted outputs are

y^n+j=i=1nαiK𝐳i¯Volt(𝐳n+j)=i=1nαi𝖪i,n+jVolt,j{1,,h}.\widehat{y}_{n+j}=\sum_{i=1}^{n}\alpha_{i}^{\ast}K^{\textup{Volt}}_{\underline{\mathbf{z}_{i}}}(\mathbf{z}_{n+j})=\sum_{i=1}^{n}\alpha_{i}^{\ast}\mathsf{K}_{i,n+j}^{\textup{Volt}},\,j\in\{1,\ldots,h\}.

V Numerics

V.1 Data generating processes and experimental setup

Simulations were performed, using each of the three estimators discussed in the paper, on three dynamic processes: the Lorenz autonomous dynamical system, the Mackey-Glass delay differential equation, and the Baba-Engle-Kraft-Kroner (BEKK) input/output system.

For the Lorenz and Mackey-Glass dynamical systems, the task consisted of performing the usual path-continuation. During training, inputs are the spatial coordinates at time tt and outputs are the (t+1)(t+1)-th spatial coordinate, and estimators are trained on a collection of nn input/outputs {(𝐳t,𝐳t+1)}t=1n\{(\mathbf{z}_{t},\mathbf{z}_{t+1})\}_{t=1}^{n}. To test their performance, the estimators are run autonomously. That is, after seeing initial input 𝐳n+1\mathbf{z}_{n+1},the outputs 𝐳^n+2,,𝐳^n+h\widehat{\mathbf{z}}_{n+2},\ldots,\widehat{\mathbf{z}}_{n+h}, for some forecasting horizon hh, are fedback into the estimator as inputs. These outputs 𝐳^n+j\widehat{\mathbf{z}}_{n+j} for j=2,,h+1j=2,\ldots,h+1 are compared against the reserved set of testing values 𝐳n+j\mathbf{z}_{n+j} for j=2,,h+1j=2,\ldots,h+1, unseen by the estimators.

For the BEKK input/output system, the goal is to perform input/output forecasting. That is, during training, each estimator is given a set of inputs {𝐳t}t=1n\{\mathbf{z}_{t}\}_{t=1}^{n} and fitted against a set of outputs {𝐲t}t=1n\{\mathbf{y}_{t}\}_{t=1}^{n}. Then, during testing, given a new set of unseen inputs {𝐳t}t=n+1n+h\{\mathbf{z}_{t}\}_{t=n+1}^{n+h}, the outputs of the estimator {𝐲^t}t=n+1n+h\{\widehat{\mathbf{y}}_{t}\}_{t=n+1}^{n+h} are compared against the actual outputs {𝐲t}t=n+1n+h\{\mathbf{y}_{t}\}_{t=n+1}^{n+h}, unseen by the estimator.

The Lorenz system is a three-dimensional system of ordinary differential equations used to model atmospheric convection [29]. The following Lorenz system

x˙=10(xy),y˙=28xyxz,z˙=83z+xy\displaystyle\dot{x}=-10(x-y),\quad\dot{y}=28x-y-xz,\quad\dot{z}=-\frac{8}{3}z+xy

with the initial conditions

x0=0,y0=1,z0=1.05x_{0}=0,\quad y_{0}=1,\quad z_{0}=1.05

was chosen. A discrete-time dynamical system was derived using Runge-Kutta45 (RK45) numerical integration with time-step 0.005 to simulate a trajectory with 15001 points. The first 5000 points were reserved for training. The remaining points were reserved for testing. By the celebrated Takens’ embedding theorem [45], the Lorenz dynamical system admits a functional that expresses each observation as a function of seven of its past observations. Thus, the Lorenz dynamical system lies within the premise discussed in Section II.2.

The Mackey-Glass equation is a first-order nonlinear delay differential equation (DDE) describing physiological control systems given by [33]. We chose the following instance of the Mackey-Glass equation

z˙=0.2z(t17)1+z(t17)100.1z(t)\dot{z}=\frac{0.2z(t-17)}{1+{z(t-17)}^{10}}-0.1z(t)

with the initial condition function being the constant function z(t)=1.2z(t)=1.2. To numerically solve this DDE, the delay interval was discretized with time step of 0.02, then the usual RK45 procedure was performed on the discretized version of the system. The resulting dataset was flattened back into a one-dimensional dataset, and to reduce the size of the dataset, the dataset was further spliced to take every 50-th data point. The final dataset consisted of 7650 points, and the first 3000 points were reserved for training. The remaining points were reserved to compare against the path-continued outputs of each estimator. Due to the discretization process, the differential equation becomes a system of equations where each ztz_{t} is a function of past observations, as is assumed by our premise in Section II.2.

The BEKK model is an input/output parametric time series model that is used in financial econometrics in the forecasting of the conditional covariances of returns of stocks traded in the financial markets [12]. We consider dd assets and the BEKK(1, 0, 1) model for their log-returns 𝐫t\mathbf{r}_{t} and associated conditional covariances Ht=Cov(𝐫t𝐫t1,𝐫t2,)H_{t}=\textup{Cov}(\mathbf{r}_{t}\mid\mathbf{r}_{t-1},\mathbf{r}_{t-2},\ldots) given by

𝐫t\displaystyle\mathbf{r}_{t} =Ht1/2𝐳t,𝐳tIIDN(0d,𝕀d)\displaystyle=H_{t}^{1/2}\mathbf{z}_{t},\quad\mathbf{z}_{t}\sim\textup{IIDN}(0_{d},\mathbb{I}_{d})
Ht\displaystyle H_{t} =CC+A𝐫t1𝐫t1A+BHt1B,\displaystyle=CC^{\top}+A\mathbf{r}_{t-1}\mathbf{r}_{t-1}^{\top}A^{\top}+BH_{t-1}B^{\top},

where the input innovations 𝐳t\mathbf{z}_{t} are Gaussian IID, and the output observations are the conditional covariances HtH_{t}. The diagonal BEKK specification is chosen where CC is an upper-triangular matrix, and AA and BB are diagonal matrices of dimension dd. It is known that there exists a unique stationary and ergodic solution whenever Aii>0,|Bii|<1A_{ii}>0,|B_{ii}|<1 for i=1,,di=1,\ldots,d, which expresses HtH_{t} as a highly nonlinear function of its past inputs [6] (as per our premise in Section II.2). Since the covariance matrices are symmetric, we only need to learn the outputs 𝐡t=vech(Ht)q\mathbf{h}_{t}=\textup{vech}(H_{t})\in\mathbb{R}^{q}, qd(d+1)/2q\coloneqq d(d+1)/2, where the vech operator stacks the columns of a given square matrix from the principal diagonal downwards. An existing dataset from [14] was used with input dimensions of 15 and output dimensions of 120. The dataset consisted of 3760 input and output points, the first 3007 were reserved for training, and the remaining 753 were reserved for testing. Since the output points were very small, to minimize loss of accuracy due to computational truncation errors, the output values were scaled by 1000. The training output data was further normalized so that each dimension would have 0 mean and variance of 1.

Since NG-RC methodology does not typically require normalization, the NG-RC datasets were not normalized. For the polynomial kernel, kernel values could become too large and result in truncation inaccuracies, so the training inputs were normalized to have a maximum of 1 and a minimum of 0. Due to the construction of the Volterra kernel, the input sequence space into the kernel for a finite sample is truncated with zeros. We thus demean the training input data. Moreover, to avoid incurring truncation errors for θ\theta and λ\lambda values, the maximum Euclidean norm of the inputs MM is set to 1, by scaling the training input values. Note that for all estimators, normalization is always performed based only on information from the training values, then the testing data is shifted and scaled based on what was used for the training data. This prevents leakage of information such as the mean, standard deviation, maximum, minimum, etc., to the testing datasets.

V.2 Time complexities

Following [39, page 280], we compute the time complexities for the NG-RC, polynomial kernel regression and Volterra kernel regression. We assume for both kernel regressions, as per the procedure used in the numerical simulations, that the usual Euclidean dot product was used.

In each forecasting scheme, training involves computing the closed-form solutions (16) for the NG-RC and (8) for the polynomial and Volterra kernel regressions. For the NG-RC, to compute 𝖷𝖷\mathsf{X}^{\top}\mathsf{X} takes O(nN2)O(nN^{2}) steps, recalling the definition of NN given in (14). To compute matrix inversion in (16), is O(N3)O(N^{3}). The remaining matrix multiplications have complexities dominated by O(nN2)O(nN^{2}). Thus, the final complexity for training weights in NG-RC is O(nN2+N3)O(nN^{2}+N^{3}). On the other hand, for polynomial or Volterra kernel regressions, one needs to compute the kernel map for each entry of the Gram matrix. For the polynomial kernel, in view of (18), this is O(τd)O(\tau d), and for the Volterra kernel map, in view of (25), this is O(d)O(d). Then, to compute the Gram matrix is O(n2dτ)O(n^{2}d\tau) and O(n2d)O(n^{2}d) for polynomial and Volterra kernel regression, respectively. Forecasting involves computing (17) for the NG-RC and (11) for the polynomial and Volterra kernel regressions. For each time step, the complexity is O(N)O(N) for the NG-RC, O(nτd)O(n\tau d) for the polynomial kernel regression, and O(nd)O(nd) for the Volterra kernel regression.

The combinatorial NN term for the NG-RC can be bounded above by (p+τd)κ(p+\tau d)^{\kappa} where κ=min(p,τd)\kappa=\min(p,\tau d). Thus, in big-O notation, the NN term can be replaced by (p+τd)κ(p+\tau d)^{\kappa}. It can then be seen that when the sample size is small, and when τ\tau and pp need not be large, the NG-RC will be faster than the polynomial and Volterra kernels. However, as τ\tau and pp grow, as is needed to learn more complex dynamical systems, the complexity for NG-RC grows exponentially, and the polynomial and Volterra kernel regressions will outperform the NG-RC significantly. For each of the forecasting schemes, the complexities associated with the training and generation of a single prediction are given in Table 1.

Table 1: Time complexities for all forecasting schemes discussed in this paper. Denote κ=min(p,τd)\kappa=\min(p,\tau d).
Training Prediction
NG-RC O(n(p+τd)2κ+(p+τd)3κ)O(n{(p+\tau d)^{2\kappa}}+{(p+\tau d)^{3\kappa}}) O((p+τd)κ)O({(p+\tau d)^{\kappa}})
Polynomial O(n2τd+n3)O({n^{2}}\tau d+n^{3}) O(nτd)O(n\tau d)
Volterra O(n2d+n3)O(n^{2}d+n^{3}) O(nd)O(nd)

V.3 Cross-validation

For each estimator, hyperparameters have to be selected. The hyperparameters that were cross-validated and the chosen values are given in Table 2.

Table 2: Chosen parameters for each estimator and dataset
System Estimator Washout Hyperparameters
Lorenz NG-RC 3 (τ=3,p=2,λreg=1×107)(\tau=3,p=2,\lambda_{\textup{reg}}=1\times 10^{-7})
Polynomial 6 (τ=6,p=2,λreg=1×106)({\tau=6,p=2},\lambda_{\textup{reg}}=1\times 10^{-6})
Volterra 100 (λ0.286,θ=0.3,λreg=1×1010)(\lambda\approx 0.286,\theta=0.3,\lambda_{\textup{reg}}=1\times 10^{-10})
Mackey-Glass NG-RC 4 (τ=4,p=5,λreg=1×107)(\tau=4,p=5,\lambda_{\textup{reg}}=1\times 10^{-7})
Polynomial 17 (τ=17,p=4,λreg=1×105){\tau=17,p=4},\lambda_{\textup{reg}}=1\times 10^{-5})
Volterra 100 (λ0.859,θ=0.3,λreg=1×109)(\lambda\approx 0.859,\theta=0.3,\lambda_{\textup{reg}}=1\times 10^{-9})
BEKK NG-RC 1 (τ=1,p=2,λreg=0.1)({\tau=1},p=2,\lambda_{\textup{reg}}=0.1)
Polynomial 1 (τ=1,p=2,λreg=0.1)({\tau=1},p=2,\lambda_{\textup{reg}}=0.1)
Volterra 100 (λ=0.72,θ=0.6,λreg=1×103)(\lambda=0.72,\theta=0.6,\lambda_{\textup{reg}}=1\times 10^{-3})

Note that the washout was not cross-validated for. For both NG-RC and polynomial kernel regression, washout is the number of delays taken. For the Volterra kernel, a longer washout is needed to wash the effect of truncating input samples with zeros. A washout of 100 was sufficient to generate meaningful results both for the full training set and when the training sets were restricted during cross-validation.

For the path-continuation tasks (Lorenz and Mackey-Glass), to select hyperparameters, cross-validation was performed by splitting each training set into training-validation folds that overlapped. During validation, path-continuation was performed and compared with the validation set. That is, the outputs of each estimator were fed back as inputs, and these autonomously generated outputs were compared with the validation set. With overlapping datasets, a smaller training set would be sufficient to create multiple training folds starting from different initial points, such that in each training fold, the estimator has sufficient time to capture dynamics during training. Then during the validation phase for each fold, for a good estimator, there would be dynamical evolution in the outputs generated by the estimator. For example, the estimator did not just fit the average. This leads to meaningful validation set errors which improves the selection process for optimal hyperparameters.

For the BEKK input/output forecasting task, the usual time-series training-validation folds were used. That is, the training dataset was split into equally sized sets where the ii-th training fold was the concatenation of the first ii sets and the ii-th validation fold was (i+1)(i+1)-set. For input/output forecasting, where estimator sees, during forecasting, a new set of inputs, this method of cross-validating turned out to be sufficient for estimators to capture the dynamics of input and output variables. Note that cross-validation training and testing were made to mimic as closely as possible the actual task to be performed on the full training set, so normalization in each fold was also performed as would have been done on the full training dataset.

The range of parameters cross-validated were chosen so that the regularization was performed over the same set of values. As detailed in the previous section, time complexity for NG-RC grows exponentially for larger lag and degree hyperparameters. It was thus impractical to cross-validate over a large range of parameters as with each increase in number of lag or maximum degree of monomials, the computational time would grow exponentially. Thus, only a smaller range of parameters could be cross-validated over. The polynomial kernel was cross-validated over a larger space of the same delay and degree hyperparameters. When cross-validating for the Lorenz system, since Takens’ embedding says only seven lags are needed, a smaller set of lags (up to 10) were cross-validated over. For the rest of the datasets, up to 101 lags were cross-validated over. Such a large range of hyperparameters was possible because by Proposition 3 and Example 4 the polynomial kernel regression uses the same covariates but is faster when more covariates are considered. Finally, mean square error was chosen to be the metric over which the best hyperparameters were chosen.

V.4 Results

Pointwise and climate metrics were used to evaluate the performance of the estimators. Pointwise metrics are distance functions that evaluate the error committed by estimators from time step to time step. Climate metrics, see also [47], are performance metrics that evaluate whether the statistical or physical properties are similar to the true system. We also consider the valid prediction time for each estimator in Lyapunov time for the two chaotic attractors (Lorenz and Mackey-Glass).

Denoting the true value 𝐲\mathbf{y}, the estimated value 𝐲^\widehat{\mathbf{y}}, and the testing set size hh, the following pointwise error metrics were chosen: the usual mean square error, the normalized mean square error (NMSE) given by

NMSE(𝐲,𝐲^)=i=1h𝐲i𝐲^i22i=1h𝐲i22,\textup{NMSE}(\mathbf{y},\widehat{\mathbf{y}})=\frac{\sum_{i=1}^{h}\|\mathbf{y}_{i}-\widehat{\mathbf{y}}_{i}\|_{2}^{2}}{\sum_{i=1}^{h}\|\mathbf{y}_{i}\|^{2}_{2}},

the mean absolute error (MAE) given by

MAE(𝐲,𝐲^)=1hi=1h𝐲𝐲i^1\textup{MAE}(\mathbf{y},\widehat{\mathbf{y}})=\frac{1}{h}\sum_{i=1}^{h}\|\mathbf{y}-\widehat{\mathbf{y}_{i}}\|_{1}

where 1\|\cdot\|_{1} is the 1-norm, the median absolute error (MdAE) given by

MdAE(𝐲,𝐲^)=1du=1dmedian({|yu,iy^u,i|}i=1h)\textup{MdAE}(\mathbf{y},\widehat{\mathbf{y}})=\frac{1}{d}\sum_{u=1}^{d}\textup{median}\left(\{|y_{u,i}-\widehat{y}_{u,i}|\}_{i=1}^{h}\right)

where u,i\cdot_{u,i} is the uu-th dimension of the ii-th vector \cdot, the mean absolute percentage error (MAPE),

MAPE(𝐲,𝐲^)=1hdu=1di=1h|yu,iy^u,i|max(ε,|yi|)\textup{MAPE}(\mathbf{y},\widehat{\mathbf{y}})=\frac{1}{hd}\sum_{u=1}^{d}\sum_{i=1}^{h}\frac{|y_{u,i}-\widehat{y}_{u,i}|}{\max(\varepsilon,|y_{i}|)}

for some very small ε\varepsilon, and lastly the R2R^{2}-score given by

R2(𝐲,𝐲^)=1du=1d(1i=1h(yu,iy^u,i)2i=1h(yu,iy¯u,i)2),R^{2}(\mathbf{y},\widehat{\mathbf{y}})=\frac{1}{d}\sum_{u=1}^{d}\left(1-\frac{\sum_{i=1}^{h}(y_{u,i}-\widehat{y}_{u,i})^{2}}{\sum_{i=1}^{h}(y_{u,i}-\bar{y}_{u,i})^{2}}\right),

where y¯u,i\bar{y}_{u,i} denotes the average of the uu-th dimension of the vector 𝐲\mathbf{y} over time steps i=1,,hi=1,\ldots,h.

Whether the estimator replicated the climate of the true time series was measured by considering the difference in the true and estimated power spectral density (PSD) and the difference in distributions using the Wasserstein-1 distance. Welch’s method with the Hann window was used to compute the PSD. The number of points per segment was chosen by visual inspection for a balance between frequency resolution and error variance. When the PSD tapers off to zero after some frequency FF, the remaining frequencies are not considered in the final difference. Finally, the PSD error (PSDE) is computed by taking

PSDE(PSD,PSD^)=u=1df=1F|PSDu,fPSD^u,f|PSDu,f,\textup{PSDE}(\textup{PSD},\widehat{\textup{PSD}})=\sum_{u=1}^{d}\sum_{f=1}^{F}\frac{|\textup{PSD}_{u,f}-\widehat{\textup{PSD}}_{u,f}|}{\textup{PSD}_{u,f}},

where PSD is the periodogram of the actual data and PSD^\widehat{\textup{PSD}} is the periodogram of the estimated data. The subscript u,fu,f denotes the ff-th term in the PSD sequence in the uu-th dimension.

The Wasserstein-1 distance was computed to compare the distributions of the true and estimated systems. For one-dimensional systems, the Wasserstein-1 distance can be computed using scipy.stats.wasserstein_distance which uses the equivalent Cramer-1 distance, that is,

W1(μ,μ^)=|CDF(y)CDF^(y)|𝑑y,W_{1}(\mu,\widehat{\mu})=\int_{\mathbb{R}}\left|\textup{CDF}(y)-\widehat{\textup{CDF}}(y)\right|dy,

where μ,μ^\mu,\widehat{\mu} are the probability distributions of yy and y^\widehat{y} respectively while CDF, CDF^\widehat{\textup{CDF}} denote the cumulative distributive functions. For time series in dd-dimensions, using scipy.stats.wasserstein_distance_nd, corresponds to solving the linear programming problem

W1(μ,μ^)=infπΓ(μ,μ^)𝐲𝐲^2𝑑π(𝐲,𝐲^),W_{1}(\mu,\widehat{\mu})=\inf_{\pi\in\Gamma(\mu,\widehat{\mu})}\int\|\mathbf{y}-\widehat{\mathbf{y}}\|_{2}d\pi(\mathbf{y},\widehat{\mathbf{y}}),

where Γ(μ,μ^)\Gamma(\mu,\widehat{\mu}) is the set of probability distributions whose marginals are μ\mu and μ^\widehat{\mu} on the first and second factors respectively. μ\mu and μ^\widehat{\mu} are the joint distributions of 𝐲\mathbf{y} and 𝐲^\widehat{\mathbf{y}} respectively. It is noted that computing the Wasserstein distance for the multidimensional case is significantly more computationally expensive than in the one-dimensional case. In the case of the Lorenz dynamical system, sampling had to be performed to make using scipy.stats.wasserstein_distance_nd tractable.

Finally, we note that for the dynamical systems Lorenz and Mackey-Glass, the Lyapunov time, defined to be the inverse of the top Lyapunov exponent, is a timescale for which a chaotic dynamical system is predictable. In deterministic path-continuing tasks, which were carried out for the Lorenz and Mackey-Glass dynamical systems, we measure the valid prediction time percent, TvalidT_{\textup{valid}}, which is the Lyapunov time taken for the predicted dynamics to differ from the true dynamics by 20%. A similar metric was used in [47].

The performance of the estimators was measured in the following manner for the dynamical systems. First, the valid prediction time was computed. Then, the ceiling of the best-performing valid prediction time is taken. The point-to-point error metrics MSE, NMSE, MAE, MdAE, MAPE, and R2R^{2} are measured only up to Tvalid\lceil T_{\textup{valid}}\rceil. Beyond the characteristic predictable timescale given by the Lyapunov time, it is not meaningful to measure the step-to-step error as the trajectories diverge exponentially according to the Lyapunov exponent. To determine if the climate is well replicated over the full testing dataset, the climate metrics PSDE and W1W_{1} are computed for the full testing dataset, with the Lorenz needing to be sampled to compute W1W_{1}. The errors are reported in Table 3.

Table 3: Error values for each estimator and dataset. A 20% deviation is allowed for TvalidT_{\textup{valid}}. MSE, NMSE, MAE, MdAE, MAPE, and R2R^{2} are computed up to Tvalid\lceil T_{\textup{valid}}\rceil. PSDE and W1W_{1} are computed for the full dataset (Lorenz was sampled). Except for R2R^{2}, lower is better for all errors. Scale PSDE for BEKK by 10410^{-4}.
System Estimator TvalidT_{\textup{valid}} MSE NMSE MAE MdAE MAPE R2R^{2} PSDE W1W_{1}
Lorenz NG-RC 7.178 25.910 0.252 1.827 0.0897 1.598 0.621 7.606 2.114
Polynomial 9.126 13.383 0.0960 1.155 0.0377 1.230 0.812 7.169 1.683
Volterra 10.566 6.840 0.0427 0.428 0.00385 0.222 0.907 8.325 1.982
Mackey-Glass NG-RC 0.3 0.0502 0.0548 0.188 0.174 0.235 0.0202 28.815 0.155
Polynomial 7.035 0.00358 0.00390 0.0274 0.00751 0.0329 0.930 6.831 0.00147
Volterra 8.305 0.00171 0.00186 0.0162 0.00202 0.0190 0.967 5.059 0.00138
BEKK NG-RC 0.00125 0.150 0.0204 0.0166 0.644 0.125 1.565 0.332
Polynomial 0.00124 0.150 0.0204 0.0166 0.642 0.123 1.140 0.337
Volterra 0.00102 0.136 0.0170 0.0139 0.634 0.382 0.963 0.319

In more complex systems such as Mackey-Glass and BEKK, the Volterra reservoir and polynomial kernel regressions easily outperform the NG-RC because they have access to much richer feature spaces. In second-order systems such as the Lorenz system, even though pointwise errors perform better than in the polynomial and Volterra kernel regressions, the climate metrics indicate that the NG-RC better captured the climate of the true Lorenz dynamical system. It could be that a lower-order system offered by the NG-RC acts as a better proxy for lower-order true dynamical systems, which accounts for the difference in climate performance. This difference in climate replication performance, however, is only slight. Observe that in Figure 1(f), all estimators capture the power spectral density of the original system well, even if the Volterra kernel is slightly outperformed by the polynomial kernel. For even more complex systems, both the kernel regression methods can capture the climate of the true dynamical system but the same cannot be said for NG-RC. A similar story holds when one considers the Wasserstein-1 distance. The distributions are in Figure 2(f). Even though the Wasserstein-1 distance for Volterra performs the poorest, the difference in distribution performance is still small, and the bulk of the distribution is still replicated. On the other hand, for complex systems such as the BEKK, the polynomial kernel and the NG-RC fail to replicate the climate of the original system completely.

Lorenz

\stackinset

ct-.005inVolterra Refer to caption

(a)
\stackinset

ct-.005inPolynomial Refer to caption

(b)
\stackinset

ct-.005inNG-RC Refer to caption

(c)

Mackey-Glass

Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 1: The PSD for the Lorenz system for the Volterra kernel is in (1(a)), the polynomial kernel is in (1(b)), and NG-RC is in (1(c)). Dimension 0 corresponds to xx, dimension 1 to yy, and dimension 2 to zz. PSD for Mackey-Glass system for Volterra kernel is (1(d)), polynomial kernel is (1(e)), and NG-RC is (1(f)). Dimension 0 refers to the zz value. PSD for BEKK for Volterra kernel is (1(g)), polynomial kernel is (1(h)), and NG-RC is (1(i)). Only the two dimensions with the most visually prominent PSD are displayed (dimensions 84 and 117).

BEKK

Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)

Lorenz

\stackinset

ct-.1inVolterra Refer to caption

(a)
\stackinset

ct-.1inPolynomial Refer to caption

(b)
\stackinset

ct-.1inNG-RC Refer to caption

(c)

Mackey-Glass

Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 2: The distributions for Lorenz system for the Volterra kernel is in (1(a)), the polynomial kernel is in (1(b)), and NG-RC is in (1(c)). The distribution for the Mackey-Glass system for Volterra kernel is (1(d)), the polynomial kernel is (1(e)), and NG-RC is (1(f)). The distributions for BEKK for Volterra kernel is (1(g)), the polynomial kernel is (1(h)), and NG-RC is (1(i)). Only one dimension is displayed (dimension 117).

BEKK

Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)

We also observe that, especially in the case of BEKK, when significantly complex dynamical systems are being learned, considering large but finite lags or monomial degrees may be insufficient. Even if finite lags are sufficient, the Gram or feature matrix values may, anyway, be too large to be handled with finite precision. In such cases, the Volterra kernel regression significantly outperforms the other two methods because it is agnostic to the lags and monomial powers, and so its Gram values do not grow with the feature choice. Moreover, as we saw in Section IV, taking infinite lag and monomial powers into consideration, offers a rich feature space and makes the associated RKHS universal, meaning that it can approximate complex systems to any desired accuracy.


Acknowledgments: The authors thank Daniel Gauthier for insightful discussions about the relation between NG-RC and the results in this paper. LG and JPO thank the hospitality of the Nanyang Technological University and the University of St. Gallen, respectively; it is during respective visits to these two institutions that some of the results in this paper were obtained. HLJT is funded by a Nanyang President’s Graduate Scholarship of Nanyang Technological University. JPO acknowledges partial financial support from the School of Physical and Mathematical Sciences of the Nanyang Technological University.

References

  • Alquier and Wintenberger [2012] Alquier, P., and O. Wintenberger (2012), Bernoulli 18 (3), 883arXiv:arXiv:0902.2924v4 .
  • Arcomano et al. [2022] Arcomano, T., I. Szunyogh, A. Wikner, J. Pathak, B. R. Hunt, and E. Ott (2022), Journal of Advances in Modeling Earth Systems 14 (3), e2021MS002712.
  • Aronszajn [1950] Aronszajn, N. (1950), Transactions of the American Mathematical Society 68 (3), 337.
  • Barbosa and Gauthier [2022] Barbosa, W. A. S., and D. J. Gauthier (2022), Chaos: An Interdisciplinary Journal of Nonlinear Science 32 (9).
  • Bollt [2021] Bollt, E. (2021), Chaos: An Interdisciplinary Journal of Nonlinear Science 31 (1), 13108.
  • Boussama et al. [2011] Boussama, F., F. Fuchs, and R. Stelzer (2011), Stochastic Processes and their Applications 121 (10), 2331.
  • Christmann and Steinwart [2008] Christmann, A., and I. Steinwart (2008), Support Vector Machines (Springer).
  • Christmann and Steinwart [2010] Christmann, A., and I. Steinwart (2010), in Advances in Neural Information Processing Systems, Vol. 23, edited by J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel,  and A. Culotta (Curran Associates, Inc.).
  • Dang Van Mien and Normand-Cyrot [1984] Dang Van Mien, H., and D. Normand-Cyrot (1984), Automatica 20 (2), 175.
  • Dedecker et al. [2007] Dedecker, J., P. Doukhan, G. Lang, J. R. León, S. Louhichi, and C. Prieur (2007), Weak Dependence: With Examples and Applications (Springer Science+Business Media).
  • Doukhan and Wintenberger [2008] Doukhan, P., and O. Wintenberger (2008), Stochastic Processes and their Applications 118 (11), 1997.
  • Engle and Kroner [1995] Engle, R. F., and F. K. Kroner (1995), Econometric Theory 11, 122.
  • Gauthier et al. [2021] Gauthier, D. J., E. Bollt, A. Griffith, and W. A. S. Barbosa (2021), Nature Communications 12 (1), 5564.
  • Gonon et al. [2022] Gonon, L., L. Grigoryeva, and J.-P. Ortega (2022), arXiv:2212.14641 .
  • Gonon et al. [2023] Gonon, L., L. Grigoryeva, and J.-P. Ortega (2023), The Annals of Applied Probability 33 (1), 28.
  • Gonon and Ortega [2020] Gonon, L., and J.-P. Ortega (2020), IEEE Transactions on Neural Networks and Learning Systems 31 (1), 100, arXiv:1807.02621 .
  • Gonon and Ortega [2021] Gonon, L., and J.-P. Ortega (2021), Neural Networks 138, 10.
  • Grigoryeva et al. [2021] Grigoryeva, L., A. G. Hart, and J.-P. Ortega (2021), Physical Review E - Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics 103, 062204.
  • Grigoryeva et al. [2023] Grigoryeva, L., A. G. Hart, and J.-P. Ortega (2023), Nonlinearity 36, 4674.
  • Grigoryeva et al. [2014] Grigoryeva, L., J. Henriques, L. Larger, and J.-P. Ortega (2014), Neural Networks 55, 59.
  • Grigoryeva and Ortega [2018a] Grigoryeva, L., and J.-P. Ortega (2018a), Neural Networks 108, 495, arXiv:/arxiv.org/abs/1806.00797 [http:] .
  • Grigoryeva and Ortega [2018b] Grigoryeva, L., and J.-P. Ortega (2018b), Journal of Machine Learning Research 19 (24), 1arXiv:1712.00754 .
  • Jaeger [2010] Jaeger, H. (2010), German National Research Center for Information Technology, Tech. Rep. (German National Research Center for Information Technology).
  • Jaeger and Haas [2004] Jaeger, H., and H. Haas (2004), Science 304 (5667), 78.
  • Kent et al. [2024] Kent, R., W. Barbosa, and D. Gauthier (2024), Nature Communications 15, 10.1038/s41467-024-48133-3.
  • Kent et al. [2023] Kent, R. M., W. A. S. Barbosa, and D. J. Gauthier (2023), arXiv preprint arXiv:2307.03813 .
  • Kocarev and Parlitz [1995] Kocarev, L., and U. Parlitz (1995), Physical Review Letters 74 (25), 5028.
  • Kocarev and Parlitz [1996] Kocarev, L., and U. Parlitz (1996), Physical Review Letters 76 (11), 1816.
  • Lorenz [1963] Lorenz, E. N. (1963), “Deterministic nonperiodic flow,” .
  • Lu et al. [2018] Lu, Z., B. R. Hunt, and E. Ott (2018), Chaos 28 (6), 10.1063/1.5039508arXiv:1805.03362 .
  • Maass [2011] Maass, W. (2011), in Computability In Context: Computation and Logic in the Real World, edited by S. S. Barry Cooper and A. Sorbi, Chap. 8 (World Scientific) pp. 275–296.
  • Maass et al. [2002] Maass, W., T. Natschläger, and H. Markram (2002), Neural Computation 14, 2531.
  • Mackey and Glass [1977] Mackey, M. C., and L. Glass (1977), Science 197, 287.
  • Martínez-Peña and Ortega [2023] Martínez-Peña, R., and J.-P. Ortega (2023), Physical Review E - Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics 107 (3), 035306.
  • Martínez-Peña and Ortega [2024] Martínez-Peña, R., and J.-P. Ortega (2024), “Input-dependence in quantum reservoir computing,” .
  • Matthews [1992] Matthews, M. B. (1992), On the Uniform Approximation of Nonlinear Discrete-Time Fading-Memory Systems Using Neural Network ModelsPh.D. thesis (ETH Zürich).
  • Matthews [1993] Matthews, M. B. (1993), Circuits, Systems, and Signal Processing 12 (2), 279.
  • Micchelli et al. [2006] Micchelli, C. A., Y. Xu, and H. Zhang (2006), Journal of Machine Learning Research 7 (12), 2651.
  • Mohri et al. [2018] Mohri, M., A. Rostamizadeh, and A. Tawalkar (2018), Foundations of Machine Learning, 2nd ed. (The MIT Press).
  • Pathak et al. [2018] Pathak, J., B. Hunt, M. Girvan, Z. Lu, and E. Ott (2018), Physical Review Letters 120 (2), 24102.
  • Pathak et al. [2017] Pathak, J., Z. Lu, B. R. Hunt, M. Girvan, and E. Ott (2017), Chaos 27 (12), 10.1063/1.5010300arXiv:1710.07313 .
  • Ratas and Pyragas [2024] Ratas, I., and K. Pyragas (2024), Physical Review E 109 (6), 64215.
  • Schölkopf and Smola [2002] Schölkopf, B., and A. J. Smola (2002), Learning with Kernels (MIT Press).
  • Sontag [1979] Sontag, E. D. (1979), IEEE Transactions on Circuits and Systems 26 (5), 342.
  • Takens [1981] Takens, F. (1981) (Springer Berlin Heidelberg) pp. 366–381.
  • Wang and Zhang [2013] Wang, B., and H. Zhang (2013),  arXiv:1310.5543 [stat.ML] .
  • Wikner et al. [2024] Wikner, A., J. Harvey, M. Girvan, B. R. Hunt, A. Pomerance, T. Antonsen, and E. Ott (2024), Neural Networks 170, 94.
  • Wikner et al. [2021] Wikner, A., J. Pathak, B. R. Hunt, I. Szunyogh, M. Girvan, and E. Ott (2021), Chaos: An Interdisciplinary Journal of Nonlinear Science 31 (5), 53114.