Combining machine learning and data assimilation to forecast dynamical systems from noisy partial observations

Georg A. Gottwald School of Mathematics and Statistics
University of Sydney
NSW 2006
Australia and Sebastian Reich Institute of Mathematics
University of Potsdam
Germany [email protected] [email protected]

Abstract.

We present a supervised learning method to learn the propagator map of a dynamical system from partial and noisy observations. In our computationally cheap and easy-to-implement framework a neural network consisting of random feature maps is trained sequentially by incoming observations within a data assimilation procedure. By employing Takens’ embedding theorem, the network is trained on delay coordinates. We show that the combination of random feature maps and data assimilation, called RAFDA, outperforms standard random feature maps for which the dynamics is learned using batch data.

Determining computationally inexpensive surrogate maps, trained on partial and noisy observations, is of utmost significance. Such surrogate maps would allow for long-time simulation of dynamical systems, the governing equations of which are unknown. Similarly, they would allow for long-time integration of complex high-dimensional dynamical systems with known underlying equations where one is however only interested in the dynamics of a select subset of resolved variables. Surrogate maps should produce trajectories which are consistent with the dynamical system even after long time integration. We consider here the situation when such systems are accessible only by partial observations of the resolved variables and where these observations are contaminated by measurement noise. By formulating a machine learning framework in delay coordinate space, we show that a partial and noisy training data set can be used to learn the dynamics in the reconstructed phase space. The learned surrogate map provides a computationally cheap forecast model for single forecasts up to several Lyapunov times as well as for dynamically consistent long-time simulations.

1. Introduction

Recent years have seen an increased interest in data-driven methods with the aim to develop cheap surrogate models to perform forecasting of dynamical systems. A main driver behind this endeavour is the potential saving of computational running time required for simulation. This is particularly important for stiff multi-scale systems for which the fastest time-scale puts strong restrictions on the time step which an be employed (Han et al., 2018, 2019, Raissi and Karniadakis, 2018, Raissi et al., 2019). Promising applications for such surrogate models come from atmospheric and ocean dynamics and from climate dynamics (Schneider et al., 2017, Dueben and Bauer, 2018, Rasp et al., 2018, Gagne II et al., 2020, Bolton and Zanna, 2019, Brajard et al., 2021, Nadiga, 2020, Cleary et al., 2021).

A particularly simple and efficient machine learning architecture are random feature maps (Rahimi and Recht, 2008, 2008, Bach, 2017a, b, Sun et al., 2019). Random feature maps provide a representation of the surrogate propagator for a dynamical system by a linear combination of randomly generated high-dimensional nonlinear functions of the input data. The training of random feature map networks only requires linear least-square regression and it was proven rigorously that random feature maps enjoy the so called universal approximation property which states that they can approximate any continuous function arbitrarily close (Park and Sandberg, 1991, Cybenko, 1989, Barron, 1993). The framework of random feature maps was extended to include internal dynamics in so called echo-state networks and reservoir computers with remarkable success in forecasting dynamical systems (Maass et al., 2002, Jaeger, 2002, Jaeger and Haas, 2004, Pathak et al., 2018a, Jüngling et al., 2019, Algar et al., 2019, Nadiga, 2020, Bollt, 2021, Gauthier et al., 2021, Platt et al., 2021).

These linear regression based training methods assume model errors rather than measurement errors in the data (Gottwald and Reich, 2021). However, the skill of random feature maps in providing a reliable surrogate forecast model can be severely impeded when the data used for training are contaminated by measurement noise. In previous work Gottwald and Reich (2021) showed that by combining random feature maps with ensemble-based sequential data assimilation, in a framework coined RAndom Feature map Data Assimilation (RAFDA), the noise can be effectively controlled leading to remarkable forecast skills for noisy observations. Moreover, RAFDA was shown to be able to provide model closure in multi-scale systems allowing for subgridscale parametrization as well as to provide reliable ensembles for probabilistic forecasting. In RAFDA the parameters of the random feature maps are learned sequentially with incoming observations in conjunction with the analysis of the state variables using an ensemble Kalman filter (EnKF) (Evensen, 2006). RAFDA thus combines a powerful approximation tool in the form of random feature maps with an advanced sequential and derivative-free data assimilation technique, which is particularly well suited for high-dimensional state and parameter estimation problems.

Several alternative combinations of data assimilation and machine learning techniques have recently been explored. Recently proposed methodologies include purely minimization-based approaches (Bocquet et al., 2019), approaches which combine ensemble-based sequential state estimation with minimization-based parameter estimation (Brajard et al., 2020, Bocquet et al., 2020), and purely ensemble-based sequential joined state and parameter estimation (Bocquet et al., 2021). Furthermore, an intimate formal equivalence between certain neural network architectures and data assimilation has been noted by Abarbanel et al. (2018). Finally, sequential data assimilation has also been extended to echo-state networks and reservoir computers (Tomizawa and Sawada, 2021, Wikner et al., 2021).

In this work, we extend RAFDA to the situation when only partial observations are available. In this case, the propagator map, that updates the system in the partially observed subspace only, is non-Markovian and knowledge of the past history is required. This problem was recently addressed within a machine learning context using appropriate model closures (Levine and Stuart, 2021). To account for the non-Markovian nature of the propagator map, we follow here a different path instead and employ phase-space reconstruction and Takens’ embedding theorem (Takens, 1981). Takens’ embedding theorem has been successfully used in data assimilation before where a forecast model was non-parametrically constructed using analogs (Hamilton et al., 2016). Recently reservoir computers, which may be viewed as providing an embedding of the dynamics in the reservoir space (Lu et al., 2017, Hart et al., 2020), were used to reconstruct dynamics in the reconstructed phase space using delay coordinates and linear regression for parameter estimation (Nakai and Saiki, 2021). Here we use delay coordinates in combination with sequential data assimilation to learn a surrogate forecast model directly in the reconstructed phase-space from noisy and partial observations.

While we follow a non-parametric or model-free approach in this paper, another promising direction is to combine machine learning and knowledge-based modelling components (Pathak et al., 2018b, Gagne II et al., 2020, Bolton and Zanna, 2019, Wikner et al., 2020, Brajard et al., 2021, Wikner et al., 2021). We note that RAFDA also allows for such extensions (Gottwald and Reich, 2021).

The paper is organised as follows. In Section 2 we develop our RAFDA methodology for partial observations. In Section 3 we show how RAFDA performs on the Lorenz 63 system and that RAFDA exhibits significantly increased forecast skill when compared to standard random feature maps trained on delay coordinates using linear regression. We close in Section 4 with a discussion and an outlook.

2. Random feature map and data assimilation (RAFDA)

Consider a $D$ -dimensional dynamical system $\dot{{\bm{u}}}=\mathcal{\bm{F}}({\bm{u}})$ which is accessed at equidistant times $t_{n}=n\Delta t$ of interval length $\Delta t>0$ , $n\geq 0$ , by partial noisy observations

\displaystyle{\bm{y}}^{\rm{o}}_{n}=\bm{G}{\bm{u}}_{n}+\bm{\mathrm{\Gamma}}^{1/2}\,\bm{\eta}_{n}

(1)

with ${\bm{u}}_{n}={\bm{u}}(t_{n})$ , observation operator $\bm{G}:{\mathbb{R}}^{D}\to{\mathbb{R}}^{d}$ , measurement error covariance matrix $\bm{\mathrm{\Gamma}}\in{\mathbb{R}}^{d\times d}$ and $d$ -dimensional independent and normally distributed noise $\bm{\eta}_{n}$ , that is, $\bm{\eta}_{n}\sim{\mathcal{N}}({\bf 0},{\bf I})$ .

The aim we set out to pursue is the following: using noisy observations ${\bm{y}}^{\rm{o}}_{j}$ for $0\leq j\leq n$ find an approximation to the propagator map which maps previous observations (or a suitable subset of them) to the unobserved variable ${\bm{y}}_{n+1}$ at future time $t_{n+1}$ . In the Markovian case $d=D$ with $\bm{G}=\bm{I}$ the propagator map is only a function of the observation at the current time $t_{n}$ . This case was studied in our previous work (Gottwald and Reich, 2021). Here the focus is on partial observations with $d<D$ for which the propagator map is non-Markovian. Our method judiciously incorporates three separate methods which we discuss in the following three subsections: 1.) Takens embedding and delay coordinates to deal with the aspect of having only access to partial observations and the resulting non-Markovian propagator map, 2.) random feature maps which form our approximation for the propagator map and 3.) data assimilation and EnKFs to control the observation noise and to determine the parameters of the random feature maps leading to the desired surrogate model.

2.1. Delay coordinates

When only partial observations are available Takens’ embedding theorem allows to represent the dynamics faithfully by means of phase space reconstruction and delay vectors (Takens, 1981, Sauer et al., 1991, Sauer and Yorke, 1993). Takens’ embedding theorem assumes noise-free observations, but has been successfully applied to noise-contaminated observations (Kantz and Schreiber, 2004, Schreiber, 1999, Small, 2005, Casdagli et al., 1991, Schreiber and Kantz, 1995).

We describe here the key steps for scalar-valued observations, that is, $d=1$ in (1). Hence, given a time-series $y_{n}^{\rm o}$ , define the $m$ -dimensional delay vectors

\displaystyle{\bm{\zeta}}^{\rm{o}}_{n}

\displaystyle=\left(y_{n}^{\rm{o}},y_{n+\tau}^{\rm{o}},y_{n+2\tau}^{\rm{o}},...,y_{n+(m-1)\tau}^{\rm{o}}\right)^{\rm T}

(2)

for $n=0,\ldots,N$ , with integer delay time $\tau>0$ . If the underlying dynamics lies inside an $D$ -dimensional phase space, then the delay reconstruction map $y_{n}^{\rm o}\to\bm{\zeta}_{n}^{\rm o}$ is an embedding provided the embedding dimension is sufficiently large with $m>2\,D_{\rm{box}}$ where $D_{\rm{box}}$ denotes the fractal box-counting dimension of the attractor (Sauer et al., 1991, Sauer and Yorke, 1993). Under these conditions, the reconstructed dynamics ${\bm{\zeta}}^{\rm{o}}_{0},{\bm{\zeta}}^{\rm{o}}_{1},{\bm{\zeta}}^{\rm{o}}_{2},\dots$ in ${\mathbb{R}}^{m}$ faithfully represents the underlying dynamics. For finite time series, the choice of the embedding dimension $m$ and the delay time $\tau$ are crucial and need to be carefully chosen. See for example Kantz and Schreiber (2004), Schreiber (1999), Casdagli et al. (1991), Schreiber and Kantz (1995), Small (2005) for a detailed discussion. Specifically, if the embedding dimension is chosen too small, then the delay reconstruction map does not represent the underlying dynamics, if it is chosen too large, the reconstruction involves redundancy reducing the length of the delay vector time series ${\bm{\zeta}}^{\rm{o}}_{n}$ . We choose the embedding dimension $m$ using the false nearest neighbour algorithm (Kennel et al., 1992). The embedding dimension is chosen as the smallest value $m$ such that the number of false nearest neighbours is less than $10\%$ . False nearest neighbours (at dimension $m$ ) are defined as those points ${\bm{\zeta}}^{\rm{o}}_{n}$ for which the Euclidean distance to their nearest neighbour changes upon increasing the embedding dimension to $m+1$ by a factor of $10$ (relative to their smallest Euclidean distance). An appropriate delay time $\tau$ is determined by determining the time for which the average mutual information has its first zero-crossing (Fraser and Swinney, 1986). We use the command phaseSpaceReconstruction from MATLAB (2020) which implements these algorithms to determine the embedding dimension $m$ , the delay time $\tau$ as well as generating the delay vectors.

The extension of the delay vectors (2) to multivariate observations (1) is straightforward and leads to delay vectors of dimension $D_{\zeta}=dm$ .

2.2. Random feature maps

In this subsection we describe the traditional random feature map network architecture (Rahimi and Recht, 2008, 2008, Bach, 2017a, b, Sun et al., 2019). This is a particularly easy-to-implement network architecture in which the input is given as a linear combination of randomly sampled nonlinear functions of the input signal, and the coefficients of this linear combination are then learned from the whole available training data set via linear ridge regression. Our RAFDA extension, as described further below, instead determines the coefficients sequentially within a data assimilation procedure.

Applied to our setting, the delay coordinates $\bm{\zeta}\in{\mathbb{R}}^{D_{\zeta}\times 1}$ are first linearly mapped by a random but fixed linear map into a high-dimensional subspace of ${\mathbb{R}}^{D_{r}}$ with $D_{r}\gg D_{\zeta}$ and then nonlinearly transformed by feature maps $\bm{\phi}$ as

\displaystyle\bm{\phi}(\bm{\zeta})=\tanh({\bm{W}_{\rm{in}}}\bm{\zeta}+{\bm{b}_{\rm{in}}})\in\mathbb{R}^{D_{r}\times 1}

(3)

with weight matrix

\displaystyle{\bm{W}_{\rm{in}}}=({\bm{w}}_{{\rm in},1},\ldots,{\bm{w}}_{{\rm in},D_{r}})^{\rm T}\in\mathbb{R}^{D_{r}\times D_{\zeta}}

and a bias

\displaystyle{\bm{b}_{\rm{in}}}=(b_{{\rm in},1},\ldots,b_{{\rm in},D_{r}})^{\rm T}\in\mathbb{R}^{D_{r}\times 1}.

The weight matrix and the bias are chosen randomly and independently of the observed delay coordinates $\bm{\zeta}_{n}^{\rm o}$ , $n=0,\ldots,N$ , according to the distributions $p({\bm{w}}_{\rm in})$ and $p(b_{\rm in})$ , respectively. We choose here

\displaystyle({\bm{W}_{\rm{in}}})_{ij}\sim\mathcal{U}[-w,w]\qquad{\rm{and}}\qquad({\bm{b}_{\rm{in}}})_{i}\sim\mathcal{U}[-b,b].

(4)

The choice of the hyperparameters $w>0$ and $b>0$ is system-dependent and needs to ensure that the observations cover the nonlinear domain of the $\tanh$ -function. The reader is referred to Gottwald and Reich (2021) for further details. Note that the hyperparameters ${\bm{W}_{\rm{in}}}$ and ${\bm{b}_{\rm{in}}}$ are kept fixed once drawn and are not learned. This restriction is made for computational simplicity and can be relaxed following E et al. (2020), Rotskoff and Vanden-Eijnden (2018).

An approximation of the propagator map in the delay coordinates $\bm{\zeta}\in R^{D_{\zeta}}$ is then provided by

\displaystyle\Psi_{\rm S}(\bm{\zeta})={\bm{W}}\bm{\phi}(\bm{\zeta}),

(5)

where the matrix ${\bm{W}}\in\mathbb{R}^{D_{\zeta}\times D_{r}}$ maps back into the delay coordinate vector space to approximate the delay vector at the next time step, and is to be learned from the noisy delay vectors ${\bm{\zeta}}^{\rm{o}}_{n}$ , $n=0,\ldots,N$ . More specifically, the weights ${\bm{W}}$ should be chosen such that

\displaystyle{\bm{\zeta}}^{\rm{o}}_{n}

\displaystyle\approx\Psi_{\rm S}({\bm{\zeta}}^{\rm{o}}_{n-1}),

(6)

which is achieved by minimizing the regularized cost function

	$\displaystyle\mathcal{L}({\bm{W}})$	$\displaystyle=\frac{1}{2}\sum_{n=1}^{N}\\|{\bm{\zeta}}^{\rm{o}}_{n}-\Psi_{\rm S}({\bm{\zeta}}^{\rm{o}}_{n-1})\\|^{2}+\frac{\beta}{2}\\|{\bm{W}}\\|^{2}_{\rm F}$
		$\displaystyle=\frac{1}{2}\\|{\bm{Z}}^{\rm o}-{\bm{W}}{\bm{\Phi}}\\|_{\rm F}^{2}+\frac{\beta}{2}\\|{\bm{W}}\\|^{2}_{\rm F},$		(7)

where $\|\mathbf{A}\|_{F}$ denotes the Frobenius norm of a matrix $\mathbf{A}$ , ${\bm{Z}}^{\rm o}\in{\mathbb{R}}^{D_{\zeta}\times N}$ is the matrix with columns ${\bm{\zeta}}^{\rm{o}}_{n}$ , $n=1,\ldots,N$ , and ${\bm{\Phi}}\in{\mathbb{R}}^{D_{r}\times N}$ is the matrix with columns

{\bm{\phi}}_{n}=\bm{\phi}({\bm{\zeta}}^{\rm{o}}_{n-1}),

(8)

$n=1,\ldots,N$ . The parameter $\beta>0$ is used for regularization. The solution to the minimization problem for (2.2) can be explicitly determined as

\displaystyle{\bm{W}}_{\rm LR}={\bm{Z}}^{\rm o}{\bm{\Phi}}^{\rm T}\left({\bm{\Phi}}{\bm{\Phi}}^{\rm T}+\beta{\bf I}\right)^{-1}\,,

(9)

and uses all available training data ${\bm{Z}}^{\rm o}$ at once. In Gottwald and Reich (2021) it was shown that standard random feature maps have difficulty dealing with noisy observations, and instead it was proposed to learn the output weights ${\bm{W}}$ sequentially within a data assimilation procedure which is described in the next section.

2.3. Data assimilation

RAFDA uses a combined parameter and state estimation within a data assimilation procedure. The main idea is to use the surrogate propagator (5) consisting of the random feature maps as the forecast model within an EnKF and to estimate sequentially the weight matrix ${\bm{W}}$ . Concretely, we consider the forecast model for given weight matrix ${\bm{W}}_{n-1}^{\rm a}$ and given delay coordinates $\bm{\zeta}_{n-1}^{\rm a}$ as


$\displaystyle\bm{\zeta}^{\rm f}_{n}$	$\displaystyle={\bm{W}}_{n-1}^{\rm a}\,\bm{\phi}(\bm{\zeta}_{n-1}^{\rm a})$	(10a)
$\displaystyle{\bm{W}}^{\rm f}_{n}$	$\displaystyle={\bm{W}}^{\rm a}_{n-1},$	(10b)

where the superscript f denotes the forecast and the superscript a denotes the analysis defined below.

To incorporate the parameter estimation into an EnKF analysis step, we consider an augmented state space ${\bm{x}}=(\bm{\zeta}^{\rm T},{\bm{w}}^{\rm T})^{\rm T}\in{\mathbb{R}}^{D_{x}\times 1}$ with $D_{x}=D_{\zeta}(1+D_{r})$ . Here ${\bm{w}}\in{\mathbb{R}}^{D_{\zeta}D_{r}\times 1}$ is the vector consisting of all matrix elements of the weight matrix ${\bm{W}}$ with its entries defined block-wise, that is, $w_{1:D_{r}}=(W_{11},\dots,W_{1D_{r}})^{\rm T}$ , $w_{D_{r}+1:2D_{r}}=(W_{21},\dots,W_{2D_{r}})^{\rm T}$ and so forth.

We now also treat ${\bm{x}}^{\rm f}_{n}$ and ${\bm{x}}^{\rm a}_{n}$ , $n\geq 0$ as random variables. Assuming a Gaussian distribution for ${\bm{x}}_{n+1}^{\rm f}$ , the analysis step for the mean $\overline{\bm{x}}^{\rm a}_{n}$ is given by

\displaystyle\overline{\bm{x}}^{\rm a}_{n}=\overline{\bm{x}}^{\rm f}_{n}-{\bm{K}}_{n}({\bm{H}}\overline{\bm{x}}^{\rm f}_{n}-{\bm{\zeta}}^{\rm{o}}_{n})

(11)

with the observation matrix ${\bm{H}}\in\mathbb{R}^{D_{\zeta}\times D_{x}}$ defined by ${\bm{H}}{\bm{x}}=\bm{\zeta}$ and the measurement error covariance matrix of the delay vectors given by ${\bf\Gamma}^{\rm dc}={\bf I}\otimes\bm{\mathrm{\Gamma}}\in{\mathbb{R}}^{D_{\zeta}\times D_{\zeta}}$ (compare (1)). Here $\bm{A}\otimes\bm{B}$ denotes the Kronecker product of two matrices. The Kalman gain matrix ${\bm{K}}_{n}$ is given by

\displaystyle{\bm{K}}_{n}={\bm{P}}^{\rm f}_{n}{\bm{H}}^{\rm T}\left({\bm{H}}{\bm{P}}^{\rm f}_{n}{\bm{H}}^{\rm T}+{\bf\Gamma}^{\rm dc}\right)^{-1}

(12)

with forecast covariance matrix

	$\displaystyle{\bm{P}}^{\rm f}_{n}$	$\displaystyle=\langle\hat{\bm{x}}^{\rm f}_{n}\otimes\hat{\bm{x}}_{n}^{\rm f}\rangle$
		$\displaystyle=\left(\begin{array}[]{cc}\langle\hat{\bm{\zeta}}^{\rm f}_{n}\otimes\hat{\bm{\zeta}}^{\rm f}_{n}\rangle&\langle\hat{\bm{\zeta}}^{\rm f}_{n}\otimes\hat{\bm{w}}^{\rm f}_{n}\rangle\\ \langle\hat{\bm{w}}^{\rm f}_{n}\otimes\hat{\bm{\zeta}}^{\rm f}_{n}\rangle&\langle\hat{\bm{w}}^{\rm f}_{n}\otimes\hat{\bm{w}}^{\rm f}_{n}\rangle\\ \end{array}\right).$		(15)

The angular bracket denotes the expectation value and the hat denotes the perturbation of ${\bm{x}}_{n}^{\rm f}$ from its mean $\overline{{\bm{x}}}_{n}^{\rm f}=\langle{\bm{x}}_{n}^{\rm f}\rangle$ , as in

\displaystyle\hat{\bm{\zeta}}^{\rm f}_{n}=\bm{\zeta}_{n}^{\rm f}-\overline{\bm{\zeta}}_{n}^{\rm f}.

(16)

Since ${\bm{H}}{\bm{x}}=\bm{\zeta}$ , we can separate the state and parameter update of the Kalman analysis step (11) as


$\displaystyle\overline{\bm{\zeta}}^{\rm a}_{n}$	$\displaystyle=\overline{\bm{\zeta}}^{\rm f}_{n}-{\bm{P}}^{\rm f}_{\bm{\zeta}\bm{\zeta}}\left({\bm{P}}^{\rm f}_{\bm{\zeta}\bm{\zeta}}+{\bf\Gamma}^{\rm dc}\right)^{-1}\Delta\bm{I}_{n}$	(17a)
$\displaystyle\overline{\bm{w}}^{\rm a}_{n}$	$\displaystyle=\overline{\bm{w}}^{\rm f}_{n}-{\bm{P}}^{\rm f}_{{\bm{w}}\bm{\zeta}}\left({\bm{P}}^{\rm f}_{\bm{\zeta}\bm{\zeta}}+{\bf\Gamma}^{\rm dc}\right)^{-1}\Delta\bm{I}_{n}$	(17b)

with innovation

\displaystyle\Delta\bm{I}_{n}:=\overline{\bm{\zeta}}_{n}^{\rm f}-{\bm{\zeta}}^{\rm{o}}_{n}

(18)

and covariance matrices ${\bm{P}}^{\rm f}_{\bm{\zeta}\bm{\zeta}}=\langle\hat{\bm{\zeta}}^{\rm f}_{n}\otimes\hat{\bm{\zeta}}^{\rm f}_{n}\rangle$ and ${\bm{P}}^{\rm f}_{{\bm{w}}\bm{\zeta}}=\langle\hat{\bm{w}}^{\rm f}_{n}\otimes\hat{\bm{\zeta}}^{\rm f}_{n}\rangle$ . Note that the delay vectors ${\bm{\zeta}}^{\rm{o}}_{n}$ are correlated in time as after each $\tau$ steps $(m-1)d$ components reappear. This implies that the setting described above is not strictly Bayesian. However, if the delay time $\tau$ is sufficiently large the dynamics of the combined forecast-analysis dynamical system will have sufficiently decorrelated and for practical purposes we can treat the observations as independent.

To implement the Kalman analysis step, we employ a stochastic EnKF (Burgers et al., 1998, Evensen, 2006). This allows to estimate the forecast covariances adapted to the dynamics and has advantages for the nonlinear forward model (10) and the non-Gaussian augmented state variables. Consider an ensemble of states ${\bm{X}}\in\mathbb{R}^{D_{x}\times M}$ consisting of $M$ members ${\bm{x}}^{(i)}\in\mathbb{R}^{D_{x}\times 1}$ , $i=1,\ldots,M$ , that is,

{\bm{X}}=\left[{\bm{x}}^{(1)},{\bm{x}}^{(2)},\dots,{\bm{x}}^{(M)}\right],

(19)

with empirical mean

\overline{{\bm{x}}}=\frac{1}{M}\sum_{i=1}^{M}{\bm{x}}^{(i)},

(20)

and associated matrix of ensemble deviations

\hat{\bm{X}}=\left[{\bm{x}}^{(1)}-\overline{{\bm{x}}},{\bm{x}}^{(2)}-\overline{{\bm{x}}},\dots,{\bm{x}}^{(M)}-\overline{{\bm{x}}}\right].

(21)

Ensembles for the forecast are denoted again by superscript f and those for the analysis by superscript a. In the forecast step each ensemble member is propagated individually using (10), updating the previous analysis ensemble ${\bm{X}}_{n-1}^{\rm a}$ to the next forecast ensemble ${\bm{X}}_{n}^{\rm f}$ . The forecast covariance matrix (15) used in the analysis step (11) can be estimated as a Monte-Carlo approximation from the forecast ensemble deviation matrix $\hat{\bm{X}}_{n}^{\rm f}$ via

\displaystyle{\bm{P}}^{\rm f}_{n}=\frac{1}{M-1}\hat{\bm{X}}_{n}^{\rm f}\,(\hat{\bm{X}}^{\rm f}_{n})^{\rm T}\in\mathbb{R}^{D_{x}\times D_{x}}.

(22)

In the stochastic ensemble Kalman filter observations ${\bm{\zeta}}^{\rm{o}}_{n}$ receive a stochastic perturbation $\bm{\eta}_{n}^{(i)}\in{\mathbb{R}}^{D_{\zeta}\times 1}$ , $i=1,\ldots,M$ , drawn independently from the Gaussian observational noise distribution $\mathcal{N}({\bf 0},{\bf\Gamma}^{\rm dc})$ . The associated ensemble of perturbed observations ${\bm{Z}}^{\rm p}_{n}\in\mathbb{R}^{D_{\zeta}\times M}$ is given by

\displaystyle{\bm{Z}}^{\rm p}_{n}

\displaystyle=\left[{\bm{\zeta}}^{\rm{o}}_{n}-\bm{\eta}_{n}^{(1)},{\bm{\zeta}}^{\rm{o}}_{n}-\bm{\eta}_{n}^{(2)},\ldots,{\bm{\zeta}}^{\rm{o}}_{n}-\bm{\eta}_{n}^{(M)}\right].

(23)

The EnKF analysis update step is then given by

\displaystyle{\bm{X}}_{n}^{\rm a}={\bm{X}}_{n}^{\rm f}-\bm{K}_{n}\Delta{\bm{I}}_{n},

(24)

with the Kalman gain defined by (12) using (22) and stochastic innovation

\displaystyle\Delta{\bm{I}}_{n}={\bm{H}}{\bm{X}}^{\rm f}_{n}-{\bm{Z}}^{\rm p}_{n}.

(25)

To mitigate against finite ensemble size effects covariance inflation with ${\bm{P}}^{\rm f}_{n}\to\alpha{\bm{P}}^{\rm f}_{n}$ is typically introduced with $\alpha>1$ (Anderson and Anderson, 1999). Such multiplicative inflation preserves the ensemble mean but increases the forecast error covariance. In Gottwald and Reich (2021), which considers the fully observed case with $D=d$ , localisation was employed by only considering covariances between a component of the state variable and its associated block in the parameter ${\bm{w}}$ . This localisation was found here not to be advantageous; we believe that this is due to the fact that each component of the delay vectors appears in all components at different times. We therefore do not perform any localisation in the numerical simulations presented below.

The initial ensemble ${\bm{X}}_{0}^{\rm a}$ is initialized as $\bm{\zeta}_{0}^{\rm a}\sim{\mathcal{N}}({\bm{\zeta}}^{\rm{o}}_{0},{\bf\Gamma}^{\rm dc})$ and ${\bm{w}}_{0}^{\rm a}\sim{\mathcal{N}}({\bm{w}}_{\rm LR},\gamma\,\mathbf{I})$ where ${\bm{w}}_{\rm LR}$ is the vectorial form of the solution ${\bm{W}}_{\rm LR}$ to the ridge regression formulation (2.2) and $\gamma>0$ is a parameter specifying the spread of the initial ensemble.

We remark that the EnKF does not minimize the cost function (2.2) but rather approximates the posterior distribution in the weights ${\bm{W}}$ given the observed delay vectors ${\bm{\zeta}}^{\rm{o}}_{n}$ , $n=0,\ldots,N$ , under the assumed measurement model (1) and vanishing model errors in the propagator (5). Including stochastic model errors into RAFDA would be straightforward.

The ensemble forecast step defined by (10), together with the EnKF analysis step (24) constitute our combined RAFDA method. We run RAFDA for a single long training data set of length $N$ . The approximation of the propagator map is given by the random feature model (5) where the weight matrix ${\bm{W}}$ is given by the ensemble mean of the weight matrix ${\bm{W}}_{N}^{\rm a}$ at final training time $t_{N}=\Delta tN$ and is denoted by ${\bm{W}}_{\rm RAFDA}$ .

We summarize our RAFDA method in Algorithm 1.

input data : time series

{\bm{y}}^{\rm{o}}_{n}

n=0,\dots,N

parameters : random feature maps: dimension

D_{r}

, internal parameters

{\bm{W}_{\rm{in}}}\in{\mathbb{R}}^{D_{r}\times D_{\zeta}}

{\bm{b}_{\rm{in}}}\in{\mathbb{R}}^{D_{r}\times 1}

EnKF: ensemble size

M

measurement error covariance

{\bf\Gamma}^{\rm dc}={\bf I}\otimes\bm{\mathrm{\Gamma}}

, inflation

\alpha

, initial

ensemble parameters

({\bm{w}}_{\rm LR},\gamma)

perform the following:

delay coordinate embedding:

form delay vectors

{\bm{\zeta}}^{\rm{o}}_{n}\in\mathbb{R}^{D_{\zeta}\times 1}

;

initializing ensemble:

set

{\bm{X}}_{0}^{\rm a}

with members drawn according to

\bm{\zeta}_{0}^{a}\sim{\mathcal{N}}({\bm{\zeta}}^{\rm{o}}_{0},{\bf\Gamma}^{\rm dc})

and

{\bm{w}}_{0}^{a}\sim{\mathcal{N}}({\bm{w}}_{\rm LR},\gamma\,\mathbf{I})

;

for $n=1:N$ do

forecast

{\bm{X}}_{n-1}^{\rm a}\to{\bm{X}}_{n}^{\rm f}

: each ensemble member is propagated according to

\bm{\zeta}^{\rm f}_{n}={\bm{W}}_{n-1}^{\rm a}\,\phi(\bm{\zeta}^{\rm a}_{n-1})

;

{\bm{W}}^{\rm f}_{n}={\bm{W}}^{\rm a}_{n-1}

;

data assimilation analysis update:

inflation:

{\bm{P}}^{\rm f}_{n}\leftarrow\alpha{\bm{P}}^{\rm f}_{n}

{\bm{X}}_{n}^{\rm a}={\bm{X}}_{n}^{\rm f}-\bm{K}_{n}\left({\bm{H}}{\bm{X}}^{\rm f}_{n}-{\bm{Z}}^{\rm p}_{n}\right)

;

output :

{\bm{W}}_{\rm RAFDA}={\text{ensemble average of }}{\bm{W}}^{\rm{a}}_{N}

Algorithm 1 Random Feature Map DataAssimilation (RAFDA)

3. Numerical results

We consider the Lorenz-63 system (Lorenz, 1963)


$\displaystyle\dot{y}$	$\displaystyle=10(y-x)$	(26a)
$\displaystyle\dot{x}$	$\displaystyle=28x-y-xz$	(26b)
$\displaystyle\dot{z}$	$\displaystyle=-\frac{8}{3}z+xy$	(26c)

with ${\bm{u}}=(x,y,z)^{\rm T}\in\mathbb{R}^{3}$ . Observations are only available for $x$ , i.e. ${\bm{G}}=(1\;0\;0)$ and $d=1$ in (1), and are taken every $\Delta t=0.02$ time units. We employ observational noise with measurement error covariance $\bm{\mathrm{\Gamma}}=\eta{\bf I}$ and use $\eta=0.2$ unless stated otherwise (compare (1)). We ensure that the dynamics evolves on the attractor by discarding an initial transient of $40$ model time units. The optimal embedding dimension and delay time for these parameters are estimated as $m=3$ and $\tau=10$ . In the following times are measured in units of the Lyapunov time $t\lambda_{\rm{max}}$ with the maximal Lyapunov exponent $\lambda_{\rm{max}}=0.91$ .

We use a reservoir of size $D_{r}=300$ with internal parameters $w=0.005$ and $b=4$ , and an ensemble size of $M=300$ for the ensemble Kalman filter. The regularization parameter is set to $\beta=2\times 10^{-5}$ and the inflation parameter to $\alpha=1+0.01\Delta t=1.0002$ . We employ a training set of length $N=4,000$ .

To test the propensity of RAFDA to learn a surrogate model (5) we generate a validation data set $\bm{\zeta}_{\rm valid}(t_{n})$ , $n\geq 0$ sampled with the same rate as the training data set with $\Delta t=0.02$ from the $x$ -component of the Lorenz-63 system (26). To quantify the forecast skill we measure the forecast time $\tau_{f}$ , defined as the largest time such that the relative forecast error $\mathcal{E}(t_{n})=||\bm{\zeta}_{\rm valid}(t_{n})-\bm{\zeta}_{n}||^{2}/||\bm{\zeta}_{\rm valid}(t_{n})||^{2}\leq\theta$ , where the $\bm{\zeta}_{n}$ are generated by the learned surrogate model. We choose here $\theta=40$ .

The forecast skill depends crucially on the choice of the randomly chosen internal parameters $({\bm{W}_{\rm{in}}},{\bm{b}_{\rm{in}}})$ as well as on the training and validation data set. We therefore report on the mean behaviour over $500$ realisations, differing in the training and validation data set, in the random draws of the internal parameters $({\bm{W}_{\rm{in}}},{\bm{b}_{\rm{in}}})$ and in the initial ensembles for RAFDA. Each training and validation data set is generated from randomly drawn initial conditions which are evolved independently over $40$ model time units to ensure that the dynamics has settled on the attractor.

In the following, we compare RAFDA with standard random feature maps using linear regression (LR). Figure 1 shows the empirical histogram for both RAFDA and LR when the surrogate model (5) is trained on data contaminated with noise of strength $\eta=0.2$ . The mean forecast time obtained by RAFDA is $\tau_{f}=2.12$ and is roughly three times larger than the forecast time obtained using standard random feature maps which yields $\tau_{f}=0.77$ . Furthermore, it is seen that the distribution of forecast times for RAFDA is heavily skewed towards larger forecast times. As expected, these forecast times are smaller than for the fully observed case with ${\bm{G}}={\bf I}$ , as considered in Gottwald and Reich (2021), where extreme values of $\tau_{f}=9.28$ were reported. Figure 2 shows a typical example of $x(t)$ for a forecast time of $\tau_{f}=2.6$ , which is close to the mean forecast time. Figure 3 shows that the surrogate model (5) obtained using RAFDA produces a model which very well approximates the long-time statistics of the full Lorenz system (26) in the sense that its trajectories reproduce the attractor in delay-coordinate space. Contrary, LR does not lead to surrogate models which are consistent with the dynamics of the Lorenz system (26). Reproducing dynamically consistent trajectories is, of course, paramount for the purpose of using such surrogate models for long-time integration when the interest is rather on the overall statistical behaviour rather than on accurate short-term forecasts.

The ability of RAFDA to control noise present in a training data set depends on the strength $\eta$ of the noise. We show in Figure 4 results of the mean forecast time for a range of $\eta\in[6.25\times 10^{-6},4.3\times 10^{5}]$ . RAFDA outperforms standard random feature maps with linear regression for noise levels $\log\eta<5$ . Moreover, there is a robust plateau with mean forecast times of around $\tau_{f}\approx 2$ for a large range of noise strengths with $\eta\lessapprox 1$ . Notice that for $\eta\to 0$ , maybe surprisingly, the mean forecast time of RAFDA does not converge to the one obtained by standard random feature maps but remains threefold larger. This was discussed in Gottwald and Reich (2021): although LR achieves a smaller value of the cost function (2.2) for $\eta\ll 1$ it does not generalise as well to unseen data. The cost function is not zero as generically the output data do not lie in the span of the random feature maps. It is important to note that LR with $\beta>0$ assumes model error (which is not present in our application of simulating the Lorenz-63 system (26) rather than observational error. We further note that, for very large noise levels $\log\eta>5$ , the forecast times $\tau_{f}$ become zero for RAFDA. This is due to the data assimilation component of our method and is an instance of the well-documented problem of filter divergence (Ehrendorfer, 2007, Nadiga et al., 2013). More precisely, in finite-size ensembles, most ensemble members may align with the most unstable direction (Ng et al., 2011) implying a small forecast error covariance. In combination with large observation noise this leads to the filter trusting its own forecast and the analysis is not corrected by incoming new observations. Filter divergence can be avoided by increasing the ensemble size and/or employing covariance inflation. We remark that we abstained here from fine-tuning all hyperparameters.

Refer to caption — Figure 1. Empirical histogram of forecast times $\tau_{f}$ in units of the Lyapunov time for noise contaminated observations with $\eta=0.2$ . Results are shown for standard LR and for RAFDA.

4. Discussion

We proposed a new data-driven method to estimate surrogate one-step forecast maps from noisy partial observations. Our method determines the surrogate map in the reconstructed phase space for delay vector coordinates, thereby dealing with the problem of having only access to partial observations. We employed the RAFDA framework in the reconstructed phase space, in which the surrogate map is constructed as a linear combination of random feature maps the coefficients of which are learned sequentially using a stochastic EnKF. We showed that learning the coefficients of the linear combination of random feature maps sequentially with incoming new observations rather than by learning them using the method of least-squares on the whole data set greatly increases the forecast capability of the surrogate model and significantly controls the measurement noise.

Our numerical results showed that the quality of the surrogate model exhibits some degree of variance, depending on the draw of the arbitrarily fixed internal parameters of the random feature map model ${\bm{W}_{\rm{in}}}$ and ${\bm{b}_{\rm{in}}}$ . Being able to choose good candidates for those parameters or learning them in conjunction with the rest of the surrogate model would greatly improve the applicability of the method. In our previous work we provided some guidance on what constitutes good hyperparameters enabling good learning (Gottwald and Reich, 2021). How to choose actual optimal values for these hyperparameters requires, for example, an optimization procedure wrapped around the RAFDA method described here, and is planned for future research. Apart from ${\bm{W}_{\rm{in}}}$ and ${\bm{b}_{\rm{in}}}$ there are numerous hyperparameters which require tuning, such as the regularization parameter $\beta$ , the inflation parameter $\alpha$ , the size and the variance of the initial ensemble, as well as the assumed observational error covariance $\Gamma$ . To obtain optimal performance of RAFD these would need to be tuned which can be done by additional optimization procedures (Nadiga et al., 2019).

The random feature map architecture is closely related to reservoir computers (Jaeger, 2002, Jaeger and Haas, 2004, Pathak et al., 2018a, Jüngling et al., 2019). In reservoir computers the incoming signal is used in conjunction with an internal reservoir dynamics to produce the output. It is argued that this helps to take into account non-trivial memory of the underlying dynamical system. It will be interesting to see if random feature maps in the space of delay vectors performs as well as reservoir computers. For systems with a fast decay of correlation such as the Lorenz 63 system the additional reservoir dynamics did not lead to an improvement of the surrogate map (Gottwald and Reich, 2021), but incorporating memory for systems with slow decay of correlation taking into account information from past observations may be important.

Acknowledgments

SR is supported by Deutsche Forschungsgemeinschaft (DFG) - Project-ID 318763901 - SFB1294

References

Abarbanel et al. [2018] H. Abarbanel, P. Rozdeba, and S. Shirman. Machine learning: Deepest learning as statistical data assimilation problems. Neural Computation, 30:2025–2055, 2018.
Algar et al. [2019] S. D. Algar, T. Lymburn, T. Stemler, M. Small, and T. Jüngling. Learned emergence in selfish collective motion. Chaos: An Interdisciplinary Journal of Nonlinear Science, 29(12):123101, 2019. doi:10.1063/1.5120776.
Anderson and Anderson [1999] J. L. Anderson and S. L. Anderson. A Monte Carlo implementation of the nonlinear filtering problem to produce ensemble assimilations and forecasts. Monthly Weather Review, 127(12):2741–2758, 1999. doi:10.1175/1520-0493(1999)127<2741:AMCIOT>2.0.CO;2.
Bach [2017a] F. Bach. Breaking the curse of dimensionality with convex neural networks. J. Mach. Learn. Research, 18(19):1–53, 2017a. URL http://jmlr.org/papers/v18/14-546.html.
Bach [2017b] F. Bach. On the equivalence between kernel quadrature rules and random feature expansions. J. Mach. Learn. Research, 18(21):1–38, 2017b. URL http://www.jmlr.org/papers/v18/15-178.html.
Barron [1993] A. Barron. Universal approximation bounds for superposition of a sigmoidal function. IEEE Trans. on Inform. Theory, 39:930–945, 1993. doi:10.1109/18.256500.
Bocquet et al. [2019] M. Bocquet, J. Brajard, A. Carrassi, and L. Bertino. Data assimilation as a learning tool to infer ordinary differential equation representations of dynamical models. Nonlinear Processes in Geophysics, 26:143–162, 2019. doi:10.5194/npg-26-143-2019.
Bocquet et al. [2020] M. Bocquet, J. Brajard, A. Carrassi, and L. Bertino. Bayesian inference of chaotic dynamics by merging data assimilation, machine learning and expectation-maximization. Foundations of Data Science, 2:55–80, 2020. doi:10.3934/fods.2020004.
Bocquet et al. [2021] M. Bocquet, A. Farchi, and Q. Malartic. Online learning of both state and dynamics using ensemble Kalman filters. Foundation of Data Science, published online, 2021. doi:10.3934/fods.2020015.
Bollt [2021] E. Bollt. On explaining the surprising success of reservoir computing forecaster of chaos? The universal machine learning dynamical system with contrast to VAR and DMD. Chaos: An Interdisciplinary Journal of Nonlinear Science, 31(1):013108, 2021. doi:10.1063/5.0024890.
Bolton and Zanna [2019] T. Bolton and L. Zanna. Applications of deep learning to ocean data inference and subgrid parameterization. Journal of Advances in Modeling Earth Systems, 11(1):376–399, 2019. doi:10.1029/2018MS001472.
Brajard et al. [2020] J. Brajard, A. Carrassi, M. Bocquet, and L. Bertino. Combining data assimilation and machine learning to emulate a dynamical model from sparse and noisy observations: A case study with the Lorenz 96 model. Journal of Computational Science, 44:101171, 2020. doi:https://doi.org/10.1016/j.jocs.2020.101171.
Brajard et al. [2021] J. Brajard, A. Carrassi, M. Bocquet, and L. Bertino. Combining data assimilation and machine learning to infer unresolved scale parametrization. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 379(2194):20200086, 2021. doi:10.1098/rsta.2020.0086.
Burgers et al. [1998] G. Burgers, P. J. van Leeuwen, and G. Evensen. Analysis scheme in the ensemble Kalman filter. Monthly Weather Review, 126(6):1719–1724, 1998. doi:10.1175/1520-0493(1998)126<1719:ASITEK>2.0.CO;2.
Casdagli et al. [1991] M. Casdagli, S. Eubank, J. Farmer, and J. Gibson. State space reconstruction in the presence of noise. Physica D: Nonlinear Phenomena, 51(1):52–98, 1991. ISSN 0167-2789. doi:10.1016/0167-2789(91)90222-U.
Cleary et al. [2021] E. Cleary, A. Garbuno-Inigo, S. Lan, T. Schneider, and A. M. Stuart. Calibrate, emulate, sample. Journal of Computational Physics, 424:109716, 2021. ISSN 0021-9991. doi:0.1016/j.jcp.2020.109716.
Cybenko [1989] G. Cybenko. Approximation by superposition of a sigmoidal function. Math. Contr., Sign., and Syst., 2:303–314, 1989. doi:10.1007/BF02551274.
Dueben and Bauer [2018] P. D. Dueben and P. Bauer. Challenges and design choices for global weather and climate models based on machine learning. Geoscientific Model Development, 11(10):3999–4009, 2018. doi:10.5194/gmd-11-3999-2018.
E et al. [2020] W. E, C. Ma, S. Wojtowytsch, and L. Wu. Towards a mathematical understanding of neural network-based machine learning: What we know and what we don’t, 2020.
Ehrendorfer [2007] M. Ehrendorfer. A review of issues in ensemble-based Kalman filtering. Meteorologische Zeitschrift, 16(6):795–818, 2007.
Evensen [2006] G. Evensen. Data Assimilation: The Ensemble Kalman Filter. Springer, New York, 2006.
Fraser and Swinney [1986] A. M. Fraser and H. L. Swinney. Independent coordinates for strange attractors from mutual information. Phys. Rev. A (3), 33(2):1134–1140, 1986. ISSN 1050-2947. doi:10.1103/PhysRevA.33.1134.
Gagne II et al. [2020] D. J. Gagne II, H. M. Christensen, A. C. Subramanian, and A. H. Monahan. Machine learning for stochastic parameterization: Generative adversarial networks in the Lorenz ’96 model. Journal of Advances in Modeling Earth Systems, 12(3), 2020. doi:10.1029/2019MS001896.
Gauthier et al. [2021] D. Gauthier, E. Bollt, A. Griffith, and W. Barbosa. Next generation reservoir computing, 2021.
Gottwald and Reich [2021] G. A. Gottwald and S. Reich. Supervised learning from noisy observations: Combining machine-learning techniques with data assimilation. Physica D: Nonlinear Phenomena, 423:132911, 2021. ISSN 0167-2789. doi:10.1016/j.physd.2021.132911.
Hamilton et al. [2016] F. Hamilton, T. Berry, and T. Sauer. Ensemble Kalman filtering without a model. Phys. Rev. X, 6:011021, Mar 2016. doi:10.1103/PhysRevX.6.011021.
Han et al. [2018] J. Han, A. Jentzen, and W. E. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018. doi:10.1073/pnas.1718942115.
Han et al. [2019] J. Han, C. Ma, Z. Ma, and W. E. Uniformly accurate machine learning-based hydrodynamic models for kinetic equations. Proceedings of the National Academy of Sciences, 116(44):21983–21991, 2019. doi:10.1073/pnas.1909854116.
Hart et al. [2020] A. Hart, J. Hook, and J. Dawes. Embedding and approximation theorems for echo state networks. Neural Networks, 128:234–247, 2020. ISSN 0893-6080. doi:10.1016/j.neunet.2020.05.013.
Jaeger [2002] H. Jaeger. A tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the ”echo state network” approach, 2002. GMD-Report 159, German National Research Institute for Computer Science.
Jaeger and Haas [2004] H. Jaeger and H. Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004. doi:10.1126/science.1091277.
Jüngling et al. [2019] T. Jüngling, T. Lymburn, T. Stemler, D. Correa, D. Walker, and M. Small. Reconstruction of complex dynamical systems from time series using reservoir computing. In 2019 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1–5, 2019. doi:10.1109/ISCAS.2019.8702137.
Kantz and Schreiber [2004] H. Kantz and T. Schreiber. Nonlinear time series analysis. Cambridge University Press, Cambridge, second edition, 2004.
Kennel et al. [1992] M. B. Kennel, R. Brown, and H. D. I. Abarbanel. Determining embedding dimension for phase-space reconstruction using a geometrical construction. Phys. Rev. A, 45:3403–3411, Mar 1992. doi:10.1103/PhysRevA.45.3403.
Levine and Stuart [2021] M. E. Levine and A. M. Stuart. A framework for machine learning of model error in dynamical systems, 2021.
Lorenz [1963] E. N. Lorenz. Deterministic nonperiodic flow. Journal of the Atmospheric Sciences, 20(2):130–141, 1963. doi:10.1175/1520-0469(1963)020<0130:DNF>2.0.CO;2.
Lu et al. [2017] Z. Lu, J. Pathak, B. Hunt, M. Girvan, R. Brockett, and E. Ott. Reservoir observers: Model-free inference of unmeasured variables in chaotic systems. Chaos: An Interdisciplinary Journal of Nonlinear Science, 27(4):041102, 2017. doi:10.1063/1.4979665.
Maass et al. [2002] W. Maass, T. Natschläger, and H. Markram. Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11):2531–2560, 2002. doi:10.1162/089976602760407955.
MATLAB [2020] MATLAB. version 9.9.0.1467703 (R2020b). The MathWorks Inc., Natick, Massachusetts, 2020.
Nadiga [2020] B. Nadiga. Reservoir computing as a tool for climate predictability studies. Journal of Advances in Modeling Earth Systems, n/a(n/a):e2020MS002290, 2020. doi:10.1029/2020MS002290.
Nadiga et al. [2019] B. Nadiga, C. Jiang, and D. Livescu. Leveraging Bayesian analysis to improve accuracy of approximate models. Journal of Computational Physics, 394:280–297, 2019. ISSN 0021-9991. doi:10.1016/j.jcp.2019.05.015.
Nadiga et al. [2013] B. T. Nadiga, W. R. Casper, and P. W. Jones. Ensemble-based global ocean data assimilation. Ocean Modelling, 72:210–230, 2013. ISSN 1463-5003. doi:10.1016/j.ocemod.2013.09.002.
Nakai and Saiki [2021] K. Nakai and Y. Saiki. Machine-learning construction of a model for a macroscopic fluid variable using the delay-coordinate of a scalar observable. Discrete & Continuous Dynamical Systems - S, 14(3):1079–1092, 2021.
Ng et al. [2011] G.-H. C. Ng, D. McLaughlin, D. Entekhabi, and A. Ahanin. The role of model dynamics in ensemble Kalman filter performance for chaotic systems. Tellus A, 63A:958–977, 2011.
Park and Sandberg [1991] J. Park and I. Sandberg. Universal approximation using radial-basis-function networks. Neural Computation, 3:246–257, 1991. doi:10.1162/neco.1991.3.2.246.
Pathak et al. [2018a] J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott. Model-free prediction of large spatiotemporally chaotic systems from data: A reservoir computing approach. Phys. Rev. Lett., 120:024102, Jan 2018a. doi:10.1103/PhysRevLett.120.024102.
Pathak et al. [2018b] J. Pathak, A. Wikner, R. Fussell, S. Chandra, B. R. Hunt, M. Girvan, , and E. Ott. Hybrid forecasting of chaotic processes: Using machine learning in conjunction with a knowledge-based model. Chaos: An Interdisciplinary Journal of Nonlinear Science, 28:041101, 2018b. doi:10.1063/1.5028373.
Platt et al. [2021] J. A. Platt, A. Wong, R. Clark, S. G. Penny, and H. D. I. Abarbanel. Forecasting using reservoir computing: The role of generalized synchronization, 2021.
Rahimi and Recht [2008] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1177–1184. Curran Associates, Inc., 2008. URL http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.pdf.
Rahimi and Recht [2008] A. Rahimi and B. Recht. Uniform approximation of functions with random bases. In 2008 46th Annual Allerton Conference on Communication, Control, and Computing, pages 555–561, 2008.
Raissi and Karniadakis [2018] M. Raissi and G. E. Karniadakis. Hidden physics models: Machine learning of nonlinear partial differential equations. Journal of Computational Physics, 357:125 – 141, 2018. doi:https://doi.org/10.1016/j.jcp.2017.11.039.
Raissi et al. [2019] M. Raissi, P. Perdikaris, and G. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686 – 707, 2019. doi:https://doi.org/10.1016/j.jcp.2018.10.045.
Rasp et al. [2018] S. Rasp, M. S. Pritchard, and P. Gentine. Deep learning to represent subgrid processes in climate models. Proceedings of the National Academy of Sciences, 115(39):9684–9689, 2018. ISSN 0027-8424. doi:10.1073/pnas.1810286115.
Rotskoff and Vanden-Eijnden [2018] G. M. Rotskoff and E. Vanden-Eijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error, 2018.
Sauer and Yorke [1993] T. Sauer and J. A. Yorke. How manyu delay coordinates do you need? International Journal of Bifurcation and Chaos, 3(3):737–744, 1993.
Sauer et al. [1991] T. Sauer, J. A. Yorke, and M. Casdagli. Embedology. Journal of Statistical Physics, 65(3):579–616, 1991.
Schneider et al. [2017] T. Schneider, S. Lan, A. Stuart, and J. Teixeira. Earth system modeling 2.0: A blueprint for models that learn from observations and targeted high-resolution simulations. Geophysical Research Letters, 44(24):12,396–12,417, 2017. doi:10.1002/2017GL076101.
Schreiber [1999] T. Schreiber. Interdisciplinary application of nonlinear time series methods. Physics Reports, 308(1):1–64, 1999. ISSN 0370-1573. doi:10.1016/S0370-1573(98)00035-0.
Schreiber and Kantz [1995] T. Schreiber and H. Kantz. Noise in chaotic data: Diagnosis and treatment. Chaos: An Interdisciplinary Journal of Nonlinear Science, 5(1):133–142, 1995. doi:10.1063/1.166095.
Small [2005] M. Small. Applied nonlinear time series analysis, volume 52 of World Scientific Series on Nonlinear Science. Series A: Monographs and Treatises. World Scientific Publishing Co. Pte. Ltd., Hackensack, NJ, 2005. ISBN 981-256-117-X. doi:10.1142/9789812567772. Applications in physics, physiology and finance.
Sun et al. [2019] Y. Sun, A. Gilbert, and A. Tewari. On the approximation capabilities of ReLU neural networks and random ReLU features, 2019.
Takens [1981] F. Takens. Detecting strange attractors in turbulence. In D. Rand and L.-S. Young, editors, Dynamical Systems and Turbulence, Warwick 1980, pages 366–381, Berlin, Heidelberg, 1981. Springer Berlin Heidelberg. ISBN 978-3-540-38945-3.
Tomizawa and Sawada [2021] F. Tomizawa and Y. Sawada. Combining Ensemble Kalman Filter and Reservoir Computing to predict spatio-temporal chaotic systems from imperfect observations and models, 2021.
Wikner et al. [2020] A. Wikner, J. Pathak, B. Hunt, M. Girvan, T. Arcomano, I. Szunyogh, A. Pomerance, , and E. Ott. Combining machine learning with knowledge-based modeling for scalable forecasting and subgrid-scale closure of large, complex, spatiotemporal systems. Chaos: An Interdisciplinary Journal of Nonlinear Science, page 053111, 2020. doi:10.1063/5.0005541.
Wikner et al. [2021] A. Wikner, J. Pathak, B. R. Hunt, I. Szunyogh, M. i. Girvan, and E. Ott. Using data assimilation to train a hybrid forecast system that combines machine-learning and knowledge-based components. Chaos: An Interdisciplinary Journal of Nonlinear Science, 31(5):053114, 2021. doi:10.1063/5.0048050.