This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Combining machine learning and data assimilation to forecast dynamical systems from noisy partial observations

Georg A. Gottwald School of Mathematics and Statistics
University of Sydney
NSW 2006
Australia
 and  Sebastian Reich Institute of Mathematics
University of Potsdam
Germany
[email protected] [email protected]
Abstract.

We present a supervised learning method to learn the propagator map of a dynamical system from partial and noisy observations. In our computationally cheap and easy-to-implement framework a neural network consisting of random feature maps is trained sequentially by incoming observations within a data assimilation procedure. By employing Takens’ embedding theorem, the network is trained on delay coordinates. We show that the combination of random feature maps and data assimilation, called RAFDA, outperforms standard random feature maps for which the dynamics is learned using batch data.

Determining computationally inexpensive surrogate maps, trained on partial and noisy observations, is of utmost significance. Such surrogate maps would allow for long-time simulation of dynamical systems, the governing equations of which are unknown. Similarly, they would allow for long-time integration of complex high-dimensional dynamical systems with known underlying equations where one is however only interested in the dynamics of a select subset of resolved variables. Surrogate maps should produce trajectories which are consistent with the dynamical system even after long time integration. We consider here the situation when such systems are accessible only by partial observations of the resolved variables and where these observations are contaminated by measurement noise. By formulating a machine learning framework in delay coordinate space, we show that a partial and noisy training data set can be used to learn the dynamics in the reconstructed phase space. The learned surrogate map provides a computationally cheap forecast model for single forecasts up to several Lyapunov times as well as for dynamically consistent long-time simulations.

1. Introduction

Recent years have seen an increased interest in data-driven methods with the aim to develop cheap surrogate models to perform forecasting of dynamical systems. A main driver behind this endeavour is the potential saving of computational running time required for simulation. This is particularly important for stiff multi-scale systems for which the fastest time-scale puts strong restrictions on the time step which an be employed (Han et al., 2018, 2019, Raissi and Karniadakis, 2018, Raissi et al., 2019). Promising applications for such surrogate models come from atmospheric and ocean dynamics and from climate dynamics (Schneider et al., 2017, Dueben and Bauer, 2018, Rasp et al., 2018, Gagne II et al., 2020, Bolton and Zanna, 2019, Brajard et al., 2021, Nadiga, 2020, Cleary et al., 2021).

A particularly simple and efficient machine learning architecture are random feature maps (Rahimi and Recht, 2008, 2008, Bach, 2017a, b, Sun et al., 2019). Random feature maps provide a representation of the surrogate propagator for a dynamical system by a linear combination of randomly generated high-dimensional nonlinear functions of the input data. The training of random feature map networks only requires linear least-square regression and it was proven rigorously that random feature maps enjoy the so called universal approximation property which states that they can approximate any continuous function arbitrarily close (Park and Sandberg, 1991, Cybenko, 1989, Barron, 1993). The framework of random feature maps was extended to include internal dynamics in so called echo-state networks and reservoir computers with remarkable success in forecasting dynamical systems (Maass et al., 2002, Jaeger, 2002, Jaeger and Haas, 2004, Pathak et al., 2018a, Jüngling et al., 2019, Algar et al., 2019, Nadiga, 2020, Bollt, 2021, Gauthier et al., 2021, Platt et al., 2021).

These linear regression based training methods assume model errors rather than measurement errors in the data (Gottwald and Reich, 2021). However, the skill of random feature maps in providing a reliable surrogate forecast model can be severely impeded when the data used for training are contaminated by measurement noise. In previous work Gottwald and Reich (2021) showed that by combining random feature maps with ensemble-based sequential data assimilation, in a framework coined RAndom Feature map Data Assimilation (RAFDA), the noise can be effectively controlled leading to remarkable forecast skills for noisy observations. Moreover, RAFDA was shown to be able to provide model closure in multi-scale systems allowing for subgridscale parametrization as well as to provide reliable ensembles for probabilistic forecasting. In RAFDA the parameters of the random feature maps are learned sequentially with incoming observations in conjunction with the analysis of the state variables using an ensemble Kalman filter (EnKF) (Evensen, 2006). RAFDA thus combines a powerful approximation tool in the form of random feature maps with an advanced sequential and derivative-free data assimilation technique, which is particularly well suited for high-dimensional state and parameter estimation problems.

Several alternative combinations of data assimilation and machine learning techniques have recently been explored. Recently proposed methodologies include purely minimization-based approaches (Bocquet et al., 2019), approaches which combine ensemble-based sequential state estimation with minimization-based parameter estimation (Brajard et al., 2020, Bocquet et al., 2020), and purely ensemble-based sequential joined state and parameter estimation (Bocquet et al., 2021). Furthermore, an intimate formal equivalence between certain neural network architectures and data assimilation has been noted by Abarbanel et al. (2018). Finally, sequential data assimilation has also been extended to echo-state networks and reservoir computers (Tomizawa and Sawada, 2021, Wikner et al., 2021).

In this work, we extend RAFDA to the situation when only partial observations are available. In this case, the propagator map, that updates the system in the partially observed subspace only, is non-Markovian and knowledge of the past history is required. This problem was recently addressed within a machine learning context using appropriate model closures (Levine and Stuart, 2021). To account for the non-Markovian nature of the propagator map, we follow here a different path instead and employ phase-space reconstruction and Takens’ embedding theorem (Takens, 1981). Takens’ embedding theorem has been successfully used in data assimilation before where a forecast model was non-parametrically constructed using analogs (Hamilton et al., 2016). Recently reservoir computers, which may be viewed as providing an embedding of the dynamics in the reservoir space (Lu et al., 2017, Hart et al., 2020), were used to reconstruct dynamics in the reconstructed phase space using delay coordinates and linear regression for parameter estimation (Nakai and Saiki, 2021). Here we use delay coordinates in combination with sequential data assimilation to learn a surrogate forecast model directly in the reconstructed phase-space from noisy and partial observations.

While we follow a non-parametric or model-free approach in this paper, another promising direction is to combine machine learning and knowledge-based modelling components (Pathak et al., 2018b, Gagne II et al., 2020, Bolton and Zanna, 2019, Wikner et al., 2020, Brajard et al., 2021, Wikner et al., 2021). We note that RAFDA also allows for such extensions (Gottwald and Reich, 2021).

The paper is organised as follows. In Section 2 we develop our RAFDA methodology for partial observations. In Section 3 we show how RAFDA performs on the Lorenz 63 system and that RAFDA exhibits significantly increased forecast skill when compared to standard random feature maps trained on delay coordinates using linear regression. We close in Section 4 with a discussion and an outlook.

2. Random feature map and data assimilation (RAFDA)

Consider a DD-dimensional dynamical system 𝒖˙=𝓕(𝒖)\dot{{\bm{u}}}=\mathcal{\bm{F}}({\bm{u}}) which is accessed at equidistant times tn=nΔtt_{n}=n\Delta t of interval length Δt>0\Delta t>0, n0n\geq 0, by partial noisy observations

𝒚no=𝑮𝒖n+𝚪1/2𝜼n\displaystyle{\bm{y}}^{\rm{o}}_{n}=\bm{G}{\bm{u}}_{n}+\bm{\mathrm{\Gamma}}^{1/2}\,\bm{\eta}_{n} (1)

with 𝒖n=𝒖(tn){\bm{u}}_{n}={\bm{u}}(t_{n}), observation operator 𝑮:Dd\bm{G}:{\mathbb{R}}^{D}\to{\mathbb{R}}^{d}, measurement error covariance matrix 𝚪d×d\bm{\mathrm{\Gamma}}\in{\mathbb{R}}^{d\times d} and dd-dimensional independent and normally distributed noise 𝜼n\bm{\eta}_{n}, that is, 𝜼n𝒩(𝟎,𝐈)\bm{\eta}_{n}\sim{\mathcal{N}}({\bf 0},{\bf I}).

The aim we set out to pursue is the following: using noisy observations 𝒚jo{\bm{y}}^{\rm{o}}_{j} for 0jn0\leq j\leq n find an approximation to the propagator map which maps previous observations (or a suitable subset of them) to the unobserved variable 𝒚n+1{\bm{y}}_{n+1} at future time tn+1t_{n+1}. In the Markovian case d=Dd=D with 𝑮=𝑰\bm{G}=\bm{I} the propagator map is only a function of the observation at the current time tnt_{n}. This case was studied in our previous work (Gottwald and Reich, 2021). Here the focus is on partial observations with d<Dd<D for which the propagator map is non-Markovian. Our method judiciously incorporates three separate methods which we discuss in the following three subsections: 1.) Takens embedding and delay coordinates to deal with the aspect of having only access to partial observations and the resulting non-Markovian propagator map, 2.) random feature maps which form our approximation for the propagator map and 3.) data assimilation and EnKFs to control the observation noise and to determine the parameters of the random feature maps leading to the desired surrogate model.

2.1. Delay coordinates

When only partial observations are available Takens’ embedding theorem allows to represent the dynamics faithfully by means of phase space reconstruction and delay vectors (Takens, 1981, Sauer et al., 1991, Sauer and Yorke, 1993). Takens’ embedding theorem assumes noise-free observations, but has been successfully applied to noise-contaminated observations (Kantz and Schreiber, 2004, Schreiber, 1999, Small, 2005, Casdagli et al., 1991, Schreiber and Kantz, 1995).

We describe here the key steps for scalar-valued observations, that is, d=1d=1 in (1). Hence, given a time-series ynoy_{n}^{\rm o}, define the mm-dimensional delay vectors

𝜻no\displaystyle{\bm{\zeta}}^{\rm{o}}_{n} =(yno,yn+τo,yn+2τo,,yn+(m1)τo)T\displaystyle=\left(y_{n}^{\rm{o}},y_{n+\tau}^{\rm{o}},y_{n+2\tau}^{\rm{o}},...,y_{n+(m-1)\tau}^{\rm{o}}\right)^{\rm T} (2)

for n=0,,Nn=0,\ldots,N, with integer delay time τ>0\tau>0. If the underlying dynamics lies inside an DD-dimensional phase space, then the delay reconstruction map yno𝜻noy_{n}^{\rm o}\to\bm{\zeta}_{n}^{\rm o} is an embedding provided the embedding dimension is sufficiently large with m>2Dboxm>2\,D_{\rm{box}} where DboxD_{\rm{box}} denotes the fractal box-counting dimension of the attractor (Sauer et al., 1991, Sauer and Yorke, 1993). Under these conditions, the reconstructed dynamics 𝜻0o,𝜻1o,𝜻2o,{\bm{\zeta}}^{\rm{o}}_{0},{\bm{\zeta}}^{\rm{o}}_{1},{\bm{\zeta}}^{\rm{o}}_{2},\dots in m{\mathbb{R}}^{m} faithfully represents the underlying dynamics. For finite time series, the choice of the embedding dimension mm and the delay time τ\tau are crucial and need to be carefully chosen. See for example Kantz and Schreiber (2004), Schreiber (1999), Casdagli et al. (1991), Schreiber and Kantz (1995), Small (2005) for a detailed discussion. Specifically, if the embedding dimension is chosen too small, then the delay reconstruction map does not represent the underlying dynamics, if it is chosen too large, the reconstruction involves redundancy reducing the length of the delay vector time series 𝜻no{\bm{\zeta}}^{\rm{o}}_{n}. We choose the embedding dimension mm using the false nearest neighbour algorithm (Kennel et al., 1992). The embedding dimension is chosen as the smallest value mm such that the number of false nearest neighbours is less than 10%10\%. False nearest neighbours (at dimension mm) are defined as those points 𝜻no{\bm{\zeta}}^{\rm{o}}_{n} for which the Euclidean distance to their nearest neighbour changes upon increasing the embedding dimension to m+1m+1 by a factor of 1010 (relative to their smallest Euclidean distance). An appropriate delay time τ\tau is determined by determining the time for which the average mutual information has its first zero-crossing (Fraser and Swinney, 1986). We use the command phaseSpaceReconstruction from MATLAB (2020) which implements these algorithms to determine the embedding dimension mm, the delay time τ\tau as well as generating the delay vectors.

The extension of the delay vectors (2) to multivariate observations (1) is straightforward and leads to delay vectors of dimension Dζ=dmD_{\zeta}=dm.

2.2. Random feature maps

In this subsection we describe the traditional random feature map network architecture (Rahimi and Recht, 2008, 2008, Bach, 2017a, b, Sun et al., 2019). This is a particularly easy-to-implement network architecture in which the input is given as a linear combination of randomly sampled nonlinear functions of the input signal, and the coefficients of this linear combination are then learned from the whole available training data set via linear ridge regression. Our RAFDA extension, as described further below, instead determines the coefficients sequentially within a data assimilation procedure.

Applied to our setting, the delay coordinates 𝜻Dζ×1\bm{\zeta}\in{\mathbb{R}}^{D_{\zeta}\times 1} are first linearly mapped by a random but fixed linear map into a high-dimensional subspace of Dr{\mathbb{R}}^{D_{r}} with DrDζD_{r}\gg D_{\zeta} and then nonlinearly transformed by feature maps ϕ\bm{\phi} as

ϕ(𝜻)=tanh(𝑾in𝜻+𝒃in)Dr×1\displaystyle\bm{\phi}(\bm{\zeta})=\tanh({\bm{W}_{\rm{in}}}\bm{\zeta}+{\bm{b}_{\rm{in}}})\in\mathbb{R}^{D_{r}\times 1} (3)

with weight matrix

𝑾in=(𝒘in,1,,𝒘in,Dr)TDr×Dζ\displaystyle{\bm{W}_{\rm{in}}}=({\bm{w}}_{{\rm in},1},\ldots,{\bm{w}}_{{\rm in},D_{r}})^{\rm T}\in\mathbb{R}^{D_{r}\times D_{\zeta}}

and a bias

𝒃in=(bin,1,,bin,Dr)TDr×1.\displaystyle{\bm{b}_{\rm{in}}}=(b_{{\rm in},1},\ldots,b_{{\rm in},D_{r}})^{\rm T}\in\mathbb{R}^{D_{r}\times 1}.

The weight matrix and the bias are chosen randomly and independently of the observed delay coordinates 𝜻no\bm{\zeta}_{n}^{\rm o}, n=0,,Nn=0,\ldots,N, according to the distributions p(𝒘in)p({\bm{w}}_{\rm in}) and p(bin)p(b_{\rm in}), respectively. We choose here

(𝑾in)ij𝒰[w,w]and(𝒃in)i𝒰[b,b].\displaystyle({\bm{W}_{\rm{in}}})_{ij}\sim\mathcal{U}[-w,w]\qquad{\rm{and}}\qquad({\bm{b}_{\rm{in}}})_{i}\sim\mathcal{U}[-b,b]. (4)

The choice of the hyperparameters w>0w>0 and b>0b>0 is system-dependent and needs to ensure that the observations cover the nonlinear domain of the tanh\tanh-function. The reader is referred to Gottwald and Reich (2021) for further details. Note that the hyperparameters 𝑾in{\bm{W}_{\rm{in}}} and 𝒃in{\bm{b}_{\rm{in}}} are kept fixed once drawn and are not learned. This restriction is made for computational simplicity and can be relaxed following E et al. (2020), Rotskoff and Vanden-Eijnden (2018).

An approximation of the propagator map in the delay coordinates 𝜻RDζ\bm{\zeta}\in R^{D_{\zeta}} is then provided by

ΨS(𝜻)=𝑾ϕ(𝜻),\displaystyle\Psi_{\rm S}(\bm{\zeta})={\bm{W}}\bm{\phi}(\bm{\zeta}), (5)

where the matrix 𝑾Dζ×Dr{\bm{W}}\in\mathbb{R}^{D_{\zeta}\times D_{r}} maps back into the delay coordinate vector space to approximate the delay vector at the next time step, and is to be learned from the noisy delay vectors 𝜻no{\bm{\zeta}}^{\rm{o}}_{n}, n=0,,Nn=0,\ldots,N. More specifically, the weights 𝑾{\bm{W}} should be chosen such that

𝜻no\displaystyle{\bm{\zeta}}^{\rm{o}}_{n} ΨS(𝜻n1o),\displaystyle\approx\Psi_{\rm S}({\bm{\zeta}}^{\rm{o}}_{n-1}), (6)

which is achieved by minimizing the regularized cost function

(𝑾)\displaystyle\mathcal{L}({\bm{W}}) =12n=1N𝜻noΨS(𝜻n1o)2+β2𝑾F2\displaystyle=\frac{1}{2}\sum_{n=1}^{N}\|{\bm{\zeta}}^{\rm{o}}_{n}-\Psi_{\rm S}({\bm{\zeta}}^{\rm{o}}_{n-1})\|^{2}+\frac{\beta}{2}\|{\bm{W}}\|^{2}_{\rm F}
=12𝒁o𝑾𝚽F2+β2𝑾F2,\displaystyle=\frac{1}{2}\|{\bm{Z}}^{\rm o}-{\bm{W}}{\bm{\Phi}}\|_{\rm F}^{2}+\frac{\beta}{2}\|{\bm{W}}\|^{2}_{\rm F}, (7)

where 𝐀F\|\mathbf{A}\|_{F} denotes the Frobenius norm of a matrix 𝐀\mathbf{A}, 𝒁oDζ×N{\bm{Z}}^{\rm o}\in{\mathbb{R}}^{D_{\zeta}\times N} is the matrix with columns 𝜻no{\bm{\zeta}}^{\rm{o}}_{n}, n=1,,Nn=1,\ldots,N, and 𝚽Dr×N{\bm{\Phi}}\in{\mathbb{R}}^{D_{r}\times N} is the matrix with columns

ϕn=ϕ(𝜻n1o),{\bm{\phi}}_{n}=\bm{\phi}({\bm{\zeta}}^{\rm{o}}_{n-1}), (8)

n=1,,Nn=1,\ldots,N. The parameter β>0\beta>0 is used for regularization. The solution to the minimization problem for (2.2) can be explicitly determined as

𝑾LR=𝒁o𝚽T(𝚽𝚽T+β𝐈)1,\displaystyle{\bm{W}}_{\rm LR}={\bm{Z}}^{\rm o}{\bm{\Phi}}^{\rm T}\left({\bm{\Phi}}{\bm{\Phi}}^{\rm T}+\beta{\bf I}\right)^{-1}\,, (9)

and uses all available training data 𝒁o{\bm{Z}}^{\rm o} at once. In Gottwald and Reich (2021) it was shown that standard random feature maps have difficulty dealing with noisy observations, and instead it was proposed to learn the output weights 𝑾{\bm{W}} sequentially within a data assimilation procedure which is described in the next section.

2.3. Data assimilation

RAFDA uses a combined parameter and state estimation within a data assimilation procedure. The main idea is to use the surrogate propagator (5) consisting of the random feature maps as the forecast model within an EnKF and to estimate sequentially the weight matrix 𝑾{\bm{W}}. Concretely, we consider the forecast model for given weight matrix 𝑾n1a{\bm{W}}_{n-1}^{\rm a} and given delay coordinates 𝜻n1a\bm{\zeta}_{n-1}^{\rm a} as

𝜻nf\displaystyle\bm{\zeta}^{\rm f}_{n} =𝑾n1aϕ(𝜻n1a)\displaystyle={\bm{W}}_{n-1}^{\rm a}\,\bm{\phi}(\bm{\zeta}_{n-1}^{\rm a}) (10a)
𝑾nf\displaystyle{\bm{W}}^{\rm f}_{n} =𝑾n1a,\displaystyle={\bm{W}}^{\rm a}_{n-1}, (10b)

where the superscript f denotes the forecast and the superscript a denotes the analysis defined below.

To incorporate the parameter estimation into an EnKF analysis step, we consider an augmented state space 𝒙=(𝜻T,𝒘T)TDx×1{\bm{x}}=(\bm{\zeta}^{\rm T},{\bm{w}}^{\rm T})^{\rm T}\in{\mathbb{R}}^{D_{x}\times 1} with Dx=Dζ(1+Dr)D_{x}=D_{\zeta}(1+D_{r}). Here 𝒘DζDr×1{\bm{w}}\in{\mathbb{R}}^{D_{\zeta}D_{r}\times 1} is the vector consisting of all matrix elements of the weight matrix 𝑾{\bm{W}} with its entries defined block-wise, that is, w1:Dr=(W11,,W1Dr)Tw_{1:D_{r}}=(W_{11},\dots,W_{1D_{r}})^{\rm T}, wDr+1:2Dr=(W21,,W2Dr)Tw_{D_{r}+1:2D_{r}}=(W_{21},\dots,W_{2D_{r}})^{\rm T} and so forth.

We now also treat 𝒙nf{\bm{x}}^{\rm f}_{n} and 𝒙na{\bm{x}}^{\rm a}_{n}, n0n\geq 0 as random variables. Assuming a Gaussian distribution for 𝒙n+1f{\bm{x}}_{n+1}^{\rm f}, the analysis step for the mean 𝒙¯na\overline{\bm{x}}^{\rm a}_{n} is given by

𝒙¯na=𝒙¯nf𝑲n(𝑯𝒙¯nf𝜻no)\displaystyle\overline{\bm{x}}^{\rm a}_{n}=\overline{\bm{x}}^{\rm f}_{n}-{\bm{K}}_{n}({\bm{H}}\overline{\bm{x}}^{\rm f}_{n}-{\bm{\zeta}}^{\rm{o}}_{n}) (11)

with the observation matrix 𝑯Dζ×Dx{\bm{H}}\in\mathbb{R}^{D_{\zeta}\times D_{x}} defined by 𝑯𝒙=𝜻{\bm{H}}{\bm{x}}=\bm{\zeta} and the measurement error covariance matrix of the delay vectors given by 𝚪dc=𝐈𝚪Dζ×Dζ{\bf\Gamma}^{\rm dc}={\bf I}\otimes\bm{\mathrm{\Gamma}}\in{\mathbb{R}}^{D_{\zeta}\times D_{\zeta}} (compare (1)). Here 𝑨𝑩\bm{A}\otimes\bm{B} denotes the Kronecker product of two matrices. The Kalman gain matrix 𝑲n{\bm{K}}_{n} is given by

𝑲n=𝑷nf𝑯T(𝑯𝑷nf𝑯T+𝚪dc)1\displaystyle{\bm{K}}_{n}={\bm{P}}^{\rm f}_{n}{\bm{H}}^{\rm T}\left({\bm{H}}{\bm{P}}^{\rm f}_{n}{\bm{H}}^{\rm T}+{\bf\Gamma}^{\rm dc}\right)^{-1} (12)

with forecast covariance matrix

𝑷nf\displaystyle{\bm{P}}^{\rm f}_{n} =𝒙^nf𝒙^nf\displaystyle=\langle\hat{\bm{x}}^{\rm f}_{n}\otimes\hat{\bm{x}}_{n}^{\rm f}\rangle
=(𝜻^nf𝜻^nf𝜻^nf𝒘^nf𝒘^nf𝜻^nf𝒘^nf𝒘^nf).\displaystyle=\left(\begin{array}[]{cc}\langle\hat{\bm{\zeta}}^{\rm f}_{n}\otimes\hat{\bm{\zeta}}^{\rm f}_{n}\rangle&\langle\hat{\bm{\zeta}}^{\rm f}_{n}\otimes\hat{\bm{w}}^{\rm f}_{n}\rangle\\ \langle\hat{\bm{w}}^{\rm f}_{n}\otimes\hat{\bm{\zeta}}^{\rm f}_{n}\rangle&\langle\hat{\bm{w}}^{\rm f}_{n}\otimes\hat{\bm{w}}^{\rm f}_{n}\rangle\\ \end{array}\right). (15)

The angular bracket denotes the expectation value and the hat denotes the perturbation of 𝒙nf{\bm{x}}_{n}^{\rm f} from its mean 𝒙¯nf=𝒙nf\overline{{\bm{x}}}_{n}^{\rm f}=\langle{\bm{x}}_{n}^{\rm f}\rangle, as in

𝜻^nf=𝜻nf𝜻¯nf.\displaystyle\hat{\bm{\zeta}}^{\rm f}_{n}=\bm{\zeta}_{n}^{\rm f}-\overline{\bm{\zeta}}_{n}^{\rm f}. (16)

Since 𝑯𝒙=𝜻{\bm{H}}{\bm{x}}=\bm{\zeta}, we can separate the state and parameter update of the Kalman analysis step (11) as

𝜻¯na\displaystyle\overline{\bm{\zeta}}^{\rm a}_{n} =𝜻¯nf𝑷𝜻𝜻f(𝑷𝜻𝜻f+𝚪dc)1Δ𝑰n\displaystyle=\overline{\bm{\zeta}}^{\rm f}_{n}-{\bm{P}}^{\rm f}_{\bm{\zeta}\bm{\zeta}}\left({\bm{P}}^{\rm f}_{\bm{\zeta}\bm{\zeta}}+{\bf\Gamma}^{\rm dc}\right)^{-1}\Delta\bm{I}_{n} (17a)
𝒘¯na\displaystyle\overline{\bm{w}}^{\rm a}_{n} =𝒘¯nf𝑷𝒘𝜻f(𝑷𝜻𝜻f+𝚪dc)1Δ𝑰n\displaystyle=\overline{\bm{w}}^{\rm f}_{n}-{\bm{P}}^{\rm f}_{{\bm{w}}\bm{\zeta}}\left({\bm{P}}^{\rm f}_{\bm{\zeta}\bm{\zeta}}+{\bf\Gamma}^{\rm dc}\right)^{-1}\Delta\bm{I}_{n} (17b)

with innovation

Δ𝑰n:=𝜻¯nf𝜻no\displaystyle\Delta\bm{I}_{n}:=\overline{\bm{\zeta}}_{n}^{\rm f}-{\bm{\zeta}}^{\rm{o}}_{n} (18)

and covariance matrices 𝑷𝜻𝜻f=𝜻^nf𝜻^nf{\bm{P}}^{\rm f}_{\bm{\zeta}\bm{\zeta}}=\langle\hat{\bm{\zeta}}^{\rm f}_{n}\otimes\hat{\bm{\zeta}}^{\rm f}_{n}\rangle and 𝑷𝒘𝜻f=𝒘^nf𝜻^nf{\bm{P}}^{\rm f}_{{\bm{w}}\bm{\zeta}}=\langle\hat{\bm{w}}^{\rm f}_{n}\otimes\hat{\bm{\zeta}}^{\rm f}_{n}\rangle. Note that the delay vectors 𝜻no{\bm{\zeta}}^{\rm{o}}_{n} are correlated in time as after each τ\tau steps (m1)d(m-1)d components reappear. This implies that the setting described above is not strictly Bayesian. However, if the delay time τ\tau is sufficiently large the dynamics of the combined forecast-analysis dynamical system will have sufficiently decorrelated and for practical purposes we can treat the observations as independent.

To implement the Kalman analysis step, we employ a stochastic EnKF (Burgers et al., 1998, Evensen, 2006). This allows to estimate the forecast covariances adapted to the dynamics and has advantages for the nonlinear forward model (10) and the non-Gaussian augmented state variables. Consider an ensemble of states 𝑿Dx×M{\bm{X}}\in\mathbb{R}^{D_{x}\times M} consisting of MM members 𝒙(i)Dx×1{\bm{x}}^{(i)}\in\mathbb{R}^{D_{x}\times 1}, i=1,,Mi=1,\ldots,M, that is,

𝑿=[𝒙(1),𝒙(2),,𝒙(M)],{\bm{X}}=\left[{\bm{x}}^{(1)},{\bm{x}}^{(2)},\dots,{\bm{x}}^{(M)}\right], (19)

with empirical mean

𝒙¯=1Mi=1M𝒙(i),\overline{{\bm{x}}}=\frac{1}{M}\sum_{i=1}^{M}{\bm{x}}^{(i)}, (20)

and associated matrix of ensemble deviations

𝑿^=[𝒙(1)𝒙¯,𝒙(2)𝒙¯,,𝒙(M)𝒙¯].\hat{\bm{X}}=\left[{\bm{x}}^{(1)}-\overline{{\bm{x}}},{\bm{x}}^{(2)}-\overline{{\bm{x}}},\dots,{\bm{x}}^{(M)}-\overline{{\bm{x}}}\right]. (21)

Ensembles for the forecast are denoted again by superscript f and those for the analysis by superscript a. In the forecast step each ensemble member is propagated individually using (10), updating the previous analysis ensemble 𝑿n1a{\bm{X}}_{n-1}^{\rm a} to the next forecast ensemble 𝑿nf{\bm{X}}_{n}^{\rm f}. The forecast covariance matrix (15) used in the analysis step (11) can be estimated as a Monte-Carlo approximation from the forecast ensemble deviation matrix 𝑿^nf\hat{\bm{X}}_{n}^{\rm f} via

𝑷nf=1M1𝑿^nf(𝑿^nf)TDx×Dx.\displaystyle{\bm{P}}^{\rm f}_{n}=\frac{1}{M-1}\hat{\bm{X}}_{n}^{\rm f}\,(\hat{\bm{X}}^{\rm f}_{n})^{\rm T}\in\mathbb{R}^{D_{x}\times D_{x}}. (22)

In the stochastic ensemble Kalman filter observations 𝜻no{\bm{\zeta}}^{\rm{o}}_{n} receive a stochastic perturbation 𝜼n(i)Dζ×1\bm{\eta}_{n}^{(i)}\in{\mathbb{R}}^{D_{\zeta}\times 1}, i=1,,Mi=1,\ldots,M, drawn independently from the Gaussian observational noise distribution 𝒩(𝟎,𝚪dc)\mathcal{N}({\bf 0},{\bf\Gamma}^{\rm dc}). The associated ensemble of perturbed observations 𝒁npDζ×M{\bm{Z}}^{\rm p}_{n}\in\mathbb{R}^{D_{\zeta}\times M} is given by

𝒁np\displaystyle{\bm{Z}}^{\rm p}_{n} =[𝜻no𝜼n(1),𝜻no𝜼n(2),,𝜻no𝜼n(M)].\displaystyle=\left[{\bm{\zeta}}^{\rm{o}}_{n}-\bm{\eta}_{n}^{(1)},{\bm{\zeta}}^{\rm{o}}_{n}-\bm{\eta}_{n}^{(2)},\ldots,{\bm{\zeta}}^{\rm{o}}_{n}-\bm{\eta}_{n}^{(M)}\right]. (23)

The EnKF analysis update step is then given by

𝑿na=𝑿nf𝑲nΔ𝑰n,\displaystyle{\bm{X}}_{n}^{\rm a}={\bm{X}}_{n}^{\rm f}-\bm{K}_{n}\Delta{\bm{I}}_{n}, (24)

with the Kalman gain defined by (12) using (22) and stochastic innovation

Δ𝑰n=𝑯𝑿nf𝒁np.\displaystyle\Delta{\bm{I}}_{n}={\bm{H}}{\bm{X}}^{\rm f}_{n}-{\bm{Z}}^{\rm p}_{n}. (25)

To mitigate against finite ensemble size effects covariance inflation with 𝑷nfα𝑷nf{\bm{P}}^{\rm f}_{n}\to\alpha{\bm{P}}^{\rm f}_{n} is typically introduced with α>1\alpha>1 (Anderson and Anderson, 1999). Such multiplicative inflation preserves the ensemble mean but increases the forecast error covariance. In Gottwald and Reich (2021), which considers the fully observed case with D=dD=d, localisation was employed by only considering covariances between a component of the state variable and its associated block in the parameter 𝒘{\bm{w}}. This localisation was found here not to be advantageous; we believe that this is due to the fact that each component of the delay vectors appears in all components at different times. We therefore do not perform any localisation in the numerical simulations presented below.

The initial ensemble 𝑿0a{\bm{X}}_{0}^{\rm a} is initialized as 𝜻0a𝒩(𝜻0o,𝚪dc)\bm{\zeta}_{0}^{\rm a}\sim{\mathcal{N}}({\bm{\zeta}}^{\rm{o}}_{0},{\bf\Gamma}^{\rm dc}) and 𝒘0a𝒩(𝒘LR,γ𝐈){\bm{w}}_{0}^{\rm a}\sim{\mathcal{N}}({\bm{w}}_{\rm LR},\gamma\,\mathbf{I}) where 𝒘LR{\bm{w}}_{\rm LR} is the vectorial form of the solution 𝑾LR{\bm{W}}_{\rm LR} to the ridge regression formulation (2.2) and γ>0\gamma>0 is a parameter specifying the spread of the initial ensemble.

We remark that the EnKF does not minimize the cost function (2.2) but rather approximates the posterior distribution in the weights 𝑾{\bm{W}} given the observed delay vectors 𝜻no{\bm{\zeta}}^{\rm{o}}_{n}, n=0,,Nn=0,\ldots,N, under the assumed measurement model (1) and vanishing model errors in the propagator (5). Including stochastic model errors into RAFDA would be straightforward.

The ensemble forecast step defined by (10), together with the EnKF analysis step (24) constitute our combined RAFDA method. We run RAFDA for a single long training data set of length NN. The approximation of the propagator map is given by the random feature model (5) where the weight matrix 𝑾{\bm{W}} is given by the ensemble mean of the weight matrix 𝑾Na{\bm{W}}_{N}^{\rm a} at final training time tN=ΔtNt_{N}=\Delta tN and is denoted by 𝑾RAFDA{\bm{W}}_{\rm RAFDA}.

We summarize our RAFDA method in Algorithm 1.

input data : time series 𝒚no{\bm{y}}^{\rm{o}}_{n}, n=0,,Nn=0,\dots,N
parameters : random feature maps: dimension DrD_{r}, internal parameters 𝑾inDr×Dζ{\bm{W}_{\rm{in}}}\in{\mathbb{R}}^{D_{r}\times D_{\zeta}}, 𝒃inDr×1{\bm{b}_{\rm{in}}}\in{\mathbb{R}}^{D_{r}\times 1}
EnKF: ensemble size MM,
measurement error covariance
𝚪dc=𝐈𝚪{\bf\Gamma}^{\rm dc}={\bf I}\otimes\bm{\mathrm{\Gamma}}, inflation α\alpha, initial
ensemble parameters (𝒘LR,γ)({\bm{w}}_{\rm LR},\gamma)
perform the following:
delay coordinate embedding:
     form delay vectors 𝜻noDζ×1{\bm{\zeta}}^{\rm{o}}_{n}\in\mathbb{R}^{D_{\zeta}\times 1};
initializing ensemble:
     set 𝑿0a{\bm{X}}_{0}^{\rm a} with members drawn according to
     𝜻0a𝒩(𝜻0o,𝚪dc)\bm{\zeta}_{0}^{a}\sim{\mathcal{N}}({\bm{\zeta}}^{\rm{o}}_{0},{\bf\Gamma}^{\rm dc}) and 𝒘0a𝒩(𝒘LR,γ𝐈){\bm{w}}_{0}^{a}\sim{\mathcal{N}}({\bm{w}}_{\rm LR},\gamma\,\mathbf{I});
for n=1:Nn=1:N do
       forecast 𝑿n1a𝑿nf{\bm{X}}_{n-1}^{\rm a}\to{\bm{X}}_{n}^{\rm f}: each ensemble member is propagated according to
              𝜻nf=𝑾n1aϕ(𝜻n1a)\bm{\zeta}^{\rm f}_{n}={\bm{W}}_{n-1}^{\rm a}\,\phi(\bm{\zeta}^{\rm a}_{n-1});
              𝑾nf=𝑾n1a{\bm{W}}^{\rm f}_{n}={\bm{W}}^{\rm a}_{n-1};
       data assimilation analysis update:
             inflation: 𝑷nfα𝑷nf{\bm{P}}^{\rm f}_{n}\leftarrow\alpha{\bm{P}}^{\rm f}_{n}
              𝑿na=𝑿nf𝑲n(𝑯𝑿nf𝒁np){\bm{X}}_{n}^{\rm a}={\bm{X}}_{n}^{\rm f}-\bm{K}_{n}\left({\bm{H}}{\bm{X}}^{\rm f}_{n}-{\bm{Z}}^{\rm p}_{n}\right);
      
output : 𝑾RAFDA=ensemble average of 𝑾Na{\bm{W}}_{\rm RAFDA}={\text{ensemble average of }}{\bm{W}}^{\rm{a}}_{N}
Algorithm 1 Random Feature Map DataAssimilation (RAFDA)

3. Numerical results

We consider the Lorenz-63 system (Lorenz, 1963)

y˙\displaystyle\dot{y} =10(yx)\displaystyle=10(y-x) (26a)
x˙\displaystyle\dot{x} =28xyxz\displaystyle=28x-y-xz (26b)
z˙\displaystyle\dot{z} =83z+xy\displaystyle=-\frac{8}{3}z+xy (26c)

with 𝒖=(x,y,z)T3{\bm{u}}=(x,y,z)^{\rm T}\in\mathbb{R}^{3}. Observations are only available for xx, i.e. 𝑮=(1 0 0){\bm{G}}=(1\;0\;0) and d=1d=1 in (1), and are taken every Δt=0.02\Delta t=0.02 time units. We employ observational noise with measurement error covariance 𝚪=η𝐈\bm{\mathrm{\Gamma}}=\eta{\bf I} and use η=0.2\eta=0.2 unless stated otherwise (compare (1)). We ensure that the dynamics evolves on the attractor by discarding an initial transient of 4040 model time units. The optimal embedding dimension and delay time for these parameters are estimated as m=3m=3 and τ=10\tau=10. In the following times are measured in units of the Lyapunov time tλmaxt\lambda_{\rm{max}} with the maximal Lyapunov exponent λmax=0.91\lambda_{\rm{max}}=0.91.

We use a reservoir of size Dr=300D_{r}=300 with internal parameters w=0.005w=0.005 and b=4b=4, and an ensemble size of M=300M=300 for the ensemble Kalman filter. The regularization parameter is set to β=2×105\beta=2\times 10^{-5} and the inflation parameter to α=1+0.01Δt=1.0002\alpha=1+0.01\Delta t=1.0002. We employ a training set of length N=4,000N=4,000.

To test the propensity of RAFDA to learn a surrogate model (5) we generate a validation data set 𝜻valid(tn)\bm{\zeta}_{\rm valid}(t_{n}), n0n\geq 0 sampled with the same rate as the training data set with Δt=0.02\Delta t=0.02 from the xx-component of the Lorenz-63 system (26). To quantify the forecast skill we measure the forecast time τf\tau_{f}, defined as the largest time such that the relative forecast error (tn)=𝜻valid(tn)𝜻n2/𝜻valid(tn)2θ\mathcal{E}(t_{n})=||\bm{\zeta}_{\rm valid}(t_{n})-\bm{\zeta}_{n}||^{2}/||\bm{\zeta}_{\rm valid}(t_{n})||^{2}\leq\theta, where the 𝜻n\bm{\zeta}_{n} are generated by the learned surrogate model. We choose here θ=40\theta=40.

The forecast skill depends crucially on the choice of the randomly chosen internal parameters (𝑾in,𝒃in)({\bm{W}_{\rm{in}}},{\bm{b}_{\rm{in}}}) as well as on the training and validation data set. We therefore report on the mean behaviour over 500500 realisations, differing in the training and validation data set, in the random draws of the internal parameters (𝑾in,𝒃in)({\bm{W}_{\rm{in}}},{\bm{b}_{\rm{in}}}) and in the initial ensembles for RAFDA. Each training and validation data set is generated from randomly drawn initial conditions which are evolved independently over 4040 model time units to ensure that the dynamics has settled on the attractor.

In the following, we compare RAFDA with standard random feature maps using linear regression (LR). Figure 1 shows the empirical histogram for both RAFDA and LR when the surrogate model (5) is trained on data contaminated with noise of strength η=0.2\eta=0.2. The mean forecast time obtained by RAFDA is τf=2.12\tau_{f}=2.12 and is roughly three times larger than the forecast time obtained using standard random feature maps which yields τf=0.77\tau_{f}=0.77. Furthermore, it is seen that the distribution of forecast times for RAFDA is heavily skewed towards larger forecast times. As expected, these forecast times are smaller than for the fully observed case with 𝑮=𝐈{\bm{G}}={\bf I}, as considered in Gottwald and Reich (2021), where extreme values of τf=9.28\tau_{f}=9.28 were reported. Figure 2 shows a typical example of x(t)x(t) for a forecast time of τf=2.6\tau_{f}=2.6, which is close to the mean forecast time. Figure 3 shows that the surrogate model (5) obtained using RAFDA produces a model which very well approximates the long-time statistics of the full Lorenz system (26) in the sense that its trajectories reproduce the attractor in delay-coordinate space. Contrary, LR does not lead to surrogate models which are consistent with the dynamics of the Lorenz system (26). Reproducing dynamically consistent trajectories is, of course, paramount for the purpose of using such surrogate models for long-time integration when the interest is rather on the overall statistical behaviour rather than on accurate short-term forecasts.

The ability of RAFDA to control noise present in a training data set depends on the strength η\eta of the noise. We show in Figure 4 results of the mean forecast time for a range of η[6.25×106,4.3×105]\eta\in[6.25\times 10^{-6},4.3\times 10^{5}]. RAFDA outperforms standard random feature maps with linear regression for noise levels logη<5\log\eta<5. Moreover, there is a robust plateau with mean forecast times of around τf2\tau_{f}\approx 2 for a large range of noise strengths with η1\eta\lessapprox 1. Notice that for η0\eta\to 0, maybe surprisingly, the mean forecast time of RAFDA does not converge to the one obtained by standard random feature maps but remains threefold larger. This was discussed in Gottwald and Reich (2021): although LR achieves a smaller value of the cost function (2.2) for η1\eta\ll 1 it does not generalise as well to unseen data. The cost function is not zero as generically the output data do not lie in the span of the random feature maps. It is important to note that LR with β>0\beta>0 assumes model error (which is not present in our application of simulating the Lorenz-63 system (26) rather than observational error. We further note that, for very large noise levels logη>5\log\eta>5, the forecast times τf\tau_{f} become zero for RAFDA. This is due to the data assimilation component of our method and is an instance of the well-documented problem of filter divergence (Ehrendorfer, 2007, Nadiga et al., 2013). More precisely, in finite-size ensembles, most ensemble members may align with the most unstable direction (Ng et al., 2011) implying a small forecast error covariance. In combination with large observation noise this leads to the filter trusting its own forecast and the analysis is not corrected by incoming new observations. Filter divergence can be avoided by increasing the ensemble size and/or employing covariance inflation. We remark that we abstained here from fine-tuning all hyperparameters.

Refer to caption
Figure 1. Empirical histogram of forecast times τf\tau_{f} in units of the Lyapunov time for noise contaminated observations with η=0.2\eta=0.2. Results are shown for standard LR and for RAFDA.
Refer to caption
Figure 2. Comparison of a representative validation time series (truth) and the corresponding RAFDA forecast, initialized with the first three data points from the truth. The measured forecast time of τf=2.6\tau_{f}=2.6 is close to the mean forecast time.
Refer to caption
Figure 3. Attractor in the reconstructed phase space spanned by the delay-coordinates. The truth is depicted by diamonds (online red), RAFDA by filled circles (online blue) and LR by crosses (online cyan).
Refer to caption
Figure 4. Forecast time τf\tau_{f} as a function of the observational noise strength η\eta. The error bars denote two standard deviations, estimated from 500500 independent realisations, differing in their randomly drawn internal parameters (𝑾in{\bm{W}_{\rm{in}}}, 𝒃in{\bm{b}_{\rm{in}}}), training and validation data sets, and in the initial ensembles.

4. Discussion

We proposed a new data-driven method to estimate surrogate one-step forecast maps from noisy partial observations. Our method determines the surrogate map in the reconstructed phase space for delay vector coordinates, thereby dealing with the problem of having only access to partial observations. We employed the RAFDA framework in the reconstructed phase space, in which the surrogate map is constructed as a linear combination of random feature maps the coefficients of which are learned sequentially using a stochastic EnKF. We showed that learning the coefficients of the linear combination of random feature maps sequentially with incoming new observations rather than by learning them using the method of least-squares on the whole data set greatly increases the forecast capability of the surrogate model and significantly controls the measurement noise.

Our numerical results showed that the quality of the surrogate model exhibits some degree of variance, depending on the draw of the arbitrarily fixed internal parameters of the random feature map model 𝑾in{\bm{W}_{\rm{in}}} and 𝒃in{\bm{b}_{\rm{in}}}. Being able to choose good candidates for those parameters or learning them in conjunction with the rest of the surrogate model would greatly improve the applicability of the method. In our previous work we provided some guidance on what constitutes good hyperparameters enabling good learning (Gottwald and Reich, 2021). How to choose actual optimal values for these hyperparameters requires, for example, an optimization procedure wrapped around the RAFDA method described here, and is planned for future research. Apart from 𝑾in{\bm{W}_{\rm{in}}} and 𝒃in{\bm{b}_{\rm{in}}} there are numerous hyperparameters which require tuning, such as the regularization parameter β\beta, the inflation parameter α\alpha, the size and the variance of the initial ensemble, as well as the assumed observational error covariance Γ\Gamma. To obtain optimal performance of RAFD these would need to be tuned which can be done by additional optimization procedures (Nadiga et al., 2019).

The random feature map architecture is closely related to reservoir computers (Jaeger, 2002, Jaeger and Haas, 2004, Pathak et al., 2018a, Jüngling et al., 2019). In reservoir computers the incoming signal is used in conjunction with an internal reservoir dynamics to produce the output. It is argued that this helps to take into account non-trivial memory of the underlying dynamical system. It will be interesting to see if random feature maps in the space of delay vectors performs as well as reservoir computers. For systems with a fast decay of correlation such as the Lorenz 63 system the additional reservoir dynamics did not lead to an improvement of the surrogate map (Gottwald and Reich, 2021), but incorporating memory for systems with slow decay of correlation taking into account information from past observations may be important.

Acknowledgments

SR is supported by Deutsche Forschungsgemeinschaft (DFG) - Project-ID 318763901 - SFB1294

References

  • Abarbanel et al. [2018] H. Abarbanel, P. Rozdeba, and S. Shirman. Machine learning: Deepest learning as statistical data assimilation problems. Neural Computation, 30:2025–2055, 2018.
  • Algar et al. [2019] S. D. Algar, T. Lymburn, T. Stemler, M. Small, and T. Jüngling. Learned emergence in selfish collective motion. Chaos: An Interdisciplinary Journal of Nonlinear Science, 29(12):123101, 2019. doi:10.1063/1.5120776.
  • Anderson and Anderson [1999] J. L. Anderson and S. L. Anderson. A Monte Carlo implementation of the nonlinear filtering problem to produce ensemble assimilations and forecasts. Monthly Weather Review, 127(12):2741–2758, 1999. doi:10.1175/1520-0493(1999)127<2741:AMCIOT>2.0.CO;2.
  • Bach [2017a] F. Bach. Breaking the curse of dimensionality with convex neural networks. J. Mach. Learn. Research, 18(19):1–53, 2017a. URL http://jmlr.org/papers/v18/14-546.html.
  • Bach [2017b] F. Bach. On the equivalence between kernel quadrature rules and random feature expansions. J. Mach. Learn. Research, 18(21):1–38, 2017b. URL http://www.jmlr.org/papers/v18/15-178.html.
  • Barron [1993] A. Barron. Universal approximation bounds for superposition of a sigmoidal function. IEEE Trans. on Inform. Theory, 39:930–945, 1993. doi:10.1109/18.256500.
  • Bocquet et al. [2019] M. Bocquet, J. Brajard, A. Carrassi, and L. Bertino. Data assimilation as a learning tool to infer ordinary differential equation representations of dynamical models. Nonlinear Processes in Geophysics, 26:143–162, 2019. doi:10.5194/npg-26-143-2019.
  • Bocquet et al. [2020] M. Bocquet, J. Brajard, A. Carrassi, and L. Bertino. Bayesian inference of chaotic dynamics by merging data assimilation, machine learning and expectation-maximization. Foundations of Data Science, 2:55–80, 2020. doi:10.3934/fods.2020004.
  • Bocquet et al. [2021] M. Bocquet, A. Farchi, and Q. Malartic. Online learning of both state and dynamics using ensemble Kalman filters. Foundation of Data Science, published online, 2021. doi:10.3934/fods.2020015.
  • Bollt [2021] E. Bollt. On explaining the surprising success of reservoir computing forecaster of chaos? The universal machine learning dynamical system with contrast to VAR and DMD. Chaos: An Interdisciplinary Journal of Nonlinear Science, 31(1):013108, 2021. doi:10.1063/5.0024890.
  • Bolton and Zanna [2019] T. Bolton and L. Zanna. Applications of deep learning to ocean data inference and subgrid parameterization. Journal of Advances in Modeling Earth Systems, 11(1):376–399, 2019. doi:10.1029/2018MS001472.
  • Brajard et al. [2020] J. Brajard, A. Carrassi, M. Bocquet, and L. Bertino. Combining data assimilation and machine learning to emulate a dynamical model from sparse and noisy observations: A case study with the Lorenz 96 model. Journal of Computational Science, 44:101171, 2020. doi:https://doi.org/10.1016/j.jocs.2020.101171.
  • Brajard et al. [2021] J. Brajard, A. Carrassi, M. Bocquet, and L. Bertino. Combining data assimilation and machine learning to infer unresolved scale parametrization. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 379(2194):20200086, 2021. doi:10.1098/rsta.2020.0086.
  • Burgers et al. [1998] G. Burgers, P. J. van Leeuwen, and G. Evensen. Analysis scheme in the ensemble Kalman filter. Monthly Weather Review, 126(6):1719–1724, 1998. doi:10.1175/1520-0493(1998)126<1719:ASITEK>2.0.CO;2.
  • Casdagli et al. [1991] M. Casdagli, S. Eubank, J. Farmer, and J. Gibson. State space reconstruction in the presence of noise. Physica D: Nonlinear Phenomena, 51(1):52–98, 1991. ISSN 0167-2789. doi:10.1016/0167-2789(91)90222-U.
  • Cleary et al. [2021] E. Cleary, A. Garbuno-Inigo, S. Lan, T. Schneider, and A. M. Stuart. Calibrate, emulate, sample. Journal of Computational Physics, 424:109716, 2021. ISSN 0021-9991. doi:0.1016/j.jcp.2020.109716.
  • Cybenko [1989] G. Cybenko. Approximation by superposition of a sigmoidal function. Math. Contr., Sign., and Syst., 2:303–314, 1989. doi:10.1007/BF02551274.
  • Dueben and Bauer [2018] P. D. Dueben and P. Bauer. Challenges and design choices for global weather and climate models based on machine learning. Geoscientific Model Development, 11(10):3999–4009, 2018. doi:10.5194/gmd-11-3999-2018.
  • E et al. [2020] W. E, C. Ma, S. Wojtowytsch, and L. Wu. Towards a mathematical understanding of neural network-based machine learning: What we know and what we don’t, 2020.
  • Ehrendorfer [2007] M. Ehrendorfer. A review of issues in ensemble-based Kalman filtering. Meteorologische Zeitschrift, 16(6):795–818, 2007.
  • Evensen [2006] G. Evensen. Data Assimilation: The Ensemble Kalman Filter. Springer, New York, 2006.
  • Fraser and Swinney [1986] A. M. Fraser and H. L. Swinney. Independent coordinates for strange attractors from mutual information. Phys. Rev. A (3), 33(2):1134–1140, 1986. ISSN 1050-2947. doi:10.1103/PhysRevA.33.1134.
  • Gagne II et al. [2020] D. J. Gagne II, H. M. Christensen, A. C. Subramanian, and A. H. Monahan. Machine learning for stochastic parameterization: Generative adversarial networks in the Lorenz ’96 model. Journal of Advances in Modeling Earth Systems, 12(3), 2020. doi:10.1029/2019MS001896.
  • Gauthier et al. [2021] D. Gauthier, E. Bollt, A. Griffith, and W. Barbosa. Next generation reservoir computing, 2021.
  • Gottwald and Reich [2021] G. A. Gottwald and S. Reich. Supervised learning from noisy observations: Combining machine-learning techniques with data assimilation. Physica D: Nonlinear Phenomena, 423:132911, 2021. ISSN 0167-2789. doi:10.1016/j.physd.2021.132911.
  • Hamilton et al. [2016] F. Hamilton, T. Berry, and T. Sauer. Ensemble Kalman filtering without a model. Phys. Rev. X, 6:011021, Mar 2016. doi:10.1103/PhysRevX.6.011021.
  • Han et al. [2018] J. Han, A. Jentzen, and W. E. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018. doi:10.1073/pnas.1718942115.
  • Han et al. [2019] J. Han, C. Ma, Z. Ma, and W. E. Uniformly accurate machine learning-based hydrodynamic models for kinetic equations. Proceedings of the National Academy of Sciences, 116(44):21983–21991, 2019. doi:10.1073/pnas.1909854116.
  • Hart et al. [2020] A. Hart, J. Hook, and J. Dawes. Embedding and approximation theorems for echo state networks. Neural Networks, 128:234–247, 2020. ISSN 0893-6080. doi:10.1016/j.neunet.2020.05.013.
  • Jaeger [2002] H. Jaeger. A tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the ”echo state network” approach, 2002. GMD-Report 159, German National Research Institute for Computer Science.
  • Jaeger and Haas [2004] H. Jaeger and H. Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004. doi:10.1126/science.1091277.
  • Jüngling et al. [2019] T. Jüngling, T. Lymburn, T. Stemler, D. Correa, D. Walker, and M. Small. Reconstruction of complex dynamical systems from time series using reservoir computing. In 2019 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1–5, 2019. doi:10.1109/ISCAS.2019.8702137.
  • Kantz and Schreiber [2004] H. Kantz and T. Schreiber. Nonlinear time series analysis. Cambridge University Press, Cambridge, second edition, 2004.
  • Kennel et al. [1992] M. B. Kennel, R. Brown, and H. D. I. Abarbanel. Determining embedding dimension for phase-space reconstruction using a geometrical construction. Phys. Rev. A, 45:3403–3411, Mar 1992. doi:10.1103/PhysRevA.45.3403.
  • Levine and Stuart [2021] M. E. Levine and A. M. Stuart. A framework for machine learning of model error in dynamical systems, 2021.
  • Lorenz [1963] E. N. Lorenz. Deterministic nonperiodic flow. Journal of the Atmospheric Sciences, 20(2):130–141, 1963. doi:10.1175/1520-0469(1963)020<0130:DNF>2.0.CO;2.
  • Lu et al. [2017] Z. Lu, J. Pathak, B. Hunt, M. Girvan, R. Brockett, and E. Ott. Reservoir observers: Model-free inference of unmeasured variables in chaotic systems. Chaos: An Interdisciplinary Journal of Nonlinear Science, 27(4):041102, 2017. doi:10.1063/1.4979665.
  • Maass et al. [2002] W. Maass, T. Natschläger, and H. Markram. Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11):2531–2560, 2002. doi:10.1162/089976602760407955.
  • MATLAB [2020] MATLAB. version 9.9.0.1467703 (R2020b). The MathWorks Inc., Natick, Massachusetts, 2020.
  • Nadiga [2020] B. Nadiga. Reservoir computing as a tool for climate predictability studies. Journal of Advances in Modeling Earth Systems, n/a(n/a):e2020MS002290, 2020. doi:10.1029/2020MS002290.
  • Nadiga et al. [2019] B. Nadiga, C. Jiang, and D. Livescu. Leveraging Bayesian analysis to improve accuracy of approximate models. Journal of Computational Physics, 394:280–297, 2019. ISSN 0021-9991. doi:10.1016/j.jcp.2019.05.015.
  • Nadiga et al. [2013] B. T. Nadiga, W. R. Casper, and P. W. Jones. Ensemble-based global ocean data assimilation. Ocean Modelling, 72:210–230, 2013. ISSN 1463-5003. doi:10.1016/j.ocemod.2013.09.002.
  • Nakai and Saiki [2021] K. Nakai and Y. Saiki. Machine-learning construction of a model for a macroscopic fluid variable using the delay-coordinate of a scalar observable. Discrete & Continuous Dynamical Systems - S, 14(3):1079–1092, 2021.
  • Ng et al. [2011] G.-H. C. Ng, D. McLaughlin, D. Entekhabi, and A. Ahanin. The role of model dynamics in ensemble Kalman filter performance for chaotic systems. Tellus A, 63A:958–977, 2011.
  • Park and Sandberg [1991] J. Park and I. Sandberg. Universal approximation using radial-basis-function networks. Neural Computation, 3:246–257, 1991. doi:10.1162/neco.1991.3.2.246.
  • Pathak et al. [2018a] J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott. Model-free prediction of large spatiotemporally chaotic systems from data: A reservoir computing approach. Phys. Rev. Lett., 120:024102, Jan 2018a. doi:10.1103/PhysRevLett.120.024102.
  • Pathak et al. [2018b] J. Pathak, A. Wikner, R. Fussell, S. Chandra, B. R. Hunt, M. Girvan, , and E. Ott. Hybrid forecasting of chaotic processes: Using machine learning in conjunction with a knowledge-based model. Chaos: An Interdisciplinary Journal of Nonlinear Science, 28:041101, 2018b. doi:10.1063/1.5028373.
  • Platt et al. [2021] J. A. Platt, A. Wong, R. Clark, S. G. Penny, and H. D. I. Abarbanel. Forecasting using reservoir computing: The role of generalized synchronization, 2021.
  • Rahimi and Recht [2008] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1177–1184. Curran Associates, Inc., 2008. URL http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.pdf.
  • Rahimi and Recht [2008] A. Rahimi and B. Recht. Uniform approximation of functions with random bases. In 2008 46th Annual Allerton Conference on Communication, Control, and Computing, pages 555–561, 2008.
  • Raissi and Karniadakis [2018] M. Raissi and G. E. Karniadakis. Hidden physics models: Machine learning of nonlinear partial differential equations. Journal of Computational Physics, 357:125 – 141, 2018. doi:https://doi.org/10.1016/j.jcp.2017.11.039.
  • Raissi et al. [2019] M. Raissi, P. Perdikaris, and G. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686 – 707, 2019. doi:https://doi.org/10.1016/j.jcp.2018.10.045.
  • Rasp et al. [2018] S. Rasp, M. S. Pritchard, and P. Gentine. Deep learning to represent subgrid processes in climate models. Proceedings of the National Academy of Sciences, 115(39):9684–9689, 2018. ISSN 0027-8424. doi:10.1073/pnas.1810286115.
  • Rotskoff and Vanden-Eijnden [2018] G. M. Rotskoff and E. Vanden-Eijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error, 2018.
  • Sauer and Yorke [1993] T. Sauer and J. A. Yorke. How manyu delay coordinates do you need? International Journal of Bifurcation and Chaos, 3(3):737–744, 1993.
  • Sauer et al. [1991] T. Sauer, J. A. Yorke, and M. Casdagli. Embedology. Journal of Statistical Physics, 65(3):579–616, 1991.
  • Schneider et al. [2017] T. Schneider, S. Lan, A. Stuart, and J. Teixeira. Earth system modeling 2.0: A blueprint for models that learn from observations and targeted high-resolution simulations. Geophysical Research Letters, 44(24):12,396–12,417, 2017. doi:10.1002/2017GL076101.
  • Schreiber [1999] T. Schreiber. Interdisciplinary application of nonlinear time series methods. Physics Reports, 308(1):1–64, 1999. ISSN 0370-1573. doi:10.1016/S0370-1573(98)00035-0.
  • Schreiber and Kantz [1995] T. Schreiber and H. Kantz. Noise in chaotic data: Diagnosis and treatment. Chaos: An Interdisciplinary Journal of Nonlinear Science, 5(1):133–142, 1995. doi:10.1063/1.166095.
  • Small [2005] M. Small. Applied nonlinear time series analysis, volume 52 of World Scientific Series on Nonlinear Science. Series A: Monographs and Treatises. World Scientific Publishing Co. Pte. Ltd., Hackensack, NJ, 2005. ISBN 981-256-117-X. doi:10.1142/9789812567772. Applications in physics, physiology and finance.
  • Sun et al. [2019] Y. Sun, A. Gilbert, and A. Tewari. On the approximation capabilities of ReLU neural networks and random ReLU features, 2019.
  • Takens [1981] F. Takens. Detecting strange attractors in turbulence. In D. Rand and L.-S. Young, editors, Dynamical Systems and Turbulence, Warwick 1980, pages 366–381, Berlin, Heidelberg, 1981. Springer Berlin Heidelberg. ISBN 978-3-540-38945-3.
  • Tomizawa and Sawada [2021] F. Tomizawa and Y. Sawada. Combining Ensemble Kalman Filter and Reservoir Computing to predict spatio-temporal chaotic systems from imperfect observations and models, 2021.
  • Wikner et al. [2020] A. Wikner, J. Pathak, B. Hunt, M. Girvan, T. Arcomano, I. Szunyogh, A. Pomerance, , and E. Ott. Combining machine learning with knowledge-based modeling for scalable forecasting and subgrid-scale closure of large, complex, spatiotemporal systems. Chaos: An Interdisciplinary Journal of Nonlinear Science, page 053111, 2020. doi:10.1063/5.0005541.
  • Wikner et al. [2021] A. Wikner, J. Pathak, B. R. Hunt, I. Szunyogh, M. i. Girvan, and E. Ott. Using data assimilation to train a hybrid forecast system that combines machine-learning and knowledge-based components. Chaos: An Interdisciplinary Journal of Nonlinear Science, 31(5):053114, 2021. doi:10.1063/5.0048050.