Efficient high-dimensional variational data assimilation with machine-learned reduced-order models

Romit Maulik¹, Vishwas Rao¹, Jiali Wang¹, Gianmarco Mengaldo²,
Emil Constantinescu¹, Bethany Lusch¹, Prasanna Balaprakash¹,
Ian Foster¹, Rao Kotamarthi¹

Argonne National Laboratory¹, National University of Singapore²

(December 2021)

Abstract

Data assimilation (DA) in the geophysical sciences remains the cornerstone of robust forecasts from numerical models. Indeed, DA plays a crucial role in the quality of numerical weather prediction, and is a crucial building block that has allowed dramatic improvements in weather forecasting over the past few decades. DA is commonly framed in a variational setting, where one solves an optimization problem within a Bayesian formulation using raw model forecasts as a prior, and observations as likelihood. This leads to a DA objective function that needs to be minimized, where the decision variables are the initial conditions specified to the model. In traditional DA, the forward model is numerically and computationally expensive. Here we replace the forward model with a low-dimensional, data-driven, and differentiable emulator. Consequently, gradients of our DA objective function with respect to the decision variables are obtained rapidly via automatic differentiation. We demonstrate our approach by performing an emulator-assisted DA forecast of geopotential height. Our results indicate that emulator-assisted DA is faster than traditional equation-based DA forecasts by four orders of magnitude, allowing computations to be performed on a workstation rather than a dedicated high-performance computer. In addition, we describe accuracy benefits of emulator-assisted DA when compared to simply using the emulator for forecasting (i.e., without DA).

1 Introduction

A physical system can be characterized by our existing knowledge of the system plus a set of observations. Existing knowledge of the system is typically formulated in terms of a mathematical model (hereafter, also referred to as model or computational model) that usually consists of differential equations. Observations can arise from various sensors, both remote (e.g., satellites) and in-place (e.g., instrumented drones).

Data assimilation (DA) combines existing knowledge of a system, usually in the form of a model, with observations to infer the best estimate of the system state at a given time. Both existing knowledge and observations come with errors that lead to uncertainties about the “true” state of the system being investigated. Hence, when combining model results characterizing our knowledge of the system with observations, it is essential to account for these errors and give an appropriate weight to each source of information available. This leads to statistical approaches, which are the basis for state-of-the-art DA methods. Model results and observations, along with their uncertanties, are encapsulated within a Bayesian framework to provide the best estimate of the state of the system conditioned on the model and observation uncertainties (or error statistics) [11, 23, 25]. More specifically, DA can be formulated as a class of inverse problems that combines information from: 1) an uncertain prior (also referred to as background) that encapsulates our best estimate of the system at a given time; 2) an imperfect computational model describing the system, and 3) noisy observations. These three sources of information are combined together to construct a posterior probability distribution that is regarded as the best estimate of the system state at the given time(s) of interest and is referred to as analysis. The analysis can be used for various tasks, including optimal state identification, and selection of appropriate initial conditions for computational models.

Two approaches to DA have gained widespread popularity: variational and ensemble-based estimation methods [23]. The former are derived from optimal control theory, while the latter are rooted in statistics. Variational methods formulate DA as a constrained nonlinear optimization problem. Here, the state of the system is adjusted to minimize the discrepacy between the prior or exisiting knowledge (e.g., in the form of a computational model) and observations, where their associated error statistics are commonly prescribed. Ensemble-based methods use optimal statistical interpolation approaches, and error statistics for the prior and observations are obtained from an ensemble. Regardless of the approach adopted, DA can be performed in two ways: sequential and continuous. In the sequential way, observations are assimilated in batches as they become available. In the continuous way, one defines a prescribed time window, called assimilation window, and uses all the observations available within the window to obtain the analysis.

DA, originally developed for numerical weather prediction (NWP), can be traced back to the original work of Richardson [27]. Today, DA is used extensively in NWP to compute accurate states of the atmosphere that, in turn, are used to estimate appropriate initial conditions for NWP models, to compute reanalysis, and to help better understand properties of the atmosphere. We focus here on NWP applications, although the novel methodology proposed can be applied in other contexts.

In NWP, variational approaches are the workhorse, and have been employed for the past several years by leading operational weather centers, including the European Centre for Medium-Range Weather Forecasts (ECWMF) and National Centers for Environmental Prediction (NCEP). The two main methods adopted are three-dimensional variational DA (or 3D-Var), and four-dimensional variational DA (or 4D-Var). In both cases, one defines an assimilation window and seeks the system state that best fits the data available, comprising the prior and observations. In 3D-Var, all observations are regarded as if they were obtained at the same time snapshot (i.e., there is no use of the time dimension of the assimilation window). In 4D-Var, the observations retain their time information, and one seeks to identify the state evolution (also referred to as the trajectory) that best fits them within the assimilation window. The state evolution is commonly obtained via a computational model, consisting of a set of high-dimensional partial differential equations (PDEs), that is often considered perfect (without errors) within the assimilation window.

The current state-of-the-art is 4D-Var. This is often implemented as a strong constraint algorithm (SC4D-Var) where one assumes that the observations over the time window are consistent (within a margin of observation errors) with the model if initialized by the true state, where the model is considered to be perfect. In reality, the error in the model is often non-negligible, in which case the SC4D-Var scheme produces an analysis that is inconsistent with the observations. The effect of model error is even more pronounced when the assimilation window is large. Weak constraint 4D-Var (WC4D-Var) [50, 51] relaxes the ‘perfect model’ assumption and assumes that model error is present at each time snapshot of the assimilation window. This requires estimates of model-error statistics that are commonly simplistic [4]. Yet, recent efforts have focused on improving estimates for model-error statistics and on better understanding their impact on the analysis accuracy in both variational and statistical approaches [1, 7, 19, 43, 50, 51, 53]. Indeed, there has been a push to aggregating computational NWP model uncertanties, such as those due to incomplete knowledge of the physics associated with sub-grid modeling, errors in boundary conditions, accumulation of numerical errors, and inaccurate parametrization of key physical processes, into a component called model error [16, 36, 37].

It has been shown that 4D-Var systematically outperforms 3D-Var [26], and for this reason is today the state-of-the-art DA for NWP applications. However, the better accuracy of 4D-Var comes with the price of higher computational costs. Indeed, to exploit the time dimension of the assimilation window it is necessary to repeatedly solve both the computational model forward in time and the tangent linear and adjoint problems [12, 14, 13]. These two additional steps are particularly expensive and lead to a significantly larger computational cost for the 4D-Var algorithm compared to its 3D-Var counterpart. Hence, a significant fraction of the computational cost in NWP is due to DA. This cost can be equivalent to the cost of 30–100 model forecasts, which corresponds to the number of iterations in the optimization procedure. This high cost restricts the amount of data that can be assimilated, and thus only a small fraction of the available observations are typically employed for operational forecasts [2, 18].

Our goal in this work is to alleviate the computational burden associated with the 4D-Var approach by replacing the expensive computational model and its adjoint with a data-driven emulator. (We use the terms emulator and surrogate interchangeably in this document.) Because our emulator is easily differentiable, we can use automatic differentiation (AD) to avoid solving the expensive adjoint problem. AD, in contrast to numerical differentiation, does not introduce any discretization errors such as those encountered in finite differences. The lack of discretization-based gradient computation also leads to accurate computation of higher-order derivatives where such errors are more pronounced. Morever, when gradients are needed with respect to many inputs (such as for partial differentiation), AD is more computationally efficient. Unlike numerical and symbolic differentiation, AD relies on the chain rule to decompose differentials and compute them efficiently. Therefore, they are instrumental in computing gradients with respect to inputs or parameters in neural network applications where the chain rule may be implemented trivially. Therefore, much of the computational burden associated with the 4D-Var method is alleviated by replacing the expensive computational models consisting of high-dimensional PDEs with machine learning (ML)-based surrogates.

A few efforts toward the integration of DA and ML have been recently undertaken. In [5], rather than replacing an entire physics model with an ML emulator, DA is applied to an imperfect physics model, and ML is used to predict the model error. In [15], ML is used to improve DA by learning a mapping from observational data to physical states. [17] proposes an ML approach to video prediction that includes a novel neural network cell inspired by DA with component called a “correction Kalman gain”. In [20], the original physical parametrization scheme is not replaced by an emulator, but neural networks are used to derive tangent-linear and adjoint models to be used during 4D-Var. [28] presents a new formulation for 3D-Var DA that uses convolutional autoencoders to create a reduced space in which to perform DA. In contrast to our work, they do not create an emulator to step forward in time, as they are performing 3D-Var DA, not 4D-Var. In [4], the authors use a method that iterates between training an ML surrogate model for a Lorenz system and applying DA. The output analysis then becomes the new training data to further improve the surrogate. In [41], they train ML emulators for Lorenz models using a form of RNNs based on reservoir computing. Then DA is applied (4D-Var and the ensemble transform Kalman filter), estimating the forecast error covariance matrix using their RNN and deriving the corresponding tangent linear model and its adjoint as linear operators. In [9], a recurrent neural network is trained to predict the difference between model outputs and a-priori performed DA computations during forecasting in a low-dimensional subspace spanned by truncated principal components. Similar methods may also be found in [38, 40, 42], where ML surrogates are used in lieu of the expensive forward model in ensemble techniques.

Our study is unique in several manners. First, we demonstrate the use of ML emulators to improve variational DA through enabling an acceleration of the outer-loop optimization problem. By using a differentiable ML surrogate instead of an expensive numerical solver, rapid computation of gradients via automatic differentiation allows for a speed-up of several orders of magnitude. Morever, in contrast with [41] where a reservoir computer was used as the surrogate, our ML emulator is given by deep recurrent neural network (i.e., a long short-term memory neural network with several stacked cells) for which automatic differentiation is imperative. Our formulation also employs a model-order reduction methodology to forecast dynamics on a reduced space, thereby leading to significant computational gains even for forecasting very high dimensional systems. In contrast with with the formulation in [9], we perform DA ‘on-the-fly’ during forecasting instead of learning the mismatch between model predictions and DA-corrected values. This makes our forecasts more generalizable during testing conditions. The specific contributions of this study are summarized as:

•

We propose a differentiable reduced-order surrogate model using dimensionality reduction coupled with a data-driven time-series forecasting technique.
•

We construct a reduced-order variational DA optimization problem that updates the initial condition of any forecast, given observations from random sensors.
•

We accelerate this DA by several orders of magnitude by using gradients of the differentiable low-order surrogate model. Our a-posteriori assessments show that the reduced-order DA significantly improves the accuracy of the forecast using the updated initial conditions.

We highlight that in contrast to a vast majority of previous DA and ML studies, the proposed technique is demonstrated for a forecasting problem that is of real-world importance (geopotential height).

2 Surrogate modeling

In this section, we will first introduce our surrogate model strategy, which may be used for direct forecasting of a geophysical quantity from data. Subsequently, we introduce the DA procedure for real-time forecasting correction using this surrogate model. The surrogate model relies on a dimensionality reduction given by proper orthogonal decomposition and neural network time-series forecasting of the reduced representation. We review dimensionality reduction, time-series forecasting, and surrogate-based DA in the following.

2.1 Proper orthogonal decomposition

The first step in the surrogate construction is the reduction of the degrees of freedom of the original data set to enable rapid forecasts. Projection-based reduced order models (ROMs) can effectively compress a high-dimensional model while still preserving its spatio-temporal structure. The dimensionality reduction is performed through the projection step where the high-dimensional model is projected onto a set of optimally chosen basis functions with respect to the $L_{2}$ norm [3]. The mechanics of a POD-ROM, can be illustrated for a state variable $\mathbf{x}\in\mathbb{R}^{N}$ , where $N$ represents the size of the computational grid. The POD-ROM then approximates $\mathbf{x}$ as the linear expansion on a finite number of $k$ orthonormal basis vectors $\boldsymbol{\phi}_{i}\in\mathbb{R}^{N}$ , a subset of the POD basis:

\mathbf{x}\approx\sum_{i=1}^{k}r_{i}\boldsymbol{\phi}_{i},

(1)

where $r_{i}\in\mathbb{R}$ is the $i$ th component of ${\mathbf{r}}\in\mathbb{R}^{k}$ , which are the coefficients of the basis expansion. The $\{\boldsymbol{\phi}_{i}\},~{}i=1,\ldots,k$ , $~{}\boldsymbol{\phi}_{i}\in\mathbb{R}^{N}$ are the POD modes. POD modes in Equation 1 can be shown to be the left singular vectors of the snapshot matrix (obtained by stacking $M$ snapshots of $\mathbf{x}$ ), $\mathbf{X}=[\mathbf{x}_{1},\ldots,\mathbf{x}_{M}]$ , extracted by performing a singular value decomposition (SVD) on $\mathbf{X}$ [22, 10]. That is,

\mathbf{X}\underset{\text{svd}}{=}\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^{\top},

(2)

where $\mathbf{U}\in\mathbb{R}^{N\times M}$ and $\mathbf{\Phi}_{k}$ represent the first $k$ columns of $\mathbf{V}$ after truncating the last $M-k$ columns based on the relative magnitudes of the cumulative sum of their singular values (see e.g., [31]). The total $L_{2}$ error in approximating the snapshots via the truncated POD basis is then

\displaystyle\sum_{j=1}^{M}\left\|\mathbf{x}_{j}-(\mathbf{\Phi}_{k}\mathbf{\Phi}_{k}^{\top})\mathbf{x}_{j}\right\|^{2}_{2}=\sum_{i=k+1}^{M}\sigma_{i}^{2},

(3)

where $\sigma_{i}$ is the singular value corresponding to the $i$ th column of $\mathbf{V}$ and is also the $i$ th diagonal element of $\boldsymbol{\Sigma}$ . It is well known that POD bases are $L_{2}$ -optimal and present a good choice for efficient compression of high-dimensional data [3]. A recent alternative for compressing the data consists of spectral POD [49, 32, 24].

2.2 Time-series forecasting

For forecasting a dynamical system from examples of time-series data, we utilize an encoder-decoder framework coupled with long short-term memory neural networks (LSTMs) [21]. The encoder-decoder formulation is given by a first step where a latent representation is derived through information from historical observations (in our case of the compressed representation of the full state), i.e.,

\displaystyle\mathbf{h}=h(\mathbf{{r}}_{t-T},\mathbf{{r}}_{t-T+1},\ldots,\mathbf{{r}}_{t})

(4)

where $h$ is the output of a function approximated by an LSTM. The LSTM neural architecture is devised to account for long and short-term correlations in time-series data through the specification of a hidden state that evolves over time and is affected by each observation of the dynamical system. The value of the hidden state at the end of the input sequence of length $T$ may then be employed as an encoded representation, $\mathbf{h}$ , of the input window. After $\mathbf{h}$ is obtained, forecasts at different distances in the future may be obtained by functional predictions. In this study, the ‘decoding’ component of the architecture is also given by an LSTM cell that is provided the encoded state information for each timestep of the output, i.e.,

	$\displaystyle[\mathbf{r}_{t+1},\ldots,\mathbf{r}_{t+T_{o}}]=\tilde{h}(\boldsymbol{h}_{t+1},\ldots,\boldsymbol{h}_{t+T_{o}}),$		(5)
	$\displaystyle\boldsymbol{h}_{t+1}=\boldsymbol{h}_{t+2}=..=\boldsymbol{h}_{t+T_{o}}=\boldsymbol{h}$

where $\tilde{h}$ is an LSTM cell as well and $T_{o}$ is the length of the output window. Once forecasts in the POD-compressed space are obtained using the encoded-decoder LSTM network, the full-order state variable can be reconstructed by using the precomputed basis functions. A schematic of the architecture is shown in Figure 1. In the past several dynamical systems forecasts have been performed solely with the use of POD-LSTM type learning [39, 33, 30, 29]. However, as will be demonstrated shortly, this approach can be enhanced significantly with the use of real-time DA from sparse observations.

Refer to caption — Figure 1: A schematic for forecasting POD coefficients using an encoder-decoder LSTM neural network. The various POD coefficients of an input window are encoded to a latent state on which the forecast window is conditioned.

3 Data Assimilation

Data assimilation (DA) is the process of fusing information from priors, imperfect predictions, and noisy data, to obtain a consistent description of the true state $\mathbf{x}^{\rm true}$ of a physical system. The resulting estimate represents the maximum a posteriori estimate and is referred to as the analysis $\mathbf{x}^{\rm a}$ .

The prior encapsulates our current knowledge of the system and is referrred to as background in the DA community [23]. The background typically consists of an estimate of the state $\mathbf{x}^{\rm b}$ , and the corresponding background error covariance matrix $\mathbf{B}$ .

The imperfect predictions are generated by a model that captures our knowledge about the physical laws that govern the evolution of the system. This model evolves an initial state $\mathbf{x}_{0}\in\mathbb{R}^{n}$ at initial time $t_{0}$ to states $\mathbf{x}_{i}\in\mathbb{R}^{n}$ at future times $t_{i}$ . A general model equation is:

\mathbf{x}_{i}=\mathcal{M}_{t_{0}\rightarrow t_{i}}\left(\mathbf{x}_{0}\right),\quad i=1,\cdots,N.

(6)

The noisy data are observations of reality available at discrete time instances. Specifically, measurements $\mathbf{y}_{i}\in\mathbb{R}^{m}$ of the physical state $\mathbf{x}^{\rm true}\left(t_{i}\right)$ are taken at times $t_{i}$ ,

\mathbf{y}_{i}=\mathcal{H}(\mathbf{x}_{i})+\varepsilon_{i},\quad\varepsilon_{i}\sim\mathcal{N}(\mathbf{0},\mathbf{R}_{i}),\quad i=1,\cdots,N,

(7)

where the observation operator $\mathcal{H}:\mathbb{R}^{n}\to\mathbb{R}^{m}$ maps the model state space onto the observation space. The random observation errors $\varepsilon_{i}$ are assumed to have normal distributions. In general, both the model and the observation operator are nonlinear. These concepts are detailed in the following referencs [11, 23, 47, 48, 8].

Variational methods solve the DA problem in an optimal control framework. Specifically, one adjusts a control variable (e.g., model parameters) in order to minimize the discrepancy between model forecasts and observations. We next review the 4D-Var formulation.

3.1 Four-dimensional variational data assimilation

Strong-constraint four-dimensional variational (4D-Var) DA processes simultaneously all observations at all times $t_{1},t_{2},\dots,t_{T}$ within the assimilation window (see 2 for a schematic representation). The control parameters are typically the initial conditions $\mathbf{x}_{0}$ , which uniquely determine the state of the system at all future times under the assumption that the model (6) perfectly represents reality. The background state is the prior best estimate of the initial conditions $\mathbf{x}_{0}^{\rm b}$ , and has an associated initial background error covariance matrix $\mathbf{B}_{0}$ . The 4D-Var problem provides the estimate $\mathbf{x}_{0}^{\rm a}$ of the true initial conditions as the solution of the following optimization problem


	$\displaystyle~{}\mathbf{x}_{0}^{\rm a}=\underset{\mathbf{x}_{0}}{\text{ arg\, min}}~{}~{}\mathcal{J}\left(\mathbf{x}_{0}\right)\qquad\text{subject to}~{}\text{\eqref{eqn:genmodel}},$		(8a)
	$\displaystyle~{}\mathcal{J}\left(\mathbf{x}_{0}\right)=\frac{1}{2}\\|\mathbf{x}_{0}-\mathbf{x}_{0}^{\rm b}\\|_{\mathbf{B}_{0}^{-1}}^{2}+\frac{1}{2}\,\sum_{i=1}^{T}\\|\mathcal{H}(\mathbf{x}_{i})-\mathbf{y}_{i}\\|_{\mathbf{R}_{i}^{-1}}^{2}\,.$		(8b)

The first term of the sum (8b) quantifies the departure of the solution $\mathbf{x}_{0}$ from the background state $\mathbf{x}_{0}^{\rm b}$ at the initial time $t_{0}$ . The second term measures the mismatch between the forecast trajectory (model solutions $\mathbf{x}_{i}$ ) and observations $\mathbf{y}_{i}$ at all times $t_{i}$ in the assimilation window. The covariance matrices $\mathbf{B}_{0}$ and $\mathbf{R}_{i}$ need to be predefined, and their quality influences the accuracy of the resulting analysis.

3.2 Data assimilation with a surrogate model

Given a low-dimensional differentiable surrogate model, for instance the proposed POD-LSTM, that approximates the dynamics of the full-order numerical model, the variational DA process can be accelerated dramatically. If such a surrogate model has predicted $\boldsymbol{\hat{x}}_{1}$ to $\boldsymbol{\hat{x}}_{T}$ for a range of timesteps in the reduced-space (i.e., $\boldsymbol{\hat{x}}\approx P\mathbf{x}$ , $\boldsymbol{\hat{x}}\in\mathbb{R}^{K}$ , with $K\ll N$ ) and observations from the real system, $\mathbf{y}_{1}$ to $\mathbf{y}_{T}$ are available, the objective function of the DA problem may be obtained by reconstructing the model state from the compressed representation. Here, $P:\mathbb{R}^{N}\rightarrow\mathbb{R}^{K}$ is the function that maps from physical space to reduced space. In case of POD based compression $P$ is a linear operation that projects from the physical space to the basis spanned by the truncated set of the left singular vectors of an SVD performed on snapshots from full-state model evaluation. If inversion from the truncated subspace is represented as $P^{\dagger}$ , this gives us

\displaystyle\begin{aligned} \mathcal{J}\left(\boldsymbol{\hat{x}_{0}}\right)=\frac{1}{2}\|\boldsymbol{\hat{x}_{0}}-P\mathbf{x}_{0}^{\rm b}\|_{\mathbf{B}_{0}^{-1}}^{2}+\frac{1}{2}\,\sum_{i=1}^{T}\|\mathcal{H}(\hat{\mathbf{x}}_{i})-\mathbf{y}_{i}\|_{\mathbf{R}_{i}^{-1}}^{2}\,.\end{aligned}

(9)

Note that $\hat{\mathbf{x}}_{i}$ , which is an approximation to the state at time $t_{i}$ , is evaluated using the surrogate model—that is, by propagating $\boldsymbol{\hat{x}_{0}}$ using the surrogate model (LSTM in this case) and applying $P^{\dagger}$ to the result. This can be written mathematically as

\displaystyle\hat{\mathbf{x}}_{i}=P^{\dagger}\boldsymbol{h}(\boldsymbol{\hat{x}_{0}})\,.

(10)

Given a fixed $P^{\dagger}$ with an effective compression ratio (obtained from our dimensionality reduction technique), Equation 8b becomes a cost function (9) expressed in $K$ dimensions which is amenable to rapid update of the initial conditions. Moreover, gradients of this objective function with respect to the initial conditions are trivially computable because of the use of automatic differentiation of our LSTM neural network. The overall approach is as follows:

•

For each forecast window, collect random observations from the true state.
•

Perform an optimization of Equation 9 by perturbing the inputs to the LSTM neural network. This input is the projected version of the initial condition state.
•

If optimization has converged, perform one forecast with the optimized initial condition. This is the DA improved forecast.

The optimization methodology used in this article is the sequential least-squares programming approach [35] implemented in SciPy. The neural architecture was deployed using TensorFlow 2.4. Our overall implementation is in Python.

4 Results

We describe the dataset used in our experiments and then present our experimental results.

4.1 Dataset

The data that we use in this study are a subset of 20 years of output from the regional climate model WRF version 3.3.1, prepared with methods and configurations described by Wang and Kotamarthi [52]. WRF, a fully compressible, nonhydrostatic, regional numerical prediction system with proven suitability for a broad range of applications, is driven by reanalysis data NCEP-Reanalysis2 (NCEP-R2) for the period 1984–2003. The NCEP-R2 dataset used in [52] includes many of the observational datasets available to build dynamically consistent gridded fields that are typically used for forecast model initialization and boundary condition setting.

We use daily averaged geopotential height at 500 hPa (Z500) from WRF to demonstrate the approach developed in this study. We choose Z500 because it is a standard field for analysing the quality of weather forecasts, it does not have very strong local gradients (in contrast to fields such as humidity or precipitation) and does not depend on local conditions such as topography, yet most of the important global flow patterns—such as midlatitude jets and a gradient between poles and Equator—are visible in Z500.

We retrieve the Z500 data from the WRF output at a grid spacing of 12km and at 3 hour intervals with 515 $\times$ 599 grid cells in the south-north and west-east directions, respectively. We then calulate per-grid-point daily averages to obtain one data value per grid point per day. Spatially, we coarsen the data by five strides which still maintains the spatial patterns of Z500 but reduces the data size significantly. Each daily Z500 snapshot therefore comprises 102 $\times$ 119 60km $\times$ 60km grid points. We have 3287 such daily snapshots for the period 1 January 1984 to 31 December 1991. Despite the coarsening of the original 12km data, the typical eastward propagating waves are clearly visible in the northern hemisphere. Such waves are also observed in the real atmosphere and are one of the main features of midlatitude weather variability on timescales of several days.

We use the first six years of the daily averaged Z500 WRF data (1984–1989) for surrogate training and optimization. Of those data, we select 70% at random as training data, for use in training the supervised ML formulation, and keep the remaining 30% as validation data, for use in tuning the neural architecture hyperparameters and to control overfitting (a situation in which the trained network predicts well on the training data but not on the test data). We also construct a test data set consisting of one year of records (1991) for prediction and evaluation. We skip 1990 so as to ensure that there is no overlap in input windows. We note that we choose identity as our observation error covariance matrix and we use a scaling to improve convergence. In our future studies, in addition to using a realistic observation error covariance matrix, we will also pursue using a flow-dependent background error covariance matrix as described in [6].

4.2 Experiments

We conduct a number of experiments to evaluate the performance of our emulator-assisted DA approach. In the first, we perform a grid-based search in which we vary both the number of POD modes (5, 10, 15) and the size of the input window (7, 14, 28, or 42 days of lead-time information) to identify the optimal combination for forecasting. Other hyperparameters such as learning rate, learning rate scheduler, number of LSTM cells, and number of neurons are determined by a combination of experience in previous modeling tasks, considerations of computational efficiency, and limited manual tuning. We set the output forecast window to 20 days to represent a difficult forecast task for the emulator.

We give the complete set of hyperparameters for our problem in Table 1. We found that the lowest validation errors were achived when just five modes were retained, a result that matches earlier studies [24], where increasing the number of modes is seen to cause difficulties in long-term forecasting (for example, for more than two weeks). The training and validation data are used in this phase of our experimentation to identify the best model for performing accelerated DA. We train the neural network architecture by penalizing the loss function on the training data, while using the validation data to enforce an early stopping criterion (i.e., prevent overfitting). Once the different models have been trained, the best model is determined by studying the validation losses of all the different architectures. This model is then used for testing on unseen data and for DA. For a further sensitivity analysis of this model, we trained four other models with varying input window sizes (7, 28, 35, 42) and different random seeds, with other hyperparameters fixed, and utilized them for obtaining ensemble test results. We trained and validated architectures using multiple hyperparameters in parallel by using Ray [34], a TCP/IP-based parallelization protocol, on Nvidia A100 GPUs of the Argonne Leadership Computing Facility’s Theta supercomputer.

Table 1: LSTM hyperparameters used when training the low-dimensional surrogate. The asterisked quantity is varied, in addition to random seeds, for sensitivity analyses of the surrogate model where in addition to the best chosen model here, several other models are trained with different weight initializations and input window sizes.

Neurons per cell	20
Number of stacked cells	2
Initial learning rate	1.0e-3
Learning rate decay rate	0.5 (based on 10 epochs patience)
Activation function	ReLU
Input window^∗	7, 14, 28, or 42 days
Number of modes retained	5
Output window	20 days
Weight initialization	Glorot
Optimizer	Adam

Our first set of assessments test the POD-LSTM emulator without any DA. We show in Figure 3 how emulator predictions for day 15 (for all forecasts in the test region) compare to climatology. Climatology, here, refers to the average forecast on a specific forecast day based on averaged observations from the training data. For instance, if geopotential height is to be forecast on December 7, 1991, the climatology prediction would be the average of December 7 geopotential height values for 1984-1989. The metrics we use for comparison first include the Cosine similarity improvement as shown in Figure 3(a). Here, the cosine similarity (which captures the alignment between prediction and truth) obtained from climatology is subtracted from that obtained from the emulator forecast. Thus, in this case, negative (blue) regions are where climatology captures the temporal trend of the forecast better than our emulator. In Figure 3(b), we subtract the MAE of the emulator predictions from that of climatology. Here, the blue regions are where climatology is more accurate on average than the emulator. Figure 3(c) merely shows the MAE for the emulator. From this analysis, it is clear that for large regions in the data domain (particularly in the North), the emulator performs quite poorly.

We next assess results from applying DA through random observations at 5000 locations of the domain, to see if results obtained with the emulator can be improved. Here, sparse observations at each timestep of the output window are used within the optimization statement introduced in Equation 9 to update the initial conditions (i.e., the input window). We emphasize that the observations are obtained randomly from the true state of the system during forecasts, which corresponds to the test data introduced previously.

The results, in Figure 4, show that the use of sensor information aids the forecast immensely, improving performance in all metrics significantly by virtue of DA. In particular, the proposed augmentation (i.e., the use of DA during forecasts) reduces forecast errors considerably in regions where the sole use of the emulator was not competitive. We provide MAE assessments for our 20 day output averaged over the testing data range in Figure 5, where we compare the raw emulator outputs, climatology, persistence, and the DA-corrected outputs. We also provide confidence intervals for the regular (i.e., without DA) and DA-corrected emulators where the five different emulators are trained with different random seeds and input window durations to assess the sensitivity to the initialization of the training as well as the automatic differentiation based optimization. The results indicate that the DA-corrected emulation is the most accurate and consistent forecast technique.

We also provide in Figure 6 a comparison with another benchmark, persistence, which uses the state of the last day in the input window as the forecast for each day of the output window. Persistence is seen to be more accurate for short lead times but is outperformed by both climatology and data-driven methods for longer forecast durations.

Our choice of 20 days for the forecast window was motivated by the results of Rasp et al. [44, 45], who performed analyses for five-day forecasts. Our study increased the forecast window to four times the original assessments for geopotential height to demonstrate the possibility for improved forecasting of any emulator using automatic differentiation-enabled variational DA. In many subregions of the computational domain, the use of DA with the emulator is also seen to improve on climatology predictions—even at large lead times (20 days out), where classical data-driven and numerical forecast models are typically less accurate than climatology.

Our method also delivers large computational gains. A one year forecast with the emulator-assisted DA performed each day takes approximately one hour on a single processor core without any accelerator hardware. In contrast, the original PDE-based simulation required 21,600 core hours for a one-year forecast with a 515 $\times$ 599 grid. (A coarse-grained forecast for one year on the 103 $\times$ 120 grid used here takes 172 core hours, but the resulting flow field does not adequately reproduce the fidelity of the 515 $\times$ 599 grid.) One can observe significant speed up ( $\sim$ 10⁴ times) for the emulation of the geopotential height, even without factoring the cost of an additional variational DA step. Furthermore, the solution to the 4D-Var problem, which yields an improved initial condition, requires on the order of 100 model runs, where each model run can be several orders of magnitude more expensive than our emulator.

5 Conclusions

We have described how a differentiable reduced-order surrogate geophysical forecasting model may be integrated into an outer-loop optimization technique whereby real-time observations of the true solution are used to improve the forecast of the surrogate. We use such observations to improve the initial condition of the surrogate model, such that an optimization statement given by the classical 4D variational DA objective function is minimized. The use of the reduced-order surrogate converts a high-dimensional optimization to one that is given by the dimensionality of the compressed representation and the duration of the input window for forecasting. Our optimization is thus performed rapidly, given sparse and random observations from the true flow field, without any access to high performance computing resources. Computational costs are reduced by four orders of magnitude, by virtue of the surrogate assisted forecasting and variational DA, when compared to a classical equation-based forecasting of the dynamics. We assess our model on a real-world forecasting task for geopotential height in the continental US, and obtain competitive results with respect to climatology and persistence baselines for mean-absolute-error and cosine similarity.

Acknowledgements

This material is based upon work supported by Laboratory Directed Research and Development (LDRD) funding from Argonne National Laboratory, provided by the Director, Office of Science, of the U.S. Department of Energy under Contract No. DE-AC02-06CH11357. This material is partially based upon work supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357. This research was funded in part by, and used resources of, the Argonne Leadership Computing Facility, a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. Romit Maulik acknowledges the support of ASCR Project DOEFOA2493: ‘Data intensive scientific machine learning’ (PM: Dr. Steve Lee). Gianmarco Mengaldo acknowledges support from NUS startup grant R-265-000-A36-133.

Data Availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

[1] S. Akella and I.M. Navon. Different approaches to model error formulation in 4D-Var: A study with high-resolution advection schemes. Tellus A, 61(1):112–128, 2009.
[2] Hans-Stefan Bauer, Thomas Schwitalla, Volker Wulfmeyer, Atoossa Bakhshaii, Uwe Ehret, Malte Neuper, and Olivier Caumont. Quantitative precipitation estimation based on high-resolution numerical weather prediction and data assimilation with WRF–a performance test. Tellus A: Dynamic Meteorology and Oceanography, 67(1):25047, 2015.
[3] Gal Berkooz, Philip Holmes, and John L Lumley. The proper orthogonal decomposition in the analysis of turbulent flows. Annual Review of Fluid Mechanics, 25(1):539–575, 1993.
[4] Julien Brajard, Alberto Carrassi, Marc Bocquet, and Laurent Bertino. Combining data assimilation and machine learning to emulate a dynamical model from sparse and noisy observations: A case study with the Lorenz 96 model. Journal of Computational Science, 44:101171, 2020.
[5] Julien Brajard, Alberto Carrassi, Marc Bocquet, and Laurent Bertino. Combining data assimilation and machine learning to infer unresolved scale parametrization. Philosophical Transactions of the Royal Society A, 379(2194):20200086, 2021.
[6] Mark Buehner. Ensemble-derived stationary and flow-dependent background-error covariances: Evaluation in a quasi-operational nwp setting. Quarterly Journal of the Royal Meteorological Society: A journal of the atmospheric sciences, applied meteorology and physical oceanography, 131(607):1013–1043, 2005.
[7] Carla Cardinali, Nedjeljka Žagar, Gabor Radnoti, and Roberto Buizza. Representing model error in ensemble data assimilation. Nonlinear Processes in Geophysics, 21(5):971–985, 2014.
[8] Gregory R Carmichael, Adrian Sandu, Tianfeng Chai, Dacian N Daescu, Emil M Constantinescu, and Youhua Tang. Predicting air quality: Improvements through advanced methods to integrate models and measurements. Journal of Computational Physics, 227(7):3540–3571, 2008.
[9] César Quilodrán Casas, Rossella Arcucci, Pin Wu, Christopher Pain, and Yi-Ke Guo. A reduced order deep data assimilation model. Physica D: Nonlinear Phenomena, 412:132615, 2020.
[10] Anindya Chatterjee. An introduction to the proper orthogonal decomposition. Current Science, pages 808–817, 2000.
[11] R. Daley. Atmospheric Data Analysis, volume 2. Cambridge University Press, 1993.
[12] Ronald M Errico. What is an adjoint model? Bulletin of the American Meteorological Society, 78(11):2577–2592, 1997.
[13] Ronald M Errico and Kevin D Raeder. An examination of the accuracy of the linearization of a mesoscale model with moist physics. Quarterly Journal of the Royal Meteorological Society, 125(553):169–195, 1999.
[14] Ronald M Errico, Tomislava Vukicevic, and Kevin Raeder. Examination of the accuracy of a tangent linear model. Tellus A: Dynamic Meteorology and Oceanography, 45(5):462–477, 1993.
[15] Thomas Frerix, Dmitrii Kochkov, Jamie A Smith, Daniel Cremers, Michael P Brenner, and Stephan Hoyer. Variational data assimilation with a learned inverse observation operator. In Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139. PMLR, 2021.
[16] J. Glimm, S. Hou, Y.H. Lee, D.H. Sharp, and K. Ye. Sources of uncertainty and error in the simulation of flow in porous media. Computational and Applied Mathematics, 23:109–120, 2004.
[17] Vincent Le Guen and Nicolas Thome. Disentangling physical dynamics from unknown factors for unsupervised video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11474–11484, 2020.
[18] Nils Gustafsson, Tijana Janjić, Christoph Schraff, Daniel Leuenberger, Martin Weissmann, Hendrik Reich, Pierre Brousseau, Thibaut Montmerle, Eric Wattrelot, Antonín Bučánek, et al. Survey of data assimilation methods for convective-scale numerical weather prediction at operational centres. Quarterly Journal of the Royal Meteorological Society, 144(713):1218–1256, 2018.
[19] James A Hansen. Accounting for model error in ensemble-based state estimation and forecasting. Monthly Weather Review, 130(10):2373–2391, 2002.
[20] Sam Hatfield, Matthew Chantry, Peter Dueben, Philippe Lopez, Alan Geer, and Tim Palmer. Building tangent-linear and adjoint models for data assimilation with neural networks. Journal of Advances in Modeling Earth Systems, 13(9):e2021MS002521, 2021.
[21] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[22] Philip Holmes, John L Lumley, Gahl Berkooz, and Clarence W Rowley. Turbulence, Coherent Structures, Dynamical Systems and Symmetry. Cambridge University Press, 2012.
[23] E. Kalnay. Atmospheric Modeling, Data Assimilation and Predictability. Cambridge University Press, 2003.
[24] Andrea Lario, Romit Maulik, Gianluigi Rozza, and Gianmarco Mengaldo. Neural-network learning of spod latent dynamics. arXiv preprint arXiv:2110.09218, 2021.
[25] F.X. Le Dimet and O. Talagrand. Variational algorithms for analysis and assimilation of meteorological observations: theoretical aspects. Tellus A, 38(2):97–110, 1986.
[26] Andrew C Lorenc and F Rawlins. Why does 4D-Var beat 3D-Var? Quarterly Journal of the Royal Meteorological Society, 131(613):3247–3257, 2005.
[27] Peter Lynch. The origins of computer weather prediction and climate modeling. Journal of Computational Physics, 227(7):3431–3444, 2008.
[28] Julian Mack, Rossella Arcucci, Miguel Molina-Solana, and Yi-Ke Guo. Attention-based convolutional autoencoders for 3D-variational data assimilation. Computer Methods in Applied Mechanics and Engineering, 372:113291, 2020.
[29] Romit Maulik, Romain Egele, Bethany Lusch, and Prasanna Balaprakash. Recurrent neural network architecture search for geophysical emulation. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14. IEEE, 2020.
[30] Romit Maulik, Bethany Lusch, and Prasanna Balaprakash. Reduced-order modeling of advection-dominated systems with recurrent neural networks and convolutional autoencoders. Physics of Fluids, 33(3):037106, 2021.
[31] Romit Maulik and Gianmarco Mengaldo. PyParSVD: A streaming, distributed and randomized singular-value-decomposition library. arXiv preprint arXiv:2108.08845, 2021.
[32] Gianmarco Mengaldo and Romit Maulik. PySPOD: A Python package for Spectral Proper Orthogonal Decomposition (SPOD). Journal of Open Source Software, 6(60):2862, 2021.
[33] Arvind T Mohan and Datta V Gaitonde. A deep learning based approach to reduced order modeling for turbulent flow control using LSTM neural networks. arXiv preprint arXiv:1804.09269, 2018.
[34] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, and Ion Stoica. Ray: A distributed framework for emerging AI applications. In 13th USENIX Symposium on Operating Systems Design and Implementation, pages 561–577, 2018.
[35] Jorge Nocedal and Stephen J Wright. Sequential quadratic programming. Numerical Optimization, pages 529–562, 2006.
[36] D. Orrell, L. Smith, J. Barkmeijer, and T.N. Palmer. Model error in weather forecasting. Nonlinear Processes in Geophysics, 8:357–371, 2001.
[37] T.N. Palmer, G.J. Shutts, R. Hagedorn, F.J. Doblas-Reyes, T. Jung, and M. Leutbecher. Representing model uncertainty in weather and climate prediction. Annu. Rev. Earth Planet. Sci, 33:163–93, 2005.
[38] Suraj Pawar, Shady E Ahmed, Omer San, Adil Rasheed, and Ionel M Navon. Long short-term memory embedded nudging schemes for nonlinear data assimilation of geophysical flows. Physics of Fluids, 32(7):076606, 2020.
[39] Suraj Pawar, SM Rahman, H Vaddireddy, Omer San, Adil Rasheed, and Prakash Vedula. A deep learning enabler for nonintrusive reduced order modeling of fluid flows. Physics of Fluids, 31(8):085101, 2019.
[40] Suraj Pawar and Omer San. Data assimilation empowered neural network parametrizations for subgrid processes in geophysical flows. Physical Review Fluids, 6(5):050501, 2021.
[41] Stephen G Penny, Timothy A Smith, Tse-Chun Chen, Jason A Platt, Hsin-Yi Lin, Michael Goodliff, and Henry D I Abarbanel. Integrating recurrent neural networks with data assimilation for scalable data-driven state estimation. arXiv preprint arXiv:2109.12269, 2021.
[42] Andrey A Popov and Adrian Sandu. Multifidelity ensemble Kalman filtering using surrogate models defined by physics-informed autoencoders. arXiv preprint arXiv:2102.13025, 2021.
[43] Vishwas Rao and Adrian Sandu. A posteriori error estimates for the solution of variational inverse problems. SIAM/ASA Journal on Uncertainty Quantification, 3(1):737–761, 2015.
[44] Stephan Rasp, Peter D Dueben, Sebastian Scher, Jonathan A Weyn, Soukayna Mouatadid, and Nils Thuerey. WeatherBench: A benchmark data set for data-driven weather forecasting. Journal of Advances in Modeling Earth Systems, 12(11):e2020MS002203, 2020.
[45] Stephan Rasp and Nils Thuerey. Data-driven medium-range weather prediction with a Resnet pretrained on climate simulations: A new model for WeatherBench. Journal of Advances in Modeling Earth Systems, 13(2):e2020MS002405, 2021.
[46] DR Reidmiller, CW Avery, DR Easterling, KE Kunkel, KLM Lewis, T Maycock, and BC Stewart. Fourth national climate assessment. Volume II: Impacts, Risks, and Adaptation in the United States, 2018.
[47] A. Sandu and T. Chai. Chemical data assimilation—An overview. Atmosphere, 2(3):426–463, 2011.
[48] A. Sandu, D. N. Daescu, G. R. Carmichael, and T. Chai. Adjoint sensitivity analysis of regional air quality models. Journal of Computational Physics, 204(1):222–252, 2005.
[49] Oliver T Schmidt, Gianmarco Mengaldo, Gianpaolo Balsamo, and Nils P Wedi. Spectral empirical orthogonal function analysis of weather and climate data. Monthly Weather Review, 147(8):2979–2995, 2019.
[50] Y. Trémolet. Accounting for an imperfect model in 4D-Var. Quarterly Journal of the Royal Meteorological Society, 132(621):2483–2504, 2006.
[51] Y. Trémolet. Model-error estimation in 4D-Var. Quarterly Journal of the Royal Meteorological Society, 133(626):1267–1280, 2007.
[52] Jiali Wang and Veerabhadra R Kotamarthi. Downscaling with a nested regional climate model in near-surface fields over the contiguous United States. Journal of Geophysical Research: Atmospheres, 119(14):8778–8797, 2014.
[53] Dusanka Zupanski and Milija Zupanski. Model error estimation employing an ensemble data assimilation approach. Monthly Weather Review, 134(5):1337–1354, 2006.