This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

tablesection algorithmsection

Iterative Ensemble Kalman Methods:
A Unified Perspective with Some New Variants

Neil K. Chada Yuming Chen and Daniel Sanz-Alonso
Abstract

This paper provides a unified perspective of iterative ensemble Kalman methods, a family of derivative-free algorithms for parameter reconstruction and other related tasks. We identify, compare and develop three subfamilies of ensemble methods that differ in the objective they seek to minimize and the derivative-based optimization scheme they approximate through the ensemble. Our work emphasizes two principles for the derivation and analysis of iterative ensemble Kalman methods: statistical linearization and continuum limits. Following these guiding principles, we introduce new iterative ensemble Kalman methods that show promising numerical performance in Bayesian inverse problems, data assimilation and machine learning tasks. Applied Mathematics and Computational Science, King Abdullah University of Science and Technology. Department of Statistics, University of Chicago.

1 Introduction

This paper provides an accessible introduction to the derivation and foundations of iterative ensemble Kalman methods, a family of derivative-free algorithms for parameter reconstruction and other related tasks. The overarching theme behind these methods is to iteratively update via Kalman-type formulae an ensemble of candidate reconstructions, aiming to bring the ensemble closer to the unknown parameter with each iteration. The ensemble Kalman updates approximate derivative-based nonlinear least-squares optimization schemes without requiring gradient evaluations. Our presentation emphasizes that iterative ensemble Kalman methods can be naturally classified in terms of the nonlinear least-squares objective they seek to minimize and the derivative-based optimization scheme they approximate through the ensemble. This perspective allows us to identify three subfamilies of iterative ensemble Kalman methods, creating unity into the growing literature on this subject. Our work also emphasizes two principles for the derivation and analysis of iterative ensemble Kalman methods: statistical linearization and continuum limits. Following these principles we introduce new iterative ensemble Kalman methods that show promising numerical performance in Bayesian inverse problems, data assimilation and machine learning tasks.

We consider the application of iterative ensemble Kalman methods to the problem of reconstructing an unknown udu\in\mathbb{R}^{d} from corrupt data yky\in\mathbb{R}^{k} related by

y=h(u)+η,y=h(u)+\eta, (1.1)

where η\eta represents measurement or model error and hh is a given map. A wide range of inverse problems, data assimilation and machine learning tasks can be cast into the framework (1.1). In these applications the unknown uu may represent, for instance, an input parameter of a differential equation, the current state of a time-evolving signal and a regressor, respectively. Ensemble Kalman methods were first introduced as filtering schemes for sequential data assimilation [21, 22, 42, 47, 50] to reduce the computational cost of the Kalman filter [34]. Their use for state and parameter estimation and inverse problems was further developed in [3, 40, 43, 55]. The idea of iterating these methods was considered in [15, 18]. Iterative ensemble Kalman methods are now popular in inverse problems and data assimilation; they have also shown some potential in machine learning applications [25, 29, 37].

Starting from an initial ensemble {u0(n)}n=1N,\{u_{0}^{(n)}\}_{n=1}^{N}, iterative ensemble Kalman methods use various ensemble-based empirical means and covariances to update

{ui(n)}n=1N{ui+1(n)}n=1N,\{u_{i}^{(n)}\}_{n=1}^{N}\to\{u_{i+1}^{(n)}\}_{n=1}^{N}, (1.2)

until a stopping criteria is satisfied; the unknown parameter uu is reconstructed by the mean of the final ensemble. The idea is analogous to classical Kalman methods and optimization schemes which, starting with a single initialization u0u_{0} use evaluations of derivatives of hh to iteratively update uiui+1u_{i}\to u_{i+1} until a stopping criteria is met. The initial ensemble is viewed as an input to the algorithm, obtained in a problem-dependent fashion. In Bayesian inverse problems and machine learning it may be obtained by sampling a prior, while in data assimilation the initial ensemble may be a given collection of particles that approximates the prediction distribution. In either case, it is useful to view the initial ensemble as a sample from a probability distribution.

There are two main computational benefits in updating an ensemble of candidate reconstructions rather than a single estimate. First, the ensemble update can be performed without evaluating derivatives of h,h, effectively approximating them using statistical linearization. This is important in applications where computing derivatives of hh is expensive, or where the map hh needs to be treated as a black-box. Second, the use of empirical rather than model covariances can significantly reduce the computational cost whenever the ensemble size NN is smaller than the dimension dd of the unknown parameter uu. Another key advantage of the ensemble approach is that, for problems that are not strongly nonlinear, the spread of the ensemble may contain meaningful information on the uncertainty in the reconstruction.

1.1 Overview: Three Subfamilies

This paper identifies, compares and further develops three subfamilies of iterative ensemble Kalman methods to implement the ensemble update (1.2). Each subfamily employs a different Kalman-type formulae, determined by a choice of objective to minimize and a derivative-based optimization scheme to approximate with the ensemble. All three approaches impose some form of regularization, either explicitly through the choice of the objective, or implicitly through the choice of the optimization scheme. Incorporating regularization is essential in parameter reconstruction problems encountered in applications, which are typically under-determined or ill-posed [41, 50].

The first subfamily considers a Tikhonov-Phillips objective associated with the parameter reconstruction problem (1.2), given by

𝖩TP(u):=12|yh(u)|R2+12|um|P2,{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u):=\frac{1}{2}|y-h(u)|^{2}_{R}+\frac{1}{2}|u-m|^{2}_{P}, (1.3)

where RR and PP are symmetric positive definite matrices that model, respectively, the data measurement precision and the level of regularization —incorporated explicitly through the choice of objective— and mm represents a background estimate of u.u. Here and throughout this paper we use the notation |v|A2:=|A1/2v|2=vTA1v|v|^{2}_{A}:=|A^{-1/2}v|^{2}=v^{T}A^{-1}v for symmetric positive definite AA and vector v.v. The ensemble is used to approximate a Gauss-Newton method applied to the Tikhonov-Phillips objective 𝖩TP.{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}. Algorithms in this subfamily were first introduced in geophysical data assimilation [1, 15, 18, 24, 39, 48] and were inspired by iterative, derivative-based, extended Kalman filters [5, 6, 33]. Extensions to more challenging problems with strongly nonlinear dynamics are considered in [49, 60]. In this paper we will use a new Iterative Ensemble Kalman Filter (IEKF) method as a prototypical example of an algorithm that belongs to this subfamily.

The second subfamily considers the data-misfit objective

𝖩DM(u):=12|yh(u)|R2.{\mathsf{J}}_{\mbox{\tiny{\rm DM}}}(u):=\frac{1}{2}|y-h(u)|^{2}_{R}. (1.4)

When the parameter reconstruction problem is ill-posed, minimizing 𝖩DM{\mathsf{J}}_{\mbox{\tiny{\rm DM}}} leads to unstable reconstructions. For this reason, iterative ensemble Kalman methods in this subfamily are complemented with a Levenberg-Marquardt optimization scheme that implicitly incorporates regularization. The ensemble is used to approximate a regularizing Levenberg-Marquardt optimization algorithm to minimize 𝖩DM.{\mathsf{J}}_{\mbox{\tiny{\rm DM}}}. Algorithms in this subfamily were introduced in the applied mathematics literature [30, 31] building on ideas from classical inverse problems [26]. Recent theoretical work has focused on developing continuous-time and mean-field limits, as well as various convergence results [8, 9, 14, 27, 37, 51]. Methodological extensions based on Bayesian hierarchical techniques were introduced in [10, 11] and the incorporation of constraints has been investigated in [2, 12]. In this paper we will use the Ensemble Kalman Inversion (EKI) method [31] as a prototypical example of an algorithm that belongs to this subfamily.

Objective Optimization Derivative Method Ensemble Method New Variant
𝖩TP{\mathsf{J}}_{\mbox{\tiny{\rm TP}}} GN IExKF (2.1) IEKF (3.1) IEKF-SL (4.1)
𝖩DM{\mathsf{J}}_{\mbox{\tiny{\rm DM}}} LM LM-DM (2.2) EKI (3.2) EKI-SL (4.2)
𝖩TP{\mathsf{J}}_{\mbox{\tiny{\rm TP}}} LM LM-TP (2.3) TEKI (3.3) TEKI-SL (4.3)

Table 1: Roadmap to the algorithms considered in this paper. We use the abbreviations GN and LM for Gauss-Newton and Levenberg-Marquardt. The numbers in parenthesis represent the subsection in which each algorithm is introduced.

The third subfamily, which has emerged more recently, combines explicit regularization through the Tikhonov-Phillips objective and an implicitly regularizing optimization scheme [14, 13]. Precisely, a Levenberg-Marquardt scheme is approximated through the ensemble in order to minimize the Tikhonov-Phillips objective 𝖩TP.{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}. In this paper we will use the Tikhonov ensemble Kalman inversion method (TEKI) [13] as a prototypical example.

To conclude this overview we note that while in this paper we will only consider least-squares objectives, iterative ensemble Kalman methods that use other regularizers have been recently proposed [37, 38, 53].

1.2 Statistical Linearization, Continuum Limits and New Variants

Each subfamily of iterative ensemble Kalman methods stems from a derivative-based optimization scheme. However, there is substantial freedom as to how to use the ensemble to approximate a derivative-based method. We will focus on randomized-maximum likelihood implementations [24, 35, 50], rather than square-root or ensemble adjustment approaches [3, 59, 28]. Two principles will guide our derivation and analysis of ensemble methods: the use of statistical linearization [60] and their connection with gradient descent methods through the study of continuum limits [51].

The idea behind statistical linearization is to approximate the gradient of hh using pairs {(ui(n),h(ui(n)))}n=1N\big{\{}\big{(}u_{i}^{(n)},h(u_{i}^{(n)})\big{)}\big{\}}_{n=1}^{N} in such a way that if hh is linear and the ensemble size NN is sufficiently large, the approximation is exact. As we shall see, this idea tacitly underlies the derivation of all the ensemble methods considered in this paper, and will be explicitly employed in our derivation of new variants. Statistical linearization has also been used within Unscented Kalman filters, see e.g. [60].

Differential equations have long been important in developing and understanding optimization schemes [44], and investigating the connections between differential equations and optimization is still an active area of research [54, 56, 57, 61]. In the context of iterative ensemble methods, continuum limit analyses arise from considering small length-steps and have been developed primarily in the context of EKI-type algorithms [8, 52]. While the derivative-based algorithms that motivate the ensemble methods result in an ODE continuum limit, the ensemble versions lead to a system of SDEs. Continuum limit analyses are useful in at least three ways. First, they unveil the gradient structure of the optimization schemes. Second, viewing optimization schemes as arising from discretizations of SDEs lends itself to design of algorithms that are easy to tune: the length-step is chosen to be small and the algorithms are run until statistical equilibrium is reached. Third, a simple linear-case analysis of the SDEs may be used to develop new algorithms that satisfy certain desirable properties. Our new iterative ensemble Kalman methods will be designed following these observations.

While our work advocates the study of continuum limits as a useful tool to design ensemble methods, continuum limits cannot fully capture the full richness and flexibility of discrete-based implementable algorithms, since different algorithms may result in the same SDE continuum limit. This insight suggests that it is not only the study of differential equations, but also their discretizations, that may contribute to the design of iterative ensemble Kalman algorithms.

1.3 Main Contributions and Outline

In addition to providing a unified perspective of the existing literature, this paper contains several original contributions. We highlight some of them in the following outline and refer to Table 1 for a summary of the algorithms considered in this paper.

  • In Section 2 we review three iterative derivative-based methods for nonlinear least-squares optimization. The ensemble-based algorithms studied in subsequent sections can be interpreted as ensemble-based approximations of the derivative-based methods described in this section. We also derive informally ODE continuum limits for each method, which unveils their gradient flow structure.

  • In Section 3 we describe the idea of statistical linearization. We review three subfamilies of iterative ensemble methods, each of which has an update formula analogous to one of the derivate-based methods in Section 2. We analyze the methods when h(u)=Huh(u)=Hu is linear by formally deriving SDE continuum limits that unveil their gradient structure. A novelty in this section is the introduction of the IEKF method, which is similar to, but different from, the iterative ensemble method introduced in [60].

  • The material in Section 4 is novel to the best of our knowledge. We introduce new variants of the iterative ensemble Kalman methods discussed in Section 3 and formally derive their SDE continuum limit. We analyze the resulting SDEs when h(u)=Huh(u)=Hu is linear. The proposed methods are designed to ensure that (i) no parameter tuning or careful stopping criteria are needed; and (ii) the ensemble covariance contains meaningful information of the uncertainty in the reconstruction in the linear case, avoiding the ensemble collapse of some existing methods.

  • In Section 5 we include an in-depth empirical comparison of the performance of the iterative ensemble Kalman methods discussed in Sections 3 and 4. We consider four problem settings motivated by applications in Bayesian inverse problems, data assimilation and machine learning. Our results illustrate the different behavior of some methods in small noise regimes and the benefits of avoiding ensemble collapse.

  • Section 6 concludes and suggests some open directions for further research.

2 Derivative-based Optimization for Nonlinear Least-Squares

In this section we review the Gauss-Newton and Levenberg-Marquardt methods, two derivative-based approaches for optimization of nonlinear least-squares objectives of the form

𝖩(u)=12|r(u)|2.{\mathsf{J}}(u)=\frac{1}{2}|r(u)|^{2}. (2.1)

We derive closed formulae for the Gauss-Newton method applied to the Tikhonov-Phillips objective 𝖩TP{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}, as well as for the Levenberg-Marquardt method applied to the data-misfit objective 𝖩DM{\mathsf{J}}_{\mbox{\tiny{\rm DM}}} and the Tikhonov-Phillips objective 𝖩TP.{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}. These formulae are the basis for the ensemble, derivative-free methods considered in the next section.

As we shall see, the search directions of Gauss-Newton and Levenberg-Marquardt methods are found by minimizing a linearization of the least-squares objective. It is thus instructive to consider first linear least-squares optimization before delving into the nonlinear setting. The following well-known result, that we will use extensively, characterizes the minimizer μ\mu of the Tikhonov-Phillips objective 𝖩TP{\mathsf{J}}_{\mbox{\tiny{\rm TP}}} in the case of linear h(u)=Huh(u)=Hu.

Lemma 2.1.

It holds that

12|yHu|R2+12|um|P2=12|uμ|C2+β,\frac{1}{2}|y-Hu|_{R}^{2}+\frac{1}{2}|u-m|^{2}_{P}=\frac{1}{2}|u-\mu|_{C}^{2}+\beta, (2.2)

where β\beta does not depend on u,u, and

C1\displaystyle C^{-1} =HTR1H+P1,\displaystyle=H^{T}R^{-1}H+P^{-1}, (2.3)
C1μ\displaystyle C^{-1}\mu =HTR1y+P1m.\displaystyle=H^{T}R^{-1}y+P^{-1}m. (2.4)

Equivalently,

μ\displaystyle\mu =m+K(yHm),\displaystyle=m+K(y-Hm), (2.5)
C\displaystyle C =(IKH)P,\displaystyle=(I-KH)P, (2.6)

where KK is the Kalman gain matrix given by

K=PHT(HPHT+R)1=CHTR1.K=PH^{T}(HPH^{T}+R)^{-1}=CH^{T}R^{-1}. (2.7)
Proof.

The formulae (2.3) and (2.4) follows by matching linear and quadratic coefficients in uu between

12|uμ|C2and12|um|P2+12|yh(u)|R2.\frac{1}{2}|u-\mu|^{2}_{C}\quad\quad\text{and}\quad\quad\frac{1}{2}|u-m|^{2}_{P}+\frac{1}{2}|y-h(u)|^{2}_{R}. (2.8)

The formulae (2.5) and (2.6) as well as the equivalent expressions for the Kalman gain KK in Equation (2.7) can be obtained using the matrix inversion lemma [50]. ∎

Bayesian Interpretation

Lemma 2.1 has a natural statistical interpretation. Consider a statistical model defined by likelihood y|u𝒩(Hu,R)y|u\sim\mathcal{N}\bigl{(}Hu,R\bigr{)} and prior u𝒩(m,P).u\sim\mathcal{N}(m,P). Then Equation (2.2) shows that the posterior distribution is Gaussian, u|y𝒩(μ,C),u|y\sim\mathcal{N}(\mu,C), and Equations (2.3)-(2.4) characterize the posterior mean and precision (inverse covariance). We interpret Equation (2.5) as providing a closed formula for the posterior mode, known as the maximum a posteriori (MAP) estimator.

More generally, the generative model

u𝒩(m,P),y|u𝒩(h(u),R),\displaystyle\begin{split}u&\sim\mathcal{N}(m,P),\\ y|u&\sim\mathcal{N}(h(u),R),\end{split} (2.9)

gives rise to a posterior distribution on u|yu|y with density proportional to exp(𝖩TP(u)).\exp\bigl{(}-{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u)\bigr{)}. Thus, minimization of 𝖩TP(u){\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u) corresponds to maximizing the posterior density under the model (2.9).

2.1 Gauss-Newton Optimization of Tikhonov-Phillips Objective

In this subsection we introduce two ways of writing the Gauss-Newton update applied to the Tikhonov-Phillips objective 𝖩TP{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}. We recall that the Gauss-Newton method applied to a general least-squares objective 𝖩(u)=12|r(u)|2{\mathsf{J}}(u)=\frac{1}{2}|r(u)|^{2} is a line-search method which, starting from an initialization u0,u_{0}, sets

ui+1=ui+αivi,i=0,1,\displaystyle u_{i+1}=u_{i}+\alpha_{i}v_{i},\quad\quad i=0,1,\ldots

where viv_{i} is a search direction defined by

vi=argminv𝖩i(v),𝖩i(v):=12|r(ui)v+r(ui)|2,\displaystyle v_{i}=\arg\min_{v}{\mathsf{J}}_{i}^{\mbox{\tiny{$\ell$}}}(v),\quad\quad{\mathsf{J}}_{i}^{\mbox{\tiny{$\ell$}}}(v):=\frac{1}{2}|r^{\prime}(u_{i})v+r(u_{i})|^{2}, (2.10)

where αi>0\alpha_{i}>0 is a length-step parameter whose choice will be discussed later. In order to apply the Gauss-Newton method to the Tikhonov-Phillips objective, we write 𝖩TP{\mathsf{J}}_{\mbox{\tiny{\rm TP}}} in the standard nonlinear least-squares form (2.1). Note that

𝖩TP(u)\displaystyle{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u) =12|um|P2+12|yh(u)|R2\displaystyle=\frac{1}{2}|u-m|^{2}_{P}+\frac{1}{2}|y-h(u)|^{2}_{R}
=12|zg(u)|Q2,\displaystyle=\frac{1}{2}|z-g(u)|^{2}_{Q},

where

z:=[ym],g(u):=[h(u)u],Q:=[R00P].z:=\begin{bmatrix}y\\ m\end{bmatrix},\quad\quad\quad g(u):=\begin{bmatrix}h(u)\\ u\end{bmatrix},\quad\quad\quad Q:=\begin{bmatrix}R&0\\ 0&P\end{bmatrix}.

Therefore we have

𝖩TP(u)=12|rTP(u)|2,rTP(u):=Q1/2(zg(u)).\displaystyle{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u)=\frac{1}{2}|r_{\mbox{\tiny{\rm TP}}}(u)|^{2},\quad\quad r_{\mbox{\tiny{\rm TP}}}(u):=Q^{-1/2}\bigl{(}z-g(u)\bigr{)}. (2.11)

The following result is a direct consequence of Lemma 2.1.

Lemma 2.2 ([6]).

The Gauss-Newton method applied to the Tikhonov-Phillips objective 𝖩TP{\mathsf{J}}_{\mbox{\tiny{\rm TP}}} admits the characterizations:

ui+1=ui+αiCi{HiTR1(yh(ui))+P1(mui)},u_{i+1}=u_{i}+\alpha_{i}C_{i}\Bigl{\{}H_{i}^{T}R^{-1}\bigl{(}y-h(u_{i})\bigr{)}+P^{-1}(m-u_{i})\Bigr{\}}, (2.12)

and

ui+1=ui+αi{Ki(yh(ui))+(IKiHi)(mui)},u_{i+1}=u_{i}+\alpha_{i}\Bigl{\{}K_{i}\bigl{(}y-h(u_{i})\bigr{)}+(I-K_{i}H_{i})(m-u_{i})\Bigl{\}}, (2.13)

where Hi=h(ui)H_{i}=h^{\prime}(u_{i}) and

Ki\displaystyle K_{i} =PHiT(HiPHiT+R)1,\displaystyle=PH_{i}^{T}(H_{i}PH_{i}^{T}+R)^{-1},
Ci\displaystyle C_{i} =(IKiHi)P.\displaystyle=(I-K_{i}H_{i})P.
Proof.

The search direction viv_{i} of Gauss-Newton for the objective 𝖩TP{\mathsf{J}}_{\mbox{\tiny{\rm TP}}} is given by

vi\displaystyle v_{i} =argminv𝖩TP,i(v)\displaystyle=\arg\min_{v}{\mathsf{J}}_{\mbox{\tiny{\rm TP,}}i}^{\ell}(v) (2.14)
=argminv12|rTP(ui)v+rTP(ui)|2\displaystyle=\arg\min_{v}\frac{1}{2}\bigl{|}r_{\mbox{\tiny{\rm TP}}}^{\prime}(u_{i})v+r_{\mbox{\tiny{\rm TP}}}(u_{i})\bigr{|}^{2} (2.15)
=argminv12|zg(ui)g(ui)v|Q\displaystyle=\arg\min_{v}\frac{1}{2}\bigl{|}z-g(u_{i})-g^{\prime}(u_{i})v\bigr{|}_{Q} (2.16)
=argminv{12|yh(ui)h(ui)v|R2+12|v(mui)|P2}.\displaystyle=\arg\min_{v}\biggl{\{}\frac{1}{2}\bigl{|}y-h(u_{i})-h^{\prime}(u_{i})v\bigr{|}^{2}_{R}+\frac{1}{2}\bigl{|}v-(m-u_{i})\bigr{|}_{P}^{2}\biggr{\}}. (2.17)

Applying Lemma 2.1, using formulae (2.4) and (2.6), we deduce that

vi=Ci{HiTR1(yh(ui))+P1(mui)},\displaystyle v_{i}=C_{i}\Bigl{\{}H_{i}^{T}R^{-1}\bigl{(}y-h(u_{i})\bigr{)}+P^{-1}(m-u_{i})\Bigr{\}},

which establishes the characterization (2.12). The equivalence between (2.12) and (2.13) follows from the identity (2.7), which implies that CiHiTR1=KiC_{i}H_{i}^{T}R^{-1}=K_{i} and CiP1=IKiHi.C_{i}P^{-1}=I-K_{i}H_{i}.

We refer to the Gauss-Newton method with constant length-step αi=α\alpha_{i}=\alpha applied to 𝖩TP{\mathsf{J}}_{\mbox{\tiny{\rm TP}}} as the Iterative Extended Kalman Filter (IExKF) algorithm. IExKF was developed in the control theory literature [33] without reference to the Gauss-Newton optimization method; the agreement between both methods was established in [6]. In order to compare IExKF with an ensemble-based method in Section 3, we summarize it here.

Algorithm 1 Iterative Extended Kalman Filter (IExKF)

Input: Initialization u0=mu_{0}=m, length-step α\alpha.
For i=0,1,i=0,1,\ldots do:

  1. 1.

    Set Ki=PHiT(HiPHiT+R)1,Hi=h(ui).K_{i}=PH_{i}^{T}(H_{i}PH_{i}^{T}+R)^{-1},\quad\quad H_{i}=h^{\prime}(u_{i}).

  2. 2.

    Set

    ui+1=ui+α{Ki(yh(ui))+(IKiHi)(mui)},u_{i+1}=u_{i}+\alpha\Bigl{\{}K_{i}\big{(}y-h(u_{i})\big{)}+(I-K_{i}H_{i})(m-u_{i})\Bigr{\}}, (2.18)

    or, equivalently,

    ui+1=ui+αCi{HiTR1(yh(ui))+P1(mui)}.u_{i+1}=u_{i}+\alpha C_{i}\Bigl{\{}H_{i}^{T}R^{-1}(y-h(u_{i}))+P^{-1}(m-u_{i})\Bigr{\}}. (2.19)

Output: u1,u2,u_{1},u_{2},\ldots

The next proposition shows that in the linear case, if α\alpha is set to 1, IExKF finds the minimizer of the objective (1.3) in one iteration, and further iterations still agree with the minimizer.

Proposition 2.3.

Suppose that h(u)=Huh(u)=Hu is linear and α=1\alpha=1. Then the output of Algorithm 1 satisfies

ui=μ,i=1,2,u_{i}=\mu,\quad\quad i=1,2,\ldots

where μ\mu is the minimizer of the Tikhonov-Phillips objective (1.3).

Proof.

In the linear case we have

Hi=H,Ki=K=PHT(HPHT+R)1,i=0,1,H_{i}=H,\quad\quad K_{i}=K=PH^{T}(HPH^{T}+R)^{-1},\quad\quad i=0,1,\ldots

Therefore, update (2.18) simplifies as

ui+1=m+K(yHm),i=0,1,u_{i+1}=m+K(y-Hm),\quad i=0,1,\ldots

This implies that, for all i1,i\geq 1, it holds that ui=μu_{i}=\mu with μ\mu defined in Equation (2.5). ∎

Choice of Length-Step

When implementing Gauss-Newton methods, it is standard practice to perform a line search in the direction of viv_{i} to adaptively choose the length-step αi.\alpha_{i}. For instance, a common strategy is to guarantee that the Wolfe conditions are satisfied [16, 45]. In this paper we will instead simply set αi=α\alpha_{i}=\alpha for some fixed value of α,\alpha, and we will follow a similar approach for all the derivative-based and ensemble-based algorithms we consider. There are two main motivations for doing so. First, it is appealing from a practical viewpoint to avoid performing a line search for ensemble-based algorithms. Second, when α\alpha is small each derivative-based algorithms we consider can be interpreted as a discretization of an ODE system, while the ensemble-based methods arise as discretizations of SDE systems. These ODEs and SDEs allow us to compare and gain transparent understanding of the gradient structure of the algorithms. They will also allow us to propose some new variants of existing ensemble Kalman methods. We next describe the continuum limit structure of IExKF.

Continuum Limit

It is not hard to check that the term in brackets in the update (2.19)

HiTR1(yh(ui))+P1(mui)H_{i}^{T}R^{-1}(y-h(u_{i}))+P^{-1}(m-u_{i})

is the negative gradient of 𝖩TP(u){\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u), which reveals the following gradient flow structure in the limit of small length-step α:\alpha:

u˙=C(t){h(u(t))TR1(yh(u(t)))+P1(mu(t))}=C(t)𝖩TP(u(t)),\displaystyle\begin{split}\dot{u}&=C(t)\Bigl{\{}h^{\prime}\bigl{(}u(t)\bigr{)}^{T}R^{-1}\bigl{(}y-h\big{(}u(t)\big{)}\bigr{)}+P^{-1}\bigl{(}m-u(t)\bigr{)}\Bigr{\}}\\ &=-C(t){\mathsf{J}}_{\mbox{\tiny{\rm TP}}}^{\prime}\bigr{(}u(t)\bigl{)},\end{split} (2.20)

with preconditioner

C(t):=(h(u(t))TR1h(u(t))+P1)1.C(t):=\biggl{(}h^{\prime}\bigl{(}u(t)\bigr{)}^{T}R^{-1}h^{\prime}\bigl{(}u(t)\bigr{)}+P^{-1}\biggr{)}^{-1}.

We remark that in the linear case, C(t)C,C(t)\equiv C, where CC is the posterior covariance given by (2.6), which agrees with the inverse of the Hessian of the Tikhonov Phillips objective.

2.2 Levenberg-Marquardt Optimization of Data-Misfit Objective

In this subsection we introduce the Levenberg-Marquardt algorithm and describe its application to the data misfit objective 𝖩DM{\mathsf{J}}_{\mbox{\tiny{\rm DM}}}. We recall that the Levenberg-Marquardt method applied to a general least-squares objective 𝖩(u)=12|r(u)|2{\mathsf{J}}(u)=\frac{1}{2}|r(u)|^{2} is a trust region method which, starting from an initialization u0,u_{0}, sets

ui+1=ui+vi,i=0,1,\displaystyle u_{i+1}=u_{i}+v_{i},\quad\quad i=0,1,\ldots

where

vi=argminv𝖩i(v),s.t.|v|P2δi,𝖩i(v):=12|r(ui)v+r(ui)|2.v_{i}=\arg\min_{v}{\mathsf{J}}_{i}^{\mbox{\tiny{$\ell$}}}(v),\quad\text{s.t.}\,\,|v|_{P}^{2}\leq\delta_{i},\quad\quad{\mathsf{J}}_{i}^{\mbox{\tiny{$\ell$}}}(v):=\frac{1}{2}|r^{\prime}(u_{i})v+r(u_{i})|^{2}.

Similar to Gauss-Newton methods, the increment viv_{i} is defined as the minimizer of a linearized objective, but now the minimization is constrained to a ball {|v|P2δi}\{|v|_{P}^{2}\leq\delta_{i}\} in which we trust that the objective can be replaced by its linearization. The increment can also be written as

vi=argminv𝖩iUC  (v),v_{i}=\arg\min_{v}{\mathsf{J}}_{i}^{\mbox{\mbox{\tiny{UC} } }}(v),

where

𝖩iUC  (v)=𝖩i(v)+12αi|v|P2.{\mathsf{J}}_{i}^{\mbox{\mbox{\tiny{UC} } }}(v)={\mathsf{J}}_{i}^{\mbox{\tiny{$\ell$}}}(v)+\frac{1}{2\alpha_{i}}|v|_{P}^{2}. (2.21)

The parameter αi>0\alpha_{i}>0 plays an analogous role to the length-step in Gauss-Newton methods. Note that the Levenberg-Marquardt increment is the unconstrained minimizer of a regularized objective. It is for this reason that we say that Levenberg-Marquardt provides an implicit regularization.

We next consider application of the Levenberg-Marquardt method to the data-misfit objective 𝖩DM,{\mathsf{J}}_{\mbox{\tiny{\rm DM}}}, which we write in standard nonlinear least-squares form:

𝖩DM(u)=12|rDM(u)|2,rDM(u):=R1/2(yh(u)).\displaystyle{\mathsf{J}}_{\mbox{\tiny{\rm DM}}}(u)=\frac{1}{2}|r_{\mbox{\tiny{\rm DM}}}(u)|^{2},\quad\quad r_{\mbox{\tiny{\rm DM}}}(u):=R^{-1/2}\bigl{(}y-h(u)\bigr{)}. (2.22)
Lemma 2.4.

The Levenberg-Marquardt method applied to the data misfit objective 𝖩DM{\mathsf{J}}_{\mbox{\tiny{\rm DM}}} admits the following characterization:

ui+1=ui+Ki{yh(ui)},u_{i+1}=u_{i}+K_{i}\Bigl{\{}y-h(u_{i})\Bigr{\}}, (2.23)

where

Ki=αiPHiT(αiHiPHiT+R)1,Hi=h(ui).K_{i}=\alpha_{i}PH_{i}^{T}(\alpha_{i}H_{i}PH_{i}^{T}+R)^{-1},\quad\quad H_{i}=h^{\prime}(u_{i}).
Proof.

Note that the increment viv_{i} is defined as the unconstrained minimizer of

𝖩DM,iUC  (v)=12|rDM(ui)v+rDM(ui)|2+12αi|v|P2=12|yh(ui)h(ui)v|R2+12αi|v|P2.\displaystyle\begin{split}{\mathsf{J}}_{\mbox{\tiny{\rm DM}},i}^{\mbox{\mbox{\tiny{UC} } }}(v)&=\frac{1}{2}|r_{\mbox{\tiny{\rm DM}}}^{\prime}(u_{i})v+r_{\mbox{\tiny{\rm DM}}}(u_{i})|^{2}+\frac{1}{2\alpha_{i}}|v|^{2}_{P}\\ &=\frac{1}{2}|y-h(u_{i})-h^{\prime}(u_{i})v|^{2}_{R}+\frac{1}{2\alpha_{i}}|v|^{2}_{P}.\end{split} (2.24)

The result follows from Lemma 2.1. ∎

Similar to the previous section, we will focus on implementations with constant length-step αi=α\alpha_{i}=\alpha, which leads to the following algorithm.

Algorithm 2 Iterative Levenberg-Marquardt with Data Misfit (ILM-DM)

Input: Initialization u0=mu_{0}=m, length-step α\alpha.
For i=0,1,i=0,1,\ldots do:

  1. 1.

    Set Ki=αPHiT(αHiPHiT+R)1,Hi=h(ui).K_{i}=\alpha PH_{i}^{T}(\alpha H_{i}PH_{i}^{T}+R)^{-1},\quad\quad H_{i}=h^{\prime}(u_{i}).

  2. 2.

    Set

    ui+1=ui+Ki{yh(ui)}.u_{i+1}=u_{i}+K_{i}\Bigl{\{}y-h(u_{i})\Bigr{\}}. (2.25)

Output: u1,u2,u_{1},u_{2},\ldots

When α=1\alpha=1, the following linear-case result shows that ILM-DM reaches the minimizer of 𝖩TP{\mathsf{J}}_{\mbox{\tiny{\rm TP}}} in one iteration. However, in contrast to IExKF, further iterations of ILM-DM will typically worsen the optimization of 𝖩TP{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}, and start moving towards minimizers of 𝖩DM.{\mathsf{J}}_{\mbox{\tiny{\rm DM}}}.

Proposition 2.5.

Suppose that h(u)=Huh(u)=Hu is linear and α=1\alpha=1. Then the output of Algorithm 2 satisfies

u1=argminu𝖩TP(u),u_{1}=\arg\min_{u}{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u),

where 𝖩TP{\mathsf{J}}_{\mbox{\tiny{\rm TP}}} is the Tikhonov-Phillips objective (1.3).

Proof.

The proof is identical to that of Proposition 2.3, noting that in the linear case ui+1=ui+K(yHui).u_{i+1}=u_{i}+K(y-Hu_{i}).

Example 2.6 (Convergence of ILM-DM with invertible observation map).

Suppose that Hd×dH\in\mathbb{R}^{d\times d} is invertible and α=1.\alpha=1. Then, writing

ui+1=(IKH)ui+Kyu_{i+1}=(I-KH)u_{i}+Ky

and noting that ρ(IKH)<1\rho(I-KH)<1 [4], it follows that uiuu_{i}\to u^{*}, where uu^{*} is the unique solution to y=Hu.y=Hu. That is, the iterates of ILM-DM converge to the unique minimizer of the data misfit objective 𝖩DM.{\mathsf{J}}_{\mbox{\tiny{\rm DM}}}.

Choice of Length-Step

When implementing Levenberg-Marquardt algorithms, the length-step parameter αi\alpha_{i} is often chosen adaptively, based on the objective. However, similar to Section 2.1, we fix αi=α\alpha_{i}=\alpha to be a small value which leads to an ODE continuum limit.

Continuum Limit

We notice that, in the limit of small length-step α\alpha, update (2.25) can be written as

ui+1=ui+αPHiTR1{yh(ui)}.u_{i+1}=u_{i}+\alpha PH_{i}^{T}R^{-1}\Bigl{\{}y-h(u_{i})\Bigr{\}}.

The term HiTR1{yh(ui)}H_{i}^{T}R^{-1}\Bigl{\{}y-h(u_{i})\Bigr{\}} is the negative gradient of 𝖩DM(u){\mathsf{J}}_{\mbox{\tiny{\rm DM}}}(u), which reveals the following gradient flow structure

u˙=Ph(u(t))TR1{yh(u(t))}=P𝖩DM(u),\displaystyle\begin{split}\dot{u}&=Ph^{\prime}\bigl{(}u(t)\bigr{)}^{T}R^{-1}\Bigl{\{}y-h\bigl{(}u(t)\bigr{)}\Bigr{\}}\\ &=-P{\mathsf{J}}_{\mbox{\tiny{\rm DM}}}^{\prime}(u),\end{split} (2.26)

where the preconditioner PP is interpreted as the prior covariance in the Bayesian framework.

2.3 Levenberg-Marquardt Optimization of Tikhonov-Phillips Objective

In this subsection we describe the application of the Levenberg-Marquardt algorithm to the Tikhonov-Phillips objective 𝖩TP{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}.

Lemma 2.7.

The Levenberg-Marquardt method applied to the data misfit objective 𝖩TP{\mathsf{J}}_{\mbox{\tiny{\rm TP}}} admits the following characterization:

ui+1=ui+Ki{zg(ui)},\displaystyle u_{i+1}=u_{i}+K_{i}\Bigl{\{}z-g(u_{i})\Bigr{\}},

where

Ki=αiPGiT(αiGiPGiT+Q)1,Gi=g(ui).K_{i}=\alpha_{i}PG_{i}^{T}(\alpha_{i}G_{i}PG_{i}^{T}+Q)^{-1},\quad\quad G_{i}=g^{\prime}(u_{i}).
Proof.

Note that the increment viv_{i} is defined as the unconstrained minimizer of

𝖩TP,iUC  (v)\displaystyle{\mathsf{J}}_{\mbox{\tiny{\rm TP}},i}^{\mbox{\mbox{\tiny{UC} } }}(v) =𝖩TP,i(v)+12αi|v|P2\displaystyle={\mathsf{J}}_{\mbox{\tiny{\rm TP,}}i}^{\ell}(v)+\frac{1}{2\alpha_{i}}|v|^{2}_{P} (2.27)
=12|zg(ui)g(ui)v|Q2+12αi|v|P2,\displaystyle=\frac{1}{2}|z-g(u_{i})-g^{\prime}(u_{i})v|_{Q}^{2}+\frac{1}{2\alpha_{i}}|v|^{2}_{P}, (2.28)

which has the same form as Equation (2.24) replacing yy with z,z, hh with g,g, and RR with Q.Q.

Setting αi=α\alpha_{i}=\alpha leads to the following algorithm.

Algorithm 3 Iterative Levenberg-Marquardt with Tikhonov-Phillips (ILM-TP)

Input: Initialization u0=mu_{0}=m, length-step α\alpha.
For i=0,1,i=0,1,\ldots do:

  1. 1.

    Set Ki=αPGiT(αGiPGiT+Q)1,Gi=g(ui).K_{i}=\alpha PG_{i}^{T}(\alpha G_{i}PG_{i}^{T}+Q)^{-1},\quad\quad G_{i}=g^{\prime}(u_{i}).

  2. 2.

    Set

    ui+1=ui+Ki{zg(ui)}.u_{i+1}=u_{i}+K_{i}\Bigl{\{}z-g(u_{i})\Bigr{\}}. (2.29)

Output: u1,u2,u_{1},u_{2},\ldots

Proposition 2.8.

Suppose that h(u)=Huh(u)=Hu is linear and α=1\alpha=1. The output of Algorithm 3 satisfies

u1=argminu{𝖩TP(u)+12|um|2}.u_{1}=\arg\min_{u}\Bigl{\{}{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u)+\frac{1}{2}|u-m|^{2}\Bigr{\}}. (2.30)
Proof.

The result is a corollary of Proposition 2.5. To see this, note that 𝖩TP(u)=12|zg(u)|Q2{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u)=\frac{1}{2}|z-g(u)|^{2}_{Q} can be viewed as a data-misfit objective with data z,z, forward model g(u)g(u) and observation matrix Q.Q. Then, ILM-TP can be interpreted as applying ILM-DM to 𝖩TP,{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}, and the objective 𝖩TP(u)+12|um|2{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u)+\frac{1}{2}|u-m|^{2} as its Tikhonov-Phillips regularization. ∎

Example 2.9 (Convergence of ILM-TP with linear invertible observation map.).

It is again instructive to consider the case where Hd×dH\in\mathbb{R}^{d\times d} is invertible. Then, following the same reasoning as in Example 2.6 we deduce that the iterates of ILM-TP converge to the unique minimizer of 𝖩TP.{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}.

Continuum Limits

In the limit of small length-step α\alpha, we can derive the gradient flow structure of update (2.29). This is similar to Section 2.2, with 𝖩DM{\mathsf{J}}_{\mbox{\tiny{\rm DM}}} replaced by 𝖩TP{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}. Precisely, the gradient flow of ILM-TP is given by

u˙\displaystyle\dot{u} =Pg(u(t))TQ1{zg(u(t))}\displaystyle=Pg^{\prime}\bigl{(}u(t)\bigr{)}^{T}Q^{-1}\Bigl{\{}z-g\bigl{(}u(t)\bigr{)}\Bigr{\}}
=P𝖩TP(u).\displaystyle=-P{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}^{\prime}(u).

3 Ensemble-based Optimization for Nonlinear Least-Squares

In this section we review three subfamilies of iterative methods that update an ensemble {ui(n)}n=1N\{u_{i}^{(n)}\}_{n=1}^{N} employing Kalman-based formulae, where i=0,1,i=0,1,\ldots denotes the iteration index and NN is a fixed ensemble size. Each ensemble member ui(n)u_{i}^{(n)} is updated by optimizing a (random) objective 𝖩i(n){\mathsf{J}}_{i}^{(n)} defined using the current ensemble {ui(n)}n=1N\{u_{i}^{(n)}\}_{n=1}^{N} and/or the initial ensemble {u0(n)}n=1N\{u_{0}^{(n)}\}_{n=1}^{N}. The optimization is performed without evaluating derivatives by invoking a statistical linearization of a Gauss-Newton or Levenberg-Marquardt algorithm. In analogy with the previous section, the three subfamilies of ensemble methods we consider differ in the choice of the objective and in the choice of the optimization algorithm. Table 2 summarizes the derivative and ensemble methods considered in the previous and the current section.

Objective Optimization Derivative Method Ensemble Method
𝖩TP{\mathsf{J}}_{\mbox{\tiny{\rm TP}}} GN IExKF IEKF
𝖩DM{\mathsf{J}}_{\mbox{\tiny{\rm DM}}} LM ILM-DM EKI
𝖩TP{\mathsf{J}}_{\mbox{\tiny{\rm TP}}} LM ILM-TP TEKI

Table 2: Summary of the main algorithms in Sections 2 and 3.

Given an ensemble {ui(n)}n=1N\{u_{i}^{(n)}\}_{n=1}^{N} we use the following notation for ensemble empirical means

mi\displaystyle m_{i} =1Nn=1Nui(n),hi=1Nn=1Nh(ui(n)),\displaystyle=\frac{1}{N}\sum_{n=1}^{N}u_{i}^{(n)},\quad\quad h_{i}=\frac{1}{N}\sum_{n=1}^{N}h(u_{i}^{(n)}),

and empirical covariances

Piuu\displaystyle P_{i}^{uu} =1Nn=1N(ui(n)mi)(ui(n)mi)T,\displaystyle=\frac{1}{N}\sum_{n=1}^{N}(u_{i}^{(n)}-m_{i})(u_{i}^{(n)}-m_{i})^{T},\quad\quad
Piuy\displaystyle P_{i}^{uy} =1Nn=1N(ui(n)mi)(h(ui(n))hi)T,\displaystyle=\frac{1}{N}\sum_{n=1}^{N}\bigl{(}u_{i}^{(n)}-m_{i}\bigr{)}\bigl{(}h(u_{i}^{(n)})-h_{i}\bigr{)}^{T},
Piyy\displaystyle P_{i}^{yy} =1Nn=1N(h(ui(n))hi)(h(ui(n))hi)T.\displaystyle=\frac{1}{N}\sum_{n=1}^{N}\bigl{(}h(u_{i}^{(n)})-h_{i}\bigr{)}\bigl{(}h(u_{i}^{(n)})-h_{i}\bigr{)}^{T}.

Two overarching themes that underlie the derivation and analysis of the ensemble methods studied in this section and the following one are the use of a statistical linearization to avoid evaluation of derivatives, and the study of continuum limits. We next introduce these two ideas.

Statistical Linearization

If h(u)=Huh(u)=Hu is linear, we have

Piuy=PiuuHT,P_{i}^{uy}=P_{i}^{uu}H^{T},

which motivates the approximation in the general nonlinear case

h(ui(n))(Piuy)T(Piuu)1=:Hi,n=1,,N,h^{\prime}(u_{i}^{(n)})\approx(P_{i}^{uy})^{T}(P_{i}^{uu})^{-1}=:H_{i},\quad\quad n=1,\dots,N, (3.1)

where here and in what follows (Piuu)1(P_{i}^{uu})^{-1} denotes the pseudoinverse of Piuu.P_{i}^{uu}. Notice that (3.1) can be regarded as a linear least-squares fit of pairs {(ui(n),h(ui(n)))}n=1N\big{\{}\big{(}u_{i}^{(n)},h(u_{i}^{(n)})\big{)}\big{\}}_{n=1}^{N} normalized around their corresponding empirical means mim_{i} and hih_{i}. We remark that in order for the approximation in (3.1) to be accurate, the ensemble size NN should not be much smaller than the input dimension dd.

Continuum Limit

We will gain theoretical understanding by studying continuum limits. Specifically, each algorithm includes a length-step parameter α>0,\alpha>0, and the evolution of the ensemble for small α\alpha can be interpreted as a discretization of an SDE system. We denote by {u(n)(t)}n=1N\{u^{(n)}(t)\}_{n=1}^{N} the sample paths of the underlying SDE. For each 1nN1\leq n\leq N, we have u0(n)=u(n)(0)u_{0}^{(n)}=u^{(n)}(0), and we view ui(n)u_{i}^{(n)} as an approximation of u(n)(t)u^{(n)}(t) for t=αit=\alpha i. Similarly as above, we define Puu(t),Puy(t),Pyy(t)P^{uu}(t),P^{uy}(t),P^{yy}(t) as the corresponding empirical covariances at time t0t\geq 0.

3.1 Ensemble Gauss-Newton Optimization of Tikhonov-Phillips Objective

3.1.1 Iterative Ensemble Kalman Filter

Given an ensemble {ui(n)}n=1N\{u_{i}^{(n)}\}_{n=1}^{N}, consider the following Gauss-Newton update for each nn:

ui+1(n)=ui(n)+αvi(n),u_{i+1}^{(n)}=u_{i}^{(n)}+\alpha v_{i}^{(n)}, (3.2)

where α>0\alpha>0 is the length-step, and vi(n)v_{i}^{(n)} is the minimizer of the following (linearized) Tikhonov-Phillips objective (cf. equation (2.17))

𝖩TP,i(n)(v)=12|yi(n)h(ui(n))Hiv|R2+12|u0(n)ui(n)v|P0uu2,yi(n)𝒩(y,α1R).{\mathsf{J}}_{\mbox{\tiny{\rm TP}},i}^{(n)}(v)=\frac{1}{2}\bigl{|}y_{i}^{(n)}-h(u_{i}^{(n)})-H_{i}v\bigr{|}^{2}_{R}+\frac{1}{2}\bigl{|}u_{0}^{(n)}-u_{i}^{(n)}-v\bigr{|}^{2}_{P_{0}^{uu}},\quad\quad y_{i}^{(n)}\sim\mathcal{N}(y,\alpha^{-1}R). (3.3)

Notice that we adopt the statistical linearization (3.1) in the above formulation. Applying Lemma 2.1, the minimizer vi(n)v_{i}^{(n)} can be calculated as

vi(n)=Ci{HiTR1(yi(n)h(ui(n)))+(P0uu)1(u0(n)ui(n))},v_{i}^{(n)}=C_{i}\Bigl{\{}H_{i}^{T}R^{-1}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)}+(P_{0}^{uu})^{-1}\big{(}u_{0}^{(n)}-u_{i}^{(n)}\big{)}\Bigr{\}}, (3.4)

or, in an equivalent form,

vi(n)=Ki(yi(n)h(ui(n)))+(IKiHi)(u0(n)ui(n)),v_{i}^{(n)}=K_{i}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)}+(I-K_{i}H_{i})(u_{0}^{(n)}-u_{i}^{(n)}), (3.5)

where

Ci\displaystyle C_{i} =(HiTR1Hi+(P0uu)1)1,\displaystyle=\big{(}H_{i}^{T}R^{-1}H_{i}+(P_{0}^{uu})^{-1}\big{)}^{-1},
Ki\displaystyle K_{i} =P0uuHiT(HiP0uuHiT+R)1.\displaystyle=P_{0}^{uu}H_{i}^{T}(H_{i}P_{0}^{uu}H_{i}^{T}+R)^{-1}.

Combining (3.2) and (3.5) leads to the Iterative Ensemble Kalman Filter (IEKF) algorithm.

Algorithm 4 Iterative Ensemble Kalman Filter (IEKF)

Input: Initial ensemble {u0(n)}n=1N\{u_{0}^{(n)}\}_{n=1}^{N} sampled independently from the prior, length-step α\alpha.
For i=0,1,i=0,1,\ldots do:

  1. 1.

    Set Ki=P0uuHiT(HiP0uuHiT+R)1,Hi=(Piuy)T(Piuu)1.K_{i}=P_{0}^{uu}H_{i}^{T}(H_{i}P_{0}^{uu}H_{i}^{T}+R)^{-1},\quad\quad H_{i}=(P_{i}^{uy})^{T}(P_{i}^{uu})^{-1}.

  2. 2.

    Draw yi(n)𝒩(y,α1R)y_{i}^{(n)}\sim\mathcal{N}(y,\alpha^{-1}R) and set

    ui+1(n)=ui(n)+α{Ki(yi(n)h(ui(n)))+(IKiHi)(u0(n)ui(n))},1nN.u_{i+1}^{(n)}=u_{i}^{(n)}+\alpha\Bigl{\{}K_{i}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)}+(I-K_{i}H_{i})\big{(}u_{0}^{(n)}-u_{i}^{(n)}\big{)}\Bigr{\}},\quad\quad 1\leq n\leq N. (3.6)

Output: Ensemble means m1,m2,m_{1},m_{2},\ldots

We highlight that IEKF is a natural ensemble-based version of the derivative-based IExKF Algorithm 1 with update (2.18). Algorithm 4 is a slight modification of the iterative ensemble Kalman algorithm proposed in [60]. The difference is that [60] sets Hi=(Piuy)T(P0uu)1H_{i}=(P_{i}^{uy})^{T}(P_{0}^{uu})^{-1} rather than Hi=(Piuy)T(Piuu)1H_{i}=(P_{i}^{uy})^{T}(P_{i}^{uu})^{-1}. Our modification guarantees that Algorithm 4 is well-balanced in the sense that if α=1\alpha=1, u0(n)𝒩(m,P)u_{0}^{(n)}\sim\mathcal{N}(m,P) and h(u)=Huh(u)=Hu is linear, then the output of Algorithm 4 satisfies that, as N,N\to\infty,

miμ,i=1,2,m_{i}\to\mu,\quad\quad i=1,2,\ldots

where μ\mu is the minimizer of 𝖩TP(u){\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u) given in Equation (1.3). This is analogous to Proposition 2.3 for IExKF. A detailed explanation is included in Section 3.1.2 below.

Other statistical linearizations and approximations of the Gauss-Newton scheme are possible. We next give a high-level description of the method proposed in [48], one of the earliest applications of iterative ensemble Kalman methods for inversion in the petroleum engineering literature. Consider the alternative characterization of the Gauss-Newton update (3.4). However, instead of using a different preconditioner CiC_{i} for each step, [48] uses a fixed preconditioner C=P0uuP0uy(R+P0yy)1(P0uy)TC_{*}=P_{0}^{uu}-P_{0}^{uy}(R+P_{0}^{yy})^{-1}(P_{0}^{uy})^{T}. Note that CC_{*} can be viewed as an approximation of C0:C_{0}:

CP0uuP0uuH0T(R+H0P0uuH0T)1H0P0uu=(H0TR1H0+(P0uu)1)1=C0.\begin{split}C_{*}&\approx P_{0}^{uu}-P_{0}^{uu}H_{0}^{T}(R+H_{0}P_{0}^{uu}H_{0}^{T})^{-1}H_{0}P_{0}^{uu}\\ &=\big{(}H_{0}^{T}R^{-1}H_{0}+(P_{0}^{uu})^{-1}\big{)}^{-1}=C_{0}.\end{split}

This leads to the following algorithm.

Algorithm 5 Iterative Ensemble Kalman Filter (IEKF-RZL)

Input: Initial ensemble {u0(n)}n=1N\{u_{0}^{(n)}\}_{n=1}^{N} sampled independently from the prior, length-step α\alpha.

  1. 1.

    Set C=P0uuP0uy(R+P0yy)1(P0uy)T.C_{*}=P_{0}^{uu}-P_{0}^{uy}\bigl{(}R+P_{0}^{yy}\bigr{)}^{-1}(P_{0}^{uy})^{T}.

For i=0,1,i=0,1,\ldots do:

  1. 1.

    Set Hi=(Piuy)T(Piuu)1.H_{i}=(P_{i}^{uy})^{T}(P_{i}^{uu})^{-1}.

  2. 2.

    Draw yi(n)𝒩(y,α1R)y_{i}^{(n)}\sim\mathcal{N}(y,\alpha^{-1}R) and set

    ui+1(n)=ui(n)+αC{HiTR1(yi(n)h(ui(n)))+(P0uu)1(u0(n)ui(n))},1nN.u_{i+1}^{(n)}=u_{i}^{(n)}+\alpha\,C_{*}\Bigl{\{}H_{i}^{T}R^{-1}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)}+(P_{0}^{uu})^{-1}\big{(}u_{0}^{(n)}-u_{i}^{(n)}\big{)}\Bigr{\}},\quad\quad 1\leq n\leq N.

Output: Ensemble means m1,m2,m_{1},m_{2},\ldots

We note that IEKF-RZL is a natural ensemble-based version of the derivative-based IExKF Algorithm 1 with update (2.19). We have empirically observed in a wide range of numerical experiments that Algorithm 4 is more stable than Algorithm 5, and we now give a heuristic argument for the advantage of Algorithm 4 in small noise regimes.

  • The update formula in Algorithm 4 is motivated by the large NN approximation

    P0uuHiT(HiP0uuHiT+R)1PHiT(HiPHiT+R)1,P_{0}^{uu}H_{i}^{T}(H_{i}P_{0}^{uu}H_{i}^{T}+R)^{-1}\approx PH_{i}^{T}(H_{i}PH_{i}^{T}+R)^{-1},

    which only requires that P0uuP_{0}^{uu} is a good approximation of PP.

  • The update formula in Algorithm 5 may be derived by invoking a large NN approximation of several terms. In particular, the error arising from the approximation

    CHiTR1C0HiTR1,C_{*}H_{i}^{T}R^{-1}\approx C_{0}H_{i}^{T}R^{-1},

    gets amplified when RR is small.

Empirical evidence of the instability of Algorithm 5 will be given in Section 5.1.

3.1.2 Analysis of IEKF

In the literature [48, 60], the length-step α\alpha is sometimes set to be 1. Here we state a simple observation about Algorithm 4 when α=1\alpha=1 and h(u)=Huh(u)=Hu is linear. We further assume that HiHH_{i}\equiv H for all ii. Then KiK0:=P0uuHT(HP0uuHT+R)1K_{i}\equiv K_{0}:=P_{0}^{uu}H^{T}(HP_{0}^{uu}H^{T}+R)^{-1}, and the update (3.6) can be simplified as

ui+1(n)=ui(n)+K0(yi(n)Hui(n))+(IK0H)(u0(n)ui(n))=u0(n)+K0(yi(n)Hu0(n)),\begin{split}u_{i+1}^{(n)}&=u_{i}^{(n)}+K_{0}(y_{i}^{(n)}-Hu_{i}^{(n)})+(I-K_{0}H)(u_{0}^{(n)}-u_{i}^{(n)})\\ &=u_{0}^{(n)}+K_{0}(y_{i}^{(n)}-Hu_{0}^{(n)}),\end{split}

where yi(n)𝒩(y,R)y_{i}^{(n)}\sim\mathcal{N}(y,R). If {u0(n)}n=1N\{u_{0}^{(n)}\}_{n=1}^{N} are sampled independently from the prior 𝒩(m,P)\mathcal{N}(m,P) and we let NN\rightarrow\infty, we have K0K=PHT(HPHT+R)1K_{0}\rightarrow K=PH^{T}(HPH^{T}+R)^{-1} and, by the law of large numbers, for any i1i\geq 1,

1Nn=1Nui(n)=(IK0H)1Nn=1Nu0(n)+K01Nn=1Nyi(n)(IKH)m+Ky,\begin{split}\frac{1}{N}\sum_{n=1}^{N}u_{i}^{(n)}&=(I-K_{0}H)\cdot\frac{1}{N}\sum_{n=1}^{N}u_{0}^{(n)}+K_{0}\cdot\frac{1}{N}\sum_{n=1}^{N}y_{i}^{(n)}\\ &\rightarrow(I-KH)m+Ky,\end{split}

which is the posterior mean (and mode) in the linear setting. In other words, in the large ensemble limit, the ensemble mean recovers the posterior mean after one iteration. However, while the choice α=1\alpha=1 may be effective in low dimensional (nonlinear) inverse problems, we do not recommend it when the dimensionality is high, as HiH_{i} might not be a good approximation of HH. Thus, we introduce Algorithms 4 and 5 with a choice of length-step α,\alpha, resembling the derivative-based Gauss-Newton method discussed in Section 2.1. Further analysis of a new variant of Algorithm 4 will be conducted in a continuum limit setting in Section 4.

Remark 3.1.

Some remarks:

  1. 1.

    Although this will not be the focus of our paper, the IEKF Algorithm 4 (together with the EKI Algorithm 6 and TEKI Algorithm 7 to be discussed later) enjoys the ‘initial subspace property’ studied in previous works [31, 51, 13] by which, for any ii and any initialization of {u0(n)}n=1N,\{u_{0}^{(n)}\}_{n=1}^{N},

    span({ui(n)}n=1N)span({u0(n)}n=1N).\text{span}\bigl{(}\{u_{i}^{(n)}\}_{n=1}^{N}\bigr{)}\subset\text{span}\bigl{(}\{u_{0}^{(n)}\}_{n=1}^{N}\bigr{)}.

    This can be shown easily for the IEKF Algorithm 4 by expanding the P0uuP_{0}^{uu} term in KiK_{i} in the update formula (3.6).

  2. 2.

    Since we assume a Gaussian prior 𝒩(m,P)\mathcal{N}(m,P) on uu, a natural idea is to replace u0(n)u_{0}^{(n)} by mm and P0uuP_{0}^{uu} by PP in the update formula (3.6). We pursue this idea in Section 4.1, where we introduce a new variant of Algorithm 4 and analyze it in the continuum limit setting. While the initial subspace property breaks down, this new variant is numerically promising, as shown in Section 5.

3.2 Ensemble Levenberg-Marquardt Optimization of Data-Misfit Objective

3.2.1 Ensemble Kalman Inversion

Given an ensemble {ui(n)}n=1N\{u_{i}^{(n)}\}_{n=1}^{N}, consider the following Levenberg-Marquardt update for each nn:

ui+1(n)=ui(n)+vi(n),u_{i+1}^{(n)}=u_{i}^{(n)}+v_{i}^{(n)}, (3.7)

where vi(n)v_{i}^{(n)} is the minimizer of the following regularized (linearized) data-misfit objective (cf. equation (2.24))

𝖩DM,i(n),UC  (v)=12|yi(n)h(ui(n))Hiv|R2+12α|v|Piuu2,yi(n)𝒩(y,α1R),{\mathsf{J}}_{\mbox{\tiny{\rm DM}},i}^{(n),\mbox{\mbox{\tiny{UC} } }}(v)=\frac{1}{2}\bigl{|}y_{i}^{(n)}-h(u_{i}^{(n)})-H_{i}v\bigr{|}^{2}_{R}+\frac{1}{2\alpha}\bigl{|}v\bigr{|}^{2}_{P_{i}^{uu}},\quad\quad y_{i}^{(n)}\sim\mathcal{N}(y,\alpha^{-1}R), (3.8)

and α>0\alpha>0 will be regarded as a length-step. Notice that we adopt the statistical linearization (3.1) in the above formulation. Applying Lemma 2.1, we can calculate the minimizer vi(n)v_{i}^{(n)} explicitly:

vi(n)=(HiTR1Hi+α1(Piuu)1)1HiTR1(yi(n)h(ui(n))),v_{i}^{(n)}=(H_{i}^{T}R^{-1}H_{i}+\alpha^{-1}(P_{i}^{uu})^{-1})^{-1}H_{i}^{T}R^{-1}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)}, (3.9)

or, in an equivalent form,

vi(n)=PiuuHiT(HiPiuuHiT+α1R)1(yi(n)h(ui(n))).v_{i}^{(n)}=P_{i}^{uu}H_{i}^{T}(H_{i}P_{i}^{uu}H_{i}^{T}+\alpha^{-1}R)^{-1}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)}. (3.10)

We combine (3.7) and (3.10), substitute PiuuHiT=PiuyP_{i}^{uu}H_{i}^{T}=P_{i}^{uy}, and make another level of approximation HiPiuyPiyyH_{i}P_{i}^{uy}\approx P_{i}^{yy}. This leads to the Ensemble Kalman Inversion (EKI) method [31].

Algorithm 6 Ensemble Kalman Inversion (EKI)

Input: Initial ensemble {u0(n)}n=1N\{u_{0}^{(n)}\}_{n=1}^{N}, length-step α\alpha.
For i=0,1,i=0,1,\ldots do:

  1. 1.

    Set Ki=Piuy(Piyy+α1R)1.K_{i}=P_{i}^{uy}(P_{i}^{yy}+\alpha^{-1}R)^{-1}.

  2. 2.

    Draw yi(n)𝒩(y,α1R)y_{i}^{(n)}\sim\mathcal{N}(y,\alpha^{-1}R) and set

    ui+1(n)=ui(n)+Ki{yi(n)h(ui(n))},1nN.u_{i+1}^{(n)}=u_{i}^{(n)}+K_{i}\Bigl{\{}y_{i}^{(n)}-h(u_{i}^{(n)})\Bigr{\}},\quad\quad 1\leq n\leq N. (3.11)

Output: Ensemble means m1,m2,m_{1},m_{2},\ldots

We note that EKI is a natural ensemble-based version of the derivative-based ILM-DM Algorithm 2. However, an important difference is that the Kalman gain in ILM-DM only uses the iterates to update HiH_{i} and PP is kept fixed. In contrast, the ensemble is used in EKI to update PiuyP_{i}^{uy} and PiyyP_{i}^{yy}.

3.2.2 Analysis of EKI

In view of the definition of KiK_{i}, we can rewrite the update (3.11) as a time-stepping scheme:

ui+1(n)=ui(n)+αPiuy(αPiyy+R)1(y+α1/2R1/2ξi(n)h(ui(n)))=ui(n)+αPiuy(αPiyy+R)1(yh(ui(n)))+α1/2Piuy(αPiyy+R)1R1/2ξi(n),\begin{split}u_{i+1}^{(n)}&=u_{i}^{(n)}+\alpha P_{i}^{uy}(\alpha P_{i}^{yy}+R)^{-1}\bigl{(}y+\alpha^{-1/2}R^{1/2}\xi_{i}^{(n)}-h(u_{i}^{(n)})\bigr{)}\\ &=u_{i}^{(n)}+\alpha P_{i}^{uy}(\alpha P_{i}^{yy}+R)^{-1}\bigl{(}y-h(u_{i}^{(n)})\bigr{)}+\alpha^{1/2}P_{i}^{uy}(\alpha P_{i}^{yy}+R)^{-1}R^{1/2}\xi_{i}^{(n)},\end{split} (3.12)

where ξi(n)𝒩(0,I)\xi_{i}^{(n)}\sim\mathcal{N}(0,I) are independent. Taking the limit α0\alpha\rightarrow 0, we interpret (3.12) as a discretization of the SDE system

du(n)=Puy(t)R1(yh(u(n)))dt+Puy(t)R1/2dW(n).\mathop{}\!\mathrm{d}u^{(n)}=P^{uy}(t)R^{-1}\bigl{(}y-h(u^{(n)})\bigr{)}\mathop{}\!\mathrm{d}t+P^{uy}(t)R^{-1/2}\mathop{}\!\mathrm{d}W^{(n)}. (3.13)

If h(u)=Huh(u)=Hu is linear, the SDE system (3.13) turns into

du(n)=Puu(t)HTR1(yHu(n))dt+Puu(t)HTR1/2dW(n).\mathop{}\!\mathrm{d}u^{(n)}=P^{uu}(t)H^{T}R^{-1}\big{(}y-Hu^{(n)}\big{)}\mathop{}\!\mathrm{d}t+P^{uu}(t)H^{T}R^{-1/2}\mathop{}\!\mathrm{d}W^{(n)}. (3.14)
Proposition 3.2.

For the SDE system (3.13), assume h(u)=Huh(u)=Hu is linear, and suppose that the initial ensemble {u0(n)}n=1N\{u_{0}^{(n)}\}_{n=1}^{N} is drawn independently from a continuous distribution with finite second moments. Then, in the large ensemble size limit NN\rightarrow\infty (mean-field), the distribution of u(n)(t)u^{(n)}(t) has mean 𝔪(t)\mathfrak{m}(t) and covariance (t)\mathfrak{C}(t), which satisfy

d𝔪(t)dt\displaystyle\frac{\mathop{}\!\mathrm{d}\mathfrak{m}(t)}{\mathop{}\!\mathrm{d}t} =(t)HTR1(yH𝔪(t)),\displaystyle=\mathfrak{C}(t)H^{T}R^{-1}\bigl{(}y-H\mathfrak{m}(t)\bigr{)}, (3.15)
d(t)dt\displaystyle\frac{\mathop{}\!\mathrm{d}\mathfrak{C}(t)}{\mathop{}\!\mathrm{d}t} =(t)HTR1H(t).\displaystyle=-\mathfrak{C}(t)H^{T}R^{-1}H\mathfrak{C}(t). (3.16)

Furthermore, the solution can be computed analytically:

𝔪(t)\displaystyle\mathfrak{m}(t) =((0)1+tHTR1H)1((0)1𝔪(0)+tHTR1y),\displaystyle=\Big{(}\mathfrak{C}(0)^{-1}+tH^{T}R^{-1}H\Big{)}^{-1}\big{(}\mathfrak{C}(0)^{-1}\mathfrak{m}(0)+tH^{T}R^{-1}y\big{)}, (3.17)
(t)\displaystyle\mathfrak{C}(t) =((0)1+tHTR1H)1.\displaystyle=\Big{(}\mathfrak{C}(0)^{-1}+tH^{T}R^{-1}H\Big{)}^{-1}. (3.18)

In particular, if Hk×dH\in\mathbb{R}^{k\times d} has full column rank (i.e., dkd\leq k, rank(H)=d{\emph{rank}(}H)=d), then, as tt\rightarrow\infty,

𝔪(t)\displaystyle\mathfrak{m}(t) (HTR1H)1(HTR1y),\displaystyle\rightarrow(H^{T}R^{-1}H)^{-1}(H^{T}R^{-1}y), (3.19)
(t)\displaystyle\mathfrak{C}(t) 0.\displaystyle\rightarrow 0. (3.20)

If Hk×dH\in\mathbb{R}^{k\times d} has full row rank (i.e., dkd\geq k, rank(H)=k\emph{rank}(H)=k), then, as t,t\rightarrow\infty,

H𝔪(t)\displaystyle H\mathfrak{m}(t) y,\displaystyle\rightarrow y, (3.21)
H(t)HT\displaystyle H\mathfrak{C}(t)H^{T} 0.\displaystyle\rightarrow 0. (3.22)
Proof.

The proof technique is similar to [23]. We will use that

𝔪(t)\displaystyle\mathfrak{m}(t) =limN𝔼[u(n)(t)],\displaystyle=\lim_{N\to\infty}\operatorname{\mathbb{E}}\bigl{[}u^{(n)}(t)\bigr{]},
(t)\displaystyle\mathfrak{C}(t) =limN𝔼[e(n)(t)e(n)(t)],\displaystyle=\lim_{N\to\infty}\operatorname{\mathbb{E}}\bigl{[}e^{(n)}(t)\otimes e^{(n)}(t)\bigr{]},

where e(n)(t):=u(n)(t)𝔪(t).e^{(n)}(t):=u^{(n)}(t)-\mathfrak{m}(t). First, note that (3.15) follows directly from (3.14) using that in the mean field limit Puu(t)P^{uu}(t) can be replaced by (t)\mathfrak{C}(t). To obtain the evolution of (t)\mathfrak{C}(t), note that

d(t)=limN𝔼[de(n)e(n)+e(n)de(n)+de(n)de(n)],\mathop{}\!\mathrm{d}\mathfrak{C}(t)=\lim_{N\to\infty}\operatorname{\mathbb{E}}\Bigl{[}\mathop{}\!\mathrm{d}e^{(n)}\otimes e^{(n)}+e^{(n)}\otimes\mathop{}\!\mathrm{d}e^{(n)}+\mathop{}\!\mathrm{d}e^{(n)}\otimes\mathop{}\!\mathrm{d}e^{(n)}\Bigr{]},

where the last term accounts for the Itô correction. This simplifies as

d(t)dt=limN𝔼[(t)HTR1H(e(n)e(n))(e(n)e(n))HTR1H(t)+(t)HTR1H(t)]=(t)HTR1H(t),\begin{split}\frac{\mathop{}\!\mathrm{d}\mathfrak{C}(t)}{\mathop{}\!\mathrm{d}t}&=\lim_{N\to\infty}\operatorname{\mathbb{E}}\Bigl{[}-\mathfrak{C}(t)H^{T}R^{-1}H(e^{(n)}\otimes e^{(n)})-(e^{(n)}\otimes e^{(n)})H^{T}R^{-1}H\mathfrak{C}(t)+\mathfrak{C}(t)H^{T}R^{-1}H\mathfrak{C}(t)\Bigr{]}\\ &=-\mathfrak{C}(t)H^{T}R^{-1}H\mathfrak{C}(t),\end{split}

which gives Equation (3.16). To derive exact formulas for 𝔪(t)\mathfrak{m}(t) and (t)\mathfrak{C}(t), we notice that

d(t)1dt=(t)1d(t)dt(t)1=HTR1H,\frac{\mathop{}\!\mathrm{d}\mathfrak{C}(t)^{-1}}{\mathop{}\!\mathrm{d}t}=-\mathfrak{C}(t)^{-1}\frac{\mathop{}\!\mathrm{d}\mathfrak{C}(t)}{\mathop{}\!\mathrm{d}t}\mathfrak{C}(t)^{-1}=H^{T}R^{-1}H,

and

d((t)1𝔪(t))dt=d(t)1dt𝔪(t)+(t)1d𝔪(t)dt=HTR1y.\frac{\mathop{}\!\mathrm{d}\big{(}\mathfrak{C}(t)^{-1}\mathfrak{m}(t)\big{)}}{\mathop{}\!\mathrm{d}t}=\frac{\mathop{}\!\mathrm{d}\mathfrak{C}(t)^{-1}}{\mathop{}\!\mathrm{d}t}\mathfrak{m}(t)+\mathfrak{C}(t)^{-1}\frac{\mathop{}\!\mathrm{d}\mathfrak{m}(t)}{\mathop{}\!\mathrm{d}t}=H^{T}R^{-1}y.

Then (3.17) and (3.18) follow easily.

If HH has full column rank, then HTR1HH^{T}R^{-1}H is invertible and therefore, as tt\rightarrow\infty,

(t)=t1(t1(0)1+HTR1H)10\mathfrak{C}(t)=t^{-1}\big{(}t^{-1}\mathfrak{C}(0)^{-1}+H^{T}R^{-1}H\big{)}^{-1}\rightarrow 0

by continuity of the matrix inverse function. The limit of 𝔪(t)\mathfrak{m}(t), (3.19), follows immediately from (3.17).

If HH has full row rank, we make the following substitutions

𝔪~(t)=R1/2H𝔪(t),~(t)=R1/2H(t)HTR1/2.\widetilde{\mathfrak{m}}(t)=R^{-1/2}H\mathfrak{m}(t),\quad\quad\quad\widetilde{\mathfrak{C}}(t)=R^{-1/2}H\mathfrak{C}(t)H^{T}R^{-1/2}.

Then (3.15) and (3.16) can be transformed into

d𝔪~(t)dt\displaystyle\frac{\mathop{}\!\mathrm{d}\widetilde{\mathfrak{m}}(t)}{\mathop{}\!\mathrm{d}t} =~(t)(R1/2y𝔪~(t)),\displaystyle=\widetilde{\mathfrak{C}}(t)\bigl{(}R^{-1/2}y-\widetilde{\mathfrak{m}}(t)\bigr{)}, (3.23)
d~(t)dt\displaystyle\frac{\mathop{}\!\mathrm{d}\widetilde{\mathfrak{C}}(t)}{\mathop{}\!\mathrm{d}t} =~(t)2.\displaystyle=-\widetilde{\mathfrak{C}}(t)^{2}. (3.24)

Using the fact that ~(0)=R1/2H(0)HTR1/2\widetilde{\mathfrak{C}}(0)=R^{-1/2}H\mathfrak{C}(0)H^{T}R^{-1/2} is invertible, we can solve these using the same technique as in the previous case:

~(t)=(~(0)1+t)1,𝔪~(t)=(~(0)1+t)1(~(0)1𝔪~(0)+tR1/2y).\widetilde{\mathfrak{C}}(t)=\big{(}\widetilde{\mathfrak{C}}(0)^{-1}+t\big{)}^{-1},\quad\quad\quad\widetilde{\mathfrak{m}}(t)=\big{(}\widetilde{\mathfrak{C}}(0)^{-1}+t\big{)}^{-1}\bigl{(}\widetilde{\mathfrak{C}}(0)^{-1}\widetilde{\mathfrak{m}}(0)+tR^{-1/2}y\bigr{)}.

As tt\rightarrow\infty, we have 𝔪~(t)R1/2y\widetilde{\mathfrak{m}}(t)\rightarrow R^{-1/2}y and ~(t)0\widetilde{\mathfrak{C}}(t)\rightarrow 0, which lead to (3.21) and (3.22). ∎

Remark 3.3.

Some remarks:

  1. 1.

    In fact, (3.22) always holds, without rank constraints on HH. The proof follows the similar idea as in [51]. This can be viewed from Equation (3.24), where we can perform eigenvalue decomposition ~(0)=XΛ(0)XT\widetilde{\mathfrak{C}}(0)=X\Lambda(0)X^{T}, and show that ~(t)=XΛ(t)XT\widetilde{\mathfrak{C}}(t)=X\Lambda(t)X^{T}, where Λ(t)\Lambda(t) are diagonal matrices and Λ(t)0\Lambda(t)\rightarrow 0 as tt\rightarrow\infty. The statement that H(t)HT0H\mathfrak{C}(t)H^{T}\rightarrow 0 (or (t)0\mathfrak{C}(t)\rightarrow 0 if HH has full column rank) is referred to as ‘ensemble collapse’. This can be interpreted as ‘the images of all the particles under HH collapse to a single point as time evolves’. Our numerical results in Section 5 show that the ensemble collapse phenomenon of Algorithm 6 is also observed in a variety of nonlinear examples and outside the mean-field limit, with moderate ensemble size NN. These empirical results justify the practical significance of the linear continuum analysis in Proposition 3.2.

  2. 2.

    Under the same setting of Proposition 3.2, if we further require that the initial ensemble {u0(n)}n=1N\{u_{0}^{(n)}\}_{n=1}^{N} is drawn independently from the prior distribution 𝒩(m,P)\mathcal{N}(m,P), then in the mean-field limit NN\rightarrow\infty we have 𝔪(0)=m\mathfrak{m}(0)=m and (0)=P\mathfrak{C}(0)=P, leading to

    𝔪(1)=(P1+HTR1H)1(P1m+HTR1y),(1)=(P1+HTR1H)1,\mathfrak{m}(1)=(P^{-1}+H^{T}R^{-1}H)^{-1}(P^{-1}m+H^{T}R^{-1}y),\quad\quad\mathfrak{C}(1)=(P^{-1}+H^{T}R^{-1}H)^{-1},

    which are the true posterior mean and covariance, respectively. However, we have observed in a variety of numerical examples (not reported here) that in nonlinear problems it is often necessary to run EKI up to times larger than 11 to obtain adequate approximation of the posterior mean and covariance. Providing a suitable stopping criteria for EKI is a topic of current research [52, 32] beyond the scope of our work.

3.3 Ensemble Levenberg-Marquardt Optimization of Tikhonov-Phillips Objective

3.3.1 Tikhonov Ensemble Kalman Inversion

Recall that we define

z:=[ym],g(u):=[h(u)u],Q:=[R00P].z:=\begin{bmatrix}y\\ m\end{bmatrix},\quad\quad\quad g(u):=\begin{bmatrix}h(u)\\ u\end{bmatrix},\quad\quad\quad Q:=\begin{bmatrix}R&0\\ 0&P\end{bmatrix}.

Then, given an ensemble {ui(n)}n=1N\{u_{i}^{(n)}\}_{n=1}^{N}, we can define

gi=1Nn=1Ng(ui(n)),g_{i}=\frac{1}{N}\sum_{n=1}^{N}g(u_{i}^{(n)}),

and empirical covariances

Pizz\displaystyle P_{i}^{zz} =1Nn=1N(g(ui(n))gi)(g(ui(n))gi)T,\displaystyle=\frac{1}{N}\sum_{n=1}^{N}\bigl{(}g(u_{i}^{(n)})-g_{i}\bigr{)}\bigl{(}g(u_{i}^{(n)})-g_{i}\bigr{)}^{T},
Piuz\displaystyle P_{i}^{uz} =1Nn=1N(ui(n)mi)(g(ui(n))gi)T.\displaystyle=\frac{1}{N}\sum_{n=1}^{N}\bigl{(}u_{i}^{(n)}-m_{i}\bigr{)}\bigl{(}g(u_{i}^{(n)})-g_{i}\bigr{)}^{T}.

Furthermore, we define the statistical linearization GiG_{i}:

g(ui(n))(Piuz)T(Piuu)1=:Gi.g^{\prime}(u_{i}^{(n)})\approx(P_{i}^{uz})^{T}(P_{i}^{uu})^{-1}=:G_{i}. (3.25)

It is not hard to check that

Gi=[HiI],G_{i}=\begin{bmatrix}H_{i}\\ I\end{bmatrix},

with HiH_{i} defined in (3.1).

Given an ensemble {ui(n)}n=1N\{u_{i}^{(n)}\}_{n=1}^{N}, consider the following Levenberg-Marquardt update for each nn:

ui+1(n)=ui(n)+vi(n),u_{i+1}^{(n)}=u_{i}^{(n)}+v_{i}^{(n)},

where vi(n)v_{i}^{(n)} is the minimizer of the following regularized (linearized) Tikhonov-Phillips objective (cf. equation (2.28))

𝖩TP,i(n),UC  (v)=12|zi(n)g(ui(n))Giv|Q2+12α|v|Piuu2,zi(n)𝒩(z,α1Q),{\mathsf{J}}_{\mbox{\tiny{\rm TP}},i}^{(n),\mbox{\mbox{\tiny{UC} } }}(v)=\frac{1}{2}\bigl{|}z_{i}^{(n)}-g(u_{i}^{(n)})-G_{i}v\bigr{|}^{2}_{Q}+\frac{1}{2\alpha}\bigl{|}v\bigr{|}^{2}_{P_{i}^{uu}},\quad\quad z_{i}^{(n)}\sim\mathcal{N}(z,\alpha^{-1}Q), (3.26)

and α>0\alpha>0 will be regarded as a length-step. We can calculate the minimizer vi(n)v_{i}^{(n)} explicitly, applying Lemma 2.1:

vi(n)=(GiTQ1Gi+α1(Piuu)1)1GiTQ1(zi(n)g(ui(n))),v_{i}^{(n)}=(G_{i}^{T}Q^{-1}G_{i}+\alpha^{-1}(P_{i}^{uu})^{-1})^{-1}G_{i}^{T}Q^{-1}\big{(}z_{i}^{(n)}-g(u_{i}^{(n)})\big{)}, (3.27)

or, in an equivalent form,

vi(n)=PiuuGiT(GiPiuuGiT+α1Q)1(zi(n)g(ui(n))).v_{i}^{(n)}=P_{i}^{uu}G_{i}^{T}(G_{i}P_{i}^{uu}G_{i}^{T}+\alpha^{-1}Q)^{-1}\big{(}z_{i}^{(n)}-g(u_{i}^{(n)})\big{)}. (3.28)

Similar to EKI, in Equation (3.28) we substitute PiuuGiT=PiuzP_{i}^{uu}G_{i}^{T}=P_{i}^{uz}, and make the approximation GiPiuzPizzG_{i}P_{i}^{uz}\approx P_{i}^{zz}. This leads to Tikhonov Ensemble Kalman Inversion (TEKI), described in Algorithm 7.

Algorithm 7 Tikhonov Ensemble Kalman Inversion (TEKI)

Input: Initial ensemble {u0(n)}n=1N\{u_{0}^{(n)}\}_{n=1}^{N}, length-step α\alpha.
For i=0,1,i=0,1,\ldots do:

  1. 1.

    Set Ki=Piuz(Pizz+α1Q)1.K_{i}=P_{i}^{uz}(P_{i}^{zz}+\alpha^{-1}Q)^{-1}.

  2. 2.

    Draw zi(n)𝒩(z,α1Q)z_{i}^{(n)}\sim\mathcal{N}(z,\alpha^{-1}Q) and set

    ui+1(n)=ui(n)+Ki{zi(n)g(ui(n))},1nN.u_{i+1}^{(n)}=u_{i}^{(n)}+K_{i}\Bigl{\{}z_{i}^{(n)}-g(u_{i}^{(n)})\Bigr{\}},\quad\quad 1\leq n\leq N. (3.29)

Output: Ensemble means m1,m2,m_{1},m_{2},\ldots

We note that TEKI is a natural ensemble-based version of the derivative-based ILM-TP Algorithm 3. However, the Kalman gain in ILM-TP keeps PP fixed, while in TEKI PiuzP_{i}^{uz} and PizzP_{i}^{zz} are updated using the ensemble.

3.3.2 Analysis of TEKI

When α\alpha is small, we can rewrite the update (3.29) as a time-stepping scheme:

ui+1(n)=ui(n)+αPiuz(αPizz+Q)1(z+α1/2Q1/2ξi(n)g(ui(n)))=ui(n)+αPiuz(αPizz+Q)1(zg(ui(n)))+α1/2Piuz(αPizz+Q)1Q1/2ξi(n),\begin{split}u_{i+1}^{(n)}&=u_{i}^{(n)}+\alpha P_{i}^{uz}(\alpha P_{i}^{zz}+Q)^{-1}\bigl{(}z+\alpha^{-1/2}Q^{1/2}\xi_{i}^{(n)}-g(u_{i}^{(n)})\bigr{)}\\ &=u_{i}^{(n)}+\alpha P_{i}^{uz}(\alpha P_{i}^{zz}+Q)^{-1}\bigl{(}z-g(u_{i}^{(n)})\bigr{)}+\alpha^{1/2}P_{i}^{uz}(\alpha P_{i}^{zz}+Q)^{-1}Q^{1/2}\xi_{i}^{(n)},\end{split} (3.30)

where ξi(n)𝒩(0,Id+k)\xi_{i}^{(n)}\sim\mathcal{N}(0,I_{d+k}) are independent. Taking the limit α0\alpha\rightarrow 0, we can interpret (3.30) as a discretization of the SDE system

du(n)=Puz(t)Q1(zg(u(n)))dt+Puz(t)Q1/2dW(n).\mathop{}\!\mathrm{d}u^{(n)}=P^{uz}(t)Q^{-1}\bigl{(}z-g(u^{(n)})\bigr{)}\mathop{}\!\mathrm{d}t+P^{uz}(t)Q^{-1/2}\mathop{}\!\mathrm{d}W^{(n)}. (3.31)

If h(u)=Huh(u)=Hu is linear, g(u)=Gug(u)=Gu is also linear and the SDE system (3.31) can be rewritten as

du(n)=Puz(t)Q1(zGu(n))dt+Puz(t)Q1/2dW(n)=Puu(t)GTQ1(zGu(n))dt+Puu(t)GTQ1/2dW(n).\begin{split}\mathop{}\!\mathrm{d}u^{(n)}&=P^{uz}(t)Q^{-1}\big{(}z-Gu^{(n)}\big{)}\mathop{}\!\mathrm{d}t+P^{uz}(t)Q^{-1/2}\mathop{}\!\mathrm{d}W^{(n)}\\ &=P^{uu}(t)G^{T}Q^{-1}\big{(}z-Gu^{(n)}\big{)}\mathop{}\!\mathrm{d}t+P^{uu}(t)G^{T}Q^{-1/2}\mathop{}\!\mathrm{d}W^{(n)}.\end{split} (3.32)
Proposition 3.4.

For the SDE system (3.31), assume h(u)=Huh(u)=Hu is linear, and that the initial ensemble {u0(n)}n=1N\{u_{0}^{(n)}\}_{n=1}^{N} is made of independent samples from a distribution with finite second moments. Then, in the large ensemble limit (mean-field), the distribution of u(n)(t)u^{(n)}(t) has mean 𝔪(t)\mathfrak{m}(t) and covariance (t)\mathfrak{C}(t), which satisfy:

d𝔪(t)dt\displaystyle\frac{\mathop{}\!\mathrm{d}\mathfrak{m}(t)}{\mathop{}\!\mathrm{d}t} =(t)GTQ1(zG𝔪(t)),\displaystyle=\mathfrak{C}(t)G^{T}Q^{-1}(z-G\mathfrak{m}(t)), (3.33)
d(t)dt\displaystyle\frac{\mathop{}\!\mathrm{d}\mathfrak{C}(t)}{\mathop{}\!\mathrm{d}t} =(t)GTQ1G(t).\displaystyle=-\mathfrak{C}(t)G^{T}Q^{-1}G\mathfrak{C}(t). (3.34)

Furthermore, the solution can be computed analytically:

𝔪(t)\displaystyle\mathfrak{m}(t) =((0)1+tGTQ1G)1((0)1𝔪(0)+tGTQ1z),\displaystyle=\Big{(}\mathfrak{C}(0)^{-1}+tG^{T}Q^{-1}G\Big{)}^{-1}\big{(}\mathfrak{C}(0)^{-1}\mathfrak{m}(0)+tG^{T}Q^{-1}z\big{)}, (3.35)
(t)\displaystyle\mathfrak{C}(t) =((0)1+tGTQ1G)1.\displaystyle=\Big{(}\mathfrak{C}(0)^{-1}+tG^{T}Q^{-1}G\Big{)}^{-1}. (3.36)

In particular, as tt\rightarrow\infty,

𝔪(t)(GTQ1G)1(GTQ1z)=(HTR1H+P)1(HTR1y+P1m),\displaystyle\begin{split}\mathfrak{m}(t)&\rightarrow(G^{T}Q^{-1}G)^{-1}(G^{T}Q^{-1}z)\\ &=(H^{T}R^{-1}H+P)^{-1}(H^{T}R^{-1}y+P^{-1}m),\end{split} (3.37)
(t)\displaystyle\mathfrak{C}(t) 0.\displaystyle\rightarrow 0. (3.38)

Notice that 𝔪(t)\mathfrak{m}(t) converges to the true posterior mean.

Proof.

Equations (3.33)-(3.36) can be derived similarly as in the proof of Proposition 3.2, replacing HH by GG, RR by QQ, and yy by zz. Now since GTQ1G=HTR1H+PG^{T}Q^{-1}G=H^{T}R^{-1}H+P is always invertible we have that, as t,t\rightarrow\infty,

(t)=t1(t1(0)1+GTQ1G)10\mathfrak{C}(t)=t^{-1}\big{(}t^{-1}\mathfrak{C}(0)^{-1}+G^{T}Q^{-1}G\big{)}^{-1}\rightarrow 0

by continuity of the matrix inverse function. The limit of 𝔪(t)\mathfrak{m}(t) in (3.37) follows directly from (3.35). ∎

4 Ensemble Kalman methods: New Variants

In the previous section we discussed three popular subfamilies of iterative ensemble Kalman methods, analogous to the derivative-based algorithms in Section 2. The aim of this section is to introduce two new iterative ensemble Kalman methods which are inspired by the SDE continuum limit structure of the algorithms in Section 3. The two new methods that we introduce have in common that they rely on statistical linearization, and that the long-time limit of the ensemble covariance recovers the posterior covariance in a linear setting. Subsection 4.1 contains a new variant of IEKF which in addition to recovering the posterior covariance, it also recovers the posterior mean in the long-time limit. Subsection 4.2 introduces a new variant of the EKI method. Finally, Subsection 4.3 highlights the gradient structure of the algorithms in this and the previous section, shows that our new variant of IEKF can also be interpreted as a modified TEKI algorithm, and sets our proposed new methods into the broader literature.

4.1 Iterative Ensemble Kalman Filter with Statistical Linearization

In some of the literature on iterative ensemble Kalman methods [48, 60], the length-step (α\alpha in Algorithms 4 and 5) is set to be 1. Although this choice of length-step allows to recover the true posterior mean in the linear case after one iteration (see Section 3.1.2), it leads to numerical instability in complex nonlinear models. Alternative ways to set α\alpha include performing a line-search that satisfies Wolfe’s condition, or using other ad-hoc line-search criteria [24]. These methods allow α\alpha to be adaptively chosen throughout the iterations, but they introduce other hyperparameters that need to be selected manually.

Our idea here is to slightly modify Algorithm 4 so that in the linear case its continuum limit has the true posterior as its invariant distribution. In this way we can simply choose a small enough α\alpha and run the algorithm until the iterates reach a statistical equilibrium, avoiding the need to specify suitable hyperparameters and stopping criteria. Our empirical results show that this approach also performs well in the nonlinear case.

In the update Equation (3.6), we replace each of the u0(n)u_{0}^{(n)} by a perturbation of the prior mean mm, and we replace P0uuP_{0}^{uu} by the prior covariance matrix PP in the definition of KiK_{i}. Details can be found below in our modified algorithm, which we call IEKF-SL.

Algorithm 8 Iterative Ensemble Kalman Filter with Statistical Linearization (IEKF-SL)

Input: Initial ensemble {u0(n)}n=1N\{u_{0}^{(n)}\}_{n=1}^{N}, step size α\alpha.
For i=0,1,i=0,1,\ldots do:

  1. 1.

    Set Ki=PHiT(HiPHiT+R)1,Hi=(Piuy)T(Piuu)1K_{i}=PH_{i}^{T}(H_{i}PH_{i}^{T}+R)^{-1},\quad\quad H_{i}=(P_{i}^{uy})^{T}(P_{i}^{uu})^{-1}.

  2. 2.

    Draw yi(n)𝒩(y,2α1R)y_{i}^{(n)}\sim\mathcal{N}(y,2\alpha^{-1}R), mi(n)𝒩(m,2α1P)m_{i}^{(n)}\sim\mathcal{N}(m,2\alpha^{-1}P) and set

    ui+1(n)=ui(n)+α{Ki(yi(n)h(ui(n)))+(IKiHi)(mi(n)ui(n))},1nN.u_{i+1}^{(n)}=u_{i}^{(n)}+\alpha\Bigl{\{}K_{i}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)}+(I-K_{i}H_{i})\big{(}m_{i}^{(n)}-u_{i}^{(n)}\big{)}\Bigr{\}},\quad\quad 1\leq n\leq N. (4.1)

Output: Ensemble means m1,m2,m_{1},m_{2},\ldots

It is natural to regard the update (4.1) as a time-stepping scheme. We rewrite it in an alternative form, in analogy to (3.4):

ui+1(n)=ui(n)+αCi{HiTR1(yi(n)h(ui(n)))+P1(mi(n)ui(n))}=ui(n)+αCi{HiTR1(yh(ui(n)))+P1(mui(n))+(HiTR1ζ+P1η)},\begin{split}u_{i+1}^{(n)}&=u_{i}^{(n)}+\alpha C_{i}\Bigl{\{}H_{i}^{T}R^{-1}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)}+P^{-1}\big{(}m_{i}^{(n)}-u_{i}^{(n)}\big{)}\Bigr{\}}\\ &=u_{i}^{(n)}+\alpha C_{i}\Bigl{\{}H_{i}^{T}R^{-1}\big{(}y-h(u_{i}^{(n)})\big{)}+P^{-1}\big{(}m-u_{i}^{(n)}\big{)}+(H_{i}^{T}R^{-1}\zeta+P^{-1}\eta)\Bigr{\}},\end{split} (4.2)

where

Ci=(HiTR1Hi+P1)1,ζ𝒩(0,2α1R),η𝒩(0,2α1P).C_{i}=\big{(}H_{i}^{T}R^{-1}H_{i}+P^{-1}\big{)}^{-1},\quad\zeta\sim\mathcal{N}(0,2\alpha^{-1}R),\quad\eta\sim\mathcal{N}(0,2\alpha^{-1}P).

We interpret Equation (4.2) as a discretization of the SDE system

du(n)=C(t)(H(t)TR1(yh(u(n)))+P1(mu(n)))dt+2C(t)dW(n),\mathop{}\!\mathrm{d}u^{(n)}=C(t)\Big{(}H(t)^{T}R^{-1}\big{(}y-h(u^{(n)})\big{)}+P^{-1}\big{(}m-u^{(n)}\big{)}\Big{)}\mathop{}\!\mathrm{d}t+\sqrt{2C(t)}\mathop{}\!\mathrm{d}W^{(n)}, (4.3)

where

H(t)\displaystyle H(t) =(Puy(t))T(Puu(t))1,\displaystyle=\big{(}P^{uy}(t)\big{)}^{T}\big{(}P^{uu}(t)\big{)}^{-1},
C(t)\displaystyle C(t) =(H(t)TR1H(t)+P1)1.\displaystyle=\big{(}H(t)^{T}R^{-1}H(t)+P^{-1}\big{)}^{-1}.

The diffusion term can be derived using the fact that

Ci(HiTR1ζ+P1η)𝒩(0,2α1Ci(HiTR1Hi+P1)Ci)=𝒩(0,2α1Ci).C_{i}(H_{i}^{T}R^{-1}\zeta+P^{-1}\eta)\sim\mathcal{N}\big{(}0,2\alpha^{-1}C_{i}(H_{i}^{T}R^{-1}H_{i}+P^{-1})C_{i}\big{)}=\mathcal{N}(0,2\alpha^{-1}C_{i}).

If h(u)=Huh(u)=Hu is linear and the empirical covariance Puu(t)P^{uu}(t) has full rank for all tt, then H(t)HH(t)\equiv H, C(t)C=(P1+HTR1H)1C(t)\equiv C=(P^{-1}+H^{T}R^{-1}H)^{-1}, and the SDE system can be decoupled and further simplified:

du(n)=C(HTR1(yHu(n))+P1(mu(n)))dt+2CdW=(u(n)+C(HTR1y+P1m))dt+2CdW.\begin{split}\mathop{}\!\mathrm{d}u^{(n)}&=C\Big{(}H^{T}R^{-1}\big{(}y-Hu^{(n)}\big{)}+P^{-1}(m-u^{(n)})\Big{)}\mathop{}\!\mathrm{d}t+\sqrt{2C}\mathop{}\!\mathrm{d}W\\ &=\Big{(}-u^{(n)}+C(H^{T}R^{-1}y+P^{-1}m)\Big{)}\mathop{}\!\mathrm{d}t+\sqrt{2C}\mathop{}\!\mathrm{d}W.\end{split} (4.4)
Proposition 4.1.

For the SDE system (4.3), assume h(u)=Huh(u)=Hu is linear and H(t)HH(t)\equiv H holds. Assume that the initial ensemble {u0(n)}n=1N\{u_{0}^{(n)}\}_{n=1}^{N} is made of independent samples from a continuous distribution with finite second moments. Then, for 1nN1\leq n\leq N, the mean 𝔪(t)\mathfrak{m}(t) and covariance (t)\mathfrak{C}(t) of u(n)(t)u^{(n)}(t) satisfy

d𝔪(t)dt\displaystyle\frac{\mathop{}\!\mathrm{d}\mathfrak{m}(t)}{\mathop{}\!\mathrm{d}t} =𝔪(t)+C(HTR1y+P1m),\displaystyle=-\mathfrak{m}(t)+C(H^{T}R^{-1}y+P^{-1}m), (4.5)
d(t)dt\displaystyle\frac{\mathop{}\!\mathrm{d}\mathfrak{C}(t)}{\mathop{}\!\mathrm{d}t} =2(t)+2C.\displaystyle=-2\mathfrak{C}(t)+2C. (4.6)

Furthermore, as tt\rightarrow\infty,

𝔪(t)C(HTR1y+P1m)=(P1+HTR1H)1(HTR1y+P1m),\displaystyle\begin{split}\mathfrak{m}(t)&\rightarrow C(H^{T}R^{-1}y+P^{-1}m)\\ &=(P^{-1}+H^{T}R^{-1}H)^{-1}(H^{T}R^{-1}y+P^{-1}m),\end{split} (4.7)
(t)\displaystyle\mathfrak{C}(t) C=(P1+HTR1H)1.\displaystyle\rightarrow C=(P^{-1}+H^{T}R^{-1}H)^{-1}. (4.8)

In other words, 𝔪(t)\mathfrak{m}(t) and (t)\mathfrak{C}(t) converge to the true posterior mean and covariance, respectively.

Proof.

It is clear that, for fixed t>0,t>0, {u(n)(t)}n=1N\{u^{(n)}(t)\}_{n=1}^{N} are independent and identically distributed. The evolution of the mean follows directly from (4.4), and the evolution of the covariance follows from (4.4) by applying Itô’s formula. It is then straightforward to derive (4.7) and (4.8) from (4.5) and (4.6), respectively. ∎

4.2 Ensemble Kalman Inversion with Statistical Linearization

Recall that in the formulation of EKI, we define a regularized (linearized) data-misfit objective (3.8), where we have a regularizer on vv with respect to the norm ||Piuu|\cdot|_{P_{i}^{uu}}. However, in view of Proposition 3.2, under a linear forward model h(u)=Huh(u)=Hu, the particles {ui(n)}i=1n\{u_{i}^{(n)}\}_{i=1}^{n} will ‘collapse’, meaning that the empirical covariance of {Hui(n)}n=1N\{Hu_{i}^{(n)}\}_{n=1}^{N} will vanish in the large ii limit. One possible solution to this is ‘covariance inflation’, namely to inject certain amount of random noise after each ensemble update. However, this requires ad-hoc tuning of additional hyperparameters. An alternative approach to avoid the ensemble collapse is to modify the regularization term in the Levenberg-Marquardt formulation (3.8). The rough idea is to consider another regularizer on HivH_{i}v in the data space, as we describe in what follows.

We define a new regularized data-misfit objective, slightly different from (3.8):

𝖩DM,i(n),UC  (v)=12|yi(n)h(ui(n))Hiv|R2+12α|v|Ci2yi(n)𝒩(y,2α1R),{\mathsf{J}}_{\mbox{\tiny{\rm DM}},i}^{(n),\mbox{\mbox{\tiny{UC} } }}(v)=\frac{1}{2}\bigl{|}y_{i}^{(n)}-h(u_{i}^{(n)})-H_{i}v\bigr{|}^{2}_{R}+\frac{1}{2\alpha}\bigl{|}v\bigr{|}^{2}_{C_{i}}\quad\quad y_{i}^{(n)}\sim\mathcal{N}(y,2\alpha^{-1}R), (4.9)

where

Ci=(P1+HiTR1Hi)1,C_{i}=(P^{-1}+H_{i}^{T}R^{-1}H_{i})^{-1}, (4.10)

and HiH_{i} is defined in (3.1). The regularization term can be decomposed as

|v|Ci2=|v|P2+|Hiv|R2.|v|_{C_{i}}^{2}=|v|_{P}^{2}+|H_{i}v|_{R}^{2}.

The first term can be regarded as a regularization on vv with respect to the prior covariance PP. The second term can be regarded as a regularization on HivH_{i}v with respect to the noise covariance RR. Applying Lemma 2.1, we can calculate the minimizer of (4.9):

vi(n)=(HiTR1Hi+α1Ci1)1HiTR1(yi(n)h(ui(n)))=(α1P1+(1+α1)HiTR1Hi)1HiTR1(yi(n)h(ui(n)))=αPHiT((1+α)HiPHiT+R)1(yi(n)h(ui(n))),\begin{split}v_{i}^{(n)}&=(H_{i}^{T}R^{-1}H_{i}+\alpha^{-1}C_{i}^{-1})^{-1}H_{i}^{T}R^{-1}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)}\\ &=\big{(}\alpha^{-1}P^{-1}+(1+\alpha^{-1})H_{i}^{T}R^{-1}H_{i}\big{)}^{-1}H_{i}^{T}R^{-1}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)}\\ &=\alpha PH_{i}^{T}\big{(}(1+\alpha)H_{i}PH_{i}^{T}+R\big{)}^{-1}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)},\end{split} (4.11)

where the second equality follows from the definition of CiC_{i}, and the third equality follows from the matrix identity (2.7). This leads to Algorithm 9.

Algorithm 9 Ensemble Kalman Inversion with Statistical Linearization (EKI-SL)

Input: Initial ensemble {u0(n)}n=1N\{u_{0}^{(n)}\}_{n=1}^{N}, step size α\alpha.
For i=0,1,i=0,1,\ldots do:

  1. 1.

    Set Ki=αPHiT((1+α)HiPHiT+R)1,Hi=(Piuy)T(Piuu)1K_{i}=\alpha PH_{i}^{T}\big{(}(1+\alpha)H_{i}PH_{i}^{T}+R\big{)}^{-1},\quad\quad H_{i}=(P_{i}^{uy})^{T}(P_{i}^{uu})^{-1}.

  2. 2.

    Draw yi(n)𝒩(y,2α1R)y_{i}^{(n)}\sim\mathcal{N}(y,2\alpha^{-1}R) and set

    ui+1(n)=ui(n)+Ki{yi(n)h(ui(n))},1nN.u_{i+1}^{(n)}=u_{i}^{(n)}+K_{i}\Bigl{\{}y_{i}^{(n)}-h(u_{i}^{(n)})\Bigr{\}},\quad\quad 1\leq n\leq N. (4.12)

Output: Ensemble means m1,m2,m_{1},m_{2},\ldots

For small α>0\alpha>0, we interpret the update (4.12) as a discretization of the coupled SDE system

du(n)=PH(t)T(H(t)PH(t)T+R)1((yh(u(n)))dt+2R1/2dW)=C(t)H(t)TR1(yh(u(n)))dt+2C(t)H(t)TR1/2dW,\begin{split}\mathop{}\!\mathrm{d}u^{(n)}&=PH(t)^{T}\big{(}H(t)PH(t)^{T}+R\big{)}^{-1}\Big{(}\big{(}y-h(u^{(n)})\big{)}\mathop{}\!\mathrm{d}t+\sqrt{2}R^{1/2}\mathop{}\!\mathrm{d}W\Big{)}\\ &=C(t)H(t)^{T}R^{-1}\big{(}y-h(u^{(n)})\big{)}\mathop{}\!\mathrm{d}t+\sqrt{2}C(t)H(t)^{T}R^{-1/2}\mathop{}\!\mathrm{d}W,\end{split} (4.13)

where

H(t)\displaystyle H(t) =(Puy(t))T(Puu(t))1,\displaystyle=\big{(}P^{uy}(t)\big{)}^{T}\big{(}P^{uu}(t)\big{)}^{-1},
C(t)\displaystyle C(t) =(P1+H(t)TR1H(t))1.\displaystyle=\big{(}P^{-1}+H(t)^{T}R^{-1}H(t)\big{)}^{-1}.

For our next result we will work under the assumption that H(t)HH(t)\equiv H is constant, which in particular requires that the empirical covariance Puu(t)P^{uu}(t) has full rank for all t.t. Importantly, under this assumption C(t)C=(P1+HTR1H)1C(t)\equiv C=(P^{-1}+H^{T}R^{-1}H)^{-1}and the SDE system is decoupled:

du(n)=CHTR1(yHu(n))dt+2CHTR1/2dW(n).\mathop{}\!\mathrm{d}u^{(n)}=CH^{T}R^{-1}\big{(}y-Hu^{(n)}\big{)}\mathop{}\!\mathrm{d}t+\sqrt{2}CH^{T}R^{-1/2}\mathop{}\!\mathrm{d}W^{(n)}. (4.14)
Proposition 4.2.

For the SDE system (4.13), assume h(u)=Huh(u)=Hu is linear and H(t)HH(t)\equiv H holds. Assume that the initial ensemble {u0(n)}n=1N\{u_{0}^{(n)}\}_{n=1}^{N} is made of independent samples from a continuous distribution with finite second moments. Then the mean 𝔪(t)\mathfrak{m}(t) and covariance (t)\mathfrak{C}(t) of u(n)(t)u^{(n)}(t) satisfy

d𝔪(t)dt\displaystyle\frac{\mathop{}\!\mathrm{d}\mathfrak{m}(t)}{\mathop{}\!\mathrm{d}t} =CHTR1(yH𝔪(t)),\displaystyle=CH^{T}R^{-1}(y-H\mathfrak{m}(t)), (4.15)
d(t)dt\displaystyle\frac{\mathop{}\!\mathrm{d}\mathfrak{C}(t)}{\mathop{}\!\mathrm{d}t} =CHTR1H(t)(t)HTR1HC+2CHTR1HC.\displaystyle=-CH^{T}R^{-1}H\mathfrak{C}(t)-\mathfrak{C}(t)H^{T}R^{-1}HC+2CH^{T}R^{-1}HC. (4.16)

In particular, if Hk×dH\in\mathbb{R}^{k\times d} has full column rank (i.e., dkd\leq k, rank(H)=d\emph{rank}(H)=d), then, as tt\rightarrow\infty,

𝔪(t)\displaystyle\mathfrak{m}(t) (HTR1H)1(HTR1y),\displaystyle\rightarrow(H^{T}R^{-1}H)^{-1}(H^{T}R^{-1}y), (4.17)
(t)\displaystyle\mathfrak{C}(t) C=(P1+HTR1H)1.\displaystyle\rightarrow C=(P^{-1}+H^{T}R^{-1}H)^{-1}. (4.18)

If Hk×dH\in\mathbb{R}^{k\times d} has full row rank (i.e., dkd\geq k, rank(H)=k\emph{rank}(H)=k), then, as t,t\rightarrow\infty,

H𝔪(t)\displaystyle H\mathfrak{m}(t) y,\displaystyle\rightarrow y, (4.19)
H(t)HT\displaystyle H\mathfrak{C}(t)H^{T} HCHT.\displaystyle\rightarrow HCH^{T}. (4.20)
Proof.

Note that here, in contrast to the setting of Proposition 3.2, the distribution of u(n)(t)u^{(n)}(t) does not depend on the ensemble size N,N, and we have simply 𝔪(t)=𝔼[u(n)(t)]\mathfrak{m}(t)=\operatorname{\mathbb{E}}\bigl{[}u^{(n)}(t)\bigr{]} and (t)=𝔼[e(n)(t)e(n)(t)],\mathfrak{C}(t)=\operatorname{\mathbb{E}}\big{[}e^{(n)}(t)\otimes e^{(n)}(t)\big{]}, with e(n)(t):=u(n)(t)𝔪(t).e^{(n)}(t):=u^{(n)}(t)-\mathfrak{m}(t). The evolution of 𝔪(t)\mathfrak{m}(t) follows directly from (4.14). To obtain the evolution of (t)\mathfrak{C}(t), we use a similar technique as in the proof of Proposition 3.2. Applying Itô’s formula,

d(t)dt=𝔼[CHTR1H(e(n)e(n))(e(n)e(n))HTR1HC+2CHTR1HC]=CHTR1H(t)(t)HTR1HC+2CHTR1HC,\begin{split}\frac{\mathop{}\!\mathrm{d}\mathfrak{C}(t)}{\mathop{}\!\mathrm{d}t}&=\operatorname{\mathbb{E}}\Big{[}-CH^{T}R^{-1}H(e^{(n)}\otimes e^{(n)})-(e^{(n)}\otimes e^{(n)})H^{T}R^{-1}HC+2CH^{T}R^{-1}HC\Big{]}\\ &=-CH^{T}R^{-1}H\mathfrak{C}(t)-\mathfrak{C}(t)H^{T}R^{-1}HC+2CH^{T}R^{-1}HC,\end{split}

which recovers (4.16).

If HH has full column rank, then HTR1HH^{T}R^{-1}H is invertible, and (4.17) follows immediately from (4.15). By setting the right-hand side of (4.16) to 0, and using the fact that it has a unique solution, we derive (4.18).

If HH has full row rank, then (4.19) follows immediately from (4.15). Next, the substitutions

~(t)=R1/2H(t)HTR1/2,C~=R1/2HCHTR1/2\widetilde{\mathfrak{C}}(t)=R^{-1/2}H\mathfrak{C}(t)H^{T}R^{-1/2},\quad\quad\quad\widetilde{C}=R^{-1/2}HCH^{T}R^{-1/2}

allow to transform (4.16) into

d~(t)dt=C~~(t)~(t)C~+2C~2.\frac{\mathop{}\!\mathrm{d}\widetilde{\mathfrak{C}}(t)}{\mathop{}\!\mathrm{d}t}=-\widetilde{C}\widetilde{\mathfrak{C}}(t)-\widetilde{\mathfrak{C}}(t)\widetilde{C}+2\widetilde{C}^{2}. (4.21)

Since C~\widetilde{C} is invertible, by setting the right-hand side of (4.21) to 0 we deduce that ~(t)C~\widetilde{\mathfrak{C}}(t)\rightarrow\widetilde{C} as tt\rightarrow\infty, which recovers (4.20). ∎

4.3 Gradient Structure and Discussion

LM Algorithms in the Continuum Limit

Levenberg-Marquardt algorithms have a natural gradient structure in the continuum limit. This was shown in Equation (2.26), where the preconditioner PP corresponds to the regularizer ||P|\cdot|_{P} that is used in the Levenberg-Marquardt algorithm (2.21). Ensemble-based Levenberg-Marquardt algorithms also possess a gradient structure. To see this, we consider an update ui+1(n)=ui(n)+vi(n)u_{i+1}^{(n)}=u_{i}^{(n)}+v_{i}^{(n)}, where vi(n)v_{i}^{(n)} is the unconstrained minimizer of the same objective as (4.9)

𝖩DM,i(n),UC  (v)=12|yi(n)h(ui(n))Hiv|R2+12α|v|Si2yi(n)𝒩(y,2α1R),{\mathsf{J}}_{\mbox{\tiny{\rm DM}},i}^{(n),\mbox{\mbox{\tiny{UC} } }}(v)=\frac{1}{2}\bigl{|}y_{i}^{(n)}-h(u_{i}^{(n)})-H_{i}v\bigr{|}^{2}_{R}+\frac{1}{2\alpha}\bigl{|}v\bigr{|}^{2}_{S_{i}}\quad\quad y_{i}^{(n)}\sim\mathcal{N}(y,2\alpha^{-1}R),

except that we allow any positive (semi)definite matrix SiS_{i} to act as a ‘regularizer’. The resulting continuum limit is given by the SDE system

du(n)=S(t)H(t)TR1(yh(u(n)))dt+2S(t)H(t)TR1/2dW(n),\mathop{}\!\mathrm{d}u^{(n)}=S(t)H(t)^{T}R^{-1}\big{(}y-h(u^{(n)})\big{)}\mathop{}\!\mathrm{d}t+\sqrt{2}S(t)H(t)^{T}R^{-1/2}\mathop{}\!\mathrm{d}W^{(n)}, (4.22)

where u(n)(t),S(t),H(t)u^{(n)}(t),S(t),H(t) are continuous time analogs of ui(n),Si,Hiu_{i}^{(n)},S_{i},H_{i}. Notice that H(t)TR1(yh(u(n)))H(t)^{T}R^{-1}(y-h(u^{(n)})) is exactly 𝖩DM(u(n))-{\mathsf{J}}_{\mbox{\tiny{\rm DM}}}^{\prime}(u^{(n)}), so that we may rewrite (4.22) as

u˙(n)=S(t)𝖩DM(u(n))+2S(t)H(t)TR1/2W˙(n),\dot{u}^{(n)}=-S(t){\mathsf{J}}_{\mbox{\tiny{\rm DM}}}^{\prime}(u^{(n)})+\sqrt{2}S(t)H(t)^{T}R^{-1/2}\dot{W}^{(n)}, (4.23)

which is a perturbed gradient descent with preconditioner S(t)S(t). We recall that S(t)=Puu(t)S(t)=P^{uu}(t) in EKI and S(t)=C(t)=(P1+H(t)TR1H(t))1S(t)=C(t)=\big{(}P^{-1}+H(t)^{T}R^{-1}H(t)\big{)}^{-1} in EKI-SL. Other choices of S(t)S(t) are possible and will be studied in future work.

Gauss-Newton Algorithms in the Continuum Limit

Gauss-Newton algorithms also have a natural gradient structure in the continuum limit. As we shown in Equation (2.20), the Gauss-Newton method applied to a Tikhonov-Phillips objective can be regarded as a gradient flow with a preconditioner that is the inverse Hessian matrix of the objective. Ensemble-based Gauss-Newton methods also possess a similar gradient structure. Recall that in Equation (4.3) we formulate the continuum limit of IEKF-SL:

du(n)=C(t)(H(t)TR1(yh(u(n)))+P1(mu(n)))dt+2C(t)dW(n).\mathop{}\!\mathrm{d}u^{(n)}=C(t)\Big{(}H(t)^{T}R^{-1}\big{(}y-h(u^{(n)})\big{)}+P^{-1}\big{(}m-u^{(n)}\big{)}\Big{)}\mathop{}\!\mathrm{d}t+\sqrt{2C(t)}\mathop{}\!\mathrm{d}W^{(n)}. (4.24)

Notice that the drift term is exactly C(t)𝖩TP(u(n))-C(t){\mathsf{J}}_{\mbox{\tiny{\rm TP}}}^{\prime}(u^{(n)}), so we may rewrite (4.24) as

u˙(n)=C(t)𝖩TP(u(n))+2C(t)W˙(n),\dot{u}^{(n)}=-C(t){\mathsf{J}}_{\mbox{\tiny{\rm TP}}}^{\prime}(u^{(n)})+\sqrt{2C(t)}\dot{W}^{(n)}, (4.25)

which is a perturbed gradient descent with preconditioner C(t)C(t), the inverse Hessian of 𝖩TP{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}.

A Unified View of Levenberg-Marquardt and Gauss-Newton Algorithms in the Continuum Limit

As a conclusion of above discussion, in the continuum limit Levenberg-Marquardt algorithms (e.g., EKI, EKI-SL) are (perturbed) gradient descent methods, with a preconditioner determined by the regularizer used in the Levenberg-Marquardt update step. Gauss-Newton algorithms (e.g., IEKF, IEKF-SL) are also (perturbed) gradient descent, with a preconditioner determined by the inverse Hessian of the objective.

Interestingly, there are cases when the two types of algorithms coincide, even with the same amount of perturbation in the gradient descent step. A way to see this is to set the Levenberg-Marquardt regularizer to be the inverse Hessian of the objective. In equation (4.23), if we consider the Tikhonov-Phillips objective (i.e., replace 𝖩DM{\mathsf{J}}_{\mbox{\tiny{\rm DM}}} by 𝖩TP{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}, HH by GG, QQ by RR), and set S(t)=C(t)S(t)=C(t), then (4.23) and (4.25) will coincide, leading to the same SDE system. This is the reason why we do not introduce a ‘TEKI-SL’ algorithm, as it is identical to IEKF-SL.

Relationship to Ensemble Kalman Sampler (EKS)

Another algorithm of interest is the Ensemble Kalman Sampler (EKS) [23, 17]. Although not discussed in this work, the EKS update is similar to (4.24). In fact, if we replace C(t)C(t) by the ensemble covariance Puu(t)P^{uu}(t), and use the fact that Puu(t)H(t)T=Puy(t)P^{uu}(t)H(t)^{T}=P^{uy}(t) by definition, we immediately recover the evolution equation of EKS:

du(n)=(Puy(t)TR1(yh(u(n)))+P1(mu(n)))dt+2Puu(t)dW(n).\mathop{}\!\mathrm{d}u^{(n)}=\Big{(}P^{uy}(t)^{T}R^{-1}\big{(}y-h(u^{(n)})\big{)}+P^{-1}\big{(}m-u^{(n)}\big{)}\Big{)}\mathop{}\!\mathrm{d}t+\sqrt{2P^{uu}(t)}\mathop{}\!\mathrm{d}W^{(n)}.

Then an Euler-Maruyama discretization is performed to compute the update formula in discrete form. However, as we find out in several numerical experiments, the length-step α\alpha needs to be chosen carefully. If the noise RR has a small scale, EKS often blows up in the first few iterations, while other algorithms with the same length-step do not. Also, if we consider a linear forward model h(u)=Huh(u)=Hu, EKS still requires a mean field assumption NN\rightarrow\infty in order for it to converge to the posterior distribution, while IEKF-SL approximately needs NdN\approx d particles where dd is the input dimension.

5 Numerical Examples

In this section we provide a numerical comparison of the iterative ensemble Kalman methods introduced in Sections 3 and 4. Our experiments highlight the variety of applications that have motivated the development of these methods, and illustrate their use in a wide range of settings.

5.1 Elliptic Boundary Value Problem

In this subsection we consider a simple nonlinear Bayesian inverse problem originally presented in [20], where the unknown parameter is two dimensional and the forward map admits a closed formula. These features facilitate both the visualization and the interpretation of the solution.

5.1.1 Problem Setup

Consider the elliptic boundary value problem

ddx(exp(u1)ddxp(x))=1,x(0,1),p(0)=0,p(1)=u2.\displaystyle\begin{split}\frac{\mathop{}\!\mathrm{d}}{\mathop{}\!\mathrm{d}x}\bigg{(}\exp(u_{1})\frac{\mathop{}\!\mathrm{d}}{\mathop{}\!\mathrm{d}x}p(x)\bigg{)}&=1,\quad x\in(0,1),\\ p(0)&=0,\quad\quad p(1)=u_{2}.\end{split} (5.1)

It can be checked that (5.1) has an explicit solution

pu(x)=u2x12exp(u1)(x2x).p_{u}(x)=u_{2}x-\frac{1}{2}\exp(-u_{1})(x^{2}-x). (5.2)
Refer to caption
(a) Truth uu^{\dagger}.
Refer to caption
(b) EKI.
Refer to caption
(c) TEKI.
Refer to caption
(d) IEKF.
Refer to caption
(e) EKI-SL.
Refer to caption
(f) IEKF-SL.
Figure 1: Ensemble members (green) after 100100 iterations, with truth uu^{\dagger} (red star) and contour plot of (unnormalized) posterior density.

We seek to recover u=(u1,u2)Tu=(u_{1},u_{2})^{T} from noisy observation of pup_{u} at two points x1=0.25x_{1}=0.25 and x2=0.75x_{2}=0.75. Precisely, we define h(u):=(pu(x1),pu(x2))Th(u):=\bigl{(}p_{u}(x_{1}),p_{u}(x_{2})\bigr{)}^{T} and consider the inverse problem of recovering uu from data yy of the form

y=h(u)+η,y=h(u)+\eta, (5.3)

where η𝒩(0,γ2I2)\eta\sim\mathcal{N}(0,\gamma^{2}I_{2}). We set a Gaussian prior on the unknown parameter u𝒩(0,1)×𝒩(100,16)u\sim\mathcal{N}(0,1)\times\mathcal{N}(100,16). In our numerical experiments we let the true parameter be u=(2.6,104.5)Tu^{\dagger}=(-2.6,104.5)^{T} and use it to generate synthetic data y=h(u)+ηy=h(u^{\dagger})+\eta with noise level γ=0.1.\gamma=0.1.

5.1.2 Implementation Details and Numerical Results

We set the ensemble size to be N=50N=50. The initial ensemble {u0(n)}n=1N\{u_{0}^{(n)}\}_{n=1}^{N} is drawn independently from the prior. The length-step α\alpha is fixed to be 0.1 for all methods. We run each algorithm up to time T=10T=10, which corresponds to 100 iterations.

We plot the level curves of the posterior density of u=(u1,u2)Tu=(u_{1},u_{2})^{T} in Figure 1. The forward model is approximately linear, as can be seen from the contour plots in Figure 1 or directly from Equation (5.2). Hence it can be used to validate the claims we made in previous sections.

Refer to caption
Figure 2: Evolution of the Frobenius norm of the ensemble covariance Puu(t)P^{uu}(t). The norm of IEKF-RZL blows up after a few iterations. The norms of the EKI and TEKI are almost identical and monotonically decreasing. The norms of the new variants EKI-SL and IEKF-SL are similar and stabilize after around 40 iterations. The norm of IEKF lies between those of the old and new variants.
Refer to caption
Refer to caption
Figure 3: EKI & EKI-SL: Relative errors and data misfit w.r.t time tt.
Refer to caption
Refer to caption
Figure 4: TEKI, IEKF & IEKF-SL: Relative errors and Tikhonov-Phillips objective w.r.t time tt.

Figure 2 compares the time evolution of the empirical covariance of the different methods. The IEKF-RZL algorithm consistently blows up in the first few iterations due to the small noise which, as discussed in Section 3.1, is an important drawback. Due to the poor performance of IEKF-RZL in small noise regimes, we will not include this method in subsequent comparisons. The EKI and TEKI plots show the ‘ensemble collapse’ phenomenon discussed in Sections 3.2 and 3.3. Note that the size of the empirical covariances of EKI and TEKI decreases monotonically, and by the 100-th iteration their size is one order of magnitude smaller than that of the new variants IEKF-SL and EKI-SL. The ensemble collapse of EKI and TEKI can also be seen in Figure 1, where we plot all the ensemble members after 100 iterations. In contrast, Figure 2 shows that the size of the empirical covariances of the new variants IEKF-SL and EKI-SL stabilizes after around 40 iterations, and Figure 1 suggests that the spread of the ensemble matches that of the posterior. These results are in agreement with the theoretical results derived in Sections 4.2 and 4.1 in a linear setting. Although we have not discussed the IEKF method when α\alpha is small, its ensemble covariance does not collapse in the linear setting, as can be seen in Figure 2.

In Figures 3 and 4 we show the performance of the different methods along the full iteration sequence. Here and in subsequent numerical examples we use two performance assessments. First, the relative error, defined as |m(t)u|/|u||m(t)-u^{\dagger}|/|u^{\dagger}| where m(t)m(t) is the ensemble’s empirical mean, which evaluates how well the ensemble mean approximates the truth. Second, we assess how each ensemble method performs in terms of its own optimization objective. Precisely, we report the data-misfit objective 𝖩DM(m(t)){\mathsf{J}}_{\mbox{\tiny{\rm DM}}}\big{(}m(t)\big{)} for EKI and EKI-SL, and the Tikhonov-Phillips objective 𝖩TP(m(t)){\mathsf{J}}_{\mbox{\tiny{\rm TP}}}\big{(}m(t)\big{)} for IEKF, TEKI and IEKF-SL. We run 10 trials for each algorithm, using different ensemble initializations (drawn from the same prior), and generate the error bars accordingly. Since this is a simple toy problem, all methods perform well. However, we note that the first iterations of EKI and TEKI reduce the objective faster than other algorithms.

5.2 High-Dimensional Linear Inverse Problem

In this subsection we consider a linear Bayesian inverse problem from [31]. This example illustrates the use of iterative ensemble Kalman methods in settings where both the size of the ensemble and the dimension of the data are significantly smaller than the dimension of the unknown parameter.

5.2.1 Problem setup

Consider the one dimensional elliptic equation

d2pdx2+p=u,x(0,π),p(0)=p(π)=0.\displaystyle\begin{split}-\frac{\mathop{}\!\mathrm{d}^{2}p}{\mathop{}\!\mathrm{d}x^{2}}+p&=u,\quad x\in(0,\pi),\\ p(0)&=p(\pi)=0.\end{split} (5.4)

We seek to recover uu from noisy observation of pp at k=241k=2^{4}-1 equispaced points xj=j24π.x_{j}=\frac{j}{2^{4}}\pi. We assume that the data is generated from the model

yj=p(xj)+ηj,j=1,,k,y_{j}=p(x_{j})+\eta_{j},\quad j=1,\dots,k, (5.5)

where ηj𝒩(0,γ2)\eta_{j}\sim\mathcal{N}(0,\gamma^{2}) are independent. By defining A=d2dx2+idA=-\frac{\mathop{}\!\mathrm{d}^{2}}{\mathop{}\!\mathrm{d}x^{2}}+id and letting 𝒪\mathcal{O} be the observation operator defined by (𝒪(p))j=p(xj)\big{(}\mathcal{O}(p)\big{)}_{j}=p(x_{j}), we can rewrite (5.5) as

y=h(u)+η,η𝒩(0,γ2Ik),y=h(u)+\eta,\quad\eta\sim\mathcal{N}(0,\gamma^{2}I_{k}),

where h=𝒪A1h=\mathcal{O}\circ A^{-1}. The forward problem (5.4) is solved on a uniform mesh with meshwidth w=28w=2^{-8} by a finite element method with continuous, piecewise linear basis functions. We assume that the unknown parameter uu has a Gaussian prior distribution, u𝒩(0,C0)u\sim\mathcal{N}(0,C_{0}) with covariance operator C0=10(Aid)1C_{0}=10(A-id)^{-1} with homogeneous Dirichlet boundary conditions. This prior can be interpreted as the law of a Brownian bridge between 0 and π\pi. For computational purposes we view uu as a random vector in 28\mathbb{R}^{2^{8}} and the linear map h(u)h(u) is represented by a matrix H(241)×28.H\in\mathbb{R}^{(2^{4}-1)\times 2^{8}}. The true parameter uu^{\dagger} is sampled from this prior (cf. Figure 5), and is used to generate synthetic observation data y=Hu+ηy=Hu^{\dagger}+\eta with noise level γ=0.01\gamma=0.01.

5.2.2 Implementation Details and Numerical Results

We set the ensemble size to be N=50N=50. The initial ensemble {u0(n)}n=1N\{u_{0}^{(n)}\}_{n=1}^{N} is drawn independently from the prior. The length-step α\alpha is fixed to be 0.05 for all methods. We run each algorithm up to time T=30T=30, which corresponds to 600 iterations.

We note that the linear inverse problem considered here has input dimension 28=2562^{8}=256, which is much larger than the ensemble size NN. Combining the results from Figure 5, 6 and 7, EKI and EKI-SL clearly overfit the data. In contrast, TEKI over-regularizes the data, and we easily notice the ensemble collapse. The IEKF and IEKF-SL lie in between, and approximate the truth slightly better.

Refer to caption
(a) Truth uu^{\dagger}
Refer to caption
(b) EKI
Refer to caption
(c) TEKI
Refer to caption
(d) IEKF
Refer to caption
(e) EKI-SL
Refer to caption
(f) IEKF-SL
Figure 5: Ensemble mean (red) at the final iteration, with 10, 90-quantiles (blue).
Refer to caption
Refer to caption
Figure 6: EKI & EKI-SL: Relative errors and data misfit w.r.t time tt.
Refer to caption
Refer to caption
Figure 7: IEKF, TEKI & IEKF-SL: Relative errors and Tikhonov-Phillips objective w.r.t time tt.

5.3 Lorenz-96 Model

In this subsection we investigate the use of iterative ensemble Kalman methods to recover the initial condition of the Lorenz-96 system from partial and noisy observation of the solution at two positive times. The experimental framework is taken from [14] and is illustrative of the use of iterative ensemble Kalman methods in numerical weather forecasting.

5.3.1 Problem Setup

Consider the dynamical system

dzldt=zl1(zl+1zl2)zl+F,l=1,,d,z0=zd,zd+1=z1,z1=zd1.\begin{gathered}\frac{dz_{l}}{dt}=z_{l-1}(z_{l+1}-z_{l-2})-z_{l}+F,\quad l=1,\ldots,d,\\ z_{0}=z_{d},\quad z_{d+1}=z_{1},\quad z_{-1}=z_{d-1}.\end{gathered} (5.6)

Here zlz_{l} denotes the lthl^{th} coordinate of the current state of the system and FF is a constant forcing with default value of F=8F=8. The dimension dd is often chosen as 40. We want to recover the initial condition

u:=(z1(0),,zd(0))Tu:=(z_{1}(0),\dots,z_{d}(0))^{T}

of (5.6) from noisy partial measurements at discrete times {si}i=1I\{s_{i}\}_{i=1}^{I}:

yi,j=ulj(si)+ηi,j,y_{i,j}=u_{l_{j}}(s_{i})+\eta_{i,j}, (5.7)

where {lj}j=1J{1,2,,d}\{l_{j}\}_{j=1}^{J}\subset\{1,2,\dots,d\} is the subset of observed coordinates, and ηi,j𝒩(0,γ2)\eta_{i,j}\sim\mathcal{N}(0,\gamma^{2}) are assumed to be independent. In our numerical experiments we set I=2I=2, J=20J=20, {s1,s2}={0.3,0.6}\{s_{1},s_{2}\}=\{0.3,0.6\}, {lj}j=120={1,3,5,,39}\{l_{j}\}_{j=1}^{20}=\{1,3,5,\dots,39\}. We set the prior on uu to be a Gaussian 𝒩(0,2Id)\mathcal{N}(0,2I_{d}). The true parameter u40u^{\dagger}\in\mathbb{R}^{40} is shown in Figure 8, and is used to generate the observation data {yi,j}\{y_{i,j}\} according to (5.7) with noise level γ=0.01.\gamma=0.01.

5.3.2 Implementation Details and Numerical Results

We set the ensemble size to be N=50N=50. The initial ensemble {u0(n)}n=1N\{u_{0}^{(n)}\}_{n=1}^{N} is drawn independently from the prior. The length-step α\alpha is fixed to be 0.05 for all methods. We run each algorithm up to time T=30T=30, which corresponds to 600 iterations.

This is a moderately high dimensional nonlinear problem, where the forward model involves a black-box solver of differential equations. Figure 8 clearly indicates an ensemble collapse for the EKI and TEKI methods. Although they are still capable of finding a descending direction of the loss direction (see Figures 9 and 10), iterates get stuck in a local minimum. In contrast, we can see the advantage of IEKF, EKI-SL and TEKI-SL in this setting: intuitively, a broader spread of the ensemble allows these methods to ‘explore’ different regions in the input space, and thereby to find a solution with potentially lower loss.

We notice that in practice ‘covariance inflation’ is often applied in EKI and TEKI algorithms, by manually injecting small random noise after each ensemble update. In general, this technique will ‘force’ a non-zero empirical covariance, prevent ensemble collapse and boost the performence of EKI and TEKI. However, the amount of noise injected is an additional hyperparameter that should be chosen manually, which should depend on the input dimension, observation noise, etc. Here we give a fair comparison of the different methods introduced, under the same time-stepping setting, with as few hyperparameters as possible.

Refer to caption
(a) Truth uu^{\dagger}.
Refer to caption
(b) EKI.
Refer to caption
(c) TEKI.
Refer to caption
(d) IEKF.
Refer to caption
(e) EKI-SL.
Refer to caption
(f) IEKF-SL.
Figure 8: Ensemble mean (red) at the final iteration, with 10, 90-quantiles (blue).
Refer to caption
Refer to caption
Figure 9: EKI & EKI-SL: Relative errors and data misfit w.r.t time tt.
Refer to caption
Refer to caption
Figure 10: IEKF, TEKI & IEKF-SL: Relative errors and Tikhonov-Phillips objective w.r.t time tt.

5.4 High-Dimensional Nonlinear Regression

In this subsection we consider a nonlinear regression problem with a highly oscillatory forward map introduced in [19], where the authors investigate the use of iterative ensemble Kalman methods to train neural networks without back propagation.

5.4.1 Problem Setup

We consider a nonlinear regression problem

y=h(u)+η,η𝒩(0,γ2Ik),y=h(u)+\eta,\quad\eta\sim\mathcal{N}(0,\gamma^{2}I_{k}),

where ud,yku\in\mathbb{R}^{d},y\in\mathbb{R}^{k}, and hh is defined by

h(u):=Au+sin(cBu),h(u):=Au+\sin(cBu), (5.8)

where A,Bk×dA,B\in\mathbb{R}^{k\times d} are random matrices with independent 𝒩(0,1)\mathcal{N}(0,1) entries. We set d=200d=200, k=150k=150 and c=20c=20. We want to recover uu from yy. We assume that the unknown uu has a Gaussian prior u𝒩(0,4Id)u\sim\mathcal{N}(0,4I_{d}). The true parameter uu^{\dagger} is set to be 2𝟏2\cdot\bf{1}, where 𝟏\bm{1} is the all-one vector. Observation data is generated as y=h(u)+ηy=h(u^{\dagger})+\eta.

By definition, hh is highly oscillatory, and we may expect that the loss function, either Tikhonov-Phillips or data misfit objective, will have many local minima. Figure 11 visualizes the Tihonov-Phillips objective function 𝖩TP(u){\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u) with respect to two randomly choosen coordinates while other coordinates are fixed to value of 2. The data misfit objective exhibits a similar behavior.

Refer to caption
Figure 11: Tikhonov-Phillips objective function with respect to two randomly chosen coordinates.

5.4.2 Implementation Details and Numerical Results

We set the ensemble size to be N=50N=50. The initial ensemble {u0(n)}n=1N\{u_{0}^{(n)}\}_{n=1}^{N} is drawn independently from the prior. The length-step α\alpha is fixed to be 0.05 for all methods. We set γ=0.01\gamma=0.01 for the observation noise. We run each algorithm up to time T=30T=30, which corresponds to 600 iterations.

We notice that this is a difficult problem, due to its high dimensionality and nonlinearity. All methods except for IEKF-SL are not capable of reconstructing the truth. In particular, from Figures 12, 13 and 14, both EKI and TEKI do a poor job with relative error larger than 1, while IEKF and EKI-SL have slightly better performance. It is worth noticing from Figure 14 that IEKF-SL has a larger Tikhonov-Phillips objective function, with a much lower relative error. This may suggest that other types of regularization objectives can be used, other that the Tikhonov-Phillips objective, to solve this problem.

Refer to caption
(a) Truth uu^{\dagger}.
Refer to caption
(b) EKI.
Refer to caption
(c) TEKI.
Refer to caption
(d) IEKF.
Refer to caption
(e) EKI-SL.
Refer to caption
(f) IEKF-SL.
Figure 12: Ensemble mean (red) at the final iteration, with 10, 90-quantiles (blue).
Refer to caption
Refer to caption
Figure 13: EKI & EKI-SL: Relative errors and data misfit w.r.t time tt.
Refer to caption
Refer to caption
Figure 14: IEKF, TEKI & IEKF-SL: Relative errors and Tikhonov-Phillips objective w.r.t time tt.

6 Conclusions and Open Directions

In this paper we have provided a unified perspective of iterative ensemble Kalman methods and introduced some new variants. We hope that our work will stimulate further research in this active area, and we conclude with a list of some open directions.

  • Continuum limits have been formally derived in our work. The rigorous derivation and analysis of SDE continuum limits, possibly in nonlinear settings, deserves further research.

  • We have advocated the analysis of continuum limits for the understanding and design of iterative ensemble Kalman methods, but it would also be desirable to develop a framework for the analysis of discrete, implementable algorithms, and to further understand the potential benefits of various discretizations of a given continuum SDE system.

  • From a theoretical viewpoint, it would be desirable to further analyze the convergence and stability of iterative ensemble methods with small or moderate ensemble size. While mean-field limits can be revealing, in practice the ensemble size is often not sufficiently large to justify the mean-field assumption. It would also be important to further analyze these questions in mildly nonlinear settings.

  • An important methodological question, still largely unresolved, is the development of adaptive and easily implementable line search schemes and stopping criteria for ensemble-based optimization schemes. An important work in this direction is [32].

  • Another avenue for future methodological research is the development of iterative ensemble Kalman methods that are sparse-promoting, considering alternative regularizations beyond the least-squares objectives considered in our paper [37, 38, 53].

  • Ensemble methods are cheap in comparison to derivative-based optimization methods and Markov chain Monte Carlo sampling algorithms. It is thus natural to use ensemble methods to build ensemble preconditioners or surrogate models to be used within more expensive but accurate computational approaches.

  • Finally, a broad area for further work is the application of iterative ensemble Kalman filters to new problems in science and engineering.

Acknowledgments

NKC is supported by KAUST baseline funding. DSA is thankful for the support of NSF and NGA through the grant DMS-2027056. The work of DSA was also partially supported by the NSF Grant DMS-1912818/1912802.

References

  • [1] S. I. Aanonsen, G. Nævdal, D. S. Oliver, A. C. Reynolds, B. Vallès and others. The ensemble Kalman filter in reservoir engineering–a review. Spe Journal, 14, 03, 393-412, 2009.
  • Albers et al. [2019] D. J. Albers, P.-A. Blancquart, M. E. Levine, E. E. Seylabi and A. M. Stuart. Ensemble Kalman methods with constraints. Inverse Problems, 35(9), 2019.
  • Anderson et al. [2001] J. L. Anderson. An ensemble adjustment Kalman filter for data assimilation. Monthly Weather Review, 129(12), 2884-2903, 2001.
  • [4] B. D. O. Anderson and J. B. Moore. Optimal Filtering. Information and System Sciences Series, 1979.
  • [5] B. M. Bell. The iterated Kalman smoother as a Gauss–Newton method. SIAM J. Optim., 4, 626–636, 1994.
  • Bell and Cathey [1993] B. M. Bell and F. W. Cathey. The iterated Kalman filter update as a Gauss-Newton method. IEEE Transactions on Automatic Control, 38(2):294–297, 1993.
  • [7] D. P. Bertsakas. Incremental least squares method and the extended Kalman filter. SIAM J. Optim 6, 807–822, 1996.
  • Blömker et al. [2019] D. Blömker, C. Schillings, P. Wacker and S. Weissmann. Well posedness and convergence analysis of the ensemble Kalman inversion. Inverse Problems, 2019.
  • Bloömker et al. [2018] D. Blömker, C. Schillings and P. Wacker. A strongly convergent numerical scheme from ensemble Kalman inversion. SIAM Journal on Numerical Analysis, 56(4):2537–2562, 2018.
  • Chada [2018] N. K. Chada. Analysis of hierarchical ensemble Kalman inversion. arXiv preprint arXiv:1801.00847, 2018.
  • [11] N. K. Chada, M. A. Iglesias, L. Roininen and A. M. Stuart. Parameterizations for ensemble Kalman inversion. Inverse Problems, 34(5), 2018.
  • Chada et al. [2019a] N. K. Chada, C. Schillings and S. Weissmann. On the incorporation of box-constraints for ensemble Kalman inversion. arXiv preprint arXiv:1908.00696, 2019a.
  • Chada et al. [2019b] N. K. Chada, A. M. Stuart and X. T. Tong. Tikhonov regularization within ensemble Kalman inversion. SIAM J. Numer. Anal., 58(2), 1263–1294, 2020b.
  • Chada et al. [2019b] N. K. Chada and X. T. Tong. Convergence acceleration of ensemble Kalman inversion in nonlinear setting. arXiv preprint arXiv:1911.02424, 2019b.
  • Chen et al. [2012b] Y. Chen and D. S. Oliver. Ensemble randomized maximum likelihood method as an iterative ensemble smoother. Mathematical Geosciences, 2012b.
  • Dennis et al. [2019b] J. E. Dennis and R. B. Schnabel. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. SIAM, 1996b.
  • Ding and Li [2019] Z. Ding and Q. Li. Ensemble Kalman sampling: mean-field limit and convergence analysis. arXiv preprint arXiv:1910.12923, 2019.
  • [18] A. Emerick and A. C. Reynolds. Ensemble smoother with multiple data assimilation, Computers and Geosciences, 55, 3–15, 2013.
  • [19] E. Haber, F. Lucka and L. Ruthotto. Never look back-a modified enkf method and its application to the training of neural networks without back propagation. arXiv preprint arXiv:1805.08034, 2018.
  • [20] O. G. Ernst, B. Sprungk and H. J. Starkloff. Analysis of the ensemble and polynomial chaos Kalman filters in Bayesian inverse problems, SIAM/ASA Journal on Uncertainty Quantification, 3(1), 823–851, 2015.
  • [21] G. Evensen. Data Assimilation: The Ensemble Kalman Filter. Springer, 2009.
  • [22] G. Evensen and P. J. Van Leeuwen. Assimilation of geosataltimeter data for the agulhas current using the ensemble Kalman filter with a quasi-geostrophic model, Monthly Weather Review, 128, 85–86, 1996.
  • Garbuno-Inigo et al. [2019] A. Garbuno-Inigo, F. Hoffmann, W. Li and A. M. Stuart. Interacting Langevin diffusions: Gradient structure and ensemble Kalman sampler. arXiv preprint arXiv:1903.08866, 2019.
  • Gu and Oliver [2007] Y. Gu and D. S. Oliver. An iterative ensemble Kalman filter for multiphase fluid flow data assimilation. Spe Journal, 12(04):438–446, 2007.
  • Haber et al. [2018] E. Haber, F. Lucka and L. Ruthotto. Never look back-A modified EnKF method and its application to the training of neural networks without back propagation. arXiv preprint arXiv:1805.08034, 2018.
  • [26] M. Hanke. A regularizing Levenberg-Marquardt scheme, with applications to inverse groundwater filtration problems. Inverse Problems, 13, 79–95, 1997.
  • Herty and Visconti [2018] M. Herty and G. Visconti. Kinetic methods for inverse problems. arXiv preprint arXiv:1811.09387, 2018.
  • Grooms [2020] I. Grooms A note on the formulation of the Ensemble Adjustment Kalman Filter. arXiv preprint arXiv:2006.02941, 2020.
  • Guth et al. [2020] P. A. Guth, C. Schillings and S. Weissmann. Ensemble Kalman filter for neural network based one-shot inversion. arXiv preprint arXiv:2005.02039, 2020.
  • [30] M. A. Iglesias. A regularising iterative ensemble Kalman method for PDE-constrained inverse problems. Inverse Problems, 32, 2016.
  • [31] M. A. Iglesias, K. J. H. Law and A. M. Stuart. Ensemble Kalman methods for inverse problems. Inverse Problems, 29, 2013.
  • [32] M. A. Iglesias and Y. Yang. Adaptive regularisation for ensemble Kalman inversion with applications to non-destructive testing and imaging. arXiv preprint arXiv:2006.14980, 2020.
  • Jazwinski [2007] A. H. Jazwinski. Stochastic Processes and Filtering Theory. Courier Corporation, 2007.
  • [34] R. E. Kalman. A new approach to linear filtering and prediction problems, Trans ASME (J. Basic Engineering), 82, 35–45, 1960.
  • Kelly and Stuart [2014] D. Kelly, K. J. H. Law and A. M. Stuart. Well-posedness and accuracy of the ensemble Kalman filter in discrete and continuous time. Nonlinearity, 27(10), 2014.
  • [36] B. Katltenbacher, A. Neubauer and O. Scherzer. Iterative Regularization Methods for Nonlinear Ill-Posed Problems. Radon Series on Computational and Applied Mathematics, de Gruyter, Berlin, 1st edition, 2008.
  • Kovachki and Stuart [2019] N. B. Kovachki and A. M. Stuart. Ensemble Kalman inversion: A derivative-free technique for machine learning tasks. Inverse Problems, 2019.
  • [38] Y. Lee. lpl_{p} regularization for ensemble Kalman inversion. arXiv preprint arXiv:2009.03470, 2020
  • Li and Reynolds [2007] G. Li and A. C. Reynolds. An iterative ensemble Kalman filter for data assimilation. In SPE annual technical conference and exhibition. Society of Petroleum Engineers, 2007.
  • Lorentzen et al. [2001] R. J. Lorentzen, K.-K. Fjelde, J. Froyen, A. C. Lage, G. Noevdal and E. H Vefring. Underbalanced drilling: Real time data interpretation and decision support. In SPE/IADC drilling conference. Society of Petroleum Engineers., 2001.
  • [41] S. Lu and S. V. Pereverzev. Multi-parameter regularization and its numerical realization. Numerische Mathematik 118(1), 1–31, 2011.
  • [42] A. J. Madja and J. Harlim. Filtering Complex Turbulent Systems. Cambridge University Press, 2012.
  • Naevdal et al. [2002] G. Naevdal, T. Mannseth and E. H Vefring. Instrumented wells and near-well reservoir monitoring through ensemble Kalman filter. In Proceedings of 8th European Conference on the Mathematics of Oil Recovery, Freiberg, Germany, 2001.
  • [44] A. S. Nemirovsky. Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience series in discrete mathematics, 1983.
  • Nocedal and Wright [2006] J. Nocedal and S. Wright. Numerical Optimization. Springer Science and Business Media, 2006.
  • [46] D. Oliver, A. C. Reynolds and N. Liu. Inverse Theory for Petroleum Reservoir Characterization and History Matching. Cambridge University Press, 1st edn, 2008.
  • [47] S. Reich and C. J. Cotter. Ensemble filter techniques for intermittent data assimilation. l Large Scale Inverse Problems. Computational Methods and Applications in the Earth Sciences, 13, 91–134, 2013.
  • Reynolds et al. [2006] A. C. Reynolds, M. Zafari and G. Li. Iterative forms of the ensemble Kalman filter. In ECMOR X-10th European conference on the mathematics of oil recovery, pages cp–23. European Association of Geoscientists & Engineers, 2006.
  • [49] P. Sakov, D. S. Oliver and L. Bertino. An Iterative EnKF for strongly nonlinear systems. Monthly Weather Review, 140, 1988–2004, 2012.
  • Sanz-Alonso et al. [2019] D. Sanz-Alonso, A. M. Stuart and A. Taeb. Inverse Problems and Data Assimilation. arXiv preprint arXiv:1810.06191, 2019.
  • [51] C. Schillings and A. M. Stuart. Analysis of the ensemble Kalman filter for inverse problems. SIAM J. Numer. Anal., 55(3), 1264–1290, 2017.
  • [52] C. Schillings and A. M. Stuart. Convergence analysis of ensemble Kalman inversion: the linear, noisy case. Applicable Analysis, 97(1), 107–123, 2018.
  • [53] T. Schneider, A. M. Stuart and J-L. Wu. Imposing sparsity within ensemble Kalman inversion. arXiv preprint arXiv:2007.06175, 2020.
  • [54] B. Shi, S. Du, M. I. Jordan and W. J. Su. Understanding the acceleration phenomenon via high-resolution differential equations. arXiv preprint arXiv:1810.08907, 2018
  • [55] J. A. Skjervheim and G. Evensen. An Ensemble Smoother for Assisted History Matching. SPE Reservoir Simulation Symposium, 21–23 February, The Woodlands, Texas, 2001.
  • [56] W. Su, S. Boyd and E. Candes. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. Advances in neural information processing systems., 2510-2518, 2014.
  • [57] W. Su, S. Boyd and E. Candes. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. The Journal of Machine Learning Research., 17(1),5312-5354, 2016.
  • Tarantola [2015] A. Tarantola. Inverse Problem Theory and Methods for Model Parameter Estimation. SIAM, 2015.
  • [59] M. K. Tippett, J. L. Anderson, C. H. Bishop and J. S. Whitaker. Ensemble square root filters. Monthly Weather Review, 131(7), 1485-90, 2003.
  • Ungarala [2012] S. Ungarala. On the iterated forms of Kalman filters using statistical linearization. Journal of Process Control, 22(5):935–943, 2012.
  • [61] A. Wibisono, A. C. Wilson and M. I. Jordan. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351–E7358, 2016.