tablesection algorithmsection

Iterative Ensemble Kalman Methods:
A Unified Perspective with Some New Variants

Neil K. Chada^† Yuming Chen^∗ and Daniel Sanz-Alonso^∗

Abstract

This paper provides a unified perspective of iterative ensemble Kalman methods, a family of derivative-free algorithms for parameter reconstruction and other related tasks. We identify, compare and develop three subfamilies of ensemble methods that differ in the objective they seek to minimize and the derivative-based optimization scheme they approximate through the ensemble. Our work emphasizes two principles for the derivation and analysis of iterative ensemble Kalman methods: statistical linearization and continuum limits. Following these guiding principles, we introduce new iterative ensemble Kalman methods that show promising numerical performance in Bayesian inverse problems, data assimilation and machine learning tasks. ^†^†^† Applied Mathematics and Computational Science, King Abdullah University of Science and Technology. ^†^†^∗ Department of Statistics, University of Chicago.

1 Introduction

This paper provides an accessible introduction to the derivation and foundations of iterative ensemble Kalman methods, a family of derivative-free algorithms for parameter reconstruction and other related tasks. The overarching theme behind these methods is to iteratively update via Kalman-type formulae an ensemble of candidate reconstructions, aiming to bring the ensemble closer to the unknown parameter with each iteration. The ensemble Kalman updates approximate derivative-based nonlinear least-squares optimization schemes without requiring gradient evaluations. Our presentation emphasizes that iterative ensemble Kalman methods can be naturally classified in terms of the nonlinear least-squares objective they seek to minimize and the derivative-based optimization scheme they approximate through the ensemble. This perspective allows us to identify three subfamilies of iterative ensemble Kalman methods, creating unity into the growing literature on this subject. Our work also emphasizes two principles for the derivation and analysis of iterative ensemble Kalman methods: statistical linearization and continuum limits. Following these principles we introduce new iterative ensemble Kalman methods that show promising numerical performance in Bayesian inverse problems, data assimilation and machine learning tasks.

We consider the application of iterative ensemble Kalman methods to the problem of reconstructing an unknown $u\in\mathbb{R}^{d}$ from corrupt data $y\in\mathbb{R}^{k}$ related by

y=h(u)+\eta,

(1.1)

where $\eta$ represents measurement or model error and $h$ is a given map. A wide range of inverse problems, data assimilation and machine learning tasks can be cast into the framework (1.1). In these applications the unknown $u$ may represent, for instance, an input parameter of a differential equation, the current state of a time-evolving signal and a regressor, respectively. Ensemble Kalman methods were first introduced as filtering schemes for sequential data assimilation [21, 22, 42, 47, 50] to reduce the computational cost of the Kalman filter [34]. Their use for state and parameter estimation and inverse problems was further developed in [3, 40, 43, 55]. The idea of iterating these methods was considered in [15, 18]. Iterative ensemble Kalman methods are now popular in inverse problems and data assimilation; they have also shown some potential in machine learning applications [25, 29, 37].

Starting from an initial ensemble $\{u_{0}^{(n)}\}_{n=1}^{N},$ iterative ensemble Kalman methods use various ensemble-based empirical means and covariances to update

\{u_{i}^{(n)}\}_{n=1}^{N}\to\{u_{i+1}^{(n)}\}_{n=1}^{N},

(1.2)

until a stopping criteria is satisfied; the unknown parameter $u$ is reconstructed by the mean of the final ensemble. The idea is analogous to classical Kalman methods and optimization schemes which, starting with a single initialization $u_{0}$ use evaluations of derivatives of $h$ to iteratively update $u_{i}\to u_{i+1}$ until a stopping criteria is met. The initial ensemble is viewed as an input to the algorithm, obtained in a problem-dependent fashion. In Bayesian inverse problems and machine learning it may be obtained by sampling a prior, while in data assimilation the initial ensemble may be a given collection of particles that approximates the prediction distribution. In either case, it is useful to view the initial ensemble as a sample from a probability distribution.

There are two main computational benefits in updating an ensemble of candidate reconstructions rather than a single estimate. First, the ensemble update can be performed without evaluating derivatives of $h,$ effectively approximating them using statistical linearization. This is important in applications where computing derivatives of $h$ is expensive, or where the map $h$ needs to be treated as a black-box. Second, the use of empirical rather than model covariances can significantly reduce the computational cost whenever the ensemble size $N$ is smaller than the dimension $d$ of the unknown parameter $u$ . Another key advantage of the ensemble approach is that, for problems that are not strongly nonlinear, the spread of the ensemble may contain meaningful information on the uncertainty in the reconstruction.

1.1 Overview: Three Subfamilies

This paper identifies, compares and further develops three subfamilies of iterative ensemble Kalman methods to implement the ensemble update (1.2). Each subfamily employs a different Kalman-type formulae, determined by a choice of objective to minimize and a derivative-based optimization scheme to approximate with the ensemble. All three approaches impose some form of regularization, either explicitly through the choice of the objective, or implicitly through the choice of the optimization scheme. Incorporating regularization is essential in parameter reconstruction problems encountered in applications, which are typically under-determined or ill-posed [41, 50].

The first subfamily considers a Tikhonov-Phillips objective associated with the parameter reconstruction problem (1.2), given by

{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u):=\frac{1}{2}|y-h(u)|^{2}_{R}+\frac{1}{2}|u-m|^{2}_{P},

(1.3)

where $R$ and $P$ are symmetric positive definite matrices that model, respectively, the data measurement precision and the level of regularization —incorporated explicitly through the choice of objective— and $m$ represents a background estimate of $u.$ Here and throughout this paper we use the notation $|v|^{2}_{A}:=|A^{-1/2}v|^{2}=v^{T}A^{-1}v$ for symmetric positive definite $A$ and vector $v.$ The ensemble is used to approximate a Gauss-Newton method applied to the Tikhonov-Phillips objective ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}.$ Algorithms in this subfamily were first introduced in geophysical data assimilation [1, 15, 18, 24, 39, 48] and were inspired by iterative, derivative-based, extended Kalman filters [5, 6, 33]. Extensions to more challenging problems with strongly nonlinear dynamics are considered in [49, 60]. In this paper we will use a new Iterative Ensemble Kalman Filter (IEKF) method as a prototypical example of an algorithm that belongs to this subfamily.

The second subfamily considers the data-misfit objective

{\mathsf{J}}_{\mbox{\tiny{\rm DM}}}(u):=\frac{1}{2}|y-h(u)|^{2}_{R}.

(1.4)

When the parameter reconstruction problem is ill-posed, minimizing ${\mathsf{J}}_{\mbox{\tiny{\rm DM}}}$ leads to unstable reconstructions. For this reason, iterative ensemble Kalman methods in this subfamily are complemented with a Levenberg-Marquardt optimization scheme that implicitly incorporates regularization. The ensemble is used to approximate a regularizing Levenberg-Marquardt optimization algorithm to minimize ${\mathsf{J}}_{\mbox{\tiny{\rm DM}}}.$ Algorithms in this subfamily were introduced in the applied mathematics literature [30, 31] building on ideas from classical inverse problems [26]. Recent theoretical work has focused on developing continuous-time and mean-field limits, as well as various convergence results [8, 9, 14, 27, 37, 51]. Methodological extensions based on Bayesian hierarchical techniques were introduced in [10, 11] and the incorporation of constraints has been investigated in [2, 12]. In this paper we will use the Ensemble Kalman Inversion (EKI) method [31] as a prototypical example of an algorithm that belongs to this subfamily.

Objective	Optimization	Derivative Method	Ensemble Method	New Variant
${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}$	GN	IExKF (2.1)	IEKF (3.1)	IEKF-SL (4.1)
${\mathsf{J}}_{\mbox{\tiny{\rm DM}}}$	LM	LM-DM (2.2)	EKI (3.2)	EKI-SL (4.2)
${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}$	LM	LM-TP (2.3)	TEKI (3.3)	TEKI-SL (4.3)

Table 1: Roadmap to the algorithms considered in this paper. We use the abbreviations GN and LM for Gauss-Newton and Levenberg-Marquardt. The numbers in parenthesis represent the subsection in which each algorithm is introduced.

The third subfamily, which has emerged more recently, combines explicit regularization through the Tikhonov-Phillips objective and an implicitly regularizing optimization scheme [14, 13]. Precisely, a Levenberg-Marquardt scheme is approximated through the ensemble in order to minimize the Tikhonov-Phillips objective ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}.$ In this paper we will use the Tikhonov ensemble Kalman inversion method (TEKI) [13] as a prototypical example.

To conclude this overview we note that while in this paper we will only consider least-squares objectives, iterative ensemble Kalman methods that use other regularizers have been recently proposed [37, 38, 53].

1.2 Statistical Linearization, Continuum Limits and New Variants

Each subfamily of iterative ensemble Kalman methods stems from a derivative-based optimization scheme. However, there is substantial freedom as to how to use the ensemble to approximate a derivative-based method. We will focus on randomized-maximum likelihood implementations [24, 35, 50], rather than square-root or ensemble adjustment approaches [3, 59, 28]. Two principles will guide our derivation and analysis of ensemble methods: the use of statistical linearization [60] and their connection with gradient descent methods through the study of continuum limits [51].

The idea behind statistical linearization is to approximate the gradient of $h$ using pairs $\big{\{}\big{(}u_{i}^{(n)},h(u_{i}^{(n)})\big{)}\big{\}}_{n=1}^{N}$ in such a way that if $h$ is linear and the ensemble size $N$ is sufficiently large, the approximation is exact. As we shall see, this idea tacitly underlies the derivation of all the ensemble methods considered in this paper, and will be explicitly employed in our derivation of new variants. Statistical linearization has also been used within Unscented Kalman filters, see e.g. [60].

Differential equations have long been important in developing and understanding optimization schemes [44], and investigating the connections between differential equations and optimization is still an active area of research [54, 56, 57, 61]. In the context of iterative ensemble methods, continuum limit analyses arise from considering small length-steps and have been developed primarily in the context of EKI-type algorithms [8, 52]. While the derivative-based algorithms that motivate the ensemble methods result in an ODE continuum limit, the ensemble versions lead to a system of SDEs. Continuum limit analyses are useful in at least three ways. First, they unveil the gradient structure of the optimization schemes. Second, viewing optimization schemes as arising from discretizations of SDEs lends itself to design of algorithms that are easy to tune: the length-step is chosen to be small and the algorithms are run until statistical equilibrium is reached. Third, a simple linear-case analysis of the SDEs may be used to develop new algorithms that satisfy certain desirable properties. Our new iterative ensemble Kalman methods will be designed following these observations.

While our work advocates the study of continuum limits as a useful tool to design ensemble methods, continuum limits cannot fully capture the full richness and flexibility of discrete-based implementable algorithms, since different algorithms may result in the same SDE continuum limit. This insight suggests that it is not only the study of differential equations, but also their discretizations, that may contribute to the design of iterative ensemble Kalman algorithms.

1.3 Main Contributions and Outline

In addition to providing a unified perspective of the existing literature, this paper contains several original contributions. We highlight some of them in the following outline and refer to Table 1 for a summary of the algorithms considered in this paper.

•

In Section 2 we review three iterative derivative-based methods for nonlinear least-squares optimization. The ensemble-based algorithms studied in subsequent sections can be interpreted as ensemble-based approximations of the derivative-based methods described in this section. We also derive informally ODE continuum limits for each method, which unveils their gradient flow structure.
•

In Section 3 we describe the idea of statistical linearization. We review three subfamilies of iterative ensemble methods, each of which has an update formula analogous to one of the derivate-based methods in Section 2. We analyze the methods when $h(u)=Hu$ is linear by formally deriving SDE continuum limits that unveil their gradient structure. A novelty in this section is the introduction of the IEKF method, which is similar to, but different from, the iterative ensemble method introduced in [60].
•

The material in Section 4 is novel to the best of our knowledge. We introduce new variants of the iterative ensemble Kalman methods discussed in Section 3 and formally derive their SDE continuum limit. We analyze the resulting SDEs when $h(u)=Hu$ is linear. The proposed methods are designed to ensure that (i) no parameter tuning or careful stopping criteria are needed; and (ii) the ensemble covariance contains meaningful information of the uncertainty in the reconstruction in the linear case, avoiding the ensemble collapse of some existing methods.
•

In Section 5 we include an in-depth empirical comparison of the performance of the iterative ensemble Kalman methods discussed in Sections 3 and 4. We consider four problem settings motivated by applications in Bayesian inverse problems, data assimilation and machine learning. Our results illustrate the different behavior of some methods in small noise regimes and the benefits of avoiding ensemble collapse.
•

Section 6 concludes and suggests some open directions for further research.

2 Derivative-based Optimization for Nonlinear Least-Squares

In this section we review the Gauss-Newton and Levenberg-Marquardt methods, two derivative-based approaches for optimization of nonlinear least-squares objectives of the form

{\mathsf{J}}(u)=\frac{1}{2}|r(u)|^{2}.

(2.1)

We derive closed formulae for the Gauss-Newton method applied to the Tikhonov-Phillips objective ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}$ , as well as for the Levenberg-Marquardt method applied to the data-misfit objective ${\mathsf{J}}_{\mbox{\tiny{\rm DM}}}$ and the Tikhonov-Phillips objective ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}.$ These formulae are the basis for the ensemble, derivative-free methods considered in the next section.

As we shall see, the search directions of Gauss-Newton and Levenberg-Marquardt methods are found by minimizing a linearization of the least-squares objective. It is thus instructive to consider first linear least-squares optimization before delving into the nonlinear setting. The following well-known result, that we will use extensively, characterizes the minimizer $\mu$ of the Tikhonov-Phillips objective ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}$ in the case of linear $h(u)=Hu$ .

Lemma 2.1.

It holds that

\frac{1}{2}|y-Hu|_{R}^{2}+\frac{1}{2}|u-m|^{2}_{P}=\frac{1}{2}|u-\mu|_{C}^{2}+\beta,

(2.2)

where $\beta$ does not depend on $u,$ and

	$\displaystyle C^{-1}$	$\displaystyle=H^{T}R^{-1}H+P^{-1},$		(2.3)
	$\displaystyle C^{-1}\mu$	$\displaystyle=H^{T}R^{-1}y+P^{-1}m.$		(2.4)

Equivalently,

	$\displaystyle\mu$	$\displaystyle=m+K(y-Hm),$		(2.5)
	$\displaystyle C$	$\displaystyle=(I-KH)P,$		(2.6)

where $K$ is the Kalman gain matrix given by

K=PH^{T}(HPH^{T}+R)^{-1}=CH^{T}R^{-1}.

(2.7)

Proof.

The formulae (2.3) and (2.4) follows by matching linear and quadratic coefficients in $u$ between

\frac{1}{2}|u-\mu|^{2}_{C}\quad\quad\text{and}\quad\quad\frac{1}{2}|u-m|^{2}_{P}+\frac{1}{2}|y-h(u)|^{2}_{R}.

(2.8)

The formulae (2.5) and (2.6) as well as the equivalent expressions for the Kalman gain $K$ in Equation (2.7) can be obtained using the matrix inversion lemma [50]. ∎

Bayesian Interpretation

Lemma 2.1 has a natural statistical interpretation. Consider a statistical model defined by likelihood $y|u\sim\mathcal{N}\bigl{(}Hu,R\bigr{)}$ and prior $u\sim\mathcal{N}(m,P).$ Then Equation (2.2) shows that the posterior distribution is Gaussian, $u|y\sim\mathcal{N}(\mu,C),$ and Equations (2.3)-(2.4) characterize the posterior mean and precision (inverse covariance). We interpret Equation (2.5) as providing a closed formula for the posterior mode, known as the maximum a posteriori (MAP) estimator.

More generally, the generative model

\displaystyle\begin{split}u&\sim\mathcal{N}(m,P),\\ y|u&\sim\mathcal{N}(h(u),R),\end{split}

(2.9)

gives rise to a posterior distribution on $u|y$ with density proportional to $\exp\bigl{(}-{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u)\bigr{)}.$ Thus, minimization of ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u)$ corresponds to maximizing the posterior density under the model (2.9).

2.1 Gauss-Newton Optimization of Tikhonov-Phillips Objective

In this subsection we introduce two ways of writing the Gauss-Newton update applied to the Tikhonov-Phillips objective ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}$ . We recall that the Gauss-Newton method applied to a general least-squares objective ${\mathsf{J}}(u)=\frac{1}{2}|r(u)|^{2}$ is a line-search method which, starting from an initialization $u_{0},$ sets

\displaystyle u_{i+1}=u_{i}+\alpha_{i}v_{i},\quad\quad i=0,1,\ldots

where $v_{i}$ is a search direction defined by

\displaystyle v_{i}=\arg\min_{v}{\mathsf{J}}_{i}^{\mbox{\tiny{$\ell$}}}(v),\quad\quad{\mathsf{J}}_{i}^{\mbox{\tiny{$\ell$}}}(v):=\frac{1}{2}|r^{\prime}(u_{i})v+r(u_{i})|^{2},

(2.10)

where $\alpha_{i}>0$ is a length-step parameter whose choice will be discussed later. In order to apply the Gauss-Newton method to the Tikhonov-Phillips objective, we write ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}$ in the standard nonlinear least-squares form (2.1). Note that

	$\displaystyle{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u)$	$\displaystyle=\frac{1}{2}\|u-m\|^{2}_{P}+\frac{1}{2}\|y-h(u)\|^{2}_{R}$
		$\displaystyle=\frac{1}{2}\|z-g(u)\|^{2}_{Q},$

where

z:=\begin{bmatrix}y\\ m\end{bmatrix},\quad\quad\quad g(u):=\begin{bmatrix}h(u)\\ u\end{bmatrix},\quad\quad\quad Q:=\begin{bmatrix}R&0\\ 0&P\end{bmatrix}.

Therefore we have

\displaystyle{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u)=\frac{1}{2}|r_{\mbox{\tiny{\rm TP}}}(u)|^{2},\quad\quad r_{\mbox{\tiny{\rm TP}}}(u):=Q^{-1/2}\bigl{(}z-g(u)\bigr{)}.

(2.11)

The following result is a direct consequence of Lemma 2.1.

Lemma 2.2 ([6]).

The Gauss-Newton method applied to the Tikhonov-Phillips objective ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}$ admits the characterizations:

u_{i+1}=u_{i}+\alpha_{i}C_{i}\Bigl{\{}H_{i}^{T}R^{-1}\bigl{(}y-h(u_{i})\bigr{)}+P^{-1}(m-u_{i})\Bigr{\}},

(2.12)

and

u_{i+1}=u_{i}+\alpha_{i}\Bigl{\{}K_{i}\bigl{(}y-h(u_{i})\bigr{)}+(I-K_{i}H_{i})(m-u_{i})\Bigl{\}},

(2.13)

where $H_{i}=h^{\prime}(u_{i})$ and

	$\displaystyle K_{i}$	$\displaystyle=PH_{i}^{T}(H_{i}PH_{i}^{T}+R)^{-1},$
	$\displaystyle C_{i}$	$\displaystyle=(I-K_{i}H_{i})P.$

Proof.

The search direction $v_{i}$ of Gauss-Newton for the objective ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}$ is given by

$\displaystyle v_{i}$	$\displaystyle=\arg\min_{v}{\mathsf{J}}_{\mbox{\tiny{\rm TP,}}i}^{\ell}(v)$	(2.14)
	$\displaystyle=\arg\min_{v}\frac{1}{2}\bigl{\|}r_{\mbox{\tiny{\rm TP}}}^{\prime}(u_{i})v+r_{\mbox{\tiny{\rm TP}}}(u_{i})\bigr{\|}^{2}$	(2.15)
	$\displaystyle=\arg\min_{v}\frac{1}{2}\bigl{\|}z-g(u_{i})-g^{\prime}(u_{i})v\bigr{\|}_{Q}$	(2.16)
	$\displaystyle=\arg\min_{v}\biggl{\{}\frac{1}{2}\bigl{\|}y-h(u_{i})-h^{\prime}(u_{i})v\bigr{\|}^{2}_{R}+\frac{1}{2}\bigl{\|}v-(m-u_{i})\bigr{\|}_{P}^{2}\biggr{\}}.$	(2.17)

Applying Lemma 2.1, using formulae (2.4) and (2.6), we deduce that

\displaystyle v_{i}=C_{i}\Bigl{\{}H_{i}^{T}R^{-1}\bigl{(}y-h(u_{i})\bigr{)}+P^{-1}(m-u_{i})\Bigr{\}},

which establishes the characterization (2.12). The equivalence between (2.12) and (2.13) follows from the identity (2.7), which implies that $C_{i}H_{i}^{T}R^{-1}=K_{i}$ and $C_{i}P^{-1}=I-K_{i}H_{i}.$ ∎

We refer to the Gauss-Newton method with constant length-step $\alpha_{i}=\alpha$ applied to ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}$ as the Iterative Extended Kalman Filter (IExKF) algorithm. IExKF was developed in the control theory literature [33] without reference to the Gauss-Newton optimization method; the agreement between both methods was established in [6]. In order to compare IExKF with an ensemble-based method in Section 3, we summarize it here.

Algorithm 1 Iterative Extended Kalman Filter (IExKF)

Input: Initialization $u_{0}=m$ , length-step $\alpha$ .
For $i=0,1,\ldots$ do:

1.

Set $K_{i}=PH_{i}^{T}(H_{i}PH_{i}^{T}+R)^{-1},\quad\quad H_{i}=h^{\prime}(u_{i}).$
2.

Set

$u_{i+1}=u_{i}+\alpha\Bigl{\{}K_{i}\big{(}y-h(u_{i})\big{)}+(I-K_{i}H_{i})(m-u_{i})\Bigr{\}},$ (2.18)

or, equivalently,

$u_{i+1}=u_{i}+\alpha C_{i}\Bigl{\{}H_{i}^{T}R^{-1}(y-h(u_{i}))+P^{-1}(m-u_{i})\Bigr{\}}.$ (2.19)

Output: $u_{1},u_{2},\ldots$

The next proposition shows that in the linear case, if $\alpha$ is set to 1, IExKF finds the minimizer of the objective (1.3) in one iteration, and further iterations still agree with the minimizer.

Proposition 2.3.

Suppose that $h(u)=Hu$ is linear and $\alpha=1$ . Then the output of Algorithm 1 satisfies

u_{i}=\mu,\quad\quad i=1,2,\ldots

where $\mu$ is the minimizer of the Tikhonov-Phillips objective (1.3).

Proof.

In the linear case we have

H_{i}=H,\quad\quad K_{i}=K=PH^{T}(HPH^{T}+R)^{-1},\quad\quad i=0,1,\ldots

Therefore, update (2.18) simplifies as

u_{i+1}=m+K(y-Hm),\quad i=0,1,\ldots

This implies that, for all $i\geq 1,$ it holds that $u_{i}=\mu$ with $\mu$ defined in Equation (2.5). ∎

Choice of Length-Step

When implementing Gauss-Newton methods, it is standard practice to perform a line search in the direction of $v_{i}$ to adaptively choose the length-step $\alpha_{i}.$ For instance, a common strategy is to guarantee that the Wolfe conditions are satisfied [16, 45]. In this paper we will instead simply set $\alpha_{i}=\alpha$ for some fixed value of $\alpha,$ and we will follow a similar approach for all the derivative-based and ensemble-based algorithms we consider. There are two main motivations for doing so. First, it is appealing from a practical viewpoint to avoid performing a line search for ensemble-based algorithms. Second, when $\alpha$ is small each derivative-based algorithms we consider can be interpreted as a discretization of an ODE system, while the ensemble-based methods arise as discretizations of SDE systems. These ODEs and SDEs allow us to compare and gain transparent understanding of the gradient structure of the algorithms. They will also allow us to propose some new variants of existing ensemble Kalman methods. We next describe the continuum limit structure of IExKF.

Continuum Limit

It is not hard to check that the term in brackets in the update (2.19)

H_{i}^{T}R^{-1}(y-h(u_{i}))+P^{-1}(m-u_{i})

is the negative gradient of ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u)$ , which reveals the following gradient flow structure in the limit of small length-step $\alpha:$

\displaystyle\begin{split}\dot{u}&=C(t)\Bigl{\{}h^{\prime}\bigl{(}u(t)\bigr{)}^{T}R^{-1}\bigl{(}y-h\big{(}u(t)\big{)}\bigr{)}+P^{-1}\bigl{(}m-u(t)\bigr{)}\Bigr{\}}\\ &=-C(t){\mathsf{J}}_{\mbox{\tiny{\rm TP}}}^{\prime}\bigr{(}u(t)\bigl{)},\end{split}

(2.20)

with preconditioner

C(t):=\biggl{(}h^{\prime}\bigl{(}u(t)\bigr{)}^{T}R^{-1}h^{\prime}\bigl{(}u(t)\bigr{)}+P^{-1}\biggr{)}^{-1}.

We remark that in the linear case, $C(t)\equiv C,$ where $C$ is the posterior covariance given by (2.6), which agrees with the inverse of the Hessian of the Tikhonov Phillips objective.

2.2 Levenberg-Marquardt Optimization of Data-Misfit Objective

In this subsection we introduce the Levenberg-Marquardt algorithm and describe its application to the data misfit objective ${\mathsf{J}}_{\mbox{\tiny{\rm DM}}}$ . We recall that the Levenberg-Marquardt method applied to a general least-squares objective ${\mathsf{J}}(u)=\frac{1}{2}|r(u)|^{2}$ is a trust region method which, starting from an initialization $u_{0},$ sets

\displaystyle u_{i+1}=u_{i}+v_{i},\quad\quad i=0,1,\ldots

where

v_{i}=\arg\min_{v}{\mathsf{J}}_{i}^{\mbox{\tiny{$\ell$}}}(v),\quad\text{s.t.}\,\,|v|_{P}^{2}\leq\delta_{i},\quad\quad{\mathsf{J}}_{i}^{\mbox{\tiny{$\ell$}}}(v):=\frac{1}{2}|r^{\prime}(u_{i})v+r(u_{i})|^{2}.

Similar to Gauss-Newton methods, the increment $v_{i}$ is defined as the minimizer of a linearized objective, but now the minimization is constrained to a ball $\{|v|_{P}^{2}\leq\delta_{i}\}$ in which we trust that the objective can be replaced by its linearization. The increment can also be written as

v_{i}=\arg\min_{v}{\mathsf{J}}_{i}^{\mbox{\mbox{\tiny{UC} } }}(v),

where

{\mathsf{J}}_{i}^{\mbox{\mbox{\tiny{UC} } }}(v)={\mathsf{J}}_{i}^{\mbox{\tiny{$\ell$}}}(v)+\frac{1}{2\alpha_{i}}|v|_{P}^{2}.

(2.21)

The parameter $\alpha_{i}>0$ plays an analogous role to the length-step in Gauss-Newton methods. Note that the Levenberg-Marquardt increment is the unconstrained minimizer of a regularized objective. It is for this reason that we say that Levenberg-Marquardt provides an implicit regularization.

We next consider application of the Levenberg-Marquardt method to the data-misfit objective ${\mathsf{J}}_{\mbox{\tiny{\rm DM}}},$ which we write in standard nonlinear least-squares form:

\displaystyle{\mathsf{J}}_{\mbox{\tiny{\rm DM}}}(u)=\frac{1}{2}|r_{\mbox{\tiny{\rm DM}}}(u)|^{2},\quad\quad r_{\mbox{\tiny{\rm DM}}}(u):=R^{-1/2}\bigl{(}y-h(u)\bigr{)}.

(2.22)

Lemma 2.4.

The Levenberg-Marquardt method applied to the data misfit objective ${\mathsf{J}}_{\mbox{\tiny{\rm DM}}}$ admits the following characterization:

u_{i+1}=u_{i}+K_{i}\Bigl{\{}y-h(u_{i})\Bigr{\}},

(2.23)

where

K_{i}=\alpha_{i}PH_{i}^{T}(\alpha_{i}H_{i}PH_{i}^{T}+R)^{-1},\quad\quad H_{i}=h^{\prime}(u_{i}).

Proof.

Note that the increment $v_{i}$ is defined as the unconstrained minimizer of

\displaystyle\begin{split}{\mathsf{J}}_{\mbox{\tiny{\rm DM}},i}^{\mbox{\mbox{\tiny{UC} } }}(v)&=\frac{1}{2}|r_{\mbox{\tiny{\rm DM}}}^{\prime}(u_{i})v+r_{\mbox{\tiny{\rm DM}}}(u_{i})|^{2}+\frac{1}{2\alpha_{i}}|v|^{2}_{P}\\ &=\frac{1}{2}|y-h(u_{i})-h^{\prime}(u_{i})v|^{2}_{R}+\frac{1}{2\alpha_{i}}|v|^{2}_{P}.\end{split}

(2.24)

The result follows from Lemma 2.1. ∎

Similar to the previous section, we will focus on implementations with constant length-step $\alpha_{i}=\alpha$ , which leads to the following algorithm.

Algorithm 2 Iterative Levenberg-Marquardt with Data Misfit (ILM-DM)

Input: Initialization $u_{0}=m$ , length-step $\alpha$ .
For $i=0,1,\ldots$ do:

1.

Set $K_{i}=\alpha PH_{i}^{T}(\alpha H_{i}PH_{i}^{T}+R)^{-1},\quad\quad H_{i}=h^{\prime}(u_{i}).$
2.

Set

$u_{i+1}=u_{i}+K_{i}\Bigl{\{}y-h(u_{i})\Bigr{\}}.$ (2.25)

Output: $u_{1},u_{2},\ldots$

When $\alpha=1$ , the following linear-case result shows that ILM-DM reaches the minimizer of ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}$ in one iteration. However, in contrast to IExKF, further iterations of ILM-DM will typically worsen the optimization of ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}$ , and start moving towards minimizers of ${\mathsf{J}}_{\mbox{\tiny{\rm DM}}}.$

Proposition 2.5.

Suppose that $h(u)=Hu$ is linear and $\alpha=1$ . Then the output of Algorithm 2 satisfies

u_{1}=\arg\min_{u}{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u),

where ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}$ is the Tikhonov-Phillips objective (1.3).

Proof.

The proof is identical to that of Proposition 2.3, noting that in the linear case $u_{i+1}=u_{i}+K(y-Hu_{i}).$ ∎

Example 2.6 (Convergence of ILM-DM with invertible observation map).

Suppose that $H\in\mathbb{R}^{d\times d}$ is invertible and $\alpha=1.$ Then, writing

u_{i+1}=(I-KH)u_{i}+Ky

and noting that $\rho(I-KH)<1$ [4], it follows that $u_{i}\to u^{*}$ , where $u^{*}$ is the unique solution to $y=Hu.$ That is, the iterates of ILM-DM converge to the unique minimizer of the data misfit objective ${\mathsf{J}}_{\mbox{\tiny{\rm DM}}}.$

Choice of Length-Step

When implementing Levenberg-Marquardt algorithms, the length-step parameter $\alpha_{i}$ is often chosen adaptively, based on the objective. However, similar to Section 2.1, we fix $\alpha_{i}=\alpha$ to be a small value which leads to an ODE continuum limit.

Continuum Limit

We notice that, in the limit of small length-step $\alpha$ , update (2.25) can be written as

u_{i+1}=u_{i}+\alpha PH_{i}^{T}R^{-1}\Bigl{\{}y-h(u_{i})\Bigr{\}}.

The term $H_{i}^{T}R^{-1}\Bigl{\{}y-h(u_{i})\Bigr{\}}$ is the negative gradient of ${\mathsf{J}}_{\mbox{\tiny{\rm DM}}}(u)$ , which reveals the following gradient flow structure

\displaystyle\begin{split}\dot{u}&=Ph^{\prime}\bigl{(}u(t)\bigr{)}^{T}R^{-1}\Bigl{\{}y-h\bigl{(}u(t)\bigr{)}\Bigr{\}}\\ &=-P{\mathsf{J}}_{\mbox{\tiny{\rm DM}}}^{\prime}(u),\end{split}

(2.26)

where the preconditioner $P$ is interpreted as the prior covariance in the Bayesian framework.

2.3 Levenberg-Marquardt Optimization of Tikhonov-Phillips Objective

In this subsection we describe the application of the Levenberg-Marquardt algorithm to the Tikhonov-Phillips objective ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}$ .

Lemma 2.7.

The Levenberg-Marquardt method applied to the data misfit objective ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}$ admits the following characterization:

\displaystyle u_{i+1}=u_{i}+K_{i}\Bigl{\{}z-g(u_{i})\Bigr{\}},

where

K_{i}=\alpha_{i}PG_{i}^{T}(\alpha_{i}G_{i}PG_{i}^{T}+Q)^{-1},\quad\quad G_{i}=g^{\prime}(u_{i}).

Proof.

Note that the increment $v_{i}$ is defined as the unconstrained minimizer of

	$\displaystyle{\mathsf{J}}_{\mbox{\tiny{\rm TP}},i}^{\mbox{\mbox{\tiny{UC} } }}(v)$	$\displaystyle={\mathsf{J}}_{\mbox{\tiny{\rm TP,}}i}^{\ell}(v)+\frac{1}{2\alpha_{i}}\|v\|^{2}_{P}$		(2.27)
		$\displaystyle=\frac{1}{2}\|z-g(u_{i})-g^{\prime}(u_{i})v\|_{Q}^{2}+\frac{1}{2\alpha_{i}}\|v\|^{2}_{P},$		(2.28)

which has the same form as Equation (2.24) replacing $y$ with $z,$ $h$ with $g,$ and $R$ with $Q.$ ∎

Setting $\alpha_{i}=\alpha$ leads to the following algorithm.

Algorithm 3 Iterative Levenberg-Marquardt with Tikhonov-Phillips (ILM-TP)

Input: Initialization $u_{0}=m$ , length-step $\alpha$ .
For $i=0,1,\ldots$ do:

1.

Set $K_{i}=\alpha PG_{i}^{T}(\alpha G_{i}PG_{i}^{T}+Q)^{-1},\quad\quad G_{i}=g^{\prime}(u_{i}).$
2.

Set

$u_{i+1}=u_{i}+K_{i}\Bigl{\{}z-g(u_{i})\Bigr{\}}.$ (2.29)

Output: $u_{1},u_{2},\ldots$

Proposition 2.8.

Suppose that $h(u)=Hu$ is linear and $\alpha=1$ . The output of Algorithm 3 satisfies

u_{1}=\arg\min_{u}\Bigl{\{}{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u)+\frac{1}{2}|u-m|^{2}\Bigr{\}}.

(2.30)

Proof.

The result is a corollary of Proposition 2.5. To see this, note that ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u)=\frac{1}{2}|z-g(u)|^{2}_{Q}$ can be viewed as a data-misfit objective with data $z,$ forward model $g(u)$ and observation matrix $Q.$ Then, ILM-TP can be interpreted as applying ILM-DM to ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}},$ and the objective ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u)+\frac{1}{2}|u-m|^{2}$ as its Tikhonov-Phillips regularization. ∎

Example 2.9 (Convergence of ILM-TP with linear invertible observation map.).

It is again instructive to consider the case where $H\in\mathbb{R}^{d\times d}$ is invertible. Then, following the same reasoning as in Example 2.6 we deduce that the iterates of ILM-TP converge to the unique minimizer of ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}.$

Continuum Limits

In the limit of small length-step $\alpha$ , we can derive the gradient flow structure of update (2.29). This is similar to Section 2.2, with ${\mathsf{J}}_{\mbox{\tiny{\rm DM}}}$ replaced by ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}$ . Precisely, the gradient flow of ILM-TP is given by

	$\displaystyle\dot{u}$	$\displaystyle=Pg^{\prime}\bigl{(}u(t)\bigr{)}^{T}Q^{-1}\Bigl{\{}z-g\bigl{(}u(t)\bigr{)}\Bigr{\}}$
		$\displaystyle=-P{\mathsf{J}}_{\mbox{\tiny{\rm TP}}}^{\prime}(u).$

3 Ensemble-based Optimization for Nonlinear Least-Squares

In this section we review three subfamilies of iterative methods that update an ensemble $\{u_{i}^{(n)}\}_{n=1}^{N}$ employing Kalman-based formulae, where $i=0,1,\ldots$ denotes the iteration index and $N$ is a fixed ensemble size. Each ensemble member $u_{i}^{(n)}$ is updated by optimizing a (random) objective ${\mathsf{J}}_{i}^{(n)}$ defined using the current ensemble $\{u_{i}^{(n)}\}_{n=1}^{N}$ and/or the initial ensemble $\{u_{0}^{(n)}\}_{n=1}^{N}$ . The optimization is performed without evaluating derivatives by invoking a statistical linearization of a Gauss-Newton or Levenberg-Marquardt algorithm. In analogy with the previous section, the three subfamilies of ensemble methods we consider differ in the choice of the objective and in the choice of the optimization algorithm. Table 2 summarizes the derivative and ensemble methods considered in the previous and the current section.

Objective	Optimization	Derivative Method	Ensemble Method
${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}$	GN	IExKF	IEKF
${\mathsf{J}}_{\mbox{\tiny{\rm DM}}}$	LM	ILM-DM	EKI
${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}$	LM	ILM-TP	TEKI

Table 2: Summary of the main algorithms in Sections 2 and 3.

Given an ensemble $\{u_{i}^{(n)}\}_{n=1}^{N}$ we use the following notation for ensemble empirical means

\displaystyle m_{i}

\displaystyle=\frac{1}{N}\sum_{n=1}^{N}u_{i}^{(n)},\quad\quad h_{i}=\frac{1}{N}\sum_{n=1}^{N}h(u_{i}^{(n)}),

and empirical covariances

	$\displaystyle P_{i}^{uu}$	$\displaystyle=\frac{1}{N}\sum_{n=1}^{N}(u_{i}^{(n)}-m_{i})(u_{i}^{(n)}-m_{i})^{T},\quad\quad$
	$\displaystyle P_{i}^{uy}$	$\displaystyle=\frac{1}{N}\sum_{n=1}^{N}\bigl{(}u_{i}^{(n)}-m_{i}\bigr{)}\bigl{(}h(u_{i}^{(n)})-h_{i}\bigr{)}^{T},$
	$\displaystyle P_{i}^{yy}$	$\displaystyle=\frac{1}{N}\sum_{n=1}^{N}\bigl{(}h(u_{i}^{(n)})-h_{i}\bigr{)}\bigl{(}h(u_{i}^{(n)})-h_{i}\bigr{)}^{T}.$

Two overarching themes that underlie the derivation and analysis of the ensemble methods studied in this section and the following one are the use of a statistical linearization to avoid evaluation of derivatives, and the study of continuum limits. We next introduce these two ideas.

Statistical Linearization

If $h(u)=Hu$ is linear, we have

P_{i}^{uy}=P_{i}^{uu}H^{T},

which motivates the approximation in the general nonlinear case

h^{\prime}(u_{i}^{(n)})\approx(P_{i}^{uy})^{T}(P_{i}^{uu})^{-1}=:H_{i},\quad\quad n=1,\dots,N,

(3.1)

where here and in what follows $(P_{i}^{uu})^{-1}$ denotes the pseudoinverse of $P_{i}^{uu}.$ Notice that (3.1) can be regarded as a linear least-squares fit of pairs $\big{\{}\big{(}u_{i}^{(n)},h(u_{i}^{(n)})\big{)}\big{\}}_{n=1}^{N}$ normalized around their corresponding empirical means $m_{i}$ and $h_{i}$ . We remark that in order for the approximation in (3.1) to be accurate, the ensemble size $N$ should not be much smaller than the input dimension $d$ .

Continuum Limit

We will gain theoretical understanding by studying continuum limits. Specifically, each algorithm includes a length-step parameter $\alpha>0,$ and the evolution of the ensemble for small $\alpha$ can be interpreted as a discretization of an SDE system. We denote by $\{u^{(n)}(t)\}_{n=1}^{N}$ the sample paths of the underlying SDE. For each $1\leq n\leq N$ , we have $u_{0}^{(n)}=u^{(n)}(0)$ , and we view $u_{i}^{(n)}$ as an approximation of $u^{(n)}(t)$ for $t=\alpha i$ . Similarly as above, we define $P^{uu}(t),P^{uy}(t),P^{yy}(t)$ as the corresponding empirical covariances at time $t\geq 0$ .

3.1 Ensemble Gauss-Newton Optimization of Tikhonov-Phillips Objective

3.1.1 Iterative Ensemble Kalman Filter

Given an ensemble $\{u_{i}^{(n)}\}_{n=1}^{N}$ , consider the following Gauss-Newton update for each $n$ :

u_{i+1}^{(n)}=u_{i}^{(n)}+\alpha v_{i}^{(n)},

(3.2)

where $\alpha>0$ is the length-step, and $v_{i}^{(n)}$ is the minimizer of the following (linearized) Tikhonov-Phillips objective (cf. equation (2.17))

{\mathsf{J}}_{\mbox{\tiny{\rm TP}},i}^{(n)}(v)=\frac{1}{2}\bigl{|}y_{i}^{(n)}-h(u_{i}^{(n)})-H_{i}v\bigr{|}^{2}_{R}+\frac{1}{2}\bigl{|}u_{0}^{(n)}-u_{i}^{(n)}-v\bigr{|}^{2}_{P_{0}^{uu}},\quad\quad y_{i}^{(n)}\sim\mathcal{N}(y,\alpha^{-1}R).

(3.3)

Notice that we adopt the statistical linearization (3.1) in the above formulation. Applying Lemma 2.1, the minimizer $v_{i}^{(n)}$ can be calculated as

v_{i}^{(n)}=C_{i}\Bigl{\{}H_{i}^{T}R^{-1}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)}+(P_{0}^{uu})^{-1}\big{(}u_{0}^{(n)}-u_{i}^{(n)}\big{)}\Bigr{\}},

(3.4)

or, in an equivalent form,

v_{i}^{(n)}=K_{i}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)}+(I-K_{i}H_{i})(u_{0}^{(n)}-u_{i}^{(n)}),

(3.5)

where

	$\displaystyle C_{i}$	$\displaystyle=\big{(}H_{i}^{T}R^{-1}H_{i}+(P_{0}^{uu})^{-1}\big{)}^{-1},$
	$\displaystyle K_{i}$	$\displaystyle=P_{0}^{uu}H_{i}^{T}(H_{i}P_{0}^{uu}H_{i}^{T}+R)^{-1}.$

Combining (3.2) and (3.5) leads to the Iterative Ensemble Kalman Filter (IEKF) algorithm.

Algorithm 4 Iterative Ensemble Kalman Filter (IEKF)

Input: Initial ensemble $\{u_{0}^{(n)}\}_{n=1}^{N}$ sampled independently from the prior, length-step $\alpha$ .
For $i=0,1,\ldots$ do:

1.

Set $K_{i}=P_{0}^{uu}H_{i}^{T}(H_{i}P_{0}^{uu}H_{i}^{T}+R)^{-1},\quad\quad H_{i}=(P_{i}^{uy})^{T}(P_{i}^{uu})^{-1}.$

Draw $y_{i}^{(n)}\sim\mathcal{N}(y,\alpha^{-1}R)$ and set

u_{i+1}^{(n)}=u_{i}^{(n)}+\alpha\Bigl{\{}K_{i}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)}+(I-K_{i}H_{i})\big{(}u_{0}^{(n)}-u_{i}^{(n)}\big{)}\Bigr{\}},\quad\quad 1\leq n\leq N.

(3.6)

Output: Ensemble means $m_{1},m_{2},\ldots$

We highlight that IEKF is a natural ensemble-based version of the derivative-based IExKF Algorithm 1 with update (2.18). Algorithm 4 is a slight modification of the iterative ensemble Kalman algorithm proposed in [60]. The difference is that [60] sets $H_{i}=(P_{i}^{uy})^{T}(P_{0}^{uu})^{-1}$ rather than $H_{i}=(P_{i}^{uy})^{T}(P_{i}^{uu})^{-1}$ . Our modification guarantees that Algorithm 4 is well-balanced in the sense that if $\alpha=1$ , $u_{0}^{(n)}\sim\mathcal{N}(m,P)$ and $h(u)=Hu$ is linear, then the output of Algorithm 4 satisfies that, as $N\to\infty,$

m_{i}\to\mu,\quad\quad i=1,2,\ldots

where $\mu$ is the minimizer of ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u)$ given in Equation (1.3). This is analogous to Proposition 2.3 for IExKF. A detailed explanation is included in Section 3.1.2 below.

Other statistical linearizations and approximations of the Gauss-Newton scheme are possible. We next give a high-level description of the method proposed in [48], one of the earliest applications of iterative ensemble Kalman methods for inversion in the petroleum engineering literature. Consider the alternative characterization of the Gauss-Newton update (3.4). However, instead of using a different preconditioner $C_{i}$ for each step, [48] uses a fixed preconditioner $C_{*}=P_{0}^{uu}-P_{0}^{uy}(R+P_{0}^{yy})^{-1}(P_{0}^{uy})^{T}$ . Note that $C_{*}$ can be viewed as an approximation of $C_{0}:$

\begin{split}C_{*}&\approx P_{0}^{uu}-P_{0}^{uu}H_{0}^{T}(R+H_{0}P_{0}^{uu}H_{0}^{T})^{-1}H_{0}P_{0}^{uu}\\ &=\big{(}H_{0}^{T}R^{-1}H_{0}+(P_{0}^{uu})^{-1}\big{)}^{-1}=C_{0}.\end{split}

This leads to the following algorithm.

Algorithm 5 Iterative Ensemble Kalman Filter (IEKF-RZL)

Input: Initial ensemble $\{u_{0}^{(n)}\}_{n=1}^{N}$ sampled independently from the prior, length-step $\alpha$ .

1.

Set $C_{*}=P_{0}^{uu}-P_{0}^{uy}\bigl{(}R+P_{0}^{yy}\bigr{)}^{-1}(P_{0}^{uy})^{T}.$

For $i=0,1,\ldots$ do:

1.

Set $H_{i}=(P_{i}^{uy})^{T}(P_{i}^{uu})^{-1}.$

Draw $y_{i}^{(n)}\sim\mathcal{N}(y,\alpha^{-1}R)$ and set

u_{i+1}^{(n)}=u_{i}^{(n)}+\alpha\,C_{*}\Bigl{\{}H_{i}^{T}R^{-1}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)}+(P_{0}^{uu})^{-1}\big{(}u_{0}^{(n)}-u_{i}^{(n)}\big{)}\Bigr{\}},\quad\quad 1\leq n\leq N.

Output: Ensemble means $m_{1},m_{2},\ldots$

We note that IEKF-RZL is a natural ensemble-based version of the derivative-based IExKF Algorithm 1 with update (2.19). We have empirically observed in a wide range of numerical experiments that Algorithm 4 is more stable than Algorithm 5, and we now give a heuristic argument for the advantage of Algorithm 4 in small noise regimes.

•

The update formula in Algorithm 4 is motivated by the large $N$ approximation

$P_{0}^{uu}H_{i}^{T}(H_{i}P_{0}^{uu}H_{i}^{T}+R)^{-1}\approx PH_{i}^{T}(H_{i}PH_{i}^{T}+R)^{-1},$

which only requires that $P_{0}^{uu}$ is a good approximation of $P$ .
•

The update formula in Algorithm 5 may be derived by invoking a large $N$ approximation of several terms. In particular, the error arising from the approximation

$C_{*}H_{i}^{T}R^{-1}\approx C_{0}H_{i}^{T}R^{-1},$

gets amplified when $R$ is small.

Empirical evidence of the instability of Algorithm 5 will be given in Section 5.1.

3.1.2 Analysis of IEKF

In the literature [48, 60], the length-step $\alpha$ is sometimes set to be 1. Here we state a simple observation about Algorithm 4 when $\alpha=1$ and $h(u)=Hu$ is linear. We further assume that $H_{i}\equiv H$ for all $i$ . Then $K_{i}\equiv K_{0}:=P_{0}^{uu}H^{T}(HP_{0}^{uu}H^{T}+R)^{-1}$ , and the update (3.6) can be simplified as

\begin{split}u_{i+1}^{(n)}&=u_{i}^{(n)}+K_{0}(y_{i}^{(n)}-Hu_{i}^{(n)})+(I-K_{0}H)(u_{0}^{(n)}-u_{i}^{(n)})\\ &=u_{0}^{(n)}+K_{0}(y_{i}^{(n)}-Hu_{0}^{(n)}),\end{split}

where $y_{i}^{(n)}\sim\mathcal{N}(y,R)$ . If $\{u_{0}^{(n)}\}_{n=1}^{N}$ are sampled independently from the prior $\mathcal{N}(m,P)$ and we let $N\rightarrow\infty$ , we have $K_{0}\rightarrow K=PH^{T}(HPH^{T}+R)^{-1}$ and, by the law of large numbers, for any $i\geq 1$ ,

\begin{split}\frac{1}{N}\sum_{n=1}^{N}u_{i}^{(n)}&=(I-K_{0}H)\cdot\frac{1}{N}\sum_{n=1}^{N}u_{0}^{(n)}+K_{0}\cdot\frac{1}{N}\sum_{n=1}^{N}y_{i}^{(n)}\\ &\rightarrow(I-KH)m+Ky,\end{split}

which is the posterior mean (and mode) in the linear setting. In other words, in the large ensemble limit, the ensemble mean recovers the posterior mean after one iteration. However, while the choice $\alpha=1$ may be effective in low dimensional (nonlinear) inverse problems, we do not recommend it when the dimensionality is high, as $H_{i}$ might not be a good approximation of $H$ . Thus, we introduce Algorithms 4 and 5 with a choice of length-step $\alpha,$ resembling the derivative-based Gauss-Newton method discussed in Section 2.1. Further analysis of a new variant of Algorithm 4 will be conducted in a continuum limit setting in Section 4.

Remark 3.1.

Some remarks:

1.

Although this will not be the focus of our paper, the IEKF Algorithm 4 (together with the EKI Algorithm 6 and TEKI Algorithm 7 to be discussed later) enjoys the ‘initial subspace property’ studied in previous works [31, 51, 13] by which, for any $i$ and any initialization of $\{u_{0}^{(n)}\}_{n=1}^{N},$

$\text{span}\bigl{(}\{u_{i}^{(n)}\}_{n=1}^{N}\bigr{)}\subset\text{span}\bigl{(}\{u_{0}^{(n)}\}_{n=1}^{N}\bigr{)}.$

This can be shown easily for the IEKF Algorithm 4 by expanding the $P_{0}^{uu}$ term in $K_{i}$ in the update formula (3.6).
2.

Since we assume a Gaussian prior $\mathcal{N}(m,P)$ on $u$ , a natural idea is to replace $u_{0}^{(n)}$ by $m$ and $P_{0}^{uu}$ by $P$ in the update formula (3.6). We pursue this idea in Section 4.1, where we introduce a new variant of Algorithm 4 and analyze it in the continuum limit setting. While the initial subspace property breaks down, this new variant is numerically promising, as shown in Section 5.

3.2 Ensemble Levenberg-Marquardt Optimization of Data-Misfit Objective

3.2.1 Ensemble Kalman Inversion

Given an ensemble $\{u_{i}^{(n)}\}_{n=1}^{N}$ , consider the following Levenberg-Marquardt update for each $n$ :

u_{i+1}^{(n)}=u_{i}^{(n)}+v_{i}^{(n)},

(3.7)

where $v_{i}^{(n)}$ is the minimizer of the following regularized (linearized) data-misfit objective (cf. equation (2.24))

{\mathsf{J}}_{\mbox{\tiny{\rm DM}},i}^{(n),\mbox{\mbox{\tiny{UC} } }}(v)=\frac{1}{2}\bigl{|}y_{i}^{(n)}-h(u_{i}^{(n)})-H_{i}v\bigr{|}^{2}_{R}+\frac{1}{2\alpha}\bigl{|}v\bigr{|}^{2}_{P_{i}^{uu}},\quad\quad y_{i}^{(n)}\sim\mathcal{N}(y,\alpha^{-1}R),

(3.8)

and $\alpha>0$ will be regarded as a length-step. Notice that we adopt the statistical linearization (3.1) in the above formulation. Applying Lemma 2.1, we can calculate the minimizer $v_{i}^{(n)}$ explicitly:

v_{i}^{(n)}=(H_{i}^{T}R^{-1}H_{i}+\alpha^{-1}(P_{i}^{uu})^{-1})^{-1}H_{i}^{T}R^{-1}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)},

(3.9)

or, in an equivalent form,

v_{i}^{(n)}=P_{i}^{uu}H_{i}^{T}(H_{i}P_{i}^{uu}H_{i}^{T}+\alpha^{-1}R)^{-1}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)}.

(3.10)

We combine (3.7) and (3.10), substitute $P_{i}^{uu}H_{i}^{T}=P_{i}^{uy}$ , and make another level of approximation $H_{i}P_{i}^{uy}\approx P_{i}^{yy}$ . This leads to the Ensemble Kalman Inversion (EKI) method [31].

Algorithm 6 Ensemble Kalman Inversion (EKI)

Input: Initial ensemble $\{u_{0}^{(n)}\}_{n=1}^{N}$ , length-step $\alpha$ .
For $i=0,1,\ldots$ do:

1.

Set $K_{i}=P_{i}^{uy}(P_{i}^{yy}+\alpha^{-1}R)^{-1}.$
2.

Draw $y_{i}^{(n)}\sim\mathcal{N}(y,\alpha^{-1}R)$ and set

$u_{i+1}^{(n)}=u_{i}^{(n)}+K_{i}\Bigl{\{}y_{i}^{(n)}-h(u_{i}^{(n)})\Bigr{\}},\quad\quad 1\leq n\leq N.$ (3.11)

Output: Ensemble means $m_{1},m_{2},\ldots$

We note that EKI is a natural ensemble-based version of the derivative-based ILM-DM Algorithm 2. However, an important difference is that the Kalman gain in ILM-DM only uses the iterates to update $H_{i}$ and $P$ is kept fixed. In contrast, the ensemble is used in EKI to update $P_{i}^{uy}$ and $P_{i}^{yy}$ .

3.2.2 Analysis of EKI

In view of the definition of $K_{i}$ , we can rewrite the update (3.11) as a time-stepping scheme:

\begin{split}u_{i+1}^{(n)}&=u_{i}^{(n)}+\alpha P_{i}^{uy}(\alpha P_{i}^{yy}+R)^{-1}\bigl{(}y+\alpha^{-1/2}R^{1/2}\xi_{i}^{(n)}-h(u_{i}^{(n)})\bigr{)}\\ &=u_{i}^{(n)}+\alpha P_{i}^{uy}(\alpha P_{i}^{yy}+R)^{-1}\bigl{(}y-h(u_{i}^{(n)})\bigr{)}+\alpha^{1/2}P_{i}^{uy}(\alpha P_{i}^{yy}+R)^{-1}R^{1/2}\xi_{i}^{(n)},\end{split}

(3.12)

where $\xi_{i}^{(n)}\sim\mathcal{N}(0,I)$ are independent. Taking the limit $\alpha\rightarrow 0$ , we interpret (3.12) as a discretization of the SDE system

\mathop{}\!\mathrm{d}u^{(n)}=P^{uy}(t)R^{-1}\bigl{(}y-h(u^{(n)})\bigr{)}\mathop{}\!\mathrm{d}t+P^{uy}(t)R^{-1/2}\mathop{}\!\mathrm{d}W^{(n)}.

(3.13)

If $h(u)=Hu$ is linear, the SDE system (3.13) turns into

\mathop{}\!\mathrm{d}u^{(n)}=P^{uu}(t)H^{T}R^{-1}\big{(}y-Hu^{(n)}\big{)}\mathop{}\!\mathrm{d}t+P^{uu}(t)H^{T}R^{-1/2}\mathop{}\!\mathrm{d}W^{(n)}.

(3.14)

Proposition 3.2.

For the SDE system (3.13), assume $h(u)=Hu$ is linear, and suppose that the initial ensemble $\{u_{0}^{(n)}\}_{n=1}^{N}$ is drawn independently from a continuous distribution with finite second moments. Then, in the large ensemble size limit $N\rightarrow\infty$ (mean-field), the distribution of $u^{(n)}(t)$ has mean $\mathfrak{m}(t)$ and covariance $\mathfrak{C}(t)$ , which satisfy

	$\displaystyle\frac{\mathop{}\!\mathrm{d}\mathfrak{m}(t)}{\mathop{}\!\mathrm{d}t}$	$\displaystyle=\mathfrak{C}(t)H^{T}R^{-1}\bigl{(}y-H\mathfrak{m}(t)\bigr{)},$		(3.15)
	$\displaystyle\frac{\mathop{}\!\mathrm{d}\mathfrak{C}(t)}{\mathop{}\!\mathrm{d}t}$	$\displaystyle=-\mathfrak{C}(t)H^{T}R^{-1}H\mathfrak{C}(t).$		(3.16)

Furthermore, the solution can be computed analytically:

	$\displaystyle\mathfrak{m}(t)$	$\displaystyle=\Big{(}\mathfrak{C}(0)^{-1}+tH^{T}R^{-1}H\Big{)}^{-1}\big{(}\mathfrak{C}(0)^{-1}\mathfrak{m}(0)+tH^{T}R^{-1}y\big{)},$		(3.17)
	$\displaystyle\mathfrak{C}(t)$	$\displaystyle=\Big{(}\mathfrak{C}(0)^{-1}+tH^{T}R^{-1}H\Big{)}^{-1}.$		(3.18)

In particular, if $H\in\mathbb{R}^{k\times d}$ has full column rank (i.e., $d\leq k$ , ${\emph{rank}(}H)=d$ ), then, as $t\rightarrow\infty$ ,

	$\displaystyle\mathfrak{m}(t)$	$\displaystyle\rightarrow(H^{T}R^{-1}H)^{-1}(H^{T}R^{-1}y),$		(3.19)
	$\displaystyle\mathfrak{C}(t)$	$\displaystyle\rightarrow 0.$		(3.20)

If $H\in\mathbb{R}^{k\times d}$ has full row rank (i.e., $d\geq k$ , $\emph{rank}(H)=k$ ), then, as $t\rightarrow\infty,$

	$\displaystyle H\mathfrak{m}(t)$	$\displaystyle\rightarrow y,$		(3.21)
	$\displaystyle H\mathfrak{C}(t)H^{T}$	$\displaystyle\rightarrow 0.$		(3.22)

Proof.

The proof technique is similar to [23]. We will use that

	$\displaystyle\mathfrak{m}(t)$	$\displaystyle=\lim_{N\to\infty}\operatorname{\mathbb{E}}\bigl{[}u^{(n)}(t)\bigr{]},$
	$\displaystyle\mathfrak{C}(t)$	$\displaystyle=\lim_{N\to\infty}\operatorname{\mathbb{E}}\bigl{[}e^{(n)}(t)\otimes e^{(n)}(t)\bigr{]},$

where $e^{(n)}(t):=u^{(n)}(t)-\mathfrak{m}(t).$ First, note that (3.15) follows directly from (3.14) using that in the mean field limit $P^{uu}(t)$ can be replaced by $\mathfrak{C}(t)$ . To obtain the evolution of $\mathfrak{C}(t)$ , note that

\mathop{}\!\mathrm{d}\mathfrak{C}(t)=\lim_{N\to\infty}\operatorname{\mathbb{E}}\Bigl{[}\mathop{}\!\mathrm{d}e^{(n)}\otimes e^{(n)}+e^{(n)}\otimes\mathop{}\!\mathrm{d}e^{(n)}+\mathop{}\!\mathrm{d}e^{(n)}\otimes\mathop{}\!\mathrm{d}e^{(n)}\Bigr{]},

where the last term accounts for the Itô correction. This simplifies as

\begin{split}\frac{\mathop{}\!\mathrm{d}\mathfrak{C}(t)}{\mathop{}\!\mathrm{d}t}&=\lim_{N\to\infty}\operatorname{\mathbb{E}}\Bigl{[}-\mathfrak{C}(t)H^{T}R^{-1}H(e^{(n)}\otimes e^{(n)})-(e^{(n)}\otimes e^{(n)})H^{T}R^{-1}H\mathfrak{C}(t)+\mathfrak{C}(t)H^{T}R^{-1}H\mathfrak{C}(t)\Bigr{]}\\ &=-\mathfrak{C}(t)H^{T}R^{-1}H\mathfrak{C}(t),\end{split}

which gives Equation (3.16). To derive exact formulas for $\mathfrak{m}(t)$ and $\mathfrak{C}(t)$ , we notice that

\frac{\mathop{}\!\mathrm{d}\mathfrak{C}(t)^{-1}}{\mathop{}\!\mathrm{d}t}=-\mathfrak{C}(t)^{-1}\frac{\mathop{}\!\mathrm{d}\mathfrak{C}(t)}{\mathop{}\!\mathrm{d}t}\mathfrak{C}(t)^{-1}=H^{T}R^{-1}H,

and

\frac{\mathop{}\!\mathrm{d}\big{(}\mathfrak{C}(t)^{-1}\mathfrak{m}(t)\big{)}}{\mathop{}\!\mathrm{d}t}=\frac{\mathop{}\!\mathrm{d}\mathfrak{C}(t)^{-1}}{\mathop{}\!\mathrm{d}t}\mathfrak{m}(t)+\mathfrak{C}(t)^{-1}\frac{\mathop{}\!\mathrm{d}\mathfrak{m}(t)}{\mathop{}\!\mathrm{d}t}=H^{T}R^{-1}y.

Then (3.17) and (3.18) follow easily.

If $H$ has full column rank, then $H^{T}R^{-1}H$ is invertible and therefore, as $t\rightarrow\infty$ ,

\mathfrak{C}(t)=t^{-1}\big{(}t^{-1}\mathfrak{C}(0)^{-1}+H^{T}R^{-1}H\big{)}^{-1}\rightarrow 0

by continuity of the matrix inverse function. The limit of $\mathfrak{m}(t)$ , (3.19), follows immediately from (3.17).

If $H$ has full row rank, we make the following substitutions

\widetilde{\mathfrak{m}}(t)=R^{-1/2}H\mathfrak{m}(t),\quad\quad\quad\widetilde{\mathfrak{C}}(t)=R^{-1/2}H\mathfrak{C}(t)H^{T}R^{-1/2}.

Then (3.15) and (3.16) can be transformed into

	$\displaystyle\frac{\mathop{}\!\mathrm{d}\widetilde{\mathfrak{m}}(t)}{\mathop{}\!\mathrm{d}t}$	$\displaystyle=\widetilde{\mathfrak{C}}(t)\bigl{(}R^{-1/2}y-\widetilde{\mathfrak{m}}(t)\bigr{)},$		(3.23)
	$\displaystyle\frac{\mathop{}\!\mathrm{d}\widetilde{\mathfrak{C}}(t)}{\mathop{}\!\mathrm{d}t}$	$\displaystyle=-\widetilde{\mathfrak{C}}(t)^{2}.$		(3.24)

Using the fact that $\widetilde{\mathfrak{C}}(0)=R^{-1/2}H\mathfrak{C}(0)H^{T}R^{-1/2}$ is invertible, we can solve these using the same technique as in the previous case:

\widetilde{\mathfrak{C}}(t)=\big{(}\widetilde{\mathfrak{C}}(0)^{-1}+t\big{)}^{-1},\quad\quad\quad\widetilde{\mathfrak{m}}(t)=\big{(}\widetilde{\mathfrak{C}}(0)^{-1}+t\big{)}^{-1}\bigl{(}\widetilde{\mathfrak{C}}(0)^{-1}\widetilde{\mathfrak{m}}(0)+tR^{-1/2}y\bigr{)}.

As $t\rightarrow\infty$ , we have $\widetilde{\mathfrak{m}}(t)\rightarrow R^{-1/2}y$ and $\widetilde{\mathfrak{C}}(t)\rightarrow 0$ , which lead to (3.21) and (3.22). ∎

Remark 3.3.

Some remarks:

1.

In fact, (3.22) always holds, without rank constraints on $H$ . The proof follows the similar idea as in [51]. This can be viewed from Equation (3.24), where we can perform eigenvalue decomposition $\widetilde{\mathfrak{C}}(0)=X\Lambda(0)X^{T}$ , and show that $\widetilde{\mathfrak{C}}(t)=X\Lambda(t)X^{T}$ , where $\Lambda(t)$ are diagonal matrices and $\Lambda(t)\rightarrow 0$ as $t\rightarrow\infty$ . The statement that $H\mathfrak{C}(t)H^{T}\rightarrow 0$ (or $\mathfrak{C}(t)\rightarrow 0$ if $H$ has full column rank) is referred to as ‘ensemble collapse’. This can be interpreted as ‘the images of all the particles under $H$ collapse to a single point as time evolves’. Our numerical results in Section 5 show that the ensemble collapse phenomenon of Algorithm 6 is also observed in a variety of nonlinear examples and outside the mean-field limit, with moderate ensemble size $N$ . These empirical results justify the practical significance of the linear continuum analysis in Proposition 3.2.

Under the same setting of Proposition 3.2, if we further require that the initial ensemble $\{u_{0}^{(n)}\}_{n=1}^{N}$ is drawn independently from the prior distribution $\mathcal{N}(m,P)$ , then in the mean-field limit $N\rightarrow\infty$ we have $\mathfrak{m}(0)=m$ and $\mathfrak{C}(0)=P$ , leading to

\mathfrak{m}(1)=(P^{-1}+H^{T}R^{-1}H)^{-1}(P^{-1}m+H^{T}R^{-1}y),\quad\quad\mathfrak{C}(1)=(P^{-1}+H^{T}R^{-1}H)^{-1},

which are the true posterior mean and covariance, respectively. However, we have observed in a variety of numerical examples (not reported here) that in nonlinear problems it is often necessary to run EKI up to times larger than $1$ to obtain adequate approximation of the posterior mean and covariance. Providing a suitable stopping criteria for EKI is a topic of current research [52, 32] beyond the scope of our work.

3.3 Ensemble Levenberg-Marquardt Optimization of Tikhonov-Phillips Objective

3.3.1 Tikhonov Ensemble Kalman Inversion

Recall that we define

z:=\begin{bmatrix}y\\ m\end{bmatrix},\quad\quad\quad g(u):=\begin{bmatrix}h(u)\\ u\end{bmatrix},\quad\quad\quad Q:=\begin{bmatrix}R&0\\ 0&P\end{bmatrix}.

Then, given an ensemble $\{u_{i}^{(n)}\}_{n=1}^{N}$ , we can define

g_{i}=\frac{1}{N}\sum_{n=1}^{N}g(u_{i}^{(n)}),

and empirical covariances

	$\displaystyle P_{i}^{zz}$	$\displaystyle=\frac{1}{N}\sum_{n=1}^{N}\bigl{(}g(u_{i}^{(n)})-g_{i}\bigr{)}\bigl{(}g(u_{i}^{(n)})-g_{i}\bigr{)}^{T},$
	$\displaystyle P_{i}^{uz}$	$\displaystyle=\frac{1}{N}\sum_{n=1}^{N}\bigl{(}u_{i}^{(n)}-m_{i}\bigr{)}\bigl{(}g(u_{i}^{(n)})-g_{i}\bigr{)}^{T}.$

Furthermore, we define the statistical linearization $G_{i}$ :

g^{\prime}(u_{i}^{(n)})\approx(P_{i}^{uz})^{T}(P_{i}^{uu})^{-1}=:G_{i}.

(3.25)

It is not hard to check that

G_{i}=\begin{bmatrix}H_{i}\\ I\end{bmatrix},

with $H_{i}$ defined in (3.1).

Given an ensemble $\{u_{i}^{(n)}\}_{n=1}^{N}$ , consider the following Levenberg-Marquardt update for each $n$ :

u_{i+1}^{(n)}=u_{i}^{(n)}+v_{i}^{(n)},

where $v_{i}^{(n)}$ is the minimizer of the following regularized (linearized) Tikhonov-Phillips objective (cf. equation (2.28))

{\mathsf{J}}_{\mbox{\tiny{\rm TP}},i}^{(n),\mbox{\mbox{\tiny{UC} } }}(v)=\frac{1}{2}\bigl{|}z_{i}^{(n)}-g(u_{i}^{(n)})-G_{i}v\bigr{|}^{2}_{Q}+\frac{1}{2\alpha}\bigl{|}v\bigr{|}^{2}_{P_{i}^{uu}},\quad\quad z_{i}^{(n)}\sim\mathcal{N}(z,\alpha^{-1}Q),

(3.26)

and $\alpha>0$ will be regarded as a length-step. We can calculate the minimizer $v_{i}^{(n)}$ explicitly, applying Lemma 2.1:

v_{i}^{(n)}=(G_{i}^{T}Q^{-1}G_{i}+\alpha^{-1}(P_{i}^{uu})^{-1})^{-1}G_{i}^{T}Q^{-1}\big{(}z_{i}^{(n)}-g(u_{i}^{(n)})\big{)},

(3.27)

or, in an equivalent form,

v_{i}^{(n)}=P_{i}^{uu}G_{i}^{T}(G_{i}P_{i}^{uu}G_{i}^{T}+\alpha^{-1}Q)^{-1}\big{(}z_{i}^{(n)}-g(u_{i}^{(n)})\big{)}.

(3.28)

Similar to EKI, in Equation (3.28) we substitute $P_{i}^{uu}G_{i}^{T}=P_{i}^{uz}$ , and make the approximation $G_{i}P_{i}^{uz}\approx P_{i}^{zz}$ . This leads to Tikhonov Ensemble Kalman Inversion (TEKI), described in Algorithm 7.

Algorithm 7 Tikhonov Ensemble Kalman Inversion (TEKI)

Input: Initial ensemble $\{u_{0}^{(n)}\}_{n=1}^{N}$ , length-step $\alpha$ .
For $i=0,1,\ldots$ do:

1.

Set $K_{i}=P_{i}^{uz}(P_{i}^{zz}+\alpha^{-1}Q)^{-1}.$
2.

Draw $z_{i}^{(n)}\sim\mathcal{N}(z,\alpha^{-1}Q)$ and set

$u_{i+1}^{(n)}=u_{i}^{(n)}+K_{i}\Bigl{\{}z_{i}^{(n)}-g(u_{i}^{(n)})\Bigr{\}},\quad\quad 1\leq n\leq N.$ (3.29)

Output: Ensemble means $m_{1},m_{2},\ldots$

We note that TEKI is a natural ensemble-based version of the derivative-based ILM-TP Algorithm 3. However, the Kalman gain in ILM-TP keeps $P$ fixed, while in TEKI $P_{i}^{uz}$ and $P_{i}^{zz}$ are updated using the ensemble.

3.3.2 Analysis of TEKI

When $\alpha$ is small, we can rewrite the update (3.29) as a time-stepping scheme:

\begin{split}u_{i+1}^{(n)}&=u_{i}^{(n)}+\alpha P_{i}^{uz}(\alpha P_{i}^{zz}+Q)^{-1}\bigl{(}z+\alpha^{-1/2}Q^{1/2}\xi_{i}^{(n)}-g(u_{i}^{(n)})\bigr{)}\\ &=u_{i}^{(n)}+\alpha P_{i}^{uz}(\alpha P_{i}^{zz}+Q)^{-1}\bigl{(}z-g(u_{i}^{(n)})\bigr{)}+\alpha^{1/2}P_{i}^{uz}(\alpha P_{i}^{zz}+Q)^{-1}Q^{1/2}\xi_{i}^{(n)},\end{split}

(3.30)

where $\xi_{i}^{(n)}\sim\mathcal{N}(0,I_{d+k})$ are independent. Taking the limit $\alpha\rightarrow 0$ , we can interpret (3.30) as a discretization of the SDE system

\mathop{}\!\mathrm{d}u^{(n)}=P^{uz}(t)Q^{-1}\bigl{(}z-g(u^{(n)})\bigr{)}\mathop{}\!\mathrm{d}t+P^{uz}(t)Q^{-1/2}\mathop{}\!\mathrm{d}W^{(n)}.

(3.31)

If $h(u)=Hu$ is linear, $g(u)=Gu$ is also linear and the SDE system (3.31) can be rewritten as

\begin{split}\mathop{}\!\mathrm{d}u^{(n)}&=P^{uz}(t)Q^{-1}\big{(}z-Gu^{(n)}\big{)}\mathop{}\!\mathrm{d}t+P^{uz}(t)Q^{-1/2}\mathop{}\!\mathrm{d}W^{(n)}\\ &=P^{uu}(t)G^{T}Q^{-1}\big{(}z-Gu^{(n)}\big{)}\mathop{}\!\mathrm{d}t+P^{uu}(t)G^{T}Q^{-1/2}\mathop{}\!\mathrm{d}W^{(n)}.\end{split}

(3.32)

Proposition 3.4.

For the SDE system (3.31), assume $h(u)=Hu$ is linear, and that the initial ensemble $\{u_{0}^{(n)}\}_{n=1}^{N}$ is made of independent samples from a distribution with finite second moments. Then, in the large ensemble limit (mean-field), the distribution of $u^{(n)}(t)$ has mean $\mathfrak{m}(t)$ and covariance $\mathfrak{C}(t)$ , which satisfy:

	$\displaystyle\frac{\mathop{}\!\mathrm{d}\mathfrak{m}(t)}{\mathop{}\!\mathrm{d}t}$	$\displaystyle=\mathfrak{C}(t)G^{T}Q^{-1}(z-G\mathfrak{m}(t)),$		(3.33)
	$\displaystyle\frac{\mathop{}\!\mathrm{d}\mathfrak{C}(t)}{\mathop{}\!\mathrm{d}t}$	$\displaystyle=-\mathfrak{C}(t)G^{T}Q^{-1}G\mathfrak{C}(t).$		(3.34)

Furthermore, the solution can be computed analytically:

	$\displaystyle\mathfrak{m}(t)$	$\displaystyle=\Big{(}\mathfrak{C}(0)^{-1}+tG^{T}Q^{-1}G\Big{)}^{-1}\big{(}\mathfrak{C}(0)^{-1}\mathfrak{m}(0)+tG^{T}Q^{-1}z\big{)},$		(3.35)
	$\displaystyle\mathfrak{C}(t)$	$\displaystyle=\Big{(}\mathfrak{C}(0)^{-1}+tG^{T}Q^{-1}G\Big{)}^{-1}.$		(3.36)

In particular, as $t\rightarrow\infty$ ,

	$\displaystyle\begin{split}\mathfrak{m}(t)&\rightarrow(G^{T}Q^{-1}G)^{-1}(G^{T}Q^{-1}z)\\ &=(H^{T}R^{-1}H+P)^{-1}(H^{T}R^{-1}y+P^{-1}m),\end{split}$			(3.37)
	$\displaystyle\mathfrak{C}(t)$	$\displaystyle\rightarrow 0.$		(3.38)

Notice that $\mathfrak{m}(t)$ converges to the true posterior mean.

Proof.

Equations (3.33)-(3.36) can be derived similarly as in the proof of Proposition 3.2, replacing $H$ by $G$ , $R$ by $Q$ , and $y$ by $z$ . Now since $G^{T}Q^{-1}G=H^{T}R^{-1}H+P$ is always invertible we have that, as $t\rightarrow\infty,$

\mathfrak{C}(t)=t^{-1}\big{(}t^{-1}\mathfrak{C}(0)^{-1}+G^{T}Q^{-1}G\big{)}^{-1}\rightarrow 0

by continuity of the matrix inverse function. The limit of $\mathfrak{m}(t)$ in (3.37) follows directly from (3.35). ∎

4 Ensemble Kalman methods: New Variants

In the previous section we discussed three popular subfamilies of iterative ensemble Kalman methods, analogous to the derivative-based algorithms in Section 2. The aim of this section is to introduce two new iterative ensemble Kalman methods which are inspired by the SDE continuum limit structure of the algorithms in Section 3. The two new methods that we introduce have in common that they rely on statistical linearization, and that the long-time limit of the ensemble covariance recovers the posterior covariance in a linear setting. Subsection 4.1 contains a new variant of IEKF which in addition to recovering the posterior covariance, it also recovers the posterior mean in the long-time limit. Subsection 4.2 introduces a new variant of the EKI method. Finally, Subsection 4.3 highlights the gradient structure of the algorithms in this and the previous section, shows that our new variant of IEKF can also be interpreted as a modified TEKI algorithm, and sets our proposed new methods into the broader literature.

4.1 Iterative Ensemble Kalman Filter with Statistical Linearization

In some of the literature on iterative ensemble Kalman methods [48, 60], the length-step ( $\alpha$ in Algorithms 4 and 5) is set to be 1. Although this choice of length-step allows to recover the true posterior mean in the linear case after one iteration (see Section 3.1.2), it leads to numerical instability in complex nonlinear models. Alternative ways to set $\alpha$ include performing a line-search that satisfies Wolfe’s condition, or using other ad-hoc line-search criteria [24]. These methods allow $\alpha$ to be adaptively chosen throughout the iterations, but they introduce other hyperparameters that need to be selected manually.

Our idea here is to slightly modify Algorithm 4 so that in the linear case its continuum limit has the true posterior as its invariant distribution. In this way we can simply choose a small enough $\alpha$ and run the algorithm until the iterates reach a statistical equilibrium, avoiding the need to specify suitable hyperparameters and stopping criteria. Our empirical results show that this approach also performs well in the nonlinear case.

In the update Equation (3.6), we replace each of the $u_{0}^{(n)}$ by a perturbation of the prior mean $m$ , and we replace $P_{0}^{uu}$ by the prior covariance matrix $P$ in the definition of $K_{i}$ . Details can be found below in our modified algorithm, which we call IEKF-SL.

Algorithm 8 Iterative Ensemble Kalman Filter with Statistical Linearization (IEKF-SL)

Input: Initial ensemble $\{u_{0}^{(n)}\}_{n=1}^{N}$ , step size $\alpha$ .
For $i=0,1,\ldots$ do:

1.

Set $K_{i}=PH_{i}^{T}(H_{i}PH_{i}^{T}+R)^{-1},\quad\quad H_{i}=(P_{i}^{uy})^{T}(P_{i}^{uu})^{-1}$ .

Draw $y_{i}^{(n)}\sim\mathcal{N}(y,2\alpha^{-1}R)$ , $m_{i}^{(n)}\sim\mathcal{N}(m,2\alpha^{-1}P)$ and set

u_{i+1}^{(n)}=u_{i}^{(n)}+\alpha\Bigl{\{}K_{i}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)}+(I-K_{i}H_{i})\big{(}m_{i}^{(n)}-u_{i}^{(n)}\big{)}\Bigr{\}},\quad\quad 1\leq n\leq N.

(4.1)

Output: Ensemble means $m_{1},m_{2},\ldots$

It is natural to regard the update (4.1) as a time-stepping scheme. We rewrite it in an alternative form, in analogy to (3.4):

\begin{split}u_{i+1}^{(n)}&=u_{i}^{(n)}+\alpha C_{i}\Bigl{\{}H_{i}^{T}R^{-1}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)}+P^{-1}\big{(}m_{i}^{(n)}-u_{i}^{(n)}\big{)}\Bigr{\}}\\ &=u_{i}^{(n)}+\alpha C_{i}\Bigl{\{}H_{i}^{T}R^{-1}\big{(}y-h(u_{i}^{(n)})\big{)}+P^{-1}\big{(}m-u_{i}^{(n)}\big{)}+(H_{i}^{T}R^{-1}\zeta+P^{-1}\eta)\Bigr{\}},\end{split}

(4.2)

where

C_{i}=\big{(}H_{i}^{T}R^{-1}H_{i}+P^{-1}\big{)}^{-1},\quad\zeta\sim\mathcal{N}(0,2\alpha^{-1}R),\quad\eta\sim\mathcal{N}(0,2\alpha^{-1}P).

We interpret Equation (4.2) as a discretization of the SDE system

\mathop{}\!\mathrm{d}u^{(n)}=C(t)\Big{(}H(t)^{T}R^{-1}\big{(}y-h(u^{(n)})\big{)}+P^{-1}\big{(}m-u^{(n)}\big{)}\Big{)}\mathop{}\!\mathrm{d}t+\sqrt{2C(t)}\mathop{}\!\mathrm{d}W^{(n)},

(4.3)

where

	$\displaystyle H(t)$	$\displaystyle=\big{(}P^{uy}(t)\big{)}^{T}\big{(}P^{uu}(t)\big{)}^{-1},$
	$\displaystyle C(t)$	$\displaystyle=\big{(}H(t)^{T}R^{-1}H(t)+P^{-1}\big{)}^{-1}.$

The diffusion term can be derived using the fact that

C_{i}(H_{i}^{T}R^{-1}\zeta+P^{-1}\eta)\sim\mathcal{N}\big{(}0,2\alpha^{-1}C_{i}(H_{i}^{T}R^{-1}H_{i}+P^{-1})C_{i}\big{)}=\mathcal{N}(0,2\alpha^{-1}C_{i}).

If $h(u)=Hu$ is linear and the empirical covariance $P^{uu}(t)$ has full rank for all $t$ , then $H(t)\equiv H$ , $C(t)\equiv C=(P^{-1}+H^{T}R^{-1}H)^{-1}$ , and the SDE system can be decoupled and further simplified:

\begin{split}\mathop{}\!\mathrm{d}u^{(n)}&=C\Big{(}H^{T}R^{-1}\big{(}y-Hu^{(n)}\big{)}+P^{-1}(m-u^{(n)})\Big{)}\mathop{}\!\mathrm{d}t+\sqrt{2C}\mathop{}\!\mathrm{d}W\\ &=\Big{(}-u^{(n)}+C(H^{T}R^{-1}y+P^{-1}m)\Big{)}\mathop{}\!\mathrm{d}t+\sqrt{2C}\mathop{}\!\mathrm{d}W.\end{split}

(4.4)

Proposition 4.1.

For the SDE system (4.3), assume $h(u)=Hu$ is linear and $H(t)\equiv H$ holds. Assume that the initial ensemble $\{u_{0}^{(n)}\}_{n=1}^{N}$ is made of independent samples from a continuous distribution with finite second moments. Then, for $1\leq n\leq N$ , the mean $\mathfrak{m}(t)$ and covariance $\mathfrak{C}(t)$ of $u^{(n)}(t)$ satisfy

	$\displaystyle\frac{\mathop{}\!\mathrm{d}\mathfrak{m}(t)}{\mathop{}\!\mathrm{d}t}$	$\displaystyle=-\mathfrak{m}(t)+C(H^{T}R^{-1}y+P^{-1}m),$		(4.5)
	$\displaystyle\frac{\mathop{}\!\mathrm{d}\mathfrak{C}(t)}{\mathop{}\!\mathrm{d}t}$	$\displaystyle=-2\mathfrak{C}(t)+2C.$		(4.6)

Furthermore, as $t\rightarrow\infty$ ,

	$\displaystyle\begin{split}\mathfrak{m}(t)&\rightarrow C(H^{T}R^{-1}y+P^{-1}m)\\ &=(P^{-1}+H^{T}R^{-1}H)^{-1}(H^{T}R^{-1}y+P^{-1}m),\end{split}$			(4.7)
	$\displaystyle\mathfrak{C}(t)$	$\displaystyle\rightarrow C=(P^{-1}+H^{T}R^{-1}H)^{-1}.$		(4.8)

In other words, $\mathfrak{m}(t)$ and $\mathfrak{C}(t)$ converge to the true posterior mean and covariance, respectively.

Proof.

It is clear that, for fixed $t>0,$ $\{u^{(n)}(t)\}_{n=1}^{N}$ are independent and identically distributed. The evolution of the mean follows directly from (4.4), and the evolution of the covariance follows from (4.4) by applying Itô’s formula. It is then straightforward to derive (4.7) and (4.8) from (4.5) and (4.6), respectively. ∎

4.2 Ensemble Kalman Inversion with Statistical Linearization

Recall that in the formulation of EKI, we define a regularized (linearized) data-misfit objective (3.8), where we have a regularizer on $v$ with respect to the norm $|\cdot|_{P_{i}^{uu}}$ . However, in view of Proposition 3.2, under a linear forward model $h(u)=Hu$ , the particles $\{u_{i}^{(n)}\}_{i=1}^{n}$ will ‘collapse’, meaning that the empirical covariance of $\{Hu_{i}^{(n)}\}_{n=1}^{N}$ will vanish in the large $i$ limit. One possible solution to this is ‘covariance inflation’, namely to inject certain amount of random noise after each ensemble update. However, this requires ad-hoc tuning of additional hyperparameters. An alternative approach to avoid the ensemble collapse is to modify the regularization term in the Levenberg-Marquardt formulation (3.8). The rough idea is to consider another regularizer on $H_{i}v$ in the data space, as we describe in what follows.

We define a new regularized data-misfit objective, slightly different from (3.8):

{\mathsf{J}}_{\mbox{\tiny{\rm DM}},i}^{(n),\mbox{\mbox{\tiny{UC} } }}(v)=\frac{1}{2}\bigl{|}y_{i}^{(n)}-h(u_{i}^{(n)})-H_{i}v\bigr{|}^{2}_{R}+\frac{1}{2\alpha}\bigl{|}v\bigr{|}^{2}_{C_{i}}\quad\quad y_{i}^{(n)}\sim\mathcal{N}(y,2\alpha^{-1}R),

(4.9)

where

C_{i}=(P^{-1}+H_{i}^{T}R^{-1}H_{i})^{-1},

(4.10)

and $H_{i}$ is defined in (3.1). The regularization term can be decomposed as

|v|_{C_{i}}^{2}=|v|_{P}^{2}+|H_{i}v|_{R}^{2}.

The first term can be regarded as a regularization on $v$ with respect to the prior covariance $P$ . The second term can be regarded as a regularization on $H_{i}v$ with respect to the noise covariance $R$ . Applying Lemma 2.1, we can calculate the minimizer of (4.9):

\begin{split}v_{i}^{(n)}&=(H_{i}^{T}R^{-1}H_{i}+\alpha^{-1}C_{i}^{-1})^{-1}H_{i}^{T}R^{-1}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)}\\ &=\big{(}\alpha^{-1}P^{-1}+(1+\alpha^{-1})H_{i}^{T}R^{-1}H_{i}\big{)}^{-1}H_{i}^{T}R^{-1}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)}\\ &=\alpha PH_{i}^{T}\big{(}(1+\alpha)H_{i}PH_{i}^{T}+R\big{)}^{-1}\big{(}y_{i}^{(n)}-h(u_{i}^{(n)})\big{)},\end{split}

(4.11)

where the second equality follows from the definition of $C_{i}$ , and the third equality follows from the matrix identity (2.7). This leads to Algorithm 9.

Algorithm 9 Ensemble Kalman Inversion with Statistical Linearization (EKI-SL)

Input: Initial ensemble $\{u_{0}^{(n)}\}_{n=1}^{N}$ , step size $\alpha$ .
For $i=0,1,\ldots$ do:

1.

Set $K_{i}=\alpha PH_{i}^{T}\big{(}(1+\alpha)H_{i}PH_{i}^{T}+R\big{)}^{-1},\quad\quad H_{i}=(P_{i}^{uy})^{T}(P_{i}^{uu})^{-1}$ .
2.

Draw $y_{i}^{(n)}\sim\mathcal{N}(y,2\alpha^{-1}R)$ and set

$u_{i+1}^{(n)}=u_{i}^{(n)}+K_{i}\Bigl{\{}y_{i}^{(n)}-h(u_{i}^{(n)})\Bigr{\}},\quad\quad 1\leq n\leq N.$ (4.12)

Output: Ensemble means $m_{1},m_{2},\ldots$

For small $\alpha>0$ , we interpret the update (4.12) as a discretization of the coupled SDE system

\begin{split}\mathop{}\!\mathrm{d}u^{(n)}&=PH(t)^{T}\big{(}H(t)PH(t)^{T}+R\big{)}^{-1}\Big{(}\big{(}y-h(u^{(n)})\big{)}\mathop{}\!\mathrm{d}t+\sqrt{2}R^{1/2}\mathop{}\!\mathrm{d}W\Big{)}\\ &=C(t)H(t)^{T}R^{-1}\big{(}y-h(u^{(n)})\big{)}\mathop{}\!\mathrm{d}t+\sqrt{2}C(t)H(t)^{T}R^{-1/2}\mathop{}\!\mathrm{d}W,\end{split}

(4.13)

where

	$\displaystyle H(t)$	$\displaystyle=\big{(}P^{uy}(t)\big{)}^{T}\big{(}P^{uu}(t)\big{)}^{-1},$
	$\displaystyle C(t)$	$\displaystyle=\big{(}P^{-1}+H(t)^{T}R^{-1}H(t)\big{)}^{-1}.$

For our next result we will work under the assumption that $H(t)\equiv H$ is constant, which in particular requires that the empirical covariance $P^{uu}(t)$ has full rank for all $t.$ Importantly, under this assumption $C(t)\equiv C=(P^{-1}+H^{T}R^{-1}H)^{-1}$ and the SDE system is decoupled:

\mathop{}\!\mathrm{d}u^{(n)}=CH^{T}R^{-1}\big{(}y-Hu^{(n)}\big{)}\mathop{}\!\mathrm{d}t+\sqrt{2}CH^{T}R^{-1/2}\mathop{}\!\mathrm{d}W^{(n)}.

(4.14)

Proposition 4.2.

For the SDE system (4.13), assume $h(u)=Hu$ is linear and $H(t)\equiv H$ holds. Assume that the initial ensemble $\{u_{0}^{(n)}\}_{n=1}^{N}$ is made of independent samples from a continuous distribution with finite second moments. Then the mean $\mathfrak{m}(t)$ and covariance $\mathfrak{C}(t)$ of $u^{(n)}(t)$ satisfy

	$\displaystyle\frac{\mathop{}\!\mathrm{d}\mathfrak{m}(t)}{\mathop{}\!\mathrm{d}t}$	$\displaystyle=CH^{T}R^{-1}(y-H\mathfrak{m}(t)),$		(4.15)
	$\displaystyle\frac{\mathop{}\!\mathrm{d}\mathfrak{C}(t)}{\mathop{}\!\mathrm{d}t}$	$\displaystyle=-CH^{T}R^{-1}H\mathfrak{C}(t)-\mathfrak{C}(t)H^{T}R^{-1}HC+2CH^{T}R^{-1}HC.$		(4.16)

In particular, if $H\in\mathbb{R}^{k\times d}$ has full column rank (i.e., $d\leq k$ , $\emph{rank}(H)=d$ ), then, as $t\rightarrow\infty$ ,

	$\displaystyle\mathfrak{m}(t)$	$\displaystyle\rightarrow(H^{T}R^{-1}H)^{-1}(H^{T}R^{-1}y),$		(4.17)
	$\displaystyle\mathfrak{C}(t)$	$\displaystyle\rightarrow C=(P^{-1}+H^{T}R^{-1}H)^{-1}.$		(4.18)

If $H\in\mathbb{R}^{k\times d}$ has full row rank (i.e., $d\geq k$ , $\emph{rank}(H)=k$ ), then, as $t\rightarrow\infty,$

	$\displaystyle H\mathfrak{m}(t)$	$\displaystyle\rightarrow y,$		(4.19)
	$\displaystyle H\mathfrak{C}(t)H^{T}$	$\displaystyle\rightarrow HCH^{T}.$		(4.20)

Proof.

Note that here, in contrast to the setting of Proposition 3.2, the distribution of $u^{(n)}(t)$ does not depend on the ensemble size $N,$ and we have simply $\mathfrak{m}(t)=\operatorname{\mathbb{E}}\bigl{[}u^{(n)}(t)\bigr{]}$ and $\mathfrak{C}(t)=\operatorname{\mathbb{E}}\big{[}e^{(n)}(t)\otimes e^{(n)}(t)\big{]},$ with $e^{(n)}(t):=u^{(n)}(t)-\mathfrak{m}(t).$ The evolution of $\mathfrak{m}(t)$ follows directly from (4.14). To obtain the evolution of $\mathfrak{C}(t)$ , we use a similar technique as in the proof of Proposition 3.2. Applying Itô’s formula,

\begin{split}\frac{\mathop{}\!\mathrm{d}\mathfrak{C}(t)}{\mathop{}\!\mathrm{d}t}&=\operatorname{\mathbb{E}}\Big{[}-CH^{T}R^{-1}H(e^{(n)}\otimes e^{(n)})-(e^{(n)}\otimes e^{(n)})H^{T}R^{-1}HC+2CH^{T}R^{-1}HC\Big{]}\\ &=-CH^{T}R^{-1}H\mathfrak{C}(t)-\mathfrak{C}(t)H^{T}R^{-1}HC+2CH^{T}R^{-1}HC,\end{split}

which recovers (4.16).

If $H$ has full column rank, then $H^{T}R^{-1}H$ is invertible, and (4.17) follows immediately from (4.15). By setting the right-hand side of (4.16) to 0, and using the fact that it has a unique solution, we derive (4.18).

If $H$ has full row rank, then (4.19) follows immediately from (4.15). Next, the substitutions

\widetilde{\mathfrak{C}}(t)=R^{-1/2}H\mathfrak{C}(t)H^{T}R^{-1/2},\quad\quad\quad\widetilde{C}=R^{-1/2}HCH^{T}R^{-1/2}

allow to transform (4.16) into

\frac{\mathop{}\!\mathrm{d}\widetilde{\mathfrak{C}}(t)}{\mathop{}\!\mathrm{d}t}=-\widetilde{C}\widetilde{\mathfrak{C}}(t)-\widetilde{\mathfrak{C}}(t)\widetilde{C}+2\widetilde{C}^{2}.

(4.21)

Since $\widetilde{C}$ is invertible, by setting the right-hand side of (4.21) to 0 we deduce that $\widetilde{\mathfrak{C}}(t)\rightarrow\widetilde{C}$ as $t\rightarrow\infty$ , which recovers (4.20). ∎

4.3 Gradient Structure and Discussion

LM Algorithms in the Continuum Limit

Levenberg-Marquardt algorithms have a natural gradient structure in the continuum limit. This was shown in Equation (2.26), where the preconditioner $P$ corresponds to the regularizer $|\cdot|_{P}$ that is used in the Levenberg-Marquardt algorithm (2.21). Ensemble-based Levenberg-Marquardt algorithms also possess a gradient structure. To see this, we consider an update $u_{i+1}^{(n)}=u_{i}^{(n)}+v_{i}^{(n)}$ , where $v_{i}^{(n)}$ is the unconstrained minimizer of the same objective as (4.9)

{\mathsf{J}}_{\mbox{\tiny{\rm DM}},i}^{(n),\mbox{\mbox{\tiny{UC} } }}(v)=\frac{1}{2}\bigl{|}y_{i}^{(n)}-h(u_{i}^{(n)})-H_{i}v\bigr{|}^{2}_{R}+\frac{1}{2\alpha}\bigl{|}v\bigr{|}^{2}_{S_{i}}\quad\quad y_{i}^{(n)}\sim\mathcal{N}(y,2\alpha^{-1}R),

except that we allow any positive (semi)definite matrix $S_{i}$ to act as a ‘regularizer’. The resulting continuum limit is given by the SDE system

\mathop{}\!\mathrm{d}u^{(n)}=S(t)H(t)^{T}R^{-1}\big{(}y-h(u^{(n)})\big{)}\mathop{}\!\mathrm{d}t+\sqrt{2}S(t)H(t)^{T}R^{-1/2}\mathop{}\!\mathrm{d}W^{(n)},

(4.22)

where $u^{(n)}(t),S(t),H(t)$ are continuous time analogs of $u_{i}^{(n)},S_{i},H_{i}$ . Notice that $H(t)^{T}R^{-1}(y-h(u^{(n)}))$ is exactly $-{\mathsf{J}}_{\mbox{\tiny{\rm DM}}}^{\prime}(u^{(n)})$ , so that we may rewrite (4.22) as

\dot{u}^{(n)}=-S(t){\mathsf{J}}_{\mbox{\tiny{\rm DM}}}^{\prime}(u^{(n)})+\sqrt{2}S(t)H(t)^{T}R^{-1/2}\dot{W}^{(n)},

(4.23)

which is a perturbed gradient descent with preconditioner $S(t)$ . We recall that $S(t)=P^{uu}(t)$ in EKI and $S(t)=C(t)=\big{(}P^{-1}+H(t)^{T}R^{-1}H(t)\big{)}^{-1}$ in EKI-SL. Other choices of $S(t)$ are possible and will be studied in future work.

Gauss-Newton Algorithms in the Continuum Limit

Gauss-Newton algorithms also have a natural gradient structure in the continuum limit. As we shown in Equation (2.20), the Gauss-Newton method applied to a Tikhonov-Phillips objective can be regarded as a gradient flow with a preconditioner that is the inverse Hessian matrix of the objective. Ensemble-based Gauss-Newton methods also possess a similar gradient structure. Recall that in Equation (4.3) we formulate the continuum limit of IEKF-SL:

\mathop{}\!\mathrm{d}u^{(n)}=C(t)\Big{(}H(t)^{T}R^{-1}\big{(}y-h(u^{(n)})\big{)}+P^{-1}\big{(}m-u^{(n)}\big{)}\Big{)}\mathop{}\!\mathrm{d}t+\sqrt{2C(t)}\mathop{}\!\mathrm{d}W^{(n)}.

(4.24)

Notice that the drift term is exactly $-C(t){\mathsf{J}}_{\mbox{\tiny{\rm TP}}}^{\prime}(u^{(n)})$ , so we may rewrite (4.24) as

\dot{u}^{(n)}=-C(t){\mathsf{J}}_{\mbox{\tiny{\rm TP}}}^{\prime}(u^{(n)})+\sqrt{2C(t)}\dot{W}^{(n)},

(4.25)

which is a perturbed gradient descent with preconditioner $C(t)$ , the inverse Hessian of ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}$ .

A Unified View of Levenberg-Marquardt and Gauss-Newton Algorithms in the Continuum Limit

As a conclusion of above discussion, in the continuum limit Levenberg-Marquardt algorithms (e.g., EKI, EKI-SL) are (perturbed) gradient descent methods, with a preconditioner determined by the regularizer used in the Levenberg-Marquardt update step. Gauss-Newton algorithms (e.g., IEKF, IEKF-SL) are also (perturbed) gradient descent, with a preconditioner determined by the inverse Hessian of the objective.

Interestingly, there are cases when the two types of algorithms coincide, even with the same amount of perturbation in the gradient descent step. A way to see this is to set the Levenberg-Marquardt regularizer to be the inverse Hessian of the objective. In equation (4.23), if we consider the Tikhonov-Phillips objective (i.e., replace ${\mathsf{J}}_{\mbox{\tiny{\rm DM}}}$ by ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}$ , $H$ by $G$ , $Q$ by $R$ ), and set $S(t)=C(t)$ , then (4.23) and (4.25) will coincide, leading to the same SDE system. This is the reason why we do not introduce a ‘TEKI-SL’ algorithm, as it is identical to IEKF-SL.

Relationship to Ensemble Kalman Sampler (EKS)

Another algorithm of interest is the Ensemble Kalman Sampler (EKS) [23, 17]. Although not discussed in this work, the EKS update is similar to (4.24). In fact, if we replace $C(t)$ by the ensemble covariance $P^{uu}(t)$ , and use the fact that $P^{uu}(t)H(t)^{T}=P^{uy}(t)$ by definition, we immediately recover the evolution equation of EKS:

\mathop{}\!\mathrm{d}u^{(n)}=\Big{(}P^{uy}(t)^{T}R^{-1}\big{(}y-h(u^{(n)})\big{)}+P^{-1}\big{(}m-u^{(n)}\big{)}\Big{)}\mathop{}\!\mathrm{d}t+\sqrt{2P^{uu}(t)}\mathop{}\!\mathrm{d}W^{(n)}.

Then an Euler-Maruyama discretization is performed to compute the update formula in discrete form. However, as we find out in several numerical experiments, the length-step $\alpha$ needs to be chosen carefully. If the noise $R$ has a small scale, EKS often blows up in the first few iterations, while other algorithms with the same length-step do not. Also, if we consider a linear forward model $h(u)=Hu$ , EKS still requires a mean field assumption $N\rightarrow\infty$ in order for it to converge to the posterior distribution, while IEKF-SL approximately needs $N\approx d$ particles where $d$ is the input dimension.

5 Numerical Examples

In this section we provide a numerical comparison of the iterative ensemble Kalman methods introduced in Sections 3 and 4. Our experiments highlight the variety of applications that have motivated the development of these methods, and illustrate their use in a wide range of settings.

5.1 Elliptic Boundary Value Problem

In this subsection we consider a simple nonlinear Bayesian inverse problem originally presented in [20], where the unknown parameter is two dimensional and the forward map admits a closed formula. These features facilitate both the visualization and the interpretation of the solution.

5.1.1 Problem Setup

Consider the elliptic boundary value problem

\displaystyle\begin{split}\frac{\mathop{}\!\mathrm{d}}{\mathop{}\!\mathrm{d}x}\bigg{(}\exp(u_{1})\frac{\mathop{}\!\mathrm{d}}{\mathop{}\!\mathrm{d}x}p(x)\bigg{)}&=1,\quad x\in(0,1),\\ p(0)&=0,\quad\quad p(1)=u_{2}.\end{split}

(5.1)

It can be checked that (5.1) has an explicit solution

p_{u}(x)=u_{2}x-\frac{1}{2}\exp(-u_{1})(x^{2}-x).

(5.2)

Refer to caption — (a) Truth $u^{\dagger}$ .

We seek to recover $u=(u_{1},u_{2})^{T}$ from noisy observation of $p_{u}$ at two points $x_{1}=0.25$ and $x_{2}=0.75$ . Precisely, we define $h(u):=\bigl{(}p_{u}(x_{1}),p_{u}(x_{2})\bigr{)}^{T}$ and consider the inverse problem of recovering $u$ from data $y$ of the form

y=h(u)+\eta,

(5.3)

where $\eta\sim\mathcal{N}(0,\gamma^{2}I_{2})$ . We set a Gaussian prior on the unknown parameter $u\sim\mathcal{N}(0,1)\times\mathcal{N}(100,16)$ . In our numerical experiments we let the true parameter be $u^{\dagger}=(-2.6,104.5)^{T}$ and use it to generate synthetic data $y=h(u^{\dagger})+\eta$ with noise level $\gamma=0.1.$

5.1.2 Implementation Details and Numerical Results

We set the ensemble size to be $N=50$ . The initial ensemble $\{u_{0}^{(n)}\}_{n=1}^{N}$ is drawn independently from the prior. The length-step $\alpha$ is fixed to be 0.1 for all methods. We run each algorithm up to time $T=10$ , which corresponds to 100 iterations.

We plot the level curves of the posterior density of $u=(u_{1},u_{2})^{T}$ in Figure 1. The forward model is approximately linear, as can be seen from the contour plots in Figure 1 or directly from Equation (5.2). Hence it can be used to validate the claims we made in previous sections.

Figure 2 compares the time evolution of the empirical covariance of the different methods. The IEKF-RZL algorithm consistently blows up in the first few iterations due to the small noise which, as discussed in Section 3.1, is an important drawback. Due to the poor performance of IEKF-RZL in small noise regimes, we will not include this method in subsequent comparisons. The EKI and TEKI plots show the ‘ensemble collapse’ phenomenon discussed in Sections 3.2 and 3.3. Note that the size of the empirical covariances of EKI and TEKI decreases monotonically, and by the 100-th iteration their size is one order of magnitude smaller than that of the new variants IEKF-SL and EKI-SL. The ensemble collapse of EKI and TEKI can also be seen in Figure 1, where we plot all the ensemble members after 100 iterations. In contrast, Figure 2 shows that the size of the empirical covariances of the new variants IEKF-SL and EKI-SL stabilizes after around 40 iterations, and Figure 1 suggests that the spread of the ensemble matches that of the posterior. These results are in agreement with the theoretical results derived in Sections 4.2 and 4.1 in a linear setting. Although we have not discussed the IEKF method when $\alpha$ is small, its ensemble covariance does not collapse in the linear setting, as can be seen in Figure 2.

In Figures 3 and 4 we show the performance of the different methods along the full iteration sequence. Here and in subsequent numerical examples we use two performance assessments. First, the relative error, defined as $|m(t)-u^{\dagger}|/|u^{\dagger}|$ where $m(t)$ is the ensemble’s empirical mean, which evaluates how well the ensemble mean approximates the truth. Second, we assess how each ensemble method performs in terms of its own optimization objective. Precisely, we report the data-misfit objective ${\mathsf{J}}_{\mbox{\tiny{\rm DM}}}\big{(}m(t)\big{)}$ for EKI and EKI-SL, and the Tikhonov-Phillips objective ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}\big{(}m(t)\big{)}$ for IEKF, TEKI and IEKF-SL. We run 10 trials for each algorithm, using different ensemble initializations (drawn from the same prior), and generate the error bars accordingly. Since this is a simple toy problem, all methods perform well. However, we note that the first iterations of EKI and TEKI reduce the objective faster than other algorithms.

5.2 High-Dimensional Linear Inverse Problem

In this subsection we consider a linear Bayesian inverse problem from [31]. This example illustrates the use of iterative ensemble Kalman methods in settings where both the size of the ensemble and the dimension of the data are significantly smaller than the dimension of the unknown parameter.

5.2.1 Problem setup

Consider the one dimensional elliptic equation

\displaystyle\begin{split}-\frac{\mathop{}\!\mathrm{d}^{2}p}{\mathop{}\!\mathrm{d}x^{2}}+p&=u,\quad x\in(0,\pi),\\ p(0)&=p(\pi)=0.\end{split}

(5.4)

We seek to recover $u$ from noisy observation of $p$ at $k=2^{4}-1$ equispaced points $x_{j}=\frac{j}{2^{4}}\pi.$ We assume that the data is generated from the model

y_{j}=p(x_{j})+\eta_{j},\quad j=1,\dots,k,

(5.5)

where $\eta_{j}\sim\mathcal{N}(0,\gamma^{2})$ are independent. By defining $A=-\frac{\mathop{}\!\mathrm{d}^{2}}{\mathop{}\!\mathrm{d}x^{2}}+id$ and letting $\mathcal{O}$ be the observation operator defined by $\big{(}\mathcal{O}(p)\big{)}_{j}=p(x_{j})$ , we can rewrite (5.5) as

y=h(u)+\eta,\quad\eta\sim\mathcal{N}(0,\gamma^{2}I_{k}),

where $h=\mathcal{O}\circ A^{-1}$ . The forward problem (5.4) is solved on a uniform mesh with meshwidth $w=2^{-8}$ by a finite element method with continuous, piecewise linear basis functions. We assume that the unknown parameter $u$ has a Gaussian prior distribution, $u\sim\mathcal{N}(0,C_{0})$ with covariance operator $C_{0}=10(A-id)^{-1}$ with homogeneous Dirichlet boundary conditions. This prior can be interpreted as the law of a Brownian bridge between $0$ and $\pi$ . For computational purposes we view $u$ as a random vector in $\mathbb{R}^{2^{8}}$ and the linear map $h(u)$ is represented by a matrix $H\in\mathbb{R}^{(2^{4}-1)\times 2^{8}}.$ The true parameter $u^{\dagger}$ is sampled from this prior (cf. Figure 5), and is used to generate synthetic observation data $y=Hu^{\dagger}+\eta$ with noise level $\gamma=0.01$ .

5.2.2 Implementation Details and Numerical Results

We set the ensemble size to be $N=50$ . The initial ensemble $\{u_{0}^{(n)}\}_{n=1}^{N}$ is drawn independently from the prior. The length-step $\alpha$ is fixed to be 0.05 for all methods. We run each algorithm up to time $T=30$ , which corresponds to 600 iterations.

We note that the linear inverse problem considered here has input dimension $2^{8}=256$ , which is much larger than the ensemble size $N$ . Combining the results from Figure 5, 6 and 7, EKI and EKI-SL clearly overfit the data. In contrast, TEKI over-regularizes the data, and we easily notice the ensemble collapse. The IEKF and IEKF-SL lie in between, and approximate the truth slightly better.

5.3 Lorenz-96 Model

In this subsection we investigate the use of iterative ensemble Kalman methods to recover the initial condition of the Lorenz-96 system from partial and noisy observation of the solution at two positive times. The experimental framework is taken from [14] and is illustrative of the use of iterative ensemble Kalman methods in numerical weather forecasting.

5.3.1 Problem Setup

Consider the dynamical system

\begin{gathered}\frac{dz_{l}}{dt}=z_{l-1}(z_{l+1}-z_{l-2})-z_{l}+F,\quad l=1,\ldots,d,\\ z_{0}=z_{d},\quad z_{d+1}=z_{1},\quad z_{-1}=z_{d-1}.\end{gathered}

(5.6)

Here $z_{l}$ denotes the $l^{th}$ coordinate of the current state of the system and $F$ is a constant forcing with default value of $F=8$ . The dimension $d$ is often chosen as 40. We want to recover the initial condition

u:=(z_{1}(0),\dots,z_{d}(0))^{T}

of (5.6) from noisy partial measurements at discrete times $\{s_{i}\}_{i=1}^{I}$ :

y_{i,j}=u_{l_{j}}(s_{i})+\eta_{i,j},

(5.7)

where $\{l_{j}\}_{j=1}^{J}\subset\{1,2,\dots,d\}$ is the subset of observed coordinates, and $\eta_{i,j}\sim\mathcal{N}(0,\gamma^{2})$ are assumed to be independent. In our numerical experiments we set $I=2$ , $J=20$ , $\{s_{1},s_{2}\}=\{0.3,0.6\}$ , $\{l_{j}\}_{j=1}^{20}=\{1,3,5,\dots,39\}$ . We set the prior on $u$ to be a Gaussian $\mathcal{N}(0,2I_{d})$ . The true parameter $u^{\dagger}\in\mathbb{R}^{40}$ is shown in Figure 8, and is used to generate the observation data $\{y_{i,j}\}$ according to (5.7) with noise level $\gamma=0.01.$

5.3.2 Implementation Details and Numerical Results

This is a moderately high dimensional nonlinear problem, where the forward model involves a black-box solver of differential equations. Figure 8 clearly indicates an ensemble collapse for the EKI and TEKI methods. Although they are still capable of finding a descending direction of the loss direction (see Figures 9 and 10), iterates get stuck in a local minimum. In contrast, we can see the advantage of IEKF, EKI-SL and TEKI-SL in this setting: intuitively, a broader spread of the ensemble allows these methods to ‘explore’ different regions in the input space, and thereby to find a solution with potentially lower loss.

We notice that in practice ‘covariance inflation’ is often applied in EKI and TEKI algorithms, by manually injecting small random noise after each ensemble update. In general, this technique will ‘force’ a non-zero empirical covariance, prevent ensemble collapse and boost the performence of EKI and TEKI. However, the amount of noise injected is an additional hyperparameter that should be chosen manually, which should depend on the input dimension, observation noise, etc. Here we give a fair comparison of the different methods introduced, under the same time-stepping setting, with as few hyperparameters as possible.

5.4 High-Dimensional Nonlinear Regression

In this subsection we consider a nonlinear regression problem with a highly oscillatory forward map introduced in [19], where the authors investigate the use of iterative ensemble Kalman methods to train neural networks without back propagation.

5.4.1 Problem Setup

We consider a nonlinear regression problem

y=h(u)+\eta,\quad\eta\sim\mathcal{N}(0,\gamma^{2}I_{k}),

where $u\in\mathbb{R}^{d},y\in\mathbb{R}^{k}$ , and $h$ is defined by

h(u):=Au+\sin(cBu),

(5.8)

where $A,B\in\mathbb{R}^{k\times d}$ are random matrices with independent $\mathcal{N}(0,1)$ entries. We set $d=200$ , $k=150$ and $c=20$ . We want to recover $u$ from $y$ . We assume that the unknown $u$ has a Gaussian prior $u\sim\mathcal{N}(0,4I_{d})$ . The true parameter $u^{\dagger}$ is set to be $2\cdot\bf{1}$ , where $\bm{1}$ is the all-one vector. Observation data is generated as $y=h(u^{\dagger})+\eta$ .

By definition, $h$ is highly oscillatory, and we may expect that the loss function, either Tikhonov-Phillips or data misfit objective, will have many local minima. Figure 11 visualizes the Tihonov-Phillips objective function ${\mathsf{J}}_{\mbox{\tiny{\rm TP}}}(u)$ with respect to two randomly choosen coordinates while other coordinates are fixed to value of 2. The data misfit objective exhibits a similar behavior.

5.4.2 Implementation Details and Numerical Results

We set the ensemble size to be $N=50$ . The initial ensemble $\{u_{0}^{(n)}\}_{n=1}^{N}$ is drawn independently from the prior. The length-step $\alpha$ is fixed to be 0.05 for all methods. We set $\gamma=0.01$ for the observation noise. We run each algorithm up to time $T=30$ , which corresponds to 600 iterations.

We notice that this is a difficult problem, due to its high dimensionality and nonlinearity. All methods except for IEKF-SL are not capable of reconstructing the truth. In particular, from Figures 12, 13 and 14, both EKI and TEKI do a poor job with relative error larger than 1, while IEKF and EKI-SL have slightly better performance. It is worth noticing from Figure 14 that IEKF-SL has a larger Tikhonov-Phillips objective function, with a much lower relative error. This may suggest that other types of regularization objectives can be used, other that the Tikhonov-Phillips objective, to solve this problem.

6 Conclusions and Open Directions

In this paper we have provided a unified perspective of iterative ensemble Kalman methods and introduced some new variants. We hope that our work will stimulate further research in this active area, and we conclude with a list of some open directions.

•

Continuum limits have been formally derived in our work. The rigorous derivation and analysis of SDE continuum limits, possibly in nonlinear settings, deserves further research.
•

We have advocated the analysis of continuum limits for the understanding and design of iterative ensemble Kalman methods, but it would also be desirable to develop a framework for the analysis of discrete, implementable algorithms, and to further understand the potential benefits of various discretizations of a given continuum SDE system.
•

From a theoretical viewpoint, it would be desirable to further analyze the convergence and stability of iterative ensemble methods with small or moderate ensemble size. While mean-field limits can be revealing, in practice the ensemble size is often not sufficiently large to justify the mean-field assumption. It would also be important to further analyze these questions in mildly nonlinear settings.
•

An important methodological question, still largely unresolved, is the development of adaptive and easily implementable line search schemes and stopping criteria for ensemble-based optimization schemes. An important work in this direction is [32].
•

Another avenue for future methodological research is the development of iterative ensemble Kalman methods that are sparse-promoting, considering alternative regularizations beyond the least-squares objectives considered in our paper [37, 38, 53].
•

Ensemble methods are cheap in comparison to derivative-based optimization methods and Markov chain Monte Carlo sampling algorithms. It is thus natural to use ensemble methods to build ensemble preconditioners or surrogate models to be used within more expensive but accurate computational approaches.
•

Finally, a broad area for further work is the application of iterative ensemble Kalman filters to new problems in science and engineering.

Acknowledgments

NKC is supported by KAUST baseline funding. DSA is thankful for the support of NSF and NGA through the grant DMS-2027056. The work of DSA was also partially supported by the NSF Grant DMS-1912818/1912802.

References

[1] S. I. Aanonsen, G. Nævdal, D. S. Oliver, A. C. Reynolds, B. Vallès and others. The ensemble Kalman filter in reservoir engineering–a review. Spe Journal, 14, 03, 393-412, 2009.
Albers et al. [2019] D. J. Albers, P.-A. Blancquart, M. E. Levine, E. E. Seylabi and A. M. Stuart. Ensemble Kalman methods with constraints. Inverse Problems, 35(9), 2019.
Anderson et al. [2001] J. L. Anderson. An ensemble adjustment Kalman filter for data assimilation. Monthly Weather Review, 129(12), 2884-2903, 2001.
[4] B. D. O. Anderson and J. B. Moore. Optimal Filtering. Information and System Sciences Series, 1979.
[5] B. M. Bell. The iterated Kalman smoother as a Gauss–Newton method. SIAM J. Optim., 4, 626–636, 1994.
Bell and Cathey [1993] B. M. Bell and F. W. Cathey. The iterated Kalman filter update as a Gauss-Newton method. IEEE Transactions on Automatic Control, 38(2):294–297, 1993.
[7] D. P. Bertsakas. Incremental least squares method and the extended Kalman filter. SIAM J. Optim 6, 807–822, 1996.
Blömker et al. [2019] D. Blömker, C. Schillings, P. Wacker and S. Weissmann. Well posedness and convergence analysis of the ensemble Kalman inversion. Inverse Problems, 2019.
Bloömker et al. [2018] D. Blömker, C. Schillings and P. Wacker. A strongly convergent numerical scheme from ensemble Kalman inversion. SIAM Journal on Numerical Analysis, 56(4):2537–2562, 2018.
Chada [2018] N. K. Chada. Analysis of hierarchical ensemble Kalman inversion. arXiv preprint arXiv:1801.00847, 2018.
[11] N. K. Chada, M. A. Iglesias, L. Roininen and A. M. Stuart. Parameterizations for ensemble Kalman inversion. Inverse Problems, 34(5), 2018.
Chada et al. [2019a] N. K. Chada, C. Schillings and S. Weissmann. On the incorporation of box-constraints for ensemble Kalman inversion. arXiv preprint arXiv:1908.00696, 2019a.
Chada et al. [2019b] N. K. Chada, A. M. Stuart and X. T. Tong. Tikhonov regularization within ensemble Kalman inversion. SIAM J. Numer. Anal., 58(2), 1263–1294, 2020b.
Chada et al. [2019b] N. K. Chada and X. T. Tong. Convergence acceleration of ensemble Kalman inversion in nonlinear setting. arXiv preprint arXiv:1911.02424, 2019b.
Chen et al. [2012b] Y. Chen and D. S. Oliver. Ensemble randomized maximum likelihood method as an iterative ensemble smoother. Mathematical Geosciences, 2012b.
Dennis et al. [2019b] J. E. Dennis and R. B. Schnabel. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. SIAM, 1996b.
Ding and Li [2019] Z. Ding and Q. Li. Ensemble Kalman sampling: mean-field limit and convergence analysis. arXiv preprint arXiv:1910.12923, 2019.
[18] A. Emerick and A. C. Reynolds. Ensemble smoother with multiple data assimilation, Computers and Geosciences, 55, 3–15, 2013.
[19] E. Haber, F. Lucka and L. Ruthotto. Never look back-a modified enkf method and its application to the training of neural networks without back propagation. arXiv preprint arXiv:1805.08034, 2018.
[20] O. G. Ernst, B. Sprungk and H. J. Starkloff. Analysis of the ensemble and polynomial chaos Kalman filters in Bayesian inverse problems, SIAM/ASA Journal on Uncertainty Quantification, 3(1), 823–851, 2015.
[21] G. Evensen. Data Assimilation: The Ensemble Kalman Filter. Springer, 2009.
[22] G. Evensen and P. J. Van Leeuwen. Assimilation of geosataltimeter data for the agulhas current using the ensemble Kalman filter with a quasi-geostrophic model, Monthly Weather Review, 128, 85–86, 1996.
Garbuno-Inigo et al. [2019] A. Garbuno-Inigo, F. Hoffmann, W. Li and A. M. Stuart. Interacting Langevin diffusions: Gradient structure and ensemble Kalman sampler. arXiv preprint arXiv:1903.08866, 2019.
Gu and Oliver [2007] Y. Gu and D. S. Oliver. An iterative ensemble Kalman filter for multiphase fluid flow data assimilation. Spe Journal, 12(04):438–446, 2007.
Haber et al. [2018] E. Haber, F. Lucka and L. Ruthotto. Never look back-A modified EnKF method and its application to the training of neural networks without back propagation. arXiv preprint arXiv:1805.08034, 2018.
[26] M. Hanke. A regularizing Levenberg-Marquardt scheme, with applications to inverse groundwater filtration problems. Inverse Problems, 13, 79–95, 1997.
Herty and Visconti [2018] M. Herty and G. Visconti. Kinetic methods for inverse problems. arXiv preprint arXiv:1811.09387, 2018.
Grooms [2020] I. Grooms A note on the formulation of the Ensemble Adjustment Kalman Filter. arXiv preprint arXiv:2006.02941, 2020.
Guth et al. [2020] P. A. Guth, C. Schillings and S. Weissmann. Ensemble Kalman filter for neural network based one-shot inversion. arXiv preprint arXiv:2005.02039, 2020.
[30] M. A. Iglesias. A regularising iterative ensemble Kalman method for PDE-constrained inverse problems. Inverse Problems, 32, 2016.
[31] M. A. Iglesias, K. J. H. Law and A. M. Stuart. Ensemble Kalman methods for inverse problems. Inverse Problems, 29, 2013.
[32] M. A. Iglesias and Y. Yang. Adaptive regularisation for ensemble Kalman inversion with applications to non-destructive testing and imaging. arXiv preprint arXiv:2006.14980, 2020.
Jazwinski [2007] A. H. Jazwinski. Stochastic Processes and Filtering Theory. Courier Corporation, 2007.
[34] R. E. Kalman. A new approach to linear filtering and prediction problems, Trans ASME (J. Basic Engineering), 82, 35–45, 1960.
Kelly and Stuart [2014] D. Kelly, K. J. H. Law and A. M. Stuart. Well-posedness and accuracy of the ensemble Kalman filter in discrete and continuous time. Nonlinearity, 27(10), 2014.
[36] B. Katltenbacher, A. Neubauer and O. Scherzer. Iterative Regularization Methods for Nonlinear Ill-Posed Problems. Radon Series on Computational and Applied Mathematics, de Gruyter, Berlin, 1st edition, 2008.
Kovachki and Stuart [2019] N. B. Kovachki and A. M. Stuart. Ensemble Kalman inversion: A derivative-free technique for machine learning tasks. Inverse Problems, 2019.
[38] Y. Lee. $l_{p}$ regularization for ensemble Kalman inversion. arXiv preprint arXiv:2009.03470, 2020
Li and Reynolds [2007] G. Li and A. C. Reynolds. An iterative ensemble Kalman filter for data assimilation. In SPE annual technical conference and exhibition. Society of Petroleum Engineers, 2007.
Lorentzen et al. [2001] R. J. Lorentzen, K.-K. Fjelde, J. Froyen, A. C. Lage, G. Noevdal and E. H Vefring. Underbalanced drilling: Real time data interpretation and decision support. In SPE/IADC drilling conference. Society of Petroleum Engineers., 2001.
[41] S. Lu and S. V. Pereverzev. Multi-parameter regularization and its numerical realization. Numerische Mathematik 118(1), 1–31, 2011.
[42] A. J. Madja and J. Harlim. Filtering Complex Turbulent Systems. Cambridge University Press, 2012.
Naevdal et al. [2002] G. Naevdal, T. Mannseth and E. H Vefring. Instrumented wells and near-well reservoir monitoring through ensemble Kalman filter. In Proceedings of 8th European Conference on the Mathematics of Oil Recovery, Freiberg, Germany, 2001.
[44] A. S. Nemirovsky. Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience series in discrete mathematics, 1983.
Nocedal and Wright [2006] J. Nocedal and S. Wright. Numerical Optimization. Springer Science and Business Media, 2006.
[46] D. Oliver, A. C. Reynolds and N. Liu. Inverse Theory for Petroleum Reservoir Characterization and History Matching. Cambridge University Press, 1st edn, 2008.
[47] S. Reich and C. J. Cotter. Ensemble filter techniques for intermittent data assimilation. l Large Scale Inverse Problems. Computational Methods and Applications in the Earth Sciences, 13, 91–134, 2013.
Reynolds et al. [2006] A. C. Reynolds, M. Zafari and G. Li. Iterative forms of the ensemble Kalman filter. In ECMOR X-10th European conference on the mathematics of oil recovery, pages cp–23. European Association of Geoscientists & Engineers, 2006.
[49] P. Sakov, D. S. Oliver and L. Bertino. An Iterative EnKF for strongly nonlinear systems. Monthly Weather Review, 140, 1988–2004, 2012.
Sanz-Alonso et al. [2019] D. Sanz-Alonso, A. M. Stuart and A. Taeb. Inverse Problems and Data Assimilation. arXiv preprint arXiv:1810.06191, 2019.
[51] C. Schillings and A. M. Stuart. Analysis of the ensemble Kalman filter for inverse problems. SIAM J. Numer. Anal., 55(3), 1264–1290, 2017.
[52] C. Schillings and A. M. Stuart. Convergence analysis of ensemble Kalman inversion: the linear, noisy case. Applicable Analysis, 97(1), 107–123, 2018.
[53] T. Schneider, A. M. Stuart and J-L. Wu. Imposing sparsity within ensemble Kalman inversion. arXiv preprint arXiv:2007.06175, 2020.
[54] B. Shi, S. Du, M. I. Jordan and W. J. Su. Understanding the acceleration phenomenon via high-resolution differential equations. arXiv preprint arXiv:1810.08907, 2018
[55] J. A. Skjervheim and G. Evensen. An Ensemble Smoother for Assisted History Matching. SPE Reservoir Simulation Symposium, 21–23 February, The Woodlands, Texas, 2001.
[56] W. Su, S. Boyd and E. Candes. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. Advances in neural information processing systems., 2510-2518, 2014.
[57] W. Su, S. Boyd and E. Candes. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. The Journal of Machine Learning Research., 17(1),5312-5354, 2016.
Tarantola [2015] A. Tarantola. Inverse Problem Theory and Methods for Model Parameter Estimation. SIAM, 2015.
[59] M. K. Tippett, J. L. Anderson, C. H. Bishop and J. S. Whitaker. Ensemble square root filters. Monthly Weather Review, 131(7), 1485-90, 2003.
Ungarala [2012] S. Ungarala. On the iterated forms of Kalman filters using statistical linearization. Journal of Process Control, 22(5):935–943, 2012.
[61] A. Wibisono, A. C. Wilson and M. I. Jordan. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351–E7358, 2016.

Iterative Ensemble Kalman Methods: A Unified Perspective with Some New Variants

Abstract

1 Introduction

1.1 Overview: Three Subfamilies

1.2 Statistical Linearization, Continuum Limits and New Variants

1.3 Main Contributions and Outline

2 Derivative-based Optimization for Nonlinear Least-Squares

Lemma 2.1.

Proof.

Bayesian Interpretation

2.1 Gauss-Newton Optimization of Tikhonov-Phillips Objective

Lemma 2.2 ([6]).

Proof.

Proposition 2.3.

Proof.

Choice of Length-Step

Continuum Limit

2.2 Levenberg-Marquardt Optimization of Data-Misfit Objective

Lemma 2.4.

Proof.

Proposition 2.5.

Proof.

Example 2.6 (Convergence of ILM-DM with invertible observation map).

Choice of Length-Step

Continuum Limit

2.3 Levenberg-Marquardt Optimization of Tikhonov-Phillips Objective

Lemma 2.7.

Proof.

Proposition 2.8.

Proof.

Example 2.9 (Convergence of ILM-TP with linear invertible observation map.).

Continuum Limits

3 Ensemble-based Optimization for Nonlinear Least-Squares

Statistical Linearization

Continuum Limit

3.1 Ensemble Gauss-Newton Optimization of Tikhonov-Phillips Objective

3.1.1 Iterative Ensemble Kalman Filter

3.1.2 Analysis of IEKF

Remark 3.1.

3.2 Ensemble Levenberg-Marquardt Optimization of Data-Misfit Objective

3.2.1 Ensemble Kalman Inversion

3.2.2 Analysis of EKI

Proposition 3.2.

Proof.

Remark 3.3.

3.3 Ensemble Levenberg-Marquardt Optimization of Tikhonov-Phillips Objective

3.3.1 Tikhonov Ensemble Kalman Inversion

3.3.2 Analysis of TEKI

Proposition 3.4.

Proof.

4 Ensemble Kalman methods: New Variants

4.1 Iterative Ensemble Kalman Filter with Statistical Linearization

Proposition 4.1.

Proof.

4.2 Ensemble Kalman Inversion with Statistical Linearization

Proposition 4.2.

Proof.

4.3 Gradient Structure and Discussion

LM Algorithms in the Continuum Limit

Gauss-Newton Algorithms in the Continuum Limit

A Unified View of Levenberg-Marquardt and Gauss-Newton Algorithms in the Continuum Limit

Relationship to Ensemble Kalman Sampler (EKS)

5 Numerical Examples

5.1 Elliptic Boundary Value Problem

5.1.1 Problem Setup

5.1.2 Implementation Details and Numerical Results

5.2 High-Dimensional Linear Inverse Problem

5.2.1 Problem setup

5.2.2 Implementation Details and Numerical Results

5.3 Lorenz-96 Model

5.3.1 Problem Setup

5.3.2 Implementation Details and Numerical Results

5.4 High-Dimensional Nonlinear Regression

5.4.1 Problem Setup

5.4.2 Implementation Details and Numerical Results

6 Conclusions and Open Directions

Acknowledgments

References

Iterative Ensemble Kalman Methods:
A Unified Perspective with Some New Variants