This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Second order ensemble Langevin method for sampling and inverse problems

Ziming Liu Department of Physics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, [email protected].    Andrew Stuart Applied and Computational Mathematics, California Institute of Technology, Pasadena, California 91125, [email protected]    Yixuan Wang Applied and Computational Mathematics, California Institute of Technology, Pasadena, California 91125, [email protected]
Abstract

We propose a sampling method based on an ensemble approximation of second order Langevin dynamics. The log target density is appended with a quadratic term in an auxiliary momentum variable and damped-driven Hamiltonian dynamics introduced; the resulting stochastic differential equation is invariant to the Gibbs measure, with marginal on the position coordinates given by the target. A preconditioner based on covariance under the law of the dynamics does not change this invariance property, and is introduced to accelerate convergence to the Gibbs measure. The resulting mean-field dynamics may be approximated by an ensemble method; this results in a gradient-free and affine-invariant stochastic dynamical system. Numerical results demonstrate its potential as the basis for a numerical sampler in Bayesian inverse problems.

keywords:
Bayesian inverse problems; Sampling; Ensemble method; Second order Langevin equation; Hybrid Monte Carlo; Mean-field models; Ensemble method; Nonlinear Fokker-Planck equation.

1 Introduction

1.1 Set-Up

Consider sampling the distribution

π(q)=1Zqexp(Φ(q)),\pi(q)=\frac{1}{Z_{q}}\exp(-\Phi(q))\,,

where Φ:N\Phi:\mathbb{R}^{N}\to\mathbb{R} is termed the potential function and ZqZ_{q} the normalization constant. A broad family of problems can be cast into this formulation; the Bayesian approach to inverse problems provides a particular focus for our work [34]. The point of departure for the algorithms considered in this paper is the following mean-field Langevin equation:

dqdt=𝒞(ρ)DΦ(q)+2𝒞(ρ)dWdt,\frac{\mathrm{d}q}{\mathrm{d}t}=-\mathcal{C}(\rho)D\Phi(q)+\sqrt{2\mathcal{C}(\rho)}\frac{\mathrm{d}W}{\mathrm{d}t}\,, (1.1)

where DD stands for the gradient operator, WW is an NN-dimensional standard Brownian motion, ρ\rho is the density associated to the law of qq, and 𝒞(ρ)\mathcal{C}(\rho) is the covariance under this density. This constitutes a mean-field generalization [23] of the standard Langevin equation [44]. Applying a particle approximation to the mean-field model results in an interacting particle system, and coupled Langevin dynamics [37, 23, 42, 24]. The benefit of preconditioning using the covariance is that it leads to mixing rates independent of the problem, provably for quadratic Φ\Phi and empirically beyond this setting, because of the affine invariance [26] of the resulting algorithms [24].

In order to further accelerate mixing and achieve sampling efficiency, we introduce an auxiliary variable pNp\in\mathbb{R}^{N} and consider the Hamiltonian

(z)=12p,1p+Φ(q),\mathcal{H}(z)=\frac{1}{2}\langle p,\mathcal{M}^{-1}p\rangle+\Phi(q)\,, (1.2)

where the new state variable z:=(q,p)2Nz:=(q^{\top},p^{\top})^{\top}\in\mathbb{R}^{2N}. Define measure on 2N\mathbb{R}^{2N} by

Π(z)=1Zq,pexp((z)),\Pi(z)=\frac{1}{Z_{q,p}}\exp(-\mathcal{H}(z))\,,

where Zq,pZ_{q,p} is the normalization constant. The marginal distribution of Π\Pi in the qq variable gives the desired distribution π\pi, i.e. Π(z)dp=π(q)\int\Pi(z)\mathrm{d}p=\pi(q). We now aim at sampling the joint distribution. To this end, consider the following underdamped Langevin dynamics in 2N\mathbb{R}^{2N}:

dzdt=𝒥D(z)𝒦D(z)+2𝒦dW0dt,\frac{\mathrm{d}z}{\mathrm{d}t}=\mathcal{J}D{\mathcal{H}}(z)-\mathcal{\mathcal{K}}D{\mathcal{H}}(z)+\sqrt{2\mathcal{\mathcal{K}}}\frac{\mathrm{d}W_{0}}{\mathrm{d}t}\,, (1.3)

with the choices

𝒥=(0𝒞𝒞0),𝒦=(𝒦100𝒦2).\mathcal{J}=\begin{pmatrix}0&\mathcal{C}\\ -\mathcal{C}&0\end{pmatrix}\,,\quad\mathcal{K}=\begin{pmatrix}\mathcal{K}_{1}&0\\ 0&\mathcal{K}_{2}\end{pmatrix}\,. (1.4)

Here W0W_{0} is a standard Brownian motion in 2N\mathbb{R}^{2N} with components W,WNW^{\prime},W\in\mathbb{R}^{N}. Then we have the following, proved in Subsection 8.1:

Proposition 1.1.

Assume that 𝒦1\mathcal{K}_{1}, 𝒦2\mathcal{K}_{2} are symmetric and non-negative, and that 𝒞\mathcal{C} is symmetric positive-definite. Assume further that 𝒞\mathcal{C}, 𝒦\mathcal{K} and \mathcal{M} depend on the law of zz under the dynamics defined by (1.3) and (1.4), but are are independent of zz: all derivatives with respect to zz are zero. Then the Gibbs measure Π(z)\Pi(z) is invariant under the dynamics defined by (1.3), (1.4).

In practice, to simulate from such a mean-field model, it will be necessary to consider a particle approximation of the form

dz(i)dt=J(Z)Dz(i)HZ(z(i))K(Z)Dz(i)HZ(z(i))+2K(Z)dW0(i)dt,\frac{\mathrm{d}z^{(i)}}{\mathrm{d}t}={J(Z)D_{z^{(i)}}{H}_{Z}(z^{(i)})-K(Z)D_{z^{(i)}}{H}_{Z}(z^{(i)})}+\sqrt{2K(Z)}\frac{\mathrm{d}W_{0}^{(i)}}{\mathrm{d}t}\,, (1.5)

for the set of II particles Z={z(i)}i=1I,Z=\{z^{(i)}\}_{i=1}^{I}, and where M(Z),K(Z),J(Z)M(Z),K(Z),J(Z) are appropriate empirical approximations of (ρ),𝒦(ρ),𝒥(ρ)\mathcal{M}(\rho),\mathcal{K}(\rho),\mathcal{J}(\rho) based on the approximation

ρi=1Iδz(i),\rho\approx\sum_{i=1}^{I}\delta_{z^{(i)}}\,,

and the Hamiltonian is given by

HZ(z)=12p,M(Z)1p+Φ(q).H_{Z}(z)=\frac{1}{2}\langle p,M(Z)^{-1}p\rangle+\Phi(q)\,. (1.6)

We use the notation HZH_{Z} to emphasize the dependence on the ensemble ZZ, but when the context is clear we also use the abbreviated notation HH. Note that HH is the appropriate finite particle approximation of ,\mathcal{H}, given the particle approximation MM of .\mathcal{M}.

In this paper we will concentrate on a specific choice of mean-field operators within the above general construction, which we now describe. Let 𝒞q(ρ)\mathcal{C}_{q}(\rho) denote the qq-marginal in the covariance under the law of (1.3). We make the choices 𝒦1=0\mathcal{K}_{1}=0, 𝒞==𝒞q(ρ)\mathcal{C}=\mathcal{M}=\mathcal{C}_{q}(\rho), and 𝒦2=γ𝒞q(ρ)\mathcal{K}_{2}=\gamma\mathcal{C}_{q}(\rho), for a scalar damping parameter γ>0\gamma>0. Then the underdamped Langevin dynamics yields

dqdt\displaystyle\frac{\mathrm{d}q}{\mathrm{d}t} =p,\displaystyle=p\,, (1.7)
dpdt\displaystyle\frac{\mathrm{d}p}{\mathrm{d}t} =𝒞q(ρ)DΦ(q)γp+2γ𝒞q(ρ)dWdt.\displaystyle=-\mathcal{C}_{q}(\rho)D\Phi(q)-{\gamma p+\sqrt{2\gamma\mathcal{C}_{q}(\rho)}\frac{\mathrm{d}W}{\mathrm{d}t}}\,.

To implement a particle approximation of the mean-field dynamics (1.7) we introduce particles in the form z(i)(t)=((q(i)(t)),(p(i)(t)))z^{(i)}(t)=\bigl{(}(q^{(i)}(t))^{\top},(p^{(i)}(t))^{\top}\bigr{)} and we use the ensemble covariance and mean approximations

𝒞q(ρ)Cq(Z)\displaystyle\mathcal{C}_{q}(\rho)\approx C_{q}(Z) 1Ii=1I(q(i)q¯)(q(i)q¯),\displaystyle\coloneqq\frac{1}{I}\sum_{i=1}^{I}\left(q^{(i)}-\bar{q}\right)\otimes\left(q^{(i)}-\bar{q}\right)\,, (1.8a)
q¯\displaystyle\bar{q} :=1Ii=1Iq(i).\displaystyle:=\frac{1}{I}\sum_{i=1}^{I}q^{(i)}\,. (1.8b)

In order to obtain affine invariance, we take the generalized square root of the ensemble covariance Cq(Z)C_{q}(Z), similarly to [24]. We introduce the N×IN\times I matrix

Q(q(1)q¯,q(2)q¯,,q(I)q¯),Q\coloneqq\left(q^{(1)}-\bar{q},q^{(2)}-\bar{q},\cdots,q^{(I)}-\bar{q}\right)\,,

which allows us to define the empirical covariance and generalized (nonsymmetric) square root via

Cq(Z)=1IQQT,Cq(Z)1IQ.{C}_{q}(Z)=\frac{1}{{I}}QQ^{T}\,,\quad\sqrt{{C}_{q}(Z)}\coloneqq\frac{1}{\sqrt{I}}Q\,.

Now with II independent standard Brownian motions {W(i)}i=1II\{W^{(i)}\}_{i=1}^{I}\in\mathbb{R}^{I}, a natural particle approximation of (1.7) is, for i=1,,Ii=1,\dots,I,

dq(i)dt\displaystyle\frac{\mathrm{d}q^{(i)}}{\mathrm{d}t} =p(i),\displaystyle=p^{(i)}\,, (1.9)
dp(i)dt\displaystyle\frac{\mathrm{d}p^{(i)}}{\mathrm{d}t} =Cq(Z)DΦ(q(i))γp(i)+2γCq(Z)dW(i)dt.\displaystyle=-C_{q}(Z)D\Phi(q^{(i)})-{\gamma p^{(i)}+\sqrt{2\gamma C_{q}(Z)}\frac{\mathrm{d}W^{(i)}}{\mathrm{d}t}}\,.

In later sections, we will add corrections ensuring preservation of the desired qq marginals of the invariant measure. These corrections will account for both finite particle approximation of the mean field limit, and the fact that (1.9) does not account for the dependence of HZ(z)H_{Z}(z) on ZZ; note that this latter dependence is accounted for in (1.5). To establish the form of the correction we follow the exposition of the well-trodden subject of determining a diffusion with given invariant measure from [40][Theorem 1].

In subsequent sections we will employ an ensemble approximation of Cq(Z)DΦ()C_{q}(Z)D\Phi(\cdot), as in [23], thereby avoiding the need to compute adjoints of the forward model; we note also that in the linear case this approximation is exact. We will show that the resulting interacting particle system has the potential to provide accurate derivative-free inference for certain classes of inverse problems. In the remainder of this section we provide a literature review, we highlight our contributions, and we outline the structure of the paper.

1.2 Literature Review

The overdamped Langevin equation is the canonical SDE that is invariant with respect to given target density distribution. Many sampling algorithms are built upon this idea, and in particular, it is shown to govern a large class of Monte Carlo Markov Chain (MCMC) methods; see [47, 43, 40]. To enhance mixing and accelerate convergence, a second-order method, Hybrid Monte Carlo (HMC, also referred to as Hamiltonian Monte Carlo) [19, 41] has been proposed, leading to underdamped Langevin dynamics. There have been many attempts to justify the empirically observed fast convergence speed of second-order methods in comparison with first-order methods [4]. Recently a quantitative convergence rate is established in [10], showing that the underdamped Langevin dynamics converges faster than the overdamped Langevin dynamics when the log of the target π\pi has a small Poincaré constant; see also [21].

The idea of introducing preconditioners within the context of interacting particle systems used for sampling is developed in in [37]. Preconditioning via ensemble covariance is shown to boost convergence and numerical performance [23]. Other choices of preconditioners in sampling will lead to different forms of SDEs and associated Fokker-Planck equations with different structures, which can result in different sampling methods effective in different scenarios; see for example [38]. Affine invariance is introduced in [26] where it is argued that this property leads to desirable convergence properties for interacting particle systems used for sampling: methods that satisfy affine invariance are invariant under an affine change of coordinates, and are thus uniformly effective for problems that can be rendered well-conditioned under an affine transformation. An affine invariant version of the mean-field underdamped Langevin dynamics [23] is proposed in [42, 24].

Kalman methods have shown wide success in state estimation problems since their introduction by Evensen; see [22] for an overview of the field, and the papers [45, 31] for discussion of their use in inverse problems. Using the ensemble covariance as a preconditioner leads to affine invariant [24] and gradient-free [23] approximations of Langevin dynamics; this is desirable in practical computations in view of the intractability of derivatives in many large-scale models arising in science and engineering [15, 28, 35, 30]. See [2, 49] for analysis of these methods and, in the context of continuous data-assimilation, see [17, 3, 51]. There are other derivative-free methods that can be derived from the mean-field perspective, and in particular consensus-based methods show promise for optimization [12] and have recently been extended to sampling in [11].

Recent work has established the convergence of ensemble preconditioning methods to mean field limits; see for example [18]. For other works on rigorous derivation of mean field limits of interacting particle systems, see [50, 13, 32]. For underpinning theory of Hamiltonian-based sampling, see [7, 8, 39, 5].

1.3 Our Contributions

The following contributions are made in this paper:

  1. 1.

    We introduce an underdamped second order mean field Langevin dynamics, with a covariance-based preconditioner.

  2. 2.

    In the case of Bayesian inverse problems defined by a linear forward map, we show that that this mean field model preserves Gaussian distributions under time evolution and, if initialized at a Gaussian, converges to the desired target at rate independent of the linear map.

  3. 3.

    We introduce finite particle approximations of the mean field model, and introduce finite size corrections to preserve invariance of the desired target measure, resulting in an affine invariant method.

  4. 4.

    For Bayesian inverse problems, we introduce a gradient-free approximation of the algorithm, based on ensemble Kalman methodology.

  5. 5.

    In the context of Bayesian inverse problems we provide numerical examples to demonstrate that the algorithm resulting from the previous considerations has desirable sampling properties.

In Section 2, we introduce the inverse problems context that motivates us. In Section 3 we discuss the equilibrium distribution of the mean field model; and then we introduce a finite particle approximation of the mean field model, identifying corrections which ensure that the equilibrium distribution is an independent product (of dimension equal to the particle cardinality) of copies of the desired equilibrium distribution. Section 4 introduces the ensemble Kalman approximation of the finite particle system; and in that section we also demonstrate affine invariance of the resulting method. Section 5 presents analysis of the finite particle system in the case of linear inverse problems, where the ensemble Kalman approximation is exact; we demonstrate that the relaxation time to equilibrium is independent of the specific linear inverse problem considered, a consequence of affine invariance. In Section 6 we provide numerical results which demonstrate the efficiency and potential value of our method, and in Section 7 we draw conclusions. Proofs of the propositions are given in the appendix, Section 8.

2 Inverse Problem

Consider the Bayesian inverse problem of finding qq from an observation yy determined by the forward model

y=𝒢(q)+η.y=\mathcal{G}(q)+\eta\,.

Here 𝒢:NJ\mathcal{G}:\mathbb{R}^{N}\to\mathbb{R}^{J} is a (in general) nonlinear forward map. We assume a prior zero-mean Gaussian π0=𝖭(0,Γ0)\pi_{0}=\mathsf{N}(0,\Gamma_{0}) on unknown qq and assume that random variable η𝖭(0,Γ)\eta\sim\mathsf{N}(0,\Gamma), representing measurement error, is independent of the prior on qq. We also assume that Γ,Γ0\Gamma,\Gamma_{0} are positive-definite. Then by Bayes rule, the posterior density that we aim to sample is given by111In what follows C=C12\|\cdot\|_{C}=\|C^{-\frac{1}{2}}\cdot\|, with analogous notation for the inducing inner-product, for any positive-definite covariance CC and for \|\cdot\| the Euclidean norm.

π(q)exp(12y𝒢(q)Γ2)π0(q)exp(Φ(q)),\pi(q)\propto\exp\Bigl{(}-\frac{1}{2}\|y-\mathcal{G}(q)\|_{\Gamma}^{2}\Bigr{)}\pi_{0}(q)\equiv{\rm exp}(-\Phi(q))\,,

with potential function Φ(q)\Phi(q) of the following form:

Φ(q)=12y𝒢(q)Γ2+12qΓ02.\Phi(q)=\frac{1}{2}\|y-\mathcal{G}(q)\|_{\Gamma}^{2}+\frac{1}{2}\|q\|_{\Gamma_{0}}^{2}\,. (2.10)

In the linear case when 𝒢(q)=Aq\mathcal{G}(q)=Aq, Φ(q)\Phi(q) is quadratic and the gradient DΦ(q)D\Phi(q) can be written as a linear function:

Φ(q)\displaystyle\Phi(q) =12yAqΓ2+12qΓ02,\displaystyle=\frac{1}{2}\|y-Aq\|_{\Gamma}^{2}+\frac{1}{2}\|q\|_{\Gamma_{0}}^{2}\,, (2.11a)
DΦ(q)\displaystyle D\Phi(q) =B1qc,\displaystyle=B^{-1}q-c\,, (2.11b)
B\displaystyle B =(ATΓ1A+Γ01)1,c=ATΓ1y.\displaystyle=(A^{T}\Gamma^{-1}A+\Gamma_{0}^{-1})^{-1}\,,\quad c=A^{T}\Gamma^{-1}y\,. (2.11c)

In this linear setting, the posterior distribution π(q)\pi(q) is the Gaussian 𝖭(Bc,B)\mathsf{N}(Bc,B).

3 Equilibrium Distributions

3.1 Fokker-Planck Equation: Mean Field

The mean field underdamped Langevin equation (1.3) has an associated nonlinear and nonlocal Fokker-Planck equation giving the evolution of the law of particles z(t)z(t), denoted ρ(z,t)\rho(z,t). The equation for this law is (see Proof 8.1.)

tρ=((𝒦J)(ρ+ρ)).\partial_{t}\rho=\nabla\cdot\bigl{(}(\mathcal{K}-J)(\rho\nabla{\mathcal{H}}+\nabla\rho)\bigr{)}\,. (3.12)

By Proposition 1.1 this Fokker-Planck equation has Π(z)\Pi(z) as its equilibrium; this follows as for standard linear Fokker-Planck equations [44] since the dependence of the sample paths on ρ\rho involves only the mean and covariance; see Proof 8.1 and see also [20, 27, 33, 46, 44] for discussions on how to derive Fokker-Planck equations with a prescribed stationary distribution. In the specific case (1.7), recall that 𝒥\mathcal{J} and 𝒦\mathcal{K} are given by

𝒥=(0𝒞q(ρ)𝒞q(ρ)0),𝒦=(000γ𝒞q(ρ)).\mathcal{J}=\begin{pmatrix}0&\mathcal{C}_{q}(\rho)\\ -\mathcal{C}_{q}(\rho)&0\end{pmatrix}\,,\quad\mathcal{K}=\begin{pmatrix}0&0\\ 0&{\gamma}\mathcal{C}_{q}(\rho)\end{pmatrix}\,. (3.13)

These choices satisfy the assumption of Proposition 1.1.

3.2 Fokker-Planck Equation: Finite Particle System

The approximation of (1.7) by the interacting particle system (1.9) does not result in an equilibrium distribution with marginals on the position coordinates given by independent products π.\pi. To address this issue we return to the formulation in (1.5), (1.6), with the following empirical approximations to 𝒦\mathcal{K} and 𝒥\mathcal{J}:

K(Z):=(000γCq(Z)),J(Z):=(0Cq(Z)Cq(Z)0),K(Z):=\begin{pmatrix}0&0\\ 0&{\gamma}C_{q}(Z)\end{pmatrix}\,,\quad J(Z):=\begin{pmatrix}0&C_{q}(Z)\\ -C_{q}(Z)&0\end{pmatrix}\,, (3.14)

where Cq(Z)C_{q}(Z) is given in (1.8); note that in defining HZ(z)H_{Z}(z) to be consistent with choices leading to (1.7) we make the approximation (ρ)M(Z)=Cq(Z),\mathcal{M}(\rho)\approx M(Z)=C_{q}(Z), leading to

HZ(z)=12p,Cq(Z)1p+Φ(q).H_{Z}(z)=\frac{1}{2}\langle p,C_{q}(Z)^{-1}p\rangle+\Phi(q)\,. (3.15)

We would like the resulting finite particle system (1.5), (3.14), (3.15) to have πI\pi^{\otimes I} as the marginal on the position coordinates of its equilibrium solution; in general, however, it does not. We now determine a drift correction term which, when added to the system results in invariance of ΠI\Pi^{I}; here, for normalization constant ZIZ_{I},

ΠI(Z)\displaystyle\Pi^{I}(Z) :=1ZIexp(i=1IHZ(z(i))),\displaystyle:=\frac{1}{Z_{I}}\exp\bigl{(}-\sum_{i=1}^{I}H_{Z}(z^{(i)})\bigr{)}\,, (3.16a)
HZ(z)\displaystyle H_{Z}(z) =12p,M(Z)1p+Φ(q).\displaystyle=\frac{1}{2}\langle p,M(Z)^{-1}p\rangle+\Phi(q)\,. (3.16b)

Note that ΠI\Pi^{I} does not have product form, but its marginal on the qq coordinates does, and is πI.\pi^{\otimes I}.

To achieve the desired invariance of ΠI{\Pi^{I}} we add the correction term z(i)(K(Z)J(Z))\nabla_{z^{(i)}}\cdot\bigl{(}K(Z)-J(Z)\bigr{)} to the interacting particle system (1.5) to obtain the following coupled system of coupled diffusions in N,\mathbb{R}^{N}, for i=1,,I:i=1,\dots,I:

dz(i)dt=JDz(i)HZ(z(i))KDz(i)HZ(z(i))+z(i)(KJ)+2KdW(i)dt.\frac{\mathrm{d}z^{(i)}}{\mathrm{d}t}=JD_{z^{(i)}}{{H}_{Z}(z^{(i)})-KD_{z^{(i)}}{H}_{Z}(z^{(i)})}+\nabla_{z^{(i)}}\cdot\bigl{(}K-J\bigr{)}+\sqrt{2K}\frac{\mathrm{d}W^{(i)}}{\mathrm{d}t}\,. (3.17)

By Theorem 1 in both [40] and [20] this will result in a linear Fokker-Planck equation (3.18) with ΠI\Pi^{I} as its equilibrium. This Fokker-Planck equation has the form

tρI=((KIJI)(ρIHI+ρI)),{\partial_{t}\rho^{I}=\nabla\cdot\bigl{(}(K^{\otimes I}-J^{\otimes I})(\rho^{I}\nabla{H}^{\otimes I}+\nabla\rho^{I})\bigr{)}\,,} (3.18)

where ρI\rho^{I} stands for the joint probability of the II particles, HIH^{\otimes I} is the vector containing II copies of HH, and KI,JIK^{\otimes I},J^{\otimes I} are block-diagonal matrices with II copies of K,JK,J on the diagonal entry respectively. Note that in the case of finite particle approximation, the preconditioner is no longer dependent of the density ρ\rho, and thus the Fokker-Planck equation is indeed linear. For the specific choices of K(Z),J(Z),M(Z)K(Z),J(Z),M(Z) and HZ(z)H_{Z}(z) made in (3.14) and (3.15), equation (3.17) simplifies to the following system of diffusions in N\mathbb{R}^{N}, holding for i=1,,I:i=1,\dots,I:

dq(i)dt\displaystyle\frac{\mathrm{d}q^{(i)}}{\mathrm{d}t} =p(i),\displaystyle=p^{(i)}\,, (3.19)
dp(i)dt\displaystyle\frac{\mathrm{d}p^{(i)}}{\mathrm{d}t} =Cq(Z)DΦ(q(i))γp(i)+1Ip(i)p(i)TCq(Z)1q^(i)+1+NIq^(i)+2γCq(Z)dW(i)dt\displaystyle=-C_{q}(Z)D\Phi(q^{(i)})-{\gamma}p^{(i)}+\frac{1}{I}{p^{(i)}}{p^{(i)}}^{T}C_{q}(Z)^{-1}\hat{q}^{(i)}+\frac{1+N}{I}\hat{q}^{(i)}+\sqrt{2{\gamma}C_{q}(Z)}\frac{\mathrm{d}W^{(i)}}{\mathrm{d}t}

where we have defined q^(i):=q(i)q¯\hat{q}^{(i)}:=q^{(i)}-\bar{q}. Concerning these equations, we establish the following proposition. We postpone the proof to Subsection 8.2; it is based on direct calculations of the terms z(i)(KJ)\nabla_{z^{(i)}}\cdot(K-J) and Dz(i)HZ(z(i))D_{z^{(i)}}{H_{Z}}(z^{(i)}).

Proposition 3.1.

The interacting particle system (3.19) has invariant distribution ΠI\Pi^{I}.

4 Ensemble Kalman Approximation

4.1 Derivatives Via Differences

We now make the ensemble Kalman approximation to approximate the gradient term by differences, as in [23]:

D𝒢(q(i))(q(k)q¯)(𝒢(q(k))𝒢¯),D\mathcal{G}\left(q^{(i)}\right)\left(q^{(k)}-\bar{q}\right)\approx\left(\mathcal{G}\left(q^{(k)}\right)-\bar{\mathcal{G}}\right)\,,

where 𝒢¯:=1Ik=1I𝒢(q(k))\bar{\mathcal{G}}:=\frac{1}{I}\sum_{k=1}^{I}\mathcal{G}\left(q^{(k)}\right). Invoking this approximation within (3.19), using the specific form of Φ\Phi, yields the following system of interacting particles in N\mathbb{R}^{N}, for i=1,,I:i=1,\dots,I:

q˙(i)\displaystyle\dot{q}^{(i)} =p(i),\displaystyle=p^{(i)}\,, (4.20)
p˙(i)\displaystyle\dot{p}^{(i)} =Cq(Z)Γ01q(i)1Ik=1I𝒢(q(k))𝒢¯,𝒢(q(i))yΓq(k)γp(i)\displaystyle=-C_{q}(Z)\Gamma_{0}^{-1}{q}^{(i)}-\frac{1}{I}\sum_{k=1}^{I}\langle\mathcal{G}(q^{(k)})-\bar{\mathcal{G}},\mathcal{G}(q^{(i)})-y\rangle_{\Gamma}q^{(k)}-{\gamma}p^{(i)}
+1Ip(i)p(i)TCq(Z)1q^(i)+1+NIq^(i)+2γCq(Z)W˙(i).\displaystyle+\frac{1}{I}{p^{(i)}}{p^{(i)}}^{T}C_{q}(Z)^{-1}\hat{q}^{(i)}+\frac{1+N}{I}\hat{q}^{(i)}+\sqrt{2{\gamma}C_{q}(Z)}\dot{W}^{(i)}\,.

We will use this system as the basis of all our numerical experiments.

4.2 Affine Invariance

In this subsection, we show the affine invariance property [26, 24, 37] for the Fokker-Planck equations in the mean-field regime (3.12) and for the ensemble approximation (3.18); the particle equation in the mean-field regime (1.7), for the ensemble approximation (1.9), and its correction (3.19); and the gradient-free approximation (4.20). For simplicity of presentation, we only state the results in the case of ensemble approximation, and the mean field case is a straightforward analogy upon dropping all of the particle superscripts.

Definition 4.1 (Affine invariance for particle formulation).

We say a particle formulation is affine invariant, if under all affine transformations of the form

q(i)=Av(i)+b,p(i)=Au(i),q^{(i)}=Av^{(i)}+b\,,\quad p^{(i)}=Au^{(i)}\,,

the equations on the transformed particle systems are given by the same equations with q(i)q^{(i)}, p(i)p^{(i)} replaced by v(i)v^{(i)}, u(i)u^{(i)} respectively, and with potential Φ\Phi replaced by Φ~\tilde{\Phi} via

Φ~(v(i))=Φ(q(i))=Φ(Av(i)+b).\tilde{\Phi}(v^{(i)})=\Phi(q^{(i)})=\Phi(Av^{(i)}+b)\,.

Here AA is any invertible matrix and bb is a vector.

Definition 4.2 (Affine invariance for Fokker-Planck equation).

We say a Fokker-Planck equation is affine invariant, if under all affine transformations of the form

q(i)=Av(i)+b,p(i)=Au(i),q^{(i)}=Av^{(i)}+b\,,\quad p^{(i)}=Au^{(i)}\,,

the equations on the pushforward PDF ρ~I\tilde{\rho}^{I} are given by the same equation on ρI{\rho}^{I} with q(i)q^{(i)}, p(i)p^{(i)} replaced by v(i)v^{(i)}, u(i)u^{(i)} respectively, and with Hamiltonian HH replaced by H~\tilde{H} via

H~(v(i),u(i))=H(q(i),p(i))=H(Av(i)+b,Au(i)).\tilde{H}(v^{(i)},u^{(i)})=H(q^{(i)},p^{(i)})=H(Av^{(i)}+b,Au^{(i)})\,.

Here AA is any invertible matrix and bb is a vector.

Then we have the observation that all of our systems are affine invariant.

Proposition 4.3.

The particle formulations (1.9), (3.19) and (4.20) are affine invariant. The Fokker-Planck equation (3.18) is also affine invariant.

We defer the proof in Subsection 8.3. The significance of affine invariance is that it implies that the rate of convergence is preserved under affine transformations. The proposed methodology is thus uniformly effective for problems that become well-conditioned under an affine transformation. Proposition 5.1, which follows in the next section, illustrates this property in the setting of linear forward map 𝒢().\mathcal{G}(\cdot).

5 Mean Field Model For Linear Inverse Problems

We consider the mean field SDE (1.7) in the linear inverse problem setting of Section 2 with 𝒢(q)=Aq;\mathcal{G}(q)=Aq; thus (2.11) holds. We note that BB in (2.11) is both well-defined and symmetric positive definite since Γ0,Γ\Gamma_{0},\Gamma are assumed to be symmetric positive definite. The results demonstrate problem-independent rates of convergence, in this linear setting, a consequence of affine invariance which in turn is a consequence of our choice of preconditioned mean field system.

In the setting of the linear inverse problem, the mean field model (1.7) reduces to

q˙\displaystyle\dot{q} =p,\displaystyle=p\,, (5.21)
p˙\displaystyle\dot{p} =𝒞q(ρ)(B1qc)γp+2γ𝒞q(ρ)W˙.\displaystyle=-\mathcal{C}_{q}(\rho)(B^{-1}q-c)-{\gamma}{p}+\sqrt{2{\gamma}\mathcal{C}_{q}(\rho)}\dot{W}\,.

We prove the following result about this system in Subsection 8.4:

Proposition 5.1.

Write the mean m(ρ)m(\rho) and the covariance 𝒞(ρ)\mathcal{C}(\rho) of the law ρ(z)\rho(z) of particles in equation (5.21) in the block form

m(ρ)=(mq(ρ)mp(ρ)),𝒞(ρ)=(𝒞q(ρ)𝒞q,p(ρ)𝒞q,pT(ρ)𝒞p(ρ)).m(\rho)=\begin{pmatrix}m_{q}(\rho)\\ m_{p}(\rho)\end{pmatrix}\,,\quad\mathcal{C}(\rho)=\begin{pmatrix}\mathcal{C}_{q}(\rho)&\mathcal{C}_{q,p}(\rho)\\ \mathcal{C}_{q,p}^{T}(\rho)&\mathcal{C}_{p}(\rho)\end{pmatrix}\,.

The evolution of the mean and covariance is prescribed by the following system of ODEs:

mq˙\displaystyle\dot{m_{q}} =mp,\displaystyle=m_{p}\,, (5.22)
mp˙\displaystyle\dot{m_{p}} =𝒞q(B1mqc)γmp,\displaystyle=-\mathcal{C}_{q}(B^{-1}m_{q}-c)-{\gamma}m_{p}\,,
𝒞q˙\displaystyle\dot{\mathcal{C}_{q}} =𝒞q,p+𝒞q,pT,\displaystyle=\mathcal{C}_{q,p}+\mathcal{C}_{q,p}^{T}\,,
𝒞p˙\displaystyle\dot{\mathcal{C}_{p}} =𝒞qB1𝒞q,p(𝒞qB1𝒞q,p)T2γ𝒞p+2γ𝒞q,\displaystyle=-\mathcal{C}_{q}B^{-1}\mathcal{C}_{q,p}-(\mathcal{C}_{q}B^{-1}\mathcal{C}_{q,p})^{T}-2{\gamma}\mathcal{C}_{p}+2{\gamma}\mathcal{C}_{q}\,,
𝒞˙q,p\displaystyle\dot{\mathcal{C}}_{q,p} =γ𝒞q,p𝒞qB1𝒞q+𝒞p.\displaystyle=-{\gamma}\mathcal{C}_{q,p}-\mathcal{C}_{q}B^{-1}\mathcal{C}_{q}+\mathcal{C}_{p}\,.

The unique steady solution with positive-definite covariance is the Gibbs measure mq=Bc,mp=0,Cq,p=0,Cq=Cp=B;m_{q}=Bc,m_{p}=0,C_{q,p}=0,C_{q}=C_{p}=B; the marginal on qq gives the solution of the linear Gaussian Bayesian inverse problem. All other steady state solutions have degenerate covariance, are unstable, and take the form mq=B(c+m),mp=0,Cq,p=0,Cq=Cp=B1/2XB1/2m_{q}=B(c+m),m_{p}=0,C_{q,p}=0,C_{q}=C_{p}=B^{1/2}XB^{1/2} for a projection matrix XX and mm in the nullspace of CqC_{q}. The equilibrium point with positive-definite covariance is hyperbolic and linearly stable and hence its basin of attraction is an open set, containing the equilibrium point itself, in the set of all mean vectors and positive-definite covariances. Furthermore, the mean and covariance converge to this equilibrium, from all points in its basin of attraction, with a speed independent of BB and cc.

Remark 5.2.

Proposition 5.1 demonstrates that the convergence speed to the non-degenerate equilibrium point is independent of the specific linear inverse problem to be solved; the rate does, however, depend on γ\gamma. Analysis of the linear stability problem shows that γ1.83\gamma\approx 1.83 gives the best local convergence rate; see Remark 8.5. In the case where mean field preconditioning is not used, the optimal choice of γ\gamma for underdamped Langevin dynamics depends on the linear inverse problem being solved, and can be identified explicitly in the scalar setting [44]; for analysis in the non-Gaussian setting see [14]. Motivated by analogies with the work in [44, 14] we expect the optimal choice of γ\gamma to be problem dependent in the nonlinear case, but motivated by our analysis in the linear setting we expect to see a good choice which is not too small or too large and can be identified by straightforward optimization.

Now we show that for Gaussian initial data, the solution remains Gaussian and is thus determined by the evolution of the mean and covariance via the ODE (5.22). We prove this by investigating the mean field Fokker-Planck equation in the linear case. The evolution in time of the law ρ(z)\rho(z) of equation (5.21) is governed by the equation

tρ=((p𝒞q(ρ)(B1qc)+γp)ρ+(0𝒞q(ρ)𝒞q(ρ)γ𝒞q(ρ))ρ).\partial_{t}\rho=\nabla\cdot\left(\begin{pmatrix}-p\\ \mathcal{C}_{q}(\rho)(B^{-1}q-c)+{\gamma}p\end{pmatrix}\rho+\begin{pmatrix}0&-\mathcal{C}_{q}(\rho)\\ \mathcal{C}_{q}(\rho)&{\gamma}\mathcal{C}_{q}(\rho)\end{pmatrix}\nabla\rho\right)\,. (5.23)

We prove the following in Subsection 8.5:

Proposition 5.3.

Let m(t)m(t), C(t)C(t) solve the ODE (5.22) with initial conditions m0m_{0} and C0C_{0}. Assume that ρ0\rho_{0} is a Gaussian distribution, so that

ρ0(z):=1(2π)N(detC0)1/2exp(12zm0C02)\rho_{0}(z):=\frac{1}{(2\pi)^{N}}\left(\operatorname{det}{C}_{0}\right)^{-1/2}\exp\left(-\frac{1}{2}\left\|z-m_{0}\right\|_{{C}_{0}}^{2}\right)

with mean m0m_{0} and covariance C0C_{0}. Then the Gaussian profile

ρ(t,z):=1(2π)N(detC(t))1/2exp(12zm(t)C(t)2)\rho(t,z):=\frac{1}{(2\pi)^{N}}\left(\operatorname{det}{C}({t})\right)^{-1/2}\exp\left(-\frac{1}{2}\left\|z-m({t})\right\|_{{C}({t})}^{2}\right)

solves the Fokker-Planck equation (5.23) with initial condition ρ(0,z)=ρ0(z)\rho(0,z)=\rho_{0}(z).

6 Numerical Results

We introduce, and study, a numerical method for sampling the Bayesian inverse problem of Section 2; the method is based on numerical time stepping of the interacting particle system (4.20). In this section, we demonstrate that the proposed sampler, which we refer to as ensemble Kalman hybrid Monte Carlo (EKHMC) can effectively approximate posterior distributions for two widely studied inverse problem test cases. We compare EKHMC with its first-order version EKS [23] and a gold standard MCMC [9]. EKHMC inherits two major advantages of EKS: (1) exact gradients are not required (i.e., derivative-free); (2) the ensemble can faithfully approximate the spread of the posterior distribution, rather than collapse to a single point as happens with the basic EKI algorithm [49]. Furthermore, we show empirically that EKHMC can obtain samples of similar quality to EKS, and has faster convergence than EKS. We detail our numerical time-stepping scheme in the first subsection, before studying two examples (one low dimensional, one a PDE inverse problem for a field) in the subsequent subsections.

6.1 Time Integration Schemes

We employ a splitting method to integrate the stochastic dynamical system given by equation (4.20): the first capturing the Hamiltonian evolution, including the finite-size correction, and the second capturing an OU process in momentum space. The Hamiltonian evolution follows the equation:

q˙(i)\displaystyle\dot{q}^{(i)} =p(i),\displaystyle=p^{(i)}\,, (6.24)
p˙(i)\displaystyle\dot{p}^{(i)} =FHCq(Z)Γ01q(i)1Ik=1I𝒢(q(k))𝒢¯,𝒢(q(i))yΓq(k)\displaystyle=F_{H}\equiv-C_{q}(Z)\Gamma_{0}^{-1}{q}^{(i)}-\frac{1}{I}\sum_{k=1}^{I}\langle\mathcal{G}(q^{(k)})-\bar{\mathcal{G}},\mathcal{G}(q^{(i)})-y\rangle_{\Gamma}q^{(k)}
+1Ip(i)p(i)TCq(Z)1q^(i)+1+NIq^(i).\displaystyle+\frac{1}{I}{p^{(i)}}{p^{(i)}}^{T}C_{q}(Z)^{-1}\hat{q}^{(i)}+\frac{1+N}{I}\hat{q}^{(i)}\,.

The OU process follows the equation:

q˙(i)\displaystyle\dot{q}^{(i)} =0,\displaystyle=0\,, (6.25)
p˙(i)\displaystyle\dot{p}^{(i)} =γp(i)+2γCq(Z)W˙(i).\displaystyle=-\gamma p^{(i)}+\sqrt{2\gamma C_{q}(Z)}\dot{W}^{(i)}\,.

We implement a symplectic Euler integrator [48] for the Hamiltonian part, i.e., take a half step ϵ/2\epsilon/2 of momentum updates, then a full step ϵ\epsilon of position updates, and finally a half step ϵ/2\epsilon/2 of momentum updates.222With the ensemble approximation the system is only approximately Hamiltonian; the splitting used in symplectic Euler is still well-defined, however. Let ZjZ_{j} be the collection of all position and momentum particles at time jj: {qj(i),pj(i)}i=1I.\{q_{j}^{(i)},p_{j}^{(i)}\}_{i=1}^{I}. Starting from time jj this symplectic Euler integration gives map ZjZ^j.Z_{j}\mapsto\hat{Z}_{j}. We set qj+1(i)q_{j+1}^{(i)} to be the ithi^{\rm th} position coordinates of Z^j\hat{Z}_{j} and then update the momentum coordinates using the OU process; the damping coefficient γ>0\gamma>0 is treated as a hyperparameter of EKHMC. Vector ZZ is set at the value given by output of the preceding symplectic Euler integrator, denoted by Z^j\hat{Z}_{j}. The update of the ithi^{\rm th} momentum coordinate given by the OU process is then:

p~j(i)=exp(γϵ)p^j(i),\displaystyle\tilde{p}^{(i)}_{j}={\rm exp}(-\gamma\epsilon)\hat{p}^{(i)}_{j}\,, (6.26)
pj+1(i)=p~j(i)+η,η𝖭(0,(1exp(2γϵ))Cq(Z^j)),\displaystyle p^{(i)}_{j+1}=\tilde{p}^{(i)}_{j}+\eta,\quad\eta\sim\mathsf{N}\Bigl{(}0,\bigl{(}1-{\rm exp}(-2\gamma\epsilon)\bigr{)}C_{q}(\hat{Z}_{j})\Bigr{)}\,,

where p^j(i)\hat{p}^{(i)}_{j} are the momentum coordinates from Z^j\hat{Z}_{j}. Within the OU process the damping coefficient γ>0\gamma>0 is treated as a hyperparameter of EKHMC.

Similarly to [23], and as there with the goal of improving stability and convergence speed towards posterior distribution, we implement an adaptive step size, i.e., the true step size is rescaled by the magnitude of the “force field” FHF_{H}:

ϵ~=ϵa|FH|+1.\tilde{\epsilon}=\frac{\epsilon}{a|F_{H}|+1}\,. (6.27)

6.2 Low Dimensional Parameter Space

We follow the example presented in Section 4.3 of the paper [23]. We start by defining the forward map which is given by the one-dimensional elliptic boundary value problem

ddx(exp(u1)ddxp(x))=1,x(0,1),\displaystyle-\frac{d}{dx}\Bigl{(}{\rm exp}(u_{1})\frac{d}{dx}p(x)\Bigr{)}=1\,,\quad x\in(0,1)\,, (6.28a)
p(0)=0,p(1)=u2.\displaystyle\quad\quad\quad p(0)=0\,,\quad p(1)=u_{2}\,. (6.28b)

The solution is given explicitly by

p(x)=u2x+exp(u1)(x22+x2),p(x)=u_{2}x+{\rm exp}(-u_{1})\Bigl{(}-\frac{x^{2}}{2}+\frac{x}{2}\Bigr{)}\,, (6.29)

The forward model operator 𝒢\mathcal{G} is then defined by

𝒢(u)=(p(x1)p(x2)).\mathcal{G}(u)=\begin{pmatrix}p(x_{1})\\ p(x_{2})\end{pmatrix}\,. (6.30)

Here u=(u1,u2)Tu=\bigl{(}u_{1},u_{2}\bigr{)}^{T} is a constant vector that we want to find and we assume that we are given noise measurements yy of p()p(\cdot) at locations x1=0.25x_{1}=0.25 and x2=0.75x_{2}=0.75. Parameters are chosen according to [23], but we summarize them here for completeness:

  • noise η𝖭(0,Γ)\eta\sim\mathsf{N}(0,\Gamma), Γ=0.12I2\Gamma=0.1^{2}I_{2};

  • prior π0=𝖭(0,Γ0)\pi_{0}=\mathsf{N}(0,\Gamma_{0}), Γ0=σ2I2\Gamma_{0}=\sigma^{2}I_{2}, σ=10;\sigma=10;

  • measurement y=(27.5,79.7)T;y=(27.5,79.7)^{T};

  • number of particles J=103;J=10^{3};

  • initialization: (u1,u2)𝖭(3.5,0.12)×𝖴(70,110).(u_{1},u_{2})\sim\mathsf{N}(-3.5,0.1^{2})\times\mathsf{U}(70,110).

Here 𝖴\mathsf{U} is the uniform distribution.

Refer to caption
Figure 1: The low dimensional parameter space example. From left to right: samples; mean u1u_{1}; mean u2u_{2}; the first singular value σ1\sigma_{1}; the second singular value σ2\sigma_{2}.

We choose a=0.01a=0.01, ϵ=0.2\epsilon=0.2 in both EKS and EKHMC. We find choosing γ=1\gamma=1 causes overshooting, a phenomenon which can be ameliorated by increasing to γ=100\gamma=100; indeed this latter value appears, empirically, to be close to optimal in terms of convergence speed. We conjecture this difference from the linear case, where the optimal value is γ=1.83\gamma=1.83 (see Remark 5.2), is due to the non-Gaussianity of the desired target: the particles have accumulated considerable momentum when entering the linear convergence regime, and so extra damping is required to counteract this and avoid overshooting. Despite the desirable problem independence of the optimal γ\gamma in the linear case, we expect case-by-case optimization to be needed for nonlinear problems.

We evolve the ensemble of particles for 200 iterations, and record their positions in the last iteration as the approximation of the posterior distribution, shown in Figure 1. The EKHMC samples are quite similar to EKS samples, both of which are reasonable approximations of the samples obtained from the gold standard HMC [19] simulations – but both approximations of the gold standard miss the full spread of the true distribution, because of the ensemble Kalman approximation they invoke. We also compute four ensemble quantities to characterize the evolution of EKHMC and EKS: u1u_{1} mean, u2u_{2} mean, two eigenvalues σ1\sigma_{1} and σ2\sigma_{2} of the covariance matrix Cq(Z)C_{q}(Z). EKHMC has faster convergence towards equilibrium than EKS, benefiting from the second-order dynamics.

6.3 Darcy Flow

Refer to caption
Figure 2: Darcy flow. Left: samples obtained from EKHMC (top) and EKS (bottom), compared with MCMC. Middle: Evolution of uH2||u||_{H^{-2}} for EKHMC and EKS for different J=128,512,2048J=128,512,2048. Right: The same as the middle, but uL2||u||_{L^{2}} instead of uH2||u||_{H^{-2}}. EKHMC converges faster than EKS.

This example follows the problem setting detailed in [23]. We summarize the essential problem specification here for completeness.The forward problem of porous medium flow, defined by permeability a()a(\cdot) and source term f()f(\cdot), is to find the pressure field p()p(\cdot) where p()p(\cdot) is solution of the following elliptic PDE:

(a(x)p(x))\displaystyle-\nabla\cdot\Bigl{(}a(x)\nabla p(x)\Bigr{)} =f(x),xD=[0,1]2,\displaystyle=f(x),\quad x\in D=[0,1]^{2}\,, (6.31)
p(x)\displaystyle p(x) =0,xD.\displaystyle=0,\quad x\in\partial D\,.

We assume that the permeability a(x)=a(x;u)a(x)=a(x;u) depends on some unknown parameters udu\in\mathbb{R}^{d}. The resulting inverse problem is, given (noisy) pointwise measurements of p(x)p(x) on a grid, to infer a(x;u)a(x;u) or uu. We model a(x;u)a(x;u) as a log-Gaussian field. The Gaussian underlying this log-Gaussian model has mean zero and has precision operator defined as 𝒞1=(+τ2)α\mathcal{C}^{-1}=(-\triangle+\tau^{2}\mathcal{I})^{\alpha}; here \triangle is equipped with Neumann boundary conditions on the spatial-mean zero functions. We set τ=3\tau=3 and α=2\alpha=2 in the experiments. Such parametrization yields as Karhunen-Loeve (KL) expansion

loga(x;u)=lKulλlφl(x){\rm log}a(x;u)=\sum_{l\in K}u_{l}\sqrt{\lambda_{l}}\varphi_{l}(x) (6.32)

where φl(x)=cos(πl,x)\varphi_{l}(x)={\rm cos}(\pi\langle l,x\rangle), λl=(π2||2+τ2)α\lambda_{l}=(\pi^{2}|\ell|^{2}+\tau^{2})^{-\alpha}, K2K\equiv\mathbb{Z}^{2}. Draws from this random field are Hölder with any exponent less than one. In practice, we truncate KK to have dimension d=28d=2^{8}. We generate a truth random field by sampling u𝖭(0,Id)u\sim\mathsf{N}(0,I_{d}). We create data yy with η𝖭(0,0.12IK)\eta\sim\mathsf{N}(0,0.1^{2}I_{K}). We choose the prior covariance Γ0=102Id\Gamma_{0}=10^{2}I_{d}.

To characterize the evolution of EKHMC and compare it with EKS, we compute two metrics:

dH2()=1Jj=1J||u(j)(t)||H22,dL2()=1Jj=1J||u(j)(t)||L22.d_{H^{-2}}(\cdot)=\sqrt{\frac{1}{J}\sum_{j=1}^{J}||u^{(j)}(t)-\cdot||_{H^{-2}}^{2}}\,,\quad d_{L^{2}}(\cdot)=\sqrt{\frac{1}{J}\sum_{j=1}^{J}||u^{(j)}(t)-\cdot||^{2}_{L^{2}}}\,. (6.33)

For these metrics, we use norms

uH2=lKd|ul|2λl,uL2=lKd|ul|2,||u||_{H^{-2}}=\sqrt{\sum_{l\in K_{d}}|u_{l}|^{2}\lambda_{l}}\,,\quad||u||_{L^{2}}=\sqrt{\sum_{l\in K_{d}}|u_{l}|^{2}}\,, (6.34)

where the first is defined in the negative Sobolev space H2H^{-2}, whilst the second in the L2L^{2} space.

We set a=0.01a=0.01 and ϵ=1.0\epsilon=1.0 for both EKHMC and EKS. We set γ=1\gamma=1 in EKHMC; unlike the previous example, this choice does not lead to over-shooting. We simulate the particles for 100100 iterations, and use their positions in the last iteration as the approximation to the posterior distribution. We compare the samples obtained from MCMC (we use the pCN variant on RWMH [16]), EKS (J=128,512,2048J=128,512,2048) and EKHMC (J=128,512,2048J=128,512,2048) in Figure 2. The evolution of uH2||u||_{H^{-2}} and uL2||u||_{L^{2}} are plotted to show that convergence is achieved, and that EKHMC converges faster than EKS.

7 Conclusions

Gradient-free methods for inverse problems are increasingly important in large-scale multi-physics science and engineering problems; see [29] and the references therein. In this paper we have provided an initial study of gradient-free methods which leverage the potential power of Hamiltonian based sampling. Analysis of the resulting method is hard, and our initial theoretical results leave many avenues open; in particular convergence to equilibrium is not established for the underlying nonlinear, nonlocal Fokker-Planck equation arising from the mean field model, and optimization of convergence rates over γ\gamma is not understood, except in the setting of linear Gaussian inverse problems. The Fokker-Planck equation associated with standard underdamped Langevin dynamics has been studied in the context of hypocoercity – see [1, 52] – and generalization of these methods will be of potential interest. Preconditioned HMC is also proven to converge in [6]; generalization to ensemble approximations would be of interest. On the computational side there are other approaches to avoiding gradient computations, yet leveraging Hamiltonian structure, which need to be evaluated and compared with what is proposed here [36].

8 Appendix

We present the proofs of various results from the paper here.

8.1 Proof of Proposition 1.1

Proof 8.1.

The determination of the most general form of diffusion process which is invariant with respect to a given measure has a long history; see [20][Theorem 1] for a statement and the historical context. Note, however, that this theory is not developed in the mean-field setting and concerns only the linear Fokker-Planck equation. However, the dependence of the matrices 𝒦,𝒥\mathcal{K},\mathcal{J} and \mathcal{M} on ρ\rho readily allows use of the approach taken in the linear case. We find it expedient to use the exposition of this topic in [40]. The same derivation as used to obtain Eq. (5) from [40] shows333We use the notational convention concerning divergence of matrix fields that is standard in continuum mechanics [25]; this differs from the notational convention adopted in [40]. that the density ρ\rho associated with the dynamics (1.3), (1.4) satisfies Eq. (3.12); this follows from the (resp. skew-) symmetry properties of (resp. 𝒥\mathcal{J}) 𝒦\mathcal{K} and the fact that 𝒥\mathcal{J}, 𝒦\mathcal{K} and \mathcal{M} are assumed independent of zz, despite their dependence on ρ.\rho. It is also manifestly the case that the Gibbs measure is invariant for (3.12) since the mass term \mathcal{M} appearing in \mathcal{H} is independent of zz so that

ρ+ρ=0\rho\nabla\mathcal{H}+\nabla\rho=0

when ρ\rho is the Gibbs measure and (3.12) shows that then tρ=0.\partial_{t}\rho=0.

8.2 Proof of Proposition 3.1

Proof 8.2.

We can compute by Lemma 2.1 in [42]:

q(i)(Cq(Z))=1+NI(q(i)q¯).\nabla_{q^{(i)}}\cdot(C_{q}(Z))=\frac{1+N}{I}(q^{(i)}-\bar{q})\,.

For our specific choices of 𝒥\mathcal{J}, 𝒦\mathcal{K} and \mathcal{H}, and hence of J(Z),K(Z),HZJ(Z),K(Z),H_{Z}, we can explicitly write out the term

z(i)(KJ)=(01+NI(q(i)q¯))\nabla_{z^{(i)}}\cdot(K-J)=\begin{pmatrix}0\\ \frac{1+N}{I}(q^{(i)}-\bar{q})\end{pmatrix}

in (3.17). Meanwhile we also have

Dz(i)HZ(z)=(DΦ(q(i))+12Dq(i)p(i)TCq(Z)1p(i)Cq(Z)1p(i));D_{z^{(i)}}{{H}_{Z}(z)}=\begin{pmatrix}D\Phi(q^{(i)})+\frac{1}{2}D_{q^{(i)}}{p^{(i)}}^{T}C_{q}(Z)^{-1}p^{(i)}\\ C_{q}(Z)^{-1}p^{(i)}\end{pmatrix}\,;

note that the quadratic momentum term is now dependent on the ensemble of the positions, and hence contributes terms which disappear only in the mean-field limit. It suffices to show that

Cq(Z)12Dq(i)p(i)TCq(Z)1p(i)=1Ip(i)p(i)TCq(Z)1(q(i)q¯).-C_{q}(Z)\frac{1}{2}D_{q^{(i)}}{p^{(i)}}^{T}C_{q}(Z)^{-1}p^{(i)}=\frac{1}{I}{p^{(i)}}{p^{(i)}}^{T}C_{q}(Z)^{-1}(q^{(i)}-\bar{q})\,.

In fact, for 1kd1\leq k\leq d and eke_{k} being the kk-th standard basis, we have

p(i)TCq(Z)1p(i)qk(i)\displaystyle-\frac{\partial{p^{(i)}}^{T}C_{q}(Z)^{-1}p^{(i)}}{\partial q^{(i)}_{k}} =p(i)TCq(Z)1(1Ij=1Iq(j)q(j)Tq¯q¯T)qk(i)Cq(Z)1p(i)\displaystyle={p^{(i)}}^{T}C_{q}(Z)^{-1}\frac{\partial(\frac{1}{I}\sum_{j=1}^{I}q^{(j)}{q^{(j)}}^{T}-\bar{q}\bar{q}^{T})}{\partial q^{(i)}_{k}}C_{q}(Z)^{-1}p^{(i)}
=1Ip(i)TCq(Z)1[(q(i)q¯)ekT+ek(q(i)q¯)T]Cq(Z)1p(i)\displaystyle=\frac{1}{I}{p^{(i)}}^{T}C_{q}(Z)^{-1}[(q^{(i)}-\bar{q})e_{k}^{T}+e_{k}(q^{(i)}-\bar{q})^{T}]C_{q}(Z)^{-1}p^{(i)}
=2IekTCq(Z)1p(i)p(i)TCq(Z)1(q(i)q¯).\displaystyle=\frac{2}{I}e_{k}^{T}C_{q}(Z)^{-1}p^{(i)}{p^{(i)}}^{T}C_{q}(Z)^{-1}(q^{(i)}-\bar{q})\,.

Thus we have

12Dq(i)p(i)TCq(Z)1p(i)=1ICq(Z)1p(i)p(i)TCq(Z)1(q(i)q¯)-\frac{1}{2}D_{q^{(i)}}{p^{(i)}}^{T}C_{q}(Z)^{-1}p^{(i)}=\frac{1}{I}C_{q}(Z)^{-1}p^{(i)}{p^{(i)}}^{T}C_{q}(Z)^{-1}(q^{(i)}-\bar{q})

and we conclude the claim.

8.3 Proof of Proposition 4.3

Proof 8.3.

In fact, consider the affine transformation

q(i)=Av(i)+b,p(i)=Au(i),q^{(i)}=Av^{(i)}+b\,,\quad p^{(i)}=Au^{(i)}\,,

along with

Φ~(v(i))=Φ(q(i))=Φ(Av(i)+b),H~(v(i),u(i))=H(q(i),p(i))=H(Av(i)+b,Au(i)).\tilde{\Phi}(v^{(i)})=\Phi(q^{(i)})=\Phi(Av^{(i)}+b)\,,\quad\tilde{H}(v^{(i)},u^{(i)})=H(q^{(i)},p^{(i)})=H(Av^{(i)}+b,Au^{(i)})\,.

Then we have that the gradient term scales like v(i)Φ~(v(i))\nabla_{v^{(i)}}\tilde{\Phi}(v^{(i)})= ATq(i)Φ(q(i))A^{T}\nabla_{q^{(i)}}\Phi(q^{(i)}) and the covariance preconditioner scales like Cv=A1CqA1TC_{v}={A^{-1}}C_{q}{A^{-1}}^{T}. The generalized square root scales like Cv=A1Cq\sqrt{C_{v}}={A^{-1}}\sqrt{C_{q}}. Therefore we can check that affine invariance holds for the particle systems (1.9) and (3.19). Affine invariance for the gradient free approximation (4.20) is more easily seen to be true. Finally for the Fokker-Planck equation (3.18), we can check similarly via the scaling of the terms. See a similar argument in the proof of Lemma 4.7 in [24].

8.4 Proof of Proposition 5.1

Proof 8.4.

We can take expectations in (5.21) to obtain the first two ODEs in (5.22). Now define

z^=zm(ρ)(q^p^),\hat{z}=z-m(\rho)\triangleq\begin{pmatrix}\hat{q}\\ \hat{p}\end{pmatrix}\,,

to obtain the following evolution for z^\hat{z}:

q^˙\displaystyle\dot{\hat{q}} =p^,\displaystyle=\hat{p}\,,
p^˙\displaystyle\dot{\hat{p}} =𝒞q(ρ)B1q^γp^+2γ𝒞q(ρ)W˙.\displaystyle=-\mathcal{C}_{q}(\rho)B^{-1}\hat{q}-{\gamma}\hat{p}+\sqrt{2{\gamma}\mathcal{C}_{q}(\rho)}\dot{W}\,.

Note that 𝒞q=𝔼[q^q^]\mathcal{C}_{q}=\mathop{\mathbb{E}}[\hat{q}\otimes\hat{q}], 𝒞p=𝔼[p^p^]\mathcal{C}_{p}=\mathop{\mathbb{E}}[\hat{p}\otimes\hat{p}], and 𝒞q,p=𝔼[q^p^]\mathcal{C}_{q,p}=\mathop{\mathbb{E}}[\hat{q}\otimes\hat{p}]. We can use Ito’s formula and the above evolution to derive the last three ODEs in (5.22). The form of the steady state solutions is immediate from setting the right hand side of (5.22) to zero.

Now, we will establish the independence of the essential dynamics of the mean and covariance on the specific choice of B,cB,c. To this end, denote x1=B1/2(B1mqc)x_{1}=B^{1/2}(B^{-1}m_{q}-c), x2=B1/2mpx_{2}=B^{-1/2}m_{p}, X=B1/2𝒞qB1/2X=B^{-1/2}\mathcal{C}_{q}B^{-1/2}, Y=B1/2𝒞pB1/2Y=B^{-1/2}\mathcal{C}_{p}B^{-1/2}, Z=B1/2𝒞q,pB1/2Z=B^{-1/2}\mathcal{C}_{q,p}B^{-1/2}. Applying this change of variables we obtain, from (5.22), the system

x1˙\displaystyle\dot{x_{1}} =x2,\displaystyle=x_{2}\,, (8.35)
x2˙\displaystyle\dot{x_{2}} =Xx1γx2,\displaystyle=-Xx_{1}-{\gamma}x_{2}\,,
X˙\displaystyle\dot{X} =Z+ZT,\displaystyle=Z+Z^{T}\,,
Y˙\displaystyle\dot{Y} =XZZTX2γY+2γX,\displaystyle=-XZ-Z^{T}X-2{\gamma}Y+2{\gamma}X\,,
Z˙\displaystyle\dot{Z} =γZX2+Y.\displaystyle=-{\gamma}Z-X^{2}+Y\,.

We notice that the steady state solutions take the form (x1,x2,X,Y,Z)=(w,0,X,X,0)(x_{1},x_{2},X,Y,Z)=(w,0,X,X,0) for a projection matrix X2=XX^{2}=X and ww such that Xw=0Xw=0. The steady state with X=INX=I_{N} and w=0w=0 corresponds to the desired posterior mean. The transformed equation is indeed independent of BB and cc. Therefore the speed of convergence (x1,x2,X,Y,Z)(0,0,IN,IN,0)(x_{1},x_{2},X,Y,Z)\to(0,0,I_{N},I_{N},0), within its basin of attraction, is indeed independent of BB and c.c. The same is true in the original variables.

To complete the proof of stability it suffices to establish the local exponential convergence to the steady solution (x1,x2,X,Y,Z)=(0,0,IN,IN,0)(x_{1},x_{2},X,Y,Z)=(0,0,I_{N},I_{N},0). As a first step, we show that the following system converges to (X,Y,Z)=(IN,IN,0)(X,Y,Z)=(I_{N},I_{N},0):

X˙\displaystyle\dot{X} =Z+ZT,\displaystyle=Z+Z^{T}\,, (8.36)
Y˙\displaystyle\dot{Y} =XZZTX2γY+2γX,\displaystyle=-XZ-Z^{T}X-2{\gamma}Y+2{\gamma}X\,,
Z˙\displaystyle\dot{Z} =γZX2+Y.\displaystyle=-{\gamma}Z-X^{2}+Y\,.

Linearization around (X,Y,Z)=(IN,IN,0)(X,Y,Z)=(I_{N},I_{N},0) in variables of entries of (X,Y,Z+ZT)(X,Y,Z+Z^{T}) gives the matrix

(002IN22γIN22γIN22IN22IN2IN2γIN2)\begin{pmatrix}0&0&2I_{N^{2}}\\ 2\gamma I_{N^{2}}&-2\gamma I_{N^{2}}&-2I_{N^{2}}\\ -2I_{N^{2}}&I_{N^{2}}&-\gamma I_{N^{2}}\end{pmatrix}

whose eigenvalues satisfy x3+3γx2+(2γ2+6)x+4γ=0x^{3}+3\gamma x^{2}+(2\gamma^{2}+6)x+4\gamma=0 which all have strictly negative real part for γ>0\gamma>0, since we can show the unique real eigenvalue is in the interval (γ,0)(-\gamma,0). Therefore we conclude the local exponential convergence of covariances X,Y,ZX,Y,Z in a neighbourhood of (X,Y,Z)=(IN,IN,0)(X,Y,Z)=(I_{N},I_{N},0). The exponential convergence of means (x1,x2)(0,0)(x_{1},x_{2})\to(0,0) then follows linearizing the first two linear ODEs in the system (8.35) at X=IN.X=I_{N}.

Similarly, for any other steady state (X,Y,Z)=(X,X,0)(X,Y,Z)=(X,X,0) with a projection matrix XINX\neq I_{N}, we will show that there is an unstable direction in (8.36). In fact, since XX is symmetric with only eigenvalues 11 and 0, there exists a nonzero vector vv such that Xv=0Xv=0. Linearization around (X,Y,Z)=(X,X,0)(X,Y,Z)=(X,X,0) in the direction (avvT,bvvT,cvvT)(avv^{T},bvv^{T},cvv^{T}) gives the following 3×33\times 3 matrix for scalars a,b,ca,b,c:

(0022γ2γ001γ)\begin{pmatrix}0&0&2\\ 2\gamma&-2\gamma&0\\ 0&1&-\gamma\end{pmatrix}

whose eigenvalues satisfy x3+3γx2+2γ2x4γ=0x^{3}+3\gamma x^{2}+2\gamma^{2}x-4\gamma=0. For all γ>0\gamma>0 it may be shown that there exists a real eigenvalue in the interval (0,1)(0,1); thus there is an unstable direction determined by (a,b,c)(a,b,c), and therefore in the original formulation.

Remark 8.5.

We can further investigate the spectral gap around (X,Y,Z)=(IN,IN,0)(X,Y,Z)=(I_{N},I_{N},0), which is the absolute value in the real root of equation x3+3γx2+(2γ2+6)x+4γ=0x^{3}+3\gamma x^{2}+(2\gamma^{2}+6)x+4\gamma=0. To be precise, we can show that for x0=12128x_{0}=-\sqrt{12-\sqrt{128}} and γ0=(4+3x02)/(4x0)1.83\gamma_{0}=-(4+3x_{0}^{2})/(4x_{0})\approx 1.83, the spectral gap is maximized as x0-x_{0} and we expect fastest convergence for our method in the linear setting.

To establish this claim we proceed as follows. By the intermediate value theorem, in order to show there always exists a root in [x0,0)[x_{0},0) and the spectral gap is at most x0-x_{0}, we only need to show that

x03+3γx02+(2γ2+6)x0+4γ0.x_{0}^{3}+3\gamma x_{0}^{2}+(2\gamma^{2}+6)x_{0}+4\gamma\leq 0\,.

The claim is true because x0<0x_{0}<0 and 2x0(x03+6x0)=(4+3x02)2/42x_{0}(x_{0}^{3}+6x_{0})=(4+3x_{0}^{2})^{2}/4. By the basic inequality a2+b22aba^{2}+b^{2}\geq 2ab, we have

x03+3γx02+(2γ2+6)x0+4γ=2x0γ2+(4+3x02)γ+x03+6x00.x_{0}^{3}+3\gamma x_{0}^{2}+(2\gamma^{2}+6)x_{0}+4\gamma=2x_{0}\gamma^{2}+(4+3x_{0}^{2})\gamma+x_{0}^{3}+6x_{0}\leq 0\,.

The maximal spectral gap is attained if and only if the equality in a2+b22aba^{2}+b^{2}\geq 2ab holds, i.e. when γ=γ0\gamma=\gamma_{0}.

We also plot the spectral gap as a function of γ\gamma to help visualize the dependence of the rate of convergence on damping for the linear problem; see Figure 3. The clear message from this figure is that there is a natural optimization problem for γ\gamma in this linear Gaussian setting; this is used to motivate searches for optimal parameters in the non-Gaussian setting.

Refer to caption
Figure 3: Spectral gap as a function of γ\gamma

8.5 Proof of Proposition 5.3

Proof 8.6.

Note that since m(ρ)m(\rho), 𝒞(ρ)\mathcal{C}(\rho) are independent of the particle zz, we have

ρ=𝒞(ρ)1(zm(ρ))ρ,\nabla\rho=-{\mathcal{C}}(\rho)^{-1}(z-m(\rho))\rho\,,
D2ρ=(𝒞(ρ)1+(𝒞(ρ)1(zm(ρ)))(𝒞(ρ)1(zm(ρ))))ρ.D^{2}\rho=\left(-{\mathcal{C}}(\rho)^{-1}+\left({\mathcal{C}}(\rho)^{-1}(z-m(\rho))\right)\otimes\left({\mathcal{C}}(\rho)^{-1}(z-m(\rho))\right)\right)\rho\,.

Using this we can substitute the Gaussian profile into equation (5.23). We compute the first term on the right hand to obtain

(𝒞(ρ)1(zm(ρ))(p𝒞q(ρ)(B1qc)+γp)+γN)ρI1ρ.\left(-{\mathcal{C}}(\rho)^{-1}(z-m(\rho))\cdot\begin{pmatrix}-p\\ \mathcal{C}_{q}(\rho)(B^{-1}q-c)+{\gamma}p\end{pmatrix}+{\gamma}N\right)\rho\triangleq I_{1}\rho\,.

The second term on the right hand side can be written as:

(0𝒞q(ρ)𝒞q(ρ)γ𝒞q(ρ)):(𝒞(ρ)1+(𝒞(ρ)1(zm(ρ)))(𝒞(ρ)1(zm(ρ))))ρ(I2+I3)ρ.\begin{pmatrix}0&-\mathcal{C}_{q}(\rho)\\ \mathcal{C}_{q}(\rho)&{\gamma}\mathcal{C}_{q}(\rho)\end{pmatrix}:\left(-{\mathcal{C}}(\rho)^{-1}+\left({\mathcal{C}}(\rho)^{-1}(z-m(\rho))\right)\otimes\left({\mathcal{C}}(\rho)^{-1}(z-m(\rho))\right)\right)\rho\triangleq(I_{2}+I_{3})\rho\,.

For the left hand side of (5.23), note that

ddtdet𝒞(ρ)=Tr[det𝒞(ρ)𝒞(ρ)1ddt𝒞(ρ)],ddt𝒞(ρ)1=𝒞(ρ)1(ddt𝒞(ρ))𝒞(ρ)1.\frac{\mathrm{d}}{\mathrm{d}t}\operatorname{det}{\mathcal{C}}(\rho)=\operatorname{Tr}\left[\operatorname{det}{\mathcal{C}}(\rho){\mathcal{C}}(\rho)^{-1}\frac{\mathrm{d}}{\mathrm{d}t}{\mathcal{C}}(\rho)\right],\,\,\frac{\mathrm{d}}{\mathrm{d}t}{\mathcal{C}}(\rho)^{-1}=-{\mathcal{C}}(\rho)^{-1}\left(\frac{\mathrm{d}}{\mathrm{d}t}{\mathcal{C}}(\rho)\right){\mathcal{C}}(\rho)^{-1}\,.

Using the evolution of the mean (5.22) we obtain

ddtzm(ρ)𝒞(ρ)2=\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}\|z-m(\rho)\|_{{\mathcal{C}}(\rho)}^{2}= 2ddt(zm(ρ)),𝒞(ρ)1(zm(ρ))\displaystyle 2\left\langle\frac{\mathrm{d}}{\mathrm{d}t}(z-m(\rho)),{\mathcal{C}}(\rho)^{-1}(z-m(\rho))\right\rangle
+(zm(ρ)),ddt(𝒞(ρ)1)(zm(ρ))\displaystyle+\left\langle(z-m(\rho)),\frac{\mathrm{d}}{\mathrm{d}t}\left({\mathcal{C}}(\rho)^{-1}\right)(z-m(\rho))\right\rangle
=\displaystyle= 2(mp(ρ)𝒞q(ρ)(B1mq(ρ)c)+γmp(ρ)),𝒞(ρ)1(zm(ρ))\displaystyle 2\left\langle\begin{pmatrix}-m_{p}(\rho)\\ \mathcal{C}_{q}(\rho)(B^{-1}m_{q}(\rho)-c)+{\gamma}m_{p}(\rho)\end{pmatrix},\mathcal{C}(\rho)^{-1}(z-m(\rho))\right\rangle
𝒞(ρ)1(zm(ρ)),ddt(𝒞(ρ))𝒞(ρ)1(zm(ρ))2(I4I5).\displaystyle-\left\langle{\mathcal{C}}(\rho)^{-1}(z-m(\rho)),\frac{\mathrm{d}}{\mathrm{d}t}\left({\mathcal{C}}(\rho)\right){\mathcal{C}}(\rho)^{-1}(z-m(\rho))\right\rangle\triangleq 2(I_{4}-I_{5})\,.

Therefore we can compute the left hand side as:

tρ\displaystyle\partial_{t}\rho =[12(det𝒞(ρ))1(ddtdet𝒞(ρ))12ddtzm(ρ)𝒞(ρ)2]ρ\displaystyle=\left[-\frac{1}{2}(\operatorname{det}{\mathcal{C}}(\rho))^{-1}\left(\frac{\mathrm{d}}{\mathrm{d}t}\operatorname{det}{\mathcal{C}}(\rho)\right)-\frac{1}{2}\frac{\mathrm{d}}{\mathrm{d}t}\left\|z-m(\rho)\right\|_{{\mathcal{C}}(\rho)}^{2}\right]\rho
=[12Tr[𝒞(ρ)1ddt𝒞(ρ)]I4+I5]ρ(I6I4+I5)ρ.\displaystyle=\left[-\frac{1}{2}\operatorname{Tr}\left[{\mathcal{C}}(\rho)^{-1}\frac{\mathrm{d}}{\mathrm{d}t}{\mathcal{C}}(\rho)\right]-I_{4}+I_{5}\right]\rho\triangleq(I_{6}-I_{4}+I_{5})\rho\,.

In order to show that the Gaussian profile is indeed a solution to (5.23), we only need to show that I1+I2+I3=I4+I5+I6I_{1}+I_{2}+I_{3}=-I_{4}+I_{5}+I_{6}. Note that

I1+I4\displaystyle I_{1}+I_{4} =γN+𝒞(ρ)1(zm(ρ))(pmp(ρ)𝒞q(ρ)B1(qmq(ρ))γ(pmp(ρ)))\displaystyle={\gamma}N+{\mathcal{C}}(\rho)^{-1}(z-m(\rho))\cdot\begin{pmatrix}p-m_{p}(\rho)\\ -\mathcal{C}_{q}(\rho)B^{-1}(q-m_{q}(\rho))-{\gamma}(p-m_{p}(\rho))\end{pmatrix}
=γN+𝒞(ρ)1(zm(ρ))(0IN𝒞q(ρ)B1γIN)(zm(ρ)).\displaystyle={\gamma}N+{\mathcal{C}}(\rho)^{-1}(z-m(\rho))\cdot\begin{pmatrix}0&I_{N}\\ -\mathcal{C}_{q}(\rho)B^{-1}&-{\gamma}I_{N}\end{pmatrix}(z-m(\rho))\,.

Defining z^=zm(ρ)\hat{z}=z-m(\rho), we collect the terms in the to-be-proven identity I1+I2+I3=I4+I5+I6I_{1}+I_{2}+I_{3}=-I_{4}+I_{5}+I_{6} as:

z^T[𝒞(ρ)1(0IN𝒞q(ρ)B1γIN)+𝒞(ρ)1((0𝒞q(ρ)𝒞q(ρ)γ𝒞q(ρ))12ddt(𝒞(ρ)))𝒞(ρ)1]z^\displaystyle\hat{z}^{T}\left[{\mathcal{C}}(\rho)^{-1}\begin{pmatrix}0&I_{N}\\ -\mathcal{C}_{q}(\rho)B^{-1}&-{\gamma}I_{N}\end{pmatrix}+{\mathcal{C}}(\rho)^{-1}\left(\begin{pmatrix}0&-\mathcal{C}_{q}(\rho)\\ \mathcal{C}_{q}(\rho)&{\gamma}\mathcal{C}_{q}(\rho)\end{pmatrix}-\frac{1}{2}\frac{\mathrm{d}}{\mathrm{d}t}\left({\mathcal{C}}(\rho)\right)\right){\mathcal{C}}(\rho)^{-1}\right]\hat{z}
+\displaystyle+ γNTr[(0𝒞q(ρ)𝒞q(ρ)γ𝒞q(ρ))𝒞(ρ)1]+12Tr[ddt𝒞(ρ)𝒞(ρ)1]z^TM1z^+M2=0.\displaystyle{\gamma}N-\operatorname{Tr}\left[\begin{pmatrix}0&-\mathcal{C}_{q}(\rho)\\ \mathcal{C}_{q}(\rho)&{\gamma}\mathcal{C}_{q}(\rho)\end{pmatrix}{\mathcal{C}}(\rho)^{-1}\right]+\frac{1}{2}\operatorname{Tr}\left[\frac{\mathrm{d}}{\mathrm{d}t}{\mathcal{C}}(\rho){\mathcal{C}}(\rho)^{-1}\right]\triangleq\hat{z}^{T}M_{1}\hat{z}+M_{2}=0\,.

We only need to show that M1+M1T=M2=0M_{1}+M_{1}^{T}=M_{2}=0.

In fact, we compute 𝒞(ρ)(M1+M1T)𝒞(ρ){\mathcal{C}}(\rho)(M_{1}+M_{1}^{T}){\mathcal{C}}(\rho) to be:

(0IN𝒞q(ρ)B1γIN)𝒞(ρ)+𝒞(ρ)(0(B1)T𝒞q(ρ)INγIN)+(0002γ𝒞q(ρ))ddt(𝒞(ρ)),\begin{pmatrix}0&I_{N}\\ -\mathcal{C}_{q}(\rho)B^{-1}&-{\gamma}I_{N}\end{pmatrix}{\mathcal{C}}(\rho)+{\mathcal{C}}(\rho)\begin{pmatrix}0&-(B^{-1})^{T}\mathcal{C}_{q}(\rho)\\ I_{N}&-{\gamma}I_{N}\end{pmatrix}+\begin{pmatrix}0&0\\ 0&2{\gamma}\mathcal{C}_{q}(\rho)\end{pmatrix}-\frac{\mathrm{d}}{\mathrm{d}t}\left({\mathcal{C}}(\rho)\right)\,,

which equals zero upon blockwise computation via (5.22). Thus M1+M1T=0M_{1}+M_{1}^{T}=0. Moreover, note that M2=Tr[𝒞(ρ)M1]M_{2}=-\operatorname{Tr}\left[\mathcal{C}(\rho)M_{1}\right]: M1+M1T=0M_{1}+M_{1}^{T}=0 implies 𝒞(ρ)(M1+M1T)=0\mathcal{C}(\rho)(M_{1}+M_{1}^{T})=0 and taking the trace on both sides leads to the conclusion that M2=0M_{2}=0. Therefore the Gaussian profile indeed solves the Fokker-Planck equation (5.23).

Acknowledgments The work of ZL is supported by IAIFI through NSF grant PHY-2019786. The work of AMS is supported by NSF award AGS1835860, the Office of Naval Research award N00014-17-1-2079 and by a Department of Defense Vannevar Bush Faculty Fellowship.

References

  • [1] D. Bakry and M. Émery. Diffusions hypercontractives. In Seminaire de probabilités XIX 1983/84, pages 177–206. Springer, 1985.
  • [2] K. Bergemann and S. Reich. A mollified ensemble Kalman filter. Quarterly Journal of the Royal Meteorological Society, 136(651):1636–1643, 2010.
  • [3] K. Bergemann and S. Reich. An ensemble Kalman-Bucy filter for continuous data assimilation. Meteorologische Zeitschrift, 21(3):213, 2012.
  • [4] M. Betancourt. A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprint arXiv:1701.02434, 2017.
  • [5] M. Betancourt, S. Byrne, S. Livingstone, and M. Girolami. The geometric foundations of Hamiltonian Monte Carlo. Bernoulli, 23(4A):2257–2298, 2017.
  • [6] N. Bou-Rabee and A. Eberle. Two-scale coupling for preconditioned Hamiltonian Monte Carlo in infinite dimensions. Stochastics and Partial Differential Equations: Analysis and Computations, 9(1):207–242, 2021.
  • [7] N. Bou-Rabee and J. M. Sanz-Serna. Randomized Hamiltonian Monte Carlo. The Annals of Applied Probability, 27(4):2159–2194, 2017.
  • [8] N. Bou-Rabee and J. M. Sanz-Serna. Geometric integrators and the Hamiltonian Monte Carlo method. Acta Numerica, 27:113–206, 2018.
  • [9] S. Brooks, A. Gelman, G. Jones, and X.-L. Meng. Handbook of Markov Chain Monte Carlo. CRC press, 2011.
  • [10] Y. Cao, J. Lu, and L. Wang. On explicit l2l^{2}-convergence rate estimate for underdamped Langevin dynamics. arXiv preprint arXiv:1908.04746, 2019.
  • [11] J. Carrillo, F. Hoffmann, A. Stuart, and U. Vaes. Consensus-based sampling. Studies in Applied Mathematics, 148(3):1069–1140, 2022.
  • [12] J. A. Carrillo, Y.-P. Choi, C. Totzeck, and O. Tse. An analytical framework for consensus-based global optimization method. Mathematical Models and Methods in Applied Sciences, 28(06):1037–1066, 2018.
  • [13] J. A. Carrillo, M. Fornasier, G. Toscani, and F. Vecil. Particle, kinetic, and hydrodynamic models of swarming. In Mathematical modeling of collective behavior in socio-economic and life sciences, pages 297–336. Springer, 2010.
  • [14] M. Chak, N. Kantas, T. Lelièvre, and G. A. Pavliotis. Optimal friction matrix for underdamped langevin sampling. arXiv preprint arXiv:2112.06844, 2021.
  • [15] E. Cleary, A. Garbuno-Inigo, S. Lan, T. Schneider, and A. M. Stuart. Calibrate, emulate, sample. Journal of Computational Physics, 424:109716, 2021.
  • [16] S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. Mcmc methods for functions: modifying old algorithms to make them faster. Statistical Science, 28(3):424–446, 2013.
  • [17] P. Del Moral, A. Kurtzmann, and J. Tugaut. On the stability and the uniform propagation of chaos of a class of extended ensemble Kalman–Bucy filters. SIAM Journal on Control and Optimization, 55(1):119–155, 2017.
  • [18] Z. Ding and Q. Li. Ensemble Kalman sampler: Mean-field limit and convergence analysis. SIAM Journal on Mathematical Analysis, 53(2):1546–1578, 2021.
  • [19] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics letters B, 195(2):216–222, 1987.
  • [20] A. B. Duncan, N. Nüsken, and G. A. Pavliotis. Using perturbed underdamped langevin dynamics to efficiently sample from probability distributions. Journal of Statistical Physics, 169(6):1098–1131, 2017.
  • [21] A. Eberle, A. Guillin, and R. Zimmer. Couplings and quantitative contraction rates for Langevin dynamics. The Annals of Probability, 47(4):1982–2010, 2019.
  • [22] G. Evensen et al. Data assimilation: the ensemble Kalman filter, volume 2. Springer, 2009.
  • [23] A. Garbuno-Inigo, F. Hoffmann, W. Li, and A. M. Stuart. Interacting Langevin diffusions: Gradient structure and ensemble Kalman sampler. SIAM Journal on Applied Dynamical Systems, 19(1):412–441, 2020.
  • [24] A. Garbuno-Inigo, N. Nüsken, and S. Reich. Affine invariant interacting Langevin dynamics for bayesian inference. SIAM Journal on Applied Dynamical Systems, 19(3):1633–1658, 2020.
  • [25] O. Gonzalez and A. M. Stuart. A first course in continuum mechanics, volume 42. Cambridge University Press, 2008.
  • [26] J. Goodman and J. Weare. Ensemble samplers with affine invariance. Communications in applied mathematics and computational science, 5(1):65–80, 2010.
  • [27] R. Graham. Covariant formulation of non-equilibrium statistical thermodynamics. Zeitschrift für Physik B Condensed Matter, 26(4):397–405, 1977.
  • [28] E. Haber, F. Lucka, and L. Ruthotto. Never look back-a modified enkf method and its application to the training of neural networks without back propagation. arXiv preprint arXiv:1805.08034, 2018.
  • [29] D. Z. Huang, J. Huang, S. Reich, and A. M. Stuart. Efficient derivative-free bayesian inference for large-scale inverse problems. arXiv preprint arXiv:2204.04386, 2022.
  • [30] D. Z. Huang, T. Schneider, and A. M. Stuart. Unscented Kalman inversion. arXiv preprint arXiv:2102.01580, 2021.
  • [31] M. A. Iglesias, K. J. Law, and A. M. Stuart. Ensemble Kalman methods for inverse problems. Inverse Problems, 29(4):045001, 2013.
  • [32] P.-E. Jabin and Z. Wang. Mean field limit for stochastic particle systems. In Active Particles, Volume 1, pages 379–402. Springer, 2017.
  • [33] D.-Q. Jiang and D. Jiang. Mathematical theory of nonequilibrium steady states: on the frontier of probability and dynamical systems. Springer Science & Business Media, 2004.
  • [34] J. Kaipio and E. Somersalo. Statistical and computational inverse problems, volume 160. Springer Science & Business Media, 2006.
  • [35] N. B. Kovachki and A. M. Stuart. Ensemble Kalman inversion: a derivative-free technique for machine learning tasks. Inverse Problems, 35(9):095005, 2019.
  • [36] S. Lan, T. Bui-Thanh, M. Christie, and M. Girolami. Emulation of higher-order tensors in manifold Monte Carlo methods for bayesian inverse problems. Journal of Computational Physics, 308:81–101, 2016.
  • [37] B. Leimkuhler, C. Matthews, and J. Weare. Ensemble preconditioning for Markov chain Monte Carlo simulation. Statistics and Computing, 28(2):277–290, 2018.
  • [38] M. Lindsey, J. Weare, and A. Zhang. Ensemble Markov chain Monte Carlo with teleporting walkers. arXiv preprint arXiv:2106.02686, 2021.
  • [39] S. Livingstone, M. Betancourt, S. Byrne, and M. Girolami. On the geometric ergodicity of Hamiltonian Monte Carlo. Bernoulli, 25(4A):3109–3138, 2019.
  • [40] Y.-A. Ma, T. Chen, and E. Fox. A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems, pages 2917–2925, 2015.
  • [41] R. M. Neal. An improved acceptance procedure for the hybrid Monte Carlo algorithm. Journal of Computational Physics, 111(1):194–203, 1994.
  • [42] N. Nüsken and S. Reich. Note on interacting Langevin diffusions: Gradient structure and ensemble Kalman sampler by garbuno-inigo, hoffmann, li and stuart. arXiv preprint arXiv:1908.10890, 2019.
  • [43] M. Ottobre and G. Pavliotis. Asymptotic analysis for the generalized Langevin equation. Nonlinearity, 24(5):1629, 2011.
  • [44] G. A. Pavliotis. Stochastic processes and applications: diffusion processes, the Fokker-Planck and Langevin equations, volume 60. Springer, 2014.
  • [45] S. Reich. A dynamical systems framework for intermittent data assimilation. BIT Numerical Mathematics, 51(1):235–249, 2011.
  • [46] H. Risken. Fokker-planck equation. In The Fokker-Planck Equation, pages 63–95. Springer, 1996.
  • [47] G. O. Roberts and J. S. Rosenthal. Optimal scaling for various metropolis-hastings algorithms. Statistical science, 16(4):351–367, 2001.
  • [48] J.-M. Sanz-Serna and M.-P. Calvo. Numerical Hamiltonian Problems. Courier Dover Publications, 2018.
  • [49] C. Schillings and A. M. Stuart. Analysis of the ensemble Kalman filter for inverse problems. SIAM Journal on Numerical Analysis, 55(3):1264–1290, 2017.
  • [50] A.-S. Sznitman. Topics in propagation of chaos. In Ecole d’été de probabilités de Saint-Flour XIX—1989, pages 165–251. Springer, 1991.
  • [51] A. Taghvaei, J. De Wiljes, P. G. Mehta, and S. Reich. Kalman filter and its modern extensions for the continuous-time nonlinear filtering problem. Journal of Dynamic Systems, Measurement, and Control, 140(3), 2018.
  • [52] C. Villani. Hypocoercivity. arXiv preprint math/0609050, 2006.