This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient, Multimodal, and Derivative-Free Bayesian Inference With Fisher-Rao Gradient Flows

Yifan Chen Courant Institute, New York University, NY [email protected]    Daniel Zhengyu Huang111Author to whom any correspondence should be addressed. Beijing International Center for Mathematical Research, Center for Machine Learning Research, Peking University, Beijing, China [email protected]    Jiaoyang Huang University of Pennsylvania, Philadelphia, PA [email protected]    Sebastian Reich Universität Potsdam, Potsdam, Germany [email protected]    Andrew M. Stuart California Institute of Technology, Pasadena, CA [email protected]
Abstract

In this paper, we study efficient approximate sampling for probability distributions known up to normalization constants. We specifically focus on a problem class arising in Bayesian inference for large-scale inverse problems in science and engineering applications. The computational challenges we address with the proposed methodology are: (i) the need for repeated evaluations of expensive forward models; (ii) the potential existence of multiple modes; and (iii) the fact that gradient of, or adjoint solver for, the forward model might not be feasible.

While existing Bayesian inference methods meet some of these challenges individually, we propose a framework that tackles all three systematically. Our approach builds upon the Fisher-Rao gradient flow in probability space, yielding a dynamical system for probability densities that converges towards the target distribution at a uniform exponential rate. This rapid convergence is advantageous for the computational burden outlined in (i). We apply Gaussian mixture approximations with operator splitting techniques to simulate the flow numerically; the resulting approximation can capture multiple modes thus addressing (ii). Furthermore, we employ the Kalman methodology to facilitate a derivative-free update of these Gaussian components and their respective weights, addressing the issue in (iii).

The proposed methodology results in an efficient derivative-free posterior approximation method, flexible enough to handle multi-modal distributions: Gaussian Mixture Kalman Inversion (GMKI). The effectiveness of GMKI is demonstrated both theoretically and numerically in several experiments with multimodal target distributions, including proof-of-concept and two-dimensional examples, as well as a large-scale application: recovering the Navier-Stokes initial condition from solution data at positive times.

: Inverse Problems

Keywords: Bayesian Inverse Problems, Sampling, Derivative-Free Methods, Multimodal, Kalman Methodology, Fisher-Rao Gradient Flow, Gaussian Mixtures.

1 Introduction

In this paper, we introduce the posterior approximation method called Gaussian Mixture Kalman Inversion (GMKI), designed for solution of PDE inverse problems for which forward model evaluation is expensive, derivative/adjoint calculations cannot be used and multiple modes are present. In subsection 1.1 we give the context, followed in subsection 1.2 with details of our guiding motivations. Subsection 1.3 describes the key ingredients of the algorithm and subsection 1.4 the contributions. In subsection 1.5 we give a detailed literature review and in subsection 1.6 we describe the organization of the paper.

1.1 Context

Sampling a target probability distribution known up to normalization constants is a classical problem in science and engineering. In this paper, we focus specifically on targets resulting from Bayesian inverse problems [1, 2] involving recovery of unknown parameter θNθ\theta\in\mathbb{R}^{N_{\theta}} from noisy observation yNyy\in\mathbb{R}^{N_{y}}, through forward model

y=𝒢(θ)+η.y=\mathcal{G}(\theta)+\eta. (1)

Here, 𝒢\mathcal{G} denotes the forward mapping which, for the problems we focus on, is nonlinear and requires solution of a partial differential equation (PDE) to evaluate. The observational noise η\eta is here assumed to be Gaussian: η𝒩(0,Ση)\eta\sim\mathcal{N}(0,\Sigma_{\eta}). By assigning a Gaussian prior 𝒩(r0,Σ0)\mathcal{N}(r_{0},\Sigma_{0}) to the unknown θ\theta, the Bayesian framework leads to the posterior distribution

ρpost(θ)\displaystyle\rho_{\rm post}(\theta) exp(ΦR(θ)),\displaystyle\propto\exp(-\Phi_{R}(\theta)), (2a)
ΦR(θ)\displaystyle\Phi_{R}(\theta) =Φ(θ)+12Σ012(θr0)2,\displaystyle=\Phi(\theta)+\frac{1}{2}\lVert\Sigma_{0}^{-\frac{1}{2}}(\theta-r_{0})\rVert^{2}, (2b)
Φ(θ)\displaystyle\Phi(\theta) =12Ση12(y𝒢(θ))2.\displaystyle=\frac{1}{2}\lVert\Sigma_{\eta}^{-\frac{1}{2}}(y-\mathcal{G}(\theta))\rVert^{2}. (2c)

Here Φ\Phi is the negative log likelihood. Minimization of ΦR\Phi_{R} is a nonlinear least-squares problem which defines the maximum a posterior (MAP) point estimator for the Bayesian inverse problem. It is the goal of this paper to develop an efficient method for approximating ρpost\rho_{\rm post} defined by ΦR\Phi_{R} in the specific setting which we now outline.

1.2 Guiding Motivations

We give more detail concerning the motivations behind the specific posterior approximation method developed here. Firstly we note that an appropriate unit of cost in solution of Bayesian inverse problems is the evaluation of 𝒢\mathcal{G} as this will be required multiple times for methods such as MCMC [3] and SMC [4, 5]; when evaluation of 𝒢\mathcal{G} requires running large scale PDE solvers fast convergence is paramount. Secondly, we note that multiple modes, caused by multiple minimizers of ΦR\Phi_{R}, cause many methods to become slow [6], expending multiple steps in one mode before moving to another [7, 8]; in addition, many Gaussian approximation based methods are unable to capture multiple modes. Nevertheless, exploring all these modes is necessary since missing one could lead to detrimental effects on engineering or science predictions; Thirdly we note that the gradient of ΦR\Phi_{R} may not be available or even feasible. This might be because the computational models are only given as a black box (e.g. in global climate model calibration [9, 10]), the numerical methods are not differentiable (e.g. in the embedded boundary method [11, 12, 13, 14] and adaptive mesh refinement [15, 16]), or because of inherently discontinuous physics (i.e., in fracture [17] or cloud modeling [18, 19]). In this paper, we address these three challenges by combining, respectively, Fisher-Rao gradient flows, Gaussian mixture approximations, and Kalman methodology. The resulting posterior approximation method, Gaussian Mixture Kalman Inversion (GMKI), is fast due to the uniform exponential convergence of Fisher-Rao gradient flows, can capture multiple modes since Gaussian mixture approximations are employed, and is derivative-free thanks to the systematic Kalman methodology.

1.3 Key Ingredients of GMKI

In sampling, it is widely accepted practice to construct a dynamical system for a density that gradually evolves to the posterior distribution, or its approximation, after a specified finite time or at infinite time. Numerical approximation of this dynamics, using either particle or parametric methods, leads to practical algorithms. These include sequential Monte Carlo (SMC, specified finite time) [4], and Markov chain Monte Carlo (MCMC, infinite time) [3] that are commonly used in Bayesian inference. In recent years, gradient flows in the probability space have become a popular choice of dynamical systems [20, 21, 22]; their study presents the opportunity to profoundly influence our understanding and development of sampling algorithms.

In general, the convergence rates of different gradient flows can vary significantly. In this paper, we focus on the Fisher-Rao gradient flow of the Kullback–Leibler (KL) divergence [23, 24, 22]:

ρtt=ρt(logρpostlogρt)ρt𝔼ρt[logρpostlogρt].\displaystyle\frac{\partial\rho_{t}}{\partial t}=\rho_{t}\bigl{(}\log\rho_{\rm post}-\log\rho_{t}\bigr{)}-\rho_{t}\mathbb{E}_{\rho_{t}}[\log\rho_{\rm post}-\log\rho_{t}]. (3)

The Fisher-Rao gradient flow converges to its steady state, ρpost\rho_{\rm post}, exponentially fast, with a rate of 𝒪(et)\mathcal{O}(e^{-t}); see Proposition 1, [22, Theorem 4.1], and also [23, 24, 25]. This convergence rate is uniform and independent of ρpost\rho_{\rm post}, in particular its log-Sobolev constant, which typically determines the convergence rates of other gradient flows, such as the Wasserstein gradient flow. It is worth noting that the log-Sobolev constant may behave poorly when the posterior distribution ρpost\rho_{\rm post} is highly anisotropic or multimodal [7, 8]. Thus, we consider Eq. 3 as a desirable flow for sampling general distributions.

We introduce numerical approximations of Eq. 3 to construct practical algorithms. Particle methods represent the current density ρt\rho_{t} by a (possibly weighted) sum of Dirac measures evaluated at an ensemble of particles. The flow Eq. 3 can then be realized as a birth-death dynamics of these particles [23, 24]. However, the birth-death rate depends on the density, so it is necessary to constantly reconstruct ρt\rho_{t} from the empirical particle distribution. In [23, 24], kernel density estimators have been applied for the reconstruction, but their performance may be affected when the dimension of the problem becomes large. Moreover, birth-death dynamics alone cannot change the support of the distribution, so additional steps need to be added to explore the space [23, 24, 26]; such exploration steps change the dynamics and may also lead to challenges in high dimensional problems.

Parametric methods, which reduce the gradient flow into some parametric density space, constitute another common choice of numerical approximation. One way to do this is to project the flow Eq. 3 into the Gaussian space [27, 28, 29], via a moment closure approach. The resulting system for the mean and covariance is given by [22]:

dmtdt=Ct𝔼ρat[θlogρpost]dCtdt=Ct+Ct𝔼ρat[θθlogρpost]Ct,\displaystyle\frac{\mathrm{d}m_{t}}{\mathrm{d}t}=C_{t}\mathbb{E}_{\rho_{a_{t}}}[\nabla_{\theta}\log\rho_{\rm post}]\qquad\frac{\mathrm{d}C_{t}}{\mathrm{d}t}=C_{t}+C_{t}\mathbb{E}_{\rho_{a_{t}}}[\nabla_{\theta}\nabla_{\theta}\log\rho_{\rm post}]C_{t}, (4)

where ρt\rho_{t} in Eq. 3 is approximated by a Gaussian ρat=𝒩(mt,Ct)\rho_{a_{t}}=\mathcal{N}(m_{t},C_{t}) in Eq. 4; here at=(mt,Ct)a_{t}=(m_{t},C_{t}) is the unknown parameter. We note that one may also derive the above flow by natural gradient methods in variational inference [30, 31, 32]; see discussions in [22]. Theoretically, it has been shown in [22] that Eq. 4 converges exponentially fast to the best Gaussian approximation of ρpost\rho_{\rm post} in the KL divergence sense, when ρpost\rho_{\rm post} is log-concave. Therefore, by simulating Eq. 4, we get a Gaussian approximation of the posterior; this can be done through direct time integration or ensemble methods.

More generally, for multimodal problems, Gaussian mixture approximations have been studied in the literature under the variational inference framework [33, 29, 34]. These approaches require the evaluation of the gradient and sometimes even the Hessian matrix of logρpost\log\rho_{\rm post}, as shown in Eq. 4, which are not directly feasible for the type of problems which are our focus in this paper.

On the other hand, Kalman methodology has emerged as an effective methodology for sampling for both filter and inverse problems [35, 36, 37, 38, 39, 40, 41, 42]. Similar to the parametric methods discussed above, it relies on Gaussian approximations; however, it additionally utilizes the structure of the problem, i.e., the least-squares form of the posterior as described in Eq. 2. Notably, the Kalman methodology can lead to derivative-free algorithms such as the Ensemble Kalman Filter (EnKF), Unscented Kalman Filter (UKF), and Ensemble Kalman Inversion (EKI), all defined in [42]. Moreover, the recent work on EKI and its variants in [43] can be interpreted as applying Kalman-type approximations to the Fisher-Rao gradient flow Eq. 3, although this gradient structure was not explicitly pointed out in the original paper. The effectiveness of this method has been demonstrated on large-scale inverse problems in science and engineering, with up to hundreds of dimensions. However, since only Gaussian approximations are used, the method may not be suitable for multimodal posterior distributions.

1.4 Contributions

The primary focus of this paper is to extend the Kalman methodology in [43] to Gaussian mixture approximations of the Fisher-Rao gradient flow. This leads to GMKI, a derivative-free posterior approximation method that converges fast and captures multiple modes for the challenging inversion problems studied here. We make the following contributions:

  1. 1.

    We propose an operator splitting approach to integrate the Fisher-Rao gradient flow in time, which leads to an exploration step that explores the space freely and an exploitation step that harnesses the data and prior information. We prove the resulting exploration-exploitation scheme converges exponentially fast to the target distribution at the discrete time level (Section 2).

  2. 2.

    We demonstrate a connection between the continuous time limit of the pre-existing algorithm in [43] and Gaussian variational inference (Section 3).

  3. 3.

    We apply Gaussian mixture approximations to the exploration-exploitation scheme. We utilize the Kalman methodology to update the weights and locations of the mixtures. This leads to our derivative-free algorithm, GMKI, for sampling multimodal distributions (Section 4).

  4. 4.

    We analyze GMKI by deriving the continuous time limit of the dynamics. Based on the continuous dynamics, we study its exploration effects, establish its affine invariant property, connect the methodology to variational inference with Gaussian mixtures, and investigate the convergence properties (Section 5).

  5. 5.

    We demonstrate, on one/two-dimensional model problems as well as a high-dimensional application (recovering the Navier-Stokes initial condition from solution data at positive times), that GMKI is able to capture multiple modes in approximately 𝒪(10)\mathcal{O}(10) iterations, making it a promising approach for solving large scale Bayesian inverse problems. Our code is accessible online222 https://github.com/Zhengyu-Huang/InverseProblems.jl/tree/master/Multimodal. (Section 6).

1.5 Literature Review

The review of relevant literature concerns SMC and MCMC, variational inference, gradient flows and Kalman methodology.

1.5.1 SMC and MCMC

Sequential Monte Carlo (SMC) [44] and Markov chain Monte Carlo (MCMC) [3] are common approaches used in Bayesian inference for sampling posteriors. They lead to dynamical systems of densities that progressively converge to the target distribution. For SMC, the dynamical system operates over finite time intervals so converges fast in the density level, but numerical approximations of the dynamical system can be challenging, with difficulties such as weight collapses. Such issues are more pronounced in the case of multimodal posteriors, requiring a substantial number of particles and a good initialization for SMC to succeed, due to its lack of exploration. Approximation of the finite-time dynamics in SMC via transport of measures has also been investigated [45, 46, 47]. The Fisher-Rao gradient flow used in this paper can be seen as an infinite time extension of SMC dynamics that allows efficient exploration while converging exponentially fast in the density level. MCMC approaches typically require O(104)O(10^{4}) model runs, or more, for the type of PDE-based inversion arising in this paper; thus they are too costly. Moreover, most MCMC approaches are based on local moves and face significant challenges in the multimodal scenario.

1.5.2 Variational Inference

Variational inference [48, 49, 50] addresses the sampling problem Eq. 2 using optimization, typically with a lower computational cost compared to MCMC. The objective function, often chosen to be the KL divergence between the target distribution and a variational distribution, is minimized to get a closest approximate distribution within the variational distribution family. Gaussian distributions and Gaussian mixtures are often used as the variational distribution [51, 52, 33, 53]. The concept of natural gradients [30, 28, 33, 31, 32] has been widely used to derive efficient optimization algorithms for variational inference. These algorithms typically require evaluations of gradient information for the log density. We also note that the Gaussian and Gaussian mixture ansatz has been used in conjunction with the Dirac-Frenkel variational principle to solve time-dependent PDEs of wave functions and probability densities [54, 55]. When the PDE is the Fisher-Rao gradient flow, these methods can recover the parameter dynamics obtained by natural gradient flow in variational inference [56].

1.5.3 Fisher-Rao Gradient Flow

The Fisher-Rao gradient flow plays a key role in the design of sampling algorithms studied in this paper. There is a vast literature on the use of gradient flows of the KL divergence in the density space, employing different metric tensors, for sampling. We specifically focus on the Fisher-Rao metric, introduced by C.R. Rao [57], to derive the gradient flow Eq. 3, as it is the only metric, up to scaling, invariant under any diffeomorphism of the parameter space [58, 59, 60]. This invariance leads to a gradient flow converging at a rate independent of the target distribution. In practice, the Fisher-Rao gradient flow and its simulation by birth-death processes have been used in sequential Monte Carlo samplers to reduce the variance of particle weights [4] and accelerate Langevin sampling [23, 24, 26] and statistical learning [61]. Kernel approximation of the flow has also been considered [46, 62]. Gaussian approximation of the Fisher-Rao gradient flow is studied in [27], with close connections to natural gradient methods in variational inference.

1.5.4 Kalman Methodology

The Kalman methodology encompasses a general class of approaches for solving filtering and inverse problems. They are based on replacing the Bayesian inference step in a filter, which may be viewed as governed by a prior to posterior map, by an approximate transport map which is exact for Gaussians; inverse problems are solved by linking them to a filter. Ensemble Kalman methods give rise to derivative-free algorithms, and are appropriate for solving filtering and inverse problems in which the desired probability distribution is close to Gaussian [35, 63, 64, 65, 43].

Beyond Gaussian approximations, a strand of research has extended Kalman filters to operate on Gaussian mixtures [66, 67, 68, 69, 70, 71, 72, 73]. These methods model both prior and posterior distributions using Gaussian mixture distributions, leveraging a componentwise application of the Kalman methodology for each Gaussian component. Various techniques, such as recluster analysis and resampling techniques [74, 75, 76, 77, 78], as well as localization techniques [79, 80, 81], have been developed to enhance the robustness of these approaches. Nevertheless, existing methods in this category are tailored to transform a Gaussian mixture prior into a Gaussian mixture posterior; they can be understood as a Gaussian mixture approximation of the dynamics in SMC. The resulting methods lack full exploration of the space of possible solutions. In contrast, GMKI incorporates gradient flows, resulting in theoretical advantages manifest in its analysis. In practice, GMKI’s exploration component enables effective traversal of the solution space, leading to robust performance without weight collapse.

1.6 Organization

The paper is organized as follows. In Section 2, we introduce the Fisher-Rao gradient flow and the exploration-exploitation scheme for discretizing the flow in time. In Section 3, the Gaussian approximation approach for spatial approximation is reviewed. In Section 4, the proposed GMKI approach is presented, which relies on the Gaussian mixture approximation and Kalman methodology. In Section 5, the continuous time dynamics of the GMKI approach is derived and analyzed. In Section 6, numerical experiments are provided. We make concluding remarks in Section 7.

2 Fisher-Rao Gradient Flow

In the following two subsections, we (i) briefly describe the Fisher-Rao gradient flow in the time-continuous settings; and (ii) introduce our operator splitting approach for simulating this flow in practice.

2.1 Continuous Flow

In this paper we focus on the gradient flow arising from using the KL divergence

KL[ρρpost]=ρlog(ρρpost)dθ=𝔼ρ[logρlogρpost]\displaystyle{\rm KL}[\rho\|\rho_{\rm post}]=\int\rho\log\bigl{(}\frac{\rho}{\rho_{\rm post}}\bigr{)}\mathrm{d}\theta=\mathbb{E}_{\rho}[\log\rho-\log\rho_{\rm post}] (5)

as the energy functional, along with the Fisher-Rao metric tensor MFR(ρ)1ψ=ρψM^{\rm FR}(\rho)^{-1}\psi=\rho\psi; the resulting gradient flow has the form (See [22, Section 4.1])

ρtt\displaystyle\frac{\partial\rho_{t}}{\partial t} =MFR(ρt)1δKL[ρtρpost]δρ\displaystyle=-M^{\rm FR}(\rho_{t})^{-1}\frac{\delta{\rm KL}[\rho_{t}\|\rho_{\rm post}]}{\delta\rho}
=ρt(logρpostlogρt)ρt𝔼ρt[logρpostlogρt].\displaystyle=\rho_{t}\bigl{(}\log\rho_{\rm post}-\log\rho_{t}\bigr{)}-\rho_{t}\mathbb{E}_{\rho_{t}}[\log\rho_{\rm post}-\log\rho_{t}]. (6)

We have the following uniform exponential convergence result for this flow ([22, Theorem 4.1]; see also related results in [23, 24, 25, 82]):

Proposition 1.

Let ρt\rho_{t} satisfy Eq. 6. Assume there exist constants K,B>0K,B>0 such that the initial density ρ0\rho_{0} satisfies

eK(1+|θ|2)ρ0(θ)ρpost(θ)eK(1+|θ|2),\displaystyle e^{-K(1+|\theta|^{2})}\leq\frac{\rho_{0}(\theta)}{\rho_{\rm post}(\theta)}\leq e^{K(1+|\theta|^{2})}, (7)

and ρ0,ρpost\rho_{0},\rho_{\rm post} have bounded second moments

|θ|2ρ0(θ)dθB,|θ|2ρpost(θ)dθB.\displaystyle\int|\theta|^{2}\rho_{0}(\theta)\mathrm{d}\theta\leq B,\quad\int|\theta|^{2}\rho_{\rm post}(\theta)\mathrm{d}\theta\leq B. (8)

Then, for any tlog((1+B)K)t\geq\log\bigl{(}(1+B)K\bigr{)},

KL[ρtρpost](2+B+eB)Ket.\displaystyle{\rm KL}[\rho_{t}\|\rho_{\rm post}]\leq(2+B+eB)Ke^{-t}. (9)

Accurate numerical simulation of Eq. 6 thus has the potential to exhibit uniform exponential convergence across a wide range of targets ρpost\rho_{\rm post}.

2.2 Time-Stepping via Operator Splitting

As a first step towards the derivation of an algorithm, we apply operator splitting to Eq. 6. Abusing notation we let ρn\rho_{n} denote our approximation of ρtn\rho_{t_{n}} at time t=tn=nΔtt=t_{n}=n\Delta t, where Δt\Delta t denotes the time step, and solve sequentially

ρ^tt=ρ^t(logρ^t𝔼ρ^t[logρ^t]),ρ^tn=ρn,tnttn+1,\displaystyle\frac{\partial\hat{\rho}_{t}}{\partial t}=-\hat{\rho}_{t}(\log\hat{\rho}_{t}-\mathbb{E}_{\hat{\rho}_{t}}[\log\hat{\rho}_{t}]),\quad\hat{\rho}_{t_{n}}=\rho_{n},\quad t_{n}\leq t\leq t_{n+1}, (10)

and

ρˇtt=ρˇt(logρpost𝔼ρˇt[logρpost]),ρˇtn=ρ^tn+1,tnttn+1.\displaystyle\frac{\partial\check{\rho}_{t}}{\partial t}=\check{\rho}_{t}(\log\rho_{\rm post}-\mathbb{E}_{\check{\rho}_{t}}[\log\rho_{\rm post}]),\quad\check{\rho}_{t_{n}}=\hat{\rho}_{t_{n+1}},\quad t_{n}\leq t\leq t_{n+1}. (11)

The map ρnρn+1\rho_{n}\mapsto\rho_{n+1} is then defined by setting ρn+1=ρˇtn+1\rho_{n+1}=\check{\rho}_{t_{n+1}}. Further abusing notation we write ρ^tn+1=ρ^n+1\hat{\rho}_{t_{n+1}}=\hat{\rho}_{n+1} and ρˇtn+1=ρˇn+1.\check{\rho}_{t_{n+1}}=\check{\rho}_{n+1}. Note that all of ρn,ρ^n+1\rho_{n},\hat{\rho}_{n+1} and ρˇn+1\check{\rho}_{n+1} are functions of θ\theta. Both Eq. 10 and Eq. 11 admit explicit solutions and we may write

ρ^n+1(θ)ρn(θ)eΔt,\displaystyle\hat{\rho}_{n+1}(\theta)\propto\rho_{n}(\theta)^{e^{-\Delta t}}, (12a)
ρˇn+1(θ)ρ^n+1(θ)ρpost(θ)Δt.\displaystyle\check{\rho}_{n+1}(\theta)\propto\hat{\rho}_{n+1}(\theta)\rho_{\rm post}(\theta)^{\Delta t}. (12b)

Furthermore, using the first order approximation eΔt1Δte^{-\Delta t}\approx 1-\Delta t and the explicit formula (2) for ρpost,\rho_{\rm post}, we obtain the following time-stepping scheme:

ρ^n+1(θ)ρn(θ)1Δt,\displaystyle\hat{\rho}_{n+1}(\theta)\propto\rho_{n}(\theta)^{1-\Delta t}, (13a)
ρn+1(θ)ρ^n+1(θ)ρpost(θ)Δtρ^n+1(θ)eΔtΦR(θ).\displaystyle\rho_{n+1}(\theta)\propto\hat{\rho}_{n+1}(\theta)\rho_{\rm post}(\theta)^{\Delta t}\propto\hat{\rho}_{n+1}(\theta)e^{-\Delta t\Phi_{R}(\theta)}. (13b)

It is worth mentioning that the first order approximation corrects the bias introduced by the operator splitting, ensuring that ρpost\rho_{\rm post} remains the fixed point of this time-stepping scheme. Moreover, the step (13a) and Eq. 10 can be interpreted as the Fisher-Rao gradient flow of the negative entropy term 𝔼ρ[logρ]\mathbb{E}_{\rho}[\log\rho], which tends to increase entropy by expanding the distribution to explore the state space. In contrast, Eq. 13b multiplies the current distribution by the “posterior function” to exploit the data and prior information, concentrating towards regions of high posterior density. It is worth noting that the exploration-exploitation concept distinguishes the present approach from sequential Monte Carlo [4] and other homotopy based approaches for sampling [83], which instead rely on the updating rule

ρn+1(θ)ρn(θ)eΔtΦ(θ)\rho_{n+1}(\theta)\propto\rho_{n}(\theta)e^{-\Delta t\Phi(\theta)}

to deform the prior into the posterior in one unit time. The iteration Eq. 13 is first proposed as the basis for sampling algorithms in [43] as a methodology to remedy the ensemble collapse of EKI in long time asymptotics; however, the connection to gradient flows is not pointed out. We note that the iteration Eq. 13 also connects to the tempering (or annealing) approaches that are commonly used in the Monte Carlo literature [84, 85, 86]. Finally Eq. 13 can also be interpreted as an entropic mirror descent algorithm in optimization [87].

The exploration-exploitation time-stepping scheme Eq. 13 inherits the convergence property of the continuous flow; see Proposition 2. The proof can be found in A.

Proposition 2.

Under the assumptions in Proposition 1, let ρn\rho_{n} solve Eq. 13, then for any n|log((1+B)K)log(1Δt)|n\geq|\frac{\log((1+B)K)}{\log(1-\Delta t)}|, it holds that

KL[ρnρpost](2+B+eB)K(1Δt)n.\displaystyle\mathrm{KL}\left[\rho_{n}\|\rho_{\rm post}\right]\leq(2+B+eB)K(1-\Delta t)^{n}. (14)

3 Gaussian Approximation and Kalman Methodology

In this section, we discuss the Gaussian approximation of the scheme Eq. 13 through the Kalman methodology. In doing so we review the necessary techniques and pave the way for constructing our Gaussian mixture approximations in the next section.

In [43], the authors used Gaussian distributions to approximate the evolution of densities defined by Eq. 13. More precisely, assume ρn=𝒩(mn,Cn)\rho_{n}=\mathcal{N}(m_{n},C_{n}). Then, the first exploration step Eq. 13a leads to

ρ^n+1=𝒩(m^n+1,C^n+1)=𝒩(mn,11ΔtCn).\displaystyle\hat{\rho}_{n+1}=\mathcal{N}(\widehat{m}_{n+1},\widehat{C}_{n+1})=\mathcal{N}(m_{n},\frac{1}{1-\Delta t}C_{n}). (15)

The distribution still remains Gaussian. However, the second exploitation step Eq. 13b will map out of the space of Gaussian densities, unless ΦR\Phi_{R} is quadratic. In [43], the Kalman methodology is employed to approximate Eq. 13b, which is similar to the analysis step in the Kalman filter. More precisely, the methodology starts with the following artificial inverse problem:

x=(θ)+ν,\displaystyle x=\mathcal{F}(\theta)+\nu, (16)

where we have

x=[yr0](θ)=[𝒢(θ)θ]Σν=[Ση00Σ0].x=\begin{bmatrix}y\\ r_{0}\end{bmatrix}\quad\mathcal{F}(\theta)=\begin{bmatrix}\mathcal{G}(\theta)\\ \theta\end{bmatrix}\quad\Sigma_{\nu}=\begin{bmatrix}\Sigma_{\eta}&0\\ 0&\Sigma_{0}\end{bmatrix}. (17)

Here, we set the prior on θ\theta as ρ^n+1(θ)\hat{\rho}_{n+1}(\theta), and the observation noise ν𝒩(0,1ΔtΣν)\nu\sim\mathcal{N}(0,\frac{1}{\Delta t}\Sigma_{\nu}). Following Bayes rule, the posterior distribution of the artificial inverse problem is

ρ(θ|x)=ρ^n+1(θ)ρ(x|θ)ρ(x)ρ^n+1(θ)eΔtΦR(θ)=ρn+1(θ),\displaystyle\rho(\theta|x)=\frac{\hat{\rho}_{n+1}(\theta)\rho(x|\theta)}{\rho(x)}\propto\hat{\rho}_{n+1}(\theta)e^{-\Delta t\Phi_{R}(\theta)}=\rho_{n+1}(\theta), (18)

which matches the output of the step Eq. 13b. Here we used the fact that Eq. 2 can be rewritten as

ΦR(θ)=12Σν12(x(θ))2.\displaystyle\Phi_{R}(\theta)=\frac{1}{2}\lVert\Sigma_{\nu}^{-\frac{1}{2}}(x-\mathcal{F}(\theta))\rVert^{2}. (19)

The Kalman methodology for approximating the posterior ρ(θ|x)\rho(\theta|x) may now be adopted. One first forms a Gaussian approximation of the joint distribution of θ\theta and (θ)+ν\mathcal{F}(\theta)+\nu, via standard moment matching, yielding

ρG(θ,(θ)+ν)𝒩([θx];[m^n+1x^n+1],[C^n+1C^n+1θxC^n+1θxTC^n+1xx]),\rho^{\rm G}(\theta,\mathcal{F}(\theta)+\nu)\sim\mathcal{N}\Big{(}\begin{bmatrix}\theta\\ x\end{bmatrix};\begin{bmatrix}\widehat{m}_{n+1}\\ \widehat{x}_{n+1}\end{bmatrix},\begin{bmatrix}\widehat{C}_{n+1}&\widehat{C}^{\theta x}_{n+1}\\ \widehat{C}_{n+1}^{{\theta x}^{T}}&\widehat{C}^{xx}_{n+1}\end{bmatrix}\Big{)}, (20)

where m^n+1,C^n+1\widehat{m}_{n+1},\widehat{C}_{n+1} are as specified previously, and

x^n+1=𝔼[(θ)],C^n+1θx=Cov[θ,(θ)],C^n+1xx=Cov[(θ)+ν]=Cov[(θ)]+1ΔtΣν.\begin{split}\widehat{x}_{n+1}=\mathbb{E}[\mathcal{F}(\theta)],\quad\widehat{C}^{\theta x}_{n+1}=\mathrm{Cov}[\theta,\mathcal{F}(\theta)],\quad\widehat{C}^{xx}_{n+1}=\mathrm{Cov}[\mathcal{F}(\theta)+\nu]=\mathrm{Cov}[\mathcal{F}(\theta)]+\frac{1}{\Delta t}\Sigma_{\nu}.\end{split} (21)

In the above, the expectation and covariance are taken over θρ^n+1\theta\sim\hat{\rho}_{n+1}. These integrals can be computed using Monte Carlo methods or quadrature rules.

Then, one can condition the joint Gaussian distribution Eq. 20 on the event (θ)+ν=x\mathcal{F}(\theta)+\nu=x, to get a Gaussian approximation of the posterior ρ(θ|x)\rho(\theta|x). In detail, using the Gaussian conditioning formula, one obtains

ρn+1(θ)ρG(θ|(θ)+ν=x)=𝒩(θ;mn+1,Cn+1),\displaystyle\rho_{n+1}(\theta)\approx\rho^{\rm G}(\theta|\mathcal{F}(\theta)+\nu=x)=\mathcal{N}(\theta;m_{n+1},C_{n+1}), (22a)

where

mn+1\displaystyle m_{n+1} =m^n+1+C^n+1θx(C^n+1xx)1(xx^n+1),\displaystyle=\widehat{m}_{n+1}+\widehat{C}^{\theta x}_{n+1}(\widehat{C}^{xx}_{n+1})^{-1}(x-\widehat{x}_{n+1}), (23a)
Cn+1\displaystyle C_{n+1} =C^n+1C^n+1θx(C^n+1xx)1(C^n+1θx)T.\displaystyle=\widehat{C}_{n+1}-\widehat{C}^{\theta x}_{n+1}(\widehat{C}^{xx}_{n+1})^{-1}(\widehat{C}_{n+1}^{{\theta x}})^{T}. (23b)

Combining the two updates Eq. 15 and Eq. 23 leads to a Gaussian approximation scheme for solving the discrete Fisher-Rao gradient flow Eq. 13. The scheme is based on the Kalman methodology and is derivative free. Some theoretical and numerical studies of this scheme can be found in [43].

Remark 1.

We can connect the Gaussian approximation based on the Kalman methodology and the approximation Eq. 4 obtained by Gaussian variational inference. To do so we consider the continuous time limit of Eq. 23, calculated in [43, Eq. A.2], as

dmtdt\displaystyle\frac{\mathrm{d}m_{t}}{\mathrm{d}t} =C^tθxΣν1(xx^t),dCtdt\displaystyle=\widehat{C}_{t}^{\theta x}\Sigma_{\nu}^{-1}(x-\widehat{x}_{t}),\qquad\frac{\mathrm{d}C_{t}}{\mathrm{d}t} =CtC^tθxΣν1(C^tθx)T,\displaystyle=C_{t}-\widehat{C}_{t}^{\theta x}\Sigma_{\nu}^{-1}({\widehat{C}_{t}^{\theta x}})^{T}, (24)

where x^t=𝔼ρt[(θ)]\widehat{x}_{t}=\mathbb{E}_{\rho_{t}}[\mathcal{F}(\theta)], C^tθx=𝔼ρt[(θm)((θ)𝔼(θ))]\widehat{C}_{t}^{\theta x}=\mathbb{E}_{\rho_{t}}\bigl{[}\bigl{(}\theta-m\bigr{)}\otimes\bigl{(}\mathcal{F}(\theta)-\mathbb{E}\mathcal{F}(\theta)\bigr{)}\bigr{]} and the expectation is taken with respect to the distribution ρt(θ)=𝒩(θ;mt,Ct)\rho_{t}(\theta)=\mathcal{N}(\theta;m_{t},C_{t}). We can view Eq. 24 as a derivative-free approximation of Eq. 4 through the statistical linearization [42, Sec 4.3.2] approach. More precisely, by Stein’s identity which utilizes the integration by parts formula for Gaussian measures, we have the relation 𝔼ρt[θ(θ)]=Ct1C^tθx\mathbb{E}_{\rho_{t}}[\nabla_{\theta}\mathcal{F}(\theta)]=C_{t}^{-1}\widehat{C}^{\theta x}_{t}. Statistical linearization makes the approximation θ(θ)Ct1C^tθx\nabla_{\theta}\mathcal{F}(\theta)\approx C_{t}^{-1}\widehat{C}^{\theta x}_{t} for all θ\theta; the approximation is exact when \mathcal{F} is linear. Based on it, we can approximate the right hand side in the equation of the mean in Eq. 4 as follows:

Ct𝔼ρt[θlogρpost]=Ct𝔼ρt[θ(θ)Σν1(x(θ))]C^tθxΣν1(xx^t),C_{t}\mathbb{E}_{\rho_{t}}[\nabla_{\theta}\log\rho_{\rm post}]=C_{t}\mathbb{E}_{\rho_{t}}[\nabla_{\theta}\mathcal{F}(\theta)\Sigma_{\nu}^{-1}(x-\mathcal{F}(\theta))]\approx\widehat{C}_{t}^{\theta x}\Sigma_{\nu}^{-1}(x-\widehat{x}_{t}), (25)

where in the first identity we used the fact ρpostexp(12Σν1/2((θ)x)2)\rho_{\rm post}\propto\exp(-\frac{1}{2}\|\Sigma_{\nu}^{-1/2}(\mathcal{F}(\theta)-x)\|^{2}), and we used the statistical linearization approximation in the last derivation.

The stochastic linearization essentially approximates θ(θ)\nabla_{\theta}\mathcal{F}(\theta) by a constant vector (which is why the term linearization is used); under this approximation, the Hessian θθ(θ)\nabla_{\theta}\nabla_{\theta}\mathcal{F}(\theta) is zero333We can also interpret this step through the Gauss-Newton approximation.. Based on this fact, we obtain, for the equation of the covariance in Eq. 4, that

CtCt𝔼ρt[θθlogρpost]Ct\displaystyle C_{t}-C_{t}\mathbb{E}_{\rho_{t}}[\nabla_{\theta}\nabla_{\theta}\log\rho_{\rm post}]C_{t} CtCt𝔼ρt[θ(θ)Σν1θ(θ)T]Ct\displaystyle\approx C_{t}-C_{t}\mathbb{E}_{\rho_{t}}[\nabla_{\theta}\mathcal{F}(\theta)\Sigma_{\nu}^{-1}\nabla_{\theta}\mathcal{F}(\theta)^{T}]C_{t} (26)
CtC^tθxΣν1(C^θxt)T.\displaystyle\approx C_{t}-\widehat{C}^{\theta x}_{t}\Sigma_{\nu}^{-1}({\widehat{C}^{\theta x}}_{t})^{T}.

Thus, we recover Eq. 24. In this sense, we can understand the Kalman methodology in the continuous limit as applying a statistical linearization approximation to the dynamics obtained in variational inference. \lozenge

4 Gaussian Mixture Kalman Inversion

In this section, we study the use of Gaussian mixture models to approximate the evolution ρnρn+1\rho_{n}\mapsto\rho_{n+1} defined by Eq. 13. We assume that at step nn the distribution has the form of a KK-component Gaussian mixture:

ρn(θ)=k=1Kwn,k𝒩(θ;mn,k,Cn,k),\displaystyle\rho_{n}(\theta)=\sum_{k=1}^{K}w_{n,k}\mathcal{N}(\theta;m_{n,k},C_{n,k}),

where wn,k0w_{n,k}\geq 0, k=1Kwn,k=1\sum_{k=1}^{K}w_{n,k}=1. It remains to specify updates of the weights and Gaussian components, through the map defined by Eq. 13. In the two subsequent subsections, we consider steps (13a) and (13b) respectively.

4.1 The Exploration Step

The first step Eq. 13a, ρ^n+1(θ)\hat{\rho}_{n+1}(\theta) ρn(θ)1Δt\propto\rho_{n}(\theta)^{1-\Delta t}, is not closed in the space of KK-component Gaussian mixtures. Thus, we choose to approximate ρ^n+1\hat{\rho}_{n+1} by a new Gaussian mixture model

ρ^n+1ρ^n+1GM=k=1Kw^n+1,k𝒩(θ;m^n+1,k,C^n+1,k).\hat{\rho}_{n+1}\approx\hat{\rho}^{\rm GM}_{n+1}=\sum_{k=1}^{K}\hat{w}_{n+1,k}\mathcal{N}(\theta;\widehat{m}_{n+1,k},\widehat{C}_{n+1,k}).

Note that ρn(θ)1Δt=ρn(θ)Δtρn(θ)\rho_{n}(\theta)^{1-\Delta t}=\rho_{n}(\theta)^{-\Delta t}\rho_{n}(\theta). We will determine the parameters of the new Gaussian mixture model by applying Gaussian moment matching444Such approximation along with our exploitation step will connect our GMKI to Gaussian mixture variational inference and natural gradient; see Remark 2. to each component ρn(θ)Δtwn,k𝒩(θ;mn,k,Cn,k)\rho_{n}(\theta)^{-\Delta t}w_{n,k}\mathcal{N}(\theta;m_{n,k},C_{n,k}) in ρn(θ)Δtρn(θ)\rho_{n}(\theta)^{-\Delta t}\rho_{n}(\theta). More specifically, we first rewrite the power of a Gaussian mixture as follows:

ρ^n+1(θ)\displaystyle\hat{\rho}_{n+1}(\theta)\propto (k=1Kwn,k𝒩(θ;mn,k,Cn,k))1Δt\displaystyle\Bigl{(}\sum_{k=1}^{K}w_{n,k}\mathcal{N}(\theta;m_{n,k},C_{n,k})\Bigr{)}^{1-\Delta t}
=\displaystyle= k=1K[wn,k𝒩(θ;mn,k,Cn,k)(i=1Kwn,i𝒩(θ;mn,i,Cn,i))Δt]\displaystyle\sum_{k=1}^{K}\Bigl{[}w_{n,k}\mathcal{N}(\theta;m_{n,k},C_{n,k})\Bigl{(}\sum_{i=1}^{K}w_{n,i}\mathcal{N}(\theta;m_{n,i},C_{n,i})\Bigr{)}^{-\Delta t}\Bigr{]}
=\displaystyle= k=1K[fn,k(θ)𝒩(θ;mn,k,Cn,k1Δt)],\displaystyle\sum_{k=1}^{K}\Bigl{[}f_{n,k}(\theta)\mathcal{N}(\theta;m_{n,k},\frac{C_{n,k}}{1-\Delta t})\Bigr{]},

where

fn,k(θ)=(2π)ΔtNθ2(1Δt)Nθ2wn,k1Δtdet(Cn,k)Δt2(wn,k𝒩(θ;mn,k,Cn,k)i=1Kwn,i𝒩(θ;mn,i,Cn,i))Δt.f_{n,k}(\theta)=\frac{(2\pi)^{\frac{\Delta tN_{\theta}}{2}}}{(1-\Delta t)^{\frac{N_{\theta}}{2}}}w_{n,k}^{1-\Delta t}{\det(C_{n,k})}^{\frac{\Delta t}{2}}\Bigl{(}\frac{w_{n,k}\mathcal{N}(\theta;m_{n,k},C_{n,k})}{\sum_{i=1}^{K}w_{n,i}\mathcal{N}(\theta;m_{n,i},C_{n,i})}\Bigr{)}^{\Delta t}.

We approximate each component above by a Gaussian distribution:

fn,k(θ)𝒩(θ;mn,k,Cn,k1Δt)w^n+1,k𝒩(θ;m^n+1,k,C^n+1,k),\displaystyle f_{n,k}(\theta)\mathcal{N}(\theta;m_{n,k},\frac{C_{n,k}}{1-\Delta t})\approx\hat{w}_{n+1,k}\mathcal{N}(\theta;\widehat{m}_{n+1,k},\widehat{C}_{n+1,k}), (27)

where we set

w^n+1,k\displaystyle\hat{w}_{n+1,k} =fn,k(θ)𝒩(θ;mn,k,Cn,k1Δt)dθ,\displaystyle=\int f_{n,k}(\theta)\mathcal{N}(\theta;m_{n,k},\frac{C_{n,k}}{1-\Delta t})\mathrm{d}\theta, (28a)
m^n+1,k\displaystyle\widehat{m}_{n+1,k} =1w^n+1,kθfn,k(θ)𝒩(θ;mn,k,Cn,k1Δt)dθ,\displaystyle=\frac{1}{\hat{w}_{n+1,k}}\int\theta f_{n,k}(\theta)\mathcal{N}(\theta;m_{n,k},\frac{C_{n,k}}{1-\Delta t})\mathrm{d}\theta, (28b)
C^n+1,k\displaystyle\widehat{C}_{n+1,k} =1w^n+1,k(θm^n+1,k)(θm^n+1,k)Tfn,k(θ)𝒩(θ;mn,k,Cn,k1Δt)dθ.\displaystyle=\frac{1}{\hat{w}_{n+1,k}}\int(\theta-\widehat{m}_{n+1,k})(\theta-\widehat{m}_{n+1,k})^{T}f_{n,k}(\theta)\mathcal{N}(\theta;m_{n,k},\frac{C_{n,k}}{1-\Delta t})\mathrm{d}\theta. (28c)

Here, we determine w^n+1,k\hat{w}_{n+1,k}, m^n+1,k\widehat{m}_{n+1,k}, and C^n+1,k\widehat{C}_{n+1,k} by moment matching of both sides in Eq. 27. These integrals can be evaluated by Monte Carlo method or quadrature rules. Then, we normalize {w^n+1,k}k=1k=K\{\hat{w}_{n+1,k}\}_{k=1}^{k=K} so that their summation is 1. The updates (28) determine ρ^n+1GM\hat{\rho}^{\rm GM}_{n+1}.

4.2 The Exploitation Step

The second step Eq. 13b leads to

ρn+1(θ)\displaystyle\rho_{n+1}(\theta) ρ^n+1GM(θ)eΔtΦR(θ)\displaystyle\propto\hat{\rho}^{\rm GM}_{n+1}(\theta)e^{-\Delta t\Phi_{R}(\theta)}
k=1Kw^n+1,k𝒩(θ;m^n+1,k,C^n+1,k)eΔtΦR(θ).\displaystyle\propto\sum_{k=1}^{K}\hat{w}_{n+1,k}\mathcal{N}(\theta;\widehat{m}_{n+1,k},\widehat{C}_{n+1,k})e^{-\Delta t\Phi_{R}(\theta)}.

Now, the goal is to approximate the above ρn+1\rho_{n+1} by a KK-component Gaussian mixture k=1Kwn+1,k𝒩(θ;mn+1,k,Cn+1,k)\sum_{k=1}^{K}w_{n+1,k}\mathcal{N}(\theta;m_{n+1,k},C_{n+1,k}). We adopt the Kalman methodology described in Section 3, to update each Gaussian component individually such that

w^n+1,k𝒩(θ;m^n+1,k,C^n+1,k)eΔtΦR(θ)wn+1,k𝒩(θ;mn+1,k,Cn+1,k).\displaystyle\hat{w}_{n+1,k}\mathcal{N}(\theta;\widehat{m}_{n+1,k},\widehat{C}_{n+1,k})e^{-\Delta t\Phi_{R}(\theta)}\approx w_{n+1,k}\mathcal{N}(\theta;m_{n+1,k},C_{n+1,k}). (29)

More precisely, following (23), for each 1kK1\leq k\leq K, we obtain the mean and covariance updates as

mn+1,k=m^n+1,k+C^n+1,kθx(C^n+1,kxx)1(xx^n+1,k),Cn+1,k=C^n+1,kC^n+1,kθx(C^n+1,kxx)1(C^n+1,kθx)T,\begin{split}m_{n+1,k}&=\widehat{m}_{n+1,k}+\widehat{C}_{n+1,k}^{\theta x}(\widehat{C}_{n+1,k}^{xx})^{-1}(x-\widehat{x}_{n+1,k}),\\ C_{n+1,k}&=\widehat{C}_{n+1,k}-\widehat{C}_{n+1,k}^{\theta x}(\widehat{C}_{n+1,k}^{xx})^{-1}(\widehat{C}_{n+1,k}^{\theta x})^{T},\end{split} (30)

where

x^n+1,k=𝔼[(θ)]C^n+1,kθx=Cov[θ,(θ)]C^n+1,kxx=Cov[(θ)]+1ΔtΣν,\begin{split}\widehat{x}_{n+1,k}=\mathbb{E}[\mathcal{F}(\theta)]\qquad\widehat{C}_{n+1,k}^{\theta x}=\mathrm{Cov}[\theta,\mathcal{F}(\theta)]\qquad\widehat{C}_{n+1,k}^{xx}=\mathrm{Cov}[\mathcal{F}(\theta)]+\frac{1}{\Delta t}\Sigma_{\nu},\end{split}

with θ𝒩(θ;m^n+1,k,C^n+1,k)\theta\sim\mathcal{N}(\theta;\widehat{m}_{n+1,k},\widehat{C}_{n+1,k}). The weight wn+1,kw_{n+1,k} is estimated by matching Eq. 29 via integration

wn+1,k=w^n+1,k𝒩(θ;m^n+1,k,C^n+1,k)eΔtΦR(θ)dθ.\begin{split}w_{n+1,k}=\hat{w}_{n+1,k}\int\mathcal{N}(\theta;\widehat{m}_{n+1,k},\widehat{C}_{n+1,k})e^{-\Delta t\Phi_{R}(\theta)}\mathrm{d}\theta.\end{split} (31)

Equations 28, 30 and 31 define our Gaussian Mixture Kalman Inversion algorithm (GMKI), which leads to an iteration of Gaussian mixture approximations of Eq. 13 without using derivatives. As mentioned in Section 3, the updates involve Gaussian integration Eqs. 28, 30 and 31, which can be approximated via Monte Carlo or quadrature rules. In this paper, we use the Monte Carlo method to approximate Eq. 28; these integrations do not require the forward evaluations so are inexpensive. Furthermore, we use the modified unscented transform detailed in [43, Definition 1] to approximate Eqs. 30 and 31, which requires (2Nθ+1)K(2N_{\theta}+1)K forward evaluations; these evaluations can be computed in parallel. The detailed algorithm is presented in B.

5 Theoretical Analysis

In this section, theoretical studies of our GMKI methodology are presented, through a continuous time analysis. In Section 5.1, we discuss the exploration effect of the first step Eq. 28 of our GMKI. In Section 5.2, we investigate the convergence properties of our GMKI method in scenarios where the posterior follows a Gaussian distribution, as well as in cases where it corresponds to a Gaussian mixture with well-separated Gaussian components. The analysis of the continuous time limit also allows us to connect GMKI with other variational inference approaches based on Gaussian mixtures. A schematic of properties of GMKI is also shown in Figure 1. In Section 5.3, we discuss the affine invariance property of our GMKI. We summarize the conclusions of the theoretical studies in Section 5.4.

Refer to caption
Figure 1: Schematic of properties of GMKI. The Grey curve represents the posterior distribution. Blue curves represent Gaussian components of the Gaussian mixture approximation. From left to right: Gaussian components can exhibit exponential convergence toward their respective Gaussian modes if these modes are well separated (see Proposition 7); the repulsion between distinct Gaussian components in the iteration of GMKI helps explore the space and capture multiple modes (see Section 5.1); when multiple Gaussian components converge towards a single Gaussian mode in the posterior distribution, they can provide a good approximation of the Gaussian mode (see Proposition 6); GMKI can capture multiple modes even when these modes are intertwined (see numerical examples in Section 6.1).

5.1 Exploration Effect

In the derivation of GMKI, the first step Eq. 28 is designed to approximate the exploration phase of the Fisher-Rao gradient flow Eq. 13a. In this subsection, we investigate whether the exploration effect still persists with the approximation made by GMKI. In fact, Eq. 28 tends to expand the distribution for exploration in the following two ways: (1) repulsion between Gaussian components; and (2) by an increase of the entropy. The repulsion effect can be understood through the following continuous time limit analysis. In this analysis we abuse notation, replacing the subscript nn in the discrete iterations by tt, which equals nΔtn\Delta t in the continuous limit of the means, covariances and weights of the the Gaussian mixture.

Proposition 3.

The continuous time limit (Δt0)(\Delta t\rightarrow 0) of the exploration step Eq. 28 is

m˙t,k\displaystyle\dot{m}_{t,k} =𝒩(θ;mt,k,Ct,k)(θmt,k)logρt(θ)dθ,\displaystyle=-\int\mathcal{N}(\theta;m_{t,k},C_{t,k})(\theta-m_{t,k})\log\rho_{t}(\theta)\mathrm{d}\theta, (32a)
C˙t,k\displaystyle\dot{C}_{t,k} =𝒩(θ;mt,k,Ct,k)((θmt,k)(θmt,k)TCt,k)logρt(θ)dθ,\displaystyle=-\int\mathcal{N}(\theta;m_{t,k},C_{t,k})\Bigl{(}(\theta-m_{t,k})(\theta-m_{t,k})^{T}-C_{t,k}\Bigr{)}\log\rho_{t}(\theta)\mathrm{d}\theta, (32b)
w˙t,k\displaystyle\dot{w}_{t,k} =wt,k[𝒩(θ;mt,k,Ct,k)ρt(θ)]logρt(θ)dθ.\displaystyle=-w_{t,k}\int\Bigl{[}\mathcal{N}(\theta;m_{t,k},C_{t,k})-\rho_{t}(\theta)\Bigr{]}\log\rho_{t}(\theta)\mathrm{d}\theta. (32c)

Here ρt(θ)=k=1Kwt,k𝒩(θ;mt,k,Ct,k)\rho_{t}(\theta)=\sum_{k=1}^{K}w_{t,k}\mathcal{N}(\theta;m_{t,k},C_{t,k}).

The proof is in C.1. The continuous time limit of the evolution equation of the mean Eq. 32a suggests that, if K2K\geq 2, mt,km_{t,k} will move towards the direction where ρt\rho_{t} is small. Indeed, let’s consider a specific scenario where Nθ=1N_{\theta}=1, K=2K=2 and mt,1<mt,2m_{t,1}<m_{t,2}, In this case, we have that

m˙t,1=(θmt,1)𝒩(θ;mt,1,Ct,1)logρt(θ)dθ=θ>mt,1θ<mt,1(θmt,1)𝒩(θ;mt,1,Ct,1)logρt(θ)dθ=θ>mt,1(θmt,1)𝒩(θ;mt,1,Ct,1)(logρt(θ)logρt(2mt,1θ))dθ<0,\begin{split}\dot{m}_{t,1}&=-\int(\theta-m_{t,1})\mathcal{N}(\theta;m_{t,1},C_{t,1})\log\rho_{t}(\theta)\mathrm{d}\theta\\ &=-\int_{\theta>m_{t,1}\cup\theta<m_{t,1}}(\theta-m_{t,1})\mathcal{N}(\theta;m_{t,1},C_{t,1})\log\rho_{t}(\theta)\mathrm{d}\theta\\ &=-\int_{\theta>m_{t,1}}(\theta-m_{t,1})\mathcal{N}(\theta;m_{t,1},C_{t,1})(\log\rho_{t}(\theta)-\log\rho_{t}(2m_{t,1}-\theta))\mathrm{d}\theta\\ &<0,\end{split} (33)

where the third equality results from the change of variable θ2mt,1θ\theta\rightarrow 2m_{t,1}-\theta. And the last inequality is due to the fact that

logρt(θ)logρt(2mt,1θ)=logwt,1𝒩(θ;mt,1,Ct,1)+wt,2𝒩(θ;mt,2,Ct,2)wt,1𝒩(θ;mt,1,Ct,1)+wt,2𝒩(2mt,1θ;mt,2,Ct,2).\log\rho_{t}(\theta)-\log\rho_{t}(2m_{t,1}-\theta)=\log\frac{w_{t,1}\mathcal{N}(\theta;m_{t,1},C_{t,1})+w_{t,2}\mathcal{N}(\theta;m_{t,2},C_{t,2})}{w_{t,1}\mathcal{N}(\theta;m_{t,1},C_{t,1})+w_{t,2}\mathcal{N}(2m_{t,1}-\theta;m_{t,2},C_{t,2})}.

The right hand side is non-negative, since when θ>mt,1\theta>m_{t,1}, we have |mt,22mt,1+θ|=(mt,2mt,1)+(θmt,1)>|mt,2θ||m_{t,2}-2m_{t,1}+\theta|=(m_{t,2}-m_{t,1})+(\theta-m_{t,1})>|m_{t,2}-\theta| and hence 𝒩(2mt,1θ;mt,2,Ct,2)<𝒩(θ;mt,2,Ct,2)\mathcal{N}(2m_{t,1}-\theta;m_{t,2},C_{t,2})<\mathcal{N}(\theta;m_{t,2},C_{t,2}). Similarly, we can also establish that m˙t,2>0\dot{m}_{t,2}>0. Hence these two Gaussian means are repulsed.

We can also understand the exploration effect through the increase of the entropy; see Proposition 4. The proof can be found in C.2.

Proposition 4.

The entropy of the Gaussian mixture

ρt(θ)=k=1Kwt,k𝒩(θ;mt,k,Ct,k)\rho_{t}(\theta)=\sum_{k=1}^{K}w_{t,k}\mathcal{N}(\theta;m_{t,k},C_{t,k})

obtained from Eq. 32 is non-decreasing; indeed:

ddtρtlogρtdθ=k=1K(w˙t,k2wt,k+wt,km˙t,kTCt,k1m˙t,k+wt,k2tr[C˙t,kTCt,k1C˙t,kCt,k1])0.\begin{split}\frac{\mathrm{d}}{\mathrm{d}t}\int-\rho_{t}\log\rho_{t}\mathrm{d}\theta=&\sum_{k=1}^{K}\Bigl{(}\frac{\dot{w}_{t,k}^{2}}{w_{t,k}}+w_{t,k}\dot{m}_{t,k}^{T}C_{t,k}^{-1}\dot{m}_{t,k}+\frac{w_{t,k}}{2}{\rm tr}[\dot{C}_{t,k}^{T}C_{t,k}^{-1}\dot{C}_{t,k}C_{t,k}^{-1}]\Bigr{)}\geq 0.\end{split} (34)

5.2 Convergence Analysis

To provide insights for the convergence of GMKI, we consider its continuous limit in time. Similar to Eq. 32, the continuous time limit of our GMKI is given in Proposition 5. The proof is in C.3.

Proposition 5.

The continuous time limit (Δt0)(\Delta t\rightarrow 0) of the proposed GMKI defines the evolving Gaussian mixture measure

ρt(θ)=k=1Kwt,k𝒩(θ;mt,k,Ct,k)\rho_{t}(\theta)=\sum_{k=1}^{K}w_{t,k}\mathcal{N}(\theta;m_{t,k},C_{t,k})

where

m˙t,k=\displaystyle\dot{m}_{t,k}= Ct,k𝒩(θ;mt,k,Ct,k)θlogρt(θ)dθ+C^t,kθxΣν1(xx^t,k),\displaystyle-C_{t,k}\int\mathcal{N}(\theta;m_{t,k},C_{t,k})\nabla_{\theta}\log\rho_{t}(\theta)\mathrm{d}\theta+\widehat{C}_{t,k}^{\theta x}\Sigma_{\nu}^{-1}(x-\widehat{x}_{t,k}), (35a)
C˙t,k=\displaystyle\dot{C}_{t,k}= Ct,k(𝒩(θ;mt,k,Ct,k)θθlogρt(θ)dθ)Ct,kC^t,kθxΣν1C^t,kT,\displaystyle-C_{t,k}\left(\int\mathcal{N}(\theta;m_{t,k},C_{t,k})\nabla_{\theta}\nabla_{\theta}\log\rho_{t}(\theta)\mathrm{d}\theta\right)C_{t,k}-\widehat{C}_{t,k}^{\theta x}\Sigma_{\nu}^{-1}{\widehat{C}_{t,k}^{T}}, (35b)
w˙t,k=\displaystyle\dot{w}_{t,k}= wt,k[𝒩(θ;mt,k,Ct,k)ρt(θ)][logρt(θ)+ΦR(θ)]dθ.\displaystyle-w_{t,k}\int\Bigl{[}\mathcal{N}(\theta;m_{t,k},C_{t,k})-\rho_{t}(\theta)\Bigr{]}\Bigl{[}\log\rho_{t}(\theta)+\Phi_{R}(\theta)\Bigr{]}\mathrm{d}\theta. (35c)

Here

x^t,k=𝔼[(θ)],C^t,kθx=Cov[θ,(θ)],withθ𝒩(mt,k,Ct,k).\begin{split}\widehat{x}_{t,k}=\mathbb{E}[\mathcal{F}(\theta)],\qquad\widehat{C}_{t,k}^{\theta x}=\mathrm{Cov}[\theta,\mathcal{F}(\theta)],\quad\textrm{with}\quad\theta\sim\mathcal{N}(m_{t,k},C_{t,k}).\end{split} (36)
Remark 2.

We can also connect the continuous limit in Proposition 5 with the classical variational inference approach based on Gaussian mixtures. In fact, Eq. 35 can be obtained by combining natural gradient methods in the Gaussian mixture context and derivative-free Kalman approximation (similar to Remark 1); see C.8. \lozenge

To gain insight into the convergence properties of the continuous flow, we first study Eq. 35 for the Gaussian posterior case; the proof can be found in C.4.

Proposition 6 (Linear inverse problems).

Assume 𝒢(θ)=Gθ\mathcal{G}(\theta)=G\cdot\theta is linear, and the posterior is Gaussian with the form

ΦR(θ)=12(θmpost)TCpost1(θmpost).\Phi_{R}(\theta)=\frac{1}{2}(\theta-m_{\rm post})^{T}C^{-1}_{\rm post}(\theta-m_{\rm post}).

Then, the KL divergence between the Gaussian mixture

ρt(θ)=k=1Kwt,k𝒩(θ;mt,k,Ct,k)\rho_{t}(\theta)=\sum_{k=1}^{K}w_{t,k}\mathcal{N}(\theta;m_{t,k},C_{t,k})

obtained from Eq. 35 and ρpost\rho_{\rm post} is non-increasing:

ddtKL[ρtρpost]=k=1K(w˙t,k2wt,k+wt,km˙t,kTCpost1m˙t,k+wt,k2tr[C˙t,kTCt,k1C˙t,kCt,k1])0.\begin{split}\frac{\mathrm{d}}{\mathrm{d}t}\mathrm{KL}[\rho_{t}\|\rho_{\rm post}]=&-\sum_{k=1}^{K}\Bigl{(}\frac{\dot{w}_{t,k}^{2}}{w_{t,k}}+w_{t,k}\dot{m}_{t,k}^{T}C_{\rm post}^{-1}\dot{m}_{t,k}+\frac{w_{t,k}}{2}{\rm tr}[\dot{C}_{t,k}^{T}C_{t,k}^{-1}\dot{C}_{t,k}C_{t,k}^{-1}]\Bigr{)}\leq 0.\end{split} (37)

Furthermore, the mean and Fisher information matrix of stationary points ρ(θ)=kw,k𝒩(θ;m,k,C,k)\rho_{\infty}(\theta)=\sum_{k}w_{\infty,k}\mathcal{N}(\theta;m_{\infty,k},C_{\infty,k}) satisfy that

k=1Kw,km,k=mpost,FIM[ρ]=θρθρTρdθ=Cpost1.\displaystyle\sum_{k=1}^{K}w_{\infty,k}m_{\infty,k}=m_{\rm post},\qquad\rm{FIM}[\rho_{\infty}]=\int\frac{\nabla_{\theta}\rho_{\infty}\nabla_{\theta}\rho_{\infty}^{T}}{\rho_{\infty}}\mathrm{d}\theta=C_{\rm post}^{-1}. (38)
Remark 3.

Proposition 6 shows that, if the posterior is Gaussian, then the KL divergence of the GMKI is non-increasing in time. Furthermore, the Gaussian mixture converges to a distribution ρ\rho_{\infty} from which the correct Gaussian statistics can be extracted. Nevertheless, from our current proof, it is not yet known whether ρ\rho_{\infty} converge to ρpost\rho_{\rm post}. We leave this question for future study. \lozenge

Finally, we provide some formal analysis for the convergence of our GMKI in scenarios where the posterior distribution is close to Gaussian mixture with the same number of components, namely

ρpost(θ)exp(ΦR(θ))k=1Kwk𝒩(θ;mk,Ck)withinf1kKwk>0.\rho_{\rm post}(\theta)\propto\exp(-\Phi_{R}(\theta))\underset{\sim}{\propto}\sum_{k=1}^{K}w_{k}^{*}\mathcal{N}(\theta;m_{k}^{*},C_{k}^{*})\quad\textrm{with}\inf_{1\leq k\leq K}\quad w_{k}^{*}>0.

For simplicity, we assume these Gaussian components are well separated. It is technical to give a precise definition of the well-separatedness of different Gaussian components; our argument here is purely formal and serves to provide insights for the behavior of GMKI. Suppose the kk-th Gaussian component 𝒩(θ;mk(0),Ck(0))\mathcal{N}(\theta;m_{k}(0),C_{k}(0)) in GMKI is close to its corresponding mode (e.g., the kk-th mode) of ρpost\rho_{\rm post} while becoming well separated from other Gaussian components. In such case, we may simplify the continuous time limit Eq. 35 by neglecting the interaction between different Gaussian components. The simplified continuous time dynamics and its property are presented in Proposition 7, with derivations in C.5.

Proposition 7.

Consider the simplified continuous time dynamics

m˙t,k=\displaystyle\dot{m}_{t,k}= Ct,k(Ck)1(mkmt,k),\displaystyle\ C_{t,k}(C_{k}^{*})^{-1}(m_{k}^{*}-m_{t,k}), (39a)
C˙t,k=\displaystyle\dot{C}_{t,k}= Ct,kCt,k(Ck)1Ct,k,\displaystyle\ C_{t,k}-C_{t,k}(C_{k}^{*})^{-1}C_{t,k}, (39b)
w˙t,k=\displaystyle\dot{w}_{t,k}= wt,k(logwklogwt,ki=1Kwt,i(logwilogwt,i)).\displaystyle\ w_{t,k}\bigl{(}\log w_{k}^{*}-\log w_{t,k}-\sum_{i=1}^{K}w_{t,i}(\log w_{i}^{*}-\log w_{t,i})\bigr{)}. (39c)

Then, mt,km_{t,k}, Ct,kC_{t,k}, and wt,kw_{t,k} will converge to mkm_{k}^{*}, CkC_{k}^{*}, and wkw_{k}^{*} exponentially fast.

The proof of Proposition 7 can be found in C.6.

5.3 Affine Invariance

Sampling methods are said to be affine invariant if the algorithm is unchanged under any invertible affine mapping. It is a consequence of such a property that convergence rates across all Gaussians with positive covariance are the same. More generally affine invariant algorithms can be highly effective for highly anisotropic distributions [88, 89]. This effectiveness stems from their consistent behavior across all coordinate systems related by affine transformations. Specifically, the convergence properties of these methods can be understood by examining the optimal coordinate system, which minimizes anisotropy to the fullest extent, across all affine transformations. With this in mind, the following is of interest in relation to GMKI:

Proposition 8.

The continuous time limit in Proposition 5 is affine invariant. Specifically, for any invertible affine mapping φ:θθ~=Aθ+b\varphi:\theta\rightarrow\widetilde{\theta}=A\theta+b, and define corresponding scalar, vector, and matrix transformations

w~t,k=wt,km~t,k=Amt,k+bC~t,k=ACt,kAT,\displaystyle\widetilde{w}_{t,k}=w_{t,k}\quad\widetilde{m}_{t,k}=Am_{t,k}+b\quad\widetilde{C}_{t,k}=AC_{t,k}A^{T},

and function transformations

~(θ~)=(A1(θ~b))Φ~R(θ~)=ΦR(A1(θ~b)),\displaystyle\widetilde{\mathcal{F}}(\widetilde{\theta})=\mathcal{F}(A^{-1}(\widetilde{\theta}-b))\quad\quad\widetilde{\Phi}_{R}(\widetilde{\theta})=\Phi_{R}(A^{-1}(\widetilde{\theta}-b)),

The evolution equations of w~t,k\widetilde{w}_{t,k}, m~t,k\widetilde{m}_{t,k} and C~t,k\widetilde{C}_{t,k} remain the same, retaining the structure of the GMKI evolutions in Eq. 35.

The proof of Proposition 8 can be found in C.7.

5.4 Summary of Theoretical Analyses

Combining insights from Propositions 3, 6, 7 and 8, we anticipate that the repulsion between distinct Gaussian components will enable the proposed methodology GMKI to capture possible modes of the posterior distribution more effectively. Moreover, the algorithm will converge efficiently to modes of the posterior when these modes are well separated. Finally, we note that the affine invariance property of the GMKI shows that it will be effective for the approximation of certain highly anisotropic distributions.

6 Numerical Study

In this section, we present numerical studies regarding the proposed GMKI algorithm. We focus on posterior distributions of unknown parameters or fields arising in inverse problems that may exhibit multiple modes. Three types of model problems are considered:

  1. 1.

    A one-dimensional bimodal problem: we use this problem as a proof-of-concept example. Our result demonstrates that the convergence rate remains unchanged no matter how overlapped the two modes are. This implies the independence of the convergence rate regarding potential barriers.

  2. 2.

    A two-dimensional bimodal problem: we use this benchmark problem, introduced in [65, 90], to compare GMKI with other sampling methods such as the Langevin dynamics and birth death process [24]. Our result shows that GMKI is not only more accurate but also more cost-effective for this specific problem.

  3. 3.

    A high-dimensional bimodal problem: we consider the inverse problem of recovering the initial velocity field of the Navier-Stokes flow. The problem is designed to have a symmetry which induces two modes in the posterior. We show that GMKI can capture both modes efficiently; this indicates GMKI’s potential for addressing multimodal problems in large-scale and high-dimensional applications.

Regarding the parameter of the GMKI algorithm (B), we will specify in detail the number of mixtures KK in each problem. For all experiments, we use the time-step size Δt=0.5\Delta t=0.5 and adopt J=1000J=1000 points for the Monte Carlo estimation of Eq. 28.

6.1 One-Dimensional Bimodal Problem

We first consider the following 1D bimodal inverse problem, associated with a forward model

y=𝒢(θ)+η with y=1,𝒢(θ)=θ2.y=\mathcal{G}(\theta)+\eta\quad\textrm{ with }\quad y=1,\quad\mathcal{G}(\theta)=\theta^{2}.

We assume the prior is ρprior𝒩(3,22)\rho_{\rm prior}\sim\mathcal{N}(3,2^{2}) and consider different noise levels:

Case A:η𝒩(0,0.22);\displaystyle\textrm{Case A:}\quad\eta\sim\mathcal{N}(0,0.2^{2});
Case B:η𝒩(0,0.52);\displaystyle\textrm{Case B:}\quad\eta\sim\mathcal{N}(0,0.5^{2});
Case C:η𝒩(0,1.02);\displaystyle\textrm{Case C:}\quad\eta\sim\mathcal{N}(0,1.0^{2});
Case D:η𝒩(0,1.52).\displaystyle\textrm{Case D:}\quad\eta\sim\mathcal{N}(0,1.5^{2}).

Note that the overlap between these two modes is larger when the noise strength is larger. For case A, the two modes are well separated, and for case D, the two modes are nearly mingled.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: The one-dimensional bimodal problem with Ση\Sigma_{\eta} values of 0.220.2^{2} (top), 0.520.5^{2} (top middle), 1.021.0^{2} (bottom middle), and 1.521.5^{2} (bottom). Each panel displays the reference posterior distribution (grey square lines) and posterior distributions estimated by the GMKI (blue lines) at the 3030th iteration with mode number K=1, 2, 3K=1,\,2,\,3 (from left to right) with mean mkm_{k} (colored) and initial mean (black) of each Gaussian component marked. The fourth figure shows the total variation distance between the reference posterior distribution and the posterior distributions estimated by the GMKI with mode number K=1, 2, 3K=1,\,2,\,3.

We apply GMKI with K=1K=1, 22, and 33 modes, which are randomly initialized according to the prior distribution; we assign them equal weights. In each iteration, the GMKI algorithm requires 33, 66, and 99 forward evaluations, respectively. The reference posterior distribution is obtained by evaluating the unnormalized posterior on a uniform grid and then normalizing it.

The results for different cases are reported from Fig. 2. Each row first shows the reference posterior and posteriors approximated by GMKI at the 3030th iteration, using different mode numbers K=1, 2, 3K=1,\,2,\,3 from left to right. And the fourth figure shows the convergence in terms of the total variation (TV) distance. When K=1K=1, we can only capture one mode, and this of course will not be weighted correctly since it will have weight one by construction; when K=2K=2 or 33, we can capture both modes. It is worth mentioning that GMKI converges in fewer than 30 iterations. The convergence behavior appears independent of the potential barrier. For case A, where the two modes are well separated, the approximated posteriors by GMKI with K=2K=2 or 33 match very well with the reference. This observation justifies our formal analysis in Proposition 7. For cases B,C,D, we observe that GMKI is capable of capturing both modes, however, some discrepancy will arise in the region where the modes overlap. These discrepancies persist when increasing the mode number in GMKI from K=2K=2 to K=3K=3.

6.2 Two-Dimensional Bimodal Problem

In this subsection, we consider the 2D bimodal inverse problem from [91, 90], associated with the forward model

y=𝒢(θ)+η with y=4.2297,𝒢(θ)=(θ(1)θ(2))2.y=\mathcal{G}(\theta)+\eta\quad\textrm{ with }\quad y=4.2297,\quad\mathcal{G}(\theta)=(\theta_{(1)}-\theta_{(2)})^{2}.

Here θ=[θ(1),θ(2)]T\theta=[\theta_{(1)},\theta_{(2)}]^{T}. We assume the noise distribution is η𝒩(0,I)\eta\sim\mathcal{N}(0,I) and consider two different prior distributions:

Case A:ρprior𝒩(0,I);\displaystyle\textrm{Case A:}\quad\rho_{\rm prior}\sim\mathcal{N}(0,I);
Case B:ρprior𝒩([0.5,0]T,I).\displaystyle\textrm{Case B:}\quad\rho_{\rm prior}\sim\mathcal{N}([0.5,0]^{T},I).

For case A, the two modes are symmetric with respect to the line θ(1)θ(2)=0\theta_{(1)}-\theta_{(2)}=0, while for case B, the two modes are not symmetric.

We apply GMKI with K=3K=3 modes, which are randomly initialized based on the prior distribution and we assign these components with equal weights. In each iteration, the algorithm requires (2Nθ+1)K=15(2N_{\theta}+1)K=15 forward evaluations. The reference posterior distribution is obtained by evaluating the unnormalized posterior on a uniform grid and then normalizing it. For both case A and case B, the estimated posterior distributions obtained by the GMKI are presented in Figs. 3 and 4. We observe a strong correspondence with the reference, where in GMKI, mode 1 converges to one target mode, while mode 2 and mode 3 converge to another target mode. Moreover, the evolution of the total variation distance indicates rapid convergence.

Refer to caption
Figure 3: Two-dimensional bimodal problem with ρprior𝒩(0,I)\rho_{\rm prior}\sim\mathcal{N}(0,I). From left to right: reference posterior distribution (left), posterior distributions estimated by 3-modal GMKI (middle left) at the 30th iteration (means mkm_{k} (colored) and initial means (black) are marked), BDLS-KL [24] (middle right) at the 1000th iteration, and total variation distance between the reference posterior distribution and the posterior distributions estimated by the GMKI (right).
Refer to caption
Figure 4: Two-dimensional bimodal problem with ρprior𝒩([0.5,0]T,I)\rho_{\rm prior}\sim\mathcal{N}([0.5,0]^{T},I). From left to right: reference posterior distribution (left), posterior distributions estimated by 3-modal GMKI (middle left) at the 30th iteration (means mkm_{k} (colored) and initial means (black) are marked), BDLS-KL [24] (middle right) at the 1000th iteration, and total variation distance between the reference posterior distribution and the posterior distributions estimated by the GMKI (right).

As a comparison, the derivative-free affine-invariant Langevin dynamics (dfALDI) [64, 65] and derivative-free Bayesian inversion using multiscale dynamics [90] with 10610^{6} iterations have been used for sampling the posterior distribution for case A. Their results are reported in [90, Figure 5], where dfALDI fails while multiscale dynamics can capture both modes but with wrong weights. The local preconditioner, which involves employing local empirical covariance with distance-dependent weights as introduced in [91], gives rise to an alternative derivative-free affine-invariant Langevin dynamics sampling approach. That approach notably enhances the sampling results in these scenarios. Moreover, for both case A and case B, we apply the BDLS-KL algorithm proposed in [24], which is a gradient-based sampler relying on the birth-death dynamics with kernel density estimators to approximate the Wasserstein-Fisher-Rao gradient flow. For the implementation of BDLS-KL, we use Δt=102\Delta t=10^{-2}, T=10T=10, ensemble size J=103J=10^{3}, and the RBF kernel k(x,x)=exp(1hxx2)k(x,x^{\prime})=\exp(\frac{1}{h}\lVert x-x^{\prime}\rVert^{2}) with the bandwidth h=med2/logJh=\mathrm{med}^{2}/\log J adopted from [92]; here med2\textrm{med}^{2} is the squared median of the pairwise Euclidean distance between the current particles. The results are shown in Figs. 3 and 4. For both cases, GMKI is not only more cost-effective but also more accurate compared to these existing approaches. Moreover, to study the behavior of GMKI in higher-modal problems, a two-dimensional four-modal problem is presented in D, leading to a similar conclusion.

6.3 High-Dimensional Bimodal Problem: Navier Stokes Problem

Finally, we study the problem of recovering the initial vorticity field ω0\omega_{0} of a fluid flow from measurements at later times. The flow is described by the 2D Navier-Stokes equation on a periodic domain D=[0,2π]×[0,2π]D=[0,2\pi]\times[0,2\pi], which can be written in the vorticity-streamfunction ωψ\omega-\psi formulation:

ωt+(v)ωνΔω=×f,ω=Δψ14π2ψ=0v=[ψx2,ψx1]T+vb.\begin{split}&\frac{\partial\omega}{\partial t}+(v\cdot\nabla)\omega-\nu\Delta\omega=\nabla\times f,\\ &\omega=-\Delta\psi\qquad\frac{1}{4\pi^{2}}\int\psi=0\qquad v=[\frac{\partial\psi}{\partial x_{2}},-\frac{\partial\psi}{\partial x_{1}}]^{T}+v_{b}.\end{split} (40)

Here vv denotes the velocity vector, ν=0.01\nu=0.01 denotes the viscosity, vb=[0,2π]Tv_{b}=[0,2\pi]^{T} denotes the non-zero mean background velocity, and f(x)=[0,cos(4x(1))]Tf(x)=[0,\cos(4x_{(1)})]^{T} denotes the external forcing.

The problem is built to be spatially symmetric with respect to x(1)=πx_{(1)}=\pi. The source of the fluid is chosen such that

×f([x(1),x(2)]T)=×f([2πx(1),x(2)]T).\displaystyle\nabla\times f([x_{(1)},x_{(2)}]^{T})=-\nabla\times f([2\pi-x_{(1)},x_{(2)}]^{T}).

The observations in the inverse problem are chosen as the difference of pointwise measurements of the vorticity value ω()\omega(\cdot)

ω([x(1),x(2)]T)ω([2πx(1),x(2)]T)\omega([x_{(1)},x_{(2)}]^{T})-\omega([2\pi-x_{(1)},x_{(2)}]^{T})

at 5656 equidistant points in the left domain (see Figure 5), at T=0.25T=0.25 and T=0.5T=0.5, corrupted with observation error η𝒩(0,0.12I)\eta\sim\mathcal{N}(0,0.1^{2}I). Under this set-up, both ω0([x(1),x(2)]T)\omega_{0}([x_{(1)},x_{(2)}]^{T}) and ω0([2πx(1),x(2)]T)-\omega_{0}([2\pi-x_{(1)},x_{(2)}]^{T}) will lead to the same measurements. Thus the inverse problem will be at least bi-modal.

We assume the prior of ω0(x,θ)\omega_{0}(x,\theta) is a Gaussian field with covariance operator C=(Δ)2C=(-\Delta)^{-2}, subject to periodic boundary conditions, on the space of mean zero functions. The corresponding KL expansion of the initial vorticity field is given by

ω0(x,θ)=lKθ(l)cλlψlc(x)+θ(l)sλlψls(x),\omega_{0}(x,\theta)=\sum_{l\in K}\theta^{c}_{(l)}\sqrt{\lambda_{l}}\psi^{c}_{l}(x)+\theta^{s}_{(l)}\sqrt{\lambda_{l}}\psi^{s}_{l}(x), (41)

where 𝕃={(lx,ly)|lx+ly>0 or (lx+ly=0 and lx>0)}\mathbb{L}=\{(l_{x},l_{y})|l_{x}+l_{y}>0\textrm{ or }(l_{x}+l_{y}=0\textrm{ and }l_{x}>0)\}, and the eigenpairs are of the form

ψlc(x)=cos(lx)2πψls(x)=sin(lx)2πλl=1|l|4,\psi^{c}_{l}(x)=\frac{\cos(l\cdot x)}{\sqrt{2}\pi}\quad\psi^{s}_{l}(x)=\frac{\sin(l\cdot x)}{\sqrt{2}\pi}\quad\lambda_{l}=\frac{1}{|l|^{4}},

and θ(l)c,θ(l)s𝒩(0,2π2)\theta^{c}_{(l)},\theta^{s}_{(l)}\sim\mathcal{N}(0,2\pi^{2}). The KL expansion Eq. 41 can be rewritten as a sum over positive integers rather than a lattice:

ω0(x,θ)=l=1θ(l)λlψl(x),\omega_{0}(x,\theta)=\sum_{l=1}^{\infty}\theta_{(l)}\sqrt{\lambda_{l}}\psi_{l}(x), (42)

where the eigenvalues λl\lambda_{l} are in descending order. We truncate the expansion to the first 128128 terms and generate the true vorticity field ω0(x;θref)\omega_{0}(x;\theta_{\rm ref}) with θref128\theta_{\rm ref}\in\mathbb{R}^{128}; we aim to recover the parameter based on observation data.

Refer to caption
Refer to caption
Figure 5: The vorticity field ω\omega at T=0.25T=0.25 and T=0.5T=0.5 and observations ω([x(1),x(2)]T)ω([2πx(1),x(2)]T)\omega([x_{(1)},x_{(2)}]^{T})-\omega([2\pi-x_{(1)},x_{(2)}]^{T}) at 5656 equidistant points (solid black dots). Their mirroring points are marked (empty black dots).

We employ GMKI with K=3K=3 modes, which are randomly initialized based on prior distribution with equal weights. Since we have 33 modes and Nθ=128N_{\theta}=128, in each iteration, we require (2Nθ+1)K=771(2N_{\theta}+1)K=771 forward evaluations. We depict the true initial vorticity field ω0(x;θref)\omega_{0}(x;\theta_{\rm ref}), its mirrored field (the mirroring of the velocity field induces the antisymmetry in the vorticity field) and the three recovered initial vorticity fields ω0(x;mk)\omega_{0}(x;m_{k}) obtained by GMKI at the 5050th iteration in Fig. 6. Mode 1 captures the mirroring field of ω0(x;θref)\omega_{0}(x;\theta_{\rm ref}) and mode 2 and mode 3 capture ω0(x;θref)\omega_{0}(x;\theta_{\rm ref}). Figure 7 presents the relative errors of the vorticity field, the optimization errors ΦR(mn,k)\Phi_{R}(m_{n,k}), the Frobenius norm Cn,kF\lVert C_{n,k}\rVert_{F} and the Gaussian mixture weights wn,kw_{n,k} (from left to right). It shows that our GMKI converges in fewer than 50 iterations. Figure 8 displays the marginals of the estimated posterior distributions associated with the first 1616 theta coefficients obtained by GMKI. The marginal distributions exhibit clear bimodality, and the approximate posteriors have a high probability covering the true coefficients.

Refer to caption
Figure 6: The true initial vorticity field ω0(x;θref)\omega_{0}(x;\theta_{\rm ref}), and recovered initial vorticity fields ω0(x;mk)\omega_{0}(x;m_{k}) obtained by GMKI.
Refer to caption
Figure 7: Navier-Stokes flow problem: the relative errors of the initial vorticity field, the optimization errors ΦR(mn,k)\Phi_{R}(m_{n,k}), the Frobenius norm Cn,kF\lVert C_{n,k}\rVert_{F}, and the Gaussian mixture weights wn,kw_{n,k} (from left to right) for different modes.
Refer to caption
Figure 8: Navier-Stokes flow problem: the true Karhunen-Loeve expansion parameters θ(i)\theta_{(i)} (black crosses), and mean estimations of θ(i)\theta_{(i)} for each modes (circles) and the associated marginal distributions obtained GMKI at the 5050th iteration.

7 Conclusion

In this paper, we have presented a new framework for solving Bayesian inverse problems. The framework is based on the Fisher-Rao gradient flow. Within this framework, we introduce a novel approach, Gaussian mixture Kalman inversion (GMKI), which leverages Gaussian mixtures and Kalman’s methodology for numerical approximations of the the flow. GMKI is particularly useful when the posterior distribution has multiple modes and when the derivative of the forward model is not available or computationally expensive.

We derive the continuous time dynamics of the GMKI, showing its connection to Gaussian mixture variational inference, and studying its exploration effects and convergence properties. Our numerical experiments showcase GMKI’s capability in approximating posterior distributions with multiple modes. GMKI outperforms many existing Bayesian inference methods in terms of efficiency and accuracy.

There can be numerous avenues for future research. On the algorithmic side, it is of interest to refine the approximations employed by GMKI in regions where the mixture components overlap significantly; see the experimental results in Section 6.1. Moreover, although the Kalman methodology achieves a derivative-free implementation, it may suffer from degeneracy issues when the modes of the distribution concentrate on a low dimensional manifold; for a demonstration see E. Therefore, improving the derivative-free methodology in such a scenario is important for enhancing GMKI. On the theoretical side, a thorough analysis of the convergence of GMKI for general target distribution could offer valuable insights for its practical application.

Acknowledgments

YC is supported by the Courant Instructorship. DZH is supported by the Fundamental Research Funds for the Central Universities and the high-performance computing platform of Peking University. JH is supported by NSF grant DMS-2331096 and the Sloan research fellowship. DZH and AMS are supported by NSF award AGS1835860 and by the generosity of Eric and Wendy Schmidt by recommendation of the Schmidt Futures program. AMS is supported by a Department of Defense Vannevar Bush Faculty Fellowship and by the SciAI Center, funded by the Office of Naval Research (ONR), under Grant Number N00014-23-1-2729. SR is supported by Deutsche Forschungsgemeinschaft (DFG) - Project-ID 318763901 - SFB1294. We also thank the anonymous reviewers for their helpful suggestions.

Appendix A Convergence of the Exploration-Exploitation Scheme for the Fisher-Rao Gradient Flow

Proof of Proposition 2.

The solution of the Fisher-Rao gradient flow Eq. 3 has an analytical solution

ρt(θ)=1Ztρ0(θ)etρpost(θ)1et,ρt(θ)ρpost(θ)=1Zt(ρ0(θ)ρpost(θ))et.\displaystyle\rho_{t}(\theta)=\frac{1}{Z_{t}}\rho_{0}(\theta)^{e^{-t}}\rho_{\rm post}(\theta)^{1-e^{-t}},\quad\frac{\rho_{t}(\theta)}{\rho_{\rm post}(\theta)}=\frac{1}{Z_{t}}\left(\frac{\rho_{0}(\theta)}{\rho_{\rm post}(\theta)}\right)^{e^{-t}}. (43)

where ZtZ_{t} is the normalization constant. The update Eq. 13 leads to

ρn(θ)=1Znρ0(θ)(1Δt)nρpost(θ)1(1Δt)n,ρn(θ)ρpost(θ)=1Zn(ρ0(θ)ρpost(θ))(1Δt)n,\displaystyle\rho_{n}(\theta)=\frac{1}{Z_{n}}\rho_{0}(\theta)^{(1-\Delta t)^{n}}\rho_{\rm post}(\theta)^{1-(1-\Delta t)^{n}},\quad\frac{\rho_{n}(\theta)}{\rho_{\rm post}(\theta)}=\frac{1}{Z_{n}}\left(\frac{\rho_{0}(\theta)}{\rho_{\rm post}(\theta)}\right)^{(1-\Delta t)^{n}}, (44)

where ZnZ_{n} is the normalization constant. By comparing Eq. Eq. 44 and Eq. Eq. 43, we have

ρn(θ)=ρt(θ)fort=nlog(1Δt).\rho_{n}(\theta)=\rho_{t}(\theta)\quad\text{for}\quad t=-n\log(1-\Delta t).

Bringing t=nlog(1Δt)t=-n\log(1-\Delta t) into Eq. 9 leads to

KL[ρnρpost]=KL[ρnlog(1Δt)ρ](2+B+eB)K(1Δt)n,\displaystyle\mathrm{KL}[\rho_{n}\|\rho_{\rm post}]=\mathrm{KL}[\rho_{-n\log(1-\Delta t)}\|\rho]\leq(2+B+eB)K(1-\Delta t)^{n}, (45)

when nlog(1Δt)log((1+B)K)-n\log(1-\Delta t)\geq\log((1+B)K). ∎

Appendix B Gaussian Mixture Kalman Inversion Algorithm

The detailed algorithm is presented in Algorithm 1. There are four hyperparameters: the number of mixtures KK, the time-step size Δt\Delta t, the number of particles JJ for Monte Carlo integration, and the number of iterations NN. Increasing KK enhances the expressiveness of the Gaussian mixture model but also increases computational cost. Therefore, we recommend choosing KK based on available computational resources and prior knowledge of the number of target modes. Once all modes are captured, there is no need to increase KK further. A larger JJ improves the accuracy of integration approximations, but it also increases computational cost. In the experiments, we adopt J=1000J=1000. The time step Δt\Delta t lies within the range (0,1)(0,1). While a larger Δt\Delta t accelerates convergence, it may also cause numerical instability. However, we found that Δt=0.5\Delta t=0.5 strikes a good balance between stability and efficiency. All of these parameters can be chosen adaptively, which we plan to explore in future work. Moreover, in our implementation, we store logw\log w instead of ww, and hence ww will never actually reach zero. Additionally, for efficiency, we avoid zero weights by setting a lower bound (e.g., 101010^{-10}) for ww.

Algorithm 1 Gaussian Mixture Kalman Inversion
Input: initial guess {w0,k,m0,k,C0,k}k=1K\{w_{0,k},m_{0,k},C_{0,k}\}_{k=1}^{K}, time-step size Δt\Delta t, number of iterations NN, number of particles JJ for Monte Carlo, forward model \mathcal{F}
Output: final solution {wN,k,mN,k,CN,k}k=1K\{w_{N,k},m_{N,k},C_{N,k}\}_{k=1}^{K}
for n0n\leftarrow 0 to N1N-1 do
     for k1k\leftarrow 1 to KK do
         Sample {θj}j=1J𝒩(θ;mn,k,Cn,k1Δt)\{\theta^{j}\}_{j=1}^{J}\sim\mathcal{N}(\theta;m_{n,k},\frac{C_{n,k}}{1-\Delta t}), first step update
w^n+1,k=1Jj=1Jfn,k(θj)\displaystyle\hat{w}_{n+1,k}=\frac{1}{J}\sum_{j=1}^{J}f_{n,k}(\theta^{j})
m^n+1,k=1w^n+1,kJj=1Jθjfn,k(θj)\displaystyle\widehat{m}_{n+1,k}=\frac{1}{\hat{w}_{n+1,k}J}\sum_{j=1}^{J}\theta^{j}f_{n,k}(\theta^{j})
C^n+1,k=1w^n+1,k(J1)j=1J(θjm^n+1,k)(θjm^n+1,k)Tfn,k(θj)\displaystyle\widehat{C}_{n+1,k}=\frac{1}{\hat{w}_{n+1,k}(J-1)}\sum_{j=1}^{J}(\theta^{j}-\widehat{m}_{n+1,k})(\theta^{j}-\widehat{m}_{n+1,k})^{T}f_{n,k}(\theta^{j})
     end for
     Normalize w^n+1,k:=w^n+1,kk=1Kw^n+1,k\hat{w}_{n+1,k}:=\frac{\hat{w}_{n+1,k}}{\sum_{k=1}^{K}\hat{w}_{n+1,k}}
     for k1k\leftarrow 1 to KK do \triangleright Apply modified unscented transform [43, Eq. 37]
         Generate sigma-points (a=max{18,12Nθ})(a=\max\{\frac{1}{8},\frac{1}{2N_{\theta}}\})
θk0=m^n+1,k\displaystyle\theta_{k}^{0}=\widehat{m}_{n+1,k}
θkj=m^n+1,k+12a[C^n+1,k]j(1jNθ)\displaystyle\theta_{k}^{j}=\widehat{m}_{n+1,k}+\frac{1}{\sqrt{2a}}[\sqrt{\widehat{C}_{n+1,k}}]_{j}\quad(1\leq j\leq N_{\theta})
θkj+Nθ=m^n+1,k12a[C^n+1,k]j(1jNθ)\displaystyle\theta_{k}^{j+N_{\theta}}=\widehat{m}_{n+1,k}-\frac{1}{\sqrt{2a}}[\sqrt{\widehat{C}_{n+1,k}}]_{j}\quad(1\leq j\leq N_{\theta})
         Approximate the mean and covariance
x^n+1,k:=(θk0)\displaystyle\widehat{x}_{n+1,k}:=\mathcal{F}(\theta_{k}^{0})
C^n+1,kθx:=j=12Nθa(θkjm^n+1,k)((θkj)x^n+1,k)T\displaystyle\widehat{C}_{n+1,k}^{\theta x}:=\sum_{j=1}^{2N_{\theta}}a(\theta_{k}^{j}-\widehat{m}_{n+1,k})(\mathcal{F}(\theta_{k}^{j})-\widehat{x}_{n+1,k})^{T}
C^n+1,kxx:=j=12Nθa((θkj)x^n+1,k)((θkj)x^n+1,k)T+1ΔtΣν\displaystyle\widehat{C}_{n+1,k}^{xx}:=\sum_{j=1}^{2N_{\theta}}a(\mathcal{F}(\theta_{k}^{j})-\widehat{x}_{n+1,k})(\mathcal{F}(\theta_{k}^{j})-\widehat{x}_{n+1,k})^{T}+\frac{1}{\Delta t}\Sigma_{\nu}
         Second step update
mn+1,k=m^n+1,k+C^n+1,kθx(C^n+1,kxx)1(xx^n+1,k)Cn+1,k=C^n+1,kC^n+1,kθx(C^n+1,kxx)1C^n+1,kθxTwn+1,k=w^n+1,keΔtΦR(θk0)\begin{split}m_{n+1,k}&=\widehat{m}_{n+1,k}+\widehat{C}_{n+1,k}^{\theta x}(\widehat{C}_{n+1,k}^{xx})^{-1}(x-\widehat{x}_{n+1,k})\\ C_{n+1,k}&=\widehat{C}_{n+1,k}-\widehat{C}_{n+1,k}^{\theta x}(\widehat{C}_{n+1,k}^{xx})^{-1}{\widehat{C}_{n+1,k}^{{\theta x}^{T}}}\\ w_{n+1,k}&=\hat{w}_{n+1,k}e^{-\Delta t\Phi_{R}(\theta_{k}^{0})}\end{split}
     end for
     Normalize wn+1,k:=wn+1,kk=1Kwn+1,k{w}_{n+1,k}:=\frac{{w}_{n+1,k}}{\sum_{k=1}^{K}{w}_{n+1,k}}
end for

Appendix C Theoretical Studies About GMKI

C.1 Proof of Proposition 3

Update equation Eq. 28 and the normalization of weights {w^n+1,k}k\{\hat{w}_{n+1,k}\}_{k} can be combined and rewritten as

w^n+1,k=wn,k𝒩(θ;mn,k,Cn,k)ρnΔtdθρn1Δtdθ,\displaystyle\hat{w}_{n+1,k}=\frac{w_{n,k}\int\mathcal{N}(\theta;m_{n,k},C_{n,k})\rho_{n}^{-\Delta t}\mathrm{d}\theta}{\int\rho_{n}^{1-\Delta t}\mathrm{d}\theta}, (46a)
m^n+1,k=θ𝒩(θ;mn,k,Cn,k)ρnΔtdθ𝒩(θ;mn,k,Cn,k)ρnΔtdθ,\displaystyle\widehat{m}_{n+1,k}=\frac{\int\theta\mathcal{N}(\theta;m_{n,k},C_{n,k})\rho_{n}^{-\Delta t}\mathrm{d}\theta}{\int\mathcal{N}(\theta;m_{n,k},C_{n,k})\rho_{n}^{-\Delta t}\mathrm{d}\theta}, (46b)
C^n+1,k=(θm^n+1,k)(θm^n+1,k)T𝒩(θ;mn,k,Cn,k)ρnΔtdθ𝒩(θ;mn,k,Cn,k)ρnΔtdθ.\displaystyle\widehat{C}_{n+1,k}=\frac{\int(\theta-\widehat{m}_{n+1,k})(\theta-\widehat{m}_{n+1,k})^{T}\mathcal{N}(\theta;m_{n,k},C_{n,k})\rho_{n}^{-\Delta t}\mathrm{d}\theta}{\int\mathcal{N}(\theta;m_{n,k},C_{n,k})\rho_{n}^{-\Delta t}\mathrm{d}\theta}. (46c)

Now we derive the continuous time limit of the exploration step; we ignore the hat notation for simplicity.

w^n+1,kwn,kΔt\displaystyle\frac{\hat{w}_{n+1,k}-w_{n,k}}{\Delta t} =wn,k[𝒩(θ;mn,k,Cn,k)ρn]ρnΔtdθΔtρn1Δtdθ\displaystyle=w_{n,k}\frac{\int\Bigl{[}\mathcal{N}(\theta;m_{n,k},C_{n,k})-\rho_{n}\Bigr{]}\rho_{n}^{-\Delta t}\mathrm{d}\theta}{\Delta t\int\rho_{n}^{1-\Delta t}\mathrm{d}\theta} (47)
=wn,k[𝒩(θ;mn,k,Cn,k)ρn](1Δtlogρn)dθΔt+𝒪(Δt)\displaystyle=w_{n,k}\frac{\int\Bigl{[}\mathcal{N}(\theta;m_{n,k},C_{n,k})-\rho_{n}\Bigr{]}(1-\Delta t\log\rho_{n})\mathrm{d}\theta}{\Delta t}+\mathcal{O}(\Delta t)
=wn,k[𝒩(θ;mn,k,Cn,k)ρn]logρndθ+𝒪(Δt),\displaystyle=-w_{n,k}\int\Bigl{[}\mathcal{N}(\theta;m_{n,k},C_{n,k})-\rho_{n}\Bigr{]}\log\rho_{n}\mathrm{d}\theta+\mathcal{O}(\Delta t),

where in the second identity, we used ρnΔt=eΔtlogρn=1Δtlogρn+𝒪(Δ2)\rho_{n}^{-\Delta t}=e^{-\Delta t\log\rho_{n}}=1-\Delta t\log\rho_{n}+\mathcal{O}(\Delta^{2}). Therefore the continuous limit can be written as

w˙t,k=wt,k[𝒩(θ;mt,k,Ct,k)ρt(θ)]logρt(θ)dθ.\dot{w}_{t,k}=-w_{t,k}\int\Bigl{[}\mathcal{N}(\theta;m_{t,k},C_{t,k})-\rho_{t}(\theta)\Bigr{]}\log\rho_{t}(\theta)\mathrm{d}\theta\,.

Similarly, for mn,km_{n,k} we have

m^n+1,kmn,kΔt\displaystyle\frac{\widehat{m}_{n+1,k}-m_{n,k}}{\Delta t} =(θmn,k)𝒩(θ;mn,k,Cn,k)ρnΔtdθΔt𝒩(θ;mn,k,Cn,k)ρnΔtdθ\displaystyle=\frac{\int(\theta-m_{n,k})\mathcal{N}(\theta;m_{n,k},C_{n,k})\rho_{n}^{-\Delta t}\mathrm{d}\theta}{\Delta t\int\mathcal{N}(\theta;m_{n,k},C_{n,k})\rho_{n}^{-\Delta t}\mathrm{d}\theta} (48)
=(θmn,k)𝒩(θ;mn,k,Cn,k)(1Δtlogρn)dθΔt+𝒪(Δt)\displaystyle=\frac{\int(\theta-m_{n,k})\mathcal{N}(\theta;m_{n,k},C_{n,k})(1-\Delta t\log\rho_{n})\mathrm{d}\theta}{\Delta t}+\mathcal{O}(\Delta t)
=𝒩(θ;mn,k,Cn,k)(θmn,k)logρn(θ)dθ+𝒪(Δt)\displaystyle=-\int\mathcal{N}(\theta;m_{n,k},C_{n,k})(\theta-m_{n,k})\log\rho_{n}(\theta)\mathrm{d}\theta+\mathcal{O}(\Delta t)
=Cn,k𝒩(θ;mn,k,Cn,k)θlogρn(θ)dθ+𝒪(Δt),\displaystyle=-C_{n,k}\int\mathcal{N}(\theta;m_{n,k},C_{n,k})\nabla_{\theta}\log\rho_{n}(\theta)\mathrm{d}\theta+\mathcal{O}(\Delta t),

where in the last identity, we used integration by parts; it is also known as the Stein’s identity. Thus the continuous limit is

m˙t,k=𝒩(θ;mt,k,Ct,k)(θmt,k)logρt(θ)dθ.\dot{m}_{t,k}=-\int\mathcal{N}(\theta;m_{t,k},C_{t,k})(\theta-m_{t,k})\log\rho_{t}(\theta)\mathrm{d}\theta.

Finally, for Cn,kC_{n,k}, we have

C^n+1,kCn,kΔt\displaystyle\frac{\widehat{C}_{n+1,k}-C_{n,k}}{\Delta t} (49)
=\displaystyle= [(θm^n+1,k)(θm^n+1,k)TCn,k]𝒩(θ;mn,k,Cn,k)ρnΔtdθΔt𝒩(θ;mn,k,Cn,k)ρnΔtdθ\displaystyle\frac{\int\Bigl{[}(\theta-\widehat{m}_{n+1,k})(\theta-\widehat{m}_{n+1,k})^{T}-C_{n,k}\Bigr{]}\mathcal{N}(\theta;m_{n,k},C_{n,k})\rho_{n}^{-\Delta t}\mathrm{d}\theta}{\Delta t\int\mathcal{N}(\theta;m_{n,k},C_{n,k})\rho_{n}^{-\Delta t}\mathrm{d}\theta}
=\displaystyle= [(θm^n+1,k)(θm^n+1,k)TCn,k]𝒩(θ;mn,k,Cn,k)(1Δtlogρn)dθΔt𝒩(θ;mn,k,Cn,k)(1Δtlogρn)dθ+𝒪(Δt)\displaystyle\frac{\int\Bigl{[}(\theta-\widehat{m}_{n+1,k})(\theta-\widehat{m}_{n+1,k})^{T}-C_{n,k}\Bigr{]}\mathcal{N}(\theta;m_{n,k},C_{n,k})(1-\Delta t\log\rho_{n})\mathrm{d}\theta}{\Delta t\int\mathcal{N}(\theta;m_{n,k},C_{n,k})(1-\Delta t\log\rho_{n})\mathrm{d}\theta}+\mathcal{O}(\Delta t)
=\displaystyle= [(θm^n+1,k)(θm^n+1,k)TCn,k]𝒩(θ;mn,k,Cn,k)(1Δtlogρn)dθΔt+𝒪(Δt)\displaystyle\frac{\int\Bigl{[}(\theta-\widehat{m}_{n+1,k})(\theta-\widehat{m}_{n+1,k})^{T}-C_{n,k}\Bigr{]}\mathcal{N}(\theta;m_{n,k},C_{n,k})(1-\Delta t\log\rho_{n})\mathrm{d}\theta}{\Delta t}+\mathcal{O}(\Delta t)
=\displaystyle= [(θmn,k)(θmn,k)TCn,k]𝒩(θ;mn,k,Cn,k)logρndθ+𝒪(Δt)\displaystyle-\int\Bigl{[}(\theta-m_{n,k})(\theta-m_{n,k})^{T}-C_{n,k}\Bigr{]}\mathcal{N}(\theta;m_{n,k},C_{n,k})\log\rho_{n}\mathrm{d}\theta+\mathcal{O}(\Delta t)
=\displaystyle= Cn,k𝒩(θ;mn,k,Cn,k)θθlogρndθCn,k+𝒪(Δt),\displaystyle-C_{n,k}\int\mathcal{N}(\theta;m_{n,k},C_{n,k})\nabla_{\theta}\nabla_{\theta}\log\rho_{n}\mathrm{d}\theta C_{n,k}+\mathcal{O}(\Delta t),

where in the fourth identity, we used Eq. 48 m^n+1,k=mn,k+𝒪(Δt)\widehat{m}_{n+1,k}=m_{n,k}+\mathcal{O}(\Delta t). And in the last identity, we used integration by parts. Finally, this leads to the desired continuous limit stated in the proposition.

C.2 Proof of Proposition 4

The evolution equation of the entropy of ρt\rho_{t} is

ddtρtlogρtdθ=dρtdtlogρtdθ=kw˙t,k𝒩(θ;mt,k,Ct,k)logρtdθkwkm˙t,kTCt,k1(θmt,k)𝒩(θ;mt,k,Ct,k)logρtdθkwt,k2tr[C˙t,kT(Ct,k1(θmt,k)(θmt,k)TCt,k1Ct,k1)𝒩(θ;mt,k,Ct,k)logρtdθ]=k(w˙t,k2wt,k+wt,km˙t,kTCt,k1m˙t,k+wt,k2tr[C˙t,kCt,k1C˙t,kCt,k1]).\begin{split}&\frac{\mathrm{d}}{\mathrm{d}t}\int-\rho_{t}\log\rho_{t}\mathrm{d}\theta=-\int\frac{\mathrm{d}\rho_{t}}{\mathrm{d}t}\log\rho_{t}\mathrm{d}\theta\\ =&-\sum_{k}\dot{w}_{t,k}\int\mathcal{N}(\theta;m_{t,k},C_{t,k})\log\rho_{t}\mathrm{d}\theta\\ &-\sum_{k}w_{k}\dot{m}_{t,k}^{T}C_{t,k}^{-1}\int(\theta-m_{t,k})\mathcal{N}(\theta;m_{t,k},C_{t,k})\log\rho_{t}\mathrm{d}\theta\\ &-\sum_{k}\frac{w_{t,k}}{2}{\rm tr}\Bigl{[}\dot{C}_{t,k}^{T}\int\bigl{(}C_{t,k}^{-1}(\theta-m_{t,k})(\theta-m_{t,k})^{T}C_{t,k}^{-1}-C_{t,k}^{-1}\bigr{)}\mathcal{N}(\theta;m_{t,k},C_{t,k})\log\rho_{t}\mathrm{d}\theta\Bigr{]}\\ =&\sum_{k}\Bigl{(}\frac{\dot{w}_{t,k}^{2}}{w_{t,k}}+w_{t,k}\dot{m}_{t,k}^{T}C_{t,k}^{-1}\dot{m}_{t,k}+\frac{w_{t,k}}{2}{\rm tr}\Bigl{[}\dot{C}_{t,k}C_{t,k}^{-1}\dot{C}_{t,k}C_{t,k}^{-1}\Bigr{]}\Bigr{)}.\end{split} (50)

Here we have used the fact dρtdtdθ=0\int\frac{\mathrm{d}\rho_{t}}{\mathrm{d}t}\mathrm{d}\theta=0 in the first identity. And in the second identity we used the fact that kw˙t,k=0\sum_{k}\dot{w}_{t,k}=0 and used the continuous time limit equations Eq. 32.

C.3 Proof of Proposition 5

Combining Proposition 3 and Eq. 30 leads to the following

mn+1,kmn,kΔt=m^n+1,kmn,kΔt+C^n+1,kθx(C^n+1,kxx)1(xx^n+1,k)Δt=Cn,k𝒩(θ;mn,k,Cn,k)θlogρn(θ)dθ+C^n,kθxΣν1(xx^n,k)+𝒪(Δt).Cn+1,kCn,kΔt=C^n+1,kCn,kΔtC^n+1,kθx(C^n+1,kxx)1C^n+1,kθxTΔt=Cn,k𝒩(θ;mn,k,Cn,k)θθlogρn(θ)dθCn,kC^n,kθxΣν1C^n,kθxT+𝒪(Δt).\begin{split}&\frac{m_{n+1,k}-m_{n,k}}{\Delta t}\\ =&\frac{\widehat{m}_{n+1,k}-m_{n,k}}{\Delta t}+\frac{\widehat{C}_{n+1,k}^{\theta x}(\widehat{C}_{n+1,k}^{xx})^{-1}(x-\widehat{x}_{n+1,k})}{\Delta t}\\ =&-C_{n,k}\int\mathcal{N}(\theta;m_{n,k},C_{n,k})\nabla_{\theta}\log\rho_{n}(\theta)\mathrm{d}\theta+\widehat{C}_{n,k}^{\theta x}\Sigma_{\nu}^{-1}(x-\widehat{x}_{n,k})+\mathcal{O}(\Delta t).\\ &\frac{C_{n+1,k}-C_{n,k}}{\Delta t}\\ =&\frac{\widehat{C}_{n+1,k}-C_{n,k}}{\Delta t}-\frac{\widehat{C}_{n+1,k}^{\theta x}(\widehat{C}_{n+1,k}^{xx})^{-1}{\widehat{C}_{n+1,k}^{{\theta x}^{T}}}}{\Delta t}\\ =&-C_{n,k}\int\mathcal{N}(\theta;m_{n,k},C_{n,k})\nabla_{\theta}\nabla_{\theta}\log\rho_{n}(\theta)\mathrm{d}\theta C_{n,k}-\widehat{C}_{n,k}^{\theta x}\Sigma_{\nu}^{-1}{\widehat{C}_{n,k}^{{\theta x}^{T}}}+\mathcal{O}(\Delta t).\end{split} (51)

Using the above formula, we obtain the continuous limit. For the weights, we have

wn+1,k=w^n+1,k𝒩(θ;m^n+1,k,C^n+1,k)eΔtΦR(θ)dθρ^n+1eΔtΦR(θ)dθ.\displaystyle w_{n+1,k}=\frac{\hat{w}_{n+1,k}\int\mathcal{N}(\theta;\widehat{m}_{n+1,k},\widehat{C}_{n+1,k})e^{-\Delta t\Phi_{R}(\theta)}\mathrm{d}\theta}{\int\hat{\rho}_{n+1}e^{-\Delta t\Phi_{R}(\theta)}\mathrm{d}\theta}. (52a)

Therefore,

wn+1,kwn,kΔt=wn+1,kw^n+1,kΔt+w^n+1,kwn,kΔt=w^n+1,k[𝒩(θ;m^n+1,k,C^n+1,k)ρ^n+1]eΔtΦRdθΔtρ^n+1eΔtΦRdθ+w^n+1,kwn,kΔt+𝒪(Δt)=w^n+1,k[𝒩(θ;m^n+1,k,C^n+1,k)ρ^n+1](1ΔtΦR)dθΔt+w^n+1,kwn,kΔt+𝒪(Δt)=wn,k[𝒩(θ;mn,k,Cn,k)ρn](logρn+ΦR)dθ+𝒪(Δt),\begin{split}&\frac{w_{n+1,k}-w_{n,k}}{\Delta t}=\frac{w_{n+1,k}-\hat{w}_{n+1,k}}{\Delta t}+\frac{\hat{w}_{n+1,k}-w_{n,k}}{\Delta t}\\ =&\hat{w}_{n+1,k}\frac{\int\bigl{[}\mathcal{N}(\theta;\widehat{m}_{n+1,k},\widehat{C}_{n+1,k})-\hat{\rho}_{n+1}\bigr{]}e^{-\Delta t\Phi_{R}}\mathrm{d}\theta}{\Delta t\int\hat{\rho}_{n+1}e^{-\Delta t\Phi_{R}}\mathrm{d}\theta}+\frac{\hat{w}_{n+1,k}-w_{n,k}}{\Delta t}+\mathcal{O}(\Delta t)\\ =&\hat{w}_{n+1,k}\frac{\int\bigl{[}\mathcal{N}(\theta;\widehat{m}_{n+1,k},\widehat{C}_{n+1,k})-\hat{\rho}_{n+1}\bigr{]}(1-\Delta t\Phi_{R})\mathrm{d}\theta}{\Delta t}+\frac{\hat{w}_{n+1,k}-w_{n,k}}{\Delta t}+\mathcal{O}(\Delta t)\\ =&-w_{n,k}\int\Bigl{[}\mathcal{N}(\theta;m_{n,k},C_{n,k})-\rho_{n}\Bigr{]}(\log\rho_{n}+\Phi_{R})\mathrm{d}\theta+\mathcal{O}(\Delta t),\end{split} (53)

from which we readily obtain the continuous limit for the equation of the weights. Here in the last step we used the result in C.1.

C.4 Proof of Proposition 6

Under the Gaussian posterior assumption (i.e. ΦR\Phi_{R} is quadratic) and based on the formula in (25) and (26), we get

C^t,kθxΣν1(xx^t,k)=Ct,k𝒩(θ;mt,k,Ct,k)θΦRdθ,\displaystyle\widehat{C}_{t,k}^{\theta x}\Sigma_{\nu}^{-1}(x-\widehat{x}_{t,k})=-C_{t,k}\int\mathcal{N}(\theta;m_{t,k},C_{t,k})\nabla_{\theta}\Phi_{R}\mathrm{d}\theta, (54)
C^t,kθxΣν1C^t,kT=Ct,k(𝒩(θ;mt,k,Ct,k)θθΦRdθ)Ct,kT.\displaystyle\widehat{C}_{t,k}^{\theta x}\Sigma_{\nu}^{-1}{\widehat{C}_{t,k}^{T}}=C_{t,k}\left(\int\mathcal{N}(\theta;m_{t,k},C_{t,k})\nabla_{\theta}\nabla_{\theta}\Phi_{R}\mathrm{d}\theta\right)C_{t,k}^{T}. (55)

Bringing Eq. 54 into the continuous time dynamics Eq. 35 and using integration by parts (also known as Stein’s identities) leads to

m˙t,k=\displaystyle\dot{m}_{t,k}= Ct,k𝒩(θ;mt,k,Ct,k)θ(logρt+ΦR)dθ\displaystyle-C_{t,k}\int\mathcal{N}(\theta;m_{t,k},C_{t,k})\nabla_{\theta}\bigl{(}\log\rho_{t}+\Phi_{R}\bigr{)}\mathrm{d}\theta
=\displaystyle= 𝒩(θ;mt,k,Ct,k)(θmt,k)(logρt+ΦR)dθ,\displaystyle-\int\mathcal{N}(\theta;m_{t,k},C_{t,k})(\theta-m_{t,k})\bigl{(}\log\rho_{t}+\Phi_{R}\bigr{)}\mathrm{d}\theta, (56a)
C˙t,k=\displaystyle\dot{C}_{t,k}= 𝒩(θ;mt,k,Ct,k)θθ(logρt+ΦR)dθ\displaystyle-\int\mathcal{N}(\theta;m_{t,k},C_{t,k})\nabla_{\theta}\nabla_{\theta}\bigl{(}\log\rho_{t}+\Phi_{R}\bigr{)}\mathrm{d}\theta
=\displaystyle= 𝒩(θ;mt,k,Ct,k)((θmt,k)(θmt,k)TCt,k)(logρt+ΦR)dθ,\displaystyle-\int\mathcal{N}(\theta;m_{t,k},C_{t,k})\Bigl{(}(\theta-m_{t,k})(\theta-m_{t,k})^{T}-C_{t,k}\Bigr{)}\bigl{(}\log\rho_{t}+\Phi_{R}\bigr{)}\mathrm{d}\theta, (56b)
w˙t,k=\displaystyle\dot{w}_{t,k}= wt,k[𝒩(θ;mt,k,Ct,k)ρt][logρt+ΦR]dθ.\displaystyle-w_{t,k}\int\Bigl{[}\mathcal{N}(\theta;m_{t,k},C_{t,k})-\rho_{t}\Bigr{]}\Bigl{[}\log\rho_{t}+\Phi_{R}\Bigr{]}\mathrm{d}\theta. (56c)

The evolution equation of the KL divergence between ρt\rho_{t} and ρpost\rho_{\rm post} is then

ddtKL[ρtρpost]=ddtρtlog(ρtρpost)dθ=dρtdt(logρt+ΦR)dθ=kw˙t,k𝒩(θ;mt,k,Ct,k)(logρt+ΦR)dθ+kwkm˙t,kTCt,k1(θmt,k)𝒩(θ;mt,k,Ct,k)(logρt+ΦR)dθ+kwt,k2tr[C˙t,kT(Ct,k1(θmt,k)(θmt,k)TCt,k1Ct,k1)𝒩(θ;mt,k,Ct,k)(logρt+ΦR)dθ]=k(w˙t,k2wt,k+wt,km˙t,kTCt,k1m˙t,k+wt,k2tr[C˙t,kCt,k1C˙t,kCt,k1]).\begin{split}&\frac{\mathrm{d}}{\mathrm{d}t}\mathrm{KL}[\rho_{t}\|\rho_{\rm post}]\\ =&\frac{\mathrm{d}}{\mathrm{d}t}\int\rho_{t}\log\Bigl{(}\frac{\rho_{t}}{\rho_{\rm post}}\Bigr{)}\mathrm{d}\theta=\int\frac{\mathrm{d}\rho_{t}}{\mathrm{d}t}\bigl{(}\log\rho_{t}+\Phi_{R}\bigr{)}\mathrm{d}\theta\\ =&\sum_{k}\dot{w}_{t,k}\int\mathcal{N}(\theta;m_{t,k},C_{t,k})\bigl{(}\log\rho_{t}+\Phi_{R}\bigr{)}\mathrm{d}\theta\\ &+\sum_{k}w_{k}\dot{m}_{t,k}^{T}C_{t,k}^{-1}\int(\theta-m_{t,k})\mathcal{N}(\theta;m_{t,k},C_{t,k})\bigl{(}\log\rho_{t}+\Phi_{R}\bigr{)}\mathrm{d}\theta\\ &+\sum_{k}\frac{w_{t,k}}{2}{\rm tr}\Bigl{[}\dot{C}_{t,k}^{T}\int\bigl{(}C_{t,k}^{-1}(\theta-m_{t,k})(\theta-m_{t,k})^{T}C_{t,k}^{-1}-C_{t,k}^{-1}\bigr{)}\mathcal{N}(\theta;m_{t,k},C_{t,k})\bigl{(}\log\rho_{t}+\Phi_{R}\bigr{)}\mathrm{d}\theta\Bigr{]}\\ =&-\sum_{k}\Bigl{(}\frac{\dot{w}_{t,k}^{2}}{w_{t,k}}+w_{t,k}\dot{m}_{t,k}^{T}C_{t,k}^{-1}\dot{m}_{t,k}+\frac{w_{t,k}}{2}{\rm tr}\Bigl{[}\dot{C}_{t,k}C_{t,k}^{-1}\dot{C}_{t,k}C_{t,k}^{-1}\Bigr{]}\Bigr{)}.\end{split} (57)

Here in the last identity, we used the equation of the continuous time dynamics in (56) to simplify the formula.

Consider any stationary point ρ=kwk,𝒩(θ;mk,,Ck,)\rho_{\infty}=\sum_{k}w_{k,\infty}\mathcal{N}(\theta;m_{k,\infty},C_{k,\infty}) with nonzero wk,w_{k,\infty}. The stationary point condition for the mean mk,m_{k,\infty} (56a) is

𝒩(θ;mk,,Ck,)θlogρdθ=Cpost1(mk,mpost),\int\mathcal{N}(\theta;m_{k,\infty},C_{k,\infty})\nabla_{\theta}\log\rho_{\infty}\mathrm{d}\theta=-C_{\rm post}^{-1}(m_{k,\infty}-m_{\rm post}), (58)

where we used that 𝒩(θ;m,k,C,k)θΦRdθ=Cpost1(mk,mpost)\int\mathcal{N}(\theta;m_{\infty,k},C_{\infty,k})\nabla_{\theta}\Phi_{R}\mathrm{d}\theta=C_{\rm post}^{-1}(m_{k,\infty}-m_{\rm post}). The stationary point condition for the covariance Ck,C_{k,\infty} (56b) is

𝒩(θ;mk,,Ck,)θθlogρdθ=Cpost1,\int\mathcal{N}(\theta;m_{k,\infty},C_{k,\infty})\nabla_{\theta}\nabla_{\theta}\log\rho_{\infty}\mathrm{d}\theta=-C_{\rm post}^{-1}, (59)

where we used that 𝒩(θ;m,k,C,k)θθΦRdθ=Cpost1\int\mathcal{N}(\theta;m_{\infty,k},C_{\infty,k})\nabla_{\theta}\nabla_{\theta}\Phi_{R}\mathrm{d}\theta=C_{\rm post}^{-1}. Multiplying Eq. 58 and Eq. 59 by wk,w_{k,\infty} and summing the results yields

mpostkwk,mk,=Cpostkwk,𝒩(θ;mk,,Ck,)θlogρdθ=0,\displaystyle m_{\rm post}-\sum_{k}w_{k,\infty}m_{k,\infty}=C_{\rm post}\int\sum_{k}w_{k,\infty}\mathcal{N}(\theta;m_{k,\infty},C_{k,\infty})\nabla_{\theta}\log\rho_{\infty}\mathrm{d}\theta=0, (60)
Cpost1=wk,𝒩(θ;mk,,Ck,)θθlogρdθ=FIM[ρ].\displaystyle C_{\rm post}^{-1}=-\int\sum w_{k,\infty}\mathcal{N}(\theta;m_{k,\infty},C_{k,\infty})\nabla_{\theta}\nabla_{\theta}\log\rho_{\infty}\mathrm{d}\theta={\rm FIM}[\rho_{\infty}].

C.5 Derivation of the Simplified Continuous Time Dynamics Eq. 39

In this section, we formally derive Eq. 39, assuming these Gaussian components in ρpost\rho_{\rm post} are well separated. When θ\theta is close to mkm_{k}^{*}, one may make the following approximation

ΦR(θ)=logρpost(θ)log𝒩(θ,mk,Ck)log(wk).\displaystyle\Phi_{R}(\theta)=\log\rho_{\rm post}(\theta)\approx-\log\mathcal{N}(\theta,m_{k}^{*},C_{k}^{*})-\log(w_{k}^{*}). (61)

Combining the definition of ΦR\Phi_{R} in Eq. 19 with Eq. 61 leads to that

12(x(θ))TΣν1(x(θ))12(θmk)TCk1(θmk)+constant.-\frac{1}{2}(x-\mathcal{F}(\theta))^{T}\Sigma_{\nu}^{-1}(x-\mathcal{F}(\theta))\approx-\frac{1}{2}(\theta-m_{k}^{*})^{T}C_{k}^{*^{-1}}(\theta-m_{k}^{*})+\textrm{constant}. (62)

This implies that (θ)\mathcal{F}(\theta) is approximately locally linear around m=mkm=m_{k}^{*} with (θ)Fkθ+c\mathcal{F}(\theta)\approx F_{k}\theta+c, such that FkTΣν1Fk=Ck1F_{k}^{T}\Sigma_{\nu}^{-1}F_{k}=C_{k}^{*^{-1}} and CkFkTΣν1(xc)=mkC_{k}^{*}F_{k}^{T}\Sigma_{\nu}^{-1}(x-c)=m_{k}^{*}. Based on the above derivation, when the kk-the component 𝒩(θ;mt,k,Ct,k)\mathcal{N}(\theta;m_{t,k},C_{t,k}) in the Gaussian mixture approximation is concentrating around mkm_{k}^{*}, the expectation and covariance in the continuous time limit of the proposed GMKI (36) can be approximated as

x^t,k\displaystyle\widehat{x}_{t,k} =𝔼[(θ)]𝔼[Fkθ+c]=Fkmt,k+c,\displaystyle=\mathbb{E}[\mathcal{F}(\theta)]\approx\mathbb{E}[F_{k}\theta+c]=F_{k}m_{t,k}+c, (63a)
C^t,kθx\displaystyle\widehat{C}_{t,k}^{\theta x} =Cov[θ,(θ)]𝔼[(θmt,k)(Fk(θmt,k))]=Ct,kFkT.\displaystyle=\textrm{Cov}\bigl{[}\theta,\mathcal{F}(\theta)\bigr{]}\approx\mathbb{E}\bigl{[}\bigl{(}\theta-m_{t,k}\bigr{)}\otimes\bigl{(}F_{k}(\theta-m_{t,k})\bigr{)}\bigr{]}=C_{t,k}F_{k}^{T}. (63b)

Here the expectation are taken with respect to 𝒩(θ;mt,k,Ct,k)\mathcal{N}(\theta;m_{t,k},C_{t,k}).

Now we will simplify Eq. 35 by neglecting the interaction between well separated Gaussian components to obtain Eq. 39. For the mean evolution equation (35a), we have

m˙t,kCt,k𝒩(θ;mt,k,Ct,k)θlog(wk𝒩(θ;mt,k,Ct,k))dθ+C^t,kθxΣν1(xx^t,k)=C^t,kθxΣν1(xx^t,k)Ct,k(Ck)1(mkmt,k).\begin{split}\dot{m}_{t,k}\approx&-C_{t,k}\int\mathcal{N}(\theta;m_{t,k},C_{t,k})\nabla_{\theta}\log\bigl{(}w_{k}\mathcal{N}(\theta;m_{t,k},C_{t,k})\bigr{)}\mathrm{d}\theta+\widehat{C}_{t,k}^{\theta x}\Sigma_{\nu}^{-1}(x-\widehat{x}_{t,k})\\ =&\widehat{C}_{t,k}^{\theta x}\Sigma_{\nu}^{-1}(x-\widehat{x}_{t,k})\\ \approx&C_{t,k}(C_{k}^{*})^{-1}(m_{k}^{*}-m_{t,k}).\end{split} (64)

The first approximation is obtained by substituting logρt\log\rho_{t} with log(wk𝒩(θ;mt,k,Ct,k))\log(w_{k}\mathcal{N}(\theta;m_{t,k},C_{t,k})), due to the well separateness assumption; the resulting integral is zero so leads to the second identity. The third approximation is obtained by using Eq. 63 and the relation FkTΣν1Fk=Ck1F_{k}^{T}\Sigma_{\nu}^{-1}F_{k}=C_{k}^{*^{-1}} and CkFkTΣν1(xc)=mkC_{k}^{*}F_{k}^{T}\Sigma_{\nu}^{-1}(x-c)=m_{k}^{*}.

For the covariance evolution equation (35b), similarly we have

C˙t,kCt,k(𝒩(θ;mt,k,Ct,k)θθlog(wk𝒩(θ;mt,k,Ct,k))dθ)Ct,kC^t,kθxΣν1C^t,kT=Ct,kC^t,kθxΣν1C^t,kT=Ct,kCt,k(Ck)1Ct,k.\begin{split}\dot{C}_{t,k}\approx&-C_{t,k}\left(\int\mathcal{N}(\theta;m_{t,k},C_{t,k})\nabla_{\theta}\nabla_{\theta}\log\bigl{(}w_{k}\mathcal{N}(\theta;m_{t,k},C_{t,k})\bigr{)}\mathrm{d}\theta\right)C_{t,k}-\widehat{C}_{t,k}^{\theta x}\Sigma_{\nu}^{-1}{\widehat{C}_{t,k}^{T}}\\ =&\ C_{t,k}-\widehat{C}_{t,k}^{\theta x}\Sigma_{\nu}^{-1}{\widehat{C}_{t,k}^{T}}\\ =&\ C_{t,k}-C_{t,k}(C_{k}^{*})^{-1}C_{t,k}.\end{split} (65)

The first approximation is obtained by substituting logρt\log\rho_{t} with log(wk𝒩(θ;mt,k,Ct,k))\log(w_{k}\mathcal{N}(\theta;m_{t,k},C_{t,k})). The third approximation is obtained by using Eq. 63.

Finally, for the weight evolution equation (35c), using the formula ρt(θ)=iwt,i𝒩(θ;mt,i,Ct,i)\rho_{t}(\theta)=\sum_{i}\int w_{t,i}\mathcal{N}(\theta;m_{t,i},C_{t,i}), we get

w˙t,k=wt,k𝒩(θ;mt,k,Ct,k)logρt(θ)dθ+wt,kiwt,i𝒩(θ;mt,i,Ct,i)logρt(θ)dθwt,k𝒩(θ;mt,k,Ct,k)ΦR(θ)dθ+wt,kiwt,i𝒩(θ;mt,i,Ct,i)ΦR(θ)dθwt,k𝒩(θ;mt,k,Ct,k)log[wt,k𝒩(θ;mt,k,Ct,k)]dθ+wt,kiwt,i𝒩(θ;mt,i,Ct,i)log[wt,i𝒩(θ;mt,i,Ct,i)]dθ+wt,k𝒩(θ;mt,k,Ct,k)(log𝒩(θ,mk,Ck)+log(wk))dθwt,kiwt,i𝒩(θ;mt,i,Ct,i)(log𝒩(θ,mi,Ci)+log(wi))dθ=wt,k(logwklogwt,kiwt,i(logwilogwt,i)).\begin{split}\dot{w}_{t,k}=&-w_{t,k}\int\mathcal{N}(\theta;m_{t,k},C_{t,k})\log\rho_{t}(\theta)\mathrm{d}\theta+w_{t,k}\sum_{i}\int w_{t,i}\mathcal{N}(\theta;m_{t,i},C_{t,i})\log\rho_{t}(\theta)\mathrm{d}\theta\\ &-w_{t,k}\int\mathcal{N}(\theta;m_{t,k},C_{t,k})\Phi_{R}(\theta)\mathrm{d}\theta+w_{t,k}\sum_{i}\int w_{t,i}\mathcal{N}(\theta;m_{t,i},C_{t,i})\Phi_{R}(\theta)\mathrm{d}\theta\\ \approx&-w_{t,k}\int\mathcal{N}(\theta;m_{t,k},C_{t,k})\log[w_{t,k}\mathcal{N}(\theta;m_{t,k},C_{t,k})]\mathrm{d}\theta\\ &+w_{t,k}\sum_{i}\int w_{t,i}\mathcal{N}(\theta;m_{t,i},C_{t,i})\log[w_{t,i}\mathcal{N}(\theta;m_{t,i},C_{t,i})]\mathrm{d}\theta\\ &+w_{t,k}\int\mathcal{N}(\theta;m_{t,k},C_{t,k})\Bigl{(}\log\mathcal{N}(\theta,m_{k}^{*},C_{k}^{*})+\log(w_{k}^{*})\Bigr{)}\mathrm{d}\theta\\ &-w_{t,k}\sum_{i}\int w_{t,i}\mathcal{N}(\theta;m_{t,i},C_{t,i})\Bigl{(}\log\mathcal{N}(\theta,m_{i}^{*},C_{i}^{*})+\log(w_{i}^{*})\Bigr{)}\mathrm{d}\theta\\ =&w_{t,k}\bigl{(}\log w_{k}^{*}-\log w_{t,k}-\sum_{i}w_{t,i}(\log w_{i}^{*}-\log w_{t,i})\bigr{)}.\end{split} (66)

The first approximation is obtained by substituting logρt\log\rho_{t} with log(wk𝒩(θ;mt,k,Ct,k))\log(w_{k}\mathcal{N}(\theta;m_{t,k},C_{t,k})) and using Eq. 61 for approximating ΦR(θ)\Phi_{R}(\theta). Combining Eqs. 64, 65 and 66 leads to the simplified continuous time dynamics Eq. 39.

C.6 Proof of Proposition 7

The mean, covariance and weight evolution equations Eqs. 39a, 39b and 39c admit analytical solutions

mt,k=mk+et((1et)Ck1+etCk(0)1)1Ck(0)1(mk(0)mk),\displaystyle m_{t,k}=m_{k}^{*}+e^{-t}\Bigl{(}(1-e^{-t})C_{k}^{*^{-1}}+e^{-t}C_{k}(0)^{-1}\Bigr{)}^{-1}C_{k}(0)^{-1}\bigl{(}m_{k}(0)-m_{k}^{*}\bigr{)}, (67a)
Ct,k1=Ck1+et(Ck(0)1Ck1),\displaystyle C_{t,k}^{-1}=C_{k}^{*^{-1}}+e^{-t}(C_{k}(0)^{-1}-C_{k}^{*^{-1}}), (67b)
wk=wk(wk(0)wk)etiwi(wi(0)wi)et.\displaystyle w_{k}=\frac{w_{k}^{*}\Bigl{(}\frac{w_{k}(0)}{w_{k}^{*}}\Bigr{)}^{e^{-t}}}{\sum_{i}w_{i}^{*}\Bigl{(}\frac{w_{i}(0)}{w_{i}^{*}}\Bigr{)}^{e^{-t}}}. (67c)

They will converge to mkm_{k}^{*}, CkC_{k}^{*} and wkw_{k}^{*} exponentially fast.

C.7 Proof of Proposition 8

Consider any invertible affine mapping φ:θθ~=Aθ+b\varphi:\theta\rightarrow\widetilde{\theta}=A\theta+b, and define corresponding vector and matrix transformations

m~t,k=Amt,k+bC~t,k=ACt,kAT,\displaystyle\widetilde{m}_{t,k}=Am_{t,k}+b\quad\widetilde{C}_{t,k}=AC_{t,k}A^{T},

density transformations

ρ~(θ~)=φρ(θ)=ρ(A1(θ~b))|A|1𝒩(θ~;m~t,k,C~t,k)=𝒩(θ;mt,k,Ct,k)|A|1,\displaystyle{\widetilde{\rho}}(\widetilde{\theta})=\varphi_{\sharp}\rho(\theta)=\rho(A^{-1}(\widetilde{\theta}-b))|A|^{-1}\quad\mathcal{N}(\widetilde{\theta};\widetilde{m}_{t,k},\widetilde{C}_{t,k})=\mathcal{N}(\theta;m_{t,k},C_{t,k})|A|^{-1},

function transformations

~(θ~)=(A1(θ~b))Φ~R(θ~)=ΦR(A1(θ~b)),\displaystyle\widetilde{\mathcal{F}}(\widetilde{\theta})=\mathcal{F}(A^{-1}(\widetilde{\theta}-b))\quad\quad\widetilde{\Phi}_{R}(\widetilde{\theta})=\Phi_{R}(A^{-1}(\widetilde{\theta}-b)),

and their related expectation and covariance

x~t,k=𝔼[~(θ~)],C~t,kθx=Cov[θ~,~(θ~)],withθ~𝒩(m~t,k,C~t,k),\displaystyle\widetilde{x}_{t,k}=\mathbb{E}[\widetilde{\mathcal{F}}(\widetilde{\theta})],\qquad\widetilde{C}^{\theta x}_{t,k}=\mathrm{Cov}[\widetilde{\theta},\widetilde{\mathcal{F}}(\widetilde{\theta})],\quad\textrm{with}\quad\widetilde{\theta}\sim\mathcal{N}(\widetilde{m}_{t,k},\widetilde{C}_{t,k}),

then we have

θ~logρ~(θ~)=ATθlogρ(θ)θ~θ~logρ~(θ~)=ATθlogρ(θ)A1x~t,k=x^t,kC~t,kθx=AC^t,kθx.\begin{split}&\nabla_{\widetilde{\theta}}\log\widetilde{\rho}(\widetilde{\theta})=A^{-T}\nabla_{\theta}\log\rho(\theta)\qquad\nabla_{\widetilde{\theta}}\nabla_{\widetilde{\theta}}\log\widetilde{\rho}(\widetilde{\theta})=A^{-T}\nabla_{\theta}\log\rho(\theta)A^{-1}\\ &\widetilde{x}_{t,k}=\widehat{x}_{t,k}\qquad\widetilde{C}_{t,k}^{\theta x}=A\widehat{C}_{t,k}^{\theta x}.\end{split} (68)

The evolution equations of m~t,k\widetilde{m}_{t,k}, C~t,k\widetilde{C}_{t,k}, wt,kw_{t,k} in Eq. 35 can be rewritten as

m~˙t,k=\displaystyle\dot{\widetilde{m}}_{t,k}= ACt,k𝒩(θ;mt,k,Ct,k)θlogρt(θ)dθ+AC^t,kθxΣν1(xx^t,k),\displaystyle-AC_{t,k}\int\mathcal{N}(\theta;m_{t,k},C_{t,k})\nabla_{\theta}\log\rho_{t}(\theta)\mathrm{d}\theta+A\widehat{C}_{t,k}^{\theta x}\Sigma_{\nu}^{-1}(x-\widehat{x}_{t,k}),
=\displaystyle= C~t,k𝒩(θ~;m~t,k,C~t,k)θ~logρ~t(θ~)dθ~+C~t,kθxΣν1(xx~t,k),\displaystyle-\widetilde{C}_{t,k}\int\mathcal{N}(\widetilde{\theta};\widetilde{m}_{t,k},\widetilde{C}_{t,k})\nabla_{\widetilde{\theta}}\log\widetilde{\rho}_{t}(\widetilde{\theta})\mathrm{d}\widetilde{\theta}+\widetilde{C}_{t,k}^{\theta x}\Sigma_{\nu}^{-1}(x-\widetilde{x}_{t,k}),
C~˙t,k=\displaystyle\dot{\widetilde{C}}_{t,k}= ACt,k(𝒩(θ;mt,k,Ct,k)θθlogρt(θ)dθ)Ct,kATAC^t,kθxΣν1C^t,kθxTAT,\displaystyle-AC_{t,k}\left(\int\mathcal{N}(\theta;m_{t,k},C_{t,k})\nabla_{\theta}\nabla_{\theta}\log\rho_{t}(\theta)\mathrm{d}\theta\right)C_{t,k}A^{T}-A\widehat{C}_{t,k}^{\theta x}\Sigma_{\nu}^{-1}{\widehat{C}_{t,k}^{{\theta x}T}}A^{T},
=\displaystyle= C~t,k(𝒩(θ~;m~t,k,C~t,k)θ~θ~logρ~t(θ~)dθ~)C~t,kC~t,kθxΣν1C~t,kθxT,\displaystyle-\widetilde{C}_{t,k}\left(\int\mathcal{N}(\widetilde{\theta};\widetilde{m}_{t,k},\widetilde{C}_{t,k})\nabla_{\widetilde{\theta}}\nabla_{\widetilde{\theta}}\log\widetilde{\rho}_{t}(\widetilde{\theta})\mathrm{d}\widetilde{\theta}\right)\widetilde{C}_{t,k}-\widetilde{C}_{t,k}^{\theta x}\Sigma_{\nu}^{-1}{\widetilde{C}_{t,k}^{{\theta x}^{T}}},
w˙t,k=\displaystyle\dot{w}_{t,k}= wt,k[𝒩(θ;mt,k,Ct,k)ρt(θ)][logρt(θ)+ΦR(θ)]dθ\displaystyle-w_{t,k}\int\Bigl{[}\mathcal{N}(\theta;m_{t,k},C_{t,k})-\rho_{t}(\theta)\Bigr{]}\Bigl{[}\log\rho_{t}(\theta)+\Phi_{R}(\theta)\Bigr{]}\mathrm{d}\theta
=\displaystyle= wt,k[𝒩(θ~;m~t,k,C~t,k)ρ~t(θ~)][logρ~t(θ~)+Φ~R(θ~)]dθ~.\displaystyle-w_{t,k}\int\Bigl{[}\mathcal{N}(\widetilde{\theta};\widetilde{m}_{t,k},\widetilde{C}_{t,k})-\widetilde{\rho}_{t}(\widetilde{\theta})\Bigr{]}\Bigl{[}\log\widetilde{\rho}_{t}(\widetilde{\theta})+\widetilde{\Phi}_{R}(\widetilde{\theta})\Bigr{]}\mathrm{d}\widetilde{\theta}.

Hence the continuous time limit equation Eq. 35 of GMKI is affine invariant.

C.8 Connections Between the GMKI Approach and Gaussian Mixture Variational Inference

Gaussian mixture variational inference seeks to identify a minimizer of KL[ρGMρpost]{\rm KL}[\rho^{\rm GM}\|\rho_{\rm post}], where

ρGM(θ;a)=k=1Kwk𝒩(θ;mk,Ck)\rho^{\rm GM}(\theta;a)=\sum_{k=1}^{K}w_{k}\mathcal{N}(\theta;m_{k},C_{k})

is a KK-component Gaussian mixture, parameterized by their means, covariances and weights denoted by

a:=[m1,,mk,,mK,C1.,Ck,,CK,w1,,wk,,wK].a:=[m_{1},...,m_{k},...,m_{K},C_{1}....,C_{k},...,C_{K},w_{1},...,w_{k},...,w_{K}].

The derivatives of the KL divergence with respect to aa are

KL[ρGM(;a)ρpost]mk=wk𝒩(θ;mk,Ck)(θlogρGMθlogρpost)dθ,\displaystyle\frac{\partial{\rm KL}[\rho^{\rm GM}(\cdot;a)\|\rho_{\rm post}]}{\partial m_{k}}=w_{k}\int\mathcal{N}(\theta;m_{k},C_{k})\Bigl{(}\nabla_{\theta}\log\rho^{\rm GM}-\nabla_{\theta}\log\rho_{\rm post}\Bigr{)}\mathrm{d}\theta, (69a)
KL[ρGM(;a)ρpost]Ck=wk2𝒩(θ;mk,Ck)(θθlogρGMθθlogρpost)dθ,\displaystyle\frac{\partial{\rm KL}[\rho^{\rm GM}(\cdot;a)\|\rho_{\rm post}]}{\partial C_{k}}=\frac{w_{k}}{2}\int\mathcal{N}(\theta;m_{k},C_{k})\Bigl{(}\nabla_{\theta}\nabla_{\theta}\log\rho^{\rm GM}-\nabla_{\theta}\nabla_{\theta}\log\rho_{\rm post}\Bigr{)}\mathrm{d}\theta, (69b)
KL[ρGM(;a)ρpost]wk=𝒩(θ;mk,Ck)(logρGMρpost+1)dθ.\displaystyle\frac{\partial{\rm KL}[\rho^{\rm GM}(\cdot;a)\|\rho_{\rm post}]}{\partial w_{k}}=\int\mathcal{N}(\theta;m_{k},C_{k})\Bigl{(}\log\frac{\rho^{\rm GM}}{\rho_{\rm post}}+1\Bigr{)}\mathrm{d}\theta. (69c)

The algorithm of natural gradient descent uses the finite dimensional version of the Fisher-Rao metric tensor in the parameter space, also known as the Fisher information matrix with the form

FI(a)=aρGM(θ;a)aρGM(θ;a)ρGM(θ;a)dθ{\rm FI}(a)=\int\frac{\nabla_{a}\rho^{\rm GM}(\theta;a)\otimes\nabla_{a}\rho^{\rm GM}(\theta;a)}{\rho^{\rm GM}(\theta;a)}\mathrm{d}\theta (70)

as the preconditioner for gradient descent. Here, we write down its continuous limit, namely the natural gradient flow. To do so, we use the perspective of proximal point method and consider

an+1=\displaystyle a_{n+1}= argminaKL[ρGM(;a)ρpost]+12Δtaan,FI(an)(aan),\displaystyle\operatorname*{arg\,min}_{a}{\rm KL}[\rho^{\rm GM}(\cdot;a)\|\rho_{\rm post}]+\frac{1}{2\Delta t}\bigl{\langle}a-a_{n},{\rm FI}(a_{n})(a-a_{n})\bigr{\rangle}, (71a)
subjecttok=1Kwn+1,k=1.\displaystyle\mathrm{\ \ subject\ to\ }\sum_{k=1}^{K}w_{n+1,k}=1. (71b)

By using the formula of derivatives in Eq. 69, the Lagrangian multiplier to handle the constraint in the above optimization, and taking Δt0\Delta t\rightarrow 0, we arrive at the following natural gradient flow

[m˙kC˙kw˙k]=(FI(a))1[wk𝒩(θ;mk,Ck)(θlogρGMθlogρpost)dθwk2𝒩(θ;mk,Ck)(θθlogρGMθθlogρpost)dθ(𝒩(θ;mk,Ck)ρGM)log(ρGMρpost)dθ].\displaystyle\begin{bmatrix}\dot{m}_{k}\\ \dot{C}_{k}\\ \dot{w}_{k}\end{bmatrix}=({\rm FI}(a))^{-1}\begin{bmatrix}-w_{k}\int\mathcal{N}(\theta;m_{k},C_{k})\Bigl{(}\nabla_{\theta}\log\rho^{\rm GM}-\nabla_{\theta}\log\rho_{\rm post}\Bigr{)}\mathrm{d}\theta\\ -\frac{w_{k}}{2}\int\mathcal{N}(\theta;m_{k},C_{k})\Bigl{(}\nabla_{\theta}\nabla_{\theta}\log\rho^{\rm GM}-\nabla_{\theta}\nabla_{\theta}\log\rho_{\rm post}\Bigr{)}\mathrm{d}\theta\\ -\int\bigl{(}\mathcal{N}(\theta;m_{k},C_{k})-\rho^{\rm GM}\bigr{)}\log\bigl{(}\frac{\rho^{\rm GM}}{\rho_{\rm post}}\bigr{)}\mathrm{d}\theta\end{bmatrix}. (72)

Computation of FI(a){\rm FI}(a) is costly. For better efficiency, diagonal approximations of the Fisher information matrix have been used in the literature [33], which leads to

FI(a)diag(w1C11,,wkCk1,,wKCK1,w1X1,,wkXk,,wKXK,1w1,,1wk,,1wK).{\rm FI}(a)\approx\textrm{diag}\left(w_{1}C_{1}^{-1},...,w_{k}C_{k}^{-1},...,w_{K}C_{K}^{-1},w_{1}X_{1},...,w_{k}X_{k},...,w_{K}X_{K},\frac{1}{w_{1}},...,\frac{1}{w_{k}},...,\frac{1}{w_{K}}\right). (73)

where each XkX_{k} is a 4-th order tensor satisfying

XkY=14Ck1(Y+YT)Ck1,YNθ×Nθ.X_{k}Y=\frac{1}{4}C_{k}^{-1}(Y+Y^{T})C_{k}^{-1},\quad\forall\ Y\in\mathbb{R}^{N_{\theta}\times N_{\theta}}. (74)

Bringing the approximated Fisher information matrix Eq. 73 into the natural gradient flow Eq. 72 leads to the following equation:

m˙k=Ck𝒩(θ;mk,Ck)(θlogρGMθlogρpost)dθ,C˙k=Ck(𝒩(θ;mk,Ck)(θθlogρGMθθlogρpost)dθ)Ck,w˙k=wk(𝒩(θ;mk,Ck)ρGM)log(ρGMρpost)dθ.\begin{split}\dot{m}_{k}&=-C_{k}\int\mathcal{N}(\theta;m_{k},C_{k})\Bigl{(}\nabla_{\theta}\log\rho^{\rm GM}-\nabla_{\theta}\log\rho_{\rm post}\Bigr{)}\mathrm{d}\theta,\\ \dot{C}_{k}&=-C_{k}\Bigl{(}\int\mathcal{N}(\theta;m_{k},C_{k})\bigl{(}\nabla_{\theta}\nabla_{\theta}\log\rho^{\rm GM}-\nabla_{\theta}\nabla_{\theta}\log\rho_{\rm post}\bigr{)}\mathrm{d}\theta\Bigr{)}C_{k},\\ \dot{w}_{k}&=-w_{k}\int\Bigl{(}\mathcal{N}(\theta;m_{k},C_{k})-\rho^{\rm GM}\Bigr{)}\log\bigl{(}\frac{\rho^{\rm GM}}{\rho_{\rm post}}\bigr{)}\mathrm{d}\theta.\end{split} (75)

This is the natural gradient flow with diagonal approximations of the Fisher information matrix; its discretization is the approximate natural gradient descent algorithm that has been used in the literature.

By comparing Eq. 75 with the continuous-time limit of our GMKI as presented in Eq. 35, we observe that our GMKI can be seen as a derivative-free approximation of the approximate natural gradient descent. The approximation is made through stochastic linearization, for θlogρpost\nabla_{\theta}\log\rho_{\rm post} and θθlogρpost\nabla_{\theta}\nabla_{\theta}\log\rho_{\rm post}, based on the Kalman methodology explained in Remark 1.

For completeness, in the following part, we present a derivation of the diagonal approximations of the Fisher information matrix (73). Let 𝒩k\mathcal{N}_{k} denote 𝒩(θ;mk,Ck)\mathcal{N}(\theta;m_{k},C_{k}) and δk,i\delta_{k,i} be the indicator function which is zero if and only if k=ik=i. We can get Eq. 73 by only keeping the diagonal blocks of FI(a){\rm FI}(a) and approximating the diagonals under the assumptions that different Gaussian components are well separated. More precisely, for the diagonal block regarding the weight {wk}\{w_{k}\}, we have

wkρGMwiρGMρGMdθ=𝒩k𝒩iρGMdθδk,i𝒩k𝒩kwk𝒩kdθ=δk,iwk.\int\frac{\nabla_{w_{k}}\rho^{\rm GM}\otimes\nabla_{w_{i}}\rho^{\rm GM}}{\rho^{\rm GM}}\mathrm{d}\theta=\int\frac{\mathcal{N}_{k}\mathcal{N}_{i}}{\rho^{\rm GM}}\mathrm{d}\theta\approx\delta_{k,i}\int\frac{\mathcal{N}_{k}\mathcal{N}_{k}}{w_{k}\mathcal{N}_{k}}\mathrm{d}\theta=\frac{\delta_{k,i}}{w_{k}}. (76)

Here we substitute ρGM\rho^{\rm GM} by wk𝒩kw_{k}\mathcal{N}_{k} during its integration with 𝒩k\mathcal{N}_{k}. We note that we will keep using this approximation multiple times in the following derivations.

For the diagonal block regarding the mean mkm_{k}, we have

mkρGMmiρGMρGMdθ=\displaystyle\int\frac{\nabla_{m_{k}}\rho^{\rm GM}\otimes\nabla_{m_{i}}\rho^{\rm GM}}{\rho^{\rm GM}}\mathrm{d}\theta= wkwi𝒩k𝒩iCk1(θmk)(θmi)TCi1ρGMdθ\displaystyle\int\frac{w_{k}w_{i}\mathcal{N}_{k}\mathcal{N}_{i}C_{k}^{-1}(\theta-m_{k})(\theta-m_{i})^{T}C_{i}^{-1}}{\rho^{\rm GM}}\mathrm{d}\theta (77)
\displaystyle\approx δk,iwk2𝒩k2Ck1(θmk)(θmk)TCk1wk𝒩kdθ\displaystyle\delta_{k,i}\int\frac{w_{k}^{2}\mathcal{N}_{k}^{2}C_{k}^{-1}(\theta-m_{k})(\theta-m_{k})^{T}C_{k}^{-1}}{w_{k}\mathcal{N}_{k}}\mathrm{d}\theta
=\displaystyle= δk,iwkCk1.\displaystyle\delta_{k,i}w_{k}C_{k}^{-1}.

For the diagonal block regarding the covariance Ck{C_{k}}, we have

CkρGMCiρGMρGMdθ=wkwi𝒩k𝒩i(Ck1(θmk)(θmk)TCk1Ck1)(Ci1(θmi)(θmi)TCi1Ci1)4ρGMdθδk,iwk2𝒩k2(Ck1(θmk)(θmk)TCk1Ck1)(Ck1(θmk)(θmk)TCk1Ck1)4wk𝒩kdθ=δk,i4wk𝒩k(Ck1(θmk)(θmk)TCk1Ck1)(Ck1(θmk)(θmk)TCk1Ck1)dθ.\begin{split}\int&\frac{\nabla_{C_{k}}\rho^{\rm GM}\otimes\nabla_{C_{i}}\rho^{\rm GM}}{\rho^{\rm GM}}\mathrm{d}\theta\\ =&\int\frac{w_{k}w_{i}\mathcal{N}_{k}\mathcal{N}_{i}\Bigl{(}C_{k}^{-1}(\theta-m_{k})(\theta-m_{k})^{T}C_{k}^{-1}-C_{k}^{-1}\Bigr{)}\otimes\Bigl{(}C_{i}^{-1}(\theta-m_{i})(\theta-m_{i})^{T}C_{i}^{-1}-C_{i}^{-1}\Bigr{)}}{4\rho^{\rm GM}}\mathrm{d}\theta\\ \approx&\delta_{k,i}\int\frac{w_{k}^{2}\mathcal{N}_{k}^{2}\Bigl{(}C_{k}^{-1}(\theta-m_{k})(\theta-m_{k})^{T}C_{k}^{-1}-C_{k}^{-1}\Bigr{)}\otimes\Bigl{(}C_{k}^{-1}(\theta-m_{k})(\theta-m_{k})^{T}C_{k}^{-1}-C_{k}^{-1}\Bigr{)}}{4w_{k}\mathcal{N}_{k}}\mathrm{d}\theta\\ =&\frac{\delta_{k,i}}{4}\int w_{k}\mathcal{N}_{k}\Bigl{(}C_{k}^{-1}(\theta-m_{k})(\theta-m_{k})^{T}C_{k}^{-1}-C_{k}^{-1}\Bigr{)}\otimes\Bigl{(}C_{k}^{-1}(\theta-m_{k})(\theta-m_{k})^{T}C_{k}^{-1}-C_{k}^{-1}\Bigr{)}\mathrm{d}\theta.\end{split} (78)

It is worth noting that Eq. 78 is a 4th order tensor. To gain a more detailed understanding of this term, let us denote

Xk:=14𝒩k(Ck1(θmk)(θmk)TCk1Ck1)(Ck1(θmk)(θmk)TCk1Ck1)dθ=14𝒩(y;0,I)Ck1/2(yyTI)Ck1/2Ck1/2(yyTI)Ck1/2dy where y=Ck1/2(θmk).\begin{split}X_{k}:=&\frac{1}{4}\int\mathcal{N}_{k}\Bigl{(}C_{k}^{-1}(\theta-m_{k})(\theta-m_{k})^{T}C_{k}^{-1}-C_{k}^{-1}\Bigr{)}\otimes\Bigl{(}C_{k}^{-1}(\theta-m_{k})(\theta-m_{k})^{T}C_{k}^{-1}-C_{k}^{-1}\Bigr{)}\mathrm{d}\theta\\ =&\frac{1}{4}\int\mathcal{N}(y;0,I)C_{k}^{-1/2}\Bigl{(}yy^{T}-I\Bigr{)}C_{k}^{-1/2}\otimes C_{k}^{-1/2}\Bigl{(}yy^{T}-I\Bigr{)}C_{k}^{-1/2}\mathrm{d}y\\ &\qquad\text{ where }y=C_{k}^{-1/2}(\theta-m_{k}).\end{split} (79)

We can show that XkX_{k} satisfies Eq. 74. To do so note that the (ij,lm)(ij,lm) entry of XkX_{k} has the form

Xk[ij,lm]=\displaystyle X_{k}[ij,lm]= 14r,s,p,qCk1/2[i,r]Ck1/2[j,s]Ck1/2[l,p]Ck1/2[m,q](yrysδr,s)(ypyqδp,q)𝒩(y;0,I)dy\displaystyle\frac{1}{4}\sum_{r,s,p,q}C_{k}^{-1/2}[i,r]C_{k}^{-1/2}[j,s]C_{k}^{-1/2}[l,p]C_{k}^{-1/2}[m,q]\int(y_{r}y_{s}-\delta_{r,s})(y_{p}y_{q}-\delta_{p,q})\mathcal{N}(y;0,I)\mathrm{d}y
=\displaystyle= 14r,s,p,qCk1/2[i,r]Ck1/2[j,s]Ck1/2[l,p]Ck1/2[m,q](δr,pδs,q+δr,qδs,p)\displaystyle\frac{1}{4}\sum_{r,s,p,q}C_{k}^{-1/2}[i,r]C_{k}^{-1/2}[j,s]C_{k}^{-1/2}[l,p]C_{k}^{-1/2}[m,q]\bigl{(}\delta_{r,p}\delta_{s,q}+\delta_{r,q}\delta_{s,p}\bigr{)}
=\displaystyle= 14(Ck1[i,l]Ck1[j,m]+Ck1[i,m]Ck1[j,l]).\displaystyle\frac{1}{4}\Bigl{(}C_{k}^{-1}[i,l]C_{k}^{-1}[j,m]+C_{k}^{-1}[i,m]C_{k}^{-1}[j,l]\Bigr{)}.

Therefore,

(XkY)ij\displaystyle(X_{k}Y)_{ij} =l,mXk[ij,lm]Y[l,m]\displaystyle=\sum_{l,m}X_{k}[ij,lm]Y[l,m] (80)
=14l,m(Ck1[i,l]Y[l,m]Ck1[j,m]+Ck1[i,m]Y[l,m]Ck1[j,l])\displaystyle=\frac{1}{4}\sum_{l,m}\Bigl{(}C_{k}^{-1}[i,l]Y[l,m]C_{k}^{-1}[j,m]+C_{k}^{-1}[i,m]Y[l,m]C_{k}^{-1}[j,l]\Bigr{)}
=14(Ck1YCk1+Ck1YTCk1)ij.\displaystyle=\frac{1}{4}(C_{k}^{-1}YC_{k}^{-1}+C_{k}^{-1}Y^{T}C_{k}^{-1})_{ij}.

The proof is complete.

Appendix D Two-Dimensional Four-modal Problem

In this subsection, we consider a 2D four-modal inverse problem, associated with the forward model

y=𝒢(θ)+η with y=[4.22974.2297],𝒢(θ)=[(θ(1)θ(2))2(θ(1)+θ(2))2].y=\mathcal{G}(\theta)+\eta\quad\textrm{ with }\quad y=\begin{bmatrix}4.2297\\ 4.2297\end{bmatrix},\quad\mathcal{G}(\theta)=\begin{bmatrix}(\theta_{(1)}-\theta_{(2)})^{2}\\ (\theta_{(1)}+\theta_{(2)})^{2}\end{bmatrix}.

Here θ=[θ(1),θ(2)]T\theta=[\theta_{(1)},\theta_{(2)}]^{T}. We assume the noise distribution is η𝒩(0,I)\eta\sim\mathcal{N}(0,I) and consider the prior distribution ρprior𝒩([0.5,0]T,I)\rho_{\rm prior}\sim\mathcal{N}([0.5,0]^{T},I). The reference posterior distribution has four modes with varying weights. We apply GMKI with K=3K=3 and 66 modes, randomly initialized based on the prior distribution, and assign equal weights to these components. The estimated posterior distributions obtained by GMKI at the 30th iteration, along with the convergence in terms of total variation distance, are shown in Fig. 9. When K=3K=3 only three target modes are captured; when K=6K=6, all target modes are captured, and the approximation error becomes significantly smaller.

Refer to caption
Figure 9: Two-dimensional four-modal problem. From left to right: reference posterior distribution (left), posterior distributions estimated by 3 and 6-modal GMKI (middle) at the 30th iteration (means mkm_{k} (colored) and initial means (black) are marked), and total variation distance between the reference posterior distribution and the posterior distributions estimated by the GMKI (right).

Appendix E Limitation of the GMKI

There can be multimodal problems with many modes concentrating on a low dimensional manifold. GMKI may fail in such a case. To illustrate, we consider the posterior distribution exp(ΦR(θ))\exp(-\Phi_{R}(\theta)) in 2\mathbb{R}^{2}, where ΦR(θ)=(1θ(1)2θ(2)2)22ση2\Phi_{R}(\theta)=\frac{(1-\theta_{(1)}^{2}-\theta_{(2)}^{2})^{2}}{2\sigma_{\eta}^{2}} and ση=0.3\sigma_{\eta}=0.3. Clearly the mass is distributed along the unit circle, as depicted on the left side of Fig. 10. We sample this density with GMKI and Gaussian mixture variational inference (GMVI), described in Eq. 75. Note that GMKI can be seen as a derivative free approximation of GMVI so this study is for the purpose of understanding the effect of the derivative free approximation step. We and initialize both methods with 1010 Gaussian components with means randomly sampled from 𝒩(0,I)\mathcal{N}(0,I) and the same identity covariance II.

The sampling results obtained by GMKI at the 30th iteration are presented in the middle of Fig. 10. While the means of these Gaussian components migrate towards the unit circle, the covariance associated with each Gaussian component are elongated in the tangent direction. Consequently, the overall approximation of the target distribution is inaccurate. The covariance of these Gaussian components bear resemblance to those seen in the Laplace approximation. Indeed, the Laplace approximation at any maximum a posteriori (MAP) of the target distribution has the form 𝒩(θ;m,H)\mathcal{N}(\theta;m,H^{\dagger}); here, m=[m(1),m(2)]m=[m_{(1)},m_{(2)}] lies on the unit circle and

H=θθΦR(θ)|θ=m=4ση2[m(1)(2)m(1)m(2)m(1)m(2)m(2)2].H=-\nabla_{\theta}\nabla_{\theta}\Phi_{R}(\theta)|_{\theta=m}=\frac{4}{\sigma_{\eta}^{2}}\begin{bmatrix}m_{(1)}^{(2)}&m_{(1)}m_{(2)}\\ m_{(1)}m_{(2)}&m_{(2)}^{2}\end{bmatrix}. (81)

It is worth mentioning that HH exhibits singularity, particularly along the tangent direction of the unit circle. Hence the Laplace approximation is degenerate and concentrates on the tangential line of the unit circle at mm.

To further explore this issue, we turn to use the algorithm for GMVI. This approach requires the evaluation of the gradient and Hessian of logρpost\log\rho_{\rm post}. We approximate these Gaussian integrations in Eq. 75 using the modified unscented transform, as detailed in [43, Eq. 37]. Moreover we employ a time-step of Δt=0.01\Delta t=0.01, which leads to a stable numerical scheme in our implementation. The outcome of GMVI at the 1000th iteration is depicted in the right of Fig. 10. The result matches the reference well.

Refer to caption
Figure 10: Circle shape posterior: reference posterior distribution (left), posterior distribution obtained by 10-modal GMKI (middle), posterior distribution obtained by 10-modal GMVI (right). Means mkm_{k} of Gaussian components are marked. In the middle, the color appears lighter because the distribution is concentrated along a thin, elongated line, resulting in a visually diluted effect.

Since the main difference between GMVI Eq. 75 and the continuous time dynamics of GMKI is the derivative-free Kalman approximation for the gradient terms θlogρpost\nabla_{\theta}\log\rho_{\rm post} and θθlogρpost\nabla_{\theta}\nabla_{\theta}\log\rho_{\rm post}, we understand that the Kalman approximation step leads to the failure of GMKI for sampling the above distribution. It is the goal of future study to investigate other derivative free approximations that can circumvent this failure.

References

References

  • [1] Jari Kaipio and Erkki Somersalo. Statistical and computational inverse problems, volume 160. Springer Science & Business Media, 2006.
  • [2] Andrew M Stuart. Inverse problems: A Bayesian perspective. Acta numerica, 19:451–559, 2010.
  • [3] Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of Markov chain Monte Carlo. CRC press, 2011.
  • [4] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411–436, 2006.
  • [5] Nicolas Chopin, Omiros Papaspiliopoulos, et al. An introduction to sequential Monte Carlo, volume 4. Springer, 2020.
  • [6] Claudia Tebaldi, Richard L Smith, Doug Nychka, and Linda O Mearns. Quantifying uncertainty in projections of regional climate change: A Bayesian approach to the analysis of multimodel ensembles. Journal of Climate, 18(10):1524–1540, 2005.
  • [7] Véronique Gayrard, Anton Bovier, Michael Eckhoff, and Markus Klein. Metastability in reversible diffusion processes i: Sharp asymptotics for capacities and exit times. Journal of the European Mathematical Society, 6(4):399–424, 2004.
  • [8] Véronique Gayrard, Anton Bovier, and Markus Klein. Metastability in reversible diffusion processes ii: Precise asymptotics for small eigenvalues. Journal of the European Mathematical Society, 7(1):69–99, 2005.
  • [9] Mrinal K Sen and Paul L Stoffa. Global optimization methods in geophysical inversion. Cambridge University Press, 2013.
  • [10] Tapio Schneider, Shiwei Lan, Andrew Stuart, and Joao Teixeira. Earth system modeling 2.0: A blueprint for models that learn from observations and targeted high-resolution simulations. Geophysical Research Letters, 44(24):12–396, 2017.
  • [11] Charles S Peskin. Numerical analysis of blood flow in the heart. Journal of computational physics, 25(3):220–252, 1977.
  • [12] Daniel Z Huang, Dante De Santis, and Charbel Farhat. A family of position-and orientation-independent embedded boundary methods for viscous flow and fluid–structure interaction problems. Journal of Computational Physics, 365:74–104, 2018.
  • [13] Daniel Z Huang, Philip Avery, Charbel Farhat, Jason Rabinovitch, Armen Derkevorkian, and Lee D Peterson. Modeling, simulation and validation of supersonic parachute inflation dynamics during Mars landing. In AIAA Scitech 2020 Forum, page 0313, 2020.
  • [14] Shunxiang Cao and Daniel Zhengyu Huang. Bayesian calibration for large-scale fluid structure interaction problems under embedded/immersed boundary framework. International Journal for Numerical Methods in Engineering, 2022.
  • [15] Marsha J Berger, Phillip Colella, et al. Local adaptive mesh refinement for shock hydrodynamics. Journal of computational Physics, 82(1):64–84, 1989.
  • [16] Raunak Borker, Daniel Huang, Sebastian Grimberg, Charbel Farhat, Philip Avery, and Jason Rabinovitch. Mesh adaptation framework for embedded boundary methods for computational fluid dynamics and fluid-structure interaction. International Journal for Numerical Methods in Fluids, 90(8):389–424, 2019.
  • [17] Nicolas Moës, John Dolbow, and Ted Belytschko. A finite element method for crack growth without remeshing. International journal for numerical methods in engineering, 46(1):131–150, 1999.
  • [18] Zhihong Tan, Colleen M Kaul, Kyle G Pressel, Yair Cohen, Tapio Schneider, and João Teixeira. An extended eddy-diffusivity mass-flux scheme for unified representation of subgrid-scale turbulence and convection. Journal of Advances in Modeling Earth Systems, 10(3):770–800, 2018.
  • [19] Ignacio Lopez-Gomez, Costa D Christopoulos, Haakon Ludvig Ervik, Oliver R Dunbar, Yair Cohen, and Tapio Schneider. Training physics-based machine-learning parameterizations with gradient-free ensemble kalman methods. 2022.
  • [20] Nicolas Garcia Trillos and Daniel Sanz-Alonso. The Bayesian update: Variational formulations and gradient flows. Bayesian Analysis, 15(1):29–56, 2020.
  • [21] N Garcia Trillos, B Hosseini, and D Sanz-Alonso. From optimization to sampling through gradient flows. Notices of the American Mathematical Society, 70(6), 2023.
  • [22] Yifan Chen, Daniel Zhengyu Huang, Jiaoyang Huang, Sebastian Reich, and Andrew M Stuart. Sampling via gradient flows in the space of probability measures. arXiv preprint arXiv:2310.03597, 2023.
  • [23] Yulong Lu, Jianfeng Lu, and James Nolen. Accelerating langevin sampling with birth-death. arXiv preprint arXiv:1905.09863, 2019.
  • [24] Yulong Lu, Dejan Slepčev, and Lihan Wang. Birth-death dynamics for sampling: Global convergence, approximations and their asymptotics. arXiv preprint arXiv:2211.00450, 2022.
  • [25] Carles Domingo-Enrich and Aram-Alexandre Pooladian. An explicit expansion of the kullback-leibler divergence along its fisher-rao gradient flow. arXiv preprint arXiv:2302.12229, 2023.
  • [26] Lezhi Tan and Jianfeng Lu. Accelerate langevin sampling with birth-death process and exploration component. arXiv preprint arXiv:2305.05529, 2023.
  • [27] Yifan Chen, Daniel Zhengyu Huang, Jiaoyang Huang, Sebastian Reich, and Andrew M Stuart. Gradient flows for sampling: Mean-field models, Gaussian approximations and affine invariance. arXiv preprint arXiv:2302.11024, 2023.
  • [28] Yifan Chen and Wuchen Li. Optimal transport natural gradient for statistical manifolds with continuous sample space. Information Geometry, 3(1):1–32, 2020.
  • [29] Marc Lambert, Sinho Chewi, Francis Bach, Silvère Bonnabel, and Philippe Rigollet. Variational inference via wasserstein gradient flows. arXiv preprint arXiv:2205.15902, 2022.
  • [30] Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
  • [31] James Martens. New insights and perspectives on the natural gradient method. The Journal of Machine Learning Research, 21(1):5776–5851, 2020.
  • [32] Guodong Zhang, James Martens, and Roger B Grosse. Fast convergence of natural gradient descent for over-parameterized neural networks. Advances in Neural Information Processing Systems, 32, 2019.
  • [33] Wu Lin, Mohammad Emtiyaz Khan, and Mark Schmidt. Fast and simple natural-gradient variational inference with mixture of exponential-family approximations. In International Conference on Machine Learning, pages 3992–4002. PMLR, 2019.
  • [34] Tom Huix, Anna Korba, Alain Durmus, and Eric Moulines. Theoretical guarantees for variational inference with fixed-variance mixture of gaussians. arXiv preprint arXiv:2406.04012, 2024.
  • [35] Yan Chen and Dean S Oliver. Ensemble randomized maximum likelihood method as an iterative ensemble smoother. Mathematical Geosciences, 44(1):1–26, 2012.
  • [36] Alexandre A Emerick and Albert C Reynolds. Investigation of the sampling performance of ensemble-based methods with a simple reservoir model. Computational Geosciences, 17(2):325–350, 2013.
  • [37] Marco A Iglesias, Kody JH Law, and Andrew M Stuart. Ensemble Kalman methods for inverse problems. Inverse Problems, 29(4):045001, 2013.
  • [38] Sahani Pathiraja and Sebastian Reich. Discrete gradients for computational bayesian inference. arXiv preprint arXiv:1903.00186, 2019.
  • [39] Neil K Chada, Andrew M Stuart, and Xin T Tong. Tikhonov regularization within ensemble Kalman inversion. SIAM Journal on Numerical Analysis, 58(2):1263–1294, 2020.
  • [40] Tapio Schneider, Andrew M Stuart, and Jin-Long Wu. Ensemble Kalman inversion for sparse learning of dynamical systems from time-averaged data. arXiv preprint arXiv:2007.06175, 2020.
  • [41] Daniel Zhengyu Huang, Tapio Schneider, and Andrew M Stuart. Iterated kalman methodology for inverse problems. Journal of Computational Physics, page 111262, 2022.
  • [42] Edoardo Calvello, Sebastian Reich, and Andrew M Stuart. Ensemble Kalman methods: a mean field perspective. arXiv preprint arXiv:2209.11371, 2022.
  • [43] Daniel Zhengyu Huang, Jiaoyang Huang, Sebastian Reich, and Andrew M Stuart. Efficient derivative-free Bayesian inference for large-scale inverse problems. Inverse Problems, 38(12):125006, 2022.
  • [44] Arnaud Doucet, Adam M Johansen, et al. A tutorial on particle filtering and smoothing: Fifteen years later. Handbook of nonlinear filtering, 12(656-704):3, 2009.
  • [45] Ilja Klebanov and Timothy John Sullivan. Transporting higher-order quadrature rules: Quasi-monte carlo points and sparse grids for mixture distributions. arXiv preprint arXiv:2308.10081, 2023.
  • [46] Aimee Maurais and Youssef Marzouk. Sampling in unit time with kernel fisher-rao flow. arXiv preprint arXiv:2401.03892, 2024.
  • [47] Nikolas Nüsken. Stein transport for Bayesian inference. arXiv preprint arXiv:2409.01464, 2024.
  • [48] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
  • [49] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
  • [50] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
  • [51] Matias Quiroz, David J Nott, and Robert Kohn. Gaussian variational approximation for high-dimensional state space models. arXiv preprint arXiv:1801.07873, 2018.
  • [52] Mohammad Khan and Wu Lin. Conjugate-computation variational inference: Converting variational inference in non-conjugate models to inferences in conjugate models. In Artificial Intelligence and Statistics, pages 878–887. PMLR, 2017.
  • [53] Théo Galy-Fajou, Valerio Perrone, and Manfred Opper. Flexible and efficient inference with particles for the variational Gaussian approximation. Entropy, 23:990, 2021.
  • [54] Caroline Lasser and Christian Lubich. Computing quantum dynamics in the semiclassical regime. Acta Numerica, 29:229–401, 2020.
  • [55] William Anderson and Mohammad Farazmand. Fisher information and shape-morphing modes for solving the fokker–planck equation in higher dimensions. Applied Mathematics and Computation, 467:128489, 2024.
  • [56] Huan Zhang, Yifan Chen, Eric Vanden-Eijnden, and Benjamin Peherstorfer. Sequential-in-time training of nonlinear parametrizations for solving time-dependent partial differential equations. arXiv preprint arXiv:2404.01145, 2024.
  • [57] C Radhakrishna Rao. Information and the accuracy attainable in the estimation of statistical parameters. In Breakthroughs in statistics, pages 235–247. Springer, 1992.
  • [58] Nikolai Nikolaevich Cencov. Statistical decision rules and optimal inference. Number 53. American Mathematical Soc., 2000.
  • [59] Nihat Ay, Jürgen Jost, Hông Vân Lê, and Lorenz Schwachhöfer. Information geometry and sufficient statistics. Probability Theory and Related Fields, 162(1):327–364, 2015.
  • [60] Martin Bauer, Martins Bruveris, and Peter W Michor. Uniqueness of the Fisher–Rao metric on the space of smooth densities. Bulletin of the London Mathematical Society, 48(3):499–506, 2016.
  • [61] Yuling Yan, Kaizheng Wang, and Philippe Rigollet. Learning Gaussian mixtures using the wasserstein-fisher-rao gradient flow. arXiv preprint arXiv:2301.01766, 2023.
  • [62] L Wang and N Nüsken. Measure transport with kernel mean embeddings. arXiv preprint arXiv:2401.12967, 2024.
  • [63] Oliver G Ernst, Björn Sprungk, and Hans-Jörg Starkloff. Analysis of the ensemble and polynomial chaos Kalman filters in Bayesian inverse problems. SIAM/ASA Journal on Uncertainty Quantification, 3(1):823–851, 2015.
  • [64] Alfredo Garbuno-Inigo, Franca Hoffmann, Wuchen Li, and Andrew M Stuart. Interacting Langevin diffusions: Gradient structure and ensemble Kalman sampler. SIAM Journal on Applied Dynamical Systems, 19(1):412–441, 2020.
  • [65] Alfredo Garbuno-Inigo, Nikolas Nüsken, and Sebastian Reich. Affine invariant interacting Langevin dynamics for Bayesian inference. SIAM Journal on Applied Dynamical Systems, 19(3):1633–1658, 2020.
  • [66] Daniel Alspach and Harold Sorenson. Nonlinear Bayesian estimation using Gaussian sum approximations. IEEE transactions on automatic control, 17(4):439–448, 1972.
  • [67] Kazufumi Ito and Kaiqi Xiong. Gaussian filters for nonlinear filtering problems. IEEE transactions on automatic control, 45(5):910–927, 2000.
  • [68] Rong Chen and Jun S Liu. Mixture kalman filters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(3):493–508, 2000.
  • [69] Sebastian Reich. A Gaussian-mixture ensemble transform filter. Quarterly Journal of the Royal Meteorological Society, 138(662):222–233, 2012.
  • [70] Ruoxia Li, Vinay Prasad, and Biao Huang. Gaussian mixture model-based ensemble kalman filtering for state and parameter estimation for a pmma process. Processes, 4(2):9, 2016.
  • [71] Rui Fan, Renke Huang, and Ruisheng Diao. Gaussian mixture model-based ensemble kalman filter for machine parameter calibration. IEEE Transactions on Energy Conversion, 33(3):1597–1599, 2018.
  • [72] Dario Grana, Torstein Fjeldstad, and Henning Omre. Bayesian Gaussian mixture linear inversion for geophysical inverse problems. Mathematical Geosciences, 49:493–515, 2017.
  • [73] Yuming Ba and Lijian Jiang. A residual-driven adaptive Gaussian mixture approximation for Bayesian inverse problems. Journal of Computational and Applied Mathematics, 399:113707, 2022.
  • [74] Rudolph Van Der Merwe and Eric Wan. Gaussian mixture sigma-point particle filters for sequential probabilistic inference in dynamic state-space models. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., volume 6, pages VI–701. IEEE, 2003.
  • [75] Keston W Smith. Cluster ensemble kalman filter. Tellus A: Dynamic Meteorology and Oceanography, 59(5):749–757, 2007.
  • [76] Andreas S Stordal, Hans A Karlsen, Geir Nævdal, Hans J Skaug, and Brice Vallès. Bridging the ensemble kalman filter and particle filters: the adaptive Gaussian mixture filter. Computational Geosciences, 15:293–305, 2011.
  • [77] Ibrahim Hoteit, Xiaodong Luo, and Dinh-Tuan Pham. Particle Kalman filtering: A nonlinear Bayesian framework for ensemble kalman filters. Monthly weather review, 140(2):528–542, 2012.
  • [78] Marco Frei and Hans R Künsch. Mixture ensemble kalman filters. Computational Statistics & Data Analysis, 58:127–138, 2013.
  • [79] Thomas Bengtsson, Chris Snyder, and Doug Nychka. Toward a nonlinear ensemble filter for high-dimensional systems. Journal of Geophysical Research: Atmospheres, 108(D24), 2003.
  • [80] Alexander Y Sun, Alan P Morris, and Sitakanta Mohanty. Sequential updating of multimodal hydrogeologic parameter fields using localization and clustering techniques. Water Resources Research, 45(7), 2009.
  • [81] Andreas S Stordal, Hans A Karlsen, Geir Nævdal, Dean S Oliver, and Hans J Skaug. Filtering with state space localized kalman gain. Physica D: Nonlinear Phenomena, 241(13):1123–1135, 2012.
  • [82] José A Carrillo, Yifan Chen, Daniel Zhengyu Huang, Jiaoyang Huang, and Dongyi Wei. Fisher-rao gradient flow: Geodesic convexity and functional inequalities. arXiv preprint arXiv:2407.15693, 2024.
  • [83] Sebastian Reich. A dynamical systems framework for intermittent data assimilation. BIT Numerical Mathematics, 51(1):235–249, 2011.
  • [84] Andrew Gelman and Xiao-Li Meng. Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statistical science, pages 163–185, 1998.
  • [85] Radford M Neal. Annealed importance sampling. Statistics and computing, 11:125–139, 2001.
  • [86] Haoxuan Chen and Lexing Ying. Ensemble-based annealed importance sampling. arXiv preprint arXiv:2401.15645, 2024.
  • [87] Nicolas Chopin, Francesca R Crucinio, and Anna Korba. A connection between tempering and entropic mirror descent. arXiv preprint arXiv:2310.11914, 2023.
  • [88] Jonathan Goodman and Jonathan Weare. Ensemble samplers with affine invariance. Communications in applied mathematics and computational science, 5(1):65–80, 2010.
  • [89] Daniel Foreman-Mackey, David W Hogg, Dustin Lang, and Jonathan Goodman. EMCEE: The MCMC hammer. Publications of the Astronomical Society of the Pacific, 125(925):306, 2013.
  • [90] Grigoris A Pavliotis, Andrew M Stuart, and Urbain Vaes. Derivative-free Bayesian inversion using multiscale dynamics. SIAM Journal on Applied Dynamical Systems, 21(1):284–326, 2022.
  • [91] Sebastian Reich and Simon Weissmann. Fokker–Planck particle systems for Bayesian inference: Computational approaches. SIAM/ASA Journal on Uncertainty Quantification, 9(2):446–482, 2021.
  • [92] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose Bayesian inference algorithm. Advances in neural information processing systems, 29, 2016.