Reweighted Interacting Langevin Diffusions: an Accelerated Sampling Method for Optimization

Junlong Lyu Zhitang Chen Wenlong Lyu Jianye Hao

Abstract

We proposed a new technique to accelerate sampling methods for solving difficult optimization problems. Our method investigates the intrinsic connection between posterior distribution sampling and optimization with Langevin dynamics, and then we propose an interacting particle scheme that approximates a Reweighted Interacting Langevin Diffusion system (RILD). The underlying system is designed by adding a multiplicative source term into the classical Langevin operator, leading to a higher convergence rate and a more concentrated invariant measure. We analyze the convergence rate of our algorithm and the improvement compared to existing results in the asymptotic situation. We also design various tests to verify our theoretical results, showing the advantages of accelerating convergence and breaking through barriers of suspicious local minimums, especially in high-dimensional non-convex settings. Our algorithms and analysis shed some light on combining gradient and genetic algorithms using Partial Differential Equations (PDEs) with provable guarantees.

Optimization, Stochastic Process, Functional Analysis, Genetic Algorithm, ICML

1 Introduction

Optimization methods have natural connections to sampling methods (Dalalyan, 2017) . Consider a potential function $V:\mathbb{R}^{d}\to\mathbb{R}$ that is twice differentiable and has only one simple global minimum ${\bm{x}}^{\ast}=\arg\min_{x\in\mathbb{R}^{d}}V({\bm{x}})$ for simplicity. The first order Langevin Dynamics is defined by the following Stochastic Differential Equation (SDE):

d{\bm{x}}_{t}=-\nabla V({\bm{x}}_{t})+\sigma d\mathcal{B}_{t},

(1)

where $\sigma>0$ is the diffusion parameter proportional to the square of the temperature parameter, and $\{\mathcal{B}_{t}\}_{t\geq 0}$ is the standard Brownian motion in $\mathbb{R}^{d}$ . Under certain assumptions on the drift coefficient $\nabla V$ , it was shown that the distribution of ${\bm{x}}_{t}$ in Eq.1 converges exponentially to its stationary distribution, the Gibbs measure $\nu_{\sigma}({\bm{x}})\propto\exp(-2V({\bm{x}})/\sigma^{2})$ . With $\sigma\to 0$ , $\nu_{\sigma}$ concentrates on the global minimum $x^{\ast}$ , leading to various Langevin dynamics based algorithms like Gradient Langevin Dynamics (GLD) (Xu et al., 2017; Dalalyan, 2014), Stochastic Gradient Langevin Dynamics (SGLD) (Welling & Teh, 2011) etc.

While Langevin dynamics based algorithms handle optimization tasks with slightly nonconvex potential function $V$ effectively (Raginsky et al., 2017; Zhang et al., 2017), it encounters difficulties with global optimization for highly nonconvex $V$ , as it requires $\sigma_{n}\to 0$ to guarantee convergence. However, a small $\sigma_{n}$ makes the probability of jumping across a certain potential barrier drastically diminish. Conventional Markov Chain Monte Carlo (MCMC) methods were proposed to alleviate such problems, such as parallel tempering (Swendsen & Wang, 1986; Geyer, 1991), flat histogram algorithms (Berg & Neuhaus, 1991) and simulated annealing (Kirkpatrick et al., 1983). Among them, Simulated Annealing (SA) provides a reasonable way to handle this problem by maintaining a population of particles to sequentially approximate $\nu_{\tilde{\sigma}_{n}}({\bm{x}})$ with $\tilde{\sigma}_{n}\to 0$ . However, it requires numerious evaluations of $\nabla V$ , which can be computationally costly, making it less attractive for optimization tasks. Therefore in this paper, we aim at accelerating Langevin dynamics based algorithms by modifying the underlying continuous Langevin dynamics.

Recently, an Ensemble Kalman Sampler (EKS) (Garbuno-Iñigo et al., 2019; Schillings & Stuart, 2016) method was suggested to accelerate the convergence rate toward the equilibrium $\nu_{\sigma}({\bm{x}})$ . This method approximates a preconditioning-modified Langevin dynamic that shares the same equilibrium with the original one, but enjoys a faster convergence rate for ill-posed problems. Besides, this method has another benefit when the potential function $V({\bm{x}})=\|\mathcal{G}({\bm{x}})-y\|^{2}$ is a least square functional (Engl et al., 1996), i.e. the gradient of $V$ , or the Jacobean matrix of $\mathcal{G}$ does not need to be calculated explicitly, as those terms involving them can be approximated by the $0^{th}$ order information in current steps, see Eq. (16).

However, the EKS method has little improvement when solving highly non-convex optimizations. Besides, as our purposes is to find the global minimum of $V$ , restricting the stationary distribution of the underlying dynamic to be precisely $\nu_{\sigma}({\bm{x}})$ is not necessary. Suppose we have another $\sigma$ -dependent dynamic with limit distribution approximate $\delta_{{\bm{x}}^{\ast}}({\bm{x}})$ , the Delta distribution at the minimum point $\bm{x}\ast$ , then this dynamic can also be chosen to approximate the global minimum. This inspires us to further modify the Langevin dynamics for faster convergence. Specifically, we modify the Fokker-Planck Equation related to Langevin Dynamics by adding a linear source term, which can be proven, by the spectral approach (Pankov, 2001), to improve the convergence rate and the mass concentration near the global minimum of the invariant measure. The new process belongs to a type of nonlinear operators called Feynman-Kac Semigroup which was developed in Large Deviation Theory for calculating generating functions (Varadhan, 2010) and also used in important practical applications such as the Diffusion Monte Carlo (DMC) method (Foulkes et al., 2001). To design a practical algorithm, we use the Interacting Particle methods (Moral & Miclo, 2000; Moral, 2013), introducing the reweighting and resampling technique to simulate the effect of this source term and get the so-called Reweighted Interacting Langevin Diffusion (RILD) algorithm. This is, as far as we know, the first time using Interacting Particle methods to solve optimization tasks. Many numerical experiments are tested to show the effect in accelerating convergence and flatten the potential barriers.

The main contributions are summarized as follows:

•

We provide a simple but effective way to modify the Langevin Dynamics in Section 3.1 for faster convergence and flatter potential barriers.
•

We provide a feasible discretization way to design algorithms in Section 3.2, and compare the new algorithm Alg. 1 with several existing methods, showing its advantages.
•

We provide theoretical results in Section 4 based on spectral analysis to guarantee the advantages.

2 Preliminary

2.1 Overdamped Langevin Dynamics

In this section, we introduce the classical Langevin Dynamics which is related to GLD and SGLD algorithms.

Suppose $V:\mathbb{R}^{d}\to\mathbb{R}$ that is twice differentiable and has a simple global minimum ${\bm{x}}^{\ast}=\arg\min_{x\in\mathbb{R}^{d}}V({\bm{x}})$ for simplicity.

The overdamped Langevin Dynamics is defined as in 1. Such a dynamic has plenty of relationships with the Fokker-Planck equation. Denote

\mathcal{L}\cdot:=-\langle\nabla V,\nabla\cdot\rangle+\frac{\sigma^{2}}{2}\Delta\cdot

(2)

the infinitesimal generator (Øksendal, 1987) of the Markov process $({\bm{x}}_{t})_{t\geq 0}$ , and its $L^{2}$ adjoint operator:

\mathcal{L}^{\dagger}\cdot:=\text{div}(\nabla V\cdot+\frac{\sigma^{2}}{2}\nabla\cdot)

(3)

Let us assume the law of ${\bm{x}}_{t}$ at time $t$ has a density $p_{t}(x)$ for Lebesgure measure. Then $p_{t}({\bm{x}})$ satisfies the Fokker-Planck equation:

\frac{\partial}{\partial t}p_{t}({\bm{x}})=\mathcal{L}^{\dagger}p_{t}({\bm{x}})=\text{div}\big{(}p_{t}({\bm{x}})\nabla V({\bm{x}})+\frac{\sigma^{2}}{2}\nabla p_{t}({\bm{x}})\big{)}

(4)

We may denote $p_{t}({\bm{x}})=e^{\mathcal{L}^{\dagger}t}p_{0}({\bm{x}})$ where we denote $p_{0}({\bm{x}})$ the initial distribution of overdamped Langevin dynamics, and it is well-known that the Markov operator $e^{\mathcal{L}^{\dagger}t}$ admits a unique invariant probability measure $\nu({\bm{x}})=Z_{\nu}^{-1}e^{-2\sigma^{-2}V({\bm{x}})},Z_{\nu}=\int_{\mathbb{R}^{d}}e^{-2\sigma^{-2}V({\bm{x}})}dx$ . The rate converging to $\nu({\bm{x}})$ has been wildly studied,

Proposition 2.1 (Proposition 2.3 in Lelièvre & Stoltz (2016)).

Under specific condition (Assumption 4.2), ¹¹1 We define $L^{2}(\mu):=\{f:\|f\|_{L^{2}(\mu)}<\infty\}$ , where $\langle f,g\rangle_{L^{2}(\mu)}:=\int_{\mathbb{R}^{d}}f(x)g(x)\mu(x)dx$ , $\|f\|_{L^{2}(\mu)}=\langle f,f\rangle_{L^{2}(\mu)}^{\frac{1}{2}}$ for all $p_{0}$ such that $p_{0}/\nu\in L^{2}(\nu)$ , and for all $t\geq 0$ ,

\|e^{\mathcal{L}^{\dagger}t}p_{0}/\nu-1\|_{L^{2}(\nu)}\leq\|p_{0}/\nu-1\|_{L^{2}(\nu)}e^{-\delta t},

where $\delta$ is the first non-zero eigenvalue $-\lambda_{1}$ of the operator $-\mathcal{L}$ on $L^{2}(\nu)$ .

Note that $\lambda_{1}$ is real because $\mathcal{L}$ is self-adjoint²²2We say a linear operator $A$ is self-adjoint on a Hilbert space $\{\mathcal{H},\langle\cdot\rangle_{\mathcal{H}}\}$ , if for any $f,g\in\mathcal{H},\langle f,Ag\rangle_{\mathcal{H}}=\langle Af,g\rangle_{\mathcal{H}}$ on $L^{2}(\nu)$ .

When for sampling purposes, individual samples are sampled from independent paths by Monte Carlo Markov Chain (MCMC, (Berg, 2004)) corresponding to the discretized Langevin dynamics. This method is somehow slow for large dimensional, non-convex problems. Such a weakness make it less attractive for optimization purposes. Fortunately, we can explore the connection between individual samples, as they did in Garbuno-Iñigo et al. (2019). We will introduce their work in the next subsection.

2.2 Interacting Langevin Diffusion

In Garbuno-Iñigo et al. (2019), the authors introduced a modified Langevin dynamics and analyzed its property, proving its superiority in designing sampling algorithms. We now introduce the main result of their work.

The convergence rate of classical Langevin dynamic based algorithm can be really slow when $V$ varies highly anisotropic, . A common approach for canceling out this effect is to introduce a $d\times d$ preconditioning positive semi-definite matrix $C$ in the corresponding gradient scheme,

d{\bm{x}}_{t}=-C\nabla V({\bm{x}}_{t})dt+\sigma\sqrt{C}d\mathcal{B}_{t}.

(5)

Here $\sqrt{C}$ can be any $d\times r$ real matrix $U$ such that $UU^{T}=C$ (Note that in this case, $W_{t}$ can be reduced to $r$ -dimensional standard Brownian motion as the essential rank of $C$ is no larger than r).

The corresponding Fokker-Planck equation now becomes

\frac{\partial}{\partial t}p_{t}({\bm{x}})=\text{div}\big{(}p_{t}({\bm{x}})C\nabla V({\bm{x}})+\frac{\sigma^{2}}{2}C\nabla p_{t}({\bm{x}})\big{)}

(6)

and the infinitesimal generator

	$\displaystyle\mathcal{L}_{C}\cdot$	$\displaystyle:=-\langle C\nabla V,\nabla\cdot\rangle+\frac{\sigma^{2}}{2}\text{div}(C\nabla\cdot),$		(7)
	$\displaystyle\mathcal{L}_{C}^{\dagger}\cdot$	$\displaystyle:=\text{div}\big{(}C\nabla V\cdot+\frac{\sigma^{2}}{2}C\nabla\cdot\big{)}.$		(8)

One can easily verify that $e^{-2\sigma^{-2}V({\bm{x}})}$ is the invariant measure to the above system for all semi-definite $C$ , and the only one positive invariant measure if $C$ is strictly positive definite.

To find a suitable $C$ is of general interest. One of the best choices is to let $C=\text{Hess}V$ , as a counterpart of Newton’s scheme in optimization, which is unfriendly for computation. One intrinsic benefit of choosing $\text{Hess}V$ is its affine-invariant property when designing numerical schemes.

Such a property can also be preserved if we take $C=\mathcal{C}(p)$ , the covariance matrix under the probability measure $p({\bm{x}})$ ,

\mathcal{C}(p):=\int_{\mathbb{R}^{d}}\big{(}{\bm{x}}-m(p)\big{)}\otimes\big{(}{\bm{x}}-m(p)\big{)}p({\bm{x}})d{\bm{x}},

(9)

m(p):=\int_{\mathbb{R}^{d}}{\bm{x}}p({\bm{x}})dx.

(10)

The dynamic now becomes a nonlinear flow

	$\displaystyle d{\bm{x}}_{t}$	$\displaystyle=-\mathcal{C}(p_{t})\nabla V({\bm{x}}_{t})dt+\sigma\sqrt{\mathcal{C}(p_{t})}d\mathcal{B}_{t},$		(11)
	$\displaystyle\frac{\partial}{\partial t}p_{t}$	$\displaystyle=\text{div}\big{(}p_{t}\mathcal{C}(p_{t})\nabla V+\frac{\sigma^{2}}{2}\mathcal{C}(p_{t})\nabla p_{t}\big{)}$		(12)

To simulate from such a mean-field model, the finite interacting particle system $\mathcal{X}_{t}=\{{\bm{x}}^{i}_{t}\}_{i=1}^{N}$ is introduced:

d{\bm{x}}^{i}_{t}=-\mathcal{C}(\mathcal{X}_{t})\nabla V({\bm{x}}_{t}^{i})dt+\sigma\sqrt{\mathcal{C}(\mathcal{X}_{t})}d\mathcal{B}_{t}^{i},

(13)

where $\mathcal{B}_{t}^{i}$ are i.i.d. standard Brownian motions, and

		$\displaystyle\mathcal{C}(\mathcal{X}_{t}):={\bm{X}}_{t}{\bm{X}}_{t}^{T},\sqrt{\mathcal{C}(\mathcal{X}_{t})}:={\bm{X}}_{t},$
		$\displaystyle{\bm{X}}_{t}:=\frac{1}{\sqrt{N}}\large({\bm{x}}_{t}^{1}-\bar{\bm{x}}_{t},\cdots,{\bm{x}}_{t}^{N}-\bar{\bm{x}}_{t}\large),\bar{\bm{x}}_{t}=\frac{1}{N}\sum_{i=1}^{N}{\bm{x}}_{t}^{i}$

Particles in this system are then no longer independent to each other, in contrast to independent simulations in SGLD algorithm.

Now suppose $V$ is a least squares functional³³3For any positive-definite matrix $A$ , we define $\langle a,a^{\prime}\rangle_{A}=\langle a,A^{-1}a^{\prime}\rangle=a^{T}A^{-1}a^{\prime}$ , and $\|a\|_{A}=\|A^{-\frac{1}{2}}a\|$ . with Tikhonov-Phillips regularization (Engl et al., 1996):

V(x)=\frac{1}{2}\|y-\mathcal{G}(x)\|_{\Gamma}^{2}+\frac{1}{2}\|x\|_{\Gamma_{0}}^{2},

(14)

where $y\in\mathbb{R}^{k},\mathcal{G}:\mathbb{R}^{d}\to\mathbb{R}^{k}$ is the forward map, $\Gamma$ and $\Gamma_{0}$ are two positive definite matrixs. In this situation, the system (13) can be re-written as

	$\displaystyle d{\bm{x}}^{i}_{t}=$	$\displaystyle-\frac{1}{N}\sum_{j=1}^{N}\left\langle D\mathcal{G}(\bm{x}^{i}_{t})(\bm{x}^{j}_{t}-\bm{\bar{x}}_{t}),\mathcal{G}(\bm{x}^{i}_{t})-y\right\rangle_{\Gamma}\bm{x}_{t}^{j}dt$
		$\displaystyle-\mathcal{C}(\mathcal{X}_{t})\Gamma_{0}^{-1}\bm{x}_{t}^{i}dt+\sigma\sqrt{\mathcal{C}(\mathcal{X}_{t})}dW_{t}^{i}.$		(15)

Using the $1^{\text{st}}$ order Taylor approximation

D\mathcal{G}(\bm{x}^{i}_{t})(\bm{x}^{j}_{t}-\bm{\bar{x}}_{t})\approx\mathcal{G}(\bm{x}^{j}_{t})-\mathcal{\bar{G}}_{t},\quad\mathcal{\bar{G}}_{t}:=\frac{1}{N}\sum_{k=1}^{N}\mathcal{G}(\bm{x}^{k}_{t}),

one may approximate Eq. 15 in a derivative-free manner as

	$\displaystyle d{\bm{x}}^{i}_{t}=$	$\displaystyle-\frac{1}{N}\sum_{j=1}^{N}\left\langle\mathcal{G}(\bm{x}^{j}_{t})-\mathcal{\bar{G}}_{t},\mathcal{G}(\bm{x}^{i}_{t})-y\right\rangle_{\Gamma}\bm{x}_{t}^{j}dt$
		$\displaystyle-\mathcal{C}(\mathcal{X}_{t})\Gamma_{0}^{-1}\bm{x}_{t}^{i}dt+\sigma\sqrt{\mathcal{C}(\mathcal{X}_{t})}dW_{t}^{i}.$		(16)

3 Reweighted Interacting Langevin Diffusion

3.1 Continuous process analysis

Keeping the invariant measure to be exactly $e^{-2\sigma^{-2}V({\bm{x}})}$ is not necessary for optimization. It would be preferable if a new process can converge faster to its invariant measure with its invariant measure still concentrating on the global optimum as $\sigma\to 0$ .

Now we introduce an additional source term to modify Eq. (6) into

	$\displaystyle\frac{\partial}{\partial t}\tilde{p}_{t}$	$\displaystyle=\text{div}\big{(}\tilde{p}_{t}C\nabla V+\frac{\sigma^{2}}{2}C\nabla\tilde{p}_{t}\big{)}+W\tilde{p}_{t}$		(17)
	$\displaystyle p_{t}({\bm{x}})$	$\displaystyle=\frac{\tilde{p}_{t}({\bm{x}})}{\int_{\tilde{\bm{x}}\in\mathbb{R}^{d}}\tilde{p}_{t}(\tilde{\bm{x}})d\tilde{\bm{x}}}$		(18)

Here we mainly consider the situation when $C=I$ corresponding to Eq. 4, or $C=\mathcal{C}(p_{t})$ corresponding to Eq. 12, but the analysis remains valid for all positive-definite $C$ . We assume in this paper that the function $W$ is smooth and upper bounded.

Let us look inside what the role $W$ plays in the evolution of this process. If we take $W$ as a function to $f$ such that $W({\bm{x}})$ becomes larger when $f({\bm{x}})$ becomes smaller, intuitively the ratio of the mass in the better-fitness region (closer to the global minimum) becomes larger, thus we expect the invariant measure concentrates more on the global optimum.

Note that such a process (17), (18) is no longer a gradient flow structure, which does not preserve total mass, thus a normalization is added to keep the total mass equal to 1. Such a normalizing process has been well-studied as the so-called Feynman-Kac Semigroup when considering the linear case: $C$ is a constant matrix. As the spectral analysis is wildly studied for linear operators (Kato, 1966), and for numerical discretization the $C$ is fixed at each time step, we thus conduct analysis for any fixed matrix $C$ .

We introduce the solution operator corresponding to Eq. 17 and Eq. 18: recall the infinite generator $\mathcal{L}^{\dagger}_{C}$ , the solution in Eq. 17 can be represented by $\tilde{p}_{t}=e^{t(\mathcal{L}^{\dagger}_{C}+W)}p_{0}$ . We denote the corresponding reweighted operator, which is also well-known as Feynman-Kac Semigroup, as $\Phi^{t}_{\mathcal{L}_{C}+W}:$

\Phi^{t}_{\mathcal{L}_{C}+W}(p_{0}):=\frac{e^{t(\mathcal{L}^{\dagger}_{C}+W)}p_{0}}{\int_{\mathbb{R}^{d}}e^{t(\mathcal{L}^{\dagger}_{C}+W)}p_{0}(\tilde{\bm{x}})d\tilde{\bm{x}}},

then $p_{t}=\Phi^{t}_{\mathcal{L}_{C}+W}(p_{0})$ is exactly the solution to Eq. 18.

The operator $\Phi^{t}_{\mathcal{L}_{C}+W}$ is nonlinear in general, but it has many similarity with the linear operator $e^{t(\mathcal{L}^{\dagger}_{C}+W)}$ . In specific, the unique positive fixed-point $p_{\infty}$ of $\Phi^{t}_{\mathcal{L}_{C}+W}$ is proportional to the principle eigenfunction of $(\mathcal{L}^{\dagger}_{C}+W)$ , and the convergence rate of $\Phi^{t}_{\mathcal{L}_{C}+W}(p_{0})$ to $p_{\infty}$ is controlled by the spectral gap of $(\mathcal{L}_{C}+W)$ . For the Feynman-Kac semigroup, one can refer to Ferré & Stoltz (2017) for time-invariant case, Lyu et al. (2021) for time-periodic case, and Moral (2004) for systematical details.

We are interested in what benefits the source term $W$ can contribute. We show that, when taking $W=\varepsilon m(V)$ for some small $\varepsilon>0$ where $m:\mathbb{R}\to\mathbb{R}$ is a monotonic decreasing function of $V$ , this brings mainly two benefits:

•

the spectral gap to the operator $\mathcal{L}_{C}+W$ is larger than $\mathcal{L}_{C}$ considered in same functional space;
•

the invariant measure of $\Phi^{t}_{\mathcal{L}_{C}+W}$ concentrates more on the global minimum compared to the invariant measure of $e^{t(\mathcal{L}^{\dagger}_{C})}$ .

These two benefits show that, in the same noise level, process (18) can converge faster and have more concentrated invariant measure than process (4) or (27). We will further explain the benefits, giving proof in Section 4.1.

3.2 Discrete algorithm design

Now let us use interacting particle methods (Moral & Miclo, 2000; Moral, 2013) and mean field theory (Kac et al., 1960) to design algorithms. We introduce the so-called Reweighted Interacting Langevin Diffusion algorithm, by simply approximating $p_{t}$ with a population of particles, and discretize the evolution of time by operator-splitting technique and forward-Euler scheme.

First, we convert Eq. 17 and 18 into a discrete-time version: let $\tau>0$ be a fixed timestep:

p_{(n+1)\tau}=\frac{e^{\tau(\mathcal{L}^{\dagger}_{C}+W)}p_{n\tau}}{\int_{\mathbb{R}^{d}}e^{\tau(\mathcal{L}^{\dagger}_{C}+W)}p_{n\tau}(\tilde{\bm{x}})d\tilde{\bm{x}}}

then we use the operator splitting technique⁴⁴4Note that higher order splitting technique can be applied, such as Strange splitting (Strang, 1968) $e^{\frac{\tau}{2}W}\circ e^{\tau\mathcal{L}^{\dagger}_{C}}\circ e^{\frac{\tau}{2}W}$ with splitting error $O(\tau^{3})$ . However, the approximation error can still be induced in later steps, thus we only choose a simplest one here.: approximate $e^{\tau(\mathcal{L}^{\dagger}_{C}+W)}$ by $e^{\tau W}\circ e^{\tau\mathcal{L}^{\dagger}_{C}}$ with splitting error $O(\tau^{2})$ .

Let’s now approximate $e^{\tau\mathcal{L}^{\dagger}_{C}}p_{n\tau}$ . Suppose $p_{n\tau}$ is approximated by a weighted empirical measure that can be expressed as $\hat{p}_{n\tau}({\bm{x}})=\sum_{i=1}^{N}w^{i}_{n}\delta_{{\bm{x}}_{n}^{i}}({\bm{x}})$ generated from a sort of sample-weight pair $\{{\bm{x}}_{n}^{i},w_{n}^{i}\}_{i=1}^{N}$ . We use the simplest Euler-Maruyama (Faniran, 2015) approximation,

	$\displaystyle e^{\tau\mathcal{L}^{\dagger}_{C}}\hat{p}_{n\tau}({\bm{x}})\approx\frac{1}{N}\sum_{i=1}^{N}w_{n}^{i}\delta_{{\bm{x}}_{n+1}^{i}}({\bm{x}}),$		(19)
	$\displaystyle{\bm{x}}_{n+1}^{i}={\bm{x}}_{n}^{i}-C\nabla V({\bm{x}}_{n}^{i})\tau+\sqrt{\tau\sigma^{2}C}\xi^{i}_{n}$		(20)
	$\displaystyle e^{\tau W}e^{\tau\mathcal{L}^{\dagger}_{C}}\hat{p}_{n\tau}({\bm{x}})\approx\frac{1}{N}\sum_{i=1}^{N}w_{n}^{i}e^{\tau W({\bm{x}}_{n+1}^{i})}\delta_{{\bm{x}}_{n+1}^{i}}({\bm{x}}),$		(21)

Here $C$ can be $I$ or the covariance matrix of current step: if we take the matrix $C$ to be the covariance matrix, it is computed as (denote $\Lambda=\text{diag}(w_{n}^{i}),\quad\bar{\bm{x}}_{n}=\frac{1}{N}\sum_{i=1}^{N}w_{n}^{i}{\bm{x}}_{n}^{i}$ )

	$\displaystyle C=\mathcal{C}(\hat{p}_{n\tau})={\bm{X}}_{n}\Lambda{\bm{X}}_{n}^{T},\sqrt{C}={\bm{X}}_{n}\Lambda^{\frac{1}{2}},$		(22)
	$\displaystyle{\bm{X}}_{n}:=\frac{1}{\sqrt{N}}\large({\bm{x}}_{n}^{1}-\bar{\bm{x}}_{n},\cdots,{\bm{x}}_{n}^{N}-\bar{\bm{x}}_{n}\large).$		(23)

Then, after normalizing, we get the natural approximation

	$\displaystyle p_{(n+1)\tau}({\bm{x}})\approx\sum_{i=1}^{N}w^{i}_{n+1}\delta_{{\bm{x}}_{n+1}^{i}}({\bm{x}}),$		(24)
	$\displaystyle w_{n+1}^{i}=\frac{w_{n}^{i}e^{\tau W({\bm{x}}_{n+1}^{i})}}{\sum_{j=1}^{N}w_{n}^{j}e^{\tau W({\bm{x}}_{n+1}^{j})}}$		(25)

Note that elements in $\{w^{i}_{n}\}_{i=1}^{N}$ may be polarized, we use a resampling technique for better approximation: If $\frac{\max_{i}w^{i}_{n}}{\min_{i}w^{i}_{n}}$ reaches to a threshold, we resample the replicas $\{\bm{x}_{n}^{i}\}_{i=1}^{N}$ according to the multinomial distribution associated with $\{w_{n}^{i}\}_{i=1}^{N}$ , which defines a new set of replicas $\{\bm{\tilde{x}}_{n}^{i}\}_{i=1}^{N}$ and the empirical distribution

\tilde{p}_{n\tau}(\bm{x})=\frac{1}{N}\sum_{i=1}^{N}\delta_{\bm{\tilde{x}}_{n}^{i}}(\bm{x}).

Then we replace $\hat{p}_{n\tau}$ by $\tilde{p}_{n\tau}$ in Eq. 21 and conduct further computation.

When $V$ is a least square functional with Tikhonov-Phillips regularization like in Eq. 14, we can follow the same analysis, approximating $\mathcal{C}(\hat{p}_{n\tau})\nabla V({\bm{x}}_{n}^{i})$ in a similar derivative-free manner as in Eq. 16: ( $\mathcal{\bar{G}}_{n}:=\frac{1}{N}\sum_{k=1}^{N}w_{n}^{i}\mathcal{G}(\bm{x}^{k}_{n})$ )

		$\displaystyle\mathcal{C}(\hat{p}_{n\tau})\nabla V({\bm{x}}_{n}^{i})$		(26)
	$\displaystyle\approx$	$\displaystyle\frac{1}{N}\sum_{j=1}^{N}w_{n}^{j}\left\langle\mathcal{G}(\bm{x}^{j}_{n})-\mathcal{\bar{G}}_{n},\mathcal{G}(\bm{x}^{i}_{n})-y\right\rangle_{\Gamma}\bm{x}_{n}^{j}+\mathcal{C}(\hat{p}_{n\tau})\Gamma_{0}\bm{x}_{n}^{i}.$

We now conclude in algorithm 1. Note that we fix the population $N$ , stepsize $\tau$ , and noise level $\sigma$ for simple, but these can be adjusted dynamically for faster convergence.

Note that our methods can also be interpreted as a mutation-selection genetic particle algorithm with MCMC mutations: the update rule for $\bm{x}_{n}^{i}\to\bm{x}_{n+1}^{i}$ can be regarded as mutation, and the resampling step can be regarded as selection. Such a type of algorithm acts like a ridge connecting Genetic Algorithm and PDEs, which can conduct better convergence analysis.

Algorithm 1 Reweighted Interacting Langevin Diffusion (RILD)

Input: Functional to minimize

V({\bm{x}})\in\mathbb{R},x\in X\subseteq\mathbb{R}^{d}

, gradient

\nabla V({\bm{x}})\in\mathbb{R}^{d}

, Fitness

W(\bm{x})\in\mathbb{R}

, population size

N

, Maximum iteration

Iter

, i.i.d. d-dimensional Standard Gaussian samples

\{\xi^{i}_{n}\}_{1\leq i\leq N,0\leq n<Iter}

, Stepsize

\tau

, noise

\sigma

Onput: Samples nearby global minimum.

Sample

{\bm{x}}_{0}^{i},i=1,\cdots,N

under initial distribution

p_{0}

;

w_{0}^{i}=\frac{1}{N},i=1,\cdots,N

n=0

while

n<Iter

Calculate

C

by (23) or just

C=I

, depends on if one use covariance modification.

If use covariance modification, directly calculate

C\nabla V({\bm{x}}_{n}^{i})

or approximate it by (26), depends on if

V

is a least square functional (14).

{\bm{x}}_{n+1}^{i}={\bm{x}}_{n}^{i}-C\nabla V({\bm{x}}_{n}^{i})\tau+\sqrt{\tau\sigma^{2}C}\xi^{i}_{n}

\quad w_{n+1}^{i}=\frac{w_{n}^{i}e^{\tau W({\bm{x}}_{n+1}^{i})}}{\sum_{j=1}^{N}w_{n}^{j}e^{\tau W({\bm{x}}_{n+1}^{j})}}

\quad 1\leq i\leq N

\frac{\max_{i}w^{i}_{n+1}}{\min_{i}w^{i}_{n+1}}>Threshold

then

Sample

\{\bm{\tilde{x}}_{n+1}^{i}\}_{i=1}^{N}

from

\{\bm{x}_{n+1}^{i}\}_{i=1}^{N}

according to the multinomial probability

\{w_{n+1}^{i}\}_{i=1}^{N}

\bm{x}_{n+1}^{i}=\bm{\tilde{x}}_{n+1}^{i},w_{n+1}^{i}=\frac{1}{N},1\leq i\leq N

end if

n=n+1

end while

Output

\{\bm{x}_{n}^{i}\}_{i=1}^{N}

3.3 Gradient-Free variants

In practice, many problems are hard to get the exact gradients for optimizing: gradient information is infeasible, or computationally expensive. We thus suggest a gradient-free variant, which corresponds to a process only with diffusion and source term. We will prove that such a process can exponentially converge to its invariant measure, and it also has an invariant measure that generally concentrates on the global minimum as $\sigma\to 0$ .

We introduce the formula of the modified process as

	$\displaystyle\frac{\partial}{\partial t}\tilde{p}_{t}$	$\displaystyle=\text{div}\big{(}\frac{\sigma^{2}}{2}C\nabla\tilde{p}_{t}\big{)}+W\tilde{p}_{t}$		(27)
	$\displaystyle p_{t}({\bm{x}})$	$\displaystyle=\frac{\tilde{p}_{t}({\bm{x}})}{\int_{\tilde{\bm{x}}\in\mathbb{R}^{d}}\tilde{p}_{t}(\tilde{\bm{x}})d\tilde{\bm{x}}}$		(28)

Note that we only delete the term related to $\nabla V$ , the term $W$ that can still be chosen to relate to $V$ . We will state in Thm. 4.6 that, the system will finally converge to a distribution concentrating on the maximum of $W$ .

The discrete algorithm is designed similarly as in Algorithm 1, but just ignores the gradient term $C\nabla V$ . This can be seen by taking $V(\bm{x})\equiv\text{const}$ .

4 Theoretical properties

In this section, we state the main theoretical results associated with our RILD algorithm 1. The proofs are offered in Appendix A.

4.1 Spectral Gap enhancement of the reweighting modification

Now we analyze how $W$ helps in improving the convergence rate, as well as the sharpness of the invariant measure.

First let us restrict the comparison in the $L^{2}(\nu)$ , where $\nu(x)=e^{-\frac{2V(x)}{\sigma^{2}}}$ . The reason to choose this space is mainly because the following property,

Lemma 4.1.

For any positive definite matrix $C$ , $L_{C}+W$ and $L_{C}$ are self-adjoint over $L^{2}(\nu)$ .

To continue our analysis, we need to make following assumption for $V$ and $W$ , which is necessary for $L_{C}+W$ and $L_{C}$ have a discrete and up-bounded spectrum (see Pankov (2001)).

Assumption 4.2.

The functions $V$ and $W$ are assumed to satisfy:

\lim_{|\bm{x}|\to\infty}V=+\infty,W<A\text{ for a constant A}\in\mathbb{R},

and

\lim_{|\bm{x}|\to\infty}\frac{|\nabla V|^{2}}{2\sigma^{2}}-\frac{\Delta V}{2}-W=\lim_{|\bm{x}|\to\infty}\frac{|\nabla V|^{2}}{2\sigma^{2}}-\frac{\Delta V}{2}=+\infty

Next, we need to prove that the Feynman-Kac semigroup $\Phi^{t}_{\mathcal{L}_{C}+W}$ does map any initial density $p_{0}\in L^{2}(\nu)$ to a limit density. Such a result is concluded as follows:

Theorem 4.3.

Under Asmp. 4.2, For any $C$ that is positive-definite, there exists a principle eigenvalue $\lambda_{0}$ of $\mathcal{L}_{C}+W$ over $L^{2}(\nu)$ with the corresponding normalized eigenfunction $\phi({\bm{x}})$ . Furthermore, for any positive density $p_{0}$ , the normalized probability $p_{t}:=\Phi^{t}_{\mathcal{L}_{C}+W}(p_{0})$ has a limit equal to $\phi({\bm{x}})\nu({\bm{x}})$ , that is, $\lim_{t\to\infty}||p_{t}/\nu-\phi||_{L^{2}(\nu)}=0$ .

Theorem 4.4.

Under the same assumption as in Lem. 4.3, the convergence rate of the system (17), (18) can be evaluated by the spectral gap of $\mathcal{L}_{C}+W$ over $L^{2}(\nu)$ : let $\lambda_{0}$ and $\lambda_{1}$ be the first two eigenvalues of $\mathcal{L}_{C}+W$ , then $||p_{t}/\nu-\phi||_{L^{2}(\nu)}\leq C||p_{0}/\nu-\phi||_{L^{2}(\nu)}e^{-(\lambda_{0}-\lambda_{1})t}$ .

Next, let us analyze how the convergence speed of $\Phi^{t}_{\mathcal{L}_{C}+W}p_{0}$ is improved compared to the original process $e^{t\mathcal{L}_{C}}(p_{0})$ , which is heavily related to the spectral gap. To exactly analyze the spectral gap to a differential operator is hard in general. Many existing analysis of spectral gap only consider the simplest case: $\nabla V$ is a constant matrix. In contrast to these existing techniques, we analyze the contribution of $W$ to the spectral gap by perturbation theory: we will prove that the new process has a better convergence rate compared to the old one, if $W=\varepsilon m(V)$ for a small $\varepsilon>0$ , where $m:\mathbb{R}\to\mathbb{R}$ is a monotonic decreasing function to $V$ .

Theorem 4.5.

Suppose in addition $V$ satisfies the condition the same as in Nier (2004). If we take $W=\varepsilon m(V)$ , consider under the space $L^{2}(\nu)$ , where $\nu(x)=e^{-\frac{2V(x)}{\sigma^{2}}}$ , then when $\sigma$ small enough, the spectral gap of $\mathcal{L}_{C}+\varepsilon m(V)$ is locally increasing v.s. $\varepsilon$ for small enough $\varepsilon>0$ . Besides, the principle eigenfunction of $\mathcal{L}_{C}+\varepsilon m(V)$ concentrates more on global minimum than the principle eigenfunction of $\mathcal{L}_{C}$ for small enough $\varepsilon>0$ .

4.2 Convergence of the Gradient-Free variants

In the gradient-free situation, the original operator $\mathcal{D}_{C}:=\text{div}\big{(}\frac{\sigma^{2}}{2}C\nabla\cdot\big{)}$ has a trivial invariant measure: uniform distribution, thus no optimization property can be expected when $W$ is not included. We thus turn our results to another way, showing that $\Phi_{\mathcal{D}_{C}+W}^{t}p_{0}$ exponentially converges to a distribution that concentrates on the neighborhood of the global minimum, and as $\sigma\to 0$ , the invariant distribution gets more and more concentrate on the global minimum. We now state the result as follows:

Theorem 4.6.

Consider under the space $L^{2}$ , assuming $W$ is bounded on $\mathbb{R}^{d}$ , then $\Phi_{\mathcal{D}_{C}+W}^{t}p_{0}$ converges exponentially to the principle normalized eigenfunction $\mu_{\sigma}({\bm{x}})$ of the operator $\mathcal{D}_{C}+W$ . In addition, for any $f\in L^{2}$ that’s smooth compactly supported, $\lim_{\sigma\to 0}\langle\mu_{\sigma}(\bm{x}),f\rangle=f(\bm{x}^{\ast})$ , where ${\bm{x}}^{\ast}$ is the global maximum of $W$ , or to say, $\mu_{\sigma}(\bm{x})$ generally converges to $\delta_{\bm{x}^{\ast}}(\bm{x})$ .

5 Numerical experiments

In this section, we first present Fourier Spectral analysis to verify our theoretical analysis of the proposed method. Then, we conduct two inverse problem tests, showing the positive effect as a sampling method when introducing the reweighting/resampling procedure. Finally, we test an optimization task, showing that the resampling technique can help escaping from the local minimum.

5.1 Fourier spectral method analysis for enhancing spectral gap and concentrating invariant measure

Let us consider a 1 $d$ periodic problem. We first verify our results in Thm. 4.5. Let $x\in[0,1)$ ⁵⁵5Although our analysis considers the space $\mathbb{R}$ , the results can be easily transferred to the periodic case with Asmp. 4.2 be removed., we use Fourier Spectral method (Shen et al., 2011) to discretize the differential operator $\mathcal{L}$ and $\mathcal{L}-\varepsilon V$ ,where

\mathcal{L}f(x):=-\frac{d}{dx}V(x)\frac{d}{dx}f(x)+\frac{d^{2}}{dx^{2}}f(x)

(29)

for any periodic $f\in C^{\infty}([0,1))$ , and

V(x)=\cos(9\pi x)-\cos(11\pi x)

(30)

with boundary smoothly modified.

Refer to caption — Figure 1: 1 are the graphs to the $1^{st}$ eigfun. $\nu$ to $\mathcal{L}$ , $1^{st}$ eigfun. $\mu$ to $\mathcal{L}-0.1V$ , function $V$ , and the quotient of $\mu$ and $\nu$ got by Fourier Spectral method. 1 is the graph to the first two eigenvalues $\lambda_{0}(\varepsilon),\lambda_{1}(\varepsilon)$ of $\mathcal{L}-\varepsilon V$ v.s. $\varepsilon$ .

In Fig. 1, We plot the graphs of the principal eigenfunction $\nu$ to the operator $\mathcal{L}$ , the principal eigenfunction $\mu$ to the operator $\mathcal{L}-0.1V$ , function $V$ and $\frac{\mu}{\nu}$ respectively. As we expected, $\mu$ concentrates more on the global minimum of $V$ than $\nu$ . In Fig. 1, we can see $\lambda_{0}(\varepsilon)-\lambda_{1}(\varepsilon)$ increasing as $\varepsilon$ increasing, showing the enhancement of spectral gap. This experiment gives a visual explanation to Thm. 4.5: specifically, the eigenfunction to $\mathcal{L}$ is the Gibbs measure $\nu(\bm{x})=\exp(-2V(\bm{x})/\sigma^{2})$ , getting larger when $V$ gets smaller; the eigenfunction $\mu$ to $\mathcal{L}-0.1V$ has no doubt more mass than $\nu$ that concentrating nearby the global minimum of $V$ , and the staircase-like graph of the quotient of $\mu/\nu$ gives a direct explanation to (42).

Next, let us verify our result in Thm. 4.6. We use the same space $x\in[0,1)$ and the source term $W(\bm{x})=-V(\bm{x})$ where $V(x)$ is the same as in (30), and test the analytical property of the following operator

(\mathcal{D_{\sigma}}+W)f(x):=\frac{\sigma^{2}}{2}\frac{d^{2}}{dx^{2}}f(x)+W(x)f(x)

(31)

for any periodic $f\in C^{\infty}([0,1))$ . We plot in Fig. 2 the principal eigenfunction, or invariant distribution density $\mu_{\sigma}(x)$ of $\mathcal{D}_{\sigma}$ with different $\sigma$ . We also calculate the mass nearby the global minimum: we take the interval $I=[0.44,0.68]$ , and calculate $\int_{I}\mu_{\sigma}(x)dx$ . The results is plotted in Fig. 2, showing the concentrating tendency when $\sigma\to 0$ . This coincides with the result in Thm. 4.6: the invariant measure concentrates more and more on the global maximum of $W$ as $\sigma\to 0$ .

5.2 Numerical tests for inverse problem derivative-free solving and sampling

In this section, we design some numerical tests in inverse problem solving fields. We compare our RILD algorithm with EKS (Garbuno-Iñigo et al., 2019) and EKI (Kovachki & Stuart, 2018). These problems can be seen as an optimization problem of a least square functional with Tickhonov-Phillips regularization (14), thus a derivative-free approximation to the updating rule ((16) for EKS and EKI, or (26) for RILD) can be used to design derivative-free schemes.

We first try to solve a low-dimensional inverse problem. The numerical experiment considered here is the example originally presented by Ernst et al. (2015), and also used in Herty & Visconti (2018). We compare with the result from Garbuno-Iñigo et al. (2019), and the experimental settings are exactly the same. The forward map is given by the solution of a one-dimensional elliptic boundary value problem as defined in Garbuno-Iñigo et al. (2019),

-\frac{d}{du}\left(\exp(x_{1})\frac{d}{du}f(u)\right)=1,u\in[0,1]

(32)

with $f(0)=0,f(1)=x_{2}$ . The explicit solution is given by

f(u)=x_{1}u+\exp(-x_{2})\left(-\frac{u^{2}}{2}+\frac{u}{2}\right).

(33)

Thus we define the forward map

\mathcal{G}(\bm{x})=\left(f(u_{1}),f(u_{2})\right)^{T}.

(34)

Here $\bm{x}=(x_{1},x_{2})^{T}$ is a constant vector we want to solve, and we have noisy measurements $\bm{y}=(27.5,79.7)^{T}$ of $f(\cdot)$ at locations $u_{1}=0.25,u_{2}=0.75$ . This can be solved by minimizing the least-square functional defined as in Eq. 14. We assume observation noise $\Gamma=0.1^{2}I_{2}$ , prior matrix $\Gamma_{0}=10^{2}I_{2}$ , and initial ensemble drawn from $N(0,1)\times U(90,110).$ The ensemble size is $N=10^{3}$ . We fix $\sigma=\sqrt{2}$ . The stepsize $\tau$ is updated adaptively as in Garbuno-Iñigo et al. (2019). We take $W(\bm{x})=-\|\mathcal{G}(\bm{x})-\bm{y}\|^{2}_{\Gamma}$ .

We compare our RILD algorithm 1 with EKS and EKI algorithms. The key difference between our RILD and EKS in this situation is the use of reweighting/resampling. The results are plotted in Fig. 3, 3, and 3. From the figure, we can see that our RILD algorithm converges much faster than EKI or EKS algorithms using the same super-parameters, and as our problem is settled as a posterior sampling problem, the ensemble of RILD stops shrinking to the minimum point after some iterations, unless one decrease the diffusion parameter $\sigma$ . One could expect, using a decay schedule to $\sigma$ , RILD algorithm will perform much better than EKS or EKI algorithms for optimizations.

We also tested in a high-dimensional case. Specifically, we define the map $\mathcal{G}:\mathbb{R}^{d}\to\mathbb{R}^{2d-2}$

\mathcal{G}(\bm{x})=\left(10(x_{2}-x_{1}^{2}),\cdots,10(x_{d}-x_{d-1}^{2}),x_{1},\cdots,x_{d}\right)^{T},

$\bm{y}=(0,\cdots,0,1,\cdots,1)^{T}$ with $0$ repeats $d-1$ times and $1$ repeats $d-1$ times. One can verify that

\|\mathcal{G}(\bm{x})-\bm{y}\|^{2}=\sum_{i=1}^{d-1}\left(100(x_{i+1}-x_{i}^{2})^{2}+(x_{i}-1)^{2}\right)

is exactly a $Rosenbrock$ function. We choose $d=100$ , observation noise matrix $\Gamma=0.1^{2}I_{198}$ , prior distribution matrix $\Gamma_{0}=10^{2}I_{100}$ , and initial ensemble drawn from $N(2,0.3^{2}I_{100})$ . The global solution of $\mathcal{G}(\bm{x})=\bm{y}$ is $\bm{x}=(1,\cdots,1)^{T}.$ The ensemble size is fixed to $N=10^{3}$ . We take $\sigma=\sqrt{2}$ and the stepsize $\tau$ is updated adaptively as before. For RILD algorithm, we choose $W(\bm{x})=-5*10^{-3}\|\mathcal{G}(\bm{x})-\bm{y}\|^{2}$ . Similar to the test before, one can find RILD converges much faster than EKI or EKS in these high-dimensional sampling tasks, and thus perform better in optimization situations.

5.3 Numerical tests for highly nonconvex high-dimensional optimization

We now test our RILD algorithm in a highly nonconvex high-dimensional situation. we test with 100 dimensional $Ackley$ function, which is defined as follows:

V(\bm{x})=-ae^{-b\sqrt{\frac{1}{d}\sum_{i=1}^{d}x_{i}^{2}}}-e^{\frac{1}{d}\sum_{i=1}^{d}\cos(cx_{i})}+a+e,

(35)

where $\bm{x}=(x_{1},\cdots,x_{d})^{T},d=100,a=20,b=0.2,c=2\pi$ . As the difficulties mainly raise from the numerious local minimum, covariance modification that is designed for ill-posed problems is not suitable here, we just take $C=I$ .

We compare our algorithm RILD in Alg. 1 with the classical Gradient Langevin Dynamics (GLD) algorithm. GLD algorithm discretizes a single path of the Langevin Dynamics:

\bm{x}_{n+1}=\bm{x}_{n}-\nabla V(\bm{x}_{n})\tau+\sqrt{\tau\sigma^{2}}\xi_{n}.

The difference between RILD and GLD is: RILD maintains an ensemble of size $N$ while GLD only maintains 1 individual, and at each step, RILD calculates a weight associated with each individual, then resamples the ensemble according to the weight, see Alg. 1. Now we test if RILD has better ability getting out of local minimums, compared to GLD.

For RILD, we take ensemble size $N=50$ , and randomly pick the initial ensemble⁶⁶6Such an initial setting creates many difficulties to find the global minimum, as the first term in the $Ackley$ function becomes quickly dominated when $\bm{x}$ gets away from the origin. from $N(0,30^{2}I_{100})$ . For GLD, the initial point is randomly chosen from the initial ensemble of RILD. We test a wide range of the stepsize $\tau\in[2,32]$ , $\sigma\in[1,16]$ , and for each fixed $\tau$ and $\sigma$ , we repeat 10 trials to calculate the pass rate: we say one trial is passed, if the RILD or GLD algorithm can find a point $\bm{x}$ that $V(\bm{x})<17$ in $5*10^{4}$ evaluations⁷⁷7If one finds a point smaller than 17, the remaining task will be trivial as the first term in the $Ackley$ function gets dominant.. All trials in all super-parameter settings begin with the same initial ensemble. In 5 One can see that RILD has a wide range of super-parameter settings to find the true decay directions, while GLD algorithm cannot make it in any tested settings. We also tested GA and PSO under the same initial condition with different super-parameters, and reported the best searching result in Fig. 5 together with RILD and GLD.

6 Conclusion

In this work, we have demonstrated a methodology for accelerating Langevin Dynamics based algorithms by the addition of the source term $W$ and the use of reweighting/resampling technique – the RILD algorithm. Our algorithm and analyses shed some light on combining gradient algorithms and genetic algorithms using Partial Differential Equations (PDEs) with provable guarantees.

In the future, we will combine the reweighting technique with higher-order optimization schemes such as momentum accelerated gradient. We will also conduct a finer analysis for convergence with finite particles, which is somehow more important as asymptotic results are only suitable for a large enough ensemble. We expect these studies will bring some insights to design new numerical algorithms.

References

Berg (2004) Berg, B. A. Markov chain monte carlo simulations and their statistical analysis. 2004.
Berg & Neuhaus (1991) Berg, B. A. and Neuhaus, T. Multicanonical algorithms for first order phase transitions. Physics Letters B, 267:249–253, 1991.
Dalalyan (2014) Dalalyan, A. S. Theoretical guarantees for approximate sampling from smooth and log‐concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79, 2014.
Dalalyan (2017) Dalalyan, A. S. Further and stronger analogy between sampling and optimization: Langevin monte carlo and gradient descent. In Annual Conference Computational Learning Theory, 2017.
Engl et al. (1996) Engl, H. W., Hanke, M., and Neubauer, A. Regularization of inverse problems. 1996.
Ernst et al. (2015) Ernst, O. G., Sprungk, B., and Starkloff, H.-J. Analysis of the ensemble and polynomial chaos kalman filters in bayesian inverse problems. SIAM/ASA J. Uncertain. Quantification, 3:823–851, 2015.
Faniran (2015) Faniran, T. S. Numerical solution of stochastic differential equations. 2015.
Ferré & Stoltz (2017) Ferré, G. and Stoltz, G. Error estimates on ergodic properties of discretized feynman–kac semigroups. Numerische Mathematik, 143:261 – 313, 2017.
Foulkes et al. (2001) Foulkes, W. M. C., Mitas, L., Needs, R. J., and Rajagopal, G. Quantum monte carlo simulations of solids. Reviews of Modern Physics, 73:33–83, 2001.
Garbuno-Iñigo et al. (2019) Garbuno-Iñigo, A., Hoffmann, F., Li, W., and Stuart, A. M. Interacting langevin diffusions: Gradient structure and ensemble kalman sampler. SIAM J. Appl. Dyn. Syst., 19:412–441, 2019.
Geyer (1991) Geyer, C. J. Markov chain monte carlo maximum likelihood. 1991.
Herty & Visconti (2018) Herty, M. and Visconti, G. Kinetic methods for inverse problems. Kinetic & Related Models, 2018.
Kac et al. (1960) Kac, M., Uhlenbeck, G. E., Hibbs, A. R., van der Pol, B., and Gillis, J. Probability and related topics in physical sciences. 1960.
Kato (1966) Kato, T. Perturbation theory for linear operators. 1966.
Kirkpatrick et al. (1983) Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. Optimization by simulated annealing. Science, 220:671 – 680, 1983.
Kovachki & Stuart (2018) Kovachki, N. B. and Stuart, A. M. Ensemble kalman inversion: a derivative-free technique for machine learning tasks. Inverse Problems, 35, 2018.
Lelièvre & Stoltz (2016) Lelièvre, T. and Stoltz, G. Partial differential equations and stochastic methods in molecular dynamics*. Acta Numerica, 25:681 – 880, 2016.
Lyu et al. (2021) Lyu, J., Wang, Z., Xin, J. X., and Zhang, Z. A convergent interacting particle method and computation of kpp front speeds in chaotic flows. SIAM J. Numer. Anal., 60:1136–1167, 2021.
Moral (2004) Moral, P. D. Feynman-kac formulae: Genealogical and interacting particle systems with applications. 2004.
Moral (2013) Moral, P. D. Mean field simulation for monte carlo integration. 2013.
Moral & Miclo (2000) Moral, P. D. and Miclo, L. Branching and interacting particle systems. approximations of feynman-kac formulae with applications to non-linear filtering. 2000.
Nier (2004) Nier, F. Quantitative analysis of metastability in reversible diffusion processes via a witten complex approach. 2004.
Øksendal (1987) Øksendal, B. Stochastic differential equations : an introduction with applications. Journal of the American Statistical Association, 82:948, 1987.
Pankov (2001) Pankov, A. Introduction to spectral theory of schrödinger operators. Mathematics eJournal, 2001.
Parlett (1981) Parlett, B. N. The symmetric eigenvalue problem. 1981.
Raginsky et al. (2017) Raginsky, M., Rakhlin, A., and Telgarsky, M. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Annual Conference Computational Learning Theory, 2017.
Reed & Simon (1979) Reed, M. C. and Simon, B. Methods of modern mathematical physics. iii. scattering theory. 1979.
Schillings & Stuart (2016) Schillings, C. and Stuart, A. M. Analysis of the ensemble kalman filter for inverse problems. SIAM J. Numer. Anal., 55:1264–1290, 2016.
Shen et al. (2011) Shen, J., Tang, T., and Wang, L. Spectral methods: Algorithms, analysis and applications. 2011.
Strang (1968) Strang, G. On the construction and comparison of difference schemes. SIAM Journal on Numerical Analysis, 5:506–517, 1968.
Swendsen & Wang (1986) Swendsen and Wang. Replica monte carlo simulation of spin glasses. Physical review letters, 57 21:2607–2609, 1986.
Varadhan (2010) Varadhan, S. R. S. Large deviations. 2010.
Welling & Teh (2011) Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In International Conference on Machine Learning, 2011.
Xu et al. (2017) Xu, P., Chen, J., and Gu, Q. Global convergence of langevin dynamics based algorithms for nonconvex optimization. ArXiv, abs/1707.06618, 2017.
Zhang et al. (2017) Zhang, Y., Liang, P., and Charikar, M. A hitting time analysis of stochastic gradient langevin dynamics. In Annual Conference Computational Learning Theory, 2017.

Appendix A Proofs

In this section we give proof to the results in section 4.

Proof of Lemma 4.1

For any $f,g\in L^{2}(\nu)\cap C_{0}^{\infty}(\mathbb{R}^{d})$ , note that $W$ is no doubt a self-adjoint operator, we only need to prove for $L_{C}$ . Notice that

	$\displaystyle Z_{\nu}\langle f,\mathcal{L}_{C}g\rangle_{L^{2}(\nu)}=\int_{\mathbb{R}^{d}}f(\bm{x})\left(-\langle C\nabla V(\bm{x}),\nabla g(\bm{x})\rangle+\frac{\sigma^{2}}{2}\text{div}(C\nabla g(\bm{x}))\right)e^{-2\sigma^{-2}V(\bm{x})}d\bm{x}$	(36)
$\displaystyle=$	$\displaystyle\int_{\mathbb{R}^{d}}-f(\bm{x})\langle C\nabla V(\bm{x}),\nabla g(\bm{x})\rangle e^{-2\sigma^{-2}V(\bm{x})}d\bm{x}+\frac{\sigma^{2}}{2}\int_{\mathbb{R}^{d}}f(\bm{x})\text{div}(C\nabla g(\bm{x}))e^{-2\sigma^{-2}V(\bm{x})}d\bm{x}$	(37)
$\displaystyle=$	$\displaystyle\int_{\mathbb{R}^{d}}-f(\bm{x})\langle C\nabla V(\bm{x}),\nabla g(\bm{x})\rangle e^{-2\sigma^{-2}V(\bm{x})}d\bm{x}-\frac{\sigma^{2}}{2}\int_{\mathbb{R}^{d}}\left\langle C\nabla g(\bm{x}),\nabla(f(\bm{x})e^{-2\sigma^{-2}V(\bm{x})})\right\rangle d\bm{x}$	(38)
$\displaystyle=$	$\displaystyle\int_{\mathbb{R}^{d}}-f(\bm{x})\large(\langle C\nabla V(\bm{x}),\nabla g(\bm{x})\rangle-\langle\nabla V(\bm{x}),\nabla Cg(\bm{x})\rangle\large)e^{-2\sigma^{-2}V(\bm{x})}d\bm{x}-\frac{\sigma^{2}}{2}\int_{\mathbb{R}^{d}}\langle C\nabla g(\bm{x}),\nabla f(\bm{x})\rangle e^{-2\sigma^{-2}V(\bm{x})}d\bm{x}$	(39)
$\displaystyle=$	$\displaystyle-\frac{\sigma^{2}}{2}\int_{\mathbb{R}^{d}}\langle C\nabla g(\bm{x}),\nabla f(\bm{x})\rangle e^{-2\sigma^{-2}V(\bm{x})}d\bm{x},$	(40)

it is not hard to find that this is a symmetric form, thus $\mathcal{L}_{C}$ is self-adjoint on $L^{2}(\nu).$

$\square$

Next, let us use a changing variable trick to eliminate $C$ for simplicity.

Lemma A.1.

Given any invertible matrix $A\in\mathbb{R}^{d\times d}$ , for any function $f:\mathbb{R}^{d}\to\mathbb{R}$ , we define the function

\tilde{f}(\bm{x}):=f(A\bm{x}).

We denote the differential operator $\mathcal{\tilde{L}}_{C}$ as

\mathcal{\tilde{L}}_{C}\cdot:=-\langle C\nabla\tilde{V},\nabla\cdot\rangle+\frac{\sigma^{2}}{2}\text{div}(C\nabla\cdot),

then the following statement is true:

\mathcal{\tilde{L}}_{\tilde{C}}\tilde{g}(\bm{x})=\mathcal{L}_{C}g(A\bm{x}),

for any $g:C^{\infty}(\mathbb{R}^{d})$ , where $\tilde{C}=(A^{T})^{-1}CA^{-1}$

Proof.

Using the chain rule, we notice that

(\nabla\tilde{f})(\bm{x})=A(\nabla f)(A\bm{x}),

thus

\mathcal{\tilde{L}}_{\tilde{C}}\tilde{g}(\bm{x})=-\langle A^{T}\tilde{C}A(\nabla V)(A\bm{x}),(\nabla g)(A\bm{x})\rangle+\frac{\sigma^{2}}{2}\text{div}(A^{T}\tilde{C}A\nabla g)(A\bm{x})=\mathcal{L}_{C}g(A\bm{x}).

∎

Lemma A.1 shows that, we can use simple linear transformation to simplify the operator. Specifically, if we choose $A=C^{\frac{1}{2}}$ , we can transfer $\mathcal{L}_{C}$ to the operator $\mathcal{\tilde{L}}$ in the classical form. Thus we may only consider the case when $C=I$ .

To better understand the spectral property of $\mathcal{L}$ , we introduce the Witten Laplacian that’s unitarily equivalent to $\mathcal{L}$ and $\mathcal{L}^{\dagger}$ , acting on the non-weighted $L^{2}$ -space. The Witten Laplacian is defined by

\Delta_{f,h}=(-h\nabla+\nabla f)\cdot(h\nabla+\nabla f)=-h^{2}\Delta+(|\nabla f|^{2}-h\Delta f),

and then we can find the following property:

Proposition A.2.

The operator $\mathcal{L}$ can be unitarily changed into the Witten Laplacian form as follows:

-2\sigma^{-2}\Delta_{V/2,\sigma^{2}/2}=\tilde{U}\mathcal{L}\tilde{U}^{-1},

where $\tilde{U}$ is the unitary transformation

\tilde{U}:\left\{\begin{aligned} L^{2}(\nu)&\to L^{2},\\ \phi&\mapsto\phi\sqrt{\nu}.\end{aligned}\right.

Thus, we may only consider the spectrum of

\mathcal{S}:=\tilde{U}\mathcal{L}\tilde{U}^{-1}=\frac{\sigma^{2}}{2}\Delta-\frac{|\nabla V|^{2}}{2\sigma^{2}}+\frac{\Delta V}{2}

over $L^{2}$ in replace of $\mathcal{L}$ over $L^{2}(\nu)$ . Notice that $W$ is invariant under such a transformation, that is, for any $W$ , $\tilde{U}W\tilde{U}^{-1}=W$ , thus the spectrum of $\mathcal{S}+W$ over $L^{2}$ is equivalent to the spectrum of $\mathcal{L}+W$ over $L^{2}(\nu)$ .

Proof of Theorem 4.3 and 4.4

By Lemma A.1 and Proposition A.2, the operator $\mathcal{L}+W$ is transformed into the Schrödinger form operator $\mathcal{S}+W$ . According to the Assumption 4.2, $\mathcal{S}+W$ has discrete real spectrum, and thus has a sequence of real eigenvalues $\lambda_{0}>\lambda_{1}\geq\lambda_{2}\cdots,$ see Reed & Simon (1979), Pankov (2001).

Recall

\Phi^{t}_{\mathcal{L}+W}(p_{0})=\frac{e^{t(\mathcal{L}^{\dagger}+W)}p_{0}}{\int_{\mathbb{R}^{d}}e^{t(\mathcal{L}^{\dagger}+W)}p_{0}(\tilde{\bm{x}})d\tilde{\bm{x}}}=\frac{e^{t(\mathcal{L}^{\dagger}+W-\lambda_{0})}p_{0}}{\|e^{t(\mathcal{L}^{\dagger}+W-\lambda_{0})}p_{0}\|_{L^{1}}},

Following the same procedure as the Proposition 2. in Ferré & Stoltz (2017), we have that, for any $p_{0}\in L^{2}(1/\nu)$ ,

\left\|\Phi^{t}_{\mathcal{L}+W}(p_{0})-\frac{\psi_{0}}{\|\psi_{0}\|_{L^{1}}}\right\|_{L^{2}(1/\nu)}\leq C\left\|p_{0}-\frac{\psi_{0}}{\|\psi_{0}\|_{L^{1}}}\right\|_{L^{2}(1/\nu)}e^{-(\lambda_{0}-\lambda_{1})t},

where $\psi_{0}$ is the eigenfunction of $\mathcal{L}^{\dagger}+W$ corresponding to $\lambda_{0}$ . Noticing that, by definition and simple calculation,

\left\|p_{0}-\frac{\psi_{0}}{\|\psi_{0}\|_{L^{1}}}\right\|_{L^{2}(1/\nu)}=\left\|\frac{p_{0}}{\nu}-\frac{\psi_{0}}{\nu\|\psi_{0}\|_{L^{1}}}\right\|_{L^{2}(\nu)},

and $\frac{\psi_{0}}{\|\psi_{0}\|_{L^{1}}\nu}$ is just the normalized eigenfunction $\phi_{0}$ of $\mathcal{L}+W$ corresponding to $\lambda_{0}$ .

$\square$

Proof of Theorem 4.5

By Lemma A.1 and Proposition A.2, we consider the eigenpairs of the operator $\mathcal{S}$ and $\mathcal{S}^{\varepsilon}:=\mathcal{S}+\varepsilon m(V)$ . We denote the eigenpair to $\mathcal{S}$ as $\{(\phi_{i},\lambda_{i}),0\leq i\leq\infty\}$ , the eigenpair to $\mathcal{S}^{\varepsilon}$ as $\{(\phi^{\varepsilon}_{i},\lambda^{\varepsilon}_{i}),0\leq i\leq\infty\}$ . For simplicity we denote $\beta=\frac{2}{\sigma^{2}}$ . The perturbation theory of spectrum (see Chapter 5.4, (Pankov, 2001) or Kato (1966)) reads:

\lambda^{\varepsilon}_{i}=\lambda_{i}+\varepsilon\frac{\langle\phi_{i},m(V)\phi_{i}\rangle}{\langle\phi_{i},\phi_{i}\rangle}+O(\varepsilon^{2}),

(41)

\phi_{i}^{\varepsilon}=\phi_{i}+\varepsilon\sum_{j\neq i}\frac{\langle\phi_{j},m(V)\phi_{i}\rangle}{(\lambda_{i}-\lambda_{j})\langle\phi_{j},\phi_{j}\rangle}\phi_{j}.

(42)

We first prove that, for small enough $\varepsilon$ ,

\lambda_{0}^{\varepsilon}-\lambda_{1}^{\varepsilon}>\lambda_{0}-\lambda_{1},

by (41) this equivalent to

\frac{\langle\phi_{0},m(V)\phi_{0}\rangle}{\langle\phi_{0},\phi_{0}\rangle}>\frac{\langle\phi_{1},m(V)\phi_{1}\rangle}{\langle\phi_{1},\phi_{1}\rangle}.

According to Chapter 2.5 in Lelièvre & Stoltz (2016) and Nier (2004), there are exactly $m_{0}$ eigenvalues close enough to $\lambda_{0}$ corresponding to the $m_{0}$ local minimums of $V$ , and for $1\leq j\leq m_{0}$ , $\phi_{i}$ are of the form $\chi_{j}\exp(-\beta V/2)$ , where the functions $\chi_{j}$ are locally constant over the basins of attractions of the local minima of $V$ , we denote the minimum corresponding to each $\phi_{j}$ as $\bm{x}_{j}$ .

We then denote the attraction basin as $B_{1}$ with corresponding local minimum $V(\bm{x}_{1})$ ; and denote $M=\max_{\bm{x}\in B_{1}}V(\bm{x})$ . According to Nier (2004), for any $\delta>0$ , the region $G_{1}:=\{V(\bm{x})\leq M-\delta\}\cap B_{1}^{c}$ , $\chi_{1}$ is asymptotically constant as $\beta\to 0$ . Thus we may assume

\chi_{1}|_{B_{1}\cup G_{1}}\approx I_{B_{1}}+c(\beta)I_{G_{1}}.

(43)

By observing that when $\bm{x}\in(B_{1}\cup G_{1})^{c},V(\bm{x})>M-\delta$ , thus we have

0=\langle\phi_{1},\phi_{0}\rangle=\int_{\mathbb{R}^{n}}\chi_{1}\exp(-\beta V)d\bm{x}=\int_{B_{1}}\exp(-\beta V)d\bm{x}+c(\beta)\int_{G_{1}}\exp(-\beta V)d\bm{x}+O(e^{-\beta(M-\delta)})

and $\bm{x}_{0}\in G_{1}$ , we decude that

c(\beta)=-\frac{\int_{B_{1}}\exp(-\beta V)d\bm{x}+O(e^{-\beta(M-\delta)})}{\int_{G_{1}}\exp(-\beta V)d\bm{x}}=-O(e^{-\beta(V(\bm{x}_{1})-V(\bm{x}_{0}))}).

(44)

Then

	$\displaystyle\frac{\langle\phi_{1},m(V)\phi_{1}\rangle}{\langle\phi_{1},\phi_{1}\rangle}$	$\displaystyle=\frac{\int_{B_{1}}m(V)\exp(-\beta V)d\bm{x}-O(e^{-2\beta(V(x_{1})-V(x_{0}))})\int_{G_{1}}m(V)\exp(-\beta V)d\bm{x}}{\int_{B_{1}}\exp(-\beta V)d\bm{x}-O(e^{-2\beta(V(x_{1})-V(x_{0}))})\int_{G_{1}}\exp(-\beta V)d\bm{x}}$
		$\displaystyle=m(V(\bm{x}_{1}))+O(\beta^{-1}),$

as $m(V)$ has no dependence on $\beta$ .

On the other side,

\frac{\langle\phi_{0},m(V)\phi_{0}\rangle}{\langle\phi_{0},\phi_{0}\rangle}=m(V(\bm{x}^{\ast}))+O(\beta^{-1})

thus we arrive at

\frac{\langle\phi_{0},m(V)\phi_{0}\rangle}{\langle\phi_{0},\phi_{0}\rangle}>\frac{\langle\phi_{1},m(V)\phi_{1}\rangle}{\langle\phi_{1},\phi_{1}\rangle},

because $m$ is a decreasing function and $V(\bm{x}^{\ast})<V(\bm{x}_{1}).$

By (42), for the eigenfunction estimation of $\phi_{0}^{\varepsilon}$ , we aim at proving that

\langle\phi_{j},m(V)\phi_{0}\rangle<0.

(45)

for $j$ that $\lambda_{j}$ close enough to $\lambda_{0}$ , thus these term will suppress the height nearby local minimum of $\phi_{0}$ , remaining the height nearby global minimum.

We only prove this for $j=1$ , and for $i=2,\cdots,m_{0}$ , the procedure are similar. For $j>m_{0}$ , these term in (42) are asynmtotically negotiable.

Note that, by omitting small term (the integration region outside $B_{1}\cup G_{1}$ ) and use (43), (44) , the statement (45) is equivalent to

\int_{B_{1}}m(V)e^{-\beta V}d\bm{x}-\frac{\int_{B_{1}}\exp(-\beta V)d\bm{x}}{\int_{G_{1}}\exp(-\beta V)d\bm{x}}\int_{G_{1}}m(V)e^{-\beta V}d\bm{x}<0

\frac{\int_{B_{1}}m(V)e^{-\beta V}d\bm{x}}{\int_{B_{1}}e^{-\beta V}d\bm{x}}<\frac{\int_{G_{1}}m(V)e^{-\beta V}d\bm{x}}{\int_{G_{1}}e^{-\beta V}d\bm{x}}

which is asymptotically true when $\beta\to\infty$ , as left hand side trends to $m(V(\bm{x}_{1}))$ while right hand side trends to $m(V(\bm{x}^{\ast}))$ .

$\square$

Proof of Theorem 4.6

We begin with considering the spectrum of $W$ . Let $W_{min}=\inf\{W(\bm{x})\},W_{max}=\sup\{W(\bm{x})\}$ and $W_{min}$ can be $-\infty$ . We assume $W_{max}$ is reachable with one and only one maximum $\bm{x}^{\ast}$ .

Note that for any $\lambda>W_{max}$ or $\lambda<W_{min}$ , the operator $\lambda-W$ is invertible in $L^{2}$ , thus the spectrum of $W$ is just $[W_{min},W_{max}]$ . We now consider the operator $\frac{\sigma^{2}}{2}\Delta+W$ . By Rayleigh-quotient formula (Parlett, 1981), the principle eigenvalue $\lambda_{0}(\sigma)$ of $\frac{\sigma^{2}}{2}\Delta+W$ over $L^{2}$ is

\lambda_{0}(\sigma)=\sup_{\phi\in H^{1}}\frac{\int_{\mathbb{R}^{d}}\phi((\frac{\sigma^{2}}{2}\Delta+W)\phi)d\bm{x}}{\int_{\mathbb{R}^{d}}\phi^{2}d\bm{x}}=\sup_{\phi\in H^{1}}(\frac{\int_{\mathbb{R}^{d}}\phi^{2}Wd\bm{x}}{\int_{\mathbb{R}^{d}}\phi^{2}d\bm{x}}-\frac{\sigma^{2}}{2}\frac{\int_{\mathbb{R}^{d}}(\nabla\phi)^{T}\nabla\phi d\bm{x}}{\int_{\mathbb{R}^{d}}\phi^{2}d\bm{x}})

It’s worthwhile noticing that $\lambda_{0}(\sigma)$ is a decreasing function respect to $\sigma$ , thus for any $\sigma>0$ , $\lambda_{0}(\sigma)<\lambda_{0}(0)=W_{max}$ .

Take a series of test function $\phi_{n}(\bm{x})\in H^{1}$ that gradually trends to the delta function $\delta_{\bm{x}^{\ast}}(\bm{x})$ , by noticing that if we take $\sigma_{n}:=\sqrt{2\frac{1}{n}\frac{\langle\phi_{n},\phi_{n}\rangle}{\langle\nabla\phi_{n},\nabla\phi_{n}\rangle}}$ , we have

\lim_{\sigma\to 0}\lambda_{0}(\sigma)\geq\lim_{n\to\infty}(\frac{\langle\phi_{n},W\phi_{n}\rangle}{\langle\phi_{n},\phi_{n}\rangle}-\frac{\sigma^{2}_{n}}{2}\frac{\langle\nabla\phi_{n},\nabla\phi_{n}\rangle}{\langle\phi_{n},\phi_{n}\rangle})=W_{max}

Thus $\lambda_{0}(\sigma)$ is continuous at $\sigma=0$ .

Now denote the principal normalized eigenfunction of $\frac{\sigma^{2}}{2}\Delta+W$ is $\psi_{0}(\sigma)$ . For any $f\in L^{2}$ that’s smooth compactly supported, noticing that

	$\displaystyle\langle\psi_{0}(\sigma),f\rangle$	$\displaystyle=\lambda_{0}(\sigma)^{-1}\langle(\frac{\sigma^{2}}{2}\Delta+W)\psi_{0}(\sigma),f\rangle$		(46)
		$\displaystyle=\lambda_{0}(\sigma)^{-1}\frac{\sigma^{2}}{2}\langle\psi_{0}(\sigma),\Delta f\rangle+\lambda_{0}(\sigma)^{-1}\langle\psi_{0}(\sigma),Wf\rangle,$		(47)

taking $\sigma\to 0$ , the first term trends to 0, and thus

\lim_{\sigma\to 0}\langle\psi_{0}(\sigma),f\rangle=\lim_{\sigma\to 0}\langle\frac{W}{W_{max}}\psi_{0}(\sigma),f\rangle

This implies that as $\sigma\to 0,\psi_{0}(\sigma)$ acts closer and closer to $\delta_{\bm{x}^{\ast}}(\bm{x})$ .

$\square$