This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Reweighted Interacting Langevin Diffusions: an Accelerated Sampling Method for Optimization

Junlong Lyu    Zhitang Chen    Wenlong Lyu    Jianye Hao
Abstract

We proposed a new technique to accelerate sampling methods for solving difficult optimization problems. Our method investigates the intrinsic connection between posterior distribution sampling and optimization with Langevin dynamics, and then we propose an interacting particle scheme that approximates a Reweighted Interacting Langevin Diffusion system (RILD). The underlying system is designed by adding a multiplicative source term into the classical Langevin operator, leading to a higher convergence rate and a more concentrated invariant measure. We analyze the convergence rate of our algorithm and the improvement compared to existing results in the asymptotic situation. We also design various tests to verify our theoretical results, showing the advantages of accelerating convergence and breaking through barriers of suspicious local minimums, especially in high-dimensional non-convex settings. Our algorithms and analysis shed some light on combining gradient and genetic algorithms using Partial Differential Equations (PDEs) with provable guarantees.

Optimization, Stochastic Process, Functional Analysis, Genetic Algorithm, ICML

1 Introduction

Optimization methods have natural connections to sampling methods (Dalalyan, 2017) . Consider a potential function V:dV:\mathbb{R}^{d}\to\mathbb{R} that is twice differentiable and has only one simple global minimum 𝒙=argminxdV(𝒙){\bm{x}}^{\ast}=\arg\min_{x\in\mathbb{R}^{d}}V({\bm{x}}) for simplicity. The first order Langevin Dynamics is defined by the following Stochastic Differential Equation (SDE):

d𝒙t=V(𝒙t)+σdt,d{\bm{x}}_{t}=-\nabla V({\bm{x}}_{t})+\sigma d\mathcal{B}_{t}, (1)

where σ>0\sigma>0 is the diffusion parameter proportional to the square of the temperature parameter, and {t}t0\{\mathcal{B}_{t}\}_{t\geq 0} is the standard Brownian motion in d\mathbb{R}^{d}. Under certain assumptions on the drift coefficient V\nabla V, it was shown that the distribution of 𝒙t{\bm{x}}_{t} in Eq.1 converges exponentially to its stationary distribution, the Gibbs measure νσ(𝒙)exp(2V(𝒙)/σ2)\nu_{\sigma}({\bm{x}})\propto\exp(-2V({\bm{x}})/\sigma^{2}). With σ0\sigma\to 0, νσ\nu_{\sigma} concentrates on the global minimum xx^{\ast}, leading to various Langevin dynamics based algorithms like Gradient Langevin Dynamics (GLD) (Xu et al., 2017; Dalalyan, 2014), Stochastic Gradient Langevin Dynamics (SGLD) (Welling & Teh, 2011) etc.

While Langevin dynamics based algorithms handle optimization tasks with slightly nonconvex potential function VV effectively (Raginsky et al., 2017; Zhang et al., 2017), it encounters difficulties with global optimization for highly nonconvex VV, as it requires σn0\sigma_{n}\to 0 to guarantee convergence. However, a small σn\sigma_{n} makes the probability of jumping across a certain potential barrier drastically diminish. Conventional Markov Chain Monte Carlo (MCMC) methods were proposed to alleviate such problems, such as parallel tempering (Swendsen & Wang, 1986; Geyer, 1991), flat histogram algorithms (Berg & Neuhaus, 1991) and simulated annealing (Kirkpatrick et al., 1983). Among them, Simulated Annealing (SA) provides a reasonable way to handle this problem by maintaining a population of particles to sequentially approximate νσ~n(𝒙)\nu_{\tilde{\sigma}_{n}}({\bm{x}}) with σ~n0\tilde{\sigma}_{n}\to 0. However, it requires numerious evaluations of V\nabla V, which can be computationally costly, making it less attractive for optimization tasks. Therefore in this paper, we aim at accelerating Langevin dynamics based algorithms by modifying the underlying continuous Langevin dynamics.

Recently, an Ensemble Kalman Sampler (EKS) (Garbuno-Iñigo et al., 2019; Schillings & Stuart, 2016) method was suggested to accelerate the convergence rate toward the equilibrium νσ(𝒙)\nu_{\sigma}({\bm{x}}). This method approximates a preconditioning-modified Langevin dynamic that shares the same equilibrium with the original one, but enjoys a faster convergence rate for ill-posed problems. Besides, this method has another benefit when the potential function V(𝒙)=𝒢(𝒙)y2V({\bm{x}})=\|\mathcal{G}({\bm{x}})-y\|^{2} is a least square functional (Engl et al., 1996), i.e. the gradient of VV, or the Jacobean matrix of 𝒢\mathcal{G} does not need to be calculated explicitly, as those terms involving them can be approximated by the 0th0^{th} order information in current steps, see Eq.  (16).

However, the EKS method has little improvement when solving highly non-convex optimizations. Besides, as our purposes is to find the global minimum of VV, restricting the stationary distribution of the underlying dynamic to be precisely νσ(𝒙)\nu_{\sigma}({\bm{x}}) is not necessary. Suppose we have another σ\sigma-dependent dynamic with limit distribution approximate δ𝒙(𝒙)\delta_{{\bm{x}}^{\ast}}({\bm{x}}) , the Delta distribution at the minimum point 𝒙\bm{x}\ast, then this dynamic can also be chosen to approximate the global minimum. This inspires us to further modify the Langevin dynamics for faster convergence. Specifically, we modify the Fokker-Planck Equation related to Langevin Dynamics by adding a linear source term, which can be proven, by the spectral approach (Pankov, 2001), to improve the convergence rate and the mass concentration near the global minimum of the invariant measure. The new process belongs to a type of nonlinear operators called Feynman-Kac Semigroup which was developed in Large Deviation Theory for calculating generating functions (Varadhan, 2010) and also used in important practical applications such as the Diffusion Monte Carlo (DMC) method (Foulkes et al., 2001). To design a practical algorithm, we use the Interacting Particle methods (Moral & Miclo, 2000; Moral, 2013), introducing the reweighting and resampling technique to simulate the effect of this source term and get the so-called Reweighted Interacting Langevin Diffusion (RILD) algorithm. This is, as far as we know, the first time using Interacting Particle methods to solve optimization tasks. Many numerical experiments are tested to show the effect in accelerating convergence and flatten the potential barriers.

The main contributions are summarized as follows:

  • We provide a simple but effective way to modify the Langevin Dynamics in Section 3.1 for faster convergence and flatter potential barriers.

  • We provide a feasible discretization way to design algorithms in Section 3.2, and compare the new algorithm Alg. 1 with several existing methods, showing its advantages.

  • We provide theoretical results in Section 4 based on spectral analysis to guarantee the advantages.

2 Preliminary

2.1 Overdamped Langevin Dynamics

In this section, we introduce the classical Langevin Dynamics which is related to GLD and SGLD algorithms.

Suppose V:dV:\mathbb{R}^{d}\to\mathbb{R} that is twice differentiable and has a simple global minimum 𝒙=argminxdV(𝒙){\bm{x}}^{\ast}=\arg\min_{x\in\mathbb{R}^{d}}V({\bm{x}}) for simplicity.

The overdamped Langevin Dynamics is defined as in 1. Such a dynamic has plenty of relationships with the Fokker-Planck equation. Denote

:=V,+σ22Δ\mathcal{L}\cdot:=-\langle\nabla V,\nabla\cdot\rangle+\frac{\sigma^{2}}{2}\Delta\cdot (2)

the infinitesimal generator (Øksendal, 1987) of the Markov process (𝒙t)t0({\bm{x}}_{t})_{t\geq 0}, and its L2L^{2} adjoint operator:

:=div(V+σ22)\mathcal{L}^{\dagger}\cdot:=\text{div}(\nabla V\cdot+\frac{\sigma^{2}}{2}\nabla\cdot) (3)

Let us assume the law of 𝒙t{\bm{x}}_{t} at time tt has a density pt(x)p_{t}(x) for Lebesgure measure. Then pt(𝒙)p_{t}({\bm{x}}) satisfies the Fokker-Planck equation:

tpt(𝒙)=pt(𝒙)=div(pt(𝒙)V(𝒙)+σ22pt(𝒙))\frac{\partial}{\partial t}p_{t}({\bm{x}})=\mathcal{L}^{\dagger}p_{t}({\bm{x}})=\text{div}\big{(}p_{t}({\bm{x}})\nabla V({\bm{x}})+\frac{\sigma^{2}}{2}\nabla p_{t}({\bm{x}})\big{)} (4)

We may denote pt(𝒙)=etp0(𝒙)p_{t}({\bm{x}})=e^{\mathcal{L}^{\dagger}t}p_{0}({\bm{x}}) where we denote p0(𝒙)p_{0}({\bm{x}}) the initial distribution of overdamped Langevin dynamics, and it is well-known that the Markov operator ete^{\mathcal{L}^{\dagger}t} admits a unique invariant probability measure ν(𝒙)=Zν1e2σ2V(𝒙),Zν=de2σ2V(𝒙)𝑑x\nu({\bm{x}})=Z_{\nu}^{-1}e^{-2\sigma^{-2}V({\bm{x}})},Z_{\nu}=\int_{\mathbb{R}^{d}}e^{-2\sigma^{-2}V({\bm{x}})}dx. The rate converging to ν(𝒙)\nu({\bm{x}}) has been wildly studied,

Proposition 2.1 (Proposition 2.3 in Lelièvre & Stoltz (2016)).

Under specific condition (Assumption 4.2), 111 We define L2(μ):={f:fL2(μ)<}L^{2}(\mu):=\{f:\|f\|_{L^{2}(\mu)}<\infty\}, where f,gL2(μ):=df(x)g(x)μ(x)𝑑x\langle f,g\rangle_{L^{2}(\mu)}:=\int_{\mathbb{R}^{d}}f(x)g(x)\mu(x)dx, fL2(μ)=f,fL2(μ)12\|f\|_{L^{2}(\mu)}=\langle f,f\rangle_{L^{2}(\mu)}^{\frac{1}{2}}for all p0p_{0} such that p0/νL2(ν)p_{0}/\nu\in L^{2}(\nu), and for all t0t\geq 0,

etp0/ν1L2(ν)p0/ν1L2(ν)eδt,\|e^{\mathcal{L}^{\dagger}t}p_{0}/\nu-1\|_{L^{2}(\nu)}\leq\|p_{0}/\nu-1\|_{L^{2}(\nu)}e^{-\delta t},

where δ\delta is the first non-zero eigenvalue λ1-\lambda_{1} of the operator -\mathcal{L} on L2(ν)L^{2}(\nu).

Note that λ1\lambda_{1} is real because \mathcal{L} is self-adjoint222We say a linear operator AA is self-adjoint on a Hilbert space {,}\{\mathcal{H},\langle\cdot\rangle_{\mathcal{H}}\}, if for any f,g,f,Ag=Af,gf,g\in\mathcal{H},\langle f,Ag\rangle_{\mathcal{H}}=\langle Af,g\rangle_{\mathcal{H}} on L2(ν)L^{2}(\nu).

When for sampling purposes, individual samples are sampled from independent paths by Monte Carlo Markov Chain (MCMC, (Berg, 2004)) corresponding to the discretized Langevin dynamics. This method is somehow slow for large dimensional, non-convex problems. Such a weakness make it less attractive for optimization purposes. Fortunately, we can explore the connection between individual samples, as they did in Garbuno-Iñigo et al. (2019). We will introduce their work in the next subsection.

2.2 Interacting Langevin Diffusion

In Garbuno-Iñigo et al. (2019), the authors introduced a modified Langevin dynamics and analyzed its property, proving its superiority in designing sampling algorithms. We now introduce the main result of their work.

The convergence rate of classical Langevin dynamic based algorithm can be really slow when VV varies highly anisotropic, . A common approach for canceling out this effect is to introduce a d×dd\times d preconditioning positive semi-definite matrix CC in the corresponding gradient scheme,

d𝒙t=CV(𝒙t)dt+σCdt.d{\bm{x}}_{t}=-C\nabla V({\bm{x}}_{t})dt+\sigma\sqrt{C}d\mathcal{B}_{t}. (5)

Here C\sqrt{C} can be any d×rd\times r real matrix UU such that UUT=CUU^{T}=C (Note that in this case, WtW_{t} can be reduced to rr-dimensional standard Brownian motion as the essential rank of CC is no larger than r).

The corresponding Fokker-Planck equation now becomes

tpt(𝒙)=div(pt(𝒙)CV(𝒙)+σ22Cpt(𝒙))\frac{\partial}{\partial t}p_{t}({\bm{x}})=\text{div}\big{(}p_{t}({\bm{x}})C\nabla V({\bm{x}})+\frac{\sigma^{2}}{2}C\nabla p_{t}({\bm{x}})\big{)} (6)

and the infinitesimal generator

C\displaystyle\mathcal{L}_{C}\cdot :=CV,+σ22div(C),\displaystyle:=-\langle C\nabla V,\nabla\cdot\rangle+\frac{\sigma^{2}}{2}\text{div}(C\nabla\cdot), (7)
C\displaystyle\mathcal{L}_{C}^{\dagger}\cdot :=div(CV+σ22C).\displaystyle:=\text{div}\big{(}C\nabla V\cdot+\frac{\sigma^{2}}{2}C\nabla\cdot\big{)}. (8)

One can easily verify that e2σ2V(𝒙)e^{-2\sigma^{-2}V({\bm{x}})} is the invariant measure to the above system for all semi-definite CC, and the only one positive invariant measure if CC is strictly positive definite.

To find a suitable CC is of general interest. One of the best choices is to let C=HessVC=\text{Hess}V, as a counterpart of Newton’s scheme in optimization, which is unfriendly for computation. One intrinsic benefit of choosing HessV\text{Hess}V is its affine-invariant property when designing numerical schemes.

Such a property can also be preserved if we take C=𝒞(p)C=\mathcal{C}(p), the covariance matrix under the probability measure p(𝒙)p({\bm{x}}),

𝒞(p):=d(𝒙m(p))(𝒙m(p))p(𝒙)𝑑𝒙,\mathcal{C}(p):=\int_{\mathbb{R}^{d}}\big{(}{\bm{x}}-m(p)\big{)}\otimes\big{(}{\bm{x}}-m(p)\big{)}p({\bm{x}})d{\bm{x}}, (9)
m(p):=d𝒙p(𝒙)𝑑x.m(p):=\int_{\mathbb{R}^{d}}{\bm{x}}p({\bm{x}})dx. (10)

The dynamic now becomes a nonlinear flow

d𝒙t\displaystyle d{\bm{x}}_{t} =𝒞(pt)V(𝒙t)dt+σ𝒞(pt)dt,\displaystyle=-\mathcal{C}(p_{t})\nabla V({\bm{x}}_{t})dt+\sigma\sqrt{\mathcal{C}(p_{t})}d\mathcal{B}_{t}, (11)
tpt\displaystyle\frac{\partial}{\partial t}p_{t} =div(pt𝒞(pt)V+σ22𝒞(pt)pt)\displaystyle=\text{div}\big{(}p_{t}\mathcal{C}(p_{t})\nabla V+\frac{\sigma^{2}}{2}\mathcal{C}(p_{t})\nabla p_{t}\big{)} (12)

To simulate from such a mean-field model, the finite interacting particle system 𝒳t={𝒙ti}i=1N\mathcal{X}_{t}=\{{\bm{x}}^{i}_{t}\}_{i=1}^{N} is introduced:

d𝒙ti=𝒞(𝒳t)V(𝒙ti)dt+σ𝒞(𝒳t)dti,d{\bm{x}}^{i}_{t}=-\mathcal{C}(\mathcal{X}_{t})\nabla V({\bm{x}}_{t}^{i})dt+\sigma\sqrt{\mathcal{C}(\mathcal{X}_{t})}d\mathcal{B}_{t}^{i}, (13)

where ti\mathcal{B}_{t}^{i} are i.i.d. standard Brownian motions, and

𝒞(𝒳t):=𝑿t𝑿tT,𝒞(𝒳t):=𝑿t,\displaystyle\mathcal{C}(\mathcal{X}_{t}):={\bm{X}}_{t}{\bm{X}}_{t}^{T},\sqrt{\mathcal{C}(\mathcal{X}_{t})}:={\bm{X}}_{t},
𝑿t:=1N(𝒙t1𝒙¯t,,𝒙tN𝒙¯t),𝒙¯t=1Ni=1N𝒙ti\displaystyle{\bm{X}}_{t}:=\frac{1}{\sqrt{N}}\large({\bm{x}}_{t}^{1}-\bar{\bm{x}}_{t},\cdots,{\bm{x}}_{t}^{N}-\bar{\bm{x}}_{t}\large),\bar{\bm{x}}_{t}=\frac{1}{N}\sum_{i=1}^{N}{\bm{x}}_{t}^{i}

Particles in this system are then no longer independent to each other, in contrast to independent simulations in SGLD algorithm.

Now suppose VV is a least squares functional333For any positive-definite matrix AA, we define a,aA=a,A1a=aTA1a\langle a,a^{\prime}\rangle_{A}=\langle a,A^{-1}a^{\prime}\rangle=a^{T}A^{-1}a^{\prime}, and aA=A12a\|a\|_{A}=\|A^{-\frac{1}{2}}a\|. with Tikhonov-Phillips regularization (Engl et al., 1996):

V(x)=12y𝒢(x)Γ2+12xΓ02,V(x)=\frac{1}{2}\|y-\mathcal{G}(x)\|_{\Gamma}^{2}+\frac{1}{2}\|x\|_{\Gamma_{0}}^{2}, (14)

where yk,𝒢:dky\in\mathbb{R}^{k},\mathcal{G}:\mathbb{R}^{d}\to\mathbb{R}^{k} is the forward map, Γ\Gamma and Γ0\Gamma_{0} are two positive definite matrixs. In this situation, the system (13) can be re-written as

d𝒙ti=\displaystyle d{\bm{x}}^{i}_{t}= 1Nj=1ND𝒢(𝒙ti)(𝒙tj𝒙¯t),𝒢(𝒙ti)yΓ𝒙tjdt\displaystyle-\frac{1}{N}\sum_{j=1}^{N}\left\langle D\mathcal{G}(\bm{x}^{i}_{t})(\bm{x}^{j}_{t}-\bm{\bar{x}}_{t}),\mathcal{G}(\bm{x}^{i}_{t})-y\right\rangle_{\Gamma}\bm{x}_{t}^{j}dt
𝒞(𝒳t)Γ01𝒙tidt+σ𝒞(𝒳t)dWti.\displaystyle-\mathcal{C}(\mathcal{X}_{t})\Gamma_{0}^{-1}\bm{x}_{t}^{i}dt+\sigma\sqrt{\mathcal{C}(\mathcal{X}_{t})}dW_{t}^{i}. (15)

Using the 1st1^{\text{st}} order Taylor approximation

D𝒢(𝒙ti)(𝒙tj𝒙¯t)𝒢(𝒙tj)𝒢¯t,𝒢¯t:=1Nk=1N𝒢(𝒙tk),D\mathcal{G}(\bm{x}^{i}_{t})(\bm{x}^{j}_{t}-\bm{\bar{x}}_{t})\approx\mathcal{G}(\bm{x}^{j}_{t})-\mathcal{\bar{G}}_{t},\quad\mathcal{\bar{G}}_{t}:=\frac{1}{N}\sum_{k=1}^{N}\mathcal{G}(\bm{x}^{k}_{t}),

one may approximate Eq. 15 in a derivative-free manner as

d𝒙ti=\displaystyle d{\bm{x}}^{i}_{t}= 1Nj=1N𝒢(𝒙tj)𝒢¯t,𝒢(𝒙ti)yΓ𝒙tjdt\displaystyle-\frac{1}{N}\sum_{j=1}^{N}\left\langle\mathcal{G}(\bm{x}^{j}_{t})-\mathcal{\bar{G}}_{t},\mathcal{G}(\bm{x}^{i}_{t})-y\right\rangle_{\Gamma}\bm{x}_{t}^{j}dt
𝒞(𝒳t)Γ01𝒙tidt+σ𝒞(𝒳t)dWti.\displaystyle-\mathcal{C}(\mathcal{X}_{t})\Gamma_{0}^{-1}\bm{x}_{t}^{i}dt+\sigma\sqrt{\mathcal{C}(\mathcal{X}_{t})}dW_{t}^{i}. (16)

3 Reweighted Interacting Langevin Diffusion

3.1 Continuous process analysis

Keeping the invariant measure to be exactly e2σ2V(𝒙)e^{-2\sigma^{-2}V({\bm{x}})} is not necessary for optimization. It would be preferable if a new process can converge faster to its invariant measure with its invariant measure still concentrating on the global optimum as σ0\sigma\to 0.

Now we introduce an additional source term to modify Eq. (6) into

tp~t\displaystyle\frac{\partial}{\partial t}\tilde{p}_{t} =div(p~tCV+σ22Cp~t)+Wp~t\displaystyle=\text{div}\big{(}\tilde{p}_{t}C\nabla V+\frac{\sigma^{2}}{2}C\nabla\tilde{p}_{t}\big{)}+W\tilde{p}_{t} (17)
pt(𝒙)\displaystyle p_{t}({\bm{x}}) =p~t(𝒙)𝒙~dp~t(𝒙~)𝑑𝒙~\displaystyle=\frac{\tilde{p}_{t}({\bm{x}})}{\int_{\tilde{\bm{x}}\in\mathbb{R}^{d}}\tilde{p}_{t}(\tilde{\bm{x}})d\tilde{\bm{x}}} (18)

Here we mainly consider the situation when C=IC=I corresponding to Eq. 4, or C=𝒞(pt)C=\mathcal{C}(p_{t}) corresponding to Eq. 12, but the analysis remains valid for all positive-definite CC. We assume in this paper that the function WW is smooth and upper bounded.

Let us look inside what the role WW plays in the evolution of this process. If we take WW as a function to ff such that W(𝒙)W({\bm{x}}) becomes larger when f(𝒙)f({\bm{x}}) becomes smaller, intuitively the ratio of the mass in the better-fitness region (closer to the global minimum) becomes larger, thus we expect the invariant measure concentrates more on the global optimum.

Note that such a process (17), (18) is no longer a gradient flow structure, which does not preserve total mass, thus a normalization is added to keep the total mass equal to 1. Such a normalizing process has been well-studied as the so-called Feynman-Kac Semigroup when considering the linear case: CC is a constant matrix. As the spectral analysis is wildly studied for linear operators (Kato, 1966), and for numerical discretization the CC is fixed at each time step, we thus conduct analysis for any fixed matrix CC.

We introduce the solution operator corresponding to Eq. 17 and Eq. 18: recall the infinite generator C\mathcal{L}^{\dagger}_{C}, the solution in Eq. 17 can be represented by p~t=et(C+W)p0\tilde{p}_{t}=e^{t(\mathcal{L}^{\dagger}_{C}+W)}p_{0}. We denote the corresponding reweighted operator, which is also well-known as Feynman-Kac Semigroup, as ΦC+Wt:\Phi^{t}_{\mathcal{L}_{C}+W}:

ΦC+Wt(p0):=et(C+W)p0det(C+W)p0(𝒙~)𝑑𝒙~,\Phi^{t}_{\mathcal{L}_{C}+W}(p_{0}):=\frac{e^{t(\mathcal{L}^{\dagger}_{C}+W)}p_{0}}{\int_{\mathbb{R}^{d}}e^{t(\mathcal{L}^{\dagger}_{C}+W)}p_{0}(\tilde{\bm{x}})d\tilde{\bm{x}}},

then pt=ΦC+Wt(p0)p_{t}=\Phi^{t}_{\mathcal{L}_{C}+W}(p_{0}) is exactly the solution to Eq. 18.

The operator ΦC+Wt\Phi^{t}_{\mathcal{L}_{C}+W} is nonlinear in general, but it has many similarity with the linear operator et(C+W)e^{t(\mathcal{L}^{\dagger}_{C}+W)}. In specific, the unique positive fixed-point pp_{\infty} of ΦC+Wt\Phi^{t}_{\mathcal{L}_{C}+W} is proportional to the principle eigenfunction of (C+W)(\mathcal{L}^{\dagger}_{C}+W), and the convergence rate of ΦC+Wt(p0)\Phi^{t}_{\mathcal{L}_{C}+W}(p_{0}) to pp_{\infty} is controlled by the spectral gap of (C+W)(\mathcal{L}_{C}+W). For the Feynman-Kac semigroup, one can refer to Ferré & Stoltz (2017) for time-invariant case, Lyu et al. (2021) for time-periodic case, and Moral (2004) for systematical details.

We are interested in what benefits the source term WW can contribute. We show that, when taking W=εm(V)W=\varepsilon m(V) for some small ε>0\varepsilon>0 where m:m:\mathbb{R}\to\mathbb{R} is a monotonic decreasing function of VV, this brings mainly two benefits:

  • the spectral gap to the operator C+W\mathcal{L}_{C}+W is larger than C\mathcal{L}_{C} considered in same functional space;

  • the invariant measure of ΦC+Wt\Phi^{t}_{\mathcal{L}_{C}+W} concentrates more on the global minimum compared to the invariant measure of et(C)e^{t(\mathcal{L}^{\dagger}_{C})}.

These two benefits show that, in the same noise level, process (18) can converge faster and have more concentrated invariant measure than process (4) or (27). We will further explain the benefits, giving proof in Section 4.1.

3.2 Discrete algorithm design

Now let us use interacting particle methods (Moral & Miclo, 2000; Moral, 2013) and mean field theory (Kac et al., 1960) to design algorithms. We introduce the so-called Reweighted Interacting Langevin Diffusion algorithm, by simply approximating ptp_{t} with a population of particles, and discretize the evolution of time by operator-splitting technique and forward-Euler scheme.

First, we convert Eq. 17 and 18 into a discrete-time version: let τ>0\tau>0 be a fixed timestep:

p(n+1)τ=eτ(C+W)pnτdeτ(C+W)pnτ(𝒙~)𝑑𝒙~p_{(n+1)\tau}=\frac{e^{\tau(\mathcal{L}^{\dagger}_{C}+W)}p_{n\tau}}{\int_{\mathbb{R}^{d}}e^{\tau(\mathcal{L}^{\dagger}_{C}+W)}p_{n\tau}(\tilde{\bm{x}})d\tilde{\bm{x}}}

then we use the operator splitting technique444Note that higher order splitting technique can be applied, such as Strange splitting (Strang, 1968) eτ2WeτCeτ2We^{\frac{\tau}{2}W}\circ e^{\tau\mathcal{L}^{\dagger}_{C}}\circ e^{\frac{\tau}{2}W} with splitting error O(τ3)O(\tau^{3}). However, the approximation error can still be induced in later steps, thus we only choose a simplest one here.: approximate eτ(C+W)e^{\tau(\mathcal{L}^{\dagger}_{C}+W)} by eτWeτCe^{\tau W}\circ e^{\tau\mathcal{L}^{\dagger}_{C}} with splitting error O(τ2)O(\tau^{2}) .

Let’s now approximate eτCpnτe^{\tau\mathcal{L}^{\dagger}_{C}}p_{n\tau}. Suppose pnτp_{n\tau} is approximated by a weighted empirical measure that can be expressed as p^nτ(𝒙)=i=1Nwniδ𝒙ni(𝒙)\hat{p}_{n\tau}({\bm{x}})=\sum_{i=1}^{N}w^{i}_{n}\delta_{{\bm{x}}_{n}^{i}}({\bm{x}}) generated from a sort of sample-weight pair {𝒙ni,wni}i=1N\{{\bm{x}}_{n}^{i},w_{n}^{i}\}_{i=1}^{N}. We use the simplest Euler-Maruyama (Faniran, 2015) approximation,

eτCp^nτ(𝒙)1Ni=1Nwniδ𝒙n+1i(𝒙),\displaystyle e^{\tau\mathcal{L}^{\dagger}_{C}}\hat{p}_{n\tau}({\bm{x}})\approx\frac{1}{N}\sum_{i=1}^{N}w_{n}^{i}\delta_{{\bm{x}}_{n+1}^{i}}({\bm{x}}), (19)
𝒙n+1i=𝒙niCV(𝒙ni)τ+τσ2Cξni\displaystyle{\bm{x}}_{n+1}^{i}={\bm{x}}_{n}^{i}-C\nabla V({\bm{x}}_{n}^{i})\tau+\sqrt{\tau\sigma^{2}C}\xi^{i}_{n} (20)
eτWeτCp^nτ(𝒙)1Ni=1NwnieτW(𝒙n+1i)δ𝒙n+1i(𝒙),\displaystyle e^{\tau W}e^{\tau\mathcal{L}^{\dagger}_{C}}\hat{p}_{n\tau}({\bm{x}})\approx\frac{1}{N}\sum_{i=1}^{N}w_{n}^{i}e^{\tau W({\bm{x}}_{n+1}^{i})}\delta_{{\bm{x}}_{n+1}^{i}}({\bm{x}}), (21)

Here CC can be II or the covariance matrix of current step: if we take the matrix CC to be the covariance matrix, it is computed as (denote Λ=diag(wni),𝒙¯n=1Ni=1Nwni𝒙ni\Lambda=\text{diag}(w_{n}^{i}),\quad\bar{\bm{x}}_{n}=\frac{1}{N}\sum_{i=1}^{N}w_{n}^{i}{\bm{x}}_{n}^{i})

C=𝒞(p^nτ)=𝑿nΛ𝑿nT,C=𝑿nΛ12,\displaystyle C=\mathcal{C}(\hat{p}_{n\tau})={\bm{X}}_{n}\Lambda{\bm{X}}_{n}^{T},\sqrt{C}={\bm{X}}_{n}\Lambda^{\frac{1}{2}}, (22)
𝑿n:=1N(𝒙n1𝒙¯n,,𝒙nN𝒙¯n).\displaystyle{\bm{X}}_{n}:=\frac{1}{\sqrt{N}}\large({\bm{x}}_{n}^{1}-\bar{\bm{x}}_{n},\cdots,{\bm{x}}_{n}^{N}-\bar{\bm{x}}_{n}\large). (23)

Then, after normalizing, we get the natural approximation

p(n+1)τ(𝒙)i=1Nwn+1iδ𝒙n+1i(𝒙),\displaystyle p_{(n+1)\tau}({\bm{x}})\approx\sum_{i=1}^{N}w^{i}_{n+1}\delta_{{\bm{x}}_{n+1}^{i}}({\bm{x}}), (24)
wn+1i=wnieτW(𝒙n+1i)j=1NwnjeτW(𝒙n+1j)\displaystyle w_{n+1}^{i}=\frac{w_{n}^{i}e^{\tau W({\bm{x}}_{n+1}^{i})}}{\sum_{j=1}^{N}w_{n}^{j}e^{\tau W({\bm{x}}_{n+1}^{j})}} (25)

Note that elements in {wni}i=1N\{w^{i}_{n}\}_{i=1}^{N} may be polarized, we use a resampling technique for better approximation: If maxiwniminiwni\frac{\max_{i}w^{i}_{n}}{\min_{i}w^{i}_{n}} reaches to a threshold, we resample the replicas {𝒙ni}i=1N\{\bm{x}_{n}^{i}\}_{i=1}^{N} according to the multinomial distribution associated with {wni}i=1N\{w_{n}^{i}\}_{i=1}^{N}, which defines a new set of replicas {𝒙~ni}i=1N\{\bm{\tilde{x}}_{n}^{i}\}_{i=1}^{N} and the empirical distribution

p~nτ(𝒙)=1Ni=1Nδ𝒙~ni(𝒙).\tilde{p}_{n\tau}(\bm{x})=\frac{1}{N}\sum_{i=1}^{N}\delta_{\bm{\tilde{x}}_{n}^{i}}(\bm{x}).

Then we replace p^nτ\hat{p}_{n\tau} by p~nτ\tilde{p}_{n\tau} in Eq. 21 and conduct further computation.

When VV is a least square functional with Tikhonov-Phillips regularization like in Eq. 14, we can follow the same analysis, approximating 𝒞(p^nτ)V(𝒙ni)\mathcal{C}(\hat{p}_{n\tau})\nabla V({\bm{x}}_{n}^{i}) in a similar derivative-free manner as in Eq. 16: ( 𝒢¯n:=1Nk=1Nwni𝒢(𝒙nk)\mathcal{\bar{G}}_{n}:=\frac{1}{N}\sum_{k=1}^{N}w_{n}^{i}\mathcal{G}(\bm{x}^{k}_{n}) )

𝒞(p^nτ)V(𝒙ni)\displaystyle\mathcal{C}(\hat{p}_{n\tau})\nabla V({\bm{x}}_{n}^{i}) (26)
\displaystyle\approx 1Nj=1Nwnj𝒢(𝒙nj)𝒢¯n,𝒢(𝒙ni)yΓ𝒙nj+𝒞(p^nτ)Γ0𝒙ni.\displaystyle\frac{1}{N}\sum_{j=1}^{N}w_{n}^{j}\left\langle\mathcal{G}(\bm{x}^{j}_{n})-\mathcal{\bar{G}}_{n},\mathcal{G}(\bm{x}^{i}_{n})-y\right\rangle_{\Gamma}\bm{x}_{n}^{j}+\mathcal{C}(\hat{p}_{n\tau})\Gamma_{0}\bm{x}_{n}^{i}.

We now conclude in algorithm 1. Note that we fix the population NN, stepsize τ\tau, and noise level σ\sigma for simple, but these can be adjusted dynamically for faster convergence.

Note that our methods can also be interpreted as a mutation-selection genetic particle algorithm with MCMC mutations: the update rule for 𝒙ni𝒙n+1i\bm{x}_{n}^{i}\to\bm{x}_{n+1}^{i} can be regarded as mutation, and the resampling step can be regarded as selection. Such a type of algorithm acts like a ridge connecting Genetic Algorithm and PDEs, which can conduct better convergence analysis.

Algorithm 1 Reweighted Interacting Langevin Diffusion (RILD)
  Input: Functional to minimize V(𝒙),xXdV({\bm{x}})\in\mathbb{R},x\in X\subseteq\mathbb{R}^{d}, gradient V(𝒙)d\nabla V({\bm{x}})\in\mathbb{R}^{d}, Fitness W(𝒙)W(\bm{x})\in\mathbb{R}, population size NN, Maximum iteration IterIter, i.i.d. d-dimensional Standard Gaussian samples {ξni}1iN,0n<Iter\{\xi^{i}_{n}\}_{1\leq i\leq N,0\leq n<Iter}, Stepsize τ\tau, noise σ\sigma.
  Onput: Samples nearby global minimum.
  Sample 𝒙0i,i=1,,N{\bm{x}}_{0}^{i},i=1,\cdots,N under initial distribution p0p_{0}; w0i=1N,i=1,,Nw_{0}^{i}=\frac{1}{N},i=1,\cdots,N, n=0n=0.
  while n<Itern<Iter do
     Calculate CC by (23) or just C=IC=I, depends on if one use covariance modification.
     If use covariance modification, directly calculate CV(𝒙ni)C\nabla V({\bm{x}}_{n}^{i}) or approximate it by (26), depends on if VV is a least square functional (14).
     𝒙n+1i=𝒙niCV(𝒙ni)τ+τσ2Cξni{\bm{x}}_{n+1}^{i}={\bm{x}}_{n}^{i}-C\nabla V({\bm{x}}_{n}^{i})\tau+\sqrt{\tau\sigma^{2}C}\xi^{i}_{n}, wn+1i=wnieτW(𝒙n+1i)j=1NwnjeτW(𝒙n+1j)\quad w_{n+1}^{i}=\frac{w_{n}^{i}e^{\tau W({\bm{x}}_{n+1}^{i})}}{\sum_{j=1}^{N}w_{n}^{j}e^{\tau W({\bm{x}}_{n+1}^{j})}}, 1iN\quad 1\leq i\leq N
     if maxiwn+1iminiwn+1i>Threshold\frac{\max_{i}w^{i}_{n+1}}{\min_{i}w^{i}_{n+1}}>Threshold then
        Sample {𝒙~n+1i}i=1N\{\bm{\tilde{x}}_{n+1}^{i}\}_{i=1}^{N} from {𝒙n+1i}i=1N\{\bm{x}_{n+1}^{i}\}_{i=1}^{N} according to the multinomial probability {wn+1i}i=1N\{w_{n+1}^{i}\}_{i=1}^{N}.
        𝒙n+1i=𝒙~n+1i,wn+1i=1N,1iN\bm{x}_{n+1}^{i}=\bm{\tilde{x}}_{n+1}^{i},w_{n+1}^{i}=\frac{1}{N},1\leq i\leq N
     end if
     n=n+1n=n+1
  end while
  Output {𝒙ni}i=1N\{\bm{x}_{n}^{i}\}_{i=1}^{N}

3.3 Gradient-Free variants

In practice, many problems are hard to get the exact gradients for optimizing: gradient information is infeasible, or computationally expensive. We thus suggest a gradient-free variant, which corresponds to a process only with diffusion and source term. We will prove that such a process can exponentially converge to its invariant measure, and it also has an invariant measure that generally concentrates on the global minimum as σ0\sigma\to 0.

We introduce the formula of the modified process as

tp~t\displaystyle\frac{\partial}{\partial t}\tilde{p}_{t} =div(σ22Cp~t)+Wp~t\displaystyle=\text{div}\big{(}\frac{\sigma^{2}}{2}C\nabla\tilde{p}_{t}\big{)}+W\tilde{p}_{t} (27)
pt(𝒙)\displaystyle p_{t}({\bm{x}}) =p~t(𝒙)𝒙~dp~t(𝒙~)𝑑𝒙~\displaystyle=\frac{\tilde{p}_{t}({\bm{x}})}{\int_{\tilde{\bm{x}}\in\mathbb{R}^{d}}\tilde{p}_{t}(\tilde{\bm{x}})d\tilde{\bm{x}}} (28)

Note that we only delete the term related to V\nabla V, the term WW that can still be chosen to relate to VV. We will state in Thm. 4.6 that, the system will finally converge to a distribution concentrating on the maximum of WW.

The discrete algorithm is designed similarly as in Algorithm 1, but just ignores the gradient term CVC\nabla V. This can be seen by taking V(𝒙)constV(\bm{x})\equiv\text{const}.

4 Theoretical properties

In this section, we state the main theoretical results associated with our RILD algorithm 1. The proofs are offered in Appendix A.

4.1 Spectral Gap enhancement of the reweighting modification

Now we analyze how WW helps in improving the convergence rate, as well as the sharpness of the invariant measure.

First let us restrict the comparison in the L2(ν)L^{2}(\nu), where ν(x)=e2V(x)σ2\nu(x)=e^{-\frac{2V(x)}{\sigma^{2}}}. The reason to choose this space is mainly because the following property,

Lemma 4.1.

For any positive definite matrix CC, LC+WL_{C}+W and LCL_{C} are self-adjoint over L2(ν)L^{2}(\nu).

To continue our analysis, we need to make following assumption for VV and WW, which is necessary for LC+WL_{C}+W and LCL_{C} have a discrete and up-bounded spectrum (see Pankov (2001)).

Assumption 4.2.

The functions VV and WW are assumed to satisfy:

lim|𝒙|V=+,W<A for a constant A,\lim_{|\bm{x}|\to\infty}V=+\infty,W<A\text{ for a constant A}\in\mathbb{R},

and

lim|𝒙||V|22σ2ΔV2W=lim|𝒙||V|22σ2ΔV2=+\lim_{|\bm{x}|\to\infty}\frac{|\nabla V|^{2}}{2\sigma^{2}}-\frac{\Delta V}{2}-W=\lim_{|\bm{x}|\to\infty}\frac{|\nabla V|^{2}}{2\sigma^{2}}-\frac{\Delta V}{2}=+\infty

Next, we need to prove that the Feynman-Kac semigroup ΦC+Wt\Phi^{t}_{\mathcal{L}_{C}+W} does map any initial density p0L2(ν)p_{0}\in L^{2}(\nu) to a limit density. Such a result is concluded as follows:

Theorem 4.3.

Under Asmp. 4.2, For any CC that is positive-definite, there exists a principle eigenvalue λ0\lambda_{0} of C+W\mathcal{L}_{C}+W over L2(ν)L^{2}(\nu) with the corresponding normalized eigenfunction ϕ(𝐱)\phi({\bm{x}}). Furthermore, for any positive density p0p_{0}, the normalized probability pt:=ΦC+Wt(p0)p_{t}:=\Phi^{t}_{\mathcal{L}_{C}+W}(p_{0}) has a limit equal to ϕ(𝐱)ν(𝐱)\phi({\bm{x}})\nu({\bm{x}}), that is, limtpt/νϕL2(ν)=0\lim_{t\to\infty}||p_{t}/\nu-\phi||_{L^{2}(\nu)}=0.

Theorem 4.4.

Under the same assumption as in Lem. 4.3, the convergence rate of the system (17), (18) can be evaluated by the spectral gap of C+W\mathcal{L}_{C}+W over L2(ν)L^{2}(\nu): let λ0\lambda_{0} and λ1\lambda_{1} be the first two eigenvalues of C+W\mathcal{L}_{C}+W, then pt/νϕL2(ν)Cp0/νϕL2(ν)e(λ0λ1)t||p_{t}/\nu-\phi||_{L^{2}(\nu)}\leq C||p_{0}/\nu-\phi||_{L^{2}(\nu)}e^{-(\lambda_{0}-\lambda_{1})t}.

Next, let us analyze how the convergence speed of ΦC+Wtp0\Phi^{t}_{\mathcal{L}_{C}+W}p_{0} is improved compared to the original process etC(p0)e^{t\mathcal{L}_{C}}(p_{0}), which is heavily related to the spectral gap. To exactly analyze the spectral gap to a differential operator is hard in general. Many existing analysis of spectral gap only consider the simplest case: V\nabla V is a constant matrix. In contrast to these existing techniques, we analyze the contribution of WW to the spectral gap by perturbation theory: we will prove that the new process has a better convergence rate compared to the old one, if W=εm(V)W=\varepsilon m(V) for a small ε>0\varepsilon>0, where m:m:\mathbb{R}\to\mathbb{R} is a monotonic decreasing function to VV.

Theorem 4.5.

Suppose in addition VV satisfies the condition the same as in Nier (2004). If we take W=εm(V)W=\varepsilon m(V), consider under the space L2(ν)L^{2}(\nu), where ν(x)=e2V(x)σ2\nu(x)=e^{-\frac{2V(x)}{\sigma^{2}}}, then when σ\sigma small enough, the spectral gap of C+εm(V)\mathcal{L}_{C}+\varepsilon m(V) is locally increasing v.s. ε\varepsilon for small enough ε>0\varepsilon>0. Besides, the principle eigenfunction of C+εm(V)\mathcal{L}_{C}+\varepsilon m(V) concentrates more on global minimum than the principle eigenfunction of C\mathcal{L}_{C} for small enough ε>0\varepsilon>0.

4.2 Convergence of the Gradient-Free variants

In the gradient-free situation, the original operator 𝒟C:=div(σ22C)\mathcal{D}_{C}:=\text{div}\big{(}\frac{\sigma^{2}}{2}C\nabla\cdot\big{)} has a trivial invariant measure: uniform distribution, thus no optimization property can be expected when WW is not included. We thus turn our results to another way, showing that Φ𝒟C+Wtp0\Phi_{\mathcal{D}_{C}+W}^{t}p_{0} exponentially converges to a distribution that concentrates on the neighborhood of the global minimum, and as σ0\sigma\to 0, the invariant distribution gets more and more concentrate on the global minimum. We now state the result as follows:

Theorem 4.6.

Consider under the space L2L^{2}, assuming WW is bounded on d\mathbb{R}^{d}, then Φ𝒟C+Wtp0\Phi_{\mathcal{D}_{C}+W}^{t}p_{0} converges exponentially to the principle normalized eigenfunction μσ(𝐱)\mu_{\sigma}({\bm{x}}) of the operator 𝒟C+W\mathcal{D}_{C}+W. In addition, for any fL2f\in L^{2} that’s smooth compactly supported, limσ0μσ(𝐱),f=f(𝐱)\lim_{\sigma\to 0}\langle\mu_{\sigma}(\bm{x}),f\rangle=f(\bm{x}^{\ast}), where 𝐱{\bm{x}}^{\ast} is the global maximum of WW, or to say, μσ(𝐱)\mu_{\sigma}(\bm{x}) generally converges to δ𝐱(𝐱)\delta_{\bm{x}^{\ast}}(\bm{x}).

5 Numerical experiments

In this section, we first present Fourier Spectral analysis to verify our theoretical analysis of the proposed method. Then, we conduct two inverse problem tests, showing the positive effect as a sampling method when introducing the reweighting/resampling procedure. Finally, we test an optimization task, showing that the resampling technique can help escaping from the local minimum.

5.1 Fourier spectral method analysis for enhancing spectral gap and concentrating invariant measure

Let us consider a 1 dd periodic problem. We first verify our results in Thm. 4.5. Let x[0,1)x\in[0,1)555Although our analysis considers the space \mathbb{R}, the results can be easily transferred to the periodic case with Asmp. 4.2 be removed., we use Fourier Spectral method (Shen et al., 2011) to discretize the differential operator \mathcal{L} and εV\mathcal{L}-\varepsilon V,where

f(x):=ddxV(x)ddxf(x)+d2dx2f(x)\mathcal{L}f(x):=-\frac{d}{dx}V(x)\frac{d}{dx}f(x)+\frac{d^{2}}{dx^{2}}f(x) (29)

for any periodic fC([0,1))f\in C^{\infty}([0,1)), and

V(x)=cos(9πx)cos(11πx)V(x)=\cos(9\pi x)-\cos(11\pi x) (30)

with boundary smoothly modified.

Refer to caption
Refer to caption
Figure 1: 1 are the graphs to the 1st1^{st} eigfun. ν\nu to \mathcal{L}, 1st1^{st} eigfun. μ\mu to 0.1V\mathcal{L}-0.1V, function VV, and the quotient of μ\mu and ν\nu got by Fourier Spectral method. 1 is the graph to the first two eigenvalues λ0(ε),λ1(ε)\lambda_{0}(\varepsilon),\lambda_{1}(\varepsilon) of εV\mathcal{L}-\varepsilon V v.s. ε\varepsilon.

In Fig. 1, We plot the graphs of the principal eigenfunction ν\nu to the operator \mathcal{L}, the principal eigenfunction μ\mu to the operator 0.1V\mathcal{L}-0.1V, function VV and μν\frac{\mu}{\nu} respectively. As we expected, μ\mu concentrates more on the global minimum of VV than ν\nu. In Fig. 1, we can see λ0(ε)λ1(ε)\lambda_{0}(\varepsilon)-\lambda_{1}(\varepsilon) increasing as ε\varepsilon increasing, showing the enhancement of spectral gap. This experiment gives a visual explanation to Thm. 4.5: specifically, the eigenfunction to \mathcal{L} is the Gibbs measure ν(𝒙)=exp(2V(𝒙)/σ2)\nu(\bm{x})=\exp(-2V(\bm{x})/\sigma^{2}), getting larger when VV gets smaller; the eigenfunction μ\mu to 0.1V\mathcal{L}-0.1V has no doubt more mass than ν\nu that concentrating nearby the global minimum of VV, and the staircase-like graph of the quotient of μ/ν\mu/\nu gives a direct explanation to (42).

Next, let us verify our result in Thm. 4.6. We use the same space x[0,1)x\in[0,1) and the source term W(𝒙)=V(𝒙)W(\bm{x})=-V(\bm{x}) where V(x)V(x) is the same as in (30), and test the analytical property of the following operator

(𝒟σ+W)f(x):=σ22d2dx2f(x)+W(x)f(x)(\mathcal{D_{\sigma}}+W)f(x):=\frac{\sigma^{2}}{2}\frac{d^{2}}{dx^{2}}f(x)+W(x)f(x) (31)

for any periodic fC([0,1))f\in C^{\infty}([0,1)). We plot in Fig. 2 the principal eigenfunction, or invariant distribution density μσ(x)\mu_{\sigma}(x) of 𝒟σ\mathcal{D}_{\sigma} with different σ\sigma. We also calculate the mass nearby the global minimum: we take the interval I=[0.44,0.68]I=[0.44,0.68], and calculate Iμσ(x)𝑑x\int_{I}\mu_{\sigma}(x)dx. The results is plotted in Fig. 2, showing the concentrating tendency when σ0\sigma\to 0. This coincides with the result in Thm. 4.6: the invariant measure concentrates more and more on the global maximum of WW as σ0\sigma\to 0.

Refer to caption
Refer to caption
Figure 2: 2 is the invariant density μσ(x)\mu_{\sigma}(x) got by Fourier Spectral method under different σ\sigma. 2 plots the ratio 0.440.68μσ(x)𝑑x\int_{0.44}^{0.68}\mu_{\sigma}(x)dx.

5.2 Numerical tests for inverse problem derivative-free solving and sampling

In this section, we design some numerical tests in inverse problem solving fields. We compare our RILD algorithm with EKS (Garbuno-Iñigo et al., 2019) and EKI (Kovachki & Stuart, 2018). These problems can be seen as an optimization problem of a least square functional with Tickhonov-Phillips regularization (14), thus a derivative-free approximation to the updating rule ((16) for EKS and EKI, or (26) for RILD) can be used to design derivative-free schemes.

We first try to solve a low-dimensional inverse problem. The numerical experiment considered here is the example originally presented by Ernst et al. (2015), and also used in Herty & Visconti (2018). We compare with the result from Garbuno-Iñigo et al. (2019), and the experimental settings are exactly the same. The forward map is given by the solution of a one-dimensional elliptic boundary value problem as defined in Garbuno-Iñigo et al. (2019),

ddu(exp(x1)dduf(u))=1,u[0,1]-\frac{d}{du}\left(\exp(x_{1})\frac{d}{du}f(u)\right)=1,u\in[0,1] (32)

with f(0)=0,f(1)=x2f(0)=0,f(1)=x_{2}. The explicit solution is given by

f(u)=x1u+exp(x2)(u22+u2).f(u)=x_{1}u+\exp(-x_{2})\left(-\frac{u^{2}}{2}+\frac{u}{2}\right). (33)

Thus we define the forward map

𝒢(𝒙)=(f(u1),f(u2))T.\mathcal{G}(\bm{x})=\left(f(u_{1}),f(u_{2})\right)^{T}. (34)

Here 𝒙=(x1,x2)T\bm{x}=(x_{1},x_{2})^{T} is a constant vector we want to solve, and we have noisy measurements 𝒚=(27.5,79.7)T\bm{y}=(27.5,79.7)^{T} of f()f(\cdot) at locations u1=0.25,u2=0.75u_{1}=0.25,u_{2}=0.75. This can be solved by minimizing the least-square functional defined as in Eq. 14. We assume observation noise Γ=0.12I2\Gamma=0.1^{2}I_{2}, prior matrix Γ0=102I2\Gamma_{0}=10^{2}I_{2}, and initial ensemble drawn from N(0,1)×U(90,110).N(0,1)\times U(90,110). The ensemble size is N=103N=10^{3}. We fix σ=2\sigma=\sqrt{2}. The stepsize τ\tau is updated adaptively as in Garbuno-Iñigo et al. (2019). We take W(𝒙)=𝒢(𝒙)𝒚Γ2W(\bm{x})=-\|\mathcal{G}(\bm{x})-\bm{y}\|^{2}_{\Gamma}.

We compare our RILD algorithm 1 with EKS and EKI algorithms. The key difference between our RILD and EKS in this situation is the use of reweighting/resampling. The results are plotted in Fig. 3, 3, and 3. From the figure, we can see that our RILD algorithm converges much faster than EKI or EKS algorithms using the same super-parameters, and as our problem is settled as a posterior sampling problem, the ensemble of RILD stops shrinking to the minimum point after some iterations, unless one decrease the diffusion parameter σ\sigma. One could expect, using a decay schedule to σ\sigma, RILD algorithm will perform much better than EKS or EKI algorithms for optimizations.

Refer to caption
Refer to caption
Refer to caption
Figure 3: The convergence comparison between RILD, EKS and EKI algorithms. 3 is the mean loss versus iterations, 3 is the ensembles at 15th15^{th} iteration, 3 is the ensembles at 30th30^{th} iteration.

We also tested in a high-dimensional case. Specifically, we define the map 𝒢:d2d2\mathcal{G}:\mathbb{R}^{d}\to\mathbb{R}^{2d-2}

𝒢(𝒙)=(10(x2x12),,10(xdxd12),x1,,xd)T,\mathcal{G}(\bm{x})=\left(10(x_{2}-x_{1}^{2}),\cdots,10(x_{d}-x_{d-1}^{2}),x_{1},\cdots,x_{d}\right)^{T},

𝒚=(0,,0,1,,1)T\bm{y}=(0,\cdots,0,1,\cdots,1)^{T} with 0 repeats d1d-1 times and 11 repeats d1d-1 times. One can verify that

𝒢(𝒙)𝒚2=i=1d1(100(xi+1xi2)2+(xi1)2)\|\mathcal{G}(\bm{x})-\bm{y}\|^{2}=\sum_{i=1}^{d-1}\left(100(x_{i+1}-x_{i}^{2})^{2}+(x_{i}-1)^{2}\right)

is exactly a RosenbrockRosenbrock function. We choose d=100d=100, observation noise matrix Γ=0.12I198\Gamma=0.1^{2}I_{198}, prior distribution matrix Γ0=102I100\Gamma_{0}=10^{2}I_{100}, and initial ensemble drawn from N(2,0.32I100)N(2,0.3^{2}I_{100}). The global solution of 𝒢(𝒙)=𝒚\mathcal{G}(\bm{x})=\bm{y} is 𝒙=(1,,1)T.\bm{x}=(1,\cdots,1)^{T}. The ensemble size is fixed to N=103N=10^{3}. We take σ=2\sigma=\sqrt{2} and the stepsize τ\tau is updated adaptively as before. For RILD algorithm, we choose W(𝒙)=5103𝒢(𝒙)𝒚2W(\bm{x})=-5*10^{-3}\|\mathcal{G}(\bm{x})-\bm{y}\|^{2}. Similar to the test before, one can find RILD converges much faster than EKI or EKS in these high-dimensional sampling tasks, and thus perform better in optimization situations.

Refer to caption
Refer to caption
Refer to caption
Figure 4: The comparison between RILD, EKS, and EKI algorithms. 4 is the mean loss v.s. iterrations, 4 is the ensembles at 150th150^{th} iteration, 4 is the ensembles at 400th400^{th} iteration.

5.3 Numerical tests for highly nonconvex high-dimensional optimization

We now test our RILD algorithm in a highly nonconvex high-dimensional situation. we test with 100 dimensional AckleyAckley function, which is defined as follows:

V(𝒙)=aeb1di=1dxi2e1di=1dcos(cxi)+a+e,V(\bm{x})=-ae^{-b\sqrt{\frac{1}{d}\sum_{i=1}^{d}x_{i}^{2}}}-e^{\frac{1}{d}\sum_{i=1}^{d}\cos(cx_{i})}+a+e, (35)

where 𝒙=(x1,,xd)T,d=100,a=20,b=0.2,c=2π\bm{x}=(x_{1},\cdots,x_{d})^{T},d=100,a=20,b=0.2,c=2\pi. As the difficulties mainly raise from the numerious local minimum, covariance modification that is designed for ill-posed problems is not suitable here, we just take C=IC=I.

We compare our algorithm RILD in Alg. 1 with the classical Gradient Langevin Dynamics (GLD) algorithm. GLD algorithm discretizes a single path of the Langevin Dynamics:

𝒙n+1=𝒙nV(𝒙n)τ+τσ2ξn.\bm{x}_{n+1}=\bm{x}_{n}-\nabla V(\bm{x}_{n})\tau+\sqrt{\tau\sigma^{2}}\xi_{n}.

The difference between RILD and GLD is: RILD maintains an ensemble of size NN while GLD only maintains 1 individual, and at each step, RILD calculates a weight associated with each individual, then resamples the ensemble according to the weight, see Alg. 1. Now we test if RILD has better ability getting out of local minimums, compared to GLD.

For RILD, we take ensemble size N=50N=50, and randomly pick the initial ensemble666Such an initial setting creates many difficulties to find the global minimum, as the first term in the AckleyAckley function becomes quickly dominated when 𝒙\bm{x} gets away from the origin. from N(0,302I100)N(0,30^{2}I_{100}). For GLD, the initial point is randomly chosen from the initial ensemble of RILD. We test a wide range of the stepsize τ[2,32]\tau\in[2,32], σ[1,16]\sigma\in[1,16], and for each fixed τ\tau and σ\sigma, we repeat 10 trials to calculate the pass rate: we say one trial is passed, if the RILD or GLD algorithm can find a point 𝒙\bm{x} that V(𝒙)<17V(\bm{x})<17 in 51045*10^{4} evaluations777If one finds a point smaller than 17, the remaining task will be trivial as the first term in the AckleyAckley function gets dominant.. All trials in all super-parameter settings begin with the same initial ensemble. In 5 One can see that RILD has a wide range of super-parameter settings to find the true decay directions, while GLD algorithm cannot make it in any tested settings. We also tested GA and PSO under the same initial condition with different super-parameters, and reported the best searching result in Fig. 5 together with RILD and GLD.

Refer to caption
Refer to caption
Figure 5: 5: pass rates heat map for RILD(left) and GLD(right). 5: decay graph for PSO, GA, RILD and GLD where τ=10,σ=5\tau=10,\sigma=5 for RILD and GLD.

6 Conclusion

In this work, we have demonstrated a methodology for accelerating Langevin Dynamics based algorithms by the addition of the source term WW and the use of reweighting/resampling technique – the RILD algorithm. Our algorithm and analyses shed some light on combining gradient algorithms and genetic algorithms using Partial Differential Equations (PDEs) with provable guarantees.

In the future, we will combine the reweighting technique with higher-order optimization schemes such as momentum accelerated gradient. We will also conduct a finer analysis for convergence with finite particles, which is somehow more important as asymptotic results are only suitable for a large enough ensemble. We expect these studies will bring some insights to design new numerical algorithms.

References

  • Berg (2004) Berg, B. A. Markov chain monte carlo simulations and their statistical analysis. 2004.
  • Berg & Neuhaus (1991) Berg, B. A. and Neuhaus, T. Multicanonical algorithms for first order phase transitions. Physics Letters B, 267:249–253, 1991.
  • Dalalyan (2014) Dalalyan, A. S. Theoretical guarantees for approximate sampling from smooth and log‐concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79, 2014.
  • Dalalyan (2017) Dalalyan, A. S. Further and stronger analogy between sampling and optimization: Langevin monte carlo and gradient descent. In Annual Conference Computational Learning Theory, 2017.
  • Engl et al. (1996) Engl, H. W., Hanke, M., and Neubauer, A. Regularization of inverse problems. 1996.
  • Ernst et al. (2015) Ernst, O. G., Sprungk, B., and Starkloff, H.-J. Analysis of the ensemble and polynomial chaos kalman filters in bayesian inverse problems. SIAM/ASA J. Uncertain. Quantification, 3:823–851, 2015.
  • Faniran (2015) Faniran, T. S. Numerical solution of stochastic differential equations. 2015.
  • Ferré & Stoltz (2017) Ferré, G. and Stoltz, G. Error estimates on ergodic properties of discretized feynman–kac semigroups. Numerische Mathematik, 143:261 – 313, 2017.
  • Foulkes et al. (2001) Foulkes, W. M. C., Mitas, L., Needs, R. J., and Rajagopal, G. Quantum monte carlo simulations of solids. Reviews of Modern Physics, 73:33–83, 2001.
  • Garbuno-Iñigo et al. (2019) Garbuno-Iñigo, A., Hoffmann, F., Li, W., and Stuart, A. M. Interacting langevin diffusions: Gradient structure and ensemble kalman sampler. SIAM J. Appl. Dyn. Syst., 19:412–441, 2019.
  • Geyer (1991) Geyer, C. J. Markov chain monte carlo maximum likelihood. 1991.
  • Herty & Visconti (2018) Herty, M. and Visconti, G. Kinetic methods for inverse problems. Kinetic & Related Models, 2018.
  • Kac et al. (1960) Kac, M., Uhlenbeck, G. E., Hibbs, A. R., van der Pol, B., and Gillis, J. Probability and related topics in physical sciences. 1960.
  • Kato (1966) Kato, T. Perturbation theory for linear operators. 1966.
  • Kirkpatrick et al. (1983) Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. Optimization by simulated annealing. Science, 220:671 – 680, 1983.
  • Kovachki & Stuart (2018) Kovachki, N. B. and Stuart, A. M. Ensemble kalman inversion: a derivative-free technique for machine learning tasks. Inverse Problems, 35, 2018.
  • Lelièvre & Stoltz (2016) Lelièvre, T. and Stoltz, G. Partial differential equations and stochastic methods in molecular dynamics*. Acta Numerica, 25:681 – 880, 2016.
  • Lyu et al. (2021) Lyu, J., Wang, Z., Xin, J. X., and Zhang, Z. A convergent interacting particle method and computation of kpp front speeds in chaotic flows. SIAM J. Numer. Anal., 60:1136–1167, 2021.
  • Moral (2004) Moral, P. D. Feynman-kac formulae: Genealogical and interacting particle systems with applications. 2004.
  • Moral (2013) Moral, P. D. Mean field simulation for monte carlo integration. 2013.
  • Moral & Miclo (2000) Moral, P. D. and Miclo, L. Branching and interacting particle systems. approximations of feynman-kac formulae with applications to non-linear filtering. 2000.
  • Nier (2004) Nier, F. Quantitative analysis of metastability in reversible diffusion processes via a witten complex approach. 2004.
  • Øksendal (1987) Øksendal, B. Stochastic differential equations : an introduction with applications. Journal of the American Statistical Association, 82:948, 1987.
  • Pankov (2001) Pankov, A. Introduction to spectral theory of schrödinger operators. Mathematics eJournal, 2001.
  • Parlett (1981) Parlett, B. N. The symmetric eigenvalue problem. 1981.
  • Raginsky et al. (2017) Raginsky, M., Rakhlin, A., and Telgarsky, M. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Annual Conference Computational Learning Theory, 2017.
  • Reed & Simon (1979) Reed, M. C. and Simon, B. Methods of modern mathematical physics. iii. scattering theory. 1979.
  • Schillings & Stuart (2016) Schillings, C. and Stuart, A. M. Analysis of the ensemble kalman filter for inverse problems. SIAM J. Numer. Anal., 55:1264–1290, 2016.
  • Shen et al. (2011) Shen, J., Tang, T., and Wang, L. Spectral methods: Algorithms, analysis and applications. 2011.
  • Strang (1968) Strang, G. On the construction and comparison of difference schemes. SIAM Journal on Numerical Analysis, 5:506–517, 1968.
  • Swendsen & Wang (1986) Swendsen and Wang. Replica monte carlo simulation of spin glasses. Physical review letters, 57 21:2607–2609, 1986.
  • Varadhan (2010) Varadhan, S. R. S. Large deviations. 2010.
  • Welling & Teh (2011) Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In International Conference on Machine Learning, 2011.
  • Xu et al. (2017) Xu, P., Chen, J., and Gu, Q. Global convergence of langevin dynamics based algorithms for nonconvex optimization. ArXiv, abs/1707.06618, 2017.
  • Zhang et al. (2017) Zhang, Y., Liang, P., and Charikar, M. A hitting time analysis of stochastic gradient langevin dynamics. In Annual Conference Computational Learning Theory, 2017.

Appendix A Proofs

In this section we give proof to the results in section 4.

Proof of Lemma  4.1

For any f,gL2(ν)C0(d)f,g\in L^{2}(\nu)\cap C_{0}^{\infty}(\mathbb{R}^{d}), note that WW is no doubt a self-adjoint operator, we only need to prove for LCL_{C}. Notice that

Zνf,CgL2(ν)=df(𝒙)(CV(𝒙),g(𝒙)+σ22div(Cg(𝒙)))e2σ2V(𝒙)𝑑𝒙\displaystyle Z_{\nu}\langle f,\mathcal{L}_{C}g\rangle_{L^{2}(\nu)}=\int_{\mathbb{R}^{d}}f(\bm{x})\left(-\langle C\nabla V(\bm{x}),\nabla g(\bm{x})\rangle+\frac{\sigma^{2}}{2}\text{div}(C\nabla g(\bm{x}))\right)e^{-2\sigma^{-2}V(\bm{x})}d\bm{x} (36)
=\displaystyle= df(𝒙)CV(𝒙),g(𝒙)e2σ2V(𝒙)d𝒙+σ22df(𝒙)div(Cg(𝒙))e2σ2V(𝒙)𝑑𝒙\displaystyle\int_{\mathbb{R}^{d}}-f(\bm{x})\langle C\nabla V(\bm{x}),\nabla g(\bm{x})\rangle e^{-2\sigma^{-2}V(\bm{x})}d\bm{x}+\frac{\sigma^{2}}{2}\int_{\mathbb{R}^{d}}f(\bm{x})\text{div}(C\nabla g(\bm{x}))e^{-2\sigma^{-2}V(\bm{x})}d\bm{x} (37)
=\displaystyle= df(𝒙)CV(𝒙),g(𝒙)e2σ2V(𝒙)d𝒙σ22dCg(𝒙),(f(𝒙)e2σ2V(𝒙))𝑑𝒙\displaystyle\int_{\mathbb{R}^{d}}-f(\bm{x})\langle C\nabla V(\bm{x}),\nabla g(\bm{x})\rangle e^{-2\sigma^{-2}V(\bm{x})}d\bm{x}-\frac{\sigma^{2}}{2}\int_{\mathbb{R}^{d}}\left\langle C\nabla g(\bm{x}),\nabla(f(\bm{x})e^{-2\sigma^{-2}V(\bm{x})})\right\rangle d\bm{x} (38)
=\displaystyle= df(𝒙)(CV(𝒙),g(𝒙)V(𝒙),Cg(𝒙))e2σ2V(𝒙)d𝒙σ22dCg(𝒙),f(𝒙)e2σ2V(𝒙)𝑑𝒙\displaystyle\int_{\mathbb{R}^{d}}-f(\bm{x})\large(\langle C\nabla V(\bm{x}),\nabla g(\bm{x})\rangle-\langle\nabla V(\bm{x}),\nabla Cg(\bm{x})\rangle\large)e^{-2\sigma^{-2}V(\bm{x})}d\bm{x}-\frac{\sigma^{2}}{2}\int_{\mathbb{R}^{d}}\langle C\nabla g(\bm{x}),\nabla f(\bm{x})\rangle e^{-2\sigma^{-2}V(\bm{x})}d\bm{x} (39)
=\displaystyle= σ22dCg(𝒙),f(𝒙)e2σ2V(𝒙)𝑑𝒙,\displaystyle-\frac{\sigma^{2}}{2}\int_{\mathbb{R}^{d}}\langle C\nabla g(\bm{x}),\nabla f(\bm{x})\rangle e^{-2\sigma^{-2}V(\bm{x})}d\bm{x}, (40)

it is not hard to find that this is a symmetric form, thus C\mathcal{L}_{C} is self-adjoint on L2(ν).L^{2}(\nu).

\square

Next, let us use a changing variable trick to eliminate CC for simplicity.

Lemma A.1.

Given any invertible matrix Ad×dA\in\mathbb{R}^{d\times d}, for any function f:df:\mathbb{R}^{d}\to\mathbb{R}, we define the function

f~(𝒙):=f(A𝒙).\tilde{f}(\bm{x}):=f(A\bm{x}).

We denote the differential operator ~C\mathcal{\tilde{L}}_{C} as

~C:=CV~,+σ22div(C),\mathcal{\tilde{L}}_{C}\cdot:=-\langle C\nabla\tilde{V},\nabla\cdot\rangle+\frac{\sigma^{2}}{2}\text{div}(C\nabla\cdot),

then the following statement is true:

~C~g~(𝒙)=Cg(A𝒙),\mathcal{\tilde{L}}_{\tilde{C}}\tilde{g}(\bm{x})=\mathcal{L}_{C}g(A\bm{x}),

for any g:C(d)g:C^{\infty}(\mathbb{R}^{d}), where C~=(AT)1CA1\tilde{C}=(A^{T})^{-1}CA^{-1}

Proof.

Using the chain rule, we notice that

(f~)(𝒙)=A(f)(A𝒙),(\nabla\tilde{f})(\bm{x})=A(\nabla f)(A\bm{x}),

thus

~C~g~(𝒙)=ATC~A(V)(A𝒙),(g)(A𝒙)+σ22div(ATC~Ag)(A𝒙)=Cg(A𝒙).\mathcal{\tilde{L}}_{\tilde{C}}\tilde{g}(\bm{x})=-\langle A^{T}\tilde{C}A(\nabla V)(A\bm{x}),(\nabla g)(A\bm{x})\rangle+\frac{\sigma^{2}}{2}\text{div}(A^{T}\tilde{C}A\nabla g)(A\bm{x})=\mathcal{L}_{C}g(A\bm{x}).

Lemma A.1 shows that, we can use simple linear transformation to simplify the operator. Specifically, if we choose A=C12A=C^{\frac{1}{2}}, we can transfer C\mathcal{L}_{C} to the operator ~\mathcal{\tilde{L}} in the classical form. Thus we may only consider the case when C=IC=I.

To better understand the spectral property of \mathcal{L}, we introduce the Witten Laplacian that’s unitarily equivalent to \mathcal{L} and \mathcal{L}^{\dagger}, acting on the non-weighted L2L^{2}-space. The Witten Laplacian is defined by

Δf,h=(h+f)(h+f)=h2Δ+(|f|2hΔf),\Delta_{f,h}=(-h\nabla+\nabla f)\cdot(h\nabla+\nabla f)=-h^{2}\Delta+(|\nabla f|^{2}-h\Delta f),

and then we can find the following property:

Proposition A.2.

The operator \mathcal{L} can be unitarily changed into the Witten Laplacian form as follows:

2σ2ΔV/2,σ2/2=U~U~1,-2\sigma^{-2}\Delta_{V/2,\sigma^{2}/2}=\tilde{U}\mathcal{L}\tilde{U}^{-1},

where U~\tilde{U} is the unitary transformation

U~:{L2(ν)L2,ϕϕν.\tilde{U}:\left\{\begin{aligned} L^{2}(\nu)&\to L^{2},\\ \phi&\mapsto\phi\sqrt{\nu}.\end{aligned}\right.

Thus, we may only consider the spectrum of

𝒮:=U~U~1=σ22Δ|V|22σ2+ΔV2\mathcal{S}:=\tilde{U}\mathcal{L}\tilde{U}^{-1}=\frac{\sigma^{2}}{2}\Delta-\frac{|\nabla V|^{2}}{2\sigma^{2}}+\frac{\Delta V}{2}

over L2L^{2} in replace of \mathcal{L} over L2(ν)L^{2}(\nu). Notice that WW is invariant under such a transformation, that is, for any WW, U~WU~1=W\tilde{U}W\tilde{U}^{-1}=W, thus the spectrum of 𝒮+W\mathcal{S}+W over L2L^{2} is equivalent to the spectrum of +W\mathcal{L}+W over L2(ν)L^{2}(\nu).

Proof of Theorem  4.3 and 4.4

By Lemma A.1 and Proposition A.2, the operator +W\mathcal{L}+W is transformed into the Schrödinger form operator 𝒮+W\mathcal{S}+W. According to the Assumption 4.2, 𝒮+W\mathcal{S}+W has discrete real spectrum, and thus has a sequence of real eigenvalues λ0>λ1λ2,\lambda_{0}>\lambda_{1}\geq\lambda_{2}\cdots, see Reed & Simon (1979), Pankov (2001).

Recall

Φ+Wt(p0)=et(+W)p0det(+W)p0(𝒙~)𝑑𝒙~=et(+Wλ0)p0et(+Wλ0)p0L1,\Phi^{t}_{\mathcal{L}+W}(p_{0})=\frac{e^{t(\mathcal{L}^{\dagger}+W)}p_{0}}{\int_{\mathbb{R}^{d}}e^{t(\mathcal{L}^{\dagger}+W)}p_{0}(\tilde{\bm{x}})d\tilde{\bm{x}}}=\frac{e^{t(\mathcal{L}^{\dagger}+W-\lambda_{0})}p_{0}}{\|e^{t(\mathcal{L}^{\dagger}+W-\lambda_{0})}p_{0}\|_{L^{1}}},

Following the same procedure as the Proposition 2. in Ferré & Stoltz (2017), we have that, for any p0L2(1/ν)p_{0}\in L^{2}(1/\nu),

Φ+Wt(p0)ψ0ψ0L1L2(1/ν)Cp0ψ0ψ0L1L2(1/ν)e(λ0λ1)t,\left\|\Phi^{t}_{\mathcal{L}+W}(p_{0})-\frac{\psi_{0}}{\|\psi_{0}\|_{L^{1}}}\right\|_{L^{2}(1/\nu)}\leq C\left\|p_{0}-\frac{\psi_{0}}{\|\psi_{0}\|_{L^{1}}}\right\|_{L^{2}(1/\nu)}e^{-(\lambda_{0}-\lambda_{1})t},

where ψ0\psi_{0} is the eigenfunction of +W\mathcal{L}^{\dagger}+W corresponding to λ0\lambda_{0}. Noticing that, by definition and simple calculation,

p0ψ0ψ0L1L2(1/ν)=p0νψ0νψ0L1L2(ν),\left\|p_{0}-\frac{\psi_{0}}{\|\psi_{0}\|_{L^{1}}}\right\|_{L^{2}(1/\nu)}=\left\|\frac{p_{0}}{\nu}-\frac{\psi_{0}}{\nu\|\psi_{0}\|_{L^{1}}}\right\|_{L^{2}(\nu)},

and ψ0ψ0L1ν\frac{\psi_{0}}{\|\psi_{0}\|_{L^{1}}\nu} is just the normalized eigenfunction ϕ0\phi_{0} of +W\mathcal{L}+W corresponding to λ0\lambda_{0}.

\square

Proof of Theorem  4.5

By Lemma A.1 and Proposition A.2, we consider the eigenpairs of the operator 𝒮\mathcal{S} and 𝒮ε:=𝒮+εm(V)\mathcal{S}^{\varepsilon}:=\mathcal{S}+\varepsilon m(V). We denote the eigenpair to 𝒮\mathcal{S} as {(ϕi,λi),0i}\{(\phi_{i},\lambda_{i}),0\leq i\leq\infty\}, the eigenpair to 𝒮ε\mathcal{S}^{\varepsilon} as {(ϕiε,λiε),0i}\{(\phi^{\varepsilon}_{i},\lambda^{\varepsilon}_{i}),0\leq i\leq\infty\}. For simplicity we denote β=2σ2\beta=\frac{2}{\sigma^{2}}. The perturbation theory of spectrum (see Chapter 5.4, (Pankov, 2001) or Kato (1966)) reads:

λiε=λi+εϕi,m(V)ϕiϕi,ϕi+O(ε2),\lambda^{\varepsilon}_{i}=\lambda_{i}+\varepsilon\frac{\langle\phi_{i},m(V)\phi_{i}\rangle}{\langle\phi_{i},\phi_{i}\rangle}+O(\varepsilon^{2}), (41)
ϕiε=ϕi+εjiϕj,m(V)ϕi(λiλj)ϕj,ϕjϕj.\phi_{i}^{\varepsilon}=\phi_{i}+\varepsilon\sum_{j\neq i}\frac{\langle\phi_{j},m(V)\phi_{i}\rangle}{(\lambda_{i}-\lambda_{j})\langle\phi_{j},\phi_{j}\rangle}\phi_{j}. (42)

We first prove that, for small enough ε\varepsilon,

λ0ελ1ε>λ0λ1,\lambda_{0}^{\varepsilon}-\lambda_{1}^{\varepsilon}>\lambda_{0}-\lambda_{1},

by (41) this equivalent to

ϕ0,m(V)ϕ0ϕ0,ϕ0>ϕ1,m(V)ϕ1ϕ1,ϕ1.\frac{\langle\phi_{0},m(V)\phi_{0}\rangle}{\langle\phi_{0},\phi_{0}\rangle}>\frac{\langle\phi_{1},m(V)\phi_{1}\rangle}{\langle\phi_{1},\phi_{1}\rangle}.

According to Chapter 2.5 in Lelièvre & Stoltz (2016) and Nier (2004), there are exactly m0m_{0} eigenvalues close enough to λ0\lambda_{0} corresponding to the m0m_{0} local minimums of VV, and for 1jm01\leq j\leq m_{0}, ϕi\phi_{i} are of the form χjexp(βV/2)\chi_{j}\exp(-\beta V/2), where the functions χj\chi_{j} are locally constant over the basins of attractions of the local minima of VV, we denote the minimum corresponding to each ϕj\phi_{j} as 𝒙j\bm{x}_{j}.

We then denote the attraction basin as B1B_{1} with corresponding local minimum V(𝒙1)V(\bm{x}_{1}); and denote M=max𝒙B1V(𝒙)M=\max_{\bm{x}\in B_{1}}V(\bm{x}). According to Nier (2004), for any δ>0\delta>0, the region G1:={V(𝒙)Mδ}B1cG_{1}:=\{V(\bm{x})\leq M-\delta\}\cap B_{1}^{c}, χ1\chi_{1} is asymptotically constant as β0\beta\to 0. Thus we may assume

χ1|B1G1IB1+c(β)IG1.\chi_{1}|_{B_{1}\cup G_{1}}\approx I_{B_{1}}+c(\beta)I_{G_{1}}. (43)

By observing that when 𝒙(B1G1)c,V(𝒙)>Mδ\bm{x}\in(B_{1}\cup G_{1})^{c},V(\bm{x})>M-\delta, thus we have

0=ϕ1,ϕ0=nχ1exp(βV)𝑑𝒙=B1exp(βV)𝑑𝒙+c(β)G1exp(βV)𝑑𝒙+O(eβ(Mδ))0=\langle\phi_{1},\phi_{0}\rangle=\int_{\mathbb{R}^{n}}\chi_{1}\exp(-\beta V)d\bm{x}=\int_{B_{1}}\exp(-\beta V)d\bm{x}+c(\beta)\int_{G_{1}}\exp(-\beta V)d\bm{x}+O(e^{-\beta(M-\delta)})

and 𝒙0G1\bm{x}_{0}\in G_{1}, we decude that

c(β)=B1exp(βV)𝑑𝒙+O(eβ(Mδ))G1exp(βV)𝑑𝒙=O(eβ(V(𝒙1)V(𝒙0))).c(\beta)=-\frac{\int_{B_{1}}\exp(-\beta V)d\bm{x}+O(e^{-\beta(M-\delta)})}{\int_{G_{1}}\exp(-\beta V)d\bm{x}}=-O(e^{-\beta(V(\bm{x}_{1})-V(\bm{x}_{0}))}). (44)

Then

ϕ1,m(V)ϕ1ϕ1,ϕ1\displaystyle\frac{\langle\phi_{1},m(V)\phi_{1}\rangle}{\langle\phi_{1},\phi_{1}\rangle} =B1m(V)exp(βV)𝑑𝒙O(e2β(V(x1)V(x0)))G1m(V)exp(βV)𝑑𝒙B1exp(βV)𝑑𝒙O(e2β(V(x1)V(x0)))G1exp(βV)𝑑𝒙\displaystyle=\frac{\int_{B_{1}}m(V)\exp(-\beta V)d\bm{x}-O(e^{-2\beta(V(x_{1})-V(x_{0}))})\int_{G_{1}}m(V)\exp(-\beta V)d\bm{x}}{\int_{B_{1}}\exp(-\beta V)d\bm{x}-O(e^{-2\beta(V(x_{1})-V(x_{0}))})\int_{G_{1}}\exp(-\beta V)d\bm{x}}
=m(V(𝒙1))+O(β1),\displaystyle=m(V(\bm{x}_{1}))+O(\beta^{-1}),

as m(V)m(V) has no dependence on β\beta.

On the other side,

ϕ0,m(V)ϕ0ϕ0,ϕ0=m(V(𝒙))+O(β1)\frac{\langle\phi_{0},m(V)\phi_{0}\rangle}{\langle\phi_{0},\phi_{0}\rangle}=m(V(\bm{x}^{\ast}))+O(\beta^{-1})

thus we arrive at

ϕ0,m(V)ϕ0ϕ0,ϕ0>ϕ1,m(V)ϕ1ϕ1,ϕ1,\frac{\langle\phi_{0},m(V)\phi_{0}\rangle}{\langle\phi_{0},\phi_{0}\rangle}>\frac{\langle\phi_{1},m(V)\phi_{1}\rangle}{\langle\phi_{1},\phi_{1}\rangle},

because mm is a decreasing function and V(𝒙)<V(𝒙1).V(\bm{x}^{\ast})<V(\bm{x}_{1}).

By (42), for the eigenfunction estimation of ϕ0ε\phi_{0}^{\varepsilon}, we aim at proving that

ϕj,m(V)ϕ0<0.\langle\phi_{j},m(V)\phi_{0}\rangle<0. (45)

for jj that λj\lambda_{j} close enough to λ0\lambda_{0}, thus these term will suppress the height nearby local minimum of ϕ0\phi_{0}, remaining the height nearby global minimum.

We only prove this for j=1j=1, and for i=2,,m0i=2,\cdots,m_{0}, the procedure are similar. For j>m0j>m_{0}, these term in (42) are asynmtotically negotiable.

Note that, by omitting small term (the integration region outside B1G1B_{1}\cup G_{1}) and use (43), (44) , the statement (45) is equivalent to

B1m(V)eβV𝑑𝒙B1exp(βV)𝑑𝒙G1exp(βV)𝑑𝒙G1m(V)eβV𝑑𝒙<0\int_{B_{1}}m(V)e^{-\beta V}d\bm{x}-\frac{\int_{B_{1}}\exp(-\beta V)d\bm{x}}{\int_{G_{1}}\exp(-\beta V)d\bm{x}}\int_{G_{1}}m(V)e^{-\beta V}d\bm{x}<0
B1m(V)eβV𝑑𝒙B1eβV𝑑𝒙<G1m(V)eβV𝑑𝒙G1eβV𝑑𝒙\frac{\int_{B_{1}}m(V)e^{-\beta V}d\bm{x}}{\int_{B_{1}}e^{-\beta V}d\bm{x}}<\frac{\int_{G_{1}}m(V)e^{-\beta V}d\bm{x}}{\int_{G_{1}}e^{-\beta V}d\bm{x}}

which is asymptotically true when β\beta\to\infty, as left hand side trends to m(V(𝒙1))m(V(\bm{x}_{1})) while right hand side trends to m(V(𝒙))m(V(\bm{x}^{\ast})).

\square

Proof of Theorem  4.6

We begin with considering the spectrum of WW. Let Wmin=inf{W(𝒙)},Wmax=sup{W(𝒙)}W_{min}=\inf\{W(\bm{x})\},W_{max}=\sup\{W(\bm{x})\} and WminW_{min} can be -\infty. We assume WmaxW_{max} is reachable with one and only one maximum 𝒙\bm{x}^{\ast}.

Note that for any λ>Wmax\lambda>W_{max} or λ<Wmin\lambda<W_{min}, the operator λW\lambda-W is invertible in L2L^{2}, thus the spectrum of WW is just [Wmin,Wmax][W_{min},W_{max}]. We now consider the operator σ22Δ+W\frac{\sigma^{2}}{2}\Delta+W. By Rayleigh-quotient formula (Parlett, 1981), the principle eigenvalue λ0(σ)\lambda_{0}(\sigma) of σ22Δ+W\frac{\sigma^{2}}{2}\Delta+W over L2L^{2} is

λ0(σ)=supϕH1dϕ((σ22Δ+W)ϕ)𝑑𝒙dϕ2𝑑𝒙=supϕH1(dϕ2W𝑑𝒙dϕ2𝑑𝒙σ22d(ϕ)Tϕd𝒙dϕ2𝑑𝒙)\lambda_{0}(\sigma)=\sup_{\phi\in H^{1}}\frac{\int_{\mathbb{R}^{d}}\phi((\frac{\sigma^{2}}{2}\Delta+W)\phi)d\bm{x}}{\int_{\mathbb{R}^{d}}\phi^{2}d\bm{x}}=\sup_{\phi\in H^{1}}(\frac{\int_{\mathbb{R}^{d}}\phi^{2}Wd\bm{x}}{\int_{\mathbb{R}^{d}}\phi^{2}d\bm{x}}-\frac{\sigma^{2}}{2}\frac{\int_{\mathbb{R}^{d}}(\nabla\phi)^{T}\nabla\phi d\bm{x}}{\int_{\mathbb{R}^{d}}\phi^{2}d\bm{x}})

It’s worthwhile noticing that λ0(σ)\lambda_{0}(\sigma) is a decreasing function respect to σ\sigma , thus for any σ>0\sigma>0, λ0(σ)<λ0(0)=Wmax\lambda_{0}(\sigma)<\lambda_{0}(0)=W_{max}.

Take a series of test function ϕn(𝒙)H1\phi_{n}(\bm{x})\in H^{1} that gradually trends to the delta function δ𝒙(𝒙)\delta_{\bm{x}^{\ast}}(\bm{x}), by noticing that if we take σn:=21nϕn,ϕnϕn,ϕn\sigma_{n}:=\sqrt{2\frac{1}{n}\frac{\langle\phi_{n},\phi_{n}\rangle}{\langle\nabla\phi_{n},\nabla\phi_{n}\rangle}}, we have

limσ0λ0(σ)limn(ϕn,Wϕnϕn,ϕnσn22ϕn,ϕnϕn,ϕn)=Wmax\lim_{\sigma\to 0}\lambda_{0}(\sigma)\geq\lim_{n\to\infty}(\frac{\langle\phi_{n},W\phi_{n}\rangle}{\langle\phi_{n},\phi_{n}\rangle}-\frac{\sigma^{2}_{n}}{2}\frac{\langle\nabla\phi_{n},\nabla\phi_{n}\rangle}{\langle\phi_{n},\phi_{n}\rangle})=W_{max}

Thus λ0(σ)\lambda_{0}(\sigma) is continuous at σ=0\sigma=0.

Now denote the principal normalized eigenfunction of σ22Δ+W\frac{\sigma^{2}}{2}\Delta+W is ψ0(σ)\psi_{0}(\sigma). For any fL2f\in L^{2} that’s smooth compactly supported, noticing that

ψ0(σ),f\displaystyle\langle\psi_{0}(\sigma),f\rangle =λ0(σ)1(σ22Δ+W)ψ0(σ),f\displaystyle=\lambda_{0}(\sigma)^{-1}\langle(\frac{\sigma^{2}}{2}\Delta+W)\psi_{0}(\sigma),f\rangle (46)
=λ0(σ)1σ22ψ0(σ),Δf+λ0(σ)1ψ0(σ),Wf,\displaystyle=\lambda_{0}(\sigma)^{-1}\frac{\sigma^{2}}{2}\langle\psi_{0}(\sigma),\Delta f\rangle+\lambda_{0}(\sigma)^{-1}\langle\psi_{0}(\sigma),Wf\rangle, (47)

taking σ0\sigma\to 0, the first term trends to 0, and thus

limσ0ψ0(σ),f=limσ0WWmaxψ0(σ),f\lim_{\sigma\to 0}\langle\psi_{0}(\sigma),f\rangle=\lim_{\sigma\to 0}\langle\frac{W}{W_{max}}\psi_{0}(\sigma),f\rangle

This implies that as σ0,ψ0(σ)\sigma\to 0,\psi_{0}(\sigma) acts closer and closer to δ𝒙(𝒙)\delta_{\bm{x}^{\ast}}(\bm{x}).

\square