Reweighted Interacting Langevin Diffusions: an Accelerated Sampling Method for Optimization
Abstract
We proposed a new technique to accelerate sampling methods for solving difficult optimization problems. Our method investigates the intrinsic connection between posterior distribution sampling and optimization with Langevin dynamics, and then we propose an interacting particle scheme that approximates a Reweighted Interacting Langevin Diffusion system (RILD). The underlying system is designed by adding a multiplicative source term into the classical Langevin operator, leading to a higher convergence rate and a more concentrated invariant measure. We analyze the convergence rate of our algorithm and the improvement compared to existing results in the asymptotic situation. We also design various tests to verify our theoretical results, showing the advantages of accelerating convergence and breaking through barriers of suspicious local minimums, especially in high-dimensional non-convex settings. Our algorithms and analysis shed some light on combining gradient and genetic algorithms using Partial Differential Equations (PDEs) with provable guarantees.
1 Introduction
Optimization methods have natural connections to sampling methods (Dalalyan, 2017) . Consider a potential function that is twice differentiable and has only one simple global minimum for simplicity. The first order Langevin Dynamics is defined by the following Stochastic Differential Equation (SDE):
(1) |
where is the diffusion parameter proportional to the square of the temperature parameter, and is the standard Brownian motion in . Under certain assumptions on the drift coefficient , it was shown that the distribution of in Eq.1 converges exponentially to its stationary distribution, the Gibbs measure . With , concentrates on the global minimum , leading to various Langevin dynamics based algorithms like Gradient Langevin Dynamics (GLD) (Xu et al., 2017; Dalalyan, 2014), Stochastic Gradient Langevin Dynamics (SGLD) (Welling & Teh, 2011) etc.
While Langevin dynamics based algorithms handle optimization tasks with slightly nonconvex potential function effectively (Raginsky et al., 2017; Zhang et al., 2017), it encounters difficulties with global optimization for highly nonconvex , as it requires to guarantee convergence. However, a small makes the probability of jumping across a certain potential barrier drastically diminish. Conventional Markov Chain Monte Carlo (MCMC) methods were proposed to alleviate such problems, such as parallel tempering (Swendsen & Wang, 1986; Geyer, 1991), flat histogram algorithms (Berg & Neuhaus, 1991) and simulated annealing (Kirkpatrick et al., 1983). Among them, Simulated Annealing (SA) provides a reasonable way to handle this problem by maintaining a population of particles to sequentially approximate with . However, it requires numerious evaluations of , which can be computationally costly, making it less attractive for optimization tasks. Therefore in this paper, we aim at accelerating Langevin dynamics based algorithms by modifying the underlying continuous Langevin dynamics.
Recently, an Ensemble Kalman Sampler (EKS) (Garbuno-Iñigo et al., 2019; Schillings & Stuart, 2016) method was suggested to accelerate the convergence rate toward the equilibrium . This method approximates a preconditioning-modified Langevin dynamic that shares the same equilibrium with the original one, but enjoys a faster convergence rate for ill-posed problems. Besides, this method has another benefit when the potential function is a least square functional (Engl et al., 1996), i.e. the gradient of , or the Jacobean matrix of does not need to be calculated explicitly, as those terms involving them can be approximated by the order information in current steps, see Eq. (16).
However, the EKS method has little improvement when solving highly non-convex optimizations. Besides, as our purposes is to find the global minimum of , restricting the stationary distribution of the underlying dynamic to be precisely is not necessary. Suppose we have another -dependent dynamic with limit distribution approximate , the Delta distribution at the minimum point , then this dynamic can also be chosen to approximate the global minimum. This inspires us to further modify the Langevin dynamics for faster convergence. Specifically, we modify the Fokker-Planck Equation related to Langevin Dynamics by adding a linear source term, which can be proven, by the spectral approach (Pankov, 2001), to improve the convergence rate and the mass concentration near the global minimum of the invariant measure. The new process belongs to a type of nonlinear operators called Feynman-Kac Semigroup which was developed in Large Deviation Theory for calculating generating functions (Varadhan, 2010) and also used in important practical applications such as the Diffusion Monte Carlo (DMC) method (Foulkes et al., 2001). To design a practical algorithm, we use the Interacting Particle methods (Moral & Miclo, 2000; Moral, 2013), introducing the reweighting and resampling technique to simulate the effect of this source term and get the so-called Reweighted Interacting Langevin Diffusion (RILD) algorithm. This is, as far as we know, the first time using Interacting Particle methods to solve optimization tasks. Many numerical experiments are tested to show the effect in accelerating convergence and flatten the potential barriers.
The main contributions are summarized as follows:
2 Preliminary
2.1 Overdamped Langevin Dynamics
In this section, we introduce the classical Langevin Dynamics which is related to GLD and SGLD algorithms.
Suppose that is twice differentiable and has a simple global minimum for simplicity.
The overdamped Langevin Dynamics is defined as in 1. Such a dynamic has plenty of relationships with the Fokker-Planck equation. Denote
(2) |
the infinitesimal generator (Øksendal, 1987) of the Markov process , and its adjoint operator:
(3) |
Let us assume the law of at time has a density for Lebesgure measure. Then satisfies the Fokker-Planck equation:
(4) |
We may denote where we denote the initial distribution of overdamped Langevin dynamics, and it is well-known that the Markov operator admits a unique invariant probability measure . The rate converging to has been wildly studied,
Proposition 2.1 (Proposition 2.3 in Lelièvre & Stoltz (2016)).
Under specific condition (Assumption 4.2), 111 We define , where , for all such that , and for all ,
where is the first non-zero eigenvalue of the operator on .
Note that is real because is self-adjoint222We say a linear operator is self-adjoint on a Hilbert space , if for any on .
When for sampling purposes, individual samples are sampled from independent paths by Monte Carlo Markov Chain (MCMC, (Berg, 2004)) corresponding to the discretized Langevin dynamics. This method is somehow slow for large dimensional, non-convex problems. Such a weakness make it less attractive for optimization purposes. Fortunately, we can explore the connection between individual samples, as they did in Garbuno-Iñigo et al. (2019). We will introduce their work in the next subsection.
2.2 Interacting Langevin Diffusion
In Garbuno-Iñigo et al. (2019), the authors introduced a modified Langevin dynamics and analyzed its property, proving its superiority in designing sampling algorithms. We now introduce the main result of their work.
The convergence rate of classical Langevin dynamic based algorithm can be really slow when varies highly anisotropic, . A common approach for canceling out this effect is to introduce a preconditioning positive semi-definite matrix in the corresponding gradient scheme,
(5) |
Here can be any real matrix such that (Note that in this case, can be reduced to -dimensional standard Brownian motion as the essential rank of is no larger than r).
The corresponding Fokker-Planck equation now becomes
(6) |
and the infinitesimal generator
(7) | ||||
(8) |
One can easily verify that is the invariant measure to the above system for all semi-definite , and the only one positive invariant measure if is strictly positive definite.
To find a suitable is of general interest. One of the best choices is to let , as a counterpart of Newton’s scheme in optimization, which is unfriendly for computation. One intrinsic benefit of choosing is its affine-invariant property when designing numerical schemes.
Such a property can also be preserved if we take , the covariance matrix under the probability measure ,
(9) |
(10) |
The dynamic now becomes a nonlinear flow
(11) | ||||
(12) |
To simulate from such a mean-field model, the finite interacting particle system is introduced:
(13) |
where are i.i.d. standard Brownian motions, and
Particles in this system are then no longer independent to each other, in contrast to independent simulations in SGLD algorithm.
Now suppose is a least squares functional333For any positive-definite matrix , we define , and . with Tikhonov-Phillips regularization (Engl et al., 1996):
(14) |
where is the forward map, and are two positive definite matrixs. In this situation, the system (13) can be re-written as
(15) |
Using the order Taylor approximation
one may approximate Eq. 15 in a derivative-free manner as
(16) |
3 Reweighted Interacting Langevin Diffusion
3.1 Continuous process analysis
Keeping the invariant measure to be exactly is not necessary for optimization. It would be preferable if a new process can converge faster to its invariant measure with its invariant measure still concentrating on the global optimum as .
Now we introduce an additional source term to modify Eq. (6) into
(17) | ||||
(18) |
Here we mainly consider the situation when corresponding to Eq. 4, or corresponding to Eq. 12, but the analysis remains valid for all positive-definite . We assume in this paper that the function is smooth and upper bounded.
Let us look inside what the role plays in the evolution of this process. If we take as a function to such that becomes larger when becomes smaller, intuitively the ratio of the mass in the better-fitness region (closer to the global minimum) becomes larger, thus we expect the invariant measure concentrates more on the global optimum.
Note that such a process (17), (18) is no longer a gradient flow structure, which does not preserve total mass, thus a normalization is added to keep the total mass equal to 1. Such a normalizing process has been well-studied as the so-called Feynman-Kac Semigroup when considering the linear case: is a constant matrix. As the spectral analysis is wildly studied for linear operators (Kato, 1966), and for numerical discretization the is fixed at each time step, we thus conduct analysis for any fixed matrix .
We introduce the solution operator corresponding to Eq. 17 and Eq. 18: recall the infinite generator , the solution in Eq. 17 can be represented by . We denote the corresponding reweighted operator, which is also well-known as Feynman-Kac Semigroup, as
then is exactly the solution to Eq. 18.
The operator is nonlinear in general, but it has many similarity with the linear operator . In specific, the unique positive fixed-point of is proportional to the principle eigenfunction of , and the convergence rate of to is controlled by the spectral gap of . For the Feynman-Kac semigroup, one can refer to Ferré & Stoltz (2017) for time-invariant case, Lyu et al. (2021) for time-periodic case, and Moral (2004) for systematical details.
We are interested in what benefits the source term can contribute. We show that, when taking for some small where is a monotonic decreasing function of , this brings mainly two benefits:
-
•
the spectral gap to the operator is larger than considered in same functional space;
-
•
the invariant measure of concentrates more on the global minimum compared to the invariant measure of .
These two benefits show that, in the same noise level, process (18) can converge faster and have more concentrated invariant measure than process (4) or (27). We will further explain the benefits, giving proof in Section 4.1.
3.2 Discrete algorithm design
Now let us use interacting particle methods (Moral & Miclo, 2000; Moral, 2013) and mean field theory (Kac et al., 1960) to design algorithms. We introduce the so-called Reweighted Interacting Langevin Diffusion algorithm, by simply approximating with a population of particles, and discretize the evolution of time by operator-splitting technique and forward-Euler scheme.
then we use the operator splitting technique444Note that higher order splitting technique can be applied, such as Strange splitting (Strang, 1968) with splitting error . However, the approximation error can still be induced in later steps, thus we only choose a simplest one here.: approximate by with splitting error .
Let’s now approximate . Suppose is approximated by a weighted empirical measure that can be expressed as generated from a sort of sample-weight pair . We use the simplest Euler-Maruyama (Faniran, 2015) approximation,
(19) | |||
(20) | |||
(21) |
Here can be or the covariance matrix of current step: if we take the matrix to be the covariance matrix, it is computed as (denote )
(22) | |||
(23) |
Then, after normalizing, we get the natural approximation
(24) | |||
(25) |
Note that elements in may be polarized, we use a resampling technique for better approximation: If reaches to a threshold, we resample the replicas according to the multinomial distribution associated with , which defines a new set of replicas and the empirical distribution
Then we replace by in Eq. 21 and conduct further computation.
When is a least square functional with Tikhonov-Phillips regularization like in Eq. 14, we can follow the same analysis, approximating in a similar derivative-free manner as in Eq. 16: ( )
(26) | ||||
We now conclude in algorithm 1. Note that we fix the population , stepsize , and noise level for simple, but these can be adjusted dynamically for faster convergence.
Note that our methods can also be interpreted as a mutation-selection genetic particle algorithm with MCMC mutations: the update rule for can be regarded as mutation, and the resampling step can be regarded as selection. Such a type of algorithm acts like a ridge connecting Genetic Algorithm and PDEs, which can conduct better convergence analysis.
3.3 Gradient-Free variants
In practice, many problems are hard to get the exact gradients for optimizing: gradient information is infeasible, or computationally expensive. We thus suggest a gradient-free variant, which corresponds to a process only with diffusion and source term. We will prove that such a process can exponentially converge to its invariant measure, and it also has an invariant measure that generally concentrates on the global minimum as .
We introduce the formula of the modified process as
(27) | ||||
(28) |
Note that we only delete the term related to , the term that can still be chosen to relate to . We will state in Thm. 4.6 that, the system will finally converge to a distribution concentrating on the maximum of .
The discrete algorithm is designed similarly as in Algorithm 1, but just ignores the gradient term . This can be seen by taking .
4 Theoretical properties
In this section, we state the main theoretical results associated with our RILD algorithm 1. The proofs are offered in Appendix A.
4.1 Spectral Gap enhancement of the reweighting modification
Now we analyze how helps in improving the convergence rate, as well as the sharpness of the invariant measure.
First let us restrict the comparison in the , where . The reason to choose this space is mainly because the following property,
Lemma 4.1.
For any positive definite matrix , and are self-adjoint over .
To continue our analysis, we need to make following assumption for and , which is necessary for and have a discrete and up-bounded spectrum (see Pankov (2001)).
Assumption 4.2.
The functions and are assumed to satisfy:
and
Next, we need to prove that the Feynman-Kac semigroup does map any initial density to a limit density. Such a result is concluded as follows:
Theorem 4.3.
Under Asmp. 4.2, For any that is positive-definite, there exists a principle eigenvalue of over with the corresponding normalized eigenfunction . Furthermore, for any positive density , the normalized probability has a limit equal to , that is, .
Theorem 4.4.
Next, let us analyze how the convergence speed of is improved compared to the original process , which is heavily related to the spectral gap. To exactly analyze the spectral gap to a differential operator is hard in general. Many existing analysis of spectral gap only consider the simplest case: is a constant matrix. In contrast to these existing techniques, we analyze the contribution of to the spectral gap by perturbation theory: we will prove that the new process has a better convergence rate compared to the old one, if for a small , where is a monotonic decreasing function to .
Theorem 4.5.
Suppose in addition satisfies the condition the same as in Nier (2004). If we take , consider under the space , where , then when small enough, the spectral gap of is locally increasing v.s. for small enough . Besides, the principle eigenfunction of concentrates more on global minimum than the principle eigenfunction of for small enough .
4.2 Convergence of the Gradient-Free variants
In the gradient-free situation, the original operator has a trivial invariant measure: uniform distribution, thus no optimization property can be expected when is not included. We thus turn our results to another way, showing that exponentially converges to a distribution that concentrates on the neighborhood of the global minimum, and as , the invariant distribution gets more and more concentrate on the global minimum. We now state the result as follows:
Theorem 4.6.
Consider under the space , assuming is bounded on , then converges exponentially to the principle normalized eigenfunction of the operator . In addition, for any that’s smooth compactly supported, , where is the global maximum of , or to say, generally converges to .
5 Numerical experiments
In this section, we first present Fourier Spectral analysis to verify our theoretical analysis of the proposed method. Then, we conduct two inverse problem tests, showing the positive effect as a sampling method when introducing the reweighting/resampling procedure. Finally, we test an optimization task, showing that the resampling technique can help escaping from the local minimum.
5.1 Fourier spectral method analysis for enhancing spectral gap and concentrating invariant measure
Let us consider a 1 periodic problem. We first verify our results in Thm. 4.5. Let 555Although our analysis considers the space , the results can be easily transferred to the periodic case with Asmp. 4.2 be removed., we use Fourier Spectral method (Shen et al., 2011) to discretize the differential operator and ,where
(29) |
for any periodic , and
(30) |
with boundary smoothly modified.


In Fig. 1, We plot the graphs of the principal eigenfunction to the operator , the principal eigenfunction to the operator , function and respectively. As we expected, concentrates more on the global minimum of than . In Fig. 1, we can see increasing as increasing, showing the enhancement of spectral gap. This experiment gives a visual explanation to Thm. 4.5: specifically, the eigenfunction to is the Gibbs measure , getting larger when gets smaller; the eigenfunction to has no doubt more mass than that concentrating nearby the global minimum of , and the staircase-like graph of the quotient of gives a direct explanation to (42).
Next, let us verify our result in Thm. 4.6. We use the same space and the source term where is the same as in (30), and test the analytical property of the following operator
(31) |
for any periodic . We plot in Fig. 2 the principal eigenfunction, or invariant distribution density of with different . We also calculate the mass nearby the global minimum: we take the interval , and calculate . The results is plotted in Fig. 2, showing the concentrating tendency when . This coincides with the result in Thm. 4.6: the invariant measure concentrates more and more on the global maximum of as .


5.2 Numerical tests for inverse problem derivative-free solving and sampling
In this section, we design some numerical tests in inverse problem solving fields. We compare our RILD algorithm with EKS (Garbuno-Iñigo et al., 2019) and EKI (Kovachki & Stuart, 2018). These problems can be seen as an optimization problem of a least square functional with Tickhonov-Phillips regularization (14), thus a derivative-free approximation to the updating rule ((16) for EKS and EKI, or (26) for RILD) can be used to design derivative-free schemes.
We first try to solve a low-dimensional inverse problem. The numerical experiment considered here is the example originally presented by Ernst et al. (2015), and also used in Herty & Visconti (2018). We compare with the result from Garbuno-Iñigo et al. (2019), and the experimental settings are exactly the same. The forward map is given by the solution of a one-dimensional elliptic boundary value problem as defined in Garbuno-Iñigo et al. (2019),
(32) |
with . The explicit solution is given by
(33) |
Thus we define the forward map
(34) |
Here is a constant vector we want to solve, and we have noisy measurements of at locations . This can be solved by minimizing the least-square functional defined as in Eq. 14. We assume observation noise , prior matrix , and initial ensemble drawn from The ensemble size is . We fix . The stepsize is updated adaptively as in Garbuno-Iñigo et al. (2019). We take .
We compare our RILD algorithm 1 with EKS and EKI algorithms. The key difference between our RILD and EKS in this situation is the use of reweighting/resampling. The results are plotted in Fig. 3, 3, and 3. From the figure, we can see that our RILD algorithm converges much faster than EKI or EKS algorithms using the same super-parameters, and as our problem is settled as a posterior sampling problem, the ensemble of RILD stops shrinking to the minimum point after some iterations, unless one decrease the diffusion parameter . One could expect, using a decay schedule to , RILD algorithm will perform much better than EKS or EKI algorithms for optimizations.



We also tested in a high-dimensional case. Specifically, we define the map
with repeats times and repeats times. One can verify that
is exactly a function. We choose , observation noise matrix , prior distribution matrix , and initial ensemble drawn from . The global solution of is The ensemble size is fixed to . We take and the stepsize is updated adaptively as before. For RILD algorithm, we choose . Similar to the test before, one can find RILD converges much faster than EKI or EKS in these high-dimensional sampling tasks, and thus perform better in optimization situations.



5.3 Numerical tests for highly nonconvex high-dimensional optimization
We now test our RILD algorithm in a highly nonconvex high-dimensional situation. we test with 100 dimensional function, which is defined as follows:
(35) |
where . As the difficulties mainly raise from the numerious local minimum, covariance modification that is designed for ill-posed problems is not suitable here, we just take .
We compare our algorithm RILD in Alg. 1 with the classical Gradient Langevin Dynamics (GLD) algorithm. GLD algorithm discretizes a single path of the Langevin Dynamics:
The difference between RILD and GLD is: RILD maintains an ensemble of size while GLD only maintains 1 individual, and at each step, RILD calculates a weight associated with each individual, then resamples the ensemble according to the weight, see Alg. 1. Now we test if RILD has better ability getting out of local minimums, compared to GLD.
For RILD, we take ensemble size , and randomly pick the initial ensemble666Such an initial setting creates many difficulties to find the global minimum, as the first term in the function becomes quickly dominated when gets away from the origin. from . For GLD, the initial point is randomly chosen from the initial ensemble of RILD. We test a wide range of the stepsize , , and for each fixed and , we repeat 10 trials to calculate the pass rate: we say one trial is passed, if the RILD or GLD algorithm can find a point that in evaluations777If one finds a point smaller than 17, the remaining task will be trivial as the first term in the function gets dominant.. All trials in all super-parameter settings begin with the same initial ensemble. In 5 One can see that RILD has a wide range of super-parameter settings to find the true decay directions, while GLD algorithm cannot make it in any tested settings. We also tested GA and PSO under the same initial condition with different super-parameters, and reported the best searching result in Fig. 5 together with RILD and GLD.


6 Conclusion
In this work, we have demonstrated a methodology for accelerating Langevin Dynamics based algorithms by the addition of the source term and the use of reweighting/resampling technique – the RILD algorithm. Our algorithm and analyses shed some light on combining gradient algorithms and genetic algorithms using Partial Differential Equations (PDEs) with provable guarantees.
In the future, we will combine the reweighting technique with higher-order optimization schemes such as momentum accelerated gradient. We will also conduct a finer analysis for convergence with finite particles, which is somehow more important as asymptotic results are only suitable for a large enough ensemble. We expect these studies will bring some insights to design new numerical algorithms.
References
- Berg (2004) Berg, B. A. Markov chain monte carlo simulations and their statistical analysis. 2004.
- Berg & Neuhaus (1991) Berg, B. A. and Neuhaus, T. Multicanonical algorithms for first order phase transitions. Physics Letters B, 267:249–253, 1991.
- Dalalyan (2014) Dalalyan, A. S. Theoretical guarantees for approximate sampling from smooth and log‐concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79, 2014.
- Dalalyan (2017) Dalalyan, A. S. Further and stronger analogy between sampling and optimization: Langevin monte carlo and gradient descent. In Annual Conference Computational Learning Theory, 2017.
- Engl et al. (1996) Engl, H. W., Hanke, M., and Neubauer, A. Regularization of inverse problems. 1996.
- Ernst et al. (2015) Ernst, O. G., Sprungk, B., and Starkloff, H.-J. Analysis of the ensemble and polynomial chaos kalman filters in bayesian inverse problems. SIAM/ASA J. Uncertain. Quantification, 3:823–851, 2015.
- Faniran (2015) Faniran, T. S. Numerical solution of stochastic differential equations. 2015.
- Ferré & Stoltz (2017) Ferré, G. and Stoltz, G. Error estimates on ergodic properties of discretized feynman–kac semigroups. Numerische Mathematik, 143:261 – 313, 2017.
- Foulkes et al. (2001) Foulkes, W. M. C., Mitas, L., Needs, R. J., and Rajagopal, G. Quantum monte carlo simulations of solids. Reviews of Modern Physics, 73:33–83, 2001.
- Garbuno-Iñigo et al. (2019) Garbuno-Iñigo, A., Hoffmann, F., Li, W., and Stuart, A. M. Interacting langevin diffusions: Gradient structure and ensemble kalman sampler. SIAM J. Appl. Dyn. Syst., 19:412–441, 2019.
- Geyer (1991) Geyer, C. J. Markov chain monte carlo maximum likelihood. 1991.
- Herty & Visconti (2018) Herty, M. and Visconti, G. Kinetic methods for inverse problems. Kinetic & Related Models, 2018.
- Kac et al. (1960) Kac, M., Uhlenbeck, G. E., Hibbs, A. R., van der Pol, B., and Gillis, J. Probability and related topics in physical sciences. 1960.
- Kato (1966) Kato, T. Perturbation theory for linear operators. 1966.
- Kirkpatrick et al. (1983) Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. Optimization by simulated annealing. Science, 220:671 – 680, 1983.
- Kovachki & Stuart (2018) Kovachki, N. B. and Stuart, A. M. Ensemble kalman inversion: a derivative-free technique for machine learning tasks. Inverse Problems, 35, 2018.
- Lelièvre & Stoltz (2016) Lelièvre, T. and Stoltz, G. Partial differential equations and stochastic methods in molecular dynamics*. Acta Numerica, 25:681 – 880, 2016.
- Lyu et al. (2021) Lyu, J., Wang, Z., Xin, J. X., and Zhang, Z. A convergent interacting particle method and computation of kpp front speeds in chaotic flows. SIAM J. Numer. Anal., 60:1136–1167, 2021.
- Moral (2004) Moral, P. D. Feynman-kac formulae: Genealogical and interacting particle systems with applications. 2004.
- Moral (2013) Moral, P. D. Mean field simulation for monte carlo integration. 2013.
- Moral & Miclo (2000) Moral, P. D. and Miclo, L. Branching and interacting particle systems. approximations of feynman-kac formulae with applications to non-linear filtering. 2000.
- Nier (2004) Nier, F. Quantitative analysis of metastability in reversible diffusion processes via a witten complex approach. 2004.
- Øksendal (1987) Øksendal, B. Stochastic differential equations : an introduction with applications. Journal of the American Statistical Association, 82:948, 1987.
- Pankov (2001) Pankov, A. Introduction to spectral theory of schrödinger operators. Mathematics eJournal, 2001.
- Parlett (1981) Parlett, B. N. The symmetric eigenvalue problem. 1981.
- Raginsky et al. (2017) Raginsky, M., Rakhlin, A., and Telgarsky, M. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Annual Conference Computational Learning Theory, 2017.
- Reed & Simon (1979) Reed, M. C. and Simon, B. Methods of modern mathematical physics. iii. scattering theory. 1979.
- Schillings & Stuart (2016) Schillings, C. and Stuart, A. M. Analysis of the ensemble kalman filter for inverse problems. SIAM J. Numer. Anal., 55:1264–1290, 2016.
- Shen et al. (2011) Shen, J., Tang, T., and Wang, L. Spectral methods: Algorithms, analysis and applications. 2011.
- Strang (1968) Strang, G. On the construction and comparison of difference schemes. SIAM Journal on Numerical Analysis, 5:506–517, 1968.
- Swendsen & Wang (1986) Swendsen and Wang. Replica monte carlo simulation of spin glasses. Physical review letters, 57 21:2607–2609, 1986.
- Varadhan (2010) Varadhan, S. R. S. Large deviations. 2010.
- Welling & Teh (2011) Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In International Conference on Machine Learning, 2011.
- Xu et al. (2017) Xu, P., Chen, J., and Gu, Q. Global convergence of langevin dynamics based algorithms for nonconvex optimization. ArXiv, abs/1707.06618, 2017.
- Zhang et al. (2017) Zhang, Y., Liang, P., and Charikar, M. A hitting time analysis of stochastic gradient langevin dynamics. In Annual Conference Computational Learning Theory, 2017.
Appendix A Proofs
In this section we give proof to the results in section 4.
Proof of Lemma 4.1
For any , note that is no doubt a self-adjoint operator, we only need to prove for . Notice that
(36) | ||||
(37) | ||||
(38) | ||||
(39) | ||||
(40) |
it is not hard to find that this is a symmetric form, thus is self-adjoint on
Next, let us use a changing variable trick to eliminate for simplicity.
Lemma A.1.
Given any invertible matrix , for any function , we define the function
We denote the differential operator as
then the following statement is true:
for any , where
Proof.
Using the chain rule, we notice that
thus
∎
Lemma A.1 shows that, we can use simple linear transformation to simplify the operator. Specifically, if we choose , we can transfer to the operator in the classical form. Thus we may only consider the case when .
To better understand the spectral property of , we introduce the Witten Laplacian that’s unitarily equivalent to and , acting on the non-weighted -space. The Witten Laplacian is defined by
and then we can find the following property:
Proposition A.2.
The operator can be unitarily changed into the Witten Laplacian form as follows:
where is the unitary transformation
Thus, we may only consider the spectrum of
over in replace of over . Notice that is invariant under such a transformation, that is, for any , , thus the spectrum of over is equivalent to the spectrum of over .
By Lemma A.1 and Proposition A.2, the operator is transformed into the Schrödinger form operator . According to the Assumption 4.2, has discrete real spectrum, and thus has a sequence of real eigenvalues see Reed & Simon (1979), Pankov (2001).
Recall
Following the same procedure as the Proposition 2. in Ferré & Stoltz (2017), we have that, for any ,
where is the eigenfunction of corresponding to . Noticing that, by definition and simple calculation,
and is just the normalized eigenfunction of corresponding to .
Proof of Theorem 4.5
By Lemma A.1 and Proposition A.2, we consider the eigenpairs of the operator and . We denote the eigenpair to as , the eigenpair to as . For simplicity we denote . The perturbation theory of spectrum (see Chapter 5.4, (Pankov, 2001) or Kato (1966)) reads:
(41) |
(42) |
We first prove that, for small enough ,
According to Chapter 2.5 in Lelièvre & Stoltz (2016) and Nier (2004), there are exactly eigenvalues close enough to corresponding to the local minimums of , and for , are of the form , where the functions are locally constant over the basins of attractions of the local minima of , we denote the minimum corresponding to each as .
We then denote the attraction basin as with corresponding local minimum ; and denote . According to Nier (2004), for any , the region , is asymptotically constant as . Thus we may assume
(43) |
By observing that when , thus we have
and , we decude that
(44) |
Then
as has no dependence on .
On the other side,
thus we arrive at
because is a decreasing function and
By (42), for the eigenfunction estimation of , we aim at proving that
(45) |
for that close enough to , thus these term will suppress the height nearby local minimum of , remaining the height nearby global minimum.
We only prove this for , and for , the procedure are similar. For , these term in (42) are asynmtotically negotiable.
Note that, by omitting small term (the integration region outside ) and use (43), (44) , the statement (45) is equivalent to
which is asymptotically true when , as left hand side trends to while right hand side trends to .
Proof of Theorem 4.6
We begin with considering the spectrum of . Let and can be . We assume is reachable with one and only one maximum .
Note that for any or , the operator is invertible in , thus the spectrum of is just . We now consider the operator . By Rayleigh-quotient formula (Parlett, 1981), the principle eigenvalue of over is
It’s worthwhile noticing that is a decreasing function respect to , thus for any , .
Take a series of test function that gradually trends to the delta function , by noticing that if we take , we have
Thus is continuous at .
Now denote the principal normalized eigenfunction of is . For any that’s smooth compactly supported, noticing that
(46) | ||||
(47) |
taking , the first term trends to 0, and thus
This implies that as acts closer and closer to .