Global optimization via Schrödinger-Föllmer diffusion
Abstract.
We study the problem of finding global minimizers of approximately via sampling from a probability distribution with density with respect to the Lebesgue measure for small enough. We analyze a sampler based on the Euler-Maruyama discretization of the Schrödinger-Föllmer diffusion processes with stochastic approximation under appropriate assumptions on the step size and the potential . We prove that the output of the proposed sampler is an approximate global minimizer of with high probability at cost of sampling standard normal random variables. Numerical studies illustrate the effectiveness of the proposed method and its superiority to the Langevin method.
Key words and phrases:
Global optimization, Schrödinger-Föllmer diffusion, Stochastic approximation1. Introduction
In this paper we study a challenging problem of finding the global minimizers of a non-convex smooth function . Suppose is the set of the global minima of with finite cardinality, i.e.,
(1) |
where denotes the ball centered at origin with radius . Precisely speaking, we have and for any , where denotes the Lebesgue measure on . Without loss of generality, we can assume outside the ball without changing the global minimizers of . For any given , we define a constant and a probability density function on as
Let be the probability distribution corresponding to the density function . If the function is twice continuously differentiable, we have that the measure converges weakly to a probability measure with weights proportional to at as goes to 0, i.e.,
Therefore, solving the optimization problem (1) can be converted into sampling from the probability distribution measure for sufficiently small .
An efficient method sampling from is based on the overdamped Langevin stochastic differential equation (SDE) which is given by
(2) |
where is a -dimensional Brownian motion. Under some certain conditions, Langevin SDE (2) admits the unique invariant measure (Bakry et al., 2008). Hence the Langevin sampler is generated by applying Euler-Maruyama discretization on this process to achieve the purpose of sampling from . Convergence properties of the Langevin sampler under the strongly convex potential assumption have been established by Durmus and Moulines (2016); Dalalyan (2017a, b); Cheng and Bartlett (2018); Durmus and Moulines (2019); Dalalyan and Karagulyan (2019). Moreover, the strongly convex potential assumption can be replaced by different conditions to guarantee the log-Sobolev inequality for the target distribution, including the dissipativity condition for the drift term (Raginsky et al., 2017; Zhang et al., 2019; Mou et al., 2022) and the local convexity condition for the potential function outside a ball (Durmus and Moulines, 2017; Ma et al., 2019; Bou-Rabee et al., 2020).
An alternative to Langevin sampler is the class of algorithms based on the Schrödinger-Föllmer diffusion process (3). This process has been proposed for sampling and generative modelling (Tzen and Raginsky, 2019; Huang et al., 2021; Jiao et al., 2021; Wang et al., 2021). Subsequently, Ruzayqat et al. (2022) studied the problem of unbiased estimation of expectations based on the Schrödinger-Föllmer diffusion process. Vargas et al. (2021) applied this process to Bayesian inference in large datasets, and the related posterior is reached in finite time. Zhang and Chen (2021) proposed a new Path Integral Sampler (PIS), a generic sampler that generates samples through simulating a target-dependent SDE which can be trained with free-form architecture network design. The PIS is built on Schödinger-Föllmer diffusion process to reach the terminal distribution. Different from these existing works, the purpose of this paper is to solve the non-convex smooth optimization problems. To that end, we need to rescale the Schrödinger-Föllmer diffusion process to sample from the target distribution .
To be precise, the Schrödinger-Föllmer diffusion process associated to is defined as
(3) |
where the drift function is
with the density ratio and being the density function of a normal distribution . According to Léonard (2014) and Eldan et al. (2020), the process in (3) was first formulated by Föllmer (Föllmer, 1985, 1986, 1988) when studying the Schrödinger bridge problem (Schrödinger, 1932). The main feature of the above Schrödinger-Föllmer diffusion process is that it interpolates and in time , i.e., , see Proposition 2.3. Then we can solve the optimization problem (1) by sampling from via the following Euler-Maruyama discretization of (3),
where is the step size, , and are independent and identically distributed from . If the expectations in the drift term do not have analytical forms, one can use Monte Carlo method to evaluate approximately, i.e., one can sample from according to
where with i.i.d. .
The main result of this paper is summarized in the following.
Theorem 1.1.
(Informal) Under condition , , with probability at least , is a -global minimizer of , i.e., , if number of iterations , number of Gaussian samples per iteration and .
The rest of this paper is organized as follows. In Section 2, we give the motivation and details of approximation method, i.e., Algorithm 1. In Section 3, we present the theoretical analysis for the proposed method. In Section 4, a numerical example is given to validate the efficiency of the method. We conclude the manuscript in Section 5. Proofs for all the propositions and theorems are provided in Appendix 7.
2. Methodology Description
In this section we first provide some background on the Schrödinger-Föllmer diffusion. Then we propose Algorithm 1 to solve the minimization problem (1) based on the Euler-Maruyama discretization scheme of the Schrödinger-Föllmer diffusion.
2.1. Background on Schrödinger-Föllmer diffusion
We first recall the Schrödinger bridge problem, then introduce the Schrödinger-Föllmer diffusion.
2.1.1. Schrödinger bridge problem
Let be the space of -valued continuous functions on the time interval . Denote as the canonical process on , where , . The canonical -field on is then generated as , where denotes the Borel -algebra of . Denote as the space of probability measures on the path space , and as the Wiener measure with variance whose initial marginal is . The law of the reversible Brownian motion is then defined as , which is an unbounded measure on . One can observe that has a marginal distribution coinciding with the Lebesgue measure at each . Schrödinger (1932) studied the problem of finding the most likely random evolution between two probability distributions . This problem is referred to as the Schrödinger bridge problem (SBP). SBP can be further formulated as seeking a probability law on the path space that interpolates between and , such that the probability law is close to the prior law of the Brownian diffusion with respect to the relative entropy (Jamison, 1975; Léonard, 2014), i.e., finding a path measure with marginal such that
where the relative entropy is defined by if (i.e., is absolutely continuous w.r.t. ), and otherwise. The following theorem characterizes the solution of SBP.
Theorem 2.1 (Léonard (2014)).
If measures , then SBP admits a unique solution , where and are -measurable non-negative functions satisfying the Schrödinger system
Furthermore, the pair with
solves the minimum action problem
such that
Let be the transition density of the Wiener process with variance , and be the density of and , respectively. Denote by
Then the Schrödinger system in Theorem 2.1 can also be characterized by
with the following forward and backward time harmonic equations (Chen et al., 2021),
Let denote marginal density of , i.e., , then it can be represented by the product of and (Chen et al., 2021). Let consist of admissible Markov controls with finite energy. Then, the vector field
(4) |
solves the following stochastic control problem.
Theorem 2.2 (Dai Pra (1991)).
such that
(5) |
According to Theorem 2.2, the dynamics determined by the SDE in (5) with a time-varying drift term in (4) will drive the particles sampled from the initial distribution to evolve to the particles drawn from the target distribution on the unit time interval. This nice property is what we need in designing samplers: we can sample from the underlying target distribution via pushing forward a simple reference distribution . In particular, if we take the initial distribution to be , the degenerate distribution at , then the Schrödinger-Föllmer diffusion process (7) defined below is a solution to (5), i.e., it will transport to the target distribution.
2.1.2. Schrödinger-Föllmer diffusion process
From now on, without loss of generality, we can assume that the minimum value of is 0, i.e., , otherwise, we consider replaced by . Since is absolutely continuous with respect to the -dimensional Gaussian distribution , then we denote the Radon-Nikodym derivative of with respect to as follows:
(6) |
Let be the heat semigroup defined by
The Schrödinger-Föllmer diffusion process (Föllmer, 1985, 1986, 1988) is defined as
(7) |
where is the drift term given by
This process defined in (7) is a solution to (5) with , , and (Dai Pra, 1991; Lehec, 2013; Eldan et al., 2020). Let
Since the drift term is scale-invariant with respect to in the sense that for any . Therefore, the Schrödinger-Föllmer diffusion can be used for sampling from an unnormalized distribution , that is, the normalizing constant of does not need to be known. We have with the normalized constant , then and .
To ensure that the SDE (7) admits a unique strong solution, we assume that
-
is twice continuous differentiable on and outside a ball .
As we mentioned earlier, we can make Assumption by smoothing and it does not change the the global minimizers of . Under Assumption , we have the following two properties which further imply the SDE (7) admits a unique strong solution.
-
For each , are Lipschitz continuous with a constant , where
and
(8) -
For each , there exists a constant such that , where
(9)
Properties (P1)-(P2) are shown in Section 7.1 in Appendix. We should mention here (P1)-(P2) are used as assumptions in Lehec (2013); Tzen and Raginsky (2019).
Thanks to (P1) and (P2), some calculations show that
and
We conclude that
Furthermore, we can also easily deduce that the drift term satisfies a linear growth condition and a Lipschitz continuity condition (Revuz and Yor, 2013; Pavliotis, 2014), that is,
(C1) |
and
(C2) |
where and are two finite positive constants that only depend on . The linear growth condition (C1) and Lipschitz continuity condition (C2) ensure the existence of the unique strong solution of Schrödinger-Föllmer SDE (7). We summarize the above discussion in the following Proposition, whose proof are shown in Appendix 7.
Proposition 2.3.
Under assumption the Schrödinger-Föllmer SDE (7) has a unique strong solution with and .
2.2. Euler-Maruyama discretization for Schrödinger-Föllmer Diffusion
Proposition 2.3 shows that the Schrödinger-Föllmer diffusion will transport to the probability distribution measure on the unite time interval. Because the drift term is scale-invariant with respect to in the sense that , the Schrödinger-Föllmer diffusion can be used for sampling from , where the normalizing constant may not be known. To this end, we use the Euler-Maruyama method to discretize the Schrödinger-Föllmer diffusion (7). Let
the Euler-Maruyama discretization scheme reads
(10) |
where are i.i.d. and
(11) |
where the second equality follows from Stein’s lemma (Stein, 1972, 1986; Landsman and Nešlehová, 2008). From the definition of in (11) we may not get its explicit expression. We then consider a estimator of by replacing in the drift term with -samples mean, i.e.,
(12) |
or
(13) |
where are i.i.d. . The detailed description of the proposed method is summarized in the following Algorithm 1 below.
3. Theoretical Property
In this section, we show that the Gibbs measure weakly converges to a multidimensional distribution concentrating on the optimal points . Since the minimum value of is 0, then we estimate the probabilities of and for any , and establish the non-asymptotic bounds between the law of the samples generated from Algorithm 1 and the target distribution in the Wasserstein-2 distance. Recall that the linear growth condition (C1) and Lipschitz continuity (C2) hold under conditions (P1) and (P2), which make the Schrödinger-Föllmer SDE (7) have the unique strong solution. Besides, we assume that the drift term is Lipschitz continuous in and -Hölder continuous in ,
(C3) |
where is a positive constant depending on .
Remark 3.1.
(C1) and (C2) are the essentially sufficient conditions such that the Schrödinger-Föllmer SDE (7) admits the unique strong solution. (C3) has been introduced in Theorem 4.1 of Tzen and Raginsky (2019) and it is also similar to the condition (H2) of Chau et al. (2021) and Assumption 3.2 of Barkhagen et al. (2021). Obviously, (C3) implies (C2), and (C1) holds if the drift term is bounded over .
Firstly, we show that the Gibbs measure weakly converges to a multidimensional distribution. This result can be traced back to the 1980s. For the overall continuity of the article, we combine the Laplace’s method in Hwang (1980, 1981) to give a detailed proof of the result. The key idea is to prove that for each small enough, converges to as .
Next, we want to estimate the probabilities of and for any . However, the second analysis is more complicated due to the discretization, and the main idea comes from Dalalyan (2017b); Cheng et al. (2018), which constructs a continuous-time interpolation SDE for the Euler-Maruyama discretization. In their works, the relative entropy is controlled via using the Girsanov’s theorem to estimate Radon-Nikodym derivatives.
Another method of controlling the relative entropy is proposed by Mou et al. (2022). By direct calculations, the time derivative of the relative entropy between the interpolated and the original SDE (7) is controlled by the mean squared difference between the drift terms of the Fokker-Planck equations for the original and the interpolated processes. Compared to the bound obtained from Lemma 7.2, the bound in Mou et al. (2022) has an additional backward conditional expectation inside the norm. It becomes a key reason for obtaining higher precision orders. But it must satisfy the dissipative condition to the drift term of the SDE and initial distribution smoothness.
The concrete results are showed in the following Theorems 3.2, 3.5, 3.7. See Appendix 7 for detailed proofs.
Theorem 3.2.
Let be a twice continuously differentiable function. Suppose there exists a finite set and , then
Under Theorem 3.2, a natural question is to consider the convergence rate about the measure converging to the multidimensional distribution. To that end, we apply the tools from the large deviation of the Gibbs measure which is absolutely continuous with respect to Lebesgue measure, see Chiang et al. (1987); Holley et al. (1989); Márquez (1997). Therefore we can obtain the following property.
Proposition 3.3.
Assume that the conditions of Theorem 3.2 hold, then for all ,
Remark 3.4.
Although we can directly use the large deviation principle to obtain Proposition 3.3, further, we can conclude that the Gibbs measure weakly converges to the global minimum points of the potential function and obtain the corresponding convergence rate. However, we can not directly obtain the specific limit distribution form Proposition 3.3.
Theorem 3.5.
Remark 3.6.
The first term in the right hand side of (14) originates from the difference between Gibbs sampling and optimization, which is determined by the simulated annealing method itself. The latter two terms are caused by Euler discretization of SDE (7) and Monte Carlo estimation of drift coefficient in (11). The latter two terms of (14) depend on polynomially and exponentially. This result is contrary to the some Langevin methods, that is, the related convergence error bounds depend on exponentially (Wang, 2009; Hale, 2010; Menz and Schlichting, 2014; Raginsky et al., 2017) implying that the efficiency of Langevin samplers may suffer from the curse of dimensionality.
By Theorem 3.5, , with probability at least , is a -global minimizer of , i.e., , if the number of iterations , the number of Gaussian samples per iteration and .
At last, we establish the non-asymptotic bounds between the law of the samples generated from Algorithm 1 and the distribution in the Wasserstein-2 distance. We introduce the definition of Wasserstein distance. Let be the collection of coupling probability measures on such that its respective marginal distributions are and . The Wasserstein distance of order measuring the discrepancy between and is defined as
Theorem 3.7.
Remark 3.8.
Langevin sampling method has been studied under the (strongly) convex potential assumption (Durmus and Moulines, 2016, 2017, 2019; Dalalyan, 2017a, b; Cheng and Bartlett, 2018; Dalalyan and Karagulyan, 2019). Also, there are some mean-field Langevin methods, see Garbuno-Inigo et al. (2020); Wang and Li (2022) for more details. However, the Langevin diffusion process tends to the target distribution as time goes to infinity while the Schrödinger bridge achieve this in unit time. The Schrödinger bridge has been shown to have close connections with a number of problems in statistical physics, optimal transport and optimal control (Léonard, 2014). However, only a few recent works use this tool for statistical sampling. Bernton et al. (2019) proposed the Schrödinger bridge sampler by applying the iterative proportional fitting method (Sinkhorn algorithm (Sinkhorn, 1964; Peyré and Cuturi, 2019)). For the Gibbs measure , Schrödinger bridge samplers iteratively modifies the transition kernels of the reference Markov chain to obtain a process whose marginal distribution at terminal time is approximate to , via regression-based approximations of the corresponding iterative proportional fitting recursion.
4. Simulation studies
In this section, we conduct numerical simulations to evaluate the performance of the proposed method (Algorithm 1), and compare it with the Langevin method. We consider the following non-convex smooth function (Carrillo et al., 2021), which maps from to ,
with and . Figure 1 depicts this target function with setting , and we can see that the number of local minimizers in is infinite and only one global minimizer exists, i.e, . In the numerical experiment, we consider the case in . We set in Algorithm 1, and Langevin method is implemented by R package yuima (Iacus and Yoshida, 2018). As shown in Proposition 2.3, the target distribution can be exactly obtained at time one. Thus, we only need to keep the last data in the Euler-Maruyama discretization of Schrödinger-Föllmer diffusion in each iteration and repeat this scheme such that the desired sample size is derived, i.e., we get one sample when running Algorithm 1 one time (it costs Gaussian samples at each run). In comparison, the Langevin diffusion (3) tends to the target distribution as time goes to infinity, then it should burn out sufficient data points empirically in each Euler-Maruyama discretization. To make the comparison fair, we run the Langevin method 50 times. At each run we generate samples and keep the one with the best function value to show. Figure 2 shows the simulation results of 50 independent runs, where the red line yields the global minimizer (GM) . From Figure 2, we can conclude that our proposed method obtains the global minimizer approximately and performs better than the Langevin method.



5. Conclusion
We study the problem of finding global minimizers of approximately via sampling from a probability distribution with density with respect to the Lebesgue measure for small enough. We analyze a sampler based on the Euler discretization scheme of the Schrödinger-Föllmer diffusion processes with stochastic approximation under appropriate assumptions on the step size and the potential . We prove that the output of the proposed sampler is an approximate global minimizer of with high probability. Moreover, simulation studies verify the effectiveness of the proposed method on solving non-convex smooth optimization problems and it performs better than the Langevin method.
6. Acknowledgments
We would like the thank the anonymous referees for their useful comments and suggestions, which have led to considerable improvements in the paper. This work is supported by the National Key Research and Development Program of China (No. 2020YFA0714200), by the National Science Foundation of China (No. 12125103, No. 12071362, No. 11871474, No.11871385). The numerical calculations have been done at the Supercomputing Center of Wuhan University.
7. appendix
In the appendix, we will show (P1)-(P2) and prove the all Propositions and Theorems, i.e., Propositions 2.3, 3.3, Theorems 3.2, 3.5, 3.7.
7.1. Verify (P1)-(P2)
Proof.
Recall the definition of in (6) and the assumption when . Through some simple calculations, we have
Then, the properties (P1)-(P2) hold if for each ,
and
where we use the fact that is non-increasing for . ∎
7.2. Proof of Proposition 2.3
Proof.
By (P1) and (P2), it yields that for all and ,
(A.1) |
Then, by (P1)-(P2) and (A.1), for all and ,
Setting yields the Lipschitiz continuous condition (C2). Combining (A.1) and (C2) with the triangle inequality, we have
Let , then (C1) holds. Therefore, the drift term satisfies the linear grow condition (C1) and Lipschitz condition (C2), then the Schrödinger-Föllmer diffusion SDE (7) has the unique strong solution (Revuz and Yor, 2013; Pavliotis, 2014).
Moreover, Schrödinger-Föllmer diffusion process defined in (7) admits the transition probability density
where
is the transition probability density of a -dimensional Brownian motion . See Dai Pra (1991); Lehec (2013) for details. It follows that for any Borel measurable set ,
where the last equality follows from and . Therefore, is distributed as the probability distribution . This completes the proof. ∎
7.3. Proof of Proposition 3.3
Proof.
Under the Theorem 3.2, then for all . According to the Varadhan’s theorem 1.2.3 in Dupuis and Ellis (2011), it follows that the family on satisfies large deviation principle with rate function , that is, for every function , the bounded continuous function space on ,
(A.2) |
where the rate function is defined by
Next, we only need to prove (A.2). On the one hand,
By Lemma 7.1, we have
(A.3) |
On the other hand, for any positive , combining Lemma 2.2 in Varadhan (1966) and the dominated convergence theorem yields that
(A.4) |
Hence, by (7.3) and (A.4), we get
Since the measure satisfies the large deviation principle with rate function , if we take the closed set , then
∎
7.4. Preliminary lemmas for Theorem 3.2 and Theorem 3.5
In order to prove that the Gibbs measure weakly converges to a multidimensional distribution and estimate the probabilities of and for any , we first need to prove the following Lemmas 7.1-7.2.
Lemma 7.1.
Assume is twice continuously differentiable and satisfies , and the Hessian matrix is positive definite. When is sufficiently small, then
for any .
Proof.
For each , there exists such that
holds for any . Moreover, we have
(A.5) |
There is an orthogonal matrix such taht , where are the eigenvalues of Hessian matrix . And we denote , then
Hence
(A.6) |
Further, we can get
(A.7) |
where the equality holds by setting . By (A.5), (A.6) and (A.7), it follows that
Similarly, we have
Therefore, letting , we get
∎
Lemma 7.2.
Let be strong solutions of the following two stochastic differential equations
and is a -measureable random variable. In addition, if drift terms and satisfy , then we have
(A.8) |
and the relative entropy of with respect to satisfies
where probability distributions are induced by process and , respectively.
Proof.
By the Novikov condition (Revuz and Yor, 2013), we know that
is exponential martingale and for all . We can denote a new probability measure such that . By Girsanov’s theorem (Revuz and Yor, 2013), under the new probability measure , we can conclude that
is a -Brownian motion. Hence, under the new probability measure ,
Thus, we have , where is the distribution of under the measure . Furthermore, we can obtain (A.8). On the other hand, by the definition of realtive entropy of with respect to , we have
Therefore, the proof of Lemma 7.2 is completed. ∎
7.5. Proof of Theorem 3.2
Proof.
The result can be traced back to the 1980s. For the overall continuity of the article, we use the Laplace’s method in Hwang (1980, 1981) to give a detailed proof of the result. The key idea is to prove that for each small enough, converges to as . We firstly introduce the following notations
Hence, we have
(A.9) |
On the one hand, is a positive definite symmetric matrix. For any , we can choose such that
holds for any . Thus, for any , we obtain
By Lemma 7.1 and letting , we have
As , we get
(A.10) |
On the other hand, we have
Since and holds in , then for any , we have
Also, it follows that
Thus we get
(A.11) |
Taking limit in (A.9) and applying (A.10), (A.11), we get
Therefore, the proof Theorem 3.2 is completed. ∎
7.6. Proof of Theorem 3.5
Proof.
Note that
(A.12) |
According to for any , then has at least linear growth at infinity, that is, there exists a constant such that for large enough
We can choose sufficiently large such that . Hence,
(A.13) |
where Vol is the volume of a ball with radius . On the other hand, since , then there exists such that when , we have
(A.14) |
By injection (7.6), (A.14) into (A.12), we get
where
(A.15) |
Next, we will prove that the second conclusion holds in the discrete case. Recall that is the step size, is the cumulative step size up to iteration , and is the Schrödinger-Föllmer diffusion process (7). Let be the probability measure of defined by (10). Then for fixed , we have
(A.16) |
where we use Pinsker’s inequality (Bakry et al., 2014) in the last inequality and the second inequality holds due to the fact that letting , then
where the total variation metric between two probability measures on is defined by .
Firstly, from the first part of proof, we can get a bound for the first term on the right hand side of (7.6). That is, for each , there exists a constant defined in (A.15) such that
(A.17) |
Secondly, we estimate the boundness of . To make use of continuous-time tools, we construct a continuous-time interpolation for the discrete-time algorithm (10). In particular, we define a stochastic process via SDE
(A.18) |
with the non-anticipative drift . Meanwhile, because of Proposition 2.3, the process (7) satisfies . We also denote by and the marginal distributions on of . Thus, combining (10), (A.18) and Lemma 7.2, we obtain
(A.19) |
where
the second inequality holds due to Remark 4.1 in Huang et al. (2021) and the fact that , the second equality holds by the continuous-time interpolation (A.18), and the fourth inequality follows from .
So it remains to estimate the relative entropy . Similar to the proof of the relative entropy , we need to construct a continuous-time interpolation process defined by
(A.20) |
with the non-anticipative drift , where is defined by (12) or (13). Denote by and the marginal distributions on of . Therefore, combining (A.18), (A.20), Lemma 7.2 and Lemma 7.5, we get
(A.21) |
where
with . By injecting (A.17), (7.6), and (7.6) into (7.6), we can get the desired results. ∎
7.7. Preliminary lemmas for Theorem 3.7
Lemma 7.3.
Assume (P1) and (P2) hold, then
Proof.
Lemma 7.4.
Assume (P1) and (P2) hold, then for any ,
Proof.
Using (C1) and the elementary inequality , one can derive that for any ,
Letting yields
(A.22) |
From the definition of in (7), we have
On the one hand, by the Itô formula and (A.22), for any , we have
where . Furthermore, we have
The Grönwall inequality yields
(A.23) |
On the other hand, by the elementary inequality then we get
Thus, using the Cauchy-Schwarz inequality, Burkholder-Davis-Gundy inequality, (C1) and (A.23), we deduce that
∎
Lemma 7.5.
Assume (P1) and (P2) hold, then
where
with .
Proof.
Denote two independent sequences of independent copies of , that is, and . For notation convenience, we also denote
Since , then . Moreover, we have
(A.24) |
where the second inequality holds by (P1) and the last inequality follows from the fact that and are independent standard normal distribution. Similarly, we also have
(A.25) |
where the second inequality holds due to (P1). Thus, by (7.7) and (7.7), it follows that
(A.26) |
(A.27) |
Then, by (P1) and (P2), some simple calculations yield that
(A.28) |
Recall that with
then for any . Further, by (7.7), it follows that for all and ,
(A.29) |
Then, combining (A.26)-(A.27) and (A.29), it follows that
where
∎
Lemma 7.6.
Assume (P1) and (P2) hold, then for any ,
Proof.
Define , hence, we get , where with . By (P1) and (P2), it follows that for all and ,
(A.30) |
Then, by (A.30) and , we have
Further, we can get
Therefore,
Since , then by induction, we have
∎
7.8. Proof of Theorem 3.7
Proof.
From the definitions of and , we have
where the second inequality holds due to for and the fourth inequality holds by condition (C3). Then, we obtain
where follows from Lemma 7.4, and the last inequality holds by Lemma 7.5. Owing to , we can conclude that there exists a constant such that
where
Therefore,
It completes the proof. ∎
References
- Bakry et al. (2008) Bakry D, Cattiaux P, Guillin A (2008) Rate of convergence for ergodic continuous markov processes: Lyapunov versus poincaré. Journal of Functional Analysis 254(3):727–759
- Bakry et al. (2014) Bakry D, Gentil I, Ledoux M (2014) Analysis and geometry of Markov diffusion operators, vol 103. Springer
- Barkhagen et al. (2021) Barkhagen M, Chau NH, Moulines É, Rásonyi M, Sabanis S, Zhang Y (2021) On stochastic gradient langevin dynamics with dependent data streams in the logconcave case. Bernoulli 27(1):1–33
- Bernton et al. (2019) Bernton E, Heng J, Doucet A, Jacob PE (2019) Schrödinger bridge samplers. arXiv preprint arXiv:191213170
- Bou-Rabee et al. (2020) Bou-Rabee N, Eberle A, Zimmer R (2020) Coupling and convergence for hamiltonian monte carlo. The Annals of applied probability 30(3):1209–1250
- Carrillo et al. (2021) Carrillo JA, Jin S, Li L, Zhu Y (2021) A consensus-based global optimization method for high dimensional machine learning problems. ESAIM: Control, Optimisation and Calculus of Variations 27:S5
- Chau et al. (2021) Chau NH, Moulines É, Rásonyi M, Sabanis S, Zhang Y (2021) On stochastic gradient langevin dynamics with dependent data streams: The fully non-convex case. SIAM Journal on Mathematics of Data Science 3(3):959–986
- Chen et al. (2021) Chen Y, Georgiou TT, Pavon M (2021) Stochastic control liaisons: Richard sinkhorn meets gaspard monge on a schrödinger bridge. SIAM Review 63(2):249–313
- Cheng and Bartlett (2018) Cheng X, Bartlett P (2018) Convergence of langevin mcmc in kl-divergence. Proceedings of Machine Learning Research, Volume 83: Algorithmic Learning Theory pp 186–211
- Cheng et al. (2018) Cheng X, Chatterji NS, Bartlett PL, Jordan MI (2018) Underdamped langevin mcmc: A non-asymptotic analysis. In: Conference on learning theory, PMLR, pp 300–323
- Chiang et al. (1987) Chiang TS, Hwang CR, Sheu SJ (1987) Diffusion for global optimization in r^n. SIAM Journal on Control and Optimization 25(3):737–753
- Dai Pra (1991) Dai Pra P (1991) A stochastic control approach to reciprocal diffusion processes. Applied mathematics and Optimization 23(1):313–329
- Dalalyan (2017a) Dalalyan AS (2017a) Further and stronger analogy between sampling and optimization: Langevin monte carlo and gradient descent. pp 678–689
- Dalalyan (2017b) Dalalyan AS (2017b) Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79(3):651–676
- Dalalyan and Karagulyan (2019) Dalalyan AS, Karagulyan AG (2019) User-friendly guarantees for the langevin monte carlo with inaccurate gradient. Stochastic Processes and their Applications 129(12):5278–5311
- Dupuis and Ellis (2011) Dupuis P, Ellis RS (2011) A weak convergence approach to the theory of large deviations. John Wiley & Sons
- Durmus and Moulines (2016) Durmus A, Moulines E (2016) Sampling from a strongly log-concave distribution with the unadjusted langevin algorithm. arXiv preprint arXiv:160501559
- Durmus and Moulines (2017) Durmus A, Moulines E (2017) Nonasymptotic convergence analysis for the unadjusted langevin algorithm. The Annals of Applied Probability 27(3):1551–1587
- Durmus and Moulines (2019) Durmus A, Moulines E (2019) High-dimensional bayesian inference via the unadjusted langevin algorithm. Bernoulli 25(4A):2854–2882
- Eldan et al. (2020) Eldan R, Lehec J, Shenfeld Y (2020) Stability of the logarithmic sobolev inequality via the föllmer process. In: Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, Institut Henri Poincaré, vol 56, pp 2253–2269
- Föllmer (1985) Föllmer H (1985) An entropy approach to the time reversal of diffusion processes. In: Stochastic Differential Systems Filtering and Control, Springer, pp 156–163
- Föllmer (1986) Föllmer H (1986) Time reversal on wiener space. In: Stochastic processes–mathematics and physics, Springer, pp 119–129
- Föllmer (1988) Föllmer H (1988) Random fields and diffusion processes. In: École d’Été de Probabilités de Saint-Flour XV–XVII, 1985–87, Springer, pp 101–203
- Garbuno-Inigo et al. (2020) Garbuno-Inigo A, Hoffmann F, Li W, Stuart AM (2020) Interacting langevin diffusions: Gradient structure and ensemble kalman sampler. SIAM Journal on Applied Dynamical Systems 19(1):412–441
- Hale (2010) Hale JK (2010) Asymptotic behavior of dissipative systems. 25, American Mathematical Soc.
- Holley et al. (1989) Holley RA, Kusuoka S, Stroock DW (1989) Asymptotics of the spectral gap with applications to the theory of simulated annealing. Journal of functional analysis 83(2):333–347
- Huang et al. (2021) Huang J, Jiao Y, Kang L, Liao X, Liu J, Liu Y (2021) Schrödinger-föllmer sampler: Sampling without ergodicity. arXiv preprint arXiv: 210610880
- Hwang (1980) Hwang CR (1980) Laplace’s method revisited: weak convergence of probability measures. The Annals of Probability pp 1177–1182
- Hwang (1981) Hwang CR (1981) A generalization of laplace’s method. Proceedings of the American Mathematical Society 82(3):446–451
- Iacus and Yoshida (2018) Iacus SM, Yoshida N (2018) Simulation and inference for stochastic processes with YUIMA. Springer
- Jamison (1975) Jamison B (1975) The markov processes of schrödinger. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 32(4):323–331
- Jiao et al. (2021) Jiao Y, Kang L, Liu Y, Zhou Y (2021) Convergence analysis of schrödinger-föllmer sampler without convexity. arXiv preprint arXiv:210704766
- Landsman and Nešlehová (2008) Landsman Z, Nešlehová J (2008) Stein’s lemma for elliptical random vectors. Journal of Multivariate Analysis 99(5):912–927
- Lehec (2013) Lehec J (2013) Representation formula for the entropy and functional inequalities. In: Annales de l’IHP Probabilités et statistiques, vol 49, pp 885–899
- Léonard (2014) Léonard C (2014) A survey of the schrödinger problem and some of its connections with optimal transport. Discrete & Continuous Dynamical Systems-A 34(4):1533
- Ma et al. (2019) Ma YA, Chen Y, Jin C, Flammarion N, Jordan MI (2019) Sampling can be faster than optimization. Proceedings of the National Academy of Sciences 116(42):20881–20885
- Márquez (1997) Márquez D (1997) Convergence rates for annealing diffusion processes. The Annals of Applied Probability pp 1118–1139
- Menz and Schlichting (2014) Menz G, Schlichting A (2014) Poincaré and logarithmic sobolev inequalities by decomposition of the energy landscape. Annals of Probability 42(5):1809–1884
- Mou et al. (2022) Mou W, Flammarion N, Wainwright MJ, Bartlett PL (2022) Improved bounds for discretization of langevin diffusions: Near-optimal rates without convexity. Bernoulli 28(3):1577–1601
- Pavliotis (2014) Pavliotis GA (2014) Stochastic processes and applications: diffusion processes, the Fokker-Planck and Langevin equations, vol 60. Springer
- Peyré and Cuturi (2019) Peyré G, Cuturi M (2019) Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning 11(5-6):355–607
- Raginsky et al. (2017) Raginsky M, Rakhlin A, Telgarsky M (2017) Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In: Conference on Learning Theory, PMLR, pp 1674–1703
- Revuz and Yor (2013) Revuz D, Yor M (2013) Continuous martingales and Brownian motion, vol 293. Springer Science & Business Media
- Ruzayqat et al. (2022) Ruzayqat H, Beskos A, Crisan D, Jasra A, Kantas N (2022) Unbiased estimation using a class of diffusion processes. arXiv preprint arXiv:220303013
- Schrödinger (1932) Schrödinger E (1932) Sur la théorie relativiste de l’électron et l’interprétation de la mécanique quantique. In: Annales de l’institut Henri Poincaré, vol 2, pp 269–310
- Sinkhorn (1964) Sinkhorn R (1964) A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics 35(2):876–879
- Stein (1972) Stein C (1972) A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In: Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 2: Probability theory, University of California Press, pp 583–602
- Stein (1986) Stein C (1986) Approximate computation of expectations. IMS
- Tzen and Raginsky (2019) Tzen B, Raginsky M (2019) Theoretical guarantees for sampling and inference in generative models with latent diffusions. In: Conference on Learning Theory, PMLR, pp 3084–3114
- Varadhan (1966) Varadhan SS (1966) Asymptotic probabilities and differential equations. Communications on Pure and Applied Mathematics 19(3):261–286
- Vargas et al. (2021) Vargas F, Ovsianas A, Fernandes D, Girolami M, Lawrence N, Nüsken N (2021) Bayesian learning via neural schrödinger-föllmer flows. arXiv preprint arXiv:211110510
- Wang (2009) Wang FY (2009) Log-sobolev inequalities: different roles of ric and hess. The Annals of Probability 37(4):1587–1604
- Wang et al. (2021) Wang G, Jiao Y, Xu Q, Wang Y, Yang C (2021) Deep generative learning via schrödinger bridge. In: International Conference on Machine Learning, PMLR, pp 10794–10804
- Wang and Li (2022) Wang Y, Li W (2022) Accelerated information gradient flow. Journal of Scientific Computing 90(1):1–47
- Zhang and Chen (2021) Zhang Q, Chen Y (2021) Path integral sampler: a stochastic control approach for sampling. arXiv preprint arXiv:211115141
- Zhang et al. (2019) Zhang Y, Akyildiz ÖD, Damoulas T, Sabanis S (2019) Nonasymptotic estimates for stochastic gradient langevin dynamics under local conditions in nonconvex optimization. arXiv preprint arXiv:191002008