Convergence Analysis of Schrödinger-Föllmer Sampler without Convexity
Abstract
Schrödinger-Föllmer sampler (SFS) [26], is a novel and efficient approach for sampling from possibly unnormalized distributions without ergodicity. SFS is based on the Euler-Maruyama discretization of Schrödinger-Föllmer diffusion process
on the unit interval, which transports the degenerate distribution at time zero to the target distribution at time one. In [26], the consistency of SFS is established under a restricted assumption that the potential is uniformly (on ) strongly convex (on ). In this paper we provide a nonasymptotic error bound of SFS in Wasserstein distance under some smooth and bounded conditions on the density ratio of the target distribution over the standard normal distribution, but without requiring the strongly convexity of the potential.
1 Introduction
Sampling from possibly unnormalized distributions is an import task in Bayesian statistics and machine learning. Ever since the Metropolis-Hastings (MH) algorithm [32, 25] was introduced, various random sampling methods were proposed, including Gibbs sampler, random walk sampler, independent sampler, Lagevin sampler, bouncy particle sampler, zig-zag sampler [23, 22, 37, 29, 35, 4, 2], among others, see [5, 14] and the references therein. The above mentioned sampling algorithms generate random samples by running an ergodic Markov chain whose stationary distribution is the target distribution.
In [26], the Schrödinger-Föllmer sampler (SFS), a novel sampling approach without requiring the property of ergodicity is proposed. SFS is based on the Schrödinger-Föllmer diffusion process, defined as
(1) |
where the drift function
with . According to [28] and [18], the process in (1) was first formulated by Föllmer [19, 20, 21] when studying the Schrödinger bridge problem [36]. The main feature of the above Schrödinger-Föllmer process is that it interpolates and the target in time , i.e., , see Proposition 1. SFS samples from via the following Euler-Maruyama discretization of (1),
where is the step size, , and are independent and identically distributed from . If the expectations in the drift term do not have analytical forms, one can use Monte Carlo method to evaluate approximately, i.e., one can sample from according
where with i.i.d . The numerical simulations in [26] demonstrate that SFS outperforms the exiting samplers based on egordicity.
In Section 4.2 of [26], they prove that
under a restricted assumption that the potential is uniformly strongly convex, i.e.,
(2) |
where is one finite and positive constant. In this paper we provide a new analysis of the above SFS iteration. We establish a nonasymptotic error bound on under the condition that and are Lipschitz continuous and has positive lower bound, but without using the uniformly strongly convexity requirement (2).
2 Schrödinger-Föllmer sampler
In this section we recall the Schrödinger-Föllmer sampler briefly. More backgrounds on the Schrödinger-Föllmer diffusion process please see [10, 28, 7, 26].
Let be the target distribution and absolutely continuous with respect to the -dimensional standard normal measure . Let
We assume that
-
are Lipschitz continuous with constant ,
-
There exists such that .
Define the heat semigroup as
Proposition 1.
Define a drift function
If satisfies assumptions and , then the Schrödinger-Föllmer diffusion
(3) |
has a unique strong solution and .
Remark 2.1.
-
(i)
From the definition of in Proposition 1, it follows that .
-
(ii)
If the target distribution is with the normalized constant , then . Once is twice differentiable and
then both and are Lipschitz continuous, i.e., (A1) holds. (A2) is equivalent to the growth condition on the potential that .
-
(iii)
Under (A1) and (A2), some calculation shows that
and
and
We conclude that
Proposition 1 shows that the Schrödinger-Föllmer diffusion will transport to the target on the unite time interval. Since drift term is scale-invariant with respect to in the sense that . Therefore, the Schrödinger-Föllmer diffusion can be used for sampling from , where the normalizing constant of may not to be known. To this end, we use the Euler-Maruyama method to discretize the Schrödinger-Föllmer diffusion (3). Let
the Euler-Maruyama scheme reads
(4) |
where are i.i.d. and
(5) |
From the definition of in (5) we may not get its explicit expression. Here, we can get one estimator of by replacing in with -sample mean, i.e.,
(6) |
or
(7) |
where are i.i.d. . The detailed description of SFS is summarized in following Algorithm 1 below, which is Algorithm 2 in [26].
In Section 4.2 of [26], they proved that
under a restricted assumption that the potential is uniformly strongly convex, i.e, satisfies (2). However, (2) is not easy to verify. In the next section, we establish a nonasymptotic bound on the Wasserstein distance between the law of generated by SFS (Algorithm 1) and the target under smooth and bounded conditions (A1) and (A2) but without using the strongly uniform convexity assumption (2) on .
3 Nonasymptotic Bound without convexity
Under conditions (A1) and (A2), one can easily deduce the growth condition and Lipschitz/Hölder continuity of the drift term [26], i.e.,
(C1) |
and
(C2) |
and
(C3) |
where and are two finite and positive constants.
Remark 3.1.
(C1) and (C2) are the essentially sufficient conditions such that the Schrödinger-Föllmer SDE (3) admits the unique strong solution. (C3) has been introduced in Theorem 4.1 of [38], and it is also similar to the condition H2 of [6] and Assumption 3.2 of [1]. Obviously, (C3) implies (C2), and (C1) holds if the drift term is bounded over .
Let be the collection of coupling probability measures on such that its respective marginal distributions are and . The Wasserstein of order with which we measure the discrepancy between and is defined as
Theorem 2.
Assume (A1) and (A2) hold, then
where is the step size.
Remark 3.2.
This theorem provides some guidance on the selection of and . To ensure convergence of the distribution of , we should set the step size and . In high-dimensional models with a large , we need to generate a large number of random vectors from to obtain an accurate estimate of the drift term . If we assume that is bounded above we can improve the nonasymptotic error bound, in which can be improved to be .
Theorem 3.
Assume that, in addition to the conditions of Theorem 2, has a finite upper bound, then
where is the step size.
Remark 3.3.
With the boundedness condition on , to ensure convergence of the sampling distribution, we can set the step size and . Note that the sample size requirement for approximating the drift term is significantly less stringent than that in Theorem 2.
Remark 3.4.
Langevin sampling method has been studied under the (strongly) convex potential assumption [15, 16, 17, 11, 12, 8, 13]; the dissipativity condition for the drift term [34, 33, 40]; the local convexity condition for the potential function outside a ball [17, 9, 30, 3]. However, these conditions may not hold for models with multiple modes, for example, Gaussian mixtures, where their potentials are not convex and the log Sobolev inequality may not be satisfied. Moreover, the constant in the log Sobolev inequality depends on the dimensionality exponentially [39, 24, 31, 34], implying that the Langevin samplers suffers from the curse of dimensionality. SFS does not require the underlying Markov process to be ergodic, therefore, our results in Theorem 2 and 3 are established under the smooth and bounded assumptions (A1) and (A2) on but do not need the above mentioned conditions used in the analysis of Langevin samplers.
In Theorem 2 and Theorem 3, we use (A2), i.e, has positive lower bound, however, (A2) may not hold if the target distribution admits compact support. To circumvent this difficulty, we consider the regularized probability measure
The corresponding density ratio is
Obviously, satisfies (A1) and (A2) if and are Lipschitz continuous. Since can approximate to well if we set small enough, then we consider sampling from by running SFS (Algorithm 1) with being replaced by . We use to denote the last iteration of SFS.
Theorem 4.
Assume (A1) holds and set , then
where is the step size, is a constant depending on . Moreover, if has the finite upper bound and set , then
4 Conclusion
In [26], Schrödinger-Föllmer sampler (SFS) was proposed for sampling from possibly unnormalized distributions. The key feature of SFS is that it does not need ergodicity as its theoretical basis. The consistency of SFS proved in [26] relies a restricted assumption that the potential function is uniformly strongly convex. In this paper we provide a new convergence analysis of the SFS without the strongly convexity condition on the potential. We establish a nonasymptotic error bound on Wasserstein distance between the law of the output of SFS and the target distribution under smooth and bounded assumptions on the density ratio of the target distribution over the standard normal distribution.
5 Acknowledgment
The authors would like to thank Professor Liming Wu at Université Clermont-Auvergne for helpful discussions on this topic.
Appendix A Appendix
A.1 Proof of Proposition 1
A.2 Preliminary lemmas for Theorems 2-3
Lemma 5.
Assume (A1) and (A2) hold, then
Proof.
Lemma 6.
Assume (A1) and (A2) hold, then for any ,
Proof.
Lemma 7.
Assume (A1) and (A2) hold, then for any ,
Moreover, if has the finite upper bound, then
Proof.
Denote two independent sets of independent copies of , that is, and . For notation convenience, we denote
Due to , then . Then,
(A.1) |
where the second inequality holds by (A1). Similarly, we also have
(A.2) |
where the second inequality holds due to (A1). Thus, by (A.2) and (A.2), it follows that
(A.3) |
(A.4) |
Then, by (A1) and (A2), through some simple calculation, it yields that
(A.5) |
Let , then
(A.6) |
Therefore, by (A.3)-(A.6), it can be concluded that
Lemma 8.
Assume (A1) and (A2) hold, then for ,
Proof.
Define and , where with . By (A1) and (A2), it follows that for all and ,
(A.8) |
Then, by (A.8), we have
Further, we can get
Therefore,
Since , then by induction, we have
∎
Lemma 9.
Assume (A1) and (A2) hold, then for ,
Moreover, if has the finite upper bound, then
Proof.
Let , then
(A.9) |
Next, we need to bound the two terms of (A.9). First, by Lemma 7, we have
Secondly, combining (A.8) and Lemma 8 with Markov inequality, it yields that
Thence
(A.10) |
Set in (A.10), then we have
Moreover, if has the finite upper bound, then by Lemma 7, we can similarly get
This completes the proof. ∎
A.3 Proof of Theorem 2
A.4 Proof of Theorem 3
A.5 Preliminary lemmas for Theorem 4
Lemma 10.
Assume (A1) holds, then for any ,
where . Moreover, if has the finite upper bound, then
Proof.
Denote two independent sets of independent copies of by and . For notation convenience, we denote
where . Since , we have . By (A1), it yields that and are Lipschitz continuous. Thus there exists a finite and positive constant such that for all ,
(A.12) |
(A.13) |
Therefore,
(A.14) |
where the second inequality follows from (A.13). Similarly, we also have
(A.15) |
where the second inequality follows from (A.12). Hence, by (A.5) and (A.5), we have
(A.16) |
(A.17) |
Then, by (A.12) and (A.13), through some simple calculation, it yields that
(A.18) |
Let , then
(A.19) |
Therefore, by (A.16)-(A.19), it can be concluded that
Lemma 11.
Assume (A1) holds, then for ,
where .
Proof.
Define and , where with . By (A1), then there exists one finite and positive constant such that is -Lipschitz continuous. Then, for all and , we have
(A.21) |
By (A.21), we have
Furthermore, it can be shown that
Therefore,
Since , then by induction, we have
∎
Lemma 12.
Assume (A1) holds, then for and ,
where . Moreover, if has a finite upper bound, then
Proof.
Let , then
(A.22) |
Next, we need to bound the two terms on the right hand of (A.22). First, by Lemma 10, we have
Second, by combining (A.21) and Lemma 11 with the Markov inequality, we have
Therefore,
(A.23) |
Setting in (A.5), we have
Moreover, if has a finite upper bound, then by Lemma 10, we can similarly get
This completes the proof. ∎
A.6 Proof of Theorem 4
Proof.
By triangle inequality, we have , then we obtain the upper bound of two terms on the right hand of this inequality, respectively.
First, similar to the proof of Theorem 2, by Lemma 12 and and through some calculation, we can conclude that
(A.24) |
Second, we need to get the upper bound of . Let and , is one Bernoulli random variable satisfying and . Assume , and are independent of each other. Then is one coupling of , and denote its joint distribution by . Therefore, we have
Then we have
(A.25) |
Combining (A.24) with (A.25), it yields that
(A.26) |
Set in (A.6), then there exist one constant depending on such that
Moreover, if has the finite upper bound, then similar to the proof of Theorem 3 and by (A.25) and Lemma 12, we have
(A.27) |
Set in (A.27), then there exists one constant depending on such that
∎
References
- [1] Mathias Barkhagen, Ngoc Huy Chau, Eric Moulines, Miklos Rasonyi, Sotirios Sabanis, and Ying Zhang, On stochastic gradient langevin dynamics with dependent data streams in the logconcave case, arXiv preprint arXiv:1812.02709, (2018).
- [2] Joris Bierkens, Paul Fearnhead, Gareth Roberts, et al., The zig-zag process and super-efficient sampling for bayesian analysis of big data, The Annals of Statistics, 47 (2019), pp. 1288–1320.
- [3] Nawaf Bou-Rabee, Andreas Eberle, Raphael Zimmer, et al., Coupling and convergence for hamiltonian monte carlo, Annals of Applied Probability, 30 (2020), pp. 1209–1250.
- [4] Alexandre Bouchard-Cote, Sebastian J Vollmer, and Arnaud Doucet, The bouncy particle sampler: A nonreversible rejection-free markov chain monte carlo method, Journal of the American Statistical Association, 113 (2018), pp. 855–867.
- [5] Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng, Handbook of markov chain monte carlo, CRC press, 2011.
- [6] Ngoc Huy Chau, Eric Moulines, Miklos Rasonyi, Sotirios Sabanis, and Ying Zhang, On stochastic gradient langevin dynamics with dependent data streams: the fully non-convex case, arXiv preprint arXiv:1905.13142, (2019).
- [7] Yongxin Chen, Tryphon T Georgiou, and Michele Pavon, Stochastic control liaisons: Richard sinkhorn meets gaspard monge on a schrodinger bridge, SIAM Review, 63 (2021), pp. 249–313.
- [8] Xiang Cheng and Peter Bartlett, Convergence of langevin mcmc in kl-divergence, Proceedings of Machine Learning Research, Volume 83: Algorithmic Learning Theory, (2018), pp. 186–211.
- [9] Xiang Cheng, Niladri S Chatterji, Yasin Abbasi-Yadkori, Peter L Bartlett, and Michael I Jordan, Sharp convergence rates for langevin dynamics in the nonconvex setting, arXiv preprint arXiv:1805.01648, (2018).
- [10] Paolo Dai Pra, A stochastic control approach to reciprocal diffusion processes, Applied mathematics and Optimization, 23 (1991), pp. 313–329.
- [11] Arnak S Dalalyan, Further and stronger analogy between sampling and optimization: Langevin monte carlo and gradient descent, arXiv: Statistics Theory, (2017).
- [12] , Theoretical guarantees for approximate sampling from smooth and log-concave densities, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79 (2017), pp. 651–676.
- [13] Arnak S Dalalyan and Avetik G Karagulyan, User-friendly guarantees for the langevin monte carlo with inaccurate gradient, Stochastic Processes and their Applications, 129 (2019), pp. 5278–5311.
- [14] David B Dunson and JE Johndrow, The hastings algorithm at fifty, Biometrika, 107 (2020), pp. 1–23.
- [15] Alain Durmus and Eric Moulines, High-dimensional bayesian inference via the unadjusted langevin algorithm., arXiv: Statistics Theory, (2016).
- [16] , Sampling from a strongly log-concave distribution with the unadjusted langevin algorithm, arXiv: Statistics Theory, (2016).
- [17] Alain Durmus, Eric Moulines, et al., Nonasymptotic convergence analysis for the unadjusted langevin algorithm, The Annals of Applied Probability, 27 (2017), pp. 1551–1587.
- [18] Ronen Eldan, Joseph Lehec, Yair Shenfeld, et al., Stability of the logarithmic sobolev inequality via the follmer process, in Annales de l’Institut Henri Poincare, Probabilites et Statistiques, vol. 56, Institut Henri Poincare, 2020, pp. 2253–2269.
- [19] Hans Follmer, An entropy approach to the time reversal of diffusion processes, in Stochastic Differential Systems Filtering and Control, Springer, 1985, pp. 156–163.
- [20] , Time reversal on wiener space, in Stochastic processes-mathematics and physics, Springer, 1986, pp. 119–129.
- [21] , Random fields and diffusion processes, in Ecole dEte de Probabilites de Saint-Flour XV–XVII, 1985–87, Springer, 1988, pp. 101–203.
- [22] Alan E Gelfand and Adrian FM Smith, Sampling-based approaches to calculating marginal densities, Journal of the American statistical association, 85 (1990), pp. 398–409.
- [23] Stuart Geman and Donald Geman, Stochastic relaxation, gibbs distributions, and the bayesian restoration of images, IEEE Transactions on pattern analysis and machine intelligence, (1984), pp. 721–741.
- [24] Jack K Hale, Asymptotic behavior of dissipative systems, no. 25, American Mathematical Soc., 2010.
- [25] W Keith Hastings, Monte carlo sampling methods using markov chains and their applications, (1970).
- [26] Jian Huang, Yuling Jiao, Lican Kang, Xu Liao, Jin Liu, and Yanyan Liu, Schrodinger-follmer sampler: Sampling without ergodicity, arXiv preprint arXiv: 2106.10880, (2021).
- [27] Joseph Lehec, Representation formula for the entropy and functional inequalities, in Annales de l’IHP Probabilites et statistiques, vol. 49, 2013, pp. 885–899.
- [28] Christian Leonard, A survey of the schrodinger problem and some of its connections with optimal transport, Discrete Continuous Dynamical Systems-A, 34 (2014), p. 1533.
- [29] Jun S Liu, Monte Carlo strategies in scientific computing, Springer Science Business Media, 2008.
- [30] Yi-An Ma, Yuansi Chen, Chi Jin, Nicolas Flammarion, and Michael I Jordan, Sampling can be faster than optimization, Proceedings of the National Academy of Sciences, 116 (2019), pp. 20881–20885.
- [31] Georg Menz, Andre Schlichting, et al., Poincare and logarithmic sobolev inequalities by decomposition of the energy landscape, Annals of Probability, 42 (2014), pp. 1809–1884.
- [32] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller, Equation of state calculations by fast computing machines, The journal of chemical physics, 21 (1953), pp. 1087–1092.
- [33] Wenlong Mou, Nicolas Flammarion, Martin J Wainwright, and Peter L Bartlett, Improved bounds for discretization of langevin diffusions: Near-optimal rates without convexity, arXiv preprint arXiv:1907.11331, (2019).
- [34] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky, Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis, in Conference on Learning Theory, PMLR, 2017, pp. 1674–1703.
- [35] Christian P Robert, George Casella, and George Casella, Introducing monte carlo methods with r, vol. 18, Springer, 2010.
- [36] Erwin Schrodinger, Sur la theorie relativiste de lelectron et l’interpretation de la mecanique quantique, in Annales de l’institut Henri Poincare, vol. 2, 1932, pp. 269–310.
- [37] Luke Tierney, Markov chains for exploring posterior distributions, The Annals of Statistics, 22 (1994), pp. 1701–1728.
- [38] Belinda Tzen and Maxim Raginsky, Theoretical guarantees for sampling and inference in generative models with latent diffusions, in Conference on Learning Theory, PMLR, 2019, pp. 3084–3114.
- [39] Feng-Yu Wang et al., Log-sobolev inequalities: different roles of ric and hess, The Annals of Probability, 37 (2009), pp. 1587–1604.
- [40] Ying Zhang, Omer Deniz Akyildiz, Theo Damoulas, and Sotirios Sabanis, Nonasymptotic estimates for stochastic gradient langevin dynamics under local conditions in nonconvex optimization, arXiv preprint arXiv:1910.02008, (2019).