Provable Benefit of Annealed Langevin Monte Carlo for Non-log-concave Sampling
Abstract
We address the outstanding problem of sampling from an unnormalized density that may be non-log-concave and multimodal. To enhance the performance of simple Markov chain Monte Carlo (MCMC) methods, techniques of annealing type have been widely used. However, quantitative theoretical guarantees of these techniques are under-explored. This study takes a first step toward providing a non-asymptotic analysis of annealed MCMC. Specifically, we establish, for the first time, an oracle complexity of for simple annealed Langevin Monte Carlo algorithm to achieve accuracy in Kullback-Leibler divergence to the target distribution on with -smooth potential . Here, represents the action of a curve of probability measures interpolating the target distribution and a readily sampleable distribution.
1 Introduction
We study the task of efficient sampling from a probability distribution on . This fundamental problem is pivotal across various fields including computational statistics [42, 50, 5, 48], Bayesian inference [30], statistical physics [46, 37], and finance [17], and has been extensively studied in the literature [12]. The most common approach to this problem is Markov Chain Monte Carlo (MCMC), among which Langevin Monte Carlo (LMC) [20, 61, 13, 44] is a particularly popular choice. LMC can be understood as a time-discretization of a diffusion process, known as Langevin diffusion (LD), whose stationary distribution is the target distribution , and has been attractive partly due to its robust performance despite conceptual simplicity.
Although LMC and its variants converge rapidly when the target distribution is strongly log-concave or satisfies isoperimetric inequalities such as the log-Sobolev inequality (LSI) [20, 61, 13], its effectiveness diminishes when dealing with target distributions that are not strongly log-concave or are multimodal, such as mixtures of Gaussians. In such scenarios, the sampler often becomes confined to a single mode, severely limiting its ability to explore the entire distribution effectively. This results in significant challenges in transitioning between modes, which can dramatically increase the mixing time, making it exponential in dimension . Such limitations highlight the need for enhanced MCMC methodologies that can efficiently navigate the complex landscapes of multimodal distributions, thereby improving convergence rates and overall sampling efficiency.
To address the challenges posed by multimodality, techniques around the notion of annealing have been widely employed [29, 45]. The general philosophy involves constructing a sequence of intermediate distributions that bridge the gap between an easily samplable distribution (e.g., Gaussian or Dirac-like), and the target distribution . The process starts with sampling from and progressively samples from each subsequent distribution until is reached. When and are close enough, approximate samples from can serve as a warm start for sampling from , thereby facilitating this transition. Employing LMC within this framework gives rise to what is known as the annealed LMC algorithm, which is the focus of our study. Despite its empirical success [54, 55, 69, 68], a thorough theoretical understanding of annealed LMC, particularly its non-asymptotic complexity bounds, remains elusive.
In this work, we take a first step toward providing a non-asymptotic analysis of annealed MCMC. By leveraging the Girsanov theorem to quantify the differences between the sampling dynamics and a reference process, we establish an upper bound of the error of annealed MCMC by the summation of two terms. One term is equal to the ratio of an action integral in the Wasserstein geometry induced by optimal transport and the duration of the process and diminishes when the annealing is sufficiently slow. Another term quantifies the discretization error in implementation. Our approach challenges the traditional point of view that annealed MCMC is a cascade of warm start. Analysis based on this viewpoint still requires log-concavity or isoperimetry. Instead, our theoretical analysis of the annealed Langevin-based MCMC advances the field by eliminating the reliance on assumptions related to log-concavity or isoperimetry. Our strategy represents a paradigm shift in analyzing annealed MCMC. Even though a fully polynomial complexity bound has not been achieved yet. We hope our strategy can inspire future work towards this ambitious goal.
Our key technical contributions are summarized as follows.
-
1.
We discover a novel strategy to analyze the non-asymptotic complexity bound of annealed MCMC algorithms that can circumvent the need for log-concavity or isoperimetry.
-
2.
In Section 4, we investigate the annealed LD, which involves running LD with a dynamically changing target distribution and derive a surprising bound on the time required to simulate the SDE for achieving -accuracy in KL divergence.
-
3.
Building on the insights from the analysis of the continuous dynamic and accounting for discretization errors, we establish a non-asymptotic oracle complexity bound for annealed LMC in Section 5, which is applicable to a broad range of annealing schemes.
The quantitative results are summarized and compared to other sampling algorithms in Table 1. Notably, our approach operates under the least stringent assumptions and exhibits the most favorable -dependence among all isoperimetry-free sampling methods.
Algorithm |
|
|
Criterion | Complexity | |||||
---|---|---|---|---|---|---|---|---|---|
|
-LSI | Potential smooth | , | ||||||
|
-LSI | Potential smooth | , TV | ||||||
|
/ |
|
, TV | ||||||
|
/ |
|
, TV | ||||||
|
/ |
|
, TV+W2 | ||||||
|
/ | Potential smooth | , |
Related works and comparison. We provide a brief overview of the literature, mainly focusing on the algorithms for non-log-concave sampling and their theoretical analysis.
-
•
Samplers based on tempering. The fundamental concept of tempering involves sampling the system at various temperatures simultaneously: at higher temparatures, the distribution flattens, allowing particles to easily transition between modes, while at lower temperatures, particles can more effectively explore local structures. In simulated tempering [43, 64], the system’s temperature is randomly switched, while in parallel tempering (also known as replica exchange) [57, 38], the temperatures of two particles are swapped according to a specific rule. However, quantitative theoretical results for tempering are limited, and the existing results (e.g., [27, 28, 19]) apply primarily to certain special classes of non-log-concave distributions.
-
•
Samplers based on general diffusions. Inspired by score-based diffusion models [33, 53, 56], recent advances have introduced sampling methods that reverse the Ornstein-Uhlenbeck process, as detailed in [35, 34, 32]. These samplers exhibit reduced sensitivity to isoperimetric conditions, but rely on estimating score functions (gradients of log-density) via importance sampling, which poses significant challenges in high-dimensional settings. Concurrently, studies such as [66, 59, 60, 49] have employed neural networks to approximate unknown drift terms, enabling an SDE to transport a readily sampleable distribution to the target distribution. This approach has shown excellent performance in handling complex distributions, albeit at the expense of significant computational resources required for neural network training. In contrast, annealed LMC runs on a known interpolation of probability distributions, thus simplifying sampling by obviating the need for intensive score estimation or neural network training.
-
•
Non-asymptotic analysis for non-log-concave sampling. Drawing upon the stationary-point analysis in non-convex optimization, the seminal work [4] characterizes the convergence of non-log-concave sampling via Fisher divergence. Subsequently, [11] applies this methodology to examine the local mixing of LMC. However, Fisher divergence is a relatively weak criterion compared to more commonly employed metrics such as total-variational distance or Wasserstein distances. In contrast, our study provides a convergence guarantee in terms of KL divergence, which implies convergence in total-variation distance and offers a stronger result.
Notations and definitions. For , we define , , and . For , the notations , , and indicate that for some universal constant , and the notations and stand for both and . hides logarithmic dependence in . A function is -strongly-convex if , and is -smooth if . The total-variation (TV) distance is defined as , and the Kullback-Leibler (KL) divergence is defined as . represents the norm on . For and a probability measure on , , and the second-order moment of is defined as .
2 Preliminaries
2.1 Stochastic Differential Equations and Girsanov Theorem
A stochastic differential equation (SDE) is a stochastic process on , the space of continuous functions from to . The dynamics of are typically represented by the equation , , where is a standard Brownian motion in . The path measure of , denoted , characterizes the distribution of over and is defined by for all measurable subset of . The following lemma, as a corollary of the Girsanov theorem [36, 58, 41], provides a methodology for computing the KL divergence between two path measures and serves as a crucial technical tool in our proof.
Lemma 1.
Assume we have the following two SDEs on :
Let and denote the path measures of and , respectively. Then
2.2 Langevin Diffusion and Langevin Monte Carlo
The Langevin diffusion (LD) with target distribution is the solution to the SDE
(1) |
It is well-known that under mild conditions, is the unique stationary distribution this SDE, and when has good regularity properties, the marginal distribution of converges to as , so we can sample from by simulating Equation 1 for a long time. However, in most of the cases, LD is intractable to simulate exactly, and the Euler-Maruyama discretization of Equation 1 leads to the Langevin Monte Carlo (LMC) algorithm. LMC with step size and target distribution is a Markov chain constructed by iterating the following update rule:
(2) |
where .
2.3 Wasserstein Distance and Curves of Probability Measures
We briefly introduce several fundamental concepts in optimal transport, and direct readers to authoritative textbooks [62, 63, 2, 1] for an in-depth exploration.
For two probability measures on with finite second-order moments, the Wasserstein-2 (W2) distance between and is defined as
where is the set of all couplings of , i.e., probability measure on with and , for all measurable set .
Given a vector field and a curve of probability measures on with finite second-order moments, we say that generates if the continuity equation , holds. The metric derivative of at is defined as
which can be interpreted as the “speed” of this curve. If exists and is finite for all , we say that is absolutely continuous (AC). The metric derivative and the continuity equation are closely related by the following important fact (see [2, Theorem 8.3.1]):
Lemma 2.
For an AC curve of probability measures , any vector field that generates satisfies for all . Moreover, the equality is reachable.
Finally, define the action of an AC curve of probability measures as .
2.4 Isoperimetric Inequalities
A probability measure on satisfies a log-Sobolev inequality (LSI) with constant , or -LSI, if for all with ,
It is worth noting that -strongly-log-concave distributions satisfy -LSI (Bakry-Émery theorem, [3]). When satisfies -LSI and the potential is -smooth, it is established in [61] that the LD converges exponentially fast in KL divergence, while the LMC also converges exponentially with a bias that vanishes when the step size approaches .
3 Problem Setup
In this paper, we consider the following mild assumption on the target distribution on , which is widely used in the field of sampling (see, e.g., [20, 61, 4, 13]):
Assumption 1.
The potential is -smooth, and there exists a global minimizer of such that . Moreover, has finite second-order moment.
Recall that the rationale behind annealing involves a gradual transition from a distribution that is easy to sample from to the more complex target distribution, which is crucial to our algorithm. Throughout this paper, we define the following curve of probability measures on :
(3) |
We use the term annealing schedule to refer to the functions and . They need to be differentiable and monotone, satisfying the boundary conditions and . The values of and will be determined later. This flexible interpolation scheme includes many general cases. For example, [6] and [26] used the schedule , while [45] used the schedule . The key motivation for this interpolation is that when , the quadratic term predominates, making the potential of strongly-convex with a moderate condition number, thus is easy to sample from; on the other hand, when , is just the target distribution . As gradually increases from to , the readily sampleable distribution slowly transform into the target distribution .
In Lemma 5, we prove that also has finite second-order moment, so the W2 distance between ’s makes sense. We further make a mild assumption on the absolute continuity of the curve:
Assumption 2.
The curve is AC with finite action .
To facilitate easy sampling from , we consider two choices of the parameters and , which have the following complexity guarantees.
Lemma 3.
When , can be sampled directly; when , we choose , so that under Assumption 1, it takes queries to the oracle of and in expectation to precisely sample from via rejection sampling.
See Appendix A for the rejection sampling algorithm as well as the full proof of this lemma. The parameter reflects prior knowledge about global minimizer(s) of the potential function . Unless it is exceptionally large, indicating scarce prior information about the global minimizer(s) of , this complexity is negligible compared to the overall complexity of sampling. In particular, when the exact location of a global minimizer of is known, we can adjust the potential so that becomes a global minimizer, thereby reducing the complexity to .
Equipped with this foundational setup, we now proceed to introduce the annealed LD and annealed LMC algorithms.
4 Analysis of Annealed Langevin Diffusion
To elucidate the concept of annealing more clearly, we first consider the annealed Langevin diffusion (ALD) algorithm, which samples from the by running LD with a dynamically changing target distribution. For the purpose of this discussion, let us assume that we can exactly simulate any SDE with known drift and diffusion terms.
Fix an annealing schedule and choose a sufficiently long time . We define a reparametrized curve of probability measures . Starting with an initial sample , we run the SDE
(4) |
and ultimately output as an approximate sample from the target distribution . Intuitively, when is changing slowly, the distribution of should closely resemble , leading to an output distribution that approximates the target distribution. This turns out to be true, as is confirmed by the following theorem, which provides a convergence guarantee for the ALD process.
Theorem 1.
When choosing , it follows that .
Proof.
Let be the path measure of ALD (Equation 4) initialized at , and define as the path measure corresponding to the following reference SDE:
(5) |
The vector field is designed such that for all . According to the Fokker-Planck equation, must satisfy the PDE
which means that generates . We can compute using Lemma 1:
Leveraging Lemma 2, among all vector fields that generate , we can choose the one that minimizes , thereby making , the metric derivative. With the reparameterization , we have the following relation by chain rule:
Employing the change-of-variable formula leads to
Finally, using the data-processing inequality ([12, Theorem 1.5.3]), with ,
where and stand for the marginal distributions of and at time , respectively. ∎
Although we adopt a specific parametrization of the interpolating distribution (Equation 3), which is widely used in practice, the results in Theorem 1 are applicable to any curve of probability measures that bridge and . This applicability extends as long as the closed forms of the drift and diffusion terms in the SDE, which precisely follow the path , are known.
Let us delve deeper into the mechanics of the ALD. Although at time the SDE (Equation 4) targets the distribution , the distribution of does not precisely match . Nevertheless, by choosing a sufficiently long time , we actually move on the curve sufficiently slowly, thus minimizing the discrepancy between the path measure of and the reference curve . By applying data-processing inequality, we can upper bound the error between the marginal distributions at time by the joint distributions of the two paths. In essence, moving more slowly, sampling more precisely.
Our analysis predominantly addresses the global error across the entire curve of probability measures, rather than focusing solely on the local error at time . This approach is inspired by [18] (see also [12, Section 4.4]) and [9], and stands in contrast to the traditional isoperimetry-based analyses of LD (e.g., [61, 13, 4]), which emphasize the decay of the KL divergence from the distribution of to the target distribution and require LSI to bound the time derivative of this quantity. Notably, the total time needed to run the SDE depends solely on the action of the curve , obviating the need for assumptions related to log-concavity or isoperimetry.
We also remark that the ALD plays a significant role in another important subject known as non-equilibrium stochastic thermodynamics [52]. Recently, a refinement of the fluctuation theorem was discovered in [10, 24], stating that the irreversible entropy production in a stochastic thermodynamic system is equal to the ratio of a similar action integral as in Theorem 1 and the duration of the process , resembling Theorem 1.
5 Analysis of Annealed Langevin Monte Carlo
It is crucial to recognize that the ALD (Equation 4) cannot be precisely simulated in practice. Transitioning from LD to LMC introduces additional errors due to discretization. This section will present a detailed convergence analysis for the annealed Langevin Monte Carlo (ALMC) algorithm, which is implementable in practical scenarios.
A straightforward yet non-optimal method to discretize Equation 4 involves employing the Euler-Maruyama scheme, i.e.,
However, considering that , the integral of the linear term can be computed in closed form, so we can use the exponential-integrator scheme [65, 67] to further reduce the discretization error.
Given the total time , we define a sequence of points , and set , . The exponential-integrator scheme is then expressed as
(6) |
where when , . The explicit update rule is detailed in Algorithm 1, with denoting , and the derivation of Equation 7 is presented in Section B.1. Notably, replacing with recovers the ALD (Equation 4), and setting and reduces to the classical LMC iterations.
(7) |
We illustrate the ALMC algorithm in Figure 1. The underlying intuition behind this algorithm is that by setting a sufficiently long total time , the trajectory of the continuous dynamic (i.e., annealed LD) approaches the reference trajectory closely, as established in Theorem 1. Additionally, by adopting sufficiently small step sizes , the discretization error can be substantially reduced. Unlike traditional annealed LMC methods, which require multiple LMC update steps for each intermediate distribution , our approach views the annealed LMC as a discretization of a continuous-time process. Consequently, it is adequate to perform a single step of LMC towards each intermediate distribution , provided that is sufficiently large and the step sizes are appropriately small.
The subsequent theorem provides a convergence guarantee for the annealed LMC algorithm, with the complete proof detailed in Section B.2.
Theorem 2.
Under Assumptions 1 and 2, Algorithm 1 can generate a distribution satisfying within
calls to the oracle of and in expectation.
Sketch of Proof.
Let be the path measure of ALMC (Equation 4), whose marginal distribution at time is the output distribution . Again, let denote the reference path measure of Equation 5 used in the proof of Theorem 1, in which the same vector field ensures that under for all .
Applying Girsanov theorem (Lemma 1) and carefully dealing with the discretization error, we can upper bound by
The first summation is governed by the total time , which pertains to the convergence of the continuous dynamic (i.e., ALD). Setting ensures that the first summation remains , provided that the step size is sufficiently small. The second summation addresses the discretization error, and it suffices to determine the appropriate value of to minimize , the total number of calls to the oracle of for discretizing the SDE. Combining with the complexity of sampling from determines the overall complexity of the algorithm. ∎
Once again, our analysis relies on bounding the global error between two path measures by Girsanov theorem. The annealing schedule and influence the complexity exclusively through the action . Identifying the optimal annealing schedule to minimize this action remains a significant challenge, and we propose this as an area for future research.
We also note that our Assumption 1 encompasses strongly-log-concave target distributions. For sampling from these well-conditioned distributions via LMC, the complexity required to achieve - in TV distance scales as [12]. However, using Pinsker inequality , our complexity to meet the same error criterion is , indicating a significantly higher computational demand. Refining the parameter dependence in our algorithm to reduce this complexity remains a key area for future work.
Finally, we present a conjecture concerning the action , leaving its proof or disproof as an open question for future research. Following this conjecture, we demonstrate its applicability to a specific class of non-log-concave distributions in the subsequent example, with the detailed proof provided in Appendix C.
Conjecture 1.
Under Assumption 1, the action is bounded above by a polynomial function of the problem parameters, including , , , , , etc.
Example 1.
Consider a mixture of Gaussian distributions in defined by , where the weights , , and for all . Consequently, the potential is -smooth, where . With an annealing schedule defined by and for some , it follows that .
To demonstrate the superiority of our theoretical results, let us consider a simplified scenario where , , and . We derive from Example 1 and Pinsker inequality that the total complexity required to obtain a sample that is -accurate in TV is . In contrast, studies such as [51, 8, 19] indicate that the LSI constant of is , so existing bounds in Table 1 suggest that LMC, to achieve the same accuracy, can only provide an exponential complexity guarantee of .
6 Conclusions and Future Work
In this paper, we have explored the complexity of annealed LMC for sampling from a non-log-concave probability measure without relying on log-concavity or isoperimetry. Central to our analysis are the Girsanov theorem and optimal transport techniques, thus circumventing the need for isoperimetric inequalities. While our proof primarily focuses on the annealing scheme as described in Equation 3, there is potential for adaptation to more general interpolations. Further exploration of these applications will be a focus of our future research. Technically, our proof methodology could be expanded to include a broader range of target distributions beyond those with smooth potentials, such as those with Hölder-continuous gradients [7, 21, 40, 47, 23, 22], or even heavy-tailed target distributions [44, 31]. Furthermore, eliminating the necessity of assuming global minimizers of the potential function would enhance the practical utility of our algorithm. Theoretically, proving Conjecture 1 or obtaining a polynomial bound of under less restrictive conditions would further validate that our complexity bounds are genuinely polynomial. Finally, while this work emphasizes the upper bounds for non-log-concave sampling, an intriguing future avenue would be to verify the tightness of these bounds and explore potential lower bounds for this problem [16, 15, 14].
References
- [1] Luigi Ambrosio, Elia Brué and Daniele Semola “Lectures on optimal transport” 130, UNITEXT Springer Cham, 2021 DOI: 10.1007/978-3-030-72162-6
- [2] Luigi Ambrosio, Nicola Gigli and Giuseppe Savaré “Gradient Flows: In Metric Spaces and in the Space of Probability Measures”, Lectures in Mathematics. ETH Zürich Birkhäuser Basel, 2008 DOI: 10.1007/978-3-7643-8722-8
- [3] Dominique Bakry, Ivan Gentil and Michel Ledoux “Analysis and geometry of Markov diffusion operators” 103, Grundlehren der mathematischen Wissenschaften Springer Cham, 2014 DOI: 10.1007/978-3-319-00227-9
- [4] Krishna Balasubramanian et al. “Towards a Theory of Non-Log-Concave Sampling: First-Order Stationarity Guarantees for Langevin Monte Carlo” In Proceedings of Thirty Fifth Conference on Learning Theory 178, Proceedings of Machine Learning Research PMLR, 2022, pp. 2896–2923 URL: https://proceedings.mlr.press/v178/balasubramanian22a.html
- [5] Steve Brooks, Andrew Gelman, Galin Jones and Xiao-Li Meng “Handbook of Markov chain Monte Carlo” CRC press, 2011
- [6] Nicolas Brosse, Alain Durmus and Éric Moulines “Normalizing constants of log-concave densities” In Electronic Journal of Statistics 12.1 Institute of Mathematical StatisticsBernoulli Society, 2018, pp. 851–889 DOI: 10.1214/18-EJS1411
- [7] Niladri Chatterji, Jelena Diakonikolas, Michael I. Jordan and Peter Bartlett “Langevin Monte Carlo without smoothness” In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics 108, Proceedings of Machine Learning Research PMLR, 2020, pp. 1716–1726 URL: https://proceedings.mlr.press/v108/chatterji20a.html
- [8] Hong-Bin Chen, Sinho Chewi and Jonathan Niles-Weed “Dimension-free log-Sobolev inequalities for mixture distributions” In Journal of Functional Analysis 281.11, 2021, pp. 109236 DOI: https://doi.org/10.1016/j.jfa.2021.109236
- [9] Sitan Chen et al. “Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions” In The Eleventh International Conference on Learning Representations, 2023 URL: https://openreview.net/forum?id=zyLVMgsZ0U_
- [10] Yongxin Chen, Tryphon T. Georgiou and Allen Tannenbaum “Stochastic Control and Nonequilibrium Thermodynamics: Fundamental Limits” In IEEE Transactions on Automatic Control 65.7, 2020, pp. 2979–2991 DOI: 10.1109/TAC.2019.2939625
- [11] Xiang Cheng, Bohan Wang, Jingzhao Zhang and Yusong Zhu “Fast Conditional Mixing of MCMC Algorithms for Non-log-concave Distributions” In Advances in Neural Information Processing Systems 36 Curran Associates, Inc., 2023, pp. 13374–13394 URL: https://proceedings.neurips.cc/paper_files/paper/2023/file/2b00b3331bd0f5fbfdd966ac06338f6d-Paper-Conference.pdf
- [12] Sinho Chewi “Log-Concave Sampling” Book draft, in preparation, 2024 URL: https://chewisinho.github.io
- [13] Sinho Chewi et al. “Analysis of Langevin Monte Carlo from Poincaré to Log-Sobolev” In Proceedings of Thirty Fifth Conference on Learning Theory 178, Proceedings of Machine Learning Research PMLR, 2022, pp. 1–2 URL: https://proceedings.mlr.press/v178/chewi22a.html
- [14] Sinho Chewi, Patrik Gerber, Holden Lee and Chen Lu “Fisher information lower bounds for sampling” In Proceedings of The 34th International Conference on Algorithmic Learning Theory 201, Proceedings of Machine Learning Research PMLR, 2023, pp. 375–410 URL: https://proceedings.mlr.press/v201/chewi23b.html
- [15] Sinho Chewi et al. “Query lower bounds for log-concave sampling” In 2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS), 2023, pp. 2139–2148 DOI: 10.1109/FOCS57990.2023.00131
- [16] Sinho Chewi et al. “The query complexity of sampling from strongly log-concave distributions in one dimension” In Proceedings of Thirty Fifth Conference on Learning Theory 178, Proceedings of Machine Learning Research PMLR, 2022, pp. 2041–2059 URL: https://proceedings.mlr.press/v178/chewi22b.html
- [17] John S. Dagpunar “Simulation and Monte Carlo: With Applications in Finance and MCMC” John Wiley & Sons, Ltd, 2007 DOI: 10.1002/9780470061336
- [18] Arnak Dalalyan and Alexandre B. Tsybakov “Sparse regression learning by aggregation and Langevin Monte-Carlo” JCSS Special Issue: Cloud Computing 2011 In Journal of Computer and System Sciences 78.5, 2012, pp. 1423–1443 DOI: https://doi.org/10.1016/j.jcss.2011.12.023
- [19] Jing Dong and Xin T. Tong “Spectral gap of replica exchange Langevin diffusion on mixture distributions” In Stochastic Processes and their Applications 151, 2022, pp. 451–489 DOI: https://doi.org/10.1016/j.spa.2022.06.006
- [20] Alain Durmus, Szymon Majewski and Błażej Miasojedow “Analysis of Langevin Monte Carlo via Convex Optimization” In Journal of Machine Learning Research 20.73, 2019, pp. 1–46 URL: http://jmlr.org/papers/v20/18-173.html
- [21] Murat A Erdogdu and Rasa Hosseinzadeh “On the Convergence of Langevin Monte Carlo: The Interplay between Tail Growth and Smoothness” In Proceedings of Thirty Fourth Conference on Learning Theory 134, Proceedings of Machine Learning Research PMLR, 2021, pp. 1776–1822 URL: https://proceedings.mlr.press/v134/erdogdu21a.html
- [22] Jiaojiao Fan, Bo Yuan and Yongxin Chen “Improved dimension dependence of a proximal algorithm for sampling” In Proceedings of Thirty Sixth Conference on Learning Theory 195, Proceedings of Machine Learning Research PMLR, 2023, pp. 1473–1521 URL: https://proceedings.mlr.press/v195/fan23a.html
- [23] Jiaojiao Fan, Bo Yuan, Jiaming Liang and Yongxin Chen “Nesterov Smoothing for Sampling Without Smoothness” In 2023 62nd IEEE Conference on Decision and Control (CDC), 2023, pp. 5313–5318 DOI: 10.1109/CDC49753.2023.10383529
- [24] Rui Fu, Amirhossein Taghvaei, Yongxin Chen and Tryphon T Georgiou “Maximal power output of a stochastic thermodynamic engine” In Automatica 123 Elsevier, 2021, pp. 109366
- [25] Guillaume Garrigos and Robert M Gower “Handbook of convergence theorems for (stochastic) gradient methods” In arXiv preprint arXiv:2301.11235, 2023
- [26] Rong Ge, Holden Lee and Jianfeng Lu “Estimating Normalizing Constants for Log-Concave Distributions: Algorithms and Lower Bounds” In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020 Chicago, IL, USA: Association for Computing Machinery, 2020, pp. 579–586 DOI: 10.1145/3357713.3384289
- [27] Rong Ge, Holden Lee and Andrej Risteski “Beyond Log-concavity: Provable Guarantees for Sampling Multi-modal Distributions using Simulated Tempering Langevin Monte Carlo” In Advances in Neural Information Processing Systems 31 Curran Associates, Inc., 2018 URL: https://proceedings.neurips.cc/paper_files/paper/2018/file/c6ede20e6f597abf4b3f6bb30cee16c7-Paper.pdf
- [28] Rong Ge, Holden Lee and Andrej Risteski “Simulated tempering Langevin Monte Carlo II: An improved proof using soft Markov chain decomposition” In arXiv preprint arXiv:1812.00793, 2018
- [29] Saul Brian Gelfand and Sanjoy K Mitter “On sampling methods and annealing algorithms” In technical report, 1990
- [30] Andrew Gelman, John B. Carlin, Hal S. Stern and Donald B. Rubin “Bayesian data analysis” ChapmanHall/CRC, 2013
- [31] Ye He, Krishnakumar Balasubramanian and Murat A. Erdogdu “An Analysis of Transformed Unadjusted Langevin Algorithm for Heavy-Tailed Sampling” In IEEE Transactions on Information Theory 70.1, 2024, pp. 571–593 DOI: 10.1109/TIT.2023.3318152
- [32] Ye He, Kevin Rojas and Molei Tao “Zeroth-Order Sampling Methods for Non-Log-Concave Distributions: Alleviating Metastability by Denoising Diffusion” In arXiv preprint arXiv:2402.17886, 2024
- [33] Jonathan Ho, Ajay Jain and Pieter Abbeel “Denoising diffusion probabilistic models” In Advances in Neural Information Processing Systems 33, 2020, pp. 6840–6851
- [34] Xunpeng Huang et al. “Faster Sampling without Isoperimetry via Diffusion-based Monte Carlo” In arXiv preprint arXiv:2401.06325, 2024
- [35] Xunpeng Huang et al. “Reverse Diffusion Monte Carlo” In The Twelfth International Conference on Learning Representations, 2024 URL: https://openreview.net/forum?id=kIPEyMSdFV
- [36] Ioannis Karatzas and Steven E. Shreve “Brownian Motion and Stochastic Calculus”, Graduate Texts in Mathematics Springer New York, NY, 1991 DOI: 10.1007/978-1-4612-0949-2
- [37] David Landau and Kurt Binder “A Guide to Monte Carlo Simulations in Statistical Physics” Cambridge University Press, 2021 DOI: 10.1017/9781108780346
- [38] Holden Lee and Zeyu Shen “Improved Bound for Mixing Time of Parallel Tempering” In arXiv preprint arXiv:2304.01303, 2023
- [39] Jiaming Liang and Yongxin Chen “A Proximal Algorithm for Sampling” In Transactions on Machine Learning Research, 2023 URL: https://openreview.net/forum?id=CkXOwlhf27
- [40] Jiaming Liang and Yongxin Chen “A Proximal Algorithm for Sampling from Non-Smooth Potentials” In 2022 Winter Simulation Conference (WSC), 2022, pp. 3229–3240 DOI: 10.1109/WSC57314.2022.10015293
- [41] Robert S. Liptser and Albert N. Shiryaev “Statistics of Random Processes”, Stochastic Modelling and Applied Probability Springer Berlin, Heidelberg, 2013 DOI: 10.1007/978-3-662-13043-8
- [42] Jun S. Liu “Monte Carlo strategies in scientific computing”, Springer Series in Statistics Springer New York, NY, 2004 DOI: 10.1007/978-0-387-76371-2
- [43] Enzo Marinari and Giorgio Parisi “Simulated Tempering: A New Monte Carlo Scheme” In Europhysics Letters 19.6, 1992, pp. 451 DOI: 10.1209/0295-5075/19/6/002
- [44] Alireza Mousavi-Hosseini et al. “Towards a Complete Analysis of Langevin Monte Carlo: Beyond Poincaré Inequality” In Proceedings of Thirty Sixth Conference on Learning Theory 195, Proceedings of Machine Learning Research PMLR, 2023, pp. 1–35 URL: https://proceedings.mlr.press/v195/mousavi-hosseini23a.html
- [45] Radford M. Neal “Annealed importance sampling” In Statistics and Computing 11.2, 2001, pp. 125–139 DOI: 10.1023/A:1008923215028
- [46] Mark E.. Newman and Gerard T. Barkema “Monte Carlo Methods in Statistical Physics” Oxford University Press, 1999 DOI: 10.1093/oso/9780198517962.001.0001
- [47] Dao Nguyen, Xin Dang and Yixin Chen “Unadjusted Langevin Algorithm for Non-convex Weakly Smooth Potentials” In Communications in Mathematics and Statistics, 2023 DOI: 10.1007/s40304-023-00350-w
- [48] Art B. Owen “Monte Carlo theory, methods and examples”, 2013 URL: https://artowen.su.domains/mc/
- [49] Lorenz Richter and Julius Berner “Improved sampling via learned diffusions” In The Twelfth International Conference on Learning Representations, 2024 URL: https://openreview.net/forum?id=h4pNROsO06
- [50] Christian P. Robert and George Casella “Monte Carlo Statistical Methods”, Springer Texts in Statistics Springer New York, NY, 2004 DOI: 10.1007/978-1-4757-4145-2
- [51] André Schlichting “Poincaré and Log–Sobolev Inequalities for Mixtures” In Entropy 21.1, 2019 DOI: 10.3390/e21010089
- [52] Udo Seifert “Stochastic thermodynamics, fluctuation theorems and molecular machines” In Reports on progress in physics 75.12 IOP Publishing, 2012, pp. 126001
- [53] Jiaming Song, Chenlin Meng and Stefano Ermon “Denoising Diffusion Implicit Models” In International Conference on Learning Representations, 2021 URL: https://openreview.net/forum?id=St1giarCHLP
- [54] Yang Song and Stefano Ermon “Generative Modeling by Estimating Gradients of the Data Distribution” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019 URL: https://proceedings.neurips.cc/paper_files/paper/2019/file/3001ef257407d5a371a96dcd947c7d93-Paper.pdf
- [55] Yang Song and Stefano Ermon “Improved Techniques for Training Score-Based Generative Models” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 12438–12448 URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/92c3b916311a5517d9290576e3ea37ad-Paper.pdf
- [56] Yang Song et al. “Score-Based Generative Modeling through Stochastic Differential Equations” In International Conference on Learning Representations, 2021 URL: https://openreview.net/forum?id=PxTIG12RRHS
- [57] Robert H. Swendsen and Jian-Sheng Wang “Replica Monte Carlo Simulation of Spin-Glasses” In Phys. Rev. Lett. 57 American Physical Society, 1986, pp. 2607–2609 DOI: 10.1103/PhysRevLett.57.2607
- [58] Ali Süleyman Üstünel and Moshe Zakai “Transformation of Measure on Wiener Space”, Springer Monographs in Mathematics Springer Berlin, Heidelberg, 2013 DOI: 10.1007/978-3-662-13225-8
- [59] Francisco Vargas, Will Sussman Grathwohl and Arnaud Doucet “Denoising Diffusion Samplers” In The Eleventh International Conference on Learning Representations, 2023 URL: https://openreview.net/forum?id=8pvnfTAbu1f
- [60] Francisco Vargas, Shreyas Padhy, Denis Blessing and Nikolas Nüsken “Transport meets Variational Inference: Controlled Monte Carlo Diffusions” In The Twelfth International Conference on Learning Representations, 2024 URL: https://openreview.net/forum?id=PP1rudnxiW
- [61] Santosh S. Vempala and Andre Wibisono “Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019 URL: https://proceedings.neurips.cc/paper_files/paper/2019/file/65a99bb7a3115fdede20da98b08a370f-Paper.pdf
- [62] Cédric Villani “Optimal Transport: Old and New”, Grundlehren der mathematischen Wissenschaften Springer Berlin, Heidelberg, 2008 DOI: 10.1007/978-3-540-71050-9
- [63] Cédric Villani “Topics in optimal transportation” American Mathematical Society, 2021
- [64] Dawn B. Woodard, Scott C. Schmidler and Mark Huber “Conditions for rapid mixing of parallel and simulated tempering on multimodal distributions” In The Annals of Applied Probability 19.2 Institute of Mathematical Statistics, 2009, pp. 617–640 DOI: 10.1214/08-AAP555
- [65] Qinsheng Zhang and Yongxin Chen “Fast Sampling of Diffusion Models with Exponential Integrator” In The Eleventh International Conference on Learning Representations, 2023 URL: https://openreview.net/forum?id=Loek7hfb46P
- [66] Qinsheng Zhang and Yongxin Chen “Path Integral Sampler: A Stochastic Control Approach For Sampling” In International Conference on Learning Representations, 2022 URL: https://openreview.net/forum?id=_uCb2ynRu7Y
- [67] Qinsheng Zhang, Molei Tao and Yongxin Chen “gDDIM: Generalized denoising diffusion implicit models” In The Eleventh International Conference on Learning Representations, 2023 URL: https://openreview.net/forum?id=1hKE9qjvz-
- [68] Nicolas Zilberstein, Ashutosh Sabharwal and Santiago Segarra “Solving Linear Inverse Problems Using Higher-Order Annealed Langevin Diffusion” In IEEE Transactions on Signal Processing 72, 2024, pp. 492–505 DOI: 10.1109/TSP.2023.3348202
- [69] Nicolas Zilberstein et al. “Annealed Langevin Dynamics for Massive MIMO Detection” In IEEE Transactions on Wireless Communications 22.6, 2023, pp. 3762–3776 DOI: 10.1109/TWC.2022.3221057
Appendix A Proof of Lemma 3
Our proof is inspired by [39].
We only consider the nontrivial case . The potential of is , which is -strongly-convex and -smooth. Note that for any fixed point ,
The right-hand side is the unnormalized density of . The rejection sampling algorithm is as follows: sample , and accept as a sample from with probability
By [12, Theorem 7.1.1], the number of queries to the oracle of until acceptance follows a geometric distribution with mean
We choose such that . With this , is also as long as .
Let be the global minimizer of the strongly convex potential function , which satisfies
Given the smoothness of ,
Therefore, to guarantee , it suffices to find an such that .
We first derive an upper bound of under Assumption 1. Since is a global minimizer of ,
When , i.e., is a known global minimizer of both and , we can directly set ; otherwise, we need to run optimization algorithms to find such an . According to standard results in convex optimization (see, e.g., [25, Theorem 3.6]), running gradient descent on function with step size yields the following convergence rate: starting from , the -th iterate satisfies
where the last inequality holds when . Thus, iterations are sufficient to find a desired . We summarize the rejection sampling algorithm in Algorithm 2. In conclusion, precisely sampling from requires
calls to the oracle of and in expectation.
∎
Appendix B Proofs for Annealed LMC
B.1 Proof of Equation 7
Define , whose derivative is . By Itô’s formula, we have
Integrating over (note that in this case ), we obtain
i.e.,
Since
by defining , we have . Similarly, we can show that
and is a zero-mean Gaussian random vector with covariance
B.2 Proof of Theorem 2
We denote the path measure of ALMC (Equation 6) by . Then, , the marginal distribution of , serves as the output distribution . Similar to the methodology in the proof of Theorem 1, we use to denote the reference path measure of Equation 5, in which the vector field generates the cure of probability distributions .
Using the data-processing inequality, it suffices to demonstrate that . By Girsanov theorem (Lemma 1) and triangle inequality, we have
The last inequality arises from the smoothness of and the increasing property of . To bound , note that under , for , we have
Thanks to the fact that under ,
The second inequality arises from the application of the Cauchy-Schwarz inequality, and the last inequality is due to the definition . Taking integral over ,
Recall that the potential of is -smooth. By Lemma 4 and the monotonicity of and , we have
Therefore, is upper bounded by
For the remaining integral, given that generates , according to Lemma 2, we may choose such that . Thus,
through a change-of-variable analogous to that used in the proof of Theorem 1. Therefore,
Assume (which will be verified later), so we can further simplify the above expression to
To bound the above expression by , we first select , mirroring the total time required for running annealed LD as specified in Theorem 1. This guarantees that the continuous dynamics closely approximate the reference path measure. Given that , it remains only to ensure
which is equivalent to
(8) |
To minimize , we apply Cauchy-Schwarz inequality:
which establishes a lower bound for . The equality is achieved when for all . Thus, selecting
satisfies the constraint given in Equation 8. In this case, the step size . Combining this with the complexity for sampling from , we have completed the proof.
Appendix C Proof of Example 1
The smoothness of comes from [11, Lemma 4].
Note that , and define
We use coupling method to upper bound . We first sample with distribution , , and then independently sample . Then,
By definition of W2 distance, we have
This implies
By time reparameterization , . With , . Therefore,
It follows from the proof that as long as is bounded by a polynomial function of , and , so is the action .
Appendix D Supplementary Lemmas
Lemma 4 ([12, Lemma 4.E.1]).
Consider a probability measure on . If for some , then .
Lemma 5.
Under Assumption 1, defined in Equation 3 has finite second-order moment when .
Proof.
When , , and the claim is straightforward. Otherwise, by convexity of , we have
Multiplying both sides by and taking integral over , we see that . ∎