This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Provable Benefit of Annealed Langevin Monte Carlo for Non-log-concave Sampling

Wei Guo, Molei Tao, Yongxin Chen
Georgia Institute of Technology
{wei.guo,mtao,yongchen}@gatech.edu
Abstract

We address the outstanding problem of sampling from an unnormalized density that may be non-log-concave and multimodal. To enhance the performance of simple Markov chain Monte Carlo (MCMC) methods, techniques of annealing type have been widely used. However, quantitative theoretical guarantees of these techniques are under-explored. This study takes a first step toward providing a non-asymptotic analysis of annealed MCMC. Specifically, we establish, for the first time, an oracle complexity of O~(dβ2𝒜2ε6)\widetilde{O}\left(\frac{d\beta^{2}{\cal A}^{2}}{\varepsilon^{6}}\right) for simple annealed Langevin Monte Carlo algorithm to achieve ε2\varepsilon^{2} accuracy in Kullback-Leibler divergence to the target distribution πeV\pi\propto{\rm e}^{-V} on d\mathbb{R}^{d} with β\beta-smooth potential VV. Here, 𝒜{\cal A} represents the action of a curve of probability measures interpolating the target distribution π\pi and a readily sampleable distribution.

1 Introduction

We study the task of efficient sampling from a probability distribution πeV\pi\propto{\rm e}^{-V} on d\mathbb{R}^{d}. This fundamental problem is pivotal across various fields including computational statistics [42, 50, 5, 48], Bayesian inference [30], statistical physics [46, 37], and finance [17], and has been extensively studied in the literature [12]. The most common approach to this problem is Markov Chain Monte Carlo (MCMC), among which Langevin Monte Carlo (LMC) [20, 61, 13, 44] is a particularly popular choice. LMC can be understood as a time-discretization of a diffusion process, known as Langevin diffusion (LD), whose stationary distribution is the target distribution π\pi, and has been attractive partly due to its robust performance despite conceptual simplicity.

Although LMC and its variants converge rapidly when the target distribution π\pi is strongly log-concave or satisfies isoperimetric inequalities such as the log-Sobolev inequality (LSI) [20, 61, 13], its effectiveness diminishes when dealing with target distributions that are not strongly log-concave or are multimodal, such as mixtures of Gaussians. In such scenarios, the sampler often becomes confined to a single mode, severely limiting its ability to explore the entire distribution effectively. This results in significant challenges in transitioning between modes, which can dramatically increase the mixing time, making it exponential in dimension dd. Such limitations highlight the need for enhanced MCMC methodologies that can efficiently navigate the complex landscapes of multimodal distributions, thereby improving convergence rates and overall sampling efficiency.

To address the challenges posed by multimodality, techniques around the notion of annealing have been widely employed [29, 45]. The general philosophy involves constructing a sequence of intermediate distributions π0,π1,,πM\pi_{0},\pi_{1},...,\pi_{M} that bridge the gap between an easily samplable distribution π0\pi_{0} (e.g., Gaussian or Dirac-like), and the target distribution πM=π\pi_{M}=\pi. The process starts with sampling from π0\pi_{0} and progressively samples from each subsequent distribution until πM\pi_{M} is reached. When πi\pi_{i} and πi+1\pi_{i+1} are close enough, approximate samples from πi\pi_{i} can serve as a warm start for sampling from πi+1\pi_{i+1}, thereby facilitating this transition. Employing LMC within this framework gives rise to what is known as the annealed LMC algorithm, which is the focus of our study. Despite its empirical success [54, 55, 69, 68], a thorough theoretical understanding of annealed LMC, particularly its non-asymptotic complexity bounds, remains elusive.

In this work, we take a first step toward providing a non-asymptotic analysis of annealed MCMC. By leveraging the Girsanov theorem to quantify the differences between the sampling dynamics and a reference process, we establish an upper bound of the error of annealed MCMC by the summation of two terms. One term is equal to the ratio of an action integral in the Wasserstein geometry induced by optimal transport and the duration of the process and diminishes when the annealing is sufficiently slow. Another term quantifies the discretization error in implementation. Our approach challenges the traditional point of view that annealed MCMC is a cascade of warm start. Analysis based on this viewpoint still requires log-concavity or isoperimetry. Instead, our theoretical analysis of the annealed Langevin-based MCMC advances the field by eliminating the reliance on assumptions related to log-concavity or isoperimetry. Our strategy represents a paradigm shift in analyzing annealed MCMC. Even though a fully polynomial complexity bound has not been achieved yet. We hope our strategy can inspire future work towards this ambitious goal.

Our key technical contributions are summarized as follows.

  1. 1.

    We discover a novel strategy to analyze the non-asymptotic complexity bound of annealed MCMC algorithms that can circumvent the need for log-concavity or isoperimetry.

  2. 2.

    In Section 4, we investigate the annealed LD, which involves running LD with a dynamically changing target distribution and derive a surprising bound on the time required to simulate the SDE for achieving ε2\varepsilon^{2}-accuracy in KL divergence.

  3. 3.

    Building on the insights from the analysis of the continuous dynamic and accounting for discretization errors, we establish a non-asymptotic oracle complexity bound for annealed LMC in Section 5, which is applicable to a broad range of annealing schemes.

The quantitative results are summarized and compared to other sampling algorithms in Table 1. Notably, our approach operates under the least stringent assumptions and exhibits the most favorable ε\varepsilon-dependence among all isoperimetry-free sampling methods.

Table 1: Comparison of oracle complexities in terms of dd, ε\varepsilon, and LSI constant for sampling from πeV\pi\propto{\rm e}^{-V}. The notation “poly(){\rm poly(\cdot)}” denotes a polynomial function of the indicated parameters.
Algorithm
Isoperimetric
Assumptions
Other
Assumptions
Criterion Complexity
LMC
[61]
CC-LSI Potential smooth ε2\varepsilon^{2}, KL(π)\operatorname{KL}\left(\cdot\middle\|\pi\right) O~(C2dε2)\widetilde{O}\left(C^{2}d\varepsilon^{-2}\right)
PS
[22]
CC-LSI Potential smooth ε\varepsilon, TV O~(Cd1/2logε1)\widetilde{O}\left(Cd^{1/2}\log\varepsilon^{-1}\right)
STLMC
[28]
/
Translated mixture of
a well-conditioned
distribution
ε\varepsilon, TV O(poly(d,ε1))O\left(\operatorname{poly}\left(d,\varepsilon^{-1}\right)\right)
RDMC
[35]
/
Potential smooth,
nearly convex at \infty
ε\varepsilon, TV O(poly(d)epoly(ε1))O\left(\operatorname{poly}\left(d\right){\rm e}^{\operatorname{poly}\left(\varepsilon^{-1}\right)}\right)
ZOD-MC
[32]
/
Potential growing
at most quadratically
ε\varepsilon, TV+W2 eO~(d)O(logε1){\rm e}^{\widetilde{O}\left(d\right)O\left(\log\varepsilon^{-1}\right)}
ALMC
(ours)
/ Potential smooth ε2\varepsilon^{2}, KL(π)\operatorname{KL}\left(\pi\middle\|\cdot\right) O~(d𝒜(d)2ε6)\widetilde{O}\left(d{\cal A}(d)^{2}\varepsilon^{-6}\right)

Related works and comparison. We provide a brief overview of the literature, mainly focusing on the algorithms for non-log-concave sampling and their theoretical analysis.

  • Samplers based on tempering. The fundamental concept of tempering involves sampling the system at various temperatures simultaneously: at higher temparatures, the distribution flattens, allowing particles to easily transition between modes, while at lower temperatures, particles can more effectively explore local structures. In simulated tempering [43, 64], the system’s temperature is randomly switched, while in parallel tempering (also known as replica exchange) [57, 38], the temperatures of two particles are swapped according to a specific rule. However, quantitative theoretical results for tempering are limited, and the existing results (e.g., [27, 28, 19]) apply primarily to certain special classes of non-log-concave distributions.

  • Samplers based on general diffusions. Inspired by score-based diffusion models [33, 53, 56], recent advances have introduced sampling methods that reverse the Ornstein-Uhlenbeck process, as detailed in [35, 34, 32]. These samplers exhibit reduced sensitivity to isoperimetric conditions, but rely on estimating score functions (gradients of log-density) via importance sampling, which poses significant challenges in high-dimensional settings. Concurrently, studies such as [66, 59, 60, 49] have employed neural networks to approximate unknown drift terms, enabling an SDE to transport a readily sampleable distribution to the target distribution. This approach has shown excellent performance in handling complex distributions, albeit at the expense of significant computational resources required for neural network training. In contrast, annealed LMC runs on a known interpolation of probability distributions, thus simplifying sampling by obviating the need for intensive score estimation or neural network training.

  • Non-asymptotic analysis for non-log-concave sampling. Drawing upon the stationary-point analysis in non-convex optimization, the seminal work [4] characterizes the convergence of non-log-concave sampling via Fisher divergence. Subsequently, [11] applies this methodology to examine the local mixing of LMC. However, Fisher divergence is a relatively weak criterion compared to more commonly employed metrics such as total-variational distance or Wasserstein distances. In contrast, our study provides a convergence guarantee in terms of KL divergence, which implies convergence in total-variation distance and offers a stronger result.

Notations and definitions. For a,ba,b\in\mathbb{R}, we define [[a,b]]:=[a,b]\left[\!\left[a,b\right]\!\right]:=[a,b]\cap\mathbb{Z}, ab:=min(a,b)a\wedge b:=\min(a,b), and ab:=max(a,b)a\vee b:=\max(a,b). For a,b>0a,b>0, the notations aba\lesssim b, a=O(b)a=O(b), and b=Ω(a)b=\Omega(a) indicate that aCba\leq Cb for some universal constant C>0C>0, and the notations aba\asymp b and a=Θ(b)a=\Theta(b) stand for both a=O(b)a=O(b) and b=O(a)b=O(a). O~()\widetilde{O}\left(\cdot\right) hides logarithmic dependence in O()O(\cdot). A function UC2(d)U\in C^{2}(\mathbb{R}^{d}) is α(>0)\alpha(>0)-strongly-convex if 2UαI\nabla^{2}U\succeq\alpha I, and is β(>0)\beta(>0)-smooth if βI2UβI-\beta I\preceq\nabla^{2}U\preceq\beta I. The total-variation (TV) distance is defined as TV(μ,ν)=supAd|μ(A)ν(A)|\operatorname{TV}\left(\mu,\nu\right)=\sup_{A\subset\mathbb{R}^{d}}\left|\mu(A)-\nu(A)\right|, and the Kullback-Leibler (KL) divergence is defined as KL(μν)=𝔼μ[logdμdν]\operatorname{KL}\left(\mu\middle\|\nu\right)=\operatorname{\mathbb{E}}_{\mu}\left[\log\frac{{\rm d}\mu}{{\rm d}\nu}\right]. \left\|\cdot\right\| represents the 2\ell^{2} norm on d\mathbb{R}^{d}. For f:ddf:\mathbb{R}^{d}\to\mathbb{R}^{d^{\prime}} and a probability measure μ\mu on d\mathbb{R}^{d}, fL2(μ):=(f2dμ)1/2\left\|f\right\|_{L^{2}(\mu)}:=\left(\int\left\|f\right\|^{2}{\rm d}\mu\right)^{1/2}, and the second-order moment of μ\mu is defined as 𝔼μ[2]\operatorname{\mathbb{E}}_{\mu}\left[\left\|\cdot\right\|^{2}\right].

2 Preliminaries

2.1 Stochastic Differential Equations and Girsanov Theorem

A stochastic differential equation (SDE) X=(Xt)t[0,T]X=(X_{t})_{t\in[0,T]} is a stochastic process on Ω=C([0,T];d)\Omega=C([0,T];\mathbb{R}^{d}), the space of continuous functions from [0,T][0,T] to d\mathbb{R}^{d}. The dynamics of XX are typically represented by the equation dXt=b(Xt,t)dt+σ(Xt,t)dBt{\rm d}X_{t}=b(X_{t},t){\rm d}t+\sigma(X_{t},t){\rm d}B_{t}, t[0,T]t\in[0,T], where (Bt)t[0,T](B_{t})_{t\in[0,T]} is a standard Brownian motion in d\mathbb{R}^{d}. The path measure of XX, denoted X\mathbb{P}^{X}, characterizes the distribution of XX over Ω\Omega and is defined by X(A)=(XA)\mathbb{P}^{X}(A)=\operatorname{\mathbb{P}}\left(X\in A\right) for all measurable subset AA of Ω\Omega. The following lemma, as a corollary of the Girsanov theorem [36, 58, 41], provides a methodology for computing the KL divergence between two path measures and serves as a crucial technical tool in our proof.

Lemma 1.

Assume we have the following two SDEs on Ω\Omega:

dXt=at(Xt)dt+2dBt,X0μ;dYt=bt(Yt)dt+2dBt,Y0ν.\displaystyle{\rm d}X_{t}=a_{t}(X_{t}){\rm d}t+\sqrt{2}{\rm d}B_{t},~{}X_{0}\sim\mu;\qquad{\rm d}Y_{t}=b_{t}(Y_{t}){\rm d}t+\sqrt{2}{\rm d}B_{t},~{}Y_{0}\sim\nu.

Let X\mathbb{P}^{X} and Y\mathbb{P}^{Y} denote the path measures of XX and YY, respectively. Then

KL(XY)=KL(μν)+14𝔼XX[0Tat(Xt)bt(Xt)2dt].\operatorname{KL}\left(\mathbb{P}^{X}\middle\|\mathbb{P}^{Y}\right)=\operatorname{KL}\left(\mu\middle\|\nu\right)+\frac{1}{4}\operatorname{\mathbb{E}}_{X\sim\mathbb{P}^{X}}\left[\int_{0}^{T}\left\|a_{t}(X_{t})-b_{t}(X_{t})\right\|^{2}{\rm d}t\right].

2.2 Langevin Diffusion and Langevin Monte Carlo

The Langevin diffusion (LD) with target distribution πeV\pi\propto{\rm e}^{-V} is the solution to the SDE

dXt=V(Xt)dt+2dBt,t[0,);X0μ0.{\rm d}X_{t}=-\nabla V(X_{t}){\rm d}t+\sqrt{2}{\rm d}B_{t},~{}t\in[0,\infty);~{}X_{0}\sim\mu_{0}. (1)

It is well-known that under mild conditions, π\pi is the unique stationary distribution this SDE, and when π\pi has good regularity properties, the marginal distribution of XtX_{t} converges to π\pi as t+t\to{+\infty}, so we can sample from π\pi by simulating Equation 1 for a long time. However, in most of the cases, LD is intractable to simulate exactly, and the Euler-Maruyama discretization of Equation 1 leads to the Langevin Monte Carlo (LMC) algorithm. LMC with step size h>0h>0 and target distribution πeV\pi\propto{\rm e}^{-V} is a Markov chain {Xkh}k=0,1,\{X_{kh}\}_{k=0,1,...} constructed by iterating the following update rule:

X(k+1)h=XkhhV(Xkh)+2(B(k+1)hBkh),k=0,1,;X0μ0,X_{(k+1)h}=X_{kh}-h\nabla V(X_{kh})+\sqrt{2}(B_{(k+1)h}-B_{kh}),~{}k=0,1,...;~{}X_{0}\sim\mu_{0}, (2)

where {B(k+1)hBkh}k=0,1,i.i.d.𝒩(0,hI)\{B_{(k+1)h}-B_{kh}\}_{k=0,1,...}\stackrel{{\scriptstyle{\rm i.i.d.}}}{{\sim}}\operatorname{{\cal N}}\left(0,hI\right).

2.3 Wasserstein Distance and Curves of Probability Measures

We briefly introduce several fundamental concepts in optimal transport, and direct readers to authoritative textbooks [62, 63, 2, 1] for an in-depth exploration.

For two probability measures μ,ν\mu,\nu on d\mathbb{R}^{d} with finite second-order moments, the Wasserstein-2 (W2) distance between μ\mu and ν\nu is defined as

W2(μ,ν)=infγΠ(μ,ν)(xy2γ(dx,dy))12,W_{2}(\mu,\nu)=\inf_{\gamma\in\Pi(\mu,\nu)}\left(\int\left\|x-y\right\|^{2}\gamma({\rm d}x,{\rm d}y)\right)^{\frac{1}{2}},

where Π(μ,ν)\Pi(\mu,\nu) is the set of all couplings of (μ,ν)(\mu,\nu), i.e., probability measure γ\gamma on d×d\mathbb{R}^{d}\times\mathbb{R}^{d} with γ(A×d)=μ(A)\gamma(A\times\mathbb{R}^{d})=\mu(A) and γ(d×A)=ν(A)\gamma(\mathbb{R}^{d}\times A)=\nu(A), for all measurable set AdA\subset\mathbb{R}^{d}.

Given a vector field v=(vt:dd)t[a,b]v=(v_{t}:\mathbb{R}^{d}\to\mathbb{R}^{d})_{t\in[a,b]} and a curve of probability measures ρ=(ρt)t[a,b]\rho=(\rho_{t})_{t\in[a,b]} on d\mathbb{R}^{d} with finite second-order moments, we say that vv generates ρ\rho if the continuity equation tρt+(ρtvt)=0\partial_{t}\rho_{t}+\nabla\cdot(\rho_{t}v_{t})=0, t[a,b]t\in[a,b] holds. The metric derivative of ρ\rho at t[a,b]t\in[a,b] is defined as

|ρ˙|t:=limδ0W2(ρt+δ,ρt)|δ|,\left|\dot{\rho}\right|_{t}:=\lim_{\delta\to 0}\frac{W_{2}(\rho_{t+\delta},\rho_{t})}{\left|\delta\right|},

which can be interpreted as the “speed” of this curve. If |ρ˙|t\left|\dot{\rho}\right|_{t} exists and is finite for all t[a,b]t\in[a,b], we say that ρ\rho is absolutely continuous (AC). The metric derivative and the continuity equation are closely related by the following important fact (see [2, Theorem 8.3.1]):

Lemma 2.

For an AC curve of probability measures (ρt)t[a,b](\rho_{t})_{t\in[a,b]}, any vector field (vt)t[a,b](v_{t})_{t\in[a,b]} that generates (ρt)t[a,b](\rho_{t})_{t\in[a,b]} satisfies |ρ˙|tvtL2(ρt)\left|\dot{\rho}\right|_{t}\leq\left\|v_{t}\right\|_{L^{2}(\rho_{t})} for all t[a,b]t\in[a,b]. Moreover, the equality is reachable.

Finally, define the action of an AC curve of probability measures (ρt)t[a,b](\rho_{t})_{t\in[a,b]} as 𝒜(ρ):=ab|ρ˙|t2dt{\cal A}(\rho):=\int_{a}^{b}\left|\dot{\rho}\right|_{t}^{2}{\rm d}t.

2.4 Isoperimetric Inequalities

A probability measure π\pi on d\mathbb{R}^{d} satisfies a log-Sobolev inequality (LSI) with constant CC, or CC-LSI, if for all fC1(d)f\in C^{1}(\mathbb{R}^{d}) with 𝔼π[f2]>0\operatorname{\mathbb{E}}_{\pi}\left[f^{2}\right]>0,

𝔼π[f2logf2𝔼π[f2]]2C𝔼π[f2].\operatorname{\mathbb{E}}_{\pi}\left[f^{2}\log\frac{f^{2}}{\operatorname{\mathbb{E}}_{\pi}\left[f^{2}\right]}\right]\leq 2C\operatorname{\mathbb{E}}_{\pi}\left[\left\|\nabla f\right\|^{2}\right].

It is worth noting that α\alpha-strongly-log-concave distributions satisfy 1α\frac{1}{\alpha}-LSI (Bakry-Émery theorem, [3]). When πeV\pi\propto{\rm e}^{-V} satisfies CC-LSI and the potential VV is β\beta-smooth, it is established in [61] that the LD converges exponentially fast in KL divergence, while the LMC also converges exponentially with a bias that vanishes when the step size approaches 0.

3 Problem Setup

In this paper, we consider the following mild assumption on the target distribution πeV\pi\propto{\rm e}^{-V} on d\mathbb{R}^{d}, which is widely used in the field of sampling (see, e.g., [20, 61, 4, 13]):

Assumption 1.

The potential VV is β\beta-smooth, and there exists a global minimizer xx_{*} of VV such that xR\left\|x_{*}\right\|\leq R. Moreover, π\pi has finite second-order moment.

Recall that the rationale behind annealing involves a gradual transition from a distribution that is easy to sample from to the more complex target distribution, which is crucial to our algorithm. Throughout this paper, we define the following curve of probability measures on d\mathbb{R}^{d}:

πθexp(η(θ)Vλ(θ)22),θ[0,1].\pi_{\theta}\propto\exp\left(-\eta(\theta)V-\frac{\lambda(\theta)}{2}\left\|\cdot\right\|^{2}\right),~{}\theta\in[0,1]. (3)

We use the term annealing schedule to refer to the functions η()\eta(\cdot) and λ()\lambda(\cdot). They need to be differentiable and monotone, satisfying the boundary conditions η0=η(0)η(1)=1\eta_{0}=\eta(0)\nearrow\eta(1)=1 and λ0=λ(0)λ(1)=0\lambda_{0}=\lambda(0)\searrow\lambda(1)=0. The values of η0[0,1]\eta_{0}\in[0,1] and λ0(1,+)\lambda_{0}\in(1,{+\infty}) will be determined later. This flexible interpolation scheme includes many general cases. For example, [6] and [26] used the schedule η()1\eta(\cdot)\equiv 1, while [45] used the schedule λ(θ)=c(1η(θ))\lambda(\theta)=c(1-\eta(\theta)). The key motivation for this interpolation is that when θ=0\theta=0, the quadratic term predominates, making the potential of π0\pi_{0} strongly-convex with a moderate condition number, thus π0\pi_{0} is easy to sample from; on the other hand, when θ=1\theta=1, π1\pi_{1} is just the target distribution π\pi. As θ\theta gradually increases from 0 to 11, the readily sampleable distribution π0\pi_{0} slowly transform into the target distribution π1\pi_{1}.

In Lemma 5, we prove that πθ\pi_{\theta} also has finite second-order moment, so the W2 distance between πθ\pi_{\theta}’s makes sense. We further make a mild assumption on the absolute continuity of the curve:

Assumption 2.

The curve (πθ)θ[0,1](\pi_{\theta})_{\theta\in[0,1]} is AC with finite action 𝒜η,λ(π):=01|π˙|θ2dθ{\cal A}_{\eta,\lambda}(\pi):=\int_{0}^{1}\left|\dot{\pi}\right|_{\theta}^{2}{\rm d}\theta.

To facilitate easy sampling from π0\pi_{0}, we consider two choices of the parameters η0\eta_{0} and λ0\lambda_{0}, which have the following complexity guarantees.

Lemma 3.

When η0=0\eta_{0}=0, π0=𝒩(0,1λ0I)\pi_{0}=\operatorname{{\cal N}}\left(0,\frac{1}{\lambda_{0}}I\right) can be sampled directly; when η0(0,1]\eta_{0}\in(0,1], we choose λ0=η0dβ\lambda_{0}=\eta_{0}d\beta, so that under Assumption 1, it takes O(1logη0βR2d2)=O~(1)O\left(1\vee\log\frac{\eta_{0}\beta R^{2}}{d^{2}}\right)=\widetilde{O}\left(1\right) queries to the oracle of VV and V\nabla V in expectation to precisely sample from π0\pi_{0} via rejection sampling.

See Appendix A for the rejection sampling algorithm as well as the full proof of this lemma. The parameter RR reflects prior knowledge about global minimizer(s) of the potential function VV. Unless it is exceptionally large, indicating scarce prior information about the global minimizer(s) of VV, this O~(1)\widetilde{O}\left(1\right) complexity is negligible compared to the overall complexity of sampling. In particular, when the exact location of a global minimizer of VV is known, we can adjust the potential VV so that 0 becomes a global minimizer, thereby reducing the complexity to O(1)O(1).

Equipped with this foundational setup, we now proceed to introduce the annealed LD and annealed LMC algorithms.

4 Analysis of Annealed Langevin Diffusion

To elucidate the concept of annealing more clearly, we first consider the annealed Langevin diffusion (ALD) algorithm, which samples from the πeV\pi\propto{\rm e}^{-V} by running LD with a dynamically changing target distribution. For the purpose of this discussion, let us assume that we can exactly simulate any SDE with known drift and diffusion terms.

Fix an annealing schedule η(),λ()\eta(\cdot),\lambda(\cdot) and choose a sufficiently long time TT. We define a reparametrized curve of probability measures (π~t:=πt/T)t[0,T]\left(\widetilde{\pi}_{t}:=\pi_{t/T}\right)_{t\in[0,T]}. Starting with an initial sample X0π0=π~0X_{0}\sim\pi_{0}=\widetilde{\pi}_{0}, we run the SDE

dXt=logπ~t(Xt)dt+2dBt,t[0,T],{\rm d}X_{t}=\nabla\log\widetilde{\pi}_{t}(X_{t}){\rm d}t+\sqrt{2}{\rm d}B_{t},~{}t\in[0,T], (4)

and ultimately output XTνALDX_{T}\sim\nu^{\rm ALD} as an approximate sample from the target distribution π\pi. Intuitively, when π~t\widetilde{\pi}_{t} is changing slowly, the distribution of XtX_{t} should closely resemble π~t\widetilde{\pi}_{t}, leading to an output distribution νALD\nu^{\rm ALD} that approximates the target distribution. This turns out to be true, as is confirmed by the following theorem, which provides a convergence guarantee for the ALD process.

Theorem 1.

When choosing T=𝒜η,λ(π)4ε2T=\frac{{\cal A}_{\eta,\lambda}(\pi)}{4\varepsilon^{2}}, it follows that KL(πνALD)ε2\operatorname{KL}\left(\pi\middle\|\nu^{\rm ALD}\right)\leq\varepsilon^{2}.

Proof.

Let \mathbb{Q} be the path measure of ALD (Equation 4) initialized at X0π~0X_{0}\sim\widetilde{\pi}_{0}, and define \mathbb{P} as the path measure corresponding to the following reference SDE:

dXt=(logπ~t+vt)(Xt)dt+2dBt,X0π~0,t[0,T].{\rm d}X_{t}=(\nabla\log\widetilde{\pi}_{t}+v_{t})(X_{t}){\rm d}t+\sqrt{2}{\rm d}B_{t},~{}X_{0}\sim\widetilde{\pi}_{0},~{}t\in[0,T]. (5)

The vector field v=(vt)t[0,T]v=(v_{t})_{t\in[0,T]} is designed such that Xtπ~tX_{t}\sim\widetilde{\pi}_{t} for all t[0,T]t\in[0,T]. According to the Fokker-Planck equation, vv must satisfy the PDE

tπ~t=(π~t(logπ~t+vt))+Δπ~t=(π~tvt),t[0,T],\partial_{t}\widetilde{\pi}_{t}=-\nabla\cdot(\widetilde{\pi}_{t}\left(\nabla\log\widetilde{\pi}_{t}+v_{t}\right))+\Delta\widetilde{\pi}_{t}=-\nabla\cdot(\widetilde{\pi}_{t}v_{t}),~{}t\in[0,T],

which means that vv generates (π~t:=πt/T)t[0,T]\left(\widetilde{\pi}_{t}:=\pi_{t/T}\right)_{t\in[0,T]}. We can compute KL()\operatorname{KL}\left(\mathbb{P}\middle\|\mathbb{Q}\right) using Lemma 1:

KL()=14𝔼[0Tvt(Xt)2dt]=140TvtL2(π~t)2dt.\operatorname{KL}\left(\mathbb{P}\middle\|\mathbb{Q}\right)=\frac{1}{4}\operatorname{\mathbb{E}}_{\mathbb{P}}\left[\int_{0}^{T}\left\|v_{t}(X_{t})\right\|^{2}{\rm d}t\right]=\frac{1}{4}\int_{0}^{T}\left\|v_{t}\right\|_{L^{2}(\widetilde{\pi}_{t})}^{2}{\rm d}t.

Leveraging Lemma 2, among all vector fields vv that generate (π~t:=πt/T)t[0,T]\left(\widetilde{\pi}_{t}:=\pi_{t/T}\right)_{t\in[0,T]}, we can choose the one that minimizes vtL2(π~t)\left\|v_{t}\right\|_{L^{2}(\widetilde{\pi}_{t})}, thereby making vtL2(π~t)=|π~˙|t\left\|v_{t}\right\|_{L^{2}(\widetilde{\pi}_{t})}=\left|\dot{\widetilde{\pi}}\right|_{t}, the metric derivative. With the reparameterization π~t=πt/T\widetilde{\pi}_{t}=\pi_{t/T}, we have the following relation by chain rule:

|π~˙|t=limδ0W2(π~t+δ,π~t)|δ|=limδ0W2(π(t+δ)/T,πt/T)T|δ/T|=1T|π˙|t/T.\left|\dot{\widetilde{\pi}}\right|_{t}=\lim_{\delta\to 0}\frac{W_{2}(\widetilde{\pi}_{t+\delta},\widetilde{\pi}_{t})}{\left|\delta\right|}=\lim_{\delta\to 0}\frac{W_{2}(\pi_{(t+\delta)/T},\pi_{t/T})}{T\left|\delta/T\right|}=\frac{1}{T}\left|\dot{\pi}\right|_{t/T}.

Employing the change-of-variable formula leads to

KL()=140T|π~˙|t2dt=14T01|π˙|θ2dθ=𝒜η,λ(π)4T.\operatorname{KL}\left(\mathbb{P}\middle\|\mathbb{Q}\right)=\frac{1}{4}\int_{0}^{T}\left|\dot{\widetilde{\pi}}\right|_{t}^{2}{\rm d}t=\frac{1}{4T}\int_{0}^{1}\left|\dot{\pi}\right|_{\theta}^{2}{\rm d}\theta=\frac{{\cal A}_{\eta,\lambda}(\pi)}{4T}.

Finally, using the data-processing inequality ([12, Theorem 1.5.3]), with T=𝒜η,λ(π)4ε2T=\frac{{\cal A}_{\eta,\lambda}(\pi)}{4\varepsilon^{2}},

KL(πνALD)=KL(TT)KL()=ε2,\operatorname{KL}\left(\pi\middle\|\nu^{\rm ALD}\right)=\operatorname{KL}\left(\mathbb{P}_{T}\middle\|\mathbb{Q}_{T}\right)\leq\operatorname{KL}\left(\mathbb{P}\middle\|\mathbb{Q}\right)=\varepsilon^{2},

where T\mathbb{P}_{T} and T\mathbb{Q}_{T} stand for the marginal distributions of \mathbb{P} and \mathbb{Q} at time TT, respectively. ∎

Although we adopt a specific parametrization of the interpolating distribution (Equation 3), which is widely used in practice, the results in Theorem 1 are applicable to any curve of probability measures (πθ)θ[0,1](\pi_{\theta})_{\theta\in[0,1]} that bridge π0\pi_{0} and π1=π\pi_{1}=\pi. This applicability extends as long as the closed forms of the drift and diffusion terms in the SDE, which precisely follow the path (πθ)θ[0,1](\pi_{\theta})_{\theta\in[0,1]}, are known.

Let us delve deeper into the mechanics of the ALD. Although at time tt the SDE (Equation 4) targets the distribution π~t\widetilde{\pi}_{t}, the distribution of XtX_{t} does not precisely match π~t\widetilde{\pi}_{t}. Nevertheless, by choosing a sufficiently long time TT, we actually move on the curve (πθ)θ[0,1](\pi_{\theta})_{\theta\in[0,1]} sufficiently slowly, thus minimizing the discrepancy between the path measure of (Xt)t[0,T](X_{t})_{t\in[0,T]} and the reference curve (π~t)t[0,T](\widetilde{\pi}_{t})_{t\in[0,T]}. By applying data-processing inequality, we can upper bound the error between the marginal distributions at time TT by the joint distributions of the two paths. In essence, moving more slowly, sampling more precisely.

Our analysis predominantly addresses the global error across the entire curve of probability measures, rather than focusing solely on the local error at time TT. This approach is inspired by [18] (see also [12, Section 4.4]) and [9], and stands in contrast to the traditional isoperimetry-based analyses of LD (e.g., [61, 13, 4]), which emphasize the decay of the KL divergence from the distribution of XtX_{t} to the target distribution and require LSI to bound the time derivative of this quantity. Notably, the total time TT needed to run the SDE depends solely on the action of the curve (πθ)θ[0,1](\pi_{\theta})_{\theta\in[0,1]}, obviating the need for assumptions related to log-concavity or isoperimetry.

We also remark that the ALD plays a significant role in another important subject known as non-equilibrium stochastic thermodynamics [52]. Recently, a refinement of the fluctuation theorem was discovered in [10, 24], stating that the irreversible entropy production in a stochastic thermodynamic system is equal to the ratio of a similar action integral as in Theorem 1 and the duration of the process TT, resembling Theorem 1.

5 Analysis of Annealed Langevin Monte Carlo

It is crucial to recognize that the ALD (Equation 4) cannot be precisely simulated in practice. Transitioning from LD to LMC introduces additional errors due to discretization. This section will present a detailed convergence analysis for the annealed Langevin Monte Carlo (ALMC) algorithm, which is implementable in practical scenarios.

A straightforward yet non-optimal method to discretize Equation 4 involves employing the Euler-Maruyama scheme, i.e.,

Xt+ΔtXt+logπ~t(Xt)Δt+2(Bt+ΔtBt),0t<t+ΔtT.X_{t+\Delta t}\approx X_{t}+\nabla\log\widetilde{\pi}_{t}(X_{t})\Delta t+\sqrt{2}(B_{t+\Delta t}-B_{t}),~{}0\leq t<t+\Delta t\leq T.

However, considering that logπ~t(x)=η(tT)V(x)λ(tT)x\nabla\log\widetilde{\pi}_{t}(x)=-\eta\left(\frac{t}{T}\right)\nabla V(x)-\lambda\left(\frac{t}{T}\right)x, the integral of the linear term can be computed in closed form, so we can use the exponential-integrator scheme [65, 67] to further reduce the discretization error.

Given the total time TT, we define a sequence of points 0=θ0<θ1<<θM=10=\theta_{0}<\theta_{1}<...<\theta_{M}=1, and set T=TθT_{\ell}=T\theta_{\ell}, h=T(θθ1)h_{\ell}=T(\theta_{\ell}-\theta_{\ell-1}). The exponential-integrator scheme is then expressed as

dXt=(η(tT)V(Xt)λ(tT)Xt)dt+2dBt,t[0,T],X0π0,{\rm d}X_{t}=\left(-\eta\left(\frac{t}{T}\right)\nabla V(X_{t_{-}})-\lambda\left(\frac{t}{T}\right)X_{t}\right){\rm d}t+\sqrt{2}{\rm d}B_{t},~{}t\in[0,T],~{}X_{0}\sim\pi_{0}, (6)

where t:=T1t_{-}:=T_{\ell-1} when t[T1,T)t\in[T_{\ell-1},T_{\ell}), [[1,M]]\ell\in\left[\!\left[1,M\right]\!\right]. The explicit update rule is detailed in Algorithm 1, with xx_{\ell} denoting XTX_{T_{\ell}}, and the derivation of Equation 7 is presented in Section B.1. Notably, replacing V(Xt)\nabla V(X_{t_{-}}) with V(Xt)\nabla V(X_{t}) recovers the ALD (Equation 4), and setting η1\eta\equiv 1 and λ0\lambda\equiv 0 reduces to the classical LMC iterations.

Input: Target distribution πeV\pi\propto{\rm e}^{-V}, total time TT, annealing schedule η()\eta(\cdot) and λ()\lambda(\cdot), discrete points θ0,,θM\theta_{0},...,\theta_{M}.
1 For 0θ<θ10\leq\theta<\theta^{\prime}\leq 1, define Λ0(θ,θ)=exp(Tθθλ(u)du)\Lambda_{0}(\theta^{\prime},\theta)=\exp\left(-T\int_{\theta}^{\theta^{\prime}}\lambda(u){\rm d}u\right), H(θ,θ)=Tθθη(u)Λ0(u,θ)duH(\theta^{\prime},\theta)=T\int_{\theta}^{\theta^{\prime}}\eta(u)\Lambda_{0}(u,\theta^{\prime}){\rm d}u, and Λ1(θ,θ)=2TθθΛ02(u,θ)du\Lambda_{1}(\theta^{\prime},\theta)=\sqrt{2T\int_{\theta}^{\theta^{\prime}}\Lambda_{0}^{2}(u,\theta^{\prime}){\rm d}u};
2
3Obtain a sample x0π0x_{0}\sim\pi_{0} using rejection sampling (Algorithm 2);
4
5for =1,2,,M\ell=1,2,...,M do
6       Sample an independent Gaussian noise ξ𝒩(0,I)\xi_{\ell}\sim\operatorname{{\cal N}}\left(0,I\right);
7       Update
x=Λ0(θ,θ1)x1H(θ,θ1)V(x1)+Λ1(θ,θ1)ξ.x_{\ell}=\Lambda_{0}(\theta_{\ell},\theta_{\ell-1})x_{\ell-1}-H(\theta_{\ell},\theta_{\ell-1})\nabla V(x_{\ell-1})+\Lambda_{1}(\theta_{\ell},\theta_{\ell-1})\xi_{\ell}. (7)
8 end for
Output: xMνALMCx_{M}\sim\nu^{\rm ALMC}, an approximate sample from π\pi.
Algorithm 1 Annealed LMC Algorithm

We illustrate the ALMC algorithm in Figure 1. The underlying intuition behind this algorithm is that by setting a sufficiently long total time TT, the trajectory of the continuous dynamic (i.e., annealed LD) approaches the reference trajectory closely, as established in Theorem 1. Additionally, by adopting sufficiently small step sizes h1,,hMh_{1},...,h_{M}, the discretization error can be substantially reduced. Unlike traditional annealed LMC methods, which require multiple LMC update steps for each intermediate distribution π1,,πM\pi_{1},...,\pi_{M}, our approach views the annealed LMC as a discretization of a continuous-time process. Consequently, it is adequate to perform a single step of LMC towards each intermediate distribution π~T1,,π~TM\widetilde{\pi}_{T_{1}},...,\widetilde{\pi}_{T_{M}}, provided that TT is sufficiently large and the step sizes are appropriately small.

\displaystyle\mathbb{Q}: annealed LMC\displaystyle\mathbb{P}: referenceπ~0\displaystyle\widetilde{\pi}_{0}π~T1\displaystyle\widetilde{\pi}_{T_{1}}\displaystyle\cdots\displaystyle\cdotsπ~T1\displaystyle\widetilde{\pi}_{T_{\ell-1}}π~T\displaystyle\widetilde{\pi}_{T_{\ell}}π~TM1\displaystyle\widetilde{\pi}_{T_{M-1}}π~TM=π\displaystyle\widetilde{\pi}_{T_{M}}=\piνALMC\displaystyle\nu^{\mathrm{ALMC}}LMC towards π~T\displaystyle\widetilde{\pi}_{T_{\ell}}
Figure 1: Illustration of the ALMC algorithm. The \ell-th green arrow, proceeding from left to right, represents one step of LMC towards π~T\widetilde{\pi}_{T_{\ell}} with step size hh_{\ell}, while each red arrow corresponds to the application of the same transition kernel, initialized at π~T1\widetilde{\pi}_{T_{\ell-1}} on the reference trajectory \mathbb{P}, which is depicted in purple. To evaluate KL()\operatorname{KL}\left(\mathbb{P}\middle\|\mathbb{Q}\right), the Girsanov theorem implies that we only need to bound the aggregate KL divergence across each small interval (i.e., the sum of the blue “distances”).

The subsequent theorem provides a convergence guarantee for the annealed LMC algorithm, with the complete proof detailed in Section B.2.

Theorem 2.

Under Assumptions 1 and 2, Algorithm 1 can generate a distribution νALMC\nu^{\rm ALMC} satisfying KL(πνALMC)ε2\operatorname{KL}\left(\pi\middle\|\nu^{\rm ALMC}\right)\leq\varepsilon^{2} within

O~(dβ2𝒜η,λ(π)2ε6)\widetilde{O}\left(\frac{d\beta^{2}{\cal A}_{\eta,\lambda}(\pi)^{2}}{\varepsilon^{6}}\right)

calls to the oracle of VV and V\nabla V in expectation.

Sketch of Proof.

Let \mathbb{Q} be the path measure of ALMC (Equation 4), whose marginal distribution at time TT is the output distribution νALMC\nu^{\rm ALMC}. Again, let \mathbb{P} denote the reference path measure of Equation 5 used in the proof of Theorem 1, in which the same vector field (vt)t[0,T](v_{t})_{t\in[0,T]} ensures that Xtπ~tX_{t}\sim\widetilde{\pi}_{t} under \mathbb{P} for all t[0,T]t\in[0,T].

Applying Girsanov theorem (Lemma 1) and carefully dealing with the discretization error, we can upper bound KL()\operatorname{KL}\left(\mathbb{P}\middle\|\mathbb{Q}\right) by

KL()=1M(1+η(θ)2β2h2Tθ1θ|π˙|θ2dθ+η(θ)2β2dh2(1+h(βη(θ)+λ(θ1)))).\operatorname{KL}\left(\mathbb{P}\middle\|\mathbb{Q}\right)\lesssim\sum_{\ell=1}^{M}\left(\frac{1+\eta(\theta_{\ell})^{2}\beta^{2}h_{\ell}^{2}}{T}\int_{\theta_{\ell-1}}^{\theta_{\ell}}\left|\dot{\pi}\right|^{2}_{\theta}{\rm d}\theta+\eta(\theta_{\ell})^{2}\beta^{2}dh_{\ell}^{2}\left(1+h_{\ell}\left(\beta\eta(\theta_{\ell})+\lambda(\theta_{\ell-1})\right)\right)\right).

The first summation is governed by the total time TT, which pertains to the convergence of the continuous dynamic (i.e., ALD). Setting T𝒜η,λ(π)ε2T\asymp\frac{{\cal A}_{\eta,\lambda}(\pi)}{\varepsilon^{2}} ensures that the first summation remains O(ε2)O(\varepsilon^{2}), provided that the step size hh_{\ell} is sufficiently small. The second summation addresses the discretization error, and it suffices to determine the appropriate value of θ\theta_{\ell} to minimize MM, the total number of calls to the oracle of V\nabla V for discretizing the SDE. Combining MM with the complexity of sampling from π0\pi_{0} determines the overall complexity of the algorithm. ∎

Once again, our analysis relies on bounding the global error between two path measures by Girsanov theorem. The annealing schedule η()\eta(\cdot) and λ()\lambda(\cdot) influence the complexity exclusively through the action 𝒜η,λ(π){\cal A}_{\eta,\lambda}(\pi). Identifying the optimal annealing schedule to minimize this action remains a significant challenge, and we propose this as an area for future research.

We also note that our Assumption 1 encompasses strongly-log-concave target distributions. For sampling from these well-conditioned distributions via LMC, the complexity required to achieve ε\varepsilon- in TV distance scales as O~(ε2)\widetilde{O}\left(\varepsilon^{-2}\right) [12]. However, using Pinsker inequality KL2TV2{\rm KL}\geq 2{\rm TV}^{2}, our complexity to meet the same error criterion is O(ε6)O\left(\varepsilon^{-6}\right), indicating a significantly higher computational demand. Refining the parameter dependence in our algorithm to reduce this complexity remains a key area for future work.

Finally, we present a conjecture concerning the action 𝒜η,λ(π){\cal A}_{\eta,\lambda}(\pi), leaving its proof or disproof as an open question for future research. Following this conjecture, we demonstrate its applicability to a specific class of non-log-concave distributions in the subsequent example, with the detailed proof provided in Appendix C.

Conjecture 1.

Under Assumption 1, the action 𝒜η,λ(π){\cal A}_{\eta,\lambda}(\pi) is bounded above by a polynomial function of the problem parameters, including dd, β\beta, RR, 𝔼π[2]\operatorname{\mathbb{E}}_{\pi}\left[\left\|\cdot\right\|^{2}\right] η0\eta_{0}, λ0\lambda_{0}, etc.

Example 1.

Consider a mixture of Gaussian distributions in d\mathbb{R}^{d} defined by π=i=1Npi𝒩(yi,β1I)\pi=\sum_{i=1}^{N}p_{i}\operatorname{{\cal N}}\left(y_{i},\beta^{-1}I\right), where the weights pi>0p_{i}>0, i=1Npi=1\sum_{i=1}^{N}p_{i}=1, and yi=r\left\|y_{i}\right\|=r for all i[[1,N]]i\in\left[\!\left[1,N\right]\!\right]. Consequently, the potential V=logπV=-\log\pi is BB-smooth, where B=β(4r2β+1)B=\beta(4r^{2}\beta+1). With an annealing schedule defined by η()1\eta(\cdot)\equiv 1 and λ(θ)=dB(1θ)γ\lambda(\theta)=dB(1-\theta)^{\gamma} for some 1γ=O(1)1\leq\gamma=O(1), it follows that 𝒜η,λ(π)=O(d(r2β+1)(r2+dβ)){\cal A}_{\eta,\lambda}(\pi)=O\left(d(r^{2}\beta+1)\left(r^{2}+\frac{d}{\beta}\right)\right).

To demonstrate the superiority of our theoretical results, let us consider a simplified scenario where N=2N=2, y1=y2y_{1}=-y_{2}, and r21βr^{2}\gg\frac{1}{\beta}. We derive from Example 1 and Pinsker inequality that the total complexity required to obtain a sample that is ε\varepsilon-accurate in TV is O~(d3β2r4(r4β2d2)ε6)\widetilde{O}\left(d^{3}\beta^{2}r^{4}(r^{4}\beta^{2}\vee d^{2})\varepsilon^{-6}\right). In contrast, studies such as [51, 8, 19] indicate that the LSI constant of π\pi is Ω(eΘ(βr2))\Omega\left({\rm e}^{\Theta(\beta r^{2})}\right), so existing bounds in Table 1 suggest that LMC, to achieve the same accuracy, can only provide an exponential complexity guarantee of O~(eΘ(βr2)dε2)\widetilde{O}\left({\rm e}^{\Theta(\beta r^{2})}d\varepsilon^{-2}\right).

6 Conclusions and Future Work

In this paper, we have explored the complexity of annealed LMC for sampling from a non-log-concave probability measure without relying on log-concavity or isoperimetry. Central to our analysis are the Girsanov theorem and optimal transport techniques, thus circumventing the need for isoperimetric inequalities. While our proof primarily focuses on the annealing scheme as described in Equation 3, there is potential for adaptation to more general interpolations. Further exploration of these applications will be a focus of our future research. Technically, our proof methodology could be expanded to include a broader range of target distributions beyond those with smooth potentials, such as those with Hölder-continuous gradients [7, 21, 40, 47, 23, 22], or even heavy-tailed target distributions [44, 31]. Furthermore, eliminating the necessity of assuming global minimizers of the potential function would enhance the practical utility of our algorithm. Theoretically, proving Conjecture 1 or obtaining a polynomial bound of 𝒜η,λ(π){\cal A}_{\eta,\lambda}(\pi) under less restrictive conditions would further validate that our complexity bounds are genuinely polynomial. Finally, while this work emphasizes the upper bounds for non-log-concave sampling, an intriguing future avenue would be to verify the tightness of these bounds and explore potential lower bounds for this problem [16, 15, 14].

References

  • [1] Luigi Ambrosio, Elia Brué and Daniele Semola “Lectures on optimal transport” 130, UNITEXT Springer Cham, 2021 DOI: 10.1007/978-3-030-72162-6
  • [2] Luigi Ambrosio, Nicola Gigli and Giuseppe Savaré “Gradient Flows: In Metric Spaces and in the Space of Probability Measures”, Lectures in Mathematics. ETH Zürich Birkhäuser Basel, 2008 DOI: 10.1007/978-3-7643-8722-8
  • [3] Dominique Bakry, Ivan Gentil and Michel Ledoux “Analysis and geometry of Markov diffusion operators” 103, Grundlehren der mathematischen Wissenschaften Springer Cham, 2014 DOI: 10.1007/978-3-319-00227-9
  • [4] Krishna Balasubramanian et al. “Towards a Theory of Non-Log-Concave Sampling: First-Order Stationarity Guarantees for Langevin Monte Carlo” In Proceedings of Thirty Fifth Conference on Learning Theory 178, Proceedings of Machine Learning Research PMLR, 2022, pp. 2896–2923 URL: https://proceedings.mlr.press/v178/balasubramanian22a.html
  • [5] Steve Brooks, Andrew Gelman, Galin Jones and Xiao-Li Meng “Handbook of Markov chain Monte Carlo” CRC press, 2011
  • [6] Nicolas Brosse, Alain Durmus and Éric Moulines “Normalizing constants of log-concave densities” In Electronic Journal of Statistics 12.1 Institute of Mathematical StatisticsBernoulli Society, 2018, pp. 851–889 DOI: 10.1214/18-EJS1411
  • [7] Niladri Chatterji, Jelena Diakonikolas, Michael I. Jordan and Peter Bartlett “Langevin Monte Carlo without smoothness” In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics 108, Proceedings of Machine Learning Research PMLR, 2020, pp. 1716–1726 URL: https://proceedings.mlr.press/v108/chatterji20a.html
  • [8] Hong-Bin Chen, Sinho Chewi and Jonathan Niles-Weed “Dimension-free log-Sobolev inequalities for mixture distributions” In Journal of Functional Analysis 281.11, 2021, pp. 109236 DOI: https://doi.org/10.1016/j.jfa.2021.109236
  • [9] Sitan Chen et al. “Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions” In The Eleventh International Conference on Learning Representations, 2023 URL: https://openreview.net/forum?id=zyLVMgsZ0U_
  • [10] Yongxin Chen, Tryphon T. Georgiou and Allen Tannenbaum “Stochastic Control and Nonequilibrium Thermodynamics: Fundamental Limits” In IEEE Transactions on Automatic Control 65.7, 2020, pp. 2979–2991 DOI: 10.1109/TAC.2019.2939625
  • [11] Xiang Cheng, Bohan Wang, Jingzhao Zhang and Yusong Zhu “Fast Conditional Mixing of MCMC Algorithms for Non-log-concave Distributions” In Advances in Neural Information Processing Systems 36 Curran Associates, Inc., 2023, pp. 13374–13394 URL: https://proceedings.neurips.cc/paper_files/paper/2023/file/2b00b3331bd0f5fbfdd966ac06338f6d-Paper-Conference.pdf
  • [12] Sinho Chewi “Log-Concave Sampling” Book draft, in preparation, 2024 URL: https://chewisinho.github.io
  • [13] Sinho Chewi et al. “Analysis of Langevin Monte Carlo from Poincaré to Log-Sobolev” In Proceedings of Thirty Fifth Conference on Learning Theory 178, Proceedings of Machine Learning Research PMLR, 2022, pp. 1–2 URL: https://proceedings.mlr.press/v178/chewi22a.html
  • [14] Sinho Chewi, Patrik Gerber, Holden Lee and Chen Lu “Fisher information lower bounds for sampling” In Proceedings of The 34th International Conference on Algorithmic Learning Theory 201, Proceedings of Machine Learning Research PMLR, 2023, pp. 375–410 URL: https://proceedings.mlr.press/v201/chewi23b.html
  • [15] Sinho Chewi et al. “Query lower bounds for log-concave sampling” In 2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS), 2023, pp. 2139–2148 DOI: 10.1109/FOCS57990.2023.00131
  • [16] Sinho Chewi et al. “The query complexity of sampling from strongly log-concave distributions in one dimension” In Proceedings of Thirty Fifth Conference on Learning Theory 178, Proceedings of Machine Learning Research PMLR, 2022, pp. 2041–2059 URL: https://proceedings.mlr.press/v178/chewi22b.html
  • [17] John S. Dagpunar “Simulation and Monte Carlo: With Applications in Finance and MCMC” John Wiley & Sons, Ltd, 2007 DOI: 10.1002/9780470061336
  • [18] Arnak Dalalyan and Alexandre B. Tsybakov “Sparse regression learning by aggregation and Langevin Monte-Carlo” JCSS Special Issue: Cloud Computing 2011 In Journal of Computer and System Sciences 78.5, 2012, pp. 1423–1443 DOI: https://doi.org/10.1016/j.jcss.2011.12.023
  • [19] Jing Dong and Xin T. Tong “Spectral gap of replica exchange Langevin diffusion on mixture distributions” In Stochastic Processes and their Applications 151, 2022, pp. 451–489 DOI: https://doi.org/10.1016/j.spa.2022.06.006
  • [20] Alain Durmus, Szymon Majewski and Błażej Miasojedow “Analysis of Langevin Monte Carlo via Convex Optimization” In Journal of Machine Learning Research 20.73, 2019, pp. 1–46 URL: http://jmlr.org/papers/v20/18-173.html
  • [21] Murat A Erdogdu and Rasa Hosseinzadeh “On the Convergence of Langevin Monte Carlo: The Interplay between Tail Growth and Smoothness” In Proceedings of Thirty Fourth Conference on Learning Theory 134, Proceedings of Machine Learning Research PMLR, 2021, pp. 1776–1822 URL: https://proceedings.mlr.press/v134/erdogdu21a.html
  • [22] Jiaojiao Fan, Bo Yuan and Yongxin Chen “Improved dimension dependence of a proximal algorithm for sampling” In Proceedings of Thirty Sixth Conference on Learning Theory 195, Proceedings of Machine Learning Research PMLR, 2023, pp. 1473–1521 URL: https://proceedings.mlr.press/v195/fan23a.html
  • [23] Jiaojiao Fan, Bo Yuan, Jiaming Liang and Yongxin Chen “Nesterov Smoothing for Sampling Without Smoothness” In 2023 62nd IEEE Conference on Decision and Control (CDC), 2023, pp. 5313–5318 DOI: 10.1109/CDC49753.2023.10383529
  • [24] Rui Fu, Amirhossein Taghvaei, Yongxin Chen and Tryphon T Georgiou “Maximal power output of a stochastic thermodynamic engine” In Automatica 123 Elsevier, 2021, pp. 109366
  • [25] Guillaume Garrigos and Robert M Gower “Handbook of convergence theorems for (stochastic) gradient methods” In arXiv preprint arXiv:2301.11235, 2023
  • [26] Rong Ge, Holden Lee and Jianfeng Lu “Estimating Normalizing Constants for Log-Concave Distributions: Algorithms and Lower Bounds” In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020 Chicago, IL, USA: Association for Computing Machinery, 2020, pp. 579–586 DOI: 10.1145/3357713.3384289
  • [27] Rong Ge, Holden Lee and Andrej Risteski “Beyond Log-concavity: Provable Guarantees for Sampling Multi-modal Distributions using Simulated Tempering Langevin Monte Carlo” In Advances in Neural Information Processing Systems 31 Curran Associates, Inc., 2018 URL: https://proceedings.neurips.cc/paper_files/paper/2018/file/c6ede20e6f597abf4b3f6bb30cee16c7-Paper.pdf
  • [28] Rong Ge, Holden Lee and Andrej Risteski “Simulated tempering Langevin Monte Carlo II: An improved proof using soft Markov chain decomposition” In arXiv preprint arXiv:1812.00793, 2018
  • [29] Saul Brian Gelfand and Sanjoy K Mitter “On sampling methods and annealing algorithms” In technical report, 1990
  • [30] Andrew Gelman, John B. Carlin, Hal S. Stern and Donald B. Rubin “Bayesian data analysis” ChapmanHall/CRC, 2013
  • [31] Ye He, Krishnakumar Balasubramanian and Murat A. Erdogdu “An Analysis of Transformed Unadjusted Langevin Algorithm for Heavy-Tailed Sampling” In IEEE Transactions on Information Theory 70.1, 2024, pp. 571–593 DOI: 10.1109/TIT.2023.3318152
  • [32] Ye He, Kevin Rojas and Molei Tao “Zeroth-Order Sampling Methods for Non-Log-Concave Distributions: Alleviating Metastability by Denoising Diffusion” In arXiv preprint arXiv:2402.17886, 2024
  • [33] Jonathan Ho, Ajay Jain and Pieter Abbeel “Denoising diffusion probabilistic models” In Advances in Neural Information Processing Systems 33, 2020, pp. 6840–6851
  • [34] Xunpeng Huang et al. “Faster Sampling without Isoperimetry via Diffusion-based Monte Carlo” In arXiv preprint arXiv:2401.06325, 2024
  • [35] Xunpeng Huang et al. “Reverse Diffusion Monte Carlo” In The Twelfth International Conference on Learning Representations, 2024 URL: https://openreview.net/forum?id=kIPEyMSdFV
  • [36] Ioannis Karatzas and Steven E. Shreve “Brownian Motion and Stochastic Calculus”, Graduate Texts in Mathematics Springer New York, NY, 1991 DOI: 10.1007/978-1-4612-0949-2
  • [37] David Landau and Kurt Binder “A Guide to Monte Carlo Simulations in Statistical Physics” Cambridge University Press, 2021 DOI: 10.1017/9781108780346
  • [38] Holden Lee and Zeyu Shen “Improved Bound for Mixing Time of Parallel Tempering” In arXiv preprint arXiv:2304.01303, 2023
  • [39] Jiaming Liang and Yongxin Chen “A Proximal Algorithm for Sampling” In Transactions on Machine Learning Research, 2023 URL: https://openreview.net/forum?id=CkXOwlhf27
  • [40] Jiaming Liang and Yongxin Chen “A Proximal Algorithm for Sampling from Non-Smooth Potentials” In 2022 Winter Simulation Conference (WSC), 2022, pp. 3229–3240 DOI: 10.1109/WSC57314.2022.10015293
  • [41] Robert S. Liptser and Albert N. Shiryaev “Statistics of Random Processes”, Stochastic Modelling and Applied Probability Springer Berlin, Heidelberg, 2013 DOI: 10.1007/978-3-662-13043-8
  • [42] Jun S. Liu “Monte Carlo strategies in scientific computing”, Springer Series in Statistics Springer New York, NY, 2004 DOI: 10.1007/978-0-387-76371-2
  • [43] Enzo Marinari and Giorgio Parisi “Simulated Tempering: A New Monte Carlo Scheme” In Europhysics Letters 19.6, 1992, pp. 451 DOI: 10.1209/0295-5075/19/6/002
  • [44] Alireza Mousavi-Hosseini et al. “Towards a Complete Analysis of Langevin Monte Carlo: Beyond Poincaré Inequality” In Proceedings of Thirty Sixth Conference on Learning Theory 195, Proceedings of Machine Learning Research PMLR, 2023, pp. 1–35 URL: https://proceedings.mlr.press/v195/mousavi-hosseini23a.html
  • [45] Radford M. Neal “Annealed importance sampling” In Statistics and Computing 11.2, 2001, pp. 125–139 DOI: 10.1023/A:1008923215028
  • [46] Mark E.. Newman and Gerard T. Barkema “Monte Carlo Methods in Statistical Physics” Oxford University Press, 1999 DOI: 10.1093/oso/9780198517962.001.0001
  • [47] Dao Nguyen, Xin Dang and Yixin Chen “Unadjusted Langevin Algorithm for Non-convex Weakly Smooth Potentials” In Communications in Mathematics and Statistics, 2023 DOI: 10.1007/s40304-023-00350-w
  • [48] Art B. Owen “Monte Carlo theory, methods and examples”, 2013 URL: https://artowen.su.domains/mc/
  • [49] Lorenz Richter and Julius Berner “Improved sampling via learned diffusions” In The Twelfth International Conference on Learning Representations, 2024 URL: https://openreview.net/forum?id=h4pNROsO06
  • [50] Christian P. Robert and George Casella “Monte Carlo Statistical Methods”, Springer Texts in Statistics Springer New York, NY, 2004 DOI: 10.1007/978-1-4757-4145-2
  • [51] André Schlichting “Poincaré and Log–Sobolev Inequalities for Mixtures” In Entropy 21.1, 2019 DOI: 10.3390/e21010089
  • [52] Udo Seifert “Stochastic thermodynamics, fluctuation theorems and molecular machines” In Reports on progress in physics 75.12 IOP Publishing, 2012, pp. 126001
  • [53] Jiaming Song, Chenlin Meng and Stefano Ermon “Denoising Diffusion Implicit Models” In International Conference on Learning Representations, 2021 URL: https://openreview.net/forum?id=St1giarCHLP
  • [54] Yang Song and Stefano Ermon “Generative Modeling by Estimating Gradients of the Data Distribution” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019 URL: https://proceedings.neurips.cc/paper_files/paper/2019/file/3001ef257407d5a371a96dcd947c7d93-Paper.pdf
  • [55] Yang Song and Stefano Ermon “Improved Techniques for Training Score-Based Generative Models” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 12438–12448 URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/92c3b916311a5517d9290576e3ea37ad-Paper.pdf
  • [56] Yang Song et al. “Score-Based Generative Modeling through Stochastic Differential Equations” In International Conference on Learning Representations, 2021 URL: https://openreview.net/forum?id=PxTIG12RRHS
  • [57] Robert H. Swendsen and Jian-Sheng Wang “Replica Monte Carlo Simulation of Spin-Glasses” In Phys. Rev. Lett. 57 American Physical Society, 1986, pp. 2607–2609 DOI: 10.1103/PhysRevLett.57.2607
  • [58] Ali Süleyman Üstünel and Moshe Zakai “Transformation of Measure on Wiener Space”, Springer Monographs in Mathematics Springer Berlin, Heidelberg, 2013 DOI: 10.1007/978-3-662-13225-8
  • [59] Francisco Vargas, Will Sussman Grathwohl and Arnaud Doucet “Denoising Diffusion Samplers” In The Eleventh International Conference on Learning Representations, 2023 URL: https://openreview.net/forum?id=8pvnfTAbu1f
  • [60] Francisco Vargas, Shreyas Padhy, Denis Blessing and Nikolas Nüsken “Transport meets Variational Inference: Controlled Monte Carlo Diffusions” In The Twelfth International Conference on Learning Representations, 2024 URL: https://openreview.net/forum?id=PP1rudnxiW
  • [61] Santosh S. Vempala and Andre Wibisono “Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019 URL: https://proceedings.neurips.cc/paper_files/paper/2019/file/65a99bb7a3115fdede20da98b08a370f-Paper.pdf
  • [62] Cédric Villani “Optimal Transport: Old and New”, Grundlehren der mathematischen Wissenschaften Springer Berlin, Heidelberg, 2008 DOI: 10.1007/978-3-540-71050-9
  • [63] Cédric Villani “Topics in optimal transportation” American Mathematical Society, 2021
  • [64] Dawn B. Woodard, Scott C. Schmidler and Mark Huber “Conditions for rapid mixing of parallel and simulated tempering on multimodal distributions” In The Annals of Applied Probability 19.2 Institute of Mathematical Statistics, 2009, pp. 617–640 DOI: 10.1214/08-AAP555
  • [65] Qinsheng Zhang and Yongxin Chen “Fast Sampling of Diffusion Models with Exponential Integrator” In The Eleventh International Conference on Learning Representations, 2023 URL: https://openreview.net/forum?id=Loek7hfb46P
  • [66] Qinsheng Zhang and Yongxin Chen “Path Integral Sampler: A Stochastic Control Approach For Sampling” In International Conference on Learning Representations, 2022 URL: https://openreview.net/forum?id=_uCb2ynRu7Y
  • [67] Qinsheng Zhang, Molei Tao and Yongxin Chen “gDDIM: Generalized denoising diffusion implicit models” In The Eleventh International Conference on Learning Representations, 2023 URL: https://openreview.net/forum?id=1hKE9qjvz-
  • [68] Nicolas Zilberstein, Ashutosh Sabharwal and Santiago Segarra “Solving Linear Inverse Problems Using Higher-Order Annealed Langevin Diffusion” In IEEE Transactions on Signal Processing 72, 2024, pp. 492–505 DOI: 10.1109/TSP.2023.3348202
  • [69] Nicolas Zilberstein et al. “Annealed Langevin Dynamics for Massive MIMO Detection” In IEEE Transactions on Wireless Communications 22.6, 2023, pp. 3762–3776 DOI: 10.1109/TWC.2022.3221057

Appendix A Proof of Lemma 3

Our proof is inspired by [39].

We only consider the nontrivial case η0(0,1]\eta_{0}\in(0,1]. The potential of π0\pi_{0} is V0:=η0V+λ022V_{0}:=\eta_{0}V+\frac{\lambda_{0}}{2}\left\|\cdot\right\|^{2}, which is (λ0η0β)(\lambda_{0}-\eta_{0}\beta)-strongly-convex and (λ0+η0β)(\lambda_{0}+\eta_{0}\beta)-smooth. Note that for any fixed point xdx^{\prime}\in\mathbb{R}^{d},

π0(x)exp(V0(x))exp(V0(x)V0(x),xxλ0η0β2xx2).\pi_{0}(x)\propto\exp\left(-V_{0}(x)\right)\leq\exp\left(-V_{0}(x^{\prime})-\left\langle\nabla V_{0}(x^{\prime}),x-x^{\prime}\right\rangle-\frac{\lambda_{0}-\eta_{0}\beta}{2}\left\|x-x^{\prime}\right\|^{2}\right).

The right-hand side is the unnormalized density of π0:=𝒩(xV0(x)λ0η0β,1λ0η0βI)\pi_{0}^{\prime}:=\operatorname{{\cal N}}\left(x^{\prime}-\frac{\nabla V_{0}(x^{\prime})}{\lambda_{0}-\eta_{0}\beta},\frac{1}{\lambda_{0}-\eta_{0}\beta}I\right). The rejection sampling algorithm is as follows: sample Xπ0X\sim\pi_{0}^{\prime}, and accept XX as a sample from π0\pi_{0} with probability

exp(V0(X)+V0(x)+V0(x),Xx+λ0η0β2Xx2)(0,1].\exp\left(-V_{0}(X)+V_{0}(x^{\prime})+\left\langle\nabla V_{0}(x^{\prime}),X-x^{\prime}\right\rangle+\frac{\lambda_{0}-\eta_{0}\beta}{2}\left\|X-x^{\prime}\right\|^{2}\right)\in(0,1].

By [12, Theorem 7.1.1], the number of queries to the oracle of VV until acceptance follows a geometric distribution with mean

exp(V0(x)V0(x),xxλ0η0β2xx2)dxexp(V0(x))dx\displaystyle\frac{\int\exp\left(-V_{0}(x^{\prime})-\left\langle\nabla V_{0}(x^{\prime}),x-x^{\prime}\right\rangle-\frac{\lambda_{0}-\eta_{0}\beta}{2}\left\|x-x^{\prime}\right\|^{2}\right){\rm d}x}{\int\exp\left(-V_{0}(x)\right){\rm d}x}
exp(V0(x)V0(x),xxλ0η0β2xx2)dxexp(V0(x)V0(x),xxλ0+η0β2xx2)dx\displaystyle\leq\frac{\int\exp\left(-V_{0}(x^{\prime})-\left\langle\nabla V_{0}(x^{\prime}),x-x^{\prime}\right\rangle-\frac{\lambda_{0}-\eta_{0}\beta}{2}\left\|x-x^{\prime}\right\|^{2}\right){\rm d}x}{\int\exp\left(-V_{0}(x^{\prime})-\left\langle\nabla V_{0}(x^{\prime}),x-x^{\prime}\right\rangle-\frac{\lambda_{0}+\eta_{0}\beta}{2}\left\|x-x^{\prime}\right\|^{2}\right){\rm d}x}
=exp(V0(x),xλ0η0β2x2)dxexp(V0(x),xλ0+η0β2x2)dx\displaystyle=\frac{\int\exp\left(-\left\langle\nabla V_{0}(x^{\prime}),x\right\rangle-\frac{\lambda_{0}-\eta_{0}\beta}{2}\left\|x\right\|^{2}\right){\rm d}x}{\int\exp\left(-\left\langle\nabla V_{0}(x^{\prime}),x\right\rangle-\frac{\lambda_{0}+\eta_{0}\beta}{2}\left\|x\right\|^{2}\right){\rm d}x}
=exp(λ0η0β2x+V0(x)λ0η0β2+V0(x)22(λ0η0β))dxexp(λ0+η0β2x+V0(x)λ0+η0β2+V0(x)22(λ0+η0β))dx\displaystyle=\frac{\int\exp\left(-\frac{\lambda_{0}-\eta_{0}\beta}{2}\left\|x+\frac{\nabla V_{0}(x^{\prime})}{\lambda_{0}-\eta_{0}\beta}\right\|^{2}+\frac{\left\|\nabla V_{0}(x^{\prime})\right\|^{2}}{2(\lambda_{0}-\eta_{0}\beta)}\right){\rm d}x}{\int\exp\left(-\frac{\lambda_{0}+\eta_{0}\beta}{2}\left\|x+\frac{\nabla V_{0}(x^{\prime})}{\lambda_{0}+\eta_{0}\beta}\right\|^{2}+\frac{\left\|\nabla V_{0}(x^{\prime})\right\|^{2}}{2(\lambda_{0}+\eta_{0}\beta)}\right){\rm d}x}
=(λ0+η0βλ0η0β)d2exp(η0βλ02η02β2V0(x)2)\displaystyle=\left(\frac{\lambda_{0}+\eta_{0}\beta}{\lambda_{0}-\eta_{0}\beta}\right)^{\frac{d}{2}}\exp\left(\frac{\eta_{0}\beta}{\lambda_{0}^{2}-\eta_{0}^{2}\beta^{2}}\left\|\nabla V_{0}(x^{\prime})\right\|^{2}\right)
exp(η0βdλ0η0β)exp(η0βλ02η02β2V0(x)2)\displaystyle\leq\exp\left(\frac{\eta_{0}\beta d}{\lambda_{0}-\eta_{0}\beta}\right)\exp\left(\frac{\eta_{0}\beta}{\lambda_{0}^{2}-\eta_{0}^{2}\beta^{2}}\left\|\nabla V_{0}(x^{\prime})\right\|^{2}\right)

We choose λ0=η0βd\lambda_{0}=\eta_{0}\beta d such that exp(η0βdλ0η0β)1\exp\left(\frac{\eta_{0}\beta d}{\lambda_{0}-\eta_{0}\beta}\right)\lesssim 1. With this λ0\lambda_{0}, exp(η0βλ02η02β2V0(x)2)\exp\left(\frac{\eta_{0}\beta}{\lambda_{0}^{2}-\eta_{0}^{2}\beta^{2}}\left\|\nabla V_{0}(x^{\prime})\right\|^{2}\right) is also O(1)O(1) as long as V0(x)2η0βd2\left\|\nabla V_{0}(x^{\prime})\right\|^{2}\lesssim\eta_{0}\beta d^{2}.

Let x′′x^{\prime\prime} be the global minimizer of the strongly convex potential function V0V_{0}, which satisfies

0=V0(x′′)=η0V(x′′)+λ0x′′=η0V(x′′)+η0βdx′′.0=\nabla V_{0}(x^{\prime\prime})=\eta_{0}\nabla V(x^{\prime\prime})+\lambda_{0}x^{\prime\prime}=\eta_{0}\nabla V(x^{\prime\prime})+\eta_{0}\beta dx^{\prime\prime}.

Given the smoothness of V0V_{0},

V0(x)2=V0(x)V0(x′′)2(η0βd+η0β)2xx′′2.\left\|\nabla V_{0}(x^{\prime})\right\|^{2}=\left\|\nabla V_{0}(x^{\prime})-\nabla V_{0}(x^{\prime\prime})\right\|^{2}\leq(\eta_{0}\beta d+\eta_{0}\beta)^{2}\left\|x^{\prime}-x^{\prime\prime}\right\|^{2}.

Therefore, to guarantee V0(x)2η0βd2\left\|\nabla V_{0}(x^{\prime})\right\|^{2}\lesssim\eta_{0}\beta d^{2}, it suffices to find an xx^{\prime} such that xx′′1η0β\left\|x^{\prime}-x^{\prime\prime}\right\|\lesssim\frac{1}{\sqrt{\eta_{0}\beta}}.

We first derive an upper bound of x′′\left\|x^{\prime\prime}\right\| under Assumption 1. Since xx_{*} is a global minimizer of VV,

βdx′′\displaystyle\beta d\left\|x^{\prime\prime}\right\| =V(x′′)=V(x′′)V(x)\displaystyle=\left\|\nabla V(x^{\prime\prime})\right\|=\left\|\nabla V(x^{\prime\prime})-\nabla V(x_{*})\right\|
βx′′xβ(x′′+R),\displaystyle\leq\beta\left\|x^{\prime\prime}-x_{*}\right\|\leq\beta(\left\|x^{\prime\prime}\right\|+R),
x′′\displaystyle\implies\left\|x^{\prime\prime}\right\| Rd1Rd.\displaystyle\leq\frac{R}{d-1}\lesssim\frac{R}{d}.

When R=0R=0, i.e., 0 is a known global minimizer of both VV and V0V_{0}, we can directly set x=0x^{\prime}=0; otherwise, we need to run optimization algorithms to find such an xx^{\prime}. According to standard results in convex optimization (see, e.g., [25, Theorem 3.6]), running gradient descent on function V0V_{0} with step size 1λ0+η0β\frac{1}{\lambda_{0}+\eta_{0}\beta} yields the following convergence rate: starting from x0=0x_{0}=0, the tt-th iterate xtx_{t} satisfies

xtx′′2(1λ0η0βλ0+η0β)t0x′′2(2d)tR2d2R2etd2,\left\|x_{t}-x^{\prime\prime}\right\|^{2}\leq\left(1-\frac{\lambda_{0}-\eta_{0}\beta}{\lambda_{0}+\eta_{0}\beta}\right)^{t}\left\|0-x^{\prime\prime}\right\|^{2}\lesssim\left(\frac{2}{d}\right)^{t}\frac{R^{2}}{d^{2}}\leq\frac{R^{2}}{{\rm e}^{t}d^{2}},

where the last inequality holds when d6d\geq 6. Thus, logη0βR2d2+O(1)\log\frac{\eta_{0}\beta R^{2}}{d^{2}}+O(1) iterations are sufficient to find a desired xx^{\prime}. We summarize the rejection sampling algorithm in Algorithm 2. In conclusion, precisely sampling from π0\pi_{0} requires

O(1logη0βR2d2)=O~(1)O\left(1\vee\log\frac{\eta_{0}\beta R^{2}}{d^{2}}\right)=\widetilde{O}\left(1\right)

calls to the oracle of VV and V\nabla V in expectation.

Input: πexp(V)\pi\propto\exp\left(-V\right), η0(0,1]\eta_{0}\in(0,1], λ0=η0βd\lambda_{0}=\eta_{0}\beta d.
1 Let V0:=η0V+λ022V_{0}:=\eta_{0}V+\frac{\lambda_{0}}{2}\left\|\cdot\right\|^{2};
2 Use gradient descent to find an xx^{\prime} that is O(1η0β)O\left(\frac{1}{\sqrt{\eta_{0}\beta}}\right)-close to the global minimizer of V0V_{0};
3 Let π0:=𝒩(xV0(x)λ0η0β,1λ0η0βI)\pi_{0}^{\prime}:=\operatorname{{\cal N}}\left(x^{\prime}-\frac{\nabla V_{0}(x^{\prime})}{\lambda_{0}-\eta_{0}\beta},\frac{1}{\lambda_{0}-\eta_{0}\beta}I\right);
4 while True do
5       Sample Xπ0X\sim\pi_{0}^{\prime} and U[0,1]U\in[0,1] independently;
6       Accept XX as a sample of π0\pi_{0} when
Uexp(V0(X)+V0(x)+V0(x),Xx+λ0η0β2Xx2).U\leq\exp\left(-V_{0}(X)+V_{0}(x^{\prime})+\left\langle\nabla V_{0}(x^{\prime}),X-x^{\prime}\right\rangle+\frac{\lambda_{0}-\eta_{0}\beta}{2}\left\|X-x^{\prime}\right\|^{2}\right).
7 end while
Output: Xπ0X\sim\pi_{0}.
Algorithm 2 Rejection Sampling for π0\pi_{0}.

Appendix B Proofs for Annealed LMC

B.1 Proof of Equation 7

Define Λ(t):=0tλ(τT)dτ\varLambda(t):=\int_{0}^{t}\lambda\left(\frac{\tau}{T}\right){\rm d}\tau, whose derivative is Λ(t)=λ(tT)\varLambda^{\prime}(t)=\lambda\left(\frac{t}{T}\right). By Itô’s formula, we have

d(eΛ(t)Xt)=eΛ(t)(λ(tT)Xtdt+dXt)=eΛ(t)(η(tT)V(Xt)dt+2dBt).{\rm d}\left({\rm e}^{\varLambda(t)}X_{t}\right)={\rm e}^{\varLambda(t)}\left(\lambda\left(\frac{t}{T}\right)X_{t}{\rm d}t+{\rm d}X_{t}\right)={\rm e}^{\varLambda(t)}\left(-\eta\left(\frac{t}{T}\right)\nabla V(X_{t_{-}}){\rm d}t+\sqrt{2}{\rm d}B_{t}\right).

Integrating over t[T1,T)t\in[T_{\ell-1},T_{\ell}) (note that in this case t=T1t_{-}=T_{\ell-1}), we obtain

eΛ(T)XTeΛ(T1)XT1\displaystyle{\rm e}^{\varLambda(T_{\ell})}X_{T_{\ell}}-{\rm e}^{\varLambda(T_{\ell-1})}X_{T_{\ell-1}} =(T1Tη(tT)eΛ(t)dt)V(XT1)+2T1TeΛ(t)dBt,\displaystyle=-\left(\int_{T_{\ell-1}}^{T_{\ell}}\eta\left(\frac{t}{T}\right){\rm e}^{\varLambda(t)}{\rm d}t\right)\nabla V(X_{T_{\ell-1}})+\sqrt{2}\int_{T_{\ell-1}}^{T_{\ell}}{\rm e}^{\varLambda(t)}{\rm d}B_{t},

i.e.,

XT\displaystyle X_{T_{\ell}} =e(Λ(T)Λ(T1))XT1(T1Tη(tT)e(Λ(T)Λ(t))dt)V(XT1)\displaystyle={\rm e}^{-(\varLambda(T_{\ell})-\varLambda(T_{\ell-1}))}X_{T_{\ell-1}}-\left(\int_{T_{\ell-1}}^{T_{\ell}}\eta\left(\frac{t}{T}\right){\rm e}^{-(\varLambda(T_{\ell})-\varLambda(t))}{\rm d}t\right)\nabla V(X_{T_{\ell-1}})
+2T1Te(Λ(T)Λ(t))dBt.\displaystyle+\sqrt{2}\int_{T_{\ell-1}}^{T_{\ell}}{\rm e}^{-(\varLambda(T_{\ell})-\varLambda(t))}{\rm d}B_{t}.

Since

Λ(T)Λ(t)=tTλ(τT)dτ=Tt/TT/Tλ(u)du,\varLambda(T_{\ell})-\varLambda(t)=\int_{t}^{T_{\ell}}\lambda\left(\frac{\tau}{T}\right){\rm d}\tau=T\int_{t/T}^{T_{\ell}/T}\lambda(u){\rm d}u,

by defining Λ0(θ,θ)=exp(Tθθλ(u)du)\Lambda_{0}(\theta^{\prime},\theta)=\exp\left(-T\int_{\theta}^{\theta^{\prime}}\lambda(u){\rm d}u\right), we have e(Λ(T)Λ(T1))=Λ0(θ,θ1){\rm e}^{-(\varLambda(T_{\ell})-\varLambda(T_{\ell-1}))}=\Lambda_{0}(\theta_{\ell},\theta_{\ell-1}). Similarly, we can show that

T1Tη(tT)e(Λ(T)Λ(t))dt=T1Tη(tT)Λ0(TT,tT)dt=Tθ1θη(u)Λ0(θ,u)du,\int_{T_{\ell-1}}^{T_{\ell}}\eta\left(\frac{t}{T}\right){\rm e}^{-(\varLambda(T_{\ell})-\varLambda(t))}{\rm d}t=\int_{T_{\ell-1}}^{T_{\ell}}\eta\left(\frac{t}{T}\right)\Lambda_{0}\left(\frac{T_{\ell}}{T},\frac{t}{T}\right){\rm d}t=T\int_{\theta_{\ell-1}}^{\theta_{\ell}}\eta(u)\Lambda_{0}(\theta_{\ell},u){\rm d}u,

and 2T1Te(Λ(T)Λ(t))dBt\sqrt{2}\int_{T_{\ell-1}}^{T_{\ell}}{\rm e}^{-(\varLambda(T_{\ell})-\varLambda(t))}{\rm d}B_{t} is a zero-mean Gaussian random vector with covariance

2T1Te2(Λ(T)Λ(t))dtI=2T1TΛ02(TT,tT)dtI=2Tθ1θΛ02(θ,u)duI.2\int_{T_{\ell-1}}^{T_{\ell}}{\rm e}^{-2(\varLambda(T_{\ell})-\varLambda(t))}{\rm d}t\cdot I=2\int_{T_{\ell-1}}^{T_{\ell}}\Lambda_{0}^{2}\left(\frac{T_{\ell}}{T},\frac{t}{T}\right){\rm d}t\cdot I=2T\int_{\theta_{\ell-1}}^{\theta_{\ell}}\Lambda_{0}^{2}(\theta_{\ell},u){\rm d}u\cdot I.

B.2 Proof of Theorem 2

We denote the path measure of ALMC (Equation 6) by \mathbb{Q}. Then, T\mathbb{Q}_{T}, the marginal distribution of XTX_{T}, serves as the output distribution νALMC\nu^{\rm ALMC}. Similar to the methodology in the proof of Theorem 1, we use \mathbb{P} to denote the reference path measure of Equation 5, in which the vector field (vt)t[0,T](v_{t})_{t\in[0,T]} generates the cure of probability distributions (π~t)t[0,T](\widetilde{\pi}_{t})_{t\in[0,T]}.

Using the data-processing inequality, it suffices to demonstrate that KL()ε2\operatorname{KL}\left(\mathbb{P}\middle\|\mathbb{Q}\right)\leq\varepsilon^{2}. By Girsanov theorem (Lemma 1) and triangle inequality, we have

KL()\displaystyle\operatorname{KL}\left(\mathbb{P}\middle\|\mathbb{Q}\right) =140T𝔼[η(tT)(V(Xt)V(Xt))vt(Xt)2]dt\displaystyle=\frac{1}{4}\int_{0}^{T}\operatorname{\mathbb{E}}_{\mathbb{P}}\left[\left\|\eta\left(\frac{t}{T}\right)\left(\nabla V(X_{t})-\nabla V(X_{t_{-}})\right)-v_{t}(X_{t})\right\|^{2}\right]{\rm d}t
0T𝔼[η(tT)2V(Xt)V(Xt)2+vt(Xt)2]dt\displaystyle\lesssim\int_{0}^{T}\operatorname{\mathbb{E}}_{\mathbb{P}}\left[\eta\left(\frac{t}{T}\right)^{2}\left\|\nabla V(X_{t})-\nabla V(X_{t_{-}})\right\|^{2}+\left\|v_{t}(X_{t})\right\|^{2}\right]{\rm d}t
=1Mη(TT)2β2T1T𝔼[XtXt2]dt+0TvtL2(π~t)2dt.\displaystyle\leq\sum_{\ell=1}^{M}\eta\left(\frac{T_{\ell}}{T}\right)^{2}\beta^{2}\int_{T_{\ell-1}}^{T_{\ell}}\operatorname{\mathbb{E}}_{\mathbb{P}}\left[\left\|X_{t}-X_{t_{-}}\right\|^{2}\right]{\rm d}t+\int_{0}^{T}\left\|v_{t}\right\|_{L^{2}(\widetilde{\pi}_{t})}^{2}{\rm d}t.

The last inequality arises from the smoothness of VV and the increasing property of η()\eta(\cdot). To bound 𝔼[XtXt2]\operatorname{\mathbb{E}}_{\mathbb{P}}\left[\left\|X_{t}-X_{t_{-}}\right\|^{2}\right], note that under \mathbb{P}, for t[T1,T)t\in[T_{\ell-1},T_{\ell}), we have

XtXt=T1t(logπ~τ+vτ)(Xτ)dτ+2(BtBT1).X_{t}-X_{t_{-}}=\int_{T_{\ell-1}}^{t}\left(\nabla\log\widetilde{\pi}_{\tau}+v_{\tau}\right)(X_{\tau}){\rm d}\tau+\sqrt{2}(B_{t}-B_{T_{\ell-1}}).

Thanks to the fact that Xtπ~tX_{t}\sim\widetilde{\pi}_{t} under \mathbb{P},

𝔼[XtXt2]\displaystyle\operatorname{\mathbb{E}}_{\mathbb{P}}\left[\left\|X_{t}-X_{t_{-}}\right\|^{2}\right] 𝔼[T1t(logπ~τ+vτ)(Xτ)dτ2]+𝔼[2(BtBT1)2]\displaystyle\lesssim\operatorname{\mathbb{E}}_{\mathbb{P}}\left[\left\|\int_{T_{\ell-1}}^{t}\left(\nabla\log\widetilde{\pi}_{\tau}+v_{\tau}\right)(X_{\tau}){\rm d}\tau\right\|^{2}\right]+\operatorname{\mathbb{E}}\left[\left\|\sqrt{2}(B_{t}-B_{T_{\ell-1}})\right\|^{2}\right]
(tT1)T1t𝔼[(logπ~τ+vτ)(Xτ)2]dτ+d(tT1)\displaystyle\lesssim(t-{T_{\ell-1}})\int_{T_{\ell-1}}^{t}\operatorname{\mathbb{E}}_{\mathbb{P}}\left[\left\|(\nabla\log\widetilde{\pi}_{\tau}+v_{\tau})(X_{\tau})\right\|^{2}\right]{\rm d}\tau+d(t-{T_{\ell-1}})
(tT1)T1t(logπ~τL2(π~τ)2+vτL2(π~τ)2)dτ+d(tT1)\displaystyle\lesssim(t-T_{\ell-1})\int_{T_{\ell-1}}^{t}\left(\left\|\nabla\log\widetilde{\pi}_{\tau}\right\|^{2}_{L^{2}(\widetilde{\pi}_{\tau})}+\left\|v_{\tau}\right\|_{L^{2}(\widetilde{\pi}_{\tau})}^{2}\right){\rm d}\tau+d(t-T_{\ell-1})
hT1T(logπ~τL2(π~τ)2+vτL2(π~τ)2)dτ+dh.\displaystyle\lesssim h_{\ell}\int_{T_{\ell-1}}^{T_{\ell}}\left(\left\|\nabla\log\widetilde{\pi}_{\tau}\right\|^{2}_{L^{2}(\widetilde{\pi}_{\tau})}+\left\|v_{\tau}\right\|_{L^{2}(\widetilde{\pi}_{\tau})}^{2}\right){\rm d}\tau+dh_{\ell}.

The second inequality arises from the application of the Cauchy-Schwarz inequality, and the last inequality is due to the definition h=TT1h_{\ell}=T_{\ell}-T_{\ell-1}. Taking integral over t[T1,T]t\in[T_{\ell-1},T_{\ell}],

T1T𝔼[XtXt2]dth2T1T(logπ~tL2(π~t)2+vtL2(π~t)2)dt+dh2.\int_{T_{\ell-1}}^{T_{\ell}}\operatorname{\mathbb{E}}\left[\left\|X_{t}-X_{t_{-}}\right\|^{2}\right]{\rm d}t\lesssim h_{\ell}^{2}\int_{T_{\ell-1}}^{T_{\ell}}\left(\left\|\nabla\log\widetilde{\pi}_{t}\right\|^{2}_{L^{2}(\widetilde{\pi}_{t})}+\left\|v_{t}\right\|_{L^{2}(\widetilde{\pi}_{t})}^{2}\right){\rm d}t+dh_{\ell}^{2}.

Recall that the potential of π~t\widetilde{\pi}_{t} is (η(tT)β+λ(tT))\left(\eta\left(\frac{t}{T}\right)\beta+\lambda\left(\frac{t}{T}\right)\right)-smooth. By Lemma 4 and the monotonicity of η()\eta(\cdot) and λ()\lambda(\cdot), we have

T1Tlogπ~tL2(π~t)2dt\displaystyle\int_{T_{\ell-1}}^{T_{\ell}}\left\|\nabla\log\widetilde{\pi}_{t}\right\|^{2}_{L^{2}(\widetilde{\pi}_{t})}{\rm d}t T1Td(η(tT)β+λ(tT))dt\displaystyle\leq\int_{T_{\ell-1}}^{T_{\ell}}d\left(\eta\left(\frac{t}{T}\right)\beta+\lambda\left(\frac{t}{T}\right)\right){\rm d}t
dh(βη(TT)+λ(T1T))\displaystyle\leq dh_{\ell}\left(\beta\eta\left(\frac{T_{\ell}}{T}\right)+\lambda\left(\frac{T_{\ell-1}}{T}\right)\right)
=dh(βη(θ)+λ(θ1)).\displaystyle=dh_{\ell}\left(\beta\eta\left(\theta_{\ell}\right)+\lambda\left(\theta_{\ell-1}\right)\right).

Therefore, KL()\operatorname{KL}\left(\mathbb{P}\middle\|\mathbb{Q}\right) is upper bounded by

=1Mη(θ)2β2T1T𝔼[XtXt2]dt+0TvtL2(π~t)2dt\displaystyle\lesssim\sum_{\ell=1}^{M}\eta\left(\theta_{\ell}\right)^{2}\beta^{2}\int_{T_{\ell-1}}^{T_{\ell}}\operatorname{\mathbb{E}}_{\mathbb{P}}\left[\left\|X_{t}-X_{t_{-}}\right\|^{2}\right]{\rm d}t+\int_{0}^{T}\left\|v_{t}\right\|_{L^{2}(\widetilde{\pi}_{t})}^{2}{\rm d}t
=1Mη(θ)2β2(dh3(βη(θ)+λ(θ1))+h2T1TvtL2(π~t)2dt+dh2)+0TvtL2(π~t)2dt\displaystyle\lesssim\sum_{\ell=1}^{M}\eta\left(\theta_{\ell}\right)^{2}\beta^{2}\left(dh_{\ell}^{3}\left(\beta\eta\left(\theta_{\ell}\right)+\lambda\left(\theta_{\ell-1}\right)\right)+h_{\ell}^{2}\int_{T_{\ell-1}}^{T_{\ell}}\left\|v_{t}\right\|_{L^{2}(\widetilde{\pi}_{t})}^{2}{\rm d}t+dh_{\ell}^{2}\right)+\int_{0}^{T}\left\|v_{t}\right\|_{L^{2}(\widetilde{\pi}_{t})}^{2}{\rm d}t
==1M((1+η(θ)2β2h2)T1TvtL2(π~t)2dt+η(θ)2β2dh2(1+h(βη(θ)+λ(θ1)))).\displaystyle=\sum_{\ell=1}^{M}\left(\left(1+\eta(\theta_{\ell})^{2}\beta^{2}h_{\ell}^{2}\right)\int_{T_{\ell-1}}^{T_{\ell}}\left\|v_{t}\right\|_{L^{2}(\widetilde{\pi}_{t})}^{2}{\rm d}t+\eta(\theta_{\ell})^{2}\beta^{2}dh_{\ell}^{2}\left(1+h_{\ell}\left(\beta\eta(\theta_{\ell})+\lambda(\theta_{\ell-1})\right)\right)\right).

For the remaining integral, given that (vt)t[0,T](v_{t})_{t\in[0,T]} generates (π~t)t[0,T](\widetilde{\pi}_{t})_{t\in[0,T]}, according to Lemma 2, we may choose vtv_{t} such that vtL2(π~t)=|π~˙|t\left\|v_{t}\right\|_{L^{2}(\widetilde{\pi}_{t})}=\left|\dot{\widetilde{\pi}}\right|_{t}. Thus,

T1TvtL2(π~t)2dt\displaystyle\int_{T_{\ell-1}}^{T_{\ell}}\left\|v_{t}\right\|_{L^{2}(\widetilde{\pi}_{t})}^{2}{\rm d}t =T1T|π~˙|t2dt=T1T1T2|π˙|t/T2dt=1Tθ1θ|π˙|θ2dθ,\displaystyle=\int_{T_{\ell-1}}^{T_{\ell}}\left|\dot{\widetilde{\pi}}\right|_{t}^{2}{\rm d}t=\int_{T_{\ell-1}}^{T_{\ell}}\frac{1}{T^{2}}\left|\dot{\pi}\right|^{2}_{t/T}{\rm d}t=\frac{1}{T}\int_{\theta_{\ell-1}}^{\theta_{\ell}}\left|\dot{\pi}\right|^{2}_{\theta}{\rm d}\theta,

through a change-of-variable analogous to that used in the proof of Theorem 1. Therefore,

KL()\displaystyle\operatorname{KL}\left(\mathbb{P}\middle\|\mathbb{Q}\right) =1M(1+η(θ)2β2h2Tθ1θ|π˙|θ2dθ+η(θ)2β2dh2(1+h(βη(θ)+λ(θ1)))).\displaystyle\lesssim\sum_{\ell=1}^{M}\left(\frac{1+\eta(\theta_{\ell})^{2}\beta^{2}h_{\ell}^{2}}{T}\int_{\theta_{\ell-1}}^{\theta_{\ell}}\left|\dot{\pi}\right|^{2}_{\theta}{\rm d}\theta+\eta(\theta_{\ell})^{2}\beta^{2}dh_{\ell}^{2}\left(1+h_{\ell}\left(\beta\eta(\theta_{\ell})+\lambda(\theta_{\ell-1})\right)\right)\right).

Assume h1βdh_{\ell}\lesssim\frac{1}{\beta d} (which will be verified later), so we can further simplify the above expression to

KL()\displaystyle\operatorname{KL}\left(\mathbb{P}\middle\|\mathbb{Q}\right) =1M(1Tθ1θ|π˙|θ2dθ+η(θ)2β2dh2)\displaystyle\lesssim\sum_{\ell=1}^{M}\left(\frac{1}{T}\int_{\theta_{\ell-1}}^{\theta_{\ell}}\left|\dot{\pi}\right|^{2}_{\theta}{\rm d}\theta+\eta(\theta_{\ell})^{2}\beta^{2}dh_{\ell}^{2}\right)
=𝒜η,λ(π)T+β2d=1Mη(θ)2h2\displaystyle=\frac{{\cal A}_{\eta,\lambda}(\pi)}{T}+\beta^{2}d\sum_{\ell=1}^{M}\eta(\theta_{\ell})^{2}h_{\ell}^{2}
=𝒜η,λ(π)T+β2d=1Mη(θ)2T2(θθ1)2.\displaystyle=\frac{{\cal A}_{\eta,\lambda}(\pi)}{T}+\beta^{2}d\sum_{\ell=1}^{M}\eta(\theta_{\ell})^{2}T^{2}(\theta_{\ell}-\theta_{\ell-1})^{2}.

To bound the above expression by ε2\varepsilon^{2}, we first select T𝒜η,λ(π)ε2T\asymp\frac{{\cal A}_{\eta,\lambda}(\pi)}{\varepsilon^{2}}, mirroring the total time TT required for running annealed LD as specified in Theorem 1. This guarantees that the continuous dynamics closely approximate the reference path measure. Given that η()1\eta(\cdot)\leq 1, it remains only to ensure

β2d=1Mη(θ)2T2(θθ1)2β2d𝒜η,λ(π)2ε4=1M(θθ1)2ε2,\beta^{2}d\sum_{\ell=1}^{M}\eta(\theta_{\ell})^{2}T^{2}(\theta_{\ell}-\theta_{\ell-1})^{2}\leq\beta^{2}d\frac{{\cal A}_{\eta,\lambda}(\pi)^{2}}{\varepsilon^{4}}\sum_{\ell=1}^{M}(\theta_{\ell}-\theta_{\ell-1})^{2}\lesssim\varepsilon^{2},

which is equivalent to

=1M(θθ1)2ε6dβ2𝒜η,λ(π)2.\sum_{\ell=1}^{M}(\theta_{\ell}-\theta_{\ell-1})^{2}\lesssim\frac{\varepsilon^{6}}{d\beta^{2}{\cal A}_{\eta,\lambda}(\pi)^{2}}. (8)

To minimize MM, we apply Cauchy-Schwarz inequality:

(=1M1)(=1M(θθ1)2)(=1M(θθ1))2=1,\left(\sum_{\ell=1}^{M}1\right)\left(\sum_{\ell=1}^{M}(\theta_{\ell}-\theta_{\ell-1})^{2}\right)\geq\left(\sum_{\ell=1}^{M}(\theta_{\ell}-\theta_{\ell-1})\right)^{2}=1,

which establishes a lower bound for MM. The equality is achieved when θθ1=1M\theta_{\ell}-\theta_{\ell-1}=\frac{1}{M} for all [[1,M]]\ell\in\left[\!\left[1,M\right]\!\right]. Thus, selecting

Mdβ2𝒜η,λ(π)2ε6M\asymp\frac{d\beta^{2}{\cal A}_{\eta,\lambda}(\pi)^{2}}{\varepsilon^{6}}

satisfies the constraint given in Equation 8. In this case, the step size h=TMε4dβ21βdh_{\ell}=\frac{T}{M}\asymp\frac{\varepsilon^{4}}{d\beta^{2}}\lesssim\frac{1}{\beta d}. Combining this with the O~(1)\widetilde{O}\left(1\right) complexity for sampling from π0\pi_{0}, we have completed the proof.

Appendix C Proof of Example 1

The smoothness of VV comes from [11, Lemma 4].

Note that π(x)i=1Npiexp(β2xyi2)\pi(x)\propto\sum_{i=1}^{N}p_{i}\exp\left(-\frac{\beta}{2}\left\|x-y_{i}\right\|^{2}\right), and define

π^λ(x)\displaystyle\widehat{\pi}_{\lambda}(x) π(x)exp(λ2x2)\displaystyle\propto\pi(x)\exp\left(-\frac{\lambda}{2}\left\|x\right\|^{2}\right)
=i=1Npiexp(λβ2(λ+β)yi2λ+β2xβλ+βyi2)\displaystyle=\sum_{i=1}^{N}p_{i}\exp\left(-\frac{\lambda\beta}{2(\lambda+\beta)}\left\|y_{i}\right\|^{2}-\frac{\lambda+\beta}{2}\left\|x-\frac{\beta}{\lambda+\beta}y_{i}\right\|^{2}\right)
i=1Npiexp(λ+β2xβλ+βyi2)\displaystyle\propto\sum_{i=1}^{N}p_{i}\exp\left(-\frac{\lambda+\beta}{2}\left\|x-\frac{\beta}{\lambda+\beta}y_{i}\right\|^{2}\right)
=i=1Npi𝒩(βλ+βyi,1λ+βI).\displaystyle=\sum_{i=1}^{N}p_{i}\operatorname{{\cal N}}\left(\frac{\beta}{\lambda+\beta}y_{i},\frac{1}{\lambda+\beta}I\right).

We use coupling method to upper bound W22(π^λ,π^λ+δ)W_{2}^{2}(\widehat{\pi}_{\lambda},\widehat{\pi}_{\lambda+\delta}). We first sample II with distribution (I=i)=pi\operatorname{\mathbb{P}}\left(I=i\right)=p_{i}, i[[1,N]]i\in\left[\!\left[1,N\right]\!\right], and then independently sample η𝒩(0,I)\eta\sim\operatorname{{\cal N}}\left(0,I\right). Then,

X\displaystyle X :=βλ+βyI+1λ+βηπ^λ,\displaystyle:=\frac{\beta}{\lambda+\beta}y_{I}+\frac{1}{\sqrt{\lambda+\beta}}\eta\sim\widehat{\pi}_{\lambda},
Y\displaystyle Y :=βλ+δ+βyI+1λ+δ+βηπ^λ+δ.\displaystyle:=\frac{\beta}{\lambda+\delta+\beta}y_{I}+\frac{1}{\sqrt{\lambda+\delta+\beta}}\eta\sim\widehat{\pi}_{\lambda+\delta}.

By definition of W2 distance, we have

W22(π^λ,π^λ+δ)\displaystyle W_{2}^{2}(\widehat{\pi}_{\lambda},\widehat{\pi}_{\lambda+\delta}) 𝔼[XY2]\displaystyle\leq\operatorname{\mathbb{E}}\left[\left\|X-Y\right\|^{2}\right]
=𝔼I[𝔼η[(βλ+ββλ+δ+β)yI+(1λ+β1λ+δ+β)η2]]\displaystyle=\operatorname{\mathbb{E}}_{I}\left[\operatorname{\mathbb{E}}_{\eta}\left[\left\|\left(\frac{\beta}{\lambda+\beta}-\frac{\beta}{\lambda+\delta+\beta}\right)y_{I}+\left(\frac{1}{\sqrt{\lambda+\beta}}-\frac{1}{\sqrt{\lambda+\delta+\beta}}\right)\eta\right\|^{2}\right]\right]
=𝔼I[(βλ+ββλ+δ+β)2yI2+(1λ+β1λ+δ+β)2d]\displaystyle=\operatorname{\mathbb{E}}_{I}\left[{\left(\frac{\beta}{\lambda+\beta}-\frac{\beta}{\lambda+\delta+\beta}\right)^{2}\left\|y_{I}\right\|^{2}+\left(\frac{1}{\sqrt{\lambda+\beta}}-\frac{1}{\sqrt{\lambda+\delta+\beta}}\right)^{2}d}\right]
=(βλ+ββλ+δ+β)2r2+(1λ+β1λ+δ+β)2d.\displaystyle=\left(\frac{\beta}{\lambda+\beta}-\frac{\beta}{\lambda+\delta+\beta}\right)^{2}r^{2}+\left(\frac{1}{\sqrt{\lambda+\beta}}-\frac{1}{\sqrt{\lambda+\delta+\beta}}\right)^{2}d.

This implies

|π^˙|λ2=limδ0W22(π^λ,π^λ+δ)δ2β2r2(λ+β)4+d4(λ+β)3.\left|\dot{\widehat{\pi}}\right|^{2}_{\lambda}=\lim_{\delta\to 0}\frac{W_{2}^{2}(\widehat{\pi}_{\lambda},\widehat{\pi}_{\lambda+\delta})}{\delta^{2}}\leq\frac{\beta^{2}r^{2}}{(\lambda+\beta)^{4}}+\frac{d}{4(\lambda+\beta)^{3}}.

By time reparameterization πθ=π^λ(θ)\pi_{\theta}=\widehat{\pi}_{\lambda(\theta)}, |π˙|θ=|π^˙|λ(θ)|λ˙(θ)|\left|\dot{\pi}\right|_{\theta}=\left|\dot{\widehat{\pi}}\right|_{\lambda(\theta)}\left|\dot{\lambda}(\theta)\right|. With λ(θ)=λ0(1θ)γ\lambda(\theta)=\lambda_{0}(1-\theta)^{\gamma}, |λ˙(θ)|=|γλ0(1θ)γ1|γλ0λ0\left|\dot{\lambda}(\theta)\right|=\left|\gamma\lambda_{0}(1-\theta)^{\gamma-1}\right|\leq\gamma\lambda_{0}\lesssim\lambda_{0}. Therefore,

𝒜η,λ(π)\displaystyle{\cal A}_{\eta,\lambda}(\pi) =01|π˙|θ2dθ=01|π^˙|λ(θ)2|λ˙(θ)|2dθ\displaystyle=\int_{0}^{1}\left|\dot{\pi}\right|_{\theta}^{2}{\rm d}\theta=\int_{0}^{1}\left|\dot{\widehat{\pi}}\right|_{\lambda(\theta)}^{2}\left|\dot{\lambda}(\theta)\right|^{2}{\rm d}\theta
λ001|π^˙|λ(θ)2|λ˙(θ)|dθ=λ00λ0|π^˙|λ2dλ\displaystyle\lesssim\lambda_{0}\int_{0}^{1}\left|\dot{\widehat{\pi}}\right|_{\lambda(\theta)}^{2}\left|\dot{\lambda}(\theta)\right|{\rm d}\theta=\lambda_{0}\int_{0}^{\lambda_{0}}\left|\dot{\widehat{\pi}}\right|_{\lambda}^{2}{\rm d}\lambda
=λ00λ0(β2r2(λ+β)4+d4(λ+β)3)dλ\displaystyle=\lambda_{0}\int_{0}^{\lambda_{0}}\left(\frac{\beta^{2}r^{2}}{(\lambda+\beta)^{4}}+\frac{d}{4(\lambda+\beta)^{3}}\right){\rm d}\lambda
λ0(β2r21β3+d1β2)\displaystyle\lesssim\lambda_{0}\left(\beta^{2}r^{2}\frac{1}{\beta^{3}}+d\frac{1}{\beta^{2}}\right)
dβ(r2β+1)(r2β+dβ2)\displaystyle\lesssim d\beta(r^{2}\beta+1)\left(\frac{r^{2}}{\beta}+\frac{d}{\beta^{2}}\right)
=d(r2β+1)(r2+dβ).\displaystyle=d(r^{2}\beta+1)\left(r^{2}+\frac{d}{\beta}\right).

It follows from the proof that as long as maxθ[0,1]|λ˙(θ)|\max_{\theta\in[0,1]}\left|\dot{\lambda}(\theta)\right| is bounded by a polynomial function of β\beta, dd and rr, so is the action 𝒜η,λ(π){\cal A}_{\eta,\lambda}(\pi).

Appendix D Supplementary Lemmas

Lemma 4 ([12, Lemma 4.E.1]).

Consider a probability measure μeU\mu\propto{\rm e}^{-U} on d\mathbb{R}^{d}. If 2UβI\nabla^{2}U\preceq\beta I for some β>0\beta>0, then 𝔼μ[U2]βd\operatorname{\mathbb{E}}_{\mu}\left[\left\|\nabla U\right\|^{2}\right]\leq\beta d.

Lemma 5.

Under Assumption 1, πθ\pi_{\theta} defined in Equation 3 has finite second-order moment when η(θ)[0,1]\eta(\theta)\in[0,1].

Proof.

When η(θ)=1\eta(\theta)=1, πθexp(Vλ(θ)22)exp(V)\pi_{\theta}\propto\exp\left(-V-\frac{\lambda(\theta)}{2}\left\|\cdot\right\|^{2}\right)\leq\exp\left(-V\right), and the claim is straightforward. Otherwise, by convexity of ueuu\mapsto{\rm e}^{-u}, we have

exp(η(θ)Vλ(θ)22)\displaystyle\exp\left(-\eta(\theta)V-\frac{\lambda(\theta)}{2}\left\|\cdot\right\|^{2}\right) =exp(η(θ)V(1η(θ))λ(θ)2(1η(θ))2)\displaystyle=\exp\left(-\eta(\theta)V-(1-\eta(\theta))\frac{\lambda(\theta)}{2(1-\eta(\theta))}\left\|\cdot\right\|^{2}\right)
η(θ)exp(V)+(1η(θ))exp(λ(θ)2(1η(θ))2).\displaystyle\leq\eta(\theta)\exp\left(-V\right)+(1-\eta(\theta))\exp\left(-\frac{\lambda(\theta)}{2(1-\eta(\theta))}\left\|\cdot\right\|^{2}\right).

Multiplying both sides by 2\left\|\cdot\right\|^{2} and taking integral over d\mathbb{R}^{d}, we see that 𝔼πθ[2]<+\operatorname{\mathbb{E}}_{\pi_{\theta}}\left[\left\|\cdot\right\|^{2}\right]<{+\infty}. ∎