This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Provable Benefit of Annealed Langevin Monte Carlo for Non-log-concave Sampling

Wei Guo, Molei Tao, Yongxin Chen
Georgia Institute of Technology
{wei.guo,mtao,yongchen}@gatech.edu
Abstract

We consider the outstanding problem of sampling from an unnormalized density that may be non-log-concave and multimodal. To enhance the performance of simple Markov chain Monte Carlo (MCMC) methods, techniques of annealing type have been widely used. However, quantitative theoretical guarantees of these techniques are under-explored. This study takes a first step toward providing a non-asymptotic analysis of annealed MCMC. Specifically, we establish, for the first time, an oracle complexity of O~(dβ2𝒜2ε6)\widetilde{O}\left(\frac{d\beta^{2}{\cal A}^{2}}{\varepsilon^{6}}\right) for the simple annealed Langevin Monte Carlo algorithm to achieve ε2\varepsilon^{2} accuracy in Kullback-Leibler divergence to the target distribution πeV\pi\propto{\rm e}^{-V} on d\mathbb{R}^{d} with β\beta-smooth potential VV. Here, 𝒜{\cal A} represents the action of a curve of probability measures interpolating the target distribution π\pi and a readily sampleable distribution.

1 Introduction

We study the task of efficient sampling from a probability distribution πeV\pi\propto{\rm e}^{-V} on d\mathbb{R}^{d}. This fundamental problem is pivotal across various fields including computational statistics (Liu, 2004; Brooks et al., 2011), Bayesian inference (Gelman et al., 2013), statistical physics (Newman & Barkema, 1999), and finance (Dagpunar, 2007), and has been extensively studied in the literature (Chewi, 2024). The most common approach to this problem is Markov Chain Monte Carlo (MCMC), among which Langevin Monte Carlo (LMC) (Durmus et al., 2019; Vempala & Wibisono, 2019; Chewi et al., 2022; Mousavi-Hosseini et al., 2023) is a particularly popular choice. LMC can be understood as a time-discretization of a diffusion process, known as Langevin diffusion (LD), whose stationary distribution is the target distribution π\pi, and has been attractive partly due to its robust performance despite conceptual simplicity.

Although LMC and its variants converge rapidly when the target distribution π\pi is strongly log-concave or satisfies isoperimetric inequalities such as the log-Sobolev inequality (LSI) (Durmus et al., 2019; Vempala & Wibisono, 2019; Chewi et al., 2022), its effectiveness diminishes when dealing with target distributions that are not strongly log-concave or are multimodal, such as mixtures of Gaussians. In such scenarios, the sampler often becomes confined to a single mode, severely limiting its ability to explore the entire distribution effectively. This results in significant challenges in transitioning between modes, which can dramatically increase the mixing time, making it exponential in problem parameters such as dimension, distance between modes, etc. (Dong & Tong, 2022; Ma et al., 2019). Such limitations highlight the need for enhanced MCMC methodologies that can efficiently navigate the complex landscapes of multimodal distributions, thereby improving convergence rates and overall sampling efficiency.

To address the challenges posed by multimodality, techniques around the notion of annealing have been widely employed (Gelfand et al., 1990; Neal, 2001). The general philosophy involves constructing a sequence of intermediate distributions π0,π1,,πM\pi_{0},\pi_{1},...,\pi_{M} that bridge the gap between an easily samplable distribution π0\pi_{0} (e.g., Gaussian or Dirac-like), and the target distribution πM=π\pi_{M}=\pi. The process starts with sampling from π0\pi_{0} and progressively samples from each subsequent distribution until πM\pi_{M} is reached. When πi\pi_{i} and πi+1\pi_{i+1} are close enough, approximate samples from πi\pi_{i} can serve as a warm start for sampling from πi+1\pi_{i+1}, thereby facilitating this transition. Employing LMC within this framework gives rise to what is known as the annealed LMC algorithm, which is the focus of our study. Despite its empirical success (Song & Ermon, 2019; 2020; Zilberstein et al., 2023; 2024), a thorough theoretical understanding of annealed LMC, particularly its non-asymptotic complexity bounds, remains elusive.

In this work, we take a first step toward developing a non-asymptotic analysis of annealed MCMC. Utilizing the Girsanov theorem to quantify the differences between the sampling dynamics and a reference process, we derive an upper bound on the error of annealed MCMC, which consists of two key terms. The first term is the ratio of an action functional in Wasserstein geometry, induced by optimal transport, to the duration of the process. This term decreases when the annealing schedule is sufficiently slow. The second term captures the discretization error inherent in practical implementations. Our approach challenges the traditional view that annealed MCMC is simply a series of warm starts. Analyses based on this perspective typically require assumptions of log-concavity or isoperimetric inequalities. In contrast, our theoretical framework for annealed MCMC dispenses with these assumptions, marking a significant shift in the understanding and analysis of annealed MCMC.

Contributions.

Our key technical contributions are summarized as follows.

  • We propose a novel strategy to analyze the non-asymptotic complexity bounds of annealed MCMC algorithms, bypassing the need for assumptions such as log-concavity or isoperimetry.

  • In Sec. 4, we investigate the annealed LD, which involves running LD with a dynamically changing target distribution. We derive a notable bound on the time required to simulate the SDE for achieving ε2\varepsilon^{2}-accuracy in KL divergence.

  • Building on the insights from the analysis of the continuous dynamic and incorporating discretization errors, we establish a non-asymptotic oracle complexity bound for annealed LMC in Sec. 5, which is applicable to a wide range of annealing schemes.

Comparison of quantitative results.

The quantitative results are summarized and compared to other sampling algorithms in Tab. 1. For algorithms requiring isoperimetric assumptions, we include Langevin Monte Carlo (LMC, Vempala & Wibisono (2019)) and Proximal Sampler (PS, Fan et al. (2023)), which converges rapidly but do not have theoretical guarantees without isoperimetric assumptions. For isoperimetry-free samplers, we include Simulated Tempering Langevin Monte Carlo (STLMC, Ge et al. (2018b)), a tempering-based sampler converging rapidly for a specific family of non-log-concave distributions, and three algorithms inspired by score-based generative models: Reverse Diffusion Monte Carlo (RDMC, Huang et al. (2024a)), Recursive Score Diffusion-based Monte Carlo (RS-DMC, Huang et al. (2024b)), and Zeroth Order Diffusion-Monte Carlo (ZOD-MC, He et al. (2024)), which involve simulating the time-reversal of the Ornstein-Uhlenbeck (OU) process and require estimating the scores of the intermediate distributions. Notably, our approach operates under the least stringent assumptions and exhibits the most favorable ε\varepsilon-dependence among all isoperimetry-free sampling methods.

Table 1: Comparison of oracle complexities in terms of dd, ε\varepsilon, and the LSI constant for sampling from πeV\pi\propto{\rm e}^{-V}. “poly()\operatorname{poly}(\cdot)” indicates a polynomial dependence on the specified parameters.
Algorithm
Isoperimetric
Assumptions
Other
Assumptions
Criterion Complexity
LMC
CC-LSI Potential smooth ε2\varepsilon^{2}, KL(π)\operatorname{KL}(\cdot\|\pi) O~(C2dε2)\widetilde{O}(C^{2}d\varepsilon^{-2})
PS
CC-LSI Potential smooth ε\varepsilon, TV\rm TV O~(Cd1/2logε1)\widetilde{O}(Cd^{1/2}\log\varepsilon^{-1})
STLMC
/
Translated mixture of
a well-conditioned
distribution
ε\varepsilon, TV\rm TV O(poly(d,ε1))O(\operatorname{poly}(d,\varepsilon^{-1}))
RDMC
/
Potential smooth,
nearly convex at \infty
ε\varepsilon, TV\rm TV O(poly(d)epoly(ε1))O(\operatorname{poly}(d){\rm e}^{\operatorname{poly}(\varepsilon^{-1})})
RS-DMC
/ Potential smooth ε2\varepsilon^{2}, KL(π)\operatorname{KL}(\pi\|\cdot) exp(O(log3dε2))\exp(O(\log^{3}d\varepsilon^{-2}))
ZOD-MC
/
Potential growing
at most quadratically
ε\varepsilon, TV+W2{\rm TV}+{\rm W}_{2} exp(O~(d)O(logε1))\exp(\widetilde{O}(d)O(\log\varepsilon^{-1}))
ALMC
(ours)
/ Potential smooth ε2\varepsilon^{2}, KL(π)\operatorname{KL}(\pi\|\cdot) O~(d𝒜(d)2ε6)\widetilde{O}(d{\cal A}(d)^{2}\varepsilon^{-6})
Related works.

We provide a brief overview of the literature, mainly focusing on the algorithms for non-log-concave sampling and their theoretical analysis.

  1. 1.

    Samplers based on tempering. The fundamental concept of tempering involves sampling the system at various temperatures simultaneously: at higher temparatures, the distribution flattens, allowing particles to easily transition between modes, while at lower temperatures, particles can more effectively explore local structures. In simulated tempering (Marinari & Parisi, 1992; Woodard et al., 2009), the system’s temperature is randomly switched, while in parallel tempering (also known as replica exchange) (Swendsen & Wang, 1986; Lee & Shen, 2023), the temperatures of two particles are swapped according to a specific rule. However, quantitative theoretical results for tempering are limited, and the existing results (e.g., Ge et al. (2018a; b); Dong & Tong (2022)) apply primarily to certain special classes of non-log-concave distributions.

  2. 2.

    Samplers based on general diffusions. Inspired by score-based diffusion models (Ho et al., 2020; Song et al., 2021a; b), recent advances have introduced sampling methods that reverse the OU process, as detailed in Huang et al. (2024a; b); He et al. (2024). These samplers exhibit reduced sensitivity to isoperimetric conditions, but rely on estimating score functions (gradients of log-density) via importance sampling, which poses significant challenges in high-dimensional settings. Concurrently, studies such as Zhang & Chen (2022); Vargas et al. (2023; 2024); Richter & Berner (2024) have employed neural networks to approximate unknown drift terms, enabling an SDE to transport a readily sampleable distribution to the target distribution. This approach has shown excellent performance in handling complex distributions, albeit at the expense of significant computational resources required for neural network training. In contrast, annealed LMC runs on a known interpolation of probability distributions, thus simplifying sampling by obviating the need for intensive score estimation or neural network training.

  3. 3.

    Non-asymptotic analysis for non-log-concave sampling. Drawing upon the stationary-point analysis in non-convex optimization, the seminal work Balasubramanian et al. (2022) characterizes the convergence of non-log-concave sampling via Fisher divergence. Subsequently, Cheng et al. (2023) applies this methodology to examine the local mixing of LMC. However, Fisher divergence is a relatively weak criterion compared to more commonly employed metrics such as total-variational distance or Wasserstein distances. In contrast, our study provides a convergence guarantee in terms of KL divergence, which implies convergence in total-variation distance and offers a stronger result.

Notations and definitions.

For a,ba,b\in\mathbb{R}, we define [[a,b]]:=[a,b]\left[\!\left[a,b\right]\!\right]:=[a,b]\cap\mathbb{Z}, ab:=min(a,b)a\wedge b:=\min(a,b), and ab:=max(a,b)a\vee b:=\max(a,b). For a,b>0a,b>0, the notations aba\lesssim b, a=O(b)a=O(b), and b=Ω(a)b=\Omega(a) indicate that aCba\leq Cb for some universal constant C>0C>0, and the notations aba\asymp b and a=Θ(b)a=\Theta(b) stand for both a=O(b)a=O(b) and b=O(a)b=O(a). O~()\widetilde{O}(\cdot) hides logarithmic dependence in O()O(\cdot). A function UC2(d)U\in C^{2}(\mathbb{R}^{d}) is α(>0)\alpha(>0)-strongly-convex if 2UαI\nabla^{2}U\succeq\alpha I, and is β(>0)\beta(>0)-smooth111Smoothness has several different definitions such as being C1C^{1} or CC^{\infty}, and the one used here is also a common one appearing in many standard textbooks in optimization and sampling, such as Chewi (2024). if βI2UβI-\beta I\preceq\nabla^{2}U\preceq\beta I. The total-variation (TV) distance is defined as TV(μ,ν)=supAd|μ(A)ν(A)|\operatorname{TV}(\mu,\nu)=\sup_{A\subset\mathbb{R}^{d}}|\mu(A)-\nu(A)|, and the Kullback-Leibler (KL) divergence is defined as KL(μν)=𝔼μlogdμdν\operatorname{KL}(\mu\|\nu)=\mathbb{E}_{\mu}\log\frac{{\rm d}\mu}{{\rm d}\nu}. \|\cdot\| represents the 2\ell^{2} norm on d\mathbb{R}^{d}. For f:ddf:\mathbb{R}^{d}\to\mathbb{R}^{d^{\prime}} and a probability measure μ\mu on d\mathbb{R}^{d}, fL2(μ):=(f2dμ)12\|f\|_{L^{2}(\mu)}:=\left(\int\|f\|^{2}\mathrm{d}\mu\right)^{\frac{1}{2}}, and the second-order moment of μ\mu is defined as 𝔼μ2\mathbb{E}_{\mu}\|\cdot\|^{2}.

2 Preliminaries

2.1 Stochastic Differential Equations and Girsanov Theorem

A stochastic differential equation (SDE) X=(Xt)t[0,T]X=(X_{t})_{t\in[0,T]} is a stochastic process on Ω=C([0,T];d)\Omega=C([0,T];\mathbb{R}^{d}), the space of continuous functions from [0,T][0,T] to d\mathbb{R}^{d}. The dynamics of XX are typically represented by the equation dXt=bt(X)dt+σt(X)dBt\mathrm{d}X_{t}=b_{t}(X)\mathrm{d}t+\sigma_{t}(X)\mathrm{d}B_{t}, t[0,T]t\in[0,T], where (Bt)t[0,T](B_{t})_{t\in[0,T]} is a standard Brownian motion in d\mathbb{R}^{d}, and bt(X)db_{t}(X)\in\mathbb{R}^{d}, σt(X)d×d\sigma_{t}(X)\in\mathbb{R}^{d\times d} depends on (Xs)s[0,t](X_{s})_{s\in[0,t]}. The path measure of XX, denoted X\mathbb{P}^{X}, characterizes the distribution of XX over Ω\Omega and is defined by X(A)=Pr(XA)\mathbb{P}^{X}(A)=\operatorname{Pr}(X\in A) for all measurable subset AA of Ω\Omega. The following lemma, as a corollary of the Girsanov theorem (Üstünel & Zakai, 2013), provides a methodology for computing the KL divergence between two path measures and serves as a crucial technical tool in our proof.

Lemma 1.

Assume we have the following two SDEs on Ω\Omega:

dXt=at(X)dt+2dBt,X0μ;dYt=bt(Y)dt+2dBt,Y0ν.\displaystyle\mathrm{d}X_{t}=a_{t}(X)\mathrm{d}t+\sqrt{2}\mathrm{d}B_{t},~{}X_{0}\sim\mu;\qquad\mathrm{d}Y_{t}=b_{t}(Y)\mathrm{d}t+\sqrt{2}\mathrm{d}B_{t},~{}Y_{0}\sim\nu.

Let X\mathbb{P}^{X} and Y\mathbb{P}^{Y} denote the path measures of XX and YY, respectively. Then

KL(XY)=KL(μν)+14𝔼XX0Tat(X)bt(X)2dt.\operatorname{KL}(\mathbb{P}^{X}\|\mathbb{P}^{Y})=\operatorname{KL}(\mu\|\nu)+\frac{1}{4}\mathbb{E}_{X\sim\mathbb{P}^{X}}\int_{0}^{T}\|a_{t}(X)-b_{t}(X)\|^{2}\mathrm{d}t.

2.2 Langevin Diffusion and Langevin Monte Carlo

The Langevin diffusion (LD) with target distribution πeV\pi\propto{\rm e}^{-V} is the solution to the SDE

dXt=V(Xt)dt+2dBt,t[0,);X0μ0.\mathrm{d}X_{t}=-\nabla V(X_{t})\mathrm{d}t+\sqrt{2}\mathrm{d}B_{t},~{}t\in[0,\infty);~{}X_{0}\sim\mu_{0}. (1)

It is well-known that under mild conditions, π\pi is the unique stationary distribution this SDE, and when π\pi has good regularity properties, the marginal distribution of XtX_{t} converges to π\pi as t+t\to{+\infty}, so we can sample from π\pi by simulating Eq. 1 for a long time. However, in most of the cases, LD is intractable to simulate exactly, and the Euler-Maruyama discretization of Eq. 1 leads to the Langevin Monte Carlo (LMC) algorithm. LMC with step size h>0h>0 and target distribution πeV\pi\propto{\rm e}^{-V} is a Markov chain {Xkh}k=0,1,\{X_{kh}\}_{k=0,1,...} constructed by iterating the following update rule:

X(k+1)h=XkhhV(Xkh)+2(B(k+1)hBkh),k=0,1,;X0μ0,X_{(k+1)h}=X_{kh}-h\nabla V(X_{kh})+\sqrt{2}(B_{(k+1)h}-B_{kh}),~{}k=0,1,...;~{}X_{0}\sim\mu_{0}, (2)

where {B(k+1)hBkh}k=0,1,i.i.d.𝒩(0,hI)\{B_{(k+1)h}-B_{kh}\}_{k=0,1,...}\stackrel{{\scriptstyle{\rm i.i.d.}}}{{\sim}}\operatorname{{\cal N}}\left(0,hI\right).

2.3 Isoperimetric Inequalities

A probability measure π\pi on d\mathbb{R}^{d} satisfies a log-Sobolev inequality (LSI) with constant CC, or CC-LSI, if for all fC1(d)f\in C^{1}(\mathbb{R}^{d}) with 𝔼πf2>0\mathbb{E}_{\pi}{f^{2}}>0,

𝔼πf2logf2𝔼πf22C𝔼πf2.\mathbb{E}_{\pi}{f^{2}\log\frac{f^{2}}{\mathbb{E}_{\pi}{f^{2}}}}\leq 2C\mathbb{E}_{\pi}{\|\nabla f\|^{2}}.

A probability measure π\pi on d\mathbb{R}^{d} satisfies a Poincaré inequality (PI) with constant CC, or CC-PI, if for all fC1(d)f\in C^{1}(\mathbb{R}^{d}),

VarπfC𝔼πf2.\operatorname{Var}_{\pi}f\leq C\mathbb{E}_{\pi}{\|\nabla f\|^{2}}.

It is worth noting that α\alpha-strongly-log-concave distributions satisfy 1α\frac{1}{\alpha}-LSI, and CC-LSI implies CC-PI (Bakry et al., 2014). It is established in Vempala & Wibisono (2019) that when πeV\pi\propto{\rm e}^{-V} satisfies CC-LSI, the LD converges exponentially fast in KL divergence; furthermore, when the potential VV is β\beta-smooth, the LMC also converges exponentially with a bias that vanishes when the step size approaches 0.

2.4 Wasserstein Distance and Curves of Probability Measures

We briefly introduce several fundamental concepts in optimal transport, and direct readers to authoritative textbooks (Villani, 2008; 2021; Ambrosio et al., 2008; 2021) for an in-depth exploration.

For two probability measures μ,ν\mu,\nu on d\mathbb{R}^{d} with finite second-order moments, the Wasserstein-2 (W2\text{W}_{\text{2}}) distance between μ\mu and ν\nu is defined as

W2(μ,ν)=infγΠ(μ,ν)(xy2γ(dx,dy))12,W_{2}(\mu,\nu)=\inf_{\gamma\in\Pi(\mu,\nu)}\left(\int\|x-y\|^{2}\gamma(\mathrm{d}x,\mathrm{d}y)\right)^{\frac{1}{2}},

where Π(μ,ν)\Pi(\mu,\nu) is the set of all couplings of (μ,ν)(\mu,\nu), i.e., probability measure γ\gamma on d×d\mathbb{R}^{d}\times\mathbb{R}^{d} with γ(A×d)=μ(A)\gamma(A\times\mathbb{R}^{d})=\mu(A) and γ(d×A)=ν(A)\gamma(\mathbb{R}^{d}\times A)=\nu(A), for all measurable set AdA\subset\mathbb{R}^{d}.

Given a vector field v=(vt:dd)t[a,b]v=(v_{t}:\mathbb{R}^{d}\to\mathbb{R}^{d})_{t\in[a,b]} and a curve of probability measures ρ=(ρt)t[a,b]\rho=(\rho_{t})_{t\in[a,b]} on d\mathbb{R}^{d} with finite second-order moments, we say that vv generates ρ\rho if the continuity equation tρt+(ρtvt)=0\partial_{t}\rho_{t}+\nabla\cdot(\rho_{t}v_{t})=0, t[a,b]t\in[a,b] holds. The metric derivative of ρ\rho at t[a,b]t\in[a,b] is defined as

|ρ˙|t:=limδ0W2(ρt+δ,ρt)|δ|,|\dot{\rho}|_{t}:=\lim_{\delta\to 0}\frac{W_{2}(\rho_{t+\delta},\rho_{t})}{|\delta|},

which can be interpreted as the “speed” of this curve. If |ρ˙|t|\dot{\rho}|_{t} exists and is finite for all t[a,b]t\in[a,b], we say that ρ\rho is absolutely continuous (AC). AC is a fairly weak regularity condition on the curve of probability measures. In Lem. 4, we present a sufficient condition for AC as a formal statement.

The metric derivative and the continuity equation are closely related by the following important fact from Ambrosio et al. (2008, Theorems 8.3.1 and 8.4.5):

Lemma 2.

For an AC curve of probability measures (ρt)t[a,b](\rho_{t})_{t\in[a,b]}, any vector field (vt)t[a,b](v_{t})_{t\in[a,b]} that generates (ρt)t[a,b](\rho_{t})_{t\in[a,b]} satisfies |ρ˙|tvtL2(ρt)|\dot{\rho}|_{t}\leq\|v_{t}\|_{L^{2}(\rho_{t})} for a.e. t[a,b]t\in[a,b]. Moreover, there exists a unique vector field (vt)t[a,b](v^{*}_{t})_{t\in[a,b]} generating (ρt)t[a,b](\rho_{t})_{t\in[a,b]} that satisfies |ρ˙|t=vtL2(ρt)|\dot{\rho}|_{t}=\|v^{*}_{t}\|_{L^{2}(\rho_{t})} for a.e. t[a,b]t\in[a,b].

Finally, we define the action of an AC curve of probability measures (ρt)t[a,b](\rho_{t})_{t\in[a,b]} as ab|ρ˙|t2dt\int_{a}^{b}|\dot{\rho}|_{t}^{2}\mathrm{d}t. As will be shown in the next section, the action is a key property characterizing the effectiveness of a curve in annealed sampling. The following lemma provides additional intuition about the action and may be helpful for interested readers.

Lemma 3.

Given an AC curve of probability measures (ρt)t[0,1](\rho_{t})_{t\in[0,1]}, and let 𝒜{\cal A} be its action. Then

  1. 1.

    𝒜W22(ρ0,ρ1){\cal A}\geq W_{2}^{2}(\rho_{0},\rho_{1}), and the equality is attained when (ρt)t[0,1](\rho_{t})_{t\in[0,1]} is a constant-speed Wasserstein geodesic, i.e., let two random variables (X0,X1)(X_{0},X_{1}) follow the optimal coupling of (ρ0,ρ1)(\rho_{0},\rho_{1}) such that 𝔼X0X12=W22(ρ0,ρ1)\mathbb{E}\|X_{0}-X_{1}\|^{2}=W_{2}^{2}(\rho_{0},\rho_{1}), and define ρt\rho_{t} as the law of (1t)X0+tX1(1-t)X_{0}+tX_{1}.

  2. 2.

    If ρt\rho_{t} satisfies CLSI(ρt)C_{\rm LSI}(\rho_{t})-LSI for all tt, then 𝒜01CLSI(ρt)2tlogρtL2(ρt)2dt.{\cal A}\leq\int_{0}^{1}C_{\rm LSI}(\rho_{t})^{2}\|\partial_{t}\nabla\log\rho_{t}\|^{2}_{L^{2}(\rho_{t})}{\rm d}t.

  3. 3.

    If ρt\rho_{t} satisfies CPI(ρt)C_{\rm PI}(\rho_{t})-PI for all tt, then 𝒜012CPI(ρt)tlogρtL2(ρt)2dt{\cal A}\leq\int_{0}^{1}2C_{\rm PI}(\rho_{t})\|\partial_{t}\log\rho_{t}\|_{L^{2}(\rho_{t})}^{2}{\rm d}t.

The proof of Lem. 3 can be found in App. A. The lower bound above can be derived by definition or via variational representations of action and W2{\rm W}_{2} distance, while the two upper bounds relate the action to the isoperimetric inequalities, indicating that finite action is a weaker assumption than requiring the LSI or PI constants along the trajectory. In fact, these bounds may not be tight, as demonstrated by the following surprising example, in which the action is polynomial with respect to certain problem parameters and even independent of some of them, whereas the LSI and PI constants exhibit exponential dependence. Its proof is detailed in App. A.

Example 1.

Consider the curve that bridges a target distribution ρ0\rho_{0} to a readily sampleable distribution ρ1\rho_{1}, in the form ρt=ρ0𝒩(0,2StI)\rho_{t}=\rho_{0}*\operatorname{{\cal N}}\left(0,2StI\right), t[0,1]t\in[0,1], for some large positive number SS. Let 𝒜{\cal A} be the action of this curve. Then

  1. 1.

    𝒜=S0SlogpsL2(ps)2ds{\cal A}=S\int_{0}^{S}\|\nabla\log p_{s}\|^{2}_{L^{2}(p_{s})}\mathrm{d}s, where ps=ρs/S=ρ0𝒩(0,2sI)p_{s}=\rho_{s/S}=\rho_{0}*\operatorname{{\cal N}}\left(0,2sI\right), s[0,S]s\in[0,S].

  2. 2.

    In particular, when ρ0\rho_{0} is a mixture of Gaussian distribution i=1Nwi𝒩(μi,σ2I)\sum_{i=1}^{N}w_{i}\operatorname{{\cal N}}\left(\mu_{i},\sigma^{2}I\right), then 𝒜Sd2log(1+2Sσ2){\cal A}\leq\frac{Sd}{2}\log\left(1+\frac{2S}{\sigma^{2}}\right) is independent of the number of components NN, the weights wiw_{i}, and the means μi\mu_{i}. However, in general, the LSI and PI constants may depend exponentially on maxi,jμiμj\max_{i,j}\|\mu_{i}-\mu_{j}\|, the maximum distance between the means.

3 Problem Setup

Recall that the rationale behind annealing involves a gradual transition from π0\pi_{0}, a simple distribution that is easy to sample from, to π1=π\pi_{1}=\pi, the more complex target distribution. Throughout this paper, we define a curve of probability measures (πθ)θ[0,1]\left(\pi_{\theta}\right)_{\theta\in[0,1]}, along which we will apply annealing-based MCMC for sampling from π\pi. For now, we do not specify the exact form of this curve, but instead introduce the following mild regularity assumption:

Assumption 1.

Each πθ\pi_{\theta} has a finite second-order moment, and the curve (πθ)θ[0,1](\pi_{\theta})_{\theta\in[0,1]} is AC with finite action 𝒜=01|π˙|θ2dθ{\cal A}=\int_{0}^{1}|\dot{\pi}|_{\theta}^{2}\mathrm{d}\theta.

For the purpose of non-asymptotic analysis, we further introduce the following mild assumption on the target distribution πeV\pi\propto{\rm e}^{-V} on d\mathbb{R}^{d}, which is widely used in the field of sampling (see Chewi (2024) for an overview):

Assumption 2.

The potential VV is β\beta-smooth, and there exists a global minimizer xx_{*} of VV such that xR\|x_{*}\|\leq R. Moreover, π\pi has finite second-order moment.

With this foundational setup, we now proceed to introduce the annealed LD and annealed LMC algorithms. Our goal is to characterize the non-asymptotic complexity of annealed MCMC algorithms for sampling from possibly non-log-concave distributions.

4 Analysis of Annealed Langevin Diffusion

To elucidate the concept of annealing more clearly, we first consider the annealed Langevin diffusion (ALD) algorithm, which samples from the πeV\pi\propto{\rm e}^{-V} by running LD with a dynamically changing target distribution. For the sake of this discussion, we assume the following: (i) we can exactly sample from π0\pi_{0}, (ii) the scores (logπθ)θ[0,1](\nabla\log\pi_{\theta})_{\theta\in[0,1]} are known in closed form, and (iii) we can exactly simulate any SDE with known drift and diffusion terms.

Fix a sufficiently long time TT. We define a reparametrized curve of probability measures (π~t:=πt/T)t[0,T](\widetilde{\pi}_{t}:=\pi_{t/T})_{t\in[0,T]}. Starting with an initial sample X0π0=π~0X_{0}\sim\pi_{0}=\widetilde{\pi}_{0}, we run the SDE

dXt=logπ~t(Xt)dt+2dBt,t[0,T],\mathrm{d}X_{t}=\nabla\log\widetilde{\pi}_{t}(X_{t})\mathrm{d}t+\sqrt{2}\mathrm{d}B_{t},~{}t\in[0,T], (3)

and ultimately output XTνALDX_{T}\sim\nu^{\rm ALD} as an approximate sample from the target distribution π\pi. Intuitively, when π~t\widetilde{\pi}_{t} is changing slowly, the distribution of XtX_{t} should closely resemble π~t\widetilde{\pi}_{t}, leading to an output distribution νALD\nu^{\rm ALD} that approximates the target distribution. This turns out to be true, as is confirmed by the following theorem, which provides a convergence guarantee for the ALD process.

Theorem 1.

When choosing T=𝒜4ε2T=\frac{{\cal A}}{4\varepsilon^{2}}, it follows that KL(πνALD)ε2\operatorname{KL}(\pi\|\nu^{\rm ALD})\leq\varepsilon^{2}.

Proof.

Let \mathbb{Q} be the path measure of ALD (Eq. 3) initialized at X0π~0X_{0}\sim\widetilde{\pi}_{0}, and define \mathbb{P} as the path measure corresponding to the following reference SDE:

dXt=(logπ~t+vt)(Xt)dt+2dBt,X0π~0,t[0,T].\mathrm{d}X_{t}=(\nabla\log\widetilde{\pi}_{t}+v_{t})(X_{t})\mathrm{d}t+\sqrt{2}\mathrm{d}B_{t},~{}X_{0}\sim\widetilde{\pi}_{0},~{}t\in[0,T]. (4)

The vector field v=(vt)t[0,T]v=(v_{t})_{t\in[0,T]} is designed such that Xtπ~tX_{t}\sim\widetilde{\pi}_{t} for all t[0,T]t\in[0,T]. According to the Fokker-Planck equation222Here, we assume that the solution (in terms of the curve of probability measures) of the Fokker-Planck equation given the drift and diffusion terms exists and is unique, which holds under weak regularity conditions (Le Bris & Lions, 2008)., this is equivalent to the following PDE:

tπ~t=(π~t(logπ~t+vt))+Δπ~t=(π~tvt),t[0,T],\partial_{t}\widetilde{\pi}_{t}=-\nabla\cdot(\widetilde{\pi}_{t}\left(\nabla\log\widetilde{\pi}_{t}+v_{t}\right))+\Delta\widetilde{\pi}_{t}=-\nabla\cdot(\widetilde{\pi}_{t}v_{t}),~{}t\in[0,T],

which means that vv generates (π~t:=πt/T)t[0,T]\left(\widetilde{\pi}_{t}:=\pi_{t/T}\right)_{t\in[0,T]}. We can compute KL()\operatorname{KL}(\mathbb{P}\|\mathbb{Q}) using Lem. 1:

KL()=14𝔼0Tvt(Xt)2dt=140TvtL2(π~t)2dt.\operatorname{KL}(\mathbb{P}\|\mathbb{Q})=\frac{1}{4}\mathbb{E}_{\mathbb{P}}{\int_{0}^{T}\|v_{t}(X_{t})\|^{2}\mathrm{d}t}=\frac{1}{4}\int_{0}^{T}\|v_{t}\|_{L^{2}(\widetilde{\pi}_{t})}^{2}\mathrm{d}t.

Leveraging Lem. 2, among all vector fields vv that generate (π~t:=πt/T)t[0,T]\left(\widetilde{\pi}_{t}:=\pi_{t/T}\right)_{t\in[0,T]}, we can choose the one that minimizes vtL2(π~t)\|v_{t}\|_{L^{2}(\widetilde{\pi}_{t})}, thereby making vtL2(π~t)=|π~˙|t\|v_{t}\|_{L^{2}(\widetilde{\pi}_{t})}=|\dot{\widetilde{\pi}}|_{t}, the metric derivative. With the reparameterization π~t=πt/T\widetilde{\pi}_{t}=\pi_{t/T}, we have the following relation by chain rule:

|π~˙|t=limδ0W2(π~t+δ,π~t)|δ|=limδ0W2(π(t+δ)/T,πt/T)T|δ/T|=1T|π˙|t/T.|\dot{\widetilde{\pi}}|_{t}=\lim_{\delta\to 0}\frac{W_{2}(\widetilde{\pi}_{t+\delta},\widetilde{\pi}_{t})}{|\delta|}=\lim_{\delta\to 0}\frac{W_{2}(\pi_{(t+\delta)/T},\pi_{t/T})}{T|\delta/T|}=\frac{1}{T}|\dot{\pi}|_{t/T}.

Employing the change-of-variable formula leads to

KL()=140T|π~˙|t2dt=14T01|π˙|θ2dθ=𝒜4T.\operatorname{KL}(\mathbb{P}\|\mathbb{Q})=\frac{1}{4}\int_{0}^{T}|\dot{\widetilde{\pi}}|_{t}^{2}\mathrm{d}t=\frac{1}{4T}\int_{0}^{1}|\dot{\pi}|_{\theta}^{2}\mathrm{d}\theta=\frac{{\cal A}}{4T}.

Finally, using data-processing inequality (see, e.g., Chewi (2024, Theorem 1.5.3)), with T=𝒜4ε2T=\frac{{\cal A}}{4\varepsilon^{2}},

KL(πνALD)=KL(TT)KL()=ε2,\operatorname{KL}(\pi\|\nu^{\rm ALD})=\operatorname{KL}(\mathbb{P}_{T}\|\mathbb{Q}_{T})\leq\operatorname{KL}(\mathbb{P}\|\mathbb{Q})=\varepsilon^{2},

where T\mathbb{P}_{T} and T\mathbb{Q}_{T} stand for the marginal distributions of \mathbb{P} and \mathbb{Q} at time TT, respectively. ∎

Let us delve deeper into the mechanics of the ALD. Although at time tt the SDE (Eq. 3) targets the distribution π~t\widetilde{\pi}_{t}, the distribution of XtX_{t} does not precisely align with π~t\widetilde{\pi}_{t}. Nevertheless, by choosing a sufficiently long time TT, we actually move on the curve (πθ)θ[0,1](\pi_{\theta})_{\theta\in[0,1]} sufficiently slowly, thus minimizing the discrepancy between the path measure of (Xt)t[0,T](X_{t})_{t\in[0,T]} and the reference curve (π~t)t[0,T](\widetilde{\pi}_{t})_{t\in[0,T]}. Using the data-processing inequality, we can upper bound the error between the marginal distributions at time TT by the joint distributions of the two paths. In essence, moving more slowly, sampling more precisely.

Our analysis primarily addresses the global error across the entire curve of probability measures, rather than focusing solely on the local error at time TT. This approach is inspired by Dalalyan & Tsybakov (2012); Chen et al. (2023), and stands in contrast to the isoperimetry-based analyses of LD (e.g., Vempala & Wibisono (2019); Chewi et al. (2022); Balasubramanian et al. (2022)), which focus on the decay of the KL divergence from the distribution of XtX_{t} to the target distribution, and require LSI to bound the time derivative of this quantity. Notably, the total time TT needed to run the SDE depends solely on the action of the curve (πθ)θ[0,1](\pi_{\theta})_{\theta\in[0,1]}, obviating the need for assumptions related to log-concavity or isoperimetry.

It is also worth noting that ALD plays a critical role in the field of non-equilibrium stochastic thermodynamics (Seifert, 2012). Recently, a refinement of the fluctuation theorem was discovered in Chen et al. (2020); Fu et al. (2021), showing that the irreversible entropy production in a stochastic thermodynamic system is equal to the ratio of a similar action integral and the duration of the process TT, closely resembling Thm. 1.

5 Analysis of Annealed Langevin Monte Carlo

It is crucial to recognize that, in practice, running the algorithm requires knowledge of the score functions (logπθ)θ[0,1](\nabla\log\pi_{\theta})_{\theta\in[0,1]} in closed form. Additionally, simulating ALD (Eq. 3) necessitates discretization schemes, which introduces further errors. This section presents a detailed non-asymptotic convergence analysis for the annealed Langevin Monte Carlo (ALMC) algorithm, which is a practical approach for real-world implementations.

The outline of this section is as follows. We will first introduce a family of the interpolation curve (πθ)θ[0,1](\pi_{\theta})_{\theta\in[0,1]} that serves as the basis for our analysis. Next, we propose a discretization scheme tailored to the special structure of (πθ)θ[0,1](\pi_{\theta})_{\theta\in[0,1]}. Finally, we state the complexity analysis in Thm. 2, and provide an example demonstrating the improvement of our bounds over existing results.

5.1 Choice of the Interpolation Curve

We consider the following curve of probability measures on d\mathbb{R}^{d}:

πθexp(η(θ)Vλ(θ)22),θ[0,1],\pi_{\theta}\propto\exp\left(-\eta(\theta)V-\frac{\lambda(\theta)}{2}\|\cdot\|^{2}\right),~{}\theta\in[0,1], (5)

where the functions η()\eta(\cdot) and λ()\lambda(\cdot) are called the annealing schedule. These functions must be differentiable and monotonic, satisfying the boundary conditions η0=η(0)η(1)=1\eta_{0}=\eta(0)\nearrow\eta(1)=1 and λ0=λ(0)λ(1)=0\lambda_{0}=\lambda(0)\searrow\lambda(1)=0. The values of η0[0,1]\eta_{0}\in[0,1] and λ0(1,+)\lambda_{0}\in(1,{+\infty}) are chosen so that π0\pi_{0} corresponds a Gaussian distribution 𝒩(0,λ01I)\operatorname{{\cal N}}\left(0,\lambda_{0}^{-1}I\right) or can be efficiently sampled within O~(1)\widetilde{O}(1) steps of rejection sampling (see Lem. 5 for details). We also remark that, under Assump. 2, πθ\pi_{\theta} has a finite second-order moment (see Lem. 8 the proof), ensuring the W2\text{W}_{2} distance between πθ\pi_{\theta}’s is well-defined.

This flexible interpolation scheme includes many general cases. For example, Brosse et al. (2018) and Ge et al. (2020) used the schedule η()1\eta(\cdot)\equiv 1, while Neal (2001) used the schedule λ(θ)=c(1η(θ))\lambda(\theta)=c(1-\eta(\theta)). The key motivation for this interpolation is that when θ=0\theta=0, the quadratic term predominates, making the potential of π0\pi_{0} strongly-convex with a moderate condition number, thus π0\pi_{0} is easy to sample from; on the other hand, when θ=1\theta=1, π1\pi_{1} is just the target distribution π\pi. As θ\theta gradually increases from 0 to 11, the readily sampleable distribution π0\pi_{0} slowly transforms into the target distribution π1\pi_{1}.

5.2 The Discretization Algorithm

With the interpolation curve (πθ)θ[0,1](\pi_{\theta})_{\theta\in[0,1]} chosen as above, a straightforward yet non-optimal method to discretize Eq. 3 involves employing the Euler-Maruyama scheme, i.e.,

Xt+ΔtXt+logπ~t(Xt)Δt+2(Bt+ΔtBt),0t<t+ΔtT.X_{t+\Delta t}\approx X_{t}+\nabla\log\widetilde{\pi}_{t}(X_{t})\Delta t+\sqrt{2}(B_{t+\Delta t}-B_{t}),~{}0\leq t<t+\Delta t\leq T.

However, considering that logπ~t(x)=η(tT)V(x)λ(tT)x\nabla\log\widetilde{\pi}_{t}(x)=-\eta\left(\frac{t}{T}\right)\nabla V(x)-\lambda\left(\frac{t}{T}\right)x, the integral of the linear term can be computed in closed form, so we can use the exponential-integrator scheme (Zhang & Chen, 2023; Zhang et al., 2023) to further reduce the discretization error. Given the total time TT, we define a sequence of points 0=θ0<θ1<<θM=10=\theta_{0}<\theta_{1}<...<\theta_{M}=1, and set T=TθT_{\ell}=T\theta_{\ell}, h=T(θθ1)h_{\ell}=T(\theta_{\ell}-\theta_{\ell-1}). The exponential-integrator scheme is then expressed as

dXt=(η(tT)V(Xt)λ(tT)Xt)dt+2dBt,t[0,T],X0π0,\mathrm{d}X_{t}=\left(-\eta\left(\frac{t}{T}\right)\nabla V(X_{t_{-}})-\lambda\left(\frac{t}{T}\right)X_{t}\right)\mathrm{d}t+\sqrt{2}\mathrm{d}B_{t},~{}t\in[0,T],~{}X_{0}\sim\pi_{0}, (6)

where t:=T1t_{-}:=T_{\ell-1} when t[T1,T)t\in[T_{\ell-1},T_{\ell}), [[1,M]]\ell\in\left[\!\left[1,M\right]\!\right]. The explicit update rule is detailed in Alg. 1, with xx_{\ell} denoting XTX_{T_{\ell}}, and the derivation of Eq. 7 is presented in Sec. C.1. Notably, replacing V(Xt)\nabla V(X_{t_{-}}) with V(Xt)\nabla V(X_{t}) recovers the ALD (Eq. 3), and setting η1\eta\equiv 1 and λ0\lambda\equiv 0 reduces to the classical LMC iterations.

Input: Target distribution πeV\pi\propto{\rm e}^{-V}, total time TT, annealing schedule η()\eta(\cdot) and λ()\lambda(\cdot), discrete points θ0,,θM\theta_{0},...,\theta_{M}.
1 For 0θ<θ10\leq\theta<\theta^{\prime}\leq 1, define Λ0(θ,θ)=exp(Tθθλ(u)du)\Lambda_{0}(\theta^{\prime},\theta)=\exp\left(-T\int_{\theta}^{\theta^{\prime}}\lambda(u)\mathrm{d}u\right), H(θ,θ)=Tθθη(u)Λ0(u,θ)duH(\theta^{\prime},\theta)=T\int_{\theta}^{\theta^{\prime}}\eta(u)\Lambda_{0}(u,\theta^{\prime})\mathrm{d}u, and Λ1(θ,θ)=2TθθΛ02(u,θ)du\Lambda_{1}(\theta^{\prime},\theta)=\sqrt{2T\int_{\theta}^{\theta^{\prime}}\Lambda_{0}^{2}(u,\theta^{\prime})\mathrm{d}u};
2
3Obtain a sample x0π0x_{0}\sim\pi_{0} (e.g., using rejection sampling (Alg. 2));
4
5for =1,2,,M\ell=1,2,...,M do
6       Sample an independent Gaussian noise ξ𝒩(0,I)\xi_{\ell}\sim\operatorname{{\cal N}}\left(0,I\right);
7       Update
x=Λ0(θ,θ1)x1H(θ,θ1)V(x1)+Λ1(θ,θ1)ξ.x_{\ell}=\Lambda_{0}(\theta_{\ell},\theta_{\ell-1})x_{\ell-1}-H(\theta_{\ell},\theta_{\ell-1})\nabla V(x_{\ell-1})+\Lambda_{1}(\theta_{\ell},\theta_{\ell-1})\xi_{\ell}. (7)
8 end for
Output: xMνALMCx_{M}\sim\nu^{\rm ALMC}, an approximate sample from π\pi.
Algorithm 1 Annealed LMC Algorithm

We illustrate the ALMC algorithm in Fig. 1. The underlying intuition behind this algorithm is that by setting a sufficiently long total time TT, the trajectory of the continuous dynamic (i.e., annealed LD) approaches the reference trajectory closely, as established in Thm. 1. Additionally, by adopting sufficiently small step sizes h1,,hMh_{1},...,h_{M}, the discretization error can be substantially reduced. Unlike traditional annealed LMC methods, which require multiple LMC update steps for each intermediate distribution π1,,πM\pi_{1},...,\pi_{M}, our approach views the annealed LMC as a discretization of a continuous-time process. Consequently, it is adequate to perform a single step of LMC towards each intermediate distribution π~T1,,π~TM\widetilde{\pi}_{T_{1}},...,\widetilde{\pi}_{T_{M}}, provided that TT is sufficiently large and the step sizes are appropriately small.

\displaystyle\mathbb{Q}: annealed LMC\displaystyle\mathbb{P}: referenceπ~0\displaystyle\widetilde{\pi}_{0}π~T1\displaystyle\widetilde{\pi}_{T_{1}}\displaystyle\cdots\displaystyle\cdotsπ~T1\displaystyle\widetilde{\pi}_{T_{\ell-1}}π~T\displaystyle\widetilde{\pi}_{T_{\ell}}π~TM1\displaystyle\widetilde{\pi}_{T_{M-1}}π~TM=π\displaystyle\widetilde{\pi}_{T_{M}}=\piνALMC\displaystyle\nu^{\mathrm{ALMC}}LMC towards π~T\displaystyle\widetilde{\pi}_{T_{\ell}}
Figure 1: Illustration of the ALMC algorithm. The \ell-th green arrow, proceeding from left to right, represents one step of LMC towards π~T\widetilde{\pi}_{T_{\ell}} with step size hh_{\ell}, while each red arrow corresponds to the application of the same transition kernel, initialized at π~T1\widetilde{\pi}_{T_{\ell-1}} on the reference trajectory \mathbb{P}, which is depicted in purple. To evaluate KL()\operatorname{KL}(\mathbb{P}\|\mathbb{Q}), the Girsanov theorem implies that we only need to bound the aggregate KL divergence across each small interval (i.e., the sum of the blue “distances”).

5.3 Convergence Analysis of the Algorithm

The subsequent theorem provides a convergence guarantee for the annealed LMC algorithm, with a detailed proof available in Sec. C.2.

Theorem 2.

Under Assumps. 2 and 1, Alg. 1 can generate a distribution νALMC\nu^{\rm ALMC} satisfying KL(πνALMC)ε2\operatorname{KL}(\pi\|\nu^{\rm ALMC})\leq\varepsilon^{2} within

O~(dβ2𝒜2ε6)\widetilde{O}\left(\frac{d\beta^{2}{\cal A}^{2}}{\varepsilon^{6}}\right)

calls to the oracle of VV and V\nabla V in expectation.

Sketch of Proof.

Let \mathbb{Q} be the path measure of ALMC (Eq. 3), the time-discretized sampling process, whose marginal distribution at time TT is the output distribution νALMC\nu^{\rm ALMC}. Again, let \mathbb{P} denote the reference path measure of Eq. 4 used in the proof of Thm. 1, in which the same vector field (vt)t[0,T](v_{t})_{t\in[0,T]} ensures that Xtπ~tX_{t}\sim\widetilde{\pi}_{t} under \mathbb{P} for all t[0,T]t\in[0,T]. Applying Girsanov theorem (Lem. 1) and carefully dealing with the discretization error, we can upper bound KL()\operatorname{KL}(\mathbb{P}\|\mathbb{Q}) by

KL()=1M(1+η(θ)2β2h2Tθ1θ|π˙|θ2dθ+η(θ)2β2dh2(1+h(βη(θ)+λ(θ1)))).\operatorname{KL}(\mathbb{P}\|\mathbb{Q})\lesssim\sum_{\ell=1}^{M}\left(\frac{1+\eta(\theta_{\ell})^{2}\beta^{2}h_{\ell}^{2}}{T}\int_{\theta_{\ell-1}}^{\theta_{\ell}}|\dot{\pi}|^{2}_{\theta}\mathrm{d}\theta+\eta(\theta_{\ell})^{2}\beta^{2}dh_{\ell}^{2}\left(1+h_{\ell}\left(\beta\eta(\theta_{\ell})+\lambda(\theta_{\ell-1})\right)\right)\right).

The first summation is governed by the total time TT, which pertains to the convergence of the continuous dynamic (i.e., ALD). Setting T𝒜ε2T\asymp\frac{{\cal A}}{\varepsilon^{2}} ensures that the first summation remains O(ε2)O(\varepsilon^{2}), provided that the step size hh_{\ell} is sufficiently small. The second summation addresses the discretization error, and it suffices to determine the appropriate value of θ\theta_{\ell} to minimize MM, the total number of calls to the oracle of V\nabla V for discretizing the SDE. Combining MM with the complexity of sampling from π0\pi_{0} determines the overall complexity of the algorithm. ∎

Once again, our analysis relies on bounding the global error between two path measures by Girsanov theorem. The annealing schedule η()\eta(\cdot) and λ()\lambda(\cdot) influences the complexity exclusively through the action 𝒜{\cal A}. This crucial identity, based on the choice of interpolation curve (πθ)θ[0,1](\pi_{\theta})_{\theta\in[0,1]}, significantly affects the effectiveness of ALMC. Our specific choice (Eq. 5), among all interpolation curves with closed-form scores, aims at simplifying the discretization error analysis. However, alternative annealing schemes could be explored to find the optimal one in practice. For instance, a recent paper (Fan et al., 2024) proposed an annealing scheme πθ(x)π0((1aθ)x)1θπ1(xb+(1b)θ)θ\pi_{\theta}(x)\propto\pi_{0}((1-a\theta)x)^{1-\theta}\pi_{1}\left(\frac{x}{b+(1-b)\theta}\right)^{\theta} for some a[0,1]a\in[0,1] and b(0,1]b\in(0,1]. The extra scaling factor 1b+(1b)θ\frac{1}{b+(1-b)\theta} might improve mixing speed. We leave the choice of the interpolation curve as a topic for future research.

We also note that our Assump. 2 encompasses strongly-log-concave target distributions. For sampling from these well-conditioned distributions via LMC, the complexity required to achieve ε\varepsilon-accuracy in TV distance scales as O~(ε2)\widetilde{O}(\varepsilon^{-2}) (Chewi, 2024). However, using Pinsker inequality KL2TV2\operatorname{KL}\geq 2\operatorname{TV}^{2}, our complexity to meet the same error criterion is O(ε6)O(\varepsilon^{-6}), indicating a significantly higher computational demand. While our discretization error is as sharp as existing works, the main reason for our worse ε\varepsilon-dependence is due to the time required for the continuous dynamics to converge, which is based on Girsanov theorem rather than LSI. While sharpening the ε\varepsilon-dependence based on Girsanov theorem remains a challenge, our O(ε6)O(\varepsilon^{-6}) dependence still outperforms all the other bounds in Tab. 1 without isoperimetry assumptions.

Finally, we conclude by demonstrating an example of a class of mixture of Gaussian distributions, which illustrates how our analysis can improve the complexity bound from exponential to polynomial. The detailed proof is provided in App. D.

Example 2.

Consider a mixture of Gaussian distributions in d\mathbb{R}^{d} defined by π=i=1Npi𝒩(yi,β1I)\pi=\sum_{i=1}^{N}p_{i}\operatorname{{\cal N}}\left(y_{i},\beta^{-1}I\right), where the weights pi>0p_{i}>0, i=1Npi=1\sum_{i=1}^{N}p_{i}=1, and yi=r\|y_{i}\|=r for all i[[1,N]]i\in\left[\!\left[1,N\right]\!\right]. Consequently, the potential V=logπV=-\log\pi is BB-smooth, where B=β(4r2β+1)B=\beta(4r^{2}\beta+1). With an annealing schedule defined by η()1\eta(\cdot)\equiv 1 and λ(θ)=dB(1θ)γ\lambda(\theta)=dB(1-\theta)^{\gamma} for some 1γ=O(1)1\leq\gamma=O(1), it follows that 𝒜=O(d(r2β+1)(r2+dβ)){\cal A}=O\left(d(r^{2}\beta+1)\left(r^{2}+\frac{d}{\beta}\right)\right).

To show the superiority of our theoretical results, consider a simplified scenario with N=2N=2, y1=y2y_{1}=-y_{2}, and r2β1r^{2}\gg\beta^{-1}. By applying Ex. 2 and the Pinsker inequality, the total complexity required to obtain an ε\varepsilon-accurate sample in TV distance is O~(d3β2r4(r4β2d2)ε6)\widetilde{O}(d^{3}\beta^{2}r^{4}(r^{4}\beta^{2}\vee d^{2})\varepsilon^{-6}). In contrast, studies such as Schlichting (2019); Chen et al. (2021); Dong & Tong (2022) indicate that the LSI constant of π\pi is Ω(eΘ(βr2))\Omega({\rm e}^{\Theta(\beta r^{2})}), so existing bounds in Tab. 1 suggest that LMC, to achieve the same accuracy, can only provide an exponential complexity guarantee of O~(eΘ(βr2)dε2)\widetilde{O}({\rm e}^{\Theta(\beta r^{2})}d\varepsilon^{-2}).

6 Experiments

We conduct simple numerical experiments to verify the findings in Ex. 2, which demonstrate that for a certain class of mixture of Gaussian distributions, ALMC achieves polynomial convergence with respect to rr. Specifically, we consider a target distribution comprising a mixture of 66 Gaussians with uniform weights 16\frac{1}{6}, means {(rcoskπ3,rsinkπ3):k[[0,5]]}\left\{\left(r\cos\frac{k\pi}{3},r\sin\frac{k\pi}{3}\right):k\in\left[\!\left[0,5\right]\!\right]\right\}, and the same covariance 0.1I0.1I (corresponding to β=10\beta=10 in Ex. 2). We experimented r{2,5,10,15,20,25,30}r\in\{2,5,10,15,20,25,30\}, and compute the number of iterations required for the empirical KL divergence from the target distribution to the sampled distribution to fall below 0.20.2 (blue curve) and 0.10.1 (orange curve). The results, displayed in Fig. 2, use a log-log scale for both axes. The near-linear behavior of the curves in this plot confirms that the complexity depends polynomially on rr, validating our theory. Further details of the experimental setup and implementation can be found in App. F.

Refer to caption
Figure 2: Relationship between norm of means rr and the number of iterations to reach 0.20.2 and 0.10.1 accuracy in KL divergence, both in log\log scale.

7 Conclusions and Future Work

In this paper, we have explored the complexity of ALMC for sampling from a non-log-concave probability measure, circumventing the reliance on log-concavity or isoperimetric inequalities. Central to our analysis are the Girsanov theorem and optimal transport techniques, providing a novel approach. While our proof primarily focuses on the annealing scheme as described in Eq. 5, it can potentially be adapted to more general interpolations. Further exploration of these applications will be a key direction for future research. Technically, our proof methodology could be expanded to a broader range of target distributions beyond those with smooth potentials, such as those with Hölder-continuous gradients (Chatterji et al., 2020; Liang & Chen, 2022; Fan et al., 2023). Eliminating the assumption of global minimizers of the potential function would further enhance the practical applicability of our algorithm. Finally, while this work emphasizes the upper bounds for non-log-concave sampling, exploring the tightness of these bounds and investigating potential lower bounds for this problem (Chewi et al., 2023a; b) are intriguing avenues for future research.

Acknowledgments

We would like to thank Ye He, Sinho Chewi, Matthew S. Zhang, Omar Chehab, Adrien Vacher, and Anna Korba for insightful discussions. MT is supported by the National Science Foundation under Award No. DMS-1847802, and WG and YC are supported by the National Science Foundation under Award No. ECCS-1942523 and DMS-2206576.

References

  • Ambrosio et al. (2008) Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics. ETH Zürich. Birkhäuser Basel, 2 edition, 2008. doi: 10.1007/978-3-7643-8722-8.
  • Ambrosio et al. (2021) Luigi Ambrosio, Elia Brué, and Daniele Semola. Lectures on optimal transport, volume 130 of UNITEXT. Springer Cham, 2021. doi: 10.1007/978-3-030-72162-6. URL https://link.springer.com/book/10.1007/978-3-030-72162-6.
  • Bakry et al. (2014) Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and geometry of Markov diffusion operators, volume 103 of Grundlehren der mathematischen Wissenschaften. Springer Cham, 1 edition, 2014. doi: 10.1007/978-3-319-00227-9.
  • Balasubramanian et al. (2022) Krishna Balasubramanian, Sinho Chewi, Murat A Erdogdu, Adil Salim, and Shunshi Zhang. Towards a theory of non-log-concave sampling: First-order stationarity guarantees for Langevin Monte Carlo. In Po-Ling Loh and Maxim Raginsky (eds.), Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pp.  2896–2923. PMLR, 02–05 Jul 2022. URL https://proceedings.mlr.press/v178/balasubramanian22a.html.
  • Brooks et al. (2011) Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of Markov chain Monte Carlo. CRC press, 2011.
  • Brosse et al. (2018) Nicolas Brosse, Alain Durmus, and Éric Moulines. Normalizing constants of log-concave densities. Electronic Journal of Statistics, 12(1):851 – 889, 2018. doi: 10.1214/18-EJS1411. URL https://doi.org/10.1214/18-EJS1411.
  • Chatterji et al. (2020) Niladri Chatterji, Jelena Diakonikolas, Michael I. Jordan, and Peter Bartlett. Langevin Monte Carlo without smoothness. In Silvia Chiappa and Roberto Calandra (eds.), Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp.  1716–1726. PMLR, 26–28 Aug 2020. URL https://proceedings.mlr.press/v108/chatterji20a.html.
  • Chen et al. (2021) Hong-Bin Chen, Sinho Chewi, and Jonathan Niles-Weed. Dimension-free log-Sobolev inequalities for mixture distributions. Journal of Functional Analysis, 281(11):109236, 2021. ISSN 0022-1236. doi: https://doi.org/10.1016/j.jfa.2021.109236. URL https://www.sciencedirect.com/science/article/pii/S0022123621003189.
  • Chen et al. (2023) Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=zyLVMgsZ0U_.
  • Chen et al. (2020) Yongxin Chen, Tryphon T. Georgiou, and Allen Tannenbaum. Stochastic control and nonequilibrium thermodynamics: Fundamental limits. IEEE Transactions on Automatic Control, 65(7):2979–2991, 2020. doi: 10.1109/TAC.2019.2939625.
  • Cheng et al. (2023) Xiang Cheng, Bohan Wang, Jingzhao Zhang, and Yusong Zhu. Fast conditional mixing of MCMC algorithms for non-log-concave distributions. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  13374–13394. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/2b00b3331bd0f5fbfdd966ac06338f6d-Paper-Conference.pdf.
  • Chewi (2024) Sinho Chewi. Log-Concave Sampling. Book draft, in preparation, 2024. URL https://chewisinho.github.io.
  • Chewi et al. (2022) Sinho Chewi, Murat A Erdogdu, Mufan Li, Ruoqi Shen, and Shunshi Zhang. Analysis of Langevin Monte Carlo from Poincaré to log-Sobolev. In Po-Ling Loh and Maxim Raginsky (eds.), Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pp.  1–2. PMLR, 02–05 Jul 2022. URL https://proceedings.mlr.press/v178/chewi22a.html.
  • Chewi et al. (2023a) Sinho Chewi, Jaume De Dios Pont, Jerry Li, Chen Lu, and Shyam Narayanan. Query lower bounds for log-concave sampling. In 2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS), pp.  2139–2148, 2023a. doi: 10.1109/FOCS57990.2023.00131.
  • Chewi et al. (2023b) Sinho Chewi, Patrik Gerber, Holden Lee, and Chen Lu. Fisher information lower bounds for sampling. In Shipra Agrawal and Francesco Orabona (eds.), Proceedings of The 34th International Conference on Algorithmic Learning Theory, volume 201 of Proceedings of Machine Learning Research, pp.  375–410. PMLR, 20 Feb–23 Feb 2023b. URL https://proceedings.mlr.press/v201/chewi23b.html.
  • Dagpunar (2007) John S. Dagpunar. Simulation and Monte Carlo: With Applications in Finance and MCMC. John Wiley & Sons, Ltd, 2007. doi: 10.1002/9780470061336.
  • Dalalyan & Tsybakov (2012) Arnak Dalalyan and Alexandre B. Tsybakov. Sparse regression learning by aggregation and Langevin Monte-Carlo. Journal of Computer and System Sciences, 78(5):1423–1443, 2012. ISSN 0022-0000. doi: https://doi.org/10.1016/j.jcss.2011.12.023. URL https://www.sciencedirect.com/science/article/pii/S0022000012000220. JCSS Special Issue: Cloud Computing 2011.
  • Dong & Tong (2022) Jing Dong and Xin T. Tong. Spectral gap of replica exchange langevin diffusion on mixture distributions. Stochastic Processes and their Applications, 151:451–489, 2022. ISSN 0304-4149. doi: https://doi.org/10.1016/j.spa.2022.06.006. URL https://www.sciencedirect.com/science/article/pii/S0304414922001314.
  • Durmus et al. (2019) Alain Durmus, Szymon Majewski, and Błażej Miasojedow. Analysis of langevin monte carlo via convex optimization. Journal of Machine Learning Research, 20(73):1–46, 2019. URL http://jmlr.org/papers/v20/18-173.html.
  • Fan et al. (2023) Jiaojiao Fan, Bo Yuan, and Yongxin Chen. Improved dimension dependence of a proximal algorithm for sampling. In Gergely Neu and Lorenzo Rosasco (eds.), Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pp.  1473–1521. PMLR, 12–15 Jul 2023. URL https://proceedings.mlr.press/v195/fan23a.html.
  • Fan et al. (2024) Mingzhou Fan, Ruida Zhou, Chao Tian, and Xiaoning Qian. Path-guided particle-based sampling. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=Kt4fwiuKqf.
  • Fu et al. (2021) Rui Fu, Amirhossein Taghvaei, Yongxin Chen, and Tryphon T Georgiou. Maximal power output of a stochastic thermodynamic engine. Automatica, 123:109366, 2021.
  • Garrigos & Gower (2023) Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods. arXiv preprint arXiv:2301.11235, 2023.
  • Ge et al. (2018a) Rong Ge, Holden Lee, and Andrej Risteski. Beyond log-concavity: Provable guarantees for sampling multi-modal distributions using simulated tempering Langevin Monte Carlo. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018a. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/c6ede20e6f597abf4b3f6bb30cee16c7-Paper.pdf.
  • Ge et al. (2018b) Rong Ge, Holden Lee, and Andrej Risteski. Simulated tempering Langevin Monte Carlo II: An improved proof using soft Markov chain decomposition. arXiv preprint arXiv:1812.00793, 2018b.
  • Ge et al. (2020) Rong Ge, Holden Lee, and Jianfeng Lu. Estimating normalizing constants for log-concave distributions: Algorithms and lower bounds. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, pp.  579–586, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450369794. doi: 10.1145/3357713.3384289. URL https://doi.org/10.1145/3357713.3384289.
  • Gelfand et al. (1990) Saul Brian Gelfand, Sanjoy K Mitter, et al. On sampling methods and annealing algorithms. technical report, 1990.
  • Gelman et al. (2013) Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian data analysis. Chapman and Hall/CRC, 3 edition, 2013.
  • He et al. (2024) Ye He, Kevin Rojas, and Molei Tao. Zeroth-order sampling methods for non-log-concave distributions: Alleviating metastability by denoising diffusion. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=X3Aljulsw5.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • Huang et al. (2024a) Xunpeng Huang, Hanze Dong, Yifan Hao, Yian Ma, and Tong Zhang. Reverse diffusion Monte Carlo. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=kIPEyMSdFV.
  • Huang et al. (2024b) Xunpeng Huang, Difan Zou, Hanze Dong, Yi-An Ma, and Tong Zhang. Faster sampling without isoperimetry via diffusion-based Monte Carlo. In Shipra Agrawal and Aaron Roth (eds.), Proceedings of Thirty Seventh Conference on Learning Theory, volume 247 of Proceedings of Machine Learning Research, pp.  2438–2493. PMLR, 30 Jun–03 Jul 2024b. URL https://proceedings.mlr.press/v247/huang24a.html.
  • Le Bris & Lions (2008) C. Le Bris and P.-L. Lions. Existence and uniqueness of solutions to Fokker–Planck type equations with irregular coefficients. Communications in Partial Differential Equations, 33(7):1272–1317, 2008. doi: 10.1080/03605300801970952. URL https://doi.org/10.1080/03605300801970952.
  • Lee & Shen (2023) Holden Lee and Zeyu Shen. Improved bound for mixing time of parallel tempering. arXiv preprint arXiv:2304.01303, 2023.
  • Liang & Chen (2022) Jiaming Liang and Yongxin Chen. A proximal algorithm for sampling from non-smooth potentials. In 2022 Winter Simulation Conference (WSC), pp.  3229–3240, 2022. doi: 10.1109/WSC57314.2022.10015293.
  • Liang & Chen (2023) Jiaming Liang and Yongxin Chen. A proximal algorithm for sampling. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=CkXOwlhf27.
  • Liu (2004) Jun S. Liu. Monte Carlo strategies in scientific computing. Springer Series in Statistics. Springer New York, NY, 2004. doi: 10.1007/978-0-387-76371-2.
  • Liu (2020) Yuan Liu. The Poincaré inequality and quadratic transportation-variance inequalities. Electronic Journal of Probability, 25(none):1 – 16, 2020. doi: 10.1214/19-EJP403. URL https://doi.org/10.1214/19-EJP403.
  • Ma et al. (2019) Yi-An Ma, Yuansi Chen, Chi Jin, Nicolas Flammarion, and Michael I. Jordan. Sampling can be faster than optimization. Proceedings of the National Academy of Sciences, 116(42):20881–20885, 2019. doi: 10.1073/pnas.1820003116. URL https://www.pnas.org/doi/abs/10.1073/pnas.1820003116.
  • Marinari & Parisi (1992) Enzo Marinari and Giorgio Parisi. Simulated tempering: A new Monte Carlo scheme. Europhysics Letters, 19(6):451, jul 1992. doi: 10.1209/0295-5075/19/6/002. URL https://dx.doi.org/10.1209/0295-5075/19/6/002.
  • Mousavi-Hosseini et al. (2023) Alireza Mousavi-Hosseini, Tyler K. Farghly, Ye He, Krishna Balasubramanian, and Murat A. Erdogdu. Towards a complete analysis of Langevin Monte Carlo: Beyond Poincaré inequality. In Gergely Neu and Lorenzo Rosasco (eds.), Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pp.  1–35. PMLR, 12–15 Jul 2023. URL https://proceedings.mlr.press/v195/mousavi-hosseini23a.html.
  • Neal (2001) Radford M. Neal. Annealed importance sampling. Statistics and Computing, 11(2):125–139, April 2001. ISSN 1573-1375. doi: 10.1023/A:1008923215028. URL https://doi.org/10.1023/A:1008923215028.
  • Newman & Barkema (1999) Mark E. J. Newman and Gerard T. Barkema. Monte Carlo Methods in Statistical Physics. Oxford University Press, 02 1999. ISBN 9780198517962. doi: 10.1093/oso/9780198517962.001.0001. URL https://doi.org/10.1093/oso/9780198517962.001.0001.
  • Paty & Cuturi (2019) François-Pierre Paty and Marco Cuturi. Subspace robust Wasserstein distances. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  5072–5081. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/paty19a.html.
  • Richter & Berner (2024) Lorenz Richter and Julius Berner. Improved sampling via learned diffusions. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=h4pNROsO06.
  • Schlichting (2019) André Schlichting. Poincaré and log–Sobolev inequalities for mixtures. Entropy, 21(1), 2019. ISSN 1099-4300. doi: 10.3390/e21010089. URL https://www.mdpi.com/1099-4300/21/1/89.
  • Seifert (2012) Udo Seifert. Stochastic thermodynamics, fluctuation theorems and molecular machines. Reports on progress in physics, 75(12):126001, 2012.
  • Song et al. (2021a) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021a. URL https://openreview.net/forum?id=St1giarCHLP.
  • Song & Ermon (2019) Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/3001ef257407d5a371a96dcd947c7d93-Paper.pdf.
  • Song & Ermon (2020) Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  12438–12448. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/92c3b916311a5517d9290576e3ea37ad-Paper.pdf.
  • Song et al. (2021b) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021b. URL https://openreview.net/forum?id=PxTIG12RRHS.
  • Swendsen & Wang (1986) Robert H. Swendsen and Jian-Sheng Wang. Replica Monte Carlo simulation of spin-glasses. Phys. Rev. Lett., 57:2607–2609, Nov 1986. doi: 10.1103/PhysRevLett.57.2607. URL https://link.aps.org/doi/10.1103/PhysRevLett.57.2607.
  • Szabó (2014) Zoltán Szabó. Information theoretical estimators toolbox. Journal of Machine Learning Research, 15(9):283–287, 2014. URL http://jmlr.org/papers/v15/szabo14a.html.
  • Üstünel & Zakai (2013) Ali Süleyman Üstünel and Moshe Zakai. Transformation of Measure on Wiener Space. Springer Monographs in Mathematics. Springer Berlin, Heidelberg, 1 edition, 2013. doi: 10.1007/978-3-662-13225-8.
  • Vargas et al. (2023) Francisco Vargas, Will Sussman Grathwohl, and Arnaud Doucet. Denoising diffusion samplers. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=8pvnfTAbu1f.
  • Vargas et al. (2024) Francisco Vargas, Shreyas Padhy, Denis Blessing, and Nikolas Nüsken. Transport meets variational inference: Controlled Monte Carlo diffusions. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PP1rudnxiW.
  • Vempala & Wibisono (2019) Santosh S. Vempala and Andre Wibisono. Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/65a99bb7a3115fdede20da98b08a370f-Paper.pdf.
  • Villani (2008) Cédric Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer Berlin, Heidelberg, 1 edition, 2008. doi: 10.1007/978-3-540-71050-9. URL https://link.springer.com/book/10.1007/978-3-540-71050-9.
  • Villani (2021) Cédric Villani. Topics in optimal transportation, volume 58. American Mathematical Society, 2021.
  • Woodard et al. (2009) Dawn B. Woodard, Scott C. Schmidler, and Mark Huber. Conditions for rapid mixing of parallel and simulated tempering on multimodal distributions. The Annals of Applied Probability, 19(2):617 – 640, 2009. doi: 10.1214/08-AAP555. URL https://doi.org/10.1214/08-AAP555.
  • Zhang & Chen (2022) Qinsheng Zhang and Yongxin Chen. Path integral sampler: A stochastic control approach for sampling. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=_uCb2ynRu7Y.
  • Zhang & Chen (2023) Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Loek7hfb46P.
  • Zhang et al. (2023) Qinsheng Zhang, Molei Tao, and Yongxin Chen. gDDIM: Generalized denoising diffusion implicit models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1hKE9qjvz-.
  • Zilberstein et al. (2023) Nicolas Zilberstein, Chris Dick, Rahman Doost-Mohammady, Ashutosh Sabharwal, and Santiago Segarra. Annealed Langevin dynamics for massive MIMO detection. IEEE Transactions on Wireless Communications, 22(6):3762–3776, 2023. doi: 10.1109/TWC.2022.3221057.
  • Zilberstein et al. (2024) Nicolas Zilberstein, Ashutosh Sabharwal, and Santiago Segarra. Solving linear inverse problems using higher-order Annealed langevin diffusion. IEEE Transactions on Signal Processing, 72:492–505, 2024. doi: 10.1109/TSP.2023.3348202.

Appendix A Proof of Results on Action

Lemma 4 (Sufficient Condition for Absolute Continuity (Informal)).

Assume that a curve of probability distributions (πθ)θ[0,1](\pi_{\theta})_{\theta\in[0,1]} on d\mathbb{R}^{d} has a density π:[0,1]×d(θ,x)πθ(x)>0\pi:[0,1]\times\mathbb{R}^{d}\ni(\theta,x)\mapsto\pi_{\theta}(x)>0 that is jointly C1C^{1}. Then, this curve is AC.

Proof.

The following proof is informal. We leave the task of formalizing this statement for future work.

We first prove the result in one-dimension. Let Fθ(x)=xπθ(u)duF_{\theta}(x)=\int_{-\infty}^{x}\pi_{\theta}(u){\rm d}u be the c.d.f. of πθ\pi_{\theta}, which is strictly increasing as πθ(x)>0\pi_{\theta}(x)>0 everywhere. As a result, its inverse Fθ1F_{\theta}^{-1} is well-defined, and F1()F^{-1}_{\cdot}(\cdot) is also jointly C1C^{1}. We know from Ambrosio et al. (2008, Theorem 6.0.2) that

W22(πθ,πθ+δ)=01|Fθ+δ1(q)Fθ1(q)|2dq.W_{2}^{2}(\pi_{\theta},\pi_{\theta+\delta})=\int_{0}^{1}|F_{\theta+\delta}^{-1}(q)-F_{\theta}^{-1}(q)|^{2}{\rm d}q.

The above quantity should be O(δ2)O(\delta^{2}) as δ0\delta\to 0 due to the C1C^{1} property, and hence the curve is AC.

In dd-dimensions, consider the one-dimensional subspace robust W2\text{W}_{\text{2}} distance (Paty & Cuturi, 2019) defined as

S2(μ,ν)=supv𝕊d1W2(vμ,vν),S_{2}(\mu,\nu)=\sup_{v\in\mathbb{S}^{d-1}}W_{2}(v_{\sharp}\mu,v_{\sharp}\nu),

where 𝕊d1={vd:v=1}\mathbb{S}^{d-1}=\{v\in\mathbb{R}^{d}:\|v\|=1\} is the set of unit vectors in d\mathbb{R}^{d}, and vμv_{\sharp}\mu (respectively, vνv_{\sharp}\nu) is the law of v,X\left\langle v,X\right\rangle when XμX\sim\mu (respectively, XνX\sim\nu), which is a probability measure in \mathbb{R}. By Paty & Cuturi (2019, Proposition 2), W2(μ,ν)dS2(μ,ν)W_{2}(\mu,\nu)\leq\sqrt{d}S_{2}(\mu,\nu). Similar argument as above shows that W22(vπθ,vπθ+δ)=O(δ2)W_{2}^{2}(v_{\sharp}\pi_{\theta},v_{\sharp}\pi_{\theta+\delta})=O(\delta^{2}), and hence W22(πθ,πθ+δ)=O(δ2)W_{2}^{2}(\pi_{\theta},\pi_{\theta+\delta})=O(\delta^{2}). ∎

Proof of Lem. 3.
  1. 1.

    By Cauchy-Schwarz inequality,

    𝒜=01|ρ˙|t2dt011dt(01|ρ˙|tdt)2.{\cal A}=\int_{0}^{1}|\dot{\rho}|_{t}^{2}\mathrm{d}t\int_{0}^{1}1\mathrm{d}t\geq\left(\int_{0}^{1}|\dot{\rho}|_{t}\mathrm{d}t\right)^{2}.

    It remains to prove that W2(ρ0,ρ1)01|ρ˙|tdtW_{2}(\rho_{0},\rho_{1})\leq\int_{0}^{1}|\dot{\rho}|_{t}\mathrm{d}t. In fact, take any sequence 0=t0<t1<<tN=10=t_{0}<t_{1}<...<t_{N}=1, and let Δt:=max1iN(titi1)\Delta t:=\max_{1\leq i\leq N}(t_{i}-t_{i-1}). By triangle inequality of the W2\text{W}_{\text{2}} distance,

    W2(ρ0,ρ1)\displaystyle W_{2}(\rho_{0},\rho_{1}) i=1NW2(ρti,ρti1)\displaystyle\leq\sum_{i=1}^{N}W_{2}(\rho_{t_{i}},\rho_{t_{i-1}})
    =i=1NW2(ρti,ρti1)titi1(titi1)\displaystyle=\sum_{i=1}^{N}\frac{W_{2}(\rho_{t_{i}},\rho_{t_{i-1}})}{t_{i}-t_{i-1}}(t_{i}-t_{i-1})
    =i=1N(|ρ˙|ti+o(Δt))(titi1)01|ρ˙|tdt.\displaystyle=\sum_{i=1}^{N}(|\dot{\rho}|_{t_{i}}+o(\Delta t))(t_{i}-t_{i-1})\to\int_{0}^{1}|\dot{\rho}|_{t}\mathrm{d}t.

    Another derivation is by investigating the variational representations of these two quantities. First, by Lem. 2, one can write

    𝒜=01|ρ˙|t2dt=infvt:tρt+(ρtvt)=001vtL2(ρt)2dt.{\cal A}=\int_{0}^{1}|\dot{\rho}|_{t}^{2}\mathrm{d}t=\inf_{v_{t}:~{}\partial_{t}\rho_{t}+\nabla\cdot(\rho_{t}v_{t})=0}\int_{0}^{1}\|v_{t}\|_{L^{2}(\rho_{t})}^{2}{\rm d}t.

    On the other hand, the Benamou-Brenier formula (Ambrosio et al., 2008, Equation 8.0.3) implies

    W22(ρ0,ρ1)=inf(ρt,vt):tρt+(ρtvt)=0;ρt=0=ρ0,ρt=1=ρ101vtL2(ρt)2dt.W_{2}^{2}(\rho_{0},\rho_{1})=\inf_{(\rho_{t},v_{t}):~{}\partial_{t}\rho_{t}+\nabla\cdot(\rho_{t}v_{t})=0;~{}\rho_{t=0}=\rho_{0},\rho_{t=1}=\rho_{1}}\int_{0}^{1}\|v_{t}\|_{L^{2}(\rho_{t})}^{2}{\rm d}t.

    Thus, one can easily conclude the desired inequality.

    To show that the inequality is attained when (ρt)t[0,1](\rho_{t})_{t\in[0,1]} is the constant-speed Wasserstein geodesic, note that the derivation implies that the inequality is attained when |ρ˙|t|\dot{\rho}|_{t} is a constant for all t[0,1]t\in[0,1], and for any 0t1<t2<t310\leq t_{1}<t_{2}<t_{3}\leq 1, W2(ρt1,ρt2)+W2(ρt2,ρt3)=W2(ρt1,ρt3)W_{2}(\rho_{t_{1}},\rho_{t_{2}})+W_{2}(\rho_{t_{2}},\rho_{t_{3}})=W_{2}(\rho_{t_{1}},\rho_{t_{3}}). We refer readers to Ambrosio et al. (2008, Chapter 7.2) for the construction of the constant-speed Wasserstein geodesic.

  2. 2.

    It is known from Bakry et al. (2014) that π\pi satisfies CC-LSI implies

    KL(μπ)C2𝔼μlogdμdπ2,KL(μπ)12CW22(μ,π),μ.\operatorname{KL}(\mu\|\pi)\leq\frac{C}{2}\mathbb{E}_{\mu}\left\|\nabla\log\frac{{\rm d}\mu}{{\rm d}\pi}\right\|^{2},~{}~{}~{}~{}\operatorname{KL}(\mu\|\pi)\geq\frac{1}{2C}W_{2}^{2}(\mu,\pi),~{}\forall\mu.

    Therefore,

    1δ2W22(ρt+δ,ρt)CLSI(ρt)2logρt+δlogρt2δ2dρt+δ.\frac{1}{\delta^{2}}W_{2}^{2}(\rho_{t+\delta},\rho_{t})\leq C_{\rm LSI}(\rho_{t})^{2}\int\frac{\|\nabla\log\rho_{t+\delta}-\nabla\log\rho_{t}\|^{2}}{\delta^{2}}{\rm d}\rho_{t+\delta}.

    By letting δ0\delta\to 0 and assuming regularity conditions, we have

    |ρ˙|t2CLSI(ρt)2tlogρt2dρt=CLSI(ρt)2tlogρtL2(ρt)2|\dot{\rho}|_{t}^{2}\leq C_{\rm LSI}(\rho_{t})^{2}\int\|\partial_{t}\nabla\log\rho_{t}\|^{2}{\rm d}\rho_{t}=C_{\rm LSI}(\rho_{t})^{2}\|\partial_{t}\nabla\log\rho_{t}\|^{2}_{L^{2}(\rho_{t})}

    .

  3. 3.

    By Liu (2020), we know that π\pi satisfies CC-PI implies

    W22(μ,π)2C(dμdπdμ1),μ.W_{2}^{2}(\mu,\pi)\leq 2C\left(\int\frac{{\rm d}\mu}{{\rm d}\pi}\mathrm{d}\mu-1\right),~{}\forall\mu.

    Therefore,

    1δ2W22(ρt+δ,ρt)2CPI(ρt)(ρt+δρt)2δ2ρtdx.\frac{1}{\delta^{2}}W_{2}^{2}(\rho_{t+\delta},\rho_{t})\leq 2C_{\rm PI}(\rho_{t})\int\frac{(\rho_{t+\delta}-\rho_{t})^{2}}{\delta^{2}\rho_{t}}{\rm d}x.

    By letting δ0\delta\to 0, we have

    |ρ˙|t22CPI(ρt)(tρt)2ρtdx=2CPI(ρt)tlogρtL2(ρt)2.|\dot{\rho}|_{t}^{2}\leq 2C_{\rm PI}(\rho_{t})\int\frac{(\partial_{t}\rho_{t})^{2}}{\rho_{t}}{\rm d}x=2C_{\rm PI}(\rho_{t})\|\partial_{t}\log\rho_{t}\|_{L^{2}(\rho_{t})}^{2}.

\square

Proof of Ex. 1.
  1. 1.

    The reparameterized curve (ps)s[0,S](p_{s})_{s\in[0,S]} satisfies the standard heat equation:

    sps=Δpssps+(ps(logps))=0,\partial_{s}p_{s}=\Delta p_{s}\implies\partial_{s}p_{s}+\nabla\cdot(p_{s}(-\nabla\log p_{s}))=0,

    which means (logps)s[0,S](-\nabla\log p_{s})_{s\in[0,S]} generates (ps)s[0,S](p_{s})_{s\in[0,S]}. According to the uniqueness result in Lem. 2 and also Ambrosio et al. (2008, Theorem 8.3.1), the Lebesgue-a.e. unique vector field (vt)t[a,b](v_{t}^{*})_{t\in[a,b]} that generates (ρt)t[a,b](\rho_{t})_{t\in[a,b]} and satisfies |ρ˙|t=vtL2(ρt)|\dot{\rho}|_{t}=\|v^{*}_{t}\|_{L^{2}(\rho_{t})} for Lebesgue-a.e. t[a,b]t\in[a,b] can be written in a gradient field vt=ϕtv^{*}_{t}=\nabla\phi_{t} for some ϕt:d\phi_{t}:\mathbb{R}^{d}\to\mathbb{R}. This implies that

    |p˙|s=logpsL2(ps)2.|\dot{p}|_{s}=\|\nabla\log p_{s}\|^{2}_{L^{2}(p_{s})}.

    By the time change [0,S]sθ=s/S[0,1][0,S]\ni s\mapsto\theta=s/S\in[0,1] and taking integral, we obtain the desired result.

  2. 2.

    Since ps=i=1Nwi𝒩(μi,(σ2+2s)I)p_{s}=\sum_{i=1}^{N}w_{i}\operatorname{{\cal N}}\left(\mu_{i},(\sigma^{2}+2s)I\right), by Lem. 7, we know that 2logps1σ2+2sI-\nabla^{2}\log p_{s}\preceq\frac{1}{\sigma^{2}+2s}I. Thus Lem. 6 implies logpsL2(ps)2dσ2+2s\|\nabla\log p_{s}\|^{2}_{L^{2}(p_{s})}\leq\frac{d}{\sigma^{2}+2s}, so

    𝒜S0Sdσ2+2sds=Sd2log(1+2Sσ2).{\cal A}\leq S\int_{0}^{S}\frac{d}{\sigma^{2}+2s}\mathrm{d}s=\frac{Sd}{2}\log\left(1+\frac{2S}{\sigma^{2}}\right).

    Finally, the LSI and PI constant of mixture of Gaussian distributions are, in general, exponential with respect to the maximum distance between the means. To illustrate this, consider a simpler case where there are only 2 Gaussian components: ρ=12𝒩(0,σ2I)+12𝒩(y,σ2I)\rho=\frac{1}{2}{\cal N}(0,\sigma^{2}I)+\frac{1}{2}{\cal N}(y,\sigma^{2}I). Schlichting (2019, Section 4.1) has shown that CPI(ρ)CLSI(ρ)σ22(ey2/σ2+3)C_{\rm PI}(\rho)\leq C_{\rm LSI}(\rho)\leq\frac{\sigma^{2}}{2}\left({\rm e}^{\|y\|^{2}/\sigma^{2}}+3\right), while Dong & Tong (2022, Proposition 1) has shown that when yσd\|y\|\gtrsim\sigma\sqrt{d}, CLSI(ρ)CPI(ρ)σ4y2eΩ(y2/σ2)C_{\rm LSI}(\rho)\geq C_{\rm PI}(\rho)\gtrsim\frac{\sigma^{4}}{\|y\|^{2}}{\rm e}^{\Omega(\|y\|^{2}/\sigma^{2})}. \square

Appendix B Sampling from π0\pi_{0}

Lemma 5.

When η0=0\eta_{0}=0, π0=𝒩(0,λ01I)\pi_{0}=\operatorname{{\cal N}}\left(0,\lambda_{0}^{-1}I\right) can be sampled directly; when η0(0,1]\eta_{0}\in(0,1], we choose λ0=η0dβ\lambda_{0}=\eta_{0}d\beta, so that under Assump. 2, it takes O(1logη0βR2d2)=O~(1)O\left(1\vee\log\frac{\eta_{0}\beta R^{2}}{d^{2}}\right)=\widetilde{O}(1) queries to the oracle of VV and V\nabla V in expectation to precisely sample from π0\pi_{0} via rejection sampling.

Proof.

Our proof is inspired by Liang & Chen (2023). We only consider the nontrivial case η0(0,1]\eta_{0}\in(0,1]. The potential of π0\pi_{0} is V0:=η0V+λ022V_{0}:=\eta_{0}V+\frac{\lambda_{0}}{2}\|\cdot\|^{2}, which is (λ0η0β)(\lambda_{0}-\eta_{0}\beta)-strongly-convex and (λ0+η0β)(\lambda_{0}+\eta_{0}\beta)-smooth. Note that for any fixed point xdx^{\prime}\in\mathbb{R}^{d},

π0(x)exp(V0(x))exp(V0(x)V0(x),xxλ0η0β2xx2).\pi_{0}(x)\propto\exp\left(-V_{0}(x)\right)\leq\exp\left(-V_{0}(x^{\prime})-\left\langle\nabla V_{0}(x^{\prime}),x-x^{\prime}\right\rangle-\frac{\lambda_{0}-\eta_{0}\beta}{2}\|x-x^{\prime}\|^{2}\right).

The r.h.s. is the unnormalized density of π0:=𝒩(xV0(x)λ0η0β,1λ0η0βI)\pi_{0}^{\prime}:=\operatorname{{\cal N}}\left(x^{\prime}-\frac{\nabla V_{0}(x^{\prime})}{\lambda_{0}-\eta_{0}\beta},\frac{1}{\lambda_{0}-\eta_{0}\beta}I\right). The rejection sampling algorithm is as follows: sample Xπ0X\sim\pi_{0}^{\prime}, and accept XX as a sample from π0\pi_{0} with probability

exp(V0(X)+V0(x)+V0(x),Xx+λ0η0β2xx2)(0,1].\exp\left(-V_{0}(X)+V_{0}(x^{\prime})+\left\langle\nabla V_{0}(x^{\prime}),X-x^{\prime}\right\rangle+\frac{\lambda_{0}-\eta_{0}\beta}{2}\|x-x^{\prime}\|^{2}\right)\in(0,1].

By Chewi (2024, Theorem 7.1.1), the number of queries to the oracle of VV until acceptance follows a geometric distribution with mean

exp(V0(x)V0(x),xxλ0η0β2xx2)dxexp(V0(x))dx\displaystyle\frac{\int\exp\left(-V_{0}(x^{\prime})-\left\langle\nabla V_{0}(x^{\prime}),x-x^{\prime}\right\rangle-\frac{\lambda_{0}-\eta_{0}\beta}{2}\|x-x^{\prime}\|^{2}\right)\mathrm{d}x}{\int\exp\left(-V_{0}(x)\right)\mathrm{d}x}
exp(V0(x)V0(x),xxλ0η0β2xx2)dxexp(V0(x)V0(x),xxλ0+η0β2xx2)dx\displaystyle\leq\frac{\int\exp\left(-V_{0}(x^{\prime})-\left\langle\nabla V_{0}(x^{\prime}),x-x^{\prime}\right\rangle-\frac{\lambda_{0}-\eta_{0}\beta}{2}\|x-x^{\prime}\|^{2}\right)\mathrm{d}x}{\int\exp\left(-V_{0}(x^{\prime})-\left\langle\nabla V_{0}(x^{\prime}),x-x^{\prime}\right\rangle-\frac{\lambda_{0}+\eta_{0}\beta}{2}\|x-x^{\prime}\|^{2}\right)\mathrm{d}x}
=exp(V0(x),xλ0η0β2x2)dxexp(V0(x),xλ0+η0β2x2)dx\displaystyle=\frac{\int\exp\left(-\left\langle\nabla V_{0}(x^{\prime}),x\right\rangle-\frac{\lambda_{0}-\eta_{0}\beta}{2}\|x\|^{2}\right)\mathrm{d}x}{\int\exp\left(-\left\langle\nabla V_{0}(x^{\prime}),x\right\rangle-\frac{\lambda_{0}+\eta_{0}\beta}{2}\|x\|^{2}\right)\mathrm{d}x}
=exp(λ0η0β2x+V0(x)λ0η0β2+V0(x)22(λ0η0β))dxexp(λ0+η0β2x+V0(x)λ0+η0β2+V0(x)22(λ0+η0β))dx\displaystyle=\frac{\int\exp\left(-\frac{\lambda_{0}-\eta_{0}\beta}{2}\left\|x+\frac{\nabla V_{0}(x^{\prime})}{\lambda_{0}-\eta_{0}\beta}\right\|^{2}+\frac{\|\nabla V_{0}(x^{\prime})\|^{2}}{2(\lambda_{0}-\eta_{0}\beta)}\right)\mathrm{d}x}{\int\exp\left(-\frac{\lambda_{0}+\eta_{0}\beta}{2}\left\|x+\frac{\nabla V_{0}(x^{\prime})}{\lambda_{0}+\eta_{0}\beta}\right\|^{2}+\frac{\|\nabla V_{0}(x^{\prime})\|^{2}}{2(\lambda_{0}+\eta_{0}\beta)}\right)\mathrm{d}x}
=(λ0+η0βλ0η0β)d2exp(η0βλ02η02β2V0(x)2)\displaystyle=\left(\frac{\lambda_{0}+\eta_{0}\beta}{\lambda_{0}-\eta_{0}\beta}\right)^{\frac{d}{2}}\exp\left(\frac{\eta_{0}\beta}{\lambda_{0}^{2}-\eta_{0}^{2}\beta^{2}}\|\nabla V_{0}(x^{\prime})\|^{2}\right)
exp(η0βdλ0η0β)exp(η0βλ02η02β2V0(x)2)\displaystyle\leq\exp\left(\frac{\eta_{0}\beta d}{\lambda_{0}-\eta_{0}\beta}\right)\exp\left(\frac{\eta_{0}\beta}{\lambda_{0}^{2}-\eta_{0}^{2}\beta^{2}}\|\nabla V_{0}(x^{\prime})\|^{2}\right)

We choose λ0=η0βd\lambda_{0}=\eta_{0}\beta d such that exp(η0βdλ0η0β)1\exp\left(\frac{\eta_{0}\beta d}{\lambda_{0}-\eta_{0}\beta}\right)\lesssim 1. With this λ0\lambda_{0}, exp(η0βλ02η02β2V0(x)2)\exp\left(\frac{\eta_{0}\beta}{\lambda_{0}^{2}-\eta_{0}^{2}\beta^{2}}\|\nabla V_{0}(x^{\prime})\|^{2}\right) is also O(1)O(1) as long as V0(x)2η0βd2\|\nabla V_{0}(x^{\prime})\|^{2}\lesssim\eta_{0}\beta d^{2}.

Let x′′x^{\prime\prime} be the global minimizer of the strongly convex potential function V0V_{0}, which satisfies

0=V0(x′′)=η0V(x′′)+λ0x′′=η0V(x′′)+η0βdx′′.0=\nabla V_{0}(x^{\prime\prime})=\eta_{0}\nabla V(x^{\prime\prime})+\lambda_{0}x^{\prime\prime}=\eta_{0}\nabla V(x^{\prime\prime})+\eta_{0}\beta dx^{\prime\prime}.

Given the smoothness of V0V_{0},

V0(x)2=V0(x)V0(x′′)2(η0βd+η0β)2xx′′2.\|\nabla V_{0}(x^{\prime})\|^{2}=\|\nabla V_{0}(x^{\prime})-\nabla V_{0}(x^{\prime\prime})\|^{2}\leq(\eta_{0}\beta d+\eta_{0}\beta)^{2}\|x^{\prime}-x^{\prime\prime}\|^{2}.

Therefore, to guarantee V0(x)2η0βd2\|\nabla V_{0}(x^{\prime})\|^{2}\lesssim\eta_{0}\beta d^{2}, it suffices to find an xx^{\prime} such that xx′′1η0β\|x^{\prime}-x^{\prime\prime}\|\lesssim\frac{1}{\sqrt{\eta_{0}\beta}}.

We first derive an upper bound of x′′\|x^{\prime\prime}\| under Assump. 2. Since xx_{*} is a global minimizer of VV,

βdx′′\displaystyle\beta d\|x^{\prime\prime}\| =V(x′′)=V(x′′)V(x)\displaystyle=\|\nabla V(x^{\prime\prime})\|=\|\nabla V(x^{\prime\prime})-\nabla V(x_{*})\|
βx′′xβ(x′′+R),\displaystyle\leq\beta\|x^{\prime\prime}-x_{*}\|\leq\beta(\|x^{\prime\prime}\|+R),
x′′\displaystyle\implies\|x^{\prime\prime}\| Rd1Rd.\displaystyle\leq\frac{R}{d-1}\lesssim\frac{R}{d}.

When R=0R=0, i.e., 0 is a known global minimizer of both VV and V0V_{0}, we can directly set x=0x^{\prime}=0; otherwise, we need to run optimization algorithms to find such an xx^{\prime}. According to standard results in convex optimization (see, e.g., Garrigos & Gower (2023, Theorem 3.6)), running gradient descent on function V0V_{0} with step size 1λ0+η0β\frac{1}{\lambda_{0}+\eta_{0}\beta} yields the following convergence rate: starting from x0=0x_{0}=0, the tt-th iterate xtx_{t} satisfies

xtx′′2(1λ0η0βλ0+η0β)t0x′′2(2d)tR2d2R2etd2,\|x_{t}-x^{\prime\prime}\|^{2}\leq\left(1-\frac{\lambda_{0}-\eta_{0}\beta}{\lambda_{0}+\eta_{0}\beta}\right)^{t}\|0-x^{\prime\prime}\|^{2}\lesssim\left(\frac{2}{d}\right)^{t}\frac{R^{2}}{d^{2}}\leq\frac{R^{2}}{{\rm e}^{t}d^{2}},

where the last inequality holds when d6d\geq 6. Thus, logη0βR2d2+O(1)\log\frac{\eta_{0}\beta R^{2}}{d^{2}}+O(1) iterations are sufficient to find a desired xx^{\prime}. We summarize the rejection sampling in Alg. 2. In conclusion, precisely sampling from π0\pi_{0} requires

O(1logη0βR2d2)=O~(1)O\left(1\vee\log\frac{\eta_{0}\beta R^{2}}{d^{2}}\right)=\widetilde{O}(1)

calls to the oracle of VV and V\nabla V in expectation. ∎

Input: πexp(V)\pi\propto\exp\left(-V\right), η0(0,1]\eta_{0}\in(0,1], λ0=η0βd\lambda_{0}=\eta_{0}\beta d.
1 Let V0:=η0V+λ022V_{0}:=\eta_{0}V+\frac{\lambda_{0}}{2}\|\cdot\|^{2};
2 Use gradient descent to find an xx^{\prime} that is O(1η0β)O\left(\frac{1}{\sqrt{\eta_{0}\beta}}\right)-close to the global minimizer of V0V_{0};
3 Let π0:=𝒩(xV0(x)λ0η0β,1λ0η0βI)\pi_{0}^{\prime}:=\operatorname{{\cal N}}\left(x^{\prime}-\frac{\nabla V_{0}(x^{\prime})}{\lambda_{0}-\eta_{0}\beta},\frac{1}{\lambda_{0}-\eta_{0}\beta}I\right);
4 while True do
5       Sample Xπ0X\sim\pi_{0}^{\prime} and U[0,1]U\in[0,1] independently;
6       Accept XX as a sample of π0\pi_{0} when
Uexp(V0(X)+V0(x)+V0(x),Xx+λ0η0β2xx2).U\leq\exp\left(-V_{0}(X)+V_{0}(x^{\prime})+\left\langle\nabla V_{0}(x^{\prime}),X-x^{\prime}\right\rangle+\frac{\lambda_{0}-\eta_{0}\beta}{2}\|x-x^{\prime}\|^{2}\right).
7 end while
Output: Xπ0X\sim\pi_{0}.
Algorithm 2 Rejection Sampling for π0\pi_{0}.
Remark.

The parameter RR reflects prior knowledge about global minimizer(s) of the potential function VV. Unless it is exceptionally large, indicating scarce prior information about the global minimizer(s) of VV, this O~(1)\widetilde{O}(1) complexity is negligible compared to the overall complexity of sampling. In particular, when the exact location of a global minimizer of VV is known, we can adjust the potential VV so that 0 becomes a global minimizer, thereby reducing the complexity to O(1)O(1).

Appendix C Proofs for Annealed LMC

C.1 Proof of Eq. 7

Define Λ(t):=0tλ(τT)dτ\varLambda(t):=\int_{0}^{t}\lambda\left(\frac{\tau}{T}\right)\mathrm{d}\tau, whose derivative is Λ(t)=λ(tT)\varLambda^{\prime}(t)=\lambda\left(\frac{t}{T}\right). By Itô’s formula, we have

d(eΛ(t)Xt)=eΛ(t)(λ(tT)Xtdt+dXt)=eΛ(t)(η(tT)V(Xt)dt+2dBt).\mathrm{d}\left({\rm e}^{\varLambda(t)}X_{t}\right)={\rm e}^{\varLambda(t)}\left(\lambda\left(\frac{t}{T}\right)X_{t}\mathrm{d}t+\mathrm{d}X_{t}\right)={\rm e}^{\varLambda(t)}\left(-\eta\left(\frac{t}{T}\right)\nabla V(X_{t_{-}})\mathrm{d}t+\sqrt{2}\mathrm{d}B_{t}\right).

Integrating over t[T1,T)t\in[T_{\ell-1},T_{\ell}) (note that in this case t=T1t_{-}=T_{\ell-1}), we obtain

eΛ(T)XTeΛ(T1)XT1\displaystyle{\rm e}^{\varLambda(T_{\ell})}X_{T_{\ell}}-{\rm e}^{\varLambda(T_{\ell-1})}X_{T_{\ell-1}} =(T1Tη(tT)eΛ(t)dt)V(XT1)+2T1TeΛ(t)dBt,\displaystyle=-\left(\int_{T_{\ell-1}}^{T_{\ell}}\eta\left(\frac{t}{T}\right){\rm e}^{\varLambda(t)}\mathrm{d}t\right)\nabla V(X_{T_{\ell-1}})+\sqrt{2}\int_{T_{\ell-1}}^{T_{\ell}}{\rm e}^{\varLambda(t)}\mathrm{d}B_{t},

i.e.,

XT\displaystyle X_{T_{\ell}} =e(Λ(T)Λ(T1))XT1(T1Tη(tT)e(Λ(T)Λ(t))dt)V(XT1)\displaystyle={\rm e}^{-(\varLambda(T_{\ell})-\varLambda(T_{\ell-1}))}X_{T_{\ell-1}}-\left(\int_{T_{\ell-1}}^{T_{\ell}}\eta\left(\frac{t}{T}\right){\rm e}^{-(\varLambda(T_{\ell})-\varLambda(t))}\mathrm{d}t\right)\nabla V(X_{T_{\ell-1}})
+2T1Te(Λ(T)Λ(t))dBt.\displaystyle+\sqrt{2}\int_{T_{\ell-1}}^{T_{\ell}}{\rm e}^{-(\varLambda(T_{\ell})-\varLambda(t))}\mathrm{d}B_{t}.

Since

Λ(T)Λ(t)=tTλ(τT)dτ=Tt/TT/Tλ(u)du,\varLambda(T_{\ell})-\varLambda(t)=\int_{t}^{T_{\ell}}\lambda\left(\frac{\tau}{T}\right)\mathrm{d}\tau=T\int_{t/T}^{T_{\ell}/T}\lambda(u)\mathrm{d}u,

by defining Λ0(θ,θ)=exp(Tθθλ(u)du)\Lambda_{0}(\theta^{\prime},\theta)=\exp\left(-T\int_{\theta}^{\theta^{\prime}}\lambda(u)\mathrm{d}u\right), we have e(Λ(T)Λ(T1))=Λ0(θ,θ1){\rm e}^{-(\varLambda(T_{\ell})-\varLambda(T_{\ell-1}))}=\Lambda_{0}(\theta_{\ell},\theta_{\ell-1}). Similarly, we can show that

T1Tη(tT)e(Λ(T)Λ(t))dt=T1Tη(tT)Λ0(TT,tT)dt=Tθ1θη(u)Λ0(θ,u)du,\int_{T_{\ell-1}}^{T_{\ell}}\eta\left(\frac{t}{T}\right){\rm e}^{-(\varLambda(T_{\ell})-\varLambda(t))}\mathrm{d}t=\int_{T_{\ell-1}}^{T_{\ell}}\eta\left(\frac{t}{T}\right)\Lambda_{0}\left(\frac{T_{\ell}}{T},\frac{t}{T}\right)\mathrm{d}t=T\int_{\theta_{\ell-1}}^{\theta_{\ell}}\eta(u)\Lambda_{0}(\theta_{\ell},u)\mathrm{d}u,

and 2T1Te(Λ(T)Λ(t))dBt\sqrt{2}\int_{T_{\ell-1}}^{T_{\ell}}{\rm e}^{-(\varLambda(T_{\ell})-\varLambda(t))}\mathrm{d}B_{t} is a zero-mean Gaussian random vector with covariance

2T1Te2(Λ(T)Λ(t))dtI=2T1TΛ02(TT,tT)dtI=2Tθ1θΛ02(θ,u)duI.2\int_{T_{\ell-1}}^{T_{\ell}}{\rm e}^{-2(\varLambda(T_{\ell})-\varLambda(t))}\mathrm{d}t\cdot I=2\int_{T_{\ell-1}}^{T_{\ell}}\Lambda_{0}^{2}\left(\frac{T_{\ell}}{T},\frac{t}{T}\right)\mathrm{d}t\cdot I=2T\int_{\theta_{\ell-1}}^{\theta_{\ell}}\Lambda_{0}^{2}(\theta_{\ell},u)\mathrm{d}u\cdot I.

\square

C.2 Proof of Thm. 2

We denote the path measure of ALMC (Eq. 6) by \mathbb{Q}. Then, T\mathbb{Q}_{T}, the marginal distribution of XTX_{T}, serves as the output distribution νALMC\nu^{\rm ALMC}. Similar to the methodology in the proof of Thm. 1, we use \mathbb{P} to denote the reference path measure of Eq. 4, in which the vector field (vt)t[0,T](v_{t})_{t\in[0,T]} generates the cure of probability distributions (π~t)t[0,T](\widetilde{\pi}_{t})_{t\in[0,T]}.

Using the data-processing inequality, it suffices to demonstrate that KL()ε2\operatorname{KL}(\mathbb{P}\|\mathbb{Q})\leq\varepsilon^{2}. By Girsanov theorem (Lem. 1) and triangle inequality, we have

KL()\displaystyle\operatorname{KL}(\mathbb{P}\|\mathbb{Q}) =140T𝔼η(tT)(V(Xt)V(Xt))vt(Xt)2dt\displaystyle=\frac{1}{4}\int_{0}^{T}\mathbb{E}_{\mathbb{P}}{\left\|\eta\left(\frac{t}{T}\right)\left(\nabla V(X_{t})-\nabla V(X_{t_{-}})\right)-v_{t}(X_{t})\right\|^{2}}\mathrm{d}t
0T𝔼[η(tT)2V(Xt)V(Xt)2+vt(Xt)2]dt\displaystyle\lesssim\int_{0}^{T}\mathbb{E}_{\mathbb{P}}{\left[\eta\left(\frac{t}{T}\right)^{2}\|\nabla V(X_{t})-\nabla V(X_{t_{-}})\|^{2}+\|v_{t}(X_{t})\|^{2}\right]}\mathrm{d}t
=1Mη(TT)2β2T1T𝔼XtXt2dt+0TvtL2(π~t)2dt.\displaystyle\leq\sum_{\ell=1}^{M}\eta\left(\frac{T_{\ell}}{T}\right)^{2}\beta^{2}\int_{T_{\ell-1}}^{T_{\ell}}\mathbb{E}_{\mathbb{P}}{\|X_{t}-X_{t_{-}}\|^{2}}\mathrm{d}t+\int_{0}^{T}\|v_{t}\|_{L^{2}(\widetilde{\pi}_{t})}^{2}\mathrm{d}t.

The last inequality arises from the smoothness of VV and the increasing property of η()\eta(\cdot). To bound 𝔼XtXt2\mathbb{E}_{\mathbb{P}}{\|X_{t}-X_{t_{-}}\|^{2}}, note that under \mathbb{P}, for t[T1,T)t\in[T_{\ell-1},T_{\ell}), we have

XtXt=T1t(logπ~τ+vτ)(Xτ)dτ+2(BtBT1).X_{t}-X_{t_{-}}=\int_{T_{\ell-1}}^{t}\left(\nabla\log\widetilde{\pi}_{\tau}+v_{\tau}\right)(X_{\tau})\mathrm{d}\tau+\sqrt{2}(B_{t}-B_{T_{\ell-1}}).

Thanks to the fact that Xtπ~tX_{t}\sim\widetilde{\pi}_{t} under \mathbb{P},

𝔼XtXt2\displaystyle\mathbb{E}_{\mathbb{P}}{\|X_{t}-X_{t_{-}}\|^{2}} 𝔼T1t(logπ~τ+vτ)(Xτ)dτ2+𝔼2(BtBT1)2\displaystyle\lesssim\mathbb{E}_{\mathbb{P}}{\left\|\int_{T_{\ell-1}}^{t}\left(\nabla\log\widetilde{\pi}_{\tau}+v_{\tau}\right)(X_{\tau})\mathrm{d}\tau\right\|^{2}}+\mathbb{E}{\|\sqrt{2}(B_{t}-B_{T_{\ell-1}})\|^{2}}
(tT1)T1t𝔼(logπ~τ+vτ)(Xτ)2dτ+d(tT1)\displaystyle\lesssim(t-{T_{\ell-1}})\int_{T_{\ell-1}}^{t}\mathbb{E}_{\mathbb{P}}{\|(\nabla\log\widetilde{\pi}_{\tau}+v_{\tau})(X_{\tau})\|^{2}}\mathrm{d}\tau+d(t-{T_{\ell-1}})
(tT1)T1t(logπ~τL2(π~τ)2+vτL2(π~τ)2)dτ+d(tT1)\displaystyle\lesssim(t-T_{\ell-1})\int_{T_{\ell-1}}^{t}\left(\|\nabla\log\widetilde{\pi}_{\tau}\|^{2}_{L^{2}(\widetilde{\pi}_{\tau})}+\|v_{\tau}\|_{L^{2}(\widetilde{\pi}_{\tau})}^{2}\right)\mathrm{d}\tau+d(t-T_{\ell-1})
hT1T(logπ~τL2(π~τ)2+vτL2(π~τ)2)dτ+dh.\displaystyle\lesssim h_{\ell}\int_{T_{\ell-1}}^{T_{\ell}}\left(\|\nabla\log\widetilde{\pi}_{\tau}\|^{2}_{L^{2}(\widetilde{\pi}_{\tau})}+\|v_{\tau}\|_{L^{2}(\widetilde{\pi}_{\tau})}^{2}\right)\mathrm{d}\tau+dh_{\ell}.

The second inequality arises from the application of the Cauchy-Schwarz inequality, and the last inequality is due to the definition h=TT1h_{\ell}=T_{\ell}-T_{\ell-1}. Taking integral over t[T1,T]t\in[T_{\ell-1},T_{\ell}],

T1T𝔼XtXt2dth2T1T(logπ~tL2(π~t)2+vtL2(π~t)2)dt+dh2.\int_{T_{\ell-1}}^{T_{\ell}}\mathbb{E}{\|X_{t}-X_{t_{-}}\|^{2}}\mathrm{d}t\lesssim h_{\ell}^{2}\int_{T_{\ell-1}}^{T_{\ell}}\left(\|\nabla\log\widetilde{\pi}_{t}\|^{2}_{L^{2}(\widetilde{\pi}_{t})}+\|v_{t}\|_{L^{2}(\widetilde{\pi}_{t})}^{2}\right)\mathrm{d}t+dh_{\ell}^{2}.

Recall that the potential of π~t\widetilde{\pi}_{t} is (η(tT)β+λ(tT))\left(\eta\left(\frac{t}{T}\right)\beta+\lambda\left(\frac{t}{T}\right)\right)-smooth. By Lem. 6 and the monotonicity of η()\eta(\cdot) and λ()\lambda(\cdot), we have

T1Tlogπ~tL2(π~t)2dt\displaystyle\int_{T_{\ell-1}}^{T_{\ell}}\|\nabla\log\widetilde{\pi}_{t}\|^{2}_{L^{2}(\widetilde{\pi}_{t})}\mathrm{d}t T1Td(η(tT)β+λ(tT))dt\displaystyle\leq\int_{T_{\ell-1}}^{T_{\ell}}d\left(\eta\left(\frac{t}{T}\right)\beta+\lambda\left(\frac{t}{T}\right)\right)\mathrm{d}t
dh(βη(TT)+λ(T1T))\displaystyle\leq dh_{\ell}\left(\beta\eta\left(\frac{T_{\ell}}{T}\right)+\lambda\left(\frac{T_{\ell-1}}{T}\right)\right)
=dh(βη(θ)+λ(θ1)).\displaystyle=dh_{\ell}\left(\beta\eta\left(\theta_{\ell}\right)+\lambda\left(\theta_{\ell-1}\right)\right).

Therefore, KL()\operatorname{KL}(\mathbb{P}\|\mathbb{Q}) is upper bounded by

=1Mη(θ)2β2T1T𝔼XtXt2dt+0TvtL2(π~t)2dt\displaystyle\lesssim\sum_{\ell=1}^{M}\eta\left(\theta_{\ell}\right)^{2}\beta^{2}\int_{T_{\ell-1}}^{T_{\ell}}\mathbb{E}_{\mathbb{P}}{\|X_{t}-X_{t_{-}}\|^{2}}\mathrm{d}t+\int_{0}^{T}\|v_{t}\|_{L^{2}(\widetilde{\pi}_{t})}^{2}\mathrm{d}t
=1Mη(θ)2β2(dh3(βη(θ)+λ(θ1))+h2T1TvtL2(π~t)2dt+dh2)+0TvtL2(π~t)2dt\displaystyle\lesssim\sum_{\ell=1}^{M}\eta\left(\theta_{\ell}\right)^{2}\beta^{2}\left(dh_{\ell}^{3}\left(\beta\eta\left(\theta_{\ell}\right)+\lambda\left(\theta_{\ell-1}\right)\right)+h_{\ell}^{2}\int_{T_{\ell-1}}^{T_{\ell}}\|v_{t}\|_{L^{2}(\widetilde{\pi}_{t})}^{2}\mathrm{d}t+dh_{\ell}^{2}\right)+\int_{0}^{T}\|v_{t}\|_{L^{2}(\widetilde{\pi}_{t})}^{2}\mathrm{d}t
==1M((1+η(θ)2β2h2)T1TvtL2(π~t)2dt+η(θ)2β2dh2(1+h(βη(θ)+λ(θ1)))).\displaystyle=\sum_{\ell=1}^{M}\left(\left(1+\eta(\theta_{\ell})^{2}\beta^{2}h_{\ell}^{2}\right)\int_{T_{\ell-1}}^{T_{\ell}}\|v_{t}\|_{L^{2}(\widetilde{\pi}_{t})}^{2}\mathrm{d}t+\eta(\theta_{\ell})^{2}\beta^{2}dh_{\ell}^{2}\left(1+h_{\ell}\left(\beta\eta(\theta_{\ell})+\lambda(\theta_{\ell-1})\right)\right)\right).

For the remaining integral, given that (vt)t[0,T](v_{t})_{t\in[0,T]} generates (π~t)t[0,T](\widetilde{\pi}_{t})_{t\in[0,T]}, according to Lem. 2, we may choose vtv_{t} such that vtL2(π~t)=|π~˙|t\|v_{t}\|_{L^{2}(\widetilde{\pi}_{t})}=|\dot{\widetilde{\pi}}|_{t}. Thus,

T1TvtL2(π~t)2dt\displaystyle\int_{T_{\ell-1}}^{T_{\ell}}\|v_{t}\|_{L^{2}(\widetilde{\pi}_{t})}^{2}\mathrm{d}t =T1T|π~˙|t2dt=T1T1T2|π˙|t/T2dt=1Tθ1θ|π˙|θ2dθ,\displaystyle=\int_{T_{\ell-1}}^{T_{\ell}}|\dot{\widetilde{\pi}}|_{t}^{2}\mathrm{d}t=\int_{T_{\ell-1}}^{T_{\ell}}\frac{1}{T^{2}}|\dot{\pi}|^{2}_{t/T}\mathrm{d}t=\frac{1}{T}\int_{\theta_{\ell-1}}^{\theta_{\ell}}|\dot{\pi}|^{2}_{\theta}\mathrm{d}\theta,

through a change-of-variable analogous to that used in the proof of Thm. 1. Therefore,

KL()\displaystyle\operatorname{KL}(\mathbb{P}\|\mathbb{Q}) =1M(1+η(θ)2β2h2Tθ1θ|π˙|θ2dθ+η(θ)2β2dh2(1+h(βη(θ)+λ(θ1)))).\displaystyle\lesssim\sum_{\ell=1}^{M}\left(\frac{1+\eta(\theta_{\ell})^{2}\beta^{2}h_{\ell}^{2}}{T}\int_{\theta_{\ell-1}}^{\theta_{\ell}}|\dot{\pi}|^{2}_{\theta}\mathrm{d}\theta+\eta(\theta_{\ell})^{2}\beta^{2}dh_{\ell}^{2}\left(1+h_{\ell}\left(\beta\eta(\theta_{\ell})+\lambda(\theta_{\ell-1})\right)\right)\right).

Assume h1βdh_{\ell}\lesssim\frac{1}{\beta d} (which will be verified later), so we can further simplify the above expression to

KL()\displaystyle\operatorname{KL}(\mathbb{P}\|\mathbb{Q}) =1M(1Tθ1θ|π˙|θ2dθ+η(θ)2β2dh2)\displaystyle\lesssim\sum_{\ell=1}^{M}\left(\frac{1}{T}\int_{\theta_{\ell-1}}^{\theta_{\ell}}|\dot{\pi}|^{2}_{\theta}\mathrm{d}\theta+\eta(\theta_{\ell})^{2}\beta^{2}dh_{\ell}^{2}\right)
=𝒜T+β2d=1Mη(θ)2h2\displaystyle=\frac{{\cal A}}{T}+\beta^{2}d\sum_{\ell=1}^{M}\eta(\theta_{\ell})^{2}h_{\ell}^{2}
=𝒜T+β2d=1Mη(θ)2T2(θθ1)2.\displaystyle=\frac{{\cal A}}{T}+\beta^{2}d\sum_{\ell=1}^{M}\eta(\theta_{\ell})^{2}T^{2}(\theta_{\ell}-\theta_{\ell-1})^{2}.

To bound the above expression by ε2\varepsilon^{2}, we first select T𝒜ε2T\asymp\frac{{\cal A}}{\varepsilon^{2}}, mirroring the total time TT required for running annealed LD as specified in Thm. 1. This guarantees that the continuous dynamics closely approximate the reference path measure. Given that η()1\eta(\cdot)\leq 1, it remains only to ensure

β2d=1Mη(θ)2T2(θθ1)2β2d𝒜2ε4=1M(θθ1)2ε2,\beta^{2}d\sum_{\ell=1}^{M}\eta(\theta_{\ell})^{2}T^{2}(\theta_{\ell}-\theta_{\ell-1})^{2}\leq\beta^{2}d\frac{{\cal A}^{2}}{\varepsilon^{4}}\sum_{\ell=1}^{M}(\theta_{\ell}-\theta_{\ell-1})^{2}\lesssim\varepsilon^{2},

which is equivalent to

=1M(θθ1)2ε6dβ2𝒜2.\sum_{\ell=1}^{M}(\theta_{\ell}-\theta_{\ell-1})^{2}\lesssim\frac{\varepsilon^{6}}{d\beta^{2}{\cal A}^{2}}. (8)

To minimize MM, we apply Cauchy-Schwarz inequality:

(=1M1)(=1M(θθ1)2)(=1M(θθ1))2=1,\left(\sum_{\ell=1}^{M}1\right)\left(\sum_{\ell=1}^{M}(\theta_{\ell}-\theta_{\ell-1})^{2}\right)\geq\left(\sum_{\ell=1}^{M}(\theta_{\ell}-\theta_{\ell-1})\right)^{2}=1,

which establishes a lower bound for MM. The equality is achieved when θθ1=1M\theta_{\ell}-\theta_{\ell-1}=\frac{1}{M} for all [[1,M]]\ell\in\left[\!\left[1,M\right]\!\right]. Thus, selecting

Mdβ2𝒜2ε6M\asymp\frac{d\beta^{2}{\cal A}^{2}}{\varepsilon^{6}}

satisfies the constraint given in Eq. 8. In this case, the step size h=TMε4dβ21βdh_{\ell}=\frac{T}{M}\asymp\frac{\varepsilon^{4}}{d\beta^{2}}\lesssim\frac{1}{\beta d}. Combining this with the O~(1)\widetilde{O}(1) complexity for sampling from π0\pi_{0}, we have completed the proof. \square

Appendix D Proof of Ex. 2

The smoothness of VV comes from Lem. 7.

Note that π(x)i=1Npiexp(β2xyi2)\pi(x)\propto\sum_{i=1}^{N}p_{i}\exp\left(-\frac{\beta}{2}\|x-y_{i}\|^{2}\right), and define

π^λ(x)\displaystyle\widehat{\pi}_{\lambda}(x) π(x)exp(λ2x2)\displaystyle\propto\pi(x)\exp\left(-\frac{\lambda}{2}\|x\|^{2}\right)
=i=1Npiexp(λβ2(λ+β)yi2λ+β2xβλ+βyi2)\displaystyle=\sum_{i=1}^{N}p_{i}\exp\left(-\frac{\lambda\beta}{2(\lambda+\beta)}\|y_{i}\|^{2}-\frac{\lambda+\beta}{2}\left\|x-\frac{\beta}{\lambda+\beta}y_{i}\right\|^{2}\right)
i=1Npiexp(λ+β2xβλ+βyi2)\displaystyle\propto\sum_{i=1}^{N}p_{i}\exp\left(-\frac{\lambda+\beta}{2}\left\|x-\frac{\beta}{\lambda+\beta}y_{i}\right\|^{2}\right)
=i=1Npi𝒩(βλ+βyi,1λ+βI).\displaystyle=\sum_{i=1}^{N}p_{i}\operatorname{{\cal N}}\left(\frac{\beta}{\lambda+\beta}y_{i},\frac{1}{\lambda+\beta}I\right).

We use coupling method to upper bound W22(π^λ,π^λ+δ)W_{2}^{2}(\widehat{\pi}_{\lambda},\widehat{\pi}_{\lambda+\delta}). We first sample II following distribution (I=i)=pi\mathbb{P}(I=i)=p_{i}, i[[1,N]]i\in\left[\!\left[1,N\right]\!\right], and then independently sample η𝒩(0,I)\eta\sim\operatorname{{\cal N}}\left(0,I\right). We have

X\displaystyle X :=βλ+βyI+1λ+βηπ^λ,\displaystyle:=\frac{\beta}{\lambda+\beta}y_{I}+\frac{1}{\sqrt{\lambda+\beta}}\eta\sim\widehat{\pi}_{\lambda},
Y\displaystyle Y :=βλ+δ+βyI+1λ+δ+βηπ^λ+δ.\displaystyle:=\frac{\beta}{\lambda+\delta+\beta}y_{I}+\frac{1}{\sqrt{\lambda+\delta+\beta}}\eta\sim\widehat{\pi}_{\lambda+\delta}.

By the definition of the W2\text{W}_{\text{2}} distance, we have

W22(π^λ,π^λ+δ)\displaystyle W_{2}^{2}(\widehat{\pi}_{\lambda},\widehat{\pi}_{\lambda+\delta}) 𝔼XY2\displaystyle\leq\mathbb{E}\|X-Y\|^{2}
=𝔼I𝔼η(βλ+ββλ+δ+β)yI+(1λ+β1λ+δ+β)η2\displaystyle=\mathbb{E}_{I}\mathbb{E}_{\eta}\left\|\left(\frac{\beta}{\lambda+\beta}-\frac{\beta}{\lambda+\delta+\beta}\right)y_{I}+\left(\frac{1}{\sqrt{\lambda+\beta}}-\frac{1}{\sqrt{\lambda+\delta+\beta}}\right)\eta\right\|^{2}
=𝔼I(βλ+ββλ+δ+β)2yI2+(1λ+β1λ+δ+β)2d\displaystyle=\mathbb{E}_{I}\left(\frac{\beta}{\lambda+\beta}-\frac{\beta}{\lambda+\delta+\beta}\right)^{2}\|y_{I}\|^{2}+\left(\frac{1}{\sqrt{\lambda+\beta}}-\frac{1}{\sqrt{\lambda+\delta+\beta}}\right)^{2}d
=(βλ+ββλ+δ+β)2r2+(1λ+β1λ+δ+β)2d.\displaystyle=\left(\frac{\beta}{\lambda+\beta}-\frac{\beta}{\lambda+\delta+\beta}\right)^{2}r^{2}+\left(\frac{1}{\sqrt{\lambda+\beta}}-\frac{1}{\sqrt{\lambda+\delta+\beta}}\right)^{2}d.

This implies

|π^˙|λ2=limδ0W22(π^λ,π^λ+δ)δ2β2r2(λ+β)4+d4(λ+β)3.|\dot{\widehat{\pi}}|^{2}_{\lambda}=\lim_{\delta\to 0}\frac{W_{2}^{2}(\widehat{\pi}_{\lambda},\widehat{\pi}_{\lambda+\delta})}{\delta^{2}}\leq\frac{\beta^{2}r^{2}}{(\lambda+\beta)^{4}}+\frac{d}{4(\lambda+\beta)^{3}}.

By time reparameterization πθ=π^λ(θ)\pi_{\theta}=\widehat{\pi}_{\lambda(\theta)}, |π˙|θ=|π^˙|λ(θ)|λ˙(θ)||\dot{\pi}|_{\theta}=|\dot{\widehat{\pi}}|_{\lambda(\theta)}|\dot{\lambda}(\theta)|. With λ(θ)=λ0(1θ)γ\lambda(\theta)=\lambda_{0}(1-\theta)^{\gamma}, |λ˙(θ)|=γλ0(1θ)γ1γλ0λ0|\dot{\lambda}(\theta)|=\gamma\lambda_{0}(1-\theta)^{\gamma-1}\leq\gamma\lambda_{0}\lesssim\lambda_{0}. Therefore,

𝒜\displaystyle{\cal A} =01|π˙|θ2dθ=01|π^˙|λ(θ)2|λ˙(θ)|2dθ\displaystyle=\int_{0}^{1}|\dot{\pi}|_{\theta}^{2}\mathrm{d}\theta=\int_{0}^{1}|\dot{\widehat{\pi}}|_{\lambda(\theta)}^{2}|\dot{\lambda}(\theta)|^{2}\mathrm{d}\theta
λ001|π^˙|λ(θ)2|λ˙(θ)|dθ=λ00λ0|π^˙|λ2dλ\displaystyle\lesssim\lambda_{0}\int_{0}^{1}|\dot{\widehat{\pi}}|_{\lambda(\theta)}^{2}|\dot{\lambda}(\theta)|\mathrm{d}\theta=\lambda_{0}\int_{0}^{\lambda_{0}}|\dot{\widehat{\pi}}|_{\lambda}^{2}\mathrm{d}\lambda
=λ00λ0(β2r2(λ+β)4+d4(λ+β)3)dλ\displaystyle=\lambda_{0}\int_{0}^{\lambda_{0}}\left(\frac{\beta^{2}r^{2}}{(\lambda+\beta)^{4}}+\frac{d}{4(\lambda+\beta)^{3}}\right)\mathrm{d}\lambda
λ0(β2r21β3+d1β2)\displaystyle\lesssim\lambda_{0}\left(\beta^{2}r^{2}\frac{1}{\beta^{3}}+d\frac{1}{\beta^{2}}\right)
dβ(r2β+1)(r2β+dβ2)\displaystyle\lesssim d\beta(r^{2}\beta+1)\left(\frac{r^{2}}{\beta}+\frac{d}{\beta^{2}}\right)
=d(r2β+1)(r2+dβ).\displaystyle=d(r^{2}\beta+1)\left(r^{2}+\frac{d}{\beta}\right).

It follows from the proof that as long as maxθ[0,1]|λ˙(θ)|\max_{\theta\in[0,1]}|\dot{\lambda}(\theta)| is bounded by a polynomial function of β\beta, dd and rr, so is the action 𝒜{\cal A}. \square

Appendix E Supplementary Lemmas

Lemma 6 (Chewi (2024, Lemma 4.E.1)).

Consider a probability measure μeU\mu\propto{\rm e}^{-U} on d\mathbb{R}^{d}. If 2UβI\nabla^{2}U\preceq\beta I for some β>0\beta>0, then 𝔼μU2βd\mathbb{E}_{\mu}{\|\nabla U\|^{2}}\leq\beta d.

Lemma 7 (Cheng et al. (2023, Lemma 4)).

For a Gaussian mixture distribution π=iwi𝒩(μi,σ2I)\pi=\sum_{i}w_{i}\operatorname{{\cal N}}\left(\mu_{i},\sigma^{2}I\right), we have

(1σ2maxi,jμiμj22σ4)I2logπ1σ2I.\left(\frac{1}{\sigma^{2}}-\frac{\max_{i,j}\|\mu_{i}-\mu_{j}\|^{2}}{2\sigma^{4}}\right)I\preceq-\nabla^{2}\log\pi\preceq\frac{1}{\sigma^{2}}I.
Remark.

This is a slightly improved version, as in the original lemma the authors omitted the factor 22 in the denominator on the l.h.s.

Lemma 8.

Under Assump. 2, πθ\pi_{\theta} defined in Eq. 5 has finite second-order moment when η(θ)[0,1]\eta(\theta)\in[0,1].

Proof.

When η(θ)=1\eta(\theta)=1, πθexp(Vλ(θ)22)exp(V)\pi_{\theta}\propto\exp\left(-V-\frac{\lambda(\theta)}{2}\|\cdot\|^{2}\right)\leq\exp\left(-V\right), and the claim is straightforward. Otherwise, by convexity of ueuu\mapsto{\rm e}^{-u}, we have

exp(η(θ)Vλ(θ)22)\displaystyle\exp\left(-\eta(\theta)V-\frac{\lambda(\theta)}{2}\|\cdot\|^{2}\right) =exp(η(θ)V(1η(θ))λ(θ)2(1η(θ))2)\displaystyle=\exp\left(-\eta(\theta)V-(1-\eta(\theta))\frac{\lambda(\theta)}{2(1-\eta(\theta))}\|\cdot\|^{2}\right)
η(θ)exp(V)+(1η(θ))exp(λ(θ)2(1η(θ))2).\displaystyle\leq\eta(\theta)\exp\left(-V\right)+(1-\eta(\theta))\exp\left(-\frac{\lambda(\theta)}{2(1-\eta(\theta))}\|\cdot\|^{2}\right).

Multiplying both sides by 2\|\cdot\|^{2} and taking integral over d\mathbb{R}^{d}, we see that 𝔼πθ2<+\mathbb{E}_{\pi_{\theta}}\|\cdot\|^{2}<{+\infty}. ∎

Appendix F Further Details of Experiments in Sec. 6

Hyperparameter settings.

For all the experiments, we use the annealing schedule λ(θ)=5(1θ)10\lambda(\theta)=5(1-\theta)^{10} and η(θ)1\eta(\theta)\equiv 1. The step size for ALMC is designed to follow a quadratic schedule: the step size at the \ell-th iteration (out of MM total iterations, [[1,M]]\ell\in\left[\!\left[1,M\right]\!\right]) is given by smaxsminM2/4(M2)2+smax-\frac{s_{\max}-s_{\min}}{M^{2}/4}\left(\ell-\frac{M}{2}\right)^{2}+s_{\max}, which increases on [0,M2]\left[0,\frac{M}{2}\right] and decreases on [M2,M]\left[\frac{M}{2},M\right], with a maximum of smaxs_{\max} and a minimum of smins_{\min}. For r=2,5,10,15,20,25,30r=2,5,10,15,20,25,30, the MM we choose are 200,500,2500,10000,20000,40000,60000200,500,2500,10000,20000,40000,60000, respectively, and we set smax=0.05s_{\max}=0.05 and smin=0.01s_{\min}=0.01, which achieve the best performance across all settings after tuning.

KL divergence estimation.

We use the Information Theoretical Estimators (ITE) toolbox (Szabó, 2014) to empirically estimate the KL divergence. In all cases, we generate 10001000 samples using ALMC, and an additional 10001000 samples from the target distribution. The KL divergence is estimated using the ite.cost.BDKL_KnnK() function, which leverages kk-nearest-neighbor techniques to compute the divergence from the target distribution to the sampled distribution.

Regression coefficients.

We also compute the linear regression approximation of the curves in Fig. 2 via sklearn.linear_model.LinearRegression. The blue curve has a slope of 2.8412.841 and an intercept of 1.2571.257, while the orange one has a slope of 2.8902.890 and an intercept of 0.9040.904. The R2R^{2} scores, calculated using sklearn.metrics.r2_score, are 0.9950.995 and 0.9970.997, respectively.