This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Reverse Diffusion Monte Carlo

Xunpeng Huang\ \ {}^{\dagger}  Hanze Dongfootnotemark: §\ \ {}^{{\dagger}\S}  Yifan Hao  Yi-An Ma  Tong Zhang

The Hong Kong University of Science and Technology
§ Salesforce AI Research
University of California, San Diego
University of Illinois Urbana-Champaign
Equal contribution. Random order.
Abstract

We propose a Monte Carlo sampler from the reverse diffusion process. Unlike the practice of diffusion models, where the intermediary updates—the score functions—are learned with a neural network, we transform the score matching problem into a mean estimation one. By estimating the means of the regularized posterior distributions, we derive a novel Monte Carlo sampling algorithm called reverse diffusion Monte Carlo (rdMC), which is distinct from the Markov chain Monte Carlo (MCMC) methods. We determine the sample size from the error tolerance and the properties of the posterior distribution to yield an algorithm that can approximately sample the target distribution with any desired accuracy. Additionally, we demonstrate and prove under suitable conditions that sampling with rdMC can be significantly faster than that with MCMC. For multi-modal target distributions such as those in Gaussian mixture models, rdMC greatly improves over the Langevin-style MCMC sampling methods both theoretically and in practice. The proposed rdMC method offers a new perspective and solution beyond classical MCMC algorithms for the challenging complex distributions.

1 Introduction

Recent success of diffusion models has shown great promise for the the reverse diffusion processes in generating samples from a complex distribution (Song et al., 2020; Rombach et al., 2022). In the existing line of works, one is given samples from the target distribution and aims to generate more samples from the same target. One would diffuse the target distribution into a standard normal one, and use score matching to learn the transitions between the consecutive intermediary distributions (Ho et al., 2020; Song et al., 2020). Reversing the learned diffusion process leads us back to the target distribution. The benefit of the reverse diffusion process lies in the efficiency of convergence from any complex distribution to a normal one (Chen et al., 2022a; Lee et al., 2023). For example, diffusing a target multi-modal distribution into a normal one is not harder than diffusing a single mode. Backtracking the process from the normal distribution directly yields the desired multi-modal target. If one instead adopts the forward diffusion process from a normal distribution to the multi-modal one, there is the classical challenge of mixing among the modes, as illustrated in Figure  1. This observation motivates us to ask:

Can we create an efficient, general purpose Monte Carlo sampler from reverse diffusion processes?

For Monte Carlo sampling, while we have access to an unnormalized density function p(𝒙)exp(f(𝒙))p_{*}({\bm{x}})\propto\exp(-f_{*}({\bm{x}})), samples from the target distribution are unavailable (Neal, 1993; Jerrum & Sinclair, 1996; Robert et al., 1999). As a result, we need a different and yet efficient method of score estimation to perform the reverse SDE. This leads to the first contribution of this paper. We leverage the fact that the diffusion process from our target distribution pp_{*} towards a standard normal one pp_{\infty} is an Ornstein-Uhlenbeck (OU) process, which admits explicit solutions. We thereby transform the score matching problem into a non-parametric mean estimation one, without training a parameterized diffusion model. We name this new algorithm as reverse diffusion Monte Carlo (rdMC).

To implement rdMC and solve the aforementioned mean estimation problem, we propose two approaches. One is to sample from a normal distribution ρt\rho_{t}—determined at each time tt by the transition kernel of the OU process—then compute the mean estimates weighted by the target pp_{*}. This approach translates all the computational challenge to the sample complexity in the importance sampling estimator. The iteration complexity required to achieve an overall ϵ\epsilon TV accuracy is O(ϵ2)O(\epsilon^{-2}), under minimal assumptions. Another approach is to use Unadjusted Langevin Algorithm (ULA) to generate samples from the product distribution of the target density pp_{*} and the normal one ρt\rho_{t}. This approach greatly reduces sample complexity in high dimensions, and yet is a better conditioned algorithm than ULA over pp_{*} due to the multiplication of the normal distribution. In our experiments, we find that a combination of the two approaches excels at distributions with multiple modes.

We then analyze the efficacy and efficiency of our rdMC method. We study the benefits of the reverse diffusion approach for both multi-modal distributions and high dimensional heavy-tail distributions that breaks the isoperimetric properties (Gross, 1975; Poincaré, 1890; Vempala & Wibisono, 2019). In multi-modal distributions, the Langevin algorithm based MCMC approaches suffer from an exponential cost for the need of barrier crossing, which makes mixing time extremely long. The rdMC approach can circumvent the hurdle and solve the problem. For high-dimensional heavy-tail distributions, our rdMC method circumvents the often-required isoperimetric properties of the target distributions, thereby avoiding the curse of dimensionality.

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 1: Langevin dynamics (first row) versus reverse SDE (second row). The first and second rows depict the intermediate states of the Langevin algorithm and the reverse SDE, respectively, illustrating the transition from a standard normal p0p_{0} to a Gaussian mixture pp_{*}. It can be observed that due to the local nature of the information contained in lnp\nabla\ln p_{*}, the Langevin algorithm tends to get stuck in modes close to the initializations. In contrast, the reverse SDE excels at transporting particles to different modes proportional to the target densities.

Contributions. We propose a non-parametric sampling algorithm that leverages the reverse SDE of the OU process. Our proposed approach involves estimating the score function by a mean estimation sub-problem for posteriors, enabling the efficient generation of samples through the reverse SDE. We focus on the complexity of the sub-problems and establish the convergence of our algorithm. We found that our approach effectively tackles sampling tasks with ill-conditioned log-Sobolev constants. For example, it excels in challenging scenarios characterized by multi-modal and long-tailed distributions. Our analysis sheds light on the varying complexity of the sub-problems at different time points, providing a fresh perspective for revisiting score estimation in diffusion-based models.

2 Preliminaries

In this section, we begin by introducing related work from the perspectives of Markov Chain Monte Carlo (MCMC) and diffusion models, and we discuss the connection between these works and the present paper. Next, we provide a notation for the reverse process of diffusion models, which specifies the stochastic differential equation (SDE) that particles follow in rdMC.

We first introduce the related works as below.

Langevin-based MCMC. The mainstream gradient-based sampling algorithms are mainly based on the continuous Langevin dynamics (LD) for sampling from a target distribution pexp(f)p_{*}\propto\exp(-f_{*}). The Unadjusted Langevin Algorithm (ULA) discretizes LD using the Euler-Maruyama scheme and obtains a biased stationary distribution. Due to its simplicity, ULA is widely used in machine learning. Its convergence has been investigated for different criteria, including Total Variation (TV) distance, Wasserstein distances (Durmus & Moulines, 2019), and Kullback-Leibler (KL) divergence (Dalalyan, 2017; Cheng & Bartlett, 2018; Ma et al., 2019; Freund et al., 2022), in different settings, such as strongly log-concave, log-Sobolev inequality (LSI) (Vempala & Wibisono, 2019), and Poincaré inequality (PI) (Chewi et al., 2021). Several works have achieved acceleration convergence for ULA by decreasing the discretization error with higher-order SDE, e.g., underdamped Langevin dynamics (Cheng et al., 2018; Ma et al., 2021; Mou et al., 2021), and aligning the discretized stationary distribution to the target pp_{*}, e.g., Metropolis-adjusted Langevin algorithm (Dwivedi et al., 2018) and proximal samplers (Lee et al., 2021b; Chen et al., 2022b; b; Altschuler & Chewi, 2023). Liu & Wang (2016); Dong et al. (2023) also attempt to perform sampling tasks with deterministic algorithms whose limiting ODE is derived from the Langevin dynamics.

Regarding the convergence guarantees, most of these works have landscape assumptions for the target distribution pp_{*}, e.g., strong log-concavity, LSI, or PI. For more general distributions, Erdogdu & Hosseinzadeh (2021) and Mousavi-Hosseini et al. (2023) consider KL convergence in modified LSI and weak PI. These extensions allow for slower tail-growth of negative log-density ff_{*}, compared to the quadratic or even linear case. Although these works extend ULA to more general distributions, the computational burden of ill-conditioned isoperimetry still exists, e.g., exponentially dependent on the dimension (Raginsky et al., 2017). In this paper, we introduce another SDE to guide the sampling algorithm, which is non-Markovian and time-inhomogeneous. Our algorithm discretizes such an SDE and can reduce the isoperimetric and dimension dependence wrt TV convergence.

Diffusion Models and Stochastic Localization. In recent years, diffusion models have gained significant attention due to their ability to generate high-quality samples (Ho et al., 2020; Rombach et al., 2022). The core idea of diffusion models is to parameterize the score, i.e., the gradient of the log-density, during the entire forward OU process from the target distribution pp_{*}. In this condition, the reverse SDE is associated with the inference process in diffusion models to perform unnormalized sampling for intricate target distributions. Apart from conventional MCMC trajectories, the most desirable property of this process is that, if the score function can be well approximated, it can sample from a general distribution Chen et al. (2022a); Lee et al. (2022; 2023). This implies the reverse SDE has a lower dependency on the isoperimetric requirement for the target distribution, which inspires the designs of new sampling algorithms. On the other hand, stochastic localization framework (Eldan, 2013; 2020; 2022; El Alaoui et al., 2022; Montanari, 2023) formalizes the general reverse diffusion and discusses the relationship with current diffusion models. Along this line, Montanari & Wu (2023) consider the unnormalized sampling for the symmetric spiked model. However, for the properties of the general unnormalized sampling, the investigation is limited.

This work employs the backward path of diffusion models to design a sampling strategy which we can prove to be more effective than the forward path of Langevin algorithm under suitable conditions. Several other recent studies (Vargas et al., 2023a; Berner et al., 2022; Zhang et al., 2023; Vargas et al., 2023c; b) have also utilized the diffusion path in their posterior samplers. Instead of the closed form MC sampling approach, the above works learns an approximate score function via a parametrized model, following the standard practice of the diffusion generative models. The resulting approximate distribution following the backward diffusion path is akin to the ones obtained from variational inference, and contains errors that depend on the expressiveness of the parametric models for the score functions. In addition, the convergence of the sampling process using such an approach depends on the generalization error of learned score functions (Tzen & Raginsky, 2019) (See Appendix for more details). It remains open how to bound sample complexity of such approaches, due to the need for bounding the error of the learned score function along the entire trajectory.

In contrast, rdMC proposed in this paper can be directly analyzed. Specifically, our algorithm draws samples from the unnormalized distribution without training data and the algorithm is designed to avoid learning based score function estimation. We are interested in the comparisons between rdMC and conventional MCMC algorithms. Thus, we analyze the complexity to estimate the score by drawing samples from the posterior distribution and found that our proposed algorithm is much more efficient in certain cases when the log-Sobolev constant of the target distribution is small.

Reverse SDE in Diffusion Models. In this part, we introduce the notations and formulation of the reverse SDE associated with the inference process in diffusion models, which will also be commonly used in the following sections. We begin with an OU process starting from pp_{*} formulated as

d𝒙t=𝒙tdt+2dBt,𝒙0p0ef,\mathrm{d}{\bm{x}}_{t}=-{\bm{x}}_{t}\mathrm{d}t+\sqrt{2}\mathrm{d}B_{t},\quad{\bm{x}}_{0}\sim p_{0}\propto e^{-f_{*}}, (1)

where (Bt)t0(B_{t})_{t\geq 0} is Brownian motion and p0p_{0} is assigned as pp_{*}. This is similar to the forward process of diffusion models (Song et al., 2020) whose stationary distribution is 𝒩(𝟎,𝑰){\mathcal{N}}\left({\bm{0}},{\bm{I}}\right) and we can track the random variable 𝒙{\bm{x}} at time tt with density function ptp_{t}.

According to Cattiaux et al. (2021), under mild conditions in the following forward process d𝒙t=bt(𝒙t)dt+2dBt,\mathrm{d}{\bm{x}}_{t}=b_{t}({\bm{x}}_{t})\mathrm{d}t+\sqrt{2}\mathrm{d}B_{t}, the reverse process also admits an SDE description. If we fix the terminal time T>0T>0 and set 𝒙~t=𝒙Tt,fort[0,T],\tilde{{\bm{x}}}_{t}={\bm{x}}_{T-t},\ \mathrm{for}\ t\in[0,T], the process (𝒙~t)t[0,T](\tilde{{\bm{x}}}_{t})_{t\in[0,T]} satisfies the SDE d𝒙~t=b~t(𝒙~t)dt+2dBt,\mathrm{d}\tilde{{\bm{x}}}_{t}=\tilde{b}_{t}(\tilde{{\bm{x}}}_{t})\mathrm{d}t+\sqrt{2}\mathrm{d}B_{t}, where the reverse drift satisfies the relation bt+b~Tt=2lnpt,𝒙tpt.b_{t}+\tilde{b}_{T-t}=2\nabla\ln p_{t},{\bm{x}}_{t}\sim p_{t}. In this condition, the reverse process of Eq. (1) is as d𝒙~t=(𝒙~t+2lnpTt(𝒙~t))dt+2dBt,𝒙~tp~t.\mathrm{d}\tilde{{\bm{x}}}_{t}=\left(\tilde{{\bm{x}}}_{t}+2\nabla\ln p_{T-t}(\tilde{{\bm{x}}}_{t})\right)\mathrm{d}t+\sqrt{2}\mathrm{d}B_{t},\tilde{{\bm{x}}}_{t}\sim\tilde{p}_{t}. Thus, once the score function lnpTt\nabla\ln p_{T-t} is obtained, the reverse SDE induce a sampling algorithm.

Discretization and realization of the reverse process. To numerically solve the previous SDE, suppose kt/ηk\coloneqq\lfloor t/\eta\rfloor for any t[0,T]t\in[0,T], we approximate the score function lnpTt\nabla\ln p_{T-t} with 𝒗k{\bm{v}}_{k} for t[kη,(k+1)η]t\in[k\eta,(k+1)\eta]. This modification results in a new SDE, given by

d𝒙~t=(𝒙~t+𝒗k(𝒙~kη))dt+2dBt,t[kη,(k+1)η].\mathrm{d}\tilde{{\bm{x}}}_{t}=\left(\tilde{{\bm{x}}}_{t}+{\bm{v}}_{k}(\tilde{{\bm{x}}}_{k\eta})\right)\mathrm{d}t+\sqrt{2}\mathrm{d}B_{t},\quad t\in\left[k\eta,(k+1)\eta\right]. (2)

Specifically, when k=0k=0, we set 𝒙~0\tilde{{\bm{x}}}_{0} to be sampled from p~0\tilde{p}_{0}, which can approximate pTp_{T}.To find the closed form of the solution by setting an auxiliary random variable as 𝐫i(𝒙~t,t)𝒙~t,iet{\mathbf{r}}_{i}\left(\tilde{{\bm{x}}}_{t},t\right)\coloneqq\tilde{{\bm{x}}}_{t,i}e^{-t}, we have

d𝐫i(𝒙~t,t)=\displaystyle\mathrm{d}{\mathbf{r}}_{i}(\tilde{{\bm{x}}}_{t},t)= (𝐫it+𝐫i𝒙~t,i[𝒗k,i+𝒙~t,i]+Tr(2𝐫i𝒙~t,i2))dt+𝐫i𝒙~t,i2dBt\displaystyle\left(\frac{\partial{\mathbf{r}}_{i}}{\partial t}+\frac{\partial{\mathbf{r}}_{i}}{\partial\tilde{{\bm{x}}}_{t,i}}\cdot\left[{\bm{v}}_{k,i}+\tilde{{\bm{x}}}_{t,i}\right]+\operatorname{Tr}\left(\frac{\partial^{2}{\mathbf{r}}_{i}}{\partial\tilde{{\bm{x}}}_{t,i}^{2}}\right)\right)\mathrm{d}t+\frac{\partial{\mathbf{r}}_{i}}{\partial\tilde{{\bm{x}}}_{t,i}}\cdot\sqrt{2}\mathrm{d}B_{t}
=\displaystyle= 𝒗k,ietdt+2etdBt,\displaystyle{\bm{v}}_{k,i}e^{-t}\mathrm{d}t+\sqrt{2}e^{-t}\mathrm{d}B_{t},

where the equalities are derived by the Itô’s Lemma. Then, we set the initial value 𝐫0(𝒙~t,0)=𝒙~t,i{\mathbf{r}}_{0}(\tilde{{\bm{x}}}_{t},0)=\tilde{{\bm{x}}}_{t,i}, and integral on both sides of the equation. We have

𝒙~t+s=es𝒙~t+(es1)𝒗k+𝒩(0,(e2s1)𝑰d).\displaystyle\tilde{{\bm{x}}}_{t+s}=e^{s}\tilde{{\bm{x}}}_{t}+\left(e^{s}-1\right){\bm{v}}_{k}+\mathcal{N}\left(0,\left(e^{2s}-1\right){\bm{I}}_{d}\right). (3)

For the specific construction of 𝒗k{\bm{v}}_{k} to approximate the score, we defer it to Section 3.

3 The Reverse Diffusion Monte Carlo Approach

As shown in SDE (2), we introduce an estimator 𝒗k𝒙lnpTt(𝒙){\bm{v}}_{k}\approx\nabla_{{\bm{x}}}\ln p_{T-t}({\bm{x}}) to implement the reverse diffusion. This section is dedicated to exploring viable estimators and the benefits of the reverse diffusion process. We found that we can reformulate the score as an expectation with the transition kernel of the forward OU process. We also derive the intuition that rdMC can reduce the isoperimetric dependence of the target distribution compared with conventional Langevin algorithm. Lastly, we introduce the detailed implementation of rdMC in real practice with different score estimators.

3.1 Score function is the expectation of the posterior

We start with the formulation of 𝒙lnpt(𝒙)\nabla_{{\bm{x}}}\ln p_{t}({\bm{x}}). In general SDEs, the score functions lnpt\nabla\ln p_{t} do not have an analytic form. However, our forward process is an OU process (SDE (1)) whose transition kernel is given as pt|0(𝒙|𝒙0)=(2π(1e2t))d/2exp[𝒙et𝒙022(1e2t)].p_{t|0}({\bm{x}}|{\bm{x}}_{0})=\left(2\pi\left(1-e^{-2t}\right)\right)^{-d/2}\cdot\exp\left[\frac{-\left\|{\bm{x}}-e^{-t}{\bm{x}}_{0}\right\|^{2}}{2\left(1-e^{-2t}\right)}\right]. Such conditional density presents the probability of obtaining 𝒙t=𝒙{\bm{x}}_{t}={\bm{x}} given 𝒙0{\bm{x}}_{0} in SDE (1). Note that pt(𝒙)=𝔼p0(𝒙0)pt|0(𝒙|𝒙0)p_{t}({\bm{x}})={\mathbb{E}}_{p_{0}({\bm{x}}_{0})}p_{t|0}({\bm{x}}|{\bm{x}}_{0}), we can use the property to derive other score formulations. Bayes’ theorem demonstrates that the score can be reformulated as an expectation by the following Lemma.

Lemma 1.

Assume that Eq. (1) defines the forward process. The score function can be rewritten as

𝒙lnpTt(𝒙)=𝔼𝒙0qTt(|𝒙)e(Tt)𝒙0𝒙(1e2(Tt)),\displaystyle\nabla_{{\bm{x}}}\ln p_{T-t}({\bm{x}})=\mathbb{E}_{{\bm{x}}_{0}\sim q_{T-t}(\cdot|{\bm{x}})}\frac{e^{-(T-t)}{\bm{x}}_{0}-{\bm{x}}}{\left(1-e^{-2(T-t)}\right)}, (4)
qTt(𝒙0|𝒙)exp(f(𝒙0)𝒙e(Tt)𝒙022(1e2(Tt))).\displaystyle q_{T-t}({\bm{x}}_{0}|{\bm{x}})\propto\exp\left(-f_{*}({\bm{x}}_{0})-\frac{\left\|{\bm{x}}-e^{-(T-t)}{\bm{x}}_{0}\right\|^{2}}{2\left(1-e^{-2(T-t)}\right)}\right).

The proof of Lemma 1 is presented in Appendix C.1. For any t>0t>0, we observe that logqTt-\log q_{T-t} incorporates an additional quadratic term. In scenarios where pp_{*} adheres to the log-Sobolev inequality (LSI), this term enhances qTtq_{T-t}’s log-Sobolev (LSI) constant, thereby accelerating convergence. Conversely, with heavy-tailed pp_{*} (where ff_{*}’s growth is slower than a quadratic function), the extra term retains quadratic growth, yielding sub-Gaussian tails and log-Sobolev properties. Notably, as tt approaches TT, the quadratic component becomes predominant, rendering qTtq_{T-t} strongly log-concave and facilitating sampling. In summary, every qTtq_{T-t} exhibits a larger LSI constant than pp_{*}. As tt increases, this constant grows, ultimately leading qTtq_{T-t} towards strong convexity. Consequently, this provides a sequence of distributions with LSI constants surpassing those of pp_{*}, enabling efficient score estimation for lnpTt\nabla\ln p_{T-t}.

From Lemma 1, the expectation formula of qTt(|𝒙)q_{T-t}(\cdot|{\bm{x}}) can be obtained by empirical mean estimator with sufficient samples from qTt(|𝒙)q_{T-t}(\cdot|{\bm{x}}). Thus, the gradient complexity required in this sampling subproblem becomes the bottleneck of our algorithm. Suppose {𝒙k(i)}\{{\bm{x}}_{k}^{(i)}\} is samples drawn from qTkη(|𝒙)q_{T-k\eta}(\cdot|{\bm{x}}) when 𝒙~kη=𝒙\tilde{{\bm{x}}}_{k\eta}={\bm{x}} for any 𝒙d{\bm{x}}\in\mathbb{R}^{d}, the construction of 𝒗k(𝒙){\bm{v}}_{k}({\bm{x}}) in Eq. (2) can be presented as

𝒗k(𝒙)=1nki=1nk𝒗k(i)(𝒙)where𝒗k(i)(𝒙)=2(1e2(Tkη))1(e(Tkη)𝒙k(i)𝒙).\textstyle{\bm{v}}_{k}({\bm{x}})=\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}{\bm{v}}_{k}^{(i)}({\bm{x}})\quad\mathrm{where}\quad{\bm{v}}^{(i)}_{k}({\bm{x}})=2\left(1-e^{-2(T-k\eta)}\right)^{-1}\cdot\left(e^{-(T-k\eta)}{{\bm{x}}^{(i)}_{k}}-{\bm{x}}\right). (5)
Algorithm 1 rdMC: reverse diffusion Monte Carlo
1:  Input: Initial particle 𝒙~0\tilde{{\bm{x}}}_{0} sampled from p~0\tilde{p}_{0}, Terminal time TT, Step size η,η\eta,\eta^{\prime}, Sample size nn.
2:  for k=0k=0 to T/η1\lfloor T/\eta\rfloor-1 do
3:     Set 𝒗k=𝟎{\bm{v}}_{k}={\bm{0}};
4:     Create nn Monte Carlo samples to estimate 𝒗k𝔼𝒙qTt[𝒙~kηe(Tkη)𝒙(1e2(Tkη))], where qTt(𝒙|𝒙~kη)exp(f(𝒙)𝒙~kηe(Tkη)𝒙22(1e2(Tkη))).{{\bm{v}}_{k}\approx\ }\mathbb{E}_{{\bm{x}}\sim q_{T-t}}\left[-\frac{\tilde{{\bm{x}}}_{k\eta}-e^{-(T-k\eta)}{\bm{x}}}{\left(1-e^{-2(T-k\eta)}\right)}\right],\text{ where }q_{T-t}({\bm{x}}|\tilde{{\bm{x}}}_{k\eta})\propto\exp\left(-f_{*}({\bm{x}})-\frac{\left\|\tilde{{\bm{x}}}_{k\eta}-e^{-(T-k\eta)}{\bm{x}}\right\|^{2}}{2\left(1-e^{-2(T-k\eta)}\right)}\right).
5:     𝒙~(k+1)η=eη𝒙~kη+(eη1)𝒗k+ξwhere ξ is sampled from 𝒩(0,(e2η1)𝑰d)\tilde{{\bm{x}}}_{{(k+1)\eta}}=e^{\eta}\tilde{{\bm{x}}}_{k\eta}+\left(e^{\eta}-1\right){\bm{v}}_{k}+\xi\quad\text{where $\xi$ is sampled from }\mathcal{N}\left(0,\left(e^{2\eta}-1\right){\bm{I}}_{d}\right).
6:  end for
7:  Return: x~T/ηη\tilde{{\bm{x}}}_{\lfloor T/\eta\rfloor\eta} .
Refer to caption Refer to caption Refer to caption
Figure 2: Illustrations of ptp_{t}, qtq_{t}, and their log-Sobolev (LSI) constants. The target distribution pp_{*} is a Gaussian mixture. We choose qt(|x=0)q_{t}(\cdot|x=0) for illustration. As tt increases, the modes of ptp_{t} collapse to zero rapidly, corresponding to an improving LSI constant. For qtq_{t}, the barrier height of qtq_{t} remains small, resulting in a relatively large LSI constant as long as T=O(1)T=O(1). Thus initializing with pTp_{T} and performing rdMC reduces computation complexity for multi-modal pp_{*}.

3.2 Reverse SDE vs Langevin dynamics: Intuition

From Figure 1, we observe that the rdMC method deviates significantly from the conventional gradient-based MCMC techniques, such as Langevin dynamics. It visualizes the paths from a Gaussian distribution to a mixture of Gaussian distributions. It can be observed that Langevin dynamics, being solely driven by the gradient information of the target density pp_{*}, tends to get trapped in local regions, resulting in uneven sampling of the mixture of Gaussian distribution. More precisely, pp_{*} admits a small LSI constant due to the separation of the modes (Ma et al., 2019; Schlichting, 2019; Menz & Schlichting, 2014). Consequently, the convergence of conventional MCMC methods becomes notably challenging in such scenarios (Tosh & Dasgupta, 2014).

To better demonstrate the effect of our proposed SDE, we compute the LSI constant estimates for 11-dd case in Figure 2. Due to the shrinkage property of the forward process, the LSI constant of ptp_{t} can be exponentially better than the original one. Meanwhile, for a T=O(1)T=O(1), the LSI constant of qtq_{t} is also well-behaved. In our algorithm, a well-conditioned LSI constant for both pTp_{T} and qTq_{T} reduces the computation overhead. Thus, we can choose a intermediate TT to connect those local modes and perform reverse SDE towards pp_{*}. rdMC can distribute samples more evenly across different modes and enjoy faster convergence in those challenging cases. Moreover, if the growth rate of logp-\log p_{*} is slower than the quadratic function (heavy tail), we can choose a large TT and use pp_{\infty} to estimate pTp_{T}. Then, all logqt-\log q_{t} have quadratic growth, which implies log-Sobolev property. These intuitions also explain why diffusion models excel in modeling complex distributions in high-dimensional spaces. We will provide the quantitative explanation in Section 4.

3.3 Algorithms for score estimation with qTt(|x)q_{T-t}(\cdot|x)

According to the expectation form of scores shown in Lemma 1, a detailed reverse sampling algorithm can be naturally proposed in Algorithm 1. Specifically, it can be summarized as the following steps: (1) choose proper TT and p~0\tilde{p}_{0} such that pTp~0p_{T}\approx\tilde{p}_{0}. This step can be done by either p~0=p\tilde{p}_{0}=p_{\infty} for large TT or performing the ULA for pTp_{T} (Algorithm 3 in Appendix); (2) sample from a distribution that approximate qTtq_{T-t} (Step 4 of Algorithm 1); (3) follow p~t\tilde{p}_{t} with the approximated qTtq_{T-t} samples (Step 5 of Algorithm 1); (4) repeat until tTt\approx T. After (4), we can also perform Langevin algorithm to fine-tune the steps when the gradient complexity limit is not reached.

The key of implementing Algorithm 1 is to estimate the scores lnpTt\nabla\ln p_{T-t} via Step 4 with samples from qTtq_{T-t}. In what follows, we discuss the implementation that combines the importance weight sampling with the adjusted Langevin algorithm (ULA).

Importance weighted score estimator. We first consider importance sampling approach for estimating the score lnpTt\nabla\ln p_{T-t}. From Eq. (4), we know that the key is to estimate:

𝒙lnpTt(𝒙)=𝔼𝒙0qTt(|𝒙)[e(Tt)𝒙0𝒙(1e2(Tt))]=1Z𝔼𝒙0ρTt(|𝒙)[e(Tt)𝒙0𝒙(1e2(Tt))ef(𝒙0)],\displaystyle\nabla_{{\bm{x}}}\ln p_{T-t}({\bm{x}})=\mathbb{E}_{{\bm{x}}_{0}\sim q_{T-t}(\cdot|{\bm{x}})}\left[\frac{e^{-(T-t)}{\bm{x}}_{0}-{\bm{x}}}{\left(1-e^{-2(T-t)}\right)}\right]=\frac{1}{Z_{*}}\mathbb{E}_{{\bm{x}}_{0}\sim\rho_{T-t}(\cdot|{\bm{x}})}\left[\frac{e^{-(T-t)}{\bm{x}}_{0}-{\bm{x}}}{\left(1-e^{-2(T-t)}\right)}\cdot e^{-f_{*}({\bm{x}}_{0})}\right],

and Z=𝔼𝒙0ρTt(|𝒙)[exp(f(𝒙0))],Z_{*}=\mathbb{E}_{{\bm{x}}_{0}\sim\rho_{T-t}(\cdot|{\bm{x}})}\left[\exp(-f_{*}({\bm{x}}_{0}))\right], where ρTt(|𝒙)exp(𝒙e(Tt)𝒙022(1e2(Tt))).\rho_{T-t}(\cdot|{\bm{x}})\propto\exp\left(\frac{-\left\|{\bm{x}}-e^{-(T-t)}{\bm{x}}_{0}\right\|^{2}}{2\left(1-e^{-2(T-t)}\right)}\right). Note that sampling from ρTt\rho_{T-t} takes negligible computation resource. The main challenge is the sample complexity of estimating the two expectations.

Since ρTt(𝒙0|𝒙)\rho_{T-t}({\bm{x}}_{0}|{\bm{x}}) is Gaussian with variance σt2=1e2(Tt)e2(Tt)\sigma_{t}^{2}=\frac{1-e^{-2(T-t)}}{e^{-2(T-t)}}, we know that as long as 𝒙e(Tt)𝒙0(1e2(Tt))exp(f(𝒙0))-\frac{{\bm{x}}-e^{-(T-t)}{\bm{x}}_{0}}{\left(1-e^{-2(T-t)}\right)}\cdot\exp(-f_{*}({\bm{x}}_{0})) is GG-Lipschitz, the sample complexity scales as N=O~(σt2G2ϵ2)N=\tilde{O}\left({\sigma_{t}^{2}}\frac{G^{2}}{\epsilon^{2}}\right) for the resulting errors of the two mean estimators to be less than ϵ\epsilon (Wainwright, 2019). However, the sample size required of the importance sampling method to achieve an overall small error is known to scale exponentially with the KL divergence between ρTt\rho_{T-t} and qTtq_{T-t} (Chatterjee & Diaconis, 2018), which can depend on the dimension. In our current formulation, this is due to the fact that the true denominator Z=𝔼𝒙0ρTt(|𝒙)[ef(𝒙0)]Z_{*}=\mathbb{E}_{{\bm{x}}_{0}\sim\rho_{T-t}(\cdot|{\bm{x}})}[e^{-f_{*}({\bm{x}}_{0})}] can be as little as exp(d)\exp(-d). As a result, to make the overall score estimation error small, the error tolerance and in turn the sample size required for estimating ZZ_{*} can scale exponentially with the dimension.

ULA score estimator. An alternative score estimator considers that the mean of the underlying distribution qTt(|𝒙)q^{\prime}_{T-t}(\cdot|{\bm{x}}) of these samples needs to sufficiently approach qTt(|𝒙)q_{T-t}(\cdot|{\bm{x}}), which can be achieved by closing the gap of KL divergence or Wasserstein distance between qTt(|𝒙)q^{\prime}_{T-t}(\cdot|{\bm{x}}) and qTt(|𝒙)q_{T-t}(\cdot|{\bm{x}}). Since the additional quadratic term shown in Eq. (4) helps improve a quadratic tail growth for qTt(|𝒙)q_{T-t}(\cdot|{\bm{x}}), which implies the establishment of the isoperimetric condition. We expect the convergence can be achieved by performing the ULA on a sampling subproblem whose target distribution is qTt(|𝒙)q_{T-t}(\cdot|{\bm{x}}). We provide the detailed implementation in Algorithm 2.

Algorithm 2 ULA inner-loop for the qt(|𝒙)q_{t}(\cdot|{\bm{x}}) sampler (Step 4 of Algorithm 1)
1:  Input: Condition 𝒙{\bm{x}} and time tt, Sample size nn, Initial particles {𝒙0i}i=1n\{{{\bm{x}}}^{i}_{0}\}_{i=1}^{n}, Iters KK, Step size η\eta.
2:  for k=1k=1 to KK do
3:     for i=1i=1 to nn do
4:        𝒙ki=𝒙k1iη(f(𝒙k1i)+et(et𝒙k1i𝒙)1e2t)+2ηξk{\bm{x}}_{k}^{i}={\bm{x}}_{k-1}^{i}-\eta\left(\nabla f_{*}({\bm{x}}_{k-1}^{i})+\frac{e^{-t}\left(e^{-t}{\bm{x}}_{k-1}^{i}-{\bm{x}}\right)}{1-e^{-2t}}\right)+\sqrt{2\eta}\xi_{k}, where ξk𝒩(0,𝑰d)\xi_{k}\sim\mathcal{N}\left(0,{\bm{I}}_{d}\right)
5:     end for
6:  end for
7:  Return: {xKi}i=1n\{{{\bm{x}}}^{i}_{K}\}_{i=1}^{n} .

ULA score estimator with importance sampling initialization. Inspired by the previous estimators, we can combine the importance sampling approach with the ULA. In particular, we first implement the importance sampling method to form a rough score estimator. We then perform ULA at the mean estimator and obtain a refined accurate score estimate. Via this combination, we are able to efficiently obtain accurate score estimation by virtue of the ULA algorithm when tt is close to TT. When tt is close to 0, we are able to quickly obtain rough score estimates via the importance sampling approach. We discover empirically that this combination generally perform well when tt interpolates the two regimes (Figure 3 in Section 4).

4 Analyses and Examples of the rdMC Approach

In this section, we analyze the overall complexity of the rdMC via ULA inner loop estimation. Since the complexity of the importance sampling estimate is discussed in Section 3.3, we only consider Algorithm 1 with direct ULA sampling of qTt(|𝒙)q_{T-t}(\cdot|{\bm{x}}) rather than the smart importance sampling initialization to make our analysis clear.

To provide the analysis of the convergence of rdMC, we make the following assumptions.

  1. [A1]

    For all t0t\geq 0, the score lnpt\nabla\ln p_{t} is LL-Lipschitz.

  2. [A2]

    The second moment of the target distribution pp_{*} is upper bounded, i.e., 𝔼p[2]=m22\mathbb{E}_{p_{*}}\left[\left\|\cdot\right\|^{2}\right]=m_{2}^{2}.

These assumptions are standard in diffusion analysis to guarantee the convergence to the target distribution (Chen et al., 2022a). Specifically, Assumption [A1] governs the smoothness characteristics of the forward process, which ensure the feasibility of numerical discretization methods used for approximating the solution of continuous SDE. In addition, Assumption [A2] introduces essential constraints on the moments of the target distribution. These constraints effectively prevent an excessive accumulation of probability mass in the tail region, thereby ensuring a more balanced and well-distributed target distribution.

4.1 Outer loop complexity

According to Algorithm 1, the overall gradient complexity depends on the number of outer loops kk, as well as the complexity to achieve accurate score estimations (Line 5 in Algorithm 1). When the score is well-approximated and satisfies 𝔼pTkη[𝒗klnpTkη2]ϵ2\mathbb{E}_{p_{T-k\eta}}\left[\left\|{\bm{v}}_{k}-\nabla\ln p_{T-k\eta}\right\|^{2}\right]\leq\epsilon^{2}, the overall error in TV distance, DTV(p,p~T)D_{\mathrm{TV}}(p_{*},\tilde{p}_{T}), can be made 𝒪~(ϵ)\tilde{\mathcal{O}}(\epsilon) by choosing a small η\eta. Under this condition, the number of outer loops satisfies k=Ω(L2dϵ2)k=\Omega(L^{2}d\epsilon^{-2}), which shares the same complexity as that in diffusion analysis (Chen et al., 2022a; Lee et al., 2022; 2023). Such a complexity of the score estimation oracles is independent of the log-Sobolev (LSI) constant of the target density pp_{*}, which means that the isoperimetric dependence of rdMC is dominated by the subproblem of sampling from qTtq_{T-t}. Specifically, the following theorem demonstrates the conditions for achieving DTV(p,p~T)=𝒪~(ϵ)D_{\mathrm{TV}}(p_{*},\tilde{p}_{T})=\tilde{\mathcal{O}}(\epsilon).

Theorem 1.

For any ϵ>0\epsilon>0, assume that DKL(pTp~0)<ϵD_{\mathrm{KL}}(p_{T}\|\tilde{p}_{0})<\epsilon for some T>0T>0, p^\hat{p} as suggested in Algorithm 1, η=C1(d+m22)1ϵ2\eta=C_{1}(d+m_{2}^{2})^{-1}\epsilon^{2}. If the OU process induced by pp_{*} satisfies Assumption [A1], [A2], and qTkηq_{T-k\eta} satisfies the log-Sobolev inequality (LSI) with constant μk\mu_{k} (kηTk\eta\leq T). We set

nk=64Tdμk1η3ϵ2δ1,k=213T4d2μk2η8ϵ4δ4\displaystyle n_{k}=64Td\mu_{k}^{-1}\eta^{-3}\epsilon^{-2}\delta^{-1},\quad\mathcal{E}_{k}=2^{-13}\cdot T^{-4}d^{-2}\mu_{k}^{2}\eta^{8}\epsilon^{4}\delta^{4} (6)

where nkn_{k} is the number of samples to estimate the score, and k\mathcal{E}_{k} is the KL error tolerence for the inner loop. With probability at least 1δ1-\delta, Algorithm 1 converges in TV distance with 𝒪~(ϵ)\tilde{\mathcal{O}}(\epsilon) error.

Therefore, the key points for the convergence of rdMC can be summarized as

  • The LSI of target distributions of sampling subproblems, i.e., qTkηq_{T-k\eta} is maintained.

  • The initialization of the reverse process p~0\tilde{p}_{0} is sufficiently close to pTp_{T}.

4.2 Overall computation complexity

In this section, we consider the overall computation complexity of rdMC. Note that the LSI constants of qTkηq_{T-k\eta} depend on the properties of pp_{*}. As a result we consider more specific assumptions that bound the LSI constants for qTkηq_{T-k\eta}. In particular, we demonstrate the benefits of using qTkηq_{T-k\eta} for targets pp_{*} with infinite or exponentially large LSI constants. Compared with ULA, rdMC improves the gradient complexity, due to the improved LSI constant of qTkηq_{T-k\eta} over that of pp_{*} in finite TT. Even for heavy-tailed pp_{*} that does not satisfy the Poincaré inequality, target distributions of sampling subproblems, i.e., qTkηq_{T-k\eta}, can still preserve the LSI property, which helps rdMC to alleviate the exponential dimension dependence of gradient complexity for achieving TV distance convergence.

Specifically, Section 4.2.1 consider the case that the LSI constant of pp_{*} depend on the radius RR exponentially, which can usually be found in mixture models. Our proposed rdMC can reduce the exponent by a factor. Section 4.2.2 consider the case that pp_{*} is not LSI, but rdMC can create an LSI subproblem sequence which makes the dimension dependency polynomial.

We first provide an estimate for the LSI constant of qtq_{t} under general Lipschitzness assumption [A1].

Lemma 2.

Under [A1], the LSI constant for qtq_{t} in the forward OU process is e2t2(1e2t)\frac{e^{-2t}}{2\left(1-e^{-2t}\right)} when 0t12ln(1+12L)0\leq t\leq\frac{1}{2}\ln\left(1+\frac{1}{2L}\right). This estimation indicate that when quadratic term dominate the log-density of qtq_{t}, the log-Sobolev property is well-guaranteed.

4.2.1 Improving the LSI constant dependence for mixture models

Apart from Assumption [A1], [A2], we study the case where pp_{*} satisfies the following assumption:

  1. [A3]

    There exists R>0R>0, such that f(𝒙)f_{*}({\bm{x}}) is mm-strongly convex when 𝒙R\|{\bm{x}}\|\geq R.

In this case, the target density pp_{*} admits an LSI constant which scales exponentially with the radius of the region of nonconvexity RR, i.e., m2exp(16LR2)\frac{m}{2}\exp(-16LR^{2}), as shown in (Ma et al., 2019). This implies that if we draw samples from pp_{*} with ULA, the gradient complexity will have exponential dependency with respect to R2R^{2}. However, in rdMC with suitable choices of T=O(logR)T=O(\log R) and p~0\tilde{p}_{0}, the exponential dependency on RR is removed, which is a bottleneck for mixture models.

In the Proposition 1 below, we select values of TT and p~0\tilde{p}_{0} to achieve the desired level of accuracy. Notably, Lemma 2 suggest that we can choose a O(1)O(1) termination time to make the LSI constant of qtq_{t} well-behehaved and pTp_{T} is much simpler than p0p_{0}. That is to say, rdMC can exhibit lower isoperimetric dependency compared to conventional MCMC techniques. Thus, We find that the overall computation complexity of rdMC reduces the dependency on RR.

Proposition 1.

Assume that the OU process induced by pp_{*} satisfies [A1], [A2], [A3]. We estimate pTp_{T} with p~0\tilde{p}_{0} by ULA, where the iterations wrt LSI constant is Ω(m1exp(16LR2e2T2T))\Omega\left(m^{-1}\exp(16LR^{2}e^{-2T}-2T)\right). For any ϵ>0\epsilon>0 and T12ln(1+12L)T\leq\frac{1}{2}\ln\left(1+\frac{1}{2L}\right), the Algorithm 1 from p~0\tilde{p}_{0} to target distribution convergence with a probability at least 1δ1-\delta and total gradient complexity Ω(poly(ϵ,δ))\Omega\left(\mathrm{poly}(\epsilon,\delta)\right) independent of RR.

Example: Gaussian Mixture Model. We consider an example that

p(x)exp(12x2)+exp(12xy2),y1.p_{*}(x)\propto\exp\left(-\frac{1}{2}\|x\|^{2}\right)+\exp\left(-\frac{1}{2}\|x-y\|^{2}\right),\quad y\gg 1.

The LSI constant is Θ(eC0y2)\Theta(e^{-C_{0}\|y\|^{2}}) (Ma et al., 2019; Schlichting, 2019; Menz & Schlichting, 2014), corresponding to the complexity of the target distribution. Tosh & Dasgupta (2014) prove that the lower bound iteration complexity by MCMC-based algorithm to sample from the Gaussian mixture scales exponentially with the squared distance y2\|y\|^{2} between the two modes: Ω(exp(y2/8))\Omega(\exp(\|y\|^{2}/8)). Note that rdMC is not a type of conventional MCMC. With importance sampling score estimator, the O~(ϵ2)\tilde{O}(\epsilon^{-2}) iteration complexity and O~(ϵ2)\tilde{O}(\epsilon^{-2}) samples at each step, the TV distance converges with O~(ϵ)\tilde{O}(\epsilon) error (Appendix B).

The computation overhead of the outer-loop process does not depend on yy. For the inner-loop score estimation we can choose T=12log32T=\frac{1}{2}\log\frac{3}{2} to make the LSI constant of qtq_{t} to be O(1)O(1). We can perform ULA to initialize pTp_{T}, which reduces the barrier between modes significantly. Specifically, the LSI constant of pTp_{T} is Θ(eC0eTy2)\Theta(e^{-C_{0}\|e^{-T}y\|^{2}}), which improves the dependence on y2\|y\|^{2} by a e2T=23e^{-2T}=\frac{2}{3} factor. Since this factor is on the exponent, the reduction is exponential.

Figure 3 is a demonstration for this example. We choose different rr to represent the change of RR in [A3]. We compare the convergence of Langevin Monte Carlo (LMC), Underdamped LMC (ULMC) and rdMC in terms of gradient complexity. As rr increases, we find that LMC fails to converge within a limited time. ULMC, with the introduction of velocity, can alleviate this situation. Notably, our algorithm can still ensure fast mixing for large rr. The inner loop is initialized with importance sampling mean estimator. Hyper-parameters and more results are provided in Appendix F.

4.2.2 Improving the dimension dependence in heavy-tailed target distributions

Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
r=0.5r=0.5 r=1.0r=1.0 r=2.0r=2.0
Figure 3: Maximum Mean Discrepancy (MMD) convergence of LMC, ULMC, rdMC. First row shows different target distributions, with increasing mode separation rr and barrier heights, leading to reduced log-Sobolev (LSI) constants. Second row displays the algorithms’ convergence, revealing rdMC’s pronounced advantage convergence compared to ULMC/LMC, especially for large rr.

In this subsection, we study the case where pp_{*} satisfies Assumption [A1], [A2], and

  1. [A4]

    For any r>0r>0, we can find some R(r)R(r) satisfying f(𝒙)+r𝒙2f_{*}({\bm{x}})+r\left\|{\bm{x}}\right\|^{2} is convex for 𝒙R(r)\left\|{\bm{x}}\right\|\geq R(r). Without loss of generality, we suppose R(r)=cR/rnR(r)=c_{R}/r^{n} for some n>0,cR>0n>0,c_{R}>0.

Assumption [A4] can be considered as a soft version of [A3]. Specifically, it permits the tail growth of the negative log-density ff_{*} to be slower than quadratic functions. This encompasses certain target distributions that fall outside the constraints of LSI and PI. Furthermore, given that the additional quadratic term present in Eq. (4) dominates the tail, qTkηq_{T-k\eta} satisfies LSI for all t>0t>0.

Lemma 3.

Under [A1], [A4], the LSI constant for qtq_{t} in the forward OU process is e2t6(1e2t)e163LR2(e2t6(1e2t))\frac{e^{-2t}}{6(1-e^{-2t})}\cdot e^{-16\cdot 3L\cdot R^{2}\left(\frac{e^{-2t}}{6(1-e^{-2t})}\right)} for any t0t\geq 0. The tail property guarantees a uniform LSI constant.

The uniform bound on the LSI constant enables us to estimate the score for any ptp_{t}. We can consider cases that are free from the constraints on the properties of pTp_{T} and let TT be sufficiently large. By setting TT at 𝒪~(ln1/ϵ)\tilde{\mathcal{O}}(\ln 1/\epsilon) level, we can approximate pTp_{T} with pp_{\infty} — the stationary distribution of the forward process. Furthermore, since qtq_{t} are log-Sobolev, we can perform rdMC to sample from heavy-tailed pp^{*} in the absence of a log-Sobolev inequality. The explicit computational complexity of rdMC, needed to converge to any specified accuracy, is detailed in the subsequent proposition.

Proposition 2.

Assume that the target distribution pp_{*} satisfies Assumption [A1], [A2], and [A4]. We take pp_{\infty} to be p~0\tilde{p}_{0}. For any ϵ>0\epsilon>0, by performing Algorithm 1 with ULA inner loop and hyper-parameters in Theorem 1, with a probability at least 1δ1-\delta, we have DTV(p~t,p)O~(ϵ)D_{\mathrm{TV}}\left(\tilde{p}_{t},p_{*}\right)\leq\tilde{O}(\epsilon) with Ω(d18ϵ16n83δ6exp(ϵ16n))\Omega\left(d^{18}\epsilon^{-16n-83}\delta^{-6}\exp\left(\epsilon^{-16n}\right)\right) gradient complexity.

Example: Potentials with Sub-Linear Tails. We consider an example that

p(x)exp((x2+1)a)wherea(0,0.5).p_{*}(x)\propto\exp\left(-\left(\|x\|^{2}+1\right)^{a}\right)\quad\mathrm{where}\quad a\in(0,0.5).

Lemma 5 demonstrates that this pp_{*} satisfies Assumption [A4] with n=(22a)11n=(2-2a)^{-1}\leq 1. Moreover, these target distributions with sub-linear tails also satisfy weak Poincare inequality (wPI) introduced in Mousavi-Hosseini et al. (2023), in which the dimension dependence of ULA to achieve the convergence in TV distance is proven to be Ω~(d4a1+3)\tilde{\Omega}(d^{4a^{-1}+3}). Compared with this result, the complexity of rdMC shown in Proposition 2 has a lower dimension dependence when a4/15a\leq 4/15.

Conclusion. This paper presents a novel sampling algorithm based on the reverse SDE of the OU process. The algorithm efficiently generates samples by utilizing the mean estimation of a sub-problem to estimate the score function. It demonstrates convergence in terms of total variation distance and proves efficacy in general sampling tasks with or without isoperimetric conditions. The algorithm exhibits lower isoperimetric dependency compared to conventional MCMC techniques, making it well-suited for multi-modal and high-dimensional challenging sampling. The analysis provides insights into the complexity of score estimation within the OU process, given the conditional posterior distribution.

Acknowledgements

The research is partially supported by the NSF awards: SCALE MoDL-2134209, CCF-2112665 (TILOS). It is also supported in part by the DARPA AIE program, the U.S. Department of Energy, Office of Science, the Facebook Research Award, as well as CDC-RFA-FT-23-0069 from the CDC’s Center for Forecasting and Outbreak Analytics.

References

  • Altschuler & Chewi (2023) Jason M Altschuler and Sinho Chewi. Faster high-accuracy log-concave sampling via algorithmic warm starts. arXiv preprint arXiv:2302.10249, 2023.
  • Bakry et al. (2014) Dominique Bakry, Ivan Gentil, Michel Ledoux, et al. Analysis and geometry of Markov diffusion operators, volume 103. Springer, 2014.
  • Berner et al. (2022) Julius Berner, Lorenz Richter, and Karen Ullrich. An optimal control perspective on diffusion-based generative modeling. arXiv preprint arXiv:2211.01364, 2022.
  • Cattiaux et al. (2021) Patrick Cattiaux, Giovanni Conforti, Ivan Gentil, and Christian Léonard. Time reversal of diffusion processes under a finite entropy condition. arXiv preprint arXiv:2104.07708, 2021.
  • Chafaï (2004) Djalil Chafaï. Entropies, convexity, and functional inequalities, on \phi\backslash phi-entropies and \phi\backslash phi-sobolev inequalities. Journal of Mathematics of Kyoto University, 44(2):325–363, 2004.
  • Chatterjee & Diaconis (2018) Sourav Chatterjee and Persi Diaconis. The sample size required in importance sampling. Annals of Applied Probability, 28(2), 2018.
  • Chen et al. (2023) Hongrui Chen, Holden Lee, and Jianfeng Lu. Improved analysis of score-based generative modeling: User-friendly bounds under minimal smoothness assumptions. In International Conference on Machine Learning, pp. 4735–4763. PMLR, 2023.
  • Chen et al. (2022a) Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. arXiv preprint arXiv:2209.11215, 2022a.
  • Chen et al. (2022b) Yongxin Chen, Sinho Chewi, Adil Salim, and Andre Wibisono. Improved analysis for a proximal algorithm for sampling. In Conference on Learning Theory, pp.  2984–3014. PMLR, 2022b.
  • Cheng & Bartlett (2018) Xiang Cheng and Peter Bartlett. Convergence of langevin mcmc in kl-divergence. In Algorithmic Learning Theory, pp.  186–211. PMLR, 2018.
  • Cheng et al. (2018) Xiang Cheng, Niladri S Chatterji, Peter L Bartlett, and Michael I Jordan. Underdamped langevin mcmc: A non-asymptotic analysis. In Conference on learning theory, pp.  300–323. PMLR, 2018.
  • Chewi et al. (2021) Sinho Chewi, Murat A Erdogdu, Mufan Bill Li, Ruoqi Shen, and Matthew Zhang. Analysis of langevin monte carlo from poincaré to log-sobolev. arXiv preprint arXiv:2112.12662, 2021.
  • Dalalyan (2017) Arnak Dalalyan. Further and stronger analogy between sampling and optimization: Langevin monte carlo and gradient descent. In Conference on Learning Theory, pp.  678–689. PMLR, 2017.
  • Dong et al. (2023) Hanze Dong, Xi Wang, LIN Yong, and Tong Zhang. Particle-based variational inference with preconditioned functional gradient flow. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=6OphWWAE3cS.
  • Durmus & Moulines (2019) Alain Durmus and Eric Moulines. High-dimensional bayesian inference via the unadjusted langevin algorithm. 2019.
  • Dwivedi et al. (2018) Raaz Dwivedi, Yuansi Chen, Martin J Wainwright, and Bin Yu. Log-concave sampling: Metropolis-hastings algorithms are fast! In Conference on learning theory, pp.  793–797. PMLR, 2018.
  • El Alaoui et al. (2022) Ahmed El Alaoui, Andrea Montanari, and Mark Sellke. Sampling from the sherrington-kirkpatrick gibbs measure via algorithmic stochastic localization. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pp.  323–334. IEEE, 2022.
  • Eldan (2013) Ronen Eldan. Thin shell implies spectral gap up to polylog via a stochastic localization scheme. Geometric and Functional Analysis, 23(2):532–569, 2013.
  • Eldan (2020) Ronen Eldan. Taming correlations through entropy-efficient measure decompositions with applications to mean-field approximation. Probability Theory and Related Fields, 176(3-4):737–755, 2020.
  • Eldan (2022) Ronen Eldan. Analysis of high-dimensional distributions using pathwise methods. Proceedings of ICM, to appear, 2022.
  • Erdogdu & Hosseinzadeh (2021) Murat A Erdogdu and Rasa Hosseinzadeh. On the convergence of langevin monte carlo: The interplay between tail growth and smoothness. In Conference on Learning Theory, pp.  1776–1822. PMLR, 2021.
  • Freund et al. (2022) Yoav Freund, Yi-An Ma, and Tong Zhang. When is the convergence time of langevin algorithms dimension independent? a composite optimization viewpoint. Journal of Machine Learning Research, 23(214), 2022.
  • Gross (1975) Leonard Gross. Logarithmic sobolev inequalities. American Journal of Mathematics, 97(4):1061–1083, 1975.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • Huang et al. (2021) Jian Huang, Yuling Jiao, Lican Kang, Xu Liao, Jin Liu, and Yanyan Liu. Schrödinger-föllmer sampler: sampling without ergodicity. arXiv preprint arXiv:2106.10880, 2021.
  • Jerrum & Sinclair (1996) Mark Jerrum and Alistair Sinclair. The markov chain monte carlo method: an approach to approximate counting and integration. Approximation algorithms for NP-hard problems, pp.  482–520, 1996.
  • Le Gall (2016) Jean-François Le Gall. Brownian motion, martingales, and stochastic calculus. Springer, 2016.
  • Lee et al. (2021a) Holden Lee, Chirag Pabbaraju, Anish Sevekari, and Andrej Risteski. Universal approximation for log-concave distributions using well-conditioned normalizing flows. arXiv preprint arXiv:2107.02951, 2021a.
  • Lee et al. (2022) Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence for score-based generative modeling with polynomial complexity. arXiv preprint arXiv:2206.06227, 2022.
  • Lee et al. (2023) Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence of score-based generative modeling for general data distributions. In International Conference on Algorithmic Learning Theory, pp.  946–985. PMLR, 2023.
  • Lee et al. (2021b) Yin Tat Lee, Ruoqi Shen, and Kevin Tian. Structured logconcave sampling with a restricted gaussian oracle. In Conference on Learning Theory, pp.  2993–3050. PMLR, 2021b.
  • Liu & Wang (2016) Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. Advances in neural information processing systems, 29, 2016.
  • Ma et al. (2019) Yi-An Ma, Yuansi Chen, Chi Jin, Nicolas Flammarion, and Michael I Jordan. Sampling can be faster than optimization. Proceedings of the National Academy of Sciences, 116(42):20881–20885, 2019.
  • Ma et al. (2021) Yi-An Ma, Niladri S Chatterji, Xiang Cheng, Nicolas Flammarion, Peter L Bartlett, and Michael I Jordan. Is there an analog of Nesterov acceleration for gradient-based MCMC? Bernoulli, 27(3), 2021.
  • Menz & Schlichting (2014) Georg Menz and André Schlichting. Poincaré and logarithmic sobolev inequalities by decomposition of the energy landscape. 2014.
  • Montanari (2023) Andrea Montanari. Sampling, diffusions, and stochastic localization. arXiv preprint arXiv:2305.10690, 2023.
  • Montanari & Wu (2023) Andrea Montanari and Yuchen Wu. Posterior sampling from the spiked models via diffusion processes. arXiv preprint arXiv:2304.11449, 2023.
  • Mou et al. (2021) Wenlong Mou, Yi-An Ma, Martin J Wainwright, Peter L Bartlett, and Michael I Jordan. High-order langevin diffusion yields an accelerated mcmc algorithm. The Journal of Machine Learning Research, 22(1):1919–1959, 2021.
  • Mousavi-Hosseini et al. (2023) Alireza Mousavi-Hosseini, Tyler Farghly, Ye He, Krishnakumar Balasubramanian, and Murat A Erdogdu. Towards a complete analysis of langevin monte carlo: Beyond poincar\\backslash’e inequality. arXiv preprint arXiv:2303.03589, 2023.
  • Neal (1993) Radford M Neal. Probabilistic inference using Markov chain Monte Carlo methods. Department of Computer Science, University of Toronto Toronto, ON, Canada, 1993.
  • Poincaré (1890) Henri Poincaré. Sur les équations aux dérivées partielles de la physique mathématique. American Journal of Mathematics, pp.  211–294, 1890.
  • Raginsky et al. (2017) Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pp.  1674–1703. PMLR, 2017.
  • Robert et al. (1999) Christian P Robert, George Casella, and George Casella. Monte Carlo statistical methods, volume 2. Springer, 1999.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10684–10695, 2022.
  • Rothaus (1981) OS Rothaus. Diffusion on compact riemannian manifolds and logarithmic sobolev inequalities. Journal of functional analysis, 42(1):102–109, 1981.
  • Schlichting (2019) André Schlichting. Poincaré and log–sobolev inequalities for mixtures. Entropy, 21(1):89, 2019.
  • Song et al. (2020) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  • Tosh & Dasgupta (2014) Christopher Tosh and Sanjoy Dasgupta. Lower bounds for the gibbs sampler over mixtures of gaussians. In International Conference on Machine Learning, pp. 1467–1475. PMLR, 2014.
  • Tzen & Raginsky (2019) Belinda Tzen and Maxim Raginsky. Theoretical guarantees for sampling and inference in generative models with latent diffusions. In Conference on Learning Theory, pp.  3084–3114. PMLR, 2019.
  • Vargas et al. (2023a) Francisco Vargas, Will Grathwohl, and Arnaud Doucet. Denoising diffusion samplers. arXiv preprint arXiv:2302.13834, 2023a.
  • Vargas et al. (2023b) Francisco Vargas, Andrius Ovsianas, David Fernandes, Mark Girolami, Neil D Lawrence, and Nikolas Nüsken. Bayesian learning via neural schrödinger–föllmer flows. Statistics and Computing, 33(1):3, 2023b.
  • Vargas et al. (2023c) Francisco Vargas, Teodora Reu, and Anna Kerekes. Expressiveness remarks for denoising diffusion based sampling. In Fifth Symposium on Advances in Approximate Bayesian Inference, 2023c.
  • Vempala & Wibisono (2019) Santosh Vempala and Andre Wibisono. Rapid convergence of the unadjusted langevin algorithm: Isoperimetry suffices. Advances in neural information processing systems, 32, 2019.
  • Villani (2021) Cédric Villani. Topics in optimal transportation, volume 58. American Mathematical Soc., 2021.
  • Wainwright (2019) Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
  • Zhang et al. (2023) Dinghuai Zhang, Ricky Tian Qi Chen, Cheng-Hao Liu, Aaron Courville, and Yoshua Bengio. Diffusion generative flow samplers: Improving learning signals through partial trajectory optimization. arXiv preprint arXiv:2310.02679, 2023.

Appendix A Proof Sketch

For better understanding for our paper, we provide a proof sketch as below.

Lemma 1 is a direct result of Bayes theorem (Appendix C.1), which is the main motivation of our algorithm, that estimate the score with samples from qq.

Our main contribution is to prove the convergence and evaluate the complexity of Algorithm 1, where the inner loop is performed by Algorithm 2.

Our analysis is based on the TV distance111Please refer to Table 1 for the notation definitions., where we use data-processing inequality, triangle inequality, and Pinsker’s inequality (refer to Eq. (19) for details) to provide the upper bound as below

DTV(p~T,p)12DKL(p~0pT)Term 1+12DKL(P^TP~TpT)Term 2,\displaystyle D_{\mathrm{TV}}\left(\tilde{p}_{T},p_{*}\right)\leq\underbrace{\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\tilde{p}_{0}\|p_{T}\right)}}_{\mathrm{Term\ 1}}+\underbrace{\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\hat{P}_{T}\big{\|}\tilde{P}_{T}^{p_{T}}\right)}}_{\mathrm{Term\ 2}},

where Term 1 is the error between p~0\tilde{p}_{0} and pTp_{T} and Team 2 is the score estimation loss of the whole trajectory.

Theorem 1 considers the case that all qTkηq_{T-k\eta} is log-Sobolev with constant μk\mu_{k} (kηTk\eta\leq T), where the log-Sobolev constants can further be estimated with additional assumptions on pp_{*}.

By definition,

2(Term 2)2=k=0N1kη(k+1)η𝔼P^T[14𝐯k(𝐱kη)2lnpTt(𝐱t)2]dt,\displaystyle 2(\text{Term 2})^{2}=\sum_{k=0}^{N-1}\int_{k\eta}^{(k+1)\eta}\mathbb{E}_{\hat{P}_{T}}\left[\frac{1}{4}\cdot\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-t}({\mathbf{x}}_{t})\right\|^{2}\right]\mathrm{d}t,

where Term 2 is defined by an integration.

To bound the error between integration and discretized algorithm, we have Lemma 7 that when η=O(ϵ2)\eta=O(\epsilon^{2})

14𝔼P^T[𝐯k(𝐱kη)2lnpTt(𝐱t)2]\displaystyle\frac{1}{4}\cdot\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-t}({\mathbf{x}}_{t})\right\|^{2}\right]\leq 4ϵ2+12𝔼P^T[𝐯k(𝐱kη)2lnpTkη(𝐱kη)2]ϵscore\displaystyle 4\epsilon^{2}+\frac{1}{2}\cdot\underbrace{\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-k\eta}({\mathbf{x}}_{k\eta})\right\|^{2}\right]}_{\epsilon_{\mathrm{score}}} (7)

Note that the ϵscore\epsilon_{\mathrm{score}} can be controlled by

2𝐯k(𝒙)2𝔼𝐱0qTkη(|𝒙)[𝒙e(Tkη)𝐱0(1e2(Tkη))]2Term 2.1\displaystyle 2\underbrace{\left\|{\mathbf{v}}_{k}({\bm{x}})-2\mathbb{E}_{{\mathbf{x}}^{\prime}_{0}\sim q^{\prime}_{T-k\eta}(\cdot|{\bm{x}})}\left[-\frac{{\bm{x}}-e^{-(T-k\eta)}{\mathbf{x}}^{\prime}_{0}}{\left(1-e^{-2(T-k\eta)}\right)}\right]\right\|^{2}}_{\mathrm{Term\ 2.1}}

and

22e(Tkη)1e2(Tkη)[𝔼𝐱0qTkη(|𝒙)[𝐱0]𝔼𝐱0qTkη(|𝒙)[𝐱0]]2Term 2.2,2\underbrace{\left\|\frac{2e^{-(T-k\eta)}}{1-e^{-2(T-k\eta)}}\left[\mathbb{E}_{{\mathbf{x}}^{\prime}_{0}\sim q^{\prime}_{T-k\eta}(\cdot|{\bm{x}})}[{\mathbf{x}}^{\prime}_{0}]-\mathbb{E}_{{\mathbf{x}}_{0}\sim q_{T-k\eta}(\cdot|{\bm{x}})}[{\mathbf{x}}_{0}]\right]\right\|^{2}}_{\mathrm{Term\ 2.2}},

where qq^{\prime} is the estimated inner loop distribution by ULA.

Lemma 8 provide the concentration for Term 2.1.

We also have

Term 2.28η2CLSI,k1DKL(qTkη(|𝒙)qTkη(|𝒙)),\mathrm{Term\ 2.2}\leq 8\eta^{-2}C_{\mathrm{LSI},k}^{-1}D_{\mathrm{KL}}\left({q}^{\prime}_{T-k\eta}(\cdot|{\bm{x}})\|{q}_{T-k\eta}(\cdot|{\bm{x}})\right),

where the KL divergence is controlled by the convergence of ULA (Lemma 9). Thus, the desired convergence can be obtained.

Note that Theorem 1 is based on the fact that all qTkηq_{T-k\eta} are log-Sobolev. So we aim to estimate the log-Sobolev constants for every qTkηq_{T-k\eta}.

Lemma 2 and 3 provide two approaches to estimate the log-Sobolev constant for different time steps.

When pp_{*} is strongly convex outside a ball with radius RR, we can derive that pTp_{T} can reduce the radius so that improve the log-Sobolev constant. Thus, we can choose proper TT where pTp_{T} has larger log-Sobolev constant than p0p_{0} and qtq_{t} are strongly convexity so that the log-Sobolev constants are independent of RR, which makes the algorithm mix fast.

For more general distributions without log-Sobolev constant, Lemma 10 provide the log-Sobolev constant the whole trajectory, so that the computation complexity can be obtained.

Appendix B Notations and discussions

B.1 Algorithm

Algorithm 3 Initialization of p^\hat{p} if p^p\hat{p}\neq p_{\infty}
1:  Input: Initial particle 𝒙0{{\bm{x}}}_{0}, OU process terminate time TT, Iters T0T_{0}, Step size η0\eta_{0}, Sample size nn.
2:  for k=1k=1 to T0T_{0} do
3:     Create nkn_{k} Monte Carlo samples to estimate
𝔼𝒙qt[𝒙k1eT𝒙(1e2T)], s.t. qt(𝒙|𝒙k1)exp(f(𝒙)𝒙k1eT𝒙22(1e2T)).\mathbb{E}_{{\bm{x}}\sim q_{t}}\left[-\frac{{\bm{x}}_{k-1}-e^{-T}{\bm{x}}}{\left(1-e^{-2T}\right)}\right],\text{ s.t. }q_{t}({\bm{x}}|{\bm{x}}_{k-1})\propto\exp\left(-f_{*}({\bm{x}})-\frac{\left\|{\bm{x}}_{k-1}-e^{-T}{\bm{x}}\right\|^{2}}{2\left(1-e^{-2T}\right)}\right).
4:     Compute the corresponding estimator 𝒗k{\bm{v}}_{k}.
5:     𝒙k=𝒙k1+η0𝒗k+2η0ξwhere ξ is sampled from 𝒩(0,𝑰d){{\bm{x}}}_{k}={{\bm{x}}}_{k-1}+\eta_{0}{\bm{v}}_{k}+\sqrt{2\eta_{0}}\xi\quad\text{where $\xi$ is sampled from }\mathcal{N}\left(0,{\bm{I}}_{d}\right).
6:  end for
7:  Return: xT0{\bm{x}}_{T_{0}} .

B.2 Notations

According to Cattiaux et al. (2021), under mild conditions in the following forward process d𝐱t=bt(𝐱t)dt+2dBt,\mathrm{d}{\mathbf{x}}_{t}=b_{t}({\mathbf{x}}_{t})\mathrm{d}t+\sqrt{2}\mathrm{d}B_{t}, the reverse process also admits an SDE description. If we fix the terminal time T>0T>0 and set 𝐱^t=𝐱Tt,fort[0,T],\hat{{\mathbf{x}}}_{t}={\mathbf{x}}_{T-t},\quad\mathrm{for}\ t\in[0,T], the process (𝐱^t)t[0,T](\hat{{\mathbf{x}}}_{t})_{t\in[0,T]} satisfies the SDE d𝐱^t=b^t(𝐱^t)dt+2dBt,\mathrm{d}\hat{{\mathbf{x}}}_{t}=\hat{b}_{t}(\hat{{\mathbf{x}}}_{t})\mathrm{d}t+\sqrt{2}\mathrm{d}B_{t}, where the reverse drift satisfies the relation bt+b^Tt=2lnpt,𝐱tpt.b_{t}+\hat{b}_{T-t}=2\nabla\ln p_{t},{\mathbf{x}}_{t}\sim p_{t}. In this condition, the reverse process of SDE (1) is as follows

d𝐱^t=(𝐱^t+2lnpTt(𝐱^t))dt+2dBt.\mathrm{d}\hat{{\mathbf{x}}}_{t}=\left(\hat{{\mathbf{x}}}_{t}+2\nabla\ln p_{T-t}(\hat{{\mathbf{x}}}_{t})\right)\mathrm{d}t+\sqrt{2}\mathrm{d}B_{t}. (8)

Thus, once the score function lnpTt\nabla\ln p_{T-t} is obtained, the reverse SDE induce a sampling algorithm. To obtain the particles along SDE  (8), the first step is to initialize the particle with a tractable starting distribution. In real practice, it is usually hard to sample from the ideal initialization pTp_{T} directly due to its unknown properties. Instead, we sample from an approximated distribution p^\hat{p}. For large TT, pp_{\infty} is chosen for approximating pTp_{T} as their gap can be controlled. For the iteration, we utilize the numerical discretization method, i.e., DDPM Ho et al. (2020), widely used in diffusion models’ literature. Different from ULA, DDPM divides SDE (8) by different time segments, and consider the following SDE for each segment

d𝐱¯t=(𝐱¯t+2lnpTkη(𝐱¯kη))dt+2dBt,t[kη,(k+1)η],𝐱¯0pT\mathrm{d}\bar{{\mathbf{x}}}_{t}=\left(\bar{{\mathbf{x}}}_{t}+2\nabla\ln p_{T-k\eta}(\bar{{\mathbf{x}}}_{k\eta})\right)\mathrm{d}t+\sqrt{2}\mathrm{d}B_{t},\quad t\in\left[k\eta,(k+1)\eta\right],\quad\bar{{\mathbf{x}}}_{0}\sim p_{T} (9)

to discretize SDE (8). Suppose we obtain 𝐯k{\mathbf{v}}_{k} to approximate 2lnpTkη2\nabla\ln p_{T-k\eta} at each iteration. Then, we obtain the SDE 2 for practical updates of particles.

Reiterate of Algorithm 1

Once the score function can be estimated, the sampling algorithm can be presented. The detailed algorithm is described in Algorithm 1. By following these steps, our proposed algorithm efficiently addresses the given problem and demonstrates its effectiveness in practical applications. Specifically, we summarize our algorithm as below: (1) choose proper TT and proper p^\hat{p} such that pTp^p_{T}\approx\hat{p}. This step can be done by either p^=p\hat{p}=p_{\infty} for large TT or performing the Langevin Monte Carlo for pTp_{T}; (2) sample from p^\hat{p}; (3) sample from a distribution that approximate qTtq_{T-t}; (4) update p~t\tilde{p}_{t} with the approximated qTtq_{T-t} samples. This step can be done by Langevin Monte Carlo inner loop, as logqTt\nabla\log q_{T-t} is explicit; (5) repeat (3) and (4) until tTt\approx T. After (5), we can also perform Langevin algorithm to fine-tune the steps when the gradient complexity limit is not reached. The main algorithm as Algorithm 1.

Here, we reiterate our notation in Table 1 and the definition of log-Sobolev inequality as follows.

Definition 1.

(Logarithmic Sobolev inequality) A distribution with density function pp satisfies the log-Sobolev inequality with a constant μ>0\mu>0 if for all smooth function g:dg\colon\mathbb{R}^{d}\rightarrow\mathbb{R} with 𝔼p[g2]<\mathbb{E}_{p}[g^{2}]<\infty,

𝔼p[g2lng2]𝔼p[g2]ln𝔼p[g2]2μ𝔼p[g2].\mathbb{E}_{p_{*}}\left[g^{2}\ln g^{2}\right]-\mathbb{E}_{p_{*}}\left[g^{2}\right]\ln\mathbb{E}_{p_{*}}\left[g^{2}\right]\leq\frac{2}{\mu}\mathbb{E}_{p_{*}}\left[\left\|\nabla g\right\|^{2}\right]. (10)
Table 1: Notation List
Symbols Description
φσ2\varphi_{\sigma^{2}} The density function of the centered Gaussian distribution, i.e., 𝒩(𝟎,σ2𝑰)\mathcal{N}\left({\bm{0}},\sigma^{2}{\bm{I}}\right).
p,p0p_{*},p_{0} The target density function (initial distribution of the forward process)
(𝐱t)t[0,T]\left({\mathbf{x}}_{t}\right)_{t\in[0,T]} The forward process, i.e., SDE (1)
ptp_{t} The density function of 𝐱t{\mathbf{x}}_{t}, i.e., 𝐱tpt{\mathbf{x}}_{t}\sim p_{t}
pp_{\infty} The density function of stationary distribution of the forward process.
(𝐱^t)t[0,T]\left(\hat{{\mathbf{x}}}_{t}\right)_{t\in[0,T]} The ideal reverse process, i.e., SDE (8)
p^t\hat{p}_{t} The density function of 𝐱^t\hat{{\mathbf{x}}}_{t}, i.e., 𝐱^tp^t\hat{{\mathbf{x}}}_{t}\sim\hat{p}_{t} and pt=p^Ttp_{t}=\hat{p}_{T-t}
P^T\hat{P}_{T} The law of the ideal reverse process SDE (8) over the path space 𝒞([0,T];d)\mathcal{C}\left([0,T];\mathbb{R}^{d}\right).
(𝐱¯t)t[0,T]\left(\bar{{\mathbf{x}}}_{t}\right)_{t\in[0,T]} The reverse process following from SDE (9)
p¯t\bar{p}_{t} The density function of 𝐱¯t\bar{{\mathbf{x}}}_{t}, i.e., 𝐱¯tp¯t\bar{{\mathbf{x}}}_{t}\sim\bar{p}_{t}
(𝐱~t)t[0,T]\left(\tilde{{\mathbf{x}}}_{t}\right)_{t\in[0,T]} The practical reverse process following from SDE (2) with initial distribution qq
p~t\tilde{p}_{t} The density function of 𝐱~t\tilde{{\mathbf{x}}}_{t}, i.e., 𝐱~tp~t\tilde{{\mathbf{x}}}_{t}\sim\tilde{p}_{t}
P~T\tilde{P}_{T} The law of the reverse process (𝐱~t)t[0,T]\left(\tilde{{\mathbf{x}}}_{t}\right)_{t\in[0,T]} over the path space 𝒞([0,T];d)\mathcal{C}\left([0,T];\mathbb{R}^{d}\right).
(𝐱~tpT)t[0,T]\left(\tilde{{\mathbf{x}}}^{p_{T}}_{t}\right)_{t\in[0,T]} The reverse process following from SDE (2) with initial distribution pTp_{T}
p~tpT\tilde{p}^{p_{T}}_{t} The density function of 𝐱~t\tilde{{\mathbf{x}}}_{t}, i.e., 𝐱~tpTp~tpT\tilde{{\mathbf{x}}}^{p_{T}}_{t}\sim\tilde{p}^{p_{T}}_{t}
P~TpT\tilde{P}^{p_{T}}_{T} The law of the reverse process (𝐱~tpT)t[0,T]\left(\tilde{{\mathbf{x}}}^{p_{T}}_{t}\right)_{t\in[0,T]} over the path space 𝒞([0,T];d)\mathcal{C}\left([0,T];\mathbb{R}^{d}\right).

Besides, there are many constants used in our proof. We provide notations here to prevent confusion.

Constant Value Constant Value
C0C_{0} DKL(p0p)D_{\mathrm{KL}}\left(p_{0}\|p_{\infty}\right) C1C_{1} 216L22^{-16}\cdot L^{-2}
C2C_{2} 48L62n28nC08n48L\cdot 6^{2n}\cdot 2^{-8n}\cdot C_{0}^{8n} C3C_{3} 64C0C13C664\cdot C_{0}C_{1}^{-3}C_{6}
C4C_{4} 213C04C18C622^{-13}\cdot C_{0}^{-4}C_{1}^{8}C_{6}^{-2} C5C_{5} 283452L2C2C41C62ln(24L2C04C41C6)2^{8}\cdot 3^{4}\cdot 5^{2}L^{2}\cdot C_{2}C_{4}^{-1}C_{6}^{2}\ln\left(2^{-4}L^{2}C_{0}^{4}C_{4}^{-1}C_{6}\right)
C6C_{6} 624C046\cdot 2^{-4}\cdot C_{0}^{4} C5C_{5}^{\prime} 222345L2C04C18ln(28LC08C18)2^{22}\cdot 3^{4}\cdot 5\cdot L^{-2}C_{0}^{4}C_{1}^{-8}\ln\left(\frac{2^{8}}{L}\cdot C_{0}^{8}C_{1}^{-8}\right)
C3C_{3}^{\prime} 64L1C0C1364L^{-1}\cdot C_{0}C_{1}^{-3}
Table 2: Constant List

B.3 Examples

Lemma 4.

(Proof of the Gaussian Mixture example) The iteration and sample complexity of rdMC with importance sampling estimator is O(ϵ2)O(\epsilon^{-2}).

Proof.

Similar to Eq. (19) in Theorem 1, we have

DTV(p~T,p)12DKL(p~0pT)Term 1+12DKL(P^TP~TpT)Term 2.\displaystyle D_{\mathrm{TV}}\left(\tilde{p}_{T},p_{*}\right)\leq\underbrace{\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\tilde{p}_{0}\|p_{T}\right)}}_{\mathrm{Term\ 1}}+\underbrace{\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\hat{P}_{T}\big{\|}\tilde{P}_{T}^{p_{T}}\right)}}_{\mathrm{Term\ 2}}. (11)

by using data-processing inequality, triangle inequality, and Pinsker’s inequality.

To ensure Term 1 controllable, we choose T=2lnDKL(pp)2ϵ2T=2\ln\frac{D_{\mathrm{KL}}(p_{*}\|p_{\infty})}{2\epsilon^{2}}.

For Term 2,

DKL(P^TP~TpT)=𝔼P^T[lndP^TdP~TpT]=\displaystyle D_{\mathrm{KL}}\left(\hat{P}_{T}\big{\|}\tilde{P}_{T}^{p_{T}}\right)=\mathbb{E}_{\hat{P}_{T}}\left[\ln\frac{\mathrm{d}\hat{P}_{T}}{\mathrm{d}\tilde{P}_{T}^{p_{T}}}\right]= 14k=0N1𝔼P^T[kη(k+1)η𝐯k(𝐱kη)2lnpTt(𝐱t)2dt]\displaystyle\frac{1}{4}\sum_{k=0}^{N-1}\mathbb{E}_{\hat{P}_{T}}\left[\int_{k\eta}^{(k+1)\eta}\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-t}({\mathbf{x}}_{t})\right\|^{2}\mathrm{d}t\right]
=\displaystyle= k=0N1kη(k+1)η𝔼P^T[14𝐯k(𝐱kη)2lnpTt(𝐱t)2]dt,\displaystyle\sum_{k=0}^{N-1}\int_{k\eta}^{(k+1)\eta}\mathbb{E}_{\hat{P}_{T}}\left[\frac{1}{4}\cdot\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-t}({\mathbf{x}}_{t})\right\|^{2}\right]\mathrm{d}t,

By Lemma 7,

14𝔼P^T[𝐯k(𝐱kη)2lnpTt(𝐱t)2]\displaystyle\frac{1}{4}\cdot\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-t}({\mathbf{x}}_{t})\right\|^{2}\right]\leq 4ϵ2+12𝔼P^T[𝐯k(𝐱kη)2lnpTkη(𝐱kη)2]ϵscore\displaystyle 4\epsilon^{2}+\frac{1}{2}\cdot\underbrace{\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-k\eta}({\mathbf{x}}_{k\eta})\right\|^{2}\right]}_{\epsilon_{\mathrm{score}}} (12)

Thus, the iteration complexity of the RDS algorithm is O~(ϵ2){\tilde{O}}(\epsilon^{-2}) when the score estimator L2L_{2} error is O~(ϵ){\tilde{O}}(\epsilon), which is controlled by concentration inequalities.

For time tt, since we have DKL(p~TtpTt)ϵD_{\mathrm{KL}}(\tilde{p}_{T-t}\|{p}_{T-t})\leq\epsilon and pTtp_{T-t} is sub-Gaussian, the density p~Tt\tilde{p}_{T-t} is also sub-Gaussian with variance σt2\sigma_{t}^{\prime 2}.

We have

(|x𝔼x|>2σtln(3/δ))δ3\mathbb{P}(|x-{\mathbb{E}}x|>\sqrt{2}\sigma_{t}^{\prime}\ln(3/\delta))\leq\frac{\delta}{3}

For Gaussian mixture model, exp(f(x0))\exp(-f_{*}(x_{0})) is 22-Lipschitz and function ff_{*} is 11-Lipschitz smooth, xe(Tt)x0(1e2(Tt))exp(f(x0))-\frac{x-e^{-(T-t)}x_{0}}{\left(1-e^{-2(T-t)}\right)}\cdot\exp(-f_{*}(x_{0})) is Gx,tG_{x,t}-Lipschitz.

For time tt, the variance is σt2=1e2(Tt)e2(Tt)\sigma_{t}^{2}=\frac{1-e^{-2(T-t)}}{e^{-2(T-t)}}.

Assume that the expectation and the estimator of xe(Tt)x0(1e2(Tt))exp(f(x0))-\frac{x-e^{-(T-t)}x_{0}}{\left(1-e^{-2(T-t)}\right)}\cdot\exp(-f_{*}(x_{0})) is μX\mu_{X} and XX. The expectation and the estimator of exp(f(x0))\exp(-f_{*}(x_{0})) is μY\mu_{Y} and YY

(|XμX|>ϵ)exp(nϵ22σt2Gx,t2)\mathbb{P}(|X-\mu_{X}|>\epsilon)\leq\exp\left(-\frac{n\epsilon^{2}}{2\sigma_{t}^{2}G_{x,t}^{2}}\right)
(|YμY|>ϵ)exp(nϵ28σt2)\mathbb{P}(|Y-\mu_{Y}|>\epsilon)\leq\exp\left(-\frac{n\epsilon^{2}}{8\sigma_{t}^{2}}\right)

We choose x+=𝔼x+2σtln(3/δ)x_{+}={\mathbb{E}}x+\sqrt{2}\sigma_{t}^{\prime}\ln(3/\delta), G=Gx+,tG=G_{x_{+},t}.

If n=max(G2,4)σt2ϵ2ln(3δ1)n=\max(G^{2},4)\sigma_{t}^{2}\epsilon^{-2}\ln(3\delta^{-1}), with probability 1δ1-\delta, the error of the estimator is at most ϵ\epsilon.

Moreover, if we choose the inner loop iteration to estimate the posterior. We can start from some T>0T>0, and estimate pTp_{T} initially.

Specifically, for the Gaussian mixture, we can get a tighter log-Sobolev bound. For time tt, we have pte12x2+e12xety2p_{t}\propto e^{-\frac{1}{2}\|x\|^{2}}+e^{-\frac{1}{2}\|x-e^{-t}y\|^{2}}, which indicate the log-Sobolev constant, CLSI,y,t11+14(eety2+1)C_{\mathrm{LSI},y,t}^{-1}\leq 1+\frac{1}{4}(e^{\|e^{-t}y\|^{2}}+1). Considering the smoothness, we have

|d2dx2logp(x)|=|1+(ex2(ex2/2+e1/2(xy)2)2+ex2/2(ex2/2+e1/2(xy)2))y2|1.\left|\frac{\mathrm{d}^{2}}{\mathrm{d}x^{2}}\log p(x)\right|=\left|-1+\left(-\frac{e^{x^{2}}}{(e^{x^{2}/2}+e^{1/2(x-y)^{2}})^{2}}+\frac{e^{x^{2}/2}}{(e^{x^{2}/2}+e^{1/2(x-y)^{2}})}\right)y^{2}\right|\leq 1.

When we choose T=12log2L2L+1=12log32T=-\frac{1}{2}\log\frac{2L}{2L+1}=\frac{1}{2}\log\frac{3}{2}, The estimation of pTp_{T} needs O(e23y2)O(e^{\frac{2}{3}y^{2}}) iterations, which improves the original O(ey2)O(e^{y^{2}}). ∎

Lemma 5.

Suppose the negative log density of target distribution f=lnpf_{*}=-\ln p_{*} satisfies

f(x)=(x2+1)af_{*}(x)=\left(\|x\|^{2}+1\right)^{a}

where a[0,0.5]a\in[0,0.5].

Proof.

Consider the Hessian of ff_{*}, we have

2f(𝒙)=2a(x2+1)a2((2a1)x2+1).\nabla^{2}f_{*}({\bm{x}})=2a\cdot(x^{2}+1)^{a-2}\cdot\left((2a-1)x^{2}+1\right).

If we require |x|r1/(22a)|x|\geq r^{-1/(2-2a)}, it has

|x|r1/(22a)|x|22ar1r|x|2|x|42a\displaystyle|x|\geq r^{-1/(2-2a)}\quad\Rightarrow|x|^{2-2a}\geq r^{-1}\quad\Leftrightarrow\quad r\geq\frac{|x|^{2}}{|x|^{4-2a}}
ra(12a)|x|21(|x|2+1)2a2a(x2+1)a2((2a1)x2+1)+2r=2(f(𝒙)+r|x|2)0.\displaystyle\Rightarrow\quad\frac{r}{a}\geq\frac{(1-2a)|x|^{2}-1}{(|x|^{2}+1)^{2-a}}\Rightarrow\quad 2a\cdot(x^{2}+1)^{a-2}\cdot\left((2a-1)x^{2}+1\right)+2r=\nabla^{2}(f_{*}({\bm{x}})+r|x|^{2})\geq 0.

It means if we choose CR=1C_{R}=1 and n=1/(22a)n=1/(2-2a)

B.4 More discussion about previous works

Recent studies have underscored the potential of diffusion models, exploring their integration across various domains, including approximate Bayesian computation methods. One line of research (Vargas et al., 2023a; Berner et al., 2022; Zhang et al., 2023; Vargas et al., 2023c; b) involves applying reverse diffusion in the VI framework to create posterior samplers by neural networks. These studies have examined the conditional expected form of the score function, similar to Lemma 1. Such score-based VI algorithms have shown to offer improvements over traditional VI. However, upon adopting a neural network estimator, VI-based algorithms are subject to an inherent unknown error.

Other research (Tzen & Raginsky, 2019; Chen et al., 2022a) has also delved into the characteristics of parameterized diffusion-like processes under assumed error conditions. Yet, the comparative advantages of diffusion models against MCMC methods and the computational complexity involved in learning the score function are not well-investigated. This gap hinders a straightforward comparison with Langevin-based dynamics.

Another related work is the Schrödinger-Föllmer Sampler (SFS) (Huang et al., 2021), which also tend to estimate the drift term with non-parametric estimators. The main difference of the proposed algorithm is that SFS leverages Schrödinger-Föllmer process. In this process, the target distribution pp_{*} is transformed into a Dirac delta distribution. This transformation often results in the gradient logpt\nabla\log p_{t} becoming problematic when ptp_{t} closely resembles a delta distribution, posing challenges for maintaining the Lipschitz continuity of the drift term. Huang et al. (2021); Tzen & Raginsky (2019) note that the assumption holds when both pexp(x2/2)p_{*}\exp(\|x\|^{2}/2) and its gradient are Lipschitz continuous, and the former is bounded below by a positive value. However, this condition may not be met when the variance of pp_{*} exceeds 1, limiting its general applicability in the Schrödinger-Föllmer process. The comparison of SFS with traditional MCMC methods under general conditions remains an open question. However, given that the pp_{\infty} of the OU process represents a smooth distribution – a standard Gaussian, the requirement for the Lipschitz continuity of pt\nabla p_{t} is much weaker, as diffusion analysis suggested (Chen et al., 2022a; 2023). Additionally, Lee et al. (2021a) indicated that LL-smoothness in log-concave pp_{*} implies the smoothness in ptp_{t}. Moreover, the SFS algorithm considers the importance sampling estimator and the error analysis is mainly based on the Gaussian mixture model. As we mentioned in our Section 3.3, importance sampling estimator would suffer from curse of dimensionality in real practice. Our ULA-based analysis can be adapted to more general distributions for both ill-behaved LSI and non-LSI distributions. In a word, our proposed rdMC provide the complexity for general distributions and SFS is more task specific. It remains open to further investigate the complexity for SFS-like algorithm, which can be interesting future work. Moreover, it is also possible to adapt the ULA estimator idea to SFS.

Our approach, stands as an asymptotically exact sampling algorithm, akin to traditional MCMC methods. It allows us to determine an overall convergence rate that diminishes with increasing computational complexity, enabling a direct comparison of complexity with MCMC approaches. The main technique of our algorithm is to analyze the complexity of the score estimation with non-parametric algorithm and we found the merits of the proposed one compared with MCMC. Our theory can also support the diffusion-based VI against Langevin-based ones.

Appendix C Main Proofs

C.1 Proof of Lemma 1

Proof.

When the OU process, i.e., Eq. 1, is selected as our forward path, the transition kernel of (𝐱t)t0({\mathbf{x}}_{t})_{t\geq 0} has a closed form, i.e.,

p(𝒙,t|𝒙0,0)=(2π(1e2t))d/2exp[𝒙et𝒙022(1e2t)], 0tT.p({\bm{x}},t|{\bm{x}}_{0},0)=\left(2\pi\left(1-e^{-2t}\right)\right)^{-d/2}\cdot\exp\left[\frac{-\left\|{\bm{x}}-e^{-t}{\bm{x}}_{0}\right\|^{2}}{2\left(1-e^{-2t}\right)}\right],\quad\forall\ 0\leq t\leq T.

In this condition, we have

pTt(𝒙)=\displaystyle p_{T-t}({\bm{x}})= dp0(𝒙0)pTt|0(𝒙|𝒙0)d𝒙0\displaystyle\int_{\mathbb{R}^{d}}p_{0}({\bm{x}}_{0})\cdot p_{T-t|0}({\bm{x}}|{\bm{x}}_{0})\mathrm{d}{\bm{x}}_{0}
=\displaystyle= dp0(𝒙0)(2π(1e2(Tt)))d/2exp[𝒙e(Tt)𝒙022(1e2(Tt))]d𝒙0\displaystyle\int_{\mathbb{R}^{d}}p_{0}({\bm{x}}_{0})\cdot\left(2\pi\left(1-e^{-2(T-t)}\right)\right)^{-d/2}\cdot\exp\left[\frac{-\left\|{\bm{x}}-e^{-(T-t)}{\bm{x}}_{0}\right\|^{2}}{2\left(1-e^{-2(T-t)}\right)}\right]\mathrm{d}{\bm{x}}_{0}

Plugging this formulation into the following equation

𝒙lnpTt(𝒙)=pTt(𝒙)pTt(𝒙),\nabla_{{\bm{x}}}\ln p_{T-t}({\bm{x}})=\frac{\nabla p_{T-t}({\bm{x}})}{p_{T-t}({\bm{x}})},

we have

𝒙lnpTt(𝒙)=\displaystyle\nabla_{{\bm{x}}}\ln p_{T-t}({\bm{x}})= dp0(𝒙0)(2π(1e2(Tt)))d/2exp[𝒙e(Tt)𝒙022(1e2(Tt))]d𝒙0dp0(𝒙0)(2π(1e2(Tt)))d/2exp[𝒙e(Tt)𝒙022(1e2(Tt))]d𝒙0\displaystyle\frac{\nabla\int_{\mathbb{R}^{d}}p_{0}({\bm{x}}_{0})\cdot\left(2\pi\left(1-e^{-2(T-t)}\right)\right)^{-d/2}\cdot\exp\left[\frac{-\left\|{\bm{x}}-e^{-(T-t)}{\bm{x}}_{0}\right\|^{2}}{2\left(1-e^{-2(T-t)}\right)}\right]\mathrm{d}{\bm{x}}_{0}}{\int_{\mathbb{R}^{d}}p_{0}({\bm{x}}_{0})\cdot\left(2\pi\left(1-e^{-2(T-t)}\right)\right)^{-d/2}\cdot\exp\left[\frac{-\left\|{\bm{x}}-e^{-(T-t)}{\bm{x}}_{0}\right\|^{2}}{2\left(1-e^{-2(T-t)}\right)}\right]\mathrm{d}{\bm{x}}_{0}} (13)
=\displaystyle= dp0(𝒙0)exp(𝒙eTt𝒙022(1e2(Tt)))(𝒙e(Tt)𝒙0(1e2(Tt)))d𝒙0dp0(𝒙0)exp(𝒙e(Tt)𝒙022(1e2(Tt)))d𝒙0\displaystyle\frac{\int_{\mathbb{R}^{d}}p_{0}({\bm{x}}_{0})\cdot\exp\left(\frac{-\left\|{\bm{x}}-e^{T-t}{\bm{x}}_{0}\right\|^{2}}{2\left(1-e^{-2(T-t)}\right)}\right)\cdot\left(-\frac{{\bm{x}}-e^{-(T-t)}{\bm{x}}_{0}}{\left(1-e^{-2(T-t)}\right)}\right)\mathrm{d}{\bm{x}}_{0}}{\int_{\mathbb{R}^{d}}p_{0}({\bm{x}}_{0})\cdot\exp\left(\frac{-\left\|{\bm{x}}-e^{-(T-t)}{\bm{x}}_{0}\right\|^{2}}{2\left(1-e^{-2(T-t)}\right)}\right)\mathrm{d}{\bm{x}}_{0}}
=\displaystyle= 𝔼𝐱0qTt(|𝒙)[𝒙e(Tt)𝐱0(1e2(Tt))]\displaystyle\mathbb{E}_{{\mathbf{x}}_{0}\sim q_{T-t}(\cdot|{\bm{x}})}\left[-\frac{{\bm{x}}-e^{-(T-t)}{\mathbf{x}}_{0}}{\left(1-e^{-2(T-t)}\right)}\right]

where the density function qTt(|𝒙)q_{T-t}(\cdot|{\bm{x}}) is defined as

qTt(𝒙0|𝒙)=p0(𝒙0)exp(𝒙eTt𝒙022(1e2(Tt)))dp0(𝒙0)exp(𝒙eTt𝒙022(1e2(Tt)))d𝒙0exp(f(𝒙0)𝒙e(Tt)𝒙022(1e2(Tt))).q_{T-t}({\bm{x}}_{0}|{\bm{x}})=\frac{p_{0}({\bm{x}}_{0})\cdot\exp\left(\frac{-\left\|{\bm{x}}-e^{T-t}{\bm{x}}_{0}\right\|^{2}}{2\left(1-e^{-2(T-t)}\right)}\right)}{\int_{\mathbb{R}^{d}}p_{0}({\bm{x}}_{0})\cdot\exp\left(\frac{-\left\|{\bm{x}}-e^{T-t}{\bm{x}}_{0}\right\|^{2}}{2\left(1-e^{-2(T-t)}\right)}\right)\mathrm{d}{\bm{x}}_{0}}\propto\exp\left(-f_{*}({\bm{x}}_{0})-\frac{\left\|{\bm{x}}-e^{-(T-t)}{\bm{x}}_{0}\right\|^{2}}{2\left(1-e^{-2(T-t)}\right)}\right).

Hence, the proof is completed. ∎

C.2 Proof of Lemma 2 and 3

Lemma 6.

(Proposition 1 in Ma et al. (2019)) For peUp_{*}\propto e^{-U}, where UU is mm-strongly convex outside of a region of radius RR and LL-Lipschitz smooth, the log-Sobolev constant of pp_{*}

ρUm2e16LR2.\rho_{U}\geq\frac{m}{2}e^{-16LR^{2}}.
Proof.

By Lemma 6, for any t=Tkηt=T-k\eta, we have the LSI constant of qTkηq_{T-k\eta} satisfies

CLSI,ke2(Tkη)6(1e2(Tkη))exp(163LR2(e2(Tkη)6(1e2(Tkη)))).C_{\mathrm{LSI},k}\geq\frac{e^{-2(T-k\eta)}}{6(1-e^{-2(T-k\eta)})}\exp\left(-16\cdot 3L\cdot R^{2}\left(\frac{e^{-2(T-k\eta)}}{6(1-e^{-2(T-k\eta)})}\right)\right).

When 2L1+2Le2(Tkη)1\frac{2L}{1+2L}\leq e^{-2(T-k\eta)}\leq 1. We have

L+e2(Tkη)2(1e2(Tkη))0,-L+\frac{e^{-2(T-k\eta)}}{2\left(1-e^{-2(T-k\eta)}\right)}\geq 0,

which implies

Σk,max𝑰3e2(Tkη)2(1e2(Tkη))𝑰2gTkη(𝒙)e2(Tkη)2(1e2(Tkη))𝑰Σk,min𝑰.\displaystyle\Sigma_{k,\max}{\bm{I}}\coloneqq\frac{3e^{-2(T-k\eta)}}{2\left(1-e^{-2(T-k\eta)}\right)}\cdot{\bm{I}}\succeq\nabla^{2}g_{T-k\eta}({\bm{x}})\succeq\frac{e^{-2(T-k\eta)}}{2\left(1-e^{-2(T-k\eta)}\right)}\cdot{\bm{I}}\coloneqq\Sigma_{k,\min}{\bm{I}}. (14)

Due to the fact that gTkηg_{T-k\eta} is Σk,min\Sigma_{k,\min}-strongly convex, Σk,max\Sigma_{k,\max}-smooth and Lemma 20, we have CLSI,kΣk,minC_{\mathrm{LSI},k}\geq\Sigma_{k,\min}.

C.3 Proof of Main Theorem

Proof.

We have

DTV(p~T,p)\displaystyle D_{\mathrm{TV}}\left(\tilde{p}_{T},p_{*}\right)\leq DTV(P~T,P^T)DTV(P~T,P~TpT)+DTV(P~TpT,P^T)\displaystyle D_{\mathrm{TV}}\left(\tilde{P}_{T},\hat{P}_{T}\right)\leq D_{\mathrm{TV}}\left(\tilde{P}_{T},\tilde{P}_{T}^{p_{T}}\right)+D_{\mathrm{TV}}\left(\tilde{P}_{T}^{p_{T}},\hat{P}_{T}\right) (15)
\displaystyle\leq DTV(p~0,pT)+DTV(P~TpT,P^T)12DKL(p~0pT)Term 1+12DKL(P^TP~TpT)Term 2,\displaystyle D_{\mathrm{TV}}\left(\tilde{p}_{0},p_{T}\right)+D_{\mathrm{TV}}\left(\tilde{P}_{T}^{p_{T}},\hat{P}_{T}\right)\leq\underbrace{\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\tilde{p}_{0}\|p_{T}\right)}}_{\mathrm{Term\ 1}}+\underbrace{\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\hat{P}_{T}\big{\|}\tilde{P}_{T}^{p_{T}}\right)}}_{\mathrm{Term\ 2}},

where the first and the third inequalities follow from data-processing inequality, the second inequality follows from the triangle inequality, and the last inequality follows from Pinsker’s inequality.

For Term 2,

DKL(P^TP~TpT)=𝔼P^T[lndP^TdP~TpT]=\displaystyle D_{\mathrm{KL}}\left(\hat{P}_{T}\big{\|}\tilde{P}_{T}^{p_{T}}\right)=\mathbb{E}_{\hat{P}_{T}}\left[\ln\frac{\mathrm{d}\hat{P}_{T}}{\mathrm{d}\tilde{P}_{T}^{p_{T}}}\right]= 14k=0N1𝔼P^T[kη(k+1)η𝐯k(𝐱kη)2lnpTt(𝐱t)2dt]\displaystyle\frac{1}{4}\sum_{k=0}^{N-1}\mathbb{E}_{\hat{P}_{T}}\left[\int_{k\eta}^{(k+1)\eta}\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-t}({\mathbf{x}}_{t})\right\|^{2}\mathrm{d}t\right]
=\displaystyle= k=0N1kη(k+1)η𝔼P^T[14𝐯k(𝐱kη)2lnpTt(𝐱t)2]dt,\displaystyle\sum_{k=0}^{N-1}\int_{k\eta}^{(k+1)\eta}\mathbb{E}_{\hat{P}_{T}}\left[\frac{1}{4}\cdot\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-t}({\mathbf{x}}_{t})\right\|^{2}\right]\mathrm{d}t,

By Lemma 7,

14𝔼P^T[𝐯k(𝐱kη)2lnpTt(𝐱t)2]\displaystyle\frac{1}{4}\cdot\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-t}({\mathbf{x}}_{t})\right\|^{2}\right]\leq 4ϵ2+12𝔼P^T[𝐯k(𝐱kη)2lnpTkη(𝐱kη)2]ϵscore\displaystyle 4\epsilon^{2}+\frac{1}{2}\cdot\underbrace{\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-k\eta}({\mathbf{x}}_{k\eta})\right\|^{2}\right]}_{\epsilon_{\mathrm{score}}} (16)

According to Eq. 4, for each term of the summation, we have

𝔼P^T[𝐯k(𝐱kη)2lnpTkη(𝐱kη)2]\displaystyle\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-k\eta}({\mathbf{x}}_{k\eta})\right\|^{2}\right]
=\displaystyle= 𝔼P^T[𝐯k(𝐱kη)2𝔼𝐱0qTkη(|𝐱kη)[𝐱kηe(Tkη)𝐱0(1e2(Tkη))]2].\displaystyle\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\mathbb{E}_{{\mathbf{x}}_{0}\sim q_{T-k\eta}(\cdot|{\mathbf{x}}_{k\eta})}\left[-\frac{{\mathbf{x}}_{k\eta}-e^{-(T-k\eta)}{\mathbf{x}}_{0}}{\left(1-e^{-2(T-k\eta)}\right)}\right]\right\|^{2}\right].

For each 𝐱kη=𝒙{\mathbf{x}}_{k\eta}={\bm{x}}, we have

𝐯k(𝒙)2𝔼𝐱0qTkη(|𝒙)[𝒙e(Tkη)𝐱0(1e2(Tkη))]2\displaystyle\left\|{\mathbf{v}}_{k}({\bm{x}})-2\mathbb{E}_{{\mathbf{x}}_{0}\sim q_{T-k\eta}(\cdot|{\bm{x}})}\left[-\frac{{\bm{x}}-e^{-(T-k\eta)}{\mathbf{x}}_{0}}{\left(1-e^{-2(T-k\eta)}\right)}\right]\right\|^{2} (17)
=\displaystyle= 𝐯k(𝒙)2𝔼𝐱0qTkη(|𝒙)[𝒙e(Tkη)𝐱0(1e2(Tkη))]\displaystyle\left\|{\mathbf{v}}_{k}({\bm{x}})-2\mathbb{E}_{{\mathbf{x}}^{\prime}_{0}\sim q^{\prime}_{T-k\eta}(\cdot|{\bm{x}})}\left[-\frac{{\bm{x}}-e^{-(T-k\eta)}{\mathbf{x}}^{\prime}_{0}}{\left(1-e^{-2(T-k\eta)}\right)}\right]\right.
+2e(Tkη)1e2(Tkη)[𝔼𝐱0qTkη(|𝒙)[𝐱0]𝔼𝐱0qTkη(|𝒙)[𝐱0]]2\displaystyle+\left.\frac{2e^{-(T-k\eta)}}{1-e^{-2(T-k\eta)}}\left[\mathbb{E}_{{\mathbf{x}}^{\prime}_{0}\sim q^{\prime}_{T-k\eta}(\cdot|{\bm{x}})}[{\mathbf{x}}^{\prime}_{0}]-\mathbb{E}_{{\mathbf{x}}_{0}\sim q_{T-k\eta}(\cdot|{\bm{x}})}[{\mathbf{x}}_{0}]\right]\right\|^{2}
\displaystyle\leq 2𝐯k(𝒙)2𝔼𝐱0qTkη(|𝒙)[𝒙e(Tkη)𝐱0(1e2(Tkη))]2Term 2.1\displaystyle 2\underbrace{\left\|{\mathbf{v}}_{k}({\bm{x}})-2\mathbb{E}_{{\mathbf{x}}^{\prime}_{0}\sim q^{\prime}_{T-k\eta}(\cdot|{\bm{x}})}\left[-\frac{{\bm{x}}-e^{-(T-k\eta)}{\mathbf{x}}^{\prime}_{0}}{\left(1-e^{-2(T-k\eta)}\right)}\right]\right\|^{2}}_{\mathrm{Term\ 2.1}}
+22e(Tkη)1e2(Tkη)[𝔼𝐱0qTkη(|𝒙)[𝐱0]𝔼𝐱0qTkη(|𝒙)[𝐱0]]2Term 2.2,\displaystyle+2\underbrace{\left\|\frac{2e^{-(T-k\eta)}}{1-e^{-2(T-k\eta)}}\left[\mathbb{E}_{{\mathbf{x}}^{\prime}_{0}\sim q^{\prime}_{T-k\eta}(\cdot|{\bm{x}})}[{\mathbf{x}}^{\prime}_{0}]-\mathbb{E}_{{\mathbf{x}}_{0}\sim q_{T-k\eta}(\cdot|{\bm{x}})}[{\mathbf{x}}_{0}]\right]\right\|^{2}}_{\mathrm{Term\ 2.2}},

where we denote qTkη(|𝒙)q^{\prime}_{T-k\eta}(\cdot|{\bm{x}}) denote the underlying distribution of output particles of the auxiliary sampling task.

For Term 2.2\mathrm{Term\ 2.2}, we denote an optimal coupling between qTkη(|𝒙)q_{T-k\eta}(\cdot|{\bm{x}}) and qTkη(|𝒙)q^{\prime}_{T-k\eta}(\cdot|{\bm{x}}) to be

γΓopt(qTkη(|𝒙),qTkη(|𝒙)).\gamma\in\Gamma_{\mathrm{opt}}(q_{T-k\eta}(\cdot|{\bm{x}}),q^{\prime}_{T-k\eta}(\cdot|{\bm{x}})).

Hence, we have

2e(Tkη)1e2(Tkη)[𝔼𝐱0qTkη(|𝒙)[𝐱0]𝔼𝐱0qTkη(|𝒙)[𝐱0]]2\displaystyle\left\|\frac{2e^{-(T-k\eta)}}{1-e^{-2(T-k\eta)}}\left[\mathbb{E}_{{{\mathbf{x}}}^{\prime}_{0}\sim{q}^{\prime}_{T-k\eta}(\cdot|{\bm{x}})}[{{\mathbf{x}}}^{\prime}_{0}]-\mathbb{E}_{{{\mathbf{x}}}_{0}\sim{q}_{T-k\eta}(\cdot|{\bm{x}})}[{{\mathbf{x}}}_{0}]\right]\right\|^{2} (18)
=\displaystyle= 𝔼(𝐱0,𝐱0)γ[2e(Tkη)1e2(Tkη)(𝐱0𝐱0)]24e2(Tkη)(1e2(Tkη))2𝔼(𝐱0,𝐱0)γ[𝐱0𝐱02]\displaystyle\left\|\mathbb{E}_{({{\mathbf{x}}}_{0},{{\mathbf{x}}}_{0}^{\prime})\sim\gamma}\left[\frac{2e^{-(T-k\eta)}}{1-e^{-2(T-k\eta)}}\cdot\left({{\mathbf{x}}}_{0}-{{\mathbf{x}}}_{0}^{\prime}\right)\right]\right\|^{2}\leq\frac{4e^{-2(T-k\eta)}}{(1-e^{-2(T-k\eta)})^{2}}\cdot\mathbb{E}_{({{\mathbf{x}}}_{0},{{\mathbf{x}}}_{0}^{\prime})\sim\gamma}\left[\left\|{{\mathbf{x}}}_{0}-{{\mathbf{x}}}_{0}^{\prime}\right\|^{2}\right]
=\displaystyle= 4e2(Tkη)(1e2(Tkη))2W22(qTkη(|𝒙),qTkη(|𝒙))\displaystyle\frac{4e^{-2(T-k\eta)}}{(1-e^{-2(T-k\eta)})^{2}}W_{2}^{2}\left({q}_{T-k\eta}(\cdot|{\bm{x}}),{q}^{\prime}_{T-k\eta}(\cdot|{\bm{x}})\right)
\displaystyle\leq 4e2(Tkη)(1e2(Tkη))22CLSI,kDKL(qTkη(|𝒙)qTkη(|𝒙))\displaystyle\frac{4e^{-2(T-k\eta)}}{\left(1-e^{-2(T-k\eta)}\right)^{2}}\cdot\frac{2}{{C}_{\mathrm{LSI},k}}D_{\mathrm{KL}}\left({q}^{\prime}_{T-k\eta}(\cdot|{\bm{x}})\|{q}_{T-k\eta}(\cdot|{\bm{x}})\right)
\displaystyle\leq 8η2CLSI,k1DKL(qTkη(|𝒙)qTkη(|𝒙)),\displaystyle 8\eta^{-2}C_{\mathrm{LSI},k}^{-1}D_{\mathrm{KL}}\left({q}^{\prime}_{T-k\eta}(\cdot|{\bm{x}})\|{q}_{T-k\eta}(\cdot|{\bm{x}})\right),

where the first inequality follows from Jensen’s inequality, the second inequality follows from the Talagrand inequality and the last inequality follows from

e2(Tkη)e2η1ηe2(Tkη)(1e2(Tkη))2η2,e^{-2(T-k\eta)}\leq e^{-2\eta}\leq 1-\eta\ \Rightarrow\ \frac{e^{-2(T-k\eta)}}{(1-e^{-2(T-k\eta)})^{2}}\leq\eta^{-2},

when η1/2\eta\leq 1/2.

By Lemma 8 and 9, the desired convergence can be obtained.

C.4 Proof of the main Propositions

Proof.

By Eq. (19), we have the upper bound for

12DKL(p~0pT)Term 1+12DKL(P^TP~TpT)Term 2.\displaystyle\underbrace{\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\tilde{p}_{0}\|p_{T}\right)}}_{\mathrm{Term\ 1}}+\underbrace{\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(\hat{P}_{T}\big{\|}\tilde{P}_{T}^{p_{T}}\right)}}_{\mathrm{Term\ 2}}. (19)

We aim to upper bound these two terms in our analysis.

Errors from the forward process

For Term 1 of Eq. 19, we can either choose DKL(p^pT)D_{\mathrm{KL}}\left(\hat{p}\|p_{T}\right) or choose large TT with p^=p\hat{p}=p_{\infty}.

If we choose p^=p\hat{p}=p_{\infty}, we have

DKL(p~0p)=DKL(pTp)C0exp(T2),D_{\mathrm{KL}}\left(\tilde{p}_{0}\|p_{\infty}\right)=D_{\mathrm{KL}}\left(p_{T}\|p_{\infty}\right)\leq C_{0}\exp\left(-\frac{T}{2}\right),

where the inequality follows from Lemma 21. By requiring

C0exp(T2)2ϵ2T2lnC02ϵ2,C_{0}\cdot\exp\left(-\frac{T}{2}\right)\leq 2\epsilon^{2}\Longleftrightarrow T\geq 2\ln\frac{C_{0}}{2\epsilon^{2}},

we have Term 1ϵ\mathrm{Term\ 1}\leq\epsilon. To simplify the proof, we choose TT as its lower bound.

If we choose p^\hat{p}, then the iteration complexity depend on the log-Sobolev constant of pTp_{T}. In (Ma et al., 2019), it is demonstrated that any distribution satisfying Assumptions [A1] and [A3] has a log-Sobolev constant of m2exp(16LR2)\frac{m}{2}\exp(-16LR^{2}), which scales exponentially with the radius RR.

Considering the pTp_{T}, we have

XT=eTX0+1e2Tε.X_{T}=e^{-T}X_{0}+\sqrt{1-e^{-2T}}\varepsilon.

Assume that the density of eTX0e^{-T}X_{0} is hh,

logh(eTx)+log|eTI|=logp0(x)\log h(e^{-T}x)+\log|e^{-T}I|=\log p_{0}(x)
logh(eTx)=logp0(x)+dT.\log h(e^{-T}x)=\log p_{0}(x)+dT.

Assume that y=eTxy=e^{-T}x and Assumption [A3] holds,

2h(y)=e2T2p0(eTy)e2TmI.-\nabla^{2}h(y)=-e^{2T}\nabla^{2}p_{0}(e^{T}y)\geq e^{2T}mI.

We have outside a ball with eTRe^{-T}R, the negative log-density is e2Tme^{2T}m strongly convex. By Lemma 16, the final log-Sobolev constant is

12mexp(16LR2eT+2T)+11e2T=O(mexp(16LR2eT+2T)).\frac{1}{\frac{2}{m\exp(-16LR^{2}e^{-T}+2T)}+\frac{1}{1-e^{-2T}}}=O({m\exp(-16LR^{2}e^{-T}+2T)}).
Errors from the backward process

Without loss of generality, we consider the Assumption [A4] case, where tt has been split to two intervals. The Assumption [A3] case can be recognized as the first interval of Assumption [A4]. For Term 2, we first consider the proof when Novikov’s condition holds for simplification. A more rigorous analysis without Novikov’s condition can be easily extended with the tricks shown in (Chen et al., 2022a). Considering Corollary 2, we have

DKL(P^TP~TpT)=𝔼P^T[lndP^TdP~TpT]=\displaystyle D_{\mathrm{KL}}\left(\hat{P}_{T}\big{\|}\tilde{P}_{T}^{p_{T}}\right)=\mathbb{E}_{\hat{P}_{T}}\left[\ln\frac{\mathrm{d}\hat{P}_{T}}{\mathrm{d}\tilde{P}_{T}^{p_{T}}}\right]= 14k=0N1𝔼P^T[kη(k+1)η𝐯k(𝐱kη)2lnpTt(𝐱t)2dt]\displaystyle\frac{1}{4}\sum_{k=0}^{N-1}\mathbb{E}_{\hat{P}_{T}}\left[\int_{k\eta}^{(k+1)\eta}\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-t}({\mathbf{x}}_{t})\right\|^{2}\mathrm{d}t\right] (20)
=\displaystyle= k=0N1kη(k+1)η𝔼P^T[14𝐯k(𝐱kη)2lnpTt(𝐱t)2]dt,\displaystyle\sum_{k=0}^{N-1}\int_{k\eta}^{(k+1)\eta}\mathbb{E}_{\hat{P}_{T}}\left[\frac{1}{4}\cdot\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-t}({\mathbf{x}}_{t})\right\|^{2}\right]\mathrm{d}t,

where N=T/ηN=\lfloor T/\eta\rfloor.

We also have

14𝔼P^T[𝐯k(𝐱kη)2lnpTt(𝐱t)2]\displaystyle\frac{1}{4}\cdot\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-t}({\mathbf{x}}_{t})\right\|^{2}\right]\leq 4ϵ2+12𝔼P^T[𝐯k(𝐱kη)2lnpTkη(𝐱kη)2]ϵscore\displaystyle 4\epsilon^{2}+\frac{1}{2}\cdot\underbrace{\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-k\eta}({\mathbf{x}}_{k\eta})\right\|^{2}\right]}_{\epsilon_{\mathrm{score}}} (21)

following Lemma 7 by choosing the step size of the backward path satisfying

ηC1(d+m22)1ϵ2.\eta\leq C_{1}\left(d+m_{2}^{2}\right)^{-1}\epsilon^{2}. (22)

To simplify the proof, we choose η\eta as its upper bound. Plugging Eq. 32 into Eq. 20, we have

DKL(P^TP~TpT)8ϵ2lnC02ϵ2+12k=0N1η𝔼P^T[𝐯k(𝐱kη)2lnpTkη(𝐱kη)2].D_{\mathrm{KL}}\left(\hat{P}_{T}\big{\|}\tilde{P}_{T}^{p_{T}}\right)\leq 8\epsilon^{2}\ln\frac{C_{0}}{2\epsilon^{2}}+\frac{1}{2}\cdot\sum_{k=0}^{N-1}\eta\cdot\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-k\eta}({\mathbf{x}}_{k\eta})\right\|^{2}\right].

Besides, due to Lemma 10, we have

k=0N1η𝔼P^T[𝐯k(𝐱kη)2lnpTkη(𝐱kη)2]20ϵ2lnC02ϵ2\sum_{k=0}^{N-1}\eta\cdot\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-k\eta}({\mathbf{x}}_{k\eta})\right\|^{2}\right]\leq 20\epsilon^{2}\ln\frac{C_{0}}{2\epsilon^{2}}

with a probability at least 1ϵ1-\epsilon by requiring an

𝒪(max(C3C5,C3C5)C11C0(d+m22)18ϵ16n88exp(5C2ϵ16n))\mathcal{O}\left(\max\left(C_{3}C_{5},C_{3}^{\prime}C_{5}^{\prime}\right)\cdot C_{1}^{-1}C_{0}\cdot(d+m_{2}^{2})^{18}\epsilon^{-16n-88}\exp\left(5C_{2}\epsilon^{-16n}\right)\right)

gradient complexity. Hence, we have

Term 212(4ϵ2+5ϵ2)T3ϵln(C2ϵ2),\mathrm{Term\ 2}\leq\sqrt{\frac{1}{2}\cdot\left(4\epsilon^{2}+5\epsilon^{2}\right)\cdot T}\leq 3\epsilon\sqrt{\ln\left(\frac{C}{2\epsilon^{2}}\right)},

which implies

DTV(p~t,p)ϵ+3ϵln(C2ϵ2)=O~(ϵ).D_{\mathrm{TV}}\left(\tilde{p}_{t},p_{*}\right)\leq\epsilon+3\epsilon\sqrt{\ln\left(\frac{C}{2\epsilon^{2}}\right)}=\tilde{O}(\epsilon). (23)

Hence, the proof is completed.

Appendix D Important Lemmas

Lemma 7.

(Errors from the discretization) With Algorithm 1 and notation list 1, if we choose the step size of outer loop satisfying

ηC1(d+m22)1ϵ2,\eta\leq C_{1}\left(d+m_{2}^{2}\right)^{-1}\epsilon^{2},

then for t[kη,(k+1)η]t\in[k\eta,(k+1)\eta] we have

𝔼P^T[lnpTkη(𝐱kη)pTt(𝐱kη)2]+L2𝔼P^T[𝐱kη𝐱t2]ϵ2.\mathbb{E}_{\hat{P}_{T}}\left[\left\|\nabla\ln\frac{p_{T-k\eta}({\mathbf{x}}_{k\eta})}{p_{T-t}({\mathbf{x}}_{k\eta})}\right\|^{2}\right]+L^{2}\cdot\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{x}}_{k\eta}-{\mathbf{x}}_{t}\right\|^{2}\right]\leq\epsilon^{2}.
14𝔼P^T[𝐯k(𝐱kη)2lnpTt(𝐱t)2]\displaystyle\frac{1}{4}\cdot\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-t}({\mathbf{x}}_{t})\right\|^{2}\right]\leq 4ϵ2+12𝔼P^T[𝐯k(𝐱kη)2lnpTkη(𝐱kη)2]ϵscore.\displaystyle 4\epsilon^{2}+\frac{1}{2}\cdot\underbrace{\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-k\eta}({\mathbf{x}}_{k\eta})\right\|^{2}\right]}_{\epsilon_{\mathrm{score}}}.
Proof.

According to the choice of tt, we have TkηTtT-k\eta\geq T-t. With the transition kernel of the forward process, we have the following connection

pTkη(𝒙)=\displaystyle p_{T-k\eta}({\bm{x}})= pTt(𝒚)(𝒙,Tkη|𝒚,Tt)d𝒚\displaystyle\int p_{T-t}({\bm{y}})\cdot\mathbb{P}\left({\bm{x}},T-k\eta|{\bm{y}},T-t\right)\mathrm{d}{\bm{y}} (24)
=\displaystyle= pTt(𝒚)[2π(1e2(tkη))]d2exp[𝒙e(tkη)𝒚22(1e2(tkη))]d𝒚\displaystyle\int p_{T-t}({\bm{y}})\left[2\pi\left(1-e^{-2(t-k\eta)}\right)\right]^{-\frac{d}{2}}\cdot\exp\left[\frac{-\left\|{\bm{x}}-e^{-(t-k\eta)}{\bm{y}}\right\|^{2}}{2\left(1-e^{-2(t-k\eta)}\right)}\right]\mathrm{d}{\bm{y}}
=\displaystyle= e(tkη)dpTt(e(tkη)𝒛)[2π(1e2(tkη))]d2exp[𝒙𝒛22(1e2(tkη))]d𝒛,\displaystyle\int e^{(t-k\eta)d}p_{T-t}\left(e^{(t-k\eta)}{\bm{z}}\right)\left[2\pi\left(1-e^{-2(t-k\eta)}\right)\right]^{-\frac{d}{2}}\cdot\exp\left[-\frac{\left\|{\bm{x}}-{\bm{z}}\right\|^{2}}{2\left(1-e^{-2(t-k\eta)}\right)}\right]\mathrm{d}{\bm{z}},

where the last equation follows from setting 𝒛=e(tkη)𝒚{\bm{z}}=e^{-(t-k\eta)}{\bm{y}}. We should note that

pTt(𝒛)e(tkη)dpTt(e(tkη)𝒛)p^{\prime}_{T-t}({\bm{z}})\coloneqq e^{(t-k\eta)d}p_{T-t}(e^{(t-k\eta)}{\bm{z}})

is also a density function. For each element 𝐱kη=𝒙{\mathbf{x}}_{k\eta}={\bm{x}}, we have

lnpTt(𝒙)pTkη(𝒙)2=\displaystyle\left\|\nabla\ln\frac{p_{T-t}({\bm{x}})}{p_{T-k\eta}({\bm{x}})}\right\|^{2}= lnpTt(𝒙)e(tkη)dpTt(e(tkη)𝒙)+lne(tkη)dpTt(e(tkη)𝒙)pTkη(𝒙)2\displaystyle\left\|\nabla\ln\frac{p_{T-t}({\bm{x}})}{e^{(t-k\eta)d}p_{T-t}\left(e^{(t-k\eta)}{\bm{x}}\right)}+\nabla\ln\frac{e^{(t-k\eta)d}p_{T-t}\left(e^{(t-k\eta)}{\bm{x}}\right)}{p_{T-k\eta}({\bm{x}})}\right\|^{2}
\displaystyle\leq 2lnpTt(𝒙)e(tkη)dpTt(e(tkη)𝒙)2+2lne(tkη)dpTt(e(tkη)𝒙)(pTtφ(1e2(tkη)))(𝒙)2,\displaystyle 2\left\|\nabla\ln\frac{p_{T-t}({\bm{x}})}{e^{(t-k\eta)d}p_{T-t}\left(e^{(t-k\eta)}{\bm{x}}\right)}\right\|^{2}+2\left\|\nabla\ln\frac{e^{(t-k\eta)d}p_{T-t}\left(e^{(t-k\eta)}{\bm{x}}\right)}{\left(p^{\prime}_{T-t}\ast\varphi_{\left(1-e^{-2(t-k\eta)}\right)}\right)({\bm{x}})}\right\|^{2},

where the inequality follows from the triangle inequality and Eq. 24. For the first term, we have

lnpTt(𝒙)e(tkη)dpTt(e(tkη)𝒙)=lnpTt(𝒙)e(tkη)lnpTt(e(tkη)𝒙)\displaystyle\left\|\nabla\ln\frac{p_{T-t}({\bm{x}})}{e^{(t-k\eta)d}p_{T-t}\left(e^{(t-k\eta)}{\bm{x}}\right)}\right\|=\left\|\nabla\ln p_{T-t}({\bm{x}})-e^{(t-k\eta)}\cdot\nabla\ln p_{T-t}\left(e^{(t-k\eta)}{\bm{x}}\right)\right\| (25)
\displaystyle\leq lnpTt(𝒙)e(tkη)lnpTt(𝒙)+e(tkη)lnpTt(𝒙)lnpTt(e(tkη)𝒙)\displaystyle\left\|\nabla\ln p_{T-t}({\bm{x}})-e^{(t-k\eta)}\cdot\nabla\ln p_{T-t}({\bm{x}})\right\|+e^{(t-k\eta)}\cdot\left\|\nabla\ln p_{T-t}({\bm{x}})-\nabla\ln p_{T-t}\left(e^{(t-k\eta)}{\bm{x}}\right)\right\|
\displaystyle\leq (e(tkη)1)lnpTt(𝒙)+e(tkη)(e(tkη)1)L𝒙.\displaystyle\left(e^{(t-k\eta)}-1\right)\left\|\nabla\ln p_{T-t}({\bm{x}})\right\|+e^{(t-k\eta)}\cdot\left(e^{(t-k\eta)}-1\right)L\left\|{\bm{x}}\right\|.

For the second term, the score lnpTt-\nabla\ln p^{\prime}_{T-t} is (e2(tkη)L)\left(e^{2(t-k\eta)}L\right)-smooth. Therefore, with Lemma 13 and the requirement

2e2(tkη)(1e2(tkη))1L{4(tkη)12Ltkη12ηmin{18L,12},2\cdot e^{2(t-k\eta)}\cdot\left(1-e^{-2(t-k\eta)}\right)\leq\frac{1}{L}\ \Leftarrow\left\{\begin{aligned} &4(t-k\eta)\leq\frac{1}{2L}\\ &t-k\eta\leq\frac{1}{2}\end{aligned}\right.\ \Leftarrow\eta\leq\min\left\{\frac{1}{8L},\frac{1}{2}\right\},

we have

lnpTt(𝒙)ln(pTtφ(1e2(tkη)))(𝒙)\displaystyle\left\|\nabla\ln p^{\prime}_{T-t}({\bm{x}})-\nabla\ln\left(p^{\prime}_{T-t}\ast\varphi_{\left(1-e^{-2(t-k\eta)}\right)}\right)({\bm{x}})\right\| (26)
\displaystyle\leq 6e2(tkη)L(1e2(tkη))d1/2+2e3(tkη)L(1e2(tkη))lnpTt(e(tkη)𝒙)\displaystyle 6e^{2(t-k\eta)}L\sqrt{\left(1-e^{-2(t-k\eta)}\right)}d^{1/2}+2e^{3(t-k\eta)}L\left(1-e^{-2(t-k\eta)}\right)\left\|\nabla\ln p_{T-t}\left(e^{(t-k\eta)}{\bm{x}}\right)\right\|
\displaystyle\leq 6e2(tkη)L(1e2(tkη))d1/2\displaystyle 6e^{2(t-k\eta)}L\sqrt{\left(1-e^{-2(t-k\eta)}\right)}d^{1/2}
+2Le(tkη)(e2(tkη)1)lnpTt(e(tkη)𝒙)lnpTt(𝒙)+lnpTt(𝒙)\displaystyle+2Le^{(t-k\eta)}\cdot\left(e^{2(t-k\eta)}-1\right)\left\|\nabla\ln p_{T-t}\left(e^{(t-k\eta)}{\bm{x}}\right)-\nabla\ln p_{T-t}({\bm{x}})+\nabla\ln p_{T-t}({\bm{x}})\right\|
\displaystyle\leq 6e2(tkη)L(1e2(tkη))d1/2+2L2e(tkη)(e2(tkη)1)(e(tkη)1)𝒙\displaystyle 6e^{2(t-k\eta)}L\sqrt{\left(1-e^{-2(t-k\eta)}\right)}d^{1/2}+2L^{2}e^{(t-k\eta)}\cdot\left(e^{2(t-k\eta)}-1\right)\left(e^{(t-k\eta)}-1\right)\left\|{\bm{x}}\right\|
+2Le(tkη)(e2(tkη)1)lnpTt(𝒙).\displaystyle+2Le^{(t-k\eta)}\cdot\left(e^{2(t-k\eta)}-1\right)\left\|\nabla\ln p_{T-t}({\bm{x}})\right\|.

Due to the range η1/2\eta\leq 1/2, we have the following inequalities

e2(tkη)e2η1+4η,1e2(tkη)2(tkη)2ηande(tkη)eη1+2η.e^{2(t-k\eta)}\leq e^{2\eta}\leq 1+4\eta,\quad 1-e^{-2(t-k\eta)}\leq 2(t-k\eta)\leq 2\eta\quad\mathrm{and}\quad e^{(t-k\eta)}\leq e^{\eta}\leq 1+2\eta.

Thus, Eq. 25 and Eq. 26 can be reformulated as

lnpTt(𝒙)e(tkη)dpTt(e(tkη)𝒙)2ηlnpTt(𝒙)+4ηL𝒙\displaystyle\left\|\nabla\ln\frac{p_{T-t}({\bm{x}})}{e^{(t-k\eta)d}p_{T-t}\left(e^{(t-k\eta)}{\bm{x}}\right)}\right\|\leq 2\eta\left\|\nabla\ln p_{T-t}({\bm{x}})\right\|+4\eta L\left\|{\bm{x}}\right\| (27)
\displaystyle\Rightarrow lnpTt(𝒙)e(tkη)dpTt(e(tkη)𝒙)28η2lnpTt(𝒙)2+32η2L2𝒙2\displaystyle\left\|\nabla\ln\frac{p_{T-t}({\bm{x}})}{e^{(t-k\eta)d}p_{T-t}\left(e^{(t-k\eta)}{\bm{x}}\right)}\right\|^{2}\leq 8\eta^{2}\left\|\nabla\ln p_{T-t}({\bm{x}})\right\|^{2}+32\eta^{2}L^{2}\left\|{\bm{x}}\right\|^{2}

and

lnpTt(𝒙)ln(pTtφ(1e2(tkη)))(𝒙)\displaystyle\left\|\nabla\ln p^{\prime}_{T-t}({\bm{x}})-\nabla\ln\left(p^{\prime}_{T-t}\ast\varphi_{\left(1-e^{-2(t-k\eta)}\right)}\right)({\bm{x}})\right\|
\displaystyle\leq 6(4η+1)L2ηd+2L2(2η+1)4η2η𝒙+2L(2η+1)4ηlnpTt(𝒙)\displaystyle 6\left(4\eta+1\right)L\sqrt{2\eta d}+2L^{2}\cdot(2\eta+1)\cdot 4\eta\cdot 2\eta\cdot\left\|{\bm{x}}\right\|+2L\cdot(2\eta+1)\cdot 4\eta\cdot\left\|\nabla\ln p_{T-t}({\bm{x}})\right\|
\displaystyle\leq 18L2ηd+32L2η2𝒙+16LηlnpTt(𝒙),\displaystyle 18L\sqrt{2\eta d}+32L^{2}\eta^{2}\cdot\left\|{\bm{x}}\right\|+16L\eta\cdot\left\|\nabla\ln p_{T-t}({\bm{x}})\right\|,

which is equivalent to

lnpTt(𝒙)ln(pTtφ(1e2(tkη)))(𝒙)2\displaystyle\left\|\nabla\ln p^{\prime}_{T-t}({\bm{x}})-\nabla\ln\left(p^{\prime}_{T-t}\ast\varphi_{\left(1-e^{-2(t-k\eta)}\right)}\right)({\bm{x}})\right\|^{2} (28)
\displaystyle\leq 3(211L2ηd+210L4η4𝒙2+28L2η2lnpTt(𝒙)2)\displaystyle 3\cdot\left(2^{11}\cdot L^{2}\eta d+2^{10}\cdot L^{4}\eta^{4}\left\|{\bm{x}}\right\|^{2}+2^{8}\cdot L^{2}\eta^{2}\left\|\nabla\ln p_{T-t}({\bm{x}})\right\|^{2}\right)
\displaystyle\leq 213L2ηd+212L4η4𝒙2+210L2η2lnpTt(𝒙)2.\displaystyle 2^{13}\cdot L^{2}\eta d+2^{12}\cdot L^{4}\eta^{4}\left\|{\bm{x}}\right\|^{2}+2^{10}\cdot L^{2}\eta^{2}\left\|\nabla\ln p_{T-t}({\bm{x}})\right\|^{2}.

Without loss of generality, we suppose L1L\geq 1, combining Eq. 27 and Eq. 28, we have the following bound

𝔼P^T[lnpTt(𝐱kη)pTkη(𝐱kη)2]\displaystyle\mathbb{E}_{\hat{P}_{T}}\left[\left\|\nabla\ln\frac{p_{T-t}({\mathbf{x}}_{k\eta})}{p_{T-k\eta}({\mathbf{x}}_{k\eta})}\right\|^{2}\right]\leq 214Lηd+28L2η2𝔼P^[𝐱kη2]+212L2η2𝔼P^[lnpTt(𝐱kη)2]\displaystyle 2^{14}\cdot L\eta d+2^{8}\cdot L^{2}\eta^{2}\mathbb{E}_{\hat{P}}\left[\left\|{\mathbf{x}}_{k\eta}\right\|^{2}\right]+2^{12}\cdot L^{2}\eta^{2}\mathbb{E}_{\hat{P}}\left[\left\|\nabla\ln p_{T-t}({\mathbf{x}}_{k\eta})\right\|^{2}\right] (29)
\displaystyle\leq 214Lηd+28L2η2𝔼P^[𝐱kη2]+213L2η2𝔼P^[lnpTt(𝐱t)2]\displaystyle 2^{14}\cdot L\eta d+2^{8}\cdot L^{2}\eta^{2}\mathbb{E}_{\hat{P}}\left[\left\|{\mathbf{x}}_{k\eta}\right\|^{2}\right]+2^{13}\cdot L^{2}\eta^{2}\mathbb{E}_{\hat{P}}\left[\left\|\nabla\ln p_{T-t}({\mathbf{x}}_{t})\right\|^{2}\right]
+213L4η2𝔼P^[𝐱kη𝐱t2].\displaystyle+2^{13}\cdot L^{4}\eta^{2}\mathbb{E}_{\hat{P}}\left[\left\|{\mathbf{x}}_{k\eta}-{\mathbf{x}}_{t}\right\|^{2}\right].

Besides, we have

4[𝔼P^T[lnpTkη(𝐱kη)pTt(𝐱kη)2]+L2𝔼P^T[𝐱kη𝐱t2]]\displaystyle 4\left[\mathbb{E}_{\hat{P}_{T}}\left[\left\|\nabla\ln\frac{p_{T-k\eta}({\mathbf{x}}_{k\eta})}{p_{T-t}({\mathbf{x}}_{k\eta})}\right\|^{2}\right]+L^{2}\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{x}}_{k\eta}-{\mathbf{x}}_{t}\right\|^{2}\right]\right] (30)
\displaystyle\leq 4[214Lηd+28L2η2𝔼P^T[𝐱kη2]+213L2η2𝔼P^T[lnpTt(𝐱t)2]\displaystyle 4\left[2^{14}\cdot L\eta d+2^{8}\cdot L^{2}\eta^{2}\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{x}}_{k\eta}\right\|^{2}\right]+2^{13}\cdot L^{2}\eta^{2}\mathbb{E}_{\hat{P}_{T}}\left[\left\|\nabla\ln p_{T-t}({\mathbf{x}}_{t})\right\|^{2}\right]\right.
+(213L2η2+1)L2𝔼P^T[𝐱kη𝐱t2]]\displaystyle\left.+\left(2^{13}\cdot L^{2}\eta^{2}+1\right)L^{2}\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{x}}_{k\eta}-{\mathbf{x}}_{t}\right\|^{2}\right]\right]
\displaystyle\leq 216Lηd+210L2η2(d+m22)+215L3η2d+28L2(2(m22+d)η2+4dη),\displaystyle 2^{16}\cdot L\eta d+2^{10}\cdot L^{2}\eta^{2}(d+m_{2}^{2})+2^{15}\cdot L^{3}\eta^{2}d+2^{8}\cdot L^{2}\left(2(m_{2}^{2}+d)\eta^{2}+4d\eta\right),

where the last inequality with Lemma 14 and Lemma 15. To diminish the discretization error, we require the step size of backward sampling, i.e., η\eta satisfies

{216Lηdϵ2210L2η2(d+m22)ϵ2215L3η2dϵ228L2(2(m22+d)η2+4dη)ϵ2{η216L1d1ϵ2η25L1(d+m22)0.5ϵη27.5L1.5d0.5ϵη25L0.5(d+m22)0.5ϵη210L2d1ϵ2\left\{\begin{aligned} &2^{16}\cdot L\eta d\leq\epsilon^{2}\\ &2^{10}\cdot L^{2}\eta^{2}(d+m_{2}^{2})\leq\epsilon^{2}\\ &2^{15}\cdot L^{3}\eta^{2}d\leq\epsilon^{2}\\ &2^{8}\cdot L^{2}\left(2(m_{2}^{2}+d)\eta^{2}+4d\eta\right)\leq\epsilon^{2}\end{aligned}\right.\quad\Leftarrow\quad\left\{\begin{aligned} &\eta\leq 2^{-16}\cdot L^{-1}d^{-1}\epsilon^{2}\\ &\eta\leq 2^{-5}\cdot L^{-1}\left(d+m_{2}^{2}\right)^{-0.5}\epsilon\\ &\eta\leq 2^{-7.5}\cdot L^{-1.5}d^{-0.5}\epsilon\\ &\eta\leq 2^{-5}L^{-0.5}\left(d+m_{2}^{2}\right)^{-0.5}\epsilon\\ &\eta\leq 2^{-10}L^{-2}d^{-1}\epsilon^{2}\end{aligned}\right.

Specifically, if we choose

η216L2(d+m22)1ϵ2=C1(d+m22)1ϵ2,\eta\leq 2^{-16}\cdot L^{-2}\left(d+m_{2}^{2}\right)^{-1}\epsilon^{2}=C_{1}(d+m_{2}^{2})^{-1}\epsilon^{2},

we have

𝔼P^T[lnpTkη(𝐱kη)pTt(𝐱kη)2]+L2𝔼P^T[𝐱kη𝐱t2]ϵ2.\mathbb{E}_{\hat{P}_{T}}\left[\left\|\nabla\ln\frac{p_{T-k\eta}({\mathbf{x}}_{k\eta})}{p_{T-t}({\mathbf{x}}_{k\eta})}\right\|^{2}\right]+L^{2}\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{x}}_{k\eta}-{\mathbf{x}}_{t}\right\|^{2}\right]\leq\epsilon^{2}. (31)

Hence, the proof is completed.

Thus, for t[kη,(k+1)η]t\in[k\eta,(k+1)\eta], it has

14𝔼P^T[𝐯k(𝐱kη)2lnpTt(𝐱t)2]\displaystyle\frac{1}{4}\cdot\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-t}({\mathbf{x}}_{t})\right\|^{2}\right] (32)
\displaystyle\leq 2𝔼P^T[lnpTkη(𝐱kη)lnpTt(𝐱t)2]+12𝔼P^T[𝐯k(𝐱kη)2lnpTkη(𝐱kη)2]\displaystyle 2\mathbb{E}_{\hat{P}_{T}}\left[\left\|\nabla\ln p_{T-k\eta}({\mathbf{x}}_{k\eta})-\nabla\ln p_{T-t}({\mathbf{x}}_{t})\right\|^{2}\right]+\frac{1}{2}\cdot\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-k\eta}({\mathbf{x}}_{k\eta})\right\|^{2}\right]
\displaystyle\leq 4𝔼P^T[lnpTkη(𝐱kη)pTt(𝐱kη)2]+4L2𝔼P^T[𝐱kη𝐱t2]\displaystyle 4\mathbb{E}_{\hat{P}_{T}}\left[\left\|\nabla\ln\frac{p_{T-k\eta}({\mathbf{x}}_{k\eta})}{p_{T-t}({\mathbf{x}}_{k\eta})}\right\|^{2}\right]+4L^{2}\cdot\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{x}}_{k\eta}-{\mathbf{x}}_{t}\right\|^{2}\right]
+12𝔼P^T[𝐯k(𝐱kη)2lnpTkη(𝐱kη)2]\displaystyle+\frac{1}{2}\cdot\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-k\eta}({\mathbf{x}}_{k\eta})\right\|^{2}\right]
\displaystyle\leq 4ϵ2+12𝔼P^T[𝐯k(𝐱kη)2lnpTkη(𝐱kη)2]ϵscore.\displaystyle 4\epsilon^{2}+\frac{1}{2}\cdot\underbrace{\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-k\eta}({\mathbf{x}}_{k\eta})\right\|^{2}\right]}_{\epsilon_{\mathrm{score}}}.

where the second inequality follows from Assumption [A1].

Lemma 8.

For each inner loop, we denote qz(|𝐱)q_{z}(\cdot|{\bm{x}}) and q(|𝐱)q(\cdot|{\bm{x}}) to be the underlying distribution of output particles and the target distribution, respectively, where qq satisfies LSI with constant μ\mu. When we set the step size of outer loops to be η\eta, by requiring

n64Tdμ1η3ϵ2δ1andDKL(qzq)213T4d2μ2η8ϵ4δ4,n\geq 64Td\mu^{-1}\eta^{-3}\epsilon^{-2}\delta^{-1}\quad\mathrm{and}\quad D_{\mathrm{KL}}\left(q_{z}\|q\right)\leq 2^{-13}\cdot T^{-4}d^{-2}\mu^{2}\eta^{8}\epsilon^{4}\delta^{4},

we have

{𝐱0(i)}i=1nqz(n)(|𝒙)[1ni=1n𝐯i(𝒙)1n𝔼[i=1n𝐯i(𝒙)]2ϵ]exp(1δ/(2T/η))+δ2T/η,\displaystyle\mathbb{P}_{\left\{{\mathbf{x}}_{0}^{(i)}\right\}_{i=1}^{n}\sim q^{(n)}_{z}(\cdot|{\bm{x}})}\left[\left\|\frac{1}{n}\sum_{i=1}^{n}{\mathbf{v}}_{i}({\bm{x}})-\frac{1}{n}\mathbb{E}\left[\sum_{i=1}^{n}{\mathbf{v}}_{i}({\bm{x}})\right]\right\|\geq 2\epsilon\right]\leq\exp\left(-\frac{1}{\delta/(2\lfloor T/\eta\rfloor)}\right)+\frac{\delta}{2\lfloor T/\eta\rfloor},

where

𝐯i(𝒙)2𝒙e(Tkη)𝐱0(i)1e2(Tkη)i{1,,n}.\displaystyle{\mathbf{v}}_{i}({\bm{x}})\coloneqq-2\cdot\frac{{\bm{x}}-e^{-(T-k\eta)}{\mathbf{x}}_{0}^{(i)}}{1-e^{-2(T-k\eta)}}\quad i\in\left\{1,\ldots,n\right\}.
Proof.

For each inner loop, we abbreviate the target distribution as q~\tilde{q}, the initial distribution as q0q_{0}. Then the iteration of the inner loop is presented as

𝐱z+1=𝐱z+τlnq(𝐱z|𝒙)+2τ𝒩(𝟎,𝑰).{\mathbf{x}}_{z+1}={\mathbf{x}}_{z}+\tau\nabla\ln q({\mathbf{x}}_{z}|{\bm{x}})+\sqrt{2\tau}\mathcal{N}({\bm{0}},{\bm{I}}).

We suppose the underlying distribution of the zz-th iteration to be qz(|𝒙)q_{z}(\cdot|{\bm{x}}). Hence, we expect the following inequality

[𝐯(𝒙)qz(𝒙0)(2)𝒙eTkη𝒙01e2(Tkη)2ϵ2]1δ\mathbb{P}\left[\left\|{\mathbf{v}}({\bm{x}})-\int q_{z}({\bm{x}}_{0})(-2)\cdot\frac{{\bm{x}}-e^{T-k\eta}{\bm{x}}_{0}}{1-e^{-2(T-k\eta)}}\right\|^{2}\leq\epsilon^{2}\right]\geq 1-\delta

is established, where 𝐯(𝒙)=1/ni=1n𝐯i(𝒙){\mathbf{v}}({\bm{x}})=1/n\sum_{i=1}^{n}{\mathbf{v}}_{i}({\bm{x}}). In this condition, we have

{𝐱0(i)}i=1nqz(n)(|𝒙)[1ni=1n𝐯i(𝒙)1n𝔼[i=1n𝐯i(𝒙)]𝔼1ni=1n𝐯i(𝒙)𝔼𝐯1(𝒙)+ϵ]\displaystyle\mathbb{P}_{\left\{{\mathbf{x}}_{0}^{(i)}\right\}_{i=1}^{n}\sim q_{z}^{(n)}(\cdot|{\bm{x}})}\left[\left\|\frac{1}{n}\sum_{i=1}^{n}{\mathbf{v}}_{i}({\bm{x}})-\frac{1}{n}\mathbb{E}\left[\sum_{i=1}^{n}{\mathbf{v}}_{i}({\bm{x}})\right]\right\|\geq\mathbb{E}\left\|\frac{1}{n}\sum_{i=1}^{n}{\mathbf{v}}_{i}({\bm{x}})-\mathbb{E}{\mathbf{v}}_{1}({\bm{x}})\right\|+\epsilon\right] (33)
=\displaystyle= {𝐱0(i)}i=1nqz(n)(|𝒙)[i=1n𝐱0(i)𝔼[i=1n𝐱0(i)]𝔼i=1n𝐱i𝔼[i=1n𝐱0(i)]+1e2(Tkη)2e(Tkη)nϵ].\displaystyle\mathbb{P}_{\left\{{\mathbf{x}}_{0}^{(i)}\right\}_{i=1}^{n}\sim q_{z}^{(n)}(\cdot|{\bm{x}})}\left[\left\|\sum_{i=1}^{n}{\mathbf{x}}_{0}^{(i)}-\mathbb{E}\left[\sum_{i=1}^{n}{\mathbf{x}}_{0}^{(i)}\right]\right\|\geq\mathbb{E}\left\|\sum_{i=1}^{n}{\mathbf{x}}_{i}-\mathbb{E}\left[\sum_{i=1}^{n}{\mathbf{x}}_{0}^{(i)}\right]\right\|+\frac{1-e^{-2(T-k\eta)}}{2e^{-(T-k\eta)}}\cdot n\epsilon\right].

To simplify the notations, we set

𝒃z𝔼qz(|𝒙)[𝐱0],vz𝔼{𝐱0(i)}i=1nqzn(|𝒙)i=1n𝐱i𝔼[i=1n𝐱0(i)],\displaystyle{\bm{b}}_{z}\coloneqq\mathbb{E}_{q_{z}(\cdot|{\bm{x}})}[{\mathbf{x}}_{0}],\quad v_{z}\coloneqq\mathbb{E}_{\left\{{\mathbf{x}}_{0}^{(i)}\right\}_{i=1}^{n}\sim q_{z}^{n}(\cdot|{\bm{x}})}\left\|\sum_{i=1}^{n}{\mathbf{x}}_{i}-\mathbb{E}\left[\sum_{i=1}^{n}{\mathbf{x}}_{0}^{(i)}\right]\right\|,
𝒃𝔼q(|𝒙)[𝐱0],andv𝔼{𝐱0(i)}i=1nqn(|𝒙)i=1n𝐱i𝔼[i=1n𝐱0(i)].\displaystyle{\bm{b}}\coloneqq\mathbb{E}_{q(\cdot|{\bm{x}})}[{\mathbf{x}}_{0}],\quad\mathrm{and}\quad v\coloneqq\mathbb{E}_{\left\{{\mathbf{x}}_{0}^{(i)}\right\}_{i=1}^{n}\sim q^{n}(\cdot|{\bm{x}})}\left\|\sum_{i=1}^{n}{\mathbf{x}}_{i}-\mathbb{E}\left[\sum_{i=1}^{n}{\mathbf{x}}_{0}^{(i)}\right]\right\|.

Then, we have

{𝐱0(i)}i=1nqz(n)(|𝒙)[i=1n𝐱0(i)n𝒃zvz+1e2(Tkη)2e(Tkη)nϵ]\displaystyle\mathbb{P}_{\left\{{\mathbf{x}}_{0}^{(i)}\right\}_{i=1}^{n}\sim q_{z}^{(n)}(\cdot|{\bm{x}})}\left[\left\|\sum_{i=1}^{n}{\mathbf{x}}_{0}^{(i)}-n{\bm{b}}_{z}\right\|\geq v_{z}+\frac{1-e^{-2(T-k\eta)}}{2e^{-(T-k\eta)}}\cdot n\epsilon\right] (34)
\displaystyle\leq {𝐱0(i)}i=1nq(n)(|𝒙)[i=1n𝐱0(i)n𝒃zvz+1e2(Tkη)2e(Tkη)nϵ]+DTV(q(n)(|𝒙),qz(n)(|𝒙))\displaystyle\mathbb{P}_{\left\{{\mathbf{x}}_{0}^{(i)}\right\}_{i=1}^{n}\sim q^{(n)}(\cdot|{\bm{x}})}\left[\left\|\sum_{i=1}^{n}{\mathbf{x}}_{0}^{(i)}-n{\bm{b}}_{z}\right\|\geq v_{z}+\frac{1-e^{-2(T-k\eta)}}{2e^{-(T-k\eta)}}\cdot n\epsilon\right]+D_{\mathrm{TV}}(q^{(n)}(\cdot|{\bm{x}}),q_{z}^{(n)}(\cdot|{\bm{x}}))
\displaystyle\leq {𝐱0(i)}i=1Nkq(n)(|𝒙)[i=1n𝐱0(i)n𝒃zvz+1e2(Tkη)2e(Tkη)nϵ]+nDTV(q(|𝒙),qz(|𝒙)).\displaystyle\mathbb{P}_{\left\{{\mathbf{x}}_{0}^{(i)}\right\}_{i=1}^{N_{k}}\sim q^{(n)}(\cdot|{\bm{x}})}\left[\left\|\sum_{i=1}^{n}{\mathbf{x}}_{0}^{(i)}-n{\bm{b}}_{z}\right\|\geq v_{z}+\frac{1-e^{-2(T-k\eta)}}{2e^{-(T-k\eta)}}\cdot n\epsilon\right]+n\cdot D_{\mathrm{TV}}(q(\cdot|{\bm{x}}),q_{z}(\cdot|{\bm{x}})).

Consider the first term, we have the following relation

i=1n𝐱0(i)n𝒃zvz+1e(Tkη)2e2(Tkη)nϵ\displaystyle\left\|\sum_{i=1}^{n}{\mathbf{x}}_{0}^{(i)}-n{\bm{b}}_{z}\right\|\geq v_{z}+\frac{1-e^{-(T-k\eta)}}{2e^{-2(T-k\eta)}}\cdot n\epsilon (35)
\displaystyle\Rightarrow i=1n𝐱0(i)n𝒃i=1n𝐱0(i)n𝒃zn𝒃𝒃z\displaystyle\left\|\sum_{i=1}^{n}{\mathbf{x}}_{0}^{(i)}-n{\bm{b}}\right\|\geq\left\|\sum_{i=1}^{n}{\mathbf{x}}_{0}^{(i)}-n{\bm{b}}_{z}\right\|-n\left\|{\bm{b}}-{\bm{b}}_{z}\right\|
vz+1e2(Tkη)2e(Tkη)nϵn𝒃𝒃z\displaystyle\geq v_{z}+\frac{1-e^{-2(T-k\eta)}}{2e^{-(T-k\eta)}}\cdot n\epsilon-n\left\|{\bm{b}}-{\bm{b}}_{z}\right\|
=v+1e2(Tkη)2e(Tkη)nϵn𝒃𝒃z+(vzv)\displaystyle=v+\frac{1-e^{-2(T-k\eta)}}{2e^{-(T-k\eta)}}\cdot n\epsilon-n\left\|{\bm{b}}-{\bm{b}}_{z}\right\|+(v_{z}-v)
v+1e2(Tkη)2e(Tkη)nϵnW2(q(|𝒙),qz(|𝒙))ndμ1,\displaystyle\geq v+\frac{1-e^{-2(T-k\eta)}}{2e^{-(T-k\eta)}}\cdot n\epsilon-n\cdot W_{2}(q(\cdot|{\bm{x}}),q_{z}(\cdot|{\bm{x}}))-\sqrt{nd\mu^{-1}},

where the last inequality follows from

𝒃𝒃z=\displaystyle\left\|{\bm{b}}-{\bm{b}}_{z}\right\|= (q(𝒙0|𝒙)qz(𝒙0|𝒙))𝒙0d𝒙0=(𝒙0𝒙z)γ(𝒙0,𝒙z)d(𝒙0,𝒙z)\displaystyle\left\|\int\left(q({\bm{x}}_{0}|{\bm{x}})-q_{z}({\bm{x}}_{0}|{\bm{x}})\right){\bm{x}}_{0}\mathrm{d}{\bm{x}}_{0}\right\|=\left\|\int\left({\bm{x}}_{0}-{\bm{x}}_{z}\right)\gamma({\bm{x}}_{0},{\bm{x}}_{z})\mathrm{d}({\bm{x}}_{0},{\bm{x}}_{z})\right\| (36)
\displaystyle\leq (γ(𝒙0,𝒙z)d(𝒙0,𝒙z))1/2(γ(𝒙0,𝒙z)𝒙0𝒙z2d(𝒙0,𝒙z))1/2\displaystyle\left(\int\gamma({\bm{x}}_{0},{\bm{x}}_{z})\mathrm{d}({\bm{x}}_{0},{\bm{x}}_{z})\right)^{1/2}\cdot\left(\int\gamma({\bm{x}}_{0},{\bm{x}}_{z})\cdot\left\|{\bm{x}}_{0}-{\bm{x}}_{z}\right\|^{2}\mathrm{d}({\bm{x}}_{0},{\bm{x}}_{z})\right)^{1/2}
\displaystyle\leq W2(q(|𝒙),qz(|𝒙)),\displaystyle W_{2}(q(\cdot|{\bm{x}}),q_{z}(\cdot|{\bm{x}})),

and

v=\displaystyle v= n𝔼{𝐱0(i)}i=1nq(n)(|𝒙)1ni=1n𝐱0(i)𝒃nvar(1ni=1n𝐱0(i))\displaystyle n\cdot\mathbb{E}_{\left\{{\mathbf{x}}_{0}^{(i)}\right\}_{i=1}^{n}\sim q^{(n)}(\cdot|{\bm{x}})}\left\|\frac{1}{n}\sum_{i=1}^{n}{\mathbf{x}}_{0}^{(i)}-{\bm{b}}\right\|\leq n\cdot\sqrt{\mathrm{var}\left(\frac{1}{n}\sum_{i=1}^{n}{\mathbf{x}}_{0}^{(i)}\right)}
=\displaystyle= nvar(𝐱0(1))ndμ1\displaystyle\sqrt{n\mathrm{var}\left({\mathbf{x}}_{0}^{(1)}\right)}\leq\sqrt{nd\mu^{-1}}

deduced by Lemma 23. By requiring

W2(q(|𝒙),qz(|𝒙))1e2(Tkη)8e(Tkη)ϵandn64e2(Tkη)d(1e2(Tkη))2μϵ2W_{2}(q(\cdot|{\bm{x}}),q_{z}(\cdot|{\bm{x}}))\leq\frac{1-e^{-2(T-k\eta)}}{8{e^{-(T-k\eta)}}}\cdot\epsilon\quad\mathrm{and}\quad n\geq\frac{64e^{-2(T-k\eta)}d}{\left(1-e^{-2(T-k\eta)}\right)^{2}\mu\epsilon^{2}} (37)

in Eq. 35, we have

{𝐱0(i)}i=1nq(n)(|𝒙)[i=1n𝐱0(i)n𝒃zvz+1e2(Tkη)2e(Tkη)nϵ]\displaystyle\mathbb{P}_{\left\{{\mathbf{x}}_{0}^{(i)}\right\}_{i=1}^{n}\sim q^{(n)}(\cdot|{\bm{x}})}\left[\left\|\sum_{i=1}^{n}{\mathbf{x}}_{0}^{(i)}-n{\bm{b}}_{z}\right\|\geq v_{z}+\frac{1-e^{-2(T-k\eta)}}{2e^{-(T-k\eta)}}\cdot n\epsilon\right] (38)
\displaystyle\leq {𝐱0(i)}i=1nq(n)(|𝒙)[i=1n𝐱0(i)n𝒃v+1e2(Tkη)4e(Tkη)nϵ]\displaystyle\mathbb{P}_{\left\{{\mathbf{x}}_{0}^{(i)}\right\}_{i=1}^{n}\sim q^{(n)}(\cdot|{\bm{x}})}\left[\left\|\sum_{i=1}^{n}{\mathbf{x}}_{0}^{(i)}-n{\bm{b}}\right\|\geq v+\frac{1-e^{-2(T-k\eta)}}{4e^{-(T-k\eta)}}\cdot n\epsilon\right]

According to Lemma 16, the LSI constant of

i=1n𝐱0(i)qqqn\sum_{i=1}^{n}{\mathbf{x}}_{0}^{(i)}\sim\underbrace{q\ast q\cdots\ast q}_{n}

is μ/n\mu/n. Besides, considering the function F(𝒙)=𝒙:dF\left({\bm{x}}\right)=\left\|{\bm{x}}\right\|\colon\mathbb{R}^{d}\rightarrow\mathbb{R} is 11-Lipschitz because

FLip=sup𝒙𝒚|F(𝒙)F(𝒚)|𝒙𝒚=sup𝒙𝒚|𝒙𝒚|(𝒙𝒚)=1,\left\|F\right\|_{\mathrm{Lip}}=\sup_{{\bm{x}}\not={\bm{y}}}\frac{\left|F({\bm{x}})-F({\bm{y}})\right|}{\left\|{\bm{x}}-{\bm{y}}\right\|}=\sup_{{\bm{x}}\not={\bm{y}}}\frac{\left|\left\|{\bm{x}}\right\|-\left\|{\bm{y}}\right\|\right|}{\left\|\left({\bm{x}}-{\bm{y}}\right)\right\|}=1,

we have

{𝐱0(i)}i=1nq(n)(|𝒙)[i=1n𝐱0(i)n𝒃v+1e2(Tkη)4e(Tkη)nϵ]exp{(1e(Tkη))2nϵ2μ32e2(Tkη)}\begin{split}&\mathbb{P}_{\left\{{\mathbf{x}}_{0}^{(i)}\right\}_{i=1}^{n}\sim q^{(n)}(\cdot|{\bm{x}})}\left[\left\|\sum_{i=1}^{n}{\mathbf{x}}_{0}^{(i)}-n{\bm{b}}\right\|\geq v+\frac{1-e^{-2(T-k\eta)}}{4e^{-(T-k\eta)}}\cdot n\epsilon\right]\\ \leq&\text{exp}\left\{-\frac{(1-e^{-(T-k\eta)})^{2}n\epsilon^{2}\mu}{32e^{-2(T-k\eta)}}\right\}\end{split} (39)

due to Lemma 18. Plugging Eq. 39 and Eq. 38 into Eq. 34 and Eq. 33, we have

{𝐱0(i)}i=1nqz(n)(|𝒙)[1ni=1n𝐯i(𝒙)1n𝔼[i=1n𝐯i(𝒙)]𝔼1ni=1n𝐯i(𝒙)𝔼𝐯1(𝒙)+ϵ]\displaystyle\mathbb{P}_{\left\{{\mathbf{x}}_{0}^{(i)}\right\}_{i=1}^{n}\sim q_{z}^{(n)}(\cdot|{\bm{x}})}\left[\left\|\frac{1}{n}\sum_{i=1}^{n}{\mathbf{v}}_{i}({\bm{x}})-\frac{1}{n}\mathbb{E}\left[\sum_{i=1}^{n}{\mathbf{v}}_{i}({\bm{x}})\right]\right\|\geq\mathbb{E}\left\|\frac{1}{n}\sum_{i=1}^{n}{\mathbf{v}}_{i}({\bm{x}})-\mathbb{E}{\mathbf{v}}_{1}({\bm{x}})\right\|+\epsilon\right] (40)
\displaystyle\leq exp{(1e(Tkη))2nϵ2μ32e2(Tkη)}+nDTV(q(|𝒙),qz(|𝒙)).\displaystyle\exp\left\{-\frac{(1-e^{-(T-k\eta)})^{2}n\epsilon^{2}\mu}{32e^{-2(T-k\eta)}}\right\}+n\cdot D_{\mathrm{TV}}(q(\cdot|{\bm{x}}),q_{z}(\cdot|{\bm{x}})).

Besides, we have

𝔼{𝐱0(i)}i=1nqzn(|𝒙)1ni=1n𝐯i(𝒙)𝔼𝐯1(𝒙)var(1ni=1n𝐯i)\displaystyle\mathbb{E}_{\left\{{\mathbf{x}}_{0}^{(i)}\right\}_{i=1}^{n}\sim q_{z}^{n}(\cdot|{\bm{x}})}\left\|\frac{1}{n}\sum_{i=1}^{n}{\mathbf{v}}_{i}({\bm{x}})-\mathbb{E}{\mathbf{v}}_{1}({\bm{x}})\right\|\leq\sqrt{\mathrm{var}\left(\frac{1}{n}\sum_{i=1}^{n}{\mathbf{v}}_{i}\right)}
=\displaystyle= var(𝐯1)n=2e(Tkη)1e2(Tkη)var(𝐱0)n.\displaystyle\sqrt{\frac{\mathrm{var}({\mathbf{v}}_{1})}{n}}=\frac{2e^{-(T-k\eta)}}{1-e^{-2(T-k\eta)}}\sqrt{\frac{\mathrm{var}({\mathbf{x}}_{0})}{n}}.

Suppose the optimal coupling of qz(|𝒙)q_{z}(\cdot|{\bm{x}}) and q(|𝒙)q(\cdot|{\bm{x}}) is γzΓopt(qz(|𝒙),q(|𝒙))\gamma_{z}\in\Gamma_{\mathrm{opt}}\left(q_{z}(\cdot|{\bm{x}}),q(\cdot|{\bm{x}})\right), then we have

varqz(|𝒙)(𝐱0)=\displaystyle\mathrm{var}_{q_{z}(\cdot|{\bm{x}})}\left({\mathbf{x}}_{0}\right)= qz(𝒙z|𝒙)𝒙z𝒃z2d𝒙z=𝒙z𝒃z2dγ(𝒙z,𝒙0)\displaystyle\int q_{z}({\bm{x}}_{z}|{\bm{x}})\left\|{\bm{x}}_{z}-{\bm{b}}_{z}\right\|^{2}\mathrm{d}{\bm{x}}_{z}=\int\left\|{\bm{x}}_{z}-{\bm{b}}_{z}\right\|^{2}\mathrm{d}\gamma({\bm{x}}_{z},{\bm{x}}_{0})
=\displaystyle= 𝒙z𝒙0+𝒙0𝒃+𝒃𝒃z2dγ(𝒙z,𝒙0)\displaystyle\int\left\|{\bm{x}}_{z}-{\bm{x}}_{0}+{\bm{x}}_{0}-{\bm{b}}+{\bm{b}}-{\bm{b}}_{z}\right\|^{2}\mathrm{d}\gamma({\bm{x}}_{z},{\bm{x}}_{0})
\displaystyle\leq 3𝒙z𝒙02dγ(𝒙z,𝒙0)+3𝒙0𝒃2dγ(𝒙z,𝒙0)+3𝒃𝒃z2\displaystyle 3\int\left\|{\bm{x}}_{z}-{\bm{x}}_{0}\right\|^{2}\mathrm{d}\gamma({\bm{x}}_{z},{\bm{x}}_{0})+3\int\left\|{\bm{x}}_{0}-{\bm{b}}\right\|^{2}\mathrm{d}\gamma({\bm{x}}_{z},{\bm{x}}_{0})+3\left\|{\bm{b}}-{\bm{b}}_{z}\right\|^{2}
\displaystyle\leq 6W22(q~,qz)+3dμ1\displaystyle 6W^{2}_{2}(\tilde{q},q_{z})+3d\mu^{-1}

where the last inequality follows from Eq. 36 and Lemma 23. By requiring W22(q~,qz)d6μW^{2}_{2}(\tilde{q},q_{z})\leq\frac{d}{6\mu}, we have

2e(Tkη)1e2(Tkη)varqz(𝐱0)n4ne(Tkη)1e2(Tkη)dμ.\frac{2e^{-(T-k\eta)}}{1-e^{-2(T-k\eta)}}\sqrt{\frac{\mathrm{var}_{q_{z}}({\mathbf{x}}_{0})}{n}}\leq\frac{4}{\sqrt{n}}\cdot\frac{e^{-(T-k\eta)}}{1-e^{-2(T-k\eta)}}\cdot\sqrt{\frac{d}{\mu}}.

Combining this result with Eq. 40, we have

{𝐱0(i)}i=1nqzn(|𝒙)[1ni=1n𝐯i(𝒙)1n𝔼[i=1Nk𝐯i(𝒙)]4ne(Tkη)1e2(Tkη)dμ+ϵ]\displaystyle\mathbb{P}_{\left\{{\mathbf{x}}_{0}^{(i)}\right\}_{i=1}^{n}\sim q_{z}^{n}(\cdot|{\bm{x}})}\left[\left\|\frac{1}{n}\sum_{i=1}^{n}{\mathbf{v}}_{i}({\bm{x}})-\frac{1}{n}\mathbb{E}\left[\sum_{i=1}^{N_{k}}{\mathbf{v}}_{i}({\bm{x}})\right]\right\|\geq\frac{4}{\sqrt{n}}\cdot\frac{e^{-(T-k\eta)}}{1-e^{-2(T-k\eta)}}\cdot\sqrt{\frac{d}{\mu}}+\epsilon\right]
\displaystyle\leq exp{(1e(Tkη))2nϵ2μ32e2(Tkη)}+nDTV(q(|𝒙),qz(|𝒙)).\displaystyle\exp\left\{-\frac{(1-e^{-(T-k\eta)})^{2}n\epsilon^{2}\mu}{32e^{-2(T-k\eta)}}\right\}+n\cdot D_{\mathrm{TV}}(q(\cdot|{\bm{x}}),q_{z}(\cdot|{\bm{x}})).

By requiring

4ne(Tkη)1e2(Tkη)dμϵ\displaystyle\frac{4}{\sqrt{n}}\cdot\frac{e^{-(T-k\eta)}}{1-e^{-2(T-k\eta)}}\cdot\sqrt{\frac{d}{\mu}}\leq\epsilon n16e2(Tkη)d(1e2(Tkη))2μϵ2,\displaystyle\Rightarrow\quad n\geq\frac{16e^{-2(T-k\eta)}d}{\left(1-e^{-2(T-k\eta)}\right)^{2}\mu\epsilon^{2}}, (41)
(1e(Tkη))2nϵ2μ32e2(Tkη)T/η2δ\displaystyle-\frac{(1-e^{-(T-k\eta)})^{2}n\epsilon^{2}\mu}{32e^{-2(T-k\eta)}}\leq-\lfloor T/\eta\rfloor\cdot\frac{2}{\delta} nT/η64e2(Tkη)(1e(Tkη))2μϵ2δ\displaystyle\Rightarrow\quad n\geq\lfloor T/\eta\rfloor\cdot\frac{64e^{-2(T-k\eta)}}{(1-e^{-(T-k\eta)})^{2}\mu\epsilon^{2}\delta}
andDTV(q~,qz)δ2nT/η.\displaystyle\mathrm{and}\quad D_{\mathrm{TV}}(\tilde{q},q_{z})\leq\frac{\delta}{2n\cdot\lfloor T/\eta\rfloor}.

we have

{𝐱0(i)}i=1nqz(n)(|𝒙)[1ni=1n𝐯i(𝒙)1n𝔼[i=1n𝐯i(𝒙)]2ϵ]exp(1δ/(2T/η))+δ2T/η.\displaystyle\mathbb{P}_{\left\{{\mathbf{x}}_{0}^{(i)}\right\}_{i=1}^{n}\sim q_{z}^{(n)}(\cdot|{\bm{x}})}\left[\left\|\frac{1}{n}\sum_{i=1}^{n}{\mathbf{v}}_{i}({\bm{x}})-\frac{1}{n}\mathbb{E}\left[\sum_{i=1}^{n}{\mathbf{v}}_{i}({\bm{x}})\right]\right\|\geq 2\epsilon\right]\leq\exp\left(-\frac{1}{\delta/(2\lfloor T/\eta\rfloor)}\right)+\frac{\delta}{2\lfloor T/\eta\rfloor}.

Combining the choice of nn and the gap between q(|𝒙)q(\cdot|{\bm{x}}) and qz(|𝒙)q_{z}(\cdot|{\bm{x}}) in Eq. 37 and Eq. 41, we have

{n64e2(Tkη)d(1e2(Tkη))2μϵ2nT/η64e2(Tkη)(1e(Tkη))2μϵ2δn16e2(Tkη)d(1e2(Tkη))2μϵ2and{W2(q(|𝒙),qz(|𝒙))1e2(Tkη)8e(Tkη)ϵW22(q(|𝒙),qz(|𝒙))d/(6μ)DTV(q(|𝒙),qz(|𝒙))δ2nT/η.\left\{\begin{aligned} n\geq&\frac{64e^{-2(T-k\eta)}d}{\left(1-e^{-2(T-k\eta)}\right)^{2}\mu\epsilon^{2}}\\ n\geq&\lfloor T/\eta\rfloor\cdot\frac{64e^{-2(T-k\eta)}}{(1-e^{-(T-k\eta)})^{2}\mu\epsilon^{2}\delta}\\ n\geq&\frac{16e^{-2(T-k\eta)}d}{\left(1-e^{-2(T-k\eta)}\right)^{2}\mu\epsilon^{2}}\end{aligned}\right.\quad\mathrm{and}\quad\left\{\begin{aligned} W_{2}(q(\cdot|{\bm{x}}),q_{z}(\cdot|{\bm{x}}))\leq&\frac{1-e^{-2(T-k\eta)}}{8{e^{-(T-k\eta)}}}\cdot\epsilon\\ W^{2}_{2}(q(\cdot|{\bm{x}}),q_{z}(\cdot|{\bm{x}}))\leq&d/(6\mu)\\ D_{\mathrm{TV}}(q(\cdot|{\bm{x}}),q_{z}(\cdot|{\bm{x}}))\leq&\frac{\delta}{2n\cdot\lfloor T/\eta\rfloor}\end{aligned}\right.. (42)

Without loss of generality, we suppose η1/2\eta\leq 1/2, due to the range of e2(Tkη)e^{-2(T-k\eta)} as follows

e2(Tkη)e2η1ηe2(Tkη)(1e2(Tkη))2η2,e^{-2(T-k\eta)}\leq e^{-2\eta}\leq 1-\eta\ \Rightarrow\ \frac{e^{-2(T-k\eta)}}{(1-e^{-2(T-k\eta)})^{2}}\leq\eta^{-2},

we obtain the sufficient condition for achieving Eq. 42 is

n64Tdμ1η3ϵ2δ1max{64dμ1η2ϵ2,64Tη3μ1ϵ2δ1,16dμ1η2ϵ2}\displaystyle n\geq 64Td\mu^{-1}\eta^{-3}\epsilon^{-2}\delta^{-1}\geq\max\left\{64d\mu^{-1}\eta^{-2}\epsilon^{-2},64T\eta^{-3}\mu^{-1}\epsilon^{-2}\delta^{-1},16d\mu^{-1}\eta^{-2}\epsilon^{-2}\right\}
and\displaystyle\mathrm{and} DTV(q~,qz)12δηn1T1DKL(qzq~)12δ2η2n2T2213T4d2μ2η8ϵ4δ4.\displaystyle D_{\mathrm{TV}}(\tilde{q},q_{z})\leq\frac{1}{2}\cdot\delta\eta n^{-1}T^{-1}\Leftarrow D_{\mathrm{KL}}\left(q_{z}\|\tilde{q}\right)\leq\frac{1}{2}\cdot\delta^{2}\eta^{2}n^{-2}T^{-2}\leq 2^{-13}\cdot T^{-4}d^{-2}\mu^{2}\eta^{8}\epsilon^{4}\delta^{4}.

Hence, the proof is completed. ∎

Lemma 9.

With Algorithm 1 and notation list 1, if we choose the initial distribution of the kk-th inner loop to be

qTkη(𝒙0|𝒙)exp(𝒙e(Tkη)𝒙022(1e2(Tkη))),q^{\prime}_{T-k\eta}({\bm{x}}_{0}|{\bm{x}})\propto\exp\left(-\frac{\left\|{\bm{x}}-e^{-(T-k\eta)}{\bm{x}}_{0}\right\|^{2}}{2\left(1-e^{-2(T-k\eta)}\right)}\right),

then suppose the the LSI constant of qTkηq_{T-k\eta} is CLSI,kC_{\mathrm{LSI},k}, their KL divergence can be upper bounded as

DKL(qTkη(|𝒙)qTkη(|𝒙))L22CLSI,ke2(Tkη)(d+𝒙2).D_{\mathrm{KL}}\left(q^{\prime}_{T-k\eta}(\cdot|{\bm{x}})\|q_{T-k\eta}(\cdot|{\bm{x}})\right)\leq\frac{L^{2}}{2C_{\mathrm{LSI},k}}\cdot e^{2(T-k\eta)}\left(d+\|{\bm{x}}\|^{2}\right).
Proof.

According to the fact that the LSI constant of qTkηq_{T-k\eta} is CLSI,kC_{\mathrm{LSI},k}, then we have

DKL(qTkη(|𝒙)qTkη(|𝒙))(2CLSI,k)1qTkη(𝒙0|𝒙)f(𝒙0)2d𝒙0\displaystyle D_{\mathrm{KL}}\left(q^{\prime}_{T-k\eta}(\cdot|{\bm{x}})\|q_{T-k\eta}(\cdot|{\bm{x}})\right)\leq(2C_{\mathrm{LSI},k})^{-1}\cdot\int q_{T-k\eta}^{\prime}({\bm{x}}_{0}|{\bm{x}})\left\|\nabla f({\bm{x}}_{0})\right\|^{2}\mathrm{d}{\bm{x}}_{0}
\displaystyle\leq L2(2CLSI,k)1𝔼𝐱0qTkη(|𝒙)[𝐱02]=L2(2CLSI,k)1(var(𝐱0)+𝔼[𝐱0]2).\displaystyle L^{2}(2C_{\mathrm{LSI},k})^{-1}\cdot\mathbb{E}_{{\mathbf{x}}_{0}\sim q^{\prime}_{T-k\eta}(\cdot|{\bm{x}})}\left[\left\|{\mathbf{x}}_{0}\right\|^{2}\right]=L^{2}(2C_{\mathrm{LSI},k})^{-1}\cdot\left(\mathrm{var}({\mathbf{x}}_{0})+\left\|\mathbb{E}\left[{\mathbf{x}}_{0}\right]\right\|^{2}\right).

Because qTkη(|𝒙)q^{\prime}_{T-k\eta}(\cdot|{\bm{x}}) is a high-dimensional Gaussian, its mean value satisfies 𝔼[𝐱0]=e(Tkη)𝒙\mathbb{E}[{\mathbf{x}}_{0}]=e^{(T-k\eta)}{\bm{x}}. Besides, we have

2lnqTkη(𝒙0)=e2(Tkη)1e2(Tkη)𝑰.-\nabla^{2}\ln q^{\prime}_{T-k\eta}({\bm{x}}_{0})=\frac{e^{-2(T-k\eta)}}{1-e^{-2(T-k\eta)}}\cdot{\bm{I}}.

According to Lemma 20, qTkη(|𝒙)q^{\prime}_{T-k\eta}(\cdot|{\bm{x}}) satisfies the log-Sobolev inequality (and the Poincaré inequality). Follows from Lemma 23, we have

var𝐱0qTkη(|𝒙)[𝐱0]d(1e2(Tkη))e2(Tkη)de2(Tkη).\mathrm{var}_{{\mathbf{x}}_{0}\sim q^{\prime}_{T-k\eta}(\cdot|{\bm{x}})}\left[{\mathbf{x}}_{0}\right]\leq d\cdot(1-e^{-2(T-k\eta)})\cdot e^{2(T-k\eta)}\leq de^{2(T-k\eta)}.

Hence, we have

DKL(qTkη(|𝒙)qTkη(|𝒙))L22CLSI,ke2(Tkη)(d+𝒙2)\displaystyle D_{\mathrm{KL}}\left(q^{\prime}_{T-k\eta}(\cdot|{\bm{x}})\|q_{T-k\eta}(\cdot|{\bm{x}})\right)\leq\frac{L^{2}}{2C_{\mathrm{LSI},k}}\cdot e^{2(T-k\eta)}\left(d+\|{\bm{x}}\|^{2}\right)

and the proof is completed. ∎

Lemma 10.

(Errors from the inner loop sampling task) Suppose Assumption [A1],[A2],[A4] hold. With Algorithm 1 notation list 1 and suitable η=C1(d+m22)1ϵ2\eta=C_{1}\left(d+m_{2}^{2}\right)^{-1}\epsilon^{2}, there is

k=0N1η𝔼P^T[𝐯k(𝐱kη)2lnpTkη(𝐱kη)2]20ϵ2lnC02ϵ2,\sum_{k=0}^{N-1}\eta\cdot\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-k\eta}({\mathbf{x}}_{k\eta})\right\|^{2}\right]\leq 20\epsilon^{2}\ln\frac{C_{0}}{2\epsilon^{2}},

with a probability at least 1δ1-\delta. The gradient complexity of achieving this result is

max(C3C5,C3C5)C11C0(d+m22)18ϵ16n83exp(5C2ϵ16n)δ6\max\left(C_{3}C_{5},C_{3}^{\prime}C_{5}^{\prime}\right)\cdot C_{1}^{-1}C_{0}\cdot(d+m_{2}^{2})^{18}\epsilon^{-16n-83}\exp\left(5C_{2}\epsilon^{-16n}\right)\delta^{-6}

where constants CiC_{i} and CiC_{i}^{\prime} are independent with dd and ϵ\epsilon.

Proof.

To upper bound it more precisely, we divide the backward process into two stages.

Stage 1: when e2(Tkη)2L/(1+2L)e^{-2(T-k\eta)}\leq 2L/(1+2L).

It implies the iteration kk satisfies

k12η(2Tln1+2L2L)N1.k\leq\frac{1}{2\eta}\left(2T-\ln\frac{1+2L}{2L}\right)\coloneqq N_{1}. (43)

In this condition, we set

qTkη(𝒙0|𝒙)exp(gTkη(𝒙0|𝒙))exp(f(𝒙0)𝒙e(Tkη)𝒙022(1e2(Tkη))).q_{T-k\eta}({\bm{x}}_{0}|{\bm{x}})\propto\exp(-g_{T-k\eta}({\bm{x}}_{0}|{\bm{x}}))\coloneqq\exp\left(-f_{*}({\bm{x}}_{0})-\frac{\left\|{\bm{x}}-e^{-(T-k\eta)}{\bm{x}}_{0}\right\|^{2}}{2\left(1-e^{-2(T-k\eta)}\right)}\right). (44)

Hence, we can reformulate gTkη(𝒙0|𝒙)g_{T-k\eta}({\bm{x}}_{0}|{\bm{x}}) as

gTkη(𝒙0|𝒙)=\displaystyle g_{T-k\eta}({\bm{x}}_{0}|{\bm{x}})= f(𝒙0)+e2(Tkη)3(1e2(Tkη))𝒙02part 1\displaystyle\underbrace{f_{*}({\bm{x}}_{0})+\frac{e^{-2(T-k\eta)}}{3(1-e^{-2(T-k\eta)})}\left\|{\bm{x}}_{0}\right\|^{2}}_{\mathrm{part\ 1}}
+e2(Tkη)6(1e2(Tkη))𝒙02e(Tkη)(1e2(Tkη))𝒙0𝒙+𝒙22(1e2(Tkη))part 2.\displaystyle+\underbrace{\frac{e^{-2(T-k\eta)}}{6(1-e^{-2(T-k\eta)})}\left\|{\bm{x}}_{0}\right\|^{2}-\frac{e^{-(T-k\eta)}}{(1-e^{-2(T-k\eta)})}{\bm{x}}_{0}^{\top}{\bm{x}}+\frac{\left\|{\bm{x}}\right\|^{2}}{2(1-e^{-2(T-k\eta)})}}_{\mathrm{part\ 2}}.

According to Assumption [A4], we know part 1 and part 2 are both strongly convex outside the ball with radius R(e2(Tkη)/(6(1e2(Tkη))))R(e^{-2(T-k\eta)}/(6(1-e^{-2(T-k\eta)}))). With Lemma 22, the function gTkη(|𝒙)g_{T-k\eta}(\cdot|{\bm{x}}) is e2(Tkη)/(3(1e2(Tkη)))e^{-2(T-k\eta)}/(3(1-e^{-2(T-k\eta)}))-strongly convex outside the ball. Besides, the gradient Lipschitz constant of gTkη(|𝒙)g_{T-k\eta}(\cdot|{\bm{x}}) can be upper bounded as

2gTkη(𝒙0|𝒙)2f(𝒙0)+e2(Tkη)1e2(Tkη)𝑰(L+e2(Tkη)1e2(Tkη))𝑰3L𝑰\nabla^{2}g_{T-k\eta}({\bm{x}}_{0}|{\bm{x}})\preceq\nabla^{2}f_{*}({\bm{x}}_{0})+\frac{e^{-2(T-k\eta)}}{1-e^{-2(T-k\eta)}}\cdot{\bm{I}}\preceq\left(L+\frac{e^{-2(T-k\eta)}}{1-e^{-2(T-k\eta)}}\right)\cdot{\bm{I}}\preceq 3L{\bm{I}}

where the last inequality follows from the choice of e2(Tkη)e^{-2(T-k\eta)} in the stage.

When the total time satisfies exp(T/2)=2ϵ2/C0\exp(-T/2)=2\epsilon^{2}/C_{0}, we have

CLSI,k1\displaystyle C^{-1}_{\mathrm{LSI},k}\leq 6(1e2(Tkη))e2(Tkη)exp(48LR2(e2(Tkη)6(1e2(Tkη))))\displaystyle 6(1-e^{-2(T-k\eta)})e^{2(T-k\eta)}\cdot\exp\left(48L\cdot R^{2}\left(\frac{e^{-2(T-k\eta)}}{6(1-e^{-2(T-k\eta)})}\right)\right)
\displaystyle\leq 6exp(2(Tkη)+48L(6(1e2(Tkη))e2(Tkη))2n)\displaystyle 6\exp\left(2(T-k\eta)+48L\cdot\left(6(1-e^{-2(T-k\eta)})e^{2(T-k\eta)}\right)^{2n}\right)
\displaystyle\leq 6exp(2(Tkη)+48L62ne4n(Tkη)).\displaystyle 6\exp\left(2(T-k\eta)+48L\cdot 6^{2n}\cdot e^{4n(T-k\eta)}\right).

The second inequality follows from Assumption [A4], and the last inequality follows from the setting n,L1n,L\geq 1 without loss of generality.

CLSI,k1624C04ϵ8exp(48L62n28nC08nϵ16n)=C6ϵ8exp(C2ϵ16n)C^{-1}_{\mathrm{LSI},k}\leq 6\cdot 2^{-4}\cdot C_{0}^{4}\epsilon^{-8}\exp\left(48L\cdot 6^{2n}\cdot 2^{-8n}\cdot C_{0}^{8n}\cdot\epsilon^{-16n}\right)=C_{6}\epsilon^{-8}\exp\left(C_{2}\cdot\epsilon^{-16n}\right) (45)

For Term 1\mathrm{Term\ 1}, due to Lemma 8, if we set the step size of outer loop to be

η=C1(d+m22)1ϵ2,\eta=C_{1}\left(d+m_{2}^{2}\right)^{-1}\epsilon^{2},

the sample number of each iteration kk satisfies

nk=\displaystyle n_{k}= C3(d+m22)4ϵ18exp(C2ϵ16n)δ1=64C0C13C6(d+m22)4ϵ18exp(C2ϵ16n)δ1\displaystyle C_{3}\cdot\left(d+m_{2}^{2}\right)^{4}\epsilon^{-18}\exp\left(C_{2}\cdot\epsilon^{-16n}\right)\delta^{-1}=64\cdot C_{0}C_{1}^{-3}C_{6}\left(d+m_{2}^{2}\right)^{4}\epsilon^{-18}\exp\left(C_{2}\cdot\epsilon^{-16n}\right)\delta^{-1}
\displaystyle\geq 642lnC02ϵ2d(C1(d+m22)1ϵ2)3ϵ2C6ϵ8exp(C2ϵ16n)δ64Tdη3ϵ2δ1CLSI,k1\displaystyle 64\cdot 2\ln\frac{C_{0}}{2\epsilon^{2}}\cdot d\cdot\left(C_{1}(d+m_{2}^{2})^{-1}\epsilon^{2}\right)^{-3}\cdot\epsilon^{-2}\cdot C_{6}\epsilon^{-8}\exp\left(C_{2}\cdot\epsilon^{-16n}\right)\cdot\delta\geq 64Td\eta^{-3}\epsilon^{-2}\delta^{-1}C_{\mathrm{LSI},k}^{-1}

and the accuracy of the inner loop meets

DKL(qTkη(|𝒙)qTkη(|𝒙))C4(d+m22)10ϵ44exp(2C2ϵ16n)δ4\displaystyle D_{\mathrm{KL}}\left(q^{\prime}_{T-k\eta}(\cdot|{\bm{x}})\|q_{T-k\eta}(\cdot|{\bm{x}})\right)\leq C_{4}\left(d+m_{2}^{2}\right)^{-10}\epsilon^{44}\exp\left(-2C_{2}\cdot\epsilon^{-16n}\right)\delta^{4} (46)
=\displaystyle= 213C04C18(d+m22)10ϵ28δ4C62ϵ16exp(2C2ϵ16n)\displaystyle 2^{-13}\cdot C_{0}^{-4}\cdot C_{1}^{8}\left(d+m_{2}^{2}\right)^{-10}\epsilon^{28}\delta^{4}\cdot C_{6}^{-2}\epsilon^{16}\exp\left(-2C_{2}\cdot\epsilon^{-16n}\right)
\displaystyle\leq 213(2lnC02ϵ2)4d2ϵ4δ4C18(d+m22)8ϵ16CLSI,k2213T4d2ϵ4δ4η8CLSI,k2\displaystyle 2^{-13}\cdot\left(2\ln\frac{C_{0}}{2\epsilon^{2}}\right)^{-4}\cdot d^{-2}\cdot\epsilon^{4}\cdot\delta^{4}\cdot C_{1}^{8}\left(d+m_{2}^{2}\right)^{-8}\epsilon^{16}\cdot C^{2}_{\mathrm{LSI},k}\leq 2^{-13}\cdot T^{-4}d^{-2}\epsilon^{4}\delta^{4}\eta^{8}C^{2}_{\mathrm{LSI},k}

when ϵ21/2\epsilon^{2}\leq 1/2. In this condition, we have

[𝐯k(𝒙)2𝔼𝐱0qTkη(|𝒙)[𝒙e(Tkη)𝐱0(1e2(Tkη))]24ϵ2]\displaystyle\mathbb{P}\left[\left\|{\mathbf{v}}_{k}({\bm{x}})-2\mathbb{E}_{{\mathbf{x}}^{\prime}_{0}\sim q^{\prime}_{T-k\eta}(\cdot|{\bm{x}})}\left[-\frac{{\bm{x}}-e^{-(T-k\eta)}{\mathbf{x}}_{0}}{\left(1-e^{-2(T-k\eta)}\right)}\right]\right\|^{2}\leq 4\epsilon^{2}\right] (47)
\displaystyle\geq 1exp(1δ/(2T/η))δ2T/η1δT/η,\displaystyle 1-\exp\left(-\frac{1}{\delta/(2\lfloor T/\eta\rfloor)}\right)-\frac{\delta}{2\lfloor T/\eta\rfloor}\geq 1-\frac{\delta}{\lfloor T/\eta\rfloor},

where qTkη(|𝒙)q^{\prime}_{T-k\eta}(\cdot|{\bm{x}}) denotes the underlying distribution of output particles of the kk-th inner loop. To achieve

DKL(qTkη(|𝒙kη)qTkη(|𝒙kη))C4(d+m22)10ϵ44exp(2C2ϵ16n)δ4δKL,D_{\mathrm{KL}}\left(q^{\prime}_{T-k\eta}(\cdot|{\bm{x}}_{k\eta})\|q_{T-k\eta}(\cdot|{\bm{x}}_{k\eta})\right)\leq C_{4}\left(d+m_{2}^{2}\right)^{-10}\epsilon^{44}\exp\left(-2C_{2}\cdot\epsilon^{-16n}\right)\delta^{4}\coloneqq\delta_{\mathrm{KL}},

Lemma 19 requires the step size to satisfy

τk\displaystyle\tau_{k}\leq 2432L2C4C61(d+m22)11ϵ52exp(3C2ϵ16n)δ4\displaystyle 2^{-4}\cdot 3^{-2}\cdot L^{-2}C_{4}C_{6}^{-1}\left(d+m_{2}^{2}\right)^{-11}\epsilon^{52}\exp\left(-3C_{2}\epsilon^{-16n}\right)\delta^{4}
\displaystyle\leq C61ϵ8exp(C2ϵ16n)14(3L)214dC4(d+m22)10ϵ44exp(2C2ϵ16n)δ4\displaystyle C_{6}^{-1}\epsilon^{8}\exp\left(-C_{2}\epsilon^{-16n}\right)\cdot\frac{1}{4(3L)^{2}}\cdot\frac{1}{4d}\cdot C_{4}\left(d+m_{2}^{2}\right)^{-10}\epsilon^{44}\exp\left(-2C_{2}\cdot\epsilon^{-16n}\right)\delta^{4}
\displaystyle\leq CLSI,k42gTkη(|𝒙)2214dδKL\displaystyle\frac{C_{\mathrm{LSI},k}}{4\|\nabla^{2}g_{T-k\eta}(\cdot|{\bm{x}})\|_{2}^{2}}\cdot\frac{1}{4d}\cdot\delta_{\mathrm{KL}}

and the iteration number ZkZ_{k} meets

Zk1CLSI,kτkln2DKL(qTkη,0(|𝒙)qTkη(|𝒙))δKLZ_{k}\geq\frac{1}{C_{\mathrm{LSI},k}\tau_{k}}\cdot\ln\frac{2D_{\mathrm{KL}}\left(q^{\prime}_{T-k\eta,0}(\cdot|{\bm{x}})\|q_{T-k\eta}(\cdot|{\bm{x}})\right)}{\delta_{\mathrm{KL}}}

where qTkη,0(|𝒙)q^{\prime}_{T-k\eta,0}(\cdot|{\bm{x}}) denotes the initial distribution of the kk-th inner loop. By choosing τk\tau_{k} to be its upper bound and the initial distribution of kk-th inner loop to be

qTkη,0(𝒙0|𝒙)exp(𝒙e(Tkη)𝒙022(1e2(Tkη))),q^{\prime}_{T-k\eta,0}({\bm{x}}_{0}|{\bm{x}})\propto\exp\left(-\frac{\left\|{\bm{x}}-e^{-(T-k\eta)}{\bm{x}}_{0}\right\|^{2}}{2\left(1-e^{-2(T-k\eta)}\right)}\right),

we have

DKL(qTkη,0(|𝒙)qTkη(|𝒙))\displaystyle D_{\mathrm{KL}}\left(q^{\prime}_{T-k\eta,0}(\cdot|{\bm{x}})\|q_{T-k\eta}(\cdot|{\bm{x}})\right)\leq L22CLSI,ke2(Tkη)(d+𝒙2)\displaystyle\frac{L^{2}}{2C_{\mathrm{LSI},k}}\cdot e^{2(T-k\eta)}\left(d+\|{\bm{x}}\|^{2}\right)
\displaystyle\leq 21L2C6ϵ8exp(C2ϵ16n)e2(Tkη)(d+𝒙2)\displaystyle 2^{-1}L^{2}C_{6}\epsilon^{-8}\exp\left(C_{2}\cdot\epsilon^{-16n}\right)\cdot e^{2(T-k\eta)}\left(d+\|{\bm{x}}\|^{2}\right)

with Lemma 9 and Eq. 45. It implies the iteration number ZkZ_{k} of inner loops to be required as

Zk\displaystyle Z_{k}\geq C5(d+m22)12ϵ16n61exp(4C2ϵ16n)(d+𝒙2)δ5\displaystyle C_{5}\cdot(d+m_{2}^{2})^{12}\epsilon^{-16n-61}\exp\left(4C_{2}\epsilon^{-16n}\right)\cdot\left(d+\|{\bm{x}}\|^{2}\right)\delta^{-5}
=\displaystyle= 283452L2C2C41C62ln(24L2C04C41C6)(d+m22)12ϵ16n61exp(4C2ϵ16n)(d+𝒙2)δ5\displaystyle 2^{8}\cdot 3^{4}\cdot 5^{2}L^{2}\cdot C_{2}C_{4}^{-1}C_{6}^{2}\ln\left(2^{-4}L^{2}C_{0}^{4}C_{4}^{-1}C_{6}\right)\cdot(d+m_{2}^{2})^{12}\epsilon^{-16n-61}\exp\left(4C_{2}\epsilon^{-16n}\right)\cdot\left(d+\|{\bm{x}}\|^{2}\right)\delta^{-5}
=\displaystyle= C6ϵ8exp(C2ϵ16n)(2432L2C41C6(d+m22)11ϵ52exp(3C2ϵ16n)δ4)\displaystyle C_{6}\epsilon^{-8}\exp\left(C_{2}\cdot\epsilon^{-16n}\right)\cdot\left(2^{4}\cdot 3^{2}L^{2}C_{4}^{-1}C_{6}\cdot\left(d+m_{2}^{2}\right)^{11}\epsilon^{-52}\exp\left(3C_{2}\epsilon^{-16n}\right)\delta^{-4}\right)
(ln(24L2C04C41C6)+3C2ϵ16n+60ln1ϵ+4ln1δ+10ln(d+m22)+ln(d+𝒙2))\displaystyle\cdot\left(\ln\left(2^{-4}L^{2}C_{0}^{4}C_{4}^{-1}C_{6}\right)+3C_{2}\epsilon^{-16n}+60\ln\frac{1}{\epsilon}+4\ln\frac{1}{\delta}+10\ln(d+m_{2}^{2})+\ln(d+\|{\bm{x}}\|^{2})\right)
\displaystyle\geq 1CLSI,kτkln2DKL(qTkη,0(|𝒙)qTkη(|𝒙))δKL.\displaystyle\frac{1}{C_{\mathrm{LSI},k}\tau_{k}}\cdot\ln\frac{2D_{\mathrm{KL}}\left(q^{\prime}_{T-k\eta,0}(\cdot|{\bm{x}})\|q_{T-k\eta}(\cdot|{\bm{x}})\right)}{\delta_{\mathrm{KL}}}.

when ln(1/ϵ)2\ln(1/\epsilon)\geq 2 and lnd2\ln d\geq 2 without loss of generality.

A sufficient condition to obtain Term 2ϵ2\mathrm{Term\ 2}\leq\epsilon^{2} is to make the following inequality establish

DKL(qTkη(|𝒙)qTkη(|𝒙))23η2ϵ2C61ϵ8exp(C2ϵ16n)D_{\mathrm{KL}}\left({q}^{\prime}_{T-k\eta}(\cdot|{\bm{x}})\|{q}_{T-k\eta}(\cdot|{\bm{x}})\right)\leq 2^{-3}\cdot\eta^{2}\epsilon^{2}\cdot C_{6}^{-1}\epsilon^{8}\exp\left(-C_{2}\epsilon^{-16n}\right)

which will be dominated by Eq. 46 in almost cases obviously.

Hence, combining Eq. 17, Eq. 47 and Eq. 18, there is

k=0N1η𝔼P^T[𝐯k(𝐱kη)2lnpTkη(𝐱kη)2]10N1ηϵ2\displaystyle\sum_{k=0}^{N_{1}}\eta\cdot\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-k\eta}({\mathbf{x}}_{k\eta})\right\|^{2}\right]\leq 10N_{1}\eta\cdot\epsilon^{2} (48)

with a probability at least 1N1δ/(T/η)1-N_{1}\cdot\delta/(\lfloor T/\eta\rfloor) which is obtained by uniformed bound. We require the gradient complexity in this stage will be

cost=k=0N1nk𝔼P^T(Zk)=k=0N1\displaystyle\mathrm{cost}=\sum_{k=0}^{N_{1}}n_{k}\mathbb{E}_{\hat{P}_{T}}(Z_{k})=\sum_{k=0}^{N_{1}} C3(d+m22)4ϵ18exp(C2ϵ16n)δ1\displaystyle C_{3}\cdot\left(d+m_{2}^{2}\right)^{4}\epsilon^{-18}\exp\left(C_{2}\cdot\epsilon^{-16n}\right)\delta^{-1} (49)
𝔼P^T[C5(d+m22)12ϵ16n61exp(4C2ϵ16n)(d+𝐱kη2)δ5]\displaystyle\cdot\mathbb{E}_{\hat{P}_{T}}\left[C_{5}\cdot(d+m_{2}^{2})^{12}\epsilon^{-16n-61}\exp\left(4C_{2}\epsilon^{-16n}\right)\cdot\left(d+\|{\mathbf{x}}_{k\eta}\|^{2}\right)\delta^{-5}\right]
\displaystyle\leq N1C3C5(d+m22)17ϵ16n79exp(5C2ϵ16n)δ6\displaystyle N_{1}\cdot C_{3}C_{5}(d+m_{2}^{2})^{17}\epsilon^{-16n-79}\exp\left(5C_{2}\epsilon^{-16n}\right)\delta^{-6}

where the last inequality follows from Lemma 14.

Stage 2: when 2L1+2Le2(Tkη)1\frac{2L}{1+2L}\leq e^{-2(T-k\eta)}\leq 1.

We have the LSI constant for this stage. It is a constant level LSI constant, which mean we should choose the sample and the iteration number similar to Stage 1. Therefore, for Term 1\mathrm{Term\ 1}, by requiring

nk=\displaystyle n_{k}= 64LC0C13(d+m22)4ϵ10δ1\displaystyle\frac{64}{L}\cdot C_{0}C_{1}^{-3}\cdot\left(d+m_{2}^{2}\right)^{4}\epsilon^{-10}\delta^{-1}
\displaystyle\geq 642lnC02ϵ2d(C1(d+m22)1ϵ2)3ϵ2δ1L164Tdη3ϵ2δ1CLSI,k1\displaystyle 64\cdot 2\ln\frac{C_{0}}{2\epsilon^{2}}\cdot d\cdot\left(C_{1}(d+m_{2}^{2})^{-1}\epsilon^{2}\right)^{-3}\cdot\epsilon^{-2}\delta^{-1}\cdot L^{-1}\geq 64Td\eta^{-3}\epsilon^{-2}\delta^{-1}C_{\mathrm{LSI},k}^{-1}

and

DKL(qTkη(|𝒙)qTkη(|𝒙))213L2C04C18(d+m22)10ϵ28δ4\displaystyle D_{\mathrm{KL}}\left(q^{\prime}_{T-k\eta}(\cdot|{\bm{x}})\|q_{T-k\eta}(\cdot|{\bm{x}})\right)\leq 2^{-13}L^{2}\cdot C_{0}^{-4}C_{1}^{8}\cdot(d+m_{2}^{2})^{-10}\epsilon^{28}\delta^{4} (50)
\displaystyle\leq 213(2lnC02ϵ2)4d2ϵ4δ4C18(d+m22)8ϵ16CLSI,k2213T4d2ϵ4δ4η8CLSI,k2.\displaystyle 2^{-13}\cdot\left(2\ln\frac{C_{0}}{2\epsilon^{2}}\right)^{-4}\cdot d^{-2}\cdot\epsilon^{4}\delta^{4}\cdot C_{1}^{8}\left(d+m_{2}^{2}\right)^{-8}\epsilon^{16}\cdot C^{2}_{\mathrm{LSI},k}\leq 2^{-13}\cdot T^{-4}d^{-2}\epsilon^{4}\delta^{4}\eta^{8}C^{2}_{\mathrm{LSI},k}.

In this condition, we have

[𝐯k(𝒙)2𝔼𝐱0qTkη(|𝒙)[𝒙e(Tkη)𝐱0(1e2(Tkη))]24ϵ2]\displaystyle\mathbb{P}\left[\left\|{\mathbf{v}}_{k}({\bm{x}})-2\mathbb{E}_{{\mathbf{x}}^{\prime}_{0}\sim q^{\prime}_{T-k\eta}(\cdot|{\bm{x}})}\left[-\frac{{\bm{x}}-e^{-(T-k\eta)}{\mathbf{x}}_{0}}{\left(1-e^{-2(T-k\eta)}\right)}\right]\right\|^{2}\leq 4\epsilon^{2}\right] (51)
\displaystyle\geq 1exp(1δ/(2T/η))δ2T/η1δT/η,\displaystyle 1-\exp\left(-\frac{1}{\delta/(2\lfloor T/\eta\rfloor)}\right)-\frac{\delta}{2\lfloor T/\eta\rfloor}\geq 1-\frac{\delta}{\lfloor T/\eta\rfloor},

where qTkη(|𝒙)q^{\prime}_{T-k\eta}(\cdot|{\bm{x}}) denotes the underlying distribution of output particles of the kk-th inner loop. To achieve

DKL(qTkη(|𝒙kη)qTkη(|𝒙kη))213L2C04C18(d+m22)10ϵ28δ4δKL,D_{\mathrm{KL}}\left(q^{\prime}_{T-k\eta}(\cdot|{\bm{x}}_{k\eta})\|q_{T-k\eta}(\cdot|{\bm{x}}_{k\eta})\right)\leq 2^{-13}L^{2}\cdot C_{0}^{-4}C_{1}^{8}\cdot(d+m_{2}^{2})^{-10}\epsilon^{28}\delta^{4}\coloneqq\delta_{\mathrm{KL}},

Lemma 19 requires the step size to satisfy

τk\displaystyle\tau_{k}\leq 21731Σk,max1L2C04C18(d+m22)11ϵ28δ4\displaystyle 2^{-17}\cdot 3^{-1}\Sigma_{k,\max}^{-1}\cdot L^{2}C_{0}^{-4}C_{1}^{8}\left(d+m_{2}^{2}\right)^{-11}\epsilon^{28}\delta^{4}
\displaystyle\leq Σk,minΣk,max14Σk,max14d213L2C04C18(d+m22)10ϵ28δ4\displaystyle\frac{\Sigma_{k,\min}}{\Sigma_{k,\max}}\cdot\frac{1}{4\Sigma_{k,\max}}\cdot\frac{1}{4d}\cdot 2^{-13}L^{2}\cdot C_{0}^{-4}C_{1}^{8}\cdot(d+m_{2}^{2})^{-10}\epsilon^{28}\delta^{4}
\displaystyle\leq CLSI,k42gTkη(|𝒙)2214dδKL\displaystyle\frac{C_{\mathrm{LSI},k}}{4\|\nabla^{2}g_{T-k\eta}(\cdot|{\bm{x}})\|_{2}^{2}}\cdot\frac{1}{4d}\cdot\delta_{\mathrm{KL}}

and the iteration number ZkZ_{k} meets

Zk1CLSI,kτkln2DKL(qTkη,0(|𝒙)qTkη(|𝒙))δKLZ_{k}\geq\frac{1}{C_{\mathrm{LSI},k}\tau_{k}}\cdot\ln\frac{2D_{\mathrm{KL}}\left(q^{\prime}_{T-k\eta,0}(\cdot|{\bm{x}})\|q_{T-k\eta}(\cdot|{\bm{x}})\right)}{\delta_{\mathrm{KL}}}

where qTkη,0(|𝒙)q^{\prime}_{T-k\eta,0}(\cdot|{\bm{x}}) denotes the initial distribution of the kk-th inner loop. By choosing τk\tau_{k} to be its upper bound and the initial distribution of kk-th inner loop to be

qTkη,0(𝒙0|𝒙)exp(𝒙e(Tkη)𝒙022(1e2(Tkη))),q^{\prime}_{T-k\eta,0}({\bm{x}}_{0}|{\bm{x}})\propto\exp\left(-\frac{\left\|{\bm{x}}-e^{-(T-k\eta)}{\bm{x}}_{0}\right\|^{2}}{2\left(1-e^{-2(T-k\eta)}\right)}\right),

we have

DKL(qTkη,0(|𝒙)qTkη(|𝒙))\displaystyle D_{\mathrm{KL}}\left(q^{\prime}_{T-k\eta,0}(\cdot|{\bm{x}})\|q_{T-k\eta}(\cdot|{\bm{x}})\right)\leq L22CLSI,ke2(Tkη)(d+𝒙2)L2e2(Tkη)(d+𝒙2)\displaystyle\frac{L^{2}}{2C_{\mathrm{LSI},k}}\cdot e^{2(T-k\eta)}\left(d+\|{\bm{x}}\|^{2}\right)\leq\frac{L}{2}\cdot e^{2(T-k\eta)}\left(d+\|{\bm{x}}\|^{2}\right)

with Lemma 9 and Eq. 14. It implies the iteration number ZkZ_{k} of inner loops to be required as

Zk\displaystyle Z_{k}\geq C5(d+m22)12ϵ29(d+𝒙2)δ5\displaystyle C^{\prime}_{5}\cdot(d+m_{2}^{2})^{12}\epsilon^{-29}\cdot\left(d+\|{\bm{x}}\|^{2}\right)\delta^{-5}
=\displaystyle= 222345L2C04C18ln(28LC08C18)(d+m22)12ϵ29δ5(d+𝒙2)\displaystyle 2^{22}\cdot 3^{4}\cdot 5\cdot L^{-2}C_{0}^{4}C_{1}^{-8}\ln\left(\frac{2^{8}}{L}\cdot C_{0}^{8}C_{1}^{-8}\right)\cdot(d+m_{2}^{2})^{12}\epsilon^{-29}\delta^{-5}\left(d+\|{\bm{x}}\|^{2}\right)
=\displaystyle= 1Σk,min(21731Σk,max1L2C04C18(d+m22)11ϵ28δ4)1\displaystyle\frac{1}{\Sigma_{k,\min}}\cdot\left(2^{-17}\cdot 3^{-1}\Sigma_{k,\max}^{-1}\cdot L^{2}C_{0}^{-4}C_{1}^{8}\left(d+m_{2}^{2}\right)^{-11}\epsilon^{28}\delta^{4}\right)^{-1}
(ln(28LC08C18)+36ln1ϵ+4ln1δ+10ln(d+m22)+ln(d+𝒙2))\displaystyle\cdot\left(\ln\left(\frac{2^{8}}{L}\cdot C_{0}^{8}C_{1}^{-8}\right)+36\ln\frac{1}{\epsilon}+4\ln\frac{1}{\delta}+10\ln(d+m_{2}^{2})+\ln(d+\|{\bm{x}}\|^{2})\right)
\displaystyle\geq 1CLSI,kτkln2DKL(qTkη,0(|𝒙)qTkη(|𝒙))δKL.\displaystyle\frac{1}{C_{\mathrm{LSI},k}\tau_{k}}\cdot\ln\frac{2D_{\mathrm{KL}}\left(q^{\prime}_{T-k\eta,0}(\cdot|{\bm{x}})\|q_{T-k\eta}(\cdot|{\bm{x}})\right)}{\delta_{\mathrm{KL}}}.

when ln(1/ϵ)2\ln(1/\epsilon)\geq 2 and lnd2\ln d\geq 2 without loss of generality.

For Term 2\mathrm{Term\ 2}, we denote an optimal coupling between qTkη(|𝒙)q_{T-k\eta}(\cdot|{\bm{x}}) and qTkη(|𝒙)q^{\prime}_{T-k\eta}(\cdot|{\bm{x}}) to be

γΓopt(qTkη(|𝒙),qTkη(|𝒙)).\gamma\in\Gamma_{\mathrm{opt}}(q_{T-k\eta}(\cdot|{\bm{x}}),q^{\prime}_{T-k\eta}(\cdot|{\bm{x}})).

Hence, we have

2e(Tkη)1e2(Tkη)[𝔼𝐱0qTkη(|𝒙)[𝐱0]𝔼𝐱0qTkη(|𝒙)[𝐱0]]2\displaystyle\left\|\frac{2e^{-(T-k\eta)}}{1-e^{-2(T-k\eta)}}\left[\mathbb{E}_{{{\mathbf{x}}}^{\prime}_{0}\sim{q}^{\prime}_{T-k\eta}(\cdot|{\bm{x}})}[{{\mathbf{x}}}^{\prime}_{0}]-\mathbb{E}_{{{\mathbf{x}}}_{0}\sim{q}_{T-k\eta}(\cdot|{\bm{x}})}[{{\mathbf{x}}}_{0}]\right]\right\|^{2} (52)
=\displaystyle= 𝔼(𝐱0,𝐱0)γ[2e(Tkη)1e2(Tkη)(𝐱0𝐱0)]24e2(Tkη)(1e2(Tkη))2𝔼(𝐱0,𝐱0)γ[𝐱0𝐱02]\displaystyle\left\|\mathbb{E}_{({{\mathbf{x}}}_{0},{{\mathbf{x}}}_{0}^{\prime})\sim\gamma}\left[\frac{2e^{-(T-k\eta)}}{1-e^{-2(T-k\eta)}}\cdot\left({{\mathbf{x}}}_{0}-{{\mathbf{x}}}_{0}^{\prime}\right)\right]\right\|^{2}\leq\frac{4e^{-2(T-k\eta)}}{(1-e^{-2(T-k\eta)})^{2}}\cdot\mathbb{E}_{({{\mathbf{x}}}_{0},{{\mathbf{x}}}_{0}^{\prime})\sim\gamma}\left[\left\|{{\mathbf{x}}}_{0}-{{\mathbf{x}}}_{0}^{\prime}\right\|^{2}\right]
=\displaystyle= 4e2(Tkη)(1e2(Tkη))2W22(qTkη(|𝒙),qTkη(|𝒙))\displaystyle\frac{4e^{-2(T-k\eta)}}{(1-e^{-2(T-k\eta)})^{2}}W_{2}^{2}\left({q}_{T-k\eta}(\cdot|{\bm{x}}),{q}^{\prime}_{T-k\eta}(\cdot|{\bm{x}})\right)
\displaystyle\leq 4e2(Tkη)(1e2(Tkη))22CLSI,kDKL(qTkη(|𝒙)qTkη(|𝒙))\displaystyle\frac{4e^{-2(T-k\eta)}}{\left(1-e^{-2(T-k\eta)}\right)^{2}}\cdot\frac{2}{{C}_{\mathrm{LSI},k}}D_{\mathrm{KL}}\left({q}^{\prime}_{T-k\eta}(\cdot|{\bm{x}})\|{q}_{T-k\eta}(\cdot|{\bm{x}})\right)
\displaystyle\leq 8η2CLSI,k1DKL(qTkη(|𝒙)qTkη(|𝒙)),\displaystyle 8\eta^{-2}C_{\mathrm{LSI},k}^{-1}D_{\mathrm{KL}}\left({q}^{\prime}_{T-k\eta}(\cdot|{\bm{x}})\|{q}_{T-k\eta}(\cdot|{\bm{x}})\right),

where the first inequality follows from Jensen’s inequality, the second inequality follows from the Talagrand inequality and the last inequality follows from

e2(Tkη)e2η1ηe2(Tkη)(1e2(Tkη))2η2,e^{-2(T-k\eta)}\leq e^{-2\eta}\leq 1-\eta\ \Rightarrow\ \frac{e^{-2(T-k\eta)}}{(1-e^{-2(T-k\eta)})^{2}}\leq\eta^{-2},

when η1/2\eta\leq 1/2. Therefore, a sufficient condition to obtain Term 2ϵ2\mathrm{Term\ 2}\leq\epsilon^{2} is to make the following inequality establish

DKL(qTkη(|𝒙)qTkη(|𝒙))23η2ϵ2L1D_{\mathrm{KL}}\left({q}^{\prime}_{T-k\eta}(\cdot|{\bm{x}})\|{q}_{T-k\eta}(\cdot|{\bm{x}})\right)\leq 2^{-3}\cdot\eta^{2}\epsilon^{2}\cdot L^{-1}

which will be dominated by Eq. 50 in almost cases obviously.

Hence, combining Eq. 17, Eq. 51 and Eq. 52, there is

k=N1+1T/ηη𝔼P^T[𝐯k(𝐱kη)2lnpTkη(𝐱kη)2]10(T/ηN1)ηϵ2\displaystyle\sum_{k=N_{1}+1}^{\lfloor T/\eta\rfloor}\eta\cdot\mathbb{E}_{\hat{P}_{T}}\left[\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-k\eta}({\mathbf{x}}_{k\eta})\right\|^{2}\right]\leq 10(\lfloor T/\eta\rfloor-N_{1})\eta\cdot\epsilon^{2} (53)

with a probability at least 1(T/ηN1)δ/(T/η)1-\left(\lfloor T/\eta\rfloor-N_{1}\right)\cdot\delta/(\lfloor T/\eta\rfloor) which is obtained by uniformed bound. We require the gradient complexity in this stage will be

cost=k=N1+1T/ηnk𝔼P^T(Zk)=k=N1+1T/η\displaystyle\mathrm{cost}=\sum_{k=N_{1}+1}^{\lfloor T/\eta\rfloor}n_{k}\mathbb{E}_{\hat{P}_{T}}(Z_{k})=\sum_{k=N_{1}+1}^{\lfloor T/\eta\rfloor} 64LC0C13(d+m22)4ϵ10δ1\displaystyle\frac{64}{L}\cdot C_{0}C_{1}^{-3}\cdot\left(d+m_{2}^{2}\right)^{4}\epsilon^{-10}\delta^{-1} (54)
𝔼P^T[C5(d+m22)12ϵ29(d+𝐱kη2)δ5]\displaystyle\cdot\mathbb{E}_{\hat{P}_{T}}\left[C^{\prime}_{5}\cdot(d+m_{2}^{2})^{12}\epsilon^{-29}\cdot\left(d+\|{\mathbf{x}}_{k\eta}\|^{2}\right)\delta^{-5}\right]
\displaystyle\leq (T/ηN1)C3C5(d+m22)17ϵ39δ6\displaystyle(\lfloor T/\eta\rfloor-N_{1})\cdot C^{\prime}_{3}C^{\prime}_{5}(d+m_{2}^{2})^{17}\epsilon^{-39}\delta^{-6}

where the last inequality follows from Lemma 14. Combining Eq. 49 and Eq. 54, we know the total gradient complexity will be less than

max(C3C5,C3C5)C11C0(d+m22)18ϵ16n83exp(5C2ϵ16n)δ6.\max\left(C_{3}C_{5},C_{3}^{\prime}C_{5}^{\prime}\right)\cdot C_{1}^{-1}C_{0}\cdot(d+m_{2}^{2})^{18}\epsilon^{-16n-83}\exp\left(5C_{2}\epsilon^{-16n}\right)\delta^{-6}.

Hence the proof is completed. ∎

Appendix E Auxiliary Lemmas

Lemma 11.

(Lemma 11 of Vempala & Wibisono (2019)) Assume ν=exp(f)\nu=\exp(-f) is LL-smooth. Then 𝔼νf2dL{\mathbb{E}}_{\nu}\|\nabla f\|^{2}\leq dL.

Lemma 12.

(Girsanov’s theorem, Theorem 5.22 in Le Gall (2016)) Let PTP_{T} and QTQ_{T} be two probability measures on path space 𝒞([0,T],d)\mathcal{C}\left([0,T],\mathbb{R}^{d}\right). Suppose under PT{P}_{T}, the process (𝐱~t)t[0,T]({\tilde{{\mathbf{x}}}}_{t})_{t\in[0,T]} follows

d𝐱~t=b~tdt+σtdB~t.\mathrm{d}{\tilde{{\mathbf{x}}}}_{t}=\tilde{b}_{t}\mathrm{d}t+\sigma_{t}\mathrm{d}\tilde{B}_{t}.

Under QTQ_{T}, the process (𝐱^t)t[0,T]({\hat{{\mathbf{x}}}}_{t})_{t\in[0,T]} follows

d𝐱^t=b^tdt+σtdB^tand𝐱^0=𝐱~0.\mathrm{d}{\hat{{\mathbf{x}}}}_{t}=\hat{b}_{t}\mathrm{d}t+\sigma_{t}\mathrm{d}\hat{B}_{t}\quad\mathrm{and}\quad\hat{{\mathbf{x}}}_{0}=\tilde{{\mathbf{x}}}_{0}.

We assume that for each t0t\geq 0, σtd×d\sigma_{t}\in\mathbb{R}^{d\times d} is a non-singular diffusion matrix. Then, provided that Novikov’s condition holds

𝔼QT[exp(120Tσt1(b~tb^t)2)]<,\mathbb{E}_{Q_{T}}\left[\exp\left(\frac{1}{2}\int_{0}^{T}\left\|\sigma_{t}^{-1}\left(\tilde{b}_{t}-\hat{b}_{t}\right)\right\|^{2}\right)\right]<\infty,

we have

dPTdQT=exp(0Tσt1(b~tb^t)dBt120Tσt1(b~tb^t)2dt).\frac{\mathrm{d}{P}_{T}}{\mathrm{d}{Q}_{T}}=\exp\left(\int_{0}^{T}\sigma_{t}^{-1}\left(\tilde{b}_{t}-\hat{b}_{t}\right)\mathrm{d}B_{t}-\frac{1}{2}\int_{0}^{T}\left\|\sigma_{t}^{-1}\left(\tilde{b}_{t}-\hat{b}_{t}\right)\right\|^{2}\mathrm{d}t\right).
Corollary 2.

Plugging following settings

PTP~TpT,QTP^T,b~t𝐱t+𝐯k(𝐱kη),b^t𝐱t+2σ2lnpTt(𝐱t),σt=2σ,andt[kη,(k+1)η],{P}_{T}\coloneqq\tilde{P}_{T}^{p_{T}},\ {Q}_{T}\coloneqq\hat{P}_{T},\ \tilde{b}_{t}\coloneqq{\mathbf{x}}_{t}+{\mathbf{v}}_{k}({{\mathbf{x}}}_{k\eta}),\ \hat{b}_{t}\coloneqq{\mathbf{x}}_{t}+2\sigma^{2}\nabla\ln p_{T-t}({\mathbf{x}}_{t}),\sigma_{t}=\sqrt{2}\sigma,\ \mathrm{and}\ t\in[k\eta,(k+1)\eta],

into Lemma 12 and assuming Novikov’s condition holds, then we have

DKL(P^TP~TpT)=𝔼P^T[lndP^TdP~TpT]=14k=0N1𝔼P^T[kη(k+1)η𝐯k(𝐱kη)2lnpTt(𝐱t)2dt].D_{\mathrm{KL}}\left(\hat{P}_{T}\big{\|}\tilde{P}_{T}^{p_{T}}\right)=\mathbb{E}_{\hat{P}_{T}}\left[\ln\frac{\mathrm{d}\hat{P}_{T}}{\mathrm{d}\tilde{P}_{T}^{p_{T}}}\right]=\frac{1}{4}\sum_{k=0}^{N-1}\mathbb{E}_{\hat{P}_{T}}\left[\int_{k\eta}^{(k+1)\eta}\left\|{\mathbf{v}}_{k}({\mathbf{x}}_{k\eta})-2\nabla\ln p_{T-t}({\mathbf{x}}_{t})\right\|^{2}\mathrm{d}t\right].
Lemma 13.

(Lemma C.11 in Lee et al. (2022)) Suppose that p(𝐱)ef(𝐱)p({\bm{x}})\propto e^{-f({\bm{x}})} is a probability density function on d\mathbb{R}^{d}, where f(𝐱)f({\bm{x}}) is LL-smooth, and let φσ2(𝐱)\varphi_{\sigma^{2}}({\bm{x}}) be the density function of 𝒩(𝟎,σ2𝐈d)\mathcal{N}({\bm{0}},\sigma^{2}{\bm{I}}_{d}). Then for L12σ2L\leq\frac{1}{2\sigma^{2}}, it has

lnp(𝒙)(pφσ2)(𝒙)6Lσd1/2+2Lσ2f(𝒙).\left\|\nabla\ln\frac{p({\bm{x}})}{\left(p\ast\varphi_{\sigma^{2}}\right)({\bm{x}})}\right\|\leq 6L\sigma d^{1/2}+2L\sigma^{2}\left\|\nabla f({\bm{x}})\right\|.
Lemma 14.

(Lemma 9 in Chen et al. (2022a)) Suppose that Assumption [A1] and [A2] hold. Let (𝐱)t[0,T]\left({\mathbf{x}}\right)_{t\in[0,T]} denote the forward process 1.

  1. 1.

    (moment bound) For all t0t\geq 0,

    𝔼[𝐱t2]dm22.\mathbb{E}\left[\left\|{\mathbf{x}}_{t}\right\|^{2}\right]\leq d\vee m_{2}^{2}.
  2. 2.

    (score function bound) For all t0t\geq 0,

    𝔼[lnpt(𝐱t)2]Ld.\mathbb{E}\left[\left\|\nabla\ln p_{t}({\mathbf{x}}_{t})\right\|^{2}\right]\leq Ld.
Lemma 15.

(Variant of Lemma 10 in Chen et al. (2022a)) Suppose that Assumption [A2] holds. Let (𝐱)t[0,T]\left({\mathbf{x}}\right)_{t\in[0,T]} denote the forward process 1. For 0s<t0\leq s<t, if ts1t-s\leq 1, then

𝔼[𝐱t𝐱s2]2(m22+d)(ts)2+4d(ts)\mathbb{E}\left[\left\|{\mathbf{x}}_{t}-{\mathbf{x}}_{s}\right\|^{2}\right]\leq 2\left(m_{2}^{2}+d\right)\cdot\left(t-s\right)^{2}+4d\cdot\left(t-s\right)
Proof.

According to the forward process, we have

𝔼[𝐱t𝐱s2]=\displaystyle\mathbb{E}\left[\left\|{\mathbf{x}}_{t}-{\mathbf{x}}_{s}\right\|^{2}\right]= 𝔼[st𝐱rdr+2(BtBs)2]𝔼[2st𝐱rdr2+4BtBs2]\displaystyle\mathbb{E}\left[\left\|\int_{s}^{t}-{\mathbf{x}}_{r}\mathrm{d}r+\sqrt{2}\left(B_{t}-B_{s}\right)\right\|^{2}\right]\leq\mathbb{E}\left[2\left\|\int_{s}^{t}{\mathbf{x}}_{r}\mathrm{d}r\right\|^{2}+4\left\|B_{t}-B_{s}\right\|^{2}\right]
\displaystyle\leq 2𝔼[(st𝐱rdr)2]+4d(ts)2st𝔼[𝐱r2]dr(ts)+4d(ts)\displaystyle 2\mathbb{E}\left[\left(\int_{s}^{t}\left\|{\mathbf{x}}_{r}\right\|\mathrm{d}r\right)^{2}\right]+4d\cdot(t-s)\leq 2\int_{s}^{t}\mathbb{E}\left[\left\|{\mathbf{x}}_{r}\right\|^{2}\right]\mathrm{d}r\cdot(t-s)+4d\cdot(t-s)
\displaystyle\leq 2(m22+d)(ts)2+4d(ts),\displaystyle 2\left(m_{2}^{2}+d\right)\cdot\left(t-s\right)^{2}+4d\cdot\left(t-s\right),

where the third inequality follows from Holder’s inequality and the last one follows from Lemma 14. Hence, the proof is completed. ∎

Lemma 16.

(Corollary 3.1 in Chafaï (2004)) If ν,ν~\nu,\tilde{\nu} satisfy LSI with constants α,α~>0\alpha,\tilde{\alpha}>0, respectively, then νν~\nu*\tilde{\nu} satisfies LSI with constant (1α+1α~)1(\frac{1}{\alpha}+\frac{1}{\tilde{\alpha}})^{-1}.

Lemma 17.

Let 𝐱{\mathbf{x}} be a real random variable. If there exist constants C,A<C,A<\infty such that 𝔼[eλ𝐱]CeAλ2\mathbb{E}\left[e^{\lambda{\mathbf{x}}}\right]\leq Ce^{A\lambda^{2}} for all λ>0\lambda>0 then

{𝐱t}Cexp(t24A)\mathbb{P}\left\{{\mathbf{x}}\geq t\right\}\leq C\exp\left(-\frac{t^{2}}{4A}\right)
Proof.

According to the non-decreasing property of exponential function eλ𝒙e^{\lambda{\bm{x}}}, we have

{𝐱t}={eλ𝐱eλt}𝔼[eλ𝐱]eλtCexp(Aλ2λt),\mathbb{P}\left\{{\mathbf{x}}\geq t\right\}=\mathbb{P}\left\{e^{\lambda{\mathbf{x}}}\geq e^{\lambda t}\right\}\leq\frac{\mathbb{E}\left[e^{\lambda{\mathbf{x}}}\right]}{e^{\lambda t}}\leq C\exp\left(A\lambda^{2}-\lambda t\right),

The first inequality follows from Markov inequality and the second follows from the given conditions. By minimizing the RHS, i.e., choosing λ=t/(2A)\lambda=t/(2A), the proof is completed. ∎

Lemma 18.

If ν\nu satisfies a log-Sobolev inequality with log-Sobolev constant μ\mu then every 11-Lipschitz function ff is integrable with respect to ν\nu and satisfies the concentration inequality

ν{f𝔼ν[f]+t}exp(μt22).\nu\left\{f\geq\mathbb{E}_{\nu}[f]+t\right\}\leq\exp\left(-\frac{\mu t^{2}}{2}\right).
Proof.

According to Lemma 17, it suffices to prove that for any 11-Lipschitz function ff with expectation 𝔼ν[f]=0\mathbb{E}_{\nu}[f]=0,

𝔼[eλf]eλ2/(2μ).\mathbb{E}\left[e^{\lambda f}\right]\leq e^{\lambda^{2}/(2\mu)}.

To prove this, it suffices, by a routine truncation and smoothing argument, to prove it for bounded, smooth, compactly supported functions ff such that f1\|\nabla f\|\leq 1. Assume that ff is such a function. Then for every λ0\lambda\geq 0 the log-Sobolev inequality implies

Entν(eλf)2μ𝔼ν[eλf/22],\mathrm{Ent}_{\nu}\left(e^{\lambda f}\right)\leq\frac{2}{\mu}\mathbb{E}_{\nu}\left[\left\|\nabla e^{\lambda f/2}\right\|^{2}\right],

which is written as

𝔼ν[λfeλf]𝔼ν[eλf]log𝔼[eλf]λ22μ𝔼ν[f2eλf].\mathbb{E}_{\nu}\left[\lambda fe^{\lambda f}\right]-\mathbb{E}_{\nu}\left[e^{\lambda f}\right]\log\mathbb{E}\left[e^{\lambda f}\right]\leq\frac{\lambda^{2}}{2\mu}\mathbb{E}_{\nu}\left[\left\|\nabla f\right\|^{2}e^{\lambda f}\right].

With the notation φ(λ)=𝔼[eλf]\varphi(\lambda)=\mathbb{E}\left[e^{\lambda f}\right] and ψ(λ)=logφ(λ)\psi(\lambda)=\log\varphi(\lambda), the above inequality can be reformulated as

λφ(λ)\displaystyle\lambda\varphi^{\prime}(\lambda)\leq φ(λ)logφ(λ)+λ22μ𝔼ν[f2eλf]\displaystyle\varphi(\lambda)\log\varphi(\lambda)+\frac{\lambda^{2}}{2\mu}\mathbb{E}_{\nu}\left[\left\|\nabla f\right\|^{2}e^{\lambda f}\right]
\displaystyle\leq φ(λ)logφ(λ)+λ22μφ(λ),\displaystyle\varphi(\lambda)\log\varphi(\lambda)+\frac{\lambda^{2}}{2\mu}\varphi(\lambda),

where the last step follows from the fact f1\left\|\nabla f\right\|\leq 1. Dividing both sides by λ2φ(λ)\lambda^{2}\varphi(\lambda) gives

(log(φ(λ))λ)12μ.\big{(}\frac{\log(\varphi(\lambda))}{\lambda}\big{)}^{\prime}\leq\frac{1}{2\mu}.

Denoting that the limiting value log(φ(λ))λλ=0=limλ0+log(φ(λ))λ=𝔼ν[f]=0\frac{\log(\varphi(\lambda))}{\lambda}\mid_{\lambda=0}=\lim_{\lambda\to 0^{+}}\frac{\log(\varphi(\lambda))}{\lambda}=\mathbb{E}_{\nu}[f]=0, we have

log(φ(λ))λ=0λ(log(φ(t))t)𝑑tλ2μ,\frac{\log(\varphi(\lambda))}{\lambda}=\int_{0}^{\lambda}\big{(}\frac{\log(\varphi(t))}{t}\big{)}^{\prime}dt\leq\frac{\lambda}{2\mu},

which implies that

ψ(λ)λ22μφ(λ)exp(λ22μ)\psi(\lambda)\leq\frac{\lambda^{2}}{2\mu}\Longrightarrow\varphi(\lambda)\leq\exp\left(\frac{\lambda^{2}}{2\mu}\right)

Then the proof can be completed by a trivial argument of Lemma 17. ∎

Lemma 19.

(Theorem 1 in Vempala & Wibisono (2019)) Suppose pp_{*} satisfies LSI with constant μ>0\mu>0 and is LL-smooth. For any 𝐱0p0{\mathbf{x}}_{0}\sim p_{0} with DKL(p0p)<D_{\mathrm{KL}}(p_{0}\|p_{\infty})<\infty, the iterates 𝐱kpk{\mathbf{x}}_{k}\sim p_{k} of ULA with step size 0<τμ4L20<\tau\leq\frac{\mu}{4L^{2}} satisfy

DKL(ptp)eμτkDKL(p0p)+8τdL2μ.D_{\mathrm{KL}}\left(p_{t}\|p_{\infty}\right)\leq e^{-\mu\tau k}D_{\mathrm{KL}}\left(p_{0}\|p_{\infty}\right)+\frac{8\tau dL^{2}}{\mu}.

Thus, for any δ>0\delta>0, to achieve DKL(ptp)δD_{\mathrm{KL}}\left(p_{t}\|p_{\infty}\right)\leq\delta, it suffices to run ULA with step size

0<τμ4L2min{1,δ4d}0<\tau\leq\frac{\mu}{4L^{2}}\min\left\{1,\frac{\delta}{4d}\right\}

for

k1μτlog2DKL(p0p)δ.k\geq\frac{1}{\mu\tau}\log\frac{2D_{\mathrm{KL}}\left(p_{0}\|p_{\infty}\right)}{\delta}.
Lemma 20.

(Variant of Lemma 10 in Cheng & Bartlett (2018)) Suppose lnp-\ln p_{\infty} is mm-strongly convex function, for any distribution with density function pp, we have

DKL(pp)12mp(𝒙)lnp(𝒙)p(𝒙)2d𝒙.D_{\mathrm{KL}}\left(p\|p_{\infty}\right)\leq\frac{1}{2m}\int p({\bm{x}})\left\|\nabla\ln\frac{p({\bm{x}})}{p_{*}({\bm{x}})}\right\|^{2}\mathrm{d}{\bm{x}}.

By choosing p(𝐱)=g2(𝐱)p(𝐱)/𝔼p[g2(𝐱)]p({\bm{x}})=g^{2}({\bm{x}})p_{*}({\bm{x}})/\mathbb{E}_{p_{*}}\left[g^{2}({\mathbf{x}})\right] for the test function g:dg\colon\mathbb{R}^{d}\rightarrow\mathbb{R} and 𝔼p[g2(𝐱)]<\mathbb{E}_{p_{*}}\left[g^{2}({\mathbf{x}})\right]<\infty, we have

𝔼p[g2lng2]𝔼p[g2]ln𝔼p[g2]2m𝔼p[g2],\mathbb{E}_{p_{*}}\left[g^{2}\ln g^{2}\right]-\mathbb{E}_{p_{*}}\left[g^{2}\right]\ln\mathbb{E}_{p_{*}}\left[g^{2}\right]\leq\frac{2}{m}\mathbb{E}_{p_{*}}\left[\left\|\nabla g\right\|^{2}\right],

which implies pp_{*} satisfies mm-log-Sobolev inequality.

Lemma 21.

Using the notation in Table. 1, for each t[0,)t\in[0,\infty), the underlying distribution ptp_{t} of the forward process satisfies

DKL(ptp)4(dL+m22)exp(t2)D_{\mathrm{KL}}\left(p_{t}\|p_{\infty}\right)\leq 4(dL+m_{2}^{2})\cdot\exp\left(-\frac{t}{2}\right)
Proof.

Consider the Fokker–Planck equation of the forward process, i.e.,

d𝐱t=𝐱tdt+2dBt,𝐱0p0ef,\mathrm{d}{\mathbf{x}}_{t}=-{\mathbf{x}}_{t}\mathrm{d}t+\sqrt{2}\mathrm{d}B_{t},\quad{\mathbf{x}}_{0}\sim p_{0}\propto e^{-f_{*}},

we have

tpt(𝒙)=(pt(𝒙)𝒙)+Δpt(𝒙)=(pt(𝒙)lnpt(𝒙)e12𝒙2).\partial_{t}p_{t}({\bm{x}})=\nabla\cdot\left(p_{t}({\bm{x}}){\bm{x}}\right)+\Delta p_{t}({\bm{x}})=\nabla\cdot\left(p_{t}({\bm{x}})\nabla\ln\frac{p_{t}({\bm{x}})}{e^{-\frac{1}{2}\|{\bm{x}}\|^{2}}}\right).

It implies that the stationary distribution is

pexp(12𝒙2).p_{\infty}\propto\exp\left(-\frac{1}{2}\cdot\left\|{\bm{x}}\right\|^{2}\right). (55)

Then, we consider the KL convergence of (𝐱t)t0({\mathbf{x}}_{t})_{t\geq 0}, and have

dDKL(ptp)dt=\displaystyle\frac{\mathrm{d}D_{\mathrm{KL}}(p_{t}\|p_{\infty})}{\mathrm{d}t}= ddtpt(𝒙)lnpt(𝒙)p(𝒙)d𝒙=tpt(𝒙)lnpt(𝒙)p(𝒙)d𝒙\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}\int p_{t}({\bm{x}})\ln\frac{p_{t}({\bm{x}})}{p_{\infty}({\bm{x}})}\mathrm{d}{\bm{x}}=\int\partial_{t}p_{t}({\bm{x}})\ln\frac{p_{t}({\bm{x}})}{p_{\infty}({\bm{x}})}\mathrm{d}{\bm{x}} (56)
=\displaystyle= (pt(𝒙)lnpt(𝒙)p(𝒙))lnpt(𝒙)p(𝒙)d𝒙\displaystyle\int\nabla\cdot\left(p_{t}({\bm{x}})\nabla\ln\frac{p_{t}({\bm{x}})}{p_{\infty}({\bm{x}})}\right)\cdot\ln\frac{p_{t}({\bm{x}})}{p_{\infty}({\bm{x}})}\mathrm{d}{\bm{x}}
=\displaystyle= pt(𝒙)lnpt(𝒙)p(𝒙)2d𝒙.\displaystyle-\int p_{t}({\bm{x}})\left\|\nabla\ln\frac{p_{t}({\bm{x}})}{p_{\infty}({\bm{x}})}\right\|^{2}\mathrm{d}{\bm{x}}.

According to Proposition 5.5.1 of Bakry et al. (2014), if pp_{\infty} is a centered Gaussian measure on d\mathbb{R}^{d} with covariance matrix Σ\Sigma, for every smooth function ff on d\mathbb{R}^{d}, we have

𝔼p[f2logf2]𝔼p[f2]log𝔼p[f2]2𝔼p[Σff]\mathbb{E}_{p_{\infty}}\left[f^{2}\log f^{2}\right]-\mathbb{E}_{p_{\infty}}\left[f^{2}\right]\log\mathbb{E}_{p_{\infty}}\left[f^{2}\right]\leq 2\mathbb{E}_{p_{\infty}}\left[\Sigma\nabla f\cdot\nabla f\right]

For the forward stationary distribution Eq. 55, we have Σ=𝑰\Sigma={\bm{I}}. Hence, by choosing f2(𝒙)=pt(𝒙)/p(𝒙)f^{2}({\bm{x}})=p_{t}({\bm{x}})/p_{\infty}({\bm{x}}), we have

DKL(ptp)2pt(𝒙)lnpt(𝒙)p(𝒙)2d𝒙D_{\mathrm{KL}}\left(p_{t}\|p_{\infty}\right)\leq 2\int p_{t}({\bm{x}})\left\|\nabla\ln\frac{p_{t}({\bm{x}})}{p_{\infty}({\bm{x}})}\right\|^{2}\mathrm{d}{\bm{x}}

Plugging this inequality into Eq. 56, we have

dDKL(ptp)dt=pt(𝒙)lnpt(𝒙)p(𝒙)2d𝒙12DKL(ptp).\frac{\mathrm{d}D_{\mathrm{KL}}(p_{t}\|p_{\infty})}{\mathrm{d}t}=-\int p_{t}({\bm{x}})\left\|\nabla\ln\frac{p_{t}({\bm{x}})}{p_{\infty}({\bm{x}})}\right\|^{2}\mathrm{d}{\bm{x}}\leq-\frac{1}{2}D_{\mathrm{KL}}(p_{t}\|p_{\infty}).

Integrating implies the desired bound,i.e.,

DKL(ptp)exp(t2)DKL(p0p)=C0exp(t2).D_{\mathrm{KL}}(p_{t}\|p_{\infty})\leq\exp\left(-\frac{t}{2}\right)D_{\mathrm{KL}}(p_{0}\|p_{\infty})=C_{0}\exp\left(-\frac{t}{2}\right).

Lemma 22.

Suppose f1:df_{1}\colon\mathbb{R}^{d}\rightarrow\mathbb{R} and f2:df_{2}\colon\mathbb{R}^{d}\rightarrow\mathbb{R} is μ\mu-strongly convex for 𝐱R\|{\bm{x}}\|\geq R. That means v1(𝐱)f1(𝐱)μ/2𝐱2v_{1}({\bm{x}})\coloneqq f_{1}({\bm{x}})-\mu/2\cdot\left\|{\bm{x}}\right\|^{2} (and v2(𝐱)f2(𝐱)μ/2𝐱2v_{2}({\bm{x}})\coloneqq f_{2}({\bm{x}})-\mu/2\cdot\left\|{\bm{x}}\right\|^{2}) is convex on Ω=dB(𝟎,R)\Omega=\mathbb{R}^{d}\setminus B({\bm{0}},R). Specifically, we require that 𝐱Ω{\bm{x}}\in\Omega, any convex combination of 𝐱=i=1kλi𝐱i{\bm{x}}=\sum_{i=1}^{k}\lambda_{i}{\bm{x}}_{i} with 𝐱1,,𝐱kΩ{\bm{x}}_{1},\ldots,{\bm{x}}_{k}\in\Omega satisfies

v1(𝒙)i=1kλkv1(𝒙i).v_{1}({\bm{x}})\leq\sum_{i=1}^{k}\lambda_{k}v_{1}({\bm{x}}_{i}).

Then, we have f1+f2f_{1}+f_{2} is 2μ2\mu-strongly convex for 𝐱R\|{\bm{x}}\|\geq R.

Proof.

We define v(𝒙)=f(𝒙)μ𝒙2v({\bm{x}})=f({\bm{x}})-\mu\left\|{\bm{x}}\right\|^{2}. Hence, by considering 𝒙Ω{\bm{x}}\in\Omega and its convex combination of 𝒙=i=1kλi𝒙i{\bm{x}}=\sum_{i=1}^{k}\lambda_{i}{\bm{x}}_{i} with 𝒙1,,𝒙kΩ{\bm{x}}_{1},\ldots,{\bm{x}}_{k}\in\Omega, we have

v(𝒙)=v1(𝒙)+v2(𝒙)i=1kλiv1(𝒙i)+i=1kλiv2(𝒙i)=i=1kλiv(𝒙i).\displaystyle v({\bm{x}})=v_{1}({\bm{x}})+v_{2}({\bm{x}})\leq\sum_{i=1}^{k}\lambda_{i}v_{1}({\bm{x}}_{i})+\sum_{i=1}^{k}\lambda_{i}v_{2}({\bm{x}}_{i})=\sum_{i=1}^{k}\lambda_{i}v({\bm{x}}_{i}).

Hence, the proof is completed. ∎

Lemma 23.

Suppose qq is a distribution which satisfies LSI with constant μ\mu, then its variance satisfies

q(𝒙)𝒙𝔼q~[𝐱]2d𝒙dμ.\int q({\bm{x}})\left\|{\bm{x}}-\mathbb{E}_{\tilde{q}}\left[{\mathbf{x}}\right]\right\|^{2}\mathrm{d}{\bm{x}}\leq\frac{d}{\mu}.
Proof.

It is known that LSI implies Poincaré inequality with the same constant (Rothaus, 1981; Villani, 2021; Vempala & Wibisono, 2019), which can be derived by taking ρ(1+ηg)ν\rho\rightarrow(1+\eta g)\nu in Eq. (10). Thus, for μ\mu-LSI distribution qq, we have

varq(g(𝐱))1μ𝔼q[g(𝐱)2].\mathrm{var}_{q}\left(g({\mathbf{x}})\right)\leq\frac{1}{\mu}\mathbb{E}_{q}\left[\left\|\nabla g({\mathbf{x}})\right\|^{2}\right].

for all smooth function g:dg\colon\mathbb{R}^{d}\rightarrow\mathbb{R}.

In this condition, we suppose 𝒃=𝔼q[𝐱]{\bm{b}}=\mathbb{E}_{q}[{\mathbf{x}}], and have the following equation

q(𝒙)𝒙𝔼q[𝐱]2d𝒙=q(𝒙)𝒙𝒃2d𝒙\displaystyle\int q({\bm{x}})\left\|{\bm{x}}-\mathbb{E}_{q}\left[{\mathbf{x}}\right]\right\|^{2}\mathrm{d}{\bm{x}}=\int q({\bm{x}})\left\|{\bm{x}}-{\bm{b}}\right\|^{2}\mathrm{d}{\bm{x}}
=\displaystyle= i=1dq(𝒙)(𝒙i𝒃i)2d𝒙=i=1dq(𝒙)(𝒙,𝒆i𝒃,𝒆i)2d𝒙\displaystyle\int\sum_{i=1}^{d}q({\bm{x}})\left({\bm{x}}_{i}-{\bm{b}}_{i}\right)^{2}\mathrm{d}{\bm{x}}=\sum_{i=1}^{d}\int q({\bm{x}})\left(\left<{\bm{x}},{\bm{e}}_{i}\right>-\left<{\bm{b}},{\bm{e}}_{i}\right>\right)^{2}\mathrm{d}{\bm{x}}
=\displaystyle= i=1dq(𝒙)(𝒙,𝒆i𝔼q[𝐱,𝒆i])2d𝒙=i=1dvarq(gi(𝐱))\displaystyle\sum_{i=1}^{d}\int q({\bm{x}})\left(\left<{\bm{x}},{\bm{e}}_{i}\right>-\mathbb{E}_{q}\left[\left<{\mathbf{x}},{\bm{e}}_{i}\right>\right]\right)^{2}\mathrm{d}{\bm{x}}=\sum_{i=1}^{d}\mathrm{var}_{q}\left(g_{i}({\mathbf{x}})\right)

where gi(𝒙)g_{i}({\bm{x}}) is defined as gi(𝒙)𝒙,𝒆ig_{i}({\bm{x}})\coloneqq\left<{\bm{x}},{\bm{e}}_{i}\right> and 𝒆i{\bm{e}}_{i} is a one-hot vector ( the ii-th element of 𝒆i{\bm{e}}_{i} is 11 others are 0).

Combining this equation and Poincaré inequality, for each ii, we have

varq(gi(𝐱))1μ𝔼q[𝒆i2]=1μ.\mathrm{var}_{q}\left(g_{i}({\mathbf{x}})\right)\leq\frac{1}{\mu}\mathbb{E}_{q}\left[\left\|{\bm{e}}_{i}\right\|^{2}\right]=\frac{1}{\mu}.

By combining the equation and inequality above, we have

q(𝒙)𝒙𝔼q[𝐱]2d𝒙=i=1dvarq(gi(𝐱))dμ\int q({\bm{x}})\left\|{\bm{x}}-\mathbb{E}_{q}\left[{\mathbf{x}}\right]\right\|^{2}\mathrm{d}{\bm{x}}=\sum_{i=1}^{d}\mathrm{var}_{q}\left(g_{i}({\mathbf{x}})\right)\leq\frac{d}{\mu}

Hence, the proof is completed. ∎

Appendix F Empirical Results

F.1 Experiment settings and more empirical results

We choose 1,0001,000 particles in the experiments and use MMD (with RBF kernel) as the metric. We choose T{ln0.99,ln0.95,ln0.9,ln0.8,ln0.7}T\in\{-\ln 0.99,-\ln 0.95,-\ln 0.9,-\ln 0.8,-\ln 0.7\}. We use 1010, 5050, or 100100 iterations to approximate p^\hat{p} chosen by the corresponding problem. The inner loop is initialized with importance sampling mean estimator by 100100 particles. The inner iteration and inner loop sample-size are chosen from {1,5,10,100}\{1,5,10,100\}. The outer learning rate is chosen from {T/20,T/10,T/5}\{T/20,T/10,T/5\}. When the algorithm converges, we further perform LMC until the limit of gradient complexity. Note that the gradient complexity is evaluated by the product of outer loop and inner loop.

Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
r=0.5r=0.5 r=1.0r=1.0 r=2.0r=2.0
Figure 4: Maximum Mean Discrepancy (MMD) convergence of LMC, ULMC, rdMC.
Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
r=1.0r=1.0 r=2.0r=2.0 r=4.0r=4.0
Figure 5: Maximum Mean Discrepancy (MMD) convergence of LMC, ULMC, rdMC.

F.2 More Investigation on Ill-behaved Gaussian Case

Figure 6 demonstrate the differences between Langevin dynamics and the OU process in terms of their trajectories. The former utilizes the gradient information of the target distribution lnp\nabla\ln p_{*}, to facilitate optimization. However, it converges slowly in directions with small gradients. On the other hand, the OU process constructs paths with equal velocities in each direction, thereby avoiding the influence of gradient vanishing directions. Consequently, leveraging the reverse process of the OU process is advantageous for addressing the issue of uneven gradient sampling across different directions.

Refer to caption Refer to caption Refer to caption Refer to caption
𝒩(0,𝑰d)\mathcal{N}(0,{\bm{I}}_{d}) pp_{*} Langevin μt\mu_{t} OU process μt\mu_{t}
Figure 6: Langevin dynamics vs (reverse) OU process. We consider the sampling path between the 22-dimensional normal distributions 𝒩(0,𝑰2)\mathcal{N}(0,{\bm{I}}_{2}) and 𝒩((20,20),diag(400,1))\mathcal{N}((20,20),\operatorname{diag}(400,1)). The mean μt\mu_{t} of Langevin dynamics show varying convergence speeds in different directions, while the OU process demonstrates more uniform changes.

In order to further demonstrate the effectiveness of our algorithm, we conducted additional experiments comparing the Langevin dynamics with our proposed method in our sample scenarios. To better highlight the impacts of different components, we chose the 22-dimensional ill-conditioned Gaussian distribution 𝒩((20,20),(400001))\mathcal{N}\left((20,20),\left(\begin{array}[]{cc}400&0\\ 0&1\end{array}\right)\right) (shown in Figure 7) as the target distribution to showcase this aspect. In this setting, we obtained an oracle sampler for qtq_{t}, enabling us to conduct a more precise experimental analysis of the sample complexity.

Refer to caption Refer to caption
𝒩(0,I)\mathcal{N}(0,I) 𝒩((20,20),diag(400,1))\mathcal{N}((20,20),\operatorname{diag}(400,1))
Initial distribution Target distribution
Figure 7: Initial and Target distribution sample illustration.

Figure 8 illustrates the distributions of samples generated by the LMC algorithm at iterations 10, 100, 1000, and 10000. It can be observed that these distributions deviate from the target distribution.

Refer to caption Refer to caption
T=10T=10 T=100T=100
Refer to caption Refer to caption
T=1000T=1000 T=10000T=10000
Figure 8: Illustration of LMC with different iterations.

Figure 9 demonstrates the performance of the rdMC algorithm in the scenario of infinite samples, revealing its significantly superior convergence compared to the LMC algorithm.

Refer to caption Refer to caption
T/(kη)=10T/(k\eta)=10 T/(kη)=100T/(k\eta)=100
Refer to caption Refer to caption
T/(kη)=1000T/(k\eta)=1000 T/(kη)=10000T/(k\eta)=10000
Figure 9: Illustration of rdMC (oracle sampler, infinite samples) with different iterations.

Figure 10 showcases the convergence of the rdMC algorithm in estimating the score under different sample sizes. It can be observed that our algorithm is not sensitive to the sample quantity, demonstrating its robustness in practical applications.

Refer to caption Refer to caption
sample size =1=1 sample size =10=10
Refer to caption Refer to caption
sample size =100=100 sample size =1000=1000
Figure 10: Illustration of rdMC (oracle sampler, finite samples, Tkη=100\frac{T}{k\eta}=100).

Figure 11 illustrates the convergence behavior of the rdMC algorithm with an inexact solver. It can be observed that even when employing LMC as the solver for the inner loop, the final convergence of our algorithm surpasses that of the original LMC. This is attributed to the insensitivity of the algorithm to the precision of the inner loop when tt is large. Additionally, when tt is small, the log-Sobolev constant of the inner problem is relatively large, simplifying the problem as a whole and guiding the samples towards the target distribution through the diffusion path.

Refer to caption Refer to caption
inner loop =50=50 inner loop =100=100
Refer to caption Refer to caption
inner loop =500=500 inner loop =1000=1000
Figure 11: Illustration of rdMC (inexact sampler, finite samples, Tkη=100\frac{T}{k\eta}=100).

F.3 Sampling from high-dimensional distributions

To validate the dimensional dependency of rdMC algorithm, we consider to extend our Gaussian mixture model experiments.

In the log-Sobolev distribution context (see Section 4.2.1), both the Langevin-based algorithm and our proposed method exhibit polynomial dependence on dimensionality. However, the log-Sobolev constant grows exponentially with radius RR, which is the primary influencing factor. This leads to behavior akin to the 2-D example presented earlier. A significant limitation of the Langevin-based algorithm is its inability to converge within finite time for large RR values, in contrast to the robustness of our algorithm. In higher-dimensional scenarios (e.g., r=2r=2 for d=50d=50 and d=100d=100), we observe a notable decrease in rdMC performance after approximately 100 computations. This decline may stem from the kernel-based computation of MMD, which tends to be less sensitive in higher dimensions. For these large rr cases, LMC and ULMC fails to converge in finite time. Overall, Figure 12 exhibits trends consistent with Figure 3, corroborating our theoretical findings.

Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
d=5d=5 d=10d=10 d=50d=50 d=100d=100
Figure 12: Illustration of rdMC and MCMC for high-dimensional Gaussian Mixture model (6 modes). The first row shows the case of r=1r=1. The second row shows the case of r=2r=2.

For heavy-tail distribution (refer to Section 4.2.2), the complexity of both rdMC and MCMC are quite high, so it is not feasible to sample from them in limited time. Nonetheless, as our algorithm only has the polynomial dependency of dd, but the Langevin-based algorithms have exponential dependency with respect to dd, we may have more advantages. For example, we can use the nthn_{\text{th}} moment demonstrate the similarity of different distributions. To illustrate the phenomenon, we consider the extreme case – Cauchy distribution222The mean of Cauchy distribution does not exist. We use the case for better heavy-tail demonstration.. According to Figure 13, rdMC has better approximation for the true distribution and the approximation gap increases with respect to dimension.

Refer to caption Refer to caption Refer to caption
1st1_{\text{st}} moment 2nd2_{\text{nd}} moment 3rd3_{\text{rd}} moment
Figure 13: Moment estimators for Cauchy distribution when n=1000n=1000. It reflects the closeness between different distributions. For high dimensional moments, we computes the sum across all dimensions (trace) for proper plotting. The algorithm are evaluated with 1K gradient complexity.

F.4 Discussion on the choice of p~0\tilde{p}_{0}

Note that different p~0\tilde{p}_{0} may have impacts on the algorithm. In this subsection, we discuss the impact of choice of p~0\tilde{p}_{0} and try to make a clear demonstration for the influence in real practice.

Practically, selecting TT determines the initial distribution for reverse diffusion. While any TT choice can achieve convergence, the computational complexity varies with different TT values. According to Figure 14, we can notice that with the increase of TT, the modes of pTp_{T} tend to merge to a single mode, and the speed is exponentially fast (Lemma 21). Even with T=ln0.95T=-\ln 0.95, the dis-connectivity of modes can be alleviated significantly. Thus, the choice of TT is not sensitive when choosing TT from ln0.95-\ln 0.95 to ln0.7-\ln 0.7.

Refer to caption Refer to caption Refer to caption Refer to caption
T=ln0.7T=-\ln 0.7 T=ln0.8T=-\ln 0.8 T=ln0.9T=-\ln 0.9 T=ln0.95T=-\ln 0.95
Figure 14: Choice of pTp_{T} and the approximation by p~0\tilde{p}_{0}. For TT from ln0.95-\ln 0.95 to ln0.7-\ln 0.7, p~0\tilde{p}_{0} can approximate pTp_{T} properly, where each modes are connected to make the approximated distribution well-mixed. Then the rdMC can be performed properly.

However, when choosing too small or too large TT, there would be some consequence:

  • If TT is too small, the approximation of p~0\tilde{p}_{0} would be extremely hard. For example, if T0T\rightarrow 0, the algorithm would be similar to Langevin algorithm;

  • If TT is too large, it would be wasteful to transport from TT to ln0.7-\ln 0.7 since the distribution in this interval is highly homogeneous333It is possible to consider varied step size scheduling, which can be interesting future work..

In summary, aside from the T0T\rightarrow 0 scenario, our algorithm exhibits insensitivity to the choice of TT (or p~0\tilde{p}_{0}). Selecting an appropriate TT can reduce computational demands when using a constant step size schedule.

F.5 Neal’s Funnel

Neal’s Funnel is a classic demonstration of how traditional MCMC methods struggle with convergence unless specific parameterization strategies are employed. We further investigate the performance of our algorithm for this scenario. As indicated in Figure 15, our method demonstrates more rapid convergence compared to Langevin-based MCMC approaches. Additionally, Figure 16 reveals that while LMC lacks efficient exploration capabilities, ULMC fails to accurately represent density. This discrepancy stems from incorporating momentum/velocity. Our algorithm strikes an improved balance between exploration and precision, primarily due to the efficacy of the reverse diffusion path, thereby enhancing convergence speed.

Refer to caption Refer to caption Refer to caption
d=2d=2 d=5d=5 d=10d=10
Figure 15: Convergence of Neal’s Funnel for rdMC, LMC, and ULMC.
Refer to caption Refer to caption Refer to caption Refer to caption
Ground Truth LMC ULMC rdMC
Figure 16: Samples from Neal’s Funnel.