This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Convergence of score-based generative modeling
for general data distributions

Holden Lee Johns Hopkins University Jianfeng Lu Duke University Yixin Tan Duke University
Abstract

Score-based generative modeling (SGM) has grown to be a hugely successful method for learning to generate samples from complex data distributions such as that of images and audio. It is based on evolving an SDE that transforms white noise into a sample from the learned distribution, using estimates of the score function, or gradient log-pdf. Previous convergence analyses for these methods have suffered either from strong assumptions on the data distribution or exponential dependencies, and hence fail to give efficient guarantees for the multimodal and non-smooth distributions that arise in practice and for which good empirical performance is observed. We consider a popular kind of SGM—denoising diffusion models—and give polynomial convergence guarantees for general data distributions, with no assumptions related to functional inequalities or smoothness. Assuming L2L^{2}-accurate score estimates, we obtain Wasserstein distance guarantees for any distribution of bounded support or sufficiently decaying tails, as well as TV guarantees for distributions with further smoothness assumptions.

1 Introduction

Diffusion models have gained huge popularity in recent years in machine learning, as a method to learn and generate new samples from a data distribution. Score-based generative modeling (SGM), as a particular kind of diffusion model, uses learned score functions (gradients of the log-pdf) to transform white noise to the data distribution through following a stochatic differential equation. While SGM has achieved state-of-the-art performance for artificial image and audio generation [SE19, Dat+19, Gra+19, SE20, Son+20, Men+21, Son+21a, Son+21, Jin+22], including being a key component of text-to-image systems [Ram+22], our theoretical understanding of these models is still nascent.

In particular, basic questions on the convergence of the generated distribution to the data distribution remain unanswered. Recent theoretical work on SGM has attempted to answer these questions [De ̵+21, LLT22, De ̵22], but they either suffer from exponential dependence on parameters or rely on strong assumptions on the data distribution such as functional inequalities or smoothness, which are rarely satisfied in practical situations. For example, considering the hallmark application of generating images from text, we expect the distribution of images to be (a) multimodal, and hence not satisfying functional inequalities with reasonable constants, and (b) supported on lower-dimensional manifolds, and hence not smooth. However, SGM still performs remarkably well in these settings. Indeed, this is one relative advantage to other approaches to generative modeling such as generative adversarial networks, which can struggle to learn multimodal distributions [ARZ18].

In this work, we aim to develop theoretical convergence guarantees with polynomial complexity for SGM under minimal data assumptions.

1.1 Problem setting

Given samples from a data distribution PdataP_{\textup{data}}, the problem of generative modeling is to learn the distribution in a way that allows generation of new samples. A general framework for many score-based generative models is where noise is injected into PdataP_{\textup{data}} via a forward SDE [Son+20]

dx~t\displaystyle d\widetilde{x}_{t} =f(x~t,t)dt+g(t)dwt,t[0,T],\displaystyle=f(\widetilde{x}_{t},t)\,dt+g(t)\,dw_{t},\quad t\in[0,T], (1)

where x~0P~0:=Pdata\widetilde{x}_{0}\sim\widetilde{P}_{0}:=P_{\textup{data}}. Let p~t\widetilde{p}_{t} denote the density of x~t\widetilde{x}_{t}. Remarkably, x~t\widetilde{x}_{t} also satisfies a reverse-time SDE,

dx~t\displaystyle d\widetilde{x}_{t} =[f(x~t,t)g(t)2lnp~t(x~t)]dt+g(t)dw~t,t[0,T],\displaystyle=[f(\widetilde{x}_{t},t)-g(t)^{2}\nabla\ln\widetilde{p}_{t}(\widetilde{x}_{t})]\,dt+g(t)\,d\widetilde{w}_{t},\quad t\in[0,T], (2)

where w~t\widetilde{w}_{t} is a backward Brownian motion [And82]. Because the forward process transforms the data distribution to noise, the hope is to use the backwards process to transform noise into samples.

In practice, when we only have sample access to PdataP_{\textup{data}}, the score function lnp~t\nabla\ln\widetilde{p}_{t} is not available. A key mechanism behind SGM is that the score function is learnable from data, through empirically minimizing a de-noising objective evaluated at noisy samples x~t\widetilde{x}_{t} [Vin11]. The samples x~t\widetilde{x}_{t} are obtained by evolving the forward SDE starting from the data samples x~0\widetilde{x}_{0}, and the optimization is done within an expressive function class such as neural networks. If the score function is successfully approximated in this way, then the L2L^{2}-error 𝔼p~t[lnp~t(x)s(x,t)2]\mathbb{E}_{\widetilde{p}_{t}}[\left\|{\nabla\ln\widetilde{p}_{t}(x)-s(x,t)}\right\|^{2}] will be small. The natural question is then the following:

Given L2L^{2}-error bounds of the score function, how close is the distribution generated by (2) (with score estimate s(x,t)s(x,t) in place of lnp~t\nabla\ln\widetilde{p}_{t}, and appropriate discretization) to the data distribution PdataP_{\textup{data}}?

We note it is more realistic to consider L2L^{2} rather than LL^{\infty}-error, and this makes the analysis more challenging. Indeed, prior work on Langevin Monte Carlo [EHZ21] and related sampling algorithms only apply when the score function is known exactly, or with suitable modification, known up to LL^{\infty}-error. L2L^{2}-error has a genuinely different effect from LL^{\infty}-error, as it can cause the stationary distribution for Langevin Monte Carlo to be arbitrarily diffferent [LLT22], necessitating a “medium-time" analysis.

In addition, we hope to obtain a result with as few structural assumptions as possible on PdataP_{\textup{data}}, so that it can be useful in realistic scenarios where SGM is applied.

1.2 Prior work on convergence guarantees

We highlight two recent works which make progress on this problem. [LLT22] are the first to give polynomial convergence guarantees in TV distance under L2L^{2}-accurate score for a reasonable family of distributions. They introduce a framework to reduce the analysis under L2L^{2}-accurate score to LL^{\infty}-accurate score. However, they rely on the data distribution satisfying smoothness conditions and a log-Sobolev inequality—a strong assumption which essentially limits the guarantees to unimodal distributions.

[De ̵22] instead make minimal data assumptions, giving convergence in Wasserstein distance for distributions with bounded support \mathcal{M}. In particular, this covers the case of distributions supported on lower-dimensional manifolds, where guarantees in TV distance are unattainable. However, for general distributions, their guarantees have exponential dependence on the diameter of \mathcal{M} and the inverse of the desired error (exp(O(diam()2/ε))\exp(O(\operatorname{diam}(\mathcal{M})^{2}/\varepsilon))), and for smooth distributions, an improved, but still exponential dependence on the growth rate of the Hessian 2lnp~t\nabla^{2}\ln\widetilde{p}_{t} as the noise approaches 0 (exp(O~(Γ))\exp(\widetilde{O}(\Gamma)) for distributions with 2lnp~tΓ/σt2\left\|{\nabla^{2}\ln\widetilde{p}_{t}}\right\|\leq\Gamma/\sigma_{t}^{2}).

We note that other works also analyze the generalization error of a learned score estimate [BMR20, De ̵22]. This is an important question because without further assumptions, learning an L2L^{2}-accurate score estimate requires a number of samples exponential in the dimension. As this is beyond the scope of our paper, we assume that an L2L^{2}-accurate score estimate is obtainable.

1.3 Our contributions

In this work, we analyze convergence in the most general setting of distributions of bounded support, as in [De ̵22]. We give Wasserstein bounds for any distribution of bounded support (or sufficiently decaying tails), and TV bounds for distributions under smoothness assumptions, that are polynomial in all parameters, and do not rely on the data distribution satisfying any functional inequality. This gives theoretical grounding to the empirical success of SGM on data distributions that are often multimodal and non-smooth.

We streamline the χ2\chi^{2}-based analysis of [LLT22], with significant changes as to completely remove the use of functional inequalities. In particular, the biggest challenge—and our key improvement—is to bound a certain KL-divergence without reliance on a global functional inequality. For this, we prove a key lemma that distributions which are close in χ2\chi^{2}-divergence have score functions that are close in L2L^{2} (which may be of independent interest), and then a structural result that the distributions arising from the diffusion process can be slightly modified as to satisfy the desired inequality, through decomposition into distributions that do satisfy a log-Sobolev inequality.

Upon finishing our paper, we learned of a concurrent and independent work [Che+22] which obtained theoretical guarantees for score-based generative modeling under similarly general assumptions on the data distribution. We note that although our bounds are obtained under similar assumptions (with our assumption of the score estimate accuracy slightly weaker than theirs), our proof techniques are quite different. Following the “bad set” idea from [LLT22], we derived a change-of-measure inequality with Theorem 7.1, while the analysis in [Che+22] is based on the Girsanov approach.

2 Main results

To state our results, we will consider a specific type of SGM called denoising diffusion probabilistic modeling (DDPM) [HJA20], where in the forward SDE (1), f(x,t)=12g(t)2xf(x,t)=-\frac{1}{2}g(t)^{2}x for some non-decreasing function gg to be chosen. The forward process is an Ornstein-Uhlenbeck process with time rescaling: x~t\widetilde{x}_{t} has the same distribution as

mtx~0+σtz, where\displaystyle m_{t}\widetilde{x}_{0}+\sigma_{t}z,\text{ where }
mt=exp[120tg(s)2𝑑s],σt2=1exp[0tg(s)2𝑑s], and zN(0,I).\displaystyle m_{t}=\exp\left[{-\frac{1}{2}\int_{0}^{t}g(s)^{2}\,ds}\right],\,\sigma_{t}^{2}=1-\exp\left[{-\int_{0}^{t}g(s)^{2}\,ds}\right],\text{ and }z\sim N(0,I). (3)

Given an estimate score function s(x,t)s(x,t) approximating lnp~t(x)\nabla\ln\widetilde{p}_{t}(x), we can simulate the reverse process (reparameterizing tTtt\mapsfrom T-t and denoting pt:=p~Ttp_{t}:=\widetilde{p}_{T-t})

dxt\displaystyle dx_{t} =12g(Tt)2(xt+2lnpt(xt))dt+g(Tt)dwt\displaystyle=\frac{1}{2}g(T-t)^{2}\left({x_{t}+2\nabla\ln p_{t}(x_{t})}\right)\,dt+g(T-t)\,dw_{t} (4)

with the exponential integrator discretization [ZC22], where hk=tk+1tkh_{k}=t_{k+1}-t_{k} and ηk+1N(0,Id)\eta_{k+1}\sim N(0,I_{d}):

ztk+1=ztk+γ1,k(ztk+2s(Ttk,ztk))+γ2,kηk+1,\displaystyle z_{t_{k+1}}=z_{t_{k}}+\gamma_{1,k}(z_{t_{k}}+2s(T-t_{k},z_{t_{k}}))+\sqrt{\gamma_{2,k}}\cdot\eta_{k+1}, (5)
where γ1,k=exp[12Gtk,tk+1]1,γ2,k=exp[Gtk,tk+1]1, andGt,t:=ttg(Ts)2𝑑s.\displaystyle\text{where }\gamma_{1,k}=\exp\left[{\frac{1}{2}G_{t_{k},t_{k+1}}}\right]-1,\,\gamma_{2,k}=\exp\left[{G_{t_{k},t_{k+1}}}\right]-1,\text{ and}\,G_{t^{\prime},t}:=\int_{t^{\prime}}^{t}g(T-s)^{2}\,ds. (6)

We initialize z0z_{0} with a prior distribution that approximates p0=p~Tp_{0}=\widetilde{p}_{T} for sufficiently large TT:

z0q0=pprior:\displaystyle z_{0}\sim q_{0}=p_{\textup{prior}}: =N(0,σT2Id)N(0,Id).\displaystyle=N(0,\sigma_{T}^{2}I_{d})\approx N(0,I_{d}). (7)

While we focus on DDPM, we note that the continuous process underlying DDPM is equivalent to that of score-matching Langevin diffusion (SMLD) under reparameterization in time and space (see [LLT22, §C.2]). We will further take g1g\equiv 1 for convenience in stating our results.

Our goal is to obtain a quantitative guarantee for the distance between the distribution qtKq_{t_{K}} for ztKz_{t_{K}} (for appropriate tKTt_{K}\approx T) and PdataP_{\textup{data}}, under a L2L^{2}-score error guarantee. In the following, we assume a sequence of discretization points 0=t0<t1<<tKT0=t_{0}<t_{1}<\cdots<t_{K}\leq T has been chosen.

Assumption 1 (L2L^{2} score error).

For any t{Tt0,,TtK}t\in\{T-t_{0},\ldots,T-t_{K}\}, the error in the score estimate is bounded in L2(p~t)L^{2}(\widetilde{p}_{t}):

lnp~ts(,t)L2(p~t)2=𝔼p~t[lnp~t(x)s(x,t)2]εt2:=εσ2σt4.\left\|{\nabla\ln\widetilde{p}_{t}-s(\cdot,t)}\right\|_{L^{2}(\widetilde{p}_{t})}^{2}=\mathbb{E}_{\widetilde{p}_{t}}[\left\|{\nabla\ln\widetilde{p}_{t}(x)-s(x,t)}\right\|^{2}]\leq\varepsilon_{t}^{2}:=\frac{\varepsilon_{\sigma}^{2}}{\sigma_{t}^{4}}.

We note that the gradient lnp~t\nabla\ln\widetilde{p}_{t} grows as 1σt2\frac{1}{\sigma_{t}^{2}} as t0t\to 0, so this is a reasonable assumption, and quantitatively weaker than a uniform bound over tt.

Assumption 2 (Bounded support).

PdataP_{\textup{data}} is supported on BR(0):={xd:xR}B_{R}(0):=\left\{{x\in\mathbb{R}^{d}}:{\left\|{x}\right\|\leq R}\right\}.

For simplicity, we assume bounded support when stating our main theorems, but note that our results generalize to distributions with sufficiently fast power decay. In the application of image generation, pixel values are bounded, so Assumption 2 is satisfied with RR typically on the order of d\sqrt{d}.

These are the only assumptions we need to obtain a polynomial complexity guarantee. We also consider the following stronger smoothness assumption, which is Assumption A.6 in [De ̵22] and will give better dependencies. Note that [De ̵22, Theorem I.8] shows a (nonuniform) version of Assumption 3 holds when p0p_{0} is a smooth density on a convex submanifold.

Assumption 3.

The following bound of the Hessian of the log-pdf holds for any t>0t>0 and xx:

2lnpt(x)\displaystyle\left\|{\nabla^{2}\ln p_{t}(x)}\right\| Cσt2,\displaystyle\leq\frac{C}{\sigma_{t}^{2}},

for some constant C>0C>0.

Finally, the following smoothness assumption on p~0\widetilde{p}_{0} will allow us to obtain TV guarantees.

Assumption 4.

PdataP_{\textup{data}} admits a density p~0eV(x)\widetilde{p}_{0}\propto e^{-V(x)} where V(x)V(x) is LL-smooth.

We are now ready to state our main theorems.

INPUT: Time TT; discretization points 0=t0<t1<<tNT0=t_{0}<t_{1}<\cdots<t_{N}\leq T; score estimates s(,Ttk)s(\cdot,T-t_{k}); radius RR; function gg (default: g1g\equiv 1)
Draw z0ppriorz_{0}\sim p_{\textup{prior}} from the prior distribution ppriorp_{\textup{prior}} given by (7).
for kk from 11 to NN do
     Compute ztkz_{t_{k}} from ztk1z_{t_{k-1}} using (5).
end for
Let z^tN=ztN\widehat{z}_{t_{N}}=z_{t_{N}} if ztNBR(0)z_{t_{N}}\in B_{R}(0); otherwise, let z^tN=0\widehat{z}_{t_{N}}=0.
Algorithm 1 DDPM with exponential integrator [Son+20, ZC22]
Theorem 2.1 (Wasserstein+TV error for distributions with bounded support).

Suppose that Assumption 1 and 2 hold with RdR\geq\sqrt{d}. Then there is a sequence of discretization points 0=t0<t1<<tN<T0=t_{0}<t_{1}<\cdots<t_{N}<T with N=O(poly(d,R,1/εTV,1/εW))N=O(\operatorname{poly}(d,R,1/\varepsilon_{\operatorname{TV}},1/\varepsilon_{\textup{W}})) such that if εσ=O~(εTV6.5εW5R9d2.25)\varepsilon_{\sigma}=\widetilde{O}\left({\frac{\varepsilon_{\operatorname{TV}}^{6.5}\varepsilon_{\textup{W}}^{5}}{R^{9}d^{2.25}}}\right), then the distribution qtNq_{t_{N}} of the output ztNz_{t_{N}} of DDPM is εTV\varepsilon_{\operatorname{TV}}-close in TV distance to a distribution that is εW\varepsilon_{W} in W2W_{2}-distance from PdataP_{\textup{data}}. If in addition Assumption 3 holds with CR2C\geq R^{2}, it suffices for εσ=O~(εTV4C2d)\varepsilon_{\sigma}=\widetilde{O}\left({\frac{\varepsilon_{\operatorname{TV}}^{4}}{C^{2}d}}\right) (note that the O~()\widetilde{O}(\cdot) hides logarithmic dependence on εW\varepsilon_{\textup{W}}).

This result is perhaps surprising at first glance, as it is well known that for sampling algorithms such as Langevin Monte Carlo, structural assumptions on the target distribution—such as a log-Sobolev inequality—are required to obtain similar theoretical guarantees, even with the knowledge of the exact score function. The key reason that we can do better is that we utilize a sequence of score functions sts_{t} along the reverse SDE, which is not available in standard sampling settings. Moreover, we choose TT large enough so that q0=ppriorq_{0}=p_{\textup{prior}} is close to p0p_{0}, and it suffices to track the evolution of the true process (2), that is, maintain rather than decrease the error. To some extent, this result shows the power of DDPM and other reverse SDE-based methods compared with generative modeling based on standard Langevin Monte Carlo.

A statement with more precise dependencies, and which works for unbounded distributions with sufficiently decaying tails, can be found as Theorem 7.2. We note that under the Hessian bound (Assumption 3), up to logarithmic factors, the same score error bound suffices to obtain a fixed TV distance to a distribution arbitrarily close in W2W_{2} distance. By truncating the resulting distribution, we can also obtain purely Wasserstein error bounds.

Theorem 2.2 (Wasserstein error for distributions with bounded support).

In the same setting as Theorem 2.1, consider the distribution q^tN\widehat{q}_{t_{N}} of the truncated output x^tN\widehat{x}_{t_{N}} of DDPM. If Assumptions 1 and 2 hold with RdR\geq\sqrt{d} and εσ=O~(εW18R22d2.25)\varepsilon_{\sigma}=\widetilde{O}\left({\frac{\varepsilon_{\textup{W}}^{18}}{R^{22}d^{2.25}}}\right), then with appropriate (polynomial) choice of parameters, W2(q^tK,Pdata)εWW_{2}(\widehat{q}_{t_{K}},P_{\textup{data}})\leq\varepsilon_{\textup{W}}. If in addition Assumption 3 holds with CR2C\geq R^{2}, then εσ=O~(εW8C2R8d)\varepsilon_{\sigma}=\widetilde{O}\left({\frac{\varepsilon_{\textup{W}}^{8}}{C^{2}R^{8}d}}\right) suffices.

With an extra assumption on the smoothness of PdataP_{\text{data}}, we can also obtain purely TV error bounds:

Theorem 2.3 (TV error for distributions under smoothness assumption).

Suppose that Assumptions 1 and 4 hold, PdataP_{\textup{data}} is subexponential (with a fixed constant), and denote R=max{d,𝔼PdataX}R=\max\left\{{\sqrt{d},\mathbb{E}_{P_{\text{data}}}\left\|{X}\right\|}\right\}. Then there is a sequence of discretization points 0=t0<t1<<tN<T0=t_{0}<t_{1}<\cdots<t_{N}<T with N=O(poly(d,R,1/εTV))N=O(\operatorname{poly}(d,R,1/\varepsilon_{\operatorname{TV}})) such that if εσ=O~(εTV11.5R14d2.25L5)\varepsilon_{\sigma}=\widetilde{O}\left({\frac{\varepsilon_{\operatorname{TV}}^{11.5}}{R^{14}d^{2.25}L^{5}}}\right), then TV(qtN,Pdata)εTV\operatorname{TV}(q_{t_{N}},P_{\textup{data}})\leq\varepsilon_{\operatorname{TV}}. If in addition Assumption 3 holds with CR2C\geq R^{2}, then εσ=O~(εTV4C2d)\varepsilon_{\sigma}=\widetilde{O}\left({\frac{\varepsilon_{\operatorname{TV}}^{4}}{C^{2}d}}\right) suffices.

A more precise statement can be found as Theorem 7.3, which also works more generally with sufficient tail decay. We note that this result can be derived directly by combining Theorem 7.2 and a TV error bound between PdataP_{\text{data}} and ptNp_{t_{N}} (Lemma 6.4) depending on the smoothness of PdataP_{\text{data}}.

3 Proof overview

Our proof uses the framework by [LLT22] to convert guarantees under LL^{\infty}-accurate score function to under L2L^{2}-accurate score function. For the analysis under LL^{\infty}-accurate score function, we interpolate the discrete process with estimated score, ztqtz_{t}\sim q_{t}, and derive a differential inequality

ddtχ2(qt||pt)\displaystyle\frac{d}{dt}\chi^{2}(q_{t}||p_{t}) =g(Tt)2pt(qtpt)+2𝔼[g(Tt)2(s(ztk,Ttk)lnpt(zt),qt(x)pt(x)].\displaystyle=-g(T-t)^{2}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+2\mathbb{E}\left[{\left\langle{g(T-t)^{2}(s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t}),\nabla\frac{q_{t}(x)}{p_{t}(x)}}\right\rangle}\right].

We bound resulting error terms, making ample use of the Donsker-Varadhan variational principle to convert expectations to be under ptp_{t}. Under small enough step sizes, this shows that χ2(qt||pt)\chi^{2}(q_{t}||p_{t}) grows slowly (Theorem 4.10), which suffices as χ2\chi^{2}-divergence decays exponentially in the forward process.

The most challenging error term to deal with is the KL divergence term KL(qtψt||pt)\operatorname{KL}(q_{t}\psi_{t}||p_{t}). Our main innovation over the analysis of [LLT22] is bounding this term without a global log-Sobolev inequality for ptp_{t}. We note that it suffices for ptp_{t} to be a mixture of distributions each satisfying a log-Sobolev inequality, with the logarithm of the minimum mixture weight bounded below, and in Lemma 5.2, we show that we can decompose any distributed of bounded support in this manner if we move a small amount of its mass.

In Section 6, we show that this does not significantly affect the estimate of the score function, by interpreting the score function as solving a Bayesian inference problem: that of de-noising a noised data point. More precisely, we show in Lemma 6.5 that the difference between the score functions of two different distributions can be bounded in L2L^{2} in terms of their χ2\chi^{2}-divergence, which may be of independent interest.

Finally, we reduce from the L2L^{2} to LL^{\infty} setting by bounding the probabilities of hitting a bad set where the score error is large, and carefully choose parameters (Section 7). This gives a TV error bound to p~δ\widetilde{p}_{\delta}—the forward distribution at small positive time. Finally, we can bound the Wasserstein distance of p~δ\widetilde{p}_{\delta} to P~0\widetilde{P}_{0} (in the general case) or the TV distance (under additional smoothness of P~0\widetilde{P}_{0}.)

In Section A we show that the Hessian is always bounded by O(dσt2)O\left({\frac{d}{\sigma_{t}^{2}}}\right) with high probability (cf. Assumption 3). We speculate that a high-probability rather than uniform bound on the Hessian (as in Lemma 4.13) can be used to obtain better dependencies, and leave this as an open problem.

Notation and definitions

We let p~t\widetilde{p}_{t} denote the density of x~t\widetilde{x}_{t} under the forward process (1). Note that x0P~0x_{0}\sim\widetilde{P}_{0} may not admit a density, but x~t\widetilde{x}_{t} will for t>0t>0. For the reverse process, we use the notation pt=p~Ttp_{t}=\widetilde{p}_{T-t}, xt=x~Ttx_{t}=\widetilde{x}_{T-t}. We defined mtm_{t} and σt\sigma_{t} in (2),

mt\displaystyle m_{t} =exp[120tg(s)2𝑑s],\displaystyle=\exp\left[{-\frac{1}{2}\int_{0}^{t}g(s)^{2}\,ds}\right], σt2\displaystyle\sigma_{t}^{2} =1exp[0tg(s)2𝑑s],\displaystyle=1-\exp\left[{-\int_{0}^{t}g(s)^{2}\,ds}\right],

and note that p~t=(MmtP~0)φσt2\widetilde{p}_{t}=(M_{m_{t}\sharp}\widetilde{P}_{0})*\varphi_{\sigma_{t}^{2}}, where Mm(x)=mxM_{m}(x)=mx denotes multiplication by mm, FPF_{\sharp}P denotes the pushforward of the measure PP by FF, and φσ2\varphi_{\sigma^{2}} is the density of N(0,σ2Id)N(0,\sigma^{2}I_{d}). When g1g\equiv 1, we note the bound σt2min{1,t}\sigma_{t}^{2}\leq\min\{1,t\} and σt2=Θ(min{1,t})\sigma_{t}^{2}=\Theta(\min\{1,t\}).

We will let ztz_{t} denote the (interpolated) discrete process (see (12)) and let qtq_{t} be the density of ztz_{t}. We define

ϕt(x)\displaystyle\phi_{t}(x) =qt(x)pt(x),\displaystyle=\frac{q_{t}(x)}{p_{t}(x)}, ψt(x)\displaystyle\psi_{t}(x) =ϕt(x)𝔼ptϕt2,\displaystyle=\frac{\phi_{t}(x)}{\mathbb{E}_{p_{t}}\phi_{t}^{2}}, (8)

and note that qtψtq_{t}\psi_{t} is a probability density. We defined Gt,t=ttg(Ts)2𝑑sG_{t^{\prime},t}=\int_{t^{\prime}}^{t}g(T-s)^{2}\,ds in (6).

We denote the estimated score function by either s(x,t)s(x,t) and st(x)s_{t}(x) interchangeably.

A random variable XX is subgaussian with constant CC if

C=Xψ2:\displaystyle C=\left\|{X}\right\|_{\psi_{2}}: =inf{t>0:𝔼exp(X2/t2)2}<.\displaystyle=\inf\left\{{t>0}:{\mathbb{E}\exp(X^{2}/t^{2})\leq 2}\right\}<\infty.

A d\mathbb{R}^{d}-valued random variable XX is subgaussian with constant CC if for all v𝕊d1v\in\mathbb{S}^{d-1}, X,v\left\langle{X,v}\right\rangle is subgaussian. We also define

X2,ψ2:\displaystyle\left\|{X}\right\|_{2,\psi_{2}}: =X2ψ2.\displaystyle=\left\|{\left\|{X}\right\|_{2}}\right\|_{\psi_{2}}.

Given a probability measure PP on d\mathbb{R}^{d} with density pp, the associated Dirichlet form is p(f,g):=df,gP(dx)=df,gp(x)𝑑x\mathscr{E}_{p}(f,g):=\int_{\mathbb{R}^{d}}\left\langle{\nabla f,\nabla g}\right\rangle\,P(dx)=\int_{\mathbb{R}^{d}}\left\langle{\nabla f,\nabla g}\right\rangle p(x)\,dx; denote p(f)=p(f,f)\mathscr{E}_{p}(f)=\mathscr{E}_{p}(f,f). we say that a log-Sobolev inequality (LSI) holds with constant CLSC_{\textup{LS}} if for any probability density qq,

KL(q||p)\displaystyle\operatorname{KL}(q||p) CLS2p(qp,lnqp)=CLS2dlnq(x)p(x)2q(x)𝑑x.\displaystyle\leq\frac{C_{\textup{LS}}}{2}\mathscr{E}_{p}\left({\frac{q}{p},\ln\frac{q}{p}}\right)=\frac{C_{\textup{LS}}}{2}\int_{\mathbb{R}^{d}}\left\|{\nabla\ln\frac{q(x)}{p(x)}}\right\|^{2}q(x)\,dx. (9)

Note dlnq(x)p(x)2q(x)𝑑x\int_{\mathbb{R}^{d}}\left\|{\nabla\ln\frac{q(x)}{p(x)}}\right\|^{2}q(x)\,dx is also known as the Fisher information of qq with respect to pp. Alternatively, defining the entropy by Entp(f)=𝔼pf(x)lnf(x)𝔼pf(x)ln𝔼pf(x)\operatorname{Ent}_{p}(f)=\mathbb{E}_{p}f(x)\ln f(x)-\mathbb{E}_{p}f(x)\ln\mathbb{E}_{p}f(x), for any f0f\geq 0,

Entp(f)\displaystyle\operatorname{Ent}_{p}(f) CLS2p(f,lnf)=CLS2dlnf(x)2f(x)p(x)𝑑x.\displaystyle\leq\frac{C_{\textup{LS}}}{2}\mathscr{E}_{p}\left({f,\ln f}\right)=\frac{C_{\textup{LS}}}{2}\int_{\mathbb{R}^{d}}\left\|{\nabla\ln f(x)}\right\|^{2}f(x)p(x)dx. (10)

4 DDPM with LL^{\infty}-accurate score estimate

We consider the error between the exact backwards SDE (4) and the exponential integrator with estimated score (5). In this section, we bound the error assuming that the score estimate ss is accurate in LL^{\infty}.

Assumption 5 (LL^{\infty} score error).

For any t{Tt0,,TtK}t\in\{T-t_{0},\ldots,T-t_{K}\}, the error in the score estimate is bounded:

lnp~ts(,t)=supxdlnp~t(x)s(x,t)ε,t2\displaystyle\left\|{\nabla\ln\widetilde{p}_{t}-s(\cdot,t)}\right\|_{\infty}=\sup_{x\in\mathbb{R}^{d}}\left\|{\nabla\ln\widetilde{p}_{t}(x)-s(x,t)}\right\|\leq\varepsilon_{\infty,t}^{2} (11)

for some non-decreasing function ε,t2\varepsilon_{\infty,t}^{2}.

In Section 7, we will relax this condition to score function being accurate in L2L^{2}.

First, we construct the following continuous-time process which interpolates the discrete-time process (5), for t[tk,tk+1]t\in[t_{k},t_{k+1}]:

dzt\displaystyle dz_{t} =g(Tt)2(12zt+s(ztk,Ttk))dt+g(Tt)dwt.\displaystyle=g(T-t)^{2}\left({\frac{1}{2}z_{t}+s(z_{t_{k}},T-t_{k})}\right)\,dt+g(T-t)\,dw_{t}. (12)

Integration gives that

ztztk\displaystyle z_{t}-z_{t_{k}} =(exp(12Gtk,t)1)(ztk+2s(ztk,Ttk))\displaystyle=\left({\exp\left({\frac{1}{2}G_{t_{k},t}}\right)-1}\right)(z_{t_{k}}+2s(z_{t_{k}},T-t_{k}))
+tktexp(12tktg(Tt′′)2𝑑t′′)g(t)𝑑wt,\displaystyle\qquad+\int_{t_{k}}^{t}\exp\left({\frac{1}{2}\int_{t_{k}}^{t^{\prime}}g(T-t^{\prime\prime})^{2}\,dt^{\prime\prime}}\right)g(t^{\prime})\,dw_{t^{\prime}}, (13)

where Gt,tG_{t^{\prime},t} is defined in (6).

Letting qtq_{t} be the distribution of ztz_{t} and ptp_{t} be the distribution of xtx_{t}, we have by [LLT22, Lemma A.2] that

ddtχ2(qt||pt)\displaystyle\frac{d}{dt}\chi^{2}(q_{t}||p_{t}) =g(Tt)2pt(qtpt)+2𝔼[g(Tt)2(s(ztk,Ttk)lnpt(zt),qt(x)pt(x)].\displaystyle=-g(T-t)^{2}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+2\mathbb{E}\left[{\left\langle{g(T-t)^{2}(s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t}),\nabla\frac{q_{t}(x)}{p_{t}(x)}}\right\rangle}\right]. (14)

(Note that in our case, f^\widehat{f} also depends on ztz_{t} rather than just ztkz_{t_{k}}, but this does not change the calculation.) Define ϕt,ψt\phi_{t},\psi_{t} as in (8): ϕt(x)=qt(x)pt(x)\phi_{t}(x)=\frac{q_{t}(x)}{p_{t}(x)}, ψt(x)=ϕt(x)𝔼ptϕt2\psi_{t}(x)=\frac{\phi_{t}(x)}{\mathbb{E}_{p_{t}}\phi_{t}^{2}}.

To bound (14), we use the following lemma.

Lemma 4.1 (cf. [EHZ21, Lemma 1], [LLT22, Lemma A.3]).

For any C>0C>0 and any d\mathbb{R}^{d}-valued random variable uu, we have

𝔼[u,qt(zt)pt(zt)]\displaystyle\mathbb{E}\left[{\left\langle{u,\nabla\frac{q_{t}(z_{t})}{p_{t}(z_{t})}}\right\rangle}\right] C(χ2(qt||pt)+1)𝔼[u2ψt(zt)]+14Cpt(qtpt).\displaystyle\leq C\cdot(\chi^{2}(q_{t}||p_{t})+1)\cdot\mathbb{E}\left[{\left\|{u}\right\|^{2}\psi_{t}(z_{t})}\right]+\frac{1}{4C}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right).
Proof.

By Young’s inequality,

𝔼[u,qt(zt)pt(zt)]\displaystyle\mathbb{E}\left[{\left\langle{u,\nabla\frac{q_{t}(z_{t})}{p_{t}(z_{t})}}\right\rangle}\right] =𝔼[uqt(zt)pt(zt),pt(zt)qt(zt)qt(zt)pt(zt)]\displaystyle=\mathbb{E}\left[{\left\langle{u\sqrt{\frac{q_{t}(z_{t})}{p_{t}(z_{t})}},\sqrt{\frac{p_{t}(z_{t})}{q_{t}(z_{t})}}\nabla\frac{q_{t}(z_{t})}{p_{t}(z_{t})}}\right\rangle}\right]
C𝔼[u2qt(zt)pt(zt)]+14C𝔼pt[qt(x)pt(x)2]\displaystyle\leq C\mathbb{E}\left[{\left\|{u}\right\|^{2}\frac{q_{t}(z_{t})}{p_{t}(z_{t})}}\right]+\frac{1}{4C}\mathbb{E}_{p_{t}}\left[{\left\|{\nabla\frac{q_{t}(x)}{p_{t}(x)}}\right\|^{2}}\right]
=C𝔼ptϕt2𝔼[u2ψt(zt)]+14Cpt(qtpt)\displaystyle=C\mathbb{E}_{p_{t}}\phi_{t}^{2}\cdot\mathbb{E}\left[{\left\|{u}\right\|^{2}\psi_{t}(z_{t})}\right]+\frac{1}{4C}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)
=C(χ2(qt||pt)+1)𝔼[u2ψt(zt)]+14Cpt(qtpt).\displaystyle=C(\chi^{2}(q_{t}||p_{t})+1)\cdot\mathbb{E}\left[{\left\|{u}\right\|^{2}\psi_{t}(z_{t})}\right]+\frac{1}{4C}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right).\qed
Lemma 4.2.

Suppose that (11) holds for t=Ttkt=T-t_{k}, lnptk(x)\nabla\ln p_{t_{k}}(x) is LTtkL_{T-t_{k}}-Lipschitz, gg is non-decreasing, and that hk120LTtkg(Ttk)2h_{k}\leq\frac{1}{20L_{T-t_{k}}g(T-t_{k})^{2}}. Then we have for t[tk,tk+1]t\in[t_{k},t_{k+1}] that

ddtχ2(qt||pt)\displaystyle\frac{d}{dt}\chi^{2}(q_{t}||p_{t}) 12g(Tt)2pt(qtpt)+12g(Tt)2(χ2(qt||pt)+1)\displaystyle\leq-\frac{1}{2}g(T-t)^{2}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+12g(T-t)^{2}(\chi^{2}(q_{t}||p_{t})+1)\cdot
[ε,Ttk2+16Gtk,t2LTtk2[𝔼[ψt(zt)zt2]+16𝔼[ψt(zt)lnpt(xt)2]]\displaystyle\qquad\Bigg{[}\varepsilon_{\infty,T-t_{k}}^{2}+16G_{t_{k},t}^{2}L_{T-t_{k}}^{2}\Big{[}\mathbb{E}[\psi_{t}(z_{t})\left\|{z_{t}}\right\|^{2}]+16\mathbb{E}[\psi_{t}(z_{t})\left\|{\nabla\ln p_{t}(x_{t})}\right\|^{2}]\Big{]}
+64Gtk,tLTtk2(8KL(ψtqt||pt)+2d+16ln2)+𝔼[lnptk(zt)lnpt(zt)2ψt(zt)]].\displaystyle\quad+64G_{t_{k},t}L_{T-t_{k}}^{2}(8\operatorname{KL}(\psi_{t}q_{t}||p_{t})+2d+16\ln 2)+\mathbb{E}\left[{\left\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\|^{2}\psi_{t}(z_{t})}\right]\Bigg{]}.
Proof.

We bound the second term on the RHS of (14). By Lemma 4.1,

𝔼[s(ztk,Ttk)lnpt(zt),qt(zt)pt(zt)]\displaystyle\mathbb{E}\left[{\left\langle{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t}),\nabla\frac{q_{t}(z_{t})}{p_{t}(z_{t})}}\right\rangle}\right]
(χ2(qt||pt)+1)𝔼[s(ztk,Ttk)lnpt(zt)2ψt(zt)]+14pt(qtpt).\displaystyle\leq(\chi^{2}(q_{t}||p_{t})+1)\mathbb{E}\left[{\left\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t})}\right\|^{2}\psi_{t}(z_{t})}\right]+\frac{1}{4}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right). (15)

Now

s(ztk,Ttk)lnpt(zt)2\displaystyle\left\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t})}\right\|^{2}
3[s(ztk,Ttk)lnptk(ztk)2+lnptk(ztk)lnptk(zt)2+lnptk(zt)lnpt(zt)2]\displaystyle\leq 3\left[{\left\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t_{k}}(z_{t_{k}})}\right\|^{2}+\left\|{\nabla\ln p_{t_{k}}(z_{t_{k}})-\nabla\ln p_{t_{k}}(z_{t})}\right\|^{2}+\left\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\|^{2}}\right]
3[s(ztk,Ttk)lnptk(ztk)2+LTtk2ztkzt2+lnptk(zt)lnpt(zt)2]\displaystyle\leq 3\left[{\left\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t_{k}}(z_{t_{k}})}\right\|^{2}+L_{T-t_{k}}^{2}\left\|{z_{t_{k}}-z_{t}}\right\|^{2}+\left\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\|^{2}}\right]

and

𝔼[s(ztk,Ttk)lnptk(ztk)2ψt(zt)]\displaystyle\mathbb{E}\left[{\left\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t_{k}}(z_{t_{k}})}\right\|^{2}\psi_{t}(z_{t})}\right] ε,Ttk2\displaystyle\leq\varepsilon_{\infty,T-t_{k}}^{2}

by definition of ε,t\varepsilon_{\infty,t}, so by Lemma 4.3,

𝔼[s(ztk,Ttk)lnpt(zt)2ψt(zt)]\displaystyle\mathbb{E}\left[{\left\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t})}\right\|^{2}\psi_{t}(z_{t})}\right]
3[ε,Ttk2+LTtk2𝔼[ztztk2ψt(zt)]+𝔼[lnptk(zt)lnpt(zt)2ψt(zt)]]\displaystyle\leq 3\left[{\varepsilon_{\infty,T-t_{k}}^{2}+L_{T-t_{k}}^{2}\mathbb{E}\left[{\left\|{z_{t}-z_{t_{k}}}\right\|^{2}\psi_{t}(z_{t})}\right]+\mathbb{E}\left[{\left\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\|^{2}\psi_{t}(z_{t})}\right]}\right]
3[ε,Ttk2+16Gtk,t2LTtk2[𝔼[ψt(zt)zt2]+4𝔼[ψt(zt)s(ztk,Ttk)lnpt(zt)2]\displaystyle\leq 3\Bigg{[}\varepsilon_{\infty,T-t_{k}}^{2}+16G_{t_{k},t}^{2}L_{T-t_{k}}^{2}\Big{[}\mathbb{E}[\psi_{t}(z_{t})\left\|{z_{t}}\right\|^{2}]+4\mathbb{E}[\psi_{t}(z_{t})\left\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t})}\right\|^{2}]
+16𝔼[ψt(zt)lnpt(xt)2]]+64Gtk,tLTtk2(8KL(ψtqt||pt)+2d+16ln2)\displaystyle\quad+16\mathbb{E}[\psi_{t}(z_{t})\left\|{\nabla\ln p_{t}(x_{t})}\right\|^{2}]\Big{]}+64G_{t_{k},t}L_{T-t_{k}}^{2}(8\operatorname{KL}(\psi_{t}q_{t}||p_{t})+2d+16\ln 2)
+𝔼[lnptk(zt)lnpt(zt)2ψt(zt)]]\displaystyle\quad+\mathbb{E}\left[{\left\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\|^{2}\psi_{t}(z_{t})}\right]\Bigg{]}

The condition on hkh_{k} and the fact that gg is non-decreasing implies 192Gtk,t2LTtk212192G_{t_{k},t}^{2}L_{T-t_{k}}^{2}\leq\frac{1}{2}. Rearranging gives

𝔼[s(ztk,Ttk)lnpt(zt)2ψt(zt)]\displaystyle\mathbb{E}\left[{\left\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t})}\right\|^{2}\psi_{t}(z_{t})}\right]
6[ε,Ttk2+16Gtk,t2LTtk2[𝔼[ψt(zt)zt2]+16𝔼[ψt(zt)lnpt(xt)2]\displaystyle\leq 6\Bigg{[}\varepsilon_{\infty,T-t_{k}}^{2}+16G_{t_{k},t}^{2}L_{T-t_{k}}^{2}\Big{[}\mathbb{E}[\psi_{t}(z_{t})\left\|{z_{t}}\right\|^{2}]+16\mathbb{E}[\psi_{t}(z_{t})\left\|{\nabla\ln p_{t}(x_{t})}\right\|^{2}\Big{]}
+64Gtk,tLTtk2(8KL(ψtqt||pt)+2d+16ln2)+𝔼[lnptk(zt)lnpt(zt)2ψt(zt)]]\displaystyle\quad+64G_{t_{k},t}L_{T-t_{k}}^{2}(8\operatorname{KL}(\psi_{t}q_{t}||p_{t})+2d+16\ln 2)+\mathbb{E}\left[{\left\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\|^{2}\psi_{t}(z_{t})}\right]\Bigg{]}

Substituting into (15) and that inequality into (14) give the conclusion. ∎

Lemma 4.3.

Suppose that hk12g(Ttk)2h_{k}\leq\frac{1}{2g(T-t_{k})^{2}}. Then for t[tk+1,tk]t\in[t_{k+1},t_{k}],

𝔼[ztztk2ψt(zt)]\displaystyle\mathbb{E}\left[{\left\|{z_{t}-z_{t_{k}}}\right\|^{2}\psi_{t}(z_{t})}\right] 16Gtk,t2[𝔼[ψt(zt)zt2]+4𝔼[ψt(zt)s(ztk,Ttk)lnpt(zt)2]\displaystyle\leq 16G_{t_{k},t}^{2}\Big{[}\mathbb{E}[\psi_{t}(z_{t})\left\|{z_{t}}\right\|^{2}]+4\mathbb{E}[\psi_{t}(z_{t})\left\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t})}\right\|^{2}]
+16𝔼[ψt(zt)lnpt(zt)2]]+64Gtk,t(8KL(ψtqt||pt)+2d+16ln2).\displaystyle\quad+16\mathbb{E}[\psi_{t}(z_{t})\left\|{\nabla\ln p_{t}(z_{t})}\right\|^{2}]\Big{]}+64G_{t_{k},t}(8\operatorname{KL}(\psi_{t}q_{t}||p_{t})+2d+16\ln 2).
Proof.

Consider (13). The assumption on hkh_{k} implies Gtk,tGtk,tk+112G_{t_{k},t}\leq G_{t_{k},t_{k+1}}\leq\frac{1}{2}, so exp(12Gtk,t)1Gtk,t\exp\left({\frac{1}{2}G_{t_{k},t}}\right)-1\leq G_{t_{k},t}. Let YY denote the last term of (13). Then

ztztk\displaystyle\left\|{z_{t}-z_{t_{k}}}\right\| Gtk,t[ztk+2s(ztk,Ttk)]+Y\displaystyle\leq G_{t_{k},t}\left[{\left\|{z_{t_{k}}}\right\|+2\left\|{s(z_{t_{k}},T-t_{k})}\right\|}\right]+\left\|{Y}\right\|
Gtk,t[zt+ztkzt+2s(ztk,Ttk)lnpt(zt)+2lnpt(zt)]+Y.\displaystyle\leq G_{t_{k},t}\left[{\left\|{z_{t}}\right\|+\left\|{z_{t_{k}}-z_{t}}\right\|+2\left\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t})}\right\|+2\left\|{\nabla\ln p_{t}(z_{t})}\right\|}\right]+\left\|{Y}\right\|.

Again using Gtk,t12G_{t_{k},t}\leq\frac{1}{2}, rearranging gives

ztztk\displaystyle\left\|{z_{t}-z_{t_{k}}}\right\| 2Gtk,t[zt+2s(ztk,Ttk)lnpt(zt)+4lnpt(zt)]+2Y,\displaystyle\leq 2G_{t_{k},t}\left[{\left\|{z_{t}}\right\|+2\left\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t})}\right\|+4\left\|{\nabla\ln p_{t}(z_{t})}\right\|}\right]+2\left\|{Y}\right\|,

and

𝔼[ztztk2ψt(zt)]\displaystyle\mathbb{E}\left[{\left\|{z_{t}-z_{t_{k}}}\right\|^{2}\psi_{t}(z_{t})}\right] 16Gtk,t2[𝔼[ψt(zt)zt2]+4𝔼[ψt(zt)s(ztk,Ttk)lnpt(zt)2]\displaystyle\leq 16G_{t_{k},t}^{2}\Big{[}\mathbb{E}[\psi_{t}(z_{t})\left\|{z_{t}}\right\|^{2}]+4\mathbb{E}[\psi_{t}(z_{t})\left\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t})}\right\|^{2}]
+16𝔼[ψt(zt)lnpt(xt)2]]+16𝔼[ψt(zt)Y2].\displaystyle\quad+16\mathbb{E}[\psi_{t}(z_{t})\left\|{\nabla\ln p_{t}(x_{t})}\right\|^{2}]\Big{]}+16\mathbb{E}[\psi_{t}(z_{t})\left\|{Y}\right\|^{2}].

By Lemma 4.4,

𝔼[ψt(zt)Y2]\displaystyle\mathbb{E}[\psi_{t}(z_{t})\left\|{Y}\right\|^{2}] 4Gtk,t(8KL(ψtqt||pt)+2d+16ln2).\displaystyle\leq 4G_{t_{k},t}(8\operatorname{KL}(\psi_{t}q_{t}||p_{t})+2d+16\ln 2).

The lemma follows. ∎

Lemma 4.4.

For t[tk,tk+1]t\in[t_{k},t_{k+1}],

𝔼[ψt(zt)tktexp(12tktg(Tt′′)2𝑑t′′)g(t)𝑑wt2]\displaystyle\mathbb{E}\left[{\psi_{t}(z_{t})\left\|{\int_{t_{k}}^{t}\exp\left({\frac{1}{2}\int_{t_{k}}^{t^{\prime}}g(T-t^{\prime\prime})^{2}\,dt^{\prime\prime}}\right)g(t^{\prime})\,dw_{t^{\prime}}}\right\|^{2}}\right]
2(exp(Gtk,t)1)(8KL(ψtqt||pt)+2d+16ln2).\displaystyle\qquad\leq 2(\exp(G_{t_{k},t})-1)\bigl{(}8\operatorname{KL}(\psi_{t}q_{t}||p_{t})+2d+16\ln 2\bigr{)}.
Proof.

Note that Y:=tktexp(12tktg(Tt′′)2𝑑t′′)g(t)𝑑wtY:=\int_{t_{k}}^{t}\exp\left({\frac{1}{2}\int_{t_{k}}^{t^{\prime}}g(T-t^{\prime\prime})^{2}\,dt^{\prime\prime}}\right)g(t^{\prime})\,dw_{t^{\prime}} is a Gaussian random vector with variance

tktexp(tktg(Tt′′)2𝑑t′′)g(t)2𝑑tId\displaystyle\int_{t_{k}}^{t}\exp\left({\int_{t_{k}}^{t^{\prime}}g(T-t^{\prime\prime})^{2}\,dt^{\prime\prime}}\right)g(t^{\prime})^{2}\,dt^{\prime}\cdot I_{d} =exp(tktg(Tt′′)2𝑑t′′)|t=tkt=tId\displaystyle=\exp\left({\int_{t_{k}}^{t^{\prime}}g(T-t^{\prime\prime})^{2}\,dt^{\prime\prime}}\right)\Big{|}^{t^{\prime}=t}_{t^{\prime}=t_{k}}\cdot I_{d}
=(exp(Gtk,t)1)Id.\displaystyle=(\exp(G_{t_{k},t})-1)\cdot I_{d}.

(Note that this calculation shows that the continuous-time process (12) does agree with the discrete-time process (5) at t=tk+1t=t_{k+1}.) Using the Donsker-Varadhan variational principle, for any random variable XX,

𝔼~XKL(~||)+ln𝔼expX.\displaystyle\tilde{\mathbb{E}}X\leq\operatorname{KL}(\tilde{\mathbb{P}}||\mathbb{P})+\ln\mathbb{E}\exp X.

Applying this to X=c(Y𝔼Y)2X=c\left({\left\|{Y}\right\|-\mathbb{E}\left\|{Y}\right\|}\right)^{2} for a constant c>0c>0 to be chosen later, and ~\widetilde{\mathbb{P}} such that d~d(zt)=ψt(zt)\frac{d\widetilde{\mathbb{P}}}{d\mathbb{P}}(z_{t})=\psi_{t}(z_{t}), we can bound

𝔼~Y2\displaystyle\tilde{\mathbb{E}}\left\|{Y}\right\|^{2} 2𝔼[Y2]+2𝔼~[(Y𝔼Y)2]\displaystyle\leq 2\mathbb{E}\left[{\left\|{Y}\right\|^{2}}\right]+2\widetilde{\mathbb{E}}\left[{(Y-\mathbb{E}\left\|{Y}\right\|)^{2}}\right]
2𝔼[Y2]+2c[KL(~||)+ln𝔼exp(c(Y𝔼Y)2)]\displaystyle\leq 2\mathbb{E}\left[{\left\|{Y}\right\|^{2}}\right]+\frac{2}{c}\left[{\operatorname{KL}(\tilde{\mathbb{P}}||\mathbb{P})+\ln\mathbb{E}\exp\left({c\left({\left\|{Y}\right\|-\mathbb{E}\left\|{Y}\right\|}\right)^{2}}\right)}\right] (16)
2d(exp(Gtk,t)1)+2c[KL(~||)+ln𝔼exp(c(Y𝔼Y)2)].\displaystyle\leq 2d(\exp(G_{t_{k},t})-1)+\frac{2}{c}\left[{\operatorname{KL}(\tilde{\mathbb{P}}||\mathbb{P})+\ln\mathbb{E}\exp\left({c\left({\left\|{Y}\right\|-\mathbb{E}\left\|{Y}\right\|}\right)^{2}}\right)}\right]. (17)

Now following [Che+21, Theorem 4], we set c=18(exp(Gtk,t)1)c=\frac{1}{8(\exp(G_{t_{k},t})-1)}, so that

𝔼[(Y𝔼Y)28(exp(Gtk,t)1)]2.\displaystyle\mathbb{E}\left[{\frac{(\left\|{Y}\right\|-\mathbb{E}\left\|{Y}\right\|)^{2}}{8(\exp(G_{t_{k},t})-1)}}\right]\leq 2.

Next, we have

KL(~||)\displaystyle\operatorname{KL}(\tilde{\mathbb{P}}||\mathbb{P}) =𝔼ψtqtlnψt=𝔼ψtqtlnϕt𝔼ptϕt2=12𝔼ψtqtlnϕt2(𝔼ptϕt2)2\displaystyle=\mathbb{E}_{\psi_{t}q_{t}}\ln\psi_{t}=\mathbb{E}_{\psi_{t}q_{t}}\ln\frac{\phi_{t}}{\mathbb{E}_{p_{t}}\phi_{t}^{2}}=\frac{1}{2}\mathbb{E}_{\psi_{t}q_{t}}\ln\frac{\phi_{t}^{2}}{(\mathbb{E}_{p_{t}}\phi_{t}^{2})^{2}}
=12[𝔼ψtqtlnϕt2𝔼ptϕt2ln𝔼ptϕt2]=12[𝔼ψtqtlnψtqtptln𝔼ptϕt2].\displaystyle=\frac{1}{2}\left[{\mathbb{E}_{\psi_{t}q_{t}}\ln\frac{\phi_{t}^{2}}{\mathbb{E}_{p_{t}}\phi_{t}^{2}}-\ln\mathbb{E}_{p_{t}}\phi_{t}^{2}}\right]=\frac{1}{2}\left[{\mathbb{E}_{\psi_{t}q_{t}}\ln\frac{\psi_{t}q_{t}}{p_{t}}-\ln\mathbb{E}_{p_{t}}\phi_{t}^{2}}\right].

Noting that 𝔼ptϕt2=χ2(qt||pt)+11\mathbb{E}_{p_{t}}\phi_{t}^{2}=\chi^{2}(q_{t}||p_{t})+1\geq 1, we have that

KL(~||)\displaystyle\operatorname{KL}(\tilde{\mathbb{P}}||\mathbb{P}) 12KL(ψtqt||pt).\displaystyle\leq\frac{1}{2}\operatorname{KL}(\psi_{t}q_{t}||p_{t}).

Substituting everything into (17) gives the desired inequality. ∎

Let

Kz\displaystyle K_{z} =𝔼[ψt(zt)zt2]\displaystyle=\mathbb{E}\left[{\psi_{t}(z_{t})\left\|{z_{t}}\right\|^{2}}\right] (18)
KV\displaystyle K_{V} =𝔼[ψt(zt)lnpt(zt)2]\displaystyle=\mathbb{E}\left[{\psi_{t}(z_{t})\left\|{\nabla\ln p_{t}(z_{t})}\right\|^{2}}\right] (19)
KΔV\displaystyle K_{\Delta V} =𝔼[ψt(zt)lnptk(zt)lnpt(zt)2]\displaystyle=\mathbb{E}\left[{\psi_{t}(z_{t})\left\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\|^{2}}\right] (20)
K\displaystyle K =KL(ψtqt||pt).\displaystyle=\operatorname{KL}(\psi_{t}q_{t}||p_{t}). (21)

In order to bound the RHS in Lemma 4.2, we need to bound all four of these quantities, which we do in Lemma 4.54.64.8, and Section 5, respectively. The main innovation in our analysis compared to [LLT22] is a new way to bound KK, which we present in a separate section.

First we bound KzK_{z}. Recall the norm

X2,ψ2=inf{L>0:𝔼eX22L22}.\left\|{X}\right\|_{2,\psi_{2}}=\inf\left\{{L>0}:{\mathbb{E}e^{\frac{\left\|{X}\right\|_{2}^{2}}{L^{2}}}\leq 2}\right\}.

(In other words, this is the usual Orlicz norm applied to X2\left\|{X}\right\|_{2}.)

Lemma 4.5.

For t[tk,tk+1]t\in[t_{k},t_{k+1}],

𝔼[ψt(zt)zt2]\displaystyle\mathbb{E}\left[{\psi_{t}(z_{t})\left\|{z_{t}}\right\|^{2}}\right] xt2,ψ22[KL(ψtqt||pt)+ln2].\displaystyle\leq\left\|{x_{t}}\right\|_{2,\psi_{2}}^{2}\cdot\left[{\operatorname{KL}(\psi_{t}q_{t}||p_{t})+\ln 2}\right].
Proof.

By the Donsker-Varadhan variational principle,

𝔼[ψt(zt)zt2]\displaystyle\mathbb{E}\left[{\psi_{t}(z_{t})\left\|{z_{t}}\right\|^{2}}\right] =2s𝔼ψtqt[s2x2]2s[KL(ψtqt||pt)+ln𝔼pt[es2x2]]\displaystyle=\frac{2}{s}\mathbb{E}_{\psi_{t}q_{t}}\left[{\frac{s}{2}\left\|{x}\right\|^{2}}\right]\leq\frac{2}{s}\left[{\operatorname{KL}(\psi_{t}q_{t}||p_{t})+\ln\mathbb{E}_{p_{t}}\left[{e^{\frac{s}{2}\left\|{x}\right\|^{2}}}\right]}\right]

for any s>0s>0. Choosing s=2xt2,ψ22s=2\left\|{x_{t}}\right\|_{2,\psi_{2}}^{-2}, we have 𝔼pt[es2x2]2,\mathbb{E}_{p_{t}}\left[{e^{\frac{s}{2}\left\|{x}\right\|^{2}}}\right]\leq 2, which gives the desired inequality. ∎

The following bounds KVK_{V}; note that the proof does not depend on the definition of qtq_{t}, only that it is a probability density.

Lemma 4.6 ([LLT22, Corollary C.7], [Che+21, Lemma 16]).
𝔼[ψt(zt)lnpt(zt)2]\displaystyle\mathbb{E}\left[{\psi_{t}(z_{t})\left\|{\nabla\ln p_{t}(z_{t})}\right\|^{2}}\right] 4χ2(qt||pt)+1pt(qtpt)+2dL.\displaystyle\leq\frac{4}{\chi^{2}(q_{t}||p_{t})+1}\cdot\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+2dL.

We use the following lemma to bound KΔVK_{\Delta V} in Lemma 4.8.

Lemma 4.7 ([LLT22, Lemma C.12]).

Suppose that p(x)eV(x)p(x)\propto e^{-V(x)} is a probability density on d\mathbb{R}^{d}, where V(x)V(x) is LL-smooth. Let pα(x)=αdp(αx)p_{\alpha}(x)=\alpha^{d}p(\alpha x) and φσ2(x)\varphi_{\sigma^{2}}(x) denote the density function of N(0,σ2Id)N(0,\sigma^{2}I_{d}). Then for σ212α2L\sigma^{2}\leq\frac{1}{2\alpha^{2}L},

lnp(x)(pαφσ2)(x)6α2Lσd1/2+(α+2α3Lσ2)(α1)Lx+(α1+2α3Lσ2)V(x).\left\|{\nabla\ln\frac{p(x)}{(p_{\alpha}*\varphi_{\sigma^{2}})(x)}}\right\|\leq 6\alpha^{2}L\sigma d^{1/2}+(\alpha+2\alpha^{3}L\sigma^{2})(\alpha-1)L\left\|{x}\right\|+(\alpha-1+2\alpha^{3}L\sigma^{2})\left\|{\nabla V(x)}\right\|.
Lemma 4.8.

Suppose that hk14Lg(Ttk)2h_{k}\leq\frac{1}{4Lg(T-t_{k})^{2}} where lnpt\nabla\ln p_{t} is LTtL_{T-t}-smooth (LTt1L_{T-t}\geq 1) and L=maxt[tk,tk+1]LTtL=\max_{t\in[t_{k},t_{k+1}]}L_{T-t}. For t[tk,tk+1]t\in[t_{k},t_{k+1}],

𝔼[ψt(zt)lnptk(zt)lnpt(zt)2]\displaystyle\mathbb{E}\left[{\psi_{t}(z_{t})\left\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\|^{2}}\right]
25LTt2(8Gtk,td+Gtk,t2𝔼[ψt(zt)zt2])+100LTt2Gtk,t2𝔼[ψt(zt)lnpt(zt)2]\displaystyle\leq 25L_{T-t}^{2}\left({8G_{t_{k},t}d+G_{t_{k},t}^{2}\mathbb{E}\left[{\psi_{t}(z_{t})\left\|{z_{t}}\right\|^{2}}\right]}\right)+100L_{T-t}^{2}G_{t_{k},t}^{2}\mathbb{E}\left[{\psi_{t}(z_{t})\left\|{\nabla\ln p_{t}(z_{t})}\right\|^{2}}\right]
Proof.

We have the following relationship for t[tk,tk+1]t\in[t_{k},t_{k+1}]:

ptk=(pt)αφσ2.\displaystyle p_{t_{k}}=(p_{t})_{\alpha}*\varphi_{\sigma^{2}}.

where pα(x)=αdp(αx)p_{\alpha}(x)=\alpha^{d}p(\alpha x), α=e12tktg(Ts)2𝑑s\alpha=e^{\frac{1}{2}\int_{t_{k}}^{t}g(T-s)^{2}\,ds} and σ2=1etktg(Ts)2𝑑s\sigma^{2}=1-e^{-\int_{t_{k}}^{t}g(T-s)^{2}\,ds}. Observe that since hk14g(Ttk)2h_{k}\leq\frac{1}{4g(T-t_{k})^{2}},

α\displaystyle\alpha 1+tktg(Ts)2𝑑s1+hkg(Ttk)21+14\displaystyle\leq 1+\int_{t_{k}}^{t}g(T-s)^{2}ds\leq 1+h_{k}g(T-t_{k})^{2}\leq 1+\frac{1}{4}
σ2\displaystyle\sigma^{2} =1etktg(Ts)2𝑑stktg(Ts)2𝑑shkg(Ttk)214.\displaystyle=1-e^{-\int_{t_{k}}^{t}g(T-s)^{2}ds}\leq\int_{t_{k}}^{t}g(T-s)^{2}ds\leq h_{k}g(T-t_{k})^{2}\leq\frac{1}{4}.

We note that

σ2hkg(Ttk)214Lt12α2Lt\displaystyle\sigma^{2}\leq h_{k}g(T-t_{k})^{2}\leq\frac{1}{4L_{t}}\leq\frac{1}{2\alpha^{2}L_{t}}

so the hypothesis of Lemma 4.7 is satisfied. Using Lemma 4.7, we obtain

𝔼[ψt(zt)lnptk(zt)lnpt(zt)2]\displaystyle\mathbb{E}\left[{\psi_{t}(z_{t})\left\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\|^{2}}\right]
72α4LTt2σ2d+4(α+2α3LTtσ2)2(α1)2LTt2𝔼[ψ(zt)zt2]\displaystyle\leq 72\alpha^{4}L_{T-t}^{2}\sigma^{2}d+4(\alpha+2\alpha^{3}L_{T-t}\sigma^{2})^{2}(\alpha-1)^{2}L_{T-t}^{2}\mathbb{E}\left[{\psi(z_{t})\left\|{z_{t}}\right\|^{2}}\right]
+4(α1+2α3LTtσ2)2𝔼[ψt(zt)lnpt(zt)2]\displaystyle\quad+4(\alpha-1+2\alpha^{3}L_{T-t}\sigma^{2})^{2}\mathbb{E}\left[{\psi_{t}(z_{t})\left\|{\nabla\ln p_{t}(z_{t})}\right\|^{2}}\right]
72(5/4)4LTt2Gtk,td+4(2α)2Gtk,t2LTt2𝔼[ψ(zt)zt2]\displaystyle\leq 72(5/4)^{4}L_{T-t}^{2}G_{t_{k},t}d+4(2\alpha)^{2}G_{t_{k},t}^{2}L_{T-t}^{2}\mathbb{E}\left[{\psi(z_{t})\left\|{z_{t}}\right\|^{2}}\right]
+4(Gtk,t+4LTtGtk,t)2𝔼[ψt(zt)lnpt(zt)2]\displaystyle\quad+4(G_{t_{k},t}+4L_{T-t}G_{t_{k},t})^{2}\mathbb{E}\left[{\psi_{t}(z_{t})\left\|{\nabla\ln p_{t}(z_{t})}\right\|^{2}}\right]
200LTt2dGtk,t+25LTt2Gtk,t2𝔼[ψ(zt)zt2]+100LTt2Gtk,t2𝔼[ψt(zt)lnpt(zt)2].\displaystyle\leq 200L_{T-t}^{2}dG_{t_{k},t}+25L_{T-t}^{2}G_{t_{k},t}^{2}\mathbb{E}\left[{\psi(z_{t})\left\|{z_{t}}\right\|^{2}}\right]+100L_{T-t}^{2}G_{t_{k},t}^{2}\mathbb{E}\left[{\psi_{t}(z_{t})\left\|{\nabla\ln p_{t}(z_{t})}\right\|^{2}}\right].

Now we put everything together. Write Gt=Gtk,tG_{t}=G_{t_{k},t} for short. Suppose LtL_{t} is non-increasing. By Lemma 4.2,

ddtχ2(qt||pt)\displaystyle\frac{d}{dt}\chi^{2}(q_{t}||p_{t}) 12g(Tt)2pt(qtpt)+12g(Tt)2(χ2(qt||pt)+1)E\displaystyle\leq-\frac{1}{2}g(T-t)^{2}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+12g(T-t)^{2}(\chi^{2}(q_{t}||p_{t})+1)\cdot E
where E\displaystyle\text{where }E 16Gt2LTtk2(Kz+16KV)+64GtLTtk2(8K+2d+16ln2)+ε,Ttk2+KΔV.\displaystyle\leq 16G_{t}^{2}L_{T-t_{k}}^{2}(K_{z}+16K_{V})+64G_{t}L_{T-t_{k}}^{2}(8K+2d+16\ln 2)+\varepsilon_{\infty,T-t_{k}}^{2}+K_{\Delta V}.

By Lemma 4.8, KΔV25LTt2(8Gtd+Gt2Kz)+100LTt2Gt2KVK_{\Delta V}\leq 25L_{T-t}^{2}(8G_{t}d+G_{t}^{2}K_{z})+100L_{T-t}^{2}G_{t}^{2}K_{V}, so

E\displaystyle E 41Gt2LTt2Kz+356Gt2LTt2KV+64GtLTt2(8K+6d+16ln2)+ε,Ttk2.\displaystyle\leq 41G_{t}^{2}L_{T-t}^{2}K_{z}+356G_{t}^{2}L_{T-t}^{2}K_{V}+64G_{t}L_{T-t}^{2}(8K+6d+16\ln 2)+\varepsilon_{\infty,T-t_{k}}^{2}.

By Lemma 4.5, Kzxt2,ψ22(K+ln2)K_{z}\leq\left\|{x_{t}}\right\|_{2,\psi_{2}}^{2}(K+\ln 2), and by Corollary 4.6, KV4χ2(qt||pt)+1pt(qtpt)+2dLK_{V}\leq\frac{4}{\chi^{2}(q_{t}||p_{t})+1}\cdot\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+2dL, so

E\displaystyle E 41Gt2LTt2(xt2,ψ22(K+ln2))+356Gt2LTt2(4χ2(qt||pt)+1pt(qtpt)+2dL)\displaystyle\leq 41G_{t}^{2}L_{T-t}^{2}\left({\left\|{x_{t}}\right\|_{2,\psi_{2}}^{2}(K+\ln 2)}\right)+356G_{t}^{2}L_{T-t}^{2}\left({\frac{4}{\chi^{2}(q_{t}||p_{t})+1}\cdot\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+2dL}\right)
+64GtLTt2(8K+6d+16ln2)+ε,Ttk2.\displaystyle\quad+64G_{t}L_{T-t}^{2}(8K+6d+16\ln 2)+\varepsilon_{\infty,T-t_{k}}^{2}.

Now, if hkεhk20g(Ttk)2LTtk+1h_{k}\leq\frac{\varepsilon^{\prime}_{h_{k}}}{20g(T-t_{k})^{2}L_{T-t_{k+1}}}, then

E\displaystyle E εhk2[xt2,ψ22(K+ln2)+(4χ2(qt||pt)+1pt(qtpt)+2dLTt)]\displaystyle\leq{\varepsilon^{\prime}_{h_{k}}}^{2}\left[{\left\|{x_{t}}\right\|_{2,\psi_{2}}^{2}(K+\ln 2)+\left({\frac{4}{\chi^{2}(q_{t}||p_{t})+1}\cdot\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+2dL_{T-t}}\right)}\right]
+4εhkLTt(8K+2d+16ln2)+ε,Ttk2.\displaystyle\quad+4\varepsilon^{\prime}_{h_{k}}L_{T-t}(8K+2d+16\ln 2)+\varepsilon_{\infty,T-t_{k}}^{2}.

Let MTt:=xt2,ψ22M_{T-t}:=\left\|{x_{t}}\right\|_{2,\psi_{2}}^{2}. Assume that KATtχ2(qt||pt)+1+BTtK\leq\frac{A_{T-t}}{\chi^{2}(q_{t}||p_{t})+1}+B_{T-t}. Then we obtain

12g(Tt)2(χ2(qt||pt)+1)E\displaystyle 12g(T-t)^{2}(\chi^{2}(q_{t}||p_{t})+1)\cdot E
12g(Tt)2[pt(qtpt)(εhk2(ATtMTt+4)+εhk32LTtATt)\displaystyle\leq 12g(T-t)^{2}\Big{[}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)({\varepsilon^{\prime}_{h_{k}}}^{2}\cdot(A_{T-t}M_{T-t}+4)+\varepsilon^{\prime}_{h_{k}}\cdot 32L_{T-t}A_{T-t})
+(χ2(qt||pt)+1)(εhk2((BTt+ln2)MTt+2dL)\displaystyle\hskip 80.00012pt+(\chi^{2}(q_{t}||p_{t})+1)({\varepsilon^{\prime}_{h_{k}}}^{2}\cdot((B_{T-t}+\ln 2)M_{T-t}+2dL)
+εhkLTt(8BTt+6d+16ln2))+ε,Ttk2].\displaystyle\hskip 80.00012pt+\varepsilon^{\prime}_{h_{k}}\cdot L_{T-t}(8B_{T-t}+6d+16\ln 2))+\varepsilon_{\infty,T-t_{k}}^{2}\Big{]}.

If εhkmin{148(ATtMTt+4),1128LTtATt}\varepsilon^{\prime}_{h_{k}}\leq\min\left\{{\frac{1}{\sqrt{48(A_{T-t}M_{T-t}+4)}},\frac{1}{128L_{T-t}A_{T-t}}}\right\}, then

ddtχ2(qt||pt)\displaystyle\frac{d}{dt}\chi^{2}(q_{t}||p_{t}) 12g(Tt)2[(χ2(qt||pt)+1)(εhk2((BTt+ln2)MTt+2dLTt)\displaystyle\leq 12g(T-t)^{2}\Big{[}(\chi^{2}(q_{t}||p_{t})+1)({\varepsilon^{\prime}_{h_{k}}}^{2}\cdot((B_{T-t}+\ln 2)M_{T-t}+2dL_{T-t})
+εhkLTt(8BTt+6d+16ln2))+ε,Ttk2].\displaystyle\qquad\qquad+\varepsilon^{\prime}_{h_{k}}\cdot L_{T-t}(8B_{T-t}+6d+16\ln 2))+\varepsilon_{\infty,T-t_{k}}^{2}\Big{]}.

If εhkmin{εg(Tt)24(Ttk)((BTt+ln2)MTt+2dLTt),ε24g(Tt)2(Tt)LTt(8BTt+6d+16ln2)}\varepsilon^{\prime}_{h_{k}}\leq\min\left\{{\frac{\sqrt{\varepsilon^{\prime}}}{g(T-t)\sqrt{24(T-t_{k})((B_{T-t}+\ln 2)M_{T-t}+2dL_{T-t})}},\frac{\varepsilon^{\prime}}{24g(T-t)^{2}(T-t)L_{T-t}(8B_{T-t}+6d+16\ln 2)}}\right\}, we get

ddtχ2(qt||pt)\displaystyle\frac{d}{dt}\chi^{2}(q_{t}||p_{t}) εTt(χ2(qt||pt)+1)+ε,Ttk2g(Tt)2.\displaystyle\leq\frac{\varepsilon^{\prime}}{T-t}(\chi^{2}(q_{t}||p_{t})+1)+\varepsilon_{\infty,T-t_{k}}^{2}g(T-t)^{2}.

Integration gives

χ2(qtk||ptk)\displaystyle\chi^{2}(q_{t_{k}}||p_{t_{k}}) eε0tk1Tt𝑑t(χ2(q0||p0)+1)+0tkettkεTs𝑑sεTt2g(Tt)2dt\displaystyle\leq e^{\varepsilon^{\prime}\int_{0}^{t_{k}}\frac{1}{T-t}\,dt}(\chi^{2}(q_{0}||p_{0})+1)+\int_{0}^{t_{k}}e^{\int_{t}^{t_{k}}\frac{\varepsilon^{\prime}}{T-s}\,ds}\varepsilon_{T-t}^{2}g(T-t)^{2}\,dt
(TTtk)εχ2(q0||p0)+((TTtk)ε1)+0tk(TtTtk)εεTt2g(Tt)2dt.\displaystyle\leq\left({\frac{T}{T-t_{k}}}\right)^{\varepsilon^{\prime}}\chi^{2}(q_{0}||p_{0})+\left({\left({\frac{T}{T-t_{k}}}\right)^{\varepsilon^{\prime}}-1}\right)+\int_{0}^{t_{k}}\left({\frac{T-t}{T-t_{k}}}\right)^{\varepsilon^{\prime}}\varepsilon_{T-t}^{2}g(T-t)^{2}\,dt.

Taking ε=εln(TTtN)\varepsilon^{\prime}=\frac{\varepsilon}{\ln\left({\frac{T}{T-t_{N}}}\right)} then gives the following Theorem 4.10. We first introduce a technical assumption.

Definition 4.9.

Let f:>0>0f:\mathbb{R}_{>0}\to\mathbb{R}_{>0}. We say that ff has at most power growth and decay (with some constant c>0c>0) if maxu[t2,t]f(u)[f(t)c,cf(t)]\max_{u\in[\frac{t}{2},t]}f(u)\in\left[{\frac{f(t)}{c},cf(t)}\right].

Theorem 4.10.

Suppose that the following hold.

  1. 1.

    Assumption 5 holds for ε,t\varepsilon_{\infty,t}.

  2. 2.

    x~t2,ψ22Mt\left\|{\widetilde{x}_{t}}\right\|_{2,\psi_{2}}^{2}\leq M_{t}.

  3. 3.

    The KL bound KL(ψtqt||pt)ATtχ2(qt||pt)+1+BTt\operatorname{KL}(\psi_{t}q_{t}||p_{t})\leq\frac{A_{T-t}}{\chi^{2}(q_{t}||p_{t})+1}+B_{T-t} holds for any density qtq_{t} and t<tNt<t_{N}, where ψt(x)=qt(x)/pt(x)χ2(qt||pt)+1\psi_{t}(x)=\frac{q_{t}(x)/p_{t}(x)}{\chi^{2}(q_{t}||p_{t})+1}.

  4. 4.

    g(t),At,Bt,Lt,Mtg(t),A_{t},B_{t},L_{t},M_{t} have at most polynomial growth and decay (with some constant cc).

Then there is some constant cc^{\prime} (depending on cc) such that if the step sizes satisfy

hk\displaystyle h_{k} min{Ttk2,cεhkg(Ttk)2LTtk},\displaystyle\leq\min\left\{{\frac{T-t_{k}}{2},\frac{c^{\prime}\varepsilon^{\prime}_{h_{k}}}{g(T-t_{k})^{2}L_{T-t_{k}}}}\right\},
where εhk\displaystyle\text{where }\varepsilon^{\prime}_{h_{k}} =min{1ATtkMTtk+1,1LTtkATtk,\displaystyle=\min\Bigg{\{}\frac{1}{\sqrt{A_{T-t_{k}}M_{T-t_{k}}+1}},\frac{1}{L_{T-t_{k}}A_{T-t_{k}}},
ε/ln(TTtN)g(Ttk)(Ttk)((BTtk+1)MTtk+dLTtk),\displaystyle\qquad\qquad\frac{\sqrt{\varepsilon/\ln\left({\frac{T}{T-t_{N}}}\right)}}{g(T-t_{k})\sqrt{(T-t_{k})((B_{T-t_{k}}+1)M_{T-t_{k}}+dL_{T-t_{k}})}},
ε/ln(TTtN)g(Ttk)2(Ttk)LTtk(BTtk+d)},\displaystyle\qquad\qquad\frac{\varepsilon/\ln\left({\frac{T}{T-t_{N}}}\right)}{g(T-t_{k})^{2}(T-t_{k})L_{T-t_{k}}(B_{T-t_{k}}+d)}\Bigg{\}},

then for 0kN0\leq k\leq N,

χ2(qtk||ptk)\displaystyle\chi^{2}(q_{t_{k}}||p_{t_{k}}) eεχ2(q0||p0)+(eε1)+eε0tkε,Tt2g(Tt)2dt.\displaystyle\leq e^{\varepsilon}\chi^{2}(q_{0}||p_{0})+(e^{\varepsilon}-1)+e^{\varepsilon}\int_{0}^{t_{k}}\varepsilon_{\infty,T-t}^{2}g(T-t)^{2}\,dt.
Proof.

This follows from the above calculations and the observation that if we replace F(Tt)F(T-t) by F(Ttk)F(T-t_{k}), for some FF satisfying the power growth and decay assumption, then we change the bound by at most a constant factor, because the step size satisfies hk=tk+1tkTtk2h_{k}=t_{k+1}-t_{k}\leq\frac{T-t_{k}}{2}. ∎

We specialize this theorem in the case of distributions with bounded support. Note that although not every initial distribution p~t\widetilde{p}_{t} may satisfy a KL inequality as required by condition 3 of Theorem 30, Lemma 5.2 will give the existence of a distribution that does, and is close in TV-error. Later in Section 6, we show that this will have a small effect on the score function, and hence allow us to prove our main theorems.

Corollary 4.11.

Suppose that Assumptions 5 and 2 hold, R2dR^{2}\geq d, g1g\equiv 1, and that P~0\widetilde{P}_{0} is such that the KL inequality (30) holds. Let δ=TtN\delta=T-t_{N}. If 0<δ,ε<120<\delta,\varepsilon<\frac{1}{2}, hk=O(εmax{Ttk,(Ttk)3}R4dln(Tδ)ln(RδεK))h_{k}=O\left({\frac{\varepsilon}{\max\{T-t_{k},(T-t_{k})^{-3}\}R^{4}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R}{\delta\varepsilon_{K}}}\right)}}\right), then for any 0kN0\leq k\leq N,

χ2(qtk||ptk)\displaystyle\chi^{2}(q_{t_{k}}||p_{t_{k}}) eεχ2(q0||p0)+ε+eε0tkε,Tt2dt.\displaystyle\leq e^{\varepsilon}\chi^{2}(q_{0}||p_{0})+\varepsilon+e^{\varepsilon}\int_{0}^{t_{k}}\varepsilon_{\infty,T-t}^{2}\,dt.
Proof.

For g1g\equiv 1, note that σTt2=Θ(min{Tt,1})\sigma_{T-t}^{2}=\Theta(\min\{T-t,1\}). From Lemma 4.13, we can choose

Lt\displaystyle L_{t} =R2σt4=O(R2min{(Tt)2,1}).\displaystyle=\frac{R^{2}}{\sigma_{t}^{4}}=O\left({\frac{R^{2}}{\min\{(T-t)^{2},1\}}}\right).

From Lemma 4.15, we can choose

Mt\displaystyle M_{t} =max{R2,d}.\displaystyle=\max\{R^{2},d\}.

The KL inequality (30) gives us

At\displaystyle A_{t} =6(e+1)σt2=O(min{Tt,1})\displaystyle=6(e+1)\sigma_{t}^{2}=O(\min\{T-t,1\})
Bt\displaystyle B_{t} =ln(1ε)+dln(1+O(RTtN))\displaystyle=\ln\left({\frac{1}{\varepsilon}}\right)+d\ln\left({1+O\left({\frac{R}{\sqrt{T-t_{N}}}}\right)}\right)

We now check the requirements on hkh_{k}. We need

εhk\displaystyle\varepsilon^{\prime}_{h_{k}} =O(1ATtkMTtk+1)\displaystyle=O\left({\frac{1}{\sqrt{A_{T-t_{k}}M_{T-t_{k}}+1}}}\right) εhk\displaystyle\Longleftarrow\varepsilon^{\prime}_{h_{k}} =O(1max{R,d})\displaystyle=O\left({\frac{1}{\max\{R,\sqrt{d}\}}}\right) (22)
εhk\displaystyle\varepsilon^{\prime}_{h_{k}} =O(1LTtkATtk)\displaystyle=O\left({\frac{1}{L_{T-t_{k}}A_{T-t_{k}}}}\right) εhk\displaystyle\Longleftarrow\varepsilon^{\prime}_{h_{k}} =O(TtkR2)\displaystyle=O\left({\frac{T-t_{k}}{R^{2}}}\right) (23)
εhk\displaystyle\varepsilon^{\prime}_{h_{k}} =O(ε/ln(Tδ)(Ttk)((BTtk+1)MTtk+dLTtk)).\displaystyle=O\left({\frac{\sqrt{\varepsilon/\ln\left({\frac{T}{\delta}}\right)}}{\sqrt{(T-t_{k})((B_{T-t_{k}}+1)M_{T-t_{k}}+dL_{T-t_{k}})}}}\right). (24)

For Ttk1T-t_{k}\leq 1, (24) is implied by

εhk\displaystyle\varepsilon^{\prime}_{h_{k}} =O(ε/ln(Tδ)(Ttk)(ln(1εK)+dln(Rδ))max{R2,d}+dR2Ttk)\displaystyle=O\left({\frac{\sqrt{\varepsilon/\ln\left({\frac{T}{\delta}}\right)}}{\sqrt{(T-t_{k})\left({\ln\left({\frac{1}{\varepsilon_{K}}}\right)+d\ln\left({\frac{R}{\delta}}\right)}\right)\max\{R^{2},d\}+\frac{dR^{2}}{T-t_{k}}}}}\right)
εhk\displaystyle\Longleftarrow\varepsilon^{\prime}_{h_{k}} =O(ε(Ttk)dmax{R2,d}ln(Tδ)ln(RδεK)),\displaystyle=O\left({\sqrt{\frac{\varepsilon(T-t_{k})}{d\max\{R^{2},d\}\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R}{\delta\varepsilon_{K}}}\right)}}}\right),

and for Ttk>1T-t_{k}>1,

εhk\displaystyle\varepsilon^{\prime}_{h_{k}} =O(ε/ln(Tδ)T(ln(1ε)+dln(Rδ))max{R2,d}+dR2)\displaystyle=O\left({\frac{\sqrt{\varepsilon/\ln\left({\frac{T}{\delta}}\right)}}{\sqrt{T\left({\ln\left({\frac{1}{\varepsilon}}\right)+d\ln\left({\frac{R}{\delta}}\right)}\right)\max\{R^{2},d\}+dR^{2}}}}\right)
εhk\displaystyle\Longleftarrow\varepsilon^{\prime}_{h_{k}} =O(εTdmax{R2,d}ln(Tδ)ln(RδεK)).\displaystyle=O\left({\sqrt{\frac{\varepsilon}{Td\max\{R^{2},d\}\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R}{\delta\varepsilon_{K}}}\right)}}}\right).

Finally, the last requirement is

εhk\displaystyle\varepsilon^{\prime}_{h_{k}} =O(ε/ln(Tδ)(Ttk)LTtk(BTtk+d))\displaystyle=O\left({\frac{\varepsilon/\ln\left({\frac{T}{\delta}}\right)}{(T-t_{k})L_{T-t_{k}}(B_{T-t_{k}}+d)}}\right)
εhk\displaystyle\Longleftarrow\varepsilon^{\prime}_{h_{k}} =O(εR2max{Ttk,(Ttk)1}dln(Tδ)ln(RδεK)).\displaystyle=O\left({\frac{\varepsilon}{R^{2}\max\{T-t_{k},(T-t_{k})^{-1}\}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R}{\delta\varepsilon_{K}}}\right)}}\right).

As long as R2=Ω(d)R^{2}=\Omega(d) and ε<1\varepsilon<1, the last equation implies all the others. Plugging this into Theorem 4.10 gives the result. ∎

Above, we use the Hessian bound 2lnpt(x)R2σt4\left\|{\nabla^{2}\ln p_{t}(x)}\right\|\leq\frac{R^{2}}{\sigma_{t}^{4}} given in Lemma 4.13. Under the stronger smoothness assumption given by Assumption 3, we can take the step sizes to be larger.

Corollary 4.12.

Suppose that Assumptions 523 hold, CR2dC\geq R^{2}\geq d, g1g\equiv 1, and that P~0\widetilde{P}_{0} is such that the KL inequality (30) holds. Let δ=TtN\delta=T-t_{N}. If 0<δ,ep<120<\delta,ep<\frac{1}{2} and ε<1/T\varepsilon<1/\sqrt{T}, hk=O(εmax{Ttk,(Ttk)1}C2dln(Tδ)ln(RδεK))h_{k}=O\left({\frac{\varepsilon}{\max\{T-t_{k},(T-t_{k})^{-1}\}C^{2}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R}{\delta\varepsilon_{K}}}\right)}}\right), then for any 0kN0\leq k\leq N,

χ2(qtk||ptk)\displaystyle\chi^{2}(q_{t_{k}}||p_{t_{k}}) eεχ2(q0||p0)+ε+eε0tkε,Tt2dt.\displaystyle\leq e^{\varepsilon}\chi^{2}(q_{0}||p_{0})+\varepsilon+e^{\varepsilon}\int_{0}^{t_{k}}\varepsilon_{\infty,T-t}^{2}\,dt.
Proof.

We instead have the bound Lt=Cσt2L_{t}=\frac{C}{\sigma_{t}^{2}}. The requirement (22) stays the same, while (23) is implied by εhk=O(1/C)\varepsilon^{\prime}_{h_{k}}=O(1/C). Inequality (24), for Ttk1T-t_{k}\leq 1, is implied by

εhk\displaystyle\varepsilon^{\prime}_{h_{k}} =O(1dmax{C,R2}ln(Tδ)ln(RδεK)).\displaystyle=O\left({\sqrt{\frac{1}{d\max\{C,R^{2}\}\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R}{\delta\varepsilon_{K}}}\right)}}}\right).

and for Ttk>1T-t_{k}>1,

εhk\displaystyle\varepsilon^{\prime}_{h_{k}} =O(εTdmax{C,R2}ln(Tδ)ln(RδεK)).\displaystyle=O\left({\sqrt{\frac{\varepsilon}{Td\max\{C,R^{2}\}\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R}{\delta\varepsilon_{K}}}\right)}}}\right).

Finally, the last requirement is implied by

εhk\displaystyle\varepsilon^{\prime}_{h_{k}} =O(εCdln(Tδ)ln(RδεK)),\displaystyle=O\left({\frac{\varepsilon}{Cd\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R}{\delta\varepsilon_{K}}}\right)}}\right),

and for CR2C\geq R^{2}, ε1/T\varepsilon\leq 1/\sqrt{T}, implies all the others. ∎

4.1 Auxiliary bounds

In this section we give bounds on the Hessian (LtL_{t}, Lemma 4.13), initial χ2\chi^{2} divergence χ2(q0||p0)\chi^{2}(q_{0}||p_{0}) (Lemma 4.14), and Orlicz norm (MtM_{t}, Lemma 4.15).

Lemma 4.13 (Hessian bound).

Suppose that μ\mu is a probability measure supported on a bounded set d\mathcal{M}\subset\mathbb{R}^{d} with radius RR. Then letting φσ2\varphi_{\sigma^{2}} denote the density of N(0,σ2Id)N(0,\sigma^{2}I_{d}),

2ln(μφσ2(x))max{R2σ4,1σ2}.\displaystyle\left\|{\nabla^{2}\ln(\mu*\varphi_{\sigma^{2}}(x))}\right\|\leq\max\left\{{\frac{R^{2}}{\sigma^{4}},\frac{1}{\sigma^{2}}}\right\}. (25)

Therefore, for P~0\widetilde{P}_{0} supported on BR(0)B_{R}(0), R1R\geq 1, we have

2lnp~t(x)R2σt4.\displaystyle\left\|{\nabla^{2}\ln\widetilde{p}_{t}(x)}\right\|\leq\frac{R^{2}}{\sigma_{t}^{4}}. (26)
Proof.

Let μx,σ2\mu_{x,\sigma^{2}} denote the density μ(du)\mu(du) weighted with the gaussian φσ2(ux)\varphi_{\sigma^{2}}(u-x), that is, μx,σ2(du)=exu22σ2μ(du)dexu22σ2μ(du)\mu_{x,\sigma^{2}}(du)=\frac{e^{-\frac{\left\|{x-u}\right\|^{2}}{2\sigma^{2}}}\mu(du)}{\int_{\mathbb{R}^{d}}e^{-\frac{\left\|{x-u}\right\|^{2}}{2\sigma^{2}}}\mu(du)}. We note the following calculations:

ln(μφσ2(x))\displaystyle\nabla\ln(\mu*\varphi_{\sigma^{2}}(x)) =dexu22σ2μ(du)dexu22σ2μ(du)=dxuσ2exu22σ2μ(du)dexu22σ2μ(du)=1σ2𝔼μx,σ2(xu)\displaystyle=\frac{\nabla\int_{\mathbb{R}^{d}}e^{-\frac{\left\|{x-u}\right\|^{2}}{2\sigma^{2}}}\mu(du)}{\int_{\mathbb{R}^{d}}e^{-\frac{\left\|{x-u}\right\|^{2}}{2\sigma^{2}}}\mu(du)}=\frac{\int_{\mathbb{R}^{d}}-\frac{x-u}{\sigma^{2}}e^{-\frac{\left\|{x-u}\right\|^{2}}{2\sigma^{2}}}\mu(du)}{\int_{\mathbb{R}^{d}}e^{-\frac{\left\|{x-u}\right\|^{2}}{2\sigma^{2}}}\mu(du)}=-\frac{1}{\sigma^{2}}\mathbb{E}_{\mu_{x,\sigma^{2}}}(x-u) (27)
2ln(μφσ2(x))\displaystyle\nabla^{2}\ln(\mu*\varphi_{\sigma^{2}}(x)) =1σ4Covμx,σ2(xu)1σ2Id=1σ4Covμx,σ2(x)1σ2Id.\displaystyle=\frac{1}{\sigma^{4}}\operatorname{Cov}_{\mu_{x,\sigma^{2}}}(x-u)-\frac{1}{\sigma^{2}}I_{d}=\frac{1}{\sigma^{4}}\operatorname{Cov}_{\mu_{x,\sigma^{2}}}(x)-\frac{1}{\sigma^{2}}I_{d}. (28)

The covariance of a distribution supported on a set of radius RR is bounded by R2R^{2} in operator norm. Inequality (25) then follows from (28).

For (26), note that p~t=MmtP~0φσt2\widetilde{p}_{t}=M_{m_{t}\sharp}\widetilde{P}_{0}*\varphi_{\sigma_{t}^{2}}, where mtm_{t} is given by (2) and MmM_{m} denotes multiplication by mm. Since MmtP~0M_{m_{t}\sharp}\widetilde{P}_{0} is supported on BmtR(0)BR(0)B_{m_{t}R}(0)\subset B_{R}(0) and σt1\sigma_{t}\leq 1, the result follows. ∎

Lemma 4.14 (Bound on initial χ2\chi^{2}-divergence).

Suppose that P~0\widetilde{P}_{0} is supported on BR(0)B_{R}(0). Let pprior=N(0,(1eG0,t)Id)p_{\textup{prior}}=N(0,(1-e^{G_{0,t}})I_{d}). Then

χ2(pprior||p~T)\displaystyle\chi^{2}(p_{\textup{prior}}||\widetilde{p}_{T}) exp[R2exp(G0,T)1exp(G0,T)]\displaystyle\leq\exp\left[{\frac{R^{2}\exp(-G_{0,T})}{1-\exp(-G_{0,T})}}\right]

and for 0<ε<120<\varepsilon<\frac{1}{2} and G0,Tln(4R2ε2)1G_{0,T}\geq\ln\left({\frac{4R^{2}}{\varepsilon^{2}}}\right)\vee 1, we have χ2(pprior||p~T)ε2\chi^{2}(p_{\textup{prior}}||\widetilde{p}_{T})\leq\varepsilon^{2}.

Proof.

We have for x0P~0x_{0}\sim\widetilde{P}_{0} that

χ2(N(0,(1eG0,T)Id)||N(x0exp(12G0,T),(1exp(G0,T))Id))\displaystyle\chi^{2}\left({N(0,(1-e^{-G_{0,T}})I_{d})||N(x_{0}\exp\left({-\frac{1}{2}G_{0,T}}\right),(1-\exp(-G_{0,T}))I_{d})}\right)
exp[x02exp(G0,T)1exp(G0,T)]exp(R2exp(G0,T)1exp(G0,T))\displaystyle\leq\exp\left[{\frac{\left\|{x_{0}}\right\|^{2}\exp(-G_{0,T})}{1-\exp(-G_{0,T})}}\right]\leq\exp\left({\frac{R^{2}\exp(-G_{0,T})}{1-\exp(-G_{0,T})}}\right)

Using convexity of χ2\chi^{2}-divergence then gives the result. For G0,Tln(4R2ε2)1G_{0,T}\geq\ln\left({\frac{4R^{2}}{\varepsilon^{2}}}\right)\vee 1, we have

exp[R2exp(G0,T)1exp(G0,T)]\displaystyle\exp\left[{\frac{R^{2}\exp(-G_{0,T})}{1-\exp(-G_{0,T})}}\right] exp[ε2/41/2]ε2.\displaystyle\leq\exp\left[{\frac{\varepsilon^{2}/4}{1/2}}\right]\leq\varepsilon^{2}.\qed
Lemma 4.15 (Subgaussian bound).

Suppose P~0\widetilde{P}_{0} is supported on BR(0)B_{R}(0). Then for Xp~tX\sim\widetilde{p}_{t},

X2,ψ2eln2(4mtR+6C1σtd)=O(max{R,d}),\left\|{X}\right\|_{2,\psi_{2}}\leq\sqrt{\frac{e}{\ln 2}}\cdot\left({4m_{t}R+6C_{1}\sigma_{t}\sqrt{d}}\right)=O(\max\{R,\sqrt{d}\}),

where mt,σtm_{t},\sigma_{t} are as in (2) and C1C_{1} is an absolute constant.

Proof.

Let YP~0Y\sim\widetilde{P}_{0} s.t. X=mtY+σtξX=m_{t}Y+\sigma_{t}\xi for some ξN(0,Id)\xi\sim N(0,I_{d}) independent of YY. Define U=X2:=(i=1dXi2)1/2U=\left\|{X}\right\|_{2}:=\left({\sum_{i=1}^{d}X_{i}^{2}}\right)^{1/2}, then for p1p\geq 1,

𝔼|U|p=𝔼X2p\displaystyle\mathbb{E}|U|^{p}=\mathbb{E}\left\|{X}\right\|_{2}^{p} 𝔼(mtY2+σtξ2)p\displaystyle\leq\mathbb{E}\left({\left\|{m_{t}Y}\right\|_{2}+\left\|{\sigma_{t}\xi}\right\|_{2}}\right)^{p}
2p1𝔼[mtY2p+σtξ2p]\displaystyle\leq 2^{p-1}\mathbb{E}\left[{\left\|{m_{t}Y}\right\|_{2}^{p}+\left\|{\sigma_{t}\xi}\right\|_{2}^{p}}\right]
2p1[(mtR)p+σtp2p/2Γ((d+p)/2)Γ(d/2)]\displaystyle\leq 2^{p-1}\left[{(m_{t}R)^{p}+\sigma_{t}^{p}\cdot 2^{p/2}\frac{\Gamma((d+p)/2)}{\Gamma(d/2)}}\right]
2p1[(mtR)p+C1(2σt)p(dp/2+pp/2)]\displaystyle\leq 2^{p-1}\left[{(m_{t}R)^{p}+C_{1}(\sqrt{2}\sigma_{t})^{p}\cdot\left({d^{p/2}+p^{p/2}}\right)}\right]

where Γ\Gamma is the commonly used gamma function and C1C_{1} is an absolute constant. Therefore

(𝔼|U|p)1/p2mtR+2C1σt(d+p)Kp,\displaystyle\left({\mathbb{E}|U|^{p}}\right)^{1/p}\leq 2m_{t}R+\sqrt{2}C_{1}\sigma_{t}(\sqrt{d}+\sqrt{p})\leq K\sqrt{p},

where K=2mtR+3C1σtdK=2m_{t}R+3C_{1}\sigma_{t}\sqrt{d}. Now consider V=U/KV=U/K, then for some λ>0\lambda>0 small enough, by Taylor expansion,

𝔼[eλ2V2]=𝔼[1+p=1(λ2V2)pp!]=1+p=1λ2p𝔼[V2p]p!.\displaystyle\mathbb{E}\left[{e^{\lambda^{2}V^{2}}}\right]=\mathbb{E}\left[{1+\sum_{p=1}^{\infty}\frac{\left({\lambda^{2}V^{2}}\right)^{p}}{p!}}\right]=1+\sum_{p=1}^{\infty}\frac{\lambda^{2p}\mathbb{E}\left[{V^{2p}}\right]}{p!}.

Note that 𝔼[V2p](2p)p\mathbb{E}\left[{V^{2p}}\right]\leq(2p)^{p}, while Stirling’s approximation yields p!(p/e)pp!\geq(p/e)^{p}. Substituting these two bounds, we get

𝔼eλ2V21+p=1(2λ2pp/e)=p=0(2eλ2)p=112eλ2,\displaystyle\mathbb{E}e^{\lambda^{2}V^{2}}\leq 1+\sum_{p=1}^{\infty}\left({\frac{2\lambda^{2}p}{p/e}}\right)=\sum_{p=0}^{\infty}(2e\lambda^{2})^{p}=\frac{1}{1-2e\lambda^{2}},

provided that 2eλ2<12e\lambda^{2}<1, in which case the geometric series above converges. To bound this quantity further, we can use the numeric inequality 1/(1x)e2x1/(1-x)\leq e^{2x} which is valid for x[0,1/2]x\in[0,1/2]. It follows that

𝔼eλ2V2e4eλ2for all λ satisfying |λ|1/2e.\displaystyle\mathbb{E}e^{\lambda^{2}V^{2}}\leq e^{4e\lambda^{2}}\ \ \text{for all $\lambda$ satisfying $|\lambda|\leq 1/2\sqrt{e}$}.

Now set 4eλ2=ln24e\lambda^{2}=\ln 2, then

𝔼[eln24eK2X22]2,\displaystyle\mathbb{E}\left[{e^{\frac{\ln 2}{4eK^{2}}\left\|{X}\right\|_{2}^{2}}}\right]\leq 2,

which implies that

X2,ψ24eln2K\displaystyle\left\|{X}\right\|_{2,\psi_{2}}\leq\sqrt{\frac{4e}{\ln 2}}K =eln2(4mtR+6C1σtd).\displaystyle=\sqrt{\frac{e}{\ln 2}}\cdot\left({4m_{t}R+6C_{1}\sigma_{t}\sqrt{d}}\right).\qed

5 Bounding the KL divergence

In this section, we bound the quantity K=KL(ψtqt||pt)K=\operatorname{KL}(\psi_{t}q_{t}||p_{t}), where ψt\psi_{t} is as in (8). While ptp_{t} is defined by the DDPM process, in this section we do not assume qtq_{t} is the density of the discretized process; rather, it is any density for which pt(qtpt)\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right) and χ2(qt||pt)\chi^{2}(q_{t}||p_{t}) are finite.

Lemma 5.1.

Suppose that P~0\widetilde{P}_{0} is a probability measure on d\mathbb{R}^{d} such that

P~0\displaystyle\widetilde{P}_{0} =j=1mwjP~j,0,\displaystyle=\sum_{j=1}^{m}w_{j}\widetilde{P}_{j,0}, (29)

where wj>0w_{j}>0, j=1mwj=1\sum_{j=1}^{m}w_{j}=1, and each P~j,0\widetilde{P}_{j,0} is a probability measure. For t>0t>0, let p~t\widetilde{p}_{t} and p~j,t\widetilde{p}_{j,t} be the densities obtained by running the forward DDPM process (1) for time tt, and pt=p~Ttp_{t}=\widetilde{p}_{T-t}, pj,t=p~j,Ttp_{j,t}=\widetilde{p}_{j,T-t}. Let wmin=min1jmwjw_{\min}=\min_{1\leq j\leq m}w_{j} and suppose all the P~j,t\widetilde{P}_{j,t} satisfy a log-Sobolev constant with constant CtC_{t}. Then for any qtq_{t}, where ψt\psi_{t} is as in (8)

KL(ψtqt||pt)\displaystyle\operatorname{KL}(\psi_{t}q_{t}||p_{t}) 2CTtχ2(qt||pt)+1pt(qtpt)+ln(1wmin).\displaystyle\leq\frac{2C_{T-t}}{\chi^{2}(q_{t}||p_{t})+1}\cdot\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+\ln\left({\frac{1}{w_{\min}}}\right).

While we need ptp_{t} to satisfy a log-Sobolev inequality to get a bound of the form Cχ2(qt||pt)+1pt(qtpt)\frac{C}{\chi^{2}(q_{t}||p_{t})+1}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right) ([LLT22, Lemma C.8]), we note that if we allow additive slack, it suffices for ptp_{t} to be a mixture of distributions satisfying a log-Sobolev inequality, with the logarithm of the minimum mixture weight bounded below. In Lemma 5.2 we will see that we can almost decompose any distribution of bounded support in this manner, if we move a small amount of the mass.

Proof.

Let f¯t:[m]\overline{f}_{t}:[m]\to\mathbb{R} be the function

f¯t(j)\displaystyle\overline{f}_{t}(j) =dψt(x)qt(x)pt(x)Pj,t(x)𝑑x.\displaystyle=\int_{\mathbb{R}^{d}}\frac{\psi_{t}(x)q_{t}(x)}{p_{t}(x)}P_{j,t}(x)\,dx.

By decomposition of entropy and the fact that each Pi,tP_{i,t} satisfies LSI with constant CTtC_{T-t},

KL(ψtqt||pt)\displaystyle\operatorname{KL}(\psi_{t}q_{t}||p_{t}) Entpt(ψtqtpt)\displaystyle\leq\operatorname{Ent}_{p_{t}}\left({\frac{\psi_{t}q_{t}}{p_{t}}}\right)
=i=1mdwiEntPi,t(ψtqtpt)+Entw(f¯t)\displaystyle=\sum_{i=1}^{m}\int_{\mathbb{R}^{d}}w_{i}\operatorname{Ent}_{P_{i,t}}\left({\frac{\psi_{t}q_{t}}{p_{t}}}\right)+\operatorname{Ent}_{w}(\overline{f}_{t})
Ct2i=1mwiPi,t(lnψtqtpt,ψtqtpt)+Entw(f¯t)\displaystyle\leq\frac{C_{t}}{2}\sum_{i=1}^{m}w_{i}\mathscr{E}_{P_{i,t}}\left({\ln\frac{\psi_{t}q_{t}}{p_{t}},\frac{\psi_{t}q_{t}}{p_{t}}}\right)+\operatorname{Ent}_{w}(\overline{f}_{t})
Ct2pt(lnψtqtpt,ψtqtpt)+Entw(f¯t)\displaystyle\leq\frac{C_{t}}{2}\mathscr{E}_{p_{t}}\left({\ln\frac{\psi_{t}q_{t}}{p_{t}},\frac{\psi_{t}q_{t}}{p_{t}}}\right)+\operatorname{Ent}_{w}(\overline{f}_{t})
=Ct2dlnψt(x)qt(x)pt(x)2ψt(x)qt(x)𝑑x+Entw(f¯t)\displaystyle=\frac{C_{t}}{2}\int_{\mathbb{R}^{d}}\left\|{\nabla\ln\frac{\psi_{t}(x)q_{t}(x)}{p_{t}(x)}}\right\|^{2}\psi_{t}(x)q_{t}(x)\,dx+\operatorname{Ent}_{w}(\overline{f}_{t})
=2Ctdlnqt(x)pt(x)2ψt(x)qt(x)𝑑x+Entw(f¯t)\displaystyle=2C_{t}\int_{\mathbb{R}^{d}}\left\|{\nabla\ln\frac{q_{t}(x)}{p_{t}(x)}}\right\|^{2}\psi_{t}(x)q_{t}(x)\,dx+\operatorname{Ent}_{w}(\overline{f}_{t})
=2Ctdqt(x)pt(x)2ψt(x)pt(x)2qt(x)𝑑x+Entw(f¯t)\displaystyle=2C_{t}\int_{\mathbb{R}^{d}}\left\|{\nabla\frac{q_{t}(x)}{p_{t}(x)}}\right\|^{2}\frac{\psi_{t}(x)p_{t}(x)^{2}}{q_{t}(x)}\,dx+\operatorname{Ent}_{w}(\overline{f}_{t})
=2Ctχ2(qt||pt)+1qt(x)pt(x)2pt(x)𝑑x+Entw(f¯t)\displaystyle=\frac{2C_{t}}{\chi^{2}(q_{t}||p_{t})+1}\cdot\int\left\|{\nabla\frac{q_{t}(x)}{p_{t}(x)}}\right\|^{2}p_{t}(x)\,dx+\operatorname{Ent}_{w}(\overline{f}_{t})
2Ctχ2(qt||pt)+1pt(qtpt)+ln(1wmin),\displaystyle\leq\frac{2C_{t}}{\chi^{2}(q_{t}||p_{t})+1}\cdot\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+\ln\left({\frac{1}{w_{\min}}}\right),

where the last inequality follows from noting wjf¯t(j)w_{j}\overline{f}_{t}(j) is a probability mass function on [m][m], so that f¯t(j)1wj\overline{f}_{t}(j)\leq\frac{1}{w_{j}} and

Entw(f¯t)=j=1mwjf¯t(j)ln(f¯t(j))j=1mwjf¯t(j)ln(1wmin)=ln(1wmin).\operatorname{Ent}_{w}(\overline{f}_{t})=\sum_{j=1}^{m}w_{j}\overline{f}_{t}(j)\ln(\overline{f}_{t}(j))\leq\sum_{j=1}^{m}w_{j}\overline{f}_{t}(j)\ln\left({\frac{1}{w_{\min}}}\right)=\ln\left({\frac{1}{w_{\min}}}\right).\qed
Lemma 5.2.

Suppose 0<εK<120<\varepsilon_{K}<\frac{1}{2}, and that P¯0\overline{P}_{0} is a probability measure such that P¯0()1εK8\overline{P}_{0}(\mathcal{M})\geq 1-\frac{\varepsilon_{K}}{8}. Let 𝒩(,σt2)\mathcal{N}\left({\mathcal{M},\frac{\sigma_{t}}{2}}\right) denote the covering number of \mathcal{M} with balls of radius σt\sigma_{t}. Given δ>0\delta>0, there exists a distribution P~0\widetilde{P}_{0} such that χ2(P~0||P¯0)εK\chi^{2}(\widetilde{P}_{0}||\overline{P}_{0})\leq\varepsilon_{K} and considering the DDPM process started with P~0\widetilde{P}_{0}, for all 0tTδ0\leq t\leq T-\delta,

KL(ψtqt||pt)\displaystyle\operatorname{KL}(\psi_{t}q_{t}||p_{t}) (6(1+e)σTt2χ2(qt||pt)+1pt(qtpt)+ln(𝒩(,σδ/2)εK)).\displaystyle\leq\left({\frac{6(1+e)\sigma_{T-t}^{2}}{\chi^{2}(q_{t}||p_{t})+1}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+\ln\left({\frac{\mathcal{N}(\mathcal{M},\sigma_{\delta}/2)}{\varepsilon_{K}}}\right)}\right).

In particular, for =BR(0)\mathcal{M}=B_{R}(0) in d\mathbb{R}^{d},

KL(ψtqt||pt)\displaystyle\operatorname{KL}(\psi_{t}q_{t}||p_{t}) (6(1+e)σTt2χ2(qt||pt)+1pt(qtpt)+ln(1εK)+dln(1+4Rσδ)).\displaystyle\leq\left({\frac{6(1+e)\sigma_{T-t}^{2}}{\chi^{2}(q_{t}||p_{t})+1}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+\ln\left({\frac{1}{\varepsilon_{K}}}\right)+d\ln\left({1+\frac{4R}{\sigma_{\delta}}}\right)}\right). (30)
Proof.

Partition \mathcal{M} into disjoint subsets j\mathcal{M}_{j}, 1jN:=𝒩(,σδ/2)1\leq j\leq N:=\mathcal{N}(\mathcal{M},\sigma_{\delta}/2) of diameter at most σδ\sigma_{\delta}, and decompose

P¯0\displaystyle\overline{P}_{0} =wP+j=1nw¯jP~j,0\displaystyle=w_{*}P_{*}+\sum_{j=1}^{n}\overline{w}_{j}\widetilde{P}_{j,0}

where pjp_{j} is supported on j\mathcal{M}_{j} and P=P¯0(|c)P_{*}=\overline{P}_{0}(\cdot|\mathcal{M}^{c}). We will zero out the coefficients of all small components: let Z=j:wjεK8NwjZ=\sum_{j:w_{j}\geq\frac{\varepsilon_{K}}{8N}}w_{j} and

wj\displaystyle w_{j} ={w¯jZ,j[n],w¯jεK8N0,otherwise,\displaystyle=\begin{cases}\frac{\overline{w}_{j}}{Z},&j\in[n],\,\overline{w}_{j}\geq\frac{\varepsilon_{K}}{8N}\\ 0,&\text{otherwise,}\end{cases}

and define

P~0\displaystyle\widetilde{P}_{0} =j=1nwjP~j,0.\displaystyle=\sum_{j=1}^{n}w_{j}\widetilde{P}_{j,0}.

Note that Z1εK8j:wjεK8N1εK4Z\geq 1-\frac{\varepsilon_{K}}{8}-\sum_{j:w_{j}\leq\frac{\varepsilon_{K}}{8N}}\geq 1-\frac{\varepsilon_{K}}{4}. As probability distributions on [m]{}[m]\cup\{*\},

χ2(w||w¯)(11εK4)21εK,\chi^{2}(w||\overline{w})\leq\left({\frac{1}{1-\frac{\varepsilon_{K}}{4}}}\right)^{2}-1\leq\varepsilon_{K},

and hence the same bound holds for χ2(P~0||P¯0)\chi^{2}(\widetilde{P}_{0}||\overline{P}_{0}). Note each MmtP~j,0M_{m_{t}\sharp}\widetilde{P}_{j,0} is supported on a set of diameter mtσσm_{t}\sigma\leq\sigma. By Theorem 1 of [CCN21], noting that

χ2(N(μ2,Σ)||N(μ1,Σ))\displaystyle\chi^{2}(N(\mu_{2},\Sigma)||N(\mu_{1},\Sigma)) =exp[(μ2μ1)Σ1(μ2μ1)]e\displaystyle=\exp\left[{(\mu_{2}-\mu_{1})^{\top}\Sigma^{-1}(\mu_{2}-\mu_{1})}\right]\leq e

when Σ=σ2I\Sigma=\sigma^{2}I and μ2μ1σ\left\|{\mu_{2}-\mu_{1}}\right\|\leq\sigma, P~j,t=(MmtP~j,0)φσ2\widetilde{P}_{j,t}=(M_{m_{t}\sharp}\widetilde{P}_{j,0})*\varphi_{\sigma^{2}} satisfies a log-Sobolev inequality with constant 6(1+e)σt26(1+e)\sigma_{t}^{2}. The result then follows from Lemma 5.1. For =BR(0)\mathcal{M}=B_{R}(0), we use the bound 𝒩(BR(0),σδ/2)(1+4Rσδ)d\mathcal{N}(B_{R}(0),\sigma_{\delta}/2)\leq\left({1+\frac{4R}{\sigma_{\delta}}}\right)^{d} [Ver18, Corollary 4.2.13]. ∎

In the next section, we show that we can move a small amount of mass ε\varepsilon without significantly affecting the score function. This is necessary, as our guarantees on the score estimate are for the original distribution and not the perturbed one in Lemma 5.2.

6 The effect of perturbing the data distribution on the score function

In this section we consider the effect of perturbing the data distribution on the score function. The key observation is that the score function can be interpreted as the solution to an inference problem, that of recovering the original data point from a noisy sample, with data distribution as the given prior distribution. We show through a coupling argument that we can bound the difference between the score functions in terms of the distance between the two data distributions. This will allow us to “massage” the data distribution in order to optimally bound KL(ψtqt||pt)\operatorname{KL}(\psi_{t}q_{t}||p_{t}) in Section 5.

6.1 Perturbation under χ2\chi^{2} error and truncation

We first give a general lemma on denoising error from a mismatched prior.

Lemma 6.1 (Denoising error from mismatched prior).

Let φ\varphi be a probability density on d\mathbb{R}^{d}, and P0,x,P1,xP_{0,x},P_{1,x} be measures on d\mathbb{R}^{d}. For i=0,1i=0,1, let PiP_{i} denote the joint distribution of xiPi,xx_{i}\sim P_{i,x} and yi=xi+ξiy_{i}=x_{i}+\xi_{i} where ξiφ\xi_{i}\sim\varphi, and let Pi,yP_{i,y} denote the marginal distribution of yy. Let

m(k)(ε):\displaystyle m^{(k)}(\varepsilon): =sup0f1,dfφ𝑑xεdf(x)xkφ(x)𝑑x.\displaystyle=\sup_{0\leq f\leq 1,\int_{\mathbb{R}^{d}}f\varphi\,dx\leq\varepsilon}\int_{\mathbb{R}^{d}}f(x)\left\|{x}\right\|^{k}\varphi(x)\,dx.

Let εTV=TV(P0,x,P1,x)\varepsilon_{\operatorname{TV}}=\operatorname{TV}(P_{0,x},P_{1,x}) and εχ2=χ2(P0,x||P1,x)\varepsilon_{\chi}^{2}=\chi^{2}(P_{0,x}||P_{1,x}). Then

dP0,y(dy0)dx0P0(dx0|y0)dx1P1(dx1|y0)2\displaystyle\int_{\mathbb{R}^{d}}P_{0,y}(dy_{0})\left\|{\int_{\mathbb{R}^{d}}x_{0}P_{0}(dx_{0}|y_{0})-\int_{\mathbb{R}^{d}}x_{1}P_{1}(dx_{1}|y_{0})}\right\|^{2}
8m(2)(εTV)+εχm(4)(εTV)\displaystyle\leq 8m^{(2)}(\varepsilon_{\operatorname{TV}})+\varepsilon_{\chi}\sqrt{m^{(4)}(\varepsilon_{\operatorname{TV}})}

For φ=φσ2\varphi=\varphi_{\sigma^{2}}, the upper bound is O(σ2εχ(d+ln(1εTV)))O\left({\sigma^{2}\varepsilon_{\chi}\left({d+\ln\left({\frac{1}{\varepsilon_{\operatorname{TV}}}}\right)}\right)}\right).

Note the tricky part of the proof is to deal with P1(dx1|y0)P_{1}(dx_{1}|y_{0}), which can be thought of as inferring xx assuming the incorrect prior P1,xP_{1,x}, rather than the actual prior P0,xP_{0,x}.

Proof.

For notational clarity, we will denote draws from the conditional distribution as x^0\widehat{x}_{0} and x^1\widehat{x}_{1}, for example P0(dx^0|y0)P_{0}(d\widehat{x}_{0}|y_{0}). Let ri(y)=d(x^iy)Pi(dx^i|y)r_{i}(y)=\int_{\mathbb{R}^{d}}(\widehat{x}_{i}-y)P_{i}(d\widehat{x}_{i}|y). Let P0,1P_{0,1} be a coupling of (x0,y0=x0+ξ0,x1,y1=y1+ξ1)(x_{0},y_{0}=x_{0}+\xi_{0},x_{1},y_{1}=y_{1}+\xi_{1}) such that x0=x1x_{0}=x_{1} with probability 1εTV1-\varepsilon_{\operatorname{TV}} and ξ0=ξ1\xi_{0}=\xi_{1} with probability 11. We have

dP0,y(dy0)r0(y0)r1(y0)2\displaystyle\int_{\mathbb{R}^{d}}P_{0,y}(dy_{0})\left\|{r_{0}(y_{0})-r_{1}(y_{0})}\right\|^{2} ={y0=y1}P0,1,y(dy0,dy1)r0(y0)r1(y0)2(I)\displaystyle=\underbrace{\int_{\{y_{0}=y_{1}\}}P_{0,1,y}(dy_{0},dy_{1})\left\|{r_{0}(y_{0})-r_{1}(y_{0})}\right\|^{2}}_{(I)}
+{y0y1}P0,1,y(dy0,dy1)r0(y0)r1(y0)2(II).\displaystyle\quad+\underbrace{\int_{\{y_{0}\neq y_{1}\}}P_{0,1,y}(dy_{0},dy_{1})\left\|{r_{0}(y_{0})-r_{1}(y_{0})}\right\|^{2}}_{(II)}.

Define a measure QQ (not necessarily a probability measure) on d\mathbb{R}^{d} by

Q(A):=P0,1(y0A and y0=y1).\displaystyle Q(A):=P_{0,1}(y_{0}\in A\text{ and }y_{0}=y_{1}).

Note that

Q(A)\displaystyle Q(A) min{P0,y(A),P1,y(A)},\displaystyle\leq\min\{P_{0,y}(A),P_{1,y}(A)\},

so QQ is absolutely continuous with respect to P0,yP_{0,y} and P1,yP_{1,y}, and by assumption on the coupling,

Q(d)\displaystyle Q(\mathbb{R}^{d}) 1εTV.\displaystyle\geq 1-\varepsilon_{\operatorname{TV}}. (31)

Under P0,1P_{0,1}, when y0=y1y_{0}=y_{1}, we can couple P0(dx^0|y0)P_{0}(d\widehat{x}_{0}|y_{0}) and P1(dx^1|y0)P_{1}(d\widehat{x}_{1}|y_{0}) so that x0=x1x_{0}=x_{1} with probability min{dQdP0,y,dQdP1,y}\min\left\{{\frac{dQ}{dP_{0,y}},\frac{dQ}{dP_{1,y}}}\right\}. Let P^(dx^0,dx^1|y0)\widehat{P}(d\widehat{x}_{0},d\widehat{x}_{1}|y_{0}) denote this coupled distribution. Then as in Lemma 6.5,

(I)\displaystyle(I) {y0=y1}P0,1,y(dy0,dy1){x^0x^1}((x^0y0)(x^1y0))P^(dx^0,dx^1|y0)2\displaystyle\leq\int_{\{y_{0}=y_{1}\}}P_{0,1,y}(dy_{0},dy_{1})\left\|{\int_{\{\widehat{x}_{0}\neq\widehat{x}_{1}\}}((\widehat{x}_{0}-y_{0})-(\widehat{x}_{1}-y_{0}))\widehat{P}(d\widehat{x}_{0},d\widehat{x}_{1}|y_{0})}\right\|^{2}
2dP0,1,y(dy0,dy1)({x^0x^1}ξ02P^(dx^0,dx^1|y0)+{x^0x^1}ξ12P^(dx^0,dx^1|y1))\displaystyle\leq 2\int_{\mathbb{R}^{d}}P_{0,1,y}(dy_{0},dy_{1})\left({\int_{\{\widehat{x}_{0}\neq\widehat{x}_{1}\}}\left\|{\xi_{0}}\right\|^{2}\widehat{P}(d\widehat{x}_{0},d\widehat{x}_{1}|y_{0})+\int_{\{\widehat{x}_{0}\neq\widehat{x}_{1}\}}\left\|{\xi_{1}}\right\|^{2}\widehat{P}(d\widehat{x}_{0},d\widehat{x}_{1}|y_{1})}\right)

We bound this by first bounding

dP0,1,y(dy1,dy2)P^(x^0x^1)\displaystyle\int_{\mathbb{R}^{d}}P_{0,1,y}(dy_{1},dy_{2})\widehat{P}(\widehat{x}_{0}\neq\widehat{x}_{1}) dP0,y(dy)max{1dQdP0,y,1dQdP1,y}2εTV,\displaystyle\leq\int_{\mathbb{R}^{d}}P_{0,y}(dy)\max\left\{{1-\frac{dQ}{dP_{0,y}},1-\frac{dQ}{dP_{1,y}}}\right\}\leq 2\varepsilon_{\operatorname{TV}}, (32)

which follows from the two inequalities (using (31))

dP0,y(dy)(1dQdP0,y)\displaystyle\int_{\mathbb{R}^{d}}P_{0,y}(dy)\left({1-\frac{dQ}{dP_{0,y}}}\right) =1Q(d)εTV\displaystyle=1-Q(\mathbb{R}^{d})\leq\varepsilon_{\operatorname{TV}}
dP0,y(dy)(1dQdP1,y)\displaystyle\int_{\mathbb{R}^{d}}P_{0,y}(dy)\left({1-\frac{dQ}{dP_{1,y}}}\right) dP1,y(dy)(1dQdP1,y)+TV(P0,y,P1,y)\displaystyle\leq\int_{\mathbb{R}^{d}}P_{1,y}(dy)\left({1-\frac{dQ}{dP_{1,y}}}\right)+\operatorname{TV}(P_{0,y},P_{1,y})
(1Q(d))+εTV2εTV.\displaystyle\leq(1-Q(\mathbb{R}^{d}))+\varepsilon_{\operatorname{TV}}\leq 2\varepsilon_{\operatorname{TV}}.

From (32), and the fact that the distribution of (xi,yi)(x_{i},y_{i}) is the same as (x^i,yi)(\widehat{x}_{i},y_{i}) by Nishimori’s identity, we obtain

(I)\displaystyle(I) 2(m(2)(2εTV)+m(2)(2εTV))=4m(2)(εTV).\displaystyle\leq 2(m^{(2)}(2\varepsilon_{\operatorname{TV}})+m^{(2)}(2\varepsilon_{\operatorname{TV}}))=4m^{(2)}(\varepsilon_{\operatorname{TV}}).

Now for the second term (II),

(II)\displaystyle(II) 2{y0y1}P0,1,y(dy0,dy1)(r0(y0)2+r1(y0)2).\displaystyle\leq 2\int_{\{y_{0}\neq y_{1}\}}P_{0,1,y}(dy_{0},dy_{1})(\left\|{r_{0}(y_{0})}\right\|^{2}+\left\|{r_{1}(y_{0})}\right\|^{2}).

The first term satisfies {y0y1}P0,1,y(dy0,dy1)r0(y0)2m(2)(εTV)\int_{\{y_{0}\neq y_{1}\}}P_{0,1,y}(dy_{0},dy_{1})\left\|{r_{0}(y_{0})}\right\|^{2}\leq m^{(2)}(\varepsilon_{\operatorname{TV}}). For the second term, we note that Cauchy-Schwarz gives for any measures PP and QQ that

Ωf(x)P(dx)\displaystyle\int_{\Omega}f(x)P(dx) Ωf(x)Q(dx)+Ω(dPdQ1)f(x)Q(dx)\displaystyle\leq\int_{\Omega}f(x)Q(dx)+\int_{\Omega}\left({\frac{dP}{dQ}-1}\right)f(x)Q(dx)
Ωf(x)Q(dx)+χ2(P||Q)Ωf(x)2Q(dx)\displaystyle\leq\int_{\Omega}f(x)Q(dx)+\sqrt{\chi^{2}(P||Q)\int_{\Omega}f(x)^{2}Q(dx)}

to switch from the measure P0,yP_{0,y} to P1,yP_{1,y}:

{y0y1}P0,1,y(dy0)r1(y0)2=nP0,y(dy0)P0,1,y(y0y1|y0)r1(y0)2\displaystyle\int_{\{y_{0}\neq y_{1}\}}P_{0,1,y}(dy_{0})\left\|{r_{1}(y_{0})}\right\|^{2}=\int_{\mathbb{R}^{n}}P_{0,y}(dy_{0})P_{0,1,y}(y_{0}\neq y_{1}|y_{0})\left\|{r_{1}(y_{0})}\right\|^{2}
nP1,y(dy0)P0,1,y(y0y1|y0)r1(y0)2+χ2(P0,y||P1,y)P1,y(dy0)P0,1,y(y0y1|y0)r1(y0)4\displaystyle\leq\int_{\mathbb{R}^{n}}P_{1,y}(dy_{0})P_{0,1,y}(y_{0}\neq y_{1}|y_{0})\left\|{r_{1}(y_{0})}\right\|^{2}+\sqrt{\chi^{2}(P_{0,y}||P_{1,y})\int P_{1,y}(dy_{0})P_{0,1,y}(y_{0}\neq y_{1}|y_{0})\left\|{r_{1}(y_{0})}\right\|^{4}}

(Note that intentionally, the measure is P1,yP_{1,y}, though we use y0y_{0} for the variable.) Hence,

nP1,y(dy0)P0,1,y(y0y1|y0)\displaystyle\int_{\mathbb{R}^{n}}P_{1,y}(dy_{0})P_{0,1,y}(y_{0}\neq y_{1}|y_{0}) TV(P0,y,P1,y)+nP0,y(dy0)P0,1,y(y0y1|y0)2εTV\displaystyle\leq\operatorname{TV}(P_{0,y},P_{1,y})+\int_{\mathbb{R}^{n}}P_{0,y}(dy_{0})P_{0,1,y}(y_{0}\neq y_{1}|y_{0})\leq 2\varepsilon_{\operatorname{TV}}

so

{y0y1}P0,1,y(dy0)r1(y0)2\displaystyle\int_{\{y_{0}\neq y_{1}\}}P_{0,1,y}(dy_{0})\left\|{r_{1}(y_{0})}\right\|^{2} m(2)(2εTV)+χ2(P0,x||P1,x)m(4)(2εTV),\displaystyle\leq m^{(2)}(2\varepsilon_{\operatorname{TV}})+\sqrt{\chi^{2}(P_{0,x}||P_{1,x})m^{(4)}(2\varepsilon_{\operatorname{TV}})},

where we used the data processing inequality.

For φ=φσ2\varphi=\varphi_{\sigma^{2}}, we obtain by Lemma 6.6 that the bound is

O(σ2(εTV+εχεTV1/2)(d+ln(1εTV)))=O(σ2εχ(d+ln(1εTV))).O\left({\sigma^{2}(\varepsilon_{\operatorname{TV}}+\varepsilon_{\chi}\varepsilon_{\operatorname{TV}}^{1/2})\left({d+\ln\left({\frac{1}{\varepsilon_{\operatorname{TV}}}}\right)}\right)}\right)=O\left({\sigma^{2}\varepsilon_{\chi}\left({d+\ln\left({\frac{1}{\varepsilon_{\operatorname{TV}}}}\right)}\right)}\right).\qed

We use this lemma to obtain a bound on the L2L^{2} score error under perturbation of the distribution, by interpreting the score as the solution to a de-noising problem.

Lemma 6.2 (L2L^{2} score error under perturbation).

Let P~(0)=P~0(0)\widetilde{P}^{(0)}=\widetilde{P}^{(0)}_{0} and P~(1)=P~0(1)\widetilde{P}^{(1)}=\widetilde{P}^{(1)}_{0} be two probability distributions on d\mathbb{R}^{d} such that χ2(P~(1)||P~(0))εχ21\chi^{2}(\widetilde{P}^{(1)}||\widetilde{P}^{(0)})\leq\varepsilon_{\chi}^{2}\leq 1.

  1. 1.

    For any σ>0\sigma>0,

    ln(P~(0)φσ2)(x)ln(P~(1)φσ2)(x)2(P~(1)φσ2)(dx)=O(εχ(d+ln(1εχ))σ2).\int\left\|{\nabla\ln(\widetilde{P}^{(0)}*\varphi_{\sigma^{2}})(x)-\nabla\ln(\widetilde{P}^{(1)}*\varphi_{\sigma^{2}})(x)}\right\|^{2}(\widetilde{P}^{(1)}*\varphi_{\sigma^{2}})(dx)=O\left({\frac{\varepsilon_{\chi}\left({d+\ln\left({\frac{1}{\varepsilon_{\chi}}}\right)}\right)}{\sigma^{2}}}\right).
  2. 2.

    Let p~t(i)\widetilde{p}_{t}^{(i)} be the density resulting from running (1) starting from P~(i)\widetilde{P}^{(i)}, and let σt\sigma_{t} be as in (2). Then for any t>0t>0,

    lnp~t(0)(x)lnp~t(1)(x)2p~t(1)(x)𝑑x\displaystyle\int\left\|{\nabla\ln\widetilde{p}^{(0)}_{t}(x)-\nabla\ln\widetilde{p}^{(1)}_{t}(x)}\right\|^{2}\widetilde{p}^{(1)}_{t}(x)\,dx =O(εχ(d+ln(1εχ))σt2).\displaystyle=O\left({\frac{\varepsilon_{\chi}\left({d+\ln\left({\frac{1}{\varepsilon_{\chi}}}\right)}\right)}{\sigma_{t}^{2}}}\right).
Proof.

For part 1, note by (27) that

ln(P~(i)φσ2)(y)\displaystyle\nabla\ln(\widetilde{P}^{(i)}*\varphi_{\sigma^{2}})(y) =1σ2𝔼P~y,σ2(i)(xy),\displaystyle=\frac{1}{\sigma^{2}}\mathbb{E}_{\widetilde{P}^{(i)}_{y,\sigma^{2}}}(x-y),

where P~y,σ2(i)\widetilde{P}^{(i)}_{y,\sigma^{2}} is the “tilted" probability distribution defined by

dP~y,σ2(i)dP~(i)(x)\displaystyle\frac{d\widetilde{P}^{(i)}_{y,\sigma^{2}}}{d\widetilde{P}^{(i)}}(x) exy22σ2.\displaystyle\propto e^{-\frac{\left\|{x-y}\right\|^{2}}{2\sigma^{2}}}.

By Bayes’s rule, this can be viewed as the conditional probability that x0=xx_{0}=x given xt=yx_{t}=y, where x0P~(i)x_{0}\sim\widetilde{P}^{(i)} and y=x0+σξy=x_{0}+\sigma\xi, ξN(0,Id)\xi\sim N(0,I_{d}). Hence this fits in the framework of Lemma 6.1 and

ln(P~(0)φσ2)(y)ln(P~(1)φσ2)(y)2(P~(1)φσ2)(dy)\displaystyle\int\left\|{\nabla\ln(\widetilde{P}^{(0)}*\varphi_{\sigma^{2}})(y)-\nabla\ln(\widetilde{P}^{(1)}*\varphi_{\sigma^{2}})(y)}\right\|^{2}(\widetilde{P}^{(1)}*\varphi_{\sigma^{2}})(dy)
=1σ4d𝔼P~y,t(0)[x]𝔼P~y,t(1)[x]2(P~(1)φσ2)(dy)\displaystyle=\frac{1}{\sigma^{4}}\int_{\mathbb{R}^{d}}\left\|{\mathbb{E}_{\widetilde{P}^{(0)}_{y,t}}[x]-\mathbb{E}_{\widetilde{P}^{(1)}_{y,t}}[x]}\right\|^{2}(\widetilde{P}^{(1)}*\varphi_{\sigma^{2}})(dy)
=O(1σ4σ2εχ(d+ln(1εTV))),\displaystyle=O\left({\frac{1}{\sigma^{4}}\sigma^{2}\varepsilon_{\chi}\left({d+\ln\left({\frac{1}{\varepsilon_{\operatorname{TV}}}}\right)}\right)}\right),

giving the result.

For part 2, note that p~t(i)=(MmtP~(i))φσt2\widetilde{p}^{(i)}_{t}=(M_{m_{t}\sharp}\widetilde{P}^{(i)})*\varphi_{\sigma_{t}^{2}}. Applying part 1 with P~(i)MmtP~(i)\widetilde{P}^{(i)}\mapsfrom M_{m_{t}\sharp}\widetilde{P}^{(i)} (which preserves χ2\chi^{2}-divergence) and σ=σt\sigma=\sigma_{t} gives the result.

Finally, we argue that a score estimate that is accurate with respect to p~t(1)\widetilde{p}^{(1)}_{t} will still be accurate with respect to p~t(0)\widetilde{p}^{(0)}_{t}, with high probability. When using this lemma, we will substitute in the bound from Lemma 6.2.

Lemma 6.3.

Let P~0(0)\widetilde{P}^{(0)}_{0} and P~0(1)\widetilde{P}^{(1)}_{0} be two probability distributions on d\mathbb{R}^{d} with TV distance ε\varepsilon. Suppose the estimated score function st(x)s_{t}(x) satisfies

lnp~t(0)stL2(p~t(0))2=𝔼p~t(0)[lnp~t(0)(x)st(x)2]εt2\left\|{\nabla\ln\widetilde{p}^{(0)}_{t}-s_{t}}\right\|_{L^{2}(\widetilde{p}^{(0)}_{t})}^{2}=\mathbb{E}_{\widetilde{p}^{(0)}_{t}}\left[{\left\|{\nabla\ln\widetilde{p}^{(0)}_{t}(x)-s_{t}(x)}\right\|^{2}}\right]\leq\varepsilon_{t}^{2}

for t(0,T]t\in(0,T], and lnp~t(0)\nabla\ln\widetilde{p}_{t}^{(0)} is LtL_{t}-Lipschitz. Then for t(0,T]t\in(0,T] and any ε>0\varepsilon_{\infty}>0,

Pp~t(1)(stlnp~t(1)ε)ε+4ε2[εt2+lnp~t(1)(x)lnp~t(0)(x)2p~t(1)(x)𝑑x].P_{\widetilde{p}_{t}^{(1)}}\left({\left\|{s_{t}-\nabla\ln\widetilde{p}_{t}^{(1)}}\right\|\geq\varepsilon_{\infty}}\right)\leq\varepsilon+\frac{4}{\varepsilon_{\infty}^{2}}\cdot\left[{\varepsilon_{t}^{2}+\int\left\|{\nabla\ln\widetilde{p}^{(1)}_{t}(x)-\nabla\ln\widetilde{p}^{(0)}_{t}(x)}\right\|^{2}\widetilde{p}^{(1)}_{t}(x)\,dx}\right].
Proof.

We have

Pp~t(1)(stlnp~t(1)ε)\displaystyle P_{\widetilde{p}_{t}^{(1)}}\left({\left\|{s_{t}-\nabla\ln\widetilde{p}_{t}^{(1)}}\right\|\geq\varepsilon_{\infty}}\right)
Pp~t(1)(stlnp~t(0)ε/2)+Pp~t(1)(lnp~t(0)lnp~t(1)ε/2)\displaystyle\leq P_{\widetilde{p}_{t}^{(1)}}\left({\left\|{s_{t}-\nabla\ln\widetilde{p}_{t}^{(0)}}\right\|\geq\varepsilon_{\infty}/2}\right)+P_{\widetilde{p}_{t}^{(1)}}\left({\left\|{\nabla\ln\widetilde{p}_{t}^{(0)}-\nabla\ln\widetilde{p}_{t}^{(1)}}\right\|\geq\varepsilon_{\infty}/2}\right)
TV(p~t(0),p~t(1))+Pp~t(0)(stlnp~t(0)ε/2)+Pp~t(1)(lnp~t(0)lnp~t(1)ε/2).\displaystyle\leq\operatorname{TV}(\widetilde{p}_{t}^{(0)},\widetilde{p}_{t}^{(1)})+P_{\widetilde{p}_{t}^{(0)}}\left({\left\|{s_{t}-\nabla\ln\widetilde{p}_{t}^{(0)}}\right\|\geq\varepsilon_{\infty}/2}\right)+P_{\widetilde{p}_{t}^{(1)}}\left({\left\|{\nabla\ln\widetilde{p}_{t}^{(0)}-\nabla\ln\widetilde{p}_{t}^{(1)}}\right\|\geq\varepsilon_{\infty}/2}\right).

The first term is bounded by TV(P~(0),P~(1))ε\operatorname{TV}(\widetilde{P}^{(0)},\widetilde{P}^{(1)})\leq\varepsilon. For the second term, by Chebyshev’s Inequality,

Pp~t(0)(stlnp~t(0)ε1/2)4ε2𝔼p~t(0)[stlnp~t(0)2]4εt2ε2;\displaystyle P_{\widetilde{p}_{t}^{(0)}}\left({\left\|{s_{t}-\nabla\ln\widetilde{p}_{t}^{(0)}}\right\|\geq\varepsilon_{1}/2}\right)\leq\frac{4}{\varepsilon_{\infty}^{2}}\mathbb{E}_{\widetilde{p}^{(0)}_{t}}\left[{\left\|{s_{t}-\nabla\ln\widetilde{p}^{(0)}_{t}}\right\|^{2}}\right]\leq\frac{4\varepsilon_{t}^{2}}{\varepsilon_{\infty}^{2}};

For the last term, again by Chebyshev’s Inequality,

Pp~t(1)(lnp~t(0)lnp~t(1)ε/2)4ε12lnp~t(1)(x)lnp~t(0)(x)2p~t(1)(x)𝑑x.\displaystyle P_{\widetilde{p}_{t}^{(1)}}\left({\left\|{\nabla\ln\widetilde{p}_{t}^{(0)}-\nabla\ln\widetilde{p}_{t}^{(1)}}\right\|\geq\varepsilon_{\infty}/2}\right)\leq\frac{4}{\varepsilon_{1}^{2}}\int\left\|{\nabla\ln\widetilde{p}^{(1)}_{t}(x)-\nabla\ln\widetilde{p}^{(0)}_{t}(x)}\right\|^{2}\widetilde{p}^{(1)}_{t}(x)dx.

We conclude the proof by combining the these three inequalities. ∎

Finally, we will need the following to obtain a TV error bound to p~0\widetilde{p}_{0} in Theorem 2.3.

Lemma 6.4.

Suppose that p~0eV(x)\widetilde{p}_{0}\propto e^{-V(x)} is a probability density on d\mathbb{R}^{d} with bounded first moment 𝔼p~0X\mathbb{E}_{\widetilde{p}_{0}}\left\|{X}\right\|, and VV is LL-smooth. Then for t>0t>0 such that αtσt12L\alpha_{t}\sigma_{t}\leq\frac{1}{2L}, we have

TV(p~t,p~0)2(αt1)(L𝔼p~0X+d)+32dLαtσt.\displaystyle\operatorname{TV}(\widetilde{p}_{t},\widetilde{p}_{0})\leq 2\left({\alpha_{t}-1}\right)\cdot\left({L\mathbb{E}_{\widetilde{p}_{0}}\left\|{X}\right\|+d}\right)+\frac{3}{2}dL\alpha_{t}\sigma_{t}.

Here αt=1/mt\alpha_{t}=1/m_{t} and σt\sigma_{t} are defined in (2). In particular, TV(p~δ,p~0)εTVTV(\widetilde{p}_{\delta},\widetilde{p}_{0})\leq\varepsilon_{\operatorname{TV}} if δ=O(εTV2R2L2)\delta=O(\frac{\varepsilon_{TV}^{2}}{R^{2}L^{2}}) and R=max{d,𝔼p~0X}R=\max\left\{{\sqrt{d},\mathbb{E}_{\widetilde{p}_{0}}\left\|{X}\right\|}\right\}.

Proof.

Without loss of generality, we assume that p~0(x)=eV(x)\widetilde{p}_{0}(x)=e^{-V(x)}. Note that p~t(x)=αtdp~0(αty)φσt2(xy)𝑑y\widetilde{p}_{t}(x)=\int\alpha_{t}^{d}\widetilde{p}_{0}(\alpha_{t}y)\varphi_{\sigma_{t}^{2}}(x-y)\,dy. Let q~t(x):=αtdp~0(αtx)\widetilde{q}_{t}(x):=\alpha_{t}^{d}\widetilde{p}_{0}(\alpha_{t}x), which is also a probability density on d\mathbb{R}^{d}. Then by the triangle inequality,

TV(p~t,p~0)TV(p~t,q~t)+TV(q~t,p~0).\displaystyle\operatorname{TV}(\widetilde{p}_{t},\widetilde{p}_{0})\leq\operatorname{TV}(\widetilde{p}_{t},\widetilde{q}_{t})+\operatorname{TV}(\widetilde{q}_{t},\widetilde{p}_{0}).

For the second term,

|q~t(x)p~0(x)|\displaystyle\left|{\widetilde{q}_{t}(x)-\widetilde{p}_{0}(x)}\right| =|αtdp~0(αtx)p~0(x)|\displaystyle=\left|{\alpha_{t}^{d}\widetilde{p}_{0}(\alpha_{t}x)-\widetilde{p}_{0}(x)}\right|
=|eV(αtx)+dlnαteV(x)|\displaystyle=\left|{e^{-V(\alpha_{t}x)+d\ln\alpha_{t}}-e^{-V(x)}}\right|
max{eV(x),eV(αtx)+dlnαt}(1e|V(x)V(αtx)+dlnαt|)\displaystyle\leq\max\left\{{e^{-V(x)},e^{-V(\alpha_{t}x)+d\ln\alpha_{t}}}\right\}\cdot\left({1-e^{-\left|{V(x)-V(\alpha_{t}x)+d\ln\alpha_{t}}\right|}}\right)
(p~0(x)+q~t(x))(|V(x)V(αtx)|+dlnαt)\displaystyle\leq\left({\widetilde{p}_{0}(x)+\widetilde{q}_{t}(x)}\right)\cdot\left({\left|{V(x)-V(\alpha_{t}x)}\right|+d\ln\alpha_{t}}\right)
(p~0(x)+q~t(x))[Lx(αt1)+dlnαt],\displaystyle\leq\left({\widetilde{p}_{0}(x)+\widetilde{q}_{t}(x)}\right)\cdot\left[{L\left\|{x}\right\|\left({\alpha_{t}-1}\right)+d\ln\alpha_{t}}\right],

where in the second inequality, we use the fact that 1ex|x|1-e^{x}\leq\left|{x}\right| for all x0x\leq 0. Thus

TV(q~t(x),p~0(x))\displaystyle\operatorname{TV}(\widetilde{q}_{t}(x),\widetilde{p}_{0}(x)) =12|q~t(x)p~0(x)|𝑑x\displaystyle=\frac{1}{2}\int|\widetilde{q}_{t}(x)-\widetilde{p}_{0}(x)|\,dx
[L(αt1)x+dlnαt]p~0(x)𝑑x+[L(αt1)x+dlnαt]q~t(x)𝑑x\displaystyle\leq\int\left[{L\left({\alpha_{t}-1}\right)\left\|{x}\right\|+d\ln\alpha_{t}}\right]\widetilde{p}_{0}(x)\,dx+\int\left[{L\left({\alpha_{t}-1}\right)\left\|{x}\right\|+d\ln\alpha_{t}}\right]\widetilde{q}_{t}(x)\,dx
L(αt1)(xp~0(x)𝑑x+xq~t(x)𝑑x)+2dlnαt\displaystyle\leq L(\alpha_{t}-1)\left({\int\left\|{x}\right\|\widetilde{p}_{0}(x)dx+\int\left\|{x}\right\|\widetilde{q}_{t}(x)dx}\right)+2d\ln\alpha_{t}
2L(αt1)xp~0(x)𝑑x+2dlnαt.\displaystyle\leq 2L(\alpha_{t}-1)\int\left\|{x}\right\|\widetilde{p}_{0}(x)dx+2d\ln\alpha_{t}.

Now for the first term,

p~t(x)q~t(x)\displaystyle\widetilde{p}_{t}(x)-\widetilde{q}_{t}(x) =q~t(xy)φσt2(y)𝑑yq~t(x)=(q~t(xσty)q~t(x))φ(y)𝑑y,\displaystyle=\int\widetilde{q}_{t}(x-y)\varphi_{\sigma_{t}^{2}}(y)\,dy-\widetilde{q}_{t}(x)=\int\left({\widetilde{q}_{t}(x-\sigma_{t}y)-\widetilde{q}_{t}(x)}\right)\varphi(y)dy,

where φ(y)\varphi(y) is the density of the dd-dimensional standard Gaussian distribution. Apply Minkowski’s inequality for integrals:

|p~t(x)q~t(x)|𝑑x\displaystyle\int\left|{\widetilde{p}_{t}(x)-\widetilde{q}_{t}(x)}\right|dx =|(q~t(xσty)q~t(x))φ(y)𝑑y|𝑑x\displaystyle=\int\left|{\int\left({\widetilde{q}_{t}(x-\sigma_{t}y)-\widetilde{q}_{t}(x)}\right)\varphi(y)dy}\right|dx
[|q~t(xσty)q~t(x)|𝑑x]φ(y)𝑑y\displaystyle\leq\int\left[{\int\left|{\widetilde{q}_{t}(x-\sigma_{t}y)-\widetilde{q}_{t}(x)}\right|dx}\right]\varphi(y)\,dy
[(eLαtσty1)q~t(x)𝑑x]φ(y)𝑑y\displaystyle\leq\int\left[{\int\left({e^{L\left\|{\alpha_{t}\sigma_{t}y}\right\|}-1}\right)\widetilde{q}_{t}(x)dx}\right]\varphi(y)\,dy
=(eLαtσty1)φ(y)𝑑y\displaystyle=\int\left({e^{L\left\|{\alpha_{t}\sigma_{t}y}\right\|}-1}\right)\varphi(y)\,dy
=(2π)d/2eLαtσtyy22𝑑y1\displaystyle=\left({2\pi}\right)^{-d/2}\int e^{L\alpha_{t}\sigma_{t}\left\|{y}\right\|-\frac{\left\|{y}\right\|^{2}}{2}}\,dy-1
(2π)d/2e[12+(Lαtσt)2]y2𝑑y+Lαtσtyφ(y)𝑑y1\displaystyle\leq\left({2\pi}\right)^{-d/2}\int e^{\left[{-\frac{1}{2}+\left({L\alpha_{t}\sigma_{t}}\right)^{2}}\right]\left\|{y}\right\|^{2}}dy+L\alpha_{t}\sigma_{t}\int\left\|{y}\right\|\varphi(y)\,dy-1
[112(Lαtσt)2]d/2+dLαtσt1\displaystyle\leq\left[{\frac{1}{1-2\left({L\alpha_{t}\sigma_{t}}\right)^{2}}}\right]^{d/2}+\sqrt{d}L\alpha_{t}\sigma_{t}-1
e2d(Lαtσt)21+dLαtσt\displaystyle\leq e^{2d\left({L\alpha_{t}\sigma_{t}}\right)^{2}}-1+\sqrt{d}L\alpha_{t}\sigma_{t}
4d(Lαtσt)2+dLαtσt,\displaystyle\leq 4d\left({L\alpha_{t}\sigma_{t}}\right)^{2}+\sqrt{d}L\alpha_{t}\sigma_{t},

where in the third inequality, we use the elementary inequality exx+ex2e^{x}\leq x+e^{x^{2}}, which is valid for all xx\in\mathbb{R}, and in the fifth inequality, we use 112xe4x\frac{1}{1-2x}\leq e^{4x}, which holds for x[0,1/3]x\in[0,1/3]. Hence if Lαtσt1/2L\alpha_{t}\sigma_{t}\leq 1/2, we have

TV(p~t,q~t)12|p~t(x)q~t(x)|𝑑x32dLαtσt.\displaystyle\operatorname{TV}(\widetilde{p}_{t},\widetilde{q}_{t})\leq\frac{1}{2}\int\left|{\widetilde{p}_{t}(x)-\widetilde{q}_{t}(x)}\right|dx\leq\frac{3}{2}dL\alpha_{t}\sigma_{t}.

Now we conclude the proof by combining the bounds for TV(p~t,q~t)\operatorname{TV}(\widetilde{p}_{t},\widetilde{q}_{t}) and TV(p~0,q~t)\operatorname{TV}(\widetilde{p}_{0},\widetilde{q}_{t}):

TV(p~t,p~0)\displaystyle\operatorname{TV}(\widetilde{p}_{t},\widetilde{p}_{0}) TV(p~t,q~t)+TV(q~t,p~0)\displaystyle\leq\operatorname{TV}(\widetilde{p}_{t},\widetilde{q}_{t})+\operatorname{TV}(\widetilde{q}_{t},\widetilde{p}_{0})
2L(αt1)xp~0(x)𝑑x+2dlnαt+32dLαtσt\displaystyle\leq 2L(\alpha_{t}-1)\int\left\|{x}\right\|\widetilde{p}_{0}(x)dx+2d\ln\alpha_{t}+\frac{3}{2}dL\alpha_{t}\sigma_{t}
2(αt1)(L𝔼p~0X+d)+32dLαtσt,\displaystyle\leq 2\left({\alpha_{t}-1}\right)\cdot\left({L\mathbb{E}_{\widetilde{p}_{0}}\left\|{X}\right\|+d}\right)+\frac{3}{2}dL\alpha_{t}\sigma_{t},

where we use the fact that lnxx1\ln x\leq x-1 for all x1x\geq 1. Recall that αt=1/mt=et/2\alpha_{t}=1/m_{t}=e^{t/2} and σt2=1et\sigma_{t}^{2}=1-e^{-t} when g1g\equiv 1. It suffices for

max{2(L𝔼p~0X+d)(αδ1),32dLαδσδ}εTV2,\displaystyle\max\left\{{2\left({L\mathbb{E}_{\widetilde{p}_{0}}\left\|{X}\right\|+d}\right)\left({\alpha_{\delta}-1}\right),\frac{3}{2}dL\alpha_{\delta}\sigma_{\delta}}\right\}\leq\frac{\varepsilon_{\operatorname{TV}}}{2},

which is implied by

δ\displaystyle\delta min{εTVL𝔼p~0X+d,εTV2d2L2}εTV2R2L2,\displaystyle\precsim{\min\left\{{\frac{\varepsilon_{\operatorname{TV}}}{L\mathbb{E}_{\widetilde{p}_{0}}\left\|{X}\right\|+d},\frac{\varepsilon_{\operatorname{TV}}^{2}}{d^{2}L^{2}}}\right\}}\asymp{\frac{\varepsilon_{\operatorname{TV}}^{2}}{R^{2}L^{2}}},

for appropriate constants, as Rmax{d,𝔼p~0X}R\geq\max\left\{{\sqrt{d},\mathbb{E}_{\widetilde{p}_{0}}\left\|{X}\right\|}\right\}. ∎

6.2 Perturbation under TV error

Although we will not need it in our proof, we note that we can derive a similar perturbation result under TV error, which might be of independent interest.

Lemma 6.5.

Let K(x,dy)K(x,dy) be a probability kernel on d\mathbb{R}^{d}, let P0,x,P1,xP_{0,x},P_{1,x} be measures on d\mathbb{R}^{d}. Let PiP_{i} denote the joint distribution of xiPi,xx_{i}\sim P_{i,x} and yiK(xi,)y_{i}\sim K(x_{i},\cdot), and let Pi,yP_{i,y} denote the marginal distribution of yy. Suppose there is a coupling P0,1P_{0,1} of (x0,y0)P0(x_{0},y_{0})\sim P_{0} and (x1,y1)P1(x_{1},y_{1})\sim P_{1} such that

  • x0=x1x_{0}=x_{1} with probability 1ε1-\varepsilon,

  • x0=x1x_{0}=x_{1} implies y0=y1y_{0}=y_{1}, and

  • 𝔼[y0y12]εW2\mathbb{E}[\left\|{y_{0}-y_{1}}\right\|^{2}]\leq\varepsilon_{\textup{W}}^{2}.

Define the tail error by

mi(ε):\displaystyle m_{i}(\varepsilon): =sup0f1,dfφ𝑑xεdf(x)x2Pi(dx).\displaystyle=\sup_{0\leq f\leq 1,\int_{\mathbb{R}^{d}}f\varphi\,dx\leq\varepsilon}\int_{\mathbb{R}^{d}}f(x)\left\|{x}\right\|^{2}P_{i}(\,dx).

Let ri(y)=dxiPi(dxi|y)r_{i}(y)=\int_{\mathbb{R}^{d}}x_{i}P_{i}(dx_{i}|y), and suppose that r1(y)=dx1P1(dx1|y)r_{1}(y)=\int_{\mathbb{R}^{d}}x_{1}P_{1}(dx_{1}|y) is L1L_{1}-Lipschitz. Then

dP0,y(dy0)dx0P0(dx0|y0)dx1P1(dx1|y0)2\displaystyle\int_{\mathbb{R}^{d}}P_{0,y}(dy_{0})\left\|{\int_{\mathbb{R}^{d}}x_{0}P_{0}(dx_{0}|y_{0})-\int_{\mathbb{R}^{d}}x_{1}P_{1}(dx_{1}|y_{0})}\right\|^{2}
4(m0(2ε)+m0(ε)+m1(2ε)+m1(ε))+2L12εW2\displaystyle\qquad\leq 4(m_{0}(2\varepsilon)+m_{0}(\varepsilon)+m_{1}(2\varepsilon)+m_{1}(\varepsilon))+2L_{1}^{2}\varepsilon_{\textup{W}}^{2}
4(m0(2ε)+m1(2ε))+4(1+L12)(m0(ε)+m1(ε)).\displaystyle\qquad\leq 4(m_{0}(2\varepsilon)+m_{1}(2\varepsilon))+4(1+L_{1}^{2})(m_{0}(\varepsilon)+m_{1}(\varepsilon)).
Proof.

For notational clarity, we will denote draws from the conditional distribution as x^0\widehat{x}_{0} and x^1\widehat{x}_{1}, for example P0(dx^0|y0)P_{0}(d\widehat{x}_{0}|y_{0}). We have

dP0,y(dy0)r0(y0)r1(y0)2\displaystyle\int_{\mathbb{R}^{d}}P_{0,y}(dy_{0})\left\|{r_{0}(y_{0})-r_{1}(y_{0})}\right\|^{2} 2d×dP0,1,y(dy0,dy1)r0(y0)r1(y1)2(I)\displaystyle\leq 2\underbrace{\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}P_{0,1,y}(dy_{0},dy_{1})\left\|{r_{0}(y_{0})-r_{1}(y_{1})}\right\|^{2}}_{(I)}
+2d×dP0,1,y(dy0,dy1)r1(y1)r1(y0)2(II).\displaystyle\quad+2\underbrace{\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}P_{0,1,y}(dy_{0},dy_{1})\left\|{r_{1}(y_{1})-r_{1}(y_{0})}\right\|^{2}}_{(II)}.

For the first term (I), we split it as

(I)\displaystyle(I) {y0=y1}P0,1,y(dy0,dy1)r0(y0)r1(y0)2(i)+{y0y1}P0,1,y(dy0,dy1)r0(y0)r1(y1)2(ii).\displaystyle\leq\underbrace{\int_{\{y_{0}=y_{1}\}}P_{0,1,y}(dy_{0},dy_{1})\left\|{r_{0}(y_{0})-r_{1}(y_{0})}\right\|^{2}}_{(i)}+\underbrace{\int_{\{y_{0}\neq y_{1}\}}P_{0,1,y}(dy_{0},dy_{1})\left\|{r_{0}(y_{0})-r_{1}(y_{1})}\right\|^{2}}_{(ii)}.

Define the measure QQ on d\mathbb{R}^{d} by

Q(A):\displaystyle Q(A): =P0,1(y0A and y0=y1).\displaystyle=P_{0,1}(y_{0}\in A\text{ and }y_{0}=y_{1}).

As in Lemma 6.2, under P0,1P_{0,1}, when y0=y1y_{0}=y_{1}, we can couple P0(dx^0|y0)P_{0}(d\widehat{x}_{0}|y_{0}) and P1(dx^1|y0)P_{1}(d\widehat{x}_{1}|y_{0}) so that x0=x1x_{0}=x_{1} with probability min{dQdP0,y,dQdP1,y}\min\left\{{\frac{dQ}{dP_{0,y}},\frac{dQ}{dP_{1,y}}}\right\}. Let P^(dx^0,dx^1|y0)\widehat{P}(d\widehat{x}_{0},d\widehat{x}_{1}|y_{0}) denote this coupled distribution. Then

(i)\displaystyle(i) {y0=y1}P0,1,y(dy0,dy1){x^0x^1}(x^0x^1)P^(dx^0,dx^1|y0)2\displaystyle\leq\int_{\{y_{0}=y_{1}\}}P_{0,1,y}(dy_{0},dy_{1})\left\|{\int_{\{\widehat{x}_{0}\neq\widehat{x}_{1}\}}(\widehat{x}_{0}-\widehat{x}_{1})\widehat{P}(d\widehat{x}_{0},d\widehat{x}_{1}|y_{0})}\right\|^{2}
2dP0,1,y(dy0,dy1)({x^0x^1}x^02P^(dx^0,dx^1|y0)+{x^0x^1}x^12P^(dx^0,dx^1|y1))\displaystyle\leq 2\int_{\mathbb{R}^{d}}P_{0,1,y}(dy_{0},dy_{1})\left({\int_{\{\widehat{x}_{0}\neq\widehat{x}_{1}\}}\left\|{\widehat{x}_{0}}\right\|^{2}\widehat{P}(d\widehat{x}_{0},d\widehat{x}_{1}|y_{0})+\int_{\{\widehat{x}_{0}\neq\widehat{x}_{1}\}}\left\|{\widehat{x}_{1}}\right\|^{2}\widehat{P}(d\widehat{x}_{0},d\widehat{x}_{1}|y_{1})}\right)
2(m0(2ε)+m1(2ε))\displaystyle\leq 2(m_{0}(2\varepsilon)+m_{1}(2\varepsilon))

as in Lemma 6.2. Now

(ii)\displaystyle(ii) 2{y0y1}P0,1,y(dy0,dy1)(r0(y0)2+r1(y1)2)2(m0(ε)+m1(ε)).\displaystyle\leq 2\int_{\{y_{0}\neq y_{1}\}}P_{0,1,y}(dy_{0},dy_{1})(\left\|{r_{0}(y_{0})}\right\|^{2}+\left\|{r_{1}(y_{1})}\right\|^{2})\leq 2(m_{0}(\varepsilon)+m_{1}(\varepsilon)).

Finally, for the second term (II), we use the fact that r1r_{1} is L1L_{1} Lipschitz and the coupling to conclude

(II)\displaystyle(II) dP0,1,y(dy0,dy1)L12y0y12L12εW2.\displaystyle\leq\int_{\mathbb{R}^{d}}P_{0,1,y}(dy_{0},dy_{1})L_{1}^{2}\left\|{y_{0}-y_{1}}\right\|^{2}\leq L_{1}^{2}\varepsilon_{\textup{W}}^{2}.

We conclude the proof by combining the inequalities for (i), (ii), and (II).

For the second upper bound, we note that

𝔼[y0y12]2(𝔼[y02]+𝔼[y12])2(m0(ε)+m1(ε)).\mathbb{E}[\left\|{y_{0}-y_{1}}\right\|^{2}]\leq 2(\mathbb{E}[\left\|{y_{0}}\right\|^{2}]+\mathbb{E}[\left\|{y_{1}}\right\|^{2}])\leq 2(m_{0}(\varepsilon)+m_{1}(\varepsilon)).\qed

6.3 Gaussian tail calculation

We use the following Gaussian tail calculation in the proof of Lemma 6.2.

Lemma 6.6.

Let μ\mu be the standard Gaussian measure on N(0,Id)N(0,I_{d}). Then

supμ(A)εAx2μ(dx)\displaystyle\sup_{\mu(A)\leq\varepsilon}\int_{A}\left\|{x}\right\|^{2}\mu(dx) ε(2d+3ln(1ε)+3)=O(ε(d+ln(1ε)))\displaystyle\leq\varepsilon\left({2d+3\ln\left({\frac{1}{\varepsilon}}\right)+3}\right)=O\left({\varepsilon\left({d+\ln\left({\frac{1}{\varepsilon}}\right)}\right)}\right)
supμ(A)εAx4μ(dx)\displaystyle\sup_{\mu(A)\leq\varepsilon}\int_{A}\left\|{x}\right\|^{4}\mu(dx) ε(2d+3ln(1ε))2+3ε(2d+3ln(1ε))+9ε=O(ε(d2+ln(1ε)2)).\displaystyle\leq\varepsilon\left({2d+3\ln\left({\frac{1}{\varepsilon}}\right)}\right)^{2}+3\varepsilon\left({2d+3\ln\left({\frac{1}{\varepsilon}}\right)}\right)+9\varepsilon=O\left({\varepsilon\left({d^{2}+\ln\left({\frac{1}{\varepsilon}}\right)^{2}}\right)}\right).
Proof.

By the χ2\chi^{2} tail bound in [LM00], for t0t\geq 0,

μ(X22d+3t)\displaystyle\mu(\left\|{X}\right\|^{2}\geq 2d+3t) (X2d+2dt+2t)et,\displaystyle\leq\mathbb{P}(\left\|{X}\right\|^{2}\geq d+2\sqrt{dt}+2t)\leq e^{-t}, (33)

so X2\left\|{X}\right\|^{2} is stochastically dominated by a random variable with cdf F(y)=1ey2d3F(y)=1-e^{-\frac{y-2d}{3}}. Then letting PYP_{Y} be the measure corresponding to FF,

supμ(A)εAx2μ(dx)\displaystyle\sup_{\mu(A)\leq\varepsilon}\int_{A}\left\|{x}\right\|^{2}\mu(dx) supPY(A)εAyPY(dy)=2d+3ln(1ε)y𝑑F(y)\displaystyle\leq\sup_{P_{Y}(A)\leq\varepsilon}\int_{A}yP_{Y}(dy)=\int_{2d+3\ln\left({\frac{1}{\varepsilon}}\right)}^{\infty}ydF(y)
=ε(2d+3ln(1ε))+2d+3ln(1ε)ey2d3𝑑y=ε(2d+3ln(1ε))+3ε\displaystyle=\varepsilon\left({2d+3\ln\left({\frac{1}{\varepsilon}}\right)}\right)+\int_{2d+3\ln\left({\frac{1}{\varepsilon}}\right)}^{\infty}e^{-\frac{y-2d}{3}}dy=\varepsilon\left({2d+3\ln\left({\frac{1}{\varepsilon}}\right)}\right)+3\varepsilon

and

supμ(A)εAx4μ(dx)\displaystyle\sup_{\mu(A)\leq\varepsilon}\int_{A}\left\|{x}\right\|^{4}\mu(dx) supPY(A)εAy2PY(dy)=2d+3ln(1ε)y2𝑑F(y)\displaystyle\leq\sup_{P_{Y}(A)\leq\varepsilon}\int_{A}y^{2}P_{Y}(dy)=\int_{2d+3\ln\left({\frac{1}{\varepsilon}}\right)}^{\infty}y^{2}dF(y)
=ε(2d+3ln(1ε))2+2d+3ln(1ε)2yey2d3𝑑y\displaystyle=\varepsilon\left({2d+3\ln\left({\frac{1}{\varepsilon}}\right)}\right)^{2}+\int_{2d+3\ln\left({\frac{1}{\varepsilon}}\right)}^{\infty}2ye^{-\frac{y-2d}{3}}dy
=ε(2d+3ln(1ε))2[3yey2d3]|2d+3ln(1ε)+2d+3ln(1ε)3ey2d3𝑑y\displaystyle=\varepsilon\left({2d+3\ln\left({\frac{1}{\varepsilon}}\right)}\right)^{2}-\left[3ye^{-\frac{y-2d}{3}}\right]\Big{|}^{\infty}_{2d+3\ln\left({\frac{1}{\varepsilon}}\right)}+\int_{2d+3\ln\left({\frac{1}{\varepsilon}}\right)}^{\infty}3e^{-\frac{y-2d}{3}}\,dy
=ε(2d+3ln(1ε))2+3ε(2d+3ln(1ε))+9ε.\displaystyle=\varepsilon\left({2d+3\ln\left({\frac{1}{\varepsilon}}\right)}\right)^{2}+3\varepsilon\left({2d+3\ln\left({\frac{1}{\varepsilon}}\right)}\right)+9\varepsilon.\qed

7 Guarantees under L2L^{2}-accurate score estimate

We will state our results under a more general tail bound assumption.

Assumption 6 (Tail bound).

R:[0,1][0,)R:[0,1]\to[0,\infty) is a function such that Pdata(BR(ε)(0))1εP_{\textup{data}}(B_{R(\varepsilon)}(0))\geq 1-\varepsilon.

Our result will require R(ε)R(\varepsilon) to grow at most as a sufficiently small power of ε1\varepsilon^{-1} as ε0\varepsilon\to 0; in particular, this holds for subexponential distributions. By taking RR to be a constant function, this contains the assumption of bounded support (Assumption 2) as a special case.

7.1 TV error guarantees

We follow the framework of [LLT22] to convert guarantees under LL^{\infty}-accurate score estimate, to guarantees under L2L^{2}-accurate score estimate.

Theorem 7.1 ([LLT22, Theorem 4.1]).

Let (Ω,,)(\Omega,\mathcal{F},\mathbb{P}) be a probability space and {n}\{\mathcal{F}_{n}\} be a filtration of the sigma field \mathcal{F}. Suppose XnpnX_{n}\sim p_{n}, ZnqnZ_{n}\sim q_{n}, and Z¯nq¯n\overline{Z}_{n}\sim\overline{q}_{n} are n\mathcal{F}_{n}-adapted random processes taking values in Ω\Omega, and BnΩB_{n}\subseteq\Omega are sets such that the following hold for every n0n\in\mathbb{N}_{0}.

  1. 1.

    If ZkBkcZ_{k}\in B_{k}^{c} for all 0kn10\leq k\leq n-1, then Zn=Z¯nZ_{n}=\overline{Z}_{n}.

  2. 2.

    χ2(q¯n||pn)Dn2\chi^{2}(\overline{q}_{n}||p_{n})\leq D_{n}^{2}.

  3. 3.

    (XnBn)δn\mathbb{P}(X_{n}\in B_{n})\leq\delta_{n}.

Then the following hold.

TV(qn,q¯n)\displaystyle\operatorname{TV}(q_{n},\overline{q}_{n}) k=0n1(Dk2+1)1/2δk1/2\displaystyle\leq\sum_{k=0}^{n-1}(D_{k}^{2}+1)^{1/2}\delta_{k}^{1/2} TV(pn,qn)\displaystyle\operatorname{TV}(p_{n},q_{n}) Dn+k=0n1(Dk2+1)1/2δk1/2\displaystyle\leq D_{n}+\sum_{k=0}^{n-1}(D_{k}^{2}+1)^{1/2}\delta_{k}^{1/2} (34)
Theorem 7.2 (DDPM with L2L^{2}-accurate score estimate).

Let 0<εχ,εTV,δ<120<\varepsilon_{\chi},\varepsilon_{\operatorname{TV}},\delta<\frac{1}{2}. Suppose that Assumption 6 for a sufficiently small value of cc that R0R_{0} is such that R(cεTV3δ6εχ12R019d5)R0R\left({\frac{c\varepsilon_{\operatorname{TV}}^{3}\delta^{6}\varepsilon_{\chi}^{12}}{R_{0}^{19}d^{5}}}\right)\leq R_{0}, and R02dR_{0}^{2}\geq d. Suppose one of the following cases holds.

  1. 1.

    Let Pdata,s(,t)P_{\textup{data}},s(\cdot,t) be such that Assumption 1 holds, with R02dR_{0}^{2}\geq d. Suppose that

    εσ=O(εTVδ5/2εχ11/2B9/4),\varepsilon_{\sigma}=O\left({\frac{\varepsilon_{\operatorname{TV}}\delta^{5/2}\varepsilon_{\chi}^{11/2}}{B^{9/4}}}\right),

    where B=R04dln(Tδ)ln(R0dδεTVεχ)B=R_{0}^{4}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R_{0}d}{\delta\varepsilon_{\operatorname{TV}}\varepsilon_{\chi}}}\right), and we run (5) starting from ppriorp_{\textup{prior}} for time T=ln(16R02εχ2)T=\ln\left({\frac{16R_{0}^{2}}{\varepsilon_{\chi}^{2}}}\right), N=O(B(T+1δ2)εχ2)N=O\left({\frac{B\left({T+\frac{1}{\delta^{2}}}\right)}{\varepsilon_{\chi}^{2}}}\right) steps with step sizes satisfying hk=O(εχ2Bmax{Ttk,(Ttk)3})h_{k}=O\left({\frac{\varepsilon_{\chi}^{2}}{B\max\{T-t_{k},(T-t_{k})^{-3}\}}}\right).

  2. 2.

    Let Pdata,s(,t)P_{\textup{data}},s(\cdot,t) be such that Assumptions 1 and 3 hold, with CR02C\geq R_{0}^{2}. Suppose

    εσ\displaystyle\varepsilon_{\sigma} =O(εTVεχ3T5/2B),\displaystyle=O\left({\frac{\varepsilon_{\operatorname{TV}}\varepsilon_{\chi}^{3}}{T^{5/2}B}}\right),

    where B=C2dln(Tδ)ln(R0dδεTVεχ)B=C^{2}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R_{0}d}{\delta\varepsilon_{\operatorname{TV}}\varepsilon_{\chi}}}\right), and we run (5) starting from ppriorp_{\textup{prior}} for time T=ln(16R02εχ2)T=\ln\left({\frac{16R_{0}^{2}}{\varepsilon_{\chi}^{2}}}\right), N=O(B(T+ln(1δ))εχ2)N=O\left({\frac{B\left({T+\ln\left({\frac{1}{\delta}}\right)}\right)}{\varepsilon_{\chi}^{2}}}\right) steps with step sizes satisfying hk=O(εχ2Bmax{Ttk,(Ttk)1})h_{k}=O\left({\frac{\varepsilon_{\chi}^{2}}{B\max\{T-t_{k},(T-t_{k})^{-1}\}}}\right).

Then the resulting distribution qtNq_{t_{N}} is such that qtNq_{t_{N}} is εTV\varepsilon_{\operatorname{TV}}-far in TV distance from a distribution q¯tN\overline{q}_{t_{N}}, where q¯tN\overline{q}_{t_{N}} satisfies χ2(q¯tN||ptN)εχ2\chi^{2}(\overline{q}_{t_{N}}||p_{t_{N}})\leq\varepsilon_{\chi}^{2}. In particular, taking εχ=εTV\varepsilon_{\chi}=\varepsilon_{\operatorname{TV}}, we have TV(qT,Pdata)2εTV\operatorname{TV}(q_{T},P_{\textup{data}})\leq 2\varepsilon_{\operatorname{TV}}.

Note that the condition on RR can be satisfied if R(ε)=o(R1/19)R(\varepsilon)=o(R^{-1/19}) (no effort has been made to optimize the exponent).

Proof.

We invoke Lemma 5.2 for a εK\varepsilon_{K} to be chosen, to obtain a distribution P~0\widetilde{P}_{0} on BR0(0)B_{R_{0}}(0), where R0R(εK/8)R_{0}\geq R(\varepsilon_{K}/8). Let B=R04dln(Tδ)ln(R0δεK)B=R_{0}^{4}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R_{0}}{\delta\varepsilon_{K}}}\right) and B=C2dln(Tδ)ln(R0δεK)B=C^{2}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R_{0}}{\delta\varepsilon_{K}}}\right) in case 1 and case 2, respectively; our choice of εK=O(εTV2δ6n2R06)\varepsilon_{K}=O\left({\frac{\varepsilon_{\operatorname{TV}}^{2}\delta^{6}}{n^{2}R_{0}^{6}}}\right) will give the definition of BB in the theorem statement. In the following, we define p~t\widetilde{p}_{t} with P~0\widetilde{P}_{0}, rather than PdataP_{\textup{data}}, as the initial distribution. Note that since TV(Pdata,P~0)εK=o(εTV)\operatorname{TV}(P_{\textup{data}},\widetilde{P}_{0})\leq\sqrt{\varepsilon_{K}}=o(\varepsilon_{\operatorname{TV}}) (and the same holds for their evolutions under (1)), it suffices to consider convergence to p~δ\widetilde{p}_{\delta}.

We first define the bad sets where the error in the score estimate is large,

Bt:\displaystyle B_{t}: ={lnp~t(x)s(x,t)>ε,t}\displaystyle=\left\{{\left\|{\nabla\ln\widetilde{p}_{t}(x)-s(x,t)}\right\|>\varepsilon_{\infty,t}}\right\} (35)

for some ε,t\varepsilon_{\infty,t} to be chosen.

Given t0t\geq 0, let t=tkt_{-}=t_{k} where kk is such that t[tk,tk+1)t\in[t_{k},t_{k+1}). Given bad sets BtB_{t}, define the interpolated process on [tk,tk+1)[t_{k},t_{k+1}) by

dz¯t\displaystyle d\overline{z}_{t} =g(Tt)2(12zt+b(z,Tt))dt+g(Tt)dwt\displaystyle=g(T-t)^{2}\left({\frac{1}{2}z_{t}+b(z_{-},T-t_{-})}\right)\,dt+g(T-t)\,dw_{t} (36)
where b(z,t)\displaystyle\text{where }b(z,t) ={s(z,t),zBtlnp~t(z),zBt.\displaystyle=\begin{cases}s(z,t),&z\not\in B_{t}\\ \nabla\ln\widetilde{p}_{t}(z),&z\in B_{t}\end{cases}.

In other words, simulate the reverse SDE using the score estimate as long as the point is in the good set at the previous discretization timepoint tkt_{k}, and otherwise use the actual gradient lnpt\nabla\ln p_{t}. Let q¯t\overline{q}_{t} denote the distribution of z¯t\overline{z}_{t} when z¯0q0\overline{z}_{0}\sim q_{0}. Note that this process is defined only for purposes of analysis, as we do not have access to lnpt\nabla\ln p_{t}. As before, we let denote qtq_{t} the distribution of ztz_{t} defined by (12).

We can couple this process with the exponential integrator (5) using ss so that as long as xtmBTtmx_{t_{m}}\not\in B_{T-t_{m}}, the processes agree, thus satisfying condition 1 of Theorem 7.1.

Then by Lemma 6.3,

P~t(0)(Bt)\displaystyle\widetilde{P}_{t}^{(0)}(B_{t}) =εK+4ε,t2(εt2+O(εK(d+ln(1εK))σt2)),\displaystyle=\varepsilon_{K}+\frac{4}{\varepsilon_{\infty,t}^{2}}\left({\varepsilon_{t}^{2}+O\left({\frac{\varepsilon_{K}\left({d+\ln\left({\frac{1}{\varepsilon_{K}}}\right)}\right)}{\sigma_{t}^{2}}}\right)}\right),

Then by choice of hkh_{k} and either Corollary 4.11 or 4.12, when 0tnεt2𝑑t=O(1)\int_{0}^{t_{n}}\varepsilon_{t}^{2}\,dt=O(1),

χ2(q¯tk||ptk)\displaystyle\chi^{2}(\overline{q}_{t_{k}}||p_{t_{k}}) =eεχ2(q0||p0)+ε+eε0tnε,Tt2dt\displaystyle=e^{\varepsilon}\chi^{2}(q_{0}||p_{0})+\varepsilon+e^{\varepsilon}\int_{0}^{t_{n}}\varepsilon_{\infty,T-t}^{2}\,dt (37)
2χ2(q0||p0)+O(1),\displaystyle\leq 2\chi^{2}(q_{0}||p_{0})+O(1),

where ε=εχ24\varepsilon=\frac{\varepsilon_{\chi}^{2}}{4}. For χ2(q¯tk||ptk)\chi^{2}(\overline{q}_{t_{k}}||p_{t_{k}}) to be bounded by εχ2\varepsilon_{\chi}^{2}, it suffices for the terms in (37) to be bounded by εχ22,εχ24,εχ24\frac{\varepsilon_{\chi}^{2}}{2},\frac{\varepsilon_{\chi}^{2}}{4},\frac{\varepsilon_{\chi}^{2}}{4}; this is implied by

T\displaystyle T =ln(16R2εχ2) by Lemma 4.14\displaystyle=\ln\left({\frac{16R^{2}}{\varepsilon_{\chi}^{2}}}\right)\text{ by Lemma~{}\ref{l:K-chi}}
0tnε,Tt2𝑑t\displaystyle\int_{0}^{t_{n}}\varepsilon_{\infty,T-t}^{2}\,dt =O(εχ2).\displaystyle=O(\varepsilon_{\chi}^{2}). (38)

By Theorem 7.1,

TV(qtn,q¯tn)\displaystyle\operatorname{TV}(q_{t_{n}},\overline{q}_{t_{n}}) k=0n1(1+χ2(qtk||ptk))1/2P(Btk)1/2\displaystyle\leq\sum_{k=0}^{n-1}(1+\chi^{2}(q_{t_{k}}||p_{t_{k}}))^{1/2}P(B_{t_{k}})^{1/2}
k=0n1(2χ2(q0||p0)1/2+O(1))δt1/2\displaystyle\leq\sum_{k=0}^{n-1}\left({2\chi^{2}(q_{0}||p_{0})^{1/2}+O(1)}\right)\delta_{t}^{1/2} (39)
=O(k=0n1εtkε,tk+εK(1+d+ln(1/εK)ε,tkσTtk)).\displaystyle=O\left({\sum_{k=0}^{n-1}\frac{\varepsilon_{t_{k}}}{\varepsilon_{\infty,t_{k}}}+\sqrt{\varepsilon_{K}}\left({1+\frac{\sqrt{d+\ln(1/\varepsilon_{K})}}{\varepsilon_{\infty,t_{k}}\sigma_{T-t_{k}}}}\right)}\right). (40)

For this to be bounded by εTV\varepsilon_{\operatorname{TV}}, it suffices for

k=0n1εtε,t\displaystyle\sum_{k=0}^{n-1}\frac{\varepsilon_{t}}{\varepsilon_{\infty,t}} =O(εTV)\displaystyle=O(\varepsilon_{\operatorname{TV}}) (41)
εK\displaystyle\varepsilon_{K} =O(minkεtk2σTtk2d+ln(1/εK)).\displaystyle=O\left({\frac{\min_{k}\varepsilon_{t_{k}}^{2}\sigma_{T-t_{k}}^{2}}{d+\ln(1/\varepsilon_{K})}}\right). (42)

We bound (42) crudely, as the dependence on εK\varepsilon_{K} will be logarithmic. Using εtk2=εσ2/σtk4\varepsilon_{t_{k}}^{2}=\varepsilon_{\sigma}^{2}/\sigma_{t_{k}}^{4}, it suffices that

εK\displaystyle\varepsilon_{K} =O(εσ2d+ln(1/εK)).\displaystyle=O\left({\frac{\varepsilon_{\sigma}^{2}}{d+\ln(1/\varepsilon_{K})}}\right). (43)

We will return to this after deriving a condition on εσ\varepsilon_{\sigma}. It remains to bound (38) and (41). We break up the timepoints depending on whether Tt>1T-t>1. Let

(t0,t1,,tN)\displaystyle(t_{0},t_{1},\ldots,t_{N}) =(t0,,tncoarse1,t0,,tnfine)\displaystyle=(t_{0},\ldots,t_{n^{\textup{coarse}}-1},t_{0}^{\prime},\ldots,t_{n^{\textup{fine}}}^{\prime})

and uk=Ttku_{k}=T-t_{k}^{\prime}, where tncoarse1T1t1t_{n^{\textup{coarse}}-1}\leq T-1\leq t_{1}^{\prime}. Let hk=tk+1tkh_{k}^{\prime}=t_{k+1}^{\prime}-t_{k}^{\prime}. Note the “fine” timepoints will be closer together than the “coarse” timepoints. We break up the integral (38) and the sum (41) into the parts involving the coarse and fine timepoints. For (38), it suffices to have

(38), coarse:0t0ε,Tt2𝑑t\displaystyle\eqref{e:error-int},\text{ coarse:}\quad\int_{0}^{t_{0}^{\prime}}\varepsilon_{\infty,T-t}^{2}\,dt Tmax0kncoarseε,Ttk2=O(εχ2)\displaystyle\leq T\max_{0\leq k\leq n^{\textup{coarse}}}\varepsilon_{\infty,T-t_{k}}^{2}=O(\varepsilon_{\chi}^{2})

so it suffices to take ε,Ttk2εχ2T\varepsilon_{\infty,T-t_{k}}^{2}\asymp\frac{\varepsilon_{\chi}^{2}}{T}. Let α=3\alpha=3 in case 1 and α=1\alpha=1 in case 2. For the fine part, recalling our choice of hkh_{k}^{\prime}, it suffices to have (note we can redefine εt=εtk\varepsilon_{t}=\varepsilon_{t_{k}} when t[tk,tk+1)t\in[t_{k},t_{k+1}) without any harm)

(38), fine:t0tnfineε,Tt2𝑑t\displaystyle\eqref{e:error-int},\text{ fine:}\quad\int_{t_{0}^{\prime}}^{t_{n^{\textup{fine}}}^{\prime}}\varepsilon_{\infty,T-t}^{2}\,dt =k=0nfine1hkε,Ttk2=O(εχ2)\displaystyle=\sum_{k=0}^{n^{\textup{fine}}-1}h_{k}^{\prime}\varepsilon_{\infty,T-t_{k}^{\prime}}^{2}=O(\varepsilon_{\chi}^{2})
k=0nfine1εχ2ukαBε,uk2\displaystyle\Longleftarrow\sum_{k=0}^{n^{\textup{fine}}-1}\frac{\varepsilon_{\chi}^{2}u_{k}^{\alpha}}{B}\varepsilon_{\infty,u_{k}}^{2} =O(εχ2)\displaystyle=O(\varepsilon_{\chi}^{2})
k=0nfine1ukαε,uk2B\displaystyle\Longleftarrow\sum_{k=0}^{n^{\textup{fine}}-1}\frac{u_{k}^{\alpha}\varepsilon_{\infty,u_{k}}^{2}}{B} =O(1).\displaystyle=O(1). (44)

For (41), it suffices to have

(41), coarse:k=0ncoarse1εTtkε,Ttk\displaystyle\eqref{e:error-sum},\text{ coarse:}\quad\sum_{k=0}^{n^{\textup{coarse}}-1}\frac{\varepsilon_{T-t_{k}}}{\varepsilon_{\infty,T-t_{k}}} ncoarseεσεχ/T=O(εTV)\displaystyle\asymp n^{\textup{coarse}}\frac{\varepsilon_{\sigma}}{\varepsilon_{\chi}/\sqrt{T}}=O(\varepsilon_{\operatorname{TV}})
εσ\displaystyle\Longleftarrow\varepsilon_{\sigma} =O(εTVεχncoarseT)\displaystyle=O\left({\frac{\varepsilon_{\operatorname{TV}}\varepsilon_{\chi}}{n^{\textup{coarse}}\sqrt{T}}}\right) (45)

and

(41), fine:k=0nfine1εukε,ukk=0nfine1εσukε,uk=O(εTV).\displaystyle\eqref{e:error-sum},\text{ fine:}\quad\sum_{k=0}^{n^{\textup{fine}}-1}\frac{\varepsilon_{u_{k}}}{\varepsilon_{\infty,u_{k}}}\asymp\sum_{k=0}^{n^{\textup{fine}}-1}\frac{\varepsilon_{\sigma}}{u_{k}\varepsilon_{\infty,u_{k}}}=O(\varepsilon_{\operatorname{TV}}). (46)

Note that in light of the required step sizes, we can take ncoarseT2Bεχ2n^{\textup{coarse}}\asymp\frac{T^{2}B}{\varepsilon_{\chi}^{2}}. Considering the equality case of Hölder’s inequality on (44)1/3(46)2/3\eqref{e:1fine}^{1/3}\eqref{e:2fine}^{2/3} suggests that we take

εσ\displaystyle\varepsilon_{\sigma} εTVB1/2(k=0nfine1ukα23)3/2\displaystyle\asymp\frac{\varepsilon_{\operatorname{TV}}B^{1/2}}{\left({\sum_{k=0}^{n^{\textup{fine}}-1}{u_{k}}^{\frac{\alpha-2}{3}}}\right)^{3/2}} (47)
ε,uk\displaystyle\varepsilon_{\infty,u_{k}} B1/2ukα+13(k=0nfine1ukα23)1/2\displaystyle\asymp\frac{B^{1/2}}{{u_{k}}^{\frac{\alpha+1}{3}}\left({\sum_{k=0}^{n^{\textup{fine}}-1}{u_{k}}^{\frac{\alpha-2}{3}}}\right)^{1/2}} (48)

Note that the number of steps needed in the fine part is O(Bεχ2δ2)O\left({\frac{B}{\varepsilon_{\chi}^{2}\delta^{2}}}\right) in the first case and O(Bεχ2)ln(1δ)O\left({\frac{B}{\varepsilon_{\chi}^{2}}}\right)\ln\left({\frac{1}{\delta}}\right) in the second case. We can check that (47) and (48) make (44) and (46) satisfied.

Finally, we calculate the denominator for εσ\varepsilon_{\sigma}. In case 1, note that starting from Tt0=O(1)T-t_{0}^{\prime}=O(1) and taking steps of size hkεχ2B(Ttk)3h_{k}^{\prime}\asymp\frac{\varepsilon_{\chi}^{2}}{B(T-t_{k}^{\prime})^{3}}, it takes nfine=Θ(Bεχ2δ2)n^{\textup{fine}}=\Theta\left({\frac{B}{\varepsilon_{\chi}^{2}\delta^{2}}}\right) steps to reach Tt=δT-t=\delta.

uk\displaystyle u_{k} =Ttk=(1+Θ(kεχ2B))12\displaystyle=T-t_{k}^{\prime}=\left({1+\Theta\left({\frac{k\varepsilon_{\chi}^{2}}{B}}\right)}\right)^{-\frac{1}{2}}
k=0nfine1uk1/3\displaystyle\sum_{k=0}^{n^{\textup{fine}}-1}u_{k}^{1/3} k=0nfine1(1+Θ(kεχ2B))16Bεχ2(nfine)56(Bεχ2)11/61δ5/3.\displaystyle\asymp\sum_{k=0}^{n^{\textup{fine}}-1}\left({1+\Theta\left({\frac{k\varepsilon_{\chi}^{2}}{B}}\right)}\right)^{-\frac{1}{6}}\asymp\frac{B}{\varepsilon_{\chi}^{2}}(n^{\textup{fine}})^{\frac{5}{6}}\asymp\left({\frac{B}{\varepsilon_{\chi}^{2}}}\right)^{11/6}\frac{1}{\delta^{5/3}}.

Then we obtain

εσ\displaystyle\varepsilon_{\sigma} εTVB1/2εχ11/2B11/4δ5/2=εTVδ5/2εχ11/2B9/4.\displaystyle\asymp\varepsilon_{\operatorname{TV}}B^{1/2}\frac{\varepsilon_{\chi}^{11/2}}{B^{11/4}}\delta^{5/2}=\frac{\varepsilon_{\operatorname{TV}}\delta^{5/2}\varepsilon_{\chi}^{11/2}}{B^{9/4}}.

In case 1, our requirement is

εσ\displaystyle\varepsilon_{\sigma} O(εTVδ5/2εχ11/2B9/4εTVεχ3T5/2B),\displaystyle\asymp O\left({\frac{\varepsilon_{\operatorname{TV}}\delta^{5/2}\varepsilon_{\chi}^{11/2}}{B^{9/4}}\wedge\frac{\varepsilon_{\operatorname{TV}}\varepsilon_{\chi}^{3}}{T^{5/2}B}}\right),

but note that the first bound is more stringent. Now, returning to (43), we see that it suffices to take εK=O(1d(εTVδ5/2εχ11/2R09d9/4)2+β)\varepsilon_{K}=O\left({\frac{1}{d}\left({\frac{\varepsilon_{\operatorname{TV}}\delta^{5/2}\varepsilon_{\chi}^{11/2}}{R_{0}^{9}d^{9/4}}}\right)^{2+\beta}}\right) for any β>0\beta>0 (this will “solve” the log(1/εK)\log(1/\varepsilon_{K}) appearing in BB.)

In case 2, we have instead uk=exp(Θ(εχ2Bk))u_{k}=\exp\left({-\Theta\left({\frac{\varepsilon_{\chi}^{2}}{B}k}\right)}\right) so

εσεTVB1/2(εχ2B)3/2=εTVεχ3B.\varepsilon_{\sigma}\asymp\varepsilon_{\operatorname{TV}}B^{1/2}\left({\frac{\varepsilon_{\chi}^{2}}{B}}\right)^{3/2}=\frac{\varepsilon_{\operatorname{TV}}\varepsilon_{\chi}^{3}}{B}.\qed
Theorem 7.3 (TV error for DDPM with L2L^{2}-accurate score estimate and smoothness).

Let 0<εTV<120<\varepsilon_{\operatorname{TV}}<\frac{1}{2}. Suppose that Assumption 4 and  6 for a sufficiently small value of cc that R0R_{0} is such that R(cεTV15R031d5L12)R0R\left({\frac{c\varepsilon_{\operatorname{TV}}^{15}}{R_{0}^{31}d^{5}L^{12}}}\right)\leq R_{0}, and R02max{d,𝔼Pdata[X2]}R_{0}^{2}\geq\max\left\{{d,\mathbb{E}_{P_{\text{data}}}\left[{\left\|{X}\right\|^{2}}\right]}\right\}, and one of the following cases holds.

  1. 1.

    Let Pdata,s(,t)P_{\textup{data}},s(\cdot,t) be such that Assumption 1 holds. Suppose that

    εσ=O(εTV11.5B9/4R05L5),\varepsilon_{\sigma}=O\left({\frac{\varepsilon_{\operatorname{TV}}^{11.5}}{B^{9/4}R_{0}^{5}L^{5}}}\right),

    where B=R04dln(TR02L2εTV2)ln(R03dL2εTV3εχ)B=R_{0}^{4}d\ln\left({\frac{TR_{0}^{2}L^{2}}{\varepsilon_{\operatorname{TV}}^{2}}}\right)\ln\left({\frac{R_{0}^{3}dL^{2}}{\varepsilon_{\operatorname{TV}}^{3}\varepsilon_{\chi}}}\right), and we run (5) starting from ppriorp_{\textup{prior}} for time T=ln(16R02εTV2)T=\ln\left({\frac{16R_{0}^{2}}{\varepsilon_{\operatorname{TV}}^{2}}}\right), N=O(B(T+(R0LεTV)4)εTV2)N=O\left({\frac{B\left({T+\left({\frac{R_{0}L}{\varepsilon_{\operatorname{TV}}}}\right)^{4}}\right)}{\varepsilon_{\operatorname{TV}}^{2}}}\right) steps with step sizes satisfying hk=O(εχ2Bmax{Ttk,(Ttk)3})h_{k}=O\left({\frac{\varepsilon_{\chi}^{2}}{B\max\{T-t_{k},(T-t_{k})^{-3}\}}}\right).

  2. 2.

    Let Pdata,s(,t)P_{\textup{data}},s(\cdot,t) be such that Assumptions 1 and 3 hold, with CR02C\geq R_{0}^{2}. Suppose

    εσ\displaystyle\varepsilon_{\sigma} =O(εTV4T5/2B),\displaystyle=O\left({\frac{\varepsilon_{\operatorname{TV}}^{4}}{T^{5/2}B}}\right),

    where B=C2dln(TR02L2εTV2)ln(R03dL2εTV4)B=C^{2}d\ln\left({\frac{TR_{0}^{2}L^{2}}{\varepsilon_{\operatorname{TV}}^{2}}}\right)\ln\left({\frac{R_{0}^{3}dL^{2}}{\varepsilon_{\operatorname{TV}}^{4}}}\right), and we run (5) starting from ppriorp_{\textup{prior}} for time T=ln(16R02εTV2)T=\ln\left({\frac{16R_{0}^{2}}{\varepsilon_{\operatorname{TV}}^{2}}}\right), N=O(B(T+ln(R0LεTV))εTV2)N=O\left({\frac{B\left({T+\ln\left({\frac{R_{0}L}{\varepsilon_{\operatorname{TV}}}}\right)}\right)}{\varepsilon_{\operatorname{TV}}^{2}}}\right) steps with step sizes satisfying hk=O(εχ2Bmax{Ttk,(Ttk)1})h_{k}=O\left({\frac{\varepsilon_{\chi}^{2}}{B\max\{T-t_{k},(T-t_{k})^{-1}\}}}\right).

Then the resulting distribution qtNq_{t_{N}} is such that qtNq_{t_{N}} is εTV\varepsilon_{\operatorname{TV}}-far in TV distance from the data distribution PdataP_{\text{data}}.

Proof.

With the result of Theorem 7.2, we see that TV(qtN,ptN)2εTV\operatorname{TV}(q_{t_{N}},p_{t_{N}})\leq 2\varepsilon_{\operatorname{TV}}. Now by Lemma 6.4, if we further assume

δ=O(εTV2R02L2),\displaystyle\delta=O\left({\frac{\varepsilon_{\operatorname{TV}}^{2}}{R_{0}^{2}L^{2}}}\right),

then TV(ptN,Pdata)εTV\operatorname{TV}(p_{t_{N}},P_{\text{data}})\leq\varepsilon_{\operatorname{TV}}. We conclude the proof by triangle inequality and replacing the δ\delta-dependence with O(εTV2R02L2)O(\frac{\varepsilon_{\operatorname{TV}}^{2}}{R_{0}^{2}L^{2}}) in the previous theorem. ∎

Proof of Theorem 2.3.

If PdataP_{\textup{data}} is subexponential with a fixed constant, note that Assumption 6 holds with R(ε)=O(ln(1ε))R(\varepsilon)=O\left({\ln\left({\frac{1}{\varepsilon}}\right)}\right) and hence R0R_{0} is logarithmic in all parameters. ∎

7.2 Wasserstein error guarantees

Proof of Theorem 2.1.

If TtN=δT-t_{N}=\delta, then W2(p~0,p~δ)σδδW_{2}(\widetilde{p}_{0},\widetilde{p}_{\delta})\leq\sigma_{\delta}\leq\sqrt{\delta}. Choosing δ=εW2\delta=\varepsilon_{\textup{W}}^{2}, we see by Theorem 7.2 it suffices to take

εσ\displaystyle\varepsilon_{\sigma} =O(εTV13/2(εW2)5/2(R4dln(Tδ)ln(RNδεW))9/4).\displaystyle=O\left({\frac{\varepsilon_{\operatorname{TV}}^{13/2}(\varepsilon_{\textup{W}}^{2})^{5/2}}{\left({R^{4}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{RN}{\delta\varepsilon_{\textup{W}}}}\right)}\right)^{9/4}}}\right).

Simplifying gives εσ=o~(εTV6.5εW5R9d2.25)\varepsilon_{\sigma}=\widetilde{o}\left({\frac{\varepsilon_{\operatorname{TV}}^{6.5}\varepsilon_{\textup{W}}^{5}}{R^{9}d^{2.25}}}\right). If Assumption 3 also holds, then it suffices to take

εσ\displaystyle\varepsilon_{\sigma} =O(εTV4T5/2C2dln(Tδ)ln(RNδεW)).\displaystyle=O\left({\frac{\varepsilon_{\operatorname{TV}}^{4}}{T^{5/2}C^{2}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{RN}{\delta\varepsilon_{\textup{W}}}}\right)}}\right).

Simplifying gives εσ=o~(εTV4C2d)\varepsilon_{\sigma}=\widetilde{o}\left({\frac{\varepsilon_{\operatorname{TV}}^{4}}{C^{2}d}}\right). ∎

Proof of Theorem 2.2.

To obtain purely Wasserstein error guarantees, we include an extra step of replacing any sample ztNqtNz_{t_{N}}\sim q_{t_{N}} falling outside BR(0)B_{R}(0) by 0. Suppose TtN=δT-t_{N}=\delta. Let q^tN\widehat{q}_{t_{N}} be the resulting distribution. Then

W2(p~0,q^tN)\displaystyle W_{2}(\widetilde{p}_{0},\widehat{q}_{t_{N}}) W2(p~0,p~δ)+W2(p~δ,q^tN)\displaystyle\leq W_{2}(\widetilde{p}_{0},\widetilde{p}_{\delta})+W_{2}(\widetilde{p}_{\delta},\widehat{q}_{t_{N}})
σδ+W2(p~δ,q~tN)δ+W2(p~δ,q^tN).\displaystyle\leq\sigma_{\delta}+W_{2}(\widetilde{p}_{\delta},\widetilde{q}_{t_{N}})\leq\sqrt{\delta}+W_{2}(\widetilde{p}_{\delta},\widehat{q}_{t_{N}}).

We choose δ=εW24\delta=\frac{\varepsilon_{\textup{W}}^{2}}{4} so the first term is εW2\leq\frac{\varepsilon_{\textup{W}}}{2}. It suffices to bound the second term W2(p~δ,q^tN)W_{2}(\widetilde{p}_{\delta},\widehat{q}_{t_{N}}) also by εW2\frac{\varepsilon_{\textup{W}}}{2}. We bound it in terms of TV(p~δ,q^tN)\operatorname{TV}(\widetilde{p}_{\delta},\widehat{q}_{t_{N}}) using the fact that q^tN\widehat{q}_{t_{N}} is supported on BR(0)B_{R}(0) and using a Gaussian tail calculation for p~δ\widetilde{p}_{\delta}. Consider a coupling of xtN=x~δp~δx_{t_{N}}=\widetilde{x}_{\delta}\sim\widetilde{p}_{\delta} and z^tNq^tN\widehat{z}_{t_{N}}\sim\widehat{q}_{t_{N}} such that xδz^tNx_{\delta}\neq\widehat{z}_{t_{N}} with probability εTV\varepsilon_{\operatorname{TV}}. Express x~δ=mδx~0+σδξ\widetilde{x}_{\delta}=m_{\delta}\widetilde{x}_{0}+\sigma_{\delta}\xi where x~0p~0\widetilde{x}_{0}\sim\widetilde{p}_{0}. Now

𝔼[x~δz^tN2]\displaystyle\mathbb{E}[\left\|{\widetilde{x}_{\delta}-\widehat{z}_{t_{N}}}\right\|^{2}] supP(A)εTV2(𝔼[mδx~0ztN2𝟙A]+σδ2𝔼[ξ2𝟙A])\displaystyle\leq\sup_{P(A)\leq\varepsilon_{\operatorname{TV}}}2\left({\mathbb{E}[\left\|{m_{\delta}\widetilde{x}_{0}-z_{t_{N}}}\right\|^{2}\mathbbm{1}_{A}]+\sigma_{\delta}^{2}\mathbb{E}[\left\|{\xi}\right\|^{2}\mathbbm{1}_{A}]}\right)
=2(4R2εTV+σδ2εTVO(d+ln(1εTV))),\displaystyle=2\left({4R^{2}\varepsilon_{\operatorname{TV}}+\sigma_{\delta}^{2}\varepsilon_{\operatorname{TV}}\cdot O\left({d+\ln\left({\frac{1}{\varepsilon_{\operatorname{TV}}}}\right)}\right)}\right),

where the bound on the second term uses Lemma 6.6. Using R2dR^{2}\geq d, we see that it suffices to choose εTV=O(εW2R2)\varepsilon_{\operatorname{TV}}=O\left({\frac{\varepsilon_{\textup{W}}^{2}}{R^{2}}}\right) for appropriate choice of constants. By Theorem 7.2, it suffices to take

εσ\displaystyle\varepsilon_{\sigma} =O((εW2/R2)13/2(εW2)5/2(R4dln(Tδ)ln(RNδεW))9/4).\displaystyle=O\left({\frac{(\varepsilon_{\textup{W}}^{2}/R^{2})^{13/2}\left({\varepsilon_{\textup{W}}^{2}}\right)^{5/2}}{\left({R^{4}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{RN}{\delta\varepsilon_{\textup{W}}}}\right)}\right)^{9/4}}}\right).

Simplifying gives o~(εW18R22d2.25)\widetilde{o}\left({\frac{\varepsilon_{\textup{W}}^{18}}{R^{22}d^{2.25}}}\right).

In case 2, it suffices to take

εσ\displaystyle\varepsilon_{\sigma} =O((εW2/R2)4T5/2(C2dln(Tδ)ln(RNδεW))).\displaystyle=O\left({\frac{(\varepsilon_{\textup{W}}^{2}/R^{2})^{4}}{T^{5/2}(C^{2}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{RN}{\delta\varepsilon_{\textup{W}}}}\right))}}\right).

Simplifying gives εσ=o~(εW8C2R8d)\varepsilon_{\sigma}=\widetilde{o}\left({\frac{\varepsilon_{\textup{W}}^{8}}{C^{2}R^{8}d}}\right). ∎

References

  • [And82] Brian DO Anderson “Reverse-time diffusion equation models” In Stochastic Processes and their Applications 12.3 Elsevier, 1982, pp. 313–326
  • [ARZ18] Sanjeev Arora, Andrej Risteski and Yi Zhang “Do GANs learn the distribution? some theory and empirics” In International Conference on Learning Representations, 2018
  • [BMR20] Adam Block, Youssef Mroueh and Alexander Rakhlin “Generative modeling with denoising auto-encoders and Langevin sampling” In arXiv preprint arXiv:2002.00107, 2020
  • [CCN21] Hong-Bin Chen, Sinho Chewi and Jonathan Niles-Weed “Dimension-free log-Sobolev inequalities for mixture distributions” In Journal of Functional Analysis 281.11 Elsevier, 2021, pp. 109236
  • [Che+21] Sinho Chewi, Murat A Erdogdu, Mufan Bill Li, Ruoqi Shen and Matthew Zhang “Analysis of Langevin Monte Carlo from Poincaré to Log-Sobolev” In arXiv preprint arXiv:2112.12662, 2021
  • [Che+22] Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim and Anru R. Zhang “Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions” arXiv:2209.11215, 2022
  • [Dat+19] Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski and Rosanne Liu “Plug and play language models: A simple approach to controlled text generation” In arXiv preprint arXiv:1912.02164, 2019
  • [De ̵+21] Valentin De Bortoli, James Thornton, Jeremy Heng and Arnaud Doucet “Diffusion Schrödinger bridge with applications to score-based generative modeling” In Advances in Neural Information Processing Systems 34, 2021
  • [De ̵22] Valentin De Bortoli “Convergence of denoising diffusion models under the manifold hypothesis” In arXiv preprint arXiv:2208.05314, 2022
  • [EHZ21] Murat A. Erdogdu, Rasa Hosseinzadeh and Matthew S. Zhang “Convergence of Langevin Monte Carlo in Chi-Squared and Renyi Divergence” In arXiv preprint arXiv:2007.11612, 2021
  • [Gra+19] Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi and Kevin Swersky “Your classifier is secretly an energy based model and you should treat it like one” In arXiv preprint arXiv:1912.03263, 2019
  • [HJA20] Jonathan Ho, Ajay Jain and Pieter Abbeel “Denoising diffusion probabilistic models” In Advances in Neural Information Processing Systems 33, 2020, pp. 6840–6851
  • [Jin+22] Bowen Jing, Gabriele Corso, Renato Berlinghieri and Tommi Jaakkola “Subspace Diffusion Generative Models” In arXiv preprint arXiv:2205.01490, 2022
  • [LLT22] Holden Lee, Jianfeng Lu and Yixin Tan “Convergence for score-based generative modeling with polynomial complexity” In arXiv preprint arXiv:2206.06227, 2022
  • [LM00] Beatrice Laurent and Pascal Massart “Adaptive estimation of a quadratic functional by model selection” In Annals of Statistics JSTOR, 2000, pp. 1302–1338
  • [Men+21] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu and Stefano Ermon “SDEdit: Guided image synthesis and editing with stochastic differential equations” In International Conference on Learning Representations, 2021
  • [Ram+22] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu and Mark Chen “Hierarchical text-conditional image generation with clip latents” In arXiv preprint arXiv:2204.06125, 2022
  • [SE19] Yang Song and Stefano Ermon “Generative Modeling by Estimating Gradients of the Data Distribution” In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems, 2019
  • [SE20] Yang Song and Stefano Ermon “Improved techniques for training score-based generative models” In arXiv preprint arXiv:2006.09011, 2020
  • [Son+20] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon and Ben Poole “Score-Based Generative Modeling through Stochastic Differential Equations” In International Conference on Learning Representations, 2020
  • [Son+21] Yang Song, Conor Durkan, Iain Murray and Stefano Ermon “Maximum likelihood training of score-based diffusion models” In Advances in Neural Information Processing Systems 34, 2021
  • [Son+21a] Yang Song, Liyue Shen, Lei Xing and Stefano Ermon “Solving Inverse Problems in Medical Imaging with Score-Based Generative Models” In arXiv preprint arXiv:2111.08005, 2021
  • [Ver18] Roman Vershynin “High-dimensional probability: An introduction with applications in data science” Cambridge university press, 2018
  • [Vin11] Pascal Vincent “A connection between score matching and denoising autoencoders” In Neural computation 23.7 MIT Press, 2011, pp. 1661–1674
  • [ZC22] Qinsheng Zhang and Yongxin Chen “Fast Sampling of Diffusion Models with Exponential Integrator” In arXiv preprint arXiv:2204.13902, 2022

Appendix A High-probability bound on the Hessian

In this section we obtain a high-probability bound on the Hessian of lnp~t\ln\widetilde{p}_{t}, i.e., the Jacobian of the score function.

To see why we expect Hessian to usually be smaller than the worst-case bound given by Lemma 4.13, note that we can express (27) and (28) as

ln(μφσ2(y))\displaystyle\nabla\ln(\mu*\varphi_{\sigma^{2}}(y)) =1σ2𝔼[YX|Y=y]\displaystyle=-\frac{1}{\sigma^{2}}\mathbb{E}[Y-X|Y=y] (49)
2ln(μφσ2(y))\displaystyle\nabla^{2}\ln(\mu*\varphi_{\sigma^{2}}(y)) =1σ4Cov[YX|Y=y]1σ2Id\displaystyle=\frac{1}{\sigma^{4}}\operatorname{Cov}[Y-X|Y=y]-\frac{1}{\sigma^{2}}I_{d} (50)

where XμX\sim\mu and Y=X+σξY=X+\sigma\xi, ξN(0,Id)\xi\sim N(0,I_{d}). We expect that the random variable YXY-X is distributed as N(0,σ2Id)N(0,\sigma^{2}I_{d}), which suggests that the covariance (50) may be bounded by 1σ2\frac{1}{\sigma^{2}} rather than 1σ\frac{1}{\sigma} with high probability. Indeed, we can easily construct an example where the worst case of Lemma 4.13 is attained—for example, μ=12(δv+δv)\mu=\frac{1}{2}(\delta_{-v}+\delta_{v}) for v2=R\left\|{v}\right\|_{2}=R, at x=0x=0—but this point has exponentially small probability density under μφσ2\mu*\varphi_{\sigma^{2}}.

The following lemma uses a ε\varepsilon-net argument to bound the operator norm of the variance of a conditional distribution, with high probability.

Lemma A.1.

Suppose XX is a d\mathbb{R}^{d}-valued random variable over the probability space (Ω,𝒢,P)(\Omega,\mathcal{G},P), and 𝒢\mathcal{F}\subseteq\mathcal{G} is a σ\sigma-subalgebra. If XX is subgaussian, then

(𝔼[XX|]2Xψ22ln(25dε))\displaystyle\mathbb{P}\left({\mathbb{E}\left[{\left\|{XX^{\top}}\right\||\mathcal{F}}\right]\geq 2\left\|{X}\right\|_{\psi_{2}}^{2}\ln\left({\frac{2\cdot 5^{d}}{\varepsilon}}\right)}\right) ε.\displaystyle\leq\varepsilon.
Proof.

By Jensen’s inequality and Markov’s inequality, for any v𝕊d1v\in\mathbb{S}^{d-1},

(𝔼[vXXv|]λ2)\displaystyle\mathbb{P}\left({\mathbb{E}[v^{\top}XX^{\top}v|\mathcal{F}]\geq\lambda^{2}}\right) =(e𝔼[vXXv|]/c2eλ2/c2)\displaystyle=\mathbb{P}\left({e^{\mathbb{E}[v^{\top}XX^{\top}v|\mathcal{F}]/c^{2}}\geq e^{\lambda^{2}/c^{2}}}\right)
(𝔼[eX,v2/c2|]eλ2/c2)\displaystyle\leq\mathbb{P}\left({\mathbb{E}\left[{e^{\left\langle{X,v}\right\rangle^{2}/c^{2}}|\mathcal{F}}\right]\geq e^{\lambda^{2}/c^{2}}}\right)
𝔼[𝔼[eX,v2/c2|]]eλ2/c2=𝔼[eX,v2/c2]eλ2/c22eλ2/Xψ2,\displaystyle\leq\frac{\mathbb{E}\left[{\mathbb{E}[e^{\left\langle{X,v}\right\rangle^{2}/c^{2}}|\mathcal{F}]}\right]}{e^{\lambda^{2}/c^{2}}}=\frac{\mathbb{E}\left[{e^{\left\langle{X,v}\right\rangle^{2}/c^{2}}}\right]}{e^{\lambda^{2}/c^{2}}}\leq 2e^{-\lambda^{2}/\left\|{X}\right\|_{\psi_{2}}},

where the last inequality follows from taking c=Xψ2c=\left\|{X}\right\|_{\psi_{2}}. Now take a 12\frac{1}{2}-net 𝒩\mathcal{N} of 𝕊d1\mathbb{S}^{d-1} of size 5d\leq 5^{d} [Ver18, Cor. 4.2.13]. By a union bound,

(v𝒩:𝔼[vXXv|]λ2)\displaystyle\mathbb{P}\left({\exists v\in\mathcal{N}:\mathbb{E}[v^{\top}XX^{\top}v|\mathcal{F}]\geq\lambda^{2}}\right) 5d2eλ2/Xψ22=ε\displaystyle\leq 5^{d}\cdot 2\cdot e^{-\lambda^{2}/\left\|{X}\right\|_{\psi_{2}}^{2}}=\varepsilon

when we take λ=Xψ2ln(25dε)\lambda=\left\|{X}\right\|_{\psi_{2}}\sqrt{\ln\left({\frac{2\cdot 5^{d}}{\varepsilon}}\right)}. By [Ver18, Lemma 4.4.1], the operator norm can be bounded by the norm on an ε\varepsilon-net,

A\displaystyle\left\|{A}\right\| 2supvAA,v=2supvA|vAv|.\displaystyle\leq 2\sup_{v\in A}\left\|{\left\langle{A,v}\right\rangle}\right\|=2\sup_{v\in A}|v^{\top}Av|.

where the second inequality holds when AA is symmetric. The result follows from applying this to 𝔼[vXXv|]\mathbb{E}[v^{\top}XX^{\top}v|\mathcal{F}]. ∎

From this we obtain the desired high-probability bound.

Lemma A.2.

There is a universal constant CC such that the following holds. For any starting distribution P~0\widetilde{P}_{0}, letting P~t\widetilde{P}_{t} be the law of the DDPM process (1) at time tt, we have

P~t(2lnp~t(x)C(d+ln(1ε))σt2)\displaystyle\widetilde{P}_{t}\left({\left\|{\nabla^{2}\ln\widetilde{p}_{t}(x)}\right\|\leq\frac{C(d+\ln\left({\frac{1}{\varepsilon}}\right))}{\sigma_{t}^{2}}}\right) 1ε.\displaystyle\geq 1-\varepsilon.

Note that there is no dependence on the radius.

Proof.

Apply (50) with μ=MmtP~0\mu=M_{m_{t}\sharp}\widetilde{P}_{0} to obtain 2lnp~t\nabla^{2}\ln\widetilde{p}_{t}. Noting that YXN(0,σ2Id)Y-X\sim N(0,\sigma^{2}I_{d}) is subgaussian with YXψ2C2σ\left\|{Y-X}\right\|_{\psi_{2}}\leq C_{2}\sigma for some universal constant C2C_{2}, the result follows from Lemma A.1. ∎