Convergence of score-based generative modeling
for general data distributions

Holden Lee Johns Hopkins University Jianfeng Lu Duke University Yixin Tan Duke University

Abstract

Score-based generative modeling (SGM) has grown to be a hugely successful method for learning to generate samples from complex data distributions such as that of images and audio. It is based on evolving an SDE that transforms white noise into a sample from the learned distribution, using estimates of the score function, or gradient log-pdf. Previous convergence analyses for these methods have suffered either from strong assumptions on the data distribution or exponential dependencies, and hence fail to give efficient guarantees for the multimodal and non-smooth distributions that arise in practice and for which good empirical performance is observed. We consider a popular kind of SGM—denoising diffusion models—and give polynomial convergence guarantees for general data distributions, with no assumptions related to functional inequalities or smoothness. Assuming $L^{2}$ -accurate score estimates, we obtain Wasserstein distance guarantees for any distribution of bounded support or sufficiently decaying tails, as well as TV guarantees for distributions with further smoothness assumptions.

1 Introduction

Diffusion models have gained huge popularity in recent years in machine learning, as a method to learn and generate new samples from a data distribution. Score-based generative modeling (SGM), as a particular kind of diffusion model, uses learned score functions (gradients of the log-pdf) to transform white noise to the data distribution through following a stochatic differential equation. While SGM has achieved state-of-the-art performance for artificial image and audio generation [SE19, Dat+19, Gra+19, SE20, Son+20, Men+21, Son+21a, Son+21, Jin+22], including being a key component of text-to-image systems [Ram+22], our theoretical understanding of these models is still nascent.

In particular, basic questions on the convergence of the generated distribution to the data distribution remain unanswered. Recent theoretical work on SGM has attempted to answer these questions [De ̵+21, LLT22, De ̵22], but they either suffer from exponential dependence on parameters or rely on strong assumptions on the data distribution such as functional inequalities or smoothness, which are rarely satisfied in practical situations. For example, considering the hallmark application of generating images from text, we expect the distribution of images to be (a) multimodal, and hence not satisfying functional inequalities with reasonable constants, and (b) supported on lower-dimensional manifolds, and hence not smooth. However, SGM still performs remarkably well in these settings. Indeed, this is one relative advantage to other approaches to generative modeling such as generative adversarial networks, which can struggle to learn multimodal distributions [ARZ18].

In this work, we aim to develop theoretical convergence guarantees with polynomial complexity for SGM under minimal data assumptions.

1.1 Problem setting

Given samples from a data distribution $P_{\textup{data}}$ , the problem of generative modeling is to learn the distribution in a way that allows generation of new samples. A general framework for many score-based generative models is where noise is injected into $P_{\textup{data}}$ via a forward SDE [Son+20]

\displaystyle d\widetilde{x}_{t}

\displaystyle=f(\widetilde{x}_{t},t)\,dt+g(t)\,dw_{t},\quad t\in[0,T],

(1)

where $\widetilde{x}_{0}\sim\widetilde{P}_{0}:=P_{\textup{data}}$ . Let $\widetilde{p}_{t}$ denote the density of $\widetilde{x}_{t}$ . Remarkably, $\widetilde{x}_{t}$ also satisfies a reverse-time SDE,

\displaystyle d\widetilde{x}_{t}

\displaystyle=[f(\widetilde{x}_{t},t)-g(t)^{2}\nabla\ln\widetilde{p}_{t}(\widetilde{x}_{t})]\,dt+g(t)\,d\widetilde{w}_{t},\quad t\in[0,T],

(2)

where $\widetilde{w}_{t}$ is a backward Brownian motion [And82]. Because the forward process transforms the data distribution to noise, the hope is to use the backwards process to transform noise into samples.

In practice, when we only have sample access to $P_{\textup{data}}$ , the score function $\nabla\ln\widetilde{p}_{t}$ is not available. A key mechanism behind SGM is that the score function is learnable from data, through empirically minimizing a de-noising objective evaluated at noisy samples $\widetilde{x}_{t}$ [Vin11]. The samples $\widetilde{x}_{t}$ are obtained by evolving the forward SDE starting from the data samples $\widetilde{x}_{0}$ , and the optimization is done within an expressive function class such as neural networks. If the score function is successfully approximated in this way, then the $L^{2}$ -error $\mathbb{E}_{\widetilde{p}_{t}}[\left\|{\nabla\ln\widetilde{p}_{t}(x)-s(x,t)}\right\|^{2}]$ will be small. The natural question is then the following:

Given $L^{2}$ -error bounds of the score function, how close is the distribution generated by (2) (with score estimate $s(x,t)$ in place of $\nabla\ln\widetilde{p}_{t}$ , and appropriate discretization) to the data distribution $P_{\textup{data}}$ ?

We note it is more realistic to consider $L^{2}$ rather than $L^{\infty}$ -error, and this makes the analysis more challenging. Indeed, prior work on Langevin Monte Carlo [EHZ21] and related sampling algorithms only apply when the score function is known exactly, or with suitable modification, known up to $L^{\infty}$ -error. $L^{2}$ -error has a genuinely different effect from $L^{\infty}$ -error, as it can cause the stationary distribution for Langevin Monte Carlo to be arbitrarily diffferent [LLT22], necessitating a “medium-time" analysis.

In addition, we hope to obtain a result with as few structural assumptions as possible on $P_{\textup{data}}$ , so that it can be useful in realistic scenarios where SGM is applied.

1.2 Prior work on convergence guarantees

We highlight two recent works which make progress on this problem. [LLT22] are the first to give polynomial convergence guarantees in TV distance under $L^{2}$ -accurate score for a reasonable family of distributions. They introduce a framework to reduce the analysis under $L^{2}$ -accurate score to $L^{\infty}$ -accurate score. However, they rely on the data distribution satisfying smoothness conditions and a log-Sobolev inequality—a strong assumption which essentially limits the guarantees to unimodal distributions.

[De ̵22] instead make minimal data assumptions, giving convergence in Wasserstein distance for distributions with bounded support $\mathcal{M}$ . In particular, this covers the case of distributions supported on lower-dimensional manifolds, where guarantees in TV distance are unattainable. However, for general distributions, their guarantees have exponential dependence on the diameter of $\mathcal{M}$ and the inverse of the desired error ( $\exp(O(\operatorname{diam}(\mathcal{M})^{2}/\varepsilon))$ ), and for smooth distributions, an improved, but still exponential dependence on the growth rate of the Hessian $\nabla^{2}\ln\widetilde{p}_{t}$ as the noise approaches 0 ( $\exp(\widetilde{O}(\Gamma))$ for distributions with $\left\|{\nabla^{2}\ln\widetilde{p}_{t}}\right\|\leq\Gamma/\sigma_{t}^{2}$ ).

We note that other works also analyze the generalization error of a learned score estimate [BMR20, De ̵22]. This is an important question because without further assumptions, learning an $L^{2}$ -accurate score estimate requires a number of samples exponential in the dimension. As this is beyond the scope of our paper, we assume that an $L^{2}$ -accurate score estimate is obtainable.

1.3 Our contributions

In this work, we analyze convergence in the most general setting of distributions of bounded support, as in [De ̵22]. We give Wasserstein bounds for any distribution of bounded support (or sufficiently decaying tails), and TV bounds for distributions under smoothness assumptions, that are polynomial in all parameters, and do not rely on the data distribution satisfying any functional inequality. This gives theoretical grounding to the empirical success of SGM on data distributions that are often multimodal and non-smooth.

We streamline the $\chi^{2}$ -based analysis of [LLT22], with significant changes as to completely remove the use of functional inequalities. In particular, the biggest challenge—and our key improvement—is to bound a certain KL-divergence without reliance on a global functional inequality. For this, we prove a key lemma that distributions which are close in $\chi^{2}$ -divergence have score functions that are close in $L^{2}$ (which may be of independent interest), and then a structural result that the distributions arising from the diffusion process can be slightly modified as to satisfy the desired inequality, through decomposition into distributions that do satisfy a log-Sobolev inequality.

Upon finishing our paper, we learned of a concurrent and independent work [Che+22] which obtained theoretical guarantees for score-based generative modeling under similarly general assumptions on the data distribution. We note that although our bounds are obtained under similar assumptions (with our assumption of the score estimate accuracy slightly weaker than theirs), our proof techniques are quite different. Following the “bad set” idea from [LLT22], we derived a change-of-measure inequality with Theorem 7.1, while the analysis in [Che+22] is based on the Girsanov approach.

2 Main results

To state our results, we will consider a specific type of SGM called denoising diffusion probabilistic modeling (DDPM) [HJA20], where in the forward SDE (1), $f(x,t)=-\frac{1}{2}g(t)^{2}x$ for some non-decreasing function $g$ to be chosen. The forward process is an Ornstein-Uhlenbeck process with time rescaling: $\widetilde{x}_{t}$ has the same distribution as

		$\displaystyle m_{t}\widetilde{x}_{0}+\sigma_{t}z,\text{ where }$
		$\displaystyle m_{t}=\exp\left[{-\frac{1}{2}\int_{0}^{t}g(s)^{2}\,ds}\right],\,\sigma_{t}^{2}=1-\exp\left[{-\int_{0}^{t}g(s)^{2}\,ds}\right],\text{ and }z\sim N(0,I).$		(3)

Given an estimate score function $s(x,t)$ approximating $\nabla\ln\widetilde{p}_{t}(x)$ , we can simulate the reverse process (reparameterizing $t\mapsfrom T-t$ and denoting $p_{t}:=\widetilde{p}_{T-t}$ )

\displaystyle dx_{t}

\displaystyle=\frac{1}{2}g(T-t)^{2}\left({x_{t}+2\nabla\ln p_{t}(x_{t})}\right)\,dt+g(T-t)\,dw_{t}

(4)

with the exponential integrator discretization [ZC22], where $h_{k}=t_{k+1}-t_{k}$ and $\eta_{k+1}\sim N(0,I_{d})$ :

		$\displaystyle z_{t_{k+1}}=z_{t_{k}}+\gamma_{1,k}(z_{t_{k}}+2s(T-t_{k},z_{t_{k}}))+\sqrt{\gamma_{2,k}}\cdot\eta_{k+1},$		(5)
		$\displaystyle\text{where }\gamma_{1,k}=\exp\left[{\frac{1}{2}G_{t_{k},t_{k+1}}}\right]-1,\,\gamma_{2,k}=\exp\left[{G_{t_{k},t_{k+1}}}\right]-1,\text{ and}\,G_{t^{\prime},t}:=\int_{t^{\prime}}^{t}g(T-s)^{2}\,ds.$		(6)

We initialize $z_{0}$ with a prior distribution that approximates $p_{0}=\widetilde{p}_{T}$ for sufficiently large $T$ :

\displaystyle z_{0}\sim q_{0}=p_{\textup{prior}}:

\displaystyle=N(0,\sigma_{T}^{2}I_{d})\approx N(0,I_{d}).

(7)

While we focus on DDPM, we note that the continuous process underlying DDPM is equivalent to that of score-matching Langevin diffusion (SMLD) under reparameterization in time and space (see [LLT22, §C.2]). We will further take $g\equiv 1$ for convenience in stating our results.

Our goal is to obtain a quantitative guarantee for the distance between the distribution $q_{t_{K}}$ for $z_{t_{K}}$ (for appropriate $t_{K}\approx T$ ) and $P_{\textup{data}}$ , under a $L^{2}$ -score error guarantee. In the following, we assume a sequence of discretization points $0=t_{0}<t_{1}<\cdots<t_{K}\leq T$ has been chosen.

Assumption 1 ( $L^{2}$ score error).

For any $t\in\{T-t_{0},\ldots,T-t_{K}\}$ , the error in the score estimate is bounded in $L^{2}(\widetilde{p}_{t})$ :

\left\|{\nabla\ln\widetilde{p}_{t}-s(\cdot,t)}\right\|_{L^{2}(\widetilde{p}_{t})}^{2}=\mathbb{E}_{\widetilde{p}_{t}}[\left\|{\nabla\ln\widetilde{p}_{t}(x)-s(x,t)}\right\|^{2}]\leq\varepsilon_{t}^{2}:=\frac{\varepsilon_{\sigma}^{2}}{\sigma_{t}^{4}}.

We note that the gradient $\nabla\ln\widetilde{p}_{t}$ grows as $\frac{1}{\sigma_{t}^{2}}$ as $t\to 0$ , so this is a reasonable assumption, and quantitatively weaker than a uniform bound over $t$ .

Assumption 2 (Bounded support).

$P_{\textup{data}}$ is supported on $B_{R}(0):=\left\{{x\in\mathbb{R}^{d}}:{\left\|{x}\right\|\leq R}\right\}$ .

For simplicity, we assume bounded support when stating our main theorems, but note that our results generalize to distributions with sufficiently fast power decay. In the application of image generation, pixel values are bounded, so Assumption 2 is satisfied with $R$ typically on the order of $\sqrt{d}$ .

These are the only assumptions we need to obtain a polynomial complexity guarantee. We also consider the following stronger smoothness assumption, which is Assumption A.6 in [De ̵22] and will give better dependencies. Note that [De ̵22, Theorem I.8] shows a (nonuniform) version of Assumption 3 holds when $p_{0}$ is a smooth density on a convex submanifold.

Assumption 3.

The following bound of the Hessian of the log-pdf holds for any $t>0$ and $x$ :

\displaystyle\left\|{\nabla^{2}\ln p_{t}(x)}\right\|

\displaystyle\leq\frac{C}{\sigma_{t}^{2}},

for some constant $C>0$ .

Finally, the following smoothness assumption on $\widetilde{p}_{0}$ will allow us to obtain TV guarantees.

Assumption 4.

$P_{\textup{data}}$ admits a density $\widetilde{p}_{0}\propto e^{-V(x)}$ where $V(x)$ is $L$ -smooth.

We are now ready to state our main theorems.

INPUT: Time

T

; discretization points

0=t_{0}<t_{1}<\cdots<t_{N}\leq T

; score estimates

s(\cdot,T-t_{k})

; radius

R

; function

g

(default:

g\equiv 1

)

Draw

z_{0}\sim p_{\textup{prior}}

from the prior distribution

p_{\textup{prior}}

given by (7).

for

k

from

1

N

Compute

z_{t_{k}}

from

z_{t_{k-1}}

using (5).

end for

Let

\widehat{z}_{t_{N}}=z_{t_{N}}

z_{t_{N}}\in B_{R}(0)

; otherwise, let

\widehat{z}_{t_{N}}=0

Algorithm 1 DDPM with exponential integrator [Son+20, ZC22]

Theorem 2.1 (Wasserstein+TV error for distributions with bounded support).

Suppose that Assumption 1 and 2 hold with $R\geq\sqrt{d}$ . Then there is a sequence of discretization points $0=t_{0}<t_{1}<\cdots<t_{N}<T$ with $N=O(\operatorname{poly}(d,R,1/\varepsilon_{\operatorname{TV}},1/\varepsilon_{\textup{W}}))$ such that if $\varepsilon_{\sigma}=\widetilde{O}\left({\frac{\varepsilon_{\operatorname{TV}}^{6.5}\varepsilon_{\textup{W}}^{5}}{R^{9}d^{2.25}}}\right)$ , then the distribution $q_{t_{N}}$ of the output $z_{t_{N}}$ of DDPM is $\varepsilon_{\operatorname{TV}}$ -close in TV distance to a distribution that is $\varepsilon_{W}$ in $W_{2}$ -distance from $P_{\textup{data}}$ . If in addition Assumption 3 holds with $C\geq R^{2}$ , it suffices for $\varepsilon_{\sigma}=\widetilde{O}\left({\frac{\varepsilon_{\operatorname{TV}}^{4}}{C^{2}d}}\right)$ (note that the $\widetilde{O}(\cdot)$ hides logarithmic dependence on $\varepsilon_{\textup{W}}$ ).

This result is perhaps surprising at first glance, as it is well known that for sampling algorithms such as Langevin Monte Carlo, structural assumptions on the target distribution—such as a log-Sobolev inequality—are required to obtain similar theoretical guarantees, even with the knowledge of the exact score function. The key reason that we can do better is that we utilize a sequence of score functions $s_{t}$ along the reverse SDE, which is not available in standard sampling settings. Moreover, we choose $T$ large enough so that $q_{0}=p_{\textup{prior}}$ is close to $p_{0}$ , and it suffices to track the evolution of the true process (2), that is, maintain rather than decrease the error. To some extent, this result shows the power of DDPM and other reverse SDE-based methods compared with generative modeling based on standard Langevin Monte Carlo.

A statement with more precise dependencies, and which works for unbounded distributions with sufficiently decaying tails, can be found as Theorem 7.2. We note that under the Hessian bound (Assumption 3), up to logarithmic factors, the same score error bound suffices to obtain a fixed TV distance to a distribution arbitrarily close in $W_{2}$ distance. By truncating the resulting distribution, we can also obtain purely Wasserstein error bounds.

Theorem 2.2 (Wasserstein error for distributions with bounded support).

In the same setting as Theorem 2.1, consider the distribution $\widehat{q}_{t_{N}}$ of the truncated output $\widehat{x}_{t_{N}}$ of DDPM. If Assumptions 1 and 2 hold with $R\geq\sqrt{d}$ and $\varepsilon_{\sigma}=\widetilde{O}\left({\frac{\varepsilon_{\textup{W}}^{18}}{R^{22}d^{2.25}}}\right)$ , then with appropriate (polynomial) choice of parameters, $W_{2}(\widehat{q}_{t_{K}},P_{\textup{data}})\leq\varepsilon_{\textup{W}}$ . If in addition Assumption 3 holds with $C\geq R^{2}$ , then $\varepsilon_{\sigma}=\widetilde{O}\left({\frac{\varepsilon_{\textup{W}}^{8}}{C^{2}R^{8}d}}\right)$ suffices.

With an extra assumption on the smoothness of $P_{\text{data}}$ , we can also obtain purely TV error bounds:

Theorem 2.3 (TV error for distributions under smoothness assumption).

Suppose that Assumptions 1 and 4 hold, $P_{\textup{data}}$ is subexponential (with a fixed constant), and denote $R=\max\left\{{\sqrt{d},\mathbb{E}_{P_{\text{data}}}\left\|{X}\right\|}\right\}$ . Then there is a sequence of discretization points $0=t_{0}<t_{1}<\cdots<t_{N}<T$ with $N=O(\operatorname{poly}(d,R,1/\varepsilon_{\operatorname{TV}}))$ such that if $\varepsilon_{\sigma}=\widetilde{O}\left({\frac{\varepsilon_{\operatorname{TV}}^{11.5}}{R^{14}d^{2.25}L^{5}}}\right)$ , then $\operatorname{TV}(q_{t_{N}},P_{\textup{data}})\leq\varepsilon_{\operatorname{TV}}$ . If in addition Assumption 3 holds with $C\geq R^{2}$ , then $\varepsilon_{\sigma}=\widetilde{O}\left({\frac{\varepsilon_{\operatorname{TV}}^{4}}{C^{2}d}}\right)$ suffices.

A more precise statement can be found as Theorem 7.3, which also works more generally with sufficient tail decay. We note that this result can be derived directly by combining Theorem 7.2 and a TV error bound between $P_{\text{data}}$ and $p_{t_{N}}$ (Lemma 6.4) depending on the smoothness of $P_{\text{data}}$ .

3 Proof overview

Our proof uses the framework by [LLT22] to convert guarantees under $L^{\infty}$ -accurate score function to under $L^{2}$ -accurate score function. For the analysis under $L^{\infty}$ -accurate score function, we interpolate the discrete process with estimated score, $z_{t}\sim q_{t}$ , and derive a differential inequality

\displaystyle\frac{d}{dt}\chi^{2}(q_{t}||p_{t})

\displaystyle=-g(T-t)^{2}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+2\mathbb{E}\left[{\left\langle{g(T-t)^{2}(s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t}),\nabla\frac{q_{t}(x)}{p_{t}(x)}}\right\rangle}\right].

We bound resulting error terms, making ample use of the Donsker-Varadhan variational principle to convert expectations to be under $p_{t}$ . Under small enough step sizes, this shows that $\chi^{2}(q_{t}||p_{t})$ grows slowly (Theorem 4.10), which suffices as $\chi^{2}$ -divergence decays exponentially in the forward process.

The most challenging error term to deal with is the KL divergence term $\operatorname{KL}(q_{t}\psi_{t}||p_{t})$ . Our main innovation over the analysis of [LLT22] is bounding this term without a global log-Sobolev inequality for $p_{t}$ . We note that it suffices for $p_{t}$ to be a mixture of distributions each satisfying a log-Sobolev inequality, with the logarithm of the minimum mixture weight bounded below, and in Lemma 5.2, we show that we can decompose any distributed of bounded support in this manner if we move a small amount of its mass.

In Section 6, we show that this does not significantly affect the estimate of the score function, by interpreting the score function as solving a Bayesian inference problem: that of de-noising a noised data point. More precisely, we show in Lemma 6.5 that the difference between the score functions of two different distributions can be bounded in $L^{2}$ in terms of their $\chi^{2}$ -divergence, which may be of independent interest.

Finally, we reduce from the $L^{2}$ to $L^{\infty}$ setting by bounding the probabilities of hitting a bad set where the score error is large, and carefully choose parameters (Section 7). This gives a TV error bound to $\widetilde{p}_{\delta}$ —the forward distribution at small positive time. Finally, we can bound the Wasserstein distance of $\widetilde{p}_{\delta}$ to $\widetilde{P}_{0}$ (in the general case) or the TV distance (under additional smoothness of $\widetilde{P}_{0}$ .)

In Section A we show that the Hessian is always bounded by $O\left({\frac{d}{\sigma_{t}^{2}}}\right)$ with high probability (cf. Assumption 3). We speculate that a high-probability rather than uniform bound on the Hessian (as in Lemma 4.13) can be used to obtain better dependencies, and leave this as an open problem.

Notation and definitions

We let $\widetilde{p}_{t}$ denote the density of $\widetilde{x}_{t}$ under the forward process (1). Note that $x_{0}\sim\widetilde{P}_{0}$ may not admit a density, but $\widetilde{x}_{t}$ will for $t>0$ . For the reverse process, we use the notation $p_{t}=\widetilde{p}_{T-t}$ , $x_{t}=\widetilde{x}_{T-t}$ . We defined $m_{t}$ and $\sigma_{t}$ in (2),

\displaystyle m_{t}

\displaystyle=\exp\left[{-\frac{1}{2}\int_{0}^{t}g(s)^{2}\,ds}\right],

\displaystyle\sigma_{t}^{2}

\displaystyle=1-\exp\left[{-\int_{0}^{t}g(s)^{2}\,ds}\right],

and note that $\widetilde{p}_{t}=(M_{m_{t}\sharp}\widetilde{P}_{0})*\varphi_{\sigma_{t}^{2}}$ , where $M_{m}(x)=mx$ denotes multiplication by $m$ , $F_{\sharp}P$ denotes the pushforward of the measure $P$ by $F$ , and $\varphi_{\sigma^{2}}$ is the density of $N(0,\sigma^{2}I_{d})$ . When $g\equiv 1$ , we note the bound $\sigma_{t}^{2}\leq\min\{1,t\}$ and $\sigma_{t}^{2}=\Theta(\min\{1,t\})$ .

We will let $z_{t}$ denote the (interpolated) discrete process (see (12)) and let $q_{t}$ be the density of $z_{t}$ . We define

\displaystyle\phi_{t}(x)

\displaystyle=\frac{q_{t}(x)}{p_{t}(x)},

\displaystyle\psi_{t}(x)

\displaystyle=\frac{\phi_{t}(x)}{\mathbb{E}_{p_{t}}\phi_{t}^{2}},

(8)

and note that $q_{t}\psi_{t}$ is a probability density. We defined $G_{t^{\prime},t}=\int_{t^{\prime}}^{t}g(T-s)^{2}\,ds$ in (6).

We denote the estimated score function by either $s(x,t)$ and $s_{t}(x)$ interchangeably.

A random variable $X$ is subgaussian with constant $C$ if

\displaystyle C=\left\|{X}\right\|_{\psi_{2}}:

\displaystyle=\inf\left\{{t>0}:{\mathbb{E}\exp(X^{2}/t^{2})\leq 2}\right\}<\infty.

A $\mathbb{R}^{d}$ -valued random variable $X$ is subgaussian with constant $C$ if for all $v\in\mathbb{S}^{d-1}$ , $\left\langle{X,v}\right\rangle$ is subgaussian. We also define

\displaystyle\left\|{X}\right\|_{2,\psi_{2}}:

\displaystyle=\left\|{\left\|{X}\right\|_{2}}\right\|_{\psi_{2}}.

Given a probability measure $P$ on $\mathbb{R}^{d}$ with density $p$ , the associated Dirichlet form is $\mathscr{E}_{p}(f,g):=\int_{\mathbb{R}^{d}}\left\langle{\nabla f,\nabla g}\right\rangle\,P(dx)=\int_{\mathbb{R}^{d}}\left\langle{\nabla f,\nabla g}\right\rangle p(x)\,dx$ ; denote $\mathscr{E}_{p}(f)=\mathscr{E}_{p}(f,f)$ . we say that a log-Sobolev inequality (LSI) holds with constant $C_{\textup{LS}}$ if for any probability density $q$ ,

\displaystyle\operatorname{KL}(q||p)

\displaystyle\leq\frac{C_{\textup{LS}}}{2}\mathscr{E}_{p}\left({\frac{q}{p},\ln\frac{q}{p}}\right)=\frac{C_{\textup{LS}}}{2}\int_{\mathbb{R}^{d}}\left\|{\nabla\ln\frac{q(x)}{p(x)}}\right\|^{2}q(x)\,dx.

(9)

Note $\int_{\mathbb{R}^{d}}\left\|{\nabla\ln\frac{q(x)}{p(x)}}\right\|^{2}q(x)\,dx$ is also known as the Fisher information of $q$ with respect to $p$ . Alternatively, defining the entropy by $\operatorname{Ent}_{p}(f)=\mathbb{E}_{p}f(x)\ln f(x)-\mathbb{E}_{p}f(x)\ln\mathbb{E}_{p}f(x)$ , for any $f\geq 0$ ,

\displaystyle\operatorname{Ent}_{p}(f)

\displaystyle\leq\frac{C_{\textup{LS}}}{2}\mathscr{E}_{p}\left({f,\ln f}\right)=\frac{C_{\textup{LS}}}{2}\int_{\mathbb{R}^{d}}\left\|{\nabla\ln f(x)}\right\|^{2}f(x)p(x)dx.

(10)

4 DDPM with $L^{\infty}$ -accurate score estimate

We consider the error between the exact backwards SDE (4) and the exponential integrator with estimated score (5). In this section, we bound the error assuming that the score estimate $s$ is accurate in $L^{\infty}$ .

Assumption 5 ( $L^{\infty}$ score error).

For any $t\in\{T-t_{0},\ldots,T-t_{K}\}$ , the error in the score estimate is bounded:

\displaystyle\left\|{\nabla\ln\widetilde{p}_{t}-s(\cdot,t)}\right\|_{\infty}=\sup_{x\in\mathbb{R}^{d}}\left\|{\nabla\ln\widetilde{p}_{t}(x)-s(x,t)}\right\|\leq\varepsilon_{\infty,t}^{2}

(11)

for some non-decreasing function $\varepsilon_{\infty,t}^{2}$ .

In Section 7, we will relax this condition to score function being accurate in $L^{2}$ .

First, we construct the following continuous-time process which interpolates the discrete-time process (5), for $t\in[t_{k},t_{k+1}]$ :

\displaystyle dz_{t}

\displaystyle=g(T-t)^{2}\left({\frac{1}{2}z_{t}+s(z_{t_{k}},T-t_{k})}\right)\,dt+g(T-t)\,dw_{t}.

(12)

Integration gives that

	$\displaystyle z_{t}-z_{t_{k}}$	$\displaystyle=\left({\exp\left({\frac{1}{2}G_{t_{k},t}}\right)-1}\right)(z_{t_{k}}+2s(z_{t_{k}},T-t_{k}))$
		$\displaystyle\qquad+\int_{t_{k}}^{t}\exp\left({\frac{1}{2}\int_{t_{k}}^{t^{\prime}}g(T-t^{\prime\prime})^{2}\,dt^{\prime\prime}}\right)g(t^{\prime})\,dw_{t^{\prime}},$		(13)

where $G_{t^{\prime},t}$ is defined in (6).

Letting $q_{t}$ be the distribution of $z_{t}$ and $p_{t}$ be the distribution of $x_{t}$ , we have by [LLT22, Lemma A.2] that

\displaystyle\frac{d}{dt}\chi^{2}(q_{t}||p_{t})

\displaystyle=-g(T-t)^{2}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+2\mathbb{E}\left[{\left\langle{g(T-t)^{2}(s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t}),\nabla\frac{q_{t}(x)}{p_{t}(x)}}\right\rangle}\right].

(14)

(Note that in our case, $\widehat{f}$ also depends on $z_{t}$ rather than just $z_{t_{k}}$ , but this does not change the calculation.) Define $\phi_{t},\psi_{t}$ as in (8): $\phi_{t}(x)=\frac{q_{t}(x)}{p_{t}(x)}$ , $\psi_{t}(x)=\frac{\phi_{t}(x)}{\mathbb{E}_{p_{t}}\phi_{t}^{2}}$ .

To bound (14), we use the following lemma.

Lemma 4.1 (cf. [EHZ21, Lemma 1], [LLT22, Lemma A.3]).

For any $C>0$ and any $\mathbb{R}^{d}$ -valued random variable $u$ , we have

\displaystyle\mathbb{E}\left[{\left\langle{u,\nabla\frac{q_{t}(z_{t})}{p_{t}(z_{t})}}\right\rangle}\right]

\displaystyle\leq C\cdot(\chi^{2}(q_{t}||p_{t})+1)\cdot\mathbb{E}\left[{\left\|{u}\right\|^{2}\psi_{t}(z_{t})}\right]+\frac{1}{4C}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right).

Proof.

By Young’s inequality,

	$\displaystyle\mathbb{E}\left[{\left\langle{u,\nabla\frac{q_{t}(z_{t})}{p_{t}(z_{t})}}\right\rangle}\right]$	$\displaystyle=\mathbb{E}\left[{\left\langle{u\sqrt{\frac{q_{t}(z_{t})}{p_{t}(z_{t})}},\sqrt{\frac{p_{t}(z_{t})}{q_{t}(z_{t})}}\nabla\frac{q_{t}(z_{t})}{p_{t}(z_{t})}}\right\rangle}\right]$
		$\displaystyle\leq C\mathbb{E}\left[{\left\\|{u}\right\\|^{2}\frac{q_{t}(z_{t})}{p_{t}(z_{t})}}\right]+\frac{1}{4C}\mathbb{E}_{p_{t}}\left[{\left\\|{\nabla\frac{q_{t}(x)}{p_{t}(x)}}\right\\|^{2}}\right]$
		$\displaystyle=C\mathbb{E}_{p_{t}}\phi_{t}^{2}\cdot\mathbb{E}\left[{\left\\|{u}\right\\|^{2}\psi_{t}(z_{t})}\right]+\frac{1}{4C}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)$
		$\displaystyle=C(\chi^{2}(q_{t}\|\|p_{t})+1)\cdot\mathbb{E}\left[{\left\\|{u}\right\\|^{2}\psi_{t}(z_{t})}\right]+\frac{1}{4C}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right).\qed$

Lemma 4.2.

Suppose that (11) holds for $t=T-t_{k}$ , $\nabla\ln p_{t_{k}}(x)$ is $L_{T-t_{k}}$ -Lipschitz, $g$ is non-decreasing, and that $h_{k}\leq\frac{1}{20L_{T-t_{k}}g(T-t_{k})^{2}}$ . Then we have for $t\in[t_{k},t_{k+1}]$ that

	$\displaystyle\frac{d}{dt}\chi^{2}(q_{t}\|\|p_{t})$	$\displaystyle\leq-\frac{1}{2}g(T-t)^{2}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+12g(T-t)^{2}(\chi^{2}(q_{t}\|\|p_{t})+1)\cdot$
		$\displaystyle\qquad\Bigg{[}\varepsilon_{\infty,T-t_{k}}^{2}+16G_{t_{k},t}^{2}L_{T-t_{k}}^{2}\Big{[}\mathbb{E}[\psi_{t}(z_{t})\left\\|{z_{t}}\right\\|^{2}]+16\mathbb{E}[\psi_{t}(z_{t})\left\\|{\nabla\ln p_{t}(x_{t})}\right\\|^{2}]\Big{]}$
		$\displaystyle\quad+64G_{t_{k},t}L_{T-t_{k}}^{2}(8\operatorname{KL}(\psi_{t}q_{t}\|\|p_{t})+2d+16\ln 2)+\mathbb{E}\left[{\left\\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}\psi_{t}(z_{t})}\right]\Bigg{]}.$

Proof.

We bound the second term on the RHS of (14). By Lemma 4.1,

	$\displaystyle\mathbb{E}\left[{\left\langle{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t}),\nabla\frac{q_{t}(z_{t})}{p_{t}(z_{t})}}\right\rangle}\right]$
	$\displaystyle\leq(\chi^{2}(q_{t}\|\|p_{t})+1)\mathbb{E}\left[{\left\\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}\psi_{t}(z_{t})}\right]+\frac{1}{4}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right).$		(15)

Now

	$\displaystyle\left\\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}$
	$\displaystyle\leq 3\left[{\left\\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t_{k}}(z_{t_{k}})}\right\\|^{2}+\left\\|{\nabla\ln p_{t_{k}}(z_{t_{k}})-\nabla\ln p_{t_{k}}(z_{t})}\right\\|^{2}+\left\\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}}\right]$
	$\displaystyle\leq 3\left[{\left\\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t_{k}}(z_{t_{k}})}\right\\|^{2}+L_{T-t_{k}}^{2}\left\\|{z_{t_{k}}-z_{t}}\right\\|^{2}+\left\\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}}\right]$

and

\displaystyle\mathbb{E}\left[{\left\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t_{k}}(z_{t_{k}})}\right\|^{2}\psi_{t}(z_{t})}\right]

\displaystyle\leq\varepsilon_{\infty,T-t_{k}}^{2}

by definition of $\varepsilon_{\infty,t}$ , so by Lemma 4.3,

	$\displaystyle\mathbb{E}\left[{\left\\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}\psi_{t}(z_{t})}\right]$
	$\displaystyle\leq 3\left[{\varepsilon_{\infty,T-t_{k}}^{2}+L_{T-t_{k}}^{2}\mathbb{E}\left[{\left\\|{z_{t}-z_{t_{k}}}\right\\|^{2}\psi_{t}(z_{t})}\right]+\mathbb{E}\left[{\left\\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}\psi_{t}(z_{t})}\right]}\right]$
	$\displaystyle\leq 3\Bigg{[}\varepsilon_{\infty,T-t_{k}}^{2}+16G_{t_{k},t}^{2}L_{T-t_{k}}^{2}\Big{[}\mathbb{E}[\psi_{t}(z_{t})\left\\|{z_{t}}\right\\|^{2}]+4\mathbb{E}[\psi_{t}(z_{t})\left\\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}]$
	$\displaystyle\quad+16\mathbb{E}[\psi_{t}(z_{t})\left\\|{\nabla\ln p_{t}(x_{t})}\right\\|^{2}]\Big{]}+64G_{t_{k},t}L_{T-t_{k}}^{2}(8\operatorname{KL}(\psi_{t}q_{t}\|\|p_{t})+2d+16\ln 2)$
	$\displaystyle\quad+\mathbb{E}\left[{\left\\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}\psi_{t}(z_{t})}\right]\Bigg{]}$

The condition on $h_{k}$ and the fact that $g$ is non-decreasing implies $192G_{t_{k},t}^{2}L_{T-t_{k}}^{2}\leq\frac{1}{2}$ . Rearranging gives

	$\displaystyle\mathbb{E}\left[{\left\\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}\psi_{t}(z_{t})}\right]$
	$\displaystyle\leq 6\Bigg{[}\varepsilon_{\infty,T-t_{k}}^{2}+16G_{t_{k},t}^{2}L_{T-t_{k}}^{2}\Big{[}\mathbb{E}[\psi_{t}(z_{t})\left\\|{z_{t}}\right\\|^{2}]+16\mathbb{E}[\psi_{t}(z_{t})\left\\|{\nabla\ln p_{t}(x_{t})}\right\\|^{2}\Big{]}$
	$\displaystyle\quad+64G_{t_{k},t}L_{T-t_{k}}^{2}(8\operatorname{KL}(\psi_{t}q_{t}\|\|p_{t})+2d+16\ln 2)+\mathbb{E}\left[{\left\\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}\psi_{t}(z_{t})}\right]\Bigg{]}$

Substituting into (15) and that inequality into (14) give the conclusion. ∎

Lemma 4.3.

Suppose that $h_{k}\leq\frac{1}{2g(T-t_{k})^{2}}$ . Then for $t\in[t_{k+1},t_{k}]$ ,

	$\displaystyle\mathbb{E}\left[{\left\\|{z_{t}-z_{t_{k}}}\right\\|^{2}\psi_{t}(z_{t})}\right]$	$\displaystyle\leq 16G_{t_{k},t}^{2}\Big{[}\mathbb{E}[\psi_{t}(z_{t})\left\\|{z_{t}}\right\\|^{2}]+4\mathbb{E}[\psi_{t}(z_{t})\left\\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}]$
		$\displaystyle\quad+16\mathbb{E}[\psi_{t}(z_{t})\left\\|{\nabla\ln p_{t}(z_{t})}\right\\|^{2}]\Big{]}+64G_{t_{k},t}(8\operatorname{KL}(\psi_{t}q_{t}\|\|p_{t})+2d+16\ln 2).$

Proof.

Consider (13). The assumption on $h_{k}$ implies $G_{t_{k},t}\leq G_{t_{k},t_{k+1}}\leq\frac{1}{2}$ , so $\exp\left({\frac{1}{2}G_{t_{k},t}}\right)-1\leq G_{t_{k},t}$ . Let $Y$ denote the last term of (13). Then

	$\displaystyle\left\\|{z_{t}-z_{t_{k}}}\right\\|$	$\displaystyle\leq G_{t_{k},t}\left[{\left\\|{z_{t_{k}}}\right\\|+2\left\\|{s(z_{t_{k}},T-t_{k})}\right\\|}\right]+\left\\|{Y}\right\\|$
		$\displaystyle\leq G_{t_{k},t}\left[{\left\\|{z_{t}}\right\\|+\left\\|{z_{t_{k}}-z_{t}}\right\\|+2\left\\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t})}\right\\|+2\left\\|{\nabla\ln p_{t}(z_{t})}\right\\|}\right]+\left\\|{Y}\right\\|.$

Again using $G_{t_{k},t}\leq\frac{1}{2}$ , rearranging gives

\displaystyle\left\|{z_{t}-z_{t_{k}}}\right\|

\displaystyle\leq 2G_{t_{k},t}\left[{\left\|{z_{t}}\right\|+2\left\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t})}\right\|+4\left\|{\nabla\ln p_{t}(z_{t})}\right\|}\right]+2\left\|{Y}\right\|,

and

	$\displaystyle\mathbb{E}\left[{\left\\|{z_{t}-z_{t_{k}}}\right\\|^{2}\psi_{t}(z_{t})}\right]$	$\displaystyle\leq 16G_{t_{k},t}^{2}\Big{[}\mathbb{E}[\psi_{t}(z_{t})\left\\|{z_{t}}\right\\|^{2}]+4\mathbb{E}[\psi_{t}(z_{t})\left\\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}]$
		$\displaystyle\quad+16\mathbb{E}[\psi_{t}(z_{t})\left\\|{\nabla\ln p_{t}(x_{t})}\right\\|^{2}]\Big{]}+16\mathbb{E}[\psi_{t}(z_{t})\left\\|{Y}\right\\|^{2}].$

By Lemma 4.4,

\displaystyle\mathbb{E}[\psi_{t}(z_{t})\left\|{Y}\right\|^{2}]

\displaystyle\leq 4G_{t_{k},t}(8\operatorname{KL}(\psi_{t}q_{t}||p_{t})+2d+16\ln 2).

The lemma follows. ∎

Lemma 4.4.

For $t\in[t_{k},t_{k+1}]$ ,

	$\displaystyle\mathbb{E}\left[{\psi_{t}(z_{t})\left\\|{\int_{t_{k}}^{t}\exp\left({\frac{1}{2}\int_{t_{k}}^{t^{\prime}}g(T-t^{\prime\prime})^{2}\,dt^{\prime\prime}}\right)g(t^{\prime})\,dw_{t^{\prime}}}\right\\|^{2}}\right]$
	$\displaystyle\qquad\leq 2(\exp(G_{t_{k},t})-1)\bigl{(}8\operatorname{KL}(\psi_{t}q_{t}\|\|p_{t})+2d+16\ln 2\bigr{)}.$

Proof.

Note that $Y:=\int_{t_{k}}^{t}\exp\left({\frac{1}{2}\int_{t_{k}}^{t^{\prime}}g(T-t^{\prime\prime})^{2}\,dt^{\prime\prime}}\right)g(t^{\prime})\,dw_{t^{\prime}}$ is a Gaussian random vector with variance

	$\displaystyle\int_{t_{k}}^{t}\exp\left({\int_{t_{k}}^{t^{\prime}}g(T-t^{\prime\prime})^{2}\,dt^{\prime\prime}}\right)g(t^{\prime})^{2}\,dt^{\prime}\cdot I_{d}$	$\displaystyle=\exp\left({\int_{t_{k}}^{t^{\prime}}g(T-t^{\prime\prime})^{2}\,dt^{\prime\prime}}\right)\Big{\|}^{t^{\prime}=t}_{t^{\prime}=t_{k}}\cdot I_{d}$
		$\displaystyle=(\exp(G_{t_{k},t})-1)\cdot I_{d}.$

(Note that this calculation shows that the continuous-time process (12) does agree with the discrete-time process (5) at $t=t_{k+1}$ .) Using the Donsker-Varadhan variational principle, for any random variable $X$ ,

\displaystyle\tilde{\mathbb{E}}X\leq\operatorname{KL}(\tilde{\mathbb{P}}||\mathbb{P})+\ln\mathbb{E}\exp X.

Applying this to $X=c\left({\left\|{Y}\right\|-\mathbb{E}\left\|{Y}\right\|}\right)^{2}$ for a constant $c>0$ to be chosen later, and $\widetilde{\mathbb{P}}$ such that $\frac{d\widetilde{\mathbb{P}}}{d\mathbb{P}}(z_{t})=\psi_{t}(z_{t})$ , we can bound

$\displaystyle\tilde{\mathbb{E}}\left\\|{Y}\right\\|^{2}$	$\displaystyle\leq 2\mathbb{E}\left[{\left\\|{Y}\right\\|^{2}}\right]+2\widetilde{\mathbb{E}}\left[{(Y-\mathbb{E}\left\\|{Y}\right\\|)^{2}}\right]$
	$\displaystyle\leq 2\mathbb{E}\left[{\left\\|{Y}\right\\|^{2}}\right]+\frac{2}{c}\left[{\operatorname{KL}(\tilde{\mathbb{P}}\|\|\mathbb{P})+\ln\mathbb{E}\exp\left({c\left({\left\\|{Y}\right\\|-\mathbb{E}\left\\|{Y}\right\\|}\right)^{2}}\right)}\right]$	(16)
	$\displaystyle\leq 2d(\exp(G_{t_{k},t})-1)+\frac{2}{c}\left[{\operatorname{KL}(\tilde{\mathbb{P}}\|\|\mathbb{P})+\ln\mathbb{E}\exp\left({c\left({\left\\|{Y}\right\\|-\mathbb{E}\left\\|{Y}\right\\|}\right)^{2}}\right)}\right].$	(17)

Now following [Che+21, Theorem 4], we set $c=\frac{1}{8(\exp(G_{t_{k},t})-1)}$ , so that

\displaystyle\mathbb{E}\left[{\frac{(\left\|{Y}\right\|-\mathbb{E}\left\|{Y}\right\|)^{2}}{8(\exp(G_{t_{k},t})-1)}}\right]\leq 2.

Next, we have

	$\displaystyle\operatorname{KL}(\tilde{\mathbb{P}}\|\|\mathbb{P})$	$\displaystyle=\mathbb{E}_{\psi_{t}q_{t}}\ln\psi_{t}=\mathbb{E}_{\psi_{t}q_{t}}\ln\frac{\phi_{t}}{\mathbb{E}_{p_{t}}\phi_{t}^{2}}=\frac{1}{2}\mathbb{E}_{\psi_{t}q_{t}}\ln\frac{\phi_{t}^{2}}{(\mathbb{E}_{p_{t}}\phi_{t}^{2})^{2}}$
		$\displaystyle=\frac{1}{2}\left[{\mathbb{E}_{\psi_{t}q_{t}}\ln\frac{\phi_{t}^{2}}{\mathbb{E}_{p_{t}}\phi_{t}^{2}}-\ln\mathbb{E}_{p_{t}}\phi_{t}^{2}}\right]=\frac{1}{2}\left[{\mathbb{E}_{\psi_{t}q_{t}}\ln\frac{\psi_{t}q_{t}}{p_{t}}-\ln\mathbb{E}_{p_{t}}\phi_{t}^{2}}\right].$

Noting that $\mathbb{E}_{p_{t}}\phi_{t}^{2}=\chi^{2}(q_{t}||p_{t})+1\geq 1$ , we have that

\displaystyle\operatorname{KL}(\tilde{\mathbb{P}}||\mathbb{P})

\displaystyle\leq\frac{1}{2}\operatorname{KL}(\psi_{t}q_{t}||p_{t}).

Substituting everything into (17) gives the desired inequality. ∎

Let

$\displaystyle K_{z}$	$\displaystyle=\mathbb{E}\left[{\psi_{t}(z_{t})\left\\|{z_{t}}\right\\|^{2}}\right]$	(18)
$\displaystyle K_{V}$	$\displaystyle=\mathbb{E}\left[{\psi_{t}(z_{t})\left\\|{\nabla\ln p_{t}(z_{t})}\right\\|^{2}}\right]$	(19)
$\displaystyle K_{\Delta V}$	$\displaystyle=\mathbb{E}\left[{\psi_{t}(z_{t})\left\\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}}\right]$	(20)
$\displaystyle K$	$\displaystyle=\operatorname{KL}(\psi_{t}q_{t}\|\|p_{t}).$	(21)

In order to bound the RHS in Lemma 4.2, we need to bound all four of these quantities, which we do in Lemma 4.5, 4.6, 4.8, and Section 5, respectively. The main innovation in our analysis compared to [LLT22] is a new way to bound $K$ , which we present in a separate section.

First we bound $K_{z}$ . Recall the norm

\left\|{X}\right\|_{2,\psi_{2}}=\inf\left\{{L>0}:{\mathbb{E}e^{\frac{\left\|{X}\right\|_{2}^{2}}{L^{2}}}\leq 2}\right\}.

(In other words, this is the usual Orlicz norm applied to $\left\|{X}\right\|_{2}$ .)

Lemma 4.5.

For $t\in[t_{k},t_{k+1}]$ ,

\displaystyle\mathbb{E}\left[{\psi_{t}(z_{t})\left\|{z_{t}}\right\|^{2}}\right]

\displaystyle\leq\left\|{x_{t}}\right\|_{2,\psi_{2}}^{2}\cdot\left[{\operatorname{KL}(\psi_{t}q_{t}||p_{t})+\ln 2}\right].

Proof.

By the Donsker-Varadhan variational principle,

\displaystyle\mathbb{E}\left[{\psi_{t}(z_{t})\left\|{z_{t}}\right\|^{2}}\right]

\displaystyle=\frac{2}{s}\mathbb{E}_{\psi_{t}q_{t}}\left[{\frac{s}{2}\left\|{x}\right\|^{2}}\right]\leq\frac{2}{s}\left[{\operatorname{KL}(\psi_{t}q_{t}||p_{t})+\ln\mathbb{E}_{p_{t}}\left[{e^{\frac{s}{2}\left\|{x}\right\|^{2}}}\right]}\right]

for any $s>0$ . Choosing $s=2\left\|{x_{t}}\right\|_{2,\psi_{2}}^{-2}$ , we have $\mathbb{E}_{p_{t}}\left[{e^{\frac{s}{2}\left\|{x}\right\|^{2}}}\right]\leq 2,$ which gives the desired inequality. ∎

The following bounds $K_{V}$ ; note that the proof does not depend on the definition of $q_{t}$ , only that it is a probability density.

Lemma 4.6 ([LLT22, Corollary C.7], [Che+21, Lemma 16]).

\displaystyle\mathbb{E}\left[{\psi_{t}(z_{t})\left\|{\nabla\ln p_{t}(z_{t})}\right\|^{2}}\right]

\displaystyle\leq\frac{4}{\chi^{2}(q_{t}||p_{t})+1}\cdot\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+2dL.

We use the following lemma to bound $K_{\Delta V}$ in Lemma 4.8.

Lemma 4.7 ([LLT22, Lemma C.12]).

Suppose that $p(x)\propto e^{-V(x)}$ is a probability density on $\mathbb{R}^{d}$ , where $V(x)$ is $L$ -smooth. Let $p_{\alpha}(x)=\alpha^{d}p(\alpha x)$ and $\varphi_{\sigma^{2}}(x)$ denote the density function of $N(0,\sigma^{2}I_{d})$ . Then for $\sigma^{2}\leq\frac{1}{2\alpha^{2}L}$ ,

\left\|{\nabla\ln\frac{p(x)}{(p_{\alpha}*\varphi_{\sigma^{2}})(x)}}\right\|\leq 6\alpha^{2}L\sigma d^{1/2}+(\alpha+2\alpha^{3}L\sigma^{2})(\alpha-1)L\left\|{x}\right\|+(\alpha-1+2\alpha^{3}L\sigma^{2})\left\|{\nabla V(x)}\right\|.

Lemma 4.8.

Suppose that $h_{k}\leq\frac{1}{4Lg(T-t_{k})^{2}}$ where $\nabla\ln p_{t}$ is $L_{T-t}$ -smooth ( $L_{T-t}\geq 1$ ) and $L=\max_{t\in[t_{k},t_{k+1}]}L_{T-t}$ . For $t\in[t_{k},t_{k+1}]$ ,

	$\displaystyle\mathbb{E}\left[{\psi_{t}(z_{t})\left\\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}}\right]$
	$\displaystyle\leq 25L_{T-t}^{2}\left({8G_{t_{k},t}d+G_{t_{k},t}^{2}\mathbb{E}\left[{\psi_{t}(z_{t})\left\\|{z_{t}}\right\\|^{2}}\right]}\right)+100L_{T-t}^{2}G_{t_{k},t}^{2}\mathbb{E}\left[{\psi_{t}(z_{t})\left\\|{\nabla\ln p_{t}(z_{t})}\right\\|^{2}}\right]$

Proof.

We have the following relationship for $t\in[t_{k},t_{k+1}]$ :

\displaystyle p_{t_{k}}=(p_{t})_{\alpha}*\varphi_{\sigma^{2}}.

where $p_{\alpha}(x)=\alpha^{d}p(\alpha x)$ , $\alpha=e^{\frac{1}{2}\int_{t_{k}}^{t}g(T-s)^{2}\,ds}$ and $\sigma^{2}=1-e^{-\int_{t_{k}}^{t}g(T-s)^{2}\,ds}$ . Observe that since $h_{k}\leq\frac{1}{4g(T-t_{k})^{2}}$ ,

	$\displaystyle\alpha$	$\displaystyle\leq 1+\int_{t_{k}}^{t}g(T-s)^{2}ds\leq 1+h_{k}g(T-t_{k})^{2}\leq 1+\frac{1}{4}$
	$\displaystyle\sigma^{2}$	$\displaystyle=1-e^{-\int_{t_{k}}^{t}g(T-s)^{2}ds}\leq\int_{t_{k}}^{t}g(T-s)^{2}ds\leq h_{k}g(T-t_{k})^{2}\leq\frac{1}{4}.$

We note that

\displaystyle\sigma^{2}\leq h_{k}g(T-t_{k})^{2}\leq\frac{1}{4L_{t}}\leq\frac{1}{2\alpha^{2}L_{t}}

so the hypothesis of Lemma 4.7 is satisfied. Using Lemma 4.7, we obtain

	$\displaystyle\mathbb{E}\left[{\psi_{t}(z_{t})\left\\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}}\right]$
	$\displaystyle\leq 72\alpha^{4}L_{T-t}^{2}\sigma^{2}d+4(\alpha+2\alpha^{3}L_{T-t}\sigma^{2})^{2}(\alpha-1)^{2}L_{T-t}^{2}\mathbb{E}\left[{\psi(z_{t})\left\\|{z_{t}}\right\\|^{2}}\right]$
	$\displaystyle\quad+4(\alpha-1+2\alpha^{3}L_{T-t}\sigma^{2})^{2}\mathbb{E}\left[{\psi_{t}(z_{t})\left\\|{\nabla\ln p_{t}(z_{t})}\right\\|^{2}}\right]$
	$\displaystyle\leq 72(5/4)^{4}L_{T-t}^{2}G_{t_{k},t}d+4(2\alpha)^{2}G_{t_{k},t}^{2}L_{T-t}^{2}\mathbb{E}\left[{\psi(z_{t})\left\\|{z_{t}}\right\\|^{2}}\right]$
	$\displaystyle\quad+4(G_{t_{k},t}+4L_{T-t}G_{t_{k},t})^{2}\mathbb{E}\left[{\psi_{t}(z_{t})\left\\|{\nabla\ln p_{t}(z_{t})}\right\\|^{2}}\right]$
	$\displaystyle\leq 200L_{T-t}^{2}dG_{t_{k},t}+25L_{T-t}^{2}G_{t_{k},t}^{2}\mathbb{E}\left[{\psi(z_{t})\left\\|{z_{t}}\right\\|^{2}}\right]+100L_{T-t}^{2}G_{t_{k},t}^{2}\mathbb{E}\left[{\psi_{t}(z_{t})\left\\|{\nabla\ln p_{t}(z_{t})}\right\\|^{2}}\right].$

∎

Now we put everything together. Write $G_{t}=G_{t_{k},t}$ for short. Suppose $L_{t}$ is non-increasing. By Lemma 4.2,

	$\displaystyle\frac{d}{dt}\chi^{2}(q_{t}\|\|p_{t})$	$\displaystyle\leq-\frac{1}{2}g(T-t)^{2}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+12g(T-t)^{2}(\chi^{2}(q_{t}\|\|p_{t})+1)\cdot E$
	$\displaystyle\text{where }E$	$\displaystyle\leq 16G_{t}^{2}L_{T-t_{k}}^{2}(K_{z}+16K_{V})+64G_{t}L_{T-t_{k}}^{2}(8K+2d+16\ln 2)+\varepsilon_{\infty,T-t_{k}}^{2}+K_{\Delta V}.$

By Lemma 4.8, $K_{\Delta V}\leq 25L_{T-t}^{2}(8G_{t}d+G_{t}^{2}K_{z})+100L_{T-t}^{2}G_{t}^{2}K_{V}$ , so

\displaystyle E

\displaystyle\leq 41G_{t}^{2}L_{T-t}^{2}K_{z}+356G_{t}^{2}L_{T-t}^{2}K_{V}+64G_{t}L_{T-t}^{2}(8K+6d+16\ln 2)+\varepsilon_{\infty,T-t_{k}}^{2}.

By Lemma 4.5, $K_{z}\leq\left\|{x_{t}}\right\|_{2,\psi_{2}}^{2}(K+\ln 2)$ , and by Corollary 4.6, $K_{V}\leq\frac{4}{\chi^{2}(q_{t}||p_{t})+1}\cdot\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+2dL$ , so

	$\displaystyle E$	$\displaystyle\leq 41G_{t}^{2}L_{T-t}^{2}\left({\left\\|{x_{t}}\right\\|_{2,\psi_{2}}^{2}(K+\ln 2)}\right)+356G_{t}^{2}L_{T-t}^{2}\left({\frac{4}{\chi^{2}(q_{t}\|\|p_{t})+1}\cdot\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+2dL}\right)$
		$\displaystyle\quad+64G_{t}L_{T-t}^{2}(8K+6d+16\ln 2)+\varepsilon_{\infty,T-t_{k}}^{2}.$

Now, if $h_{k}\leq\frac{\varepsilon^{\prime}_{h_{k}}}{20g(T-t_{k})^{2}L_{T-t_{k+1}}}$ , then

	$\displaystyle E$	$\displaystyle\leq{\varepsilon^{\prime}_{h_{k}}}^{2}\left[{\left\\|{x_{t}}\right\\|_{2,\psi_{2}}^{2}(K+\ln 2)+\left({\frac{4}{\chi^{2}(q_{t}\|\|p_{t})+1}\cdot\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+2dL_{T-t}}\right)}\right]$
		$\displaystyle\quad+4\varepsilon^{\prime}_{h_{k}}L_{T-t}(8K+2d+16\ln 2)+\varepsilon_{\infty,T-t_{k}}^{2}.$

Let $M_{T-t}:=\left\|{x_{t}}\right\|_{2,\psi_{2}}^{2}$ . Assume that $K\leq\frac{A_{T-t}}{\chi^{2}(q_{t}||p_{t})+1}+B_{T-t}$ . Then we obtain

	$\displaystyle 12g(T-t)^{2}(\chi^{2}(q_{t}\|\|p_{t})+1)\cdot E$
	$\displaystyle\leq 12g(T-t)^{2}\Big{[}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)({\varepsilon^{\prime}_{h_{k}}}^{2}\cdot(A_{T-t}M_{T-t}+4)+\varepsilon^{\prime}_{h_{k}}\cdot 32L_{T-t}A_{T-t})$
	$\displaystyle\hskip 80.00012pt+(\chi^{2}(q_{t}\|\|p_{t})+1)({\varepsilon^{\prime}_{h_{k}}}^{2}\cdot((B_{T-t}+\ln 2)M_{T-t}+2dL)$
	$\displaystyle\hskip 80.00012pt+\varepsilon^{\prime}_{h_{k}}\cdot L_{T-t}(8B_{T-t}+6d+16\ln 2))+\varepsilon_{\infty,T-t_{k}}^{2}\Big{]}.$

If $\varepsilon^{\prime}_{h_{k}}\leq\min\left\{{\frac{1}{\sqrt{48(A_{T-t}M_{T-t}+4)}},\frac{1}{128L_{T-t}A_{T-t}}}\right\}$ , then

	$\displaystyle\frac{d}{dt}\chi^{2}(q_{t}\|\|p_{t})$	$\displaystyle\leq 12g(T-t)^{2}\Big{[}(\chi^{2}(q_{t}\|\|p_{t})+1)({\varepsilon^{\prime}_{h_{k}}}^{2}\cdot((B_{T-t}+\ln 2)M_{T-t}+2dL_{T-t})$
		$\displaystyle\qquad\qquad+\varepsilon^{\prime}_{h_{k}}\cdot L_{T-t}(8B_{T-t}+6d+16\ln 2))+\varepsilon_{\infty,T-t_{k}}^{2}\Big{]}.$

If $\varepsilon^{\prime}_{h_{k}}\leq\min\left\{{\frac{\sqrt{\varepsilon^{\prime}}}{g(T-t)\sqrt{24(T-t_{k})((B_{T-t}+\ln 2)M_{T-t}+2dL_{T-t})}},\frac{\varepsilon^{\prime}}{24g(T-t)^{2}(T-t)L_{T-t}(8B_{T-t}+6d+16\ln 2)}}\right\}$ , we get

\displaystyle\frac{d}{dt}\chi^{2}(q_{t}||p_{t})

\displaystyle\leq\frac{\varepsilon^{\prime}}{T-t}(\chi^{2}(q_{t}||p_{t})+1)+\varepsilon_{\infty,T-t_{k}}^{2}g(T-t)^{2}.

Integration gives

	$\displaystyle\chi^{2}(q_{t_{k}}\|\|p_{t_{k}})$	$\displaystyle\leq e^{\varepsilon^{\prime}\int_{0}^{t_{k}}\frac{1}{T-t}\,dt}(\chi^{2}(q_{0}\|\|p_{0})+1)+\int_{0}^{t_{k}}e^{\int_{t}^{t_{k}}\frac{\varepsilon^{\prime}}{T-s}\,ds}\varepsilon_{T-t}^{2}g(T-t)^{2}\,dt$
		$\displaystyle\leq\left({\frac{T}{T-t_{k}}}\right)^{\varepsilon^{\prime}}\chi^{2}(q_{0}\|\|p_{0})+\left({\left({\frac{T}{T-t_{k}}}\right)^{\varepsilon^{\prime}}-1}\right)+\int_{0}^{t_{k}}\left({\frac{T-t}{T-t_{k}}}\right)^{\varepsilon^{\prime}}\varepsilon_{T-t}^{2}g(T-t)^{2}\,dt.$

Taking $\varepsilon^{\prime}=\frac{\varepsilon}{\ln\left({\frac{T}{T-t_{N}}}\right)}$ then gives the following Theorem 4.10. We first introduce a technical assumption.

Definition 4.9.

Let $f:\mathbb{R}_{>0}\to\mathbb{R}_{>0}$ . We say that $f$ has at most power growth and decay (with some constant $c>0$ ) if $\max_{u\in[\frac{t}{2},t]}f(u)\in\left[{\frac{f(t)}{c},cf(t)}\right]$ .

Theorem 4.10.

Suppose that the following hold.

1.

Assumption 5 holds for $\varepsilon_{\infty,t}$ .
2.

$\left\|{\widetilde{x}_{t}}\right\|_{2,\psi_{2}}^{2}\leq M_{t}$ .
3.

The KL bound $\operatorname{KL}(\psi_{t}q_{t}||p_{t})\leq\frac{A_{T-t}}{\chi^{2}(q_{t}||p_{t})+1}+B_{T-t}$ holds for any density $q_{t}$ and $t<t_{N}$ , where $\psi_{t}(x)=\frac{q_{t}(x)/p_{t}(x)}{\chi^{2}(q_{t}||p_{t})+1}$ .
4.

$g(t),A_{t},B_{t},L_{t},M_{t}$ have at most polynomial growth and decay (with some constant $c$ ).

Then there is some constant $c^{\prime}$ (depending on $c$ ) such that if the step sizes satisfy

	$\displaystyle h_{k}$	$\displaystyle\leq\min\left\{{\frac{T-t_{k}}{2},\frac{c^{\prime}\varepsilon^{\prime}_{h_{k}}}{g(T-t_{k})^{2}L_{T-t_{k}}}}\right\},$
	$\displaystyle\text{where }\varepsilon^{\prime}_{h_{k}}$	$\displaystyle=\min\Bigg{\{}\frac{1}{\sqrt{A_{T-t_{k}}M_{T-t_{k}}+1}},\frac{1}{L_{T-t_{k}}A_{T-t_{k}}},$
		$\displaystyle\qquad\qquad\frac{\sqrt{\varepsilon/\ln\left({\frac{T}{T-t_{N}}}\right)}}{g(T-t_{k})\sqrt{(T-t_{k})((B_{T-t_{k}}+1)M_{T-t_{k}}+dL_{T-t_{k}})}},$
		$\displaystyle\qquad\qquad\frac{\varepsilon/\ln\left({\frac{T}{T-t_{N}}}\right)}{g(T-t_{k})^{2}(T-t_{k})L_{T-t_{k}}(B_{T-t_{k}}+d)}\Bigg{\}},$

then for $0\leq k\leq N$ ,

\displaystyle\chi^{2}(q_{t_{k}}||p_{t_{k}})

\displaystyle\leq e^{\varepsilon}\chi^{2}(q_{0}||p_{0})+(e^{\varepsilon}-1)+e^{\varepsilon}\int_{0}^{t_{k}}\varepsilon_{\infty,T-t}^{2}g(T-t)^{2}\,dt.

Proof.

This follows from the above calculations and the observation that if we replace $F(T-t)$ by $F(T-t_{k})$ , for some $F$ satisfying the power growth and decay assumption, then we change the bound by at most a constant factor, because the step size satisfies $h_{k}=t_{k+1}-t_{k}\leq\frac{T-t_{k}}{2}$ . ∎

We specialize this theorem in the case of distributions with bounded support. Note that although not every initial distribution $\widetilde{p}_{t}$ may satisfy a KL inequality as required by condition 3 of Theorem 30, Lemma 5.2 will give the existence of a distribution that does, and is close in TV-error. Later in Section 6, we show that this will have a small effect on the score function, and hence allow us to prove our main theorems.

Corollary 4.11.

Suppose that Assumptions 5 and 2 hold, $R^{2}\geq d$ , $g\equiv 1$ , and that $\widetilde{P}_{0}$ is such that the KL inequality (30) holds. Let $\delta=T-t_{N}$ . If $0<\delta,\varepsilon<\frac{1}{2}$ , $h_{k}=O\left({\frac{\varepsilon}{\max\{T-t_{k},(T-t_{k})^{-3}\}R^{4}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R}{\delta\varepsilon_{K}}}\right)}}\right)$ , then for any $0\leq k\leq N$ ,

\displaystyle\chi^{2}(q_{t_{k}}||p_{t_{k}})

\displaystyle\leq e^{\varepsilon}\chi^{2}(q_{0}||p_{0})+\varepsilon+e^{\varepsilon}\int_{0}^{t_{k}}\varepsilon_{\infty,T-t}^{2}\,dt.

Proof.

For $g\equiv 1$ , note that $\sigma_{T-t}^{2}=\Theta(\min\{T-t,1\})$ . From Lemma 4.13, we can choose

\displaystyle L_{t}

\displaystyle=\frac{R^{2}}{\sigma_{t}^{4}}=O\left({\frac{R^{2}}{\min\{(T-t)^{2},1\}}}\right).

From Lemma 4.15, we can choose

\displaystyle M_{t}

\displaystyle=\max\{R^{2},d\}.

The KL inequality (30) gives us

	$\displaystyle A_{t}$	$\displaystyle=6(e+1)\sigma_{t}^{2}=O(\min\{T-t,1\})$
	$\displaystyle B_{t}$	$\displaystyle=\ln\left({\frac{1}{\varepsilon}}\right)+d\ln\left({1+O\left({\frac{R}{\sqrt{T-t_{N}}}}\right)}\right)$

We now check the requirements on $h_{k}$ . We need

$\displaystyle\varepsilon^{\prime}_{h_{k}}$	$\displaystyle=O\left({\frac{1}{\sqrt{A_{T-t_{k}}M_{T-t_{k}}+1}}}\right)$	$\displaystyle\Longleftarrow\varepsilon^{\prime}_{h_{k}}$	$\displaystyle=O\left({\frac{1}{\max\{R,\sqrt{d}\}}}\right)$	(22)
$\displaystyle\varepsilon^{\prime}_{h_{k}}$	$\displaystyle=O\left({\frac{1}{L_{T-t_{k}}A_{T-t_{k}}}}\right)$	$\displaystyle\Longleftarrow\varepsilon^{\prime}_{h_{k}}$	$\displaystyle=O\left({\frac{T-t_{k}}{R^{2}}}\right)$	(23)
$\displaystyle\varepsilon^{\prime}_{h_{k}}$	$\displaystyle=O\left({\frac{\sqrt{\varepsilon/\ln\left({\frac{T}{\delta}}\right)}}{\sqrt{(T-t_{k})((B_{T-t_{k}}+1)M_{T-t_{k}}+dL_{T-t_{k}})}}}\right).$			(24)

For $T-t_{k}\leq 1$ , (24) is implied by

	$\displaystyle\varepsilon^{\prime}_{h_{k}}$	$\displaystyle=O\left({\frac{\sqrt{\varepsilon/\ln\left({\frac{T}{\delta}}\right)}}{\sqrt{(T-t_{k})\left({\ln\left({\frac{1}{\varepsilon_{K}}}\right)+d\ln\left({\frac{R}{\delta}}\right)}\right)\max\{R^{2},d\}+\frac{dR^{2}}{T-t_{k}}}}}\right)$
	$\displaystyle\Longleftarrow\varepsilon^{\prime}_{h_{k}}$	$\displaystyle=O\left({\sqrt{\frac{\varepsilon(T-t_{k})}{d\max\{R^{2},d\}\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R}{\delta\varepsilon_{K}}}\right)}}}\right),$

and for $T-t_{k}>1$ ,

	$\displaystyle\varepsilon^{\prime}_{h_{k}}$	$\displaystyle=O\left({\frac{\sqrt{\varepsilon/\ln\left({\frac{T}{\delta}}\right)}}{\sqrt{T\left({\ln\left({\frac{1}{\varepsilon}}\right)+d\ln\left({\frac{R}{\delta}}\right)}\right)\max\{R^{2},d\}+dR^{2}}}}\right)$
	$\displaystyle\Longleftarrow\varepsilon^{\prime}_{h_{k}}$	$\displaystyle=O\left({\sqrt{\frac{\varepsilon}{Td\max\{R^{2},d\}\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R}{\delta\varepsilon_{K}}}\right)}}}\right).$

Finally, the last requirement is

	$\displaystyle\varepsilon^{\prime}_{h_{k}}$	$\displaystyle=O\left({\frac{\varepsilon/\ln\left({\frac{T}{\delta}}\right)}{(T-t_{k})L_{T-t_{k}}(B_{T-t_{k}}+d)}}\right)$
	$\displaystyle\Longleftarrow\varepsilon^{\prime}_{h_{k}}$	$\displaystyle=O\left({\frac{\varepsilon}{R^{2}\max\{T-t_{k},(T-t_{k})^{-1}\}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R}{\delta\varepsilon_{K}}}\right)}}\right).$

As long as $R^{2}=\Omega(d)$ and $\varepsilon<1$ , the last equation implies all the others. Plugging this into Theorem 4.10 gives the result. ∎

Above, we use the Hessian bound $\left\|{\nabla^{2}\ln p_{t}(x)}\right\|\leq\frac{R^{2}}{\sigma_{t}^{4}}$ given in Lemma 4.13. Under the stronger smoothness assumption given by Assumption 3, we can take the step sizes to be larger.

Corollary 4.12.

Suppose that Assumptions 5, 2, 3 hold, $C\geq R^{2}\geq d$ , $g\equiv 1$ , and that $\widetilde{P}_{0}$ is such that the KL inequality (30) holds. Let $\delta=T-t_{N}$ . If $0<\delta,ep<\frac{1}{2}$ and $\varepsilon<1/\sqrt{T}$ , $h_{k}=O\left({\frac{\varepsilon}{\max\{T-t_{k},(T-t_{k})^{-1}\}C^{2}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R}{\delta\varepsilon_{K}}}\right)}}\right)$ , then for any $0\leq k\leq N$ ,

\displaystyle\chi^{2}(q_{t_{k}}||p_{t_{k}})

\displaystyle\leq e^{\varepsilon}\chi^{2}(q_{0}||p_{0})+\varepsilon+e^{\varepsilon}\int_{0}^{t_{k}}\varepsilon_{\infty,T-t}^{2}\,dt.

Proof.

We instead have the bound $L_{t}=\frac{C}{\sigma_{t}^{2}}$ . The requirement (22) stays the same, while (23) is implied by $\varepsilon^{\prime}_{h_{k}}=O(1/C)$ . Inequality (24), for $T-t_{k}\leq 1$ , is implied by

\displaystyle\varepsilon^{\prime}_{h_{k}}

\displaystyle=O\left({\sqrt{\frac{1}{d\max\{C,R^{2}\}\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R}{\delta\varepsilon_{K}}}\right)}}}\right).

and for $T-t_{k}>1$ ,

\displaystyle\varepsilon^{\prime}_{h_{k}}

\displaystyle=O\left({\sqrt{\frac{\varepsilon}{Td\max\{C,R^{2}\}\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R}{\delta\varepsilon_{K}}}\right)}}}\right).

Finally, the last requirement is implied by

\displaystyle\varepsilon^{\prime}_{h_{k}}

\displaystyle=O\left({\frac{\varepsilon}{Cd\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R}{\delta\varepsilon_{K}}}\right)}}\right),

and for $C\geq R^{2}$ , $\varepsilon\leq 1/\sqrt{T}$ , implies all the others. ∎

4.1 Auxiliary bounds

In this section we give bounds on the Hessian ( $L_{t}$ , Lemma 4.13), initial $\chi^{2}$ divergence $\chi^{2}(q_{0}||p_{0})$ (Lemma 4.14), and Orlicz norm ( $M_{t}$ , Lemma 4.15).

Lemma 4.13 (Hessian bound).

Suppose that $\mu$ is a probability measure supported on a bounded set $\mathcal{M}\subset\mathbb{R}^{d}$ with radius $R$ . Then letting $\varphi_{\sigma^{2}}$ denote the density of $N(0,\sigma^{2}I_{d})$ ,

\displaystyle\left\|{\nabla^{2}\ln(\mu*\varphi_{\sigma^{2}}(x))}\right\|\leq\max\left\{{\frac{R^{2}}{\sigma^{4}},\frac{1}{\sigma^{2}}}\right\}.

(25)

Therefore, for $\widetilde{P}_{0}$ supported on $B_{R}(0)$ , $R\geq 1$ , we have

\displaystyle\left\|{\nabla^{2}\ln\widetilde{p}_{t}(x)}\right\|\leq\frac{R^{2}}{\sigma_{t}^{4}}.

(26)

Proof.

Let $\mu_{x,\sigma^{2}}$ denote the density $\mu(du)$ weighted with the gaussian $\varphi_{\sigma^{2}}(u-x)$ , that is, $\mu_{x,\sigma^{2}}(du)=\frac{e^{-\frac{\left\|{x-u}\right\|^{2}}{2\sigma^{2}}}\mu(du)}{\int_{\mathbb{R}^{d}}e^{-\frac{\left\|{x-u}\right\|^{2}}{2\sigma^{2}}}\mu(du)}$ . We note the following calculations:

	$\displaystyle\nabla\ln(\mu*\varphi_{\sigma^{2}}(x))$	$\displaystyle=\frac{\nabla\int_{\mathbb{R}^{d}}e^{-\frac{\left\\|{x-u}\right\\|^{2}}{2\sigma^{2}}}\mu(du)}{\int_{\mathbb{R}^{d}}e^{-\frac{\left\\|{x-u}\right\\|^{2}}{2\sigma^{2}}}\mu(du)}=\frac{\int_{\mathbb{R}^{d}}-\frac{x-u}{\sigma^{2}}e^{-\frac{\left\\|{x-u}\right\\|^{2}}{2\sigma^{2}}}\mu(du)}{\int_{\mathbb{R}^{d}}e^{-\frac{\left\\|{x-u}\right\\|^{2}}{2\sigma^{2}}}\mu(du)}=-\frac{1}{\sigma^{2}}\mathbb{E}_{\mu_{x,\sigma^{2}}}(x-u)$		(27)
	$\displaystyle\nabla^{2}\ln(\mu*\varphi_{\sigma^{2}}(x))$	$\displaystyle=\frac{1}{\sigma^{4}}\operatorname{Cov}_{\mu_{x,\sigma^{2}}}(x-u)-\frac{1}{\sigma^{2}}I_{d}=\frac{1}{\sigma^{4}}\operatorname{Cov}_{\mu_{x,\sigma^{2}}}(x)-\frac{1}{\sigma^{2}}I_{d}.$		(28)

The covariance of a distribution supported on a set of radius $R$ is bounded by $R^{2}$ in operator norm. Inequality (25) then follows from (28).

For (26), note that $\widetilde{p}_{t}=M_{m_{t}\sharp}\widetilde{P}_{0}*\varphi_{\sigma_{t}^{2}}$ , where $m_{t}$ is given by (2) and $M_{m}$ denotes multiplication by $m$ . Since $M_{m_{t}\sharp}\widetilde{P}_{0}$ is supported on $B_{m_{t}R}(0)\subset B_{R}(0)$ and $\sigma_{t}\leq 1$ , the result follows. ∎

Lemma 4.14 (Bound on initial $\chi^{2}$ -divergence).

Suppose that $\widetilde{P}_{0}$ is supported on $B_{R}(0)$ . Let $p_{\textup{prior}}=N(0,(1-e^{G_{0,t}})I_{d})$ . Then

\displaystyle\chi^{2}(p_{\textup{prior}}||\widetilde{p}_{T})

\displaystyle\leq\exp\left[{\frac{R^{2}\exp(-G_{0,T})}{1-\exp(-G_{0,T})}}\right]

and for $0<\varepsilon<\frac{1}{2}$ and $G_{0,T}\geq\ln\left({\frac{4R^{2}}{\varepsilon^{2}}}\right)\vee 1$ , we have $\chi^{2}(p_{\textup{prior}}||\widetilde{p}_{T})\leq\varepsilon^{2}$ .

Proof.

We have for $x_{0}\sim\widetilde{P}_{0}$ that

	$\displaystyle\chi^{2}\left({N(0,(1-e^{-G_{0,T}})I_{d})\|\|N(x_{0}\exp\left({-\frac{1}{2}G_{0,T}}\right),(1-\exp(-G_{0,T}))I_{d})}\right)$
	$\displaystyle\leq\exp\left[{\frac{\left\\|{x_{0}}\right\\|^{2}\exp(-G_{0,T})}{1-\exp(-G_{0,T})}}\right]\leq\exp\left({\frac{R^{2}\exp(-G_{0,T})}{1-\exp(-G_{0,T})}}\right)$

Using convexity of $\chi^{2}$ -divergence then gives the result. For $G_{0,T}\geq\ln\left({\frac{4R^{2}}{\varepsilon^{2}}}\right)\vee 1$ , we have

\displaystyle\exp\left[{\frac{R^{2}\exp(-G_{0,T})}{1-\exp(-G_{0,T})}}\right]

\displaystyle\leq\exp\left[{\frac{\varepsilon^{2}/4}{1/2}}\right]\leq\varepsilon^{2}.\qed

Lemma 4.15 (Subgaussian bound).

Suppose $\widetilde{P}_{0}$ is supported on $B_{R}(0)$ . Then for $X\sim\widetilde{p}_{t}$ ,

\left\|{X}\right\|_{2,\psi_{2}}\leq\sqrt{\frac{e}{\ln 2}}\cdot\left({4m_{t}R+6C_{1}\sigma_{t}\sqrt{d}}\right)=O(\max\{R,\sqrt{d}\}),

where $m_{t},\sigma_{t}$ are as in (2) and $C_{1}$ is an absolute constant.

Proof.

Let $Y\sim\widetilde{P}_{0}$ s.t. $X=m_{t}Y+\sigma_{t}\xi$ for some $\xi\sim N(0,I_{d})$ independent of $Y$ . Define $U=\left\|{X}\right\|_{2}:=\left({\sum_{i=1}^{d}X_{i}^{2}}\right)^{1/2}$ , then for $p\geq 1$ ,

	$\displaystyle\mathbb{E}\|U\|^{p}=\mathbb{E}\left\\|{X}\right\\|_{2}^{p}$	$\displaystyle\leq\mathbb{E}\left({\left\\|{m_{t}Y}\right\\|_{2}+\left\\|{\sigma_{t}\xi}\right\\|_{2}}\right)^{p}$
		$\displaystyle\leq 2^{p-1}\mathbb{E}\left[{\left\\|{m_{t}Y}\right\\|_{2}^{p}+\left\\|{\sigma_{t}\xi}\right\\|_{2}^{p}}\right]$
		$\displaystyle\leq 2^{p-1}\left[{(m_{t}R)^{p}+\sigma_{t}^{p}\cdot 2^{p/2}\frac{\Gamma((d+p)/2)}{\Gamma(d/2)}}\right]$
		$\displaystyle\leq 2^{p-1}\left[{(m_{t}R)^{p}+C_{1}(\sqrt{2}\sigma_{t})^{p}\cdot\left({d^{p/2}+p^{p/2}}\right)}\right]$

where $\Gamma$ is the commonly used gamma function and $C_{1}$ is an absolute constant. Therefore

\displaystyle\left({\mathbb{E}|U|^{p}}\right)^{1/p}\leq 2m_{t}R+\sqrt{2}C_{1}\sigma_{t}(\sqrt{d}+\sqrt{p})\leq K\sqrt{p},

where $K=2m_{t}R+3C_{1}\sigma_{t}\sqrt{d}$ . Now consider $V=U/K$ , then for some $\lambda>0$ small enough, by Taylor expansion,

\displaystyle\mathbb{E}\left[{e^{\lambda^{2}V^{2}}}\right]=\mathbb{E}\left[{1+\sum_{p=1}^{\infty}\frac{\left({\lambda^{2}V^{2}}\right)^{p}}{p!}}\right]=1+\sum_{p=1}^{\infty}\frac{\lambda^{2p}\mathbb{E}\left[{V^{2p}}\right]}{p!}.

Note that $\mathbb{E}\left[{V^{2p}}\right]\leq(2p)^{p}$ , while Stirling’s approximation yields $p!\geq(p/e)^{p}$ . Substituting these two bounds, we get

\displaystyle\mathbb{E}e^{\lambda^{2}V^{2}}\leq 1+\sum_{p=1}^{\infty}\left({\frac{2\lambda^{2}p}{p/e}}\right)=\sum_{p=0}^{\infty}(2e\lambda^{2})^{p}=\frac{1}{1-2e\lambda^{2}},

provided that $2e\lambda^{2}<1$ , in which case the geometric series above converges. To bound this quantity further, we can use the numeric inequality $1/(1-x)\leq e^{2x}$ which is valid for $x\in[0,1/2]$ . It follows that

\displaystyle\mathbb{E}e^{\lambda^{2}V^{2}}\leq e^{4e\lambda^{2}}\ \ \text{for all $\lambda$ satisfying $|\lambda|\leq 1/2\sqrt{e}$}.

Now set $4e\lambda^{2}=\ln 2$ , then

\displaystyle\mathbb{E}\left[{e^{\frac{\ln 2}{4eK^{2}}\left\|{X}\right\|_{2}^{2}}}\right]\leq 2,

which implies that

\displaystyle\left\|{X}\right\|_{2,\psi_{2}}\leq\sqrt{\frac{4e}{\ln 2}}K

\displaystyle=\sqrt{\frac{e}{\ln 2}}\cdot\left({4m_{t}R+6C_{1}\sigma_{t}\sqrt{d}}\right).\qed

5 Bounding the KL divergence

In this section, we bound the quantity $K=\operatorname{KL}(\psi_{t}q_{t}||p_{t})$ , where $\psi_{t}$ is as in (8). While $p_{t}$ is defined by the DDPM process, in this section we do not assume $q_{t}$ is the density of the discretized process; rather, it is any density for which $\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)$ and $\chi^{2}(q_{t}||p_{t})$ are finite.

Lemma 5.1.

Suppose that $\widetilde{P}_{0}$ is a probability measure on $\mathbb{R}^{d}$ such that

\displaystyle\widetilde{P}_{0}

\displaystyle=\sum_{j=1}^{m}w_{j}\widetilde{P}_{j,0},

(29)

where $w_{j}>0$ , $\sum_{j=1}^{m}w_{j}=1$ , and each $\widetilde{P}_{j,0}$ is a probability measure. For $t>0$ , let $\widetilde{p}_{t}$ and $\widetilde{p}_{j,t}$ be the densities obtained by running the forward DDPM process (1) for time $t$ , and $p_{t}=\widetilde{p}_{T-t}$ , $p_{j,t}=\widetilde{p}_{j,T-t}$ . Let $w_{\min}=\min_{1\leq j\leq m}w_{j}$ and suppose all the $\widetilde{P}_{j,t}$ satisfy a log-Sobolev constant with constant $C_{t}$ . Then for any $q_{t}$ , where $\psi_{t}$ is as in (8)

\displaystyle\operatorname{KL}(\psi_{t}q_{t}||p_{t})

\displaystyle\leq\frac{2C_{T-t}}{\chi^{2}(q_{t}||p_{t})+1}\cdot\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+\ln\left({\frac{1}{w_{\min}}}\right).

While we need $p_{t}$ to satisfy a log-Sobolev inequality to get a bound of the form $\frac{C}{\chi^{2}(q_{t}||p_{t})+1}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)$ ([LLT22, Lemma C.8]), we note that if we allow additive slack, it suffices for $p_{t}$ to be a mixture of distributions satisfying a log-Sobolev inequality, with the logarithm of the minimum mixture weight bounded below. In Lemma 5.2 we will see that we can almost decompose any distribution of bounded support in this manner, if we move a small amount of the mass.

Proof.

Let $\overline{f}_{t}:[m]\to\mathbb{R}$ be the function

\displaystyle\overline{f}_{t}(j)

\displaystyle=\int_{\mathbb{R}^{d}}\frac{\psi_{t}(x)q_{t}(x)}{p_{t}(x)}P_{j,t}(x)\,dx.

By decomposition of entropy and the fact that each $P_{i,t}$ satisfies LSI with constant $C_{T-t}$ ,

	$\displaystyle\operatorname{KL}(\psi_{t}q_{t}\|\|p_{t})$	$\displaystyle\leq\operatorname{Ent}_{p_{t}}\left({\frac{\psi_{t}q_{t}}{p_{t}}}\right)$
		$\displaystyle=\sum_{i=1}^{m}\int_{\mathbb{R}^{d}}w_{i}\operatorname{Ent}_{P_{i,t}}\left({\frac{\psi_{t}q_{t}}{p_{t}}}\right)+\operatorname{Ent}_{w}(\overline{f}_{t})$
		$\displaystyle\leq\frac{C_{t}}{2}\sum_{i=1}^{m}w_{i}\mathscr{E}_{P_{i,t}}\left({\ln\frac{\psi_{t}q_{t}}{p_{t}},\frac{\psi_{t}q_{t}}{p_{t}}}\right)+\operatorname{Ent}_{w}(\overline{f}_{t})$
		$\displaystyle\leq\frac{C_{t}}{2}\mathscr{E}_{p_{t}}\left({\ln\frac{\psi_{t}q_{t}}{p_{t}},\frac{\psi_{t}q_{t}}{p_{t}}}\right)+\operatorname{Ent}_{w}(\overline{f}_{t})$
		$\displaystyle=\frac{C_{t}}{2}\int_{\mathbb{R}^{d}}\left\\|{\nabla\ln\frac{\psi_{t}(x)q_{t}(x)}{p_{t}(x)}}\right\\|^{2}\psi_{t}(x)q_{t}(x)\,dx+\operatorname{Ent}_{w}(\overline{f}_{t})$
		$\displaystyle=2C_{t}\int_{\mathbb{R}^{d}}\left\\|{\nabla\ln\frac{q_{t}(x)}{p_{t}(x)}}\right\\|^{2}\psi_{t}(x)q_{t}(x)\,dx+\operatorname{Ent}_{w}(\overline{f}_{t})$
		$\displaystyle=2C_{t}\int_{\mathbb{R}^{d}}\left\\|{\nabla\frac{q_{t}(x)}{p_{t}(x)}}\right\\|^{2}\frac{\psi_{t}(x)p_{t}(x)^{2}}{q_{t}(x)}\,dx+\operatorname{Ent}_{w}(\overline{f}_{t})$
		$\displaystyle=\frac{2C_{t}}{\chi^{2}(q_{t}\|\|p_{t})+1}\cdot\int\left\\|{\nabla\frac{q_{t}(x)}{p_{t}(x)}}\right\\|^{2}p_{t}(x)\,dx+\operatorname{Ent}_{w}(\overline{f}_{t})$
		$\displaystyle\leq\frac{2C_{t}}{\chi^{2}(q_{t}\|\|p_{t})+1}\cdot\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+\ln\left({\frac{1}{w_{\min}}}\right),$

where the last inequality follows from noting $w_{j}\overline{f}_{t}(j)$ is a probability mass function on $[m]$ , so that $\overline{f}_{t}(j)\leq\frac{1}{w_{j}}$ and

\operatorname{Ent}_{w}(\overline{f}_{t})=\sum_{j=1}^{m}w_{j}\overline{f}_{t}(j)\ln(\overline{f}_{t}(j))\leq\sum_{j=1}^{m}w_{j}\overline{f}_{t}(j)\ln\left({\frac{1}{w_{\min}}}\right)=\ln\left({\frac{1}{w_{\min}}}\right).\qed

Lemma 5.2.

Suppose $0<\varepsilon_{K}<\frac{1}{2}$ , and that $\overline{P}_{0}$ is a probability measure such that $\overline{P}_{0}(\mathcal{M})\geq 1-\frac{\varepsilon_{K}}{8}$ . Let $\mathcal{N}\left({\mathcal{M},\frac{\sigma_{t}}{2}}\right)$ denote the covering number of $\mathcal{M}$ with balls of radius $\sigma_{t}$ . Given $\delta>0$ , there exists a distribution $\widetilde{P}_{0}$ such that $\chi^{2}(\widetilde{P}_{0}||\overline{P}_{0})\leq\varepsilon_{K}$ and considering the DDPM process started with $\widetilde{P}_{0}$ , for all $0\leq t\leq T-\delta$ ,

\displaystyle\operatorname{KL}(\psi_{t}q_{t}||p_{t})

\displaystyle\leq\left({\frac{6(1+e)\sigma_{T-t}^{2}}{\chi^{2}(q_{t}||p_{t})+1}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+\ln\left({\frac{\mathcal{N}(\mathcal{M},\sigma_{\delta}/2)}{\varepsilon_{K}}}\right)}\right).

In particular, for $\mathcal{M}=B_{R}(0)$ in $\mathbb{R}^{d}$ ,

\displaystyle\operatorname{KL}(\psi_{t}q_{t}||p_{t})

\displaystyle\leq\left({\frac{6(1+e)\sigma_{T-t}^{2}}{\chi^{2}(q_{t}||p_{t})+1}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+\ln\left({\frac{1}{\varepsilon_{K}}}\right)+d\ln\left({1+\frac{4R}{\sigma_{\delta}}}\right)}\right).

(30)

Proof.

Partition $\mathcal{M}$ into disjoint subsets $\mathcal{M}_{j}$ , $1\leq j\leq N:=\mathcal{N}(\mathcal{M},\sigma_{\delta}/2)$ of diameter at most $\sigma_{\delta}$ , and decompose

\displaystyle\overline{P}_{0}

\displaystyle=w_{*}P_{*}+\sum_{j=1}^{n}\overline{w}_{j}\widetilde{P}_{j,0}

where $p_{j}$ is supported on $\mathcal{M}_{j}$ and $P_{*}=\overline{P}_{0}(\cdot|\mathcal{M}^{c})$ . We will zero out the coefficients of all small components: let $Z=\sum_{j:w_{j}\geq\frac{\varepsilon_{K}}{8N}}w_{j}$ and

\displaystyle w_{j}

\displaystyle=\begin{cases}\frac{\overline{w}_{j}}{Z},&j\in[n],\,\overline{w}_{j}\geq\frac{\varepsilon_{K}}{8N}\\ 0,&\text{otherwise,}\end{cases}

and define

\displaystyle\widetilde{P}_{0}

\displaystyle=\sum_{j=1}^{n}w_{j}\widetilde{P}_{j,0}.

Note that $Z\geq 1-\frac{\varepsilon_{K}}{8}-\sum_{j:w_{j}\leq\frac{\varepsilon_{K}}{8N}}\geq 1-\frac{\varepsilon_{K}}{4}$ . As probability distributions on $[m]\cup\{*\}$ ,

\chi^{2}(w||\overline{w})\leq\left({\frac{1}{1-\frac{\varepsilon_{K}}{4}}}\right)^{2}-1\leq\varepsilon_{K},

and hence the same bound holds for $\chi^{2}(\widetilde{P}_{0}||\overline{P}_{0})$ . Note each $M_{m_{t}\sharp}\widetilde{P}_{j,0}$ is supported on a set of diameter $m_{t}\sigma\leq\sigma$ . By Theorem 1 of [CCN21], noting that

\displaystyle\chi^{2}(N(\mu_{2},\Sigma)||N(\mu_{1},\Sigma))

\displaystyle=\exp\left[{(\mu_{2}-\mu_{1})^{\top}\Sigma^{-1}(\mu_{2}-\mu_{1})}\right]\leq e

when $\Sigma=\sigma^{2}I$ and $\left\|{\mu_{2}-\mu_{1}}\right\|\leq\sigma$ , $\widetilde{P}_{j,t}=(M_{m_{t}\sharp}\widetilde{P}_{j,0})*\varphi_{\sigma^{2}}$ satisfies a log-Sobolev inequality with constant $6(1+e)\sigma_{t}^{2}$ . The result then follows from Lemma 5.1. For $\mathcal{M}=B_{R}(0)$ , we use the bound $\mathcal{N}(B_{R}(0),\sigma_{\delta}/2)\leq\left({1+\frac{4R}{\sigma_{\delta}}}\right)^{d}$ [Ver18, Corollary 4.2.13]. ∎

In the next section, we show that we can move a small amount of mass $\varepsilon$ without significantly affecting the score function. This is necessary, as our guarantees on the score estimate are for the original distribution and not the perturbed one in Lemma 5.2.

6 The effect of perturbing the data distribution on the score function

In this section we consider the effect of perturbing the data distribution on the score function. The key observation is that the score function can be interpreted as the solution to an inference problem, that of recovering the original data point from a noisy sample, with data distribution as the given prior distribution. We show through a coupling argument that we can bound the difference between the score functions in terms of the distance between the two data distributions. This will allow us to “massage” the data distribution in order to optimally bound $\operatorname{KL}(\psi_{t}q_{t}||p_{t})$ in Section 5.

6.1 Perturbation under $\chi^{2}$ error and truncation

We first give a general lemma on denoising error from a mismatched prior.

Lemma 6.1 (Denoising error from mismatched prior).

Let $\varphi$ be a probability density on $\mathbb{R}^{d}$ , and $P_{0,x},P_{1,x}$ be measures on $\mathbb{R}^{d}$ . For $i=0,1$ , let $P_{i}$ denote the joint distribution of $x_{i}\sim P_{i,x}$ and $y_{i}=x_{i}+\xi_{i}$ where $\xi_{i}\sim\varphi$ , and let $P_{i,y}$ denote the marginal distribution of $y$ . Let

\displaystyle m^{(k)}(\varepsilon):

\displaystyle=\sup_{0\leq f\leq 1,\int_{\mathbb{R}^{d}}f\varphi\,dx\leq\varepsilon}\int_{\mathbb{R}^{d}}f(x)\left\|{x}\right\|^{k}\varphi(x)\,dx.

Let $\varepsilon_{\operatorname{TV}}=\operatorname{TV}(P_{0,x},P_{1,x})$ and $\varepsilon_{\chi}^{2}=\chi^{2}(P_{0,x}||P_{1,x})$ . Then

	$\displaystyle\int_{\mathbb{R}^{d}}P_{0,y}(dy_{0})\left\\|{\int_{\mathbb{R}^{d}}x_{0}P_{0}(dx_{0}\|y_{0})-\int_{\mathbb{R}^{d}}x_{1}P_{1}(dx_{1}\|y_{0})}\right\\|^{2}$
	$\displaystyle\leq 8m^{(2)}(\varepsilon_{\operatorname{TV}})+\varepsilon_{\chi}\sqrt{m^{(4)}(\varepsilon_{\operatorname{TV}})}$

For $\varphi=\varphi_{\sigma^{2}}$ , the upper bound is $O\left({\sigma^{2}\varepsilon_{\chi}\left({d+\ln\left({\frac{1}{\varepsilon_{\operatorname{TV}}}}\right)}\right)}\right)$ .

Note the tricky part of the proof is to deal with $P_{1}(dx_{1}|y_{0})$ , which can be thought of as inferring $x$ assuming the incorrect prior $P_{1,x}$ , rather than the actual prior $P_{0,x}$ .

Proof.

For notational clarity, we will denote draws from the conditional distribution as $\widehat{x}_{0}$ and $\widehat{x}_{1}$ , for example $P_{0}(d\widehat{x}_{0}|y_{0})$ . Let $r_{i}(y)=\int_{\mathbb{R}^{d}}(\widehat{x}_{i}-y)P_{i}(d\widehat{x}_{i}|y)$ . Let $P_{0,1}$ be a coupling of $(x_{0},y_{0}=x_{0}+\xi_{0},x_{1},y_{1}=y_{1}+\xi_{1})$ such that $x_{0}=x_{1}$ with probability $1-\varepsilon_{\operatorname{TV}}$ and $\xi_{0}=\xi_{1}$ with probability $1$ . We have

	$\displaystyle\int_{\mathbb{R}^{d}}P_{0,y}(dy_{0})\left\\|{r_{0}(y_{0})-r_{1}(y_{0})}\right\\|^{2}$	$\displaystyle=\underbrace{\int_{\{y_{0}=y_{1}\}}P_{0,1,y}(dy_{0},dy_{1})\left\\|{r_{0}(y_{0})-r_{1}(y_{0})}\right\\|^{2}}_{(I)}$
		$\displaystyle\quad+\underbrace{\int_{\{y_{0}\neq y_{1}\}}P_{0,1,y}(dy_{0},dy_{1})\left\\|{r_{0}(y_{0})-r_{1}(y_{0})}\right\\|^{2}}_{(II)}.$

Define a measure $Q$ (not necessarily a probability measure) on $\mathbb{R}^{d}$ by

\displaystyle Q(A):=P_{0,1}(y_{0}\in A\text{ and }y_{0}=y_{1}).

Note that

\displaystyle Q(A)

\displaystyle\leq\min\{P_{0,y}(A),P_{1,y}(A)\},

so $Q$ is absolutely continuous with respect to $P_{0,y}$ and $P_{1,y}$ , and by assumption on the coupling,

\displaystyle Q(\mathbb{R}^{d})

\displaystyle\geq 1-\varepsilon_{\operatorname{TV}}.

(31)

Under $P_{0,1}$ , when $y_{0}=y_{1}$ , we can couple $P_{0}(d\widehat{x}_{0}|y_{0})$ and $P_{1}(d\widehat{x}_{1}|y_{0})$ so that $x_{0}=x_{1}$ with probability $\min\left\{{\frac{dQ}{dP_{0,y}},\frac{dQ}{dP_{1,y}}}\right\}$ . Let $\widehat{P}(d\widehat{x}_{0},d\widehat{x}_{1}|y_{0})$ denote this coupled distribution. Then as in Lemma 6.5,

	$\displaystyle(I)$	$\displaystyle\leq\int_{\{y_{0}=y_{1}\}}P_{0,1,y}(dy_{0},dy_{1})\left\\|{\int_{\{\widehat{x}_{0}\neq\widehat{x}_{1}\}}((\widehat{x}_{0}-y_{0})-(\widehat{x}_{1}-y_{0}))\widehat{P}(d\widehat{x}_{0},d\widehat{x}_{1}\|y_{0})}\right\\|^{2}$
		$\displaystyle\leq 2\int_{\mathbb{R}^{d}}P_{0,1,y}(dy_{0},dy_{1})\left({\int_{\{\widehat{x}_{0}\neq\widehat{x}_{1}\}}\left\\|{\xi_{0}}\right\\|^{2}\widehat{P}(d\widehat{x}_{0},d\widehat{x}_{1}\|y_{0})+\int_{\{\widehat{x}_{0}\neq\widehat{x}_{1}\}}\left\\|{\xi_{1}}\right\\|^{2}\widehat{P}(d\widehat{x}_{0},d\widehat{x}_{1}\|y_{1})}\right)$

We bound this by first bounding

\displaystyle\int_{\mathbb{R}^{d}}P_{0,1,y}(dy_{1},dy_{2})\widehat{P}(\widehat{x}_{0}\neq\widehat{x}_{1})

\displaystyle\leq\int_{\mathbb{R}^{d}}P_{0,y}(dy)\max\left\{{1-\frac{dQ}{dP_{0,y}},1-\frac{dQ}{dP_{1,y}}}\right\}\leq 2\varepsilon_{\operatorname{TV}},

(32)

which follows from the two inequalities (using (31))

	$\displaystyle\int_{\mathbb{R}^{d}}P_{0,y}(dy)\left({1-\frac{dQ}{dP_{0,y}}}\right)$	$\displaystyle=1-Q(\mathbb{R}^{d})\leq\varepsilon_{\operatorname{TV}}$
	$\displaystyle\int_{\mathbb{R}^{d}}P_{0,y}(dy)\left({1-\frac{dQ}{dP_{1,y}}}\right)$	$\displaystyle\leq\int_{\mathbb{R}^{d}}P_{1,y}(dy)\left({1-\frac{dQ}{dP_{1,y}}}\right)+\operatorname{TV}(P_{0,y},P_{1,y})$
		$\displaystyle\leq(1-Q(\mathbb{R}^{d}))+\varepsilon_{\operatorname{TV}}\leq 2\varepsilon_{\operatorname{TV}}.$

From (32), and the fact that the distribution of $(x_{i},y_{i})$ is the same as $(\widehat{x}_{i},y_{i})$ by Nishimori’s identity, we obtain

\displaystyle(I)

\displaystyle\leq 2(m^{(2)}(2\varepsilon_{\operatorname{TV}})+m^{(2)}(2\varepsilon_{\operatorname{TV}}))=4m^{(2)}(\varepsilon_{\operatorname{TV}}).

Now for the second term (II),

\displaystyle(II)

\displaystyle\leq 2\int_{\{y_{0}\neq y_{1}\}}P_{0,1,y}(dy_{0},dy_{1})(\left\|{r_{0}(y_{0})}\right\|^{2}+\left\|{r_{1}(y_{0})}\right\|^{2}).

The first term satisfies $\int_{\{y_{0}\neq y_{1}\}}P_{0,1,y}(dy_{0},dy_{1})\left\|{r_{0}(y_{0})}\right\|^{2}\leq m^{(2)}(\varepsilon_{\operatorname{TV}})$ . For the second term, we note that Cauchy-Schwarz gives for any measures $P$ and $Q$ that

	$\displaystyle\int_{\Omega}f(x)P(dx)$	$\displaystyle\leq\int_{\Omega}f(x)Q(dx)+\int_{\Omega}\left({\frac{dP}{dQ}-1}\right)f(x)Q(dx)$
		$\displaystyle\leq\int_{\Omega}f(x)Q(dx)+\sqrt{\chi^{2}(P\|\|Q)\int_{\Omega}f(x)^{2}Q(dx)}$

to switch from the measure $P_{0,y}$ to $P_{1,y}$ :

	$\displaystyle\int_{\{y_{0}\neq y_{1}\}}P_{0,1,y}(dy_{0})\left\\|{r_{1}(y_{0})}\right\\|^{2}=\int_{\mathbb{R}^{n}}P_{0,y}(dy_{0})P_{0,1,y}(y_{0}\neq y_{1}\|y_{0})\left\\|{r_{1}(y_{0})}\right\\|^{2}$
	$\displaystyle\leq\int_{\mathbb{R}^{n}}P_{1,y}(dy_{0})P_{0,1,y}(y_{0}\neq y_{1}\|y_{0})\left\\|{r_{1}(y_{0})}\right\\|^{2}+\sqrt{\chi^{2}(P_{0,y}\|\|P_{1,y})\int P_{1,y}(dy_{0})P_{0,1,y}(y_{0}\neq y_{1}\|y_{0})\left\\|{r_{1}(y_{0})}\right\\|^{4}}$

(Note that intentionally, the measure is $P_{1,y}$ , though we use $y_{0}$ for the variable.) Hence,

\displaystyle\int_{\mathbb{R}^{n}}P_{1,y}(dy_{0})P_{0,1,y}(y_{0}\neq y_{1}|y_{0})

\displaystyle\leq\operatorname{TV}(P_{0,y},P_{1,y})+\int_{\mathbb{R}^{n}}P_{0,y}(dy_{0})P_{0,1,y}(y_{0}\neq y_{1}|y_{0})\leq 2\varepsilon_{\operatorname{TV}}

\displaystyle\int_{\{y_{0}\neq y_{1}\}}P_{0,1,y}(dy_{0})\left\|{r_{1}(y_{0})}\right\|^{2}

\displaystyle\leq m^{(2)}(2\varepsilon_{\operatorname{TV}})+\sqrt{\chi^{2}(P_{0,x}||P_{1,x})m^{(4)}(2\varepsilon_{\operatorname{TV}})},

where we used the data processing inequality.

For $\varphi=\varphi_{\sigma^{2}}$ , we obtain by Lemma 6.6 that the bound is

O\left({\sigma^{2}(\varepsilon_{\operatorname{TV}}+\varepsilon_{\chi}\varepsilon_{\operatorname{TV}}^{1/2})\left({d+\ln\left({\frac{1}{\varepsilon_{\operatorname{TV}}}}\right)}\right)}\right)=O\left({\sigma^{2}\varepsilon_{\chi}\left({d+\ln\left({\frac{1}{\varepsilon_{\operatorname{TV}}}}\right)}\right)}\right).\qed

We use this lemma to obtain a bound on the $L^{2}$ score error under perturbation of the distribution, by interpreting the score as the solution to a de-noising problem.

Lemma 6.2 ( $L^{2}$ score error under perturbation).

Let $\widetilde{P}^{(0)}=\widetilde{P}^{(0)}_{0}$ and $\widetilde{P}^{(1)}=\widetilde{P}^{(1)}_{0}$ be two probability distributions on $\mathbb{R}^{d}$ such that $\chi^{2}(\widetilde{P}^{(1)}||\widetilde{P}^{(0)})\leq\varepsilon_{\chi}^{2}\leq 1$ .

For any $\sigma>0$ ,

\int\left\|{\nabla\ln(\widetilde{P}^{(0)}*\varphi_{\sigma^{2}})(x)-\nabla\ln(\widetilde{P}^{(1)}*\varphi_{\sigma^{2}})(x)}\right\|^{2}(\widetilde{P}^{(1)}*\varphi_{\sigma^{2}})(dx)=O\left({\frac{\varepsilon_{\chi}\left({d+\ln\left({\frac{1}{\varepsilon_{\chi}}}\right)}\right)}{\sigma^{2}}}\right).

Let $\widetilde{p}_{t}^{(i)}$ be the density resulting from running (1) starting from $\widetilde{P}^{(i)}$ , and let $\sigma_{t}$ be as in (2). Then for any $t>0$ ,

\displaystyle\int\left\|{\nabla\ln\widetilde{p}^{(0)}_{t}(x)-\nabla\ln\widetilde{p}^{(1)}_{t}(x)}\right\|^{2}\widetilde{p}^{(1)}_{t}(x)\,dx

\displaystyle=O\left({\frac{\varepsilon_{\chi}\left({d+\ln\left({\frac{1}{\varepsilon_{\chi}}}\right)}\right)}{\sigma_{t}^{2}}}\right).

Proof.

For part 1, note by (27) that

\displaystyle\nabla\ln(\widetilde{P}^{(i)}*\varphi_{\sigma^{2}})(y)

\displaystyle=\frac{1}{\sigma^{2}}\mathbb{E}_{\widetilde{P}^{(i)}_{y,\sigma^{2}}}(x-y),

where $\widetilde{P}^{(i)}_{y,\sigma^{2}}$ is the “tilted" probability distribution defined by

\displaystyle\frac{d\widetilde{P}^{(i)}_{y,\sigma^{2}}}{d\widetilde{P}^{(i)}}(x)

\displaystyle\propto e^{-\frac{\left\|{x-y}\right\|^{2}}{2\sigma^{2}}}.

By Bayes’s rule, this can be viewed as the conditional probability that $x_{0}=x$ given $x_{t}=y$ , where $x_{0}\sim\widetilde{P}^{(i)}$ and $y=x_{0}+\sigma\xi$ , $\xi\sim N(0,I_{d})$ . Hence this fits in the framework of Lemma 6.1 and

	$\displaystyle\int\left\\|{\nabla\ln(\widetilde{P}^{(0)}\varphi_{\sigma^{2}})(y)-\nabla\ln(\widetilde{P}^{(1)}\varphi_{\sigma^{2}})(y)}\right\\|^{2}(\widetilde{P}^{(1)}*\varphi_{\sigma^{2}})(dy)$
	$\displaystyle=\frac{1}{\sigma^{4}}\int_{\mathbb{R}^{d}}\left\\|{\mathbb{E}_{\widetilde{P}^{(0)}_{y,t}}[x]-\mathbb{E}_{\widetilde{P}^{(1)}_{y,t}}[x]}\right\\|^{2}(\widetilde{P}^{(1)}*\varphi_{\sigma^{2}})(dy)$
	$\displaystyle=O\left({\frac{1}{\sigma^{4}}\sigma^{2}\varepsilon_{\chi}\left({d+\ln\left({\frac{1}{\varepsilon_{\operatorname{TV}}}}\right)}\right)}\right),$

giving the result.

For part 2, note that $\widetilde{p}^{(i)}_{t}=(M_{m_{t}\sharp}\widetilde{P}^{(i)})*\varphi_{\sigma_{t}^{2}}$ . Applying part 1 with $\widetilde{P}^{(i)}\mapsfrom M_{m_{t}\sharp}\widetilde{P}^{(i)}$ (which preserves $\chi^{2}$ -divergence) and $\sigma=\sigma_{t}$ gives the result.

∎

Finally, we argue that a score estimate that is accurate with respect to $\widetilde{p}^{(1)}_{t}$ will still be accurate with respect to $\widetilde{p}^{(0)}_{t}$ , with high probability. When using this lemma, we will substitute in the bound from Lemma 6.2.

Lemma 6.3.

Let $\widetilde{P}^{(0)}_{0}$ and $\widetilde{P}^{(1)}_{0}$ be two probability distributions on $\mathbb{R}^{d}$ with TV distance $\varepsilon$ . Suppose the estimated score function $s_{t}(x)$ satisfies

\left\|{\nabla\ln\widetilde{p}^{(0)}_{t}-s_{t}}\right\|_{L^{2}(\widetilde{p}^{(0)}_{t})}^{2}=\mathbb{E}_{\widetilde{p}^{(0)}_{t}}\left[{\left\|{\nabla\ln\widetilde{p}^{(0)}_{t}(x)-s_{t}(x)}\right\|^{2}}\right]\leq\varepsilon_{t}^{2}

for $t\in(0,T]$ , and $\nabla\ln\widetilde{p}_{t}^{(0)}$ is $L_{t}$ -Lipschitz. Then for $t\in(0,T]$ and any $\varepsilon_{\infty}>0$ ,

P_{\widetilde{p}_{t}^{(1)}}\left({\left\|{s_{t}-\nabla\ln\widetilde{p}_{t}^{(1)}}\right\|\geq\varepsilon_{\infty}}\right)\leq\varepsilon+\frac{4}{\varepsilon_{\infty}^{2}}\cdot\left[{\varepsilon_{t}^{2}+\int\left\|{\nabla\ln\widetilde{p}^{(1)}_{t}(x)-\nabla\ln\widetilde{p}^{(0)}_{t}(x)}\right\|^{2}\widetilde{p}^{(1)}_{t}(x)\,dx}\right].

Proof.

We have

	$\displaystyle P_{\widetilde{p}_{t}^{(1)}}\left({\left\\|{s_{t}-\nabla\ln\widetilde{p}_{t}^{(1)}}\right\\|\geq\varepsilon_{\infty}}\right)$
	$\displaystyle\leq P_{\widetilde{p}_{t}^{(1)}}\left({\left\\|{s_{t}-\nabla\ln\widetilde{p}_{t}^{(0)}}\right\\|\geq\varepsilon_{\infty}/2}\right)+P_{\widetilde{p}_{t}^{(1)}}\left({\left\\|{\nabla\ln\widetilde{p}_{t}^{(0)}-\nabla\ln\widetilde{p}_{t}^{(1)}}\right\\|\geq\varepsilon_{\infty}/2}\right)$
	$\displaystyle\leq\operatorname{TV}(\widetilde{p}_{t}^{(0)},\widetilde{p}_{t}^{(1)})+P_{\widetilde{p}_{t}^{(0)}}\left({\left\\|{s_{t}-\nabla\ln\widetilde{p}_{t}^{(0)}}\right\\|\geq\varepsilon_{\infty}/2}\right)+P_{\widetilde{p}_{t}^{(1)}}\left({\left\\|{\nabla\ln\widetilde{p}_{t}^{(0)}-\nabla\ln\widetilde{p}_{t}^{(1)}}\right\\|\geq\varepsilon_{\infty}/2}\right).$

The first term is bounded by $\operatorname{TV}(\widetilde{P}^{(0)},\widetilde{P}^{(1)})\leq\varepsilon$ . For the second term, by Chebyshev’s Inequality,

\displaystyle P_{\widetilde{p}_{t}^{(0)}}\left({\left\|{s_{t}-\nabla\ln\widetilde{p}_{t}^{(0)}}\right\|\geq\varepsilon_{1}/2}\right)\leq\frac{4}{\varepsilon_{\infty}^{2}}\mathbb{E}_{\widetilde{p}^{(0)}_{t}}\left[{\left\|{s_{t}-\nabla\ln\widetilde{p}^{(0)}_{t}}\right\|^{2}}\right]\leq\frac{4\varepsilon_{t}^{2}}{\varepsilon_{\infty}^{2}};

For the last term, again by Chebyshev’s Inequality,

\displaystyle P_{\widetilde{p}_{t}^{(1)}}\left({\left\|{\nabla\ln\widetilde{p}_{t}^{(0)}-\nabla\ln\widetilde{p}_{t}^{(1)}}\right\|\geq\varepsilon_{\infty}/2}\right)\leq\frac{4}{\varepsilon_{1}^{2}}\int\left\|{\nabla\ln\widetilde{p}^{(1)}_{t}(x)-\nabla\ln\widetilde{p}^{(0)}_{t}(x)}\right\|^{2}\widetilde{p}^{(1)}_{t}(x)dx.

We conclude the proof by combining the these three inequalities. ∎

Finally, we will need the following to obtain a TV error bound to $\widetilde{p}_{0}$ in Theorem 2.3.

Lemma 6.4.

Suppose that $\widetilde{p}_{0}\propto e^{-V(x)}$ is a probability density on $\mathbb{R}^{d}$ with bounded first moment $\mathbb{E}_{\widetilde{p}_{0}}\left\|{X}\right\|$ , and $V$ is $L$ -smooth. Then for $t>0$ such that $\alpha_{t}\sigma_{t}\leq\frac{1}{2L}$ , we have

\displaystyle\operatorname{TV}(\widetilde{p}_{t},\widetilde{p}_{0})\leq 2\left({\alpha_{t}-1}\right)\cdot\left({L\mathbb{E}_{\widetilde{p}_{0}}\left\|{X}\right\|+d}\right)+\frac{3}{2}dL\alpha_{t}\sigma_{t}.

Here $\alpha_{t}=1/m_{t}$ and $\sigma_{t}$ are defined in (2). In particular, $TV(\widetilde{p}_{\delta},\widetilde{p}_{0})\leq\varepsilon_{\operatorname{TV}}$ if $\delta=O(\frac{\varepsilon_{TV}^{2}}{R^{2}L^{2}})$ and $R=\max\left\{{\sqrt{d},\mathbb{E}_{\widetilde{p}_{0}}\left\|{X}\right\|}\right\}$ .

Proof.

Without loss of generality, we assume that $\widetilde{p}_{0}(x)=e^{-V(x)}$ . Note that $\widetilde{p}_{t}(x)=\int\alpha_{t}^{d}\widetilde{p}_{0}(\alpha_{t}y)\varphi_{\sigma_{t}^{2}}(x-y)\,dy$ . Let $\widetilde{q}_{t}(x):=\alpha_{t}^{d}\widetilde{p}_{0}(\alpha_{t}x)$ , which is also a probability density on $\mathbb{R}^{d}$ . Then by the triangle inequality,

\displaystyle\operatorname{TV}(\widetilde{p}_{t},\widetilde{p}_{0})\leq\operatorname{TV}(\widetilde{p}_{t},\widetilde{q}_{t})+\operatorname{TV}(\widetilde{q}_{t},\widetilde{p}_{0}).

For the second term,

	$\displaystyle\left\|{\widetilde{q}_{t}(x)-\widetilde{p}_{0}(x)}\right\|$	$\displaystyle=\left\|{\alpha_{t}^{d}\widetilde{p}_{0}(\alpha_{t}x)-\widetilde{p}_{0}(x)}\right\|$
		$\displaystyle=\left\|{e^{-V(\alpha_{t}x)+d\ln\alpha_{t}}-e^{-V(x)}}\right\|$
		$\displaystyle\leq\max\left\{{e^{-V(x)},e^{-V(\alpha_{t}x)+d\ln\alpha_{t}}}\right\}\cdot\left({1-e^{-\left\|{V(x)-V(\alpha_{t}x)+d\ln\alpha_{t}}\right\|}}\right)$
		$\displaystyle\leq\left({\widetilde{p}_{0}(x)+\widetilde{q}_{t}(x)}\right)\cdot\left({\left\|{V(x)-V(\alpha_{t}x)}\right\|+d\ln\alpha_{t}}\right)$
		$\displaystyle\leq\left({\widetilde{p}_{0}(x)+\widetilde{q}_{t}(x)}\right)\cdot\left[{L\left\\|{x}\right\\|\left({\alpha_{t}-1}\right)+d\ln\alpha_{t}}\right],$

where in the second inequality, we use the fact that $1-e^{x}\leq\left|{x}\right|$ for all $x\leq 0$ . Thus

	$\displaystyle\operatorname{TV}(\widetilde{q}_{t}(x),\widetilde{p}_{0}(x))$	$\displaystyle=\frac{1}{2}\int\|\widetilde{q}_{t}(x)-\widetilde{p}_{0}(x)\|\,dx$
		$\displaystyle\leq\int\left[{L\left({\alpha_{t}-1}\right)\left\\|{x}\right\\|+d\ln\alpha_{t}}\right]\widetilde{p}_{0}(x)\,dx+\int\left[{L\left({\alpha_{t}-1}\right)\left\\|{x}\right\\|+d\ln\alpha_{t}}\right]\widetilde{q}_{t}(x)\,dx$
		$\displaystyle\leq L(\alpha_{t}-1)\left({\int\left\\|{x}\right\\|\widetilde{p}_{0}(x)dx+\int\left\\|{x}\right\\|\widetilde{q}_{t}(x)dx}\right)+2d\ln\alpha_{t}$
		$\displaystyle\leq 2L(\alpha_{t}-1)\int\left\\|{x}\right\\|\widetilde{p}_{0}(x)dx+2d\ln\alpha_{t}.$

Now for the first term,

\displaystyle\widetilde{p}_{t}(x)-\widetilde{q}_{t}(x)

\displaystyle=\int\widetilde{q}_{t}(x-y)\varphi_{\sigma_{t}^{2}}(y)\,dy-\widetilde{q}_{t}(x)=\int\left({\widetilde{q}_{t}(x-\sigma_{t}y)-\widetilde{q}_{t}(x)}\right)\varphi(y)dy,

where $\varphi(y)$ is the density of the $d$ -dimensional standard Gaussian distribution. Apply Minkowski’s inequality for integrals:

	$\displaystyle\int\left\|{\widetilde{p}_{t}(x)-\widetilde{q}_{t}(x)}\right\|dx$	$\displaystyle=\int\left\|{\int\left({\widetilde{q}_{t}(x-\sigma_{t}y)-\widetilde{q}_{t}(x)}\right)\varphi(y)dy}\right\|dx$
		$\displaystyle\leq\int\left[{\int\left\|{\widetilde{q}_{t}(x-\sigma_{t}y)-\widetilde{q}_{t}(x)}\right\|dx}\right]\varphi(y)\,dy$
		$\displaystyle\leq\int\left[{\int\left({e^{L\left\\|{\alpha_{t}\sigma_{t}y}\right\\|}-1}\right)\widetilde{q}_{t}(x)dx}\right]\varphi(y)\,dy$
		$\displaystyle=\int\left({e^{L\left\\|{\alpha_{t}\sigma_{t}y}\right\\|}-1}\right)\varphi(y)\,dy$
		$\displaystyle=\left({2\pi}\right)^{-d/2}\int e^{L\alpha_{t}\sigma_{t}\left\\|{y}\right\\|-\frac{\left\\|{y}\right\\|^{2}}{2}}\,dy-1$
		$\displaystyle\leq\left({2\pi}\right)^{-d/2}\int e^{\left[{-\frac{1}{2}+\left({L\alpha_{t}\sigma_{t}}\right)^{2}}\right]\left\\|{y}\right\\|^{2}}dy+L\alpha_{t}\sigma_{t}\int\left\\|{y}\right\\|\varphi(y)\,dy-1$
		$\displaystyle\leq\left[{\frac{1}{1-2\left({L\alpha_{t}\sigma_{t}}\right)^{2}}}\right]^{d/2}+\sqrt{d}L\alpha_{t}\sigma_{t}-1$
		$\displaystyle\leq e^{2d\left({L\alpha_{t}\sigma_{t}}\right)^{2}}-1+\sqrt{d}L\alpha_{t}\sigma_{t}$
		$\displaystyle\leq 4d\left({L\alpha_{t}\sigma_{t}}\right)^{2}+\sqrt{d}L\alpha_{t}\sigma_{t},$

where in the third inequality, we use the elementary inequality $e^{x}\leq x+e^{x^{2}}$ , which is valid for all $x\in\mathbb{R}$ , and in the fifth inequality, we use $\frac{1}{1-2x}\leq e^{4x}$ , which holds for $x\in[0,1/3]$ . Hence if $L\alpha_{t}\sigma_{t}\leq 1/2$ , we have

\displaystyle\operatorname{TV}(\widetilde{p}_{t},\widetilde{q}_{t})\leq\frac{1}{2}\int\left|{\widetilde{p}_{t}(x)-\widetilde{q}_{t}(x)}\right|dx\leq\frac{3}{2}dL\alpha_{t}\sigma_{t}.

Now we conclude the proof by combining the bounds for $\operatorname{TV}(\widetilde{p}_{t},\widetilde{q}_{t})$ and $\operatorname{TV}(\widetilde{p}_{0},\widetilde{q}_{t})$ :

	$\displaystyle\operatorname{TV}(\widetilde{p}_{t},\widetilde{p}_{0})$	$\displaystyle\leq\operatorname{TV}(\widetilde{p}_{t},\widetilde{q}_{t})+\operatorname{TV}(\widetilde{q}_{t},\widetilde{p}_{0})$
		$\displaystyle\leq 2L(\alpha_{t}-1)\int\left\\|{x}\right\\|\widetilde{p}_{0}(x)dx+2d\ln\alpha_{t}+\frac{3}{2}dL\alpha_{t}\sigma_{t}$
		$\displaystyle\leq 2\left({\alpha_{t}-1}\right)\cdot\left({L\mathbb{E}_{\widetilde{p}_{0}}\left\\|{X}\right\\|+d}\right)+\frac{3}{2}dL\alpha_{t}\sigma_{t},$

where we use the fact that $\ln x\leq x-1$ for all $x\geq 1$ . Recall that $\alpha_{t}=1/m_{t}=e^{t/2}$ and $\sigma_{t}^{2}=1-e^{-t}$ when $g\equiv 1$ . It suffices for

\displaystyle\max\left\{{2\left({L\mathbb{E}_{\widetilde{p}_{0}}\left\|{X}\right\|+d}\right)\left({\alpha_{\delta}-1}\right),\frac{3}{2}dL\alpha_{\delta}\sigma_{\delta}}\right\}\leq\frac{\varepsilon_{\operatorname{TV}}}{2},

which is implied by

\displaystyle\delta

\displaystyle\precsim{\min\left\{{\frac{\varepsilon_{\operatorname{TV}}}{L\mathbb{E}_{\widetilde{p}_{0}}\left\|{X}\right\|+d},\frac{\varepsilon_{\operatorname{TV}}^{2}}{d^{2}L^{2}}}\right\}}\asymp{\frac{\varepsilon_{\operatorname{TV}}^{2}}{R^{2}L^{2}}},

for appropriate constants, as $R\geq\max\left\{{\sqrt{d},\mathbb{E}_{\widetilde{p}_{0}}\left\|{X}\right\|}\right\}$ . ∎

6.2 Perturbation under TV error

Although we will not need it in our proof, we note that we can derive a similar perturbation result under TV error, which might be of independent interest.

Lemma 6.5.

Let $K(x,dy)$ be a probability kernel on $\mathbb{R}^{d}$ , let $P_{0,x},P_{1,x}$ be measures on $\mathbb{R}^{d}$ . Let $P_{i}$ denote the joint distribution of $x_{i}\sim P_{i,x}$ and $y_{i}\sim K(x_{i},\cdot)$ , and let $P_{i,y}$ denote the marginal distribution of $y$ . Suppose there is a coupling $P_{0,1}$ of $(x_{0},y_{0})\sim P_{0}$ and $(x_{1},y_{1})\sim P_{1}$ such that

•

$x_{0}=x_{1}$ with probability $1-\varepsilon$ ,
•

$x_{0}=x_{1}$ implies $y_{0}=y_{1}$ , and
•

$\mathbb{E}[\left\|{y_{0}-y_{1}}\right\|^{2}]\leq\varepsilon_{\textup{W}}^{2}$ .

Define the tail error by

\displaystyle m_{i}(\varepsilon):

\displaystyle=\sup_{0\leq f\leq 1,\int_{\mathbb{R}^{d}}f\varphi\,dx\leq\varepsilon}\int_{\mathbb{R}^{d}}f(x)\left\|{x}\right\|^{2}P_{i}(\,dx).

Let $r_{i}(y)=\int_{\mathbb{R}^{d}}x_{i}P_{i}(dx_{i}|y)$ , and suppose that $r_{1}(y)=\int_{\mathbb{R}^{d}}x_{1}P_{1}(dx_{1}|y)$ is $L_{1}$ -Lipschitz. Then

	$\displaystyle\int_{\mathbb{R}^{d}}P_{0,y}(dy_{0})\left\\|{\int_{\mathbb{R}^{d}}x_{0}P_{0}(dx_{0}\|y_{0})-\int_{\mathbb{R}^{d}}x_{1}P_{1}(dx_{1}\|y_{0})}\right\\|^{2}$
	$\displaystyle\qquad\leq 4(m_{0}(2\varepsilon)+m_{0}(\varepsilon)+m_{1}(2\varepsilon)+m_{1}(\varepsilon))+2L_{1}^{2}\varepsilon_{\textup{W}}^{2}$
	$\displaystyle\qquad\leq 4(m_{0}(2\varepsilon)+m_{1}(2\varepsilon))+4(1+L_{1}^{2})(m_{0}(\varepsilon)+m_{1}(\varepsilon)).$

Proof.

For notational clarity, we will denote draws from the conditional distribution as $\widehat{x}_{0}$ and $\widehat{x}_{1}$ , for example $P_{0}(d\widehat{x}_{0}|y_{0})$ . We have

	$\displaystyle\int_{\mathbb{R}^{d}}P_{0,y}(dy_{0})\left\\|{r_{0}(y_{0})-r_{1}(y_{0})}\right\\|^{2}$	$\displaystyle\leq 2\underbrace{\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}P_{0,1,y}(dy_{0},dy_{1})\left\\|{r_{0}(y_{0})-r_{1}(y_{1})}\right\\|^{2}}_{(I)}$
		$\displaystyle\quad+2\underbrace{\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}P_{0,1,y}(dy_{0},dy_{1})\left\\|{r_{1}(y_{1})-r_{1}(y_{0})}\right\\|^{2}}_{(II)}.$

For the first term (I), we split it as

\displaystyle(I)

\displaystyle\leq\underbrace{\int_{\{y_{0}=y_{1}\}}P_{0,1,y}(dy_{0},dy_{1})\left\|{r_{0}(y_{0})-r_{1}(y_{0})}\right\|^{2}}_{(i)}+\underbrace{\int_{\{y_{0}\neq y_{1}\}}P_{0,1,y}(dy_{0},dy_{1})\left\|{r_{0}(y_{0})-r_{1}(y_{1})}\right\|^{2}}_{(ii)}.

Define the measure $Q$ on $\mathbb{R}^{d}$ by

\displaystyle Q(A):

\displaystyle=P_{0,1}(y_{0}\in A\text{ and }y_{0}=y_{1}).

As in Lemma 6.2, under $P_{0,1}$ , when $y_{0}=y_{1}$ , we can couple $P_{0}(d\widehat{x}_{0}|y_{0})$ and $P_{1}(d\widehat{x}_{1}|y_{0})$ so that $x_{0}=x_{1}$ with probability $\min\left\{{\frac{dQ}{dP_{0,y}},\frac{dQ}{dP_{1,y}}}\right\}$ . Let $\widehat{P}(d\widehat{x}_{0},d\widehat{x}_{1}|y_{0})$ denote this coupled distribution. Then

	$\displaystyle(i)$	$\displaystyle\leq\int_{\{y_{0}=y_{1}\}}P_{0,1,y}(dy_{0},dy_{1})\left\\|{\int_{\{\widehat{x}_{0}\neq\widehat{x}_{1}\}}(\widehat{x}_{0}-\widehat{x}_{1})\widehat{P}(d\widehat{x}_{0},d\widehat{x}_{1}\|y_{0})}\right\\|^{2}$
		$\displaystyle\leq 2\int_{\mathbb{R}^{d}}P_{0,1,y}(dy_{0},dy_{1})\left({\int_{\{\widehat{x}_{0}\neq\widehat{x}_{1}\}}\left\\|{\widehat{x}_{0}}\right\\|^{2}\widehat{P}(d\widehat{x}_{0},d\widehat{x}_{1}\|y_{0})+\int_{\{\widehat{x}_{0}\neq\widehat{x}_{1}\}}\left\\|{\widehat{x}_{1}}\right\\|^{2}\widehat{P}(d\widehat{x}_{0},d\widehat{x}_{1}\|y_{1})}\right)$
		$\displaystyle\leq 2(m_{0}(2\varepsilon)+m_{1}(2\varepsilon))$

as in Lemma 6.2. Now

\displaystyle(ii)

\displaystyle\leq 2\int_{\{y_{0}\neq y_{1}\}}P_{0,1,y}(dy_{0},dy_{1})(\left\|{r_{0}(y_{0})}\right\|^{2}+\left\|{r_{1}(y_{1})}\right\|^{2})\leq 2(m_{0}(\varepsilon)+m_{1}(\varepsilon)).

Finally, for the second term (II), we use the fact that $r_{1}$ is $L_{1}$ Lipschitz and the coupling to conclude

\displaystyle(II)

\displaystyle\leq\int_{\mathbb{R}^{d}}P_{0,1,y}(dy_{0},dy_{1})L_{1}^{2}\left\|{y_{0}-y_{1}}\right\|^{2}\leq L_{1}^{2}\varepsilon_{\textup{W}}^{2}.

We conclude the proof by combining the inequalities for (i), (ii), and (II).

For the second upper bound, we note that

\mathbb{E}[\left\|{y_{0}-y_{1}}\right\|^{2}]\leq 2(\mathbb{E}[\left\|{y_{0}}\right\|^{2}]+\mathbb{E}[\left\|{y_{1}}\right\|^{2}])\leq 2(m_{0}(\varepsilon)+m_{1}(\varepsilon)).\qed

6.3 Gaussian tail calculation

We use the following Gaussian tail calculation in the proof of Lemma 6.2.

Lemma 6.6.

Let $\mu$ be the standard Gaussian measure on $N(0,I_{d})$ . Then

	$\displaystyle\sup_{\mu(A)\leq\varepsilon}\int_{A}\left\\|{x}\right\\|^{2}\mu(dx)$	$\displaystyle\leq\varepsilon\left({2d+3\ln\left({\frac{1}{\varepsilon}}\right)+3}\right)=O\left({\varepsilon\left({d+\ln\left({\frac{1}{\varepsilon}}\right)}\right)}\right)$
	$\displaystyle\sup_{\mu(A)\leq\varepsilon}\int_{A}\left\\|{x}\right\\|^{4}\mu(dx)$	$\displaystyle\leq\varepsilon\left({2d+3\ln\left({\frac{1}{\varepsilon}}\right)}\right)^{2}+3\varepsilon\left({2d+3\ln\left({\frac{1}{\varepsilon}}\right)}\right)+9\varepsilon=O\left({\varepsilon\left({d^{2}+\ln\left({\frac{1}{\varepsilon}}\right)^{2}}\right)}\right).$

Proof.

By the $\chi^{2}$ tail bound in [LM00], for $t\geq 0$ ,

\displaystyle\mu(\left\|{X}\right\|^{2}\geq 2d+3t)

\displaystyle\leq\mathbb{P}(\left\|{X}\right\|^{2}\geq d+2\sqrt{dt}+2t)\leq e^{-t},

(33)

so $\left\|{X}\right\|^{2}$ is stochastically dominated by a random variable with cdf $F(y)=1-e^{-\frac{y-2d}{3}}$ . Then letting $P_{Y}$ be the measure corresponding to $F$ ,

	$\displaystyle\sup_{\mu(A)\leq\varepsilon}\int_{A}\left\\|{x}\right\\|^{2}\mu(dx)$	$\displaystyle\leq\sup_{P_{Y}(A)\leq\varepsilon}\int_{A}yP_{Y}(dy)=\int_{2d+3\ln\left({\frac{1}{\varepsilon}}\right)}^{\infty}ydF(y)$
		$\displaystyle=\varepsilon\left({2d+3\ln\left({\frac{1}{\varepsilon}}\right)}\right)+\int_{2d+3\ln\left({\frac{1}{\varepsilon}}\right)}^{\infty}e^{-\frac{y-2d}{3}}dy=\varepsilon\left({2d+3\ln\left({\frac{1}{\varepsilon}}\right)}\right)+3\varepsilon$

and

	$\displaystyle\sup_{\mu(A)\leq\varepsilon}\int_{A}\left\\|{x}\right\\|^{4}\mu(dx)$	$\displaystyle\leq\sup_{P_{Y}(A)\leq\varepsilon}\int_{A}y^{2}P_{Y}(dy)=\int_{2d+3\ln\left({\frac{1}{\varepsilon}}\right)}^{\infty}y^{2}dF(y)$
		$\displaystyle=\varepsilon\left({2d+3\ln\left({\frac{1}{\varepsilon}}\right)}\right)^{2}+\int_{2d+3\ln\left({\frac{1}{\varepsilon}}\right)}^{\infty}2ye^{-\frac{y-2d}{3}}dy$
		$\displaystyle=\varepsilon\left({2d+3\ln\left({\frac{1}{\varepsilon}}\right)}\right)^{2}-\left[3ye^{-\frac{y-2d}{3}}\right]\Big{\|}^{\infty}_{2d+3\ln\left({\frac{1}{\varepsilon}}\right)}+\int_{2d+3\ln\left({\frac{1}{\varepsilon}}\right)}^{\infty}3e^{-\frac{y-2d}{3}}\,dy$
		$\displaystyle=\varepsilon\left({2d+3\ln\left({\frac{1}{\varepsilon}}\right)}\right)^{2}+3\varepsilon\left({2d+3\ln\left({\frac{1}{\varepsilon}}\right)}\right)+9\varepsilon.\qed$

7 Guarantees under $L^{2}$ -accurate score estimate

We will state our results under a more general tail bound assumption.

Assumption 6 (Tail bound).

$R:[0,1]\to[0,\infty)$ is a function such that $P_{\textup{data}}(B_{R(\varepsilon)}(0))\geq 1-\varepsilon$ .

Our result will require $R(\varepsilon)$ to grow at most as a sufficiently small power of $\varepsilon^{-1}$ as $\varepsilon\to 0$ ; in particular, this holds for subexponential distributions. By taking $R$ to be a constant function, this contains the assumption of bounded support (Assumption 2) as a special case.

7.1 TV error guarantees

We follow the framework of [LLT22] to convert guarantees under $L^{\infty}$ -accurate score estimate, to guarantees under $L^{2}$ -accurate score estimate.

Theorem 7.1 ([LLT22, Theorem 4.1]).

Let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space and $\{\mathcal{F}_{n}\}$ be a filtration of the sigma field $\mathcal{F}$ . Suppose $X_{n}\sim p_{n}$ , $Z_{n}\sim q_{n}$ , and $\overline{Z}_{n}\sim\overline{q}_{n}$ are $\mathcal{F}_{n}$ -adapted random processes taking values in $\Omega$ , and $B_{n}\subseteq\Omega$ are sets such that the following hold for every $n\in\mathbb{N}_{0}$ .

1.

If $Z_{k}\in B_{k}^{c}$ for all $0\leq k\leq n-1$ , then $Z_{n}=\overline{Z}_{n}$ .
2.

$\chi^{2}(\overline{q}_{n}||p_{n})\leq D_{n}^{2}$ .
3.

$\mathbb{P}(X_{n}\in B_{n})\leq\delta_{n}$ .

Then the following hold.

\displaystyle\operatorname{TV}(q_{n},\overline{q}_{n})

\displaystyle\leq\sum_{k=0}^{n-1}(D_{k}^{2}+1)^{1/2}\delta_{k}^{1/2}

\displaystyle\operatorname{TV}(p_{n},q_{n})

\displaystyle\leq D_{n}+\sum_{k=0}^{n-1}(D_{k}^{2}+1)^{1/2}\delta_{k}^{1/2}

(34)

Theorem 7.2 (DDPM with $L^{2}$ -accurate score estimate).

Let $0<\varepsilon_{\chi},\varepsilon_{\operatorname{TV}},\delta<\frac{1}{2}$ . Suppose that Assumption 6 for a sufficiently small value of $c$ that $R_{0}$ is such that $R\left({\frac{c\varepsilon_{\operatorname{TV}}^{3}\delta^{6}\varepsilon_{\chi}^{12}}{R_{0}^{19}d^{5}}}\right)\leq R_{0}$ , and $R_{0}^{2}\geq d$ . Suppose one of the following cases holds.

1.

Let $P_{\textup{data}},s(\cdot,t)$ be such that Assumption 1 holds, with $R_{0}^{2}\geq d$ . Suppose that

$\varepsilon_{\sigma}=O\left({\frac{\varepsilon_{\operatorname{TV}}\delta^{5/2}\varepsilon_{\chi}^{11/2}}{B^{9/4}}}\right),$

where $B=R_{0}^{4}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R_{0}d}{\delta\varepsilon_{\operatorname{TV}}\varepsilon_{\chi}}}\right)$ , and we run (5) starting from $p_{\textup{prior}}$ for time $T=\ln\left({\frac{16R_{0}^{2}}{\varepsilon_{\chi}^{2}}}\right)$ , $N=O\left({\frac{B\left({T+\frac{1}{\delta^{2}}}\right)}{\varepsilon_{\chi}^{2}}}\right)$ steps with step sizes satisfying $h_{k}=O\left({\frac{\varepsilon_{\chi}^{2}}{B\max\{T-t_{k},(T-t_{k})^{-3}\}}}\right)$ .
2.

Let $P_{\textup{data}},s(\cdot,t)$ be such that Assumptions 1 and 3 hold, with $C\geq R_{0}^{2}$ . Suppose

$\displaystyle\varepsilon_{\sigma}$ $\displaystyle=O\left({\frac{\varepsilon_{\operatorname{TV}}\varepsilon_{\chi}^{3}}{T^{5/2}B}}\right),$

where $B=C^{2}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R_{0}d}{\delta\varepsilon_{\operatorname{TV}}\varepsilon_{\chi}}}\right)$ , and we run (5) starting from $p_{\textup{prior}}$ for time $T=\ln\left({\frac{16R_{0}^{2}}{\varepsilon_{\chi}^{2}}}\right)$ , $N=O\left({\frac{B\left({T+\ln\left({\frac{1}{\delta}}\right)}\right)}{\varepsilon_{\chi}^{2}}}\right)$ steps with step sizes satisfying $h_{k}=O\left({\frac{\varepsilon_{\chi}^{2}}{B\max\{T-t_{k},(T-t_{k})^{-1}\}}}\right)$ .

Then the resulting distribution $q_{t_{N}}$ is such that $q_{t_{N}}$ is $\varepsilon_{\operatorname{TV}}$ -far in TV distance from a distribution $\overline{q}_{t_{N}}$ , where $\overline{q}_{t_{N}}$ satisfies $\chi^{2}(\overline{q}_{t_{N}}||p_{t_{N}})\leq\varepsilon_{\chi}^{2}$ . In particular, taking $\varepsilon_{\chi}=\varepsilon_{\operatorname{TV}}$ , we have $\operatorname{TV}(q_{T},P_{\textup{data}})\leq 2\varepsilon_{\operatorname{TV}}$ .

Note that the condition on $R$ can be satisfied if $R(\varepsilon)=o(R^{-1/19})$ (no effort has been made to optimize the exponent).

Proof.

We invoke Lemma 5.2 for a $\varepsilon_{K}$ to be chosen, to obtain a distribution $\widetilde{P}_{0}$ on $B_{R_{0}}(0)$ , where $R_{0}\geq R(\varepsilon_{K}/8)$ . Let $B=R_{0}^{4}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R_{0}}{\delta\varepsilon_{K}}}\right)$ and $B=C^{2}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{R_{0}}{\delta\varepsilon_{K}}}\right)$ in case 1 and case 2, respectively; our choice of $\varepsilon_{K}=O\left({\frac{\varepsilon_{\operatorname{TV}}^{2}\delta^{6}}{n^{2}R_{0}^{6}}}\right)$ will give the definition of $B$ in the theorem statement. In the following, we define $\widetilde{p}_{t}$ with $\widetilde{P}_{0}$ , rather than $P_{\textup{data}}$ , as the initial distribution. Note that since $\operatorname{TV}(P_{\textup{data}},\widetilde{P}_{0})\leq\sqrt{\varepsilon_{K}}=o(\varepsilon_{\operatorname{TV}})$ (and the same holds for their evolutions under (1)), it suffices to consider convergence to $\widetilde{p}_{\delta}$ .

We first define the bad sets where the error in the score estimate is large,

\displaystyle B_{t}:

\displaystyle=\left\{{\left\|{\nabla\ln\widetilde{p}_{t}(x)-s(x,t)}\right\|>\varepsilon_{\infty,t}}\right\}

(35)

for some $\varepsilon_{\infty,t}$ to be chosen.

Given $t\geq 0$ , let $t_{-}=t_{k}$ where $k$ is such that $t\in[t_{k},t_{k+1})$ . Given bad sets $B_{t}$ , define the interpolated process on $[t_{k},t_{k+1})$ by

	$\displaystyle d\overline{z}_{t}$	$\displaystyle=g(T-t)^{2}\left({\frac{1}{2}z_{t}+b(z_{-},T-t_{-})}\right)\,dt+g(T-t)\,dw_{t}$		(36)
	$\displaystyle\text{where }b(z,t)$	$\displaystyle=\begin{cases}s(z,t),&z\not\in B_{t}\\ \nabla\ln\widetilde{p}_{t}(z),&z\in B_{t}\end{cases}.$

In other words, simulate the reverse SDE using the score estimate as long as the point is in the good set at the previous discretization timepoint $t_{k}$ , and otherwise use the actual gradient $\nabla\ln p_{t}$ . Let $\overline{q}_{t}$ denote the distribution of $\overline{z}_{t}$ when $\overline{z}_{0}\sim q_{0}$ . Note that this process is defined only for purposes of analysis, as we do not have access to $\nabla\ln p_{t}$ . As before, we let denote $q_{t}$ the distribution of $z_{t}$ defined by (12).

We can couple this process with the exponential integrator (5) using $s$ so that as long as $x_{t_{m}}\not\in B_{T-t_{m}}$ , the processes agree, thus satisfying condition 1 of Theorem 7.1.

Then by Lemma 6.3,

\displaystyle\widetilde{P}_{t}^{(0)}(B_{t})

\displaystyle=\varepsilon_{K}+\frac{4}{\varepsilon_{\infty,t}^{2}}\left({\varepsilon_{t}^{2}+O\left({\frac{\varepsilon_{K}\left({d+\ln\left({\frac{1}{\varepsilon_{K}}}\right)}\right)}{\sigma_{t}^{2}}}\right)}\right),

Then by choice of $h_{k}$ and either Corollary 4.11 or 4.12, when $\int_{0}^{t_{n}}\varepsilon_{t}^{2}\,dt=O(1)$ ,

	$\displaystyle\chi^{2}(\overline{q}_{t_{k}}\|\|p_{t_{k}})$	$\displaystyle=e^{\varepsilon}\chi^{2}(q_{0}\|\|p_{0})+\varepsilon+e^{\varepsilon}\int_{0}^{t_{n}}\varepsilon_{\infty,T-t}^{2}\,dt$		(37)
		$\displaystyle\leq 2\chi^{2}(q_{0}\|\|p_{0})+O(1),$

where $\varepsilon=\frac{\varepsilon_{\chi}^{2}}{4}$ . For $\chi^{2}(\overline{q}_{t_{k}}||p_{t_{k}})$ to be bounded by $\varepsilon_{\chi}^{2}$ , it suffices for the terms in (37) to be bounded by $\frac{\varepsilon_{\chi}^{2}}{2},\frac{\varepsilon_{\chi}^{2}}{4},\frac{\varepsilon_{\chi}^{2}}{4}$ ; this is implied by

	$\displaystyle T$	$\displaystyle=\ln\left({\frac{16R^{2}}{\varepsilon_{\chi}^{2}}}\right)\text{ by Lemma~{}\ref{l:K-chi}}$
	$\displaystyle\int_{0}^{t_{n}}\varepsilon_{\infty,T-t}^{2}\,dt$	$\displaystyle=O(\varepsilon_{\chi}^{2}).$		(38)

By Theorem 7.1,

$\displaystyle\operatorname{TV}(q_{t_{n}},\overline{q}_{t_{n}})$	$\displaystyle\leq\sum_{k=0}^{n-1}(1+\chi^{2}(q_{t_{k}}\|\|p_{t_{k}}))^{1/2}P(B_{t_{k}})^{1/2}$
	$\displaystyle\leq\sum_{k=0}^{n-1}\left({2\chi^{2}(q_{0}\|\|p_{0})^{1/2}+O(1)}\right)\delta_{t}^{1/2}$	(39)
	$\displaystyle=O\left({\sum_{k=0}^{n-1}\frac{\varepsilon_{t_{k}}}{\varepsilon_{\infty,t_{k}}}+\sqrt{\varepsilon_{K}}\left({1+\frac{\sqrt{d+\ln(1/\varepsilon_{K})}}{\varepsilon_{\infty,t_{k}}\sigma_{T-t_{k}}}}\right)}\right).$	(40)

For this to be bounded by $\varepsilon_{\operatorname{TV}}$ , it suffices for

	$\displaystyle\sum_{k=0}^{n-1}\frac{\varepsilon_{t}}{\varepsilon_{\infty,t}}$	$\displaystyle=O(\varepsilon_{\operatorname{TV}})$		(41)
	$\displaystyle\varepsilon_{K}$	$\displaystyle=O\left({\frac{\min_{k}\varepsilon_{t_{k}}^{2}\sigma_{T-t_{k}}^{2}}{d+\ln(1/\varepsilon_{K})}}\right).$		(42)

We bound (42) crudely, as the dependence on $\varepsilon_{K}$ will be logarithmic. Using $\varepsilon_{t_{k}}^{2}=\varepsilon_{\sigma}^{2}/\sigma_{t_{k}}^{4}$ , it suffices that

\displaystyle\varepsilon_{K}

\displaystyle=O\left({\frac{\varepsilon_{\sigma}^{2}}{d+\ln(1/\varepsilon_{K})}}\right).

(43)

We will return to this after deriving a condition on $\varepsilon_{\sigma}$ . It remains to bound (38) and (41). We break up the timepoints depending on whether $T-t>1$ . Let

\displaystyle(t_{0},t_{1},\ldots,t_{N})

\displaystyle=(t_{0},\ldots,t_{n^{\textup{coarse}}-1},t_{0}^{\prime},\ldots,t_{n^{\textup{fine}}}^{\prime})

and $u_{k}=T-t_{k}^{\prime}$ , where $t_{n^{\textup{coarse}}-1}\leq T-1\leq t_{1}^{\prime}$ . Let $h_{k}^{\prime}=t_{k+1}^{\prime}-t_{k}^{\prime}$ . Note the “fine” timepoints will be closer together than the “coarse” timepoints. We break up the integral (38) and the sum (41) into the parts involving the coarse and fine timepoints. For (38), it suffices to have

\displaystyle\eqref{e:error-int},\text{ coarse:}\quad\int_{0}^{t_{0}^{\prime}}\varepsilon_{\infty,T-t}^{2}\,dt

\displaystyle\leq T\max_{0\leq k\leq n^{\textup{coarse}}}\varepsilon_{\infty,T-t_{k}}^{2}=O(\varepsilon_{\chi}^{2})

so it suffices to take $\varepsilon_{\infty,T-t_{k}}^{2}\asymp\frac{\varepsilon_{\chi}^{2}}{T}$ . Let $\alpha=3$ in case 1 and $\alpha=1$ in case 2. For the fine part, recalling our choice of $h_{k}^{\prime}$ , it suffices to have (note we can redefine $\varepsilon_{t}=\varepsilon_{t_{k}}$ when $t\in[t_{k},t_{k+1})$ without any harm)

$\displaystyle\eqref{e:error-int},\text{ fine:}\quad\int_{t_{0}^{\prime}}^{t_{n^{\textup{fine}}}^{\prime}}\varepsilon_{\infty,T-t}^{2}\,dt$	$\displaystyle=\sum_{k=0}^{n^{\textup{fine}}-1}h_{k}^{\prime}\varepsilon_{\infty,T-t_{k}^{\prime}}^{2}=O(\varepsilon_{\chi}^{2})$
$\displaystyle\Longleftarrow\sum_{k=0}^{n^{\textup{fine}}-1}\frac{\varepsilon_{\chi}^{2}u_{k}^{\alpha}}{B}\varepsilon_{\infty,u_{k}}^{2}$	$\displaystyle=O(\varepsilon_{\chi}^{2})$
$\displaystyle\Longleftarrow\sum_{k=0}^{n^{\textup{fine}}-1}\frac{u_{k}^{\alpha}\varepsilon_{\infty,u_{k}}^{2}}{B}$	$\displaystyle=O(1).$	(44)

For (41), it suffices to have

	$\displaystyle\eqref{e:error-sum},\text{ coarse:}\quad\sum_{k=0}^{n^{\textup{coarse}}-1}\frac{\varepsilon_{T-t_{k}}}{\varepsilon_{\infty,T-t_{k}}}$	$\displaystyle\asymp n^{\textup{coarse}}\frac{\varepsilon_{\sigma}}{\varepsilon_{\chi}/\sqrt{T}}=O(\varepsilon_{\operatorname{TV}})$
	$\displaystyle\Longleftarrow\varepsilon_{\sigma}$	$\displaystyle=O\left({\frac{\varepsilon_{\operatorname{TV}}\varepsilon_{\chi}}{n^{\textup{coarse}}\sqrt{T}}}\right)$		(45)

and

\displaystyle\eqref{e:error-sum},\text{ fine:}\quad\sum_{k=0}^{n^{\textup{fine}}-1}\frac{\varepsilon_{u_{k}}}{\varepsilon_{\infty,u_{k}}}\asymp\sum_{k=0}^{n^{\textup{fine}}-1}\frac{\varepsilon_{\sigma}}{u_{k}\varepsilon_{\infty,u_{k}}}=O(\varepsilon_{\operatorname{TV}}).

(46)

Note that in light of the required step sizes, we can take $n^{\textup{coarse}}\asymp\frac{T^{2}B}{\varepsilon_{\chi}^{2}}$ . Considering the equality case of Hölder’s inequality on $\eqref{e:1fine}^{1/3}\eqref{e:2fine}^{2/3}$ suggests that we take

	$\displaystyle\varepsilon_{\sigma}$	$\displaystyle\asymp\frac{\varepsilon_{\operatorname{TV}}B^{1/2}}{\left({\sum_{k=0}^{n^{\textup{fine}}-1}{u_{k}}^{\frac{\alpha-2}{3}}}\right)^{3/2}}$		(47)
	$\displaystyle\varepsilon_{\infty,u_{k}}$	$\displaystyle\asymp\frac{B^{1/2}}{{u_{k}}^{\frac{\alpha+1}{3}}\left({\sum_{k=0}^{n^{\textup{fine}}-1}{u_{k}}^{\frac{\alpha-2}{3}}}\right)^{1/2}}$		(48)

Note that the number of steps needed in the fine part is $O\left({\frac{B}{\varepsilon_{\chi}^{2}\delta^{2}}}\right)$ in the first case and $O\left({\frac{B}{\varepsilon_{\chi}^{2}}}\right)\ln\left({\frac{1}{\delta}}\right)$ in the second case. We can check that (47) and (48) make (44) and (46) satisfied.

Finally, we calculate the denominator for $\varepsilon_{\sigma}$ . In case 1, note that starting from $T-t_{0}^{\prime}=O(1)$ and taking steps of size $h_{k}^{\prime}\asymp\frac{\varepsilon_{\chi}^{2}}{B(T-t_{k}^{\prime})^{3}}$ , it takes $n^{\textup{fine}}=\Theta\left({\frac{B}{\varepsilon_{\chi}^{2}\delta^{2}}}\right)$ steps to reach $T-t=\delta$ .

	$\displaystyle u_{k}$	$\displaystyle=T-t_{k}^{\prime}=\left({1+\Theta\left({\frac{k\varepsilon_{\chi}^{2}}{B}}\right)}\right)^{-\frac{1}{2}}$
	$\displaystyle\sum_{k=0}^{n^{\textup{fine}}-1}u_{k}^{1/3}$	$\displaystyle\asymp\sum_{k=0}^{n^{\textup{fine}}-1}\left({1+\Theta\left({\frac{k\varepsilon_{\chi}^{2}}{B}}\right)}\right)^{-\frac{1}{6}}\asymp\frac{B}{\varepsilon_{\chi}^{2}}(n^{\textup{fine}})^{\frac{5}{6}}\asymp\left({\frac{B}{\varepsilon_{\chi}^{2}}}\right)^{11/6}\frac{1}{\delta^{5/3}}.$

Then we obtain

\displaystyle\varepsilon_{\sigma}

\displaystyle\asymp\varepsilon_{\operatorname{TV}}B^{1/2}\frac{\varepsilon_{\chi}^{11/2}}{B^{11/4}}\delta^{5/2}=\frac{\varepsilon_{\operatorname{TV}}\delta^{5/2}\varepsilon_{\chi}^{11/2}}{B^{9/4}}.

In case 1, our requirement is

\displaystyle\varepsilon_{\sigma}

\displaystyle\asymp O\left({\frac{\varepsilon_{\operatorname{TV}}\delta^{5/2}\varepsilon_{\chi}^{11/2}}{B^{9/4}}\wedge\frac{\varepsilon_{\operatorname{TV}}\varepsilon_{\chi}^{3}}{T^{5/2}B}}\right),

but note that the first bound is more stringent. Now, returning to (43), we see that it suffices to take $\varepsilon_{K}=O\left({\frac{1}{d}\left({\frac{\varepsilon_{\operatorname{TV}}\delta^{5/2}\varepsilon_{\chi}^{11/2}}{R_{0}^{9}d^{9/4}}}\right)^{2+\beta}}\right)$ for any $\beta>0$ (this will “solve” the $\log(1/\varepsilon_{K})$ appearing in $B$ .)

In case 2, we have instead $u_{k}=\exp\left({-\Theta\left({\frac{\varepsilon_{\chi}^{2}}{B}k}\right)}\right)$ so

\varepsilon_{\sigma}\asymp\varepsilon_{\operatorname{TV}}B^{1/2}\left({\frac{\varepsilon_{\chi}^{2}}{B}}\right)^{3/2}=\frac{\varepsilon_{\operatorname{TV}}\varepsilon_{\chi}^{3}}{B}.\qed

Theorem 7.3 (TV error for DDPM with $L^{2}$ -accurate score estimate and smoothness).

Let $0<\varepsilon_{\operatorname{TV}}<\frac{1}{2}$ . Suppose that Assumption 4 and 6 for a sufficiently small value of $c$ that $R_{0}$ is such that $R\left({\frac{c\varepsilon_{\operatorname{TV}}^{15}}{R_{0}^{31}d^{5}L^{12}}}\right)\leq R_{0}$ , and $R_{0}^{2}\geq\max\left\{{d,\mathbb{E}_{P_{\text{data}}}\left[{\left\|{X}\right\|^{2}}\right]}\right\}$ , and one of the following cases holds.

1.

Let $P_{\textup{data}},s(\cdot,t)$ be such that Assumption 1 holds. Suppose that

$\varepsilon_{\sigma}=O\left({\frac{\varepsilon_{\operatorname{TV}}^{11.5}}{B^{9/4}R_{0}^{5}L^{5}}}\right),$

where $B=R_{0}^{4}d\ln\left({\frac{TR_{0}^{2}L^{2}}{\varepsilon_{\operatorname{TV}}^{2}}}\right)\ln\left({\frac{R_{0}^{3}dL^{2}}{\varepsilon_{\operatorname{TV}}^{3}\varepsilon_{\chi}}}\right)$ , and we run (5) starting from $p_{\textup{prior}}$ for time $T=\ln\left({\frac{16R_{0}^{2}}{\varepsilon_{\operatorname{TV}}^{2}}}\right)$ , $N=O\left({\frac{B\left({T+\left({\frac{R_{0}L}{\varepsilon_{\operatorname{TV}}}}\right)^{4}}\right)}{\varepsilon_{\operatorname{TV}}^{2}}}\right)$ steps with step sizes satisfying $h_{k}=O\left({\frac{\varepsilon_{\chi}^{2}}{B\max\{T-t_{k},(T-t_{k})^{-3}\}}}\right)$ .
2.

Let $P_{\textup{data}},s(\cdot,t)$ be such that Assumptions 1 and 3 hold, with $C\geq R_{0}^{2}$ . Suppose

$\displaystyle\varepsilon_{\sigma}$ $\displaystyle=O\left({\frac{\varepsilon_{\operatorname{TV}}^{4}}{T^{5/2}B}}\right),$

where $B=C^{2}d\ln\left({\frac{TR_{0}^{2}L^{2}}{\varepsilon_{\operatorname{TV}}^{2}}}\right)\ln\left({\frac{R_{0}^{3}dL^{2}}{\varepsilon_{\operatorname{TV}}^{4}}}\right)$ , and we run (5) starting from $p_{\textup{prior}}$ for time $T=\ln\left({\frac{16R_{0}^{2}}{\varepsilon_{\operatorname{TV}}^{2}}}\right)$ , $N=O\left({\frac{B\left({T+\ln\left({\frac{R_{0}L}{\varepsilon_{\operatorname{TV}}}}\right)}\right)}{\varepsilon_{\operatorname{TV}}^{2}}}\right)$ steps with step sizes satisfying $h_{k}=O\left({\frac{\varepsilon_{\chi}^{2}}{B\max\{T-t_{k},(T-t_{k})^{-1}\}}}\right)$ .

Then the resulting distribution $q_{t_{N}}$ is such that $q_{t_{N}}$ is $\varepsilon_{\operatorname{TV}}$ -far in TV distance from the data distribution $P_{\text{data}}$ .

Proof.

With the result of Theorem 7.2, we see that $\operatorname{TV}(q_{t_{N}},p_{t_{N}})\leq 2\varepsilon_{\operatorname{TV}}$ . Now by Lemma 6.4, if we further assume

\displaystyle\delta=O\left({\frac{\varepsilon_{\operatorname{TV}}^{2}}{R_{0}^{2}L^{2}}}\right),

then $\operatorname{TV}(p_{t_{N}},P_{\text{data}})\leq\varepsilon_{\operatorname{TV}}$ . We conclude the proof by triangle inequality and replacing the $\delta$ -dependence with $O(\frac{\varepsilon_{\operatorname{TV}}^{2}}{R_{0}^{2}L^{2}})$ in the previous theorem. ∎

Proof of Theorem 2.3.

If $P_{\textup{data}}$ is subexponential with a fixed constant, note that Assumption 6 holds with $R(\varepsilon)=O\left({\ln\left({\frac{1}{\varepsilon}}\right)}\right)$ and hence $R_{0}$ is logarithmic in all parameters. ∎

7.2 Wasserstein error guarantees

Proof of Theorem 2.1.

If $T-t_{N}=\delta$ , then $W_{2}(\widetilde{p}_{0},\widetilde{p}_{\delta})\leq\sigma_{\delta}\leq\sqrt{\delta}$ . Choosing $\delta=\varepsilon_{\textup{W}}^{2}$ , we see by Theorem 7.2 it suffices to take

\displaystyle\varepsilon_{\sigma}

\displaystyle=O\left({\frac{\varepsilon_{\operatorname{TV}}^{13/2}(\varepsilon_{\textup{W}}^{2})^{5/2}}{\left({R^{4}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{RN}{\delta\varepsilon_{\textup{W}}}}\right)}\right)^{9/4}}}\right).

Simplifying gives $\varepsilon_{\sigma}=\widetilde{o}\left({\frac{\varepsilon_{\operatorname{TV}}^{6.5}\varepsilon_{\textup{W}}^{5}}{R^{9}d^{2.25}}}\right)$ . If Assumption 3 also holds, then it suffices to take

\displaystyle\varepsilon_{\sigma}

\displaystyle=O\left({\frac{\varepsilon_{\operatorname{TV}}^{4}}{T^{5/2}C^{2}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{RN}{\delta\varepsilon_{\textup{W}}}}\right)}}\right).

Simplifying gives $\varepsilon_{\sigma}=\widetilde{o}\left({\frac{\varepsilon_{\operatorname{TV}}^{4}}{C^{2}d}}\right)$ . ∎

Proof of Theorem 2.2.

To obtain purely Wasserstein error guarantees, we include an extra step of replacing any sample $z_{t_{N}}\sim q_{t_{N}}$ falling outside $B_{R}(0)$ by $0$ . Suppose $T-t_{N}=\delta$ . Let $\widehat{q}_{t_{N}}$ be the resulting distribution. Then

	$\displaystyle W_{2}(\widetilde{p}_{0},\widehat{q}_{t_{N}})$	$\displaystyle\leq W_{2}(\widetilde{p}_{0},\widetilde{p}_{\delta})+W_{2}(\widetilde{p}_{\delta},\widehat{q}_{t_{N}})$
		$\displaystyle\leq\sigma_{\delta}+W_{2}(\widetilde{p}_{\delta},\widetilde{q}_{t_{N}})\leq\sqrt{\delta}+W_{2}(\widetilde{p}_{\delta},\widehat{q}_{t_{N}}).$

We choose $\delta=\frac{\varepsilon_{\textup{W}}^{2}}{4}$ so the first term is $\leq\frac{\varepsilon_{\textup{W}}}{2}$ . It suffices to bound the second term $W_{2}(\widetilde{p}_{\delta},\widehat{q}_{t_{N}})$ also by $\frac{\varepsilon_{\textup{W}}}{2}$ . We bound it in terms of $\operatorname{TV}(\widetilde{p}_{\delta},\widehat{q}_{t_{N}})$ using the fact that $\widehat{q}_{t_{N}}$ is supported on $B_{R}(0)$ and using a Gaussian tail calculation for $\widetilde{p}_{\delta}$ . Consider a coupling of $x_{t_{N}}=\widetilde{x}_{\delta}\sim\widetilde{p}_{\delta}$ and $\widehat{z}_{t_{N}}\sim\widehat{q}_{t_{N}}$ such that $x_{\delta}\neq\widehat{z}_{t_{N}}$ with probability $\varepsilon_{\operatorname{TV}}$ . Express $\widetilde{x}_{\delta}=m_{\delta}\widetilde{x}_{0}+\sigma_{\delta}\xi$ where $\widetilde{x}_{0}\sim\widetilde{p}_{0}$ . Now

	$\displaystyle\mathbb{E}[\left\\|{\widetilde{x}_{\delta}-\widehat{z}_{t_{N}}}\right\\|^{2}]$	$\displaystyle\leq\sup_{P(A)\leq\varepsilon_{\operatorname{TV}}}2\left({\mathbb{E}[\left\\|{m_{\delta}\widetilde{x}_{0}-z_{t_{N}}}\right\\|^{2}\mathbbm{1}_{A}]+\sigma_{\delta}^{2}\mathbb{E}[\left\\|{\xi}\right\\|^{2}\mathbbm{1}_{A}]}\right)$
		$\displaystyle=2\left({4R^{2}\varepsilon_{\operatorname{TV}}+\sigma_{\delta}^{2}\varepsilon_{\operatorname{TV}}\cdot O\left({d+\ln\left({\frac{1}{\varepsilon_{\operatorname{TV}}}}\right)}\right)}\right),$

where the bound on the second term uses Lemma 6.6. Using $R^{2}\geq d$ , we see that it suffices to choose $\varepsilon_{\operatorname{TV}}=O\left({\frac{\varepsilon_{\textup{W}}^{2}}{R^{2}}}\right)$ for appropriate choice of constants. By Theorem 7.2, it suffices to take

\displaystyle\varepsilon_{\sigma}

\displaystyle=O\left({\frac{(\varepsilon_{\textup{W}}^{2}/R^{2})^{13/2}\left({\varepsilon_{\textup{W}}^{2}}\right)^{5/2}}{\left({R^{4}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{RN}{\delta\varepsilon_{\textup{W}}}}\right)}\right)^{9/4}}}\right).

Simplifying gives $\widetilde{o}\left({\frac{\varepsilon_{\textup{W}}^{18}}{R^{22}d^{2.25}}}\right)$ .

In case 2, it suffices to take

\displaystyle\varepsilon_{\sigma}

\displaystyle=O\left({\frac{(\varepsilon_{\textup{W}}^{2}/R^{2})^{4}}{T^{5/2}(C^{2}d\ln\left({\frac{T}{\delta}}\right)\ln\left({\frac{RN}{\delta\varepsilon_{\textup{W}}}}\right))}}\right).

Simplifying gives $\varepsilon_{\sigma}=\widetilde{o}\left({\frac{\varepsilon_{\textup{W}}^{8}}{C^{2}R^{8}d}}\right)$ . ∎

References

[And82] Brian DO Anderson “Reverse-time diffusion equation models” In Stochastic Processes and their Applications 12.3 Elsevier, 1982, pp. 313–326
[ARZ18] Sanjeev Arora, Andrej Risteski and Yi Zhang “Do GANs learn the distribution? some theory and empirics” In International Conference on Learning Representations, 2018
[BMR20] Adam Block, Youssef Mroueh and Alexander Rakhlin “Generative modeling with denoising auto-encoders and Langevin sampling” In arXiv preprint arXiv:2002.00107, 2020
[CCN21] Hong-Bin Chen, Sinho Chewi and Jonathan Niles-Weed “Dimension-free log-Sobolev inequalities for mixture distributions” In Journal of Functional Analysis 281.11 Elsevier, 2021, pp. 109236
[Che+21] Sinho Chewi, Murat A Erdogdu, Mufan Bill Li, Ruoqi Shen and Matthew Zhang “Analysis of Langevin Monte Carlo from Poincaré to Log-Sobolev” In arXiv preprint arXiv:2112.12662, 2021
[Che+22] Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim and Anru R. Zhang “Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions” arXiv:2209.11215, 2022
[Dat+19] Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski and Rosanne Liu “Plug and play language models: A simple approach to controlled text generation” In arXiv preprint arXiv:1912.02164, 2019
[De ̵+21] Valentin De Bortoli, James Thornton, Jeremy Heng and Arnaud Doucet “Diffusion Schrödinger bridge with applications to score-based generative modeling” In Advances in Neural Information Processing Systems 34, 2021
[De ̵22] Valentin De Bortoli “Convergence of denoising diffusion models under the manifold hypothesis” In arXiv preprint arXiv:2208.05314, 2022
[EHZ21] Murat A. Erdogdu, Rasa Hosseinzadeh and Matthew S. Zhang “Convergence of Langevin Monte Carlo in Chi-Squared and Renyi Divergence” In arXiv preprint arXiv:2007.11612, 2021
[Gra+19] Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi and Kevin Swersky “Your classifier is secretly an energy based model and you should treat it like one” In arXiv preprint arXiv:1912.03263, 2019
[HJA20] Jonathan Ho, Ajay Jain and Pieter Abbeel “Denoising diffusion probabilistic models” In Advances in Neural Information Processing Systems 33, 2020, pp. 6840–6851
[Jin+22] Bowen Jing, Gabriele Corso, Renato Berlinghieri and Tommi Jaakkola “Subspace Diffusion Generative Models” In arXiv preprint arXiv:2205.01490, 2022
[LLT22] Holden Lee, Jianfeng Lu and Yixin Tan “Convergence for score-based generative modeling with polynomial complexity” In arXiv preprint arXiv:2206.06227, 2022
[LM00] Beatrice Laurent and Pascal Massart “Adaptive estimation of a quadratic functional by model selection” In Annals of Statistics JSTOR, 2000, pp. 1302–1338
[Men+21] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu and Stefano Ermon “SDEdit: Guided image synthesis and editing with stochastic differential equations” In International Conference on Learning Representations, 2021
[Ram+22] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu and Mark Chen “Hierarchical text-conditional image generation with clip latents” In arXiv preprint arXiv:2204.06125, 2022
[SE19] Yang Song and Stefano Ermon “Generative Modeling by Estimating Gradients of the Data Distribution” In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems, 2019
[SE20] Yang Song and Stefano Ermon “Improved techniques for training score-based generative models” In arXiv preprint arXiv:2006.09011, 2020
[Son+20] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon and Ben Poole “Score-Based Generative Modeling through Stochastic Differential Equations” In International Conference on Learning Representations, 2020
[Son+21] Yang Song, Conor Durkan, Iain Murray and Stefano Ermon “Maximum likelihood training of score-based diffusion models” In Advances in Neural Information Processing Systems 34, 2021
[Son+21a] Yang Song, Liyue Shen, Lei Xing and Stefano Ermon “Solving Inverse Problems in Medical Imaging with Score-Based Generative Models” In arXiv preprint arXiv:2111.08005, 2021
[Ver18] Roman Vershynin “High-dimensional probability: An introduction with applications in data science” Cambridge university press, 2018
[Vin11] Pascal Vincent “A connection between score matching and denoising autoencoders” In Neural computation 23.7 MIT Press, 2011, pp. 1661–1674
[ZC22] Qinsheng Zhang and Yongxin Chen “Fast Sampling of Diffusion Models with Exponential Integrator” In arXiv preprint arXiv:2204.13902, 2022

Appendix A High-probability bound on the Hessian

In this section we obtain a high-probability bound on the Hessian of $\ln\widetilde{p}_{t}$ , i.e., the Jacobian of the score function.

To see why we expect Hessian to usually be smaller than the worst-case bound given by Lemma 4.13, note that we can express (27) and (28) as

	$\displaystyle\nabla\ln(\mu*\varphi_{\sigma^{2}}(y))$	$\displaystyle=-\frac{1}{\sigma^{2}}\mathbb{E}[Y-X\|Y=y]$		(49)
	$\displaystyle\nabla^{2}\ln(\mu*\varphi_{\sigma^{2}}(y))$	$\displaystyle=\frac{1}{\sigma^{4}}\operatorname{Cov}[Y-X\|Y=y]-\frac{1}{\sigma^{2}}I_{d}$		(50)

where $X\sim\mu$ and $Y=X+\sigma\xi$ , $\xi\sim N(0,I_{d})$ . We expect that the random variable $Y-X$ is distributed as $N(0,\sigma^{2}I_{d})$ , which suggests that the covariance (50) may be bounded by $\frac{1}{\sigma^{2}}$ rather than $\frac{1}{\sigma}$ with high probability. Indeed, we can easily construct an example where the worst case of Lemma 4.13 is attained—for example, $\mu=\frac{1}{2}(\delta_{-v}+\delta_{v})$ for $\left\|{v}\right\|_{2}=R$ , at $x=0$ —but this point has exponentially small probability density under $\mu*\varphi_{\sigma^{2}}$ .

The following lemma uses a $\varepsilon$ -net argument to bound the operator norm of the variance of a conditional distribution, with high probability.

Lemma A.1.

Suppose $X$ is a $\mathbb{R}^{d}$ -valued random variable over the probability space $(\Omega,\mathcal{G},P)$ , and $\mathcal{F}\subseteq\mathcal{G}$ is a $\sigma$ -subalgebra. If $X$ is subgaussian, then

\displaystyle\mathbb{P}\left({\mathbb{E}\left[{\left\|{XX^{\top}}\right\||\mathcal{F}}\right]\geq 2\left\|{X}\right\|_{\psi_{2}}^{2}\ln\left({\frac{2\cdot 5^{d}}{\varepsilon}}\right)}\right)

\displaystyle\leq\varepsilon.

Proof.

By Jensen’s inequality and Markov’s inequality, for any $v\in\mathbb{S}^{d-1}$ ,

	$\displaystyle\mathbb{P}\left({\mathbb{E}[v^{\top}XX^{\top}v\|\mathcal{F}]\geq\lambda^{2}}\right)$	$\displaystyle=\mathbb{P}\left({e^{\mathbb{E}[v^{\top}XX^{\top}v\|\mathcal{F}]/c^{2}}\geq e^{\lambda^{2}/c^{2}}}\right)$
		$\displaystyle\leq\mathbb{P}\left({\mathbb{E}\left[{e^{\left\langle{X,v}\right\rangle^{2}/c^{2}}\|\mathcal{F}}\right]\geq e^{\lambda^{2}/c^{2}}}\right)$
		$\displaystyle\leq\frac{\mathbb{E}\left[{\mathbb{E}[e^{\left\langle{X,v}\right\rangle^{2}/c^{2}}\|\mathcal{F}]}\right]}{e^{\lambda^{2}/c^{2}}}=\frac{\mathbb{E}\left[{e^{\left\langle{X,v}\right\rangle^{2}/c^{2}}}\right]}{e^{\lambda^{2}/c^{2}}}\leq 2e^{-\lambda^{2}/\left\\|{X}\right\\|_{\psi_{2}}},$

where the last inequality follows from taking $c=\left\|{X}\right\|_{\psi_{2}}$ . Now take a $\frac{1}{2}$ -net $\mathcal{N}$ of $\mathbb{S}^{d-1}$ of size $\leq 5^{d}$ [Ver18, Cor. 4.2.13]. By a union bound,

\displaystyle\mathbb{P}\left({\exists v\in\mathcal{N}:\mathbb{E}[v^{\top}XX^{\top}v|\mathcal{F}]\geq\lambda^{2}}\right)

\displaystyle\leq 5^{d}\cdot 2\cdot e^{-\lambda^{2}/\left\|{X}\right\|_{\psi_{2}}^{2}}=\varepsilon

when we take $\lambda=\left\|{X}\right\|_{\psi_{2}}\sqrt{\ln\left({\frac{2\cdot 5^{d}}{\varepsilon}}\right)}$ . By [Ver18, Lemma 4.4.1], the operator norm can be bounded by the norm on an $\varepsilon$ -net,

\displaystyle\left\|{A}\right\|

\displaystyle\leq 2\sup_{v\in A}\left\|{\left\langle{A,v}\right\rangle}\right\|=2\sup_{v\in A}|v^{\top}Av|.

where the second inequality holds when $A$ is symmetric. The result follows from applying this to $\mathbb{E}[v^{\top}XX^{\top}v|\mathcal{F}]$ . ∎

From this we obtain the desired high-probability bound.

Lemma A.2.

There is a universal constant $C$ such that the following holds. For any starting distribution $\widetilde{P}_{0}$ , letting $\widetilde{P}_{t}$ be the law of the DDPM process (1) at time $t$ , we have

\displaystyle\widetilde{P}_{t}\left({\left\|{\nabla^{2}\ln\widetilde{p}_{t}(x)}\right\|\leq\frac{C(d+\ln\left({\frac{1}{\varepsilon}}\right))}{\sigma_{t}^{2}}}\right)

\displaystyle\geq 1-\varepsilon.

Note that there is no dependence on the radius.

Proof.

Apply (50) with $\mu=M_{m_{t}\sharp}\widetilde{P}_{0}$ to obtain $\nabla^{2}\ln\widetilde{p}_{t}$ . Noting that $Y-X\sim N(0,\sigma^{2}I_{d})$ is subgaussian with $\left\|{Y-X}\right\|_{\psi_{2}}\leq C_{2}\sigma$ for some universal constant $C_{2}$ , the result follows from Lemma A.1. ∎

	$\displaystyle\mathbb{E}\left[{\left\langle{u,\nabla\frac{q_{t}(z_{t})}{p_{t}(z_{t})}}\right\rangle}\right]$	$\displaystyle=\mathbb{E}\left[{\left\langle{u\sqrt{\frac{q_{t}(z_{t})}{p_{t}(z_{t})}},\sqrt{\frac{p_{t}(z_{t})}{q_{t}(z_{t})}}\nabla\frac{q_{t}(z_{t})}{p_{t}(z_{t})}}\right\rangle}\right]$
		$\displaystyle\leq C\mathbb{E}\left[{\left\\|{u}\right\\|^{2}\frac{q_{t}(z_{t})}{p_{t}(z_{t})}}\right]+\frac{1}{4C}\mathbb{E}_{p_{t}}\left[{\left\\|{\nabla\frac{q_{t}(x)}{p_{t}(x)}}\right\\|^{2}}\right]$
		$\displaystyle=C\mathbb{E}_{p_{t}}\phi_{t}^{2}\cdot\mathbb{E}\left[{\left\\|{u}\right\\|^{2}\psi_{t}(z_{t})}\right]+\frac{1}{4C}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)$
		$\displaystyle=C(\chi^{2}(q_{t}\|\|p_{t})+1)\cdot\mathbb{E}\left[{\left\\|{u}\right\\|^{2}\psi_{t}(z_{t})}\right]+\frac{1}{4C}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right).\qed$

	$\displaystyle\frac{d}{dt}\chi^{2}(q_{t}\|\|p_{t})$	$\displaystyle\leq-\frac{1}{2}g(T-t)^{2}\mathscr{E}_{p_{t}}\left({\frac{q_{t}}{p_{t}}}\right)+12g(T-t)^{2}(\chi^{2}(q_{t}\|\|p_{t})+1)\cdot$
		$\displaystyle\qquad\Bigg{[}\varepsilon_{\infty,T-t_{k}}^{2}+16G_{t_{k},t}^{2}L_{T-t_{k}}^{2}\Big{[}\mathbb{E}[\psi_{t}(z_{t})\left\\|{z_{t}}\right\\|^{2}]+16\mathbb{E}[\psi_{t}(z_{t})\left\\|{\nabla\ln p_{t}(x_{t})}\right\\|^{2}]\Big{]}$
		$\displaystyle\quad+64G_{t_{k},t}L_{T-t_{k}}^{2}(8\operatorname{KL}(\psi_{t}q_{t}\|\|p_{t})+2d+16\ln 2)+\mathbb{E}\left[{\left\\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}\psi_{t}(z_{t})}\right]\Bigg{]}.$

	$\displaystyle\mathbb{E}\left[{\left\\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}\psi_{t}(z_{t})}\right]$
	$\displaystyle\leq 3\left[{\varepsilon_{\infty,T-t_{k}}^{2}+L_{T-t_{k}}^{2}\mathbb{E}\left[{\left\\|{z_{t}-z_{t_{k}}}\right\\|^{2}\psi_{t}(z_{t})}\right]+\mathbb{E}\left[{\left\\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}\psi_{t}(z_{t})}\right]}\right]$
	$\displaystyle\leq 3\Bigg{[}\varepsilon_{\infty,T-t_{k}}^{2}+16G_{t_{k},t}^{2}L_{T-t_{k}}^{2}\Big{[}\mathbb{E}[\psi_{t}(z_{t})\left\\|{z_{t}}\right\\|^{2}]+4\mathbb{E}[\psi_{t}(z_{t})\left\\|{s(z_{t_{k}},T-t_{k})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}]$
	$\displaystyle\quad+16\mathbb{E}[\psi_{t}(z_{t})\left\\|{\nabla\ln p_{t}(x_{t})}\right\\|^{2}]\Big{]}+64G_{t_{k},t}L_{T-t_{k}}^{2}(8\operatorname{KL}(\psi_{t}q_{t}\|\|p_{t})+2d+16\ln 2)$
	$\displaystyle\quad+\mathbb{E}\left[{\left\\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}\psi_{t}(z_{t})}\right]\Bigg{]}$

$\displaystyle\tilde{\mathbb{E}}\left\\|{Y}\right\\|^{2}$	$\displaystyle\leq 2\mathbb{E}\left[{\left\\|{Y}\right\\|^{2}}\right]+2\widetilde{\mathbb{E}}\left[{(Y-\mathbb{E}\left\\|{Y}\right\\|)^{2}}\right]$
	$\displaystyle\leq 2\mathbb{E}\left[{\left\\|{Y}\right\\|^{2}}\right]+\frac{2}{c}\left[{\operatorname{KL}(\tilde{\mathbb{P}}\|\|\mathbb{P})+\ln\mathbb{E}\exp\left({c\left({\left\\|{Y}\right\\|-\mathbb{E}\left\\|{Y}\right\\|}\right)^{2}}\right)}\right]$	(16)
	$\displaystyle\leq 2d(\exp(G_{t_{k},t})-1)+\frac{2}{c}\left[{\operatorname{KL}(\tilde{\mathbb{P}}\|\|\mathbb{P})+\ln\mathbb{E}\exp\left({c\left({\left\\|{Y}\right\\|-\mathbb{E}\left\\|{Y}\right\\|}\right)^{2}}\right)}\right].$	(17)

	$\displaystyle\mathbb{E}\left[{\psi_{t}(z_{t})\left\\|{\nabla\ln p_{t_{k}}(z_{t})-\nabla\ln p_{t}(z_{t})}\right\\|^{2}}\right]$
	$\displaystyle\leq 72\alpha^{4}L_{T-t}^{2}\sigma^{2}d+4(\alpha+2\alpha^{3}L_{T-t}\sigma^{2})^{2}(\alpha-1)^{2}L_{T-t}^{2}\mathbb{E}\left[{\psi(z_{t})\left\\|{z_{t}}\right\\|^{2}}\right]$
	$\displaystyle\quad+4(\alpha-1+2\alpha^{3}L_{T-t}\sigma^{2})^{2}\mathbb{E}\left[{\psi_{t}(z_{t})\left\\|{\nabla\ln p_{t}(z_{t})}\right\\|^{2}}\right]$
	$\displaystyle\leq 72(5/4)^{4}L_{T-t}^{2}G_{t_{k},t}d+4(2\alpha)^{2}G_{t_{k},t}^{2}L_{T-t}^{2}\mathbb{E}\left[{\psi(z_{t})\left\\|{z_{t}}\right\\|^{2}}\right]$
	$\displaystyle\quad+4(G_{t_{k},t}+4L_{T-t}G_{t_{k},t})^{2}\mathbb{E}\left[{\psi_{t}(z_{t})\left\\|{\nabla\ln p_{t}(z_{t})}\right\\|^{2}}\right]$
	$\displaystyle\leq 200L_{T-t}^{2}dG_{t_{k},t}+25L_{T-t}^{2}G_{t_{k},t}^{2}\mathbb{E}\left[{\psi(z_{t})\left\\|{z_{t}}\right\\|^{2}}\right]+100L_{T-t}^{2}G_{t_{k},t}^{2}\mathbb{E}\left[{\psi_{t}(z_{t})\left\\|{\nabla\ln p_{t}(z_{t})}\right\\|^{2}}\right].$

Convergence of score-based generative modeling for general data distributions

Abstract

1 Introduction

1.1 Problem setting

1.2 Prior work on convergence guarantees

1.3 Our contributions

2 Main results

Assumption 1 (L2L^{2} score error).

Assumption 2 (Bounded support).

Assumption 3.

Assumption 4.

Theorem 2.1 (Wasserstein+TV error for distributions with bounded support).

Theorem 2.2 (Wasserstein error for distributions with bounded support).

Theorem 2.3 (TV error for distributions under smoothness assumption).

3 Proof overview

Notation and definitions

4 DDPM with L∞L^{\infty}-accurate score estimate

Assumption 5 (L∞L^{\infty} score error).

Lemma 4.1 (cf. [EHZ21, Lemma 1], [LLT22, Lemma A.3]).

Proof.

Lemma 4.2.

Proof.

Lemma 4.3.

Proof.

Lemma 4.4.

Proof.

Lemma 4.5.

Proof.

Lemma 4.6 ([LLT22, Corollary C.7], [Che+21, Lemma 16]).

Lemma 4.7 ([LLT22, Lemma C.12]).

Lemma 4.8.

Proof.

Definition 4.9.

Theorem 4.10.

Proof.

Corollary 4.11.

Proof.

Corollary 4.12.

Proof.

4.1 Auxiliary bounds

Lemma 4.13 (Hessian bound).

Proof.

Lemma 4.14 (Bound on initial χ2\chi^{2}-divergence).

Proof.

Lemma 4.15 (Subgaussian bound).

Proof.

5 Bounding the KL divergence

Lemma 5.1.

Proof.

Lemma 5.2.

Proof.

6 The effect of perturbing the data distribution on the score function

6.1 Perturbation under χ2\chi^{2} error and truncation

Lemma 6.1 (Denoising error from mismatched prior).

Proof.

Lemma 6.2 (L2L^{2} score error under perturbation).

Proof.

Lemma 6.3.

Proof.

Lemma 6.4.

Proof.

6.2 Perturbation under TV error

Lemma 6.5.

Proof.

6.3 Gaussian tail calculation

Lemma 6.6.

Proof.

7 Guarantees under L2L^{2}-accurate score estimate

Assumption 6 (Tail bound).

7.1 TV error guarantees

Theorem 7.1 ([LLT22, Theorem 4.1]).

Theorem 7.2 (DDPM with L2L^{2}-accurate score estimate).

Proof.

Theorem 7.3 (TV error for DDPM with L2L^{2}-accurate score estimate and smoothness).

Proof.

Proof of Theorem 2.3.

7.2 Wasserstein error guarantees

Proof of Theorem 2.1.

Proof of Theorem 2.2.

References

Convergence of score-based generative modeling
for general data distributions

Assumption 1 ( $L^{2}$ score error).

4 DDPM with $L^{\infty}$ -accurate score estimate

Assumption 5 ( $L^{\infty}$ score error).

Lemma 4.14 (Bound on initial $\chi^{2}$ -divergence).

6.1 Perturbation under $\chi^{2}$ error and truncation

Lemma 6.2 ( $L^{2}$ score error under perturbation).

7 Guarantees under $L^{2}$ -accurate score estimate

Theorem 7.2 (DDPM with $L^{2}$ -accurate score estimate).

Theorem 7.3 (TV error for DDPM with $L^{2}$ -accurate score estimate and smoothness).