Convergence of score-based generative modeling
for general data distributions
Abstract
Score-based generative modeling (SGM) has grown to be a hugely successful method for learning to generate samples from complex data distributions such as that of images and audio. It is based on evolving an SDE that transforms white noise into a sample from the learned distribution, using estimates of the score function, or gradient log-pdf. Previous convergence analyses for these methods have suffered either from strong assumptions on the data distribution or exponential dependencies, and hence fail to give efficient guarantees for the multimodal and non-smooth distributions that arise in practice and for which good empirical performance is observed. We consider a popular kind of SGM—denoising diffusion models—and give polynomial convergence guarantees for general data distributions, with no assumptions related to functional inequalities or smoothness. Assuming -accurate score estimates, we obtain Wasserstein distance guarantees for any distribution of bounded support or sufficiently decaying tails, as well as TV guarantees for distributions with further smoothness assumptions.
1 Introduction
Diffusion models have gained huge popularity in recent years in machine learning, as a method to learn and generate new samples from a data distribution. Score-based generative modeling (SGM), as a particular kind of diffusion model, uses learned score functions (gradients of the log-pdf) to transform white noise to the data distribution through following a stochatic differential equation. While SGM has achieved state-of-the-art performance for artificial image and audio generation [SE19, Dat+19, Gra+19, SE20, Son+20, Men+21, Son+21a, Son+21, Jin+22], including being a key component of text-to-image systems [Ram+22], our theoretical understanding of these models is still nascent.
In particular, basic questions on the convergence of the generated distribution to the data distribution remain unanswered. Recent theoretical work on SGM has attempted to answer these questions [De ̵+21, LLT22, De ̵22], but they either suffer from exponential dependence on parameters or rely on strong assumptions on the data distribution such as functional inequalities or smoothness, which are rarely satisfied in practical situations. For example, considering the hallmark application of generating images from text, we expect the distribution of images to be (a) multimodal, and hence not satisfying functional inequalities with reasonable constants, and (b) supported on lower-dimensional manifolds, and hence not smooth. However, SGM still performs remarkably well in these settings. Indeed, this is one relative advantage to other approaches to generative modeling such as generative adversarial networks, which can struggle to learn multimodal distributions [ARZ18].
In this work, we aim to develop theoretical convergence guarantees with polynomial complexity for SGM under minimal data assumptions.
1.1 Problem setting
Given samples from a data distribution , the problem of generative modeling is to learn the distribution in a way that allows generation of new samples. A general framework for many score-based generative models is where noise is injected into via a forward SDE [Son+20]
(1) |
where . Let denote the density of . Remarkably, also satisfies a reverse-time SDE,
(2) |
where is a backward Brownian motion [And82]. Because the forward process transforms the data distribution to noise, the hope is to use the backwards process to transform noise into samples.
In practice, when we only have sample access to , the score function is not available. A key mechanism behind SGM is that the score function is learnable from data, through empirically minimizing a de-noising objective evaluated at noisy samples [Vin11]. The samples are obtained by evolving the forward SDE starting from the data samples , and the optimization is done within an expressive function class such as neural networks. If the score function is successfully approximated in this way, then the -error will be small. The natural question is then the following:
Given -error bounds of the score function, how close is the distribution generated by (2) (with score estimate in place of , and appropriate discretization) to the data distribution ?
We note it is more realistic to consider rather than -error, and this makes the analysis more challenging. Indeed, prior work on Langevin Monte Carlo [EHZ21] and related sampling algorithms only apply when the score function is known exactly, or with suitable modification, known up to -error. -error has a genuinely different effect from -error, as it can cause the stationary distribution for Langevin Monte Carlo to be arbitrarily diffferent [LLT22], necessitating a “medium-time" analysis.
In addition, we hope to obtain a result with as few structural assumptions as possible on , so that it can be useful in realistic scenarios where SGM is applied.
1.2 Prior work on convergence guarantees
We highlight two recent works which make progress on this problem. [LLT22] are the first to give polynomial convergence guarantees in TV distance under -accurate score for a reasonable family of distributions. They introduce a framework to reduce the analysis under -accurate score to -accurate score. However, they rely on the data distribution satisfying smoothness conditions and a log-Sobolev inequality—a strong assumption which essentially limits the guarantees to unimodal distributions.
[De ̵22] instead make minimal data assumptions, giving convergence in Wasserstein distance for distributions with bounded support . In particular, this covers the case of distributions supported on lower-dimensional manifolds, where guarantees in TV distance are unattainable. However, for general distributions, their guarantees have exponential dependence on the diameter of and the inverse of the desired error (), and for smooth distributions, an improved, but still exponential dependence on the growth rate of the Hessian as the noise approaches 0 ( for distributions with ).
We note that other works also analyze the generalization error of a learned score estimate [BMR20, De ̵22]. This is an important question because without further assumptions, learning an -accurate score estimate requires a number of samples exponential in the dimension. As this is beyond the scope of our paper, we assume that an -accurate score estimate is obtainable.
1.3 Our contributions
In this work, we analyze convergence in the most general setting of distributions of bounded support, as in [De ̵22]. We give Wasserstein bounds for any distribution of bounded support (or sufficiently decaying tails), and TV bounds for distributions under smoothness assumptions, that are polynomial in all parameters, and do not rely on the data distribution satisfying any functional inequality. This gives theoretical grounding to the empirical success of SGM on data distributions that are often multimodal and non-smooth.
We streamline the -based analysis of [LLT22], with significant changes as to completely remove the use of functional inequalities. In particular, the biggest challenge—and our key improvement—is to bound a certain KL-divergence without reliance on a global functional inequality. For this, we prove a key lemma that distributions which are close in -divergence have score functions that are close in (which may be of independent interest), and then a structural result that the distributions arising from the diffusion process can be slightly modified as to satisfy the desired inequality, through decomposition into distributions that do satisfy a log-Sobolev inequality.
Upon finishing our paper, we learned of a concurrent and independent work [Che+22] which obtained theoretical guarantees for score-based generative modeling under similarly general assumptions on the data distribution. We note that although our bounds are obtained under similar assumptions (with our assumption of the score estimate accuracy slightly weaker than theirs), our proof techniques are quite different. Following the “bad set” idea from [LLT22], we derived a change-of-measure inequality with Theorem 7.1, while the analysis in [Che+22] is based on the Girsanov approach.
2 Main results
To state our results, we will consider a specific type of SGM called denoising diffusion probabilistic modeling (DDPM) [HJA20], where in the forward SDE (1), for some non-decreasing function to be chosen. The forward process is an Ornstein-Uhlenbeck process with time rescaling: has the same distribution as
(3) |
Given an estimate score function approximating , we can simulate the reverse process (reparameterizing and denoting )
(4) |
with the exponential integrator discretization [ZC22], where and :
(5) | ||||
(6) |
We initialize with a prior distribution that approximates for sufficiently large :
(7) |
While we focus on DDPM, we note that the continuous process underlying DDPM is equivalent to that of score-matching Langevin diffusion (SMLD) under reparameterization in time and space (see [LLT22, §C.2]). We will further take for convenience in stating our results.
Our goal is to obtain a quantitative guarantee for the distance between the distribution for (for appropriate ) and , under a -score error guarantee. In the following, we assume a sequence of discretization points has been chosen.
Assumption 1 ( score error).
For any , the error in the score estimate is bounded in :
We note that the gradient grows as as , so this is a reasonable assumption, and quantitatively weaker than a uniform bound over .
Assumption 2 (Bounded support).
is supported on .
For simplicity, we assume bounded support when stating our main theorems, but note that our results generalize to distributions with sufficiently fast power decay. In the application of image generation, pixel values are bounded, so Assumption 2 is satisfied with typically on the order of .
These are the only assumptions we need to obtain a polynomial complexity guarantee. We also consider the following stronger smoothness assumption, which is Assumption A.6 in [De ̵22] and will give better dependencies. Note that [De ̵22, Theorem I.8] shows a (nonuniform) version of Assumption 3 holds when is a smooth density on a convex submanifold.
Assumption 3.
The following bound of the Hessian of the log-pdf holds for any and :
for some constant .
Finally, the following smoothness assumption on will allow us to obtain TV guarantees.
Assumption 4.
admits a density where is -smooth.
We are now ready to state our main theorems.
Theorem 2.1 (Wasserstein+TV error for distributions with bounded support).
Suppose that Assumption 1 and 2 hold with . Then there is a sequence of discretization points with such that if , then the distribution of the output of DDPM is -close in TV distance to a distribution that is in -distance from . If in addition Assumption 3 holds with , it suffices for (note that the hides logarithmic dependence on ).
This result is perhaps surprising at first glance, as it is well known that for sampling algorithms such as Langevin Monte Carlo, structural assumptions on the target distribution—such as a log-Sobolev inequality—are required to obtain similar theoretical guarantees, even with the knowledge of the exact score function. The key reason that we can do better is that we utilize a sequence of score functions along the reverse SDE, which is not available in standard sampling settings. Moreover, we choose large enough so that is close to , and it suffices to track the evolution of the true process (2), that is, maintain rather than decrease the error. To some extent, this result shows the power of DDPM and other reverse SDE-based methods compared with generative modeling based on standard Langevin Monte Carlo.
A statement with more precise dependencies, and which works for unbounded distributions with sufficiently decaying tails, can be found as Theorem 7.2. We note that under the Hessian bound (Assumption 3), up to logarithmic factors, the same score error bound suffices to obtain a fixed TV distance to a distribution arbitrarily close in distance. By truncating the resulting distribution, we can also obtain purely Wasserstein error bounds.
Theorem 2.2 (Wasserstein error for distributions with bounded support).
With an extra assumption on the smoothness of , we can also obtain purely TV error bounds:
Theorem 2.3 (TV error for distributions under smoothness assumption).
3 Proof overview
Our proof uses the framework by [LLT22] to convert guarantees under -accurate score function to under -accurate score function. For the analysis under -accurate score function, we interpolate the discrete process with estimated score, , and derive a differential inequality
We bound resulting error terms, making ample use of the Donsker-Varadhan variational principle to convert expectations to be under . Under small enough step sizes, this shows that grows slowly (Theorem 4.10), which suffices as -divergence decays exponentially in the forward process.
The most challenging error term to deal with is the KL divergence term . Our main innovation over the analysis of [LLT22] is bounding this term without a global log-Sobolev inequality for . We note that it suffices for to be a mixture of distributions each satisfying a log-Sobolev inequality, with the logarithm of the minimum mixture weight bounded below, and in Lemma 5.2, we show that we can decompose any distributed of bounded support in this manner if we move a small amount of its mass.
In Section 6, we show that this does not significantly affect the estimate of the score function, by interpreting the score function as solving a Bayesian inference problem: that of de-noising a noised data point. More precisely, we show in Lemma 6.5 that the difference between the score functions of two different distributions can be bounded in in terms of their -divergence, which may be of independent interest.
Finally, we reduce from the to setting by bounding the probabilities of hitting a bad set where the score error is large, and carefully choose parameters (Section 7). This gives a TV error bound to —the forward distribution at small positive time. Finally, we can bound the Wasserstein distance of to (in the general case) or the TV distance (under additional smoothness of .)
In Section A we show that the Hessian is always bounded by with high probability (cf. Assumption 3). We speculate that a high-probability rather than uniform bound on the Hessian (as in Lemma 4.13) can be used to obtain better dependencies, and leave this as an open problem.
Notation and definitions
We let denote the density of under the forward process (1). Note that may not admit a density, but will for . For the reverse process, we use the notation , . We defined and in (2),
and note that , where denotes multiplication by , denotes the pushforward of the measure by , and is the density of . When , we note the bound and .
We will let denote the (interpolated) discrete process (see (12)) and let be the density of . We define
(8) |
and note that is a probability density. We defined in (6).
We denote the estimated score function by either and interchangeably.
A random variable is subgaussian with constant if
A -valued random variable is subgaussian with constant if for all , is subgaussian. We also define
Given a probability measure on with density , the associated Dirichlet form is ; denote . we say that a log-Sobolev inequality (LSI) holds with constant if for any probability density ,
(9) |
Note is also known as the Fisher information of with respect to . Alternatively, defining the entropy by , for any ,
(10) |
4 DDPM with -accurate score estimate
We consider the error between the exact backwards SDE (4) and the exponential integrator with estimated score (5). In this section, we bound the error assuming that the score estimate is accurate in .
Assumption 5 ( score error).
For any , the error in the score estimate is bounded:
(11) |
for some non-decreasing function .
In Section 7, we will relax this condition to score function being accurate in .
First, we construct the following continuous-time process which interpolates the discrete-time process (5), for :
(12) |
Integration gives that
(13) |
where is defined in (6).
Letting be the distribution of and be the distribution of , we have by [LLT22, Lemma A.2] that
(14) |
(Note that in our case, also depends on rather than just , but this does not change the calculation.) Define as in (8): , .
To bound (14), we use the following lemma.
Lemma 4.1 (cf. [EHZ21, Lemma 1], [LLT22, Lemma A.3]).
For any and any -valued random variable , we have
Proof.
By Young’s inequality,
Lemma 4.2.
Suppose that (11) holds for , is -Lipschitz, is non-decreasing, and that . Then we have for that
Proof.
Lemma 4.3.
Suppose that . Then for ,
Proof.
Lemma 4.4.
For ,
Proof.
Note that is a Gaussian random vector with variance
(Note that this calculation shows that the continuous-time process (12) does agree with the discrete-time process (5) at .) Using the Donsker-Varadhan variational principle, for any random variable ,
Applying this to for a constant to be chosen later, and such that , we can bound
(16) | ||||
(17) |
Now following [Che+21, Theorem 4], we set , so that
Next, we have
Noting that , we have that
Substituting everything into (17) gives the desired inequality. ∎
Let
(18) | ||||
(19) | ||||
(20) | ||||
(21) |
In order to bound the RHS in Lemma 4.2, we need to bound all four of these quantities, which we do in Lemma 4.5, 4.6, 4.8, and Section 5, respectively. The main innovation in our analysis compared to [LLT22] is a new way to bound , which we present in a separate section.
First we bound . Recall the norm
(In other words, this is the usual Orlicz norm applied to .)
Lemma 4.5.
For ,
Proof.
By the Donsker-Varadhan variational principle,
for any . Choosing , we have which gives the desired inequality. ∎
The following bounds ; note that the proof does not depend on the definition of , only that it is a probability density.
We use the following lemma to bound in Lemma 4.8.
Lemma 4.7 ([LLT22, Lemma C.12]).
Suppose that is a probability density on , where is -smooth. Let and denote the density function of . Then for ,
Lemma 4.8.
Suppose that where is -smooth () and . For ,
Proof.
Now we put everything together. Write for short. Suppose is non-increasing. By Lemma 4.2,
By Lemma 4.8, , so
By Lemma 4.5, , and by Corollary 4.6, , so
Now, if , then
Let . Assume that . Then we obtain
If , then
If , we get
Integration gives
Taking then gives the following Theorem 4.10. We first introduce a technical assumption.
Definition 4.9.
Let . We say that has at most power growth and decay (with some constant ) if .
Theorem 4.10.
Suppose that the following hold.
-
1.
Assumption 5 holds for .
-
2.
.
-
3.
The KL bound holds for any density and , where .
-
4.
have at most polynomial growth and decay (with some constant ).
Then there is some constant (depending on ) such that if the step sizes satisfy
then for ,
Proof.
This follows from the above calculations and the observation that if we replace by , for some satisfying the power growth and decay assumption, then we change the bound by at most a constant factor, because the step size satisfies . ∎
We specialize this theorem in the case of distributions with bounded support. Note that although not every initial distribution may satisfy a KL inequality as required by condition 3 of Theorem 30, Lemma 5.2 will give the existence of a distribution that does, and is close in TV-error. Later in Section 6, we show that this will have a small effect on the score function, and hence allow us to prove our main theorems.
Corollary 4.11.
Proof.
For , note that . From Lemma 4.13, we can choose
From Lemma 4.15, we can choose
The KL inequality (30) gives us
We now check the requirements on . We need
(22) | ||||||
(23) | ||||||
(24) |
For , (24) is implied by
and for ,
Finally, the last requirement is
As long as and , the last equation implies all the others. Plugging this into Theorem 4.10 gives the result. ∎
Above, we use the Hessian bound given in Lemma 4.13. Under the stronger smoothness assumption given by Assumption 3, we can take the step sizes to be larger.
Corollary 4.12.
Proof.
4.1 Auxiliary bounds
In this section we give bounds on the Hessian (, Lemma 4.13), initial divergence (Lemma 4.14), and Orlicz norm (, Lemma 4.15).
Lemma 4.13 (Hessian bound).
Suppose that is a probability measure supported on a bounded set with radius . Then letting denote the density of ,
(25) |
Therefore, for supported on , , we have
(26) |
Proof.
Let denote the density weighted with the gaussian , that is, . We note the following calculations:
(27) | ||||
(28) |
Lemma 4.14 (Bound on initial -divergence).
Suppose that is supported on . Let . Then
and for and , we have .
Proof.
We have for that
Using convexity of -divergence then gives the result. For , we have
Lemma 4.15 (Subgaussian bound).
Proof.
Let s.t. for some independent of . Define , then for ,
where is the commonly used gamma function and is an absolute constant. Therefore
where . Now consider , then for some small enough, by Taylor expansion,
Note that , while Stirling’s approximation yields . Substituting these two bounds, we get
provided that , in which case the geometric series above converges. To bound this quantity further, we can use the numeric inequality which is valid for . It follows that
Now set , then
which implies that
5 Bounding the KL divergence
In this section, we bound the quantity , where is as in (8). While is defined by the DDPM process, in this section we do not assume is the density of the discretized process; rather, it is any density for which and are finite.
Lemma 5.1.
Suppose that is a probability measure on such that
(29) |
where , , and each is a probability measure. For , let and be the densities obtained by running the forward DDPM process (1) for time , and , . Let and suppose all the satisfy a log-Sobolev constant with constant . Then for any , where is as in (8)
While we need to satisfy a log-Sobolev inequality to get a bound of the form ([LLT22, Lemma C.8]), we note that if we allow additive slack, it suffices for to be a mixture of distributions satisfying a log-Sobolev inequality, with the logarithm of the minimum mixture weight bounded below. In Lemma 5.2 we will see that we can almost decompose any distribution of bounded support in this manner, if we move a small amount of the mass.
Proof.
Let be the function
By decomposition of entropy and the fact that each satisfies LSI with constant ,
where the last inequality follows from noting is a probability mass function on , so that and
Lemma 5.2.
Suppose , and that is a probability measure such that . Let denote the covering number of with balls of radius . Given , there exists a distribution such that and considering the DDPM process started with , for all ,
In particular, for in ,
(30) |
Proof.
Partition into disjoint subsets , of diameter at most , and decompose
where is supported on and . We will zero out the coefficients of all small components: let and
and define
Note that . As probability distributions on ,
and hence the same bound holds for . Note each is supported on a set of diameter . By Theorem 1 of [CCN21], noting that
when and , satisfies a log-Sobolev inequality with constant . The result then follows from Lemma 5.1. For , we use the bound [Ver18, Corollary 4.2.13]. ∎
In the next section, we show that we can move a small amount of mass without significantly affecting the score function. This is necessary, as our guarantees on the score estimate are for the original distribution and not the perturbed one in Lemma 5.2.
6 The effect of perturbing the data distribution on the score function
In this section we consider the effect of perturbing the data distribution on the score function. The key observation is that the score function can be interpreted as the solution to an inference problem, that of recovering the original data point from a noisy sample, with data distribution as the given prior distribution. We show through a coupling argument that we can bound the difference between the score functions in terms of the distance between the two data distributions. This will allow us to “massage” the data distribution in order to optimally bound in Section 5.
6.1 Perturbation under error and truncation
We first give a general lemma on denoising error from a mismatched prior.
Lemma 6.1 (Denoising error from mismatched prior).
Let be a probability density on , and be measures on . For , let denote the joint distribution of and where , and let denote the marginal distribution of . Let
Let and . Then
For , the upper bound is .
Note the tricky part of the proof is to deal with , which can be thought of as inferring assuming the incorrect prior , rather than the actual prior .
Proof.
For notational clarity, we will denote draws from the conditional distribution as and , for example . Let . Let be a coupling of such that with probability and with probability . We have
Define a measure (not necessarily a probability measure) on by
Note that
so is absolutely continuous with respect to and , and by assumption on the coupling,
(31) |
Under , when , we can couple and so that with probability . Let denote this coupled distribution. Then as in Lemma 6.5,
We bound this by first bounding
(32) |
which follows from the two inequalities (using (31))
From (32), and the fact that the distribution of is the same as by Nishimori’s identity, we obtain
Now for the second term (II),
The first term satisfies . For the second term, we note that Cauchy-Schwarz gives for any measures and that
to switch from the measure to :
(Note that intentionally, the measure is , though we use for the variable.) Hence,
so
where we used the data processing inequality.
For , we obtain by Lemma 6.6 that the bound is
We use this lemma to obtain a bound on the score error under perturbation of the distribution, by interpreting the score as the solution to a de-noising problem.
Lemma 6.2 ( score error under perturbation).
Proof.
For part 1, note by (27) that
where is the “tilted" probability distribution defined by
By Bayes’s rule, this can be viewed as the conditional probability that given , where and , . Hence this fits in the framework of Lemma 6.1 and
giving the result.
For part 2, note that . Applying part 1 with (which preserves -divergence) and gives the result.
∎
Finally, we argue that a score estimate that is accurate with respect to will still be accurate with respect to , with high probability. When using this lemma, we will substitute in the bound from Lemma 6.2.
Lemma 6.3.
Let and be two probability distributions on with TV distance . Suppose the estimated score function satisfies
for , and is -Lipschitz. Then for and any ,
Proof.
We have
The first term is bounded by . For the second term, by Chebyshev’s Inequality,
For the last term, again by Chebyshev’s Inequality,
We conclude the proof by combining the these three inequalities. ∎
Finally, we will need the following to obtain a TV error bound to in Theorem 2.3.
Lemma 6.4.
Suppose that is a probability density on with bounded first moment , and is -smooth. Then for such that , we have
Here and are defined in (2). In particular, if and .
Proof.
Without loss of generality, we assume that . Note that . Let , which is also a probability density on . Then by the triangle inequality,
For the second term,
where in the second inequality, we use the fact that for all . Thus
Now for the first term,
where is the density of the -dimensional standard Gaussian distribution. Apply Minkowski’s inequality for integrals:
where in the third inequality, we use the elementary inequality , which is valid for all , and in the fifth inequality, we use , which holds for . Hence if , we have
Now we conclude the proof by combining the bounds for and :
where we use the fact that for all . Recall that and when . It suffices for
which is implied by
for appropriate constants, as . ∎
6.2 Perturbation under TV error
Although we will not need it in our proof, we note that we can derive a similar perturbation result under TV error, which might be of independent interest.
Lemma 6.5.
Let be a probability kernel on , let be measures on . Let denote the joint distribution of and , and let denote the marginal distribution of . Suppose there is a coupling of and such that
-
•
with probability ,
-
•
implies , and
-
•
.
Define the tail error by
Let , and suppose that is -Lipschitz. Then
Proof.
For notational clarity, we will denote draws from the conditional distribution as and , for example . We have
For the first term (I), we split it as
Define the measure on by
As in Lemma 6.2, under , when , we can couple and so that with probability . Let denote this coupled distribution. Then
as in Lemma 6.2. Now
Finally, for the second term (II), we use the fact that is Lipschitz and the coupling to conclude
We conclude the proof by combining the inequalities for (i), (ii), and (II).
For the second upper bound, we note that
6.3 Gaussian tail calculation
We use the following Gaussian tail calculation in the proof of Lemma 6.2.
Lemma 6.6.
Let be the standard Gaussian measure on . Then
Proof.
By the tail bound in [LM00], for ,
(33) |
so is stochastically dominated by a random variable with cdf . Then letting be the measure corresponding to ,
and
7 Guarantees under -accurate score estimate
We will state our results under a more general tail bound assumption.
Assumption 6 (Tail bound).
is a function such that .
Our result will require to grow at most as a sufficiently small power of as ; in particular, this holds for subexponential distributions. By taking to be a constant function, this contains the assumption of bounded support (Assumption 2) as a special case.
7.1 TV error guarantees
We follow the framework of [LLT22] to convert guarantees under -accurate score estimate, to guarantees under -accurate score estimate.
Theorem 7.1 ([LLT22, Theorem 4.1]).
Let be a probability space and be a filtration of the sigma field . Suppose , , and are -adapted random processes taking values in , and are sets such that the following hold for every .
-
1.
If for all , then .
-
2.
.
-
3.
.
Then the following hold.
(34) |
Theorem 7.2 (DDPM with -accurate score estimate).
Let . Suppose that Assumption 6 for a sufficiently small value of that is such that , and . Suppose one of the following cases holds.
- 1.
- 2.
Then the resulting distribution is such that is -far in TV distance from a distribution , where satisfies . In particular, taking , we have .
Note that the condition on can be satisfied if (no effort has been made to optimize the exponent).
Proof.
We invoke Lemma 5.2 for a to be chosen, to obtain a distribution on , where . Let and in case 1 and case 2, respectively; our choice of will give the definition of in the theorem statement. In the following, we define with , rather than , as the initial distribution. Note that since (and the same holds for their evolutions under (1)), it suffices to consider convergence to .
We first define the bad sets where the error in the score estimate is large,
(35) |
for some to be chosen.
Given , let where is such that . Given bad sets , define the interpolated process on by
(36) | ||||
In other words, simulate the reverse SDE using the score estimate as long as the point is in the good set at the previous discretization timepoint , and otherwise use the actual gradient . Let denote the distribution of when . Note that this process is defined only for purposes of analysis, as we do not have access to . As before, we let denote the distribution of defined by (12).
We can couple this process with the exponential integrator (5) using so that as long as , the processes agree, thus satisfying condition 1 of Theorem 7.1.
Then by Lemma 6.3,
Then by choice of and either Corollary 4.11 or 4.12, when ,
(37) | ||||
where . For to be bounded by , it suffices for the terms in (37) to be bounded by ; this is implied by
(38) |
By Theorem 7.1,
(39) | ||||
(40) |
For this to be bounded by , it suffices for
(41) | ||||
(42) |
We bound (42) crudely, as the dependence on will be logarithmic. Using , it suffices that
(43) |
We will return to this after deriving a condition on . It remains to bound (38) and (41). We break up the timepoints depending on whether . Let
and , where . Let . Note the “fine” timepoints will be closer together than the “coarse” timepoints. We break up the integral (38) and the sum (41) into the parts involving the coarse and fine timepoints. For (38), it suffices to have
so it suffices to take . Let in case 1 and in case 2. For the fine part, recalling our choice of , it suffices to have (note we can redefine when without any harm)
(44) |
For (41), it suffices to have
(45) |
and
(46) |
Note that in light of the required step sizes, we can take . Considering the equality case of Hölder’s inequality on suggests that we take
(47) | ||||
(48) |
Note that the number of steps needed in the fine part is in the first case and in the second case. We can check that (47) and (48) make (44) and (46) satisfied.
Finally, we calculate the denominator for . In case 1, note that starting from and taking steps of size , it takes steps to reach .
Then we obtain
In case 1, our requirement is
but note that the first bound is more stringent. Now, returning to (43), we see that it suffices to take for any (this will “solve” the appearing in .)
In case 2, we have instead so
Theorem 7.3 (TV error for DDPM with -accurate score estimate and smoothness).
Proof.
7.2 Wasserstein error guarantees
Proof of Theorem 2.1.
Proof of Theorem 2.2.
To obtain purely Wasserstein error guarantees, we include an extra step of replacing any sample falling outside by . Suppose . Let be the resulting distribution. Then
We choose so the first term is . It suffices to bound the second term also by . We bound it in terms of using the fact that is supported on and using a Gaussian tail calculation for . Consider a coupling of and such that with probability . Express where . Now
where the bound on the second term uses Lemma 6.6. Using , we see that it suffices to choose for appropriate choice of constants. By Theorem 7.2, it suffices to take
Simplifying gives .
In case 2, it suffices to take
Simplifying gives . ∎
References
- [And82] Brian DO Anderson “Reverse-time diffusion equation models” In Stochastic Processes and their Applications 12.3 Elsevier, 1982, pp. 313–326
- [ARZ18] Sanjeev Arora, Andrej Risteski and Yi Zhang “Do GANs learn the distribution? some theory and empirics” In International Conference on Learning Representations, 2018
- [BMR20] Adam Block, Youssef Mroueh and Alexander Rakhlin “Generative modeling with denoising auto-encoders and Langevin sampling” In arXiv preprint arXiv:2002.00107, 2020
- [CCN21] Hong-Bin Chen, Sinho Chewi and Jonathan Niles-Weed “Dimension-free log-Sobolev inequalities for mixture distributions” In Journal of Functional Analysis 281.11 Elsevier, 2021, pp. 109236
- [Che+21] Sinho Chewi, Murat A Erdogdu, Mufan Bill Li, Ruoqi Shen and Matthew Zhang “Analysis of Langevin Monte Carlo from Poincaré to Log-Sobolev” In arXiv preprint arXiv:2112.12662, 2021
- [Che+22] Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim and Anru R. Zhang “Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions” arXiv:2209.11215, 2022
- [Dat+19] Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski and Rosanne Liu “Plug and play language models: A simple approach to controlled text generation” In arXiv preprint arXiv:1912.02164, 2019
- [De ̵+21] Valentin De Bortoli, James Thornton, Jeremy Heng and Arnaud Doucet “Diffusion Schrödinger bridge with applications to score-based generative modeling” In Advances in Neural Information Processing Systems 34, 2021
- [De ̵22] Valentin De Bortoli “Convergence of denoising diffusion models under the manifold hypothesis” In arXiv preprint arXiv:2208.05314, 2022
- [EHZ21] Murat A. Erdogdu, Rasa Hosseinzadeh and Matthew S. Zhang “Convergence of Langevin Monte Carlo in Chi-Squared and Renyi Divergence” In arXiv preprint arXiv:2007.11612, 2021
- [Gra+19] Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi and Kevin Swersky “Your classifier is secretly an energy based model and you should treat it like one” In arXiv preprint arXiv:1912.03263, 2019
- [HJA20] Jonathan Ho, Ajay Jain and Pieter Abbeel “Denoising diffusion probabilistic models” In Advances in Neural Information Processing Systems 33, 2020, pp. 6840–6851
- [Jin+22] Bowen Jing, Gabriele Corso, Renato Berlinghieri and Tommi Jaakkola “Subspace Diffusion Generative Models” In arXiv preprint arXiv:2205.01490, 2022
- [LLT22] Holden Lee, Jianfeng Lu and Yixin Tan “Convergence for score-based generative modeling with polynomial complexity” In arXiv preprint arXiv:2206.06227, 2022
- [LM00] Beatrice Laurent and Pascal Massart “Adaptive estimation of a quadratic functional by model selection” In Annals of Statistics JSTOR, 2000, pp. 1302–1338
- [Men+21] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu and Stefano Ermon “SDEdit: Guided image synthesis and editing with stochastic differential equations” In International Conference on Learning Representations, 2021
- [Ram+22] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu and Mark Chen “Hierarchical text-conditional image generation with clip latents” In arXiv preprint arXiv:2204.06125, 2022
- [SE19] Yang Song and Stefano Ermon “Generative Modeling by Estimating Gradients of the Data Distribution” In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems, 2019
- [SE20] Yang Song and Stefano Ermon “Improved techniques for training score-based generative models” In arXiv preprint arXiv:2006.09011, 2020
- [Son+20] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon and Ben Poole “Score-Based Generative Modeling through Stochastic Differential Equations” In International Conference on Learning Representations, 2020
- [Son+21] Yang Song, Conor Durkan, Iain Murray and Stefano Ermon “Maximum likelihood training of score-based diffusion models” In Advances in Neural Information Processing Systems 34, 2021
- [Son+21a] Yang Song, Liyue Shen, Lei Xing and Stefano Ermon “Solving Inverse Problems in Medical Imaging with Score-Based Generative Models” In arXiv preprint arXiv:2111.08005, 2021
- [Ver18] Roman Vershynin “High-dimensional probability: An introduction with applications in data science” Cambridge university press, 2018
- [Vin11] Pascal Vincent “A connection between score matching and denoising autoencoders” In Neural computation 23.7 MIT Press, 2011, pp. 1661–1674
- [ZC22] Qinsheng Zhang and Yongxin Chen “Fast Sampling of Diffusion Models with Exponential Integrator” In arXiv preprint arXiv:2204.13902, 2022
Appendix A High-probability bound on the Hessian
In this section we obtain a high-probability bound on the Hessian of , i.e., the Jacobian of the score function.
To see why we expect Hessian to usually be smaller than the worst-case bound given by Lemma 4.13, note that we can express (27) and (28) as
(49) | ||||
(50) |
where and , . We expect that the random variable is distributed as , which suggests that the covariance (50) may be bounded by rather than with high probability. Indeed, we can easily construct an example where the worst case of Lemma 4.13 is attained—for example, for , at —but this point has exponentially small probability density under .
The following lemma uses a -net argument to bound the operator norm of the variance of a conditional distribution, with high probability.
Lemma A.1.
Suppose is a -valued random variable over the probability space , and is a -subalgebra. If is subgaussian, then
Proof.
By Jensen’s inequality and Markov’s inequality, for any ,
where the last inequality follows from taking . Now take a -net of of size [Ver18, Cor. 4.2.13]. By a union bound,
when we take . By [Ver18, Lemma 4.4.1], the operator norm can be bounded by the norm on an -net,
where the second inequality holds when is symmetric. The result follows from applying this to . ∎
From this we obtain the desired high-probability bound.
Lemma A.2.
There is a universal constant such that the following holds. For any starting distribution , letting be the law of the DDPM process (1) at time , we have
Note that there is no dependence on the radius.