A Good Score Does not Lead to A Good Generative Model
Abstract
Score-based Generative Models (SGMs) is one leading method in generative modeling, renowned for their ability to generate high-quality samples from complex, high-dimensional data distributions. The method enjoys empirical success and is supported by rigorous theoretical convergence properties. In particular, it has been shown that SGMs can generate samples from a distribution that is close to the ground-truth if the underlying score function is learned well, suggesting the success of SGM as a generative model. We provide a counter-example in this paper. Through the sample complexity argument, we provide one specific setting where the score function is learned well. Yet, SGMs in this setting can only output samples that are Gaussian blurring of training data points, mimicking the effects of kernel density estimation. The finding resonates a series of recent finding that reveal that SGMs can demonstrate strong memorization effect and fail to generate.
1 Introduction
Generative modeling aims to understand the dataset structure so to generate similar examples. It has been widely used in image and text generation (Wang et al., 2018; Huang et al., 2018; Rombach et al., 2022; Li et al., 2022; Gong et al., 2022), speech and audio synthesis (Donahue et al., 2018; Kong et al., 2020a, b; Huang et al., 2022), and even the discovery of protein structures (Watson et al., 2023).
Among the various types of generative models, Score-based Generative Models (SGMs) (Song et al., 2020; Ho et al., 2020; Karras et al., 2022) have recently emerged as a forefront method, and achieved state-of-the-art empirical results across diverse domains. It views the data structure of existing examples coded in a probability distribution, that we call the target distribution. Once SGM learns the target distribution from the data, it generates new samples from it.
Despite their empirical successes, a thorough theoretical understanding of why SGMs perform well remains elusive. A more fundamental question is:
What are the criteria to evaluate the performance of a generative model?
Heuristically, two key components of generative models are “imitating” and “generating”. The “imitating” is about learning from the existences, while generating calls for creativity to produce new. A successful generative model should exhibit both imitation ability, so to produce samples that resemble the training data, and at the same time, manifest creativity, and generate samples that are not mere replicas of existing ones.
In the past few years, significance theoretical progresses have been made on assessing the imitation ability of SGMs. In particular, recently made available theory provides a very nice collection of error bounds to evaluate the difference between the learned distribution and the ground-truth distribution. Such discussion has been made available in various statistical distances, including total variation, KL divergence, Wasserstein distance and others. These discoveries suggest that SGMs have strong imitation ability, i.e. can approximate the ground-truth distribution well if the score function (gradient of log-density) of the target distribution along the diffusion process can be effectively learned.
We would like to discuss the other side of the story: Relying solely on these upper error bounds might be misleading in assessing the overall performance of SGMs. In particular, this criterion does not adequately address the issue of memorization – the possibility that the produced samples are simply replicas of the training data. In other words, SGMs with strong imitation ability can be lack of creativity.
1.1 A toy model argument
At the heart of our argument is that a simple Kernel Density Estimation (KDE) of the target ground-truth distribution can be arbitrarily close. Yet, drawing a sample from the ground-truth and drawing one from a KDE presents very different features. The latter fails on the task of “generation.”
To be mathematically more precise, let be the ground-truth distribution, and be a set of i.i.d samples drawn from it. The empirical distribution is . We denote the distribution obtained by smoothing with a Gaussian kernel . Such definition naturally puts as one kind of Kernel Density Estimation (KDE) of with the bandwidth .
It is intuitive that when the sample size is large, and the bandwidth is properly chosen, the KDE approximates the true distribution . In the most extreme case, when the bandwidth , the kernel density estimate degenerates to the empirical distribution . Throughout the paper we view the empirical distribution as a special case of KDE.
Though and are close, generating samples from and from are drastically different stories. Drawing from amounts to generate a completely new sample, independent of the dataset, while generating from essentially means selecting a sample uniformly from the set and then applying a Gaussian blurring. Regardless of how close approximates the ground-truth , sampling from KDE ultimately gives replicas of the existing samples.
Would SGM be different from KDE? SGM is built on a complicated procedure, incorporating forward noise injection, score matching, and backward sampling processes. The machinery is significantly more convoluted than the straightforward KDE approach. Would it be able to generate new samples?
We are to show in this paper that the perfect SGM is actually a KDE itself. The mathematical statement is presented in Theorem 4.3. The “perfect” means the minimizer of the empirical score matching objective is achieved during the score-matching procedure. We term the learned score function the empirical optimal score function. Since SGM equiped with the empirical optimal score function is effectively a KDE, it sees the limitation of KDE and fails to “generate.” This phenomenon is clearly demonstrated in Figure 1 with the test conducted over the CIFAR10 dataset.

It is important to note that this observation does not contradict existing theories that suggest SGMs can approximate the target distribution when the score function is accurately learned. Indeed, in Theorem 3.1 we provide a sample complexity estimate and derive a lower bound of the sample size . When the sample size is sufficiently large, the empirical optimal score function approximates the ground-truth score function. Consequently, according to the existing theories, the output of SGM is a distribution close to the ground-truth target distribution. Yet, two distribution being close is not sufficient for the task of generation.
1.2 Contributions
The primary contribution of this paper is presenting a counter-example of score-based generative models (SGMs) with accurate approximated score function, yet producing unoriginal, replicated samples. Our findings are substantiated through the following two steps:
-
•
We establish in Theorem 3.1 the score-matching error of the empirical optimal score function, and present an explicit non-asymptotic error bound with the sample complexity. This result illustrates that the empirical optimal score function satisfies the standard bound on the score estimation error used in the convergence analysis in the existing literature (Chen et al., 2022, 2023c, 2023d; Benton et al., 2023a), which presumably should lead to the conclusion that SGMs equipped with the empirical optimal score function produces a distribution close to the target distribution.
-
•
We show in Theorem 4.3 that SGMs equipped with empirical optimal score function resembles a Gaussian KDE, and thus presents strong memorization effects and fails to produce novel samples.
These results combined rigorously demonstrates that the SGM with precise empirical score matching is capable to produce a distribution close to the target, but the procedure does not ensure the efficacy of an SGM in its ability to generate innovative and diverse samples. This observation underscores the limitation of current upper bound type guarantees and highlights the need for new theoretical criteria to assess the performance of generative models.
Notations: Let to be the -dimensional Euclidean space and is the time horizon. Denote and to be the spatial variable and time variable respectively. We denote as the target data distribution supported on a subset of , and indicate the empirical distribution by . The Gaussian kernel with bandwidth is denoted by . For the special case , i.e. standard Gaussian, we use notation . We denote the Gaussian KDE with bandwidth as . In general, we use and (or and ) to represent the laws of forward and backward SDEs at time respectively (a thorough summary of PDEs and SDEs’ notations used in this paper is provided in Appendix A). We denote to be the early stopping time for running SDEs.
1.3 Literature review
We are mainly concerned of three distinct lines of research related to SGM performance, as summarized below.
Convergence of SGMs. The first line of research concerns theoretical convergence properties of SGMs. This addresses the most fundamental performance of the algorithm: What elements are needed for SGM to perform well? In this context, a good performance amounts to generating a new sample from the learned distribution that is close to the ground-truth. This line of research has garnered a large amount of interests, drawing its relation to sampling. For most studies, the analysis becomes quantifying the deviation between distributions generated by SGMs and the ground-truth distributions. This includes the earlier studies such as (Lee et al., 2022; Wibisono & Yingxi Yang, 2022; De Bortoli et al., 2021; De Bortoli, 2022; Kwon et al., 2022; Block et al., 2022), and later (Chen et al., 2022, 2023a, 2023b; Benton et al., 2023a; Li et al., 2023) that significantly relaxed the Lipschitz condition of the score function and achieved polynomial convergence rate. In these discoveries, Girsanov’s theorem turns out to be a crucial proof strategy. Parallel to these findings, convergence properties of ODE-based SGMs have also been explored (Chen et al., 2023c, d; Benton et al., 2023b; Albergo et al., 2023; Li et al., 2023), and comparison to SDE-based SGMs have been drawn.
Sample complexity studies of SGMs. Another line of research focuses on sample complexity. How many samples/training data points are needed to learn the score? In line with convergence rate analysis, the sample complexity study has been conducted with the criteria set to be -approximation of the score function (Block et al., 2022; Cui et al., 2023; Chen et al., 2023b; Oko et al., 2023). The involved techniques range from deploying Rademarcher complexity for certain hypothesis classes, to utilizing specific neural network structures. Often in times, there are also assumptions made on the structure of data.
Memorization effect of SGMs. The third line of research on SGM concerns its memorizing effect. This line of research was triggered by some experimental discovery and was confirmed by some high profile lawsuits (New York Times, 2023). Experimentally it was found that SGMs, when trained well, tend to produce replicas of training samples (Somepalli et al., 2022, 2023; Carlini et al., 2023). This phenomenon draws serious privacy concerns, and motivates studies on the fundamental nature of SGMs: Are SGMs memorizers or generalizers? In (Yoon et al., 2023), the authors presented a dichotomy, showing through numerical experiments that SGMs can generate novel samples when they fail to memorize training data. Furthermore, when confined to a basis of harmonic functions adapted to the geometry of image features, (Kadkhodaie et al., 2023) suggest that neural network denoisers in SGMs might have an inductive bias, aiming the generation. In (Gu et al., 2023; Yi et al., 2023), the authors derive the optimal solution to the empirical score-matching problem and show that the SGMs equipped with this score function exhibit a strong memorization effect. This suggests that with limited amount of training data and a large neural network capacity, SGMs tend to memorize rather than generalize.
To summarize: the convergence results of SGMs suggest a well-learned score function can be called to produce a sample drawn from a distribution close to the ground-truth, and the studies on the memorization effect of SGMs suggest the new drawings are simple replicas of the training dataset. It is worth noting that the two sets of results do not contradict. In particular, the convergence results do not rule out the explicit dependence of new generated samples on the training data. The connection between the two aspects of SGM performance is yet to be developed, and this is our main task of the current paper. We show that SGMs, despite having favorable convergence properties, can still resort to memorization, in the form of kernel density estimation. The finding underscores the need for a new theoretical framework to evaluate SGMs’ performance, taking into account both imitation ability and creativity of SGMs.
2 Score-based Generative Models
We provide a brief expository to the Score-based Generative models (SGM) (Song et al., 2020) in this section. Mathematically, SGM is equivalent to denoising diffusion probabilistic modeling (DDPM) (Ho et al., 2020), so we use the two terms interchangeably.
2.1 Mathematical foundation for DDPM
The foundation for SGM stems from two mathematical observations. Firstly, a diffusion type partial differential equation (PDE) drives an arbitrary distribution to a Gaussian distribution, forming a bridge between the complex target distribution to the standard Gaussian, an easy-to-sample distribution. Secondly, such diffusion process can be simulated by its samples, translating the complicated PDE to a set of stochastic differential equations (SDEs) that are computationally easy to manipulate.
More precisely, denote the solution to the PDE:
(1) |
It can be shown that, for arbitrary initial data , when is big enough,
and the convergence is exponentially fast (Bakry et al., 2014). In our context, we set the initial data , the to-be-learned target distribution.
This PDE can be run backward in time. Denote , a quick calculation shows
(2) |
This means with the full knowledge of , the flow field drives the standard Gaussian () back to its original distribution, the target . The term is called the score function.
Simulating these two PDEs (1) and (2) directly is computationally infeasible, especially when dimension , but both equations can be represented by samples whose dynamics satisfy the corresponding SDEs. In particular, letting
(3) |
the standard OU process, and
(4) |
where and are two Brownian motions, then, with proper initial conditions:
This relation translates directly simulating two PDEs (1) and (2) to running its representative samples governed by SDEs (3)-(4), significantly reducing the computational complexity. It is worth noting that if one draws and runs (4), then:
meaning the dynamics of (4) returns a sample from the target distribution , achieving the task of sampling. Here the notation stands for drawing an i.i.d. sample from.
2.2 Score-function, explicit solution and score matching
It is clear the success of SGM, being able to draw a sample from the target distribution , lies in finding a good approximation of the score function . In the idealized setting, this score function can be explicitly expressed. In the practical computation, this function is learned from existing dataset through the score-matching procedure.
To explicitly express the score function amounts to solving (1), or equivalently (3). Taking the SDE perspective, we analyze the OU process in (3) and obtain an explicit solution:
(5) |
where is the initial data and . Equivalently, using the PDE perspective, one sets as the initial condition to run (1) to form a set of Green’s functions:
(6) |
These functions are Gaussian functions of centered at with isotropic variance . This set of functions is also referred to as the transition kernel from time conditioned on to time with .
In the idealized setting with the target distribution fully known, then with , the solution of (1) becomes the superposition of Green’s functions weighted by , namely:
(7) |
thus by definition, the score function is explicit:
(8) | ||||
where we called (7) and used the notation to denote the conditional flow field. This function maps to . Using the explicit formula (6), we have the explicit solution for the conditional flow field:
(9) |
It is a linear function on with Lipschitz constant that blows up at .
Score matching. The practical setting is not idealized: The lack of explicit formulation prevents direct computation of (8). Algorithmically, one needs to learn from existing samples. A neural network (NN) is then deployed.
Intuitively, the NN should provide a function as close as possible to the true score function, meaning it solves:
where , the uniform distribution over the time interval, and . is a hypothesis space, and in this context, the function space representable by a class of neural networks. However, neither nor is known in the formulation, so we turn to an equivalent problem:
where , and . The subindex CSM stands for conditional-score-matching. The two problems can be shown to be mathematically equivalent, see Lemma B.1. Practically, however, this new problem is much more tractable, now with both and explicit, see (6) and (8).
The target distribution is still unknown. At hands, we have many samples drawn from it: . This allows us to reformulate the problem into an empirical risk minimization (ERM) problem:
(10) |
with and .
In the execution of a practical DDPM algorithm, (10) is first run to find an NN serving as a good approximation to the score function, termed , and the user end then deploys this in (4) in place of for generating a new sample from . Sample and run:
(11) |
The law is denoted to be . We note two differences comparing (4) and (11): the initial data is replaced by and the score function is replaced by the empirically learned score function . If both approximations are accurate, we expect for all .
When minimizing the objective (10), noting the singularity at of as in (9), it is a standard practice to conduct “early stopping” (Song et al., 2020). This is to take out a small fraction around the origin of time in the training (10) and learn the score with . Consequently, the sampling is also only ran up to . The algorithm returns samples drawn from . The hope is approximates the target using the following approximation chain:
2.3 Error analysis for DDPM
In the idealized setting, , , , and backward SDE (11) is run perfectly, then the sample initially drawn from Gaussian will represents the target distribution at . Computationally, these assumptions all break: all four factors, finite , nontrivial , imperfect and discretization error of (11) induce error. These errors were beautifully analyzed in (Chen et al., 2022; Benton et al., 2023a). We summarize their results briefly.
All analysis require the target distribution to have bounded second moment.
Assumption 2.1 (bounded second moment).
We assume that .
The learned score function is also assumed to be close to the ground-truth in :
Assumption 2.2 (score estimation error).
The score estimate satisfies
Under these assumptions, it was concluded DDPM samples well:
Theorem 2.3 (Modified version of Theorem 1 in (Benton et al., 2023a)).
The discretization error in the original result is irrelevant to the discussion here and is omitted. This upper error bound consists of two parts. The first term comes from the score approximation error, while the second term comes from the finite truncation, where we forcefully replace by .
The theorem states that, when is large enough and the score function is approximated well in sense, the TV distance between the law of generated samples and is very small, concluding that DDPM is a good sampling strategy.
It is tempting to further this statement and claim that DDPM is also a good generative model. Indeed, on the surface, it is typically claimed that generative models are equivalent to drawing samples from a target distribution . However, we should note a stark difference between sampling and generation: A meaningful generative model should be able to produce samples that are not mere replica of known ones. The error bound in Theorem 2.3 does not exclude this possibility. As will be shown in Section 3, it is possible to design a DDPM whose score function is learned well, so according to Theorem 2.3 produces a distribution close to the target. Yet in Section 4, we demonstrate that this model fails to be produce new samples. These two sections combined suggest DDPM with a well-learned score function does not necessarily produce a meaningful generative model.
3 A good score estimate: sample complexity analysis
Inspired by Theorem 2.3, we are to design a DDPM whose learned score function satisfies Assumption 2.2. Throughout the section, we assume the hypothesis space is large enough (, for example), and the learned score estimate achieves the global minimum of the ERM (10). In practical training, the error heavily depends on the specific NN structure utilized in the optimization. The approximation error of the NN training is beyond the discussion point of the current paper.
Noting the objective is a convex functional of , the optimizer has a closed-form. As derived in Proposition B.2, for , the empirical optimal score function is:
(12) |
where is the conditional flow field, see (9).
Accordingly, the DDPM draws an initial data from and evolves the following SDE:
(13) |
We denote the law of samples . The choice of the font indicates the law is produced by a finite dimensional object .
To understand the empirical optimal score function, we compare (12) with the ground-truth score function (8). It is clear can be interpreted as a Monte-Carlo (MC) sampling of , replacing both integrals in the numerator and the denominator in (8) by empirical means. The law of large number suggests the empirical mean should converge to the true mean when the number of samples is big. Therefore, it is expected approximates well with a very high probability when . We formulate this result in the following theorem.
Theorem 3.1 (Approximation error of empirical optimal score function).
Let to be i.i.d samples drawn from the target data distribution . Denote and the true and empirical optimal score function, respectively, as defined in (8) and (12). Then for any fixed , and , we have
with probability at least provided that the number of training samples , in particular
-
•
Case 1: If is an isotropic Gaussian, i.e. , with second moment , then ;
-
•
Case 2: If is supported on the Euclidean ball of radius such that , then .
The theorem implies that when the sample size is large with , we have high confidence, , to state that the empirical optimal score function , computed using the i.i.d. samples , is within distance from the true score function in .
Remark 3.2.
A few comments are in line:
-
(a)
Second moment and support radius : The second moment and support radius being the same order as is only for notational convenience. In the proof, the assumption can be relaxed. When we do so, the success rate needs to be adjusted accordingly (see the discussions in Appendix C).
-
(b)
Implication on DDPM performance: Combining Theorem 3.1 with Theorem 2.3, it is straightforward to draw a conclusion on the performance of DDPM in terms of sample complexity. Under the same assumptions in Theorem 3.1, for any tolerance error , by choosing , , then it holds that, the DDPM algorithm ran according to (13) with the empirical optimal score function computed from (12) gives:
with probability at least .
-
(c)
Error dependence on parameters: Both the confidence level parameter and the accuracy parameter appears algebraically in . The rate of comes from MC sampling convergence of and is expected to be the optimal one. The rate of reflects the fact that the proof uses the simple Markov inequality.
We leave the main proof to Appendix C and only briefly discuss the proof strategy using Case as an example.
Sketch of proof.
Denote the error term
(14) |
and
defines a function that maps to , and is a random variable itself. According to the Markov’s inequality:
(15) |
To compute the right hand side, we note
(16) |
and for fixed , according to the definition (14), one can show:
(17) |
Taking expectation with respect to in , we have
finishing the proof when combined with (15). ∎
4 A bad SGM: memorization Effects
Results in Theorem 2.3 and Theorem 3.1 combined implies that the DDPM (11) ran with the empirical optimal score function provides a good sampling method with a high probability. It is tempting to further this statement and call it a good generative model. We are to show in this section that this is not the case. In particular, we claim DDPM ran by will lead to a kernel density estimation (KDE).
To be more precise, with i.i.d drawn from the target distribution , DDPM (11) ran with produces a distribution that is a convolution of a Gaussian with , and hence becomes a KDE of . Since the context is clear, throughout the section we drop the lower index in .
The statement above stems from the following two simple observations. Firstly, the solution to the system (1) with initial distribution set to be is a simple Gaussian convolution with ; and secondly, the exact score function for this new system (initialized at ) happens to be the empirical optimal score function (12).
To expand on it, we first set the initial data for (1) as , the empirical distribution. Theory in Section 2.2 still applies. In particular, the solution to (1) , denoted by , and the solution to (2), denoted by , still have explicit forms using the Green’s functions:
(18) | ||||
For small , and , the PDE solution (18) presents a strong similarity to a KDE of with parameter :
where is the convolution operator. The resemblance can be characterized mathematically precisely:
Proposition 4.1.
Suppose the training samples satisfy , for , with , where is defined in (5).
This means the forward and backward procedure described in (1)-(2) approximately provides a simple KDE to the target distribution when initialized with the empirical distribution.
We now further claim this forward and backward procedure is realized by running SGM using the empirical optimal score . To see this, we follow the computation in (8), and call (18) to obtain:
This means the exact score function for the KDE approximation exactly recovers , the empirical optimal score for , and thus SGM with empirical optimal score realizes the KDE approximation, as seen in the following proposition.
Proposition 4.2.
Combine Propositions 4.1 and 4.2 using triangle inequality, we see is essentially a kernel density estimation when approaches . Furthermore, if one pushes , we obtain the finite-support result:
Theorem 4.3 (SGM with empirical optimal score function resembles KDE).
Under the same assumptions as Proposition 4.2, SGM algorithm (13) with the empirical optimal score function returns a simple Gaussian convolution with the empirical distribution in the form of (18), and it presents the following behavior:
-
•
(with early stopping) for any , set and , we have
-
•
(without early stopping) by taking the limit and , we have .
The theorem suggests DDPM with empirical optimal score function is, in the end, simply a KDE of the target . However close KDE is to the target , it is nevertheless only an object with finite amount of information.
Unlike drawing from where one can generate a completely new sample independent of the training samples, drawing from can only provide replicas of (with a slight shift and polluted with Gaussian noise). As a summary, SGM ran by the empirical optimal score function fails the task of generation.
5 Numerical Experiments
This section is dedicated to providing numerical evidence for Theorem 3.1 and Theorem 4.3. Throughout the experiment, we choose the target data distribution to be a -dimensional isotropic Gaussian, denoted by . The implementation details are provided in Appendix E.
We first estimate the score approximation error of the empirical optimal score function, as delineated in (3), for various size of training sample . Figure 2 shows that the error has decreasing rate approximately , confirming the theoretical finding in Theorem 3.1, see also Remark 3.2(c).

Secondly, we showcase Theorem 4.3 and demonstrate that DDPM behaves as a KDE when equipped with empirical optimal score function. As seen in Figure 3, samples produced by DDPM ran with exhibit a high concentration around the training samples. Conversely, while the samples generated by DDPM ran with the true score function appear to be drawn from the same distribution as the training samples, they are not mere duplicates of the existing ones.


6 Discussion and Conclusion
The classical theory measures the success of score-based generative model based on the distance of the learned distribution and the ground-truth distribution. Under this criterion, SGM would be successful if the score function is learned well.
In this paper, we provide a counter-example of SGM that has a good score approximation while produces meaningless samples. On one hand, the application of Theorem 2.3 and Theorem 3.1 combined suggest SGM equipped with empirical optimal score function learns a distribution close to the ground-truth. On the other hand, Theorem 4.3 suggests this scenario resembles the Gaussian kernel density estimation and can only generate existing training samples with Gaussian blurring.
This apparent paradox between sound theoretical convergence and poor empirical new sample generations indicates that current theoretical criteria may not be sufficient to fully evaluate the performance of generative models. It strongly focuses on the “imitation” capability and losses out on quantifying “creativity”. Similar features were presented in other generative models like generative adversarial networks (Vardanyan et al., 2023), and different criteria have been proposed (Vardanyan et al., 2023; Yi et al., 2023), yet a comprehensive end-to-end convergence analysis for these criteria has not been done for SGMs. We leave this exploration to future research.
Broader Impact
Our results, while being theoretical in nature, have potential positive impacts in motivating better frameworks to ensure that the generative model do not create unintended leakage of private information. We believe that there are no clear negative societal consequences of this theoretical work.
Acknowledgements
The three authors are supported in part by NSF-DMS 1750488, and NSF-DMS 2308440. S. Li is further supported by NSF-DMS 2236447.
References
- Albergo et al. (2023) Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797, 2023.
- Bakry et al. (2014) Bakry, D., Gentil, I., Ledoux, M., et al. Analysis and geometry of Markov diffusion operators, volume 103. Springer, 2014.
- Benton et al. (2023a) Benton, J., De Bortoli, V., Doucet, A., and Deligiannidis, G. Linear convergence bounds for diffusion models via stochastic localization. arXiv preprint arXiv:2308.03686, 2023a.
- Benton et al. (2023b) Benton, J., Deligiannidis, G., and Doucet, A. Error bounds for flow matching methods. arXiv preprint arXiv:2305.16860, 2023b.
- Block et al. (2022) Block, A., Mroueh, Y., and Rakhlin, A. Generative modeling with denoising auto-encoders and langevin sampling, 2022.
- Carlini et al. (2023) Carlini, N., Hayes, J., Nasr, M., Jagielski, M., Sehwag, V., Tramèr, F., Balle, B., Ippolito, D., and Wallace, E. Extracting training data from diffusion models, 2023.
- Chen et al. (2023a) Chen, H., Lee, H., and Lu, J. Improved analysis of score-based generative modeling: User-friendly bounds under minimal smoothness assumptions. In International Conference on Machine Learning, pp. 4735–4763. PMLR, 2023a.
- Chen et al. (2023b) Chen, M., Huang, K., Zhao, T., and Wang, M. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. arXiv preprint arXiv:2302.07194, 2023b.
- Chen et al. (2022) Chen, S., Chewi, S., Li, J., Li, Y., Salim, A., and Zhang, A. R. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. arXiv preprint arXiv:2209.11215, 2022.
- Chen et al. (2023c) Chen, S., Chewi, S., Lee, H., Li, Y., Lu, J., and Salim, A. The probability flow ode is provably fast. arXiv preprint arXiv:2305.11798, 2023c.
- Chen et al. (2023d) Chen, S., Daras, G., and Dimakis, A. G. Restoration-degradation beyond linear diffusions: A non-asymptotic analysis for ddim-type samplers. arXiv preprint arXiv:2303.03384, 2023d.
- Cui et al. (2023) Cui, H., Krzakala, F., Vanden-Eijnden, E., and Zdeborová, L. Analysis of learning a flow-based generative model from limited sample complexity, 2023.
- De Bortoli (2022) De Bortoli, V. Convergence of denoising diffusion models under the manifold hypothesis. arXiv preprint arXiv:2208.05314, 2022.
- De Bortoli et al. (2021) De Bortoli, V., Thornton, J., Heng, J., and Doucet, A. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695–17709, 2021.
- Donahue et al. (2018) Donahue, C., McAuley, J., and Puckette, M. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208, 2018.
- Gong et al. (2022) Gong, S., Li, M., Feng, J., Wu, Z., and Kong, L. Diffuseq: Sequence to sequence text generation with diffusion models. arXiv preprint arXiv:2210.08933, 2022.
- Gu et al. (2023) Gu, X., Du, C., Pang, T., Li, C., Lin, M., and Wang, Y. On memorization in diffusion models. arXiv preprint arXiv:2310.02664, 2023.
- Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models, 2020.
- Huang et al. (2018) Huang, H., He, R., Sun, Z., Tan, T., et al. Introvae: Introspective variational autoencoders for photographic image synthesis. Advances in neural information processing systems, 31, 2018.
- Huang et al. (2022) Huang, R., Lam, M. W., Wang, J., Su, D., Yu, D., Ren, Y., and Zhao, Z. Fastdiff: A fast conditional diffusion model for high-quality speech synthesis. arXiv preprint arXiv:2204.09934, 2022.
- Kadkhodaie et al. (2023) Kadkhodaie, Z., Guth, F., Simoncelli, E. P., and Mallat, S. Generalization in diffusion models arises from geometry-adaptive harmonic representation, 2023.
- Karras et al. (2022) Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
- Kong et al. (2020a) Kong, J., Kim, J., and Bae, J. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020a.
- Kong et al. (2020b) Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, B. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020b.
- Krizhevsky et al. (2009) Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images, 2009.
- Kwon et al. (2022) Kwon, D., Fan, Y., and Lee, K. Score-based generative modeling secretly minimizes the wasserstein distance. Advances in Neural Information Processing Systems, 35:20205–20217, 2022.
- Lee et al. (2022) Lee, H., Lu, J., and Tan, Y. Convergence for score-based generative modeling with polynomial complexity. Advances in Neural Information Processing Systems, 35:22870–22882, 2022.
- Li et al. (2023) Li, G., Wei, Y., Chen, Y., and Chi, Y. Towards faster non-asymptotic convergence for diffusion-based generative models. arXiv preprint arXiv:2306.09251, 2023.
- Li et al. (2022) Li, X., Thickstun, J., Gulrajani, I., Liang, P. S., and Hashimoto, T. B. Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022.
- Lipman et al. (2022) Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022.
- New York Times (2023) New York Times. The New York Times sued OpenAI and Microsoft for copyright infringement, 2023. URL https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf.
- Oko et al. (2023) Oko, K., Akiyama, S., and Suzuki, T. Diffusion models are minimax optimal distribution estimators, 2023.
- Rombach et al. (2022) Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
- Somepalli et al. (2022) Somepalli, G., Singla, V., Goldblum, M., Geiping, J., and Goldstein, T. Diffusion art or digital forgery? investigating data replication in diffusion models, 2022.
- Somepalli et al. (2023) Somepalli, G., Singla, V., Goldblum, M., Geiping, J., and Goldstein, T. Understanding and mitigating copying in diffusion models. arXiv preprint arXiv:2305.20086, 2023.
- Song et al. (2020) Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- Terrell & Scott (1992) Terrell, G. R. and Scott, D. W. Variable kernel density estimation. The Annals of Statistics, pp. 1236–1265, 1992.
- Vardanyan et al. (2023) Vardanyan, E., Minasyan, A., Hunanyan, S., Galstyan, T., and Dalalyan, A. Guaranteed optimal generative modeling with maximum deviation from the empirical distribution. arXiv preprint arXiv:2307.16422, 2023.
- Wang et al. (2018) Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., and Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807, 2018.
- Watson et al. (2023) Watson, J. L., Juergens, D., Bennett, N. R., Trippe, B. L., Yim, J., Eisenach, H. E., Ahern, W., Borst, A. J., Ragotte, R. J., Milles, L. F., et al. De novo design of protein structure and function with rfdiffusion. Nature, 620(7976):1089–1100, 2023.
- Wibisono & Yingxi Yang (2022) Wibisono, A. and Yingxi Yang, K. Convergence in kl divergence of the inexact langevin algorithm with application to score-based generative models. arXiv e-prints, pp. arXiv–2211, 2022.
- Yi et al. (2023) Yi, M., Sun, J., and Li, Z. On the generalization of diffusion model, 2023.
- Yoon et al. (2023) Yoon, T., Choi, J. Y., Kwon, S., and Ryu, E. K. Diffusion probabilistic models generalize when they fail to memorize. In ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023.
Appendix A Notations
Partial differential equations (PDEs). Let to be the -dimensional Euclidean space and is the time horizon. Denote and to be the spatial variable and time variable respectively. The gradient of a real-valued function with respect to the spatial variable and the time-derivative of are denoted by and respectively. The Laplacian of is denoted by . Here, indicates the divergence of with respect to the spatial variable .
Stochastic differential equations (SDEs) and their laws.
-
•
The target data distribution is .
-
•
The forward process (3) initialized at the target distribution is denoted , and .
-
•
The backward process (4) is denoted , where .
-
•
The DDPM algorithm (11) with arbitrary learned score function is denoted and . We initialize the process at , the standard Gaussian distribution.
-
•
The DDPM algorithm (13) with the empirical optimal score function is denoted by . We indicate the law at time as and let .
-
•
The law of forward process (3) initialized at the empirical distribution at time is indicated by . The law of corresponding backward process at time is denoted by .
Other notations. We denote as the target data distribution supported on a subset of , and indicate the empirical distribution by . The Gaussian kernel with bandwidth is denoted by . For the special case , i.e. standard Gaussian, we use notation . We denote the Gaussian KDE with bandwidth as . The early stopping time of running SDEs is indicated by . We use to denote .
Appendix B Empirical optimal score function
Lemma B.1.
Assuming that for all and , then up to a constant independent of function , and are equal.
Proof.
We follow the proof of Theorem 2 in (Lipman et al., 2022). We assume that are decreasing to zero at a sufficient speed as , and are bounded in both time and space variables. These assumptions ensure the existence of all integrals and allow the changing of integration order (by Fubini’s theorem).
To prove and are equal up to a constant independent of function , we only need to show that for any fixed ,
where is a constant function that independent of function . We can compute that
where the second equality we use the definition of , and in the third equality we change the order of integration.
Therefor we have
where the last inequality comes from the fact that and are independent of . ∎
Lemma B.2.
The optimizer of the objective function
has the form
Proof.
Since the objective is a convex functional of , by the first-order optimality condition, the optimizer should satisfy
which implies that for ,
∎
Appendix C Approximation error of empirical optimal score function
In this section, we provide the full proof of Theorem 3.1. For the completeness, we state the theorem again in the following:
Theorem C.1 (Approximation error of empirical optimal score function).
Let to be i.i.d samples drawn from the target data distribution . Denote and the true and empirical optimal score function respectively, as defined in (8) and (12). Then for any fixed , and , we have
with probability at least provided that the number of training samples , where is defined based on the nature of :
-
•
Case 1: If is an isotropic Gaussian, i.e. , with second moment , then ;
-
•
Case 2: If is supported on the Euclidean ball of radius such that , then .
Proof of Theorem 3.1.
Denote the error term
(19) |
and
defines a function that maps to , and is a random variable itself. According to the Markov’s inequality:
(20) |
The final conclusions are mainly built on the upper bound of the right hand side in the Markov’s inequality above. We prove the results for Case 1 and Case 2 respectively.
- •
- •
∎
Lemma C.2.
Under the same assumptions as in Theorem C.1, for fixed , we have
-
•
Case 1: If is an isotropic Gaussian with second moment , then
-
•
Case 2: If is supported on the Euclidean ball with radius such that , then
Proof.
-
•
Case 1: By the definitions of and in (8) and , we can rewrite them as
(24) (25) where we denote and . Then we can compute
where the last inequality is the Young’s. Then we have:
-
•
Case 2: We use the same notations as in Case 1 and define
Then we have:
We now bound terms and respectively. For term , we have
where we assume . For term , we can calculate
Combining the upper bounds for terms and , we obtain
∎
Lemma C.3.
.
Proof.
∎
Lemma C.4.
Remark C.5.
Define . Due to the randomness in , is also a random variable. According to the definition (24), is the mean of random variable , and is the ensemble average of realizations of . It is always true that the variance of the ensemble average is of the variance of the original random variable, so naturally:
Proof.
We denote . By the definitions of and , we can compute
With similar computations, one can show
∎
Lemma C.6.
Proof.
We can compute that
If we assume that for , then we have
∎
Lemma C.7.
Proof.
Denote
We now bound terms and respectively. For term , we have
(26) | ||||
For term , by Lemma (C.6) we obtain
By Lemma C.10, we have
By Lemma C.8, we know that
Then we can obtain the upper bound for term , i.e.
(27) | ||||
We finish the proof by combining the upper bounds of terms and derived in (26) and (27). ∎
Lemma C.8.
Under the same assumptions as in Lemma C.7, we have
Proof.
For notation simplicity, we denote and use as a short notation of when the context is clear. Then we have
For every , we can compute
Therefore, we have
Note that
For term , we have
For term , we have
and
Therefore, we know that
Then, we can have the upper bound for , i.e.
Similar to the computations for term , we can compute the upper bound of as the following:
Therefore, for term , we have
∎
Lemma C.9 (Convolution of two Gaussian distributions).
Let and , then
Proof.
One can compute that
Define , and completing the square:
(28) | ||||
∎
Lemma C.10.
Suppose , then one can compute the following quantities:
-
1.
-
2.
-
3.
where . Both and indicate ignoring the constants.
Lemma C.11.
Given a collection of d-dimensional vectors , the following inequality holds
Proof.
Denote for all , then we can compute
∎
Lemma C.12.
For any , with the Green’s function defined in (6), is lower bounded by
Proof.
It comes from the direct computation:
where the inequality comes from Young’s inequality, i.e. for any . ∎
Appendix D Memorization effects
In this section, we provide the proof of Propositions 4.1 and 4.2, and Theorem 4.3. For the completeness, we state all the propositions and theorem again before the proof.
Proposition D.1.
Suppose the training samples satisfy , for , with , where is defined in (5).
Proof of Proposition 4.1.
By the definition of total variation, we have
∎
Proposition D.2.
Theorem D.3 (SGM with empirical optimal score function resembles KDE).
Under the same assumptions as Proposition 4.2, SGM algorithm (13) with the empirical optimal score function returns a simple Gaussian convolution with the empirical distribution in the form of (18), and it presents the following behavior:
-
•
(with early stopping) for any , set and , we have
-
•
(without early stopping) by taking the limit and , we have .
Proof of Theorem 4.3.
(with early stopping) For , combining Proposition 4.1 and Proposition 4.2 using triangle inequality, we have
(30) |
For any , by choosing and , we obtain .
(without early stopping) By taking the limit and in inequality (30), we have . This implies that equals to the empirical distribution .
∎
Lemma D.4 (KL divergence between two Gaussian distributions).
Let and be two Gaussian distributions on . Then the KL divergence between and is
Lemma D.5 (Convergence of forward OU process).
Denote to be the distribution of forward OU process at time initializing with the empirical distribution , where are i.i.d samples such that . Then for ,
Proof.
Since , by Lemma D.4 we have
By the convexity of the KL divergence,
By the Pinsker’s inequality, we have
∎
Appendix E Numerical experiments
In this section, we provide the details of the numerical experiments. The code is available in https://github.com/SixuLi/DDPM_and_KDE.111The implementation of KDE generation is built based on code https://github.com/patrickphat/Generate-Handwritten-Digits-Kernel-Density-Estimation; The implementation of DDPM on CIFAR10 dataset follows the code https://github.com/sail-sg/DiffMemorize/tree/main.
E.1 Synthetic data distribution
We consider the target data distribution is a -dimensional isotropic Gaussian, i.e. . In this case, the law of forward OU process (3) and the exact score function defined in (8) have explicit formulations. Specifically, by Lemma C.10, we obtain
(31) |
(32) |
where and as defined in (5). We set choose and in our experiments.
We first estimate the score approximation error of the empirical optimal score function across various training sample sizes . Setting early stopping time , time interval length , and sample size ranging from to , we numerically estimate
(33) |
using the empirical average
(34) |
This is achieved through repeating the following steps times and computing the average output:
-
1.
Randomly sample training data from the target distribution ;
-
2.
Uniformly sample from time interval with step size and total number of steps ;
-
3.
For each , sample (where ) from the distribution as derived in (31);
-
4.
Compute the empirical average (34) using , and .
The results (shown in Figure 2) align with the convergence rate as provided in Theorem 3.1.
In the second part of our experiments, we generate samples from DDPM using either the exact score function or the empirical optimal score function . We discretize and simulate the SDEs (4) and (13) using Euler-maruyama method. The experiment parameters are: time interval length , discretization step , early stopping times , and number of training data . We generate new samples from DDPM with and respectively. Visualization results for and are shown in Figure 4 and Figure 5. The samples generated by DDPM with the empirical optimal score function exhibit strong memorization effects, while those from DDPM with the exact score function appear independent of the training data, yet maintain the same distribution. This numerical observation corroborates our theoretical findings in Theorem 4.3.




E.2 Real-world data distribution
We consider as the underlying distribution generating the CIFAR10 dataset images (Krizhevsky et al., 2009), conprising training samples of dimension . We denote as the images in the CIFAR10 dataset, and we use them to construct the following two generative models.
-
•
The first one is simple Gaussian Kernel Density Estimation (KDE), i.e. , where is the Gaussian kernel’s bandwidth. To generate a sample, we first uniformly sample a data from , then apply Gaussian blurring with bandwidth to . The bandwidth is set to times the optimal bandwidth , as per Scott’s rule (Terrell & Scott, 1992), where is the training data’s standard deviation. The sampling results are shown in the second row of Figure 1. Comparing with the training data (the first row in Figure 1), we can clearly see that the generated samples have strong dependence on existing ones.
-
•
The second one is DDPM equipped with the empirical optimal score function as defined in (13). We follow the implementations in (Gu et al., 2023). To illustrate the details, we follow the notations used in (Gu et al., 2023). Recall the backward SDE (13)
For sample generation, we discretize the time steps with being the time interval length and being the total number of steps, and apply the Euler-maruyama solver. The update rule is as the following:
(35) where . We terminate this update rule (35) at , where is the early stopping index222Here we abuse the notation and refer it as the early stopping index. It is different from the we used in the main paper.. We set , , and vary . Figure 1’s third row shows the generated samples with . We can observe that the samples generated by DDPM equipped with the empirical optimal score function behave very similar to the samples generated by the Gaussian KDE (the second row in Figure 1). This aligns with our theoretical findings provided in Theorem 4.3. Additionally, Figure 6 displays samples and , highlighting the strong memorization effect in DDPM with the empirical optimal score function, irrespctive of the early stopping time.
Figure 6: Images generated by DDPM equipped with the empirical optimal score function based on CIFAR10 dataset. The first row is the original images from the CIFAR10 dataset. The second and third rows corresponding to the results of setting the early stopping index and respectively.