Unraveling the Smoothness Properties of Diffusion Models: A Gaussian Mixture Perspective
Diffusion models have made rapid progress in generating high-quality samples across various domains. However, a theoretical understanding of the Lipschitz continuity and second momentum properties of the diffusion process is still lacking. In this paper, we bridge this gap by providing a detailed examination of these smoothness properties for the case where the target data distribution is a mixture of Gaussians, which serves as a universal approximator for smooth densities such as image data. We prove that if the target distribution is a -mixture of Gaussians, the density of the entire diffusion process will also be a -mixture of Gaussians. We then derive tight upper bounds on the Lipschitz constant and second momentum that are independent of the number of mixture components . Finally, we apply our analysis to various diffusion solvers, both SDE and ODE based, to establish concrete error guarantees in terms of the total variation distance and KL divergence between the target and learned distributions. Our results provide deeper theoretical insights into the dynamics of the diffusion process under common data distributions.
1 Introduction
Diffusion models, a prominent generative modeling framework, have rapid progress and garnered significant attention in recent years, due to their potential powerful ability to generate high-quality samples across various domains and diverse applications. Score-based generative diffusion models [HJA20, SSDK+20] can generate high-quality image samples comparable to GANs, which require adversarial optimization. Based on the U-Net [RFB15], stable diffusion [RBL+22], a conditional multi-modality generation model, can successfully generate business-used images. Based on the Transformer (DiT) [PX23], OpenAI released a video diffusion model, SORA [Ope24], with a surprising performance.
However, despite these technological strides, a critical gap persists in the theoretical understanding of diffusion processes, especially concerning the Lipschitz continuity and second momentum properties of these models. Many existing studies ([DB22, CCL+22, LLT22, CLL23, CCL+23] and many more) make simplifying assumptions about these key smoothness properties but lack rigorous formulation or comprehensive analysis. This paper aims to bridge this theoretical gap by providing a detailed examination of the Lipschitz and second momentum characteristics. To make the data distribution analyzable, we consider the mixture of Gaussian data distribution, which is a universal approximation (Fact 1.1) for any smooth density, such as complex image and video data distributions.




Fact 1.1.
A Gaussian mixture model is a universal approximator of densities, in the sense that any smooth density can be approximated with any specific nonzero amount of error by a Gaussian mixture model with enough components [Sco15].
Furthermore, we prove that if the target data distribution is a -mixture of Gaussian, the density of the whole diffusion process will be a -mixture of Gaussian (Lemma 3.2). Thus, our focus is studying a mixture of Gaussian target data distribution in the diffusion process, where we can gain a concrete Lipschitz constant and second momentum (Lemma 3.3 and Lemma 3.5). Moreover, we explore the implications of these properties through the lens of various solvers, both Stochastic Differential Equation (SDE) and Ordinary Differential Equation (ODE) based, providing a deeper insight into the dynamic behavior of diffusion processes and concrete guarantees in Table 1.
Our contribution:
- •
- •
-
•
After applying our analysis to , which is the SDE version of reverse process, we prove the dynamic of the diffusion process satisfies some concrete total variation bound (Theorem 5.6) and concrete KL divergence bound (Theorem 5.7 and Theorem 5.8) under choosing some total discretization steps. See a summary in Table 1.
- •
Other studies of the mixture of Gaussian under diffusion.
Recently, there has been a rich line of work that studies a mixture of Gaussian data distribution under the diffusion process, which shares a similar popular setting. However, none focus on the Lipschitz smoothness and second momentum property as ours. We provide a short summary here for convenience. [WCL+24] analyses the effect of diffusion guidance and provides theoretical insights on the instillation of task-specific information into the score function and conditional generation, e.g., text-to-image, by studying the mixture of Gaussian data distribution as a case study. [GLB+24] propose a new SDE-solver for diffusion models under a mixture of Gaussian data. [SCK23, GKL24] learn mixtures of Gaussian data using the objective and gives bound for sample complexity by assuming all covariance matrices are identity but does not use Lipschitz, while our work does not need identity assumptions. [CKS24] mainly focuses on solving -mixture of Gaussian in diffusion model by leveraging the property of the mixture of Gaussian and polynomial approximation methods and gives bound for sample complexity by assuming the covariance matrices have bounded condition number and that the mean vectors and covariance matrices lie in a bounded ball.
Type | Error guarantee | Steps for error | Reference |
---|---|---|---|
[CCL+22] | Theorem 5.6 | ||
[CLL23] | Theorem 5.7 | ||
[CLL23] | Theorem 5.8 | ||
[CCL+23] | Theorem 5.9 | ||
[CCL+23] | Theorem 5.10 |
2 Related Work
Diffusion models and score-based generative models. [HJA20] introduced the concept of denoising diffusion probabilistic models (), which learn to reverse a gradual noising process to generate samples and have been applied to many area [WSD+23, WCZ+23, WXZ+24]. Then, [SSDK+20] used Stochastic Differential Equations (SDE) to build the diffusion model and explored the equivalence between score-based generative models (SGMs) and diffusion models, generalizing the diffusion model to continuous time. There is a line of works [SDWMG15] studying the connection between diffusion models and non-equilibrium thermodynamics, providing a theoretical foundation for these models. Another line of work has focused on improving the efficiency and quality of diffusion models [NRW21]. Particularly, [SE19] leverages score matching to train diffusion models more efficiently; [ND21] improved architectures and training techniques for diffusion models. Diffusion models have been applied to various domains beyond image generation, achieving impressive results, e.g., text-to-image synthesis [CZZ+20], audio synthesis [KPH+20], image super-resolution [SHC+22], and so on.
Mixture of Gaussian. Mixtures of Gaussian are among the most fundamental and widely used statistical models [Das99, FOS08, MV10, BS10, CDSS13, SOAJ14, DK14, DKK+19, ADLS17, LS17, ABDH+18, DK20, BK20, DHKK20, SCL+22, BDJ+22, XSW+23, BS23, SMF+24] and studied in neural networks learning [RGKZ21, SWL21, LSSS24, SWL24]. Recently, they have also been widely studied in diffusion models [SCK23, WCL+24, GLB+24, GKL24, CKS24] (see detailed discussion in Section 1).
Lipschitz and second momentum in score estimation. There is a line of work using bounded Lipschitz and second momentum as their key smoothness assumptions for the whole diffusion process without giving any concrete formulation [CCL+22, KFL22, LLT22, LKB+23, CLL23, CHZW23, CCL+23, CDD23, BDD23, ZHF+24, KLL+24], while our work gives a “close-form” formulation of Lipschitz and second momentum. There is another rich line of work studying how to train the diffusion models to have a better theoretical guarantee [SE19, SE20, SK21, SGSE20, SDME21, SDCS23, LKB+23, CDD23, RHX+23, CHZW23, SCK23, YFZ+23, LLSS24, GKL24, CCL+23, GLB+24, CKS24, HRX24, HWSL24] and many more.
Roadmap. In Section 3, we provide the notation we use, several definitions, and lemmas related to mixtures of Gaussian. In Section 4, we provide preliminary knowledge on score-based generative models (SGMs) and diffusion models. Section 6 provides the tools that we use from previous papers. In Section 5, we present our main results. In Section 7, we conclude the paper.
3 Lipschitz and Second Momentum of Mixture of Gaussian
In this section, we discuss the notations used, several key definitions for mixtures of Gaussian distributions, and lemmas concerning the Lipschitz continuity and second momentum of these mixtures. We begin by presenting the notations that are used throughout the paper. Notations. For two vectors and , we use to denote the inner product between , i.e., . We use to denote the probability, we use to denote the expectation. We use to denote a vector where only -th coordinate is , and other entries are . For each , we use to denote the vector where -th entry is for all , and this is the Hardamard product. We use to denote a length- vector where all the entries are ones. We use to denote the -th coordinate of . We use to denote the norm of a vector , i.e. , , and . For , for any matrix , we use to denote the spectral norm of , i.e. . We use to denote the minimum/maximum singular value of matrix . For a square matrix , we use to denote the trace of , i.e., . We use to denote the determinant of matrix . We use to denote the convolution of 2 functions . In addition to notation, for two functions , we use the shorthand (resp. ) to indicate that (resp. ) for an absolute constant .
mixtures of Gaussian. Now, we are ready to introduce mixtures of Gaussian. Formally, we can have the following definition of the for mixtures of Gaussian.
Definition 3.1 ( mixtures of Gaussian ).
Let , , . For a fixed timestamp , (1) let be the weight for the -th Gaussian component at time and ; (2) let and be the mean vector and covariance matrix for the -th Gaussian component at time . Then, we define
Then, the following lemma shows that the linear combination between mixtures of Gaussian and a single standard Gaussian is still a mixtures of Gaussian. The proof is in Appendix G.
Lemma 3.2 (Informal version of Lemma G.1).
Let . Let be a -mixture of Gaussian distribution, and let be its defined in Definition 3.1, i.e.,
Let sample from . Let and , which is independent from . Then, we have a new random variable which is also a -mixture of Gaussian distribution , whose is
where .
Note that the Gaussian mixture model is a universal approximator of densities (Fact 1.1). Then, it is natural to assume that the target/image () data distribution as the mixtures of Gaussian. Lemma 3.2 tell us that if the target data distribution is mixtures of Gaussian, then by Eq. (2), the of the whole diffusion process is mixtures of Gaussian, i.e., the for any . See details in Section 4.
Main properties. Thus, we only need to analyze the property, i.e., Lipschitz constant and second momentum, of mixtures of Gaussian, to have a clear guarantee for the whole diffusion process. In the following two lemmas, we will give concrete bounds of the Lipschitz constant and second momentum of mixtures of Gaussian. Both proofs can be found in Appendix G.
Lemma 3.3 (Lipschitz, informal version of Lemma G.3).
If the following conditions hold: Let , where and for each . Let be defined as Eq. (10) and , where . Let , . Let .
The Lipschitz constant for the score function is given by:
In Lemma 3.3, we can clearly see that roughly , which means the Lipschitz is only conditioned on the smallest singular value of all Gaussian component but independent with .
Remark 3.4.
We believe our results in Lemma 3.3 is non-trivial. First, the Lipschitz bound depends inversely exponential on the dimension . If grows lager, the Lipschitz constant becomes much smaller. Second, in practice, the will be super large for complicated data distribution, e.g., millions of Gaussian components for an image distribution. Many studies need to overcome the large hardness in learning Gaussian mixture models [BDJ+22, BS23, RGKZ21, SWL24], while our Lipschitz upper-bound is independent with .
Lemma 3.5 (Second momentum, informal version of Lemma G.2).
Let , where is defined by Eq. (9). Then, we have
4 Score Based Model and Diffusion Model
In Section 4.1, we first briefly introduce [HJA20], a stochastic differential equations (SDE) version of the reverse diffusion process, and score-based generative models (SGMs) [SSDK+20], which is a generalization of . Then, in Section 4.2, we introduce multiple solvers for the reverse process.
4.1 Background on score based model and diffusion model
First, we denote the input data as and the target original data distribution as . Assuming that the noisy latent state of score-based generative models (SGMs) at time is a stochastic process for , we have the forward SDE is defined by:
(1) |
where is the standard Brownian motion, which is called drift coefficient, and which is called diffusion coefficient. We use to denote the marginal probability density function of . The at last time step is the prior distribution which often defined as standard Gaussian .
The continuous forward SDE Eq. (1) also has the discrete Markov chain form as
(2) |
where and , and are functions of time. Additionally, we assume as , and . Thus, . Also, clearly when , and for this is the boundary condition. More specifically, the Eq. (2) can be viewed as the iterative equation in [HJA20].
From [And82], we know also satisfy the reverse SDE:
(3) |
where is backward Brownian motion [And82], and is the score function. For convenience, we rewrite the reverse SDE Eq. (3) in a forward version by switching the time direction, replacing with . Let . The law of is identical to the law of . We use to denote the density of :
(4) |
The process converts noise into samples from , thereby achieving the objective of generative modeling.
Since we can not obtain the score function directly, we can use a neural network to approximate it, and we denote the estimated score function by . By replacing the score function with our approximated score function , we can rewrite Eq. (4) as:
(5) |
where is the process we approximate by our SGM .
For clarity, we mainly focus on the Ornstein-Uhlenbeck (OU) process, which is a diffusion process with coefficient:
Definition 4.1.
The forward SDE of OU process (, and ) is: The corresponding discrete Markov chain form of OU process is given by: where we can see that , in Eq. (2).
From Eq. (5) under the OU process, we can have:
(6) |
4.2 Definitions of different solvers
In practical applications, it’s necessary to adopt a discrete-time approximation for sampling dynamics. We have the following definition.
Definition 4.2 (Time discretization).
Define the discretization points as , where is the early stopping parameter, with presenting the normal setting. For each discretization step , where , the step size is denoted by .
Let be the discrete approximation of in Eq. (6) starting from . We use to denote the density of . Let be defined in Definition 4.2 and . To solve Eq. (6), we have the following two numerical solvers:
Definition 4.3.
We define as the numerical solver satisfying
where is the output, is the total time, and is the score estimates. The Euler-Maruyama scheme is, (Equation 5 in [CLL23]), for
(7) |
Definition 4.4.
We define as the numerical solver satisfying
where is the output, is the total time, and is the score estimates. The Exponential Integrator scheme is, (Equation 6 in [CLL23]), for
(8) |
The two methods above are used for solving SDE. The difference is that in the first term of RHS of Euler-Maruyama uses , while Exponential Integrator uses . The Exponential Integrator scheme has a closed-form solution (see detail in Section 1.1 of [CLL23]).
Now, we introduce two ordinary differential equations (ODE) solvers and , which ignore the Brownian motion term in Eq. (6). We will provide their concrete steps bound in Section 5.
Definition 4.5.
We define (Diffusion Predictor with Overdamped Modeling) as Algorithm 1 in [CCL+23] satisfying where is the early stopping parameter, is the output, is the total steps, is the predictor step size, is the corrector step size, (see detailed definition in Algorithm 1 of [CCL+23]), and is the score estimates.
Definition 4.6.
We define (Diffusion Predictor with Underdamped Modeling) as Algorithm 2 in [CCL+23], satisfying where is the early stopping parameter, is the output, is the total steps, is the predictor step size, is the corrector step size, (see detailed definition in Algorithm 2 of [CCL+23]), and is the score estimates.
5 Main Result for Application
In this section, we will provide the main results of applications. In Section 5.1, we provide our key definitions and assumptions used. In Section 5.2, we provide our results for the total variation bound. In Section 5.3, we provide our results for the KL divergence bound. In Section 5.4, we provide our results for the probability flow ODE method, including DPOM and DPUM.
5.1 Key definitions and assumptions
We first assume that the loss of the learned score estimator is upper bounded by in Assumption 5.1. Then, we can show that we can recover the target/image data distribution under a small total variation or KL divergence gap later.
Assumption 5.1 (Score estimation error, Assumption 1 in [CLL23], page 6, and Assumption 3 in [CCL+22], page 6).
The learned score function satisfies for any ,
where is the step size defined in Definition 4.2 for step , is the total steps, and .
Definition 5.2.
We define as total variation distance between and .
Let the data distribution be a -mixture of Gaussians:
(9) |
Using the Lemma 3.2, we know the of at the any time is given by:
(10) |
Then, we define the following conditions of the mixtures of Gaussian.
Condition 5.3.
All conditions here are related to the beginning time density in Eq. (9).
-
•
Let and .
-
•
Let and .
In Condition 5.3, we denote as the minimum/maximum singular value between the covariance matrices of -components at the original data distribution . is the maximum -norm between mean vectors of the -components at . is the minimum determinant between the covariance matrices of -components at .
Condition 5.4.
In Condition 5.4, we denote as the minimum/maximum singular value between the covariance matrices of -components at . be the lower/upper bound of the distance between the input data and all scaled mean vectors at timestep . Let be the lower bound of , the at time and be the minimum determinant at .
Clearly, we have , so Condition 5.4 is a stronger condition.
Fact 5.5 (Pinsker’s inequality [CK11]).
Let are two probability distributions, then we have
From Fact 5.5, we can see the total variation is a weaker bound than KL divergence.
5.2 Total variation
Now, we are ready to present our result for total variation bound assuming data distribution is -mixture of Gaussian ( in Eq. (9)). Recall is defined in Assumption 5.1, is denied in Definition 5.2 and is defined in Definition 4.2. See the proof in Appendix H.
Theorem 5.6 (, total variation, informal version of Theorem H.1).
In Theorem 5.6, suppose that and choose , , then we have In particular, in order to have , it suffices to have score error , where hides which is a term. Thus, Theorem 5.6 provides a guarantee for total variation bound between target data distribution and learned output distribution with concrete and .
5.3 KL divergence
Similarly, we can present our result for the KL divergence bound in the following two theorems, assuming data distribution is -mixture of Gaussian. Recall is defined in Assumption 5.1, is denied in Definition 5.2 and is defined in Definition 4.2. See the proof in Appendix H.
Theorem 5.7 (, KL divergence, informal version of Theorem H.2).
Assume Condition 5.4. We use uniform discretization points.
(1) Let denote the density of the output of the (Definition 4.4), we have
In particular, choosing and , then .
Theorem 5.8 (, KL divergence for smooth data distribution, informal version of Theorem H.3).
Assume Condition 5.3 and 5.4. We use the exponentially decreasing (then constant) step size . Let denote the density of the output of the defined by Definition 4.4. Then, we have
where and . In particular, if we set and , then . In addition, for Euler-Maruyama scheme defined in Definition 4.3, the same bounds hold with an additional term.
Theorem 5.8 and Theorem 5.7 provide a guarantee for KL divergence bound between target data distribution and learned output distribution with concrete and . Theorem 5.8 has instead of in the bound for total number of discretization points , but includes an additional compared to Theorem 5.7. On the other hand, Theorem 5.8 requires the data distribution is Lipschitz and second-order differentiable [CLL23], allowing a better bound in terms of , while all other theorems [CCL+22, CCL+23] require conditions about Lipschitz on for any . However, our assumption is -mixture of Gaussians satisfies all conditions they use.
From Fact 5.5, we can compare KL divergence results with total variation results. Notice that when comparing theorems for total variation and KL divergence. According to Fact 5.5, the square of the TV distance is comparable to the KL divergence, which explains the squared relationship of the second momentum term.
5.4 Probability flow ODE
Notice that in the previous results we are considering SDE based models. However from [SSDK+20], we know that we can also use ODE to run the reverse process, which has the same marginal distribution with SDE reverse process but is thereby deterministic. In this section, we provide results of and (Algorithm 1 and 2 in [CCL+23]) algorithms which are based on ODE reverse process.
Theorem 5.9 (, informal version of Theorem H.4).
Theorem 5.10 (, informal version of Theorem H.5).
Theorem 5.9 and Theorem 5.10 provide a guarantee for total variation bound between target data distribution and learned output distribution with concrete and for ODE reverse process. The difference between the (Theorem 5.9) and (Theorem 5.10) is the complexity of and the final iteration complexity term. Using algorithm, we can reduce the total iteration complexity from to .
6 Tools From Previous Work
In this section we present several tools we use from previous work.
Assumption 6.1 (Lipschitz score, Assumption 1 in [CCL+22], page 6).
For all , the score is -Lipschitz.
Assumption 6.2 (Second momentum bound, Assumption 2 in [CCL+22], page 6 and Assumption 2 in [CLL23], page 6).
We assume that .
Assumption 6.3 (Smooth data distributions, Assumption 4 in [CLL23], page 10).
The data distribution admits a density and is -Lipschitz, where means second-order differentiable.
Remark 6.4.
We state a tool from previous work [CCL+22],
Theorem 6.5 (DDPM, Theorem 2 in [CCL+22], page 7).
We state a tool from previous work [CLL23].
Theorem 6.6 (Theorem 1 in [CLL23]).
Theorem 6.7 (Theorem 5 in [CLL23], page 10).
There is a universal constant such that the following holds. Under Assumptions 6.2, 5.1, and 6.3 hold, by using the exponentially decreasing (then constant) step size , the sampling dynamic (8) results in a distribution such that
Choosing and makes this . In addition, for Euler-Maruyama scheme (7), the same bounds hold with an additional term.
We state a tool from previous work [CCL+23],
Theorem 6.8 (DPOM, Theorem 2 in [CCL+23], page 6).
Theorem 6.9 (DPUM, Theorem 3 in [CCL+23], page 7).
7 Conclusion
We have presented a theoretical analysis of the Lipschitz continuity and second momentum properties of diffusion models, focusing on the case where the target data distribution is a mixture of Gaussian. Our results provide concrete bounds on key properties of the diffusion process and establish error guarantees for various diffusion solvers. These findings contribute to a deeper understanding of diffusion models and pave the way for further theoretical and practical advancements in this field.
Acknowledgement
Research is partially supported by the National Science Foundation (NSF) Grants 2023239-DMS, CCF-2046710, and Air Force Grant FA9550-18-1-0166.
References
- [ABDH+18] Hassan Ashtiani, Shai Ben-David, Nicholas Harvey, Christopher Liaw, Abbas Mehrabian, and Yaniv Plan. Nearly tight sample complexity bounds for learning mixtures of gaussians via sample compression schemes. Advances in Neural Information Processing Systems, 31, 2018.
- [ADLS17] Jayadev Acharya, Ilias Diakonikolas, Jerry Li, and Ludwig Schmidt. Sample-optimal density estimation in nearly-linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1278–1289. SIAM, 2017.
- [And82] Brian DO Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.
- [BDD23] Joe Benton, George Deligiannidis, and Arnaud Doucet. Error bounds for flow matching methods. arXiv preprint arXiv:2305.16860, 2023.
- [BDJ+22] Ainesh Bakshi, Ilias Diakonikolas, He Jia, Daniel M Kane, Pravesh K Kothari, and Santosh S Vempala. Robustly learning mixtures of k arbitrary gaussians. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 1234–1247, 2022.
- [BK20] Ainesh Bakshi and Pravesh Kothari. Outlier-robust clustering of non-spherical mixtures. arXiv preprint arXiv:2005.02970, 2020.
- [BS10] Mikhail Belkin and Kaushik Sinha. Polynomial learning of distribution families. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 103–112. IEEE, 2010.
- [BS23] Rares-Darius Buhai and David Steurer. Beyond parallel pancakes: Quasi-polynomial time guarantees for non-spherical gaussian mixtures. In The Thirty Sixth Annual Conference on Learning Theory, pages 548–611. PMLR, 2023.
- [CCL+22] Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. arXiv preprint arXiv:2209.11215, 2022.
- [CCL+23] Sitan Chen, Sinho Chewi, Holden Lee, Yuanzhi Li, Jianfeng Lu, and Adil Salim. The probability flow ode is provably fast. Advances in Neural Information Processing Systems, 36, 2023.
- [CDD23] Sitan Chen, Giannis Daras, and Alex Dimakis. Restoration-degradation beyond linear diffusions: A non-asymptotic analysis for ddim-type samplers. In International Conference on Machine Learning, pages 4462–4484. PMLR, 2023.
- [CDSS13] Siu-On Chan, Ilias Diakonikolas, Xiaorui Sun, and Rocco A Servedio. Learning mixtures of structured distributions over discrete domains. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1380–1394. SIAM, 2013.
- [CHZW23] Minshuo Chen, Kaixuan Huang, Tuo Zhao, and Mengdi Wang. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. In International Conference on Machine Learning, pages 4672–4712. PMLR, 2023.
- [CK11] Imre Csiszár and János Körner. Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011.
- [CKS24] Sitan Chen, Vasilis Kontonis, and Kulin Shah. Learning general gaussian mixtures with efficient score matching. arXiv preprint arXiv:2404.18893, 2024.
- [CLL23] Hongrui Chen, Holden Lee, and Jianfeng Lu. Improved analysis of score-based generative modeling: User-friendly bounds under minimal smoothness assumptions. In International Conference on Machine Learning, pages 4735–4763. PMLR, 2023.
- [CZZ+20] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713, 2020.
- [Das99] Sanjoy Dasgupta. Learning mixtures of gaussians. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pages 634–644. IEEE, 1999.
- [DB22] Valentin De Bortoli. Convergence of denoising diffusion models under the manifold hypothesis. arXiv preprint arXiv:2208.05314, 2022.
- [DHKK20] Ilias Diakonikolas, Samuel B Hopkins, Daniel Kane, and Sushrut Karmalkar. Robustly learning any clusterable mixture of gaussians. arXiv preprint arXiv:2005.06417, 2020.
- [DK14] Constantinos Daskalakis and Gautam Kamath. Faster and sample near-optimal algorithms for proper learning mixtures of gaussians. In Conference on Learning Theory, pages 1183–1213. PMLR, 2014.
- [DK20] Ilias Diakonikolas and Daniel M Kane. Small covers for near-zero sets of polynomials and learning latent variable models. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 184–195. IEEE, 2020.
- [DKK+19] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48(2):742–864, 2019.
- [FOS08] Jon Feldman, Ryan O’Donnell, and Rocco A Servedio. Learning mixtures of product distributions over discrete domains. SIAM Journal on Computing, 37(5):1536–1564, 2008.
- [GKL24] Khashayar Gatmiry, Jonathan Kelner, and Holden Lee. Learning mixtures of gaussians using diffusion models. arXiv preprint arXiv:2404.18869, 2024.
- [GLB+24] Hanzhong Guo, Cheng Lu, Fan Bao, Tianyu Pang, Shuicheng Yan, Chao Du, and Chongxuan Li. Gaussian mixture solvers for diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
- [HJA20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- [HO07] John R Hershey and Peder A Olsen. Approximating the kullback leibler divergence between gaussian mixture models. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, volume 4, pages IV–317. IEEE, 2007.
- [HRX24] Yinbin Han, Meisam Razaviyayn, and Renyuan Xu. Neural network-based score estimation in diffusion models: Optimization and generalization. In The Twelfth International Conference on Learning Representations, 2024.
- [HWSL24] Jerry Yao-Chieh Hu, Weimin Wu, Zhao Song, and Han Liu. On statistical rates and provably efficient criteria of latent diffusion transformers (dits). arXiv preprint arXiv:2407.01079, 2024.
- [KFL22] Dohyun Kwon, Ying Fan, and Kangwook Lee. Score-based generative modeling secretly minimizes the wasserstein distance. Advances in Neural Information Processing Systems, 35:20205–20217, 2022.
- [KLL+24] Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ODE trajectory of diffusion. In The Twelfth International Conference on Learning Representations, 2024.
- [KPH+20] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020.
- [LKB+23] Jae Hyun Lim, Nikola B Kovachki, Ricardo Baptista, Christopher Beckham, Kamyar Azizzadenesheli, Jean Kossaifi, Vikram Voleti, Jiaming Song, Karsten Kreis, Jan Kautz, et al. Score-based diffusion models in function space, 2023. URL https://arxiv. org/abs/2302.07400, 2023.
- [LLSS24] Chenyang Li, Yingyu Liang, Zhenmei Shi, and Zhao Song. Exploring the frontiers of softmax: Provable optimization, applications in diffusion model, and beyond. arXiv preprint arXiv:2405.03251, 2024.
- [LLT22] Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence for score-based generative modeling with polynomial complexity. Advances in Neural Information Processing Systems, 35:22870–22882, 2022.
- [LS17] Jerry Li and Ludwig Schmidt. Robust and proper learning for mixtures of gaussians via systems of polynomial inequalities. In Conference on Learning Theory, pages 1302–1382. PMLR, 2017.
- [LSSS24] Yingyu Liang, Zhizhou Sha, Zhenmei Shi, and Zhao Song. Differential privacy mechanisms in neural tangent kernel regression. arXiv preprint arXiv:2407.13621, 2024.
- [MV10] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures of gaussians. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 93–102. IEEE, 2010.
- [ND21] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021.
- [NRW21] Eliya Nachmani, Robin San Roman, and Lior Wolf. Non gaussian denoising diffusion models. arXiv preprint arXiv:2106.07582, 2021.
- [Ope24] OpenAI. Video generation models as world simulators, 2024. https://openai.com/research/video-generation-models-as-world-simulators.
- [PKK22] João M Pereira, Joe Kileel, and Tamara G Kolda. Tensor moments of gaussian mixture models: Theory and applications. arXiv preprint arXiv:2202.06930, 2022.
- [PX23] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
- [RBL+22] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- [RFB15] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
- [RGKZ21] Maria Refinetti, Sebastian Goldt, Florent Krzakala, and Lenka Zdeborová. Classifying high-dimensional gaussian mixtures: Where kernel methods fail and neural networks succeed. In International Conference on Machine Learning, pages 8936–8947. PMLR, 2021.
- [RHX+23] Alex Reneau, Jerry Yao-Chieh Hu, Chenwei Xu, Weijian Li, Ammar Gilani, and Han Liu. Feature programming for multivariate time series prediction. In Fortieth International Conference on Machine Learning (ICML), 2023.
- [SCK23] Kulin Shah, Sitan Chen, and Adam Klivans. Learning mixtures of gaussians using the ddpm objective. Advances in Neural Information Processing Systems, 36:19636–19649, 2023.
- [SCL+22] Zhenmei Shi, Jiefeng Chen, Kunyang Li, Jayaram Raghuram, Xi Wu, Yingyu Liang, and Somesh Jha. The trade-off between universality and label efficiency of representations from contrastive learning. In The Eleventh International Conference on Learning Representations, 2022.
- [Sco15] David W Scott. Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons, 2015.
- [SDCS23] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Proceedings of the 40th International Conference on Machine Learning, pages 32211–32252, 2023.
- [SDME21] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. Advances in neural information processing systems, 34:1415–1428, 2021.
- [SDWMG15] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
- [SE19] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
- [SE20] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
- [SGSE20] Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, pages 574–584. PMLR, 2020.
- [SHC+22] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE transactions on pattern analysis and machine intelligence, 45(4):4713–4726, 2022.
- [SK21] Yang Song and Diederik P Kingma. How to train your energy-based models. arXiv preprint arXiv:2101.03288, 2021.
- [SMF+24] Zhenmei Shi, Yifei Ming, Ying Fan, Frederic Sala, and Yingyu Liang. Domain generalization via nuclear norm regularization. In Conference on Parsimony and Learning, pages 179–201. PMLR, 2024.
- [SOAJ14] Ananda Theertha Suresh, Alon Orlitsky, Jayadev Acharya, and Ashkan Jafarpour. Near-optimal-sample estimators for spherical gaussian mixtures. Advances in Neural Information Processing Systems, 27, 2014.
- [SSDK+20] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- [SWL21] Zhenmei Shi, Junyi Wei, and Yingyu Liang. A theoretical analysis on feature learning in neural networks: Emergence from inputs and advantage over fixed features. In International Conference on Learning Representations, 2021.
- [SWL24] Zhenmei Shi, Junyi Wei, and Yingyu Liang. Provable guarantees for neural networks via gradient feature learning. Advances in Neural Information Processing Systems, 36, 2024.
- [Vin04] Susana Vinga. Convolution integrals of normal distribution functions. Supplementary material to Vinga and Almeida (2004)“Rényi continuous entropy of DNA sequences, 2004.
- [WCL+24] Yuchen Wu, Minshuo Chen, Zihao Li, Mengdi Wang, and Yuting Wei. Theoretical insights for diffusion guidance: A case study for gaussian mixture models. arXiv preprint arXiv:2403.01639, 2024.
- [WCZ+23] Yilin Wang, Zeyuan Chen, Liangjun Zhong, Zheng Ding, Zhizhou Sha, and Zhuowen Tu. Dolfin: Diffusion layout transformers without autoencoder. arXiv preprint arXiv:2310.16305, 2023.
- [WSD+23] Zirui Wang, Zhizhou Sha, Zheng Ding, Yilin Wang, and Zhuowen Tu. Tokencompose: Grounding diffusion with token-level supervision. arXiv preprint arXiv:2312.03626, 2023.
- [WXZ+24] Yilin Wang, Haiyang Xu, Xiang Zhang, Zeyuan Chen, Zhizhou Sha, Zirui Wang, and Zhuowen Tu. Omnicontrolnet: Dual-stage integration for conditional image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7436–7448, 2024.
- [XSW+23] Zhuoyan Xu, Zhenmei Shi, Junyi Wei, Fangzhou Mu, Yin Li, and Yingyu Liang. Towards few-shot adaptation of foundation models via multitask finetuning. In The Twelfth International Conference on Learning Representations, 2023.
- [YFZ+23] Zhantao Yang, Ruili Feng, Han Zhang, Yujun Shen, Kai Zhu, Lianghua Huang, Yifei Zhang, Yu Liu, Deli Zhao, Jingren Zhou, et al. Lipschitz singularities in diffusion models. In The Twelfth International Conference on Learning Representations, 2023.
- [ZHF+24] Jianbin Zheng, Minghui Hu, Zhongyi Fan, Chaoyue Wang, Changxing Ding, Dacheng Tao, and Tat-Jen Cham. Trajectory consistency distillation. arXiv preprint arXiv:2402.19159, 2024.
Appendix
Roadmap.
-
•
Section A discusses the limitations of the paper.
-
•
Section B discusses the societal impacts of the paper.
-
•
Section C provides the preliminary for the paper.
-
•
Section D provides the case when we consider a continuous time score function, specifically a single Gaussian.
-
•
Section E provides the case when we consider the score function to be 2 mixtures of Gaussians.
-
•
Section F provides the case when we consider the score function to be mixture of Gaussians.
- •
-
•
Section H provides our main results when we consider the data distribution is mixture of Gaussians.
Appendix A Limitations
This work has not directly addressed the practical applications of our results. Additionally, we did not provide a sample complexity bound for our settings. Future research could explore how these findings might be implemented in real-world scenarios and work on improving these limitations.
Appendix B Societal Impacts
We explore and provide a deeper understanding of the diffusion models and also explicitly give the Lipschitz constant for -mixture of Gaussians, which may inspire a better algorithm design.
On the other hand, our paper is purely theoretical in nature, so we foresee no immediate negative ethical impact.
Appendix C Preliminary
This section provides some preliminary knowledge and is organized as below:
C.1 Facts
We provide several basic facts from calculus and linear algebra that are used in the proofs.
Fact C.1 (Calculus).
For , , , , , it is well-known that
-
•
(chain rule)
-
•
(product rule)
-
•
(power rule)
-
•
(derivative of the inner product)
-
•
(derivative of exponential function)
-
•
(derivative of logarithm function)
-
•
(derivative of norm)
-
•
if is independent from . (derivative of independent variables)
Fact C.2 (Norm Bounds).
For , , , , , is symmetric and p.s.d., we have
-
•
(absolute homogeneity)
-
•
(triangle inequality)
-
•
(Cauchy–Schwarz inequality)
-
•
-
•
-
•
-
•
-
•
-
•
.
-
•
.
Fact C.3 (Matrix Calculus).
Let denote a symmetric matrix. Let and . Suppose that is independent of . Then, we know
-
•
C.2 Properties of functions
During the course of proving the Lipschitz continuity for mixtures of Gaussians, we found that we need to use the following bound for the function.
Fact C.4.
For , where , , we have
Proof.
We have
where the first step follows from simple algebra, the second step follows from , and the last step follows from for all . ∎
Fact C.5.
For , where , we have
Proof.
We have
where the first step follows from notation of Hardamard product, the second step follows from , and the last step follows from for all . ∎
Fact C.6 (Mean value theorem for vector function).
For vector , vector function , , let be differentiable on open convex domain , we have
-
•
Part 1:
-
•
Part 2: for some , where denotes a matrix which its -th term is .
-
•
Part 3: If for all , then for all .
Proof.
Proof of Part 1
Part 1 can be verified by applying Mean Value Theorem of 1-variable function on .
where .
Proof of Part 2
Let , we have
the initial step is by basic calculation, the second step is from Part 1, the third step uses chain rule, the 4th step is due to Cauchy-Schwartz inequality. Removing on both sides gives the result.
Proof of Part 3
Part 3 directly follows from Part 2. ∎
Fact C.7.
Let denotes a matrix whose -th term is . For , , , where , , we have
Proof.
We can show
where the first step follows from , the second step follows from Fact C.2, the third step follows from spectral norm of a diagonal matrix is the absolute value of its largest entry, the fourth step follows from , and the last step follows from . ∎
Fact C.8.
For , , , where , we have
Proof.
Fact C.9.
For , , , where , we have
Proof.
We can show, for ,
where the first step follows from Mean Value Theorem, the second step follows from Fact C.1, the third step follows from , and the last step follows from .
∎
C.3 Lipschitz multiplication property
Our overall proofs of Lipschitz constant for -mixture of Gaussians follow the idea from Fact below.
Fact C.10.
If the following conditions hold
-
•
-
•
Then, we have
Proof.
We can show
where the first step follows from simple algebra, the second step follows from Fact C.2, the third step follows from rearranging terms, the fourth step follows from the assumptions of the lemma, the fifth step follows from the same logic of above, the sixth step follows from simple algebra, and the last step follows from the recursive process. ∎
Appendix D Single Gaussian Case
In this section, we consider the continuous case of , which is the probability density function () of the input data , and also a function of time . More specifically, we consider the cases when is: (1) a single Gaussian where either the mean is a function of time (Section D.1) or the covariance is a function of time (Section D.2), (2) a single Gaussian where both the mean and the covariance are a function of time (Section D.3). And then, we compute the upper bound and Lipschitz constant for the score function i.e. the gradient of log .
D.1 Case when the mean of is a function of time
We start our calculation by a simple case. Consider such that
Let is denote . We have is . Then, we can get gradient is a function of because of . Inject and into the gradient function, then we are done.
Below we define the for the continues case when the mean is a function of time.
Definition D.1.
If the following conditions hold
-
•
Let .
-
•
Let , and .
We define
Further, we have
Below we calculate the score function of for the continuous case when the mean is a function of time.
Lemma D.2.
If the following conditions hold
-
•
Let .
-
•
Let , and .
Then,
Proof.
Below we calculate the upper bound for the score function of for continuous case when the mean is a function of time.
Lemma D.3 (Linear growth).
If the following conditions hold
-
•
Let .
-
•
Let , and .
Then,
Proof.
Below we calculate the Lipschitz constant for the score function of for continuous case when the mean is a function of time.
Lemma D.4 (Lipschitz).
If the following conditions hold
-
•
Let .
-
•
Let , and .
Then,
Proof.
We can show
where the first step follows from Lemma D.2, and the last step follows from simple algebra. ∎
D.2 Case when the covariance of is a function of time
Below we define the for continuous case when the covariance is a function of time.
Definition D.5.
If the following conditions hold
-
•
Let .
-
•
Let , and .
We define
Further, we have
Below we calculate the score function of for continuous case when the covariance is a function of time.
Lemma D.6.
If the following conditions hold
-
•
Let .
-
•
Let , and .
Then,
Proof.
Below we calculate the upper bound of the score function of for the continuous case when the covariance is a function of time.
Lemma D.7 (Linear growth).
If the following conditions hold
-
•
Let .
-
•
Let , and .
Then,
Proof.
Below we calculate the Lipschitz constant of the score function of for continuous case when the covariance is a function of time.
Lemma D.8 (Lipschitz).
If the following conditions hold
-
•
Let .
-
•
Let , and .
Then,
D.3 A general version for single Gaussian
Now we combine the previous results by calculate a slightly more complex case. Consider such that
where , and they are derivative to and is a symmetric p.s.d. matrix whose the smallest singular value is always larger than a fixed .
Definition D.9.
If the following conditions hold
-
•
Let .
-
•
Let , and .
We define
Further, we have
Below we calculate the score function of for continuous case when both the mean and covariance are a function of time.
Lemma D.10.
If the following conditions hold
-
•
Let .
-
•
Let , and .
Then,
Proof.
Below we calculate the upper bound of the score function of for continuous case when both the mean and covariance is a function of time.
Lemma D.11 (Linear growth).
If the following conditions hold
-
•
Let .
-
•
Let , and .
Then,
Proof.
Below we calculate the Lipschitz constant of the score function of for continuous case when both the mean and covariance are a function of time.
Lemma D.12 (Lipschitz).
If the following conditions hold
-
•
Let .
-
•
Let , and .
Then,
Appendix E A General Version for Two Gaussian
In this section, we compute the linear growth and Lipschitz constant for a mixture of 2 Gaussian where both the mean and covariance are a function of time. The organization of this section is as follows:
-
•
Section E.1 defines the probability density function () that we use, which is a mixture of 2 Gaussian.
-
•
Section E.2 provides lemmas that are used for calculation of the score function i.e. gradient of the log .
-
•
Section E.3 provides the expression of the score function.
-
•
Section E.4 provides lemmas that are used for calculation of the upper bound of the score function.
-
•
Section E.5 provides the expression of the upper bound of the score function.
-
•
Section E.6 provides lemmas of upper bound for some base functions that are used for calculation of the Lipschitz constant of the score function.
-
•
Section E.7 provides lemmas of Lipschitz constant for some base functions that are used for calculation of the Lipschitz constant of the score function.
-
•
Section E.8 provides lemmas of Lipschitz constant for that are used for calculation of the Lipschitz constant of the score function.
-
•
Section E.9 provides lemmas of Lipschitz constant for that are used for calculation of the Lipschitz constant of the score function.
-
•
Section E.10 provides the expression of the Lipschitz constant of the score function.
First, we define the following. Let and also is a function of time . Consider such that
where , and they are derivative to and is a symmetric p.s.d. matrix whose the smallest singular value is always larger than a fixed value .
For further simplicity of calculation, we denote to be .
E.1 Definitions
Below we define function and .
Definition E.1.
If the following conditions hold
-
•
Let .
-
•
Let , and .
We define
and
It’s clearly to see that since takes maximum when .
Below we define the for 2 mixtures of Gaussians.
Definition E.2.
If the following conditions hold
-
•
Let .
-
•
Let , and .
-
•
Let and .
-
•
Let be defined as Definition E.1.
We define
This can be further rewritten as follows:
Further, we have
E.2 Lemmas for calculation of the score function
This subsection describes lemmas that are used for further calculation of the score function.
This lemma calculates the gradient of function .
Lemma E.3.
If the following conditions hold
-
•
Let .
-
•
Let , and .
-
•
Let and .
-
•
Let be defined as Definition E.1.
Then, for , we have
Proof.
This lemma calculates the gradient of function .
Lemma E.4.
If the following conditions hold
Then,
E.3 Calculation of the score function
Below we define and that simplify further calculation.
Definition E.5.
For further simplicity, we define the following functions:
If the following conditions hold
-
•
Let .
-
•
Let , and .
-
•
Let and .
-
•
Let be defined as Definition E.1.
We define
and
And it’s clearly to see that , and .
This lemma calculates the score function.
Lemma E.6.
If the following conditions hold
Then,
E.4 Lemmas for the calculation of the upper bound of the score function
This section provides lemmas that are used in calculation of upper bound of the score function.
This lemma calculates the upper bound of function .
Lemma E.7.
If the following conditions hold
-
•
Let .
-
•
Let , and .
Then, for each , we have
E.5 Upper bound of the score function
This lemma calculates the upper bound of the score function.
Lemma E.8 (Linear growth).
If the following conditions hold
Then,
E.6 Lemmas for Lipschitz calculation: upper bound of base functions
This section provides the lemmas of bounds of base functions that are used in calculation of Lipschitz of the score function.
This lemma calculate the upper bound of the function .
Lemma E.9.
If the following conditions hold
-
•
Let .
-
•
Let , and .
-
•
Let , where , for each .
Then, for each , we have
Proof.
This lemma calculate the lower bound of the function .
Lemma E.10.
If the following conditions hold
-
•
Let .
-
•
Let , and .
-
•
Let , where , for each .
-
•
Let , where , for each .
Then,
Proof.
This lemma calculate the upper bound of the function .
Lemma E.11.
If the following conditions hold
-
•
Let .
-
•
Let , and .
-
•
Let , where , for each .
-
•
Let , where , for each .
Then,
E.7 Lemmas for Lipschitz calculation: Lipschitz constant of base functions
This section provides the lemmas of Lipschitz constant of base functions that are used in calculation of Lipschitz of the score function.
This lemma calculates Lipschitz constant of function .
Lemma E.12.
If the following conditions hold
-
•
Let .
-
•
Let , and .
Then, for , we have
Proof.
This lemma calculates Lipschitz constant of function .
Lemma E.13.
If the following conditions hold
-
•
Let .
-
•
Let , and .
-
•
Let , , where , for each .
Then, for each , we have
Proof.
This lemma calculates Lipschitz constant of function .
Lemma E.14.
If the following conditions hold
-
•
Let .
-
•
Let , and .
-
•
Let be defined as Definition E.1.
-
•
Let , where , for each .
-
•
Let , where , for each .
Then, for each , we have
Proof.
This lemma calculates Lipschitz constant of function .
Lemma E.15.
If the following conditions hold
-
•
Let .
-
•
Let , and .
-
•
Let be defined as Definition E.1.
-
•
Let and .
-
•
Let , where , for each .
-
•
Let , where , for each .
-
•
Let .
-
•
Let .
-
•
Let .
Then, we have
Proof.
This lemma calculates Lipschitz constant of function .
Lemma E.16.
If the following conditions hold
-
•
Let .
-
•
Let , and .
-
•
Let be defined as Definition E.1.
-
•
Let and .
-
•
Let , where , for each .
-
•
Let , where , for each .
-
•
Let , where .
-
•
Let .
-
•
Let .
-
•
Let .
Then,
Proof.
We can show
where the first step follows from simple algebra, the second step follows from , and the last step follows from Lemma E.15. ∎
E.8 Lemmas for Lipschitz calculation:
This lemma calculates Lipschitz constant of function .
Lemma E.17.
If the following conditions hold
Then,
Proof.
We can show
where the first step follows from Definition E.5, the second step follows from Fact C.2, and the last step follows from simple algebra.
For the first term in the above, we have
(11) |
where the first step follows from , the second step follows from Lemma E.16 and the last step follows from definition of .
For the second term in the above, we have
(12) |
where the first step follows from , the second step follows from Lemma E.15.
This lemma calculates Lipschitz constant of function .
Lemma E.18.
If the following conditions hold
Then, we have
E.9 Lemmas for Lipschitz calculation:
This lemma calculates Lipschitz constant of function .
Lemma E.19.
If the following conditions hold
Then,
Proof.
We can show
where the first step follows from Definition E.5, the second step follows from Fact C.2, and the last step follows from simple algebra.
For the first term in the above, we have
(15) |
where the first step follows from , the second step follows from Lemma E.16.
For the second term in the above, we have
(16) |
where the first step follows from , the second step follows from Lemma E.15.
This lemma calculates Lipschitz constant of function .
Lemma E.20.
If the following conditions hold
Then, we have
E.10 Lipschitz constant of the score function
This lemma calculates the Lipschitz constant of the score funciton.
Lemma E.21 (Lipschitz).
Appendix F A General Version for Gaussian
In this section we consider a more general case of mixture of Gaussians.
-
•
Section F.1 provides the definition for mixture of Gaussians.
-
•
Section F.2 provides the expression of the score function.
-
•
Section F.3 provides the upper bound of the score function.
-
•
Section F.4 provides lemmas that are used in further calculation of Lipschitz constant.
-
•
Section F.5 provides the Lipschitz constant for mixture of Gaussians.
F.1 Definitions
Let . Let , , and is a function of time . Consider such that
where , and they are derivative to and is a symmetric p.s.d. matrix whose the smallest singular value is always larger than a fixed value .
Below we define the for a single multivariate Gaussian.
Definition F.1.
If the following conditions hold
-
•
Let .
-
•
Let .
-
•
Let , and .
We define
This is the of a single Gaussian so it’s clearly to see that since takes maximum when .
Below we define the for mixtures of Gaussians.
Definition F.2.
If the following conditions hold
-
•
Let .
-
•
Let .
-
•
Let , and .
-
•
Let , , and .
-
•
Let be defined as Definition F.1.
We define
This can be further rewritten as follows:
Further, we have
This lemma calculates the gradient of for mixture of Gaussians.
Lemma F.3.
If the following conditions hold
We have
Proof.
We can show
where the first step follows from Definition F.2, the second step follows from Fact C.1, and the last step follows from Lemma E.3.
∎
Below we define that simplifies further calculation.
Definition F.4.
If the following conditions hold
-
•
Let .
-
•
Let .
-
•
Let , and .
-
•
Let , , and .
-
•
Let be defined as Definition F.1.
For further simplicity, we define
It’s clearly to see that and
F.2 Calculation of the score function
This lemma calculates the score function for mixture of Gaussians.
Lemma F.5.
If the following conditions hold
We have
F.3 Upper bound of the score function
This lemma calculates upper bound of the score function for mixture of Gaussians.
Lemma F.6.
If the following conditions hold
Then, we have
F.4 Lemmas for Lipshitz calculation
This section provides lemmas for calculation of Lipschitz constant of the score function for mixture of Gaussians.
This lemma calculates Lipschitz constant of function .
Lemma F.7.
If the following conditions hold
-
•
Let .
-
•
Let .
-
•
Let , and .
-
•
Let , , and .
-
•
Let be defined as Definition F.1.
-
•
Let , where , for each .
-
•
Let , where , for each .
-
•
Let .
-
•
Let .
-
•
Let .
Then, we have
Proof.
This lemma calculates Lipschitz constant of function .
Lemma F.8.
If the following conditions hold
-
•
Let .
-
•
Let .
-
•
Let , and .
-
•
Let , , and .
-
•
Let be defined as Definition F.1.
-
•
Let , where , for each .
-
•
Let , where , for each .
-
•
Let , where .
-
•
Let .
-
•
Let .
-
•
Let .
Then, we have
Proof.
We can show
where the first step follows from simple algebra, the second step follows from , and the last step follows from Lemma F.7. ∎
This lemma calculates Lipschitz constant of function .
Lemma F.9.
If the following conditions hold
Then, for each , we have
Proof.
We can show
where the first step follows from Definition F.4, the second step follows from Fact C.2, and the last step follows from simple algebra.
For the first term in the above, we have
(19) |
where the first step follows from , the second step follows from Lemma F.8, and the last step follows from definition of .
For the second term in the above, we have
(20) |
where the first step follows from , the second step follows from Lemma F.7.
This lemma calculates Lipschitz constant of function .
Lemma F.10.
If the following conditions hold
Then, for each , we have
F.5 Lipschitz constant of the score function
This lemma calculates Lipschitz constant of the score function for mixture of Gaussians.
Lemma F.11.
If the following conditions hold
Then, we have
Appendix G Putting It All Together
Our overall goal is that we want to provide a more concrete calculation for theorems in Section 6 by assuming the data distribution is a mixture of Gaussian. Now we provide lemmas that are used in further calculation.
Now we provide the lemma for -mixtue of Gaussians which states that if is mixture Gaussians, then all the s in the diffusion process are also mixtures of Gaussians.
Lemma G.1 (Formal version of Lemma 3.2).
Let . Let be a -mixture of Gaussian distribution, and let be its , i.e.,
Let sample from . Let and , which is independent from . Then we have a new random variable which is also a -mixture of Gaussian distribution , whose is
where .
Proof.
First, we know that the of the sum of two independent random variables is the convolution of their .
From [Vin04] we know that the convolution of 2 Gaussians is another Gaussian, i.e. , where is the convolution operator.
And we know the of a linear transformation of a random variable , let’s say where , is .
If we consider the transformation where , this transformation can be written as . Therefore the of is .
Now we prove the lemma. To find the of , where , we can show
where the first step follows from the definition of , the second step follows from , and the last step follows from definition of Gaussian distribution.
For a single standard Gaussian random variable , the of will simply be .
To find the of , we can show
where the first step follows from the of the sum of 2 independent random variables is the convolution of their , the second step follows from the of , the third step follows from the distributive property of convolution, and the last step follows from .
Thus, the of can be written as a mixture of Gaussians:
where . ∎
Now we provide the lemma for the second momentum of -mixtue of Gaussians.
Lemma G.2 (Formal version of Lemma 3.5).
If the following conditions hold:
-
•
, where is defined by Eq. (9).
Then, we have
Proof.
From [PKK22], we know the second momentum of data distribution is given by:
(23) |
Then, we can show
where the first step follows from definition of -norm, the second step follows from where is a vector, the third step follows from the linearity of the trace operator, the fourth step follows from Eq. (23), and the last step follows from . ∎
Now we give the Lipschitz constant explicitly.
Appendix H More Calculation for Application
In this section, we will provide a more concrete calculation for Theorem 6.5, Theorem 6.6, Theorem 6.7, Theorem 6.8 and Theorem 6.9.
H.1 Concrete calculation of Theorem 6.5
Theorem H.1 (, total variation, formal version of Theorem 5.6).
If the following conditions hold:
- •
-
•
The step size satisfies and for .
-
•
Let denote the density of the output of the defined by Definition 4.3.
We have
where
-
•
,
-
•
,
-
•
.
Proof.
Now we want to find a more concrete in Assumption 6.1. Notice that from Lemma G.1, we know that at any time between , is also a -mixture of gaussian, except that the mean and covariance change with time.
Using Lemma G.3, we can get .
In Assumption 5.1, we also assume the same thing.
Now we want to have a more concrete setting for Theorem 6.5 by calculating each term directly. Notice that now we have all the quantities except for the KL divergence term. Thus, we calculate , which means the KL divergence of data distribution and standard Gaussian.
In our notation, we have
However, this integral has no close form, but we can find an upper bound for this KL divergence instead.
We know the KL divergence of 2 normal distribution is given by:
We define , , . From [HO07], we can show
where the first step follows from the convexity of KL divergence, the second step follows from KL divergence of 2 normal distribution, the third step follows from and , and the last step follows from the definition of , , .
Then we have all the quantities in Theorem 6.5. After directly applying the theorem, we finish the proof. ∎
H.2 Concrete calculation for Theorem 6.6
H.3 Concrete calculation for Theorem 6.7
Theorem H.3 (, KL divergence for smooth data distribution, formal version of Theorem 5.8).
If the following conditions hold:
- •
-
•
We use the exponentially decreasing (then constant) step size .
-
•
Let denote the density of the output of the defined by Definition 4.4.
We have
where
-
•
,
-
•
.
Furthermore, if we choosing and , then we can show
In addition, for Euler-Maruyama scheme defined in Definition 4.3, the same bounds hold with an additional term.
H.4 Concrete calculation for Theorem 6.8
Theorem H.4 (, formal version of Theorem 5.9).
If the following conditions hold:
We have
where
-
•
,
-
•
.
In particular, if we set , , , and if the score estimation error satisfies , then we can obtain TV error with a total iteration complexity of steps.
H.5 Concrete calculation for Theorem 6.9
Theorem H.5 (, formal version of Theorem 5.10).
If the following conditions hold:
We have
where
-
•
,
-
•
.
In particular, if we set , , , and if the score estimation error satisfies , then we can obtain TV error with a total iteration complexity of steps.