This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Unraveling the Smoothness Properties of Diffusion Models: A Gaussian Mixture Perspective

Yingyu Liang [email protected]. The University of Hong Kong. [email protected]. University of Wisconsin-Madison.    Zhenmei Shi [email protected]. University of Wisconsin-Madison.    Zhao Song [email protected]. The Simons Institute for the Theory of Computing at the University of California, Berkeley.    Yufa Zhou [email protected]. University of Pennsylvania.

Diffusion models have made rapid progress in generating high-quality samples across various domains. However, a theoretical understanding of the Lipschitz continuity and second momentum properties of the diffusion process is still lacking. In this paper, we bridge this gap by providing a detailed examination of these smoothness properties for the case where the target data distribution is a mixture of Gaussians, which serves as a universal approximator for smooth densities such as image data. We prove that if the target distribution is a kk-mixture of Gaussians, the density of the entire diffusion process will also be a kk-mixture of Gaussians. We then derive tight upper bounds on the Lipschitz constant and second momentum that are independent of the number of mixture components kk. Finally, we apply our analysis to various diffusion solvers, both SDE and ODE based, to establish concrete error guarantees in terms of the total variation distance and KL divergence between the target and learned distributions. Our results provide deeper theoretical insights into the dynamics of the diffusion process under common data distributions.

1 Introduction

Diffusion models, a prominent generative modeling framework, have rapid progress and garnered significant attention in recent years, due to their potential powerful ability to generate high-quality samples across various domains and diverse applications. Score-based generative diffusion models [HJA20, SSDK+20] can generate high-quality image samples comparable to GANs, which require adversarial optimization. Based on the U-Net [RFB15], stable diffusion [RBL+22], a conditional multi-modality generation model, can successfully generate business-used images. Based on the Transformer (DiT) [PX23], OpenAI released a video diffusion model, SORA [Ope24], with a surprising performance.

However, despite these technological strides, a critical gap persists in the theoretical understanding of diffusion processes, especially concerning the Lipschitz continuity and second momentum properties of these models. Many existing studies ([DB22, CCL+22, LLT22, CLL23, CCL+23] and many more) make simplifying assumptions about these key smoothness properties but lack rigorous formulation or comprehensive analysis. This paper aims to bridge this theoretical gap by providing a detailed examination of the Lipschitz and second momentum characteristics. To make the data distribution analyzable, we consider the mixture of Gaussian data distribution, which is a universal approximation (Fact 1.1) for any smooth density, such as complex image and video data distributions.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: An illustration of discrete diffusion process for 88 mixture of Gaussian as shown in Eq. (2). The left figure represents the target 2-dimensional data distribution p0p_{0}. The right figure represents the standard normal distribution pT=𝒩(0,I2×2)p_{T}=\mathcal{N}(0,I_{2\times 2}), where T=3T=3.
Fact 1.1.

A Gaussian mixture model is a universal approximator of densities, in the sense that any smooth density can be approximated with any specific nonzero amount of error by a Gaussian mixture model with enough components [Sco15].

Furthermore, we prove that if the target data distribution is a kk-mixture of Gaussian, the density of the whole diffusion process will be a kk-mixture of Gaussian (Lemma 3.2). Thus, our focus is studying a mixture of Gaussian target data distribution in the diffusion process, where we can gain a concrete Lipschitz constant and second momentum (Lemma 3.3 and Lemma 3.5). Moreover, we explore the implications of these properties through the lens of various solvers, both Stochastic Differential Equation (SDE) and Ordinary Differential Equation (ODE) based, providing a deeper insight into the dynamic behavior of diffusion processes and concrete guarantees in Table 1.

Our contribution:

  • As the Gaussian mixture model is a universal approximator of densities (Fact 1.1), we assume the target/image data distribution as a kk-mixture of Gaussian. Then, we show that the density of the whole diffusion process is a kk-mixture of Gaussian (Lemma 3.2).

  • We analyze the Lipschitz and second momentum of kk-mixture of Gaussian data distribution and provide a tight upper bound, which is independent of kk, (Lemma 3.3 and Lemma 3.5), even when kk goes to infinity (see more discussion in Remark 3.4).

  • After applying our analysis to 𝖣𝖣𝖯𝖬\mathsf{DDPM}, which is the SDE version of reverse process, we prove the dynamic of the diffusion process satisfies some concrete total variation bound (Theorem 5.6) and concrete KL divergence bound (Theorem 5.7 and Theorem 5.8) under choosing some total discretization steps. See a summary in Table 1.

  • After applying our analysis to 𝖣𝖯𝖮𝖬\mathsf{DPOM} and 𝖣𝖯𝖴𝖬\mathsf{DPUM}, which is the ODE version of the reverse process, we prove the dynamic of the diffusion process satisfies some concrete total variation bound (Theorem 5.9 and Theorem 5.10) under choosing some total discretization steps.

Other studies of the mixture of Gaussian under diffusion.

Recently, there has been a rich line of work that studies a mixture of Gaussian data distribution under the diffusion process, which shares a similar popular setting. However, none focus on the Lipschitz smoothness and second momentum property as ours. We provide a short summary here for convenience. [WCL+24] analyses the effect of diffusion guidance and provides theoretical insights on the instillation of task-specific information into the score function and conditional generation, e.g., text-to-image, by studying the mixture of Gaussian data distribution as a case study. [GLB+24] propose a new SDE-solver for diffusion models under a mixture of Gaussian data. [SCK23, GKL24] learn mixtures of Gaussian data using the 𝖣𝖣𝖯𝖬\mathsf{DDPM} objective and gives bound for sample complexity by assuming all covariance matrices are identity but does not use Lipschitz, while our work does not need identity assumptions. [CKS24] mainly focuses on solving kk-mixture of Gaussian in diffusion model by leveraging the property of the mixture of Gaussian and polynomial approximation methods and gives bound for sample complexity by assuming the covariance matrices have bounded condition number and that the mean vectors and covariance matrices lie in a bounded ball.

Type Error guarantee Steps for O~(ϵ02)\widetilde{O}(\epsilon_{0}^{2}) error Reference
𝖣𝖣𝖯𝖬\mathsf{DDPM} [CCL+22] TV(p0,q^)2\mathrm{TV}(p_{0},\widehat{q})^{2} Θ~(d/ϵ02)\widetilde{\Theta}(d/\epsilon_{0}^{2}) Theorem 5.6
𝖣𝖣𝖯𝖬\mathsf{DDPM} [CLL23] KL(p0,q^)\mathrm{KL}(p_{0},\widehat{q}) Θ~(d/ϵ02)\widetilde{\Theta}(d/\epsilon_{0}^{2}) Theorem 5.7
𝖣𝖣𝖯𝖬\mathsf{DDPM} [CLL23] KL(p0,q^)\mathrm{KL}(p_{0},\widehat{q}) Θ~(d2/ϵ02)\widetilde{\Theta}(d^{2}/\epsilon_{0}^{2}) Theorem 5.8
𝖣𝖯𝖮𝖬\mathsf{DPOM} [CCL+23] TV(p0,q^)2\mathrm{TV}(p_{0},\widehat{q})^{2} Θ~(d/ϵ02)\widetilde{\Theta}(d/\epsilon_{0}^{2}) Theorem 5.9
𝖣𝖯𝖴𝖬\mathsf{DPUM} [CCL+23] TV(p0,q^)2\mathrm{TV}(p_{0},\widehat{q})^{2} Θ~(d/ϵ0)\widetilde{\Theta}(\sqrt{d}/\epsilon_{0}) Theorem 5.10
Table 1: A summary of our applications using our Lipschitz and second momentum bound, when we assume σmin(pt)\sigma_{\min(p_{t})} as a constant (defined in Condition 5.4). The third column means the number of discretization points required to guarantee a small total variation/KL divergence distance between the target data distribution p0p_{0} and the learned distribution q^\widehat{q}.

2 Related Work

Diffusion models and score-based generative models. [HJA20] introduced the concept of denoising diffusion probabilistic models (𝖣𝖣𝖯𝖬\mathsf{DDPM}), which learn to reverse a gradual noising process to generate samples and have been applied to many area [WSD+23, WCZ+23, WXZ+24]. Then, [SSDK+20] used Stochastic Differential Equations (SDE) to build the diffusion model and explored the equivalence between score-based generative models (SGMs) and diffusion models, generalizing the diffusion model to continuous time. There is a line of works [SDWMG15] studying the connection between diffusion models and non-equilibrium thermodynamics, providing a theoretical foundation for these models. Another line of work has focused on improving the efficiency and quality of diffusion models [NRW21]. Particularly, [SE19] leverages score matching to train diffusion models more efficiently; [ND21] improved architectures and training techniques for diffusion models. Diffusion models have been applied to various domains beyond image generation, achieving impressive results, e.g., text-to-image synthesis [CZZ+20], audio synthesis [KPH+20], image super-resolution [SHC+22], and so on.

Mixture of Gaussian. Mixtures of Gaussian are among the most fundamental and widely used statistical models [Das99, FOS08, MV10, BS10, CDSS13, SOAJ14, DK14, DKK+19, ADLS17, LS17, ABDH+18, DK20, BK20, DHKK20, SCL+22, BDJ+22, XSW+23, BS23, SMF+24] and studied in neural networks learning [RGKZ21, SWL21, LSSS24, SWL24]. Recently, they have also been widely studied in diffusion models [SCK23, WCL+24, GLB+24, GKL24, CKS24] (see detailed discussion in Section 1).

Lipschitz and second momentum in score estimation. There is a line of work using bounded Lipschitz and second momentum as their key smoothness assumptions for the whole diffusion process without giving any concrete formulation [CCL+22, KFL22, LLT22, LKB+23, CLL23, CHZW23, CCL+23, CDD23, BDD23, ZHF+24, KLL+24], while our work gives a “close-form” formulation of Lipschitz and second momentum. There is another rich line of work studying how to train the diffusion models to have a better theoretical guarantee [SE19, SE20, SK21, SGSE20, SDME21, SDCS23, LKB+23, CDD23, RHX+23, CHZW23, SCK23, YFZ+23, LLSS24, GKL24, CCL+23, GLB+24, CKS24, HRX24, HWSL24] and many more.

Roadmap. In Section 3, we provide the notation we use, several definitions, and lemmas related to kk mixtures of Gaussian. In Section 4, we provide preliminary knowledge on score-based generative models (SGMs) and diffusion models. Section 6 provides the tools that we use from previous papers. In Section 5, we present our main results. In Section 7, we conclude the paper.

3 Lipschitz and Second Momentum of Mixture of Gaussian

In this section, we discuss the notations used, several key definitions for kk mixtures of Gaussian distributions, and lemmas concerning the Lipschitz continuity and second momentum of these mixtures. We begin by presenting the notations that are used throughout the paper. Notations. For two vectors xnx\in\mathbb{R}^{n} and yny\in\mathbb{R}^{n}, we use x,y\langle x,y\rangle to denote the inner product between x,yx,y, i.e., x,y=i=1nxiyi\langle x,y\rangle=\sum_{i=1}^{n}x_{i}y_{i}. We use Pr[]\Pr[] to denote the probability, we use 𝔼[]\operatorname*{{\mathbb{E}}}[] to denote the expectation. We use eie_{i} to denote a vector where only ii-th coordinate is 11, and other entries are 0. For each a,bna,b\in\mathbb{R}^{n}, we use abna\circ b\in\mathbb{R}^{n} to denote the vector where ii-th entry is (ab)i=aibi(a\circ b)_{i}=a_{i}b_{i} for all i[n]i\in[n], and this is the Hardamard product. We use 𝟏n{\bf 1}_{n} to denote a length-nn vector where all the entries are ones. We use xi,jx_{i,j} to denote the jj-th coordinate of xinx_{i}\in\mathbb{R}^{n}. We use xp\|x\|_{p} to denote the p\ell_{p} norm of a vector xnx\in\mathbb{R}^{n}, i.e. x1:=i=1n|xi|\|x\|_{1}:=\sum_{i=1}^{n}|x_{i}|, x2:=(i=1nxi2)1/2\|x\|_{2}:=(\sum_{i=1}^{n}x_{i}^{2})^{1/2}, and x:=maxi[n]|xi|\|x\|_{\infty}:=\max_{i\in[n]}|x_{i}|. For k>nk>n, for any matrix Ak×nA\in\mathbb{R}^{k\times n}, we use A\|A\| to denote the spectral norm of AA, i.e. A:=supxnAx2/x2\|A\|:=\sup_{x\in\mathbb{R}^{n}}\|Ax\|_{2}/\|x\|_{2}. We use σmin(A),σmax(A)\sigma_{\min}(A),\sigma_{\max}(A) to denote the minimum/maximum singular value of matrix AA. For a square matrix AA, we use tr[A]\operatorname{tr}[A] to denote the trace of AA, i.e., tr[A]=i=1nAi,i\operatorname{tr}[A]=\sum_{i=1}^{n}A_{i,i}. We use det(A)\det(A) to denote the determinant of matrix AA. We use fgf*g to denote the convolution of 2 functions f,gf,g. In addition to O()O(\cdot) notation, for two functions f,gf,g, we use the shorthand fgf\lesssim g (resp. \gtrsim) to indicate that fCgf\leq Cg (resp. \geq) for an absolute constant CC.

kk mixtures of Gaussian. Now, we are ready to introduce kk mixtures of Gaussian. Formally, we can have the following definition of the 𝗉𝖽𝖿\mathsf{pdf} for kk mixtures of Gaussian.

Definition 3.1 (kk mixtures of Gaussian 𝗉𝖽𝖿\mathsf{pdf} ).

Let xdx\in\mathbb{R}^{d}, i[k]i\in[k], t0t\geq 0\in\mathbb{R}. For a fixed timestamp tt, (1) let αi(t)(0,1)\alpha_{i}(t)\in(0,1) be the weight for the ii-th Gaussian component at time tt and i=1kαi(t)=1\sum_{i=1}^{k}\alpha_{i}(t)=1; (2) let μi(t)d\mu_{i}(t)\in\mathbb{R}^{d} and Σi(t)d×d\Sigma_{i}(t)\in\mathbb{R}^{d\times d} be the mean vector and covariance matrix for the ii-th Gaussian component at time tt. Then, we define

pt(x):=\displaystyle p_{t}(x):= i=1kαi(t)(2π)d/2det(Σi(t))1/2exp(12(xμi(t))Σi(t)1(xμi(t))).\displaystyle~{}\sum_{i=1}^{k}\frac{\alpha_{i}(t)}{(2\pi)^{d/2}\det(\Sigma_{i}(t))^{1/2}}\cdot\exp(-\frac{1}{2}(x-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(x-\mu_{i}(t))).

Then, the following lemma shows that the linear combination between kk mixtures of Gaussian and a single standard Gaussian is still a kk mixtures of Gaussian. The proof is in Appendix G.

Lemma 3.2 (Informal version of Lemma G.1).

Let a,ba,b\in\mathbb{R}. Let 𝒟{\cal D} be a kk-mixture of Gaussian distribution, and let pp be its 𝗉𝖽𝖿\mathsf{pdf} defined in Definition 3.1, i.e.,

p(x):=\displaystyle p(x):= i=1kαi(2π)d/2det(Σi)1/2exp(12(xμi)Σi1(xμi))\displaystyle~{}\sum_{i=1}^{k}\frac{\alpha_{i}}{(2\pi)^{d/2}\det(\Sigma_{i})^{1/2}}\cdot\exp(-\frac{1}{2}(x-\mu_{i})^{\top}\Sigma_{i}^{-1}(x-\mu_{i}))

Let xdx\in\mathbb{R}^{d} sample from 𝒟{\cal D}. Let zdz\in\mathbb{R}^{d} and z𝒩(0,I)z\sim{\cal N}(0,I), which is independent from xx. Then, we have a new random variable y=ax+bzy=ax+bz which is also a kk-mixture of Gaussian distribution 𝒟~\widetilde{\cal D}, whose 𝗉𝖽𝖿\mathsf{pdf} is

p~(x):=\displaystyle\widetilde{p}(x):= i=1kαi(2π)d/2det(Σ~i)1/2exp(12(xμ~i)Σ~i1(xμ~i)),\displaystyle~{}\sum_{i=1}^{k}\frac{\alpha_{i}}{(2\pi)^{d/2}\det(\widetilde{\Sigma}_{i})^{1/2}}\cdot\exp(-\frac{1}{2}(x-\widetilde{\mu}_{i})^{\top}\widetilde{\Sigma}_{i}^{-1}(x-\widetilde{\mu}_{i})),

where μ~i=aμi,Σ~i=a2Σi+b2I\widetilde{\mu}_{i}=a\mu_{i},\widetilde{\Sigma}_{i}=a^{2}\Sigma_{i}+b^{2}I.

Note that the Gaussian mixture model is a universal approximator of densities (Fact 1.1). Then, it is natural to assume that the target/image (p0p_{0}) data distribution as the kk mixtures of Gaussian. Lemma 3.2 tell us that if the target data distribution is kk mixtures of Gaussian, then by Eq. (2), the 𝗉𝖽𝖿\mathsf{pdf} of the whole diffusion process is kk mixtures of Gaussian, i.e., the ptp_{t} for any t[0,T]t\in[0,T]. See details in Section 4.

Main properties. Thus, we only need to analyze the property, i.e., Lipschitz constant and second momentum, of kk mixtures of Gaussian, to have a clear guarantee for the whole diffusion process. In the following two lemmas, we will give concrete bounds of the Lipschitz constant and second momentum of kk mixtures of Gaussian. Both proofs can be found in Appendix G.

Lemma 3.3 (Lipschitz, informal version of Lemma G.3).

If the following conditions hold: Let βxatμi2R\beta\leq\|x-a_{t}\mu_{i}\|_{2}\leq R, where R1R\geq 1 and β(0,0.1)\beta\in(0,0.1) for each i[k]i\in[k]. Let pt(x)p_{t}(x) be defined as Eq. (10) and pt(x)γp_{t}(x)\geq\gamma, where γ(0,0.1)\gamma\in(0,0.1). Let σmin(pt):=mini[k]{σmin(at2Σi+bt2I)}\sigma_{\min(p_{t})}:=\min_{i\in[k]}\{\sigma_{\min}(a_{t}^{2}\Sigma_{i}+b_{t}^{2}I)\}, σmax(pt):=maxi[k]{σmax(at2Σi+bt2I)}\sigma_{\max(p_{t})}:=\max_{i\in[k]}\{\sigma_{\max}(a_{t}^{2}\Sigma_{i}+b_{t}^{2}I)\}. Let detmin(pt):=mini[k]{det(at2Σi+bt2I)}\det_{\min(p_{t})}:=\min_{i\in[k]}\{\det(a_{t}^{2}\Sigma_{i}+b_{t}^{2}I)\}.

The Lipschitz constant for the score function dlog(pt(x))dx\frac{\mathrm{d}\log(p_{t}(x))}{\mathrm{d}x} is given by:

L=\displaystyle L= 1σmin(pt)+2R2γ2σmin(pt)2exp(β22σmax(pt))(1(2π)ddetmin(pt)+1(2π)d/2detmin(pt)1/2).\displaystyle~{}\frac{1}{\sigma_{\min(p_{t})}}+\frac{2R^{2}}{\gamma^{2}\sigma_{\min(p_{t})}^{2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max(p_{t})}})\cdot(\frac{1}{(2\pi)^{d}\det_{\min(p_{t})}}+\frac{1}{(2\pi)^{d/2}\det_{\min(p_{t})}^{1/2}}).

In Lemma 3.3, we can clearly see that roughly L=O(1/poly(σmin(pt)))L=O(1/\operatorname{poly}(\sigma_{\min(p_{t})})), which means the Lipschitz is only conditioned on the smallest singular value of all Gaussian component but independent with kk.

Remark 3.4.

We believe our results in Lemma 3.3 is non-trivial. First, the Lipschitz bound depends inversely exponential on the dimension dd. If dd grows lager, the Lipschitz constant LL becomes much smaller. Second, in practice, the kk will be super large for complicated data distribution, e.g., millions of Gaussian components for an image distribution. Many studies need to overcome the large kk hardness in learning Gaussian mixture models [BDJ+22, BS23, RGKZ21, SWL24], while our Lipschitz upper-bound is independent with kk.

Lemma 3.5 (Second momentum, informal version of Lemma G.2).

Let x0p0x_{0}\sim p_{0}, where p0p_{0} is defined by Eq. (9). Then, we have

m22:=𝔼x0p0[x022]=i=1kαi(μi22+tr[Σi])maxi[k]{μi22+tr[Σi]}.\displaystyle m_{2}^{2}:=\operatorname*{{\mathbb{E}}}_{x_{0}\sim p_{0}}[\|x_{0}\|_{2}^{2}]=\sum_{i=1}^{k}\alpha_{i}(\|\mu_{i}\|_{2}^{2}+\operatorname{tr}[\Sigma_{i}])\leq\max_{i\in[k]}\{\|\mu_{i}\|_{2}^{2}+\operatorname{tr}[\Sigma_{i}]\}.

The proof idea of Lemma 3.3 and Lemma 3.5 is that we first consider the case when k=1k=1 in Appendix D and then we extend to 22 mixtures of Gaussian setting in Appendix E. Finally, we can generalize to kk setting in Appendix F and summarize it in Appendix G.

In Lemma 3.5, we can see that m22=O(1)m_{2}^{2}=O(1) roughly, which is independent of kk as well. Later, we will apply our Lemma 3.3 and Lemma 3.5 in Section 5 to get concrete bound for each diffusion process, e.g., 𝖣𝖣𝖯𝖬\mathsf{DDPM}, 𝖣𝖯𝖮𝖬\mathsf{DPOM} and 𝖣𝖯𝖴𝖬\mathsf{DPUM}.

4 Score Based Model and Diffusion Model

In Section 4.1, we first briefly introduce 𝖣𝖣𝖯𝖬\mathsf{DDPM} [HJA20], a stochastic differential equations (SDE) version of the reverse diffusion process, and score-based generative models (SGMs) [SSDK+20], which is a generalization of 𝖣𝖣𝖯𝖬\mathsf{DDPM}. Then, in Section 4.2, we introduce multiple solvers for the reverse process.

4.1 Background on score based model and diffusion model

First, we denote the input data as xdx\in\mathbb{R}^{d} and the target original data distribution as p0(x)p_{0}(x). Assuming that the noisy latent state of score-based generative models (SGMs) at time tt is a stochastic process xtx_{t} for 0tT0\leq t\leq T, we have the forward SDE is defined by:

dxt=f(xt,t)dt+g(t)dwt,x0p0,0tT\displaystyle\mathrm{d}x_{t}=f(x_{t},t)\mathrm{d}t+g(t)\mathrm{d}w_{t},~{}~{}x_{0}\sim p_{0},~{}~{}0\leq t\leq T (1)

where wtdw_{t}\in\mathbb{R}^{d} is the standard Brownian motion, f(,t):ddf(\cdot,t):\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} which is called drift coefficient, and g(t):g(t):\mathbb{R}\rightarrow\mathbb{R} which is called diffusion coefficient. We use pt(x)p_{t}(x) to denote the marginal probability density function of xtx_{t}. The 𝗉𝖽𝖿\mathsf{pdf} at last time step pTp_{T} is the prior distribution which often defined as standard Gaussian pT(x)=𝒩(0,I)p_{T}(x)=\mathcal{N}(0,I).

The continuous forward SDE Eq. (1) also has the discrete Markov chain form as

xt=atx0+btz,x0p0\displaystyle x_{t}=a_{t}x_{0}+b_{t}z,~{}~{}x_{0}\sim p_{0} (2)

where x0p0(x)x_{0}\sim p_{0}(x) and z𝒩(0,I)z\sim\mathcal{N}(0,I), and at,bta_{t},b_{t}\in\mathbb{R} are functions of time. Additionally, we assume as tTt\rightarrow T, at0a_{t}\rightarrow 0 and bt1b_{t}\rightarrow 1. Thus, xT𝒩(0,I)x_{T}\sim\mathcal{N}(0,I). Also, clearly when t0t\to 0, at1a_{t}\to 1 and bt0b_{t}\to 0 for this is the boundary condition. More specifically, the Eq. (2) can be viewed as the iterative equation in 𝖣𝖣𝖯𝖬\mathsf{DDPM} [HJA20].

From [And82], we know xtx_{t} also satisfy the reverse SDE:

dxt=(f(xt,t)g(t)2logpt(xt))dt+g(t)dw~t,\displaystyle\mathrm{d}x_{t}=\left(f(x_{t},t)-g(t)^{2}\nabla\log p_{t}(x_{t})\right)\mathrm{d}t+g(t)\,\mathrm{d}\widetilde{w}_{t}, (3)

where w~td\widetilde{w}_{t}\in\mathbb{R}^{d} is backward Brownian motion [And82], and logpt(xt)\nabla\log p_{t}(x_{t}) is the score function. For convenience, we rewrite the reverse SDE Eq. (3) in a forward version by switching the time direction, replacing tt with TtT-t. Let x~t:=xTt\widetilde{x}_{t}:=x_{T-t}. The law of (x~t)0tT(\widetilde{x}_{t})_{0\leq t\leq T} is identical to the law of (xTt)0tT(x_{T-t})_{0\leq t\leq T}. We use qtq_{t} to denote the density of x~t\widetilde{x}_{t}:

dx~t=\displaystyle\mathrm{d}\widetilde{x}_{t}= (f(x~t,Tt)+g(Tt)2logpTt(x~t))dt+g(Tt)dwt.\displaystyle~{}(-f(\widetilde{x}_{t},T-t)+g(T-t)^{2}\nabla\log p_{T-t}(\widetilde{x}_{t}))\mathrm{d}t+g(T-t)\mathrm{d}w_{t}. (4)

The process (x~t)0tT(\widetilde{x}_{t})_{0\leq t\leq T} converts noise into samples from p0p_{0}, thereby achieving the objective of generative modeling.

Since we can not obtain the score function logpt\nabla\log p_{t} directly, we can use a neural network to approximate it, and we denote the estimated score function by st(x)s_{t}(x). By replacing the score function logpt\nabla\log p_{t} with our approximated score function st(x)s_{t}(x), we can rewrite Eq. (4) as:

dyt=\displaystyle\mathrm{d}y_{t}= (f(yt,Tt)+g(Tt)2sTt(yt))dt+g(Tt)dwt,y0pT,0tT.\displaystyle~{}(-f(y_{t},T-t)+g(T-t)^{2}s_{T-t}(y_{t}))\mathrm{d}t+g(T-t)\,\mathrm{d}w_{t},~{}~{}y_{0}\sim p_{T},~{}~{}0\leq t\leq T. (5)

where yty_{t} is the process we approximate by our SGM sts_{t}.

For clarity, we mainly focus on the Ornstein-Uhlenbeck (OU) process, which is a diffusion process with exp\exp coefficient:

Definition 4.1.

The forward SDE of OU process (f(xt,t)=xtf(x_{t},t)=-x_{t}, and g(t)2g(t)\equiv\sqrt{2}) is: dxt=xtdt+2dwt.\mathrm{d}x_{t}=-x_{t}\mathrm{d}t+\sqrt{2}\mathrm{d}w_{t}. The corresponding discrete Markov chain form of OU process is given by: xt=etx0+1e2tz,x_{t}=e^{-t}x_{0}+\sqrt{1-e^{-2t}}z, where we can see that at=eta_{t}=e^{-t}, bt=1e2tb_{t}=\sqrt{1-e^{-2t}} in Eq. (2).

From Eq. (5) under the OU process, we can have:

dyt=(xt+2sTt(yt))dt+2dwt.\displaystyle\mathrm{d}y_{t}=(x_{t}+2s_{T-t}(y_{t}))\mathrm{d}t+\sqrt{2}\mathrm{d}w_{t}. (6)

4.2 Definitions of different solvers

In practical applications, it’s necessary to adopt a discrete-time approximation for sampling dynamics. We have the following definition.

Definition 4.2 (Time discretization).

Define the NN discretization points as δ=t0t1tN=T\delta=t_{0}\leq t_{1}\leq\cdots\leq t_{N}=T, where δ0\delta\geq 0 is the early stopping parameter, with δ=0\delta=0 presenting the normal setting. For each discretization step kk, where 1kN1\leq k\leq N, the step size is denoted by hk:=tktk1h_{k}:=t_{k}-t_{k-1}.

Let y^t\widehat{y}_{t} be the discrete approximation of yty_{t} in Eq. (6) starting from y^0𝒩(0,I)\widehat{y}_{0}\sim\mathcal{N}(0,I). We use q^t\widehat{q}_{t} to denote the density of y^t\widehat{y}_{t}. Let NN be defined in Definition 4.2 and tk=TtNkt^{\prime}_{k}=T-t_{N-k}. To solve Eq. (6), we have the following two numerical solvers:

Definition 4.3.

We define 𝖤𝗎𝗅𝖾𝗋𝖬𝖺𝗋𝗎𝗒𝖺𝗆𝖺\mathsf{EulerMaruyama} as the numerical solver satisfying

y^Tδ=𝖤𝗎𝗅𝖾𝗋𝖬𝖺𝗋𝗎𝗒𝖺𝗆𝖺(T,s),\displaystyle\widehat{y}_{T-\delta}=\mathsf{EulerMaruyama}(T,s),

where y^Tδd\widehat{y}_{T-\delta}\in\mathbb{R}^{d} is the output, T+T\in\mathbb{R}^{+} is the total time, and ss is the score estimates. The Euler-Maruyama scheme is, (Equation 5 in [CLL23]), for t[tk,tk+1]t\in[t_{k}^{\prime},t_{k+1}^{\prime}]

dy^t=(y^tk+2sTtk(y^tk))dt+2dwt.\displaystyle\mathrm{d}\widehat{y}_{t}=(\widehat{y}_{t^{\prime}_{k}}+2s_{T-t^{\prime}_{k}}(\widehat{y}_{t^{\prime}_{k}}))\mathrm{d}t+\sqrt{2}\mathrm{d}w_{t}. (7)
Definition 4.4.

We define 𝖤𝗑𝗉𝗈𝗇𝖾𝗇𝗍𝗂𝖺𝗅𝖨𝗇𝗍𝖾𝗀𝗋𝖺𝗍𝗈𝗋\mathsf{ExponentialIntegrator} as the numerical solver satisfying

y^Tδ=𝖤𝗑𝗉𝗈𝗇𝖾𝗇𝗍𝗂𝖺𝗅𝖨𝗇𝗍𝖾𝗀𝗋𝖺𝗍𝗈𝗋(T,s),\displaystyle\widehat{y}_{T-\delta}=\mathsf{ExponentialIntegrator}(T,s),

where y^Tδd\widehat{y}_{T-\delta}\in\mathbb{R}^{d} is the output, T+T\in\mathbb{R}^{+} is the total time, and ss is the score estimates. The Exponential Integrator scheme is, (Equation 6 in [CLL23]), for t[tk,tk+1]t\in[t_{k}^{\prime},t_{k+1}^{\prime}]

dy^t=(y^t+2sTtk(y^tk))dt+2dwt.\displaystyle\mathrm{d}\widehat{y}_{t}=(\widehat{y}_{t}+2s_{T-t^{\prime}_{k}}(\widehat{y}_{t^{\prime}_{k}}))\mathrm{d}t+\sqrt{2}\mathrm{d}w_{t}. (8)

The two methods above are used for solving SDE. The difference is that in the first term of RHS of Euler-Maruyama uses y^tk\widehat{y}_{t^{\prime}_{k}}, while Exponential Integrator uses y^t\widehat{y}_{t}. The Exponential Integrator scheme has a closed-form solution (see detail in Section 1.1 of [CLL23]).

Now, we introduce two ordinary differential equations (ODE) solvers 𝖣𝖯𝖮𝖬\mathsf{DPOM} and 𝖣𝖯𝖴𝖬\mathsf{DPUM}, which ignore the Brownian motion term in Eq. (6). We will provide their concrete steps bound in Section 5.

Definition 4.5.

We define 𝖣𝖯𝖮𝖬\mathsf{DPOM} (Diffusion Predictor with Overdamped Modeling) as Algorithm 1 in [CCL+23] satisfying y^Tδ=𝖣𝖯𝖮𝖬(T,hpred,hcorr,s),\widehat{y}_{T-\delta}=\mathsf{DPOM}(T,h_{\mathrm{pred}},h_{\mathrm{corr}},s), where δ\delta is the early stopping parameter, y^Tδd\widehat{y}_{T-\delta}\in\mathbb{R}^{d} is the output, T+T\in\mathbb{R}^{+} is the total steps, hpredh_{\mathrm{pred}} is the predictor step size, hcorrh_{\mathrm{corr}} is the corrector step size, (see detailed definition in Algorithm 1 of [CCL+23]), and ss is the score estimates.

Definition 4.6.

We define 𝖣𝖯𝖴𝖬\mathsf{DPUM} (Diffusion Predictor with Underdamped Modeling) as Algorithm 2 in [CCL+23], satisfying y^Tδ=𝖣𝖯𝖴𝖬(T,hpred,hcorr,s),\widehat{y}_{T-\delta}=\mathsf{DPUM}(T,h_{\mathrm{pred}},h_{\mathrm{corr}},s), where δ\delta is the early stopping parameter, y^Tδd\widehat{y}_{T-\delta}\in\mathbb{R}^{d} is the output, T+T\in\mathbb{R}^{+} is the total steps, hpredh_{\mathrm{pred}} is the predictor step size, hcorrh_{\mathrm{corr}} is the corrector step size, (see detailed definition in Algorithm 2 of [CCL+23]), and ss is the score estimates.

5 Main Result for Application

In this section, we will provide the main results of applications. In Section 5.1, we provide our key definitions and assumptions used. In Section 5.2, we provide our results for the total variation bound. In Section 5.3, we provide our results for the KL divergence bound. In Section 5.4, we provide our results for the probability flow ODE method, including DPOM and DPUM.

5.1 Key definitions and assumptions

We first assume that the loss of the learned score estimator is upper bounded by ϵ02\epsilon_{0}^{2} in Assumption 5.1. Then, we can show that we can recover the target/image data distribution under a small total variation or KL divergence gap later.

Assumption 5.1 (Score estimation error, Assumption 1 in [CLL23], page 6, and Assumption 3 in [CCL+22], page 6).

The learned score function st(x)s_{t}(x) satisfies for any 1kN1\leq k\leq N,

1Tk=1Nhk𝔼ptk[logptk(x)stk(x)22]ϵ02.\displaystyle\frac{1}{T}\sum_{k=1}^{N}h_{k}\operatorname*{{\mathbb{E}}}_{p_{t_{k}}}[\|\nabla\log p_{t_{k}}(x)-s_{t_{k}}(x)\|_{2}^{2}]\leq\epsilon_{0}^{2}.

where hkh_{k} is the step size defined in Definition 4.2 for step kk, NN is the total steps, and k=1Nhk=T\sum_{k=1}^{N}h_{k}=T.

To avoid ill-distribution qTq_{T}, we have the below definition follows [CLL23, CCL+23].

Definition 5.2.

We define ϵ\epsilon as total variation distance between qTδq_{T-\delta} and qTq_{T}.

Let the data distribution p0(x)p_{0}(x) be a kk-mixture of Gaussians:

p0(x):=i=1kαi𝒩(μi,Σi).\displaystyle p_{0}(x):=\sum_{i=1}^{k}\alpha_{i}\mathcal{N}(\mu_{i},\Sigma_{i}). (9)

Using the Lemma 3.2, we know the 𝗉𝖽𝖿\mathsf{pdf} of xtx_{t} at the any time tt is given by:

pt(x)=i=1kαi𝒩(atμi,at2Σi+bt2I).\displaystyle p_{t}(x)=\sum_{i=1}^{k}\alpha_{i}\mathcal{N}(a_{t}\mu_{i},a_{t}^{2}\Sigma_{i}+b_{t}^{2}I). (10)

Then, we define the following conditions of the kk mixtures of Gaussian.

Condition 5.3.

All conditions here are related to the beginning time density p0p_{0} in Eq. (9).

  • Let σmin(p0)=mini[k]{σmin(Σi)}\sigma_{\min(p_{0})}=\min_{i\in[k]}\{\sigma_{\min}(\Sigma_{i})\} and σmax(p0)=maxi[k]{σmax(Σi)}\sigma_{\max(p_{0})}=\max_{i\in[k]}\{\sigma_{\max}(\Sigma_{i})\}.

  • Let μmax(p0)=maxi[k]{μi22}\mu_{\max(p_{0})}=\max_{i\in[k]}\{\|\mu_{i}\|_{2}^{2}\} and detmin(p0)=mini[k]{det(Σi)}\det_{\min(p_{0})}=\min_{i\in[k]}\{\det(\Sigma_{i})\}.

In Condition 5.3, we denote σmin(p0),σmax(p0)\sigma_{\min(p_{0})},\sigma_{\max(p_{0})} as the minimum/maximum singular value between the covariance matrices of kk-components at the original data distribution p0p_{0}. μmax(p0)\mu_{\max(p_{0})} is the maximum 22\ell_{2}^{2}-norm between mean vectors of the kk-components at p0p_{0}. detmin(p0)\det_{\min(p_{0})} is the minimum determinant between the covariance matrices of kk-components at p0p_{0}.

Condition 5.4.

All conditions here are related to the all time density ptp_{t} in Eq. (10), where t[0,T]t\in[0,T]. Let xdx\in\mathbb{R}^{d} and at,bta_{t},b_{t} is defined by Definition 4.1. Assume Assumption 5.1.

  • Let βxatμi2R\beta\leq\|x-a_{t}\mu_{i}\|_{2}\leq R, where R1R\geq 1 and β(0,0.1)\beta\in(0,0.1), for each i[k]i\in[k].

  • Let pt(x)p_{t}(x) be defined as Eq. (10) and pt(x)γp_{t}(x)\geq\gamma, where γ(0,0.1)\gamma\in(0,0.1).

  • Let σmin(pt):=mini[k]{σmin(at2Σi+bt2I)}\sigma_{\min(p_{t})}:=\min_{i\in[k]}\{\sigma_{\min}(a_{t}^{2}\Sigma_{i}+b_{t}^{2}I)\}, σmax(pt):=maxi[k]{σmax(at2Σi+bt2I)}\sigma_{\max(p_{t})}:=\max_{i\in[k]}\{\sigma_{\max}(a_{t}^{2}\Sigma_{i}+b_{t}^{2}I)\}.

  • Let detmin(pt):=mini[k]{det(at2Σi+bt2I)}\det_{\min(p_{t})}:=\min_{i\in[k]}\{\det(a_{t}^{2}\Sigma_{i}+b_{t}^{2}I)\}.

In Condition 5.4, we denote σmin(pt),σmax(pt)\sigma_{\min(p_{t})},\sigma_{\max(p_{t})} as the minimum/maximum singular value between the covariance matrices of kk-components at ptp_{t}. β,R\beta,R be the lower/upper bound of the 2\ell_{2} distance between the input data xx and all kk scaled mean vectors atμia_{t}\mu_{i} at timestep tt. Let γ\gamma be the lower bound of pt(x)p_{t}(x), the 𝗉𝖽𝖿\mathsf{pdf} at time tt and detmin(pt)\det_{\min(p_{t})} be the minimum determinant at ptp_{t}.

Clearly, we have σmax(pt)σmax(p0)\sigma_{\max(p_{t})}\geq\sigma_{\max(p_{0})}, so Condition 5.4 is a stronger condition.

Fact 5.5 (Pinsker’s inequality [CK11]).

Let p,qp,q are two probability distributions, then we have TV(p,q)12KL(pq).\mathrm{TV(p,q)}\leq\sqrt{\frac{1}{2}\mathrm{KL}(p\|q)}.

From Fact 5.5, we can see the total variation is a weaker bound than KL divergence.

5.2 Total variation

Now, we are ready to present our result for total variation bound assuming data distribution is kk-mixture of Gaussian (p0p_{0} in Eq. (9)). Recall ϵ0\epsilon_{0} is defined in Assumption 5.1, ϵ\epsilon is denied in Definition 5.2 and hkh_{k} is defined in Definition 4.2. See the proof in Appendix H.

Theorem 5.6 (𝖣𝖣𝖯𝖬\mathsf{DDPM}, total variation, informal version of Theorem H.1).

Assume Condition 5.3 and 5.4 hold. The step size hk:=T/Nh_{k}:=T/N satisfies hk=O(1/L)h_{k}=O(1/L) and L1L\geq 1 for k[N]k\in[N]. Let q^\widehat{q} denote the density of the output of the 𝖤𝗎𝗅𝖾𝗋𝖬𝖺𝗋𝗎𝗒𝖺𝗆𝖺\mathsf{EulerMaruyama} defined by Definition 4.3. Then, we have

TV(q^,p0)KL(p0𝒩(0,I))exp(T)convergenceofforwardprocess+(Ldh+Lm2h)Tdiscretizationerror+ϵ0Tscoreestimationerror.\displaystyle\mathrm{TV}(\widehat{q},p_{0})\lesssim\underbrace{\sqrt{\mathrm{KL}(p_{0}\|\mathcal{N}(0,I))}\exp(-T)}_{\mathrm{convergence~{}of~{}forward~{}process}}+\underbrace{(L\sqrt{dh}+Lm_{2}h)\sqrt{T}}_{\mathrm{discretization~{}error}}+\underbrace{\epsilon_{0}\sqrt{T}}_{\mathrm{score~{}estimation~{}error}}.

where L=1σmin(pt)+2R2γ2σmin(pt)2(1(2π)ddetmin(pt)+1(2π)d/2detmin(pt)1/2)exp(β22σmax(pt))L=\frac{1}{\sigma_{\min(p_{t})}}+\frac{2R^{2}}{\gamma^{2}\sigma_{\min(p_{t})}^{2}}\cdot(\frac{1}{(2\pi)^{d}\det_{\min(p_{t})}}+\frac{1}{(2\pi)^{d/2}\det_{\min(p_{t})}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max(p_{t})}}), m2=(i=1kαi(μi22+tr[Σi]))1/2m_{2}=(\sum_{i=1}^{k}\alpha_{i}(\|\mu_{i}\|_{2}^{2}+\operatorname{tr}[\Sigma_{i}]))^{1/2}, and KL(p0(x)𝒩(0,I))12(log(detmin(p0))+dσmax(p0)+μmax(p0)d)\mathrm{KL}(p_{0}(x)\|\mathcal{N}(0,I))\leq\frac{1}{2}(-\log(\mathrm{det}_{\min(p_{0})})+d\sigma_{\max(p_{0})}+\mu_{\max(p_{0})}-d).

In Theorem 5.6, suppose that m2dm_{2}\leq d and choose T=Θ(log(KL(p0𝒩(0,I))/ϵ))T=\Theta(\log(\mathrm{KL}(p_{0}\|\mathcal{N}(0,I))/\epsilon)), hk=Θ(ϵ2L2d)h_{k}=\Theta(\frac{\epsilon^{2}}{L^{2}d}), then we have TV(q^,p0)O~(ϵ+ϵ0),for N=O~(L2d/ϵ2).\mathrm{TV}(\widehat{q},p_{0})\leq\widetilde{O}(\epsilon+\epsilon_{0}),~{}\text{for }N=\widetilde{O}(L^{2}d/\epsilon^{2}). In particular, in order to have TV(q^,p0)ϵ\mathrm{TV}(\widehat{q},p_{0})\leq\epsilon, it suffices to have score error ϵ0O~(ϵ)\epsilon_{0}\leq\widetilde{O}(\epsilon), where O~()\widetilde{O}(\cdot) hides TT which is a log\log term. Thus, Theorem 5.6 provides a guarantee for total variation bound between target data distribution p0p_{0} and learned output distribution q^\widehat{q} with concrete LL and m2m_{2}.

5.3 KL divergence

Similarly, we can present our result for the KL divergence bound in the following two theorems, assuming data distribution is kk-mixture of Gaussian. Recall ϵ0\epsilon_{0} is defined in Assumption 5.1, ϵ\epsilon is denied in Definition 5.2 and hkh_{k} is defined in Definition 4.2. See the proof in Appendix H.

Theorem 5.7 (𝖣𝖣𝖯𝖬\mathsf{DDPM}, KL divergence, informal version of Theorem H.2).

Assume Condition 5.4. We use uniform discretization points.

(1) Let q^\widehat{q} denote the density of the output of the 𝖤𝗑𝗉𝗈𝗇𝖾𝗇𝗍𝗂𝖺𝗅𝖨𝗇𝗍𝖾𝗀𝗋𝖺𝗍𝗈𝗋\mathsf{ExponentialIntegrator} (Definition 4.4), we have

KL(p0q^)(M2+d)exp(T)+Tϵ02+dT2L2N.\displaystyle\mathrm{KL}(p_{0}\|\widehat{q})\lesssim(M_{2}+d)\exp(-T)+T\epsilon_{0}^{2}+\frac{dT^{2}L^{2}}{N}.

In particular, choosing T=Θ(log(M2d/ϵ0))T=\Theta(\log(M_{2}d/\epsilon_{0})) and N=Θ(dT2L2/ϵ02)N=\Theta(dT^{2}L^{2}/\epsilon_{0}^{2}), then KL(p0q^)=O~(ϵ02)\mathrm{KL}(p_{0}\|\widehat{q})=\widetilde{O}(\epsilon_{0}^{2}).

(2) Let q^\widehat{q} denote the density of the output of the 𝖤𝗎𝗅𝖾𝗋𝖬𝖺𝗋𝗎𝗒𝖺𝗆𝖺\mathsf{EulerMaruyama} (Definition 4.3), we have

KL(p0q^)(M2+d)exp(T)+Tϵ02+dT2L2N+T3M2N2.\displaystyle\mathrm{KL}(p_{0}\|\widehat{q})\lesssim(M_{2}+d)\exp(-T)+T\epsilon_{0}^{2}+\frac{dT^{2}L^{2}}{N}+\frac{T^{3}M_{2}}{N^{2}}.

where M2=i=1kαi(μi22+tr[Σi])M_{2}=\sum_{i=1}^{k}\alpha_{i}(\|\mu_{i}\|_{2}^{2}+\operatorname{tr}[\Sigma_{i}]) and L=1σmin(pt)+2R2γ2σmin(pt)2(1(2π)ddetmin(pt)+1(2π)d/2detmin(pt)1/2)exp(β22σmax(pt))L=\frac{1}{\sigma_{\min(p_{t})}}+\frac{2R^{2}}{\gamma^{2}\sigma_{\min(p_{t})}^{2}}\cdot(\frac{1}{(2\pi)^{d}\det_{\min(p_{t})}}+\frac{1}{(2\pi)^{d/2}\det_{\min(p_{t})}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max(p_{t})}}).

Theorem 5.8 (𝖣𝖣𝖯𝖬\mathsf{DDPM}, KL divergence for smooth data distribution, informal version of Theorem H.3).

Assume Condition 5.3 and 5.4. We use the exponentially decreasing (then constant) step size hk=cmin{max{tk,1/L},1},c=T+logLN1Kdh_{k}=c\min\{\max\{t_{k},1/L\},1\},c=\frac{T+\log L}{N}\leq\frac{1}{Kd}. Let q^\widehat{q} denote the density of the output of the 𝖤𝗑𝗉𝗈𝗇𝖾𝗇𝗍𝗂𝖺𝗅𝖨𝗇𝗍𝖾𝗀𝗋𝖺𝗍𝗈𝗋\mathsf{ExponentialIntegrator} defined by Definition 4.4. Then, we have

KL(p0q^)(M2+d)exp(T)+Tϵ02+d2(T+logL)2N,\displaystyle\mathrm{KL}(p_{0}\|\widehat{q})\lesssim(M_{2}+d)\exp(-T)+T\epsilon_{0}^{2}+\frac{d^{2}(T+\log L)^{2}}{N},

where M2=i=1kαi(μi22+tr[Σi])M_{2}=\sum_{i=1}^{k}\alpha_{i}(\|\mu_{i}\|_{2}^{2}+\operatorname{tr}[\Sigma_{i}]) and L=1σmin(pt)+2R2γ2σmin(pt)2(1(2π)ddetmin(pt)+1(2π)d/2detmin(pt)1/2)exp(β22σmax(pt))L=\frac{1}{\sigma_{\min(p_{t})}}+\frac{2R^{2}}{\gamma^{2}\sigma_{\min(p_{t})}^{2}}\cdot(\frac{1}{(2\pi)^{d}\det_{\min(p_{t})}}+\frac{1}{(2\pi)^{d/2}\det_{\min(p_{t})}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max(p_{t})}}). In particular, if we set T=Θ(log(M2d/ϵ0))T=\Theta(\log(M_{2}d/\epsilon_{0})) and N=Θ(d2(T+logL)2/ϵ02)N=\Theta(d^{2}(T+\log L)^{2}/\epsilon_{0}^{2}), then KL(p0q^)O~(ϵ02)\mathrm{KL}(p_{0}\|\widehat{q})\leq\widetilde{O}(\epsilon_{0}^{2}). In addition, for Euler-Maruyama scheme defined in Definition 4.3, the same bounds hold with an additional M2k=1Nhk3M_{2}\sum_{k=1}^{N}h_{k}^{3} term.

Theorem 5.8 and Theorem 5.7 provide a guarantee for KL divergence bound between target data distribution p0p_{0} and learned output distribution q^\widehat{q} with concrete LL and M2M_{2}. Theorem 5.8 has log2L\log^{2}L instead of L2L^{2} in the bound for total number of discretization points NN, but includes an additional dd compared to Theorem 5.7. On the other hand, Theorem 5.8 requires the data distribution p0p_{0} is Lipschitz and second-order differentiable [CLL23], allowing a better bound in terms of LL, while all other theorems [CCL+22, CCL+23] require conditions about Lipschitz on ptp_{t} for any 0tT0\leq t\leq T. However, our assumption p0p_{0} is kk-mixture of Gaussians satisfies all conditions they use.

From Fact 5.5, we can compare KL divergence results with total variation results. Notice that M2=m22M_{2}=m_{2}^{2} when comparing theorems for total variation and KL divergence. According to Fact 5.5, the square of the TV distance is comparable to the KL divergence, which explains the squared relationship of the second momentum term.

5.4 Probability flow ODE

Notice that in the previous results we are considering SDE based models. However from [SSDK+20], we know that we can also use ODE to run the reverse process, which has the same marginal distribution with SDE reverse process but is thereby deterministic. In this section, we provide results of 𝖣𝖯𝖮𝖬\mathsf{DPOM} and 𝖣𝖯𝖴𝖬\mathsf{DPUM} (Algorithm 1 and 2 in [CCL+23]) algorithms which are based on ODE reverse process.

Theorem 5.9 (𝖣𝖯𝖮𝖬\mathsf{DPOM}, informal version of Theorem H.4).

Assume Condition 5.4. We use the 𝖣𝖯𝖮𝖬\mathsf{DPOM} algorithm defined in Definition 4.5, and let q^\widehat{q} be the output density of it. Then, we have

TV(q^,p0)(d+m2)exp(T)+L2Td1/2hpred+L3/2Td1/2hcorr1/2+L1/2Tϵ0+ϵ.\displaystyle\mathrm{TV}(\widehat{q},p_{0})\lesssim(\sqrt{d}+m_{2})\exp(-T)+L^{2}Td^{1/2}h_{\mathrm{pred}}+L^{3/2}Td^{1/2}h_{\mathrm{corr}}^{1/2}+L^{1/2}T\epsilon_{0}+\epsilon.

where m2=(i=1kαi(μi22+tr[Σi]))1/2m_{2}=(\sum_{i=1}^{k}\alpha_{i}(\|\mu_{i}\|_{2}^{2}+\operatorname{tr}[\Sigma_{i}]))^{1/2} and L=1σmin(pt)+2R2γ2σmin(pt)2(1(2π)ddetmin(pt)+1(2π)d/2detmin(pt)1/2)exp(β22σmax(pt))L=\frac{1}{\sigma_{\min(p_{t})}}+\frac{2R^{2}}{\gamma^{2}\sigma_{\min(p_{t})}^{2}}\cdot(\frac{1}{(2\pi)^{d}\det_{\min(p_{t})}}+\frac{1}{(2\pi)^{d/2}\det_{\min(p_{t})}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max(p_{t})}}). In particular, if we set T=Θ(log(dm2/ϵ))T=\Theta(\log(dm_{2}/\epsilon)), hpred=Θ~(ϵL2d1/2)h_{\mathrm{pred}}=\widetilde{\Theta}(\frac{\epsilon}{L^{2}d^{1/2}}), hcorr=Θ~(ϵL3d)h_{\mathrm{corr}}=\widetilde{\Theta}(\frac{\epsilon}{L^{3}d}), and if the score estimation error satisfies ϵ0O~(ϵL)\epsilon_{0}\leq\widetilde{O}(\frac{\epsilon}{\sqrt{L}}), then we can obtain TV error ϵ\epsilon with a total iteration complexity of Θ~(L3d/ϵ2)\widetilde{\Theta}(L^{3}d/\epsilon^{2}) steps.

Theorem 5.10 (𝖣𝖯𝖴𝖬\mathsf{DPUM}, informal version of Theorem H.5).

Assume Condition 5.4. We use the 𝖣𝖯𝖴𝖬\mathsf{DPUM} algorithm defined in Definition 4.6, and let q^\widehat{q} be the output density of it. Then, we have

TV(q^,p0)(d+m2)exp(T)+L2Td1/2hpred+L3/2Td1/2hcorr1/2+L1/2Tϵ0+ϵ.\displaystyle\mathrm{TV}(\widehat{q},p_{0})\lesssim(\sqrt{d}+m_{2})\exp(-T)+L^{2}Td^{1/2}h_{\mathrm{pred}}+L^{3/2}Td^{1/2}h_{\mathrm{corr}}^{1/2}+L^{1/2}T\epsilon_{0}+\epsilon.

where L=1σmin(pt)+2R2γ2σmin(pt)2(1(2π)ddetmin(pt)+1(2π)d/2detmin(pt)1/2)exp(β22σmax(pt))L=\frac{1}{\sigma_{\min(p_{t})}}+\frac{2R^{2}}{\gamma^{2}\sigma_{\min(p_{t})}^{2}}\cdot(\frac{1}{(2\pi)^{d}\det_{\min(p_{t})}}+\frac{1}{(2\pi)^{d/2}\det_{\min(p_{t})}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max(p_{t})}}) and m2=(i=1kαi(μi22+tr[Σi]))1/2m_{2}=(\sum_{i=1}^{k}\alpha_{i}(\|\mu_{i}\|_{2}^{2}+\operatorname{tr}[\Sigma_{i}]))^{1/2}. In particular, if we set T=Θ(log(dm2/ϵ))T=\Theta(\log(dm_{2}/\epsilon)), hpred=Θ~(ϵL2d1/2)h_{\mathrm{pred}}=\widetilde{\Theta}(\frac{\epsilon}{L^{2}d^{1/2}}), hcorr=Θ~(ϵL3/2d1/2)h_{\mathrm{corr}}=\widetilde{\Theta}(\frac{\epsilon}{L^{3/2}d^{1/2}}), and if the score estimation error satisfies ϵ0O~(ϵL)\epsilon_{0}\leq\widetilde{O}(\frac{\epsilon}{\sqrt{L}}), then we can obtain TV error ϵ\epsilon with a total iteration complexity of Θ~(L2d1/2/ϵ)\widetilde{\Theta}(L^{2}d^{1/2}/\epsilon) steps.

Theorem 5.9 and Theorem 5.10 provide a guarantee for total variation bound between target data distribution p0p_{0} and learned output distribution q^\widehat{q} with concrete LL and M2M_{2} for ODE reverse process. The difference between the 𝖣𝖯𝖮𝖬\mathsf{DPOM} (Theorem 5.9) and 𝖣𝖯𝖴𝖬\mathsf{DPUM} (Theorem 5.10) is the complexity of hcorrh_{\mathrm{corr}} and the final iteration complexity term. Using 𝖣𝖯𝖴𝖬\mathsf{DPUM} algorithm, we can reduce the total iteration complexity from Θ~(L3d/ϵ2)\widetilde{\Theta}(L^{3}d/\epsilon^{2}) to Θ~(L2d/ϵ)\widetilde{\Theta}(L^{2}\sqrt{d}/\epsilon).

6 Tools From Previous Work

In this section we present several tools we use from previous work.

Assumption 6.1 (Lipschitz score, Assumption 1 in [CCL+22], page 6).

For all t0t\geq 0, the score logpt\nabla\log p_{t} is LL-Lipschitz.

Assumption 6.2 (Second momentum bound, Assumption 2 in [CCL+22], page 6 and Assumption 2 in [CLL23], page 6).

We assume that m22:=M2:=𝔼p0[x22]<m_{2}^{2}:=M_{2}:=\operatorname*{{\mathbb{E}}}_{p_{0}}[\|x\|_{2}^{2}]<\infty.

Assumption 6.3 (Smooth data distributions, Assumption 4 in [CLL23], page 10).

The data distribution admits a density p0C2(d)p_{0}\in C^{2}(\mathbb{R}^{d}) and logp0\nabla\log p_{0} is LL-Lipschitz, where C2C^{2} means second-order differentiable.

Remark 6.4.

We can notice M2=m22M_{2}=m_{2}^{2}. The theorems from [CLL23] use M2M_{2} for KL divergence. The theorems form [CCL+22] and [CCL+23] use m2=m22=M2m_{2}=\sqrt{m_{2}^{2}}=\sqrt{M_{2}} for total variance, because of Pinsker’s inequality (Fact 5.5).

We state a tool from previous work [CCL+22],

Theorem 6.5 (DDPM, Theorem 2 in [CCL+22], page 7).

Suppose that Assumptions 6.1, 6.2 and 5.1 hold. Let q^T\widehat{q}_{T} be the output of DDPM algorithm at times TT, and suppose that the step size h:=T/Nh:=T/N satisfies h1/Lh\lesssim 1/L, where L1L\geq 1. Then, it holds that

TV(q^T,p0)KL(p0||𝒩(0,I))exp(T)convergenceofforwardprocess+(Ldh+Lm2h)Tdiscretizationerror+ϵ0Tscoreestimationerror.\displaystyle\mathrm{TV}(\widehat{q}_{T},p_{0})\lesssim\underbrace{\sqrt{\mathrm{KL}(p_{0}||\mathcal{N}(0,I))}\exp(-T)}_{\mathrm{convergence~{}of~{}forward~{}process}}+\underbrace{(L\sqrt{dh}+Lm_{2}h)\sqrt{T}}_{\mathrm{discretization~{}error}}+\underbrace{\epsilon_{0}\sqrt{T}}_{\mathrm{score~{}estimation~{}error}}.

We state a tool from previous work [CLL23].

Theorem 6.6 (Theorem 1 in [CLL23]).

Suppose that Assumptions 6.1, 6.2, 5.1 hold. If L1L\geq 1, hk1h_{k}\leq 1 for k=1,,Nk=1,\dots,N and T1T\geq 1, using uniform discretization points yields the following:

  • Using Exponential Integrator scheme (8), we have

    KL(p0q^T)(M2+d)exp(T)+Tϵ02+dT2L2N.\displaystyle\mathrm{KL}(p_{0}\|\widehat{q}_{T})\lesssim(M_{2}+d)\exp(-T)+T\epsilon_{0}^{2}+\frac{dT^{2}L^{2}}{N}.

    In particular, choosing T=Θ(log(dM2/ϵ02))T=\Theta(\log(dM_{2}/\epsilon_{0}^{2})) and N=Θ(dT2L2/ϵ02)N=\Theta(dT^{2}L^{2}/\epsilon_{0}^{2}) makes this O~(ϵ02)\widetilde{O}(\epsilon_{0}^{2}).

  • Using the Euler-Maruyama scheme (7), we have

    KL(p0q^T)(M2+d)exp(T)+Tϵ02+dT2L2N+T3M2N2.\displaystyle\mathrm{KL}(p_{0}\|\widehat{q}_{T})\lesssim(M_{2}+d)\exp(-T)+T\epsilon_{0}^{2}+\frac{dT^{2}L^{2}}{N}+\frac{T^{3}M_{2}}{N^{2}}.
Theorem 6.7 (Theorem 5 in [CLL23], page 10).

There is a universal constant KK such that the following holds. Under Assumptions 6.2, 5.1, and 6.3 hold, by using the exponentially decreasing (then constant) step size hk=cmin{max{tk,1L},1},c=T+logLN1Kdh_{k}=c\min\{\max\{t_{k},\frac{1}{L}\},1\},c=\frac{T+\log L}{N}\leq\frac{1}{Kd}, the sampling dynamic (8) results in a distribution q^T\widehat{q}_{T} such that

KL(p0q^T)(M2+d)exp(T)+Tϵ02+d2(T+logL)2N.\displaystyle\mathrm{KL}(p_{0}\|\widehat{q}_{T})\lesssim(M_{2}+d)\exp(-T)+T\epsilon_{0}^{2}+\frac{d^{2}(T+\log L)^{2}}{N}.

Choosing T=Θ(log(dM2/ϵ02))T=\Theta(\log(dM_{2}/\epsilon_{0}^{2})) and N=Θ(d2(T+logL)2/ϵ02)N=\Theta(d^{2}(T+\log L)^{2}/\epsilon_{0}^{2}) makes this O~(ϵ02)\widetilde{O}(\epsilon_{0}^{2}). In addition, for Euler-Maruyama scheme (7), the same bounds hold with an additional M2k=1Nhk3M_{2}\sum_{k=1}^{N}h_{k}^{3} term.

We state a tool from previous work [CCL+23],

Theorem 6.8 (DPOM, Theorem 2 in [CCL+23], page 6).

Suppose that Assumptions 6.1, 6.2 and 5.1 hold. If q^T\widehat{q}_{T} denotes the output of DPOM (see Algorithm 1 in [CCL+23]) with early stopping. Then, it holds that

TV(q^T,p0)(d+m2)exp(T)+L2Td1/2hpred+L3/2Td1/2hcorr1/2+L1/2Tϵ0+ϵ.\displaystyle\mathrm{TV}(\widehat{q}_{T},p_{0})\lesssim(\sqrt{d}+m_{2})\exp(-T)+L^{2}Td^{1/2}h_{\mathrm{pred}}+L^{3/2}Td^{1/2}h_{\mathrm{corr}}^{1/2}+L^{1/2}T\epsilon_{0}+\epsilon.

In particular, if we set T=Θ(log(dm22/ϵ2))T=\Theta(\log(dm_{2}^{2}/\epsilon^{2})), hpred=Θ~(ϵL2d1/2)h_{\mathrm{pred}}=\widetilde{\Theta}(\frac{\epsilon}{L^{2}d^{1/2}}), hcorr=Θ~(ϵL3d)h_{\mathrm{corr}}=\widetilde{\Theta}(\frac{\epsilon}{L^{3}d}), and if the score estimation error satisfies ϵ0O~(ϵL)\epsilon_{0}\leq\widetilde{O}(\frac{\epsilon}{\sqrt{L}}), then we can obtain TV error ϵ\epsilon with a total iteration complexity of Θ~(L3d/ϵ2)\widetilde{\Theta}(L^{3}d/\epsilon^{2}) steps.

Theorem 6.9 (DPUM, Theorem 3 in [CCL+23], page 7).

Suppose that Assumptions 6.1, 6.2 and 5.1 hold. If q^T\widehat{q}_{T} denotes the output of DPUM (see Algorithm 2 in [CCL+23]) with early stopping. Then, it holds that

TV(q^T,p0)(d+m2)exp(T)+L2Td1/2hpred+L3/2Td1/2hcorr1/2+L1/2Tϵ0+ϵ.\displaystyle\mathrm{TV}(\widehat{q}_{T},p_{0})\lesssim(\sqrt{d}+m_{2})\exp(-T)+L^{2}Td^{1/2}h_{\mathrm{pred}}+L^{3/2}Td^{1/2}h_{\mathrm{corr}}^{1/2}+L^{1/2}T\epsilon_{0}+\epsilon.

In particular, if we set T=Θ(log(dm22/ϵ2))T=\Theta(\log(dm_{2}^{2}/\epsilon^{2})), hpred=Θ~(ϵL2d1/2)h_{\mathrm{pred}}=\widetilde{\Theta}(\frac{\epsilon}{L^{2}d^{1/2}}), hcorr=Θ~(ϵL3/2d1/2)h_{\mathrm{corr}}=\widetilde{\Theta}(\frac{\epsilon}{L^{3/2}d^{1/2}}), and if the score estimation error satisfies ϵ0O~(ϵL)\epsilon_{0}\leq\widetilde{O}(\frac{\epsilon}{\sqrt{L}}), then we can obtain TV error ϵ\epsilon with a total iteration complexity of Θ~(L2d1/2/ϵ)\widetilde{\Theta}(L^{2}d^{1/2}/\epsilon) steps.

7 Conclusion

We have presented a theoretical analysis of the Lipschitz continuity and second momentum properties of diffusion models, focusing on the case where the target data distribution is a mixture of Gaussian. Our results provide concrete bounds on key properties of the diffusion process and establish error guarantees for various diffusion solvers. These findings contribute to a deeper understanding of diffusion models and pave the way for further theoretical and practical advancements in this field.

Acknowledgement

Research is partially supported by the National Science Foundation (NSF) Grants 2023239-DMS, CCF-2046710, and Air Force Grant FA9550-18-1-0166.

References

  • [ABDH+18] Hassan Ashtiani, Shai Ben-David, Nicholas Harvey, Christopher Liaw, Abbas Mehrabian, and Yaniv Plan. Nearly tight sample complexity bounds for learning mixtures of gaussians via sample compression schemes. Advances in Neural Information Processing Systems, 31, 2018.
  • [ADLS17] Jayadev Acharya, Ilias Diakonikolas, Jerry Li, and Ludwig Schmidt. Sample-optimal density estimation in nearly-linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1278–1289. SIAM, 2017.
  • [And82] Brian DO Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.
  • [BDD23] Joe Benton, George Deligiannidis, and Arnaud Doucet. Error bounds for flow matching methods. arXiv preprint arXiv:2305.16860, 2023.
  • [BDJ+22] Ainesh Bakshi, Ilias Diakonikolas, He Jia, Daniel M Kane, Pravesh K Kothari, and Santosh S Vempala. Robustly learning mixtures of k arbitrary gaussians. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 1234–1247, 2022.
  • [BK20] Ainesh Bakshi and Pravesh Kothari. Outlier-robust clustering of non-spherical mixtures. arXiv preprint arXiv:2005.02970, 2020.
  • [BS10] Mikhail Belkin and Kaushik Sinha. Polynomial learning of distribution families. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 103–112. IEEE, 2010.
  • [BS23] Rares-Darius Buhai and David Steurer. Beyond parallel pancakes: Quasi-polynomial time guarantees for non-spherical gaussian mixtures. In The Thirty Sixth Annual Conference on Learning Theory, pages 548–611. PMLR, 2023.
  • [CCL+22] Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. arXiv preprint arXiv:2209.11215, 2022.
  • [CCL+23] Sitan Chen, Sinho Chewi, Holden Lee, Yuanzhi Li, Jianfeng Lu, and Adil Salim. The probability flow ode is provably fast. Advances in Neural Information Processing Systems, 36, 2023.
  • [CDD23] Sitan Chen, Giannis Daras, and Alex Dimakis. Restoration-degradation beyond linear diffusions: A non-asymptotic analysis for ddim-type samplers. In International Conference on Machine Learning, pages 4462–4484. PMLR, 2023.
  • [CDSS13] Siu-On Chan, Ilias Diakonikolas, Xiaorui Sun, and Rocco A Servedio. Learning mixtures of structured distributions over discrete domains. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1380–1394. SIAM, 2013.
  • [CHZW23] Minshuo Chen, Kaixuan Huang, Tuo Zhao, and Mengdi Wang. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. In International Conference on Machine Learning, pages 4672–4712. PMLR, 2023.
  • [CK11] Imre Csiszár and János Körner. Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011.
  • [CKS24] Sitan Chen, Vasilis Kontonis, and Kulin Shah. Learning general gaussian mixtures with efficient score matching. arXiv preprint arXiv:2404.18893, 2024.
  • [CLL23] Hongrui Chen, Holden Lee, and Jianfeng Lu. Improved analysis of score-based generative modeling: User-friendly bounds under minimal smoothness assumptions. In International Conference on Machine Learning, pages 4735–4763. PMLR, 2023.
  • [CZZ+20] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713, 2020.
  • [Das99] Sanjoy Dasgupta. Learning mixtures of gaussians. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pages 634–644. IEEE, 1999.
  • [DB22] Valentin De Bortoli. Convergence of denoising diffusion models under the manifold hypothesis. arXiv preprint arXiv:2208.05314, 2022.
  • [DHKK20] Ilias Diakonikolas, Samuel B Hopkins, Daniel Kane, and Sushrut Karmalkar. Robustly learning any clusterable mixture of gaussians. arXiv preprint arXiv:2005.06417, 2020.
  • [DK14] Constantinos Daskalakis and Gautam Kamath. Faster and sample near-optimal algorithms for proper learning mixtures of gaussians. In Conference on Learning Theory, pages 1183–1213. PMLR, 2014.
  • [DK20] Ilias Diakonikolas and Daniel M Kane. Small covers for near-zero sets of polynomials and learning latent variable models. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 184–195. IEEE, 2020.
  • [DKK+19] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48(2):742–864, 2019.
  • [FOS08] Jon Feldman, Ryan O’Donnell, and Rocco A Servedio. Learning mixtures of product distributions over discrete domains. SIAM Journal on Computing, 37(5):1536–1564, 2008.
  • [GKL24] Khashayar Gatmiry, Jonathan Kelner, and Holden Lee. Learning mixtures of gaussians using diffusion models. arXiv preprint arXiv:2404.18869, 2024.
  • [GLB+24] Hanzhong Guo, Cheng Lu, Fan Bao, Tianyu Pang, Shuicheng Yan, Chao Du, and Chongxuan Li. Gaussian mixture solvers for diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  • [HJA20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • [HO07] John R Hershey and Peder A Olsen. Approximating the kullback leibler divergence between gaussian mixture models. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, volume 4, pages IV–317. IEEE, 2007.
  • [HRX24] Yinbin Han, Meisam Razaviyayn, and Renyuan Xu. Neural network-based score estimation in diffusion models: Optimization and generalization. In The Twelfth International Conference on Learning Representations, 2024.
  • [HWSL24] Jerry Yao-Chieh Hu, Weimin Wu, Zhao Song, and Han Liu. On statistical rates and provably efficient criteria of latent diffusion transformers (dits). arXiv preprint arXiv:2407.01079, 2024.
  • [KFL22] Dohyun Kwon, Ying Fan, and Kangwook Lee. Score-based generative modeling secretly minimizes the wasserstein distance. Advances in Neural Information Processing Systems, 35:20205–20217, 2022.
  • [KLL+24] Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ODE trajectory of diffusion. In The Twelfth International Conference on Learning Representations, 2024.
  • [KPH+20] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020.
  • [LKB+23] Jae Hyun Lim, Nikola B Kovachki, Ricardo Baptista, Christopher Beckham, Kamyar Azizzadenesheli, Jean Kossaifi, Vikram Voleti, Jiaming Song, Karsten Kreis, Jan Kautz, et al. Score-based diffusion models in function space, 2023. URL https://arxiv. org/abs/2302.07400, 2023.
  • [LLSS24] Chenyang Li, Yingyu Liang, Zhenmei Shi, and Zhao Song. Exploring the frontiers of softmax: Provable optimization, applications in diffusion model, and beyond. arXiv preprint arXiv:2405.03251, 2024.
  • [LLT22] Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence for score-based generative modeling with polynomial complexity. Advances in Neural Information Processing Systems, 35:22870–22882, 2022.
  • [LS17] Jerry Li and Ludwig Schmidt. Robust and proper learning for mixtures of gaussians via systems of polynomial inequalities. In Conference on Learning Theory, pages 1302–1382. PMLR, 2017.
  • [LSSS24] Yingyu Liang, Zhizhou Sha, Zhenmei Shi, and Zhao Song. Differential privacy mechanisms in neural tangent kernel regression. arXiv preprint arXiv:2407.13621, 2024.
  • [MV10] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures of gaussians. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 93–102. IEEE, 2010.
  • [ND21] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021.
  • [NRW21] Eliya Nachmani, Robin San Roman, and Lior Wolf. Non gaussian denoising diffusion models. arXiv preprint arXiv:2106.07582, 2021.
  • [Ope24] OpenAI. Video generation models as world simulators, 2024. https://openai.com/research/video-generation-models-as-world-simulators.
  • [PKK22] João M Pereira, Joe Kileel, and Tamara G Kolda. Tensor moments of gaussian mixture models: Theory and applications. arXiv preprint arXiv:2202.06930, 2022.
  • [PX23] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  • [RBL+22] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • [RFB15] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
  • [RGKZ21] Maria Refinetti, Sebastian Goldt, Florent Krzakala, and Lenka Zdeborová. Classifying high-dimensional gaussian mixtures: Where kernel methods fail and neural networks succeed. In International Conference on Machine Learning, pages 8936–8947. PMLR, 2021.
  • [RHX+23] Alex Reneau, Jerry Yao-Chieh Hu, Chenwei Xu, Weijian Li, Ammar Gilani, and Han Liu. Feature programming for multivariate time series prediction. In Fortieth International Conference on Machine Learning (ICML), 2023.
  • [SCK23] Kulin Shah, Sitan Chen, and Adam Klivans. Learning mixtures of gaussians using the ddpm objective. Advances in Neural Information Processing Systems, 36:19636–19649, 2023.
  • [SCL+22] Zhenmei Shi, Jiefeng Chen, Kunyang Li, Jayaram Raghuram, Xi Wu, Yingyu Liang, and Somesh Jha. The trade-off between universality and label efficiency of representations from contrastive learning. In The Eleventh International Conference on Learning Representations, 2022.
  • [Sco15] David W Scott. Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons, 2015.
  • [SDCS23] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Proceedings of the 40th International Conference on Machine Learning, pages 32211–32252, 2023.
  • [SDME21] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. Advances in neural information processing systems, 34:1415–1428, 2021.
  • [SDWMG15] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  • [SE19] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  • [SE20] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
  • [SGSE20] Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, pages 574–584. PMLR, 2020.
  • [SHC+22] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE transactions on pattern analysis and machine intelligence, 45(4):4713–4726, 2022.
  • [SK21] Yang Song and Diederik P Kingma. How to train your energy-based models. arXiv preprint arXiv:2101.03288, 2021.
  • [SMF+24] Zhenmei Shi, Yifei Ming, Ying Fan, Frederic Sala, and Yingyu Liang. Domain generalization via nuclear norm regularization. In Conference on Parsimony and Learning, pages 179–201. PMLR, 2024.
  • [SOAJ14] Ananda Theertha Suresh, Alon Orlitsky, Jayadev Acharya, and Ashkan Jafarpour. Near-optimal-sample estimators for spherical gaussian mixtures. Advances in Neural Information Processing Systems, 27, 2014.
  • [SSDK+20] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  • [SWL21] Zhenmei Shi, Junyi Wei, and Yingyu Liang. A theoretical analysis on feature learning in neural networks: Emergence from inputs and advantage over fixed features. In International Conference on Learning Representations, 2021.
  • [SWL24] Zhenmei Shi, Junyi Wei, and Yingyu Liang. Provable guarantees for neural networks via gradient feature learning. Advances in Neural Information Processing Systems, 36, 2024.
  • [Vin04] Susana Vinga. Convolution integrals of normal distribution functions. Supplementary material to Vinga and Almeida (2004)“Rényi continuous entropy of DNA sequences, 2004.
  • [WCL+24] Yuchen Wu, Minshuo Chen, Zihao Li, Mengdi Wang, and Yuting Wei. Theoretical insights for diffusion guidance: A case study for gaussian mixture models. arXiv preprint arXiv:2403.01639, 2024.
  • [WCZ+23] Yilin Wang, Zeyuan Chen, Liangjun Zhong, Zheng Ding, Zhizhou Sha, and Zhuowen Tu. Dolfin: Diffusion layout transformers without autoencoder. arXiv preprint arXiv:2310.16305, 2023.
  • [WSD+23] Zirui Wang, Zhizhou Sha, Zheng Ding, Yilin Wang, and Zhuowen Tu. Tokencompose: Grounding diffusion with token-level supervision. arXiv preprint arXiv:2312.03626, 2023.
  • [WXZ+24] Yilin Wang, Haiyang Xu, Xiang Zhang, Zeyuan Chen, Zhizhou Sha, Zirui Wang, and Zhuowen Tu. Omnicontrolnet: Dual-stage integration for conditional image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7436–7448, 2024.
  • [XSW+23] Zhuoyan Xu, Zhenmei Shi, Junyi Wei, Fangzhou Mu, Yin Li, and Yingyu Liang. Towards few-shot adaptation of foundation models via multitask finetuning. In The Twelfth International Conference on Learning Representations, 2023.
  • [YFZ+23] Zhantao Yang, Ruili Feng, Han Zhang, Yujun Shen, Kai Zhu, Lianghua Huang, Yifei Zhang, Yu Liu, Deli Zhao, Jingren Zhou, et al. Lipschitz singularities in diffusion models. In The Twelfth International Conference on Learning Representations, 2023.
  • [ZHF+24] Jianbin Zheng, Minghui Hu, Zhongyi Fan, Chaoyue Wang, Changxing Ding, Dacheng Tao, and Tat-Jen Cham. Trajectory consistency distillation. arXiv preprint arXiv:2402.19159, 2024.

Appendix

Roadmap.

  • Section A discusses the limitations of the paper.

  • Section B discusses the societal impacts of the paper.

  • Section C provides the preliminary for the paper.

  • Section D provides the case when we consider a continuous time score function, specifically a single Gaussian.

  • Section E provides the case when we consider the score function to be 2 mixtures of Gaussians.

  • Section F provides the case when we consider the score function to be kk mixture of Gaussians.

  • Section G provides lemmas that we use for a more concrete calculation for theorems in Section 6.

  • Section H provides our main results when we consider the data distribution is kk mixture of Gaussians.

Appendix A Limitations

This work has not directly addressed the practical applications of our results. Additionally, we did not provide a sample complexity bound for our settings. Future research could explore how these findings might be implemented in real-world scenarios and work on improving these limitations.

Appendix B Societal Impacts

We explore and provide a deeper understanding of the diffusion models and also explicitly give the Lipschitz constant for kk-mixture of Gaussians, which may inspire a better algorithm design.

On the other hand, our paper is purely theoretical in nature, so we foresee no immediate negative ethical impact.

Appendix C Preliminary

This section provides some preliminary knowledge and is organized as below:

  • Section C.1 provides the facts we use.

  • Section C.2 provides the property of exp function we use.

  • Section C.3 provides the Lipschitz multiplication property we use.

C.1 Facts

We provide several basic facts from calculus and linear algebra that are used in the proofs.

Fact C.1 (Calculus).

For xx\in\mathbb{R}, yy\in\mathbb{R}, tt\in\mathbb{R}, unu\in\mathbb{R}^{n}, vnv\in\mathbb{R}^{n}, it is well-known that

  • dxdt=dxdydydt\frac{\mathrm{d}x}{\mathrm{d}t}=\frac{\mathrm{d}x}{\mathrm{d}y}\frac{\mathrm{d}y}{\mathrm{d}t} (chain rule)

  • dxydt=dxdty+dydtx\frac{\mathrm{d}xy}{\mathrm{d}t}=\frac{\mathrm{d}x}{\mathrm{d}t}y+\frac{\mathrm{d}y}{\mathrm{d}t}x (product rule)

  • dxndx=nxn1\frac{\mathrm{d}x^{n}}{\mathrm{d}x}=nx^{n-1} (power rule)

  • du,vdu=v\frac{\mathrm{d}\langle u,v\rangle}{\mathrm{d}u}=v (derivative of the inner product)

  • dexp(x)dx=exp(x)\frac{\mathrm{d}\exp(x)}{\mathrm{d}x}=\exp(x) (derivative of exponential function)

  • dlogxdx=1x\frac{\mathrm{d}\log x}{\mathrm{d}x}=\frac{1}{x} (derivative of logarithm function)

  • dduu22=2u\frac{\mathrm{d}}{\mathrm{d}u}\|u\|_{2}^{2}=2u (derivative of 2\ell_{2} norm)

  • dydx=0\frac{\mathrm{d}y}{\mathrm{d}x}=0 if yy is independent from xx. (derivative of independent variables)

Fact C.2 (Norm Bounds).

For aa\in\mathbb{R}, bb\in\mathbb{R}, unu\in\mathbb{R}^{n}, vnv\in\mathbb{R}^{n}, Ak×nA\in\mathbb{R}^{k\times n}, Wn×nW\in\mathbb{R}^{n\times n} is symmetric and p.s.d., we have

  • au2=|a|u2\|au\|_{2}=|a|\cdot\|u\|_{2} (absolute homogeneity)

  • u+v2u2+v2\|u+v\|_{2}\leq\|u\|_{2}+\|v\|_{2} (triangle inequality)

  • |uv|u2v2|u^{\top}v|\leq\|u\|_{2}\cdot\|v\|_{2} (Cauchy–Schwarz inequality)

  • u2=u2\|u^{\top}\|_{2}=\|u\|_{2}

  • Au2Au2\|Au\|_{2}\leq\|A\|\cdot\|u\|_{2}

  • aA=|a|A\|aA\|=|a|\cdot\|A\|

  • A=σmax(A)\|A\|=\sigma_{\max}(A)

  • A1=1σmin(A)\|A^{-1}\|=\frac{1}{\sigma_{\min}(A)}

  • uWuu22σmin(W)u^{\top}Wu\geq\|u\|_{2}^{2}\cdot\sigma_{\min}(W).

  • σmin(W1)=1σmax(W)\sigma_{\min}(W^{-1})=\frac{1}{\sigma_{\max}(W)}.

Fact C.3 (Matrix Calculus).

Let Wn×nW\in\mathbb{R}^{n\times n} denote a symmetric matrix. Let xnx\in\mathbb{R}^{n} and sns\in\mathbb{R}^{n}. Suppose that ss is independent of xx. Then, we know

  • ddx(xs)W(xs)=2W(xs)\frac{\mathrm{d}}{\mathrm{d}x}(x-s)^{\top}W(x-s)=2W(x-s)

C.2 Properties of exp\exp functions

During the course of proving the Lipschitz continuity for mixtures of Gaussians, we found that we need to use the following bound for the exp\exp function.

Fact C.4.

For |ab|0.1|a-b|\leq 0.1, where aa\in\mathbb{R}, bb\in\mathbb{R}, we have

|exp(a)exp(b)||exp(a)|2|ab|\displaystyle|\exp(a)-\exp(b)|\leq|\exp(a)|\cdot 2|a-b|
Proof.

We have

|exp(a)exp(b)|=\displaystyle|\exp(a)-\exp(b)|= |exp(a)(1exp(ba))|\displaystyle~{}|\exp(a)\cdot(1-\exp(b-a))|
=\displaystyle= |exp(a)||(1exp(ba))|\displaystyle~{}|\exp(a)|\cdot|(1-\exp(b-a))|
\displaystyle\leq |exp(a)|2|ab|\displaystyle~{}|\exp(a)|\cdot 2|a-b|

where the first step follows from simple algebra, the second step follows from |ab|=|a||b||a\cdot b|=|a|\cdot|b|, and the last step follows from |exp(x)1|2x|\exp(x)-1|\leq 2x for all x(0,0.1)x\in(0,0.1). ∎

Fact C.5.

For uv0.1\|u-v\|_{\infty}\leq 0.1, where u,vnu,v\in\mathbb{R}^{n}, we have

exp(u)exp(v)2exp(u)22uv\displaystyle\|\exp(u)-\exp(v)\|_{2}\leq\|\exp(u)\|_{2}\cdot 2\|u-v\|_{\infty}
Proof.

We have

exp(u)exp(v)2=\displaystyle\|\exp(u)-\exp(v)\|_{2}= exp(u)(𝟏nexp(vu))2\displaystyle~{}\|\exp(u)\circ({\bf 1}_{n}-\exp(v-u))\|_{2}
\displaystyle\leq exp(u)2𝟏nexp(vu)\displaystyle~{}\|\exp(u)\|_{2}\cdot\|{\bf 1}_{n}-\exp(v-u)\|_{\infty}
\displaystyle\leq exp(u)22uv\displaystyle~{}\|\exp(u)\|_{2}\cdot 2\|u-v\|_{\infty}

where the first step follows from notation of Hardamard product, the second step follows from uv2uv2\|u\circ v\|_{2}\leq\|u\|_{\infty}\cdot\|v\|_{2}, and the last step follows from |exp(x)1|2x|\exp(x)-1|\leq 2x for all x(0,0.1)x\in(0,0.1). ∎

Fact C.6 (Mean value theorem for vector function).

For vector x,yCnx,y\in C\subset\mathbb{R}^{n}, vector function f(x):Cf(x):C\to\mathbb{R}, g(x):Cmg(x):C\to\mathbb{R}^{m}, let f,gf,g be differentiable on open convex domain CC, we have

  • Part 1: f(y)f(x)=f(x+t(yx))(yx)f(y)-f(x)=\nabla f(x+t(y-x))^{\top}(y-x)

  • Part 2: g(y)g(x)2g(x+t(yx))yx2\|g(y)-g(x)\|_{2}\leq\|g^{\prime}(x+t(y-x))\|\cdot\|y-x\|_{2} for some t(0,1)t\in(0,1), where g(a)g^{\prime}(a) denotes a matrix which its (i,j)(i,j)-th term is dg(a)jdai\frac{\mathrm{d}g(a)_{j}}{\mathrm{d}a_{i}}.

  • Part 3: If g(a)M\|g^{\prime}(a)\|\leq M for all aCa\in C, then g(y)g(x)2Myx2\|g(y)-g(x)\|_{2}\leq M\|y-x\|_{2} for all x,yCx,y\in C.

Proof.

Proof of Part 1

Part 1 can be verified by applying Mean Value Theorem of 1-variable function on γ(c)=f(x+c(yx))\gamma(c)=f(x+c(y-x)).

f(y)f(x)=γ(1)γ(0)=γ(t)(10)=f(x+t(yx))(yx)\displaystyle f(y)-f(x)=\gamma(1)-\gamma(0)=\gamma^{\prime}(t)(1-0)=\nabla f(x+t(y-x))^{\top}(y-x)

where t(0,1)t\in(0,1).

Proof of Part 2

Let G(c):=(g(y)g(x))g(c)G(c):=(g(y)-g(x))^{\top}g(c), we have

g(y)g(x)22=\displaystyle\|g(y)-g(x)\|_{2}^{2}= G(y)G(x)\displaystyle~{}G(y)-G(x)
=\displaystyle= G(x+t(yx))(yx)\displaystyle~{}\nabla G(x+t(y-x))^{\top}(y-x)
=\displaystyle= (g(x+t(yx))n×m(g(y)g(x))m×1)(yx)n×1\displaystyle~{}(\underbrace{g^{\prime}(x+t(y-x))}_{n\times m}\cdot\underbrace{(g(y)-g(x))}_{m\times 1})^{\top}\cdot\underbrace{(y-x)}_{n\times 1}
\displaystyle\leq g(x+t(yx))g(y)g(x)2yx2\displaystyle~{}\|g^{\prime}(x+t(y-x))\|\cdot\|g(y)-g(x)\|_{2}\cdot\|y-x\|_{2}

the initial step is by basic calculation, the second step is from Part 1, the third step uses chain rule, the 4th step is due to Cauchy-Schwartz inequality. Removing g(y)g(x)2\|g(y)-g(x)\|_{2} on both sides gives the result.

Proof of Part 3

Part 3 directly follows from Part 2. ∎

Fact C.7.

Let g(a)g^{\prime}(a) denotes a matrix whose (i,j)(i,j)-th term is dg(a)jdai\frac{\mathrm{d}g(a)_{j}}{\mathrm{d}a_{i}}. For unu\in\mathbb{R}^{n}, vnv\in\mathbb{R}^{n}, u2,v2R\|u\|_{2},\|v\|_{2}\leq R, where R0R\geq 0, t(0,1)t\in(0,1), we have

exp(u+t(vu))exp(R)\displaystyle\|\exp^{\prime}(u+t(v-u))\|\leq\exp(R)
Proof.

We can show

exp(u+t(vu))=\displaystyle\|\exp^{\prime}(u+t(v-u))\|= diag(exp(u+t(vu)))\displaystyle~{}\|\mathrm{diag}(\exp(u+t(v-u)))\|
\displaystyle\leq σmax(diag(exp(u+t(vu))))\displaystyle~{}\sigma_{\max}(\mathrm{diag}(\exp(u+t(v-u))))
=\displaystyle= maxi[n]exp(ui+t(viui))\displaystyle~{}\max_{i\in[n]}\exp(u_{i}+t(v_{i}-u_{i}))
\displaystyle\leq maxi[n]max{exp(vi),exp(ui)}\displaystyle~{}\max_{i\in[n]}\max\{\exp(v_{i}),\exp(u_{i})\}
\displaystyle\leq exp(R)\displaystyle~{}\exp(R)

where the first step follows from dexp(x)dx=diag(exp(x))\frac{\mathrm{d}\exp(x)}{\mathrm{d}x}=\mathrm{diag}(\exp(x)), the second step follows from Fact C.2, the third step follows from spectral norm of a diagonal matrix is the absolute value of its largest entry, the fourth step follows from t(0,1)t\in(0,1), and the last step follows from exp(v)exp(v)exp(v2)\|\exp(v)\|_{\infty}\leq\exp(\|v\|_{\infty})\leq\exp(\|v\|_{2}). ∎

Fact C.8.

For unu\in\mathbb{R}^{n}, vnv\in\mathbb{R}^{n}, u2,v2R\|u\|_{2},\|v\|_{2}\leq R, where R0R\geq 0, we have

exp(u)exp(v)2exp(R)uv2\displaystyle\|\exp(u)-\exp(v)\|_{2}\leq\exp(R)\|u-v\|_{2}
Proof.

We can show, for t(0,1)t\in(0,1),

exp(u)exp(v)2\displaystyle\|\exp(u)-\exp(v)\|_{2}\leq exp(u+t(vu))uv2\displaystyle~{}\|\exp^{\prime}(u+t(v-u))\|\cdot\|u-v\|_{2}
\displaystyle\leq exp(R)uv2\displaystyle~{}\exp(R)\|u-v\|_{2}

where the first step follows from Fact C.6, the second step follows from Fact C.7. ∎

Fact C.9.

For aa\in\mathbb{R}, bb\in\mathbb{R}, a,bRa,b\leq R, where R0R\geq 0, we have

|exp(a)exp(b)|exp(R)|ab|\displaystyle|\exp(a)-\exp(b)|\leq\exp(R)|a-b|
Proof.

We can show, for t(0,1)t\in(0,1),

|exp(a)exp(b)|=\displaystyle|\exp(a)-\exp(b)|= |exp(a+t(ba))||ab|\displaystyle~{}|\exp^{\prime}(a+t(b-a))|\cdot|a-b|
=\displaystyle= |exp(a+t(ba))||ab|\displaystyle~{}|\exp(a+t(b-a))|\cdot|a-b|
\displaystyle\leq max{exp(a),exp(b)}|ab|\displaystyle~{}\max\{\exp(a),\exp(b)\}\cdot|a-b|
\displaystyle\leq exp(R)|uv|\displaystyle~{}\exp(R)|u-v|

where the first step follows from Mean Value Theorem, the second step follows from Fact C.1, the third step follows from t(0,1)t\in(0,1), and the last step follows from a,bRa,b\leq R.

C.3 Lipschitz multiplication property

Our overall proofs of Lipschitz constant for kk-mixture of Gaussians follow the idea from Fact below.

Fact C.10.

If the following conditions hold

  • fi(x)fi(y)2Lxy2\|f_{i}(x)-f_{i}(y)\|_{2}\leq L\cdot\|x-y\|_{2}

  • R:=maxi[n],x|fi(x)|R:=\max_{i\in[n],x}|f_{i}(x)|

Then, we have

|i=1kfi(x)i=1kfi(y)|kRk1Lxy2\displaystyle|\prod_{i=1}^{k}f_{i}(x)-\prod_{i=1}^{k}f_{i}(y)|\leq k\cdot R^{k-1}\cdot L\cdot\|x-y\|_{2}
Proof.

We can show

|i=1kfi(x)i=1kfi(y)|\displaystyle|\prod_{i=1}^{k}f_{i}(x)-\prod_{i=1}^{k}f_{i}(y)|
=\displaystyle= |fk(x)i=1k1fi(x)fk(y)i=1k1fi(y)|\displaystyle~{}|f_{k}(x)\prod_{i=1}^{k-1}f_{i}(x)-f_{k}(y)\prod_{i=1}^{k-1}f_{i}(y)|
\displaystyle\leq |fk(x)i=1k1fi(x)fk(y)i=1k1fi(x)|+|fk(y)i=1k1fi(x)fk(y)i=1k1fi(y)|\displaystyle~{}|f_{k}(x)\prod_{i=1}^{k-1}f_{i}(x)-f_{k}(y)\prod_{i=1}^{k-1}f_{i}(x)|+|f_{k}(y)\prod_{i=1}^{k-1}f_{i}(x)-f_{k}(y)\prod_{i=1}^{k-1}f_{i}(y)|
=\displaystyle= |(fk(x)fk(y))i=1k1fi(x)|+|fk(y)(i=1k1fi(x)i=1k1fi(y))|\displaystyle~{}|(f_{k}(x)-f_{k}(y))\prod_{i=1}^{k-1}f_{i}(x)|+|f_{k}(y)(\prod_{i=1}^{k-1}f_{i}(x)-\prod_{i=1}^{k-1}f_{i}(y))|
\displaystyle\leq Lxy2Rk1+R|i=1k1fi(x)i=1k1fi(y)|\displaystyle~{}L\cdot\|x-y\|_{2}\cdot R^{k-1}+R\cdot|\prod_{i=1}^{k-1}f_{i}(x)-\prod_{i=1}^{k-1}f_{i}(y)|
\displaystyle\leq Lxy2Rk1+R(|Lxy2Rk2+R|i=1k2fi(x)i=1k2fi(y)|)\displaystyle~{}L\cdot\|x-y\|_{2}\cdot R^{k-1}+R\cdot(|L\cdot\|x-y\|_{2}\cdot R^{k-2}+R\cdot|\prod_{i=1}^{k-2}f_{i}(x)-\prod_{i=1}^{k-2}f_{i}(y)|)
=\displaystyle= 2Lxy2Rk1+R2|i=1k2fi(x)i=1k2fi(y)|\displaystyle~{}2\cdot L\cdot\|x-y\|_{2}\cdot R^{k-1}+R^{2}\cdot|\prod_{i=1}^{k-2}f_{i}(x)-\prod_{i=1}^{k-2}f_{i}(y)|
\displaystyle\leq kRk1Lxy2\displaystyle~{}k\cdot R^{k-1}\cdot L\cdot\|x-y\|_{2}

where the first step follows from simple algebra, the second step follows from Fact C.2, the third step follows from rearranging terms, the fourth step follows from the assumptions of the lemma, the fifth step follows from the same logic of above, the sixth step follows from simple algebra, and the last step follows from the recursive process. ∎

Appendix D Single Gaussian Case

In this section, we consider the continuous case of pt(x)p_{t}(x), which is the probability density function (𝗉𝖽𝖿\mathsf{pdf}) of the input data xx, and also a function of time tt. More specifically, we consider the cases when pt(x)p_{t}(x) is: (1) a single Gaussian where either the mean is a function of time (Section D.1) or the covariance is a function of time (Section D.2), (2) a single Gaussian where both the mean and the covariance are a function of time (Section D.3). And then, we compute the upper bound and Lipschitz constant for the score function i.e. the gradient of log 𝗉𝖽𝖿\mathsf{pdf} dlogpt(x)dx\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}.

D.1 Case when the mean of pt(x)p_{t}(x) is a function of time

We start our calculation by a simple case. Consider ptp_{t} such that

pt(x)=Prx𝒩(t𝟏d,Id)[x=x]\displaystyle p_{t}(x)=\Pr_{x^{\prime}\sim\mathcal{N}(t{\bf 1}_{d},I_{d})}[x^{\prime}=x]

Let 𝗉𝖽𝖿\mathsf{pdf} is d\mathbb{R}^{d}\rightarrow\mathbb{R} denote ptp_{t}. We have log(𝗉𝖽𝖿())\log(\mathsf{pdf}()) is d\mathbb{R}^{d}\rightarrow\mathbb{R}. Then, we can get gradient log(𝗉𝖽𝖿())\nabla\log(\mathsf{pdf}()) is a function of tt because of 𝒩(t𝟏d,Id)\mathcal{N}(t{\bf 1}_{d},I_{d}). Inject xx and yy into the gradient function, then we are done.

Below we define the 𝗉𝖽𝖿\mathsf{pdf} for the continues case when the mean is a function of time.

Definition D.1.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

We define

pt(x):=1(2π)d/2exp(12xt𝟏d22)\displaystyle p_{t}(x):=\frac{1}{(2\pi)^{d/2}}\exp(-\frac{1}{2}\|x-t\mathbf{1}_{d}\|_{2}^{2})

Further, we have

logpt(x)=d2log(2π)12xt𝟏d22\displaystyle\log p_{t}(x)=-\frac{d}{2}\log(2\pi)-\frac{1}{2}\|x-t\mathbf{1}_{d}\|_{2}^{2}

Below we calculate the score function of 𝗉𝖽𝖿\mathsf{pdf} for the continuous case when the mean is a function of time.

Lemma D.2.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

Then,

dlogpt(x)dx=t𝟏dx\displaystyle\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}=t\mathbf{1}_{d}-x
Proof.

We can show

dlogpt(x)dx=\displaystyle\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}= ddx(d2log(2π)12xt𝟏d22)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x}(-\frac{d}{2}\log(2\pi)-\frac{1}{2}\|x-t\mathbf{1}_{d}\|_{2}^{2})
=\displaystyle= 12ddxxt𝟏d22\displaystyle~{}-\frac{1}{2}\cdot\frac{\mathrm{d}}{\mathrm{d}x}\|x-t\mathbf{1}_{d}\|_{2}^{2}
=\displaystyle= 122(xt𝟏d)\displaystyle~{}-\frac{1}{2}\cdot 2(x-t\mathbf{1}_{d})
=\displaystyle= t𝟏dx\displaystyle~{}t\mathbf{1}_{d}-x

where the first step follows from Definition D.1, the second step follows from variables are independent, the third step follows from Fact C.1, and the last step follows from simple algebra. ∎

Below we calculate the upper bound for the score function of 𝗉𝖽𝖿\mathsf{pdf} for continuous case when the mean is a function of time.

Lemma D.3 (Linear growth).

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

Then,

dlogpt(x)dx2t+x2\displaystyle\|\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}\|_{2}\leq t+\|x\|_{2}
Proof.

We can show

dlogpt(x)dx2=\displaystyle\|\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}\|_{2}= t𝟏dx2\displaystyle~{}\|t\mathbf{1}_{d}-x\|_{2}
\displaystyle\leq t𝟏d2+x2\displaystyle~{}\|t\mathbf{1}_{d}\|_{2}+\|-x\|_{2}
\displaystyle\leq t+x2\displaystyle~{}t+\|x\|_{2}

where the first step follows from Lemma D.2, the second step follows from Fact C.2, and the last step follows from simple algebra. ∎

Below we calculate the Lipschitz constant for the score function of 𝗉𝖽𝖿\mathsf{pdf} for continuous case when the mean is a function of time.

Lemma D.4 (Lipschitz).

If the following conditions hold

  • Let x,x~dx,\widetilde{x}\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

Then,

dlogpt(x)dxdlogpt(x~)dx~2=x~x2\displaystyle\|\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}-\frac{\mathrm{d}\log p_{t}(\widetilde{x})}{\mathrm{d}\widetilde{x}}\|_{2}=\|\widetilde{x}-x\|_{2}
Proof.

We can show

dlogpt(x)dxdlogpt(x~)dx~2=\displaystyle\|\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}-\frac{\mathrm{d}\log p_{t}(\widetilde{x})}{\mathrm{d}\widetilde{x}}\|_{2}= t𝟏dx(t𝟏dx~)2\displaystyle~{}\|t\mathbf{1}_{d}-x-(t\mathbf{1}_{d}-\widetilde{x})\|_{2}
=\displaystyle= x~x2\displaystyle~{}\|\widetilde{x}-x\|_{2}

where the first step follows from Lemma D.2, and the last step follows from simple algebra. ∎

D.2 Case when the covariance of pt(x)p_{t}(x) is a function of time

Below we define the 𝗉𝖽𝖿\mathsf{pdf} for continuous case when the covariance is a function of time.

Definition D.5.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

We define

pt(x):=1t1/2(2π)d/2exp(12tx𝟏d22)\displaystyle p_{t}(x):=\frac{1}{t^{1/2}(2\pi)^{d/2}}\exp(-\frac{1}{2t}\|x-\mathbf{1}_{d}\|_{2}^{2})

Further, we have

logpt(x)=12logtd2log(2π)12tx𝟏d22\displaystyle\log p_{t}(x)=-\frac{1}{2}\log t-\frac{d}{2}\log(2\pi)-\frac{1}{2t}\|x-\mathbf{1}_{d}\|_{2}^{2}

Below we calculate the score function of 𝗉𝖽𝖿\mathsf{pdf} for continuous case when the covariance is a function of time.

Lemma D.6.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

Then,

dlogpt(x)dx=1t(𝟏dx)\displaystyle\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}=\frac{1}{t}(\mathbf{1}_{d}-x)
Proof.

We can show

dlogpt(x)dx=\displaystyle\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}= ddx(12logtd2log(2π)12tx𝟏d22)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x}(-\frac{1}{2}\log t-\frac{d}{2}\log(2\pi)-\frac{1}{2t}\|x-\mathbf{1}_{d}\|_{2}^{2})
=\displaystyle= 12tddxx𝟏d22\displaystyle~{}-\frac{1}{2t}\cdot\frac{\mathrm{d}}{\mathrm{d}x}\|x-\mathbf{1}_{d}\|_{2}^{2}
=\displaystyle= 12t2(x𝟏d)\displaystyle~{}-\frac{1}{2t}\cdot 2(x-\mathbf{1}_{d})
=\displaystyle= 1t(𝟏dx)\displaystyle~{}\frac{1}{t}(\mathbf{1}_{d}-x)

where the first step follows from Definition D.5, the second step follows from variables are independent, the third step follows from Fact C.1, and the last step follows from simple algebra. ∎

Below we calculate the upper bound of the score function of 𝗉𝖽𝖿\mathsf{pdf} for the continuous case when the covariance is a function of time.

Lemma D.7 (Linear growth).

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

Then,

dlogpt(x)dx21t(1+x2)\displaystyle\|\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}\|_{2}\leq\frac{1}{t}(1+\|x\|_{2})
Proof.

We can show

dlogpt(x)dx2=\displaystyle\|\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}\|_{2}= 1t(𝟏dx)2\displaystyle~{}\|\frac{1}{t}(\mathbf{1}_{d}-x)\|_{2}
=\displaystyle= |1t|𝟏dx2\displaystyle~{}|\frac{1}{t}|\cdot\|\mathbf{1}_{d}-x\|_{2}
=\displaystyle= 1t𝟏dx2\displaystyle~{}\frac{1}{t}\|\mathbf{1}_{d}-x\|_{2}
\displaystyle\leq 1t(𝟏d2+x2)\displaystyle~{}\frac{1}{t}(\|\mathbf{1}_{d}\|_{2}+\|-x\|_{2})
=\displaystyle= 1t(1+x2)\displaystyle~{}\frac{1}{t}(1+\|x\|_{2})

where the first step follows from Lemma D.6, the second step follows from Fact C.2, the third step follows from t0t\geq 0, the fourth step follows from Fact C.2, and the last step follows from simple algebra. ∎

Below we calculate the Lipschitz constant of the score function of 𝗉𝖽𝖿\mathsf{pdf} for continuous case when the covariance is a function of time.

Lemma D.8 (Lipschitz).

If the following conditions hold

  • Let x,x~dx,\widetilde{x}\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

Then,

dlogpt(x)dxdlogpt(x~)dx~2=1txx~2\displaystyle\|\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}-\frac{\mathrm{d}\log p_{t}(\widetilde{x})}{\mathrm{d}\widetilde{x}}\|_{2}=\frac{1}{t}\|x-\widetilde{x}\|_{2}
Proof.

We can show

dlogpt(x)dxdlogpt(x~)dx~2=\displaystyle\|\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}-\frac{\mathrm{d}\log p_{t}(\widetilde{x})}{\mathrm{d}\widetilde{x}}\|_{2}= 1t(𝟏dx)1t(𝟏dx~)2\displaystyle~{}\|\frac{1}{t}(\mathbf{1}_{d}-x)-\frac{1}{t}(\mathbf{1}_{d}-\widetilde{x})\|_{2}
=\displaystyle= 1t(x~x)2\displaystyle~{}\|\frac{1}{t}(\widetilde{x}-x)\|_{2}
=\displaystyle= 1txx~2\displaystyle~{}\frac{1}{t}\|x-\widetilde{x}\|_{2}

where the first step follows from Lemma D.6, the second step follows from simple algebra, the third step follows from Fact C.2. ∎

D.3 A general version for single Gaussian

Now we combine the previous results by calculate a slightly more complex case. Consider ptp_{t} such that

pt(x)=Prx𝒩(μ(t),Σ(t))[x=x]\displaystyle p_{t}(x)=\Pr_{x^{\prime}\sim\mathcal{N}(\mu(t),\Sigma(t))}[x^{\prime}=x]

where μ(t)d\mu(t)\in\mathbb{R}^{d}, Σ(t)d×d\Sigma(t)\in\mathbb{R}^{d\times d} and they are derivative to tt and Σ(t)\Sigma(t) is a symmetric p.s.d. matrix whose the smallest singular value is always larger than a fixed σmin>0\sigma_{\min}>0.

Definition D.9.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

We define

pt(x):=1(2π)d/2det(Σ(t))1/2exp(12(xμ(t))Σ(t)1(xμ(t))).\displaystyle p_{t}(x):=\frac{1}{(2\pi)^{d/2}\det(\Sigma(t))^{1/2}}\exp(-\frac{1}{2}(x-\mu(t))^{\top}\Sigma(t)^{-1}(x-\mu(t))).

Further, we have

logpt(x)=d2log(2π)12logdet(Σ(t))12(xμ(t))Σ(t)1(xμ(t))\displaystyle\log p_{t}(x)=-\frac{d}{2}\log(2\pi)-\frac{1}{2}\log\det(\Sigma(t))-\frac{1}{2}(x-\mu(t))^{\top}\Sigma(t)^{-1}(x-\mu(t))

Below we calculate the score function of 𝗉𝖽𝖿\mathsf{pdf} for continuous case when both the mean and covariance are a function of time.

Lemma D.10.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

Then,

dlogpt(x)dx=Σ(t)1(xμ(t))\displaystyle\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}=-\Sigma(t)^{-1}(x-\mu(t))
Proof.

We can show

dlogpt(x)dx=\displaystyle\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}= ddx(d2log(2π)12logdet(Σ(t))12(xμ(t))Σ(t)1(xμ(t)))\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x}(-\frac{d}{2}\log(2\pi)-\frac{1}{2}\log\det(\Sigma(t))-\frac{1}{2}(x-\mu(t))^{\top}\Sigma(t)^{-1}(x-\mu(t)))
=\displaystyle= 12ddx(xμ(t))Σ(t)1(xμ(t))\displaystyle~{}-\frac{1}{2}\cdot\frac{\mathrm{d}}{\mathrm{d}x}(x-\mu(t))^{\top}\Sigma(t)^{-1}(x-\mu(t))
=\displaystyle= 122Σ(t)1(xμ(t))\displaystyle~{}-\frac{1}{2}\cdot 2\Sigma(t)^{-1}(x-\mu(t))
=\displaystyle= Σ(t)1(xμ(t))\displaystyle~{}-\Sigma(t)^{-1}(x-\mu(t))

where the first step follows from Definition D.9, the second step follows from Fact C.1, the third step follows from Fact C.3, and the last step follows from simple algebra. ∎

Below we calculate the upper bound of the score function of 𝗉𝖽𝖿\mathsf{pdf} for continuous case when both the mean and covariance is a function of time.

Lemma D.11 (Linear growth).

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

Then,

dlogpt(x)dx21σmin(Σ(t))(μ(t)2+x2)\displaystyle\|\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}\|_{2}\leq\frac{1}{\sigma_{\min}(\Sigma(t))}\cdot(\|\mu(t)\|_{2}+\|x\|_{2})
Proof.

We can show

dlogpt(x)dx2=\displaystyle\|\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}\|_{2}= Σ(t)1(xμ(t))2\displaystyle~{}\|-\Sigma(t)^{-1}(x-\mu(t))\|_{2}
\displaystyle\leq Σ(t)1xμ(t)2\displaystyle~{}\|-\Sigma(t)^{-1}\|\cdot\|x-\mu(t)\|_{2}
=\displaystyle= Σ(t)1xμ(t)2\displaystyle~{}\|\Sigma(t)^{-1}\|\cdot\|x-\mu(t)\|_{2}
=\displaystyle= 1σmin(Σ(t))xμ(t)2\displaystyle~{}\frac{1}{\sigma_{\min}(\Sigma(t))}\cdot\|x-\mu(t)\|_{2}
\displaystyle\leq 1σmin(Σ(t))(x2+μ(t)2)\displaystyle~{}\frac{1}{\sigma_{\min}(\Sigma(t))}\cdot(\|x\|_{2}+\|-\mu(t)\|_{2})
=\displaystyle= 1σmin(Σ(t))(μ(t)2+x2)\displaystyle~{}\frac{1}{\sigma_{\min}(\Sigma(t))}\cdot(\|\mu(t)\|_{2}+\|x\|_{2})

where the first step follows from Lemma D.10, the second step follows from Fact C.2, the third step follows from Fact C.2, the fourth step follows from Fact C.2, the fifth step follows from Fact C.2, and the last step follows from Fact C.2. ∎

Below we calculate the Lipschitz constant of the score function of 𝗉𝖽𝖿\mathsf{pdf} for continuous case when both the mean and covariance are a function of time.

Lemma D.12 (Lipschitz).

If the following conditions hold

  • Let x,x~dx,\widetilde{x}\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

Then,

dlogpt(x)dxdlogpt(x~)dx~2\displaystyle\|\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}-\frac{\mathrm{d}\log p_{t}(\widetilde{x})}{\mathrm{d}\widetilde{x}}\|_{2}\leq 1σmin(Σ(t))xx~2\displaystyle~{}\frac{1}{\sigma_{\min}(\Sigma(t))}\cdot\|x-\widetilde{x}\|_{2}
Proof.

We can show

dlogpt(x)dxdlogpt(x~)dx~2=\displaystyle\|\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}-\frac{\mathrm{d}\log p_{t}(\widetilde{x})}{\mathrm{d}\widetilde{x}}\|_{2}= Σ(t)1(xμ(t))(Σ(t)1(x~μ(t)))2\displaystyle~{}\|-\Sigma(t)^{-1}(x-\mu(t))-(-\Sigma(t)^{-1}(\widetilde{x}-\mu(t)))\|_{2}
=\displaystyle= Σ(t)1(xx~)2\displaystyle~{}\|-\Sigma(t)^{-1}(x-\widetilde{x})\|_{2}
\displaystyle\leq Σ(t)1xx~2\displaystyle~{}\|-\Sigma(t)^{-1}\|\cdot\|x-\widetilde{x}\|_{2}
=\displaystyle= 1σmin(Σ(t))xx~2\displaystyle~{}\frac{1}{\sigma_{\min}(\Sigma(t))}\cdot\|x-\widetilde{x}\|_{2}

where the first step follows from Lemma D.10, the second step follows from simple algebra, the third step follows from Fact C.2, and the last step follows from Fact C.2. ∎

Appendix E A General Version for Two Gaussian

In this section, we compute the linear growth and Lipschitz constant for a mixture of 2 Gaussian where both the mean and covariance are a function of time. The organization of this section is as follows:

  • Section E.1 defines the probability density function (𝗉𝖽𝖿\mathsf{pdf}) pt(x)p_{t}(x) that we use, which is a mixture of 2 Gaussian.

  • Section E.2 provides lemmas that are used for calculation of the score function i.e. gradient of the log 𝗉𝖽𝖿\mathsf{pdf} dlogpt(x)dx\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}.

  • Section E.3 provides the expression of the score function.

  • Section E.4 provides lemmas that are used for calculation of the upper bound of the score function.

  • Section E.5 provides the expression of the upper bound of the score function.

  • Section E.6 provides lemmas of upper bound for some base functions that are used for calculation of the Lipschitz constant of the score function.

  • Section E.7 provides lemmas of Lipschitz constant for some base functions that are used for calculation of the Lipschitz constant of the score function.

  • Section E.8 provides lemmas of Lipschitz constant for f(x)f(x) that are used for calculation of the Lipschitz constant of the score function.

  • Section E.9 provides lemmas of Lipschitz constant for g(x)g(x) that are used for calculation of the Lipschitz constant of the score function.

  • Section E.10 provides the expression of the Lipschitz constant of the score function.

First, we define the following. Let α(t)(0,1)\alpha(t)\in(0,1) and also is a function of time tt. Consider ptp_{t} such that

pt(x)=Prxα(t)𝒩(μ1(t),Σ1(t))+(1α(t)))𝒩(μ2(t),Σ2(t))[x=x]\displaystyle p_{t}(x)=\Pr_{x^{\prime}\sim\alpha(t)\mathcal{N}(\mu_{1}(t),\Sigma_{1}(t))+(1-\alpha(t)))\mathcal{N}(\mu_{2}(t),\Sigma_{2}(t))}[x^{\prime}=x]

where μ1(t),μ2(t)d\mu_{1}(t),\mu_{2}(t)\in\mathbb{R}^{d}, Σ1(t),Σ2(t)d×d\Sigma_{1}(t),\Sigma_{2}(t)\in\mathbb{R}^{d\times d} and they are derivative to tt and Σ1(t),Σ2(t)\Sigma_{1}(t),\Sigma_{2}(t) is a symmetric p.s.d. matrix whose the smallest singular value is always larger than a fixed value σmin>0\sigma_{\min}>0.

For further simplicity of calculation, we denote α(t)\alpha(t) to be α\alpha.

E.1 Definitions

Below we define function N1N_{1} and N2N_{2}.

Definition E.1.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

We define

N1(x):=1(2π)d/2det(Σ1(t))1/2exp(12(xμ1(t))Σ1(t)1(xμ1(t)))\displaystyle N_{1}(x):=\frac{1}{(2\pi)^{d/2}\det(\Sigma_{1}(t))^{1/2}}\exp(-\frac{1}{2}(x-\mu_{1}(t))^{\top}\Sigma_{1}(t)^{-1}(x-\mu_{1}(t)))

and

N2(x):=1(2π)d/2det(Σ2(t))1/2exp(12(xμ2(t))Σ2(t)1(xμ2(t)))\displaystyle N_{2}(x):=\frac{1}{(2\pi)^{d/2}\det(\Sigma_{2}(t))^{1/2}}\exp(-\frac{1}{2}(x-\mu_{2}(t))^{\top}\Sigma_{2}(t)^{-1}(x-\mu_{2}(t)))

It’s clearly to see that Ni1(2π)d/2det(Σi(t))1/2N_{i}\leq\frac{1}{(2\pi)^{d/2}\det(\Sigma_{i}(t))^{1/2}} since Ni(x)N_{i}(x) takes maximum when x=μix=\mu_{i}.

Below we define the 𝗉𝖽𝖿\mathsf{pdf} for 2 mixtures of Gaussians.

Definition E.2.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let α\alpha\in\mathbb{R} and α(0,1)\alpha\in(0,1).

  • Let N1(x),N2(x)N_{1}(x),N_{2}(x) be defined as Definition E.1.

We define

pt(x):=\displaystyle p_{t}(x):= α(2π)d/2det(Σ1(t))1/2exp(12(xμ1(t))Σ1(t)1(xμ1(t)))+\displaystyle\frac{\alpha}{(2\pi)^{d/2}\det(\Sigma_{1}(t))^{1/2}}\exp(-\frac{1}{2}(x-\mu_{1}(t))^{\top}\Sigma_{1}(t)^{-1}(x-\mu_{1}(t)))+
1α(2π)d/2det(Σ2(t))1/2exp(12(xμ2(t))Σ2(t)1(xμ2(t))).\displaystyle\frac{1-\alpha}{(2\pi)^{d/2}\det(\Sigma_{2}(t))^{1/2}}\exp(-\frac{1}{2}(x-\mu_{2}(t))^{\top}\Sigma_{2}(t)^{-1}(x-\mu_{2}(t))).

This can be further rewritten as follows:

pt(x)=αN1(x)+(1α)N2(x)\displaystyle p_{t}(x)=\alpha N_{1}(x)+(1-\alpha)N_{2}(x)

Further, we have

logpt(x)=log(αN1(x)+(1α)N2(x))\displaystyle\log p_{t}(x)=\log(\alpha N_{1}(x)+(1-\alpha)N_{2}(x))

E.2 Lemmas for calculation of the score function

This subsection describes lemmas that are used for further calculation of the score function.

This lemma calculates the gradient of function NiN_{i}.

Lemma E.3.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let α\alpha\in\mathbb{R} and α(0,1)\alpha\in(0,1).

  • Let N1(x),N2(x)N_{1}(x),N_{2}(x) be defined as Definition E.1.

Then, for i{1,2}i\in\{1,2\}, we have

dNi(x)dx=Ni(x)(Σi(t)1(xμi(t)))\displaystyle\frac{\mathrm{d}N_{i}(x)}{\mathrm{d}x}=N_{i}(x)(-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))
Proof.

We can show

dNi(x)dx=\displaystyle\frac{\mathrm{d}N_{i}(x)}{\mathrm{d}x}= ddx(1(2π)d/2det(Σi(t))1/2exp(12(xμi(t))Σi(t)1(xμi(t))))\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x}(\frac{1}{(2\pi)^{d/2}\det(\Sigma_{i}(t))^{1/2}}\exp(-\frac{1}{2}(x-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(x-\mu_{i}(t))))
=\displaystyle= Ni(x)ddx(12(xμi(t))Σi(t)1(xμi(t)))\displaystyle~{}N_{i}(x)\cdot\frac{\mathrm{d}}{\mathrm{d}x}(-\frac{1}{2}(x-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))
=\displaystyle= Ni(x)(122Σi(t)1(xμi(t)))\displaystyle~{}N_{i}(x)(-\frac{1}{2}\cdot 2\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))
=\displaystyle= Ni(x)(Σi(t)1(xμi(t)))\displaystyle~{}N_{i}(x)(-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))

where the first step follows from Definition E.1, the second step follows from Fact C.1, the third step follows from Fact C.3, and the last step follows from simple algebra. ∎

This lemma calculates the gradient of function pt(x)p_{t}(x).

Lemma E.4.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let α\alpha\in\mathbb{R} and α(0,1)\alpha\in(0,1).

  • Let pt(x)p_{t}(x) be defined as Definition E.2.

  • Let N1(x),N2(x)N_{1}(x),N_{2}(x) be defined as Definition E.1.

Then,

dpt(x)dx=αN1(x)(Σ1(t)1(xμ1(t)))+(1α)N2(x)(Σ2(t)1(xμ2(t)))\displaystyle\frac{\mathrm{d}p_{t}(x)}{\mathrm{d}x}=\alpha N_{1}(x)(-\Sigma_{1}(t)^{-1}(x-\mu_{1}(t)))+(1-\alpha)N_{2}(x)(-\Sigma_{2}(t)^{-1}(x-\mu_{2}(t)))
Proof.

We can show

dpt(x)dx=\displaystyle\frac{\mathrm{d}p_{t}(x)}{\mathrm{d}x}= ddx(αN1(x)+(1α)N2(x))\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x}(\alpha N_{1}(x)+(1-\alpha)N_{2}(x))
=\displaystyle= αddxN1(x)+(1α)ddxN2(x)\displaystyle~{}\alpha\frac{\mathrm{d}}{\mathrm{d}x}N_{1}(x)+(1-\alpha)\frac{\mathrm{d}}{\mathrm{d}x}N_{2}(x)
=\displaystyle= αN1(x)(Σ1(t)1(xμ1(t)))+(1α)N2(x)(Σ2(t)1(xμ2(t)))\displaystyle~{}\alpha N_{1}(x)(-\Sigma_{1}(t)^{-1}(x-\mu_{1}(t)))+(1-\alpha)N_{2}(x)(-\Sigma_{2}(t)^{-1}(x-\mu_{2}(t)))

where the first step follows from Definition E.2, the second step follows from Fact C.1, and the last step follows from Lemma E.3. ∎

E.3 Calculation of the score function

Below we define f(x)f(x) and g(x)g(x) that simplify further calculation.

Definition E.5.

For further simplicity, we define the following functions:

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let α\alpha\in\mathbb{R} and α(0,1)\alpha\in(0,1).

  • Let N1(x),N2(x)N_{1}(x),N_{2}(x) be defined as Definition E.1.

We define

f(x):=αN1(x)αN1(x)+(1α)N2(x)\displaystyle f(x):=\frac{\alpha N_{1}(x)}{\alpha N_{1}(x)+(1-\alpha)N_{2}(x)}

and

g(x):=(1α)N2(x)αN1(x)+(1α)N2(x)\displaystyle g(x):=\frac{(1-\alpha)N_{2}(x)}{\alpha N_{1}(x)+(1-\alpha)N_{2}(x)}

And it’s clearly to see that 0f(x)10\leq f(x)\leq 1, 0g(x)10\leq g(x)\leq 1 and f(x)+g(x)=1f(x)+g(x)=1.

This lemma calculates the score function.

Lemma E.6.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let α\alpha\in\mathbb{R} and α(0,1)\alpha\in(0,1).

  • Let pt(x)p_{t}(x) be defined as Definition E.2.

  • Let N1(x),N2(x)N_{1}(x),N_{2}(x) be defined as Definition E.1.

  • Let f(x),g(x)f(x),g(x) be defined as Definition E.5.

Then,

dlogpt(x)dx=αN1(x)(Σ1(t)1(xμ1(t)))αN1(x)+(1α)N2(x)+(1α)N2(x)(Σ2(t)1(xμ2(t)))αN1(x)+(1α)N2(x)\displaystyle\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}=\frac{\alpha N_{1}(x)(-\Sigma_{1}(t)^{-1}(x-\mu_{1}(t)))}{\alpha N_{1}(x)+(1-\alpha)N_{2}(x)}+\frac{(1-\alpha)N_{2}(x)(-\Sigma_{2}(t)^{-1}(x-\mu_{2}(t)))}{\alpha N_{1}(x)+(1-\alpha)N_{2}(x)}
Proof.

We can show

dlogpt(x)dx=\displaystyle\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}= dlogpt(x)dpt(x)dpt(x)dx\displaystyle~{}\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}p_{t}(x)}\frac{\mathrm{d}p_{t}(x)}{\mathrm{d}x}
=\displaystyle= 1pt(x)dpt(x)dx\displaystyle~{}\frac{1}{p_{t}(x)}\frac{\mathrm{d}p_{t}(x)}{\mathrm{d}x}
=\displaystyle= 1pt(x)(αN1(x)(Σ1(t)1(xμ1(t)))+(1α)N2(x)(Σ2(t)1(xμ2(t))))\displaystyle~{}\frac{1}{p_{t}(x)}(\alpha N_{1}(x)(-\Sigma_{1}(t)^{-1}(x-\mu_{1}(t)))+(1-\alpha)N_{2}(x)(-\Sigma_{2}(t)^{-1}(x-\mu_{2}(t))))
=\displaystyle= αN1(x)(Σ1(t)1(xμ1(t)))αN1(x)+(1α)N2(x)+(1α)N2(x)(Σ2(t)1(xμ2(t)))αN1(x)+(1α)N2(x)\displaystyle~{}\frac{\alpha N_{1}(x)(-\Sigma_{1}(t)^{-1}(x-\mu_{1}(t)))}{\alpha N_{1}(x)+(1-\alpha)N_{2}(x)}+\frac{(1-\alpha)N_{2}(x)(-\Sigma_{2}(t)^{-1}(x-\mu_{2}(t)))}{\alpha N_{1}(x)+(1-\alpha)N_{2}(x)}
=\displaystyle= f(x)(Σ1(t)1(xμ1(t))+g(x)(Σ2(t)1(xμ2(t))\displaystyle~{}f(x)(-\Sigma_{1}(t)^{-1}(x-\mu_{1}(t))+g(x)(-\Sigma_{2}(t)^{-1}(x-\mu_{2}(t))

where the first step follows from Fact C.1, the second step follows from Fact C.1, the third step follows from Lemma E.4, the fourth step follows from Definition E.2 and the last step follows from Definition E.5.

E.4 Lemmas for the calculation of the upper bound of the score function

This section provides lemmas that are used in calculation of upper bound of the score function.

This lemma calculates the upper bound of function Σi(t)1(xμi(t))2\|-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t))\|_{2}.

Lemma E.7.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

Then, for each i{1,2}i\in\{1,2\}, we have

Σi(t)1(xμi(t))21σmin(Σi(t))(x2+μi(t)2)\displaystyle\|-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t))\|_{2}\leq\frac{1}{\sigma_{\min}(\Sigma_{i}(t))}\cdot(\|x\|_{2}+\|\mu_{i}(t)\|_{2})
Proof.

We can show

Σi(t)1(xμi(t))2\displaystyle\|-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t))\|_{2}\leq Σi(t)1xμi(t)2\displaystyle~{}\|-\Sigma_{i}(t)^{-1}\|\cdot\|x-\mu_{i}(t)\|_{2}
=\displaystyle= Σi(t)1xμi(t)2\displaystyle~{}\|\Sigma_{i}(t)^{-1}\|\cdot\|x-\mu_{i}(t)\|_{2}
=\displaystyle= 1σmin(Σi(t))xμ(t)2\displaystyle~{}\frac{1}{\sigma_{\min}(\Sigma_{i}(t))}\cdot\|x-\mu(t)\|_{2}
\displaystyle\leq 1σmin(Σi(t))(x2+μi(t)2)\displaystyle~{}\frac{1}{\sigma_{\min}(\Sigma_{i}(t))}\cdot(\|x\|_{2}+\|-\mu_{i}(t)\|_{2})
=\displaystyle= 1σmin(Σi(t))(x2+μi(t)2)\displaystyle~{}\frac{1}{\sigma_{\min}(\Sigma_{i}(t))}\cdot(\|x\|_{2}+\|\mu_{i}(t)\|_{2})

where the first step follows from Fact C.2, the second step follows from Fact C.2, the third step follows from Fact C.2, the fourth step follows from Fact C.2, and the last step follows from simple algebra. ∎

E.5 Upper bound of the score function

This lemma calculates the upper bound of the score function.

Lemma E.8 (Linear growth).

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let α\alpha\in\mathbb{R} and α(0,1)\alpha\in(0,1).

  • Let pt(x)p_{t}(x) be defined as Definition E.2.

  • Let f(x),g(x)f(x),g(x) be defined as Definition E.5.

  • Let σmin:=min{σmin(Σ1(t)),σmin(Σ2(t))}\sigma_{\min}:=\min\{\sigma_{\min}(\Sigma_{1}(t)),\sigma_{\min}(\Sigma_{2}(t))\}.

  • Let μmax:=max{μ1(t)2,μ2(t)2,1}\mu_{\max}:=\max\{\|\mu_{1}(t)\|_{2},\|\mu_{2}(t)\|_{2},1\}.

Then,

dlogpt(x)dx2\displaystyle\|\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}\|_{2}\leq σmin1μmax(1+x2)\displaystyle~{}\sigma_{\min}^{-1}\cdot\mu_{\max}\cdot(1+\|x\|_{2})
Proof.

We can show

dlogpt(x)dx2=\displaystyle\|\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}\|_{2}= f(x)(Σ1(t)1(xμ1(t))+g(x)(Σ2(t)1(xμ2(t))2\displaystyle~{}\|f(x)(-\Sigma_{1}(t)^{-1}(x-\mu_{1}(t))+g(x)(-\Sigma_{2}(t)^{-1}(x-\mu_{2}(t))\|_{2}
\displaystyle\leq f(x)(Σ1(t)1(xμ1(t))2+g(x)(Σ2(t)1(xμ2(t))22\displaystyle~{}\|f(x)(-\Sigma_{1}(t)^{-1}(x-\mu_{1}(t))\|_{2}+\|g(x)(-\Sigma_{2}(t)^{-1}(x-\mu_{2}(t))_{2}\|_{2}
\displaystyle\leq maxi[2]Σi(t)1(xμi(t))2\displaystyle~{}\max_{i\in[2]}\|-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t))\|_{2}
\displaystyle\leq maxi[2](1σmin(Σi(t))(x2+μi(t)2))\displaystyle~{}\max_{i\in[2]}(\frac{1}{\sigma_{\min}(\Sigma_{i}(t))}\cdot(\|x\|_{2}+\|\mu_{i}(t)\|_{2}))
\displaystyle\leq σmin1(μmax+x2)\displaystyle~{}\sigma_{\min}^{-1}(\mu_{\max}+\|x\|_{2})
\displaystyle\leq σmin1μmax(1+x2)\displaystyle~{}\sigma_{\min}^{-1}\cdot\mu_{\max}\cdot(1+\|x\|_{2})

where the first step follows from Lemma E.6, the second step follows from Fact C.2, the third step follows from f(x)+g(x)=1f(x)+g(x)=1 and f(x),g(x)0f(x),g(x)\geq 0, the fourth step follows from Lemma E.7, the fifth step follows from definition of μmax\mu_{\max} and σmin\sigma_{\min}, and the last step follows from μmax1\mu_{\max}\geq 1. ∎

E.6 Lemmas for Lipschitz calculation: upper bound of base functions

This section provides the lemmas of bounds of base functions that are used in calculation of Lipschitz of the score function.

This lemma calculate the upper bound of the function Σi(t)1(xμi(t))2\|-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t))\|_{2}.

Lemma E.9.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let xμi(t)2R\|x-\mu_{i}(t)\|_{2}\leq R, where R1R\geq 1, for each i{1,2}i\in\{1,2\}.

Then, for each i{1,2}i\in\{1,2\}, we have

Σi(t)1(xμi(t))2Rσmin(Σi(t))\displaystyle\|-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t))\|_{2}\leq\frac{R}{\sigma_{\min}(\Sigma_{i}(t))}
Proof.

We can show

Σi(t)1(xμi(t))2\displaystyle\|-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t))\|_{2}\leq Σi(t)1xμi(t)2\displaystyle~{}\|-\Sigma_{i}(t)^{-1}\|\cdot\|x-\mu_{i}(t)\|_{2}
=\displaystyle= 1σmin(Σi(t))xμi(t)2\displaystyle~{}\frac{1}{\sigma_{\min}(\Sigma_{i}(t))}\cdot\|x-\mu_{i}(t)\|_{2}
\displaystyle\leq Rσmin(Σi(t))\displaystyle~{}\frac{R}{\sigma_{\min}(\Sigma_{i}(t))}

where the first step follows from Fact C.2, the second step follows from Fact C.2, and the last step follows from xμi(t)2R\|x-\mu_{i}(t)\|_{2}\leq R. ∎

This lemma calculate the lower bound of the function (xμi(t))Σi(t)1(xμi(t))(x-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)).

Lemma E.10.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let xμi(t)2R\|x-\mu_{i}(t)\|_{2}\leq R, where R1R\geq 1, for each i{1,2}i\in\{1,2\}.

  • Let xμi(t)2β\|x-\mu_{i}(t)\|_{2}\geq\beta, where β(0,0.1)\beta\in(0,0.1), for each i{1,2}i\in\{1,2\}.

Then,

(xμi(t))Σi(t)1(xμi(t))β2σmax(Σi(t))\displaystyle(x-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(x-\mu_{i}(t))\geq\frac{\beta^{2}}{\sigma_{\max}(\Sigma_{i}(t))}
Proof.

We can show

LHS\displaystyle\mathrm{LHS}\geq xμi(t)22σmin(Σi(t)1)\displaystyle~{}\|x-\mu_{i}(t)\|_{2}^{2}\cdot\sigma_{\min}(\Sigma_{i}(t)^{-1})
=\displaystyle= xμi(t)221σmax(Σi(t))\displaystyle~{}\|x-\mu_{i}(t)\|_{2}^{2}\cdot\frac{1}{\sigma_{\max}(\Sigma_{i}(t))}
\displaystyle\geq β2σmax(Σi(t))\displaystyle~{}\frac{\beta^{2}}{\sigma_{\max}(\Sigma_{i}(t))}

where the first step follows from Fact C.2, the second step follows from Fact C.2, and the last step follows from xμi(t)2β\|x-\mu_{i}(t)\|_{2}\geq\beta. ∎

This lemma calculate the upper bound of the function exp(12(xμi(t))Σi(t)1(xμi(t)))\exp(-\frac{1}{2}(x-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(x-\mu_{i}(t))).

Lemma E.11.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let xμi(t)2R\|x-\mu_{i}(t)\|_{2}\leq R, where R1R\geq 1, for each i{1,2}i\in\{1,2\}.

  • Let xμi(t)2β\|x-\mu_{i}(t)\|_{2}\geq\beta, where β(0,0.1)\beta\in(0,0.1), for each i{1,2}i\in\{1,2\}.

Then,

exp(12(xμi(t))Σi(t)1(xμi(t)))exp(β22σmax(Σi(t)))\displaystyle\exp(-\frac{1}{2}(x-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))\leq\exp(-\frac{\beta^{2}}{2\sigma_{\max}(\Sigma_{i}(t))})
Proof.

We can show

LHS=\displaystyle\mathrm{LHS}= exp(12(xμi(t))Σi(t)1(xμi(t)))\displaystyle~{}\exp(-\frac{1}{2}(x-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))
\displaystyle\leq exp(β22σmax(Σi(t)))\displaystyle~{}\exp(-\frac{\beta^{2}}{2\sigma_{\max}(\Sigma_{i}(t))})

where the first step follows from Fact C.2, the second step follows from Lemma E.10. ∎

E.7 Lemmas for Lipschitz calculation: Lipschitz constant of base functions

This section provides the lemmas of Lipschitz constant of base functions that are used in calculation of Lipschitz of the score function.

This lemma calculates Lipschitz constant of function Σi(t)1(xμi(t))(Σi(t)1(x~μi(t)))2\|-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t))-(-\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t)))\|_{2}.

Lemma E.12.

If the following conditions hold

  • Let x,x~dx,\widetilde{x}\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

Then, for i{1,2}i\in\{1,2\}, we have

Σi(t)1(xμi(t))(Σi(t)1(x~μi(t)))21σmin(Σi(t))xx~2\displaystyle\|-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t))-(-\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t)))\|_{2}\leq\frac{1}{\sigma_{\min}(\Sigma_{i}(t))}\cdot\|x-\widetilde{x}\|_{2}
Proof.

We can show

LHS=\displaystyle\mathrm{LHS}= Σi(t)1(xx~)2\displaystyle~{}\|-\Sigma_{i}(t)^{-1}(x-\widetilde{x})\|_{2}
\displaystyle\leq Σi(t)1xx~2\displaystyle~{}\|-\Sigma_{i}(t)^{-1}\|\cdot\|x-\widetilde{x}\|_{2}
=\displaystyle= 1σmin(Σi(t))xx~2\displaystyle~{}\frac{1}{\sigma_{\min}(\Sigma_{i}(t))}\cdot\|x-\widetilde{x}\|_{2}

where the first step follows from simple algebra, the second step follows from Fact C.2, and the last step follows from Fact C.2. ∎

This lemma calculates Lipschitz constant of function |12(xμi(t))Σi(t)1(xμi(t))(12(x~μi(t))Σi(t)1(x~μi(t)))||-\frac{1}{2}(x-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(x-\mu_{i}(t))-(-\frac{1}{2}(\widetilde{x}-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t)))|.

Lemma E.13.

If the following conditions hold

  • Let x,x~dx,\widetilde{x}\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let xμi(t)2R\|x-\mu_{i}(t)\|_{2}\leq R, x~μi(t)2R\|\widetilde{x}-\mu_{i}(t)\|_{2}\leq R, where R1R\geq 1, for each i{1,2}i\in\{1,2\}.

Then, for each i{1,2}i\in\{1,2\}, we have

|12(xμi(t))Σi(t)1(xμi(t))(12(x~μi(t))Σi(t)1(x~μi(t)))|Rσmin(Σi(t))xx~2\displaystyle|-\frac{1}{2}(x-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(x-\mu_{i}(t))-(-\frac{1}{2}(\widetilde{x}-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t)))|\leq\frac{R}{\sigma_{\min}(\Sigma_{i}(t))}\cdot\|x-\widetilde{x}\|_{2}
Proof.

We can show

LHS\displaystyle\mathrm{LHS}\leq |12(xμi(t))Σi(t)1(xμi(t))(12(xμi(t))Σi(t)1(x~μi(t)))|\displaystyle~{}|-\frac{1}{2}(x-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(x-\mu_{i}(t))-(-\frac{1}{2}(x-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t)))|
+|12(xμi(t))Σi(t)1(x~μi(t))(12(x~μi(t))Σi(t)1(x~μi(t)))|\displaystyle~{}+|-\frac{1}{2}(x-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t))-(-\frac{1}{2}(\widetilde{x}-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t)))|
\displaystyle\leq |12(xμi(t))Σi(t)1(xx~)|+|12(xx~)Σi(t)1(x~μi(t))|\displaystyle~{}|-\frac{1}{2}(x-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(x-\widetilde{x})|+|-\frac{1}{2}(x-\widetilde{x})^{\top}\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t))|
\displaystyle\leq 12Σi(t)1(xμi(t))2xx~2+12Σi(t)1(xx~)2x~μi(t)2\displaystyle~{}\frac{1}{2}\cdot\|\Sigma_{i}(t)^{-1}(x-\mu_{i}(t))\|_{2}\cdot\|x-\widetilde{x}\|_{2}+\frac{1}{2}\cdot\|\Sigma_{i}(t)^{-1}(x-\widetilde{x})\|_{2}\cdot\|\widetilde{x}-\mu_{i}(t)\|_{2}
\displaystyle\leq 121σmin(Σi(t))Rxx~2+12Σi(t)1(xx~)2R\displaystyle~{}\frac{1}{2}\cdot\frac{1}{\sigma_{\min}(\Sigma_{i}(t))}\cdot R\cdot\|x-\widetilde{x}\|_{2}+\frac{1}{2}\cdot\|\Sigma_{i}(t)^{-1}(x-\widetilde{x})\|_{2}\cdot R
\displaystyle\leq 121σmin(Σi(t))Rxx~2+121σmin(Σi(t))xx~2R\displaystyle~{}\frac{1}{2}\cdot\frac{1}{\sigma_{\min}(\Sigma_{i}(t))}\cdot R\cdot\|x-\widetilde{x}\|_{2}+\frac{1}{2}\cdot\frac{1}{\sigma_{\min}(\Sigma_{i}(t))}\cdot\|x-\widetilde{x}\|_{2}\cdot R
=\displaystyle= Rσmin(Σi(t))xx~2\displaystyle~{}\frac{R}{\sigma_{\min}(\Sigma_{i}(t))}\cdot\|x-\widetilde{x}\|_{2}

where the first step follows from Fact C.2, the second step follows from simple algebra, the third step follows from Fact C.2, the fourth step follows from xμi(t)2R\|x-\mu_{i}(t)\|_{2}\leq R, x~μi(t)2R\|\widetilde{x}-\mu_{i}(t)\|_{2}\leq R, the fifth step follows from Lemma E.12, and the last step follows from simple algebra. ∎

This lemma calculates Lipschitz constant of function |Ni(x)Ni(x~)||N_{i}(x)-N_{i}(\widetilde{x})|.

Lemma E.14.

If the following conditions hold

  • Let x,x~dx,\widetilde{x}\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let N1(x),N2(x)N_{1}(x),N_{2}(x) be defined as Definition E.1.

  • Let xμi(t)2R\|x-\mu_{i}(t)\|_{2}\leq R, where R1R\geq 1, for each i{1,2}i\in\{1,2\}.

  • Let xμi(t)2β\|x-\mu_{i}(t)\|_{2}\geq\beta, where β(0,0.1)\beta\in(0,0.1), for each i{1,2}i\in\{1,2\}.

Then, for each i{1,2}i\in\{1,2\}, we have

|Ni(x)Ni(x~)|1(2π)d/2det(Σi(t))1/2exp(β22σmax(Σi(t)))Rσmin(Σi(t))xx~2\displaystyle|N_{i}(x)-N_{i}(\widetilde{x})|\leq\frac{1}{(2\pi)^{d/2}\det(\Sigma_{i}(t))^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}(\Sigma_{i}(t))})\cdot\frac{R}{\sigma_{\min}(\Sigma_{i}(t))}\cdot\|x-\widetilde{x}\|_{2}
Proof.

We can show

|Ni(x)Ni(x~)|=\displaystyle|N_{i}(x)-N_{i}(\widetilde{x})|= |1(2π)d/2det(Σi(t))1/2exp(12(xμi(t))Σi(t)1(xμi(t)))\displaystyle~{}|\frac{1}{(2\pi)^{d/2}\det(\Sigma_{i}(t))^{1/2}}\exp(-\frac{1}{2}(x-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))
1(2π)d/2det(Σi(t))1/2exp(12(x~μi(t))Σi(t)1(x~μi(t)))|\displaystyle~{}-\frac{1}{(2\pi)^{d/2}\det(\Sigma_{i}(t))^{1/2}}\exp(-\frac{1}{2}(\widetilde{x}-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t)))|
=\displaystyle= 1(2π)d/2det(Σi(t))1/2|exp(12(xμi(t))Σi(t)1(xμi(t)))\displaystyle~{}\frac{1}{(2\pi)^{d/2}\det(\Sigma_{i}(t))^{1/2}}\cdot|\exp(-\frac{1}{2}(x-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))
exp(12(x~μi(t))Σi(t)1(x~μi(t)))|\displaystyle~{}-\exp(-\frac{1}{2}(\widetilde{x}-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t)))|
\displaystyle\leq 1(2π)d/2det(Σi(t))1/2exp(β22σmax(Σi(t)))\displaystyle~{}\frac{1}{(2\pi)^{d/2}\det(\Sigma_{i}(t))^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}(\Sigma_{i}(t))})
|12(xμi(t))Σi(t)1(xμi(t))(12(x~μi(t))Σi(t)1(x~μi(t)))|\displaystyle~{}\cdot|-\frac{1}{2}(x-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(x-\mu_{i}(t))-(-\frac{1}{2}(\widetilde{x}-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t)))|
\displaystyle\leq 1(2π)d/2det(Σi(t))1/2exp(β22σmax(Σi(t)))Rσmin(Σi(t))xx~2\displaystyle~{}\frac{1}{(2\pi)^{d/2}\det(\Sigma_{i}(t))^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}(\Sigma_{i}(t))})\cdot\frac{R}{\sigma_{\min}(\Sigma_{i}(t))}\cdot\|x-\widetilde{x}\|_{2}

where the first step follows from Definition E.1, the second step follows from simple algebra, the third step follows from Fact C.9, and the last step follows from Lemma E.11. ∎

This lemma calculates Lipschitz constant of function |αN1(x)+(1α)N2(x)(αN1(x~)+(1α)N2(x~))||\alpha N_{1}(x)+(1-\alpha)N_{2}(x)-(\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x}))|.

Lemma E.15.

If the following conditions hold

  • Let x,x~dx,\widetilde{x}\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let N1(x),N2(x)N_{1}(x),N_{2}(x) be defined as Definition E.1.

  • Let α\alpha\in\mathbb{R} and α(0,1)\alpha\in(0,1).

  • Let xμi(t)2R\|x-\mu_{i}(t)\|_{2}\leq R, where R1R\geq 1, for each i{1,2}i\in\{1,2\}.

  • Let xμi(t)2β\|x-\mu_{i}(t)\|_{2}\geq\beta, where β(0,0.1)\beta\in(0,0.1), for each i{1,2}i\in\{1,2\}.

  • Let σmin:=min{σmin(Σ1(t)),σmin(Σ2(t))}\sigma_{\min}:=\min\{\sigma_{\min}(\Sigma_{1}(t)),\sigma_{\min}(\Sigma_{2}(t))\}.

  • Let σmax:=max{σmax(Σ1(t)),σmax(Σ2(t))}\sigma_{\max}:=\max\{\sigma_{\max}(\Sigma_{1}(t)),\sigma_{\max}(\Sigma_{2}(t))\}.

  • Let detmin:=min{det(Σ1(t)),det(Σ2(t))}\det_{\min}:=\min\{\det(\Sigma_{1}(t)),\det(\Sigma_{2}(t))\}.

Then, we have

|αN1(x)+(1α)N2(x)(αN1(x~)+(1α)N2(x~))|1(2π)d/2detmin1/2exp(β22σmax)Rσminxx~2\displaystyle|\alpha N_{1}(x)+(1-\alpha)N_{2}(x)-(\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x}))|\leq\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
Proof.

We can show

LHS=\displaystyle\mathrm{LHS}= |αN1(x)αN1(x~)+(1α)N2(x)(1α)N2(x~)|\displaystyle~{}|\alpha N_{1}(x)-\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(x)-(1-\alpha)N_{2}(\widetilde{x})|
\displaystyle\leq α|N1(x)N1(x~)|+(1α)|N2(x)N2(x~)|\displaystyle~{}\alpha|N_{1}(x)-N_{1}(\widetilde{x})|+(1-\alpha)|N_{2}(x)-N_{2}(\widetilde{x})|
\displaystyle\leq α(2π)d/2det(Σ1(t))1/2exp(β22σmax(Σ1(t)))Rσmin(Σ1(t))xx~2\displaystyle~{}\frac{\alpha}{(2\pi)^{d/2}\det(\Sigma_{1}(t))^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}(\Sigma_{1}(t))})\cdot\frac{R}{\sigma_{\min}(\Sigma_{1}(t))}\cdot\|x-\widetilde{x}\|_{2}
+1α(2π)d/2det(Σ2(t))1/2exp(β22σmax(Σ2(t)))Rσmin(Σ2(t))xx~2\displaystyle~{}+\frac{1-\alpha}{(2\pi)^{d/2}\det(\Sigma_{2}(t))^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}(\Sigma_{2}(t))})\cdot\frac{R}{\sigma_{\min}(\Sigma_{2}(t))}\cdot\|x-\widetilde{x}\|_{2}
\displaystyle\leq 1(2π)d/2maxi[2]1det(Σi(t))1/2exp(β22σmax(Σi(t)))Rσmin(Σi(t))xx~2\displaystyle~{}\frac{1}{(2\pi)^{d/2}}\max_{i\in[2]}\frac{1}{\det(\Sigma_{i}(t))^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}(\Sigma_{i}(t))})\cdot\frac{R}{\sigma_{\min}(\Sigma_{i}(t))}\cdot\|x-\widetilde{x}\|_{2}
\displaystyle\leq 1(2π)d/2detmin1/2exp(β22σmax)Rσminxx~2\displaystyle~{}\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}

where the first step follows from simple algebra, the second step follows from Fact C.2, the third step follows from Lemma E.14, the fourth step follows from α(0,1)\alpha\in(0,1), and the last step follows from the definition of detmin,σmax,σmin\det_{\min},\sigma_{\max},\sigma_{\min}. ∎

This lemma calculates Lipschitz constant of function |(αN1(x)+(1α)N2(x))1(αN1(x~)+(1α)N2(x~))1||(\alpha N_{1}(x)+(1-\alpha)N_{2}(x))^{-1}-(\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x}))^{-1}|.

Lemma E.16.

If the following conditions hold

  • Let x,x~dx,\widetilde{x}\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let N1(x),N2(x)N_{1}(x),N_{2}(x) be defined as Definition E.1.

  • Let α\alpha\in\mathbb{R} and α(0,1)\alpha\in(0,1).

  • Let xμi(t)2R\|x-\mu_{i}(t)\|_{2}\leq R, where R1R\geq 1, for each i{1,2}i\in\{1,2\}.

  • Let xμi(t)2β\|x-\mu_{i}(t)\|_{2}\geq\beta, where β(0,0.1)\beta\in(0,0.1), for each i{1,2}i\in\{1,2\}.

  • Let αN1(x)+(1α)N2(x)γ\alpha N_{1}(x)+(1-\alpha)N_{2}(x)\geq\gamma, where γ(0,0.1)\gamma\in(0,0.1).

  • Let σmin:=min{σmin(Σ1(t)),σmin(Σ2(t))}\sigma_{\min}:=\min\{\sigma_{\min}(\Sigma_{1}(t)),\sigma_{\min}(\Sigma_{2}(t))\}.

  • Let σmax:=max{σmax(Σ1(t)),σmax(Σ2(t))}\sigma_{\max}:=\max\{\sigma_{\max}(\Sigma_{1}(t)),\sigma_{\max}(\Sigma_{2}(t))\}.

  • Let detmin:=min{det(Σ1(t)),det(Σ2(t))}\det_{\min}:=\min\{\det(\Sigma_{1}(t)),\det(\Sigma_{2}(t))\}.

Then,

|(αN1(x)+(1α)N2(x))1(αN1(x~)+(1α)N2(x~))1|\displaystyle~{}|(\alpha N_{1}(x)+(1-\alpha)N_{2}(x))^{-1}-(\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x}))^{-1}|
\displaystyle\leq γ21(2π)d/2detmin1/2exp(β22σmax)Rσminxx~2\displaystyle~{}\gamma^{-2}\cdot\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
Proof.

We can show

LHS\displaystyle\mathrm{LHS}\leq (αN1(x)+(1α)N2(x))1(αN1(x~)+(1α)N2(x~))1\displaystyle~{}(\alpha N_{1}(x)+(1-\alpha)N_{2}(x))^{-1}\cdot(\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x}))^{-1}
|αN1(x)+(1α)N2(x)(αN1(x~)+(1α)N2(x~))|\displaystyle~{}\cdot|\alpha N_{1}(x)+(1-\alpha)N_{2}(x)-(\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x}))|
\displaystyle\leq γ2|αN1(x)+(1α)N2(x)(αN1(x~)+(1α)N2(x~))|\displaystyle~{}\gamma^{-2}\cdot|\alpha N_{1}(x)+(1-\alpha)N_{2}(x)-(\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x}))|
\displaystyle\leq γ21(2π)d/2detmin1/2exp(β22σmax)Rσminxx~2\displaystyle~{}\gamma^{-2}\cdot\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}

where the first step follows from simple algebra, the second step follows from αN1(x)+(1α)N2(x)γ\alpha N_{1}(x)+(1-\alpha)N_{2}(x)\geq\gamma, and the last step follows from Lemma E.15. ∎

E.8 Lemmas for Lipschitz calculation: f(x)f(x)

This lemma calculates Lipschitz constant of function |f(x)f(x~)||f(x)-f(\widetilde{x})|.

Lemma E.17.

If the following conditions hold

  • Let x,x~dx,\widetilde{x}\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let N1(x),N2(x)N_{1}(x),N_{2}(x) be defined as Definition E.1.

  • Let f(x)f(x) be defined as Definition E.5.

  • Let α\alpha\in\mathbb{R} and α(0,1)\alpha\in(0,1).

  • Let xμi(t)2R\|x-\mu_{i}(t)\|_{2}\leq R, where R1R\geq 1, for each i{1,2}i\in\{1,2\}.

  • Let xμi(t)2β\|x-\mu_{i}(t)\|_{2}\geq\beta, where β(0,0.1)\beta\in(0,0.1), for each i{1,2}i\in\{1,2\}.

  • Let αN1(x)+(1α)N2(x)γ\alpha N_{1}(x)+(1-\alpha)N_{2}(x)\geq\gamma, where γ(0,0.1)\gamma\in(0,0.1).

  • Let σmin:=min{σmin(Σ1(t)),σmin(Σ2(t))}\sigma_{\min}:=\min\{\sigma_{\min}(\Sigma_{1}(t)),\sigma_{\min}(\Sigma_{2}(t))\}.

  • Let σmax:=max{σmax(Σ1(t)),σmax(Σ2(t))}\sigma_{\max}:=\max\{\sigma_{\max}(\Sigma_{1}(t)),\sigma_{\max}(\Sigma_{2}(t))\}.

  • Let detmin:=min{det(Σ1(t)),det(Σ2(t))}\det_{\min}:=\min\{\det(\Sigma_{1}(t)),\det(\Sigma_{2}(t))\}.

Then,

|f(x)f(x~)|2αγ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)Rσminxx~2\displaystyle|f(x)-f(\widetilde{x})|\leq 2\alpha\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
Proof.

We can show

|f(x)f(x~)|=\displaystyle|f(x)-f(\widetilde{x})|= |αN1(x)αN1(x)+(1α)N2(x)αN1(x~)αN1(x~)+(1α)N2(x~)|\displaystyle~{}|\frac{\alpha N_{1}(x)}{\alpha N_{1}(x)+(1-\alpha)N_{2}(x)}-\frac{\alpha N_{1}(\widetilde{x})}{\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x})}|
\displaystyle\leq |αN1(x)αN1(x)+(1α)N2(x)αN1(x)αN1(x~)+(1α)N2(x~)|\displaystyle~{}|\frac{\alpha N_{1}(x)}{\alpha N_{1}(x)+(1-\alpha)N_{2}(x)}-\frac{\alpha N_{1}(x)}{\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x})}|
+|αN1(x)αN1(x~)+(1α)N2(x~)αN1(x~)αN1(x~)+(1α)N2(x~)|\displaystyle~{}+|\frac{\alpha N_{1}(x)}{\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x})}-\frac{\alpha N_{1}(\widetilde{x})}{\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x})}|
=\displaystyle= α|N1(x)||(αN1(x)+(1α)N2(x))1(αN1(x~)+(1α)N2(x~))1|\displaystyle~{}\alpha\cdot|N_{1}(x)|\cdot|(\alpha N_{1}(x)+(1-\alpha)N_{2}(x))^{-1}-(\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x}))^{-1}|
+α|N1(x)N1(x~)||(αN1(x~)+(1α)N2(x~))1|\displaystyle~{}+\alpha\cdot|N_{1}(x)-N_{1}(\widetilde{x})|\cdot|(\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x}))^{-1}|

where the first step follows from Definition E.5, the second step follows from Fact C.2, and the last step follows from simple algebra.

For the first term in the above, we have

α|N1(x)||(αN1(x)+(1α)N2(x))1(αN1(x~)+(1α)N2(x~))1|\displaystyle~{}\alpha\cdot|N_{1}(x)|\cdot|(\alpha N_{1}(x)+(1-\alpha)N_{2}(x))^{-1}-(\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x}))^{-1}|
\displaystyle\leq α1(2π)d/2det(Σ1(t))1/2|(αN1(x)+(1α)N2(x))1(αN1(x~)+(1α)N2(x~))1|\displaystyle~{}\alpha\cdot\frac{1}{(2\pi)^{d/2}\det(\Sigma_{1}(t))^{1/2}}\cdot|(\alpha N_{1}(x)+(1-\alpha)N_{2}(x))^{-1}-(\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x}))^{-1}|
\displaystyle\leq α1(2π)d/2det(Σ1(t))1/2γ21(2π)d/2detmin1/2exp(β22σmax)Rσminxx~2\displaystyle~{}\alpha\cdot\frac{1}{(2\pi)^{d/2}\det(\Sigma_{1}(t))^{1/2}}\cdot\gamma^{-2}\cdot\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
\displaystyle\leq αγ21(2π)ddetminexp(β22σmax)Rσminxx~2\displaystyle~{}\alpha\cdot\gamma^{-2}\cdot\frac{1}{(2\pi)^{d}\det_{\min}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2} (11)

where the first step follows from N1(x)1(2π)d/2det(Σ1(t))1/2N_{1}(x)\leq\frac{1}{(2\pi)^{d/2}\det(\Sigma_{1}(t))^{1/2}}, the second step follows from Lemma E.16 and the last step follows from definition of detmin\det_{\min}.

For the second term in the above, we have

α|N1(x)N1(x~)||(αN1(x~)+(1α)N2(x~))1|\displaystyle~{}\alpha\cdot|N_{1}(x)-N_{1}(\widetilde{x})|\cdot|(\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x}))^{-1}|
\displaystyle\leq αγ1|N1(x)N1(x~)|\displaystyle~{}\alpha\cdot\gamma^{-1}\cdot|N_{1}(x)-N_{1}(\widetilde{x})|
\displaystyle\leq αγ11(2π)d/2det(Σ1(t))1/2exp(β22σmax(Σ1(t)))Rσmin(Σ1(t))xx~2\displaystyle~{}\alpha\cdot\gamma^{-1}\cdot\frac{1}{(2\pi)^{d/2}\det(\Sigma_{1}(t))^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}(\Sigma_{1}(t))})\cdot\frac{R}{\sigma_{\min}(\Sigma_{1}(t))}\cdot\|x-\widetilde{x}\|_{2} (12)

where the first step follows from αN1(x)+(1α)N2(x)γ\alpha N_{1}(x)+(1-\alpha)N_{2}(x)\geq\gamma, the second step follows from Lemma E.15.

Combining Eq. (E.8) and Eq. (E.8) together, we have

|f(x)f(x~)|\displaystyle|f(x)-f(\widetilde{x})|\leq αγ21(2π)ddetminexp(β22σmax)Rσminxx~2\displaystyle~{}\alpha\cdot\gamma^{-2}\cdot\frac{1}{(2\pi)^{d}\det_{\min}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
+αγ11(2π)d/2det(Σ1(t))1/2exp(β22σmax(Σ1(t)))Rσmin(Σ1(t))xx~2\displaystyle~{}+\alpha\cdot\gamma^{-1}\cdot\frac{1}{(2\pi)^{d/2}\det(\Sigma_{1}(t))^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}(\Sigma_{1}(t))})\cdot\frac{R}{\sigma_{\min}(\Sigma_{1}(t))}\cdot\|x-\widetilde{x}\|_{2}
\displaystyle\leq αγ21(2π)ddetminexp(β22σmax)Rσminxx~2\displaystyle~{}\alpha\cdot\gamma^{-2}\cdot\frac{1}{(2\pi)^{d}\det_{\min}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
+αγ11(2π)d/2detmin1/2exp(β22σmax)Rσminxx~2\displaystyle~{}+\alpha\cdot\gamma^{-1}\cdot\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
\displaystyle\leq 2αγ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)Rσminxx~2\displaystyle~{}2\alpha\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}

where the first step follows from the bound of the first term and the second term, the second step follows from the definition of detmin,σmax,σmin\det_{\min},\sigma_{\max},\sigma_{\min}, and the third step follows from γ<0.1\gamma<0.1. ∎

This lemma calculates Lipschitz constant of function f(x)(Σ1(t)1(xμ1(t)))f(x~)(Σ1(t)1(x~μ1(t)))2\|f(x)(-\Sigma_{1}(t)^{-1}(x-\mu_{1}(t)))-f(\widetilde{x})(-\Sigma_{1}(t)^{-1}(\widetilde{x}-\mu_{1}(t)))\|_{2}.

Lemma E.18.

If the following conditions hold

  • Let x,x~dx,\widetilde{x}\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let N1(x),N2(x)N_{1}(x),N_{2}(x) be defined as Definition E.1.

  • Let f(x)f(x) be defined as Definition E.5.

  • Let α\alpha\in\mathbb{R} and α(0,1)\alpha\in(0,1).

  • Let xμi(t)2R\|x-\mu_{i}(t)\|_{2}\leq R, where R1R\geq 1, for each i{1,2}i\in\{1,2\}.

  • Let xμi(t)2β\|x-\mu_{i}(t)\|_{2}\geq\beta, where β(0,0.1)\beta\in(0,0.1), for each i{1,2}i\in\{1,2\}.

  • Let αN1(x)+(1α)N2(x)γ\alpha N_{1}(x)+(1-\alpha)N_{2}(x)\geq\gamma, where γ(0,0.1)\gamma\in(0,0.1).

  • Let σmin:=min{σmin(Σ1(t)),σmin(Σ2(t))}\sigma_{\min}:=\min\{\sigma_{\min}(\Sigma_{1}(t)),\sigma_{\min}(\Sigma_{2}(t))\}.

  • Let σmax:=max{σmax(Σ1(t)),σmax(Σ2(t))}\sigma_{\max}:=\max\{\sigma_{\max}(\Sigma_{1}(t)),\sigma_{\max}(\Sigma_{2}(t))\}.

  • Let detmin:=min{det(Σ1(t)),det(Σ2(t))}\det_{\min}:=\min\{\det(\Sigma_{1}(t)),\det(\Sigma_{2}(t))\}.

Then, we have

f(x)(Σ1(t)1(xμ1(t)))f(x~)(Σ1(t)1(x~μ1(t)))2\displaystyle~{}\|f(x)(-\Sigma_{1}(t)^{-1}(x-\mu_{1}(t)))-f(\widetilde{x})(-\Sigma_{1}(t)^{-1}(\widetilde{x}-\mu_{1}(t)))\|_{2}
\displaystyle\leq (1σmin+2αγ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)(Rσmin)2)xx~2\displaystyle~{}(\frac{1}{\sigma_{\min}}+2\alpha\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot(\frac{R}{\sigma_{\min}})^{2})\cdot\|x-\widetilde{x}\|_{2}
Proof.

We can show

LHS\displaystyle\mathrm{LHS}\leq f(x)(Σ1(t)1(xμ1(t)))f(x)(Σ1(t)1(x~μ1(t)))2\displaystyle~{}\|f(x)(-\Sigma_{1}(t)^{-1}(x-\mu_{1}(t)))-f(x)(-\Sigma_{1}(t)^{-1}(\widetilde{x}-\mu_{1}(t)))\|_{2}
+f(x)(Σ1(t)1(x~μ1(t)))f(x~)(Σ1(t)1(x~μ1(t)))2\displaystyle~{}+\|f(x)(-\Sigma_{1}(t)^{-1}(\widetilde{x}-\mu_{1}(t)))-f(\widetilde{x})(-\Sigma_{1}(t)^{-1}(\widetilde{x}-\mu_{1}(t)))\|_{2}
\displaystyle\leq |f(x)|(Σ1(t)1(xμ1(t)))(Σ1(t)1(x~μ1(t)))2\displaystyle~{}|f(x)|\cdot\|(-\Sigma_{1}(t)^{-1}(x-\mu_{1}(t)))-(-\Sigma_{1}(t)^{-1}(\widetilde{x}-\mu_{1}(t)))\|_{2}
+|f(x)f(x~)|Σ1(t)1(x~μ1(t))2\displaystyle~{}+|f(x)-f(\widetilde{x})|\cdot\|-\Sigma_{1}(t)^{-1}(\widetilde{x}-\mu_{1}(t))\|_{2}

where the first step follows from Fact C.2, the second step follows from Fact C.2.

For the first term in the above, we have

|f(x)|(Σ1(t)1(xμ1(t)))(Σ1(t)1(x~μ1(t)))2\displaystyle~{}|f(x)|\cdot\|(-\Sigma_{1}(t)^{-1}(x-\mu_{1}(t)))-(-\Sigma_{1}(t)^{-1}(\widetilde{x}-\mu_{1}(t)))\|_{2}
\displaystyle\leq (Σ1(t)1(xμ1(t)))(Σ1(t)1(x~μ1(t)))2\displaystyle~{}\|(-\Sigma_{1}(t)^{-1}(x-\mu_{1}(t)))-(-\Sigma_{1}(t)^{-1}(\widetilde{x}-\mu_{1}(t)))\|_{2}
\displaystyle\leq 1σmin(Σ1(t))xx~2\displaystyle~{}\frac{1}{\sigma_{\min}(\Sigma_{1}(t))}\cdot\|x-\widetilde{x}\|_{2} (13)

where the first step follows from f(x)1f(x)\leq 1, the second step follows from Lemma E.12.

For the second term in the above, we have

|f(x)f(x~)|Σ1(t)1(x~μ1(t))2\displaystyle~{}|f(x)-f(\widetilde{x})|\cdot\|-\Sigma_{1}(t)^{-1}(\widetilde{x}-\mu_{1}(t))\|_{2}
\displaystyle\leq Rσmin(Σ1(t))|f(x)f(x~)|\displaystyle~{}\frac{R}{\sigma_{\min}(\Sigma_{1}(t))}\cdot|f(x)-f(\widetilde{x})|
\displaystyle\leq Rσmin(Σ1(t))2αγ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)Rσminxx~2\displaystyle~{}\frac{R}{\sigma_{\min}(\Sigma_{1}(t))}\cdot 2\alpha\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2} (14)

where the first step follows from Lemma E.9, the second step follows from Lemma E.17.

Combining Eq. (E.8) and Eq. (E.8) together, we have

f(x)(Σ1(t)1(xμ1(t)))f(x~)(Σ1(t)1(x~μ1(t)))2\displaystyle~{}\|f(x)(-\Sigma_{1}(t)^{-1}(x-\mu_{1}(t)))-f(\widetilde{x})(-\Sigma_{1}(t)^{-1}(\widetilde{x}-\mu_{1}(t)))\|_{2}
\displaystyle\leq 1σmin(Σ1(t))xx~2\displaystyle~{}\frac{1}{\sigma_{\min}(\Sigma_{1}(t))}\cdot\|x-\widetilde{x}\|_{2}
+Rσmin(Σ1(t))2αγ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)Rσminxx~2\displaystyle~{}+\frac{R}{\sigma_{\min}(\Sigma_{1}(t))}\cdot 2\alpha\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
\displaystyle\leq 1σminxx~2\displaystyle~{}\frac{1}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
+Rσmin2αγ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)Rσminxx~2\displaystyle~{}+\frac{R}{\sigma_{\min}}\cdot 2\alpha\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
=\displaystyle= (1σmin+2αγ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)(Rσmin)2)xx~2\displaystyle~{}(\frac{1}{\sigma_{\min}}+2\alpha\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot(\frac{R}{\sigma_{\min}})^{2})\cdot\|x-\widetilde{x}\|_{2}

where the first step follows from the bound of the first term and the second term, the second step follows from the definition of detmin,σmax,σmin\det_{\min},\sigma_{\max},\sigma_{\min}, and the last step follows from simple algebra. ∎

E.9 Lemmas for Lipschitz calculation: g(x)g(x)

This lemma calculates Lipschitz constant of function |g(x)g(x~)||g(x)-g(\widetilde{x})|.

Lemma E.19.

If the following conditions hold

  • Let x,x~dx,\widetilde{x}\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let N1(x),N2(x)N_{1}(x),N_{2}(x) be defined as Definition E.1.

  • Let g(x)g(x) be defined as Definition E.5.

  • Let α\alpha\in\mathbb{R} and α(0,1)\alpha\in(0,1).

  • Let xμi(t)2R\|x-\mu_{i}(t)\|_{2}\leq R, where R1R\geq 1, for each i{1,2}i\in\{1,2\}.

  • Let xμi(t)2β\|x-\mu_{i}(t)\|_{2}\geq\beta, where β(0,0.1)\beta\in(0,0.1), for each i{1,2}i\in\{1,2\}.

  • Let αN1(x)+(1α)N2(x)γ\alpha N_{1}(x)+(1-\alpha)N_{2}(x)\geq\gamma, where γ(0,0.1)\gamma\in(0,0.1).

  • Let σmin:=min{σmin(Σ1(t)),σmin(Σ2(t))}\sigma_{\min}:=\min\{\sigma_{\min}(\Sigma_{1}(t)),\sigma_{\min}(\Sigma_{2}(t))\}.

  • Let σmax:=max{σmax(Σ1(t)),σmax(Σ2(t))}\sigma_{\max}:=\max\{\sigma_{\max}(\Sigma_{1}(t)),\sigma_{\max}(\Sigma_{2}(t))\}.

  • Let detmin:=min{det(Σ1(t)),det(Σ2(t))}\det_{\min}:=\min\{\det(\Sigma_{1}(t)),\det(\Sigma_{2}(t))\}.

Then,

|g(x)g(x~)|2(1α)γ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)Rσminxx~2\displaystyle|g(x)-g(\widetilde{x})|\leq 2(1-\alpha)\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
Proof.

We can show

|g(x)g(x~)|=\displaystyle|g(x)-g(\widetilde{x})|= |(1α)N2(x)αN1(x)+(1α)N2(x)(1α)N2(x~)αN1(x~)+(1α)N2(x~)|\displaystyle~{}|\frac{(1-\alpha)N_{2}(x)}{\alpha N_{1}(x)+(1-\alpha)N_{2}(x)}-\frac{(1-\alpha)N_{2}(\widetilde{x})}{\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x})}|
\displaystyle\leq |(1α)N2(x)αN1(x)+(1α)N2(x)(1α)N2(x)αN1(x~)+(1α)N2(x~)|\displaystyle~{}|\frac{(1-\alpha)N_{2}(x)}{\alpha N_{1}(x)+(1-\alpha)N_{2}(x)}-\frac{(1-\alpha)N_{2}(x)}{\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x})}|
+|(1α)N2(x)αN1(x~)+(1α)N2(x~)(1α)N2(x~)αN1(x~)+(1α)N2(x~)|\displaystyle~{}+|\frac{(1-\alpha)N_{2}(x)}{\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x})}-\frac{(1-\alpha)N_{2}(\widetilde{x})}{\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x})}|
=\displaystyle= (1α)|N2(x)||(αN1(x)+(1α)N2(x))1(αN1(x~)+(1α)N2(x~))1|\displaystyle~{}(1-\alpha)\cdot|N_{2}(x)|\cdot|(\alpha N_{1}(x)+(1-\alpha)N_{2}(x))^{-1}-(\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x}))^{-1}|
+(1α)|N2(x)N2(x~)||(αN1(x~)+(1α)N2(x~))1|\displaystyle~{}+(1-\alpha)\cdot|N_{2}(x)-N_{2}(\widetilde{x})|\cdot|(\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x}))^{-1}|

where the first step follows from Definition E.5, the second step follows from Fact C.2, and the last step follows from simple algebra.

For the first term in the above, we have

(1α)|N2(x)||(αN1(x)+(1α)N2(x))1(αN1(x~)+(1α)N2(x~))1|\displaystyle~{}(1-\alpha)\cdot|N_{2}(x)|\cdot|(\alpha N_{1}(x)+(1-\alpha)N_{2}(x))^{-1}-(\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x}))^{-1}|
\displaystyle\leq (1α)1(2π)d/2det(Σ2(t))1/2|(αN1(x)+(1α)N2(x))1(αN1(x~)+(1α)N2(x~))1|\displaystyle~{}(1-\alpha)\cdot\frac{1}{(2\pi)^{d/2}\det(\Sigma_{2}(t))^{1/2}}\cdot|(\alpha N_{1}(x)+(1-\alpha)N_{2}(x))^{-1}-(\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x}))^{-1}|
\displaystyle\leq (1α)1(2π)d/2det(Σ2(t))1/2γ21(2π)d/2detmin1/2exp(β22σmax)Rσminxx~2\displaystyle~{}(1-\alpha)\cdot\frac{1}{(2\pi)^{d/2}\det(\Sigma_{2}(t))^{1/2}}\cdot\gamma^{-2}\cdot\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
\displaystyle\leq (1α)γ21(2π)ddetminexp(β22σmax)Rσminxx~2\displaystyle~{}(1-\alpha)\cdot\gamma^{-2}\cdot\frac{1}{(2\pi)^{d}\det_{\min}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2} (15)

where the first step follows from N2(x)1(2π)d/2det(Σ2(t))1/2N_{2}(x)\leq\frac{1}{(2\pi)^{d/2}\det(\Sigma_{2}(t))^{1/2}}, the second step follows from Lemma E.16.

For the second term in the above, we have

(1α)|N2(x)N2(x~)||(αN1(x~)+(1α)N2(x~))1|\displaystyle~{}(1-\alpha)\cdot|N_{2}(x)-N_{2}(\widetilde{x})|\cdot|(\alpha N_{1}(\widetilde{x})+(1-\alpha)N_{2}(\widetilde{x}))^{-1}|
\displaystyle\leq (1α)γ1|N2(x)N2(x~)|\displaystyle~{}(1-\alpha)\cdot\gamma^{-1}\cdot|N_{2}(x)-N_{2}(\widetilde{x})|
\displaystyle\leq (1α)γ11(2π)d/2det(Σ2(t))1/2exp(β22σmax(Σ2(t)))Rσmin(Σ2(t))xx~2\displaystyle~{}(1-\alpha)\cdot\gamma^{-1}\cdot\frac{1}{(2\pi)^{d/2}\det(\Sigma_{2}(t))^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}(\Sigma_{2}(t))})\cdot\frac{R}{\sigma_{\min}(\Sigma_{2}(t))}\cdot\|x-\widetilde{x}\|_{2} (16)

where the first step follows from αN1(x)+(1α)N2(x)γ\alpha N_{1}(x)+(1-\alpha)N_{2}(x)\geq\gamma, the second step follows from Lemma E.15.

Combining Eq. (E.9) and Eq. (E.9) together, we have

|g(x)g(x~)|\displaystyle|g(x)-g(\widetilde{x})|\leq (1α)γ21(2π)ddetminexp(β22σmax)Rσminxx~2\displaystyle~{}(1-\alpha)\cdot\gamma^{-2}\cdot\frac{1}{(2\pi)^{d}\det_{\min}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
+(1α)γ11(2π)d/2det(Σ2(t))1/2exp(β22σmax(Σ2(t)))Rσmin(Σ2(t))xx~2\displaystyle~{}+(1-\alpha)\cdot\gamma^{-1}\cdot\frac{1}{(2\pi)^{d/2}\det(\Sigma_{2}(t))^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}(\Sigma_{2}(t))})\cdot\frac{R}{\sigma_{\min}(\Sigma_{2}(t))}\cdot\|x-\widetilde{x}\|_{2}
\displaystyle\leq (1α)γ21(2π)ddetminexp(β22σmax)Rσminxx~2\displaystyle~{}(1-\alpha)\cdot\gamma^{-2}\cdot\frac{1}{(2\pi)^{d}\det_{\min}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
+(1α)γ11(2π)d/2detmin1/2exp(β22σmax)Rσminxx~2\displaystyle~{}+(1-\alpha)\cdot\gamma^{-1}\cdot\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
\displaystyle\leq 2(1α)γ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)Rσminxx~2\displaystyle~{}2(1-\alpha)\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}

where the first step follows from the bound of the first term and the second term, the second step follows from the definition of detmin,σmax,σmin\det_{\min},\sigma_{\max},\sigma_{\min}, and the last step follows from γ<0.1\gamma<0.1. ∎

This lemma calculates Lipschitz constant of function g(x)(Σ2(t)1(xμ2(t)))g(x~)(Σ2(t)1(x~μ2(t)))2\|g(x)(-\Sigma_{2}(t)^{-1}(x-\mu_{2}(t)))-g(\widetilde{x})(-\Sigma_{2}(t)^{-1}(\widetilde{x}-\mu_{2}(t)))\|_{2}.

Lemma E.20.

If the following conditions hold

  • Let x,x~dx,\widetilde{x}\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let N1(x),N2(x)N_{1}(x),N_{2}(x) be defined as Definition E.1.

  • Let g(x)g(x) be defined as Definition E.5.

  • Let α\alpha\in\mathbb{R} and α(0,1)\alpha\in(0,1).

  • Let xμi(t)2R\|x-\mu_{i}(t)\|_{2}\leq R, where R1R\geq 1, for each i{1,2}i\in\{1,2\}.

  • Let xμi(t)2β\|x-\mu_{i}(t)\|_{2}\geq\beta, where β(0,0.1)\beta\in(0,0.1), for each i{1,2}i\in\{1,2\}.

  • Let αN1(x)+(1α)N2(x)γ\alpha N_{1}(x)+(1-\alpha)N_{2}(x)\geq\gamma, where γ(0,0.1)\gamma\in(0,0.1).

  • Let σmin:=min{σmin(Σ1(t)),σmin(Σ2(t))}\sigma_{\min}:=\min\{\sigma_{\min}(\Sigma_{1}(t)),\sigma_{\min}(\Sigma_{2}(t))\}.

  • Let σmax:=max{σmax(Σ1(t)),σmax(Σ2(t))}\sigma_{\max}:=\max\{\sigma_{\max}(\Sigma_{1}(t)),\sigma_{\max}(\Sigma_{2}(t))\}.

  • Let detmin:=min{det(Σ1(t)),det(Σ2(t))}\det_{\min}:=\min\{\det(\Sigma_{1}(t)),\det(\Sigma_{2}(t))\}.

Then, we have

g(x)(Σ2(t)1(xμ2(t)))g(x~)(Σ2(t)1(x~μ2(t)))2\displaystyle~{}\|g(x)(-\Sigma_{2}(t)^{-1}(x-\mu_{2}(t)))-g(\widetilde{x})(-\Sigma_{2}(t)^{-1}(\widetilde{x}-\mu_{2}(t)))\|_{2}
\displaystyle\leq (1σmin+2(1α)γ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)(Rσmin)2)xx~2\displaystyle~{}(\frac{1}{\sigma_{\min}}+2(1-\alpha)\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot(\frac{R}{\sigma_{\min}})^{2})\cdot\|x-\widetilde{x}\|_{2}
Proof.

We can show

LHS\displaystyle\mathrm{LHS}\leq g(x)(Σ2(t)1(xμ2(t)))g(x)(Σ2(t)1(x~μ2(t)))2\displaystyle~{}\|g(x)(-\Sigma_{2}(t)^{-1}(x-\mu_{2}(t)))-g(x)(-\Sigma_{2}(t)^{-1}(\widetilde{x}-\mu_{2}(t)))\|_{2}
+g(x)(Σ2(t)1(x~μ2(t)))f(x~)(Σ2(t)1(x~μ2(t)))2\displaystyle~{}+\|g(x)(-\Sigma_{2}(t)^{-1}(\widetilde{x}-\mu_{2}(t)))-f(\widetilde{x})(-\Sigma_{2}(t)^{-1}(\widetilde{x}-\mu_{2}(t)))\|_{2}
\displaystyle\leq |g(x)|(Σ2(t)1(xμ2(t)))(Σ2(t)1(x~μ2(t)))2\displaystyle~{}|g(x)|\cdot\|(-\Sigma_{2}(t)^{-1}(x-\mu_{2}(t)))-(-\Sigma_{2}(t)^{-1}(\widetilde{x}-\mu_{2}(t)))\|_{2}
+|g(x)g(x~)|Σ2(t)1(x~μ2(t))2\displaystyle~{}+|g(x)-g(\widetilde{x})|\cdot\|-\Sigma_{2}(t)^{-1}(\widetilde{x}-\mu_{2}(t))\|_{2}

where the first step follows from Fact C.2, the second step follows from Fact C.2.

For the first term in the above, we have

|g(x)|(Σ2(t)1(xμ2(t)))(Σ2(t)1(x~μ2(t)))2\displaystyle~{}|g(x)|\cdot\|(-\Sigma_{2}(t)^{-1}(x-\mu_{2}(t)))-(-\Sigma_{2}(t)^{-1}(\widetilde{x}-\mu_{2}(t)))\|_{2}
\displaystyle\leq (Σ2(t)1(xμ2(t)))(Σ2(t)1(x~μ2(t)))2\displaystyle~{}\|(-\Sigma_{2}(t)^{-1}(x-\mu_{2}(t)))-(-\Sigma_{2}(t)^{-1}(\widetilde{x}-\mu_{2}(t)))\|_{2}
\displaystyle\leq 1σmin(Σ2(t))xx~2\displaystyle~{}\frac{1}{\sigma_{\min}(\Sigma_{2}(t))}\cdot\|x-\widetilde{x}\|_{2} (17)

where the first step follows from g(x)1g(x)\leq 1, the second step follows from Lemma E.12.

For the second term in the above, we have

|g(x)g(x~)|Σ2(t)1(x~μ2(t))2\displaystyle~{}|g(x)-g(\widetilde{x})|\cdot\|-\Sigma_{2}(t)^{-1}(\widetilde{x}-\mu_{2}(t))\|_{2}
\displaystyle\leq Rσmin(Σ2(t))|g(x)g(x~)|\displaystyle~{}\frac{R}{\sigma_{\min}(\Sigma_{2}(t))}\cdot|g(x)-g(\widetilde{x})|
\displaystyle\leq Rσmin(Σ2(t))2(1α)γ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)Rσminxx~2\displaystyle~{}\frac{R}{\sigma_{\min}(\Sigma_{2}(t))}\cdot 2(1-\alpha)\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2} (18)

where the first step follows from Lemma E.9, the second step follows from Lemma E.19.

Combining Eq. (E.9) and Eq. (E.9) together, we have

g(x)(Σ2(t)1(xμ2(t)))g(x~)(Σ2(t)1(x~μ2(t)))2\displaystyle~{}\|g(x)(-\Sigma_{2}(t)^{-1}(x-\mu_{2}(t)))-g(\widetilde{x})(-\Sigma_{2}(t)^{-1}(\widetilde{x}-\mu_{2}(t)))\|_{2}
\displaystyle\leq 1σmin(Σ2(t))xx~2\displaystyle~{}\frac{1}{\sigma_{\min}(\Sigma_{2}(t))}\cdot\|x-\widetilde{x}\|_{2}
+Rσmin(Σ2(t))2(1α)γ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)Rσminxx~2\displaystyle~{}+\frac{R}{\sigma_{\min}(\Sigma_{2}(t))}\cdot 2(1-\alpha)\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
\displaystyle\leq 1σminxx~2\displaystyle~{}\frac{1}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
+Rσmin2(1α)γ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)Rσminxx~2\displaystyle~{}+\frac{R}{\sigma_{\min}}\cdot 2(1-\alpha)\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
=\displaystyle= (1σmin+2(1α)γ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)(Rσmin)2)xx~2\displaystyle~{}(\frac{1}{\sigma_{\min}}+2(1-\alpha)\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot(\frac{R}{\sigma_{\min}})^{2})\cdot\|x-\widetilde{x}\|_{2}

where the first step follows from the bound of the first term and the second term, the second step follows from the definition of detmin,σmax,σmin\det_{\min},\sigma_{\max},\sigma_{\min}, and the last step follows from simple algebra. ∎

E.10 Lipschitz constant of the score function

This lemma calculates the Lipschitz constant of the score funciton.

Lemma E.21 (Lipschitz).

If the following conditions hold

  • Let x,x~dx,\widetilde{x}\in\mathbb{R}^{d}.

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let N1(x),N2(x)N_{1}(x),N_{2}(x) be defined as Definition E.1.

  • Let α\alpha\in\mathbb{R} and α(0,1)\alpha\in(0,1).

  • Let pt(x)p_{t}(x) be defined as Definition E.2.

  • Let f(x),g(x)f(x),g(x) be defined as Definition E.5.

  • Let xμi(t)2R\|x-\mu_{i}(t)\|_{2}\leq R, where R1R\geq 1, for each i{1,2}i\in\{1,2\}.

  • Let xμi(t)2β\|x-\mu_{i}(t)\|_{2}\geq\beta, where β(0,0.1)\beta\in(0,0.1), for each i{1,2}i\in\{1,2\}.

  • Let αN1(x)+(1α)N2(x)γ\alpha N_{1}(x)+(1-\alpha)N_{2}(x)\geq\gamma, where γ(0,0.1)\gamma\in(0,0.1).

  • Let σmin:=min{σmin(Σ1(t)),σmin(Σ2(t))}\sigma_{\min}:=\min\{\sigma_{\min}(\Sigma_{1}(t)),\sigma_{\min}(\Sigma_{2}(t))\}.

  • Let σmax:=max{σmax(Σ1(t)),σmax(Σ2(t))}\sigma_{\max}:=\max\{\sigma_{\max}(\Sigma_{1}(t)),\sigma_{\max}(\Sigma_{2}(t))\}.

  • Let detmin:=min{det(Σ1(t)),det(Σ2(t))}\det_{\min}:=\min\{\det(\Sigma_{1}(t)),\det(\Sigma_{2}(t))\}.

Then,

dlogpt(x)dxdlogpt(x~)dx~2\displaystyle\|\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}-\frac{\mathrm{d}\log p_{t}(\widetilde{x})}{\mathrm{d}\widetilde{x}}\|_{2}\leq (2σmin+2R2γ2σmin2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax))xx~2\displaystyle~{}(\frac{2}{\sigma_{\min}}+\frac{2R^{2}}{\gamma^{2}\sigma_{\min}^{2}}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}}))\cdot\|x-\widetilde{x}\|_{2}
Proof.

We can show

LHS=\displaystyle\mathrm{LHS}= f(x)(Σ1(t)1(xμ1(t)))+g(x)(Σ2(t)1(xμ2(t)))\displaystyle~{}\|f(x)(-\Sigma_{1}(t)^{-1}(x-\mu_{1}(t)))+g(x)(-\Sigma_{2}(t)^{-1}(x-\mu_{2}(t)))
(f(x~)(Σ1(t)1(x~μ1(t)))+g(x~)(Σ2(t)1(x~μ2(t))))2\displaystyle-(f(\widetilde{x})(-\Sigma_{1}(t)^{-1}(\widetilde{x}-\mu_{1}(t)))+g(\widetilde{x})(-\Sigma_{2}(t)^{-1}(\widetilde{x}-\mu_{2}(t))))\|_{2}
\displaystyle\leq f(x)(Σ1(t)1(xμ1(t)))f(x~)(Σ1(t)1(x~μ1(t)))2\displaystyle~{}\|f(x)(-\Sigma_{1}(t)^{-1}(x-\mu_{1}(t)))-f(\widetilde{x})(-\Sigma_{1}(t)^{-1}(\widetilde{x}-\mu_{1}(t)))\|_{2}
+g(x)(Σ2(t)1(xμ2(t)))g(x~)(Σ2(t)1(x~μ2(t)))2\displaystyle~{}+\|g(x)(-\Sigma_{2}(t)^{-1}(x-\mu_{2}(t)))-g(\widetilde{x})(-\Sigma_{2}(t)^{-1}(\widetilde{x}-\mu_{2}(t)))\|_{2}
\displaystyle\leq (1σmin+2αγ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)(Rσmin)2)xx~2\displaystyle~{}(\frac{1}{\sigma_{\min}}+2\alpha\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot(\frac{R}{\sigma_{\min}})^{2})\cdot\|x-\widetilde{x}\|_{2}
+g(x)(Σ2(t)1(xμ2(t)))g(x~)(Σ2(t)1(x~μ2(t)))2\displaystyle~{}+\|g(x)(-\Sigma_{2}(t)^{-1}(x-\mu_{2}(t)))-g(\widetilde{x})(-\Sigma_{2}(t)^{-1}(\widetilde{x}-\mu_{2}(t)))\|_{2}
\displaystyle\leq (1σmin+2αγ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)(Rσmin)2)xx~2\displaystyle~{}(\frac{1}{\sigma_{\min}}+2\alpha\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot(\frac{R}{\sigma_{\min}})^{2})\cdot\|x-\widetilde{x}\|_{2}
+(1σmin+2(1α)γ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)(Rσmin)2)xx~2\displaystyle~{}+(\frac{1}{\sigma_{\min}}+2(1-\alpha)\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot(\frac{R}{\sigma_{\min}})^{2})\cdot\|x-\widetilde{x}\|_{2}
=\displaystyle= (2σmin+2R2γ2σmin2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax))xx~2\displaystyle~{}(\frac{2}{\sigma_{\min}}+\frac{2R^{2}}{\gamma^{2}\sigma_{\min}^{2}}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}}))\cdot\|x-\widetilde{x}\|_{2}

where the first step follows from Lemma E.6, the second step follows from Fact C.2, the third step follows from Lemma E.18, the fourth step follows from Lemma E.20, and the last step follows from simple algebra. ∎

Appendix F A General Version for kk Gaussian

In this section we consider a more general case of kk mixture of Gaussians.

  • Section F.1 provides the definition for kk mixture of Gaussians.

  • Section F.2 provides the expression of the score function.

  • Section F.3 provides the upper bound of the score function.

  • Section F.4 provides lemmas that are used in further calculation of Lipschitz constant.

  • Section F.5 provides the Lipschitz constant for kk mixture of Gaussians.

F.1 Definitions

Let i[k]i\in[k]. Let αi(t)(0,1)\alpha_{i}(t)\in(0,1), i=1kαi(t)=1\sum_{i=1}^{k}\alpha_{i}(t)=1, and is a function of time tt. Consider ptp_{t} such that

pt(x)=Prxi=1kαi(t)𝒩(μi(t),Σi(t))[x=x]\displaystyle p_{t}(x)=\Pr_{x^{\prime}\sim\sum_{i=1}^{k}\alpha_{i}(t)\mathcal{N}(\mu_{i}(t),\Sigma_{i}(t))}[x^{\prime}=x]

where μi(t)d\mu_{i}(t)\in\mathbb{R}^{d}, Σi(t)d×d\Sigma_{i}(t)\in\mathbb{R}^{d\times d} and they are derivative to tt and Σi(t)\Sigma_{i}(t) is a symmetric p.s.d. matrix whose the smallest singular value is always larger than a fixed value σmin>0\sigma_{\min}>0.

Below we define the 𝗉𝖽𝖿\mathsf{pdf} for a single multivariate Gaussian.

Definition F.1.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let i[k]i\in[k].

  • Let tt\in\mathbb{R}, and t0t\geq 0.

We define

Ni(x):=1(2π)d/2det(Σi(t))1/2exp(12(xμi(t))Σi(t)1(xμ1(t)))\displaystyle N_{i}(x):=\frac{1}{(2\pi)^{d/2}\det(\Sigma_{i}(t))^{1/2}}\exp(-\frac{1}{2}(x-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(x-\mu_{1}(t)))

This is the 𝗉𝖽𝖿\mathsf{pdf} of a single Gaussian so it’s clearly to see that 0Ni1(2π)d/2det(Σi(t))1/20\leq N_{i}\leq\frac{1}{(2\pi)^{d/2}\det(\Sigma_{i}(t))^{1/2}} since Ni(x)N_{i}(x) takes maximum when x=μix=\mu_{i}.

Below we define the 𝗉𝖽𝖿\mathsf{pdf} for kk mixtures of Gaussians.

Definition F.2.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let i[k]i\in[k].

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let αi(t)\alpha_{i}(t)\in\mathbb{R}, i=1kαi(t)=1\sum_{i=1}^{k}\alpha_{i}(t)=1, and αi(t)(0,1)\alpha_{i}(t)\in(0,1).

  • Let Ni(x)N_{i}(x) be defined as Definition F.1.

We define

pt(x):=i=1kαi(t)(2π)d/2det(Σi(t))1/2exp(12(xμi(t))Σi(t)1(xμi(t)))\displaystyle p_{t}(x):=\sum_{i=1}^{k}\frac{\alpha_{i}(t)}{(2\pi)^{d/2}\det(\Sigma_{i}(t))^{1/2}}\exp(-\frac{1}{2}(x-\mu_{i}(t))^{\top}\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))

This can be further rewritten as follows:

pt(x)=i=1kαi(t)Ni(x)\displaystyle p_{t}(x)=\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x)

Further, we have

logpt(x)=log(i=1kαi(t)Ni(x))\displaystyle\log p_{t}(x)=\log(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x))

This lemma calculates the gradient of 𝗉𝖽𝖿\mathsf{pdf} for kk mixture of Gaussians.

Lemma F.3.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let i[k]i\in[k].

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let αi(t)\alpha_{i}(t)\in\mathbb{R}, i=1kαi(t)=1\sum_{i=1}^{k}\alpha_{i}(t)=1, and αi(t)(0,1)\alpha_{i}(t)\in(0,1).

  • Let Ni(x)N_{i}(x) be defined as Definition F.1.

  • Let pt(x)p_{t}(x) be defined as Definition F.2

We have

dpt(x)dx=i=1kαi(t)Ni(x)(Σi(t)1(xμi(t)))\displaystyle\frac{\mathrm{d}p_{t}(x)}{\mathrm{d}x}=\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x)(-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))
Proof.

We can show

dpt(x)dx=\displaystyle\frac{\mathrm{d}p_{t}(x)}{\mathrm{d}x}= ddxi=1kαi(t)Ni(x)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x}\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x)
=\displaystyle= i=1kαi(t)dNi(x)dx\displaystyle~{}\sum_{i=1}^{k}\alpha_{i}(t)\frac{\mathrm{d}N_{i}(x)}{\mathrm{d}x}
=\displaystyle= i=1kαi(t)Ni(x)(Σi(t)1(xμi(t)))\displaystyle~{}\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x)(-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))

where the first step follows from Definition F.2, the second step follows from Fact C.1, and the last step follows from Lemma E.3.

Below we define fif_{i} that simplifies further calculation.

Definition F.4.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let i[k]i\in[k].

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let αi(t)\alpha_{i}(t)\in\mathbb{R}, i=1kαi(t)=1\sum_{i=1}^{k}\alpha_{i}(t)=1, and αi(t)(0,1)\alpha_{i}(t)\in(0,1).

  • Let Ni(x)N_{i}(x) be defined as Definition F.1.

For further simplicity, we define

fi(x):=αi(t)Ni(x)i=1kαi(t)Ni(x)\displaystyle f_{i}(x):=\frac{\alpha_{i}(t)N_{i}(x)}{\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x)}

It’s clearly to see that 0fi(x)10\leq f_{i}(x)\leq 1 and i=1kfi(x)=1\sum_{i=1}^{k}f_{i}(x)=1

F.2 Calculation of the score function

This lemma calculates the score function for kk mixture of Gaussians.

Lemma F.5.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let i[k]i\in[k].

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let αi(t)\alpha_{i}(t)\in\mathbb{R}, i=1kαi(t)=1\sum_{i=1}^{k}\alpha_{i}(t)=1, and αi(t)(0,1)\alpha_{i}(t)\in(0,1).

  • Let Ni(x)N_{i}(x) be defined as Definition F.1.

  • Let pt(x)p_{t}(x) be defined as Definition F.2.

  • Let fi(x)f_{i}(x) be defined as Definition F.4.

We have

dlogpt(x)dx=i=1kfi(x)(Σi(t)1(xμi(t)))\displaystyle\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}=\sum_{i=1}^{k}f_{i}(x)(-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))
Proof.

We can show

dlogpt(x)dx=\displaystyle\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}= dlogpt(x)dpt(x)dpt(x)dx\displaystyle~{}\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}p_{t}(x)}\frac{\mathrm{d}p_{t}(x)}{\mathrm{d}x}
=\displaystyle= 1pt(x)dpt(x)dx\displaystyle~{}\frac{1}{p_{t}(x)}\frac{\mathrm{d}p_{t}(x)}{\mathrm{d}x}
=\displaystyle= 1pt(x)i=1kαi(t)Ni(x)(Σi(t)1(xμi(t)))\displaystyle~{}\frac{1}{p_{t}(x)}\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x)(-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))
=\displaystyle= i=1kαi(t)Ni(x)(Σi(t)1(xμi(t)))i=1kαi(t)Ni(x)\displaystyle~{}\frac{\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x)(-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))}{\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x)}
=\displaystyle= i=1kfi(x)(Σi(t)1(xμi(t)))\displaystyle~{}\sum_{i=1}^{k}f_{i}(x)(-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))

where the first step follows from Fact C.1, the second step follows from Fact C.1, the third step follows from Lemma F.3, the fourth step follows from Definition F.2, and the last step follows from Definition F.4. ∎

F.3 Upper bound of the score function

This lemma calculates upper bound of the score function for kk mixture of Gaussians.

Lemma F.6.

If the following conditions hold

  • Let xdx\in\mathbb{R}^{d}.

  • Let i[k]i\in[k].

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let αi(t)\alpha_{i}(t)\in\mathbb{R}, i=1kαi(t)=1\sum_{i=1}^{k}\alpha_{i}(t)=1, and αi(t)(0,1)\alpha_{i}(t)\in(0,1).

  • Let pt(x)p_{t}(x) be defined as Definition F.2.

  • Let fi(x)f_{i}(x) be defined as Definition F.4.

  • Let σmin:=min{σmin(Σ1(t)),σmin(Σ2(t)),,σmin(Σk(t))}\sigma_{\min}:=\min\{\sigma_{\min}(\Sigma_{1}(t)),\sigma_{\min}(\Sigma_{2}(t)),\dots,\sigma_{\min}(\Sigma_{k}(t))\}.

  • Let μmax:=max{1,μ1(t)2,μ2(t)2,,μk(t)2}\mu_{\max}:=\max\{1,\|\mu_{1}(t)\|_{2},\|\mu_{2}(t)\|_{2},\dots,\|\mu_{k}(t)\|_{2}\}.

Then, we have

dlogpt(x)dx2σmin1μmax(1+x2)\displaystyle\|\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}\|_{2}\leq\sigma_{\min}^{-1}\cdot\mu_{\max}\cdot(1+\|x\|_{2})
Proof.

We can show

dlogpt(x)dx2=\displaystyle\|\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}\|_{2}= i=1kfi(x)(Σi(t)1(xμi(t)))2\displaystyle~{}\|\sum_{i=1}^{k}f_{i}(x)(-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))\|_{2}
\displaystyle\leq i=1kfi(x)Σi(t)1(xμi(t))2\displaystyle~{}\sum_{i=1}^{k}f_{i}(x)\|-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t))\|_{2}
\displaystyle\leq maxi[k]Σi(t)1(xμi(t))2\displaystyle~{}\max_{i\in[k]}\|-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t))\|_{2}
\displaystyle\leq maxi[k]1σmin(Σi(t))(x2+μi(t)2)\displaystyle~{}\max_{i\in[k]}\frac{1}{\sigma_{\min}(\Sigma_{i}(t))}\cdot(\|x\|_{2}+\|\mu_{i}(t)\|_{2})
\displaystyle\leq σmin1(μmax+x2)\displaystyle~{}\sigma_{\min}^{-1}(\mu_{\max}+\|x\|_{2})
\displaystyle\leq σmin1μmax(1+x2)\displaystyle~{}\sigma_{\min}^{-1}\cdot\mu_{\max}\cdot(1+\|x\|_{2})

where the first step follows from Lemma F.5, the second step follows from triangle inequality, the third step follows from i=1kfi(x)=1\sum_{i=1}^{k}f_{i}(x)=1 and fi(x)0f_{i}(x)\geq 0, the fourth step follows from Lemma E.7, the fifth step follows from definition of μmax\mu_{\max} and σmin\sigma_{\min}, and the last step follows from μmax1\mu_{\max}\geq 1. ∎

F.4 Lemmas for Lipshitz calculation

This section provides lemmas for calculation of Lipschitz constant of the score function for kk mixture of Gaussians.

This lemma calculates Lipschitz constant of function |i=1kαi(t)Ni(x)i=1kαi(t)Ni(x~)||\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x)-\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(\widetilde{x})|.

Lemma F.7.

If the following conditions hold

  • Let x,x~dx,\widetilde{x}\in\mathbb{R}^{d}.

  • Let i[k]i\in[k].

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let αi(t)\alpha_{i}(t)\in\mathbb{R}, i=1kαi(t)=1\sum_{i=1}^{k}\alpha_{i}(t)=1, and αi(t)(0,1)\alpha_{i}(t)\in(0,1).

  • Let Ni(x)N_{i}(x) be defined as Definition F.1.

  • Let xμi(t)2R\|x-\mu_{i}(t)\|_{2}\leq R, where R1R\geq 1, for each i[k]i\in[k].

  • Let xμi(t)2β\|x-\mu_{i}(t)\|_{2}\geq\beta, where β(0,0.1)\beta\in(0,0.1), for each i[k]i\in[k].

  • Let σmin:=min{σmin(Σ1(t)),σmin(Σ2(t)),,σmin(Σk(t))}\sigma_{\min}:=\min\{\sigma_{\min}(\Sigma_{1}(t)),\sigma_{\min}(\Sigma_{2}(t)),\dots,\sigma_{\min}(\Sigma_{k}(t))\}.

  • Let σmax:=max{σmax(Σ1(t)),σmax(Σ2(t)),,σmax(Σk(t))}\sigma_{\max}:=\max\{\sigma_{\max}(\Sigma_{1}(t)),\sigma_{\max}(\Sigma_{2}(t)),\dots,\sigma_{\max}(\Sigma_{k}(t))\}.

  • Let detmin:=min{det(Σ1(t)),det(Σ2(t)),,det(Σk(t))}\det_{\min}:=\min\{\det(\Sigma_{1}(t)),\det(\Sigma_{2}(t)),\dots,\det(\Sigma_{k}(t))\}.

Then, we have

|i=1kαi(t)Ni(x)i=1kαi(t)Ni(x~)|1(2π)d/2detmin1/2exp(β22σmax)Rσminxx~2\displaystyle|\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x)-\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(\widetilde{x})|\leq\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
Proof.

We can show

LHS=\displaystyle\mathrm{LHS}= |i=1kαi(t)(Ni(x)Ni(x~))|\displaystyle~{}|\sum_{i=1}^{k}\alpha_{i}(t)(N_{i}(x)-N_{i}(\widetilde{x}))|
\displaystyle\leq i=1kαi(t)|Ni(x)Ni(x~)|\displaystyle~{}\sum_{i=1}^{k}\alpha_{i}(t)|N_{i}(x)-N_{i}(\widetilde{x})|
\displaystyle\leq i=1kαi(t)1(2π)d/2det(Σi(t))1/2exp(β22σmax(Σi(t)))Rσmin(Σi(t))xx~2\displaystyle~{}\sum_{i=1}^{k}\alpha_{i}(t)\frac{1}{(2\pi)^{d/2}\det(\Sigma_{i}(t))^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}(\Sigma_{i}(t))})\cdot\frac{R}{\sigma_{\min}(\Sigma_{i}(t))}\cdot\|x-\widetilde{x}\|_{2}
\displaystyle\leq maxi[k]1(2π)d/2det(Σi(t))1/2exp(β22σmax(Σi(t)))Rσmin(Σi(t))xx~2\displaystyle~{}\max_{i\in[k]}\frac{1}{(2\pi)^{d/2}\det(\Sigma_{i}(t))^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}(\Sigma_{i}(t))})\cdot\frac{R}{\sigma_{\min}(\Sigma_{i}(t))}\cdot\|x-\widetilde{x}\|_{2}
\displaystyle\leq 1(2π)d/2detmin1/2exp(β22σmax)Rσminxx~2\displaystyle~{}\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}

where the first step follows from simple algebra, the second step follows from Fact C.2, the third step follows from Lemma E.14, the fourth step follows from i=1kαi(t)=1\sum_{i=1}^{k}\alpha_{i}(t)=1, and αi(t)(0,1)\alpha_{i}(t)\in(0,1), and the last step follows from the definition of detmin,σmax,σmin\det_{\min},\sigma_{\max},\sigma_{\min}. ∎

This lemma calculates Lipschitz constant of function |(i=1kαi(t)Ni(x))1(i=1kαi(t)Ni(x~))1||(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x))^{-1}-(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(\widetilde{x}))^{-1}|.

Lemma F.8.

If the following conditions hold

  • Let x,x~dx,\widetilde{x}\in\mathbb{R}^{d}.

  • Let i[k]i\in[k].

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let αi(t)\alpha_{i}(t)\in\mathbb{R}, i=1kαi(t)=1\sum_{i=1}^{k}\alpha_{i}(t)=1, and αi(t)(0,1)\alpha_{i}(t)\in(0,1).

  • Let Ni(x)N_{i}(x) be defined as Definition F.1.

  • Let xμi(t)2R\|x-\mu_{i}(t)\|_{2}\leq R, where R1R\geq 1, for each i[k]i\in[k].

  • Let xμi(t)2β\|x-\mu_{i}(t)\|_{2}\geq\beta, where β(0,0.1)\beta\in(0,0.1), for each i[k]i\in[k].

  • Let i=1kαi(t)Ni(x)γ\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x)\geq\gamma, where γ(0,0.1)\gamma\in(0,0.1).

  • Let σmin:=min{σmin(Σ1(t)),σmin(Σ2(t)),,σmin(Σk(t))}\sigma_{\min}:=\min\{\sigma_{\min}(\Sigma_{1}(t)),\sigma_{\min}(\Sigma_{2}(t)),\dots,\sigma_{\min}(\Sigma_{k}(t))\}.

  • Let σmax:=max{σmax(Σ1(t)),σmax(Σ2(t)),,σmax(Σk(t))}\sigma_{\max}:=\max\{\sigma_{\max}(\Sigma_{1}(t)),\sigma_{\max}(\Sigma_{2}(t)),\dots,\sigma_{\max}(\Sigma_{k}(t))\}.

  • Let detmin:=min{det(Σ1(t)),det(Σ2(t)),,det(Σk(t))}\det_{\min}:=\min\{\det(\Sigma_{1}(t)),\det(\Sigma_{2}(t)),\dots,\det(\Sigma_{k}(t))\}.

Then, we have

|(i=1kαi(t)Ni(x))1(i=1kαi(t)Ni(x~))1|γ21(2π)d/2detmin1/2exp(β22σmax)Rσminxx~2\displaystyle|(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x))^{-1}-(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(\widetilde{x}))^{-1}|\leq\gamma^{-2}\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
Proof.

We can show

LHS=\displaystyle\mathrm{LHS}= (i=1kαi(t)Ni(x))1(i=1kαi(t)Ni(x~))1|i=1kαi(t)Ni(x)i=1kαi(t)Ni(x~)|\displaystyle~{}(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x))^{-1}\cdot(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(\widetilde{x}))^{-1}\cdot|\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x)-\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(\widetilde{x})|
\displaystyle\leq γ2|i=1kαi(t)Ni(x)i=1kαi(t)Ni(x~)|\displaystyle~{}\gamma^{-2}\cdot|\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x)-\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(\widetilde{x})|
\displaystyle\leq γ21(2π)d/2detmin1/2exp(β22σmax)Rσminxx~2\displaystyle~{}\gamma^{-2}\cdot\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}

where the first step follows from simple algebra, the second step follows from i=1kαi(t)Ni(x)γ\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x)\geq\gamma, and the last step follows from Lemma F.7. ∎

This lemma calculates Lipschitz constant of function |fi(x)fi(x~)||f_{i}(x)-f_{i}(\widetilde{x})|.

Lemma F.9.

If the following conditions hold

  • Let x,x~dx,\widetilde{x}\in\mathbb{R}^{d}.

  • Let i[k]i\in[k].

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let αi(t)\alpha_{i}(t)\in\mathbb{R}, i=1kαi(t)=1\sum_{i=1}^{k}\alpha_{i}(t)=1, and αi(t)(0,1)\alpha_{i}(t)\in(0,1).

  • Let Ni(x)N_{i}(x) be defined as Definition F.1.

  • Let fi(x)f_{i}(x) be defined as Definition F.4.

  • Let xμi(t)2R\|x-\mu_{i}(t)\|_{2}\leq R, where R1R\geq 1, for each i[k]i\in[k].

  • Let xμi(t)2β\|x-\mu_{i}(t)\|_{2}\geq\beta, where β(0,0.1)\beta\in(0,0.1), for each i[k]i\in[k].

  • Let i=1kαi(t)Ni(x)γ\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x)\geq\gamma, where γ(0,0.1)\gamma\in(0,0.1).

  • Let σmin:=min{σmin(Σ1(t)),σmin(Σ2(t)),,σmin(Σk(t))}\sigma_{\min}:=\min\{\sigma_{\min}(\Sigma_{1}(t)),\sigma_{\min}(\Sigma_{2}(t)),\dots,\sigma_{\min}(\Sigma_{k}(t))\}.

  • Let σmax:=max{σmax(Σ1(t)),σmax(Σ2(t)),,σmax(Σk(t))}\sigma_{\max}:=\max\{\sigma_{\max}(\Sigma_{1}(t)),\sigma_{\max}(\Sigma_{2}(t)),\dots,\sigma_{\max}(\Sigma_{k}(t))\}.

  • Let detmin:=min{det(Σ1(t)),det(Σ2(t)),,det(Σk(t))}\det_{\min}:=\min\{\det(\Sigma_{1}(t)),\det(\Sigma_{2}(t)),\dots,\det(\Sigma_{k}(t))\}.

Then, for each i[k]i\in[k], we have

|fi(x)fi(x~)|2αi(t)γ21(2π)d/2detmin1/2exp(β22σmax)Rσminxx~2\displaystyle|f_{i}(x)-f_{i}(\widetilde{x})|\leq 2\alpha_{i}(t)\cdot\gamma^{-2}\cdot\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
Proof.

We can show

|fi(x)fi(x~)|=\displaystyle|f_{i}(x)-f_{i}(\widetilde{x})|= |αi(t)Ni(x)(i=1kαi(t)Ni(x))1αi(t)Ni(x~)(i=1kαi(t)Ni(x~))1|\displaystyle~{}|\alpha_{i}(t)N_{i}(x)\cdot(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x))^{-1}-\alpha_{i}(t)N_{i}(\widetilde{x})\cdot(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(\widetilde{x}))^{-1}|
\displaystyle\leq |αi(t)Ni(x)(i=1kαi(t)Ni(x))1αi(t)Ni(x)(i=1kαi(t)Ni(x~))1|\displaystyle~{}|\alpha_{i}(t)N_{i}(x)\cdot(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x))^{-1}-\alpha_{i}(t)N_{i}(x)\cdot(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(\widetilde{x}))^{-1}|
+|αi(t)Ni(x)(i=1kαi(t)Ni(x~))1αi(t)Ni(x~)(i=1kαi(t)Ni(x~))1|\displaystyle~{}+|\alpha_{i}(t)N_{i}(x)\cdot(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(\widetilde{x}))^{-1}-\alpha_{i}(t)N_{i}(\widetilde{x})\cdot(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(\widetilde{x}))^{-1}|
\displaystyle\leq αi(t)Ni(x)|(i=1kαi(t)Ni(x))1(i=1kαi(t)Ni(x~))1|\displaystyle~{}\alpha_{i}(t)N_{i}(x)\cdot|(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x))^{-1}-(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(\widetilde{x}))^{-1}|
+αi(t)(i=1kαi(t)Ni(x~))1|Ni(x)Ni(x~)|\displaystyle~{}+\alpha_{i}(t)(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(\widetilde{x}))^{-1}|N_{i}(x)-N_{i}(\widetilde{x})|

where the first step follows from Definition F.4, the second step follows from Fact C.2, and the last step follows from simple algebra.

For the first term in the above, we have

αi(t)Ni(x)|(i=1kαi(t)Ni(x))1(i=1kαi(t)Ni(x~))1|\displaystyle~{}\alpha_{i}(t)N_{i}(x)\cdot|(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x))^{-1}-(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(\widetilde{x}))^{-1}|
\displaystyle\leq αi(t)1(2π)d/2det(Σi(t))1/2|(i=1kαi(t)Ni(x))1(i=1kαi(t)Ni(x~))1|\displaystyle~{}\alpha_{i}(t)\cdot\frac{1}{(2\pi)^{d/2}\det(\Sigma_{i}(t))^{1/2}}\cdot|(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x))^{-1}-(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(\widetilde{x}))^{-1}|
\displaystyle\leq αi(t)1(2π)d/2det(Σi(t))1/2γ21(2π)d/2detmin1/2exp(β22σmax)Rσminxx~2\displaystyle~{}\alpha_{i}(t)\cdot\frac{1}{(2\pi)^{d/2}\det(\Sigma_{i}(t))^{1/2}}\cdot\gamma^{-2}\cdot\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
\displaystyle\leq αi(t)γ21(2π)ddetminexp(β22σmax)Rσminxx~2\displaystyle~{}\alpha_{i}(t)\cdot\gamma^{-2}\cdot\frac{1}{(2\pi)^{d}\det_{\min}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2} (19)

where the first step follows from Ni(x)1(2π)d/2det(Σi(t))1/2N_{i}(x)\leq\frac{1}{(2\pi)^{d/2}\det(\Sigma_{i}(t))^{1/2}}, the second step follows from Lemma F.8, and the last step follows from definition of detmin\det_{\min}.

For the second term in the above, we have

αi(t)(i=1kαi(t)Ni(x~))1|Ni(x)Ni(x~)|\displaystyle~{}\alpha_{i}(t)(\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(\widetilde{x}))^{-1}|N_{i}(x)-N_{i}(\widetilde{x})|
\displaystyle\leq αi(t)γ1|Ni(x)Ni(x~)|\displaystyle~{}\alpha_{i}(t)\cdot\gamma^{-1}\cdot|N_{i}(x)-N_{i}(\widetilde{x})|
\displaystyle\leq αi(t)γ11(2π)d/2det(Σi(t))1/2exp(β22σmax(Σi(t)))Rσmin(Σi(t))xx~2\displaystyle~{}\alpha_{i}(t)\cdot\gamma^{-1}\cdot\frac{1}{(2\pi)^{d/2}\det(\Sigma_{i}(t))^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}(\Sigma_{i}(t))})\cdot\frac{R}{\sigma_{\min}(\Sigma_{i}(t))}\cdot\|x-\widetilde{x}\|_{2} (20)

where the first step follows from i=1kαi(t)Ni(x)γ\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x)\geq\gamma, the second step follows from Lemma F.7.

Combining Eq. (F.4) and Eq. (F.4) together, we have

|fi(x)fi(x~)|\displaystyle|f_{i}(x)-f_{i}(\widetilde{x})|\leq αi(t)γ21(2π)ddetminexp(β22σmax)Rσminxx~2\displaystyle~{}\alpha_{i}(t)\cdot\gamma^{-2}\cdot\frac{1}{(2\pi)^{d}\det_{\min}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
+αi(t)γ11(2π)d/2det(Σi(t))1/2exp(β22σmax(Σi(t)))Rσmin(Σi(t))xx~2\displaystyle~{}+\alpha_{i}(t)\cdot\gamma^{-1}\cdot\frac{1}{(2\pi)^{d/2}\det(\Sigma_{i}(t))^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}(\Sigma_{i}(t))})\cdot\frac{R}{\sigma_{\min}(\Sigma_{i}(t))}\cdot\|x-\widetilde{x}\|_{2}
\displaystyle\leq αi(t)γ21(2π)ddetminexp(β22σmax)Rσminxx~2\displaystyle~{}\alpha_{i}(t)\cdot\gamma^{-2}\cdot\frac{1}{(2\pi)^{d}\det_{\min}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
+αi(t)γ11(2π)d/2detmin1/2exp(β22σmax)Rσminxx~2\displaystyle~{}+\alpha_{i}(t)\cdot\gamma^{-1}\cdot\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}}\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
\displaystyle\leq 2αi(t)γ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)Rσminxx~2\displaystyle~{}2\alpha_{i}(t)\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}

where the first step follows from the bound of the first term and the second term, the second step follows from the definition of detmin,σmax,σmin\det_{\min},\sigma_{\max},\sigma_{\min}, and the last step follows from γ<0.1\gamma<0.1. ∎

This lemma calculates Lipschitz constant of function fi(x)(Σi(t)1(xμi(t)))fi(x~)(Σi(t)1(x~μi(t)))2\|f_{i}(x)(-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))-f_{i}(\widetilde{x})(-\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t)))\|_{2}.

Lemma F.10.

If the following conditions hold

  • Let x,x~dx,\widetilde{x}\in\mathbb{R}^{d}.

  • Let i[k]i\in[k].

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let αi(t)\alpha_{i}(t)\in\mathbb{R}, i=1kαi(t)=1\sum_{i=1}^{k}\alpha_{i}(t)=1, and αi(t)(0,1)\alpha_{i}(t)\in(0,1).

  • Let Ni(x)N_{i}(x) be defined as Definition F.1.

  • Let fi(x)f_{i}(x) be defined as Definition F.4.

  • Let xμi(t)2R\|x-\mu_{i}(t)\|_{2}\leq R, where R1R\geq 1, for each i[k]i\in[k].

  • Let xμi(t)2β\|x-\mu_{i}(t)\|_{2}\geq\beta, where β(0,0.1)\beta\in(0,0.1), for each i[k]i\in[k].

  • Let i=1kαi(t)Ni(x)γ\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x)\geq\gamma, where γ(0,0.1)\gamma\in(0,0.1).

  • Let σmin:=min{σmin(Σ1(t)),σmin(Σ2(t)),,σmin(Σk(t))}\sigma_{\min}:=\min\{\sigma_{\min}(\Sigma_{1}(t)),\sigma_{\min}(\Sigma_{2}(t)),\dots,\sigma_{\min}(\Sigma_{k}(t))\}.

  • Let σmax:=max{σmax(Σ1(t)),σmax(Σ2(t)),,σmax(Σk(t))}\sigma_{\max}:=\max\{\sigma_{\max}(\Sigma_{1}(t)),\sigma_{\max}(\Sigma_{2}(t)),\dots,\sigma_{\max}(\Sigma_{k}(t))\}.

  • Let detmin:=min{det(Σ1(t)),det(Σ2(t)),,det(Σk(t))}\det_{\min}:=\min\{\det(\Sigma_{1}(t)),\det(\Sigma_{2}(t)),\dots,\det(\Sigma_{k}(t))\}.

Then, for each i[k]i\in[k], we have

fi(x)(Σi(t)1(xμi(t)))fi(x~)(Σi(t)1(x~μi(t)))2\displaystyle~{}\|f_{i}(x)(-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))-f_{i}(\widetilde{x})(-\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t)))\|_{2}
\displaystyle\leq (|fi(x)|σmin+2αi(t)γ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)(Rσmin)2)xx~2\displaystyle~{}(\frac{|f_{i}(x)|}{\sigma_{\min}}+2\alpha_{i}(t)\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot(\frac{R}{\sigma_{\min}})^{2})\cdot\|x-\widetilde{x}\|_{2}
Proof.

We can show

LHS\displaystyle\mathrm{LHS}\leq fi(x)(Σi(t)1(xμi(t)))fi(x)(Σi(t)1(x~μi(t)))2\displaystyle~{}\|f_{i}(x)(-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))-f_{i}(x)(-\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t)))\|_{2}
+fi(x)(Σi(t)1(x~μi(t)))fi(x~)(Σi(t)1(x~μi(t)))2\displaystyle~{}+\|f_{i}(x)(-\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t)))-f_{i}(\widetilde{x})(-\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t)))\|_{2}
\displaystyle\leq |fi(x)|(Σi(t)1(xμi(t)))(Σi(t)1(x~μi(t)))2\displaystyle~{}|f_{i}(x)|\cdot\|(-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))-(-\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t)))\|_{2}
+|fi(x)fi(x~)|Σi(t)1(x~μi(t))2\displaystyle~{}+|f_{i}(x)-f_{i}(\widetilde{x})|\cdot\|-\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t))\|_{2}

where the first step follows from Fact C.2, the second step follows from Fact C.2.

For the first term in the above, we have

|fi(x)|(Σi(t)1(xμi(t)))(Σi(t)1(x~μi(t)))2\displaystyle~{}|f_{i}(x)|\cdot\|(-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))-(-\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t)))\|_{2}
\displaystyle\leq |fi(x)|σmin(Σi(t))xx~2\displaystyle~{}\frac{|f_{i}(x)|}{\sigma_{\min}(\Sigma_{i}(t))}\cdot\|x-\widetilde{x}\|_{2} (21)

where the first step follows from Lemma E.12.

For the second term in the above, we have

|fi(x)fi(x~)|Σi(t)1(x~μi(t))2\displaystyle~{}|f_{i}(x)-f_{i}(\widetilde{x})|\cdot\|-\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t))\|_{2}
\displaystyle\leq Rσmin(Σi(t))|fi(x)fi(x~)|\displaystyle~{}\frac{R}{\sigma_{\min}(\Sigma_{i}(t))}\cdot|f_{i}(x)-f_{i}(\widetilde{x})|
\displaystyle\leq Rσmin(Σi(t))2αi(t)γ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)Rσminxx~2\displaystyle~{}\frac{R}{\sigma_{\min}(\Sigma_{i}(t))}\cdot 2\alpha_{i}(t)\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2} (22)

where the first step follows from Lemma E.9, the second step follows from Lemma F.9.

Combining Eq. (F.4) and Eq. (F.4) together, we have

fi(x)(Σi(t)1(xμi(t)))fi(x~)(Σi(t)1(x~μi(t)))2\displaystyle~{}\|f_{i}(x)(-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))-f_{i}(\widetilde{x})(-\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t)))\|_{2}
\displaystyle\leq |fi(x)|σmin(Σi(t))xx~2\displaystyle~{}\frac{|f_{i}(x)|}{\sigma_{\min}(\Sigma_{i}(t))}\cdot\|x-\widetilde{x}\|_{2}
+Rσmin(Σi(t))2αi(t)γ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)Rσminxx~2\displaystyle~{}+\frac{R}{\sigma_{\min}(\Sigma_{i}(t))}\cdot 2\alpha_{i}(t)\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
\displaystyle\leq |fi(x)|σminxx~2\displaystyle~{}\frac{|f_{i}(x)|}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
+Rσmin2αi(t)γ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)Rσminxx~2\displaystyle~{}+\frac{R}{\sigma_{\min}}\cdot 2\alpha_{i}(t)\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot\frac{R}{\sigma_{\min}}\cdot\|x-\widetilde{x}\|_{2}
=\displaystyle= (|fi(x)|σmin+2αi(t)γ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)(Rσmin)2)xx~2\displaystyle~{}(\frac{|f_{i}(x)|}{\sigma_{\min}}+2\alpha_{i}(t)\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot(\frac{R}{\sigma_{\min}})^{2})\cdot\|x-\widetilde{x}\|_{2}

where the first step follows from the bound of the first term and the second term, the second step follows from the definition of detmin,σmax,σmin\det_{\min},\sigma_{\max},\sigma_{\min}, and the last step follows from simple algebra. ∎

F.5 Lipschitz constant of the score function

This lemma calculates Lipschitz constant of the score function for kk mixture of Gaussians.

Lemma F.11.

If the following conditions hold

  • Let x,x~dx,\widetilde{x}\in\mathbb{R}^{d}.

  • Let i[k]i\in[k].

  • Let tt\in\mathbb{R}, and t0t\geq 0.

  • Let αi(t)\alpha_{i}(t)\in\mathbb{R}, i=1kαi(t)=1\sum_{i=1}^{k}\alpha_{i}(t)=1, and αi(t)(0,1)\alpha_{i}(t)\in(0,1).

  • Let Ni(x)N_{i}(x) be defined as Definition F.1.

  • Let pt(x)p_{t}(x) be defined as Definition F.2.

  • Let fi(x)f_{i}(x) be defined as Definition F.4.

  • Let xμi(t)2R\|x-\mu_{i}(t)\|_{2}\leq R, where R1R\geq 1, for each i[k]i\in[k].

  • Let xμi(t)2β\|x-\mu_{i}(t)\|_{2}\geq\beta, where β(0,0.1)\beta\in(0,0.1), for each i[k]i\in[k].

  • Let i=1kαi(t)Ni(x)γ\sum_{i=1}^{k}\alpha_{i}(t)N_{i}(x)\geq\gamma, where γ(0,0.1)\gamma\in(0,0.1).

  • Let σmin:=min{σmin(Σ1(t)),σmin(Σ2(t)),,σmin(Σk(t))}\sigma_{\min}:=\min\{\sigma_{\min}(\Sigma_{1}(t)),\sigma_{\min}(\Sigma_{2}(t)),\dots,\sigma_{\min}(\Sigma_{k}(t))\}.

  • Let σmax:=max{σmax(Σ1(t)),σmax(Σ2(t)),,σmax(Σk(t))}\sigma_{\max}:=\max\{\sigma_{\max}(\Sigma_{1}(t)),\sigma_{\max}(\Sigma_{2}(t)),\dots,\sigma_{\max}(\Sigma_{k}(t))\}.

  • Let detmin:=min{det(Σ1(t)),det(Σ2(t)),,det(Σk(t))}\det_{\min}:=\min\{\det(\Sigma_{1}(t)),\det(\Sigma_{2}(t)),\dots,\det(\Sigma_{k}(t))\}.

Then, we have

dlogpt(x)dxdlogpt(x~)dx~2\displaystyle~{}\|\frac{\mathrm{d}\log p_{t}(x)}{\mathrm{d}x}-\frac{\mathrm{d}\log p_{t}(\widetilde{x})}{\mathrm{d}\widetilde{x}}\|_{2}
\displaystyle\leq (1σmin+2R2γ2σmin2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax))xx~2\displaystyle~{}(\frac{1}{\sigma_{\min}}+\frac{2R^{2}}{\gamma^{2}\sigma_{\min}^{2}}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}}))\cdot\|x-\widetilde{x}\|_{2}
Proof.

We can show

LHS\displaystyle~{}\mathrm{LHS}
=\displaystyle= i=1kfi(x)(Σi(t)1(xμi(t)))i=1kfi(x~)(Σi(t)1(x~μi(t)))2\displaystyle~{}\|\sum_{i=1}^{k}f_{i}(x)(-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))-\sum_{i=1}^{k}f_{i}(\widetilde{x})(-\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t)))\|_{2}
\displaystyle\leq i=1kfi(x)(Σi(t)1(xμi(t)))fi(x~)(Σi(t)1(x~μi(t)))2\displaystyle~{}\sum_{i=1}^{k}\|f_{i}(x)(-\Sigma_{i}(t)^{-1}(x-\mu_{i}(t)))-f_{i}(\widetilde{x})(-\Sigma_{i}(t)^{-1}(\widetilde{x}-\mu_{i}(t)))\|_{2}
\displaystyle\leq i=1k(|fi(x)|σmin+2αi(t)γ2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)(Rσmin)2)xx~2\displaystyle~{}\sum_{i=1}^{k}(\frac{|f_{i}(x)|}{\sigma_{\min}}+2\alpha_{i}(t)\cdot\gamma^{-2}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})\cdot(\frac{R}{\sigma_{\min}})^{2})\cdot\|x-\widetilde{x}\|_{2}
=\displaystyle= (1σmin+2R2γ2σmin2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax))xx~2\displaystyle~{}(\frac{1}{\sigma_{\min}}+\frac{2R^{2}}{\gamma^{2}\sigma_{\min}^{2}}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}}))\cdot\|x-\widetilde{x}\|_{2}

where the first step follows from Lemma F.5, the second step follows from triangle inequality, the third step follows from Lemma F.10, and the last step follows from i=1k|fi(x)|=i=1kfi(x)=1\sum_{i=1}^{k}|f_{i}(x)|=\sum_{i=1}^{k}f_{i}(x)=1 and i=1kαi(t)=1\sum_{i=1}^{k}\alpha_{i}(t)=1. ∎

Appendix G Putting It All Together

Our overall goal is that we want to provide a more concrete calculation for theorems in Section 6 by assuming the data distribution is a kk mixture of Gaussian. Now we provide lemmas that are used in further calculation.

Now we provide the lemma for kk-mixtue of Gaussians which states that if p0p_{0} is mixture Gaussians, then all the 𝗉𝖽𝖿\mathsf{pdf}s in the diffusion process are also mixtures of Gaussians.

Lemma G.1 (Formal version of Lemma 3.2).

Let a,ba,b\in\mathbb{R}. Let 𝒟{\cal D} be a kk-mixture of Gaussian distribution, and let pp be its 𝗉𝖽𝖿\mathsf{pdf}, i.e.,

p(x):=i=1kαi(2π)d/2det(Σi)1/2exp(12(xμi)Σi1(xμi))\displaystyle p(x):=\sum_{i=1}^{k}\frac{\alpha_{i}}{(2\pi)^{d/2}\det(\Sigma_{i})^{1/2}}\exp(-\frac{1}{2}(x-\mu_{i})^{\top}\Sigma_{i}^{-1}(x-\mu_{i}))

Let xdx\in\mathbb{R}^{d} sample from 𝒟{\cal D}. Let zdz\in\mathbb{R}^{d} and z𝒩(0,I)z\sim{\cal N}(0,I), which is independent from xx. Then we have a new random variable y=ax+bzy=ax+bz which is also a kk-mixture of Gaussian distribution 𝒟~\widetilde{\cal D}, whose 𝗉𝖽𝖿\mathsf{pdf} is

p~(x):=i=1kαi(2π)d/2det(Σ~i)1/2exp(12(xμ~i)Σ~i1(xμ~i)),\displaystyle\widetilde{p}(x):=\sum_{i=1}^{k}\frac{\alpha_{i}}{(2\pi)^{d/2}\det(\widetilde{\Sigma}_{i})^{1/2}}\exp(-\frac{1}{2}(x-\widetilde{\mu}_{i})^{\top}\widetilde{\Sigma}_{i}^{-1}(x-\widetilde{\mu}_{i})),

where μ~i=aμi,Σ~i=a2Σi+b2I\widetilde{\mu}_{i}=a\mu_{i},\widetilde{\Sigma}_{i}=a^{2}\Sigma_{i}+b^{2}I.

Proof.

First, we know that the 𝗉𝖽𝖿\mathsf{pdf} of the sum of two independent random variables is the convolution of their 𝗉𝖽𝖿\mathsf{pdf}.

From [Vin04] we know that the convolution of 2 Gaussians is another Gaussian, i.e. 𝒩(μ1,Σ1)𝒩(μ2,Σ2)=𝒩(μ1+μ2,Σ1+Σ2)\mathcal{N}(\mu_{1},\Sigma_{1})*\mathcal{N}(\mu_{2},\Sigma_{2})=\mathcal{N}(\mu_{1}+\mu_{2},\Sigma_{1}+\Sigma_{2}), where * is the convolution operator.

And we know the 𝗉𝖽𝖿\mathsf{pdf} of a linear transformation of a random variable xdx\in\mathbb{R}^{d}, let’s say Ax+bAx+b where Ad×d,bdA\in\mathbb{R}^{d\times d},b\in\mathbb{R}^{d}, is 1|det(A)|p(A1(xb))\frac{1}{|\det(A)|}p(A^{-1}(x-b)).

If we consider the transformation axax where a,xda\in\mathbb{R},x\in\mathbb{R}^{d}, this transformation can be written as aIxaIx. Therefore the 𝗉𝖽𝖿\mathsf{pdf} of axax is 1|det(aI)|p((aI)1x)=1|ad|p(x/a)\frac{1}{|\det(aI)|}p((aI)^{-1}x)=\frac{1}{|a^{d}|}p(x/a).

Now we prove the lemma. To find the 𝗉𝖽𝖿\mathsf{pdf} of axax, where xp(x)x\sim p(x), we can show

|1|ad|p(x/a)|=\displaystyle|\frac{1}{|a^{d}|}p(x/a)|= 1|ad|i=1kαi(2π)d/2det(Σi)1/2exp(12(xaμi)Σi1(xaμi))\displaystyle~{}\frac{1}{|a^{d}|}\sum_{i=1}^{k}\frac{\alpha_{i}}{(2\pi)^{d/2}\det(\Sigma_{i})^{1/2}}\exp(-\frac{1}{2}(\frac{x}{a}-\mu_{i})^{\top}\Sigma_{i}^{-1}(\frac{x}{a}-\mu_{i}))
=\displaystyle= i=1kαi(2π)d/2det(a2Σi)1/2exp(12(xaμi)a2Σi1(aaμi))\displaystyle~{}\sum_{i=1}^{k}\frac{\alpha_{i}}{(2\pi)^{d/2}\det(a^{2}\Sigma_{i})^{1/2}}\exp(-\frac{1}{2}(x-a\mu_{i})^{\top}a^{-2}\Sigma_{i}^{-1}(a-a\mu_{i}))
=\displaystyle= i=1kαi𝒩(aμi,a2Σi)\displaystyle~{}\sum_{i=1}^{k}\alpha_{i}\mathcal{N}(a\mu_{i},a^{2}\Sigma_{i})

where the first step follows from the definition of p(x)p(x), the second step follows from a2ddet(Σi)=det(a2Σi)a^{2d}\det(\Sigma_{i})=\det(a^{2}\Sigma_{i}), and the last step follows from definition of Gaussian distribution.

For a single standard Gaussian random variable zz, the 𝗉𝖽𝖿\mathsf{pdf} of bzbz will simply be 𝒩(0,b2I)\mathcal{N}(0,b^{2}I).

To find the 𝗉𝖽𝖿\mathsf{pdf} of y=ax+bzy=ax+bz, we can show

p~(x)=\displaystyle\widetilde{p}(x)= 1|ad|p(x/a)𝒩(0,b2I)\displaystyle~{}\frac{1}{|a^{d}|}p(x/a)*\mathcal{N}(0,b^{2}I)
=\displaystyle= (i=1kαi𝒩(aμi,a2Σi))𝒩(0,b2I)\displaystyle~{}(\sum_{i=1}^{k}\alpha_{i}\mathcal{N}(a\mu_{i},a^{2}\Sigma_{i}))*\mathcal{N}(0,b^{2}I)
=\displaystyle= i=1k(αi𝒩(aμi,a2Σi)𝒩(0,b2I))\displaystyle~{}\sum_{i=1}^{k}(\alpha_{i}\mathcal{N}(a\mu_{i},a^{2}\Sigma_{i})*\mathcal{N}(0,b^{2}I))
=\displaystyle= i=1kαi𝒩(aμi,a2Σi+b2I)\displaystyle~{}\sum_{i=1}^{k}\alpha_{i}\mathcal{N}(a\mu_{i},a^{2}\Sigma_{i}+b^{2}I)

where the first step follows from the 𝗉𝖽𝖿\mathsf{pdf} of the sum of 2 independent random variables is the convolution of their 𝗉𝖽𝖿\mathsf{pdf}, the second step follows from the 𝗉𝖽𝖿\mathsf{pdf} of 1|ad|p(x/a)\frac{1}{|a^{d}|}p(x/a), the third step follows from the distributive property of convolution, and the last step follows from 𝒩(aμi,a2Σi)𝒩(0,b2I)=𝒩(aμi,a2Σi+b2I)\mathcal{N}(a\mu_{i},a^{2}\Sigma_{i})*\mathcal{N}(0,b^{2}I)=\mathcal{N}(a\mu_{i},a^{2}\Sigma_{i}+b^{2}I).

Thus, the 𝗉𝖽𝖿\mathsf{pdf} of yy can be written as a mixture of kk Gaussians:

p~(x):=i=1kαi(2π)d/2det(Σ~i)1/2exp(12(xμ~i)Σ~i1(xμ~i)),\displaystyle\widetilde{p}(x):=\sum_{i=1}^{k}\frac{\alpha_{i}}{(2\pi)^{d/2}\det(\widetilde{\Sigma}_{i})^{1/2}}\exp(-\frac{1}{2}(x-\widetilde{\mu}_{i})^{\top}\widetilde{\Sigma}_{i}^{-1}(x-\widetilde{\mu}_{i})),

where μ~i=aμi,Σ~i=a2Σi+b2I\widetilde{\mu}_{i}=a\mu_{i},\widetilde{\Sigma}_{i}=a^{2}\Sigma_{i}+b^{2}I. ∎

Now we provide the lemma for the second momentum of kk-mixtue of Gaussians.

Lemma G.2 (Formal version of Lemma 3.5).

If the following conditions hold:

  • x0p0x_{0}\sim p_{0}, where p0p_{0} is defined by Eq. (9).

Then, we have

m22:=𝔼x0p0[x022]=i=1kαi(μi2+tr[Σi])\displaystyle m_{2}^{2}:=\operatorname*{{\mathbb{E}}}_{x_{0}\sim p_{0}}[\|x_{0}\|_{2}^{2}]=\sum_{i=1}^{k}\alpha_{i}(\|\mu_{i}\|_{2}+\operatorname{tr}[\Sigma_{i}])
Proof.

From [PKK22], we know the second momentum of data distribution p0(x)p_{0}(x) is given by:

𝔼[x0x0]=i=1kαi(μi22+Σi)\displaystyle\operatorname*{{\mathbb{E}}}[x_{0}x_{0}^{\top}]=\sum_{i=1}^{k}\alpha_{i}\cdot(\|\mu_{i}\|_{2}^{2}+\Sigma_{i}) (23)

Then, we can show

𝔼[x022]=\displaystyle\operatorname*{{\mathbb{E}}}[\|x_{0}\|_{2}^{2}]= 𝔼[x0x0]\displaystyle~{}\operatorname*{{\mathbb{E}}}[x_{0}^{\top}x_{0}]
=\displaystyle= 𝔼[tr[x0x0]]\displaystyle~{}\operatorname*{{\mathbb{E}}}[\operatorname{tr}[x_{0}x_{0}^{\top}]]
=\displaystyle= tr[𝔼[x0x0]]\displaystyle~{}\operatorname{tr}[\operatorname*{{\mathbb{E}}}[x_{0}x_{0}^{\top}]]
=\displaystyle= tr[i=1kαi(μiμi+Σi)]\displaystyle~{}\operatorname{tr}[\sum_{i=1}^{k}\alpha_{i}(\mu_{i}\mu_{i}^{\top}+\Sigma_{i})]
=\displaystyle= i=1kαi(μi22+tr[Σi])\displaystyle~{}\sum_{i=1}^{k}\alpha_{i}(\|\mu_{i}\|_{2}^{2}+\operatorname{tr}[\Sigma_{i}])

where the first step follows from definition of 2\ell_{2}-norm, the second step follows from tr[aa]=aa\operatorname{tr}[aa^{\top}]=a^{\top}a where aa is a vector, the third step follows from the linearity of the trace operator, the fourth step follows from Eq. (23), and the last step follows from tr[aa]=a22\operatorname{tr}[aa^{\top}]=\|a\|_{2}^{2}. ∎

Now we give the Lipschitz constant explicitly.

Lemma G.3 (Formal version of Lemma 3.3).

If the following conditions hold

  • Let xatμi2R\|x-a_{t}\mu_{i}\|_{2}\leq R, where R1R\geq 1, for each i[k]i\in[k].

  • Let xatμi2β\|x-a_{t}\mu_{i}\|_{2}\geq\beta, where β(0,0.1)\beta\in(0,0.1), for each i[k]i\in[k].

  • Let pt(x)p_{t}(x) be defined as Eq. (10) and pt(x)γp_{t}(x)\geq\gamma, where γ(0,0.1)\gamma\in(0,0.1).

  • Let σmin:=mini[k]{σmin(at2Σi+bt2I)}\sigma_{\min}:=\min_{i\in[k]}\{\sigma_{\min}(a_{t}^{2}\Sigma_{i}+b_{t}^{2}I)\}.

  • Let σmax:=maxi[k]{σmax(at2Σi+bt2I)}\sigma_{\max}:=\max_{i\in[k]}\{\sigma_{\max}(a_{t}^{2}\Sigma_{i}+b_{t}^{2}I)\}.

  • Let detmin:=mini[k]{det(at2Σi+bt2I)}\det_{\min}:=\min_{i\in[k]}\{\det(a_{t}^{2}\Sigma_{i}+b_{t}^{2}I)\}.

The Lipschitz constant for the score function dlog(pt(x))dx\frac{\mathrm{d}\log(p_{t}(x))}{\mathrm{d}x} is given by:

L=1σmin+2R2γ2σmin2(1(2π)ddetmin+1(2π)d/2detmin1/2)exp(β22σmax)\displaystyle L=\frac{1}{\sigma_{\min}}+\frac{2R^{2}}{\gamma^{2}\sigma_{\min}^{2}}\cdot(\frac{1}{(2\pi)^{d}\det_{\min}}+\frac{1}{(2\pi)^{d/2}\det_{\min}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max}})
Proof.

Using Lemma F.11 and G.1, we can get the result. ∎

Appendix H More Calculation for Application

In this section, we will provide a more concrete calculation for Theorem 6.5, Theorem 6.6, Theorem 6.7, Theorem 6.8 and Theorem 6.9.

H.1 Concrete calculation of Theorem 6.5

Theorem H.1 (𝖣𝖣𝖯𝖬\mathsf{DDPM}, total variation, formal version of Theorem 5.6).

If the following conditions hold:

  • Condition 5.3 and 5.4.

  • The step size hk:=T/Nh_{k}:=T/N satisfies hk=O(1/L)h_{k}=O(1/L) and L1L\geq 1 for k[N]k\in[N].

  • Let q^\widehat{q} denote the density of the output of the 𝖤𝗎𝗅𝖾𝗋𝖬𝖺𝗋𝗎𝗒𝖺𝗆𝖺\mathsf{EulerMaruyama} defined by Definition 4.3.

We have

TV(q^,p0)KL(p0𝒩(0,I))exp(T)convergenceofforwardprocess+(Ldh+Lm2h)Tdiscretizationerror+ϵ0Tscoreestimationerror.\displaystyle\mathrm{TV}(\widehat{q},p_{0})\lesssim\underbrace{\sqrt{\mathrm{KL}(p_{0}\|\mathcal{N}(0,I))}\exp(-T)}_{\mathrm{convergence~{}of~{}forward~{}process}}+\underbrace{(L\sqrt{dh}+Lm_{2}h)\sqrt{T}}_{\mathrm{discretization~{}error}}+\underbrace{\epsilon_{0}\sqrt{T}}_{\mathrm{score~{}estimation~{}error}}.

where

  • L=1σmin(pt)+2R2γ2σmin(pt)2(1(2π)ddetmin(pt)+1(2π)d/2detmin(pt)1/2)exp(β22σmax(pt))L=\frac{1}{\sigma_{\min(p_{t})}}+\frac{2R^{2}}{\gamma^{2}\sigma_{\min(p_{t})}^{2}}\cdot(\frac{1}{(2\pi)^{d}\det_{\min(p_{t})}}+\frac{1}{(2\pi)^{d/2}\det_{\min(p_{t})}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max(p_{t})}}),

  • m2=(i=1kαi(μi22+tr[Σi]))1/2m_{2}=(\sum_{i=1}^{k}\alpha_{i}(\|\mu_{i}\|_{2}^{2}+\operatorname{tr}[\Sigma_{i}]))^{1/2},

  • KL(p0(x)𝒩(0,I))12(log(detmin(p0))+dσmax(p0)+μmax(p0)d)\mathrm{KL}(p_{0}(x)\|\mathcal{N}(0,I))\leq\frac{1}{2}(-\log(\mathrm{det}_{\min(p_{0})})+d\sigma_{\max(p_{0})}+\mu_{\max(p_{0})}-d).

Proof.

Now we want to find a more concrete LL in Assumption 6.1. Notice that from Lemma G.1, we know that at any time between 0tT0\leq t\leq T, ptp_{t} is also a kk-mixture of gaussian, except that the mean and covariance change with time.

Using Lemma G.3, we can get LL.

Now we want to find the second momentum in Assumption 6.2. Using Lemma G.2, we know that m2=(i=1kαi(μi22+tr[Σi]))1/2m_{2}=(\sum_{i=1}^{k}\alpha_{i}(\|\mu_{i}\|_{2}^{2}+\operatorname{tr}[\Sigma_{i}]))^{1/2}.

In Assumption 5.1, we also assume the same thing.

Now we want to have a more concrete setting for Theorem 6.5 by calculating each term directly. Notice that now we have all the quantities except for the KL divergence term. Thus, we calculate KL(p0𝒩(0,I))\mathrm{KL}(p_{0}\|\mathcal{N}(0,I)), which means the KL divergence of data distribution and standard Gaussian.

In our notation, we have

KL(p0(x)𝒩(0,I))=\displaystyle\mathrm{KL}(p_{0}(x)\|\mathcal{N}(0,I))= KL(i=1kαi𝒩(μi,Σi)𝒩(0,I))\displaystyle~{}\mathrm{KL}(\sum_{i=1}^{k}\alpha_{i}\mathcal{N}(\mu_{i},\Sigma_{i})\|\mathcal{N}(0,I))
=\displaystyle= i=1kαi𝒩(μi,Σi)log(i=1kαi𝒩(μi,Σi)𝒩(0,I))dx.\displaystyle~{}\int\sum_{i=1}^{k}\alpha_{i}\mathcal{N}(\mu_{i},\Sigma_{i})\log(\frac{\sum_{i=1}^{k}\alpha_{i}\mathcal{N}(\mu_{i},\Sigma_{i})}{\mathcal{N}(0,I)})\mathrm{d}x.

However, this integral has no close form, but we can find an upper bound for this KL divergence instead.

We know the KL divergence of 2 normal distribution is given by:

KL(𝒩(μ1,Σ1)𝒩(μ2,Σ2))\displaystyle~{}\mathrm{KL}(\mathcal{N}(\mu_{1},\Sigma_{1})\|\mathcal{N}(\mu_{2},\Sigma_{2}))
=\displaystyle= 12log(det(Σ1)det(Σ2))+12tr[(Σ2)1Σ1]+12(μ1μ2)(Σ2)1(μ1μ2)d2\displaystyle~{}-\frac{1}{2}\log(\frac{\det(\Sigma_{1})}{\det(\Sigma_{2})})+\frac{1}{2}\operatorname{tr}[(\Sigma_{2})^{-1}\Sigma_{1}]+\frac{1}{2}(\mu_{1}-\mu_{2})^{\top}(\Sigma_{2})^{-1}(\mu_{1}-\mu_{2})-\frac{d}{2}

We define σmax(p0)=maxi[k]{σmax(Σi)}\sigma_{\max(p_{0})}=\max_{i\in[k]}\{\sigma_{\max}(\Sigma_{i})\}, detmin(p0)=mini[k]{det(Σi)}\det_{\min(p_{0})}=\min_{i\in[k]}\{\det(\Sigma_{i})\}, μmax(p0)=maxi[k]{μi22}\mu_{\max(p_{0})}=\max_{i\in[k]}\{\|\mu_{i}\|_{2}^{2}\}. From [HO07], we can show

KL(i=1kαi𝒩(μi,Σi)𝒩(0,I))\displaystyle\mathrm{KL}(\sum_{i=1}^{k}\alpha_{i}\mathcal{N}(\mu_{i},\Sigma_{i})\|\mathcal{N}(0,I))\leq i=1kαiKL(𝒩(μi,Σi)𝒩(0,I))\displaystyle~{}\sum_{i=1}^{k}\alpha_{i}\mathrm{KL}(\mathcal{N}(\mu_{i},\Sigma_{i})\|\mathcal{N}(0,I))
=\displaystyle= i=1kαi2(log(det(Σi))+tr[Σi]+μi22d)\displaystyle~{}\sum_{i=1}^{k}\frac{\alpha_{i}}{2}(-\log(\det(\Sigma_{i}))+\operatorname{tr}[\Sigma_{i}]+\|\mu_{i}\|_{2}^{2}-d)
\displaystyle\leq maxi[k]12(log(det(Σi))+tr[Σi]+μi22d)\displaystyle~{}\max_{i\in[k]}\frac{1}{2}(-\log(\det(\Sigma_{i}))+\operatorname{tr}[\Sigma_{i}]+\|\mu_{i}\|_{2}^{2}-d)
\displaystyle\leq 12(log(detmin(p0))+dσmax(p0)+μmax(p0)d)\displaystyle~{}\frac{1}{2}(-\log(\mathrm{det}_{\min(p_{0})})+d\sigma_{\max(p_{0})}+\mu_{\max(p_{0})}-d)

where the first step follows from the convexity of KL divergence, the second step follows from KL divergence of 2 normal distribution, the third step follows from i=1kαi=1\sum_{i=1}^{k}\alpha_{i}=1 and 0αi10\leq\alpha_{i}\leq 1, and the last step follows from the definition of detmin(p0)\det_{\min(p_{0})}, σmax(p0)\sigma_{\max(p_{0})}, μmax(p0)\mu_{\max(p_{0})}.

Then we have all the quantities in Theorem 6.5. After directly applying the theorem, we finish the proof. ∎

H.2 Concrete calculation for Theorem 6.6

Theorem H.2 (𝖣𝖣𝖯𝖬\mathsf{DDPM}, KL divergence, formal version of Theorem 5.7).

If the following conditions hold:

  • Condition 5.4.

  • We use uniform discretization points.

We have

  • Let q^\widehat{q} denote the density of the output of the 𝖤𝗑𝗉𝗈𝗇𝖾𝗇𝗍𝗂𝖺𝗅𝖨𝗇𝗍𝖾𝗀𝗋𝖺𝗍𝗈𝗋\mathsf{ExponentialIntegrator} defined by Definition 4.4, we have

    KL(p0q^)(M2+d)eT+Tϵ02+dT2L2N.\displaystyle\mathrm{KL}(p_{0}\|\widehat{q})\lesssim(M_{2}+d)e^{-T}+T\epsilon_{0}^{2}+\frac{dT^{2}L^{2}}{N}.

    In particular, choosing T=Θ(log(M2d/ϵ0))T=\Theta(\log(M_{2}d/\epsilon_{0})) and N=Θ(dT2L2/ϵ02)N=\Theta(dT^{2}L^{2}/\epsilon_{0}^{2}), then we can show that

    KL(p0q^)=O~(ϵ02)\displaystyle\mathrm{KL}(p_{0}\|\widehat{q})=\widetilde{O}(\epsilon_{0}^{2})
  • Let q^\widehat{q} denote the density of the output of the 𝖤𝗎𝗅𝖾𝗋𝖬𝖺𝗋𝗎𝗒𝖺𝗆𝖺\mathsf{EulerMaruyama} defined by Definition 4.3, we have

    KL(p0q^)(M2+d)eT+Tϵ02+dT2L2N+T3M2N2.\displaystyle\mathrm{KL}(p_{0}\|\widehat{q})\lesssim(M_{2}+d)e^{-T}+T\epsilon_{0}^{2}+\frac{dT^{2}L^{2}}{N}+\frac{T^{3}M_{2}}{N^{2}}.

where

  • L=1σmin(pt)+2R2γ2σmin(pt)2(1(2π)ddetmin(pt)+1(2π)d/2detmin(pt)1/2)exp(β22σmax(pt))L=\frac{1}{\sigma_{\min(p_{t})}}+\frac{2R^{2}}{\gamma^{2}\sigma_{\min(p_{t})}^{2}}\cdot(\frac{1}{(2\pi)^{d}\det_{\min(p_{t})}}+\frac{1}{(2\pi)^{d/2}\det_{\min(p_{t})}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max(p_{t})}}),

  • M2=i=1kαi(μi22+tr[Σi])M_{2}=\sum_{i=1}^{k}\alpha_{i}(\|\mu_{i}\|_{2}^{2}+\operatorname{tr}[\Sigma_{i}]).

Proof.

Using Lemma G.3, we can get LL. Using Lemma G.2, we can get m2m_{2}. Then we directly apply Theorem 6.6. ∎

H.3 Concrete calculation for Theorem 6.7

Theorem H.3 (𝖣𝖣𝖯𝖬\mathsf{DDPM}, KL divergence for smooth data distribution, formal version of Theorem 5.8).

If the following conditions hold:

  • Condition 5.3 and 5.4.

  • We use the exponentially decreasing (then constant) step size hk=cmin{max{tk,1L},1},c=T+logLN1Kdh_{k}=c\min\{\max\{t_{k},\frac{1}{L}\},1\},c=\frac{T+\log L}{N}\leq\frac{1}{Kd}.

  • Let q^\widehat{q} denote the density of the output of the 𝖤𝗑𝗉𝗈𝗇𝖾𝗇𝗍𝗂𝖺𝗅𝖨𝗇𝗍𝖾𝗀𝗋𝖺𝗍𝗈𝗋\mathsf{ExponentialIntegrator} defined by Definition 4.4.

We have

KL(p0q^)(M2+d)exp(T)+Tϵ02+d2(T+logL)2N,\displaystyle\mathrm{KL}(p_{0}\|\widehat{q})\lesssim(M_{2}+d)\exp(-T)+T\epsilon_{0}^{2}+\frac{d^{2}(T+\log L)^{2}}{N},

where

  • L=1σmin(p0)+2R2γ2(σmin(p0))2(1(2π)ddetmin(p0)+1(2π)d/2(detmin(p0))1/2)exp(β22σmax(p0))L=\frac{1}{\sigma_{\min(p_{0})}}+\frac{2R^{2}}{\gamma^{2}(\sigma_{\min(p_{0})})^{2}}\cdot(\frac{1}{(2\pi)^{d}\det_{\min(p_{0})}}+\frac{1}{(2\pi)^{d/2}(\det_{\min(p_{0})})^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max(p_{0})}}),

  • M2=i=1kαi(μi22+tr[Σi])M_{2}=\sum_{i=1}^{k}\alpha_{i}(\|\mu_{i}\|_{2}^{2}+\operatorname{tr}[\Sigma_{i}]).

Furthermore, if we choosing T=Θ(log(M2d/ϵ0))T=\Theta(\log(M_{2}d/\epsilon_{0})) and N=Θ(d2(T+logL)2/ϵ02)N=\Theta(d^{2}(T+\log L)^{2}/\epsilon_{0}^{2}), then we can show

KL(p0q^)O~(ϵ02)\displaystyle\mathrm{KL}(p_{0}\|\widehat{q})\leq\widetilde{O}(\epsilon_{0}^{2})

In addition, for Euler-Maruyama scheme defined in Definition 4.3, the same bounds hold with an additional M2k=1Nhk3M_{2}\sum_{k=1}^{N}h_{k}^{3} term.

Proof.

Clearly, p0p_{0} is second-order differentiable. Using Lemma G.3, we can get LL. Using Lemma G.2, we can get m2m_{2}. Then we directly apply Theorem 6.7. ∎

H.4 Concrete calculation for Theorem 6.8

Theorem H.4 (𝖣𝖯𝖮𝖬\mathsf{DPOM}, formal version of Theorem 5.9).

If the following conditions hold:

  • Condition 5.4.

  • We use the 𝖣𝖯𝖮𝖬\mathsf{DPOM} algorithm defined in Definition 4.5, and let q^\widehat{q} be the output density of it.

We have

TV(q^,p0)(d+m2)exp(T)+L2Td1/2hpred+L3/2Td1/2hcorr1/2+L1/2Tϵ0+ϵ.\displaystyle\mathrm{TV}(\widehat{q},p_{0})\lesssim(\sqrt{d}+m_{2})\exp(-T)+L^{2}Td^{1/2}h_{\mathrm{pred}}+L^{3/2}Td^{1/2}h_{\mathrm{corr}}^{1/2}+L^{1/2}T\epsilon_{0}+\epsilon.

where

  • L=1σmin(pt)+2R2γ2σmin(pt)2(1(2π)ddetmin(pt)+1(2π)d/2detmin(pt)1/2)exp(β22σmax(pt))L=\frac{1}{\sigma_{\min(p_{t})}}+\frac{2R^{2}}{\gamma^{2}\sigma_{\min(p_{t})}^{2}}\cdot(\frac{1}{(2\pi)^{d}\det_{\min(p_{t})}}+\frac{1}{(2\pi)^{d/2}\det_{\min(p_{t})}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max(p_{t})}}),

  • m2=(i=1kαi(μi22+tr[Σi]))1/2m_{2}=(\sum_{i=1}^{k}\alpha_{i}(\|\mu_{i}\|_{2}^{2}+\operatorname{tr}[\Sigma_{i}]))^{1/2}.

In particular, if we set T=Θ(log(dm2/ϵ))T=\Theta(\log(dm_{2}/\epsilon)), hpred=Θ~(ϵL2d1/2)h_{\mathrm{pred}}=\widetilde{\Theta}(\frac{\epsilon}{L^{2}d^{1/2}}), hcorr=Θ~(ϵL3d)h_{\mathrm{corr}}=\widetilde{\Theta}(\frac{\epsilon}{L^{3}d}), and if the score estimation error satisfies ϵ0O~(ϵL)\epsilon_{0}\leq\widetilde{O}(\frac{\epsilon}{\sqrt{L}}), then we can obtain TV error ϵ\epsilon with a total iteration complexity of Θ~(L3d/ϵ2)\widetilde{\Theta}(L^{3}d/\epsilon^{2}) steps.

Proof.

Using Lemma G.3, we can get LL. Using Lemma G.2, we can get m2m_{2}. Then we directly apply Theorem 6.8. ∎

H.5 Concrete calculation for Theorem 6.9

Theorem H.5 (𝖣𝖯𝖴𝖬\mathsf{DPUM}, formal version of Theorem 5.10).

If the following conditions hold:

  • Condition 5.4.

  • We use the 𝖣𝖯𝖴𝖬\mathsf{DPUM} algorithm defined in Definition 4.6, and let q^\widehat{q} be the output density of it.

We have

TV(q^,p0)(d+m2)exp(T)+L2Td1/2hpred+L3/2Td1/2hcorr1/2+L1/2Tϵ0+ϵ.\displaystyle\mathrm{TV}(\widehat{q},p_{0})\lesssim(\sqrt{d}+m_{2})\exp(-T)+L^{2}Td^{1/2}h_{\mathrm{pred}}+L^{3/2}Td^{1/2}h_{\mathrm{corr}}^{1/2}+L^{1/2}T\epsilon_{0}+\epsilon.

where

  • L=1σmin(pt)+2R2γ2σmin(pt)2(1(2π)ddetmin(pt)+1(2π)d/2detmin(pt)1/2)exp(β22σmax(pt))L=\frac{1}{\sigma_{\min(p_{t})}}+\frac{2R^{2}}{\gamma^{2}\sigma_{\min(p_{t})}^{2}}\cdot(\frac{1}{(2\pi)^{d}\det_{\min(p_{t})}}+\frac{1}{(2\pi)^{d/2}\det_{\min(p_{t})}^{1/2}})\cdot\exp(-\frac{\beta^{2}}{2\sigma_{\max(p_{t})}}),

  • m2=(i=1kαi(μi22+tr[Σi]))1/2m_{2}=(\sum_{i=1}^{k}\alpha_{i}(\|\mu_{i}\|_{2}^{2}+\operatorname{tr}[\Sigma_{i}]))^{1/2}.

In particular, if we set T=Θ(log(dm2/ϵ))T=\Theta(\log(dm_{2}/\epsilon)), hpred=Θ~(ϵL2d1/2)h_{\mathrm{pred}}=\widetilde{\Theta}(\frac{\epsilon}{L^{2}d^{1/2}}), hcorr=Θ~(ϵL3/2d1/2)h_{\mathrm{corr}}=\widetilde{\Theta}(\frac{\epsilon}{L^{3/2}d^{1/2}}), and if the score estimation error satisfies ϵ0O~(ϵL)\epsilon_{0}\leq\widetilde{O}(\frac{\epsilon}{\sqrt{L}}), then we can obtain TV error ϵ\epsilon with a total iteration complexity of Θ~(L2d1/2/ϵ)\widetilde{\Theta}(L^{2}d^{1/2}/\epsilon) steps.

Proof.

Using Lemma G.3, we can get LL. Using Lemma G.2, we can get m2m_{2}. Then we directly apply Theorem 6.9. ∎