This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Good Score Does not Lead to A Good Generative Model

Sixu Li    Shi Chen    Qin Li
Abstract

Score-based Generative Models (SGMs) is one leading method in generative modeling, renowned for their ability to generate high-quality samples from complex, high-dimensional data distributions. The method enjoys empirical success and is supported by rigorous theoretical convergence properties. In particular, it has been shown that SGMs can generate samples from a distribution that is close to the ground-truth if the underlying score function is learned well, suggesting the success of SGM as a generative model. We provide a counter-example in this paper. Through the sample complexity argument, we provide one specific setting where the score function is learned well. Yet, SGMs in this setting can only output samples that are Gaussian blurring of training data points, mimicking the effects of kernel density estimation. The finding resonates a series of recent finding that reveal that SGMs can demonstrate strong memorization effect and fail to generate.

Machine Learning, ICML

1 Introduction

Generative modeling aims to understand the dataset structure so to generate similar examples. It has been widely used in image and text generation (Wang et al., 2018; Huang et al., 2018; Rombach et al., 2022; Li et al., 2022; Gong et al., 2022), speech and audio synthesis (Donahue et al., 2018; Kong et al., 2020a, b; Huang et al., 2022), and even the discovery of protein structures (Watson et al., 2023).

Among the various types of generative models, Score-based Generative Models (SGMs) (Song et al., 2020; Ho et al., 2020; Karras et al., 2022) have recently emerged as a forefront method, and achieved state-of-the-art empirical results across diverse domains. It views the data structure of existing examples coded in a probability distribution, that we call the target distribution. Once SGM learns the target distribution from the data, it generates new samples from it.

Despite their empirical successes, a thorough theoretical understanding of why SGMs perform well remains elusive. A more fundamental question is:

What are the criteria to evaluate the performance of a generative model?

Heuristically, two key components of generative models are “imitating” and “generating”. The “imitating” is about learning from the existences, while generating calls for creativity to produce new. A successful generative model should exhibit both imitation ability, so to produce samples that resemble the training data, and at the same time, manifest creativity, and generate samples that are not mere replicas of existing ones.

In the past few years, significance theoretical progresses have been made on assessing the imitation ability of SGMs. In particular, recently made available theory provides a very nice collection of error bounds to evaluate the difference between the learned distribution and the ground-truth distribution. Such discussion has been made available in various statistical distances, including total variation, KL divergence, Wasserstein distance and others. These discoveries suggest that SGMs have strong imitation ability, i.e. can approximate the ground-truth distribution well if the score function (gradient of log-density) of the target distribution along the diffusion process can be effectively learned.

We would like to discuss the other side of the story: Relying solely on these upper error bounds might be misleading in assessing the overall performance of SGMs. In particular, this criterion does not adequately address the issue of memorization – the possibility that the produced samples are simply replicas of the training data. In other words, SGMs with strong imitation ability can be lack of creativity.

1.1 A toy model argument

At the heart of our argument is that a simple Kernel Density Estimation (KDE) of the target ground-truth distribution can be arbitrarily close. Yet, drawing a sample from the ground-truth and drawing one from a KDE presents very different features. The latter fails on the task of “generation.”

To be mathematically more precise, let p(x)p_{\ast}(x) be the ground-truth distribution, and {yi}i=1N\{y_{i}\}_{i=1}^{N} be a set of i.i.d samples drawn from it. The empirical distribution is 𝗉:=1Ni=1Nδyi{\mathsf{p}_{\ast}}:=\frac{1}{N}\sum_{i=1}^{N}\delta_{y_{i}}. We denote 𝗉γ:=𝗉𝒩γ\mathsf{p}_{\ast}^{\gamma}:={\mathsf{p}_{\ast}}*\mathcal{N}_{\gamma} the distribution obtained by smoothing 𝗉{\mathsf{p}_{\ast}} with a Gaussian kernel 𝒩γ:=𝒩(0,γ2Id×d)\mathcal{N}_{\gamma}:=\mathcal{N}(0,\gamma^{2}I_{d\times d}). Such definition naturally puts 𝗉γ\mathsf{p}_{\ast}^{\gamma} as one kind of Kernel Density Estimation (KDE) of pp_{\ast} with the bandwidth γ\gamma.

It is intuitive that when the sample size NN is large, and the bandwidth γ\gamma is properly chosen, the KDE 𝗉γ\mathsf{p}_{\ast}^{\gamma} approximates the true distribution qq. In the most extreme case, when the bandwidth γ0\gamma\to 0, the kernel density estimate 𝗉γ\mathsf{p}_{\ast}^{\gamma} degenerates to the empirical distribution 𝗉{\mathsf{p}_{\ast}}. Throughout the paper we view the empirical distribution as a special case of KDE.

Though pp_{\ast} and 𝗉γ\mathsf{p}_{\ast}^{\gamma} are close, generating samples from pp_{\ast} and from 𝗉γ\mathsf{p}_{\ast}^{\gamma} are drastically different stories. Drawing from pp_{\ast} amounts to generate a completely new sample, independent of the dataset, while generating from 𝗉γ\mathsf{p}_{\ast}^{\gamma} essentially means selecting a sample uniformly from the set {yi}i=1N\{y_{i}\}_{i=1}^{N} and then applying a Gaussian blurring. Regardless of how close 𝗉γ\mathsf{p}_{\ast}^{\gamma} approximates the ground-truth pp_{\ast}, sampling from KDE ultimately gives replicas of the existing samples.

Would SGM be different from KDE? SGM is built on a complicated procedure, incorporating forward noise injection, score matching, and backward sampling processes. The machinery is significantly more convoluted than the straightforward KDE approach. Would it be able to generate new samples?

We are to show in this paper that the perfect SGM is actually a KDE itself. The mathematical statement is presented in Theorem 4.3. The “perfect” means the minimizer of the empirical score matching objective is achieved during the score-matching procedure. We term the learned score function the empirical optimal score function. Since SGM equiped with the empirical optimal score function is effectively a KDE, it sees the limitation of KDE and fails to “generate.” This phenomenon is clearly demonstrated in Figure 1 with the test conducted over the CIFAR10 dataset.

Refer to caption
Figure 1: Images generated based on CIFAR10 dataset. The first row shows the original images, the second row presents the images blurred according to the Gaussian KDE, and the third row shows images generated by SGM equipped with the perfect score function learned from samples. Both KDE and SGM present simple replica (with Gaussian blurring) of the original images.

It is important to note that this observation does not contradict existing theories that suggest SGMs can approximate the target distribution qq when the score function is accurately learned. Indeed, in Theorem 3.1 we provide a sample complexity estimate and derive a lower bound of the sample size NN. When the sample size is sufficiently large, the empirical optimal score function approximates the ground-truth score function. Consequently, according to the existing theories, the output of SGM is a distribution close to the ground-truth target distribution. Yet, two distribution being close is not sufficient for the task of generation.

1.2 Contributions

The primary contribution of this paper is presenting a counter-example of score-based generative models (SGMs) with accurate approximated score function, yet producing unoriginal, replicated samples. Our findings are substantiated through the following two steps:

  • We establish in Theorem 3.1 the score-matching error of the empirical optimal score function, and present an explicit non-asymptotic error bound with the sample complexity. This result illustrates that the empirical optimal score function satisfies the standard L2L^{2} bound on the score estimation error used in the convergence analysis in the existing literature (Chen et al., 2022, 2023c, 2023d; Benton et al., 2023a), which presumably should lead to the conclusion that SGMs equipped with the empirical optimal score function produces a distribution close to the target distribution.

  • We show in Theorem 4.3 that SGMs equipped with empirical optimal score function resembles a Gaussian KDE, and thus presents strong memorization effects and fails to produce novel samples.

These results combined rigorously demonstrates that the SGM with precise empirical score matching is capable to produce a distribution close to the target, but the procedure does not ensure the efficacy of an SGM in its ability to generate innovative and diverse samples. This observation underscores the limitation of current upper bound type guarantees and highlights the need for new theoretical criteria to assess the performance of generative models.

Notations: Let d\mathbb{R}^{d} to be the dd-dimensional Euclidean space and T>0T>0 is the time horizon. Denote x=(x1,x2,,xd)dx=(x_{1},x_{2},\dots,x_{d})^{\top}\in\mathbb{R}^{d} and t[0,T]t\in[0,T] to be the spatial variable and time variable respectively. We denote pp_{\ast} as the target data distribution supported on a subset of d\mathbb{R}^{d}, and indicate the empirical distribution by 𝗉{\mathsf{p}_{\ast}}. The Gaussian kernel with bandwidth γ\gamma is denoted by 𝒩γ:=𝒩(0,γ2Id×d)\mathcal{N}_{\gamma}:=\mathcal{N}(0,\gamma^{2}I_{d\times d}). For the special case γ=1\gamma=1, i.e. standard Gaussian, we use notation πd:=𝒩(0,Id×d)\pi^{d}:=\mathcal{N}(0,I_{d\times d}). We denote the Gaussian KDE with bandwidth γ\gamma as 𝗉γ:=𝗉𝒩γ\mathsf{p}_{\ast}^{\gamma}:={\mathsf{p}_{\ast}}*\mathcal{N}_{\gamma}. In general, we use ptp_{t} and qtq_{t} (or 𝗉t{\mathsf{p}}_{t} and 𝗊t{\mathsf{q}}_{t}) to represent the laws of forward and backward SDEs at time tt respectively (a thorough summary of PDEs and SDEs’ notations used in this paper is provided in Appendix A). We denote δ[0,T)\delta\in[0,T) to be the early stopping time for running SDEs.

1.3 Literature review

We are mainly concerned of three distinct lines of research related to SGM performance, as summarized below.

Convergence of SGMs. The first line of research concerns theoretical convergence properties of SGMs. This addresses the most fundamental performance of the algorithm: What elements are needed for SGM to perform well? In this context, a good performance amounts to generating a new sample from the learned distribution that is close to the ground-truth. This line of research has garnered a large amount of interests, drawing its relation to sampling. For most studies, the analysis becomes quantifying the deviation between distributions generated by SGMs and the ground-truth distributions. This includes the earlier studies such as (Lee et al., 2022; Wibisono & Yingxi Yang, 2022; De Bortoli et al., 2021; De Bortoli, 2022; Kwon et al., 2022; Block et al., 2022), and later (Chen et al., 2022, 2023a, 2023b; Benton et al., 2023a; Li et al., 2023) that significantly relaxed the Lipschitz condition of the score function and achieved polynomial convergence rate. In these discoveries, Girsanov’s theorem turns out to be a crucial proof strategy. Parallel to these findings, convergence properties of ODE-based SGMs have also been explored (Chen et al., 2023c, d; Benton et al., 2023b; Albergo et al., 2023; Li et al., 2023), and comparison to SDE-based SGMs have been drawn.

Sample complexity studies of SGMs. Another line of research focuses on sample complexity. How many samples/training data points are needed to learn the score? In line with convergence rate analysis, the sample complexity study has been conducted with the criteria set to be L2L^{2}-approximation of the score function (Block et al., 2022; Cui et al., 2023; Chen et al., 2023b; Oko et al., 2023). The involved techniques range from deploying Rademarcher complexity for certain hypothesis classes, to utilizing specific neural network structures. Often in times, there are also assumptions made on the structure of data.

Memorization effect of SGMs. The third line of research on SGM concerns its memorizing effect. This line of research was triggered by some experimental discovery and was confirmed by some high profile lawsuits (New York Times, 2023). Experimentally it was found that SGMs, when trained well, tend to produce replicas of training samples (Somepalli et al., 2022, 2023; Carlini et al., 2023). This phenomenon draws serious privacy concerns, and motivates studies on the fundamental nature of SGMs: Are SGMs memorizers or generalizers? In (Yoon et al., 2023), the authors presented a dichotomy, showing through numerical experiments that SGMs can generate novel samples when they fail to memorize training data. Furthermore, when confined to a basis of harmonic functions adapted to the geometry of image features, (Kadkhodaie et al., 2023) suggest that neural network denoisers in SGMs might have an inductive bias, aiming the generation. In (Gu et al., 2023; Yi et al., 2023), the authors derive the optimal solution to the empirical score-matching problem and show that the SGMs equipped with this score function exhibit a strong memorization effect. This suggests that with limited amount of training data and a large neural network capacity, SGMs tend to memorize rather than generalize.

To summarize: the convergence results of SGMs suggest a well-learned score function can be called to produce a sample drawn from a distribution close to the ground-truth, and the studies on the memorization effect of SGMs suggest the new drawings are simple replicas of the training dataset. It is worth noting that the two sets of results do not contradict. In particular, the convergence results do not rule out the explicit dependence of new generated samples on the training data. The connection between the two aspects of SGM performance is yet to be developed, and this is our main task of the current paper. We show that SGMs, despite having favorable convergence properties, can still resort to memorization, in the form of kernel density estimation. The finding underscores the need for a new theoretical framework to evaluate SGMs’ performance, taking into account both imitation ability and creativity of SGMs.

2 Score-based Generative Models

We provide a brief expository to the Score-based Generative models (SGM) (Song et al., 2020) in this section. Mathematically, SGM is equivalent to denoising diffusion probabilistic modeling (DDPM) (Ho et al., 2020), so we use the two terms interchangeably.

2.1 Mathematical foundation for DDPM

The foundation for SGM stems from two mathematical observations. Firstly, a diffusion type partial differential equation (PDE) drives an arbitrary distribution to a Gaussian distribution, forming a bridge between the complex target distribution to the standard Gaussian, an easy-to-sample distribution. Secondly, such diffusion process can be simulated by its samples, translating the complicated PDE to a set of stochastic differential equations (SDEs) that are computationally easy to manipulate.

More precisely, denote pt(x)p_{t}(x) the solution to the PDE:

tpt=(xpt)+Δpt.\partial_{t}p_{t}=\nabla\cdot(xp_{t})+\Delta p_{t}\,. (1)

It can be shown that, for arbitrary initial data p0p_{0}, when TT is big enough,

pTlimtpt=πd,p_{T}\approx\lim_{t\to\infty}p_{t}=\pi^{d}\,,

and the convergence is exponentially fast (Bakry et al., 2014). In our context, we set the initial data p0=pp_{0}=p_{\ast}, the to-be-learned target distribution.

This PDE can be run backward in time. Denote qt=pTtq_{t}=p_{T-t}, a quick calculation shows

tqt=((x+2lnpTt)qt)+Δqt.\partial_{t}q_{t}=-\nabla\cdot((x+2\nabla\ln p_{T-t})q_{t})+\Delta q_{t}\,. (2)

This means with the full knowledge of lnpTt\nabla\ln p_{T-t}, the flow field x+2lnpTt(x)=x+2u(Tt,x)x+2\nabla\ln p_{T-t}(x)=x+2u(T-t,x) drives the standard Gaussian (q0=pTπdq_{0}=p_{T}\approx\pi^{d}) back to its original distribution, the target qT=p0=pq_{T}=p_{0}=p_{\ast}. The term u(t,x)=lnpt(x)u(t,x)=\nabla\ln p_{t}(x) is called the score function.

Simulating these two PDEs (1) and (2) directly is computationally infeasible, especially when dimension d1d\gg 1, but both equations can be represented by samples whose dynamics satisfy the corresponding SDEs. In particular, letting

dXt=Xtdt+2dBt,dX_{t}^{\rightarrow}=-X_{t}^{\rightarrow}dt+\sqrt{2}dB_{t}\,, (3)

the standard OU process, and

dXt=[Xt+2u(Tt,Xt)]dt+2dBt,dX_{t}^{\leftarrow}=\left[X_{t}^{\leftarrow}+2u(T-t,X_{t}^{\leftarrow})\right]dt+\sqrt{2}dB_{t}^{\prime}, (4)

where BtB_{t} and BtB_{t}^{\prime} are two Brownian motions, then, with proper initial conditions:

Law(Xt)=qt=pTt=Law(XTt).\mathrm{Law}(X_{t}^{\leftarrow})=q_{t}=p_{T-t}=\textrm{Law}(X_{T-t}^{\rightarrow})\,.

This relation translates directly simulating two PDEs (1) and (2) to running its representative samples governed by SDEs (3)-(4), significantly reducing the computational complexity. It is worth noting that if one draws Xt=0pTX_{t=0}^{\leftarrow}\sim p_{T} and runs (4), then:

Law(XT)=p,\mathrm{Law}(X_{T}^{\leftarrow})=p_{\ast}\,,

meaning the dynamics of (4) returns a sample from the target distribution pp_{\ast}, achieving the task of sampling. Here the notation \sim stands for drawing an i.i.d. sample from.

2.2 Score-function, explicit solution and score matching

It is clear the success of SGM, being able to draw a sample from the target distribution pp_{\ast}, lies in finding a good approximation of the score function u(t,x)u(t,x). In the idealized setting, this score function can be explicitly expressed. In the practical computation, this function is learned from existing dataset through the score-matching procedure.

To explicitly express the score function amounts to solving (1), or equivalently (3). Taking the SDE perspective, we analyze the OU process in (3) and obtain an explicit solution:

Xt:=μ(t)y+σ(t)Zwith{μ(t):=etσ(t):=1e2t,X_{t}^{\rightarrow}:=\mu(t)y+\sigma(t)Z\quad\text{with}\quad\begin{cases}\mu(t):=e^{-t}\\ \sigma(t):=\sqrt{1-e^{-2t}}\,,\end{cases} (5)

where yy is the initial data and ZπdZ\sim\pi^{d}. Equivalently, using the PDE perspective, one sets p0=δyp_{0}=\delta_{y} as the initial condition to run (1) to form a set of Green’s functions:

pt(x|y):=𝒩(x;μ(t)y,σ(t)2Id×d).p_{t}(x|y):=\mathcal{N}\left(x;\mu(t)y,\sigma(t)^{2}I_{d\times d}\right)\,. (6)

These functions are Gaussian functions of xx centered at μ(t)y\mu(t)y with isotropic variance σ(t)2\sigma(t)^{2}. This set of functions is also referred to as the transition kernel from time 0 conditioned on X0=yX_{0}^{\rightarrow}=y to time tt with Xt=xX_{t}^{\rightarrow}=x.

In the idealized setting with the target distribution pp_{\ast} fully known, then with p0=pp_{0}=p_{\ast}, the solution of (1) becomes the superposition of Green’s functions weighted by pp_{\ast}, namely:

pt(x)=pt(x|y)p(y)𝑑y,p_{t}(x)=\int p_{t}(x|y)p_{\ast}(y)dy\,, (7)

thus by definition, the score function is explicit:

u(t,x)\displaystyle u(t,x) =lnpt(x)=pt(x)pt(x)\displaystyle=\nabla\ln p_{t}(x)=\frac{\nabla p_{t}(x)}{p_{t}(x)} (8)
=u(t,x|y)pt(x|y)p(y)𝑑ypt(x|y)p(y)𝑑y,\displaystyle=\frac{\int u(t,x|y)p_{t}(x|y)p_{\ast}(y)dy}{\int p_{t}(x|y)p_{\ast}(y)dy}\,,

where we called (7) and used the notation u(t,x|y)=lnpt(x|y)u(t,x|y)=\nabla\ln p_{t}(x|y) to denote the conditional flow field. This function maps +×d\mathbb{R}_{+}\times\mathbb{R}^{d} to d\mathbb{R}^{d}. Using the explicit formula (6), we have the explicit solution for the conditional flow field:

u(t,x|y)=xμ(t)yσ(t)2.u(t,x|y)=-\frac{x-\mu(t)y}{\sigma(t)^{2}}\,. (9)

It is a linear function on xx with Lipschitz constant 1σ(t)2\frac{1}{\sigma(t)^{2}} that blows up at t=0t=0.

Score matching. The practical setting is not idealized: The lack of explicit formulation pp_{\ast} prevents direct computation of (8). Algorithmically, one needs to learn u(t,x)u(t,x) from existing samples. A neural network (NN) is then deployed.

Intuitively, the NN should provide a function as close as possible to the true score function, meaning it solves:

minsSM(s):=𝔼t,x[s(t,x)u(t,x)2],\min_{s\in\mathcal{F}}\;\mathcal{L}_{\text{SM}}(s):=\mathbb{E}_{t,x}\left[\left\|s(t,x)-u(t,x)\right\|^{2}\right],

where tU[0,T]t\sim U[0,T], the uniform distribution over the time interval, and xpt(x)x\sim p_{t}(x). \mathcal{F} is a hypothesis space, and in this context, the function space representable by a class of neural networks. However, neither ptp_{t} nor u(t,x)u(t,x) is known in the formulation, so we turn to an equivalent problem:

minsCSM(s):=𝔼t,y,x[s(t,x)u(t,x|y)2],\min_{s\in\mathcal{F}}\;\mathcal{L}_{\text{CSM}}(s):=\mathbb{E}_{t,y,x}\left[\left\|s(t,x)-u(t,x|y)\right\|^{2}\right],

where tU[0,T]t\sim U[0,T], ypy\sim p_{\ast} and xpt(x|y)x\sim p_{t}(x|y). The subindex CSM stands for conditional-score-matching. The two problems can be shown to be mathematically equivalent, see Lemma B.1. Practically, however, this new problem is much more tractable, now with both pt(x|y)p_{t}(x|y) and u(t,x|y)u(t,x|y) explicit, see (6) and (8).

The target distribution pp_{\ast} is still unknown. At hands, we have many samples drawn from it: {yi}i=1N\{y_{i}\}_{i=1}^{N}. This allows us to reformulate the problem into an empirical risk minimization (ERM) problem:

minsCSMN(s):=1Ni=1N𝔼t,x[s(t,x)u(t,x|yi)2]\min_{s\in\mathcal{F}}\;\mathcal{L}^{N}_{\text{CSM}}(s):=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{t,x}\left[\left\|s(t,x)-u(t,x|y_{i})\right\|^{2}\right] (10)

with tU[0,T]t\sim U[0,T] and xpt(x|yi)x\sim p_{t}(x|y_{i}).

In the execution of a practical DDPM algorithm, (10) is first run to find an NN serving as a good approximation to the score function, termed s(t,x)s(t,x), and the user end then deploys this s(t,x)s(t,x) in (4) in place of u(t,x)u(t,x) for generating a new sample from pp_{\ast}. Sample X¯0πd\bar{X}_{0}^{\leftarrow}\sim\pi^{d} and run:

dX¯t=(X¯t+2s(Tt,X¯t))dt+2dBt.d\bar{X}_{t}^{\leftarrow}=\left(\bar{X}_{t}^{\leftarrow}+2s(T-t,\bar{X}_{t}^{\leftarrow})\right)dt+\sqrt{2}dB_{t}\,. (11)

The law is denoted to be 𝗊¯t:=Law(X¯t)\bar{\mathsf{q}}_{t}:=\text{Law}(\bar{X}_{t}^{\leftarrow}). We note two differences comparing (4) and (11): the initial data pTp_{T} is replaced by πd\pi^{d} and the score function u(t,x)u(t,x) is replaced by the empirically learned score function s(t,x)s(t,x). If both approximations are accurate, we expect 𝗊¯tqt\bar{\mathsf{q}}_{t}\approx q_{t} for all tt.

When minimizing the objective (10), noting the singularity at t=0t=0 of u(t,x|yi)u(t,x|y_{i}) as in (9), it is a standard practice to conduct “early stopping” (Song et al., 2020). This is to take out a small fraction around the origin of time in the training (10) and learn the score with tU[δ,T]t\sim U[\delta,T]. Consequently, the sampling is also only ran up to TδT-\delta. The algorithm returns samples X¯Tδ\bar{X}_{T-\delta}^{\leftarrow} drawn from 𝗊¯Tδ\bar{\mathsf{q}}_{T-\delta}. The hope is 𝗊¯Tδ\bar{\mathsf{q}}_{T-\delta} approximates the target pp_{\ast} using the following approximation chain:

𝗊¯TδqTδifsu,πdpT=pδp0ifδ0=p.\underbrace{\bar{\mathsf{q}}_{T-\delta}\approx q_{T-\delta}}_{\text{if}\;s\approx u\,,\;\pi^{d}\approx p_{T}}=\underbrace{p_{\delta}\approx p_{0}}_{\text{if}\;\delta\to 0}=p_{\ast}\,.

2.3 Error analysis for DDPM

In the idealized setting, TT\to\infty, s(t,x)=u(t,x)s(t,x)=u(t,x), δ0\delta\to 0, and backward SDE (11) is run perfectly, then the sample initially drawn from Gaussian πd\pi^{d} will represents the target distribution pp_{\ast} at TT. Computationally, these assumptions all break: all four factors, finite TT, nontrivial δ\delta, imperfect s(t,x)s(t,x) and discretization error of (11) induce error. These errors were beautifully analyzed in (Chen et al., 2022; Benton et al., 2023a). We summarize their results briefly.

All analysis require the target distribution to have bounded second moment.

Assumption 2.1 (bounded second moment).

We assume that 𝔪22:=𝔼yp[y2]<\mathfrak{m}_{2}^{2}:=\mathbb{E}_{y\sim p_{\ast}}\left[\|y\|^{2}\right]<\infty.

The learned score function is also assumed to be close to the ground-truth in L2(dt,ptdx)L_{2}(dt,p_{t}dx):

Assumption 2.2 (score estimation error).

The score estimate s(x,t)s(x,t) satisfies

𝔼tU[δ,T],xpt[s(t,x)u(t,x)2]εscore2.\mathbb{E}_{t\sim U[\delta,T],x\sim p_{t}}\left[\left\|s(t,x)-u(t,x)\right\|^{2}\right]\leq\varepsilon_{\text{score}}^{2}\,.

Under these assumptions, it was concluded DDPM samples well:

Theorem 2.3 (Modified version of Theorem 1 in (Benton et al., 2023a)).

Suppose the Assumptions 2.1 and 2.2 hold and T1T\geq 1, δ>0\delta>0. Let 𝗊¯Tδ\bar{\mathsf{q}}_{T-\delta} be the output of the DDPM algorithm (11) at time TδT-\delta. Then it holds that

TV(𝗊¯Tδ,pδ)εscore+dexp(T){\mathrm{TV}}\left(\bar{\mathsf{q}}_{T-\delta},p_{\delta}\right)\lesssim\varepsilon_{\text{score}}+\sqrt{d}\exp(-T)

The discretization error in the original result is irrelevant to the discussion here and is omitted. This upper error bound consists of two parts. The first term εscore\varepsilon_{\text{score}} comes from the score approximation error, while the second term dexp(T)\sqrt{d}\exp(-T) comes from the finite truncation, where we forcefully replace pTp_{T} by πd\pi^{d}.

The theorem states that, when TT is large enough and the score function is approximated well in L2(dt,ptdx)L_{2}(dt,p_{t}dx) sense, the TV distance between the law of generated samples 𝗊¯Tδ\bar{\mathsf{q}}_{T-\delta} and pδpp_{\delta}\approx p_{\ast} is very small, concluding that DDPM is a good sampling strategy.

It is tempting to further this statement and claim that DDPM is also a good generative model. Indeed, on the surface, it is typically claimed that generative models are equivalent to drawing samples from a target distribution pp_{\ast}. However, we should note a stark difference between sampling and generation: A meaningful generative model should be able to produce samples that are not mere replica of known ones. The error bound in Theorem 2.3 does not exclude this possibility. As will be shown in Section 3, it is possible to design a DDPM whose score function is learned well, so according to Theorem 2.3 produces a distribution close to the target. Yet in Section 4, we demonstrate that this model fails to be produce new samples. These two sections combined suggest DDPM with a well-learned score function does not necessarily produce a meaningful generative model.

3 A good score estimate: sample complexity analysis

Inspired by Theorem 2.3, we are to design a DDPM whose learned score function satisfies Assumption 2.2. Throughout the section, we assume the hypothesis space is large enough (L2([0,T]×d)\mathcal{F}\supseteq L^{2}([0,T]\times\mathbb{R}^{d}), for example), and the learned score estimate achieves the global minimum of the ERM (10). In practical training, the error heavily depends on the specific NN structure utilized in the optimization. The approximation error of the NN training is beyond the discussion point of the current paper.

Noting the objective CSMN(s)\mathcal{L}_{\text{CSM}}^{N}(s) is a convex functional of ss, the optimizer has a closed-form. As derived in Proposition B.2, for (t,x)[0,T]×d(t,x)\in[0,T]\times\mathbb{R}^{d}, the empirical optimal score function is:

s{yi}N(t,x):=i=1Nu(t,x|yi)pt(x|yi)j=1Npt(x|yj),s^{N}_{\{y_{i}\}}(t,x):=\frac{\sum_{i=1}^{N}u(t,x|y_{i})p_{t}(x|y_{i})}{\sum_{j=1}^{N}p_{t}(x|y_{j})}\,, (12)

where u(t,x|y)u(t,x|y) is the conditional flow field, see (9).

Accordingly, the DDPM draws an initial data from X^0πd\widehat{X}_{0}^{\leftarrow}\sim\pi^{d} and evolves the following SDE:

dX^t=(X^t+2s{yi}N(Tt,X^t))dt+2dBt.d\widehat{X}_{t}^{\leftarrow}=(\widehat{X}_{t}^{\leftarrow}+2s^{N}_{\{y_{i}\}}(T-t,\widehat{X}_{t}^{\leftarrow}))dt+\sqrt{2}dB_{t}\,. (13)

We denote the law of samples 𝗊^t:=Law(X^t)\widehat{\mathsf{q}}_{t}:=\text{Law}(\widehat{X}_{t}^{\leftarrow}). The choice of the font indicates the law is produced by a finite dimensional object s{yi}Ns^{N}_{\{y_{i}\}}.

To understand the empirical optimal score function, we compare (12) with the ground-truth score function (8). It is clear s{yi}Ns^{N}_{\{y_{i}\}} can be interpreted as a Monte-Carlo (MC) sampling of u(t,x)u(t,x), replacing both integrals in the numerator and the denominator in (8) by empirical means. The law of large number suggests the empirical mean should converge to the true mean when the number of samples is big. Therefore, it is expected s{yi}Ns^{N}_{\{y_{i}\}} approximates uu well with a very high probability when N1N\gg 1. We formulate this result in the following theorem.

Theorem 3.1 (Approximation error of empirical optimal score function).

Let {yi}i=1N\{y_{i}\}_{i=1}^{N} to be NN i.i.d samples drawn from the target data distribution pp_{\ast}. Denote u(t,x)u(t,x) and s{yi}N(t,x)s^{N}_{\{y_{i}\}}(t,x) the true and empirical optimal score function, respectively, as defined in (8) and (12). Then for any fixed 0<δ<T<0<\delta<T<\infty, εscore>0\varepsilon_{\text{score}}>0 and τ>0\tau>0, we have

𝔼tU[δ,T],xpt[s{yi}N(t,x)u(t,x)2]εscore2,\mathbb{E}_{t\sim U[\delta,T],x\sim p_{t}}\left[\left\|s^{N}_{\{y_{i}\}}(t,x)-u(t,x)\right\|^{2}\right]\leq\varepsilon_{\text{score}}^{2},

with probability at least 1τ1-\tau provided that the number of training samples NN(εscore,δ,τ)N\geq N(\varepsilon_{\text{score}},\delta,\tau), in particular

  • Case 1: If pp_{\ast} is an isotropic Gaussian, i.e. p(y)=𝒩(y;μp,σp2Id×d)p_{\ast}(y)=\mathcal{N}(y;\mu_{p_{\ast}},\sigma_{p_{\ast}}^{2}I_{d\times d}), with second moment 𝔪22=O(d)\mathfrak{m}_{2}^{2}=O(d), then N(εscore,δ,τ)=1τεscore2O(d)(2δ)(d+4)/2N(\varepsilon_{\text{score}},\delta,\tau)=\frac{1}{\tau\varepsilon_{\text{score}}^{2}}\frac{O(d)}{(2\delta)^{(d+4)/2}};

  • Case 2: If pp_{\ast} is supported on the Euclidean ball of radius RR such that R2=O(d)R^{2}=O(d), then N(εscore,δ,τ)=1τεscore2exp(O(d)δ)N(\varepsilon_{\text{score}},\delta,\tau)=\frac{1}{\tau\varepsilon_{\text{score}}^{2}}\exp\left(\frac{O(d)}{\delta}\right).

The theorem implies that when the sample size is large with NN(εscore,δ,τ)N\geq N(\varepsilon_{\text{score}},\delta,\tau), we have high confidence, 1τ1-\tau, to state that the empirical optimal score function s{yi}Ns^{N}_{\{y_{i}\}}, computed using the i.i.d. samples {yi}\{y_{i}\}, is within εscore\varepsilon_{\text{score}} distance from the true score function u(t,x)u(t,x) in L2(dt,ptdx)L_{2}(dt,p_{t}dx).

Remark 3.2.

A few comments are in line:

  • (a)

    Second moment 𝔪22=O(d)\mathfrak{m}_{2}^{2}=O(d) and support radius R2=O(d)R^{2}=O(d): The second moment and support radius being the same order as dd is only for notational convenience. In the proof, the assumption can be relaxed. When we do so, the success rate needs to be adjusted accordingly (see the discussions in Appendix C).

  • (b)

    Implication on DDPM performance: Combining Theorem 3.1 with Theorem 2.3, it is straightforward to draw a conclusion on the performance of DDPM in terms of sample complexity. Under the same assumptions in Theorem 3.1, for any tolerance error ε>0\varepsilon>0, by choosing T=logdεT=\log\frac{\sqrt{d}}{\varepsilon}, NN(ε,δ,τ)N\geq N(\varepsilon,\delta,\tau), then it holds that, the DDPM algorithm ran according to (13) with the empirical optimal score function sNs^{N} computed from (12) gives:

    TV(𝗊^Tδ,pδ)ε\mathrm{TV}(\widehat{\mathsf{q}}_{T-\delta},p_{\delta})\lesssim\varepsilon

    with probability at least 1τ1-\tau.

  • (c)

    Error dependence on parameters: Both the confidence level parameter τ\tau and the accuracy parameter εscore\varepsilon_{\text{score}} appears algebraically in N(εscore,δ,δ)N(\varepsilon_{\text{score}},\delta,\delta). The rate of εscore2\varepsilon_{\text{score}}^{-2} comes from MC sampling convergence of 1N\frac{1}{\sqrt{N}} and is expected to be the optimal one. The rate of τ1\tau^{-1} reflects the fact that the proof uses the simple Markov inequality.

We leave the main proof to Appendix C and only briefly discuss the proof strategy using Case 22 as an example.

Sketch of proof.

Denote the error term

|E{yi}t|2=𝔼xpt[s{yi}N(t,x)u(t,x)2]\left|E^{t}_{\{y_{i}\}}\right|^{2}={\mathbb{E}_{x\sim p_{t}}\left[\left\|s^{N}_{\{y_{i}\}}(t,x)-u(t,x)\right\|^{2}\right]} (14)

and

|E{yi}|2=𝔼tU[δ,T]|E{yi}t|2=1TδδT|E{yi}t|2𝑑t.\left|E_{\{y_{i}\}}\right|^{2}={\mathbb{E}_{t\sim U[\delta,T]}\left|E^{t}_{\{y_{i}\}}\right|^{2}}=\frac{1}{T-\delta}\int_{\delta}^{T}\left|E^{t}_{\{y_{i}\}}\right|^{2}dt\,.

E{yi}E_{\{y_{i}\}} defines a function that maps {yi}Nd\{y_{i}\}\in\mathbb{R}^{Nd} to +\mathbb{R}^{+}, and is a random variable itself. According to the Markov’s inequality:

(E{yi}>εscore)𝔼{yi}pN|E{yi}|2εscore2.\mathbb{P}\left(E_{\{y_{i}\}}>\varepsilon_{\text{score}}\right)\leq\frac{\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left|E_{\{y_{i}\}}\right|^{2}}{\varepsilon_{\text{score}}^{2}}\,. (15)

To compute the right hand side, we note

𝔼{yi}pN|E{yi}|2=𝔼t,{yi}pN|E{yi}t|2,\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left|E_{\{y_{i}\}}\right|^{2}=\mathbb{E}_{t,\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left|E^{t}_{\{y_{i}\}}\right|^{2}\,, (16)

and for fixed t[δ,T]t\in[\delta,T], according to the definition (14), one can show:

𝔼{yi}pN|E{yi}t|21N1texp(O(d)t).\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left|E^{t}_{\{y_{i}\}}\right|^{2}\lesssim\frac{1}{N}\frac{1}{t}\exp\left(\frac{O(d)}{t}\right)\,. (17)

Taking expectation with respect to tt in [δ,T][\delta,T], we have

𝔼{yi}pN|E{yi}|21Nexp(O(d)δ),\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left|E_{\{y_{i}\}}\right|^{2}\lesssim\frac{1}{N}\exp\left(\frac{O(d)}{\delta}\right)\,,

finishing the proof when combined with (15). ∎

It is clear the entire proof is built upon a direct use of the Markov inequality, and the most technical component of the proof is to give an estimate to the mean of the error term |E{yi}t|2|E^{t}_{\{y_{i}\}}|^{2} in (17). We provide this estimate in Lemma C.2.

4 A bad SGM: memorization Effects

Results in Theorem 2.3 and Theorem 3.1 combined implies that the DDPM (11) ran with the empirical optimal score function s{yi}Ns^{N}_{\{y_{i}\}} provides a good sampling method with a high probability. It is tempting to further this statement and call it a good generative model. We are to show in this section that this is not the case. In particular, we claim DDPM ran by s{yi}Ns^{N}_{\{y_{i}\}} will lead to a kernel density estimation (KDE).

To be more precise, with {yi}i=1N\{y_{i}\}_{i=1}^{N} i.i.d drawn from the target distribution pp_{\ast}, DDPM (11) ran with s{yi}Ns^{N}_{\{y_{i}\}} produces a distribution that is a convolution of a Gaussian with 𝗉=1Ni=1Nδyi{\mathsf{p}_{\ast}}=\frac{1}{N}\sum_{i=1}^{N}\delta_{y_{i}}, and hence becomes a KDE of pp_{\ast}. Since the context is clear, throughout the section we drop the lower index {yi}\{y_{i}\} in s{yi}Ns^{N}_{\{y_{i}\}}.

The statement above stems from the following two simple observations. Firstly, the solution to the system (1) with initial distribution set to be 𝗉{\mathsf{p}_{\ast}} is a simple Gaussian convolution with 𝗉{\mathsf{p}_{\ast}}; and secondly, the exact score function for this new system (initialized at 𝗉{\mathsf{p}_{\ast}}) happens to be the empirical optimal score function (12).

To expand on it, we first set the initial data for (1) as 𝗉{\mathsf{p}_{\ast}}, the empirical distribution. Theory in Section 2.2 still applies. In particular, the solution to (1) , denoted by 𝗉t{\mathsf{p}}_{t}, and the solution to (2), denoted by 𝗊t{\mathsf{q}}_{t}, still have explicit forms using the Green’s functions:

𝗉t(x)=𝗊Tt(x)\displaystyle{\mathsf{p}}_{t}(x)={\mathsf{q}}_{T-t}(x) =pt(x|y)𝗉(y)𝑑y=1Ni=1Npt(x|yi)\displaystyle=\int p_{t}(x|y){\mathsf{p}_{\ast}}(y)dy=\frac{1}{N}\sum_{i=1}^{N}p_{t}(x|y_{i}) (18)
=1Ni=1N𝒩(x;μ(t)yi,σ(t)2Id×d).\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\mathcal{N}\left(x;\mu(t)y_{i},\sigma(t)^{2}I_{d\times d}\right)\,.

For small tt, μ(t)1\mu(t)\approx 1 and σ(t)0\sigma(t)\approx 0, the PDE solution (18) presents a strong similarity to a KDE of pp_{\ast} with parameter γ=σ(t)\gamma=\sigma(t):

𝗉γ(x):=𝗉𝒩(0,γ2)=1Ni=1N𝒩(x;yi,γ2Id×d),\mathsf{p}_{\ast}^{\gamma}(x):={\mathsf{p}_{\ast}}\ast\mathcal{N}(0,\gamma^{2})=\frac{1}{N}\sum_{i=1}^{N}\mathcal{N}\left(x;y_{i},\gamma^{2}I_{d\times d}\right)\,,

where \ast is the convolution operator. The resemblance can be characterized mathematically precisely:

Proposition 4.1.

Suppose the training samples {yi}i=1N\{y_{i}\}_{i=1}^{N} satisfy yi2d\|y_{i}\|_{2}\leq d, for δ0\delta\geq 0, TV(𝗊Tδ,𝗉γ)dδ2\mathrm{TV}({\mathsf{q}}_{T-\delta},\mathsf{p}_{\ast}^{\gamma})\leq\frac{d\sqrt{\delta}}{2} with γ=σ(δ)\gamma=\sigma(\delta), where σ()\sigma(\cdot) is defined in (5).

This means the forward and backward procedure described in (1)-(2) approximately provides a simple KDE to the target distribution when initialized with the empirical distribution.

We now further claim this forward and backward procedure is realized by running SGM using the empirical optimal score sNs^{N}. To see this, we follow the computation in (8), and call (18) to obtain:

ln𝗉t(x)=i=1Npt(x|yi)j=1Npt(x|yj)=i=1Nu(t,x|yi)pt(x|yi)j=1Npt(x|yi).\nabla\ln{\mathsf{p}}_{t}(x)=\frac{\sum_{i=1}^{N}\nabla p_{t}(x|y_{i})}{\sum_{j=1}^{N}p_{t}(x|y_{j})}=\frac{\sum_{i=1}^{N}u(t,x|y_{i})p_{t}(x|y_{i})}{\sum_{j=1}^{N}p_{t}(x|y_{i})}\,.

This means the exact score function for the KDE approximation 𝗉t=𝗊Tt{\mathsf{p}}_{t}={\mathsf{q}}_{T-t} exactly recovers sNs^{N}, the empirical optimal score for ptp_{t}, and thus SGM with empirical optimal score realizes the KDE approximation, as seen in the following proposition.

Proposition 4.2.

Under the same assumptions are in Proposition 4.1, on the time interval t[0,T]t\in[0,T], the total variation between the output distribution of SGM algorithm (13) with the empirical optimal score function 𝗊^t\widehat{\mathsf{q}}_{t} and the KDE approximation 𝗊t{\mathsf{q}}_{t} – is bounded by TV(𝗊^t,𝗊t)d2exp(T)\mathrm{TV}\left(\widehat{\mathsf{q}}_{t},{\mathsf{q}}_{t}\right)\leq\frac{d}{2}\exp(-T).

Combine Propositions 4.1 and 4.2 using triangle inequality, we see 𝗊^t\widehat{\mathsf{q}}_{t} is essentially a kernel density estimation when tt approaches TT. Furthermore, if one pushes t=T+t=T\to+\infty, we obtain the finite-support result:

Theorem 4.3 (SGM with empirical optimal score function resembles KDE).

Under the same assumptions as Proposition 4.2, SGM algorithm (13) with the empirical optimal score function sNs^{N} returns a simple Gaussian convolution with the empirical distribution in the form of (18), and it presents the following behavior:

  • (with early stopping) for any ε>0\varepsilon>0, set T=logdεT=\log\frac{d}{\varepsilon} and δ=ε2d\delta=\frac{\varepsilon^{2}}{d}, we have

    TV(𝗊^Tδ,𝗉γ)ε,withγ=σ(δ),\mathrm{TV}(\widehat{\mathsf{q}}_{T-\delta},\mathsf{p}_{\ast}^{\gamma})\leq\varepsilon\,,\quad\text{with}\quad\gamma=\sigma(\delta)\,,
  • (without early stopping) by taking the limit T+T\rightarrow+\infty and δ=0\delta=0, we have 𝗊^=𝗉=1Ni=1Nδyi\widehat{\mathsf{q}}_{\infty}={\mathsf{p}_{\ast}}=\frac{1}{N}\sum_{i=1}^{N}\delta_{y_{i}}.

The theorem suggests DDPM with empirical optimal score function sNs^{N} is, in the end, simply a KDE of the target pp_{\ast}. However close KDE 𝗉γ\mathsf{p}_{\ast}^{\gamma} is to the target pp_{\ast}, it is nevertheless only an object with finite amount of information.

Unlike drawing from pp_{\ast} where one can generate a completely new sample independent of the training samples, drawing from 𝗉γ\mathsf{p}_{\ast}^{\gamma} can only provide replicas of yiy_{i} (with a slight shift and polluted with Gaussian noise). As a summary, SGM ran by the empirical optimal score function fails the task of generation.

Some mathematical comments are in line. We first note that (4.3) does not contradict ((b)). Indeed, with high probability, 𝗊^Tδ\widehat{\mathsf{q}}_{T-\delta} approximates both pδp_{\delta} and the KDE 𝗉γ\mathsf{p}_{\ast}^{\gamma}. The second bullet point (without early stopping) was also discussed in (Gu et al., 2023). Our result generalize theirs to any small time TδT-\delta.

5 Numerical Experiments

This section is dedicated to providing numerical evidence for Theorem 3.1 and Theorem 4.3. Throughout the experiment, we choose the target data distribution pp_{\ast} to be a 22-dimensional isotropic Gaussian, denoted by p(x)=𝒩(x;μp,σp2I2×2)p_{\ast}(x)=\mathcal{N}(x;\mu_{p_{\ast}},\sigma_{p_{\ast}}^{2}I_{2\times 2}). The implementation details are provided in Appendix E.

We first estimate the score approximation error of the empirical optimal score function, as delineated in (3), for various size of training sample NN. Figure 2 shows that the error has decreasing rate approximately O(1N)O(\frac{1}{N}), confirming the theoretical finding in Theorem 3.1, see also Remark 3.2(c).

Refer to caption
Figure 2: Score approximation error of the empirical optimal score function defined in (16) versus the number of training samples NN. Both xx-axis and yy-axis are in the logarithmic scales. The orange crosses represent the score approximation error for varying values of NN, with a fitted blue trend line. Reference lines with a slope of 1-1 are depicted by the green dashed lines, illustrating that the slope of the blue line is also approximately 1-1. This observation corroborates the rate O(1N)O(\frac{1}{N}) provided in Theorem 3.1.

Secondly, we showcase Theorem 4.3 and demonstrate that DDPM behaves as a KDE when equipped with empirical optimal score function. As seen in Figure 3, samples produced by DDPM ran with sNs^{N} exhibit a high concentration around the training samples. Conversely, while the samples generated by DDPM ran with the true score function u(t,x)u(t,x) appear to be drawn from the same distribution as the training samples, they are not mere duplicates of the existing ones.

Refer to caption
Refer to caption
Figure 3: Left: Samples generated by DDPM with empirical optimal score function sN(t,x)s^{N}(t,x). Right: Samples generated by DDPM with true score function u(t,x)u(t,x). In both plots, the blue crosses are the training samples, the green dots are the initialization positions and the orange dots are the outputs of DDPM with early stop of δ=0.01\delta=0.01.

6 Discussion and Conclusion

The classical theory measures the success of score-based generative model based on the distance of the learned distribution and the ground-truth distribution. Under this criterion, SGM would be successful if the score function is learned well.

In this paper, we provide a counter-example of SGM that has a good score approximation while produces meaningless samples. On one hand, the application of Theorem 2.3 and Theorem 3.1 combined suggest SGM equipped with empirical optimal score function learns a distribution close to the ground-truth. On the other hand, Theorem 4.3 suggests this scenario resembles the Gaussian kernel density estimation and can only generate existing training samples with Gaussian blurring.

This apparent paradox between sound theoretical convergence and poor empirical new sample generations indicates that current theoretical criteria may not be sufficient to fully evaluate the performance of generative models. It strongly focuses on the “imitation” capability and losses out on quantifying “creativity”. Similar features were presented in other generative models like generative adversarial networks (Vardanyan et al., 2023), and different criteria have been proposed (Vardanyan et al., 2023; Yi et al., 2023), yet a comprehensive end-to-end convergence analysis for these criteria has not been done for SGMs. We leave this exploration to future research.

Broader Impact

Our results, while being theoretical in nature, have potential positive impacts in motivating better frameworks to ensure that the generative model do not create unintended leakage of private information. We believe that there are no clear negative societal consequences of this theoretical work.

Acknowledgements

The three authors are supported in part by NSF-DMS 1750488, and NSF-DMS 2308440. S. Li is further supported by NSF-DMS 2236447.

References

  • Albergo et al. (2023) Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797, 2023.
  • Bakry et al. (2014) Bakry, D., Gentil, I., Ledoux, M., et al. Analysis and geometry of Markov diffusion operators, volume 103. Springer, 2014.
  • Benton et al. (2023a) Benton, J., De Bortoli, V., Doucet, A., and Deligiannidis, G. Linear convergence bounds for diffusion models via stochastic localization. arXiv preprint arXiv:2308.03686, 2023a.
  • Benton et al. (2023b) Benton, J., Deligiannidis, G., and Doucet, A. Error bounds for flow matching methods. arXiv preprint arXiv:2305.16860, 2023b.
  • Block et al. (2022) Block, A., Mroueh, Y., and Rakhlin, A. Generative modeling with denoising auto-encoders and langevin sampling, 2022.
  • Carlini et al. (2023) Carlini, N., Hayes, J., Nasr, M., Jagielski, M., Sehwag, V., Tramèr, F., Balle, B., Ippolito, D., and Wallace, E. Extracting training data from diffusion models, 2023.
  • Chen et al. (2023a) Chen, H., Lee, H., and Lu, J. Improved analysis of score-based generative modeling: User-friendly bounds under minimal smoothness assumptions. In International Conference on Machine Learning, pp.  4735–4763. PMLR, 2023a.
  • Chen et al. (2023b) Chen, M., Huang, K., Zhao, T., and Wang, M. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. arXiv preprint arXiv:2302.07194, 2023b.
  • Chen et al. (2022) Chen, S., Chewi, S., Li, J., Li, Y., Salim, A., and Zhang, A. R. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. arXiv preprint arXiv:2209.11215, 2022.
  • Chen et al. (2023c) Chen, S., Chewi, S., Lee, H., Li, Y., Lu, J., and Salim, A. The probability flow ode is provably fast. arXiv preprint arXiv:2305.11798, 2023c.
  • Chen et al. (2023d) Chen, S., Daras, G., and Dimakis, A. G. Restoration-degradation beyond linear diffusions: A non-asymptotic analysis for ddim-type samplers. arXiv preprint arXiv:2303.03384, 2023d.
  • Cui et al. (2023) Cui, H., Krzakala, F., Vanden-Eijnden, E., and Zdeborová, L. Analysis of learning a flow-based generative model from limited sample complexity, 2023.
  • De Bortoli (2022) De Bortoli, V. Convergence of denoising diffusion models under the manifold hypothesis. arXiv preprint arXiv:2208.05314, 2022.
  • De Bortoli et al. (2021) De Bortoli, V., Thornton, J., Heng, J., and Doucet, A. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695–17709, 2021.
  • Donahue et al. (2018) Donahue, C., McAuley, J., and Puckette, M. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208, 2018.
  • Gong et al. (2022) Gong, S., Li, M., Feng, J., Wu, Z., and Kong, L. Diffuseq: Sequence to sequence text generation with diffusion models. arXiv preprint arXiv:2210.08933, 2022.
  • Gu et al. (2023) Gu, X., Du, C., Pang, T., Li, C., Lin, M., and Wang, Y. On memorization in diffusion models. arXiv preprint arXiv:2310.02664, 2023.
  • Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models, 2020.
  • Huang et al. (2018) Huang, H., He, R., Sun, Z., Tan, T., et al. Introvae: Introspective variational autoencoders for photographic image synthesis. Advances in neural information processing systems, 31, 2018.
  • Huang et al. (2022) Huang, R., Lam, M. W., Wang, J., Su, D., Yu, D., Ren, Y., and Zhao, Z. Fastdiff: A fast conditional diffusion model for high-quality speech synthesis. arXiv preprint arXiv:2204.09934, 2022.
  • Kadkhodaie et al. (2023) Kadkhodaie, Z., Guth, F., Simoncelli, E. P., and Mallat, S. Generalization in diffusion models arises from geometry-adaptive harmonic representation, 2023.
  • Karras et al. (2022) Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
  • Kong et al. (2020a) Kong, J., Kim, J., and Bae, J. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020a.
  • Kong et al. (2020b) Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, B. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020b.
  • Krizhevsky et al. (2009) Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images, 2009.
  • Kwon et al. (2022) Kwon, D., Fan, Y., and Lee, K. Score-based generative modeling secretly minimizes the wasserstein distance. Advances in Neural Information Processing Systems, 35:20205–20217, 2022.
  • Lee et al. (2022) Lee, H., Lu, J., and Tan, Y. Convergence for score-based generative modeling with polynomial complexity. Advances in Neural Information Processing Systems, 35:22870–22882, 2022.
  • Li et al. (2023) Li, G., Wei, Y., Chen, Y., and Chi, Y. Towards faster non-asymptotic convergence for diffusion-based generative models. arXiv preprint arXiv:2306.09251, 2023.
  • Li et al. (2022) Li, X., Thickstun, J., Gulrajani, I., Liang, P. S., and Hashimoto, T. B. Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022.
  • Lipman et al. (2022) Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022.
  • New York Times (2023) New York Times. The New York Times sued OpenAI and Microsoft for copyright infringement, 2023. URL https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf.
  • Oko et al. (2023) Oko, K., Akiyama, S., and Suzuki, T. Diffusion models are minimax optimal distribution estimators, 2023.
  • Rombach et al. (2022) Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  • Somepalli et al. (2022) Somepalli, G., Singla, V., Goldblum, M., Geiping, J., and Goldstein, T. Diffusion art or digital forgery? investigating data replication in diffusion models, 2022.
  • Somepalli et al. (2023) Somepalli, G., Singla, V., Goldblum, M., Geiping, J., and Goldstein, T. Understanding and mitigating copying in diffusion models. arXiv preprint arXiv:2305.20086, 2023.
  • Song et al. (2020) Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  • Terrell & Scott (1992) Terrell, G. R. and Scott, D. W. Variable kernel density estimation. The Annals of Statistics, pp.  1236–1265, 1992.
  • Vardanyan et al. (2023) Vardanyan, E., Minasyan, A., Hunanyan, S., Galstyan, T., and Dalalyan, A. Guaranteed optimal generative modeling with maximum deviation from the empirical distribution. arXiv preprint arXiv:2307.16422, 2023.
  • Wang et al. (2018) Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., and Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  8798–8807, 2018.
  • Watson et al. (2023) Watson, J. L., Juergens, D., Bennett, N. R., Trippe, B. L., Yim, J., Eisenach, H. E., Ahern, W., Borst, A. J., Ragotte, R. J., Milles, L. F., et al. De novo design of protein structure and function with rfdiffusion. Nature, 620(7976):1089–1100, 2023.
  • Wibisono & Yingxi Yang (2022) Wibisono, A. and Yingxi Yang, K. Convergence in kl divergence of the inexact langevin algorithm with application to score-based generative models. arXiv e-prints, pp.  arXiv–2211, 2022.
  • Yi et al. (2023) Yi, M., Sun, J., and Li, Z. On the generalization of diffusion model, 2023.
  • Yoon et al. (2023) Yoon, T., Choi, J. Y., Kwon, S., and Ryu, E. K. Diffusion probabilistic models generalize when they fail to memorize. In ICML 2023 Workshop on Structured Probabilistic Inference {\{\\backslash&}\} Generative Modeling, 2023.

Appendix A Notations

Partial differential equations (PDEs). Let d\mathbb{R}^{d} to be the dd-dimensional Euclidean space and T>0T>0 is the time horizon. Denote x=(x1,x2,,xd)Tdx=(x_{1},x_{2},\dots,x_{d})^{T}\in\mathbb{R}^{d} and t[0,T]t\in[0,T] to be the spatial variable and time variable respectively. The gradient of a real-valued function pp with respect to the spatial variable and the time-derivative of pp are denoted by p=(px1,px2,,pxd)\nabla p=\left(\frac{\partial p}{\partial x_{1}},\frac{\partial p}{\partial x_{2}},\cdots,\frac{\partial p}{\partial x_{d}}\right) and tp\partial_{t}p respectively. The Laplacian of pp is denoted by Δp=(p)\Delta p=\nabla\cdot(\nabla p). Here, F=i=1dFixi\nabla\cdot F=\sum_{i=1}^{d}\frac{\partial F_{i}}{\partial x_{i}} indicates the divergence of F=(F1,F2,,Fd)F=(F_{1},F_{2},\cdots,F_{d}) with respect to the spatial variable xx.

Stochastic differential equations (SDEs) and their laws.

  • The target data distribution is pp_{\ast}.

  • The forward process (3) initialized at the target distribution pp_{\ast} is denoted (Xt)t[0,T](X_{t}^{\rightarrow})_{t\in[0,T]}, and pt:=Law(Xt)p_{t}:=\mathrm{Law}(X_{t}^{\rightarrow}).

  • The backward process (4) is denoted (Xt)t[0,T](X_{t}^{\leftarrow})_{t\in[0,T]}, where Law(Xt):=qt=pTt=Law(XTt)\mathrm{Law}(X_{t}^{\leftarrow}):=q_{t}=p_{T-t}=\textrm{Law}(X_{T-t}^{\rightarrow}).

  • The DDPM algorithm (11) with arbitrary learned score function is denoted (X¯t)t[0,T](\bar{X}_{t}^{\leftarrow})_{t\in[0,T]} and 𝗊¯t:=Law(X¯t)\bar{\mathsf{q}}_{t}:=\textrm{Law}(\bar{X}_{t}^{\leftarrow}). We initialize the process at 𝗊¯0=πd\bar{\mathsf{q}}_{0}=\pi^{d}, the standard Gaussian distribution.

  • The DDPM algorithm (13) with the empirical optimal score function sNs^{N} is denoted by (X^t)t[0,T](\widehat{X}_{t}^{\leftarrow})_{t\in[0,T]}. We indicate the law at time tt as 𝗊^t:=Law(X^t)\widehat{\mathsf{q}}_{t}:=\textrm{Law}(\widehat{X}_{t}^{\leftarrow}) and let 𝗊^0=πd\widehat{\mathsf{q}}_{0}=\pi^{d}.

  • The law of forward process (3) initialized at the empirical distribution 𝗉{\mathsf{p}_{\ast}} at time t[0,T]t\in[0,T] is indicated by 𝗉t{\mathsf{p}}_{t}. The law of corresponding backward process at time t[0,T]t\in[0,T] is denoted by 𝗊t=𝗉Tt{\mathsf{q}}_{t}={\mathsf{p}}_{T-t}.

Other notations. We denote pp_{\ast} as the target data distribution supported on a subset of d\mathbb{R}^{d}, and indicate the empirical distribution by 𝗉{\mathsf{p}_{\ast}}. The Gaussian kernel with bandwidth γ\gamma is denoted by 𝒩γ:=𝒩(0,γ2Id×d)\mathcal{N}_{\gamma}:=\mathcal{N}(0,\gamma^{2}I_{d\times d}). For the special case γ=1\gamma=1, i.e. standard Gaussian, we use notation πd:=𝒩(0,Id×d)\pi^{d}:=\mathcal{N}(0,I_{d\times d}). We denote the Gaussian KDE with bandwidth γ\gamma as 𝗉γ:=𝗉𝒩γ\mathsf{p}_{\ast}^{\gamma}:={\mathsf{p}_{\ast}}*\mathcal{N}_{\gamma}. The early stopping time of running SDEs is indicated by δ[0,T)\delta\in[0,T). We use i[N]i\in[N] to denote i=1,2,,Ni=1,2,\dots,N.

Appendix B Empirical optimal score function

Lemma B.1.

Assuming that pt(x)>0p_{t}(x)>0 for all xdx\in\mathbb{R}^{d} and t[0,T]t\in[0,T], then up to a constant independent of function sL2([0,T]×d)s\in L^{2}([0,T]\times\mathbb{R}^{d}), SM(s)\mathcal{L}_{\text{SM}}(s) and CSM(s)\mathcal{L}_{\text{CSM}}(s) are equal.

Proof.

We follow the proof of Theorem 2 in (Lipman et al., 2022). We assume that p(x)p_{\ast}(x) are decreasing to zero at a sufficient speed as x\|x\|\rightarrow\infty, and u(t,x),s(t,x)u(t,x),s(t,x) are bounded in both time and space variables. These assumptions ensure the existence of all integrals and allow the changing of integration order (by Fubini’s theorem).

To prove SM(s)\mathcal{L}_{\text{SM}}(s) and CSM(s)\mathcal{L}_{\text{CSM}}(s) are equal up to a constant independent of function ss, we only need to show that for any fixed t[0,T]t\in[0,T],

𝔼xpt[s(t,x)u(t,x)2]=𝔼yp,xpt(x|y)[s(t,x)u(t,x|y)2]+C,\mathbb{E}_{x\sim p_{t}}\left[\left\|s(t,x)-u(t,x)\right\|^{2}\right]=\mathbb{E}_{y\sim p_{\ast},x\sim p_{t}(x|y)}\left[\left\|s(t,x)-u(t,x|y)\right\|^{2}\right]+C,

where CC is a constant function that independent of function ss. We can compute that

𝔼xpt[u(t,x)2]=s(t,x)2pt(x)𝑑x=s(t,x)2pt(x|y)p(y)𝑑y=𝔼yp,xpt(x|y)[s(t,x)2],\displaystyle\mathbb{E}_{x\sim p_{t}}\left[\left\|u(t,x)\right\|^{2}\right]=\int\left\|s(t,x)\right\|^{2}p_{t}(x)dx=\int\int\left\|s(t,x)\right\|^{2}p_{t}(x|y)p_{\ast}(y)dy=\mathbb{E}_{y\sim p_{\ast},x\sim p_{t}(x|y)}\left[\left\|s(t,x)\right\|^{2}\right],

where the second equality we use the definition of pt(x)p_{t}(x), and in the third equality we change the order of integration.

𝔼xpt[s(t,x),u(t,x)]\displaystyle\mathbb{E}_{x\sim p_{t}}\left[\langle s(t,x),u(t,x)\rangle\right] =s(t,x),u(t,x|y)pt(x|y)p(y)𝑑ypt(x)pt(x)𝑑x\displaystyle=\int\langle s(t,x),\frac{\int u(t,x|y)p_{t}(x|y)p_{\ast}(y)dy}{p_{t}(x)}\rangle p_{t}(x)dx
=s(t,x),u(t,x|y)pt(x|y)p(y)𝑑y𝑑x\displaystyle=\int\langle s(t,x),\int u(t,x|y)p_{t}(x|y)p_{\ast}(y)dy\rangle dx
=s(t,x),u(t,x|y)pt(x|y)p(y)𝑑y𝑑x\displaystyle=\int\langle s(t,x),u(t,x|y)\rangle p_{t}(x|y)p_{\ast}(y)dydx
=𝔼yp,xpt(x|y)[s(t,x),u(t,x|y)](by Fubini’s theorem)\displaystyle=\mathbb{E}_{y\sim p_{\ast},x\sim p_{t}(x|y)}\left[\langle s(t,x),u(t,x|y)\rangle\right]\qquad(\text{by Fubini's theorem})

Therefor we have

𝔼xpt[s(t,x)u(t,x)2]\displaystyle\mathbb{E}_{x\sim p_{t}}\left[\left\|s(t,x)-u(t,x)\right\|^{2}\right] =𝔼xpt[s(t,x)2]2𝔼xpt[s(t,x),u(t,x)]+𝔼xpt[u(t,x)2]\displaystyle=\mathbb{E}_{x\sim p_{t}}\left[\left\|s(t,x)\right\|^{2}\right]-2\mathbb{E}_{x\sim p_{t}}\left[\langle s(t,x),u(t,x)\rangle\right]+\mathbb{E}_{x\sim p_{t}}\left[\left\|u(t,x)\right\|^{2}\right]
=𝔼yp,xpt(x|y)[s(t,x)2]2𝔼yp,xpt(x|y)[s(t,x),u(t,x|y)]+𝔼xpt[u(t,x)2]\displaystyle=\mathbb{E}_{y\sim p_{\ast},x\sim p_{t}(x|y)}\left[\left\|s(t,x)\right\|^{2}\right]-2\mathbb{E}_{y\sim p_{\ast},x\sim p_{t}(x|y)}\left[\langle s(t,x),u(t,x|y)\rangle\right]+\mathbb{E}_{x\sim p_{t}}\left[\left\|u(t,x)\right\|^{2}\right]
=𝔼yp,xpt(x|y)[s(t,x)u(t,x|y)2]+C,\displaystyle=\mathbb{E}_{y\sim p_{\ast},x\sim p_{t}(x|y)}\left[\left\|s(t,x)-u(t,x|y)\right\|^{2}\right]+C,

where the last inequality comes from the fact that u(t,x)u(t,x) and u(t,x|y)u(t,x|y) are independent of s(t,x)s(t,x). ∎

Lemma B.2.

The optimizer sNs^{N} of the objective function

minsL2([0,1]×d)CSMN(s):=1Ni=1N𝔼tU[0,1],xpt(x|yi)[s(t,x)u(t,x|yi)2]\min_{s\in L^{2}([0,1]\times\mathbb{R}^{d})}\mathcal{L}^{N}_{\text{CSM}}(s):=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{t\sim U[0,1],x\sim p_{t}(x|y_{i})}\left[\left\|s(t,x)-u(t,x|y_{i})\right\|^{2}\right]

has the form

sN(t,x):=i=1Nu(t,x|yi)pt(x|yi)j=1Npt(x|yj),t[0,T],xds^{N}(t,x):=\frac{\sum_{i=1}^{N}u(t,x|y_{i})p_{t}(x|y_{i})}{\sum_{j=1}^{N}p_{t}(x|y_{j})},\qquad t\in[0,T],x\in\mathbb{R}^{d}
Proof.

Since the objective CSM(s)\mathcal{L}_{\text{CSM}}(s) is a convex functional of ss, by the first-order optimality condition, the optimizer sNs^{N} should satisfy

δCSM(s)δs|s=sN=2Ni=1N[sN(t,x)u(t,x|yi)]pt(x|yi)=0,\frac{\delta\mathcal{L}_{\text{CSM}}(s)}{\delta s}\Bigg{|}_{s=s^{N}}=\frac{2}{N}\sum_{i=1}^{N}\left[s^{N}(t,x)-u(t,x|y_{i})\right]p_{t}(x|y_{i})=0,

which implies that for t[0,T],xdt\in[0,T],x\in\mathbb{R}^{d},

sN(t,x)=i=1Nu(t,x|yi)pt(x|yi)j=1Npt(x|yj).s^{N}(t,x)=\frac{\sum_{i=1}^{N}u(t,x|y_{i})p_{t}(x|y_{i})}{\sum_{j=1}^{N}p_{t}(x|y_{j})}.

Appendix C Approximation error of empirical optimal score function

In this section, we provide the full proof of Theorem 3.1. For the completeness, we state the theorem again in the following:

Theorem C.1 (Approximation error of empirical optimal score function).

Let {yi}i=1N\{y_{i}\}_{i=1}^{N} to be NN i.i.d samples drawn from the target data distribution pp_{\ast}. Denote u(t,x)u(t,x) and s{yi}N(t,x)s^{N}_{\{y_{i}\}}(t,x) the true and empirical optimal score function respectively, as defined in (8) and (12). Then for any fixed 0<δ<T<0<\delta<T<\infty, εscore>0\varepsilon_{\text{score}}>0 and τ>0\tau>0, we have

𝔼tU[δ,T],xpt[s{yi}N(t,x)u(t,x)2]εscore2,\mathbb{E}_{t\sim U[\delta,T],x\sim p_{t}}\left[\left\|s^{N}_{\{y_{i}\}}(t,x)-u(t,x)\right\|^{2}\right]\leq\varepsilon_{\text{score}}^{2},

with probability at least 1τ1-\tau provided that the number of training samples NN(εscore,δ,τ)N\geq N(\varepsilon_{\text{score}},\delta,\tau), where N(εscore,δ,τ)N(\varepsilon_{\text{score}},\delta,\tau) is defined based on the nature of pp_{\ast}:

  • Case 1: If pp_{\ast} is an isotropic Gaussian, i.e. p(y)=𝒩(y;μp,σp2Id×d)p_{\ast}(y)=\mathcal{N}(y;\mu_{p_{\ast}},\sigma_{p_{\ast}}^{2}I_{d\times d}), with second moment 𝔪22=O(d)\mathfrak{m}_{2}^{2}=O(d), then N(εscore,δ,τ)=1τεscore2O(d)(2δ)(d+4)/2N(\varepsilon_{\text{score}},\delta,\tau)=\frac{1}{\tau\varepsilon_{\text{score}}^{2}}\frac{O(d)}{(2\delta)^{(d+4)/2}};

  • Case 2: If pp_{\ast} is supported on the Euclidean ball of radius RR such that R2=O(d)R^{2}=O(d), then N(εscore,δ,τ)=1τεscore2exp(O(d)δ)N(\varepsilon_{\text{score}},\delta,\tau)=\frac{1}{\tau\varepsilon_{\text{score}}^{2}}\exp\left(\frac{O(d)}{\delta}\right).

Proof of Theorem 3.1.

Denote the error term

|E{yi}t|2=𝔼xpt[s{yi}N(t,x)u(t,x)2]\left|E^{t}_{\{y_{i}\}}\right|^{2}={\mathbb{E}_{x\sim p_{t}}\left[\left\|s^{N}_{\{y_{i}\}}(t,x)-u(t,x)\right\|^{2}\right]} (19)

and

|E{yi}|2=𝔼tU[δ,T]|E{yi}t|2=1TδδT|E{yi}t|2𝑑t.\left|E_{\{y_{i}\}}\right|^{2}={\mathbb{E}_{t\sim U[\delta,T]}\left|E^{t}_{\{y_{i}\}}\right|^{2}}=\frac{1}{T-\delta}\int_{\delta}^{T}\left|E^{t}_{\{y_{i}\}}\right|^{2}dt\,.

E{yi}E_{\{y_{i}\}} defines a function that maps {yi}Nd\{y_{i}\}\in\mathbb{R}^{Nd} to +\mathbb{R}^{+}, and is a random variable itself. According to the Markov’s inequality:

(E{yi}>εscore)𝔼{yi}pN|E{yi}|2εscore2.\mathbb{P}\left(E_{\{y_{i}\}}>\varepsilon_{\text{score}}\right)\leq\frac{\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left|E_{\{y_{i}\}}\right|^{2}}{\varepsilon_{\text{score}}^{2}}\,. (20)

The final conclusions are mainly built on the upper bound of the right hand side in the Markov’s inequality above. We prove the results for Case 1 and Case 2 respectively.

  • Case 1: Note that

    𝔼{yi}pN|E{yi}|2=𝔼tU[δ,T]𝔼{yi}pN,xpt[s{yi}N(t,x)u(t,x)2],\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left|E_{\{y_{i}\}}\right|^{2}=\mathbb{E}_{t\sim U[\delta,T]}\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N},x\sim p_{t}}\left[\left\|s^{N}_{\{y_{i}\}}(t,x)-u(t,x)\right\|^{2}\right], (21)

    and for fixed t[δ,T]t\in[\delta,T], according to the definition (19), one can show (Lemma C.2)

    𝔼{yi}pN|E{yi}t|2O(d)N(1e2t)(d+6)/2.\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left|E^{t}_{\{y_{i}\}}\right|^{2}\lesssim\frac{O(d)}{N\left(1-e^{-2t}\right)^{(d+6)/2}}\,. (22)

    Taking expectation with respect to tt in [δ,T][\delta,T], we have

    𝔼{yi}pN|E{yi}|21TδδTO(d)N(1e2t)(d+6)/2𝑑tO(d)NδT1(2t)(d+6)/2𝑑t1NO(d)(2δ)(d+4)/2.\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left|E_{\{y_{i}\}}\right|^{2}\lesssim\frac{1}{T-\delta}\int_{\delta}^{T}\frac{O(d)}{N\left(1-e^{-2t}\right)^{(d+6)/2}}dt\lesssim\frac{O(d)}{N}\int_{\delta}^{T}\frac{1}{(2t)^{(d+6)/2}}dt\lesssim\frac{1}{N}\frac{O(d)}{(2\delta)^{(d+4)/2}}\,.

    By the Markov’s inequality (20), we have

    𝔼tU[δ,T],xpt[s{yi}N(t,x)u(t,x)2]εscore2,\mathbb{E}_{t\sim U[\delta,T],x\sim p_{t}}\left[\left\|s^{N}_{\{y_{i}\}}(t,x)-u(t,x)\right\|^{2}\right]\leq\varepsilon_{\text{score}}^{2},

    with probability 11Nεscore2O(d)(2δ)(d+4)/21-\frac{1}{N\varepsilon_{\text{score}}^{2}}\frac{O(d)}{(2\delta)^{(d+4)/2}}. Letting 1Nεscore2O(d)(2δ)(d+4)/2=τ\frac{1}{N\varepsilon_{\text{score}}^{2}}\frac{O(d)}{(2\delta)^{(d+4)/2}}=\tau, we compute the sample complexity N(εscore,δ,τ)=1τεscore2O(d)(2δ)(d+4)/2N(\varepsilon_{\text{score}},\delta,\tau)=\frac{1}{\tau\varepsilon_{\text{score}}^{2}}\frac{O(d)}{(2\delta)^{(d+4)/2}}.

  • Case 2: For fixed t[δ,T]t\in[\delta,T], according to the definition (19), one can show (Lemma C.2)

    𝔼{yi}pN|E{yi}t|21N1texp(O(d)t)\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left|E^{t}_{\{y_{i}\}}\right|^{2}\lesssim\frac{1}{N}\frac{1}{t}\exp\left(\frac{O(d)}{t}\right) (23)

    Taking expectation with respect to tt in [δ,T][\delta,T], we have (Lemma C.3)

    𝔼{yi}pN|E{yi}|21TδδT1N1texp(O(d)t)𝑑t1Nexp(O(d)δ)\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left|E_{\{y_{i}\}}\right|^{2}\lesssim\frac{1}{T-\delta}\int_{\delta}^{T}\frac{1}{N}\frac{1}{t}\exp\left(\frac{O(d)}{t}\right)dt\lesssim\frac{1}{N}\exp\left(\frac{O(d)}{\delta}\right)

    Again by the Markov’s inequality (20) and similar computations in Case 1, we have the sample complexity N(εscore,δ,τ)=1τεscore2exp(O(d)δ)N(\varepsilon_{\text{score}},\delta,\tau)=\frac{1}{\tau\varepsilon_{\text{score}}^{2}}\exp\left(\frac{O(d)}{\delta}\right).

Lemma C.2.

Under the same assumptions as in Theorem C.1, for fixed t[δ,T]t\in[\delta,T], we have

  • Case 1: If pp_{\ast} is an isotropic Gaussian with second moment 𝔪22=O(d)\mathfrak{m}_{2}^{2}=O(d), then

    𝔼{yi}pN,xpt[s{yi}N(t,x)u(t,x)2]O(d)N(1e2t)(d+6)/2;\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N},x\sim p_{t}}\left[\left\|s^{N}_{\{y_{i}\}}(t,x)-u(t,x)\right\|^{2}\right]\lesssim\frac{O(d)}{N\left(1-e^{-2t}\right)^{(d+6)/2}}\,;
  • Case 2: If pp_{\ast} is supported on the Euclidean ball with radius R>0R>0 such that R2=O(d)R^{2}=O(d), then

    𝔼{yi}pN,xpt[s{yi}N(t,x)u(t,x)2]1N1texp(O(d)t).\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N},x\sim p_{t}}\left[\left\|s^{N}_{\{y_{i}\}}(t,x)-u(t,x)\right\|^{2}\right]\lesssim\frac{1}{N}\frac{1}{t}\exp\left(\frac{O(d)}{t}\right)\,.
Proof.
  • Case 1: By the definitions of u(t,x)u(t,x) and s{yi}N(t,x)s^{N}_{\{y_{i}\}}(t,x) in (8) and (12)\eqref{eqn: empirical optimal score function formula}, we can rewrite them as

    u(t,x)\displaystyle u(t,x) =u(t,x|y)pt(x|y)p(y)𝑑ypt(x|y)p(y)𝑑y=1σ(t)2x+μ(t)σ(t)2ypt(x|y)p(y)𝑑ypt(x|y)p(y)𝑑y:=atx+btvt(x)pt(x),\displaystyle=\frac{\int u(t,x|y)p_{t}(x|y)p_{\ast}(y)dy}{\int p_{t}(x|y)p_{\ast}(y)dy}=-\frac{1}{\sigma(t)^{2}}x+\frac{\mu(t)}{\sigma(t)^{2}}\frac{\int yp_{t}(x|y)p_{\ast}(y)dy}{\int p_{t}(x|y)p_{\ast}(y)dy}:=a_{t}x+b_{t}\frac{v_{t}(x)}{p_{t}(x)}, (24)
    s{yi}N(t,x)\displaystyle s^{N}_{\{y_{i}\}}(t,x) =i=1Nu(t,x|yi)pt(x|yi)j=1Npt(x|yj)=atx+bt1Ni=1Nyipt(x|yi)1Nj=1Npt(x|yi):=atx+btvtN(x)ptN(x),\displaystyle=\frac{\sum_{i=1}^{N}u(t,x|y_{i})p_{t}(x|y_{i})}{\sum_{j=1}^{N}p_{t}(x|y_{j})}=a_{t}x+b_{t}\frac{\frac{1}{N}\sum_{i=1}^{N}y_{i}p_{t}(x|y_{i})}{\frac{1}{N}\sum_{j=1}^{N}p_{t}(x|y_{i})}:=a_{t}x+b_{t}\frac{v_{t}^{N}(x)}{p_{t}^{N}(x)}, (25)

    where we denote at:=1σ(t)2a_{t}:=-\frac{1}{\sigma(t)^{2}} and bt:=μ(t)σ(t)2b_{t}:=\frac{\mu(t)}{\sigma(t)^{2}}. Then we can compute

    s{yi}N(t,x)u(t,x)2\displaystyle\left\|s^{N}_{\{y_{i}\}}(t,x)-u(t,x)\right\|^{2} =(atx+btvtN(x)ptN(x))(atx+btvt(x)pt(x))2\displaystyle=\left\|\left(a_{t}x+b_{t}\frac{v_{t}^{N}(x)}{p_{t}^{N}(x)}\right)-\left(a_{t}x+b_{t}\frac{v_{t}(x)}{p_{t}(x)}\right)\right\|^{2}
    =bt21pt(x)(vtN(x)vt(x))+(1ptN(x)1pt(x))vtN(x)2\displaystyle=b_{t}^{2}\left\|\frac{1}{p_{t}(x)}\left(v_{t}^{N}(x)-v_{t}(x)\right)+\left(\frac{1}{p_{t}^{N}(x)}-\frac{1}{p_{t}(x)}\right)v_{t}^{N}(x)\right\|^{2}
    2bt2(1pt(x)2vtN(x)vt(x)2+(ptN(x)pt(x)pt(x))2vtN(x)2ptN(x)2),\displaystyle\leq 2b_{t}^{2}\left(\frac{1}{p_{t}(x)^{2}}\left\|v_{t}^{N}(x)-v_{t}(x)\right\|^{2}+\left(\frac{p_{t}^{N}(x)-p_{t}(x)}{p_{t}(x)}\right)^{2}\frac{\left\|v_{t}^{N}(x)\right\|^{2}}{p_{t}^{N}(x)^{2}}\right)\,,

    where the last inequality is the Young’s. Then we have:

    𝔼{yi}pN,xpt[s{yi}N(t,x)u(t,x)2]\displaystyle\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N},x\sim p_{t}}\left[\left\|s^{N}_{\{y_{i}\}}(t,x)-u(t,x)\right\|^{2}\right]
    2bt2𝔼xpt[𝔼{yi}pN[1pt(x)2vtN(x)vt(x)2+(ptN(x)pt(x)pt(x))2vtN(x)2ptN(x)2]]\displaystyle\leq 2b_{t}^{2}\mathbb{E}_{x\sim p_{t}}\left[\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[\frac{1}{p_{t}(x)^{2}}\left\|v_{t}^{N}(x)-v_{t}(x)\right\|^{2}+\left(\frac{p_{t}^{N}(x)-p_{t}(x)}{p_{t}(x)}\right)^{2}\frac{\left\|v_{t}^{N}(x)\right\|^{2}}{p_{t}^{N}(x)^{2}}\right]\right]
    bt2𝔼xpt[x2+𝔪22Nμ(t)2exp(xμ(t)μp22((σ(t)2+μ(t)2σp2)(σ(t)2/2+μ(t)2σp2)μ(t)2σp2))](by Lemma C.7)\displaystyle\lesssim b_{t}^{2}\mathbb{E}_{x\sim p_{t}}\left[\frac{\|x\|^{2}+\mathfrak{m}_{2}^{2}}{N\mu(t)^{2}}\exp\left(\frac{\|x-\mu(t)\mu_{p_{\ast}}\|^{2}}{2\left(\frac{(\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2})(\sigma(t)^{2}/2+\mu(t)^{2}\sigma_{p_{\ast}}^{2})}{\mu(t)^{2}\sigma_{p_{\ast}}^{2}}\right)}\right)\right]\qquad(\text{by Lemma \ref{lemma: upper bound w.r.t y gaussian case}})
    1Nbt2μ(t)2(x2+𝔪22)exp(xμ(t)μp22((σ(t)2+μ(t)2σp2)(σ(t)2/2+μ(t)2σp2)μ(t)2σp2))exp(xμ(t)μp22(σ(t)2+μ(t)2σp2))𝑑x\displaystyle\lesssim\frac{1}{N}\frac{b_{t}^{2}}{\mu(t)^{2}}{\int}\left(\|x\|^{2}+\mathfrak{m}_{2}^{2}\right)\exp\left(\frac{\|x-\mu(t)\mu_{p_{\ast}}\|^{2}}{2\left(\frac{(\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2})(\sigma(t)^{2}/2+\mu(t)^{2}\sigma_{p_{\ast}}^{2})}{\mu(t)^{2}\sigma_{p_{\ast}}^{2}}\right)}\right)\exp\left(-\frac{\|x-\mu(t)\mu_{p_{\ast}}\|^{2}}{2(\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2})}\right)dx
    =1Nbt2μ(t)2(x2+𝔪22)exp(xμ(t)μp22((σ(t)2+μ(t)2σp2)(σ(t)2+2μ(t)2σp2σ(t)2))𝑑x\displaystyle=\frac{1}{N}\frac{b_{t}^{2}}{\mu(t)^{2}}{\int}\left(\|x\|^{2}+\mathfrak{m}_{2}^{2}\right)\exp\left(-\frac{\|x-\mu(t)\mu_{p_{\ast}}\|^{2}}{2\left(\frac{(\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2})(\sigma(t)^{2}+2\mu(t)^{2}\sigma_{p_{\ast}}^{2}}{\sigma(t)^{2}}\right)}\right)dx
    1Nbt2μ(t)21σ(t)d[(μ(t)μp2+d((σ(t)2+μ(t)2σp2)(σ(t)2+2μ(t)2σp2)σ(t)2))+𝔪22]\displaystyle\propto\frac{1}{N}\frac{b_{t}^{2}}{\mu(t)^{2}}\frac{1}{\sigma(t)^{d}}\left[\left(\|\mu(t)\mu_{p_{\ast}}\|^{2}+d\left(\frac{(\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2})(\sigma(t)^{2}+2\mu(t)^{2}\sigma_{p_{\ast}}^{2})}{\sigma(t)^{2}}\right)\right)+\mathfrak{m}_{2}^{2}\right]
    1N(𝔪22σ(t)d+4+dσ(t)d+6)(by the definition of bt=μ(t)σ(t)2)\displaystyle\lesssim\frac{1}{N}\left(\frac{\mathfrak{m}_{2}^{2}}{\sigma(t)^{d+4}}+\frac{d}{\sigma(t)^{d+6}}\right)\qquad(\text{by the definition of $b_{t}=\frac{\mu(t)}{\sigma(t)^{2}}$})
    1NO(d)σ(t)6=O(d)N(1e2t)(d+6)/2(by 𝔪22=O(d)).\displaystyle\lesssim\frac{1}{N}\frac{O(d)}{\sigma(t)^{6}}=\frac{O(d)}{N(1-e^{-2t})^{(d+6)/2}}\qquad(\text{by $\mathfrak{m}_{2}^{2}=O(d)$})\,.
  • Case 2: We use the same notations as in Case 1 and define

    A1=𝔼x,{yi}[1pt(x)2vtN(x)vt(x)2],andA2=𝔼x,{yi}[(ptN(x)pt(x)pt(x))2vtN(x)2ptN(x)2].A_{1}=\mathbb{E}_{x,\{y_{i}\}}\left[\frac{1}{p_{t}(x)^{2}}\left\|v_{t}^{N}(x)-v_{t}(x)\right\|^{2}\right]\,,\quad\text{and}\quad A_{2}=\mathbb{E}_{x,\{y_{i}\}}\left[\left(\frac{p_{t}^{N}(x)-p_{t}(x)}{p_{t}(x)}\right)^{2}\frac{\left\|v_{t}^{N}(x)\right\|^{2}}{p_{t}^{N}(x)^{2}}\right]\,.

    Then we have:

    𝔼{yi}pN,xpt[s{yi}N(t,x)u(t,x)2]\displaystyle\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N},x\sim p_{t}}\left[\left\|s^{N}_{\{y_{i}\}}(t,x)-u(t,x)\right\|^{2}\right] 2bt2(A1+A2).\displaystyle\leq 2b_{t}^{2}\left(A_{1}+A_{2}\right)\,.

    We now bound terms A1A_{1} and A2A_{2} respectively. For term A1A_{1}, we have

    A1\displaystyle A_{1} =𝔼xpt,{yi}pN[1pt(x)2vtN(x)vt(x)2]\displaystyle=\mathbb{E}_{x\sim p_{t},\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[\frac{1}{p_{t}(x)^{2}}\left\|v_{t}^{N}(x)-v_{t}(x)\right\|^{2}\right]
    =𝔼xpt[1pt(x)2𝔼{yi}pN[vtN(x)vt(x)2]]\displaystyle=\mathbb{E}_{x\sim p_{t}}\left[\frac{1}{p_{t}(x)^{2}}\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[\left\|v_{t}^{N}(x)-v_{t}(x)\right\|^{2}\right]\right]
    1N𝔼xpt[1pt(x)2𝔼ypypt(x|y)2](by Lemma C.4)\displaystyle\leq\frac{1}{N}\mathbb{E}_{x\sim p_{t}}\left[\frac{1}{p_{t}(x)^{2}}\mathbb{E}_{y\sim p_{\ast}}\|yp_{t}(x|y)\|^{2}\right]\qquad(\text{by Lemma \ref{lemma: variance of mean type result}})
    =1N1(2πσ(t)2)d1pt(x)y2exp(2xμ(t)y22σ(t)2)p(y)𝑑y𝑑x(by the definition of pt(x|y))\displaystyle=\frac{1}{N}\frac{1}{\left(2\pi\sigma(t)^{2}\right)^{d}}\int\int\frac{1}{p_{t}(x)}\|y\|^{2}\exp\left(-\frac{2\|x-\mu(t)y\|^{2}}{2\sigma(t)^{2}}\right)p_{\ast}(y)dydx\qquad(\text{by the definition of $p_{t}(x|y)$})
    1NKt1(2πσ(t)2)d/2y2(exp(1+λμ(t)2σ(t)2x2)exp(2xμ(t)y22σ(t)2)𝑑x)p(y)𝑑y(by Lemma C.12)\displaystyle\leq\frac{1}{N}\frac{K^{-1}_{t}}{\left(2\pi\sigma(t)^{2}\right)^{d/2}}\int\|y\|^{2}\left(\int\exp\left(\frac{1+\lambda\mu(t)}{2\sigma(t)^{2}}\|x\|^{2}\right)\exp\left(-\frac{2\|x-\mu(t)y\|^{2}}{2\sigma(t)^{2}}\right)dx\right)p_{\ast}(y)dy\qquad(\text{by Lemma \ref{lemma: lower bound of p_t(x)}})
    =1NKt1(2πσ(t)2)d/2y2exp(μ(t)2(1+λμ(t))σ(t)2(1λμ(t))y2)(exp(1λμ(t)2σ(t)2x2μ(t)1λμ(t)y2)𝑑x)p(y)𝑑y\displaystyle=\frac{1}{N}\frac{K^{-1}_{t}}{\left(2\pi\sigma(t)^{2}\right)^{d/2}}\int\|y\|^{2}\exp\left(\frac{\mu(t)^{2}(1+\lambda\mu(t))}{\sigma(t)^{2}(1-\lambda\mu(t))}\|y\|^{2}\right)\left(\int\exp\left(-\frac{1-\lambda\mu(t)}{2\sigma(t)^{2}}\left\|x-\frac{2\mu(t)}{1-\lambda\mu(t)}y\right\|^{2}\right)dx\right)p_{\ast}(y)dy
    =1NKt1(1λμ(t))d/2y2exp(μ(t)2(1+λμ(t))σ(t)2(1λμ(t))y2)p(y)𝑑y\displaystyle=\frac{1}{N}\frac{K^{-1}_{t}}{(1-\lambda\mu(t))^{d/2}}\int\|y\|^{2}\exp\left(\frac{\mu(t)^{2}(1+\lambda\mu(t))}{\sigma(t)^{2}(1-\lambda\mu(t))}\|y\|^{2}\right)p_{\ast}(y)dy
    =1N1(1λμ(t))d/2y2exp(μ(t)2(1+λμ(t))σ(t)2(1λμ(t))y2)p(y)𝑑yexp(μ(t)+λμ(t)22λσ(t)2y2)p(y)𝑑y(by the definition of Kt)\displaystyle=\frac{1}{N}\frac{1}{(1-\lambda\mu(t))^{d/2}}\frac{\int\|y\|^{2}\exp\left(\frac{\mu(t)^{2}(1+\lambda\mu(t))}{\sigma(t)^{2}(1-\lambda\mu(t))}\|y\|^{2}\right)p_{\ast}(y)dy}{\int\exp\left(-\frac{\mu(t)+\lambda\mu(t)^{2}}{2\lambda\sigma(t)^{2}}\|y\|^{2}\right)p_{\ast}(y)dy}\qquad(\text{by the definition of $K_{t}$})
    1N1(1λμ(t))d/2R2exp(μ(t)2(1+λμ(t))σ(t)2(1λμ(t))R2)exp(μ(t)+λμ(t)22λσ(t)2R2)(by supp(p)B(0,R))\displaystyle\leq\frac{1}{N}\frac{1}{(1-\lambda\mu(t))^{d/2}}R^{2}\exp\left(\frac{\mu(t)^{2}(1+\lambda\mu(t))}{\sigma(t)^{2}(1-\lambda\mu(t))}R^{2}\right)\exp\left(\frac{\mu(t)+\lambda\mu(t)^{2}}{2\lambda\sigma(t)^{2}}R^{2}\right)\qquad(\text{by $\mathrm{supp}(p_{\ast})\subseteq B(0,R)$})
    =1NR2(1λμ(t))d/2exp(μ(t)(1+λμ(t))22λσ(t)2(1λμ(t))R2)\displaystyle=\frac{1}{N}\frac{R^{2}}{(1-\lambda\mu(t))^{d/2}}\exp\left(\frac{\mu(t)(1+\lambda\mu(t))^{2}}{2\lambda\sigma(t)^{2}(1-\lambda\mu(t))}R^{2}\right)
    =1N2d/2R2exp(9μ(t)22σ(t)2R2)(by choosing λ=12μ(t))\displaystyle=\frac{1}{N}2^{d/2}R^{2}\exp\left(\frac{9\mu(t)^{2}}{2\sigma(t)^{2}}R^{2}\right)\qquad(\text{by choosing $\lambda=\frac{1}{2\mu(t)}$})
    =1Nexp(μ(t)2σ(t)2O(d)),\displaystyle=\frac{1}{N}\exp\left(\frac{\mu(t)^{2}}{\sigma(t)^{2}}O(d)\right),

    where we assume R2=O(d)R^{2}=O(d). For term A2A_{2}, we can calculate

    A2\displaystyle A_{2} =𝔼xpt,{yi}pN[(ptN(x)pt(x)pt(x))2vtN(x)2ptN(x)2]\displaystyle=\mathbb{E}_{x\sim p_{t},\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[\left(\frac{p_{t}^{N}(x)-p_{t}(x)}{p_{t}(x)}\right)^{2}\frac{\left\|v_{t}^{N}(x)\right\|^{2}}{p_{t}^{N}(x)^{2}}\right]
    R2𝔼xpt[1pt(x)2𝔼{yi}pN(ptN(x)pt(x))2](by Lemma C.6)\displaystyle\leq R^{2}\mathbb{E}_{x\sim p_{t}}\left[\frac{1}{p_{t}(x)^{2}}\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left(p_{t}^{N}(x)-p_{t}(x)\right)^{2}\right]\qquad(\text{by Lemma \ref{lemma: bound for empirical division}})
    R2N𝔼xpt[1pt(x)2𝔼{yi}pN[pt(x|y)2]](by Lemma C.4)\displaystyle\leq\frac{R^{2}}{N}\mathbb{E}_{x\sim p_{t}}\left[\frac{1}{p_{t}(x)^{2}}\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[p_{t}(x|y)^{2}\right]\right]\qquad(\text{by Lemma \ref{lemma: variance of mean type result}})
    1NKt1R2(2πσ(t)2)d/2(exp(1+λμ(t)2σ(t)2x2)exp(2xμ(t)y22σ(t)2)𝑑x)p(y)𝑑y(by Lemma C.12)\displaystyle\leq\frac{1}{N}\frac{K^{-1}_{t}R^{2}}{(2\pi\sigma(t)^{2})^{d/2}}\int\left(\int\exp\left(\frac{1+\lambda\mu(t)}{2\sigma(t)^{2}}\|x\|^{2}\right)\exp\left(-\frac{2\|x-\mu(t)y\|^{2}}{2\sigma(t)^{2}}\right)dx\right)p_{\ast}(y)dy\qquad(\text{by Lemma \ref{lemma: lower bound of p_t(x)}})
    1Nexp(μ(t)2σ(t)2O(d))(by the same computations as for term A1)\displaystyle\leq\frac{1}{N}\exp\left(\frac{\mu(t)^{2}}{\sigma(t)^{2}}O(d)\right)\qquad(\text{by the same computations as for term $A_{1}$})

    Combining the upper bounds for terms A1A_{1} and A2A_{2}, we obtain

    𝔼{yi}pN,xpt[s{yi}N(t,x)u(t,x)2]\displaystyle\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N},x\sim p_{t}}\left[\left\|s^{N}_{\{y_{i}\}}(t,x)-u(t,x)\right\|^{2}\right] 1Nμ(t)2σ(t)4exp(μ(t)2σ(t)2O(d))\displaystyle\lesssim\frac{1}{N}\frac{\mu(t)^{2}}{\sigma(t)^{4}}\exp\left(\frac{\mu(t)^{2}}{\sigma(t)^{2}}O(d)\right)
    =1Nexp(2t)(1exp(2t))2exp(exp(2t)1exp(2t)O(d))\displaystyle=\frac{1}{N}\frac{\exp(-2t)}{(1-\exp(-2t))^{2}}\exp\left(\frac{\exp(-2t)}{1-\exp(-2t)}O(d)\right)
    (by the definitions of μ(t) and σ(t))\displaystyle\qquad\qquad\qquad\qquad(\text{by the definitions of $\mu(t)$ and $\sigma(t)$})
    1N1texp(O(d)t)\displaystyle\leq\frac{1}{N}\frac{1}{t}\exp\left(\frac{O(d)}{t}\right)

Lemma C.3.

1TδδT1N1texp(O(d)t)𝑑t1Nexp(O(d)δ)\frac{1}{T-\delta}\int_{\delta}^{T}\frac{1}{N}\frac{1}{t}\exp\left(\frac{O(d)}{t}\right)dt\lesssim\frac{1}{N}\exp\left(\frac{O(d)}{\delta}\right).

Proof.
1TδδT1N1texp(O(d)t)𝑑t\displaystyle\frac{1}{T-\delta}\int_{\delta}^{T}\frac{1}{N}\frac{1}{t}\exp\left(\frac{O(d)}{t}\right)dt =1N1Tδ1/T1/δexp(O(d)s)s𝑑s\displaystyle=\frac{1}{N}\frac{1}{T-\delta}\int_{1/T}^{1/\delta}\frac{\exp\left(O(d)s\right)}{s}ds
1NTTδ1/T1/δexp(O(d)s)𝑑s\displaystyle\leq\frac{1}{N}\frac{T}{T-\delta}\int_{1/T}^{1/\delta}\exp\left(O(d)s\right)ds
1N1O(d)exp(O(d)δ)1Nexp(O(d)δ).\displaystyle\leq\frac{1}{N}\frac{1}{O(d)}\exp\left(\frac{O(d)}{\delta}\right)\lesssim\frac{1}{N}\exp\left(\frac{O(d)}{\delta}\right).

Lemma C.4.

Suppose {yi}i=1N\{y_{i}\}_{i=1}^{N} are i.i.d samples drawn from the distribution pp_{\ast}. For vt(x),pt(x)v_{t}(x),p_{t}(x) and vtN(x),ptN(x)v_{t}^{N}(x),p_{t}^{N}(x) defined in (24) and (25) respectively, we have

𝔼{yi}pN[vtN(x)vt(x)2]1N𝔼yp[ypt(x|y)2]\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[\left\|v_{t}^{N}(x)-v_{t}(x)\right\|^{2}\right]\leq\frac{1}{N}\mathbb{E}_{y\sim p_{\ast}}\left[\left\|yp_{t}(x|y)\right\|^{2}\right]

and

𝔼{yi}pN[ptN(x)pt(x)2]1N𝔼yp[pt(x|y)2].\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[\left\|p_{t}^{N}(x)-p_{t}(x)\right\|^{2}\right]\leq\frac{1}{N}\mathbb{E}_{y\sim p_{\ast}}\left[p_{t}(x|y)^{2}\right].
Remark C.5.

Define ft,x(y):=ypt(x|y)f_{t,x}(y):=yp_{t}(x|y). Due to the randomness in yy, ft,xf_{t,x} is also a random variable. According to the definition (24), vt(x)=ypt(x|y)p(y)𝑑y=𝔼p[ft,x(y)]v_{t}(x)=\int yp_{t}(x|y)p_{\ast}(y)dy=\mathbb{E}_{p_{\ast}}[f_{t,x}(y)] is the mean of random variable ft,x(y)f_{t,x}(y), and vtN(x)=1Nift,x(yi)v_{t}^{N}(x)=\frac{1}{N}\sum_{i}f_{t,x}(y_{i}) is the ensemble average of NN realizations of ft,xf_{t,x}. It is always true that the variance of the ensemble average is 1N\frac{1}{N} of the variance of the original random variable, so naturally:

𝔼{yi}pN[vtN(x)vt(x)2]=1NVarp[ft,x(y)]1N𝔼pft,x2.\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[\left\|v_{t}^{N}(x)-v_{t}(x)\right\|^{2}\right]=\frac{1}{N}\mathrm{Var}_{p_{\ast}}[f_{t,x}(y)]\leq\frac{1}{N}\mathbb{E}_{p_{\ast}}\|f_{t,x}\|^{2}\,.
Proof.

We denote ft,x(y):=ypt(x|y)f_{t,x}(y):=yp_{t}(x|y). By the definitions of vtN(x)v_{t}^{N}(x) and vt(x)v_{t}(x), we can compute

𝔼{yi}pN[vtN(x)vt(x)2]\displaystyle\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[\left\|v_{t}^{N}(x)-v_{t}(x)\right\|^{2}\right] =𝔼{yi}pN[1Ni=1N(ft,x(yi)𝔼yp[ft,x(y)])2]\displaystyle=\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[\left\|\frac{1}{N}\sum_{i=1}^{N}\left(f_{t,x}(y_{i})-\mathbb{E}_{y\sim p_{\ast}}[f_{t,x}(y)]\right)\right\|^{2}\right]
=1N𝔼yp[ft,x(y)𝔼yp[ft,x(y)]2]\displaystyle=\frac{1}{N}\mathbb{E}_{y\sim p_{\ast}}\left[\left\|f_{t,x}(y)-\mathbb{E}_{y\sim p_{\ast}}[f_{t,x}(y)]\right\|^{2}\right]
1N𝔼yp[ft,x(y)2]=1N𝔼yp[ypt(x|y)2].\displaystyle\leq\frac{1}{N}\mathbb{E}_{y\sim p_{\ast}}\left[\left\|f_{t,x}(y)\right\|^{2}\right]=\frac{1}{N}\mathbb{E}_{y\sim p_{\ast}}\left[\left\|yp_{t}(x|y)\right\|^{2}\right]\,.

With similar computations, one can show

𝔼{yi}pN[ptN(x)pt(x)2]1N𝔼yp[pt(x|y)2].\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[\left\|p_{t}^{N}(x)-p_{t}(x)\right\|^{2}\right]\leq\frac{1}{N}\mathbb{E}_{y\sim p_{\ast}}\left[p_{t}(x|y)^{2}\right].

Lemma C.6.

Given a collection of vectors {yi}i=1N\{y_{i}\}_{i=1}^{N}, for any fixed xdx\in\mathbb{R}^{d} and t[0,T]t\in[0,T], the following inequality holds

vtN(x)2ptN(x)2=i=1Npt(x|yi)j=1Npt(x|yj)yi2σ(t)2μ(t)2x2+1μ(t)21Ni=1Nxμ(t)yi2,\frac{\left\|v_{t}^{N}(x)\right\|^{2}}{p_{t}^{N}(x)^{2}}=\left\|\sum_{i=1}^{N}\frac{p_{t}(x|y_{i})}{\sum_{j=1}^{N}p_{t}(x|y_{j})}y_{i}\right\|^{2}\lesssim\frac{\sigma(t)^{2}}{\mu(t)^{2}}\|x\|^{2}+\frac{1}{\mu(t)^{2}}\frac{1}{N}\sum_{i=1}^{N}\left\|x-\mu(t)y_{i}\right\|^{2},

where vtNv_{t}^{N} and ptNp_{t}^{N} are defined in (25), pt(x|y)p_{t}(x|y) is the Green’s function defined in (6) and μ(t)=et,σ(t)2=1e2t\mu(t)=e^{-t},\sigma(t)^{2}=1-e^{-2t} as defined in (5). If we further assume that yi22R2\|y_{i}\|_{2}^{2}\leq R^{2} for all i[N]i\in[N], then we have

vtN(x)2ptN(x)2=i=1Npt(x|yi)j=1Npt(x|yj)yi2R2.\frac{\left\|v_{t}^{N}(x)\right\|^{2}}{p_{t}^{N}(x)^{2}}=\left\|\sum_{i=1}^{N}\frac{p_{t}(x|y_{i})}{\sum_{j=1}^{N}p_{t}(x|y_{j})}y_{i}\right\|^{2}\leq R^{2}\,.
Proof.

We can compute that

vtN(x)2ptN(x)2\displaystyle\frac{\left\|v_{t}^{N}(x)\right\|^{2}}{p_{t}^{N}(x)^{2}} =i=1Npt(x|yi)j=1Npt(x|yj)yi2\displaystyle=\left\|\sum_{i=1}^{N}\frac{p_{t}(x|y_{i})}{\sum_{j=1}^{N}p_{t}(x|y_{j})}y_{i}\right\|^{2}
=i=1Nexp(xμ(t)yi22σ(t)2)j=1Nexp(xμ(t)yj22σ(t)2)σ(t)μ(t)(μ(t)yix+x)σ(t)2\displaystyle=\left\|\sum_{i=1}^{N}\frac{\exp\left(-\frac{\|x-\mu(t)y_{i}\|^{2}}{2\sigma(t)^{2}}\right)}{\sum_{j=1}^{N}\exp\left(-\frac{\|x-\mu(t)y_{j}\|^{2}}{2\sigma(t)^{2}}\right)}\frac{\sigma(t)}{\mu(t)}\frac{\left(\mu(t)y_{i}-x+x\right)}{\sigma(t)}\right\|^{2}
σ(t)2μ(t)2(x2+i=1Nexp(xμ(t)yi22σ(t)2)j=1Nexp(xμ(t)yj22σ(t)2)(μ(t)yix)σ(t)2)\displaystyle\lesssim\frac{\sigma(t)^{2}}{\mu(t)^{2}}\left(\|x\|^{2}+\left\|\sum_{i=1}^{N}\frac{\exp\left(-\frac{\|x-\mu(t)y_{i}\|^{2}}{2\sigma(t)^{2}}\right)}{\sum_{j=1}^{N}\exp\left(-\frac{\|x-\mu(t)y_{j}\|^{2}}{2\sigma(t)^{2}}\right)}\frac{\left(\mu(t)y_{i}-x\right)}{\sigma(t)}\right\|^{2}\right)
σ(t)2μ(t)2x2+1μ(t)21Ni=1Nxμ(t)yi2(by Lemma C.11)\displaystyle\leq\frac{\sigma(t)^{2}}{\mu(t)^{2}}\|x\|^{2}+\frac{1}{\mu(t)^{2}}\frac{1}{N}\sum_{i=1}^{N}\left\|x-\mu(t)y_{i}\right\|^{2}\qquad(\text{by Lemma \ref{lemma: weight avg bounded by uniform avg}})

If we assume that yi22R2\|y_{i}\|_{2}^{2}\leq R^{2} for i[N]i\in[N], then we have

vtN(x)2ptN(x)2=i=1Npt(x|yi)j=1Npt(x|yj)yi2R2i=1Npt(x|yi)j=1Npt(x|yj)2=R2.\frac{\left\|v_{t}^{N}(x)\right\|^{2}}{p_{t}^{N}(x)^{2}}=\left\|\sum_{i=1}^{N}\frac{p_{t}(x|y_{i})}{\sum_{j=1}^{N}p_{t}(x|y_{j})}y_{i}\right\|^{2}\leq R^{2}\left\|\frac{\sum_{i=1}^{N}p_{t}(x|y_{i})}{\sum_{j=1}^{N}p_{t}(x|y_{j})}\right\|^{2}=R^{2}\,.

Lemma C.7.

Under the same assumptions as in Theorem C.1 Case 1, for fixed t[δ,T]t\in[\delta,T] and xdx\in\mathbb{R}^{d}, we have

𝔼{yi}pN\displaystyle\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}} [1pt(x)2vtN(x)vt(x)2+(ptN(x)pt(x)pt(x))2vtN(x)2ptN(x)2]\displaystyle\left[\frac{1}{p_{t}(x)^{2}}\left\|v_{t}^{N}(x)-v_{t}(x)\right\|^{2}+\left(\frac{p_{t}^{N}(x)-p_{t}(x)}{p_{t}(x)}\right)^{2}\frac{\left\|v_{t}^{N}(x)\right\|^{2}}{p_{t}^{N}(x)^{2}}\right]
x2+𝔪22Nμ(t)2exp(xμ(t)μp22((σ(t)2+μ(t)2σp2)(σ(t)2/2+μ(t)2σp2)μ(t)2σp2))\displaystyle\qquad\qquad\qquad\qquad\qquad\quad\;\;\leq\frac{\|x\|^{2}+\mathfrak{m}_{2}^{2}}{N\mu(t)^{2}}\exp\left(\frac{\|x-\mu(t)\mu_{p_{\ast}}\|^{2}}{2\left(\frac{(\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2})(\sigma(t)^{2}/2+\mu(t)^{2}\sigma_{p_{\ast}}^{2})}{\mu(t)^{2}\sigma_{p_{\ast}}^{2}}\right)}\right)

Here we denote 𝔪22:=𝔼yp[y2]=μp2+dσp2\mathfrak{m}_{2}^{2}:=\mathbb{E}_{y\sim p_{\ast}}\left[\|y\|^{2}\right]=\|\mu_{p_{\ast}}\|^{2}+d\sigma_{p_{\ast}}^{2}.

Proof.

Denote

A1:=𝔼{yi}pN[1pt(x)2vtN(x)vt(x)2]andA2:=𝔼{yi}pN[(ptN(x)pt(x)pt(x))2vtN(x)2ptN(x)2]A_{1}:=\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[\frac{1}{p_{t}(x)^{2}}\left\|v_{t}^{N}(x)-v_{t}(x)\right\|^{2}\right]\qquad\text{and}\qquad A_{2}:=\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[\left(\frac{p_{t}^{N}(x)-p_{t}(x)}{p_{t}(x)}\right)^{2}\frac{\left\|v_{t}^{N}(x)\right\|^{2}}{p_{t}^{N}(x)^{2}}\right]

We now bound terms A1A_{1} and A2A_{2} respectively. For term A1A_{1}, we have

A1\displaystyle A_{1} =1pt(x)2𝔼{yi}pN[vtN(x)vt(x)2]\displaystyle=\frac{1}{p_{t}(x)^{2}}\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[\left\|v_{t}^{N}(x)-v_{t}(x)\right\|^{2}\right] (26)
1N1pt(x)2𝔼yp[ypt(x|y)2](by Lemma C.4)\displaystyle\leq\frac{1}{N}\frac{1}{p_{t}(x)^{2}}\mathbb{E}_{y\sim p_{\ast}}\left[\left\|yp_{t}(x|y)\right\|^{2}\right]\qquad(\text{by Lemma \ref{lemma: variance of mean type result}})
x2+𝔪22Nexp(xμ(t)μp22(σ(t)2+μ(t)2σp22))exp(xμ(t)μp22(σ(t)2/2+μ(t)2σp2))(by Lemma C.10)\displaystyle\lesssim\frac{\|x\|^{2}+\mathfrak{m}_{2}^{2}}{N}\exp\left(\frac{\|x-\mu(t)\mu_{p_{\ast}}\|^{2}}{2\left(\frac{\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2}}{2}\right)}\right)\exp\left(-\frac{\|x-\mu(t)\mu_{p_{\ast}}\|^{2}}{2\left(\sigma(t)^{2}/2+\mu(t)^{2}\sigma_{p_{\ast}}^{2}\right)}\right)\qquad(\text{by Lemma \ref{lemma: useful quantities}})
=x2+𝔪22Nexp(xμ(t)μp22((σ(t)2+μ(t)2σp2)(σ(t)2/2+μ(t)2σp2)μ(t)2σp2))\displaystyle=\frac{\|x\|^{2}+\mathfrak{m}_{2}^{2}}{N}\exp\left(\frac{\|x-\mu(t)\mu_{p_{\ast}}\|^{2}}{2\left(\frac{(\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2})(\sigma(t)^{2}/2+\mu(t)^{2}\sigma_{p_{\ast}}^{2})}{\mu(t)^{2}\sigma_{p_{\ast}}^{2}}\right)}\right)

For term A2A_{2}, by Lemma (C.6) we obtain

A2\displaystyle A_{2} =𝔼{yi}pN[(ptN(x)pt(x)pt(x))2vtN(x)2ptN(x)2]\displaystyle=\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[\left(\frac{p_{t}^{N}(x)-p_{t}(x)}{p_{t}(x)}\right)^{2}\frac{\left\|v_{t}^{N}(x)\right\|^{2}}{p_{t}^{N}(x)^{2}}\right]
\displaystyle\lesssim 1μ(t)21pt(x)2(σ(t)2x2𝔼{yi}pN[(ptN(x)pt(x))2]+𝔼{yi}pN[1Ni=1Nxμ(t)yi2(ptN(x)pt(x))2])\displaystyle\frac{1}{\mu(t)^{2}}\frac{1}{p_{t}(x)^{2}}\left(\sigma(t)^{2}\|x\|^{2}\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[\left(p_{t}^{N}(x)-p_{t}(x)\right)^{2}\right]+\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[\frac{1}{N}\sum_{i=1}^{N}\left\|x-\mu(t)y_{i}\right\|^{2}\left(p_{t}^{N}(x)-p_{t}(x)\right)^{2}\right]\right)
:=\displaystyle:= 1μ(t)21pt(x)2(A2,1+A2,2)\displaystyle\frac{1}{\mu(t)^{2}}\frac{1}{p_{t}(x)^{2}}\left(A_{2,1}+A_{2,2}\right)

By Lemma C.10, we have

A2,1=σ(t)2x2𝔼{yi}pN[(ptN(x)pt(x))2]σ(t)2x2Nexp(xμ(t)μp22(σ(t)2/2+μ(t)2σp2))A_{2,1}=\sigma(t)^{2}\|x\|^{2}\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[\left(p_{t}^{N}(x)-p_{t}(x)\right)^{2}\right]\lesssim\frac{\sigma(t)^{2}\|x\|^{2}}{N}\exp\left(-\frac{\|x-\mu(t)\mu_{p_{\ast}}\|^{2}}{2\left(\sigma(t)^{2}/2+\mu(t)^{2}\sigma_{p_{\ast}}^{2}\right)}\right)

By Lemma C.8, we know that

A2,2=𝔼{yi}pN[1Ni=1Nxμ(t)yi2(ptN(x)pt(x))2]x2+𝔪22Nexp(xμ(t)μp22(σ(t)2/2+μ(t)2σp2))A_{2,2}=\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[\frac{1}{N}\sum_{i=1}^{N}\left\|x-\mu(t)y_{i}\right\|^{2}\left(p_{t}^{N}(x)-p_{t}(x)\right)^{2}\right]\lesssim\frac{\|x\|^{2}+\mathfrak{m}_{2}^{2}}{N}\exp\left(-\frac{\left\|x-\mu(t)\mu_{p_{\ast}}\right\|^{2}}{2\left(\sigma(t)^{2}/2+\mu(t)^{2}\sigma_{p_{\ast}}^{2}\right)}\right)

Then we can obtain the upper bound for term A2A_{2}, i.e.

A2\displaystyle A_{2} 1μ(t)21pt(x)2(A2,1+A2,2)\displaystyle\lesssim\frac{1}{\mu(t)^{2}}\frac{1}{p_{t}(x)^{2}}\left(A_{2,1}+A_{2,2}\right) (27)
1μ(t)21pt(x)2x2+𝔪22Nexp(xμ(t)μp22(σ(t)2/2+μ(t)2σp2))\displaystyle\lesssim\frac{1}{\mu(t)^{2}}\frac{1}{p_{t}(x)^{2}}\frac{\|x\|^{2}+\mathfrak{m}_{2}^{2}}{N}\exp\left(-\frac{\left\|x-\mu(t)\mu_{p_{\ast}}\right\|^{2}}{2\left(\sigma(t)^{2}/2+\mu(t)^{2}\sigma_{p_{\ast}}^{2}\right)}\right)
1μ(t)2x2+𝔪22Nexp(xμ(t)μp22((σ(t)2+μ(t)2σp2)(σ(t)2/2+μ(t)2σp2)μ(t)2σp2))\displaystyle\lesssim\frac{1}{\mu(t)^{2}}\frac{\|x\|^{2}+\mathfrak{m}_{2}^{2}}{N}\exp\left(\frac{\|x-\mu(t)\mu_{p_{\ast}}\|^{2}}{2\left(\frac{(\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2})(\sigma(t)^{2}/2+\mu(t)^{2}\sigma_{p_{\ast}}^{2})}{\mu(t)^{2}\sigma_{p_{\ast}}^{2}}\right)}\right)

We finish the proof by combining the upper bounds of terms A1A_{1} and A2A_{2} derived in (26) and (27). ∎

Lemma C.8.

Under the same assumptions as in Lemma C.7, we have

A2,2:=𝔼{yi}pN[1Ni=1Nxμ(t)yi2(ptN(x)pt(x))2]x2+𝔪22Nexp(xμ(t)μp22(σ(t)2/2+μ(t)2σp2))A_{2,2}:=\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[\frac{1}{N}\sum_{i=1}^{N}\left\|x-\mu(t)y_{i}\right\|^{2}\left(p_{t}^{N}(x)-p_{t}(x)\right)^{2}\right]\lesssim\frac{\|x\|^{2}+\mathfrak{m}_{2}^{2}}{N}\exp\left(-\frac{\left\|x-\mu(t)\mu_{p_{\ast}}\right\|^{2}}{2\left(\sigma(t)^{2}/2+\mu(t)^{2}\sigma_{p_{\ast}}^{2}\right)}\right)
Proof.

For notation simplicity, we denote gt,x(y):=pt(x|y)g_{t,x}(y):=p_{t}(x|y) and use 𝔼{yi}\mathbb{E}_{\{y_{i}\}} as a short notation of 𝔼{yi}pN\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}} when the context is clear. Then we have

A2,2\displaystyle A_{2,2} =𝔼{yi}pN[1Ni=1Nxμ(t)yi2(ptN(x)pt(x))2]\displaystyle=\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left[\frac{1}{N}\sum_{i=1}^{N}\left\|x-\mu(t)y_{i}\right\|^{2}\left(p_{t}^{N}(x)-p_{t}(x)\right)^{2}\right]
=1Ni=1N𝔼{yi}[xμ(t)yi2(1Nj=1N(gt,x(yj)𝔼yj[gt,x(yj)]))2]\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{\{y_{i}\}}\left[\left\|x-\mu(t)y_{i}\right\|^{2}\left(\frac{1}{N}\sum_{j=1}^{N}\left(g_{t,x}(y_{j})-\mathbb{E}_{y_{j}}[g_{t,x}(y_{j})]\right)\right)^{2}\right]

For every i[N]i\in[N], we can compute

𝔼{yk}k=1N[xμ(t)yi2(1Nj=1N(gt,x(yj)𝔼yj[gt,x(yj)]))2]\displaystyle\mathbb{E}_{\{y_{k}\}_{k=1}^{N}}\left[\left\|x-\mu(t)y_{i}\right\|^{2}\left(\frac{1}{N}\sum_{j=1}^{N}\left(g_{t,x}(y_{j})-\mathbb{E}_{y_{j}}[g_{t,x}(y_{j})]\right)\right)^{2}\right]
=𝔼yi[xμ(t)yi2𝔼{yj}jiN[1N2((gt,x(yi)𝔼yi[gt,x(yi)])+jiN(gt,x(yj)𝔼yj[gt,x(yj)]))2]]\displaystyle=\mathbb{E}_{y_{i}}\left[\left\|x-\mu(t)y_{i}\right\|^{2}\mathbb{E}_{\{y_{j}\}_{j\neq i}^{N}}\left[\frac{1}{N^{2}}\left(\left(g_{t,x}(y_{i})-\mathbb{E}_{y_{i}}[g_{t,x}(y_{i})]\right)+\sum_{j\neq i}^{N}\left(g_{t,x}(y_{j})-\mathbb{E}_{y_{j}}[g_{t,x}(y_{j})]\right)\right)^{2}\right]\right]
1N2𝔼yi[xμ(t)yi2𝔼{yj}jiN[(gt,x(yi)𝔼yi[gt,x(yi)])2+(jiN(gt,x(yj)𝔼yj[gt,x(yj)]))2]]\displaystyle\lesssim\frac{1}{N^{2}}\mathbb{E}_{y_{i}}\left[\left\|x-\mu(t)y_{i}\right\|^{2}\mathbb{E}_{\{y_{j}\}_{j\neq i}^{N}}\left[\left(g_{t,x}(y_{i})-\mathbb{E}_{y_{i}}[g_{t,x}(y_{i})]\right)^{2}+\left(\sum_{j\neq i}^{N}\left(g_{t,x}(y_{j})-\mathbb{E}_{y_{j}}[g_{t,x}(y_{j})]\right)\right)^{2}\right]\right]
=1N2(𝔼yi[xμ(t)yi2(gt,x(yi)𝔼yi[gt,x(yi)])2]\displaystyle=\frac{1}{N^{2}}\Bigg{(}\mathbb{E}_{y_{i}}\left[\left\|x-\mu(t)y_{i}\right\|^{2}\left(g_{t,x}(y_{i})-\mathbb{E}_{y_{i}}[g_{t,x}(y_{i})]\right)^{2}\right]
+𝔼yi[xμ(t)yi2]𝔼{yj}jiN[(jiN(gt,x(yj)𝔼yj[gt,x(yj)]))2])\displaystyle\qquad\qquad+\mathbb{E}_{y_{i}}\left[\left\|x-\mu(t)y_{i}\right\|^{2}\right]\mathbb{E}_{\{y_{j}\}_{j\neq i}^{N}}\left[\left(\sum_{j\neq i}^{N}\left(g_{t,x}(y_{j})-\mathbb{E}_{y_{j}}[g_{t,x}(y_{j})]\right)\right)^{2}\right]\Bigg{)}
1N2(𝔼yi[xμ(t)yi2(gt,x(yi)𝔼yi[gt,x(yi)])2]+𝔼yi[xμ(t)yi2](N1)𝔼y[(gt,x(y))2])\displaystyle\leq\frac{1}{N^{2}}\left(\mathbb{E}_{y_{i}}\left[\left\|x-\mu(t)y_{i}\right\|^{2}\left(g_{t,x}(y_{i})-\mathbb{E}_{y_{i}}[g_{t,x}(y_{i})]\right)^{2}\right]+\mathbb{E}_{y_{i}}\left[\left\|x-\mu(t)y_{i}\right\|^{2}\right](N-1)\mathbb{E}_{y}\left[\left(g_{t,x}(y)\right)^{2}\right]\right)

Therefore, we have

A2,2\displaystyle A_{2,2} =1Ni=1N𝔼{yi}[xμ(t)yi2(1Nj=1N(gt,x(yj)𝔼yj[gt,x(yj)]))2]\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{\{y_{i}\}}\left[\left\|x-\mu(t)y_{i}\right\|^{2}\left(\frac{1}{N}\sum_{j=1}^{N}\left(g_{t,x}(y_{j})-\mathbb{E}_{y_{j}}[g_{t,x}(y_{j})]\right)\right)^{2}\right]
1Ni=1N1N2(𝔼yi[xμ(t)yi2(gt,x(yi)𝔼yi[gt,x(yi)])2]\displaystyle\lesssim\frac{1}{N}\sum_{i=1}^{N}\frac{1}{N^{2}}\bigg{(}\mathbb{E}_{y_{i}}\left[\left\|x-\mu(t)y_{i}\right\|^{2}\left(g_{t,x}(y_{i})-\mathbb{E}_{y_{i}}[g_{t,x}(y_{i})]\right)^{2}\right]
+(N1)𝔼yi[xμ(t)yi2]𝔼y[(gt,x(y))2])\displaystyle\qquad\qquad\qquad\;\;+(N-1)\mathbb{E}_{y_{i}}\left[\left\|x-\mu(t)y_{i}\right\|^{2}\right]\mathbb{E}_{y}\left[\left(g_{t,x}(y)\right)^{2}\right]\bigg{)}
=1N2𝔼y[xμ(t)y2(gt,x(y)𝔼y[gt,x(y)])2]+N1N2𝔼y[xμ(t)y2]𝔼y[(gt,x(y))2]\displaystyle=\frac{1}{N^{2}}\mathbb{E}_{y}\left[\left\|x-\mu(t)y\right\|^{2}\left(g_{t,x}(y)-\mathbb{E}_{y}[g_{t,x}(y)]\right)^{2}\right]+\frac{N-1}{N^{2}}\mathbb{E}_{y}\left[\left\|x-\mu(t)y\right\|^{2}\right]\mathbb{E}_{y}\left[\left(g_{t,x}(y)\right)^{2}\right]
:=A2,2,1+A2,2,2\displaystyle:=A_{2,2,1}+A_{2,2,2}

Note that

A2,2,1\displaystyle A_{2,2,1} =1N2𝔼y[xμ(t)y2(gt,x(y)𝔼y[gt,x(y)])2]\displaystyle=\frac{1}{N^{2}}\mathbb{E}_{y}\left[\left\|x-\mu(t)y\right\|^{2}\left(g_{t,x}(y)-\mathbb{E}_{y}[g_{t,x}(y)]\right)^{2}\right]
1N2(𝔼y[xμ(t)y2gt,x(y)2]+(𝔼y[gt,x(y)])2𝔼y[xμ(t)y2])\displaystyle\lesssim\frac{1}{N^{2}}\left(\mathbb{E}_{y}\left[\left\|x-\mu(t)y\right\|^{2}g_{t,x}(y)^{2}\right]+\left(\mathbb{E}_{y}\left[g_{t,x}(y)\right]\right)^{2}\mathbb{E}_{y}\left[\|x-\mu(t)y\|^{2}\right]\right)
:=1N2(B1+B2)\displaystyle:=\frac{1}{N^{2}}\left(B_{1}+B_{2}\right)

For term B1B_{1}, we have

B1\displaystyle B_{1} =𝔼y[xμ(t)y2gt,x(y)2]\displaystyle=\mathbb{E}_{y}\left[\left\|x-\mu(t)y\right\|^{2}g_{t,x}(y)^{2}\right]
=𝔼y[xμ(t)y2exp(xμ(t)y22(σ(t)2/2))]\displaystyle=\mathbb{E}_{y}\left[\|x-\mu(t)y\|^{2}\exp\left(-\frac{\|x-\mu(t)y\|^{2}}{2\left(\sigma(t)^{2}/2\right)}\right)\right]
=𝔼y~[y~2exp(y~22(σ(t)2/2))],wherey~:=xμ(t)y𝒩(y~;xμ(t)μp,μ(t)2σp2Id×d)\displaystyle=\mathbb{E}_{\widetilde{y}}\left[\|\widetilde{y}\|^{2}\exp\left(-\frac{\left\|\widetilde{y}\right\|^{2}}{2\left(\sigma(t)^{2}/2\right)}\right)\right],\qquad\text{where}\;\widetilde{y}:=x-\mu(t)y\sim\mathcal{N}(\widetilde{y};x-\mu(t)\mu_{p_{\ast}},\mu(t)^{2}\sigma_{p_{\ast}}^{2}I_{d\times d})
y~2exp(y~22(σ(t)2/2))exp(y~(xμ(t)μp)22μ(t)2σp2)𝑑y\displaystyle\lesssim\int\left\|\widetilde{y}\right\|^{2}\exp\left(-\frac{\left\|\widetilde{y}\right\|^{2}}{2\left(\sigma(t)^{2}/2\right)}\right)\exp\left(-\frac{\left\|\widetilde{y}-(x-\mu(t)\mu_{p_{\ast}})\right\|^{2}}{2\mu(t)^{2}\sigma_{p_{\ast}}^{2}}\right)dy
𝒩(x;μ(t)μp,(σ(t)22+μ(t)2σp2)Id×d)𝔼Y^[Y^2],(similar to the computations in (28))\displaystyle\lesssim\mathcal{N}\left(x;\mu(t)\mu_{p_{\ast}},\left(\frac{\sigma(t)^{2}}{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2}\right)I_{d\times d}\right)\mathbb{E}_{\widehat{Y}}\left[\left\|\widehat{Y}\right\|^{2}\right],\qquad(\text{similar to the computations in \eqref{eqn: comp conv of two Gaussians}})
whereY^𝒩(y^;σ(t)2(xμ(t)μp)σ(t)2+2μ(t)2σp2,σ(t)2μ(t)2σp2σ(t)2+2μ(t)2σp2Id×d)\displaystyle\qquad\qquad\text{where}\;\widehat{Y}\sim\mathcal{N}\left(\widehat{y};\frac{\sigma(t)^{2}(x-\mu(t)\mu_{p_{\ast}})}{\sigma(t)^{2}+2\mu(t)^{2}\sigma_{p_{\ast}}^{2}},\frac{\sigma(t)^{2}\mu(t)^{2}\sigma_{p_{\ast}}^{2}}{\sigma(t)^{2}+2\mu(t)^{2}\sigma_{p_{\ast}}^{2}}I_{d\times d}\right)
(xμ(t)μp2+dσp2)exp(xμ(t)μp22(σ(t)2/2+μ(t)2σp2))\displaystyle\lesssim\left(\left\|x-\mu(t)\mu_{p_{\ast}}\right\|^{2}+d\sigma_{p_{\ast}}^{2}\right)\exp\left(-\frac{\left\|x-\mu(t)\mu_{p_{\ast}}\right\|^{2}}{2\left(\sigma(t)^{2}/2+\mu(t)^{2}\sigma_{p_{\ast}}^{2}\right)}\right)
(x2+μp2+dσp2)exp(xμ(t)μp22(σ(t)2/2+μ(t)2σp2))\displaystyle\lesssim\left(\|x\|^{2}+\|\mu_{p_{\ast}}\|^{2}+d\sigma_{p_{\ast}}^{2}\right)\exp\left(-\frac{\left\|x-\mu(t)\mu_{p_{\ast}}\right\|^{2}}{2\left(\sigma(t)^{2}/2+\mu(t)^{2}\sigma_{p_{\ast}}^{2}\right)}\right)
=(x2+𝔪22)exp(xμ(t)μp22(σ(t)2/2+μ(t)2σp2))\displaystyle=\left(\|x\|^{2}+\mathfrak{m}_{2}^{2}\right)\exp\left(-\frac{\left\|x-\mu(t)\mu_{p_{\ast}}\right\|^{2}}{2\left(\sigma(t)^{2}/2+\mu(t)^{2}\sigma_{p_{\ast}}^{2}\right)}\right)

For term B2B_{2}, we have

(𝔼y[gt,x(y)])2=(𝔼y[exp(xμ(t)y22σ(t)2)])2exp(xμ(t)μp22(σ(t)2+μ(t)2σp22))(by Lemma C.10)\left(\mathbb{E}_{y}\left[g_{t,x}(y)\right]\right)^{2}=\left(\mathbb{E}_{y}\left[\exp\left(-\frac{\|x-\mu(t)y\|^{2}}{2\sigma(t)^{2}}\right)\right]\right)^{2}\lesssim\exp\left(-\frac{\|x-\mu(t)\mu_{p_{\ast}}\|^{2}}{2\left(\frac{\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2}}{2}\right)}\right)\qquad(\text{by Lemma \ref{lemma: useful quantities}})

and

𝔼y[xμ(t)y2]\displaystyle\mathbb{E}_{y}\left[\left\|x-\mu(t)y\right\|^{2}\right] =𝔼y~[y~2],wherey~:=xμ(t)y𝒩(y~;xμ(t)μp,μ(t)2σp2Id×d)\displaystyle=\mathbb{E}_{\widetilde{y}}\left[\left\|\widetilde{y}\right\|^{2}\right],\qquad\text{where}\;\widetilde{y}:=x-\mu(t)y\sim\mathcal{N}(\widetilde{y};x-\mu(t)\mu_{p_{\ast}},\mu(t)^{2}\sigma_{p_{\ast}}^{2}I_{d\times d})
=xμ(t)μp2+dμ(t)2σp2\displaystyle=\left\|x-\mu(t)\mu_{p_{\ast}}\right\|^{2}+d\mu(t)^{2}\sigma_{p_{\ast}}^{2}
x2+𝔪22\displaystyle\lesssim\|x\|^{2}+\mathfrak{m}_{2}^{2}

Therefore, we know that

B2\displaystyle B_{2} =(𝔼y[gt,x(y)])2𝔼y[xμ(t)y2](x2+𝔪22)exp(xμ(t)μq22(σ(t)2+μ(t)2σp22))\displaystyle=\left(\mathbb{E}_{y}\left[g_{t,x}(y)\right]\right)^{2}\mathbb{E}_{y}\left[\|x-\mu(t)y\|^{2}\right]\lesssim\left(\|x\|^{2}+\mathfrak{m}_{2}^{2}\right)\exp\left(-\frac{\|x-\mu(t)\mu_{q}\|^{2}}{2\left(\frac{\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2}}{2}\right)}\right)
(x2+𝔪22)exp(xμ(t)μp22(σ(t)2/2+μ(t)2σp2))\displaystyle\leq\left(\|x\|^{2}+\mathfrak{m}_{2}^{2}\right)\exp\left(-\frac{\left\|x-\mu(t)\mu_{p_{\ast}}\right\|^{2}}{2\left(\sigma(t)^{2}/2+\mu(t)^{2}\sigma_{p_{\ast}}^{2}\right)}\right)

Then, we can have the upper bound for A2,2,1A_{2,2,1}, i.e.

A2,2,1=1N2(B1+B2)1N2(x2+𝔪22)exp(xμ(t)μp22(σ(t)2/2+μ(t)2σp2))A_{2,2,1}=\frac{1}{N^{2}}\left(B_{1}+B_{2}\right)\lesssim\frac{1}{N^{2}}\left(\|x\|^{2}+\mathfrak{m}_{2}^{2}\right)\exp\left(-\frac{\left\|x-\mu(t)\mu_{p_{\ast}}\right\|^{2}}{2\left(\sigma(t)^{2}/2+\mu(t)^{2}\sigma_{p_{\ast}}^{2}\right)}\right)

Similar to the computations for term A2,2,1A_{2,2,1}, we can compute the upper bound of A2,2,2A_{2,2,2} as the following:

A2,2,2=N1N2𝔼y[xμ(t)y2]𝔼y[(gt,x(y))2]N1N2(x2+𝔪22)exp(xμ(t)μp22(σ(t)2/2+μ(t)2σp2))\displaystyle A_{2,2,2}=\frac{N-1}{N^{2}}\mathbb{E}_{y}\left[\left\|x-\mu(t)y\right\|^{2}\right]\mathbb{E}_{y}\left[\left(g_{t,x}(y)\right)^{2}\right]\lesssim\frac{N-1}{N^{2}}\left(\|x\|^{2}+\mathfrak{m}_{2}^{2}\right)\exp\left(-\frac{\left\|x-\mu(t)\mu_{p_{\ast}}\right\|^{2}}{2\left(\sigma(t)^{2}/2+\mu(t)^{2}\sigma_{p_{\ast}}^{2}\right)}\right)

Therefore, for term A2,2A_{2,2}, we have

A2,2A2,2,1+A2,2,2x2+𝔪22Nexp(xμ(t)μp22(σ(t)2/2+μ(t)2σp2))A_{2,2}\lesssim A_{2,2,1}+A_{2,2,2}\lesssim\frac{\|x\|^{2}+\mathfrak{m}_{2}^{2}}{N}\exp\left(-\frac{\left\|x-\mu(t)\mu_{p_{\ast}}\right\|^{2}}{2\left(\sigma(t)^{2}/2+\mu(t)^{2}\sigma_{p_{\ast}}^{2}\right)}\right)

Lemma C.9 (Convolution of two Gaussian distributions).

Let fX(x)=𝒩(x;μX,σX2Id×d)f_{X}(x)=\mathcal{N}(x;\mu_{X},\sigma_{X}^{2}I_{d\times d}) and fY(y)=𝒩(y;μY,σY2Id×d)f_{Y}(y)=\mathcal{N}(y;\mu_{Y},\sigma_{Y}^{2}I_{d\times d}), then

fZ(z):=fX(zy)fY(y)𝑑y=𝒩(z;μX+μY,(σX2+σY2)Id×d)f_{Z}(z):=\int f_{X}(z-y)f_{Y}(y)dy=\mathcal{N}(z;\mu_{X}+\mu_{Y},(\sigma_{X}^{2}+\sigma_{Y}^{2})I_{d\times d})
Proof.

One can compute that

fZ(z)\displaystyle f_{Z}(z) =fX(zy)fY(y)𝑑y\displaystyle=\int f_{X}(z-y)f_{Y}(y)dy
exp(zyμx22σX2)exp(yμY22σY2)𝑑y\displaystyle\propto\int\exp\left(-\frac{\|z-y-\mu_{x}\|^{2}}{2\sigma_{X}^{2}}\right)\exp\left(-\frac{\|y-\mu_{Y}\|^{2}}{2\sigma_{Y}^{2}}\right)dy
=exp(12σX2σY2[σY2(z2+y2+μX22zTy2zTμX+2μXTy)\displaystyle=\int\exp\bigg{(}-\frac{1}{2\sigma_{X}^{2}\sigma_{Y}^{2}}\bigg{[}\sigma_{Y}^{2}\left(\|z\|^{2}+\|y\|^{2}+\|\mu_{X}\|^{2}-2z^{T}y-2z^{T}\mu_{X}+2\mu_{X}^{T}y\right)
+σX2(y22μYTy+μY2)])dy\displaystyle\qquad\qquad\qquad\qquad\qquad+\sigma_{X}^{2}\left(\|y\|^{2}-2\mu_{Y}^{T}y+\|\mu_{Y}\|^{2}\right)\bigg{]}\bigg{)}dy
exp(12σX2σY2[(σX2+σY2)y22(σY2(zμX)+σX2μY)Ty+σY2z2])\displaystyle\propto\int\exp\left(-\frac{1}{2\sigma_{X}^{2}\sigma_{Y}^{2}}\left[(\sigma_{X}^{2}+\sigma_{Y}^{2})\|y\|^{2}-2\left(\sigma_{Y}^{2}(z-\mu_{X})+\sigma_{X}^{2}\mu_{Y}\right)^{T}y+\sigma_{Y}^{2}\|z\|^{2}\right]\right)

Define σZ:=σX2+σY2\sigma_{Z}:=\sqrt{\sigma_{X}^{2}+\sigma_{Y}^{2}}, and completing the square:

fZ(z)\displaystyle f_{Z}(z) exp(z22σX2)exp(12(σXσYσZ)2(y22σZ2(σY2(zμX)+σX2μY)Ty))𝑑y\displaystyle\propto\exp\left(-\frac{\|z\|^{2}}{2\sigma_{X}^{2}}\right)\int\exp\left(-\frac{1}{2\left(\frac{\sigma_{X}\sigma_{Y}}{\sigma_{Z}}\right)^{2}}\left(\|y\|^{2}-\frac{2}{\sigma_{Z}^{2}}(\sigma_{Y}^{2}(z-\mu_{X})+\sigma_{X}^{2}\mu_{Y})^{T}y\right)\right)dy (28)
exp(z22σX2+σY2(zμX)+σX2μY22σZ2(σXσY)2)exp(12(σXσYσZ)2yσY2(zμX)+σX2μYσZ22)𝑑y\displaystyle\propto\exp\left(-\frac{\|z\|^{2}}{2\sigma_{X}^{2}}+\frac{\|\sigma_{Y}^{2}(z-\mu_{X})+\sigma_{X}^{2}\mu_{Y}\|^{2}}{2\sigma_{Z}^{2}(\sigma_{X}\sigma_{Y})^{2}}\right)\int\exp\left(-\frac{1}{2\left(\frac{\sigma_{X}\sigma_{Y}}{\sigma_{Z}}\right)^{2}}\left\|y-\frac{\sigma_{Y}^{2}(z-\mu_{X})+\sigma_{X}^{2}\mu_{Y}}{\sigma_{Z}^{2}}\right\|^{2}\right)dy
exp(z(μX+μY)22(σX2+σY2))EY^[𝕀{Y^+}],whereY^𝒩(y^;σY2(zμX)+σX2μYσZ2,σX2σY2σZ2Id×d)\displaystyle\propto\exp\left(-\frac{\|z-(\mu_{X}+\mu_{Y})\|^{2}}{2(\sigma_{X}^{2}+\sigma_{Y}^{2})}\right)E_{\widehat{Y}}[\mathbb{I}\{\widehat{Y}\leq+\infty\}],\;\text{where}\;\widehat{Y}\sim\mathcal{N}\left(\widehat{y};\frac{\sigma_{Y}^{2}(z-\mu_{X})+\sigma_{X}^{2}\mu_{Y}}{\sigma_{Z}^{2}},\frac{\sigma_{X}^{2}\sigma_{Y}^{2}}{\sigma_{Z}^{2}}I_{d\times d}\right)
𝒩(z;μX+μY,(σX2+σY2)Id×d)\displaystyle\propto\mathcal{N}\left(z;\mu_{X}+\mu_{Y},(\sigma_{X}^{2}+\sigma_{Y}^{2}\right)I_{d\times d})

Lemma C.10.

Suppose yp=𝒩(y;μp,σp2Id×d)y\sim p_{\ast}=\mathcal{N}\left(y;\mu_{p_{\ast}},\sigma_{p_{\ast}}^{2}I_{d\times d}\right), then one can compute the following quantities:

  • 1.
    pt(x)=𝔼yp[pt(x|y)]=𝒩(x;μ(t)μp,(σ(t)2+μ(t)2σp2)Id×d):=h(x)p_{t}(x)=\mathbb{E}_{y\sim p_{\ast}}[p_{t}(x|y)]=\mathcal{N}\left(x;\mu(t)\mu_{p_{\ast}},\left(\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2}\right)I_{d\times d}\right):=h(x)
  • 2.
    𝔼yp[yexp(xμ(t)y22σ(t)2)](μ(t)σp2x+σ(t)2μpσ(t)2+μ(t)2σp2)h(x)\mathbb{E}_{y\sim p_{\ast}}\left[y\exp\left(-\frac{\|x-\mu(t)y\|^{2}}{2\sigma(t)^{2}}\right)\right]\propto\left(\frac{\mu(t)\sigma_{p_{\ast}}^{2}x+\sigma(t)^{2}\mu_{p_{\ast}}}{\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2}}\right)h(x)
  • 3.
    𝔼yp[y2exp(xμ(t)y22σ(t)2)](x2+𝔪22)h(x),\mathbb{E}_{y\sim p_{\ast}}\left[\|y\|^{2}\exp\left(-\frac{\|x-\mu(t)y\|^{2}}{2\sigma(t)^{2}}\right)\right]\lesssim\left(\|x\|^{2}+\mathfrak{m}_{2}^{2}\right)h(x),

    where 𝔪22:=μp2+dσp2\mathfrak{m}_{2}^{2}:=\|\mu_{p_{\ast}}\|^{2}+d\sigma_{p_{\ast}}^{2}. Both \propto and \lesssim indicate ignoring the constants.

Proof.
  • 1.
    𝔼yp[pt(x|y)]\displaystyle\mathbb{E}_{y\sim p_{\ast}}[p_{t}(x|y)] 𝔼yp[exp(xμ(t)y22σ(t)2)]\displaystyle\propto\mathbb{E}_{y\sim p_{\ast}}\left[\exp\left(-\frac{\|x-\mu(t)y\|^{2}}{2\sigma(t)^{2}}\right)\right] (29)
    exp(xμ(t)y22σ(t)2)exp(yμp22σp2)𝑑y\displaystyle\propto\int\exp\left(-\frac{\|x-\mu(t)y\|^{2}}{2\sigma(t)^{2}}\right)\exp\left(-\frac{\|y-\mu_{p_{\ast}}\|^{2}}{2\sigma_{p_{\ast}}^{2}}\right)dy
    =exp(x/μ(t)y22σ(t)2/μ(t)2)exp(yμp22σp2)𝑑y\displaystyle=\int\exp\left(-\frac{\left\|x/\mu(t)-y\right\|^{2}}{2\sigma(t)^{2}/\mu(t)^{2}}\right)\exp\left(-\frac{\|y-\mu_{p_{\ast}}\|^{2}}{2\sigma_{p_{\ast}}^{2}}\right)dy
    =𝒩(xμ(t);0,σ(t)2μ(t)2Id×d)𝒩(y;μp,σp2Id×d)\displaystyle=\mathcal{N}\left(\frac{x}{\mu(t)};0,\frac{\sigma(t)^{2}}{\mu(t)^{2}}I_{d\times d}\right)*\mathcal{N}\left(y;\mu_{p_{\ast}},\sigma_{p_{\ast}}^{2}I_{d\times d}\right)
    =𝒩(x;μ(t)μp,(σ(t)2+μ(t)2σp2)Id×d)(by Lemma C.9)\displaystyle=\mathcal{N}\left(x;\mu(t)\mu_{p_{\ast}},\left(\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2}\right)I_{d\times d}\right)\qquad(\text{by Lemma \ref{lemma: conv of two Gaussians}})
  • 2.

    Similar to the computations in (28) and (29), one can compute

    𝔼yp[yexp(xμ(t)y22σ(t)2)]\displaystyle\mathbb{E}_{y\sim p_{\ast}}\left[y\exp\left(-\frac{\|x-\mu(t)y\|^{2}}{2\sigma(t)^{2}}\right)\right] 𝒩(x;μ(t)μp,(σ(t)2+μ(t)2σp2)Id×d)𝔼Y^[Y^],\displaystyle\propto\mathcal{N}\left(x;\mu(t)\mu_{p_{\ast}},\left(\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2}\right)I_{d\times d}\right)\mathbb{E}_{\widehat{Y}}\left[\widehat{Y}\right],
    (whereY^𝒩(y^;μ(t)σp2x+σ(t)2μpσ(t)2+μ(t)2σp2,σ(t)2σp2σ(t)2+μ(t)2σp2Id×d))\displaystyle\qquad\left(\text{where}\;\widehat{Y}\sim\mathcal{N}\left(\widehat{y};\frac{\mu(t)\sigma_{p_{\ast}}^{2}x+\sigma(t)^{2}\mu_{p_{\ast}}}{\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2}},\frac{\sigma(t)^{2}\sigma_{p_{\ast}}^{2}}{\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2}}I_{d\times d}\right)\right)
    =(μ(t)σp2x+σ(t)2μpσ(t)2+μ(t)2σp2)h(x)\displaystyle=\left(\frac{\mu(t)\sigma_{p_{\ast}}^{2}x+\sigma(t)^{2}\mu_{p_{\ast}}}{\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2}}\right)h(x)
  • 3.
    𝔼yp[y2exp(xμ(t)y22σ(t)2)]\displaystyle\mathbb{E}_{y\sim p_{\ast}}\left[\|y\|^{2}\exp\left(-\frac{\|x-\mu(t)y\|^{2}}{2\sigma(t)^{2}}\right)\right] 𝒩(x;μ(t)μp,(σ(t)2+μ(t)2σp2)Id×d)𝔼Y^[Y^2],\displaystyle\propto\mathcal{N}\left(x;\mu(t)\mu_{p_{\ast}},\left(\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2}\right)I_{d\times d}\right)\mathbb{E}_{\widehat{Y}}\left[\|\widehat{Y}\|^{2}\right],
    (whereY^𝒩(y^;μ(t)σp2x+σ(t)2μpσ(t)2+μ(t)2σp2,σ(t)2σp2σ(t)2+μ(t)2σp2Id×d))\displaystyle\left(\text{where}\;\widehat{Y}\sim\mathcal{N}\left(\widehat{y};\frac{\mu(t)\sigma_{p_{\ast}}^{2}x+\sigma(t)^{2}\mu_{p_{\ast}}}{\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2}},\frac{\sigma(t)^{2}\sigma_{p_{\ast}}^{2}}{\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2}}I_{d\times d}\right)\right)
    =(μ(t)σp2x+σ(t)2μpσ(t)2+μ(t)2σp22+dσ(t)2σp2σ(t)2+μ(t)2σp2)h(x)\displaystyle=\left(\left\|\frac{\mu(t)\sigma_{p_{\ast}}^{2}x+\sigma(t)^{2}\mu_{p_{\ast}}}{\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2}}\right\|^{2}+d\frac{\sigma(t)^{2}\sigma_{p_{\ast}}^{2}}{\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2}}\right)h(x)
    (x2+μp2+dσp2)h(x)=(x2+𝔪22)h(x)\displaystyle\lesssim\left(\|x\|^{2}+\|\mu_{p_{\ast}}\|^{2}+d\sigma_{p_{\ast}}^{2}\right)h(x)=\left(\|x\|^{2}+\mathfrak{m}_{2}^{2}\right)h(x)

Lemma C.11.

Given a collection of d-dimensional vectors {yi}i=1N\{y_{i}\}_{i=1}^{N}, the following inequality holds

i=1Nexp(yi2)j=1Nexp(yj2)yi21Ni=1Nyi2\left\|\sum_{i=1}^{N}\frac{\exp\left(-\|y_{i}\|^{2}\right)}{\sum_{j=1}^{N}\exp\left(-\|y_{j}\|^{2}\right)}y_{i}\right\|^{2}\leq\frac{1}{N}\sum_{i=1}^{N}\|y_{i}\|^{2}
Proof.

Denote wiN:=exp(yi2)j=1Nexp(yj2)w_{i}^{N}:=\frac{\exp\left(-\|y_{i}\|^{2}\right)}{\sum_{j=1}^{N}\exp\left(-\|y_{j}\|^{2}\right)} for all i=1,2,,ni=1,2,\dots,n, then we can compute

i=1Nexp(yi2)j=1Nexp(yj2)yi2\displaystyle\left\|\sum_{i=1}^{N}\frac{\exp\left(-\|y_{i}\|^{2}\right)}{\sum_{j=1}^{N}\exp\left(-\|y_{j}\|^{2}\right)}y_{i}\right\|^{2} =i=1NwiNyi2\displaystyle=\left\|\sum_{i=1}^{N}w_{i}^{N}y_{i}\right\|^{2}
=i=1Nj=1NwiNwjNyiTyj\displaystyle=\sum_{i=1}^{N}\sum_{j=1}^{N}w_{i}^{N}w_{j}^{N}y_{i}^{T}y_{j}
12i=1Nj=1NwiNwjN(yi2+yj2)\displaystyle\leq\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}w_{i}^{N}w_{j}^{N}\left(\|y_{i}\|^{2}+\|y_{j}\|^{2}\right)
=i=1NwiNyi2(by the fact that i=1NwiN=1)\displaystyle=\sum_{i=1}^{N}w_{i}^{N}\|y_{i}\|^{2}\qquad(\text{by the fact that $\sum_{i=1}^{N}w_{i}^{N}=1$})
1Ni=1Nyi2.\displaystyle\leq\frac{1}{N}\sum_{i=1}^{N}\|y_{i}\|^{2}.

Lemma C.12.

For any λ>0\lambda>0, with the Green’s function pt(x|y)p_{t}(x|y) defined in (6), pt(x):=pt(x|y)p(y)𝑑yp_{t}(x):=\int p_{t}(x|y)p_{\ast}(y)dy is lower bounded by

pt(x)\displaystyle p_{t}(x) 1(2πσ(t)2)d/2exp(1+λμ(t)2σ(t)2x2)exp(μ(t)+λμ(t)22λσ(t)2y2)p(y)𝑑y\displaystyle\geq\frac{1}{\left(2\pi\sigma(t)^{2}\right)^{d/2}}\exp\left(-\frac{1+\lambda\mu(t)}{2\sigma(t)^{2}}\|x\|^{2}\right)\int\exp\left(-\frac{\mu(t)+\lambda\mu(t)^{2}}{2\lambda\sigma(t)^{2}}\|y\|^{2}\right)p_{\ast}(y)dy
:=1(2πσ(t)2)d/2exp(1+λμ(t)2σ(t)2x2)Kt\displaystyle:=\frac{1}{\left(2\pi\sigma(t)^{2}\right)^{d/2}}\exp\left(-\frac{1+\lambda\mu(t)}{2\sigma(t)^{2}}\|x\|^{2}\right)K_{t}
Proof.

It comes from the direct computation:

pt(x)\displaystyle p_{t}(x) =pt(x|y)p(y)𝑑y\displaystyle=\int p_{t}(x|y)p_{\ast}(y)dy
=1(2πσ(t)2)d/2exp(xμ(t)y22σ(t)2)p(y)𝑑y\displaystyle=\int\frac{1}{\left(2\pi\sigma(t)^{2}\right)^{d/2}}\exp\left(-\frac{\|x-\mu(t)y\|^{2}}{2\sigma(t)^{2}}\right)p_{\ast}(y)dy
=1(2πσ(t)2)d/2exp(12σ(t)2(x22μ(t)xTy+μ(t)2y2))p(y)𝑑y\displaystyle=\frac{1}{\left(2\pi\sigma(t)^{2}\right)^{d/2}}\int\exp\left(-\frac{1}{2\sigma(t)^{2}}\left(\|x\|^{2}-2\mu(t)x^{T}y+\mu(t)^{2}\|y\|^{2}\right)\right)p_{\ast}(y)dy
1(2πσ(t)2)d/2exp(12σ(t)2(x2+λμ(t)x2+1λμ(t)y2+μ(t)2y2))p(y)𝑑y\displaystyle\geq\frac{1}{\left(2\pi\sigma(t)^{2}\right)^{d/2}}\int\exp\left(-\frac{1}{2\sigma(t)^{2}}\left(\|x\|^{2}+\lambda\mu(t)\|x\|^{2}+\frac{1}{\lambda}\mu(t)\|y\|^{2}+\mu(t)^{2}\|y\|^{2}\right)\right)p_{\ast}(y)dy
=1(2πσ(t)2)d/2exp(1+λμ(t)2σ(t)2x2)exp(μ(t)+λμ(t)22λσ(t)2y2)p(y)𝑑y,\displaystyle=\frac{1}{\left(2\pi\sigma(t)^{2}\right)^{d/2}}\exp\left(-\frac{1+\lambda\mu(t)}{2\sigma(t)^{2}}\|x\|^{2}\right)\int\exp\left(-\frac{\mu(t)+\lambda\mu(t)^{2}}{2\lambda\sigma(t)^{2}}\|y\|^{2}\right)p_{\ast}(y)dy,

where the inequality comes from Young’s inequality, i.e. 2aTbλa2+1λb22a^{T}b\leq\lambda\|a\|^{2}+\frac{1}{\lambda}\|b\|^{2} for any λ>0\lambda>0. ∎

Appendix D Memorization effects

In this section, we provide the proof of Propositions 4.1 and 4.2, and Theorem 4.3. For the completeness, we state all the propositions and theorem again before the proof.

Proposition D.1.

Suppose the training samples {yi}i=1N\{y_{i}\}_{i=1}^{N} satisfy yi2d\|y_{i}\|_{2}\leq d, for δ0\delta\geq 0, TV(𝗊Tδ,𝗉γ)dδ2\mathrm{TV}({\mathsf{q}}_{T-\delta},\mathsf{p}_{\ast}^{\gamma})\leq\frac{d\sqrt{\delta}}{2} with γ=σ(δ)\gamma=\sigma(\delta), where σ()\sigma(\cdot) is defined in (5).

Proof of Proposition 4.1.

By the definition of total variation, we have

TV(𝗊Tδ,𝗉γ)\displaystyle\mathrm{TV}\left({\mathsf{q}}_{T-\delta},\mathsf{p}_{\ast}^{\gamma}\right) =12d|1Ni=1N𝒩(x;μ(δ)yi,σ(δ)2Id×d)1Ni=1N𝒩(x;yi,σ(δ)2Id×d)|𝑑x\displaystyle=\frac{1}{2}\int_{\mathbb{R}^{d}}\left|\frac{1}{N}\sum_{i=1}^{N}\mathcal{N}\left(x;\mu(\delta)y_{i},\sigma(\delta)^{2}I_{d\times d}\right)-\frac{1}{N}\sum_{i=1}^{N}\mathcal{N}\left(x;y_{i},\sigma(\delta)^{2}I_{d\times d}\right)\right|dx
1Ni=1N12d|𝒩(x;μ(δ)yi,σ(δ)2Id×d)𝒩(x;yi,σ(δ)2Id×d)|𝑑x\displaystyle\leq\frac{1}{N}\sum_{i=1}^{N}\frac{1}{2}\int_{\mathbb{R}^{d}}\left|\mathcal{N}\left(x;\mu(\delta)y_{i},\sigma(\delta)^{2}I_{d\times d}\right)-\mathcal{N}\left(x;y_{i},\sigma(\delta)^{2}I_{d\times d}\right)\right|dx
=1Ni=1NTV(𝒩(x;μ(δ)yi,σ(δ)2Id×d),𝒩(x;yi,σ(δ)2Id×d))\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\mathrm{TV}\left(\mathcal{N}\left(x;\mu(\delta)y_{i},\sigma(\delta)^{2}I_{d\times d}\right),\mathcal{N}\left(x;y_{i},\sigma(\delta)^{2}I_{d\times d}\right)\right)
1Ni=1N12KL(𝒩(x;μ(δ)yi,σ(δ)2Id×d)||𝒩(x;yi,σ(δ)2Id×d))\displaystyle\leq\frac{1}{N}\sum_{i=1}^{N}\sqrt{\frac{1}{2}\mathrm{KL}\left(\mathcal{N}\left(x;\mu(\delta)y_{i},\sigma(\delta)^{2}I_{d\times d}\right)||\mathcal{N}\left(x;y_{i},\sigma(\delta)^{2}I_{d\times d}\right)\right)}
=1Ni=1Nyi21μ(δ)2σ(δ)(by Lemma D.4)\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\|y_{i}\|_{2}\frac{1-\mu(\delta)}{2\sigma(\delta)}\qquad(\text{by Lemma \ref{lemma: KL divergence between two gaussians}})
1exp(δ)21exp(2δ)d(by the definitions of μ(δ) and σ(δ), and yi2d)\displaystyle\leq\frac{1-\exp(-\delta)}{2\sqrt{1-\exp(-2\delta)}}d\qquad(\text{by the definitions of $\mu(\delta)$ and $\sigma(\delta)$, and $\|y_{i}\|_{2}\leq d$})
dδ2.\displaystyle\leq\frac{d\sqrt{\delta}}{2}\,.

Proposition D.2.

Under the same assumptions are in Proposition 4.1, on the time interval t[0,T]t\in[0,T], the total variation between the output distribution of SGM algorithm (13) with the empirical optimal score function 𝗊^t\widehat{\mathsf{q}}_{t} and the KDE approximation 𝗊t{\mathsf{q}}_{t} – is bounded by TV(𝗊^t,𝗊t)d2exp(T)\mathrm{TV}\left(\widehat{\mathsf{q}}_{t},{\mathsf{q}}_{t}\right)\leq\frac{d}{2}\exp(-T).

Proof of Proposition 4.2.

By the data-processing inequality and Lemma D.5, we have

TV(𝗊t,𝗊^t)TV(𝗊0,𝗊^0)=TV(𝗉T,πd)d2exp(T).\displaystyle\mathrm{TV}({\mathsf{q}}_{t},\widehat{\mathsf{q}}_{t})\leq\mathrm{TV}({\mathsf{q}}_{0},\widehat{\mathsf{q}}_{0})=\mathrm{TV}({\mathsf{p}}_{T},\pi^{d})\leq\frac{d}{2}\exp(-T)\,.

Theorem D.3 (SGM with empirical optimal score function resembles KDE).

Under the same assumptions as Proposition 4.2, SGM algorithm (13) with the empirical optimal score function sNs^{N} returns a simple Gaussian convolution with the empirical distribution in the form of (18), and it presents the following behavior:

  • (with early stopping) for any ε>0\varepsilon>0, set T=logdεT=\log\frac{d}{\varepsilon} and δ=ε2d\delta=\frac{\varepsilon^{2}}{d}, we have

    TV(𝗊^Tδ,𝗉γ)ε,withγ=σ(δ),\mathrm{TV}(\widehat{\mathsf{q}}_{T-\delta},\mathsf{p}_{\ast}^{\gamma})\leq\varepsilon\,,\quad\text{with}\quad\gamma=\sigma(\delta)\,,
  • (without early stopping) by taking the limit T+T\rightarrow+\infty and δ=0\delta=0, we have 𝗊^=𝗉=1Ni=1Nδyi\widehat{\mathsf{q}}_{\infty}={\mathsf{p}_{\ast}}=\frac{1}{N}\sum_{i=1}^{N}\delta_{y_{i}}.

Proof of Theorem 4.3.

(with early stopping) For 0δ<T0\leq\delta<T, combining Proposition 4.1 and Proposition 4.2 using triangle inequality, we have

TV(𝗊^Tδ,𝗉γ)TV(𝗊Tδ,𝗉γ)+TV(𝗊Tδ,𝗊^Tδ)d2(δ+exp(T))\mathrm{TV}\left(\widehat{\mathsf{q}}_{T-\delta},\mathsf{p}_{\ast}^{\gamma}\right)\leq\mathrm{TV}({\mathsf{q}}_{T-\delta},\mathsf{p}_{\ast}^{\gamma})+\mathrm{TV}\left({\mathsf{q}}_{T-\delta},\widehat{\mathsf{q}}_{T-\delta}\right)\leq\frac{d}{2}\left(\sqrt{\delta}+\exp(-T)\right) (30)

For any ε>0\varepsilon>0, by choosing T=logdεT=\log\frac{d}{\varepsilon} and δ=ε2d\delta=\frac{\varepsilon^{2}}{d}, we obtain TV(𝗊^Tδ,𝗉γ)ε\mathrm{TV}(\widehat{\mathsf{q}}_{T-\delta},\mathsf{p}_{\ast}^{\gamma})\leq\varepsilon.

(without early stopping) By taking the limit T+T\rightarrow+\infty and δ=0\delta=0 in inequality (30), we have TV(𝗊^,𝗉)0\mathrm{TV}(\widehat{\mathsf{q}}_{\infty},{\mathsf{p}_{\ast}})\leq 0. This implies that 𝗊^\widehat{\mathsf{q}}_{\infty} equals to the empirical distribution 𝗉=1Ni=1Nδyi{\mathsf{p}_{\ast}}=\frac{1}{N}\sum_{i=1}^{N}\delta_{y_{i}}.

Lemma D.4 (KL divergence between two Gaussian distributions).

Let p=𝒩(μp,Σp)p=\mathcal{N}(\mu_{p},\Sigma_{p}) and q=𝒩(μq,Σq)q=\mathcal{N}(\mu_{q},\Sigma_{q}) be two Gaussian distributions on d\mathbb{R}^{d}. Then the KL divergence between pp and qq is

KL(p||q)=12[log|Σq||Σp|d+(μpμq)TΣq1(μpμq)+Tr{Σq1Σp}]\mathrm{KL}(p||q)=\frac{1}{2}\left[\log\frac{|\Sigma_{q}|}{|\Sigma_{p}|}-d+(\mu_{p}-\mu_{q})^{T}\Sigma_{q}^{-1}(\mu_{p}-\mu_{q})+\text{Tr}\left\{\Sigma_{q}^{-1}\Sigma_{p}\right\}\right]
Lemma D.5 (Convergence of forward OU process).

Denote 𝗉T{\mathsf{p}}_{T} to be the distribution of forward OU process at time TT initializing with the empirical distribution 𝗉=1Ni=1Nδyi{\mathsf{p}_{\ast}}=\frac{1}{N}\sum_{i=1}^{N}\delta_{y_{i}}, where {yi}i=1N\{y_{i}\}_{i=1}^{N} are i.i.d samples such that yi2d\|y_{i}\|_{2}\leq d. Then for T1T\geq 1,

TV(𝗉T,πd)d2exp(T).\mathrm{TV}\left({\mathsf{p}}_{T},\pi^{d}\right)\leq\frac{d}{2}\exp(-T)\,.
Proof.

Since pt(x|y)=𝒩(x;exp(t)y,σ(t)2Id×d)p_{t}(x|y)=\mathcal{N}\left(x;\exp(-t)y,\sigma(t)^{2}I_{d\times d}\right), by Lemma D.4 we have

KL(pt(x|y)||πd)=12[dlogσ(t)2d+dσ(t)2+exp(t)y2]\mathrm{KL}\left(p_{t}(x|y)||\pi^{d}\right)=\frac{1}{2}\left[-d\log\sigma(t)^{2}-d+d\sigma(t)^{2}+\left\|\exp(-t)y\right\|^{2}\right]

By the convexity of the KL divergence,

KL(𝗉T||πd)\displaystyle\mathrm{KL}\left({\mathsf{p}}_{T}||\pi^{d}\right) =KL(dpT(x|y)𝗉(y)𝑑yπd)\displaystyle=\mathrm{KL}\left(\int_{\mathbb{R}^{d}}p_{T}(x|y){\mathsf{p}_{\ast}}(y)dy\bigg{\|}\pi^{d}\right)
KL(pT(x|y)πd)𝗉(y)𝑑y\displaystyle\leq\int\mathrm{KL}\left(p_{T}(x|y)\big{\|}\pi^{d}\right){\mathsf{p}_{\ast}}(y)dy
=12[dlogσ(T)2d+dσ(T)2+exp(2T)𝔼y𝗉y2]\displaystyle=\frac{1}{2}\left[-d\log\sigma(T)^{2}-d+d\sigma(T)^{2}+\exp(-2T)\mathbb{E}_{y\sim{\mathsf{p}_{\ast}}}\left\|y\right\|^{2}\right]
=12[dlog(1exp(2T))d+d(1exp(2T))+exp(2T)𝔼y𝗉y2]\displaystyle=\frac{1}{2}\left[-d\log\left(1-\exp(-2T)\right)-d+d\left(1-\exp(-2T)\right)+\exp(-2T)\mathbb{E}_{y\sim{\mathsf{p}_{\ast}}}\left\|y\right\|^{2}\right]
12exp(2T)𝔼y𝗉[y2](by the fact log(1x)x for x0)\displaystyle\leq\frac{1}{2}\exp(-2T)\mathbb{E}_{y\sim{\mathsf{p}_{\ast}}}[\left\|y\right\|^{2}]\qquad(\text{by the fact $\log(1-x)\geq-x$ for $x\geq 0$})
=12exp(2T)1Ni=1Nyi22d22exp(2T)\displaystyle=\frac{1}{2}\exp(-2T)\frac{1}{N}\sum_{i=1}^{N}\|y_{i}\|_{2}^{2}\leq\frac{d^{2}}{2}\exp(-2T)

By the Pinsker’s inequality, we have

TV(𝗉T,πd)12KL(𝗉T||πd)d2exp(T).\mathrm{TV}\left({\mathsf{p}}_{T},\pi^{d}\right)\leq\sqrt{\frac{1}{2}\mathrm{KL}\left({\mathsf{p}}_{T}||\pi^{d}\right)}\leq\frac{d}{2}\exp(-T)\,.

Appendix E Numerical experiments

In this section, we provide the details of the numerical experiments. The code is available in https://github.com/SixuLi/DDPM_and_KDE.111The implementation of KDE generation is built based on code https://github.com/patrickphat/Generate-Handwritten-Digits-Kernel-Density-Estimation; The implementation of DDPM on CIFAR10 dataset follows the code https://github.com/sail-sg/DiffMemorize/tree/main.

E.1 Synthetic data distribution

We consider the target data distribution pp_{\ast} is a 22-dimensional isotropic Gaussian, i.e. p(x):=𝒩(x;μp,σp2I2×2)p_{\ast}(x):=\mathcal{N}(x;\mu_{p_{\ast}},\sigma_{p_{\ast}}^{2}I_{2\times 2}). In this case, the law of forward OU process (3) ptp_{t} and the exact score function u(t,x)u(t,x) defined in (8) have explicit formulations. Specifically, by Lemma C.10, we obtain

pt(x)=pt(x|y)p(y)𝑑y=𝒩(x;μ(t)μp,(σ(t)2+μ(t)2σp2)I2×2)p_{t}(x)=\int p_{t}(x|y)p_{\ast}(y)dy=\mathcal{N}\left(x;\mu(t)\mu_{p_{\ast}},\left(\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2}\right)I_{2\times 2}\right) (31)
u(t,x)=u(t,x|y)pt(x|y)p(y)𝑑ypt(x|y)p(y)𝑑y=1σ(t)2x+μ(t)σ(t)2ypt(x|y)p(d)𝑑ypt(x|y)p(y)𝑑y=μ(t)μpxσ(t)2+μ(t)2σp2,u(t,x)=\frac{\int u(t,x|y)p_{t}(x|y)p_{\ast}(y)dy}{\int p_{t}(x|y)p_{\ast}(y)dy}=-\frac{1}{\sigma(t)^{2}}x+\frac{\mu(t)}{\sigma(t)^{2}}\frac{\int yp_{t}(x|y)p_{\ast}(d)dy}{\int p_{t}(x|y)p_{\ast}(y)dy}=\frac{\mu(t)\mu_{p_{\ast}}-x}{\sigma(t)^{2}+\mu(t)^{2}\sigma_{p_{\ast}}^{2}}, (32)

where μ(t)=exp(t)\mu(t)=\exp(-t) and σ(t)2=1exp(2t)\sigma(t)^{2}=1-\exp(-2t) as defined in (5). We set choose μp=[5,5]\mu_{p_{\ast}}=[-5,5] and σp2=10\sigma_{p_{\ast}}^{2}=10 in our experiments.

We first estimate the score approximation error of the empirical optimal score function s{yi}Ns^{N}_{\{y_{i}\}} across various training sample sizes NN. Setting early stopping time δ=0.02\delta=0.02, time interval length T=5T=5, and sample size NN ranging from N=100N=100 to N=2000N=2000, we numerically estimate

𝔼{yi}pN|E{yi}|2=𝔼tU[δ,T]𝔼{yi}pN,xpt[s{yi}N(t,x)u(t,x)2],\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N}}\left|E_{\{y_{i}\}}\right|^{2}=\mathbb{E}_{t\sim U[\delta,T]}\mathbb{E}_{\{y_{i}\}\sim p_{\ast}^{\otimes N},x\sim p_{t}}\left[\left\|s^{N}_{\{y_{i}\}}(t,x)-u(t,x)\right\|^{2}\right], (33)

using the empirical average

|E{yi}|^2:=1K1Mk=1Km=1Ms{yi}N(tk,xmtk)u(tk,xmtk)2\widehat{\left|E_{\{y_{i}\}}\right|}^{2}:=\frac{1}{K}\frac{1}{M}\sum_{k=1}^{K}\sum_{m=1}^{M}\left\|s_{\{y_{i}\}}^{N}(t_{k},x_{m}^{t_{k}})-u(t_{k},x_{m}^{t_{k}})\right\|^{2} (34)

This is achieved through repeating the following steps 1010 times and computing the average output:

  • 1.

    Randomly sample NN training data {yi}i=1N\{y_{i}\}_{i=1}^{N} from the target distribution pp_{\ast};

  • 2.

    Uniformly sample {tk}k=1K\{t_{k}\}_{k=1}^{K} from time interval [δ,T][\delta,T] with step size h=0.02h=0.02 and total number of steps K=ThK=\frac{T}{h};

  • 3.

    For each tkt_{k}, sample {xmtk}m=1M\{x_{m}^{t_{k}}\}_{m=1}^{M} (where M=1000M=1000) from the distribution ptkp_{t_{k}} as derived in (31);

  • 4.

    Compute the empirical average |E{yi}|^2\widehat{\left|E_{\{y_{i}\}}\right|}^{2} (34) using {yi}\{y_{i}\}, {tk}\{t_{k}\} and {xmtk}\{x_{m}^{t_{k}}\}.

The results (shown in Figure 2) align with the convergence rate O(1N)O(\frac{1}{N}) as provided in Theorem 3.1.

In the second part of our experiments, we generate samples from DDPM using either the exact score function u(t,x)u(t,x) or the empirical optimal score function s{yi}N(t,x)s_{\{y_{i}\}}^{N}(t,x). We discretize and simulate the SDEs (4) and (13) using Euler-maruyama method. The experiment parameters are: time interval length T=5T=5, discretization step h=0.0005h=0.0005, early stopping times δ=0or 0.01\delta=0\;\text{or}\;0.01, and number of training data N=100N=100. We generate 10001000 new samples from DDPM with u(t,x)u(t,x) and s{yi}N(t,x)s^{N}_{\{y_{i}\}}(t,x) respectively. Visualization results for δ=0\delta=0 and δ=0.01\delta=0.01 are shown in Figure 4 and Figure 5. The samples generated by DDPM with the empirical optimal score function s{yi}N(t,x)s^{N}_{\{y_{i}\}}(t,x) exhibit strong memorization effects, while those from DDPM with the exact score function u(t,x)u(t,x) appear independent of the training data, yet maintain the same distribution. This numerical observation corroborates our theoretical findings in Theorem 4.3.

Refer to caption
Refer to caption
Figure 4: Left: Samples generated by DDPM with empirical optimal score function sN(t,x)s^{N}(t,x). Right: Samples generated by DDPM with true score function u(t,x)u(t,x). Both two algorithms are ran up to time T=5T=5, i.e. early stopping time δ=0\delta=0. The blue crosses are the training samples, the green dots are the initialization positions and the orange points are the generated samples.
Refer to caption
Refer to caption
Figure 5: Left: Samples generated by DDPM with empirical optimal score function sN(t,x)s^{N}(t,x). Right: Samples generated by DDPM with true score function u(t,x)u(t,x). Both two algorithms are early stopped with δ=0.01\delta=0.01. The blue crosses are the training samples, the green dots are the initialization positions and the orange points are the generated samples.

E.2 Real-world data distribution

We consider pp_{\ast} as the underlying distribution generating the CIFAR10 dataset images (Krizhevsky et al., 2009), conprising N=50000N=50000 training samples of dimension d=32×32×3d=32\times 32\times 3. We denote {yi}i=1N\{y_{i}\}_{i=1}^{N} as the 5000050000 images in the CIFAR10 dataset, and we use them to construct the following two generative models.

  • The first one is simple Gaussian Kernel Density Estimation (KDE), i.e. 𝗉γ(x)=1Ni=1N𝒩(x;yi,γ2Id×d)\mathsf{p}_{\ast}^{\gamma}(x)=\frac{1}{N}\sum_{i=1}^{N}\mathcal{N}(x;y_{i},\gamma^{2}I_{d\times d}), where γ\gamma is the Gaussian kernel’s bandwidth. To generate a sample, we first uniformly sample a data y(j)y_{(j)} from {yi}i=1N\{y_{i}\}_{i=1}^{N}, then apply Gaussian blurring with bandwidth γ\gamma to y(j)y_{(j)}. The bandwidth γ\gamma is set to 0.10.1 times the optimal bandwidth N1d+4σN^{-\frac{1}{d+4}}\sigma, as per Scott’s rule (Terrell & Scott, 1992), where σ\sigma is the training data’s standard deviation. The sampling results are shown in the second row of Figure 1. Comparing with the training data (the first row in Figure 1), we can clearly see that the generated samples have strong dependence on existing ones.

  • The second one is DDPM equipped with the empirical optimal score function as defined in (13). We follow the implementations in (Gu et al., 2023). To illustrate the details, we follow the notations used in (Gu et al., 2023). Recall the backward SDE (13)

    dX^t=(X^t+2s{yi}N(Tt,X^t))dt+2dBt.d\widehat{X}_{t}^{\leftarrow}=(\widehat{X}_{t}^{\leftarrow}+2s^{N}_{\{y_{i}\}}(T-t,\widehat{X}_{t}^{\leftarrow}))dt+\sqrt{2}dB_{t}\,.

    For sample generation, we discretize the time steps 0=t0<t1<<tK=T0=t_{0}<t_{1}<\cdots<t_{K}=T with T>0T>0 being the time interval length and K>0K>0 being the total number of steps, and apply the Euler-maruyama solver. The update rule is as the following:

    Xtn=Xtn1+(tntn1)(Xtn1+2s{yi}N(Ttn1,Xtn1))+2(tntn1)Z,X_{t_{n}}=X_{t_{n-1}}+\left(t_{n}-t_{n-1}\right)\left(X_{t_{n-1}}+2s^{N}_{\{y_{i}\}}(T-t_{n-1},X_{t_{n-1}})\right)+\sqrt{2(t_{n}-t_{n-1})}Z, (35)

    where ZπdZ\sim\pi^{d}. We terminate this update rule (35) at tδt_{\delta}, where δ\delta is the early stopping index222Here we abuse the notation δ\delta and refer it as the early stopping index. It is different from the δ\delta we used in the main paper.. We set T=80T=80, K=18K=18, and vary δ\delta. Figure 1’s third row shows the generated samples with δ=5\delta=5. We can observe that the samples generated by DDPM equipped with the empirical optimal score function behave very similar to the samples generated by the Gaussian KDE (the second row in Figure 1). This aligns with our theoretical findings provided in Theorem 4.3. Additionally, Figure 6 displays samples δ=3\delta=3 and δ=5\delta=5, highlighting the strong memorization effect in DDPM with the empirical optimal score function, irrespctive of the early stopping time.

    Refer to caption
    Figure 6: Images generated by DDPM equipped with the empirical optimal score function based on CIFAR10 dataset. The first row is the original images from the CIFAR10 dataset. The second and third rows corresponding to the results of setting the early stopping index δ=3\delta=3 and 55 respectively.