This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Soft Truncation: A Universal Training Technique of
Score-based Diffusion Model for High Precision Score Estimation

Dongjun Kim    Seungjae Shin    Kyungwoo Song    Wanmo Kang    Il-Chul Moon
Abstract

Recent advances in diffusion models bring state-of-the-art performance on image generation tasks. However, empirical results from previous research in diffusion models imply an inverse correlation between density estimation and sample generation performances. This paper investigates with sufficient empirical evidence that such inverse correlation happens because density estimation is significantly contributed by small diffusion time, whereas sample generation mainly depends on large diffusion time. However, training a score network well across the entire diffusion time is demanding because the loss scale is significantly imbalanced at each diffusion time. For successful training, therefore, we introduce Soft Truncation, a universally applicable training technique for diffusion models, that softens the fixed and static truncation hyperparameter into a random variable. In experiments, Soft Truncation achieves state-of-the-art performance on CIFAR-10, CelebA, CelebA-HQ 256×256256\times 256, and STL-10 datasets.

Machine Learning, ICML

1 Introduction

Recent advances in generative models enable the creation of highly realistic images. One direction of such modeling is likelihood-free models (Karras et al., 2019) based on minimax training. The other direction is likelihood-based models, including VAE (Vahdat & Kautz, 2020), autoregressive models (Parmar et al., 2018), and flow models (Grcić et al., 2021). Diffusion models (Ho et al., 2020) are one of the most successful likelihood-based models, where the reverse diffusion models the generative process. The success of diffusion models achieves state-of-the-art performance in image generation (Dhariwal & Nichol, 2021).

Previously, a model with the emphasis on Fréchet Inception Distance (FID), such as DDPM (Ho et al., 2020) and ADM (Dhariwal & Nichol, 2021), trains the score network with the variance weighting; whereas a model with the emphasis on Negative Log-Likelihood (NLL), such as ScoreFlow (Song et al., 2021a) and VDM (Kingma et al., 2021), trains the score network with the likelihood weighting. Such models, however, have the trade-off between NLL and FID: models with the emphasis on FID perform poorly on NLL, and vice versa. Instead of widely investigating the trade-off, they limit their work by separately training the score network on FID-favorable and NLL-favorable settings. This paper introduces Soft Truncation that significantly resolves the trade-off, with the NLL-favorable setting as the default training configuration. Soft Truncation reports a comparable FID against FID-favorable diffusion models while keeping NLL at the equivalent level of NLL-favorable models.

For that, we observe that the truncation hyperparameter is a significant hyperparameter that determines the overall scale of NLL and FID. This hyperparameter, ϵ\epsilon, is the smallest diffusion time to estimate the score function, and the score function beneath ϵ\epsilon is not estimated. A model with small enough ϵ\epsilon favors NLL at the sacrifice on FID, and a model with relatively large ϵ\epsilon is preferable to FID but has poor NLL. Therefore, we introduce Soft Truncation, which softens the fixed and static truncation hyperparameter (ϵ\epsilon) into a random variable (τ\tau) that randomly selects its smallest diffusion time at every optimization step. In every mini-batch update, we sample a new smallest diffusion time, τ\tau, randomly, and the batch optimization endeavors to estimate the score function only on [τ,T][\tau,T], rather than [ϵ,T][\epsilon,T], by ignoring beneath τ\tau. As τ\tau varies by mini-batch updates, the score network successfully estimates the score function on the entire range of diffusion time on [ϵ,T][\epsilon,T], which brings an improved FID.

There are two interesting properties of Soft Truncation. First, though Soft Truncation is nothing to do with the weighting function in its algorithmic design, surprisingly, Soft Truncation turns out to be equivalent to a diffusion model with a general weight in the expectation sense (Eq. (10)). The random variable of τ\tau determines the weight function (Theorem 1), and this gives a partial reason why Soft Truncation is successful in FID as much as the FID-favorable training (Table 4), even though Soft Truncation only considers the truncation threshold in its implementation (Section 4.2). Second, once τ\tau is sampled in a mini-batch optimization, Soft Truncation optimizes the log-likelihood perturbed by τ\tau (Lemma 1). Thus, Soft Truncation could be framed by Maximum Perturbed Likelihood Estimation (MPLE), a generalized concept of MLE that is specifically defined only in diffusion models (Section 4.4).

2 Preliminary

Throughout this paper, we focus on continuous-time diffusion models (Song et al., 2021b). A continuous diffusion model slowly and systematically perturbs a data random variable, 𝐱0\mathbf{x}_{0}, into a noise variable, 𝐱T\mathbf{x}_{T}, as time flows. The diffusion mechanism is represented as a Stochastic Differential Equation (SDE), written by

d𝐱t=𝐟(𝐱t,t)dt+g(t)d𝐰t,\displaystyle\mathop{}\!\mathrm{d}\mathbf{x}_{t}=\mathbf{f}(\mathbf{x}_{t},t)\mathop{}\!\mathrm{d}t+g(t)\mathop{}\!\mathrm{d}\mathbf{w}_{t}, (1)

where 𝐰t\mathbf{w}_{t} is a standard Wiener process. The drift (𝐟\mathbf{f}) and the diffusion (gg) terms are fixed, so the data variable is diffused in a fixed manner. We denote {𝐱t}t=0T\{\mathbf{x}_{t}\}_{t=0}^{T} as the solution of the given SDE of Eq. (1), and we omit the subscript and superscript to denote {𝐱t}\{\mathbf{x}_{t}\}, if no confusion is arised.

The theory of stochastic calculus indicates that there exists a corresponding reverse SDE given by

d𝐱t=[𝐟(𝐱t,t)g2(t)logpt(𝐱t)]dt¯+g(t)d𝐰¯t,\displaystyle\mathop{}\!\mathrm{d}\mathbf{x}_{t}=\big{[}\mathbf{f}(\mathbf{x}_{t},t)-g^{2}(t)\nabla\log{p_{t}(\mathbf{x}_{t})}\big{]}\mathop{}\!\mathrm{d}\bar{t}+g(t)\mathop{}\!\mathrm{d}\mathbf{\bar{w}}_{t}, (2)

where the solution of this reverse SDE exactly coincides to the solution of the forward SDE of Eq. (1). Here, dt¯\mathop{}\!\mathrm{d}\bar{t} is the backward time differential; d𝐰¯t\mathop{}\!\mathrm{d}\mathbf{\bar{w}}_{t} is a standard Wiener process flowing backward in time (Anderson, 1982); and pt(𝐱t)p_{t}(\mathbf{x}_{t}) is the probability distribution of 𝐱t\mathbf{x}_{t}. Henceforth, we represent {𝐱t}\{\mathbf{x}_{t}\} as the solution of SDEs of Eqs. (1) and (2).

The diffusion model’s objective is to learn the stochastic process, {𝐱t}\{\mathbf{x}_{t}\}, as a parametrized stochastic process, {𝐱t𝜽}\{\mathbf{x}_{t}^{\bm{\theta}}\}. A diffusion model builds the parametrized stochastic process as a solution of a generative SDE,

d𝐱t𝜽=[𝐟(𝐱t𝜽,t)g2(t)𝐬𝜽(𝐱t𝜽,t)]dt¯+g(t)d𝐰¯t.\displaystyle\mathop{}\!\mathrm{d}\mathbf{x}_{t}^{\bm{\theta}}=\big{[}\mathbf{f}(\mathbf{x}_{t}^{\bm{\theta}},t)-g^{2}(t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t}^{\bm{\theta}},t)\big{]}\mathop{}\!\mathrm{d}\bar{t}+g(t)\mathop{}\!\mathrm{d}\mathbf{\bar{w}}_{t}. (3)

We construct the parametrized stochastic process by solving the generative SDE of Eq. (3) backward in time with a starting variable of 𝐱T𝜽π\mathbf{x}_{T}^{\bm{\theta}}\sim\pi, where π\pi is an noise distribution. Throughout the paper, we denote pt𝜽p_{t}^{\bm{\theta}} as the probability distribution of 𝐱t𝜽\mathbf{x}_{t}^{\bm{\theta}}.

A diffusion model learns the generative stochastic process by minimizing the score loss (Song et al., 2021a) of

(𝜽;λ)=120Tλ(t)𝔼𝐱t[𝐬𝜽(𝐱t,t)logpt(𝐱t)22]dt,\displaystyle\mathcal{L}(\bm{\theta};\lambda)=\frac{1}{2}\int_{0}^{T}\lambda(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t,

where λ(t)\lambda(t) is a weighting function that counts the contribution of each diffusion time on the loss function. This score loss is infeasible to optimize because the data score, logpt(𝐱t)\nabla\log{p_{t}(\mathbf{x}_{t})}, is intractable in general. Fortunately, (𝜽;λ)\mathcal{L}(\bm{\theta};\lambda) is known to be equivalent to the (continuous) denoising NCSN loss (Song et al., 2021b; Song & Ermon, 2019),

NCSN(𝜽;λ)\displaystyle\mathcal{L}_{NCSN}(\bm{\theta};\lambda)
=120Tλ(t)𝔼𝐱0,𝐱t[𝐬𝜽(𝐱t,t)logp0t(𝐱t|𝐱0)22]dt,\displaystyle=\frac{1}{2}\int_{0}^{T}\lambda(t)\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t,

up to a constant that is irrelevant to 𝜽\bm{\theta}-optimization.

Two important SDEs are known to attain analytic transition probabilities, logp0t(𝐱t|𝐱0)\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}: Variance Exploding SDE (VESDE) and Variance Preserving SDE (VPSDE) (Song et al., 2021b). First, VESDE assumes 𝐟(𝐱t,t)=0\mathbf{f}(\mathbf{x}_{t},t)=0 and g(t)=σmin(σmaxσmin)t2logσmaxσming(t)=\sigma_{min}(\frac{\sigma_{max}}{\sigma_{min}})^{t}\sqrt{2\log{\frac{\sigma_{max}}{\sigma_{min}}}}. With such specific forms of 𝐟\mathbf{f} and gg, the transition probability of VESDE turns out to follow a Gaussian distribution of p0t(𝐱t|𝐱0)=𝒩(𝐱t;μVE(t)𝐱0,σVE2(t)𝐈)p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t};\mu_{VE}(t)\mathbf{x}_{0},\sigma_{VE}^{2}(t)\mathbf{I}) with μVE(t)1\mu_{VE}(t)\equiv 1 and σVE2(t)=σmin2[(σmaxσmin)2t1]\sigma_{VE}^{2}(t)=\sigma_{min}^{2}[(\frac{\sigma_{max}}{\sigma_{min}})^{2t}-1]. Similarly, VPSDE takes 𝐟(𝐱t,t)=12β(t)𝐱t\mathbf{f}(\mathbf{x}_{t},t)=-\frac{1}{2}\beta(t)\mathbf{x}_{t} and g(t)=β(t)g(t)=\sqrt{\beta(t)}, where β(t)=βmin+t(βmaxβmin)\beta(t)=\beta_{min}+t(\beta_{max}-\beta_{min}); and its transition probability falls into a Gaussian distribution of p0t(𝐱t|𝐱0)=𝒩(𝐱t;μVP(t)𝐱0,σVP2𝐈)p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t};\mu_{VP}(t)\mathbf{x}_{0},\sigma_{VP}^{2}\mathbf{I}) with μVP(t)=e120tβ(s)ds\mu_{VP}(t)=e^{-\frac{1}{2}\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s} and σVP2(t)=1e0tβ(s)ds\sigma_{VP}^{2}(t)=1-e^{-\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}.

Refer to caption
(a) Integrand by Time
Refer to caption
(b) Variational Bound Truncated at τ\tau
Refer to caption
(c) Test Performance by Log-Time
Figure 1: The contribution of diffusion time on the variational bound experimented on CIFAR-10 with DDPM++ (VP, NLL) (Song et al., 2021a). (a) The integrand of the variational bound is extremely imbalanced on [ϵ,T][\epsilon,T]. (b) The truncated variational bound only changes near τ0\tau\approx 0. (c) The truncation hyperparameter (ϵ\epsilon) is a significant factor for performances.

Recently, Kim et al. (2022) categorize VESDE and VPSDE as a family of linear diffusions that has the SDE of

d𝐱t=12β(t)𝐱tdt+g(t)d𝐰t,\displaystyle\mathop{}\!\mathrm{d}\mathbf{x}_{t}=-\frac{1}{2}\beta(t)\mathbf{x}_{t}\mathop{}\!\mathrm{d}t+g(t)\mathop{}\!\mathrm{d}\mathbf{w}_{t}, (4)

where β(t)\beta(t) and g(t)g(t) are generic tt-functions. Under the linear diffusions, we derive the transition probability to follow a Gaussian distribution p0t(𝐱t|𝐱0)=𝒩(𝐱t;μ(t)𝐱0,σ2(t)𝐈)p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t};\mu(t)\mathbf{x}_{0},\sigma^{2}(t)\mathbf{I}) for certain μ(t)\mu(t) and σ(t)\sigma(t) depending on β(t)\beta(t) and g(t)g(t), respectively (see Eq. (16) of Appendix A.1). We emphasize that the suggested Soft Truncation is applicable for any SDE of Eq. (1), but we limit our focus to the family of linear SDEs of Eq. (4), particularly VESDE and VPSDE among linear SDEs, to maintain the simplicity. With such a Gaussian transtion probability, the denoising NCSN loss with a linear SDE is equivalent to

120Tλ(t)σ2(t)𝔼𝐱0,ϵ[ϵ𝜽(μ(t)𝐱0+σ(t)ϵ,t)ϵ22]dt,\displaystyle\frac{1}{2}\int_{0}^{T}\frac{\lambda(t)}{\sigma^{2}(t)}\mathbb{E}_{\mathbf{x}_{0},\bm{\epsilon}}\big{[}\|\bm{\epsilon}_{\bm{\theta}}(\mu(t)\mathbf{x}_{0}+\sigma(t)\bm{\epsilon},t)-\bm{\epsilon}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t,

if ϵ𝜽(μ(t)𝐱0+σ(t)ϵ,t)=σ(t)𝐬𝜽(μ(t)𝐱0+σ(t)ϵ,t)\bm{\epsilon}_{\bm{\theta}}(\mu(t)\mathbf{x}_{0}+\sigma(t)\bm{\epsilon},t)=-\sigma(t)\mathbf{s}_{\bm{\theta}}(\mu(t)\mathbf{x}_{0}+\sigma(t)\bm{\epsilon},t), where ϵ𝒩(0,𝐈)\bm{\epsilon}\sim\mathcal{N}(0,\mathbf{I}) is a random perturbation, and ϵ𝜽\bm{\epsilon}_{\bm{\theta}} is the neural network that predicts ϵ\bm{\epsilon}. This is the (continuous) DDPM loss (Song et al., 2021b), and the equivalence of the two losses provides a unified view of NCSN and DDPM. Hence, NCSN and DDPM are exchangeable for each other, and we take the NCSN loss as a default form of a diffusion loss throughout the paper.

The NCSN loss training is connected to the likelihood training in Song et al. (2021a) by

𝔼𝐱0[logp0𝜽(𝐱0)]NCSN(𝜽;g2),\displaystyle\mathbb{E}_{\mathbf{x}_{0}}[-\log{p_{0}^{\bm{\theta}}(\mathbf{x}_{0})}]\leq\mathcal{L}_{NCSN}(\bm{\theta};g^{2}), (5)

when the weighting function is the square of the diffusion term as λ(t)=g2(t)\lambda(t)=g^{2}(t), called the likelihood weighting.

3 Training and Evaluation of Diffusion Models in Practice

3.1 The Need of Truncation

In the family of linear SDEs, the gradient of the log transition probability satisfies logp0t(𝐱t|𝐱0)=𝐱tμ(t)𝐱0σ2(t)=𝐳σ(t)\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}=-\frac{\mathbf{x}_{t}-\mu(t)\mathbf{x}_{0}}{\sigma^{2}(t)}=-\frac{\mathbf{z}}{\sigma(t)}, where 𝐱t\mathbf{x}_{t} is given to μ(t)𝐱0+σ(t)𝐳\mu(t)\mathbf{x}_{0}+\sigma(t)\mathbf{z} with 𝐳𝒩(0,𝐈)\mathbf{z}\sim\mathcal{N}(0,\mathbf{I}). The denominator of σ(t)\sigma(t) converges to zero as t0t\rightarrow 0, which leads 𝐬𝜽(𝐱t,t)logp0t(𝐱t|𝐱0)2\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2} to diverge as t0t\rightarrow 0, as illustrated in Figure 1-(a), see Appendix A.2 for details. Therefore, the Monte-Carlo estimation of the NCSN loss is under high variance, which prevents stable training of the score network. In practice, therefore, previous research truncates the diffusion time range to [τ,T][\tau,T], with a positive truncation hyperparameter, τ=ϵ>0\tau=\epsilon>0.

3.2 Variational Bound With Positive Truncation

For the analysis for density estimation in Section 3.3, this section derives the variational bound of the log-likelihood when a diffusion model has a positive truncation because Inequality (5) holds only with zero truncation (τ=0\tau=0). Lemma 1 provides a generalization of Inequality (5), proved by applying the data processing inequality (Gerchinovitz et al., 2020) and the Girsanov theorem (Pavon & Wakolbinger, 1991; Vargas et al., 2021; Song et al., 2021a).

Lemma 1.

For any τ[0,T]\tau\in[0,T],

𝔼𝐱τ[logpτ𝜽(𝐱τ)](𝜽;g2,τ)\displaystyle\mathbb{E}_{\mathbf{x}_{\tau}}\big{[}-\log{p_{\tau}^{\bm{\theta}}(\mathbf{x}_{\tau})}\big{]}\leq\mathcal{L}(\bm{\theta};g^{2},\tau) (6)

holds, where (𝛉;g2,τ)=12τTg2(t)𝔼𝐱0,𝐱t[𝐬𝛉(𝐱t,t)logp0t(𝐱t|𝐱0)22]dt\mathcal{L}(\bm{\theta};g^{2},\tau)=\frac{1}{2}\int_{\tau}^{T}g^{2}(t)\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t, up to a constant, see Eq. (17).

Lemma 1 is a generalization of Inequality (5) in that Inequality (6) collapses to Inequality (5) under the zero truncation: NCSN(𝜽;λ)=(𝜽;λ,τ=0)\mathcal{L}_{NCSN}(\bm{\theta};\lambda)=\mathcal{L}(\bm{\theta};\lambda,\tau=0). If the time range is truncated to [τ,T][\tau,T] for τ[0,T]\tau\in[0,T], then from the variational inference, the log-likelihood becomes

𝔼𝐱0[logp0𝜽(𝐱0)]𝔼𝐱τ[logpτ𝜽(𝐱τ)]+Rτ(𝜽)\displaystyle\mathbb{E}_{\mathbf{x}_{0}}\big{[}-\log{p_{0}^{\bm{\theta}}(\mathbf{x}_{0})}\big{]}\leq\mathbb{E}_{\mathbf{x}_{\tau}}\big{[}-\log{p_{\tau}^{\bm{\theta}}(\mathbf{x}_{\tau})}\big{]}+R_{\tau}(\bm{\theta}) (7)

where

Rτ(𝜽)=𝔼𝐱0[p0τ(𝐱τ|𝐱0)logp0τ(𝐱τ|𝐱0)p𝜽(𝐱0|𝐱τ)d𝐱τ],\displaystyle R_{\tau}(\bm{\theta})=\mathbb{E}_{\mathbf{x}_{0}}\bigg{[}\int p_{0\tau}(\mathbf{x}_{\tau}|\mathbf{x}_{0})\log{\frac{p_{0\tau}(\mathbf{x}_{\tau}|\mathbf{x}_{0})}{p_{\bm{\theta}}(\mathbf{x}_{0}|\mathbf{x}_{\tau})}}\mathop{}\!\mathrm{d}\mathbf{x}_{\tau}\bigg{]},

with p𝜽(𝐱0|𝐱τ)p_{\bm{\theta}}(\mathbf{x}_{0}|\mathbf{x}_{\tau}) being the probability distribution of 𝐱0\mathbf{x}_{0} given 𝐱τ\mathbf{x}_{\tau} and the score estimation with 𝐬𝜽\mathbf{s}_{\bm{\theta}} at τ\tau. For any τ\tau, we apply Lemma 1 to the right-hand-side of Inequality (7) to obtain the variational bound of the log-likelihood as

𝔼𝐱0[logp0𝜽(𝐱0)](𝜽;g2,τ)+Rτ(𝜽).\displaystyle\mathbb{E}_{\mathbf{x}_{0}}\big{[}-\log{p_{0}^{\bm{\theta}}(\mathbf{x}_{0})}\big{]}\leq\mathcal{L}(\bm{\theta};g^{2},\tau)+R_{\tau}(\bm{\theta}). (8)
Refer to caption
Figure 2: The truncation time is key to enhance the microscopic sample quality.

3.3 A Universal Phenomenon in Diffusion Training: Extremely Imbalanced Loss

To avoid the diverging issue introduced in Section 3.1, previous works in VPSDE (Song et al., 2021a; Vahdat et al., 2021) modify the loss by truncating the integration on [τ,T][\tau,T] with a fixed hyperparameter τ=ϵ>0\tau=\epsilon>0 so that the score network does not estimate the score function on [0,ϵ)[0,\epsilon). Analogously, previous works in VESDE (Song et al., 2021b; Chen et al., 2022) approximate σVE2(t)σmin2(σmaxσmin)2t\sigma_{VE}^{2}(t)\approx\sigma_{min}^{2}(\frac{\sigma_{max}}{\sigma_{min}})^{2t} to truncate the minimum variance of the transition probability to be σmin2\sigma_{min}^{2}. Truncating diffusion time at ϵ\epsilon in VPSDE is equivalent to truncating diffusion variance (σmin2\sigma_{min}^{2}) in VESDE, so these two truncations on VE/VP SDEs have the identical effect on bounding the diffusion loss. Henceforth, this paper discusses the argument of truncating diffusion time (VPSDE) and diffusion variance (VESDE) exchangeably.

Figure 1 illustrates the significance of truncation in the training of diffusion models. With the truncation of strictly positive ϵ=105\epsilon=10^{-5}, Figure 1-(a) shows that the integrand of (𝜽;g2,τ)\mathcal{L}(\bm{\theta};g^{2},\tau) in the Bits-Per-Dimension (BPD) scale is still extremely imbalanced. It turns out that such extreme imbalance appears to be a universal phenomenon in training a diffusion model, and this phenomenon lasts from the beginning to the end of training.

Refer to caption
Figure 3: Illustration of the generative process trained on CelebA-HQ 256×256256\times 256 with NCSN++ (VE) (Song et al., 2021b). The score precision on large diffusion time is key to construct the realistic overall sample quality.
Refer to caption
Figure 4: Norm of reverse drift of generative process, trained on CIFAR-10 with DDPM++ (VP, FID) (Song et al., 2021b).

Figure 1-(b) with the green line presents the variational bound of the log-likelihood (right-hand-side of Inequality (8)) on the yy-axis, and it indicates that the variational bound is sharply decreasing near the small diffusion time. Therefore, if ϵ\epsilon is insufficiently small, the variational bound is not tight to the log-likelihood, and a diffusion model fails at MLE training. In addition, Figure 2 indicates that insufficiently small ϵ\epsilon (or σmin\sigma_{min}) would also harm the microscopic sample quality. From these observations, ϵ\epsilon becomes a significant hyperparameter that needs to be selected carefully.

3.4 Effect of Truncation on Model Evaluation

Figure 1-(c) reports test performances on density estimation. Figure 1-(c) illustrates that both Negative Evidence Lower Bound (NELBO) and NLL monotonically decrease by lowering ϵ\epsilon because NELBO is largely contributed by small diffusion time at test time as well as training time. Therefore, it could be a common strategy to reduce ϵ\epsilon as much as possible to reduce test NELBO/NLL.

Refer to caption
Figure 5: Regenerated samples synthesized by solving the probability flow ODE on [ϵ,τ][\epsilon,\tau] backwards with the initial point of 𝐱τ=μ(τ)𝐱0+σ(τ)𝐳\mathbf{x}_{\tau}=\mu(\tau)\mathbf{x}_{0}+\sigma(\tau)\mathbf{z} for 𝐳𝒩(0,𝐈)\mathbf{z}\sim\mathcal{N}(0,\mathbf{I}), trained on CelebA with DDPM++ (VP, FID) (Song et al., 2021b).
Table 1: Ablation on σmin\sigma_{min}.
σmin\sigma_{min} CIFAR-10
NLL (\downarrow) FID-10k (\downarrow)
10210^{-2} 4.95 6.95
10310^{-3} 3.04 7.04
10410^{-4} 2.99 8.17
10510^{-5} 2.97 8.29

On the contrary, there is a counter effect on FID for ϵ\epsilon. Table 1, trained on CIFAR-10 (Krizhevsky et al., 2009) with NCSN++ (Song et al., 2021b), presents that FID is worsened as we take smaller hyperparameter σmin\sigma_{min} for the training. It is the range of small diffusion time that significantly contributes to the variational bound in the blue line of Figure 1-(b), so the score network with a small truncation hyperparameter, σmin\sigma_{min} or ϵ\epsilon, remains unoptimized on large diffusion time. In the lens of Figure 2, therefore, the inconsistent result of Table 1 is attributed to the inaccurate score on large diffusion time.

Table 2: FID-10k scores.
σmin\sigma_{min} 10310^{-3} 10410^{-4} 10510^{-5}
σtr=1\sigma_{tr}=1 6.84 8.04 8.29

We design an experiment to validate the above argument in Table 2. This experiment utilizes two types of score networks: 1) three alternative networks (As) with diverse σmin{103,104,105}\sigma_{min}\in\{10^{-3},10^{-4},10^{-5}\} trained in Table 1 experiment; 2) a network (B) with σmin=105\sigma_{min}=10^{-5} (the last row of Table 1). With these score networks, we denoise the noises by either one of the first-typed As from σmax\sigma_{max} to a common and fixed σtr(=1)\sigma_{tr}(=1), and we use B to further denoise from σtr\sigma_{tr} to σmin=105\sigma_{min}=10^{-5}. This further denoising step with model B enables us to compare the score accuracy on large diffusion time for models with diverse truncation hyperparameters in a fair resolution setting. Table 2 presents that the model with σmin=103\sigma_{min}=10^{-3} has the best FID, implying that the training with too small truncation would harm the sample fidelity.

Refer to caption
(a) Monte-Carlo Loss
Refer to caption
(b) Soft Truncation
Refer to caption
(c) Importance Distribution
Figure 6: The experimental result trained on CIFAR-10 with DDPM++ (VP, NLL) (Song et al., 2021a). (a) The Monte-Carlo loss for each diffusion time, σ2(t)𝐬𝜽(𝐱t,t)logp0t(𝐱t|𝐱0)22\sigma^{2}(t)\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}. (b) The Monte-Carlo loss for each diffusion time on variaous truncation time. (c) The importance distribution for various truncation distributions.

Specifically, Figure 4 shows the Euclidean norm of g2(t)𝐬𝜽(𝐱t,t)g^{2}(t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t), where each dot represents for a Monte-Carlo sample from pt(𝐱t)p_{t}(\mathbf{x}_{t}). Here, g2(t)𝐬𝜽(𝐱t,t)g^{2}(t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t) is in the reverse drift term of the generative process, d𝐱t𝜽=[𝐟(𝐱t𝜽,t)g2(t)𝐬𝜽(𝐱t𝜽,t)]dt¯+g(t)d𝐰¯t\mathop{}\!\mathrm{d}\mathbf{x}_{t}^{\bm{\theta}}=[\mathbf{f}(\mathbf{x}_{t}^{\bm{\theta}},t)-g^{2}(t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t}^{\bm{\theta}},t)]\mathop{}\!\mathrm{d}\bar{t}+g(t)\mathop{}\!\mathrm{d}\mathbf{\bar{w}}_{t}. Figure 4 illustrates that it is the large diffusion time that dominates the sampling process. Therefore, a precise score network on large diffusion time is particularly important in sample generation.

The imprecise score mainly affects the global sample context, as the denoising on small diffusion time only crafts the image in its microscopic details, illustrated in Figures 3 and 5. Figure 3 shows how the global fidelity is damaged: a man synthesized in the second row has unrealistic curly hair on his forehead, constructed on the large diffusion time. Figure 5 deepens the importance of learning a good score estimation on large diffusion time. It shows the regenerated samples by solving the generative process time reversely, starting from 𝐱τ\mathbf{x}_{\tau} (Meng et al., 2021).

4 Soft Truncation: A Training Technique for a Diffusion Model

As in Section 3, the choice of ϵ\epsilon is crucial for training and evaluation, but it is computationally infeasible to search for the optimal ϵ\epsilon. Therefore, we introduce a training technique that predominantly mediates the need for ϵ\epsilon-search by softening the fixed truncation hyperparameter into a truncation random variable so that the truncation time varies in every optimization step. Our approach successfully trains the score network on large diffusion time without sacrificing NLL. We explain the Monte-Carlo estimation of the variational bound in Section 4.1, which is the common practice of previous research but explained to emphasize how simple (though effective) Soft Truncation is, and we subsequently introduce Soft Truncation in Section 4.2.

Refer to caption
Figure 7: Quartile of importance weighted Monte-Carlo time of VPSDE. Red dots represent Q1/Q2/Q3/Q4 quantiles when truncated at τ=ϵ=105\tau=\epsilon=10^{-5}. About 25%25\% and 50%50\% of Monte-Carlo time are located in [ϵ,5×103][\epsilon,5\times 10^{-3}] and [ϵ,0.106][\epsilon,0.106], respectively. Green dots represent Q0-Q5 quantiles when truncated at τ=0.1\tau=0.1. Importance weighted Monte-Carlo time with τ=0.1\tau=0.1 is distributed much more balanced compared to the truncation at τ=ϵ\tau=\epsilon.

4.1 Monte-Carlo Estimation of Truncated Variational Bound with Importance Sampling

In this section, we fix a truncation hyperparameter to be τ=ϵ\tau=\epsilon. For every batch {𝐱0(b)}b=1B\{\mathbf{x}_{0}^{(b)}\}_{b=1}^{B}, the Monte-Carlo estimation of the variational bound in Inequality (6) is (𝜽;g2,ϵ)^(𝜽;g2,ϵ)=12Bb=1Bg2(t(b))𝐬𝜽(𝐱t(b),t(b))logp0t(b)(𝐱t(b)|𝐱0)22\mathcal{L}(\bm{\theta};g^{2},\epsilon)\approx\mathcal{\hat{L}}(\bm{\theta};g^{2},\epsilon)=\frac{1}{2B}\sum_{b=1}^{B}g^{2}(t^{(b)})\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t^{(b)}},t^{(b)})-\nabla\log{p_{0t^{(b)}}(\mathbf{x}_{t^{(b)}}|\mathbf{x}_{0})}\|_{2}^{2}, up to a constant irrelevant to 𝜽\bm{\theta}, where 𝐱t(b)=μ(t(b))𝐱0+σ(t(b))ϵ(b)\mathbf{x}_{t^{(b)}}=\mu(t^{(b)})\mathbf{x}_{0}+\sigma(t^{(b)})\bm{\epsilon}^{(b)} with {t(b)}b=1B\{t^{(b)}\}_{b=1}^{B} and {ϵ(b)}b=1B\{\bm{\epsilon}^{(b)}\}_{b=1}^{B} be the corresponding Monte-Carlo samples from t(b)[ϵ,T]t^{(b)}\sim[\epsilon,T] and ϵ(b)𝒩(0,𝐈)\bm{\epsilon}^{(b)}\sim\mathcal{N}(0,\mathbf{I}), respectively. Note that this Monte-Carlo estimation is tractably computed from the analytic form of the transition probability as logp0t(b)(𝐱t(b)|𝐱0)=ϵ(b)σ(t(b))\nabla\log{p_{0t^{(b)}}(\mathbf{x}_{t^{(b)}}|\mathbf{x}_{0})}=\frac{\bm{\epsilon}^{(b)}}{\sigma(t^{(b)})} under linear SDEs.

Previous works (Song et al., 2021a; Huang et al., 2021) apply the importance sampling with the importance distribution of piw(t)=g2(t)/σ2(t)Zϵ1[ϵ,T](t)p_{iw}(t)=\frac{g^{2}(t)/\sigma^{2}(t)}{Z_{\epsilon}}1_{[\epsilon,T]}(t), where Zϵ=ϵTg2(t)σ2(t)dtZ_{\epsilon}=\int_{\epsilon}^{T}\frac{g^{2}(t)}{\sigma^{2}(t)}\mathop{}\!\mathrm{d}t. It is well known (Goodfellow et al., 2016) that the Monte-Carlo variance of ^\hat{\mathcal{L}} is minimum if the importance distribution is piw(t)g2(t)L(t)p_{iw}^{*}(t)\propto g^{2}(t)L(t) with L(t)=𝔼𝐱0,𝐱t[𝐬𝜽(𝐱t,t)logp0t(𝐱t|𝐱0)22]L(t)=\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}[\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}], but sampling of Monte-Carlo diffusion time from piw(t)p_{iw}^{*}(t) at every training iteration would incur 2×2\times slower training speed, at least, because the importance sampling requires the score evaluation. Therefore, previous research approximates L(t)L(t) by L^(t)=𝔼𝐱0,𝐱t[logp0t(𝐱t|𝐱0)22]1/σ2(t)\hat{L}(t)=\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}[\|\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}]\propto 1/\sigma^{2}(t), and piw(t)p_{iw}(t) becomes the approximate importance weight. This approximation, at the expense of bias, is cheap because the closed-form of the inverse Cumulative Distribution Function (CDF) is known. Unless we train the variance directly as in Kingma et al. (2021), we believe piw(t)p_{iw}(t) is the maximally efficient sampler as long as the training speed matters. The importance weighted Monte-Carlo estimation becomes

(𝜽;g2,ϵ)\displaystyle\mathcal{L}(\bm{\theta};g^{2},\epsilon)
=Zϵ2ϵTpiw(t)σ2(t)𝔼[𝐬𝜽(𝐱t,t)logp0t(𝐱t|𝐱0)22]dt\displaystyle=\frac{Z_{\epsilon}}{2}\int_{\epsilon}^{T}p_{iw}(t)\sigma^{2}(t)\mathbb{E}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t
Zϵ2Bb=1Bσ2(tiw(b))𝐬𝜽(𝐱tiw(b),tiw(b))ϵ(b)σ(tiw(b))22\displaystyle\approx\frac{Z_{\epsilon}}{2B}\sum_{b=1}^{B}\sigma^{2}(t_{iw}^{(b)})\bigg{\|}\mathbf{s}_{\bm{\theta}}\big{(}\mathbf{x}_{t_{iw}^{(b)}},t_{iw}^{(b)}\big{)}-\frac{\bm{\epsilon}^{(b)}}{\sigma(t_{iw}^{(b)})}\bigg{\|}_{2}^{2}
:=^iw(𝜽;g2,ϵ),\displaystyle:=\mathcal{\hat{L}}_{iw}(\bm{\theta};g^{2},\epsilon), (9)

where {tiw(b)}b=1B\{t_{iw}^{(b)}\}_{b=1}^{B} is the Monte-Carlo sample from the importance distribution, i.e., tiw(b)piw(t)g2(t)σ2(t)t_{iw}^{(b)}\sim p_{iw}(t)\propto\frac{g^{2}(t)}{\sigma^{2}(t)}.

The importance sampling is advantageous in both NLL and FID (Song et al., 2021a) over the uniform sampling, as the importance sampling significantly reduces the estimation variance. Figure 6-(a) illustrates the sample-by-sample loss, and the importance sampling significantly mitigates the loss scale by diffusion time compared to the scale in Figure 1-(a). However, the importance distribution satisfies piw(t)p_{iw}(t)\rightarrow\infty as t0t\rightarrow 0 in Figure 6-(c) blue line, and most of the importance weighted Monte-Carlo time is concentrated at tϵt\approx\epsilon in Figure 7. Hence, the use of the importance sampling has a trade-off between the reduced variance (Figure 6-(a)) versus the over-sampled diffusion time near tϵt\approx\epsilon (Figure 7). Regardless of whether to use the importance sampling or not, therefore, the inaccurate score estimation on large diffusion time appears sampling-strategic-independently, and solving this pre-matured score estimation becomes a nontrivial task.

Instead of the likelihood weighting, previous works (Ho et al., 2020; Nichol & Dhariwal, 2021; Dhariwal & Nichol, 2021) train the denoising score loss with the variance weighting, λ(t)=σ2(t)\lambda(t)=\sigma^{2}(t). With this weighting, the importance distribution becomes the uniform distribution, piw(t)=λ(t)σ2(t)1p_{iw}(t)=\frac{\lambda(t)}{\sigma^{2}(t)}\equiv 1, so it significantly alleviates the trade-off of using the likelihood weighting. However, the variance weighting favors FID at the sacrifice in NLL because the loss is no longer the variational bound of the log-likelihood. In contrast, the training with the likelihood weighting is leaning towards NLL than FID, so Soft Truncation is for the balanced NLL and FID, using the likelihood weighting.

4.2 Soft Truncation

Soft Truncation releases the truncation hyperparameter from a static variable to a random variable with a probability distribution of (τ)\mathbb{P}(\tau). In every mini-batch update, Soft Truncation optimizes the diffusion model with ^iw(𝜽;g2,τ)\mathcal{\hat{L}}_{iw}(\bm{\theta};g^{2},\tau) in Eq. (9) for a sampled τ(τ)\tau\sim\mathbb{P}(\tau). In other words, for every batch {𝐱0(b)}b=1B\{\mathbf{x}_{0}^{(b)}\}_{b=1}^{B}, Soft Truncation optimizes the Monte-Carlo loss

^iw(𝜽;λ,τ)=Zτ2Bb=1Bσ2(tiw(b))𝐬𝜽(𝐱tiw(b),tiw(b))ϵ(b)σ(tiw(b))22\displaystyle\mathcal{\hat{L}}_{iw}(\bm{\theta};\lambda,\tau)=\frac{Z_{\tau}}{2B}\sum_{b=1}^{B}\sigma^{2}(t_{iw}^{(b)})\bigg{\|}\mathbf{s}_{\bm{\theta}}\big{(}\mathbf{x}_{t_{iw}^{(b)}},t_{iw}^{(b)}\big{)}-\frac{\bm{\epsilon}^{(b)}}{\sigma(t_{iw}^{(b)})}\bigg{\|}_{2}^{2}

with {tiw(b)}b=1B\{t_{iw}^{(b)}\}_{b=1}^{B} sampled from the importance distribution of piw,τ(t)=g2(t)/σ2(t)Zτ1[τ,T](t)p_{iw,\tau}(t)=\frac{g^{2}(t)/\sigma^{2}(t)}{Z_{\tau}}1_{[\tau,T]}(t), where Zτ:=τTg2(t)σ2(t)dtZ_{\tau}:=\int_{\tau}^{T}\frac{g^{2}(t)}{\sigma^{2}(t)}\mathop{}\!\mathrm{d}t.

Soft Truncation resolves the oversampling issue of diffusion time near tϵt\approx\epsilon, meaning that Monte-Carlo time is not concentrated on ϵ\epsilon anymore. Figure 7 illustrates the quantiles of importance weighted Monte-Carlo time with Soft Truncation under τ=ϵ\tau=\epsilon and τ=0.1\tau=0.1. The score network is trained more equally on diffusion time when τ=0.1\tau=0.1, and as a consequence, the loss imbalance issue in each training step is also alleviated as in Figure 6-(b) with purple dots. This limited range of [τ,T][\tau,T] provides a chance to learn a score network more balanced on diffusion time. As τ\tau is softened, such truncation level will vary by mini-batch updates: see the loss scales change by blue, green, red, and purple dots according to various τ\taus in Figure 6-(b). Eventually, the softened τ\tau will provide a fair chance to learn the score network from small as well as large diffusion time.

4.3 Soft Truncation Equals to A Diffusion Model With A General Weight

In the original diffusion model, the loss estimation, ^(𝜽;g2,ϵ)\mathcal{\hat{L}}(\bm{\theta};g^{2},\epsilon), is just a batch-wise approximation of a population loss, (𝜽;g2,ϵ)\mathcal{L}(\bm{\theta};g^{2},\epsilon). However, the target population loss of Soft Truncation, (𝜽;g2,τ)\mathcal{L}(\bm{\theta};g^{2},\tau), is depending on a random variable τ\tau, so the target population loss itself becomes a random variable. Therefore, we derive the expected Soft Truncation loss to reveal the connection to the original diffusion model:

ST(𝜽;g2,):=𝔼(τ)[(𝜽;g2,τ)]\displaystyle\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}):=\mathbb{E}_{\mathbb{P}(\tau)}\big{[}\mathcal{L}(\bm{\theta};g^{2},\tau)\big{]}
=12ϵT(τ)τTg2(t)𝔼[𝐬𝜽logp0t22]dtdτ\displaystyle\quad=\frac{1}{2}\int_{\epsilon}^{T}\mathbb{P}(\tau)\int_{\tau}^{T}g^{2}(t)\mathbb{E}\big{[}\|\mathbf{s}_{\bm{\theta}}-\nabla\log{p_{0t}}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t\mathop{}\!\mathrm{d}\tau
=12ϵTg2(t)𝔼[𝐬𝜽logp0t22]dt,\displaystyle\quad=\frac{1}{2}\int_{\epsilon}^{T}g^{2}_{\mathbb{P}}(t)\mathbb{E}\big{[}\|\mathbf{s}_{\bm{\theta}}-\nabla\log{p_{0t}}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t,

up to a constant, where g2(t)=(0t(τ)dτ)g2(t)g^{2}_{\mathbb{P}}(t)=\big{(}\int_{0}^{t}\mathbb{P}(\tau)\mathop{}\!\mathrm{d}\tau\big{)}g^{2}(t), by exchanging the orders of the integrations. Therefore, we conclude that Soft Truncation reduces to a diffusion model with a general weight of g2(t)g_{\mathbb{P}}^{2}(t), see Appendix A.3:

ST(𝜽;g2,)=(𝜽;g2,ϵ).\displaystyle\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P})=\mathcal{L}(\bm{\theta};g^{2}_{\mathbb{P}},\epsilon). (10)

4.4 Soft Truncation is Maximum Perturbed Likelihood Estimation

As explained in Section 4.3, Soft Truncation is a diffusion model with a general weight, in the expected sense. Reversely, this section analyzes a diffusion model with a general weight in view of Soft Truncation. Suppose we have a general weight λ\lambda. Theorem 1 implies that this general weighted diffusion loss, (𝜽;λ,ϵ)\mathcal{L}(\bm{\theta};\lambda,\epsilon), is the variational bound of the perturbed KL divergence expected by λ(τ)\mathbb{P}_{\lambda}(\tau). Theorem 1 collapses to Lemma 1 if λ(t)=cg2(t)\lambda(t)=cg^{2}(t) for any c>0c>0111If λ(t)=cg2(t)\lambda(t)=cg^{2}(t), the probability satisfies ([a,b])=1[a,b](ϵ)\mathbb{P}([a,b])=1_{[a,b]}(\epsilon), which is a probability distribution of one mass at ϵ\epsilon.. See Appendix B for the detailed statement and proof.

Theorem 1.

Suppose λ(t)g2(t)\frac{\lambda(t)}{g^{2}(t)} is a nondecreasing and nonnegative absolutely continuous function on [ϵ,T][\epsilon,T] and zero on [0,ϵ)[0,\epsilon). For the probability defined by

λ([a,b])=[max(a,ϵ)b(λ(s)g2(s))ds+λ(ϵ)g2(ϵ)1[a,b](ϵ)]/Z,\displaystyle\mathbb{P}_{\lambda}([a,b])=\bigg{[}\int_{\text{max}(a,\epsilon)}^{b}\Big{(}\frac{\lambda(s)}{g^{2}(s)}\Big{)}^{\prime}\mathop{}\!\mathrm{d}s+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}1_{[a,b]}(\epsilon)\bigg{]}\bigg{/}Z,

where Z=λ(T)g2(T)Z=\frac{\lambda(T)}{g^{2}(T)}; up to a constant, the variational bound of the general weighted diffusion loss becomes

𝔼λ(τ)[DKL(pτpτ𝜽)]\displaystyle\quad\mathbb{E}_{\mathbb{P}_{\lambda}(\tau)}\big{[}D_{KL}(p_{\tau}\|p_{\tau}^{\bm{\theta}})\big{]}
12ZϵTλ(t)𝔼𝐱t[𝐬𝜽(𝐱t,t)logpt(𝐱t)22]dt\displaystyle\leq\frac{1}{2Z}\int_{\epsilon}^{T}\lambda(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t
=1Z(𝜽;λ,ϵ)=𝔼λ(τ)[(𝜽;g2,τ)].\displaystyle=\frac{1}{Z}\mathcal{L}(\bm{\theta};\lambda,\epsilon)=\mathbb{E}_{\mathbb{P}_{\lambda}(\tau)}\big{[}\mathcal{L}(\bm{\theta};g^{2},\tau)\big{]}.

The meaning of Soft Truncation becomes clearer in view of Theorem 1. Instead of training the general weighted diffusion loss, (𝜽;λ,ϵ)\mathcal{L}(\bm{\theta};\lambda,\epsilon), we optimize the truncated variational bound, (𝜽;g2,τ)\mathcal{L}(\bm{\theta};g^{2},\tau). This truncated loss upper bounds the perturbed KL divergence, DKL(pτpτ𝜽)D_{KL}(p_{\tau}\|p_{\tau}^{\bm{\theta}}) by Lemma 1, and Figure 1-(c) indicates that the Inequality (6) is nearly tight. Therefore, Soft Truncation could be interpreted as the Maximum Perturbed Likelihood Estimation (MPLE), where the perturbation level is a random variable. Soft Truncation is not MLE training because the Inequality 8 is not tight as demonstrated in Figure 1-(b) unless τ\tau is sufficiently small.

Old wisdom is to minimize the loss variance if available for stable training. However, some optimization methods in the deep learning era (e.g., stochastic gradient descent) deliberately add noises to a loss function that eventually helps escape from a local optimum. Soft Truncation is categorized in such optimization methods that inflate the loss variance by intentionally imposing auxiliary randomness on loss estimation. This randomness is represented by the outmost expectation of 𝔼λ(τ)\mathbb{E}_{\mathbb{P}_{\lambda}(\tau)}, which controls the diffusion time range batch-wisely. Additionally, the loss with a sampled τ\tau is the proxy of the perturbed KL divergence by τ\tau, so the auxiliary randomness on loss estimation is theoretically tamed, meaning that it is not a random perturbation.

4.5 Choice of Truncation Probability Distribution

We parametrize the probability distribution of τ\tau by

k(τ)=1/τkZk1[ϵ,T](τ)1τk,\displaystyle\mathbb{P}_{k}(\tau)=\frac{1/\tau^{k}}{Z_{k}}1_{[\epsilon,T]}(\tau)\propto\frac{1}{\tau^{k}}, (11)

where Zk=ϵT1τkdτZ_{k}=\int_{\epsilon}^{T}\frac{1}{\tau^{k}}\mathop{}\!\mathrm{d}\tau with sufficiently small enough truncation hyperparameter. Note that it is still beneficial to remain ϵ\epsilon strictly positive because a batch update with τ0<ϵ\tau\approx 0<\epsilon would drift the score network away from the optimal point. Figure 6-(c) illustrates the importance distribution of λk\lambda_{\mathbb{P}_{k}} for varying kk. From the definition of Eq. (11), k(τ)δϵ(τ)\mathbb{P}_{k}(\tau)\rightarrow\delta_{\epsilon}(\tau) as kk\rightarrow\infty, and this limiting delta distribution corresponds to the original diffusion model with the likelihood weighting. Figure 6-(c) shows that the importance distribution of k\mathbb{P}_{k} with finite kk interpolates the likelihood weighting and the variance weighting.

With the current simple form, we experimentally find that the sweet spot is k1.0k\approx 1.0 in VPSDE and k=2.0k=2.0 in VESDE with the emphasis on the sample quality. For VPSDE, the importance distribution in Figure 6-(c) is nearly equal to that of the variance weighting if k1.0k\approx 1.0, so Soft Truncation with k1.0k\approx 1.0 improves the sample fidelity, while maintaining low NLL. On the other hand, if kk is too small, no τ\tau will be sampled near ϵ\epsilon, so it hurts both sample generation and density estimation. We leave further study on searching for the optimal distribution of τ\tau as future work.

[Uncaptioned image]
Figure 8: Soft Truncation improves FID on CelebA trained with UNCSN++ (RVE).
Table 3: Ablation study of Soft Truncation for various weightings on CIFAR-10 and ImageNet32 with DDPM++ (VP).
Loss Soft Truncation NLL NELBO FID
ODE
CIFAR-10 (𝜽;g2,ϵ)\mathcal{L}(\bm{\theta};g^{2},\epsilon) 3.03 3.13 6.70
(𝜽;σ2,ϵ)\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon) 3.21 3.34 3.90
(𝜽;g12,ϵ)\mathcal{L}(\bm{\theta};g_{\mathbb{P}_{1}}^{2},\epsilon) 3.06 3.18 6.11
ST(𝜽;g2,1)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1}) 3.01 3.08 3.96
ST(𝜽;g2,0.9)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{0.9}) 3.03 3.13 3.45
ImageNet32 (𝜽;g2,ϵ)\mathcal{L}(\bm{\theta};g^{2},\epsilon) 3.92 3.94 12.68
(𝜽;σ2,ϵ)\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon) 3.95 4.00 9.22
(𝜽;g12,ϵ)\mathcal{L}(\bm{\theta};g_{\mathbb{P}_{1}}^{2},\epsilon) 3.93 3.97 11.89
ST(𝜽;g2,0.9)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{0.9}) 3.90 3.91 8.42
Table 4: Ablation study of Soft Truncation for various model architectures and diffusion SDEs on CelebA.
SDE Model Loss NLL NELBO FID
PC ODE
VE NCSN++ (𝜽;σ2,ϵ)\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon) 3.41 3.42 3.95 -
ST(𝜽;σ2,2)\mathcal{L}_{ST}(\bm{\theta};\sigma^{2},\mathbb{P}_{2}) 3.44 3.44 2.68 -
RVE UNCSN++ (𝜽;g2,ϵ)\mathcal{L}(\bm{\theta};g^{2},\epsilon) 2.01 2.01 3.36 -
ST(𝜽;g2,2)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{2}) 1.97 2.02 1.92 -
VP DDPM++ (𝜽;σ2,ϵ)\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon) 2.14 2.21 3.03 2.32
ST(𝜽;σ2,1)\mathcal{L}_{ST}(\bm{\theta};\sigma^{2},\mathbb{P}_{1}) 2.17 2.29 2.88 1.90
UDDPM++ (𝜽;σ2,ϵ)\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon) 2.11 2.20 3.23 4.72
ST(𝜽;σ2,1)\mathcal{L}_{ST}(\bm{\theta};\sigma^{2},\mathbb{P}_{1}) 2.16 2.28 2.22 1.94
DDPM++ (𝜽;g2,ϵ)\mathcal{L}(\bm{\theta};g^{2},\epsilon) 2.00 2.09 5.31 3.95
ST(𝜽;g2,1)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1}) 2.00 2.11 4.50 2.90
UDDPM++ (𝜽;g2,ϵ)\mathcal{L}(\bm{\theta};g^{2},\epsilon) 1.98 2.12 4.65 3.98
ST(𝜽;g2,1)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1}) 2.00 2.10 4.45 2.97
Table 5: Ablation study of Soft Truncation for various ϵ\epsilon on CIFAR-10 with DDPM++ (VP).
Loss ϵ\epsilon NLL NELBO FID (ODE)
(𝜽;g2,ϵ)\mathcal{L}(\bm{\theta};g^{2},\epsilon) 10210^{-2} 4.64 4.69 38.82
10310^{-3} 3.51 3.52 6.21
10410^{-4} 3.05 3.08 6.33
10510^{-5} 3.03 3.13 6.70
ST(𝜽;g2,1)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1}) 10210^{-2} 4.65 4.69 39.83
10310^{-3} 3.51 3.52 5.14
10410^{-4} 3.05 3.08 4.16
10510^{-5} 3.01 3.08 3.96
Table 6: Ablation study of Soft Truncation for various k\mathbb{P}_{k} on CIFAR-10 trained with DDPM++ (VP).
Loss NLL NELBO FID (ODE)
ST(𝜽;g2,0)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{0}) 3.24 3.39 6.27
ST(𝜽;g2,0.8)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{0.8}) 3.03 3.05 3.61
ST(𝜽;g2,0.9)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{0.9}) 3.03 3.13 3.45
ST(𝜽;g2,1)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1}) 3.01 3.08 3.96
ST(𝜽;g2,1.1)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1.1}) 3.02 3.09 3.98
ST(𝜽;g2,1.2)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1.2}) 3.03 3.09 3.98
ST(𝜽;g2,2)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{2}) 3.01 3.10 6.31
ST(𝜽;g2,3)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{3}) 3.02 3.09 6.54
ST(𝜽;g2,)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{\infty}) 3.01 3.09 6.70
=(𝜽;g2,ϵ)=\mathcal{L}(\bm{\theta};g^{2},\epsilon)
Table 7: Ablation study of Soft Truncation for CIFAR-10 trained with DDPM++ when a diffusion is combined with a normalizing flow in INDM (Kim et al., 2022).
Loss NLL NELBO FID (ODE)
INDM (VP, NLL) 2.98 2.98 6.01
INDM (VP, FID) 3.17 3.23 3.61
INDM (VP, NLL) + ST 3.01 3.02 3.88
Table 8: Performance comparisons on benchmark datasets. The boldfaced numbers present the best performance, and the underlined numbers present the second-best performance. We report NLL of DDPM++ on CIFAR-10, ImageNet32, and CelebA with the variational dequantization (Song et al., 2021a) to compare with the baselines in a fair setting.

Model CIFAR10 ImageNet32 CelebA CelebA-HQ STL-10 32×3232\times 32 32×3232\times 32 64×6464\times 64 256×256256\times 256 48×4848\times 48 NLL (\downarrow) FID (\downarrow) IS (\uparrow) NLL FID IS NLL FID FID FID IS Likelihood-free Models StyleGAN2-ADA+Tuning (Karras et al., 2020) - 2.92 10.02 - - - - - - - - Styleformer (Park & Kim, 2022) - 2.82 9.94 - - - - 3.66 - 15.17 11.01 Likelihood-based Models ARDM-Upscale 4 (Hoogeboom et al., 2021) 2.64 - - - - - - - - - - VDM (Kingma et al., 2021) 2.65 7.41 - 3.72 - - - - - - - LSGM (FID) (Vahdat et al., 2021) 3.43 2.10 - - - - - - - - - NCSN++ cont. (deep, VE) (Song et al., 2021b) 3.45 2.20 9.89 - - - 2.39 3.95 7.23 - - DDPM++ cont. (deep, sub-VP) (Song et al., 2021b) 2.99 2.41 9.57 - - - - - - - - DenseFlow-74-10 (Grcić et al., 2021) 2.98 34.90 - 3.63 - - 1.99 - - - - ScoreFlow (VP, FID) (Song et al., 2021a) 3.04 3.98 - 3.84 8.34 - - - - - - Efficient-VDVAE (Hazami et al., 2022) 2.87 - - - - - 1.83 - - - - PNDM (Liu et al., 2022) - 3.26 - - - - - 2.71 - - - ScoreFlow (deep, sub-VP, NLL) (Song et al., 2021a) 2.81 5.40 - 3.76 10.18 - - - - - - Improved DDPM (LsimpleL_{simple}) (Nichol & Dhariwal, 2021) 3.37 2.90 - - - - - - - - - UNCSN++ (RVE) + ST 3.04 2.33 10.11 - - - 1.97 1.92 7.16 7.71 13.43 DDPM++ (VP, FID) + ST 2.91 2.47 9.78 - - - 2.10 1.90 - - - DDPM++ (VP, NLL) + ST 2.88 3.45 9.19 3.85 8.42 11.82 1.96 2.90 - - -

5 Experiments

This section empirically studies our suggestions on benchmark datasets, including CIFAR-10 (Krizhevsky et al., 2009), ImageNet 32×3232\times 32 (Van Oord et al., 2016), STL-10 (Coates et al., 2011)222We downsize the dataset from 96×9696\times 96 to 48×4848\times 48 following Jiang et al. (2021); Park & Kim (2022). CelebA (Liu et al., 2015) 64×6464\times 64 and CelebA-HQ (Karras et al., 2018) 256×256256\times 256.

Soft Truncation is a universal training technique indepedent to model architectures and diffusion strategies. In the experiments, we test Soft Truncation on various architectures, including vanilla NCSN++, DDPM++, Unbounded NCSN++ (UNCSN++), and Unbounded DDPM++ (UDDPM++). Also, Soft Truncation is applied to various diffusion SDEs, such as VESDE, VPSDE, and Reverse VESDE (RVESDE). Although we use continuous SDEs for the diffusion strategies, Soft Truncation with the discrete model, such as DDPM (Ho et al., 2020), is a straightforward application of continuous models. Appendix D enumerates the specifications of score architectures and SDEs.

From Figure 1-(c), a sweet spot of the hard threshold is ϵ=105\epsilon=10^{-5}, in which NLL/NELBO are no longer improved under this threshold. As the diffusion model has no information on [0,ϵ)[0,\epsilon), we comply Kim et al. (2022) to use Inequality (7) for NLL computation and Inequality (8) for NELBO computation. Following Kim et al. (2022), we compute logpϵ𝜽(𝐱ϵ)\log{p_{\epsilon}^{\bm{\theta}}(\mathbf{x}_{\epsilon})}, rather than logpϵ𝜽(𝐱0)\log{p_{\epsilon}^{\bm{\theta}}(\mathbf{x}_{0})}. It is the common practice of continuous diffusion models (Song et al., 2021b, a; Dockhorn et al., 2022) to report their performances with logpϵ𝜽(𝐱0)\log{p_{\epsilon}^{\bm{\theta}}(\mathbf{x}_{0})}, but Kim et al. (2022) show that logpϵ𝜽(𝐱ϵ)\log{p_{\epsilon}^{\bm{\theta}}(\mathbf{x}_{\epsilon})} differs to logpϵ𝜽(𝐱0)\log{p_{\epsilon}^{\bm{\theta}}(\mathbf{x}_{0})} by 0.05 in BPD scale when ϵ=105\epsilon=10^{-5}, which is quite significant. We use the uniform dequantization (Theis et al., 2016) as default, otherwise noted. For sample generation, we use either of Predictor-Corrector (PC) sampler or Ordinary Differential Equation (ODE) sampler (Song et al., 2021b). We denote (𝜽;λ,ϵ)\mathcal{L}(\bm{\theta};\lambda,\epsilon) as the vanilla training with λ\lambda-weighting, and ST(𝜽;g2,)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}) as the training by Soft Truncation with the truncation probability of \mathbb{P}. We additionally denote ST(𝜽;σ2,)\mathcal{L}_{ST}(\bm{\theta};\sigma^{2},\mathbb{P}) for updating the network by the variance weighted loss per batch-wise update. We release our code at https://github.com/Kim-Dongjun/Soft-Truncation.

FID by Iteration Figure 8 illustrates the FID score (Heusel et al., 2017) in yy-axis by training steps in xx-axis. Figure 8 shows that Soft Truncation beats the vanilla training after 150k of training iterations.

Ablation Studies Tables 4, 4, 7, and 7 show ablation studies on various weighting functions, model architectures, SDEs, ϵ\epsilons, and probability distributions of τ\tau, respectively. See Appendix E.2. Table 4 shows that Soft Truncation beats or equals to the vanilla training in all performances. We highlight that Soft Truncation with 0.9\mathbb{P}_{0.9} outperforms the FID-favorable model with the variance weighting with respect to FID on both CIFAR-10 and ImageNet32.

Not only comparing with the pre-existing weighting functions, such as λ=g2\lambda=g^{2} or λ=σ2\lambda=\sigma^{2}, Table 4 additionally reports the experimental result of a general weighting function of λ=g12\lambda=g_{\mathbb{P}_{1}}^{2}. From Eq. (10), Soft Truncation with 1\mathbb{P}_{1} and the vanilla training with λ=g12\lambda=g_{\mathbb{P}_{1}}^{2} coincide in their loss functions in average, i.e., (𝜽;g12,ϵ)=ST(𝜽;g2,1)\mathcal{L}(\bm{\theta};g_{\mathbb{P}_{1}}^{2},\epsilon)=\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1}). Thus, when comparing the paired experiments, Soft Truncation could be considered as an alternative way of estimating the same loss, and Table 4 implies that Soft Truncation gives better optimization than the vanilla method. This strongly implies that Soft Truncation could be a default training method for a general weighted denoising diffusion loss.

Table 4 provides two implications. First, Soft Truncation particularly boosts FID while maintaining density estimation performances under the variation of score networks and diffusion strategies. Second, Table 4 shows that Soft Truncation is effective on CelebA even when we apply Soft Truncation on the variance weighting, i.e., ST(𝜽;σ2,)\mathcal{L}_{ST}(\bm{\theta};\sigma^{2},\mathbb{P}), but we find that this does not hold on CIFAR-10 and ImageNet32. We leave it as a future work on this extent.

Table 7 shows a contrastive trend of the vanilla training and Soft Truncation. The inverse correlation appears between NLL and FID in the vanilla training, but Soft Truncation monotonically reduces both NLL and FID by ϵ\epsilon. This implies that Soft Truncation significantly reduces the effort of the ϵ\epsilon search. Table 7 studies the effect of the probability distribution of τ\tau in VPSDE. It shows that Soft Truncation significantly improves FID upon the experiment of (𝜽;g2,ϵ)\mathcal{L}(\bm{\theta};g^{2},\epsilon) on the range of 0.8k1.20.8\leq k\leq 1.2. Finally, Table 7 shows that Soft Truncation also works with a nonlinear forward SDE (Kim et al., 2022), so the scope of Soft Truncation is not limited to a family of linear SDEs.

Quantitative Comparison to SOTA Table 8 compares Soft Truncation (ST) against the current best generative models. It shows that Soft Truncation achieves the state-of-the-art sample generation performances on CIFAR-10, CelebA, CelebA-HQ, and STL-10, while keeping NLL intact. In particular, we have experimented thoroughly on the CelebA dataset, and we find that Soft Truncation largely exceeds the previous best FID scores by far. In FID, Soft Truncation with DDPM++ performs 1.90, which exceeds the previous best FID of 2.92 by DDGM. Also, Soft Truncation significantly improves FID on STL-10.

6 Conclusion

This paper proposes a generally applicable training method for diffusion models. The suggested training method, Soft Truncation, is motivated from the observation that the density estimation is mostly counted on small diffusion time, while the sample generation is mostly constructed on large diffusion time. However, small diffusion time dominates the Monte-Carlo estimation of the loss function, so this imbalance contribution prevents accurate score learning on large diffusion time. Soft Truncation softens the truncation level at each mini-batch update, and this simple modification is connected to the general weighted diffusion loss and the concept of Maximum Perturbed Likelihood Estimation.

Acknowledgements

This research was supported by AI Technology Development for Commonsense Extraction, Reasoning, and Inference from Heterogeneous Data(IITP) funded by the Ministry of Science and ICT(2022-0-00077). We thank Jaeyoung Byeon and Daehan Park for their fruitful mathematical advice, and Byeonghu Na for his support of the experiments.

References

  • Anderson (1982) Anderson, B. D. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.
  • Chen et al. (2018) Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.
  • Chen et al. (2022) Chen, T., Liu, G.-H., and Theodorou, E. Likelihood training of schrödinger bridge using forward-backward SDEs theory. In International Conference on Learning Representations, 2022.
  • Coates et al. (2011) Coates, A., Ng, A., and Lee, H. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp.  215–223. JMLR Workshop and Conference Proceedings, 2011.
  • Dhariwal & Nichol (2021) Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 2021.
  • Dockhorn et al. (2022) Dockhorn, T., Vahdat, A., and Kreis, K. Score-based generative modeling with critically-damped langevin diffusion. International Conference on Learning Representations, 2022.
  • Evans (1998) Evans, L. C. Partial differential equations. Graduate studies in mathematics, 19(2), 1998.
  • Gerchinovitz et al. (2020) Gerchinovitz, S., Ménard, P., and Stoltz, G. Fano’s inequality for random variables. Statistical Science, 35(2):178–201, 2020.
  • Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT press, 2016.
  • Grcić et al. (2021) Grcić, M., Grubišić, I., and Šegvić, S. Densely connected normalizing flows. Advances in Neural Information Processing Systems, 34, 2021.
  • Hazami et al. (2022) Hazami, L., Mama, R., and Thurairatnam, R. Efficient-vdvae: Less is more. arXiv preprint arXiv:2203.13751, 2022.
  • Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • Ho et al. (2019) Ho, J., Chen, X., Srinivas, A., Duan, Y., and Abbeel, P. Flow++: Improving flow-based generative models with variational dequantization and architecture design. In International Conference on Machine Learning, pp. 2722–2730. PMLR, 2019.
  • Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • Hoogeboom et al. (2021) Hoogeboom, E., Gritsenko, A. A., Bastings, J., Poole, B., Berg, R. v. d., and Salimans, T. Autoregressive diffusion models. arXiv preprint arXiv:2110.02037, 2021.
  • Huang et al. (2021) Huang, C.-W., Lim, J. H., and Courville, A. C. A variational perspective on diffusion-based generative models and score matching. Advances in Neural Information Processing Systems, 34, 2021.
  • Jiang et al. (2021) Jiang, Y., Chang, S., and Wang, Z. Transgan: Two pure transformers can make one strong gan, and that can scale up. Advances in Neural Information Processing Systems, 34, 2021.
  • Karras et al. (2018) Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
  • Karras et al. (2019) Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4401–4410, 2019.
  • Karras et al. (2020) Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., and Aila, T. Training generative adversarial networks with limited data. Advances in Neural Information Processing Systems, 33:12104–12114, 2020.
  • Kim et al. (2022) Kim, D., Na, B., Kwon, S. J., Lee, D., Kang, W., and Moon, I.-C. Maximum likelihood training of implicit nonlinear diffusion models. arXiv preprint arXiv:2205.13699, 2022.
  • Kingma et al. (2021) Kingma, D. P., Salimans, T., Poole, B., and Ho, J. Variational diffusion models. In Advances in Neural Information Processing Systems, 2021.
  • Krizhevsky et al. (2009) Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.
  • Liu et al. (2022) Liu, L., Ren, Y., Lin, Z., and Zhao, Z. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022.
  • Liu et al. (2015) Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp.  3730–3738, 2015.
  • Meng et al. (2021) Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., and Ermon, S. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021.
  • Nichol & Dhariwal (2021) Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162–8171. PMLR, 2021.
  • Oksendal (2013) Oksendal, B. Stochastic differential equations: an introduction with applications. Springer Science & Business Media, 2013.
  • Park & Kim (2022) Park, J. and Kim, Y. Styleformer: Transformer based generative adversarial networks with style vector. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2022.
  • Parmar et al. (2022) Parmar, G., Zhang, R., and Zhu, J.-Y. On buggy resizing libraries and surprising subtleties in fid calculation. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2022.
  • Parmar et al. (2018) Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., and Tran, D. Image transformer. In International Conference on Machine Learning, pp. 4055–4064. PMLR, 2018.
  • Pavon & Wakolbinger (1991) Pavon, M. and Wakolbinger, A. On free energy, stochastic control, and schrödinger processes. In Modeling, Estimation and Control of Systems with Uncertainty, pp.  334–348. Springer, 1991.
  • Song & Ermon (2019) Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
  • Song & Ermon (2020) Song, Y. and Ermon, S. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
  • Song et al. (2021a) Song, Y., Durkan, C., Murray, I., and Ermon, S. Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34, 2021a.
  • Song et al. (2021b) Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021b.
  • Szegedy et al. (2016) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2818–2826, 2016.
  • Theis et al. (2016) Theis, L., van den Oord, A., and Bethge, M. A note on the evaluation of generative models. In International Conference on Learning Representations (ICLR 2016), pp.  1–10, 2016.
  • Vahdat & Kautz (2020) Vahdat, A. and Kautz, J. Nvae: A deep hierarchical variational autoencoder. Advances in Neural Information Processing Systems, 33:19667–19679, 2020.
  • Vahdat et al. (2021) Vahdat, A., Kreis, K., and Kautz, J. Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34, 2021.
  • Van Oord et al. (2016) Van Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. In International Conference on Machine Learning, pp. 1747–1756. PMLR, 2016.
  • Vargas et al. (2021) Vargas, F., Thodoroff, P., Lamacraft, A., and Lawrence, N. Solving schrödinger bridges via maximum likelihood. Entropy, 23(9):1134, 2021.
  • Welling & Teh (2011) Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp.  681–688. Citeseer, 2011.

Appendix A Derivation

A.1 Transition Probability for Linear SDEs

Kim et al. (2022) has classified linear SDEs as

d𝐱t=12β(t)𝐱tdt+g(t)d𝐰t,\displaystyle\mathop{}\!\mathrm{d}\mathbf{x}_{t}=-\frac{1}{2}\beta(t)\mathbf{x}_{t}\mathop{}\!\mathrm{d}t+g(t)\mathop{}\!\mathrm{d}\mathbf{w}_{t}, (12)

where β:0\beta:\mathbb{R}\rightarrow\mathbb{R}_{\geq 0} and g:0g:\mathbb{R}\rightarrow\mathbb{R}_{\geq 0} are real-valued functions. VESDE has β(t)0\beta(t)\equiv 0 and g(t)=dσ2(t)/dt=σmin(σmaxσmin)t2logσmaxσming(t)=\sqrt{\mathop{}\!\mathrm{d}\sigma^{2}(t)/\mathop{}\!\mathrm{d}t}=\sigma_{min}(\frac{\sigma_{max}}{\sigma_{min}})^{t}\sqrt{2\log{\frac{\sigma_{max}}{\sigma_{min}}}}, where σmin\sigma_{min} and σmax\sigma_{max} are the minimum/maximum perturbation variances, respectively. It has the transition probability of

p0t(𝐱t|𝐱0)=𝒩(𝐱t;μVE(t)𝐱0,σVE2(t)𝐈),\displaystyle p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t};\mu_{VE}(t)\mathbf{x}_{0},\sigma_{VE}^{2}(t)\mathbf{I}),

where μVE(t)1\mu_{VE}(t)\equiv 1 and σVE2(t):=σmin2[(σmaxσmin)2t1]\sigma_{VE}^{2}(t):=\sigma_{min}^{2}[(\frac{\sigma_{max}}{\sigma_{min}})^{2t}-1]. VPSDE has β(t)=βmin+(βmaxβmin)t\beta(t)=\beta_{min}+(\beta_{max}-\beta_{min})t and g(t)=β(t)g(t)=\sqrt{\beta(t)} with the transition probability of

p0t(𝐱t|𝐱0)=𝒩(𝐱t;μVP(t)𝐱0,σVP2(t)𝐈),\displaystyle p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t};\mu_{VP}(t)\mathbf{x}_{0},\sigma_{VP}^{2}(t)\mathbf{I}),

where μVP(t)=e120tβ(s)ds\mu_{VP}(t)=e^{-\frac{1}{2}\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s} and σ2(t)=1e0tβ(s)ds\sigma^{2}(t)=1-e^{-\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}.

Analogous to VE/VP SDEs, the transition probability of the generic linear SDE of Eq. (12) is a Gaussian distribution of p0t(𝐱t|𝐱0)=𝒩(𝐱t|μ(t)𝐱0,σ2(t)𝐈)p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t}|\mu(t)\mathbf{x}_{0},\sigma^{2}(t)\mathbf{I}), where its mean and covariance functions are characterized as a system of ODEs of

dμ(t)dt=12β(t)μ(t),\displaystyle\frac{\mathop{}\!\mathrm{d}\mu(t)}{\mathop{}\!\mathrm{d}t}=-\frac{1}{2}\beta(t)\mu(t), (13)
dσ2(t)dt=β(t)σ2(t)+g2(t),\displaystyle\frac{\mathop{}\!\mathrm{d}\sigma^{2}(t)}{\mathop{}\!\mathrm{d}t}=-\beta(t)\sigma^{2}(t)+g^{2}(t), (14)

with initial conditions to be μ(0)=1\mu(0)=1 and σ2(0)=0\sigma^{2}(0)=0.

Eq. (13) has its solution by

μ(t)=e120tβ(s)ds.\displaystyle\mu(t)=e^{-\frac{1}{2}\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}.

If we multiply e0tβ(s)dse^{\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s} to Eq. (14), then Eq. (14) equals to

e0tβ(s)dsdσ2(t)dt+e0tβ(s)dsβ(t)σ2(t)=e0tβ(s)dsg2(t)\displaystyle e^{\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}\frac{\mathop{}\!\mathrm{d}\sigma^{2}(t)}{\mathop{}\!\mathrm{d}t}+e^{\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}\beta(t)\sigma^{2}(t)=e^{\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}g^{2}(t) (15)
d[e0tβ(s)dsσ2(t)]dt=e0tβ(s)dsg2(t)\displaystyle\iff\frac{\mathop{}\!\mathrm{d}\Big{[}e^{\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}\sigma^{2}(t)\Big{]}}{\mathop{}\!\mathrm{d}t}=e^{\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}g^{2}(t)
e0tβ(s)dsσ2(t)=0te0τβ(s)dsg2(τ)dτ+C\displaystyle\iff e^{\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}\sigma^{2}(t)=\int_{0}^{t}e^{\int_{0}^{\tau}\beta(s)\mathop{}\!\mathrm{d}s}g^{2}(\tau)\mathop{}\!\mathrm{d}\tau+C
σ2(t)=e0tβ(s)ds0te0τβ(s)dsg2(τ)dτ+Ce0tβ(s)ds.\displaystyle\iff\sigma^{2}(t)=e^{-\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}\int_{0}^{t}e^{\int_{0}^{\tau}\beta(s)\mathop{}\!\mathrm{d}s}g^{2}(\tau)\mathop{}\!\mathrm{d}\tau+Ce^{-\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}.

If we impose σ2(0)=0\sigma^{2}(0)=0 to Eq. (15), then the constant CC satisfies C=0C=0, and the variance formula becomes

σ2(t)=e0tβ(s)ds0te0τβ(s)dsg2(τ)dτ.\displaystyle\sigma^{2}(t)=e^{-\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}\int_{0}^{t}e^{\int_{0}^{\tau}\beta(s)\mathop{}\!\mathrm{d}s}g^{2}(\tau)\mathop{}\!\mathrm{d}\tau.

To sum up, the family of linear SDEs of d𝐱t=12β(t)𝐱tdt+g(t)d𝐰t\mathop{}\!\mathrm{d}\mathbf{x}_{t}=-\frac{1}{2}\beta(t)\mathbf{x}_{t}\mathop{}\!\mathrm{d}t+g(t)\mathop{}\!\mathrm{d}\mathbf{w}_{t} gets the transition probability to be

p0t(𝐱t|𝐱0)=𝒩(𝐱t|e120tβ(s)ds𝐱0,e0tβ(s)ds(0te0τβ(s)dsg2(τ)dτ)𝐈).\displaystyle p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}\bigg{(}\mathbf{x}_{t}\Big{|}e^{-\frac{1}{2}\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}\mathbf{x}_{0},e^{-\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}\Big{(}\int_{0}^{t}e^{\int_{0}^{\tau}\beta(s)\mathop{}\!\mathrm{d}s}g^{2}(\tau)\mathop{}\!\mathrm{d}\tau\Big{)}\mathbf{I}\bigg{)}. (16)

A.2 Diverging Denoising Loss

The gradient of the log transition probability, logp0t(𝐱t|𝐱0)=𝐱tμ(t)𝐱0σ2(t)=𝐳σ(t)\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}=-\frac{\mathbf{x}_{t}-\mu(t)\mathbf{x}_{0}}{\sigma^{2}(t)}=-\frac{\mathbf{z}}{\sigma(t)}, is diverging at μ(t)𝐱0\mu(t)\mathbf{x}_{0}, where 𝐱t=μ(t)𝐱0+σ(t)𝐳\mathbf{x}_{t}=\mu(t)\mathbf{x}_{0}+\sigma(t)\mathbf{z}. Below Lemma 2 indicates that 𝐬(𝐱t,t)logp0t(𝐱t|𝐱0)2\|\mathbf{s}(\mathbf{x}_{t},t)-\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}\rightarrow\infty for any continuous score function, 𝐬\mathbf{s}. This leads that the denoising score loss diverges as t0t\rightarrow 0 as illustrated in Figure 1-(a).

Lemma 2.

Let [0,T]={𝐬:d×[0,T]d, 𝐬 is locally Lipschitz}\mathcal{H}_{[0,T]}=\{\mathbf{s}:\mathbb{R}^{d}\times[0,T]\rightarrow\mathbb{R}^{d},\text{ $\mathbf{s}$ is locally Lipschitz}\}. Suppose a continuous vector field 𝐯\mathbf{v} defined on a subset UU of a compact manifold MM (i.e., 𝐯:UMd\mathbf{v}:U\subset M\rightarrow\mathbb{R}^{d}) is unbounded, then there exists no 𝐬[0,T]\mathbf{s}\in\mathcal{H}_{[0,T]} such that limt0𝐬(𝐱,t)=𝐯(𝐱)\lim_{t\rightarrow 0}\mathbf{s}(\mathbf{x},t)=\mathbf{v}(\mathbf{x}) a.e. on UU.

Proof of Lemma 2.

Since UU is an open subset of a compact manifold MM, 𝐱1𝐱2diam(M)\|\mathbf{x}_{1}-\mathbf{x}_{2}\|\leq\text{diam}(M) for all 𝐱1,𝐱2U\mathbf{x}_{1},\mathbf{x}_{2}\in U. Also, if t1,t2[0,T]t_{1},t_{2}\in[0,T], |t1t2||t_{1}-t_{2}| is bounded. Hence, the local Lipschitzness of 𝐬\mathbf{s} implies that there exists a positive K>0K>0 such that s(𝐱1,t1)s(𝐱2,t2)K(𝐱1𝐱2+|t1t2|)\|s(\mathbf{x}_{1},t_{1})-s(\mathbf{x}_{2},t_{2})\|\leq K(\|\mathbf{x}_{1}-\mathbf{x}_{2}\|+|t_{1}-t_{2}|) for any 𝐱1,𝐱2U\mathbf{x}_{1},\mathbf{x}_{2}\in U and t1,t2[0,T]t_{1},t_{2}\in[0,T]. Therefore, for any 𝐬[0,T]\mathbf{s}\in\mathcal{H}_{[0,T]}, there exists C>0C>0 such that 𝐬(𝐱,t)<C\|\mathbf{s}(\mathbf{x},t)\|<C for all 𝐱U\mathbf{x}\in U and t[0,T]t\in[0,T], which leads no 𝐬\mathbf{s} that satisfies 𝐬(𝐱,t)v(𝐱)\mathbf{s}(\mathbf{x},t)\rightarrow v(\mathbf{x}) a.e. on UU as t0t\rightarrow 0. ∎

A.3 General Weighted Diffusion Loss

The denoising score loss is

(𝜽;g2,τ)=12τTg2(t)𝔼𝐱0,𝐱t[𝐬𝜽(𝐱t,t)𝐱tlogp0t(𝐱t|𝐱0)22logp0t(𝐱t|𝐱0)22]dtτT𝔼𝐱t[div(𝐟(𝐱t,t))]dt𝔼𝐱T[logπ(𝐱T)],\displaystyle\begin{split}\mathcal{L}(\bm{\theta};g^{2},\tau)=&\frac{1}{2}\int_{\tau}^{T}g^{2}(t)\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}-\|\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t\\ &-\int_{\tau}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t-\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]},\end{split} (17)

for any τ[0,T]\tau\in[0,T]. For an appropriate class of function A(t)A(t),

0T(τ)(τTA(t)dt)dτ=0T0T(τ)A(t)1[τ,T](t)dtdτ=0T0T(τ)A(t)1[τ,T](t)dτdt=0T0t(τ)A(t)dτdt=0T(0t(τ)dτ)A(t)dt\displaystyle\begin{split}\int_{0}^{T}\mathbb{P}(\tau)\bigg{(}\int_{\tau}^{T}A(t)\mathop{}\!\mathrm{d}t\bigg{)}\mathop{}\!\mathrm{d}\tau=&\int_{0}^{T}\int_{0}^{T}\mathbb{P}(\tau)A(t)1_{[\tau,T]}(t)\mathop{}\!\mathrm{d}t\mathop{}\!\mathrm{d}\tau\\ =&\int_{0}^{T}\int_{0}^{T}\mathbb{P}(\tau)A(t)1_{[\tau,T]}(t)\mathop{}\!\mathrm{d}\tau\mathop{}\!\mathrm{d}t\\ =&\int_{0}^{T}\int_{0}^{t}\mathbb{P}(\tau)A(t)\mathop{}\!\mathrm{d}\tau\mathop{}\!\mathrm{d}t\\ =&\int_{0}^{T}\bigg{(}\int_{0}^{t}\mathbb{P}(\tau)\mathop{}\!\mathrm{d}\tau\bigg{)}A(t)\mathop{}\!\mathrm{d}t\end{split}

holds by changing the order of integration. Therefore, we get

ST(𝜽;g2,):=𝔼(τ)[(𝜽;g2,τ)]\displaystyle\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}):=\mathbb{E}_{\mathbb{P}(\tau)}\big{[}\mathcal{L}(\bm{\theta};g^{2},\tau)\big{]}
=0T(τ)[12τTg2(t)𝔼𝐱0,𝐱t[𝐬𝜽(𝐱t,t)𝐱tlogp0t(𝐱t|𝐱0)22logp0t(𝐱t|𝐱0)22]dt\displaystyle=\int_{0}^{T}\mathbb{P}(\tau)\bigg{[}\frac{1}{2}\int_{\tau}^{T}g^{2}(t)\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}-\|\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t
τT𝔼𝐱t[div(𝐟(𝐱t,t))]dt𝔼𝐱T[logπ(𝐱T)]]dτ\displaystyle\quad-\int_{\tau}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t-\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}\bigg{]}\mathop{}\!\mathrm{d}\tau
=0T(0t(τ)dτ)[12g2(t)𝔼𝐱0,𝐱t[𝐬𝜽(𝐱t,t)𝐱tlogp0t(𝐱t|𝐱0)22logp0t(𝐱t|𝐱0)22]\displaystyle=\int_{0}^{T}\Big{(}\int_{0}^{t}\mathbb{P}(\tau)\mathop{}\!\mathrm{d}\tau\Big{)}\bigg{[}\frac{1}{2}g^{2}(t)\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}-\|\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}\big{]}
𝔼𝐱t[div(𝐟(𝐱t,t))]]dt𝔼𝐱T[logπ(𝐱T)]\displaystyle\quad-\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\bigg{]}\mathop{}\!\mathrm{d}t-\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}
=120Tg2(t)𝔼𝐱0,𝐱t[𝐬𝜽(𝐱t,t)𝐱tlogp0t(𝐱t|𝐱0)22]dt+C,\displaystyle=\frac{1}{2}\int_{0}^{T}g_{\mathbb{P}}^{2}(t)\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t+C,

where

C=120Tg2(t)𝔼𝐱0,𝐱t[logp0t(𝐱t|𝐱0)22]dt0T(0t(τ)dτ)𝔼𝐱t[div(𝐟(𝐱t,t))]dt𝔼𝐱T[logπ(𝐱T)].\displaystyle C=-\frac{1}{2}\int_{0}^{T}g_{\mathbb{P}}^{2}(t)\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}\big{[}\|\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t-\int_{0}^{T}\Big{(}\int_{0}^{t}\mathbb{P}(\tau)\mathop{}\!\mathrm{d}\tau\Big{)}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t-\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}.

If 𝐟(𝐱t,t)=12β(t)𝐱t\mathbf{f}(\mathbf{x}_{t},t)=-\frac{1}{2}\beta(t)\mathbf{x}_{t}, then we have

C=d20T(0t(τ)dτ)g2(t)σ2(t)dt+d20T(0t(τ)dτ)β(t)dt𝔼𝐱T[logπ(𝐱T)].\displaystyle C=-\frac{d}{2}\int_{0}^{T}\Big{(}\int_{0}^{t}\mathbb{P}(\tau)\mathop{}\!\mathrm{d}\tau\Big{)}\frac{g^{2}(t)}{\sigma^{2}(t)}\mathop{}\!\mathrm{d}t+\frac{d}{2}\int_{0}^{T}\Big{(}\int_{0}^{t}\mathbb{P}(\tau)\mathop{}\!\mathrm{d}\tau\Big{)}\beta(t)\mathop{}\!\mathrm{d}t-\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}.

Appendix B Theorems and Proofs

Lemma 1.

For any τ[0,T]\tau\in[0,T],

𝔼𝐱τ[logpτ𝜽(𝐱τ)]\displaystyle\mathbb{E}_{\mathbf{x}_{\tau}}\big{[}-\log{p_{\tau}^{\bm{\theta}}(\mathbf{x}_{\tau})}\big{]}\leq (𝜽;g2,τ)=12τTg2(t)𝔼𝐱0,𝐱t[𝐬𝜽(𝐱t,t)𝐱tlogp0t(𝐱t|𝐱0)22\displaystyle\mathcal{L}(\bm{\theta};g^{2},\tau)=\frac{1}{2}\int_{\tau}^{T}g^{2}(t)\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}
𝐱tlogp0t(𝐱t|𝐱0)22]dtτT𝔼𝐱t[div(𝐟(𝐱t,t))]dt𝔼𝐱T[logπ(𝐱T)].\displaystyle-\|\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t-\int_{\tau}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\textup{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t-\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}.
Proof.

Suppose 𝝁\bm{\mu} is the path measure of the forward SDE, and 𝝂𝜽\bm{\nu}_{\bm{\theta}} is the path measure of the generative SDE. The restricted measure is defined by 𝝁|[τ,T]({Ft}t=τT):=𝝁({Ft}t=0T)\bm{\mu}|_{[\tau,T]}(\{F_{t}\}_{t=\tau}^{T}):=\bm{\mu}(\{F_{t}\}_{t=0}^{T}), where Ft=dF_{t}=\mathbb{R}^{d} if t[0,τ)t\in[0,\tau) and FtF_{t} is a measurable set in d\mathbb{R}^{d} otherwise. The restricted measure of 𝝂𝜽\bm{\nu}_{\bm{\theta}} is defined analogously. Then, by the data processing inequality, we get

DKL(pτpτ𝜽)DKL(𝝁|[τ,T]𝝂𝜽|[τ,T]).\displaystyle D_{KL}(p_{\tau}\|p_{\tau}^{\bm{\theta}})\leq D_{KL}(\bm{\mu}|_{[\tau,T]}\|\bm{\nu}_{\bm{\theta}}|_{[\tau,T]}). (18)

Now, from the chain rule of KL divergences, we have

DKL(𝝁|[τ,T]𝝂𝜽|[τ,T])=DKL(pTπ)+𝔼𝐳pT[DKL(𝝁|[τ,T](|𝐱T=𝐳)𝝂𝜽|[τ,T](|𝐱T=𝐳))].\displaystyle D_{KL}(\bm{\mu}|_{[\tau,T]}\|\bm{\nu}_{\bm{\theta}}|_{[\tau,T]})=D_{KL}(p_{T}\|\pi)+\mathbb{E}_{\mathbf{z}\sim p_{T}}\Big{[}D_{KL}\big{(}\bm{\mu}|_{[\tau,T]}(\cdot|\mathbf{x}_{T}=\mathbf{z})\|\bm{\nu}_{\bm{\theta}}|_{[\tau,T]}(\cdot|\mathbf{x}_{T}=\mathbf{z})\big{)}\Big{]}. (19)

From the Girsanov theorem and the Martingale property, we get

DKL(𝝁|[τ,T](|𝐱T=𝐳)𝝂𝜽|[τ,T](|𝐱T=𝐳))=12τT𝔼pt(𝐱t)[g2(t)𝐬𝜽(𝐱t,t)logpt(𝐱t)22]dt,\displaystyle D_{KL}\big{(}\bm{\mu}|_{[\tau,T]}(\cdot|\mathbf{x}_{T}=\mathbf{z})\|\bm{\nu}_{\bm{\theta}}|_{[\tau,T]}(\cdot|\mathbf{x}_{T}=\mathbf{z})\big{)}=\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t, (20)

and combining Eq. (18), (19) and (20), we have

DKL(pτpτ𝜽)DKL(pTπ)+12τT𝔼pt(𝐱t)[g2(t)𝐬𝜽(𝐱t,t)logpt(𝐱t)22]dt.\displaystyle D_{KL}(p_{\tau}\|p_{\tau}^{\bm{\theta}})\leq D_{KL}(p_{T}\|\pi)+\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t. (21)

Now, from

12τT𝔼pt(𝐱t)[g2(t)[𝐬𝜽(𝐱t,t)𝐱tlogpt(𝐱t)22logpt(𝐱t)22]]dt\displaystyle\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)[\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}-\|\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}]\big{]}\mathop{}\!\mathrm{d}t
=12τT𝔼pt(𝐱t)[g2(t)𝐬𝜽(𝐱t,t)222g2(t)𝐬𝜽(𝐱t,t)𝐱tlogpt(𝐱t)]dt\displaystyle=\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\|_{2}^{2}-2g^{2}(t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\cdot\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\big{]}\mathop{}\!\mathrm{d}t
=12τT𝔼pt(𝐱t)[g2(t)𝐬𝜽(𝐱t,t)22]dtτTg2(t)𝐬𝜽(𝐱t,t)𝐱tpt(𝐱t)d𝐱tdt\displaystyle=\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t-\int_{\tau}^{T}\int g^{2}(t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\cdot\nabla_{\mathbf{x}_{t}}p_{t}(\mathbf{x}_{t})\mathop{}\!\mathrm{d}\mathbf{x}_{t}\mathop{}\!\mathrm{d}t
=12τT𝔼pt(𝐱t)[g2(t)𝐬𝜽(𝐱t,t)22]dtτTg2(t)𝐬𝜽(𝐱t,t)𝐱tpr(𝐱0)p0t(𝐱t|𝐱0)d𝐱0d𝐱tdt\displaystyle=\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t-\int_{\tau}^{T}\int g^{2}(t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\cdot\nabla_{\mathbf{x}_{t}}\int p_{r}(\mathbf{x}_{0})p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})\mathop{}\!\mathrm{d}\mathbf{x}_{0}\mathop{}\!\mathrm{d}\mathbf{x}_{t}\mathop{}\!\mathrm{d}t
=12τT𝔼pt(𝐱t)[g2(t)𝐬𝜽(𝐱t,t)22]dtτTg2(t)𝐬𝜽(𝐱t,t)pr(𝐱0)𝐱tp0t(𝐱t|𝐱0)d𝐱0d𝐱tdt\displaystyle=\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t-\int_{\tau}^{T}\int g^{2}(t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\cdot\int p_{r}(\mathbf{x}_{0})\nabla_{\mathbf{x}_{t}}p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})\mathop{}\!\mathrm{d}\mathbf{x}_{0}\mathop{}\!\mathrm{d}\mathbf{x}_{t}\mathop{}\!\mathrm{d}t
=12τT𝔼pr(𝐱0)p0t(𝐱t|𝐱0)[g2(t)[𝐬𝜽(𝐱t,t)𝐱tlogp0t(𝐱t|𝐱0)22𝐱tlogp0t(𝐱t|𝐱0)22]]dt,\displaystyle=\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{r}(\mathbf{x}_{0})p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\big{[}g^{2}(t)[\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}-\|\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}]\big{]}\mathop{}\!\mathrm{d}t,

we can transform 𝐬𝜽(𝐱t,t)logpt(𝐱t)22\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2} into 𝐬𝜽(𝐱t,t)logp0t(𝐱t|𝐱0)22\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}, Eq. (21) is equivalent to

𝔼pτ(𝐱τ)[logpτ𝜽(𝐱τ)]DKL(pTπ)+12τT𝔼pt(𝐱t)[g2(t)𝐬𝜽(𝐱t,t)logpt(𝐱t)22]dt+(pτ)\displaystyle\mathbb{E}_{p_{\tau}(\mathbf{x}_{\tau})}\big{[}-\log{p_{\tau}^{\bm{\theta}}(\mathbf{x}_{\tau})}\big{]}\leq D_{KL}(p_{T}\|\pi)+\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t+\mathcal{H}(p_{\tau}) (24)
=DKL(pTπ)+12τT𝔼pt(𝐱t)[g2(t)𝐬𝜽(𝐱t,t)logp0t(𝐱t|𝐱0)22logp0t(𝐱t|𝐱0)22]dt\displaystyle=D_{KL}(p_{T}\|\pi)+\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}-\|\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t
+12τT𝔼pt(𝐱t)[g2(t)logpt(𝐱t)22]dt+(pτ).\displaystyle\quad+\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\nabla\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t+\mathcal{H}(p_{\tau}).

Now, directly applying Theorem 4 of Song et al. (2021a), the entropy of (pτ)\mathcal{H}(p_{\tau}) becomes

(pτ)=(pT)12τT𝔼pt(𝐱t)[2div(𝐟(𝐱t,t))+g2(t)logpt(𝐱t)22]dt.\displaystyle\mathcal{H}(p_{\tau})=\mathcal{H}(p_{T})-\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}2\text{div}\big{(}\mathbf{f}(\mathbf{x}_{t},t)\big{)}+g^{2}(t)\|\nabla\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t. (25)

Therefore, from Eq. (24) and (25), we get

𝔼pτ(𝐱τ)[logpτ𝜽(𝐱τ)]\displaystyle\mathbb{E}_{p_{\tau}(\mathbf{x}_{\tau})}\big{[}-\log{p_{\tau}^{\bm{\theta}}(\mathbf{x}_{\tau})}\big{]}\leq 12τT𝔼pt(𝐱t)[g2(t)𝐬𝜽(𝐱t,t)logp0t(𝐱t|𝐱0)22logp0t(𝐱t|𝐱0)22]dt\displaystyle\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}-\|\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t
τT𝔼𝐱t[div(𝐟(𝐱t,t))]dt𝔼𝐱T[logπ(𝐱T)].\displaystyle-\int_{\tau}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\textup{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t-\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}.

Theorem 1.

Suppose λ(t)\lambda(t) is a weighting function of the NCSN loss. If λ(t)g2(t)\frac{\lambda(t)}{g^{2}(t)} is a nondecreasing and nonnegative absolutely continuous function on [ϵ,T][\epsilon,T] and zero on [0,ϵ)[0,\epsilon), then

(𝜽;λ,ϵ)\displaystyle\mathcal{L}(\bm{\theta};\lambda,\epsilon)\geq ϵT(λ(τ)g2(τ))𝔼𝐱τ[logpτ𝜽(𝐱τ)]dτ+λ(ϵ)g2(ϵ)𝔼𝐱ϵ[logpϵ𝜽(𝐱ϵ)]\displaystyle\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}\mathbb{E}_{\mathbf{x}_{\tau}}\big{[}-\log{p_{\tau}^{\bm{\theta}}(\mathbf{x}_{\tau})}\big{]}\mathop{}\!\mathrm{d}\tau+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\mathbb{E}_{\mathbf{x}_{\epsilon}}\big{[}-\log{p_{\epsilon}^{\bm{\theta}}(\mathbf{x}_{\epsilon})}\big{]}
+ϵT(λ(τ)g2(τ)1)𝔼𝐱τ[div(𝐟(𝐱τ,τ))]dτ+[λ(T)g2(T)1]𝔼𝐱T[logπ(𝐱T)].\displaystyle+\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}-1\Big{)}\mathbb{E}_{\mathbf{x}_{\tau}}\big{[}\textup{div}(\mathbf{f}(\mathbf{x}_{\tau},\tau))\big{]}\mathop{}\!\mathrm{d}\tau+\Big{[}\frac{\lambda(T)}{g^{2}(T)}-1\Big{]}\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}.
Proof.

We prove the theorm by using

ϵTλ(t)A(t)dt=ϵT[ϵt(λ(t)g2(t))dτ+λ(ϵ)g2(ϵ)]g2(t)A(t)dt=ϵTϵT1[ϵ,t](τ)(λ(τ)g2(τ))g2(t)A(t)dτdt+λ(ϵ)g2(ϵ)ϵTg2(t)A(t)dt=ϵT(λ(τ)g2(τ))τTg2(t)A(t)dtdτ+λ(ϵ)g2(ϵ)ϵTg2(t)A(t)dt.\displaystyle\begin{split}\int_{\epsilon}^{T}\lambda(t)A(t)\mathop{}\!\mathrm{d}t=&\int_{\epsilon}^{T}\bigg{[}\int_{\epsilon}^{t}\Big{(}\frac{\lambda(t)}{g^{2}(t)}\Big{)}^{\prime}\mathop{}\!\mathrm{d}\tau+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\bigg{]}g^{2}(t)A(t)\mathop{}\!\mathrm{d}t\\ =&\int_{\epsilon}^{T}\int_{\epsilon}^{T}1_{[\epsilon,t]}(\tau)\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}g^{2}(t)A(t)\mathop{}\!\mathrm{d}\tau\mathop{}\!\mathrm{d}t+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\int_{\epsilon}^{T}g^{2}(t)A(t)\mathop{}\!\mathrm{d}t\\ =&\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}\int_{\tau}^{T}g^{2}(t)A(t)\mathop{}\!\mathrm{d}t\mathop{}\!\mathrm{d}\tau+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\int_{\epsilon}^{T}g^{2}(t)A(t)\mathop{}\!\mathrm{d}t.\end{split} (26)

By plugging A(t)=12𝔼𝐱t[𝐬𝜽(𝐱t,t)𝐱tlogpt(𝐱t)22logpt(𝐱t)22]A(t)=\frac{1}{2}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}-\|\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]} in Eq. (26), we have

(𝜽;λ,ϵ):=\displaystyle\mathcal{L}(\bm{\theta};\lambda,\epsilon):= 12ϵTλ(t)𝔼𝐱t[𝐬𝜽(𝐱t,t)𝐱tlogpt(𝐱t)22𝐱tlogpt(𝐱t)22]dt\displaystyle\frac{1}{2}\int_{\epsilon}^{T}\lambda(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}-\|\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t
ϵT𝔼𝐱t[div(𝐟(𝐱t,t))]dt𝔼𝐱T[logπ(𝐱T)]\displaystyle-\int_{\epsilon}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t-\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}
=\displaystyle= ϵT(λ(τ)g2(τ))[12τTg2(t)𝔼𝐱t[𝐬𝜽(𝐱t,t)𝐱tlogpt(𝐱t)22𝐱tlogpt(𝐱t)22]dt\displaystyle\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}\bigg{[}\frac{1}{2}\int_{\tau}^{T}g^{2}(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}-\|\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t
τT𝔼𝐱t[div(𝐟(𝐱t,t))]dt𝔼𝐱T[logπ(𝐱T)]]dτ\displaystyle\quad\quad-\int_{\tau}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t-\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}\bigg{]}\mathop{}\!\mathrm{d}\tau
+λ(ϵ)g2(ϵ)[12ϵTg2(t)𝔼𝐱t[𝐬𝜽(𝐱t,t)𝐱tlogpt(𝐱t)22𝐱tlogpt(𝐱t)22]dt\displaystyle+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\bigg{[}\frac{1}{2}\int_{\epsilon}^{T}g^{2}(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}-\|\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t (27)
ϵT𝔼𝐱t[div(𝐟(𝐱t,t))]dt𝔼𝐱T[logπ(𝐱T)]]\displaystyle\quad\quad-\int_{\epsilon}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t-\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}\bigg{]}
+ϵT(λ(τ)g2(τ))τT𝔼𝐱t[div(𝐟(𝐱t,t))]dtdτ+(λ(ϵ)g2(ϵ))ϵT𝔼𝐱t[div(𝐟(𝐱t,t))]dt\displaystyle+\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}\int_{\tau}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t\mathop{}\!\mathrm{d}\tau+\Big{(}\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\Big{)}\int_{\epsilon}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t
ϵT𝔼𝐱t[div(𝐟(𝐱t,t))]dt+𝔼𝐱T[logπ(𝐱T)][ϵT(λ(τ)g2(τ))dτ+λ(ϵ)g2(ϵ)1].\displaystyle-\int_{\epsilon}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t+\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}\bigg{[}\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}\mathop{}\!\mathrm{d}\tau+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}-1\bigg{]}.

Also, plugging A(t)=1g2(t)𝔼𝐱t[div(𝐟(𝐱t,t))]A(t)=\frac{1}{g^{2}(t)}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}\big{(}\mathbf{f}(\mathbf{x}_{t},t)\big{)}\big{]} into Eq. (26), we have

ϵTλ(t)g2(t)𝔼𝐱t[div(𝐟(𝐱t,t))]=ϵT(λ(τ)g2(τ))τT𝔼𝐱t[div(𝐟(𝐱t,t))]dtdτ+(λ(ϵ)g2(ϵ))ϵT𝔼𝐱t[div(𝐟(𝐱t,t))]dt.\displaystyle\int_{\epsilon}^{T}\frac{\lambda(t)}{g^{2}(t)}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}\big{(}\mathbf{f}(\mathbf{x}_{t},t)\big{)}\big{]}=\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}\int_{\tau}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t\mathop{}\!\mathrm{d}\tau+\Big{(}\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\Big{)}\int_{\epsilon}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t. (28)

Using Eq. (27) and (28), we get

(𝜽;λ,ϵ)=ϵT(λ(τ)g2(τ))(𝜽;g2,τ)dτ+λ(ϵ)g2(ϵ)(𝜽;g2,ϵ)+ϵT(λ(t)g2(t)1)𝔼𝐱t[div(𝐟(𝐱t,t))]dt+[λ(T)g2(T)1]𝔼𝐱T[logπ(𝐱T)].\displaystyle\begin{split}\mathcal{L}(\bm{\theta};\lambda,\epsilon)=&\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}\mathcal{L}(\bm{\theta};g^{2},\tau)\mathop{}\!\mathrm{d}\tau+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\mathcal{L}(\bm{\theta};g^{2},\epsilon)\\ &+\int_{\epsilon}^{T}\Big{(}\frac{\lambda(t)}{g^{2}(t)}-1\Big{)}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t+\Big{[}\frac{\lambda(T)}{g^{2}(T)}-1\Big{]}\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}.\end{split} (29)

Then, applying Lemma 1 to Eq. (29) yields the desired result. ∎

Corollary 1.

Suppose λ(t)\lambda(t) is a weighting function of the NCSN loss. If λ(t)g2(t)\frac{\lambda(t)}{g^{2}(t)} is a nondecreasing and nonnegative continuous function on [ϵ,T][\epsilon,T] and zero on [0,ϵ)[0,\epsilon), then

12ϵTλ(t)𝔼𝐱t[𝐬𝜽(𝐱t,t)𝐱tlogpt(𝐱t)22]dt+λ(T)g2(T)DKL(pTπ)\displaystyle\frac{1}{2}\int_{\epsilon}^{T}\lambda(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t+\frac{\lambda(T)}{g^{2}(T)}D_{KL}(p_{T}\|\pi)
ϵT(λ(τ)g2(τ))DKL(pτpτ𝜽)dτ+λ(ϵ)g2(ϵ)DKL(pϵpϵ𝜽).\displaystyle\quad\quad\quad\geq\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}D_{KL}(p_{\tau}\|p_{\tau}^{\bm{\theta}})\mathop{}\!\mathrm{d}\tau+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}D_{KL}(p_{\epsilon}\|p_{\epsilon}^{\bm{\theta}}).
Remark 1.

A direct extension of the proof indicates that Theorem 1 still holds when λ(t)g2(t)\frac{\lambda(t)}{g^{2}(t)} has finite jump on [0,T][0,T].

Remark 2.

The weight of λ(T)g2(T)\frac{\lambda(T)}{g^{2}(T)} is the normalizing constant of the unnormalized truncation probability, \mathbb{P}.

Proof.

By plugging A(t)=12𝔼𝐱t[𝐬𝜽(𝐱t,t)𝐱tlogpt(𝐱t)22]A(t)=\frac{1}{2}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]} in Eq. (26) and using Lemma 1, we have

12ϵTλ(t)𝔼𝐱t[𝐬𝜽(𝐱t,t)𝐱tlogpt(𝐱t)22]dt+λ(T)g2(T)DKL(pTπ)\displaystyle\frac{1}{2}\int_{\epsilon}^{T}\lambda(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t+\frac{\lambda(T)}{g^{2}(T)}D_{KL}(p_{T}\|\pi)
=ϵT(λ(τ)g2(τ))12τTg2(t)𝔼𝐱t[𝐬𝜽(𝐱t,t)𝐱tlogpt(𝐱t)22]dtdτ\displaystyle=\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}\frac{1}{2}\int_{\tau}^{T}g^{2}(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t\mathop{}\!\mathrm{d}\tau
+(λ(ϵ)g2(ϵ))12ϵTg2(t)𝔼𝐱t[𝐬𝜽(𝐱t,t)𝐱tlogpt(𝐱t)22]dt+λ(T)g2(T)DKL(pTπ)\displaystyle\quad+\Big{(}\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\Big{)}\frac{1}{2}\int_{\epsilon}^{T}g^{2}(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t+\frac{\lambda(T)}{g^{2}(T)}D_{KL}(p_{T}\|\pi)
ϵT(λ(τ)g2(τ))[DKL(pτpτ𝜽)DKL(pTπ)]dτ+λ(ϵ)g2(ϵ)[DKL(pϵpϵ𝜽)DKL(pTπ)]+λ(T)g2(T)DKL(pTπ)\displaystyle\geq\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}\big{[}D_{KL}(p_{\tau}\|p_{\tau}^{\bm{\theta}})-D_{KL}(p_{T}\|\pi)\big{]}\mathop{}\!\mathrm{d}\tau+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\big{[}D_{KL}(p_{\epsilon}\|p_{\epsilon}^{\bm{\theta}})-D_{KL}(p_{T}\|\pi)\big{]}+\frac{\lambda(T)}{g^{2}(T)}D_{KL}(p_{T}\|\pi)
=ϵT(λ(τ)g2(τ))DKL(pτpτ𝜽)dτ+λ(ϵ)g2(ϵ)DKL(pϵpϵ𝜽).\displaystyle=\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}D_{KL}(p_{\tau}\|p_{\tau}^{\bm{\theta}})\mathop{}\!\mathrm{d}\tau+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}D_{KL}(p_{\epsilon}\|p_{\epsilon}^{\bm{\theta}}).

Appendix C Additional Score Architectures and SDEs

C.1 Additional Score Architectures: Unbounded Parametrization

From the released code of Song et al. (2021b), the NCSN++ network is modeled by 𝐬𝜽(𝐱t,logσ(t))\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\log{\sigma(t)}), where the second argument is logσ(t)\log{\sigma(t)} instead of tt. Experiments with 𝐬𝜽(𝐱t,t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t) or 𝐬𝜽(𝐱t,σ(t))\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\sigma(t)) were not as good as the parametrization of 𝐬𝜽(𝐱t,logσ(t))\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\log{\sigma(t)}), and we analyze this experimental results from Lemma 2 and Proposition 1.

Proposition 1.

Let [1,)={𝐬:d×[1,)d, 𝐬 is locally Lipschitz}\mathcal{H}_{[1,\infty)}=\{\mathbf{s}:\mathbb{R}^{d}\times[1,\infty)\rightarrow\mathbb{R}^{d},\text{ $\mathbf{s}$ is locally Lipschitz}\}. Suppose a continuous vector field 𝐯\mathbf{v} defined on a dd-dimensional open subset UU of a compact manifold MM is unbounded, and the projection of 𝐯\mathbf{v} on each axis is locally integrable. Then, there exists 𝐬[1,)\mathbf{s}\in\mathcal{H}_{[1,\infty)} such that limη𝐬(𝐱,η)=𝐯(𝐱)\lim_{\eta\rightarrow\infty}\mathbf{s}(\mathbf{x},\eta)=\mathbf{v}(\mathbf{x}) a.e. on UU.

The gradient of the log transition probability diverges at t0t\approx 0 theoretically (Section A.2) and empirically (Figure 9-(a)). Here, in high-dimensional space, p0t(𝐱t|𝐱0)/p0t(𝐱t|𝐱0)p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})/p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0}) with 𝐱0𝐱0\mathbf{x}_{0}\neq\mathbf{x}_{0}^{\prime} is either zero or infinity. Thus, the data score is nearly identical to the gradient of the log transition probability, 𝐱tlogpt(𝐱t)22=𝐱tlogpr(𝐱0)p0t(𝐱t|𝐱0)d𝐱022𝐱tlogp0t(𝐱t|𝐱0)22\|\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}=\|\nabla_{\mathbf{x}_{t}}\log{\int p_{r}(\mathbf{x}_{0})p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})\mathop{}\!\mathrm{d}\mathbf{x}_{0}}\|_{2}^{2}\approx\|\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}, and the observation of Figure 9-(a) is valid for the exact data score, as well.

Although Lemma 2 is based on 𝐬𝜽(𝐱t,t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t), the identical result also holds for the parametrization of 𝐬𝜽(𝐱t,σ(t))\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\sigma(t)), so it indicates that both 𝐬𝜽(𝐱t,t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t) and 𝐬𝜽(𝐱t,σ(t))\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\sigma(t)) cannot estimate the data score as t0t\rightarrow 0. On the other hand, Proposition 1 implies that there exists a score function that estimates the unbounded data score asymptotically, and Proposition 1 explains the reason why the parametrization of Song et al. (2021b), i.e., 𝐬𝜽(𝐱t,logσ(t))\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\log{\sigma(t)}), is successful on score estimation.

On top of that, we introduce another parametrization that particularly focuses on the score estimation near t0t\approx 0. We name Unbounded NCSN++ (UNCSN++) as the network of 𝐬𝜽(𝐱t,η(t))\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\eta(t)) with η(t)={logσ(t)if σ(t)σ0c1σ(t)+c2if σ(t)<σ0\eta(t)=\left\{\begin{array}[]{ll}\log{\sigma(t)}&\text{if }\sigma(t)\geq\sigma_{0}\\ -\frac{c_{1}}{\sigma(t)}+c_{2}&\text{if }\sigma(t)<\sigma_{0}\end{array}\right. and Unbounded DDPM++ (UDDPM++) as the network of 𝐬𝜽(𝐱t,η(t))\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\eta(t)) with η(t):=g2(t)σ2(t)dt\eta(t):=\int\frac{g^{2}(t)}{\sigma^{2}(t)}\mathop{}\!\mathrm{d}t.

In UNCSN++, c1,c2c_{1},c_{2} and σ0\sigma_{0} are the hyperparameters. By acknowledging the parametrization of logσ(t)\log{\sigma(t)}, we choose σ0\sigma_{0} as 0.010.01. Also, to satisfy the continuously differentiability of η(t)\eta(t), two hyperparameters c1c_{1} and c2c_{2} satisfy a system of equations with degree 2, so c1c_{1} and c2c_{2} are fully determined with this system of equations.

Refer to caption
(a) Approximate data score diverges.
Refer to caption
(b) Cumulative density function of tt and η\eta.
Refer to caption
(c) VESDE violates geometric progression.
Figure 9: (a) The approximate data score, 𝐱tlogpt(𝐱t)22=𝐱tlogpr(𝐱0)p0t(𝐱t|𝐱0)d𝐱022𝐱tlogp0t(𝐱t|𝐱0)22\|\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}=\|\nabla_{\mathbf{x}_{t}}\log{\int p_{r}(\mathbf{x}_{0})p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})\mathop{}\!\mathrm{d}\mathbf{x}_{0}}\|_{2}^{2}\approx\|\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}, diverges as t0t\rightarrow 0. (b) Comparison of DDPM++ and UDDPM++ in terms of the cumulative density function of the second input. (c) Comparison of VESDE and RVESDE in terms of ddtlogσ2\frac{\mathop{}\!\mathrm{d}}{\mathop{}\!\mathrm{d}t}\log{\sigma^{2}}.

The choice of such η(t)\eta(t) for UDDPM++ is expected to enhance the score estimation near t0t\approx 0 because the input of η(t)\eta(t) is distributed uniformly when we draw samples from the importance weight. Concretely, when the sampling distribution on the diffusion time is given by piw(t)g2(t)σ2(t)p_{iw}(t)\propto\frac{g^{2}(t)}{\sigma^{2}(t)}, the η\eta-distribution from the importance sampling becomes p(η)1p(\eta)\propto 1, which is depicted in Figure 9-(b).

Proof of Proposition 1.

Let hh be a standard mollifier function. If ht(x)=tnh(𝐱/t)h_{t}(x)=t^{-n}h(\mathbf{x}/t), then vt:=htvv_{t}:=h_{t}*v converges to vv a.e. on UU as t0t\rightarrow 0 (Theorem 7-(ii) of Appendix C in (Evans, 1998)). Therefore, if we define s(𝐱,η):=v1/η(𝐱)s(\mathbf{x},\eta):=v_{1/\eta}(\mathbf{x}) on the domain of v1/η(𝐱)v_{1/\eta}(\mathbf{x}) and s(𝐱,η):=0s(\mathbf{x},\eta):=0 elsewhere, then s(𝐱,η)=v1/η(𝐱)v(𝐱)s(\mathbf{x},\eta)=v_{1/\eta}(\mathbf{x})\rightarrow v(\mathbf{x}) a.e. on UU as η\eta\rightarrow\infty.

Now, to show that 𝐬(𝐱,η)\mathbf{s}(\mathbf{x},\eta) is locally Lipschitz, let M~×[η¯,η¯]\tilde{M}\times[\underline{\eta},\overline{\eta}] be a compact subset of n×[1,)\mathbb{R}^{n}\times[1,\infty). From 𝐬(𝐱1,η1)𝐬(𝐱2,η2)=v1/η1(𝐱1)v1/η2(𝐱2)v1/η1(𝐱1)v1/η1(𝐱2)+v1/η1(𝐱2)v1/η2(𝐱2)\|\mathbf{s}(\mathbf{x}_{1},\eta_{1})-\mathbf{s}(\mathbf{x}_{2},\eta_{2})\|=\|v_{1/\eta_{1}}(\mathbf{x}_{1})-v_{1/\eta_{2}}(\mathbf{x}_{2})\|\leq\|v_{1/\eta_{1}}(\mathbf{x}_{1})-v_{1/\eta_{1}}(\mathbf{x}_{2})\|+\|v_{1/\eta_{1}}(\mathbf{x}_{2})-v_{1/\eta_{2}}(\mathbf{x}_{2})\|, if there exists K1,K2>0K_{1},K_{2}>0 such that v1/η1(𝐱1)v1/η1(𝐱2)K1𝐱1𝐱2\|v_{1/\eta_{1}}(\mathbf{x}_{1})-v_{1/\eta_{1}}(\mathbf{x}_{2})\|\leq K_{1}\|\mathbf{x}_{1}-\mathbf{x}_{2}\| and v1/η1(𝐱1)v1/η2(𝐱1)K2|η1η2|\|v_{1/\eta_{1}}(\mathbf{x}_{1})-v_{1/\eta_{2}}(\mathbf{x}_{1})\|\leq K_{2}|\eta_{1}-\eta_{2}| for all 𝐱1,𝐱2M~\mathbf{x}_{1},\mathbf{x}_{2}\in\tilde{M} and η1,η2[η¯,η¯]\eta_{1},\eta_{2}\in[\underline{\eta},\overline{\eta}], then 𝐬(𝐱,η)=v1/η(𝐱)\mathbf{s}(\mathbf{x},\eta)=v_{1/\eta}(\mathbf{x}) is Lipschitz on M~×[η¯,η¯]\tilde{M}\times[\underline{\eta},\overline{\eta}].

First, since v1/ηv_{1/\eta} is infinitely differentiable on its domain (Theorem 7-(i) of Appendix C in (Evans, 1998)) and η[η¯,η¯]\eta\in[\underline{\eta},\overline{\eta}], there exists K1>0K_{1}>0 such that v1/η(𝐱1)v1/η(𝐱2)K1𝐱1𝐱2\|v_{1/\eta}(\mathbf{x}_{1})-v_{1/\eta}(\mathbf{x}_{2})\|\leq K_{1}\|\mathbf{x}_{1}-\mathbf{x}_{2}\|. Second, the mollifier satisfies the uniform convergence on any compact subset of UU (Theorem 7-(iii) of Appendix C in (Evans, 1998)), which leads that v1/η1(𝐱)v1/η2(𝐱)K2|1η11η2|=K2|η1η2|η1η2K3|η1η2|\|v_{1/\eta_{1}}(\mathbf{x})-v_{1/\eta_{2}}(\mathbf{x})\|\leq K_{2}|\frac{1}{\eta_{1}}-\frac{1}{\eta_{2}}|=K_{2}\frac{|\eta_{1}-\eta_{2}|}{\eta_{1}\eta_{2}}\leq K_{3}|\eta_{1}-\eta_{2}| for some K2,K3>0K_{2},K_{3}>0. Therefore, 𝐬\mathbf{s} becomes an element of [1,)\mathcal{H}_{[1,\infty)}. ∎

C.2 Additional SDE: Reciprocal VESDE

VESDE assumes g(t)=σmin(σmaxσmin)t2logσmaxσming(t)=\sigma_{min}(\frac{\sigma_{max}}{\sigma_{min}})^{t}\sqrt{2\log{\frac{\sigma_{max}}{\sigma_{min}}}}. Then, the variance of the transition probability p0t(𝐱t|μVE(t)𝐱0,σVE2(t))p_{0t}(\mathbf{x}_{t}|\mu_{VE}(t)\mathbf{x}_{0},\sigma_{VE}^{2}(t)) becomes σVE2(t)=0tg2(s)ds=σmin2[(σmaxσmin)2t1]\sigma_{VE}^{2}(t)=\int_{0}^{t}g^{2}(s)\mathop{}\!\mathrm{d}s=\sigma_{min}^{2}[(\frac{\sigma_{max}}{\sigma_{min}})^{2t}-1] if the diffusion starts from t=0t=0 with the initial condition of 𝐱0pr\mathbf{x}_{0}\sim p_{r}. VESDE was originally introduced in Song & Ermon (2020) in order to satisfy the geometric property for its smooth transition of the distributional shift. Mathematically, the variance is geometric if ddtlogσVE2(t)\frac{\mathop{}\!\mathrm{d}}{\mathop{}\!\mathrm{d}t}\log{\sigma_{VE}^{2}(t)} is a constant, but VESDE losses the geometric property as illustrated in Figure 9-(c).

To attain the geometric property in VESDE, VESDE approximates the variance to be σ~VE2(t)=σmin2(σmaxσmin)2t\tilde{\sigma}_{VE}^{2}(t)=\sigma_{min}^{2}(\frac{\sigma_{max}}{\sigma_{min}})^{2t} by omitting 1 from σVE2(t)\sigma_{VE}^{2}(t). However, this approximation leads that 𝐱t\mathbf{x}_{t} is not converging to 𝐱0\mathbf{x}_{0} in distribution because σmin2(σmaxσmin)2tσmin20\sigma_{min}^{2}(\frac{\sigma_{max}}{\sigma_{min}})^{2t}\rightarrow\sigma_{min}^{2}\neq 0 as t0t\rightarrow 0. Indeed, a bit stronger claim is possible:

Proposition 2.

There is no SDE that has the stochastic process {𝐱t}t[0,T]\{\mathbf{x}_{t}\}_{t\in[0,T]}, defined by a transition probability p0t(𝐱t|𝐱0)=𝒩(𝐱t;𝐱0,σmin2(σmaxσmin)2t𝐈)p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t};\mathbf{x}_{0},\sigma_{min}^{2}(\frac{\sigma_{max}}{\sigma_{min}})^{2t}\mathbf{I}), as the solution.

Proposition 2 indicates that if we approximate the variance by σVE2(t)\sigma_{VE}^{2}(t), then the reverse diffusion process cannot be modeled by a generative process.

Rigorously, however, if the diffusion process starts from t=t=-\infty, rather than t=0t=0, then the variance of the transition probability becomes σVE,2(t)=tg2(s)ds=σmin2(σmaxσmin)2t\sigma_{VE,-\infty}^{2}(t)=\int_{-\infty}^{t}g^{2}(s)\mathop{}\!\mathrm{d}s=\sigma_{min}^{2}(\frac{\sigma_{max}}{\sigma_{min}})^{2t}, which is exactly the variance σ~VE2(t)\tilde{\sigma}_{VE}^{2}(t). Therefore, VESDE can be considered as a diffusion process starting from t=t=-\infty.

From this point of view, we introduce a SDE that satisfies the geometric progression property starting from t=0t=0. We name a new SDE as the Reciprocal VE SDE (RVESDE). RVESDE has the identical form of SDE, d𝐱t=gRVE(t)d𝐰t\mathop{}\!\mathrm{d}\mathbf{x}_{t}=g_{RVE}(t)\mathop{}\!\mathrm{d}\mathbf{w}_{t}, with

gRVE(t):={σmax(σminσmax)ϵt2ϵlog(σmaxσmin)tif t>0,0if t=0.\displaystyle g_{RVE}(t):=\left\{\begin{array}[]{ll}\sigma_{max}\big{(}\frac{\sigma_{min}}{\sigma_{max}})^{\frac{\epsilon}{t}}\frac{\sqrt{2\epsilon\log{(\frac{\sigma_{max}}{\sigma_{min}})}}}{t}&\text{if }t>0,\\ 0&\text{if }t=0.\end{array}\right.

Then, the transition probability of RVESDE becomes

p0t(𝐱t|𝐱0)=𝒩(𝐱t;𝐱0,σmax2(σminσmax)2ϵt𝐈).\displaystyle p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}\bigg{(}\mathbf{x}_{t};\mathbf{x}_{0},\sigma_{max}^{2}\Big{(}\frac{\sigma_{min}}{\sigma_{max}}\Big{)}^{\frac{2\epsilon}{t}}\mathbf{I}\bigg{)}.

As illustrated in Figure 9-(c), RVESDE attains the geometric property at the expense of having reciprocated time, 1/t1/t. Also, RVESDE satisfies σRVE2(ϵ)=σmin2\sigma_{RVE}^{2}(\epsilon)=\sigma_{min}^{2} and σRVE2(T)σmax2\sigma_{RVE}^{2}(T)\approx\sigma_{max}^{2}. The existence and uniqueness of solution for RVESDE is guaranteed by Theorem 5.2.1 in (Oksendal, 2013).

Appendix D Implementation Details

D.1 Experimental Details

Training Throughout the experiments, we train our model with a learning rate of 0.0002, warmup of 5000 iterations, and gradient clipping by 1. For UNCSN++, we take σmin=103\sigma_{min}=10^{-3}, and for NCSN++, we take σmin=102\sigma_{min}=10^{-2}. On ImageNet32 training of the likelihood weighting and the variance weighting without Soft Truncation, we take ϵ=5×105\epsilon=5\times 10^{-5}, following the setting of Song et al. (2021a). Otherwise, we take ϵ=105\epsilon=10^{-5}. For other hyperparameters, we run our experiments according to Song et al. (2021b, a).

On datasets of resolution 32×3232\times 32, we use the batch size of 128, which consumes about 48Gb GPU memory. On STL-10 with resolution 48×4848\times 48, we use the batch size of 192, and on datasets of resolution 64×6464\times 64, we experiment with 128 batch size. The batch size for the datasets of resolution 256×256256\times 256 is 40, which takes nearly 120Gb of GPU memory. On the dataset of 1024×10241024\times 1024 resolution, we use the batch size of 16, which takes around 120Gb of GPU memory. We use five NVIDIA RTX-3090 GPU machines to train the model exceeding 48Gb, and we use a pair of NVIDIA RTX-3090 GPU machines to train the model that consumes less than 48Gb.

Evaluation We apply the EMA with rate of 0.999 on NCSN++/UNCSN++ and 0.9999 on DDPM++/UDDPM++. For the density estimation, we obtain the NLL performance by the Instantaneous Change of Variable (Song et al., 2021b; Chen et al., 2018). We choose [ϵ=105,T=1][\epsilon=10^{-5},T=1] to integrate the instantaneous change-of-variable of the probability flow as default, even for the ImageNet32 dataset. In spite that Song et al. (2021b, a) integrates the change-of-variable formula with the starting variable to be 𝐱0\mathbf{x}_{0}, Table 5 of Kim et al. (2022) analyzes that there are significant difference between starting from 𝐱ϵ\mathbf{x}_{\epsilon} and 𝐱0\mathbf{x}_{0}, if ϵ\epsilon is not small enough. Therefore, we follow Kim et al. (2022) to compute 𝔼𝐱ϵ[logpϵ𝜽(𝐱ϵ)]\mathbb{E}_{\mathbf{x}_{\epsilon}}\big{[}-\log{p_{\epsilon}^{\bm{\theta}}(\mathbf{x}_{\epsilon})}\big{]}. However, to compare with the baseline models, we also evaluate the way Song et al. (2021b, a) and Vahdat et al. (2021) compute NLL. We denote the way of Kim et al. (2022) as after correction and Song et al. (2021a) as before correction, throughout the appendix. We dequantize the data variable by the uniform dequantization (Ho et al., 2019) for both after-and-before corrections. In the main paper, we only report the after correction performances.

For the sampling, we apply the Predictor-Corrector (PC) algorithm introduced in Song et al. (2021b). We set the signal-to-noise ratio as 0.16 on 32×3232\times 32 datasets, 0.17 on 48×4848\times 48 and 64×6464\times 64 datasets, 0.075 on 256×\times256 sized datasets, and 0.15 on 1024×10241024\times 1024. On datasets less than 256×256256\times 256 resolution, we iterate 1,000 steps for the PC sampler, while we apply 2,000 steps on the other high-dimensional datasets. Throughout the experiments for VESDE, we use the reverse diffusion (Song et al., 2021b) for the predictor algorithm and the annealed Langevin dynamics (Welling & Teh, 2011) for the corrector algorithm. For VPSDE, we use the Euler-Maruyama for the predictor algorithm, and we do not use any corrector algorithm.

We compute the FID score (Song et al., 2021b) based on the modified Inception V1 network333https://tfhub.dev/tensorflow/tfgan/eval/inception/1 using the tensorflow-gan package for CIFAR-10 dataset, and we use the clean-FID (Parmar et al., 2022) based on the Inception V3 network (Szegedy et al., 2016) for the remaining datasets. We note that FID computed by (Parmar et al., 2022) reports a higher FID score compared to the original FID calculation444See https://github.com/GaParmar/clean-fid for the detailed experimental results..

Appendix E Additional Experimental Results

E.1 Ablation Study on Reconstruction Term

Table 9: Ablation study of Soft Truncation with/without the reconstruction term when training on CIFAR-10 trained with DDPM++ (VP).
Loss Soft Truncation Reconstruction Term for Training NLL NELBO FID
𝔼𝐱0[logpϵ𝜽(𝐱0)]\mathbb{E}_{\mathbf{x}_{0}}[-\log{p_{\epsilon}^{\bm{\theta}}(\mathbf{x}_{0})}] (before correction) 𝔼𝐱ϵ[logpϵ𝜽(𝐱ϵ)]+Rϵ(𝜽)\mathbb{E}_{\mathbf{x}_{\epsilon}}[-\log{p_{\epsilon}^{\bm{\theta}}(\mathbf{x}_{\epsilon})}]+R_{\epsilon}(\bm{\theta}) (after correction) (𝜽;g2,ϵ)\mathcal{L}(\bm{\theta};g^{2},\epsilon) (without residual) (𝜽;g2,ϵ)\mathcal{L}(\bm{\theta};g^{2},\epsilon) +Rϵ(𝜽)+R_{\epsilon}(\bm{\theta}) (with residual) ODE
(𝜽;g2,ϵ)\mathcal{L}(\bm{\theta};g^{2},\epsilon) 2.97 3.03 3.11 3.13 6.70
(𝜽;g2,ϵ)+𝔼𝐱0,𝐱ϵ[logp(𝐱0|𝐱ϵ)]\mathcal{L}(\bm{\theta};g^{2},\epsilon)+\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{\epsilon}}\big{[}-\log{p(\mathbf{x}_{0}|\mathbf{x}_{\epsilon})}\big{]} 3.01 2.99 3.07 3.09 6.93
ST(𝜽;g2,1)=𝔼1(τ)[(𝜽;g2,τ)]\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1})=\mathbb{E}_{\mathbb{P}_{1}(\tau)}\big{[}\mathcal{L}(\bm{\theta};g^{2},\tau)\big{]} 2.98 3.01 3.08 3.08 3.96
=𝔼1(τ)[(𝜽;g2,τ)]=\mathbb{E}_{\mathbb{P}_{1}(\tau)}\big{[}\mathcal{L}(\bm{\theta};g^{2},\tau)\big{]}
𝔼1(τ)[(𝜽;g2,τ)+Rτ(𝜽)\mathbb{E}_{\mathbb{P}_{1}(\tau)}\big{[}\mathcal{L}(\bm{\theta};g^{2},\tau)+R_{\tau}(\bm{\theta}) 2.95 2.98 3.04 3.04 4.23

Table 9 presents that the training with the reconstruction term outperforms the training without the reconstruction term on NLL/NELBO with the sacrifice on sample generation. If τ\tau is fixed as ϵ\epsilon, then the bound

𝔼𝐱0[logp0𝜽(𝐱0)](𝜽;g2,τ)+𝔼𝐱0,𝐱τ[logp(𝐱0|𝐱τ)]\displaystyle\mathbb{E}_{\mathbf{x}_{0}}\big{[}-\log{p_{0}^{\bm{\theta}}(\mathbf{x}_{0})}\big{]}\leq\mathcal{L}(\bm{\theta};g^{2},\tau)+\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{\tau}}\big{[}-\log{p(\mathbf{x}_{0}|\mathbf{x}_{\tau})}\big{]}

is tight enough to estimate the negative log-likelihood. However, if τ\tau is a subject of random variable, then the bound is not tight to the negative log-likelihood, as evidenced in Figure 1-(b). On the other hand, if we do not count the reconstruction, then the bound becomes

𝔼𝐱0[logpτ𝜽(𝐱τ)](𝜽;g2,τ),\displaystyle\mathbb{E}_{\mathbf{x}_{0}}\big{[}-\log{p_{\tau}^{\bm{\theta}}(\mathbf{x}_{\tau})}\big{]}\leq\mathcal{L}(\bm{\theta};g^{2},\tau),

up to a constant, and this bound becomes tight regardless of τ\tau, which is evidenced in Figure 1-(c). This is why we call Soft Truncation as Maximum Perturbed Likelihood Estimation (MPLE).

Table 10: Ablation study of Soft Truncation for various weightings on CIFAR-10 and ImageNet32 with DDPM++ (VP).
Dataset Loss Soft Truncation NLL NELBO FID
after correction before correction with residual without residual ODE
CIFAR-10 (𝜽;g2,ϵ)\mathcal{L}(\bm{\theta};g^{2},\epsilon) 3.03 2.97 3.13 3.11 6.70
(𝜽;σ2,ϵ)\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon) 3.21 3.16 3.34 3.32 3.90
(𝜽;g12,ϵ)\mathcal{L}(\bm{\theta};g_{\mathbb{P}_{1}}^{2},\epsilon) 3.06 3.02 3.18 3.14 6.11
ST(𝜽;g2,1)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1}) 3.01 2.98 3.08 3.08 3.96
ImageNet32 (𝜽;g2,ϵ)\mathcal{L}(\bm{\theta};g^{2},\epsilon) 3.92 3.90 3.94 3.95 12.68
(𝜽;σ2,ϵ)\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon) 3.95 3.96 4.00 4.01 9.22
(𝜽;g12,ϵ)\mathcal{L}(\bm{\theta};g_{\mathbb{P}_{1}}^{2},\epsilon) 3.93 3.92 3.97 3.98 11.89
ST(𝜽;g2,1)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1}) 3.90 3.87 3.92 3.92 9.52
ST(𝜽;g2,0.9)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{0.9}) 3.90 3.88 3.91 3.91 8.42
Table 11: Ablation study of Soft Truncation for various model architectures and diffusion SDEs on CelebA.
SDE Model Loss NLL NELBO FID
after correction before correction with residual without residual PC ODE
VE NCSN++ (𝜽;σ2,ϵ)\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon) 3.41 2.37 3.42 3.96 3.95 -
ST(𝜽;σ2,2)\mathcal{L}_{ST}(\bm{\theta};\sigma^{2},\mathbb{P}_{2}) 3.44 2.42 3.44 3.97 2.68 -
RVE UNCSN++ (𝜽;g2,ϵ)\mathcal{L}(\bm{\theta};g^{2},\epsilon) 2.01 1.96 2.01 2.17 3.36 -
ST(𝜽;g2,2)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{2}) 1.97 1.91 2.02 2.18 1.92 -
VP DDPM++ (𝜽;σ2,ϵ)\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon) 2.14 2.07 2.21 2.22 3.03 2.32
ST(𝜽;σ2,1)\mathcal{L}_{ST}(\bm{\theta};\sigma^{2},\mathbb{P}_{1}) 2.17 2.08 2.29 2.26 2.88 1.90
UDDPM++ (𝜽;σ2,ϵ)\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon) 2.11 2.07 2.20 2.21 3.23 4.72
ST(𝜽;σ2,1)\mathcal{L}_{ST}(\bm{\theta};\sigma^{2},\mathbb{P}_{1}) 2.16 2.08 2.28 2.25 2.22 1.94
DDPM++ (𝜽;g2,ϵ)\mathcal{L}(\bm{\theta};g^{2},\epsilon) 2.00 1.93 2.09 2.09 5.31 3.95
ST(𝜽;g2,1)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1}) 2.00 1.94 2.11 2.11 4.50 2.90
UDDPM++ (𝜽;g2,ϵ)\mathcal{L}(\bm{\theta};g^{2},\epsilon) 1.98 1.95 2.12 2.15 4.65 3.98
ST(𝜽;g2,1)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1}) 2.00 1.94 2.10 2.10 4.45 2.97
Table 12: Ablation study of Soft Truncation for various σmin\sigma_{min} (equivalently, ϵ\epsilon) on CIFAR-10 with UNCSN++ (RVE).
Loss ϵ\epsilon NLL NELBO FID
after correction before correction with residual without residual ODE
(𝜽;g2,ϵ)\mathcal{L}(\bm{\theta};g^{2},\epsilon) 10210^{-2} 4.64 4.02 4.69 5.20 38.82
10310^{-3} 3.51 3.20 3.52 3.90 6.21
10410^{-4} 3.05 2.98 3.08 3.24 6.33
10510^{-5} 3.03 2.97 3.13 3.11 6.70
ST(𝜽;g2,1)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1}) 10210^{-2} 4.65 4.03 4.69 5.20 39.83
10310^{-3} 3.51 3.21 3.52 3.88 5.14
10410^{-4} 3.05 2.98 3.08 3.24 4.16
10510^{-5} 3.01 2.98 3.08 3.08 3.96
Table 13: Ablation study of Soft Truncation for various k\mathbb{P}_{k} on CIFAR-10 trained with DDPM++ (VP).
Loss NLL NELBO FID
after correction before correction with residual without residual ODE
ST(𝜽;g2,0)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{0}) 3.24 3.16 3.39 3.34 6.27
ST(𝜽;g2,0.8)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{0.8}) 3.03 3.00 3.05 3.05 3.61
ST(𝜽;g2,0.9)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{0.9}) 3.03 2.99 3.13 3.13 3.45
ST(𝜽;g2,1)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1}) 3.01 2.98 3.08 3.08 3.96
ST(𝜽;g2,1.1)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1.1}) 3.02 2.99 3.09 3.10 3.98
ST(𝜽;g2,1.2)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1.2}) 3.03 2.99 3.09 3.09 3.98
ST(𝜽;g2,2)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{2}) 3.01 2.97 3.10 3.09 6.31
ST(𝜽;g2,3)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{3}) 3.02 2.96 3.09 3.09 6.54
ST(𝜽;g2,)\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{\infty}) 3.01 2.95 3.09 3.07 6.70
=(𝜽;g2,ϵ)=\mathcal{L}(\bm{\theta};g^{2},\epsilon)
Table 14: Ablation study of Soft Truncation for CIFAR-10 trained with DDPM++ when a diffusion is combined with a normalizing flow (Kim et al., 2022). We use ([a,b])=121[a,b](ϵ)+120.9([a,b])\mathbb{P}([a,b])=\frac{1}{2}1_{[a,b]}(\epsilon)+\frac{1}{2}\mathbb{P}_{0.9}([a,b]).
Loss NLL NELBO FID
after correction before correction with residual without residual ODE
(𝜽;g2,ϵ)\mathcal{L}(\bm{\theta};g^{2},\epsilon) 2.97 2.94 2.97 2.96 6.06
(𝜽;σ2,ϵ)\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon) 3.17 3.11 3.23 3.18 3.61
(𝜽;g2,)\mathcal{L}(\bm{\theta};g^{2},\mathbb{P}) 3.01 2.98 3.02 3.01 3.89

E.2 Full Tables

Tables 10, 11, 12, 13, and 14 present the full list of performances for Soft Truncation.

E.3 Generated Samples

Figure 10 shows how images are created from the trained model, and Figures from 11 to 16 present non-cherry picked generated samples of the trained model.

Refer to caption
Figure 10: Image generation by denoising via predictor-corrector sampler.
Refer to caption
Figure 11: Random samples of CIFAR-10.
Refer to caption
Figure 12: Random samples on CelebA.
Refer to caption
Figure 13: Random samples on LSUN Bedroom.
Refer to caption
Figure 14: Random samples on LSUN Church.
Refer to caption
Figure 15: Random samples on FFHQ 256.
Refer to caption
Figure 16: Random samples on CelebA-HQ 256.