Soft Truncation: A Universal Training Technique of
Score-based Diffusion Model for High Precision Score Estimation

Soft Truncation: A Universal Training Technique of Score-based Diffusion Model for High Precision Score Estimation

Soft Truncation: A Universal Training Technique of
Score-based Diffusion Model for High Precision Score Estimation

1 Introduction

2 Preliminary

3 Training and Evaluation of Diffusion Models in Practice

4 Soft Truncation: A Training Technique for a Diffusion Model

5 Experiments

6 Conclusion

Acknowledgements

References

Appendix A Derivation

Appendix B Theorems and Proofs

Appendix C Additional Score Architectures and SDEs

Appendix D Implementation Details

Appendix E Additional Experimental Results

1 Introduction

2 Preliminary

3 Training and Evaluation of Diffusion Models in Practice

4 Soft Truncation: A Training Technique for a Diffusion Model

5 Experiments

6 Conclusion

Acknowledgements

References

Appendix A Derivation

Appendix B Theorems and Proofs

Appendix C Additional Score Architectures and SDEs

Appendix D Implementation Details

Appendix E Additional Experimental Results

3.1 The Need of Truncation

3.2 Variational Bound With Positive Truncation

3.3 A Universal Phenomenon in Diffusion Training: Extremely Imbalanced Loss

3.4 Effect of Truncation on Model Evaluation

4.1 Monte-Carlo Estimation of Truncated Variational Bound with Importance Sampling

4.2 Soft Truncation

4.3 Soft Truncation Equals to A Diffusion Model With A General Weight

4.4 Soft Truncation is Maximum Perturbed Likelihood Estimation

4.5 Choice of Truncation Probability Distribution

A.1 Transition Probability for Linear SDEs

A.2 Diverging Denoising Loss

A.3 General Weighted Diffusion Loss

C.1 Additional Score Architectures: Unbounded Parametrization

C.2 Additional SDE: Reciprocal VESDE

D.1 Experimental Details

E.1 Ablation Study on Reconstruction Term

E.2 Full Tables

E.3 Generated Samples

3.1 The Need of Truncation

3.2 Variational Bound With Positive Truncation

3.3 A Universal Phenomenon in Diffusion Training: Extremely Imbalanced Loss

3.4 Effect of Truncation on Model Evaluation

4.1 Monte-Carlo Estimation of Truncated Variational Bound with Importance Sampling

4.2 Soft Truncation

4.3 Soft Truncation Equals to A Diffusion Model With A General Weight

4.4 Soft Truncation is Maximum Perturbed Likelihood Estimation

4.5 Choice of Truncation Probability Distribution

A.1 Transition Probability for Linear SDEs

A.2 Diverging Denoising Loss

A.3 General Weighted Diffusion Loss

C.1 Additional Score Architectures: Unbounded Parametrization

C.2 Additional SDE: Reciprocal VESDE

D.1 Experimental Details

E.1 Ablation Study on Reconstruction Term

E.2 Full Tables

E.3 Generated Samples

Abstract

Lemma 1.

Theorem 1.

Lemma 2.

Proof of Lemma 2.

Lemma 1.

Proof.

Theorem 1.

Proof.

Corollary 1.

Remark 1.

Remark 2.

Proof.

Proposition 1.

Proof of Proposition 1.

Proposition 2.

Abstract

Lemma 1.

Theorem 1.

Lemma 2.

Proof of Lemma 2.

Lemma 1.

Proof.

Theorem 1.

Proof.

Corollary 1.

Remark 1.

Remark 2.

Proof.

Proposition 1.

Proof of Proposition 1.

Proposition 2.

Dongjun Kim Seungjae Shin Kyungwoo Song Wanmo Kang Il-Chul Moon

Recent advances in diffusion models bring state-of-the-art performance on image generation tasks. However, empirical results from previous research in diffusion models imply an inverse correlation between density estimation and sample generation performances. This paper investigates with sufficient empirical evidence that such inverse correlation happens because density estimation is significantly contributed by small diffusion time, whereas sample generation mainly depends on large diffusion time. However, training a score network well across the entire diffusion time is demanding because the loss scale is significantly imbalanced at each diffusion time. For successful training, therefore, we introduce Soft Truncation, a universally applicable training technique for diffusion models, that softens the fixed and static truncation hyperparameter into a random variable. In experiments, Soft Truncation achieves state-of-the-art performance on CIFAR-10, CelebA, CelebA-HQ $256\times 256$ , and STL-10 datasets.

Machine Learning, ICML

Recent advances in generative models enable the creation of highly realistic images. One direction of such modeling is likelihood-free models (Karras et al., 2019) based on minimax training. The other direction is likelihood-based models, including VAE (Vahdat & Kautz, 2020), autoregressive models (Parmar et al., 2018), and flow models (Grcić et al., 2021). Diffusion models (Ho et al., 2020) are one of the most successful likelihood-based models, where the reverse diffusion models the generative process. The success of diffusion models achieves state-of-the-art performance in image generation (Dhariwal & Nichol, 2021).

Previously, a model with the emphasis on Fréchet Inception Distance (FID), such as DDPM (Ho et al., 2020) and ADM (Dhariwal & Nichol, 2021), trains the score network with the variance weighting; whereas a model with the emphasis on Negative Log-Likelihood (NLL), such as ScoreFlow (Song et al., 2021a) and VDM (Kingma et al., 2021), trains the score network with the likelihood weighting. Such models, however, have the trade-off between NLL and FID: models with the emphasis on FID perform poorly on NLL, and vice versa. Instead of widely investigating the trade-off, they limit their work by separately training the score network on FID-favorable and NLL-favorable settings. This paper introduces Soft Truncation that significantly resolves the trade-off, with the NLL-favorable setting as the default training configuration. Soft Truncation reports a comparable FID against FID-favorable diffusion models while keeping NLL at the equivalent level of NLL-favorable models.

For that, we observe that the truncation hyperparameter is a significant hyperparameter that determines the overall scale of NLL and FID. This hyperparameter, $\epsilon$ , is the smallest diffusion time to estimate the score function, and the score function beneath $\epsilon$ is not estimated. A model with small enough $\epsilon$ favors NLL at the sacrifice on FID, and a model with relatively large $\epsilon$ is preferable to FID but has poor NLL. Therefore, we introduce Soft Truncation, which softens the fixed and static truncation hyperparameter ( $\epsilon$ ) into a random variable ( $\tau$ ) that randomly selects its smallest diffusion time at every optimization step. In every mini-batch update, we sample a new smallest diffusion time, $\tau$ , randomly, and the batch optimization endeavors to estimate the score function only on $[\tau,T]$ , rather than $[\epsilon,T]$ , by ignoring beneath $\tau$ . As $\tau$ varies by mini-batch updates, the score network successfully estimates the score function on the entire range of diffusion time on $[\epsilon,T]$ , which brings an improved FID.

There are two interesting properties of Soft Truncation. First, though Soft Truncation is nothing to do with the weighting function in its algorithmic design, surprisingly, Soft Truncation turns out to be equivalent to a diffusion model with a general weight in the expectation sense (Eq. (10)). The random variable of $\tau$ determines the weight function (Theorem 1), and this gives a partial reason why Soft Truncation is successful in FID as much as the FID-favorable training (Table 4), even though Soft Truncation only considers the truncation threshold in its implementation (Section 4.2). Second, once $\tau$ is sampled in a mini-batch optimization, Soft Truncation optimizes the log-likelihood perturbed by $\tau$ (Lemma 1). Thus, Soft Truncation could be framed by Maximum Perturbed Likelihood Estimation (MPLE), a generalized concept of MLE that is specifically defined only in diffusion models (Section 4.4).

Throughout this paper, we focus on continuous-time diffusion models (Song et al., 2021b). A continuous diffusion model slowly and systematically perturbs a data random variable, $\mathbf{x}_{0}$ , into a noise variable, $\mathbf{x}_{T}$ , as time flows. The diffusion mechanism is represented as a Stochastic Differential Equation (SDE), written by

\displaystyle\mathop{}\!\mathrm{d}\mathbf{x}_{t}=\mathbf{f}(\mathbf{x}_{t},t)\mathop{}\!\mathrm{d}t+g(t)\mathop{}\!\mathrm{d}\mathbf{w}_{t},

(1)

where $\mathbf{w}_{t}$ is a standard Wiener process. The drift ( $\mathbf{f}$ ) and the diffusion ( $g$ ) terms are fixed, so the data variable is diffused in a fixed manner. We denote $\{\mathbf{x}_{t}\}_{t=0}^{T}$ as the solution of the given SDE of Eq. (1), and we omit the subscript and superscript to denote $\{\mathbf{x}_{t}\}$ , if no confusion is arised.

The theory of stochastic calculus indicates that there exists a corresponding reverse SDE given by

\displaystyle\mathop{}\!\mathrm{d}\mathbf{x}_{t}=\big{[}\mathbf{f}(\mathbf{x}_{t},t)-g^{2}(t)\nabla\log{p_{t}(\mathbf{x}_{t})}\big{]}\mathop{}\!\mathrm{d}\bar{t}+g(t)\mathop{}\!\mathrm{d}\mathbf{\bar{w}}_{t},

(2)

where the solution of this reverse SDE exactly coincides to the solution of the forward SDE of Eq. (1). Here, $\mathop{}\!\mathrm{d}\bar{t}$ is the backward time differential; $\mathop{}\!\mathrm{d}\mathbf{\bar{w}}_{t}$ is a standard Wiener process flowing backward in time (Anderson, 1982); and $p_{t}(\mathbf{x}_{t})$ is the probability distribution of $\mathbf{x}_{t}$ . Henceforth, we represent $\{\mathbf{x}_{t}\}$ as the solution of SDEs of Eqs. (1) and (2).

The diffusion model’s objective is to learn the stochastic process, $\{\mathbf{x}_{t}\}$ , as a parametrized stochastic process, $\{\mathbf{x}_{t}^{\bm{\theta}}\}$ . A diffusion model builds the parametrized stochastic process as a solution of a generative SDE,

\displaystyle\mathop{}\!\mathrm{d}\mathbf{x}_{t}^{\bm{\theta}}=\big{[}\mathbf{f}(\mathbf{x}_{t}^{\bm{\theta}},t)-g^{2}(t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t}^{\bm{\theta}},t)\big{]}\mathop{}\!\mathrm{d}\bar{t}+g(t)\mathop{}\!\mathrm{d}\mathbf{\bar{w}}_{t}.

(3)

We construct the parametrized stochastic process by solving the generative SDE of Eq. (3) backward in time with a starting variable of $\mathbf{x}_{T}^{\bm{\theta}}\sim\pi$ , where $\pi$ is an noise distribution. Throughout the paper, we denote $p_{t}^{\bm{\theta}}$ as the probability distribution of $\mathbf{x}_{t}^{\bm{\theta}}$ .

A diffusion model learns the generative stochastic process by minimizing the score loss (Song et al., 2021a) of

\displaystyle\mathcal{L}(\bm{\theta};\lambda)=\frac{1}{2}\int_{0}^{T}\lambda(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t,

where $\lambda(t)$ is a weighting function that counts the contribution of each diffusion time on the loss function. This score loss is infeasible to optimize because the data score, $\nabla\log{p_{t}(\mathbf{x}_{t})}$ , is intractable in general. Fortunately, $\mathcal{L}(\bm{\theta};\lambda)$ is known to be equivalent to the (continuous) denoising NCSN loss (Song et al., 2021b; Song & Ermon, 2019),

	$\displaystyle\mathcal{L}_{NCSN}(\bm{\theta};\lambda)$
	$\displaystyle=\frac{1}{2}\int_{0}^{T}\lambda(t)\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}\big{[}\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t,$

up to a constant that is irrelevant to $\bm{\theta}$ -optimization.

Two important SDEs are known to attain analytic transition probabilities, $\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}$ : Variance Exploding SDE (VESDE) and Variance Preserving SDE (VPSDE) (Song et al., 2021b). First, VESDE assumes $\mathbf{f}(\mathbf{x}_{t},t)=0$ and $g(t)=\sigma_{min}(\frac{\sigma_{max}}{\sigma_{min}})^{t}\sqrt{2\log{\frac{\sigma_{max}}{\sigma_{min}}}}$ . With such specific forms of $\mathbf{f}$ and $g$ , the transition probability of VESDE turns out to follow a Gaussian distribution of $p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t};\mu_{VE}(t)\mathbf{x}_{0},\sigma_{VE}^{2}(t)\mathbf{I})$ with $\mu_{VE}(t)\equiv 1$ and $\sigma_{VE}^{2}(t)=\sigma_{min}^{2}[(\frac{\sigma_{max}}{\sigma_{min}})^{2t}-1]$ . Similarly, VPSDE takes $\mathbf{f}(\mathbf{x}_{t},t)=-\frac{1}{2}\beta(t)\mathbf{x}_{t}$ and $g(t)=\sqrt{\beta(t)}$ , where $\beta(t)=\beta_{min}+t(\beta_{max}-\beta_{min})$ ; and its transition probability falls into a Gaussian distribution of $p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t};\mu_{VP}(t)\mathbf{x}_{0},\sigma_{VP}^{2}\mathbf{I})$ with $\mu_{VP}(t)=e^{-\frac{1}{2}\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}$ and $\sigma_{VP}^{2}(t)=1-e^{-\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}$ .

Refer to caption — (a) Integrand by Time

Recently, Kim et al. (2022) categorize VESDE and VPSDE as a family of linear diffusions that has the SDE of

\displaystyle\mathop{}\!\mathrm{d}\mathbf{x}_{t}=-\frac{1}{2}\beta(t)\mathbf{x}_{t}\mathop{}\!\mathrm{d}t+g(t)\mathop{}\!\mathrm{d}\mathbf{w}_{t},

(4)

where $\beta(t)$ and $g(t)$ are generic $t$ -functions. Under the linear diffusions, we derive the transition probability to follow a Gaussian distribution $p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t};\mu(t)\mathbf{x}_{0},\sigma^{2}(t)\mathbf{I})$ for certain $\mu(t)$ and $\sigma(t)$ depending on $\beta(t)$ and $g(t)$ , respectively (see Eq. (16) of Appendix A.1). We emphasize that the suggested Soft Truncation is applicable for any SDE of Eq. (1), but we limit our focus to the family of linear SDEs of Eq. (4), particularly VESDE and VPSDE among linear SDEs, to maintain the simplicity. With such a Gaussian transtion probability, the denoising NCSN loss with a linear SDE is equivalent to

\displaystyle\frac{1}{2}\int_{0}^{T}\frac{\lambda(t)}{\sigma^{2}(t)}\mathbb{E}_{\mathbf{x}_{0},\bm{\epsilon}}\big{[}\|\bm{\epsilon}_{\bm{\theta}}(\mu(t)\mathbf{x}_{0}+\sigma(t)\bm{\epsilon},t)-\bm{\epsilon}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t,

if $\bm{\epsilon}_{\bm{\theta}}(\mu(t)\mathbf{x}_{0}+\sigma(t)\bm{\epsilon},t)=-\sigma(t)\mathbf{s}_{\bm{\theta}}(\mu(t)\mathbf{x}_{0}+\sigma(t)\bm{\epsilon},t)$ , where $\bm{\epsilon}\sim\mathcal{N}(0,\mathbf{I})$ is a random perturbation, and $\bm{\epsilon}_{\bm{\theta}}$ is the neural network that predicts $\bm{\epsilon}$ . This is the (continuous) DDPM loss (Song et al., 2021b), and the equivalence of the two losses provides a unified view of NCSN and DDPM. Hence, NCSN and DDPM are exchangeable for each other, and we take the NCSN loss as a default form of a diffusion loss throughout the paper.

The NCSN loss training is connected to the likelihood training in Song et al. (2021a) by

\displaystyle\mathbb{E}_{\mathbf{x}_{0}}[-\log{p_{0}^{\bm{\theta}}(\mathbf{x}_{0})}]\leq\mathcal{L}_{NCSN}(\bm{\theta};g^{2}),

(5)

when the weighting function is the square of the diffusion term as $\lambda(t)=g^{2}(t)$ , called the likelihood weighting.

In the family of linear SDEs, the gradient of the log transition probability satisfies $\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}=-\frac{\mathbf{x}_{t}-\mu(t)\mathbf{x}_{0}}{\sigma^{2}(t)}=-\frac{\mathbf{z}}{\sigma(t)}$ , where $\mathbf{x}_{t}$ is given to $\mu(t)\mathbf{x}_{0}+\sigma(t)\mathbf{z}$ with $\mathbf{z}\sim\mathcal{N}(0,\mathbf{I})$ . The denominator of $\sigma(t)$ converges to zero as $t\rightarrow 0$ , which leads $\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}$ to diverge as $t\rightarrow 0$ , as illustrated in Figure 1-(a), see Appendix A.2 for details. Therefore, the Monte-Carlo estimation of the NCSN loss is under high variance, which prevents stable training of the score network. In practice, therefore, previous research truncates the diffusion time range to $[\tau,T]$ , with a positive truncation hyperparameter, $\tau=\epsilon>0$ .

For the analysis for density estimation in Section 3.3, this section derives the variational bound of the log-likelihood when a diffusion model has a positive truncation because Inequality (5) holds only with zero truncation ( $\tau=0$ ). Lemma 1 provides a generalization of Inequality (5), proved by applying the data processing inequality (Gerchinovitz et al., 2020) and the Girsanov theorem (Pavon & Wakolbinger, 1991; Vargas et al., 2021; Song et al., 2021a).

For any $\tau\in[0,T]$ ,

\displaystyle\mathbb{E}_{\mathbf{x}_{\tau}}\big{[}-\log{p_{\tau}^{\bm{\theta}}(\mathbf{x}_{\tau})}\big{]}\leq\mathcal{L}(\bm{\theta};g^{2},\tau)

(6)

holds, where $\mathcal{L}(\bm{\theta};g^{2},\tau)=\frac{1}{2}\int_{\tau}^{T}g^{2}(t)\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t$ , up to a constant, see Eq. (17).

Lemma 1 is a generalization of Inequality (5) in that Inequality (6) collapses to Inequality (5) under the zero truncation: $\mathcal{L}_{NCSN}(\bm{\theta};\lambda)=\mathcal{L}(\bm{\theta};\lambda,\tau=0)$ . If the time range is truncated to $[\tau,T]$ for $\tau\in[0,T]$ , then from the variational inference, the log-likelihood becomes

\displaystyle\mathbb{E}_{\mathbf{x}_{0}}\big{[}-\log{p_{0}^{\bm{\theta}}(\mathbf{x}_{0})}\big{]}\leq\mathbb{E}_{\mathbf{x}_{\tau}}\big{[}-\log{p_{\tau}^{\bm{\theta}}(\mathbf{x}_{\tau})}\big{]}+R_{\tau}(\bm{\theta})

(7)

where

\displaystyle R_{\tau}(\bm{\theta})=\mathbb{E}_{\mathbf{x}_{0}}\bigg{[}\int p_{0\tau}(\mathbf{x}_{\tau}|\mathbf{x}_{0})\log{\frac{p_{0\tau}(\mathbf{x}_{\tau}|\mathbf{x}_{0})}{p_{\bm{\theta}}(\mathbf{x}_{0}|\mathbf{x}_{\tau})}}\mathop{}\!\mathrm{d}\mathbf{x}_{\tau}\bigg{]},

with $p_{\bm{\theta}}(\mathbf{x}_{0}|\mathbf{x}_{\tau})$ being the probability distribution of $\mathbf{x}_{0}$ given $\mathbf{x}_{\tau}$ and the score estimation with $\mathbf{s}_{\bm{\theta}}$ at $\tau$ . For any $\tau$ , we apply Lemma 1 to the right-hand-side of Inequality (7) to obtain the variational bound of the log-likelihood as

\displaystyle\mathbb{E}_{\mathbf{x}_{0}}\big{[}-\log{p_{0}^{\bm{\theta}}(\mathbf{x}_{0})}\big{]}\leq\mathcal{L}(\bm{\theta};g^{2},\tau)+R_{\tau}(\bm{\theta}).

(8)

To avoid the diverging issue introduced in Section 3.1, previous works in VPSDE (Song et al., 2021a; Vahdat et al., 2021) modify the loss by truncating the integration on $[\tau,T]$ with a fixed hyperparameter $\tau=\epsilon>0$ so that the score network does not estimate the score function on $[0,\epsilon)$ . Analogously, previous works in VESDE (Song et al., 2021b; Chen et al., 2022) approximate $\sigma_{VE}^{2}(t)\approx\sigma_{min}^{2}(\frac{\sigma_{max}}{\sigma_{min}})^{2t}$ to truncate the minimum variance of the transition probability to be $\sigma_{min}^{2}$ . Truncating diffusion time at $\epsilon$ in VPSDE is equivalent to truncating diffusion variance ( $\sigma_{min}^{2}$ ) in VESDE, so these two truncations on VE/VP SDEs have the identical effect on bounding the diffusion loss. Henceforth, this paper discusses the argument of truncating diffusion time (VPSDE) and diffusion variance (VESDE) exchangeably.

Figure 1 illustrates the significance of truncation in the training of diffusion models. With the truncation of strictly positive $\epsilon=10^{-5}$ , Figure 1-(a) shows that the integrand of $\mathcal{L}(\bm{\theta};g^{2},\tau)$ in the Bits-Per-Dimension (BPD) scale is still extremely imbalanced. It turns out that such extreme imbalance appears to be a universal phenomenon in training a diffusion model, and this phenomenon lasts from the beginning to the end of training.

Figure 1-(b) with the green line presents the variational bound of the log-likelihood (right-hand-side of Inequality (8)) on the $y$ -axis, and it indicates that the variational bound is sharply decreasing near the small diffusion time. Therefore, if $\epsilon$ is insufficiently small, the variational bound is not tight to the log-likelihood, and a diffusion model fails at MLE training. In addition, Figure 2 indicates that insufficiently small $\epsilon$ (or $\sigma_{min}$ ) would also harm the microscopic sample quality. From these observations, $\epsilon$ becomes a significant hyperparameter that needs to be selected carefully.

Figure 1-(c) reports test performances on density estimation. Figure 1-(c) illustrates that both Negative Evidence Lower Bound (NELBO) and NLL monotonically decrease by lowering $\epsilon$ because NELBO is largely contributed by small diffusion time at test time as well as training time. Therefore, it could be a common strategy to reduce $\epsilon$ as much as possible to reduce test NELBO/NLL.

Table 1: Ablation on

\sigma_{min}

$\sigma_{min}$	CIFAR-10
$\sigma_{min}$	NLL ( $\downarrow$ )	FID-10k ( $\downarrow$ )
$10^{-2}$	4.95	6.95
$10^{-3}$	3.04	7.04
$10^{-4}$	2.99	8.17
$10^{-5}$	2.97	8.29

On the contrary, there is a counter effect on FID for $\epsilon$ . Table 1, trained on CIFAR-10 (Krizhevsky et al., 2009) with NCSN++ (Song et al., 2021b), presents that FID is worsened as we take smaller hyperparameter $\sigma_{min}$ for the training. It is the range of small diffusion time that significantly contributes to the variational bound in the blue line of Figure 1-(b), so the score network with a small truncation hyperparameter, $\sigma_{min}$ or $\epsilon$ , remains unoptimized on large diffusion time. In the lens of Figure 2, therefore, the inconsistent result of Table 1 is attributed to the inaccurate score on large diffusion time.

Table 2: FID-10k scores.

$\sigma_{min}$	$10^{-3}$	$10^{-4}$	$10^{-5}$
$\sigma_{tr}=1$	6.84	8.04	8.29

We design an experiment to validate the above argument in Table 2. This experiment utilizes two types of score networks: 1) three alternative networks (As) with diverse $\sigma_{min}\in\{10^{-3},10^{-4},10^{-5}\}$ trained in Table 1 experiment; 2) a network (B) with $\sigma_{min}=10^{-5}$ (the last row of Table 1). With these score networks, we denoise the noises by either one of the first-typed As from $\sigma_{max}$ to a common and fixed $\sigma_{tr}(=1)$ , and we use B to further denoise from $\sigma_{tr}$ to $\sigma_{min}=10^{-5}$ . This further denoising step with model B enables us to compare the score accuracy on large diffusion time for models with diverse truncation hyperparameters in a fair resolution setting. Table 2 presents that the model with $\sigma_{min}=10^{-3}$ has the best FID, implying that the training with too small truncation would harm the sample fidelity.

Specifically, Figure 4 shows the Euclidean norm of $g^{2}(t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)$ , where each dot represents for a Monte-Carlo sample from $p_{t}(\mathbf{x}_{t})$ . Here, $g^{2}(t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)$ is in the reverse drift term of the generative process, $\mathop{}\!\mathrm{d}\mathbf{x}_{t}^{\bm{\theta}}=[\mathbf{f}(\mathbf{x}_{t}^{\bm{\theta}},t)-g^{2}(t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t}^{\bm{\theta}},t)]\mathop{}\!\mathrm{d}\bar{t}+g(t)\mathop{}\!\mathrm{d}\mathbf{\bar{w}}_{t}$ . Figure 4 illustrates that it is the large diffusion time that dominates the sampling process. Therefore, a precise score network on large diffusion time is particularly important in sample generation.

The imprecise score mainly affects the global sample context, as the denoising on small diffusion time only crafts the image in its microscopic details, illustrated in Figures 3 and 5. Figure 3 shows how the global fidelity is damaged: a man synthesized in the second row has unrealistic curly hair on his forehead, constructed on the large diffusion time. Figure 5 deepens the importance of learning a good score estimation on large diffusion time. It shows the regenerated samples by solving the generative process time reversely, starting from $\mathbf{x}_{\tau}$ (Meng et al., 2021).

As in Section 3, the choice of $\epsilon$ is crucial for training and evaluation, but it is computationally infeasible to search for the optimal $\epsilon$ . Therefore, we introduce a training technique that predominantly mediates the need for $\epsilon$ -search by softening the fixed truncation hyperparameter into a truncation random variable so that the truncation time varies in every optimization step. Our approach successfully trains the score network on large diffusion time without sacrificing NLL. We explain the Monte-Carlo estimation of the variational bound in Section 4.1, which is the common practice of previous research but explained to emphasize how simple (though effective) Soft Truncation is, and we subsequently introduce Soft Truncation in Section 4.2.

In this section, we fix a truncation hyperparameter to be $\tau=\epsilon$ . For every batch $\{\mathbf{x}_{0}^{(b)}\}_{b=1}^{B}$ , the Monte-Carlo estimation of the variational bound in Inequality (6) is $\mathcal{L}(\bm{\theta};g^{2},\epsilon)\approx\mathcal{\hat{L}}(\bm{\theta};g^{2},\epsilon)=\frac{1}{2B}\sum_{b=1}^{B}g^{2}(t^{(b)})\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t^{(b)}},t^{(b)})-\nabla\log{p_{0t^{(b)}}(\mathbf{x}_{t^{(b)}}|\mathbf{x}_{0})}\|_{2}^{2}$ , up to a constant irrelevant to $\bm{\theta}$ , where $\mathbf{x}_{t^{(b)}}=\mu(t^{(b)})\mathbf{x}_{0}+\sigma(t^{(b)})\bm{\epsilon}^{(b)}$ with $\{t^{(b)}\}_{b=1}^{B}$ and $\{\bm{\epsilon}^{(b)}\}_{b=1}^{B}$ be the corresponding Monte-Carlo samples from $t^{(b)}\sim[\epsilon,T]$ and $\bm{\epsilon}^{(b)}\sim\mathcal{N}(0,\mathbf{I})$ , respectively. Note that this Monte-Carlo estimation is tractably computed from the analytic form of the transition probability as $\nabla\log{p_{0t^{(b)}}(\mathbf{x}_{t^{(b)}}|\mathbf{x}_{0})}=\frac{\bm{\epsilon}^{(b)}}{\sigma(t^{(b)})}$ under linear SDEs.

Previous works (Song et al., 2021a; Huang et al., 2021) apply the importance sampling with the importance distribution of $p_{iw}(t)=\frac{g^{2}(t)/\sigma^{2}(t)}{Z_{\epsilon}}1_{[\epsilon,T]}(t)$ , where $Z_{\epsilon}=\int_{\epsilon}^{T}\frac{g^{2}(t)}{\sigma^{2}(t)}\mathop{}\!\mathrm{d}t$ . It is well known (Goodfellow et al., 2016) that the Monte-Carlo variance of $\hat{\mathcal{L}}$ is minimum if the importance distribution is $p_{iw}^{*}(t)\propto g^{2}(t)L(t)$ with $L(t)=\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}[\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}]$ , but sampling of Monte-Carlo diffusion time from $p_{iw}^{*}(t)$ at every training iteration would incur $2\times$ slower training speed, at least, because the importance sampling requires the score evaluation. Therefore, previous research approximates $L(t)$ by $\hat{L}(t)=\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}[\|\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}]\propto 1/\sigma^{2}(t)$ , and $p_{iw}(t)$ becomes the approximate importance weight. This approximation, at the expense of bias, is cheap because the closed-form of the inverse Cumulative Distribution Function (CDF) is known. Unless we train the variance directly as in Kingma et al. (2021), we believe $p_{iw}(t)$ is the maximally efficient sampler as long as the training speed matters. The importance weighted Monte-Carlo estimation becomes

	$\displaystyle\mathcal{L}(\bm{\theta};g^{2},\epsilon)$
	$\displaystyle=\frac{Z_{\epsilon}}{2}\int_{\epsilon}^{T}p_{iw}(t)\sigma^{2}(t)\mathbb{E}\big{[}\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t$
	$\displaystyle\approx\frac{Z_{\epsilon}}{2B}\sum_{b=1}^{B}\sigma^{2}(t_{iw}^{(b)})\bigg{\\|}\mathbf{s}_{\bm{\theta}}\big{(}\mathbf{x}_{t_{iw}^{(b)}},t_{iw}^{(b)}\big{)}-\frac{\bm{\epsilon}^{(b)}}{\sigma(t_{iw}^{(b)})}\bigg{\\|}_{2}^{2}$
	$\displaystyle:=\mathcal{\hat{L}}_{iw}(\bm{\theta};g^{2},\epsilon),$		(9)

where $\{t_{iw}^{(b)}\}_{b=1}^{B}$ is the Monte-Carlo sample from the importance distribution, i.e., $t_{iw}^{(b)}\sim p_{iw}(t)\propto\frac{g^{2}(t)}{\sigma^{2}(t)}$ .

The importance sampling is advantageous in both NLL and FID (Song et al., 2021a) over the uniform sampling, as the importance sampling significantly reduces the estimation variance. Figure 6-(a) illustrates the sample-by-sample loss, and the importance sampling significantly mitigates the loss scale by diffusion time compared to the scale in Figure 1-(a). However, the importance distribution satisfies $p_{iw}(t)\rightarrow\infty$ as $t\rightarrow 0$ in Figure 6-(c) blue line, and most of the importance weighted Monte-Carlo time is concentrated at $t\approx\epsilon$ in Figure 7. Hence, the use of the importance sampling has a trade-off between the reduced variance (Figure 6-(a)) versus the over-sampled diffusion time near $t\approx\epsilon$ (Figure 7). Regardless of whether to use the importance sampling or not, therefore, the inaccurate score estimation on large diffusion time appears sampling-strategic-independently, and solving this pre-matured score estimation becomes a nontrivial task.

Instead of the likelihood weighting, previous works (Ho et al., 2020; Nichol & Dhariwal, 2021; Dhariwal & Nichol, 2021) train the denoising score loss with the variance weighting, $\lambda(t)=\sigma^{2}(t)$ . With this weighting, the importance distribution becomes the uniform distribution, $p_{iw}(t)=\frac{\lambda(t)}{\sigma^{2}(t)}\equiv 1$ , so it significantly alleviates the trade-off of using the likelihood weighting. However, the variance weighting favors FID at the sacrifice in NLL because the loss is no longer the variational bound of the log-likelihood. In contrast, the training with the likelihood weighting is leaning towards NLL than FID, so Soft Truncation is for the balanced NLL and FID, using the likelihood weighting.

Soft Truncation releases the truncation hyperparameter from a static variable to a random variable with a probability distribution of $\mathbb{P}(\tau)$ . In every mini-batch update, Soft Truncation optimizes the diffusion model with $\mathcal{\hat{L}}_{iw}(\bm{\theta};g^{2},\tau)$ in Eq. (9) for a sampled $\tau\sim\mathbb{P}(\tau)$ . In other words, for every batch $\{\mathbf{x}_{0}^{(b)}\}_{b=1}^{B}$ , Soft Truncation optimizes the Monte-Carlo loss

\displaystyle\mathcal{\hat{L}}_{iw}(\bm{\theta};\lambda,\tau)=\frac{Z_{\tau}}{2B}\sum_{b=1}^{B}\sigma^{2}(t_{iw}^{(b)})\bigg{\|}\mathbf{s}_{\bm{\theta}}\big{(}\mathbf{x}_{t_{iw}^{(b)}},t_{iw}^{(b)}\big{)}-\frac{\bm{\epsilon}^{(b)}}{\sigma(t_{iw}^{(b)})}\bigg{\|}_{2}^{2}

with $\{t_{iw}^{(b)}\}_{b=1}^{B}$ sampled from the importance distribution of $p_{iw,\tau}(t)=\frac{g^{2}(t)/\sigma^{2}(t)}{Z_{\tau}}1_{[\tau,T]}(t)$ , where $Z_{\tau}:=\int_{\tau}^{T}\frac{g^{2}(t)}{\sigma^{2}(t)}\mathop{}\!\mathrm{d}t$ .

Soft Truncation resolves the oversampling issue of diffusion time near $t\approx\epsilon$ , meaning that Monte-Carlo time is not concentrated on $\epsilon$ anymore. Figure 7 illustrates the quantiles of importance weighted Monte-Carlo time with Soft Truncation under $\tau=\epsilon$ and $\tau=0.1$ . The score network is trained more equally on diffusion time when $\tau=0.1$ , and as a consequence, the loss imbalance issue in each training step is also alleviated as in Figure 6-(b) with purple dots. This limited range of $[\tau,T]$ provides a chance to learn a score network more balanced on diffusion time. As $\tau$ is softened, such truncation level will vary by mini-batch updates: see the loss scales change by blue, green, red, and purple dots according to various $\tau$ s in Figure 6-(b). Eventually, the softened $\tau$ will provide a fair chance to learn the score network from small as well as large diffusion time.

In the original diffusion model, the loss estimation, $\mathcal{\hat{L}}(\bm{\theta};g^{2},\epsilon)$ , is just a batch-wise approximation of a population loss, $\mathcal{L}(\bm{\theta};g^{2},\epsilon)$ . However, the target population loss of Soft Truncation, $\mathcal{L}(\bm{\theta};g^{2},\tau)$ , is depending on a random variable $\tau$ , so the target population loss itself becomes a random variable. Therefore, we derive the expected Soft Truncation loss to reveal the connection to the original diffusion model:

	$\displaystyle\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}):=\mathbb{E}_{\mathbb{P}(\tau)}\big{[}\mathcal{L}(\bm{\theta};g^{2},\tau)\big{]}$
	$\displaystyle\quad=\frac{1}{2}\int_{\epsilon}^{T}\mathbb{P}(\tau)\int_{\tau}^{T}g^{2}(t)\mathbb{E}\big{[}\\|\mathbf{s}_{\bm{\theta}}-\nabla\log{p_{0t}}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t\mathop{}\!\mathrm{d}\tau$
	$\displaystyle\quad=\frac{1}{2}\int_{\epsilon}^{T}g^{2}_{\mathbb{P}}(t)\mathbb{E}\big{[}\\|\mathbf{s}_{\bm{\theta}}-\nabla\log{p_{0t}}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t,$

up to a constant, where $g^{2}_{\mathbb{P}}(t)=\big{(}\int_{0}^{t}\mathbb{P}(\tau)\mathop{}\!\mathrm{d}\tau\big{)}g^{2}(t)$ , by exchanging the orders of the integrations. Therefore, we conclude that Soft Truncation reduces to a diffusion model with a general weight of $g_{\mathbb{P}}^{2}(t)$ , see Appendix A.3:

\displaystyle\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P})=\mathcal{L}(\bm{\theta};g^{2}_{\mathbb{P}},\epsilon).

(10)

As explained in Section 4.3, Soft Truncation is a diffusion model with a general weight, in the expected sense. Reversely, this section analyzes a diffusion model with a general weight in view of Soft Truncation. Suppose we have a general weight $\lambda$ . Theorem 1 implies that this general weighted diffusion loss, $\mathcal{L}(\bm{\theta};\lambda,\epsilon)$ , is the variational bound of the perturbed KL divergence expected by $\mathbb{P}_{\lambda}(\tau)$ . Theorem 1 collapses to Lemma 1 if $\lambda(t)=cg^{2}(t)$ for any $c>0$ ¹¹1If $\lambda(t)=cg^{2}(t)$ , the probability satisfies $\mathbb{P}([a,b])=1_{[a,b]}(\epsilon)$ , which is a probability distribution of one mass at $\epsilon$ .. See Appendix B for the detailed statement and proof.

Suppose $\frac{\lambda(t)}{g^{2}(t)}$ is a nondecreasing and nonnegative absolutely continuous function on $[\epsilon,T]$ and zero on $[0,\epsilon)$ . For the probability defined by

\displaystyle\mathbb{P}_{\lambda}([a,b])=\bigg{[}\int_{\text{max}(a,\epsilon)}^{b}\Big{(}\frac{\lambda(s)}{g^{2}(s)}\Big{)}^{\prime}\mathop{}\!\mathrm{d}s+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}1_{[a,b]}(\epsilon)\bigg{]}\bigg{/}Z,

where $Z=\frac{\lambda(T)}{g^{2}(T)}$ ; up to a constant, the variational bound of the general weighted diffusion loss becomes

	$\displaystyle\quad\mathbb{E}_{\mathbb{P}_{\lambda}(\tau)}\big{[}D_{KL}(p_{\tau}\\|p_{\tau}^{\bm{\theta}})\big{]}$
			$\displaystyle\leq\frac{1}{2Z}\int_{\epsilon}^{T}\lambda(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t$
			$\displaystyle=\frac{1}{Z}\mathcal{L}(\bm{\theta};\lambda,\epsilon)=\mathbb{E}_{\mathbb{P}_{\lambda}(\tau)}\big{[}\mathcal{L}(\bm{\theta};g^{2},\tau)\big{]}.$

The meaning of Soft Truncation becomes clearer in view of Theorem 1. Instead of training the general weighted diffusion loss, $\mathcal{L}(\bm{\theta};\lambda,\epsilon)$ , we optimize the truncated variational bound, $\mathcal{L}(\bm{\theta};g^{2},\tau)$ . This truncated loss upper bounds the perturbed KL divergence, $D_{KL}(p_{\tau}\|p_{\tau}^{\bm{\theta}})$ by Lemma 1, and Figure 1-(c) indicates that the Inequality (6) is nearly tight. Therefore, Soft Truncation could be interpreted as the Maximum Perturbed Likelihood Estimation (MPLE), where the perturbation level is a random variable. Soft Truncation is not MLE training because the Inequality 8 is not tight as demonstrated in Figure 1-(b) unless $\tau$ is sufficiently small.

Old wisdom is to minimize the loss variance if available for stable training. However, some optimization methods in the deep learning era (e.g., stochastic gradient descent) deliberately add noises to a loss function that eventually helps escape from a local optimum. Soft Truncation is categorized in such optimization methods that inflate the loss variance by intentionally imposing auxiliary randomness on loss estimation. This randomness is represented by the outmost expectation of $\mathbb{E}_{\mathbb{P}_{\lambda}(\tau)}$ , which controls the diffusion time range batch-wisely. Additionally, the loss with a sampled $\tau$ is the proxy of the perturbed KL divergence by $\tau$ , so the auxiliary randomness on loss estimation is theoretically tamed, meaning that it is not a random perturbation.

We parametrize the probability distribution of $\tau$ by

\displaystyle\mathbb{P}_{k}(\tau)=\frac{1/\tau^{k}}{Z_{k}}1_{[\epsilon,T]}(\tau)\propto\frac{1}{\tau^{k}},

(11)

where $Z_{k}=\int_{\epsilon}^{T}\frac{1}{\tau^{k}}\mathop{}\!\mathrm{d}\tau$ with sufficiently small enough truncation hyperparameter. Note that it is still beneficial to remain $\epsilon$ strictly positive because a batch update with $\tau\approx 0<\epsilon$ would drift the score network away from the optimal point. Figure 6-(c) illustrates the importance distribution of $\lambda_{\mathbb{P}_{k}}$ for varying $k$ . From the definition of Eq. (11), $\mathbb{P}_{k}(\tau)\rightarrow\delta_{\epsilon}(\tau)$ as $k\rightarrow\infty$ , and this limiting delta distribution corresponds to the original diffusion model with the likelihood weighting. Figure 6-(c) shows that the importance distribution of $\mathbb{P}_{k}$ with finite $k$ interpolates the likelihood weighting and the variance weighting.

With the current simple form, we experimentally find that the sweet spot is $k\approx 1.0$ in VPSDE and $k=2.0$ in VESDE with the emphasis on the sample quality. For VPSDE, the importance distribution in Figure 6-(c) is nearly equal to that of the variance weighting if $k\approx 1.0$ , so Soft Truncation with $k\approx 1.0$ improves the sample fidelity, while maintaining low NLL. On the other hand, if $k$ is too small, no $\tau$ will be sampled near $\epsilon$ , so it hurts both sample generation and density estimation. We leave further study on searching for the optimal distribution of $\tau$ as future work.

[Uncaptioned image] — Figure 8: Soft Truncation improves FID on CelebA trained with UNCSN++ (RVE).

	Loss	Soft Truncation	NLL	NELBO	FID
	ODE
CIFAR-10	$\mathcal{L}(\bm{\theta};g^{2},\epsilon)$	✗	3.03	3.13	6.70
$\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon)$	✗	3.21	3.34	3.90
$\mathcal{L}(\bm{\theta};g_{\mathbb{P}_{1}}^{2},\epsilon)$	✗	3.06	3.18	6.11
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1})$	✓	3.01	3.08	3.96
	$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{0.9})$	✓	3.03	3.13	3.45
ImageNet32	$\mathcal{L}(\bm{\theta};g^{2},\epsilon)$	✗	3.92	3.94	12.68
$\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon)$	✗	3.95	4.00	9.22
$\mathcal{L}(\bm{\theta};g_{\mathbb{P}_{1}}^{2},\epsilon)$	✗	3.93	3.97	11.89
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{0.9})$	✓	3.90	3.91	8.42

SDE	Model	Loss	NLL	NELBO	FID
VE	NCSN++	$\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon)$	3.41	3.42	3.95	-
$\mathcal{L}_{ST}(\bm{\theta};\sigma^{2},\mathbb{P}_{2})$	3.44	3.44	2.68	-
RVE	UNCSN++	$\mathcal{L}(\bm{\theta};g^{2},\epsilon)$	2.01	2.01	3.36	-
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{2})$	1.97	2.02	1.92	-
VP	DDPM++	$\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon)$	2.14	2.21	3.03	2.32
$\mathcal{L}_{ST}(\bm{\theta};\sigma^{2},\mathbb{P}_{1})$	2.17	2.29	2.88	1.90
UDDPM++	$\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon)$	2.11	2.20	3.23	4.72
$\mathcal{L}_{ST}(\bm{\theta};\sigma^{2},\mathbb{P}_{1})$	2.16	2.28	2.22	1.94
DDPM++	$\mathcal{L}(\bm{\theta};g^{2},\epsilon)$	2.00	2.09	5.31	3.95
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1})$	2.00	2.11	4.50	2.90
UDDPM++	$\mathcal{L}(\bm{\theta};g^{2},\epsilon)$	1.98	2.12	4.65	3.98
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1})$	2.00	2.10	4.45	2.97

Table 5: Ablation study of Soft Truncation for various

\epsilon

on CIFAR-10 with DDPM++ (VP).

Loss	$\epsilon$	NLL	NELBO	FID (ODE)
$\mathcal{L}(\bm{\theta};g^{2},\epsilon)$	$10^{-2}$	4.64	4.69	38.82
	$10^{-3}$	3.51	3.52	6.21
	$10^{-4}$	3.05	3.08	6.33
	$10^{-5}$	3.03	3.13	6.70
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1})$	$10^{-2}$	4.65	4.69	39.83
	$10^{-3}$	3.51	3.52	5.14
	$10^{-4}$	3.05	3.08	4.16
	$10^{-5}$	3.01	3.08	3.96

Table 6: Ablation study of Soft Truncation for various

\mathbb{P}_{k}

on CIFAR-10 trained with DDPM++ (VP).

$=\mathcal{L}(\bm{\theta};g^{2},\epsilon)$	3.01	3.09	6.70
Loss	NLL	NELBO	FID (ODE)
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{0})$	3.24	3.39	6.27
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{0.8})$	3.03	3.05	3.61
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{0.9})$	3.03	3.13	3.45
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1})$	3.01	3.08	3.96
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1.1})$	3.02	3.09	3.98
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1.2})$	3.03	3.09	3.98
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{2})$	3.01	3.10	6.31
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{3})$	3.02	3.09	6.54
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{\infty})$	3.01	3.09	6.70

Table 7: Ablation study of Soft Truncation for CIFAR-10 trained with DDPM++ when a diffusion is combined with a normalizing flow in INDM (Kim et al., 2022).

Loss	NLL	NELBO	FID (ODE)
INDM (VP, NLL)	2.98	2.98	6.01
INDM (VP, FID)	3.17	3.23	3.61
INDM (VP, NLL) + ST	3.01	3.02	3.88

Table 8: Performance comparisons on benchmark datasets. The boldfaced numbers present the best performance, and the underlined numbers present the second-best performance. We report NLL of DDPM++ on CIFAR-10, ImageNet32, and CelebA with the variational dequantization (Song et al., 2021a) to compare with the baselines in a fair setting.

Model CIFAR10 ImageNet32 CelebA CelebA-HQ STL-10 $32\times 32$ $32\times 32$ $64\times 64$ $256\times 256$ $48\times 48$ NLL ( $\downarrow$ ) FID ( $\downarrow$ ) IS ( $\uparrow$ ) NLL FID IS NLL FID FID FID IS Likelihood-free Models StyleGAN2-ADA+Tuning (Karras et al., 2020) - 2.92 10.02 - - - - - - - - Styleformer (Park & Kim, 2022) - 2.82 9.94 - - - - 3.66 - 15.17 11.01 Likelihood-based Models ARDM-Upscale 4 (Hoogeboom et al., 2021) 2.64 - - - - - - - - - - VDM (Kingma et al., 2021) 2.65 7.41 - 3.72 - - - - - - - LSGM (FID) (Vahdat et al., 2021) 3.43 2.10 - - - - - - - - - NCSN++ cont. (deep, VE) (Song et al., 2021b) 3.45 2.20 9.89 - - - 2.39 3.95 7.23 - - DDPM++ cont. (deep, sub-VP) (Song et al., 2021b) 2.99 2.41 9.57 - - - - - - - - DenseFlow-74-10 (Grcić et al., 2021) 2.98 34.90 - 3.63 - - 1.99 - - - - ScoreFlow (VP, FID) (Song et al., 2021a) 3.04 3.98 - 3.84 8.34 - - - - - - Efficient-VDVAE (Hazami et al., 2022) 2.87 - - - - - 1.83 - - - - PNDM (Liu et al., 2022) - 3.26 - - - - - 2.71 - - - ScoreFlow (deep, sub-VP, NLL) (Song et al., 2021a) 2.81 5.40 - 3.76 10.18 - - - - - - Improved DDPM ( $L_{simple}$ ) (Nichol & Dhariwal, 2021) 3.37 2.90 - - - - - - - - - UNCSN++ (RVE) + ST 3.04 2.33 10.11 - - - 1.97 1.92 7.16 7.71 13.43 DDPM++ (VP, FID) + ST 2.91 2.47 9.78 - - - 2.10 1.90 - - - DDPM++ (VP, NLL) + ST 2.88 3.45 9.19 3.85 8.42 11.82 1.96 2.90 - - -

This section empirically studies our suggestions on benchmark datasets, including CIFAR-10 (Krizhevsky et al., 2009), ImageNet $32\times 32$ (Van Oord et al., 2016), STL-10 (Coates et al., 2011)²²2We downsize the dataset from $96\times 96$ to $48\times 48$ following Jiang et al. (2021); Park & Kim (2022). CelebA (Liu et al., 2015) $64\times 64$ and CelebA-HQ (Karras et al., 2018) $256\times 256$ .

Soft Truncation is a universal training technique indepedent to model architectures and diffusion strategies. In the experiments, we test Soft Truncation on various architectures, including vanilla NCSN++, DDPM++, Unbounded NCSN++ (UNCSN++), and Unbounded DDPM++ (UDDPM++). Also, Soft Truncation is applied to various diffusion SDEs, such as VESDE, VPSDE, and Reverse VESDE (RVESDE). Although we use continuous SDEs for the diffusion strategies, Soft Truncation with the discrete model, such as DDPM (Ho et al., 2020), is a straightforward application of continuous models. Appendix D enumerates the specifications of score architectures and SDEs.

From Figure 1-(c), a sweet spot of the hard threshold is $\epsilon=10^{-5}$ , in which NLL/NELBO are no longer improved under this threshold. As the diffusion model has no information on $[0,\epsilon)$ , we comply Kim et al. (2022) to use Inequality (7) for NLL computation and Inequality (8) for NELBO computation. Following Kim et al. (2022), we compute $\log{p_{\epsilon}^{\bm{\theta}}(\mathbf{x}_{\epsilon})}$ , rather than $\log{p_{\epsilon}^{\bm{\theta}}(\mathbf{x}_{0})}$ . It is the common practice of continuous diffusion models (Song et al., 2021b, a; Dockhorn et al., 2022) to report their performances with $\log{p_{\epsilon}^{\bm{\theta}}(\mathbf{x}_{0})}$ , but Kim et al. (2022) show that $\log{p_{\epsilon}^{\bm{\theta}}(\mathbf{x}_{\epsilon})}$ differs to $\log{p_{\epsilon}^{\bm{\theta}}(\mathbf{x}_{0})}$ by 0.05 in BPD scale when $\epsilon=10^{-5}$ , which is quite significant. We use the uniform dequantization (Theis et al., 2016) as default, otherwise noted. For sample generation, we use either of Predictor-Corrector (PC) sampler or Ordinary Differential Equation (ODE) sampler (Song et al., 2021b). We denote $\mathcal{L}(\bm{\theta};\lambda,\epsilon)$ as the vanilla training with $\lambda$ -weighting, and $\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P})$ as the training by Soft Truncation with the truncation probability of $\mathbb{P}$ . We additionally denote $\mathcal{L}_{ST}(\bm{\theta};\sigma^{2},\mathbb{P})$ for updating the network by the variance weighted loss per batch-wise update. We release our code at https://github.com/Kim-Dongjun/Soft-Truncation.

FID by Iteration Figure 8 illustrates the FID score (Heusel et al., 2017) in $y$ -axis by training steps in $x$ -axis. Figure 8 shows that Soft Truncation beats the vanilla training after 150k of training iterations.

Ablation Studies Tables 4, 4, 7, and 7 show ablation studies on various weighting functions, model architectures, SDEs, $\epsilon$ s, and probability distributions of $\tau$ , respectively. See Appendix E.2. Table 4 shows that Soft Truncation beats or equals to the vanilla training in all performances. We highlight that Soft Truncation with $\mathbb{P}_{0.9}$ outperforms the FID-favorable model with the variance weighting with respect to FID on both CIFAR-10 and ImageNet32.

Not only comparing with the pre-existing weighting functions, such as $\lambda=g^{2}$ or $\lambda=\sigma^{2}$ , Table 4 additionally reports the experimental result of a general weighting function of $\lambda=g_{\mathbb{P}_{1}}^{2}$ . From Eq. (10), Soft Truncation with $\mathbb{P}_{1}$ and the vanilla training with $\lambda=g_{\mathbb{P}_{1}}^{2}$ coincide in their loss functions in average, i.e., $\mathcal{L}(\bm{\theta};g_{\mathbb{P}_{1}}^{2},\epsilon)=\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1})$ . Thus, when comparing the paired experiments, Soft Truncation could be considered as an alternative way of estimating the same loss, and Table 4 implies that Soft Truncation gives better optimization than the vanilla method. This strongly implies that Soft Truncation could be a default training method for a general weighted denoising diffusion loss.

Table 4 provides two implications. First, Soft Truncation particularly boosts FID while maintaining density estimation performances under the variation of score networks and diffusion strategies. Second, Table 4 shows that Soft Truncation is effective on CelebA even when we apply Soft Truncation on the variance weighting, i.e., $\mathcal{L}_{ST}(\bm{\theta};\sigma^{2},\mathbb{P})$ , but we find that this does not hold on CIFAR-10 and ImageNet32. We leave it as a future work on this extent.

Table 7 shows a contrastive trend of the vanilla training and Soft Truncation. The inverse correlation appears between NLL and FID in the vanilla training, but Soft Truncation monotonically reduces both NLL and FID by $\epsilon$ . This implies that Soft Truncation significantly reduces the effort of the $\epsilon$ search. Table 7 studies the effect of the probability distribution of $\tau$ in VPSDE. It shows that Soft Truncation significantly improves FID upon the experiment of $\mathcal{L}(\bm{\theta};g^{2},\epsilon)$ on the range of $0.8\leq k\leq 1.2$ . Finally, Table 7 shows that Soft Truncation also works with a nonlinear forward SDE (Kim et al., 2022), so the scope of Soft Truncation is not limited to a family of linear SDEs.

Quantitative Comparison to SOTA Table 8 compares Soft Truncation (ST) against the current best generative models. It shows that Soft Truncation achieves the state-of-the-art sample generation performances on CIFAR-10, CelebA, CelebA-HQ, and STL-10, while keeping NLL intact. In particular, we have experimented thoroughly on the CelebA dataset, and we find that Soft Truncation largely exceeds the previous best FID scores by far. In FID, Soft Truncation with DDPM++ performs 1.90, which exceeds the previous best FID of 2.92 by DDGM. Also, Soft Truncation significantly improves FID on STL-10.

This paper proposes a generally applicable training method for diffusion models. The suggested training method, Soft Truncation, is motivated from the observation that the density estimation is mostly counted on small diffusion time, while the sample generation is mostly constructed on large diffusion time. However, small diffusion time dominates the Monte-Carlo estimation of the loss function, so this imbalance contribution prevents accurate score learning on large diffusion time. Soft Truncation softens the truncation level at each mini-batch update, and this simple modification is connected to the general weighted diffusion loss and the concept of Maximum Perturbed Likelihood Estimation.

This research was supported by AI Technology Development for Commonsense Extraction, Reasoning, and Inference from Heterogeneous Data(IITP) funded by the Ministry of Science and ICT(2022-0-00077). We thank Jaeyoung Byeon and Daehan Park for their fruitful mathematical advice, and Byeonghu Na for his support of the experiments.

Anderson (1982) Anderson, B. D. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.
Chen et al. (2018) Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.
Chen et al. (2022) Chen, T., Liu, G.-H., and Theodorou, E. Likelihood training of schrödinger bridge using forward-backward SDEs theory. In International Conference on Learning Representations, 2022.
Coates et al. (2011) Coates, A., Ng, A., and Lee, H. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215–223. JMLR Workshop and Conference Proceedings, 2011.
Dhariwal & Nichol (2021) Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 2021.
Dockhorn et al. (2022) Dockhorn, T., Vahdat, A., and Kreis, K. Score-based generative modeling with critically-damped langevin diffusion. International Conference on Learning Representations, 2022.
Evans (1998) Evans, L. C. Partial differential equations. Graduate studies in mathematics, 19(2), 1998.
Gerchinovitz et al. (2020) Gerchinovitz, S., Ménard, P., and Stoltz, G. Fano’s inequality for random variables. Statistical Science, 35(2):178–201, 2020.
Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT press, 2016.
Grcić et al. (2021) Grcić, M., Grubišić, I., and Šegvić, S. Densely connected normalizing flows. Advances in Neural Information Processing Systems, 34, 2021.
Hazami et al. (2022) Hazami, L., Mama, R., and Thurairatnam, R. Efficient-vdvae: Less is more. arXiv preprint arXiv:2203.13751, 2022.
Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
Ho et al. (2019) Ho, J., Chen, X., Srinivas, A., Duan, Y., and Abbeel, P. Flow++: Improving flow-based generative models with variational dequantization and architecture design. In International Conference on Machine Learning, pp. 2722–2730. PMLR, 2019.
Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
Hoogeboom et al. (2021) Hoogeboom, E., Gritsenko, A. A., Bastings, J., Poole, B., Berg, R. v. d., and Salimans, T. Autoregressive diffusion models. arXiv preprint arXiv:2110.02037, 2021.
Huang et al. (2021) Huang, C.-W., Lim, J. H., and Courville, A. C. A variational perspective on diffusion-based generative models and score matching. Advances in Neural Information Processing Systems, 34, 2021.
Jiang et al. (2021) Jiang, Y., Chang, S., and Wang, Z. Transgan: Two pure transformers can make one strong gan, and that can scale up. Advances in Neural Information Processing Systems, 34, 2021.
Karras et al. (2018) Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
Karras et al. (2019) Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410, 2019.
Karras et al. (2020) Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., and Aila, T. Training generative adversarial networks with limited data. Advances in Neural Information Processing Systems, 33:12104–12114, 2020.
Kim et al. (2022) Kim, D., Na, B., Kwon, S. J., Lee, D., Kang, W., and Moon, I.-C. Maximum likelihood training of implicit nonlinear diffusion models. arXiv preprint arXiv:2205.13699, 2022.
Kingma et al. (2021) Kingma, D. P., Salimans, T., Poole, B., and Ho, J. Variational diffusion models. In Advances in Neural Information Processing Systems, 2021.
Krizhevsky et al. (2009) Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.
Liu et al. (2022) Liu, L., Ren, Y., Lin, Z., and Zhao, Z. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022.
Liu et al. (2015) Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730–3738, 2015.
Meng et al. (2021) Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., and Ermon, S. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021.
Nichol & Dhariwal (2021) Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162–8171. PMLR, 2021.
Oksendal (2013) Oksendal, B. Stochastic differential equations: an introduction with applications. Springer Science & Business Media, 2013.
Park & Kim (2022) Park, J. and Kim, Y. Styleformer: Transformer based generative adversarial networks with style vector. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2022.
Parmar et al. (2022) Parmar, G., Zhang, R., and Zhu, J.-Y. On buggy resizing libraries and surprising subtleties in fid calculation. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2022.
Parmar et al. (2018) Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., and Tran, D. Image transformer. In International Conference on Machine Learning, pp. 4055–4064. PMLR, 2018.
Pavon & Wakolbinger (1991) Pavon, M. and Wakolbinger, A. On free energy, stochastic control, and schrödinger processes. In Modeling, Estimation and Control of Systems with Uncertainty, pp. 334–348. Springer, 1991.
Song & Ermon (2019) Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
Song & Ermon (2020) Song, Y. and Ermon, S. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
Song et al. (2021a) Song, Y., Durkan, C., Murray, I., and Ermon, S. Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34, 2021a.
Song et al. (2021b) Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021b.
Szegedy et al. (2016) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826, 2016.
Theis et al. (2016) Theis, L., van den Oord, A., and Bethge, M. A note on the evaluation of generative models. In International Conference on Learning Representations (ICLR 2016), pp. 1–10, 2016.
Vahdat & Kautz (2020) Vahdat, A. and Kautz, J. Nvae: A deep hierarchical variational autoencoder. Advances in Neural Information Processing Systems, 33:19667–19679, 2020.
Vahdat et al. (2021) Vahdat, A., Kreis, K., and Kautz, J. Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34, 2021.
Van Oord et al. (2016) Van Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. In International Conference on Machine Learning, pp. 1747–1756. PMLR, 2016.
Vargas et al. (2021) Vargas, F., Thodoroff, P., Lamacraft, A., and Lawrence, N. Solving schrödinger bridges via maximum likelihood. Entropy, 23(9):1134, 2021.
Welling & Teh (2011) Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681–688. Citeseer, 2011.

Kim et al. (2022) has classified linear SDEs as

\displaystyle\mathop{}\!\mathrm{d}\mathbf{x}_{t}=-\frac{1}{2}\beta(t)\mathbf{x}_{t}\mathop{}\!\mathrm{d}t+g(t)\mathop{}\!\mathrm{d}\mathbf{w}_{t},

(12)

where $\beta:\mathbb{R}\rightarrow\mathbb{R}_{\geq 0}$ and $g:\mathbb{R}\rightarrow\mathbb{R}_{\geq 0}$ are real-valued functions. VESDE has $\beta(t)\equiv 0$ and $g(t)=\sqrt{\mathop{}\!\mathrm{d}\sigma^{2}(t)/\mathop{}\!\mathrm{d}t}=\sigma_{min}(\frac{\sigma_{max}}{\sigma_{min}})^{t}\sqrt{2\log{\frac{\sigma_{max}}{\sigma_{min}}}}$ , where $\sigma_{min}$ and $\sigma_{max}$ are the minimum/maximum perturbation variances, respectively. It has the transition probability of

\displaystyle p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t};\mu_{VE}(t)\mathbf{x}_{0},\sigma_{VE}^{2}(t)\mathbf{I}),

where $\mu_{VE}(t)\equiv 1$ and $\sigma_{VE}^{2}(t):=\sigma_{min}^{2}[(\frac{\sigma_{max}}{\sigma_{min}})^{2t}-1]$ . VPSDE has $\beta(t)=\beta_{min}+(\beta_{max}-\beta_{min})t$ and $g(t)=\sqrt{\beta(t)}$ with the transition probability of

\displaystyle p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t};\mu_{VP}(t)\mathbf{x}_{0},\sigma_{VP}^{2}(t)\mathbf{I}),

where $\mu_{VP}(t)=e^{-\frac{1}{2}\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}$ and $\sigma^{2}(t)=1-e^{-\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}$ .

Analogous to VE/VP SDEs, the transition probability of the generic linear SDE of Eq. (12) is a Gaussian distribution of $p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t}|\mu(t)\mathbf{x}_{0},\sigma^{2}(t)\mathbf{I})$ , where its mean and covariance functions are characterized as a system of ODEs of

	$\displaystyle\frac{\mathop{}\!\mathrm{d}\mu(t)}{\mathop{}\!\mathrm{d}t}=-\frac{1}{2}\beta(t)\mu(t),$		(13)
	$\displaystyle\frac{\mathop{}\!\mathrm{d}\sigma^{2}(t)}{\mathop{}\!\mathrm{d}t}=-\beta(t)\sigma^{2}(t)+g^{2}(t),$		(14)

with initial conditions to be $\mu(0)=1$ and $\sigma^{2}(0)=0$ .

Eq. (13) has its solution by

\displaystyle\mu(t)=e^{-\frac{1}{2}\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}.

If we multiply $e^{\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}$ to Eq. (14), then Eq. (14) equals to

If we impose $\sigma^{2}(0)=0$ to Eq. (15), then the constant $C$ satisfies $C=0$ , and the variance formula becomes

\displaystyle\sigma^{2}(t)=e^{-\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}\int_{0}^{t}e^{\int_{0}^{\tau}\beta(s)\mathop{}\!\mathrm{d}s}g^{2}(\tau)\mathop{}\!\mathrm{d}\tau.

To sum up, the family of linear SDEs of $\mathop{}\!\mathrm{d}\mathbf{x}_{t}=-\frac{1}{2}\beta(t)\mathbf{x}_{t}\mathop{}\!\mathrm{d}t+g(t)\mathop{}\!\mathrm{d}\mathbf{w}_{t}$ gets the transition probability to be

\displaystyle p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}\bigg{(}\mathbf{x}_{t}\Big{|}e^{-\frac{1}{2}\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}\mathbf{x}_{0},e^{-\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}\Big{(}\int_{0}^{t}e^{\int_{0}^{\tau}\beta(s)\mathop{}\!\mathrm{d}s}g^{2}(\tau)\mathop{}\!\mathrm{d}\tau\Big{)}\mathbf{I}\bigg{)}.

(16)

The gradient of the log transition probability, $\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}=-\frac{\mathbf{x}_{t}-\mu(t)\mathbf{x}_{0}}{\sigma^{2}(t)}=-\frac{\mathbf{z}}{\sigma(t)}$ , is diverging at $\mu(t)\mathbf{x}_{0}$ , where $\mathbf{x}_{t}=\mu(t)\mathbf{x}_{0}+\sigma(t)\mathbf{z}$ . Below Lemma 2 indicates that $\|\mathbf{s}(\mathbf{x}_{t},t)-\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}\rightarrow\infty$ for any continuous score function, $\mathbf{s}$ . This leads that the denoising score loss diverges as $t\rightarrow 0$ as illustrated in Figure 1-(a).

Let $\mathcal{H}_{[0,T]}=\{\mathbf{s}:\mathbb{R}^{d}\times[0,T]\rightarrow\mathbb{R}^{d},\text{ $\mathbf{s}$ is locally Lipschitz}\}$ . Suppose a continuous vector field $\mathbf{v}$ defined on a subset $U$ of a compact manifold $M$ (i.e., $\mathbf{v}:U\subset M\rightarrow\mathbb{R}^{d}$ ) is unbounded, then there exists no $\mathbf{s}\in\mathcal{H}_{[0,T]}$ such that $\lim_{t\rightarrow 0}\mathbf{s}(\mathbf{x},t)=\mathbf{v}(\mathbf{x})$ a.e. on $U$ .

Since $U$ is an open subset of a compact manifold $M$ , $\|\mathbf{x}_{1}-\mathbf{x}_{2}\|\leq\text{diam}(M)$ for all $\mathbf{x}_{1},\mathbf{x}_{2}\in U$ . Also, if $t_{1},t_{2}\in[0,T]$ , $|t_{1}-t_{2}|$ is bounded. Hence, the local Lipschitzness of $\mathbf{s}$ implies that there exists a positive $K>0$ such that $\|s(\mathbf{x}_{1},t_{1})-s(\mathbf{x}_{2},t_{2})\|\leq K(\|\mathbf{x}_{1}-\mathbf{x}_{2}\|+|t_{1}-t_{2}|)$ for any $\mathbf{x}_{1},\mathbf{x}_{2}\in U$ and $t_{1},t_{2}\in[0,T]$ . Therefore, for any $\mathbf{s}\in\mathcal{H}_{[0,T]}$ , there exists $C>0$ such that $\|\mathbf{s}(\mathbf{x},t)\|<C$ for all $\mathbf{x}\in U$ and $t\in[0,T]$ , which leads no $\mathbf{s}$ that satisfies $\mathbf{s}(\mathbf{x},t)\rightarrow v(\mathbf{x})$ a.e. on $U$ as $t\rightarrow 0$ . ∎

The denoising score loss is

\displaystyle\begin{split}\mathcal{L}(\bm{\theta};g^{2},\tau)=&\frac{1}{2}\int_{\tau}^{T}g^{2}(t)\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}-\|\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t\\ &-\int_{\tau}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t-\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]},\end{split}

(17)

for any $\tau\in[0,T]$ . For an appropriate class of function $A(t)$ ,

\displaystyle\begin{split}\int_{0}^{T}\mathbb{P}(\tau)\bigg{(}\int_{\tau}^{T}A(t)\mathop{}\!\mathrm{d}t\bigg{)}\mathop{}\!\mathrm{d}\tau=&\int_{0}^{T}\int_{0}^{T}\mathbb{P}(\tau)A(t)1_{[\tau,T]}(t)\mathop{}\!\mathrm{d}t\mathop{}\!\mathrm{d}\tau\\ =&\int_{0}^{T}\int_{0}^{T}\mathbb{P}(\tau)A(t)1_{[\tau,T]}(t)\mathop{}\!\mathrm{d}\tau\mathop{}\!\mathrm{d}t\\ =&\int_{0}^{T}\int_{0}^{t}\mathbb{P}(\tau)A(t)\mathop{}\!\mathrm{d}\tau\mathop{}\!\mathrm{d}t\\ =&\int_{0}^{T}\bigg{(}\int_{0}^{t}\mathbb{P}(\tau)\mathop{}\!\mathrm{d}\tau\bigg{)}A(t)\mathop{}\!\mathrm{d}t\end{split}

holds by changing the order of integration. Therefore, we get

	$\displaystyle\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}):=\mathbb{E}_{\mathbb{P}(\tau)}\big{[}\mathcal{L}(\bm{\theta};g^{2},\tau)\big{]}$
			$\displaystyle=\int_{0}^{T}\mathbb{P}(\tau)\bigg{[}\frac{1}{2}\int_{\tau}^{T}g^{2}(t)\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}\big{[}\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\\|_{2}^{2}-\\|\log{p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t$
			$\displaystyle\quad-\int_{\tau}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t-\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}\bigg{]}\mathop{}\!\mathrm{d}\tau$
			$\displaystyle=\int_{0}^{T}\Big{(}\int_{0}^{t}\mathbb{P}(\tau)\mathop{}\!\mathrm{d}\tau\Big{)}\bigg{[}\frac{1}{2}g^{2}(t)\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}\big{[}\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\\|_{2}^{2}-\\|\log{p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\\|_{2}^{2}\big{]}$
			$\displaystyle\quad-\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\bigg{]}\mathop{}\!\mathrm{d}t-\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}$
			$\displaystyle=\frac{1}{2}\int_{0}^{T}g_{\mathbb{P}}^{2}(t)\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}\big{[}\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t+C,$

where

\displaystyle C=-\frac{1}{2}\int_{0}^{T}g_{\mathbb{P}}^{2}(t)\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}\big{[}\|\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t-\int_{0}^{T}\Big{(}\int_{0}^{t}\mathbb{P}(\tau)\mathop{}\!\mathrm{d}\tau\Big{)}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t-\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}.

If $\mathbf{f}(\mathbf{x}_{t},t)=-\frac{1}{2}\beta(t)\mathbf{x}_{t}$ , then we have

\displaystyle C=-\frac{d}{2}\int_{0}^{T}\Big{(}\int_{0}^{t}\mathbb{P}(\tau)\mathop{}\!\mathrm{d}\tau\Big{)}\frac{g^{2}(t)}{\sigma^{2}(t)}\mathop{}\!\mathrm{d}t+\frac{d}{2}\int_{0}^{T}\Big{(}\int_{0}^{t}\mathbb{P}(\tau)\mathop{}\!\mathrm{d}\tau\Big{)}\beta(t)\mathop{}\!\mathrm{d}t-\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}.

For any $\tau\in[0,T]$ ,

	$\displaystyle\mathbb{E}_{\mathbf{x}_{\tau}}\big{[}-\log{p_{\tau}^{\bm{\theta}}(\mathbf{x}_{\tau})}\big{]}\leq$	$\displaystyle\mathcal{L}(\bm{\theta};g^{2},\tau)=\frac{1}{2}\int_{\tau}^{T}g^{2}(t)\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}\big{[}\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\\|_{2}^{2}$
		$\displaystyle-\\|\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t-\int_{\tau}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\textup{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t-\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}.$

Suppose $\bm{\mu}$ is the path measure of the forward SDE, and $\bm{\nu}_{\bm{\theta}}$ is the path measure of the generative SDE. The restricted measure is defined by $\bm{\mu}|_{[\tau,T]}(\{F_{t}\}_{t=\tau}^{T}):=\bm{\mu}(\{F_{t}\}_{t=0}^{T})$ , where $F_{t}=\mathbb{R}^{d}$ if $t\in[0,\tau)$ and $F_{t}$ is a measurable set in $\mathbb{R}^{d}$ otherwise. The restricted measure of $\bm{\nu}_{\bm{\theta}}$ is defined analogously. Then, by the data processing inequality, we get

\displaystyle D_{KL}(p_{\tau}\|p_{\tau}^{\bm{\theta}})\leq D_{KL}(\bm{\mu}|_{[\tau,T]}\|\bm{\nu}_{\bm{\theta}}|_{[\tau,T]}).

(18)

Now, from the chain rule of KL divergences, we have

\displaystyle D_{KL}(\bm{\mu}|_{[\tau,T]}\|\bm{\nu}_{\bm{\theta}}|_{[\tau,T]})=D_{KL}(p_{T}\|\pi)+\mathbb{E}_{\mathbf{z}\sim p_{T}}\Big{[}D_{KL}\big{(}\bm{\mu}|_{[\tau,T]}(\cdot|\mathbf{x}_{T}=\mathbf{z})\|\bm{\nu}_{\bm{\theta}}|_{[\tau,T]}(\cdot|\mathbf{x}_{T}=\mathbf{z})\big{)}\Big{]}.

(19)

From the Girsanov theorem and the Martingale property, we get

\displaystyle D_{KL}\big{(}\bm{\mu}|_{[\tau,T]}(\cdot|\mathbf{x}_{T}=\mathbf{z})\|\bm{\nu}_{\bm{\theta}}|_{[\tau,T]}(\cdot|\mathbf{x}_{T}=\mathbf{z})\big{)}=\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t,

(20)

and combining Eq. (18), (19) and (20), we have

\displaystyle D_{KL}(p_{\tau}\|p_{\tau}^{\bm{\theta}})\leq D_{KL}(p_{T}\|\pi)+\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t.

(21)

Now, from

	$\displaystyle\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)[\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}-\\|\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}]\big{]}\mathop{}\!\mathrm{d}t$
			$\displaystyle=\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\\|_{2}^{2}-2g^{2}(t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\cdot\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\big{]}\mathop{}\!\mathrm{d}t$
			$\displaystyle=\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t-\int_{\tau}^{T}\int g^{2}(t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\cdot\nabla_{\mathbf{x}_{t}}p_{t}(\mathbf{x}_{t})\mathop{}\!\mathrm{d}\mathbf{x}_{t}\mathop{}\!\mathrm{d}t$
			$\displaystyle=\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t-\int_{\tau}^{T}\int g^{2}(t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\cdot\nabla_{\mathbf{x}_{t}}\int p_{r}(\mathbf{x}_{0})p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})\mathop{}\!\mathrm{d}\mathbf{x}_{0}\mathop{}\!\mathrm{d}\mathbf{x}_{t}\mathop{}\!\mathrm{d}t$
			$\displaystyle=\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t-\int_{\tau}^{T}\int g^{2}(t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\cdot\int p_{r}(\mathbf{x}_{0})\nabla_{\mathbf{x}_{t}}p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})\mathop{}\!\mathrm{d}\mathbf{x}_{0}\mathop{}\!\mathrm{d}\mathbf{x}_{t}\mathop{}\!\mathrm{d}t$
			$\displaystyle=\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{r}(\mathbf{x}_{0})p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\big{[}g^{2}(t)[\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\\|_{2}^{2}-\\|\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\\|_{2}^{2}]\big{]}\mathop{}\!\mathrm{d}t,$

we can transform $\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}$ into $\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}$ , Eq. (21) is equivalent to

$\displaystyle\mathbb{E}_{p_{\tau}(\mathbf{x}_{\tau})}\big{[}-\log{p_{\tau}^{\bm{\theta}}(\mathbf{x}_{\tau})}\big{]}\leq D_{KL}(p_{T}\\|\pi)+\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t+\mathcal{H}(p_{\tau})$		(24)
	$\displaystyle=D_{KL}(p_{T}\\|\pi)+\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\\|_{2}^{2}-\\|\nabla\log{p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t$
	$\displaystyle\quad+\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\nabla\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t+\mathcal{H}(p_{\tau}).$

Now, directly applying Theorem 4 of Song et al. (2021a), the entropy of $\mathcal{H}(p_{\tau})$ becomes

\displaystyle\mathcal{H}(p_{\tau})=\mathcal{H}(p_{T})-\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}2\text{div}\big{(}\mathbf{f}(\mathbf{x}_{t},t)\big{)}+g^{2}(t)\|\nabla\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t.

(25)

Therefore, from Eq. (24) and (25), we get

	$\displaystyle\mathbb{E}_{p_{\tau}(\mathbf{x}_{\tau})}\big{[}-\log{p_{\tau}^{\bm{\theta}}(\mathbf{x}_{\tau})}\big{]}\leq$	$\displaystyle\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\\|_{2}^{2}-\\|\nabla\log{p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t$
		$\displaystyle-\int_{\tau}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\textup{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t-\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}.$

∎

Suppose $\lambda(t)$ is a weighting function of the NCSN loss. If $\frac{\lambda(t)}{g^{2}(t)}$ is a nondecreasing and nonnegative absolutely continuous function on $[\epsilon,T]$ and zero on $[0,\epsilon)$ , then

	$\displaystyle\mathcal{L}(\bm{\theta};\lambda,\epsilon)\geq$	$\displaystyle\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}\mathbb{E}_{\mathbf{x}_{\tau}}\big{[}-\log{p_{\tau}^{\bm{\theta}}(\mathbf{x}_{\tau})}\big{]}\mathop{}\!\mathrm{d}\tau+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\mathbb{E}_{\mathbf{x}_{\epsilon}}\big{[}-\log{p_{\epsilon}^{\bm{\theta}}(\mathbf{x}_{\epsilon})}\big{]}$
		$\displaystyle+\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}-1\Big{)}\mathbb{E}_{\mathbf{x}_{\tau}}\big{[}\textup{div}(\mathbf{f}(\mathbf{x}_{\tau},\tau))\big{]}\mathop{}\!\mathrm{d}\tau+\Big{[}\frac{\lambda(T)}{g^{2}(T)}-1\Big{]}\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}.$

We prove the theorm by using

\displaystyle\begin{split}\int_{\epsilon}^{T}\lambda(t)A(t)\mathop{}\!\mathrm{d}t=&\int_{\epsilon}^{T}\bigg{[}\int_{\epsilon}^{t}\Big{(}\frac{\lambda(t)}{g^{2}(t)}\Big{)}^{\prime}\mathop{}\!\mathrm{d}\tau+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\bigg{]}g^{2}(t)A(t)\mathop{}\!\mathrm{d}t\\ =&\int_{\epsilon}^{T}\int_{\epsilon}^{T}1_{[\epsilon,t]}(\tau)\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}g^{2}(t)A(t)\mathop{}\!\mathrm{d}\tau\mathop{}\!\mathrm{d}t+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\int_{\epsilon}^{T}g^{2}(t)A(t)\mathop{}\!\mathrm{d}t\\ =&\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}\int_{\tau}^{T}g^{2}(t)A(t)\mathop{}\!\mathrm{d}t\mathop{}\!\mathrm{d}\tau+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\int_{\epsilon}^{T}g^{2}(t)A(t)\mathop{}\!\mathrm{d}t.\end{split}

(26)

By plugging $A(t)=\frac{1}{2}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}-\|\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}\big{]}$ in Eq. (26), we have

$\displaystyle\mathcal{L}(\bm{\theta};\lambda,\epsilon):=$	$\displaystyle\frac{1}{2}\int_{\epsilon}^{T}\lambda(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}-\\|\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t$
	$\displaystyle-\int_{\epsilon}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t-\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}$
$\displaystyle=$	$\displaystyle\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}\bigg{[}\frac{1}{2}\int_{\tau}^{T}g^{2}(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}-\\|\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t$
	$\displaystyle\quad\quad-\int_{\tau}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t-\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}\bigg{]}\mathop{}\!\mathrm{d}\tau$
	$\displaystyle+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\bigg{[}\frac{1}{2}\int_{\epsilon}^{T}g^{2}(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}-\\|\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t$	(27)
	$\displaystyle\quad\quad-\int_{\epsilon}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t-\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}\bigg{]}$
	$\displaystyle+\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}\int_{\tau}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t\mathop{}\!\mathrm{d}\tau+\Big{(}\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\Big{)}\int_{\epsilon}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t$
	$\displaystyle-\int_{\epsilon}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t+\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}\bigg{[}\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}\mathop{}\!\mathrm{d}\tau+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}-1\bigg{]}.$

Also, plugging $A(t)=\frac{1}{g^{2}(t)}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}\big{(}\mathbf{f}(\mathbf{x}_{t},t)\big{)}\big{]}$ into Eq. (26), we have

\displaystyle\int_{\epsilon}^{T}\frac{\lambda(t)}{g^{2}(t)}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}\big{(}\mathbf{f}(\mathbf{x}_{t},t)\big{)}\big{]}=\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}\int_{\tau}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t\mathop{}\!\mathrm{d}\tau+\Big{(}\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\Big{)}\int_{\epsilon}^{T}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t.

(28)

Using Eq. (27) and (28), we get

\displaystyle\begin{split}\mathcal{L}(\bm{\theta};\lambda,\epsilon)=&\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}\mathcal{L}(\bm{\theta};g^{2},\tau)\mathop{}\!\mathrm{d}\tau+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\mathcal{L}(\bm{\theta};g^{2},\epsilon)\\ &+\int_{\epsilon}^{T}\Big{(}\frac{\lambda(t)}{g^{2}(t)}-1\Big{)}\mathbb{E}_{\mathbf{x}_{t}}\big{[}\text{div}(\mathbf{f}(\mathbf{x}_{t},t))\big{]}\mathop{}\!\mathrm{d}t+\Big{[}\frac{\lambda(T)}{g^{2}(T)}-1\Big{]}\mathbb{E}_{\mathbf{x}_{T}}\big{[}\log{\pi(\mathbf{x}_{T})}\big{]}.\end{split}

(29)

Then, applying Lemma 1 to Eq. (29) yields the desired result. ∎

Suppose $\lambda(t)$ is a weighting function of the NCSN loss. If $\frac{\lambda(t)}{g^{2}(t)}$ is a nondecreasing and nonnegative continuous function on $[\epsilon,T]$ and zero on $[0,\epsilon)$ , then

	$\displaystyle\frac{1}{2}\int_{\epsilon}^{T}\lambda(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t+\frac{\lambda(T)}{g^{2}(T)}D_{KL}(p_{T}\\|\pi)$
	$\displaystyle\quad\quad\quad\geq\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}D_{KL}(p_{\tau}\\|p_{\tau}^{\bm{\theta}})\mathop{}\!\mathrm{d}\tau+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}D_{KL}(p_{\epsilon}\\|p_{\epsilon}^{\bm{\theta}}).$

A direct extension of the proof indicates that Theorem 1 still holds when $\frac{\lambda(t)}{g^{2}(t)}$ has finite jump on $[0,T]$ .

The weight of $\frac{\lambda(T)}{g^{2}(T)}$ is the normalizing constant of the unnormalized truncation probability, $\mathbb{P}$ .

	$\displaystyle\frac{1}{2}\int_{\epsilon}^{T}\lambda(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t+\frac{\lambda(T)}{g^{2}(T)}D_{KL}(p_{T}\\|\pi)$
			$\displaystyle=\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}\frac{1}{2}\int_{\tau}^{T}g^{2}(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t\mathop{}\!\mathrm{d}\tau$
			$\displaystyle\quad+\Big{(}\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\Big{)}\frac{1}{2}\int_{\epsilon}^{T}g^{2}(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t+\frac{\lambda(T)}{g^{2}(T)}D_{KL}(p_{T}\\|\pi)$
			$\displaystyle\geq\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}\big{[}D_{KL}(p_{\tau}\\|p_{\tau}^{\bm{\theta}})-D_{KL}(p_{T}\\|\pi)\big{]}\mathop{}\!\mathrm{d}\tau+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\big{[}D_{KL}(p_{\epsilon}\\|p_{\epsilon}^{\bm{\theta}})-D_{KL}(p_{T}\\|\pi)\big{]}+\frac{\lambda(T)}{g^{2}(T)}D_{KL}(p_{T}\\|\pi)$
			$\displaystyle=\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}D_{KL}(p_{\tau}\\|p_{\tau}^{\bm{\theta}})\mathop{}\!\mathrm{d}\tau+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}D_{KL}(p_{\epsilon}\\|p_{\epsilon}^{\bm{\theta}}).$

∎

From the released code of Song et al. (2021b), the NCSN++ network is modeled by $\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\log{\sigma(t)})$ , where the second argument is $\log{\sigma(t)}$ instead of $t$ . Experiments with $\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)$ or $\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\sigma(t))$ were not as good as the parametrization of $\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\log{\sigma(t)})$ , and we analyze this experimental results from Lemma 2 and Proposition 1.

Let $\mathcal{H}_{[1,\infty)}=\{\mathbf{s}:\mathbb{R}^{d}\times[1,\infty)\rightarrow\mathbb{R}^{d},\text{ $\mathbf{s}$ is locally Lipschitz}\}$ . Suppose a continuous vector field $\mathbf{v}$ defined on a $d$ -dimensional open subset $U$ of a compact manifold $M$ is unbounded, and the projection of $\mathbf{v}$ on each axis is locally integrable. Then, there exists $\mathbf{s}\in\mathcal{H}_{[1,\infty)}$ such that $\lim_{\eta\rightarrow\infty}\mathbf{s}(\mathbf{x},\eta)=\mathbf{v}(\mathbf{x})$ a.e. on $U$ .

The gradient of the log transition probability diverges at $t\approx 0$ theoretically (Section A.2) and empirically (Figure 9-(a)). Here, in high-dimensional space, $p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})/p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})$ with $\mathbf{x}_{0}\neq\mathbf{x}_{0}^{\prime}$ is either zero or infinity. Thus, the data score is nearly identical to the gradient of the log transition probability, $\|\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\|_{2}^{2}=\|\nabla_{\mathbf{x}_{t}}\log{\int p_{r}(\mathbf{x}_{0})p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})\mathop{}\!\mathrm{d}\mathbf{x}_{0}}\|_{2}^{2}\approx\|\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})}\|_{2}^{2}$ , and the observation of Figure 9-(a) is valid for the exact data score, as well.

Although Lemma 2 is based on $\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)$ , the identical result also holds for the parametrization of $\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\sigma(t))$ , so it indicates that both $\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)$ and $\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\sigma(t))$ cannot estimate the data score as $t\rightarrow 0$ . On the other hand, Proposition 1 implies that there exists a score function that estimates the unbounded data score asymptotically, and Proposition 1 explains the reason why the parametrization of Song et al. (2021b), i.e., $\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\log{\sigma(t)})$ , is successful on score estimation.

On top of that, we introduce another parametrization that particularly focuses on the score estimation near $t\approx 0$ . We name Unbounded NCSN++ (UNCSN++) as the network of $\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\eta(t))$ with $\eta(t)=\left\{\begin{array}[]{ll}\log{\sigma(t)}&\text{if }\sigma(t)\geq\sigma_{0}\\ -\frac{c_{1}}{\sigma(t)}+c_{2}&\text{if }\sigma(t)<\sigma_{0}\end{array}\right.$ and Unbounded DDPM++ (UDDPM++) as the network of $\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\eta(t))$ with $\eta(t):=\int\frac{g^{2}(t)}{\sigma^{2}(t)}\mathop{}\!\mathrm{d}t$ .

In UNCSN++, $c_{1},c_{2}$ and $\sigma_{0}$ are the hyperparameters. By acknowledging the parametrization of $\log{\sigma(t)}$ , we choose $\sigma_{0}$ as $0.01$ . Also, to satisfy the continuously differentiability of $\eta(t)$ , two hyperparameters $c_{1}$ and $c_{2}$ satisfy a system of equations with degree 2, so $c_{1}$ and $c_{2}$ are fully determined with this system of equations.

The choice of such $\eta(t)$ for UDDPM++ is expected to enhance the score estimation near $t\approx 0$ because the input of $\eta(t)$ is distributed uniformly when we draw samples from the importance weight. Concretely, when the sampling distribution on the diffusion time is given by $p_{iw}(t)\propto\frac{g^{2}(t)}{\sigma^{2}(t)}$ , the $\eta$ -distribution from the importance sampling becomes $p(\eta)\propto 1$ , which is depicted in Figure 9-(b).

Let $h$ be a standard mollifier function. If $h_{t}(x)=t^{-n}h(\mathbf{x}/t)$ , then $v_{t}:=h_{t}*v$ converges to $v$ a.e. on $U$ as $t\rightarrow 0$ (Theorem 7-(ii) of Appendix C in (Evans, 1998)). Therefore, if we define $s(\mathbf{x},\eta):=v_{1/\eta}(\mathbf{x})$ on the domain of $v_{1/\eta}(\mathbf{x})$ and $s(\mathbf{x},\eta):=0$ elsewhere, then $s(\mathbf{x},\eta)=v_{1/\eta}(\mathbf{x})\rightarrow v(\mathbf{x})$ a.e. on $U$ as $\eta\rightarrow\infty$ .

Now, to show that $\mathbf{s}(\mathbf{x},\eta)$ is locally Lipschitz, let $\tilde{M}\times[\underline{\eta},\overline{\eta}]$ be a compact subset of $\mathbb{R}^{n}\times[1,\infty)$ . From $\|\mathbf{s}(\mathbf{x}_{1},\eta_{1})-\mathbf{s}(\mathbf{x}_{2},\eta_{2})\|=\|v_{1/\eta_{1}}(\mathbf{x}_{1})-v_{1/\eta_{2}}(\mathbf{x}_{2})\|\leq\|v_{1/\eta_{1}}(\mathbf{x}_{1})-v_{1/\eta_{1}}(\mathbf{x}_{2})\|+\|v_{1/\eta_{1}}(\mathbf{x}_{2})-v_{1/\eta_{2}}(\mathbf{x}_{2})\|$ , if there exists $K_{1},K_{2}>0$ such that $\|v_{1/\eta_{1}}(\mathbf{x}_{1})-v_{1/\eta_{1}}(\mathbf{x}_{2})\|\leq K_{1}\|\mathbf{x}_{1}-\mathbf{x}_{2}\|$ and $\|v_{1/\eta_{1}}(\mathbf{x}_{1})-v_{1/\eta_{2}}(\mathbf{x}_{1})\|\leq K_{2}|\eta_{1}-\eta_{2}|$ for all $\mathbf{x}_{1},\mathbf{x}_{2}\in\tilde{M}$ and $\eta_{1},\eta_{2}\in[\underline{\eta},\overline{\eta}]$ , then $\mathbf{s}(\mathbf{x},\eta)=v_{1/\eta}(\mathbf{x})$ is Lipschitz on $\tilde{M}\times[\underline{\eta},\overline{\eta}]$ .

First, since $v_{1/\eta}$ is infinitely differentiable on its domain (Theorem 7-(i) of Appendix C in (Evans, 1998)) and $\eta\in[\underline{\eta},\overline{\eta}]$ , there exists $K_{1}>0$ such that $\|v_{1/\eta}(\mathbf{x}_{1})-v_{1/\eta}(\mathbf{x}_{2})\|\leq K_{1}\|\mathbf{x}_{1}-\mathbf{x}_{2}\|$ . Second, the mollifier satisfies the uniform convergence on any compact subset of $U$ (Theorem 7-(iii) of Appendix C in (Evans, 1998)), which leads that $\|v_{1/\eta_{1}}(\mathbf{x})-v_{1/\eta_{2}}(\mathbf{x})\|\leq K_{2}|\frac{1}{\eta_{1}}-\frac{1}{\eta_{2}}|=K_{2}\frac{|\eta_{1}-\eta_{2}|}{\eta_{1}\eta_{2}}\leq K_{3}|\eta_{1}-\eta_{2}|$ for some $K_{2},K_{3}>0$ . Therefore, $\mathbf{s}$ becomes an element of $\mathcal{H}_{[1,\infty)}$ . ∎

VESDE assumes $g(t)=\sigma_{min}(\frac{\sigma_{max}}{\sigma_{min}})^{t}\sqrt{2\log{\frac{\sigma_{max}}{\sigma_{min}}}}$ . Then, the variance of the transition probability $p_{0t}(\mathbf{x}_{t}|\mu_{VE}(t)\mathbf{x}_{0},\sigma_{VE}^{2}(t))$ becomes $\sigma_{VE}^{2}(t)=\int_{0}^{t}g^{2}(s)\mathop{}\!\mathrm{d}s=\sigma_{min}^{2}[(\frac{\sigma_{max}}{\sigma_{min}})^{2t}-1]$ if the diffusion starts from $t=0$ with the initial condition of $\mathbf{x}_{0}\sim p_{r}$ . VESDE was originally introduced in Song & Ermon (2020) in order to satisfy the geometric property for its smooth transition of the distributional shift. Mathematically, the variance is geometric if $\frac{\mathop{}\!\mathrm{d}}{\mathop{}\!\mathrm{d}t}\log{\sigma_{VE}^{2}(t)}$ is a constant, but VESDE losses the geometric property as illustrated in Figure 9-(c).

To attain the geometric property in VESDE, VESDE approximates the variance to be $\tilde{\sigma}_{VE}^{2}(t)=\sigma_{min}^{2}(\frac{\sigma_{max}}{\sigma_{min}})^{2t}$ by omitting 1 from $\sigma_{VE}^{2}(t)$ . However, this approximation leads that $\mathbf{x}_{t}$ is not converging to $\mathbf{x}_{0}$ in distribution because $\sigma_{min}^{2}(\frac{\sigma_{max}}{\sigma_{min}})^{2t}\rightarrow\sigma_{min}^{2}\neq 0$ as $t\rightarrow 0$ . Indeed, a bit stronger claim is possible:

There is no SDE that has the stochastic process $\{\mathbf{x}_{t}\}_{t\in[0,T]}$ , defined by a transition probability $p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t};\mathbf{x}_{0},\sigma_{min}^{2}(\frac{\sigma_{max}}{\sigma_{min}})^{2t}\mathbf{I})$ , as the solution.

Proposition 2 indicates that if we approximate the variance by $\sigma_{VE}^{2}(t)$ , then the reverse diffusion process cannot be modeled by a generative process.

Rigorously, however, if the diffusion process starts from $t=-\infty$ , rather than $t=0$ , then the variance of the transition probability becomes $\sigma_{VE,-\infty}^{2}(t)=\int_{-\infty}^{t}g^{2}(s)\mathop{}\!\mathrm{d}s=\sigma_{min}^{2}(\frac{\sigma_{max}}{\sigma_{min}})^{2t}$ , which is exactly the variance $\tilde{\sigma}_{VE}^{2}(t)$ . Therefore, VESDE can be considered as a diffusion process starting from $t=-\infty$ .

From this point of view, we introduce a SDE that satisfies the geometric progression property starting from $t=0$ . We name a new SDE as the Reciprocal VE SDE (RVESDE). RVESDE has the identical form of SDE, $\mathop{}\!\mathrm{d}\mathbf{x}_{t}=g_{RVE}(t)\mathop{}\!\mathrm{d}\mathbf{w}_{t}$ , with

\displaystyle g_{RVE}(t):=\left\{\begin{array}[]{ll}\sigma_{max}\big{(}\frac{\sigma_{min}}{\sigma_{max}})^{\frac{\epsilon}{t}}\frac{\sqrt{2\epsilon\log{(\frac{\sigma_{max}}{\sigma_{min}})}}}{t}&\text{if }t>0,\\ 0&\text{if }t=0.\end{array}\right.

Then, the transition probability of RVESDE becomes

\displaystyle p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}\bigg{(}\mathbf{x}_{t};\mathbf{x}_{0},\sigma_{max}^{2}\Big{(}\frac{\sigma_{min}}{\sigma_{max}}\Big{)}^{\frac{2\epsilon}{t}}\mathbf{I}\bigg{)}.

As illustrated in Figure 9-(c), RVESDE attains the geometric property at the expense of having reciprocated time, $1/t$ . Also, RVESDE satisfies $\sigma_{RVE}^{2}(\epsilon)=\sigma_{min}^{2}$ and $\sigma_{RVE}^{2}(T)\approx\sigma_{max}^{2}$ . The existence and uniqueness of solution for RVESDE is guaranteed by Theorem 5.2.1 in (Oksendal, 2013).

Training Throughout the experiments, we train our model with a learning rate of 0.0002, warmup of 5000 iterations, and gradient clipping by 1. For UNCSN++, we take $\sigma_{min}=10^{-3}$ , and for NCSN++, we take $\sigma_{min}=10^{-2}$ . On ImageNet32 training of the likelihood weighting and the variance weighting without Soft Truncation, we take $\epsilon=5\times 10^{-5}$ , following the setting of Song et al. (2021a). Otherwise, we take $\epsilon=10^{-5}$ . For other hyperparameters, we run our experiments according to Song et al. (2021b, a).

On datasets of resolution $32\times 32$ , we use the batch size of 128, which consumes about 48Gb GPU memory. On STL-10 with resolution $48\times 48$ , we use the batch size of 192, and on datasets of resolution $64\times 64$ , we experiment with 128 batch size. The batch size for the datasets of resolution $256\times 256$ is 40, which takes nearly 120Gb of GPU memory. On the dataset of $1024\times 1024$ resolution, we use the batch size of 16, which takes around 120Gb of GPU memory. We use five NVIDIA RTX-3090 GPU machines to train the model exceeding 48Gb, and we use a pair of NVIDIA RTX-3090 GPU machines to train the model that consumes less than 48Gb.

Evaluation We apply the EMA with rate of 0.999 on NCSN++/UNCSN++ and 0.9999 on DDPM++/UDDPM++. For the density estimation, we obtain the NLL performance by the Instantaneous Change of Variable (Song et al., 2021b; Chen et al., 2018). We choose $[\epsilon=10^{-5},T=1]$ to integrate the instantaneous change-of-variable of the probability flow as default, even for the ImageNet32 dataset. In spite that Song et al. (2021b, a) integrates the change-of-variable formula with the starting variable to be $\mathbf{x}_{0}$ , Table 5 of Kim et al. (2022) analyzes that there are significant difference between starting from $\mathbf{x}_{\epsilon}$ and $\mathbf{x}_{0}$ , if $\epsilon$ is not small enough. Therefore, we follow Kim et al. (2022) to compute $\mathbb{E}_{\mathbf{x}_{\epsilon}}\big{[}-\log{p_{\epsilon}^{\bm{\theta}}(\mathbf{x}_{\epsilon})}\big{]}$ . However, to compare with the baseline models, we also evaluate the way Song et al. (2021b, a) and Vahdat et al. (2021) compute NLL. We denote the way of Kim et al. (2022) as after correction and Song et al. (2021a) as before correction, throughout the appendix. We dequantize the data variable by the uniform dequantization (Ho et al., 2019) for both after-and-before corrections. In the main paper, we only report the after correction performances.

For the sampling, we apply the Predictor-Corrector (PC) algorithm introduced in Song et al. (2021b). We set the signal-to-noise ratio as 0.16 on $32\times 32$ datasets, 0.17 on $48\times 48$ and $64\times 64$ datasets, 0.075 on 256 $\times$ 256 sized datasets, and 0.15 on $1024\times 1024$ . On datasets less than $256\times 256$ resolution, we iterate 1,000 steps for the PC sampler, while we apply 2,000 steps on the other high-dimensional datasets. Throughout the experiments for VESDE, we use the reverse diffusion (Song et al., 2021b) for the predictor algorithm and the annealed Langevin dynamics (Welling & Teh, 2011) for the corrector algorithm. For VPSDE, we use the Euler-Maruyama for the predictor algorithm, and we do not use any corrector algorithm.

We compute the FID score (Song et al., 2021b) based on the modified Inception V1 network³³3https://tfhub.dev/tensorflow/tfgan/eval/inception/1 using the tensorflow-gan package for CIFAR-10 dataset, and we use the clean-FID (Parmar et al., 2022) based on the Inception V3 network (Szegedy et al., 2016) for the remaining datasets. We note that FID computed by (Parmar et al., 2022) reports a higher FID score compared to the original FID calculation⁴⁴4See https://github.com/GaParmar/clean-fid for the detailed experimental results..

Table 9: Ablation study of Soft Truncation with/without the reconstruction term when training on CIFAR-10 trained with DDPM++ (VP).

Loss	Soft Truncation	Reconstruction Term for Training	NLL		NELBO		FID
Loss	Soft Truncation	Reconstruction Term for Training	$\mathbb{E}_{\mathbf{x}_{0}}[-\log{p_{\epsilon}^{\bm{\theta}}(\mathbf{x}_{0})}]$ (before correction)	$\mathbb{E}_{\mathbf{x}_{\epsilon}}[-\log{p_{\epsilon}^{\bm{\theta}}(\mathbf{x}_{\epsilon})}]+R_{\epsilon}(\bm{\theta})$ (after correction)	$\mathcal{L}(\bm{\theta};g^{2},\epsilon)$ (without residual)	$\mathcal{L}(\bm{\theta};g^{2},\epsilon)$ $+R_{\epsilon}(\bm{\theta})$ (with residual)	ODE
$\mathcal{L}(\bm{\theta};g^{2},\epsilon)$	✗	✗	2.97	3.03	3.11	3.13	6.70
$\mathcal{L}(\bm{\theta};g^{2},\epsilon)+\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{\epsilon}}\big{[}-\log{p(\mathbf{x}_{0}\|\mathbf{x}_{\epsilon})}\big{]}$	✗	✓	3.01	2.99	3.07	3.09	6.93
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1})=\mathbb{E}_{\mathbb{P}_{1}(\tau)}\big{[}\mathcal{L}(\bm{\theta};g^{2},\tau)\big{]}$	✓	✗	2.98	3.01	3.08	3.08	3.96
$=\mathbb{E}_{\mathbb{P}_{1}(\tau)}\big{[}\mathcal{L}(\bm{\theta};g^{2},\tau)\big{]}$	✓	✗	2.98	3.01	3.08	3.08	3.96
$\mathbb{E}_{\mathbb{P}_{1}(\tau)}\big{[}\mathcal{L}(\bm{\theta};g^{2},\tau)+R_{\tau}(\bm{\theta})$	✓	✓	2.95	2.98	3.04	3.04	4.23

Table 9 presents that the training with the reconstruction term outperforms the training without the reconstruction term on NLL/NELBO with the sacrifice on sample generation. If $\tau$ is fixed as $\epsilon$ , then the bound

\displaystyle\mathbb{E}_{\mathbf{x}_{0}}\big{[}-\log{p_{0}^{\bm{\theta}}(\mathbf{x}_{0})}\big{]}\leq\mathcal{L}(\bm{\theta};g^{2},\tau)+\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{\tau}}\big{[}-\log{p(\mathbf{x}_{0}|\mathbf{x}_{\tau})}\big{]}

is tight enough to estimate the negative log-likelihood. However, if $\tau$ is a subject of random variable, then the bound is not tight to the negative log-likelihood, as evidenced in Figure 1-(b). On the other hand, if we do not count the reconstruction, then the bound becomes

\displaystyle\mathbb{E}_{\mathbf{x}_{0}}\big{[}-\log{p_{\tau}^{\bm{\theta}}(\mathbf{x}_{\tau})}\big{]}\leq\mathcal{L}(\bm{\theta};g^{2},\tau),

up to a constant, and this bound becomes tight regardless of $\tau$ , which is evidenced in Figure 1-(c). This is why we call Soft Truncation as Maximum Perturbed Likelihood Estimation (MPLE).

Table 10: Ablation study of Soft Truncation for various weightings on CIFAR-10 and ImageNet32 with DDPM++ (VP).

Dataset	Loss	Soft Truncation	NLL		NELBO		FID
Dataset	Loss	Soft Truncation	after correction	before correction	with residual	without residual	ODE
CIFAR-10	$\mathcal{L}(\bm{\theta};g^{2},\epsilon)$	✗	3.03	2.97	3.13	3.11	6.70
	$\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon)$	✗	3.21	3.16	3.34	3.32	3.90
	$\mathcal{L}(\bm{\theta};g_{\mathbb{P}_{1}}^{2},\epsilon)$	✗	3.06	3.02	3.18	3.14	6.11
	$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1})$	✓	3.01	2.98	3.08	3.08	3.96
ImageNet32	$\mathcal{L}(\bm{\theta};g^{2},\epsilon)$	✗	3.92	3.90	3.94	3.95	12.68
	$\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon)$	✗	3.95	3.96	4.00	4.01	9.22
	$\mathcal{L}(\bm{\theta};g_{\mathbb{P}_{1}}^{2},\epsilon)$	✗	3.93	3.92	3.97	3.98	11.89
	$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1})$	✓	3.90	3.87	3.92	3.92	9.52
	$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{0.9})$	✓	3.90	3.88	3.91	3.91	8.42

Table 11: Ablation study of Soft Truncation for various model architectures and diffusion SDEs on CelebA.

SDE	Model	Loss	NLL		NELBO		FID
SDE	Model	Loss	after correction	before correction	with residual	without residual	PC	ODE
VE	NCSN++	$\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon)$	3.41	2.37	3.42	3.96	3.95	-
VE	NCSN++	$\mathcal{L}_{ST}(\bm{\theta};\sigma^{2},\mathbb{P}_{2})$	3.44	2.42	3.44	3.97	2.68	-
RVE	UNCSN++	$\mathcal{L}(\bm{\theta};g^{2},\epsilon)$	2.01	1.96	2.01	2.17	3.36	-
RVE	UNCSN++	$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{2})$	1.97	1.91	2.02	2.18	1.92	-
VP	DDPM++	$\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon)$	2.14	2.07	2.21	2.22	3.03	2.32
	DDPM++	$\mathcal{L}_{ST}(\bm{\theta};\sigma^{2},\mathbb{P}_{1})$	2.17	2.08	2.29	2.26	2.88	1.90
	UDDPM++	$\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon)$	2.11	2.07	2.20	2.21	3.23	4.72
	UDDPM++	$\mathcal{L}_{ST}(\bm{\theta};\sigma^{2},\mathbb{P}_{1})$	2.16	2.08	2.28	2.25	2.22	1.94
	DDPM++	$\mathcal{L}(\bm{\theta};g^{2},\epsilon)$	2.00	1.93	2.09	2.09	5.31	3.95
	DDPM++	$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1})$	2.00	1.94	2.11	2.11	4.50	2.90
	UDDPM++	$\mathcal{L}(\bm{\theta};g^{2},\epsilon)$	1.98	1.95	2.12	2.15	4.65	3.98
	UDDPM++	$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1})$	2.00	1.94	2.10	2.10	4.45	2.97

Table 12: Ablation study of Soft Truncation for various

\sigma_{min}

(equivalently,

\epsilon

) on CIFAR-10 with UNCSN++ (RVE).

Loss	$\epsilon$	NLL		NELBO		FID
Loss	$\epsilon$	after correction	before correction	with residual	without residual	ODE
$\mathcal{L}(\bm{\theta};g^{2},\epsilon)$	$10^{-2}$	4.64	4.02	4.69	5.20	38.82
	$10^{-3}$	3.51	3.20	3.52	3.90	6.21
	$10^{-4}$	3.05	2.98	3.08	3.24	6.33
	$10^{-5}$	3.03	2.97	3.13	3.11	6.70
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1})$	$10^{-2}$	4.65	4.03	4.69	5.20	39.83
	$10^{-3}$	3.51	3.21	3.52	3.88	5.14
	$10^{-4}$	3.05	2.98	3.08	3.24	4.16
	$10^{-5}$	3.01	2.98	3.08	3.08	3.96

Table 13: Ablation study of Soft Truncation for various

\mathbb{P}_{k}

on CIFAR-10 trained with DDPM++ (VP).

$=\mathcal{L}(\bm{\theta};g^{2},\epsilon)$	3.01	2.95	3.09	3.07	6.70
Loss	NLL		NELBO		FID
Loss	after correction	before correction	with residual	without residual	ODE
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{0})$	3.24	3.16	3.39	3.34	6.27
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{0.8})$	3.03	3.00	3.05	3.05	3.61
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{0.9})$	3.03	2.99	3.13	3.13	3.45
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1})$	3.01	2.98	3.08	3.08	3.96
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1.1})$	3.02	2.99	3.09	3.10	3.98
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{1.2})$	3.03	2.99	3.09	3.09	3.98
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{2})$	3.01	2.97	3.10	3.09	6.31
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{3})$	3.02	2.96	3.09	3.09	6.54
$\mathcal{L}_{ST}(\bm{\theta};g^{2},\mathbb{P}_{\infty})$	3.01	2.95	3.09	3.07	6.70

Table 14: Ablation study of Soft Truncation for CIFAR-10 trained with DDPM++ when a diffusion is combined with a normalizing flow (Kim et al., 2022). We use

\mathbb{P}([a,b])=\frac{1}{2}1_{[a,b]}(\epsilon)+\frac{1}{2}\mathbb{P}_{0.9}([a,b])

Loss	NLL		NELBO		FID
Loss	after correction	before correction	with residual	without residual	ODE
$\mathcal{L}(\bm{\theta};g^{2},\epsilon)$	2.97	2.94	2.97	2.96	6.06
$\mathcal{L}(\bm{\theta};\sigma^{2},\epsilon)$	3.17	3.11	3.23	3.18	3.61
$\mathcal{L}(\bm{\theta};g^{2},\mathbb{P})$	3.01	2.98	3.02	3.01	3.89

Tables 10, 11, 12, 13, and 14 present the full list of performances for Soft Truncation.

Figure 10 shows how images are created from the trained model, and Figures from 11 to 16 present non-cherry picked generated samples of the trained model.

	$\displaystyle\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)[\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}-\\|\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}]\big{]}\mathop{}\!\mathrm{d}t$
			$\displaystyle=\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\\|_{2}^{2}-2g^{2}(t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\cdot\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\big{]}\mathop{}\!\mathrm{d}t$
			$\displaystyle=\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t-\int_{\tau}^{T}\int g^{2}(t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\cdot\nabla_{\mathbf{x}_{t}}p_{t}(\mathbf{x}_{t})\mathop{}\!\mathrm{d}\mathbf{x}_{t}\mathop{}\!\mathrm{d}t$
			$\displaystyle=\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t-\int_{\tau}^{T}\int g^{2}(t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\cdot\nabla_{\mathbf{x}_{t}}\int p_{r}(\mathbf{x}_{0})p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})\mathop{}\!\mathrm{d}\mathbf{x}_{0}\mathop{}\!\mathrm{d}\mathbf{x}_{t}\mathop{}\!\mathrm{d}t$
			$\displaystyle=\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t-\int_{\tau}^{T}\int g^{2}(t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)\cdot\int p_{r}(\mathbf{x}_{0})\nabla_{\mathbf{x}_{t}}p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})\mathop{}\!\mathrm{d}\mathbf{x}_{0}\mathop{}\!\mathrm{d}\mathbf{x}_{t}\mathop{}\!\mathrm{d}t$
			$\displaystyle=\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{r}(\mathbf{x}_{0})p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\big{[}g^{2}(t)[\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\\|_{2}^{2}-\\|\nabla_{\mathbf{x}_{t}}\log{p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\\|_{2}^{2}]\big{]}\mathop{}\!\mathrm{d}t,$

$\displaystyle\mathbb{E}_{p_{\tau}(\mathbf{x}_{\tau})}\big{[}-\log{p_{\tau}^{\bm{\theta}}(\mathbf{x}_{\tau})}\big{]}\leq D_{KL}(p_{T}\\|\pi)+\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t+\mathcal{H}(p_{\tau})$		(24)
	$\displaystyle=D_{KL}(p_{T}\\|\pi)+\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla\log{p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\\|_{2}^{2}-\\|\nabla\log{p_{0t}(\mathbf{x}_{t}\|\mathbf{x}_{0})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t$
	$\displaystyle\quad+\frac{1}{2}\int_{\tau}^{T}\mathbb{E}_{p_{t}(\mathbf{x}_{t})}\big{[}g^{2}(t)\nabla\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t+\mathcal{H}(p_{\tau}).$

	$\displaystyle\frac{1}{2}\int_{\epsilon}^{T}\lambda(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t+\frac{\lambda(T)}{g^{2}(T)}D_{KL}(p_{T}\\|\pi)$
			$\displaystyle=\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}\frac{1}{2}\int_{\tau}^{T}g^{2}(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t\mathop{}\!\mathrm{d}\tau$
			$\displaystyle\quad+\Big{(}\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\Big{)}\frac{1}{2}\int_{\epsilon}^{T}g^{2}(t)\mathbb{E}_{\mathbf{x}_{t}}\big{[}\\|\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}\log{p_{t}(\mathbf{x}_{t})}\\|_{2}^{2}\big{]}\mathop{}\!\mathrm{d}t+\frac{\lambda(T)}{g^{2}(T)}D_{KL}(p_{T}\\|\pi)$
			$\displaystyle\geq\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}\big{[}D_{KL}(p_{\tau}\\|p_{\tau}^{\bm{\theta}})-D_{KL}(p_{T}\\|\pi)\big{]}\mathop{}\!\mathrm{d}\tau+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}\big{[}D_{KL}(p_{\epsilon}\\|p_{\epsilon}^{\bm{\theta}})-D_{KL}(p_{T}\\|\pi)\big{]}+\frac{\lambda(T)}{g^{2}(T)}D_{KL}(p_{T}\\|\pi)$
			$\displaystyle=\int_{\epsilon}^{T}\Big{(}\frac{\lambda(\tau)}{g^{2}(\tau)}\Big{)}^{\prime}D_{KL}(p_{\tau}\\|p_{\tau}^{\bm{\theta}})\mathop{}\!\mathrm{d}\tau+\frac{\lambda(\epsilon)}{g^{2}(\epsilon)}D_{KL}(p_{\epsilon}\\|p_{\epsilon}^{\bm{\theta}}).$

$\displaystyle e^{\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}\frac{\mathop{}\!\mathrm{d}\sigma^{2}(t)}{\mathop{}\!\mathrm{d}t}+e^{\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}\beta(t)\sigma^{2}(t)=e^{\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}g^{2}(t)$

$\displaystyle\iff\frac{\mathop{}\!\mathrm{d}\Big{[}e^{\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}\sigma^{2}(t)\Big{]}}{\mathop{}\!\mathrm{d}t}=e^{\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}g^{2}(t)$

$\displaystyle\iff e^{\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}\sigma^{2}(t)=\int_{0}^{t}e^{\int_{0}^{\tau}\beta(s)\mathop{}\!\mathrm{d}s}g^{2}(\tau)\mathop{}\!\mathrm{d}\tau+C$

$\displaystyle\iff\sigma^{2}(t)=e^{-\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}\int_{0}^{t}e^{\int_{0}^{\tau}\beta(s)\mathop{}\!\mathrm{d}s}g^{2}(\tau)\mathop{}\!\mathrm{d}\tau+Ce^{-\int_{0}^{t}\beta(s)\mathop{}\!\mathrm{d}s}.$