This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model Statistics

Kaiwen Zheng∗†1, Cheng Lu∗1, Jianfei Chen1, Jun Zhu‡123
1Dept. of Comp. Sci. & Tech., Institute for AI, BNRist Center, THBI Lab
1Tsinghua-Bosch Joint ML Center, Tsinghua University, Beijing, China
2Shengshu Technology, Beijing  3Pazhou Lab (Huangpu), Guangzhou, China
{zkwthu,lucheng.lc15}@gmail.com; {jianfeic, dcszj}@tsinghua.edu.cn
Abstract

Diffusion probabilistic models (DPMs) have exhibited excellent performance for high-fidelity image generation while suffering from inefficient sampling. Recent works accelerate the sampling procedure by proposing fast ODE solvers that leverage the specific ODE form of DPMs. However, they highly rely on specific parameterization during inference (such as noise/data prediction), which might not be the optimal choice. In this work, we propose a novel formulation towards the optimal parameterization during sampling that minimizes the first-order discretization error of the ODE solution. Based on such formulation, we propose DPM-Solver-v3, a new fast ODE solver for DPMs by introducing several coefficients efficiently computed on the pretrained model, which we call empirical model statistics. We further incorporate multistep methods and a predictor-corrector framework, and propose some techniques for improving sample quality at small numbers of function evaluations (NFE) or large guidance scales. Experiments show that DPM-Solver-v3 achieves consistently better or comparable performance in both unconditional and conditional sampling with both pixel-space and latent-space DPMs, especially in 5\sim10 NFEs. We achieve FIDs of 12.21 (5 NFE), 2.51 (10 NFE) on unconditional CIFAR10, and MSE of 0.55 (5 NFE, 7.5 guidance scale) on Stable Diffusion, bringing a speed-up of 15%\sim30% compared to previous state-of-the-art training-free methods. Code is available at https://github.com/thu-ml/DPM-Solver-v3.

22footnotetext: Work done during an internship at Shengshu Technology11footnotetext: Equal contribution33footnotetext: Corresponding author

1 Introduction

Diffusion probabilistic models (DPMs) [47, 15, 51] are a class of state-of-the-art image generators. By training with a strong encoder, a large model, and massive data as well as leveraging techniques such as guided sampling, DPMs are capable of generating high-resolution photorealistic and artistic images on text-to-image tasks.

Refer to caption

(a) DPM-Solver++ [32] (MSE 0.60)

Refer to caption

(b) UniPC [58] (MSE 0.65)

Refer to caption

(c) DPM-Solver-v3 (Ours, MSE 0.55)

Figure 1: Random samples of Stable-Diffusion [43] with a classifier-free guidance scale 7.5, using only 5 number of function evaluations (NFE) and text prompt “A beautiful castle beside a waterfall in the woods, by Josef Thoma, matte painting, trending on artstation HQ”.

However, to generate high-quality visual content, DPMs usually require hundreds of model evaluations to gradually remove noise using a pretrained model [15], which is much more time-consuming compared to other deep generative models such as generative adversarial networks (GANs) [13]. The sampling overhead of DPMs emerges as a crucial obstacle hindering their integration into downstream tasks.

To accelerate the sampling process of DPMs, one can employ training-based methods [37, 53, 45] or training-free samplers [48, 51, 28, 3, 52, 56]. We focus on the latter approach since it requires no extra training and is more flexible. Recent advanced training-free samplers [56, 31, 32, 58] mainly rely on the ODE form of DPMs, since its absence of stochasticity is essential for high-quality samples in around 20 steps. Besides, ODE solvers built on exponential integrators [18] converge faster. To change the original diffusion ODE into the form of exponential integrators, they need to cancel its linear term and obtain an ODE solution, where only the noise predictor needs to be approximated, and other terms can be exactly computed. Besides, Lu et al. [32] find that it is better to use another ODE solution where instead the data predictor needs to be approximated. How to choose such model parameterization (e.g., noise/data prediction) in the sampling of DPMs remains unrevealed.

In this work, we systematically study the problem of model parameterization and ODE formulation for fast sampling of DPMs. We first show that by introducing three types of coefficients, the original ODE solution can be reformulated to an equivalent one that contains a new model parameterization. Besides, inspired by exponential Rosenbrock-type methods [19] and first-order discretization error analysis, we also show that there exists an optimal set of coefficients efficiently computed on the pretrained model, which we call empirical model statistics (EMS). Building upon our novel ODE formulation, we further propose a new high-order solver for diffusion ODEs named DPM-Solver-v3, which includes a multistep predictor-corrector framework of any order, as well as some novel techniques such as pseudo high-order method to boost the performance at extremely few steps or large guidance scale.

We conduct extensive experiments with both pixel-space and latent-space DPMs to verify the effectiveness of DPM-Solver-v3. Compared to previous fast samplers, DPM-Solver-v3 can consistently improve the sample quality in 5\sim20 steps, and make a significant advancement within 10 steps.

2 Background

2.1 Diffusion Probabilistic Models

Suppose we have a DD-dimensional data distribution q0(𝒙0)q_{0}(\bm{x}_{0}). Diffusion probabilistic models (DPMs) [47, 15, 51] define a forward diffusion process {qt}t=0T\{q_{t}\}_{t=0}^{T} to gradually degenerate the data 𝒙0q0(𝒙0)\bm{x}_{0}\sim q_{0}(\bm{x}_{0}) with Gaussian noise, which satisfies the transition kernel q0t(𝒙t|𝒙0)=𝒩(αt𝒙0,σt2𝑰)q_{0t}(\bm{x}_{t}|\bm{x}_{0})=\mathcal{N}(\alpha_{t}\bm{x}_{0},\sigma_{t}^{2}\bm{I}), such that qT(𝒙T)q_{T}(\bm{x}_{T}) is approximately pure Gaussian. αt,σt\alpha_{t},\sigma_{t} are smooth scalar functions of tt, which are called noise schedule. The transition can be easily applied by 𝒙t=αt𝒙0+σtϵ,ϵ𝒩(𝟎,𝑰)\bm{x}_{t}=\alpha_{t}\bm{x}_{0}+\sigma_{t}\bm{\epsilon},\bm{\epsilon}\sim\mathcal{N}(\bm{0},\bm{I}). To train DPMs, a neural network ϵθ(𝒙t,t)\bm{\epsilon}_{\theta}(\bm{x}_{t},t) is usually parameterized to predict the noise ϵ\bm{\epsilon} by minimizing 𝔼𝒙0q0(𝒙0),ϵ𝒩(𝟎,𝑰),t𝒰(0,T)[w(t)ϵθ(𝒙t,t)ϵ22]\mathbb{E}_{\bm{x}_{0}\sim q_{0}(\bm{x}_{0}),\bm{\epsilon}\sim\mathcal{N}(\bm{0},\bm{I}),t\sim\mathcal{U}(0,T)}\left[w(t)\|\bm{\epsilon}_{\theta}(\bm{x}_{t},t)-\bm{\epsilon}\|_{2}^{2}\right], where w(t)w(t) is a weighting function. Sampling of DPMs can be performed by solving diffusion ODE [51] from time TT to time 0:

d𝒙tdt=f(t)𝒙t+g2(t)2σtϵθ(𝒙t,t),\frac{\mathrm{d}\bm{x}_{t}}{\mathrm{d}t}=f(t)\bm{x}_{t}+\frac{g^{2}(t)}{2\sigma_{t}}\bm{\epsilon}_{\theta}(\bm{x}_{t},t), (1)

where f(t)=dlogαtdtf(t)=\frac{\mathrm{d}\log\alpha_{t}}{\mathrm{d}t}, g2(t)=dσt2dt2dlogαtdtσt2g^{2}(t)=\frac{\mathrm{d}\sigma_{t}^{2}}{\mathrm{d}t}-2\frac{\mathrm{d}\log\alpha_{t}}{\mathrm{d}t}\sigma_{t}^{2} [23]. In addition, the conditional sampling by DPMs can be conducted by guided sampling [10, 16] with a conditional noise predictor ϵθ(𝒙t,t,c)\bm{\epsilon}_{\theta}(\bm{x}_{t},t,c), where cc is the condition. Specifically, classifier-free guidance [16] combines the unconditional/conditional model and obtains a new noise predictor ϵθ(𝒙t,t,c)sϵθ(𝒙t,t,c)+(1s)ϵθ(𝒙t,t,)\bm{\epsilon}_{\theta}^{\prime}(\bm{x}_{t},t,c)\coloneqq s\bm{\epsilon}_{\theta}(\bm{x}_{t},t,c)+(1-s)\bm{\epsilon}_{\theta}(\bm{x}_{t},t,\emptyset), where \emptyset is a special condition standing for the unconditional case, s>0s>0 is the guidance scale that controls the trade-off between image-condition alignment and diversity; while classifier guidance [10] uses an extra classifier pϕ(c|𝒙t,t)p_{\phi}(c|\bm{x}_{t},t) to obtain a new noise predictor ϵθ(𝒙t,t,c)ϵθ(𝒙t,t)sσt𝒙logpϕ(c|𝒙t,t)\bm{\epsilon}_{\theta}^{\prime}(\bm{x}_{t},t,c)\coloneqq\bm{\epsilon}_{\theta}(\bm{x}_{t},t)-s\sigma_{t}\nabla_{\bm{x}}\log p_{\phi}(c|\bm{x}_{t},t).

In addition, except for the noise prediction, DPMs can be parameterized as score predictor 𝒔θ(𝒙t,t)\bm{s}_{\theta}(\bm{x}_{t},t) to predict 𝒙logqt(𝒙t,t)\nabla_{\bm{x}}\log q_{t}(\bm{x}_{t},t), or data predictor 𝒙θ(𝒙t,t)\bm{x}_{\theta}(\bm{x}_{t},t) to predict 𝒙0\bm{x}_{0}. Under variance-preserving (VP) noise schedule which satisfies αt2+σt2=1\alpha_{t}^{2}+\sigma_{t}^{2}=1 [51], “v” predictor 𝒗θ(𝒙t,t)\bm{v}_{\theta}(\bm{x}_{t},t) is also proposed to predict αtϵσt𝒙0\alpha_{t}\bm{\epsilon}-\sigma_{t}\bm{x}_{0} [45]. These different parameterizations are theoretically equivalent, but have an impact on the empirical performance when used in training [23, 59].

2.2 Fast Sampling of DPMs with Exponential Integrators

Among the methods for solving the diffusion ODE (1), recent works [56, 31, 32, 58] find that ODE solvers based on exponential integrators [18] are more efficient and robust at a small number of function evaluations (<50). Specifically, an insightful observation by Lu et al. [31] is that, by change-of-variable from tt to λtlog(αt/σt)\lambda_{t}\coloneqq\log(\alpha_{t}/\sigma_{t}) (half of the log-SNR), the diffusion ODE is transformed to

d𝒙λdλ=α˙λαλ𝒙λσλϵθ(𝒙λ,λ),\frac{\mathrm{d}\bm{x}_{\lambda}}{\mathrm{d}\lambda}=\frac{\dot{\alpha}_{\lambda}}{\alpha_{\lambda}}\bm{x}_{\lambda}-\sigma_{\lambda}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda), (2)

where α˙λdαλdλ\dot{\alpha}_{\lambda}\coloneqq\frac{\mathrm{d}\alpha_{\lambda}}{\mathrm{d}\lambda}. By utilizing the semi-linear structure of the diffusion ODE and exactly computing the linear term [56, 31], we can obtain the ODE solution as Eq. (3) (left). Such exact computation of the linear part reduces the discretization errors [31]. Moreover, by leveraging the equivalence of different parameterizations, DPM-Solver++ [32] rewrites Eq. (2) by the data predictor 𝒙θ(𝒙λ,λ)(𝒙λσλϵθ(𝒙λ,λ))/αλ\bm{x}_{\theta}(\bm{x}_{\lambda},\lambda)\coloneqq(\bm{x}_{\lambda}-\sigma_{\lambda}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda))/\alpha_{\lambda} and obtains another ODE solution as Eq. (3) (right). Such solution does not need to change the pretrained noise prediction model ϵθ\bm{\epsilon}_{\theta} during the sampling process, and empirically outperforms previous samplers based on ϵθ\bm{\epsilon}_{\theta} [31].

𝒙tαt=𝒙sαsλsλteλϵθ(𝒙λ,λ)dλ,𝒙tσt=𝒙sσs+λsλteλ𝒙θ(𝒙λ,λ)dλ\frac{\bm{x}_{t}}{\alpha_{t}}=\frac{\bm{x}_{s}}{\alpha_{s}}-\int_{\lambda_{s}}^{\lambda_{t}}e^{-\lambda}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)\mathrm{d}\lambda,\quad\frac{\bm{x}_{t}}{\sigma_{t}}=\frac{\bm{x}_{s}}{\sigma_{s}}+\int_{\lambda_{s}}^{\lambda_{t}}e^{\lambda}\bm{x}_{\theta}(\bm{x}_{\lambda},\lambda)\mathrm{d}\lambda (3)

However, to the best of our knowledge, the parameterizations for sampling are still manually selected and limited to noise/data prediction, which are not well-studied.

3 Method

We now present our method. We start with a new formulation of the ODE solution with extra coefficients, followed by our high-order solver and some practical considerations. In the following discussions, we assume all the products between vectors are element-wise, and 𝒇(k)(𝒙λ,λ)=dk𝒇(𝒙λ,λ)dλk\bm{f}^{(k)}(\bm{x}_{\lambda},\lambda)=\frac{\mathrm{d}^{k}\bm{f}(\bm{x}_{\lambda},\lambda)}{\mathrm{d}\lambda^{k}} is the kk-th order total derivative of any function 𝒇\bm{f} w.r.t. λ\lambda.

3.1 Improved Formulation of Exact Solutions of Diffusion ODEs

As mentioned in Sec. 2.2, it is promising to explore the semi-linear structure of diffusion ODEs for fast sampling [56, 31, 32]. Firstly, we reveal one key insight that we can choose the linear part according to Rosenbrock-type exponential integrators [19, 18]. To this end, we consider a general form of diffusion ODEs by rewriting Eq. (2) as

d𝒙λdλ=(α˙λαλ𝒍λ)𝒙λlinear part(σλϵθ(𝒙λ,λ)𝒍λ𝒙λ)non-linear part,𝑵θ(𝒙λ,λ),\frac{\mathrm{d}\bm{x}_{\lambda}}{\mathrm{d}\lambda}=\underbrace{\left(\frac{\dot{\alpha}_{\lambda}}{\alpha_{\lambda}}-\bm{l}_{\lambda}\right)\bm{x}_{\lambda}}_{\text{linear part}}-\underbrace{\vphantom{\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r}(\sigma_{\lambda}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{l}_{\lambda}\bm{x}_{\lambda})}_{\text{non-linear part},\coloneqq\bm{N}_{\theta}(\bm{x}_{\lambda},\lambda)}, (4)

where 𝒍λ\bm{l}_{\lambda} is a DD-dimensional undetermined coefficient depending on λ\lambda. We choose 𝒍λ\bm{l}_{\lambda} to restrict the Frobenius norm of the gradient of the non-linear part w.r.t. 𝒙\bm{x}:

𝒍λ=argmin𝒍λ𝔼pλθ(𝒙λ)𝒙𝑵θ(𝒙λ,λ)F2,\bm{l}^{*}_{\lambda}=\operatornamewithlimits{argmin}_{\bm{l}_{\lambda}}\mathbb{E}_{p^{\theta}_{\lambda}(\bm{x}_{\lambda})}\|\nabla_{\bm{x}}\bm{N}_{\theta}(\bm{x}_{\lambda},\lambda)\|_{F}^{2}, (5)

where pλθp^{\theta}_{\lambda} is the distribution of samples on the ground-truth ODE trajectories at λ\lambda (i.e., model distribution). Intuitively, it makes 𝑵θ\bm{N}_{\theta} insensitive to the errors of 𝒙\bm{x} and cancels all the “linearty” of 𝑵θ\bm{N}_{\theta}. With 𝒍λ=𝒍λ\bm{l}_{\lambda}=\bm{l}_{\lambda}^{*}, by the “variation-of-constants” formula [1], starting from 𝒙λs\bm{x}_{\lambda_{s}} at time ss, the exact solution of Eq. (4) at time tt is

𝒙λt=αλteλsλt𝒍λdλ(𝒙λsαλsλsλteλsλ𝒍τdτ𝒇θ(𝒙λ,λ)approximateddλ),\bm{x}_{\lambda_{t}}=\alpha_{\lambda_{t}}e^{-\int_{\lambda_{s}}^{\lambda_{t}}\bm{l}_{\lambda}\mathrm{d}\lambda}\bigg{(}\frac{\bm{x}_{\lambda_{s}}}{\alpha_{\lambda_{s}}}-\int_{\lambda_{s}}^{\lambda_{t}}e^{\int_{\lambda_{s}}^{\lambda}\bm{l}_{\tau}\mathrm{d}\tau}\underbrace{\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)}_{\text{approximated}}\mathrm{d}\lambda\bigg{)}, (6)

where 𝒇θ(𝒙λ,λ)𝑵θ(𝒙λ,λ)αλ\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)\coloneqq\frac{\bm{N}_{\theta}(\bm{x}_{\lambda},\lambda)}{\alpha_{\lambda}}. To calculate the solution in Eq. (6), we need to approximate 𝒇θ\bm{f}_{\theta} for each λ[λs,λt]\lambda\in[\lambda_{s},\lambda_{t}] by certain polynomials [31, 32].

Secondly, we reveal another key insight that we can choose different functions to be approximated instead of 𝒇θ\bm{f}_{\theta} and further reduce the discretization error, which is related to the total derivatives of the approximated function. To this end, we consider a scaled version of 𝒇θ\bm{f}_{\theta} i.e., 𝒉θ(𝒙λ,λ)eλsλ𝒔τdτ𝒇θ(𝒙λ,λ)\bm{h}_{\theta}(\bm{x}_{\lambda},\lambda)\coloneqq e^{-\int_{\lambda_{s}}^{\lambda}\bm{s}_{\tau}\mathrm{d}\tau}\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda) where 𝒔λ\bm{s}_{\lambda} is a DD-dimensional coefficient dependent on λ\lambda, and then Eq. (6) becomes

𝒙λt=αλteλsλt𝒍λdλ(𝒙λsαλsλsλteλsλ(𝒍τ+𝒔τ)dτ𝒉θ(𝒙λ,λ)approximateddλ).\bm{x}_{\lambda_{t}}=\alpha_{\lambda_{t}}e^{-\int_{\lambda_{s}}^{\lambda_{t}}\bm{l}_{\lambda}\mathrm{d}\lambda}\bigg{(}\frac{\bm{x}_{\lambda_{s}}}{\alpha_{\lambda_{s}}}-\int_{\lambda_{s}}^{\lambda_{t}}e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\underbrace{\bm{h}_{\theta}(\bm{x}_{\lambda},\lambda)}_{\text{approximated}}\mathrm{d}\lambda\bigg{)}. (7)

Comparing with Eq. (6), we change the approximated function from 𝒇θ\bm{f}_{\theta} to 𝒉θ\bm{h}_{\theta} by using an additional scaling term related to 𝒔λ\bm{s}_{\lambda}. As we shall see, the first-order discretization error is positively related to the norm of the first-order derivative 𝒉θ(1)=eλsλ𝒔τdτ(𝒇θ(1)𝒔λ𝒇θ)\bm{h}_{\theta}^{(1)}=e^{-\int_{\lambda_{s}}^{\lambda}\bm{s}_{\tau}\mathrm{d}\tau}(\bm{f}^{(1)}_{\theta}-\bm{s}_{\lambda}\bm{f}_{\theta}). Thus, we aim to minimize 𝒇θ(1)𝒔λ𝒇θ2\|\bm{f}^{(1)}_{\theta}-\bm{s}_{\lambda}\bm{f}_{\theta}\|_{2}, in order to reduce 𝒉θ(1)2\|\bm{h}_{\theta}^{(1)}\|_{2} and the discretization error. As 𝒇θ\bm{f}_{\theta} is a fixed function depending on the pretrained model, this minimization problem essentially finds a linear function of 𝒇θ\bm{f}_{\theta} to approximate 𝒇θ(1)\bm{f}^{(1)}_{\theta}. To achieve better linear approximation, we further introduce a bias term 𝒃λD\bm{b}_{\lambda}\in\mathbb{R}^{D} and construct a function 𝒈θ\bm{g}_{\theta} satisfying 𝒈θ(1)=eλsλ𝒔τdτ(𝒇θ(1)𝒔λ𝒇θ𝒃λ)\bm{g}_{\theta}^{(1)}=e^{-\int_{\lambda_{s}}^{\lambda}\bm{s}_{\tau}\mathrm{d}\tau}(\bm{f}^{(1)}_{\theta}-\bm{s}_{\lambda}\bm{f}_{\theta}-\bm{b}_{\lambda}), which gives

𝒈θ(𝒙λ,λ)eλsλ𝒔τdτ𝒇θ(𝒙λ,λ)λsλeλsr𝒔τdτ𝒃rdr.\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)\coloneqq e^{-\int_{\lambda_{s}}^{\lambda}\bm{s}_{\tau}\mathrm{d}\tau}\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)-\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r. (8)

With 𝒈θ\bm{g}_{\theta}, Eq. (7) becomes

𝒙λt=αλteλsλt𝒍λdλlinear coefficient(𝒙λsαλsλsλteλsλ(𝒍τ+𝒔τ)dτscaling coefficient(𝒈θ(𝒙λ,λ)approximated+λsλeλsr𝒔τdτ𝒃rdrbias coefficient)dλ).\bm{x}_{\lambda_{t}}=\alpha_{\lambda_{t}}\underbrace{\vphantom{\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r}e^{-\int_{\lambda_{s}}^{\lambda_{t}}\bm{l}_{\lambda}\mathrm{d}\lambda}}_{\text{linear coefficient}}\bigg{(}\frac{\bm{x}_{\lambda_{s}}}{\alpha_{\lambda_{s}}}-\int_{\lambda_{s}}^{\lambda_{t}}\underbrace{\vphantom{\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r}e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}}_{\text{scaling coefficient}}\Big{(}\underbrace{\vphantom{\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r}\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)}_{\text{approximated}}+\underbrace{\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r}_{\text{bias coefficient}}\Big{)}\mathrm{d}\lambda\bigg{)}. (9)

Such formulation is equivalent to Eq. (3) but introduces three types of coefficients and a new parameterization 𝒈θ\bm{g}_{\theta}. We show in Appendix I.1 that the generalized parameterization 𝒈θ\bm{g}_{\theta} in Eq. (8) can cover a wide range of parameterization families in the form of 𝝍θ(𝒙λ,λ)=𝜶(λ)ϵθ(𝒙λ,λ)+𝜷(λ)𝒙λ+𝜸(λ)\bm{\psi}_{\theta}(\bm{x}_{\lambda},\lambda)=\bm{\alpha}(\lambda)\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)+\bm{\beta}(\lambda)\bm{x}_{\lambda}+\bm{\gamma}(\lambda). We aim to reduce the discretization error by finding better coefficients than previous works [31, 32].

Now we derive the concrete formula for analyzing the first-order discretization error. By replacing 𝒈θ(𝒙λ,λ)\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda) with 𝒈θ(𝒙λs,λs)\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s}) in Eq. (9), we obtain the first-order approximation 𝒙^λt=αλteλsλt𝒍λdλ(𝒙λsαλsλsλteλsλ(𝒍τ+𝒔τ)dτ(𝒈θ(𝒙λs,λs)+λsλeλsr𝒔τdτ𝒃rdr)dλ)\hat{\bm{x}}_{\lambda_{t}}=\alpha_{\lambda_{t}}e^{-\int_{\lambda_{s}}^{\lambda_{t}}\bm{l}_{\lambda}\mathrm{d}\lambda}\left(\frac{\bm{x}_{\lambda_{s}}}{\alpha_{\lambda_{s}}}-\int_{\lambda_{s}}^{\lambda_{t}}e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\left(\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})+\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r\right)\mathrm{d}\lambda\right). As 𝒈θ(𝒙λs,λs)=𝒈θ(𝒙λ,λ)+(λsλ)𝒈θ(1)(𝒙λ,λ)+𝒪((λλs)2)\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})=\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)+(\lambda_{s}-\lambda)\bm{g}_{\theta}^{(1)}(\bm{x}_{\lambda},\lambda)+\mathcal{O}((\lambda-\lambda_{s})^{2}) by Taylor expansion, it follows that the first-order discretization error can be expressed as

𝒙^λt𝒙λt=αλteλsλt𝒍λdλ(λsλteλsλ𝒍τdτ(λλs)(𝒇θ(1)(𝒙λ,λ)𝒔λ𝒇θ(𝒙λ,λ)𝒃λ)dλ)+𝒪(h3),\hat{\bm{x}}_{\lambda_{t}}-\bm{x}_{\lambda_{t}}=\alpha_{\lambda_{t}}e^{-\int_{\lambda_{s}}^{\lambda_{t}}\bm{l}_{\lambda}\mathrm{d}\lambda}\bigg{(}\int_{\lambda_{s}}^{\lambda_{t}}e^{\int_{\lambda_{s}}^{\lambda}\bm{l}_{\tau}\mathrm{d}\tau}(\lambda-\lambda_{s})\Big{(}\bm{f}^{(1)}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{s}_{\lambda}\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{b}_{\lambda}\Big{)}\mathrm{d}\lambda\bigg{)}+\mathcal{O}(h^{3}), (10)

where h=λtλsh=\lambda_{t}-\lambda_{s}. Thus, given the optimal 𝒍λ=𝒍λ\bm{l}_{\lambda}=\bm{l}_{\lambda}^{*} in Eq. (5), the discretization error 𝒙^λt𝒙λt\hat{\bm{x}}_{\lambda_{t}}-\bm{x}_{\lambda_{t}} mainly depends on 𝒇θ(1)𝒔λ𝒇θ𝒃λ\bm{f}^{(1)}_{\theta}-\bm{s}_{\lambda}\bm{f}_{\theta}-\bm{b}_{\lambda}. Based on this insight, we choose the coefficients 𝒔λ,𝒃λ\bm{s}_{\lambda},\bm{b}_{\lambda} by solving

𝒔λ,𝒃λ=argmin𝒔λ,𝒃λ𝔼pλθ(𝒙λ)[𝒇θ(1)(𝒙λ,λ)𝒔λ𝒇θ(𝒙λ,λ)𝒃λ22].\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*}=\operatornamewithlimits{argmin}_{\bm{s}_{\lambda},\bm{b}_{\lambda}}\mathbb{E}_{p^{\theta}_{\lambda}(\bm{x}_{\lambda})}\left[\left\|\bm{f}^{(1)}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{s}_{\lambda}\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{b}_{\lambda}\right\|^{2}_{2}\right]. (11)

For any λ\lambda, 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda}^{*},\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*} all have analytic solutions involving the Jacobian-vector-product of the pretrained model ϵθ\bm{\epsilon}_{\theta}, and they can be unbiasedly evaluated on a few datapoints {𝒙λ(n)}Kpλθ(𝒙λ)\{\bm{x}_{\lambda}^{(n)}\}_{K}\sim p_{\lambda}^{\theta}(\bm{x}_{\lambda}) via Monte-Carlo estimation (detailed in Section 3.4 and Appendix C.1.1). Therefore, we call 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda} empirical model statistics (EMS). In the following sections, we’ll show that by approximating 𝒈θ\bm{g}_{\theta} with Taylor expansion, we can develop our high-order solver for diffusion ODEs.

3.2 Developing High-Order Solver

𝒙^t\displaystyle\hat{\bm{x}}_{t}𝒙s\displaystyle\bm{x}_{s}(ts,𝒈s)\displaystyle(t_{s},\bm{g}_{s})(tin,𝒈in)\displaystyle(t_{i_{n}},\bm{g}_{i_{n}})estimation𝒈s(0)\displaystyle\bm{g}_{s}^{(0)}𝒈s(1)\displaystyle\bm{g}_{s}^{(1)}𝒈s(n)\displaystyle\bm{g}_{s}^{(n)}(ti1,𝒈i1)\displaystyle(t_{i_{1}},\bm{g}_{i_{1}})

(a) Local Approximation

𝒙0\displaystyle\bm{x}_{0}𝒙^i2\displaystyle\hat{\bm{x}}_{i-2}𝒙^i1\displaystyle\hat{\bm{x}}_{i-1}𝒙^i\displaystyle\hat{\bm{x}}_{i}ϵ^i2\displaystyle\hat{\bm{\epsilon}}_{i-2}𝒈^i2\displaystyle\hat{\bm{g}}_{i-2}ϵ^i1\displaystyle\hat{\bm{\epsilon}}_{i-1}𝒈^i1\displaystyle\hat{\bm{g}}_{i-1}ϵ^i\displaystyle\hat{\bm{\epsilon}}_{i}𝒈^i\displaystyle\hat{\bm{g}}_{i}𝒙^ic\displaystyle\hat{\bm{x}}^{c}_{i}PredictorCorrectorcache𝒈^=𝒂,i1𝒙^+𝒃,i1ϵ^+𝒄,i1\displaystyle\hat{\bm{g}}_{\cdot}=\bm{a}_{\cdot,i-1}\hat{\bm{x}}_{\cdot}+\bm{b}_{\cdot,i-1}\hat{\bm{\epsilon}}_{\cdot}+\bm{c}_{\cdot,i-1}t0=1\displaystyle t_{0}=1ti2\displaystyle t_{i-2}ti1\displaystyle t_{i-1}ti\displaystyle t_{i}ϵ^ϵθ(𝒙^,t)\displaystyle\hat{\bm{\epsilon}}_{\cdot}\coloneqq\bm{\epsilon}_{\theta}(\hat{\bm{x}}_{\cdot},t_{\cdot})

(b) Multistep Predictor-Corrector

Figure 2: Illustration of our high-order solver. (a) (n+1)(n+1)-th order local approximation from time ss to time tt, provided nn extra function values of 𝒈θ\bm{g}_{\theta}. (b) Multistep predictor-corrector procedure as our global solver. A combination of second-order predictor and second-order corrector is shown. a,i1,b,i1,c,i1a_{\cdot,i-1},b_{\cdot,i-1},c_{\cdot,i-1} are abbreviations of coefficients in Eq. (8).

In this section, we propose our high-order solver for diffusion ODEs with local accuracy and global convergence order guarantee by leveraging our proposed solution formulation in Eq. (9). The proposed solver and analyses are highly motivated by the methods of exponential integrators [17, 18] in the ODE literature and their previous applications in the field of diffusion models [56, 31, 32, 58]. Though the EMS are designed to minimize the first-order error, they can also help our high-order solver (see Appendix I.2).

For simplicity, denote 𝑨(λs,λt)eλsλt𝒍τdτ\bm{A}(\lambda_{s},\lambda_{t})\coloneqq e^{-\int_{\lambda_{s}}^{\lambda_{t}}\bm{l}_{\tau}\mathrm{d}\tau}, 𝑬λs(λ)eλsλ(𝒍τ+𝒔τ)dτ\bm{E}_{\lambda_{s}}(\lambda)\coloneqq e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}, 𝑩λs(λ)λsλeλsr𝒔τdτ𝒃rdr\bm{B}_{\lambda_{s}}(\lambda)\coloneqq\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r. Though high-order ODE solvers essentially share the same mathematical principles, since we utilize a more complicated parameterization 𝒈θ\bm{g}_{\theta} and ODE formulation in Eq. (9) than previous works [56, 31, 32, 58], we divide the development of high-order solver into simplified local and global parts, which are not only easier to understand, but also neat and general for any order.

3.2.1 Local Approximation

Firstly, we derive formulas and properties of the local approximation, which describes how we transit locally from time ss to time tt. It can be divided into two parts: discretization of the integral in Eq. (9) and estimating the high-order derivatives in the Taylor expansion.

Discretization. Denote 𝒈s(k)𝒈θ(k)(𝒙λs,λs)\bm{g}^{(k)}_{s}\coloneqq\bm{g}_{\theta}^{(k)}(\bm{x}_{\lambda_{s}},\lambda_{s}). For n0n\geq 0, to obtain the (n+1)(n+1)-th order discretization of Eq. (9), we take the nn-th order Taylor expansion of 𝒈θ(𝒙λ,λ)\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda) w.r.t. λ\lambda at λs\lambda_{s}: 𝒈θ(𝒙λ,λ)=k=0n(λλs)kk!𝒈s(k)+𝒪((λλs)n+1)\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)=\sum_{k=0}^{n}\frac{(\lambda-\lambda_{s})^{k}}{k!}\bm{g}^{(k)}_{s}+\mathcal{O}((\lambda-\lambda_{s})^{n+1}). Substituting it into Eq. (9) yields

𝒙tαt=𝑨(λs,λt)(𝒙sαsλsλt𝑬λs(λ)𝑩λs(λ)dλk=0n𝒈s(k)λsλt𝑬λs(λ)(λλs)kk!dλ)+𝒪(hn+2)\frac{\bm{x}_{t}}{\alpha_{t}}=\bm{A}(\lambda_{s},\lambda_{t})\left(\frac{\bm{x}_{s}}{\alpha_{s}}\!-\!\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\bm{B}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda\!-\!\sum_{k=0}^{n}\bm{g}^{(k)}_{s}\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda\right)\!+\!\mathcal{O}(h^{n+2}) (12)

Here we only need to estimate the kk-th order total derivatives 𝒈θ(k)(𝒙λs,λs)\bm{g}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s}) for 0kn0\leq k\leq n, since the other terms are determined once given λs,λt\lambda_{s},\lambda_{t} and 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}, which we’ll discuss next.

High-order derivative estimation. For (n+1)(n+1)-th order approximation, we use the finite difference of 𝒈θ(𝒙λ,λ)\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda) at previous n+1n+1 steps λin,,λi1,λs\lambda_{i_{n}},\dots,\lambda_{i_{1}},\lambda_{s} to estimate each 𝒈θ(k)(𝒙λs,λs)\bm{g}_{\theta}^{(k)}(\bm{x}_{\lambda_{s}},\lambda_{s}). Such derivation is to match the coefficients of Taylor expansions. Concretely, denote δkλikλs\delta_{k}\coloneqq\lambda_{i_{k}}-\lambda_{s}, 𝒈ik𝒈θ(𝒙λik,λik)\bm{g}_{i_{k}}\coloneqq\bm{g}_{\theta}(\bm{x}_{\lambda_{i_{k}}},\lambda_{i_{k}}), and the estimated high-order derivatives 𝒈^s(k)\hat{\bm{g}}^{(k)}_{s} can be solved by the following linear system:

(δ1δ12δ1nδnδn2δnn)(𝒈^s(1)𝒈^s(n)n!)=(𝒈i1𝒈s𝒈in𝒈s)\begin{pmatrix}\delta_{1}&\delta_{1}^{2}&\cdots&\delta_{1}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ \delta_{n}&\delta_{n}^{2}&\cdots&\delta_{n}^{n}\end{pmatrix}\begin{pmatrix}\hat{\bm{g}}^{(1)}_{s}\\ \vdots\\ \frac{\hat{\bm{g}}^{(n)}_{s}}{n!}\end{pmatrix}=\begin{pmatrix}\bm{g}_{i_{1}}-\bm{g}_{s}\\ \vdots\\ \bm{g}_{i_{n}}-\bm{g}_{s}\end{pmatrix} (13)

Then by substituting 𝒈^s(k)\hat{\bm{g}}^{(k)}_{s} into Eq. (12) and dropping the 𝒪(hn+2)\mathcal{O}(h^{n+2}) error terms, we obtain the (n+1)(n+1)-th order local approximation:

𝒙^tαt=𝑨(λs,λt)(𝒙sαsλsλt𝑬λs(λ)𝑩λs(λ)dλk=0n𝒈^s(k)λsλt𝑬λs(λ)(λλs)kk!dλ)\frac{\hat{\bm{x}}_{t}}{\alpha_{t}}=\bm{A}(\lambda_{s},\lambda_{t})\left(\frac{\bm{x}_{s}}{\alpha_{s}}-\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\bm{B}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda-\sum_{k=0}^{n}\hat{\bm{g}}^{(k)}_{s}\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda\right) (14)

where 𝒈^θ(0)(𝒙λs,λs)=𝒈s\hat{\bm{g}}_{\theta}^{(0)}(\bm{x}_{\lambda_{s}},\lambda_{s})=\bm{g}_{s}. Eq. (13) and Eq. (14) provide an update rule to transit from time ss to time tt and get an approximated solution 𝒙^t\hat{\bm{x}}_{t}, when we already have the solution 𝒙s\bm{x}_{s}. For (n+1)(n+1)-th order approximation, we need nn extra solutions 𝒙λik\bm{x}_{\lambda_{i_{k}}} and their corresponding function values 𝒈ik\bm{g}_{i_{k}}. We illustrate the procedure in Fig. 2(a) and summarize it in Appendix C.2. In the following theorem, we show that under some assumptions, such local approximation has a guarantee of order of accuracy.

Theorem 3.1 (Local order of accuracy, proof in Appendix B.2.1).

Suppose 𝐱λik\bm{x}_{\lambda_{i_{k}}} are exact (i.e., on the ground-truth ODE trajectory passing 𝐱s\bm{x}_{s}) for k=1,,nk=1,\dots,n, then under some regularity conditions detailed in Appendix B.1, the local truncation error 𝐱^t𝐱t=𝒪(hn+2)\hat{\bm{x}}_{t}-\bm{x}_{t}=\mathcal{O}(h^{n+2}), which means the local approximation has (n+1)(n+1)-th order of accuracy.

Besides, we have the following theorem showing that, whatever the order is, the local approximation is unbiased given our choice of 𝒔λ,𝒃λ\bm{s}_{\lambda},\bm{b}_{\lambda} in Eq. (11). In practice, the phenomenon of reduced bias can be empirically observed (Section 4.3).

Theorem 3.2 (Local unbiasedness, proof in Appendix B.4).

Given the optimal 𝐬λ,𝐛λ\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*} in Eq. (11), For the (n+1)(n+1)-th order approximation, suppose 𝐱λi1,,𝐱λin\bm{x}_{\lambda_{i_{1}}},\dots,\bm{x}_{\lambda_{i_{n}}} are on the ground-truth ODE trajectory passing 𝐱λs\bm{x}_{\lambda_{s}}, then 𝔼pλsθ(𝐱s)[𝐱^t𝐱t]=0\mathbb{E}_{p^{\theta}_{\lambda_{s}}(\bm{x}_{s})}\left[\hat{\bm{x}}_{t}-\bm{x}_{t}\right]=0.

3.2.2 Global Solver

Given M+1M+1 time steps {ti}i=0M\{t_{i}\}_{i=0}^{M}, starting from some initial value, we can repeat the local approximation MM times to make consecutive transitions from each ti1t_{i-1} to tit_{i} until we reach an acceptable solution. At each step, we apply multistep methods [1] by caching and reusing the previous nn values at timesteps ti1n,,ti2t_{i-1-n},\dots,t_{i-2}, which is proven to be more stable when NFE is small [32, 56]. Moreover, we also apply the predictor-corrector method [58] to refine the approximation at each step without introducing extra NFE. Specifically, the (n+1)(n+1)-th order predictor is the case of the local approximation when we choose (tin,,ti1,s,t)=(ti1n,,ti2,ti1,ti)(t_{i_{n}},\dots,t_{i_{1}},s,t)=(t_{i-1-n},\dots,t_{i-2},t_{i-1},t_{i}), and the (n+1)(n+1)-th order corrector is the case when we choose (tin,,ti1,s,t)=(tin,,ti2,ti,ti1,ti)(t_{i_{n}},\dots,t_{i_{1}},s,t)=(t_{i-n},\dots,t_{i-2},t_{i},t_{i-1},t_{i}). We present our (n+1)(n+1)-th order multistep predictor-corrector procedure in Appendix C.2. We also illustrate a second-order case in Fig. 2(b). Note that different from previous works, in the local transition from ti1t_{i-1} to tit_{i}, the previous function values 𝒈^ik\hat{\bm{g}}_{i_{k}} (1kn1\leq k\leq n) used for derivative estimation are dependent on ii and are different during the sampling process because 𝒈θ\bm{g}_{\theta} is dependent on the current ti1t_{i-1} (see Eq. (8)). Thus, we directly cache 𝒙^i,ϵ^i\hat{\bm{x}}_{i},\hat{\bm{\epsilon}}_{i} and reuse them to compute 𝒈^i\hat{\bm{g}}_{i} in the subsequent steps. Notably, our proposed solver also has a global convergence guarantee, as shown in the following theorem. For simplicity, we only consider the predictor case and the case with corrector can also be proved by similar derivations in [58].

Theorem 3.3 (Global order of convergence, proof in Appendix B.2.2).

For (n+1)(n+1)-th order predictor, if we iteratively compute a sequence {𝐱^i}i=0M\{\hat{\bm{x}}_{i}\}_{i=0}^{M} to approximate the true solutions {𝐱i}i=0M\{\bm{x}_{i}\}_{i=0}^{M} at {ti}i=0M\{t_{i}\}_{i=0}^{M}, then under both local and global assumptions detailed in Appendix B.1, the final error |𝐱^M𝐱M|=𝒪(hn+1)|\hat{\bm{x}}_{M}-\bm{x}_{M}|=\mathcal{O}(h^{n+1}), where |||\cdot| denotes the element-wise absolute value, and h=max1iM(λiλi1)h=\max_{1\leq i\leq M}(\lambda_{i}-\lambda_{i-1}).

3.3 Practical Techniques

In this section, we introduce some practical techniques that further improve the sample quality in the case of small NFE or large guidance scales.

Pseudo-order solver for small NFE. When NFE is extremely small (e.g., 5\sim10), the error at each timestep becomes rather large, and incorporating too many previous values by high-order solver at each step will cause instabilities. To alleviate this problem, we propose a technique called pseudo-order solver: when estimating the kk-th order derivative, we only utilize the previous k+1k+1 function values of 𝒈θ\bm{g}_{\theta}, instead of all the nn previous values as in Eq. (13). For each kk, we can obtain 𝒈^s(k)\hat{\bm{g}}^{(k)}_{s} by solving a part of Eq. (13) and taking the last element:

(δ1δ12δ1kδkδk2δkk)(𝒈^s(k)k!)=(𝒈i1𝒈s𝒈ik𝒈s),k=1,2,,n\begin{pmatrix}\delta_{1}&\delta_{1}^{2}&\cdots&\delta_{1}^{k}\\ \vdots&\vdots&\ddots&\vdots\\ \delta_{k}&\delta_{k}^{2}&\cdots&\delta_{k}^{k}\end{pmatrix}\begin{pmatrix}\cdot\\ \vdots\\ \frac{\hat{\bm{g}}^{(k)}_{s}}{k!}\end{pmatrix}=\begin{pmatrix}\bm{g}_{i_{1}}-\bm{g}_{s}\\ \vdots\\ \bm{g}_{i_{k}}-\bm{g}_{s}\end{pmatrix},\quad k=1,2,\dots,n (15)

In practice, we do not need to solve nn linear systems. Instead, the solutions for 𝒈^s(k),k=1,,n\hat{\bm{g}}^{(k)}_{s},k=1,\dots,n have a simpler recurrence relation similar to Neville’s method [36] in Lagrange polynomial interpolation. Denote i0si_{0}\coloneqq s so that δ0=λi0λs=0\delta_{0}=\lambda_{i_{0}}-\lambda_{s}=0, we have

Theorem 3.4 (Pseudo-order solver).

For each kk, the solution in Eq. (15) is 𝐠^s(k)=k!D0(k)\hat{\bm{g}}^{(k)}_{s}=k!D^{(k)}_{0}, where

Dl(0)\displaystyle D^{(0)}_{l} 𝒈il,l=0,1,,n\displaystyle\coloneqq\bm{g}_{i_{l}},\quad l=0,1,\dots,n (16)
Dl(k)\displaystyle D^{(k)}_{l} Dl+1(k1)Dl(k1)δl+kδl,l=0,1,,nk\displaystyle\coloneqq\frac{D^{(k-1)}_{l+1}-D^{(k-1)}_{l}}{\delta_{l+k}-\delta_{l}},\quad l=0,1,\dots,n-k

Proof in Appendix B.3. Note that the pseudo-order solver of order n>2n>2 no longer has the guarantee of nn-th order of accuracy, which is not so important when NFE is small. In our experiments, we mainly rely on two combinations: when we use nn-th order predictor, we then combine it with nn-th order corrector or (n+1)(n+1)-th pseudo-order corrector.

Half-corrector for large guidance scales. When the guidance scale is large in guided sampling, we find that corrector may have negative effects on the sample quality. We propose a useful technique called half-corrector by using the corrector only in the time region t0.5t\leq 0.5. Correspondingly, the strategy that we use corrector at each step is called full-corrector.

3.4 Implementation Details

In this section, we give some implementation details about how to compute and integrate the EMS in our solver and how to adapt them to guided sampling.

Estimating EMS. For a specific λ\lambda, the EMS 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda}^{*},\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*} can be estimated by firstly drawing KK (1024\sim4096) datapoints 𝒙λpλθ(𝒙λ)\bm{x}_{\lambda}\sim p^{\theta}_{\lambda}(\bm{x}_{\lambda}) with 200-step DPM-Solver++ [32] and then analytically computing some terms related to ϵθ\bm{\epsilon}_{\theta} (detailed in Appendix C.1.1). In practice, we find it both convenient and effective to choose the distribution of the dataset q0q_{0} to approximate p0θp_{0}^{\theta}. Thus, without further specifications, we directly use samples from q0q_{0}.

Estimating integrals of EMS. We estimate EMS on NN (1201200120\sim 1200) timesteps λj0,λj1,,λjN\lambda_{j_{0}},\lambda_{j_{1}},\dots,\lambda_{j_{N}} and use trapezoidal rule to estimate the integrals in Eq. (12) (see Appendix I.3 for the estimation error analysis). We also apply some pre-computation for the integrals to avoid extra computation costs during sampling, detailed in Appendix C.1.2.

Adaptation to guided sampling. Empirically, we find that within a common range of guidance scales, we can simply compute the EMS on the model without guidance, and it can work for both unconditional sampling and guided sampling cases. See Appendix J for more discussions.

3.5 Comparison with Existing Methods

By comparing with existing diffusion ODE solvers that are based on exponential integrators [56, 31, 32, 58], we can conclude that (1) Previous ODE formulations with noise/data prediction are special cases of ours by setting 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda} to specific values. (2) Our first-order discretization can be seen as improved DDIM. See more details in Appendix A.

4 Experiments

Refer to caption

(a) CIFAR10
(ScoreSDE, Pixel DPM)

Refer to caption

(b) CIFAR10
(EDM, Pixel DPM)

Refer to caption

(c) LSUN-Bedroom
(Latent-Diffusion, Latent DPM)

Figure 3: Unconditional sampling results. We report the FID\downarrow of the methods with different numbers of function evaluations (NFE), evaluated on 50k samples. We borrow the results reported in their original paper directly.
Refer to caption

(a) ImageNet-256
(Guided-Diffusion, Pixel DPM)
(Classifier Guidance, s=2.0s=2.0)

Refer to caption

(b) MS-COCO2014
(Stable-Diffusion, Latent DPM)
(Classifier-Free Guidance, s=1.5s=1.5)

Refer to caption

(c) MS-COCO2014
(Stable-Diffusion, Latent DPM)
(Classifier-Free Guidance, s=7.5s=7.5)

Figure 4: Conditional sampling results. We report the FID\downarrow or MSE\downarrow of the methods with different numbers of function evaluations (NFE), evaluated on 10k samples.
Refer to caption

(a) CIFAR10 (ScoreSDE)

Refer to caption

(b) CIFAR10 (EDM)

Refer to caption

(c) MS-COCO2014 (Stable-Diffusion, s=7.5s=7.5)

Figure 5: Visualization of the EMS 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda} w.r.t. λ\lambda estimated on different models.
Refer to caption

(a) DPM-Solver++ [32] (FID 18.59)

Refer to caption

(b) UniPC [58] (FID 12.24)

Refer to caption

(c) DPM-Solver-v3 (Ours) (FID 7.54)

Figure 6: Random samples of Latent-Diffusion [43] on LSUN-Bedroom [55] with only NFE = 5.

In this section, we show that DPM-Solver-v3 can achieve consistent and notable speed-up for both unconditional and conditional sampling with both pixel-space and latent-space DPMs. We conduct extensive experiments on diverse image datasets, where the resolution ranges from 32 to 256. First, we present the main results of sample quality comparison with previous state-of-the-art training-free methods. Then we illustrate the effectiveness of our method by visualizing the EMS and samples. Additional ablation studies are provided in Appendix G. On each dataset, we choose a sufficient number of timesteps NN and datapoints KK for computing the EMS to reduce the estimation error, while the EMS can still be computed within hours. After we obtain the EMS and precompute the integrals involving them, there is negligible extra overhead in the sampling process. We provide the runtime comparison in Appendix E. We refer to Appendix D for more detailed experiment settings.

4.1 Main Results

Table 1: Quantitative results on CIFAR10 [24]. We report the FID\downarrow of the methods with different numbers of function evaluations (NFE), evaluated on 50k samples. We borrow the results reported in their original paper directly.
Method Model NFE
5 6 8 10 12 15 20 25
DEIS [56] ScoreSDE [51] 15.37 \\backslash \\backslash 4.17 \\backslash 3.37 2.86 \\backslash
DPM-Solver++ [32] 28.53 13.48 5.34 4.01 4.04 3.32 2.90 2.76
UniPC [58] 23.71 10.41 5.16 3.93 3.88 3.05 2.73 2.65
DPM-Solver-v3 12.76 7.40 3.94 3.40 3.24 2.91 2.71 2.64
Heun’s 2nd [21] EDM [21] 320.80 103.86 39.66 16.57 7.59 4.76 2.51 2.12
DPM-Solver++ [32] 24.54 11.85 4.36 2.91 2.45 2.17 2.05 2.02
UniPC [58] 23.52 11.10 3.86 2.85 2.38 2.08 2.01 2.00
DPM-Solver-v3 12.21 8.56 3.50 2.51 2.24 2.10 2.02 2.00

We present the results in 5205\sim 20 number of function evaluations (NFE), covering both few-step cases and the almost converged cases, as shown in Fig. 3 and Fig. 4. For the sake of clarity, we mainly compare DPM-Solver-v3 to DPM-Solver++ [32] and UniPC [58], which are the most state-of-the-art diffusion ODE solvers. We also include the results for DEIS [56] and Heun’s 2nd order method in EDM [21], but only for the datasets on which they originally reported. We don’t show the results for other methods such as DDIM [48], PNDM [28], since they have already been compared in previous works and have inferior performance. The quantitative results on CIFAR10 [24] are listed in Table 1, and more detailed quantitative results are presented in Appendix F.

Unconditional sampling We first evaluate the unconditional sampling performance of different methods on CIFAR10 [24] and LSUN-Bedroom [55]. For CIFAR10 we use two pixel-space DPMs, one is based on ScoreSDE [51] which is a widely adopted model by previous samplers, and another is based on EDM [21] which achieves the best sample quality. For LSUN-Bedroom, we use the latent-space Latent-Diffusion model [43]. We apply the multistep 3rd-order version for DPM-Solver++, UniPC and DPM-Solver-v3 by default, which performs best in the unconditional setting. For UniPC, we report the better result of their two choices B1(h)=hB_{1}(h)=h and B2(h)=eh1B_{2}(h)=e^{h}-1 at each NFE. For our DPM-Solver-v3, we tune the strategies of whether to use the pseudo-order predictor/corrector at each NFE on CIFAR10, and use the pseudo-order corrector on LSUN-Bedroom. As shown in Fig. 3, we find that DPM-Solver-v3 can achieve consistently better FID, which is especially notable when NFE is 5\sim10. For example, we improve the FID on CIFAR10 with 5 NFE from 2323 to 1212 with ScoreSDE, and achieve an FID of 2.512.51 with only 1010 NFE with the advanced DPM provided by EDM. On LSUN-Bedroom, with around 12 minutes computing of the EMS, DPM-Solver-v3 converges to the FID of 3.06 with 12 NFE, which is approximately 60% sampling cost of the previous best training-free method (20 NFE by UniPC).

Conditional sampling. We then evaluate the conditional sampling performance, which is more widely used since it allows for controllable generation with user-customized conditions. We choose two conditional settings, one is classifier guidance on pixel-space Guided-Diffusion [10] model trained on ImageNet-256 dataset [9] with 1000 class labels as conditions; the other is classifier-free guidance on latent-space Stable-Diffusion model [43] trained on LAION-5B dataset [46] with CLIP [41] embedded text prompts as conditions. We evaluate the former at the guidance scale of 2.02.0, following the best choice in [10]; and the latter at the guidance scale of 1.51.5 (following the original paper) or 7.57.5 (following the official code) with prompts random selected from MS-COCO2014 validation set [26]. Note that the FID of Stable-Diffusion samples saturates to 15.0\sim16.0 even within 10 steps when the latent codes are far from convergence, possibly due to the powerful image decoder (see Appendix H). Thus, following [32], we measure the mean square error (MSE) between the generated latent code 𝒙^\hat{\bm{x}} and the ground-truth solution 𝒙\bm{x}^{*} (i.e., 𝒙^𝒙22/D\|\hat{\bm{x}}-\bm{x}^{*}\|_{2}^{2}/D) to evaluate convergence, starting from the same Gaussian noise. We obtain 𝒙\bm{x}^{*} by 200-step DPM-Solver++, which is enough to ensure the convergence.

We apply the multistep 2nd-order version for DPM-Solver++, UniPC and DPM-Solver-v3, which performs best in conditional setting. For UniPC, we only apply the choice B2(h)=eh1B_{2}(h)=e^{h}-1, which performs better than B1(h)B_{1}(h). For our DPM-Solver-v3, we use the pseudo-order corrector by default, and report the best results between using half-corrector/full-corrector on Stable-Diffusion (s=7.5s=7.5). As shown in Fig. 4, DPM-Solver-v3 can achieve better sample quality or convergence at most NFEs, which indicates the effectiveness of our method and techniques under the conditional setting. It’s worth noting that UniPC, which adopts an extra corrector, performs even worse than DPM-Solver++ when NFE<10 on Stable-Diffusion (s=7.5s=7.5). With the combined effect of the EMS and the half-corrector technique, we successfully outperform DPM-Solver++ in such a case. Detailed comparisons can be found in the ablations in Appendix G.

4.2 Visualizations of Estimated EMS

We further visualize the estimated EMS in Fig. 5. Since 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda} are DD-dimensional vectors, we average them over all dimensions to obtain a scalar. From Fig. 5, we find that 𝒍λ\bm{l}_{\lambda} gradually changes from 11 to near 0 as the sampling proceeds, which suggests we should gradually slide from data prediction to noise prediction. As for 𝒔λ,𝒃λ\bm{s}_{\lambda},\bm{b}_{\lambda}, they are more model-specific and display many fluctuations, especially for ScoreSDE model [51] on CIFAR10. Apart from the estimation error of the EMS, we suspect that it comes from the fluctuations of ϵθ(1)\bm{\epsilon}_{\theta}^{(1)}, which is caused by the periodicity of trigonometric functions in the positional embeddings of the network input. It’s worth noting that the fluctuation of 𝒔λ,𝒃λ\bm{s}_{\lambda},\bm{b}_{\lambda} will not cause instability in our sampler (see Appendix J).

4.3 Visual Quality

We present some qualitative comparisons in Fig. 6 and Fig. 1. We can find that previous methods tend to have a small degree of color contrast at small NFE, while our method is less biased and produces more visual details. In Fig. 1(b), we can observe that previous methods with corrector may cause distortion at large guidance scales (in the left-top image, a part of the castle becomes a hill; in the left-bottom image, the hill is translucent and the castle is hanging in the air), while ours won’t. Additional samples are provided in Appendix K.

5 Conclusion

We study the ODE parameterization problem for fast sampling of DPMs. Through theoretical analysis, we find a novel ODE formulation with empirical model statistics, which is towards the optimal one to minimize the first-order discretization error. Based on such improved ODE formulation, we propose a new fast solver named DPM-Solver-v3, which involves a multistep predictor-corrector framework of any order and several techniques for improved sampling with few steps or large guidance scale. Experiments demonstrate the effectiveness of DPM-Solver-v3 in both unconditional conditional sampling with both pixel-space latent-space pre-trained DPMs, and the significant advancement of sample quality in 5\sim10 steps.

Limitations and broader impact Despite the great speedup in small numbers of steps, DPM-Solver-v3 still lags behind training-based methods and is not fast enough for real-time applications. Besides, we conducted theoretical analyses of the local error, but didn’t explore the global design spaces, such as the design of timestep schedules during sampling. And commonly, there are potential undesirable effects that DPM-Solver-v3 may be abused to accelerate the generation of fake and malicious content.

Acknowledgements

This work was supported by the National Key Research and Development Program of China (No. 2021ZD0110502), NSFC Projects (Nos. 62061136001, 62106123, 62076147, U19A2081, 61972224, 62106120), Tsinghua Institute for Guo Qiang, and the High Performance Computing Center, Tsinghua University. J.Z is also supported by the XPlorer Prize.

References

  • [1] Kendall Atkinson, Weimin Han, and David E Stewart. Numerical solution of ordinary differential equations, volume 108. John Wiley & Sons, 2011.
  • [2] Fan Bao, Chongxuan Li, Jiacheng Sun, Jun Zhu, and Bo Zhang. Estimating the optimal covariance with imperfect mean in diffusion probabilistic models. In International Conference on Machine Learning, pages 1555–1584. PMLR, 2022.
  • [3] Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-DPM: An analytic estimate of the optimal reverse variance in diffusion probabilistic models. In International Conference on Learning Representations, 2022.
  • [4] Costas Bekas, Effrosyni Kokiopoulou, and Yousef Saad. An estimator for the diagonal of a matrix. Applied numerical mathematics, 57(11-12):1214–1229, 2007.
  • [5] Mari Paz Calvo and César Palencia. A class of explicit multistep exponential integrators for semilinear problems. Numerische Mathematik, 102:367–381, 2006.
  • [6] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. In International Conference on Learning Representations, 2021.
  • [7] Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. In The Eleventh International Conference on Learning Representations, 2022.
  • [8] Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. In The Eleventh International Conference on Learning Representations, 2022.
  • [9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009.
  • [10] Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems, volume 34, pages 8780–8794, 2021.
  • [11] Tim Dockhorn, Arash Vahdat, and Karsten Kreis. Genie: Higher-order denoising diffusion solvers. In Advances in Neural Information Processing Systems, 2022.
  • [12] Alfredo Eisinberg and Giuseppe Fedele. On the inversion of the vandermonde matrix. Applied mathematics and computation, 174(2):1384–1397, 2006.
  • [13] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, volume 27, pages 2672–2680, 2014.
  • [14] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  • [15] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pages 6840–6851, 2020.
  • [16] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  • [17] Marlis Hochbruck and Alexander Ostermann. Explicit exponential Runge-Kutta methods for semilinear parabolic problems. SIAM Journal on Numerical Analysis, 43(3):1069–1090, 2005.
  • [18] Marlis Hochbruck and Alexander Ostermann. Exponential integrators. Acta Numerica, 19:209–286, 2010.
  • [19] Marlis Hochbruck, Alexander Ostermann, and Julia Schweitzer. Exponential rosenbrock-type methods. SIAM Journal on Numerical Analysis, 47(1):786–803, 2009.
  • [20] Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059–1076, 1989.
  • [21] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems, 2022.
  • [22] Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. In Advances in Neural Information Processing Systems, 2022.
  • [23] Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. In Advances in Neural Information Processing Systems, 2021.
  • [24] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [25] Shengmeng Li, Luping Liu, Zenghao Chai, Runnan Li, and Xu Tan. Era-solver: Error-robust adams solver for fast sampling of diffusion probabilistic models. arXiv preprint arXiv:2301.12935, 2023.
  • [26] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  • [27] Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, and Zhou Zhao. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11020–11028, 2022.
  • [28] Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. In International Conference on Learning Representations, 2021.
  • [29] Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2022.
  • [30] Cheng Lu, Kaiwen Zheng, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Maximum likelihood training for score-based diffusion odes by high order denoising score matching. In International Conference on Machine Learning, pages 14429–14460. PMLR, 2022.
  • [31] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In Advances in Neural Information Processing Systems, 2022.
  • [32] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022.
  • [33] Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388, 2021.
  • [34] Chenlin Meng, Ruiqi Gao, Diederik P Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. In NeurIPS 2022 Workshop on Score-Based Methods, 2022.
  • [35] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
  • [36] Eric Harold Neville. Iterative interpolation. St. Joseph’s IS Press, 1934.
  • [37] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  • [38] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pages 16784–16804. PMLR, 2022.
  • [39] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • [40] Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov, and Jiansheng Wei. Diffusion-based voice conversion with fast maximum likelihood sampling scheme. In International Conference on Learning Representations, 2022.
  • [41] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [42] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, 2022.
  • [43] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  • [44] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems, 2022.
  • [45] Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022.
  • [46] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  • [47] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  • [48] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
  • [49] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
  • [50] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. In Advances in Neural Information Processing Systems, volume 34, pages 1415–1428, 2021.
  • [51] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
  • [52] Hideyuki Tachibana, Mocho Go, Muneyoshi Inahara, Yotaro Katayama, and Yotaro Watanabe. Itô-taylor sampling scheme for denoising diffusion probabilistic models using ideal derivatives. arXiv e-prints, pages arXiv–2112, 2021.
  • [53] Daniel Watson, William Chan, Jonathan Ho, and Mohammad Norouzi. Learning fast samplers for diffusion models by differentiating through sample quality. In International Conference on Learning Representations, 2022.
  • [54] Suttisak Wizadwongsa and Supasorn Suwajanakorn. Accelerating guided diffusion sampling with splitting numerical methods. In The Eleventh International Conference on Learning Representations, 2022.
  • [55] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
  • [56] Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. In The Eleventh International Conference on Learning Representations, 2022.
  • [57] Min Zhao, Fan Bao, Chongxuan Li, and Jun Zhu. Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. In Advances in Neural Information Processing Systems, 2022.
  • [58] Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. UniPC: A unified predictor-corrector framework for fast sampling of diffusion models. arXiv preprint arXiv:2302.04867, 2023.
  • [59] Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. Improved techniques for maximum likelihood estimation for diffusion odes. arXiv preprint arXiv:2305.03935, 2023.

Appendix A Related Work

Diffusion probabilistic models (DPMs) [47, 15, 51], also known as score-based generative models (SGMs), have achieved remarkable generation ability on image domain [10, 21], yielding extensive applications such as speech, singing and video synthesis [6, 27, 14], controllable image generation, translation and editing [38, 42, 43, 35, 57, 8], likelihood estimation [50, 23, 30, 59], data compression [23] and inverse problem solving [7, 22].

A.1 Fast Sampling Methods for DPMs

Fast sampling methods based on extra training or optimization include learning variances in the reverse process [37, 2], learning sampling schedule [53], learning high-order model derivatives [11], model refinement [29] and model distillation [45, 33, 49]. Though distillation-based methods can generate high-quality samples in less than 5 steps, they additionally bring onerous training costs. Moreover, the distillation process will inevitably lose part of the information of the original model, and is hard to be adapted to pre-trained large DPMs [43, 44, 42] and conditional sampling [34]. Some of distillation-based methods also lack the ability to make flexible trade-offs between sample speed and sample quality.

In contrast, training-free samplers are more lightweight and flexible. Among them, samplers based on diffusion ODEs generally require fewer steps than those based on diffusion SDEs [51, 40, 3, 48, 32], since SDEs introduce more randomness and make the denoising harder in the sampling process. Previous samplers handle the diffusion ODEs with different methods, such as Heun’s methods [21], splitting numerical methods [54], pseudo numerical methods [28], Adams methods [25] or exponential integrators [56, 31, 32, 58].

A.2 Comparison with Existing Solvers Based on Exponential Integrators

Table 2: Comparison between DPM-Solver-v3 and other high-order diffusion ODE solvers based on exponential integrators.
DEIS [56] DPM-Solver [31] DPM-Solver++ [32] UniPC [58] DPM-Solver-v3 (Ours)
First-Order DDIM DDIM DDIM DDIM Improved DDIM
Taylor Expanded Predictor ϵθ\bm{\epsilon}_{\theta} for tt ϵθ\bm{\epsilon}_{\theta} for λ\lambda 𝒙θ\bm{x}_{\theta} for λ\lambda 𝒙θ\bm{x}_{\theta} for λ\lambda 𝒈θ\bm{g}_{\theta} for λ\lambda
Solver Type (High-Order) Multistep Singlestep Multistep Multistep Multistep
Applicable for Guided Sampling
Corrector Supported
Model-Specific

In this section, we make some theoretical comparisons between DPM-Solver-v3 and existing diffusion ODE solvers that are based on exponential integrators [56, 31, 32, 58]. We summarize the results in Table 2, and provide some analysis below.

Previous ODE formulation as special cases of ours. First, we compare the formulation of the ODE solution. Our reformulated ODE solution in Eq. (9) involves extra coefficients 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}, and it corresponds to a new predictor 𝒈θ\bm{g}_{\theta} to be approximated by Taylor expansion. By comparing ours with previous ODE formulations in Eq. (3) and their corresponding noise/data prediction, we can easily figure out that they are special cases of ours by setting 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda} to specific values:

  • Noise prediction: 𝒍λ=𝟎,𝒔λ=𝟏,𝒃λ=𝟎\bm{l}_{\lambda}=\bm{0},\bm{s}_{\lambda}=-\bm{1},\bm{b}_{\lambda}=\bm{0}

  • Data prediction: 𝒍λ=𝟏,𝒔λ=𝟎,𝒃λ=𝟎\bm{l}_{\lambda}=\bm{1},\bm{s}_{\lambda}=\bm{0},\bm{b}_{\lambda}=\bm{0}

It’s worth noting that, though our ODE formulation can degenerate to previous ones, it may still result in different solvers, since previous works conduct some equivalent substitution of the same order in the local approximation (for example, the choice of B1(h)=hB_{1}(h)=h or B2(h)=eh1B_{2}(h)=e^{h}-1 in UniPC [58]). We never conduct such substitution, thus saving the efforts to tune it.

Moreover, under our framework, we find that DPM-Solver++ is a model-agnostic approximation of DPM-Solver-v3, under the Gaussian assumption. Specifically, according to Eq. (5), we have

𝒍λ=argmin𝒍λ𝔼pλθ(𝒙λ)σλ𝒙ϵθ(𝒙λ,λ)𝒍λF2,\bm{l}^{*}_{\lambda}=\operatornamewithlimits{argmin}_{\bm{l}_{\lambda}}\mathbb{E}_{p^{\theta}_{\lambda}(\bm{x}_{\lambda})}\|\sigma_{\lambda}\nabla_{\bm{x}}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{l}_{\lambda}\|_{F}^{2}, (17)

If we assume qλ(𝒙λ)𝒩(𝒙λ|αλ𝒙0,σλ2𝑰)q_{\lambda}(\bm{x}_{\lambda})\approx\mathcal{N}(\bm{x}_{\lambda}|\alpha_{\lambda}\bm{x}_{0},\sigma_{\lambda}^{2}\bm{I}) for some fixed 𝒙0\bm{x}_{0}, then the optimal noise predictor is

ϵθ(𝒙λ,λ)σλ𝒙logqλ(𝒙λ)=𝒙λαt𝒙0σλ.\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)\approx-\sigma_{\lambda}\nabla_{\bm{x}}\log q_{\lambda}(\bm{x}_{\lambda})=\frac{\bm{x}_{\lambda}-\alpha_{t}\bm{x}_{0}}{\sigma_{\lambda}}. (18)

It follows that σλ𝒙ϵθ(𝒙λ,λ)𝑰\sigma_{\lambda}\nabla_{\bm{x}}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)\approx\bm{I}, thus 𝒍λ𝟏\bm{l}_{\lambda}^{*}\approx\bm{1} by Eq. (5), which corresponds the data prediction model used in DPM-Solver++. Moreover, for small enough λ\lambda (i.e., tt near to TT), the Gaussian assumption is almost true (see Section 4.2), thus the data-prediction DPM-Solver++ approximately computes all the linear terms at the initial stage. To the best of our knowledge, this is the first explanation for the reason why the data-prediction DPM-Solver++ outperforms the noise-prediction DPM-Solver.

First-order discretization as improved DDIM Previous methods merely use noise/data parameterization, whether or not they change the time domain from tt to λ\lambda. While they differ in high-order cases, they are proven to coincide in the first-order case, which is DDIM [48] (deterministic case, η=0\eta=0):

𝒙^t=αtαs𝒙^sαt(σsαsσtαt)ϵθ(𝒙^s,λs)\displaystyle\hat{\bm{x}}_{t}=\frac{\alpha_{t}}{\alpha_{s}}\hat{\bm{x}}_{s}-\alpha_{t}\left(\frac{\sigma_{s}}{\alpha_{s}}-\frac{\sigma_{t}}{\alpha_{t}}\right)\bm{\epsilon}_{\theta}(\hat{\bm{x}}_{s},\lambda_{s}) (19)

However, the first-order case of our method is

𝒙^t\displaystyle\hat{\bm{x}}_{t} =αtαs𝑨(λs,λt)((1+𝒍λsλsλt𝑬λs(λ)dλ)𝒙^s(σsλsλt𝑬λs(λ)dλ)ϵθ(𝒙^s,λs))\displaystyle=\frac{\alpha_{t}}{\alpha_{s}}\bm{A}(\lambda_{s},\lambda_{t})\left(\left(1+\bm{l}_{\lambda_{s}}\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda\right)\hat{\bm{x}}_{s}-\left(\sigma_{s}\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda\right)\bm{\epsilon}_{\theta}(\hat{\bm{x}}_{s},\lambda_{s})\right) (20)
αt𝑨(λs,λt)λsλt𝑬λs(λ)𝑩λs(λ)dλ\displaystyle-\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\bm{B}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda

which is not DDIM since we choose a better parameterization by the estimated EMS. Empirically, our first-order solver performs better than DDIM, as detailed in Appendix G.

Appendix B Proofs

B.1 Assumptions

In this section, we will give some mild conditions under which the local order of accuracy of Algorithm 1 and the global order of convergence of Algorithm 2 (predictor) are guaranteed.

B.1.1 Local

First, we will give the assumptions to bound the local truncation error.

Assumption B.1.

The total derivatives of the noise prediction model dkϵθ(𝒙λ,λ)dλk,k=1,,n\frac{\mathrm{d}^{k}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)}{\mathrm{d}\lambda^{k}},k=1,\dots,n exist and are continuous.

Assumption B.2.

The coefficients 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda} are continuous and bounded. dk𝒍λdλk,dk𝒔λdλk,dk𝒃λdλk,k=1,,n\frac{\mathrm{d}^{k}\bm{l}_{\lambda}}{\mathrm{d}\lambda^{k}},\frac{\mathrm{d}^{k}\bm{s}_{\lambda}}{\mathrm{d}\lambda^{k}},\frac{\mathrm{d}^{k}\bm{b}_{\lambda}}{\mathrm{d}\lambda^{k}},k=1,\dots,n exist and are continuous.

Assumption B.3.

δk=Θ(λtλs),k=1,,n\delta_{k}=\Theta(\lambda_{t}-\lambda_{s}),k=1,\dots,n

Assumption B.1 is required for the Taylor expansion which is regular in high-order numerical methods. Assumption B.2 requires the boundness of the coefficients as well as regularizes the coefficients’ smoothness to enable the Taylor expansion for 𝒈θ(𝒙λ,λ)\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda), which holds in practice given the smoothness of ϵθ(𝒙λ,λ)\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda) and pλθ(𝒙λ)p_{\lambda}^{\theta}(\bm{x}_{\lambda}). Assumption B.3 makes sure δk\delta_{k} and λtλs\lambda_{t}-\lambda_{s} is of the same order, i.e., there exists some constant rk=𝒪(1)r_{k}=\mathcal{O}(1) so that δk=rk(λtλs)\delta_{k}=r_{k}(\lambda_{t}-\lambda_{s}), which is satisfied in regular multistep methods.

B.1.2 Global

Then we will give the assumptions to bound the global error.

Assumption B.4.

The noise prediction model ϵθ(𝒙,t)\bm{\epsilon}_{\theta}(\bm{x},t) is Lipschitz w.r.t. to 𝒙\bm{x}.

Assumption B.5.

h=max1iM(λiλi1)=𝒪(1/M)h=\max_{1\leq i\leq M}(\lambda_{i}-\lambda_{i-1})=\mathcal{O}(1/M).

Assumption B.6.

The starting values 𝒙^i,1in\hat{\bm{x}}_{i},1\leq i\leq n satisfies 𝒙^i𝒙i=𝒪(hn+1)\hat{\bm{x}}_{i}-\bm{x}_{i}=\mathcal{O}(h^{n+1}).

Assumption B.4 is common in the analysis of ODEs, which assures ϵθ(𝒙^t,t)ϵθ(𝒙t,t)=𝒪(𝒙^t𝒙t)\bm{\epsilon}_{\theta}(\hat{\bm{x}}_{t},t)-\bm{\epsilon}_{\theta}(\bm{x}_{t},t)=\mathcal{O}(\hat{\bm{x}}_{t}-\bm{x}_{t}). Assumption B.5 implies that the step sizes are rather uniform. Assumption B.6 is common in the convergence analysis of multistep methods [5].

B.2 Order of Accuracy and Convergence

In this section, we prove the local and global order guarantees detailed in Theorem 3.1 and Theorem 3.3.

B.2.1 Local

Proof.

(Proof of Theorem 3.1) Denote hλtλsh\coloneqq\lambda_{t}-\lambda_{s}. Subtracting the Taylor-expanded exact solution in Eq. (12) from the local approximation in Eq. (14), we have

𝒙^t𝒙t=αt𝑨(λs,λt)k=1n𝒈^θ(k)(𝒙λs,λs)𝒈θ(k)(𝒙λs,λs)k!λsλt𝑬λs(λ)(λλs)kdλ+𝒪(hn+2)\hat{\bm{x}}_{t}-\bm{x}_{t}=-\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\sum_{k=1}^{n}\frac{\hat{\bm{g}}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})-\bm{g}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})}{k!}\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)(\lambda-\lambda_{s})^{k}\mathrm{d}\lambda+\mathcal{O}(h^{n+2}) (21)

First we examine the order of 𝑨(λs,λt)\bm{A}(\lambda_{s},\lambda_{t}) and λsλt𝑬λs(λ)(λλs)kdλ\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)(\lambda-\lambda_{s})^{k}\mathrm{d}\lambda. Under Assumption B.2, there exists some constant C1,C2C_{1},C_{2} such that 𝒍λ<C1,𝒍λ+𝒔λ<C2-\bm{l}_{\lambda}<C_{1},\bm{l}_{\lambda}+\bm{s}_{\lambda}<C_{2}. So

𝑨(λs,λt)\displaystyle\bm{A}(\lambda_{s},\lambda_{t}) =eλsλt𝒍τdτ\displaystyle=e^{-\int_{\lambda_{s}}^{\lambda_{t}}\bm{l}_{\tau}\mathrm{d}\tau} (22)
<eC1h\displaystyle<e^{C_{1}h}
=𝒪(1)\displaystyle=\mathcal{O}(1)
λsλt𝑬λs(λ)(λλs)kdλ\displaystyle\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)(\lambda-\lambda_{s})^{k}\mathrm{d}\lambda =λsλteλsλ(𝒍τ+𝒔τ)dτ(λλs)kdλ\displaystyle=\int_{\lambda_{s}}^{\lambda_{t}}e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}(\lambda-\lambda_{s})^{k}\mathrm{d}\lambda (23)
<λsλteC2(λλs)(λλs)kdλ\displaystyle<\int_{\lambda_{s}}^{\lambda_{t}}e^{C_{2}(\lambda-\lambda_{s})}(\lambda-\lambda_{s})^{k}\mathrm{d}\lambda
=𝒪(hk+1)\displaystyle=\mathcal{O}(h^{k+1})

Next we examine the order of 𝒈^θ(k)(𝒙λs,λs)𝒈θ(k)(𝒙λs,λs)k!\frac{\hat{\bm{g}}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})-\bm{g}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})}{k!}. Under Assumption B.1 and Assumption B.2, since 𝒈θ\bm{g}_{\theta} is elementary function of ϵθ\bm{\epsilon}_{\theta} and 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}, we know 𝒈θ(k)(𝒙λs,λs),k=1,,n\bm{g}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s}),k=1,\dots,n exist and are continuous. Adopting the notations in Eq. (13), by Taylor expansion, we have

𝒈i1\displaystyle\bm{g}_{i_{1}} =𝒈s+δ1𝒈s(1)+δ12𝒈s(2)++δ1n𝒈s(n)+𝒪(δ1n+1)\displaystyle=\bm{g}_{s}+\delta_{1}\bm{g}_{s}^{(1)}+\delta_{1}^{2}\bm{g}_{s}^{(2)}+\dots+\delta_{1}^{n}\bm{g}_{s}^{(n)}+\mathcal{O}(\delta_{1}^{n+1}) (24)
𝒈i2\displaystyle\bm{g}_{i_{2}} =𝒈s+δ2𝒈s(1)+δ22𝒈s(2)++δ2n𝒈s(n)+𝒪(δ2n+1)\displaystyle=\bm{g}_{s}+\delta_{2}\bm{g}_{s}^{(1)}+\delta_{2}^{2}\bm{g}_{s}^{(2)}+\dots+\delta_{2}^{n}\bm{g}_{s}^{(n)}+\mathcal{O}(\delta_{2}^{n+1})
𝒈in\displaystyle\dots{\\ }\bm{g}_{i_{n}} =𝒈s+δn𝒈s(1)+δn2𝒈s(2)++δnn𝒈s(n)+𝒪(δnn+1)\displaystyle=\bm{g}_{s}+\delta_{n}\bm{g}_{s}^{(1)}+\delta_{n}^{2}\bm{g}_{s}^{(2)}+\dots+\delta_{n}^{n}\bm{g}_{s}^{(n)}+\mathcal{O}(\delta_{n}^{n+1})

Comparing it with Eq. (13), we have

(δ1δ12δ1nδ2δ22δ2nδnδn2δnn)(𝒈^s(1)𝒈s(1)𝒈^s(2)𝒈s(2)2!𝒈^s(n)𝒈s(n)n!)=(𝒪(δ1n+1)𝒪(δ2n+1)𝒪(δnn+1))\begin{pmatrix}\delta_{1}&\delta_{1}^{2}&\cdots&\delta_{1}^{n}\\ \delta_{2}&\delta_{2}^{2}&\cdots&\delta_{2}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ \delta_{n}&\delta_{n}^{2}&\cdots&\delta_{n}^{n}\end{pmatrix}\begin{pmatrix}\hat{\bm{g}}^{(1)}_{s}-\bm{g}^{(1)}_{s}\\ \frac{\hat{\bm{g}}^{(2)}_{s}-\bm{g}^{(2)}_{s}}{2!}\\ \vdots\\ \frac{\hat{\bm{g}}^{(n)}_{s}-\bm{g}^{(n)}_{s}}{n!}\end{pmatrix}=\begin{pmatrix}\mathcal{O}(\delta_{1}^{n+1})\\ \mathcal{O}(\delta_{2}^{n+1})\\ \vdots\\ \mathcal{O}(\delta_{n}^{n+1})\end{pmatrix} (25)

From Assumption B.3, we know there exists some constants rkr_{k} so that δk=rkh,k=1,,n\delta_{k}=r_{k}h,k=1,\dots,n. Thus

(δ1δ12δ1nδ2δ22δ2nδnδn2δnn)=(r1r12r1nr2r22r2nrnrn2rnn)(hh2hn),(𝒪(δ1n+1)𝒪(δ2n+1)𝒪(δnn+1))=(𝒪(hn+1)𝒪(hn+1)𝒪(hn+1))\begin{pmatrix}\delta_{1}&\delta_{1}^{2}&\cdots&\delta_{1}^{n}\\ \delta_{2}&\delta_{2}^{2}&\cdots&\delta_{2}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ \delta_{n}&\delta_{n}^{2}&\cdots&\delta_{n}^{n}\end{pmatrix}=\begin{pmatrix}r_{1}&r_{1}^{2}&\cdots&r_{1}^{n}\\ r_{2}&r_{2}^{2}&\cdots&r_{2}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ r_{n}&r_{n}^{2}&\cdots&r_{n}^{n}\end{pmatrix}\begin{pmatrix}h&&&\\ &h^{2}&&\\ &&\ddots&\\ &&&h^{n}\\ \end{pmatrix},\begin{pmatrix}\mathcal{O}(\delta_{1}^{n+1})\\ \mathcal{O}(\delta_{2}^{n+1})\\ \vdots\\ \mathcal{O}(\delta_{n}^{n+1})\end{pmatrix}=\begin{pmatrix}\mathcal{O}(h^{n+1})\\ \mathcal{O}(h^{n+1})\\ \vdots\\ \mathcal{O}(h^{n+1})\end{pmatrix} (26)

And finally, we have

(𝒈^s(1)𝒈s(1)𝒈^s(2)𝒈s(2)2!𝒈^s(n)𝒈s(n)n!)\displaystyle\begin{pmatrix}\hat{\bm{g}}^{(1)}_{s}-\bm{g}^{(1)}_{s}\\ \frac{\hat{\bm{g}}^{(2)}_{s}-\bm{g}^{(2)}_{s}}{2!}\\ \vdots\\ \frac{\hat{\bm{g}}^{(n)}_{s}-\bm{g}^{(n)}_{s}}{n!}\end{pmatrix} =(h1h2hn)(r1r12r1nr2r22r2nrnrn2rnn)1(𝒪(hn+1)𝒪(hn+1)𝒪(hn+1))\displaystyle=\begin{pmatrix}h^{-1}&&&\\ &h^{-2}&&\\ &&\ddots&\\ &&&h^{-n}\\ \end{pmatrix}\begin{pmatrix}r_{1}&r_{1}^{2}&\cdots&r_{1}^{n}\\ r_{2}&r_{2}^{2}&\cdots&r_{2}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ r_{n}&r_{n}^{2}&\cdots&r_{n}^{n}\end{pmatrix}^{-1}\begin{pmatrix}\mathcal{O}(h^{n+1})\\ \mathcal{O}(h^{n+1})\\ \vdots\\ \mathcal{O}(h^{n+1})\end{pmatrix} (27)
=(𝒪(hn)𝒪(hn1)𝒪(h1))\displaystyle=\begin{pmatrix}\mathcal{O}(h^{n})\\ \mathcal{O}(h^{n-1})\\ \vdots\\ \mathcal{O}(h^{1})\end{pmatrix}

Substitute Eq. (22), Eq. (23) and Eq. (27) into Eq. (21), we can conclude that 𝒙^t𝒙t=𝒪(hn+2)\hat{\bm{x}}_{t}-\bm{x}_{t}=\mathcal{O}(h^{n+2}). ∎

B.2.2 Global

First, we provide a lemma that gives the local truncation error given inexact previous values when estimating the high-order derivatives.

Lemma B.7.

(Local truncation error with inexact previous values) Suppose inexact values 𝐱^λik,k=1,,n\hat{\bm{x}}_{\lambda_{i_{k}}},k=1,\dots,n and 𝐱^s\hat{\bm{x}}_{s} are used in Eq. (13) to estimate the high-order derivatives, then the local truncation error of the local approximation Eq. (14) satisfies

Δt=αt𝑨(λs,λt)αsΔs+𝒪(h)(𝒪(Δs)+k=1n𝒪(Δλik)+𝒪(hn+1))\Delta_{t}=\frac{\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})}{\alpha_{s}}\Delta_{s}+\mathcal{O}(h)\left(\mathcal{O}(\Delta_{s})+\sum_{k=1}^{n}\mathcal{O}(\Delta_{\lambda_{i_{k}}})+\mathcal{O}(h^{n+1})\right) (28)

where Δ𝐱^𝐱,hλtλs\Delta_{\cdot}\coloneqq\hat{\bm{x}}_{\cdot}-\bm{x}_{\cdot},h\coloneqq\lambda_{t}-\lambda_{s}.

Proof.

By replacing 𝒙\bm{x}_{\cdot} with 𝒙^\hat{\bm{x}}_{\cdot} in Eq. (13) and subtracting Eq. (12) from Eq. (14), the expression for the local truncation error becomes

Δt\displaystyle\Delta_{t} =αt𝑨(λs,λt)αsΔsαt𝑨(λs,λt)(𝒈θ(𝒙^λs,λs)𝒈θ(𝒙λs,λs))λsλt𝑬λs(λ)dλ\displaystyle=\frac{\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})}{\alpha_{s}}\Delta_{s}-\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\left(\bm{g}_{\theta}(\hat{\bm{x}}_{\lambda_{s}},\lambda_{s})-\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})\right)\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda (29)
αt𝑨(λs,λt)k=1n𝒈^θ(k)(𝒙λs,λs)𝒈θ(k)(𝒙λs,λs)k!λsλt𝑬λs(λ)(λλs)kdλ+𝒪(hn+2)\displaystyle-\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\sum_{k=1}^{n}\frac{\hat{\bm{g}}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})-\bm{g}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})}{k!}\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)(\lambda-\lambda_{s})^{k}\mathrm{d}\lambda+\mathcal{O}(h^{n+2})

And the linear system for solving 𝒈θ(k)(𝒙λs,λs),k=1,,n\bm{g}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s}),k=1,\dots,n becomes

(δ1δ12δ1nδ2δ22δ2nδnδn2δnn)(𝒈^s(1)𝒈^s(2)2!𝒈^s(n)n!)=(𝒈^i1𝒈^s𝒈^i2𝒈^s𝒈^in𝒈^s)\begin{pmatrix}\delta_{1}&\delta_{1}^{2}&\cdots&\delta_{1}^{n}\\ \delta_{2}&\delta_{2}^{2}&\cdots&\delta_{2}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ \delta_{n}&\delta_{n}^{2}&\cdots&\delta_{n}^{n}\end{pmatrix}\begin{pmatrix}\hat{\bm{g}}^{(1)}_{s}\\ \frac{\hat{\bm{g}}^{(2)}_{s}}{2!}\\ \vdots\\ \frac{\hat{\bm{g}}^{(n)}_{s}}{n!}\end{pmatrix}=\begin{pmatrix}\hat{\bm{g}}_{i_{1}}-\hat{\bm{g}}_{s}\\ \hat{\bm{g}}_{i_{2}}-\hat{\bm{g}}_{s}\\ \vdots\\ \hat{\bm{g}}_{i_{n}}-\hat{\bm{g}}_{s}\end{pmatrix} (30)

where 𝒈^=𝒈θ(𝒙^λ,λ)\hat{\bm{g}}_{\cdot}=\bm{g}_{\theta}(\hat{\bm{x}}_{\lambda_{\cdot}},\lambda_{\cdot}). Under Assumption B.4, we know 𝒈^𝒈=𝒪(Δλ)\hat{\bm{g}}_{\cdot}-\bm{g}_{\cdot}=\mathcal{O}(\Delta_{\lambda_{\cdot}}). Thus, under Assumption B.1, Assumption B.2 and Assumption B.3, similar to the deduction in the last section, we have

(𝒈^s(1)𝒈s(1)𝒈^s(2)𝒈s(2)2!𝒈^s(n)𝒈s(n)n!)=(h1h2hn)(r1r12r1nr2r22r2nrnrn2rnn)1(𝒪(hn+1+Δs+Δλi1)𝒪(hn+1+Δs+Δλi2)𝒪(hn+1+Δs+Δλin))\begin{pmatrix}\hat{\bm{g}}^{(1)}_{s}-\bm{g}^{(1)}_{s}\\ \frac{\hat{\bm{g}}^{(2)}_{s}-\bm{g}^{(2)}_{s}}{2!}\\ \vdots\\ \frac{\hat{\bm{g}}^{(n)}_{s}-\bm{g}^{(n)}_{s}}{n!}\end{pmatrix}=\begin{pmatrix}h^{-1}&&&\\ &h^{-2}&&\\ &&\ddots&\\ &&&h^{-n}\\ \end{pmatrix}\begin{pmatrix}r_{1}&r_{1}^{2}&\cdots&r_{1}^{n}\\ r_{2}&r_{2}^{2}&\cdots&r_{2}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ r_{n}&r_{n}^{2}&\cdots&r_{n}^{n}\end{pmatrix}^{-1}\begin{pmatrix}\mathcal{O}(h^{n+1}+\Delta_{s}+\Delta_{\lambda_{i_{1}}})\\ \mathcal{O}(h^{n+1}+\Delta_{s}+\Delta_{\lambda_{i_{2}}})\\ \vdots\\ \mathcal{O}(h^{n+1}+\Delta_{s}+\Delta_{\lambda_{i_{n}}})\end{pmatrix} (31)

Besides, under Assumption B.2, the orders of the other coefficients are the same as we obtain in the last section:

𝑨(λs,λt)=𝒪(1),λsλt𝑬λs(λ)(λλs)kdλ=𝒪(hk+1)\bm{A}(\lambda_{s},\lambda_{t})=\mathcal{O}(1),\quad\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)(\lambda-\lambda_{s})^{k}\mathrm{d}\lambda=\mathcal{O}(h^{k+1}) (32)

Thus

k=1n𝒈^θ(k)(𝒙λs,λs)𝒈θ(k)(𝒙λs,λs)k!λsλt𝑬λs(λ)(λλs)kdλ\displaystyle\sum_{k=1}^{n}\frac{\hat{\bm{g}}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})-\bm{g}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})}{k!}\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)(\lambda-\lambda_{s})^{k}\mathrm{d}\lambda (33)
=\displaystyle= (λsλt𝑬λs(λ)(λλs)1dλλsλt𝑬λs(λ)(λλs)2dλλsλt𝑬λs(λ)(λλs)ndλ)(𝒈^s(1)𝒈s(1)𝒈^s(2)𝒈s(2)2!𝒈^s(n)𝒈s(n)n!)\displaystyle\begin{pmatrix}\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)(\lambda-\lambda_{s})^{1}\mathrm{d}\lambda\\ \int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)(\lambda-\lambda_{s})^{2}\mathrm{d}\lambda\\ \vdots\\ \int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)(\lambda-\lambda_{s})^{n}\mathrm{d}\lambda\end{pmatrix}^{\top}\begin{pmatrix}\hat{\bm{g}}^{(1)}_{s}-\bm{g}^{(1)}_{s}\\ \frac{\hat{\bm{g}}^{(2)}_{s}-\bm{g}^{(2)}_{s}}{2!}\\ \vdots\\ \frac{\hat{\bm{g}}^{(n)}_{s}-\bm{g}^{(n)}_{s}}{n!}\end{pmatrix}
=\displaystyle= (𝒪(h2)𝒪(h3)𝒪(hn+1))(h1h2hn)(r1r12r1nr2r22r2nrnrn2rnn)1(𝒪(hn+1+Δs+Δλi1)𝒪(hn+1+Δs+Δλi2)𝒪(hn+1+Δs+Δλin))\displaystyle\begin{pmatrix}\mathcal{O}(h^{2})\\ \mathcal{O}(h^{3})\\ \vdots\\ \mathcal{O}(h^{n+1})\end{pmatrix}^{\top}\begin{pmatrix}h^{-1}&&&\\ &h^{-2}&&\\ &&\ddots&\\ &&&h^{-n}\\ \end{pmatrix}\begin{pmatrix}r_{1}&r_{1}^{2}&\cdots&r_{1}^{n}\\ r_{2}&r_{2}^{2}&\cdots&r_{2}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ r_{n}&r_{n}^{2}&\cdots&r_{n}^{n}\end{pmatrix}^{-1}\begin{pmatrix}\mathcal{O}(h^{n+1}+\Delta_{s}+\Delta_{\lambda_{i_{1}}})\\ \mathcal{O}(h^{n+1}+\Delta_{s}+\Delta_{\lambda_{i_{2}}})\\ \vdots\\ \mathcal{O}(h^{n+1}+\Delta_{s}+\Delta_{\lambda_{i_{n}}})\\ \end{pmatrix}
=\displaystyle= (𝒪(h)𝒪(h)𝒪(h))(r1r12r1nr2r22r2nrnrn2rnn)1(𝒪(hn+1+Δs+Δλi1)𝒪(hn+1+Δs+Δλi2)𝒪(hn+1+Δs+Δλin))\displaystyle\begin{pmatrix}\mathcal{O}(h)\\ \mathcal{O}(h)\\ \vdots\\ \mathcal{O}(h)\end{pmatrix}^{\top}\begin{pmatrix}r_{1}&r_{1}^{2}&\cdots&r_{1}^{n}\\ r_{2}&r_{2}^{2}&\cdots&r_{2}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ r_{n}&r_{n}^{2}&\cdots&r_{n}^{n}\end{pmatrix}^{-1}\begin{pmatrix}\mathcal{O}(h^{n+1}+\Delta_{s}+\Delta_{\lambda_{i_{1}}})\\ \mathcal{O}(h^{n+1}+\Delta_{s}+\Delta_{\lambda_{i_{2}}})\\ \vdots\\ \mathcal{O}(h^{n+1}+\Delta_{s}+\Delta_{\lambda_{i_{n}}})\\ \end{pmatrix}
=\displaystyle= k=1n𝒪(h)𝒪(hn+1+Δs+Δλik)\displaystyle\sum_{k=1}^{n}\mathcal{O}(h)\mathcal{O}(h^{n+1}+\Delta_{s}+\Delta_{\lambda_{i_{k}}})

Combining Eq. (29), Eq. (32) and Eq. (33), we can obtain the conclusion in Eq. (28). ∎

Then we prove Theorem 3.3 below.

Proof.

(Proof of Theorem 3.3)

As we have discussed, the predictor step from tm1t_{m-1} to tmt_{m} is a special case of the local approximation Eq. (14) with (tin,,ti1,s,t)=(tmn1,,tm2,tm1,tm)(t_{i_{n}},\dots,t_{i_{1}},s,t)=(t_{m-n-1},\dots,t_{m-2},t_{m-1},t_{m}). By Lemma B.7 we have

Δm=αtm𝑨(λtm1,λtm)αtm1Δm1+𝒪(h)(k=0n𝒪(Δmk1)+𝒪(hn+1))\Delta_{m}=\frac{\alpha_{t_{m}}\bm{A}(\lambda_{t_{m-1}},\lambda_{t_{m}})}{\alpha_{t_{m-1}}}\Delta_{m-1}+\mathcal{O}(h)\left(\sum_{k=0}^{n}\mathcal{O}(\Delta_{m-k-1})+\mathcal{O}(h^{n+1})\right) (34)

It follows that there exists constants C,C0C,C_{0} irrelevant to hh, so that

|Δm|(αtm𝑨(λtm1,λtm)αtm1+Ch)|Δm1|+Chk=0n|Δmk1|+C0hn+2|\Delta_{m}|\leq\left(\frac{\alpha_{t_{m}}\bm{A}(\lambda_{t_{m-1}},\lambda_{t_{m}})}{\alpha_{t_{m-1}}}+Ch\right)|\Delta_{m-1}|+Ch\sum_{k=0}^{n}|\Delta_{m-k-1}|+C_{0}h^{n+2} (35)

Denote fmmax0im|Δi|f_{m}\coloneqq\max_{0\leq i\leq m}|\Delta_{i}|, we then have

|Δm|(αtm𝑨(λtm1,λtm)αtm1+C1h)fm1+C0hn+2|\Delta_{m}|\leq\left(\frac{\alpha_{t_{m}}\bm{A}(\lambda_{t_{m-1}},\lambda_{t_{m}})}{\alpha_{t_{m-1}}}+C_{1}h\right)f_{m-1}+C_{0}h^{n+2} (36)

Since αtm𝑨(λtm1,λtm)αtm11\frac{\alpha_{t_{m}}\bm{A}(\lambda_{t_{m-1}},\lambda_{t_{m}})}{\alpha_{t_{m-1}}}\rightarrow 1 when h0h\rightarrow 0 and it has bounded first-order derivative due to Assumption B.2, there exists a constant C2C_{2}, so that for any CC2C\geq C_{2}, αtm𝑨(λtm1,λtm)αtm1+Ch>1\frac{\alpha_{t_{m}}\bm{A}(\lambda_{t_{m-1}},\lambda_{t_{m}})}{\alpha_{t_{m-1}}}+Ch>1 for sufficiently small hh. Thus, by taking C3=max{C1,C2}C_{3}=\max\{C_{1},C_{2}\}, we have

fm(αtm𝑨(λtm1,λtm)αtm1+C3h)fm1+C0hn+2f_{m}\leq\left(\frac{\alpha_{t_{m}}\bm{A}(\lambda_{t_{m-1}},\lambda_{t_{m}})}{\alpha_{t_{m-1}}}+C_{3}h\right)f_{m-1}+C_{0}h^{n+2} (37)

Denote Am1αtm𝑨(λtm1,λtm)αtm1+C3hA_{m-1}\coloneqq\frac{\alpha_{t_{m}}\bm{A}(\lambda_{t_{m-1}},\lambda_{t_{m}})}{\alpha_{t_{m-1}}}+C_{3}h, by repeating Eq. (37), we have

fM(i=nM1Ai)fn+(i=n+1Mj=iM1Aj)C0hn+2f_{M}\leq\left(\prod_{i=n}^{M-1}A_{i}\right)f_{n}+\left(\sum_{i=n+1}^{M}\prod_{j=i}^{M-1}A_{j}\right)C_{0}h^{n+2} (38)

By Assumption B.5, h=𝒪(1/M)h=\mathcal{O}(1/M), so we have

i=nM1Ai\displaystyle\prod_{i=n}^{M-1}A_{i} =αtM𝑨(λtn,λtM)αtni=nM1(1+αti1C3hαti𝑨(λti1,λti))\displaystyle=\frac{\alpha_{t_{M}}\bm{A}(\lambda_{t_{n}},\lambda_{t_{M}})}{\alpha_{t_{n}}}\prod_{i=n}^{M-1}\left(1+\frac{\alpha_{t_{i-1}}C_{3}h}{\alpha_{t_{i}}\bm{A}(\lambda_{t_{i-1}},\lambda_{t_{i}})}\right) (39)
αtM𝑨(λtn,λtM)αtni=nM1(1+αti1C4αti𝑨(λti1,λti)M)\displaystyle\leq\frac{\alpha_{t_{M}}\bm{A}(\lambda_{t_{n}},\lambda_{t_{M}})}{\alpha_{t_{n}}}\prod_{i=n}^{M-1}\left(1+\frac{\alpha_{t_{i-1}}C_{4}}{\alpha_{t_{i}}\bm{A}(\lambda_{t_{i-1}},\lambda_{t_{i}})M}\right)
αtM𝑨(λtn,λtM)αtn(1+σM)Mn\displaystyle\leq\frac{\alpha_{t_{M}}\bm{A}(\lambda_{t_{n}},\lambda_{t_{M}})}{\alpha_{t_{n}}}\left(1+\frac{\sigma}{M}\right)^{M-n}
C5eσ\displaystyle\leq C_{5}e^{\sigma}

where σ=maxniM1αti1C4αti𝑨(λti1,λti)\sigma=\max_{n\leq i\leq M-1}\frac{\alpha_{t_{i-1}}C_{4}}{\alpha_{t_{i}}\bm{A}(\lambda_{t_{i-1}},\lambda_{t_{i}})}. Then denote βmaxn+1iMαtM𝑨(λti,λtM)αti\beta\coloneqq\max_{n+1\leq i\leq M}\frac{\alpha_{t_{M}}\bm{A}(\lambda_{t_{i}},\lambda_{t_{M}})}{\alpha_{t_{i}}}, we have

i=n+1Mj=iM1Aj\displaystyle\sum_{i=n+1}^{M}\prod_{j=i}^{M-1}A_{j} i=n+1MαtM𝑨(λti,λtM)αti(1+σM)Mi\displaystyle\leq\sum_{i=n+1}^{M}\frac{\alpha_{t_{M}}\bm{A}(\lambda_{t_{i}},\lambda_{t_{M}})}{\alpha_{t_{i}}}\left(1+\frac{\sigma}{M}\right)^{M-i} (40)
βi=0Mn1(1+σM)i\displaystyle\leq\beta\sum_{i=0}^{M-n-1}\left(1+\frac{\sigma}{M}\right)^{i}
=βMσ[(1+σM)Mn1]\displaystyle=\frac{\beta M}{\sigma}\left[\left(1+\frac{\sigma}{M}\right)^{M-n}-1\right]
C6(eσ1)M\displaystyle\leq C_{6}\left(e^{\sigma}-1\right)M

Then we substitute Eq. (39) and Eq. (40) into Eq. (38). Note that M=𝒪(1/h)M=\mathcal{O}(1/h) by Assumption B.5, and fn=𝒪(hn+1)f_{n}=\mathcal{O}(h^{n+1}) by Assumption B.6, finally we conclude that |ΔM|fM=𝒪(hn+1)|\Delta_{M}|\leq f_{M}=\mathcal{O}(h^{n+1}). ∎

B.3 Pseudo-Order Solver

First, we provide a lemma that gives the explicit solution to Eq. (15).

Lemma B.8.

The solution to Eq. (15) is

𝒈^s(k)k!=p=1k𝒈ip𝒈i0q=0,qpk(δpδq)\frac{\hat{\bm{g}}_{s}^{(k)}}{k!}=\sum_{p=1}^{k}\frac{\bm{g}_{i_{p}}-\bm{g}_{i_{0}}}{\prod_{q=0,q\neq p}^{k}(\delta_{p}-\delta_{q})} (41)
Proof.

Denote

𝑹k=(δ1δ12δ1kδ2δ22δ2kδkδk2δkk)\bm{R}_{k}=\begin{pmatrix}\delta_{1}&\delta_{1}^{2}&\cdots&\delta_{1}^{k}\\ \delta_{2}&\delta_{2}^{2}&\cdots&\delta_{2}^{k}\\ \vdots&\vdots&\ddots&\vdots\\ \delta_{k}&\delta_{k}^{2}&\cdots&\delta_{k}^{k}\end{pmatrix} (42)

Then the solution to Eq. (15) can be expressed as

𝒈^s(k)k!=p=1k(𝑹k1)kp(𝒈ip𝒈i0)\frac{\hat{\bm{g}}_{s}^{(k)}}{k!}=\sum_{p=1}^{k}(\bm{R}_{k}^{-1})_{kp}(\bm{g}_{i_{p}}-\bm{g}_{i_{0}}) (43)

where (𝑹k1)kp(\bm{R}_{k}^{-1})_{kp} is the element of 𝑹k1\bm{R}_{k}^{-1} at the kk-th row and the pp-th column. From previous studies of the inversion of the Vandermonde matrix [12], we know

(𝑹k1)kp=1δpq=1,qpk(δpδq)=1(δpδ0)q=1,qpk(δpδq)(\bm{R}_{k}^{-1})_{kp}=\frac{1}{\delta_{p}\prod_{q=1,q\neq p}^{k}(\delta_{p}-\delta_{q})}=\frac{1}{(\delta_{p}-\delta_{0})\prod_{q=1,q\neq p}^{k}(\delta_{p}-\delta_{q})} (44)

Substituting Eq. (44) into Eq. (43), we finish the proof. ∎

Then we prove Theorem 3.4 below:

Proof.

(Proof of Theorem 3.4) First, we use mathematical induction to prove that

Dl(k)=D~l(k)p=1k𝒈il+p𝒈ilq=0,qpk(δl+pδl+q),1kn,0lnkD_{l}^{(k)}=\tilde{D}_{l}^{(k)}\coloneqq\sum_{p=1}^{k}\frac{\bm{g}_{i_{l+p}}-\bm{g}_{i_{l}}}{\prod_{q=0,q\neq p}^{k}(\delta_{l+p}-\delta_{l+q})},\quad 1\leq k\leq n,0\leq l\leq n-k (45)

For k=1k=1, Eq. (45) holds by the definition of Dl(k)D_{l}^{(k)}. Suppose the equation holds for kk, we then prove it holds for k+1k+1.

Define the Lagrange polynomial which passes (δl+p,𝒈il+p𝒈il)(\delta_{l+p},\bm{g}_{i_{l+p}}-\bm{g}_{i_{l}}) for 0pk0\leq p\leq k:

Pl(k)(x)p=1k(𝒈il+p𝒈il)q=0,qpkxδl+qδl+pδl+q,1kn,0lnkP_{l}^{(k)}(x)\coloneqq\sum_{p=1}^{k}\left(\bm{g}_{i_{l+p}}-\bm{g}_{i_{l}}\right)\prod_{q=0,q\neq p}^{k}\frac{x-\delta_{l+q}}{\delta_{l+p}-\delta_{l+q}},\quad 1\leq k\leq n,0\leq l\leq n-k (46)

Then D~l(k)=Pl(k)(x)[xk]\tilde{D}_{l}^{(k)}=P_{l}^{(k)}(x)[x^{k}] is the coefficients before the highest-order term xkx^{k} in Pl(k)(x)P_{l}^{(k)}(x). We then prove that Pl(k)(x)P_{l}^{(k)}(x) satisfies the following recurrence relation:

Pl(k)(x)=P~l(k)(x)(xδl)Pl+1(k1)(x)(xδl+k)Pl(k1)(x)δl+kδlP_{l}^{(k)}(x)=\tilde{P}_{l}^{(k)}(x)\coloneqq\frac{(x-\delta_{l})P_{l+1}^{(k-1)}(x)-(x-\delta_{l+k})P_{l}^{(k-1)}(x)}{\delta_{l+k}-\delta_{l}} (47)

By definition, Pl+1(k1)(x)P_{l+1}^{(k-1)}(x) is the (k1)(k-1)-th order polynomial which passes (δl+p,𝒈il+p𝒈il)(\delta_{l+p},\bm{g}_{i_{l+p}}-\bm{g}_{i_{l}}) for 1pk1\leq p\leq k, and Pl(k1)(x)P_{l}^{(k-1)}(x) is the (k1)(k-1)-th order polynomial which passes (δl+p,𝒈il+p𝒈il)(\delta_{l+p},\bm{g}_{i_{l+p}}-\bm{g}_{i_{l}}) for 0pk10\leq p\leq k-1.

Thus, for 1pk11\leq p\leq k-1, we have

P~l(k)(δl+p)=(δl+pδl)Pl+1(k1)(δl+p)(δl+pδl+k)Pl(k1)(δl+p)δl+kδl=𝒈il+p𝒈il\tilde{P}_{l}^{(k)}(\delta_{l+p})=\frac{(\delta_{l+p}-\delta_{l})P_{l+1}^{(k-1)}(\delta_{l+p})-(\delta_{l+p}-\delta_{l+k})P_{l}^{(k-1)}(\delta_{l+p})}{\delta_{l+k}-\delta_{l}}=\bm{g}_{i_{l+p}}-\bm{g}_{i_{l}} (48)

For p=0p=0, we have

P~l(k)(δl)=(δlδl)Pl+1(k1)(δl)(δlδl+k)Pl(k1)(δl)δl+kδl=𝒈il𝒈il\tilde{P}_{l}^{(k)}(\delta_{l})=\frac{(\delta_{l}-\delta_{l})P_{l+1}^{(k-1)}(\delta_{l})-(\delta_{l}-\delta_{l+k})P_{l}^{(k-1)}(\delta_{l})}{\delta_{l+k}-\delta_{l}}=\bm{g}_{i_{l}}-\bm{g}_{i_{l}} (49)

for p=kp=k, we have

P~l(k)(δl+k)=(δl+kδl)Pl+1(k1)(δl+k)(δl+kδl+k)Pl(k1)(δl+k)δl+kδl=𝒈il+k𝒈il\tilde{P}_{l}^{(k)}(\delta_{l+k})=\frac{(\delta_{l+k}-\delta_{l})P_{l+1}^{(k-1)}(\delta_{l+k})-(\delta_{l+k}-\delta_{l+k})P_{l}^{(k-1)}(\delta_{l+k})}{\delta_{l+k}-\delta_{l}}=\bm{g}_{i_{l+k}}-\bm{g}_{i_{l}} (50)

Therefore, P~l(k)(x)\tilde{P}_{l}^{(k)}(x) is the kk-th order polynomial which passes k+1k+1 distince points (δl+p,𝒈il+p𝒈il)(\delta_{l+p},\bm{g}_{i_{l+p}}-\bm{g}_{i_{l}}) for 0pk0\leq p\leq k. Due to the uniqueness of the Lagrange polynomial, we can conclude that Pl(k)(x)=P~l(k)(x)P_{l}^{(k)}(x)=\tilde{P}_{l}^{(k)}(x). By taking the coefficients of the highest-order term, we obtain

D~l(k)=D~l+1(k1)D~l(k1)δl+kδl\tilde{D}_{l}^{(k)}=\frac{\tilde{D}_{l+1}^{(k-1)}-\tilde{D}_{l}^{(k-1)}}{\delta_{l+k}-\delta_{l}} (51)

where by the induction hypothesis we have Dl+1(k1)=D~l+1(k1),Dl(k1)=D~l(k1)D_{l+1}^{(k-1)}=\tilde{D}_{l+1}^{(k-1)},D_{l}^{(k-1)}=\tilde{D}_{l}^{(k-1)}. Comparing Eq. (51) with the recurrence relation of Dl(k)D_{l}^{(k)} in Eq. (16), it follows that Dl(k)=D~l(k)D_{l}^{(k)}=\tilde{D}_{l}^{(k)}, which completes the mathematical induction.

Finally, by comparing the expression for D~l(k)\tilde{D}_{l}^{(k)} in Eq. (45) and the expression for 𝒈^s(k)\hat{\bm{g}}_{s}^{(k)} in Lemma B.8, we can conclude that 𝒈^s(k)=k!D0(k)\hat{\bm{g}}_{s}^{(k)}=k!D_{0}^{(k)}. ∎

B.4 Local Unbiasedness

Proof.

(Proof of Theorem 3.2) Subtracting the local exact solution in Eq. (9) from the (n+1)(n+1)-th order local approximation in Eq. (14), we have the local truncation error

𝒙^t𝒙t\displaystyle\hat{\bm{x}}_{t}-\bm{x}_{t} =αt𝑨(λs,λt)(λsλt𝑬λs(λ)𝒈θ(𝒙λ,λ)dλk=0n𝒈^θ(k)(𝒙λs,λs)λsλt𝑬λs(λ)(λλs)kk!dλ)\displaystyle=\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\left(\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)\mathrm{d}\lambda-\sum_{k=0}^{n}\hat{\bm{g}}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda\right) (52)
=αt𝑨(λs,λt)λsλt𝑬λs(λ)(𝒈θ(𝒙λ,λ)𝒈θ(𝒙λs,λs))dλ\displaystyle=\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\left(\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})\right)\mathrm{d}\lambda
αt𝑨(λs,λt)k=1n𝒈^θ(k)(𝒙λs,λs)λsλt𝑬λs(λ)(λλs)kk!dλ\displaystyle-\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\sum_{k=1}^{n}\hat{\bm{g}}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda
=αt𝑨(λs,λt)λsλt𝑬λs(λ)(𝒈θ(𝒙λ,λ)𝒈θ(𝒙λs,λs))dλ\displaystyle=\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\left(\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})\right)\mathrm{d}\lambda
αt𝑨(λs,λt)k=1n(l=1n(𝑹n1)kl(𝒈θ(𝒙λil,λil)𝒈θ(𝒙λs,λs)))λsλt𝑬λs(λ)(λλs)kk!dλ\displaystyle-\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\sum_{k=1}^{n}\left(\sum_{l=1}^{n}(\bm{R}_{n}^{-1})_{kl}(\bm{g}_{\theta}(\bm{x}_{\lambda_{i_{l}}},\lambda_{i_{l}})-\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s}))\right)\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda

where 𝒙λ\bm{x}_{\lambda} is on the ground-truth ODE trajectory passing 𝒙λs\bm{x}_{\lambda_{s}}, and (𝑹n1)kl(\bm{R}_{n}^{-1})_{kl} is the element of the inverse matrix 𝑹n1\bm{R}_{n}^{-1} at the kk-th row and the ll-th column, as discussed in the proof of Lemma B.8. By Newton-Leibniz theorem, we have

𝒈θ(𝒙λ,λ)𝒈θ(𝒙λs,λs)=λsλ𝒈θ(1)(𝒙τ,τ)dτ\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})=\int_{\lambda_{s}}^{\lambda}\bm{g}_{\theta}^{(1)}(\bm{x}_{\tau},\tau)\mathrm{d}\tau (53)

Also, since 𝒙λil,l=1,,n\bm{x}_{\lambda_{i_{l}}},l=1,\dots,n are on the ground-truth ODE trajectory passing 𝒙λs\bm{x}_{\lambda_{s}}, we have

𝒈θ(𝒙λil,λil)𝒈θ(𝒙λs,λs)=λsλil𝒈θ(1)(𝒙τ,τ)dτ\bm{g}_{\theta}(\bm{x}_{\lambda_{i_{l}}},\lambda_{i_{l}})-\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})=\int_{\lambda_{s}}^{\lambda_{i_{l}}}\bm{g}_{\theta}^{(1)}(\bm{x}_{\tau},\tau)\mathrm{d}\tau (54)

where

𝒈θ(1)(𝒙τ,τ)=eλsτ𝒔rdr(𝒇θ(1)(𝒙τ,τ)𝒔τ𝒇θ(𝒙τ,τ)𝒃τ)\bm{g}_{\theta}^{(1)}(\bm{x}_{\tau},\tau)=e^{-\int_{\lambda_{s}}^{\tau}\bm{s}_{r}\mathrm{d}r}\left(\bm{f}_{\theta}^{(1)}(\bm{x}_{\tau},\tau)-\bm{s}_{\tau}\bm{f}_{\theta}(\bm{x}_{\tau},\tau)-\bm{b}_{\tau}\right) (55)

Note that 𝒔λ,𝒍λ\bm{s}_{\lambda},\bm{l}_{\lambda} are the solution to the least square problem in Eq. (11), which makes sure 𝔼pτθ(𝒙τ)[𝒇θ(1)(𝒙τ,τ)𝒔τ𝒇θ(𝒙τ,τ)𝒃τ]=0\mathbb{E}_{p_{\tau}^{\theta}(\bm{x}_{\tau})}\left[\bm{f}_{\theta}^{(1)}(\bm{x}_{\tau},\tau)-\bm{s}_{\tau}\bm{f}_{\theta}(\bm{x}_{\tau},\tau)-\bm{b}_{\tau}\right]=0. It follows that 𝔼pλsθ(𝒙λs)[𝒇θ(1)(𝒙τ,τ)𝒔τ𝒇θ(𝒙τ,τ)𝒃τ]=0\mathbb{E}_{p_{\lambda_{s}}^{\theta}(\bm{x}_{\lambda_{s}})}\left[\bm{f}_{\theta}^{(1)}(\bm{x}_{\tau},\tau)-\bm{s}_{\tau}\bm{f}_{\theta}(\bm{x}_{\tau},\tau)-\bm{b}_{\tau}\right]=0, since 𝒙τ\bm{x}_{\tau} is on the ground-truth ODE trajectory passing 𝒙λs\bm{x}_{\lambda_{s}}. Therefore, we have 𝔼pλsθ(𝒙λs)[𝒈θ(𝒙λ,λ)𝒈θ(𝒙λs,λs)]=0\mathbb{E}_{p_{\lambda_{s}}^{\theta}(\bm{x}_{\lambda_{s}})}\left[\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})\right]=0 and 𝔼pλsθ(𝒙λs)[𝒈θ(𝒙λil,λil)𝒈θ(𝒙λs,λs)]=0\mathbb{E}_{p_{\lambda_{s}}^{\theta}(\bm{x}_{\lambda_{s}})}\left[\bm{g}_{\theta}(\bm{x}_{\lambda_{i_{l}}},\lambda_{i_{l}})-\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})\right]=0. Substitute them into Eq. (52), we conclude that 𝔼pλsθ(𝒙λs)[𝒙^t𝒙t]=0\mathbb{E}_{p_{\lambda_{s}}^{\theta}(\bm{x}_{\lambda_{s}})}\left[\hat{\bm{x}}_{t}-\bm{x}_{t}\right]=0. ∎

Appendix C Implementation Details

C.1 Computing the EMS and Related Integrals in the ODE Formulation

The ODE formulation and local approximation require computing some complex integrals involving 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}. In this section, we’ll give details about how to estimate 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda}^{*},\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*} on a few datapoints, and how to use them to compute the integrals efficiently.

C.1.1 Computing the EMS

First for the computing of 𝒍λ\bm{l}_{\lambda}^{*} in Eq. (5), note that

𝒙𝑵θ(𝒙λ,λ)=σλ𝒙ϵ(𝒙λ,λ)diag(𝒍λ)\nabla_{\bm{x}}\bm{N}_{\theta}(\bm{x}_{\lambda},\lambda)=\sigma_{\lambda}\nabla_{\bm{x}}\bm{\epsilon}(\bm{x}_{\lambda},\lambda)-\mbox{diag}(\bm{l}_{\lambda}) (56)

Since diag(𝒍λ)\mbox{diag}(\bm{l}_{\lambda}) is a diagonal matrix, minimizing 𝔼pλθ(𝒙λ)[𝒙𝑵θ(𝒙λ,λ)F2]\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}\left[\|\nabla_{\bm{x}}\bm{N}_{\theta}(\bm{x}_{\lambda},\lambda)\|_{F}^{2}\right] is equivalent to minimizing 𝔼pλθ(𝒙λ)[diag1(𝒙𝑵θ(𝒙λ,λ))22]=𝔼pλθ(𝒙λ)[diag1(σλ𝒙ϵ(𝒙λ,λ))𝒍λ22]\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}\left[\|\mbox{diag}^{-1}(\nabla_{\bm{x}}\bm{N}_{\theta}(\bm{x}_{\lambda},\lambda))\|_{2}^{2}\right]=\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}\left[\|\mbox{diag}^{-1}(\sigma_{\lambda}\nabla_{\bm{x}}\bm{\epsilon}(\bm{x}_{\lambda},\lambda))-\bm{l}_{\lambda}\|_{2}^{2}\right], where diag1\mbox{diag}^{-1} denotes the operator that takes the diagonal of a matrix as a vector. Thus we have 𝒍λ=𝔼pλθ(𝒙λ)[diag1(σλ𝒙ϵ(𝒙λ,λ))]\bm{l}_{\lambda}^{*}=\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}\left[\mbox{diag}^{-1}(\sigma_{\lambda}\nabla_{\bm{x}}\bm{\epsilon}(\bm{x}_{\lambda},\lambda))\right].

However, this formula for 𝒍λ\bm{l}_{\lambda}^{*} requires computing the diagonal of the full Jacobian of the noise prediction model, which typically has 𝒪(d2)\mathcal{O}(d^{2}) time complexity for dd-dimensional data and is unacceptable when dd is large. Fortunately, the cost can be reduced to 𝒪(d)\mathcal{O}(d) by utilizing stochastic diagonal estimators and employing the efficient Jacobian-vector-product operator provided by forward-mode automatic differentiation in deep learning frameworks.

For a dd-by-dd matrix 𝑫\bm{D}, its diagonal can be unbiasedly estimated by [4]

diag1(𝑫)=[k=1s(𝑫𝒗k)𝒗k][k=1s𝒗k𝒗k]\mbox{diag}^{-1}(\bm{D})=\left[\sum_{k=1}^{s}(\bm{D}\bm{v}_{k})\odot\bm{v}_{k}\right]\oslash\left[\sum_{k=1}^{s}\bm{v}_{k}\odot\bm{v}_{k}\right] (57)

where 𝒗kp(𝒗)\bm{v}_{k}\sim p(\bm{v}) are dd-dimensional i.i.d. samples with zero mean, \odot is the element-wise multiplication i.e., Hadamard product, and \oslash is the element-wise division. The stochastic diagonal estimator is analogous to the famous Hutchinson’s trace estimator [20]. By taking p(𝒗)p(\bm{v}) as the Rademacher distribution, we have 𝒗k𝒗k=𝟏\bm{v}_{k}\odot\bm{v}_{k}=\bm{1}, and the denominator can be omitted. For simplicity, we use regular multiplication and division symbols, assuming they are element-wise between vectors. Then 𝒍λ\bm{l}_{\lambda}^{*} can be expressed as:

𝒍λ=𝔼pλθ(𝒙λ)p(𝒗)[(σλ𝒙ϵθ(𝒙λ,λ)𝒗)𝒗]\bm{l}_{\lambda}^{*}=\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})p(\bm{v})}\left[(\sigma_{\lambda}\nabla_{\bm{x}}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)\bm{v})\bm{v}\right] (58)

which is an unbiased estimation when we replace the expectation with mean on finite samples 𝒙λpλθ(𝒙λ),𝒗p(𝒗)\bm{x}_{\lambda}\sim p_{\lambda}^{\theta}(\bm{x}_{\lambda}),\bm{v}\sim p(\bm{v}). The process for estimating 𝒍λ\bm{l}_{\lambda}^{*} can easily be paralleled on multiple devices by computing (σλ𝒙ϵθ(𝒙λ,λ)𝒗)𝒗\sum(\sigma_{\lambda}\nabla_{\bm{x}}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)\bm{v})\bm{v} on separate datapoints and gather them in the end.

Next, for the computing of 𝒔λ,𝒃λ\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*} in Eq. (11), note that it’s a simple least square problem. By taking partial derivatives w.r.t. 𝒔λ,𝒃λ\bm{s}_{\lambda},\bm{b}_{\lambda} and set them to 0, we have

{𝔼pλθ(𝒙λ)[(𝒇θ(1)(𝒙λ,λ)𝒔λ𝒇θ(𝒙λ,λ)𝒃λ)𝒇θ(𝒙λ,λ)]=0𝔼pλθ(𝒙λ)[𝒇θ(1)(𝒙λ,λ)𝒔λ𝒇θ(𝒙λ,λ)𝒃λ]=0\begin{cases}\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}\left[\left(\bm{f}^{(1)}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{s}_{\lambda}^{*}\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{b}_{\lambda}^{*}\right)\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)\right]=0\\ \mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}\left[\bm{f}^{(1)}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{s}_{\lambda}^{*}\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{b}_{\lambda}^{*}\right]=0\end{cases} (59)

And we obtain the explicit formula for 𝒔λ,𝒃λ\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*}

𝒔λ\displaystyle\bm{s}_{\lambda}^{*} =𝔼pλθ(𝒙λ)[𝒇θ(𝒙λ,λ)𝒇θ(1)(𝒙λ,λ)]𝔼pλθ(𝒙λ)[𝒇θ(𝒙λ,λ)]𝔼pλθ(𝒙λ)[𝒇θ(1)(𝒙λ,λ)]𝔼pλθ(𝒙λ)[𝒇θ(𝒙λ,λ)𝒇θ(𝒙λ,λ)]𝔼pλθ(𝒙λ)[𝒇θ(𝒙λ,λ)]𝔼pλθ(𝒙λ)[𝒇θ(𝒙λ,λ)]\displaystyle=\frac{\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}\left[\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)\bm{f}_{\theta}^{(1)}(\bm{x}_{\lambda},\lambda)\right]-\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}\left[\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)\right]\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}\left[\bm{f}_{\theta}^{(1)}(\bm{x}_{\lambda},\lambda)\right]}{\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}[\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)]-\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}[\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)]\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}[\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)]} (60)
𝒃λ\displaystyle\bm{b}_{\lambda}^{*} =𝔼pλθ(𝒙λ)[𝒇(1)(𝒙λ,λ)]𝒔λ𝔼pλθ(𝒙λ)[𝒇θ(𝒙λ,λ)]\displaystyle=\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}[\bm{f}^{(1)}(\bm{x}_{\lambda},\lambda)]-\bm{s}_{\lambda}^{*}\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}[\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)] (61)

which are unbiased least square estimators when we replace the expectation with mean on finite samples 𝒙λpλθ(𝒙λ)\bm{x}_{\lambda}\sim p_{\lambda}^{\theta}(\bm{x}_{\lambda}). Also, the process for estimating 𝒔λ,𝒃λ\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*} can be paralleled on multiple devices by computing 𝒇θ,𝒇θ(1),𝒇θ𝒇θ,𝒇θ𝒇θ(1)\sum\bm{f}_{\theta},\sum\bm{f}^{(1)}_{\theta},\sum\bm{f}_{\theta}\bm{f}_{\theta},\sum\bm{f}_{\theta}\bm{f}^{(1)}_{\theta} on separate datapoints and gather them in the end. Thus, the estimation of 𝒔λ,𝒃λ\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*} involving evaluating 𝒇θ\bm{f}_{\theta} and 𝒇θ(1)\bm{f}_{\theta}^{(1)} on 𝒙λ\bm{x}_{\lambda}. 𝒇θ\bm{f}_{\theta} is a direct transformation of ϵθ\bm{\epsilon}_{\theta} and requires a single forward pass. For 𝒇θ(1)\bm{f}_{\theta}^{(1)}, we have

𝒇θ(1)(𝒙λ,λ)\displaystyle\bm{f}^{(1)}_{\theta}(\bm{x}_{\lambda},\lambda) =𝒇θ(𝒙λ,λ)λ+𝒙𝒇θ(𝒙λ,λ)d𝒙λdλ\displaystyle=\frac{\partial\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)}{\partial\lambda}+\nabla_{\bm{x}}\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)\frac{\mathrm{d}\bm{x}_{\lambda}}{\mathrm{d}\lambda} (62)
=eλ(ϵθ(1)(𝒙λ,λ)ϵθ(𝒙λ,λ))𝒍˙λαλα˙λ𝒍λαλ2𝒙λ𝒍λαλ(α˙λαλ𝒙λσλϵθ(𝒙λ,λ))\displaystyle=e^{-\lambda}\left(\bm{\epsilon}_{\theta}^{(1)}(\bm{x}_{\lambda},\lambda)-\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)\right)-\frac{\dot{\bm{l}}_{\lambda}\alpha_{\lambda}-\dot{\alpha}_{\lambda}\bm{l}_{\lambda}}{\alpha_{\lambda}^{2}}\bm{x}_{\lambda}-\frac{\bm{l}_{\lambda}}{\alpha_{\lambda}}\left(\frac{\dot{\alpha}_{\lambda}}{\ \alpha_{\lambda}}\bm{x}_{\lambda}-\sigma_{\lambda}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)\right)
=eλ((𝒍λ1)ϵθ(𝒙λ,λ)+ϵθ(1)(𝒙λ,λ))𝒍˙λ𝒙λαλ\displaystyle=e^{-\lambda}\left((\bm{l}_{\lambda}-1)\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)+\bm{\epsilon}_{\theta}^{(1)}(\bm{x}_{\lambda},\lambda)\right)-\frac{\dot{\bm{l}}_{\lambda}\bm{x}_{\lambda}}{\alpha_{\lambda}}

After we obtain 𝒍λ\bm{l}_{\lambda}, 𝒍˙λ\dot{\bm{l}}_{\lambda} can be estimated by finite difference. To compute ϵθ(1)(𝒙λ,λ)\bm{\epsilon}_{\theta}^{(1)}(\bm{x}_{\lambda},\lambda), we have

ϵθ(1)(𝒙λ,λ)\displaystyle\bm{\epsilon}_{\theta}^{(1)}(\bm{x}_{\lambda},\lambda) =λϵθ(𝒙λ,λ)+𝒙ϵθ(𝒙λ,λ)d𝒙λdλ\displaystyle=\partial_{\lambda}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)+\nabla_{\bm{x}}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)\frac{\mathrm{d}\bm{x}_{\lambda}}{\mathrm{d}\lambda} (63)
=λϵθ(𝒙λ,λ)+𝒙ϵθ(𝒙λ,λ)(α˙λαλ𝒙λσλϵθ(𝒙λ,λ))\displaystyle=\partial_{\lambda}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)+\nabla_{\bm{x}}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)\left(\frac{\dot{\alpha}_{\lambda}}{\ \alpha_{\lambda}}\bm{x}_{\lambda}-\sigma_{\lambda}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)\right)

which can also be computed with the Jacobian-vector-product operator.

In conclusion, for any λ\lambda, 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda}^{*},\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*} can be efficiently and unbiasedly estimated by sampling a few datapoints 𝒙λpλθ(𝒙λ)\bm{x}_{\lambda}\sim p_{\lambda}^{\theta}(\bm{x}_{\lambda}) and using the Jacobian-vector-product.

C.1.2 Integral Precomputing

In the local approximation in Eq. (14), there are three integrals involving the EMS, which are 𝑨(λs,λt),λsλt𝑬λs(λ)𝑩λs(λ)dλ,λsλt𝑬λs(λ)(λλs)kk!dλ\bm{A}(\lambda_{s},\lambda_{t}),\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\bm{B}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda,\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda. Define the following terms, which are also evaluated at λj0,λj1,,λjN\lambda_{j_{0}},\lambda_{j_{1}},\dots,\lambda_{j_{N}} and can be precomputed in 𝒪(N)\mathcal{O}(N) time:

𝑳λ\displaystyle\bm{L}_{\lambda} =λTλ𝒍τdτ\displaystyle=\int_{\lambda_{T}}^{\lambda}\bm{l}_{\tau}\mathrm{d}\tau (64)
𝑺λ\displaystyle\bm{S}_{\lambda} =λTλ𝒔τdτ\displaystyle=\int_{\lambda_{T}}^{\lambda}\bm{s}_{\tau}\mathrm{d}\tau
𝑩λ\displaystyle\bm{B}_{\lambda} =λTλeλTr𝒔τdτ𝒃rdr=λTλe𝑺r𝒃rdr\displaystyle=\int_{\lambda_{T}}^{\lambda}e^{-\int_{\lambda_{T}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r=\int_{\lambda_{T}}^{\lambda}e^{-\bm{S}_{r}}\bm{b}_{r}\mathrm{d}r
𝑪λ\displaystyle\bm{C}_{\lambda} =λTλ(eλTu(𝒍τ+𝒔τ)dτλTueλTr𝒔τdτ𝒃rdr)du=λTλe𝑳u+𝑺u𝑩udu\displaystyle=\int_{\lambda_{T}}^{\lambda}\left(e^{\int_{\lambda_{T}}^{u}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\int_{\lambda_{T}}^{u}e^{-\int_{\lambda_{T}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r\right)\mathrm{d}u=\int_{\lambda_{T}}^{\lambda}e^{\bm{L}_{u}+\bm{S}_{u}}\bm{B}_{u}\mathrm{d}u
𝑰λ\displaystyle\bm{I}_{\lambda} =λTλeλTr(𝒍τ+𝒔τ)dτdr=λTλe𝑳r+𝑺rdr\displaystyle=\int_{\lambda_{T}}^{\lambda}e^{\int_{\lambda_{T}}^{r}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\mathrm{d}r=\int_{\lambda_{T}}^{\lambda}e^{\bm{L}_{r}+\bm{S}_{r}}\mathrm{d}r

Then for any λs,λt\lambda_{s},\lambda_{t}, we can verify that the first two integrals can be expressed as

𝑨(λs,λt)\displaystyle\bm{A}(\lambda_{s},\lambda_{t}) =e𝑳λs𝑳λt\displaystyle=e^{\bm{L}_{\lambda_{s}}-\bm{L}_{\lambda_{t}}} (65)
λsλt𝑬λs(λ)𝑩λs(λ)dλ\displaystyle\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\bm{B}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda =e𝑳λs(𝑪λt𝑪λs𝑩λs(𝑰λt𝑰λs))\displaystyle=e^{-\bm{L}_{\lambda_{s}}}(\bm{C}_{\lambda_{t}}-\bm{C}_{\lambda_{s}}-\bm{B}_{\lambda_{s}}(\bm{I}_{\lambda_{t}}-\bm{I}_{\lambda_{s}}))

which can be computed in 𝒪(1)\mathcal{O}(1) time. For the third and last integral, denote it as 𝑬λs,λt(k)\bm{E}_{\lambda_{s},\lambda_{t}}^{(k)}, i.e.,

𝑬λs,λt(k)=λsλt𝑬λs(λ)(λλs)kk!dλ\bm{E}_{\lambda_{s},\lambda_{t}}^{(k)}=\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda (66)

We need to compute it for 0kn0\leq k\leq n and for every local transition time pair (λs,λt)(\lambda_{s},\lambda_{t}) in the sampling process. For k=0k=0, we have

𝑬λs,λt(0)=e𝑳λs𝑺λs(𝑰λt𝑰λs)\bm{E}_{\lambda_{s},\lambda_{t}}^{(0)}=e^{-\bm{L}_{\lambda_{s}}-\bm{S}_{\lambda_{s}}}(\bm{I}_{\lambda_{t}}-\bm{I}_{\lambda_{s}}) (67)

which can also be computed in 𝒪(1)\mathcal{O}(1) time. But for k>0k>0, we no longer have such a simplification technique. Still, for any fixed timestep schedule {λi}i=0M\{\lambda_{i}\}_{i=0}^{M} during the sampling process, we can use a lazy precomputing strategy: compute 𝑬λi1,λi(k),1iM\bm{E}_{\lambda_{i-1},\lambda_{i}}^{(k)},1\leq i\leq M when generating the first sample, store it with a unique key (k,i)(k,i) and retrieve it in 𝒪(1)\mathcal{O}(1) in the following sampling process.

C.2 Algorithm

We provide the pseudocode of the local approximation and global solver in Algorithm 1 and Algorithm 2, which concisely describes how we implement DPM-Solver-v3.

Algorithm 1 (n+1)(n+1)-th order local approximation: LUpdaten+1

Require: noise schedule αt,σt\alpha_{t},\sigma_{t}, coefficients 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}

Input: transition time pair (s,t)(s,t), 𝒙s\bm{x}_{s}, nn extra timesteps {tik}k=1n\{t_{i_{k}}\}_{k=1}^{n}, 𝒈θ\bm{g}_{\theta} values (𝒈in,,𝒈i1,𝒈s)(\bm{g}_{i_{n}},\dots,\bm{g}_{i_{1}},\bm{g}_{s}) at {(𝒙λik,tik)}k=1n\{(\bm{x}_{\lambda_{i_{k}}},t_{i_{k}})\}_{k=1}^{n} and (𝒙s,s)(\bm{x}_{s},s)

Input Format: {tin,𝒈in},,{ti1,𝒈i1},{s,𝒙s,𝒈s},t\{t_{i_{n}},\bm{g}_{i_{n}}\},\dots,\{t_{i_{1}},\bm{g}_{i_{1}}\},\{s,\bm{x}_{s},\bm{g}_{s}\},t

1:  Compute 𝑨(λs,λt),λsλt𝑬λs(λ)𝑩λs(λ)dλ,λsλt𝑬λs(λ)(λλs)kk!dλ\bm{A}(\lambda_{s},\lambda_{t}),\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\bm{B}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda,\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda (Appendix C.1.2)
2:  δk=λikλs,k=1,,n\delta_{k}=\lambda_{i_{k}}-\lambda_{s},\quad k=1,\dots,n
3:  (𝒈^s(1)𝒈^s(2)2!𝒈^s(n)n!)(δ1δ12δ1nδ2δ22δ2nδnδn2δnn)1(𝒈i1𝒈s𝒈i2𝒈s𝒈in𝒈s)\begin{pmatrix}\hat{\bm{g}}^{(1)}_{s}\\ \frac{\hat{\bm{g}}^{(2)}_{s}}{2!}\\ \vdots\\ \frac{\hat{\bm{g}}^{(n)}_{s}}{n!}\end{pmatrix}\leftarrow\begin{pmatrix}\delta_{1}&\delta_{1}^{2}&\cdots&\delta_{1}^{n}\\ \delta_{2}&\delta_{2}^{2}&\cdots&\delta_{2}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ \delta_{n}&\delta_{n}^{2}&\cdots&\delta_{n}^{n}\end{pmatrix}^{-1}\begin{pmatrix}\bm{g}_{i_{1}}-\bm{g}_{s}\\ \bm{g}_{i_{2}}-\bm{g}_{s}\\ \vdots\\ \bm{g}_{i_{n}}-\bm{g}_{s}\end{pmatrix} (Eq. (13))
4:  𝒙^tαt𝑨(λs,λt)(𝒙sαsλsλt𝑬λs(λ)𝑩λs(λ)dλk=0n𝒈^s(k)λsλt𝑬λs(λ)(λλs)kk!dλ)\displaystyle\hat{\bm{x}}_{t}\leftarrow\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\left(\frac{\bm{x}_{s}}{\alpha_{s}}-\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\bm{B}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda-\sum_{k=0}^{n}\hat{\bm{g}}^{(k)}_{s}\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda\right) (Eq. (14))

Output: 𝒙^t\hat{\bm{x}}_{t}

Algorithm 2 (n+1)(n+1)-th order multistep predictor-corrector algorithm

Require: noise prediction model ϵθ\bm{\epsilon}_{\theta}, noise schedule αt,σt\alpha_{t},\sigma_{t}, coefficients 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}, cache Q1,Q2Q_{1},Q_{2}

Input: timesteps {ti}i=0M\{t_{i}\}_{i=0}^{M}, initial value 𝒙0\bm{x}_{0}

1:  Q1cache𝒙0Q_{1}\overset{cache}{\leftarrow}\bm{x}_{0}
2:  Q2cacheϵθ(𝒙0,t0)Q_{2}\overset{cache}{\leftarrow}\bm{\epsilon}_{\theta}(\bm{x}_{0},t_{0})
3:  for m=1m=1 to MM do
4:     nmmin{n+1,m}n_{m}\leftarrow\min\{n+1,m\}
5:     𝒙^mnm,,𝒙^m1fetchQ1\hat{\bm{x}}_{m-n_{m}},\dots,\hat{\bm{x}}_{m-1}\overset{fetch}{\leftarrow}Q_{1}
6:     ϵ^mnm,,ϵ^m1fetchQ2\hat{\bm{\epsilon}}_{m-n_{m}},\dots,\hat{\bm{\epsilon}}_{m-1}\overset{fetch}{\leftarrow}Q_{2}
7:     𝒈^leλm1λl𝒔τdτσλlϵ^l𝒍λl𝒙^lαλlλm1λleλm1r𝒔τdτ𝒃rdr,l=mnm,,m1\displaystyle\hat{\bm{g}}_{l}\leftarrow e^{-\int_{\lambda_{m-1}}^{\lambda_{l}}\bm{s}_{\tau}\mathrm{d}\tau}\frac{\sigma_{\lambda_{l}}\hat{\bm{\epsilon}}_{l}-\bm{l}_{\lambda_{l}}\hat{\bm{x}}_{l}}{\alpha_{\lambda_{l}}}-\int_{\lambda_{m-1}}^{\lambda_{l}}e^{-\int_{\lambda_{m-1}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r,\quad l=m-n_{m},\dots,m-1 (Eq. (8))
8:     𝒙^mLUpdatenm({tmnm,𝒈^mnm},,{tm2,𝒈^m2},{tm1,𝒙^m1,𝒈^m1},tm)\hat{\bm{x}}_{m}\leftarrow\text{LUpdate}_{n_{m}}(\{t_{m-n_{m}},\hat{\bm{g}}_{m-n_{m}}\},\dots,\{t_{m-2},\hat{\bm{g}}_{m-2}\},\{t_{m-1},\hat{\bm{x}}_{m-1},\hat{\bm{g}}_{m-1}\},t_{m})
9:     if mMm\neq M then
10:        ϵ^mϵθ(𝒙^m,tm)\hat{\bm{\epsilon}}_{m}\leftarrow\bm{\epsilon}_{\theta}(\hat{\bm{x}}_{m},t_{m})
11:        𝒈^meλm1λm𝒔τdτσλmϵ^m𝒍λm𝒙^mαλmλm1λmeλm1r𝒔τdτ𝒃rdr\displaystyle\hat{\bm{g}}_{m}\leftarrow e^{-\int_{\lambda_{m-1}}^{\lambda_{m}}\bm{s}_{\tau}\mathrm{d}\tau}\frac{\sigma_{\lambda_{m}}\hat{\bm{\epsilon}}_{m}-\bm{l}_{\lambda_{m}}\hat{\bm{x}}_{m}}{\alpha_{\lambda_{m}}}-\int_{\lambda_{m-1}}^{\lambda_{m}}e^{-\int_{\lambda_{m-1}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r (Eq. (8))
12:        𝒙^mcLUpdatenm({tmnm+1,𝒈^mnm+1},,{tm2,𝒈^m2},{tm,𝒈^m},\hat{\bm{x}}_{m}^{c}\leftarrow\text{LUpdate}_{n_{m}}(\{t_{m-n_{m}+1},\hat{\bm{g}}_{m-n_{m}+1}\},\dots,\{t_{m-2},\hat{\bm{g}}_{m-2}\},\{t_{m},\hat{\bm{g}}_{m}\},
                                                            {tm1,𝒙^m1,𝒈^m1},tm)\{t_{m-1},\hat{\bm{x}}_{m-1},\hat{\bm{g}}_{m-1}\},t_{m})
13:        ϵ^mcϵ^m+𝒍λm(𝒙^mc𝒙^m)/σλm\hat{\bm{\epsilon}}_{m}^{c}\leftarrow\hat{\bm{\epsilon}}_{m}+\bm{l}_{\lambda_{m}}(\hat{\bm{x}}_{m}^{c}-\hat{\bm{x}}_{m})/\sigma_{\lambda_{m}} (to ensure 𝒈^mc=𝒈^m\hat{\bm{g}}_{m}^{c}=\hat{\bm{g}}_{m})
14:        Q1cache𝒙^mcQ_{1}\overset{cache}{\leftarrow}\hat{\bm{x}}_{m}^{c}
15:        Q2cacheϵ^mcQ_{2}\overset{cache}{\leftarrow}\hat{\bm{\epsilon}}_{m}^{c}
16:     end if
17:  end for

Output: 𝒙^M\hat{\bm{x}}_{M}

Appendix D Experiment Details

In this section, we provide more experiment details for each setting, including the codebases and the configurations for evaluation, EMS computing and sampling. Unless otherwise stated, we utilize the forward-mode automatic differentiation (torch.autograd.forward_ad) provided by PyTorch [39] to compute the Jacobian-vector-products (JVPs). Also, as stated in Section 3.4, we draw datapoints 𝒙λ\bm{x}_{\lambda} from the marginal distribution qλq_{\lambda} defined by the forward diffusion process starting from some data distribution q0q_{0}, instead of the model distribution pλθp_{\lambda}^{\theta}.

D.1 ScoreSDE on CIFAR10

Codebase and evaluation For unconditional sampling on CIFAR10 [24], one experiment setting is based on the pretrained pixel-space diffusion model provided by ScoreSDE [51]. We use their official codebase of PyTorch implementation, and their checkpoint checkpoint_8.pth under vp/cifar10_ddpmpp_deep_continuous config. We adopt their own statistic file and code for computing FID.

EMS computing We estimate the EMS at N=1200N=1200 uniform timesteps λj0,λj1,,λjN\lambda_{j_{0}},\lambda_{j_{1}},\dots,\lambda_{j_{N}} by drawing K=4096K=4096 datapoints 𝒙λ0q0\bm{x}_{\lambda_{0}}\sim q_{0}, where q0q_{0} is the distribution of the training set. We compute two sets of EMS, corresponding to start time ϵ=103\epsilon=10^{-3} (NFE\leq 10) and ϵ=104\epsilon=10^{-4} (NFE>10) in the sampling process respectively. The total time for EMS computing is \sim7h on 8 GPU cards of NVIDIA A40.

Sampling Following previous works [31, 32, 58], we use start time ϵ=103\epsilon=10^{-3} (NFE\leq 10) and ϵ=104\epsilon=10^{-4} (NFE>10), end time T=1T=1 and adopt the uniform logSNR timestep schedule. For DPM-Solver-v3, we use the 3rd-order predictor with the 3rd-order corrector by default. In particular, we change to the pseudo 3rd-order predictor at 5 NFE to further boost the performance.

D.2 EDM on CIFAR10

Codebase and evaluation For unconditional sampling on CIFAR10 [24], another experiment setting is based on the pretrained pixel-space diffusion model provided by EDM [21]. We use their official codebase of PyTorch implementation, and their checkpoint edm-cifar10-32x32-uncond-vp.pkl. For consistency, we borrow the statistic file and code from ScoreSDE [51] for computing FID.

EMS computing Since the pretrained models of EDM are stored within the pickles, we fail to use torch.autograd.forward_ad for computing JVPs. Instead, we use torch.autograd.functional.jvp, which is much slower since it employs the double backward trick. We estimate two sets of EMS. One corresponds to N=1200N=1200 uniform timesteps λj0,λj1,,λjN\lambda_{j_{0}},\lambda_{j_{1}},\dots,\lambda_{j_{N}} and K=1024K=1024 datapoints 𝒙λ0q0\bm{x}_{\lambda_{0}}\sim q_{0}, where q0q_{0} is the distribution of the training set. The other corresponds to N=120,K=4096N=120,K=4096. They are used when NFE<10 and NFE\geq10 respectively. The total time for EMS computing is \sim3.5h on 8 GPU cards of NVIDIA A40.

Sampling Following EDM, we use start time tmin=0.002t_{\min}=0.002 and end time tmax=80.0t_{\max}=80.0, but adopt the uniform logSNR timestep schedule which performs better in practice. For DPM-Solver-v3, we use the 3rd-order predictor and additionally employ the 3rd-order corrector when NFE\leq 6. In particular, we change to the pseudo 3rd-order predictor at 5 NFE to further boost the performance.

D.3 Latent-Diffusion on LSUN-Bedroom

Codebase and evaluation The unconditional sampling on LSUN-Bedroom [55] is based on the pretrained latent-space diffusion model provided by Latent-Diffusion [43]. We use their official codebase of PyTorch implementation and their default checkpoint. We borrow the statistic file and code from Guided-Diffusion [10] for computing FID.

EMS computing We estimate the EMS at N=120N=120 uniform timesteps λj0,λj1,,λjN\lambda_{j_{0}},\lambda_{j_{1}},\dots,\lambda_{j_{N}} by drawing K=1024K=1024 datapoints 𝒙λ0q0\bm{x}_{\lambda_{0}}\sim q_{0}, where q0q_{0} is the distribution of the latents of the training set. The total time for EMS computing is \sim12min on 8 GPU cards of NVIDIA A40.

Sampling Following previous works [58], we use start time ϵ=103\epsilon=10^{-3}, end time T=1T=1 and adopt the uniform tt timestep schedule. For DPM-Solver-v3, we use the 3rd-order predictor with the pseudo 4th-order corrector.

D.4 Guided-Diffusion on ImageNet-256

Codebase and evaluation The conditional sampling on ImageNet-256 [9] is based on the pretrained pixel-space diffusion model provided by Guided-Diffusion [10]. We use their official codebase of PyTorch implementation and their two checkpoints: the conditional diffusion model 256x256_diffusion.pt and the classifier 256x256_classifier.pt. We adopt their own statistic file and code for computing FID.

EMS computing We estimate the EMS at N=500N=500 uniform timesteps λj0,λj1,,λjN\lambda_{j_{0}},\lambda_{j_{1}},\dots,\lambda_{j_{N}} by drawing K=1024K=1024 datapoints 𝒙λ0q0\bm{x}_{\lambda_{0}}\sim q_{0}, where q0q_{0} is the distribution of the training set. Also, we find that the FID metric on the ImageNet-256 dataset behaves specially, and degenerated 𝒍λ\bm{l}_{\lambda} (𝒍λ=𝟏\bm{l}_{\lambda}=\bm{1}) performs better. The total time for EMS computing is \sim9.5h on 8 GPU cards of NVIDIA A40.

Sampling Following previous works [31, 32, 58], we use start time ϵ=103\epsilon=10^{-3}, end time T=1T=1 and adopt the uniform tt timestep schedule. For DPM-Solver-v3, we use the 2nd-order predictor with the pseudo 3rd-order corrector.

D.5 Stable-Diffusion on MS-COCO2014 prompts

Codebase and evaluation The text-to-image sampling on MS-COCO2014 [26] prompts is based on the pretrained latent-space diffusion model provided by Stable-Diffusion [43]. We use their official codebase of PyTorch implementation and their checkpoint sd-v1-4.ckpt. We compute MSE on randomly selected captions from the MS-COCO2014 validation dataset, as detailed in Section 4.1.

EMS computing We estimate the EMS at N=250N=250 uniform timesteps λj0,λj1,,λjN\lambda_{j_{0}},\lambda_{j_{1}},\dots,\lambda_{j_{N}} by drawing K=1024K=1024 datapoints 𝒙λ0q0\bm{x}_{\lambda_{0}}\sim q_{0}. Since Stable-Diffusion is trained on the LAION-5B dataset [46], there is a gap between the images in the MS-COCO2014 validation dataset and the images generated by Stable-Diffusion with certain guidance scale. Thus, we choose q0q_{0} to be the distribution of the latents generated by Stable-Diffusion with corresponding guidance scale, using 200-step DPM-Solver++ [32]. We generate these latents with random captions and Gaussian noise different from those we use to compute MSE. The total time for EMS computing is \sim11h on 8 GPU cards of NVIDIA A40 for each guidance scale.

Sampling Following previous works [32, 58], we use start time ϵ=103\epsilon=10^{-3}, end time T=1T=1 and adopt the uniform tt timestep schedule. For DPM-Solver-v3, we use the 2nd-order predictor with the pseudo 3rd-order corrector.

D.6 License

Table 3: The used datasets, codes and their licenses.
Name URL Citation License
CIFAR10 https://www.cs.toronto.edu/~kriz/cifar.html [24] \\backslash
LSUN-Bedroom https://www.yf.io/p/lsun [55] \\backslash
ImageNet-256 https://www.image-net.org [9] \\backslash
MS-COCO2014 https://cocodataset.org [26] CC BY 4.0
ScoreSDE https://github.com/yang-song/score_sde_pytorch [51] Apache-2.0
EDM https://github.com/NVlabs/edm [21] CC BY-NC-SA 4.0
Guided-Diffusion https://github.com/openai/guided-diffusion [10] MIT
Latent-Diffusion https://github.com/CompVis/latent-diffusion [43] MIT
Stable-Diffusion https://github.com/CompVis/stable-diffusion [43] CreativeML Open RAIL-M
DPM-Solver++ https://github.com/LuChengTHU/dpm-solver [32] MIT
UniPC https://github.com/wl-zhao/UniPC [58] \\backslash

We list the used datasets, codes and their licenses in Table 3.

Appendix E Runtime Comparison

Table 4: Runtime of different methods to generate a single batch (second / batch, ±\pmstd) on a single NVIDIA A40, varying the number of function evaluations (NFE). We don’t include the runtime of the decoding stage for latent-space DPMs.
Method NFE
5 10 15 20
CIFAR10 [24], ScoreSDE [51] (batch size = 128)
DPM-Solver++ [32] 1.253(±\pm0.0014) 2.503(±\pm0.0017) 3.754(±\pm0.0042) 5.010(±\pm0.0048)
UniPC [58] 1.268(±\pm0.0012) 2.532(±\pm0.0018) 3.803(±\pm0.0037) 5.080(±\pm0.0049)
DPM-Solver-v3 1.273(±\pm0.0005) 2.540(±\pm0.0023) 3.826(±\pm0.0039) 5.108(±\pm0.0055)
CIFAR10 [24], EDM [21] (batch size = 128)
DPM-Solver++ [32] 1.137(±\pm0.0011) 2.278(±\pm0.0015) 3.426(±\pm0.0024) 4.569(±\pm0.0031)
UniPC [58] 1.142(±\pm0.0016) 2.289(±\pm0.0019) 3.441(±\pm0.0035) 4.590(±\pm0.0021)
DPM-Solver-v3 1.146(±\pm0.0010) 2.293(±\pm0.0015) 3.448(±\pm0.0018) 4.600(±\pm0.0027)
LSUN-Bedroom [55], Latent-Diffusion [43] (batch size = 32)
DPM-Solver++ [32] 1.302(±\pm0.0009) 2.608(±\pm0.0010) 3.921(±\pm0.0023) 5.236(±\pm0.0045)
UniPC [58] 1.305(±\pm0.0005) 2.616(±\pm0.0019) 3.934(±\pm0.0033) 5.244(±\pm0.0043)
DPM-Solver-v3 1.302(±\pm0.0010) 2.620(±\pm0.0027) 3.932(±\pm0.0028) 5.290(±\pm0.0030)
ImageNet256 [9], Guided-Diffusion [10] (batch size = 4)
DPM-Solver++ [32] 1.594(±\pm0.0011) 3.194(±\pm0.0018) 4.792(±\pm0.0031) 6.391(±\pm0.0045)
UniPC [58] 1.606(±\pm0.0026) 3.205(±\pm0.0025) 4.814(±\pm0.0049) 6.427(±\pm0.0060)
DPM-Solver-v3 1.601(±\pm0.0059) 3.229(±\pm0.0031) 4.807(±\pm0.0068) 6.458(±\pm0.0257)
MS-COCO2014 [26], Stable-Diffusion [43] (batch size = 4)
DPM-Solver++ [32] 1.732(±\pm0.0012) 3.464(±\pm0.0020) 5.229(±\pm0.0027) 6.974(±\pm0.0013)
UniPC [58] 1.735(±\pm0.0012) 3.484(±\pm0.0364) 5.212(±\pm0.0015) 6.988(±\pm0.0035)
DPM-Solver-v3 1.731(±\pm0.0008) 3.471(±\pm0.0011) 5.211(±\pm0.0030) 6.945(±\pm0.0022)

As we have mentioned in Section 4, the runtime of DPM-Solver-v3 is almost the same as other solvers (DDIM [48], DPM-Solver [31], DPM-Solver++ [32], UniPC [58], etc.) as long as they use the same NFE. This is because the main computation costs are the serial evaluations of the large neural network ϵθ\bm{\epsilon}_{\theta}, and the other coefficients are either analytically computed [48, 31, 32, 58], or precomputed (DPM-Solver-v3), thus having neglectable costs.

Table 4 shows the runtime of DPM-Solver-v3 and some other solvers on a single NVIDIA A40 under different settings. We use torch.cuda.Event and torch.cuda.synchronize to accurately compute the runtime. We evaluate the runtime on 8 batches (dropping the first batch since it contains extra initializations) and report the mean and std. We can see that the runtime is proportional to NFE and has a difference of about ±1\pm 1% for different solvers, which confirms our statement. Therefore, the speedup for the NFE is almost the actual speedup of the runtime.

Appendix F Quantitative Results

Table 5: Quantitative results on LSUN-Bedroom [55]. We report the FID\downarrow of the methods with different numbers of function evaluations (NFE), evaluated on 50k samples.
Method Model NFE
5 6 8 10 12 15 20
DPM-Solver++ [32] Latent-Diffusion [43] 18.59 8.50 4.19 3.63 3.43 3.29 3.16
UniPC [58] 12.24 6.19 4.00 3.56 3.34 3.18 3.07
DPM-Solver-v3 7.54 4.79 3.53 3.16 3.06 3.05 3.05
Table 6: Quantitative results on ImageNet-256 [9]. We report the FID\downarrow of the methods with different numbers of function evaluations (NFE), evaluated on 10k samples.
Method Model NFE
5 6 8 10 12 15 20
DPM-Solver++ [32] Guided-Diffusion [10] (s=2.0s=2.0) 16.87 13.09 9.95 8.72 8.13 7.73 7.48
UniPC [58] 15.62 11.91 9.29 8.35 7.95 7.64 7.44
DPM-Solver-v3 15.10 11.39 8.96 8.27 7.94 7.62 7.39
Table 7: Quantitative results on MS-COCO2014 [26] prompts. We report the MSE\downarrow of the methods with different numbers of function evaluations (NFE), evaluated on 10k samples.
Method Model NFE
5 6 8 10 12 15 20
DPM-Solver++ [32] Stable-Diffusion [43] (s=1.5s=1.5) 0.076 0.056 0.028 0.016 0.012 0.009 0.006
UniPC [58] 0.055 0.039 0.024 0.012 0.007 0.005 0.002
DPM-Solver-v3 0.037 0.027 0.024 0.007 0.005 0.001 0.002
DPM-Solver++ [32] Stable-Diffusion [43] (s=7.5s=7.5) 0.60 0.65 0.50 0.46 0.42 0.38 0.30
UniPC [58] 0.65 0.71 0.56 0.46 0.43 0.35 0.31
DPM-Solver-v3 0.55 0.64 0.49 0.40 0.45 0.34 0.29

We present the detailed quantitative results of Section 4.1 for different datasets in Table 1, Table 5, Table 6 and Table 7 respectively. They clearly verify that DPM-Solver-v3 achieves consistently better or comparable performance under various settings, especially in 5\sim10 NFEs.

Appendix G Ablations

In this section, we conduct some ablations to further evaluate and analyze the effectiveness of DPM-Solver-v3.

G.1 Varying the Number of Timesteps and Datapoints for the EMS

Table 8: Ablation of the number of timesteps NN and datapoints KK for the EMS, experimented with ScoreSDE [51] on CIFAR10 [24]. We report the FID\downarrow with different numbers of function evaluations (NFE), evaluated on 50k samples.
NN KK NFE
5 6 8 10 12 15 20
1200 512 18.84 7.90 4.49 3.74 3.88 3.52 3.12
1200 1024 15.52 7.55 4.17 3.56 3.37 3.03 2.78
120 4096 13.67 7.60 4.09 3.49 3.24 2.90 2.70
250 4096 13.28 7.56 4.00 3.45 3.22 2.92 2.70
1200 4096 12.76 7.40 3.94 3.40 3.24 2.91 2.71

First, we’d like to investigate how the number of timesteps NN and the number of datapoints KK for computing the EMS affects the performance. We conduct experiments with the DPM ScoreSDE [51] on CIFAR10 [24], by decreasing NN and KK from our default choice N=1200,K=4096N=1200,K=4096.

We list the FID results using the EMS of different NN and KK in Table 8. We can observe that the number of datapoints KK is crucial to the performance, while the number of timesteps NN is less significant and affects mainly the performance in 5\sim10 NFEs. When NFE>10, we can decrease NN to as little as 50, which gives even better FIDs. Note that the time cost for computing the EMS is proportional to NKNK, so how to choose appropriate NN and KK for both efficiency and accuracy is worth studying.

G.2 First-Order Comparison

Table 9: Quantitative comparison of first-order solvers (DPM-Solver-v3-1 and DDIM [48]), experimented with ScoreSDE [51] on CIFAR10 [24]. We report the FID\downarrow with different numbers of function evaluations (NFE), evaluated on 50k samples.
Method NFE
5 6 8 10 12 15 20
DDIM [48] 54.56 41.92 27.51 20.11 15.64 12.05 9.00
DPM-Solver-v3-1 39.18 29.82 20.03 14.98 11.86 9.34 7.19
NFE = 5 NFE = 10 NFE = 20
DDIM [48] Refer to caption Refer to caption Refer to caption
FID 54.56 FID 20.11 FID 9.00
DPM-Solver-v3-1 (Ours) Refer to caption Refer to caption Refer to caption
FID 39.18 FID 14.98 FID 7.19
Figure 7: Random samples by first-order solvers (DPM-Solver-v3-1 and DDIM [48]) of ScoreSDE [51] on CIFAR10 dataset [24], using 5, 10 and 20 NFE.

As stated in Appendix A, the first-order case of DPM-Solver-v3 (DPM-Solver-v3-1) is different from DDIM [48], which is the previous best first-order solver for DPMs. Note that DPM-Solver-v3-1 applies no corrector, since any corrector has an order of at least 2.

In Table 9 and Figure 7, we compare DPM-Solver-v3-1 with DDIM both quantitatively and qualitatively, using the DPM ScoreSDE [51] on CIFAR10 [24]. The results verify our statement that DPM-Solver-v3-1 performs better than DDIM.

G.3 Effects of Pseudo-Order Solver

We now demonstrate the effectiveness of the pseudo-order solver, including the pseudo-order predictor and the pseudo-order corrector.

Pseudo-order predictor The pseudo-order predictor is only applied in one case (at 5 NFE on CIFAR10 [24]) to achieve maximum performance improvement. In such cases, without the pseudo-order predictor, the FID results will degenerate from 12.76 to 15.91 for ScoreSDE [51], and from 12.21 to 12.72 for EDM [21]. While they are still better than previous methods, the pseudo-order predictor is proven to further boost the performance at NFEs as small as 5.

Table 10: Effects of pseudo-order corrector under different settings. We report the FID\downarrow with different numbers of function evaluations (NFE).
Method NFE
5 6 8 10 12 15 20
LSUN-Bedroom [55], Latent-Diffusion [43]
4th-order corrector 8.83 5.28 3.65 3.27 3.17 3.14 3.13
\rightarrowpseudo (default) 7.54 4.79 3.53 3.16 3.06 3.05 3.05
ImageNet-256 [9], Guided-Diffusion [10] (s=2.0s=2.0)
3rd-order corrector 15.87 11.91 9.27 8.37 7.97 7.62 7.47
\rightarrowpseudo (default) 15.10 11.39 8.96 8.27 7.94 7.62 7.39
MS-COCO2014 [26], Stable-Diffusion [43] (s=1.5s=1.5)
3rd-order corrector 0.037 0.028 0.028 0.014 0.0078 0.0024 0.0011
\rightarrowpseudo (default) 0.037 0.027 0.024 0.0065 0.0048 0.0014 0.0022

Pseudo-order corrector We show the comparison between true and pseudo-order corrector in Table 10. We can observe a consistent improvement when switching to the pseudo-order corrector. Thus, it suggests that if we use nn-th order predictor, we’d better combine it with pseudo (n+1)(n+1)-th order corrector rather than (n+1)(n+1)-th order corrector.

G.4 Effects of Half-Corrector

Table 11: Ablation of half-corrector/full-corrector on MS-COCO2014 [26] prompts with Stable-Diffusion model [43] and guidance scale 7.5. We report the MSE\downarrow of the methods with different numbers of function evaluations (NFE), evaluated on 10k samples.
Method Corrector Usage NFE
5 6 8 10 12 15 20
DPM-Solver++ [32] no corrector 0.60 0.65 0.50 0.46 0.42 0.38 0.30
UniPC [58] full-corrector 0.65 0.71 0.56 0.46 0.43 0.35 0.31
\rightarrowhalf-corrector 0.59 0.66 0.50 0.46 0.41 0.38 0.30
DPM-Solver-v3 full-corrector 0.65 0.67 0.49 0.40 0.47 0.34 0.30
\rightarrowhalf-corrector 0.55 0.64 0.51 0.44 0.45 0.36 0.29

We demonstrate the effects of half-corrector in Table 11, using the popular Stable-Diffusion model [43]. We can observe that under the relatively large guidance scale of 7.5 which is necessary for producing samples of high quality, the corrector adopted by UniPC [58] has a negative effect on the convergence to the ground-truth samples, making UniPC even worse than DPM-Solver++ [32]. When we employ the half-corrector technique, the problem is partially alleviated. Still, it lags behind our DPM-Solver-v3, since we further incorporate the EMS.

G.5 Singlestep vs. Multistep

Table 12: Quantitative comparison of single-step methods (S) vs. multi-step methods (M), experimented with ScoreSDE [51] on CIFAR10 [24]. We report the FID\downarrow with different numbers of function evaluations (NFE), evaluated on 50k samples.
Method NFE
5 6 8 10 12 15 20 25
DPM-Solver (S) [31] 290.51 23.78 23.51 4.67 4.97 3.34 2.85 2.70
DPM-Solver (M) [32] 27.40 17.85 9.04 6.41 5.31 4.10 3.30 2.98
DPM-Solver++ (S) [32] 51.80 38.54 12.13 6.52 6.36 4.56 3.52 3.09
DPM-Solver++ (M) [32] 28.53 13.48 5.34 4.01 4.04 3.32 2.90 2.76
DPM-Solver-v3 (S) 21.83 16.81 7.93 5.76 5.17 3.99 3.22 2.96
DPM-Solver-v3 (M) 12.76 7.40 3.94 3.40 3.24 2.91 2.71 2.64

As we stated in Section 3.2.2, multistep methods perform better than singlestep methods. To study the relationship between parameterization and these solver types, we develop a singlestep version of DPM-Solver-v3 in Algorithm 3. Note that since (n+1)(n+1)-th order singlestep solver divides each step into n+1n+1 substeps, the total number of timesteps is a multiple of n+1n+1. For flexibility, we follow the adaptive third-order strategy of DPM-Solver [31], which first takes third-order steps and then takes first-order or second-order steps in the end.

Algorithm 3 (n+1)(n+1)-th order singlestep solver

Require: noise prediction model ϵθ\bm{\epsilon}_{\theta}, noise schedule αt,σt\alpha_{t},\sigma_{t}, coefficients 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}, cache Q1,Q2Q_{1},Q_{2}

Input: timesteps {ti}i=0(n+1)M\{t_{i}\}_{i=0}^{(n+1)M}, initial value 𝒙0\bm{x}_{0}

1:  Q1cache𝒙0Q_{1}\overset{cache}{\leftarrow}\bm{x}_{0}
2:  Q2cacheϵθ(𝒙0,t0)Q_{2}\overset{cache}{\leftarrow}\bm{\epsilon}_{\theta}(\bm{x}_{0},t_{0})
3:  for m=1m=1 to MM do
4:     i0(n+1)(m1)i_{0}\leftarrow(n+1)(m-1)
5:     for i=0i=0 to nn do
6:        𝒙^i0,,𝒙^i0+ifetchQ1\hat{\bm{x}}_{i_{0}},\dots,\hat{\bm{x}}_{i_{0}+i}\overset{fetch}{\leftarrow}Q_{1}
7:        ϵ^i0,,ϵ^i0+ifetchQ2\hat{\bm{\epsilon}}_{i_{0}},\dots,\hat{\bm{\epsilon}}_{i_{0}+i}\overset{fetch}{\leftarrow}Q_{2}
8:        𝒈^leλi0λl𝒔τdτσλlϵ^l𝒍λl𝒙^lαλlλi0λleλi0r𝒔τdτ𝒃rdr,l=i0,,i0+i\displaystyle\hat{\bm{g}}_{l}\leftarrow e^{-\int_{\lambda_{i_{0}}}^{\lambda_{l}}\bm{s}_{\tau}\mathrm{d}\tau}\frac{\sigma_{\lambda_{l}}\hat{\bm{\epsilon}}_{l}-\bm{l}_{\lambda_{l}}\hat{\bm{x}}_{l}}{\alpha_{\lambda_{l}}}-\int_{\lambda_{i_{0}}}^{\lambda_{l}}e^{-\int_{\lambda_{i_{0}}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r,\quad l=i_{0},\dots,i_{0}+i (Eq. (8))
9:        𝒙^i0+i+1LUpdatei+1({ti0+1,𝒈^i0+1},,{ti0+i,𝒈^i0+i},{ti0,𝒙^i0,𝒈^i0},ti0+i+1)\hat{\bm{x}}_{i_{0}+i+1}\leftarrow\text{LUpdate}_{i+1}(\{t_{i_{0}+1},\hat{\bm{g}}_{i_{0}+1}\},\dots,\{t_{i_{0}+i},\hat{\bm{g}}_{i_{0}+i}\},\{t_{i_{0}},\hat{\bm{x}}_{i_{0}},\hat{\bm{g}}_{i_{0}}\},t_{i_{0}+i+1})
10:        if (i+1)m(n+1)M(i+1)m\neq(n+1)M then
11:           ϵ^i0+i+1ϵθ(𝒙^i0+i+1,ti0+i+1)\hat{\bm{\epsilon}}_{i_{0}+i+1}\leftarrow\bm{\epsilon}_{\theta}(\hat{\bm{x}}_{i_{0}+i+1},t_{i_{0}+i+1})
12:           Q1cache𝒙^i0+i+1Q_{1}\overset{cache}{\leftarrow}\hat{\bm{x}}_{i_{0}+i+1}
13:           Q2cacheϵ^i0+i+1Q_{2}\overset{cache}{\leftarrow}\hat{\bm{\epsilon}}_{i_{0}+i+1}
14:        end if
15:     end for
16:  end for

Output: 𝒙^(n+1)M\hat{\bm{x}}_{(n+1)M}

Since UniPC [58] is based on a multistep predictor-corrector framework and has no singlestep version, we only compare with DPM-Solver [31] and DPM-Solver++ [32], which uses noise prediction and data prediction respectively. The results of ScoreSDE model [51] on CIFAR10 [24] are shown in Table 12, which demonstrate that different parameterizations have different relative performance under singlestep and multistep methods.

Specifically, for singlestep methods, DPM-Solver-v3 (S) outperforms DPM-Solver++ (S) across NFEs, but DPM-Solver (S) is even better than DPM-Solver-v3 (S) when NFE\geq10 (though when NFE<10, DPM-Solver (S) has worst performance), which suggests that noise prediction is best strategy in such scenarios; for multistep methods, we have DPM-Solver-v3 (M) > DPM-Solver++ (M) > DPM-Solver (M) across NFEs. Moreover, DPM-Solver (S) even outperforms DPM-Solver++ (M) when NFE\geq20, and the properties of different parameterizations in singlestep methods are left for future study. Overall, DPM-Solver-v3 (M) achieves the best results among all these methods, and performs the most stably under different NFEs.

Appendix H FID/CLIP Score on Stable-Diffusion

Table 13: Sample quality and text-image alignment performance on MS-COCO2014 [26] prompts with Stable-Diffusion model [43] and guidance scale 7.5. We report the FID\downarrow and CLIP score\uparrow of the methods with different numbers of function evaluations (NFE), evaluated on 10k samples.
Method Metric NFE
5 6 8 10 12 15 20
DPM-Solver++ [32] FID 18.87 17.44 16.40 15.93 15.78 15.84 15.72
CLIP score 0.263 0.265 0.265 0.265 0.266 0.265 0.265
UniPC [58] FID 18.77 17.32 16.20 16.15 16.09 16.06 15.94
CLIP score 0.262 0.263 0.265 0.265 0.265 0.265 0.265
DPM-Solver-v3 FID 18.83 16.41 15.41 15.32 15.13 15.30 15.23
CLIP score 0.260 0.262 0.264 0.265 0.265 0.265 0.265

In text-to-image generation, since the sample quality is affected not only by discretization error of the sampling process, but also by estimation error of neural networks during training, low MSE (faster convergence) does not necessarily imply better sample quality. Therefore, we choose the MSCOCO2014 [26] validation set as the reference, and additionally evaluate DPM-Solver-v3 on Stable-Diffusion model [43] by the standard metrics FID and CLIP score [41] which measure the sample quality and text-image alignment respectively. For DPM-Solver-v3, we use the full-corrector strategy when NFE<10, and no corrector when NFE\geq10.

The results in Table 13 show that DPM-Solver-v3 achieves consistently better FID and similar CLIP scores. Notably, we achieve an FID of 15.4 in 8 NFE, close to the reported FID of Stable-Diffusion v1.4.

Still, we claim that FID is not a proper metric for evaluating the convergence of latent-space diffusion models. As stated in DPM-Solver++ and Section 4.1, we can see that the FIDs quickly achieve 15.0\sim16.0 within 10 steps, even if the latent code does not converge, because of the strong image decoder. Instead, MSE in the latent space is a direct way to measure the convergence. By comparing the MSE, our sampler does converge faster to the ground-truth samples of Stable Diffusion itself.

Appendix I More Theoretical Analyses

I.1 Expressive Power of Our Generalized Parameterization

Though the introduced coefficients 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda} seem limited to guarantee the optimality of the parameterization formulation itself, we claim that the generalized parameterization 𝒈θ\bm{g}_{\theta} in Eq. (8) can actually cover a wide range of parameterization families in the form of 𝝍θ(𝒙λ,λ)=𝜶(λ)ϵθ(𝒙λ,λ)+𝜷(λ)𝒙λ+𝜸(λ)\bm{\psi}_{\theta}(\bm{x}_{\lambda},\lambda)=\bm{\alpha}(\lambda)\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)+\bm{\beta}(\lambda)\bm{x}_{\lambda}+\bm{\gamma}(\lambda). Considering the paramerization on [λs,λt][\lambda_{s},\lambda_{t}], by rearranging the terms, Eq. (8) can be written as

𝒈θ(𝒙λ,λ)=eλsλ𝒔τdτσλαλϵθ(𝒙λ,λ)eλsλ𝒔τdτ𝒍λαλ𝒙λλsλeλsr𝒔τdτ𝒃rdr\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)=e^{-\int_{\lambda_{s}}^{\lambda}\bm{s}_{\tau}\mathrm{d}\tau}\frac{\sigma_{\lambda}}{\alpha_{\lambda}}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)-e^{-\int_{\lambda_{s}}^{\lambda}\bm{s}_{\tau}\mathrm{d}\tau}\frac{\bm{l}_{\lambda}}{\alpha_{\lambda}}\bm{x}_{\lambda}-\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r (68)

We can compare the coefficients before ϵθ\bm{\epsilon}_{\theta} and 𝒙λ\bm{x}_{\lambda} in 𝒈θ\bm{g}_{\theta} and 𝝍θ\bm{\psi}_{\theta} to figure out how 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda} corresponds to 𝜶(λ),𝜷(λ),𝜸(λ)\bm{\alpha}(\lambda),\bm{\beta}(\lambda),\bm{\gamma}(\lambda). In fact, we can not directly let 𝒈θ\bm{g}_{\theta} equal 𝝍θ\bm{\psi}_{\theta}, since when λ=λs\lambda=\lambda_{s}, we have λsλ()=0\int_{\lambda_{s}}^{\lambda}(\cdot)=0, and the coefficient before ϵθ\bm{\epsilon}_{\theta} in 𝒈θ\bm{g}_{\theta} is fixed. Still, 𝝍θ\bm{\psi}_{\theta} can be equalized to 𝒈θ\bm{g}_{\theta} by a linear transformation, which only depends on λs\lambda_{s} and does not affect our analyses of the discretization error and solver.

Specifically, assuming 𝝍θ=𝝎λs𝒈θ+𝝃λs\bm{\psi}_{\theta}=\bm{\omega}_{\lambda_{s}}\bm{g}_{\theta}+\bm{\xi}_{\lambda_{s}}, by corresponding the coefficients we have

{𝜶(λ)=𝝎λseλsλ𝒔τdτσλαλ𝜷(λ)=𝝎λseλsλ𝒔τdτ𝒍λαλ𝜸(λ)=𝝎λsλsλeλsr𝒔τdτ𝒃rdr+𝝃λs{𝝎λs=eλs𝜶(λs)𝝃λs=𝜸(λs)𝒍λ=σλ𝜷(λ)𝜶(λ)𝒔λ=1𝜶(λ)𝜶(λ)𝒃λ=eλ𝜸(λ)𝜶(λ)\begin{cases}\bm{\alpha}(\lambda)=\bm{\omega}_{\lambda_{s}}e^{-\int_{\lambda_{s}}^{\lambda}\bm{s}_{\tau}\mathrm{d}\tau}\frac{\sigma_{\lambda}}{\alpha_{\lambda}}\\ \bm{\beta}(\lambda)=-\bm{\omega}_{\lambda_{s}}e^{-\int_{\lambda_{s}}^{\lambda}\bm{s}_{\tau}\mathrm{d}\tau}\frac{\bm{l}_{\lambda}}{\alpha_{\lambda}}\\ \bm{\gamma}(\lambda)=-\bm{\omega}_{\lambda_{s}}\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r+\bm{\xi}_{\lambda_{s}}\end{cases}\Rightarrow\begin{cases}\bm{\omega}_{\lambda_{s}}=e^{\lambda_{s}}\bm{\alpha}(\lambda_{s})\\ \bm{\xi}_{\lambda_{s}}=\bm{\gamma}(\lambda_{s})\\ \bm{l}_{\lambda}=-\sigma_{\lambda}\frac{\bm{\beta}(\lambda)}{\bm{\alpha}(\lambda)}\\ \bm{s}_{\lambda}=-1-\frac{\bm{\alpha}^{\prime}(\lambda)}{\bm{\alpha}(\lambda)}\\ \bm{b}_{\lambda}=-e^{-\lambda}\frac{\bm{\gamma}^{\prime}(\lambda)}{\bm{\alpha}(\lambda)}\end{cases} (69)

Therefore, as long as 𝜶(λ)0\bm{\alpha}(\lambda)\neq 0 and 𝜶(λ),𝜸(λ)\bm{\alpha}(\lambda),\bm{\gamma}(\lambda) have first-order derivatives, our proposed parameterization 𝒈θ\bm{g}_{\theta} holds the same expressive power as 𝝍θ\bm{\psi}_{\theta}, while at the same time enabling the neat optimality criteria of 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda} in Eq. (5) and Eq. (11).

I.2 Justification of Why Minimizing First-order Discretization Error Can Help Higher-order Solver

The EMS 𝒔λ,𝒃λ\bm{s}^{*}_{\lambda},\bm{b}^{*}_{\lambda} in Eq. (11) are designed to minimize the first-order discretization error in Eq. (10). However, the high-order solver is actually more frequently adopted in practice (specifically, third-order in unconditional sampling, second-order in conditional sampling), since it incurs lower sampling errors by taking higher-order Taylor expansions to approximate the predictor 𝒈θ\bm{g}_{\theta}.

In the following, we show that the EMS can also help high-order solver. By Eq. (52) in Appendix B.4, the (n+1)(n+1)-th order local error can be expressed as

𝒙^t𝒙t\displaystyle\hat{\bm{x}}_{t}-\bm{x}_{t} =αt𝑨(λs,λt)λsλt𝑬λs(λ)(𝒈θ(𝒙λ,λ)𝒈θ(𝒙λs,λs))dλ\displaystyle=\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\left(\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})\right)\mathrm{d}\lambda (70)
αt𝑨(λs,λt)k=1n(l=1n(𝑹n1)kl(𝒈θ(𝒙λil,λil)𝒈θ(𝒙λs,λs)))λsλt𝑬λs(λ)(λλs)kk!dλ\displaystyle-\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\sum_{k=1}^{n}\left(\sum_{l=1}^{n}(\bm{R}_{n}^{-1})_{kl}(\bm{g}_{\theta}(\bm{x}_{\lambda_{i_{l}}},\lambda_{i_{l}})-\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s}))\right)\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda

By Newton-Leibniz theorem, it is equivalent to

𝒙^t𝒙t\displaystyle\hat{\bm{x}}_{t}-\bm{x}_{t} =αt𝑨(λs,λt)λsλt𝑬λs(λ)(λsλgθ(1)(xτ,τ)dτ)dλ\displaystyle=\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\left(\int_{\lambda_{s}}^{\lambda}g^{(1)}_{\theta}(x_{\tau},\tau)\mathrm{d}\tau\right)\mathrm{d}\lambda (71)
αt𝑨(λs,λt)k=1n(l=1n(𝑹n1)klλsλilgθ(1)(xλ,λ)dλ)λsλt𝑬λs(λ)(λλs)kk!dλ\displaystyle-\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\sum_{k=1}^{n}\left(\sum_{l=1}^{n}(\bm{R}_{n}^{-1})_{kl}\int_{\lambda_{s}}^{\lambda_{i_{l}}}g^{(1)}_{\theta}(x_{\lambda},\lambda)\mathrm{d}\lambda\right)\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda

We assume that the estimated EMS are bounded (in the order of 𝒪(1)\mathcal{O}(1), Assumption B.2 in Appendix B.1), which is empirically confirmed as in Section 4.2. By the definition of 𝒈θ\bm{g}_{\theta} in Eq. (8), we have 𝒈θ(1)(𝒙τ,τ)=eλsτ𝒔rdr(𝒇θ(1)(𝒙τ,τ)𝒔τ𝒇θ(𝒙τ,τ)𝒃τ)\bm{g}_{\theta}^{(1)}(\bm{x}_{\tau},\tau)=e^{-\int_{\lambda_{s}}^{\tau}\bm{s}_{r}\mathrm{d}r}\left(\bm{f}_{\theta}^{(1)}(\bm{x}_{\tau},\tau)-\bm{s}_{\tau}\bm{f}_{\theta}(\bm{x}_{\tau},\tau)-\bm{b}_{\tau}\right). Therefore, Eq. (11) controls 𝒈θ(1)2\|\bm{g}_{\theta}^{(1)}\|_{2} and further controls 𝒙^t𝒙t2\|\hat{\bm{x}}_{t}-\bm{x}_{t}\|_{2}, since other terms are only dependent on the EMS and are bounded.

I.3 The Extra Error of EMS Estimation and Integral Estimation

Analysis of EMS estimation error

In practice, the EMS in Eq. (5) and Eq. (11) are estimated on finite datapoints by the explicit expressions in Eq. (58) and Eq. (63), which may differ from the true 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda}^{*},\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*}. Theoretically, on one hand, the order and convergence theorems in Section 3.2 are irrelevant to the EMS estimation error: The ODE solution in Eq. (9) is correct whatever 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda} are, and we only need the assumption that these coefficients are bounded (Assumption B.2 in Appendix B.1) to prove the local and global order; on the other hand, the first-order discretization error in Eq. (10) is vulnerable to the EMS estimation error, which relates to the performance at few steps. Empirically, to enable fast sampling, we need to ensure the number of datapoints for estimating EMS (see ablations in Table 8), and we find that our method is robust to the EMS estimation error given only 1024 datapoints in most cases (see EMS computing configs in Appendix D).

Analysis of integral estimation error

Another source of error is the process of integral estimation (λsλt𝑬λs(λ)𝑩λs(λ)dλ\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\bm{B}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda and λsλt𝑬λs(λ)(λλs)kk!dλ\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda in Eq. (14)) by trapezoidal rule. We can analyze the estimation error by the error bound formula of trapezoidal rule: suppose we use uniform discretization on [a,b][a,b] with interval hh to estimate abf(x)dx\int_{a}^{b}f(x)\mathrm{d}x, then the error EE satisfies

|E|(ba)h212max|f′′(x)||E|\leq\frac{(b-a)h^{2}}{12}\max|f^{\prime\prime}(x)| (72)

Under Assumption B.2, the EMS and their first-order derivative are bounded. Denote 𝒇1(λ)=𝑬λs(λ)𝑩λs(λ),𝒇2(λ)=𝑬λs(λ)(λλs)kk!\bm{f}_{1}(\lambda)=\bm{E}_{\lambda_{s}}(\lambda)\bm{B}_{\lambda_{s}}(\lambda),\bm{f}_{2}(\lambda)=\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}, then

𝒇1(λ)\displaystyle\bm{f}_{1}(\lambda) =eλsλ(𝒍τ+𝒔τ)dτλsλeλsr𝒔τdτ𝒃rdr\displaystyle=e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r (73)
𝒇1(λ)\displaystyle\bm{f}_{1}^{\prime}(\lambda) =(𝒍λ+𝒔λ)eλsλ(𝒍τ+𝒔τ)dτλsλeλsr𝒔τdτ𝒃rdr+𝒃λeλsλ𝒍τdτλsλeλsr𝒔τdτ𝒃rdr\displaystyle=(\bm{l}_{\lambda}+\bm{s}_{\lambda})e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r+\bm{b}_{\lambda}e^{\int_{\lambda_{s}}^{\lambda}\bm{l}_{\tau}\mathrm{d}\tau}\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r
𝒇1′′(λ)\displaystyle\bm{f}_{1}^{\prime\prime}(\lambda) =(𝒍λ+𝒔λ)eλsλ(𝒍τ+𝒔τ)dτλsλeλsr𝒔τdτ𝒃rdr+(𝒍λ+𝒔λ)2eλsλ(𝒍τ+𝒔τ)dτλsλeλsr𝒔τdτ𝒃rdr\displaystyle=(\bm{l}_{\lambda}^{\prime}+\bm{s}_{\lambda}^{\prime})e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r+(\bm{l}_{\lambda}+\bm{s}_{\lambda})^{2}e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r
+(𝒍λ+𝒔λ)𝒃λeλsλ𝒍τdτλsλeλsr𝒔τdτ𝒃rdr+𝒃λeλsλ𝒍τdτλsλeλsr𝒔τdτ𝒃rdr\displaystyle+(\bm{l}_{\lambda}+\bm{s}_{\lambda})\bm{b}_{\lambda}e^{\int_{\lambda_{s}}^{\lambda}\bm{l}_{\tau}\mathrm{d}\tau}\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r+\bm{b}_{\lambda}^{\prime}e^{\int_{\lambda_{s}}^{\lambda}\bm{l}_{\tau}\mathrm{d}\tau}\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r
+𝒃λ𝒍λeλsλ𝒍τdτλsλeλsr𝒔τdτ𝒃rdr+𝒃λ2eλsλ(𝒍τ𝒔τ)dτλsλeλsr𝒔τdτ𝒃rdr\displaystyle+\bm{b}_{\lambda}\bm{l}_{\lambda}e^{\int_{\lambda_{s}}^{\lambda}\bm{l}_{\tau}\mathrm{d}\tau}\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r+\bm{b}_{\lambda}^{2}e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}-\bm{s}_{\tau})\mathrm{d}\tau}\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r
𝒇2(λ)\displaystyle\bm{f}_{2}(\lambda) =eλsλ(𝒍τ+𝒔τ)dτ(λλs)kk!\displaystyle=e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\frac{(\lambda-\lambda_{s})^{k}}{k!} (74)
𝒇2(λ)\displaystyle\bm{f}_{2}^{\prime}(\lambda) =(𝒍λ+𝒔λ)eλsλ(𝒍τ+𝒔τ)dτ(λλs)kk!+eλsλ(𝒍τ+𝒔τ)dτ(λλs)k1(k1)!\displaystyle=(\bm{l}_{\lambda}+\bm{s}_{\lambda})e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\frac{(\lambda-\lambda_{s})^{k}}{k!}+e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\frac{(\lambda-\lambda_{s})^{k-1}}{(k-1)!}
𝒇2′′(λ)\displaystyle\bm{f}_{2}^{\prime\prime}(\lambda) =(𝒍λ+𝒔λ)eλsλ(𝒍τ+𝒔τ)dτ(λλs)kk!+(𝒍λ+𝒔λ)2eλsλ(𝒍τ+𝒔τ)dτ(λλs)kk!\displaystyle=(\bm{l}_{\lambda}^{\prime}+\bm{s}_{\lambda}^{\prime})e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\frac{(\lambda-\lambda_{s})^{k}}{k!}+(\bm{l}_{\lambda}+\bm{s}_{\lambda})^{2}e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\frac{(\lambda-\lambda_{s})^{k}}{k!}
+(𝒍λ+𝒔λ)eλsλ(𝒍τ+𝒔τ)dτ(λλs)k1k1!+(𝒍λ+𝒔λ)eλsλ(𝒍τ+𝒔τ)dτ(λλs)k1(k1)!\displaystyle+(\bm{l}_{\lambda}+\bm{s}_{\lambda})e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\frac{(\lambda-\lambda_{s})^{k-1}}{{k-1}!}+(\bm{l}_{\lambda}+\bm{s}_{\lambda})e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\frac{(\lambda-\lambda_{s})^{k-1}}{(k-1)!}
+eλsλ(𝒍τ+𝒔τ)dτ(λλs)k2(k2)!\displaystyle+e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\frac{(\lambda-\lambda_{s})^{k-2}}{(k-2)!}

Since 𝒍λ,𝒔λ,𝒃λ,𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda},\bm{l}_{\lambda}^{\prime},\bm{s}_{\lambda}^{\prime},\bm{b}_{\lambda}^{\prime} are all 𝒪(1)\mathcal{O}(1), we can conclude that 𝒇1′′(λ)=𝒪(h),𝒇2′′(λ)=𝒪(hk2)\bm{f}_{1}^{\prime\prime}(\lambda)=\mathcal{O}(h),\bm{f}_{2}^{\prime\prime}(\lambda)=\mathcal{O}(h^{k-2}), and the errors of λsλt𝑬λs(λ)𝑩λs(λ)dλ,λsλt𝑬λs(λ)(λλs)kk!dλ\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\bm{B}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda,\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda are O(h02h2),O(h02hk1)O(h_{0}^{2}h^{2}),O(h_{0}^{2}h^{k-1}) respectively, where h0h_{0} is the stepsize of EMS discretization (h0=λtλsnh_{0}=\frac{\lambda_{t}-\lambda_{s}}{n}, nn corresponds to 120\sim1200 timesteps for our EMS computing), and h=λtλsh=\lambda_{t}-\lambda_{s}. Therefore, the extra error of integral estimation is under high order and ignorable.

Appendix J More Discussions

J.1 Extra Computational and Memory Costs

The extra memory cost of DPM-Solver-v3 is rather small. The extra coefficients 𝒍λ,𝒔λ,𝒃λ\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda} are discretized and computed at NN timesteps, each with a dimension DD same as the diffused data. The extra memory cost is 𝒪(ND)\mathcal{O}(ND), including the precomputed terms in Appendix C.1.2, and is rather small compared to the pretrained model (e.g. only \sim125M in total on Stable-Diffusion, compared to \sim4G of the model itself).

The pre-computation time for estimating EMS is rather short. The EMS introduced by our method can be effectively estimated on around 1k datapoints within hours (Appendix D), which is rather short compared to the long training/distillation time of other methods. Moreover, the integrals of these extra coefficients are just some vector constants that can be pre-computed within seconds, as shown in Appendix C.1.2. The precomputing is done only once before sampling.

The extra computational overhead of DPM-Solver-v3 during sampling is negligible. Once we obtain the estimated EMS and their integrals at discrete timesteps, they can be regarded as constants. Thus, during the subsequent sampling process, the computational overhead is the same as previous training-free methods (such as DPM-Solver++) with negligible differences (Appendix E).

J.2 Flexibility

The pre-computed EMS can be applied for any time schedule during sampling without re-computing EMS. Besides, we compute EMS on unconditional models and it can be used for a wide range of guidance scales (such as cfg=7.5 in Stable Diffusion). In short, EMS is flexible and easy to adopt in downstream applications.

  • Time schedule: The choice of N,KN,K in EMS is disentangled with the timestep scheduler in sampling. Once we have estimated the EMS at NN (e.g., 1200) timesteps, they can be flexibly adapted to any schedule (uniform λ\lambda/uniform tt…) during sampling, by corresponding the actual timesteps during sampling to the NN bins. For different time schedule, we only need to re-precompute Eλs,λt(k)E_{\lambda_{s},\lambda_{t}}^{(k)} in Appendix C.1.2, and the time cost is within seconds.

  • Guided sampling: We compute the EMS on the unconditional model for all guided cases. Empirically, the EMS computed on the model without guidance (unconditional part) performs more stably than those computed on the model with guidance, and can accelerate the sampling procedure in a wide range of guidance scales (including the common guidance scales used in pretrained models). We think that the unconditional model contains some common information (image priors) for all the conditions, such as color, sketch, and other image patterns. Extracting them helps correct some common biases such as shallow color, low saturation level and lack of details. In contrast, the conditional model is dependent on the condition and has a large variance.

Remark J.1.

EMS computed on models without guidance cannot work for extremely large guidance scales (e.g., cfg scale 15 for Stable Diffusion), since in this case, the condition has a large impact on the denoising process. Note that at these extremely large scales, the sampling quality is very low (compared to cfg scale 7.5) and they are rarely used in practice. Therefore, our proposed EMS without guidance is suitable enough for the common applications with the common guidance.

J.3 Stability

Refer to caption
Figure 8: Visualization of the EMS 𝒔λ,𝒃λ\bm{s}_{\lambda},\bm{b}_{\lambda} and their integrals w.r.t. λ\lambda, estimated on ScoreSDE [51] on CIFAR10 [24]. 𝒔λ,𝒃λ\bm{s}_{\lambda},\bm{b}_{\lambda} are rather fluctuating, but their integrals are smooth enough to ensure the stability of DPM-Solver-v3.

As shown in Section 4.2, the estimated EMS 𝒔λ,𝒃λ\bm{s}_{\lambda},\bm{b}_{\lambda} appear much fluctuating, especially for ScoreSDE on CIFAR10. We would like to clarify that the unstable 𝒔λ,𝒃λ\bm{s}_{\lambda},\bm{b}_{\lambda} is not an issue, and our sampler is stable:

  • The fluctuation of 𝒔λ,𝒃λ\bm{s}_{\lambda},\bm{b}_{\lambda} on ScoreSDE is intrinsic and not due to the estimation error. As we increase the number of samples to decrease the estimation error, the fluctuation is not reduced. We attribute it to the periodicity of trigonometric functions in the positional timestep embedding as stated in Section 4.2.

  • Moreover, we only need to consider the integrals of 𝒔λ,𝒃λ\bm{s}_{\lambda},\bm{b}_{\lambda} in the ODE solution Eq. (9). As shown in Figure 8, the integrals of 𝒔λ,𝒃λ\bm{s}_{\lambda},\bm{b}_{\lambda} are rather smooth, which ensures the stability of our method. Therefore, there is no need for extra smoothing.

J.4 Practical Value

When NFE is around 20, our improvement of sample quality is small because all different fast samplers based on diffusion ODEs almost converge. Therefore, what matters is that our method has a faster convergence speed to good sample quality. As we observed, the less diverse the domain is, the more evidently the speed-up takes effect. For example, on LSUN-Bedroom, our method can achieve up to 40% faster convergence.

To sum up, the practical value of DPM-Solver-v3 embodies the following aspects:

  1. 1.

    15\sim30% speed-up can save lots of costs for online text-to-image applications. Our speed-ups are applicable to large text-to-image diffusion models, which are an important part of today’s AIGC community. As the recent models become larger, a single NFE requires more computational resources, and 15\sim30% speed-up can save a lot of the companies’ expenses for commercial usage.

  2. 2.

    Improvement in 5\sim10 NFEs benefits image previewing. Since the samples are controlled by the random seed, coarse samples with 5\sim10 NFEs can be used to preview thousands of samples with low costs and give guidance on choosing the random seed, which can be then used to generate fine samples with best-quality sampling strategies. This is especially useful for text-to-image generation. Since our method achieves better quality and converges faster to the ground-truth sample, it can provide better guidance when used for preview.

Appendix K Additional Samples

We provide more visual samples in Figure 9, Figure 10, Figure 11 and Table 14 to demonstrate the qualitative effectiveness of DPM-Solver-v3. It can be seen that the visual quality of DPM-Solver-v3 outperforms previous state-of-the-art solvers. Our method can generate images that have reduced bias (less “shallow”), higher saturation level and more visual details, as mentioned in Section 4.3.

NFE = 5 NFE = 10
DPM-Solver++ [32] Refer to caption Refer to caption
FID 28.53 FID 4.01
UniPC [58] Refer to caption Refer to caption
FID 23.71 FID 3.93
DPM-Solver-v3 (Ours) Refer to caption Refer to caption
FID 12.76 FID 3.40
Figure 9: Random samples of ScoreSDE [51] on CIFAR10 dataset [24] with only 5 and 10 NFE.
NFE = 5 NFE = 10
Heun’s 2nd [21] Refer to caption Refer to caption
FID 320.80 FID 16.57
DPM-Solver++ [32] Refer to caption Refer to caption
FID 24.54 FID 2.91
UniPC [58] Refer to caption Refer to caption
FID 23.52 FID 2.85
DPM-Solver-v3 (Ours) Refer to caption Refer to caption
FID 12.21 FID 2.51
Figure 10: Random samples of EDM [21] on CIFAR10 dataset [24] with only 5 and 10 NFE.
DPM-Solver++ [32] (FID 11.02) Refer to caption
UniPC [58] (FID 10.19) Refer to caption
DPM-Solver-v3 (Ours) (FID 9.70) Refer to caption
Figure 11: Random samples of Guided-Diffusion [10] on ImageNet-256 dataset [9] with a classifier guidance scale 2.0, using only 7 NFE. We manually remove the potentially disturbing images such as those containing snakes or insects.
Table 14: Additional samples of Stable-Diffusion [43] with a classifier-free guidance scale 7.5, using only 5 NFE and selected text prompts. Some displayed prompts are truncated.
Text Prompts DPM-Solver++ [32] (MSE 0.60) UniPC [58] (MSE 0.65) DPM-Solver-v3 (Ours) (MSE 0.55)
“pixar movie still portrait photo of madison beer, jessica alba, woman, as hero catgirl cyborg woman by pixar, by greg rutkowski, wlop, rossdraws, artgerm, weta, marvel, rave girl, leeloo, unreal engine, glossy skin, pearlescent, wet, bright morning, anime, sci-fi, maxim magazine cover” [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
“oil painting with heavy impasto of a pirate ship and its captain, cosmic horror painting, elegant intricate artstation concept art by craig mullins detailed” [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
“environment living room interior, mid century modern, indoor garden with fountain, retro, m vintage, designer furniture made of wood and plastic, concrete table, wood walls, indoor potted tree, large window, outdoor forest landscape, beautiful sunset, cinematic, concept art, sunstainable architecture, octane render, utopia, ethereal, cinematic light” [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
“the living room of a cozy wooden house with a fireplace, at night, interior design, concept art, wallpaper, warm, digital art. art by james gurney and larry elmore.” [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
“Full page concept design how to craft life Poison, intricate details, infographic of alchemical, diagram of how to make potions, captions, directions, ingredients, drawing, magic, wuxia” [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
“Fantasy art, octane render, 16k, 8k, cinema 4d, back-lit, caustics, clean environment, Wood pavilion architecture, warm led lighting, dusk, Landscape, snow, arctic, with aqua water, silver Guggenheim museum spire, with rays of sunshine, white fabric landscape, tall building, zaha hadid and Santiago calatrava, smooth landscape, cracked ice, igloo, warm lighting, aurora borialis, 3d cgi, high definition, natural lighting, realistic, hyper realism” [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
“tree house in the forest, atmospheric, hyper realistic, epic composition, cinematic, landscape vista photography by Carr Clifton & Galen Rowell, 16K resolution, Landscape veduta photo by Dustin Lefevre & tdraw, detailed landscape painting by Ivan Shishkin, DeviantArt, Flickr, rendered in Enscape, Miyazaki, Nausicaa Ghibli, Breath of The Wild, 4k detailed post processing, artstation, unreal engine” [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
“A trail through the unknown, atmospheric, hyper realistic, 8k, epic composition, cinematic, octane render, artstation landscape vista photography by Carr Clifton & Galen Rowell, 16K resolution, Landscape veduta photo by Dustin Lefevre & tdraw, 8k resolution, detailed landscape painting by Ivan Shishkin, DeviantArt, Flickr, rendered in Enscape, Miyazaki, Nausicaa Ghibli, Breath of The Wild, 4k detailed post processing, artstation, rendering by octane, unreal engine” [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
“postapocalyptic city turned to fractal glass, ctane render, 8 k, exploration, cinematic, trending on artstation, by beeple, realistic, 3 5 mm camera, unreal engine, hyper detailed, photo–realistic maximum detai, volumetric light, moody cinematic epic concept art, realistic matte painting, hyper photorealistic, concept art, cinematic epic, octane render, 8k, corona render, movie concept art, octane render, 8 k, corona render, trending on artstation, cinematic composition, ultra–detailed, hyper–realistic, volumetric lighting” [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
““WORLDS”: zoological fantasy ecosystem infographics, magazine layout with typography, annotations, in the style of Elena Masci, Studio Ghibli, Caspar David Friedrich, Daniel Merriam, Doug Chiang, Ivan Aivazovsky, Herbert Bauer, Edward Tufte, David McCandless” [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]