DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model Statistics

Kaiwen Zheng^∗†1, Cheng Lu^∗1, Jianfei Chen¹, Jun Zhu^‡123
¹Dept. of Comp. Sci. & Tech., Institute for AI, BNRist Center, THBI Lab
¹Tsinghua-Bosch Joint ML Center, Tsinghua University, Beijing, China
²Shengshu Technology, Beijing ³Pazhou Lab (Huangpu), Guangzhou, China
{zkwthu,lucheng.lc15}@gmail.com; {jianfeic, dcszj}@tsinghua.edu.cn

Abstract

Diffusion probabilistic models (DPMs) have exhibited excellent performance for high-fidelity image generation while suffering from inefficient sampling. Recent works accelerate the sampling procedure by proposing fast ODE solvers that leverage the specific ODE form of DPMs. However, they highly rely on specific parameterization during inference (such as noise/data prediction), which might not be the optimal choice. In this work, we propose a novel formulation towards the optimal parameterization during sampling that minimizes the first-order discretization error of the ODE solution. Based on such formulation, we propose DPM-Solver-v3, a new fast ODE solver for DPMs by introducing several coefficients efficiently computed on the pretrained model, which we call empirical model statistics. We further incorporate multistep methods and a predictor-corrector framework, and propose some techniques for improving sample quality at small numbers of function evaluations (NFE) or large guidance scales. Experiments show that DPM-Solver-v3 achieves consistently better or comparable performance in both unconditional and conditional sampling with both pixel-space and latent-space DPMs, especially in 5 $\sim$ 10 NFEs. We achieve FIDs of 12.21 (5 NFE), 2.51 (10 NFE) on unconditional CIFAR10, and MSE of 0.55 (5 NFE, 7.5 guidance scale) on Stable Diffusion, bringing a speed-up of 15% $\sim$ 30% compared to previous state-of-the-art training-free methods. Code is available at https://github.com/thu-ml/DPM-Solver-v3.

²²footnotetext: Work done during an internship at Shengshu Technology¹¹footnotetext: Equal contribution³³footnotetext: Corresponding author

1 Introduction

Diffusion probabilistic models (DPMs) [47, 15, 51] are a class of state-of-the-art image generators. By training with a strong encoder, a large model, and massive data as well as leveraging techniques such as guided sampling, DPMs are capable of generating high-resolution photorealistic and artistic images on text-to-image tasks.

Refer to caption — Figure 1: Random samples of Stable-Diffusion [43] with a classifier-free guidance scale 7.5, using only 5 number of function evaluations (NFE) and text prompt “A beautiful castle beside a waterfall in the woods, by Josef Thoma, matte painting, trending on artstation HQ”.

However, to generate high-quality visual content, DPMs usually require hundreds of model evaluations to gradually remove noise using a pretrained model [15], which is much more time-consuming compared to other deep generative models such as generative adversarial networks (GANs) [13]. The sampling overhead of DPMs emerges as a crucial obstacle hindering their integration into downstream tasks.

To accelerate the sampling process of DPMs, one can employ training-based methods [37, 53, 45] or training-free samplers [48, 51, 28, 3, 52, 56]. We focus on the latter approach since it requires no extra training and is more flexible. Recent advanced training-free samplers [56, 31, 32, 58] mainly rely on the ODE form of DPMs, since its absence of stochasticity is essential for high-quality samples in around 20 steps. Besides, ODE solvers built on exponential integrators [18] converge faster. To change the original diffusion ODE into the form of exponential integrators, they need to cancel its linear term and obtain an ODE solution, where only the noise predictor needs to be approximated, and other terms can be exactly computed. Besides, Lu et al. [32] find that it is better to use another ODE solution where instead the data predictor needs to be approximated. How to choose such model parameterization (e.g., noise/data prediction) in the sampling of DPMs remains unrevealed.

In this work, we systematically study the problem of model parameterization and ODE formulation for fast sampling of DPMs. We first show that by introducing three types of coefficients, the original ODE solution can be reformulated to an equivalent one that contains a new model parameterization. Besides, inspired by exponential Rosenbrock-type methods [19] and first-order discretization error analysis, we also show that there exists an optimal set of coefficients efficiently computed on the pretrained model, which we call empirical model statistics (EMS). Building upon our novel ODE formulation, we further propose a new high-order solver for diffusion ODEs named DPM-Solver-v3, which includes a multistep predictor-corrector framework of any order, as well as some novel techniques such as pseudo high-order method to boost the performance at extremely few steps or large guidance scale.

We conduct extensive experiments with both pixel-space and latent-space DPMs to verify the effectiveness of DPM-Solver-v3. Compared to previous fast samplers, DPM-Solver-v3 can consistently improve the sample quality in 5 $\sim$ 20 steps, and make a significant advancement within 10 steps.

2 Background

2.1 Diffusion Probabilistic Models

Suppose we have a $D$ -dimensional data distribution $q_{0}(\bm{x}_{0})$ . Diffusion probabilistic models (DPMs) [47, 15, 51] define a forward diffusion process $\{q_{t}\}_{t=0}^{T}$ to gradually degenerate the data $\bm{x}_{0}\sim q_{0}(\bm{x}_{0})$ with Gaussian noise, which satisfies the transition kernel $q_{0t}(\bm{x}_{t}|\bm{x}_{0})=\mathcal{N}(\alpha_{t}\bm{x}_{0},\sigma_{t}^{2}\bm{I})$ , such that $q_{T}(\bm{x}_{T})$ is approximately pure Gaussian. $\alpha_{t},\sigma_{t}$ are smooth scalar functions of $t$ , which are called noise schedule. The transition can be easily applied by $\bm{x}_{t}=\alpha_{t}\bm{x}_{0}+\sigma_{t}\bm{\epsilon},\bm{\epsilon}\sim\mathcal{N}(\bm{0},\bm{I})$ . To train DPMs, a neural network $\bm{\epsilon}_{\theta}(\bm{x}_{t},t)$ is usually parameterized to predict the noise $\bm{\epsilon}$ by minimizing $\mathbb{E}_{\bm{x}_{0}\sim q_{0}(\bm{x}_{0}),\bm{\epsilon}\sim\mathcal{N}(\bm{0},\bm{I}),t\sim\mathcal{U}(0,T)}\left[w(t)\|\bm{\epsilon}_{\theta}(\bm{x}_{t},t)-\bm{\epsilon}\|_{2}^{2}\right]$ , where $w(t)$ is a weighting function. Sampling of DPMs can be performed by solving diffusion ODE [51] from time $T$ to time $0$ :

\frac{\mathrm{d}\bm{x}_{t}}{\mathrm{d}t}=f(t)\bm{x}_{t}+\frac{g^{2}(t)}{2\sigma_{t}}\bm{\epsilon}_{\theta}(\bm{x}_{t},t),

(1)

where $f(t)=\frac{\mathrm{d}\log\alpha_{t}}{\mathrm{d}t}$ , $g^{2}(t)=\frac{\mathrm{d}\sigma_{t}^{2}}{\mathrm{d}t}-2\frac{\mathrm{d}\log\alpha_{t}}{\mathrm{d}t}\sigma_{t}^{2}$ [23]. In addition, the conditional sampling by DPMs can be conducted by guided sampling [10, 16] with a conditional noise predictor $\bm{\epsilon}_{\theta}(\bm{x}_{t},t,c)$ , where $c$ is the condition. Specifically, classifier-free guidance [16] combines the unconditional/conditional model and obtains a new noise predictor $\bm{\epsilon}_{\theta}^{\prime}(\bm{x}_{t},t,c)\coloneqq s\bm{\epsilon}_{\theta}(\bm{x}_{t},t,c)+(1-s)\bm{\epsilon}_{\theta}(\bm{x}_{t},t,\emptyset)$ , where $\emptyset$ is a special condition standing for the unconditional case, $s>0$ is the guidance scale that controls the trade-off between image-condition alignment and diversity; while classifier guidance [10] uses an extra classifier $p_{\phi}(c|\bm{x}_{t},t)$ to obtain a new noise predictor $\bm{\epsilon}_{\theta}^{\prime}(\bm{x}_{t},t,c)\coloneqq\bm{\epsilon}_{\theta}(\bm{x}_{t},t)-s\sigma_{t}\nabla_{\bm{x}}\log p_{\phi}(c|\bm{x}_{t},t)$ .

In addition, except for the noise prediction, DPMs can be parameterized as score predictor $\bm{s}_{\theta}(\bm{x}_{t},t)$ to predict $\nabla_{\bm{x}}\log q_{t}(\bm{x}_{t},t)$ , or data predictor $\bm{x}_{\theta}(\bm{x}_{t},t)$ to predict $\bm{x}_{0}$ . Under variance-preserving (VP) noise schedule which satisfies $\alpha_{t}^{2}+\sigma_{t}^{2}=1$ [51], “v” predictor $\bm{v}_{\theta}(\bm{x}_{t},t)$ is also proposed to predict $\alpha_{t}\bm{\epsilon}-\sigma_{t}\bm{x}_{0}$ [45]. These different parameterizations are theoretically equivalent, but have an impact on the empirical performance when used in training [23, 59].

2.2 Fast Sampling of DPMs with Exponential Integrators

Among the methods for solving the diffusion ODE (1), recent works [56, 31, 32, 58] find that ODE solvers based on exponential integrators [18] are more efficient and robust at a small number of function evaluations (<50). Specifically, an insightful observation by Lu et al. [31] is that, by change-of-variable from $t$ to $\lambda_{t}\coloneqq\log(\alpha_{t}/\sigma_{t})$ (half of the log-SNR), the diffusion ODE is transformed to

\frac{\mathrm{d}\bm{x}_{\lambda}}{\mathrm{d}\lambda}=\frac{\dot{\alpha}_{\lambda}}{\alpha_{\lambda}}\bm{x}_{\lambda}-\sigma_{\lambda}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda),

(2)

where $\dot{\alpha}_{\lambda}\coloneqq\frac{\mathrm{d}\alpha_{\lambda}}{\mathrm{d}\lambda}$ . By utilizing the semi-linear structure of the diffusion ODE and exactly computing the linear term [56, 31], we can obtain the ODE solution as Eq. (3) (left). Such exact computation of the linear part reduces the discretization errors [31]. Moreover, by leveraging the equivalence of different parameterizations, DPM-Solver++ [32] rewrites Eq. (2) by the data predictor $\bm{x}_{\theta}(\bm{x}_{\lambda},\lambda)\coloneqq(\bm{x}_{\lambda}-\sigma_{\lambda}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda))/\alpha_{\lambda}$ and obtains another ODE solution as Eq. (3) (right). Such solution does not need to change the pretrained noise prediction model $\bm{\epsilon}_{\theta}$ during the sampling process, and empirically outperforms previous samplers based on $\bm{\epsilon}_{\theta}$ [31].

\frac{\bm{x}_{t}}{\alpha_{t}}=\frac{\bm{x}_{s}}{\alpha_{s}}-\int_{\lambda_{s}}^{\lambda_{t}}e^{-\lambda}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)\mathrm{d}\lambda,\quad\frac{\bm{x}_{t}}{\sigma_{t}}=\frac{\bm{x}_{s}}{\sigma_{s}}+\int_{\lambda_{s}}^{\lambda_{t}}e^{\lambda}\bm{x}_{\theta}(\bm{x}_{\lambda},\lambda)\mathrm{d}\lambda

(3)

However, to the best of our knowledge, the parameterizations for sampling are still manually selected and limited to noise/data prediction, which are not well-studied.

3 Method

We now present our method. We start with a new formulation of the ODE solution with extra coefficients, followed by our high-order solver and some practical considerations. In the following discussions, we assume all the products between vectors are element-wise, and $\bm{f}^{(k)}(\bm{x}_{\lambda},\lambda)=\frac{\mathrm{d}^{k}\bm{f}(\bm{x}_{\lambda},\lambda)}{\mathrm{d}\lambda^{k}}$ is the $k$ -th order total derivative of any function $\bm{f}$ w.r.t. $\lambda$ .

3.1 Improved Formulation of Exact Solutions of Diffusion ODEs

As mentioned in Sec. 2.2, it is promising to explore the semi-linear structure of diffusion ODEs for fast sampling [56, 31, 32]. Firstly, we reveal one key insight that we can choose the linear part according to Rosenbrock-type exponential integrators [19, 18]. To this end, we consider a general form of diffusion ODEs by rewriting Eq. (2) as

\frac{\mathrm{d}\bm{x}_{\lambda}}{\mathrm{d}\lambda}=\underbrace{\left(\frac{\dot{\alpha}_{\lambda}}{\alpha_{\lambda}}-\bm{l}_{\lambda}\right)\bm{x}_{\lambda}}_{\text{linear part}}-\underbrace{\vphantom{\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r}(\sigma_{\lambda}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{l}_{\lambda}\bm{x}_{\lambda})}_{\text{non-linear part},\coloneqq\bm{N}_{\theta}(\bm{x}_{\lambda},\lambda)},

(4)

where $\bm{l}_{\lambda}$ is a $D$ -dimensional undetermined coefficient depending on $\lambda$ . We choose $\bm{l}_{\lambda}$ to restrict the Frobenius norm of the gradient of the non-linear part w.r.t. $\bm{x}$ :

\bm{l}^{*}_{\lambda}=\operatornamewithlimits{argmin}_{\bm{l}_{\lambda}}\mathbb{E}_{p^{\theta}_{\lambda}(\bm{x}_{\lambda})}\|\nabla_{\bm{x}}\bm{N}_{\theta}(\bm{x}_{\lambda},\lambda)\|_{F}^{2},

(5)

where $p^{\theta}_{\lambda}$ is the distribution of samples on the ground-truth ODE trajectories at $\lambda$ (i.e., model distribution). Intuitively, it makes $\bm{N}_{\theta}$ insensitive to the errors of $\bm{x}$ and cancels all the “linearty” of $\bm{N}_{\theta}$ . With $\bm{l}_{\lambda}=\bm{l}_{\lambda}^{*}$ , by the “variation-of-constants” formula [1], starting from $\bm{x}_{\lambda_{s}}$ at time $s$ , the exact solution of Eq. (4) at time $t$ is

\bm{x}_{\lambda_{t}}=\alpha_{\lambda_{t}}e^{-\int_{\lambda_{s}}^{\lambda_{t}}\bm{l}_{\lambda}\mathrm{d}\lambda}\bigg{(}\frac{\bm{x}_{\lambda_{s}}}{\alpha_{\lambda_{s}}}-\int_{\lambda_{s}}^{\lambda_{t}}e^{\int_{\lambda_{s}}^{\lambda}\bm{l}_{\tau}\mathrm{d}\tau}\underbrace{\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)}_{\text{approximated}}\mathrm{d}\lambda\bigg{)},

(6)

where $\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)\coloneqq\frac{\bm{N}_{\theta}(\bm{x}_{\lambda},\lambda)}{\alpha_{\lambda}}$ . To calculate the solution in Eq. (6), we need to approximate $\bm{f}_{\theta}$ for each $\lambda\in[\lambda_{s},\lambda_{t}]$ by certain polynomials [31, 32].

Secondly, we reveal another key insight that we can choose different functions to be approximated instead of $\bm{f}_{\theta}$ and further reduce the discretization error, which is related to the total derivatives of the approximated function. To this end, we consider a scaled version of $\bm{f}_{\theta}$ i.e., $\bm{h}_{\theta}(\bm{x}_{\lambda},\lambda)\coloneqq e^{-\int_{\lambda_{s}}^{\lambda}\bm{s}_{\tau}\mathrm{d}\tau}\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)$ where $\bm{s}_{\lambda}$ is a $D$ -dimensional coefficient dependent on $\lambda$ , and then Eq. (6) becomes

\bm{x}_{\lambda_{t}}=\alpha_{\lambda_{t}}e^{-\int_{\lambda_{s}}^{\lambda_{t}}\bm{l}_{\lambda}\mathrm{d}\lambda}\bigg{(}\frac{\bm{x}_{\lambda_{s}}}{\alpha_{\lambda_{s}}}-\int_{\lambda_{s}}^{\lambda_{t}}e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\underbrace{\bm{h}_{\theta}(\bm{x}_{\lambda},\lambda)}_{\text{approximated}}\mathrm{d}\lambda\bigg{)}.

(7)

Comparing with Eq. (6), we change the approximated function from $\bm{f}_{\theta}$ to $\bm{h}_{\theta}$ by using an additional scaling term related to $\bm{s}_{\lambda}$ . As we shall see, the first-order discretization error is positively related to the norm of the first-order derivative $\bm{h}_{\theta}^{(1)}=e^{-\int_{\lambda_{s}}^{\lambda}\bm{s}_{\tau}\mathrm{d}\tau}(\bm{f}^{(1)}_{\theta}-\bm{s}_{\lambda}\bm{f}_{\theta})$ . Thus, we aim to minimize $\|\bm{f}^{(1)}_{\theta}-\bm{s}_{\lambda}\bm{f}_{\theta}\|_{2}$ , in order to reduce $\|\bm{h}_{\theta}^{(1)}\|_{2}$ and the discretization error. As $\bm{f}_{\theta}$ is a fixed function depending on the pretrained model, this minimization problem essentially finds a linear function of $\bm{f}_{\theta}$ to approximate $\bm{f}^{(1)}_{\theta}$ . To achieve better linear approximation, we further introduce a bias term $\bm{b}_{\lambda}\in\mathbb{R}^{D}$ and construct a function $\bm{g}_{\theta}$ satisfying $\bm{g}_{\theta}^{(1)}=e^{-\int_{\lambda_{s}}^{\lambda}\bm{s}_{\tau}\mathrm{d}\tau}(\bm{f}^{(1)}_{\theta}-\bm{s}_{\lambda}\bm{f}_{\theta}-\bm{b}_{\lambda})$ , which gives

\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)\coloneqq e^{-\int_{\lambda_{s}}^{\lambda}\bm{s}_{\tau}\mathrm{d}\tau}\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)-\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r.

(8)

With $\bm{g}_{\theta}$ , Eq. (7) becomes

\bm{x}_{\lambda_{t}}=\alpha_{\lambda_{t}}\underbrace{\vphantom{\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r}e^{-\int_{\lambda_{s}}^{\lambda_{t}}\bm{l}_{\lambda}\mathrm{d}\lambda}}_{\text{linear coefficient}}\bigg{(}\frac{\bm{x}_{\lambda_{s}}}{\alpha_{\lambda_{s}}}-\int_{\lambda_{s}}^{\lambda_{t}}\underbrace{\vphantom{\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r}e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}}_{\text{scaling coefficient}}\Big{(}\underbrace{\vphantom{\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r}\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)}_{\text{approximated}}+\underbrace{\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r}_{\text{bias coefficient}}\Big{)}\mathrm{d}\lambda\bigg{)}.

(9)

Such formulation is equivalent to Eq. (3) but introduces three types of coefficients and a new parameterization $\bm{g}_{\theta}$ . We show in Appendix I.1 that the generalized parameterization $\bm{g}_{\theta}$ in Eq. (8) can cover a wide range of parameterization families in the form of $\bm{\psi}_{\theta}(\bm{x}_{\lambda},\lambda)=\bm{\alpha}(\lambda)\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)+\bm{\beta}(\lambda)\bm{x}_{\lambda}+\bm{\gamma}(\lambda)$ . We aim to reduce the discretization error by finding better coefficients than previous works [31, 32].

Now we derive the concrete formula for analyzing the first-order discretization error. By replacing $\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)$ with $\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})$ in Eq. (9), we obtain the first-order approximation $\hat{\bm{x}}_{\lambda_{t}}=\alpha_{\lambda_{t}}e^{-\int_{\lambda_{s}}^{\lambda_{t}}\bm{l}_{\lambda}\mathrm{d}\lambda}\left(\frac{\bm{x}_{\lambda_{s}}}{\alpha_{\lambda_{s}}}-\int_{\lambda_{s}}^{\lambda_{t}}e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\left(\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})+\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r\right)\mathrm{d}\lambda\right)$ . As $\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})=\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)+(\lambda_{s}-\lambda)\bm{g}_{\theta}^{(1)}(\bm{x}_{\lambda},\lambda)+\mathcal{O}((\lambda-\lambda_{s})^{2})$ by Taylor expansion, it follows that the first-order discretization error can be expressed as

\hat{\bm{x}}_{\lambda_{t}}-\bm{x}_{\lambda_{t}}=\alpha_{\lambda_{t}}e^{-\int_{\lambda_{s}}^{\lambda_{t}}\bm{l}_{\lambda}\mathrm{d}\lambda}\bigg{(}\int_{\lambda_{s}}^{\lambda_{t}}e^{\int_{\lambda_{s}}^{\lambda}\bm{l}_{\tau}\mathrm{d}\tau}(\lambda-\lambda_{s})\Big{(}\bm{f}^{(1)}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{s}_{\lambda}\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{b}_{\lambda}\Big{)}\mathrm{d}\lambda\bigg{)}+\mathcal{O}(h^{3}),

(10)

where $h=\lambda_{t}-\lambda_{s}$ . Thus, given the optimal $\bm{l}_{\lambda}=\bm{l}_{\lambda}^{*}$ in Eq. (5), the discretization error $\hat{\bm{x}}_{\lambda_{t}}-\bm{x}_{\lambda_{t}}$ mainly depends on $\bm{f}^{(1)}_{\theta}-\bm{s}_{\lambda}\bm{f}_{\theta}-\bm{b}_{\lambda}$ . Based on this insight, we choose the coefficients $\bm{s}_{\lambda},\bm{b}_{\lambda}$ by solving

\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*}=\operatornamewithlimits{argmin}_{\bm{s}_{\lambda},\bm{b}_{\lambda}}\mathbb{E}_{p^{\theta}_{\lambda}(\bm{x}_{\lambda})}\left[\left\|\bm{f}^{(1)}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{s}_{\lambda}\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{b}_{\lambda}\right\|^{2}_{2}\right].

(11)

For any $\lambda$ , $\bm{l}_{\lambda}^{*},\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*}$ all have analytic solutions involving the Jacobian-vector-product of the pretrained model $\bm{\epsilon}_{\theta}$ , and they can be unbiasedly evaluated on a few datapoints $\{\bm{x}_{\lambda}^{(n)}\}_{K}\sim p_{\lambda}^{\theta}(\bm{x}_{\lambda})$ via Monte-Carlo estimation (detailed in Section 3.4 and Appendix C.1.1). Therefore, we call $\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}$ empirical model statistics (EMS). In the following sections, we’ll show that by approximating $\bm{g}_{\theta}$ with Taylor expansion, we can develop our high-order solver for diffusion ODEs.

3.2 Developing High-Order Solver

(a) Local Approximation

(b) Multistep Predictor-Corrector

Figure 2: Illustration of our high-order solver. (a)

(n+1)

-th order local approximation from time

s

to time

t

, provided

n

extra function values of

\bm{g}_{\theta}

. (b) Multistep predictor-corrector procedure as our global solver. A combination of second-order predictor and second-order corrector is shown.

a_{\cdot,i-1},b_{\cdot,i-1},c_{\cdot,i-1}

are abbreviations of coefficients in Eq. (8).

In this section, we propose our high-order solver for diffusion ODEs with local accuracy and global convergence order guarantee by leveraging our proposed solution formulation in Eq. (9). The proposed solver and analyses are highly motivated by the methods of exponential integrators [17, 18] in the ODE literature and their previous applications in the field of diffusion models [56, 31, 32, 58]. Though the EMS are designed to minimize the first-order error, they can also help our high-order solver (see Appendix I.2).

For simplicity, denote $\bm{A}(\lambda_{s},\lambda_{t})\coloneqq e^{-\int_{\lambda_{s}}^{\lambda_{t}}\bm{l}_{\tau}\mathrm{d}\tau}$ , $\bm{E}_{\lambda_{s}}(\lambda)\coloneqq e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}$ , $\bm{B}_{\lambda_{s}}(\lambda)\coloneqq\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r$ . Though high-order ODE solvers essentially share the same mathematical principles, since we utilize a more complicated parameterization $\bm{g}_{\theta}$ and ODE formulation in Eq. (9) than previous works [56, 31, 32, 58], we divide the development of high-order solver into simplified local and global parts, which are not only easier to understand, but also neat and general for any order.

3.2.1 Local Approximation

Firstly, we derive formulas and properties of the local approximation, which describes how we transit locally from time $s$ to time $t$ . It can be divided into two parts: discretization of the integral in Eq. (9) and estimating the high-order derivatives in the Taylor expansion.

Discretization. Denote $\bm{g}^{(k)}_{s}\coloneqq\bm{g}_{\theta}^{(k)}(\bm{x}_{\lambda_{s}},\lambda_{s})$ . For $n\geq 0$ , to obtain the $(n+1)$ -th order discretization of Eq. (9), we take the $n$ -th order Taylor expansion of $\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)$ w.r.t. $\lambda$ at $\lambda_{s}$ : $\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)=\sum_{k=0}^{n}\frac{(\lambda-\lambda_{s})^{k}}{k!}\bm{g}^{(k)}_{s}+\mathcal{O}((\lambda-\lambda_{s})^{n+1})$ . Substituting it into Eq. (9) yields

\frac{\bm{x}_{t}}{\alpha_{t}}=\bm{A}(\lambda_{s},\lambda_{t})\left(\frac{\bm{x}_{s}}{\alpha_{s}}\!-\!\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\bm{B}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda\!-\!\sum_{k=0}^{n}\bm{g}^{(k)}_{s}\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda\right)\!+\!\mathcal{O}(h^{n+2})

(12)

Here we only need to estimate the $k$ -th order total derivatives $\bm{g}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})$ for $0\leq k\leq n$ , since the other terms are determined once given $\lambda_{s},\lambda_{t}$ and $\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}$ , which we’ll discuss next.

High-order derivative estimation. For $(n+1)$ -th order approximation, we use the finite difference of $\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)$ at previous $n+1$ steps $\lambda_{i_{n}},\dots,\lambda_{i_{1}},\lambda_{s}$ to estimate each $\bm{g}_{\theta}^{(k)}(\bm{x}_{\lambda_{s}},\lambda_{s})$ . Such derivation is to match the coefficients of Taylor expansions. Concretely, denote $\delta_{k}\coloneqq\lambda_{i_{k}}-\lambda_{s}$ , $\bm{g}_{i_{k}}\coloneqq\bm{g}_{\theta}(\bm{x}_{\lambda_{i_{k}}},\lambda_{i_{k}})$ , and the estimated high-order derivatives $\hat{\bm{g}}^{(k)}_{s}$ can be solved by the following linear system:

\begin{pmatrix}\delta_{1}&\delta_{1}^{2}&\cdots&\delta_{1}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ \delta_{n}&\delta_{n}^{2}&\cdots&\delta_{n}^{n}\end{pmatrix}\begin{pmatrix}\hat{\bm{g}}^{(1)}_{s}\\ \vdots\\ \frac{\hat{\bm{g}}^{(n)}_{s}}{n!}\end{pmatrix}=\begin{pmatrix}\bm{g}_{i_{1}}-\bm{g}_{s}\\ \vdots\\ \bm{g}_{i_{n}}-\bm{g}_{s}\end{pmatrix}

(13)

Then by substituting $\hat{\bm{g}}^{(k)}_{s}$ into Eq. (12) and dropping the $\mathcal{O}(h^{n+2})$ error terms, we obtain the $(n+1)$ -th order local approximation:

\frac{\hat{\bm{x}}_{t}}{\alpha_{t}}=\bm{A}(\lambda_{s},\lambda_{t})\left(\frac{\bm{x}_{s}}{\alpha_{s}}-\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\bm{B}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda-\sum_{k=0}^{n}\hat{\bm{g}}^{(k)}_{s}\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda\right)

(14)

where $\hat{\bm{g}}_{\theta}^{(0)}(\bm{x}_{\lambda_{s}},\lambda_{s})=\bm{g}_{s}$ . Eq. (13) and Eq. (14) provide an update rule to transit from time $s$ to time $t$ and get an approximated solution $\hat{\bm{x}}_{t}$ , when we already have the solution $\bm{x}_{s}$ . For $(n+1)$ -th order approximation, we need $n$ extra solutions $\bm{x}_{\lambda_{i_{k}}}$ and their corresponding function values $\bm{g}_{i_{k}}$ . We illustrate the procedure in Fig. 2(a) and summarize it in Appendix C.2. In the following theorem, we show that under some assumptions, such local approximation has a guarantee of order of accuracy.

Theorem 3.1 (Local order of accuracy, proof in Appendix B.2.1).

Suppose $\bm{x}_{\lambda_{i_{k}}}$ are exact (i.e., on the ground-truth ODE trajectory passing $\bm{x}_{s}$ ) for $k=1,\dots,n$ , then under some regularity conditions detailed in Appendix B.1, the local truncation error $\hat{\bm{x}}_{t}-\bm{x}_{t}=\mathcal{O}(h^{n+2})$ , which means the local approximation has $(n+1)$ -th order of accuracy.

Besides, we have the following theorem showing that, whatever the order is, the local approximation is unbiased given our choice of $\bm{s}_{\lambda},\bm{b}_{\lambda}$ in Eq. (11). In practice, the phenomenon of reduced bias can be empirically observed (Section 4.3).

Theorem 3.2 (Local unbiasedness, proof in Appendix B.4).

Given the optimal $\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*}$ in Eq. (11), For the $(n+1)$ -th order approximation, suppose $\bm{x}_{\lambda_{i_{1}}},\dots,\bm{x}_{\lambda_{i_{n}}}$ are on the ground-truth ODE trajectory passing $\bm{x}_{\lambda_{s}}$ , then $\mathbb{E}_{p^{\theta}_{\lambda_{s}}(\bm{x}_{s})}\left[\hat{\bm{x}}_{t}-\bm{x}_{t}\right]=0$ .

3.2.2 Global Solver

Given $M+1$ time steps $\{t_{i}\}_{i=0}^{M}$ , starting from some initial value, we can repeat the local approximation $M$ times to make consecutive transitions from each $t_{i-1}$ to $t_{i}$ until we reach an acceptable solution. At each step, we apply multistep methods [1] by caching and reusing the previous $n$ values at timesteps $t_{i-1-n},\dots,t_{i-2}$ , which is proven to be more stable when NFE is small [32, 56]. Moreover, we also apply the predictor-corrector method [58] to refine the approximation at each step without introducing extra NFE. Specifically, the $(n+1)$ -th order predictor is the case of the local approximation when we choose $(t_{i_{n}},\dots,t_{i_{1}},s,t)=(t_{i-1-n},\dots,t_{i-2},t_{i-1},t_{i})$ , and the $(n+1)$ -th order corrector is the case when we choose $(t_{i_{n}},\dots,t_{i_{1}},s,t)=(t_{i-n},\dots,t_{i-2},t_{i},t_{i-1},t_{i})$ . We present our $(n+1)$ -th order multistep predictor-corrector procedure in Appendix C.2. We also illustrate a second-order case in Fig. 2(b). Note that different from previous works, in the local transition from $t_{i-1}$ to $t_{i}$ , the previous function values $\hat{\bm{g}}_{i_{k}}$ ( $1\leq k\leq n$ ) used for derivative estimation are dependent on $i$ and are different during the sampling process because $\bm{g}_{\theta}$ is dependent on the current $t_{i-1}$ (see Eq. (8)). Thus, we directly cache $\hat{\bm{x}}_{i},\hat{\bm{\epsilon}}_{i}$ and reuse them to compute $\hat{\bm{g}}_{i}$ in the subsequent steps. Notably, our proposed solver also has a global convergence guarantee, as shown in the following theorem. For simplicity, we only consider the predictor case and the case with corrector can also be proved by similar derivations in [58].

Theorem 3.3 (Global order of convergence, proof in Appendix B.2.2).

For $(n+1)$ -th order predictor, if we iteratively compute a sequence $\{\hat{\bm{x}}_{i}\}_{i=0}^{M}$ to approximate the true solutions $\{\bm{x}_{i}\}_{i=0}^{M}$ at $\{t_{i}\}_{i=0}^{M}$ , then under both local and global assumptions detailed in Appendix B.1, the final error $|\hat{\bm{x}}_{M}-\bm{x}_{M}|=\mathcal{O}(h^{n+1})$ , where $|\cdot|$ denotes the element-wise absolute value, and $h=\max_{1\leq i\leq M}(\lambda_{i}-\lambda_{i-1})$ .

3.3 Practical Techniques

In this section, we introduce some practical techniques that further improve the sample quality in the case of small NFE or large guidance scales.

Pseudo-order solver for small NFE. When NFE is extremely small (e.g., 5 $\sim$ 10), the error at each timestep becomes rather large, and incorporating too many previous values by high-order solver at each step will cause instabilities. To alleviate this problem, we propose a technique called pseudo-order solver: when estimating the $k$ -th order derivative, we only utilize the previous $k+1$ function values of $\bm{g}_{\theta}$ , instead of all the $n$ previous values as in Eq. (13). For each $k$ , we can obtain $\hat{\bm{g}}^{(k)}_{s}$ by solving a part of Eq. (13) and taking the last element:

\begin{pmatrix}\delta_{1}&\delta_{1}^{2}&\cdots&\delta_{1}^{k}\\ \vdots&\vdots&\ddots&\vdots\\ \delta_{k}&\delta_{k}^{2}&\cdots&\delta_{k}^{k}\end{pmatrix}\begin{pmatrix}\cdot\\ \vdots\\ \frac{\hat{\bm{g}}^{(k)}_{s}}{k!}\end{pmatrix}=\begin{pmatrix}\bm{g}_{i_{1}}-\bm{g}_{s}\\ \vdots\\ \bm{g}_{i_{k}}-\bm{g}_{s}\end{pmatrix},\quad k=1,2,\dots,n

(15)

In practice, we do not need to solve $n$ linear systems. Instead, the solutions for $\hat{\bm{g}}^{(k)}_{s},k=1,\dots,n$ have a simpler recurrence relation similar to Neville’s method [36] in Lagrange polynomial interpolation. Denote $i_{0}\coloneqq s$ so that $\delta_{0}=\lambda_{i_{0}}-\lambda_{s}=0$ , we have

Theorem 3.4 (Pseudo-order solver).

For each $k$ , the solution in Eq. (15) is $\hat{\bm{g}}^{(k)}_{s}=k!D^{(k)}_{0}$ , where

	$\displaystyle D^{(0)}_{l}$	$\displaystyle\coloneqq\bm{g}_{i_{l}},\quad l=0,1,\dots,n$		(16)
	$\displaystyle D^{(k)}_{l}$	$\displaystyle\coloneqq\frac{D^{(k-1)}_{l+1}-D^{(k-1)}_{l}}{\delta_{l+k}-\delta_{l}},\quad l=0,1,\dots,n-k$		(16)

Proof in Appendix B.3. Note that the pseudo-order solver of order $n>2$ no longer has the guarantee of $n$ -th order of accuracy, which is not so important when NFE is small. In our experiments, we mainly rely on two combinations: when we use $n$ -th order predictor, we then combine it with $n$ -th order corrector or $(n+1)$ -th pseudo-order corrector.

Half-corrector for large guidance scales. When the guidance scale is large in guided sampling, we find that corrector may have negative effects on the sample quality. We propose a useful technique called half-corrector by using the corrector only in the time region $t\leq 0.5$ . Correspondingly, the strategy that we use corrector at each step is called full-corrector.

3.4 Implementation Details

In this section, we give some implementation details about how to compute and integrate the EMS in our solver and how to adapt them to guided sampling.

Estimating EMS. For a specific $\lambda$ , the EMS $\bm{l}_{\lambda}^{*},\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*}$ can be estimated by firstly drawing $K$ (1024 $\sim$ 4096) datapoints $\bm{x}_{\lambda}\sim p^{\theta}_{\lambda}(\bm{x}_{\lambda})$ with 200-step DPM-Solver++ [32] and then analytically computing some terms related to $\bm{\epsilon}_{\theta}$ (detailed in Appendix C.1.1). In practice, we find it both convenient and effective to choose the distribution of the dataset $q_{0}$ to approximate $p_{0}^{\theta}$ . Thus, without further specifications, we directly use samples from $q_{0}$ .

Estimating integrals of EMS. We estimate EMS on $N$ ( $120\sim 1200$ ) timesteps $\lambda_{j_{0}},\lambda_{j_{1}},\dots,\lambda_{j_{N}}$ and use trapezoidal rule to estimate the integrals in Eq. (12) (see Appendix I.3 for the estimation error analysis). We also apply some pre-computation for the integrals to avoid extra computation costs during sampling, detailed in Appendix C.1.2.

Adaptation to guided sampling. Empirically, we find that within a common range of guidance scales, we can simply compute the EMS on the model without guidance, and it can work for both unconditional sampling and guided sampling cases. See Appendix J for more discussions.

3.5 Comparison with Existing Methods

By comparing with existing diffusion ODE solvers that are based on exponential integrators [56, 31, 32, 58], we can conclude that (1) Previous ODE formulations with noise/data prediction are special cases of ours by setting $\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}$ to specific values. (2) Our first-order discretization can be seen as improved DDIM. See more details in Appendix A.

4 Experiments

In this section, we show that DPM-Solver-v3 can achieve consistent and notable speed-up for both unconditional and conditional sampling with both pixel-space and latent-space DPMs. We conduct extensive experiments on diverse image datasets, where the resolution ranges from 32 to 256. First, we present the main results of sample quality comparison with previous state-of-the-art training-free methods. Then we illustrate the effectiveness of our method by visualizing the EMS and samples. Additional ablation studies are provided in Appendix G. On each dataset, we choose a sufficient number of timesteps $N$ and datapoints $K$ for computing the EMS to reduce the estimation error, while the EMS can still be computed within hours. After we obtain the EMS and precompute the integrals involving them, there is negligible extra overhead in the sampling process. We provide the runtime comparison in Appendix E. We refer to Appendix D for more detailed experiment settings.

4.1 Main Results

Table 1: Quantitative results on CIFAR10 [24]. We report the FID

\downarrow

of the methods with different numbers of function evaluations (NFE), evaluated on 50k samples. ^†We borrow the results reported in their original paper directly.

Method	Model	NFE
Method	Model	5	6	8	10	12	15	20	25
^†DEIS [56]	ScoreSDE [51]	15.37	$\backslash$	$\backslash$	4.17	$\backslash$	3.37	2.86	$\backslash$
DPM-Solver++ [32]		28.53	13.48	5.34	4.01	4.04	3.32	2.90	2.76
UniPC [58]		23.71	10.41	5.16	3.93	3.88	3.05	2.73	2.65
DPM-Solver-v3		12.76	7.40	3.94	3.40	3.24	2.91	2.71	2.64
Heun’s 2nd [21]	EDM [21]	320.80	103.86	39.66	16.57	7.59	4.76	2.51	2.12
DPM-Solver++ [32]		24.54	11.85	4.36	2.91	2.45	2.17	2.05	2.02
UniPC [58]		23.52	11.10	3.86	2.85	2.38	2.08	2.01	2.00
DPM-Solver-v3		12.21	8.56	3.50	2.51	2.24	2.10	2.02	2.00

We present the results in $5\sim 20$ number of function evaluations (NFE), covering both few-step cases and the almost converged cases, as shown in Fig. 3 and Fig. 4. For the sake of clarity, we mainly compare DPM-Solver-v3 to DPM-Solver++ [32] and UniPC [58], which are the most state-of-the-art diffusion ODE solvers. We also include the results for DEIS [56] and Heun’s 2nd order method in EDM [21], but only for the datasets on which they originally reported. We don’t show the results for other methods such as DDIM [48], PNDM [28], since they have already been compared in previous works and have inferior performance. The quantitative results on CIFAR10 [24] are listed in Table 1, and more detailed quantitative results are presented in Appendix F.

Unconditional sampling We first evaluate the unconditional sampling performance of different methods on CIFAR10 [24] and LSUN-Bedroom [55]. For CIFAR10 we use two pixel-space DPMs, one is based on ScoreSDE [51] which is a widely adopted model by previous samplers, and another is based on EDM [21] which achieves the best sample quality. For LSUN-Bedroom, we use the latent-space Latent-Diffusion model [43]. We apply the multistep 3rd-order version for DPM-Solver++, UniPC and DPM-Solver-v3 by default, which performs best in the unconditional setting. For UniPC, we report the better result of their two choices $B_{1}(h)=h$ and $B_{2}(h)=e^{h}-1$ at each NFE. For our DPM-Solver-v3, we tune the strategies of whether to use the pseudo-order predictor/corrector at each NFE on CIFAR10, and use the pseudo-order corrector on LSUN-Bedroom. As shown in Fig. 3, we find that DPM-Solver-v3 can achieve consistently better FID, which is especially notable when NFE is 5 $\sim$ 10. For example, we improve the FID on CIFAR10 with 5 NFE from $23$ to $12$ with ScoreSDE, and achieve an FID of $2.51$ with only $10$ NFE with the advanced DPM provided by EDM. On LSUN-Bedroom, with around 12 minutes computing of the EMS, DPM-Solver-v3 converges to the FID of 3.06 with 12 NFE, which is approximately 60% sampling cost of the previous best training-free method (20 NFE by UniPC).

Conditional sampling. We then evaluate the conditional sampling performance, which is more widely used since it allows for controllable generation with user-customized conditions. We choose two conditional settings, one is classifier guidance on pixel-space Guided-Diffusion [10] model trained on ImageNet-256 dataset [9] with 1000 class labels as conditions; the other is classifier-free guidance on latent-space Stable-Diffusion model [43] trained on LAION-5B dataset [46] with CLIP [41] embedded text prompts as conditions. We evaluate the former at the guidance scale of $2.0$ , following the best choice in [10]; and the latter at the guidance scale of $1.5$ (following the original paper) or $7.5$ (following the official code) with prompts random selected from MS-COCO2014 validation set [26]. Note that the FID of Stable-Diffusion samples saturates to 15.0 $\sim$ 16.0 even within 10 steps when the latent codes are far from convergence, possibly due to the powerful image decoder (see Appendix H). Thus, following [32], we measure the mean square error (MSE) between the generated latent code $\hat{\bm{x}}$ and the ground-truth solution $\bm{x}^{*}$ (i.e., $\|\hat{\bm{x}}-\bm{x}^{*}\|_{2}^{2}/D$ ) to evaluate convergence, starting from the same Gaussian noise. We obtain $\bm{x}^{*}$ by 200-step DPM-Solver++, which is enough to ensure the convergence.

We apply the multistep 2nd-order version for DPM-Solver++, UniPC and DPM-Solver-v3, which performs best in conditional setting. For UniPC, we only apply the choice $B_{2}(h)=e^{h}-1$ , which performs better than $B_{1}(h)$ . For our DPM-Solver-v3, we use the pseudo-order corrector by default, and report the best results between using half-corrector/full-corrector on Stable-Diffusion ( $s=7.5$ ). As shown in Fig. 4, DPM-Solver-v3 can achieve better sample quality or convergence at most NFEs, which indicates the effectiveness of our method and techniques under the conditional setting. It’s worth noting that UniPC, which adopts an extra corrector, performs even worse than DPM-Solver++ when NFE<10 on Stable-Diffusion ( $s=7.5$ ). With the combined effect of the EMS and the half-corrector technique, we successfully outperform DPM-Solver++ in such a case. Detailed comparisons can be found in the ablations in Appendix G.

4.2 Visualizations of Estimated EMS

We further visualize the estimated EMS in Fig. 5. Since $\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}$ are $D$ -dimensional vectors, we average them over all dimensions to obtain a scalar. From Fig. 5, we find that $\bm{l}_{\lambda}$ gradually changes from $1$ to near $0$ as the sampling proceeds, which suggests we should gradually slide from data prediction to noise prediction. As for $\bm{s}_{\lambda},\bm{b}_{\lambda}$ , they are more model-specific and display many fluctuations, especially for ScoreSDE model [51] on CIFAR10. Apart from the estimation error of the EMS, we suspect that it comes from the fluctuations of $\bm{\epsilon}_{\theta}^{(1)}$ , which is caused by the periodicity of trigonometric functions in the positional embeddings of the network input. It’s worth noting that the fluctuation of $\bm{s}_{\lambda},\bm{b}_{\lambda}$ will not cause instability in our sampler (see Appendix J).

4.3 Visual Quality

We present some qualitative comparisons in Fig. 6 and Fig. 1. We can find that previous methods tend to have a small degree of color contrast at small NFE, while our method is less biased and produces more visual details. In Fig. 1(b), we can observe that previous methods with corrector may cause distortion at large guidance scales (in the left-top image, a part of the castle becomes a hill; in the left-bottom image, the hill is translucent and the castle is hanging in the air), while ours won’t. Additional samples are provided in Appendix K.

5 Conclusion

We study the ODE parameterization problem for fast sampling of DPMs. Through theoretical analysis, we find a novel ODE formulation with empirical model statistics, which is towards the optimal one to minimize the first-order discretization error. Based on such improved ODE formulation, we propose a new fast solver named DPM-Solver-v3, which involves a multistep predictor-corrector framework of any order and several techniques for improved sampling with few steps or large guidance scale. Experiments demonstrate the effectiveness of DPM-Solver-v3 in both unconditional conditional sampling with both pixel-space latent-space pre-trained DPMs, and the significant advancement of sample quality in 5 $\sim$ 10 steps.

Limitations and broader impact Despite the great speedup in small numbers of steps, DPM-Solver-v3 still lags behind training-based methods and is not fast enough for real-time applications. Besides, we conducted theoretical analyses of the local error, but didn’t explore the global design spaces, such as the design of timestep schedules during sampling. And commonly, there are potential undesirable effects that DPM-Solver-v3 may be abused to accelerate the generation of fake and malicious content.

Acknowledgements

This work was supported by the National Key Research and Development Program of China (No. 2021ZD0110502), NSFC Projects (Nos. 62061136001, 62106123, 62076147, U19A2081, 61972224, 62106120), Tsinghua Institute for Guo Qiang, and the High Performance Computing Center, Tsinghua University. J.Z is also supported by the XPlorer Prize.

References

[1] Kendall Atkinson, Weimin Han, and David E Stewart. Numerical solution of ordinary differential equations, volume 108. John Wiley & Sons, 2011.
[2] Fan Bao, Chongxuan Li, Jiacheng Sun, Jun Zhu, and Bo Zhang. Estimating the optimal covariance with imperfect mean in diffusion probabilistic models. In International Conference on Machine Learning, pages 1555–1584. PMLR, 2022.
[3] Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-DPM: An analytic estimate of the optimal reverse variance in diffusion probabilistic models. In International Conference on Learning Representations, 2022.
[4] Costas Bekas, Effrosyni Kokiopoulou, and Yousef Saad. An estimator for the diagonal of a matrix. Applied numerical mathematics, 57(11-12):1214–1229, 2007.
[5] Mari Paz Calvo and César Palencia. A class of explicit multistep exponential integrators for semilinear problems. Numerische Mathematik, 102:367–381, 2006.
[6] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. In International Conference on Learning Representations, 2021.
[7] Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. In The Eleventh International Conference on Learning Representations, 2022.
[8] Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. In The Eleventh International Conference on Learning Representations, 2022.
[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009.
[10] Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems, volume 34, pages 8780–8794, 2021.
[11] Tim Dockhorn, Arash Vahdat, and Karsten Kreis. Genie: Higher-order denoising diffusion solvers. In Advances in Neural Information Processing Systems, 2022.
[12] Alfredo Eisinberg and Giuseppe Fedele. On the inversion of the vandermonde matrix. Applied mathematics and computation, 174(2):1384–1397, 2006.
[13] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, volume 27, pages 2672–2680, 2014.
[14] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
[15] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pages 6840–6851, 2020.
[16] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
[17] Marlis Hochbruck and Alexander Ostermann. Explicit exponential Runge-Kutta methods for semilinear parabolic problems. SIAM Journal on Numerical Analysis, 43(3):1069–1090, 2005.
[18] Marlis Hochbruck and Alexander Ostermann. Exponential integrators. Acta Numerica, 19:209–286, 2010.
[19] Marlis Hochbruck, Alexander Ostermann, and Julia Schweitzer. Exponential rosenbrock-type methods. SIAM Journal on Numerical Analysis, 47(1):786–803, 2009.
[20] Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059–1076, 1989.
[21] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems, 2022.
[22] Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. In Advances in Neural Information Processing Systems, 2022.
[23] Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. In Advances in Neural Information Processing Systems, 2021.
[24] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[25] Shengmeng Li, Luping Liu, Zenghao Chai, Runnan Li, and Xu Tan. Era-solver: Error-robust adams solver for fast sampling of diffusion probabilistic models. arXiv preprint arXiv:2301.12935, 2023.
[26] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[27] Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, and Zhou Zhao. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11020–11028, 2022.
[28] Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. In International Conference on Learning Representations, 2021.
[29] Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2022.
[30] Cheng Lu, Kaiwen Zheng, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Maximum likelihood training for score-based diffusion odes by high order denoising score matching. In International Conference on Machine Learning, pages 14429–14460. PMLR, 2022.
[31] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In Advances in Neural Information Processing Systems, 2022.
[32] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022.
[33] Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388, 2021.
[34] Chenlin Meng, Ruiqi Gao, Diederik P Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. In NeurIPS 2022 Workshop on Score-Based Methods, 2022.
[35] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
[36] Eric Harold Neville. Iterative interpolation. St. Joseph’s IS Press, 1934.
[37] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
[38] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pages 16784–16804. PMLR, 2022.
[39] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
[40] Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov, and Jiansheng Wei. Diffusion-based voice conversion with fast maximum likelihood sampling scheme. In International Conference on Learning Representations, 2022.
[41] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[42] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, 2022.
[43] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
[44] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems, 2022.
[45] Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022.
[46] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
[47] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
[48] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
[49] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
[50] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. In Advances in Neural Information Processing Systems, volume 34, pages 1415–1428, 2021.
[51] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
[52] Hideyuki Tachibana, Mocho Go, Muneyoshi Inahara, Yotaro Katayama, and Yotaro Watanabe. Itô-taylor sampling scheme for denoising diffusion probabilistic models using ideal derivatives. arXiv e-prints, pages arXiv–2112, 2021.
[53] Daniel Watson, William Chan, Jonathan Ho, and Mohammad Norouzi. Learning fast samplers for diffusion models by differentiating through sample quality. In International Conference on Learning Representations, 2022.
[54] Suttisak Wizadwongsa and Supasorn Suwajanakorn. Accelerating guided diffusion sampling with splitting numerical methods. In The Eleventh International Conference on Learning Representations, 2022.
[55] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
[56] Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. In The Eleventh International Conference on Learning Representations, 2022.
[57] Min Zhao, Fan Bao, Chongxuan Li, and Jun Zhu. Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. In Advances in Neural Information Processing Systems, 2022.
[58] Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. UniPC: A unified predictor-corrector framework for fast sampling of diffusion models. arXiv preprint arXiv:2302.04867, 2023.
[59] Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. Improved techniques for maximum likelihood estimation for diffusion odes. arXiv preprint arXiv:2305.03935, 2023.

Appendix A Related Work

Diffusion probabilistic models (DPMs) [47, 15, 51], also known as score-based generative models (SGMs), have achieved remarkable generation ability on image domain [10, 21], yielding extensive applications such as speech, singing and video synthesis [6, 27, 14], controllable image generation, translation and editing [38, 42, 43, 35, 57, 8], likelihood estimation [50, 23, 30, 59], data compression [23] and inverse problem solving [7, 22].

A.1 Fast Sampling Methods for DPMs

Fast sampling methods based on extra training or optimization include learning variances in the reverse process [37, 2], learning sampling schedule [53], learning high-order model derivatives [11], model refinement [29] and model distillation [45, 33, 49]. Though distillation-based methods can generate high-quality samples in less than 5 steps, they additionally bring onerous training costs. Moreover, the distillation process will inevitably lose part of the information of the original model, and is hard to be adapted to pre-trained large DPMs [43, 44, 42] and conditional sampling [34]. Some of distillation-based methods also lack the ability to make flexible trade-offs between sample speed and sample quality.

In contrast, training-free samplers are more lightweight and flexible. Among them, samplers based on diffusion ODEs generally require fewer steps than those based on diffusion SDEs [51, 40, 3, 48, 32], since SDEs introduce more randomness and make the denoising harder in the sampling process. Previous samplers handle the diffusion ODEs with different methods, such as Heun’s methods [21], splitting numerical methods [54], pseudo numerical methods [28], Adams methods [25] or exponential integrators [56, 31, 32, 58].

A.2 Comparison with Existing Solvers Based on Exponential Integrators

Table 2: Comparison between DPM-Solver-v3 and other high-order diffusion ODE solvers based on exponential integrators.

	DEIS [56]	DPM-Solver [31]	DPM-Solver++ [32]	UniPC [58]	DPM-Solver-v3 (Ours)
First-Order	DDIM	DDIM	DDIM	DDIM	Improved DDIM
Taylor Expanded Predictor	$\bm{\epsilon}_{\theta}$ for $t$	$\bm{\epsilon}_{\theta}$ for $\lambda$	$\bm{x}_{\theta}$ for $\lambda$	$\bm{x}_{\theta}$ for $\lambda$	$\bm{g}_{\theta}$ for $\lambda$
Solver Type (High-Order)	Multistep	Singlestep	Multistep	Multistep	Multistep
Applicable for Guided Sampling	✓	✗	✓	✓	✓
Corrector Supported	✗	✗	✗	✓	✓
Model-Specific	✗	✗	✗	✗	✓

In this section, we make some theoretical comparisons between DPM-Solver-v3 and existing diffusion ODE solvers that are based on exponential integrators [56, 31, 32, 58]. We summarize the results in Table 2, and provide some analysis below.

Previous ODE formulation as special cases of ours. First, we compare the formulation of the ODE solution. Our reformulated ODE solution in Eq. (9) involves extra coefficients $\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}$ , and it corresponds to a new predictor $\bm{g}_{\theta}$ to be approximated by Taylor expansion. By comparing ours with previous ODE formulations in Eq. (3) and their corresponding noise/data prediction, we can easily figure out that they are special cases of ours by setting $\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}$ to specific values:

•

Noise prediction: $\bm{l}_{\lambda}=\bm{0},\bm{s}_{\lambda}=-\bm{1},\bm{b}_{\lambda}=\bm{0}$
•

Data prediction: $\bm{l}_{\lambda}=\bm{1},\bm{s}_{\lambda}=\bm{0},\bm{b}_{\lambda}=\bm{0}$

It’s worth noting that, though our ODE formulation can degenerate to previous ones, it may still result in different solvers, since previous works conduct some equivalent substitution of the same order in the local approximation (for example, the choice of $B_{1}(h)=h$ or $B_{2}(h)=e^{h}-1$ in UniPC [58]). We never conduct such substitution, thus saving the efforts to tune it.

Moreover, under our framework, we find that DPM-Solver++ is a model-agnostic approximation of DPM-Solver-v3, under the Gaussian assumption. Specifically, according to Eq. (5), we have

\bm{l}^{*}_{\lambda}=\operatornamewithlimits{argmin}_{\bm{l}_{\lambda}}\mathbb{E}_{p^{\theta}_{\lambda}(\bm{x}_{\lambda})}\|\sigma_{\lambda}\nabla_{\bm{x}}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{l}_{\lambda}\|_{F}^{2},

(17)

If we assume $q_{\lambda}(\bm{x}_{\lambda})\approx\mathcal{N}(\bm{x}_{\lambda}|\alpha_{\lambda}\bm{x}_{0},\sigma_{\lambda}^{2}\bm{I})$ for some fixed $\bm{x}_{0}$ , then the optimal noise predictor is

\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)\approx-\sigma_{\lambda}\nabla_{\bm{x}}\log q_{\lambda}(\bm{x}_{\lambda})=\frac{\bm{x}_{\lambda}-\alpha_{t}\bm{x}_{0}}{\sigma_{\lambda}}.

(18)

It follows that $\sigma_{\lambda}\nabla_{\bm{x}}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)\approx\bm{I}$ , thus $\bm{l}_{\lambda}^{*}\approx\bm{1}$ by Eq. (5), which corresponds the data prediction model used in DPM-Solver++. Moreover, for small enough $\lambda$ (i.e., $t$ near to $T$ ), the Gaussian assumption is almost true (see Section 4.2), thus the data-prediction DPM-Solver++ approximately computes all the linear terms at the initial stage. To the best of our knowledge, this is the first explanation for the reason why the data-prediction DPM-Solver++ outperforms the noise-prediction DPM-Solver.

First-order discretization as improved DDIM Previous methods merely use noise/data parameterization, whether or not they change the time domain from $t$ to $\lambda$ . While they differ in high-order cases, they are proven to coincide in the first-order case, which is DDIM [48] (deterministic case, $\eta=0$ ):

\displaystyle\hat{\bm{x}}_{t}=\frac{\alpha_{t}}{\alpha_{s}}\hat{\bm{x}}_{s}-\alpha_{t}\left(\frac{\sigma_{s}}{\alpha_{s}}-\frac{\sigma_{t}}{\alpha_{t}}\right)\bm{\epsilon}_{\theta}(\hat{\bm{x}}_{s},\lambda_{s})

(19)

However, the first-order case of our method is

	$\displaystyle\hat{\bm{x}}_{t}$	$\displaystyle=\frac{\alpha_{t}}{\alpha_{s}}\bm{A}(\lambda_{s},\lambda_{t})\left(\left(1+\bm{l}_{\lambda_{s}}\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda\right)\hat{\bm{x}}_{s}-\left(\sigma_{s}\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda\right)\bm{\epsilon}_{\theta}(\hat{\bm{x}}_{s},\lambda_{s})\right)$		(20)
		$\displaystyle-\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\bm{B}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda$		(20)

which is not DDIM since we choose a better parameterization by the estimated EMS. Empirically, our first-order solver performs better than DDIM, as detailed in Appendix G.

Appendix B Proofs

B.1 Assumptions

In this section, we will give some mild conditions under which the local order of accuracy of Algorithm 1 and the global order of convergence of Algorithm 2 (predictor) are guaranteed.

B.1.1 Local

First, we will give the assumptions to bound the local truncation error.

Assumption B.1.

The total derivatives of the noise prediction model $\frac{\mathrm{d}^{k}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)}{\mathrm{d}\lambda^{k}},k=1,\dots,n$ exist and are continuous.

Assumption B.2.

The coefficients $\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}$ are continuous and bounded. $\frac{\mathrm{d}^{k}\bm{l}_{\lambda}}{\mathrm{d}\lambda^{k}},\frac{\mathrm{d}^{k}\bm{s}_{\lambda}}{\mathrm{d}\lambda^{k}},\frac{\mathrm{d}^{k}\bm{b}_{\lambda}}{\mathrm{d}\lambda^{k}},k=1,\dots,n$ exist and are continuous.

Assumption B.3.

$\delta_{k}=\Theta(\lambda_{t}-\lambda_{s}),k=1,\dots,n$

Assumption B.1 is required for the Taylor expansion which is regular in high-order numerical methods. Assumption B.2 requires the boundness of the coefficients as well as regularizes the coefficients’ smoothness to enable the Taylor expansion for $\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)$ , which holds in practice given the smoothness of $\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)$ and $p_{\lambda}^{\theta}(\bm{x}_{\lambda})$ . Assumption B.3 makes sure $\delta_{k}$ and $\lambda_{t}-\lambda_{s}$ is of the same order, i.e., there exists some constant $r_{k}=\mathcal{O}(1)$ so that $\delta_{k}=r_{k}(\lambda_{t}-\lambda_{s})$ , which is satisfied in regular multistep methods.

B.1.2 Global

Then we will give the assumptions to bound the global error.

Assumption B.4.

The noise prediction model $\bm{\epsilon}_{\theta}(\bm{x},t)$ is Lipschitz w.r.t. to $\bm{x}$ .

Assumption B.5.

$h=\max_{1\leq i\leq M}(\lambda_{i}-\lambda_{i-1})=\mathcal{O}(1/M)$ .

Assumption B.6.

The starting values $\hat{\bm{x}}_{i},1\leq i\leq n$ satisfies $\hat{\bm{x}}_{i}-\bm{x}_{i}=\mathcal{O}(h^{n+1})$ .

Assumption B.4 is common in the analysis of ODEs, which assures $\bm{\epsilon}_{\theta}(\hat{\bm{x}}_{t},t)-\bm{\epsilon}_{\theta}(\bm{x}_{t},t)=\mathcal{O}(\hat{\bm{x}}_{t}-\bm{x}_{t})$ . Assumption B.5 implies that the step sizes are rather uniform. Assumption B.6 is common in the convergence analysis of multistep methods [5].

B.2 Order of Accuracy and Convergence

In this section, we prove the local and global order guarantees detailed in Theorem 3.1 and Theorem 3.3.

B.2.1 Local

Proof.

(Proof of Theorem 3.1) Denote $h\coloneqq\lambda_{t}-\lambda_{s}$ . Subtracting the Taylor-expanded exact solution in Eq. (12) from the local approximation in Eq. (14), we have

\hat{\bm{x}}_{t}-\bm{x}_{t}=-\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\sum_{k=1}^{n}\frac{\hat{\bm{g}}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})-\bm{g}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})}{k!}\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)(\lambda-\lambda_{s})^{k}\mathrm{d}\lambda+\mathcal{O}(h^{n+2})

(21)

First we examine the order of $\bm{A}(\lambda_{s},\lambda_{t})$ and $\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)(\lambda-\lambda_{s})^{k}\mathrm{d}\lambda$ . Under Assumption B.2, there exists some constant $C_{1},C_{2}$ such that $-\bm{l}_{\lambda}<C_{1},\bm{l}_{\lambda}+\bm{s}_{\lambda}<C_{2}$ . So

$\displaystyle\bm{A}(\lambda_{s},\lambda_{t})$	$\displaystyle=e^{-\int_{\lambda_{s}}^{\lambda_{t}}\bm{l}_{\tau}\mathrm{d}\tau}$	(22)
	$\displaystyle<e^{C_{1}h}$
	$\displaystyle=\mathcal{O}(1)$

$\displaystyle\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)(\lambda-\lambda_{s})^{k}\mathrm{d}\lambda$	$\displaystyle=\int_{\lambda_{s}}^{\lambda_{t}}e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}(\lambda-\lambda_{s})^{k}\mathrm{d}\lambda$	(23)
	$\displaystyle<\int_{\lambda_{s}}^{\lambda_{t}}e^{C_{2}(\lambda-\lambda_{s})}(\lambda-\lambda_{s})^{k}\mathrm{d}\lambda$
	$\displaystyle=\mathcal{O}(h^{k+1})$

Next we examine the order of $\frac{\hat{\bm{g}}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})-\bm{g}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})}{k!}$ . Under Assumption B.1 and Assumption B.2, since $\bm{g}_{\theta}$ is elementary function of $\bm{\epsilon}_{\theta}$ and $\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}$ , we know $\bm{g}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s}),k=1,\dots,n$ exist and are continuous. Adopting the notations in Eq. (13), by Taylor expansion, we have

$\displaystyle\bm{g}_{i_{1}}$	$\displaystyle=\bm{g}_{s}+\delta_{1}\bm{g}_{s}^{(1)}+\delta_{1}^{2}\bm{g}_{s}^{(2)}+\dots+\delta_{1}^{n}\bm{g}_{s}^{(n)}+\mathcal{O}(\delta_{1}^{n+1})$	(24)
$\displaystyle\bm{g}_{i_{2}}$	$\displaystyle=\bm{g}_{s}+\delta_{2}\bm{g}_{s}^{(1)}+\delta_{2}^{2}\bm{g}_{s}^{(2)}+\dots+\delta_{2}^{n}\bm{g}_{s}^{(n)}+\mathcal{O}(\delta_{2}^{n+1})$
$\displaystyle\dots{\\ }\bm{g}_{i_{n}}$	$\displaystyle=\bm{g}_{s}+\delta_{n}\bm{g}_{s}^{(1)}+\delta_{n}^{2}\bm{g}_{s}^{(2)}+\dots+\delta_{n}^{n}\bm{g}_{s}^{(n)}+\mathcal{O}(\delta_{n}^{n+1})$

Comparing it with Eq. (13), we have

\begin{pmatrix}\delta_{1}&\delta_{1}^{2}&\cdots&\delta_{1}^{n}\\ \delta_{2}&\delta_{2}^{2}&\cdots&\delta_{2}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ \delta_{n}&\delta_{n}^{2}&\cdots&\delta_{n}^{n}\end{pmatrix}\begin{pmatrix}\hat{\bm{g}}^{(1)}_{s}-\bm{g}^{(1)}_{s}\\ \frac{\hat{\bm{g}}^{(2)}_{s}-\bm{g}^{(2)}_{s}}{2!}\\ \vdots\\ \frac{\hat{\bm{g}}^{(n)}_{s}-\bm{g}^{(n)}_{s}}{n!}\end{pmatrix}=\begin{pmatrix}\mathcal{O}(\delta_{1}^{n+1})\\ \mathcal{O}(\delta_{2}^{n+1})\\ \vdots\\ \mathcal{O}(\delta_{n}^{n+1})\end{pmatrix}

(25)

From Assumption B.3, we know there exists some constants $r_{k}$ so that $\delta_{k}=r_{k}h,k=1,\dots,n$ . Thus

\begin{pmatrix}\delta_{1}&\delta_{1}^{2}&\cdots&\delta_{1}^{n}\\ \delta_{2}&\delta_{2}^{2}&\cdots&\delta_{2}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ \delta_{n}&\delta_{n}^{2}&\cdots&\delta_{n}^{n}\end{pmatrix}=\begin{pmatrix}r_{1}&r_{1}^{2}&\cdots&r_{1}^{n}\\ r_{2}&r_{2}^{2}&\cdots&r_{2}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ r_{n}&r_{n}^{2}&\cdots&r_{n}^{n}\end{pmatrix}\begin{pmatrix}h&&&\\ &h^{2}&&\\ &&\ddots&\\ &&&h^{n}\\ \end{pmatrix},\begin{pmatrix}\mathcal{O}(\delta_{1}^{n+1})\\ \mathcal{O}(\delta_{2}^{n+1})\\ \vdots\\ \mathcal{O}(\delta_{n}^{n+1})\end{pmatrix}=\begin{pmatrix}\mathcal{O}(h^{n+1})\\ \mathcal{O}(h^{n+1})\\ \vdots\\ \mathcal{O}(h^{n+1})\end{pmatrix}

(26)

And finally, we have

	$\displaystyle\begin{pmatrix}\hat{\bm{g}}^{(1)}_{s}-\bm{g}^{(1)}_{s}\\ \frac{\hat{\bm{g}}^{(2)}_{s}-\bm{g}^{(2)}_{s}}{2!}\\ \vdots\\ \frac{\hat{\bm{g}}^{(n)}_{s}-\bm{g}^{(n)}_{s}}{n!}\end{pmatrix}$	$\displaystyle=\begin{pmatrix}h^{-1}&&&\\ &h^{-2}&&\\ &&\ddots&\\ &&&h^{-n}\\ \end{pmatrix}\begin{pmatrix}r_{1}&r_{1}^{2}&\cdots&r_{1}^{n}\\ r_{2}&r_{2}^{2}&\cdots&r_{2}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ r_{n}&r_{n}^{2}&\cdots&r_{n}^{n}\end{pmatrix}^{-1}\begin{pmatrix}\mathcal{O}(h^{n+1})\\ \mathcal{O}(h^{n+1})\\ \vdots\\ \mathcal{O}(h^{n+1})\end{pmatrix}$		(27)
		$\displaystyle=\begin{pmatrix}\mathcal{O}(h^{n})\\ \mathcal{O}(h^{n-1})\\ \vdots\\ \mathcal{O}(h^{1})\end{pmatrix}$		(27)

Substitute Eq. (22), Eq. (23) and Eq. (27) into Eq. (21), we can conclude that $\hat{\bm{x}}_{t}-\bm{x}_{t}=\mathcal{O}(h^{n+2})$ . ∎

B.2.2 Global

First, we provide a lemma that gives the local truncation error given inexact previous values when estimating the high-order derivatives.

Lemma B.7.

(Local truncation error with inexact previous values) Suppose inexact values $\hat{\bm{x}}_{\lambda_{i_{k}}},k=1,\dots,n$ and $\hat{\bm{x}}_{s}$ are used in Eq. (13) to estimate the high-order derivatives, then the local truncation error of the local approximation Eq. (14) satisfies

\Delta_{t}=\frac{\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})}{\alpha_{s}}\Delta_{s}+\mathcal{O}(h)\left(\mathcal{O}(\Delta_{s})+\sum_{k=1}^{n}\mathcal{O}(\Delta_{\lambda_{i_{k}}})+\mathcal{O}(h^{n+1})\right)

(28)

where $\Delta_{\cdot}\coloneqq\hat{\bm{x}}_{\cdot}-\bm{x}_{\cdot},h\coloneqq\lambda_{t}-\lambda_{s}$ .

Proof.

By replacing $\bm{x}_{\cdot}$ with $\hat{\bm{x}}_{\cdot}$ in Eq. (13) and subtracting Eq. (12) from Eq. (14), the expression for the local truncation error becomes

	$\displaystyle\Delta_{t}$	$\displaystyle=\frac{\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})}{\alpha_{s}}\Delta_{s}-\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\left(\bm{g}_{\theta}(\hat{\bm{x}}_{\lambda_{s}},\lambda_{s})-\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})\right)\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda$		(29)
		$\displaystyle-\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\sum_{k=1}^{n}\frac{\hat{\bm{g}}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})-\bm{g}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})}{k!}\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)(\lambda-\lambda_{s})^{k}\mathrm{d}\lambda+\mathcal{O}(h^{n+2})$		(29)

And the linear system for solving $\bm{g}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s}),k=1,\dots,n$ becomes

\begin{pmatrix}\delta_{1}&\delta_{1}^{2}&\cdots&\delta_{1}^{n}\\ \delta_{2}&\delta_{2}^{2}&\cdots&\delta_{2}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ \delta_{n}&\delta_{n}^{2}&\cdots&\delta_{n}^{n}\end{pmatrix}\begin{pmatrix}\hat{\bm{g}}^{(1)}_{s}\\ \frac{\hat{\bm{g}}^{(2)}_{s}}{2!}\\ \vdots\\ \frac{\hat{\bm{g}}^{(n)}_{s}}{n!}\end{pmatrix}=\begin{pmatrix}\hat{\bm{g}}_{i_{1}}-\hat{\bm{g}}_{s}\\ \hat{\bm{g}}_{i_{2}}-\hat{\bm{g}}_{s}\\ \vdots\\ \hat{\bm{g}}_{i_{n}}-\hat{\bm{g}}_{s}\end{pmatrix}

(30)

where $\hat{\bm{g}}_{\cdot}=\bm{g}_{\theta}(\hat{\bm{x}}_{\lambda_{\cdot}},\lambda_{\cdot})$ . Under Assumption B.4, we know $\hat{\bm{g}}_{\cdot}-\bm{g}_{\cdot}=\mathcal{O}(\Delta_{\lambda_{\cdot}})$ . Thus, under Assumption B.1, Assumption B.2 and Assumption B.3, similar to the deduction in the last section, we have

\begin{pmatrix}\hat{\bm{g}}^{(1)}_{s}-\bm{g}^{(1)}_{s}\\ \frac{\hat{\bm{g}}^{(2)}_{s}-\bm{g}^{(2)}_{s}}{2!}\\ \vdots\\ \frac{\hat{\bm{g}}^{(n)}_{s}-\bm{g}^{(n)}_{s}}{n!}\end{pmatrix}=\begin{pmatrix}h^{-1}&&&\\ &h^{-2}&&\\ &&\ddots&\\ &&&h^{-n}\\ \end{pmatrix}\begin{pmatrix}r_{1}&r_{1}^{2}&\cdots&r_{1}^{n}\\ r_{2}&r_{2}^{2}&\cdots&r_{2}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ r_{n}&r_{n}^{2}&\cdots&r_{n}^{n}\end{pmatrix}^{-1}\begin{pmatrix}\mathcal{O}(h^{n+1}+\Delta_{s}+\Delta_{\lambda_{i_{1}}})\\ \mathcal{O}(h^{n+1}+\Delta_{s}+\Delta_{\lambda_{i_{2}}})\\ \vdots\\ \mathcal{O}(h^{n+1}+\Delta_{s}+\Delta_{\lambda_{i_{n}}})\end{pmatrix}

(31)

Besides, under Assumption B.2, the orders of the other coefficients are the same as we obtain in the last section:

\bm{A}(\lambda_{s},\lambda_{t})=\mathcal{O}(1),\quad\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)(\lambda-\lambda_{s})^{k}\mathrm{d}\lambda=\mathcal{O}(h^{k+1})

(32)

Thus

	$\displaystyle\sum_{k=1}^{n}\frac{\hat{\bm{g}}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})-\bm{g}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})}{k!}\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)(\lambda-\lambda_{s})^{k}\mathrm{d}\lambda$	(33)
$\displaystyle=$	$\displaystyle\begin{pmatrix}\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)(\lambda-\lambda_{s})^{1}\mathrm{d}\lambda\\ \int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)(\lambda-\lambda_{s})^{2}\mathrm{d}\lambda\\ \vdots\\ \int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)(\lambda-\lambda_{s})^{n}\mathrm{d}\lambda\end{pmatrix}^{\top}\begin{pmatrix}\hat{\bm{g}}^{(1)}_{s}-\bm{g}^{(1)}_{s}\\ \frac{\hat{\bm{g}}^{(2)}_{s}-\bm{g}^{(2)}_{s}}{2!}\\ \vdots\\ \frac{\hat{\bm{g}}^{(n)}_{s}-\bm{g}^{(n)}_{s}}{n!}\end{pmatrix}$
$\displaystyle=$	$\displaystyle\begin{pmatrix}\mathcal{O}(h^{2})\\ \mathcal{O}(h^{3})\\ \vdots\\ \mathcal{O}(h^{n+1})\end{pmatrix}^{\top}\begin{pmatrix}h^{-1}&&&\\ &h^{-2}&&\\ &&\ddots&\\ &&&h^{-n}\\ \end{pmatrix}\begin{pmatrix}r_{1}&r_{1}^{2}&\cdots&r_{1}^{n}\\ r_{2}&r_{2}^{2}&\cdots&r_{2}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ r_{n}&r_{n}^{2}&\cdots&r_{n}^{n}\end{pmatrix}^{-1}\begin{pmatrix}\mathcal{O}(h^{n+1}+\Delta_{s}+\Delta_{\lambda_{i_{1}}})\\ \mathcal{O}(h^{n+1}+\Delta_{s}+\Delta_{\lambda_{i_{2}}})\\ \vdots\\ \mathcal{O}(h^{n+1}+\Delta_{s}+\Delta_{\lambda_{i_{n}}})\\ \end{pmatrix}$
$\displaystyle=$	$\displaystyle\begin{pmatrix}\mathcal{O}(h)\\ \mathcal{O}(h)\\ \vdots\\ \mathcal{O}(h)\end{pmatrix}^{\top}\begin{pmatrix}r_{1}&r_{1}^{2}&\cdots&r_{1}^{n}\\ r_{2}&r_{2}^{2}&\cdots&r_{2}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ r_{n}&r_{n}^{2}&\cdots&r_{n}^{n}\end{pmatrix}^{-1}\begin{pmatrix}\mathcal{O}(h^{n+1}+\Delta_{s}+\Delta_{\lambda_{i_{1}}})\\ \mathcal{O}(h^{n+1}+\Delta_{s}+\Delta_{\lambda_{i_{2}}})\\ \vdots\\ \mathcal{O}(h^{n+1}+\Delta_{s}+\Delta_{\lambda_{i_{n}}})\\ \end{pmatrix}$
$\displaystyle=$	$\displaystyle\sum_{k=1}^{n}\mathcal{O}(h)\mathcal{O}(h^{n+1}+\Delta_{s}+\Delta_{\lambda_{i_{k}}})$

Combining Eq. (29), Eq. (32) and Eq. (33), we can obtain the conclusion in Eq. (28). ∎

Then we prove Theorem 3.3 below.

Proof.

(Proof of Theorem 3.3)

As we have discussed, the predictor step from $t_{m-1}$ to $t_{m}$ is a special case of the local approximation Eq. (14) with $(t_{i_{n}},\dots,t_{i_{1}},s,t)=(t_{m-n-1},\dots,t_{m-2},t_{m-1},t_{m})$ . By Lemma B.7 we have

\Delta_{m}=\frac{\alpha_{t_{m}}\bm{A}(\lambda_{t_{m-1}},\lambda_{t_{m}})}{\alpha_{t_{m-1}}}\Delta_{m-1}+\mathcal{O}(h)\left(\sum_{k=0}^{n}\mathcal{O}(\Delta_{m-k-1})+\mathcal{O}(h^{n+1})\right)

(34)

It follows that there exists constants $C,C_{0}$ irrelevant to $h$ , so that

|\Delta_{m}|\leq\left(\frac{\alpha_{t_{m}}\bm{A}(\lambda_{t_{m-1}},\lambda_{t_{m}})}{\alpha_{t_{m-1}}}+Ch\right)|\Delta_{m-1}|+Ch\sum_{k=0}^{n}|\Delta_{m-k-1}|+C_{0}h^{n+2}

(35)

Denote $f_{m}\coloneqq\max_{0\leq i\leq m}|\Delta_{i}|$ , we then have

|\Delta_{m}|\leq\left(\frac{\alpha_{t_{m}}\bm{A}(\lambda_{t_{m-1}},\lambda_{t_{m}})}{\alpha_{t_{m-1}}}+C_{1}h\right)f_{m-1}+C_{0}h^{n+2}

(36)

Since $\frac{\alpha_{t_{m}}\bm{A}(\lambda_{t_{m-1}},\lambda_{t_{m}})}{\alpha_{t_{m-1}}}\rightarrow 1$ when $h\rightarrow 0$ and it has bounded first-order derivative due to Assumption B.2, there exists a constant $C_{2}$ , so that for any $C\geq C_{2}$ , $\frac{\alpha_{t_{m}}\bm{A}(\lambda_{t_{m-1}},\lambda_{t_{m}})}{\alpha_{t_{m-1}}}+Ch>1$ for sufficiently small $h$ . Thus, by taking $C_{3}=\max\{C_{1},C_{2}\}$ , we have

f_{m}\leq\left(\frac{\alpha_{t_{m}}\bm{A}(\lambda_{t_{m-1}},\lambda_{t_{m}})}{\alpha_{t_{m-1}}}+C_{3}h\right)f_{m-1}+C_{0}h^{n+2}

(37)

Denote $A_{m-1}\coloneqq\frac{\alpha_{t_{m}}\bm{A}(\lambda_{t_{m-1}},\lambda_{t_{m}})}{\alpha_{t_{m-1}}}+C_{3}h$ , by repeating Eq. (37), we have

f_{M}\leq\left(\prod_{i=n}^{M-1}A_{i}\right)f_{n}+\left(\sum_{i=n+1}^{M}\prod_{j=i}^{M-1}A_{j}\right)C_{0}h^{n+2}

(38)

By Assumption B.5, $h=\mathcal{O}(1/M)$ , so we have

$\displaystyle\prod_{i=n}^{M-1}A_{i}$	$\displaystyle=\frac{\alpha_{t_{M}}\bm{A}(\lambda_{t_{n}},\lambda_{t_{M}})}{\alpha_{t_{n}}}\prod_{i=n}^{M-1}\left(1+\frac{\alpha_{t_{i-1}}C_{3}h}{\alpha_{t_{i}}\bm{A}(\lambda_{t_{i-1}},\lambda_{t_{i}})}\right)$	(39)
	$\displaystyle\leq\frac{\alpha_{t_{M}}\bm{A}(\lambda_{t_{n}},\lambda_{t_{M}})}{\alpha_{t_{n}}}\prod_{i=n}^{M-1}\left(1+\frac{\alpha_{t_{i-1}}C_{4}}{\alpha_{t_{i}}\bm{A}(\lambda_{t_{i-1}},\lambda_{t_{i}})M}\right)$
	$\displaystyle\leq\frac{\alpha_{t_{M}}\bm{A}(\lambda_{t_{n}},\lambda_{t_{M}})}{\alpha_{t_{n}}}\left(1+\frac{\sigma}{M}\right)^{M-n}$
	$\displaystyle\leq C_{5}e^{\sigma}$

where $\sigma=\max_{n\leq i\leq M-1}\frac{\alpha_{t_{i-1}}C_{4}}{\alpha_{t_{i}}\bm{A}(\lambda_{t_{i-1}},\lambda_{t_{i}})}$ . Then denote $\beta\coloneqq\max_{n+1\leq i\leq M}\frac{\alpha_{t_{M}}\bm{A}(\lambda_{t_{i}},\lambda_{t_{M}})}{\alpha_{t_{i}}}$ , we have

$\displaystyle\sum_{i=n+1}^{M}\prod_{j=i}^{M-1}A_{j}$	$\displaystyle\leq\sum_{i=n+1}^{M}\frac{\alpha_{t_{M}}\bm{A}(\lambda_{t_{i}},\lambda_{t_{M}})}{\alpha_{t_{i}}}\left(1+\frac{\sigma}{M}\right)^{M-i}$	(40)
	$\displaystyle\leq\beta\sum_{i=0}^{M-n-1}\left(1+\frac{\sigma}{M}\right)^{i}$
	$\displaystyle=\frac{\beta M}{\sigma}\left[\left(1+\frac{\sigma}{M}\right)^{M-n}-1\right]$
	$\displaystyle\leq C_{6}\left(e^{\sigma}-1\right)M$

Then we substitute Eq. (39) and Eq. (40) into Eq. (38). Note that $M=\mathcal{O}(1/h)$ by Assumption B.5, and $f_{n}=\mathcal{O}(h^{n+1})$ by Assumption B.6, finally we conclude that $|\Delta_{M}|\leq f_{M}=\mathcal{O}(h^{n+1})$ . ∎

B.3 Pseudo-Order Solver

First, we provide a lemma that gives the explicit solution to Eq. (15).

Lemma B.8.

The solution to Eq. (15) is

\frac{\hat{\bm{g}}_{s}^{(k)}}{k!}=\sum_{p=1}^{k}\frac{\bm{g}_{i_{p}}-\bm{g}_{i_{0}}}{\prod_{q=0,q\neq p}^{k}(\delta_{p}-\delta_{q})}

(41)

Proof.

Denote

\bm{R}_{k}=\begin{pmatrix}\delta_{1}&\delta_{1}^{2}&\cdots&\delta_{1}^{k}\\ \delta_{2}&\delta_{2}^{2}&\cdots&\delta_{2}^{k}\\ \vdots&\vdots&\ddots&\vdots\\ \delta_{k}&\delta_{k}^{2}&\cdots&\delta_{k}^{k}\end{pmatrix}

(42)

Then the solution to Eq. (15) can be expressed as

\frac{\hat{\bm{g}}_{s}^{(k)}}{k!}=\sum_{p=1}^{k}(\bm{R}_{k}^{-1})_{kp}(\bm{g}_{i_{p}}-\bm{g}_{i_{0}})

(43)

where $(\bm{R}_{k}^{-1})_{kp}$ is the element of $\bm{R}_{k}^{-1}$ at the $k$ -th row and the $p$ -th column. From previous studies of the inversion of the Vandermonde matrix [12], we know

(\bm{R}_{k}^{-1})_{kp}=\frac{1}{\delta_{p}\prod_{q=1,q\neq p}^{k}(\delta_{p}-\delta_{q})}=\frac{1}{(\delta_{p}-\delta_{0})\prod_{q=1,q\neq p}^{k}(\delta_{p}-\delta_{q})}

(44)

Substituting Eq. (44) into Eq. (43), we finish the proof. ∎

Then we prove Theorem 3.4 below:

Proof.

(Proof of Theorem 3.4) First, we use mathematical induction to prove that

D_{l}^{(k)}=\tilde{D}_{l}^{(k)}\coloneqq\sum_{p=1}^{k}\frac{\bm{g}_{i_{l+p}}-\bm{g}_{i_{l}}}{\prod_{q=0,q\neq p}^{k}(\delta_{l+p}-\delta_{l+q})},\quad 1\leq k\leq n,0\leq l\leq n-k

(45)

For $k=1$ , Eq. (45) holds by the definition of $D_{l}^{(k)}$ . Suppose the equation holds for $k$ , we then prove it holds for $k+1$ .

Define the Lagrange polynomial which passes $(\delta_{l+p},\bm{g}_{i_{l+p}}-\bm{g}_{i_{l}})$ for $0\leq p\leq k$ :

P_{l}^{(k)}(x)\coloneqq\sum_{p=1}^{k}\left(\bm{g}_{i_{l+p}}-\bm{g}_{i_{l}}\right)\prod_{q=0,q\neq p}^{k}\frac{x-\delta_{l+q}}{\delta_{l+p}-\delta_{l+q}},\quad 1\leq k\leq n,0\leq l\leq n-k

(46)

Then $\tilde{D}_{l}^{(k)}=P_{l}^{(k)}(x)[x^{k}]$ is the coefficients before the highest-order term $x^{k}$ in $P_{l}^{(k)}(x)$ . We then prove that $P_{l}^{(k)}(x)$ satisfies the following recurrence relation:

P_{l}^{(k)}(x)=\tilde{P}_{l}^{(k)}(x)\coloneqq\frac{(x-\delta_{l})P_{l+1}^{(k-1)}(x)-(x-\delta_{l+k})P_{l}^{(k-1)}(x)}{\delta_{l+k}-\delta_{l}}

(47)

By definition, $P_{l+1}^{(k-1)}(x)$ is the $(k-1)$ -th order polynomial which passes $(\delta_{l+p},\bm{g}_{i_{l+p}}-\bm{g}_{i_{l}})$ for $1\leq p\leq k$ , and $P_{l}^{(k-1)}(x)$ is the $(k-1)$ -th order polynomial which passes $(\delta_{l+p},\bm{g}_{i_{l+p}}-\bm{g}_{i_{l}})$ for $0\leq p\leq k-1$ .

Thus, for $1\leq p\leq k-1$ , we have

\tilde{P}_{l}^{(k)}(\delta_{l+p})=\frac{(\delta_{l+p}-\delta_{l})P_{l+1}^{(k-1)}(\delta_{l+p})-(\delta_{l+p}-\delta_{l+k})P_{l}^{(k-1)}(\delta_{l+p})}{\delta_{l+k}-\delta_{l}}=\bm{g}_{i_{l+p}}-\bm{g}_{i_{l}}

(48)

For $p=0$ , we have

\tilde{P}_{l}^{(k)}(\delta_{l})=\frac{(\delta_{l}-\delta_{l})P_{l+1}^{(k-1)}(\delta_{l})-(\delta_{l}-\delta_{l+k})P_{l}^{(k-1)}(\delta_{l})}{\delta_{l+k}-\delta_{l}}=\bm{g}_{i_{l}}-\bm{g}_{i_{l}}

(49)

for $p=k$ , we have

\tilde{P}_{l}^{(k)}(\delta_{l+k})=\frac{(\delta_{l+k}-\delta_{l})P_{l+1}^{(k-1)}(\delta_{l+k})-(\delta_{l+k}-\delta_{l+k})P_{l}^{(k-1)}(\delta_{l+k})}{\delta_{l+k}-\delta_{l}}=\bm{g}_{i_{l+k}}-\bm{g}_{i_{l}}

(50)

Therefore, $\tilde{P}_{l}^{(k)}(x)$ is the $k$ -th order polynomial which passes $k+1$ distince points $(\delta_{l+p},\bm{g}_{i_{l+p}}-\bm{g}_{i_{l}})$ for $0\leq p\leq k$ . Due to the uniqueness of the Lagrange polynomial, we can conclude that $P_{l}^{(k)}(x)=\tilde{P}_{l}^{(k)}(x)$ . By taking the coefficients of the highest-order term, we obtain

\tilde{D}_{l}^{(k)}=\frac{\tilde{D}_{l+1}^{(k-1)}-\tilde{D}_{l}^{(k-1)}}{\delta_{l+k}-\delta_{l}}

(51)

where by the induction hypothesis we have $D_{l+1}^{(k-1)}=\tilde{D}_{l+1}^{(k-1)},D_{l}^{(k-1)}=\tilde{D}_{l}^{(k-1)}$ . Comparing Eq. (51) with the recurrence relation of $D_{l}^{(k)}$ in Eq. (16), it follows that $D_{l}^{(k)}=\tilde{D}_{l}^{(k)}$ , which completes the mathematical induction.

Finally, by comparing the expression for $\tilde{D}_{l}^{(k)}$ in Eq. (45) and the expression for $\hat{\bm{g}}_{s}^{(k)}$ in Lemma B.8, we can conclude that $\hat{\bm{g}}_{s}^{(k)}=k!D_{0}^{(k)}$ . ∎

B.4 Local Unbiasedness

Proof.

(Proof of Theorem 3.2) Subtracting the local exact solution in Eq. (9) from the $(n+1)$ -th order local approximation in Eq. (14), we have the local truncation error

$\displaystyle\hat{\bm{x}}_{t}-\bm{x}_{t}$	$\displaystyle=\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\left(\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)\mathrm{d}\lambda-\sum_{k=0}^{n}\hat{\bm{g}}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda\right)$	(52)
	$\displaystyle=\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\left(\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})\right)\mathrm{d}\lambda$
	$\displaystyle-\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\sum_{k=1}^{n}\hat{\bm{g}}^{(k)}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda$
	$\displaystyle=\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\left(\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})\right)\mathrm{d}\lambda$
	$\displaystyle-\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\sum_{k=1}^{n}\left(\sum_{l=1}^{n}(\bm{R}_{n}^{-1})_{kl}(\bm{g}_{\theta}(\bm{x}_{\lambda_{i_{l}}},\lambda_{i_{l}})-\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s}))\right)\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda$

where $\bm{x}_{\lambda}$ is on the ground-truth ODE trajectory passing $\bm{x}_{\lambda_{s}}$ , and $(\bm{R}_{n}^{-1})_{kl}$ is the element of the inverse matrix $\bm{R}_{n}^{-1}$ at the $k$ -th row and the $l$ -th column, as discussed in the proof of Lemma B.8. By Newton-Leibniz theorem, we have

\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})=\int_{\lambda_{s}}^{\lambda}\bm{g}_{\theta}^{(1)}(\bm{x}_{\tau},\tau)\mathrm{d}\tau

(53)

Also, since $\bm{x}_{\lambda_{i_{l}}},l=1,\dots,n$ are on the ground-truth ODE trajectory passing $\bm{x}_{\lambda_{s}}$ , we have

\bm{g}_{\theta}(\bm{x}_{\lambda_{i_{l}}},\lambda_{i_{l}})-\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})=\int_{\lambda_{s}}^{\lambda_{i_{l}}}\bm{g}_{\theta}^{(1)}(\bm{x}_{\tau},\tau)\mathrm{d}\tau

(54)

where

\bm{g}_{\theta}^{(1)}(\bm{x}_{\tau},\tau)=e^{-\int_{\lambda_{s}}^{\tau}\bm{s}_{r}\mathrm{d}r}\left(\bm{f}_{\theta}^{(1)}(\bm{x}_{\tau},\tau)-\bm{s}_{\tau}\bm{f}_{\theta}(\bm{x}_{\tau},\tau)-\bm{b}_{\tau}\right)

(55)

Note that $\bm{s}_{\lambda},\bm{l}_{\lambda}$ are the solution to the least square problem in Eq. (11), which makes sure $\mathbb{E}_{p_{\tau}^{\theta}(\bm{x}_{\tau})}\left[\bm{f}_{\theta}^{(1)}(\bm{x}_{\tau},\tau)-\bm{s}_{\tau}\bm{f}_{\theta}(\bm{x}_{\tau},\tau)-\bm{b}_{\tau}\right]=0$ . It follows that $\mathbb{E}_{p_{\lambda_{s}}^{\theta}(\bm{x}_{\lambda_{s}})}\left[\bm{f}_{\theta}^{(1)}(\bm{x}_{\tau},\tau)-\bm{s}_{\tau}\bm{f}_{\theta}(\bm{x}_{\tau},\tau)-\bm{b}_{\tau}\right]=0$ , since $\bm{x}_{\tau}$ is on the ground-truth ODE trajectory passing $\bm{x}_{\lambda_{s}}$ . Therefore, we have $\mathbb{E}_{p_{\lambda_{s}}^{\theta}(\bm{x}_{\lambda_{s}})}\left[\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})\right]=0$ and $\mathbb{E}_{p_{\lambda_{s}}^{\theta}(\bm{x}_{\lambda_{s}})}\left[\bm{g}_{\theta}(\bm{x}_{\lambda_{i_{l}}},\lambda_{i_{l}})-\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})\right]=0$ . Substitute them into Eq. (52), we conclude that $\mathbb{E}_{p_{\lambda_{s}}^{\theta}(\bm{x}_{\lambda_{s}})}\left[\hat{\bm{x}}_{t}-\bm{x}_{t}\right]=0$ . ∎

Appendix C Implementation Details

C.1 Computing the EMS and Related Integrals in the ODE Formulation

The ODE formulation and local approximation require computing some complex integrals involving $\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}$ . In this section, we’ll give details about how to estimate $\bm{l}_{\lambda}^{*},\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*}$ on a few datapoints, and how to use them to compute the integrals efficiently.

C.1.1 Computing the EMS

First for the computing of $\bm{l}_{\lambda}^{*}$ in Eq. (5), note that

\nabla_{\bm{x}}\bm{N}_{\theta}(\bm{x}_{\lambda},\lambda)=\sigma_{\lambda}\nabla_{\bm{x}}\bm{\epsilon}(\bm{x}_{\lambda},\lambda)-\mbox{diag}(\bm{l}_{\lambda})

(56)

Since $\mbox{diag}(\bm{l}_{\lambda})$ is a diagonal matrix, minimizing $\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}\left[\|\nabla_{\bm{x}}\bm{N}_{\theta}(\bm{x}_{\lambda},\lambda)\|_{F}^{2}\right]$ is equivalent to minimizing $\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}\left[\|\mbox{diag}^{-1}(\nabla_{\bm{x}}\bm{N}_{\theta}(\bm{x}_{\lambda},\lambda))\|_{2}^{2}\right]=\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}\left[\|\mbox{diag}^{-1}(\sigma_{\lambda}\nabla_{\bm{x}}\bm{\epsilon}(\bm{x}_{\lambda},\lambda))-\bm{l}_{\lambda}\|_{2}^{2}\right]$ , where $\mbox{diag}^{-1}$ denotes the operator that takes the diagonal of a matrix as a vector. Thus we have $\bm{l}_{\lambda}^{*}=\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}\left[\mbox{diag}^{-1}(\sigma_{\lambda}\nabla_{\bm{x}}\bm{\epsilon}(\bm{x}_{\lambda},\lambda))\right]$ .

However, this formula for $\bm{l}_{\lambda}^{*}$ requires computing the diagonal of the full Jacobian of the noise prediction model, which typically has $\mathcal{O}(d^{2})$ time complexity for $d$ -dimensional data and is unacceptable when $d$ is large. Fortunately, the cost can be reduced to $\mathcal{O}(d)$ by utilizing stochastic diagonal estimators and employing the efficient Jacobian-vector-product operator provided by forward-mode automatic differentiation in deep learning frameworks.

For a $d$ -by- $d$ matrix $\bm{D}$ , its diagonal can be unbiasedly estimated by [4]

\mbox{diag}^{-1}(\bm{D})=\left[\sum_{k=1}^{s}(\bm{D}\bm{v}_{k})\odot\bm{v}_{k}\right]\oslash\left[\sum_{k=1}^{s}\bm{v}_{k}\odot\bm{v}_{k}\right]

(57)

where $\bm{v}_{k}\sim p(\bm{v})$ are $d$ -dimensional i.i.d. samples with zero mean, $\odot$ is the element-wise multiplication i.e., Hadamard product, and $\oslash$ is the element-wise division. The stochastic diagonal estimator is analogous to the famous Hutchinson’s trace estimator [20]. By taking $p(\bm{v})$ as the Rademacher distribution, we have $\bm{v}_{k}\odot\bm{v}_{k}=\bm{1}$ , and the denominator can be omitted. For simplicity, we use regular multiplication and division symbols, assuming they are element-wise between vectors. Then $\bm{l}_{\lambda}^{*}$ can be expressed as:

\bm{l}_{\lambda}^{*}=\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})p(\bm{v})}\left[(\sigma_{\lambda}\nabla_{\bm{x}}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)\bm{v})\bm{v}\right]

(58)

which is an unbiased estimation when we replace the expectation with mean on finite samples $\bm{x}_{\lambda}\sim p_{\lambda}^{\theta}(\bm{x}_{\lambda}),\bm{v}\sim p(\bm{v})$ . The process for estimating $\bm{l}_{\lambda}^{*}$ can easily be paralleled on multiple devices by computing $\sum(\sigma_{\lambda}\nabla_{\bm{x}}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)\bm{v})\bm{v}$ on separate datapoints and gather them in the end.

Next, for the computing of $\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*}$ in Eq. (11), note that it’s a simple least square problem. By taking partial derivatives w.r.t. $\bm{s}_{\lambda},\bm{b}_{\lambda}$ and set them to 0, we have

\begin{cases}\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}\left[\left(\bm{f}^{(1)}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{s}_{\lambda}^{*}\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{b}_{\lambda}^{*}\right)\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)\right]=0\\ \mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}\left[\bm{f}^{(1)}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{s}_{\lambda}^{*}\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{b}_{\lambda}^{*}\right]=0\end{cases}

(59)

And we obtain the explicit formula for $\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*}$

	$\displaystyle\bm{s}_{\lambda}^{*}$	$\displaystyle=\frac{\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}\left[\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)\bm{f}_{\theta}^{(1)}(\bm{x}_{\lambda},\lambda)\right]-\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}\left[\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)\right]\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}\left[\bm{f}_{\theta}^{(1)}(\bm{x}_{\lambda},\lambda)\right]}{\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}[\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)]-\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}[\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)]\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}[\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)]}$		(60)
	$\displaystyle\bm{b}_{\lambda}^{*}$	$\displaystyle=\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}[\bm{f}^{(1)}(\bm{x}_{\lambda},\lambda)]-\bm{s}_{\lambda}^{*}\mathbb{E}_{p_{\lambda}^{\theta}(\bm{x}_{\lambda})}[\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)]$		(61)

which are unbiased least square estimators when we replace the expectation with mean on finite samples $\bm{x}_{\lambda}\sim p_{\lambda}^{\theta}(\bm{x}_{\lambda})$ . Also, the process for estimating $\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*}$ can be paralleled on multiple devices by computing $\sum\bm{f}_{\theta},\sum\bm{f}^{(1)}_{\theta},\sum\bm{f}_{\theta}\bm{f}_{\theta},\sum\bm{f}_{\theta}\bm{f}^{(1)}_{\theta}$ on separate datapoints and gather them in the end. Thus, the estimation of $\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*}$ involving evaluating $\bm{f}_{\theta}$ and $\bm{f}_{\theta}^{(1)}$ on $\bm{x}_{\lambda}$ . $\bm{f}_{\theta}$ is a direct transformation of $\bm{\epsilon}_{\theta}$ and requires a single forward pass. For $\bm{f}_{\theta}^{(1)}$ , we have

$\displaystyle\bm{f}^{(1)}_{\theta}(\bm{x}_{\lambda},\lambda)$	$\displaystyle=\frac{\partial\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)}{\partial\lambda}+\nabla_{\bm{x}}\bm{f}_{\theta}(\bm{x}_{\lambda},\lambda)\frac{\mathrm{d}\bm{x}_{\lambda}}{\mathrm{d}\lambda}$	(62)
	$\displaystyle=e^{-\lambda}\left(\bm{\epsilon}_{\theta}^{(1)}(\bm{x}_{\lambda},\lambda)-\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)\right)-\frac{\dot{\bm{l}}_{\lambda}\alpha_{\lambda}-\dot{\alpha}_{\lambda}\bm{l}_{\lambda}}{\alpha_{\lambda}^{2}}\bm{x}_{\lambda}-\frac{\bm{l}_{\lambda}}{\alpha_{\lambda}}\left(\frac{\dot{\alpha}_{\lambda}}{\ \alpha_{\lambda}}\bm{x}_{\lambda}-\sigma_{\lambda}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)\right)$
	$\displaystyle=e^{-\lambda}\left((\bm{l}_{\lambda}-1)\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)+\bm{\epsilon}_{\theta}^{(1)}(\bm{x}_{\lambda},\lambda)\right)-\frac{\dot{\bm{l}}_{\lambda}\bm{x}_{\lambda}}{\alpha_{\lambda}}$

After we obtain $\bm{l}_{\lambda}$ , $\dot{\bm{l}}_{\lambda}$ can be estimated by finite difference. To compute $\bm{\epsilon}_{\theta}^{(1)}(\bm{x}_{\lambda},\lambda)$ , we have

	$\displaystyle\bm{\epsilon}_{\theta}^{(1)}(\bm{x}_{\lambda},\lambda)$	$\displaystyle=\partial_{\lambda}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)+\nabla_{\bm{x}}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)\frac{\mathrm{d}\bm{x}_{\lambda}}{\mathrm{d}\lambda}$		(63)
		$\displaystyle=\partial_{\lambda}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)+\nabla_{\bm{x}}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)\left(\frac{\dot{\alpha}_{\lambda}}{\ \alpha_{\lambda}}\bm{x}_{\lambda}-\sigma_{\lambda}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)\right)$		(63)

which can also be computed with the Jacobian-vector-product operator.

In conclusion, for any $\lambda$ , $\bm{l}_{\lambda}^{*},\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*}$ can be efficiently and unbiasedly estimated by sampling a few datapoints $\bm{x}_{\lambda}\sim p_{\lambda}^{\theta}(\bm{x}_{\lambda})$ and using the Jacobian-vector-product.

C.1.2 Integral Precomputing

In the local approximation in Eq. (14), there are three integrals involving the EMS, which are $\bm{A}(\lambda_{s},\lambda_{t}),\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\bm{B}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda,\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda$ . Define the following terms, which are also evaluated at $\lambda_{j_{0}},\lambda_{j_{1}},\dots,\lambda_{j_{N}}$ and can be precomputed in $\mathcal{O}(N)$ time:

$\displaystyle\bm{L}_{\lambda}$	$\displaystyle=\int_{\lambda_{T}}^{\lambda}\bm{l}_{\tau}\mathrm{d}\tau$	(64)
$\displaystyle\bm{S}_{\lambda}$	$\displaystyle=\int_{\lambda_{T}}^{\lambda}\bm{s}_{\tau}\mathrm{d}\tau$
$\displaystyle\bm{B}_{\lambda}$	$\displaystyle=\int_{\lambda_{T}}^{\lambda}e^{-\int_{\lambda_{T}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r=\int_{\lambda_{T}}^{\lambda}e^{-\bm{S}_{r}}\bm{b}_{r}\mathrm{d}r$
$\displaystyle\bm{C}_{\lambda}$	$\displaystyle=\int_{\lambda_{T}}^{\lambda}\left(e^{\int_{\lambda_{T}}^{u}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\int_{\lambda_{T}}^{u}e^{-\int_{\lambda_{T}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r\right)\mathrm{d}u=\int_{\lambda_{T}}^{\lambda}e^{\bm{L}_{u}+\bm{S}_{u}}\bm{B}_{u}\mathrm{d}u$
$\displaystyle\bm{I}_{\lambda}$	$\displaystyle=\int_{\lambda_{T}}^{\lambda}e^{\int_{\lambda_{T}}^{r}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\mathrm{d}r=\int_{\lambda_{T}}^{\lambda}e^{\bm{L}_{r}+\bm{S}_{r}}\mathrm{d}r$

Then for any $\lambda_{s},\lambda_{t}$ , we can verify that the first two integrals can be expressed as

	$\displaystyle\bm{A}(\lambda_{s},\lambda_{t})$	$\displaystyle=e^{\bm{L}_{\lambda_{s}}-\bm{L}_{\lambda_{t}}}$		(65)
	$\displaystyle\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\bm{B}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda$	$\displaystyle=e^{-\bm{L}_{\lambda_{s}}}(\bm{C}_{\lambda_{t}}-\bm{C}_{\lambda_{s}}-\bm{B}_{\lambda_{s}}(\bm{I}_{\lambda_{t}}-\bm{I}_{\lambda_{s}}))$		(65)

which can be computed in $\mathcal{O}(1)$ time. For the third and last integral, denote it as $\bm{E}_{\lambda_{s},\lambda_{t}}^{(k)}$ , i.e.,

\bm{E}_{\lambda_{s},\lambda_{t}}^{(k)}=\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda

(66)

We need to compute it for $0\leq k\leq n$ and for every local transition time pair $(\lambda_{s},\lambda_{t})$ in the sampling process. For $k=0$ , we have

\bm{E}_{\lambda_{s},\lambda_{t}}^{(0)}=e^{-\bm{L}_{\lambda_{s}}-\bm{S}_{\lambda_{s}}}(\bm{I}_{\lambda_{t}}-\bm{I}_{\lambda_{s}})

(67)

which can also be computed in $\mathcal{O}(1)$ time. But for $k>0$ , we no longer have such a simplification technique. Still, for any fixed timestep schedule $\{\lambda_{i}\}_{i=0}^{M}$ during the sampling process, we can use a lazy precomputing strategy: compute $\bm{E}_{\lambda_{i-1},\lambda_{i}}^{(k)},1\leq i\leq M$ when generating the first sample, store it with a unique key $(k,i)$ and retrieve it in $\mathcal{O}(1)$ in the following sampling process.

C.2 Algorithm

We provide the pseudocode of the local approximation and global solver in Algorithm 1 and Algorithm 2, which concisely describes how we implement DPM-Solver-v3.

Algorithm 1

(n+1)

-th order local approximation: LUpdate_n+1

Require: noise schedule $\alpha_{t},\sigma_{t}$ , coefficients $\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}$

Input: transition time pair $(s,t)$ , $\bm{x}_{s}$ , $n$ extra timesteps $\{t_{i_{k}}\}_{k=1}^{n}$ , $\bm{g}_{\theta}$ values $(\bm{g}_{i_{n}},\dots,\bm{g}_{i_{1}},\bm{g}_{s})$ at $\{(\bm{x}_{\lambda_{i_{k}}},t_{i_{k}})\}_{k=1}^{n}$ and $(\bm{x}_{s},s)$

Input Format: $\{t_{i_{n}},\bm{g}_{i_{n}}\},\dots,\{t_{i_{1}},\bm{g}_{i_{1}}\},\{s,\bm{x}_{s},\bm{g}_{s}\},t$

1: Compute

\bm{A}(\lambda_{s},\lambda_{t}),\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\bm{B}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda,\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda

(Appendix C.1.2)

\delta_{k}=\lambda_{i_{k}}-\lambda_{s},\quad k=1,\dots,n

\begin{pmatrix}\hat{\bm{g}}^{(1)}_{s}\\ \frac{\hat{\bm{g}}^{(2)}_{s}}{2!}\\ \vdots\\ \frac{\hat{\bm{g}}^{(n)}_{s}}{n!}\end{pmatrix}\leftarrow\begin{pmatrix}\delta_{1}&\delta_{1}^{2}&\cdots&\delta_{1}^{n}\\ \delta_{2}&\delta_{2}^{2}&\cdots&\delta_{2}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ \delta_{n}&\delta_{n}^{2}&\cdots&\delta_{n}^{n}\end{pmatrix}^{-1}\begin{pmatrix}\bm{g}_{i_{1}}-\bm{g}_{s}\\ \bm{g}_{i_{2}}-\bm{g}_{s}\\ \vdots\\ \bm{g}_{i_{n}}-\bm{g}_{s}\end{pmatrix}

(Eq. (13))

\displaystyle\hat{\bm{x}}_{t}\leftarrow\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\left(\frac{\bm{x}_{s}}{\alpha_{s}}-\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\bm{B}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda-\sum_{k=0}^{n}\hat{\bm{g}}^{(k)}_{s}\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda\right)

(Eq. (14))

Output: $\hat{\bm{x}}_{t}$

Algorithm 2

(n+1)

-th order multistep predictor-corrector algorithm

Require: noise prediction model $\bm{\epsilon}_{\theta}$ , noise schedule $\alpha_{t},\sigma_{t}$ , coefficients $\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}$ , cache $Q_{1},Q_{2}$

Input: timesteps $\{t_{i}\}_{i=0}^{M}$ , initial value $\bm{x}_{0}$

Q_{1}\overset{cache}{\leftarrow}\bm{x}_{0}

Q_{2}\overset{cache}{\leftarrow}\bm{\epsilon}_{\theta}(\bm{x}_{0},t_{0})

3: for

m=1

M

n_{m}\leftarrow\min\{n+1,m\}

\hat{\bm{x}}_{m-n_{m}},\dots,\hat{\bm{x}}_{m-1}\overset{fetch}{\leftarrow}Q_{1}

\hat{\bm{\epsilon}}_{m-n_{m}},\dots,\hat{\bm{\epsilon}}_{m-1}\overset{fetch}{\leftarrow}Q_{2}

\displaystyle\hat{\bm{g}}_{l}\leftarrow e^{-\int_{\lambda_{m-1}}^{\lambda_{l}}\bm{s}_{\tau}\mathrm{d}\tau}\frac{\sigma_{\lambda_{l}}\hat{\bm{\epsilon}}_{l}-\bm{l}_{\lambda_{l}}\hat{\bm{x}}_{l}}{\alpha_{\lambda_{l}}}-\int_{\lambda_{m-1}}^{\lambda_{l}}e^{-\int_{\lambda_{m-1}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r,\quad l=m-n_{m},\dots,m-1

(Eq. (8))

\hat{\bm{x}}_{m}\leftarrow\text{LUpdate}_{n_{m}}(\{t_{m-n_{m}},\hat{\bm{g}}_{m-n_{m}}\},\dots,\{t_{m-2},\hat{\bm{g}}_{m-2}\},\{t_{m-1},\hat{\bm{x}}_{m-1},\hat{\bm{g}}_{m-1}\},t_{m})

9: if

m\neq M

then

10:

\hat{\bm{\epsilon}}_{m}\leftarrow\bm{\epsilon}_{\theta}(\hat{\bm{x}}_{m},t_{m})

11:

\displaystyle\hat{\bm{g}}_{m}\leftarrow e^{-\int_{\lambda_{m-1}}^{\lambda_{m}}\bm{s}_{\tau}\mathrm{d}\tau}\frac{\sigma_{\lambda_{m}}\hat{\bm{\epsilon}}_{m}-\bm{l}_{\lambda_{m}}\hat{\bm{x}}_{m}}{\alpha_{\lambda_{m}}}-\int_{\lambda_{m-1}}^{\lambda_{m}}e^{-\int_{\lambda_{m-1}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r

(Eq. (8))

12:

\hat{\bm{x}}_{m}^{c}\leftarrow\text{LUpdate}_{n_{m}}(\{t_{m-n_{m}+1},\hat{\bm{g}}_{m-n_{m}+1}\},\dots,\{t_{m-2},\hat{\bm{g}}_{m-2}\},\{t_{m},\hat{\bm{g}}_{m}\},

\{t_{m-1},\hat{\bm{x}}_{m-1},\hat{\bm{g}}_{m-1}\},t_{m})

13:

\hat{\bm{\epsilon}}_{m}^{c}\leftarrow\hat{\bm{\epsilon}}_{m}+\bm{l}_{\lambda_{m}}(\hat{\bm{x}}_{m}^{c}-\hat{\bm{x}}_{m})/\sigma_{\lambda_{m}}

(to ensure

\hat{\bm{g}}_{m}^{c}=\hat{\bm{g}}_{m}

)

14:

Q_{1}\overset{cache}{\leftarrow}\hat{\bm{x}}_{m}^{c}

15:

Q_{2}\overset{cache}{\leftarrow}\hat{\bm{\epsilon}}_{m}^{c}

16: end if

17: end for

Output: $\hat{\bm{x}}_{M}$

Appendix D Experiment Details

In this section, we provide more experiment details for each setting, including the codebases and the configurations for evaluation, EMS computing and sampling. Unless otherwise stated, we utilize the forward-mode automatic differentiation (torch.autograd.forward_ad) provided by PyTorch [39] to compute the Jacobian-vector-products (JVPs). Also, as stated in Section 3.4, we draw datapoints $\bm{x}_{\lambda}$ from the marginal distribution $q_{\lambda}$ defined by the forward diffusion process starting from some data distribution $q_{0}$ , instead of the model distribution $p_{\lambda}^{\theta}$ .

D.1 ScoreSDE on CIFAR10

Codebase and evaluation For unconditional sampling on CIFAR10 [24], one experiment setting is based on the pretrained pixel-space diffusion model provided by ScoreSDE [51]. We use their official codebase of PyTorch implementation, and their checkpoint checkpoint_8.pth under vp/cifar10_ddpmpp_deep_continuous config. We adopt their own statistic file and code for computing FID.

EMS computing We estimate the EMS at $N=1200$ uniform timesteps $\lambda_{j_{0}},\lambda_{j_{1}},\dots,\lambda_{j_{N}}$ by drawing $K=4096$ datapoints $\bm{x}_{\lambda_{0}}\sim q_{0}$ , where $q_{0}$ is the distribution of the training set. We compute two sets of EMS, corresponding to start time $\epsilon=10^{-3}$ (NFE $\leq$ 10) and $\epsilon=10^{-4}$ (NFE>10) in the sampling process respectively. The total time for EMS computing is $\sim$ 7h on 8 GPU cards of NVIDIA A40.

Sampling Following previous works [31, 32, 58], we use start time $\epsilon=10^{-3}$ (NFE $\leq$ 10) and $\epsilon=10^{-4}$ (NFE>10), end time $T=1$ and adopt the uniform logSNR timestep schedule. For DPM-Solver-v3, we use the 3rd-order predictor with the 3rd-order corrector by default. In particular, we change to the pseudo 3rd-order predictor at 5 NFE to further boost the performance.

D.2 EDM on CIFAR10

Codebase and evaluation For unconditional sampling on CIFAR10 [24], another experiment setting is based on the pretrained pixel-space diffusion model provided by EDM [21]. We use their official codebase of PyTorch implementation, and their checkpoint edm-cifar10-32x32-uncond-vp.pkl. For consistency, we borrow the statistic file and code from ScoreSDE [51] for computing FID.

EMS computing Since the pretrained models of EDM are stored within the pickles, we fail to use torch.autograd.forward_ad for computing JVPs. Instead, we use torch.autograd.functional.jvp, which is much slower since it employs the double backward trick. We estimate two sets of EMS. One corresponds to $N=1200$ uniform timesteps $\lambda_{j_{0}},\lambda_{j_{1}},\dots,\lambda_{j_{N}}$ and $K=1024$ datapoints $\bm{x}_{\lambda_{0}}\sim q_{0}$ , where $q_{0}$ is the distribution of the training set. The other corresponds to $N=120,K=4096$ . They are used when NFE<10 and NFE $\geq$ 10 respectively. The total time for EMS computing is $\sim$ 3.5h on 8 GPU cards of NVIDIA A40.

Sampling Following EDM, we use start time $t_{\min}=0.002$ and end time $t_{\max}=80.0$ , but adopt the uniform logSNR timestep schedule which performs better in practice. For DPM-Solver-v3, we use the 3rd-order predictor and additionally employ the 3rd-order corrector when NFE $\leq$ 6. In particular, we change to the pseudo 3rd-order predictor at 5 NFE to further boost the performance.

D.3 Latent-Diffusion on LSUN-Bedroom

Codebase and evaluation The unconditional sampling on LSUN-Bedroom [55] is based on the pretrained latent-space diffusion model provided by Latent-Diffusion [43]. We use their official codebase of PyTorch implementation and their default checkpoint. We borrow the statistic file and code from Guided-Diffusion [10] for computing FID.

EMS computing We estimate the EMS at $N=120$ uniform timesteps $\lambda_{j_{0}},\lambda_{j_{1}},\dots,\lambda_{j_{N}}$ by drawing $K=1024$ datapoints $\bm{x}_{\lambda_{0}}\sim q_{0}$ , where $q_{0}$ is the distribution of the latents of the training set. The total time for EMS computing is $\sim$ 12min on 8 GPU cards of NVIDIA A40.

Sampling Following previous works [58], we use start time $\epsilon=10^{-3}$ , end time $T=1$ and adopt the uniform $t$ timestep schedule. For DPM-Solver-v3, we use the 3rd-order predictor with the pseudo 4th-order corrector.

D.4 Guided-Diffusion on ImageNet-256

Codebase and evaluation The conditional sampling on ImageNet-256 [9] is based on the pretrained pixel-space diffusion model provided by Guided-Diffusion [10]. We use their official codebase of PyTorch implementation and their two checkpoints: the conditional diffusion model 256x256_diffusion.pt and the classifier 256x256_classifier.pt. We adopt their own statistic file and code for computing FID.

EMS computing We estimate the EMS at $N=500$ uniform timesteps $\lambda_{j_{0}},\lambda_{j_{1}},\dots,\lambda_{j_{N}}$ by drawing $K=1024$ datapoints $\bm{x}_{\lambda_{0}}\sim q_{0}$ , where $q_{0}$ is the distribution of the training set. Also, we find that the FID metric on the ImageNet-256 dataset behaves specially, and degenerated $\bm{l}_{\lambda}$ ( $\bm{l}_{\lambda}=\bm{1}$ ) performs better. The total time for EMS computing is $\sim$ 9.5h on 8 GPU cards of NVIDIA A40.

Sampling Following previous works [31, 32, 58], we use start time $\epsilon=10^{-3}$ , end time $T=1$ and adopt the uniform $t$ timestep schedule. For DPM-Solver-v3, we use the 2nd-order predictor with the pseudo 3rd-order corrector.

D.5 Stable-Diffusion on MS-COCO2014 prompts

Codebase and evaluation The text-to-image sampling on MS-COCO2014 [26] prompts is based on the pretrained latent-space diffusion model provided by Stable-Diffusion [43]. We use their official codebase of PyTorch implementation and their checkpoint sd-v1-4.ckpt. We compute MSE on randomly selected captions from the MS-COCO2014 validation dataset, as detailed in Section 4.1.

EMS computing We estimate the EMS at $N=250$ uniform timesteps $\lambda_{j_{0}},\lambda_{j_{1}},\dots,\lambda_{j_{N}}$ by drawing $K=1024$ datapoints $\bm{x}_{\lambda_{0}}\sim q_{0}$ . Since Stable-Diffusion is trained on the LAION-5B dataset [46], there is a gap between the images in the MS-COCO2014 validation dataset and the images generated by Stable-Diffusion with certain guidance scale. Thus, we choose $q_{0}$ to be the distribution of the latents generated by Stable-Diffusion with corresponding guidance scale, using 200-step DPM-Solver++ [32]. We generate these latents with random captions and Gaussian noise different from those we use to compute MSE. The total time for EMS computing is $\sim$ 11h on 8 GPU cards of NVIDIA A40 for each guidance scale.

Sampling Following previous works [32, 58], we use start time $\epsilon=10^{-3}$ , end time $T=1$ and adopt the uniform $t$ timestep schedule. For DPM-Solver-v3, we use the 2nd-order predictor with the pseudo 3rd-order corrector.

D.6 License

Table 3: The used datasets, codes and their licenses.

Name	URL	Citation	License
CIFAR10	https://www.cs.toronto.edu/~kriz/cifar.html	[24]	$\backslash$
LSUN-Bedroom	https://www.yf.io/p/lsun	[55]	$\backslash$
ImageNet-256	https://www.image-net.org	[9]	$\backslash$
MS-COCO2014	https://cocodataset.org	[26]	CC BY 4.0
ScoreSDE	https://github.com/yang-song/score_sde_pytorch	[51]	Apache-2.0
EDM	https://github.com/NVlabs/edm	[21]	CC BY-NC-SA 4.0
Guided-Diffusion	https://github.com/openai/guided-diffusion	[10]	MIT
Latent-Diffusion	https://github.com/CompVis/latent-diffusion	[43]	MIT
Stable-Diffusion	https://github.com/CompVis/stable-diffusion	[43]	CreativeML Open RAIL-M
DPM-Solver++	https://github.com/LuChengTHU/dpm-solver	[32]	MIT
UniPC	https://github.com/wl-zhao/UniPC	[58]	$\backslash$

We list the used datasets, codes and their licenses in Table 3.

Appendix E Runtime Comparison

Table 4: Runtime of different methods to generate a single batch (second / batch,

\pm

std) on a single NVIDIA A40, varying the number of function evaluations (NFE). We don’t include the runtime of the decoding stage for latent-space DPMs.

Method	NFE
Method	5	10	15	20
CIFAR10 [24], ScoreSDE [51] (batch size = 128)
DPM-Solver++ [32]	1.253( $\pm$ 0.0014)	2.503( $\pm$ 0.0017)	3.754( $\pm$ 0.0042)	5.010( $\pm$ 0.0048)
UniPC [58]	1.268( $\pm$ 0.0012)	2.532( $\pm$ 0.0018)	3.803( $\pm$ 0.0037)	5.080( $\pm$ 0.0049)
DPM-Solver-v3	1.273( $\pm$ 0.0005)	2.540( $\pm$ 0.0023)	3.826( $\pm$ 0.0039)	5.108( $\pm$ 0.0055)
CIFAR10 [24], EDM [21] (batch size = 128)
DPM-Solver++ [32]	1.137( $\pm$ 0.0011)	2.278( $\pm$ 0.0015)	3.426( $\pm$ 0.0024)	4.569( $\pm$ 0.0031)
UniPC [58]	1.142( $\pm$ 0.0016)	2.289( $\pm$ 0.0019)	3.441( $\pm$ 0.0035)	4.590( $\pm$ 0.0021)
DPM-Solver-v3	1.146( $\pm$ 0.0010)	2.293( $\pm$ 0.0015)	3.448( $\pm$ 0.0018)	4.600( $\pm$ 0.0027)
LSUN-Bedroom [55], Latent-Diffusion [43] (batch size = 32)
DPM-Solver++ [32]	1.302( $\pm$ 0.0009)	2.608( $\pm$ 0.0010)	3.921( $\pm$ 0.0023)	5.236( $\pm$ 0.0045)
UniPC [58]	1.305( $\pm$ 0.0005)	2.616( $\pm$ 0.0019)	3.934( $\pm$ 0.0033)	5.244( $\pm$ 0.0043)
DPM-Solver-v3	1.302( $\pm$ 0.0010)	2.620( $\pm$ 0.0027)	3.932( $\pm$ 0.0028)	5.290( $\pm$ 0.0030)
ImageNet256 [9], Guided-Diffusion [10] (batch size = 4)
DPM-Solver++ [32]	1.594( $\pm$ 0.0011)	3.194( $\pm$ 0.0018)	4.792( $\pm$ 0.0031)	6.391( $\pm$ 0.0045)
UniPC [58]	1.606( $\pm$ 0.0026)	3.205( $\pm$ 0.0025)	4.814( $\pm$ 0.0049)	6.427( $\pm$ 0.0060)
DPM-Solver-v3	1.601( $\pm$ 0.0059)	3.229( $\pm$ 0.0031)	4.807( $\pm$ 0.0068)	6.458( $\pm$ 0.0257)
MS-COCO2014 [26], Stable-Diffusion [43] (batch size = 4)
DPM-Solver++ [32]	1.732( $\pm$ 0.0012)	3.464( $\pm$ 0.0020)	5.229( $\pm$ 0.0027)	6.974( $\pm$ 0.0013)
UniPC [58]	1.735( $\pm$ 0.0012)	3.484( $\pm$ 0.0364)	5.212( $\pm$ 0.0015)	6.988( $\pm$ 0.0035)
DPM-Solver-v3	1.731( $\pm$ 0.0008)	3.471( $\pm$ 0.0011)	5.211( $\pm$ 0.0030)	6.945( $\pm$ 0.0022)

As we have mentioned in Section 4, the runtime of DPM-Solver-v3 is almost the same as other solvers (DDIM [48], DPM-Solver [31], DPM-Solver++ [32], UniPC [58], etc.) as long as they use the same NFE. This is because the main computation costs are the serial evaluations of the large neural network $\bm{\epsilon}_{\theta}$ , and the other coefficients are either analytically computed [48, 31, 32, 58], or precomputed (DPM-Solver-v3), thus having neglectable costs.

Table 4 shows the runtime of DPM-Solver-v3 and some other solvers on a single NVIDIA A40 under different settings. We use torch.cuda.Event and torch.cuda.synchronize to accurately compute the runtime. We evaluate the runtime on 8 batches (dropping the first batch since it contains extra initializations) and report the mean and std. We can see that the runtime is proportional to NFE and has a difference of about $\pm 1$ % for different solvers, which confirms our statement. Therefore, the speedup for the NFE is almost the actual speedup of the runtime.

Appendix F Quantitative Results

Table 5: Quantitative results on LSUN-Bedroom [55]. We report the FID

\downarrow

of the methods with different numbers of function evaluations (NFE), evaluated on 50k samples.

Method	Model	NFE
Method	Model	5	6	8	10	12	15	20
DPM-Solver++ [32]	Latent-Diffusion [43]	18.59	8.50	4.19	3.63	3.43	3.29	3.16
UniPC [58]		12.24	6.19	4.00	3.56	3.34	3.18	3.07
DPM-Solver-v3		7.54	4.79	3.53	3.16	3.06	3.05	3.05

Table 6: Quantitative results on ImageNet-256 [9]. We report the FID

\downarrow

of the methods with different numbers of function evaluations (NFE), evaluated on 10k samples.

Method	Model	NFE
Method	Model	5	6	8	10	12	15	20
DPM-Solver++ [32]	Guided-Diffusion [10] ( $s=2.0$ )	16.87	13.09	9.95	8.72	8.13	7.73	7.48
UniPC [58]		15.62	11.91	9.29	8.35	7.95	7.64	7.44
DPM-Solver-v3		15.10	11.39	8.96	8.27	7.94	7.62	7.39

Table 7: Quantitative results on MS-COCO2014 [26] prompts. We report the MSE

\downarrow

of the methods with different numbers of function evaluations (NFE), evaluated on 10k samples.

Method	Model	NFE
Method	Model	5	6	8	10	12	15	20
DPM-Solver++ [32]	Stable-Diffusion [43] ( $s=1.5$ )	0.076	0.056	0.028	0.016	0.012	0.009	0.006
UniPC [58]		0.055	0.039	0.024	0.012	0.007	0.005	0.002
DPM-Solver-v3		0.037	0.027	0.024	0.007	0.005	0.001	0.002
DPM-Solver++ [32]	Stable-Diffusion [43] ( $s=7.5$ )	0.60	0.65	0.50	0.46	0.42	0.38	0.30
UniPC [58]		0.65	0.71	0.56	0.46	0.43	0.35	0.31
DPM-Solver-v3		0.55	0.64	0.49	0.40	0.45	0.34	0.29

We present the detailed quantitative results of Section 4.1 for different datasets in Table 1, Table 5, Table 6 and Table 7 respectively. They clearly verify that DPM-Solver-v3 achieves consistently better or comparable performance under various settings, especially in 5 $\sim$ 10 NFEs.

Appendix G Ablations

In this section, we conduct some ablations to further evaluate and analyze the effectiveness of DPM-Solver-v3.

G.1 Varying the Number of Timesteps and Datapoints for the EMS

Table 8: Ablation of the number of timesteps

N

and datapoints

K

for the EMS, experimented with ScoreSDE [51] on CIFAR10 [24]. We report the FID

\downarrow

with different numbers of function evaluations (NFE), evaluated on 50k samples.

$N$	$K$	NFE
$N$	$K$	5	6	8	10	12	15	20
1200	512	18.84	7.90	4.49	3.74	3.88	3.52	3.12
1200	1024	15.52	7.55	4.17	3.56	3.37	3.03	2.78
120	4096	13.67	7.60	4.09	3.49	3.24	2.90	2.70
250	4096	13.28	7.56	4.00	3.45	3.22	2.92	2.70
1200	4096	12.76	7.40	3.94	3.40	3.24	2.91	2.71

First, we’d like to investigate how the number of timesteps $N$ and the number of datapoints $K$ for computing the EMS affects the performance. We conduct experiments with the DPM ScoreSDE [51] on CIFAR10 [24], by decreasing $N$ and $K$ from our default choice $N=1200,K=4096$ .

We list the FID results using the EMS of different $N$ and $K$ in Table 8. We can observe that the number of datapoints $K$ is crucial to the performance, while the number of timesteps $N$ is less significant and affects mainly the performance in 5 $\sim$ 10 NFEs. When NFE>10, we can decrease $N$ to as little as 50, which gives even better FIDs. Note that the time cost for computing the EMS is proportional to $NK$ , so how to choose appropriate $N$ and $K$ for both efficiency and accuracy is worth studying.

G.2 First-Order Comparison

Table 9: Quantitative comparison of first-order solvers (DPM-Solver-v3-1 and DDIM [48]), experimented with ScoreSDE [51] on CIFAR10 [24]. We report the FID

\downarrow

with different numbers of function evaluations (NFE), evaluated on 50k samples.

Method	NFE
Method	5	6	8	10	12	15	20
DDIM [48]	54.56	41.92	27.51	20.11	15.64	12.05	9.00
DPM-Solver-v3-1	39.18	29.82	20.03	14.98	11.86	9.34	7.19

As stated in Appendix A, the first-order case of DPM-Solver-v3 (DPM-Solver-v3-1) is different from DDIM [48], which is the previous best first-order solver for DPMs. Note that DPM-Solver-v3-1 applies no corrector, since any corrector has an order of at least 2.

In Table 9 and Figure 7, we compare DPM-Solver-v3-1 with DDIM both quantitatively and qualitatively, using the DPM ScoreSDE [51] on CIFAR10 [24]. The results verify our statement that DPM-Solver-v3-1 performs better than DDIM.

G.3 Effects of Pseudo-Order Solver

We now demonstrate the effectiveness of the pseudo-order solver, including the pseudo-order predictor and the pseudo-order corrector.

Pseudo-order predictor The pseudo-order predictor is only applied in one case (at 5 NFE on CIFAR10 [24]) to achieve maximum performance improvement. In such cases, without the pseudo-order predictor, the FID results will degenerate from 12.76 to 15.91 for ScoreSDE [51], and from 12.21 to 12.72 for EDM [21]. While they are still better than previous methods, the pseudo-order predictor is proven to further boost the performance at NFEs as small as 5.

Table 10: Effects of pseudo-order corrector under different settings. We report the FID

\downarrow

with different numbers of function evaluations (NFE).

Method	NFE
Method	5	6	8	10	12	15	20
LSUN-Bedroom [55], Latent-Diffusion [43]
4th-order corrector	8.83	5.28	3.65	3.27	3.17	3.14	3.13
$\rightarrow$ pseudo (default)	7.54	4.79	3.53	3.16	3.06	3.05	3.05
ImageNet-256 [9], Guided-Diffusion [10] ( $s=2.0$ )
3rd-order corrector	15.87	11.91	9.27	8.37	7.97	7.62	7.47
$\rightarrow$ pseudo (default)	15.10	11.39	8.96	8.27	7.94	7.62	7.39
MS-COCO2014 [26], Stable-Diffusion [43] ( $s=1.5$ )
3rd-order corrector	0.037	0.028	0.028	0.014	0.0078	0.0024	0.0011
$\rightarrow$ pseudo (default)	0.037	0.027	0.024	0.0065	0.0048	0.0014	0.0022

Pseudo-order corrector We show the comparison between true and pseudo-order corrector in Table 10. We can observe a consistent improvement when switching to the pseudo-order corrector. Thus, it suggests that if we use $n$ -th order predictor, we’d better combine it with pseudo $(n+1)$ -th order corrector rather than $(n+1)$ -th order corrector.

G.4 Effects of Half-Corrector

Table 11: Ablation of half-corrector/full-corrector on MS-COCO2014 [26] prompts with Stable-Diffusion model [43] and guidance scale 7.5. We report the MSE

\downarrow

of the methods with different numbers of function evaluations (NFE), evaluated on 10k samples.

Method	Corrector Usage	NFE
Method	Corrector Usage	5	6	8	10	12	15	20
DPM-Solver++ [32]	no corrector	0.60	0.65	0.50	0.46	0.42	0.38	0.30
UniPC [58]	full-corrector	0.65	0.71	0.56	0.46	0.43	0.35	0.31
UniPC [58]	$\rightarrow$ half-corrector	0.59	0.66	0.50	0.46	0.41	0.38	0.30
DPM-Solver-v3	full-corrector	0.65	0.67	0.49	0.40	0.47	0.34	0.30
DPM-Solver-v3	$\rightarrow$ half-corrector	0.55	0.64	0.51	0.44	0.45	0.36	0.29

We demonstrate the effects of half-corrector in Table 11, using the popular Stable-Diffusion model [43]. We can observe that under the relatively large guidance scale of 7.5 which is necessary for producing samples of high quality, the corrector adopted by UniPC [58] has a negative effect on the convergence to the ground-truth samples, making UniPC even worse than DPM-Solver++ [32]. When we employ the half-corrector technique, the problem is partially alleviated. Still, it lags behind our DPM-Solver-v3, since we further incorporate the EMS.

G.5 Singlestep vs. Multistep

Table 12: Quantitative comparison of single-step methods (S) vs. multi-step methods (M), experimented with ScoreSDE [51] on CIFAR10 [24]. We report the FID

\downarrow

with different numbers of function evaluations (NFE), evaluated on 50k samples.

Method	NFE
Method	5	6	8	10	12	15	20	25
DPM-Solver (S) [31]	290.51	23.78	23.51	4.67	4.97	3.34	2.85	2.70
DPM-Solver (M) [32]	27.40	17.85	9.04	6.41	5.31	4.10	3.30	2.98
DPM-Solver++ (S) [32]	51.80	38.54	12.13	6.52	6.36	4.56	3.52	3.09
DPM-Solver++ (M) [32]	28.53	13.48	5.34	4.01	4.04	3.32	2.90	2.76
DPM-Solver-v3 (S)	21.83	16.81	7.93	5.76	5.17	3.99	3.22	2.96
DPM-Solver-v3 (M)	12.76	7.40	3.94	3.40	3.24	2.91	2.71	2.64

As we stated in Section 3.2.2, multistep methods perform better than singlestep methods. To study the relationship between parameterization and these solver types, we develop a singlestep version of DPM-Solver-v3 in Algorithm 3. Note that since $(n+1)$ -th order singlestep solver divides each step into $n+1$ substeps, the total number of timesteps is a multiple of $n+1$ . For flexibility, we follow the adaptive third-order strategy of DPM-Solver [31], which first takes third-order steps and then takes first-order or second-order steps in the end.

Algorithm 3

(n+1)

-th order singlestep solver

Require: noise prediction model $\bm{\epsilon}_{\theta}$ , noise schedule $\alpha_{t},\sigma_{t}$ , coefficients $\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}$ , cache $Q_{1},Q_{2}$

Input: timesteps $\{t_{i}\}_{i=0}^{(n+1)M}$ , initial value $\bm{x}_{0}$

Q_{1}\overset{cache}{\leftarrow}\bm{x}_{0}

Q_{2}\overset{cache}{\leftarrow}\bm{\epsilon}_{\theta}(\bm{x}_{0},t_{0})

3: for

m=1

M

i_{0}\leftarrow(n+1)(m-1)

5: for

i=0

n

\hat{\bm{x}}_{i_{0}},\dots,\hat{\bm{x}}_{i_{0}+i}\overset{fetch}{\leftarrow}Q_{1}

\hat{\bm{\epsilon}}_{i_{0}},\dots,\hat{\bm{\epsilon}}_{i_{0}+i}\overset{fetch}{\leftarrow}Q_{2}

\displaystyle\hat{\bm{g}}_{l}\leftarrow e^{-\int_{\lambda_{i_{0}}}^{\lambda_{l}}\bm{s}_{\tau}\mathrm{d}\tau}\frac{\sigma_{\lambda_{l}}\hat{\bm{\epsilon}}_{l}-\bm{l}_{\lambda_{l}}\hat{\bm{x}}_{l}}{\alpha_{\lambda_{l}}}-\int_{\lambda_{i_{0}}}^{\lambda_{l}}e^{-\int_{\lambda_{i_{0}}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r,\quad l=i_{0},\dots,i_{0}+i

(Eq. (8))

\hat{\bm{x}}_{i_{0}+i+1}\leftarrow\text{LUpdate}_{i+1}(\{t_{i_{0}+1},\hat{\bm{g}}_{i_{0}+1}\},\dots,\{t_{i_{0}+i},\hat{\bm{g}}_{i_{0}+i}\},\{t_{i_{0}},\hat{\bm{x}}_{i_{0}},\hat{\bm{g}}_{i_{0}}\},t_{i_{0}+i+1})

10: if

(i+1)m\neq(n+1)M

then

11:

\hat{\bm{\epsilon}}_{i_{0}+i+1}\leftarrow\bm{\epsilon}_{\theta}(\hat{\bm{x}}_{i_{0}+i+1},t_{i_{0}+i+1})

12:

Q_{1}\overset{cache}{\leftarrow}\hat{\bm{x}}_{i_{0}+i+1}

13:

Q_{2}\overset{cache}{\leftarrow}\hat{\bm{\epsilon}}_{i_{0}+i+1}

14: end if

15: end for

16: end for

Output: $\hat{\bm{x}}_{(n+1)M}$

Since UniPC [58] is based on a multistep predictor-corrector framework and has no singlestep version, we only compare with DPM-Solver [31] and DPM-Solver++ [32], which uses noise prediction and data prediction respectively. The results of ScoreSDE model [51] on CIFAR10 [24] are shown in Table 12, which demonstrate that different parameterizations have different relative performance under singlestep and multistep methods.

Specifically, for singlestep methods, DPM-Solver-v3 (S) outperforms DPM-Solver++ (S) across NFEs, but DPM-Solver (S) is even better than DPM-Solver-v3 (S) when NFE $\geq$ 10 (though when NFE<10, DPM-Solver (S) has worst performance), which suggests that noise prediction is best strategy in such scenarios; for multistep methods, we have DPM-Solver-v3 (M) > DPM-Solver++ (M) > DPM-Solver (M) across NFEs. Moreover, DPM-Solver (S) even outperforms DPM-Solver++ (M) when NFE $\geq$ 20, and the properties of different parameterizations in singlestep methods are left for future study. Overall, DPM-Solver-v3 (M) achieves the best results among all these methods, and performs the most stably under different NFEs.

Appendix H FID/CLIP Score on Stable-Diffusion

Table 13: Sample quality and text-image alignment performance on MS-COCO2014 [26] prompts with Stable-Diffusion model [43] and guidance scale 7.5. We report the FID

\downarrow

and CLIP score

\uparrow

of the methods with different numbers of function evaluations (NFE), evaluated on 10k samples.

Method	Metric	NFE
Method	Metric	5	6	8	10	12	15	20
DPM-Solver++ [32]	FID	18.87	17.44	16.40	15.93	15.78	15.84	15.72
DPM-Solver++ [32]	CLIP score	0.263	0.265	0.265	0.265	0.266	0.265	0.265
UniPC [58]	FID	18.77	17.32	16.20	16.15	16.09	16.06	15.94
UniPC [58]	CLIP score	0.262	0.263	0.265	0.265	0.265	0.265	0.265
DPM-Solver-v3	FID	18.83	16.41	15.41	15.32	15.13	15.30	15.23
DPM-Solver-v3	CLIP score	0.260	0.262	0.264	0.265	0.265	0.265	0.265

In text-to-image generation, since the sample quality is affected not only by discretization error of the sampling process, but also by estimation error of neural networks during training, low MSE (faster convergence) does not necessarily imply better sample quality. Therefore, we choose the MSCOCO2014 [26] validation set as the reference, and additionally evaluate DPM-Solver-v3 on Stable-Diffusion model [43] by the standard metrics FID and CLIP score [41] which measure the sample quality and text-image alignment respectively. For DPM-Solver-v3, we use the full-corrector strategy when NFE<10, and no corrector when NFE $\geq$ 10.

The results in Table 13 show that DPM-Solver-v3 achieves consistently better FID and similar CLIP scores. Notably, we achieve an FID of 15.4 in 8 NFE, close to the reported FID of Stable-Diffusion v1.4.

Still, we claim that FID is not a proper metric for evaluating the convergence of latent-space diffusion models. As stated in DPM-Solver++ and Section 4.1, we can see that the FIDs quickly achieve 15.0 $\sim$ 16.0 within 10 steps, even if the latent code does not converge, because of the strong image decoder. Instead, MSE in the latent space is a direct way to measure the convergence. By comparing the MSE, our sampler does converge faster to the ground-truth samples of Stable Diffusion itself.

Appendix I More Theoretical Analyses

I.1 Expressive Power of Our Generalized Parameterization

Though the introduced coefficients $\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}$ seem limited to guarantee the optimality of the parameterization formulation itself, we claim that the generalized parameterization $\bm{g}_{\theta}$ in Eq. (8) can actually cover a wide range of parameterization families in the form of $\bm{\psi}_{\theta}(\bm{x}_{\lambda},\lambda)=\bm{\alpha}(\lambda)\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)+\bm{\beta}(\lambda)\bm{x}_{\lambda}+\bm{\gamma}(\lambda)$ . Considering the paramerization on $[\lambda_{s},\lambda_{t}]$ , by rearranging the terms, Eq. (8) can be written as

\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)=e^{-\int_{\lambda_{s}}^{\lambda}\bm{s}_{\tau}\mathrm{d}\tau}\frac{\sigma_{\lambda}}{\alpha_{\lambda}}\bm{\epsilon}_{\theta}(\bm{x}_{\lambda},\lambda)-e^{-\int_{\lambda_{s}}^{\lambda}\bm{s}_{\tau}\mathrm{d}\tau}\frac{\bm{l}_{\lambda}}{\alpha_{\lambda}}\bm{x}_{\lambda}-\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r

(68)

We can compare the coefficients before $\bm{\epsilon}_{\theta}$ and $\bm{x}_{\lambda}$ in $\bm{g}_{\theta}$ and $\bm{\psi}_{\theta}$ to figure out how $\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}$ corresponds to $\bm{\alpha}(\lambda),\bm{\beta}(\lambda),\bm{\gamma}(\lambda)$ . In fact, we can not directly let $\bm{g}_{\theta}$ equal $\bm{\psi}_{\theta}$ , since when $\lambda=\lambda_{s}$ , we have $\int_{\lambda_{s}}^{\lambda}(\cdot)=0$ , and the coefficient before $\bm{\epsilon}_{\theta}$ in $\bm{g}_{\theta}$ is fixed. Still, $\bm{\psi}_{\theta}$ can be equalized to $\bm{g}_{\theta}$ by a linear transformation, which only depends on $\lambda_{s}$ and does not affect our analyses of the discretization error and solver.

Specifically, assuming $\bm{\psi}_{\theta}=\bm{\omega}_{\lambda_{s}}\bm{g}_{\theta}+\bm{\xi}_{\lambda_{s}}$ , by corresponding the coefficients we have

\begin{cases}\bm{\alpha}(\lambda)=\bm{\omega}_{\lambda_{s}}e^{-\int_{\lambda_{s}}^{\lambda}\bm{s}_{\tau}\mathrm{d}\tau}\frac{\sigma_{\lambda}}{\alpha_{\lambda}}\\ \bm{\beta}(\lambda)=-\bm{\omega}_{\lambda_{s}}e^{-\int_{\lambda_{s}}^{\lambda}\bm{s}_{\tau}\mathrm{d}\tau}\frac{\bm{l}_{\lambda}}{\alpha_{\lambda}}\\ \bm{\gamma}(\lambda)=-\bm{\omega}_{\lambda_{s}}\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r+\bm{\xi}_{\lambda_{s}}\end{cases}\Rightarrow\begin{cases}\bm{\omega}_{\lambda_{s}}=e^{\lambda_{s}}\bm{\alpha}(\lambda_{s})\\ \bm{\xi}_{\lambda_{s}}=\bm{\gamma}(\lambda_{s})\\ \bm{l}_{\lambda}=-\sigma_{\lambda}\frac{\bm{\beta}(\lambda)}{\bm{\alpha}(\lambda)}\\ \bm{s}_{\lambda}=-1-\frac{\bm{\alpha}^{\prime}(\lambda)}{\bm{\alpha}(\lambda)}\\ \bm{b}_{\lambda}=-e^{-\lambda}\frac{\bm{\gamma}^{\prime}(\lambda)}{\bm{\alpha}(\lambda)}\end{cases}

(69)

Therefore, as long as $\bm{\alpha}(\lambda)\neq 0$ and $\bm{\alpha}(\lambda),\bm{\gamma}(\lambda)$ have first-order derivatives, our proposed parameterization $\bm{g}_{\theta}$ holds the same expressive power as $\bm{\psi}_{\theta}$ , while at the same time enabling the neat optimality criteria of $\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}$ in Eq. (5) and Eq. (11).

I.2 Justification of Why Minimizing First-order Discretization Error Can Help Higher-order Solver

The EMS $\bm{s}^{*}_{\lambda},\bm{b}^{*}_{\lambda}$ in Eq. (11) are designed to minimize the first-order discretization error in Eq. (10). However, the high-order solver is actually more frequently adopted in practice (specifically, third-order in unconditional sampling, second-order in conditional sampling), since it incurs lower sampling errors by taking higher-order Taylor expansions to approximate the predictor $\bm{g}_{\theta}$ .

In the following, we show that the EMS can also help high-order solver. By Eq. (52) in Appendix B.4, the $(n+1)$ -th order local error can be expressed as

	$\displaystyle\hat{\bm{x}}_{t}-\bm{x}_{t}$	$\displaystyle=\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\left(\bm{g}_{\theta}(\bm{x}_{\lambda},\lambda)-\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s})\right)\mathrm{d}\lambda$		(70)
		$\displaystyle-\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\sum_{k=1}^{n}\left(\sum_{l=1}^{n}(\bm{R}_{n}^{-1})_{kl}(\bm{g}_{\theta}(\bm{x}_{\lambda_{i_{l}}},\lambda_{i_{l}})-\bm{g}_{\theta}(\bm{x}_{\lambda_{s}},\lambda_{s}))\right)\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda$		(70)

By Newton-Leibniz theorem, it is equivalent to

	$\displaystyle\hat{\bm{x}}_{t}-\bm{x}_{t}$	$\displaystyle=\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\left(\int_{\lambda_{s}}^{\lambda}g^{(1)}_{\theta}(x_{\tau},\tau)\mathrm{d}\tau\right)\mathrm{d}\lambda$		(71)
		$\displaystyle-\alpha_{t}\bm{A}(\lambda_{s},\lambda_{t})\sum_{k=1}^{n}\left(\sum_{l=1}^{n}(\bm{R}_{n}^{-1})_{kl}\int_{\lambda_{s}}^{\lambda_{i_{l}}}g^{(1)}_{\theta}(x_{\lambda},\lambda)\mathrm{d}\lambda\right)\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda$		(71)

We assume that the estimated EMS are bounded (in the order of $\mathcal{O}(1)$ , Assumption B.2 in Appendix B.1), which is empirically confirmed as in Section 4.2. By the definition of $\bm{g}_{\theta}$ in Eq. (8), we have $\bm{g}_{\theta}^{(1)}(\bm{x}_{\tau},\tau)=e^{-\int_{\lambda_{s}}^{\tau}\bm{s}_{r}\mathrm{d}r}\left(\bm{f}_{\theta}^{(1)}(\bm{x}_{\tau},\tau)-\bm{s}_{\tau}\bm{f}_{\theta}(\bm{x}_{\tau},\tau)-\bm{b}_{\tau}\right)$ . Therefore, Eq. (11) controls $\|\bm{g}_{\theta}^{(1)}\|_{2}$ and further controls $\|\hat{\bm{x}}_{t}-\bm{x}_{t}\|_{2}$ , since other terms are only dependent on the EMS and are bounded.

I.3 The Extra Error of EMS Estimation and Integral Estimation

Analysis of EMS estimation error

In practice, the EMS in Eq. (5) and Eq. (11) are estimated on finite datapoints by the explicit expressions in Eq. (58) and Eq. (63), which may differ from the true $\bm{l}_{\lambda}^{*},\bm{s}_{\lambda}^{*},\bm{b}_{\lambda}^{*}$ . Theoretically, on one hand, the order and convergence theorems in Section 3.2 are irrelevant to the EMS estimation error: The ODE solution in Eq. (9) is correct whatever $\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}$ are, and we only need the assumption that these coefficients are bounded (Assumption B.2 in Appendix B.1) to prove the local and global order; on the other hand, the first-order discretization error in Eq. (10) is vulnerable to the EMS estimation error, which relates to the performance at few steps. Empirically, to enable fast sampling, we need to ensure the number of datapoints for estimating EMS (see ablations in Table 8), and we find that our method is robust to the EMS estimation error given only 1024 datapoints in most cases (see EMS computing configs in Appendix D).

Analysis of integral estimation error

Another source of error is the process of integral estimation ( $\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\bm{B}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda$ and $\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda$ in Eq. (14)) by trapezoidal rule. We can analyze the estimation error by the error bound formula of trapezoidal rule: suppose we use uniform discretization on $[a,b]$ with interval $h$ to estimate $\int_{a}^{b}f(x)\mathrm{d}x$ , then the error $E$ satisfies

|E|\leq\frac{(b-a)h^{2}}{12}\max|f^{\prime\prime}(x)|

(72)

Under Assumption B.2, the EMS and their first-order derivative are bounded. Denote $\bm{f}_{1}(\lambda)=\bm{E}_{\lambda_{s}}(\lambda)\bm{B}_{\lambda_{s}}(\lambda),\bm{f}_{2}(\lambda)=\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}$ , then

$\displaystyle\bm{f}_{1}(\lambda)$	$\displaystyle=e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r$	(73)
$\displaystyle\bm{f}_{1}^{\prime}(\lambda)$	$\displaystyle=(\bm{l}_{\lambda}+\bm{s}_{\lambda})e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r+\bm{b}_{\lambda}e^{\int_{\lambda_{s}}^{\lambda}\bm{l}_{\tau}\mathrm{d}\tau}\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r$
$\displaystyle\bm{f}_{1}^{\prime\prime}(\lambda)$	$\displaystyle=(\bm{l}_{\lambda}^{\prime}+\bm{s}_{\lambda}^{\prime})e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r+(\bm{l}_{\lambda}+\bm{s}_{\lambda})^{2}e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r$
	$\displaystyle+(\bm{l}_{\lambda}+\bm{s}_{\lambda})\bm{b}_{\lambda}e^{\int_{\lambda_{s}}^{\lambda}\bm{l}_{\tau}\mathrm{d}\tau}\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r+\bm{b}_{\lambda}^{\prime}e^{\int_{\lambda_{s}}^{\lambda}\bm{l}_{\tau}\mathrm{d}\tau}\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r$
	$\displaystyle+\bm{b}_{\lambda}\bm{l}_{\lambda}e^{\int_{\lambda_{s}}^{\lambda}\bm{l}_{\tau}\mathrm{d}\tau}\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r+\bm{b}_{\lambda}^{2}e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}-\bm{s}_{\tau})\mathrm{d}\tau}\int_{\lambda_{s}}^{\lambda}e^{-\int_{\lambda_{s}}^{r}\bm{s}_{\tau}\mathrm{d}\tau}\bm{b}_{r}\mathrm{d}r$

$\displaystyle\bm{f}_{2}(\lambda)$	$\displaystyle=e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\frac{(\lambda-\lambda_{s})^{k}}{k!}$	(74)
$\displaystyle\bm{f}_{2}^{\prime}(\lambda)$	$\displaystyle=(\bm{l}_{\lambda}+\bm{s}_{\lambda})e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\frac{(\lambda-\lambda_{s})^{k}}{k!}+e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\frac{(\lambda-\lambda_{s})^{k-1}}{(k-1)!}$
$\displaystyle\bm{f}_{2}^{\prime\prime}(\lambda)$	$\displaystyle=(\bm{l}_{\lambda}^{\prime}+\bm{s}_{\lambda}^{\prime})e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\frac{(\lambda-\lambda_{s})^{k}}{k!}+(\bm{l}_{\lambda}+\bm{s}_{\lambda})^{2}e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\frac{(\lambda-\lambda_{s})^{k}}{k!}$
	$\displaystyle+(\bm{l}_{\lambda}+\bm{s}_{\lambda})e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\frac{(\lambda-\lambda_{s})^{k-1}}{{k-1}!}+(\bm{l}_{\lambda}+\bm{s}_{\lambda})e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\frac{(\lambda-\lambda_{s})^{k-1}}{(k-1)!}$
	$\displaystyle+e^{\int_{\lambda_{s}}^{\lambda}(\bm{l}_{\tau}+\bm{s}_{\tau})\mathrm{d}\tau}\frac{(\lambda-\lambda_{s})^{k-2}}{(k-2)!}$

Since $\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda},\bm{l}_{\lambda}^{\prime},\bm{s}_{\lambda}^{\prime},\bm{b}_{\lambda}^{\prime}$ are all $\mathcal{O}(1)$ , we can conclude that $\bm{f}_{1}^{\prime\prime}(\lambda)=\mathcal{O}(h),\bm{f}_{2}^{\prime\prime}(\lambda)=\mathcal{O}(h^{k-2})$ , and the errors of $\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\bm{B}_{\lambda_{s}}(\lambda)\mathrm{d}\lambda,\int_{\lambda_{s}}^{\lambda_{t}}\bm{E}_{\lambda_{s}}(\lambda)\frac{(\lambda-\lambda_{s})^{k}}{k!}\mathrm{d}\lambda$ are $O(h_{0}^{2}h^{2}),O(h_{0}^{2}h^{k-1})$ respectively, where $h_{0}$ is the stepsize of EMS discretization ( $h_{0}=\frac{\lambda_{t}-\lambda_{s}}{n}$ , $n$ corresponds to 120 $\sim$ 1200 timesteps for our EMS computing), and $h=\lambda_{t}-\lambda_{s}$ . Therefore, the extra error of integral estimation is under high order and ignorable.

Appendix J More Discussions

J.1 Extra Computational and Memory Costs

The extra memory cost of DPM-Solver-v3 is rather small. The extra coefficients $\bm{l}_{\lambda},\bm{s}_{\lambda},\bm{b}_{\lambda}$ are discretized and computed at $N$ timesteps, each with a dimension $D$ same as the diffused data. The extra memory cost is $\mathcal{O}(ND)$ , including the precomputed terms in Appendix C.1.2, and is rather small compared to the pretrained model (e.g. only $\sim$ 125M in total on Stable-Diffusion, compared to $\sim$ 4G of the model itself).

The pre-computation time for estimating EMS is rather short. The EMS introduced by our method can be effectively estimated on around 1k datapoints within hours (Appendix D), which is rather short compared to the long training/distillation time of other methods. Moreover, the integrals of these extra coefficients are just some vector constants that can be pre-computed within seconds, as shown in Appendix C.1.2. The precomputing is done only once before sampling.

The extra computational overhead of DPM-Solver-v3 during sampling is negligible. Once we obtain the estimated EMS and their integrals at discrete timesteps, they can be regarded as constants. Thus, during the subsequent sampling process, the computational overhead is the same as previous training-free methods (such as DPM-Solver++) with negligible differences (Appendix E).

J.2 Flexibility

The pre-computed EMS can be applied for any time schedule during sampling without re-computing EMS. Besides, we compute EMS on unconditional models and it can be used for a wide range of guidance scales (such as cfg=7.5 in Stable Diffusion). In short, EMS is flexible and easy to adopt in downstream applications.

•

Time schedule: The choice of $N,K$ in EMS is disentangled with the timestep scheduler in sampling. Once we have estimated the EMS at $N$ (e.g., 1200) timesteps, they can be flexibly adapted to any schedule (uniform $\lambda$ /uniform $t$ …) during sampling, by corresponding the actual timesteps during sampling to the $N$ bins. For different time schedule, we only need to re-precompute $E_{\lambda_{s},\lambda_{t}}^{(k)}$ in Appendix C.1.2, and the time cost is within seconds.
•

Guided sampling: We compute the EMS on the unconditional model for all guided cases. Empirically, the EMS computed on the model without guidance (unconditional part) performs more stably than those computed on the model with guidance, and can accelerate the sampling procedure in a wide range of guidance scales (including the common guidance scales used in pretrained models). We think that the unconditional model contains some common information (image priors) for all the conditions, such as color, sketch, and other image patterns. Extracting them helps correct some common biases such as shallow color, low saturation level and lack of details. In contrast, the conditional model is dependent on the condition and has a large variance.

Remark J.1.

EMS computed on models without guidance cannot work for extremely large guidance scales (e.g., cfg scale 15 for Stable Diffusion), since in this case, the condition has a large impact on the denoising process. Note that at these extremely large scales, the sampling quality is very low (compared to cfg scale 7.5) and they are rarely used in practice. Therefore, our proposed EMS without guidance is suitable enough for the common applications with the common guidance.

J.3 Stability

As shown in Section 4.2, the estimated EMS $\bm{s}_{\lambda},\bm{b}_{\lambda}$ appear much fluctuating, especially for ScoreSDE on CIFAR10. We would like to clarify that the unstable $\bm{s}_{\lambda},\bm{b}_{\lambda}$ is not an issue, and our sampler is stable:

•

The fluctuation of $\bm{s}_{\lambda},\bm{b}_{\lambda}$ on ScoreSDE is intrinsic and not due to the estimation error. As we increase the number of samples to decrease the estimation error, the fluctuation is not reduced. We attribute it to the periodicity of trigonometric functions in the positional timestep embedding as stated in Section 4.2.
•

Moreover, we only need to consider the integrals of $\bm{s}_{\lambda},\bm{b}_{\lambda}$ in the ODE solution Eq. (9). As shown in Figure 8, the integrals of $\bm{s}_{\lambda},\bm{b}_{\lambda}$ are rather smooth, which ensures the stability of our method. Therefore, there is no need for extra smoothing.

J.4 Practical Value

When NFE is around 20, our improvement of sample quality is small because all different fast samplers based on diffusion ODEs almost converge. Therefore, what matters is that our method has a faster convergence speed to good sample quality. As we observed, the less diverse the domain is, the more evidently the speed-up takes effect. For example, on LSUN-Bedroom, our method can achieve up to 40% faster convergence.

To sum up, the practical value of DPM-Solver-v3 embodies the following aspects:

1.

15 $\sim$ 30% speed-up can save lots of costs for online text-to-image applications. Our speed-ups are applicable to large text-to-image diffusion models, which are an important part of today’s AIGC community. As the recent models become larger, a single NFE requires more computational resources, and 15 $\sim$ 30% speed-up can save a lot of the companies’ expenses for commercial usage.
2.

Improvement in 5 $\sim$ 10 NFEs benefits image previewing. Since the samples are controlled by the random seed, coarse samples with 5 $\sim$ 10 NFEs can be used to preview thousands of samples with low costs and give guidance on choosing the random seed, which can be then used to generate fine samples with best-quality sampling strategies. This is especially useful for text-to-image generation. Since our method achieves better quality and converges faster to the ground-truth sample, it can provide better guidance when used for preview.

Appendix K Additional Samples

We provide more visual samples in Figure 9, Figure 10, Figure 11 and Table 14 to demonstrate the qualitative effectiveness of DPM-Solver-v3. It can be seen that the visual quality of DPM-Solver-v3 outperforms previous state-of-the-art solvers. Our method can generate images that have reduced bias (less “shallow”), higher saturation level and more visual details, as mentioned in Section 4.3.

[Uncaptioned image] — Table 14: Additional samples of Stable-Diffusion [43] with a classifier-free guidance scale 7.5, using only 5 NFE and selected text prompts. Some displayed prompts are truncated.

	NFE = 5	NFE = 10	NFE = 20
DDIM [48]
	FID 54.56	FID 20.11	FID 9.00
DPM-Solver-v3-1 (Ours)
	FID 39.18	FID 14.98	FID 7.19

	NFE = 5	NFE = 10
DPM-Solver++ [32]
	FID 28.53	FID 4.01
UniPC [58]
	FID 23.71	FID 3.93
DPM-Solver-v3 (Ours)
	FID 12.76	FID 3.40

	NFE = 5	NFE = 10
Heun’s 2nd [21]
	FID 320.80	FID 16.57
DPM-Solver++ [32]
	FID 24.54	FID 2.91
UniPC [58]
	FID 23.52	FID 2.85
DPM-Solver-v3 (Ours)
	FID 12.21	FID 2.51

Text Prompts	DPM-Solver++ [32] (MSE 0.60)	UniPC [58] (MSE 0.65)	DPM-Solver-v3 (Ours) (MSE 0.55)
“pixar movie still portrait photo of madison beer, jessica alba, woman, as hero catgirl cyborg woman by pixar, by greg rutkowski, wlop, rossdraws, artgerm, weta, marvel, rave girl, leeloo, unreal engine, glossy skin, pearlescent, wet, bright morning, anime, sci-fi, maxim magazine cover”
“oil painting with heavy impasto of a pirate ship and its captain, cosmic horror painting, elegant intricate artstation concept art by craig mullins detailed”
“environment living room interior, mid century modern, indoor garden with fountain, retro, m vintage, designer furniture made of wood and plastic, concrete table, wood walls, indoor potted tree, large window, outdoor forest landscape, beautiful sunset, cinematic, concept art, sunstainable architecture, octane render, utopia, ethereal, cinematic light”
“the living room of a cozy wooden house with a fireplace, at night, interior design, concept art, wallpaper, warm, digital art. art by james gurney and larry elmore.”
“Full page concept design how to craft life Poison, intricate details, infographic of alchemical, diagram of how to make potions, captions, directions, ingredients, drawing, magic, wuxia”
“Fantasy art, octane render, 16k, 8k, cinema 4d, back-lit, caustics, clean environment, Wood pavilion architecture, warm led lighting, dusk, Landscape, snow, arctic, with aqua water, silver Guggenheim museum spire, with rays of sunshine, white fabric landscape, tall building, zaha hadid and Santiago calatrava, smooth landscape, cracked ice, igloo, warm lighting, aurora borialis, 3d cgi, high definition, natural lighting, realistic, hyper realism”
“tree house in the forest, atmospheric, hyper realistic, epic composition, cinematic, landscape vista photography by Carr Clifton & Galen Rowell, 16K resolution, Landscape veduta photo by Dustin Lefevre & tdraw, detailed landscape painting by Ivan Shishkin, DeviantArt, Flickr, rendered in Enscape, Miyazaki, Nausicaa Ghibli, Breath of The Wild, 4k detailed post processing, artstation, unreal engine”
“A trail through the unknown, atmospheric, hyper realistic, 8k, epic composition, cinematic, octane render, artstation landscape vista photography by Carr Clifton & Galen Rowell, 16K resolution, Landscape veduta photo by Dustin Lefevre & tdraw, 8k resolution, detailed landscape painting by Ivan Shishkin, DeviantArt, Flickr, rendered in Enscape, Miyazaki, Nausicaa Ghibli, Breath of The Wild, 4k detailed post processing, artstation, rendering by octane, unreal engine”
“postapocalyptic city turned to fractal glass, ctane render, 8 k, exploration, cinematic, trending on artstation, by beeple, realistic, 3 5 mm camera, unreal engine, hyper detailed, photo–realistic maximum detai, volumetric light, moody cinematic epic concept art, realistic matte painting, hyper photorealistic, concept art, cinematic epic, octane render, 8k, corona render, movie concept art, octane render, 8 k, corona render, trending on artstation, cinematic composition, ultra–detailed, hyper–realistic, volumetric lighting”
““WORLDS”: zoological fantasy ecosystem infographics, magazine layout with typography, annotations, in the style of Elena Masci, Studio Ghibli, Caspar David Friedrich, Daniel Merriam, Doug Chiang, Ivan Aivazovsky, Herbert Bauer, Edward Tufte, David McCandless”

DPM-Solver++ [32] (FID 11.02)
UniPC [58] (FID 10.19)
DPM-Solver-v3 (Ours) (FID 9.70)