This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Image Restoration Through Generalized Ornstein-Uhlenbeck Bridge

Conghan Yue    Zhengwei Peng    Junlong Ma    Shiyan Du    Pengxu Wei    Dongyu Zhang
Abstract

Diffusion models exhibit powerful generative capabilities enabling noise mapping to data via reverse stochastic differential equations. However, in image restoration, the focus is on the mapping relationship from low-quality to high-quality images. Regarding this issue, we introduce the Generalized Ornstein-Uhlenbeck Bridge (GOUB) model. By leveraging the natural mean-reverting property of the generalized OU process and further eliminating the variance of its steady-state distribution through the Doob’s h–transform, we achieve diffusion mappings from point to point enabling the recovery of high-quality images from low-quality ones. Moreover, we unravel the fundamental mathematical essence shared by various bridge models, all of which are special instances of GOUB and empirically demonstrate the optimality of our proposed models. Additionally, we present the corresponding Mean-ODE model adept at capturing both pixel-level details and structural perceptions. Experimental outcomes showcase the state-of-the-art performance achieved by both models across diverse tasks, including inpainting, deraining, and super-resolution. Code is available at https://github.com/Hammour-steak/GOUB.

Diffusion Model, Diffusion Bridge, Image Restoration

1 Introduction

Image restoration involves the restoring of high-quality (HQ) images from their low-quality (LQ) version (Banham & Katsaggelos, 1997; Zhou et al., 1988; Liang et al., 2021; Luo et al., 2023b), which is often characterized as an ill-posed inverse problem due to the loss of crucial information during the degradation from high-quality images to low-quality images. It encompasses a suite of classical tasks, including image deraining (Zhang & Patel, 2017; Yang et al., 2020; Xiao et al., 2022), denoising (Zhang et al., 2018a; Li et al., 2022; Soh & Cho, 2022; Zhang et al., 2023a), deblurring (Yuan et al., 2007; Kong et al., 2023), inpainting (Jain et al., 2023; Zhang et al., 2023b), and super-resolution (Dong et al., 2015; Zamfir et al., 2023; Wei et al., 2023), among others.

Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song & Ermon, 2019; Song et al., 2021b; Karras et al., 2022) have also been applied to image restoration, yielding favorable results (Ho & Salimans, 2021; Wang et al., 2023; Su et al., 2022; Shi et al., 2024). They mainly follow the standard forward process, diffusing images to pure noise and using low-quality images as conditions to facilitate the generation process of high-quality images (Dhariwal & Nichol, 2021; Ho & Salimans, 2021; Kawar et al., 2021; Saharia et al., 2022; Kawar et al., 2022; Chung et al., 2022b, a; Wang et al., 2023). However, these approaches require the integration of substantial prior knowledge specific to each task such as degradation matrices, limiting their universality.

Furthermore, some studies have attempted to establish a point-to-point mapping from low-quality to high-quality images, learning the general degradation and restoration process and thus circumventing the need for additional prior information for modeling specific tasks (Chen et al., 2022; Cui et al., 2023; Lee et al., 2024). In terms of diffusion models, this mapping can be realized through the bridge (Liu et al., 2022; Su et al., 2022; Liu et al., 2023a), a stochastic process with fixed starting and ending points. By assigning high-quality and low-quality images to the starting and ending points, and initiating with the low-quality images, high-quality images can be obtained by applying the reverse diffusion process, thereby enabling image restoration. However, some bridge models face challenges in learning likelihoods (Liu et al., 2022), necessitating reliance on cumbersome iterative approximation methods (De Bortoli et al., 2021; Su et al., 2022; Shi et al., 2024), which pose significant constraints in practical applications; others do not consider the selection of diffusion process and ignore the optimality of diffusion process (Liu et al., 2023a; Li et al., 2023; Zhou et al., 2024), thus may introducing unnecessary costs and limiting the performance of the model.

This paper proposed a novel image restoration bridge model, the Generalized Ornstein-Uhlenbeck Bridge (GOUB), depicted in Figure 1. Owing to the mean-reverting properties of the Generalized Ornstein-Uhlenbeck (GOU) process, it gradually diffuses the HQ image into a noisy LQ state (denoted as 𝐱T+λϵ\mathbf{x}_{T}+\lambda\epsilon in Figure 1). By applying Doob’s h-transform on GOU, we modify the diffusion process to eliminate noise on 𝐱T\mathbf{x}_{T} to directly bridge the HQ image and its LQ counterpart. The model initiates a point-to-point forward diffusion process and learns its reverse through maximum likelihood estimation, thereby ensuring it can restore a low-quality image to the corresponding high-quality image avoiding the limitation of generality and costly iterative approximation. Our main contributions can be summarized as follows:

  • We introduce a novel image restoration bridge model GOUB which eliminates variance of the ending point on the GOU process, directly connecting the high and low-quality images and is particularly expressive in deep visual features and diversity.

  • Benefiting from the distinctive features of the parameterization mechanism, we introduce the corresponding Mean-ODE model, demonstrating a strong ability to capture pixel-level details and structural perceptions.

  • We uncover the mathematical essence of several bridge models, all of which are special cases of the GOUB, and empirically demonstrate the optimality of our proposed models.

  • Our model has achieved state-of-the-art results on numerous image restoration tasks, such as inpainting, deraining, and super-resolution.

Refer to caption
Figure 1: Overview of the proposed GOUB for image restoration. The GOU process is capable of transferring an HQ image into a noisy LQ image. Additionally, through the application of h-transform, we can eliminate the noise on LQ, enabling the GOUB model to precisely bridge the gap between HQ and LQ.

2 Preliminaries

2.1 Score-based Diffusion Model

The score-based diffusion model (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021b) is a category of generative model that seamlessly transitions data into noise via a diffusion process and generates samples by learning and adapting the reverse process (Anderson, 1982). Assuming a dataset consists of nn dimensional independent identically distributed (i.i.d.) samples, following an unknown distribution denoted by p(𝐱𝟎)p(\mathbf{x_{0}}). The time-dependent forward process of the diffusion model can be described by the following SDE:

d𝐱t=𝐟(𝐱t,t)dt+gtd𝐰t,\mathrm{d}\mathbf{x}_{t}=\mathbf{f}\left(\mathbf{x}_{t},t\right)\mathrm{d}t+g_{t}\mathrm{d}\mathbf{w}_{t}, (1)

where 𝐟:nn\mathbf{f}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n} is the drift coefficient, gt:g_{t}:\mathbb{R}\rightarrow\mathbb{R} is the scalar diffusion coefficient and 𝐰t\mathbf{w}_{t} denotes the standard Brownian motion. Typically, p(𝐱0)p(\mathbf{x}_{0}) evolves over time tt from 0 to a sufficiently large TT into p(𝐱T)p(\mathbf{x}_{T}) through the SDE, such that p(𝐱T)p(\mathbf{x}_{T}) will approximate a standard Gaussian distribution pprior(𝐱)p_{\text{prior}}(\mathbf{x}). Meanwhile, the forward SDE has a corresponding reverse time SDE (Anderson, 1982) whose closed form is given by:

d𝐱t=[𝐟(𝐱t,t)gt2𝐱tlogp(𝐱t)]dt+gtd𝐰t.\mathrm{d}\mathbf{x}_{t}=\left[\mathbf{f}\left(\mathbf{x}_{t},t\right)-g^{2}_{t}\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{t})\right]\mathrm{d}t+g_{t}\mathrm{d}\mathbf{w}_{t}. (2)

Starting from time TT, p(𝐱T)p(\mathbf{x}_{T}) can progressively transform to p(𝐱0)p(\mathbf{x}_{0}) by traversing the trajectory of the reverse SDE. The score 𝐱tlogp(𝐱t)\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{t}) can generally be parameterized as 𝐬𝜽(𝐱t,t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t) and employ conditional score matching (Vincent, 2011) as the loss function for training:

\displaystyle\mathcal{L} =120T𝔼𝐱t[λ(t)𝐱tlogp(𝐱t)𝐬𝜽(𝐱t,t)2]dt\displaystyle=\frac{1}{2}\int_{0}^{T}\mathbb{E}_{\mathbf{x}_{t}}\Bigg{[}\lambda\left(t\right)\left\|\nabla_{\mathbf{x}_{t}}\log p\left(\mathbf{x}_{t}\right)-\mathbf{s}_{\bm{\theta}}\left(\mathbf{x}_{t},t\right)\right\|^{2}\Bigg{]}\mathrm{d}t (3)
120T𝔼𝐱0,𝐱t[λ(t)𝐱tlogp(𝐱t𝐱0)𝐬𝜽(𝐱t,t)2]dt,\displaystyle\propto\frac{1}{2}\int_{0}^{T}\mathbb{E}_{\mathbf{x}_{0},\mathbf{x}_{t}}\Bigg{[}\lambda\left(t\right)\left\|\nabla_{\mathbf{x}_{t}}\log p\left(\mathbf{x}_{t}\mid\mathbf{x}_{0}\right)-\mathbf{s}_{\bm{\theta}}\left(\mathbf{x}_{t},t\right)\right\|^{2}\Bigg{]}\mathrm{d}t,

where λ(t)\lambda(t) serves as a weighting function, and if selected as gt2g^{2}_{t} that yields a more optimal upper bound on the negative log-likelihood (Song et al., 2021a). The second line is actually the most commonly used, as the conditional probability p(𝐱t𝐱0)p(\mathbf{x}_{t}\mid\mathbf{x}_{0}) is generally accessible. Ultimately, one can sample 𝐱T\mathbf{x}_{T} from the prior distribution p(𝐱T)pprior(𝐱)p(\mathbf{x}_{T})\approx p_{\text{prior}}(\mathbf{x}) and obtain the 𝐱0\mathbf{x}_{0} through the numerical solution of Equation (2) via iterative steps, thereby completing the generation process.

2.2 Generalized Ornstein-Uhlenbeck process

The Generalized Ornstein-Uhlenbeck (GOU) process is the time-varying OU process (Ahmad, 1988). It is a stationary Gaussian-Markov process, whose marginal distribution gradually tends towards a stable mean and variance over time. The GOU process is generally defined as follows:

d𝐱t=θt(𝝁𝐱t)dt+gtd𝐰t,\mathrm{d}\mathbf{x}_{t}=\theta_{t}\left(\bm{\mu}-\mathbf{x}_{t}\right)\mathrm{d}t+g_{t}\mathrm{d}\mathbf{w}_{t}, (4)

where 𝝁\bm{\mu} is a given state vector, θt\theta_{t} denotes a scalar drift coefficient and gtg_{t} represents the diffusion coefficient. At the same time, we require θt,gt\theta_{t},g_{t} to satisfy the specified relationship 2λ2=gt2/θt2\lambda^{2}=g^{2}_{t}/\theta_{t}, where λ2\lambda^{2} is a given constant scalar. As a result, its transition probability possesses a closed-form analytical solution:

p(𝐱t𝐱s)=N(𝐦¯s:t,σ¯s:t2𝑰)=N(𝝁+(𝐱s𝝁)eθ¯s:t,gt22θt(1e2θ¯s:t)𝑰),θ¯s:t=stθz𝑑z.\begin{gathered}p\left(\mathbf{x}_{t}\mid\mathbf{x}_{s}\right)=N(\mathbf{\bar{m}}_{s:t},\bar{\sigma}_{s:t}^{2}\bm{I})=\\ N\left(\bm{\mu}+\left(\mathbf{x}_{s}-\bm{\mu}\right)e^{-\bar{\theta}_{s:t}},\frac{g^{2}_{t}}{2\theta_{t}}\left(1-e^{-2\bar{\theta}_{s:t}}\right)\bm{I}\right),\\ \bar{\theta}_{s:t}=\int_{s}^{t}{\theta_{z}dz}.\end{gathered} (5)

A simple proof is provided in Appendix C. For the sake of simplicity in subsequent representations, we denote θ¯0:t\bar{\theta}_{0:t} and σ¯0:t\bar{\sigma}_{0:t} as θ¯t\bar{\theta}_{t} and σ¯t\bar{\sigma}_{t} respectively. Consequently, p(𝐱t)p(\mathbf{x}_{t}) will steadily converge towards a Gaussian distribution with the mean of 𝝁\bm{\mu} and the variance of λ2\lambda^{2} as time tt progresses meaning that it exhibits the mean-reverting property.

2.3 Doob’s h-transform

Doob’s h-transform (Särkkä & Solin, 2019) is a mathematical technique applied to stochastic processes. It involves transforming the original process by incorporating a specific h-function into the drift term of the SDE, modifying the process to pass through a predetermined terminal point. More precisely, given the SDE (1), if it is desired to pass through the given fixed point 𝐱T\mathbf{x}_{T} at t=Tt=T, an additional drift term must be incorporated into the original SDE:

d𝐱t=[𝐟(𝐱t,t)+gt2𝐡(𝐱t,t,𝐱T,T)]dt+gtd𝐰t,\mathrm{d}\mathbf{x}_{t}=\left[\mathbf{f}(\mathbf{x}_{t},t)+g^{2}_{t}\mathbf{h}(\mathbf{x}_{t},t,\mathbf{x}_{T},T)\right]\mathrm{d}t+g_{t}\mathrm{d}\mathbf{w}_{t}, (6)

where 𝐡(𝐱t,t,𝐱T,T)=𝐱tlogp(𝐱T𝐱t)\mathbf{h}(\mathbf{x}_{t},t,\mathbf{x}_{T},T)=\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{T}\mid\mathbf{x}_{t}) and 𝐱0\mathbf{x}_{0} starts from p(𝐱0𝐱T)p\left(\mathbf{x}_{0}\mid\mathbf{x}_{T}\right). A simple proof can be found in Appendix D. In comparison to (1), the marginal distribution of (6) is conditioned on 𝐱T\mathbf{x}_{T}, with its forward conditional probability density given by p(𝐱t𝐱0,𝐱T)p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T}) satisfying the forward Kolmogorov equation that is defined by (6). Intuitively, p(𝐱T𝐱0,𝐱T)=1p(\mathbf{x}_{T}\mid\mathbf{x}_{0},\mathbf{x}_{T})=1 at t=Tt=T, ensuring that the SDE invariably passes through the specified point 𝐱T\mathbf{x}_{T} for any initial state 𝐱0\mathbf{x}_{0}.

3 GOUB

The GOU process (4) is characterized by mean-reverting properties that if we consider the initial state 𝐱0\mathbf{x}_{0} to represent a high-quality image and the corresponding low-quality image 𝐱T=𝝁\mathbf{x}_{T}=\bm{\mu} as the final condition, then the high-quality image will gradually converge to a Gaussian distribution with the low-quality image as its mean and a stable variance λ2\lambda^{2}. This naturally connects some information between high and low-quality images, offering an inherent advantage in image restoration. However, the initial state of the reverse process necessitates the artificial addition of noise to low-quality images, resulting in certain information loss and thus affecting the performance (Luo et al., 2023a).

In actuality, we are more focused on the connections between points (Liu et al., 2022; De Bortoli et al., 2021; Su et al., 2022; Li et al., 2023; Zhou et al., 2024) in image restoration. Coincidentally, the Doob’s h-transform technique can modify an SDE such that it passes through a specified 𝐱T\mathbf{x}_{T} at terminal time TT. Accordingly, it is crucial to note that the application of the h-transform to the GOU process effectively eliminates the impact of terminal noise, directly bridging a point-to-point relationship between high-quality and low-quality images.

3.1 Forward and backward process

Applying the h-transform, we can readily derive the forward process of the GOUB, leading to the following proposition:

Proposition 3.1.

Let 𝐱t\mathbf{x}_{t} be a finite random variable describing by the given generalized Ornstein-Uhlenbeck process (4), suppose 𝐱T=𝛍\mathbf{x}_{T}=\bm{\mu}, the evolution of its marginal distribution p(𝐱t𝐱T)p(\mathbf{x}_{t}\mid\mathbf{x}_{T}) satisfies the following SDE:

d𝐱t=(θt+gt2e2θ¯t:Tσ¯t:T2)(𝐱T𝐱t)dt+gtd𝐰t.\mathrm{d}\mathbf{x}_{t}=\left(\theta_{t}+g^{2}_{t}\frac{e^{-2\bar{\theta}_{t:T}}}{\bar{\sigma}_{t:T}^{2}}\right)(\mathbf{x}_{T}-\mathbf{x}_{t})\mathrm{d}t+g_{t}\mathrm{d}\mathbf{w}_{t}. (7)

Additionally, the forward transition p(𝐱t𝐱0,𝐱T)p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T}) is given by:

p(𝐱t𝐱0,𝐱T)=N(𝐦¯t,σ¯t2𝐈),𝐦¯t=eθ¯tσ¯t:T2σ¯T2𝐱0+[(1eθ¯t)σ¯t:T2σ¯T2+e2θ¯t:Tσ¯t2σ¯T2]𝐱Tσ¯t2=σ¯t2σ¯t:T2σ¯T2\begin{gathered}p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T})=N(\mathbf{\bar{m}^{\prime}}_{t},\bar{\sigma}^{\prime 2}_{t}\mathbf{I}),\\ \mathbf{\bar{m}^{\prime}}_{t}=e^{-\bar{\theta}_{t}}\frac{\bar{\sigma}_{t:T}^{2}}{\bar{\sigma}_{T}^{2}}\mathbf{x}_{0}+\left[\left(1-e^{-\bar{\theta}_{t}}\right)\frac{\bar{\sigma}_{t:T}^{2}}{\bar{\sigma}_{T}^{2}}+e^{-2\bar{\theta}_{t:T}}\frac{\bar{\sigma}_{t}^{2}}{\bar{\sigma}_{T}^{2}}\right]\mathbf{x}_{T}\\ \bar{\sigma}^{\prime 2}_{t}=\frac{\bar{\sigma}_{t}^{2}\bar{\sigma}_{t:T}^{2}}{\bar{\sigma}_{T}^{2}}\end{gathered} (8)

The derivation of the proposition is provided in the Appendix A.1. With Proposition 3.1, there is no need to perform multi-step forward iteration using the SDE; instead, we can directly use its closed-form solution for one-step forward sampling.

Similarly, applying the previous SDE theory enables us to easily derive the reverse process, which leads to the following Proposition 3.2:

Proposition 3.2.

The reverse SDE of equation (7) has a marginal distribution p(𝐱t𝐱T)p(\mathbf{x}_{t}\mid\mathbf{x}_{T}), and is given by:

d𝐱t=\displaystyle\mathrm{d}\mathbf{x}_{t}= [(θt+gt2e2θ¯t:Tσ¯t:T2)(𝐱T𝐱t).\displaystyle\Bigg{[}\left(\theta_{t}+g^{2}_{t}\frac{e^{-2\bar{\theta}_{t:T}}}{\bar{\sigma}_{t:T}^{2}}\right)(\mathbf{x}_{T}-\mathbf{x}_{t})\Bigg{.} (9)
.gt2𝐱tlogp(𝐱t𝐱T)]dt+gtd𝐰t,\displaystyle\Bigg{.}-g^{2}_{t}\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{t}\mid\mathbf{x}_{T})\Bigg{]}\mathrm{d}t+g_{t}\mathrm{d}\mathbf{w}_{t},

and exists a probability flow ODE:

d𝐱t=\displaystyle\mathrm{d}\mathbf{x}_{t}= [(θt+gt2e2θ¯t:Tσ¯t:T2)(𝐱T𝐱t).\displaystyle\Bigg{[}\left(\theta_{t}+g^{2}_{t}\frac{e^{-2\bar{\theta}_{t:T}}}{\bar{\sigma}_{t:T}^{2}}\right)(\mathbf{x}_{T}-\mathbf{x}_{t})\Bigg{.} (10)
.12gt2𝐱tlogp(𝐱t𝐱T)]dt.\displaystyle-\Bigg{.}\frac{1}{2}g^{2}_{t}\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{t}\mid\mathbf{x}_{T})\Bigg{]}\mathrm{d}t.

We are capable of initiating from a low-quality image 𝐱T\mathbf{x}_{T} and proceeding to utilize Euler sampling solving the reverse SDE or ODE for restoration purposes.

3.2 Training object

The score term 𝐱tlogp(𝐱t𝐱T)\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{t}\mid\mathbf{x}_{T}) can be parameterized by a neural network 𝐬𝜽(𝐱t,𝐱T,t)\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\mathbf{x}_{T},t) and can be estimated using the loss function (3). Unfortunately, training the score function for SDEs generally presents a significant challenge. Nevertheless, since the analytical form of GOUB is directly obtainable, we will introduce the use of maximum likelihood for training, which yields a more stable loss function.

We first discretize the continuous time interval [0,T][0,T] into NN sufficiently fine-grained intervals in a reasonable manner, denoted as {𝐱t}t[0,N],𝐱N=𝐱T\{\mathbf{x}_{t}\}_{t\in[0,N]},\mathbf{x}_{N}=\mathbf{x}_{T}. We are concerned with maximizing the log-likelihood, which leads us to the following proposition:

Proposition 3.3.

Let 𝐱t\mathbf{x}_{t} be a finite random variable describing by the given generalized Ornstein-Uhlenbeck process (4), for a fixed 𝐱T\mathbf{x}_{T}, the expectation of log-likelihood 𝔼p(𝐱0)[logp𝛉(𝐱0𝐱T)]\mathbb{E}_{p(\mathbf{x}_{0})}[\log p_{\bm{\theta}}(\mathbf{x}_{0}\mid\mathbf{x}_{T})] possesses an Evidence Lower Bound (ELBO):

ELBO=𝔼p(𝐱0)[𝔼p(𝐱1𝐱0)[logp𝜽(𝐱0𝐱1,𝐱T)].\displaystyle ELBO=\mathbb{E}_{p(\mathbf{x}_{0})}\Bigg{[}\mathbb{E}_{p\left(\mathbf{x}_{1}\mid\mathbf{x}_{0}\right)}\left[\log p_{\bm{\theta}}\left(\mathbf{x}_{0}\mid\mathbf{x}_{1},\mathbf{x}_{T}\right)\right]-\Bigg{.} (11)
.t=2T𝔼p(xtx0)[KL(p(𝐱t1𝐱0,𝐱t,𝐱T)||p𝜽(𝐱t1𝐱t,𝐱T))]]\displaystyle\Bigg{.}\sum_{t=2}^{T}\mathbb{E}_{p(x_{t}\mid x_{0})}[{KL\left(p\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{0},\mathbf{x}_{t},\mathbf{x}_{T}\right)||p_{\bm{\theta}}\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\mathbf{x}_{T}\right)\right)}]\Bigg{]}

Assuming p𝛉(𝐱t1𝐱t,𝐱T)p_{\bm{\theta}}\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\mathbf{x}_{T}\right) is a Gaussian distribution with a constant variance N(𝛍𝛉,t1,σ𝛉,t12𝐈)N(\bm{\mu}_{\bm{\theta},t-1},\sigma_{\bm{\theta},t-1}^{2}\bm{I}), maximizing the ELBO is equivalent to minimizing:

=𝔼t,𝐱0,𝐱t,𝐱T[12σ𝜽,t12𝝁t1𝝁𝜽,t12],\mathcal{L}=\mathbb{E}_{t,\mathbf{x}_{0},\mathbf{x}_{t},\mathbf{x}_{T}}\left[\frac{1}{2\sigma_{\bm{\theta},t-1}^{2}}\|\bm{\mu}_{t-1}-\bm{\mu}_{\bm{\theta},t-1}\|^{2}\right], (12)

where 𝛍t1\bm{\mu}_{t-1} represents the mean of p(𝐱t1𝐱0,𝐱t,𝐱T)p\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{0},\mathbf{x}_{t},\mathbf{x}_{T}\right):

𝝁t1=1σ¯t2[σ¯t12(𝐱tb𝐱T)a+(σ¯t2σ¯t12a2)𝐦¯t],\small\bm{\mu}_{t-1}=\frac{1}{\bar{\sigma}^{\prime 2}_{t}}\left[\bar{\sigma}^{\prime 2}_{t-1}(\mathbf{x}_{t}-b\mathbf{x}_{T})a+(\bar{\sigma}^{\prime 2}_{t}-\bar{\sigma}^{\prime 2}_{t-1}a^{2})\mathbf{\bar{m}^{\prime}}_{t}\right], (13)

where,

a\displaystyle a =eθ¯t1:tσ¯t:T2σ¯t1:T2,\displaystyle=\frac{e^{-\bar{\theta}_{t-1:t}}\bar{\sigma}_{t:T}^{2}}{\bar{\sigma}_{t-1:T}^{2}},
b\displaystyle b =1σ¯T2{(1eθ¯t)σ¯t:T2+e2θ¯t:Tσ¯t2\displaystyle=\frac{1}{\bar{\sigma}_{T}^{2}}\left\{(1-e^{-\bar{\theta}_{t}})\bar{\sigma}^{2}_{t:T}+e^{-2\bar{\theta}_{t:T}}\bar{\sigma}_{t}^{2}\right.
[(1eθ¯t1)σ¯t1:T2+e2θ¯t1:Tσ¯t12]a}\displaystyle\left.-\left[(1-e^{-\bar{\theta}_{t-1}})\bar{\sigma}^{2}_{t-1:T}+e^{-2\bar{\theta}_{t-1:T}}\bar{\sigma}_{t-1}^{2}\right]a\right\}

The derivation of the proposition is provided in the Appendix A.2. With Proposition 3.3, we can easily construct the training objective. In this work, we try to parameterized 𝝁𝜽,t1\bm{\mu}_{\bm{\theta},t-1} from differential of SDE which can be derived from equation (9):

𝐱t1=\displaystyle\mathbf{x}_{t-1}= 𝐱t(θt+gt2e2θ¯t:Tσ¯t:T2)(𝐱T𝐱t)\displaystyle\mathbf{x}_{t}-\left(\theta_{t}+g^{2}_{t}\frac{e^{-2\bar{\theta}_{t:T}}}{\bar{\sigma}_{t:T}^{2}}\right)(\mathbf{x}_{T}-\mathbf{x}_{t}) (14)
+gt2𝐱tlogp(𝐱t𝐱T)gtϵt,\displaystyle+g^{2}_{t}\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{t}\mid\mathbf{x}_{T})-g_{t}\bm{\epsilon}_{t},

where ϵtN(𝟎,dt𝑰)\bm{\epsilon}_{t}\sim N(\mathbf{0},\mathrm{d}t\bm{I}), therefore:

𝝁𝜽,t1=\displaystyle\bm{\mu}_{\bm{\theta},t-1}= 𝐱t(θt+gt2e2θ¯t:Tσ¯t:T2)(𝐱T𝐱t)\displaystyle\mathbf{x}_{t}-\left(\theta_{t}+g^{2}_{t}\frac{e^{-2\bar{\theta}_{t:T}}}{\bar{\sigma}_{t:T}^{2}}\right)(\mathbf{x}_{T}-\mathbf{x}_{t}) (15)
+gt2𝐱tlogpθ(𝐱t𝐱T),\displaystyle+g^{2}_{t}\nabla_{\mathbf{x}_{t}}\log p_{\theta}(\mathbf{x}_{t}\mid\mathbf{x}_{T}),
σ𝜽,t1=\displaystyle\sigma_{\bm{\theta},t-1}= gt.\displaystyle g_{t}.

Inspired by conditional score matching, we can parameterize noise as ϵ𝜽(𝐱t,𝐱T,t)\bm{\epsilon}_{\bm{\theta}}(\mathbf{x}_{t},\mathbf{x}_{T},t), thus the score 𝐱tlogpθ(𝐱t𝐱T)\nabla_{\mathbf{x}_{t}}\log p_{\theta}(\mathbf{x}_{t}\mid\mathbf{x}_{T}) can be represented as ϵ𝜽(𝐱t,𝐱T,t)/σ¯t-\bm{\epsilon}_{\bm{\theta}}(\mathbf{x}_{t},\mathbf{x}_{T},t)/\bar{\sigma}^{\prime}_{t}. In addition, during our empirical research, we found that utilizing L1 loss yields enhanced image reconstruction outcomes (Boyd & Vandenberghe, 2004; Hastie et al., 2009). This approach enables the model to learn pixel-level details more easily, resulting in markedly improved visual quality. Therefore, the final training object is:

=𝔼t,𝐱0,𝐱t,𝐱T\displaystyle\mathcal{L}=\mathbb{E}_{t,\mathbf{x}_{0},\mathbf{x}_{t},\mathbf{x}_{T}} [12gt21σ¯t2[σ¯t12(𝐱tb𝐱T)a.\displaystyle\left[\frac{1}{2g_{t}^{2}}\Bigg{\|}\frac{1}{\bar{\sigma}^{\prime 2}_{t}}\left[\bar{\sigma}^{\prime 2}_{t-1}(\mathbf{x}_{t}-b\mathbf{x}_{T})a\right.\Bigg{.}\right. (16)
.+(σ¯t2σ¯t12a2)𝐦¯t]𝐱t.\displaystyle\left.\Bigg{.}\left.+(\bar{\sigma}^{\prime 2}_{t}-\bar{\sigma}^{\prime 2}_{t-1}a^{2})\mathbf{\bar{m}^{\prime}}_{t}\right]-\mathbf{x}_{t}\right.\Bigg{.}
.+(θt+gt2e2θ¯t:Tσ¯t:T2)(𝐱T𝐱t).\displaystyle\left.\Bigg{.}+\left(\theta_{t}+g^{2}_{t}\frac{e^{-2\bar{\theta}_{t:T}}}{\bar{\sigma}_{t:T}^{2}}\right)(\mathbf{x}_{T}-\mathbf{x}_{t})\right.\Bigg{.}
.+gt2σ¯tϵ𝜽(𝐱t,𝐱T,t)]\displaystyle\left.\Bigg{.}+\frac{g^{2}_{t}}{\bar{\sigma}^{\prime}_{t}}\bm{\epsilon}_{\bm{\theta}}(\mathbf{x}_{t},\mathbf{x}_{T},t)\Bigg{\|}\right]

Consequently, if we obtain the optimal ϵ𝜽(𝐱t,𝐱T,t)\bm{\epsilon}_{\bm{\theta}}^{*}(\mathbf{x}_{t},\mathbf{x}_{T},t), we can compute the score 𝐱tlogp(𝐱t𝐱T)ϵ𝜽(𝐱t,𝐱T,t)/σ¯t\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{t}\mid\mathbf{x}_{T})\approx-\bm{\epsilon}_{\bm{\theta}}^{*}(\mathbf{x}_{t},\mathbf{x}_{T},t)/\bar{\sigma}^{\prime}_{t} for reverse process. Starting from a low-quality image 𝐱T\mathbf{x}_{T}, we can recover 𝐱0\mathbf{x}_{0} by using Equation (9) to perform reverse iteration.

3.3 Mean-ODE

Unlike normal diffusion models, our parameterization of the mean 𝝁𝜽,t1\bm{\mu}_{\bm{\theta},t-1} is derived from the differential of SDE which effectively combines the characteristics of discrete diffusion models and continuous score-based generative models. In the reverse process, the value of each sampling step will approximated to the true mean during training. Therefore, we propose a Mean-ODE model, which omits the Brownian drift term:

d𝐱t=\displaystyle\mathrm{d}\mathbf{x}_{t}= [(θt+gt2e2θ¯t:Tσ¯t:T2)(𝐱T𝐱t).\displaystyle\Bigg{[}\left(\theta_{t}+g^{2}_{t}\frac{e^{-2\bar{\theta}_{t:T}}}{\bar{\sigma}_{t:T}^{2}}\right)(\mathbf{x}_{T}-\mathbf{x}_{t})\Bigg{.} (17)
.gt2𝐱tlogp(𝐱t𝐱T)]dt,\displaystyle-\Bigg{.}g^{2}_{t}\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{t}\mid\mathbf{x}_{T})\Bigg{]}\mathrm{d}t,

To simplify the expression, we use GOUB to represent the GOUB (SDE) sampling model and Mean-ODE to represent the GOUB (Mean-ODE) sampling model. Our following experiments have demonstrated that the Mean-ODE is more effective than the corresponding Score-ODE at capturing the pixel details and structural perceptions of images, playing a pivotal role in image restoration tasks. Concurrently, the SDE model (9) is more focused on deep visual features and diversity.

4 Experiments

We conduct experiments under three popular image restoration tasks: image inpainting, image deraining, and image super-resolution. Four metrics are employed for the model evaluation, i.e., Peak Signal-to-Noise Ratio (PSNR) for assessing reconstruction quality, Structural Similarity Index (SSIM) (Wang et al., 2004) for gauging structural perception, Learned Perceptual Image Patch Similarity (LPIPS) (Zhang et al., 2018b) for evaluating the depth and quality of features, and Fréchet Inception Distance (FID) (Heusel et al., 2017) to measure the diversity in generated images. More experiment details are present in Appendix E.

Image Inpainting.

Image inpainting involves filling in missing or damaged parts of an image, to restore or enhance the overall visual effect of the image. We have selected the CelebA-HQ 256×256256\times 256 datasets (Karras et al., 2018) for both training and testing with 100 thin masks. We compare our models with several current baseline inpainting approaches such as PromptIR (Potlapalli et al., 2023), DDRM (Kawar et al., 2022) and IR-SDE (Luo et al., 2023a). The relevant experimental results are shown in the Table 1 and Figure 2. It is observed that the two proposed models achieved state-of-the-art results in their respective areas of strength and also delivered highly competitive outcomes on other metrics. From a visual perspective, our model excels in capturing details such as eyebrows, eyes, and image backgrounds.

Table 1: Image Inpainting. Qualitative comparison with the relevant baselines on CelebA-HQ.

METHOD PSNR\uparrow SSIM\uparrow LPIPS\downarrow FID\downarrow
PromptIR 30.22 0.9180 0.068 32.69
DDRM 27.16 0.8993 0.089 37.02
IR-SDE 28.37 0.9166 0.046 25.13
GOUB 28.98 0.9067 0.037 4.30
Mean-ODE 31.39 0.9392 0.052 12.24
Refer to caption
Figure 2: Qualitative comparison of the visual results of different inpainting methods on the CelebA-HQ dataset with thin mask.

Image Deraining.

We have selected the Rain100H datasets (Yang et al., 2017) for our training and testing, which includes 1800 pairs of training data and 100 images for testing. It is important to note that in this task, similar to other deraining models, we present the PSNR and SSIM scores specifically on the Y channel (YCbCr space). We report state-of-the-art approaches for comparison: MPRNet (Zamir et al., 2021), M3SNet-32 (Gao et al., 2023), MAXIM (Tu et al., 2022), MHNet (Gao & Dang, 2023), IR-SDE (Luo et al., 2023a). The relevant experimental results are shown in the Table 2 and Figure 3. Similarly, both models achieved SOTA results respectively in the deraining task. Visually, it can be also observed that our model excels in capturing details such as the moon, the sun, and tree branches.

Table 2: Image Deraining. Qualitative comparison with the relevant baselines on Rain100H.

METHOD PSNR\uparrow SSIM\uparrow LPIPS\downarrow FID\downarrow
MPRNet 30.41 0.8906 0.158 61.59
M3SNet-32 30.64 0.8920 0.154 60.26
MAXIM 30.81 0.9027 0.133 58.72
MHNet 31.08 0.8990 0.126 57.93
IR-SDE 31.65 0.9041 0.047 18.64
GOUB 31.96 0.9028 0.046 18.14
Mean-ODE 34.56 0.9414 0.077 32.83
Refer to caption
Figure 3: Qualitative comparison of the visual results of different deraining methods on the Rain100H dataset.

Image Super-Resolution.

Single image super-resolution aims to recover a higher resolution and clearer version from a low-resolution image. We conducted training and evaluation on the DIV2K validation set for 4×\times upscaling (Agustsson & Timofte, 2017) and all low-resolution images were bicubically rescaled to the same size as their corresponding high-resolution images. To show that our models are in line with the state-of-the-art, we compare to the DDRM (Kawar et al., 2022) and IR-SDE (Luo et al., 2023a). The relevant experimental results are provided in Table 3 and Figure 4. As can be seen, our GOUB is superior to benchmarks in various indicators and handles visual details better such as edges and hair.

Table 3: Image 4×\times Super-Resolution. Qualitative comparison with the relevant baselines on DIV2K.

METHOD PSNR\uparrow SSIM\uparrow LPIPS\downarrow FID\downarrow
DDRM 24.35 0.5927 0.364 78.71
IR-SDE 25.90 0.6570 0.231 45.36
GOUB 26.89 0.7478 0.220 20.85
Mean-ODE 28.50 0.8070 0.328 22.14
Refer to caption
Figure 4: Qualitative comparison of the visual results of different 4x super-resolution methods on the DIV2K dataset.

Superiority of Mean-ODE.

Additionally, we conduct ablation experiments using the corresponding Score-ODE (10) model to demonstrate the superiority of our proposed Mean-ODE model in image restoration. From Table 4, it is evident that the performance of Mean-ODE is significantly superior to that of the corresponding Score-ODE. This is because the sampling results of each sampling step of Mean-ODE directly approximate the true mean during the training process, as opposed to the parameterized approach such as DDPM, which relies on expectations. Consequently, our proposed Mean-ODE demonstrates better reconstruction effects and is more suitable for image restoration tasks.

Table 4: Qualitative comparison with the corresponding Score-ODE on various tasks.

METHOD Image Inapinting Image Deraining Image 4×\times Super-Resolution
PSNR\uparrow SSIM\uparrow LPIPS\downarrow FID\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow FID\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow FID\downarrow
Score-ODE 18.23 0.6266 0.389 161.54 13.64 0.7404 0.338 191.15 28.14 0.7993 0.344 25.51
Mean-ODE 31.39 0.9392 0.052 12.24 34.56 0.9414 0.077 32.83 28.50 0.8070 0.328 22.14

5 Analysis

Table 5: Qualitative comparison with the different bridge models on CelebA-HQ, Rain100H, and DIV2K datasets.

METHOD Image Inapinting Image Deraining Image 4×\times Super-Resolution
PSNR\uparrow SSIM\uparrow LPIPS\downarrow FID\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow FID\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow FID\downarrow
VEB 27.75 0.8943 0.056 13.70 30.39 0.8975 0.059 28.54 24.21 0.5808 0.384 36.55
VPB 27.32 0.8841 0.049 11.87 30.89 0.8847 0.051 23.36 25.40 0.6041 0.342 29.17
GOUB 28.98 0.9067 0.037 4.30 31.96 0.9028 0.046 18.14 26.89 0.7478 0.220 20.85
Refer to caption
Figure 5: Qualitative comparison with the different bridge models in many tasks.

The Doob’s h-transform of the generalized Ornstein-Uhlenbeck process, also known as the conditional GOU process has been an intriguing topic in previous applied mathematical research (Salminen, 1984; Cheridito et al., 2003; Heng et al., 2021). On account of the mean-reverting property of the GOU process, applying the h-transform makes it most straightforward to eliminate the variance and drive it towards a Dirac distribution in its steady state which is highly advantageous for its applications in image restoration. In previous research on diffusion models, there has been limited focus on the cases of 𝐟\mathbf{f} or gg, and generally used the VE process (Song et al., 2021b) represented by NCSN (Song & Ermon, 2019) or the VP process (Song et al., 2021b) represented by DDPM (Ho et al., 2020).

In this section, we demonstrate that the mathematical essence of several recent meaningful diffusion bridge models is the same (Li et al., 2023; Zhou et al., 2024; Liu et al., 2023a) and they all represent Brownian bridge (Chow, 2009) models, details are provided in the Appendix B.1. Then, we also found that the VE and VP processes are special cases of GOU, leading to the following proposition:

Proposition 5.1.

For a given GOU process (4), there exists relationships:

limθt0GOU=VE\displaystyle\lim_{\theta_{t}\rightarrow 0}\text{GOU}=\text{VE} (18)
lim𝝁0,λ1GOU=VP\displaystyle\lim_{\bm{\mu}\rightarrow 0,\lambda\rightarrow 1}\text{GOU}=\text{VP}

Details are provided in the Appendix B.2. Therefore, we conduct experiments on VE Bridge (VEB) (Li et al., 2023; Zhou et al., 2024; Liu et al., 2023a) and VP Bridge (VPB) (Zhou et al., 2024) to demonstrate the optimality of our proposed GOUB model in image restoration. We keep all the model hyperparameters consistent and results are shown in Table 5 and Figure 5.

It can be seen that under the same configuration of model hyperparameters, the performance of the GOUB is notably superior to the other two types of bridge models, which demonstrates the optimality of GOUB and also highlights the importance of the choice of diffusion process in diffusion models.

6 Related Works

Conditional Generation.

As previously highlighted, in the work of image restoration using diffusion models, the focus of some research has predominantly been on using low-quality images as conditional inputs yy to guide the generation process. They (Kawar et al., 2021; Saharia et al., 2022; Kawar et al., 2022; Chung et al., 2022a, b, 2023; Zhao et al., 2023; Murata et al., 2023; Feng et al., 2023) all endeavor to solve or approximate the classifier log𝐱tp(𝐲𝐱t)\log\nabla_{\mathbf{x}_{t}}p(\mathbf{y}\mid\mathbf{x}_{t}), necessitating the incorporation of additional prior knowledge to model specific degradation processes which both complex and lacking in universality.

Diffusion Bridge.

This segment of work obviates the need for prior knowledge, constructing a diffusion bridge model from high-quality to low-quality images, thereby learning the degradation process. The previously mentioned approach (Liu et al., 2022; De Bortoli et al., 2021; Su et al., 2022; Liu et al., 2023a; Shi et al., 2024; Li et al., 2023; Zhou et al., 2024; Albergo et al., 2023) fall into this class and are characterized by the issues of significant computational expense in solution seeking and also not the optimal model framework. Additionally, some models of flow category (Lipman et al., 2023; Liu et al., 2023b; Tong et al., 2023; Albergo & Vanden-Eijnden, 2023; Delbracio & Milanfar, 2023) also belong to the diffusion bridge models and face the similar issue.

7 Conclusion

In this paper, we introduced the Generalized Ornstein-Uhlenbeck Bridge (GOUB) model, a diffusion bridge model that applies the Doob’s h-transform to the GOU process. This model can address general image restoration tasks without the need for specific prior knowledge. Furthermore, we have uncovered the mathematical essence of several bridge models and empirically demonstrated the optimality of our proposed model. In addition, considering our unique mean parameterization mechanism, we proposed the Mean-ODE model. Experimental results indicate that both models achieve state-of-the-art results in their respective areas of strength on various tasks, including inpainting, deraining, and super-resolution. We believe that the exploration of diffusion process and bridge models holds significant importance not only in the field of image restoration but also in advancing the study of generative diffusion models.

Impact Statements

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

  • Agustsson & Timofte (2017) Agustsson, E. and Timofte, R. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp.  126–135, 2017.
  • Ahmad (1988) Ahmad, R. Introduction to stochastic differential equations, 1988.
  • Albergo & Vanden-Eijnden (2023) Albergo, M. and Vanden-Eijnden, E. Building normalizing flows with stochastic interpolants. In In Proceedigns of International Conference on Learning Representations (ICLR), 2023.
  • Albergo et al. (2023) Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797, 2023.
  • Anderson (1982) Anderson, B. D. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.
  • Banham & Katsaggelos (1997) Banham, M. R. and Katsaggelos, A. K. Digital image restoration. IEEE signal processing magazine, 14(2):24–41, 1997.
  • Boyd & Vandenberghe (2004) Boyd, S. P. and Vandenberghe, L. Convex optimization. Cambridge university press, 2004.
  • Chen et al. (2022) Chen, L., Chu, X., Zhang, X., and Sun, J. Simple baselines for image restoration. In European Conference on Computer Vision, pp.  17–33. Springer, 2022.
  • Cheridito et al. (2003) Cheridito, P., Kawaguchi, H., and Maejima, M. Fractional ornstein-uhlenbeck processes. 2003.
  • Chow (2009) Chow, W. C. Brownian bridge. Wiley interdisciplinary reviews: computational statistics, 1(3):325–332, 2009.
  • Chung et al. (2022a) Chung, H., Kim, J., Mccann, M. T., Klasky, M. L., and Ye, J. C. Diffusion posterior sampling for general noisy inverse problems. In The Eleventh International Conference on Learning Representations, 2022a.
  • Chung et al. (2022b) Chung, H., Sim, B., Ryu, D., and Ye, J. C. Improving diffusion models for inverse problems using manifold constraints. Advances in Neural Information Processing Systems, 35:25683–25696, 2022b.
  • Chung et al. (2023) Chung, H., Kim, J., Kim, S., and Ye, J. C. Parallel diffusion models of operator and image for blind inverse problems. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6059–6069, 2023.
  • Cui et al. (2023) Cui, Y., Ren, W., Cao, X., and Knoll, A. Focal network for image restoration. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  13001–13011, 2023.
  • De Bortoli et al. (2021) De Bortoli, V., Thornton, J., Heng, J., and Doucet, A. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695–17709, 2021.
  • Delbracio & Milanfar (2023) Delbracio, M. and Milanfar, P. Inversion by direct iteration: An alternative to denoising diffusion for image restoration. Transactions on Machine Learning Research, 2023.
  • Dhariwal & Nichol (2021) Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  • Dong et al. (2015) Dong, C., Loy, C. C., He, K., and Tang, X. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2015.
  • Feng et al. (2023) Feng, B. T., Smith, J., Rubinstein, M., Chang, H., Bouman, K. L., and Freeman, W. T. Score-based diffusion models as principled priors for inverse imaging. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  10520–10531, 2023.
  • Gao & Dang (2023) Gao, H. and Dang, D. Mixed hierarchy network for image restoration. arXiv preprint arXiv:2302.09554, 2023.
  • Gao et al. (2023) Gao, H., Yang, J., Zhang, Y., Wang, N., Yang, J., and Dang, D. A mountain-shaped single-stage network for accurate image restoration. arXiv preprint arXiv:2305.05146, 2023.
  • Hastie et al. (2009) Hastie, T., Tibshirani, R., Friedman, J. H., and Friedman, J. H. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
  • Heng et al. (2021) Heng, J., De Bortoli, V., Doucet, A., and Thornton, J. Simulating diffusion bridges with score matching. arXiv preprint arXiv:2111.07243, 2021.
  • Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • Ho & Salimans (2021) Ho, J. and Salimans, T. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  • Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Jain et al. (2023) Jain, J., Zhou, Y., Yu, N., and Shi, H. Keys to better image inpainting: Structure and texture go hand in hand. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  208–217, 2023.
  • Karras et al. (2018) Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. In Proceedigns of International Conference on Learning Representations (ICLR), 2018.
  • Karras et al. (2022) Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
  • Kawar et al. (2021) Kawar, B., Vaksman, G., and Elad, M. Snips: Solving noisy inverse problems stochastically. Advances in Neural Information Processing Systems, 34:21757–21769, 2021.
  • Kawar et al. (2022) Kawar, B., Elad, M., Ermon, S., and Song, J. Denoising diffusion restoration models. Advances in Neural Information Processing Systems, 35:23593–23606, 2022.
  • Kingma & Ba (2015) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Proceedigns of International Conference on Learning Representations (ICLR), 2015.
  • Kong et al. (2023) Kong, L., Dong, J., Ge, J., Li, M., and Pan, J. Efficient frequency domain-based transformers for high-quality image deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5886–5895, 2023.
  • Lee et al. (2024) Lee, H., Kang, K., Lee, H., Baek, S.-H., and Cho, S. Ugpnet: Universal generative prior for image restoration. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  1598–1608, 2024.
  • Li et al. (2022) Li, B., Liu, X., Hu, P., Wu, Z., Lv, J., and Peng, X. All-in-one image restoration for unknown corruption. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  17452–17462, 2022.
  • Li et al. (2023) Li, B., Xue, K., Liu, B., and Lai, Y.-K. Bbdm: Image-to-image translation with brownian bridge diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1952–1961, 2023.
  • Liang et al. (2021) Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., and Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  1833–1844, 2021.
  • Lipman et al. (2023) Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. In Proceedigns of International Conference on Learning Representations (ICLR), 2023.
  • Liu et al. (2022) Liu, G.-H., Chen, T., So, O., and Theodorou, E. Deep generalized schrödinger bridge. Advances in Neural Information Processing Systems, 35:9374–9388, 2022.
  • Liu et al. (2023a) Liu, G.-H., Vahdat, A., Huang, D.-A., Theodorou, E. A., Nie, W., and Anandkumar, A. I2sb: image-to-image schrödinger bridge. In Proceedings of the 40th International Conference on Machine Learning, pp.  22042–22062, 2023a.
  • Liu et al. (2023b) Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. In Proceedigns of International Conference on Learning Representations (ICLR), 2023b.
  • Luo et al. (2023a) Luo, Z., Gustafsson, F. K., Zhao, Z., Sjölund, J., and Schön, T. B. Image restoration with mean-reverting stochastic differential equations. In International Conference on Machine Learning, pp.  23045–23066. PMLR, 2023a.
  • Luo et al. (2023b) Luo, Z., Gustafsson, F. K., Zhao, Z., Sjölund, J., and Schön, T. B. Refusion: Enabling large-size realistic image restoration with latent-space diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1680–1691, 2023b.
  • Murata et al. (2023) Murata, N., Saito, K., Lai, C.-H., Takida, Y., Uesaka, T., Mitsufuji, Y., and Ermon, S. Gibbsddrm: a partially collapsed gibbs sampler for solving blind inverse problems with denoising diffusion restoration. In Proceedings of the 40th International Conference on Machine Learning, pp.  25501–25522, 2023.
  • Nichol & Dhariwal (2021) Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp.  8162–8171. PMLR, 2021.
  • Potlapalli et al. (2023) Potlapalli, V., Zamir, S. W., Khan, S., and Khan, F. S. Promptir: Prompting for all-in-one blind image restoration. arXiv preprint arXiv:2306.13090, 2023.
  • Risken & Risken (1996) Risken, H. and Risken, H. Fokker-planck equation. Springer, 1996.
  • Saharia et al. (2022) Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D. J., and Norouzi, M. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713–4726, 2022.
  • Salminen (1984) Salminen, P. On conditional ornstein-uhlenbeck processes. Advances in Applied Probability, 16(4):920–922, 1984. ISSN 00018678. URL http://www.jstor.org/stable/1427347.
  • Särkkä & Solin (2019) Särkkä, S. and Solin, A. Applied stochastic differential equations, volume 10. Cambridge University Press, 2019.
  • Shi et al. (2024) Shi, Y., De Bortoli, V., Campbell, A., and Doucet, A. Diffusion schrödinger bridge matching. Advances in Neural Information Processing Systems, 36, 2024.
  • Soh & Cho (2022) Soh, J. W. and Cho, N. I. Variational deep image restoration. IEEE Transactions on Image Processing, 31:4363–4376, 2022.
  • Sohl-Dickstein et al. (2015) Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.  2256–2265. PMLR, 2015.
  • Song & Ermon (2019) Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  • Song et al. (2021a) Song, Y., Durkan, C., Murray, I., and Ermon, S. Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34:1415–1428, 2021a.
  • Song et al. (2021b) Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In Proceedigns of International Conference on Learning Representations (ICLR), 2021b.
  • Su et al. (2022) Su, X., Song, J., Meng, C., and Ermon, S. Dual diffusion implicit bridges for image-to-image translation. In The Eleventh International Conference on Learning Representations, 2022.
  • Tong et al. (2023) Tong, A., Malkin, N., FATRAS, K., Atanackovic, L., Zhang, Y., Huguet, G., Wolf, G., and Bengio, Y. Simulation-free schrödinger bridges via score and flow matching. In ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023.
  • Tu et al. (2022) Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., and Li, Y. Maxim: Multi-axis mlp for image processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5769–5780, 2022.
  • Vincent (2011) Vincent, P. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
  • Wang et al. (2023) Wang, Y., Yu, J., Yu, R., and Zhang, J. Unlimited-size diffusion restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1160–1167, 2023.
  • Wang et al. (2004) Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • Wei et al. (2023) Wei, P., Xie, Z., Li, G., and Lin, L. Taylor neural network for real-world image super-resolution. IEEE Transactions on Image Processing, 32:1942–1951, 2023.
  • Xiao et al. (2022) Xiao, J., Fu, X., Liu, A., Wu, F., and Zha, Z.-J. Image de-raining transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • Yang et al. (2017) Yang, W., Tan, R. T., Feng, J., Liu, J., Guo, Z., and Yan, S. Deep joint rain detection and removal from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1357–1366, 2017.
  • Yang et al. (2020) Yang, W., Tan, R. T., Wang, S., Fang, Y., and Liu, J. Single image deraining: From model-based to data-driven and beyond. IEEE Transactions on pattern analysis and machine intelligence, 43(11):4059–4077, 2020.
  • Yuan et al. (2007) Yuan, L., Sun, J., Quan, L., and Shum, H.-Y. Image deblurring with blurred/noisy image pairs. In ACM SIGGRAPH 2007 papers, pp.  1–es. 2007.
  • Zamfir et al. (2023) Zamfir, E., Conde, M. V., and Timofte, R. Towards real-time 4k image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1522–1532, 2023.
  • Zamir et al. (2021) Zamir, S. W., Arora, A., Khan, S., Hayat, M., Khan, F. S., Yang, M.-H., and Shao, L. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  14821–14831, 2021.
  • Zhang et al. (2023a) Zhang, D., Zhou, F., Jiang, Y., and Fu, Z. Mm-bsn: Self-supervised image denoising for real-world with multi-mask based on blind-spot network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4188–4197, 2023a.
  • Zhang et al. (2023b) Zhang, G., Ji, J., Zhang, Y., Yu, M., Jaakkola, T. S., and Chang, S. Towards coherent image inpainting using denoising diffusion implicit models. 2023b.
  • Zhang & Patel (2017) Zhang, H. and Patel, V. M. Convolutional sparse and low-rank coding-based rain streak removal. In 2017 IEEE Winter conference on applications of computer vision (WACV), pp.  1259–1267. IEEE, 2017.
  • Zhang et al. (2018a) Zhang, K., Zuo, W., and Zhang, L. Ffdnet: Toward a fast and flexible solution for cnn-based image denoising. IEEE Transactions on Image Processing, 27(9):4608–4622, 2018a.
  • Zhang et al. (2018b) Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  586–595, 2018b.
  • Zhao et al. (2023) Zhao, Z., Bai, H., Zhu, Y., Zhang, J., Xu, S., Zhang, Y., Zhang, K., Meng, D., Timofte, R., and Van Gool, L. Ddfm: denoising diffusion model for multi-modality image fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  8082–8093, 2023.
  • Zhou et al. (2024) Zhou, L., Lou, A., Khanna, S., and Ermon, S. Denoising diffusion bridge models. In Proceedigns of International Conference on Learning Representations (ICLR), 2024.
  • Zhou et al. (1988) Zhou, Y.-T., Chellappa, R., Vaid, A., and Jenkins, B. K. Image restoration using a neural network. IEEE transactions on acoustics, speech, and signal processing, 36(7):1141–1151, 1988.

Appendix A Proof

A.1 Proof of Proposition 3.1

Proposition 3.1. Let 𝐱t\mathbf{x}_{t} be a finite random variable describing by the given generalized Ornstein-Uhlenbeck process (4), suppose 𝐱T=𝛍\mathbf{x}_{T}=\bm{\mu}, the evolution of its marginal distribution p(𝐱t𝐱T)p(\mathbf{x}_{t}\mid\mathbf{x}_{T}) satisfies the following SDE:

d𝐱t=(θt+gt2e2θ¯t:Tσ¯t:T2)(𝐱T𝐱t)dt+gtd𝐰t,\mathrm{d}\mathbf{x}_{t}=\left(\theta_{t}+g^{2}_{t}\frac{e^{-2\bar{\theta}_{t:T}}}{\bar{\sigma}_{t:T}^{2}}\right)(\mathbf{x}_{T}-\mathbf{x}_{t})\mathrm{d}t+g_{t}\mathrm{d}\mathbf{w}_{t}, (7)

additionally, the forward transition p(𝐱t𝐱0,𝐱T)p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T}) is given by:

p(𝐱t𝐱0,𝐱T)\displaystyle p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T}) =N(𝐦¯t,σ¯t2𝐈)\displaystyle=N(\mathbf{\bar{m}^{\prime}}_{t},\bar{\sigma}^{\prime 2}_{t}\mathbf{I}) (8)
=N(eθ¯tσ¯t:T2σ¯T2𝐱0+[(1eθ¯t)σ¯t:T2σ¯T2+e2θ¯t:Tσ¯t2σ¯T2]𝐱T,σ¯t2σ¯t:T2σ¯T2𝑰)\displaystyle=N\left(e^{-\bar{\theta}_{t}}\frac{\bar{\sigma}_{t:T}^{2}}{\bar{\sigma}_{T}^{2}}\mathbf{x}_{0}+\left[\left(1-e^{-\bar{\theta}_{t}}\right)\frac{\bar{\sigma}_{t:T}^{2}}{\bar{\sigma}_{T}^{2}}+e^{-2\bar{\theta}_{t:T}}\frac{\bar{\sigma}_{t}^{2}}{\bar{\sigma}_{T}^{2}}\right]\mathbf{x}_{T},\frac{\bar{\sigma}_{t}^{2}\bar{\sigma}_{t:T}^{2}}{\bar{\sigma}_{T}^{2}}\bm{I}\right)

Proof: Based on (5), we have:

p(𝐱t𝐱0)=N(𝐱T+(𝐱0𝐱T)eθ¯t,σ¯t2𝑰)p\left(\mathbf{x}_{t}\mid\mathbf{x}_{0}\right)=N\left(\mathbf{x}_{T}+\left(\mathbf{x}_{0}-\mathbf{x}_{T}\right)e^{-\bar{\theta}_{t}},\bar{\sigma}_{t}^{2}\bm{I}\right) (19)
p(𝐱T𝐱t)=N(𝐱T+(𝐱t𝐱T)eθ¯t:T,σ¯t:T2𝑰)p\left(\mathbf{x}_{T}\mid\mathbf{x}_{t}\right)=N\left(\mathbf{x}_{T}+\left(\mathbf{x}_{t}-\mathbf{x}_{T}\right)e^{-\bar{\theta}_{t:T}},\bar{\sigma}_{t:T}^{2}\bm{I}\right) (20)
p(𝐱T𝐱0)=N(𝐱T+(𝐱0𝐱T)eθ¯T,σ¯T2𝑰)p\left(\mathbf{x}_{T}\mid\mathbf{x}_{0}\right)=N\left(\mathbf{x}_{T}+\left(\mathbf{x}_{0}-\mathbf{x}_{T}\right)e^{-\bar{\theta}_{T}},\bar{\sigma}_{T}^{2}\bm{I}\right) (21)

Firstly, the h function can be directly compute:

𝐡(𝐱t,t,𝐱T,T)\displaystyle\mathbf{h}(\mathbf{x}_{t},t,\mathbf{x}_{T},T) =𝐱tlogp(𝐱T𝐱t)\displaystyle=\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{T}\mid\mathbf{x}_{t}) (22)
=𝐱t(𝐱t𝐱T)2e2θ¯t:T2σt:T2\displaystyle=-\nabla_{\mathbf{x}_{t}}\frac{\left(\mathbf{x}_{t}-\mathbf{x}_{T}\right)^{2}e^{-2\bar{\theta}_{t:T}}}{2\sigma_{t:T}^{2}}
=(𝐱T𝐱t)e2θ¯t:Tσ¯t:T2\displaystyle=(\mathbf{x}_{T}-\mathbf{x}_{t})\frac{e^{-2\bar{\theta}_{t:T}}}{\bar{\sigma}_{t:T}^{2}}

Therefore, followed by Doob’s h-transform (6), the SDE of marginal distribution p(𝐱t𝐱T)p(\mathbf{x}_{t}\mid\mathbf{x}_{T}) satisfied is :

d𝐱t\displaystyle\mathrm{d}\mathbf{x}_{t} =[𝐟(𝐱t,t)+gt2𝐡(𝐱t,t,𝐱T,T)]dt+gtd𝐰t\displaystyle=\left[\mathbf{f}(\mathbf{x}_{t},t)+g^{2}_{t}\mathbf{h}(\mathbf{x}_{t},t,\mathbf{x}_{T},T)\right]\mathrm{d}t+g_{t}\mathrm{d}\mathbf{w}_{t} (23)
=(θt+gt2e2θ¯t:Tσ¯t:T2)(𝐱T𝐱t)dt+gtd𝐰t\displaystyle=\left(\theta_{t}+g^{2}_{t}\frac{e^{-2\bar{\theta}_{t:T}}}{\bar{\sigma}_{t:T}^{2}}\right)(\mathbf{x}_{T}-\mathbf{x}_{t})\mathrm{d}t+g_{t}\mathrm{d}\mathbf{w}_{t}

Furthermore, we can derive the following transition probability of 𝐱t\mathbf{x}_{t} using Bayes’ formula:

p(𝐱t𝐱0,𝐱T)\displaystyle p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T}) =p(𝐱T𝐱t,𝐱0)p(𝐱t𝐱0)p(𝐱T𝐱0)\displaystyle=\frac{p(\mathbf{x}_{T}\mid\mathbf{x}_{t},\mathbf{x}_{0})p(\mathbf{x}_{t}\mid\mathbf{x}_{0})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})} (24)
=p(𝐱T𝐱t)p(𝐱t𝐱0)p(𝐱T𝐱0)\displaystyle=\frac{p(\mathbf{x}_{T}\mid\mathbf{x}_{t})p(\mathbf{x}_{t}\mid\mathbf{x}_{0})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}

Since each component is independently and identically distributed (i.i.d), by considering a single dimension, we have:

p(𝐱t𝐱0,𝐱T)\displaystyle p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T}) 12πσ¯tσ¯t:T/σ¯Texp{(𝐱t[𝐱T+(𝐱0𝐱T)eθ¯t])22σ¯t2+(𝐱T[𝐱T+(𝐱t𝐱T)eθ¯t:T])22σ¯t:T2}\displaystyle\propto\frac{1}{\sqrt{2\pi}\bar{\sigma}_{t}\bar{\sigma}_{t:T}/\bar{\sigma}_{T}}\exp{-\left\{\frac{(\mathbf{x}_{t}-[\mathbf{x}_{T}+\left(\mathbf{x}_{0}-\mathbf{x}_{T}\right)e^{-\bar{\theta}_{t}}])^{2}}{2\bar{\sigma}_{t}^{2}}+\frac{(\mathbf{x}_{T}-[\mathbf{x}_{T}+\left(\mathbf{x}_{t}-\mathbf{x}_{T}\right)e^{-\bar{\theta}_{t:T}}])^{2}}{2\bar{\sigma}_{t:T}^{2}}\right\}} (25)
=12πσ¯tσ¯t:T/σ¯Texp{(𝐱t[𝐱T+(𝐱0𝐱T)eθ¯t])22σ¯t2+(𝐱t𝐱T)2e2θ¯t:T2σ¯t:T2}\displaystyle=\frac{1}{\sqrt{2\pi}\bar{\sigma}_{t}\bar{\sigma}_{t:T}/\bar{\sigma}_{T}}\exp{-\left\{\frac{(\mathbf{x}_{t}-[\mathbf{x}_{T}+\left(\mathbf{x}_{0}-\mathbf{x}_{T}\right)e^{-\bar{\theta}_{t}}])^{2}}{2\bar{\sigma}_{t}^{2}}+\frac{\left(\mathbf{x}_{t}-\mathbf{x}_{T}\right)^{2}e^{-2\bar{\theta}_{t:T}}}{2\bar{\sigma}_{t:T}^{2}}\right\}}
12πσ¯tσ¯t:T/σ¯Texp{(12σ¯t2+e2θ¯t:T2σ¯t:T2)𝐱t2(𝐱T(𝐱0𝐱T)eθ¯tσ¯t2+𝐱Te2θ¯t:Tσ¯t:T2)𝐱t}\displaystyle\propto\frac{1}{\sqrt{2\pi}\bar{\sigma}_{t}\bar{\sigma}_{t:T}/\bar{\sigma}_{T}}\exp{-\left\{\left(\frac{1}{2\bar{\sigma}_{t}^{2}}+\frac{e^{-2\bar{\theta}_{t:T}}}{2\bar{\sigma}_{t:T}^{2}}\right)\mathbf{x}_{t}^{2}-\left(\frac{\mathbf{x}_{T}-\left(\mathbf{x}_{0}-\mathbf{x}_{T}\right)e^{-\bar{\theta}_{t}}}{\bar{\sigma}_{t}^{2}}+\frac{\mathbf{x}_{T}e^{-2\bar{\theta}_{t:T}}}{\bar{\sigma}_{t:T}^{2}}\right)\mathbf{x}_{t}\right\}}

Notice that:

12σ¯t2+e2θ¯t:T2σ¯t:T2\displaystyle\frac{1}{2\bar{\sigma}_{t}^{2}}+\frac{e^{-2\bar{\theta}_{t:T}}}{2\bar{\sigma}_{t:T}^{2}} =σt:T2+σ¯t2e2θ¯t:T2σ¯t2σ¯t:T2\displaystyle=\frac{\sigma_{t:T}^{2}+\bar{\sigma}_{t}^{2}e^{-2\bar{\theta}_{t:T}}}{2\bar{\sigma}_{t}^{2}\bar{\sigma}_{t:T}^{2}} (26)
=λ2[(1e2θ¯t:T)+(1e2θ¯t)e2θ¯t:T]2σ¯t2σ¯t:T2\displaystyle=\frac{\lambda^{2}\left[(1-e^{-2\bar{\theta}_{t:T}})+(1-e^{-2\bar{\theta}_{t}})e^{-2\bar{\theta}_{t:T}}\right]}{2\bar{\sigma}_{t}^{2}\bar{\sigma}_{t:T}^{2}}
=λ2[(1e2θ¯t:T)+(e2θ¯t:Te2θ¯T)]2σ¯t2σ¯t:T2\displaystyle=\frac{\lambda^{2}\left[(1-e^{-2\bar{\theta}_{t:T}})+(e^{-2\bar{\theta}_{t:T}}-e^{-2\bar{\theta}_{T}})\right]}{2\bar{\sigma}_{t}^{2}\bar{\sigma}_{t:T}^{2}}
=σ¯T22σ¯t2σ¯t:T2\displaystyle=\frac{\bar{\sigma}_{T}^{2}}{2\bar{\sigma}_{t}^{2}\bar{\sigma}_{t:T}^{2}}

Bringing it back to (25), squaring the terms and reorganizing the equation, we obtain:

p(𝐱t𝐱0,𝐱T)\displaystyle p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T}) 12πσ¯tσ¯t:T/σ¯Texp{σ¯T22σ¯t2σ¯t:T2𝐱t2(𝐱T(𝐱0𝐱T)eθ¯tσ¯t2+𝐱Te2θ¯t:Tσ¯t:T2)𝐱t}\displaystyle\propto\frac{1}{\sqrt{2\pi}\bar{\sigma}_{t}\bar{\sigma}_{t:T}/\bar{\sigma}_{T}}\exp{-\left\{\frac{\bar{\sigma}_{T}^{2}}{2\bar{\sigma}_{t}^{2}\bar{\sigma}_{t:T}^{2}}\mathbf{x}_{t}^{2}-\left(\frac{\mathbf{x}_{T}-\left(\mathbf{x}_{0}-\mathbf{x}_{T}\right)e^{-\bar{\theta}_{t}}}{\bar{\sigma}_{t}^{2}}+\frac{\mathbf{x}_{T}e^{-2\bar{\theta}_{t:T}}}{\bar{\sigma}_{t:T}^{2}}\right)\mathbf{x}_{t}\right\}} (27)
=12πσ¯tσ¯t:T/σ¯Texp{𝐱t2([𝐱T(𝐱0𝐱T)eθ¯t]2σ¯t:T2σ¯T2+e2θ¯t:T2σ¯t2σ¯T2𝐱T)𝐱t2(σ¯tσ¯t:T/σ¯T)2}\displaystyle=\frac{1}{\sqrt{2\pi}\bar{\sigma}_{t}\bar{\sigma}_{t:T}/\bar{\sigma}_{T}}\exp{-\left\{\frac{\mathbf{x}_{t}^{2}-\left(\left[\mathbf{x}_{T}-\left(\mathbf{x}_{0}-\mathbf{x}_{T}\right)e^{-\bar{\theta}_{t}}\right]\frac{2\bar{\sigma}_{t:T}^{2}}{\bar{\sigma}_{T}^{2}}+e^{-2\bar{\theta}_{t:T}}\frac{2\bar{\sigma}_{t}^{2}}{\bar{\sigma}_{T}^{2}}\mathbf{x}_{T}\right)\mathbf{x}_{t}}{2(\bar{\sigma}_{t}\bar{\sigma}_{t:T}/\bar{\sigma}_{T})^{2}}\right\}}
12πσ¯tσ¯t:T/σ¯Texp{𝐱teθ¯tσ¯t:T2σ¯T2𝐱0[(1eθ¯t)σ¯t:T2σ¯T2+e2θ¯t:Tσ¯t2σ¯T2]𝐱T}22(σ¯tσ¯t:T/σ¯T)2\displaystyle\propto\frac{1}{\sqrt{2\pi}\bar{\sigma}_{t}\bar{\sigma}_{t:T}/\bar{\sigma}_{T}}\exp-\frac{\left\{\mathbf{x}_{t}-e^{-\bar{\theta}_{t}}\frac{\bar{\sigma}_{t:T}^{2}}{\bar{\sigma}_{T}^{2}}\mathbf{x}_{0}-\left[\left(1-e^{-\bar{\theta}_{t}}\right)\frac{\bar{\sigma}_{t:T}^{2}}{\bar{\sigma}_{T}^{2}}+e^{-2\bar{\theta}_{t:T}}\frac{\bar{\sigma}_{t}^{2}}{\bar{\sigma}_{T}^{2}}\right]\mathbf{x}_{T}\right\}^{2}}{2(\bar{\sigma}_{t}\bar{\sigma}_{t:T}/\bar{\sigma}_{T})^{2}}
=N(eθ¯tσ¯t:T2σ¯T2𝐱0+[(1eθ¯t)σ¯t:T2σ¯T2+e2θ¯t:Tσ¯t2σ¯T2]𝐱T,σ¯t2σ¯t:T2σ¯T2𝑰)\displaystyle=N\left(e^{-\bar{\theta}_{t}}\frac{\bar{\sigma}_{t:T}^{2}}{\bar{\sigma}_{T}^{2}}\mathbf{x}_{0}+\left[\left(1-e^{-\bar{\theta}_{t}}\right)\frac{\bar{\sigma}_{t:T}^{2}}{\bar{\sigma}_{T}^{2}}+e^{-2\bar{\theta}_{t:T}}\frac{\bar{\sigma}_{t}^{2}}{\bar{\sigma}_{T}^{2}}\right]\mathbf{x}_{T},\frac{\bar{\sigma}_{t}^{2}\bar{\sigma}_{t:T}^{2}}{\bar{\sigma}_{T}^{2}}\bm{I}\right)

This concludes the proof of the Proposition 3.1.

A.2 Proof of Proposition 3.3

Proposition 3.3. Let 𝐱t\mathbf{x}_{t} be a finite random variable describing by the given generalized Ornstein-Uhlenbeck process (4), for a fixed 𝐱T\mathbf{x}_{T}, the expectation of log-likelihood 𝔼p(𝐱0)[logp𝛉(𝐱0𝐱T)]\mathbb{E}_{p(\mathbf{x}_{0})}[\log p_{\bm{\theta}}(\mathbf{x}_{0}\mid\mathbf{x}_{T})] possesses an Evidence Lower Bound (ELBO):

ELBO=𝔼p(𝐱0)[𝔼p(𝐱1𝐱0)[logp𝜽(𝐱0𝐱1,𝐱T)]t=2TKL(p(𝐱t1𝐱0,𝐱t,𝐱T)||p𝜽(𝐱t1𝐱t,𝐱T))]ELBO=\mathbb{E}_{p(\mathbf{x}_{0})}\left[\mathbb{E}_{p\left(\mathbf{x}_{1}\mid\mathbf{x}_{0}\right)}\left[\log p_{\bm{\theta}}\left(\mathbf{x}_{0}\mid\mathbf{x}_{1},\mathbf{x}_{T}\right)\right]-\sum_{t=2}^{T}{KL\left(p\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{0},\mathbf{x}_{t},\mathbf{x}_{T}\right)||p_{\bm{\theta}}\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\mathbf{x}_{T}\right)\right)}\right] (11)

Assuming p𝛉(𝐱t1𝐱t,𝐱T)p_{\bm{\theta}}\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\mathbf{x}_{T}\right) is a Gaussian distribution with a constant variance N(𝛍𝛉,t1,σ𝛉,t12𝐈)N(\bm{\mu}_{\bm{\theta},t-1},\sigma_{\bm{\theta},t-1}^{2}\bm{I}), maximizing the ELBO is equivalent to minimizing:

=𝔼t,𝐱0,𝐱t,𝐱T[12σ𝜽,t12𝝁t1𝝁𝜽,t12],\mathcal{L}=\mathbb{E}_{t,\mathbf{x}_{0},\mathbf{x}_{t},\mathbf{x}_{T}}\left[\frac{1}{2\sigma_{\bm{\theta},t-1}^{2}}\|\bm{\mu}_{t-1}-\bm{\mu}_{\bm{\theta},t-1}\|^{2}\right], (12)

where 𝛍t1\bm{\mu}_{t-1} represents the mean of p(𝐱t1𝐱0,𝐱t,𝐱T)p\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{0},\mathbf{x}_{t},\mathbf{x}_{T}\right):

𝝁t1=1σ¯t2[σ¯t12(𝐱tb𝐱T)a+(σ¯t2σ¯t12a2)𝐦¯t],\bm{\mu}_{t-1}=\frac{1}{\bar{\sigma}^{\prime 2}_{t}}\left[\bar{\sigma}^{\prime 2}_{t-1}(\mathbf{x}_{t}-b\mathbf{x}_{T})a+(\bar{\sigma}^{\prime 2}_{t}-\bar{\sigma}^{\prime 2}_{t-1}a^{2})\mathbf{\bar{m}^{\prime}}_{t}\right], (13)

where,

a=eθ¯t1:tσ¯t:T2σ¯t1:T2,\displaystyle a=\frac{e^{-\bar{\theta}_{t-1:t}}\bar{\sigma}_{t:T}^{2}}{\bar{\sigma}_{t-1:T}^{2}},
b=1σ¯T2{(1eθ¯t)σ¯t:T2+e2θ¯t:Tσ¯t2[(1eθ¯t1)σ¯t1:T2+e2θ¯t1:Tσ¯t12]a}\displaystyle b=\frac{1}{\bar{\sigma}_{T}^{2}}\left\{(1-e^{-\bar{\theta}_{t}})\bar{\sigma}^{2}_{t:T}+e^{-2\bar{\theta}_{t:T}}\bar{\sigma}_{t}^{2}-\left[(1-e^{-\bar{\theta}_{t-1}})\bar{\sigma}^{2}_{t-1:T}+e^{-2\bar{\theta}_{t-1:T}}\bar{\sigma}_{t-1}^{2}\right]a\right\}

Proof: Firstly, followed by the theorem in DDPM (Ho et al., 2020):

𝔼p(𝐱0)[logp𝜽(𝐱0)]\displaystyle\mathbb{E}_{p(\mathbf{x}_{0})}\left[\log p_{\bm{\theta}}(\mathbf{x}_{0})\right]\geq 𝔼p(𝐱0)[KL(p(𝐱T𝐱0)||p(𝐱T))+𝔼p(𝐱1𝐱0)[logp𝜽(𝐱0𝐱1)].\displaystyle\mathbb{E}_{p(\mathbf{x}_{0})}\Bigg{[}-KL(p(\mathbf{x}_{T}\mid\mathbf{x}_{0})||p(\mathbf{x}_{T}))+\mathbb{E}_{p\left(\mathbf{x}_{1}\mid\mathbf{x}_{0}\right)}\left[\log p_{\bm{\theta}}\left(\mathbf{x}_{0}\mid\mathbf{x}_{1}\right)\right]\Bigg{.} (28)
.t=2T𝔼p(xtx0)[KL(p(𝐱t1𝐱0,𝐱t)||p𝜽(𝐱t1𝐱t))]]\displaystyle\Bigg{.}-\sum_{t=2}^{T}\mathbb{E}_{p(x_{t}\mid x_{0})}[{KL\left(p\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{0},\mathbf{x}_{t}\right)||p_{\bm{\theta}}\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{t}\right)\right)}]\Bigg{]}

Similarly, we have:

𝔼p(𝐱0)[logp𝜽(𝐱0𝐱T)]\displaystyle\mathbb{E}_{p(\mathbf{x}_{0})}[\log p_{\bm{\theta}}(\mathbf{x}_{0}\mid\mathbf{x}_{T})] 𝔼p(𝐱0)[KL(p(𝐱T𝐱0,𝐱T)||p(𝐱T𝐱T))+𝔼p(𝐱1𝐱0)[logp𝜽(𝐱0𝐱1,𝐱T)].\displaystyle\geq\mathbb{E}_{p(\mathbf{x}_{0})}\Bigg{[}-KL(p(\mathbf{x}_{T}\mid\mathbf{x}_{0},\mathbf{x}_{T})||p(\mathbf{x}_{T}\mid\mathbf{x}_{T}))+\mathbb{E}_{p\left(\mathbf{x}_{1}\mid\mathbf{x}_{0}\right)}\left[\log p_{\bm{\theta}}\left(\mathbf{x}_{0}\mid\mathbf{x}_{1},\mathbf{x}_{T}\right)\right]\Bigg{.} (29)
.t=2T𝔼p(xtx0)[KL(p(𝐱t1𝐱0,𝐱t,𝐱T)||p𝜽(𝐱t1𝐱t,𝐱T))]]\displaystyle\Bigg{.}\quad-\sum_{t=2}^{T}\mathbb{E}_{p(x_{t}\mid x_{0})}[{KL\left(p\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{0},\mathbf{x}_{t},\mathbf{x}_{T}\right)||p_{\bm{\theta}}\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\mathbf{x}_{T}\right)\right)}]\Bigg{]}
=𝔼p(𝐱0)[𝔼p(𝐱1𝐱0)[logp𝜽(𝐱0𝐱1,𝐱T)].\displaystyle=\mathbb{E}_{p(\mathbf{x}_{0})}\Bigg{[}\mathbb{E}_{p\left(\mathbf{x}_{1}\mid\mathbf{x}_{0}\right)}\left[\log p_{\bm{\theta}}\left(\mathbf{x}_{0}\mid\mathbf{x}_{1},\mathbf{x}_{T}\right)\right]\Bigg{.}
.t=2T𝔼p(xtx0)[KL(p(𝐱t1𝐱0,𝐱t,𝐱T)||p𝜽(𝐱t1𝐱t,𝐱T))]]\displaystyle\Bigg{.}\quad-\sum_{t=2}^{T}\mathbb{E}_{p(x_{t}\mid x_{0})}[{KL\left(p\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{0},\mathbf{x}_{t},\mathbf{x}_{T}\right)||p_{\bm{\theta}}\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\mathbf{x}_{T}\right)\right)}]\Bigg{]}
=ELBO\displaystyle=ELBO

From Bayes’ formula, we can infer that:

p(𝐱t1𝐱0,𝐱t,𝐱T)\displaystyle p\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{0},\mathbf{x}_{t},\mathbf{x}_{T}\right) =p(𝐱t𝐱0,𝐱t1,𝐱T)p(𝐱t1𝐱0,𝐱T)p(𝐱t𝐱0,𝐱T)\displaystyle=\frac{p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{t-1},\mathbf{x}_{T})p(\mathbf{x}_{t-1}\mid\mathbf{x}_{0},\mathbf{x}_{T})}{p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T})} (30)
=p(𝐱t𝐱t1,𝐱T)p(𝐱t1𝐱0,𝐱T)p(𝐱t𝐱0,𝐱T)\displaystyle=\frac{p(\mathbf{x}_{t}\mid\mathbf{x}_{t-1},\mathbf{x}_{T})p(\mathbf{x}_{t-1}\mid\mathbf{x}_{0},\mathbf{x}_{T})}{p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T})}

Since p(𝐱t1𝐱0,𝐱T)p(\mathbf{x}_{t-1}\mid\mathbf{x}_{0},\mathbf{x}_{T}) and p(𝐱t𝐱0,𝐱T)p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T}) are Gaussian distributions (8), by employing the reparameterization technique:

𝐱t1\displaystyle\mathbf{x}_{t-1} =eθ¯t1σ¯t1:T2σ¯T2𝐱0+[(1eθ¯t1)σ¯t1:T2σ¯T2+e2θ¯t1:Tσ¯t12σ¯T2]𝐱T+σ¯t1ϵt1\displaystyle=e^{-\bar{\theta}_{t-1}}\frac{\bar{\sigma}_{t-1:T}^{2}}{\bar{\sigma}_{T}^{2}}\mathbf{x}_{0}+\left[\left(1-e^{-\bar{\theta}_{t-1}}\right)\frac{\bar{\sigma}_{t-1:T}^{2}}{\bar{\sigma}_{T}^{2}}+e^{-2\bar{\theta}_{t-1:T}}\frac{\bar{\sigma}_{t-1}^{2}}{\bar{\sigma}_{T}^{2}}\right]\mathbf{x}_{T}+\bar{\sigma}^{\prime}_{t-1}\bm{\epsilon}_{t-1} (31)
=m(t1)𝐱0+n(t1)𝐱T+σ¯t1ϵt1\displaystyle=m(t-1)\mathbf{x}_{0}+n(t-1)\mathbf{x}_{T}+\bar{\sigma}^{\prime}_{t-1}\bm{\epsilon}_{t-1}
𝐱t\displaystyle\mathbf{x}_{t} =eθ¯tσ¯t:T2σ¯T2𝐱0+[(1eθ¯t)σ¯t:T2σ¯T2+e2θ¯t:Tσ¯t2σ¯T2]𝐱T+σ¯tϵt\displaystyle=e^{-\bar{\theta}_{t}}\frac{\bar{\sigma}_{t:T}^{2}}{\bar{\sigma}_{T}^{2}}\mathbf{x}_{0}+\left[\left(1-e^{-\bar{\theta}_{t}}\right)\frac{\bar{\sigma}_{t:T}^{2}}{\bar{\sigma}_{T}^{2}}+e^{-2\bar{\theta}_{t:T}}\frac{\bar{\sigma}_{t}^{2}}{\bar{\sigma}_{T}^{2}}\right]\mathbf{x}_{T}+\bar{\sigma}^{\prime}_{t}\bm{\epsilon}_{t}
=m(t)𝐱0+n(t)𝐱T+σ¯tϵt\displaystyle=m(t)\mathbf{x}_{0}+n(t)\mathbf{x}_{T}+\bar{\sigma}^{\prime}_{t}\bm{\epsilon}_{t}

Therefore,

𝐱t\displaystyle\mathbf{x}_{t} =m(t)m(t1)𝐱t1+[n(t)m(t)m(t1)n(t1)]𝐱T+σ¯t2m(t)2m(t1)2σ¯t12ϵ\displaystyle=\frac{m(t)}{m(t-1)}\mathbf{x}_{t-1}+\left[n(t)-\frac{m(t)}{m(t-1)}n(t-1)\right]\mathbf{x}_{T}+\sqrt{\bar{\sigma}^{\prime 2}_{t}-\frac{m(t)^{2}}{m(t-1)^{2}}\bar{\sigma}^{\prime 2}_{t-1}}\bm{\epsilon} (32)
=a𝐱t1+[n(t)an(t1)]𝐱T+σ¯t2a2σ¯t12ϵ\displaystyle=a\mathbf{x}_{t-1}+\left[n(t)-an(t-1)\right]\mathbf{x}_{T}+\sqrt{\bar{\sigma}^{\prime 2}_{t}-a^{2}\bar{\sigma}^{\prime 2}_{t-1}}\bm{\epsilon}
=a𝐱t1+b𝐱T+σ¯t2a2σ¯t12ϵ\displaystyle=a\mathbf{x}_{t-1}+b\mathbf{x}_{T}+\sqrt{\bar{\sigma}^{\prime 2}_{t}-a^{2}\bar{\sigma}^{\prime 2}_{t-1}}\bm{\epsilon}

Thus, p(𝐱t𝐱t1,𝐱T)=N(a𝐱t1+b𝐱T,(σ¯t2a2σ¯t12)𝑰)p(\mathbf{x}_{t}\mid\mathbf{x}_{t-1},\mathbf{x}_{T})=N(a\mathbf{x}_{t-1}+b\mathbf{x}_{T},\left(\bar{\sigma}^{\prime 2}_{t}-a^{2}\bar{\sigma}^{\prime 2}_{t-1}\right)\bm{I}) is also a Gaussian distribution. Bring it back to equation (30) we can easily obtain :

𝝁t1=1σ¯t2[σ¯t12(𝐱tb𝐱T)a+(σ¯t2σ¯t12a2)𝐦¯t],\bm{\mu}_{t-1}=\frac{1}{\bar{\sigma}^{\prime 2}_{t}}\left[\bar{\sigma}^{\prime 2}_{t-1}(\mathbf{x}_{t}-b\mathbf{x}_{T})a+(\bar{\sigma}^{\prime 2}_{t}-\bar{\sigma}^{\prime 2}_{t-1}a^{2})\mathbf{\bar{m}^{\prime}}_{t}\right], (13)

Accordingly,

KL(p(𝐱t1𝐱0,𝐱t,𝐱T)||p𝜽(𝐱t1𝐱t,𝐱T))\displaystyle KL\left(p\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{0},\mathbf{x}_{t},\mathbf{x}_{T}\right)||p_{\bm{\theta}}\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\mathbf{x}_{T}\right)\right) (33)
=\displaystyle= 𝔼p(𝐱t1𝐱0,𝐱t,𝐱T)[log12πσt1e(𝐱t1𝝁t1)2/2σt1212πσ𝜽,t1e(𝐱t1𝝁𝜽,t1)2/2σ𝜽,t12]\displaystyle\mathbb{E}_{p\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{0},\mathbf{x}_{t},\mathbf{x}_{T}\right)}\left[\log\frac{\frac{1}{\sqrt{2\pi}\sigma_{t-1}}e^{-(\mathbf{x}_{t-1}-\bm{\mu}_{t-1})^{2}/{2\sigma_{t-1}^{2}}}}{\frac{1}{\sqrt{2\pi}\sigma_{\bm{\theta},t-1}}e^{-(\mathbf{x}_{t-1}-\bm{\mu}_{\bm{\theta},t-1})^{2}/{2\sigma_{\bm{\theta},t-1}^{2}}}}\right]
=\displaystyle= 𝔼p(𝐱t1𝐱0,𝐱t,𝐱T)[logσ𝜽,t1logσt1(𝐱t1𝝁t1)2/2σt12+(𝐱t1𝝁𝜽,t1)2/2σ𝜽,t12]\displaystyle\mathbb{E}_{p\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{0},\mathbf{x}_{t},\mathbf{x}_{T}\right)}\left[\log\sigma_{\bm{\theta},t-1}-\log\sigma_{t-1}-(\mathbf{x}_{t-1}-\bm{\mu}_{t-1})^{2}/{2\sigma^{2}_{t-1}}+(\mathbf{x}_{t-1}-\bm{\mu}_{\bm{\theta},t-1})^{2}/{2\sigma_{\bm{\theta},t-1}^{2}}\right]
=\displaystyle= logσ𝜽,t1logσt112+σt122σ𝜽,t12+(𝝁t1𝝁𝜽,t1)22σ𝜽,t12\displaystyle\log\sigma_{\bm{\theta},t-1}-\log\sigma_{t-1}-\frac{1}{2}+\frac{\sigma_{t-1}^{2}}{2\sigma_{\bm{\theta},t-1}^{2}}+\frac{(\bm{\mu}_{t-1}-\bm{\mu}_{\bm{\theta},t-1})^{2}}{2\sigma_{\bm{\theta},t-1}^{2}}

Ignoring unlearnable constant, the training object that involves minimizing the negative ELBO is :

=𝔼t,𝐱0,𝐱t,𝐱T[12σ𝜽,t12𝝁t1𝝁𝜽,t12],\mathcal{L}=\mathbb{E}_{t,\mathbf{x}_{0},\mathbf{x}_{t},\mathbf{x}_{T}}\left[\frac{1}{2\sigma_{\bm{\theta},t-1}^{2}}\|\bm{\mu}_{t-1}-\bm{\mu}_{\bm{\theta},t-1}\|^{2}\right], (34)

This concludes the proof of the Proposition 3.3.

Appendix B Theoretical Results

B.1 Brownian Bridge

In this section, we will show the mathematical essence of some other bridge models, some of which are all equivalent.

Proposition B.1.

The mathematical essence of BBDM (Li et al., 2023), DDBM (VE) (Zhou et al., 2024) and I2I^{2}SB (Liu et al., 2023a) are all equivalent to the Brownian bridge.

Proof: Firstly, it is easy to understand that BBDM uses the Brownian bridge as its fundamental model architecture.

The DDBM (VE) model is derived as the Doob’s h–transform of VE-SDE, and we begin by specifying the SDE:

d𝐱t=d𝐰t\mathrm{d}\mathbf{x}_{t}=\mathrm{d}\mathbf{w}_{t} (35)

Its transition probability is given by:

p(𝐱t𝐱s)=N(𝐱s,ts)p\left(\mathbf{x}_{t}\mid\mathbf{x}_{s}\right)=N(\mathbf{x}_{s},t-s) (36)

Since, the h–function of SDE (35) is:

𝐡(𝐱t,t,𝐱T,T)\displaystyle\mathbf{h}(\mathbf{x}_{t},t,\mathbf{x}_{T},T) =𝐱tlogp(𝐱T𝐱t)\displaystyle=\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{T}\mid\mathbf{x}_{t}) (37)
=𝐱T𝐱tTt\displaystyle=\frac{\mathbf{x}_{T}-\mathbf{x}_{t}}{T-t}

Therefore, the Doob’s h–transform of (35) is:

d𝐱t=𝐱T𝐱tTtdt+d𝐰t\displaystyle\mathrm{d}\mathbf{x}_{t}=\frac{\mathbf{x}_{T}-\mathbf{x}_{t}}{T-t}\mathrm{d}t+\mathrm{d}\mathbf{w}_{t} (38)

That is the definition of Brownian bridge. Hence, DDBM (VE) is a Brownian bridge model.

Furthermore, the transition kernel of (38) is:

p(𝐱t𝐱0,𝐱T)\displaystyle p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T}) =p(𝐱T𝐱t,𝐱0)p(𝐱t𝐱0)p(𝐱T𝐱0)\displaystyle=\frac{p(\mathbf{x}_{T}\mid\mathbf{x}_{t},\mathbf{x}_{0})p(\mathbf{x}_{t}\mid\mathbf{x}_{0})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})} (39)
=p(𝐱T𝐱t)p(𝐱t𝐱0)p(𝐱T𝐱0)\displaystyle=\frac{p(\mathbf{x}_{T}\mid\mathbf{x}_{t})p(\mathbf{x}_{t}\mid\mathbf{x}_{0})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}
=N(𝐱t,Tt)N(𝐱0,t)N(𝐱0,T)\displaystyle=\frac{N(\mathbf{x}_{t},T-t)N(\mathbf{x}_{0},t)}{N(\mathbf{x}_{0},T)}
=N((1tT)𝐱0+tT𝐱T,t(Tt)T𝑰)\displaystyle=N\left(\left(1-\frac{t}{T}\right)\mathbf{x}_{0}+\frac{t}{T}\mathbf{x}_{T},\frac{t(T-t)}{T}\bm{I}\right)

This precisely corresponds to the sampling process of I2\mathrm{I}^{2}SB, thus confirming that I2\mathrm{I}^{2}SB also represents a Brownian bridge.

This concludes the proof of the Proposition B.1.

B.2 Connections Between GOU, VE and VP

The following proposition will show us that both VE and VP processes are special cases of GOU process:

Proposition 5.1. For a given GOU process (4), there exists relationships:

limθt0GOU=VE\displaystyle\lim_{\theta_{t}\rightarrow 0}\text{GOU}=\text{VE} (18)
lim𝝁0,λ1GOU=VP\displaystyle\lim_{\bm{\mu}\rightarrow 0,\lambda\rightarrow 1}\text{GOU}=\text{VP}

Proof: It’s easy to know:

limθt0GOU\displaystyle\lim_{\theta_{t}\rightarrow 0}\text{GOU} =limθt0{d𝐱t=θt(𝝁𝐱t)dt+gtd𝐰t}\displaystyle=\lim_{\theta_{t}\rightarrow 0}\left\{\mathrm{d}\mathbf{x}_{t}=\theta_{t}\left(\bm{\mu}-\mathbf{x}_{t}\right)\mathrm{d}t+g_{t}\mathrm{d}\mathbf{w}_{t}\right\} (40)
=limθt0{d𝐱t=gtd𝐰t}\displaystyle=\lim_{\theta_{t}\rightarrow 0}\left\{\mathrm{d}\mathbf{x}_{t}=g_{t}\mathrm{d}\mathbf{w}_{t}\right\}
=VE,\displaystyle=\text{VE},

where gtg_{t} will be controlled by λ2\lambda^{2}.

Besides, we have:

lim𝝁0,λ1GOU\displaystyle\lim_{\bm{\mu}\rightarrow 0,\lambda\rightarrow 1}\text{GOU} =lim𝝁0,λ1{d𝐱t=θt(𝝁𝐱t)dt+gtd𝐰t}\displaystyle=\lim_{\bm{\mu}\rightarrow 0,\lambda\rightarrow 1}\left\{\mathrm{d}\mathbf{x}_{t}=\theta_{t}\left(\bm{\mu}-\mathbf{x}_{t}\right)\mathrm{d}t+g_{t}\mathrm{d}\mathbf{w}_{t}\right\} (41)
=lim𝝁0,λ1{d𝐱t=θt𝝁dtθt𝐱tdt+gtd𝐰t}\displaystyle=\lim_{\bm{\mu}\rightarrow 0,\lambda\rightarrow 1}\left\{\mathrm{d}\mathbf{x}_{t}=\theta_{t}\bm{\mu}\mathrm{d}t-\theta_{t}\mathbf{x}_{t}\mathrm{d}t+g_{t}\mathrm{d}\mathbf{w}_{t}\right\}
=lim𝝁0,λ1{d𝐱t=12gt2𝐱tdt+gtd𝐰t}\displaystyle=\lim_{\bm{\mu}\rightarrow 0,\lambda\rightarrow 1}\left\{\mathrm{d}\mathbf{x}_{t}=-\frac{1}{2}g_{t}^{2}\mathbf{x}_{t}\mathrm{d}t+g_{t}\mathrm{d}\mathbf{w}_{t}\right\}
=VP,\displaystyle=\text{VP},

where gtg_{t} will be controlled by θt\theta_{t}.

This concludes the proof of the Proposition 5.1.

Appendix C GOU Process

Theorem C.1.

For a given GOU process:

d𝐱t=θt(𝝁𝐱t)dt+gtd𝐰t\mathrm{d}\mathbf{x}_{t}=\theta_{t}\left(\bm{\mu}-\mathbf{x}_{t}\right)\mathrm{d}t+g_{t}\mathrm{d}\mathbf{w}_{t} (4)

where 𝛍\bm{\mu} is a given state vector, θt\theta_{t} denotes a scalar drift coefficient and gtg_{t} represents the diffusion coefficient. It possesses a closed-form analytical solution:

p(𝐱t𝐱s)=N(𝝁+(𝐱s𝝁)eθ¯s:t,gt22θt(1e2θ¯s:t)𝑰),θ¯s:t=stθz𝑑zp\left(\mathbf{x}_{t}\mid\mathbf{x}_{s}\right)=N\left(\bm{\mu}+\left(\mathbf{x}_{s}-\bm{\mu}\right)e^{-\bar{\theta}_{s:t}},\frac{g^{2}_{t}}{2\theta_{t}}\left(1-e^{-2\bar{\theta}_{s:t}}\right)\bm{I}\right),\qquad\bar{\theta}_{s:t}=\int_{s}^{t}{\theta_{z}dz} (5)

Proof: Writing:

𝐟(𝐱t,t)=𝐱teθ¯t\mathbf{f}(\mathbf{x}_{t},t)=\mathbf{x}_{t}e^{\bar{\theta}_{t}} (42)

Using Ito differential formula, we get:

d𝐟(𝐱t,t)\displaystyle\mathrm{d}\mathbf{f}(\mathbf{x}_{t},t) =𝐱tθteθ¯tdt+eθ¯td𝐱t\displaystyle=\mathbf{x}_{t}\theta_{t}e^{\bar{\theta}_{t}}\mathrm{d}t+e^{\bar{\theta}_{t}}\mathrm{d}\mathbf{x}_{t} (43)
=𝐱tθteθ¯tdt+eθ¯t[θt(𝝁𝐱t)dt+gtd𝐰t]\displaystyle=\mathbf{x}_{t}\theta_{t}e^{\bar{\theta}_{t}}\mathrm{d}t+e^{\bar{\theta}_{t}}\left[\theta_{t}\left(\bm{\mu}-\mathbf{x}_{t}\right)\mathrm{d}t+g_{t}\mathrm{d}\mathbf{w}_{t}\right]
=eθ¯tθt𝝁+eθ¯tgtd𝐰t\displaystyle=e^{\bar{\theta}_{t}}\theta_{t}\bm{\mu}+e^{\bar{\theta}_{t}}g_{t}\mathrm{d}\mathbf{w}_{t}

Integrating from ss to tt we get:

𝐱teθ¯t𝐱seθ¯s\displaystyle\mathbf{x}_{t}e^{\bar{\theta}_{t}}-\mathbf{x}_{s}e^{\bar{\theta}_{s}} =steθ¯zθz𝝁dz+steθ¯zgzd𝐰z\displaystyle=\int_{s}^{t}e^{\bar{\theta}_{z}}\theta_{z}\bm{\mu}\mathrm{d}z+\int_{s}^{t}e^{\bar{\theta}_{z}}g_{z}\mathrm{d}\mathbf{w}_{z} (44)
=(eθ¯teθ¯s)𝝁+steθ¯zgzd𝐰z\displaystyle=\left(e^{\bar{\theta}_{t}}-e^{\bar{\theta}_{s}}\right)\bm{\mu}+\int_{s}^{t}e^{\bar{\theta}_{z}}g_{z}\mathrm{d}\mathbf{w}_{z}

It’s obvious that the transition kernel is a Gaussian distribution. Since d𝐰zN(𝟎,dz𝑰)\mathrm{d}\mathbf{w}_{z}\sim N(\mathbf{0},\mathrm{d}z\bm{I}), we have:

steθ¯zgzd𝐰z\displaystyle\int_{s}^{t}e^{\bar{\theta}_{z}}g_{z}\mathrm{d}\mathbf{w}_{z} =N(𝟎,ste2θ¯zgz2dz𝑰)\displaystyle=N\left(\mathbf{0},\int_{s}^{t}e^{2\bar{\theta}_{z}}g^{2}_{z}\mathrm{d}z\bm{I}\right) (45)
=N(𝟎,λ2ste2θ¯z2θtdz𝑰)\displaystyle=N\left(\mathbf{0},\lambda^{2}\int_{s}^{t}e^{2\bar{\theta}_{z}}2\theta_{t}\mathrm{d}z\bm{I}\right)
=N(𝟎,λ2(e2θ¯te2θ¯s)𝑰)\displaystyle=N\left(\mathbf{0},\lambda^{2}\left(e^{2\bar{\theta}_{t}}-e^{2\bar{\theta}_{s}}\right)\bm{I}\right)

Therefore:

𝐱teθ¯t𝐱seθ¯s=(eθ¯teθ¯s)𝝁+N(𝟎,λ2(e2θ¯te2θ¯s)𝑰)\displaystyle\mathbf{x}_{t}e^{\bar{\theta}_{t}}-\mathbf{x}_{s}e^{\bar{\theta}_{s}}=\left(e^{\bar{\theta}_{t}}-e^{\bar{\theta}_{s}}\right)\bm{\mu}+N\left(\mathbf{0},\lambda^{2}\left(e^{2\bar{\theta}_{t}}-e^{2\bar{\theta}_{s}}\right)\bm{I}\right) (46)
𝐱t=𝝁+(𝐱s𝝁)eθ¯s:t+N(𝟎,gt22θt(1e2θ¯s:t)𝑰)\displaystyle\mathbf{x}_{t}=\bm{\mu}+\left(\mathbf{x}_{s}-\bm{\mu}\right)e^{-\bar{\theta}_{s:t}}+N\left(\mathbf{0},\frac{g^{2}_{t}}{2\theta_{t}}\left(1-e^{-2\bar{\theta}_{s:t}}\right)\bm{I}\right)

This concludes the proof of the Theorem C.1.

Appendix D Doob’s h–transform

Theorem D.1.

For a given SDE:

d𝐱t=𝐟(𝐱t,t)dt+gtd𝐰t,𝐱0p(𝐱0),\mathrm{d}\mathbf{x}_{t}=\mathbf{f}\left(\mathbf{x}_{t},t\right)\mathrm{d}t+g_{t}\mathrm{d}\mathbf{w}_{t},\qquad\mathbf{x}_{0}\sim p\left(\mathbf{x}_{0}\right), (1)

For a fixed 𝐱T\mathbf{x}_{T}, the evolution of conditional probability p(𝐱t𝐱T)p(\mathbf{x}_{t}\mid\mathbf{x}_{T}) follows:

d𝐱t=[𝐟(𝐱t,t)+gt2𝐡(𝐱t,t,𝐱T,T)]dt+gtd𝐰t,𝐱0p(𝐱0𝐱T),\mathrm{d}\mathbf{x}_{t}=\left[\mathbf{f}(\mathbf{x}_{t},t)+g^{2}_{t}\mathbf{h}(\mathbf{x}_{t},t,\mathbf{x}_{T},T)\right]\mathrm{d}t+g_{t}\mathrm{d}\mathbf{w}_{t},\qquad\mathbf{x}_{0}\sim p\left(\mathbf{x}_{0}\mid\mathbf{x}_{T}\right), (6)

where 𝐡(𝐱t,t,𝐱T,T)=𝐱tlogp(𝐱T𝐱t)\mathbf{h}(\mathbf{x}_{t},t,\mathbf{x}_{T},T)=\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{T}\mid\mathbf{x}_{t}).

Proof: p(𝐱t𝐱0)p(\mathbf{x}_{t}\mid\mathbf{x}_{0}) satisfies Kolmogorov Forward Equation (KFE) also called Fokker-Planck equation (Risken & Risken, 1996):

tp(𝐱t𝐱0)=𝐱t[𝐟(𝐱t,t)p(𝐱t𝐱0)]+12gt2𝐱t𝐱tp(𝐱t𝐱0)\frac{\partial}{\partial t}p(\mathbf{x}_{t}\mid\mathbf{x}_{0})=-\nabla_{\mathbf{x}_{t}}\cdot\left[\mathbf{f}(\mathbf{x}_{t},t)p(\mathbf{x}_{t}\mid\mathbf{x}_{0})\right]+\frac{1}{2}g^{2}_{t}\nabla_{\mathbf{x}_{t}}\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{t}\mid\mathbf{x}_{0}) (47)

Similarly, p(𝐱T𝐱t)p(\mathbf{x}_{T}\mid\mathbf{x}_{t}) satisfies Kolmogorov Backward Equation (KBE) (Risken & Risken, 1996):

tp(𝐱T𝐱t)=𝐟(𝐱t,t)𝐱tp(𝐱T𝐱t)+12gt2𝐱t𝐱tp(𝐱T𝐱t)-\frac{\partial}{\partial t}p(\mathbf{x}_{T}\mid\mathbf{x}_{t})=\mathbf{f}(\mathbf{x}_{t},t)\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{T}\mid\mathbf{x}_{t})+\frac{1}{2}g^{2}_{t}\nabla_{\mathbf{x}_{t}}\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{T}\mid\mathbf{x}_{t}) (48)

Using Bayes’ rule, we have:

p(𝐱t𝐱0,𝐱T)\displaystyle p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T}) =p(𝐱T𝐱t,𝐱0)p(𝐱t𝐱0)p(𝐱T𝐱0)\displaystyle=\frac{p(\mathbf{x}_{T}\mid\mathbf{x}_{t},\mathbf{x}_{0})p(\mathbf{x}_{t}\mid\mathbf{x}_{0})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})} (49)
=p(𝐱T𝐱t)p(𝐱t𝐱0)p(𝐱T𝐱0)\displaystyle=\frac{p(\mathbf{x}_{T}\mid\mathbf{x}_{t})p(\mathbf{x}_{t}\mid\mathbf{x}_{0})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}

Therefore, the derivative of conditional transition probability p(𝐱t𝐱0,𝐱T)p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T}) with time follows:

tp(𝐱t𝐱0,𝐱T)\displaystyle\frac{\partial}{\partial t}p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T}) =p(𝐱t𝐱0)p(𝐱T𝐱0)tp(𝐱T𝐱t)+p(𝐱T𝐱t)p(𝐱T𝐱0)tp(𝐱t𝐱0)\displaystyle=\frac{p(\mathbf{x}_{t}\mid\mathbf{x}_{0})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\frac{\partial}{\partial t}p(\mathbf{x}_{T}\mid\mathbf{x}_{t})+\frac{p(\mathbf{x}_{T}\mid\mathbf{x}_{t})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\frac{\partial}{\partial t}p(\mathbf{x}_{t}\mid\mathbf{x}_{0}) (50)
=p(𝐱t𝐱0)p(𝐱T𝐱0)[𝐟(𝐱t,t)𝐱tp(𝐱T𝐱t)12gt2𝐱t𝐱tp(𝐱T𝐱t)]\displaystyle=\frac{p(\mathbf{x}_{t}\mid\mathbf{x}_{0})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\left[-\mathbf{f}(\mathbf{x}_{t},t)\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{T}\mid\mathbf{x}_{t})-\frac{1}{2}g^{2}_{t}\nabla_{\mathbf{x}_{t}}\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{T}\mid\mathbf{x}_{t})\right]
+p(𝐱T𝐱t)p(𝐱T𝐱0){𝐱t[𝐟(𝐱t,t)p(𝐱t𝐱0)]+12gt2𝐱t𝐱tp(𝐱t𝐱0)}\displaystyle\quad+\frac{p(\mathbf{x}_{T}\mid\mathbf{x}_{t})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\left\{-\nabla_{\mathbf{x}_{t}}\cdot\left[\mathbf{f}(\mathbf{x}_{t},t)p(\mathbf{x}_{t}\mid\mathbf{x}_{0})\right]+\frac{1}{2}g^{2}_{t}\nabla_{\mathbf{x}_{t}}\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{t}\mid\mathbf{x}_{0})\right\}
=[p(𝐱t𝐱0)p(𝐱T𝐱0)𝐟(𝐱t,t)𝐱tp(𝐱T𝐱t)+p(𝐱T𝐱t)p(𝐱T𝐱0)𝐟(𝐱t,t)𝐱tp(𝐱t𝐱0)\displaystyle=-\left[\frac{p(\mathbf{x}_{t}\mid\mathbf{x}_{0})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\mathbf{f}(\mathbf{x}_{t},t)\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{T}\mid\mathbf{x}_{t})+\frac{p(\mathbf{x}_{T}\mid\mathbf{x}_{t})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\mathbf{f}(\mathbf{x}_{t},t)\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{t}\mid\mathbf{x}_{0})\right.
+p(𝐱T𝐱t)p(𝐱T𝐱0)p(𝐱t𝐱0)𝐱t𝐟(𝐱t,t)]\displaystyle\quad\left.+\frac{p(\mathbf{x}_{T}\mid\mathbf{x}_{t})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}p(\mathbf{x}_{t}\mid\mathbf{x}_{0})\nabla_{\mathbf{x}_{t}}\cdot\mathbf{f}(\mathbf{x}_{t},t)\right]
+12gt2[p(𝐱T𝐱t)p(𝐱T𝐱0)𝐱t𝐱tp(𝐱t𝐱0)p(𝐱t𝐱0)p(𝐱T𝐱0)𝐱t𝐱tp(𝐱T𝐱t)]\displaystyle\quad+\frac{1}{2}g_{t}^{2}\left[\frac{p(\mathbf{x}_{T}\mid\mathbf{x}_{t})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\nabla_{\mathbf{x}_{t}}\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{t}\mid\mathbf{x}_{0})-\frac{p(\mathbf{x}_{t}\mid\mathbf{x}_{0})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\nabla_{\mathbf{x}_{t}}\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{T}\mid\mathbf{x}_{t})\right]
=[𝐟(𝐱t,t)𝐱tp(𝐱t𝐱0,𝐱T)+p(𝐱t𝐱0,𝐱T)𝐱t𝐟(𝐱t,t)]\displaystyle=-\left[\mathbf{f}(\mathbf{x}_{t},t)\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T})+p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T})\cdot\nabla_{\mathbf{x}_{t}}\mathbf{f}(\mathbf{x}_{t},t)\right]
+12gt2[p(𝐱T𝐱t)p(𝐱T𝐱0)𝐱t𝐱tp(𝐱t𝐱0)p(𝐱t𝐱0)p(𝐱T𝐱0)𝐱t𝐱tp(𝐱T𝐱t)]\displaystyle\quad+\frac{1}{2}g_{t}^{2}\left[\frac{p(\mathbf{x}_{T}\mid\mathbf{x}_{t})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\nabla_{\mathbf{x}_{t}}\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{t}\mid\mathbf{x}_{0})-\frac{p(\mathbf{x}_{t}\mid\mathbf{x}_{0})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\nabla_{\mathbf{x}_{t}}\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{T}\mid\mathbf{x}_{t})\right]
=𝐱t[𝐟(𝐱t,t)p(𝐱t𝐱0,𝐱T)]\displaystyle=-\nabla_{\mathbf{x}_{t}}\cdot\left[\mathbf{f}(\mathbf{x}_{t},t)p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T})\right]
+12gt2[p(𝐱T𝐱t)p(𝐱T𝐱0)𝐱t𝐱tp(𝐱t𝐱0)p(𝐱t𝐱0)p(𝐱T𝐱0)𝐱t𝐱tp(𝐱T𝐱t)]\displaystyle\quad+\frac{1}{2}g_{t}^{2}\left[\frac{p(\mathbf{x}_{T}\mid\mathbf{x}_{t})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\nabla_{\mathbf{x}_{t}}\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{t}\mid\mathbf{x}_{0})-\frac{p(\mathbf{x}_{t}\mid\mathbf{x}_{0})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\nabla_{\mathbf{x}_{t}}\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{T}\mid\mathbf{x}_{t})\right]

For the second term, we have:

12gt2[p(𝐱T𝐱t)p(𝐱T𝐱0)𝐱t𝐱tp(𝐱t𝐱0)p(𝐱t𝐱0)p(𝐱T𝐱0)𝐱t𝐱tp(𝐱T𝐱t)]\displaystyle\frac{1}{2}g_{t}^{2}\left[\frac{p(\mathbf{x}_{T}\mid\mathbf{x}_{t})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\nabla_{\mathbf{x}_{t}}\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{t}\mid\mathbf{x}_{0})-\frac{p(\mathbf{x}_{t}\mid\mathbf{x}_{0})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\nabla_{\mathbf{x}_{t}}\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{T}\mid\mathbf{x}_{t})\right] (51)
=\displaystyle= 12gt2[p(𝐱T𝐱t)p(𝐱T𝐱0)𝐱t𝐱tp(𝐱t𝐱0)+1p(𝐱T𝐱0)𝐱tp(𝐱T𝐱t)𝐱tp(𝐱t𝐱0)\displaystyle\frac{1}{2}g_{t}^{2}\left[\frac{p(\mathbf{x}_{T}\mid\mathbf{x}_{t})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\nabla_{\mathbf{x}_{t}}\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{t}\mid\mathbf{x}_{0})+\frac{1}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{T}\mid\mathbf{x}_{t})\cdot\nabla_{\mathbf{x}_{t}}\ p(\mathbf{x}_{t}\mid\mathbf{x}_{0})\right.
+1p(𝐱T𝐱0)𝐱tp(𝐱T𝐱t)𝐱tp(𝐱t𝐱0)+p(𝐱t𝐱0)p(𝐱T𝐱0)𝐱t𝐱tp(𝐱T𝐱t)]\displaystyle\left.+\frac{1}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{T}\mid\mathbf{x}_{t})\cdot\nabla_{\mathbf{x}_{t}}\ p(\mathbf{x}_{t}\mid\mathbf{x}_{0})+\frac{p(\mathbf{x}_{t}\mid\mathbf{x}_{0})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\nabla_{\mathbf{x}_{t}}\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{T}\mid\mathbf{x}_{t})\right]
gt2[1p(𝐱T𝐱0)𝐱tp(𝐱T𝐱t)𝐱tp(𝐱t𝐱0)+p(𝐱t𝐱0)p(𝐱T𝐱0)𝐱t𝐱tp(𝐱T𝐱t)]\displaystyle-g_{t}^{2}\left[\frac{1}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{T}\mid\mathbf{x}_{t})\cdot\nabla_{\mathbf{x}_{t}}\ p(\mathbf{x}_{t}\mid\mathbf{x}_{0})+\frac{p(\mathbf{x}_{t}\mid\mathbf{x}_{0})}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\nabla_{\mathbf{x}_{t}}\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{T}\mid\mathbf{x}_{t})\right]
=\displaystyle= 12gt2[1p(𝐱T𝐱0)𝐱t[p(𝐱T𝐱t)𝐱tp(𝐱t𝐱0)]+1p(𝐱T𝐱0)𝐱t[p(𝐱t𝐱0)𝐱tp(𝐱T𝐱t)]]\displaystyle\frac{1}{2}g_{t}^{2}\left[\frac{1}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\nabla_{\mathbf{x}_{t}}\cdot\left[p(\mathbf{x}_{T}\mid\mathbf{x}_{t})\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{t}\mid\mathbf{x}_{0})\right]+\frac{1}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\nabla_{\mathbf{x}_{t}}\cdot\left[p(\mathbf{x}_{t}\mid\mathbf{x}_{0})\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{T}\mid\mathbf{x}_{t})\right]\right]
gt21p(𝐱T𝐱0)𝐱t[p(𝐱t𝐱0)𝐱tp(𝐱T𝐱t)]\displaystyle-g_{t}^{2}\frac{1}{p(\mathbf{x}_{T}\mid\mathbf{x}_{0})}\nabla_{\mathbf{x}_{t}}\cdot\left[p(\mathbf{x}_{t}\mid\mathbf{x}_{0})\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{T}\mid\mathbf{x}_{t})\right]
=\displaystyle= 12gt2[𝐱t[p(𝐱t𝐱0,𝐱T)𝐱tlogp(𝐱t𝐱0)]+𝐱t[p(𝐱t𝐱0,𝐱T)𝐱tlogp(𝐱T𝐱t)]]\displaystyle\frac{1}{2}g_{t}^{2}\left[\nabla_{\mathbf{x}_{t}}\cdot\left[p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T})\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{t}\mid\mathbf{x}_{0})\right]+\nabla_{\mathbf{x}_{t}}\cdot\left[p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T})\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{T}\mid\mathbf{x}_{t})\right]\right]
gt2𝐱t[p(𝐱t𝐱0,𝐱T)𝐱tlogp(𝐱T𝐱t)]\displaystyle-g_{t}^{2}\nabla_{\mathbf{x}_{t}}\cdot\left[p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T})\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{T}\mid\mathbf{x}_{t})\right]
=\displaystyle= 12gt2[𝐱t[p(𝐱t𝐱0,𝐱T)𝐱tlogp(𝐱t𝐱0,𝐱T)]]gt2𝐱t[p(𝐱t𝐱0,𝐱T)𝐱tlogp(𝐱T𝐱t)]\displaystyle\frac{1}{2}g_{t}^{2}\left[\nabla_{\mathbf{x}_{t}}\cdot\left[p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T})\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T})\right]\right]-g_{t}^{2}\nabla_{\mathbf{x}_{t}}\cdot\left[p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T})\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{T}\mid\mathbf{x}_{t})\right]
=\displaystyle= 12gt2𝐱t𝐱tp(𝐱t𝐱0,𝐱T)gt2𝐱t[p(𝐱t𝐱0,𝐱T)𝐱tlogp(𝐱T𝐱t)]\displaystyle\frac{1}{2}g_{t}^{2}\nabla_{\mathbf{x}_{t}}\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T})-g_{t}^{2}\nabla_{\mathbf{x}_{t}}\cdot\left[p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T})\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{T}\mid\mathbf{x}_{t})\right]

Bring it back to (50):

tp(𝐱t𝐱0,𝐱T)\displaystyle\frac{\partial}{\partial t}p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T}) =𝐱t[𝐟(𝐱t,t)p(𝐱t𝐱0,𝐱T)]+12gt2𝐱t𝐱tp(𝐱t𝐱0,𝐱T)\displaystyle=-\nabla_{\mathbf{x}_{t}}\cdot\left[\mathbf{f}(\mathbf{x}_{t},t)p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T})\right]+\frac{1}{2}g_{t}^{2}\nabla_{\mathbf{x}_{t}}\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T}) (52)
gt2𝐱t[p(𝐱t𝐱0,𝐱T)𝐱tlogp(𝐱T𝐱t)]\displaystyle\quad-g_{t}^{2}\nabla_{\mathbf{x}_{t}}\cdot\left[p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T})\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{T}\mid\mathbf{x}_{t})\right]
=𝐱t[[𝐟(𝐱t,t)+gt2𝐱tlogp(𝐱T𝐱t)]p(𝐱t𝐱0,𝐱T)]+12gt2𝐱t𝐱tp(𝐱t𝐱0,𝐱T)\displaystyle=-\nabla_{\mathbf{x}_{t}}\cdot\left[[\mathbf{f}(\mathbf{x}_{t},t)+g_{t}^{2}\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{T}\mid\mathbf{x}_{t})]p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T})\right]+\frac{1}{2}g_{t}^{2}\nabla_{\mathbf{x}_{t}}\cdot\nabla_{\mathbf{x}_{t}}p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T})

This is the definition of FP equation of conditional transition probability p(𝐱t𝐱0,𝐱T)p(\mathbf{x}_{t}\mid\mathbf{x}_{0},\mathbf{x}_{T}), which represents the evolution follows the SDE:

d𝐱t=[𝐟(𝐱t,t)+gt2𝐱tlogp(𝐱T𝐱t)]dt+gtd𝐰t\mathrm{d}\mathbf{x}_{t}=\left[\mathbf{f}(\mathbf{x}_{t},t)+g^{2}_{t}\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{T}\mid\mathbf{x}_{t})\right]\mathrm{d}t+g_{t}\mathrm{d}\mathbf{w}_{t} (53)

This concludes the proof of the Theorem D.1.

Appendix E Experimental Details

For all experiments, we use the same noise network, with the network architecture and mainly training parameters consistent with the paper (Luo et al., 2023a). This network is similar to a U-Net structure but without group normalization layers and self-attention layers. The steady variance level λ2\lambda^{2} was set to 30 (over 255), and the sampling step number T was set to 100. In the training process, we set the patch size = 128 with batch size = 8 and use Adam (Kingma & Ba, 2015) optimizer with parameters β1=0.9\beta_{1}=0.9 and β2=0.99\beta_{2}=0.99. The total training steps are 900 thousand with the initial learning rate set to 10410^{-4}, and it decays by half at iterations 300, 500, 600, and 700 thousand. For the setting of θt\theta_{t}, we employ a flipped version of cosine noise schedule (Nichol & Dhariwal, 2021), enabling θt\theta_{t} to change from 0 to 1 over time. Notably, to address the issue of θt\theta_{t} being too smooth when tt closed to 1, we let the coefficient eθ¯Te^{-\bar{\theta}_{T}} to be a small enough value δ=0.005\delta=0.005 instead of zero, which represents θ¯Ti=0Tθidt=logδ\bar{\theta}_{T}\approx\sum_{i=0}^{T}\theta_{i}\mathrm{d}t=-\log\delta, as well as dt=logδ/i=0Tθi\mathrm{d}t=-\log\delta/\sum_{i=0}^{T}\theta_{i}. Our models are trained on a single 3090 GPU with 24GB memory for about 2.5 days.

Appendix F Additional Experiments

Table 6: Image Inpainting. Qualitative comparison with the relevant baselines on CelebA-HQ with thick mask.

METHOD PSNR\uparrow SSIM\uparrow LPIPS\downarrow FID\downarrow
DDRM 19.48 0.8154 0.1487 26.24
IRSDE 21.12 0.8499 0.1046 11.12
GOUB 22.27 0.8754 0.0914 5.64
Table 7: Image Deraining. Qualitative comparison with the relevant baselines on Rain100L.

METHOD PSNR\uparrow SSIM\uparrow LPIPS\downarrow FID\downarrow
PRENET 37.48 0.9792 0.020 10.9
MAXIM 38.06 0.9770 0.048 19.0
IRSDE 38.30 0.9805 0.014 7.94
GOUB 39.79 0.9830 0.009 5.18
Table 8: Image 8×\times Super-Resolution. Qualitative comparison with the relevant baselines on DIV2K.

METHOD PSNR\uparrow SSIM\uparrow LPIPS\downarrow Training Datasets
SRFlow 23.05 0.57 0.272 DIV2K + Flickr2K
IRSDE 22.34 0.55 0.331 DIV2K
GOUB 23.17 0.60 0.310 DIV2K

Appendix G Additional Visual Results

Refer to caption
Figure 6: Additional visual results on deraining with Rain100H datasets.
Refer to caption
Figure 7: Additional visual results on thin mask inpainting with CelebA-HQ datasets.
Refer to caption
Figure 8: Additional visual results on 4x super-resolution with DIV2K datasets.