This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\titlehead

Review

\subject

Computational statistics

\corres

Zheng Zhao

Conditional sampling within generative diffusion models

Zheng Zhao    Ziwei Luo    Jens Sjölund    and Thomas B. Schön Division of Systems and Control, Uppsala University [email protected]
Abstract

Generative diffusions are a powerful class of Monte Carlo samplers that leverage bridging Markov processes to approximate complex, high-dimensional distributions, such as those found in image processing and language models. Despite their success in these domains, an important open challenge remains: extending these techniques to sample from conditional distributions, as required in, for example, Bayesian inverse problems. In this paper, we present a comprehensive review of existing computational approaches to conditional sampling within generative diffusion models. Specifically, we highlight key methodologies that either utilise the joint distribution, or rely on (pre-trained) marginal distributions with explicit likelihoods, to construct conditional generative samplers.

This article is part of the theme issue “Bayesian inverse problems with generative models”.

keywords:
Generative diffusions, stochastic differential equations, conditional sampling, Bayesian inference
{fmtext}

1 Introduction

Consider the conditional probability distribution π(|y)\pi(\cdot{\;|\;}y) of a random variable XdX\in\mathbb{R}^{d} with condition ydyy\in\mathbb{R}^{d_{y}}. Sampling the distribution is the fundamental question in computational statistics, and there are a plethora of developed sampling schemes to use depending on what we know about π(|y)\pi(\cdot{\;|\;}y). As an example, when the density function of π(|y)\pi(\cdot{\;|\;}y) is available (up to a constant), Markov chain Monte Carlo (MCMC, Meyn and Tweedie, 2009) methods are popular and generic algorithms widely used. The MCMC algorithms simulate a Markov chain that leaves the target distribution invariant. The drawback is that this often makes the algorithms computationally and statistically inefficient for high-dimensional problems.

In this article, we discuss an emerging class of samplers that leverage generative diffusions (see, e.g., Benton et al., 2024; Song et al., 2021), which have empirically worked well for many Bayesian inverse problems. At the heart, the generative diffusions aim to find a continuos-time Markov process (e.g., stochastic differential equation) that bridges the target distribution and a reference measure, so that sampling the target simplifies to sample the reference and the Markov process. In contrast to traditional samplers such as MCMC which use the target’s density function to build statistically exact samplers, the generative diffusions use the data to approximate a sampler akin to normalising flow (Chen et al., 2018; Papamakarios et al., 2021) and flow matching (Lipman et al., 2023). This comes with at least three benefits compared to MCMC: 1) scalability of the problem dimension (after the training time), 2) no need to explicitly know the target density function, 3) and the resulting samplers are embarrassingly differentiable (see a use case in Watson et al., 2022).

However, the generative diffusion framework (for unconditional sampling) is not immediately applicable to conditional sampling, since we do not have the conditional data samples from π(|y)\pi(\cdot{\;|\;}y) required to train the generative samplers. This article thus presents a class of generative samplers that utilise data samples from the joint πX,Y\pi_{X,Y}, or the marginal πX\pi_{X} with explicit likelihood, which are typically more accessible than the conditional samples. In particular, we focus on describing generic methodologies and show how they can inspire new ones, although plenty of generative conditional samplers have been specifically developed in their respective applications, for instance, the guided diffusions in computer vision (Luo et al., 2023; Kawar et al., 2022; Lugmayr et al., 2022).

The article is organised as follows. In the following section, we briefly explain the core of generative diffusions to set-up the preliminaries of generative conditional samplers. Then, in Section 3 we show how to approximate the samplers based on the data sampled from the joint distribution, called the joint bridging methods. In Section 4 we show another important approach using Feynman–Kac models that leverages the data from the marginal when the likelihood model is accessible, followed by a pedagogical example.

Notation

If {X(t)}t=0T\{X(t)\}_{t=0}^{T} is a forward process, then we abbreviate pt|s(|x)p_{t{\;|\;}s}(\cdot{\;|\;}x) as the distribution of X(t)X(t) conditioned on X(s)=xX(s)=x for t>st>s. If the process is further correlated with another random variable YY, then we use pt,Yp_{t,Y} as the joint distribution of (X(t),Y)(X(t),Y), and pt|Y(|y)p_{t{\;|\;}Y}(\cdot{\;|\;}y) as the conditional one. We apply this notation analogously for the reverse process {U(t)}t=0T\{U(t)\}_{t=0}^{T}, but exchange pp to qq, e.g. qt|s(|u)q_{t{\;|\;}s}(\cdot{\;|\;}u) for U(t)U(t) conditioned on U(s)=uU(s)=u. The path measures of {X(t)}t=0T\{X(t)\}_{t=0}^{T} and {U(t)}t=0T\{U(t)\}_{t=0}^{T} are denoted by \mathbb{P} and 𝔹\mathbb{B}, respectively, with interchangeable marginal notation tpt\mathbb{P}_{t}\equiv p_{t}.

2 Generative diffusion sampling

We begin by explaining how generative diffusions are applied to solve unconditional sampling. Let π\pi be the distribution of a random variable XdX\in\mathbb{R}^{d} that we aim to sample from. The generative diffusions start with a (forward-time) stochastic differential equation (SDE)

dX(t)=a(X(t),t)dt+b(t)dW(t),X(0)π,X(T)pT,\mathop{}\!\mathrm{d}X(t)=a(X(t),t)\mathop{}\!\mathrm{d}t+b(t)\mathop{}\!\mathrm{d}W(t),\quad X(0)\sim\pi,\quad X(T)\sim p_{T}, (1)

that continuously transports π\pi to another reference distribution pTp_{T} at time TT, where a:d×[0,T]da\colon\mathbb{R}^{d}\times[0,T]\to\mathbb{R}^{d} and b:[0,T]×d×wb\colon[0,T]\times\mathbb{R}^{d\times w} are regular drift and dispersion functions, and WwW\in\mathbb{R}^{w} is a Brownian motion. Then, the gist is to find a reverse correspondence of Equation (1) such that if the reversal initialises at distribution pTp_{T} then it ends up with π\pi at TT. Namely, we look for a reversal

dU(t)=f(U(t),t)dt+g(t)dW(t),U(0)pT,U(T)π,\mathop{}\!\mathrm{d}U(t)=f(U(t),t)\mathop{}\!\mathrm{d}t+g(t)\mathop{}\!\mathrm{d}W(t),\quad U(0)\sim p_{T},\quad U(T)\sim\pi, (2)

and from now on we use qtq_{t} to denote the marginal law of U(t)U(t). If the forward SDE in Equation (1) is designed in such a way that pTp_{T} is easy to sample (e.g., a Gaussian), then we can sample the target π\pi—which is hard—by simulating the reversal in Equation (2) which is simpler.111The Brownian motions in Equations (1) and (2) are not necessarily the same. To avoid clutter, we throughout the paper apply the same notation WW for all Brownian motions when we are not concerned with distinguishing them.

There are infinitely many such reversals if we only require that the forward and reversal match their marginals at t=0t=0 and t=Tt=T. But if we also constrain the time evolution of ptp_{t} to be equal to qTtq_{T-t} for all t[0,T]t\in[0,T], then we arrive at the classical result by Anderson (1982) who explicitly writes the reversal as

f(u,t)=a(u,Tt)+Γ(Tt)logpTt(u),g(t)=b(Tt),f(u,t)=-a(u,T-t)+\Gamma(T-t)\operatorname{\nabla\!}\log p_{T-t}(u),\quad g(t)=b(T-t), (3)

where we shorten Γ(t)b(t)b(t)𝖳\Gamma(t)\coloneqq b(t)\,b(t)^{\mkern-1.5mu\mathsf{T}}, and ptp_{t} here stands for the probability density function of X(t)X(t). In other words, Anderson’s reversal here means that it solves the same Kolmogorov forward equation (in reverse time) as with Equation (1).

However, it is known that Anderson’s construction comes with two computational challenges. The first challenge lies in the intractability of the score function logpt\operatorname{\nabla\!}\log p_{t} (since that of π\pi is unknown). To handle this, the community has developed means of numerical techniques to approximate the score. A notable example in this context is denoising score matching (Song et al., 2021). It works by parametrising the score function (with, e.g., a neural network) and then estimating the parameters by optimising a loss function over the expectation on Equation (1(see also the convergence analysis in De Bortoli et al., 2021, Thm. 1). The second challenge is that it is non-trivial to design a forward SDE that exactly terminates at an easy-to-sample pTp_{T}. In practice, we often specify pTp_{T} as a unit Gaussian and then choose the associated Langevin dynamic as the forward equation, at the cost of assuming large enough TT.

The so-called dynamic Schrödinger bridge (DSB) is gaining traction as an alternative to Anderson’s construction (De Bortoli et al., 2021). In this way, Equation (1) serves as a reference process, instead of the forward process. Suppose that we aim to bridge between π\pi and any specified reference measure πref\pi_{\mathrm{ref}}, the DSB constructs Equation (1) (and its reversal) via the unique solution

=argmin0=π,T=πrefKL(),\mathbb{P}^{\star}=\operatorname*{arg\,min\,}_{\begin{subarray}{c}\mathbb{P}_{0}=\pi,\mathbb{P}_{T}=\pi_{\mathrm{ref}}\\ \mathbb{P}\in\mathcal{M}\end{subarray}}\mathrm{KL}(\mathbb{P}\,\|\,\mathbb{Q}), (4)

where \mathbb{Q} is the path measure of the reference process, and \mathcal{M} stands for the collection of all SDE solutions. Namely, DSB searches for diffusion processes that exactly bridge π\pi and the given πref\pi_{\mathrm{ref}}, and then DSB chooses the unique one that is closest to the reference process in the sense of Kullback–Leibler divergence. The solution of Equation (4) is typically not in closed form, and it has to be computed numerically. Compared to Anderson’s construction, DSB usually comes with smaller sampling error, since it does not make asymptotic assumptions on TT. However, the current approaches (see, e.g., methods in De Bortoli et al., 2021; Shi et al., 2023; Chen et al., 2022) for solving Equation (4) are computationally more demanding than commonly used denoising score matching, leading to potentially higher training cost.

3 Conditional sampling with joint samples

In the previous section we have seen the gist of generative sampling as well as two constructions of the generative samplers. However, they cannot be directly used to sample the conditional distribution π(|y)\pi(\cdot{\;|\;}y), since they need samples of π(|y)\pi(\cdot{\;|\;}y) to estimate Equations (2) and (4). Recently, Vargas et al. (2023) and Phillips et al. (2024) have developed generative samplers without using samples of the target, but when applied to the conditional sampling, we would have to re-train their models every time the condition yy changes. In this section we show how to apply the two constructions (i.e., Anderson and DSB) to train conditional samplers by samples from the joint distribution πX,Y\pi_{X,Y} which is a reasonable assumption for many applications (e.g., inpainting). We start by a heuristic idea, called Doob’s bridging, and then we show how to extend it to a generic framework that covers many other conditional generative samplers.

3.1 Doob’s bridging

Recall that we aim to sample π(|y)\pi(\cdot{\;|\;}y), and assume that we can sample the joint πX,Y\pi_{X,Y}. Denote by qt|s(|v)q_{t{\;|\;}s}(\cdot{\;|\;}v) the conditional distribution of U(t)U(t) conditioned on U(s)=vU(s)=v for t>st>s. The idea is now to construct the reversal in such a way that

qT| 0(x|y)=π(x|y)q_{T{\;|\;}0}(x{\;|\;}y)=\pi(x{\;|\;}y) (5)

for all xdx\in\mathbb{R}^{d}. If we in addition assume that the dimensions of XX and YY match, then an immediate construction that satisfies the requirement above is

dU(t)=f(U(t),t)dt+g(t)dW(t),(U(T),U(0))πX,Y,\mathop{}\!\mathrm{d}U(t)=f(U(t),t)\mathop{}\!\mathrm{d}t+g(t)\mathop{}\!\mathrm{d}W(t),\quad\bigl{(}U(T),U(0)\bigr{)}\sim\pi_{X,Y}, (6)

where the terminal and initial jointly follow πX,Y\pi_{X,Y} by marginalising out the intermediate path. Thus, sampling π(|y)\pi(\cdot{\;|\;}y) simplifies to simulating the SDE in Equation (6) with initial value U(0)=yU(0)=y. Directly finding such an SDE is hard, but we can leverage Anderson’s construction to form it as the reversal of a forward equation

dX(t)=a(X(t),t)dt+b(t)dW(t),(X(0),X(T))πX,Y.\mathop{}\!\mathrm{d}X(t)=a(X(t),t)\mathop{}\!\mathrm{d}t+b(t)\mathop{}\!\mathrm{d}W(t),\quad\bigl{(}X(0),X(T)\bigr{)}\sim\pi_{X,Y}. (7)

Now we have a few handy methods to explicitly construct the forward equation above. One notable example is via Doob’s hh-transform (Rogers and Williams, 2000) by choosing X(0)=XX(0)=X and

a(x,Y,t)=μ(x)+Γ(t)xloghT(Y,x,t),a(x,Y,t)=\mu(x)+\Gamma(t)\operatorname{\nabla\!}_{x}\log h_{T}(Y,x,t), (8)

where μ\mu is the drift function of another (arbitrary) reference SDE

dZ(t)=μ(Z(t))dt+b(t)dW(t),\mathop{}\!\mathrm{d}Z(t)=\mu(Z(t))\mathop{}\!\mathrm{d}t+b(t)\mathop{}\!\mathrm{d}W(t),

and the hh-function hT(y,x,t)pZ(T)|Z(t)(y|x)h_{T}(y,x,t)\coloneqq p_{Z(T){\;|\;}Z(t)}(y{\;|\;}x) is the conditional density function of Z(T)Z(T) with condition at tt. With this choice, we immediately have X(T)=YX(T)=Y, and thus the desired (X(0),X(T))πX,Y\bigl{(}X(0),X(T)\bigr{)}\sim\pi_{X,Y} is satisfied. However, we have to remark a caveat that the process tX(t)t\mapsto X(t) in Equation (7) marginally is not a Markov process, since its SDE now invokes YY which is correlated with the process across time. But on the other hand, this problem can be solved by an extension dY(t)=0\mathop{}\!\mathrm{d}Y(t)=0, Y(0)=YY(0)=Y, such that (X(t),Y(t))(X(t),Y(t)) is jointly Markov. We can then indeed learn the reversal in Equation (6) by Anderson’s construction. This approach was exploited by Somnath et al. (2023) who coined the term “aligned Schrödinger bridge”. But we note that the approach does not solve the DSB problem in Equation (4): In DSB we aim to find the optimal coupling under a path constraint, but in this approach, the coupling is already fixed by the given πX,Y\pi_{X,Y}.

The generative conditional sampling with Doob’s bridging is summarised as follows.

Algorithm 3.1 (Doob’s bridging conditional sampling).

The Doob’s bridging conditional sampling consists in training and sampling a generative diffusion as follows.

  1. Training.

    Draw {(Xi,Yi)}i=1nπX,Y\{(X^{i},Y^{i})\}_{i=1}^{n}\sim\pi_{X,Y}, and then simulate paths of Equation (7) with these samples as their respective initials. Based on these paths, learn the reversal in Equation (6) by, for instance, score matching. Note that since the forward drift depends on YY, the reversal depends on YY too.

  2. Sampling.

    After the reversal is estimated. For any given condition yy, simulate Equation (6) with initial value U(0)=yU(0)=y. Then it follows that U(T)π(|y)U(T)\sim\pi(\cdot{\;|\;}y).

From the algorithm above we see that the approach is as straightforward as the standard generative sampling for unconditional distributions, with an additional involvement of YY in the forward and reverse equations; It adds almost no computational costs. Moreover, the algorithm does not assume TT\to\infty unlike many other generative samplers. However, the algorithm suffers from a few non-trivial problems due to the use of Doob’s transform to exactly pin two points. First, Doob’s hh-function is mostly not available in closed form when the reference process is nonlinear. Although it is still possible to simulate Equation (7) without knowing hTh_{T} explicitly (Schauer et al., 2017), and to approximate hTh_{T} (Bågmark et al., 2022; Baker et al., 2024). This hinders the use of denoising score matching which is a computationally efficient estimator, as the conditional density pX(t)|X(s)p_{X(t){\;|\;}X(s)} of the forward equation is consequently intractable. Second, the function hT(,,T)h_{T}(\cdot,\cdot,T) at time TT is not defined, and moreover, the forward drift is rather stiff around that time. This induces numerical errors and instabilities that are hard to eliminate. Finally, the approach assumes that the dimensions of XX and YY are the same, which limits the range of applications. Recently, Denker et al. (2024) generalise the method by exploiting the likelihood πY|X\pi_{Y{\;|\;}X} in such a way that the two points do not have to be exactly pinned. This can potentially solve the problems mentioned above.

3.2 Generalised joint bridging

Recall again that we aim to find a pair of forward and reversal SDEs such that the reversal satisfies Equation (5). In the previous section, we have exploited Doob’s transform for the purpose, but the approach comes with several problems due to the use of Doob’s hh-transform for exactly pinning the data random variables XX and YY. In this section, we present a generalised forward-reverse SDE that does not require exact pinning.

The core of the idea is to introduce an auxiliary variable VdyV\in\mathbb{R}^{d_{y}} in the joint (U(0),U(T),V)\bigl{(}U(0),U(T),V\bigr{)}, such that it facilities sampling qT,0|Y(|y)q_{T,0{\;|\;}Y}(\cdot{\;|\;}y), and that the marginal at TT satisfies qT|V(|y)=π(|y)q_{T{\;|\;}V}(\cdot{\;|\;}y)=\pi(\cdot{\;|\;}y) (cf. Equation (5)). More precisely, the reversal that we desire is

dU(t)=f(U(t),V,t)dt+g(t)dW(t),(U(0),V)πref,(U(T),V)πX,Y,\mathop{}\!\mathrm{d}U(t)=f(U(t),V,t)\mathop{}\!\mathrm{d}t+g(t)\mathop{}\!\mathrm{d}W(t),\quad(U(0),V)\sim\pi_{\mathrm{ref}},\quad(U(T),V)\sim\pi_{X,Y}, (9)

starting at any reference πref\pi_{\mathrm{ref}}, so that qT| 0,V(|u,y)πref(du|y)=qT|V(|y)=π(|y)\int q_{T{\;|\;}0,V}(\cdot{\;|\;}u,y)\pi_{\mathrm{ref}}(\mathop{}\!\mathrm{d}u{\;|\;}y)=q_{T{\;|\;}V}(\cdot{\;|\;}y)=\pi(\cdot{\;|\;}y). With this reversal, if we can sample U(0)πref(|y)U(0)\sim\pi_{\mathrm{ref}}(\cdot{\;|\;}y), then consequently U(T)π(|y)U(T)\sim\pi(\cdot{\;|\;}y) throughout the SDE. Define the forward equation that is paired with the reversal by

dX(t)=a(X(t),Y,t)dt+b(t)dW(t),(X(0),Y)πX,Y,(X(T),Y)πref,\mathop{}\!\mathrm{d}X(t)=a(X(t),Y,t)\mathop{}\!\mathrm{d}t+b(t)\mathop{}\!\mathrm{d}W(t),\quad(X(0),Y)\sim\pi_{X,Y},\quad(X(T),Y)\sim\pi_{\mathrm{ref}}, (10)

where the reference measure πref=pT,Y\pi_{\mathrm{ref}}=p_{T,Y} is the joint distribution of (X(T),Y)(X(T),Y). The Doob’s bridging is a special case, in the way that we have coerced X(T)=YX(T)=Y as the reference. Also note that we can extend Equations (9) and (10) to Markov processes with dY(t)=0\mathop{}\!\mathrm{d}Y(t)=0 in the same way as in Doob’s bridging. To ease later discussions, let us first summarise the conditional sampling by the joint bridging method as follows.

Algorithm 3.2 (Joint bridging conditional sampling).

Let the forward and reversal in Equations (10) and (9) be given (i.e., training the generative sampler is complete). Sample U(0)πref(|y)U(0)\sim\pi_{\mathrm{ref}}(\cdot{\;|\;}y) and simulate the reversal at TT, then U(T)π(|y)U(T)\sim\pi(\cdot{\;|\;}y).

The remaining question is how to identify the generative sampler to use in Algorithm 3.2. Now looking back at Equations (1) and (2), we see that the algorithm is essentially an unconditional generative sampler that operates on the joint space of XX and YY. Hence, we can straightforwardly choose either Anderson’s construction or DSB to obtain the required reversal in Equation (9). With Anderson’s approach, the reversal’s drift function becomes

f(u,v,t)=a(u,v,Tt)+Γ(Tt)ulogpTt|Y(u|v),f(u,v,t)=-a(u,v,T-t)+\Gamma(T-t)\operatorname{\nabla\!}_{u}\log p_{T-t{\;|\;}Y}(u{\;|\;}v), (11)

where we see that the marginal score in Equation (3) now turns into a conditional one. This conditional score function is the major hurdle of the construction, as the conditional score is often intractable. Song et al. (2021) utilise the decomposition ulogpt|Y(u|v)=ulogpt(u)+ulogpY|t(v|u)\operatorname{\nabla\!}_{u}\log p_{t{\;|\;}Y}(u{\;|\;}v)=\operatorname{\nabla\!}_{u}\log p_{t}(u)+\operatorname{\nabla\!}_{u}\log p_{Y{\;|\;}t}(v{\;|\;}u), where the two terms on the right-hand side can be more accessible. Take image classification for example, where XX and YY stand for the image and label, respectively. The generative score ulogpt(u)\operatorname{\nabla\!}_{u}\log p_{t}(u) is learnt via standard score matching, while the likelihood score ulogpY|t(v|u)\operatorname{\nabla\!}_{u}\log p_{Y{\;|\;}t}(v{\;|\;}u) can be estimated by training an image classifier with categorical pY|tp_{Y{\;|\;}t} for all tt. However, this is highly application dependent, and developing a generic and efficient estimator for the conditional score remains an active research question (see, e.g., approximations in Chung et al., 2023; Song et al., 2023).

Another problem of using Anderson’s construction is again the common assumption TT\to\infty, as we need to sample the conditional reference pT|V=πref(|y)p_{T{\;|\;}V}=\pi_{\mathrm{ref}}(\cdot{\;|\;}y). For example, Luo et al. (2023) choose a(x,Y,t)=θ(Yx)a(x,Y,t)=\theta\,(Y-x), so that πref(|y)N(|y,σ2)\pi_{\mathrm{ref}}(\cdot{\;|\;}y)\approx\mathrm{N}(\cdot{\;|\;}y,\sigma^{2}) is approximately a stationary Gaussian. Under this choice, the resulting forward equation is akin to that of Doob’s bridging but instead pins between XX and a Gaussian-noised YY.

As an alternative to Anderson’s construction, we can employ DSB to search for a pair of Equations (9) and (10) that exactly bridges between πX,Y\pi_{X,Y} and a given πref\pi_{\mathrm{ref}}, within a finite horizon TT. Formally, we are looking for a process

dX(t)=a(X(t),Y(t),t)dt+b(t)dW(t),dY(t)=0,(X(0),Y(0))πX,Y,(X(T),Y(0))πref,\begin{split}\mathop{}\!\mathrm{d}X(t)&=a(X(t),Y(t),t)\mathop{}\!\mathrm{d}t+b(t)\mathop{}\!\mathrm{d}W(t),\quad\mathop{}\!\mathrm{d}Y(t)=0,\\ \bigl{(}X(0),Y(0)\bigr{)}&\sim\pi_{X,Y},\quad\bigl{(}X(T),Y(0)\bigr{)}\sim\pi_{\mathrm{ref}},\end{split} (12)

whose path measure \mathbb{P} minimises KL()\mathrm{KL}(\cdot\,\|\,\mathbb{Q}), where \mathbb{Q} is the measure of a reference process

dZ(t)=μ(Z(t),R(t),t)dt+σ(t)dW(t),dR(t)=0,(Z(0),R(0))πX,Y.\begin{split}\mathop{}\!\mathrm{d}Z(t)&=\mu(Z(t),R(t),t)\mathop{}\!\mathrm{d}t+\sigma(t)\mathop{}\!\mathrm{d}W(t),\quad\mathop{}\!\mathrm{d}R(t)=0,\\ \bigl{(}Z(0),R(0)\bigr{)}&\sim\pi_{X,Y}.\end{split} (13)

Due to the fact that πref\pi_{\mathrm{ref}} is now allowed to arbitrary, we can choose a πref\pi_{\mathrm{ref}} such that πref(|y)\pi_{\mathrm{ref}}(\cdot{\;|\;}y) is a reasonable approximation to π(|y)\pi(\cdot{\;|\;}y) that is easy to sample from (e.g., Laplace approximation).

Currently there are two (asymptotic) numerical approaches to solve for the (conditional) DSB in the two equations above. The approach developed by De Bortoli et al. (2021) aims to find a sequence i\mathbb{P}^{i} that converges to \mathbb{P} as ii\to\infty via half-bridges

2i1\displaystyle\mathbb{P}^{2\,i-1} =argminKL(2i2),T=πref,\displaystyle=\operatorname*{arg\,min\,}_{\mathbb{P}}\mathrm{KL}\bigl{(}\mathbb{P}\,\|\,\mathbb{P}^{2\,i-2}\bigr{)},\quad\mathbb{P}_{T}=\pi_{\mathrm{ref}}, (14)
2i\displaystyle\mathbb{P}^{2\,i} =argminKL(2i1),0=πX,Y,\displaystyle=\operatorname*{arg\,min\,}_{\mathbb{P}}\mathrm{KL}\bigl{(}\mathbb{P}\,\|\,\mathbb{P}^{2\,i-1}\bigr{)},\quad\mathbb{P}_{0}=\pi_{X,Y}, (15)

starting at 0\mathbb{P}^{0}\coloneqq\mathbb{Q}. The solution of each half-bridge can be approximated by conventional parameter estimation methods in SDEs. As an example, in Equation (15), suppose that 2i1\mathbb{P}^{2\,i-1} is obtained and that we can sample from it, then we can approximate 2i\mathbb{P}^{2\,i} by parametrising the SDE in Equation (12) and then estimate its parameters via drift matching between 2i\mathbb{P}^{2\,i} and 2i1\mathbb{P}^{2\,i-1}. However, a downside of this approach is that the solution does not exactly bridge πX,Y\pi_{X,Y} and πref\pi_{\mathrm{ref}} at any ii albeit the asymptoticity: it is either πX,Y\pi_{X,Y} or πref\pi_{\mathrm{ref}} that is preserved, not the both. This problem is solved by Shi et al. (2023). They utilise the decomposition =π| 0,T\mathbb{P}=\pi^{\star}\,\mathbb{Q}_{{\;|\;}0,T}, where π\pi^{\star} solves the optimal transport minν{KL(ν0,T):ν0=πX,Y,νT=πref}\min_{\nu}\{\mathrm{KL}(\nu\,\|\,\mathbb{Q}_{0,T})\colon\nu_{0}=\pi_{X,Y},\nu_{T}=\pi_{\mathrm{ref}}\}, and | 0,T\mathbb{Q}_{{\;|\;}0,T} stands for the measure of the reference process conditioning at times 0 and TT (e.g., Doob’s hh-transform). The resulting algorithm is then alternating between

2i1=proj(2i2),2i=0,T2i1| 0,T,\begin{split}\mathbb{P}^{2\,i-1}&=\mathrm{proj}_{\mathcal{M}}\bigl{(}\mathbb{P}^{2\,i-2}\bigr{)},\\ \mathbb{P}^{2\,i}&=\mathbb{P}^{2\,i-1}_{0,T}\,\mathbb{Q}_{{\;|\;}0,T},\end{split} (16)

starting at 0π~| 0,T\mathbb{P}^{0}\coloneqq\widetilde{\pi}\,\mathbb{Q}_{{\;|\;}0,T}, where π~\widetilde{\pi} is any coupling of πX,Y\pi_{X,Y} and πref\pi_{\mathrm{ref}}, for instance, the trivial product measure πX,Y×πref\pi_{X,Y}\times\pi_{\mathrm{ref}}. In the iteration above, proj(2i2)\mathrm{proj}_{\mathcal{M}}\bigl{(}\mathbb{P}^{2\,i-2}\bigr{)} approximates 2i2\mathbb{P}^{2\,i-2} as a Markov process, since the previous 2i2\mathbb{P}^{2\,i-2} is not necessarily Markovian. Moreover, Shi et al. (2023) choose a particular projection proj(2i2)=argminKL(2i2)\mathrm{proj}_{\mathcal{M}}\bigl{(}\mathbb{P}^{2\,i-2}\bigr{)}=\operatorname*{arg\,min\,}_{\mathbb{P}\in\mathcal{M}}\mathrm{KL}\bigl{(}\mathbb{P}^{2\,i-2}\,\|\,\mathbb{P}\bigr{)}, such that their time-marginal distributions match, and therefore preserving the marginals 0i=πX,Y\mathbb{P}^{i}_{0}=\pi_{X,Y} and Ti=πref\mathbb{P}^{i}_{T}=\pi_{\mathrm{ref}} at any ii. In essence, this algorithm builds a sequence on couplings 0,Ti\mathbb{P}^{i}_{0,T} that converges to π\pi^{\star}, in conjunction with Doob’s bridging. Although the method has to simulate Doob’s hh-transform | 0,T\mathbb{Q}_{{\;|\;}0,T} which can be hard for complex reference processes, the cost is marginal in the context.

Compared to the iteration in Equation (15), this projection-bridging method is particularly useful in the generative diffusion scenario, since we mainly care about the marginal distribution and not so much about the actual interpolation path. That said, at any iteration ii, the solution, is already a valid bridge between πX,Y\pi_{X,Y} and πref\pi_{\mathrm{ref}} at disposal for generative sampling, even though it is not yet a solution to the Schrödinger bridge.

3.3 Forward-backward path bridging

The previous joint bridging method (including Doob’s bridging as a special case) aims to find a reversal such that the conditional terminal is qT|V(|y)=π(|y)q_{T{\;|\;}V}(\cdot{\;|\;}y)=\pi(\cdot{\;|\;}y). In essence, it thinks of sampling the target as simulating a reverse SDE. In this section, we show another useful insight, by viewing sampling the target as simulating a stochastic filtering distribution (Corenflos et al., 2024; Dou and Song, 2024; Trippe et al., 2023). Formally, this approach sets base on

dX(t)=a1(X(t),Y(t),t)dt+b1(t)dW1(t),dY(t)=a2(X(t),Y(t),t)dt+b2(t)dW2(t),(X(0),Y(0))πX,Y,\begin{split}\mathop{}\!\mathrm{d}X(t)&=a_{1}(X(t),Y(t),t)\mathop{}\!\mathrm{d}t+b_{1}(t)\mathop{}\!\mathrm{d}W_{1}(t),\\ \mathop{}\!\mathrm{d}Y(t)&=a_{2}(X(t),Y(t),t)\mathop{}\!\mathrm{d}t+b_{2}(t)\mathop{}\!\mathrm{d}W_{2}(t),\\ \bigl{(}X(0),Y(0)\bigr{)}&\sim\pi_{X,Y},\end{split} (17)

and the core is the identity

pX(0)|Y(0),Y(0,T],X(T)(|y,y(0,T],xT)pX(T),Y(0,T]|Y(0)(dxT,dy(0,T]|y)=pX(0)|Y(0)(|y)=π(|y),\begin{split}&\int p_{X(0){\;|\;}Y(0),Y_{(0,T]},X(T)}(\cdot{\;|\;}y,y_{(0,T]},x_{T})\,p_{X(T),Y_{(0,T]}{\;|\;}Y(0)}(\mathop{}\!\mathrm{d}x_{T},\mathop{}\!\mathrm{d}y_{(0,T]}{\;|\;}y)\\ &=p_{X(0){\;|\;}Y(0)}(\cdot{\;|\;}y)=\pi(\cdot{\;|\;}y),\end{split} (18)

where the integrand pX(0)|Y(0),Y(0,T],X(T)(|y,y(0,T],xT)p_{X(0){\;|\;}Y(0),Y_{(0,T]},X(T)}(\cdot{\;|\;}y,y_{(0,T]},x_{T}) is the filtering distribution (in the reverse-time direction) with initial value xTx_{T} and measurement y[0,T][y,y(0,T]]y_{[0,T]}\coloneqq[y,y_{(0,T]}]. The identity describes an ancestral sampling that, for any given condition yy, we first sample (X(T),Y(0,T])pX(T),Y(0,T]|Y(0)(|y)\bigl{(}X(T),Y_{(0,T]}\bigr{)}\sim p_{X(T),Y_{(0,T]}{\;|\;}Y(0)}(\cdot{\;|\;}y) and then use it to solve the filtering problem of the reversal, hence the name “forward-backward path bridging”. However, since these two steps cannot be computed exactly, we show how to approximate them in practice in the following.

The distribution pX(T),Y(0,T]|Y(0)(|y)p_{X(T),Y_{(0,T]}{\;|\;}Y(0)}(\cdot{\;|\;}y) is in general not tractable. In Dou and Song (2024) and Trippe et al. (2023) they make assumptions so that they can facilitate an approximation

pX(T),Y(0,T]|Y(0)=pX(T)|Y[0,T]pY(0,T]|Y(0)pX(T)|Y(T)pY(0,T]|Y(0).p_{X(T),Y_{(0,T]}{\;|\;}Y(0)}=p_{X(T){\;|\;}Y_{[0,T]}}\,p_{Y_{(0,T]}{\;|\;}Y(0)}\approx p_{X(T){\;|\;}Y(T)}\,p_{Y_{(0,T]}{\;|\;}Y(0)}. (19)

More specifically, they assume that 1) the SDE in Equation (17) is separable (i.e., a1a_{1} does not depend on YY while a2a_{2} does not depend on XX), so that they can independently simulate pY(0,T]|Y(0)(|y)p_{Y_{(0,T]}{\;|\;}Y(0)}(\cdot{\;|\;}y) leaving away the part that depends on XX, and that 2) the SDE is stationary enough for large TT such that pX(T)|Y[0,T]pX(T)|Y(T)p_{X(T){\;|\;}Y_{[0,T]}}\approx p_{X(T){\;|\;}Y(T)}. Furthermore, the (continuous-time) filtering distribution pX(0)|Y(0),Y(0,T],X(T)p_{X(0){\;|\;}Y(0),Y_{(0,T]},X(T)} is intractable too. In practice we apply sequential Monte Carlo (SMC, Chopin and Papaspiliopoulos, 2020) to approximately sample from it. This induces error due to 3) using a finite number of particles. Since continuous-time particle filtering (Bain and Crisan, 2009) is particularly hard in the generative context, we often have to 4) discretise the SDE beforehand, which again introduce errors, though simulating SDEs without discretisation is possible (Beskos and Roberts, 2005; Blanchet and Zhang, 2020). These four points constitute the main sources of errors of this method.

The work in Corenflos et al. (2024) eliminates all the errors except for that of the SDE discretisation. The idea is based on extending the identity in Equation (18) as an invariant Markov chain Monte Carlo (MCMC) kernel 𝒦\mathcal{K} by

(𝒦h)()\displaystyle(\mathcal{K}h)(\cdot) (20)
=pX(0)|Y(0),Y(0,T],X(T)(|y,y(0,T],xT)pX(T),Y(0,T]|X(0),Y(0)(dxT,dy(0,T]|z,y)h(dz),\displaystyle=\int p_{X(0){\;|\;}Y(0),Y_{(0,T]},X(T)}(\cdot{\;|\;}y,y_{(0,T]},x_{T})\int p_{X(T),Y_{(0,T]}{\;|\;}X(0),Y(0)}(\mathop{}\!\mathrm{d}x_{T},\mathop{}\!\mathrm{d}y_{(0,T]}{\;|\;}z,y)\,h(\mathop{}\!\mathrm{d}z),

where we see that the kernel is indeed π(|y)\pi(\cdot{\;|\;}y)-variant: 𝒦(pX(0)|Y(0))=pX(0)|Y(0)=πX|Y\mathcal{K}(p_{X(0){\;|\;}Y(0)})=p_{X(0){\;|\;}Y(0)}=\pi_{X{\;|\;}Y}. Moreover, we can replace the filtering distribution with another MCMC kernel by, for instance, gradient Metropolis–Hasting (Titsias and Papaspiliopoulos, 2018), or more suitable in this context, the conditional sequential Monte Carlo (CSMC, Andrieu et al., 2010). The algorithm finally works as a Metropolis-within-Gibbs sampler as follows.

Algorithm 3.3 (Gibbs filtering of conditional generative diffusion).

Let X0π(|y)X^{0}\sim\pi(\cdot{\;|\;}y), and qq be the law of the reversal of Equation (17). For i=1,2,i=1,2,\ldots, do

  1. 1.

    simulate forward (XTi,Y(0,T]i)pX(T),Y(0,T]|Y(0),X(0)(|y,Xi1)\bigl{(}X_{T}^{i},Y^{i}_{(0,T]}\bigr{)}\sim p_{X(T),Y_{(0,T]}{\;|\;}Y(0),X(0)}(\cdot{\;|\;}y,X^{i-1}),

  2. 2.

    set the reverse path V[0,T]i=Y[T,0]iV_{[0,T]}^{i}=Y_{[T,0]}^{i},

  3. 3.

    simulate XiqU(T)|V[0,T],U(0)(|V[0,T]i,XTi)X^{i}\sim q_{U(T){\;|\;}V_{[0,T]},U(0)}(\cdot{\;|\;}V^{i}_{[0,T]},X_{T}^{i}) using CSMC.

Then Xiπ(|y)X^{i}\sim\pi(\cdot{\;|\;}y) for any ii.

The Gibbs filtering-based conditional sampler above has significantly less errors compared to Trippe et al. (2023) and Dou and Song (2024) as empirically shown in Corenflos et al. (2024). In a sense, the Gibbs sampler has transformed the two major errors due to TT\to\infty and using a finite number of particles, to the increased autocorrelation in Algorithm 3.3. When TT is small, the samples are more correlated, but still statistically exact nonetheless. Moreover, the sampler does not assume the separability of the diffusion model, meaning that in the algorithm we are free to use the reversal construction of any kind (e.g., Anderson or DSB). On the other hand, Trippe et al. (2023) and Dou and Song (2024)’s approach does not instantly apply on DSBs which are rarely separable. The cost we pay for using the Gibbs sampler is mainly the implementation and statistical efficiency of the filtering MCMC in 3 of Algorithm 3.3. It is also an open question for simulating the filtering MCMC in continuous-time without discretising the SDE.

Comparing the filtering-based approaches to the joint bridging in Section 3.2.

These two methods mainly differ in the utilisation of their reversals, and they shine in different applications. More precisely, the joint bridging method aims to specifically construct a pair of forward and backward models that serves the conditional sampling, whereas the filtering-based methods work on any given forward-backward pair. This grants the filtering methods an upper hand for some applications as training-free samplers. Take image inpainting for example, where XX and YY stand for the observed and to-paint parts, respectively. The joint bridging method has to specifically train for a pair of forward and backward models for this inpainting problem. On the other hand, the filtering-based methods can take on any pre-trained unconditional diffusion models for the images (see Section 2), and plug-and-play without any training (see, e.g., Corenflos et al., 2024, Sec. 4.3). This is because in the inpainting application XX and YY completely factorises the unconditional diffusion model, and Dou and Song (2024) recently show that it can further generalise to linear inverse problems.

However, for applications when the training-free feature cannot be utilised, the joint bridging method is easier to implement and usually runs with less computational costs. The joint bridging method keeps not a path of Y[0,Y]Y_{[0,Y]}, and thus it does not need to solve a (reverse) filtering problem which is significantly harder than simulating a reverse SDE. Furthermore, the joint bridging method immediately allows for discrete condition YY, while this is yet an interesting open question for the filtering-based methods. The discrete extension of generative diffusions in Campbell et al. (2022); Benton et al. (2024) may provide insights that enable the discrete extension.

4 Conditional sampling with explicit likelihood

In Section 3 we have investigated a class of methods that leverage the information from the joint distribution πX,Y\pi_{X,Y} to build up conditional samplers for π(|y)\pi(\cdot{\;|\;}y). However, for some applications we may only be able to sample the marginal πX\pi_{X}. For instance, in image classification we usually have vast amount of images at hand, but pairing them with labels requires extensively more efforts. Thus, in this section we show how to construct conditional samplers within prior generative diffusions, based on exploiting the additional information from the likelihood π(y|)\pi(y{\;|\;}\cdot). In particular, we focus the methods based on Feynman–Kac models with sequential Monte Carlo.

4.1 Conditional generative Feynman–Kac models

Let us assume that we are given a pair of forward-backward diffusion models (using arbitrary reversal construction) that targets πX\pi_{X} as in Equations (1) and (2). For simplicity, let us change to work with the models at discrete times k=0,1,2,,Nk=0,1,2,\ldots,N via

p0:N(x0:N)=p0(x0)k=1Npk|k1(xk|xk1),q0:N(u0:N)=q0(u0)k=1Nqk|k1(uk|uk1),p_{0:N}(x_{0:N})=p_{0}(x_{0})\prod_{k=1}^{N}p_{k{\;|\;}k-1}(x_{k}{\;|\;}x_{k-1}),\quad q_{0:N}(u_{0:N})=q_{0}(u_{0})\prod_{k=1}^{N}q_{k{\;|\;}k-1}(u_{k}{\;|\;}u_{k-1}),

where p0()=πX()=qN()p_{0}(\cdot)=\pi_{X}(\cdot)=q_{N}(\cdot), and q0()=πref()=pN()q_{0}(\cdot)=\pi_{\mathrm{ref}}(\cdot)=p_{N}(\cdot). Since we can sample πX\pi_{X} and evaluate the likelihood π(y|)\pi(y{\;|\;}\cdot), it is natural to think of using importance sampling to sample π(|y)\pi(\cdot{\;|\;}y). That is, if {Xj}j=1JπX\{X^{j}\}_{j=1}^{J}\sim\pi_{X} then {(wj,Xj)}j=1Jπ(|y)\{(w_{j},X^{j})\}_{j=1}^{J}\sim\pi(\cdot{\;|\;}y) with weight wj=π(y|Xj)w_{j}=\pi(y{\;|\;}X^{j}). However, the weights will barely be informative, as the prior samples are very unlikely samples from the posterior distribution in generative applications. Hence, in practice we generalise importance sampling within Feynman–Kac models (Chopin and Papaspiliopoulos, 2020) allowing us to effectively sample the target with sequential Monte Carlo (SMC) samplers. Formally, we define the generative Feynman–Kac model by

Q0:N(u0:N)1NM0(u0)G0(u0)k=1NMk|k1(uk|uk1)Gk(uk,uk1),s.t.QNQ0:N(u0:N)du0:N1=π(|y),M0=q0,\begin{split}Q_{0:N}(u_{0:N})&\coloneqq\frac{1}{\ell_{N}}\,M_{0}(u_{0})\,G_{0}(u_{0})\prod_{k=1}^{N}M_{k{\;|\;}k-1}(u_{k}{\;|\;}u_{k-1})\,G_{k}(u_{k},u_{k-1}),\\ \text{s.t.}\quad Q_{N}&\coloneqq\int Q_{0:N}(u_{0:N})\mathop{}\!\mathrm{d}u_{0:N-1}=\pi(\cdot{\;|\;}y),\quad M_{0}=q_{0},\end{split} (21)

where {Mk|k1}k=0N\{M_{k{\;|\;}k-1}\}_{k=0}^{N} and {Gk}k=0N\{G_{k}\}_{k=0}^{N} are Markov kernels and potential functions, respectively, and N\ell_{N} is the marginal likelihood. Instead of applying naive importance sampling, we apply SMC to simulate the model, crucially, the target QNQ_{N}, as in the following algorithm.

Algorithm 4.4 (Generative SMC sampler).

The sequential Monte Carlo samples the Feynman–Kac model in Equation (21) as follows. First, draw U01,U02,,U0JM0U^{1}_{0},U^{2}_{0},\ldots,U^{J}_{0}\sim M_{0} and set w0j=G0(U0j)/i=1JG0(U0i)w_{0}^{j}=G_{0}(U_{0}^{j})\,/\,\sum_{i=1}^{J}G_{0}(U_{0}^{i}) for j=1,2,,Jj=1,2,\ldots,J. Then, for k=1,2,,Nk=1,2,\ldots,N do

  1. 1.

    Resample {(wk1j,Uk1j)}j=1J\bigl{\{}(w_{k-1}^{j},U_{k-1}^{j})\bigr{\}}_{j=1}^{J} if needed.

  2. 2.

    Draw UkjMk|k1(|Uk1j)U_{k}^{j}\sim M_{k{\;|\;}k-1}(\cdot{\;|\;}U_{k-1}^{j}) and compute w¯kj=wk1jGk(Ukj,Uk1j)\overline{w}_{k}^{j}=w_{k-1}^{j}\,G_{k}(U_{k}^{j},U_{k-1}^{j}) for j=1,2,,Jj=1,2,\ldots,J.

  3. 3.

    Normalise wkj=w¯kj/i=1Jw¯kiw_{k}^{j}=\overline{w}^{j}_{k}\,/\,\sum_{i=1}^{J}\overline{w}_{k}^{i} for j=1,2,,Jj=1,2,\ldots,J.

At the final step NN, the tuples {(wNj,UNj)}j=1J\bigl{\{}(w_{N}^{j},U_{N}^{j})\bigr{\}}_{j=1}^{J} are weighted samples of π(|y)\pi(\cdot{\;|\;}y).

Given only the marginal constraints (e.g., at QNQ_{N} and Q0Q_{0}), the generative Feynman–Kac model is not unique. A trivial example is by letting M0=q0M_{0}=q_{0}, Mk|k1=qk|k1M_{k{\;|\;}k-1}=q_{k{\;|\;}k-1}, G0=G1==GN11G_{0}=G_{1}=\cdots=G_{N-1}\equiv 1, and GN(uk,uk1)=π(y|uk)G_{N}(u_{k},u_{k-1})=\pi(y{\;|\;}u_{k}), where Algorithm 4.4 recovers naive importance sampling which is not useful in reality. The purpose is thus to design MM and GG so that QkQ_{k} continuously anneals to π(|y)\pi(\cdot{\;|\;}y) for k=0,1,,Nk=0,1,\ldots,N while keeping the importance samples in Algorithm 4.4 effective. Wu et al. (2023) and Janati et al. (2024) show a particularly useful system of Markov kernels and potentials by setting

M0=q0,G0=l0,Mk|k1(uk|uk1)Gk(uk,uk1)=lk(y,uk)qk|k1(uk|uk1)lk1(y,uk1),M_{0}=q_{0},\quad G_{0}=l_{0},\quad M_{k{\;|\;}k-1}(u_{k}{\;|\;}u_{k-1})\,G_{k}(u_{k},u_{k-1})=\frac{l_{k}(y,u_{k})\,q_{k{\;|\;}k-1}(u_{k}{\;|\;}u_{k-1})}{l_{k-1}(y,u_{k-1})}, (22)

where lk(y,uk)π(y|uN)qN|k(uN|uk)duNl_{k}(y,u_{k})\coloneqq\int\pi(y{\;|\;}u_{N})\,q_{N{\;|\;}k}(u_{N}{\;|\;}u_{k})\mathop{}\!\mathrm{d}u_{N}, and define lN(y,uN)π(y|uN)l_{N}(y,u_{N})\coloneqq\pi(y{\;|\;}u_{N}) that reduces to the target’s likelihood. Indeed, if we substitute Equation (22) back into Equation (21), the marginal constraint QNQ_{N} is satisfied. This choice has at least two advantages that makes it valuable in the generative diffusion context. First, this construction is locally optimal, in the sense that if we also fix Gk(uk,uk1)1G_{k}(u_{k},u_{k-1})\equiv 1 then Mk|k1(uk|uk1)=qk|k1,Y(uk|uk1,y)lk(y,uk)qk|k1(uk|uk1)M_{k{\;|\;}k-1}(u_{k}{\;|\;}u_{k-1})=q_{k{\;|\;}k-1,Y}(u_{k}{\;|\;}u_{k-1},y)\propto l_{k}(y,u_{k})\,q_{k{\;|\;}k-1}(u_{k}{\;|\;}u_{k-1}) which is exactly the Markov transition of the conditional reversal in Equation (11). Consequently, all samples in Algorithm 4.4 are equally weighted, and hence the SMC sampler is perfect. However, simulating the optimal Markov kernel is hard, as lk(y,uk)l_{k}(y,u_{k}) is intractable (except when k=Nk=N). This gives rise to another advantage: even if we replace the exact lkl_{k} by any approximation l^k\widehat{l}_{k} for k=0,1,,N1k=0,1,\ldots,N-1, the Feynman–Kac model remains valid. Namely, the weighted samples of Algorithm 4.4 converge to π(|y)\pi(\cdot{\;|\;}y) in distribution as JJ\to\infty even if {lk}k=0N1\{l_{k}\}_{k=0}^{N-1} is approximate. However, the quality of the approximation l^k\widehat{l}_{k} significantly impacts the effective sample size of the algorithm, and recently there have been numerous approximations proposed.

Following Equation (22), Wu et al. (2023) choose to approximate lkl_{k} by linearisation, and let the Markov kernel be that of a Langevin dynamic that approximately leaves qk|k1,Yq_{k{\;|\;}k-1,Y} invariant. More specifically, they choose

Mk|k1(uk|uk1)=q^k(uk|uk1,y),Gk(uk,uk1)=l^k(y,uk)qk|k1(uk|uk1)l^k1(y,uk1)q^k(uk|uk1,y),l^k(y,uk)π(y|mN(uk)),q^k(uk|uk1,y)N(uk|uk1+δkuklog(l^k(y,uk)qk|k1(uk|uk1)),2δk),\begin{split}M_{k{\;|\;}k-1}(u_{k}{\;|\;}u_{k-1})&=\widehat{q}_{k}(u_{k}{\;|\;}u_{k-1},y),\quad G_{k}(u_{k},u_{k-1})=\frac{\widehat{l}_{k}(y,u_{k})\,q_{k{\;|\;}k-1}(u_{k}{\;|\;}u_{k-1})}{\widehat{l}_{k-1}(y,u_{k-1})\,\widehat{q}_{k}(u_{k}{\;|\;}u_{k-1},y)},\\ \widehat{l}_{k}(y,u_{k})&\coloneqq\pi\bigl{(}y{\;|\;}m_{N}(u_{k})\bigr{)},\\ \widehat{q}_{k}(u_{k}{\;|\;}u_{k-1},y)&\coloneqq\mathrm{N}\Bigl{(}u_{k}{\;|\;}u_{k-1}+\delta_{k}\operatorname{\nabla\!}_{u_{k}}\log\bigl{(}\widehat{l}_{k}(y,u_{k})\,q_{k{\;|\;}k-1}(u_{k}{\;|\;}u_{k-1})\bigr{)},\sqrt{2}\,\delta_{k}\Bigr{)},\end{split} (23)

where mN(uk)m_{N}(u_{k}) stands for any approximate sample from qN|k(|uk)q_{N{\;|\;}k}(\cdot{\;|\;}u_{k}), and δk\delta_{k} is the Langevin step size. Originally, Wu et al. (2023) use mN(uk)m_{N}(u_{k}) as a sample of the reversal at NN starting at uku_{k}, which often experiences high variance, but it can be improved by using mN(uk)m_{N}(u_{k}) as the conditional mean of qN|k(|uk)q_{N{\;|\;}k}(\cdot{\;|\;}u_{k}) via Tweedie’s formula (Song et al., 2023). See also Cardoso et al. (2024) for an alternative construction of the proposal. While Wu et al. (2023)’s construction of the Feynman–Kac model is empirically successful in a number of applications, its main problem lies in its demanding computation. In particular, the proposal asks to differentiate through l^k\widehat{l}_{k} usually containing a denoising neural network.

Janati et al. (2024) follow up the construction in Equation (22) and propose an efficient sampler based on partitioning the Feynman–Kac model into several contiguous sub-models, where each sub-model targets its next. As an example, when we use two blocks, Q0:NQ_{0:N} is divided into Q0:j(1)Q^{(1)}_{0:j} and Qj:N(2)Q^{(2)}_{j:N} for some 0jN0\leq j\leq N, where Q0:j(1)Q^{(1)}_{0:j} starts at q0q_{0} and targets QjQ_{j}, whereas Qj:N(2)Q^{(2)}_{j:N} starts at QjQ_{j} and targets qNq_{N}. Although the division looks trivial at a first glance, the catch is that within each sub-model we can form more efficient approximations to the Feynman–Kac Markov transition. More precisely, recall that in Equation (23), l^k(y,uk)\widehat{l}_{k}(y,u_{k}) aims to approximate π(y|uN)qN|k(uN|uk)duN\int\pi(y{\;|\;}u_{N})\,q_{N{\;|\;}k}(u_{N}{\;|\;}u_{k})\mathop{}\!\mathrm{d}u_{N}, the accuracy of which depends on the distance NkN-k. Now inside the sub-model Q0:j(1)Q^{(1)}_{0:j}, the terminal is jj instead of NN, where jkj-k is mostly smaller than NkN-k for 0kj0\leq k\leq j. In addition, instead of running the SMC in Algorithm 4.4, they also propose a variational approximation to the Markov transition of the Feynman–Kac model in order to directly sample the model. Note that although Janati et al. (2024) set roots on linear Gaussian likelihood π(y|)\pi(y{\;|\;}\cdot), their method can straightforwardly generalise to other likelihoods as well, with suitable Gaussian quadratures applied.

A computationally simpler bootstrap construction of the generative Feynman–Kac model is

Mk|k1(uk|uk1)=qk|k1(uk|uk1),Gk(uk,uk1)=l^k(y,uk)l^k1(y,uk1),l^k(y,uk)π(λky|uk),\begin{split}M_{k{\;|\;}k-1}(u_{k}{\;|\;}u_{k-1})&=q_{k{\;|\;}k-1}(u_{k}{\;|\;}u_{k-1}),\quad G_{k}(u_{k},u_{k-1})=\frac{\widehat{l}_{k}(y,u_{k})}{\widehat{l}_{k-1}(y,u_{k-1})},\\ \widehat{l}_{k}(y,u_{k})&\coloneqq\pi(\lambda_{k}\,y{\;|\;}u_{k}),\end{split} (24)

where we simplify the Markov kernel by that of the unconditional reversal which is easier compared to q^k(uk|uk1,y)\widehat{q}_{k}(u_{k}{\;|\;}u_{k-1},y) in Equation (23). Additionally, the approximate l^k\widehat{l}_{k} is also simplified as the target likelihood evaluated at uku_{k} and a suitably scaled λky\lambda_{k}\,y with λN=1\lambda_{N}=1 justified by the heuristics in Janati et al. (2024, Sec. 4.1). However, since the Markov proposal is farther from the optimal one, the resulting algorithm is statistically less efficient. When the problem of interests is not difficult, this bootstrap model may indeed be useful in practice (see our example in Section 5).

Comparing the Feynman–Kac-based approaches to the joint bridging in Section 3.2.

As we have already pointed out, the Feynman–Kac model using Equation (22) indeed recovers the Anderson’s conditional SDE in Equation (11). The difference is that the Feynman–Kac model describes the evolution of the probability distribution (using an ensemble of weighted samples) between the reference and target, while the conditional SDE formulates on the individual path. Essentially, they resemble the Lagrangian and Eulerian specifications, respectively, of a flow field (Santambrogio, 2015). This gives the Feynman–Kac formalism an advantage in the way that it does not need to precisely compute the conditional score in Equation (11) which is the main blocker of the conditional SDE approach. In other words, Algorithm 4.4 transforms the errors in approximating the conditional score, to the effective sample size of the weighted samples. However, consequently the Feynman–Kac model pays additional computational costs, as it needs to store and process an ensemble of samples to describe the distribution. Another advantage of Feynman–Kac model is that it does not need to train a dedicated conditional model, leveraging an evaluable likelihood and a pre-trained reversal of πX\pi_{X}. Although the filtering-based method in Section 3.3 can also work on pre-trained generative models, it’s range of applications is more limited than Feynman–Kac. It would be an interesting future question to ask whether the method can extend to unknown likelihood, for instance, by applying the methods in Sisson et al. (2007); Beaumont et al. (2009); Papamakarios et al. (2019); Middleton et al. (2019).

5 Pedagogical example

Refer to caption
Figure 1: Illustration of conditional samples (in blue scatters) with three conditions on yy, where the contour plots the true conditional density function. Note that for each method we draw 10,000 samples, but we downsample it to 2,000 for visibility. We see that all the three methods recover the true conditional distribution but with small biases. For instance, CDSB with y=5y=5 has biased samples between the two modes.
Refer to caption
Figure 2: The marginal histograms with condition y=5y=5 (corresponding to the last row in Figure 1). We see that although the CDSB and Feynman–Kac methods recover the shape of the distribution, they are biased (e.g., note CDSB at around x1=0x_{1}=0 and Feynman–Kac at around x2=2x_{2}=-2).

In this section we present an example to illustrate 1) the joint bridging method with the Schrödinger bridge construction in Equation (12) and 2) the Feynman–Kac method using the construction in Equation (24). Importantly, we release our implementations, written with pedagogical intent.222See the code at https://github.com/spdes/gdcs implemented in JAX.

The conditional sampling task we test here is a two-dimensional distribution given as follows. Let π(x)=0.5N(x;0,v0)+0.5N(x;0,v1)\pi(x)=0.5\,\mathrm{N}(x;0,v_{0})+0.5\,\mathrm{N}(x;0,v_{1}) be a two-dimensional Gaussian mixture, with covariances v0=[10.80.81]v_{0}=\begin{bmatrix}1&0.8\\ 0.8&1\end{bmatrix} and v1=[10.80.81]v_{1}=\begin{bmatrix}1&-0.8\\ -0.8&1\end{bmatrix}, and define the likelihood as π(y|x)N(y|x2+0.5(x12+1),0.5)\pi(y{\;|\;}x)\sim\mathrm{N}\bigl{(}y{\;|\;}x_{2}+0.5\,(x_{1}^{2}+1),0.5\bigr{)}. We brute-force compute the posterior distribution π(|y)\pi(\cdot{\;|\;}y) via trapezoidal integration for any given yy to benchmark the conditional samplers.

To apply the joint bridging method with the Schrödinger bridge construction (which we refer to as CDSB), the first step is to identify the forward Equation (12) and its reversal, which constitutes the training part of the algorithm. We choose a Brownian motion as the reference process, and apply the numerical method in Equation (15) to solve for the forward and reversal with 15 iterations. After training, the conditional sampling amounts to running Algorithm 3.2. As for the Feynman–Kac method using the heuristic Equation (24), we choose λk1\lambda_{k}\equiv 1, stratified resampling, and train a DSB for πX\pi_{X} with settings identical to those in CDSB. The same neural network is used for both methods with time span T=1T=1, N=1 000N=1\,000 Euler discretisation steps, and reference distribution πref=N(0,1)\pi_{\mathrm{ref}}=\mathrm{N}(0,1). We also use Hamiltonian Monte Carlo (HMC) as a baseline to sample the posterior distribution with step size 0.35, unit mass, and 100 leap-frog integrations. Finally, we draw 10,000 Monte Carlo samples for the test.

The results are shown in Figures 1 and 2 tested with conditions y=1y=-1, 22, and 55 leading to challenging distribution shapes. We see that in general both the CDSB and Feynman–Kac methods recover the true distribution to good extents, and that they are comparable to the MCMC method HMC. However, we note that the two generative methods are biased, particularly evidenced from their histograms. This makes sense, as the training for generative models can hardly be done exactly. Moreover, we also see that CDSB with y=5y=5 is particularly erroneous between the two modes, as the training of CDSB uses the joint samples of πX,Y\pi_{X,Y}, while the likelihood of y=5y=5 is relatively low. This reflects that CDSB (or any other joint bridging method) may fail under extreme conditions. On the other hand, the Feynman–Kac method with SMC may also fail under extreme conditions, but it can be improved by using more particles (perhaps inefficiently).

6 Conclusion

In this article we have reviewed a class of conditional sampling schemes built upon generative diffusion models. We started by two foundational constructions for unconditional diffusion models: Anderson and Schrödinger bridge. From there we explored two practical scenarios (i.e., whether the joint πX,Y\pi_{X,Y}, or πX\pi_{X} and πY|X\pi_{Y{\;|\;}X} of the data distribution is accessible) to formulate the generative conditional samplers. In summary we have mainly described three methodologies in details. The first one, called joint bridging, involves training a dedicated generative diffusion that directly targets the conditional distribution π(|y)\pi(\cdot{\;|\;}y), framing the problem of conditional sampling as simulating a generative reverse diffusion. The second one, including diffusion Gibbs, also employs a generative diffusion but targets the joint πX,Y\pi_{X,Y}, and it casts conditional sampling as solving a stochastic filtering problem. The last one, leveraging the likelihood πY|X\pi_{Y{\;|\;}X} and a (pre-trained) generative diffusion that targets πX\pi_{X}, represents the conditional sampling as simulating a Feynman–Kac model. We have demonstrated the connections among these methodologies and highlighted their respective strengths in various applications. In the final section, we have provided a pedagogical example illustrating the joint bridging and Feynman–Kac methods.

Finally, we close this paper with a few remarks and ideas for future work. 1) Akin to MCMC, the generative diffusion samplers would also need convergence diagnosis. The difference is that for MCMC we need to check if the sampler converges to the stationary phase, while for generative diffusions we are instead interested in measuring the biases. For instance, in our experiment the training of the generative samplers are empirically assessed by an expert, while MCMC are backed by ergodic theory. Currently, to the best of our knowledge, it is not clear how to gauge if the trained generative sampler can be trusted in a principled way, in particular when it comes to high-dimensional problems. 2) As illustrated in our experiment, sampling with extreme conditions can be problematic for generative diffusions. Addressing outliers is an important area for improving generative models. 3) Apart from the base methodologies of generative diffusions, the deep learning techniques (e.g., U-net, time embedding, see Song and Ermon, 2020, for a longer list) are in our opinion the core that made them successful. Improving the structure and training of the involved neural networks are essential for making generative diffusion models functional in complex tasks, and we believe that this will remain a central focus in future developments.

\ack

This work was partially supported by the Kjell och Märta Beijer Foundation, the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation and by the project Deep probabilistic regression – new models and learning algorithms (contract number: 2021-04301), by the Swedish Research Council.

References

  • Anderson (1982) Brian D. O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.
  • Andrieu et al. (2010) Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010. With discussion.
  • Bågmark et al. (2022) Kasper Bågmark, Adam Andersson, and Stig Larsson. An energy-based deep splitting method for the nonlinear filtering problem. Partial Differential Equations and Applications, 4(2):14, 2022.
  • Bain and Crisan (2009) Alan Bain and Dan Crisan. Fundamentals of stochastic filtering. Springer-Verlag New York, 2009.
  • Baker et al. (2024) Elizabeth L. Baker, Moritz Schauer, and Stefan Sommer. Score matching for bridges without time-reversals. arXiv preprint arXiv:2407.15455, 2024.
  • Beaumont et al. (2009) Mark A. Beaumont, Jean-Marie Cornuet, Jean-Michel Marin, and Christian P. Robert. Adaptive approximate Bayesian computation. Biometrika, 96(4):983–990, 2009.
  • Benton et al. (2024) Joe Benton, Yuyang Shi, Valentin De Bortoli, George Deligiannidis, and Arnaud Doucet. From denoising diffusions to denoising Markov models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(2):286–301, 2024.
  • Beskos and Roberts (2005) Alexandros Beskos and Gareth O. Roberts. Exact simulation of diffusions. The Annals of Applied Probability, 15(4):2422–2444, 2005.
  • Blanchet and Zhang (2020) Jose Blanchet and Fan Zhang. Exact simulation for multivariate Itô diffusions. Advances in Applied Probability, 52(4):1003–1034, 2020.
  • Campbell et al. (2022) Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. In Advances in Neural Information Processing Systems, volume 35, pages 28266–28279. Curran Associates, Inc., 2022.
  • Cardoso et al. (2024) Gabriel Cardoso, Yazid Janati, Sylvain Le Corff, and Eric Moulines. Monte Carlo guided denoising diffusion models for Bayesian linear inverse problems. In The 12th International Conference on Learning Representations, 2024.
  • Chen et al. (2018) Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  • Chen et al. (2022) Tianrong Chen, Guan-Horng Liu, and Evangelos Theodorou. Likelihood training of Schrödinger bridge using forward-backward SDEs theory. In Proceedings of the 10th International Conference on Learning Representations, 2022.
  • Chopin and Papaspiliopoulos (2020) Nicolas Chopin and Omiros Papaspiliopoulos. An introduction to sequential Monte Carlo. Springer Series in Statistics. Springer, 2020.
  • Chung et al. (2023) Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. In The 11th International Conference on Learning Representations, 2023.
  • Corenflos et al. (2024) Adrien Corenflos, Zheng Zhao, Simo Särkkä, Jens Sjölund, and Thomas B Schön. Conditioning diffusion models by explicit forward-backward bridging. arXiv preprint arXiv:2405.13794, 2024.
  • De Bortoli et al. (2021) Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion Schrödinger bridge with applications to score-based generative modeling. In Advances in Neural Information Processing Systems, volume 34, pages 17695–17709, 2021.
  • Denker et al. (2024) Alexander Denker, Francisco Vargas, Shreyas Padhy, Kieran Didi, Simon Mathis, Vincent Dutordoir, Riccardo Barbano, Emile Mathieu, Urszula Julia Komorowska, and Pietro Lio. DEFT: Efficient finetuning of conditional diffusion models by learning the generalised hh-transform. arXiv preprint arXiv:2406.01781, 2024.
  • Dou and Song (2024) Zehao Dou and Yang Song. Diffusion posterior sampling for linear inverse problem solving: A filtering perspective. In Proceedings of the 12th International Conference on Learning Representations, 2024.
  • Janati et al. (2024) Yazid Janati, Alain Durmus, Eric Moulines, and Jimmy Olsson. Divide-and-conquer posterior sampling for denoising diffusion priors. arXiv preprint arXiv:2403.11407, 2024.
  • Kawar et al. (2022) Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. In Advances in Neural Information Processing Systems, volume 35, pages 23593–23606. Curran Associates, Inc., 2022.
  • Lipman et al. (2023) Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In Proceedings of the 11th International Conference on Learning Representations, 2023.
  • Lugmayr et al. (2022) Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. RePaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11461–11471, 2022.
  • Luo et al. (2023) Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sjölund, and Thomas B. Schön. Image restoration with mean-reverting stochastic differential equations. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 23045–23066. PMLR, 2023.
  • Meyn and Tweedie (2009) Sean P. Meyn and Richard L. Tweedie. Markov chains and stochastic stability. Cambridge University Press, 2nd edition, 2009.
  • Middleton et al. (2019) Lawrece Middleton, George Deligiannidis, Arnaud Doucet, and Pierre E. Jacob. Unbiased smoothing using particle independent Metropolis-Hastings. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.
  • Papamakarios et al. (2019) George Papamakarios, David Sterratt, and Iain Murray. Sequential neural likelihood: Fast likelihood-free inference with autoregressive flows. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, volume 89, pages 837–848. PMLR, 2019.
  • Papamakarios et al. (2021) George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57):1–64, 2021.
  • Phillips et al. (2024) Angus Phillips, Hai-Dang Dau, Michael John Hutchinson, Valentin De Bortoli, George Deligiannidis, and Arnaud Doucet. Particle denoising diffusion sampler. arXiv preprint arXiv:2402.06320, 2024.
  • Rogers and Williams (2000) Chris Rogers and David Williams. Diffusions, Markov Processes, and martingales. Cambridge University Press, 2nd edition, 2000.
  • Santambrogio (2015) Filippo Santambrogio. Optimal transport for applied mathematicians, volume 87 of Progress in Nonlinear Differential Equations and Their Applications. Birkhäuser Cham, 2015.
  • Schauer et al. (2017) Moritz Schauer, Frank van der Meulen, and Harry van Zanten. Guided proposals for simulating multi-dimensional diffusion bridges. Bernoulli, 23(4A):2917–2950, 2017.
  • Shi et al. (2023) Yuyang Shi, Valentin De Bortoli, Andrew Campbell, and Arnaud Doucet. Diffusion Schrödinger bridge matching. In Advances in Neural Information Processing Systems, volume 36, 2023.
  • Sisson et al. (2007) Scott A. Sisson, Y. Fan, and Mark M. Tanaka. Sequential Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences, 104(6):1760–1765, 2007.
  • Somnath et al. (2023) Vignesh Ram Somnath, Matteo Pariset, Ya-Ping Hsieh, Maria Rodriguez Martinez, Andreas Krause, and Charlotte Bunne. Aligned diffusion Schrödinger bridges. In Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, volume 216, pages 1985–1995. PMLR, 2023.
  • Song et al. (2023) Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for inverse problems. In Proceedings of the 11th International Conference on Learning Representations, 2023.
  • Song and Ermon (2020) Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems, volume 33, pages 12438–12448. Curran Associates, Inc., 2020.
  • Song et al. (2021) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In Proceedings of the 9th International Conference on Learning Representations, 2021.
  • Titsias and Papaspiliopoulos (2018) Michalis K. Titsias and Omiros Papaspiliopoulos. Auxiliary gradient-based sampling algorithms. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(4):749–767, 2018.
  • Trippe et al. (2023) Brian L. Trippe, Jason Yim, Doug Tischer, David Baker, Tamara Broderick, Regina Barzilay, and Tommi S. Jaakkola. Diffusion probabilistic modeling of protein backbones in 3D for the motif scaffolding problem. In Proceedings of The 11th International Conference on Learning Representations, 2023.
  • Vargas et al. (2023) Francisco Vargas, Will Sussman Grathwohl, and Arnaud Doucet. Denoising diffusion samplers. In Proceedings of the 11th International Conference on Learning Representations, 2023.
  • Watson et al. (2022) Daniel Watson, William Chan, Jonathan Ho, and Mohammad Norouzi. Learning fast samplers for diffusion models by differentiating through sample quality. In Proceedings of the 10th International Conference on Learning Representations, 2022.
  • Wu et al. (2023) Luhuan Wu, Brian L. Trippe, Christian A. Naesseth, David Blei, and John Patrick Cunningham. Practical and asymptotically exact conditional sampling in diffusion models. In ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023.