This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Soft-constrained Schrödinger Bridge: a Stochastic Control Approach

Jhanvi Garg, Xianyang Zhang, Quan Zhou∗,†
Abstract.

Schrödinger bridge can be viewed as a continuous-time stochastic control problem where the goal is to find an optimally controlled diffusion process whose terminal distribution coincides with a pre-specified target distribution. We propose to generalize this problem by allowing the terminal distribution to differ from the target but penalizing the Kullback-Leibler divergence between the two distributions. We call this new control problem soft-constrained Schrödinger bridge (SSB). The main contribution of this work is a theoretical derivation of the solution to SSB, which shows that the terminal distribution of the optimally controlled process is a geometric mixture of the target and some other distribution. This result is further extended to a time series setting. One application is the development of robust generative diffusion models. We propose a score matching-based algorithm for sampling from geometric mixtures and showcase its use via a numerical example for the MNIST data set.

Department of Statistics, Texas A&M University, College Station, TX 77843, USA
Corresponding author (email: [email protected])

1. Introduction

1.1. Schrödinger Bridge and Its Applications

Let X=(Xt)0tTX=(X_{t})_{0\leq t\leq T} be a diffusion process over the finite time interval [0,T][0,T] with initial distribution μ0\mu_{0}. Schrödinger bridge seeks an optimal steering of XX towards a pre-specified terminal distribution μT\mu_{T} such that the resulting controlled process is closest to XX in terms of Kullback-Leibler (KL) divergence (Schrödinger, 1931, 1932). Under certain regularity conditions, the optimally controlled process is another diffusion with the same diffusion coefficients as XX but an additional drift term. This result has been obtained via different approaches and at varying levels of generality, and among the seminal works are Fortet (1940); Beurling (1960); Jamison (1975); Föllmer (1988); Dai Pra (1991). For comprehensive reviews detailing the historical development, we refer readers to Léonard (2013) and Chen et al. (2021c).

The recent generative modeling literature has seen a surge in the use of Schrödinger bridge. In these applications, μ0\mu_{0} is typically some distribution that is easy to sample from, and μT\mu_{T} is the unknown distribution of a given data set. By numerically approximating the solution to the Schrödinger bridge problem, one can generate samples from μT\mu_{T} (i.e., synthetic data points that resemble the original data set). One such algorithm is presented by De Bortoli et al. (2021) and Vargas et al. (2021), who proposed to calculate the Schrödinger bridge by approximating the iterative proportional fitting procedure (Deming and Stephan, 1940) (the method of De Bortoli et al. (2021) estimates the drift using score matching and neural networks while that of Vargas et al. (2021) uses maximum likelihood and Gaussian processes). Concurrently, Wang et al. (2021) developed a two-stage method where an auxiliary Schrödinger bridge is run first to generate samples from a smoothed version of μT\mu_{T}, and the second Schrödinger bridge transports these samples towards μT\mu_{T}. Both approaches generalize the denoising diffusion model methods of Ho et al. (2020) and Song et al. (2021). Some other recent developments in this area include Chen et al. (2021a); Song (2022); Peluchetti (2023); Richter et al. (2023); Winkler et al. (2023); Hamdouche et al. (2023); Vargas et al. (2024).

Though not the focus of this work, Schrödinger bridge sampling methods can also be used when samples from μT\mu_{T} are not available but μT\mu_{T} is known up to a normalizing constant; see, e.g., Huang et al. (2021); Zhang and Chen (2021); Vargas et al. (2022), and see Heng et al. (2024) for a recent review. For the connections between Schrödinger bridge, optimal transport and variational inference, see, e.g., Chen et al. (2016b, 2021b); Tzen and Raginsky (2019).

1.2. Overview of This Work

The main contribution of this paper is the theoretical development of a generalized Schrödinger bridge problem, which we call soft-constrained Schrödinger bridge (SSB). We take the stochastic control approach that was employed by Mikami (1990); Dai Pra (1991); Pra and Pavon (1990) for studying the original Schrödinger bridge problem (see Problem 1). In SSB, the terminal distribution of the controlled process does not need to precisely match μT\mu_{T} but needs to be close to μT\mu_{T} in terms of KL divergence. Formally, SSB differs from the original problem in that we replace the hard constraint on the terminal distribution with an additional cost term, parameterized by β\beta, in the objective function to be minimized (see Problem 2). A larger β\beta forces the terminal distribution of the controlled process to be closer to μT\mu_{T}. We rigorously find the solution to SSB and the expression for the drift term of the optimally controlled process. We show that SSB generalizes Schrödinger bridge in the sense that as β\beta\to\infty, the solution to the former coincides with the solution to the latter. An important implication of our results is that the terminal distribution of the controlled process should be a geometric mixture of μT\mu_{T} and some other distribution; when μ0\mu_{0} is a Dirac measure, the other distribution is aw(XT)\mathcal{L}aw(X_{T}) (i.e., the terminal distribution of the uncontrolled process). We further extend our results to a time series generalization of SSB, where we are interested in modifying the joint distribution of (Xt1,,XtN)(X_{t_{1}},\dots,X_{t_{N}}) for 0<t1<<tN=T0<t_{1}<\cdots<t_{N}=T.

SSB can be used as a theoretical foundation for developing more flexible and robust sampling methods. First, when the KL divergence between μT\mu_{T} and aw(XT)\mathcal{L}aw(X_{T}) is infinite, Schrödinger bridge does not admit a solution, while SSB always does. A toy example illustrating the consequences of this result is given in Example 1. More importantly, β\beta acts as a regularization parameter preventing the algorithm from overfitting to μT\mu_{T}, which is crucial for some generative modeling tasks such as fine-tuning with limited data (Moon et al., 2022). In such applications, μT\mu_{T} contains information from a small or noisy data set, and one wants to improve the sample quality by harnessing knowledge from a large high-quality reference data set. To achieve this, we can train the uncontrolled process XX in SSB using the reference data set and then tune the value of β\beta. We present a simple normal mixture example illustrating the effect of β\beta (see Example 4). For a more realistic example in generative modeling of images, we use the MNIST data set and consider the task of generating new images of digit 8. We assume that the training data set only has 50 noisy images of digit 8, but we can use the data set of all the other digits as reference. As suggested by our theoretical findings, we can train a Schrödinger bridge targeting a geometric mixture of the distributions of the two data sets. Such a Schrödinger bridge cannot be learned by existing methods, and to address this, we propose a new score matching algorithm that utilizes importance sampling. We show that this approach yields high-quality images of digit 8 when β\beta is properly chosen.

The paper is structured as follows. In Section 2, we present the stochastic control formulation of the SSB problem, and we derive its solution when μ0\mu_{0} is a Dirac measure. The solution to SSB for general initial conditions is obtained in Section 3, which involves solving a generalized Schrödinger system. Section 4 extends the results to the time series setting. In Section 5, we present a new algorithm for robust generative modeling and demonstrate its use via the MNIST data set. Proofs and auxiliary results are deferred to Appendix.

1.3. Related Literature

Our development of SSB builds upon the work of Dai Pra (1991), which formulates Schrödinger bridge as a stochastic control problem and derives the solution using the logarithmic transformation technique pioneered by Fleming (Fleming, 1977, 2005; Fleming and Rishel, 2012) and the result of Jamison (1975). The time series SSB problem is a generalization of the work of Hamdouche et al. (2023), who extended the original Schrödinger bridge problem to the time series setting but only considered the special case where μ0\mu_{0} is a Dirac measure. Pavon and Wakolbinger (1991); Blaquiere (1992) adopted an alternative stochastic control approach to studying Schrödinger bridge, which was rooted in the same logarithmic transformation and also considered in Tzen and Raginsky (2019); Berner et al. (2022). This approach can be applied to the SSB problem as well, but it requires the use of verification theorem.

Motivated by robust network routing, a discrete version of the SSB problem was proposed and solved in Chen et al. (2019), where XX is a discrete-time non-homogeneous Markov chain with finite state space. The techniques used in this paper are very different, and to our knowledge, the continuous-time SSB problem has not been addressed in the literature.

2. Problem Formulation

Let μ0,μT\mu_{0},\mu_{T} be two probability distributions on d{\mathbb{R}}^{d} such that x2μ0(dx)<\int x^{2}\mu_{0}(\mathrm{d}x)<\infty and μTλ\mu_{T}\ll\lambda, where λ\lambda denotes the Lebesgue measure. Denote the density of μT\mu_{T} by fT=dμT/dλf_{T}=\mathrm{d}\mu_{T}/\mathrm{d}\lambda. Let (Ω,,𝖯)(\Omega,\mathcal{F},{\mathsf{P}}) be a probability space, on which we define a standard dd-dimensional Brownian motion W=(Wt)t0W=(W_{t})_{t\geq 0} and a random vector ξ\xi that is independent of WW and has distribution μ0\mu_{0}. We will always use X=(Xt)0tTX=(X_{t})_{0\leq t\leq T} to denote a weak solution to the following stochastic differential equation (SDE)

(2.1) X0=ξ, and dXt=b(Xt,t)dt+\displaystyle X_{0}=\xi,\text{ and }\mathrm{d}X_{t}=b(X_{t},t)\mathrm{d}t+ σdWt\displaystyle\sigma\mathrm{d}W_{t}

for t[0,T]t\in[0,T] ,where b:d×[0,T]db\colon{\mathbb{R}}^{d}\times[0,T]\rightarrow{\mathbb{R}}^{d} and σ(0,)\sigma\in(0,\infty). Given a control u=(ut)0tTu=(u_{t})_{0\leq t\leq T}, define the controlled diffusion process by X0u=ξX_{0}^{u}=\xi and

(2.2) dXtu=[b(Xtu,t)+ut]dt+σdWt.\mathrm{d}X^{u}_{t}=\left[b(X^{u}_{t},t)+u_{t}\right]\mathrm{d}t+\sigma\mathrm{d}W_{t}.

We say a control uu is admissible if (i) utu_{t} is measurable with respect to σ((Xsu)0st)\sigma((X^{u}_{s})_{0\leq s\leq t}), (ii) the SDE (2.2) admits a weak solution, and (iii) 𝖤0Tut2dt<{\mathsf{E}}\int_{0}^{T}\|u_{t}\|^{2}\mathrm{d}t<\infty, where \|\cdot\| denotes the L2L^{2} norm. Denote the set of all admissible controls by 𝒰\mathcal{U}. Note that the initial distributions of both XX and XuX^{u} are always fixed to be μ0\mu_{0}. For ease of presentation, throughout the paper we adopt the following regularity assumption on bb, which was also used by Jamison (1975); Dai Pra (1991):

Assumption 1.

For each 1id1\leq i\leq d, bib_{i} is bounded and continuous in d×[0,T]{\mathbb{R}}^{d}\times[0,T] and is Hölder continuous in xx, uniformly with respect to (x,t)d×[0,T](x,t)\in{\mathbb{R}}^{d}\times[0,T].

Under Assumption 1, XX has a transition density function p(x,t|y,s)p(x,t\,|\,y,s); that is, for any 0s<tT0\leq s<t\leq T, ydy\in{\mathbb{R}}^{d}, and Borel set AA in d{\mathbb{R}}^{d},

(2.3) 𝖤[XtA|Xs=y]=Ap(x,t|y,s)λ(dx).{\mathsf{E}}[X_{t}\in A\,|\,X_{s}=y]=\int_{A}p(x,t\,|\,y,s)\lambda(\mathrm{d}x).

We will use dx\mathrm{d}x as a shorthand for λ(dx)\lambda(\mathrm{d}x). Moreover, by Girsanov theorem, Assumption 1 implies that the probability measures induced by (Xt)0tT(X_{t})_{0\leq t\leq T} and (Bt)0tT(B_{t})_{0\leq t\leq T} are equivalent, where Bt=ξ+σWtB_{t}=\xi+\sigma W_{t}. Hence, for any t>s0t>s\geq 0, p(x,t|y,s)p(x,t\,|\,y,s) is strictly positive and aw(Xt)λ\mathcal{L}aw(X_{t})\approx\lambda (i.e., two measures are equivalent). The role of Assumption 1 in our theoretical results will be further discussed in Remark 4.

Schrödinger bridge aims to find a minimum-energy modification of the dynamics of XX so that its terminal distribution coincides with a pre-specified distribution μT\mu_{T}, where “energy” is measured by KL divergence. Given σ\sigma-finite measures ν\nu and μ\mu such that νμ\nu\ll\mu, we use 𝒟KL(ν,μ)=log(dνdμ)dν\mathcal{D}_{\mathrm{KL}}(\nu,\mu)=\int\log(\frac{\mathrm{d}\nu}{\mathrm{d}\mu})\mathrm{d}\nu to denote the KL divergence; if ν≪̸μ\nu\not\ll\mu, define 𝒟KL(ν,μ)=\mathcal{D}_{\mathrm{KL}}(\nu,\mu)=\infty. Dai Pra (1991) considered the following stochastic control formulation of Schrödinger bridge.

Problem 1 (Schrödinger bridge).

Let 𝒰0={u𝒰:aw(XTu)=μT}\mathcal{U}_{0}=\{u\in\mathcal{U}\colon\mathcal{L}aw(X^{u}_{T})=\mu_{T}\} where (Xtu)0tT(X^{u}_{t})_{0\leq t\leq T} is defined in (2.2). Find V=infu𝒰0J(u)V=\inf_{u\in\mathcal{U}_{0}}J(u), where

(2.4) J(u)=𝖤0Tut22σ2dt,J(u)={\mathsf{E}}\int_{0}^{T}\frac{\|u_{t}\|^{2}}{2\sigma^{2}}\mathrm{d}t,

and find the optimal control uu^{*} such that J(u)=VJ(u^{*})=V.

Remark 1.

Let 𝖯X{\mathsf{P}}_{X} (resp. 𝖯Xu{\mathsf{P}}_{X}^{u}) denote the probability measure induced by XX (resp. XuX^{u}) on the space of continuous functions on [0,T][0,T]. For any admissible control u𝒰u\in\mathcal{U}, Girsanov theorem implies that J(u)=𝒟KL(𝖯Xu,𝖯X)J(u)=\mathcal{D}_{\mathrm{KL}}({\mathsf{P}}_{X}^{u},{\mathsf{P}}_{X}).

Problem 1 has been well studied in the literature. In the special case where μ0\mu_{0} is a Dirac measure, the solution can be succinctly described, and VV is just the KL divergence between two probability distributions; we recall this in Theorem 1 below. In this paper, \nabla always denotes differentiation with respect to xx.

Theorem 1 (Dai Pra (1991)).

Let μ0\mu_{0} be the Dirac measure such that μ0({x0})=1\mu_{0}(\{x_{0}\})=1 for some x0dx_{0}\in{\mathbb{R}}^{d}, and let XX be a weak solution to (2.1). Assume 𝒟KL(μT,aw(XT))<\mathcal{D}_{\mathrm{KL}}(\mu_{T},\mathcal{L}aw(X_{T}))<\infty. For Problem 1, the optimal control is given by ut=σ2logh(Xtu,t)u^{*}_{t}=\sigma^{2}\nabla\log h(X_{t}^{u^{*}},t), where

(2.5) h(x,t)\displaystyle h(x,t) =p(z,T|x,t)fT(z)p(z,T|x0,0)dz𝖤[fT(XT)p(XT,T|x0,0)|Xt=x].\displaystyle=\int p(z,T\,|\,x,t)\frac{f_{T}(z)}{p(z,T\,|\,x_{0},0)}\mathrm{d}z\eqqcolon{\mathsf{E}}\left[\frac{f_{T}(X_{T})}{p(X_{T},T\,|\,x_{0},0)}\,\Big{|}\,X_{t}=x\right].

Moreover, J(u)=𝒟KL(μT,aw(XT))J(u^{*})=\mathcal{D}_{\mathrm{KL}}(\mu_{T},\mathcal{L}aw(X_{T})).

Remark 2.

If 𝒟KL(μT,aw(XT))=\mathcal{D}_{\mathrm{KL}}(\mu_{T},\mathcal{L}aw(X_{T}))=\infty, then Problem 1 does not admit a solution in the sense that no admissible control uu can yield aw(XTu)=μT\mathcal{L}aw(X_{T}^{u})=\mu_{T}.

We propose a relaxed stochastic control formulation of the Schrödinger bridge problem by allowing the distribution of XTuX^{u}_{T} to be different from μT\mu_{T}.

Problem 2 (Soft-constrained Schrödinger bridge).

For β>0\beta>0, find V=infu𝒰Jβ(u)V=\inf_{u\in\mathcal{U}}J_{\beta}(u), where

(2.6) Jβ(u)=β𝒟KL(aw(XTu),μT)+𝖤0Tut22σ2dt,\displaystyle J_{\beta}(u)=\,\beta\,\mathcal{D}_{\mathrm{KL}}(\mathcal{L}aw(X^{u}_{T}),\mu_{T})+{\mathsf{E}}\int_{0}^{T}\frac{\|u_{t}\|^{2}}{2\sigma^{2}}\mathrm{d}t,

and find the optimal control uu^{*} such that Jβ(u)=VJ_{\beta}(u^{*})=V.

Problem 2 (i.e., the SSB problem) replaces the hard constraint aw(XTu)=μT\mathcal{L}aw(X^{u}_{T})=\mu_{T} in Problem 1 with a soft constraint parameterized by β\beta. When β=0\beta=0, it is clear that the optimal control uu^{*} for Problem 2 is ut0u^{*}_{t}\equiv 0. As β\beta\rightarrow\infty, the law of XTuX_{T}^{u} is forced to agree with μT\mu_{T}, and we will see in Theorem 2 that the optimal control for Problem 2 converges to that for Problem 1.

Before we try to solve Problem 2 in full generality, we make a remark on how Problem 1 can be simplified. In the literature, Problem 1 is often called the dynamic Schrödinger bridge problem. Since the objective function (2.4) is the KL divergence between the laws of the controlled and uncontrolled processes (recall Remark 1), we can use an additive property of KL divergence to reduce Problem 1 to a static version (Léonard, 2013), where one only needs to find a joint distribution π\pi with marginals μ0\mu_{0} and μT\mu_{T} that minimizes 𝒟KL(π,aw(X0,XT))\mathcal{D}_{\mathrm{KL}}(\pi,\mathcal{L}aw(X_{0},X_{T})). Although this property is not directly used in this paper, the insight from this observation underpins our stochastic control analysis of SSB. In particular, when μ0\mu_{0} is a Dirac measure, the solution to Problem 2 can be obtained by a simple argument which reduces the problem to optimizing over the distribution of XTuX_{T}^{u} instead of over the distribution of the whole process (Xtu)0tT(X_{t}^{u})_{0\leq t\leq T}.

Theorem 2.

Let μ0\mu_{0} be as given in Theorem 1. For Problem 2, the optimal control is given by ut=σ2logh(Xtu,t)u^{*}_{t}=\sigma^{2}\nabla\log h(X_{t}^{u^{*}},t), where

h(x,t)=h(x,t;β)=C1p(z,T|x,t)(fT(z)p(z,T|x0,0))β/(1+β)dz,\displaystyle h(x,t)=h(x,t;\beta)=\;C^{-1}\int p(z,T\,|\,x,t)\cdot\left(\frac{f_{T}(z)}{p(z,T\,|\,x_{0},0)}\right)^{\beta/(1+\beta)}\mathrm{d}z,

and C=fT(x)β/(1+β)p(x,T|x0,0)1/(1+β)dxC=\int f_{T}(x)^{\beta/(1+\beta)}p(x,T\,|\,x_{0},0)^{1/(1+\beta)}\mathrm{d}x. Moreover, Jβ(u)=(1+β)logC[0,),J_{\beta}(u^{*})=-(1+\beta)\log C\in[0,\infty), and

(2.7) limβh(x,t;β)=p(z,T|x,t)fT(z)p(z,T|x0,0)dz.\lim_{\beta\rightarrow\infty}h(x,t;\beta)=\int p(z,T\,|\,x,t)\frac{f_{T}(z)}{p(z,T\,|\,x_{0},0)}\mathrm{d}z.
Proof.

If aw(XTu)≪̸μT\mathcal{L}aw(X^{u}_{T})\not\ll\mu_{T}, then uu cannot be optimal since Jβ(u)=J_{\beta}(u)=\infty. Now fix an arbitrary u𝒰u\in\mathcal{U} such that aw(XTu)=μμT\mathcal{L}aw(X^{u}_{T})=\mu\ll\mu_{T}. Letting J(u)J(u) be as given in (2.4), we have

Jβ(u)=\displaystyle J_{\beta}(u)=\; β𝒟KL(μ,μT)+J(u)\displaystyle\beta\,\mathcal{D}_{\mathrm{KL}}(\mu,\mu_{T})+J(u)
\displaystyle\geq\; β𝒟KL(μ,μT)+𝒟KL(μ,aw(XT))\displaystyle\beta\,\mathcal{D}_{\mathrm{KL}}(\mu,\mu_{T})+\mathcal{D}_{\mathrm{KL}}(\mu,\mathcal{L}aw(X_{T}))
\displaystyle\eqqcolon\; J~β(μ)\displaystyle\tilde{J}_{\beta}(\mu)

where the inequality follows from Theorem 1. Since μT\mu_{T} has density fTf_{T} and aw(XT)\mathcal{L}aw(X_{T}) has density p(x,T|x0,0)p(x,T\,|\,x_{0},0), we can apply Lemma B2 in Appendix to get infμJ~β(μ)=(1+β)logC[0,),\inf_{\mu}\tilde{J}_{\beta}(\mu)=-(1+\beta)\log C\in[0,\infty), where the infimum is taken over all probability measures on d{\mathbb{R}}^{d} and is attained at μ\mu^{*} such that

dμdλ(x)=C1fT(x)β/(1+β)p(x,T|x0,0)1/(1+β).\displaystyle\begin{aligned} \frac{\mathrm{d}\mu^{*}}{\mathrm{d}\lambda}(x)=C^{-1}f_{T}(x)^{\beta/(1+\beta)}p(x,T\,|\,x_{0},0)^{1/(1+\beta)}.\end{aligned}

The convergence of h(x,t;β)h(x,t;\beta) as β\beta\rightarrow\infty also follows from Lemma B2.

It only remains to prove that Jβ(u)=(1+β)logCJ_{\beta}(u^{*})=-(1+\beta)\log C. Observe that we can rewrite hh as

h(x,t)=\displaystyle h(x,t)=\; p(z,T|x,t)(dμ/dλ)(z)p(z,T|x0,0)dz.\displaystyle\int p(z,T\,|\,x,t)\frac{(\mathrm{d}\mu^{*}/\mathrm{d}\lambda)(z)}{p(z,T\,|\,x_{0},0)}\mathrm{d}z.

Hence, Theorem 1 implies that uu^{*} is also the solution to Problem 1 where the terminal constraint is given by μ\mu^{*}. So Theorem 1 yields that Jβ(u)=J~β(μ)J_{\beta}(u^{*})=\tilde{J}_{\beta}(\mu^{*}), which proves the claim. ∎

Remark 3.

When bb is constant, the transition density p(x,t|x0,0)p(x,t\,|\,x_{0},0) is easy to evaluate. If fTf_{T} is known up to a normalizing constant, one can then use the Monte Carlo sampling scheme proposed by (Huang et al., 2021) to approximate the drift b+σ2logh(x,t)b+\sigma^{2}\nabla\log h(x,t) and simulate the controlled diffusion process (2.2). We describe this method and generalize it using importance sampling techniques in Appendix A. More sophisticated score-based sampling schemes can also be applied (Heng et al., 2024).

One difference between Theorems 1 and 2 is that the condition 𝒟KL(μT,aw(XT))<\mathcal{D}_{\mathrm{KL}}(\mu_{T},\mathcal{L}aw(X_{T}))<\infty is not required for solving Problem 2. We give a toy example illustrating the importance of this difference.

Example 1.

Consider b0b\equiv 0, T=1T=1, x0=0x_{0}=0 and μT\mu_{T} being the Cauchy distribution. Then, aw(XT)\mathcal{L}aw(X_{T}) is just the normal distribution with mean zero and covariance σ2I\sigma^{2}I, and we have 𝒟KL(μT,aw(XT))=\mathcal{D}_{\mathrm{KL}}(\mu_{T},\mathcal{L}aw(X_{T}))=\infty. Problem 1 does not admit a solution in this case, but Problem 2 has a solution for any β[0,)\beta\in[0,\infty) and the associated optimal control has finite energy cost. In Appendix A.2, we simulate the solution to Problem 2 with μT\mu_{T} being the Cauchy distribution. We find that when β=\beta=\infty, the numerical scheme is unstable and fails to capture the heavy tails of the Cauchy distribution. In contrast, using a finite value of β\beta significantly stabilizes the algorithm.

3. Solution to Soft-constrained Schrödinger Bridge

When μ0\mu_{0} is not a Dirac measure, the solution to the Schrödinger bridge problem is more difficult to describe and is characterized by the so-called Schrödinger system (Léonard, 2013, Theorem 2.8). In this section, we prove that the solution to Problem 2 can be obtained in a similar way, but the Schrödinger system for Problem 2 now depends on β\beta.

The main idea behind our approach is to first show that the optimal control must belong to a small class parameterized by a function gg and then use an argument similar to the proof of Theorem 2 to determine the choice of gg. To introduce this class of controls, for each measurable function g:d[0,)g\colon{\mathbb{R}}^{d}\rightarrow[0,\infty), let 𝒯g\mathcal{T}g denote the function on d×[0,T){\mathbb{R}}^{d}\times[0,T) given by

(3.1) (𝒯g)(x,t)=σ2logh(x,t),(\mathcal{T}g)(x,t)=\sigma^{2}\nabla\log h(x,t),

where

(3.2) h(x,t)=𝖤[g(XT)|Xt=x].h(x,t)={\mathsf{E}}[g(X_{T})\,|\,X_{t}=x].

Let 𝒰1={u𝒰:ut=(𝒯g)(Xtu,t) for some g0}\mathcal{U}_{1}=\{u\in\mathcal{U}\colon u_{t}=(\mathcal{T}g)(X^{u}_{t},t)\text{ for some }g\geq 0\} denote the set of all controls that are constructed by this logarithmic transformation. We present in Theorem B3 in Appendix some well-known results about the controlled SDE (2.2) with u𝒰1u\in\mathcal{U}_{1}; in particular, part (iv) of the theorem shows that such a process is a Doob’s hh-path process (Doob, 1959). Theorem B3 is largely adapted from Theorem 2.1 of Dai Pra (1991), and similar results are extensively documented in the literature (Jamison, 1975; Fleming and Sheu, 1985; Föllmer, 1988; Doob, 1984; Fleming, 2005). We can now prove a key lemma.

Lemma 3.

Let uu be an admissible control. Let hh be as given in (3.2) for some measurable g0g\geq 0 such that 𝖤g(XT)<{\mathsf{E}}g(X_{T})<\infty and h>0h>0 on d×[0,T){\mathbb{R}}^{d}\times[0,T). We have

Jβ(u)\displaystyle J_{\beta}(u)\geq\; β𝒟KL(aw(XTu),μT)+𝖤[logg(XTu)]logh(x,0)μ0(dx),\displaystyle\beta\,\mathcal{D}_{\mathrm{KL}}(\mathcal{L}aw(X^{u}_{T}),\mu_{T})+\;{\mathsf{E}}[\log g(X_{T}^{u})]-\int\log h(x,0)\mu_{0}(\mathrm{d}x),

where JβJ_{\beta} is defined in (2.6). The equality holds when ut=(𝒯g)(Xtu,t)u_{t}=(\mathcal{T}g)(X^{u}_{t},t).

Proof.

See Appendix C. ∎

Remark 4.

Assumption 1 guarantees that the SDE (2.1) admits a unique (in law) weak solution. More importantly, in the proof of Theorem B3 (which is used to derive Lemma 3), Assumption 1 is used to ensure that SDE (2.1) has a transition density function such that the function hh defined in (3.2) is sufficiently smooth and satisfies ht+h=0\frac{\partial h}{\partial t}+\mathcal{L}h=0, where \mathcal{L} denotes the generator of XX (see Theorem B3). This condition can be relaxed; see Friedman (1975) and Karatzas and Shreve (2012, Chap. 5.7) for details.

Observe that in the bound given in Lemma 3, the term logh(x,0)μ0(dx)\int\log h(x,0)\mu_{0}(\mathrm{d}x) is independent of the control uu, and the other terms depend on uu only through the distribution of XTuX^{u}_{T}. This implies that among all the admissible controls that result in the same distribution of XTuX^{u}_{T}, the cost JβJ_{\beta} is minimized by some u𝒰1u\in\mathcal{U}_{1}. We can now prove the main theoretical result of this work in Theorem 4. The existence of the solution will be considered later in Theorem 5.

Theorem 4.

Suppose there exist σ\sigma-finite measures ν0,νT\nu_{0},\nu_{T} such that ν0μ0,νTμT\nu_{0}\approx\mu_{0},\nu_{T}\approx\mu_{T} and

(3.3) dμ0dν0(y)=\displaystyle\frac{\mathrm{d}\mu_{0}}{\mathrm{d}\nu_{0}}(y)=\; p(x,T|y,0)νT(dx),\displaystyle\int p(x,T\,|\,y,0)\nu_{T}(\mathrm{d}x),
(3.4) dμTdλ(x)=\displaystyle\frac{\mathrm{d}\mu_{T}}{\mathrm{d}\lambda}(x)=\; ρT(x)1+ββp(x,T|y,0)ν0(dy),\displaystyle\rho_{T}(x)^{\frac{1+\beta}{\beta}}\int p(x,T\,|\,y,0)\nu_{0}(\mathrm{d}y),

where ρT=dνT/dλ\rho_{T}=\mathrm{d}\nu_{T}/\mathrm{d}\lambda and the transition density pp is defined in (2.3). Assume (dμ0/dν0)dμ0<\int(\mathrm{d}\mu_{0}/\mathrm{d}\nu_{0})\mathrm{d}\mu_{0}<\infty. Then ut=σ2logh(Xtu,t)u^{*}_{t}=\sigma^{2}\nabla\log h(X_{t}^{u^{*}},t) solves Problem 2, where h(x,t)=𝖤[ρT(XT)|Xt=x].h(x,t)={\mathsf{E}}[\rho_{T}(X_{T})\,|\,X_{t}=x]. Moreover, Jβ(u)=𝒟KL(μ0,ν0)[0,)J_{\beta}(u^{*})=-\mathcal{D}_{\mathrm{KL}}(\mu_{0},\nu_{0})\in[0,\infty).

Proof.

See Appendix C. ∎

Remark 5.

As we derive in the proof, the terminal distribution of the optimally controlled process is still a geometric mixture of two distributions. Explicitly, its density is proportional to

fT(x)β/(1+β)(p(x,T|y,0)ν0(dy))1/(1+β),f_{T}(x)^{\beta/(1+\beta)}\left(\int p(x,T\,|\,y,0)\nu_{0}(\mathrm{d}y)\right)^{1/(1+\beta)},

where we recall fTf_{T} is the density of μT\mu_{T}.

Remark 6.

The assumption (dμ0/dν0)dμ0<\int(\mathrm{d}\mu_{0}/\mathrm{d}\nu_{0})\mathrm{d}\mu_{0}<\infty guarantees that we can use Lemma 3 and Theorem B3 and that Jβ(u)<J_{\beta}(u^{*})<\infty; it is also used in Dai Pra (1991). Observe that ut=σ2logh(Xtu,t)u^{*}_{t}=\sigma^{2}\nabla\log h(X_{t}^{u^{*}},t) is invariant to the scaling of the function ρT\rho_{T} and thus also the scaling of ν0\nu_{0} and νT\nu_{T}. This suggests that the system defined by (3.3) and (3.4) can be generalized as follows. Let a>0a>0 and σ\sigma-finite measures ν0=ν0(a),νT=νT(a)\nu_{0}=\nu_{0}(a),\nu_{T}=\nu_{T}(a) be the solution to the following system

(3.5) dμ0dν0(y)=\displaystyle\frac{\mathrm{d}\mu_{0}}{\mathrm{d}\nu_{0}}(y)=\; p(x,T|y,0)νT(dx),\displaystyle\int p(x,T\,|\,y,0)\nu_{T}(\mathrm{d}x),
(3.6) dμTdλ(x)=\displaystyle\frac{\mathrm{d}\mu_{T}}{\mathrm{d}\lambda}(x)=\; (aρT(x))1+ββp(x,T|y,0)ν0(dy),\displaystyle(a\rho_{T}(x))^{\frac{1+\beta}{\beta}}\int p(x,T\,|\,y,0)\nu_{0}(\mathrm{d}y),\quad

where ρT=dνT/dλ\rho_{T}=\mathrm{d}\nu_{T}/\mathrm{d}\lambda. We can use essentially the same argument to show that the choice h(x,t)=𝖤[ρT(XT)|Xt=x]h(x,t)={\mathsf{E}}[\rho_{T}(X_{T})\,|\,X_{t}=x] is optimal, but now we have

Jβ(u)=(1+β)loga𝒟KL(μ0,ν0(a)).J_{\beta}(u^{*})=-(1+\beta)\log a-\mathcal{D}_{\mathrm{KL}}(\mu_{0},\nu_{0}(a)).

This is the same as that given in Theorem 4. Indeed, if (ν0,νT)(\nu_{0}^{*},\nu_{T}^{*}) is a solution to (3.3) and (3.4), then the solution to (3.5) and (3.6) is given by dν0(a)=a1+βdν0\mathrm{d}\nu_{0}(a)=a^{1+\beta}\mathrm{d}\nu_{0}^{*} and dνT(a)=a(1+β)dνT\mathrm{d}\nu_{T}(a)=a^{-(1+\beta)}\mathrm{d}\nu_{T}^{*}.

Example 2.

Theorem 2 can be obtained from Theorem 4 as a special case. If μ0\mu_{0} is a Dirac measure with μ0({x0})=1\mu_{0}(\{x_{0}\})=1, one can check that the solution to (3.3) and (3.4) is given by ν0=μ0\nu_{0}=\mu_{0} and

ρT(x)=(fT(x)p(x,T|x0,0))β/(1+β).\rho_{T}(x)=\left(\frac{f_{T}(x)}{p(x,T\,|\,x_{0},0)}\right)^{\beta/(1+\beta)}.
Example 3.

Suppose b0b\equiv 0, and let ϕσ\phi_{\sigma} denote the density function of the normal distribution with mean 0 and covariance matrix σ2I\sigma^{2}I. We have p(x,T|y,0)=ϕσT(xy)p(x,T\,|\,y,0)=\phi_{\sigma\sqrt{T}}(x-y). Assume μ0λ\mu_{0}\ll\lambda has density f0f_{0}, and suppose that f0,fTf_{0},f_{T} satisfy

f0(y)=c1ϕσT(xy)fT(x)β1+βdx,\displaystyle f_{0}(y)=c^{-1}\int\phi_{\sigma\sqrt{T}}(x-y)f_{T}(x)^{\frac{\beta}{1+\beta}}\mathrm{d}x,

where c=fT(x)β/(1+β)dxc=\int f_{T}(x)^{\beta/(1+\beta)}\mathrm{d}x is the normalizing constant assumed to be finite. A routine calculation using ϕσT(xy)dy=1\int\phi_{\sigma\sqrt{T}}(x-y)\mathrm{d}y=1 can verify that the solution to (3.3) and (3.4) is given by

dν0dλ=c(1+β),ρT(x)=cβfT(x)β/(1+β).\displaystyle\frac{\mathrm{d}\nu_{0}}{\mathrm{d}\lambda}=c^{-(1+\beta)},\;\rho_{T}(x)=c^{\beta}f_{T}(x)^{\beta/(1+\beta)}.

According to Remark 6, by choosing a=ca=c in (3.6), we can also replace ν0,νT\nu_{0},\nu_{T} by ν0(a),νT(a)\nu_{0}(a),\nu_{T}(a), where ν0(a)\nu_{0}(a) coincides with λ\lambda and νT(a)\nu_{T}(a) is a probability distribution with density c1fTβ/(1+β)c^{-1}f_{T}^{\beta/(1+\beta)}. For the original Schrödinger bridge problem (i.e., β=\beta=\infty), this solution has been used in developing efficient generative sampling methods (Wang et al., 2021; Berner et al., 2022).

Chen et al. (2019) studied a matrix optimal transport problem which can be seen as the discrete analogue to Problem 2, and they proved the solution to the corresponding Schrödinger system admits a unique solution. The main idea is to show that the solution can be characterized as the fixed point of some operator with respect to the Hilbert metric (Bushell, 1973), a technique that has been widely used in the literature on Schrödinger system (Fortet, 1940; Georgiou and Pavon, 2015; Chen et al., 2016a; Essid and Pavon, 2019; Deligiannidis et al., 2024). For the Schrödinger system defined by (3.3) and (3.4), an argument based on the Hilbert metric can also be applied to prove the existence and uniqueness of the solution when μ0,μT\mu_{0},\mu_{T} are absolutely continuous and have compact support.

Theorem 5.

Let K0K_{0} (resp. KTK_{T}) denote the support of μ0\mu_{0} (resp. μT\mu_{T}). Assume that

  1. (i)

    K0,KTdK_{0},K_{T}\subset\mathbb{R}^{d} are compact;

  2. (ii)

    f0=dμ0/dλf_{0}=\mathrm{d}\mu_{0}/\mathrm{d}\lambda, fT=dμT/dλf_{T}=\mathrm{d}\mu_{T}/\mathrm{d}\lambda exist, where λ\lambda is the Lebesgue measure;

  3. (iii)

    (y,x)p(x,T|y,0)(y,x)\mapsto p(x,T\,|\,y,0) is continuous and strictly positive on K0×KTK_{0}\times K_{T}.

For any β(0,)\beta\in(0,\infty), there exists a unique pair of non-negative, integrable functions (ρ0,ρT)(\rho_{0},\rho_{T}) such that

(3.7) f0(y)\displaystyle f_{0}(y) =ρ0(y)KTp(x,T|y,0)ρT(x)dx,\displaystyle=\rho_{0}(y)\int_{K_{T}}p(x,T\,|\,y,0)\rho_{T}(x)\mathrm{d}x,
(3.8) fT(x)\displaystyle f_{T}(x) =ρT(x)(1+β)/βK0p(x,T|y,0)ρ0(y)dy.\displaystyle=\rho_{T}(x)^{(1+\beta)/\beta}\int_{K_{0}}p(x,T\,|\,y,0)\rho_{0}(y)\mathrm{d}y.
Proof.

See Appendix D. ∎

Remark 7.

The proof of Theorem 5 is adapted from that for Proposition 1 in Chen et al. (2016a). Observe that the Schrödinger system naturally yields an iterative algorithm for computing ρ0,ρT\rho_{0},\rho_{T}. Given an estimate for ρ0\rho_{0}, denoted by ρ^0\hat{\rho}_{0}, we can estimate ρT\rho_{T} by

ρ^T(x)=(fT(x)K0p(x,T|y,0)ρ^0(y)dy)β/(1+β),\displaystyle\hat{\rho}_{T}(x)=\left(\frac{f_{T}(x)}{\int_{K_{0}}p(x,T\,|\,y,0)\hat{\rho}_{0}(y)\mathrm{d}y}\right)^{\beta/(1+\beta)},

which then can be used to update ρ^0\hat{\rho}_{0} by

ρ^0new(y)=f0(y)KTp(x,T|y,0)ρ^T(x)dx.\displaystyle\hat{\rho}_{0}^{\rm{new}}(y)=\frac{f_{0}(y)}{\int_{K_{T}}p(x,T\,|\,y,0)\hat{\rho}_{T}(x)\mathrm{d}x}.

Chen et al. (2016a) considered the original Schrödinger bridge problem (i.e., β=\beta=\infty) and showed that this updating scheme yields a strict contraction with respect to the Hilbert metric. We note that when β<\beta<\infty, this argument can be potentially made easier, since the mapping ψψβ/(1+β)\psi\mapsto\psi^{\beta/(1+\beta)} for a suitable function ψ\psi can be easily shown to be a strict contraction, and thus one only needs to verify the other steps in the updating are contractions (not necessarily strict); see Lemma D5 in Appendix. The full scope of consequences of this observation and the existence proof for the general case are left to future study.

4. Extension to Time Series

Recently a time series version of Problem 1 was studied in Hamdouche et al. (2023), where the goal is to generate time series samples from a joint probability distribution on d××d{\mathbb{R}}^{d}\times\cdots\times{\mathbb{R}}^{d}. We can generalize our Problem 2 to time series data analogously.

Problem 3.

Consider NN fixed time points 0<t1<<tN=T0<t_{1}<\cdots<t_{N}=T. Let μN\mu_{N} be a probability distribution on d×N{\mathbb{R}}^{d\times N} such that μNλ\mu_{N}\ll\lambda. For β>0\beta>0, find V=infu𝒰JβN(u)V=\inf_{u\in\mathcal{U}}J^{N}_{\beta}(u), where

(4.1) JβN(u)=\displaystyle J^{N}_{\beta}(u)= β𝒟KL(aw((Xti)1iN),μN)+𝖤0Tut22σ2dt,\displaystyle\beta\,\mathcal{D}_{\mathrm{KL}}(\mathcal{L}aw((X_{t_{i}})_{1\leq i\leq N}),\mu_{N})+{\mathsf{E}}\int_{0}^{T}\frac{\|u_{t}\|^{2}}{2\sigma^{2}}\mathrm{d}t,

and find the optimal control uu^{*} such that JβN(u)=VJ^{N}_{\beta}(u^{*})=V.

Recall that in Section 3, we started by considering functions hh such that h(Xt,t)=𝖤[g(XT)|t]h(X_{t},t)={\mathsf{E}}[g(X_{T})\,|\,\mathcal{F}_{t}] for some function gg, where t=σ((Xs)0st)\mathcal{F}_{t}=\sigma((X_{s})_{0\leq s\leq t}). It turns out that this technique can be used to solve Problem 3 as well, but now we need to consider conditional expectations of the form 𝖤[g(Xt1,,XtN)|t]{\mathsf{E}}[g(X_{t_{1}},\dots,X_{t_{N}})\,|\,\mathcal{F}_{t}]. To simplify the notation, we will write 𝒙j=(x1,,xj)\bm{x}_{j}=(x_{1},\dots,x_{j}), 𝑿j=(Xt1,,Xtj)\bm{X}_{j}=(X_{t_{1}},\dots,X_{t_{j}}), and 𝑿ju=(Xt1u,,Xtju)\bm{X}^{u}_{j}=(X_{t_{1}}^{u},\dots,X_{t_{j}}^{u}); when j=0j=0, 𝒙j\bm{x}_{j}, 𝑿j\bm{X}_{j}, 𝑿ju\bm{X}^{u}_{j} all denote the empty vector.

Given a measurable function g:d×N[0,)g\colon{\mathbb{R}}^{d\times N}\rightarrow[0,\infty), the Markovian property of XX enables us to express the conditional expectation g(𝑿N)g(\bm{X}_{N}) by

(4.2) 𝖤[g(𝑿N)|t]=j=0N1𝟙[tj,tj+1)(t)hj(Xt,t;𝑿j),{\mathsf{E}}[g(\bm{X}_{N})\,|\,\mathcal{F}_{t}]=\sum_{j=0}^{N-1}\mathds{1}_{[t_{j},t_{j+1})}(t)\cdot h_{j}(X_{t},t;\bm{X}_{j}),

where we set t0=0t_{0}=0, and the function hjh_{j} is defined by

(4.3) hj(x,t;𝒙j)=𝖤[g(𝒙j,Xtj+1,,XtN)|Xt=x],\displaystyle h_{j}(x,t;\bm{x}_{j})=\;{\mathsf{E}}\left[g(\bm{x}_{j},X_{t_{j+1}},\dots,X_{t_{N}})\,\Big{|}\,X_{t}=x\right],

for (x,t)d×[tj,tj+1).(x,t)\in{\mathbb{R}}^{d}\times[t_{j},t_{j+1}). Let 𝒯Ng\mathcal{T}_{N}g denote the function on d×[0,T)×d×N{\mathbb{R}}^{d}\times[0,T)\times{\mathbb{R}}^{d\times N} given by

(4.4) (𝒯Ng)(x,t,𝒙N)=j=0N1𝟙[tj,tj+1)(t)σ2loghj(x,t;𝒙j).\displaystyle(\mathcal{T}_{N}g)(x,t,\bm{x}_{N})=\;\sum_{j=0}^{N-1}\mathds{1}_{[t_{j},t_{j+1})}(t)\cdot\sigma^{2}\nabla\log h_{j}(x,t;\bm{x}_{j}).

We prove in Lemma C4 in Appendix that it suffices to consider controls in the set

𝒰N={u𝒰:ut=(𝒯Ng)(Xtu,t,𝑿Nu) for some g0}.\mathcal{U}_{N}=\{u\in\mathcal{U}\colon u_{t}=(\mathcal{T}_{N}g)(X^{u}_{t},t,\bm{X}^{u}_{N})\text{ for some }g\geq 0\}.

This result is a generalization of Lemma 3 and obtained by applying Theorem B3 to each time interval [tj,tj+1)[t_{j},t_{j+1}) separately. Note although we express u𝒰Nu\in\mathcal{U}_{N} as a function of 𝑿Nu\bm{X}^{u}_{N}, by (4.4), utu_{t} is measurable with respect to σ((Xsu)0st)\sigma((X^{u}_{s})_{0\leq s\leq t}). To formulate the Schrödinger system for the time series SSB problem, let pN(𝒙N|y)p_{N}(\bm{x}_{N}\,|\,y) denote the transition density from X0=yX_{0}=y to 𝑿N=𝒙N\bm{X}_{N}=\bm{x}_{N}, which is given by

(4.5) pN(𝒙N|y)=p(x1,t1|y,0)j=1N1p(xj+1,tj+1|xj,tj).p_{N}(\bm{x}_{N}\,|\,y)=p(x_{1},t_{1}\,|\,y,0)\prod_{j=1}^{N-1}p(x_{j+1},t_{j+1}\,|\,x_{j},t_{j}).
Theorem 6.

Consider Problem 3. Suppose there exist σ\sigma-finite measures ν0\nu_{0} on d{\mathbb{R}}^{d} and νN\nu_{N} on d×N{\mathbb{R}}^{d\times N} such that ν0μ0,νNμN\nu_{0}\approx\mu_{0},\nu_{N}\approx\mu_{N} and

(4.6) dμ0dν0(y)=\displaystyle\frac{\mathrm{d}\mu_{0}}{\mathrm{d}\nu_{0}}(y)=\; pN(𝒙N|y)νN(d𝒙N),\displaystyle\int p_{N}(\bm{x}_{N}\,|\,y)\nu_{N}(\mathrm{d}\bm{x}_{N}),
(4.7) dμNdλ(𝒙N)=\displaystyle\frac{\mathrm{d}\mu_{N}}{\mathrm{d}\lambda}(\bm{x}_{N})=\; ρN(𝒙N)1+ββpN(𝒙N|y)ν0(dy),\displaystyle\rho_{N}(\bm{x}_{N})^{\frac{1+\beta}{\beta}}\int p_{N}(\bm{x}_{N}\,|\,y)\nu_{0}(\mathrm{d}y),

where ρN=dνN/dλ\rho_{N}=\mathrm{d}\nu_{N}/\mathrm{d}\lambda. Assume (dμ0/dν0)dμ0<\int(\mathrm{d}\mu_{0}/\mathrm{d}\nu_{0})\mathrm{d}\mu_{0}<\infty. Then ut=(𝒯NρN)(Xtu,t,𝐗Nu)u_{t}^{*}=(\mathcal{T}_{N}\rho_{N})(X^{u^{*}}_{t},t,\bm{X}^{u^{*}}_{N}) solves Problem 3, where 𝒯N\mathcal{T}_{N} is defined by (4.4). Moreover, JβN(u)=𝒟KL(μ0,ν0)[0,)J_{\beta}^{N}(u^{*})=-\mathcal{D}_{\mathrm{KL}}(\mu_{0},\nu_{0})\in[0,\infty).

Proof.

See Appendix C. ∎

Comparing the Schrödinger system in Theorem 4 and that in Theorem 6, we see that the solution to Problem 3 has essentially the same structure as that to Problem 2. The only difference is that in Theorem 4 the Schrödinger system is constructed by using the joint distribution of (X0,XT)(X_{0},X_{T}), while in Theorem 6 it is replaced by the joint distribution of (X0,𝑿N)(X_{0},\bm{X}_{N}). We also note that Hamdouche et al. (2023) only considered the time series Schrödinger bridge problem with μ0\mu_{0} being a Dirac measure, and by letting β\beta\rightarrow\infty, Theorem 6 gives the solution to their problem in the general case.

5. Experiments

5.1. Problem Setup

We consider an application of SSB to robust generative modeling in the following scenario. Let 𝒟ref\mathcal{D}_{\mathrm{ref}} denote a large collection of high-quality samples with distribution μref\mu_{\mathrm{ref}}, and let 𝒟obj\mathcal{D}_{\mathrm{obj}} be a small set of noisy samples with distribution μobj\mu_{\mathrm{obj}}. Our objective is to generate realistic samples resembling those in 𝒟obj\mathcal{D}_{\mathrm{obj}}, but due to the limited availability of training samples, we want to leverage information from 𝒟ref\mathcal{D}_{\mathrm{ref}} to enhance the sample quality.

A natural idea is to use SSB as a regularization method to mitigate overfitting to the noisy samples in 𝒟obj\mathcal{D}_{\mathrm{obj}}. This can be implemented in two steps. For simplicity, we assume in this section that the uncontrolled process XX is given by Xt=σWtX_{t}=\sigma W_{t}; that is, we assume X0=0X_{0}=0 and b0b\equiv 0 in (2.1). Then XTX_{T} has density ϕσT\phi_{\sigma\sqrt{T}} (recall this is the density of the normal distribution with mean 0 and covariance matrix (σ2T)I(\sigma^{2}T)I). Let freff_{\mathrm{ref}} denote the density of μref\mu_{\mathrm{ref}} with respect to the Lebesgue measure. Let Xref=(Xtref)0tTX^{\mathrm{ref}}=(X^{\mathrm{ref}}_{t})_{0\leq t\leq T} be the Schrödinger bridge targeting μref\mu_{\mathrm{ref}} evolving by

(5.1) dXtref=bref(Xtref,t)dt+σdWt,\mathrm{d}X_{t}^{\mathrm{ref}}=b^{\mathrm{ref}}(X_{t}^{\mathrm{ref}},t)\mathrm{d}t+\sigma\mathrm{d}W_{t},

for t[0,T]t\in[0,T], where

bref(Xt,t)=σ2log𝖤[fref(XT)ϕσT(XT)|Xt=x].\displaystyle b^{\mathrm{ref}}(X_{t},t)=\sigma^{2}\nabla\log{\mathsf{E}}\left[\frac{f_{\mathrm{ref}}(X_{T})}{\phi_{\sigma\sqrt{T}}(X_{T})}\,\Big{|}\,X_{t}=x\right].

Theorem 1 implies that XTrefX^{\mathrm{ref}}_{T} has distribution μref\mu_{\mathrm{ref}}. Next, we solve Problem 2 using XrefX^{\mathrm{ref}} as the reference process and μobj\mu_{\mathrm{obj}} as the target distribution. This yields the process XobjX^{\mathrm{obj}} with dynamics given by

(5.2) dXtobj=bobj(Xtobj,t)dt+σdWt,\mathrm{d}X^{\mathrm{obj}}_{t}=b^{\mathrm{obj}}(X_{t}^{\mathrm{obj}},t)\mathrm{d}t+\sigma\mathrm{d}W_{t},

for t[0,T]t\in[0,T], where

(5.3) bobj(Xtobj,t)=bref(Xtobj,t)+σ2log𝖤[(fobj(XTref)fref(XTref))β/(1+β)|Xtref=x].\displaystyle b^{\mathrm{obj}}(X_{t}^{\mathrm{obj}},t)=b^{\mathrm{ref}}(X_{t}^{\mathrm{obj}},t)+\sigma^{2}\nabla\log{\mathsf{E}}\left[\left(\frac{f_{\mathrm{obj}}(X_{T}^{\mathrm{ref}})}{f_{\mathrm{ref}}(X_{T}^{\mathrm{ref}})}\right)^{\beta/(1+\beta)}\,\Big{|}\,X_{t}^{\mathrm{ref}}=x\right].

By Theorem 2, the distribution of XTobjX_{T}^{\mathrm{obj}} will be close to μobj\mu_{\mathrm{obj}} if β\beta is relatively large. It turns out that there is no need to train XrefX^{\mathrm{ref}} and XobjX^{\mathrm{obj}} separately. The following lemma shows that we can directly train a Schrödinger bridge targeting a geometric mixture of μref\mu_{\mathrm{ref}} and μobj\mu_{\mathrm{obj}}.

Lemma 7.

Let Xt=σWtX_{t}=\sigma W_{t} and XrefX^{\mathrm{ref}} and XobjX^{\mathrm{obj}} be as given in (5.1) and (5.2). Equivalently, we can express the drift of XobjX^{\mathrm{obj}} by

(5.4) bobj(Xtobj,t)=σ2log𝖤[fref(XT)11+βfobj(XT)β1+βϕσT(XT)|Xt=x].\displaystyle b^{\mathrm{obj}}(X_{t}^{\mathrm{obj}},t)=\;\sigma^{2}\nabla\log{\mathsf{E}}\left[\frac{f_{\mathrm{ref}}(X_{T})^{\frac{1}{1+\beta}}f_{\mathrm{obj}}(X_{T})^{\frac{\beta}{1+\beta}}}{\phi_{\sigma\sqrt{T}}(X_{T})}\,\Big{|}\,X_{t}=x\right].
Proof.

See Appendix E. ∎

Remark 8.

Lemma 7 follows from a change-of-measure argument. It still holds if XX is not a Brownian motion but XX solves the SDE (2.1) with X0=x0dX_{0}=x_{0}\in{\mathbb{R}}^{d} almost surely (see Appendix E).

The assumption that μ0\mu_{0} (the initial distribution of XX) is a Dirac measure greatly simplifies the calculations and enables us to directly target the unnormalized density function fref1/(1+β)fobjβ/(1+β)f_{\mathrm{ref}}^{1/(1+\beta)}f_{\mathrm{obj}}^{\beta/(1+\beta)}. We will propose in the next subsection a score matching algorithm for learning this geometric mixture distribution. For applications where a general initial distribution is desirable, one may need to build iterative algorithms by borrowing ideas from the iterative proportional fitting procedure (De Bortoli et al., 2021).

Example 4.

For an illustrative toy example, let μref\mu_{\mathrm{ref}} be a mixture of four bivariate normal distributions with means (1,1),(1,1),(1,1)(1,1),(1,-1),(-1,1), (1,1)(-1,-1) respectively and the same covariance matrix 0.052I0.05^{2}I. Let μobj\mu_{\mathrm{obj}} be a mixture of two normal distributions with means (1.2,0.8)(1.2,0.8), (1.5,0.5)(-1.5,-0.5) respectively and the same covariance matrix 0.52I0.5^{2}I. So μobj\mu_{\mathrm{obj}} essentially contains two components of μref\mu_{\mathrm{ref}} but with small bias and much larger noise. Since the density functions are known, we can directly simulate a Schrödinger bridge process targeting fref1/(1+β)fobjβ/(1+β)f_{\mathrm{ref}}^{1/(1+\beta)}f_{\mathrm{obj}}^{\beta/(1+\beta)} using the method described in Remark 3. We provide the results in Appendix A.3, which show that by targeting a geometric mixture with a moderate value of β\beta, we can effectively compel the terminal distribution of the controlled process to acquire a covariance structure similar to μref\mu_{\mathrm{ref}}.

5.2. A Score Matching Algorithm for Learning Geometric Mixtures

Let μ\mu^{*} denote a probability distribution with density function f(x)=C1fref(x)1/(1+β)fobj(x)β/(1+β)f^{*}(x)=C^{-1}f_{\mathrm{ref}}(x)^{1/(1+\beta)}f_{\mathrm{obj}}(x)^{\beta/(1+\beta)} where CC is the normalizing constant. To generate samples from μ\mu^{*}, one can use existing score-based diffusion model methods, but, as we will see shortly, one major challenge is how to train the score functions without samples from the distribution μ\mu^{*}.

Let μσ\mu^{*}_{\sigma} be the probability distribution with density

(5.5) fσ(x)=f(x)ϕσ(xy)dy,f^{*}_{\sigma}(x)=\int f^{*}(x)\phi_{\sigma}(x-y)\mathrm{d}y,

which can be thought of as a smoothed version of ff^{*}, and suppose that we can generate samples from the distribution μσ\mu^{*}_{\sigma}. Solving the Schrödinger bridge problem with initial distribution μσ\mu^{*}_{\sigma} and terminal distribution μ\mu^{*}, we obtain the controlled process XX^{*} with aw(X0)=μσ\mathcal{L}aw(X^{*}_{0})=\mu^{*}_{\sigma} and dynamics given by

(5.6) dXt=\displaystyle\mathrm{d}X^{*}_{t}= b(Xt,t)dt+σdWt,\displaystyle b^{*}(X_{t}^{*},t)\mathrm{d}t+\sigma\mathrm{d}W_{t},
where b(x,t)=\displaystyle\text{ where }b^{*}(x,t)= σ2logfσTt(x).\displaystyle\sigma^{2}\nabla\log f^{*}_{\sigma\sqrt{T-t}}(x).

The process XX^{*} satisfies aw(XT)=μ\mathcal{L}aw(X^{*}_{T})=\mu^{*} (note that this result is a special case of Example 3 with β=\beta=\infty).

We now describe how to simulate the dynamics given in (5.6) and generate samples from μσ\mu^{*}_{\sigma}. First, to learn the drift function in (5.6), we propose to combine the score matching technique with importance sampling. Let sθ(x,σ~)s_{\theta}(x,\tilde{\sigma}) denote our approximation to logfσ~(x)\nabla\log f^{*}_{\tilde{\sigma}}(x), where σ~[0,σT]\tilde{\sigma}\in[0,\sigma\sqrt{T}] and the unknown parameter θ\theta typically denotes a neural network. According to the well-known score matching technique (Hyvärinen, 2005; Vincent, 2011), we can estimate θ\theta for a given σ~\tilde{\sigma} by minimizing the objective function 𝖤xμ[L(x,θ;σ~)]{\mathsf{E}}_{x\sim\mu^{*}}[L(x,\theta;\tilde{\sigma})], where

L(x,θ;σ~)=\displaystyle L(x,\theta;\tilde{\sigma})=\; 𝖤x~𝒩(x,σ~2I)[sθ(x~,σ~)x~logfσ~(x~|x)2]\displaystyle{\mathsf{E}}_{\tilde{x}\sim\mathcal{N}(x,\tilde{\sigma}^{2}I)}\left[\left\|s_{\theta}(\tilde{x},\tilde{\sigma})-\nabla_{\tilde{x}}\log f^{*}_{\tilde{\sigma}}(\tilde{x}\,|\,x)\right\|^{2}\right]
=\displaystyle=\; 𝖤x~𝒩(x,σ~2I)[sθ(x~,σ~)+x~xσ~22].\displaystyle{\mathsf{E}}_{\tilde{x}\sim\mathcal{N}(x,\tilde{\sigma}^{2}I)}\left[\left\|s_{\theta}(\tilde{x},\tilde{\sigma})+\frac{\tilde{x}-x}{\tilde{\sigma}^{2}}\right\|^{2}\right].

Unfortunately, the existing score matching methods estimate 𝖤xμ[L(x,θ;σ~)]{\mathsf{E}}_{x\sim\mu^{*}}[L(x,\theta;\tilde{\sigma})] by using samples from μ\mu^{*}, which are not applicable to our problem since we only have access to samples from μref\mu_{\mathrm{ref}} and μobj\mu_{\mathrm{obj}} but not from μ\mu^{*}. We propose to use importance sampling to tackle this issue. Let μ~\tilde{\mu} denote an auxiliary distribution with density q=dμ~/dλq=\mathrm{d}\tilde{\mu}/\mathrm{d}\lambda, and assume that we can generate samples from μ~\tilde{\mu} (e.g. we can let μ~=μref,μobj\tilde{\mu}=\mu_{\mathrm{ref}},\mu_{\mathrm{obj}} or the smoothed versions of them). By a change of measure, we can express 𝖤xμ[L(x,θ;σ~)]{\mathsf{E}}_{x\sim\mu^{*}}[L(x,\theta;\tilde{\sigma})] as

(5.7) C1𝖤xμ~[L(x,θ;σ~)(fref(x)q(x))11+β(fobj(x)q(x))β1+β].C^{-1}\,{\mathsf{E}}_{x\sim\tilde{\mu}}\left[L(x,\theta;\tilde{\sigma})\left(\frac{f_{\mathrm{ref}}(x)}{q(x)}\right)^{\frac{1}{1+\beta}}\left(\frac{f_{\mathrm{obj}}(x)}{q(x)}\right)^{\frac{\beta}{1+\beta}}\right].

The density ratios fref/qf_{\mathrm{ref}}/q and fobj/qf_{\mathrm{obj}}/q can be learned by using samples from μref,μobj,μ~\mu_{\mathrm{ref}},\mu_{\mathrm{obj}},\tilde{\mu} and minimizing the logistic regression loss as in Wang et al. (2021). We do not need to know CC since it does not affect the drift term in (5.6). In particular, when μ~=μref\tilde{\mu}=\mu_{\mathrm{ref}}, we get

𝖤xμ[L(x,θ;σ~)]=C1𝖤xμref[L(x,θ;σ~)(fobj(x)fref(x))β1+β],\displaystyle{\mathsf{E}}_{x\sim\mu^{*}}[L(x,\theta;\tilde{\sigma})]=\;C^{-1}\,{\mathsf{E}}_{x\sim\mu_{\mathrm{ref}}}\left[L(x,\theta;\tilde{\sigma})\left(\frac{f_{\mathrm{obj}}(x)}{f_{\mathrm{ref}}(x)}\right)^{\frac{\beta}{1+\beta}}\right],

where the ratio fobj/freff_{\mathrm{obj}}/f_{\mathrm{ref}} can be learned by using samples from 𝒟obj\mathcal{D}_{\mathrm{obj}}, 𝒟ref\mathcal{D}_{\mathrm{ref}}. A similar expression can be easily derived for μ~=μobj\tilde{\mu}=\mu_{\mathrm{obj}}. This importance sampling method enables us to estimate the expectation 𝖤xμ[L(x,θ;σ~)]{\mathsf{E}}_{x\sim\mu^{*}}[L(x,\theta;\tilde{\sigma})] by using samples from 𝒟obj,𝒟ref\mathcal{D}_{\mathrm{obj}},\mathcal{D}_{\mathrm{ref}}. By averaging over σ~\tilde{\sigma} randomly drawn from the interval [0,σT][0,\sigma\sqrt{T}], we obtain an estimated loss for the parameter θ\theta. Minimizing this loss we get the estimate θ^\hat{\theta}, and we can approximate logfσ~(x)\nabla\log f^{*}_{\tilde{\sigma}}(x) using sθ^(x,σ~)s_{\hat{\theta}}(x,\tilde{\sigma}).

Simulating the SDE (5.6) requires us to generate samples from μσ\mu^{*}_{\sigma}. There are a few possible approaches. First, if σ\sigma is chosen to be sufficiently large, one may argue that we can simply approximate fσf^{*}_{\sigma} using the normal density ϕσ\phi_{\sigma}, and thus we only need to draw X0X_{0} from the normal distribution. This is the approach taken in the score-based generative models based on backward SDEs (Song and Ermon, 2019). Second, one can run an additional Schrödinger bridge process with terminal distribution μσ\mu^{*}_{\sigma}, as proposed in the two-stage Schrödinger bridge algorithm of Wang et al. (2021). The dynamics they considered is simply the solution to Problem 1 given in Theorem 1, where the uncontrolled process is a Brownian motion started at 0. The drift term is approximated by a Monte Carlo scheme (see Remark 3 and Appendix A.1), which also requires an estimate of the density ratio fσ/ϕσf^{*}_{\sigma}/\phi_{\sigma} and the score of fσf^{*}_{\sigma}. Third, one can also run the Langevin diffusion dX~t=σ2logfσ(X~t)dt+σdWt\mathrm{d}\tilde{X}_{t}=\sigma^{2}\nabla\log f^{*}_{\sigma}(\tilde{X}_{t})\mathrm{d}t+\sigma\mathrm{d}W_{t} for a sufficiently long duration, which has stationary distribution μσ\mu^{*}_{\sigma}. We found in our experiments that this approach yields more robust results, probably because it does not require Monte Carlo sampling which may yield drift estimates with large variances.

5.3. An Example Using MNIST

We present a numerical example using the MNIST data set (Deng, 2012), which consists of images of handwritten digits from 0 to 99. All images have size 28×2828\times 28 and pixels are rescaled to [0,1][0,1]. To construct 𝒟obj\mathcal{D}_{\mathrm{obj}}, we randomly select 50 images labeled as eight and reduce their quality by adding Gaussian noise with mean 0 and variance 0.40.4; see Fig. F4 in Appendix F. The reference data set 𝒟ref\mathcal{D}_{\mathrm{ref}} includes all images that are not labeled as eight, a total of 54,149 samples.

We first run the algorithm of Wang et al. (2021) using only 𝒟obj\mathcal{D}_{\mathrm{obj}}, and the generated samples are noisy as expected; see Fig. F5 in Appendix F. Next, we run our algorithm described in Section 5.2 with different choices of β\beta. The density ratio and scores are trained by using one GPU (RTX 6000). Generated samples are shown in Fig. 1, and we report in Table 1 the Fréchet Inception Distance score (Heusel et al., 2017) assessing the disparity between our generated images (sample size = 40K) and the collection of clean digit 8 images from the MNIST dataset (sample size \approx 6K). When β\beta is too small, the generated images do not resemble those in 𝒟obj\mathcal{D}_{\mathrm{obj}} and we frequently observe the influence of other digits; if β=0\beta=0, the images are essentially generated from μref\mu_{\mathrm{ref}}. When β\beta is too large, the influence from the reference data set becomes negligible, but the algorithm tends to overfit to the noisy data in 𝒟obj\mathcal{D}_{\mathrm{obj}}. For moderate values of β\beta, we observe a blend of characteristics from both data sets, and when β=1.5\beta=1.5, FID is minimized (among all tried values) and we get high-quality images of digit 8. This experiment illustrates that the information from 𝒟obj\mathcal{D}_{\mathrm{obj}} can help capture the structural features associated with the digit 8, while the samples from 𝒟ref\mathcal{D}_{\mathrm{ref}} can guide the algorithm towards effective noise removal. In Appendix F, we further analyze the generated images using t-SNE plots and inception scores (Salimans et al., 2016).

Refer to caption
Figure 1. SSB Samples for MNIST Experiment
Table 1. FID Scores for MNIST Experiment
β\beta 0 0.25 0.7 1.5 4 100
FID 67.4 66.4 61.0 56.3 110.9 182.4

6. Concluding Remarks

We propose the soft-constrained Schrödinger bridge (SSB) problem and find its solution. Our theory encompasses the existing stochastic control results for Schrödinger bridge in Dai Pra (1991) and Hamdouche et al. (2023) as special cases. The paper focuses on the theory of SSB, and the numerical examples are designed to be uncomplicated but illustrative. More advanced algorithms for solving Problems 2 and 3 in full generality need to be developed. It will also be interesting to study the applications of SSB to other generative modeling tasks, such as conditional generation, style transfer (Shi et al., 2022; Shi and Wu, 2023; Su et al., 2022) and time series data generation (Hamdouche et al., 2023). Some further generalization of the objective function may be considered as well; for example, one can add a time-dependent cost as in Pra and Pavon (1990) or consider a more general form of the terminal cost.

7. Acknowledgements

The authors would like to thank Tiziano De Angelis for valuable discussion on the problem formulation, Yun Yang for the helpful conversation on the numerics, and anonymous reviewers for their comments and suggestions. JG and QZ were supported in part by NSF grants DMS-2311307 and DMS-2245591. XZ acknowledges the support from NSF DMS-2113359. The numerical experiments were conducted with the computing resources provided by Texas A&M High Performance Research Computing.

References

  • Berner et al. [2022] Julius Berner, Lorenz Richter, and Karen Ullrich. An optimal control perspective on diffusion-based generative modeling. In NeurIPS 2022 Workshop on Score-Based Methods, 2022.
  • Beurling [1960] Arne Beurling. An automorphism of product measures. Annals of Mathematics, pages 189–200, 1960.
  • Blaquiere [1992] A Blaquiere. Controllability of a Fokker-Planck equation, the Schrödinger system, and a related stochastic optimal control (revised version). Dynamics and Control, 2(3):235–253, 1992.
  • Bushell [1973] Peter J Bushell. Hilbert’s metric and positive contraction mappings in a Banach space. Archive for Rational Mechanics and Analysis, 52:330–338, 1973.
  • Chen et al. [2021a] Tianrong Chen, Guan-Horng Liu, and Evangelos Theodorou. Likelihood training of Schrödinger bridge using forward-backward SDEs theory. In International Conference on Learning Representations, 2021a.
  • Chen et al. [2016a] Yongxin Chen, Tryphon Georgiou, and Michele Pavon. Entropic and displacement interpolation: a computational approach using the Hilbert metric. SIAM Journal on Applied Mathematics, 76(6):2375–2396, 2016a.
  • Chen et al. [2016b] Yongxin Chen, Tryphon T Georgiou, and Michele Pavon. On the relation between optimal transport and Schrödinger bridges: A stochastic control viewpoint. Journal of Optimization Theory and Applications, 169:671–691, 2016b.
  • Chen et al. [2019] Yongxin Chen, Tryphon T Georgiou, Michele Pavon, and Allen Tannenbaum. Relaxed Schrödinger bridges and robust network routing. IEEE Transactions on Control of Network Systems, 7(2):923–931, 2019.
  • Chen et al. [2021b] Yongxin Chen, Tryphon T Georgiou, and Michele Pavon. Optimal transport in systems and control. Annual Review of Control, Robotics, and Autonomous Systems, 4:89–113, 2021b.
  • Chen et al. [2021c] Yongxin Chen, Tryphon T Georgiou, and Michele Pavon. Stochastic control liaisons: Richard Sinkhorn meets Gaspard Monge on a Schrodinger bridge. SIAM Review, 63(2):249–313, 2021c.
  • Dai Pra [1991] Paolo Dai Pra. A stochastic control approach to reciprocal diffusion processes. Applied Mathematics and Optimization, 23(1):313–329, 1991.
  • De Bortoli et al. [2021] Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion Schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695–17709, 2021.
  • Deligiannidis et al. [2024] George Deligiannidis, Valentin De Bortoli, and Arnaud Doucet. Quantitative uniform stability of the iterative proportional fitting procedure. The Annals of Applied Probability, 34(1A):501–516, 2024.
  • Deming and Stephan [1940] W Edwards Deming and Frederick F Stephan. On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. The Annals of Mathematical Statistics, 11(4):427–444, 1940.
  • Deng [2012] Li Deng. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  • Doob [1959] J. L. Doob. A Markov chain theorem. Probability and Statistics.(H. Cramer memorial volume) ed. U. Grenander. Almquist and Wiksel, Stockholm & New York, pages 50–57, 1959.
  • Doob [1984] J. L. Doob. Classical potential theory and its probabilistic counterpart, volume 262 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, New York, 1984. ISBN 0-387-90881-1. doi: 10.1007/978-1-4612-5208-5.
  • Essid and Pavon [2019] Montacer Essid and Michele Pavon. Traversing the Schrödinger bridge strait: Robert Fortet’s marvelous proof redux. Journal of Optimization Theory and Applications, 181(1):23–60, 2019.
  • Fleming [1977] Wendell H Fleming. Exit probabilities and optimal stochastic control. Applied Mathematics and Optimization, 4:329–346, 1977.
  • Fleming [2005] Wendell H Fleming. Logarithmic transformations and stochastic control. In Advances in Filtering and Optimal Stochastic Control: Proceedings of the IFIP-WG 7/1 Working Conference Cocoyoc, Mexico, February 1–6, 1982, pages 131–141. Springer, 2005.
  • Fleming and Rishel [2012] Wendell H Fleming and Raymond W Rishel. Deterministic and stochastic optimal control, volume 1. Springer Science & Business Media, 2012.
  • Fleming and Sheu [1985] Wendell H Fleming and Sheunn-Jyi Sheu. Stochastic variational formula for fundamental solutions of parabolic pde. Applied Mathematics and Optimization, 13:193–204, 1985.
  • Föllmer [1988] H Föllmer. Random fields and diffusion processes. Ecole d’Ete de Probabilites de Saint-Flour XV-XVII, 1985-87, 1988.
  • Fortet [1940] Robert Fortet. Résolution d’un système d’équations de m. Schrödinger. Journal de Mathématiques Pures et Appliquées, 19(1-4):83–105, 1940.
  • Friedman [1975] Avner Friedman. Stochastic differential equations and applications. In Stochastic differential equations, pages 75–148. Springer, 1975.
  • Georgiou and Pavon [2015] Tryphon T Georgiou and Michele Pavon. Positive contraction mappings for classical and quantum Schrödinger systems. Journal of Mathematical Physics, 56(3), 2015.
  • Hamdouche et al. [2023] Mohamed Hamdouche, Pierre Henry-Labordere, and Huyên Pham. Generative modeling for time series via Schrödinger bridge. arXiv preprint arXiv:2304.05093, 04 2023. doi: 10.13140/RG.2.2.25758.00324.
  • Heng et al. [2024] Jeremy Heng, Valentin De Bortoli, and Arnaud Doucet. Diffusion Schrödinger bridges for Bayesian computation. Statistical Science, 39(1):90–99, 2024.
  • Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in Neural Information Processing Systems, 30, 2017.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • Huang et al. [2021] Jian Huang, Yuling Jiao, Lican Kang, Xu Liao, Jin Liu, and Yanyan Liu. Schrödinger-Föllmer sampler: sampling without ergodicity. arXiv preprint arXiv:2106.10880, 2021.
  • Hyvärinen [2005] Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4):695–709, 2005.
  • Jamison [1975] Benton Jamison. The Markov processes of Schrödinger. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 32(4):323–331, 1975.
  • Karatzas and Shreve [2012] Ioannis Karatzas and Steven Shreve. Brownian motion and stochastic calculus, volume 113. Springer Science & Business Media, 2012.
  • Léonard [2013] Christian Léonard. A survey of the Schrödinger problem and some of its connections with optimal transport, 2013.
  • Mikami [1990] Toshio Mikami. Variational processes from the weak forward equation. Communications in Mathematical Physics, 135:19–40, 1990.
  • Moon et al. [2022] Taehong Moon, Moonseok Choi, Gayoung Lee, Jung-Woo Ha, and Juho Lee. Fine-tuning diffusion models with limited data. In NeurIPS 2022 Workshop on Score-Based Methods, 2022.
  • Pavon and Wakolbinger [1991] Michele Pavon and Anton Wakolbinger. On free energy, stochastic control, and Schrödinger processes. In Modeling, Estimation and Control of Systems with Uncertainty: Proceedings of a Conference held in Sopron, Hungary, September 1990, pages 334–348. Springer, 1991.
  • Peluchetti [2023] Stefano Peluchetti. Diffusion bridge mixture transports, Schrödinger bridge problems and generative modeling. arXiv preprint arXiv:2304.00917, 2023.
  • Pra and Pavon [1990] Paolo Dai Pra and Michele Pavon. On the Markov processes of Schrödinger, the Feynman-Kac formula and stochastic control. In Realization and Modelling in System Theory: Proceedings of the International Symposium MTNS-89, Volume I, pages 497–504. Springer, 1990.
  • Richter et al. [2023] Lorenz Richter, Julius Berner, and Guan-Horng Liu. Improved sampling via learned diffusions. arXiv preprint arXiv:2307.01198, 2023.
  • Salimans et al. [2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. Advances in Neural Information Processing Systems, 29, 2016.
  • Schrödinger [1931] Erwin Schrödinger. Über die Umkehrung der Naturgesetze. Verlag der Akademie der Wissenschaften in Kommission bei Walter De Gruyter u. Company, 1931.
  • Schrödinger [1932] Erwin Schrödinger. Sur la théorie relativiste de l’électron et l’interprétation de la mécanique quantique. In Annales de l’institut Henri Poincaré, volume 2, pages 269–310, 1932.
  • Shi et al. [2022] Yuyang Shi, Valentin De Bortoli, George Deligiannidis, and Arnaud Doucet. Conditional simulation using diffusion Schrödinger bridges. In Uncertainty in Artificial Intelligence, pages 1792–1802. PMLR, 2022.
  • Shi and Wu [2023] Ziqiang Shi and Shoule Wu. Schröwave: Realistic voice generation by solving two-stage conditional Schrödinger bridge problems. Digital Signal Processing, 141:104175, 2023.
  • Song [2022] Ki-Ung Song. Applying regularized Schrödinger-bridge-based stochastic process in generative modeling. arXiv preprint arXiv:2208.07131, 2022.
  • Song and Ermon [2019] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pages 11895–11907, 2019.
  • Song et al. [2021] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2021.
  • Su et al. [2022] Xuan Su, Jiaming Song, Chenlin Meng, and Stefano Ermon. Dual diffusion implicit bridges for image-to-image translation. arXiv preprint arXiv:2203.08382, 2022.
  • Tzen and Raginsky [2019] Belinda Tzen and Maxim Raginsky. Theoretical guarantees for sampling and inference in generative models with latent diffusions. In Conference on Learning Theory, pages 3084–3114. PMLR, 2019.
  • Vargas et al. [2021] Francisco Vargas, Pierre Thodoroff, Austen Lamacraft, and Neil Lawrence. Solving Schrödinger bridges via maximum likelihood. Entropy, 23(9):1134, 2021.
  • Vargas et al. [2022] Francisco Vargas, Will Sussman Grathwohl, and Arnaud Doucet. Denoising diffusion samplers. In The Eleventh International Conference on Learning Representations, 2022.
  • Vargas et al. [2024] Francisco Vargas, Shreyas Padhy, Denis Blessing, and Nikolas Nüsken. Transport meets variational inference: Controlled Monte Carlo diffusions. In The Twelfth International Conference on Learning Representations, 2024.
  • Vincent [2011] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011.
  • Wang et al. [2021] Gefei Wang, Yuling Jiao, Qian Xu, Yang Wang, and Can Yang. Deep generative learning via Schrödinger bridge. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10794–10804. PMLR, 18–24 Jul 2021.
  • Winkler et al. [2023] Ludwig Winkler, Cesar Ojeda, and Manfred Opper. A score-based approach for training Schrödinger bridges for data modelling. Entropy, 25(2):316, 2023.
  • Zhang and Chen [2021] Qinsheng Zhang and Yongxin Chen. Path integral sampler: A stochastic control approach for sampling. In International Conference on Learning Representations, 2021.

APPENDIX

The code for all experiments (Cauchy simulation in Appendix A.2, normal mixture simulation in Appendix A.3 and MNIST example in Section 5.3) is at the GitHub repository https://github.com/gargjhanvi/Soft-constrained-Schrodinger-Bridge-a-Stochastic-Control
-Approach
.

A. Simulation with Known Density Functions

A.1. Monte Carlo Simulation with Densities Known up to a Normalization Constant

Let the uncontrolled process XX be the Brownian motion with b0b\equiv 0 and X0=0X_{0}=0, and set T=1T=1. Let ϕσ\phi_{\sigma} denote the density of normal distribution with mean zero and covariance matrix σ2I\sigma^{2}I. Given a target distribution with density fTf_{T} (which we simply denote by ff in this section), the solution to SSB is given by

(A.1) dXt=utdt+σdWt,\mathrm{d}X^{*}_{t}=u^{*}_{t}\mathrm{d}t+\sigma\mathrm{d}W_{t},

where the drift uu^{*} is determined by u(x,t)=logh(x,t)u^{*}(x,t)=\nabla\log h(x,t) and

h(x,t)(1t)d/2ϕσ(zx1t)(f(z)ϕσ(z))β/(1+β)dz.\displaystyle h(x,t)\propto(1-t)^{-d/2}\int\phi_{\sigma}\left(\frac{z-x}{\sqrt{1-t}}\right)\cdot\left(\frac{f(z)}{\phi_{\sigma}(z)}\right)^{\beta/(1+\beta)}\mathrm{d}z.

Note that to determine uu^{*}, we only need to know ff up to a normalization constant. Since logh=h1h\nabla\log h=h^{-1}\nabla h, we can rewrite uu^{*} as

(A.2) u(x,t)=𝖤zϕσ[r(x+1tz)logr(x+1tz)]𝖤zϕσ[r(x+1tz)]u^{*}(x,t)=\frac{{\mathsf{E}}_{z\sim\phi_{\sigma}}[r({x}+\sqrt{1-t}{z})\nabla\log r({x}+\sqrt{1-t}{z})]}{{\mathsf{E}}_{{z}\sim\phi_{\sigma}}[r({x}+\sqrt{1-t}{z})]}

where we set

r(x)=(f(x)ϕσ(x))β/(1+β).r(x)=\left(\frac{f(x)}{\phi_{\sigma}(x)}\right)^{\beta/(1+\beta)}.

We can then approximate the numerator and denominator in (A.2) separately using Monte Carlo samples.

For some target distributions, this approach can be made more efficient by using importance sampling. When ff has a heavy tail (e.g. Cauchy distribution), r(x)r(x) may grow super-exponentially with xx, and Monte Carlo estimates for the numerator and denominator in (A.2) with zz drawn from the normal distribution may have large variances. Observe that the numerator can be written as

𝖤zϕσ[r(x+1tz)logr(x+1tz)]\displaystyle{\mathsf{E}}_{z\sim\phi_{\sigma}}[r({x}+\sqrt{1-t}{z})\nabla\log r({x}+\sqrt{1-t}{z})]
=\displaystyle=\; ϕσ(z)r(x+1tz)β/(1+β)logr(x+1tz)dz.\displaystyle\int\phi_{\sigma}(z)r({x}+\sqrt{1-t}{z})^{\beta/(1+\beta)}\nabla\log r({x}+\sqrt{1-t}{z})\mathrm{d}z.

The term logr(x+1tz)\nabla\log r({x}+\sqrt{1-t}{z}) is often polynomial in zz (that is, it does not grow too fast). Hence, intuitively, the integral is likely to be well approximated by a Monte Carlo estimate with zz drawn from a density proportional to ϕσ(z)r(x+1tz)β/(1+β)\phi_{\sigma}(z)r({x}+\sqrt{1-t}{z})^{\beta/(1+\beta)}. Such a density may not be easily accessible, but if one knows the tail decay rate of ff, one can try to find a proposal distribution for zz with tails not lighter than ϕσ(z)r(x+1tz)β/(1+β)\phi_{\sigma}(z)r({x}+\sqrt{1-t}{z})^{\beta/(1+\beta)}. In the experiment given below, we propose zz from some distribution with tail decay rate same as fβ/(1+β)ϕσ1/(1+β)f^{\beta/(1+\beta)}\phi_{\sigma}^{1/(1+\beta)}. Letting ww denote our proposal density, we express the numerator in (A.2) as

(A.3) 𝖤zϕσ[r(x+1tz)logr(x+1tz)]\displaystyle{\mathsf{E}}_{z\sim\phi_{\sigma}}[r({x}+\sqrt{1-t}{z})\nabla\log r({x}+\sqrt{1-t}{z})]
=\displaystyle= 𝖤zw[ϕσ(z)r(x+1tz)w(z)logr(x+1tz)],\displaystyle{\mathsf{E}}_{z\sim w}\left[\frac{\phi_{\sigma}(z)r({x}+\sqrt{1-t}{z})}{w(z)}\nabla\log r({x}+\sqrt{1-t}{z})\right],

and estimate the right-hand side using a Monte Carlo average. Similarly, the denominator can be expressed by

(A.4) 𝖤zϕσ[r(x+1tz)]=𝖤zw[ϕσ(z)r(x+1tz)w(z)]{\mathsf{E}}_{{z}\sim\phi_{\sigma}}[r({x}+\sqrt{1-t}{z})]={\mathsf{E}}_{z\sim w}\left[\frac{\phi_{\sigma}(z)r({x}+\sqrt{1-t}{z})}{w(z)}\right]

and estimated by a Monte Carlo average.

A.2. Simulating the Cauchy Distribution

Fix b0b\equiv 0, σ=1\sigma=1, X0=0X_{0}=0 and T=1T=1. Let ff be the density of the standard Cauchy distribution i.e., f(x)=π1(1+x2)1f(x)=\pi^{-1}(1+x^{2})^{-1}. Theorem 2 shows that the solution to SSB is a Schrödinger bridge such that the terminal distribution has density

(A.5) fβ=Cβ1ϕ(x)1/(1+β)f(x)β/(1+β),f_{\beta}^{*}=C_{\beta}^{-1}\phi(x)^{1/(1+\beta)}f(x)^{\beta/(1+\beta)},

where CβC_{\beta} is the normalization constant and ϕ\phi is the density of the standard normal distribution. We plot fβf_{\beta}^{*} and logfβ\log f_{\beta}^{*} in Figure A1 for β=,100,50,20\beta=\infty,100,50,20. For xx close to zero, the density of fβ(x)f_{\beta}^{*}(x) remains approximately the same for the four choices of β\beta. When β=\beta=\infty, fβf_{\beta}^{*} is the Cauchy distribution which has a heavy tail. But for any β<\beta<\infty, the tail decay rate of fβf_{\beta}^{*} is dominated by the Gaussian component.

Refer to caption
Refer to caption
Figure A1. Densities of Geometric Mixtures of Normal Cauchy Distributions.

We simulate the solution to SSB given in (A.1) over the time interval [0,1][0,1] using the Euler-Maruyama method with 200200 time steps. The drift is approximated by the importance sampling scheme. When β=\beta=\infty (i.e., the terminal distribution is Cauchy), we let the proposal distribution ww in (A.3) and (A.4) be the tt-distribution with 22 degrees of freedom (we have also tried directly proposing zz from the standard Cauchy distribution and obtained very similar results). When β<\beta<\infty, we let ww be the normal distribution with mean 0 and variance 1+β1+\beta. We simulate the SSB process 10,000 times, and in Table A1 we report the number of failed runs; these failures happen because Monte Carlo estimates for the numerator and denominator in (A.2) become unstable when |x||x| is large, resulting in numerical overflow. When β=\beta=\infty, we observe that numerical overflow is still common even if we use 1,0001,000 Monte Carlo samples for each estimate. In contrast, when we use β100\beta\leq 100 and only 200200 Monte Carlo samples, the algorithm becomes very stable. In Figure A2, we compare the distribution of generated samples (i.e., the distribution of XTX^{*}_{T} with T=1T=1) with their corresponding target distributions. The first panel compares the distribution of the samples generated with β=\beta=\infty (failed runs ignored) with the Cauchy distribution, and the second compares the distribution the samples generated with β=100\beta=100 with the geometric mixture given in (A.5). It is clear that simulating the Schrödinger bridge process (i.e, using β=\beta=\infty) cannot recover the heavy tails of the Cauchy distribution, but the numerical simulation of SSB with β=100\beta=100 accurately yields samples from the geometric mixture distribution. Recall that the KL divergence between standard Cauchy and normal distributions is infinite, which, by Theorem 1, means that there is no control with finite energy cost that can steer a standard Brownian motion towards the Cauchy distribution at time T=1T=1. Our experiment partially illustrates the practical consequences of this fact in numerically simulating the Schrödinger bridge, and it also suggests that SSB may be a numerically more robust alternative.

Table A1. Number of Failed Attempts in 10,000 Trials. NmcN_{\rm{mc}} denotes the number of Monte Carlo samples used to approximate the expectations defined in (A.3) and (A.4).
NmcN_{\rm{mc}} 20 50 100 200 500 1000
β=\beta=\infty 1149 933 714 610 444 308
β=100\beta=100 495 29 6 1 0 0
β=50\beta=50 124 18 6 0 0 0
β=20\beta=20 48 4 2 0 1 0
Refer to caption
Refer to caption
Figure A2. Q-Q plots for the SSB Samples with Nmc=1,000N_{\rm{mc}}=1,000.

A.3. Simulating Mixtures of Normal Distributions

In Example 4, we let μref\mu_{\mathrm{ref}} be a mixture of four bivariate normal distributions,

μref=0.1𝒩((1,1), 0.052I)+0.2𝒩((1,1), 0.052I)+0.3𝒩((1,1), 0.052I)+0.4𝒩((1,1), 0.052I),\displaystyle\mu_{\mathrm{ref}}=0.1\,\mathcal{N}((1,1),\,0.05^{2}I)+0.2\,\mathcal{N}((-1,1),\,0.05^{2}I)+0.3\,\mathcal{N}((1,-1),\,0.05^{2}I)+0.4\,\mathcal{N}((-1,-1),\,0.05^{2}I),

where the weights and the mean vector of the four component distributions are different. Let μobj\mu_{\mathrm{obj}} be a mixture of two equally weighted bivariate normal distributions

μobj=0.5𝒩((1.2,0.8), 0.52I)+0.5𝒩((1.5,0.5), 0.52I).\displaystyle\mu_{\mathrm{obj}}=0.5\,\mathcal{N}((1.2,0.8),\,0.5^{2}I)+0.5\,\mathcal{N}((-1.5,-0.5),\,0.5^{2}I).

The first component of μobj\mu_{\mathrm{obj}} has mean close to (1,1)(1,1) (which is the mean vector of the first component of μref\mu_{\mathrm{ref}}), and the second component of μobj\mu_{\mathrm{obj}} has mean close to (1,1)(-1,-1) (which is the mean vector of the last component of μref\mu_{\mathrm{ref}}). Hence, we can interpret μref\mu_{\mathrm{ref}} as a distribution of high-quality samples from four different classes, and interpret μobj\mu_{\mathrm{obj}} as a distribution of noisy samples from two of the four classes. Let fβ=Cβ1fref(x)1/(1+β)fobj(x)β/(1+β)f_{\beta}^{*}=C_{\beta}^{-1}f_{\mathrm{ref}}(x)^{1/(1+\beta)}f_{\mathrm{obj}}(x)^{\beta/(1+\beta)} be the density of our target distribution.

We generate 1,0001,000 samples from fβf_{\beta}^{*} by simulating the SDE (A.1) over the time interval [0,1][0,1] with f=fβf=f_{\beta}^{*} and σ=1\sigma=1. We use 200200 time steps for discretization and generate 200200 Monte Carlo samples at each step for estimating the drift. The trajectories of the simulated processes are shown in Figure A3 for β=0,2,10,\beta=0,2,10,\infty, and the samples we generate correspond to t=1t=1. It can be seen that when β=2\beta=2 or 1010, the majority of the generated samples form two clusters, one with mean close to (1,1)(1,1) and the other with mean close to (1,1)(-1,-1). Further, the two clusters both exhibit very small within-cluster variation, which indicates that the noise from μobj\mu_{\mathrm{obj}} has been effectively reduced.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure A3. SSB Trajectories for Normal Mixture Targets.

B. Auxiliary Results

We first present two lemmas about the minimization of Kullback-Leibler divergence.

Lemma B1.

Let (Ω,,π)(\Omega,\mathcal{F},\pi) be a σ\sigma-finite measure space, and let νπ\nu\ll\pi be a measure with density f=dν/dπf=\mathrm{d}\nu/\mathrm{d}\pi such that C=ν(Ω)(0,)C=\nu(\Omega)\in(0,\infty). Let 𝒫π\mathcal{P}_{\pi} denote the set of all probability measures absolutely continuous with respect to π\pi. Then,

infμ𝒫π𝒟KL(μ,ν)=logC.\displaystyle\inf_{\mu\in\mathcal{P}_{\pi}}\mathcal{D}_{\mathrm{KL}}(\mu,\nu)=-\log C.

The infimum is attained by the probability measure μ\mu^{*} such that dμ/dπ=C1f(x)\mathrm{d}\mu^{*}/\mathrm{d}\pi=C^{-1}f(x).

Proof of Lemma B1.

Since 𝒟KL(μ,ν)=\mathcal{D}_{\mathrm{KL}}(\mu,\nu)=\infty if μ≪̸ν\mu\not\ll\nu, it suffices to consider μ𝒫π\mu\in\mathcal{P}_{\pi} such that μν\mu\ll\nu. It is straightforward to show that 𝒟KL(μ,ν)=𝒟KL(μ,μ)logC\mathcal{D}_{\mathrm{KL}}(\mu,\nu)=\mathcal{D}_{\mathrm{KL}}(\mu,\mu^{*})-\log C. Since dμ=C1fdπ=C1dν=1\int\mathrm{d}\mu^{*}=C^{-1}\int f\mathrm{d}\pi=C^{-1}\int\mathrm{d}\nu=1, μ\mu^{*} is a probability measure. The claim then follows from the fact that the KL divergence between any two probability measures is non-negative. ∎

Lemma B2.

Let (Ω,,π)(\Omega,\mathcal{F},\pi) and 𝒫π\mathcal{P}_{\pi} be as given in Lemma B1. For i=0,1i=0,1, let νi\nu_{i} be a finite measure (i.e., νi(Ω)<\nu_{i}(\Omega)<\infty) with density fi=dνi/dπf_{i}=\mathrm{d}\nu_{i}/\mathrm{d}\pi. Assume that ν1ν0\nu_{1}\ll\nu_{0}. For β0\beta\geq 0,

infμ𝒫π𝒟KL(μ,ν0)+β𝒟KL(μ,ν1)=(1+β)logCβ,\displaystyle\inf_{\mu\in\mathcal{P}_{\pi}}\mathcal{D}_{\mathrm{KL}}(\mu,\nu_{0})+\beta\mathcal{D}_{\mathrm{KL}}(\mu,\nu_{1})=-(1+\beta)\log C_{\beta},

where Cβ=f0(x)1/(1+β)f1(x)β/(1+β)π(dx)(0,)C_{\beta}=\int f_{0}(x)^{1/(1+\beta)}f_{1}(x)^{\beta/(1+\beta)}\pi(\mathrm{d}x)\in(0,\infty). The infimum is attained by the probability measure μβ\mu^{*}_{\beta} such that

fβ=dμβdπ=Cβ1f01/(1+β)f1β/(1+β).f^{*}_{\beta}=\frac{\mathrm{d}\mu^{*}_{\beta}}{\mathrm{d}\pi}=C_{\beta}^{-1}f_{0}^{1/(1+\beta)}f_{1}^{\beta/(1+\beta)}.

When ν0,ν1𝒫π\nu_{0},\nu_{1}\in\mathcal{P}_{\pi}, we have Cβ(0,1]C_{\beta}\in(0,1]. Further,

limβCβ=ν1(Ω),limβ𝒟KL(μβ,ν0)=𝒟KL(ν1,ν0).\displaystyle\lim_{\beta\rightarrow\infty}C_{\beta}=\nu_{1}(\Omega),\quad\lim_{\beta\rightarrow\infty}\mathcal{D}_{\mathrm{KL}}(\mu^{*}_{\beta},\nu_{0})=\mathcal{D}_{\mathrm{KL}}(\nu_{1},\nu_{0}).
Proof of Lemma B2.

Since ν1ν0\nu_{1}\ll\nu_{0}, we have C>0C>0. By Hölder’s inequality,

Cβ(f0dπ)1/(1+β)(f1dπ)β/(1+β)=ν0(Ω)1/(1+β)ν1(Ω)β/(1+β)<.C_{\beta}\leq\left(\int f_{0}\mathrm{d}\pi\right)^{1/(1+\beta)}\left(\int f_{1}\mathrm{d}\pi\right)^{\beta/(1+\beta)}=\nu_{0}(\Omega)^{1/(1+\beta)}\nu_{1}(\Omega)^{\beta/(1+\beta)}<\infty.

Clearly, if ν0,ν1𝒫π\nu_{0},\nu_{1}\in\mathcal{P}_{\pi}, the above inequality implies C1C\leq 1. Observe that

𝒟KL(μ,ν0)+β𝒟KL(μ,ν1)=(1+β)𝒟KL(μ,ν)\displaystyle\mathcal{D}_{\mathrm{KL}}(\mu,\nu_{0})+\beta\mathcal{D}_{\mathrm{KL}}(\mu,\nu_{1})=(1+\beta)\mathcal{D}_{\mathrm{KL}}(\mu,\nu)

where ν\nu is the measure with density dν/dπ=f01/(1+β)f1β/(1+β)=Cβfβ\mathrm{d}\nu/\mathrm{d}\pi=f_{0}^{1/(1+\beta)}f_{1}^{\beta/(1+\beta)}=C_{\beta}f^{*}_{\beta}. Hence, we can apply Lemma B1 to prove that μβ\mu^{*}_{\beta} is the minimizer.

To prove the convergence as β\beta\rightarrow\infty, let A={x:f0(x)f1(x)}A=\{x\colon f_{0}(x)\geq f_{1}(x)\} and write Cβ=Cβ,0+Cβ,1C_{\beta}=C_{\beta,0}+C_{\beta,1}, where

Cβ,0=Af0(x)1/(1+β)f1(x)β/(1+β)π(dx),Cβ,1=ΩAf0(x)1/(1+β)f1(x)β/(1+β)π(dx).\displaystyle C_{\beta,0}=\int_{A}f_{0}(x)^{1/(1+\beta)}f_{1}(x)^{\beta/(1+\beta)}\pi(\mathrm{d}x),\quad\quad C_{\beta,1}=\int_{\Omega\setminus A}f_{0}(x)^{1/(1+\beta)}f_{1}(x)^{\beta/(1+\beta)}\pi(\mathrm{d}x).

The integrands of both Cβ,0C_{\beta,0} and Cβ,0C_{\beta,0} are monotone in β\beta. Hence, using C0<C_{0}<\infty and monotone convergence theorem, we find that

limβCβ=limβf0(x)1/(1+β)f1(x)β/(1+β)π(dx)=f1(x)π(dx)=ν1(Ω),\lim_{\beta\rightarrow\infty}C_{\beta}=\int\lim_{\beta\rightarrow\infty}f_{0}(x)^{1/(1+\beta)}f_{1}(x)^{\beta/(1+\beta)}\pi(\mathrm{d}x)=\int f_{1}(x)\pi(\mathrm{d}x)=\nu_{1}(\Omega),

which also implies fβf1/ν1(Ω)f^{*}_{\beta}\rightarrow f_{1}/\nu_{1}(\Omega) pointwise. An analogous argument using monotone convergence theorem proves that limβ𝒟KL(μβ,ν0)=𝒟KL(ν1,ν0)\lim_{\beta\rightarrow\infty}\mathcal{D}_{\mathrm{KL}}(\mu^{*}_{\beta},\nu_{0})=\mathcal{D}_{\mathrm{KL}}(\nu_{1},\nu_{0}). ∎

The next result is about the controlled SDE (2.2) with u=σ2loghu=\sigma^{2}\nabla\log h. It is adapted from Theorem 2.1 of Dai Pra [1991]. Our proof is provided for completeness.

Theorem B3.

Suppose Assumption 1 holds. Let X=(Xt)0tTX=(X_{t})_{0\leq t\leq T} be a weak solution to (2.1) with initial distribution μ0\mu_{0} and transition density p(x,t|y,s)p(x,t\,|\,y,s). Define

h(x,t)=g(z)p(z,T|x,t)dz,for (x,t)d×[0,T),h(x,t)=\int g(z)p(z,T\,|\,x,t)\mathrm{d}z,\quad\text{for }(x,t)\in{\mathbb{R}}^{d}\times[0,T),

for some measurable g0g\geq 0 such that 𝖤g(XT)<{\mathsf{E}}g(X_{T})<\infty. Assume h>0h>0 on d×[0,T){\mathbb{R}}^{d}\times[0,T). Let Xh=(Xth)0tTX^{h}=(X^{h}_{t})_{0\leq t\leq T} be a weak solution with X0h=X0X^{h}_{0}=X_{0} to the SDE

(B.1) dXth=[b(Xth,t)+σ2logh(Xth,t)]dt+σdWt, for t[0,T].\mathrm{d}X_{t}^{h}=\left[b(X_{t}^{h},t)+\sigma^{2}\nabla\log h(X_{t}^{h},t)\right]\mathrm{d}t+\sigma\mathrm{d}W_{t},\text{ for }t\in[0,T].

Then, we have the following results.

  1. (i)

    h2,1(d×[0,T))h\in{\mathbb{C}}^{2,1}({\mathbb{R}}^{d}\times[0,T)) and

    (B.2) ht+i=1dbihxi+σ22i=1d2hxi2=0, for (x,t)d×[0,T).\frac{\partial h}{\partial t}+\sum_{i=1}^{d}b_{i}\frac{\partial h}{\partial x_{i}}+\frac{\sigma^{2}}{2}\sum_{i=1}^{d}\frac{\partial^{2}h}{\partial x_{i}^{2}}=0,\text{ for }(x,t)\in{\mathbb{R}}^{d}\times[0,T).
  2. (ii)

    The weak solution XhX^{h} to the SDE (B.1) exists. Indeed, we can define a probability measure 𝖰{\mathsf{Q}} by d𝖰/d𝖯=g(XT)/h(X0,0)\mathrm{d}{\mathsf{Q}}/\mathrm{d}{\mathsf{P}}=g(X_{T})/h(X_{0},0) such that he law of XX under 𝖰{\mathsf{Q}} is the same as the law of XhX^{h} under 𝖯{\mathsf{P}}.

  3. (iii)

    The process XhX^{h} satisfies

    (B.3) 𝖤[logg(XTh)h(X0h,0)]=𝖤[0Tσ22logh(Xsh,s)2ds].{\mathsf{E}}\left[\log\frac{g(X^{h}_{T})}{h(X_{0}^{h},0)}\right]={\mathsf{E}}\left[\int_{0}^{T}\frac{\sigma^{2}}{2}\|\nabla\log h(X_{s}^{h},s)\|^{2}\mathrm{d}s\right].
  4. (iv)

    The transition density of XhX^{h} is given by

    (B.4) ph(x,t|y,s)=p(x,t|y,s)h(x,t)h(y,s), for 0s<tT.p_{h}(x,t\,|\,y,s)=\frac{p(x,t\,|\,y,s)h(x,t)}{h(y,s)},\quad\text{ for }0\leq s<t\leq T.

    Hence, XhX^{h} is a Doob’s hh-path process, and the density of the distribution of XThX^{h}_{T} is

    (B.5) daw(XTh)dλ(x)=g(x)p(x,T|y,0)h(y,0)μ0(dy).\frac{\mathrm{d}\mathcal{L}aw(X^{h}_{T})}{\mathrm{d}\lambda}(x)=g(x)\int\frac{p(x,T\,|\,y,0)}{h(y,0)}\mu_{0}(\mathrm{d}y).
Proof of Theorem B3.

We follow the arguments of Jamison [1975], Dai Pra [1991]. Under Assumption 1, Proposition 2.1 of Dai Pra [1991] (which is adapted from the result of Jamison [1975]) implies that h2,1(d×[0,T))h\in{\mathbb{C}}^{2,1}({\mathbb{R}}^{d}\times[0,T)) and h+ht=0\mathcal{L}h+\frac{\partial h}{\partial t}=0 on d×[0,T){\mathbb{R}}^{d}\times[0,T), where

(B.6) =i=1dbixi+σ22i=1d2xi2\mathcal{L}=\sum_{i=1}^{d}b_{i}\frac{\partial}{\partial x_{i}}+\frac{\sigma^{2}}{2}\sum_{i=1}^{d}\frac{\partial^{2}}{\partial x_{i}^{2}}

denotes the generator of XX. This proves part (i).

To prove part (ii), we first apply Itô’s lemma to get

logh(Xt,t)=\displaystyle\log h(X_{t},t)=\; logh(X0,0)+0tb~(Xs,s)ds+0tσlogh(Xs,s)dWs,\displaystyle\log h(X_{0},0)+\int_{0}^{t}\tilde{b}(X_{s},s)\mathrm{d}s+\int_{0}^{t}\sigma\nabla\log h(X_{s},s)\mathrm{d}W_{s},

for any t[0,T)t\in[0,T), where

b~(x,t)=1h(x,t)((h)(x,t)+ht(x,t))σ22logh(x,t)2=σ22logh(x,t)2.\displaystyle\tilde{b}(x,t)=\frac{1}{h(x,t)}\left((\mathcal{L}h)(x,t)+\frac{\partial h}{\partial t}(x,t)\right)-\frac{\sigma^{2}}{2}\|\nabla\log h(x,t)\|^{2}=-\frac{\sigma^{2}}{2}\|\nabla\log h(x,t)\|^{2}.

Since g(XT)g(X_{T}) is integrable, h(Xt,t)h(X_{t},t) is a uniformly integrable martingale on [0,T)[0,T) and converges to g(XT)g(X_{T}) both a.s. and in L1L^{1}. Letting tTt\uparrow T, we obtain that

(B.7) logg(XT)=logh(X0,0)0Tσ22logh(Xs,s)2ds+0Tσlogh(Xs,s)dWs.\log g(X_{T})=\log h(X_{0},0)-\int_{0}^{T}\frac{\sigma^{2}}{2}\|\nabla\log h(X_{s},s)\|^{2}\mathrm{d}s+\int_{0}^{T}\sigma\nabla\log h(X_{s},s)\mathrm{d}W_{s}.

Write h(x,T)=g(x)h(x,T)=g(x), Yt=h(Xt,t)Y_{t}=h(X_{t},t) and Zt=Yt/Y0Z_{t}=Y_{t}/Y_{0}. We have shown that YtY_{t} and ZtZ_{t} are martingales on [0,T][0,T]. Since 𝖤[ZT]=1{\mathsf{E}}[Z_{T}]=1, we can define a probability measure 𝖰{\mathsf{Q}} by d𝖰/d𝖯=ZT\mathrm{d}{\mathsf{Q}}/\mathrm{d}{\mathsf{P}}=Z_{T}. By Girsanov theorem and the expression for ZTZ_{T} given in (B.7), the law of XX under 𝖰{\mathsf{Q}} is the same as the law of XhX^{h} under 𝖯{\mathsf{P}}; in other words, d𝖯Xh=(g(xT)/h(x0,0))d𝖯X\mathrm{d}{\mathsf{P}}^{h}_{X}=(g(x_{T})/h(x_{0},0))\mathrm{d}{\mathsf{P}}_{X}, where 𝖯X{\mathsf{P}}_{X} (resp. 𝖯Xh{\mathsf{P}}_{X}^{h}) is the probability measure induced by XX (resp. XhX^{h}) on the space of continuous functions.

For part (iii), choose tn=Tτnt_{n}=T\wedge\tau_{n}, where τn=inf{t:|Xth|n}\tau_{n}=\inf\{t\colon|X^{h}_{t}|\geq n\}. Analogously to (B.7), we can apply Itô’s lemma to get

logh(Xtnh,tn)=logh(X0h,0)+0tnσ22logh(Xsh,s)2ds+0tnσlogh(Xsh,s)dWs.\displaystyle\log h(X^{h}_{t_{n}},{t_{n}})=\log h(X^{h}_{0},0)+\int_{0}^{t_{n}}\frac{\sigma^{2}}{2}\|\nabla\log h(X^{h}_{s},s)\|^{2}\mathrm{d}s+\int_{0}^{t_{n}}\sigma\nabla\log h(X^{h}_{s},s)\mathrm{d}W_{s}.

Since hh is smooth, |logh(Xsh,s)||\nabla\log h(X^{h}_{s},s)| is bounded on [0,tn][0,t_{n}]. Taking expectations on both sides, we find that

(B.8) 𝖤[logh(Xtnh,tn)]=𝖤[logh(X0h,0)+0tnσ22logh(Xsh,s)2ds].{\mathsf{E}}\left[\log h(X^{h}_{t_{n}},{t_{n}})\right]={\mathsf{E}}\left[\log h(X^{h}_{0},0)+\int_{0}^{t_{n}}\frac{\sigma^{2}}{2}\|\nabla\log h(X^{h}_{s},s)\|^{2}\mathrm{d}s\right].

Letting nn\rightarrow\infty and applying monotone convergence theorem, we get

(B.9) lim infn𝖤[logh(Xtnh,tn)]=𝖤[logh(X0h,0)+0Tσ22logh(Xsh,s)2ds].\liminf_{n\rightarrow\infty}{\mathsf{E}}\left[\log h(X^{h}_{t_{n}},{t_{n}})\right]={\mathsf{E}}\left[\log h(X^{h}_{0},0)+\int_{0}^{T}\frac{\sigma^{2}}{2}\|\nabla\log h(X^{h}_{s},s)\|^{2}\mathrm{d}s\right].

It remains to argue that the left-hand side converges to 𝖤[logh(XTh,T)]{\mathsf{E}}\left[\log h(X^{h}_{T},T)\right]. Write Yth=h(Xth,t)Y^{h}_{t}=h(X^{h}_{t},t). Using the change of measure and h(x,T)=g(x)h(x,T)=g(x), we get

𝖤[logYth]=𝖤[YtY0logYt].{\mathsf{E}}\left[\log Y^{h}_{t}\right]={\mathsf{E}}\left[\frac{Y_{t}}{Y_{0}}\log Y_{t}\right].

The function f(y)=ylogyf(y)=y\log y is bounded below and convex. If (tn)(t_{n}) is chosen such that tnTt_{n}\uparrow T, we have

𝖤[limnYtnlogYtn|0]lim infn𝖤[YtnlogYtn|0]𝖤[YTlogYT|0],\displaystyle{\mathsf{E}}\left[\lim_{n\rightarrow\infty}Y_{t_{n}}\log Y_{t_{n}}\,|\,\mathcal{F}_{0}\right]\leq\liminf_{n\rightarrow\infty}{\mathsf{E}}\left[Y_{t_{n}}\log Y_{t_{n}}\,|\,\mathcal{F}_{0}\right]\leq{\mathsf{E}}\left[Y_{T}\log Y_{T}\,|\,\mathcal{F}_{0}\right],

where Fatou’s lemma is applied to obtain the first inequality, and the second inequality follows from the fact that YtlogYtY_{t}\log Y_{t} is a submartingale. Since YtY_{t} converges a.s. to YT=g(XT)Y_{T}=g(X_{T}), we get

(B.10) lim infn𝖤[YtnlogYtn|0]=𝖤[YTlogYT|0].\liminf_{n\rightarrow\infty}{\mathsf{E}}\left[Y_{t_{n}}\log Y_{t_{n}}\,|\,\mathcal{F}_{0}\right]={\mathsf{E}}\left[Y_{T}\log Y_{T}\,|\,\mathcal{F}_{0}\right].

Taking expectations on both sides we get

(B.11) lim infn𝖤[logYtnh]=𝖤[logYTh].\liminf_{n\rightarrow\infty}{\mathsf{E}}\left[\log Y^{h}_{t_{n}}\right]={\mathsf{E}}\left[\log Y^{h}_{T}\right].

Combining it with (B.9) proves part (iii).

Consider part (iv). For any bounded and measurable function ff, we apply the change of measure to the conditional expectation to get

𝖤𝖰[f(Xt)|s]=𝖤[f(Xt)h(Xt,t)|s]h(Xs,s).\displaystyle{\mathsf{E}}_{{\mathsf{Q}}}[f(X_{t})\,|\,\mathcal{F}_{s}]=\frac{{\mathsf{E}}[f(X_{t})h(X_{t},t)\,|\,\mathcal{F}_{s}]}{h(X_{s},s)}.

The claim then follows from Fubini’s theorem. ∎

C. Proofs for the Main Results

Recall that XX denotes the solution to the SDE

(C.1) dXt=b(Xt,t)dt+σdWt\mathrm{d}X_{t}=b(X_{t},t)\mathrm{d}t+\sigma\mathrm{d}W_{t}

over the time interval [0,T][0,T] with initial distribution aw(X0)=μ0\mathcal{L}aw(X_{0})=\mu_{0}, and XuX^{u} denotes the solution to the controlled SDE

(C.2) dXtu=[b(Xtu,t)+ut]dt+σdWt,\mathrm{d}X^{u}_{t}=\left[b(X^{u}_{t},t)+u_{t}\right]\mathrm{d}t+\sigma\mathrm{d}W_{t},

with X0u=X0X_{0}^{u}=X_{0}. The transition density of the uncontrolled process XX is denoted by p(x,t|y,s)p(x,t\,|\,y,s).

Proof of Lemma 3.

Since uu is admissible, it must satisfy 𝖤0Tut2dt<{\mathsf{E}}\int_{0}^{T}\|u_{t}\|^{2}\mathrm{d}t<\infty. Therefore, Novikov’s condition is satisfied, and we can apply Girsanov theorem to get

𝖤[g(XT)h(X0,0)]=\displaystyle{\mathsf{E}}\left[\frac{g(X_{T})}{h(X_{0},0)}\right]=\; 𝖤[g(XTu)h(X0u,0)exp{0TutσdWt0Tut22σ2dt}]\displaystyle{\mathsf{E}}\biggr{[}\frac{g(X_{T}^{u})}{h(X_{0}^{u},0)}\exp\biggr{\{}-\int_{0}^{T}\frac{u_{t}}{\sigma}\mathrm{d}W_{t}-\int_{0}^{T}\frac{\|u_{t}\|^{2}}{2\sigma^{2}}\mathrm{d}t\biggr{\}}\biggr{]}
\displaystyle\geq\; exp{𝖤[logg(XTu)h(X0u,0)0TutσdWt0Tut22σ2dt]}.\displaystyle\exp\biggr{\{}{\mathsf{E}}\biggr{[}\log\frac{g(X_{T}^{u})}{h(X_{0}^{u},0)}-\int_{0}^{T}\frac{u_{t}}{\sigma}\mathrm{d}W_{t}-\int_{0}^{T}\frac{\|u_{t}\|^{2}}{2\sigma^{2}}\mathrm{d}t\biggr{]}\biggr{\}}.

By part (ii) of Theorem B3, the left-hand side of the above inequality is equal to 11. Hence,

0𝖤[logg(XTu)h(X0u,0)0TutσdWt0Tut22σ2dt]=𝖤[logg(XTu)h(X0u,0)0Tut22σ2dt],\displaystyle 0\geq{\mathsf{E}}\left[\log\frac{g(X_{T}^{u})}{h(X_{0}^{u},0)}-\int_{0}^{T}\frac{u_{t}}{\sigma}\mathrm{d}W_{t}-\int_{0}^{T}\frac{\|u_{t}\|^{2}}{2\sigma^{2}}\mathrm{d}t\right]={\mathsf{E}}\left[\log\frac{g(X_{T}^{u})}{h(X_{0}^{u},0)}-\int_{0}^{T}\frac{\|u_{t}\|^{2}}{2\sigma^{2}}\mathrm{d}t\right],

where we have used 𝖤0Tut2dt<{\mathsf{E}}\int_{0}^{T}\|u_{t}\|^{2}\mathrm{d}t<\infty again to obtain 𝖤[0TutdWt]=0{\mathsf{E}}[\int_{0}^{T}u_{t}\mathrm{d}W_{t}]=0. By the definition of JβJ_{\beta}, we have

Jβ(u)=\displaystyle J_{\beta}(u)=\; β𝒟KL(aw(XTu),μT)+𝖤[0Tut22σ2dt]\displaystyle\beta\,\mathcal{D}_{\mathrm{KL}}(\mathcal{L}aw(X^{u}_{T}),\mu_{T})+{\mathsf{E}}\left[\int_{0}^{T}\frac{\|u_{t}\|^{2}}{2\sigma^{2}}\mathrm{d}t\right]
\displaystyle\geq\; β𝒟KL(aw(XTu),μT)+𝖤[logg(XTu)h(X0u,0)].\displaystyle\beta\,\mathcal{D}_{\mathrm{KL}}(\mathcal{L}aw(X^{u}_{T}),\mu_{T})+{\mathsf{E}}\left[\log\frac{g(X_{T}^{u})}{h(X_{0}^{u},0)}\right].

Since aw(X0u)=μ0\mathcal{L}aw(X_{0}^{u})=\mu_{0}, we have 𝖤[logh(X0u,0)]=logh(x,0)μ0(dx){\mathsf{E}}[\log h(X_{0}^{u},0)]=\int\log h(x,0)\mu_{0}(\mathrm{d}x). By part (iii) of Theorem B3, the equality is attained when ut=(𝒯g)(Xtu,t)u_{t}=(\mathcal{T}g)(X^{u}_{t},t). ∎

Proof of Theorem 4.

First, we find using (3.3) that

𝖤ρT(XT)=dμ0dν0dμ0,{\mathsf{E}}\rho_{T}(X_{T})=\int\frac{\mathrm{d}\mu_{0}}{\mathrm{d}\nu_{0}}\mathrm{d}\mu_{0},

which is finite by assumption. Since p(z,T|x,t)>0p(z,T\,|\,x,t)>0 whenever t<Tt<T, we have h>0h>0 on d×[0,T){\mathbb{R}}^{d}\times[0,T). Hence, Theorem B3 and Lemma 3 can be applied with g=ρTg=\rho_{T}. In addition to setting ρT=dνT/dλ\rho_{T}=\mathrm{d}\nu_{T}/\mathrm{d}\lambda and fT=dμT/dλf_{T}=\mathrm{d}\mu_{T}/\mathrm{d}\lambda, we define

(C.3) kT(x)=p(x,T|y,0)ν0(dy),qTu(x)=daw(XTu)dλ(x).\displaystyle k_{T}(x)=\int p(x,T\,|\,y,0)\nu_{0}(\mathrm{d}y),\quad q_{T}^{u}(x)=\frac{\mathrm{d}\mathcal{L}aw(X^{u}_{T})}{\mathrm{d}\lambda}(x).

There is no loss of generality in assuming qTuq_{T}^{u} exists, since μTλ\mu_{T}\ll\lambda and 𝒟KL(aw(XTu),μT)=\mathcal{D}_{\mathrm{KL}}(\mathcal{L}aw(X^{u}_{T}),\mu_{T})=\infty if aw(XTu)≪̸μT\mathcal{L}aw(X_{T}^{u})\not\ll\mu_{T}. For later use, we note that by (3.3) and (3.4),

(C.4) h(y,0)=dμ0dν0(y),\displaystyle h(y,0)=\frac{\mathrm{d}\mu_{0}}{\mathrm{d}\nu_{0}}(y),
(C.5) ρT(x)=(fT(x)kT(x))β/(1+β).\displaystyle\rho_{T}(x)=\left(\frac{f_{T}(x)}{k_{T}(x)}\right)^{\beta/(1+\beta)}.

Let η\eta be the σ\sigma-finite measure on d{\mathbb{R}}^{d} with density fTβ/(1+β)kT1/(1+β)f_{T}^{\beta/(1+\beta)}k_{T}^{1/(1+\beta)}. By Lemma 3, for any admissible control uu,

Jβ(u)\displaystyle J_{\beta}(u)\geq\; β𝒟KL(aw(XTu),μT)+𝖤[logρT(XTu)]logh(x,0)μ0(dx)\displaystyle\beta\,\mathcal{D}_{\mathrm{KL}}(\mathcal{L}aw(X^{u}_{T}),\mu_{T})+{\mathsf{E}}\left[\log\rho_{T}(X_{T}^{u})\right]-\int\log h(x,0)\mu_{0}(\mathrm{d}x)
=\displaystyle=\; β𝒟KL(aw(XTu),μT)+𝖤[logρT(XTu)]𝒟KL(μ0,ν0)\displaystyle\beta\,\mathcal{D}_{\mathrm{KL}}(\mathcal{L}aw(X^{u}_{T}),\mu_{T})+{\mathsf{E}}\left[\log\rho_{T}(X_{T}^{u})\right]-\mathcal{D}_{\mathrm{KL}}(\mu_{0},\nu_{0})
=\displaystyle=\; 𝖤[βlogqTu(XTu)fT(XTu)+β1+βlogfT(XTu)kT(XTu)]𝒟KL(μ0,ν0)\displaystyle{\mathsf{E}}\left[\beta\log\frac{q_{T}^{u}(X^{u}_{T})}{f_{T}(X^{u}_{T})}+\frac{\beta}{1+\beta}\log\frac{f_{T}(X^{u}_{T})}{k_{T}(X^{u}_{T})}\right]-\mathcal{D}_{\mathrm{KL}}(\mu_{0},\nu_{0})
=\displaystyle=\; β𝒟KL(aw(XTu),η)𝒟KL(μ0,ν0),\displaystyle\beta\mathcal{D}_{\mathrm{KL}}(\mathcal{L}aw(X^{u}_{T}),\eta)-\mathcal{D}_{\mathrm{KL}}(\mu_{0},\nu_{0}),

where the second line follows from (C.4) and the third from (C.5). The measures μ0,ν0,η\mu_{0},\nu_{0},\eta do not depend on uu. By Lemma B1,

(C.6) 𝒟KL(aw(XTu),η)logC,where C=fT(x)β/(1+β)kT(x)1/(1+β)dx.\displaystyle\mathcal{D}_{\mathrm{KL}}(\mathcal{L}aw(X^{u}_{T}),\eta)\geq-\log C,\quad\text{where }C=\int f_{T}(x)^{\beta/(1+\beta)}k_{T}(x)^{1/(1+\beta)}\mathrm{d}x.

We will later prove that C=1C=1. Combining the above two displayed inequalities, we get

(C.7) Jβ(u)\displaystyle J_{\beta}(u)\geq\; β𝒟KL(aw(XTu),η)𝒟KL(μ0,ν0)\displaystyle\beta\mathcal{D}_{\mathrm{KL}}(\mathcal{L}aw(X^{u}_{T}),\eta)-\mathcal{D}_{\mathrm{KL}}(\mu_{0},\nu_{0})
(C.8) \displaystyle\geq\; logC𝒟KL(μ0,ν0).\displaystyle-\log C-\mathcal{D}_{\mathrm{KL}}(\mu_{0},\nu_{0}).

For ut=σ2logh(Xtu,t)u^{*}_{t}=\sigma^{2}\nabla\log h(X_{t}^{u^{*}},t), we know by Lemma 3 that the equality in (C.7) is attained. Hence, it is optimal if we can show that the equality in (C.8) is also attained. By Lemma B1, this is equivalent to showing that

(C.9) qT(x)=C1fT(x)β/(1+β)kT(x)1/(1+β),q_{T}^{*}(x)=C^{-1}f_{T}(x)^{\beta/(1+\beta)}k_{T}(x)^{1/(1+\beta)},

where we write qT=qTuq_{T}^{*}=q_{T}^{u^{*}}. By part (iv) of Theorem B3, we have

qT(x)=\displaystyle q_{T}^{*}(x)=\; ρT(x)p(x,T|y,0)h(y,0)μ0(dy)\displaystyle\rho_{T}(x)\int\frac{p(x,T\,|\,y,0)}{h(y,0)}\mu_{0}(\mathrm{d}y)
=(i)\displaystyle\overset{(i)}{=}\; ρT(x)p(x,T|y,0)ν0(dy)\displaystyle\rho_{T}(x)\int p(x,T\,|\,y,0)\nu_{0}(\mathrm{d}y)
=\displaystyle=\; ρT(x)kT(x)\displaystyle\rho_{T}(x)k_{T}(x)
=(ii)\displaystyle\overset{(ii)}{=}\; fT(x)β/(1+β)kT(x)1/(1+β),\displaystyle f_{T}(x)^{\beta/(1+\beta)}k_{T}(x)^{1/(1+\beta)},

where step (i)(i) follows from (C.4) and step (ii) follows from (C.5). So uu^{*} is optimal, and the normalizing constant CC in (C.9) equals 11, from which it follows that Jβ(u)=𝒟KL(μ0,ν0)J_{\beta}(u^{*})=-\mathcal{D}_{\mathrm{KL}}(\mu_{0},\nu_{0}). Finally, one can apply Jensen’s inequality and the assumption (dμ0/dν0)dμ0<\int(\mathrm{d}\mu_{0}/\mathrm{d}\nu_{0})\mathrm{d}\mu_{0}<\infty to show that |𝒟KL(μ0,ν0)|<|\mathcal{D}_{\mathrm{KL}}(\mu_{0},\nu_{0})|<\infty, which concludes the proof. ∎

Proof of Theorem 6.

Recall that ν0,νN\nu_{0},\nu_{N} are defined by

(C.10) dμ0dν0(y)=\displaystyle\frac{\mathrm{d}\mu_{0}}{\mathrm{d}\nu_{0}}(y)=\; pN(𝒙N|y)νN(d𝒙N),\displaystyle\int p_{N}(\bm{x}_{N}\,|\,y)\nu_{N}(\mathrm{d}\bm{x}_{N}),
(C.11) dμNdλ(𝒙N)=\displaystyle\frac{\mathrm{d}\mu_{N}}{\mathrm{d}\lambda}(\bm{x}_{N})=\; ρN(𝒙N)1+ββpN(𝒙N|y)ν0(dy).\displaystyle\rho_{N}(\bm{x}_{N})^{\frac{1+\beta}{\beta}}\int p_{N}(\bm{x}_{N}\,|\,y)\nu_{0}(\mathrm{d}y).

The proof is essentially the same as that of Theorem 6. First, we need to derive a result analogous to Lemma 3, which we give in Lemma C4 below. We will apply Lemma C4 with g=ρNg=\rho_{N}. Recall that hjh_{j} is defined by

(C.12) hj(x,t;𝒙j)=𝖤[ρN(𝒙j,Xtj+1,,XtN)|Xt=x], for (x,t)d×[tj,tj+1).\displaystyle h_{j}(x,t;\bm{x}_{j})={\mathsf{E}}\left[\rho_{N}(\bm{x}_{j},X_{t_{j+1}},\dots,X_{t_{N}})\,\Big{|}\,X_{t}=x\right],\text{ for }(x,t)\in{\mathbb{R}}^{d}\times[t_{j},t_{j+1}).

Note that the conditions 𝖤ρN(𝑿N)<{\mathsf{E}}\rho_{N}(\bm{X}_{N})<\infty and hj>0h_{j}>0 can be verified by the same argument as that used in the proof of Theorem 4. By (C.10), we have

(C.13) h0(y,0)=𝖤[ρN(𝑿N)|X0=y]=dμ0dν0(y).h_{0}(y,0)={\mathsf{E}}[\rho_{N}(\bm{X}_{N})\,|\,X_{0}=y]=\frac{\mathrm{d}\mu_{0}}{\mathrm{d}\nu_{0}}(y).

Define

(C.14) fN(𝒙N)=dμNdλ(𝒙N),kN(𝒙N)=pN(𝒙N|y)ν0(dy),qNu(𝒙N)=daw(𝑿Nu)dλ(𝒙N).\displaystyle f_{N}(\bm{x}_{N})=\frac{\mathrm{d}\mu_{N}}{\mathrm{d}\lambda}(\bm{x}_{N}),\quad k_{N}(\bm{x}_{N})=\int p_{N}(\bm{x}_{N}\,|\,y)\nu_{0}(\mathrm{d}y),\quad q_{N}^{u}(\bm{x}_{N})=\frac{\mathrm{d}\mathcal{L}aw(\bm{X}^{u}_{N})}{\mathrm{d}\lambda}(\bm{x}_{N}).

We can rewrite (C.11) as ρN=(fN/kN)β/(1+β)\rho_{N}=(f_{N}/k_{N})^{\beta/(1+\beta)}.

By Lemma C4 and Lemma B1, for any admissible control uu,

JβN(u)logC𝒟KL(μ0,ν0),where C=fN(𝒙N)β/(1+β)kN(𝒙N)1/(1+β)d𝒙N.\displaystyle J_{\beta}^{N}(u)\geq-\log C-\mathcal{D}_{\mathrm{KL}}(\mu_{0},\nu_{0}),\quad\text{where }C=\int f_{N}(\bm{x}_{N})^{\beta/(1+\beta)}k_{N}(\bm{x}_{N})^{1/(1+\beta)}\mathrm{d}\bm{x}_{N}.

To prove that the equality is attained by the control ut=(𝒯NρN)(Xtu,t,𝑿Nu)u_{t}^{*}=(\mathcal{T}_{N}\rho_{N})(X^{u^{*}}_{t},t,\bm{X}^{u^{*}}_{N}), it remains to show that

(C.15) qN(𝒙N)=C1fN(𝒙N)β/(1+β)kN(𝒙N)1/(1+β).q_{N}^{*}(\bm{x}_{N})=C^{-1}f_{N}(\bm{x}_{N})^{\beta/(1+\beta)}k_{N}(\bm{x}_{N})^{1/(1+\beta)}.

where qN=qNuq_{N}^{*}=q_{N}^{u^{*}}. To find qNq_{N}^{*}, we can mimic the proof of Theorem B3. It is not difficult to verify that the law of XX^{*} under 𝖯{\mathsf{P}} is the same as the law of XX under 𝖰{\mathsf{Q}}, where 𝖰{\mathsf{Q}} is defined by

(C.16) d𝖰d𝖯=𝖤[ρN(𝑿N)|T]𝖤[ρN(𝑿N)|0]=ρN(𝑿N)h0(X0,0).\displaystyle\frac{\mathrm{d}{\mathsf{Q}}}{\mathrm{d}{\mathsf{P}}}=\frac{{\mathsf{E}}[\rho_{N}(\bm{X}_{N})\,|\,\mathcal{F}_{T}]}{{\mathsf{E}}[\rho_{N}(\bm{X}_{N})\,|\,\mathcal{F}_{0}]}=\frac{\rho_{N}(\bm{X}_{N})}{h_{0}(X_{0},0)}.

So for any bounded and measurable function \ell, we have

𝖤𝖰[(𝑿N)]=𝖤[(𝑿N)ρN(𝑿N)h0(X0,0)]=(𝒙N)ρN(𝒙N){pN(𝒙N|y)h0(y,0)μ0(dy)}d𝒙N.\displaystyle{\mathsf{E}}_{{\mathsf{Q}}}[\ell(\bm{X}_{N})]={\mathsf{E}}\left[\ell(\bm{X}_{N})\frac{\rho_{N}(\bm{X}_{N})}{h_{0}(X_{0},0)}\right]=\int\ell(\bm{x}_{N})\rho_{N}(\bm{x}_{N})\left\{\int\frac{p_{N}(\bm{x}_{N}\,|\,y)}{h_{0}(y,0)}\mu_{0}(\mathrm{d}y)\right\}\mathrm{d}\bm{x}_{N}.

It thus follows from (C.13) that

(C.17) qN(x)=ρN(𝒙N)pN(𝒙N|y)h0(y,0)μ0(dy)=fN(𝒙N)β/(1+β)kN(𝒙N)1/(1+β).q_{N}^{*}(x)=\rho_{N}(\bm{x}_{N})\int\frac{p_{N}(\bm{x}_{N}\,|\,y)}{h_{0}(y,0)}\mu_{0}(\mathrm{d}y)=f_{N}(\bm{x}_{N})^{\beta/(1+\beta)}k_{N}(\bm{x}_{N})^{1/(1+\beta)}.

The rest of the proof is identical to that of Theorem 4. ∎

Lemma C4.

Let uu be an admissible control and g:d×N[0,)g\colon{\mathbb{R}}^{d\times N}\rightarrow[0,\infty) be a measurable function such that 𝖤[g(𝐗N)]<{\mathsf{E}}[g(\bm{X}_{N})]<\infty. Let hjh_{j} be as given in (4.3), and assume hj>0h_{j}>0 for each jj. For the cost JβNJ_{\beta}^{N} defined in (4.1), we have

JβN(u)β𝒟KL(aw(𝑿Nu),μN)+𝖤[logg(𝑿Nu)]logh0(x,0)μ0(dx).\displaystyle J_{\beta}^{N}(u)\geq\beta\,\mathcal{D}_{\mathrm{KL}}(\mathcal{L}aw(\bm{X}^{u}_{N}),\mu_{N})+{\mathsf{E}}[\log g(\bm{X}^{u}_{N})]-\int\log h_{0}(x,0)\mu_{0}(\mathrm{d}x).

The equality is attained when ut=(𝒯Ng)(Xtu,t,𝐗Nu)u_{t}=(\mathcal{T}_{N}g)(X^{u}_{t},t,\bm{X}^{u}_{N}).

Proof of Lemma C4.

The SDE (2.2) with control ut=(𝒯Ng)(Xtu,t,𝑿Nu)u_{t}^{*}=(\mathcal{T}_{N}g)(X^{u^{*}}_{t},t,\bm{X}^{u^{*}}_{N}) can be expressed as

(C.18) X0=X0,\displaystyle X_{0}^{*}=X_{0},
dXt=[b(Xt,t)+σ2loghj(Xt,t;𝑿j)]dt+σdWt, for t[tj,tj+1),\displaystyle\mathrm{d}X_{t}^{*}=\left[b(X_{t}^{*},t)+\sigma^{2}\nabla\log h_{j}(X_{t}^{*},t;\bm{X}_{j}^{*})\right]\mathrm{d}t+\sigma\mathrm{d}W_{t},\text{ for }t\in[t_{j},t_{j+1}),

where we write X=XuX^{*}=X^{u^{*}} and hjh_{j} is as given in (4.3). By the tower property, we have

(C.19) hj(x,t;𝒙j)=𝖤[hj+1(Xtj+1,tj+1;𝒙j,Xtj+1)|Xt=x],h_{j}(x,t;\bm{x}_{j})={\mathsf{E}}\left[h_{j+1}(X_{t_{j+1}},t_{j+1};\bm{x}_{j},X_{t_{j+1}})\,|\,X_{t}=x\right],

where hNh_{N} is defined by hN(x,tN;𝒙N1,x)=g(𝒙N1,x)h_{N}(x,t_{N};\bm{x}_{N-1},x)=g(\bm{x}_{N-1},x), and XX is the solution to the uncontrolled process (2.1). The assumption 𝖤[g(𝑿N)]<{\mathsf{E}}[g(\bm{X}_{N})]<\infty implies that 𝖤[hj(Xtj,tj;𝒙j1,Xtj)]<{\mathsf{E}}[h_{j}(X_{t_{j}},t_{j};\bm{x}_{j-1},X_{t_{j}})]<\infty for almost every 𝒙j1\bm{x}_{j-1}. Hence, for each jj and xjdx_{j}\in{\mathbb{R}}^{d}, Theorem B3 implies that there exists a weak solution to the SDE (C.18) on [tj,tj+1][t_{j},t_{j+1}] with Xtj=xjX_{t_{j}}^{*}=x_{j}. Moreover,

(C.20) 𝖤[loghj+1(Xtj+1,tj+1;𝑿j+1)hj(Xtj,tj;𝑿j)]=𝖤[tjtj+1σ22loghj(Xs,t;𝑿j)2ds].\displaystyle{\mathsf{E}}\left[\log\frac{h_{j+1}(X_{t_{j+1}}^{*},t_{j+1};\bm{X}_{j+1}^{*})}{h_{j}(X_{t_{j}}^{*},t_{j};\bm{X}_{j}^{*})}\right]={\mathsf{E}}\left[\int_{t_{j}}^{t_{j+1}}\frac{\sigma^{2}}{2}\|\nabla\log h_{j}(X_{s}^{*},t;\bm{X}_{j}^{*})\|^{2}\mathrm{d}s\right].

Summing over all the NN time intervals, we get

𝖤[logg(𝑿N)h0(X0,0)]=𝖤[j=0N1loghj+1(Xtj+1,tj+1;𝑿j+1)hj(Xtj,tj;𝑿j)]=𝖤[0Tut22σ2ds].\displaystyle{\mathsf{E}}\left[\log\frac{g(\bm{X}^{*}_{N})}{h_{0}(X_{0}^{*},0)}\right]={\mathsf{E}}\left[\sum_{j=0}^{N-1}\log\frac{h_{j+1}(X_{t_{j+1}}^{*},t_{j+1};\bm{X}_{j+1}^{*})}{h_{j}(X_{t_{j}}^{*},t_{j};\bm{X}_{j}^{*})}\right]={\mathsf{E}}\left[\int_{0}^{T}\frac{\|u_{t}^{*}\|^{2}}{2\sigma^{2}}\mathrm{d}s\right].

Now consider an arbitrary admissible control uu. As in the proof of Lemma 3, by Girsanov theorem, we have

1=\displaystyle 1=\; 𝖤[g(𝑿N)h0(X0,0)]\displaystyle{\mathsf{E}}\left[\frac{g(\bm{X}_{N})}{h_{0}(X_{0},0)}\right]
=\displaystyle=\; 𝖤[g(𝑿Nu)h0(X0u,0)exp{0TutσdWt0Tut22σ2dt}]\displaystyle{\mathsf{E}}\biggr{[}\frac{g(\bm{X}_{N}^{u})}{h_{0}(X_{0}^{u},0)}\exp\biggr{\{}-\int_{0}^{T}\frac{u_{t}}{\sigma}\mathrm{d}W_{t}-\int_{0}^{T}\frac{\|u_{t}\|^{2}}{2\sigma^{2}}\mathrm{d}t\biggr{\}}\biggr{]}
\displaystyle\geq\; exp{𝖤[logg(𝑿Nu)h0(X0u,0)0TutσdWt0Tut22σ2dt]},\displaystyle\exp\biggr{\{}{\mathsf{E}}\biggr{[}\log\frac{g(\bm{X}_{N}^{u})}{h_{0}(X_{0}^{u},0)}-\int_{0}^{T}\frac{u_{t}}{\sigma}\mathrm{d}W_{t}-\int_{0}^{T}\frac{\|u_{t}\|^{2}}{2\sigma^{2}}\mathrm{d}t\biggr{]}\biggr{\}},

which yields that

𝖤[0Tut22σ2dt]𝖤[logg(𝑿Nu)h0(X0u,0)]=𝖤[logg(𝑿Nu)]logh0(x,0)μ0(dx).\displaystyle{\mathsf{E}}\left[\int_{0}^{T}\frac{\|u_{t}\|^{2}}{2\sigma^{2}}\mathrm{d}t\right]\geq{\mathsf{E}}\left[\log\frac{g(\bm{X}_{N}^{u})}{h_{0}(X_{0}^{u},0)}\right]={\mathsf{E}}[\log g(\bm{X}^{u}_{N})]-\int\log h_{0}(x,0)\mu_{0}(\mathrm{d}x).

The asserted result thus follows. ∎

D. Proof of the Existence of Solution to SSB

We first recall the definition of the Hilbert metric [Chen et al., 2016a]. For KdK\subset\mathbb{R}^{d} and 1p1\leq p\leq\infty, let p(K)\mathcal{L}^{p}(K) denote the LpL^{p} space of functions defined on KK. Define

(D.1) +p(K)={fp(K):infxKf(x)>0},0p(K)={fp(K):infxKf(x)0}.\mathcal{L}_{+}^{p}(K)=\{f\in\mathcal{L}^{p}(K)\colon\inf_{x\in K}f(x)>0\},\quad\mathcal{L}_{0}^{p}(K)=\{f\in\mathcal{L}^{p}(K)\colon\inf_{x\in K}f(x)\geq 0\}.

Since 0p(K)\mathcal{L}_{0}^{p}(K) is a closed solid cone in the Banach space p(K)\mathcal{L}^{p}(K), we can define a Hilbert metric on it. For any x,y0p(K){0}x,y\in\mathcal{L}_{0}^{p}(K)\setminus\{0\} (where 0 denotes the constant function equal to 0), define

(D.2) M(x,y)=inf{c:xcy},m(x,y)=sup{c:cyx},\displaystyle M(x,y)=\inf\{c\colon x\preceq cy\},\quad m(x,y)=\sup\{c\colon cy\preceq x\},

where xyx\preceq y means yx0p(K)y-x\in\mathcal{L}_{0}^{p}(K) and we use the convention inf=\inf\emptyset=\infty. The Hilbert metric dHd_{H} on 𝒦{0}\mathcal{K}\setminus\{0\} is defined by

(D.3) dH(x,y)=logM(x,y)m(x,y).d_{H}(x,y)=\log\frac{M(x,y)}{m(x,y)}.

Note that dHd_{H} is only a pseudometric on 𝒦{0}\mathcal{K}\setminus\{0\}, but it is a metric on the space of rays of 𝒦{0}\mathcal{K}\setminus\{0\}.

Proof of Theorem 5.

The proof is adapted from Chen et al. [2016a, Proposition 1]. We will show the existence of strictly positive and integrable functions ρ0,ρT\rho_{0},\rho_{T} such that

(D.4) f0(y)\displaystyle f_{0}(y) =ρ0(y)KTp(x,T|y,0)ρT(x)dx,\displaystyle=\rho_{0}(y)\int_{K_{T}}p(x,T\,|\,y,0)\rho_{T}(x)\mathrm{d}x,
(D.5) fT(x)\displaystyle f_{T}(x) =ρT(x)(1+β)/βK0p(x,T|y,0)ρ0(y)dy.\displaystyle=\rho_{T}(x)^{(1+\beta)/\beta}\int_{K_{0}}p(x,T\,|\,y,0)\rho_{0}(y)\mathrm{d}y.

Let ψ^0+(K0)\hat{\psi}_{0}\in\mathcal{L}_{+}^{\infty}(K_{0}) be our guess for f0/ρ0f_{0}/\rho_{0}. We can update ψ^0\hat{\psi}_{0} as follows.

  1. (1)

    Set ρ^0(y)=f0(y)/ψ^0(y)\hat{\rho}_{0}(y)=f_{0}(y)/\hat{\psi}_{0}(y).

  2. (2)

    Set ψ^T(x)=(K0p(x,T|y,0)ρ^0(y)dy)β/(1+β)\hat{\psi}_{T}(x)=\left(\int_{K_{0}}p(x,T\,|\,y,0)\hat{\rho}_{0}(y)\mathrm{d}y\right)^{\beta/(1+\beta)}, which is an estimate for fTβ/(1+β)/ρTf_{T}^{\beta/(1+\beta)}/\rho_{T} by (D.5).

  3. (3)

    Set ρ^T(x)=fTβ/(1+β)/ψ^T(x)\hat{\rho}_{T}(x)=f_{T}^{\beta/(1+\beta)}/\hat{\psi}_{T}(x).

  4. (4)

    By (D.4), update the estimate for ψ0\psi_{0} by ψ^0new(y)=KTp(x,T|y,0)ρ^T(x)dx\hat{\psi}_{0}^{\rm{new}}(y)=\int_{K_{T}}p(x,T\,|\,y,0)\hat{\rho}_{T}(x)\mathrm{d}x.

Denote this updating scheme by ψ^0new=𝒪(ψ^0)\hat{\psi}_{0}^{\rm{new}}=\mathcal{O}(\hat{\psi}_{0}). Note that for any c(0,)c\in(0,\infty) and ψ+(K0)\psi\in\mathcal{L}_{+}^{\infty}(K_{0}),

(D.6) 𝒪(cψ)=cβ/(1+β)𝒪(ψ).\mathcal{O}(c\psi)=c^{\beta/(1+\beta)}\mathcal{O}(\psi).

In Lemma D5 below, we prove that 𝒪\mathcal{O} is a strict contraction mapping from +(K0)\mathcal{L}_{+}^{\infty}(K_{0}) to +(K0)\mathcal{L}_{+}^{\infty}(K_{0}) with respect to the Hilbert metric. To prove the existence of ρ0,ρT\rho_{0},\rho_{T}, it is sufficient to show that 𝒪\mathcal{O} has a fixed point ψ0+(K0)\psi_{0}\in\mathcal{L}_{+}^{\infty}(K_{0}), since we can set

ρ0(y)=f0(y)ψ0(y),ρT(x)=(fT(x)K0p(x,T|y,0)ρ0(y)dy)β1+β,\displaystyle\rho_{0}(y)=\frac{f_{0}(y)}{\psi_{0}(y)},\quad\rho_{T}(x)=\left(\frac{f_{T}(x)}{\int_{K_{0}}p(x,T\,|\,y,0)\rho_{0}(y)\,\mathrm{d}y}\right)^{\frac{\beta}{1+\beta}},

which must satisfy (D.4) and (D.5) and ρ001(K0),ρT01(KT)\rho_{0}\in\mathcal{L}_{0}^{1}(K_{0}),\rho_{T}\in\mathcal{L}_{0}^{1}(K_{T}) (see the proof of Lemma D5 for why ρ0,ρT\rho_{0},\rho_{T} are integrable). But note that we cannot apply Banach fixed-point theorem to 𝒪\mathcal{O}.

To find the fixed point of 𝒪\mathcal{O}, we first consider its normalized version 𝒪~\tilde{\mathcal{O}}, defined by 𝒪~(ψ)=𝒪(ψ)/𝒪(ψ)2\tilde{\mathcal{O}}(\psi)=\mathcal{O}(\psi)/\|\mathcal{O}(\psi)\|_{2}. Let

(D.7) E={g+(K0):g2=1},E=\{g\in\mathcal{L}_{+}^{\infty}(K_{0})\colon\|g\|_{2}=1\},

denote the domain and range of 𝒪~\tilde{\mathcal{O}}. Since 𝒪\mathcal{O} is a strict contraction mapping on +(K0)\mathcal{L}_{+}^{\infty}(K_{0}) with respect to the Hilbert metric (which is invariant under scaling), 𝒪~\tilde{\mathcal{O}} is also a strict contraction mapping and thus continuous (with respect to the Hilbert metric) on EE. If g+(K0)g\in\mathcal{L}_{+}^{\infty}(K_{0}) is a fixed point of O~\tilde{O}, then 𝒪(g)21+βg+(K0)\|\mathcal{O}(g)\|_{2}^{1+\beta}g\in\mathcal{L}_{+}^{\infty}(K_{0}) is a fixed point of 𝒪\mathcal{O}, since

𝒪(𝒪(g)21+βg)=𝒪(g)2β𝒪(g)=𝒪(g)21+β𝒪~(g)=𝒪(g)21+βg,\displaystyle\mathcal{O}\left(\|\mathcal{O}(g)\|_{2}^{1+\beta}g\right)=\|\mathcal{O}(g)\|_{2}^{\beta}\mathcal{O}(g)=\|\mathcal{O}(g)\|_{2}^{1+\beta}\tilde{\mathcal{O}}(g)=\|\mathcal{O}(g)\|_{2}^{1+\beta}g,

where the first equality follows from (D.6). Moreover, since dHd_{H} is a metric on the rays, any other fixed point of 𝒪\mathcal{O} must have the form cgcg for some constant c>0c>0. But the same argument shows that cc must equal 𝒪(g)21+β\|\mathcal{O}(g)\|_{2}^{1+\beta}, and thus the fixed point of 𝒪\mathcal{O} is unique. This further yields the uniqueness of (ρ0,ρT)(\rho_{0},\rho_{T}).

So it only remains to prove that O~\tilde{O} has a fixed point in +(K0)\mathcal{L}_{+}^{\infty}(K_{0}). The proof for this claim is essentially the same as that in Chen et al. [2016a]. Pick arbitrarily ψ(0)+(K0)\psi^{(0)}\in\mathcal{L}_{+}^{\infty}(K_{0}) and let g0=ψ(0)/ψ(0)2g_{0}=\psi^{(0)}/\|\psi^{(0)}\|_{2}. For k=1,2,k=1,2,\dots, define

(D.8) gk𝒪k(ψ(0))𝒪k(ψ(0))2=𝒪~(gk1),g_{k}\coloneqq\frac{\mathcal{O}^{k}(\psi^{(0)})}{\left\|\mathcal{O}^{k}(\psi^{(0)})\right\|_{2}}=\tilde{\mathcal{O}}(g_{k-1}),

where the second equality follows from (D.6). Each gkg_{k} is in EE and well-defined as 𝒪k(ψ(0))+(K0)2(K0)\mathcal{O}^{k}(\psi^{(0)})\in\mathcal{L}_{+}^{\infty}(K_{0})\subset\mathcal{L}_{2}(K_{0}). Since 𝒪~\tilde{\mathcal{O}} is a strict contraction mapping with respect to dHd_{H}, {gk}k0\{g_{k}\}_{k\geq 0} is a Cauchy sequence with respect to dHd_{H}. Using the inequality gkgl2edH(gk,gl)1\|g_{k}-g_{l}\|_{2}\leq e^{d_{H}(g_{k},g_{l})}-1 (see Chen et al. [2016a]), we find that {gk}k0\{g_{k}\}_{k\geq 0} is also Cauchy with respect to the L2L^{2}-norm, and thus there exists g02(K0)g\in\mathcal{L}^{2}_{0}(K_{0}) such that limkgkg2=0\lim_{k\rightarrow\infty}\|g_{k}-g\|_{2}=0 and g2=1\|g\|_{2}=1. Next, we argue that {gk}k0\{g_{k}\}_{k\geq 0} is uniformly bounded from below and above and also uniformly equicontinuous. To show the uniform boundedness, we first observe that gk2=1\|g_{k}\|_{2}=1 implies

(D.9) supxgk(x)1λ(K0)infxgk(x).\sup_{x}g_{k}(x)\geq\frac{1}{\sqrt{\lambda(K_{0})}}\geq\inf_{x}g_{k}(x).

Further, since p(x,T|y,0)p(x,T\,|\,y,0) is bounded in (x,y)(x,y) and recalling the last step in the construction of 𝒪\mathcal{O}, we have

(D.10) supx(𝒪ψ)(x)infx(𝒪ψ)(x)ϵ1\frac{\sup_{x}(\mathcal{O}\psi)(x)}{\inf_{x}(\mathcal{O}\psi)(x)}\leq\epsilon^{-1}

for any ψ+(K0)\psi\in\mathcal{L}_{+}^{\infty}(K_{0}), where ϵ(0,1]\epsilon\in(0,1] is some constant independent of ψ\psi. Combining (D.9) and (D.10), we get

(D.11) ϵλ(K0)infxgk(x)supxgk(x)1ϵλ(K0).\frac{\epsilon}{\sqrt{\lambda(K_{0})}}\leqslant\inf_{x}g_{k}(x)\leqslant\sup_{x}g_{k}(x)\leqslant\frac{1}{\epsilon\sqrt{\lambda(K_{0})}}.

Note that the uniform boundedness of {gk}\{g_{k}\} also implies g+(K0)g\in\mathcal{L}^{\infty}_{+}(K_{0}). The uniform equicontinuity of {gk}\{g_{k}\} can be proved by using the uniform continuity of the transition density pp. Finally, by Arzelà–Ascoli Theorem, there is a subsequence {gkn}\{g_{k_{n}}\} such that gkng_{k_{n}} converges to gg uniformly with respect to the L2L^{2}-norm and gg is also uniformly continuous. This implies dH(gkn,g)0d_{H}(g_{k_{n}},g)\rightarrow 0 and thus dH(gk,g)0d_{H}(g_{k},g)\rightarrow 0. By the continuity (with respect to dHd_{H}) of O~\tilde{O}, we can interchange the limit operation with 𝒪~\tilde{\mathcal{O}}, thereby establishing gg as the fixed point of 𝒪~\tilde{\mathcal{O}}. ∎

Lemma D5.

For ψ+(K0)\psi\in\mathcal{L}_{+}^{\infty}(K_{0}), define

(D.12) (𝒪ψ)(x)=KTp(z,T|x,0)(fT(z)K0p(z,T|y,0)f0(y)ψ(y)1dy)β/(1+β)dz.(\mathcal{O}\psi)(x)=\int_{K_{T}}p(z,T\,|\,x,0)\left(\frac{f_{T}(z)}{\int_{K_{0}}p(z,T\,|\,y,0)f_{0}(y)\psi(y)^{-1}\mathrm{d}y}\right)^{\beta/(1+\beta)}\mathrm{d}z.

The operator 𝒪\mathcal{O} is a strict contraction mapping from +(K0)\mathcal{L}_{+}^{\infty}(K_{0}) to +(K0)\mathcal{L}_{+}^{\infty}(K_{0}) with respect to the Hilbert metric.

Proof.

We can express the operator 𝒪\mathcal{O} by 𝒪=T𝒫0\mathcal{O}=\mathcal{E}_{T}\circ\mathcal{P}\circ\mathcal{I}\circ\mathcal{E}_{0}\circ\mathcal{I}, and define ,0,T,𝒫\mathcal{I},\mathcal{E}_{0},\mathcal{E}_{T},\mathcal{P} by

(D.13) :+(K)+(K),(φ)(x)=φ(x)1for K=K0,KT,\displaystyle\mathcal{I}\colon\mathcal{L}_{+}^{\infty}(K)\longrightarrow\mathcal{L}_{+}^{\infty}(K),\quad\quad\;(\mathcal{I}\varphi)(x)=\varphi(x)^{-1}\quad\text{for }K=K_{0},K_{T},
0:+(K0)+(KT),(0φ)(x)=K0p(x,T|y,0)f0(y)φ(y)dy,\displaystyle\mathcal{E}_{0}\colon\mathcal{L}_{+}^{\infty}(K_{0})\longrightarrow\mathcal{L}_{+}^{\infty}(K_{T}),\quad(\mathcal{E}_{0}\varphi)(x)=\int_{K_{0}}p(x,T\,|\,y,0)f_{0}(y)\varphi(y)\mathrm{d}y,
T:+(KT)+(K0),(Tφ)(x)=KTp(z,T|x,0)fT(z)β/(1+β)φ(z)dz,\displaystyle\mathcal{E}_{T}\colon\mathcal{L}_{+}^{\infty}(K_{T})\longrightarrow\mathcal{L}_{+}^{\infty}(K_{0}),\quad(\mathcal{E}_{T}\varphi)(x)=\int_{K_{T}}p(z,T\,|\,x,0)f_{T}(z)^{\beta/(1+\beta)}\varphi(z)\mathrm{d}z,
𝒫:+(KT)+(KT),(𝒫φ)(x)=φ(x)β/1+β.\displaystyle\mathcal{P}\colon\mathcal{L}_{+}^{\infty}(K_{T})\longrightarrow\mathcal{L}_{+}^{\infty}(K_{T}),\quad(\mathcal{P}\varphi)(x)=\varphi(x)^{\beta/1+\beta}.

It is worth explaining how the ranges of these operators are determined. First, if φ+(K)\varphi\in\mathcal{L}_{+}^{\infty}(K), it is clear that φ+(K)\mathcal{I}\varphi\in\mathcal{L}_{+}^{\infty}(K) and 𝒫φ+(K)\mathcal{P}\varphi\in\mathcal{L}_{+}^{\infty}(K). For 0\mathcal{E}_{0}, since we assume K0K_{0} is compact and p(x,T|y,0)p(x,T\,|\,y,0) is continuous in (x,y)(x,y), for any φ+(K0)\varphi\in\mathcal{L}_{+}^{\infty}(K_{0}) there exists ϵ>0\epsilon>0 such that

ϵ=ϵK0f0(y)dy(0φ)(x)ϵ1K0f0(y)dy=ϵ1.\displaystyle\epsilon=\epsilon\int_{K_{0}}f_{0}(y)\mathrm{d}y\leq(\mathcal{E}_{0}\varphi)(x)\leq\epsilon^{-1}\int_{K_{0}}f_{0}(y)\mathrm{d}y=\epsilon^{-1}.

The argument for T\mathcal{E}_{T} is similar, and note that by Hölder’s inequality,

KTfT(z)β/(1+β)dzλ(KT)1/(1+β)<.\displaystyle\int_{K_{T}}f_{T}(z)^{\beta/(1+\beta)}\mathrm{d}z\leq\lambda(K_{T})^{1/(1+\beta)}<\infty.

Now we prove that 𝒫\mathcal{P} is a strict contraction. Let ψ1,ψ2+(KT)\psi_{1},\psi_{2}\in\mathcal{L}_{+}^{\infty}(K_{T}). By the definition given in (D.2),

M(𝒫(ψ1),𝒫(ψ2))=M(ψ1,ψ2)β/1+βandm(𝒫(ϕ1),𝒫(ϕ2))=m(ϕ1,ϕ2)β/1+β,M\left(\mathcal{P}\left(\psi_{1}\right),\mathcal{P}\left(\psi_{2}\right)\right)=M\left(\psi_{1},\psi_{2}\right)^{\beta/1+\beta}\text{and}\quad m\left(\mathcal{P}\left(\phi_{1}\right),\mathcal{P}\left(\phi_{2}\right)\right)=m\left(\phi_{1},\phi_{2}\right)^{\beta/1+\beta},

which implies

(D.14) dH(𝒫(ϕ1),𝒫(ϕ2))=log(M(𝒫(ϕ1),𝒫(ϕ2))m(𝒫(ϕ1),𝒫(ϕ2)))=β1+βdH(ϕ1,ϕ2)<dH(ϕ1,ϕ2),d_{H}\left(\mathcal{P}\left(\phi_{1}\right),\mathcal{P}\left(\phi_{2}\right)\right)=\log\left(\frac{M\left(\mathcal{P}\left(\phi_{1}\right),\mathcal{P}\left(\phi_{2}\right)\right)}{m\left(\mathcal{P}\left(\phi_{1}\right),\mathcal{P}\left(\phi_{2}\right)\right)}\right)=\frac{\beta}{1+\beta}d_{H}\left(\phi_{1},\phi_{2}\right)<d_{H}\left(\phi_{1},\phi_{2}\right),

since β<\beta<\infty. Chen et al. [2016a] showed that the operators 0,T\mathcal{E}_{0},\mathcal{E}_{T} are strict contractions using Birkhoff’s theorem, and that the operator \mathcal{I} is an isometry (all with respect to the Hilbert metric). Hence, 𝒪\mathcal{O} is a strict contraction. Note that for our problem, since 𝒫\mathcal{P} is a strict contraction, we actually only need 0,T\mathcal{E}_{0},\mathcal{E}_{T} to be contractions (not necessarily strict). ∎

E. Proof of Lemma 7

We consider a more general setting. Assume that XtX_{t} is given by

(E.1) X0=x0, and dXt=b(Xt,t)dt+σdWt,X_{0}=x_{0},\text{ and }\mathrm{d}X_{t}=b(X_{t},t)\mathrm{d}t+\sigma\mathrm{d}W_{t},

where bb satisfies Assumption 1. The density function of aw(XT)\mathcal{L}aw(X_{T}) is p(x,T|x0,0)p(x,T\,|\,x_{0},0). Let XrefX^{\mathrm{ref}} be given by

(E.2) dXtref=\displaystyle\mathrm{d}X_{t}^{\mathrm{ref}}=\; bref(Xtref,t)dt+σdWt,\displaystyle b^{\mathrm{ref}}(X_{t}^{\mathrm{ref}},t)\mathrm{d}t+\sigma\mathrm{d}W_{t},
(E.3) where bref(x,t)=\displaystyle\text{ where }b^{\mathrm{ref}}(x,t)=\; b(x,t)+σ2loghref(x,t),\displaystyle b(x,t)+\sigma^{2}\nabla\log h^{\mathrm{ref}}(x,t),
(E.4) href(x,t)=\displaystyle h^{\mathrm{ref}}(x,t)=\; 𝖤[fref(XT)p(XT,T|x0,0)|Xt=x].\displaystyle{\mathsf{E}}\left[\frac{f_{\mathrm{ref}}(X_{T})}{p(X_{T},T\,|\,x_{0},0)}\,\Big{|}\,X_{t}=x\right].

Theorem 1 implies that aw(XTref)=μref\mathcal{L}aw(X_{T}^{\mathrm{ref}})=\mu_{\mathrm{ref}}. Let XobjX^{\mathrm{obj}} be given by

(E.5) dXtobj=\displaystyle\mathrm{d}X^{\mathrm{obj}}_{t}=\; bobj(Xtobj,t)dt+σdWt,\displaystyle b^{\mathrm{obj}}(X_{t}^{\mathrm{obj}},t)\mathrm{d}t+\sigma\mathrm{d}W_{t},
(E.6) where bobj(x,t)=\displaystyle\text{ where }b^{\mathrm{obj}}(x,t)=\; bref(x,t)+σ2loghobj(x,t),\displaystyle b^{\mathrm{ref}}(x,t)+\sigma^{2}\nabla\log h^{\mathrm{obj}}(x,t),
(E.7) hobj(x,t)=\displaystyle h^{\mathrm{obj}}(x,t)=\; 𝖤[(fobj(XTref)fref(XTref))β/(1+β)|Xtref=x],\displaystyle{\mathsf{E}}\left[\left(\frac{f_{\mathrm{obj}}(X_{T}^{\mathrm{ref}})}{f_{\mathrm{ref}}(X_{T}^{\mathrm{ref}})}\right)^{\beta/(1+\beta)}\,\Big{|}\,X_{t}^{\mathrm{ref}}=x\right],

which is the solution to Problem 2 with XrefX^{\mathrm{ref}} being the reference process and μobj\mu_{\mathrm{obj}} being the target distribution. We now prove that

(E.8) bobj(x,t)=b(x,t)+σ2log𝖤[fref(XT)11+βfobj(XT)β1+βp(XT,T|x0,0)|Xt=x].b^{\mathrm{obj}}(x,t)=b(x,t)+\sigma^{2}\nabla\log{\mathsf{E}}\left[\frac{f_{\mathrm{ref}}(X_{T})^{\frac{1}{1+\beta}}f_{\mathrm{obj}}(X_{T})^{\frac{\beta}{1+\beta}}}{p(X_{T},T\,|\,x_{0},0)}\,\Big{|}\,X_{t}=x\right].

That is, XobjX^{\mathrm{obj}} is also the solution to Problem 1 with XX being the reference process and μ\mu^{*} being the target distribution, where μ\mu^{*} has un-normalized density fref1/(1+β)fobjβ/(1+β)f_{\mathrm{ref}}^{1/(1+\beta)}f_{\mathrm{obj}}^{\beta/(1+\beta)}. Once this is proved, Lemma 7 follows as a special case with b0b\equiv 0 and x0=0x_{0}=0.

Proof.

To prove (E.8), we use part (ii) of Theorem B3. Define a martingale

Ztref=href(Xt,t)href(X0,0).\displaystyle Z^{\mathrm{ref}}_{t}=\frac{h^{\mathrm{ref}}(X_{t},t)}{h^{\mathrm{ref}}(X_{0},0)}.

Let 𝖰ref{\mathsf{Q}}^{\mathrm{ref}} be the probability measure given by d𝖰ref=ZTrefd𝖯\mathrm{d}{\mathsf{Q}}^{\mathrm{ref}}=Z^{\mathrm{ref}}_{T}\mathrm{d}{\mathsf{P}}. By part (ii) of Theorem B3, the law of XX under 𝖰ref{\mathsf{Q}}^{\mathrm{ref}} is the same as the law of XrefX^{\mathrm{ref}} under 𝖯{\mathsf{P}}. Applying the change of measure to (E.7) yields

hobj(x,t)=\displaystyle h^{\mathrm{obj}}(x,t)=\; 𝖤[(fobj(XTref)fref(XTref))β/(1+β)ZTrefZtref|Xt=x],\displaystyle{\mathsf{E}}\left[\left(\frac{f_{\mathrm{obj}}(X_{T}^{\mathrm{ref}})}{f_{\mathrm{ref}}(X_{T}^{\mathrm{ref}})}\right)^{\beta/(1+\beta)}\frac{Z_{T}^{\mathrm{ref}}}{Z_{t}^{\mathrm{ref}}}\,\Big{|}\,X_{t}=x\right],
=\displaystyle=\; href(Xt,t)1𝖤[(fobj(XTref)fref(XTref))β/(1+β)href(XT,T)|Xt=x],\displaystyle h^{\mathrm{ref}}(X_{t},t)^{-1}{\mathsf{E}}\left[\left(\frac{f_{\mathrm{obj}}(X_{T}^{\mathrm{ref}})}{f_{\mathrm{ref}}(X_{T}^{\mathrm{ref}})}\right)^{\beta/(1+\beta)}h^{\mathrm{ref}}(X_{T},T)\,\Big{|}\,X_{t}=x\right],
=\displaystyle=\; href(Xt,t)1𝖤[fref(XT)11+βfobj(XT)β1+βp(XT,T|x0,0)|Xt=x].\displaystyle h^{\mathrm{ref}}(X_{t},t)^{-1}{\mathsf{E}}\left[\frac{f_{\mathrm{ref}}(X_{T})^{\frac{1}{1+\beta}}f_{\mathrm{obj}}(X_{T})^{\frac{\beta}{1+\beta}}}{p(X_{T},T\,|\,x_{0},0)}\,\Big{|}\,X_{t}=x\right].

Since

bobj(x,t)\displaystyle b^{\mathrm{obj}}(x,t) =bref(x,t)+σ2loghobj(x,t)\displaystyle=b^{\mathrm{ref}}(x,t)+\sigma^{2}\nabla\log h^{\mathrm{obj}}(x,t)
=b(x,t)+σ2loghref(x,t)+σ2loghobj(x,t),\displaystyle=b(x,t)+\sigma^{2}\nabla\log h^{\mathrm{ref}}(x,t)+\sigma^{2}\nabla\log h^{\mathrm{obj}}(x,t),

we obtain (E.8). ∎

F. MNIST Example

Figure F4 visualizes the 50 images in the data set 𝒟obj\mathcal{D}_{\mathrm{obj}}, which are obtained by adding Gaussian noise to the original images in MNIST. Figure F5 shows the new images generated by the two-stage Schrödinger bridge algorithm of Wang et al. [2021] using only 𝒟obj\mathcal{D}_{\mathrm{obj}} as the input.

Table F2 shows the inception scores [Salimans et al., 2016] for our generated images and the images of digit 8 in MNIST. The score of our samples for β=1.5\beta=1.5 is slightly higher than that of the digit 8 in MNIST dataset, suggesting that our generated images of digit 8 exhibit a greater degree of variability than those in MNIST. Additionally, when β=100\beta=100, our score aligns closely with that of 𝒟obj\mathcal{D}_{\mathrm{obj}} (i.e., the noisy digit 8 images from MNIST), indicating that our method can recover the images in the target data set by using a large β\beta. For small values of β\beta, the scores of our generated images are higher than that of digit 8 images in MNIST, primarily because the reference dataset (consisting of the other digits) has greater variability and complexity. However, as shown in Figure 1, when β\beta is small, we do not necessarily get images of digit 8.

We also utilize t-SNE plots to visually characterize the distribution of our generated images. Figure F6 illustrates that our samples come from the geometric mixture distribution interpolating between the noisy images of digit 8 and the clean images of other digits. Figure F7 demonstrates that SSB samples with β=1.5\beta=1.5 are positioned closer to the clean images of digit 8 compared to the samples obtained with β=100\beta=100.

In our code, we use the neural network model of Song and Ermon [2019] for training the score functions and use the neural network model of Wang et al. [2021] for training the density ratio function.

Refer to caption
Figure F4. Samples in 𝒟obj\mathcal{D}_{\mathrm{obj}}.
Refer to caption
Figure F5. Samples Generated by the Algorithm of Wang et al. [2021] Using Only 𝒟obj\mathcal{D}_{\mathrm{obj}}.
Refer to caption
Figure F6. t-SNE Plot Illustrating the Geometric Mixture Distribution with β=1.5\beta=1.5.
Refer to caption
Figure F7. t-SNE Plot Comparing SSB Samples with Clean Images in MNIST.
Table F2. Inception Scores (mean ±\pm sd) for SSB Samples and the Images in MNIST.
Datasets (sample size 5\approx 5K) Inception score
SSB with β=0\beta=0 6.70 ±\pm 0.20
SSB with β=0.25\beta=0.25 6.59 ±\pm 0.15
SSB with β=0.7\beta=0.7 5.12 ±\pm 0.11
SSB with β=1.5\beta=1.5 3.51 ±\pm 0.08
SSB with β=4\beta=4 3.65 ±\pm 0.04
SSB with β=100\beta=100 2.87 ±\pm 0.04
Digit 8 in MNIST (clean) 3.29 ±\pm 0.04
Digit 8 in MNIST (noisy) 2.96 ±\pm 0.04