Soft-constrained Schrödinger Bridge: a Stochastic Control Approach
Abstract.
Schrödinger bridge can be viewed as a continuous-time stochastic control problem where the goal is to find an optimally controlled diffusion process whose terminal distribution coincides with a pre-specified target distribution. We propose to generalize this problem by allowing the terminal distribution to differ from the target but penalizing the Kullback-Leibler divergence between the two distributions. We call this new control problem soft-constrained Schrödinger bridge (SSB). The main contribution of this work is a theoretical derivation of the solution to SSB, which shows that the terminal distribution of the optimally controlled process is a geometric mixture of the target and some other distribution. This result is further extended to a time series setting. One application is the development of robust generative diffusion models. We propose a score matching-based algorithm for sampling from geometric mixtures and showcase its use via a numerical example for the MNIST data set.
1. Introduction
1.1. Schrödinger Bridge and Its Applications
Let be a diffusion process over the finite time interval with initial distribution . Schrödinger bridge seeks an optimal steering of towards a pre-specified terminal distribution such that the resulting controlled process is closest to in terms of Kullback-Leibler (KL) divergence (Schrödinger, 1931, 1932). Under certain regularity conditions, the optimally controlled process is another diffusion with the same diffusion coefficients as but an additional drift term. This result has been obtained via different approaches and at varying levels of generality, and among the seminal works are Fortet (1940); Beurling (1960); Jamison (1975); Föllmer (1988); Dai Pra (1991). For comprehensive reviews detailing the historical development, we refer readers to Léonard (2013) and Chen et al. (2021c).
The recent generative modeling literature has seen a surge in the use of Schrödinger bridge. In these applications, is typically some distribution that is easy to sample from, and is the unknown distribution of a given data set. By numerically approximating the solution to the Schrödinger bridge problem, one can generate samples from (i.e., synthetic data points that resemble the original data set). One such algorithm is presented by De Bortoli et al. (2021) and Vargas et al. (2021), who proposed to calculate the Schrödinger bridge by approximating the iterative proportional fitting procedure (Deming and Stephan, 1940) (the method of De Bortoli et al. (2021) estimates the drift using score matching and neural networks while that of Vargas et al. (2021) uses maximum likelihood and Gaussian processes). Concurrently, Wang et al. (2021) developed a two-stage method where an auxiliary Schrödinger bridge is run first to generate samples from a smoothed version of , and the second Schrödinger bridge transports these samples towards . Both approaches generalize the denoising diffusion model methods of Ho et al. (2020) and Song et al. (2021). Some other recent developments in this area include Chen et al. (2021a); Song (2022); Peluchetti (2023); Richter et al. (2023); Winkler et al. (2023); Hamdouche et al. (2023); Vargas et al. (2024).
Though not the focus of this work, Schrödinger bridge sampling methods can also be used when samples from are not available but is known up to a normalizing constant; see, e.g., Huang et al. (2021); Zhang and Chen (2021); Vargas et al. (2022), and see Heng et al. (2024) for a recent review. For the connections between Schrödinger bridge, optimal transport and variational inference, see, e.g., Chen et al. (2016b, 2021b); Tzen and Raginsky (2019).
1.2. Overview of This Work
The main contribution of this paper is the theoretical development of a generalized Schrödinger bridge problem, which we call soft-constrained Schrödinger bridge (SSB). We take the stochastic control approach that was employed by Mikami (1990); Dai Pra (1991); Pra and Pavon (1990) for studying the original Schrödinger bridge problem (see Problem 1). In SSB, the terminal distribution of the controlled process does not need to precisely match but needs to be close to in terms of KL divergence. Formally, SSB differs from the original problem in that we replace the hard constraint on the terminal distribution with an additional cost term, parameterized by , in the objective function to be minimized (see Problem 2). A larger forces the terminal distribution of the controlled process to be closer to . We rigorously find the solution to SSB and the expression for the drift term of the optimally controlled process. We show that SSB generalizes Schrödinger bridge in the sense that as , the solution to the former coincides with the solution to the latter. An important implication of our results is that the terminal distribution of the controlled process should be a geometric mixture of and some other distribution; when is a Dirac measure, the other distribution is (i.e., the terminal distribution of the uncontrolled process). We further extend our results to a time series generalization of SSB, where we are interested in modifying the joint distribution of for .
SSB can be used as a theoretical foundation for developing more flexible and robust sampling methods. First, when the KL divergence between and is infinite, Schrödinger bridge does not admit a solution, while SSB always does. A toy example illustrating the consequences of this result is given in Example 1. More importantly, acts as a regularization parameter preventing the algorithm from overfitting to , which is crucial for some generative modeling tasks such as fine-tuning with limited data (Moon et al., 2022). In such applications, contains information from a small or noisy data set, and one wants to improve the sample quality by harnessing knowledge from a large high-quality reference data set. To achieve this, we can train the uncontrolled process in SSB using the reference data set and then tune the value of . We present a simple normal mixture example illustrating the effect of (see Example 4). For a more realistic example in generative modeling of images, we use the MNIST data set and consider the task of generating new images of digit 8. We assume that the training data set only has 50 noisy images of digit 8, but we can use the data set of all the other digits as reference. As suggested by our theoretical findings, we can train a Schrödinger bridge targeting a geometric mixture of the distributions of the two data sets. Such a Schrödinger bridge cannot be learned by existing methods, and to address this, we propose a new score matching algorithm that utilizes importance sampling. We show that this approach yields high-quality images of digit 8 when is properly chosen.
The paper is structured as follows. In Section 2, we present the stochastic control formulation of the SSB problem, and we derive its solution when is a Dirac measure. The solution to SSB for general initial conditions is obtained in Section 3, which involves solving a generalized Schrödinger system. Section 4 extends the results to the time series setting. In Section 5, we present a new algorithm for robust generative modeling and demonstrate its use via the MNIST data set. Proofs and auxiliary results are deferred to Appendix.
1.3. Related Literature
Our development of SSB builds upon the work of Dai Pra (1991), which formulates Schrödinger bridge as a stochastic control problem and derives the solution using the logarithmic transformation technique pioneered by Fleming (Fleming, 1977, 2005; Fleming and Rishel, 2012) and the result of Jamison (1975). The time series SSB problem is a generalization of the work of Hamdouche et al. (2023), who extended the original Schrödinger bridge problem to the time series setting but only considered the special case where is a Dirac measure. Pavon and Wakolbinger (1991); Blaquiere (1992) adopted an alternative stochastic control approach to studying Schrödinger bridge, which was rooted in the same logarithmic transformation and also considered in Tzen and Raginsky (2019); Berner et al. (2022). This approach can be applied to the SSB problem as well, but it requires the use of verification theorem.
Motivated by robust network routing, a discrete version of the SSB problem was proposed and solved in Chen et al. (2019), where is a discrete-time non-homogeneous Markov chain with finite state space. The techniques used in this paper are very different, and to our knowledge, the continuous-time SSB problem has not been addressed in the literature.
2. Problem Formulation
Let be two probability distributions on such that and , where denotes the Lebesgue measure. Denote the density of by . Let be a probability space, on which we define a standard -dimensional Brownian motion and a random vector that is independent of and has distribution . We will always use to denote a weak solution to the following stochastic differential equation (SDE)
(2.1) |
for ,where and . Given a control , define the controlled diffusion process by and
(2.2) |
We say a control is admissible if (i) is measurable with respect to , (ii) the SDE (2.2) admits a weak solution, and (iii) , where denotes the norm. Denote the set of all admissible controls by . Note that the initial distributions of both and are always fixed to be . For ease of presentation, throughout the paper we adopt the following regularity assumption on , which was also used by Jamison (1975); Dai Pra (1991):
Assumption 1.
For each , is bounded and continuous in and is Hölder continuous in , uniformly with respect to .
Under Assumption 1, has a transition density function ; that is, for any , , and Borel set in ,
(2.3) |
We will use as a shorthand for . Moreover, by Girsanov theorem, Assumption 1 implies that the probability measures induced by and are equivalent, where . Hence, for any , is strictly positive and (i.e., two measures are equivalent). The role of Assumption 1 in our theoretical results will be further discussed in Remark 4.
Schrödinger bridge aims to find a minimum-energy modification of the dynamics of so that its terminal distribution coincides with a pre-specified distribution , where “energy” is measured by KL divergence. Given -finite measures and such that , we use to denote the KL divergence; if , define . Dai Pra (1991) considered the following stochastic control formulation of Schrödinger bridge.
Problem 1 (Schrödinger bridge).
Remark 1.
Let (resp. ) denote the probability measure induced by (resp. ) on the space of continuous functions on . For any admissible control , Girsanov theorem implies that .
Problem 1 has been well studied in the literature. In the special case where is a Dirac measure, the solution can be succinctly described, and is just the KL divergence between two probability distributions; we recall this in Theorem 1 below. In this paper, always denotes differentiation with respect to .
Theorem 1 (Dai Pra (1991)).
Remark 2.
If , then Problem 1 does not admit a solution in the sense that no admissible control can yield .
We propose a relaxed stochastic control formulation of the Schrödinger bridge problem by allowing the distribution of to be different from .
Problem 2 (Soft-constrained Schrödinger bridge).
For , find , where
(2.6) |
and find the optimal control such that .
Problem 2 (i.e., the SSB problem) replaces the hard constraint in Problem 1 with a soft constraint parameterized by . When , it is clear that the optimal control for Problem 2 is . As , the law of is forced to agree with , and we will see in Theorem 2 that the optimal control for Problem 2 converges to that for Problem 1.
Before we try to solve Problem 2 in full generality, we make a remark on how Problem 1 can be simplified. In the literature, Problem 1 is often called the dynamic Schrödinger bridge problem. Since the objective function (2.4) is the KL divergence between the laws of the controlled and uncontrolled processes (recall Remark 1), we can use an additive property of KL divergence to reduce Problem 1 to a static version (Léonard, 2013), where one only needs to find a joint distribution with marginals and that minimizes . Although this property is not directly used in this paper, the insight from this observation underpins our stochastic control analysis of SSB. In particular, when is a Dirac measure, the solution to Problem 2 can be obtained by a simple argument which reduces the problem to optimizing over the distribution of instead of over the distribution of the whole process .
Theorem 2.
Proof.
If , then cannot be optimal since . Now fix an arbitrary such that . Letting be as given in (2.4), we have
where the inequality follows from Theorem 1. Since has density and has density , we can apply Lemma B2 in Appendix to get where the infimum is taken over all probability measures on and is attained at such that
The convergence of as also follows from Lemma B2.
Remark 3.
When is constant, the transition density is easy to evaluate. If is known up to a normalizing constant, one can then use the Monte Carlo sampling scheme proposed by (Huang et al., 2021) to approximate the drift and simulate the controlled diffusion process (2.2). We describe this method and generalize it using importance sampling techniques in Appendix A. More sophisticated score-based sampling schemes can also be applied (Heng et al., 2024).
One difference between Theorems 1 and 2 is that the condition is not required for solving Problem 2. We give a toy example illustrating the importance of this difference.
Example 1.
Consider , , and being the Cauchy distribution. Then, is just the normal distribution with mean zero and covariance , and we have . Problem 1 does not admit a solution in this case, but Problem 2 has a solution for any and the associated optimal control has finite energy cost. In Appendix A.2, we simulate the solution to Problem 2 with being the Cauchy distribution. We find that when , the numerical scheme is unstable and fails to capture the heavy tails of the Cauchy distribution. In contrast, using a finite value of significantly stabilizes the algorithm.
3. Solution to Soft-constrained Schrödinger Bridge
When is not a Dirac measure, the solution to the Schrödinger bridge problem is more difficult to describe and is characterized by the so-called Schrödinger system (Léonard, 2013, Theorem 2.8). In this section, we prove that the solution to Problem 2 can be obtained in a similar way, but the Schrödinger system for Problem 2 now depends on .
The main idea behind our approach is to first show that the optimal control must belong to a small class parameterized by a function and then use an argument similar to the proof of Theorem 2 to determine the choice of . To introduce this class of controls, for each measurable function , let denote the function on given by
(3.1) |
where
(3.2) |
Let denote the set of all controls that are constructed by this logarithmic transformation. We present in Theorem B3 in Appendix some well-known results about the controlled SDE (2.2) with ; in particular, part (iv) of the theorem shows that such a process is a Doob’s -path process (Doob, 1959). Theorem B3 is largely adapted from Theorem 2.1 of Dai Pra (1991), and similar results are extensively documented in the literature (Jamison, 1975; Fleming and Sheu, 1985; Föllmer, 1988; Doob, 1984; Fleming, 2005). We can now prove a key lemma.
Lemma 3.
Proof.
See Appendix C. ∎
Remark 4.
Assumption 1 guarantees that the SDE (2.1) admits a unique (in law) weak solution. More importantly, in the proof of Theorem B3 (which is used to derive Lemma 3), Assumption 1 is used to ensure that SDE (2.1) has a transition density function such that the function defined in (3.2) is sufficiently smooth and satisfies , where denotes the generator of (see Theorem B3). This condition can be relaxed; see Friedman (1975) and Karatzas and Shreve (2012, Chap. 5.7) for details.
Observe that in the bound given in Lemma 3, the term is independent of the control , and the other terms depend on only through the distribution of . This implies that among all the admissible controls that result in the same distribution of , the cost is minimized by some . We can now prove the main theoretical result of this work in Theorem 4. The existence of the solution will be considered later in Theorem 5.
Theorem 4.
Proof.
See Appendix C. ∎
Remark 5.
As we derive in the proof, the terminal distribution of the optimally controlled process is still a geometric mixture of two distributions. Explicitly, its density is proportional to
where we recall is the density of .
Remark 6.
The assumption guarantees that we can use Lemma 3 and Theorem B3 and that ; it is also used in Dai Pra (1991). Observe that is invariant to the scaling of the function and thus also the scaling of and . This suggests that the system defined by (3.3) and (3.4) can be generalized as follows. Let and -finite measures be the solution to the following system
(3.5) | ||||
(3.6) |
where . We can use essentially the same argument to show that the choice is optimal, but now we have
This is the same as that given in Theorem 4. Indeed, if is a solution to (3.3) and (3.4), then the solution to (3.5) and (3.6) is given by and .
Example 2.
Example 3.
Suppose , and let denote the density function of the normal distribution with mean and covariance matrix . We have . Assume has density , and suppose that satisfy
where is the normalizing constant assumed to be finite. A routine calculation using can verify that the solution to (3.3) and (3.4) is given by
According to Remark 6, by choosing in (3.6), we can also replace by , where coincides with and is a probability distribution with density . For the original Schrödinger bridge problem (i.e., ), this solution has been used in developing efficient generative sampling methods (Wang et al., 2021; Berner et al., 2022).
Chen et al. (2019) studied a matrix optimal transport problem which can be seen as the discrete analogue to Problem 2, and they proved the solution to the corresponding Schrödinger system admits a unique solution. The main idea is to show that the solution can be characterized as the fixed point of some operator with respect to the Hilbert metric (Bushell, 1973), a technique that has been widely used in the literature on Schrödinger system (Fortet, 1940; Georgiou and Pavon, 2015; Chen et al., 2016a; Essid and Pavon, 2019; Deligiannidis et al., 2024). For the Schrödinger system defined by (3.3) and (3.4), an argument based on the Hilbert metric can also be applied to prove the existence and uniqueness of the solution when are absolutely continuous and have compact support.
Theorem 5.
Let (resp. ) denote the support of (resp. ). Assume that
-
(i)
are compact;
-
(ii)
, exist, where is the Lebesgue measure;
-
(iii)
is continuous and strictly positive on .
For any , there exists a unique pair of non-negative, integrable functions such that
(3.7) | ||||
(3.8) |
Proof.
See Appendix D. ∎
Remark 7.
The proof of Theorem 5 is adapted from that for Proposition 1 in Chen et al. (2016a). Observe that the Schrödinger system naturally yields an iterative algorithm for computing . Given an estimate for , denoted by , we can estimate by
which then can be used to update by
Chen et al. (2016a) considered the original Schrödinger bridge problem (i.e., ) and showed that this updating scheme yields a strict contraction with respect to the Hilbert metric. We note that when , this argument can be potentially made easier, since the mapping for a suitable function can be easily shown to be a strict contraction, and thus one only needs to verify the other steps in the updating are contractions (not necessarily strict); see Lemma D5 in Appendix. The full scope of consequences of this observation and the existence proof for the general case are left to future study.
4. Extension to Time Series
Recently a time series version of Problem 1 was studied in Hamdouche et al. (2023), where the goal is to generate time series samples from a joint probability distribution on . We can generalize our Problem 2 to time series data analogously.
Problem 3.
Consider fixed time points . Let be a probability distribution on such that . For , find , where
(4.1) |
and find the optimal control such that .
Recall that in Section 3, we started by considering functions such that for some function , where . It turns out that this technique can be used to solve Problem 3 as well, but now we need to consider conditional expectations of the form . To simplify the notation, we will write , , and ; when , , , all denote the empty vector.
Given a measurable function , the Markovian property of enables us to express the conditional expectation by
(4.2) |
where we set , and the function is defined by
(4.3) |
for Let denote the function on given by
(4.4) |
We prove in Lemma C4 in Appendix that it suffices to consider controls in the set
This result is a generalization of Lemma 3 and obtained by applying Theorem B3 to each time interval separately. Note although we express as a function of , by (4.4), is measurable with respect to . To formulate the Schrödinger system for the time series SSB problem, let denote the transition density from to , which is given by
(4.5) |
Theorem 6.
Proof.
See Appendix C. ∎
Comparing the Schrödinger system in Theorem 4 and that in Theorem 6, we see that the solution to Problem 3 has essentially the same structure as that to Problem 2. The only difference is that in Theorem 4 the Schrödinger system is constructed by using the joint distribution of , while in Theorem 6 it is replaced by the joint distribution of . We also note that Hamdouche et al. (2023) only considered the time series Schrödinger bridge problem with being a Dirac measure, and by letting , Theorem 6 gives the solution to their problem in the general case.
5. Experiments
5.1. Problem Setup
We consider an application of SSB to robust generative modeling in the following scenario. Let denote a large collection of high-quality samples with distribution , and let be a small set of noisy samples with distribution . Our objective is to generate realistic samples resembling those in , but due to the limited availability of training samples, we want to leverage information from to enhance the sample quality.
A natural idea is to use SSB as a regularization method to mitigate overfitting to the noisy samples in . This can be implemented in two steps. For simplicity, we assume in this section that the uncontrolled process is given by ; that is, we assume and in (2.1). Then has density (recall this is the density of the normal distribution with mean and covariance matrix ). Let denote the density of with respect to the Lebesgue measure. Let be the Schrödinger bridge targeting evolving by
(5.1) |
for , where
Theorem 1 implies that has distribution . Next, we solve Problem 2 using as the reference process and as the target distribution. This yields the process with dynamics given by
(5.2) |
for , where
(5.3) |
By Theorem 2, the distribution of will be close to if is relatively large. It turns out that there is no need to train and separately. The following lemma shows that we can directly train a Schrödinger bridge targeting a geometric mixture of and .
Lemma 7.
Proof.
See Appendix E. ∎
Remark 8.
The assumption that (the initial distribution of ) is a Dirac measure greatly simplifies the calculations and enables us to directly target the unnormalized density function . We will propose in the next subsection a score matching algorithm for learning this geometric mixture distribution. For applications where a general initial distribution is desirable, one may need to build iterative algorithms by borrowing ideas from the iterative proportional fitting procedure (De Bortoli et al., 2021).
Example 4.
For an illustrative toy example, let be a mixture of four bivariate normal distributions with means , respectively and the same covariance matrix . Let be a mixture of two normal distributions with means , respectively and the same covariance matrix . So essentially contains two components of but with small bias and much larger noise. Since the density functions are known, we can directly simulate a Schrödinger bridge process targeting using the method described in Remark 3. We provide the results in Appendix A.3, which show that by targeting a geometric mixture with a moderate value of , we can effectively compel the terminal distribution of the controlled process to acquire a covariance structure similar to .
5.2. A Score Matching Algorithm for Learning Geometric Mixtures
Let denote a probability distribution with density function where is the normalizing constant. To generate samples from , one can use existing score-based diffusion model methods, but, as we will see shortly, one major challenge is how to train the score functions without samples from the distribution .
Let be the probability distribution with density
(5.5) |
which can be thought of as a smoothed version of , and suppose that we can generate samples from the distribution . Solving the Schrödinger bridge problem with initial distribution and terminal distribution , we obtain the controlled process with and dynamics given by
(5.6) | ||||
The process satisfies (note that this result is a special case of Example 3 with ).
We now describe how to simulate the dynamics given in (5.6) and generate samples from . First, to learn the drift function in (5.6), we propose to combine the score matching technique with importance sampling. Let denote our approximation to , where and the unknown parameter typically denotes a neural network. According to the well-known score matching technique (Hyvärinen, 2005; Vincent, 2011), we can estimate for a given by minimizing the objective function , where
Unfortunately, the existing score matching methods estimate by using samples from , which are not applicable to our problem since we only have access to samples from and but not from . We propose to use importance sampling to tackle this issue. Let denote an auxiliary distribution with density , and assume that we can generate samples from (e.g. we can let or the smoothed versions of them). By a change of measure, we can express as
(5.7) |
The density ratios and can be learned by using samples from and minimizing the logistic regression loss as in Wang et al. (2021). We do not need to know since it does not affect the drift term in (5.6). In particular, when , we get
where the ratio can be learned by using samples from , . A similar expression can be easily derived for . This importance sampling method enables us to estimate the expectation by using samples from . By averaging over randomly drawn from the interval , we obtain an estimated loss for the parameter . Minimizing this loss we get the estimate , and we can approximate using .
Simulating the SDE (5.6) requires us to generate samples from . There are a few possible approaches. First, if is chosen to be sufficiently large, one may argue that we can simply approximate using the normal density , and thus we only need to draw from the normal distribution. This is the approach taken in the score-based generative models based on backward SDEs (Song and Ermon, 2019). Second, one can run an additional Schrödinger bridge process with terminal distribution , as proposed in the two-stage Schrödinger bridge algorithm of Wang et al. (2021). The dynamics they considered is simply the solution to Problem 1 given in Theorem 1, where the uncontrolled process is a Brownian motion started at . The drift term is approximated by a Monte Carlo scheme (see Remark 3 and Appendix A.1), which also requires an estimate of the density ratio and the score of . Third, one can also run the Langevin diffusion for a sufficiently long duration, which has stationary distribution . We found in our experiments that this approach yields more robust results, probably because it does not require Monte Carlo sampling which may yield drift estimates with large variances.
5.3. An Example Using MNIST
We present a numerical example using the MNIST data set (Deng, 2012), which consists of images of handwritten digits from to . All images have size and pixels are rescaled to . To construct , we randomly select 50 images labeled as eight and reduce their quality by adding Gaussian noise with mean 0 and variance ; see Fig. F4 in Appendix F. The reference data set includes all images that are not labeled as eight, a total of 54,149 samples.
We first run the algorithm of Wang et al. (2021) using only , and the generated samples are noisy as expected; see Fig. F5 in Appendix F. Next, we run our algorithm described in Section 5.2 with different choices of . The density ratio and scores are trained by using one GPU (RTX 6000). Generated samples are shown in Fig. 1, and we report in Table 1 the Fréchet Inception Distance score (Heusel et al., 2017) assessing the disparity between our generated images (sample size = 40K) and the collection of clean digit 8 images from the MNIST dataset (sample size 6K). When is too small, the generated images do not resemble those in and we frequently observe the influence of other digits; if , the images are essentially generated from . When is too large, the influence from the reference data set becomes negligible, but the algorithm tends to overfit to the noisy data in . For moderate values of , we observe a blend of characteristics from both data sets, and when , FID is minimized (among all tried values) and we get high-quality images of digit 8. This experiment illustrates that the information from can help capture the structural features associated with the digit 8, while the samples from can guide the algorithm towards effective noise removal. In Appendix F, we further analyze the generated images using t-SNE plots and inception scores (Salimans et al., 2016).

0 | 0.25 | 0.7 | 1.5 | 4 | 100 | |
---|---|---|---|---|---|---|
FID | 67.4 | 66.4 | 61.0 | 56.3 | 110.9 | 182.4 |
6. Concluding Remarks
We propose the soft-constrained Schrödinger bridge (SSB) problem and find its solution. Our theory encompasses the existing stochastic control results for Schrödinger bridge in Dai Pra (1991) and Hamdouche et al. (2023) as special cases. The paper focuses on the theory of SSB, and the numerical examples are designed to be uncomplicated but illustrative. More advanced algorithms for solving Problems 2 and 3 in full generality need to be developed. It will also be interesting to study the applications of SSB to other generative modeling tasks, such as conditional generation, style transfer (Shi et al., 2022; Shi and Wu, 2023; Su et al., 2022) and time series data generation (Hamdouche et al., 2023). Some further generalization of the objective function may be considered as well; for example, one can add a time-dependent cost as in Pra and Pavon (1990) or consider a more general form of the terminal cost.
7. Acknowledgements
The authors would like to thank Tiziano De Angelis for valuable discussion on the problem formulation, Yun Yang for the helpful conversation on the numerics, and anonymous reviewers for their comments and suggestions. JG and QZ were supported in part by NSF grants DMS-2311307 and DMS-2245591. XZ acknowledges the support from NSF DMS-2113359. The numerical experiments were conducted with the computing resources provided by Texas A&M High Performance Research Computing.
References
- Berner et al. [2022] Julius Berner, Lorenz Richter, and Karen Ullrich. An optimal control perspective on diffusion-based generative modeling. In NeurIPS 2022 Workshop on Score-Based Methods, 2022.
- Beurling [1960] Arne Beurling. An automorphism of product measures. Annals of Mathematics, pages 189–200, 1960.
- Blaquiere [1992] A Blaquiere. Controllability of a Fokker-Planck equation, the Schrödinger system, and a related stochastic optimal control (revised version). Dynamics and Control, 2(3):235–253, 1992.
- Bushell [1973] Peter J Bushell. Hilbert’s metric and positive contraction mappings in a Banach space. Archive for Rational Mechanics and Analysis, 52:330–338, 1973.
- Chen et al. [2021a] Tianrong Chen, Guan-Horng Liu, and Evangelos Theodorou. Likelihood training of Schrödinger bridge using forward-backward SDEs theory. In International Conference on Learning Representations, 2021a.
- Chen et al. [2016a] Yongxin Chen, Tryphon Georgiou, and Michele Pavon. Entropic and displacement interpolation: a computational approach using the Hilbert metric. SIAM Journal on Applied Mathematics, 76(6):2375–2396, 2016a.
- Chen et al. [2016b] Yongxin Chen, Tryphon T Georgiou, and Michele Pavon. On the relation between optimal transport and Schrödinger bridges: A stochastic control viewpoint. Journal of Optimization Theory and Applications, 169:671–691, 2016b.
- Chen et al. [2019] Yongxin Chen, Tryphon T Georgiou, Michele Pavon, and Allen Tannenbaum. Relaxed Schrödinger bridges and robust network routing. IEEE Transactions on Control of Network Systems, 7(2):923–931, 2019.
- Chen et al. [2021b] Yongxin Chen, Tryphon T Georgiou, and Michele Pavon. Optimal transport in systems and control. Annual Review of Control, Robotics, and Autonomous Systems, 4:89–113, 2021b.
- Chen et al. [2021c] Yongxin Chen, Tryphon T Georgiou, and Michele Pavon. Stochastic control liaisons: Richard Sinkhorn meets Gaspard Monge on a Schrodinger bridge. SIAM Review, 63(2):249–313, 2021c.
- Dai Pra [1991] Paolo Dai Pra. A stochastic control approach to reciprocal diffusion processes. Applied Mathematics and Optimization, 23(1):313–329, 1991.
- De Bortoli et al. [2021] Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion Schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695–17709, 2021.
- Deligiannidis et al. [2024] George Deligiannidis, Valentin De Bortoli, and Arnaud Doucet. Quantitative uniform stability of the iterative proportional fitting procedure. The Annals of Applied Probability, 34(1A):501–516, 2024.
- Deming and Stephan [1940] W Edwards Deming and Frederick F Stephan. On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. The Annals of Mathematical Statistics, 11(4):427–444, 1940.
- Deng [2012] Li Deng. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
- Doob [1959] J. L. Doob. A Markov chain theorem. Probability and Statistics.(H. Cramer memorial volume) ed. U. Grenander. Almquist and Wiksel, Stockholm & New York, pages 50–57, 1959.
- Doob [1984] J. L. Doob. Classical potential theory and its probabilistic counterpart, volume 262 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, New York, 1984. ISBN 0-387-90881-1. doi: 10.1007/978-1-4612-5208-5.
- Essid and Pavon [2019] Montacer Essid and Michele Pavon. Traversing the Schrödinger bridge strait: Robert Fortet’s marvelous proof redux. Journal of Optimization Theory and Applications, 181(1):23–60, 2019.
- Fleming [1977] Wendell H Fleming. Exit probabilities and optimal stochastic control. Applied Mathematics and Optimization, 4:329–346, 1977.
- Fleming [2005] Wendell H Fleming. Logarithmic transformations and stochastic control. In Advances in Filtering and Optimal Stochastic Control: Proceedings of the IFIP-WG 7/1 Working Conference Cocoyoc, Mexico, February 1–6, 1982, pages 131–141. Springer, 2005.
- Fleming and Rishel [2012] Wendell H Fleming and Raymond W Rishel. Deterministic and stochastic optimal control, volume 1. Springer Science & Business Media, 2012.
- Fleming and Sheu [1985] Wendell H Fleming and Sheunn-Jyi Sheu. Stochastic variational formula for fundamental solutions of parabolic pde. Applied Mathematics and Optimization, 13:193–204, 1985.
- Föllmer [1988] H Föllmer. Random fields and diffusion processes. Ecole d’Ete de Probabilites de Saint-Flour XV-XVII, 1985-87, 1988.
- Fortet [1940] Robert Fortet. Résolution d’un système d’équations de m. Schrödinger. Journal de Mathématiques Pures et Appliquées, 19(1-4):83–105, 1940.
- Friedman [1975] Avner Friedman. Stochastic differential equations and applications. In Stochastic differential equations, pages 75–148. Springer, 1975.
- Georgiou and Pavon [2015] Tryphon T Georgiou and Michele Pavon. Positive contraction mappings for classical and quantum Schrödinger systems. Journal of Mathematical Physics, 56(3), 2015.
- Hamdouche et al. [2023] Mohamed Hamdouche, Pierre Henry-Labordere, and Huyên Pham. Generative modeling for time series via Schrödinger bridge. arXiv preprint arXiv:2304.05093, 04 2023. doi: 10.13140/RG.2.2.25758.00324.
- Heng et al. [2024] Jeremy Heng, Valentin De Bortoli, and Arnaud Doucet. Diffusion Schrödinger bridges for Bayesian computation. Statistical Science, 39(1):90–99, 2024.
- Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in Neural Information Processing Systems, 30, 2017.
- Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Huang et al. [2021] Jian Huang, Yuling Jiao, Lican Kang, Xu Liao, Jin Liu, and Yanyan Liu. Schrödinger-Föllmer sampler: sampling without ergodicity. arXiv preprint arXiv:2106.10880, 2021.
- Hyvärinen [2005] Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4):695–709, 2005.
- Jamison [1975] Benton Jamison. The Markov processes of Schrödinger. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 32(4):323–331, 1975.
- Karatzas and Shreve [2012] Ioannis Karatzas and Steven Shreve. Brownian motion and stochastic calculus, volume 113. Springer Science & Business Media, 2012.
- Léonard [2013] Christian Léonard. A survey of the Schrödinger problem and some of its connections with optimal transport, 2013.
- Mikami [1990] Toshio Mikami. Variational processes from the weak forward equation. Communications in Mathematical Physics, 135:19–40, 1990.
- Moon et al. [2022] Taehong Moon, Moonseok Choi, Gayoung Lee, Jung-Woo Ha, and Juho Lee. Fine-tuning diffusion models with limited data. In NeurIPS 2022 Workshop on Score-Based Methods, 2022.
- Pavon and Wakolbinger [1991] Michele Pavon and Anton Wakolbinger. On free energy, stochastic control, and Schrödinger processes. In Modeling, Estimation and Control of Systems with Uncertainty: Proceedings of a Conference held in Sopron, Hungary, September 1990, pages 334–348. Springer, 1991.
- Peluchetti [2023] Stefano Peluchetti. Diffusion bridge mixture transports, Schrödinger bridge problems and generative modeling. arXiv preprint arXiv:2304.00917, 2023.
- Pra and Pavon [1990] Paolo Dai Pra and Michele Pavon. On the Markov processes of Schrödinger, the Feynman-Kac formula and stochastic control. In Realization and Modelling in System Theory: Proceedings of the International Symposium MTNS-89, Volume I, pages 497–504. Springer, 1990.
- Richter et al. [2023] Lorenz Richter, Julius Berner, and Guan-Horng Liu. Improved sampling via learned diffusions. arXiv preprint arXiv:2307.01198, 2023.
- Salimans et al. [2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. Advances in Neural Information Processing Systems, 29, 2016.
- Schrödinger [1931] Erwin Schrödinger. Über die Umkehrung der Naturgesetze. Verlag der Akademie der Wissenschaften in Kommission bei Walter De Gruyter u. Company, 1931.
- Schrödinger [1932] Erwin Schrödinger. Sur la théorie relativiste de l’électron et l’interprétation de la mécanique quantique. In Annales de l’institut Henri Poincaré, volume 2, pages 269–310, 1932.
- Shi et al. [2022] Yuyang Shi, Valentin De Bortoli, George Deligiannidis, and Arnaud Doucet. Conditional simulation using diffusion Schrödinger bridges. In Uncertainty in Artificial Intelligence, pages 1792–1802. PMLR, 2022.
- Shi and Wu [2023] Ziqiang Shi and Shoule Wu. Schröwave: Realistic voice generation by solving two-stage conditional Schrödinger bridge problems. Digital Signal Processing, 141:104175, 2023.
- Song [2022] Ki-Ung Song. Applying regularized Schrödinger-bridge-based stochastic process in generative modeling. arXiv preprint arXiv:2208.07131, 2022.
- Song and Ermon [2019] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pages 11895–11907, 2019.
- Song et al. [2021] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2021.
- Su et al. [2022] Xuan Su, Jiaming Song, Chenlin Meng, and Stefano Ermon. Dual diffusion implicit bridges for image-to-image translation. arXiv preprint arXiv:2203.08382, 2022.
- Tzen and Raginsky [2019] Belinda Tzen and Maxim Raginsky. Theoretical guarantees for sampling and inference in generative models with latent diffusions. In Conference on Learning Theory, pages 3084–3114. PMLR, 2019.
- Vargas et al. [2021] Francisco Vargas, Pierre Thodoroff, Austen Lamacraft, and Neil Lawrence. Solving Schrödinger bridges via maximum likelihood. Entropy, 23(9):1134, 2021.
- Vargas et al. [2022] Francisco Vargas, Will Sussman Grathwohl, and Arnaud Doucet. Denoising diffusion samplers. In The Eleventh International Conference on Learning Representations, 2022.
- Vargas et al. [2024] Francisco Vargas, Shreyas Padhy, Denis Blessing, and Nikolas Nüsken. Transport meets variational inference: Controlled Monte Carlo diffusions. In The Twelfth International Conference on Learning Representations, 2024.
- Vincent [2011] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011.
- Wang et al. [2021] Gefei Wang, Yuling Jiao, Qian Xu, Yang Wang, and Can Yang. Deep generative learning via Schrödinger bridge. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10794–10804. PMLR, 18–24 Jul 2021.
- Winkler et al. [2023] Ludwig Winkler, Cesar Ojeda, and Manfred Opper. A score-based approach for training Schrödinger bridges for data modelling. Entropy, 25(2):316, 2023.
- Zhang and Chen [2021] Qinsheng Zhang and Yongxin Chen. Path integral sampler: A stochastic control approach for sampling. In International Conference on Learning Representations, 2021.
APPENDIX
The code for all experiments (Cauchy simulation in Appendix A.2, normal mixture simulation in Appendix A.3 and MNIST example in Section 5.3) is at the GitHub repository https://github.com/gargjhanvi/Soft-constrained-Schrodinger-Bridge-a-Stochastic-Control
-Approach.
A. Simulation with Known Density Functions
A.1. Monte Carlo Simulation with Densities Known up to a Normalization Constant
Let the uncontrolled process be the Brownian motion with and , and set . Let denote the density of normal distribution with mean zero and covariance matrix . Given a target distribution with density (which we simply denote by in this section), the solution to SSB is given by
(A.1) |
where the drift is determined by and
Note that to determine , we only need to know up to a normalization constant. Since , we can rewrite as
(A.2) |
where we set
We can then approximate the numerator and denominator in (A.2) separately using Monte Carlo samples.
For some target distributions, this approach can be made more efficient by using importance sampling. When has a heavy tail (e.g. Cauchy distribution), may grow super-exponentially with , and Monte Carlo estimates for the numerator and denominator in (A.2) with drawn from the normal distribution may have large variances. Observe that the numerator can be written as
The term is often polynomial in (that is, it does not grow too fast). Hence, intuitively, the integral is likely to be well approximated by a Monte Carlo estimate with drawn from a density proportional to . Such a density may not be easily accessible, but if one knows the tail decay rate of , one can try to find a proposal distribution for with tails not lighter than . In the experiment given below, we propose from some distribution with tail decay rate same as . Letting denote our proposal density, we express the numerator in (A.2) as
(A.3) | ||||
and estimate the right-hand side using a Monte Carlo average. Similarly, the denominator can be expressed by
(A.4) |
and estimated by a Monte Carlo average.
A.2. Simulating the Cauchy Distribution
Fix , , and . Let be the density of the standard Cauchy distribution i.e., . Theorem 2 shows that the solution to SSB is a Schrödinger bridge such that the terminal distribution has density
(A.5) |
where is the normalization constant and is the density of the standard normal distribution. We plot and in Figure A1 for . For close to zero, the density of remains approximately the same for the four choices of . When , is the Cauchy distribution which has a heavy tail. But for any , the tail decay rate of is dominated by the Gaussian component.


We simulate the solution to SSB given in (A.1) over the time interval using the Euler-Maruyama method with time steps. The drift is approximated by the importance sampling scheme. When (i.e., the terminal distribution is Cauchy), we let the proposal distribution in (A.3) and (A.4) be the -distribution with degrees of freedom (we have also tried directly proposing from the standard Cauchy distribution and obtained very similar results). When , we let be the normal distribution with mean and variance . We simulate the SSB process 10,000 times, and in Table A1 we report the number of failed runs; these failures happen because Monte Carlo estimates for the numerator and denominator in (A.2) become unstable when is large, resulting in numerical overflow. When , we observe that numerical overflow is still common even if we use Monte Carlo samples for each estimate. In contrast, when we use and only Monte Carlo samples, the algorithm becomes very stable. In Figure A2, we compare the distribution of generated samples (i.e., the distribution of with ) with their corresponding target distributions. The first panel compares the distribution of the samples generated with (failed runs ignored) with the Cauchy distribution, and the second compares the distribution the samples generated with with the geometric mixture given in (A.5). It is clear that simulating the Schrödinger bridge process (i.e, using ) cannot recover the heavy tails of the Cauchy distribution, but the numerical simulation of SSB with accurately yields samples from the geometric mixture distribution. Recall that the KL divergence between standard Cauchy and normal distributions is infinite, which, by Theorem 1, means that there is no control with finite energy cost that can steer a standard Brownian motion towards the Cauchy distribution at time . Our experiment partially illustrates the practical consequences of this fact in numerically simulating the Schrödinger bridge, and it also suggests that SSB may be a numerically more robust alternative.
20 | 50 | 100 | 200 | 500 | 1000 | |
---|---|---|---|---|---|---|
1149 | 933 | 714 | 610 | 444 | 308 | |
495 | 29 | 6 | 1 | 0 | 0 | |
124 | 18 | 6 | 0 | 0 | 0 | |
48 | 4 | 2 | 0 | 1 | 0 |


A.3. Simulating Mixtures of Normal Distributions
In Example 4, we let be a mixture of four bivariate normal distributions,
where the weights and the mean vector of the four component distributions are different. Let be a mixture of two equally weighted bivariate normal distributions
The first component of has mean close to (which is the mean vector of the first component of ), and the second component of has mean close to (which is the mean vector of the last component of ). Hence, we can interpret as a distribution of high-quality samples from four different classes, and interpret as a distribution of noisy samples from two of the four classes. Let be the density of our target distribution.
We generate samples from by simulating the SDE (A.1) over the time interval with and . We use time steps for discretization and generate Monte Carlo samples at each step for estimating the drift. The trajectories of the simulated processes are shown in Figure A3 for , and the samples we generate correspond to . It can be seen that when or , the majority of the generated samples form two clusters, one with mean close to and the other with mean close to . Further, the two clusters both exhibit very small within-cluster variation, which indicates that the noise from has been effectively reduced.




B. Auxiliary Results
We first present two lemmas about the minimization of Kullback-Leibler divergence.
Lemma B1.
Let be a -finite measure space, and let be a measure with density such that . Let denote the set of all probability measures absolutely continuous with respect to . Then,
The infimum is attained by the probability measure such that .
Proof of Lemma B1.
Since if , it suffices to consider such that . It is straightforward to show that . Since , is a probability measure. The claim then follows from the fact that the KL divergence between any two probability measures is non-negative. ∎
Lemma B2.
Let and be as given in Lemma B1. For , let be a finite measure (i.e., ) with density . Assume that . For ,
where . The infimum is attained by the probability measure such that
When , we have . Further,
Proof of Lemma B2.
Since , we have . By Hölder’s inequality,
Clearly, if , the above inequality implies . Observe that
where is the measure with density . Hence, we can apply Lemma B1 to prove that is the minimizer.
To prove the convergence as , let and write , where
The integrands of both and are monotone in . Hence, using and monotone convergence theorem, we find that
which also implies pointwise. An analogous argument using monotone convergence theorem proves that . ∎
The next result is about the controlled SDE (2.2) with . It is adapted from Theorem 2.1 of Dai Pra [1991]. Our proof is provided for completeness.
Theorem B3.
Suppose Assumption 1 holds. Let be a weak solution to (2.1) with initial distribution and transition density . Define
for some measurable such that . Assume on . Let be a weak solution with to the SDE
(B.1) |
Then, we have the following results.
-
(i)
and
(B.2) -
(ii)
The weak solution to the SDE (B.1) exists. Indeed, we can define a probability measure by such that he law of under is the same as the law of under .
-
(iii)
The process satisfies
(B.3) -
(iv)
The transition density of is given by
(B.4) Hence, is a Doob’s -path process, and the density of the distribution of is
(B.5)
Proof of Theorem B3.
We follow the arguments of Jamison [1975], Dai Pra [1991]. Under Assumption 1, Proposition 2.1 of Dai Pra [1991] (which is adapted from the result of Jamison [1975]) implies that and on , where
(B.6) |
denotes the generator of . This proves part (i).
To prove part (ii), we first apply Itô’s lemma to get
for any , where
Since is integrable, is a uniformly integrable martingale on and converges to both a.s. and in . Letting , we obtain that
(B.7) |
Write , and . We have shown that and are martingales on . Since , we can define a probability measure by . By Girsanov theorem and the expression for given in (B.7), the law of under is the same as the law of under ; in other words, , where (resp. ) is the probability measure induced by (resp. ) on the space of continuous functions.
For part (iii), choose , where . Analogously to (B.7), we can apply Itô’s lemma to get
Since is smooth, is bounded on . Taking expectations on both sides, we find that
(B.8) |
Letting and applying monotone convergence theorem, we get
(B.9) |
It remains to argue that the left-hand side converges to . Write . Using the change of measure and , we get
The function is bounded below and convex. If is chosen such that , we have
where Fatou’s lemma is applied to obtain the first inequality, and the second inequality follows from the fact that is a submartingale. Since converges a.s. to , we get
(B.10) |
Taking expectations on both sides we get
(B.11) |
Combining it with (B.9) proves part (iii).
Consider part (iv). For any bounded and measurable function , we apply the change of measure to the conditional expectation to get
The claim then follows from Fubini’s theorem. ∎
C. Proofs for the Main Results
Recall that denotes the solution to the SDE
(C.1) |
over the time interval with initial distribution , and denotes the solution to the controlled SDE
(C.2) |
with . The transition density of the uncontrolled process is denoted by .
Proof of Lemma 3.
Since is admissible, it must satisfy . Therefore, Novikov’s condition is satisfied, and we can apply Girsanov theorem to get
By part (ii) of Theorem B3, the left-hand side of the above inequality is equal to . Hence,
where we have used again to obtain . By the definition of , we have
Since , we have . By part (iii) of Theorem B3, the equality is attained when . ∎
Proof of Theorem 4.
First, we find using (3.3) that
which is finite by assumption. Since whenever , we have on . Hence, Theorem B3 and Lemma 3 can be applied with . In addition to setting and , we define
(C.3) |
There is no loss of generality in assuming exists, since and if . For later use, we note that by (3.3) and (3.4),
(C.4) | |||
(C.5) |
Let be the -finite measure on with density . By Lemma 3, for any admissible control ,
where the second line follows from (C.4) and the third from (C.5). The measures do not depend on . By Lemma B1,
(C.6) |
We will later prove that . Combining the above two displayed inequalities, we get
(C.7) | ||||
(C.8) |
For , we know by Lemma 3 that the equality in (C.7) is attained. Hence, it is optimal if we can show that the equality in (C.8) is also attained. By Lemma B1, this is equivalent to showing that
(C.9) |
where we write . By part (iv) of Theorem B3, we have
where step follows from (C.4) and step (ii) follows from (C.5). So is optimal, and the normalizing constant in (C.9) equals , from which it follows that . Finally, one can apply Jensen’s inequality and the assumption to show that , which concludes the proof. ∎
Proof of Theorem 6.
Recall that are defined by
(C.10) | ||||
(C.11) |
The proof is essentially the same as that of Theorem 6. First, we need to derive a result analogous to Lemma 3, which we give in Lemma C4 below. We will apply Lemma C4 with . Recall that is defined by
(C.12) |
Note that the conditions and can be verified by the same argument as that used in the proof of Theorem 4. By (C.10), we have
(C.13) |
Define
(C.14) |
We can rewrite (C.11) as .
By Lemma C4 and Lemma B1, for any admissible control ,
To prove that the equality is attained by the control , it remains to show that
(C.15) |
where . To find , we can mimic the proof of Theorem B3. It is not difficult to verify that the law of under is the same as the law of under , where is defined by
(C.16) |
So for any bounded and measurable function , we have
It thus follows from (C.13) that
(C.17) |
The rest of the proof is identical to that of Theorem 4. ∎
Lemma C4.
Proof of Lemma C4.
The SDE (2.2) with control can be expressed as
(C.18) | ||||
where we write and is as given in (4.3). By the tower property, we have
(C.19) |
where is defined by , and is the solution to the uncontrolled process (2.1). The assumption implies that for almost every . Hence, for each and , Theorem B3 implies that there exists a weak solution to the SDE (C.18) on with . Moreover,
(C.20) |
Summing over all the time intervals, we get
Now consider an arbitrary admissible control . As in the proof of Lemma 3, by Girsanov theorem, we have
which yields that
The asserted result thus follows. ∎
D. Proof of the Existence of Solution to SSB
We first recall the definition of the Hilbert metric [Chen et al., 2016a]. For and , let denote the space of functions defined on . Define
(D.1) |
Since is a closed solid cone in the Banach space , we can define a Hilbert metric on it. For any (where denotes the constant function equal to 0), define
(D.2) |
where means and we use the convention . The Hilbert metric on is defined by
(D.3) |
Note that is only a pseudometric on , but it is a metric on the space of rays of .
Proof of Theorem 5.
The proof is adapted from Chen et al. [2016a, Proposition 1]. We will show the existence of strictly positive and integrable functions such that
(D.4) | ||||
(D.5) |
Let be our guess for . We can update as follows.
-
(1)
Set .
-
(2)
Set , which is an estimate for by (D.5).
-
(3)
Set .
-
(4)
By (D.4), update the estimate for by .
Denote this updating scheme by . Note that for any and ,
(D.6) |
In Lemma D5 below, we prove that is a strict contraction mapping from to with respect to the Hilbert metric. To prove the existence of , it is sufficient to show that has a fixed point , since we can set
which must satisfy (D.4) and (D.5) and (see the proof of Lemma D5 for why are integrable). But note that we cannot apply Banach fixed-point theorem to .
To find the fixed point of , we first consider its normalized version , defined by . Let
(D.7) |
denote the domain and range of . Since is a strict contraction mapping on with respect to the Hilbert metric (which is invariant under scaling), is also a strict contraction mapping and thus continuous (with respect to the Hilbert metric) on . If is a fixed point of , then is a fixed point of , since
where the first equality follows from (D.6). Moreover, since is a metric on the rays, any other fixed point of must have the form for some constant . But the same argument shows that must equal , and thus the fixed point of is unique. This further yields the uniqueness of .
So it only remains to prove that has a fixed point in . The proof for this claim is essentially the same as that in Chen et al. [2016a]. Pick arbitrarily and let . For , define
(D.8) |
where the second equality follows from (D.6). Each is in and well-defined as . Since is a strict contraction mapping with respect to , is a Cauchy sequence with respect to . Using the inequality (see Chen et al. [2016a]), we find that is also Cauchy with respect to the -norm, and thus there exists such that and . Next, we argue that is uniformly bounded from below and above and also uniformly equicontinuous. To show the uniform boundedness, we first observe that implies
(D.9) |
Further, since is bounded in and recalling the last step in the construction of , we have
(D.10) |
for any , where is some constant independent of . Combining (D.9) and (D.10), we get
(D.11) |
Note that the uniform boundedness of also implies . The uniform equicontinuity of can be proved by using the uniform continuity of the transition density . Finally, by Arzelà–Ascoli Theorem, there is a subsequence such that converges to uniformly with respect to the -norm and is also uniformly continuous. This implies and thus . By the continuity (with respect to ) of , we can interchange the limit operation with , thereby establishing as the fixed point of . ∎
Lemma D5.
For , define
(D.12) |
The operator is a strict contraction mapping from to with respect to the Hilbert metric.
Proof.
We can express the operator by , and define by
(D.13) | ||||
It is worth explaining how the ranges of these operators are determined. First, if , it is clear that and . For , since we assume is compact and is continuous in , for any there exists such that
The argument for is similar, and note that by Hölder’s inequality,
Now we prove that is a strict contraction. Let . By the definition given in (D.2),
which implies
(D.14) |
since . Chen et al. [2016a] showed that the operators are strict contractions using Birkhoff’s theorem, and that the operator is an isometry (all with respect to the Hilbert metric). Hence, is a strict contraction. Note that for our problem, since is a strict contraction, we actually only need to be contractions (not necessarily strict). ∎
E. Proof of Lemma 7
We consider a more general setting. Assume that is given by
(E.1) |
where satisfies Assumption 1. The density function of is . Let be given by
(E.2) | ||||
(E.3) | ||||
(E.4) |
Theorem 1 implies that . Let be given by
(E.5) | ||||
(E.6) | ||||
(E.7) |
which is the solution to Problem 2 with being the reference process and being the target distribution. We now prove that
(E.8) |
That is, is also the solution to Problem 1 with being the reference process and being the target distribution, where has un-normalized density . Once this is proved, Lemma 7 follows as a special case with and .
F. MNIST Example
Figure F4 visualizes the 50 images in the data set , which are obtained by adding Gaussian noise to the original images in MNIST. Figure F5 shows the new images generated by the two-stage Schrödinger bridge algorithm of Wang et al. [2021] using only as the input.
Table F2 shows the inception scores [Salimans et al., 2016] for our generated images and the images of digit 8 in MNIST. The score of our samples for is slightly higher than that of the digit 8 in MNIST dataset, suggesting that our generated images of digit 8 exhibit a greater degree of variability than those in MNIST. Additionally, when , our score aligns closely with that of (i.e., the noisy digit 8 images from MNIST), indicating that our method can recover the images in the target data set by using a large . For small values of , the scores of our generated images are higher than that of digit 8 images in MNIST, primarily because the reference dataset (consisting of the other digits) has greater variability and complexity. However, as shown in Figure 1, when is small, we do not necessarily get images of digit 8.
We also utilize t-SNE plots to visually characterize the distribution of our generated images. Figure F6 illustrates that our samples come from the geometric mixture distribution interpolating between the noisy images of digit 8 and the clean images of other digits. Figure F7 demonstrates that SSB samples with are positioned closer to the clean images of digit 8 compared to the samples obtained with .
In our code, we use the neural network model of Song and Ermon [2019] for training the score functions and use the neural network model of Wang et al. [2021] for training the density ratio function.




Datasets (sample size K) | Inception score |
---|---|
SSB with | 6.70 0.20 |
SSB with | 6.59 0.15 |
SSB with | 5.12 0.11 |
SSB with | 3.51 0.08 |
SSB with | 3.65 0.04 |
SSB with | 2.87 0.04 |
Digit 8 in MNIST (clean) | 3.29 0.04 |
Digit 8 in MNIST (noisy) | 2.96 0.04 |