This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Gaussian Interpolation Flows

\nameYuan Gao \email[email protected]
\addrDepartment of Applied Mathematics
The Hong Kong Polytechnic University
Hong Kong SAR, China
\AND\nameJian Huang \email[email protected]
\addrDepartments of Data Science and AI, and Applied Mathematics
The Hong Kong Polytechnic University
Hong Kong SAR, China
\AND\nameYuling Jiao \email[email protected]
\addrSchool of Mathematics and Statistics
and Hubei Key Laboratory of Computational Science
Wuhan University, Wuhan, China
Abstract

Gaussian denoising has emerged as a powerful method for constructing simulation-free continuous normalizing flows for generative modeling. Despite their empirical successes, theoretical properties of these flows and the regularizing effect of Gaussian denoising have remained largely unexplored. In this work, we aim to address this gap by investigating the well-posedness of simulation-free continuous normalizing flows built on Gaussian denoising. Through a unified framework termed Gaussian interpolation flow, we establish the Lipschitz regularity of the flow velocity field, the existence and uniqueness of the flow, and the Lipschitz continuity of the flow map and the time-reversed flow map for several rich classes of target distributions. This analysis also sheds light on the auto-encoding and cycle consistency properties of Gaussian interpolation flows. Additionally, we study the stability of these flows in source distributions and perturbations of the velocity field, using the quadratic Wasserstein distance as a metric. Our findings offer valuable insights into the learning techniques employed in Gaussian interpolation flows for generative modeling, providing a solid theoretical foundation for end-to-end error analyses of learning Gaussian interpolation flows with empirical observations.

Keywords: Continuous normalizing flows, Gaussian denoising, generative modeling, Lipschitz transport maps, stochastic interpolation.

1 Introduction

Generative modeling, which aims to learn the underlying data generating distribution from a finite sample, is a fundamental task in the field of machine learning and statistics (Salakhutdinov, 2015). Deep generative models (DGMs) find wide-ranging applications across diverse domains such as computer vision, natural language processing, drug discovery, and recommendation systems. The core objective of DGMs is to learn a nonlinear mapping, either deterministic or stochastic (with outsourcing randomness), which transforms latent samples drawn from a simple reference distribution into samples that closely resemble the target distribution.

Generative adversarial networks (GANs) have emerged as a prominent class of DGMs (Goodfellow et al., 2014; Arjovsky et al., 2017; Goodfellow et al., 2020). Through an adversarial training process, GANs learn to approximately generate samples from the data distribution. Variational auto-encoders (VAEs) are another category of DGMs (Kingma and Welling, 2014; Rezende et al., 2014; Kingma and Welling, 2019). In VAEs, the encoding and decoding procedures produce a compressed and structured latent representation, enabling efficient sampling and interpolation. Score-based diffusion models are a promising approach to deep generative modeling that has evolved rapidly since its emergence (Song and Ermon, 2019, 2020; Ho et al., 2020; Song et al., 2021b, a). The basis of score-based diffusion models lies in the notion of the score function, which characterizes the gradient of the log-density function of a given distribution.

In addition, normalizing flows have gained attention as another powerful class of DGMs (Tabak and Vanden-Eijnden, 2010; Tabak and Turner, 2013; Kobyzev et al., 2020; Papamakarios et al., 2021). In normalizing flows, an invertible mapping is learned to transform a simple source distribution into a more complex target distribution by a composition of a series of parameterized, invertible and differentiable intermediate transformations. This framework allows for efficient sampling and training by maximum likelihood estimation (Dinh et al., 2014; Rezende and Mohamed, 2015). Continuous normalizing flows (CNFs) pursue this idea further by performing the transformation over continuous time, enabling fine-grained modeling of dynamic systems from the source distribution to the target distribution. The essence of CNFs lies in defining ordinary differential equations (ODEs) that govern the evolution of CNFs in terms of continuous trajectories. Inspired by the Gaussian denoising approach, which learns a target distribution by denoising its Gaussian smoothed counterpart, many authors have considered simulation-free estimation methods that have shown great potential in large-scale applications (Song et al., 2021a; Liu et al., 2023; Albergo and Vanden-Eijnden, 2023; Lipman et al., 2023; Neklyudov et al., 2023; Tong et al., 2023; Chen and Lipman, 2023; Albergo et al., 2023b; Shaul et al., 2023; Pooladian et al., 2023). However, despite the empirical success of simulation-free CNFs based on Gaussian denoising, rigorous theoretical analysis of these CNFs have received limited attention thus far.

In this work, we explore an ODE flow-based approach for generative modeling, which we refer to as Gaussian Interpolation Flows (GIFs). This method is derived from the Gaussian stochastic interpolation detailed in Section 3. GIFs represent a straightforward extension of the stochastic interpolation method (Albergo and Vanden-Eijnden, 2023; Liu et al., 2023; Lipman et al., 2023). They can be considered a class of CNFs and encompass various ODE flows as special cases. According to the classical Cauchy-Lipschitz theorem, also known as the Picard-Lindelöf theorem (Hartman, 2002b, Theorem 1.1), a unique solution to the initial value problem for an ODE flow exists if the velocity field is continuous in the time variable and uniformly Lipschitz continuous in the space variable. In the case of GIFs, the velocity field depends on the score function of the push-forward measure. Therefore, it remains to be shown that this velocity field satisfies the regularity conditions stipulated by the Cauchy-Lipschitz theorem. These regularity conditions are commonly assumed in the literature when analyzing the convergence properties of CNFs or general neural ODEs (Chen et al., 2018; Biloš et al., 2021; Marion et al., 2023; Marion, 2023; Marzouk et al., 2023). However, there is a theoretical gap in understanding how to translate these regularity conditions on velocity fields into conditions on target distributions.

The main focus of this work is to study and establish the theoretical properties of Gaussian interpolation flow and its corresponding flow map. We show that the regularity conditions of the Cauchy-Lipschitz theorem are satisfied for several rich classes of probability distributions using variance inequalities. Based on the obtained regularity results, we further expose the well-posedness of GIFs, the Lipschitz continuity of flow mappings, and applications to generative modeling. The well-posedness results are crucial for studying the approximation and convergence properties of GIFs learned with the flow or score matching method. When applied to generative modeling, our results further elucidate the auto-encoding and cycle consistency properties exhibited by GIFs.


Geometric regularity (Assumption 2)Gaussian interpolation flowsLipschitz velocity fields (Proposition 20)Well-posedness (Theorem 25)Lipschitz flow maps (Propositions 30, 31)Auto-encoding, cycle consistencyStability in source distributions, stability in velocity fields (Propositions 37, 39) Lemma 18 Lemma 42 Lemma 44 Lemma 40 Lemma 29 Corollary 27 Lemma 38 Corollary 36 Corollary 41
Figure 1: Roadmap of the main results.

1.1 Our main contributions

We provide an overview of the main results in Figure 1, in which we indicate the assumptions used in our analysis and the relationship between the results. We also summarize our main contributions below.

  • In Section 3, we extend the framework of stochastic interpolation proposed in Albergo and Vanden-Eijnden (2023). Various ODE flows can be considered special cases of the extended framework. We prove that the marginal distributions of GIFs satisfy the continuity equation converging to the target distribution in the weak sense. Several explicit formulas of the velocity field and its derivatives are derived, which can facilitate computation and regularity estimation.

  • In Sections 4 and 5, we establish the spatial Lipschitz regularity of the velocity field for a range of target measures with rich structures, which is sufficient to guarantee the well-posedness of GIFs. Additionally, we deduce the Lipschitz regularity of both the flow map and its time-reversed counterpart. The well-posedness of GIFs is an essential attribute, serving as a foundational requirement for investigating numerical solutions of GIFs. It is important to note that while the flow maps are demonstrated to be Lipschitz continuous transport maps for generative modeling, the Lipschitz regularity for optimal transport maps has only been partially established to date.

  • In Section 6, we show that the auto-encoding and cycle consistency properties of GIFs are inherently satisfied when the flow maps exhibit Lipschitz continuity with respect to the spatial variable. This demonstrates that exact auto-encoding and cycle consistency are intrinsic characteristics of GIFs. Our findings lend theoretical support to the findings made by Su et al. (2023), as illustrated in Figures 3 and 4.

  • In Section 6, we conduct the stability analysis of GIFs, examining how they respond to changes in source distributions and to perturbations in the velocity field. This analysis, conducted in terms of the quadratic Wasserstein distance, provides valuable insights that justify the use of learning techniques such as Gaussian initialization and flow or score matching.

2 Preliminaries

In this section, we include several preliminary setups to show notations, basic assumptions, and several useful variance inequalities.

Notation. Here we summarize the notation. The space d{\mathbb{R}}^{d} is endowed with the Euclidean metric and we denote by \|\cdot\| and ,\langle\cdot,\cdot\rangle the corresponding norm and inner product. Let 𝕊d1:={xd:x=1}\mathbb{S}^{d-1}:=\{x\in{\mathbb{R}}^{d}:\|x\|=1\}. For a matrix Ak×dA\in{\mathbb{R}}^{k\times d}, we use AA^{\top} for the transpose, and the spectral norm is denoted by A2,2:=supx𝕊d1Ax\|A\|_{2,2}:=\sup_{x\in\mathbb{S}^{d-1}}\|Ax\|. For a square matrix Ad×dA\in{\mathbb{R}}^{d\times d}, we use det(A)\det(A) for the determinant and Tr(A)\operatorname{Tr}(A) for the trace. We use 𝐈d{\mathbf{I}}_{d} to denote the d×dd\times d identity matrix. For two symmetric matrices A,Bd×dA,B\in{\mathbb{R}}^{d\times d}, we denote ABA\succeq B or BAB\preceq A if ABA-B is positive semi-definite. For two vectors x,ydx,y\in{\mathbb{R}}^{d}, we denote xy:=xyx\otimes y:=xy^{\top}. For Ω1k,Ω2d,n1\Omega_{1}\subset{\mathbb{R}}^{k},\Omega_{2}\subset{\mathbb{R}}^{d},n\geq 1, we denote by Cn(Ω1;Ω2)C^{n}(\Omega_{1};\Omega_{2}) the space of continuous functions f:Ω1Ω2f:\Omega_{1}\to\Omega_{2} that are nn times differentiable and whose partial derivatives of order nn are continuous. If Ω2\Omega_{2}\subset{\mathbb{R}}, we simply write Cn(Ω1)C^{n}(\Omega_{1}). For any f(x)C2(d)f(x)\in C^{2}({\mathbb{R}}^{d}), let xf,x2f,xf\nabla_{x}f,\nabla^{2}_{x}f,\nabla_{x}\cdot f, and Δxf\Delta_{x}f denote its gradient, Hessian, divergence, and Laplacian, respectively. We use XYX\lesssim Y to denote XCYX\leq CY for some constant C>0C>0. The function composition operation is marked as gf:=g(f(x))g\circ f:=g(f(x)) for functions ff and gg.

The Borel σ\sigma-algebra of d{\mathbb{R}}^{d} is denoted by (d)\mathcal{B}({\mathbb{R}}^{d}). The space of probability measures defined on (d,(d))({\mathbb{R}}^{d},\mathcal{B}({\mathbb{R}}^{d})) is denoted as 𝒫(d)\mathcal{P}({\mathbb{R}}^{d}). For any d{\mathbb{R}}^{d}-valued random variable 𝖷\mathsf{X}, we use 𝔼[𝖷]\mathbb{E}[\mathsf{X}] and Cov(𝖷)\mathrm{Cov}(\mathsf{X}) to denote its expectation and covariance matrix, respectively. We use μν\mu*\nu to denote the convolution for any two probability measures μ\mu and ν\nu, and we use =𝑑\overset{d}{=} to indicate two random variables have the same probability distribution. For a random variable 𝖷\mathsf{X}, let Law(𝖷)\mathrm{Law}(\mathsf{X}) denote its probability distribution. Let g:kdg:{\mathbb{R}}^{k}\to{\mathbb{R}}^{d} be a measurable mapping and μ\mu be a probability measure on k{\mathbb{R}}^{k}. The push-forward measure f#μf_{\#}\mu of a measurable set AA is defined as f#μ:=μ(f1(A))f_{\#}\mu:=\mu(f^{-1}(A)). Let N(m,Σ)N(m,\Sigma) denote a dd-dimensional Gaussian random variable with mean vector mdm\in{\mathbb{R}}^{d} and covariance matrix Σd×d\Sigma\in{\mathbb{R}}^{d\times d}. For simplicity, let γd,σ2:=N(0,σ2𝐈d)\gamma_{d,\sigma^{2}}:=N(0,\sigma^{2}{\mathbf{I}}_{d}), and let φm,σ2(x)\varphi_{m,\sigma^{2}}(x) denote the probability density function of N(m,σ2𝐈d)N(m,\sigma^{2}{\mathbf{I}}_{d}) with respect to the Lebesgue measure. If m=0,σ=1m=0,\sigma=1, we abbreviate these as γd\gamma_{d} and φ(x)\varphi(x). Let Lp(d;,μ)L^{p}({\mathbb{R}}^{d};{\mathbb{R}}^{\ell},\mu) denote the LpL^{p} space with the LpL^{p} norm for p[1,]p\in[1,\infty] w.r.t. a measure μ\mu. To simplify the notation, we write Lp(d,μ)L^{p}({\mathbb{R}}^{d},\mu) if =1\ell=1, Lp(d;)L^{p}({\mathbb{R}}^{d};{\mathbb{R}}^{\ell}) if the Lebesgue measure is used, and Lp(d)L^{p}({\mathbb{R}}^{d}) if both hold.

2.1 Assumptions

We focus on the probability distributions satisfying several types of assumptions of weak convexity, which offer a geometric notion of regularity that is dimension-free in the study of high-dimensional distributions (Klartag, 2010). On one hand, weak-convexity regularity conditions are useful in deriving dimension-free guarantees for generative modeling and sampling from high-dimensional distributions. On the other hand, they accommodate distributions with complex shapes, including those with multiple modes.

Definition 1 (Cattiaux and Guillin, 2014)

A probability measure μ(dx)=exp(U)dx\mu(\mathrm{d}x)=\exp(-U)\mathrm{d}x is κ\kappa-semi-log-concave for some κ\kappa\in{\mathbb{R}} if its support Ωd\Omega\subseteq{\mathbb{R}}^{d} is convex and its potential function UC2(Ω)U\in C^{2}(\Omega) satisfies

x2U(x)κ𝐈d,xΩ.\nabla^{2}_{x}U(x)\succeq\kappa\mathbf{I}_{d},\quad\forall x\in\Omega.

The κ\kappa-semi-log-concavity condition is a relaxed notion of log-concavity, since here κ<0\kappa<0 is allowed. When κ0\kappa\geq 0, we are considering a log-concave probability measure that is proved to be unimodal (Saumard and Wellner, 2014). However, when κ<0\kappa<0, a κ\kappa-semi-log-concave probability measure can be multimodal.

Definition 2 (Eldan and Lee, 2018)

A probability measure μ(dx)=exp(U)dx\mu(\mathrm{d}x)=\exp(-U)\mathrm{d}x is β\beta-semi-log-convex for some β>0\beta>0 if its support Ωd\Omega\subseteq{\mathbb{R}}^{d} is convex and its potential function UC2(Ω)U\in C^{2}(\Omega) satisfies

x2U(x)β𝐈d,xΩ.\nabla^{2}_{x}U(x)\preceq\beta\mathbf{I}_{d},\quad\forall x\in\Omega.

The following definition of LL-log-Lipschitz continuity is a variant of LL-Lipschitz continuity. It characterizes a first-order condition on the target function rather than a second-order condition such as κ\kappa-semi-log-concavity and β\beta-semi-log-convexity in Definitions 1 and 2.

Definition 3

A function f:d+f:{\mathbb{R}}^{d}\to{\mathbb{R}}_{+} is LL-log-Lipschitz continuous if its logarithm is LL-Lipschitz continuous for some L0L\geq 0.

Based on the definitions, we present two assumptions on the target distribution. Assumption 1 concerns the absolute continuity and the moment condition. Assumption 2 imposes geometric regularity conditions.

Assumption 1

The probability measure ν\nu is absolutely continuous with respect to the Lebesgue measure and has a finite second moment.

Assumption 2

Let D:=(1/2)diam(supp(ν))D:=(1/\sqrt{2})\mathrm{diam}(\mathrm{supp}(\nu)). The probability measure ν\nu satisfies one or more of the following conditions:

  • (i)

    ν\nu is β\beta-semi-log-convex for some β>0\beta>0 and κ\kappa-semi-log-concave for some κ>0\kappa>0 with supp(ν)=d\mathrm{supp}(\nu)={\mathbb{R}}^{d};

  • (ii)

    ν\nu is κ\kappa-semi-log-concave for some κ\kappa\in{\mathbb{R}} with D(0,)D\in(0,\infty);

  • (iii)

    ν=γd,σ2ρ\nu=\gamma_{d,\sigma^{2}}*\rho where ρ\rho is a probability measure supported on a Euclidean ball of radius RR on d\mathbb{R}^{d};

  • (iv)

    ν\nu is β\beta-semi-log-convex for some β>0\beta>0, κ\kappa-semi-log-concave for some κ0\kappa\leq 0, and dνdγd(x)\frac{\mathrm{d}\nu}{\mathrm{d}\gamma_{d}}(x) is LL-log-Lipschitz in xx for some L0L\geq 0 with supp(ν)=d\mathrm{supp}(\nu)={\mathbb{R}}^{d}.

Multimodal distributions. Assumption 2 enumerates scenarios where probability distributions are endowed with geometric regularity. We examine the scenarios and clarify whether they cover multimodal distributions. Scenario (i) is referred to as the classical strong log-concavity case (κ>0\kappa>0), and thus, describes unimodal distributions. Scenario (ii) allows κ0\kappa\leq 0 and requires that the support is bounded. Mixtures of Gaussian distributions are considered in Scenario (iii), and typically are multimodal distributions. Scenario (iv) also allows κ0\kappa\leq 0 when considering a log-Lipschitz perturbation of the standard Gaussian distribution. Both Scenario (ii) and Scenario (iv) incorporate multimodal distributions due to the potential negative lower bound κ\kappa.

Lipschitz score. Lipschitz continuity of the score function is a basic regularity assumption on target distributions in the study of sampling algorithms based on Langevin and Hamiltonian dynamics. Even for high-dimensional distributions, this assumption endows a great source of regularity. For an LL-Lipschitz score function, its corresponding distribution is both LL-semi-log-convex and (L)(-L)-semi-log-concave for some L0.L\geq 0.

2.2 Variance inequalities

Variance inequalities like the Brascamp-Lieb inequality and the Cramér-Rao inequality are fundamental inequalities for explaining the regularizing effect of Gaussian denoising. Combined with κ\kappa-semi-log-concavity and β\beta-semi-log-convexity, these inequalities are crucial for deducing the Lipschitz regularity of the velocity fields of GIFs in Proposition 20-(b) and (c).

Lemma 4 (Brascamp-Lieb inequality)

Let μ(dx)=exp(U(x))dx\mu(\mathrm{d}x)=\exp(-U(x))\mathrm{d}x be a probability measure on a convex set Ωd\Omega\subseteq{\mathbb{R}}^{d} whose potential function U:ΩU:\Omega\to{\mathbb{R}} is of class C2C^{2} and strictly convex. Then for every locally Lipschitz function fL2(Ω,μ)f\in L^{2}(\Omega,\mu),

Varμ(f)𝔼μ[xf,(x2U)1xf].\mathrm{Var}_{\mu}(f)\leq\mathbb{E}_{\mu}\left[\langle\nabla_{x}f,(\nabla^{2}_{x}U)^{-1}\nabla_{x}f\rangle\right]. (2.1)

When applied to functions of the form f:xx,ef:x\mapsto\langle x,e\rangle for any e𝕊d1e\in\mathbb{S}^{d-1}, the Brascamp-Lieb inequality yields an upper bound of the covariance matrix

Covμ(𝖷)𝔼μ[(x2U(x))1]\mathrm{Cov}_{\mu}(\mathsf{X})\preceq\mathbb{E}_{\mu}\left[(\nabla^{2}_{x}U(x))^{-1}\right] (2.2)

with equality if 𝖷N(m,Σ)\mathsf{X}\sim N(m,\Sigma) with Σ\Sigma positive definite.

Under the strong log-concavity condition, that is, μ\mu is κ\kappa-semi-log-concave with κ>0\kappa>0 and the Euclidean Bakry-Émery criterion is satisfied (Bakry and Émery, 1985), the Brascamp-Lieb inequality instantly recovers the Poincaré inequality (see Definition 48).

The Brascamp-Lieb inequality originally appears in (Brascamp and Lieb, 1976, Theorem 4.1). Alternative proofs are provided in Bobkov and Ledoux (2000); Bakry et al. (2014); Cordero-Erausquin (2017). The dimension-free inequality (2.1) can be further strengthened to obtain several variants with dimensional improvement.

Lemma 5 (Cramér-Rao inequality)

Let μ(dx)=exp(U(x))dx\mu(\mathrm{d}x)=\exp(-U(x))\mathrm{d}x be a probability measure on d{\mathbb{R}}^{d} whose potential function U:dU:{\mathbb{R}}^{d}\to{\mathbb{R}} is of class C2C^{2}. Then for every fC1(d)f\in C^{1}({\mathbb{R}}^{d}),

Varμ(f)𝔼μ[xf],(𝔼μ[x2U])1𝔼μ[xf].\mathrm{Var}_{\mu}(f)\geq\langle\mathbb{E}_{\mu}[\nabla_{x}f],\left(\mathbb{E}_{\mu}[\nabla^{2}_{x}U]\right)^{-1}\mathbb{E}_{\mu}[\nabla_{x}f]\rangle. (2.3)

When applied to functions of the form f:xx,ef:x\mapsto\langle x,e\rangle for any e𝕊d1e\in\mathbb{S}^{d-1}, the Cramér-Rao inequality yields a lower bound of the covariance matrix

Covμ(𝖷)(𝔼μ[x2U(x)])1\mathrm{Cov}_{\mu}(\mathsf{X})\succeq\left(\mathbb{E}_{\mu}[\nabla^{2}_{x}U(x)]\right)^{-1} (2.4)

with equality as well if 𝖷N(m,Σ)\mathsf{X}\sim N(m,\Sigma) with Σ\Sigma positive definite.

The Cramér-Rao inequality plays a central role in asymptotic statistics as well as in information theory. The inequality (2.4) has an alternative derivation from the Cramér-Rao bound for the location parameter. For detailed proofs of the Cramér-Rao inequality, readers are referred to Chewi and Pooladian (2022); Dai et al. (2023), and the references therein.

3 Gaussian interpolation flows

Simulation-free CNFs represent a potent class of generative models based on ODE flows. Albergo and Vanden-Eijnden (2023) and Albergo et al. (2023b) introduce an innovative CNF that is constructed using stochastic interpolation techniques, such as Gaussian denoising. They conduct a thorough investigation of this flow, particularly examining its applications and effectiveness in generative modeling.

We study the ODE flow and its associated flow map as defined by the Gaussian denoising process. This process has been explored from various perspectives, including diffusion models and stochastic interpolants. Building upon the work of Albergo and Vanden-Eijnden (2023) and Albergo et al. (2023b), we expand the stochastic interpolant framework by relaxing certain conditions on the functions ata_{t} and btb_{t}, offering a more comprehensive perspective on the Gaussian denoising process.

In our generalization, we introduce an adaptive starting point to the stochastic interpolation framework, which allows for greater flexibility in the modeling process. By examining this modified framework, we aim to demonstrate that the Gaussian denoising principle is effectively implemented within the context of stochastic interpolation.

Definition 6 (Vector interpolation)

Let zd,z\in{\mathbb{R}}^{d}, x1dx_{1}\in{\mathbb{R}}^{d} be two vectors in the Euclidean space and let x0:=a0z+b0x1x_{0}:=a_{0}z+b_{0}x_{1} with a0>0,b00a_{0}>0,b_{0}\geq 0. Then we construct an interpolant between x0x_{0} and x1x_{1} over time t[0,1]t\in[0,1] through It(x0,x1),I_{t}(x_{0},x_{1}), defined by

It(x0,x1)=atz+btx1,I_{t}(x_{0},x_{1})=a_{t}z+b_{t}x_{1}, (3.1)

where at,bta_{t},b_{t} satisfy

a˙t0,b˙t0,a0>0,b00,a1=0,b1=1,at>0for any t(0,1),bt>0for any t(0,1),at,btC2([0,1)),at2C1([0,1]),btC1([0,1]).\displaystyle\begin{aligned} &\dot{a}_{t}\leq 0,\quad\dot{b}_{t}\geq 0,\quad a_{0}>0,\quad b_{0}\geq 0,\quad a_{1}=0,\quad b_{1}=1,\\ &a_{t}>0\ \ \text{for any $t\in(0,1)$},\quad b_{t}>0\ \ \text{for any $t\in(0,1)$},\\ &a_{t},b_{t}\in C^{2}([0,1)),\quad a_{t}^{2}\in C^{1}([0,1]),\quad b_{t}\in C^{1}([0,1]).\end{aligned} (3.2)
Remark 7

Compared with the vector interpolant defined by Albergo and Vanden-Eijnden (2023) (a.k.a. one-sided interpolant in Albergo et al. (2023b)), we extend its definition by relaxing the requirements that a0=1,b0=0a_{0}=1,b_{0}=0 with a0>0,b00a_{0}>0,b_{0}\geq 0. This consideration is largely motivated by analyzing the probability flow ODEs of the variance-exploding (VE) SDE and the variance-preserving (VP) SDE (Song et al., 2021b). We illustrate examples of interpolants incorporated by Definition 6 in Table 1.

Remark 8

We have eased the smoothness conditions for the functions ata_{t} and btb_{t} required in Albergo and Vanden-Eijnden (2023). Specifically, we consider the case where at,btC2([0,1))a_{t},b_{t}\in C^{2}([0,1)), at2C1([0,1])a_{t}^{2}\in C^{1}([0,1]), and btC1([0,1])b_{t}\in C^{1}([0,1]). This relaxation enables us to include the Föllmer flow into our framework, characterized by at=1t2a_{t}=\sqrt{1-t^{2}} and bt=tb_{t}=t. It is evident that at=1t2a_{t}=\sqrt{1-t^{2}} does not fulfill the condition atC2([0,1])a_{t}\in C^{2}([0,1]), but it does meet the requirements atC2([0,1))a_{t}\in C^{2}([0,1)) and at2C1([0,1])a_{t}^{2}\in C^{1}([0,1]).

Remark 9

The C2C^{2} regularity of at,bta_{t},b_{t} is necessary to derive the regularity of the velocity field v(t,x)v(t,x) in Eq. (3.5) concerning the time variable tt. In addition, the C1C^{1} regularity of at2,bta_{t}^{2},b_{t} is sufficient to ensure the Lipschitz regularity of the velocity field v(t,x)v(t,x) in Eq. (3.5) concerning the space variable xx.

A natural generalization of the vector interpolant (3.1) is to construct a set interpolant between two convex sets through Minkowski sum, which is common in convex geometry. A set interpolant stimulates the construction of a measure interpolant between a structured source measure and a target measure.

As noted, we can construct a measure interpolation using a Gaussian convolution path. The measure interpolation is particularly relevant to Gaussian denoising and Gaussian channels in information theory as elucidated in Remark 16. Because of this connection with Gaussian denoising, we call the measure interpolation a Gaussian stochastic interpolation. The Gaussian stochastic interpolation can be understood as a collection of linear combinations of a standard Gaussian random variable and the target random variable. The coefficients of the linear combinations vary with time t[0,1]t\in[0,1] as shown in Definition 6. Later in this section, we will show this Gaussian stochastic interpolation can be transformed into a deterministic ODE flow.

Gaussian stochastic interpolation has been investigated from several perspectives in the literature. The rectified flow has been proposed in Liu et al. (2023), and its theoretical connection with optimal transport has been investigated in Liu (2022). The formulation of the rectified flow is to learn the ODE flow defined by stochastic interpolation with linear time coefficients. In Section 2.3 of Liu et al. (2023), there is a nonlinear extension of the rectified flow in which the linear coefficients are replaced by general nonlinear coefficients. Albergo et al. (2023b) extends the stochastic interpolant framework proposed in (Albergo and Vanden-Eijnden, 2023) by considering a linear combination among three random variables. In Section 3 of Albergo et al. (2023b), the original stochastic interpolant framework is recovered as a one-sided interpolant between the Gaussian distribution and the target distribution. Moreover, Lipman et al. (2023) propose a flow matching method which directly learns a Gaussian conditional probability path with a neural ODE. In Section 4.1 of (Lipman et al., 2023), the velocity fields of the variance exploding and variance preserving probability flows are shown as special instances of the flow matching framework. We summarize these formulations as Gaussian stochastic interpolation by slightly extending the original stochastic interpolant framework.

Type VE VP Linear Föllmer Trigonometric
ata_{t} αt\alpha_{t} αt\alpha_{t} 1t1-t 1t2\sqrt{1-t^{2}} cos(π2t)\cos(\tfrac{\pi}{2}t)
btb_{t} 11 1αt2\sqrt{1-\alpha_{t}^{2}} tt tt sin(π2t)\sin(\tfrac{\pi}{2}t)
a0a_{0} α0\alpha_{0} α0\alpha_{0} 11 11 11
b0b_{0} 11 1α02\sqrt{1-\alpha_{0}^{2}} 0 0 0
Source Convolution Convolution γd\gamma_{d} γd\gamma_{d} γd\gamma_{d}
Table 1: Summary of various measure interpolants including VE interpolant (Song et al., 2021b), VP interpolant (Song et al., 2021b), linear interpolant (Liu et al., 2023), Föllmer interpolant (Dai et al., 2023), and trigonometric interpolant (Albergo and Vanden-Eijnden, 2023). There are two types of source measures including a standard Gaussian distribution γd\gamma_{d} and a convoluted distribution consisting of the target distribution and γd\gamma_{d}.
Definition 10 (Measure interpolation)

Let μ=Law(𝖷0)\mu=\mathrm{Law}(\mathsf{X}_{0}) and ν=Law(𝖷1)\nu=\mathrm{Law}(\mathsf{X}_{1}) be two probability measures satisfying 𝖷0=a0𝖹+b0𝖷1\mathsf{X}_{0}=a_{0}\mathsf{Z}+b_{0}\mathsf{X}_{1} where 𝖹γd:=N(0,𝐈d)\mathsf{Z}\sim\gamma_{d}:=N(0,{\mathbf{I}}_{d}) is independent from 𝖷1\mathsf{X}_{1}. We call (𝖷t)t[0,1](\mathsf{X}_{t})_{t\in[0,1]} a Gaussian stochastic interpolation from the source measure μ\mu to the target measure ν\nu, which is defined through ItI_{t} over time interval [0,1][0,1] as follows

𝖷t=It(𝖷0,𝖷1),𝖷0=a0𝖹+b0𝖷1,𝖹γd,𝖷1ν.\mathsf{X}_{t}=I_{t}(\mathsf{X}_{0},\mathsf{X}_{1}),\quad\mathsf{X}_{0}=a_{0}\mathsf{Z}+b_{0}\mathsf{X}_{1},\quad\mathsf{Z}\sim\gamma_{d},\quad\mathsf{X}_{1}\sim\nu. (3.3)
Remark 11

It is obvious that the marginal distribution of 𝖷t\mathsf{X}_{t} satisfies 𝖷t=𝑑at𝖹+bt𝖷1\mathsf{X}_{t}\overset{d}{=}a_{t}\mathsf{Z}+b_{t}\mathsf{X}_{1} with 𝖹γd,𝖷1ν\mathsf{Z}\sim\gamma_{d},\mathsf{X}_{1}\sim\nu.

Motivated by the time-varying properties of the Gaussian stochastic interpolation, we derive that its marginal flow satisfies the continuity equation. This result characterizes the dynamics of the marginal density flow of the Gaussian stochastic interpolation.

Theorem 12

Suppose that Assumption 1 holds. Then the marginal flow (pt)t[0,1](p_{t})_{t\in[0,1]} of the Gaussian stochastic interpolation (𝖷t)t[0,1](\mathsf{X}_{t})_{t\in[0,1]} between μ\mu and ν\nu satisfies the continuity equation

tpt+x(ptv(t,x))=0,(t,x)[0,1]×d,p0(x)=dμdx(x),p1(x)=dνdx(x)\partial_{t}p_{t}+\nabla_{x}\cdot(p_{t}v(t,x))=0,\quad(t,x)\in[0,1]\times{\mathbb{R}}^{d},\quad p_{0}(x)=\tfrac{\mathrm{d}\mu}{\mathrm{d}x}(x),\quad p_{1}(x)=\tfrac{\mathrm{d}\nu}{\mathrm{d}x}(x) (3.4)

in the weak sense with the velocity field

v(t,x)\displaystyle v(t,x) :=𝔼[a˙t𝖹+b˙t𝖷1|𝖷t=x],t(0,1),\displaystyle:=\mathbb{E}[\dot{a}_{t}\mathsf{Z}+\dot{b}_{t}\mathsf{X}_{1}|\mathsf{X}_{t}=x],\quad t\in(0,1), (3.5)
v(0,x)\displaystyle v(0,x) :=limt0v(t,x),v(1,x):=limt1v(t,x).\displaystyle:=\lim_{t\downarrow 0}v(t,x),~{}~{}~{}~{}v(1,x):=\lim_{t\uparrow 1}v(t,x). (3.6)
Remark 13

We notice that x=at𝔼[𝖹|𝖷t=x]+bt𝔼[𝖷1|𝖷t=x]x=a_{t}\mathbb{E}[\mathsf{Z}|\mathsf{X}_{t}=x]+b_{t}\mathbb{E}[\mathsf{X}_{1}|\mathsf{X}_{t}=x] due to Eq. (3.3). Then it holds that

v(t,x)=a˙tatx+(b˙ta˙tatbt)𝔼[𝖷1|𝖷t=x],t(0,1).v(t,x)=\tfrac{\dot{a}_{t}}{a_{t}}x+\left(\dot{b}_{t}-\tfrac{\dot{a}_{t}}{a_{t}}b_{t}\right)\mathbb{E}[\mathsf{X}_{1}|\mathsf{X}_{t}=x],\quad t\in(0,1). (3.7)

We also notice that, according to Tweedie’s formula (cf. Lemma 49 in the Appendix), it holds that

s(t,x)=btat2𝔼[𝖷1|𝖷t=x]1at2x,t(0,1),s(t,x)=\tfrac{b_{t}}{a_{t}^{2}}\mathbb{E}\left[\mathsf{X}_{1}|\mathsf{X}_{t}=x\right]-\tfrac{1}{a_{t}^{2}}x,\quad t\in(0,1), (3.8)

where s(t,x)s(t,x) is the score function of the marginal distribution of 𝖷tpt\mathsf{X}_{t}\sim p_{t}.

Combining (3.7) and (3.8), it follows that the velocity field is a gradient field and its nonlinear term is the score function s(t,x)s(t,x), namely, for any t(0,1)t\in(0,1),

v(t,x)=b˙tbtx+(b˙tbtat2a˙tat)s(t,x).\displaystyle v(t,x)=\tfrac{\dot{b}_{t}}{b_{t}}x+\left(\tfrac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)s(t,x). (3.9)
Remark 14

A relevant result has been provided in the proof of (Albergo and Vanden-Eijnden, 2023, Proposition 4) in a restricted case that a0=1,b0=0a_{0}=1,b_{0}=0. In this case, if a˙0,a˙1,b˙0,b˙1\dot{a}_{0},\dot{a}_{1},\dot{b}_{0},\dot{b}_{1} are well-defined, the velocity field reads

v(0,x)=a˙0x+b˙0𝔼ν[𝖷1],v(1,x)=b˙1x+a˙1𝔼γd[𝖹]\displaystyle v(0,x)=\dot{a}_{0}x+\dot{b}_{0}\mathbb{E}_{\nu}[\mathsf{X}_{1}],\quad v(1,x)=\dot{b}_{1}x+\dot{a}_{1}\mathbb{E}_{\gamma_{d}}[\mathsf{Z}]

at time 0 and 11. Otherwise, if any one of a˙0,a˙1,b˙0,b˙1\dot{a}_{0},\dot{a}_{1},\dot{b}_{0},\dot{b}_{1} is not well-defined, the velocity field v(0,x)v(0,x) or v(1,x)v(1,x) should be considered on a case-by-case basis. In addition, we provide an alternative viewpoint of the relationship between the velocity field associated with stochastic interpolation and the score function of its marginal flow using Tweedie’s formula in Lemma 49.

Remark 15 (Diffusion process)

The marginal flow of the Gaussian stochastic interpolation (3.3) coincides with the time-reversed marginal flow of a diffusion process (X¯t)t[0,1)(\overline{X}_{t})_{t\in[0,1)} (Albergo et al., 2023b, Theorem 3.5) defined by

dX¯t=b˙1tb1tX¯t+2(b˙1tb1ta1t2a˙1ta1t)dW¯t.\mathrm{d}\overline{X}_{t}=-\tfrac{\dot{b}_{1-t}}{b_{1-t}}\overline{X}_{t}+\sqrt{2\left(\tfrac{\dot{b}_{1-t}}{b_{1-t}}a_{1-t}^{2}-\dot{a}_{1-t}a_{1-t}\right)}\mathrm{d}\overline{W}_{t}.
Remark 16 (Gaussian denoising)

The Gaussian stochastic interpolation has an information-theoretic interpretation as a time-varying Gaussian channel. Here at2a_{t}^{2} and bt2/at2b_{t}^{2}/a_{t}^{2} stand for the noise level and signal-to-noise ratio (SNR) for time t[0,1]t\in[0,1], respectively. As time t1t\to 1, we are approaching the high-SNR regime, that is, the SNR bt2/at2b_{t}^{2}/a_{t}^{2} grows to \infty. Moreover, the SNR bt2/at2b_{t}^{2}/a_{t}^{2} is monotonically increasing in time tt over [0,1][0,1]. The Gaussian noise level gets reduced through this Gaussian denoising process.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Snapshots of a Gaussian interpolation flow based on the Föllmer interpolant. The source distribution is the standard two-dimensional Gaussian distribution γ2\gamma_{2}, and the target distribution is a mixture of six two-dimensional Gaussian distributions as the shape of a circle. The image panels are placed sequentially from time t=0t=0 to time t=1t=1.

We are now ready to define Gaussian interpolation flows by representing the continuity equation (3.4) with Lagrangian coordinates (Ambrosio and Crippa, 2014). A basic observation is that GIFs share the same marginal density flow with Gaussian stochastic interpolations. The continuity equation (3.4) plays a central role in the derandomization procedure from Gaussian stochastic interpolations to GIFs. We additionally illustrate GIFs using a two-dimensional example as in Figure 2.

Definition 17 (Gaussian interpolation flow)

Suppose that probability measure ν\nu satisfies Assumption 1. If (Xt)t[0,1](X_{t})_{t\in[0,1]} solves the initial value problem (IVP)

dXtdt(x)=v(t,Xt(x)),X0(x)μ,t[0,1],\frac{\mathrm{d}X_{t}}{\mathrm{d}t}(x)=v(t,X_{t}(x)),\quad X_{0}(x)\sim\mu,\quad t\in[0,1], (3.10)

where μ\mu is defined in Definition 10 and the velocity field vv is given by Eq. (3.5) and (3.6), we call (Xt)t[0,1](X_{t})_{t\in[0,1]} a Gaussian interpolation flow associated with the target measure ν\nu.

4 Spatial Lipschitz estimates for the velocity field

We have explicated the idea of Gaussian denoising with the procedure of Gaussian stochastic interpolation or a Gaussian channel with increasing SNR w.r.t. time. By interpreting the process as an ODE flow, we derive the framework of Gaussian interpolation flows. First and foremost, an intuition is that the regularizing effect of Gaussian denoising would ensure the Lipschitz smoothness of the velocity field. Since the standard Gaussian distribution is both 11-semi-log-concave and 11-semi-log-convex, its convolution with a target distribution will maintain its high regularity as long as the target distribution satisfies the regularity conditions. We rigorously justify this intuition by establishing spatial Lipschitz estimates for the velocity field. These estimates are established based on the upper bounds and lower bounds regarding the Jacobian matrix of the velocity field v(t,x)v(t,x) according to the Cauchy-Lipschitz theorem, which are given in Proposition 20 below. To deal with the Jacobian matrix xv(t,x)\nabla_{x}v(t,x), we introduce a covariance expression of it and present the associated upper bounds and lower bounds.

The velocity field v(t,x)v(t,x) is decomposed into a linear term and a nonlinear term, the score function s(t,x)s(t,x). To analyze the Jacobian xv(t,x)\nabla_{x}v(t,x), we only need to focus on xs(t,x)\nabla_{x}s(t,x), that is, x2logpt(x)\nabla^{2}_{x}\log p_{t}(x). To ease the notation, we would henceforth use 𝖸\mathsf{Y} for 𝖷1\mathsf{X}_{1}. Correspondingly, we replace p1(x)p_{1}(x) with p1(y)p_{1}(y) for the density function of 𝖸\mathsf{Y}.

According to Bayes’ theorem, the marginal density ptp_{t} of 𝖷t\mathsf{X}_{t} satisfies

pt(x)=p(t,x|y)p1(y)dyp_{t}(x)=\int p(t,x|y)p_{1}(y)\mathrm{d}y

where 𝖸p1(y)\mathsf{Y}\sim p_{1}(y) and p(t,x|y)=φbty,at2(x)p(t,x|y)=\varphi_{b_{t}y,a_{t}^{2}}(x) is a conditional distribution induced by the Gaussian noise. Due to the factorization pt(x)p(y|t,x)=p(t,x|y)p1(y)p_{t}(x)p(y|t,x)=p(t,x|y)p_{1}(y), the score function s(t,x)s(t,x) and its derivative xs(t,x)\nabla_{x}s(t,x) have the following expressions

s(t,x)=xlogp(y|t,x)xbtyat2,xs(t,x)=x2logp(y|t,x)1at2𝐈d.\displaystyle s(t,x)=-\nabla_{x}\log p(y|t,x)-\tfrac{x-b_{t}y}{a_{t}^{2}},\quad\nabla_{x}s(t,x)=-\nabla^{2}_{x}\log p(y|t,x)-\tfrac{1}{a_{t}^{2}}{\mathbf{I}}_{d}.

Thanks to the expressions above, a covariance matrix expression of xs(t,x)\nabla_{x}s(t,x) is endowed by the exponential family property of p(y|t,x)p(y|t,x).

Lemma 18

The conditional distribution p(y|t,x)p(y|t,x) is an exponential family distribution and a covariance matrix expression of the log-Hessian matrix x2logp(y|t,x)\nabla^{2}_{x}\log p(y|t,x) for any t(0,1)t\in(0,1) is given by

x2logp(y|t,x)=bt2at4Cov(𝖸|𝖷t=x),\nabla^{2}_{x}\log p(y|t,x)=-\tfrac{b_{t}^{2}}{a_{t}^{4}}\mathrm{Cov}(\mathsf{Y}|\mathsf{X}_{t}=x), (4.1)

where Cov(𝖸|𝖷t=x)\mathrm{Cov}(\mathsf{Y}|\mathsf{X}_{t}=x) is the covariance matrix of 𝖸|𝖷t=xp(y|t,x)\mathsf{Y}|\mathsf{X}_{t}=x\sim p(y|t,x). Moreover, for any t(0,1)t\in(0,1), it holds that

xs(t,x)=bt2at4Cov(𝖸|𝖷t=x)1at2𝐈d,\nabla_{x}s(t,x)=\tfrac{b_{t}^{2}}{a_{t}^{4}}\mathrm{Cov}(\mathsf{Y}|\mathsf{X}_{t}=x)-\tfrac{1}{a_{t}^{2}}{\mathbf{I}}_{d}, (4.2)

and that

xv(t,x)=bt2at2(b˙tbta˙tat)Cov(𝖸|𝖷t=x)+a˙tat𝐈d.\nabla_{x}v(t,x)=\tfrac{b_{t}^{2}}{a_{t}^{2}}\left(\tfrac{\dot{b}_{t}}{b_{t}}-\tfrac{\dot{a}_{t}}{a_{t}}\right)\mathrm{Cov}(\mathsf{Y}|\mathsf{X}_{t}=x)+\tfrac{\dot{a}_{t}}{a_{t}}{\mathbf{I}}_{d}. (4.3)
Remark 19

Since t(bt2at2)=2bt2at2(b˙tbta˙tat),\partial_{t}\left(\tfrac{b_{t}^{2}}{a_{t}^{2}}\right)=\tfrac{2b_{t}^{2}}{a_{t}^{2}}\left(\tfrac{\dot{b}_{t}}{b_{t}}-\tfrac{\dot{a}_{t}}{a_{t}}\right), it follows from (4.3) that the derivative of the SNR with respect to time tt controls the dependence of xv(t,x)\nabla_{x}v(t,x) on Cov(𝖸|𝖷t=x)\mathrm{Cov}(\mathsf{Y}|\mathsf{X}_{t}=x).

The representation (4.3) can be used to upper bound and lower bound xv(t,x)\nabla_{x}v(t,x). This technique has been widely used to deduce the regularity of the score function concerning the space variable (Mikulincer and Shenfeld, 2021, 2023; Chen et al., 2023b; Lee et al., 2023; Chen et al., 2023a). The covariance matrix expression (4.2) of the score function has a close connection with the Hatsell-Nolte identity in information theory (Hatsell and Nolte, 1971; Palomar and Verdú, 2005; Wu and Verdú, 2011; Cai and Wu, 2014; Wibisono et al., 2017; Wibisono and Jog, 2018a, b; Dytso et al., 2023a, b).

Employing the covariance expression in Lemma 18, we establish several bounds on xv(t,x)\nabla_{x}v(t,x) in the following proposition.

Proposition 20

Let ν(dy)=p1(y)dy\nu(\mathrm{d}y)=p_{1}(y)\mathrm{d}y be a probability measure on d{\mathbb{R}}^{d} with
D:=(1/2)diam(supp(ν))D:=(1/\sqrt{2})\mathrm{diam}(\mathrm{supp}(\nu)).

  • (a)

    For any t(0,1)t\in(0,1),

    a˙tat𝐈dxv(t,x){bt(atb˙ta˙tbt)at3D2+a˙tat}𝐈d.\frac{\dot{a}_{t}}{a_{t}}{\mathbf{I}}_{d}\preceq\nabla_{x}v(t,x)\preceq\left\{\frac{b_{t}(a_{t}\dot{b}_{t}-\dot{a}_{t}b_{t})}{a_{t}^{3}}D^{2}+\frac{\dot{a}_{t}}{a_{t}}\right\}{\mathbf{I}}_{d}.
  • (b)

    Suppose that p1p_{1} is β\beta-semi-log-convex with β>0\beta>0 and supp(p1)=d\mathrm{supp}(p_{1})={\mathbb{R}}^{d}. Then for any t(0,1]t\in(0,1],

    xv(t,x)βata˙t+btb˙tβat2+bt2𝐈d.\nabla_{x}v(t,x)\succeq\frac{\beta a_{t}\dot{a}_{t}+b_{t}\dot{b}_{t}}{\beta a_{t}^{2}+b_{t}^{2}}{\mathbf{I}}_{d}.
  • (c)

    Suppose that p1p_{1} is κ\kappa-semi-log-concave with κ\kappa\in{\mathbb{R}}. Then for any t(t0,1]t\in(t_{0},1],

    xv(t,x)κata˙t+btb˙tκat2+bt2𝐈d,\displaystyle\nabla_{x}v(t,x)\preceq\frac{\kappa a_{t}\dot{a}_{t}+b_{t}\dot{b}_{t}}{\kappa a_{t}^{2}+b_{t}^{2}}{\mathbf{I}}_{d},

    where t0t_{0} is the root of the equation κ+bt2at2=0\kappa+\frac{b_{t}^{2}}{a_{t}^{2}}=0 over t(0,1)t\in(0,1) if κ<0\kappa<0 and t0=0t_{0}=0 if κ0\kappa\geq 0.

  • (d)

    Fix a probability measure ρ\rho on d{\mathbb{R}}^{d} supported on a Euclidean ball of radius RR, and let ν:=γd,σ2ρ\nu:=\gamma_{d,\sigma^{2}}*\rho with σ>0\sigma>0. Then for any t(0,1)t\in(0,1),

    a˙tat+σ2b˙tbtat2+σ2bt2𝐈dxv(t,x){atbt(atb˙ta˙tbt)(at2+σ2bt2)2R2+a˙tat+σ2b˙tbtat2+σ2bt2}𝐈d.\frac{\dot{a}_{t}a_{t}+\sigma^{2}\dot{b}_{t}b_{t}}{a_{t}^{2}+\sigma^{2}b_{t}^{2}}{\mathbf{I}}_{d}\preceq\nabla_{x}v(t,x)\preceq\left\{\frac{a_{t}b_{t}(a_{t}\dot{b}_{t}-\dot{a}_{t}b_{t})}{(a_{t}^{2}+\sigma^{2}b_{t}^{2})^{2}}R^{2}+\frac{\dot{a}_{t}a_{t}+\sigma^{2}\dot{b}_{t}b_{t}}{a_{t}^{2}+\sigma^{2}b_{t}^{2}}\right\}{\mathbf{I}}_{d}.
  • (e)

    Suppose that dνdγd(x)\frac{\mathrm{d}\nu}{\mathrm{d}\gamma_{d}}(x) is LL-log-Lipschitz for some L0L\geq 0. Then for any t(0,1)t\in(0,1),

    {(b˙tbtat2a˙tat)(BtL2(btat2+bt2)2)+a˙tat+b˙tbtat2+bt2}𝐈d\displaystyle\left\{\left(\tfrac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)\left(-B_{t}-L^{2}\left(\tfrac{b_{t}}{a_{t}^{2}+b_{t}^{2}}\right)^{2}\right)+\tfrac{\dot{a}_{t}a_{t}+\dot{b}_{t}b_{t}}{a_{t}^{2}+b_{t}^{2}}\right\}{\mathbf{I}}_{d}
    xv(t,x){(b˙tbtat2a˙tat)Bt+a˙tat+b˙tbtat2+bt2}𝐈d,\displaystyle\preceq\nabla_{x}v(t,x)\preceq\left\{\left(\tfrac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)B_{t}+\tfrac{\dot{a}_{t}a_{t}+\dot{b}_{t}b_{t}}{a_{t}^{2}+b_{t}^{2}}\right\}{\mathbf{I}}_{d},

    where Bt:=5Lbt(at2+bt2)32(L+(log(at2+bt2/bt))12)B_{t}:=5Lb_{t}(a_{t}^{2}+b_{t}^{2})^{-\frac{3}{2}}(L+(\log(\sqrt{a_{t}^{2}+b_{t}^{2}}/b_{t}))^{-\frac{1}{2}}).

Comparing part (a) with part (d) in Proposition 20, we can see that the bounds in (a) are consistent with those in (d) in the sense that (a) is a limiting case of part (d) as σ0\sigma\to 0. The lower bound in part (a) blows up at time t=1t=1 owing to a1=0,a_{1}=0, while in part (d) it behaves well since the lower bound in part (d) coincides with a lower bound indicated by the 1σ2\tfrac{1}{\sigma^{2}}-semi-log-convex property. It reveals that the regularity of the velocity field v(t,x)v(t,x) with respect to the space variable xx improves when the target random variable is bounded and is subject to Gaussian perturbation.

The lower bound in part (b) and the upper bound in part (c) are tight in the sense that both of them are attainable for a Gaussian target distribution, that is,

xv(t,x)=βata˙t+btb˙tβat2+bt2𝐈dif ν=γd,1/β.\nabla_{x}v(t,x)=\frac{\beta a_{t}\dot{a}_{t}+b_{t}\dot{b}_{t}}{\beta a_{t}^{2}+b_{t}^{2}}{\mathbf{I}}_{d}\quad\text{if $\nu=\gamma_{d,1/\beta}$.}

The upper and lower bounds in Proposition 20-(a) and (e) become vacuous as they both blow up at time t=1t=1. The intuition behind is that the Jacobian matrix of the velocity field can be both lower and upper bounded at time t=1t=1 only if the score function of the target measure is Lipschitz continuous in the space variable xx. Under an additional Lipschitz score assumption (equivalently, β\beta-semi-log-convex and κ\kappa-semi-log-concave for some β=κ0\beta=-\kappa\geq 0), the upper and lower bounds in part (a) and part (e) can be strengthened at time t=1t=1 based on the lower bound in (b) and the upper bound in part (c).

According to Proposition 20-(a) and (c), there are two upper bounds available that shall be compared with each other. One is the D2D^{2}-based bound in part (a), and the other is the κ\kappa-based bound in part (c). According to the proof of Proposition 20 given in the Appendix, these two upper bounds are equal if and only if the corresponding upper bounds on Cov(𝖸|𝖷t=x)\mathrm{Cov}(\mathsf{Y}|\mathsf{X}_{t}=x) are equal, that is,

D2=(κ+bt2at2)1.D^{2}=\left(\kappa+\frac{b_{t}^{2}}{a_{t}^{2}}\right)^{-1}. (4.4)

Then the critical case is κD2=1\kappa D^{2}=1 since simplifying Eq. (4.4) reveals that

D2κ=bt2at2.D^{-2}-\kappa=\frac{b_{t}^{2}}{a_{t}^{2}}. (4.5)

We note that bt2/at2b_{t}^{2}/a_{t}^{2}, ranging over (0,)(0,\infty), is monotonically increasing w.r.t. t(0,1)t\in(0,1). Suppose that κD2>1\kappa D^{2}>1. Then (4.5) has no root over t(0,1),t\in(0,1), which implies that the κ\kappa-based bound is tighter over [0,1)[0,1), i.e.,

D2>(κ+bt2at2)1,t[0,1).D^{2}>\left(\kappa+\frac{b_{t}^{2}}{a_{t}^{2}}\right)^{-1},\quad\forall t\in[0,1).

Otherwise, suppose that κD2<1\kappa D^{2}<1. Then (4.5) has a root t1(0,1),t_{1}\in(0,1), which implies that the D2D^{2}-based bound is tighter over [0,t1)[0,t_{1}), i.e.,

D2<(κ+bt2at2)1,t[0,t1),\displaystyle D^{2}<\left(\kappa+\frac{b_{t}^{2}}{a_{t}^{2}}\right)^{-1},\quad\forall t\in[0,t_{1}),

and that the κ\kappa-based bound is tighter over [t1,1)[t_{1},1), i.e.,

D2(κ+bt2at2)1,t[t1,1).\displaystyle D^{2}\geq\left(\kappa+\frac{b_{t}^{2}}{a_{t}^{2}}\right)^{-1},\quad\forall t\in[t_{1},1).

Next, we present several upper bounds on the maximum eigenvalue of the Jacobian matrix of the velocity field λmax(xv(t,x))\lambda_{\max}(\nabla_{x}v(t,x)) and its exponential estimates for studying the Lipschitz regularity of the flow maps as noted in Lemma 29.

Corollary 21

Let ν\nu be a probability measure on d{\mathbb{R}}^{d} with D:=(1/2)diam(supp(ν))D:=(1/\sqrt{2})\mathrm{diam}(\mathrm{supp}(\nu)) and suppose that ν\nu is κ\kappa-semi-log-concave with κ0\kappa\geq 0.

  • (a)

    If κD21\kappa D^{2}\geq 1, then

    λmax(xv(t,x))θt:=κata˙t+btb˙tκat2+bt2,t[0,1].\lambda_{\max}(\nabla_{x}v(t,x))\leq\theta_{t}:=\frac{\kappa a_{t}\dot{a}_{t}+b_{t}\dot{b}_{t}}{\kappa a_{t}^{2}+b_{t}^{2}},\ t\in[0,1]. (4.6)
  • (b)

    If κD2<1\kappa D^{2}<1, then

    λmax(xv(t,x))θt:={bt2at2(b˙tbta˙tat)D2+a˙tat,t[0,t1),κata˙t+btb˙tκat2+bt2,t[t1,1],\lambda_{\max}(\nabla_{x}v(t,x))\leq\theta_{t}:=\begin{cases}\frac{b_{t}^{2}}{a_{t}^{2}}\left(\frac{\dot{b}_{t}}{b_{t}}-\frac{\dot{a}_{t}}{a_{t}}\right)D^{2}+\frac{\dot{a}_{t}}{a_{t}},\ &t\in[0,t_{1}),\\ \frac{\kappa a_{t}\dot{a}_{t}+b_{t}\dot{b}_{t}}{\kappa a_{t}^{2}+b_{t}^{2}},\ &t\in[t_{1},1],\end{cases} (4.7)

    where t1t_{1} solves (4.5).

Corollary 22

Let ν\nu be a probability measure on d{\mathbb{R}}^{d} with D:=(1/2)diam(supp(ν))<D:=(1/\sqrt{2})\mathrm{diam}(\mathrm{supp}(\nu))<\infty and suppose that ν\nu is κ\kappa-semi-log-concave with κ<0\kappa<0. Then

λmax(xv(t,x))θt:={bt2at2(b˙tbta˙tat)D2+a˙tat,t[0,t1),κata˙t+btb˙tκat2+bt2,t[t1,1],\lambda_{\max}(\nabla_{x}v(t,x))\leq\theta_{t}:=\begin{cases}\frac{b_{t}^{2}}{a_{t}^{2}}\left(\frac{\dot{b}_{t}}{b_{t}}-\frac{\dot{a}_{t}}{a_{t}}\right)D^{2}+\frac{\dot{a}_{t}}{a_{t}},\ &t\in[0,t_{1}),\\ \frac{\kappa a_{t}\dot{a}_{t}+b_{t}\dot{b}_{t}}{\kappa a_{t}^{2}+b_{t}^{2}},\ &t\in[t_{1},1],\end{cases} (4.8)

where t1t_{1} solves (4.5).

Corollary 23

Fix a probability measure ρ\rho on d{\mathbb{R}}^{d} supported on a Euclidean ball of radius RR and let ν:=γd,σ2ρ\nu:=\gamma_{d,\sigma^{2}}*\rho with σ>0\sigma>0. Then

λmax(xv(t,x))θt:=a˙tat+σ2b˙tbtat2+σ2bt2+atbt(atb˙ta˙tbt)(at2+σ2bt2)2R2.\lambda_{\max}(\nabla_{x}v(t,x))\leq\theta_{t}:=\frac{\dot{a}_{t}a_{t}+\sigma^{2}\dot{b}_{t}b_{t}}{a_{t}^{2}+\sigma^{2}b_{t}^{2}}+\frac{a_{t}b_{t}(a_{t}\dot{b}_{t}-\dot{a}_{t}b_{t})}{(a_{t}^{2}+\sigma^{2}b_{t}^{2})^{2}}R^{2}. (4.9)
Corollary 24

Suppose that ν\nu is κ\kappa-semi-log-concave for some κ0\kappa\leq 0, and dνdγd(x)\frac{\mathrm{d}\nu}{\mathrm{d}\gamma_{d}}(x) is LL-log-Lipschitz for some L0L\geq 0. Then

λmax(xv(t,x))θt:={(b˙tbtat2a˙tat)Bt+a˙tat+b˙tbtat2+bt2,t[0,t2),κata˙t+btb˙tκat2+bt2,t[t2,1],\lambda_{\max}(\nabla_{x}v(t,x))\leq\theta_{t}:=\begin{cases}\left(\tfrac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)B_{t}+\tfrac{\dot{a}_{t}a_{t}+\dot{b}_{t}b_{t}}{a_{t}^{2}+b_{t}^{2}},\ &t\in[0,t_{2}),\\ \frac{\kappa a_{t}\dot{a}_{t}+b_{t}\dot{b}_{t}}{\kappa a_{t}^{2}+b_{t}^{2}},\ &t\in[t_{2},1],\end{cases} (4.10)

where Bt:=5Lbt(at2+bt2)32(L+(log(at2+bt2/bt))12)B_{t}:=5Lb_{t}(a_{t}^{2}+b_{t}^{2})^{-\frac{3}{2}}(L+(\log(\sqrt{a_{t}^{2}+b_{t}^{2}}/b_{t}))^{-\frac{1}{2}}) and t2(t0,1)t_{2}\in(t_{0},1).

5 Well-posedness and Lipschtiz flow maps

In this section, we study the well-posedness of GIFs and the Lipschitz properties of their flow maps. We also show that the marginal distributions of GIFs satisfy the log-Sobolev inequality and the Poincaré inequality if Assumptions 1 and 2 are satisfied.

Theorem 25 (Well-posedness)

Suppose Assumptions 1 and 2-(i), (iii), or (iv) are satisfied. Then there exists a unique solution (Xt)t[0,1](X_{t})_{t\in[0,1]} to the IVP (3.10). Moreover, the push-forward measure satisfies Xt#μ=Law(at𝖹+bt𝖷1){X_{t}}_{\#}\mu=\mathrm{Law}(a_{t}\mathsf{Z}+b_{t}\mathsf{X}_{1}) with 𝖹γd,𝖷1ν\mathsf{Z}\sim\gamma_{d},\mathsf{X}_{1}\sim\nu.

Theorem 26

Suppose Assumptions 1 and 2-(ii) are satisfied. For any t¯(0,1)\underline{t}\in(0,1), there exists a unique solution (Xt)t[0,1t¯](X_{t})_{t\in[0,1-\underline{t}]} to the IVP (3.10). Moreover, the push-forward measure satisfies Xt#μ=Law(at𝖹+bt𝖷1){X_{t}}_{\#}\mu=\mathrm{Law}(a_{t}\mathsf{Z}+b_{t}\mathsf{X}_{1}) with 𝖹γd,𝖷1ν\mathsf{Z}\sim\gamma_{d},\mathsf{X}_{1}\sim\nu.

Corollary 27 (Time-reversed flow)

Suppose Assumptions 1 and 2-(i), (iii), or (iv) are satisfied. Then the time-reversed flow (Xt)t[0,1](X^{*}_{t})_{t\in[0,1]} associated with ν\nu is a unique solution to the IVP:

dXtdt(x)=v(1t,Xt(x)),X0(x)ν,t[0,1].\frac{\mathrm{d}X^{*}_{t}}{\mathrm{d}t}(x)=-v(1-t,X^{*}_{t}(x)),\quad X^{*}_{0}(x)\sim\nu,\quad t\in[0,1]. (5.1)

The push-forward measure satisfies Xt#ν=Law(a1t𝖹+b1t𝖷1){X^{*}_{t}}_{\#}\nu=\mathrm{Law}(a_{1-t}\mathsf{Z}+b_{1-t}\mathsf{X}_{1}) where 𝖹γd,𝖷1ν\mathsf{Z}\sim\gamma_{d},\mathsf{X}_{1}\sim\nu. Moreover, the flow map satisfies Xt(x)=Xt1(x)X^{*}_{t}(x)=X_{t}^{-1}(x).

Corollary 28

Suppose Assumptions 1 and 2-(ii) are satisfied. For any t¯(0,1)\underline{t}\in(0,1), the time-reversed flow (Xt)t[t¯,1](X^{*}_{t})_{t\in[\underline{t},1]} associated with ν\nu is a unique solution to the IVP:

dXtdt(x)=v(1t,Xt(x)),Xt¯(x)Law(a1t¯𝖹+b1t¯𝖷1),t[t¯,1],\frac{\mathrm{d}X^{*}_{t}}{\mathrm{d}t}(x)=-v(1-t,X^{*}_{t}(x)),\quad X^{*}_{\underline{t}}(x)\sim\mathrm{Law}(a_{1-\underline{t}}\mathsf{Z}+b_{1-\underline{t}}\mathsf{X}_{1}),\quad t\in[\underline{t},1], (5.2)

where 𝖹γd,𝖷1ν\mathsf{Z}\sim\gamma_{d},\mathsf{X}_{1}\sim\nu. The push-forward measure satisfies Xt#ν=Law(a1t𝖹+b1t𝖷1){X^{*}_{t}}_{\#}\nu=\mathrm{Law}(a_{1-t}\mathsf{Z}+b_{1-t}\mathsf{X}_{1}). Moreover, the flow map satisfies Xt(x)=Xt1(x)X^{*}_{t}(x)=X_{t}^{-1}(x).

Based on the well-posedness of the flow, we can provide an upper bound on the Lipschitz constant of the induced flow map.

Lemma 29

Suppose that a flow (Xt)t[0,1](X_{t})_{t\in[0,1]} is well-posed with a velocity field v(t,x):[0,1]×ddv(t,x):[0,1]\times{\mathbb{R}}^{d}\to{\mathbb{R}}^{d} of class C1C^{1} in xx, and that for any (t,x)[0,1]×d(t,x)\in[0,1]\times{\mathbb{R}}^{d}, it holds xv(t,x)θt𝐈d\nabla_{x}v(t,x)\preceq\theta_{t}{\mathbf{I}}_{d}. Let the flow map Xs,t:ddX_{s,t}:{\mathbb{R}}^{d}\to{\mathbb{R}}^{d} be of class C1C^{1} in xx for any 0st10\leq s\leq t\leq 1. Then the flow map Xs,tX_{s,t} is Lipschitz continuous with an upper bound of its Lipschitz constant given by

xXs,t(x)2,2exp(stθudu).\|\nabla_{x}X_{s,t}(x)\|_{2,2}\leq\exp\left(\int_{s}^{t}\theta_{u}\mathrm{d}u\right). (5.3)

Using Lemma 29, we show that the flow map of a GIF is Lipschitz continuous in the space variable xx.

Proposition 30 (Lipschitz mappings)

Suppose that Assumptions 1 and 2-(i) hold.

  • (i)

    If ν\nu is κ\kappa-semi-log-concave for some κ>0\kappa>0, then the flow map X1(x)X_{1}(x) is a Lipschitz mapping, that is,

    xX1(x)2,21κa02+b02,xd.\|\nabla_{x}X_{1}(x)\|_{2,2}\leq\frac{1}{\sqrt{\kappa a_{0}^{2}+b_{0}^{2}}},\quad\forall x\in{\mathbb{R}}^{d}.

    In particular, if a0=1a_{0}=1 and b0=0b_{0}=0, then

    xX1(x)2,21κ,xd.\|\nabla_{x}X_{1}(x)\|_{2,2}\leq\frac{1}{\sqrt{\kappa}},\quad\forall x\in{\mathbb{R}}^{d}.
  • (ii)

    If ν\nu is β\beta-semi-log-convex for some β>0\beta>0, then the time-reversed flow map X1(x)X^{*}_{1}(x) is a Lipschitz mapping, that is,

    xX1(x)2,2βa02+b02,xsupp(ν).\|\nabla_{x}X^{*}_{1}(x)\|_{2,2}\leq\sqrt{\beta a_{0}^{2}+b_{0}^{2}},\quad\forall x\in\mathrm{supp}(\nu).

    In particular, if a0=1a_{0}=1 and b0=0b_{0}=0, then

    xX1(x)2,2β,xsupp(ν).\|\nabla_{x}X^{*}_{1}(x)\|_{2,2}\leq\sqrt{\beta},\quad\forall x\in\mathrm{supp}(\nu).
Proposition 31 (Gaussian mixtures)

Suppose that Assumptions 1 and 2-(iii) hold. Then the flow map X1(x)X_{1}(x) is a Lipschitz mapping, that is,

xX1(x)2,2σa02+σ2b02exp(a02a02+σ2b02R22σ2),xd.\|\nabla_{x}X_{1}(x)\|_{2,2}\leq\frac{\sigma}{\sqrt{a_{0}^{2}+\sigma^{2}b_{0}^{2}}}\exp\left(\frac{a_{0}^{2}}{a_{0}^{2}+\sigma^{2}b_{0}^{2}}\cdot\frac{R^{2}}{2\sigma^{2}}\right),\quad\forall x\in{\mathbb{R}}^{d}.

In particular, if a0=1a_{0}=1 and b0=0b_{0}=0, then

xX1(x)2,2σexp(R22σ2),xd.\|\nabla_{x}X_{1}(x)\|_{2,2}\leq\sigma\exp\left(\frac{R^{2}}{2\sigma^{2}}\right),\quad\forall x\in{\mathbb{R}}^{d}.

Moreover, the time-reversed flow map X1(x)X^{*}_{1}(x) is a Lipschitz mapping, that is,

xX1(x)2,2σ2a02+b02,xsupp(ν).\|\nabla_{x}X^{*}_{1}(x)\|_{2,2}\leq\sqrt{\sigma^{-2}a_{0}^{2}+b_{0}^{2}},\quad\forall x\in\mathrm{supp}(\nu).

In particular, if a0=1a_{0}=1 and b0=0b_{0}=0, then

xX1(x)2,21σ,xsupp(ν).\|\nabla_{x}X^{*}_{1}(x)\|_{2,2}\leq\frac{1}{\sigma},\quad\forall x\in\mathrm{supp}(\nu).
Remark 32

Well-posed GIFs produce diffeomorphisms that transport the source measure onto the target measure. The diffeomorphism property of the transport maps are relevant to the auto-encoding and cycle consistency properties of their generative modeling applications. We defer a detailed discussion to Section 6.

Early stopping implicitly mollifies the target measure with a small Gaussian noise. For image generation tasks (with bounded pixel values), the mollified target measure is indeed a Gaussian mixture distribution considered in Theorem 31. The regularity of the target measure largely gets enhanced through such mollification, especially when the target measure is supported on a low-dimensional manifold in accordance with the data manifold hypothesis. Therefore, although such a diffeomorphism X1(x)X_{1}(x) may not be well-defined for general bounded target measures, an off-the-shelf solution would be to perturb the target measure with a small Gaussian noise or to employ the early stopping technique. Both approaches will smooth the landscape of the target measure.

Proposition 33

Suppose the target measure ν\nu satisfies the log-Sobolev inequality with constant CLS(ν)C_{\mathrm{LS}}(\nu). Then the marginal distribution of the GIF (pt)t[0,1](p_{t})_{t\in[0,1]} satisfies the log-Sobolev inequality, and its log-Sobolev constant CLS(pt)C_{\mathrm{LS}}(p_{t}) is bounded as

CLS(pt)at2+bt2CLS(ν).\displaystyle C_{\mathrm{LS}}(p_{t})\leq a_{t}^{2}+b_{t}^{2}C_{\mathrm{LS}}(\nu).

Moreover, suppose the target measure ν\nu satisfies the Poincaré inequality with constant CP(ν)C_{P}(\nu). Then the marginal distribution of the GIF (pt)t[0,1](p_{t})_{t\in[0,1]} satisfies the Poincaré inequality, and its Poincaré constant CP(pt)C_{\mathrm{P}}(p_{t}) is bounded as

CP(pt)at2+bt2CP(ν).\displaystyle C_{\mathrm{P}}(p_{t})\leq a_{t}^{2}+b_{t}^{2}C_{\mathrm{P}}(\nu).

The log-Sobolev and Poincaré inequalities (see Definitions 47 and 48) are fundamental tools for establishing convergence guarantees for Langevin Monte Carlo algorithms. From an algorithmic viewpoint, the predictor-corrector algorithm in score-based diffusion models and the corresponding probability flow ODEs essentially combine the ODE numerical solver (performing as the predictor) and the overdamped Langevin diffusion (performing as the corrector) to simulate samples from the marginal distributions (Song et al., 2021b). Proposition 33 shows that the marginal distributions all satisfy the log-Sobolev and Poincaré inequalities under mild assumptions on the target distribution. This conclusion suggests that Langevin Monte Carlo algorithms are certified to have convergence guarantees for sampling from the marginal distributions of GIFs. Furthermore, the target distributions covered in Assumption 2 are shown to satisfy the log-Sobolev and Poincaré inequalities (Mikulincer and Shenfeld, 2021; Dai et al., 2023; Fathi et al., 2023), which suggests that the assumptions of Proposition 33 generally hold.

6 Applications to generative modeling

Auto-encoding is a primary principle in learning a latent representation with generative models (Goodfellow et al., 2016, Chapter 14). Meanwhile, the concept of cycle consistency is important to unpaired image-to-image translation between the source and target domains (Zhu et al., 2017). The recent work by Su et al. (2023) propose the dual diffusion implicit bridges (DDIB) for image-to-image translation, which shows a strong pattern of exact auto-encoding and image-to-image translation. DDIBs are built upon the denoising diffusion implicit models (DDIM), which share the same probability flow ODE with VESDE (considered as VE interpolant in Table 1), as pointed out by (Song et al., 2021a, Proposition 1). First, DDIBs attain latent embeddings of source images encoded with one DDIM operating in the source domain. The encoding embeddings are then decoded using another DDIM trained in the target domain to construct target images. The whole process consisting of two DDIMs seems to be cycle consistent up to numerical errors. Several phenomena of auto-encoding and cycle consistency are observed in the unpaired data generation procedure with DDIBs.

We replicate the 2D experiments by Su et al. (2023) in Figures 3 and 4 to show the phenomena of approximate auto-encoding and cycle consistency of GIFs111The implementation is based on the GitHub repository at https://github.com/suxuann/ddib.. To elucidate the empirical auto-encoding and cycle consistency for measure transport, we derive Corollaries 34 and 35 below and analyze the transport maps defined by GIFs (covering the probability flow ODE of VESDE used by DDIBs). We consider the continuous-time framework and the population level, which precludes learning errors including the time discretization errors and velocity field estimation errors, and show that the transport maps naturally possess the exact auto-encoding and cycle consistency properties at the population level.

Refer to caption
Figure 3: An illustration of auto-encoding using DDIBs. The Concentric Rings data in the source domain (the first panel) is encoded into the latent domain (the second panel), and then decoded into the source domain (the third panel). According to the consistent color pattern and pointwise correspondences across the domains, both the learned encoder mapping and the learned decoder mapping exhibit approximate Lipschitz continuity with respect to the space variable. One justification of such auto-encoding observation is presented in Corollary 34 where we prove that the composition of the encoder map and the decoder map yields an identity map.
Corollary 34 (Auto-encoding)

Suppose Assumptions 1 and 2-(i), (iii), or (iv) hold for a target measure ν\nu. The Gaussian interpolation flow (Xt)t[0,1](X_{t})_{t\in[0,1]} and its time-reversed flow (Xt)t[0,1](X_{t}^{*})_{t\in[0,1]} form an auto-encoder with a Lipschitz encoder X1(x)X_{1}^{*}(x) and a Lipschitz decoder X1(x)X_{1}(x). The auto-encoding property holds in the sense that

X1X1=𝐈d.X_{1}\circ X_{1}^{*}={\mathbf{I}}_{d}. (6.1)
Corollary 35 (Cycle consistency)

Suppose Assumptions 1 and 2-(i), (iii), or (iv) hold for the target measures ν1\nu_{1} and ν2\nu_{2}. For the target measure ν1\nu_{1}, we define the Gaussian interpolation flow (X1,t)t[0,1](X_{1,t})_{t\in[0,1]} and its time-reversed flow (X1,t)t[0,1](X_{1,t}^{*})_{t\in[0,1]}. We also define the Gaussian interpolation flow (X2,t)t[0,1](X_{2,t})_{t\in[0,1]} and its time-reversed flow (X2,t)t[0,1](X_{2,t}^{*})_{t\in[0,1]} for the target measure ν2\nu_{2} using the same ata_{t} and btb_{t}. Then the transport maps X1,1(x)X_{1,1}(x), X1,1(x)X_{1,1}^{*}(x), X2,1(x)X_{2,1}(x), and X2,1(x)X_{2,1}^{*}(x) are Lipschitz continuous in the space variable xx. Furthermore, the cycle consistency property holds in the sense that

X1,1X2,1X2,1X1,1=𝐈d.X_{1,1}\circ X_{2,1}^{*}\circ X_{2,1}\circ X_{1,1}^{*}={\mathbf{I}}_{d}. (6.2)

Corollaries 34 and 35 show that the auto-encoding and cycle consistency properties hold for the flows at the population level. These results provide insights to the approximate auto-encoding and cycle consistency properties at the sample level.

Refer to caption
Figure 4: An illustration of cycle consistency using DDIBs. The cycle consistency property is manifested through the consistency of color patterns across the transformations. We transform the Moons data in the source domain onto the Concentric Squares data in the target domain, and then complete the cycle by mapping the target data back to the source domain. The latent spaces play a central role in the bidirectional translation. We provide a proof in Corollary 35 accounting for the cycle consistency property.

There are several types of errors introduced in the training of GIFs. On the one hand, the approximation in specifying source measures would exert influence on modeling the distribution. On the other hand, the approximation in the velocity field also affects the distribution learning error. We use the stability analysis method in the differential equations theory to address the potential effects of these errors.

Corollary 36

Suppose Assumptions 1 and 2-(i), (iii), or (iv) hold. It holds that

C1:=supxdxX1(x)2,2<,C2:=sup(t,x)[0,1]×dxv(t,x)2,2<.\displaystyle C_{1}:=\sup_{x\in{\mathbb{R}}^{d}}\|\nabla_{x}X_{1}(x)\|_{2,2}<\infty,\quad C_{2}:=\sup_{(t,x)\in[0,1]\times{\mathbb{R}}^{d}}\|\nabla_{x}v(t,x)\|_{2,2}<\infty.
Proposition 37 (Stability in the source distribution)

Suppose Assumptions 1 and 2-(i), (iii), or (iv) hold. If the source measure μ=Law(a0𝖹+b0𝖷1)\mu=\mathrm{Law}(a_{0}\mathsf{Z}+b_{0}\mathsf{X}_{1}) is replaced with the Gaussian measure γd,a02\gamma_{d,a_{0}^{2}}, then the stability of the transport map X1X_{1} is guaranteed by the W2W_{2} distance between the push-forward measure X1#γd,a02{X_{1}}_{\#}\gamma_{d,a_{0}^{2}} and the target measure ν=Law(𝖷1)\nu=\mathrm{Law}(\mathsf{X}_{1}) as follows

W2(X1#γd,a02,ν)C1b0𝔼ν[𝖷𝟣2]exp(C2d).W_{2}({X_{1}}_{\#}\gamma_{d,a_{0}^{2}},\nu)\leq C_{1}b_{0}\sqrt{\mathbb{E}_{\nu}[\|\mathsf{X_{1}}\|^{2}]}\exp(C_{2}d). (6.3)

The stability analysis in Proposition 37 provides insights into the selection of source measures for learning probability flow ODEs and GIFs. The error bound (6.3) demonstrates that when the signal intensity is reasonably small in the source measure, that is, b01b_{0}\ll 1, the distribution estimation error, induced by the approximation with a Gaussian source measure, is small as well in the sense of the quadratic Wasserstein distance. Using a Gaussian source measure to replace the true convolution source measure is a common approximation method for learning probability flow ODEs and GIFs. Our analysis shows this replacement is reasonable for the purpose of distribution estimation.

The Alekseev-Gröbner formula and its stochastic variants (Del Moral and Singh, 2022) have been shown effective in quantifying the stability of well-posed ODE and SDE flows against perturbations of its velocity field or drift (Bortoli, 2022; Benton et al., 2023). We state these results below for convenience.

Lemma 38

(Hairer et al., 1993, Theorem 14.5) Let (Xt)t[0,1](X_{t})_{t\in[0,1]} and (Yt)t[0,1](Y_{t})_{t\in[0,1]} solve the following IVPs, respectively

dXtdt\displaystyle\frac{\mathrm{d}X_{t}}{\mathrm{d}t} =v(t,Xt),X0=x0,t[0,1],\displaystyle=v(t,X_{t}),\quad X_{0}=x_{0},\quad t\in[0,1],
dYtdt\displaystyle\frac{\mathrm{d}Y_{t}}{\mathrm{d}t} =v~(t,Yt),Y0=x0,t[0,1],\displaystyle=\tilde{v}(t,Y_{t}),\quad~{}Y_{0}=x_{0},\quad t\in[0,1],

where v(t,x):[0,1]×ddv(t,x):[0,1]\times{\mathbb{R}}^{d}\to{\mathbb{R}}^{d} and v~(t,x):[0,1]×dd\tilde{v}(t,x):[0,1]\times{\mathbb{R}}^{d}\to{\mathbb{R}}^{d} are the velocity fields.

  • (i)

    Suppose that vv is of class C1C^{1} in xx. Then the Alekseev-Gröbner formula for the difference Xt(x0)Yt(x0)X_{t}(x_{0})-Y_{t}(x_{0}) is given by

    Xt(x0)Yt(x0)=0t(xXs,t)(Ys(x0))(v(s,Ys(x0))v~(s,Ys(x0)))dsX_{t}(x_{0})-Y_{t}(x_{0})=\int_{0}^{t}(\nabla_{x}X_{s,t})(Y_{s}(x_{0}))^{\top}\left(v(s,Y_{s}(x_{0}))-\tilde{v}(s,Y_{s}(x_{0}))\right)\mathrm{d}s (6.4)

    where xXs,t(x)\nabla_{x}X_{s,t}(x) satisfies the variational equation

    t(xXs,t(x))=(xv)(t,Xs,t(x))xXs,t(x),xXs,s(x)=𝐈d.\partial_{t}(\nabla_{x}X_{s,t}(x))=(\nabla_{x}v)(t,X_{s,t}(x))\nabla_{x}X_{s,t}(x),\quad\nabla_{x}X_{s,s}(x)={\mathbf{I}}_{d}. (6.5)
  • (ii)

    Suppose that v~\tilde{v} is of class C1C^{1} in xx. Then the Alekseev-Gröbner formula for the difference Yt(x0)Xt(x0)Y_{t}(x_{0})-X_{t}(x_{0}) is given by

    Yt(x0)Xt(x0)=0t(xYs,t)(Xs(x0))(v~(s,Xs(x0))v(s,Xs(x0)))dsY_{t}(x_{0})-X_{t}(x_{0})=\int_{0}^{t}(\nabla_{x}Y_{s,t})(X_{s}(x_{0}))^{\top}\left(\tilde{v}(s,X_{s}(x_{0}))-v(s,X_{s}(x_{0}))\right)\mathrm{d}s (6.6)

    where xYs,t(x)\nabla_{x}Y_{s,t}(x) satisfies the variational equation

    t(xYs,t(x))=(xv~)(t,Ys,t(x))xYs,t(x),xYs,s(x)=𝐈d.\partial_{t}(\nabla_{x}Y_{s,t}(x))=(\nabla_{x}\tilde{v})(t,Y_{s,t}(x))\nabla_{x}Y_{s,t}(x),\quad\nabla_{x}Y_{s,s}(x)={\mathbf{I}}_{d}. (6.7)

Exploiting the Alekseev-Gröbner formulas in Lemma 38 and uniform Lipschitz properties of the velocity field, we deduce two error bounds in terms of the quadratic Wasserstein (W2W_{2}) distance to show the stability of the ODE flow when the velocity field is not accurate.

Proposition 39 (Stability in the velocity field)

Suppose Assumptions 1 and 2 hold. Let q~t\tilde{q}_{t} denote the density function of Yt#μ{Y_{t}}_{\#}\mu.

  • (i)

    Suppose that

    01dv(t,x)v~(t,x)2q~t(x)dxdtε.\int_{0}^{1}\int_{{\mathbb{R}}^{d}}\|v(t,x)-\tilde{v}(t,x)\|^{2}\tilde{q}_{t}(x)\mathrm{d}x\mathrm{d}t\leq\varepsilon. (6.8)

    Then

    W22(Y1#μ,ν)ε01exp(2s1θudu)ds.W_{2}^{2}({Y_{1}}_{\#}\mu,\nu)\leq\varepsilon\int_{0}^{1}\exp\left(2\int_{s}^{1}\theta_{u}\mathrm{d}u\right)\mathrm{d}s. (6.9)
  • (ii)

    Suppose that

    sup(t,x)[0,1]×dxv~(t,x)2,2C3.\sup_{(t,x)\in[0,1]\times{\mathbb{R}}^{d}}\|\nabla_{x}\tilde{v}(t,x)\|_{2,2}\leq C_{3}.

    Then

    W22(Y1#μ,ν)exp(2C3)12C301dv(t,x)v~(t,x)2pt(x)dxdt.W_{2}^{2}({Y_{1}}_{\#}\mu,\nu)\leq\frac{\exp(2C_{3})-1}{2C_{3}}\int_{0}^{1}\int_{{\mathbb{R}}^{d}}\|v(t,x)-\tilde{v}(t,x)\|^{2}p_{t}(x)\mathrm{d}x\mathrm{d}t. (6.10)

Proposition 39 provides a stability analysis against the estimation error of the velocity field using the W2W_{2} distance. The estimation error originates from the flow matching or score matching procedures and the approximation error rising from using deep neural networks in estimating the velocity field or the score function. These two W2W_{2} bounds imply that the distribution estimation error is controlled by the L2L_{2} estimation error of flow matching and score matching. Indeed, this point justifies the soundness of the approximation method through flow matching and score matching. The first W2W_{2} bound (6.9) relies on the L2L_{2} control (6.8) of the perturbation error of the velocity field. The second W2W_{2} bound (6.10) is slightly better than that provided in (Albergo and Vanden-Eijnden, 2023, Proposition 3) but still has exponential dependence on the Lipschitz constant of v~(t,x)\tilde{v}(t,x).

Refer to caption
Figure 5: An approximately linear relation between b0b_{0} and the Wasserstein-2 distance.

To demonstrate the bounds presented in Propositions 37 and 39, we conducted further experiments with a mixture of eight two-dimensional Gaussian distributions. These propositions provide bounds for the stability of the flow when subjected to perturbations in either the source distribution or the velocity field. Let the target distribution be the following two-dimensional Gaussian mixture

p(x)=j=18ϕ(x;μj,Σj),\displaystyle p(x)=\sum_{j=1}^{8}\phi(x;\mu_{j},\Sigma_{j}),

where ϕ(x;μj,Σj)\phi(x;\mu_{j},\Sigma_{j}) is the probability density function for the Gaussian distribution with mean μj=12(sin(2(j1)π/8),cos(2(j1)π/8))\mu_{j}=12(\sin(2(j-1)\pi/8),\cos(2(j-1)\pi/8))^{\top} and covariance matrix Σj=0.032𝐈2\Sigma_{j}=0.03^{2}\mathbf{I}_{2} for j=1,,8j=1,\cdots,8. For Gaussian mixtures, the velocity field has an explicit formula, which facilitates the perturbation analysis.

To illustrate the bound in Proposition 37, we consider a perturbation of the source distribution for the following model:

𝖷t=at𝖹+bt𝖷 with at=1t+ζ1+ζ,bt=t+ζ1+ζ,\displaystyle\mathsf{X}_{t}=a_{t}\mathsf{Z}+b_{t}\mathsf{X}\text{\quad with \quad}a_{t}=1-\frac{t+\zeta}{1+\zeta},\quad b_{t}=\frac{t+\zeta}{1+\zeta},

where ζ[0,0.3]\zeta\in[0,0.3] is a value controlling the perturbation level. It is easy to see a0=1/(1+ζ),b0=ζ/(1+ζ)a_{0}={1}/{(1+\zeta)},b_{0}={\zeta}/{(1+\zeta)}. Thus, the source distribution Law(a0𝖹+b0𝖷)\mathrm{Law}(a_{0}\mathsf{Z}+b_{0}\mathsf{X}) is a mixture of Gaussian distributions. Practically, we can use a Gaussian distribution γ2,a02\gamma_{2,a_{0}^{2}} to replace this source distribution. In Proposition 37, we bound the error between the distributions of generated samples due to the replacement, that is,

W2(X1#γd,a02,ν)Cb0,\displaystyle W_{2}({X_{1}}_{\#}\gamma_{d,a_{0}^{2}},\nu)\leq Cb_{0},

where CC is a constant. We illustrate this theoretical bound using the mixture of Gaussian distributions and the Gaussian interpolation flow given above. We consider a mesh for the variable ζ\zeta and plot the curve for b0b_{0} and W2(X1#γd,a02,ν)W_{2}({X_{1}}_{\#}\gamma_{d,a_{0}^{2}},\nu) in Figure 5. Through Figure 5, an approximate linear relation between b0b_{0} and W2(X1#γd,a02,ν)W_{2}({X_{1}}_{\#}\gamma_{d,a_{0}^{2}},\nu) is observed, which supports the results of Proposition 37.

Refer to caption
Figure 6: A linear relation between Δvt\Delta v_{t} and the squared Wasserstein-2 distance.

We now consider perturbing the velocity field vtv_{t} by adding random noise. Let ϵ[0.5,5.5]\epsilon\in[0.5,5.5]. The random noise is generated using a Bernoulli random variable supported on {ϵ,ϵ}\{-\epsilon,\epsilon\}. Let v~t\tilde{v}_{t} denote the perturbed velocity field. Then we can compute

Δvt:=vtv~t2=2ϵ2.\displaystyle\Delta v_{t}:=\|v_{t}-\tilde{v}_{t}\|^{2}=2\epsilon^{2}.

We use the velocity field vtv_{t} and the perturbed velocity field vt{v}_{t} to generate samples and compute the squared Wasserstein-2 distance between the sample distributions. According to Proposition 39, the squared Wasserstein-2 distance should be linearly upper bounded as 𝒪(Δvt)\mathcal{O}(\Delta v_{t}), that is,

W22(Y1#μ,ν)C~012ϵ2pt(x)dxdt=C~ϵ2,\displaystyle W_{2}^{2}({Y_{1}}_{\#}\mu,\nu)\leq\tilde{C}\int_{0}^{1}\int_{\mathbb{R}^{2}}\epsilon^{2}p_{t}(x)\mathrm{d}x\mathrm{d}t=\tilde{C}\epsilon^{2},

where C~\tilde{C} is a constant. This theoretical insight is illustrated in Figure 6, where a linear relationship between these two variables is observed.

7 Related work

GIFs and the induced transport maps are related to CNFs and score-based diffusion models. Mathematically, they interrelate with the literature on Lipschitz mass transport and Wasserstein gradient flows. A central question in developing the ODE flow or transport map method for generative modeling is how to construct an ODE flow or transport map that are sufficiently smooth and enable efficient computation. Various approaches have been proposed to answer the question.

CNFs construct invertible mappings between an isotropic Gaussian distribution and a complex target distribution (Chen et al., 2018; Grathwohl et al., 2019). They fall within the broader framework of neural ODEs (Chen et al., 2018; Ruiz-Balet and Zuazua, 2023). A major challenge for CNFs is designing a time-dependent ODE flow whose marginal distribution converges to the target distribution while allowing for efficient estimation of its velocity field. Previous work has explored several principles to construct such flows, including optimal transport, Wasserstein gradient flows, and diffusion processes. Additionally, Gaussian denoising has emerged as an effective principle for constructing simulation-free CNFs in generative modeling.

Liu et al. (2023) propose the rectified flow, which is based on a linear interpolation between a standard Gaussian distribution and the target distribution, mimicking the Gaussian denoising procedure. Albergo and Vanden-Eijnden (2023) study a similar formulation called stochastic interpolation, defining a trigonometric interpolant between a standard Gaussian distribution and the target distribution. Albergo et al. (2023b) extend this idea by proposing a stochastic bridge interpolant between two arbitrary distributions. Under a few regularity assumptions, the velocity field of the ODE flow modeling the stochastic bridge interpolant is proven to be continuous in the time variable and smooth in the space variable.

Lipman et al. (2023) introduce a nonlinear least squares method called flow matching to directly estimate the velocity field of probability flow ODEs. All of these models are encompassed within the framework of simulation-free CNFs, which have been the focus of numerous ongoing research efforts (Neklyudov et al., 2023; Tong et al., 2023; Chen and Lipman, 2023; Albergo et al., 2023b; Shaul et al., 2023; Pooladian et al., 2023; Albergo et al., 2023a, c). Furthermore, Marzouk et al. (2023) provide the first statistical convergence rate for the simulation-based method by placing neural ODEs within the nonparametric estimation framework.

Score-based diffusion models integrate the time reversal of stochastic differential equations (SDEs) with the score matching technique (Sohl-Dickstein et al., 2015; Song and Ermon, 2019; Ho et al., 2020; Song and Ermon, 2020; Song et al., 2021b, a; De Bortoli et al., 2021). These models are capable of modeling highly complex probability distributions and have achieved state-of-the-art performance in image synthesis tasks (Dhariwal and Nichol, 2021; Rombach et al., 2022). The probability flow ODEs of diffusion models can be considered as CNFs, whose velocity field incorporates the nonlinear score function (Song et al., 2021b; Karras et al., 2022; Lu et al., 2022b, a; Zheng et al., 2023). In addition to the score matching method, Lu et al. (2022a) and Zheng et al. (2023) explore maximum likelihood estimation for probability flow ODEs. However, the regularity of these probability flow ODEs has not been studied and their well-posedness properties remain to be established.

A key concept in defining measure transport is Lipschitz mass transport, where the transport maps are required to be Lipschitz continuous. This ensures the smoothness and stability of the measure transport. There is a substantial body of research on the Lipschitz properties of transport maps. The celebrated Caffarelli’s contraction theorem (Caffarelli, 2000, Theorem 2) establishes the Lipschitz continuity of optimal transport maps that push the standard Gaussian measure onto a log-concave measure. Colombo et al. (2017) study a Lipschitz transport map between perturbations of log-concave measures using optimal transport theory.

Mikulincer and Shenfeld (2021) demonstrate that the Brownian transport map, defined by the F”ollmer process, is Lipschitz continuous when it pushes forward the Wiener measure on the Wiener space to the target measure on the Euclidean space. Additionally, Neeman (2022) and Mikulincer and Shenfeld (2023) prove that the transport map along the reverse heat flow of certain target measures is Lipschitz continuous.

Beyond studying Lipschitz transport maps, significant effort has been devoted to applying optimal transport theory in generative modeling. Zhang et al. (2018) propose the Monge-Ampe‘re flow for generative modeling by solving the linearized Monge-Amper̀e equation. Optimal transport theory has been utilized as a general principle to regularize the training of continuous normalizing flows or generators for generative modeling (Finlay et al., 2020; Yang and Karniadakis, 2020; Onken et al., 2021; Makkuva et al., 2020). Liang (2021) leverage the regularity theory of optimal transport to formalize the generator-discriminator-pair regularization of GANs under a minimax rate framework.

In our work, we study the Lipschitz transport maps defined by GIFs, which differ from the optimal transport map. GIFs naturally fit within the framework of continuous normalizing flows, and their flow mappings are examined from the perspective of Lipschitz mass transport.

Wasserstein gradient flows offer another principled approach to constructing ODE flows for generative modeling. A Wasserstein gradient flow is derived from the gradient descent minimization of a certain energy functional over probability measures endowed with the quadratic Wasserstein metric (Ambrosio et al., 2008). The Eulerian formulation of Wasserstein gradient flows produces the continuity equations that govern the evolution of marginal distributions. After transferred into a Lagrangian formulation, Wasserstein gradient flows define ODE flows that have been widely explored for generative modeling (Johnson and Zhang, 2018; Gao et al., 2019; Liutkus et al., 2019; Johnson and Zhang, 2019; Arbel et al., 2019; Mroueh et al., 2019; Ansari et al., 2021; Mroueh and Nguyen, 2021; Fan et al., 2022; Gao et al., 2022; Duncan et al., 2023; Xu et al., 2022). Wasserstein gradient flows are shown to be connected with the forward process of diffusion models. The variance preserving SDE of diffusion models is equivalent to the Langevin dynamics towards the standard Gaussian distribution that can be interpreted as a Wasserstein gradient flow of the Kullback–Leibler divergence for a standard Gaussian distribution (Song et al., 2021b). In the meantime, the probability flow ODE of the variance preserving SDE conforms to the Eulerian formulation of this Wasserstein gradient flow. However, when assigning a general distribution instead of the standard Gaussian distribution, it remains unclear whether the ODE formulation of Wasserstein gradient flows possesses well-posedness.

The main contribution of our work lies in establishing the theoretical properties of GIFs and their associated flow maps in a unified way. Our theoretical results encompass the Lipschitz continuity of both the flow’s velocity field and the flow map, addressing the existence, uniqueness, and stability of the flow. We also demonstrate that both the flow map and its inverse possess Lipschitz properties.

Our proposed framework for Gaussian interpolation flow builds upon previous research on probability flow methods in diffusion models (Song et al., 2021b, a) and stochastic interpolation methods for generative modeling (Liu et al., 2023; Albergo and Vanden-Eijnden, 2023; Lipman et al., 2023). Rather than adopting a methodological perspective, we focus on elucidating the theoretical aspects of these flows from a unified standpoint, thereby enhancing the understanding of various methodological approaches. Our theoretical results are derived from geometric considerations of the target distribution and from analytic calculations that exploit the Gaussian denoising property.

8 Conclusions and discussion

Gaussian denoising as a framework for constructing continuous normalizing flows holds great promise in generative modeling. Through a unified framework and rigorous analysis, we have established the well-posedness of these flows, shedding light on their capabilities and limitations. We have examined the Lipschitz regularity of the corresponding flow maps for several rich classes of probability measures. When applied to generative modeling based on Gaussian denoising, we have shown that GIFs possess auto-encoding and cycle consistency properties at the population level. Additionally, we have established stability error bounds for the errors accumulated during the process of learning GIFs.

The regularity properties of the velocity field established in this paper provide a solid theoretical basis for end-to-end error analyses of learning GIFs using deep neural networks with empirical data. Another potential application is to perform rigorous analyses of consistency models, a nascent family of ODE-based deep generative models designed for one-step generation (Song et al., 2023; Kim et al., 2023; Song and Dhariwal, 2023). We intend to investigate these intriguing problems in our subsequent work. We expect that our analytical results will facilitate further studies and advancements in applying simulation-free CNFs, including GIFs, to a diverse range of generative modeling tasks.

Appendix

In the appendices, we prove the results stated in the paper and provide necessary technical details and discussions.

Appendix A Proofs of Theorem 12 and Lemma 18

Dynamical properties of Gaussian interpolation flow (𝖷t)t[0,1](\mathsf{X}_{t})_{t\in[0,1]} form the cornerstone of the measure interpolation method. Following Albergo and Vanden-Eijnden (2023); Albergo et al. (2023b), we leverage an argument of characteristic functions to quantify the dynamics of its marginal flow, and in result, to prove Theorem 12.

Proof  [Proof of Theorem 12] Let ωd\omega\in{\mathbb{R}}^{d}. For the Gaussian stochastic interpolation (𝖷t)t[0,1](\mathsf{X}_{t})_{t\in[0,1]}, we define the characteristic function of 𝖷t\mathsf{X}_{t} by

Ψ(t,ω):=𝔼[exp(iω,𝖷t)]=𝔼[exp(iω,at𝖹+bt𝖷1)]=𝔼[exp(iatω,𝖹)]𝔼[exp(ibtω,𝖷1)],\Psi(t,\omega):=\mathbb{E}[\exp(i\langle\omega,\mathsf{X}_{t}\rangle)]=\mathbb{E}[\exp(i\langle\omega,a_{t}\mathsf{Z}+b_{t}\mathsf{X}_{1}\rangle)]=\mathbb{E}[\exp(ia_{t}\langle\omega,\mathsf{Z}\rangle)]\mathbb{E}[\exp(ib_{t}\langle\omega,\mathsf{X}_{1}\rangle)],

where the last equality is due to the independence of between 𝖹γd\mathsf{Z}\sim\gamma_{d} and 𝖷1ν\mathsf{X}_{1}\sim\nu. Taking the time derivative of Ψ(t,ω)\Psi(t,\omega) for t(0,1)t\in(0,1), we derive that

tΨ(t,ω)=iω,ψ(t,ω))\partial_{t}\Psi(t,\omega)=i\langle\omega,\psi(t,\omega))

where

ψ(t,ω):=𝔼[exp(iω,𝖷t)(a˙t𝖹+b˙t𝖷1)].\psi(t,\omega):=\mathbb{E}[\exp(i\langle\omega,\mathsf{X}_{t}\rangle)(\dot{a}_{t}\mathsf{Z}+\dot{b}_{t}\mathsf{X}_{1})].

We first define

v(t,𝖷t):=𝔼[a˙t𝖹+b˙t𝖷1|𝖷t].\displaystyle v(t,\mathsf{X}_{t}):=\mathbb{E}[\dot{a}_{t}\mathsf{Z}+\dot{b}_{t}\mathsf{X}_{1}|\mathsf{X}_{t}]. (A.1)

Using the double expectation formula, we deduce that

ψ(t,ω)=𝔼[exp(iω,𝖷t)𝔼[a˙t𝖹+b˙t𝖷1|𝖷t]]=𝔼[exp(iω,𝖷t)v(t,𝖷t)].\displaystyle\psi(t,\omega)=\mathbb{E}[\exp(i\langle\omega,\mathsf{X}_{t}\rangle)\mathbb{E}[\dot{a}_{t}\mathsf{Z}+\dot{b}_{t}\mathsf{X}_{1}|\mathsf{X}_{t}]]=\mathbb{E}[\exp(i\langle\omega,\mathsf{X}_{t}\rangle)v(t,\mathsf{X}_{t})].

Applying the inverse Fourier transform to ψ(t,ω)\psi(t,\omega), it holds that

j(t,x):=(2π)ddexp(iω,x)ψ(t,ω)dω=pt(x)v(t,x),j(t,x):=(2\pi)^{-d}\int_{\mathbb{R}^{d}}\exp(-i\langle\omega,x\rangle)\psi(t,\omega)\mathrm{d}\omega=p_{t}(x)v(t,x),

where v(t,x):=𝔼[a˙t𝖹+b˙t𝖷1|𝖷t=x]v(t,x):=\mathbb{E}[\dot{a}_{t}\mathsf{Z}+\dot{b}_{t}\mathsf{X}_{1}|\mathsf{X}_{t}=x]. Then it further yields that

tpt+xj(t,x)=0,\partial_{t}p_{t}+\nabla_{x}\cdot j(t,x)=0,

that is,

tpt+x(ptv(t,x))=0.\partial_{t}p_{t}+\nabla_{x}\cdot(p_{t}v(t,x))=0.

Next, we study the property of v(t,x)v(t,x) at t=0t=0 and t=1t=1. Notice that

x=at𝔼[𝖹|𝖷t=x]+bt𝔼[𝖷1|𝖷t=x].\displaystyle x=a_{t}\mathbb{E}[\mathsf{Z}|\mathsf{X}_{t}=x]+b_{t}\mathbb{E}[\mathsf{X}_{1}|\mathsf{X}_{t}=x]. (A.2)

Combining Eq. (A.1) and (A.2), it implies that

v(t,x)=a˙tatx+(b˙ta˙tatbt)𝔼[𝖷1|𝖷t=x],t(0,1).v(t,x)=\tfrac{\dot{a}_{t}}{a_{t}}x+\left(\dot{b}_{t}-\tfrac{\dot{a}_{t}}{a_{t}}b_{t}\right)\mathbb{E}[\mathsf{X}_{1}|\mathsf{X}_{t}=x],\quad t\in(0,1). (A.3)

According to Tweedie’s formula in Lemma 49, it holds that

s(t,x)=btat2𝔼[𝖷1|𝖷t=x]1at2x,t(0,1),s(t,x)=\tfrac{b_{t}}{a_{t}^{2}}\mathbb{E}\left[\mathsf{X}_{1}|\mathsf{X}_{t}=x\right]-\tfrac{1}{a_{t}^{2}}x,\quad t\in(0,1), (A.4)

where s(t,x)s(t,x) is the score function of the marginal distribution of 𝖷tpt\mathsf{X}_{t}\sim p_{t}.

Combining Eq. (A.3), (A.4), it holds that the velocity field is a gradient field and its nonlinear term is the score function s(t,x)s(t,x), namely, for any t(0,1)t\in(0,1),

v(t,x)=b˙tbtx+(b˙tbtat2a˙tat)s(t,x).\displaystyle v(t,x)=\tfrac{\dot{b}_{t}}{b_{t}}x+\left(\tfrac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)s(t,x). (A.5)

By the regularity properties that at,btC2([0,1)),at2C1([0,1]),btC1([0,1])a_{t},b_{t}\in C^{2}([0,1)),a_{t}^{2}\in C^{1}([0,1]),b_{t}\in C^{1}([0,1]), we have that a˙0,b˙0,a˙1a1\dot{a}_{0},\dot{b}_{0},\dot{a}_{1}a_{1}, and b˙1\dot{b}_{1} are well-defined. Then by Eq. (A.3), we define that

v(0,x):=limt0v(t,x)=a˙0a0x+(b˙0a˙0a0b0)𝔼[𝖷1|𝖷0=x]\displaystyle v(0,x):=\lim_{t\downarrow 0}v(t,x)=\tfrac{\dot{a}_{0}}{a_{0}}x+\left(\dot{b}_{0}-\tfrac{\dot{a}_{0}}{a_{0}}b_{0}\right)\mathbb{E}[\mathsf{X}_{1}|\mathsf{X}_{0}=x]

Using Eq. (A.5) yields that

v(1,x):=limt1v(t,x)=b˙1b1xa˙1a1s(1,x).\displaystyle v(1,x):=\lim_{t\uparrow 1}v(t,x)=\tfrac{\dot{b}_{1}}{b_{1}}x-\dot{a}_{1}a_{1}s(1,x). (A.6)

This completes the proof.  

Lemma 18 presents several standard properties of Gaussian channels in information theory (Wibisono and Jog, 2018a, b; Dytso et al., 2023b) that will facilitate our proof.

Proof  [Proof of Lemma 18] By Bayes’ rule, Law(𝖸|𝖷t=x)=p(y|t,x)\mathrm{Law}(\mathsf{Y}|\mathsf{X}_{t}=x)=p(y|t,x) can be represented as

p(y|t,x)\displaystyle p(y|t,x) =φbty,at2(x)p1(y)/pt(x)\displaystyle=\varphi_{b_{t}y,a_{t}^{2}}(x)p_{1}(y)/p_{t}(x)
=(2π)d/2atdexp(xbty22at2)p1(y)/pt(x)\displaystyle=(2\pi)^{-d/2}a_{t}^{-d}\exp\left(-\frac{\|x-b_{t}y\|^{2}}{2a_{t}^{2}}\right)p_{1}(y)/p_{t}(x)
=(2π)d/2atdexp(x22at2+btx,yat2bt2y22at2)p1(y)/pt(x)\displaystyle=(2\pi)^{-d/2}a_{t}^{-d}\exp\left(-\frac{\|x\|^{2}}{2a_{t}^{2}}+\frac{b_{t}\langle x,y\rangle}{a_{t}^{2}}-\frac{b_{t}^{2}\|y\|^{2}}{2a_{t}^{2}}\right)p_{1}(y)/p_{t}(x)
={exp(btx,yat2bt2y22at2)p1(y)}/{(2π)d/2atdexp(x22at2)pt(x)}.\displaystyle=\left\{\exp\left(\frac{b_{t}\langle x,y\rangle}{a_{t}^{2}}-\frac{b_{t}^{2}\|y\|^{2}}{2a_{t}^{2}}\right)p_{1}(y)\right\}/\left\{(2\pi)^{d/2}a_{t}^{d}\exp\left(\frac{\|x\|^{2}}{2a_{t}^{2}}\right)p_{t}(x)\right\}.

Let θ=btxat2,h(y)=p1(y)exp(bt2y22at2)\theta=\frac{b_{t}x}{a_{t}^{2}},h(y)=p_{1}(y)\exp(-\frac{b_{t}^{2}\|y\|^{2}}{2a_{t}^{2}}), and the logarithmic partition function

A(θ)=logdh(y)exp(y,θ)dy,A(\theta)=\log\int_{{\mathbb{R}}^{d}}h(y)\exp(\langle y,\theta\rangle)\mathrm{d}y,

then by the definition of exponential family distributions, we conclude that

p(y|t,x)=h(y)exp(y,θA(θ))\displaystyle p(y|t,x)=h(y)\exp(\langle y,\theta\rangle-A(\theta))

is an exponential family distribution of yy. By simple calculation, it follows that

x2logp(y|t,x)=bt2at4θ2A(θ).\nabla^{2}_{x}\log p(y|t,x)=-\frac{b_{t}^{2}}{a_{t}^{4}}\nabla^{2}_{\theta}A(\theta).

For an exponential family distribution, a basic equality shows that

θ2A(θ)=Cov(𝖸|𝖷t=x),\nabla^{2}_{\theta}A(\theta)=\mathrm{Cov}(\mathsf{Y}|\mathsf{X}_{t}=x),

which further yields that x2logp(y|t,x)=bt2at4Cov(𝖸|𝖷t=x)\nabla^{2}_{x}\log p(y|t,x)=-\frac{b_{t}^{2}}{a_{t}^{4}}\mathrm{Cov}(\mathsf{Y}|\mathsf{X}_{t}=x).  

Appendix B Auxiliary lemmas for Lipschitz flow maps

The following lemma, due to G. Peano (Hartman, 2002a, Theorem 3.1), describes several meaningful differential equations associated with well-posed flows and supports the derivation of Lipschitz continuity of their flow maps.

Lemma 40

(Ambrosio et al., 2023, Lemma 3.4) Suppose that a flow (Xt)t[0,1](X_{t})_{t\in[0,1]} is well-posed and its velocity field v(t,x):[0,1]×ddv(t,x):[0,1]\times{\mathbb{R}}^{d}\to{\mathbb{R}}^{d} is of class C1C^{1}. Then the flow map Xs,t:ddX_{s,t}:{\mathbb{R}}^{d}\to{\mathbb{R}}^{d} is of class C1C^{1} for any 0st10\leq s\leq t\leq 1. Fix (s,x)[0,1]×d(s,x)\in[0,1]\times{\mathbb{R}}^{d} and set the following functions defined with t[s,1]t\in[s,1]

y(t)\displaystyle y(t) :=xXs,t(x),\displaystyle:=\nabla_{x}X_{s,t}(x), J(t):=(xv)(t,Xs,t(x)),\displaystyle J(t):=(\nabla_{x}v)(t,X_{s,t}(x)),
w(t)\displaystyle w(t) :=det(xXs,t(x)),\displaystyle:=\det(\nabla_{x}X_{s,t}(x)), b(t):=(xv)(t,Xs,t(x))=Tr(J(t)).\displaystyle b(t):=(\nabla_{x}\cdot v)(t,X_{s,t}(x))=\operatorname{Tr}(J(t)).

Then y(t)y(t) and w(t)w(t) are the unique C1C^{1} solutions of the following IVPs

y˙(t)\displaystyle\dot{y}(t) =J(t)y(t),y(s)=𝐈d,\displaystyle=J(t)y(t),\quad y(s)={\mathbf{I}}_{d}, (B.1)
w˙(t)\displaystyle\dot{w}(t) =b(t)w(t),w(s)=1.\displaystyle=b(t)w(t),\quad w(s)=1. (B.2)

We present an upper bound of the Lipschitz constant of its flow map Xs,t(x)X_{s,t}(x) in Lemma 29. The upper bound has been deduced in Mikulincer and Shenfeld (2023); Ambrosio et al. (2023); Dai et al. (2023). For completeness, we derive it as a direct implication of Eq. (B.1) in Lemma 40 and an upper bound of the Jacobian matrix of the velocity field.

Proof [Proof of Lemma 29] Let y(u)=xXs,u(x)y(u)=\nabla_{x}X_{s,u}(x), J(u)=(xv)(u,Xs,u(x))J(u)=(\nabla_{x}v)(u,X_{s,u}(x)). Owing to Lemma 40, y(u)y(u) is of class C1C^{1}, and the function uy(u)2,2u\mapsto\|y(u)\|_{2,2} is absolutely continuous over [s,t][s,t]. By Lemma 40, it follows that

uy(u)2,22=2y(u),y˙(u)=2y(u),J(u)y(u)2θuy(u)2,22.\partial_{u}\|y(u)\|_{2,2}^{2}=2\langle y(u),\dot{y}(u)\rangle=2\langle y(u),J(u)y(u)\rangle\leq 2\theta_{u}\|y(u)\|_{2,2}^{2}.

Applying Grönwall’s inequality yields that y(t)2,2exp(stθudu)\|y(t)\|_{2,2}\leq\exp(\int_{s}^{t}\theta_{u}\mathrm{d}u) which concludes the proof.  

Another result is concerning the theorem of instantaneous change of variables that is widely deployed in studying neural ODEs (Chen et al., 2018, Theorem 1). We also exploit the instantaneous change of variables to prove Proposition 37. To make the proof self-contained, we show that the instantaneous change of variables directly follows Eq. (B.2) in Lemma 40. Compared with the original proof in (Chen et al., 2018, Theorem 1), we illustrate that the well-posedness of a flow is sufficient to ensure the instantaneous change of variables property, without a boundedness condition on the flow.

Corollary 41 (Instantaneous change of variables)

Suppose that a flow (Xt)t[0,1](X_{t})_{t\in[0,1]} is well-posed with a velocity field v(t,x):[0,1]×ddv(t,x):[0,1]\times{\mathbb{R}}^{d}\to{\mathbb{R}}^{d} of class C1C^{1} in xx. Let X0(x)π0(X0(x))X_{0}(x)\sim\pi_{0}(X_{0}(x)) be a distribution of the initial value. Then the law of Xt(x)X_{t}(x) satisfies the following differential equation

tlogπt(Xt(x))=Tr((xv)(t,Xt(x))).\partial_{t}\log\pi_{t}(X_{t}(x))=-\operatorname{Tr}((\nabla_{x}v)(t,X_{t}(x))).

Proof  Let δ(t):=det(xXt(x))\delta(t):=\det(\nabla_{x}X_{t}(x)). Thanks to Eq. (B.2) in Lemma 40, it holds that

δ˙(t)=Tr((xv)(t,Xt(x)))δ(t),δ(0)=1,\dot{\delta}(t)=\operatorname{Tr}((\nabla_{x}v)(t,X_{t}(x)))\delta(t),\quad\delta(0)=1,

which implies δ(t)>0\delta(t)>0 for t[0,1]t\in[0,1]. Notice that logπt(Xt(x))=logπ0(X0(x))log|δ(t)|\log\pi_{t}(X_{t}(x))=\log\pi_{0}(X_{0}(x))-\log|\delta(t)| by change of variables. Then it follows that tlogπt(Xt(x))=Tr((xv)(t,Xt(x)))\partial_{t}\log\pi_{t}(X_{t}(x))=-\operatorname{Tr}((\nabla_{x}v)(t,X_{t}(x))).  

Appendix C Proofs of spatial Lipschitz estimates for the velocity field

The main results in Section 4 are proved in this appendix. We first present some ancillary lemmas before proceeding to give the proofs.

Lemma 42 (Fathi et al., 2023)

Suppose that f:d+f:{\mathbb{R}}^{d}\to{\mathbb{R}}_{+} is LL-log-Lipschitz for some L0L\geq 0. Let 𝒫t\mathcal{P}_{t} be the Ornstein–Uhlenbeck semigroup defined by 𝒫th(x):=𝔼𝖹γd[h(etx+1e2t𝖹)]\mathcal{P}_{t}h(x):=\mathbb{E}_{\mathsf{Z}\sim\gamma_{d}}[h(e^{-t}x+\sqrt{1-e^{-2t}}\mathsf{Z})] for any hC(d)h\in C({\mathbb{R}}^{d}) and t0t\geq 0. Then it holds that

{5Let(L+t12)L2e2t}𝐈dx2log𝒫tf(x){5Let(L+t12)}𝐈d.\left\{-5Le^{-t}(L+t^{-\frac{1}{2}})-L^{2}e^{-2t}\right\}{\mathbf{I}}_{d}\preceq\nabla^{2}_{x}\log\mathcal{P}_{t}f(x)\preceq\left\{5Le^{-t}(L+t^{-\frac{1}{2}})\right\}{\mathbf{I}}_{d}.

Proof  This is a restatement of known results. See Proposition 2, Proposition 6, Theorem 6, and their proofs in Fathi et al. (2023).  

Corollary 43

Suppose that f:d+f:{\mathbb{R}}^{d}\to{\mathbb{R}}_{+} is LL-log-Lipschitz for some L0L\geq 0. Let 𝒬t\mathcal{Q}_{t} be an operator defined by

𝒬th(x):=𝔼𝖹γd[h(βtx+αt𝖹)]\mathcal{Q}_{t}h(x):=\mathbb{E}_{\mathsf{Z}\sim\gamma_{d}}[h(\beta_{t}x+\alpha_{t}\mathsf{Z})] (C.1)

for any hC(d)h\in C({\mathbb{R}}^{d}) and t[0,1]t\in[0,1] where 0αt1,βt00\leq\alpha_{t}\leq 1,\beta_{t}\geq 0 for any t[0,1]t\in[0,1]. Then it holds that

(AtL2βt2)𝐈dx2log𝒬tf(x)At𝐈d,\displaystyle\left(-A_{t}-L^{2}\beta_{t}^{2}\right){\mathbf{I}}_{d}\preceq\nabla^{2}_{x}\log\mathcal{Q}_{t}f(x)\preceq A_{t}{\mathbf{I}}_{d},

where At:=5Lβt2(1αt2)12(L+(12log(1αt2))12)A_{t}:=5L\beta_{t}^{2}(1-\alpha_{t}^{2})^{-\frac{1}{2}}(L+(-\frac{1}{2}\log(1-\alpha_{t}^{2}))^{-\frac{1}{2}}).

Proof  It is easy to notice that 𝒬tf(x)=𝒫sf(βtesx)\mathcal{Q}_{t}f(x)=\mathcal{P}_{s}f(\beta_{t}e^{s}x) where s=12log(1αt2)s=-\frac{1}{2}\log(1-\alpha_{t}^{2}). Then it follows that x2log𝒬tf(x)=(βtes)2(x2log𝒫sf)(βtesx)\nabla^{2}_{x}\log\mathcal{Q}_{t}f(x)=(\beta_{t}e^{s})^{2}(\nabla^{2}_{x}\log\mathcal{P}_{s}f)(\beta_{t}e^{s}x) which yields

(AtL2βt2)𝐈dx2log𝒬tf(x)At𝐈d,\displaystyle\left(-A_{t}-L^{2}\beta_{t}^{2}\right){\mathbf{I}}_{d}\preceq\nabla^{2}_{x}\log\mathcal{Q}_{t}f(x)\preceq A_{t}{\mathbf{I}}_{d},

where At:=5Lβt2(1αt2)12(L+(12log(1αt2))12)A_{t}:=5L\beta_{t}^{2}(1-\alpha_{t}^{2})^{-\frac{1}{2}}(L+(-\frac{1}{2}\log(1-\alpha_{t}^{2}))^{-\frac{1}{2}}).  

Lemma 44

The Jacobian matrix of the velocity field (3.5) has an alternative expression over time t(0,1)t\in(0,1), that is,

xv(t,x)=(b˙tbtat2a˙tat)(x2log𝒬~tf(x)1at2+bt2𝐈d)+b˙tbt𝐈d,\nabla_{x}v(t,x)=\left(\tfrac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)\left(\nabla^{2}_{x}\log\widetilde{\mathcal{Q}}_{t}f(x)-\tfrac{1}{a_{t}^{2}+b_{t}^{2}}{\mathbf{I}}_{d}\right)+\tfrac{\dot{b}_{t}}{b_{t}}{\mathbf{I}}_{d},

where f(x):=dνdγd(x)f(x):=\frac{\mathrm{d}\nu}{\mathrm{d}\gamma_{d}}(x) and 𝒬~tf(x):=𝔼𝖹γd[f(btat2+bt2x+atat2+bt2𝖹)]\widetilde{\mathcal{Q}}_{t}f(x):=\mathbb{E}_{\mathsf{Z}\sim\gamma_{d}}[f(\frac{b_{t}}{a_{t}^{2}+b_{t}^{2}}x+\frac{a_{t}}{\sqrt{a_{t}^{2}+b_{t}^{2}}}\mathsf{Z})].

Proof  By direct calculations, it holds that

pt(x)\displaystyle p_{t}(x) =atddp1(y)φ(xbtyat)dy=atddf(y)φ(y)φ(xbtyat)dy\displaystyle=a_{t}^{-d}\int_{{\mathbb{R}}^{d}}p_{1}(y)\varphi\left(\frac{x-b_{t}y}{a_{t}}\right)\mathrm{d}y=a_{t}^{-d}\int_{{\mathbb{R}}^{d}}f(y)\varphi(y)\varphi\left(\frac{x-b_{t}y}{a_{t}}\right)\mathrm{d}y
=atdφ((at2+bt2)12x)df(y)φ((atat2+bt2)1(ybtat2+bt2x))dy\displaystyle=a_{t}^{-d}\varphi\left((a_{t}^{2}+b_{t}^{2})^{-\frac{1}{2}}x\right)\int_{{\mathbb{R}}^{d}}f(y)\varphi\left(\left(\frac{a_{t}}{\sqrt{a_{t}^{2}+b_{t}^{2}}}\right)^{-1}\left(y-\frac{b_{t}}{a_{t}^{2}+b_{t}^{2}}x\right)\right)\mathrm{d}y
=atdφ((at2+bt2)12x)(atat2+bt2)ddf(btat2+bt2x+atat2+bt2z)dγd(z)\displaystyle=a_{t}^{-d}\varphi\left((a_{t}^{2}+b_{t}^{2})^{-\frac{1}{2}}x\right)\left(\frac{a_{t}}{\sqrt{a_{t}^{2}+b_{t}^{2}}}\right)^{d}\int_{{\mathbb{R}}^{d}}f\left(\frac{b_{t}}{a_{t}^{2}+b_{t}^{2}}x+\frac{a_{t}}{\sqrt{a_{t}^{2}+b_{t}^{2}}}z\right)\mathrm{d}\gamma_{d}(z)
=(at2+bt2)d/2φ((at2+bt2)12x)𝒬~tf(x).\displaystyle=(a_{t}^{2}+b_{t}^{2})^{-d/2}\varphi\left((a_{t}^{2}+b_{t}^{2})^{-\frac{1}{2}}x\right)\widetilde{\mathcal{Q}}_{t}f(x).

Taking the logarithm and then the second-order derivative of the equation above, it yields

xs(t,x)=x2log𝒬~tf(x)1at2+bt2𝐈d.\nabla_{x}s(t,x)=\nabla^{2}_{x}\log\widetilde{\mathcal{Q}}_{t}f(x)-\tfrac{1}{a_{t}^{2}+b_{t}^{2}}{\mathbf{I}}_{d}.

Recalling that xv(t,x)=(b˙tbtat2a˙tat)xs(t,x)+b˙tbt𝐈d\nabla_{x}v(t,x)=\left(\frac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)\nabla_{x}s(t,x)+\frac{\dot{b}_{t}}{b_{t}}{\mathbf{I}}_{d}, it further yields that

xv(t,x)=(b˙tbtat2a˙tat)x2log𝒬~tf(x)+a˙tat+b˙tbtat2+bt2𝐈d,\nabla_{x}v(t,x)=\left(\tfrac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)\nabla^{2}_{x}\log\widetilde{\mathcal{Q}}_{t}f(x)+\tfrac{\dot{a}_{t}a_{t}+\dot{b}_{t}b_{t}}{a_{t}^{2}+b_{t}^{2}}{\mathbf{I}}_{d},

which completes the proof.  

Corollary 45

Suppose that f(x):=dνdγd(x)f(x):=\frac{\mathrm{d}\nu}{\mathrm{d}\gamma_{d}}(x) is LL-log-Lipschitz for some L0L\geq 0. Then for t(0,1)t\in(0,1), it holds that

{(b˙tbtat2a˙tat)(BtL2(btat2+bt2)2)+a˙tat+b˙tbtat2+bt2}𝐈d\displaystyle\left\{\left(\tfrac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)\left(-B_{t}-L^{2}\left(\tfrac{b_{t}}{a_{t}^{2}+b_{t}^{2}}\right)^{2}\right)+\tfrac{\dot{a}_{t}a_{t}+\dot{b}_{t}b_{t}}{a_{t}^{2}+b_{t}^{2}}\right\}{\mathbf{I}}_{d}
xv(t,x){(b˙tbtat2a˙tat)Bt+a˙tat+b˙tbtat2+bt2}𝐈d,\displaystyle\preceq\nabla_{x}v(t,x)\preceq\left\{\left(\tfrac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)B_{t}+\tfrac{\dot{a}_{t}a_{t}+\dot{b}_{t}b_{t}}{a_{t}^{2}+b_{t}^{2}}\right\}{\mathbf{I}}_{d},

where Bt:=5Lbt(at2+bt2)32(L+(log(at2+bt2/bt))12)B_{t}:=5Lb_{t}(a_{t}^{2}+b_{t}^{2})^{-\frac{3}{2}}(L+(\log(\sqrt{a_{t}^{2}+b_{t}^{2}}/b_{t}))^{-\frac{1}{2}}).

Proof  Let αt=atat2+bt2\alpha_{t}=\frac{a_{t}}{\sqrt{a_{t}^{2}+b_{t}^{2}}} and βt=btat2+bt2\beta_{t}=\frac{b_{t}}{a_{t}^{2}+b_{t}^{2}}. Then these bounds hold according to Corollary 43 and Lemma 44.  

Then we are prepared to prove Proposition 20. The proof is mainly based on the techniques for bounding conditional covariance matrices that are developed in a series of work (Wibisono and Jog, 2018a, b; Mikulincer and Shenfeld, 2021, 2023; Chewi and Pooladian, 2022; Dai et al., 2023).

Proof  [Proof of Proposition 20]

  • (a)

    By Jung’s theorem (Danzer et al., 1963, Theorem 2.6), there exists a closed Euclidean ball with radius less than D:=(1/2)diam(supp(ν))D:=(1/\sqrt{2})\mathrm{diam}(\mathrm{supp}(\nu)) that contains supp(ν)\mathrm{supp}(\nu) in d{\mathbb{R}}^{d}. Then the desired bounds hold due to 0𝐈dCov(𝖸|𝖷t=x)D2𝐈d0{\mathbf{I}}_{d}\preceq\mathrm{Cov}(\mathsf{Y}|\mathsf{X}_{t}=x)\preceq D^{2}{\mathbf{I}}_{d} and Eq. (4.3).

  • (b)

    Let p1p_{1} be β\beta-semi-log-convex for some β>0\beta>0 on d{\mathbb{R}}^{d}. Then for any t[0,1)t\in[0,1), the conditional distribution p(y|t,x)p(y|t,x) is (β+bt2at2)\left(\beta+\frac{b_{t}^{2}}{a_{t}^{2}}\right)-semi-log-convex because

    y2logp(y|t,x)=y2logp1(y)y2logp(t,x|y)(β+bt2at2)𝐈d.-\nabla^{2}_{y}\log p(y|t,x)=-\nabla^{2}_{y}\log p_{1}(y)-\nabla^{2}_{y}\log p(t,x|y)\preceq\left(\beta+\frac{b_{t}^{2}}{a_{t}^{2}}\right){\mathbf{I}}_{d}.

    By the Cramér-Rao inequality (2.4), we obtain

    Cov(𝖸|𝖷t=x)(β+bt2at2)1𝐈d.\mathrm{Cov}(\mathsf{Y}|\mathsf{X}_{t}=x)\succeq\left(\beta+\frac{b_{t}^{2}}{a_{t}^{2}}\right)^{-1}{\mathbf{I}}_{d}.

    Therefore, by Eq. (4.3), we obtain

    xv(t,x){(b˙tbta˙tat)bt2βat2+bt2+a˙tat}𝐈d,\nabla_{x}v(t,x)\succeq\left\{\left(\frac{\dot{b}_{t}}{b_{t}}-\frac{\dot{a}_{t}}{a_{t}}\right)\frac{b_{t}^{2}}{\beta a_{t}^{2}+b_{t}^{2}}+\frac{\dot{a}_{t}}{a_{t}}\right\}{\mathbf{I}}_{d},

    which implies

    xv(t,x)βata˙t+btb˙tβat2+bt2𝐈d.\nabla_{x}v(t,x)\succeq\frac{\beta a_{t}\dot{a}_{t}+b_{t}\dot{b}_{t}}{\beta a_{t}^{2}+b_{t}^{2}}{\mathbf{I}}_{d}.

    In addition, the bound above can be verified at time t=1t=1 by the definition (A.6).

  • (c)

    Let p1p_{1} be κ\kappa-semi-log-concave for some κ\kappa\in{\mathbb{R}}. Then for any t[0,1)t\in[0,1), the conditional distribution p(y|t,x)p(y|t,x) is (κ+bt2at2)\left(\kappa+\frac{b_{t}^{2}}{a_{t}^{2}}\right)-semi-log-concave because

    y2logp(y|t,x)=y2logp1(y)y2logp(t,x|y)(κ+bt2at2)𝐈d.-\nabla^{2}_{y}\log p(y|t,x)=-\nabla^{2}_{y}\log p_{1}(y)-\nabla^{2}_{y}\log p(t,x|y)\succeq\left(\kappa+\frac{b_{t}^{2}}{a_{t}^{2}}\right){\mathbf{I}}_{d}.

    When t{t:κ+bt2at2>0,t(0,1)}t\in\left\{t:\kappa+\frac{b_{t}^{2}}{a_{t}^{2}}>0,t\in(0,1)\right\}, by the Brascamp-Lieb inequality (2.2), we obtain

    Cov(𝖸|𝖷t=x)(κ+bt2at2)1𝐈d.\mathrm{Cov}(\mathsf{Y}|\mathsf{X}_{t}=x)\preceq\left(\kappa+\frac{b_{t}^{2}}{a_{t}^{2}}\right)^{-1}{\mathbf{I}}_{d}.

    Therefore, by Eq. (4.3), we obtain

    xv(t,x){(b˙tbta˙tat)bt2κat2+bt2+a˙tat}𝐈d,\nabla_{x}v(t,x)\preceq\left\{\left(\frac{\dot{b}_{t}}{b_{t}}-\frac{\dot{a}_{t}}{a_{t}}\right)\frac{b_{t}^{2}}{\kappa a_{t}^{2}+b_{t}^{2}}+\frac{\dot{a}_{t}}{a_{t}}\right\}{\mathbf{I}}_{d},

    which implies

    xv(t,x)κata˙t+btb˙tκat2+bt2𝐈d.\nabla_{x}v(t,x)\preceq\frac{\kappa a_{t}\dot{a}_{t}+b_{t}\dot{b}_{t}}{\kappa a_{t}^{2}+b_{t}^{2}}{\mathbf{I}}_{d}.

    Moreover, the bound above can be verified at time t=1t=1 by the definition (A.6).

  • (d)

    Notice that

    p(y|t,x)\displaystyle p(y|t,x) =p(t,x|y)pt(x)d(γd,σ2ρ)dy\displaystyle=\frac{p(t,x|y)}{p_{t}(x)}\frac{\mathrm{d}(\gamma_{d,\sigma^{2}}*\rho)}{\mathrm{d}y}
    =Ax,tdφz,σ2(y)φxbt,at2bt2(y)ρ(dz),\displaystyle=A_{x,t}\int_{{\mathbb{R}}^{d}}\varphi_{z,\sigma^{2}}(y)\varphi_{\tfrac{x}{b_{t}},\tfrac{a_{t}^{2}}{b_{t}^{2}}}(y)\rho(\mathrm{d}z),

    where the prefactor Ax,tA_{x,t} only depends on xx and tt. Then it follows that

    p(y|t,x)=dφat2z+σ2btxat2+σ2bt2,σ2at2at2+σ2bt2(y)ρ~(dz)p(y|t,x)=\int_{{\mathbb{R}}^{d}}\varphi_{\frac{a_{t}^{2}z+\sigma^{2}b_{t}x}{a_{t}^{2}+\sigma^{2}b_{t}^{2}},\frac{\sigma^{2}a_{t}^{2}}{a_{t}^{2}+\sigma^{2}b_{t}^{2}}}(y)\tilde{\rho}(\mathrm{d}z)

    where ρ~\tilde{\rho} is a probability measure on d{\mathbb{R}}^{d} whose density function is a multiple of ρ\rho by a positive function. It also indicates that ρ~\tilde{\rho} is supported on the same Euclidean ball as ρ\rho. To further illustrate p(y|t,x)p(y|t,x), let 𝖰ρ~\mathsf{Q}\sim\tilde{\rho} and 𝖹γd\mathsf{Z}\sim\gamma_{d} be independent. Then it holds that

    at2at2+σ2bt2𝖰+σ2at2at2+σ2bt2𝖹+σ2btat2+σ2bt2xp(y|t,x).\frac{a_{t}^{2}}{a_{t}^{2}+\sigma^{2}b_{t}^{2}}\mathsf{Q}+\sqrt{\frac{\sigma^{2}a_{t}^{2}}{a_{t}^{2}+\sigma^{2}b_{t}^{2}}}\mathsf{Z}+\frac{\sigma^{2}b_{t}}{a_{t}^{2}+\sigma^{2}b_{t}^{2}}x\sim p(y|t,x).

    Thus, it holds that

    Cov(𝖸|𝖷t=x)\displaystyle\mathrm{Cov}(\mathsf{Y}|\mathsf{X}_{t}=x) =(at2at2+σ2bt2)2Cov(𝖰)+σ2at2at2+σ2bt2𝐈d\displaystyle=\left(\frac{a_{t}^{2}}{a_{t}^{2}+\sigma^{2}b_{t}^{2}}\right)^{2}\mathrm{Cov}(\mathsf{Q})+\frac{\sigma^{2}a_{t}^{2}}{a_{t}^{2}+\sigma^{2}b_{t}^{2}}{\mathbf{I}}_{d}
    {(at2at2+σ2bt2)2R2+σ2at2at2+σ2bt2}𝐈d.\displaystyle\preceq\left\{\left(\frac{a_{t}^{2}}{a_{t}^{2}+\sigma^{2}b_{t}^{2}}\right)^{2}R^{2}+\frac{\sigma^{2}a_{t}^{2}}{a_{t}^{2}+\sigma^{2}b_{t}^{2}}\right\}{\mathbf{I}}_{d}.

    By Eq. (4.3), it holds that

    xv(t,x)bt2at2(b˙tbta˙tat)((at2at2+σ2bt2)2R2+σ2at2at2+σ2bt2)𝐈d+a˙tat𝐈d,\nabla_{x}v(t,x)\preceq\frac{b_{t}^{2}}{a_{t}^{2}}\left(\frac{\dot{b}_{t}}{b_{t}}-\frac{\dot{a}_{t}}{a_{t}}\right)\left(\left(\frac{a_{t}^{2}}{a_{t}^{2}+\sigma^{2}b_{t}^{2}}\right)^{2}R^{2}+\frac{\sigma^{2}a_{t}^{2}}{a_{t}^{2}+\sigma^{2}b_{t}^{2}}\right){\mathbf{I}}_{d}+\frac{\dot{a}_{t}}{a_{t}}{\mathbf{I}}_{d},

    which implies

    xv(t,x){atbt(atb˙ta˙tbt)(at2+σ2bt2)2R2+a˙tat+σ2b˙tbtat2+σ2bt2}𝐈d.\nabla_{x}v(t,x)\preceq\left\{\frac{a_{t}b_{t}(a_{t}\dot{b}_{t}-\dot{a}_{t}b_{t})}{(a_{t}^{2}+\sigma^{2}b_{t}^{2})^{2}}R^{2}+\frac{\dot{a}_{t}a_{t}+\sigma^{2}\dot{b}_{t}b_{t}}{a_{t}^{2}+\sigma^{2}b_{t}^{2}}\right\}{\mathbf{I}}_{d}.

    Analogously, due to Cov(𝖰)0𝐈d\mathrm{Cov}(\mathsf{Q})\succeq 0{\mathbf{I}}_{d}, a lower bound would be yielded as follows

    xv(t,x)a˙tat+σ2b˙tbtat2+σ2bt2𝐈d.\nabla_{x}v(t,x)\succeq\frac{\dot{a}_{t}a_{t}+\sigma^{2}\dot{b}_{t}b_{t}}{a_{t}^{2}+\sigma^{2}b_{t}^{2}}{\mathbf{I}}_{d}.

    Then the results follow by combining the upper and lower bounds.

  • (e)

    The result follows from Corollary 45.

We complete the proof.  

Proof  [Proof of Corollary 21] Let us consider that κ>0\kappa>0 which is divided into two cases where κD21\kappa D^{2}\geq 1 and κD2<1\kappa D^{2}<1. On one hand, suppose that the first case κD21\kappa D^{2}\geq 1 holds. By Proposition 20, the κ\kappa-based upper bound is tighter, that is,

λmax(xv(t,x))θt:=κata˙t+btb˙tκat2+bt2.\displaystyle\lambda_{\max}(\nabla_{x}v(t,x))\leq\theta_{t}:=\frac{\kappa a_{t}\dot{a}_{t}+b_{t}\dot{b}_{t}}{\kappa a_{t}^{2}+b_{t}^{2}}.

On the other hand, suppose that the second case κD2<1\kappa D^{2}<1 holds. Let t1t_{1} be defined in Eq. (4.5). Again, by Proposition 20, the D2D^{2}-based upper bound is tighter over [0,t1)[0,t_{1}) and the κ\kappa-based upper bound is tighter over [t1,1][t_{1},1], which is denoted by

λmax(xv(t,x))θt:={bt2at2(b˙tbta˙tat)D2+a˙tat,t[0,t1),κata˙t+btb˙tκat2+bt2,t[t1,1].\displaystyle\lambda_{\max}(\nabla_{x}v(t,x))\leq\theta_{t}:=\begin{cases}\frac{b_{t}^{2}}{a_{t}^{2}}\left(\frac{\dot{b}_{t}}{b_{t}}-\frac{\dot{a}_{t}}{a_{t}}\right)D^{2}+\frac{\dot{a}_{t}}{a_{t}},\ &t\in[0,t_{1}),\\ \frac{\kappa a_{t}\dot{a}_{t}+b_{t}\dot{b}_{t}}{\kappa a_{t}^{2}+b_{t}^{2}},\ &t\in[t_{1},1].\end{cases}

This completes the proof.  

Proof  [Proof of Corollary 22] Let κ<0,D<\kappa<0,D<\infty such that κD2<1\kappa D^{2}<1 is fulfilled. Then an argument similar to the proof of Corollary 21 yields the desired bounds.  

Proof  [Proof of Corollary 23] The result follows from Proposition 20-(d).  

Proof  [Proof of Corollary 24] The LL-based upper and lower bounds in Proposition 20-(e) would blow up at time t=1t=1 because the term (log(at2+bt2/bt))12(\log(\sqrt{a_{t}^{2}+b_{t}^{2}}/b_{t}))^{-\frac{1}{2}} in BtB_{t} goes to \infty as t1t\to 1. To ensure the spatial derivative of the velocity field v(t,x)v(t,x) is upper bounded at time t=1t=1, we additionally require the target measure is κ\kappa-semi-log-concave with κ0\kappa\leq 0. Hence, a κ\kappa-based upper bound is available for t(t0,1]t\in(t_{0},1] as shown in Proposition 20-(c). Next, these two upper bounds are combined by choosing any t2(t0,1)t_{2}\in(t_{0},1) first. Then we exploit the LL-based bound over [0,t2)[0,t_{2}) and κ\kappa-based bound over [t0,1][t_{0},1]. This completes the proof.  

Appendix D Proofs of well-posedness and Lipschitz flow maps

The proofs of main results in Section 5 are offered in the following. Before proceeding, let us introduce some definitions and notations about function spaces that are collected in (Evans, 2010, Chapter 5). Let Lloc1(d;):={locally integrable function u:d}L_{\mathrm{loc}}^{1}({\mathbb{R}}^{d};{\mathbb{R}}^{\ell}):=\{\textrm{locally integrable function }u:{\mathbb{R}}^{d}\to{\mathbb{R}}^{\ell}\}. For integers k0k\geq 0 and 1p1\leq p\leq\infty, we define the Sobolev space Wk,p(d):={uLloc1(d)|Dαu exists and DαuLp(d) for |α|k}W^{k,p}({\mathbb{R}}^{d}):=\{u\in L^{1}_{\mathrm{loc}}({\mathbb{R}}^{d})|D^{\alpha}u\textrm{ exists and }D^{\alpha}u\in L^{p}({\mathbb{R}}^{d})\textrm{ for }|\alpha|\leq k\}, where DαuD^{\alpha}u is the weak derivative of uu. Then the local Sobolev space Wlock,p(d)W_{\mathrm{loc}}^{k,p}({\mathbb{R}}^{d}) is defined as the function space such that for any uWlock,p(d)u\in W_{\mathrm{loc}}^{k,p}({\mathbb{R}}^{d}) and any compact set Ωd\Omega\subset{\mathbb{R}}^{d}, uWk,p(Ω)u\in W^{k,p}(\Omega). As a result, we denote the vector-valued local Sobolev space by Wlock,p(d;)W_{\mathrm{loc}}^{k,p}({\mathbb{R}}^{d};{\mathbb{R}}^{\ell}). Provided that v(t,x):[0,1]×ddv(t,x):[0,1]\times{\mathbb{R}}^{d}\to{\mathbb{R}}^{d}, we use vL1([0,1];Wloc1,(d;d))v\in L^{1}([0,1];W^{1,\infty}_{\mathrm{loc}}({\mathbb{R}}^{d};{\mathbb{R}}^{d})) to indicate that vv has a finite L1L^{1} norm over (t,x)[0,1]×d(t,x)\in[0,1]\times{\mathbb{R}}^{d} and v(t,)Wloc1,(d;d)v(t,\cdot)\in W^{1,\infty}_{\mathrm{loc}}({\mathbb{R}}^{d};{\mathbb{R}}^{d}) for any t[0,1]t\in[0,1]. Similarly, we say vL1([0,1];L(d;d))v\in L^{1}([0,1];L^{\infty}({\mathbb{R}}^{d};{\mathbb{R}}^{d})) when vv has a finite L1L^{1} norm over (t,x)[0,1]×d(t,x)\in[0,1]\times{\mathbb{R}}^{d} and v(t,)L(d;d)v(t,\cdot)\in L^{\infty}({\mathbb{R}}^{d};{\mathbb{R}}^{d}) for every t[0,1]t\in[0,1]. We will use the definitions and notations in the following proof.

Proof  [Proof of Theorem 25] Under Assumptions 1 and 2, we claim that the velocity field v(t,x)v(t,x) satisfies

vL1([0,1];Wloc1,(d;d)),v21+x2L1([0,1];L(d;d)).\displaystyle v\in L^{1}([0,1];W^{1,\infty}_{\mathrm{loc}}({\mathbb{R}}^{d};{\mathbb{R}}^{d})),\quad\frac{\|v\|_{2}}{1+\|x\|_{2}}\in L^{1}([0,1];L^{\infty}({\mathbb{R}}^{d};{\mathbb{R}}^{d})).

where the first condition indicates the velocity field vv is locally bounded and locally Lipschitz continuous in xx, and the second condition is a growth condition on vv. According to the Cauchy-Lipschitz theorem (Ambrosio and Crippa, 2014, Remark 2.4), we have the representation formulae for solutions of the continuity equation. As a result, there exists a flow (Xt)t[0,1](X_{t})_{t\in[0,1]} uniquely solves the IVP (3.10). Furthermore, the marginal flow of (Xt)t[0,1](X_{t})_{t\in[0,1]} satisfies the continuity equation (3.4) in the weak sense. Then it remains to show the velocity field vv is locally bounded and locally Lipschitz continuous in xx, and satisfies the growth condition. By the lower and upper bounds given in Proposition 20, we know that vv is globally Lipschitz continuous in xx under Assumptions 1 and 2. Indeed, the global Lipschitz continuity leads to local boundedness and linear growth properties by simple arguments. More concretely, for any t(0,1)t\in(0,1), it holds that

v(t,0)\displaystyle v(t,0) =(b˙ta˙tatbt)𝔼[𝖷1|𝖷t=0]=(b˙ta˙tatbt)dyp(y|t,0)dy\displaystyle=\left(\dot{b}_{t}-\frac{\dot{a}_{t}}{a_{t}}b_{t}\right)\mathbb{E}[\mathsf{X}_{1}|\mathsf{X}_{t}=0]=\left(\dot{b}_{t}-\frac{\dot{a}_{t}}{a_{t}}b_{t}\right)\int_{\mathbb{R}^{d}}yp(y|t,0)\mathrm{d}y
(b˙ta˙tatbt)dyp1(y)atdexp(bt2y222at2)dy,\displaystyle\lesssim\left(\dot{b}_{t}-\frac{\dot{a}_{t}}{a_{t}}b_{t}\right)\int_{{\mathbb{R}}^{d}}yp_{1}(y)a_{t}^{-d}\exp\left(-\frac{b_{t}^{2}\|y\|^{2}_{2}}{2a_{t}^{2}}\right)\mathrm{d}y,

which implies v(t,0)2<\|v(t,0)\|_{2}<\infty due to fast growth of the exponential function. Besides, it holds that v(0,0)=(b˙0a˙0a0b0)𝔼[𝖷1|𝖷0=x]<,v(1,0)=a˙1a1s(1,0)<v(0,0)=(\dot{b}_{0}-\tfrac{\dot{a}_{0}}{a_{0}}b_{0})\mathbb{E}[\mathsf{X}_{1}|\mathsf{X}_{0}=x]<\infty,v(1,0)=\dot{a}_{1}a_{1}s(1,0)<\infty. Then by the boundedness of v(t,0)2\|v(t,0)\|_{2} and the global Lipschitz continuity in xx over t[0,1]t\in[0,1], we bound v(t,x)v(t,x) as follows

v(t,x)2\displaystyle\|v(t,x)\|_{2} v(t,0)2+v(t,x)v(t,0)2\displaystyle\leq\|v(t,0)\|_{2}+\|v(t,x)-v(t,0)\|_{2}
v(t,0)2+{sup(t,y)[0,1]×dyv(t,y)2,2}x2\displaystyle\leq\|v(t,0)\|_{2}+\left\{\sup_{(t,y)\in[0,1]\times{\mathbb{R}}^{d}}\|\nabla_{y}v(t,y)\|_{2,2}\right\}\|x\|_{2}
max{x2,1}.\displaystyle\lesssim\max\{\|x\|_{2},1\}.

Hence, the local boundedness and linear growth properties of vv are proved. This completes the proof.  

Proof  [Proof of Theorem 26] The proof is similar to that of Theorem 25.  

Proof  [Proof of Corollary 27] A well-posed ODE flow has the time-reversal symmetry (Lamb and Roberts, 1998). By Theorem 25, the desired results are proved.  

Proof  [Proof of Corollary 28] The proof is similar to that of Corollary 27.  

Proof  [Proof of Proposition 30] Combining Proposition 20-(b), (c), and Lemma 29, we complete the proof.  

Proof  [Proof of Proposition 31] Combining Proposition 20-(d) and Lemma 29, we complete the proof.  

Proof  [Proof of Corollary 34] By Theorem 25 and Corollary 27, it holds that

X1X1=X1X11=𝐈d.\displaystyle X_{1}\circ X_{1}^{*}=X_{1}\circ X_{1}^{-1}={\mathbf{I}}_{d}.

This completes the proof.  

Proof  [Proof of Corollary 35] By Theorem 25 and Corollary 27, it holds that

X1,1X2,1X2,1X1,1=X1,1X2,11X2,1X1,11=𝐈d.\displaystyle X_{1,1}\circ X_{2,1}^{*}\circ X_{2,1}\circ X_{1,1}^{*}=X_{1,1}\circ X_{2,1}^{-1}\circ X_{2,1}\circ X_{1,1}^{-1}={\mathbf{I}}_{d}.

This completes the proof.  

Proof  [Proof of Corollary 36] Let Assumptions 1 and 2 hold. According to Propositions 30 and 31, xX1(x)2,2\|\nabla_{x}X_{1}(x)\|_{2,2} is uniformly bounded for Case (i)-(iii) in Assumption 2. For Case (iv), the boundedness of xX1(x)2,2\|\nabla_{x}X_{1}(x)\|_{2,2} holds by combining Corollary 24 and Lemma 29. Using Proposition 20, we know that xv(t,x)2,2\|\nabla_{x}v(t,x)\|_{2,2} is uniformly bounded.  

Proof  [Proof of Proposition 33] The proof idea is similar to those of (Ball et al., 2003, Proposition 1) and (Cattiaux and Guillin, 2014, Proposition 18). Let f:Ωf:\Omega\to{\mathbb{R}} be of class C1C^{1} and 𝖷tpt\mathsf{X}_{t}\sim p_{t}. First, we consider the case of log-Sobolev inequalities. Using that 𝖹γd\mathsf{Z}\sim\gamma_{d} and 𝖷1ν\mathsf{X}_{1}\sim\nu both satisfy the log-Sobolev inequalities in Definition 47, we have

𝔼[(f2logf2)(𝖷t)]=𝔼[(f2logf2)(at𝖹+bt𝖷1)]\displaystyle~{}~{}~{}~{}~{}\mathbb{E}[(f^{2}\log f^{2})(\mathsf{X}_{t})]=\mathbb{E}[(f^{2}\log f^{2})(a_{t}\mathsf{Z}+b_{t}\mathsf{X}_{1})]
(f2(atz+btx)dγd(z))log(f2(atz+btx)dγd(z))dν(x)\displaystyle\leq\int\left(\int f^{2}(a_{t}z+b_{t}x)\mathrm{d}\gamma_{d}(z)\right)\log\left(\int f^{2}(a_{t}z+b_{t}x)\mathrm{d}\gamma_{d}(z)\right)\mathrm{d}\nu(x)
+(2CLS(γd)at2(f22)(atz+btx)dγd(z))dν(x)\displaystyle~{}~{}~{}~{}+\int\left(2C_{\mathrm{LS}}(\gamma_{d})\int a_{t}^{2}(\|\nabla f\|_{2}^{2})(a_{t}z+b_{t}x)\mathrm{d}\gamma_{d}(z)\right)\mathrm{d}\nu(x)
(f2(atz+btx)dγd(z)dν(x))log(f2(atz+btx)dγd(z)dν(x))\displaystyle\leq\left(\int\int f^{2}(a_{t}z+b_{t}x)\mathrm{d}\gamma_{d}(z)\mathrm{d}\nu(x)\right)\log\left(\int\int f^{2}(a_{t}z+b_{t}x)\mathrm{d}\gamma_{d}(z)\mathrm{d}\nu(x)\right)
+2CLS(ν)x(f2(atz+btx)dγd(z))1222dν(x)\displaystyle~{}~{}~{}~{}+2C_{\mathrm{LS}}(\nu)\int\Big{\|}\nabla_{x}\Big{(}\int f^{2}(a_{t}z+b_{t}x)\mathrm{d}\gamma_{d}(z)\Big{)}^{\frac{1}{2}}\Big{\|}_{2}^{2}\mathrm{d}\nu(x)
+2at2CLS(γd)(f22)(atz+btx)dγd(z)dν(x)\displaystyle~{}~{}~{}~{}+2a_{t}^{2}C_{\mathrm{LS}}(\gamma_{d})\int\int(\|\nabla f\|_{2}^{2})(a_{t}z+b_{t}x)\mathrm{d}\gamma_{d}(z)\mathrm{d}\nu(x)
𝔼[f2(𝖷t)]log(𝔼[f2(𝖷t)])+2at2CLS(γd)𝔼[f(𝖷t)22]\displaystyle\leq~{}\mathbb{E}[f^{2}(\mathsf{X}_{t})]\log\left(\mathbb{E}[f^{2}(\mathsf{X}_{t})]\right)+2a_{t}^{2}C_{\mathrm{LS}}(\gamma_{d})\mathbb{E}[\|\nabla f(\mathsf{X}_{t})\|_{2}^{2}]
+2CLS(ν)x(f2(atz+btx)dγd(z))1222dν(x).\displaystyle~{}~{}~{}~{}+2C_{\mathrm{LS}}(\nu)\int\Big{\|}\nabla_{x}\Big{(}\int f^{2}(a_{t}z+b_{t}x)\mathrm{d}\gamma_{d}(z)\Big{)}^{\frac{1}{2}}\Big{\|}_{2}^{2}\mathrm{d}\nu(x).

By Jensen’s inequality and the Cauchy-Schwartz inequality, it holds that

x(f2(atz+btx)dγd(z))1222dν(x)\displaystyle~{}~{}~{}~{}\int\Big{\|}\nabla_{x}\Big{(}\int f^{2}(a_{t}z+b_{t}x)\mathrm{d}\gamma_{d}(z)\Big{)}^{\frac{1}{2}}\Big{\|}_{2}^{2}\mathrm{d}\nu(x)
bt2((ff2)(atz+btx)dγd(z))2dν(x)f2(atz+btx)dγd(z)dν(x)\displaystyle\leq b_{t}^{2}\frac{\int\left(\int(\|f\nabla f\|_{2})(a_{t}z+b_{t}x)\mathrm{d}\gamma_{d}(z)\right)^{2}\mathrm{d}\nu(x)}{\int\int f^{2}(a_{t}z+b_{t}x)\mathrm{d}\gamma_{d}(z)\mathrm{d}\nu(x)}
bt2(f22)(atz+btx)dγd(z)dν(x)\displaystyle\leq b_{t}^{2}\int\int(\|\nabla f\|_{2}^{2})(a_{t}z+b_{t}x)\mathrm{d}\gamma_{d}(z)\mathrm{d}\nu(x)
bt2𝔼[f(𝖷t)22].\displaystyle\leq b_{t}^{2}\mathbb{E}[\|\nabla f(\mathsf{X}_{t})\|_{2}^{2}].

Hence, combining the equations above and the fact that CLS(γd)1C_{\mathrm{LS}}(\gamma_{d})\leq 1 (Gross, 1975), it implies that

𝔼[(f2logf2)(𝖷t)]𝔼[f2(𝖷t)]log(𝔼[f2(𝖷t)])2[at2+bt2CLS(ν)]𝔼[f(𝖷t)22],\displaystyle\mathbb{E}[(f^{2}\log f^{2})(\mathsf{X}_{t})]-\mathbb{E}[f^{2}(\mathsf{X}_{t})]\log\left(\mathbb{E}[f^{2}(\mathsf{X}_{t})]\right)\leq 2\left[a_{t}^{2}+b_{t}^{2}C_{\mathrm{LS}}(\nu)\right]\mathbb{E}[\|\nabla f(\mathsf{X}_{t})\|_{2}^{2}],

that is, CLS(pt)at2+bt2CLS(ν)C_{\mathrm{LS}}(p_{t})\leq a_{t}^{2}+b_{t}^{2}C_{\mathrm{LS}}(\nu).

Next, we tackle the case of Poincaré inequalities by similar calculations. Using that 𝖹γd\mathsf{Z}\sim\gamma_{d} and 𝖷1ν\mathsf{X}_{1}\sim\nu both satisfy the Poincaré inequalities in Definition 48, we have

𝔼[f2(𝖷t)]=𝔼[f2(at𝖹+bt𝖷1)]\displaystyle~{}~{}~{}~{}~{}\mathbb{E}[f^{2}(\mathsf{X}_{t})]=\mathbb{E}[f^{2}(a_{t}\mathsf{Z}+b_{t}\mathsf{X}_{1})]
(f(atz+btx)dγd(z))2dν(x)\displaystyle\leq\int\left(\int f(a_{t}z+b_{t}x)\mathrm{d}\gamma_{d}(z)\right)^{2}\mathrm{d}\nu(x)
+(CP(γd)at2(f22)(atz+btx)dγd(z))dν(x)\displaystyle~{}~{}~{}~{}+\int\left(C_{\mathrm{P}}(\gamma_{d})\int a_{t}^{2}(\|\nabla f\|_{2}^{2})(a_{t}z+b_{t}x)\mathrm{d}\gamma_{d}(z)\right)\mathrm{d}\nu(x)
(f(atz+btx)dγd(z)dν(x))2\displaystyle\leq\left(\int\int f(a_{t}z+b_{t}x)\mathrm{d}\gamma_{d}(z)\mathrm{d}\nu(x)\right)^{2}
+CP(ν)x(f(atz+btx)dγd(z))22dν(x)\displaystyle~{}~{}~{}~{}+C_{\mathrm{P}}(\nu)\int\Big{\|}\nabla_{x}\Big{(}\int f(a_{t}z+b_{t}x)\mathrm{d}\gamma_{d}(z)\Big{)}\Big{\|}_{2}^{2}\mathrm{d}\nu(x)
+at2CP(γd)(f22)(atz+btx)dγd(z)dν(x)\displaystyle~{}~{}~{}~{}+a_{t}^{2}C_{\mathrm{P}}(\gamma_{d})\int\int(\|\nabla f\|_{2}^{2})(a_{t}z+b_{t}x)\mathrm{d}\gamma_{d}(z)\mathrm{d}\nu(x)
(𝔼[f(𝖷t)])2+[at2CP(γd)+bt2CP(ν)]𝔼[f(𝖷t)22].\displaystyle\leq~{}\left(\mathbb{E}[f(\mathsf{X}_{t})]\right)^{2}+\left[a_{t}^{2}C_{\mathrm{P}}(\gamma_{d})+b_{t}^{2}C_{\mathrm{P}}(\nu)\right]\mathbb{E}[\|\nabla f(\mathsf{X}_{t})\|_{2}^{2}].

Combining the expression above and CP(γd)1C_{\mathrm{P}}(\gamma_{d})\leq 1, it implies that

𝔼[f2(𝖷t)](𝔼[f(𝖷t)])2[at2+bt2CP(ν)]𝔼[f(𝖷t)22],\displaystyle\mathbb{E}[f^{2}(\mathsf{X}_{t})]-\left(\mathbb{E}[f(\mathsf{X}_{t})]\right)^{2}\leq\left[a_{t}^{2}+b_{t}^{2}C_{\mathrm{P}}(\nu)\right]\mathbb{E}[\|\nabla f(\mathsf{X}_{t})\|_{2}^{2}],

that is, CP(pt)at2+bt2CP(ν)C_{\mathrm{P}}(p_{t})\leq a_{t}^{2}+b_{t}^{2}C_{\mathrm{P}}(\nu). This completes the proof.  

Appendix E Proofs of the stability results

We provide the proofs of the stability results in Section 6.

Proof  [Proof of Proposition 37] Let x0=a0z+b0x1x_{0}=a_{0}z+b_{0}x_{1} and suppose X0(x0)μ,X0(a0z)γd,a02X_{0}(x_{0})\sim\mu,X_{0}(a_{0}z)\sim\gamma_{d,a_{0}^{2}}. According to Corollary 36, the Lipschitz property of X1(x)X_{1}(x) implies that X1(x0)X1(a0z)C1x0a0z\|X_{1}(x_{0})-X_{1}(a_{0}z)\|\leq C_{1}\|x_{0}-a_{0}z\|. We consider an integral defined by

It:=x0a0z2dπt(Xt(x0),Xt(a0z)),I_{t}:=\int\|x_{0}-a_{0}z\|^{2}\mathrm{d}\pi_{t}(X_{t}(x_{0}),X_{t}(a_{0}z)),

where πt\pi_{t} is a coupling made of the joint distribution of (Xt(x0),Xt(a0z))(X_{t}(x_{0}),X_{t}(a_{0}z)). In particular, the initial value I0I_{0} is computed by

I0=x0a0z2p0(x0)φ(z)dx0dz=b0x12p1(x1)dx1=b02𝔼ν[𝖷𝟣2].\displaystyle I_{0}=\int\|x_{0}-a_{0}z\|^{2}p_{0}(x_{0})\varphi(z)\mathrm{d}x_{0}\mathrm{d}z=\int\|b_{0}x_{1}\|^{2}p_{1}(x_{1})\mathrm{d}x_{1}=b_{0}^{2}\mathbb{E}_{\nu}[\|\mathsf{X_{1}}\|^{2}].

Since (Xt)t[0,1](X_{t})_{t\in[0,1]} is well-posed with X0(x0)μX_{0}(x_{0})\sim\mu or X0(a0z)γd,a02X_{0}(a_{0}z)\sim\gamma_{d,a_{0}^{2}}, according to Corollary 41, the coupling πt\pi_{t} satisfies the following differential equation

tlogπt(Xt(x0),Xt(a0z))=Tr((xv)(t,Xt(x0)))Tr((xv)(t,Xt(a0z))).\partial_{t}\log\pi_{t}(X_{t}(x_{0}),X_{t}(a_{0}z))=-\operatorname{Tr}((\nabla_{x}v)(t,X_{t}(x_{0})))-\operatorname{Tr}((\nabla_{x}v)(t,X_{t}(a_{0}z))). (E.1)

Taking the derivative of ItI_{t} and using Eq. (E.1), it implies that

dItdt2(sup(s,x)[0,1]×dTr(xv(s,x)))It.\displaystyle\frac{\mathrm{d}I_{t}}{\mathrm{d}t}\leq 2\left(\sup_{(s,x)\in[0,1]\times{\mathbb{R}}^{d}}\|\operatorname{Tr}(\nabla_{x}v(s,x))\|\right)I_{t}.

Thanks to Tr(xv(s,x))dxv(s,x)2,2\|\operatorname{Tr}(\nabla_{x}v(s,x))\|\leq d\|\nabla_{x}v(s,x)\|_{2,2}, it follows that

dItdt2C2dIt,I0=b02𝔼ν[𝖷𝟣2].\displaystyle\frac{\mathrm{d}I_{t}}{\mathrm{d}t}\leq 2C_{2}dI_{t},\quad I_{0}=b_{0}^{2}\mathbb{E}_{\nu}[\|\mathsf{X_{1}}\|^{2}].

By Grönwall’s inequality, it holds that Itb02𝔼ν[𝖷𝟣2]exp(2C2dt)I_{t}\leq b_{0}^{2}\mathbb{E}_{\nu}[\|\mathsf{X_{1}}\|^{2}]\exp(2C_{2}dt). Therefore, we obtain the following W2W_{2} bound

W2(X1#γd,a02,ν)=W2(X1#γd,a02,X1#μ)C1I1C1b0𝔼ν[𝖷𝟣2]exp(C2d),\displaystyle W_{2}({X_{1}}_{\#}\gamma_{d,a_{0}^{2}},\nu)=W_{2}({X_{1}}_{\#}\gamma_{d,a_{0}^{2}},{X_{1}}_{\#}\mu)\leq C_{1}\sqrt{I_{1}}~{}\leq C_{1}b_{0}\sqrt{\mathbb{E}_{\nu}[\|\mathsf{X_{1}}\|^{2}]}\exp(C_{2}d),

which completes the proof.  

Proof  [Proof of Proposition 39]

  • (i)

    On the one hand, by Corollary 36, v(t,x)v(t,x) is Lipschitz continuous in xx uniformly over (t,x)[0,1]×d(t,x)\in[0,1]\times{\mathbb{R}}^{d} with Lipschitz constant C2C_{2}. By the variational equation (6.5) and Lemma 29, it follows that

    xXs,t(x)2,22exp(2stθudu).\|\nabla_{x}X_{s,t}(x)\|_{2,2}^{2}\leq\exp\left(2\int_{s}^{t}\theta_{u}\mathrm{d}u\right).

    Due to the equality (6.4), we deduce that

    X1(x0)Y1(x0)2\displaystyle\|X_{1}(x_{0})-Y_{1}(x_{0})\|^{2}
    \displaystyle\leq (01(xXs,1)(Ys(x0))2,2v(s,Ys(x0))v~(s,Ys(x0))ds)2\displaystyle\left(\int_{0}^{1}\|(\nabla_{x}X_{s,1})(Y_{s}(x_{0}))\|_{2,2}\|v(s,Y_{s}(x_{0}))-\tilde{v}(s,Y_{s}(x_{0}))\|\mathrm{d}s\right)^{2}
    \displaystyle\leq (01(xXs,1)(Ys(x0))2,22ds)(01v(s,Ys(x0))v~(s,Ys(x0))2ds)\displaystyle\left(\int_{0}^{1}\|(\nabla_{x}X_{s,1})(Y_{s}(x_{0}))\|_{2,2}^{2}\mathrm{d}s\right)\left(\int_{0}^{1}\|v(s,Y_{s}(x_{0}))-\tilde{v}(s,Y_{s}(x_{0}))\|^{2}\mathrm{d}s\right)
    \displaystyle\leq 01exp(2s1θudu)ds01v(s,Ys(x0))v~(s,Ys(x0))2ds.\displaystyle\int_{0}^{1}\exp\left(2\int_{s}^{1}\theta_{u}\mathrm{d}u\right)\mathrm{d}s\int_{0}^{1}\|v(s,Y_{s}(x_{0}))-\tilde{v}(s,Y_{s}(x_{0}))\|^{2}\mathrm{d}s.

    Take expectation and it follows that

    W22(Y1#μ,ν)\displaystyle W_{2}^{2}({Y_{1}}_{\#}\mu,\nu) 𝔼x0μ[Y1(x0)X1(x0)2]\displaystyle\leq\mathbb{E}_{x_{0}\sim\mu}\left[\|Y_{1}(x_{0})-X_{1}(x_{0})\|^{2}\right]
    01exp(2s1θudu)ds01dv(t,x)v~(t,x)2q~t(x)dxdt\displaystyle\leq\int_{0}^{1}\exp\left(2\int_{s}^{1}\theta_{u}\mathrm{d}u\right)\mathrm{d}s\int_{0}^{1}\int_{{\mathbb{R}}^{d}}\|v(t,x)-\tilde{v}(t,x)\|^{2}\tilde{q}_{t}(x)\mathrm{d}x\mathrm{d}t
    ε01exp(2s1θudu)ds\displaystyle\leq\varepsilon\int_{0}^{1}\exp\left(2\int_{s}^{1}\theta_{u}\mathrm{d}u\right)\mathrm{d}s

    where q~t\tilde{q}_{t} denotes the density function of Yt#μ{Y_{t}}_{\#}\mu, and we use the assumption that

    01dv(t,x)v~(t,x)2q~t(x)dxdtε\int_{0}^{1}\int_{{\mathbb{R}}^{d}}\|v(t,x)-\tilde{v}(t,x)\|^{2}\tilde{q}_{t}(x)\mathrm{d}x\mathrm{d}t\leq\varepsilon

    in the last inequality.

  • (ii)

    On the other hand, suppose that v~(t,x)\tilde{v}(t,x) is Lipschitz continuous in xx uniformly over (t,x)[0,1]×d(t,x)\in[0,1]\times{\mathbb{R}}^{d} with Lipschitz constant C3C_{3}. Applying Grönwall’s inequality to the variational equation (6.7), it follows that

    xYs,t(x)2,22exp(2C3(ts)).\|\nabla_{x}Y_{s,t}(x)\|_{2,2}^{2}\leq\exp(2C_{3}(t-s)).

    By the equality (6.6), it holds that

    Y1(x0)X1(x0)2\displaystyle\|Y_{1}(x_{0})-X_{1}(x_{0})\|^{2}
    \displaystyle\leq (01(xYs,1)(Xs(x0))2,2v(s,Xs(x0))v~(s,Xs(x0))ds)2\displaystyle\left(\int_{0}^{1}\|(\nabla_{x}Y_{s,1})(X_{s}(x_{0}))\|_{2,2}\|v(s,X_{s}(x_{0}))-\tilde{v}(s,X_{s}(x_{0}))\|\mathrm{d}s\right)^{2}
    \displaystyle\leq (01(xYs,1)(Xs(x0))2,22ds)(01v(s,Xs(x0))v~(s,Xs(x0))2ds)\displaystyle\left(\int_{0}^{1}\|(\nabla_{x}Y_{s,1})(X_{s}(x_{0}))\|_{2,2}^{2}\mathrm{d}s\right)\left(\int_{0}^{1}\|v(s,X_{s}(x_{0}))-\tilde{v}(s,X_{s}(x_{0}))\|^{2}\mathrm{d}s\right)
    \displaystyle\leq exp(2C3)12C301v(s,Xs(x0))v~(s,Xs(x0))2ds.\displaystyle\frac{\exp(2C_{3})-1}{2C_{3}}\int_{0}^{1}\|v(s,X_{s}(x_{0}))-\tilde{v}(s,X_{s}(x_{0}))\|^{2}\mathrm{d}s.

    Taking expectations, it further yields that

    W22(Y1#μ,ν)\displaystyle W_{2}^{2}({Y_{1}}_{\#}\mu,\nu) 𝔼x0μ[Y1(x0)X1(x0)2]\displaystyle\leq\mathbb{E}_{x_{0}\sim\mu}\left[\|Y_{1}(x_{0})-X_{1}(x_{0})\|^{2}\right]
    exp(2C3)12C301dv(t,x)v~(t,x)2pt(x)dxdt\displaystyle\leq\frac{\exp(2C_{3})-1}{2C_{3}}\int_{0}^{1}\int_{{\mathbb{R}}^{d}}\|v(t,x)-\tilde{v}(t,x)\|^{2}p_{t}(x)\mathrm{d}x\mathrm{d}t

    where Xt(x0)ptX_{t}(x_{0})\sim p_{t}.

 

Appendix F Time derivative of the velocity field

In this appendix, we are interested in representing the time derivative of the velocity field via moments of 𝖸|𝖷t=x\mathsf{Y}|\mathsf{X}_{t}=x. The result is efficacious for controlling the time derivative with moment estimates, though the computation is somehow tedious.

Proposition 46

The time derivative of the velocity field v(t,x)v(t,x) has an expression with moments of 𝖷1|𝖷t\mathsf{X}_{1}|\mathsf{X}_{t} for any t(0,1)t\in(0,1) as follows

tv(t,x)\displaystyle\partial_{t}v(t,x) =(a¨tata˙t2at2)x+(at2b¨tbta˙tatb˙tbta¨tat+a˙t2)btat2M1\displaystyle=\left(\frac{\ddot{a}_{t}}{a_{t}}-\frac{\dot{a}_{t}^{2}}{a_{t}^{2}}\right)x+\left(a_{t}^{2}\frac{\ddot{b}_{t}}{b_{t}}-\dot{a}_{t}a_{t}\frac{\dot{b}_{t}}{b_{t}}-\ddot{a}_{t}a_{t}+\dot{a}^{2}_{t}\right)\frac{b_{t}}{a_{t}^{2}}M_{1}
+bt2at2(b˙tbta˙tat)(b˙tbt2a˙tat)M2cxbt3at2(b˙tbta˙tat)2(M3M2M1),\displaystyle\quad+\frac{b_{t}^{2}}{a_{t}^{2}}\left(\frac{\dot{b}_{t}}{b_{t}}-\frac{\dot{a}_{t}}{a_{t}}\right)\left(\frac{\dot{b}_{t}}{b_{t}}-2\frac{\dot{a}_{t}}{a_{t}}\right)M^{c}_{2}x-\frac{b_{t}^{3}}{a_{t}^{2}}\left(\frac{\dot{b}_{t}}{b_{t}}-\frac{\dot{a}_{t}}{a_{t}}\right)^{2}\left(M_{3}-M_{2}M_{1}\right),

where M1:=𝔼[𝖷1|𝖷t=x],M2:=𝔼[𝖷1𝖷1|𝖷t=x],M2c:=Cov(𝖷1|𝖷t=x),M3:=𝔼[𝖷1𝖷1𝖷1|𝖷t=x]M_{1}:=\mathbb{E}[\mathsf{X}_{1}|\mathsf{X}_{t}=x],M_{2}:=\mathbb{E}[\mathsf{X}_{1}^{\top}\mathsf{X}_{1}|\mathsf{X}_{t}=x],M^{c}_{2}:=\mathrm{Cov}(\mathsf{X}_{1}|\mathsf{X}_{t}=x),M_{3}:=\mathbb{E}[\mathsf{X}_{1}\mathsf{X}_{1}^{\top}\mathsf{X}_{1}|\mathsf{X}_{t}=x].

Proof  By direct differentiation, it implies that

tv(t,x)\displaystyle\partial_{t}v(t,x) =t(b˙tbt)x+t(b˙tbtat2a˙tat)s(t,x)+(b˙tbtat2a˙tat)ts(t,x)\displaystyle=\partial_{t}\left(\frac{\dot{b}_{t}}{b_{t}}\right)x+\partial_{t}\left(\frac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)s(t,x)+\left(\frac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)\partial_{t}s(t,x)
=b¨tbtb˙t2bt2x+(b¨tbtb˙t2bt2at2+b˙tbt2a˙tata¨tata˙t2)s(t,x)+(b˙tbtat2a˙tat)ts(t,x).\displaystyle=\frac{\ddot{b}_{t}b_{t}-\dot{b}_{t}^{2}}{b_{t}^{2}}x+\left(\frac{\ddot{b}_{t}b_{t}-\dot{b}_{t}^{2}}{b_{t}^{2}}a_{t}^{2}+\frac{\dot{b}_{t}}{b_{t}}2\dot{a}_{t}a_{t}-\ddot{a}_{t}a_{t}-\dot{a}_{t}^{2}\right)s(t,x)+\left(\frac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)\partial_{t}s(t,x).

We first focus on ts(t,x)\partial_{t}s(t,x). Since ptp_{t} satisfies the continuity equation (3.4), it holds that

ts(t,x)\displaystyle\partial_{t}s(t,x) =x(tlogpt(x))\displaystyle=\nabla_{x}(\partial_{t}\log p_{t}(x))
=x(x(pt(x)v(t,x))pt(x))\displaystyle=-\nabla_{x}\left(\frac{\nabla_{x}\cdot(p_{t}(x)v(t,x))}{p_{t}(x)}\right)
=x((xpt(x))v(t,x)+pt(x)(xv(t,x))pt(x))\displaystyle=-\nabla_{x}\left(\frac{(\nabla_{x}p_{t}(x))^{\top}v(t,x)+p_{t}(x)(\nabla_{x}\cdot v(t,x))}{p_{t}(x)}\right)
=x(s(t,x)v(t,x)+xv(t,x))\displaystyle=-\nabla_{x}\left(s(t,x)^{\top}v(t,x)+\nabla_{x}\cdot v(t,x)\right)
=((xs(t,x))v(t,x)+(xv(t,x))s(t,x)+x(xv(t,x)))\displaystyle=-\left((\nabla_{x}s(t,x))^{\top}v(t,x)+(\nabla_{x}v(t,x))^{\top}s(t,x)+\nabla_{x}(\nabla_{x}\cdot v(t,x))\right)
=(xs(t,x)v(t,x)+xv(t,x)s(t,x)+xTr(xv(t,x))).\displaystyle=-\left(\nabla_{x}s(t,x)v(t,x)+\nabla_{x}v(t,x)s(t,x)+\nabla_{x}\operatorname{Tr}(\nabla_{x}v(t,x))\right).

By direct computation, it holds that

xs(t,x)v(t,x)+xv(t,x)s(t,x)\displaystyle\quad\nabla_{x}s(t,x)v(t,x)+\nabla_{x}v(t,x)s(t,x)
=xs(t,x)(b˙tbtx+(b˙tbtat2a˙tat)s(t,x))+x(b˙tbtx+(b˙tbtat2a˙tat)s(t,x))s(t,x)\displaystyle=\nabla_{x}s(t,x)\left(\frac{\dot{b}_{t}}{b_{t}}x+\left(\frac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)s(t,x)\right)+\nabla_{x}\left(\frac{\dot{b}_{t}}{b_{t}}x+\left(\frac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)s(t,x)\right)s(t,x)
=b˙tbtxs(t,x)x+(b˙tbtat2a˙tat)xs(t,x)s(t,x)+b˙tbts(t,x)+(b˙tbtat2a˙tat)xs(t,x)s(t,x)\displaystyle=\frac{\dot{b}_{t}}{b_{t}}\nabla_{x}s(t,x)x+\left(\frac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)\nabla_{x}s(t,x)s(t,x)+\frac{\dot{b}_{t}}{b_{t}}s(t,x)+\left(\frac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)\nabla_{x}s(t,x)s(t,x)
=b˙tbts(t,x)+b˙tbtxs(t,x)x+2(b˙tbtat2a˙tat)xs(t,x)s(t,x).\displaystyle=\frac{\dot{b}_{t}}{b_{t}}s(t,x)+\frac{\dot{b}_{t}}{b_{t}}\nabla_{x}s(t,x)x+2\left(\frac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)\nabla_{x}s(t,x)s(t,x).

Then we focus on the trace term

xTr(xv(t,x))\displaystyle\quad\nabla_{x}\operatorname{Tr}(\nabla_{x}v(t,x))
=xTr((b˙tbta˙tat)bt2at2Cov(𝖸|𝖷t=x)+a˙tat𝐈d)\displaystyle=\nabla_{x}\operatorname{Tr}\left(\left(\frac{\dot{b}_{t}}{b_{t}}-\frac{\dot{a}_{t}}{a_{t}}\right)\frac{b_{t}^{2}}{a_{t}^{2}}\mathrm{Cov}(\mathsf{Y}|\mathsf{X}_{t}=x)+\frac{\dot{a}_{t}}{a_{t}}{\mathbf{I}}_{d}\right)
=(b˙tbta˙tat)bt2at2xTr(Cov(𝖸|𝖷t=x))\displaystyle=\left(\frac{\dot{b}_{t}}{b_{t}}-\frac{\dot{a}_{t}}{a_{t}}\right)\frac{b_{t}^{2}}{a_{t}^{2}}\nabla_{x}\operatorname{Tr}(\mathrm{Cov}(\mathsf{Y}|\mathsf{X}_{t}=x))
=(b˙tbta˙tat)bt2at2x(y2p(y|t,x)dyyp(y|t,x)dy2)\displaystyle=\left(\frac{\dot{b}_{t}}{b_{t}}-\frac{\dot{a}_{t}}{a_{t}}\right)\frac{b_{t}^{2}}{a_{t}^{2}}\nabla_{x}\left(\int\|y\|^{2}p(y|t,x)\mathrm{d}y-\left\|\int yp(y|t,x)\mathrm{d}y\right\|^{2}\right)
=(b˙tbta˙tat)bt2at2(y2xp(y|t,x)dy2(xp(y|t,x)ydy)(yp(y|t,x)dy)),\displaystyle=\left(\frac{\dot{b}_{t}}{b_{t}}-\frac{\dot{a}_{t}}{a_{t}}\right)\frac{b_{t}^{2}}{a_{t}^{2}}\left(\int\|y\|^{2}\nabla_{x}p(y|t,x)\mathrm{d}y-2\left(\int\nabla_{x}p(y|t,x)\otimes y\mathrm{d}y\right)\left(\int yp(y|t,x)\mathrm{d}y\right)\right),

where we notice that

xp(y|t,x)\displaystyle\nabla_{x}p(y|t,x) =x(p(t,x|y)p1(y)pt(x))\displaystyle=\nabla_{x}\left(\frac{p(t,x|y)p_{1}(y)}{p_{t}(x)}\right)
=xp(t,x|y)p1(y)pt(x)p(t,x|y)p1(y)pt(x)s(t,x)\displaystyle=\frac{\nabla_{x}p(t,x|y)p_{1}(y)}{p_{t}(x)}-\frac{p(t,x|y)p_{1}(y)}{p_{t}(x)}s(t,x)
=p(y|t,x)(btyxat2s(t,x)).\displaystyle=p(y|t,x)\left(\frac{b_{t}y-x}{a_{t}^{2}}-s(t,x)\right).

For ease of presentation, we introduce the following notations to denote several moments of 𝖸|𝖷t=x\mathsf{Y}|\mathsf{X}_{t}=x

M1\displaystyle M_{1} :=𝔼[𝖸|𝖷t=x],\displaystyle:=\mathbb{E}[\mathsf{Y}|\mathsf{X}_{t}=x], M2:=𝔼[𝖸𝖸|𝖷t=x],\displaystyle M_{2}:=\mathbb{E}[\mathsf{Y}^{\top}\mathsf{Y}|\mathsf{X}_{t}=x],
M2c\displaystyle M^{c}_{2} :=Cov(𝖸|𝖷t=x),\displaystyle:=\mathrm{Cov}(\mathsf{Y}|\mathsf{X}_{t}=x), M3:=𝔼[𝖸𝖸𝖸|𝖷t=x].\displaystyle M_{3}:=\mathbb{E}[\mathsf{Y}\mathsf{Y}^{\top}\mathsf{Y}|\mathsf{X}_{t}=x].

By Tweedie’s formula in Lemma 49, it yields s(t,x)=btat2M11at2xs(t,x)=\frac{b_{t}}{a_{t}^{2}}M_{1}-\frac{1}{a_{t}^{2}}x. By this expression of s(t,x)s(t,x), it yields

xs(t,x)v(t,x)+xv(t,x)s(t,x)\displaystyle\quad\nabla_{x}s(t,x)v(t,x)+\nabla_{x}v(t,x)s(t,x)
=b˙tbts(t,x)+b˙tbtxs(t,x)x+2(b˙tbtat2a˙tat)xs(t,x)s(t,x)\displaystyle=\frac{\dot{b}_{t}}{b_{t}}s(t,x)+\frac{\dot{b}_{t}}{b_{t}}\nabla_{x}s(t,x)x+2\left(\frac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)\nabla_{x}s(t,x)s(t,x)
=b˙tbt(btat2M11at2x)+b˙tbt(bt2at4M2c1at2𝐈d)x\displaystyle=\frac{\dot{b}_{t}}{b_{t}}\left(\frac{b_{t}}{a_{t}^{2}}M_{1}-\frac{1}{a_{t}^{2}}x\right)+\frac{\dot{b}_{t}}{b_{t}}\left(\frac{b_{t}^{2}}{a_{t}^{4}}M^{c}_{2}-\frac{1}{a_{t}^{2}}{\mathbf{I}}_{d}\right)x
+2(b˙tbtat2a˙tat)(bt2at4M2c1at2𝐈d)(btat2M11at2x)\displaystyle\quad+2\left(\frac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)\left(\frac{b_{t}^{2}}{a_{t}^{4}}M^{c}_{2}-\frac{1}{a_{t}^{2}}{\mathbf{I}}_{d}\right)\left(\frac{b_{t}}{a_{t}^{2}}M_{1}-\frac{1}{a_{t}^{2}}x\right)
=2a˙tat3x+btat2(2a˙tatb˙tbt)M1+bt2at4(2a˙tatb˙tbt)M2cx+2bt3at4(b˙tbta˙tat)M2cM1\displaystyle=-2\frac{\dot{a}_{t}}{a_{t}^{3}}x+\frac{b_{t}}{a_{t}^{2}}\left(2\frac{\dot{a}_{t}}{a_{t}}-\frac{\dot{b}_{t}}{b_{t}}\right)M_{1}+\frac{b_{t}^{2}}{a_{t}^{4}}\left(2\frac{\dot{a}_{t}}{a_{t}}-\frac{\dot{b}_{t}}{b_{t}}\right)M^{c}_{2}x+2\frac{b_{t}^{3}}{a_{t}^{4}}\left(\frac{\dot{b}_{t}}{b_{t}}-\frac{\dot{a}_{t}}{a_{t}}\right)M^{c}_{2}M_{1}

and xp(y|t,x)=btat2(yM1)p(y|t,x)\nabla_{x}p(y|t,x)=\frac{b_{t}}{a_{t}^{2}}\left(y-M_{1}\right)p(y|t,x). Therefore, we obtain

y2xp(y|t,x)dy2(xp(y|t,x)ydy)(yp(y|t,x)dy)\displaystyle\quad\int\|y\|^{2}\nabla_{x}p(y|t,x)\mathrm{d}y-2\left(\int\nabla_{x}p(y|t,x)\otimes y\mathrm{d}y\right)\left(\int yp(y|t,x)\mathrm{d}y\right)
=y2btat2(yM1)p(y|t,x)dy2(btat2(yM1)yp(y|t,x)dy)(yp(y|t,x)dy)\displaystyle=\int\|y\|^{2}\frac{b_{t}}{a_{t}^{2}}\left(y-M_{1}\right)p(y|t,x)\mathrm{d}y-2\left(\int\frac{b_{t}}{a_{t}^{2}}\left(y-M_{1}\right)\otimes yp(y|t,x)\mathrm{d}y\right)\left(\int yp(y|t,x)\mathrm{d}y\right)
=btat2[y2yp(y|t,x)dy(y2p(y|t,x)dy)M1\displaystyle=\frac{b_{t}}{a_{t}^{2}}\left[\int\|y\|^{2}yp(y|t,x)\mathrm{d}y-\left(\int\|y\|^{2}p(y|t,x)\mathrm{d}y\right)M_{1}\right.
2(yyp(y|t,x)dyM1yp(y|t,x)dy)(yp(y|t,x)dy)]\displaystyle\left.~{}~{}~{}~{}~{}~{}~{}~{}-2\left(\int y\otimes yp(y|t,x)\mathrm{d}y-M_{1}\otimes\int yp(y|t,x)\mathrm{d}y\right)\left(\int yp(y|t,x)\mathrm{d}y\right)\right]
=btat2(M3M2M12M2cM1).\displaystyle=\frac{b_{t}}{a_{t}^{2}}\left(M_{3}-M_{2}M_{1}-2M^{c}_{2}M_{1}\right).

Combining the equations above, we obtain

tv(t,x)\displaystyle\partial_{t}v(t,x) =b¨tbtb˙t2bt2x+(b¨tbtb˙t2bt2at2+b˙tbt2a˙tata¨tata˙t2)(btat2M11at2x)\displaystyle=\frac{\ddot{b}_{t}b_{t}-\dot{b}_{t}^{2}}{b_{t}^{2}}x+\left(\frac{\ddot{b}_{t}b_{t}-\dot{b}_{t}^{2}}{b_{t}^{2}}a_{t}^{2}+\frac{\dot{b}_{t}}{b_{t}}2\dot{a}_{t}a_{t}-\ddot{a}_{t}a_{t}-\dot{a}^{2}_{t}\right)\left(\frac{b_{t}}{a_{t}^{2}}M_{1}-\frac{1}{a_{t}^{2}}x\right)
(b˙tbtat2a˙tat)[2a˙tat3x+btat2(2a˙tatb˙tbt)M1\displaystyle~{}~{}~{}~{}~{}-\left(\frac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)\left[-2\frac{\dot{a}_{t}}{a_{t}^{3}}x+\frac{b_{t}}{a_{t}^{2}}\left(2\frac{\dot{a}_{t}}{a_{t}}-\frac{\dot{b}_{t}}{b_{t}}\right)M_{1}\right.
+bt2at4(2a˙tatb˙tbt)M2cx+2bt3at4(b˙tbta˙tat)M2cM1]\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\left.+\frac{b_{t}^{2}}{a_{t}^{4}}\left(2\frac{\dot{a}_{t}}{a_{t}}-\frac{\dot{b}_{t}}{b_{t}}\right)M^{c}_{2}x+2\frac{b_{t}^{3}}{a_{t}^{4}}\left(\frac{\dot{b}_{t}}{b_{t}}-\frac{\dot{a}_{t}}{a_{t}}\right)M^{c}_{2}M_{1}\right]
(b˙tbtat2a˙tat)(b˙tbta˙tat)bt3at4(M3M2M12M2cM1)\displaystyle~{}~{}~{}~{}~{}-\left(\frac{\dot{b}_{t}}{b_{t}}a_{t}^{2}-\dot{a}_{t}a_{t}\right)\left(\frac{\dot{b}_{t}}{b_{t}}-\frac{\dot{a}_{t}}{a_{t}}\right)\frac{b_{t}^{3}}{a_{t}^{4}}(M_{3}-M_{2}M_{1}-2M^{c}_{2}M_{1})
=(a¨tata˙t2at2)x+(at2b¨tbta˙tatb˙tbta¨tat+a˙t2)btat2M1\displaystyle=\left(\frac{\ddot{a}_{t}}{a_{t}}-\frac{\dot{a}_{t}^{2}}{a_{t}^{2}}\right)x+\left(a_{t}^{2}\frac{\ddot{b}_{t}}{b_{t}}-\dot{a}_{t}a_{t}\frac{\dot{b}_{t}}{b_{t}}-\ddot{a}_{t}a_{t}+\dot{a}^{2}_{t}\right)\frac{b_{t}}{a_{t}^{2}}M_{1}
+bt2at2(b˙tbta˙tat)(b˙tbt2a˙tat)M2cxbt3at2(b˙tbta˙tat)2(M3M2M1).\displaystyle\quad+\frac{b_{t}^{2}}{a_{t}^{2}}\left(\frac{\dot{b}_{t}}{b_{t}}-\frac{\dot{a}_{t}}{a_{t}}\right)\left(\frac{\dot{b}_{t}}{b_{t}}-2\frac{\dot{a}_{t}}{a_{t}}\right)M^{c}_{2}x-\frac{b_{t}^{3}}{a_{t}^{2}}\left(\frac{\dot{b}_{t}}{b_{t}}-\frac{\dot{a}_{t}}{a_{t}}\right)^{2}\left(M_{3}-M_{2}M_{1}\right).

Then we complete the proof.  

Appendix G Functional inequalities and Tweedie’s formula

This appendix is devoted to an exposition of functional inequalities and Tweedie’s formula that would assist in our proof.

For a probability measure μ\mu on a compact set Ωd\Omega\subset{\mathbb{R}}^{d}, we define the variance of a function fL2(Ω,μ)f\in L^{2}(\Omega,\mu) as

Varμ(f):=Ωf2dμ(Ωfdμ)2.\displaystyle\mathrm{Var}_{\mu}(f):=\int_{\Omega}f^{2}\mathrm{d}\mu-\left(\int_{\Omega}f\mathrm{d}\mu\right)^{2}.

Moreover, for a probability measure μ\mu on a compact set Ωd\Omega\subset{\mathbb{R}}^{d} and any positive integrable function f:Ωf:\Omega\to{\mathbb{R}} such that Ωflogfdν<\int_{\Omega}f\|\log f\|\mathrm{d}\nu<\infty, we define the entropy of ff as

Entμ(f):=ΩflogfdμΩfdμlog(Ωfdμ).\displaystyle\mathrm{Ent}_{\mu}(f):=\int_{\Omega}f\log f\mathrm{d}\mu-\int_{\Omega}f\mathrm{d}\mu\log\left(\int_{\Omega}f\mathrm{d}\mu\right).
Definition 47 (Log-Sobolev inequality)

A probability measure μ𝒫(Ω)\mu\in\mathcal{P}(\Omega) is said to satisfy a log-Sobolev inequality with constant C>0C>0, if for all functions f:Ωf:\Omega\to{\mathbb{R}}, it holds that

Entμ(f2)2CΩf22dμ.\displaystyle\mathrm{Ent}_{\mu}(f^{2})\leq 2C\int_{\Omega}\|\nabla f\|_{2}^{2}\mathrm{d}\mu.

The best constant C>0C>0 for which such an inequality holds is referred to as the log-Sobolev constant CLS(μ)C_{\mathrm{LS}}(\mu).

Definition 48 (Poincaré inequality)

A probability measure μ𝒫(Ω)\mu\in\mathcal{P}(\Omega) is said to satisfy a Poincaré inequality with constant C>0C>0, if for all functions f:Ωf:\Omega\to{\mathbb{R}}, it holds that

Varμ(f)CΩf22dμ.\displaystyle\mathrm{Var}_{\mu}(f)\leq C\int_{\Omega}\|\nabla f\|_{2}^{2}\mathrm{d}\mu.

The best constant C>0C>0 for which such an inequality holds is referred to as the Poincaré constant CP(μ)C_{\mathrm{P}}(\mu).

Finally, for ease of reference, we present Tweedie’s formula that was first reported in Robbins (1956), and then was used as a simple empirical Bayes approach for correcting selection bias (Efron, 2011). Here, we use Tweedie’s formula to link the score function with the expectation conditioned on an observation with Gaussian noise.

Lemma 49 (Tweedie’s formula)

Suppose that 𝖷μ\mathsf{X}\sim\mu and ϵγd,σ2\epsilon\sim\gamma_{d,\sigma^{2}}. Let 𝖸=𝖷+ϵ\mathsf{Y}=\mathsf{X}+\epsilon and p(y)p(y) be the marginal density of 𝖸\mathsf{Y}. Then 𝔼[𝖷|𝖸=y]=y+σ2ylogp(y)\mathbb{E}[\mathsf{X}|\mathsf{Y}=y]=y+\sigma^{2}\nabla_{y}\log p(y).


References

  • Albergo and Vanden-Eijnden (2023) Michael S. Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. In The Eleventh International Conference on Learning Representations, 2023.
  • Albergo et al. (2023a) Michael S. Albergo, Nicholas M. Boffi, Michael Lindsey, and Eric Vanden-Eijnden. Multimarginal generative modeling with stochastic interpolants. arXiv preprint arXiv:2310.03695, 2023a.
  • Albergo et al. (2023b) Michael S. Albergo, Nicholas M. Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797, 2023b.
  • Albergo et al. (2023c) Michael S. Albergo, Mark Goldstein, Nicholas M. Boffi, Rajesh Ranganath, and Eric Vanden-Eijnden. Stochastic interpolants with data-dependent couplings. arXiv preprint arXiv:2310.03725, 2023c.
  • Ambrosio and Crippa (2014) Luigi Ambrosio and Gianluca Crippa. Continuity equations and ODE flows with non-smooth velocity. Proceedings of the Royal Society of Edinburgh Section A: Mathematics, 144(6):1191–1244, 2014.
  • Ambrosio et al. (2008) Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: In metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
  • Ambrosio et al. (2023) Luigi Ambrosio, Sebastiano N. Golo, and Francesco S. Cassano. Classical flows of vector fields with exponential or sub-exponential summability. Journal of Differential Equations, 372:458–504, 2023.
  • Ansari et al. (2021) Abdul Fatir Ansari, Ming Liang Ang, and Harold Soh. Refining deep generative models via discriminator gradient flow. In International Conference on Learning Representations, 2021.
  • Arbel et al. (2019) Michael Arbel, Anna Korba, Adil Salim, and Arthur Gretton. Maximum mean discrepancy gradient flow. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 214–223. PMLR, 2017.
  • Bakry and Émery (1985) Dominique Bakry and Michel Émery. Diffusions hypercontractives. In Seminaire de probabilités XIX 1983/84, pages 177–206. Springer, 1985.
  • Bakry et al. (2014) Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and geometry of Markov diffusion operators, volume 103. Springer, 2014.
  • Ball et al. (2003) Keith Ball, Franck Barthe, and Assaf Naor. Entropy jumps in the presence of a spectral gap. Duke Mathematical Journal, 119(1):41 – 63, 2003.
  • Benton et al. (2023) Joe Benton, George Deligiannidis, and Arnaud Doucet. Error bounds for flow matching methods. arXiv preprint arXiv:2305.16860, 2023.
  • Biloš et al. (2021) Marin Biloš, Johanna Sommer, Syama Sundar Rangapuram, Tim Januschowski, and Stephan Günnemann. Neural flows: Efficient alternative to neural ODEs. In Advances in Neural Information Processing Systems, volume 34, pages 21325–21337. Curran Associates, Inc., 2021.
  • Bobkov and Ledoux (2000) Sergey G. Bobkov and Michel Ledoux. From Brunn-Minkowski to Brascamp-Lieb and to logarithmic Sobolev inequalities. Geometric and Functional Analysis, 10(5):1028–1052, 2000.
  • Bortoli (2022) Valentin De Bortoli. Convergence of denoising diffusion models under the manifold hypothesis. Transactions on Machine Learning Research, 2022.
  • Brascamp and Lieb (1976) Herm J. Brascamp and Elliott H. Lieb. On extensions of the Brunn-Minkowski and Prékopa-Leindler theorems, including inequalities for log concave functions, and with an application to the diffusion equation. Journal of Functional Analysis, 22(4):366–389, 1976.
  • Caffarelli (2000) Luis A. Caffarelli. Monotonicity properties of optimal transportation and the FKG and related inequalities. Communications in Mathematical Physics, 214(3):547–563, 2000.
  • Cai and Wu (2014) Tony T. Cai and Yihong Wu. Optimal detection of sparse mixtures against a given null distribution. IEEE Transactions on Information Theory, 60(4):2217–2232, 2014.
  • Cattiaux and Guillin (2014) Patrick Cattiaux and Arnaud Guillin. Semi log-concave Markov diffusions. In Catherine Donati-Martin, Antoine Lejay, and Alain Rouault, editors, Séminaire de probabilités XLVI, pages 231–292. Springer International Publishing, Cham, 2014.
  • Chen et al. (2023a) Hongrui Chen, Holden Lee, and Jianfeng Lu. Improved analysis of score-based generative modeling: User-friendly bounds under minimal smoothness assumptions. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 4735–4763. PMLR, 2023a.
  • Chen and Lipman (2023) Ricky T.Q. Chen and Yaron Lipman. Riemannian flow matching on general geometries. arXiv preprint arXiv:2302.03660, 2023.
  • Chen et al. (2018) Ricky T.Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K. Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  • Chen et al. (2023b) Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru Zhang. Sampling is as easy as learning the score: Theory for diffusion models with minimal data assumptions. In The Eleventh International Conference on Learning Representations, 2023b.
  • Chewi and Pooladian (2022) Sinho Chewi and Aram-Alexandre Pooladian. An entropic generalization of Caffarelli’s contraction theorem via covariance inequalities. arXiv preprint arXiv:2203.04954, 2022.
  • Colombo et al. (2017) Maria Colombo, Alessio Figalli, and Yash Jhaveri. Lipschitz changes of variables between perturbations of log-concave measures. Annali della Scuola Normale Superiore di Pisa. Classe di scienze, 17(4):1491–1519, 2017.
  • Cordero-Erausquin (2017) Dario Cordero-Erausquin. Transport inequalities for log-concave measures, quantitative forms, and applications. Canadian Journal of Mathematics, 69(3):481–501, 2017.
  • Dai et al. (2023) Yin Dai, Yuan Gao, Jian Huang, Yuling Jiao, Lican Kang, and Jin Liu. Lipschitz transport maps via the Föllmer flow. arXiv preprint arXiv:2309.03490, 2023.
  • Danzer et al. (1963) Ludwig Danzer, Branko Grünbaum, and Victor Klee. Helly’s theorem and its relatives. In Proceedings of Symposia in Pure Mathematics: Convexity, volume VII, pages 101–180, Providence, RI, 1963. American Mathematical Society.
  • De Bortoli et al. (2021) Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion Schrödinger bridge with applications to score-based generative modeling. In Advances in Neural Information Processing Systems, volume 34, pages 17695–17709. Curran Associates, Inc., 2021.
  • Del Moral and Singh (2022) Pierre Del Moral and Sumeetpal S. Singh. Backward Itô–Ventzell and stochastic interpolation formulae. Stochastic Processes and their Applications, 154:197–250, 2022.
  • Dhariwal and Nichol (2021) Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems, volume 34, pages 8780–8794, 2021.
  • Dinh et al. (2014) Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
  • Duncan et al. (2023) Andrew Duncan, Nikolas Nüsken, and Lukasz Szpruch. On the geometry of Stein variational gradient descent. Journal of Machine Learning Research, 24(56):1–39, 2023.
  • Dytso et al. (2023a) Alex Dytso, Martina Cardone, and Ian Zieder. Meta derivative identity for the conditional expectation. IEEE Transactions on Information Theory, 69(7):4284–4302, 2023a.
  • Dytso et al. (2023b) Alex Dytso, H. Vincent Poor, and Shlomo Shamai Shitz. Conditional mean estimation in Gaussian noise: A meta derivative identity with applications. IEEE Transactions on Information Theory, 69(3):1883–1898, 2023b.
  • Efron (2011) Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106(496):1602–1614, 2011.
  • Eldan and Lee (2018) Ronen Eldan and James R. Lee. Regularization under diffusion and anticoncentration of the information content. Duke Mathematical Journal, 167(5):969–993, 2018.
  • Evans (2010) Lawrence C. Evans. Partial differential equations, volume 19 of Graduate studies in mathematics. American Mathematical Society, Providence, Rhode Island, second edition edition, 2010.
  • Fan et al. (2022) Jiaojiao Fan, Qinsheng Zhang, Amirhossein Taghvaei, and Yongxin Chen. Variational Wasserstein gradient flow. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pages 6185–6215. PMLR, 2022.
  • Fathi et al. (2023) Max Fathi, Dan Mikulincer, and Yair Shenfeld. Transportation onto log-Lipschitz perturbations. arXiv preprint arXiv:2305.03786, 2023.
  • Finlay et al. (2020) Chris Finlay, Jörn-Henrik Jacobsen, Levon Nurbekyan, and Adam Oberman. How to train your neural ODE: The world of Jacobian and kinetic regularization. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 3154–3164. PMLR, 2020.
  • Gao et al. (2019) Yuan Gao, Yuling Jiao, Yang Wang, Yao Wang, Can Yang, and Shunkang Zhang. Deep generative learning via variational gradient flow. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 2093–2101. PMLR, 2019.
  • Gao et al. (2022) Yuan Gao, Jian Huang, Yuling Jiao, Jin Liu, Xiliang Lu, and Zhijian Yang. Deep generative learning via Euler particle transport. In Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference, volume 145, pages 336–368. PMLR, 2022.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
  • Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
  • Goodfellow et al. (2020) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  • Grathwohl et al. (2019) Will Grathwohl, Ricky T.Q. Chen, Jesse Bettencourt, and David Duvenaud. FFJORD: Free-form continuous dynamics for scalable reversible generative models. In International Conference on Learning Representations, 2019.
  • Gross (1975) Leonard Gross. Logarithmic Sobolev inequalities. American Journal of Mathematics, 97(4):1061–1083, 1975.
  • Hairer et al. (1993) Ernst Hairer, Gerhard Wanner, and Syvert P. Nørsett. Classical Mathematical Theory, chapter I, pages 1–128. Springer Berlin Heidelberg, Berlin, Heidelberg, 1993.
  • Hartman (2002a) Philip Hartman. Dependence on Initial Conditions and Parameters, chapter V, pages 93–116. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2002a.
  • Hartman (2002b) Philip Hartman. Existence, chapter II, pages 8–23. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2002b.
  • Hatsell and Nolte (1971) Charles P. Hatsell and Loren W. Nolte. Some geometric properties of the likelihood ratio (corresp.). IEEE Transactions on Information Theory, 17(5):616–618, 1971.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pages 6840–6851, 2020.
  • Johnson and Zhang (2018) Rie Johnson and Tong Zhang. Composite functional gradient learning of generative adversarial models. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 2371–2379. PMLR, 2018.
  • Johnson and Zhang (2019) Rie Johnson and Tong Zhang. A framework of composite functional gradient methods for generative adversarial models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1):17–32, 2019.
  • Karras et al. (2022) Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems, volume 35, pages 26565–26577. Curran Associates, Inc., 2022.
  • Kim et al. (2023) Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ODE trajectory of diffusion. arXiv preprint arXiv:2310.02279, 2023.
  • Kingma and Welling (2014) Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014.
  • Kingma and Welling (2019) Diederik P. Kingma and Max Welling. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
  • Klartag (2010) Bo’az Klartag. High-dimensional distributions with convexity properties. In Proceedings of the Fifth European Congress of Mathematics, pages 401–417, Amsterdam, 14 July–18 July 2010. European Mathematical Society Publishing House.
  • Kobyzev et al. (2020) Ivan Kobyzev, Simon J.D. Prince, and Marcus A. Brubaker. Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):3964–3979, 2020.
  • Lamb and Roberts (1998) Jeroen S.W. Lamb and John A.G. Roberts. Time-reversal symmetry in dynamical systems: A survey. Physica D: Nonlinear Phenomena, 112(1-2):1–39, 1998.
  • Lee et al. (2023) Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence of score-based generative modeling for general data distributions. In Proceedings of The 34th International Conference on Algorithmic Learning Theory, volume 201, pages 946–985. PMLR, 2023.
  • Liang (2021) Tengyuan Liang. How well generative adversarial networks learn distributions. Journal of Machine Learning Research, 22(228):1–41, 2021.
  • Lipman et al. (2023) Yaron Lipman, Ricky T.Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023.
  • Liu (2022) Qiang Liu. Rectified flow: A marginal preserving approach to optimal transport. arXiv preprint arXiv:2209.14577, 2022.
  • Liu et al. (2023) Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2023.
  • Liutkus et al. (2019) Antoine Liutkus, Umut Simsekli, Szymon Majewski, Alain Durmus, and Fabian-Robert Stöter. Sliced-Wasserstein flows: Nonparametric generative modeling via optimal transport and diffusions. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 4104–4113. PMLR, 2019.
  • Lu et al. (2022a) Cheng Lu, Kaiwen Zheng, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Maximum likelihood training for score-based diffusion ODEs by high order denoising score matching. In Proceedings of the 39th International Conference on Machine Learning, pages 14429–14460. PMLR, 2022a.
  • Lu et al. (2022b) Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Advances in Neural Information Processing Systems, volume 35, pages 5775–5787. Curran Associates, Inc., 2022b.
  • Makkuva et al. (2020) Ashok Makkuva, Amirhossein Taghvaei, Sewoong Oh, and Jason Lee. Optimal transport mapping via input convex neural networks. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 6672–6681. PMLR, 2020.
  • Marion (2023) Pierre Marion. Generalization bounds for neural ordinary differential equations and deep residual networks. arXiv preprint arXiv:2305.06648, 2023.
  • Marion et al. (2023) Pierre Marion, Yu-Han Wu, Michael E Sander, and Gérard Biau. Implicit regularization of deep residual networks towards neural ODEs. arXiv preprint arXiv:2309.01213, 2023.
  • Marzouk et al. (2023) Youssef Marzouk, Zhi Ren, Sven Wang, and Jakob Zech. Distribution learning via neural differential equations: A nonparametric statistical perspective. arXiv preprint arXiv:2309.01043, 2023.
  • Mikulincer and Shenfeld (2021) Dan Mikulincer and Yair Shenfeld. The Brownian transport map. arXiv preprint arXiv:2111.11521, 2021.
  • Mikulincer and Shenfeld (2023) Dan Mikulincer and Yair Shenfeld. On the Lipschitz properties of transportation along heat flows. In Geometric Aspects of Functional Analysis: Israel Seminar (GAFA) 2020-2022, pages 269–290. Springer, 2023.
  • Mroueh and Nguyen (2021) Youssef Mroueh and Truyen Nguyen. On the convergence of gradient descent in GANs: MMD GAN as a gradient flow. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130, pages 1720–1728. PMLR, 2021.
  • Mroueh et al. (2019) Youssef Mroueh, Tom Sercu, and Anant Raj. Sobolev descent. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, volume 89, pages 2976–2985. PMLR, 2019.
  • Neeman (2022) Joe Neeman. Lipschitz changes of variables via heat flow. arXiv preprint arXiv:2201.03403, 2022.
  • Neklyudov et al. (2023) Kirill Neklyudov, Rob Brekelmans, Daniel Severo, and Alireza Makhzani. Action matching: Learning stochastic dynamics from samples. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 25858–25889. PMLR, 2023.
  • Onken et al. (2021) Derek Onken, Samy Wu Fung, Xingjian Li, and Lars Ruthotto. OT-flow: Fast and accurate continuous normalizing flows via optimal transport. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9223–9232, 2021.
  • Palomar and Verdú (2005) Daniel P. Palomar and Sergio Verdú. Gradient of mutual information in linear vector Gaussian channels. IEEE Transactions on Information Theory, 52(1):141–154, 2005.
  • Papamakarios et al. (2021) George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57):1–64, 2021.
  • Pooladian et al. (2023) Aram-Alexandre Pooladian, Heli Ben-Hamu, Carles Domingo-Enrich, Brandon Amos, Yaron Lipman, and Ricky T.Q. Chen. Multisample flow matching: Straightening flows with minibatch couplings. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 28100–28127. PMLR, 2023.
  • Rezende and Mohamed (2015) Danilo J. Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 1530–1538. PMLR, 2015.
  • Rezende et al. (2014) Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, pages 1278–1286. PMLR, 2014.
  • Robbins (1956) Herbert E. Robbins. An empirical Bayes approach to statistics. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1954–1955, volume I, page 157–163, Berkeley and Los Angeles, 1956. University of California Press.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  • Ruiz-Balet and Zuazua (2023) Domenec Ruiz-Balet and Enrique Zuazua. Neural ODE control for classification, approximation, and transport. SIAM Review, 65(3):735–773, 2023.
  • Salakhutdinov (2015) Ruslan Salakhutdinov. Learning deep generative models. Annual Review of Statistics and Its Application, 2(1):361–385, 2015.
  • Saumard and Wellner (2014) Adrien Saumard and Jon A. Wellner. Log-concavity and strong log-concavity: A review. Statistics Surveys, 8:45 – 114, 2014.
  • Shaul et al. (2023) Neta Shaul, Ricky T.Q. Chen, Maximilian Nickel, Matthew Le, and Yaron Lipman. On kinetic optimal probability paths for generative models. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 30883–30907. PMLR, 2023.
  • Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 2256–2265. PMLR, 2015.
  • Song et al. (2021a) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021a.
  • Song and Dhariwal (2023) Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models. arXiv preprint arXiv:2310.14189, 2023.
  • Song and Ermon (2019) Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, volume 32, pages 11895–11907. Curran Associates, Inc., 2019.
  • Song and Ermon (2020) Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems, volume 33, pages 12438–12448. Curran Associates, Inc., 2020.
  • Song et al. (2021b) Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021b.
  • Song et al. (2023) Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 32211–32252. PMLR, 23–29 Jul 2023.
  • Su et al. (2023) Xuan Su, Jiaming Song, Chenlin Meng, and Stefano Ermon. Dual diffusion implicit bridges for image-to-image translation. In The Eleventh International Conference on Learning Representations, 2023.
  • Tabak and Turner (2013) Esteban G. Tabak and Cristina V. Turner. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145–164, 2013.
  • Tabak and Vanden-Eijnden (2010) Esteban G. Tabak and Eric Vanden-Eijnden. Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences, 8(1):217–233, 2010.
  • Tong et al. (2023) Alexander Tong, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Kilian Fatras, Guy Wolf, and Yoshua Bengio. Conditional flow matching: Simulation-free dynamic optimal transport. arXiv preprint arXiv:2302.00482, 2023.
  • Wibisono and Jog (2018a) Andre Wibisono and Varun Jog. Convexity of mutual information along the heat flow. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 1615–1619. IEEE, 2018a.
  • Wibisono and Jog (2018b) Andre Wibisono and Varun Jog. Convexity of mutual information along the Ornstein-Uhlenbeck flow. In 2018 International Symposium on Information Theory and Its Applications (ISITA), pages 55–59. IEEE, 2018b.
  • Wibisono et al. (2017) Andre Wibisono, Varun Jog, and Po-Ling Loh. Information and estimation in Fokker-Planck channels. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 2673–2677. IEEE, 2017.
  • Wu and Verdú (2011) Yihong Wu and Sergio Verdú. Functional properties of minimum mean-square error and mutual information. IEEE Transactions on Information Theory, 58(3):1289–1301, 2011.
  • Xu et al. (2022) Chen Xu, Xiuyuan Cheng, and Yao Xie. Invertible normalizing flow neural networks by JKO scheme. arXiv preprint arXiv:2212.14424, 2022.
  • Yang and Karniadakis (2020) Liu Yang and George E. Karniadakis. Potential flow generator with L2{L}_{2} optimal transport regularity for generative models. IEEE Transactions on Neural Networks and Learning Systems, 33(2):528–538, 2020.
  • Zhang et al. (2018) Linfeng Zhang, Weinan E, and Lei Wang. Monge-Ampère flow for generative modeling. arXiv preprint arXiv:1809.10188, 2018.
  • Zheng et al. (2023) Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. Improved techniques for maximum likelihood estimation for diffusion ODEs. arXiv preprint arXiv:2305.03935, 2023.
  • Zhu et al. (2017) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2223–2232, 2017.