This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

HJ-sampler: A Bayesian sampler for inverse problems of a stochastic process by leveraging Hamilton–Jacobi PDEs and score-based generative models111Tingwei Meng and Zongren Zou contributed equally to this work.

Tingwei Meng222Department of Mathematics, UCLA, Los Angeles, CA 90025 USA ([email protected]).    Zongren Zou333Division of Applied Mathematics, Brown University, Providence, RI 02912 USA ([email protected]).    Jérôme Darbon444Corresponding author. Division of Applied Mathematics, Brown University, Providence, RI 02912 USA ([email protected]).    George Em Karniadakis555Division of Applied Mathematics, Brown University, Providence, RI 02912 USA, and Pacific Northwest National Laboratory, Richland, WA 99354 USA ([email protected]).
Abstract

The interplay between stochastic processes and optimal control has been extensively explored in the literature. With the recent surge in the use of diffusion models, stochastic processes have increasingly been applied to sample generation. This paper builds on the log transform, known as the Cole-Hopf transform in Brownian motion contexts, and extends it within a more abstract framework that includes a linear operator. Within this framework, we found that the well-known relationship between the Cole-Hopf transform and optimal transport is a particular instance where the linear operator acts as the infinitesimal generator of a stochastic process. We also introduce a novel scenario where the linear operator is the adjoint of the generator, linking to Bayesian inference under specific initial and terminal conditions. Leveraging this theoretical foundation, we develop a new algorithm, named the HJ-sampler, for Bayesian inference for the inverse problem of a stochastic differential equation with given terminal observations. The HJ-sampler involves two stages: (1) solving the viscous Hamilton-Jacobi partial differential equations, and (2) sampling from the associated stochastic optimal control problem. Our proposed algorithm naturally allows for flexibility in selecting the numerical solver for viscous HJ PDEs. We introduce two variants of the solver: the Riccati-HJ-sampler, based on the Riccati method, and the SGM-HJ-sampler, which utilizes diffusion models. We demonstrate the effectiveness and flexibility of the proposed methods by applying them to solve Bayesian inverse problems involving various stochastic processes and prior distributions, including applications that address model misspecifications and quantifying model uncertainty.

Keywords: Bayesian inference, Hamilton–Jacobi PDEs, uncertainty quantification, inverse problems

1 Introduction

Uncertainty Quantification (UQ) plays a vital role in scientific computing, helping to quantify and manage the inherent uncertainties in complex models and simulations [82, 79]. Within UQ, two significant areas of active research are Bayesian inference and sampling from data distributions. Bayesian inference has garnered considerable interest within the scientific computing community due to its ability to rigorously combine prior information with observational data, addressing model uncertainties and enhancing predictive capabilities [96, 64, 11, 116, 114, 113]. Meanwhile, the second area—sampling from data distributions—has gained popularity in the machine learning and AI communities. This surge in interest is largely driven by the success of generative models, particularly diffusion models, which excel at generating high-quality samples from complex, high-dimensional distributions [92, 26, 108, 107, 28, 103]. Diffusion models, and more generally score-based methods, have been employed not only for data generation but also for solving a wide range of scientific computing problems, including inverse problems [91], sampling [86, 110, 108, 7], distribution modification [98], mean-field problems [66, 111], control problems [99, 100], forward and inverse partial differential equations (PDEs) [102], and Schrödinger bridge problems [19, 85, 87, 39]. This broad applicability makes diffusion models a powerful tool in both AI and scientific computing.

In this work, we establish connections between these two subfields of UQ. Specifically, we observe that the Bayesian posterior distribution for certain inverse problems involving stochastic processes can be represented as a PDE, which is further linked to a stochastic optimal control problem. Based on this connection, we design an algorithm that solves certain Bayesian sampling problems by addressing the associated stochastic optimal control problem. This algorithm bears similarities to diffusion models, particularly a class known as Score-based Generative Models (SGMs). By highlighting this connection, our work offers a novel link between Bayesian inference, generative models, and traditional stochastic optimal control, providing new directions for exploration and development.

Theoretically, this connection arises as a special case of the log transform, where the initial condition incorporates the observation, and the terminal condition reflects the prior distribution in the Bayesian inference problem. The log transform has been extensively studied in the literature [34, 32, 37, 71] and is often referred to as the Cole-Hopf transform [30, 59, 60] when applied to Brownian motion. This transform connects the nonlinear system, consisting of Hamilton-Jacobi (HJ) equations coupled with Fokker-Planck equations, to its linear counterpart, the Kolmogorov forward and backward equations. Traditionally, the Cole-Hopf transform is used to simplify the solution of nonlinear PDEs by converting them into linear PDEs, which are generally easier to solve. In probability theory, it is also employed as a tool for sampling, known as Doob’s hh-transform. Beyond these established uses, we introduce a novel application of the log transform within Bayesian inference, a context that presents some open questions and opportunities for further research (see the discussion in the summary).

From a more detailed and practical perspective, we consider the Bayesian inference problem where the likelihood is determined by YT|Y0Y_{T}|Y_{0}, with YtY_{t} governed by a Stochastic Differential Equation (SDE). The goal is to solve the inverse problem of the stochastic process, specifically inferring the position of YtY_{t} given an observation at its terminal position YTY_{T}. For this class of problems, we propose an algorithm called the HJ-sampler, which generates sample paths from the posterior distribution by solving the corresponding stochastic optimal control problem. The algorithm consists of two steps: first, computing the control, and second, sampling the optimal trajectories. Different versions of the algorithm arise based on the method used to compute the control. In this paper, we present two such versions: the Riccati-HJ-sampler, which applies the Riccati method, and the SGM-HJ-sampler, which leverages the SGM method. A key advantage of the HJ-sampler is its flexibility. The first step, which computes the control, is independent of the second step, which samples the trajectories. This allows for flexibility in adjusting the observation position YTY_{T}, the time TT, and the discretization size without requiring a recomputation of the control. Beyond the proposed algorithm itself, this connection between Bayesian inference and control theory allows us to harness techniques from both control algorithms and diffusion models for efficient Bayesian sampling. Additionally, it offers a potential Bayesian interpretation of diffusion models and opens up avenues for their generalization.

While the log transform has been employed in various sampling algorithms in the literature [5, 109, 106, 42, 21], our approach differs in its application. These existing methods, and also other sampling algorithms such as such as Hamiltonian Monte Carlo [74], variational inference [6, 44, 80], and Langevin-type Monte Carlo [101, 38], typically include the target distribution as part of the terminal condition, requiring an explicit analytical formula for the density function. In contrast, our approach assumes that only the prior density or prior samples are available, with the likelihood being provided by a stochastic process, which may not have an analytical form. In other words, the terminal condition in our method encodes only the prior information, while the likelihood is embedded within the stochastic process. This decoupling provides the flexibility discussed earlier and ensures that the sampling process directly corresponds to the original stochastic process, allowing us to generate entire sample paths rather than finite-dimensional sample points.

The remainder of the paper is organized as follows. In Section 2, we detail the log transform at an abstract level, followed by specific cases in Sections 2.1 and 2.2. Section 2.1 connects the log transform to stochastic optimal transport (SOT) and Schrödinger bridge problems, while Section 2.2 focuses on the special case related to Bayesian inference setups, which is the primary contribution of this paper. Based on the content in Section 2.2, an algortihm caled the HJ-sampler algorithm is presented in Section 3, and numerical results are provided in Section 4. Finally, Section 5 summarizes the findings, discusses limitations, and outlines future research directions. Additional technical details are provided in the appendix. In Figure 1, we present the roadmap of this paper, highlighting the relationships between the main concepts and methodologies discussed.

Refer to caption
Figure 1: Roadmap of this paper. The black sections represent well-known concepts in the literature, while the red sections indicate our contributions.

2 The log transform: bridging linear and non-linear systems in the context of stochastic processes

This section introduces the log transform, a mathematical technique that establishes a connection between linear and nonlinear systems. We investigate this transform in the context of a versatile linear operator, 𝒜ϵ,t\mathcal{A}_{\epsilon,t}, which may represent differential, difference, integral operators, or combinations thereof. In Sections 2.1 and 2.2, examples are provided where 𝒜ϵ,t\mathcal{A}_{\epsilon,t} functions as either the infinitesimal generator of a stochastic process or its adjoint. The subscript ϵ\epsilon indicates the inclusion of a positive parameter in the operator, reflecting the degree of stochasticity in these examples. The operator may also depend on the spatial variable xx and the time variable tt. We put tt in the subscript because different equations may involve the operator at different times, while we omit the dependence on xx for simplicity of notation.

We introduce two functions, μ\mu and ν\nu, that map from n×[0,T]\mathbb{R}^{n}\times[0,T] to \mathbb{R} and comply with the linear system specified below:

{tμ=𝒜ϵ,Ttμ,tν=𝒜ϵ,tν,\begin{dcases}\partial_{t}\mu=\mathcal{A}_{\epsilon,T-t}\mu,\\ \partial_{t}\nu=\mathcal{A}_{\epsilon,t}^{*}\nu,\end{dcases} (1)

where each operator is applied at the point (x,t)(x,t) for any xnx\in\mathbb{R}^{n} and t[0,T]t\in[0,T]. The notation 𝒜ϵ,t\mathcal{A}_{\epsilon,t}^{*} denotes the adjoint of 𝒜ϵ,t\mathcal{A}_{\epsilon,t} in L2(n)L^{2}(\mathbb{R}^{n}) with respect to the spatial variable xx. The log transform establishes a nonlinear relationship between the pairs (μ,ν)(\mu,\nu) and (ρ,S)(\rho,S) as follows:

ρ(x,t)=μ(x,Tt)ν(x,t),S(x,t)=ϵlogμ(x,Tt),\rho(x,t)=\mu(x,T-t)\nu(x,t),\quad S(x,t)=\epsilon\log\mu(x,T-t), (2)

where the constant ϵ\epsilon in the second equation is the same as the hyperparameter in the operator 𝒜ϵ,t\mathcal{A}_{\epsilon,t}. The corresponding inverse transformation reads

μ(x,t)=eS(x,Tt)ϵ,ν(x,t)=ρ(x,t)eS(x,t)ϵ.\mu(x,t)=e^{\frac{S(x,T-t)}{\epsilon}},\quad\nu(x,t)=\rho(x,t)e^{-\frac{S(x,t)}{\epsilon}}. (3)

This leads to a coupled nonlinear system for ρ\rho and SS:

{tρ+ρeSϵ𝒜ϵ,teSϵeSϵ𝒜ϵ,t(ρeSϵ)=0,tS+ϵeSϵ𝒜ϵ,teSϵ=0,\begin{dcases}\partial_{t}\rho+\rho e^{-\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}e^{\frac{S}{\epsilon}}-e^{\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}^{*}(\rho e^{-\frac{S}{\epsilon}})=0,\\ \partial_{t}S+\epsilon e^{-\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}e^{\frac{S}{\epsilon}}=0,\end{dcases} (4)

where each operator is applied at the point (x,t)(x,t) for any xnx\in\mathbb{R}^{n} and t[0,T]t\in[0,T]. We refer to the first equation in (4) as the (generalized) Fokker-Planck equation and the second as the (generalized) HJ equation. As shown in Example 2.1 and further discussed in Appendices A and C, when 𝒜ϵ,t\mathcal{A}_{\epsilon,t} corresponds to an SDE, the equations in (4) reduce to the Fokker-Planck equation and the viscous HJ PDE. In such instances, we refer to the second equation in (4) as the viscous HJ PDE, but we refrain from using the term “viscous” for general cases due to the potential absence of a Laplacian term in the equation in such contexts.

This transformation illustrates the connection between linear and nonlinear systems. The log transform, particularly when 𝒜ϵ,t\mathcal{A}_{\epsilon,t} acts as the infinitesimal generator of a stochastic process, is recognized in the literature. For context, we briefly touch upon this aspect in Section 2.1, offering it as part of the broader narrative. Our novel contribution emerges in Section 2.2, where we establish a new link to the Bayesian framework by considering 𝒜ϵ,t\mathcal{A}_{\epsilon,t} as the adjoint of the infinitesimal generator. This insight is crucial for the algorithm we develop later in the paper. By integrating these elements into a comprehensive theoretical framework, we highlight the interconnectedness of diverse applied mathematics fields, such as stochastic processes, control theory, neural networks, and PDEs. This interdisciplinary approach fosters the potential for leveraging algorithms developed in one field to address problems in another, encouraging innovative cross-disciplinary applications.

2.1 The connection between the log transform and stochastic optimal control

Refer to caption
Figure 2: This figure illustrates the log transform (2) applied when 𝒜ϵ,t\mathcal{A}_{\epsilon,t} is as the infinitesimal generator of a stochastic process XtX_{t}. On the left, a general process is depicted, while on the right, the specific instance of a scaled Brownian motion (see Example 2.1) is presented. The time orientation selected here is consistent with that used in stochastic optimal control or stochastic optimal transport problems, aligning with the reversal of the viscous HJ PDE (see Remark 2.2 for details).

In this section, we choose the linear operator 𝒜ϵ,t\mathcal{A}_{\epsilon,t} in (1) and (4) to be the infinitesimal generator of a stochastic (Feller) process XtX_{t} in n\mathbb{R}^{n} from t=0t=0 to t=Tt=T. This operator incorporates ϵ\epsilon to denote the level of stochasticity. Its adjoint is represented by 𝒜ϵ,t\mathcal{A}_{\epsilon,t}^{*}. For further mathematical details on the concepts discussed here, we refer readers to [63, 77].

The linear system in equation (1) encompasses the Kolmogorov backward equation (KBE) for μ\mu and the Kolmogorov forward equation (KFE) for ν\nu. The non-linear system (4) constitutes a system for stochastic optimal control, with the source of stochasticity being intrinsically linked to the underlying process XtX_{t}. Since the connection works for a general stochastic process, the stochastic nature of the controlled system may extend beyond Brownian motion, enabling consideration of jump processes as well. When XtX_{t} adheres to an SDE, the two systems (1) and (4) become two PDE systems (see Example 2.1 and Appendix A). Furthermore, general stochastic processes could lead to systems characterized by discretized PDEs, integral equations, or integro-differential equations. An illustrative example of the scaled Poisson process is provided in Appendix B. An illustration of a general process, as well as a specific case of a scaled Brownian motion, is presented in Figure 2.

Under this setup, the log transform elucidates the relationship between the coupled forward-backward Kolmogorov equations and controlled stochastic systems. This transformation has been extensively studied in the literature (see, for example, [34, 32, 37, 71]). It is also referred to as the Cole-Hopf transform [30] in the context of Brownian motion. The efficacy of the log transform in numerical methodologies stems from its ability to linearize the nonlinear system (4), simplifying the complexity inherent in non-linear dynamics and associated stochastic optimal control challenges.

Remark 2.1 (Initial or terminal conditions)

The log transform’s effectiveness between equations (1) and (4) is invariant to the initial or terminal conditions applied, provided these conditions are consistently adapted following the transformation. This principle allows for varied applications, as demonstrated with subsequent examples linking the system to certain mean-field games (MFG), stochastic optimal control, stochastic optimal transport (SOT), and stochastic Wasserstein proximal operators, among other areas. For a general process, it is also related to the Schrödinger bridge problem. See the following example for more details.

Example 2.1 (Brownian motion)

We consider the scenario where the underlying process is a scaled Brownian motion, denoted by Xt=ϵWtX_{t}=\sqrt{\epsilon}W_{t}, where WtW_{t} represents a Brownian motion in n\mathbb{R}^{n}. The infinitesimal generator 𝒜ϵ,t\mathcal{A}_{\epsilon,t} and its adjoint operator 𝒜ϵ,t\mathcal{A}_{\epsilon,t}^{*} are characterized by 𝒜ϵ,t=𝒜ϵ,t=ϵ2Δx\mathcal{A}_{\epsilon,t}=\mathcal{A}_{\epsilon,t}^{*}=\frac{\epsilon}{2}\Delta_{x}, resulting in both the KBE and KFE in (1) being heat equations:

tμ=ϵ2Δxμ,tν=ϵ2Δxν.\partial_{t}\mu=\frac{\epsilon}{2}\Delta_{x}\mu,\quad\quad\partial_{t}\nu=\frac{\epsilon}{2}\Delta_{x}\nu. (5)

The other non-linear system (4) manifests as the Fokker-Planck PDE and the viscous HJ PDE (with a distinct sign from the usual case, see Remark 2.2):

tρ+x(ρxS)=ϵ2Δxρ,tS+12xS2+ϵ2ΔxS=0.\begin{split}\partial_{t}\rho+\nabla_{x}\cdot(\rho\nabla_{x}S)=\frac{\epsilon}{2}\Delta_{x}\rho,\quad\quad\partial_{t}S+\frac{1}{2}\|\nabla_{x}S\|^{2}+\frac{\epsilon}{2}\Delta_{x}S=0.\end{split} (6)

In this scenario (and generally when the underlying process XtX_{t} is described by an SDE, as detailed in Appendix A), the log transform, known as the Cole-Hopf transform, has been extensively examined in the literature [30, 59, 60]. For more applications of the Cole-Hopf transform, see [22, 75, 43, 14, 108].

By introducing distinct sets of initial or terminal conditions, the coupled PDE system (6) can be linked to first-order optimality conditions of various problems, including specific MFG, stochastic optimal control, SOT, or stochastic Wasserstein proximal operator. For instance, if we assign the initial condition ρ0\rho_{0} to ρ\rho and the terminal condition J-J to SS, the coupled PDE system (6) is linked to the following MFG:

min{0Tn12v(x,s)2ρ(x,s)𝑑x𝑑s+nJ(x)ρ(x,T)𝑑x:ρt+x(vρ)=ϵ2Δxρ,ρ(x,0)=ρ0(x)},\min\left\{\int_{0}^{T}\int_{\mathbb{R}^{n}}\frac{1}{2}\|v(x,s)\|^{2}\rho(x,s)dxds+\int_{\mathbb{R}^{n}}J(x)\rho(x,T)dx\colon\frac{\partial\rho}{\partial t}+\nabla_{x}\cdot(v\rho)=\frac{\epsilon}{2}\Delta_{x}\rho,\,\rho(x,0)=\rho_{0}(x)\right\}, (7)

which is also associated with the stochastic Wasserstein proximal point of (μ)=nJ(x)μ(x)𝑑x\mathcal{F}(\mu)=\int_{\mathbb{R}^{n}}J(x)\mu(x)dx at ρ0\rho_{0} (see [93, 62, 40]). Specifically, if ρ0\rho_{0} represents a Dirac mass centered at a point z0z_{0}, this MFG problem reduces to the following stochastic optimal control problem:

min{𝔼[0T12vs2𝑑s+J(ZT)]:dZs=vsds+ϵdWs,Z0=z0},\min\left\{\mathbb{E}\left[\int_{0}^{T}\frac{1}{2}\|v_{s}\|^{2}ds+J(Z_{T})\right]\colon dZ_{s}=v_{s}ds+\sqrt{\epsilon}dW_{s},Z_{0}=z_{0}\right\}, (8)

whose value equals S(z0,0)-S(z_{0},0). Alternatively, if we specify both initial and terminal conditions for ρ\rho (denoted by ρ0\rho_{0} and ρT\rho_{T}), the PDE system (6) becomes associated with the following SOT problem:

min{0Tn12v(x,s)2ρ(x,s)𝑑x𝑑s:ρt+x(vρ)=ϵ2Δxρ,ρ(x,0)=ρ0(x),ρ(x,T)=ρT(x)}.\min\left\{\int_{0}^{T}\int_{\mathbb{R}^{n}}\frac{1}{2}\|v(x,s)\|^{2}\rho(x,s)dxds\colon\frac{\partial\rho}{\partial t}+\nabla_{x}\cdot(v\rho)=\frac{\epsilon}{2}\Delta_{x}\rho,\,\rho(x,0)=\rho_{0}(x),\,\rho(x,T)=\rho_{T}(x)\right\}. (9)

Similar results hold for a general SDE. More details are provided in Appendix A.

The exploration of these connections can be broadened to encompass scenarios characterized by a wider range of stochastic behaviors. This expansion necessitates the consideration of controlled stochastic processes that incorporate stochastic elements beyond the realm of Brownian motion. Furthermore, these connections are intricately linked to the large deviation principle governing the underlying stochastic processes. While the literature has thoroughly examined cases involving Brownian motion from the perspective of the large deviation principle, as detailed in [8, 10, 34, 27, 20], more complex stochastic processes have been explored in [9, 36, 84, 37, 12, 61, 67, 25, 47]. Additional research has been directed towards elucidating the connections between these models and gradient flows within certain probabilistic spaces, as seen in [1, 31, 81, 78, 49]. However, extending the gradient flow concept to encompass jump processes demands a novel approach to defining geometry within probability spaces, diverging from traditional interpretations based on Wasserstein space, as discussed in [70, 72, 71, 13, 29]. Consequently, establishing a direct correlation between the general logarithmic transformation and gradient flow remains elusive.

Remark 2.2

[Regarding the time direction] The viscous HJ PDE in (6) exhibits a different sign in front of the diffusion term compared to the traditional viscous HJ PDE. This discrepancy arises from the direction of time. To maintain consistency with the controlled stochastic process in MFG or SOT, the time direction is reversed. In essence, upon applying time reversal to SS and subsequently taking the negative sign, the function S~(x,t)=S(x,Tt)\tilde{S}(x,t)=-S(x,T-t) satisfies the traditional viscous HJ PDE:

tS~(x,t)+12xS~(x,t)2=ϵ2ΔxS~(x,t).\partial_{t}\tilde{S}(x,t)+\frac{1}{2}\|\nabla_{x}\tilde{S}(x,t)\|^{2}=\frac{\epsilon}{2}\Delta_{x}\tilde{S}(x,t).

With this discrepancy in signs, the log transform becomes μ(x,t)=exp(S(x,Tt)ϵ)=exp(S~(x,t)ϵ)\mu(x,t)=\exp\left(\frac{S(x,T-t)}{\epsilon}\right)=\exp\left(-\frac{\tilde{S}(x,t)}{\epsilon}\right), thereby restoring the correct sign in the traditional Cole-Hopf transform applied to S~\tilde{S}.

2.2 The connection between log transform and Bayesian inference

Refer to caption
Figure 3: Depiction of the log transform (2) when the linear operator 𝒜ϵ,Tt\mathcal{A}_{\epsilon,T-t} acts as the adjoint of the infinitesimal generator for the stochastic process YtY_{t}, illustrating its application in Bayesian inference. With specific initial and terminal conditions, the function μ\mu evolves from the prior distribution to the data distribution, while ρ\rho evolves from a Dirac delta centered at the observation yobsy_{obs} of YTY_{T} to the corresponding posterior distribution. The first line shows the evolution of μ\mu from right to left, while the second and third lines display the evolutions of ν\nu and ρ\rho from left to right. The figures depict the graphs of the respective density functions, and the relationships among the three lines represent the first part of the log transform (2).

In this section, we delve into the scenario where the linear operator 𝒜ϵ,Tt\mathcal{A}_{\epsilon,T-t} is the adjoint operator to the infinitesimal generator of an underlying stochastic process YtY_{t}, and we illustrate its relevance to Bayesian inference. To our knowledge, this connection, along with related algorithms for a general process, has yet to be documented in existing literature. The relationship between the Cole-Hopf formula and Bayesian inference has been previously examined in [22], albeit with a focus on scaled Brownian motion as the underlying process and on computing the posterior mean rather than conducting posterior sampling.

Consider an nn-dimensional stochastic (Feller) process YtY_{t}, with 𝒜ϵ,Tt\mathcal{A}_{\epsilon,T-t} representing the adjoint of its infinitesimal generator. In contrast to the previous section, the first equation in (1) describes the KFE, while the second corresponds to the KBE. The initial condition for μ\mu is defined as the marginal density function of Y0Y_{0}, and for ν\nu, it is set as the scaled Dirac delta function δz()P(YT=z)\frac{\delta_{z}(\cdot)}{P(Y_{T}=z)} at a fixed point znz\in\mathbb{R}^{n}. According to the properties of the KFE, the density function μ(,t)\mu(\cdot,t) evolves to match the marginal density of YtY_{t}. For ν\nu, the KBE ensures that ν(x,t)=𝔼[ν(YT,0)|YTt=x]=nδz(y)P(YT=z)P(YT=y|YTt=x)𝑑y=P(YT=z|YTt=x)P(YT=z)\nu(x,t)=\mathbb{E}[\nu(Y_{T},0)|Y_{T-t}=x]=\int_{\mathbb{R}^{n}}\frac{\delta_{z}(y)}{P(Y_{T}=z)}P(Y_{T}=y|Y_{T-t}=x)\,dy=\frac{P(Y_{T}=z|Y_{T-t}=x)}{P(Y_{T}=z)}. Through the log transform (2), the function ρ\rho is given by

ρ(x,0)=μ(x,T)ν(x,0)=P(YT=x)δz(x)P(YT=z)=δz(x),ρ(x,t)=μ(x,Tt)ν(x,t)=P(YTt=x)P(YT=z|YTt=x)P(YT=z)=P(YTt=x|YT=z),\begin{split}\rho(x,0)&=\mu(x,T)\nu(x,0)=P(Y_{T}=x)\frac{\delta_{z}(x)}{P(Y_{T}=z)}=\delta_{z}(x),\\ \rho(x,t)&=\mu(x,T-t)\nu(x,t)=P(Y_{T-t}=x)\frac{P(Y_{T}=z|Y_{T-t}=x)}{P(Y_{T}=z)}=P(Y_{T-t}=x|Y_{T}=z),\end{split} (10)

indicating that ρ(,t)\rho(\cdot,t) represents the conditional density of YTtY_{T-t} given YT=zY_{T}=z. If zz is the observed value of YTY_{T}, then ρ(,t)\rho(\cdot,t) provides the Bayesian posterior density for YTt|YT=zY_{T-t}|Y_{T}=z.

From a Bayesian perspective, given the initial and terminal conditions, μ(,t)\mu(\cdot,t) evolves from the prior density P(Y0)P(Y_{0}) at t=0t=0 to the data density P(YT)P(Y_{T}) at t=Tt=T, while ρ(,t)\rho(\cdot,t) evolves from the Dirac delta δz\delta_{z} at YTY_{T} for t=0t=0 to the posterior density P(Y0|YT=z)P(Y_{0}|Y_{T}=z) at t=Tt=T. In terms of the nonlinear system for ρ\rho and SS, the function ρ\rho satisfies (10) when the following initial and terminal conditions are applied for ρ\rho and SS:

ρ(x,0)=δz(x),S(x,T)=ϵlogP(Y0=x).\rho(x,0)=\delta_{z}(x),\quad S(x,T)=\epsilon\log P(Y_{0}=x). (11)

An illustration is provided in Figure 3.

Throughout this paper, we assume that the distribution is either continuous or a mixture of continuous and discrete components, allowing us to represent it either as a density function or as a finite combination of Dirac masses. The terms “distribution” and “density function” will be used interchangeably as appropriate in the context.

Remark 2.3 (Partial observation)

A significant consideration in Bayesian inference involves scenarios of partial observation, relevant in applications such as image inpainting. The computations above remain valid even with only a partial observation of YTY_{T}. For example, if YtY_{t} is a concatenation of Yt,1mY_{t,1}\in\mathbb{R}^{m} and Yt,2nmY_{t,2}\in\mathbb{R}^{n-m}, and only an observation yobsmy_{obs}\in\mathbb{R}^{m} for YT,1Y_{T,1} is available, the analysis adapts accordingly. In this remark, for any generic variable xx, the notation x1x_{1} with a subscript ‘11’ refers to the vector comprising the first mm elements of xx, while x2x_{2}, denoted with a subscript ‘22’, encompasses the remaining nmn-m elements. The formulation of the function μ\mu remains as previously described, but the initial condition for ν\nu is modified to ν(x,0)=δyobs(x1)P(YT,1=yobs)\nu(x,0)=\frac{\delta_{y_{obs}}(x_{1})}{P(Y_{T,1}=y_{obs})} for any vector x=(x1,x2)nx=(x_{1},x_{2})\in\mathbb{R}^{n}. Following a computation akin to the earlier one, we derive that ν(x,t)=𝔼[ν(YT,0)|YTt=x]=δyobs(y1)P(YT,1=yobs)P(YT=y|YTt=x)𝑑y=P(YT,1=yobs|YTt=x)P(YT,1=yobs)\nu(x,t)=\mathbb{E}[\nu(Y_{T},0)|Y_{T-t}=x]=\int\frac{\delta_{y_{obs}}(y_{1})}{P(Y_{T,1}=y_{obs})}P(Y_{T}=y|Y_{T-t}=x)dy=\frac{P(Y_{T,1}=y_{obs}|Y_{T-t}=x)}{P(Y_{T,1}=y_{obs})}. Consequently, the function ρ\rho satisfies

ρ(x,0)=μ(x,T)ν(x,0)=P(YT=x)δyobs(x1)P(YT,1=yobs)=δyobs(x1)P(YT,2=x2|YT,1=yobs),ρ(x,t)=μ(x,Tt)ν(x,t)=P(YTt=x)P(YT,1=yobs|YTt=x)P(YT,1=yobs)=P(YTt=x|YT,1=yobs),\begin{split}\rho(x,0)&=\mu(x,T)\nu(x,0)=P(Y_{T}=x)\frac{\delta_{y_{obs}}(x_{1})}{P(Y_{T,1}=y_{obs})}=\delta_{y_{obs}}(x_{1})P(Y_{T,2}=x_{2}|Y_{T,1}=y_{obs}),\\ \rho(x,t)&=\mu(x,T-t)\nu(x,t)=P(Y_{T-t}=x)\frac{P(Y_{T,1}=y_{obs}|Y_{T-t}=x)}{P(Y_{T,1}=y_{obs})}=P(Y_{T-t}=x|Y_{T,1}=y_{obs}),\end{split} (12)

for any t(0,T)t\in(0,T) and x=(x1,x2)nx=(x_{1},x_{2})\in\mathbb{R}^{n}. While the computation is feasible in this scenario, the primary challenge in implementing the proposed algorithm in Section 3 lies in sampling from ρ(,0)\rho(\cdot,0), that is, determining how to draw samples from the conditional distribution P(YT,2=x2|YT,1=yobs)P(Y_{T,2}=x_{2}|Y_{T,1}=y_{obs}). Advancing the proposed algorithm to address this issue necessitates additional research.

This section, along with Section 2.1, presents two distinct examples of the log transform using different instances of the operator 𝒜ϵ,t\mathcal{A}_{\epsilon,t}. In Section 2.1, we investigated the log transform’s role when 𝒜ϵ,t\mathcal{A}_{\epsilon,t} acts as the infinitesimal generator of a stochastic process XtX_{t}, linking it to certain MFG, SOT, stochastic optimal control, and the stochastic Wasserstein proximal operator – fields that are at the forefront of current research. In contrast, this section explores the application of the log transform when 𝒜ϵ,Tt\mathcal{A}_{\epsilon,T-t} is the adjoint operator of the infinitesimal generator of YtY_{t}, and its relation to Bayesian inference. In scenarios where the infinitesimal generator is self-adjoint (such as in a Brownian motion process), these two situations are the same. When considering SDEs, these two cases are intuitively related in a reversed manner, highlighting a compelling research path into this duality and its potential to connect Bayesian inference with MFGs and related areas.

Until now, our discussion has focused on theoretical connections. In the next section, we will utilize these insights to develop a Bayesian sampling algorithm, named the HJ-sampler, which is designed to solve the inverse problem related to the process YtY_{t} within a Bayesian framework.

3 A Bayesian sampling method for inverse problems: HJ-sampler

As explored in Section 2.2, the log transform establishes a connection to the Bayesian framework under specific initial and terminal conditions. This section aims to harness this connection by introducing an algorithm, the HJ-sampler, designed for a particular subset of inverse problems.

Initially, we outline the category of problems addressed. Considering an underlying stochastic process YtY_{t}, with a prior distribution PpriorP_{prior} on Y0Y_{0}, our objective is to infer the solution YtY_{t} for t[0,T)t\in[0,T) based on the terminal observation YT=yobsY_{T}=y_{obs}. Essentially, we aim to tackle the inverse problem associated with YtY_{t} within the Bayesian framework. Setting the marginal distribution of Y0Y_{0} as the prior distribution PpriorP_{prior}, our task becomes to obtain samples from the posterior distribution P(Yt|YT=yobs)P(Y_{t}|Y_{T}=y_{obs}) for t[0,T)t\in[0,T). Starting from the observation yobsy_{obs}, the HJ-sampler produces a sequence of posterior samples for YtY_{t}, moving backwards in time from t=Tt=T to t=0t=0, through the resolution of the associated stochastic optimal control problem.

Subsequently, the discussion will progress in two phases. Firstly, in Section 3.1, we introduce the HJ-sampler for general stochastic processes, entailing two primary steps: solving SS and thereafter sampling from ρ\rho as defined in equation (4). Following this, in Section 3.2, attention shifts to SDE scenarios, delving into the numerical details of each step.

3.1 HJ-sampler for general stochastic processes

Refer to caption
Figure 4: The figure illustrates the SGM-HJ-sampler algorithm, consisting of two steps. The first step, shown in the top panel, corresponds to the training phase, where training data is generated by sampling YtY_{t} from μt\mu_{t}, and a neural network is trained to approximate the scaled control or score function. The heatmap in the middle represents the evolution of the density function μt\mu_{t} from right to left, with time on the horizontal axis and space on the vertical axis. The black curves display the sample paths, demonstrating the training data. The second step, in the bottom panel, represents the inference phase, where posterior samples of YtYT=yobsY_{t}\mid Y_{T}=y_{obs} (with density ρτ\rho_{\tau}) are generated by sampling the controlled paths ZτZ_{\tau}. The heatmap shows the evolution of ρτ\rho_{\tau} from left to right, with the black curves representing the generated sample paths, and the white curve depicting the sample mean of the posterior distribution. The graphs of the initial and terminal densities are displayed on the sides of each panel.

In this section, we present the HJ-sampler, an algorithm designed for a broad class of stochastic processes YtY_{t}, where the underlying stochasticity is not limited to traditional Brownian motion dynamics. As delineated in Section 2.2, by setting the terminal condition for SS to ϵlogPprior\epsilon\log P_{prior} and the initial condition for ρ\rho to δyobs\delta_{y_{obs}}, we derive that ρ(,t)\rho(\cdot,t) represents the conditional distribution of YTtY_{T-t} given YT=yobsY_{T}=y_{obs}. This enables the acquisition of posterior samples for YTtY_{T-t} given YT=yobsY_{T}=y_{obs}, utilizing ρ(,t)\rho(\cdot,t). By sampling from ρ(,t)\rho(\cdot,t) over the interval from t=0t=0 to t=Tt=T, we construct a continuous flow of posterior samples that traces the evolution from YTY_{T} back to Y0Y_{0}, conditional on YT=yobsY_{T}=y_{obs}.

Building upon this foundational understanding, we introduce an algorithm named the HJ-sampler, which comprises two primary stages:

  1. 1.

    Initially, we solve the HJ equation as specified in the second line of (4), which is the following HJ equation with terminal condition:

    {tS+ϵeSϵ𝒜ϵ,teSϵ=0,S(x,T)=ϵlogPprior(x).\begin{dcases}\partial_{t}S+\epsilon e^{-\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}e^{\frac{S}{\epsilon}}=0,\\ S(x,T)=\epsilon\log P_{prior}(x).\end{dcases} (13)
  2. 2.

    Subsequently, we obtain samples from ρ\rho as delineated in the first line of (4), which is expressed as:

    {tρ+ρeSϵ𝒜ϵ,teSϵeSϵ𝒜ϵ,t(ρeSϵ)=0,ρ(x,0)=δyobs(x).\begin{dcases}\partial_{t}\rho+\rho e^{-\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}e^{\frac{S}{\epsilon}}-e^{\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}^{*}(\rho e^{-\frac{S}{\epsilon}})=0,\\ \rho(x,0)=\delta_{y_{obs}}(x).\end{dcases} (14)
Remark 3.1 (Flexibility of the observation time TT)

The HJ-sampler is designed to sample the posterior distribution of YtY_{t} given YTY_{T} for t[0,T)t\in[0,T). However, it can be generalized to sample the distribution of YtY_{t} given YsY_{s} as long as 0t<sT0\leq t<s\leq T. This flexibility means that the observation time does not need to be fixed at the start. If the observation time ss is initially unknown, the algorithm can be modified by first solving the HJ equation for a sufficiently large TT that is guaranteed to exceed ss. In the second step, instead of using the original SS, the time-shifted version (x,t)S(x,t+Ts)(x,t)\mapsto S(x,t+T-s) is applied to the PDE governing ρ\rho. Sampling from ρ\rho then provides posterior samples for YtYs=yobsY_{t}\mid Y_{s}=y_{obs}. Consequently, if the underlying process and prior distribution remain unchanged, adjusting the observation time ss and location yobsy_{obs} does not require re-solving the HJ equation in the first step.

The HJ-sampler operates through a sequential two-step process. In the first step, prior information is used to set up the terminal condition for SS, after which the function SS is solved to compute the control for the second step. Since this step does not depend on observational data yobsy_{obs}, it can be precomputed offline using traditional numerical solvers [76, 48, 45, 2, 104, 68, 24, 51, 50, 15] or scientific machine learning techniques [23, 73, 41, 3, 97, 17, 16, 113]. The second step, dependent on the outcomes of the first, uses the observational data to set the initial condition for ρ\rho. This structured separation between prior information and observational data allows for flexibility and precomputation, with the precomputed control applied to generate Bayesian samples once the data is available. In Section 3.2.2, the application of SGMs for SDEs further illustrates this process, with the first step corresponding to the training phase and the second step to the inference stage.

The main challenge in implementing the HJ-sampler lies in efficiently sampling from ρ(,t)\rho(\cdot,t), which evolves according to the dynamics in (14). These dynamics relate to stochastic optimal control problems with a terminal cost of ϵlogPprior-\epsilon\log P_{prior}. Optimal control strategies, denoted by uu^{*}, are derived by applying operators to SS, allowing posterior samples to be generated by sampling optimally controlled stochastic trajectories. The type of stochasticity in YtY_{t} directly influences the nature of the control problem; for example, if YtY_{t} follows jump process dynamics, the control problem will involve a controlled jump process rather than a traditional controlled SDE. In the remainder of this paper, we focus specifically on cases where the underlying process is governed by an SDE, reducing the HJ equation (13) to a traditional viscous HJ PDE, corresponding to a stochastic optimal control problem. Section 3.2 introduces the details of the HJ-sampler algorithm and its variants for this case, while Section 4 presents the numerical results. Extensions of the HJ-sampler to more general processes are left for future exploration.

3.2 HJ-sampler for SDEs

In this section, and throughout the remainder of this paper, we concentrate on stochastic processes described by an nn-dimensional SDE dYt=b(Yt,t)dt+ϵdWtdY_{t}=b(Y_{t},t)dt+\sqrt{\epsilon}dW_{t}, where b:n×[0,T]nb:\mathbb{R}^{n}\times[0,T]\rightarrow\mathbb{R}^{n} acts as the drift component, and WtW_{t} represents Brownian motion in n\mathbb{R}^{n}. While our results are broadly applicable to general SDEs, additional details are provided in Appendix C. Here, the operator 𝒜ϵ,Tt\mathcal{A}_{\epsilon,T-t} is defined as 𝒜ϵ,Ttf=x(b(x,t)f)+ϵ2Δxf\mathcal{A}_{\epsilon,T-t}f=-\nabla_{x}\cdot(b(x,t)f)+\frac{\epsilon}{2}\Delta_{x}f, and the HJ equation (13) becomes:

{tSb(x,Tt)xS(x,t)+12xS2+ϵ2ΔxSϵxb(x,Tt)=0,S(x,T)=ϵlogPprior(x).\begin{dcases}\partial_{t}S-b(x,T-t)\cdot\nabla_{x}S(x,t)+\frac{1}{2}\|\nabla_{x}S\|^{2}+\frac{\epsilon}{2}\Delta_{x}S-\epsilon\nabla_{x}\cdot b(x,T-t)=0,\\ S(x,T)=\epsilon\log P_{prior}(x).\end{dcases} (15)

The corresponding Fokker-Planck equation (14) evolves into:

{tρ+x(ρ(xSb(x,Tt)))ϵ2Δxρ=0,ρ(x,0)=δyobs(x),\begin{dcases}\partial_{t}\rho+\nabla_{x}\cdot(\rho(\nabla_{x}S-b(x,T-t)))-\frac{\epsilon}{2}\Delta_{x}\rho=0,\\ \rho(x,0)=\delta_{y_{obs}}(x),\end{dcases} (16)

where the solution ρ(,t)\rho(\cdot,t) is the same as the marginal distribution of an underlying SDE. Consequently, we obtain posterior samples by sampling from this underlying SDE.

In this scenario, our HJ-sampler algorithm comprises two primary steps:

  1. 1.

    Numerically solve the HJ PDE (15) to determine SS;

  2. 2.

    Generate samples from ρ\rho that satisfies (16) by simulating the controlled SDE:

    dZτ=(xS(Zτ,τ)b(Zτ,Tτ))dτ+ϵdWτ,Z0=yobs.dZ_{\tau}=(\nabla_{x}S(Z_{\tau},\tau)-b(Z_{\tau},T-\tau))d\tau+\sqrt{\epsilon}dW_{\tau},\quad Z_{0}=y_{obs}. (17)

    For practical implementation, the Euler–Maruyama method [56] discretizes the SDE as follows:

    Zk+1=Zk+(xS(Zk,τk)b(Zk,Tτk))Δτ+ϵΔτξk,Z0=yobs,Z_{k+1}=Z_{k}+(\nabla_{x}S(Z_{k},\tau_{k})-b(Z_{k},T-\tau_{k}))\Delta\tau+\sqrt{\epsilon\Delta\tau}\xi_{k},\quad Z_{0}=y_{obs}, (18)

    where ξ0,,ξNt1\xi_{0},\dots,\xi_{N_{t}-1} are independent and identically distributed (i.i.d.) samples from the nn-dimensional standard normal distribution.

Thus, the samples obtained from the discretized SDE ZkZ_{k} serve as approximations of the posterior samples for YTkΔτY_{T-k\Delta\tau} given YT=yobsY_{T}=y_{obs}. An informal error estimation for the HJ-sampler is provided in Appendix C.3.

In the HJ-sampler algorithm, the choice of the numerical solver in each step can be adapted based on the practical requirements of the application. For instance, in the second step, the Euler–Maruyama method can be replaced with higher-order schemes to achieve higher accuracy. However, solving the viscous HJ PDE (15) in the first step is more challenging, particularly in high-dimensional settings, making the selection of numerical solvers for this step critical. In lower-dimensional cases, grid-based numerical solvers such as Essentially Non-Oscillatory (ENO) schemes [76], Weighted ENO schemes [48], or the Discontinuous Galerkin method [45] can be employed.

In addition to these grid-based methods, this paper introduces two grid-free numerical solvers that have the potential for application to high-dimensional problems:

  • If the Hamiltonian is quadratic and the prior distribution is a Gaussian mixture, the HJ PDEs can be solved using Riccati ODEs (see Section 3.2.1).

  • For more complex cases, we propose neural network-based methods, such as SGMs, as a solution (see Section 3.2.2).

Since the sampling process only requires xS\nabla_{x}S, our focus is on determining xS\nabla_{x}S rather than solving for SS itself.

3.2.1 Riccati-HJ-sampler

In scenarios where the prior distribution is Gaussian and the drift term b(Yt,t)b(Y_{t},t) of the SDE is linear in YtY_{t}, expressed as A(t)Yt+β(t)A(t)Y_{t}+\beta(t), where A:[0,T]n×nA\colon[0,T]\to\mathbb{R}^{n\times n} (throughout this paper, m×n\mathbb{R}^{m\times n} denotes the space of matrices with mm rows and nn columns) and β:[0,T]n\beta\colon[0,T]\to\mathbb{R}^{n} are continuous functions, both the Hamiltonian and the initial condition in (15) adopt quadratic forms. Consequently, the HJ PDE (15) can be efficiently solved using Riccati ODEs [105, 35]. We apply a numerical ODE solver to solve the corresponding Riccati ODEs and call this version of the HJ-sampler the Riccati-HJ-sampler.

Specifically, if we assume the prior density PpriorP_{prior} follows Pprior(θ)=1(2π)ndet(Σ)exp(12(θθP)TΣ1(θθP))P_{prior}(\theta)=\frac{1}{\sqrt{(2\pi)^{n}\det(\Sigma)}}\exp\left(-\frac{1}{2}(\theta-\theta^{P})^{T}\Sigma^{-1}(\theta-\theta^{P})\right), with θPn\theta^{P}\in\mathbb{R}^{n} and ΣS++n\Sigma\in S_{++}^{n} (in this paper, S++nS_{++}^{n} denotes the set containing all n×nn\times n symmetric positive definite matrices) representing the mean and covariance matrix, respectively, the solution to the viscous HJ PDE (15) is expressed as:

S(x,t)=12(xq(Tt))TQ(Tt)1(xq(Tt))r(Tt),S(x,t)=-\frac{1}{2}(x-q(T-t))^{T}Q(T-t)^{-1}(x-q(T-t))-r(T-t),

where Q:[0,T]S++nQ\colon[0,T]\to S_{++}^{n}, q:[0,T]nq\colon[0,T]\to\mathbb{R}^{n}, and r:[0,T]r\colon[0,T]\to\mathbb{R} satisfy the Riccati ODE system outlined below:

{Q˙(t)=I+Q(t)A(t)T+A(t)Q(t),q˙(t)=A(t)q(t)+β(t),r˙(t)=ϵ2Tr(2A(t)+Q(t)1),{Q(0)=1ϵΣ,q(0)=θP,r(0)=nϵ2log(2π)+ϵ2logdet(Σ).\begin{dcases}\dot{Q}(t)=I+Q(t)A(t)^{T}+A(t)Q(t),\\ \dot{q}(t)=A(t)q(t)+\beta(t),\\ \dot{r}(t)=\frac{\epsilon}{2}\operatorname{Tr}(2A(t)+Q(t)^{-1}),\end{dcases}\quad\quad\quad\quad\begin{dcases}Q(0)=\frac{1}{\epsilon}\Sigma,\\ q(0)=\theta^{P},\\ r(0)=\frac{n\epsilon}{2}\log(2\pi)+\frac{\epsilon}{2}\log\det(\Sigma).\end{dcases} (19)

To facilitate sampling, it is sufficient to compute QQ and qq, from which xS(x,t)\nabla_{x}S(x,t) can be derived as follows:

xS(x,t)=Q(Tt)1(xq(Tt)).\nabla_{x}S(x,t)=-Q(T-t)^{-1}(x-q(T-t)). (20)

For Gaussian mixture priors, assume the prior distribution is a convex combination of Gaussian distributions:

Pprior=j=1MwjPpriorj, where Ppriorj(x)=1(2π)ndet(Σj)exp(12(xθjP)TΣj1(xθjP)).P_{prior}=\sum_{j=1}^{M}w_{j}P_{prior}^{j},\text{ where }P_{prior}^{j}(x)=\frac{1}{\sqrt{(2\pi)^{n}\det(\Sigma_{j})}}\exp\left(-\frac{1}{2}(x-\theta^{P}_{j})^{T}\Sigma_{j}^{-1}(x-\theta^{P}_{j})\right). (21)

Here, wjw_{j} serves as the mixture coefficient, determining the relative contribution of each Gaussian component in the mixture and satisfying j=1Mwj=1\sum_{j=1}^{M}w_{j}=1. Additionally, θjPn\theta^{P}_{j}\in\mathbb{R}^{n} and ΣjS++n\Sigma_{j}\in S_{++}^{n} are the mean vectors and covariance matrices of PpriorjP_{prior}^{j}, respectively. Utilizing the log transform, solving the viscous HJ PDE (15) becomes equivalent to solving the KFE with the initial condition set as PpriorP_{prior}, which results in a linear equation. Consequently, the solution μ\mu is expressed as μ=j=1Nwjμj\mu=\sum_{j=1}^{N}w_{j}\mu_{j}, where μj\mu_{j} satisfies the KFE for the initial condition PpriorjP_{prior}^{j}. Therefore, the solution to the viscous HJ PDE (15) is formulated as:

S(x,t)=ϵlogμ(x,Tt)=ϵlog(j=1Nwjμj(x,Tt))=ϵlog(j=1Nwjexp(Sj(x,t)ϵ))=ϵlog(j=1Nwjexp(12ϵ(xqj(Tt))TQj(Tt)1(xqj(Tt))rj(Tt)ϵ)),\begin{split}S(x,t)&=\epsilon\log\mu(x,T-t)=\epsilon\log\left(\sum_{j=1}^{N}w_{j}\mu_{j}(x,T-t)\right)=\epsilon\log\left(\sum_{j=1}^{N}w_{j}\exp\left(\frac{S_{j}(x,t)}{\epsilon}\right)\right)\\ &=\epsilon\log\left(\sum_{j=1}^{N}w_{j}\exp\left(-\frac{1}{2\epsilon}(x-q_{j}(T-t))^{T}Q_{j}(T-t)^{-1}(x-q_{j}(T-t))-\frac{r_{j}(T-t)}{\epsilon}\right)\right),\end{split} (22)

where SjS_{j} addresses the viscous HJ PDE (15) with the terminal condition ϵlogμj(,0)\epsilon\log\mu_{j}(\cdot,0) and can be computed using QjQ_{j}, qjq_{j}, and rjr_{j} that solve the Riccati ODEs (19) for each jj-th Gaussian prior PpriorjP_{prior}^{j}. The optimal control is then given by:

xS(x,t)=j=1NwjQj(Tt)1(xqj(Tt))exp(12ϵ(xqj(Tt))TQj(Tt)1(xqj(Tt))1ϵrj(Tt))j=1Nwjexp(12ϵ(xqj(Tt))TQj(Tt)1(xqj(Tt))1ϵrj(Tt)).\begin{split}\nabla_{x}S(x,t)=-\frac{\sum_{j=1}^{N}w_{j}Q_{j}(T-t)^{-1}(x-q_{j}(T-t))\exp\left(-\frac{1}{2\epsilon}(x-q_{j}(T-t))^{T}Q_{j}(T-t)^{-1}(x-q_{j}(T-t))-\frac{1}{\epsilon}r_{j}(T-t)\right)}{\sum_{j=1}^{N}w_{j}\exp\left(-\frac{1}{2\epsilon}(x-q_{j}(T-t))^{T}Q_{j}(T-t)^{-1}(x-q_{j}(T-t))-\frac{1}{\epsilon}r_{j}(T-t)\right)}.\end{split}

(23)

In summary, to manage a mixed Gaussian prior PpriorP_{prior}, the Riccati-HJ-sampler initially involves solving for QjQ_{j}, qjq_{j}, and rjr_{j} using the Riccati ODEs described in (19) for each jj-th Gaussian component, PpriorjP_{prior}^{j}. The control xS\nabla_{x}S is then derived from (23).

3.2.2 SGM-HJ-sampler

In many complex scenarios, traditional numerical solvers struggle to efficiently and flexibly solve the viscous HJ PDE (15), especially in higher dimensions. To overcome these limitations, neural network-based methods can be employed. These approaches allow for approximating either the entire solution SS or its gradient xS\nabla_{x}S, which is essential for determining the control dynamics in the second step of the HJ-sampler. In this section, we propose an algorithm called the SGM-HJ-sampler, which integrates a diffusion model known as SGM [92, 26, 108, 107, 28] within the HJ-sampler framework. In the literature, SGMs are used to approximate the score function logP(Yt=x)\nabla\log P(Y_{t}=x) using a neural network trained on samples of YtY_{t}. Within our framework, this function corresponds to 1ϵxS(x,Tt)\frac{1}{\epsilon}\nabla_{x}S(x,T-t) according to the log transform. The neural network is then utilized as a scaled control to generate posterior samples. For more insights into diffusion models and their recent applications, we refer to the survey in [103]. Additionally, other neural network approaches for approximating SS or xS\nabla_{x}S can also be explored in future research.

The SGM approach is structured into two phases: training and inference. During the training phase, the model learns the score function by minimizing a score-matching loss function using data sampled from a stochastic process. In the inference phase, this learned score function is used to reverse the process, starting from an initial “noise” state to generate samples that reflect the data distribution. These two phases closely correspond to the two steps of the HJ-sampler. The first step of the HJ-sampler is similar to the training stage of the SGM, where the goal is to train a neural network to approximate the score function logP(Yt=x)=1ϵxS(x,Tt)\nabla\log P(Y_{t}=x)=\frac{1}{\epsilon}\nabla_{x}S(x,T-t), using data derived from the process YtY_{t}. Similarly, the second step of the SGM-HJ-sampler aligns with the inference phase of the SGM, where samples are generated from a controlled stochastic process utilizing the learned score function. This alignment makes the SGM method well-suited for integration into the HJ-sampler framework.

While our method shares foundational similarities with SGMs, it introduces distinct variations that set it apart. Standard SGMs and other diffusion models typically involve a forward process that transitions from data to noise, followed by a reverse process that reconstructs the data from the noise. In contrast, our model follows a different structure. We utilize two distinct processes: the first moves from the prior distribution P(Y0)=PpriorP(Y_{0})=P_{prior} to the data distribution P(YT)P(Y_{T}), while the second transitions from an observed datum Z0=yobsZ_{0}=y_{obs} to the posterior distribution P(Y0|YT=yobs)P(Y_{0}|Y_{T}=y_{obs}). As a result, unlike conventional diffusion models that use data points for training and prior samples for inference, our model uses prior samples for training and observation points for inference. These similarities and differences may provide a new perspective on diffusion models, enriching both the theoretical and practical understanding of these methods.

The SGM-HJ-sampler utilizing the SGM approach comprises two distinct steps:

  1. 1.

    Training stage: Data is generated by sampling YtY_{t} using the Euler–Maruyama method as follows:

    Yk+1,j=Yk,j+b(Yk,j,tk)Δt+ϵΔtξk,j,Y_{k+1,j}=Y_{k,j}+b(Y_{k,j},t_{k})\Delta t+\sqrt{\epsilon\Delta t}\xi_{k,j}, (24)

    where ξk,j\xi_{k,j} are i.i.d. standard Gaussian, and Y0,jY_{0,j} are prior samples, for k=1,,Nt1k=1,\dots,N_{t}-1 and j=1,,Nyj=1,\dots,N_{y}. Here, Yk,jY_{k,j} denotes the jj-th sample for YtkY_{t_{k}}. We use NtN_{t} to denote the number of time discretizations and NyN_{y} to denote the number of sample paths in training. A neural network sWs_{W} (where WW denotes the trainable parameters) is trained to fit logP(Yt)\nabla\log P(Y_{t}) using the implicit score-matching function:

    k=1Ntj=1Nyλk(12sW(Yk,j,tk)2+xsW(Yk,j,tk)),\sum_{k=1}^{N_{t}}\sum_{j=1}^{N_{y}}\lambda_{k}\left(\frac{1}{2}\|s_{W}(Y_{k,j},t_{k})\|^{2}+\nabla_{x}\cdot s_{W}(Y_{k,j},t_{k})\right), (25)

    where λk>0\lambda_{k}>0 are tunable weighting hyperparameters. For high-dimensional problems, it is standard practice to enhance scalability by employing the sliced version [90, 92]:

    k=1Ntj=1Nyλk(12sW(Yk,j,tk)2+=1NvvTx(sW(Yk,j,tk)Tv)),\sum_{k=1}^{N_{t}}\sum_{j=1}^{N_{y}}\lambda_{k}\left(\frac{1}{2}\|s_{W}(Y_{k,j},t_{k})\|^{2}+\sum_{\ell=1}^{N_{v}}v_{\ell}^{T}\nabla_{x}(s_{W}(Y_{k,j},t_{k})^{T}v_{\ell})\right), (26)

    where vnv_{\ell}\in\mathbb{R}^{n} are i.i.d. samples from a standard Gaussian distribution. For more details and discussion about the loss function, see Appendix C.2.2.

  2. 2.

    Inference stage: The discretized process ZkZ_{k} is sampled according to the equation:

    Zk+1=Zk+(ϵsW(Zk,Tτk)b(Zk,Tτk))Δτ+ϵΔτηk,Z0=yobs,Z_{k+1}=Z_{k}+(\epsilon s_{W}(Z_{k},T-\tau_{k})-b(Z_{k},T-\tau_{k}))\Delta\tau+\sqrt{\epsilon\Delta\tau}\eta_{k},\quad Z_{0}=y_{obs}, (27)

    where ηk\eta_{k} are i.i.d. standard Gaussian samples for k=1,,Nzk=1,\dots,N_{z}, distinct from those used in the training stage. This is the Euler–Maruyama discretization for ZτZ_{\tau}, and ZkZ_{k} is a sample for ZτkZ_{\tau_{k}}. Note that we use different notations for the sample size and time grid size in training (NyN_{y} and Δt\Delta t) and inference (NzN_{z} and Δτ\Delta\tau) to emphasize that these two discretizations do not need to be the same. Note that the inference stage of the SGM-HJ-sampler differs from that of the standard SGM: rather than starting from an initial “noise” state, the process begins at the observation point yobsy_{obs}.

This structured approach ensures that our model is both innovative and aligned with established methodologies while offering potential for future exploration. In this method, we do not require an explicit prior probability density function, nor do we obtain the posterior probability density function directly. Instead, the approach relies on prior samples and generates posterior samples.

μ(,0)\mu(\cdot,0) μ(,T)\mu(\cdot,T) ρ(,0)\rho(\cdot,0) ρ(,T)\rho(\cdot,T)
HJ-sampler prior distribution data distribution Dirac mass on observation posterior distribution
Diffusion models data distribution prior distribution ρ(,0)=μ(,T)\rho(\cdot,0)=\mu(\cdot,T) ρ(,T)=μ(,0)\rho(\cdot,T)=\mu(\cdot,0)
Table 1: Difference between HJ-sampler and diffusion models: the initial and terminal conditions for μ\mu and ρ\rho have different meanings in the corresponding problems, i.e., posterior sampling for HJ-sampler and data generation for diffusion models.
Remark 3.2 (Difference between SGM-HJ-sampler and SGM)

Although our training and inference stages share similar formulas with diffusion models, the key differences lie in the practical interpretation of the initial and terminal conditions, the specific problems each method is designed to address, the nature of the challenges inherent in these problems, and the directions for future improvements.

Diffusion models or SGMs are designed to generate samples from the underlying distribution of the data. Therefore, their data density function (which corresponds to our μ(,0)\mu(\cdot,0)) is unknown. A stochastic process is chosen to transform the data distribution into a simpler distribution (which corresponds to our μ(,T)\mu(\cdot,T)). This process is manually selected to balance complexity and implementation difficulty, ensuring that μ(,T)\mu(\cdot,T) is easy to sample. In this setup, μ(,T)\mu(\cdot,T) functions like a prior. A large number of samples are drawn from this prior and then transformed using the reverse process to generate samples in the original data distribution. The training and inference steps correspond to the forward and reverse processes, enabling the use of ODEs or their solution operators, rather than the reverse SDE, to enhance the efficiency of the sampling stage.

However, in our method, we aim at Bayesian inference rather than sample generation from a distribution described by data. Moreover, the stochastic processes of YtY_{t} and ZtZ_{t} are not in a forward and reverse relationship; they differ by a factor of ν\nu. Thus, we have four different initial and terminal densities: μ(,0)\mu(\cdot,0), μ(,T)\mu(\cdot,T), ρ(,0)\rho(\cdot,0), and ρ(,T)\rho(\cdot,T), instead of just two. Our interpretation of these four densities also differs. See Table 1 for more details. In our case, the process YtY_{t} is governed by the model of the underlying dynamics, which may be complicated, but we assume that samples from YtY_{t} can still be obtained in the first step. In terms of the second step, we cannot apply the techniques designed for SGMs to accelerate the inference process, as YtY_{t} and ZτZ_{\tau} are not reverse processes of each other, and the control xS\nabla_{x}S differs from ϵxlogρ\epsilon\nabla_{x}\log\rho. According to the theory connecting SDEs and ODEs, the marginal distribution for ZτZ_{\tau} is the same as the marginal distribution of the following ODE:

Z~˙τ=ϵsW(Z~τ,Tτ)b(Z~τ,Tτ)ϵ2xlogρ(Z~τ,τ),\dot{\tilde{Z}}_{\tau}=\epsilon s_{W}(\tilde{Z}_{\tau},T-\tau)-b(\tilde{Z}_{\tau},T-\tau)-\frac{\epsilon}{2}\nabla_{x}\log\rho(\tilde{Z}_{\tau},\tau), (28)

provided that the initial distributions of Z0Z_{0} and Z~0\tilde{Z}_{0} are the same. However, we cannot use the ODE for Z~τ\tilde{Z}_{\tau} to accelerate the inference process due to the lack of information on xlogρ\nabla_{x}\log\rho. Even if we can generate samples from this ODE, the sample paths of Z~τ\tilde{Z}_{\tau} differ from those of ZτZ_{\tau}, and only the marginal distributions match. Consequently, the advantage of trajectory sampling is lost.

While the previous two paragraphs outline the differences between the SGM and SGM-HJ-sampler methods, these distinctions also point to divergent directions for future development. In the literature [89, 88, 52, 53], key directions in the development of diffusion models include finding effective ways to add noise (i.e., SDE design), improving the training process by adjusting loss weights and data generation methods, and accelerating inference processes. The future improvements for the SGM-HJ-sampler are different. Since the underlying SDE process is determined by the problem setup and is usually complicated, we cannot design the SDE or the training data distribution. Moreover, as discussed earlier, we cannot utilize ODE-based sampling techniques like those in diffusion models. Thus, future work may focus on enhancing the loss design by integrating more model-specific knowledge into the loss function (see [46] for example). In the context of Bayesian inference problems, a simpler form of the prior distribution μ(,0)\mu(\cdot,0) could allow for more efficient incorporation of prior information, providing a basis for further algorithmic extensions.

Remark 3.3 (Theoretical unification of SGM-HJ-sampler and SGM)

These two methods can be understood within the same theoretical framework. Through the log transform, the diffusion model is a special case where the function ν(,0)\nu(\cdot,0) is constantly equal to 1, and μ(,0)\mu(\cdot,0) represents the data distribution. In contrast, the SGM-HJ-sampler is a special case where ν(,0)\nu(\cdot,0) is a scaled Dirac delta function δyobs()P(YT=yobs)\frac{\delta_{y_{obs}}(\cdot)}{P(Y_{T}=y_{obs})} at the observation point yobsy_{obs}, and μ(,0)\mu(\cdot,0) corresponds to the prior distribution. From this perspective, both methods can be viewed as applications of the log transform, albeit with different interpretations of the initial conditions. A related perspective on SGMs is also explored in [4], which interprets diffusion models through the lens of optimal control.

From the perspective of diffusion models, the SGM-HJ-sampler can also be viewed as a conditional sampling variant of SGM. According to diffusion model theory, the following SDE

dY~τ=(ϵsW(Y~τ,Tτ)b(Y~τ,Tτ))dτ+ϵdWτ,Y~0=dYT,d\tilde{Y}_{\tau}=(\epsilon s_{W}(\tilde{Y}_{\tau},T-\tau)-b(\tilde{Y}_{\tau},T-\tau))d\tau+\sqrt{\epsilon}dW_{\tau},\quad\tilde{Y}_{0}\stackrel{{\scriptstyle d}}{{=}}Y_{T}, (29)

where =d\stackrel{{\scriptstyle d}}{{=}} denotes equality in distribution, provides the reverse process of YtY_{t}, meaning (Y~0,Y~τ)(\tilde{Y}_{0},\tilde{Y}_{\tau}) is distributionally equivalent to (YT,YTτ)(Y_{T},Y_{T-\tau}). If we change the distribution of Y~0\tilde{Y}_{0} from that of YTY_{T} to a Dirac mass δyobs\delta_{y_{obs}} while keeping the SDE unchanged, we obtain the process ZτZ_{\tau} in (17). Since the SDE does not change, the conditional distribution of ZτZ_{\tau} given Z0=yobsZ_{0}=y_{obs} is the same as the distribution of Y~τ\tilde{Y}_{\tau} given Y~0=yobs\tilde{Y}_{0}=y_{obs}, which equals the posterior distribution of YTτY_{T-\tau} given YT=yobsY_{T}=y_{obs}. This reasoning underpins the SGM-HJ-sampler from the perspective of diffusion model theory.

4 Numerical examples

In this section, we demonstrate the accuracy, flexibility, and generalizability of our proposed HJ-sampler by applying it to four test problems. In Section 4.1, we begin with a verification test, using 1D and 2D scaled Brownian motions to present quantitative results that showcase the HJ-sampler’s performance. In Section 4.2, we consider the underlying process YtY_{t} to be an Ornstein–Uhlenbeck (OU) process, demonstrating how the HJ-sampler can handle model uncertainty and misspecification in ODEs [115]. Specifically, in the model misspecification part of this example, a second-order ODE is incorrectly modeled as a linear ODE. We introduce uncertainty by adding white noise to this misspecified ODE, transforming it into an OU process, and then solve the inference problem using the HJ-sampler. The third example, detailed in Section 4.3, illustrates the method’s potential in addressing model misspecification in nonlinear ODEs. Finally, Section 4.4 demonstrates the method’s scalability by solving a 100-dimensional problem.

For the first example and the 1D OU process case, we have analytical formulas for the posterior density functions, which serve as ground truths for quantitative comparison. These cases also have analytical solutions to the viscous HJ PDEs (15). We refer to the version of the HJ-sampler that utilizes this analytical solution as the analytic-HJ-sampler, which serves as a baseline to isolate and distinguish errors from the first and second steps of the HJ-sampler. The analytical formulas and ground truths are provided in Appendix E. The algorithm’s flexibility in sampling from P(Yt|Ys=yobs)P(Y_{t}|Y_{s}=y_{obs}) for 0t<sT0\leq t<s\leq T and yobsny_{obs}\in\mathbb{R}^{n} (see Remark 3.1) is demonstrated by selecting various tt and ss in several examples. Single precision is used in all numerical experiments for illustration purposes. Additional details on the numerical implementations can be found in Appendix D, with further results in Appendix F. The code for these examples will be made publicly available upon acceptance of this paper.

4.1 Brownian motion

In this section, we consider scaled 1D and 2D Brownian motions dYt=ϵdWtdY_{t}=\sqrt{\epsilon}dW_{t}, where WtW_{t} is a standard Brownian motion and ϵ>0\epsilon>0 is the hyperparameter indicating the level of stochasticity, as the underlying process. We solve the Bayesian inverse problem by sampling from P(YtYs=yobs)P(Y_{t}\mid Y_{s}=y_{obs}), where 0t<sT0\leq t<s\leq T, using the proposed HJ-sampler. The objective is to infer the value of YtY_{t} from the underlying stochastic model, given the observation Ys=yobsY_{s}=y_{obs} and the prior on Y0Y_{0}. The posterior distribution for this problem has an analytical solution, serving as the ground truth for error computation and quantitative analysis of the HJ-sampler’s performance. Additionally, since the viscous HJ PDE (15) also has an analytical solution, we can compare the performance of the SGM-HJ-sampler and the analytic-HJ-sampler to isolate the errors arising from the first and second steps.

4.1.1 1D cases

yobs=2y_{obs}=-2 yobs=1y_{obs}=-1 yobs=0y_{obs}=0 yobs=1.5y_{obs}=1.5 yobs=3y_{obs}=3 yobsP(YT)y_{obs}\sim P(Y_{T})
analytic-HJ-sampler 0.00240.0024 0.00230.0023 0.00370.0037 0.00180.0018 0.00180.0018 0.0024±0.00060.0024\pm 0.0006
SGM-HJ-sampler 0.00540.0054 0.00690.0069 0.01030.0103 0.01750.0175 0.02080.0208 0.0104±0.00510.0104\pm 0.0051
(a) W1W_{1} errors of posterior samples for Y0YT=yobsY_{0}\mid Y_{T}=y_{obs}, computed for different values of yobsy_{obs} (Δτ=0.01\Delta\tau=0.01).
Metric Δτ=0.5\Delta\tau=0.5 Δτ=0.1\Delta\tau=0.1 Δτ=0.01\Delta\tau=0.01 Δτ=0.001\Delta\tau=0.001
analytic- W1W_{1} 0.11410.1141 0.02170.0217 0.00220.0022 0.00080.0008
HJ-sampler Wall time (s) 0.00.0 0.20.2 2.02.0 28.628.6
SGM- W1W_{1} 0.11400.1140 0.02240.0224 0.00530.0053 0.00400.0040
HJ-sampler Wall time (s) 0.30.3 1.11.1 10.510.5 106.8106.8
(b) W1W_{1} errors and the wall time for sampling from P(Y0YT=3)P(Y_{0}\mid Y_{T}=-3), with different values of Δτ\Delta\tau in (27).
Table 2: The quantitative results for the scaled 1D Brownian motion with Gaussian prior. In (a), we compare the performance of the analytic-HJ-sampler and SGM-HJ-sampler on specific values of yobsy_{obs} (columns 2–6) and on 1,000 random samples of yobsy_{obs} drawn from P(YT)P(Y_{T}) (rightmost column). In (b), we examine the performance and computational wall time of both the analytic-HJ-sampler and SGM-HJ-sampler for different Δτ\Delta\tau values in (27). The performance is measured by Wasserstein-1 distances (W1W_{1}) between the samples from the exact posterior distribution P(Y0YT=yobs)P(Y_{0}\mid Y_{T}=y_{obs}) (Gaussian) and the posterior samples generated by the proposed algorithm. The W1W_{1} distances are computed using 1×1061\times 10^{6} samples, and the neural network in the SGM-HJ-sampler is trained on snapshots taken every Δt=0.01\Delta t=0.01.

In this section, we validate the proposed HJ-sampler through numerical experiments on 1D problems with different prior distributions, including Gaussian, Gaussian mixture, and a mixture of uniform distributions.

We first assume a standard Gaussian prior for Y0Y_{0}, where Y0𝒩(0,12)Y_{0}\sim\mathcal{N}(0,1^{2}) with ϵ=1\epsilon=1 and T=1T=1. For the SGM-HJ-sampler, the interval [0,T][0,T] is uniformly discretized with a step size of Δt=0.01\Delta t=0.01 in (24) to generate training data. In the inference stage, we set Δτ=0.01\Delta\tau=0.01 in (27) to obtain posterior samples by solving the controlled SDE. To quantitatively evaluate the posterior samples, we compute the Wasserstein-1 distance (W1W_{1}) between the exact posterior samples and those obtained using both the analytic-HJ-sampler and SGM-HJ-sampler. The results for specific values of the observation yobsy_{obs} for YTY_{T} are presented in Table 2(a) and Figure 12, which demonstrate that both samplers produce high-quality posterior samples, though the analytic-HJ-sampler consistently outperforms the SGM-HJ-sampler. This difference highlights the errors arising from the two steps of the HJ-sampler: the analytic-HJ-sampler only incurs error from discretizing the controlled SDE in the second step, whereas the SGM-HJ-sampler also introduces error from the neural network approximation in the first step. We also assess their performance over 1,000 random samples of yobsy_{obs} drawn from P(YT)P(Y_{T}), reporting the mean and standard deviation of W1W_{1} in the rightmost column of Table 2(a). Next, we analyze the performance and computational wall time of both HJ-samplers with varying Δτ\Delta\tau. Table 2(b) shows the trade-off between accuracy and efficiency in both samplers. Notably, for the SGM-HJ-sampler, the W1W_{1} errors improve at a slower rate when Δτ\Delta\tau is reduced from 0.010.01 to 0.0010.001 compared to the analytic-HJ-sampler. This indicates that the error from the neural network, trained with Δt=0.01\Delta t=0.01, becomes the dominant factor, constraining further error reduction despite a smaller Δτ\Delta\tau. Wall times are measured on a standard laptop CPU (13th Gen Intel(R) Core(TM) i9-13900HX, 2.20 GHz, 16 GB RAM).

Refer to caption
(a) P(Y0Ys=yobs)P(Y_{0}\mid Y_{s}=y_{obs}) for different yobsy_{obs}\in\mathbb{R} and s(0,T]s\in(0,T].
Refer to caption
(b) P(YtYs=yobs)P(Y_{t}\mid Y_{s}=y_{obs}) for different t[0,T)t\in[0,T) with fixed observation yobs=2y_{obs}=2 and time s=Ts=T.
Refer to caption
(c) P(YtYs=yobs)P(Y_{t}\mid Y_{s}=y_{obs}) for different yobsy_{obs}\in\mathbb{R} and 0t<sT0\leq t<s\leq T.
Figure 5: Histograms depicting the distribution of posterior samples for the scaled 1D Brownian motion case with a Gaussian mixture prior, across different observation times ss and data values yobsy_{obs}. For all cases, the posterior samples, obtained from the SGM-HJ-sampler, utilize the same pretrained neural network, trained on t[0,T]t\in[0,T] with T=1T=1. The black dashed lines represent the exact posterior density functions (Gaussian mixture). Each histogram is generated from 1×1061\times 10^{6} samples.
Y0.01Y0.8=4Y_{0.01}\mid Y_{0.8}=-4 Y0.02Y0.5=2Y_{0.02}\mid Y_{0.5}=-2 Y0.05Y0.6=0.5Y_{0.05}\mid Y_{0.6}=0.5 Y0.45Y0.95=1Y_{0.45}\mid Y_{0.95}=1 Y0.03Y0.4=3Y_{0.03}\mid Y_{0.4}=3
analytic-HJ-sampler 0.00190.0019 0.00290.0029 0.00340.0034 0.00260.0026 0.00270.0027
SGM-HJ-sampler 0.00980.0098 0.00720.0072 0.00690.0069 0.01100.0110 0.00790.0079
Table 3: W1W_{1} error for the scaled 1D Brownian motion case with a Gaussian mixture prior. The W1W_{1} distances are calculated between 1×1061\times 10^{6} posterior samples of YtYs=yobsY_{t}\mid Y_{s}=y_{obs} obtained using two versions of the HJ-sampler (with Δτ=0.001\Delta\tau=0.001) and 1×1061\times 10^{6} samples from the exact posterior distribution (Gaussian mixture).
Refer to caption
(a) The analytic-HJ-sampler.
Refer to caption
(b) The SGM-HJ-sampler.
Figure 6: Histograms depicting the distribution of posterior samples of Y0Ys=yobsY_{0}\mid Y_{s}=y_{obs} for the scaled 1D Brownian motion case with a prior distribution consisting of a mixture of uniform distributions, across different observation times ss and data values yobsy_{obs}. In all cases, the posterior samples, obtained using the SGM-HJ-sampler, utilize the same pretrained neural network, trained on t[0,T]t\in[0,T] with T=1T=1. The black dashed lines represent the exact posterior density functions. Each histogram is generated from 1×1061\times 10^{6} samples. The prior is a mixture of two uniform distributions, 𝒰[0.75,0.25)\mathcal{U}[-0.75,-0.25) and 𝒰[0.25,0.75)\mathcal{U}[0.25,0.75), with equal weights.

We then apply the HJ-sampler to a 1D Gaussian mixture prior. Specifically, the prior of Y0Y_{0} is a mixture of three Gaussians, 𝒩(0,0.52)\mathcal{N}(0,0.5^{2}), 𝒩(2,0.82)\mathcal{N}(-2,0.8^{2}), and 𝒩(2,0.62)\mathcal{N}(2,0.6^{2}), with equal weights, and we set ϵ=1\epsilon=1, T=1T=1, and Δτ=0.001\Delta\tau=0.001. We demonstrate the flexibility of the SGM-HJ-sampler by generating posterior samples for the following cases:

  1. (a)

    P(Y0Ys=yobs)P(Y_{0}\mid Y_{s}=y_{obs}) for yobsy_{obs}\in\mathbb{R} and s(0,T]s\in(0,T],

  2. (b)

    P(YtYs=yobs)P(Y_{t}\mid Y_{s}=y_{obs}) for t[0,T)t\in[0,T), s=Ts=T, and yobs=2y_{obs}=2,

  3. (c)

    P(YtYs=yobs)P(Y_{t}\mid Y_{s}=y_{obs}) for yobsy_{obs}\in\mathbb{R} and 0t<sT0\leq t<s\leq T.

The results, shown in Figure 5(a), (b), and (c), indicate that the posterior samples from the SGM-HJ-sampler agree with the exact posterior density functions. Table 3 provides quantitative results for case (c). Notably, in all cases, the neural network is trained once before knowing the observation yobsy_{obs} and time ss, and after receiving the data, the second step of the HJ-sampler generates posterior samples without needing re-training. This flexibility is discussed in Remark 3.1.

Finally, we test the method on a more challenging case where the prior distribution is compactly supported. We consider a prior consisting of a mixture of two uniform distributions, 𝒰[0.75,0.25)\mathcal{U}[-0.75,-0.25) and 𝒰[0.25,0.75)\mathcal{U}[0.25,0.75), with equal weights, and set ϵ=0.05\epsilon=0.05, T=1T=1, and Δτ=0.001\Delta\tau=0.001. Using both the analytic-HJ-sampler and SGM-HJ-sampler, we generate posterior samples of Y0Y_{0} given different observation points yobsy_{obs} for various observation times s(0,T]s\in(0,T]. Results are presented in Figure 6, showing that both HJ-samplers are capable of producing reliable posterior samples that align closely with the exact posterior density functions. However, the analytic-HJ-sampler outperforms the SGM-HJ-sampler due to the latter’s neural network approximation error.

4.1.2 A 2D case

Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) The exact posterior distribution.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(b) Posterior samples obtained from the SGM-HJ-sampler.
Figure 7: Heatmaps of the scaled 2D Brownian motion case with a Gaussian mixture prior. Posterior samples of YtY_{t} given the value of YsY_{s} for 0t<sT0\leq t<s\leq T are compared to the exact posterior distribution (Gaussian mixture). In (a), the exact posterior density functions are displayed for specific values of tt, ss, and YsY_{s}. In (b), the corresponding posterior samples obtained from the SGM-HJ-sampler are presented as histograms. The neural network in the SGM-HJ-sampler was trained on t[0,T]t\in[0,T], with T=1T=1.
Y0.1Y0.9=[0.9,0.9]TY_{0.1}\mid Y_{0.9}=[-0.9,0.9]^{T} Y0.2Y0.7=[0.7,0.3]TY_{0.2}\mid Y_{0.7}=[0.7,0.3]^{T} Y0Y0.3=[0.3,0.4]TY_{0}\mid Y_{0.3}=[0.3,-0.4]^{T} Y0.3Y0.8=[0.5,0.3]TY_{0.3}\mid Y_{0.8}=[-0.5,0.3]^{T}
analytic-HJ-sampler 0.00070.0007 0.00070.0007 0.00080.0008 0.00070.0007
SGM-HJ-sampler 0.01260.0126 0.00560.0056 0.00250.0025 0.00310.0031
Table 4: Sliced W1W_{1} errors for the scaled 2D Brownian motion case with a Gaussian mixture prior. For each value of tt, ss, and yobsy_{obs}, the W1W_{1} error is computed between 1×1061\times 10^{6} samples from the exact posterior distribution P(YtYs=yobs)P(Y_{t}\mid Y_{s}=y_{obs}) (Gaussian mixture) and the posterior samples generated using the proposed methods.

We now consider a 2D case with the prior for Y0Y_{0} given as a mixture of two Gaussian distributions with means μ1=[0.5,0.5]T\mu_{1}=[0.5,0.5]^{T} and μ2=[0.5,0.5]T\mu_{2}=[-0.5,-0.5]^{T}, and covariance matrices Σ1=[0.250.050.051/9]\Sigma_{1}=\begin{bmatrix}0.25&0.05\\ 0.05&1/9\end{bmatrix} and Σ2=[0.06250.050.050.25]\Sigma_{2}=\begin{bmatrix}0.0625&-0.05\\ -0.05&0.25\end{bmatrix}, respectively. We set ϵ=0.5\epsilon=0.5 and T=1T=1, and uniformly discretize [0,T][0,T] with a step size Δt=0.01\Delta t=0.01 to generate training data. For the inference, we set Δτ=0.001\Delta\tau=0.001 when solving the controlled SDE. Posterior samples of YtY_{t} for different values of YsY_{s} (where 0t<sT0\leq t<s\leq T) are generated using both the analytic-HJ-sampler and the SGM-HJ-sampler. The results from the SGM-HJ-sampler are displayed in Figure 7, which show strong agreement between the posterior samples and the exact density functions. To quantitatively assess the sample quality, we compute the sliced W1W_{1} distance between the posterior samples and the exact distributions. The results, shown in Table 4, indicate that while both methods perform well, the analytic-HJ-sampler generally yields better results. The sliced W1W_{1} distance between two nn-dimensional distributions μ\mu and ν\nu is computed as:

𝔼d𝒰(𝕊n1)[W1((𝒫d)#μ,(𝒫d)#ν)],\mathbb{E}_{d\sim\mathcal{U}(\mathbb{S}^{n-1})}\left[W_{1}\left((\mathcal{P}_{d})_{\#}\mu,(\mathcal{P}_{d})_{\#}\nu\right)\right],

where 𝒰(𝕊n1)\mathcal{U}(\mathbb{S}^{n-1}) is the uniform distribution on the unit sphere, 𝒫d\mathcal{P}_{d} is the projection along direction dd (i.e., 𝒫d(x)=d,x\mathcal{P}_{d}(x)=\langle d,x\rangle), and (𝒫d)#μ(\mathcal{P}_{d})_{\#}\mu and (𝒫d)#ν(\mathcal{P}_{d})_{\#}\nu denote the push-forwards of the distributions μ\mu and ν\nu, respectively. The W1W_{1} distance, W1((𝒫d)#μ,(𝒫d)#ν)W_{1}((\mathcal{P}_{d})_{\#}\mu,(\mathcal{P}_{d})_{\#}\nu), is computed between samples of (𝒫d)#μ(\mathcal{P}_{d})_{\#}\mu and (𝒫d)#ν(\mathcal{P}_{d})_{\#}\nu. The sliced W1W_{1} distance is then calculated by taking the expectation over random directions dd sampled from 𝒰(𝕊n1)\mathcal{U}(\mathbb{S}^{n-1}). To compute this sliced W1W_{1} distance, we use a Monte Carlo approximation with 50 samples for dd, utilizing the Python Optimal Transport library [33].

4.2 Ornstein–Uhlenbeck process

In this section, we consider the OU process

dYt=BYtdt+ϵdWt,dY_{t}=-BY_{t}\,dt+\sqrt{\epsilon}\,dW_{t}, (30)

where Bn×nB\in\mathbb{R}^{n\times n} is a constant matrix whose eigenvalues have positive real parts, and WtW_{t} is a Brownian motion in n\mathbb{R}^{n}. The prior distribution is assumed to be either Gaussian or a Gaussian mixture. This setup is a specific instance of the process discussed in Section 3.2.1, where A(t)=BA(t)=-B and β(t)=0\beta(t)=0. A more general version of the OU process, such as dYt=(cBYt)dt+ϵσdWtdY_{t}=(c-BY_{t})\,dt+\sqrt{\epsilon}\,\sigma\,dW_{t}, involving an additional constant vector cnc\in\mathbb{R}^{n} and a constant matrix σn×n\sigma\in\mathbb{R}^{n\times n}, can also be solved in a similar manner (as a special case discussed in Appendix C.2). However, we focus on the simpler case here for illustrative purposes. Both the Riccati-HJ-sampler and SGM-HJ-sampler can be applied to solve this problem. Additionally, in the 1D case, the viscous HJ PDE (15) has an analytical solution, enabling the use of the analytic-HJ-sampler for comparison as a sanity check.

In this scenario, the functions qq and rr in the Riccati ODE (19) have analytical solutions, given by q(t)=etBθPq(t)=e^{-tB}\theta^{P} and r(t)=ϵ2log((2πϵ)ndet(Q(t)))r(t)=\frac{\epsilon}{2}\log((2\pi\epsilon)^{n}\det(Q(t))). The other function, QQ, can be solved using an ODE solver. Although the Riccati ODE method for solving the marginal density function of the OU process is known in the literature (e.g., [57]), we extend this by connecting it to the viscous HJ PDE and control problems via the log transform.

analytic-HJ-sampler Riccati-HJ-sampler SGM-HJ-sampler
0.0085±0.00140.0085\pm 0.0014 0.0086±0.00130.0086\pm 0.0013 0.0103±0.00240.0103\pm 0.0024
Table 5: Verification of the proposed methods for a 1D OU process with a Gaussian prior. W1W_{1} distances are computed between the posterior samples of Y0YT=yobsY_{0}\mid Y_{T}=y_{obs} obtained from three HJ-samplers and the samples from the exact posterior distribution (Gaussian). We randomly draw 1,0001,000 samples of yobsy_{obs} from P(YT)P(Y_{T}) (Gaussian) and present the mean and standard deviation of the W1W_{1} distances across these values of yobsy_{obs}. The W1W_{1} distance is evaluated based on 1×1061\times 10^{6} samples.

We first perform a sanity check for the presented algorithm by inferring the value of Y0Y_{0} given YT=yobsY_{T}=y_{obs} for a 1D OU process, dYt=αYtdt+ϵdWtdY_{t}=-\alpha Y_{t}dt+\sqrt{\epsilon}dW_{t}, with α=3\alpha=3 and ϵ=1.5\epsilon=1.5. The prior distribution is assumed to be the standard Gaussian distribution 𝒩(0,12)\mathcal{N}(0,1^{2}), and we set T=1T=1. We apply the analytic-HJ-sampler, Riccati-HJ-sampler, and SGM-HJ-sampler to obtain posterior samples of Y0Y_{0}. The forward Euler scheme with a step size of 1×1041\times 10^{-4} is used to solve the Riccati ODE in the Riccati-HJ-sampler, while Δτ=0.01\Delta\tau=0.01 is employed in the inference stage of all three HJ-samplers to solve the controlled SDE. We sample 1,000 values of yobsy_{obs} randomly from P(YT)P(Y_{T}) and compute the W1W_{1} distance between the posterior samples generated by the HJ-samplers and the exact posterior distribution of Y0YT=yobsY_{0}\mid Y_{T}=y_{obs} (Gaussian). The mean and standard deviation of the W1W_{1} distances across different values of yobsy_{obs} are presented in Table 5, showing that all three HJ-samplers perform well in producing posterior samples.

After this sanity check, in the following two sections, we demonstrate the ability of the proposed HJ-samplers to handle model uncertainty and misspecification problems. In the first case (Section 4.2.1), we examine a first-order linear ODE system with model uncertainty present in both equations. In the second case (Section 4.2.2), we address a model misspecification problem where a second-order ODE is incorrectly modeled as a second-order linear ODE. This misspecification is captured by the corresponding linear SDE. By reformulating the SDE as a first-order linear system, only the second equation contains the white noise term, illustrating the algorithm’s capability to handle partial uncertainty within a system. In both cases, the SDE models yield an OU process, allowing us to apply the HJ-samplers developed for OU processes to effectively solve these Bayesian inference problems.

4.2.1 The first case: model uncertainty in a first-order linear ODE system

Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) The Riccati-HJ-sampler.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(b) The SGM-HJ-sampler.
Figure 8: Histograms of posterior samples for the 2D OU process with a Gaussian mixture prior. We show the posterior samples of YtYs=yobsY_{t}\mid Y_{s}=y_{obs} computed using the Riccati-HJ-sampler in (a) and SGM-HJ-sampler in (b) for several values of yobs2y_{obs}\in\mathbb{R}^{2} and 0t<sT0\leq t<s\leq T.

In this section, we demonstrate how to apply the proposed method to handle model uncertainty. We consider the first-order ODE system:

dy1dt=y2,dy2dt=y1y2.\begin{split}\frac{dy_{1}}{dt}&=y_{2},\\ \frac{dy_{2}}{dt}&=-y_{1}-y_{2}.\end{split} (31)

Model uncertainty is introduced by adding white noise to the right-hand sides of the equations, leading to the OU process (30) with n=2n=2 and B=[0111]2×2B=\begin{bmatrix}0&-1\\ 1&1\end{bmatrix}\in\mathbb{R}^{2\times 2}. Assume the prior distribution of Y0Y_{0} is a mixture of two Gaussians with means μ1=[0.7,0]T\mu_{1}=[-0.7,0]^{T} and μ2=[0.7,0]T\mu_{2}=[0.7,0]^{T} and covariance matrices Σ1=[0.250.10.10.16]\Sigma_{1}=\begin{bmatrix}0.25&0.1\\ 0.1&0.16\end{bmatrix} and Σ2=[0.250.10.10.16]\Sigma_{2}=\begin{bmatrix}0.25&-0.1\\ -0.1&0.16\end{bmatrix}, respectively, with equal weights. We set ϵ=5\epsilon=5 and T=1T=1, and discretize [0,T][0,T] uniformly with Δt=0.01\Delta t=0.01 to generate the training data for the SGM-HJ-sampler. We apply both the Riccati-HJ-sampler and SGM-HJ-sampler to solve the Bayesian inverse problem, i.e., sampling from P(YtYs=yobs)P(Y_{t}\mid Y_{s}=y_{obs}), where 0t<sT0\leq t<s\leq T, for some specific values of yobsy_{obs}.

Since BB is non-diagonal, the exact posterior cannot be derived analytically, and we do not have access to the analytic-HJ-sampler in this case. Therefore, the Riccati-HJ-sampler is used as the reference method to obtain posterior samples of YtYs=yobsY_{t}\mid Y_{s}=y_{obs}. We set Δτ=0.001\Delta\tau=0.001 to solve the controlled SDE in both HJ-samplers, and the forward Euler scheme with a step size of 1×1051\times 10^{-5} is employed to solve the Riccati ODE in the Riccati-HJ-sampler.

Results are presented in Figure 8, demonstrating that the SGM-HJ-sampler produces results that closely align with those from the Riccati-HJ-sampler, validating the effectiveness of the SGM-HJ-sampler.

4.2.2 The second case: model misspecification of a second-order ODE

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: Quantifying model uncertainty in inferring the values of Y1(t)Y_{1}(t) and Y2(t)Y_{2}(t) for t[0,T)t\in[0,T), based on Y1(T)Y_{1}(T) and Y2(T)Y_{2}(T) at T=5T=5, under different ϵ\epsilon values in (34). The observations Y1(T)Y_{1}(T) and Y2(T)Y_{2}(T), along with the reference solutions (solid lines in each figure), are generated by the exact ODE system (32). The parameter ϵ\epsilon represents the uncertainty level in the misspecified model. The SGM-HJ-sampler is employed to generate posterior samples, which are then used to compute the posterior means (dashed lines) and 2-standard-deviation intervals (color-shaded regions) in the right three figures. Model uncertainty refers to the uncertainty in the differential equation due to misspecification of a term on the right-hand side of (32b). The leftmost figure illustrates the reference trajectories of y1y_{1} and y2y_{2} (solid lines, from the exact ODE system (32)) and the inferred y1y_{1} and y2y_{2} (dashed lines, from the misspecified ODE system (33)). The discrepancy between the solid and dashed lines highlights the impact of model misspecification, emphasizing the need to account for the associated uncertainty.

In this section, we illustrate how the proposed method addresses model misspecification. Consider an exact but unknown ODE given by u′′(t)=u(t)+u(t)2u(t)u^{\prime\prime}(t)=-u(t)+u(t)^{2}-u^{\prime}(t), which is misspecified as u~′′(t)=u~(t)u~(t)\tilde{u}^{\prime\prime}(t)=-\tilde{u}(t)-\tilde{u}^{\prime}(t). This discrepancy necessitates modeling the error and quantifying the uncertainty induced by the incorrect model, as discussed in [115]. By defining y1(t)=u(t)y_{1}(t)=u(t) and introducing the auxiliary variable y2(t)=u(t)y_{2}(t)=u^{\prime}(t), the exact ODE transforms into:

dy1dt\displaystyle\frac{dy_{1}}{dt} =y2,\displaystyle=y_{2}, (32a)
dy2dt\displaystyle\frac{dy_{2}}{dt} =y1+y12y2.\displaystyle=-y_{1}+y_{1}^{2}-y_{2}. (32b)

Similarly, letting y~1(t)=u~(t)\tilde{y}_{1}(t)=\tilde{u}(t) and y~2(t)=u~(t)\tilde{y}_{2}(t)=\tilde{u}^{\prime}(t), the misspecified model becomes:

dy~1dt\displaystyle\frac{d\tilde{y}_{1}}{dt} =y~2,\displaystyle=\tilde{y}_{2}, (33a)
dy~2dt\displaystyle\frac{d\tilde{y}_{2}}{dt} =y~1y~2.\displaystyle=-\tilde{y}_{1}-\tilde{y}_{2}. (33b)

To account for the model uncertainty, we reformulate the misspecified ODE system (33) into the following SDE:

dY1(t)=Y2(t)dt,dY2(t)=(Y1(t)Y2(t))dt+ϵdWt,\begin{split}dY_{1}(t)&=Y_{2}(t)\,dt,\\ dY_{2}(t)&=(-Y_{1}(t)-Y_{2}(t))\,dt+\sqrt{\epsilon}\,dW_{t},\end{split} (34)

where we use capital letters Y1(t)Y_{1}(t) and Y2(t)Y_{2}(t) to denote random variables, which approximate the corresponding deterministic values denoted by lowercase letters y1(t)y_{1}(t) and y2(t)y_{2}(t) from the exact ODE model (32). The capital letters represent the stochastic versions of the corresponding lowercase variables, reflecting the uncertainty in the misspecified model. Here, ϵ\epsilon is a hyperparameter controlling the level of uncertainty in the system; a larger ϵ\epsilon corresponds to greater uncertainty in the model equations, while a smaller ϵ\epsilon indicates higher confidence in the misspecified ODE.

Our goal is to infer the values of Y1(t)Y_{1}(t) and Y2(t)Y_{2}(t) for t[0,T)t\in[0,T) based on the values of Y1(T)Y_{1}(T) and Y2(T)Y_{2}(T), with T=5T=5, while accounting for the model uncertainty in (34) using the SGM-HJ-sampler. Specifically, the data for Y1(T)Y_{1}(T) and Y2(T)Y_{2}(T), which are based on the exact ODE solutions y1(T)y_{1}(T) and y2(T)y_{2}(T), as well as the reference trajectories y1(t)y_{1}(t) and y2(t)y_{2}(t) for t[0,T)t\in[0,T), are generated by numerically solving the exact nonlinear ODE (32) with the initial conditions y1(0)=0.2y_{1}(0)=0.2 and y2(0)=0.1y_{2}(0)=0.1 using the SciPy odeint function [95]. We consider three levels of uncertainty, ϵ=1×103,1×104,1×105\epsilon=1\times 10^{-3},1\times 10^{-4},1\times 10^{-5}, and assume the prior distributions of Y1(0)Y_{1}(0) and Y2(0)Y_{2}(0) are independent log-normal distributions, specifically LogNormal(2,0.52)\text{LogNormal}(-2,0.5^{2}). Since the prior distribution is neither Gaussian nor a Gaussian mixture, the Riccati-HJ-sampler is not applicable. Therefore, we use the SGM-HJ-sampler with Δτ=0.001\Delta\tau=0.001 to solve this problem.

We present the results in Figure 9, where each panel displays the reference trajectories y1(t)y_{1}(t) and y2(t)y_{2}(t) as solid lines. In the leftmost panel, we show the inference of y1(t)y_{1}(t) and y2(t)y_{2}(t) for t[0,T)t\in[0,T), based solely on y1(T)y_{1}(T) and y2(T)y_{2}(T), by solving the misspecified linear ODE (33) backward in time. The gap between the reference trajectories (y1(t),y2(t))(y_{1}(t),y_{2}(t)) and the inferred trajectories (y~1(t),y~2(t))(\tilde{y}_{1}(t),\tilde{y}_{2}(t)) highlights the necessity of incorporating model uncertainty when addressing model misspecification.

The remaining panels in Figure 9 show the results from the SGM-HJ-sampler, with each panel corresponding to a different uncertainty level, ϵ\epsilon. Specifically, after obtaining the posterior samples (1×1031\times 10^{3}) using the SGM-HJ-sampler, we compute the posterior means and standard deviations. The posterior means are represented by dashed lines, while the 2-standard-deviation regions are shown in color, visualizing the uncertainty in the inferred solutions. From these panels, we observe that the uncertainty in the inferred values of Y1(t)Y_{1}(t) and Y2(t)Y_{2}(t) increases as ϵ\epsilon increases. In practice, ϵ\epsilon can be interpreted as a confidence hyperparameter for the model (33b), where higher confidence corresponds to a smaller ϵ\epsilon.

In this example, three distinct neural networks (sWs_{W}) were trained, one for each value of ϵ\epsilon, leading to three separate SGM-HJ-samplers. To enhance flexibility with respect to the confidence hyperparameter ϵ\epsilon, we could include ϵ\epsilon as an input to the neural network, allowing it to adapt to different confidence levels without requiring retraining. This example also highlights the capability of our method to handle partial uncertainty, where only certain equations within the system are subject to uncertainty.

4.3 Model misspecification for a nonlinear ODE

Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) The exact model is g(y)=3y2g(y)=3y^{2}, misspecified as g~(y)=y2\tilde{g}(y)=y^{2}.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(b) The exact model is g(y)=1.5y(1y)g(y)=1.5y(1-y), misspecified as g~(y)=y2\tilde{g}(y)=y^{2}.
Figure 10: Quantifying model uncertainty in inferring y(t)y(t), t[0,T)t\in[0,T), from y(T)y(T) (T=1T=1) under different cases of model misspecification. The leftmost panel shows the reference solution (exact ODE) and the inference obtained by solving the misspecified ODE backward in time. The difference highlights the consequence of model misspecification, emphasizing the need to quantify uncertainty. The right three panels display the references (solid), posterior means (dashed), and 2-standard-deviation regions (colored) computed using the SGM-HJ-sampler with varying uncertainty levels (ϵ\epsilon).

In this section, we demonstrate the effectiveness of the proposed approach in addressing model misspecification in a nonlinear 1D ODE through a Bayesian inverse problem. Specifically, we consider the following ODE:

dy(t)dt=f(t)+g(y(t)),t[0,T],\displaystyle\frac{dy(t)}{dt}=f(t)+g(y(t)),\quad t\in[0,T], (35)
y(0)=y0,\displaystyle y(0)=y_{0},

where y0y_{0} is the initial condition, f(t)=sin(4πt)f(t)=\sin(4\pi t) is the known source term, and g(y(t))g(y(t)) is the term that is misspecified as g~(y(t))\tilde{g}(y(t)). Our objective is to infer the values of y(t)y(t) for t[0,T)t\in[0,T), given the observed value yobs=y(T)y_{obs}=y(T), while accounting for uncertainty due to the misspecification of g(y)g(y). To model this uncertainty, we reformulate the ODE as the following SDE:

dYt=(f(t)+g~(Yt))dt+ϵdWt,dY_{t}=(f(t)+\tilde{g}(Y_{t}))dt+\sqrt{\epsilon}dW_{t}, (36)

where ϵ\epsilon is a positive constant that controls the level of uncertainty. The parameter ϵ\epsilon can be interpreted as a confidence level: smaller values of ϵ\epsilon indicate higher confidence in the model. The prior distribution of Y0Y_{0} is assumed to be 𝒩(0,0.12)\mathcal{N}(0,0.1^{2}). We discretize the time domain [0,T][0,T] uniformly with Δt=0.01\Delta t=0.01 to generate the training data.

We assume T=1T=1 and consider two distinct cases of model misspecification:

  1. (a)

    The exact model is g(y)=3y2g(y)=3y^{2}, but it is misspecified as g~(y)=y2\tilde{g}(y)=y^{2}.

  2. (b)

    The exact model is g(y)=1.5y(1y)g(y)=1.5y(1-y), but it is misspecified as g~(y)=y2\tilde{g}(y)=y^{2}.

In Case (a), the model form is correct but the coefficient is wrong, while in Case (b), the model gg itself is misspecified. The values of y(T)y(T) and the reference solutions for y(t)y(t), for t[0,T)t\in[0,T), are obtained by numerically solving the correctly specified ODE with initial conditions y0=0.05y_{0}=0.05 in (a) and y0=0.1y_{0}=-0.1 in (b).

The results are presented in Figure 10, following a similar presentation style as Figure 9. The leftmost panel shows the consequences of model misspecification, where the inferred solution is computed by solving the misspecified ODE backward in time. The remaining three panels show the posterior mean (dashed lines) and 2-standard-deviation regions (colored areas) computed from the posterior samples (1×1051\times 10^{5}) generated by the SGM-HJ-sampler, with different levels of uncertainty corresponding to ϵ=1×103\epsilon=1\times 10^{-3}, 5×1035\times 10^{-3}, and 1×1021\times 10^{-2}. We set Δτ=0.001\Delta\tau=0.001 in solving the controlled SDE. The solid lines represent the reference solutions for y(t)y(t) from the correctly specified ODE. These results show that as ϵ\epsilon decreases, the uncertainty in the inferred values of y(t)y(t) decreases accordingly. Overall, this example demonstrates the SGM-HJ-sampler’s capability to effectively account for model misspecification in nonlinear systems.

4.4 A high-dimensional example

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Posterior samples of Yt,t[0,T)Y_{t},t\in[0,T) given YT=y1Y_{T}=y_{1}.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(b) Posterior samples of Yt,t[0,T)Y_{t},t\in[0,T) given YT=y2Y_{T}=y_{2}.
Figure 11: Results from the SGM-HJ-sampler used to infer YtY_{t} for t[0,T)t\in[0,T) given YT=y1Y_{T}=y_{1} or y2y_{2}. Each figure shows a snapshot at a different time tt. The horizontal axis represents the spatial domain of the underlying function ftf_{t}, while the vertical axis represents the function’s value. Since YtnY_{t}\in\mathbb{R}^{n} with n=100n=100 denotes the discretization of ftf_{t}, we plot these values at the grid points in each figure. Figures (a) and (b) correspond to the cases with observations YT=y1Y_{T}=y_{1} and YT=y2Y_{T}=y_{2}, respectively, where y1y_{1} and y2y_{2} are vectors randomly selected from the testing data of YTY_{T}. The component values of y1y_{1} and y2y_{2} are marked as blue circles. The red crosses represent the posterior means of the inferred YtY_{t} components, connected by red dots to illustrate the inferred underlying function ftf_{t}. The colored region (\blacksquare) indicates the uncertainty, computed as the mean ±\pm two standard deviations. In the rightmost figures, the black lines and crosses show the reference values of f0f_{0}, the underlying unknown function used to generate the observation data YTY_{T}.

This section presents an example where the proposed algorithm is applied to a Bayesian inference problem for a function ftf_{t} defined on [0,1][0,1]. The domain is discretized using a uniform grid, and we focus on the grid points. The values of ft(xi)f_{t}(x_{i}) at each grid point xix_{i} can be represented as an nn-dimensional vector, where nn is the number of grid points. We denote this vector, with components ft(x1),,ft(xn)f_{t}(x_{1}),\dots,f_{t}(x_{n}), as YtY_{t}. The underlying stochastic process for YtY_{t} is modeled as a scaled Brownian motion, dYt=ϵdWt,t[0,T]dY_{t}=\sqrt{\epsilon}dW_{t},t\in[0,T]. The objective is to solve the inverse problem of inferring the value of Yt,t[0,T)Y_{t},t\in[0,T), given an observation of YTY_{T} and prior knowledge of Y0Y_{0}. This example also evaluates the algorithm’s capability to handle high-dimensional problems. The prior information assumes that Y0Y_{0} corresponds to the grid values of the function f0f_{0} [69, 112]:

f0(xξ1,,ξ8)=116j=18ξjsin(jπx),x[0,1],f_{0}(x\mid\xi_{1},\dots,\xi_{8})=\frac{1}{16}\sum_{j=1}^{8}\xi_{j}\sin(j\pi x),\quad x\in[0,1], (37)

where ξj,j=1,,8\xi_{j},j=1,...,8 are i.i.d. random variables uniformly distributed on [1,3)[1,3).

We set n=100n=100, ϵ=0.01\epsilon=0.01, and T=1T=1. The SGM-HJ-sampler is used to solve the problem, utilizing the sliced score matching loss function (26), which enhances scalability when direct computation of the divergence in the loss (25) becomes inefficient. The training data for sWs_{W} are generated by sampling Y0Y_{0} from (37) and numerically discretizing the Brownian motion with Δt=0.02\Delta t=0.02. The inference step is carried out with Δτ=0.01\Delta\tau=0.01. We generate posterior samples of Yt,t[0,T)Y_{t},t\in[0,T), given two specific values of YTY_{T}, randomly selected from the test set. The results are shown in Figure 11, where we depict the posterior mean (red cross) and uncertainty (two standard deviation intervals, colored areas) of YtY_{t} for certain t[0,T)t\in[0,T), based on 1,000 posterior samples from the SGM-HJ-sampler. Unlike Figures 9 and 10, the x-axis in these figures represents the spatial domain [0,1][0,1] for the underlying function ftf_{t}, rather than the temporal domain t[0,T]t\in[0,T]. Each figure is a snapshot for a specific tt, and since YtY_{t} contains the grid values of an underlying 1D function, we visualize the results by connecting the component values with lines to represent the reference, posterior mean, and uncertainties.

The results indicate that when tt is close to TT, such as t=0.9t=0.9, the uncertainty is small, and the posterior mean of YtY_{t} is close to the observed value YTY_{T}. As tt decreases, the posterior mean becomes smoother. In both cases, when t=0t=0, the reference values (black lines and crosses) lie within the two-standard-deviation regions around the posterior mean, demonstrating the effectiveness of the SGM-HJ-sampler in quantifying uncertainty for this high-dimensional problem.

5 Summary

In this paper, we leverage the log transform to establish connections between certain Bayesian inference problems, stochastic optimal control, and diffusion models. Building on this connection, we propose the HJ-sampler, an algorithm designed to sample from the posterior distribution of inverse problems involving SDE processes. We have developed three specific HJ-samplers: the analytic-HJ-sampler, Riccati-HJ-sampler, and SGM-HJ-sampler, applying them to various SDEs and different prior distributions of the initial condition. Notably, we have demonstrated the potential of these algorithms in addressing (1) uncertainty induced by model misspecification [115] in nonlinear ODEs, (2) mixtures of certainty and uncertainty in ODE systems, and (3) high dimensionality. The results showcase the accuracy, flexibility, and generalizability of our approach, highlighting new avenues for solving such inverse problems by utilizing techniques originally developed for control problems and diffusion models. Despite these advancements, several open problems and extensions remain to be explored.

There are multiple ways to generalize the current method. First, although the method is initially applied to a single observational data point, it can be extended sequentially to handle multiple observations. After obtaining posterior samples from several observation points, the prior distribution can be updated, and the HJ-sampler can be reapplied to the updated prior as new observations become available. When employing the machine learning-based version of the HJ-sampler, this process can be integrated with operator learning to potentially eliminate the need for retraining neural networks, enabling more efficient updates. Second, while the HJ-sampler is tailored for cases where YtY_{t} is governed by an SDE, the underlying log transform framework is applicable to any process with a well-defined infinitesimal generator. The SDE case represents a specific instance, and future research will focus on extending the numerical implementation of the HJ-sampler to more general processes. This broader perspective also hints at potential extensions of diffusion models to handle non-continuous distributions or processes driven by more complex noise structures.

Beyond generalization, several improvements to the current HJ-sampler method are possible. For instance, alternative numerical methods for solving viscous HJ PDEs or stochastic optimal control problems could be incorporated, leading to new variants of the HJ-sampler. Additionally, various machine learning techniques could be integrated with the HJ-sampler to enhance efficiency and accuracy. This could involve the development of novel loss functions or improved strategies for generating training data. Moreover, in cases where certain parts of the model are poorly understood or lack sufficient information, neural network surrogate models could be used to shift from a model-driven to a data-driven approach. For example, if the underlying process is not well-characterized, NeuralODE [18], NeuralSDE [54, 65], or reinforcement learning methods could be employed to approximate the dynamics.

Acknowledgement

We would like to express our gratitude to Dr. Molei Tao for providing valuable references on Sequential Monte Carlo (SMC) methods, and to Dr. Arnaud Doucet for contributing the idea presented in the second paragraph of Remark 3.3. T.M. is supported by ONR MURI N00014-20-1-2787. Z.Z., J.D., and G.E.K. are supported by the MURI/AFOSR FA9550-20-1-0358 project and the DOE-MMICS SEA-CROGS DE-SC0023191 award.

References

  • [1] S. Adams, N. Dirr, M. A. Peletier, and J. Zimmer, From a large-deviations principle to the Wasserstein gradient flow: a new micro-macro passage, Communications in Mathematical Physics, 307 (2011), pp. 791–815.
  • [2] M. Akian, S. Gaubert, and A. Lakhoua, The max-plus finite element method for solving deterministic optimal control problems: basic properties and convergence analysis, SIAM Journal on Control and Optimization, 47 (2008), pp. 817–848.
  • [3] A. Bachouch, C. Huré, N. Langrené, and H. Pham, Deep neural networks algorithms for stochastic control problems on finite horizon: numerical applications, Methodology and Computing in Applied Probability, 24 (2022), pp. 143–178.
  • [4] J. Berner, L. Richter, and K. Ullrich, An optimal control perspective on diffusion-based generative modeling, arXiv preprint arXiv:2211.01364, (2022).
  • [5] E. Bernton, J. Heng, A. Doucet, and P. E. Jacob, Schrödinger bridge samplers, arXiv preprint arXiv:1912.13170, (2019).
  • [6] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, Variational inference: A review for statisticians, Journal of the American statistical Association, 112 (2017), pp. 859–877.
  • [7] J. Bruna and J. Han, Posterior sampling with denoising oracles via tilted transport, arXiv preprint arXiv:2407.00745, (2024).
  • [8] A. Budhiraja and P. Dupuis, A variational representation for positive functionals of infinite dimensional Brownian motion, Probability and mathematical statistics-Wroclaw University, 20 (2000), pp. 39–61.
  • [9]  , Analysis and approximation of rare events, Representations and Weak Convergence Methods. Series Prob. Theory and Stoch. Modelling, 94 (2019).
  • [10] A. Budhiraja, P. Dupuis, and V. Maroulas, Variational representations for continuous time processes, in Annales de l’IHP Probabilités et statistiques, vol. 47, 2011, pp. 725–747.
  • [11] D. Calvetti and E. Somersalo, An introduction to Bayesian scientific computing: ten lectures on subjective computing, vol. 2, Springer Science & Business Media, 2007.
  • [12] P. Catuogno and A. d. O. Gomes, Large deviations for Lévy diffusions in small regime, arXiv preprint arXiv:2207.07081, (2022).
  • [13] F. A. Chalub, L. Monsaingeon, A. M. Ribeiro, and M. O. Souza, Gradient flow formulations of discrete and continuous evolutionary models: a unifying perspective, Acta Applicandae Mathematicae, 171 (2021), p. 24.
  • [14] P. Chaudhari, A. Oberman, S. Osher, S. Soatto, and G. Carlier, Deep relaxation: partial differential equations for optimizing deep neural networks, Research in the Mathematical Sciences, 5 (2018), pp. 1–30.
  • [15] P. Chen, J. Darbon, and T. Meng, Hopf-type representation formulas and efficient algorithms for certain high-dimensional optimal control problems, Computers & Mathematics with Applications, 161 (2024), pp. 90–120.
  • [16] P. Chen, T. Meng, Z. Zou, J. Darbon, and G. E. Karniadakis, Leveraging Hamilton-Jacobi PDEs with time-dependent Hamiltonians for continual scientific machine learning, in 6th Annual Learning for Dynamics & Control Conference, PMLR, 2024, pp. 1–12.
  • [17]  , Leveraging Multitime Hamilton–Jacobi PDEs for Certain Scientific Machine Learning Problems, SIAM Journal on Scientific Computing, 46 (2024), pp. C216–C248.
  • [18] R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, Neural ordinary differential equations, Advances in neural information processing systems, 31 (2018).
  • [19] T. Chen, G.-H. Liu, and E. A. Theodorou, Likelihood training of Schrödinger bridge using forward-backward SDEs theory, arXiv preprint arXiv:2110.11291, (2021).
  • [20] R. Chetrite and H. Touchette, Variational and optimal control representations of conditioned and driven processes, Journal of Statistical Mechanics: Theory and Experiment, 2015 (2015), p. P12001.
  • [21] N. Chopin, O. Papaspiliopoulos, et al., An introduction to sequential Monte Carlo, vol. 4, Springer, 2020.
  • [22] J. Darbon and G. P. Langlois, On Bayesian posterior mean estimators in imaging sciences and Hamilton–Jacobi partial differential equations, Journal of Mathematical Imaging and Vision, 63 (2021), pp. 821–854.
  • [23] J. Darbon, G. P. Langlois, and T. Meng, Overcoming the curse of dimensionality for some hamilton–jacobi partial differential equations via neural network architectures, Research in the Mathematical Sciences, 7 (2020), p. 20.
  • [24] J. Darbon and S. Osher, Algorithms for overcoming the curse of dimensionality for certain Hamilton–Jacobi equations arising in control theory and elsewhere, Research in the Mathematical Sciences, 3 (2016), p. 19.
  • [25] A. de Acosta, Large deviations for vector-valued Lévy processes, Stochastic Processes and their Applications, 51 (1994), pp. 75–115.
  • [26] V. De Bortoli, J. Thornton, J. Heng, and A. Doucet, Diffusion Schrödinger bridge with applications to score-based generative modeling, Advances in Neural Information Processing Systems, 34 (2021), pp. 17695–17709.
  • [27] F. Delarue, D. Lacker, and K. Ramanan, From the master equation to mean field game limit theory: Large deviations and concentration of measure, (2020).
  • [28] T. Deveney, J. Stanczuk, L. M. Kreusser, C. Budd, and C.-B. Schönlieb, Closing the ODE-SDE gap in score-based diffusion models through the Fokker-Planck equation, arXiv preprint arXiv:2311.15996, (2023).
  • [29] M. Erbar, Gradient flows of the entropy for jump processes, in Annales de l’IHP Probabilités et statistiques, vol. 50, 2014, pp. 920–945.
  • [30] L. C. Evans, Partial differential equations, vol. 19, American Mathematical Society, 2022.
  • [31] M. Fathi, A gradient flow approach to large deviations for diffusion processes, Journal de Mathématiques Pures et Appliquées, 106 (2016), pp. 957–993.
  • [32] J. Feng and T. G. Kurtz, Large deviations for stochastic processes, no. 131, American Mathematical Soc., 2006.
  • [33] R. Flamary, N. Courty, and A. Gramfort et al., POT: Python optimal transport, Journal of Machine Learning Research, 22 (2021), pp. 1–8.
  • [34] W. H. Fleming, A stochastic control approach to some large deviations problems, in Recent Mathematical Methods in Dynamic Programming: Proceedings of the Conference held in Rome, Italy, March 26–28, 1984, Springer, 1985, pp. 52–66.
  • [35] W. H. Fleming and R. W. Rishel, Deterministic and stochastic optimal control, vol. 1, Springer Science & Business Media, 2012.
  • [36] W. H. Fleming and H. M. Soner, Asymptotic expansions for Markov processes with Levy generators, Applied Mathematics and Optimization, 19 (1989), pp. 203–223.
  • [37] Y. Gao and J.-G. Liu, Revisit of macroscopic dynamics for some non-equilibrium chemical reactions from a Hamiltonian viewpoint, Journal of Statistical Physics, 189 (2022), p. 22.
  • [38] M. Girolami and B. Calderhead, Riemann manifold Langevin and Hamiltonian Monte Carlo methods, Journal of the Royal Statistical Society Series B: Statistical Methodology, 73 (2011), pp. 123–214.
  • [39] M. Hamdouche, P. Henry-Labordere, and H. Pham, Generative modeling for time series via Schrödinger bridge, arXiv preprint arXiv:2304.05093, (2023).
  • [40] F. Han, S. Osher, and W. Li, Tensor train based sampling algorithms for approximating regularized Wasserstein proximal operators, arXiv preprint arXiv:2401.13125, (2024).
  • [41] J. Han, A. Jentzen, and W. E, Solving high-dimensional partial differential equations using deep learning, Proceedings of the National Academy of Sciences, 115 (2018), pp. 8505–8510.
  • [42] C. Hartmann, L. Richter, C. Schütte, and W. Zhang, Variational characterization of free energy: Theory and algorithms, Entropy, 19 (2017), p. 626.
  • [43] H. Heaton, S. Wu Fung, and S. Osher, Global solutions to nonconvex problems by evolution of Hamilton-Jacobi PDEs, Communications on Applied Mathematics and Computation, (2023), pp. 1–21.
  • [44] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, Stochastic variational inference, Journal of Machine Learning Research, (2013).
  • [45] C. Hu and C. Shu, A discontinuous Galerkin finite element method for Hamilton–Jacobi equations, SIAM Journal on Scientific Computing, 21 (1999), pp. 666–690.
  • [46] Z. Hu, Z. Zhang, G. E. Karniadakis, and K. Kawaguchi, Score-based physics-informed neural networks for high-dimensional Fokker-Planck equations, arXiv preprint arXiv:2402.07465, (2024).
  • [47] E. R. Jakobsen and A. Rutkowski, The master equation for mean field game systems with fractional and nonlocal diffusions, arXiv preprint arXiv:2305.18867, (2023).
  • [48] G. Jiang and D. Peng, Weighted ENO schemes for Hamilton–Jacobi equations, SIAM Journal on Scientific Computing, 21 (2000), pp. 2126–2143.
  • [49] R. Jordan, D. Kinderlehrer, and F. Otto, The variational formulation of the Fokker–Planck equation, SIAM journal on mathematical analysis, 29 (1998), pp. 1–17.
  • [50] D. Kalise and K. Kunisch, Polynomial approximation of high-dimensional Hamilton–Jacobi–Bellman equations and applications to feedback control of semilinear parabolic PDEs, SIAM Journal on Scientific Computing, 40 (2018), pp. A629–A652.
  • [51] W. Kang and L. C. Wilcox, Mitigating the curse of dimensionality: sparse grid characteristics method for optimal feedback control and HJB equations, Computational Optimization and Applications, 68 (2017), pp. 289–315.
  • [52] T. Karras, M. Aittala, T. Aila, and S. Laine, Elucidating the design space of diffusion-based generative models, Advances in neural information processing systems, 35 (2022), pp. 26565–26577.
  • [53] T. Karras, M. Aittala, J. Lehtinen, J. Hellsten, T. Aila, and S. Laine, Analyzing and improving the training dynamics of diffusion models, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 24174–24184.
  • [54] P. Kidger, J. Foster, X. Li, and T. J. Lyons, Neural SDEs as infinite-dimensional GANs, in International conference on machine learning, PMLR, 2021, pp. 5453–5463.
  • [55] D. P. Kingma, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, (2014).
  • [56] P. E. Kloeden and E. Platen, Numerical solution of stochastic differential equations, vol. 23 of Applications of Mathematics (New York), Springer-Verlag, Berlin, 1992.
  • [57] V. N. Kolokoltsov, Nonlinear Markov processes and kinetic equations, vol. 182, Cambridge University Press, 2010.
  • [58] D. Kwon, Y. Fan, and K. Lee, Score-based generative modeling secretly minimizes the Wasserstein distance, Advances in Neural Information Processing Systems, 35 (2022), pp. 20205–20217.
  • [59] F. Léger, A geometric perspective on regularized optimal transport, Journal of Dynamics and Differential Equations, 31 (2019), pp. 1777–1791.
  • [60] F. Léger and W. Li, Hopf–Cole transformation via generalized Schrödinger bridge problem, Journal of Differential Equations, 274 (2021), pp. 788–827.
  • [61] C. Léonard, Large deviations for Poisson random measures and processes with independent increments, Stochastic processes and their applications, 85 (2000), pp. 93–121.
  • [62] W. Li, S. Liu, and S. Osher, A kernel formula for regularized Wasserstein proximal operators, arXiv preprint arXiv:2301.10301, (2023).
  • [63] T. M. Liggett, Continuous time Markov processes: an introduction, vol. 113, American Mathematical Soc., 2010.
  • [64] J. S. Liu and J. S. Liu, Monte Carlo strategies in scientific computing, vol. 10, Springer, 2001.
  • [65] X. Liu, T. Xiao, S. Si, Q. Cao, S. Kumar, and C.-J. Hsieh, Neural SDE: Stabilizing neural ODE networks with stochastic noise, arXiv preprint arXiv:1906.02355, (2019).
  • [66] J. Lu, Y. Wu, and Y. Xiang, Score-based transport modeling for mean-field Fokker-Planck equations, Journal of Computational Physics, 503 (2024), p. 112859.
  • [67] J. Lynch and J. Sethuraman, Large deviations for processes with independent increments, The annals of probability, 15 (1987), pp. 610–627.
  • [68] W. M. McEneaney, Max-plus methods for nonlinear control and estimation, vol. 2, Springer Science & Business Media, 2006.
  • [69] X. Meng, L. Yang, Z. Mao, J. del Águila Ferrandis, and G. E. Karniadakis, Learning functional priors and posteriors from data and physics, Journal of Computational Physics, 457 (2022), p. 111073.
  • [70] A. Mielke, A gradient structure for reaction–diffusion systems and for energy-drift-diffusion systems, Nonlinearity, 24 (2011), p. 1329.
  • [71] A. Mielke, M. A. Peletier, and D. M. Renger, On the relation between gradient flows and the large-deviation principle, with applications to Markov chains and diffusion, Potential Analysis, 41 (2014), pp. 1293–1327.
  • [72] A. Mielke, D. M. Renger, and M. A. Peletier, A generalization of Onsager’s reciprocity relations to gradient flows with nonlinear mobility, Journal of Non-Equilibrium Thermodynamics, 41 (2016), pp. 141–149.
  • [73] T. Nakamura-Zimmerer, Q. Gong, and W. Kang, QRnet: Optimal regulator design with LQR-augmented neural networks, IEEE Control Systems Letters, 5 (2021), pp. 1303–1308.
  • [74] R. M. Neal, MCMC using Hamiltonian dynamics, arXiv preprint arXiv:1206.1901, (2012).
  • [75] S. Osher, H. Heaton, and S. Wu Fung, A Hamilton–Jacobi-based proximal operator, Proceedings of the National Academy of Sciences, 120 (2023), p. e2220469120.
  • [76] S. Osher and C. Shu, High-order essentially nonoscillatory schemes for Hamilton-Jacobi equations, SIAM Journal on Numerical Analysis, 28 (1991), pp. 907–922.
  • [77] G. A. Pavliotis, Stochastic processes and applications, Texts in Applied Mathematics, 60 (2014).
  • [78] M. A. Peletier, Variational modelling: Energies, gradient flows, and large deviations, arXiv preprint arXiv:1402.1990, (2014).
  • [79] A. F. Psaros, X. Meng, Z. Zou, L. Guo, and G. E. Karniadakis, Uncertainty quantification in scientific machine learning: Methods, metrics, and comparisons, Journal of Computational Physics, 477 (2023), p. 111902.
  • [80] R. Ranganath, S. Gerrish, and D. Blei, Black box variational inference, in Artificial intelligence and statistics, PMLR, 2014, pp. 814–822.
  • [81] D. Renger, Microscopic interpretation of Wasserstein gradient flows, (2013).
  • [82] C. J. Roy and W. L. Oberkampf, A comprehensive framework for verification, validation, and uncertainty quantification in scientific computing, Computer methods in applied mechanics and engineering, 200 (2011), pp. 2131–2144.
  • [83] F. Santambrogio, Optimal transport for applied mathematicians, Birkäuser, NY, 55 (2015), p. 94.
  • [84] S.-J. Sheu, Stochastic control and exit probabilities of jump processes, SIAM journal on control and optimization, 23 (1985), pp. 306–328.
  • [85] Y. Shi, V. De Bortoli, A. Campbell, and A. Doucet, Diffusion Schrödinger bridge matching, Advances in Neural Information Processing Systems, 36 (2024).
  • [86] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, Deep unsupervised learning using nonequilibrium thermodynamics, in International conference on machine learning, PMLR, 2015, pp. 2256–2265.
  • [87] V. R. Somnath, M. Pariset, Y.-P. Hsieh, M. R. Martinez, A. Krause, and C. Bunne, Aligned diffusion Schrödinger bridges, in Uncertainty in Artificial Intelligence, PMLR, 2023, pp. 1985–1995.
  • [88] Y. Song, P. Dhariwal, M. Chen, and I. Sutskever, Consistency models, arXiv preprint arXiv:2303.01469, (2023).
  • [89] Y. Song and S. Ermon, Improved techniques for training score-based generative models, Advances in neural information processing systems, 33 (2020), pp. 12438–12448.
  • [90] Y. Song, S. Garg, J. Shi, and S. Ermon, Sliced score matching: A scalable approach to density and score estimation, in Uncertainty in Artificial Intelligence, PMLR, 2020, pp. 574–584.
  • [91] Y. Song, L. Shen, L. Xing, and S. Ermon, Solving inverse problems in medical imaging with score-based generative models, arXiv preprint arXiv:2111.08005, (2021).
  • [92] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, Score-based generative modeling through stochastic differential equations, arXiv preprint arXiv:2011.13456, (2020).
  • [93] H. Y. Tan, S. Osher, and W. Li, Noise-free sampling algorithms via regularized Wasserstein proximals, arXiv preprint arXiv:2308.14945, (2023).
  • [94] P. Vincent, A connection between score matching and denoising autoencoders, Neural computation, 23 (2011), pp. 1661–1674.
  • [95] P. Virtanen, R. Gommers, and T. E. Oliphant et al., SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python, Nature Methods, 17 (2020), pp. 261–272.
  • [96] U. Von Toussaint, Bayesian inference in physics, Reviews of Modern Physics, 83 (2011), pp. 943–999.
  • [97] K. P. Wabersich, A. J. Taylor, J. J. Choi, K. Sreenath, C. J. Tomlin, A. D. Ames, and M. N. Zeilinger, Data-driven safety filters: Hamilton-Jacobi reachability, control barrier functions, and predictive methods for uncertain systems, IEEE Control Systems Magazine, 43 (2023), pp. 137–177.
  • [98] Y. Wang, L. Wang, Y. Shen, Y. Wang, H. Yuan, Y. Wu, and Q. Gu, Protein conformation generation via force-guided SE(3) diffusion models, arXiv preprint arXiv:2403.14088, (2024).
  • [99] L. Wei, H. Feng, P. Hu, T. Zhang, Y. Yang, X. Zheng, R. Feng, D. Fan, and T. Wu, Closed-loop diffusion control of complex physical systems, arXiv preprint arXiv:2408.03124, (2024).
  • [100] L. Wei, P. Hu, R. Feng, H. Feng, Y. Du, T. Zhang, R. Wang, Y. Wang, Z.-M. Ma, and T. Wu, A generative approach to control complex physical systems, arXiv preprint arXiv:2407.06494, (2024).
  • [101] M. Welling and Y. W. Teh, Bayesian learning via stochastic gradient Langevin dynamics, in Proceedings of the 28th international conference on machine learning (ICML-11), Citeseer, 2011, pp. 681–688.
  • [102] T. Wu, W. Neiswanger, H. Zheng, S. Ermon, and J. Leskovec, Uncertainty quantification for forward and inverse problems of PDEs via latent global evolution, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 320–328.
  • [103] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang, Diffusion models: A comprehensive survey of methods and applications, ACM Computing Surveys, 56 (2023), pp. 1–39.
  • [104] I. Yegorov and P. M. Dower, Perspectives on characteristics based curse-of-dimensionality-free numerical approaches for solving hamilton–jacobi equations, Applied Mathematics & Optimization, 83 (2021), pp. 1–49.
  • [105] J. Yong and X. Y. Zhou, Stochastic controls: Hamiltonian systems and HJB equations, vol. 43, Springer Science & Business Media, 2012.
  • [106] B. Zhang, T. Sahai, and Y. Marzouk, Sampling via controlled stochastic dynamical systems, in I (Still) Can’t Believe It’s Not Better! NeurIPS 2021 Workshop, 2021.
  • [107] B. J. Zhang and M. A. Katsoulakis, A mean-field games laboratory for generative modeling, arXiv preprint arXiv:2304.13534, (2023).
  • [108] B. J. Zhang, S. Liu, W. Li, M. A. Katsoulakis, and S. J. Osher, Wasserstein proximal operators describe score-based generative models and resolve memorization, arXiv preprint arXiv:2402.06162, (2024).
  • [109] B. J. Zhang, T. Sahai, and Y. M. Marzouk, A Koopman framework for rare event simulation in stochastic differential equations, Journal of Computational Physics, 456 (2022), p. 111025.
  • [110] D. Zhang, R. T. Q. Chen, C.-H. Liu, A. Courville, and Y. Bengio, Diffusion generative flow samplers: Improving learning signals through partial trajectory optimization, arXiv preprint arXiv:2310.02679, (2023).
  • [111] M. Zhou, S. Osher, and W. Li, A deep learning algorithm for computing mean field control problems via forward-backward score dynamics, arXiv preprint arXiv:2401.09547, (2024).
  • [112] Z. Zou and G. E. Karniadakis, L-HYDRA: Multi-head physics-informed neural networks, arXiv preprint arXiv:2301.02152, (2023).
  • [113] Z. Zou, T. Meng, P. Chen, J. Darbon, and G. E. Karniadakis, Leveraging viscous Hamilton-Jacobi PDEs for uncertainty quantification in scientific machine learning, arXiv preprint arXiv:2404.08809, (2024).
  • [114] Z. Zou, X. Meng, and G. E. Karniadakis, Uncertainty quantification for noisy inputs-outputs in physics-informed neural networks and neural operators, arXiv preprint arXiv:2311.11262, (2023).
  • [115]  , Correcting model misspecification in physics-informed neural networks (PINNs), Journal of Computational Physics, 505 (2024), p. 112918.
  • [116] Z. Zou, X. Meng, A. F. Psaros, and G. E. Karniadakis, NeuralUQ: A comprehensive library for uncertainty quantification in neural differential equations and operators, SIAM Review, 66 (2024), pp. 161–190.

Appendix A Log transform applied to a general SDE

We consider the stochastic process XtX_{t}, represented as an SDE in n\mathbb{R}^{n}, defined by dXt=b(Xt,t)dt+ϵσ(Xt,t)dWtdX_{t}=b(X_{t},t)dt+\sqrt{\epsilon}\sigma(X_{t},t)dW_{t}. Here, WtW_{t} denotes a Brownian motion in m\mathbb{R}^{m}. The functions b:n×[0,T]nb:\mathbb{R}^{n}\times[0,T]\to\mathbb{R}^{n} and σ:n×[0,T]n×m\sigma:\mathbb{R}^{n}\times[0,T]\to\mathbb{R}^{n\times m} ensure the SDE’s proper formulation. We introduce D:n×[0,T]n×nD:\mathbb{R}^{n}\times[0,T]\to\mathbb{R}^{n\times n} defined by D(x,t)=σ(x,t)σ(x,t)TD(x,t)=\sigma(x,t)\sigma(x,t)^{T} for any xnx\in\mathbb{R}^{n} and t[0,T]t\in[0,T]. The constant ϵ\epsilon in the SDE is a positive value that indicates the level of stochasticity and corresponds to the hyperparameter ϵ\epsilon in the general definition of the operator 𝒜ϵ,t\mathcal{A}_{\epsilon,t} in Section 2. The specific formulation of 𝒜ϵ,t\mathcal{A}_{\epsilon,t} in this SDE context and its adjoint 𝒜ϵ,t\mathcal{A}_{\epsilon,t}^{*} are described by the equations:

𝒜ϵ,tf=b(x,t)xf+ϵ2Tr(D(x,t)x2f),𝒜ϵ,tf=x(b(x,t)f)+ϵ2i,j=1n2(Dij(x,t)f)xixj,\mathcal{A}_{\epsilon,t}f=b(x,t)\cdot\nabla_{x}f+\frac{\epsilon}{2}\operatorname{Tr}(D(x,t)\nabla_{x}^{2}f),\quad\mathcal{A}_{\epsilon,t}^{*}f=-\nabla_{x}\cdot(b(x,t)f)+\frac{\epsilon}{2}\sum_{i,j=1}^{n}\frac{\partial^{2}(D_{ij}(x,t)f)}{\partial x_{i}\partial x_{j}}, (38)

with Dij(x,t)D_{ij}(x,t) representing the (i,j)(i,j)-th component of the matrix D(x,t)D(x,t). In this appendix, if the variables in a function or a formula are not explicitly specified, they default to (x,t)(x,t).

The KBE and KFE in (1) are given by:

tμ=b(x,Tt)xμ+ϵ2Tr(D(x,Tt)x2μ),tν=x(b(x,t)ν)+ϵ2i,j=1n2(Dij(x,t)ν)xixj,\partial_{t}\mu=b(x,T-t)\cdot\nabla_{x}\mu+\frac{\epsilon}{2}\operatorname{Tr}(D(x,T-t)\nabla_{x}^{2}\mu),\quad\partial_{t}\nu=-\nabla_{x}\cdot(b(x,t)\nu)+\frac{\epsilon}{2}\sum_{i,j=1}^{n}\frac{\partial^{2}(D_{ij}(x,t)\nu)}{\partial x_{i}\partial x_{j}}, (39)

with all calculations based on the variables (x,t)(x,t). The coupled nonlinear differential equations in (4) are:

{tρ+x((b(x,t)+D(x,t)xS)ρ)=ϵ2i,j=1nxixj(ρDij(x,t)),tS+b(x,t)xS+12(xS)TD(x,t)xS+ϵ2Tr(D(x,t)x2S)=0,\begin{dcases}\partial_{t}\rho+\nabla_{x}\cdot((b(x,t)+D(x,t)\nabla_{x}S)\rho)=\frac{\epsilon}{2}\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}(\rho D_{ij}(x,t)),\\ \partial_{t}S+b(x,t)\cdot\nabla_{x}S+\frac{1}{2}(\nabla_{x}S)^{T}D(x,t)\nabla_{x}S+\frac{\epsilon}{2}\operatorname{Tr}(D(x,t)\nabla_{x}^{2}S)=0,\end{dcases} (40)

where calculations and functions are evaluated at (x,t)(x,t).

With terminal condition J-J on SS and initial condition δz0\delta_{z_{0}} on ρ\rho, these equations form the first order conditions for optimality in the following stochastic optimal control problem:

min{𝔼[0T12vsTD(Zs,s)1vs𝑑s+J(ZT)]:dZs=(b(Zs,s)+vs)ds+ϵσ(Zs,s)dWs,Z0=z0},\min\left\{\mathbb{E}\left[\int_{0}^{T}\frac{1}{2}v_{s}^{T}D(Z_{s},s)^{-1}v_{s}ds+J(Z_{T})\right]\colon dZ_{s}=(b(Z_{s},s)+v_{s})ds+\sqrt{\epsilon}\sigma(Z_{s},s)dW_{s},Z_{0}=z_{0}\right\}, (41)

whose value equals S(z0,0)-S(z_{0},0). Alterations in initial or terminal conditions for ρ\rho and SS relate these differential equations to specific MFGs or SOT problems, as discussed previously in Example 2.1. Details of the computations are provided in the following section.

A.1 Computational details

By the definition of 𝒜ϵ,t\mathcal{A}_{\epsilon,t}, we have

eSϵ𝒜ϵ,teSϵ=eSϵ(b(x,t)xeSϵ+ϵ2i,j=1nDij(x,t)xixjeSϵ)=eSϵ(eSϵϵb(x,t)xS+ϵ2i,j=1nDij(x,t)eSϵ((xiS)(xjS)ϵ2+xixjSϵ))=b(x,t)xSϵ+12ϵi,j=1nDij(x,t)(xiS)(xjS)+12i,j=1nDij(x,t)(xixjS)=b(x,t)xSϵ+12ϵ(xS)TD(x,t)xS+12Tr(D(x,t)x2S).\begin{split}e^{-\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}e^{\frac{S}{\epsilon}}&=e^{-\frac{S}{\epsilon}}\left(b(x,t)\cdot\nabla_{x}e^{\frac{S}{\epsilon}}+\frac{\epsilon}{2}\sum_{i,j=1}^{n}D_{ij}(x,t)\partial_{x_{i}}\partial_{x_{j}}e^{\frac{S}{\epsilon}}\right)\\ &=e^{-\frac{S}{\epsilon}}\left(\frac{e^{\frac{S}{\epsilon}}}{\epsilon}b(x,t)\cdot\nabla_{x}S+\frac{\epsilon}{2}\sum_{i,j=1}^{n}D_{ij}(x,t)e^{\frac{S}{\epsilon}}\left(\frac{(\partial_{x_{i}}S)(\partial_{x_{j}}S)}{\epsilon^{2}}+\frac{\partial_{x_{i}}\partial_{x_{j}}S}{\epsilon}\right)\right)\\ &=\frac{b(x,t)\cdot\nabla_{x}S}{\epsilon}+\frac{1}{2\epsilon}\sum_{i,j=1}^{n}D_{ij}(x,t)(\partial_{x_{i}}S)(\partial_{x_{j}}S)+\frac{1}{2}\sum_{i,j=1}^{n}D_{ij}(x,t)(\partial_{x_{i}}\partial_{x_{j}}S)\\ &=\frac{b(x,t)\cdot\nabla_{x}S}{\epsilon}+\frac{1}{2\epsilon}(\nabla_{x}S)^{T}D(x,t)\nabla_{x}S+\frac{1}{2}\operatorname{Tr}(D(x,t)\nabla_{x}^{2}S).\end{split} (42)

Therefore, the HJ equation in (4) becomes

0=tS+ϵeSϵ𝒜ϵ,teSϵ=tS+b(x,t)xS+12(xS)TD(x,t)xS+ϵ2Tr(D(x,t)x2S).0=\partial_{t}S+\epsilon e^{-\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}e^{\frac{S}{\epsilon}}=\partial_{t}S+b(x,t)\cdot\nabla_{x}S+\frac{1}{2}(\nabla_{x}S)^{T}D(x,t)\nabla_{x}S+\frac{\epsilon}{2}\operatorname{Tr}(D(x,t)\nabla_{x}^{2}S). (43)

Similarly, according to the formula for 𝒜ϵ,t\mathcal{A}_{\epsilon,t}^{*}, we get

eSϵ𝒜ϵ,t(ρeSϵ)=eSϵ(x(b(x,t)ρeSϵ)+ϵ2i,j=1n2(Dij(x,t)ρeSϵ)xixj)=eSϵ(eSϵx(b(x,t)ρ)+ρeSϵb(x,t)xSϵ+ϵ2i,j=1n(eSϵxixj(ρDij(x,t))2ϵeSϵ(xiS)(xj(ρDij(x,t)))+ρDij(x,t)eSϵ((xiS)(xjS)ϵ2xixjSϵ)))=x(b(x,t)ρ)+ρb(x,t)xSϵ+ϵ2i,j=1n(xixj(ρDij(x,t))2ϵ(xiS)(xj(ρDij(x,t)))+ρDij(x,t)((xiS)(xjS)ϵ2xixjSϵ)).\begin{split}e^{\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}^{*}(\rho e^{-\frac{S}{\epsilon}})&=e^{\frac{S}{\epsilon}}\left(-\nabla_{x}\cdot(b(x,t)\rho e^{-\frac{S}{\epsilon}})+\frac{\epsilon}{2}\sum_{i,j=1}^{n}\frac{\partial^{2}(D_{ij}(x,t)\rho e^{-\frac{S}{\epsilon}})}{\partial x_{i}\partial x_{j}}\right)\\ &=e^{\frac{S}{\epsilon}}\Bigg{(}-e^{-\frac{S}{\epsilon}}\nabla_{x}\cdot(b(x,t)\rho)+\rho e^{-\frac{S}{\epsilon}}b(x,t)\cdot\frac{\nabla_{x}S}{\epsilon}+\frac{\epsilon}{2}\sum_{i,j=1}^{n}\Big{(}e^{-\frac{S}{\epsilon}}\partial_{x_{i}}\partial_{x_{j}}(\rho D_{ij}(x,t))\\ &\quad\quad\quad\quad-\frac{2}{\epsilon}e^{-\frac{S}{\epsilon}}(\partial_{x_{i}}S)(\partial_{x_{j}}(\rho D_{ij}(x,t)))+\rho D_{ij}(x,t)e^{-\frac{S}{\epsilon}}\Big{(}\frac{(\partial_{x_{i}}S)(\partial_{x_{j}}S)}{\epsilon^{2}}-\frac{\partial_{x_{i}}\partial_{x_{j}}S}{\epsilon}\Big{)}\Big{)}\Bigg{)}\\ &=-\nabla_{x}\cdot(b(x,t)\rho)+\rho b(x,t)\cdot\frac{\nabla_{x}S}{\epsilon}+\frac{\epsilon}{2}\sum_{i,j=1}^{n}\Big{(}\partial_{x_{i}}\partial_{x_{j}}(\rho D_{ij}(x,t))\\ &\quad\quad\quad\quad-\frac{2}{\epsilon}(\partial_{x_{i}}S)(\partial_{x_{j}}(\rho D_{ij}(x,t)))+\rho D_{ij}(x,t)\Big{(}\frac{(\partial_{x_{i}}S)(\partial_{x_{j}}S)}{\epsilon^{2}}-\frac{\partial_{x_{i}}\partial_{x_{j}}S}{\epsilon}\Big{)}\Big{)}.\end{split} (44)

Then, we have

ρeSϵ𝒜ϵ,teSϵeSϵ𝒜ϵ,t(ρeSϵ)=ρb(x,t)xSϵ+ρ2ϵ(xS)TD(x,t)xS+ρ2Tr(D(x,t)x2S)+x(b(x,t)ρ)ρb(x,t)xSϵϵ2i,j=1n(xixj(ρDij(x,t))2ϵ(xiS)(xj(ρDij(x,t)))+ρDij(x,t)((xiS)(xjS)ϵ2xixjSϵ))=ρTr(D(x,t)x2S)+x(b(x,t)ρ)ϵ2i,j=1nxixj(ρDij(x,t))+i,j=1n(xiS)(xj(ρDij(x,t)))=x((b(x,t)+D(x,t)xS)ρ)ϵ2i,j=1nxixj(ρDij(x,t)).\begin{split}\rho e^{-\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}e^{\frac{S}{\epsilon}}-e^{\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}^{*}(\rho e^{-\frac{S}{\epsilon}})&=\frac{\rho b(x,t)\cdot\nabla_{x}S}{\epsilon}+\frac{\rho}{2\epsilon}(\nabla_{x}S)^{T}D(x,t)\nabla_{x}S+\frac{\rho}{2}\operatorname{Tr}(D(x,t)\nabla_{x}^{2}S)+\nabla_{x}\cdot(b(x,t)\rho)\\ &\quad\quad-\rho b(x,t)\cdot\frac{\nabla_{x}S}{\epsilon}-\frac{\epsilon}{2}\sum_{i,j=1}^{n}\Big{(}\partial_{x_{i}}\partial_{x_{j}}(\rho D_{ij}(x,t))-\frac{2}{\epsilon}(\partial_{x_{i}}S)(\partial_{x_{j}}(\rho D_{ij}(x,t)))\\ &\quad\quad+\rho D_{ij}(x,t)\Big{(}\frac{(\partial_{x_{i}}S)(\partial_{x_{j}}S)}{\epsilon^{2}}-\frac{\partial_{x_{i}}\partial_{x_{j}}S}{\epsilon}\Big{)}\Big{)}\\ &=\rho\operatorname{Tr}(D(x,t)\nabla_{x}^{2}S)+\nabla_{x}\cdot(b(x,t)\rho)-\frac{\epsilon}{2}\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}(\rho D_{ij}(x,t))+\sum_{i,j=1}^{n}(\partial_{x_{i}}S)(\partial_{x_{j}}(\rho D_{ij}(x,t)))\\ &=\nabla_{x}\cdot((b(x,t)+D(x,t)\nabla_{x}S)\rho)-\frac{\epsilon}{2}\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}(\rho D_{ij}(x,t)).\end{split}

(45)

Therefore, the Fokker-Planck equation in (4) becomes

0=tρ+ρeSϵ𝒜ϵ,teSϵeSϵ𝒜ϵ,t(ρeSϵ)=tρ+x((b(x,t)+D(x,t)xS)ρ)ϵ2i,j=1nxixj(ρDij(x,t)).0=\partial_{t}\rho+\rho e^{-\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}e^{\frac{S}{\epsilon}}-e^{\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}^{*}(\rho e^{-\frac{S}{\epsilon}})=\partial_{t}\rho+\nabla_{x}\cdot((b(x,t)+D(x,t)\nabla_{x}S)\rho)-\frac{\epsilon}{2}\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}(\rho D_{ij}(x,t)). (46)

Appendix B The log transform for a one-dimensional scaled Poisson process

In this section, we consider a stochastic process defined as a one-dimensional scaled Poisson process, namely Xt=ϵN(t)X_{t}=\epsilon N(t), where N(t)N(t) denotes a Poisson process with a rate of λ=1ϵ\lambda=\frac{1}{\epsilon}. Similar to the approach in Section 2.1, we define the linear operator 𝒜ϵ,t\mathcal{A}_{\epsilon,t} as the infinitesimal generator of XtX_{t}. Consequently, the operator 𝒜ϵ,t\mathcal{A}_{\epsilon,t} and its adjoint, 𝒜ϵ,t\mathcal{A}_{\epsilon,t}^{*}, are represented by:

𝒜ϵ,tf=f(x+ϵ,t)f(x,t)ϵ,𝒜ϵ,tf=f(xϵ,t)f(x,t)ϵ.\mathcal{A}_{\epsilon,t}f=\frac{f(x+\epsilon,t)-f(x,t)}{\epsilon},\quad\quad\mathcal{A}_{\epsilon,t}^{*}f=\frac{f(x-\epsilon,t)-f(x,t)}{\epsilon}.

This results in specific adaptations of the general coupled linear system (1) and the general coupled nonlinear system (4) to the case of the scaled Poisson process, represented by the following equations:

tμ(x,t)=μ(x+ϵ,t)μ(x,t)ϵ,tν(x,t)=ν(xϵ,t)ν(x,t)ϵ,\partial_{t}\mu(x,t)=\frac{\mu(x+\epsilon,t)-\mu(x,t)}{\epsilon},\quad\quad\partial_{t}\nu(x,t)=\frac{\nu(x-\epsilon,t)-\nu(x,t)}{\epsilon}, (47)

for the linear system, and

{0=tρ+ρeSϵ𝒜ϵ,teSϵeSϵ𝒜ϵ,t(ρeSϵ)=tρ(x,t)+1ϵ(eS(x+ϵ,t)S(x,t)ϵρ(x,t)ρ(xϵ,t)eS(x,t)S(xϵ,t)ϵ),0=tS+ϵeSϵ𝒜ϵ,teSϵ=tS(x,t)+exp(S(x+ϵ,t)S(x,t)ϵ)1,\begin{dcases}0=\partial_{t}\rho+\rho e^{-\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}e^{\frac{S}{\epsilon}}-e^{\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}^{*}(\rho e^{-\frac{S}{\epsilon}})=\partial_{t}\rho(x,t)+\frac{1}{\epsilon}\left(e^{\frac{S(x+\epsilon,t)-S(x,t)}{\epsilon}}\rho(x,t)-\rho(x-\epsilon,t)e^{\frac{S(x,t)-S(x-\epsilon,t)}{\epsilon}}\right),\\ 0=\partial_{t}S+\epsilon e^{-\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}e^{\frac{S}{\epsilon}}=\partial_{t}S(x,t)+\exp\left(\frac{S(x+\epsilon,t)-S(x,t)}{\epsilon}\right)-1,\end{dcases} (48)

for the nonlinear system.

With the terminal condition J-J on SS and the initial condition δz0\delta_{z_{0}} on ρ\rho, these PDEs serve as the first order optimality conditions for the subsequent stochastic optimal control problem:

min{0T(g(x,s)logg(x,s)g(x,s)+1)ρ(x,s)dxds+J(x)ρ(x,T)dx:tρ(x,t)=g(xϵ,t)ρ(xϵ,t)g(x,t)ρ(x,s)ϵ,ρ(x,0)=δz0(x)}.\begin{split}\min\Bigg{\{}\int_{0}^{T}\int_{\mathbb{R}}\left(g(x,s)\log g(x,s)-g(x,s)+1\right)\rho(x,s)dxds+\int_{\mathbb{R}}J(x)\rho(x,T)dx\colon\\ \partial_{t}\rho(x,t)=\frac{g(x-\epsilon,t)\rho(x-\epsilon,t)-g(x,t)\rho(x,s)}{\epsilon},\rho(x,0)=\delta_{z_{0}}(x)\Bigg{\}}.\end{split} (49)

The optimal control function gg satisfies the relation g(x,t)=exp(S(x+ϵ,t)S(x,t)ϵ)g(x,t)=\exp(\frac{S(x+\epsilon,t)-S(x,t)}{\epsilon}). This represents a stochastic optimal control scenario where the control influences the jump rate of a process through the function gg, and the running loss is derived from the entropy function of the rate gg. As in the SDE context, varying the initial and terminal conditions on ρ\rho and SS associates these two PDEs with specific MFGs and Schrödinger bridge problems, where the underlying stochastic process is a controlled jump process, given XtX_{t} is a jump process. We defer exploration of more complicated jump processes to future research.

Appendix C HJ-sampler for SDE cases

In this section, we focus on the scenario where the stochastic process YtY_{t} is described by the general SDE dYt=b(Yt,t)dt+ϵσ(Yt,t)dWtdY_{t}=b(Y_{t},t)\,dt+\sqrt{\epsilon}\sigma(Y_{t},t)\,dW_{t}, where WtW_{t} is an mm-dimensional Brownian motion. Here, b:n×[0,T]nb\colon\mathbb{R}^{n}\times[0,T]\to\mathbb{R}^{n} and σ:n×[0,T]n×m\sigma\colon\mathbb{R}^{n}\times[0,T]\to\mathbb{R}^{n\times m} are functions ensuring that the SDE is well-defined. We define D(x,t)=σ(x,t)σ(x,t)TD(x,t)=\sigma(x,t)\sigma(x,t)^{T}. Unless specified otherwise, the input variables in this appendix are assumed to be (x,t)(x,t). The constant ϵ\epsilon in the SDE corresponds to the hyperparameter ϵ\epsilon in the general definition of the operator 𝒜ϵ,Tt\mathcal{A}_{\epsilon,T-t} in Section 2. In this case, the infinitesimal generator of YtY_{t} is defined as fb(x,t)xf+ϵ2Tr(D(x,t)2f)f\mapsto b(x,t)\cdot\nabla_{x}f+\frac{\epsilon}{2}\operatorname{Tr}(D(x,t)\nabla^{2}f). Consequently, the linear operators 𝒜ϵ,Tt\mathcal{A}_{\epsilon,T-t} and its adjoint 𝒜ϵ,Tt\mathcal{A}_{\epsilon,T-t}^{*} are defined by

𝒜ϵ,Ttf=x(b(x,t)f)+ϵ2i,j=1n2(Dij(x,t)f)xixj,𝒜ϵ,Ttf=b(x,t)xf+ϵ2Tr(D(x,t)x2f),\mathcal{A}_{\epsilon,T-t}f=-\nabla_{x}\cdot(b(x,t)f)+\frac{\epsilon}{2}\sum_{i,j=1}^{n}\frac{\partial^{2}(D_{ij}(x,t)f)}{\partial x_{i}\partial x_{j}},\quad\mathcal{A}_{\epsilon,T-t}^{*}f=b(x,t)\cdot\nabla_{x}f+\frac{\epsilon}{2}\operatorname{Tr}(D(x,t)\nabla_{x}^{2}f), (50)

where Dij(x,t)D_{ij}(x,t) denotes the (i,j)(i,j)-th element of the matrix D(x,t)D(x,t). The HJ equation (13) becomes

{tSb(x,Tt)xS+ϵi,j=1n(xiS)(xjDij(x,Tt))+12(xS)TD(x,Tt)xS+ϵ2Tr(D(x,Tt)x2S)+ϵ22i,j=1nxixjDij(x,Tt)ϵxb(x,Tt)=0,S(x,T)=ϵlogPprior(x),\begin{dcases}\partial_{t}S-b(x,T-t)\cdot\nabla_{x}S+\epsilon\sum_{i,j=1}^{n}(\partial_{x_{i}}S)(\partial_{x_{j}}D_{ij}(x,T-t))+\frac{1}{2}(\nabla_{x}S)^{T}D(x,T-t)\nabla_{x}S\\ \quad\quad\quad\quad+\frac{\epsilon}{2}\operatorname{Tr}(D(x,T-t)\nabla_{x}^{2}S)+\frac{\epsilon^{2}}{2}\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}D_{ij}(x,T-t)-\epsilon\nabla_{x}\cdot b(x,T-t)=0,\\ S(x,T)=\epsilon\log P_{prior}(x),\end{dcases} (51)

which is a traditional viscous HJ PDE. The Fokker-Planck equation in (14) becomes

{tρx(b(x,Tt)ρ)+i,j=1nxi(ρ(Dij(x,Tt)xjS+ϵxjDij(x,Tt)))ϵ2i,j=1nxixj(Dij(x,Tt)ρ)=0,ρ(x,0)=δyobs(x).\begin{dcases}\partial_{t}\rho-\nabla_{x}\cdot(b(x,T-t)\rho)+\sum_{i,j=1}^{n}\partial_{x_{i}}\left(\rho(D_{ij}(x,T-t)\partial_{x_{j}}S+\epsilon\partial_{x_{j}}D_{ij}(x,T-t))\right)\\ \quad\quad\quad\quad-\frac{\epsilon}{2}\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}(D_{ij}(x,T-t)\rho)=0,\\ \rho(x,0)=\delta_{y_{obs}}(x).\end{dcases} (52)

Similar to the case in Appendix A, these two PDEs are related to the following stochastic optimal control problem:

min{𝔼[0T(12vsTDTs(Zs)1vs+ϵxbTs(Zs)ϵ22i,j=1nxixjDij,Ts(Zs))dsϵlogPprior(ZT)]:dZs=(b(Zs,Ts)+ϵxD(Zs,Ts)+vs)ds+ϵσ(Zs,Ts)dWs,Z0=yobs},\begin{split}\min\Bigg{\{}\mathbb{E}\Bigg{[}\int_{0}^{T}\Bigg{(}\frac{1}{2}v_{s}^{T}D_{T-s}(Z_{s})^{-1}v_{s}+\epsilon\nabla_{x}\cdot b_{T-s}(Z_{s})-\frac{\epsilon^{2}}{2}\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}D_{ij,T-s}(Z_{s})\Bigg{)}ds-\epsilon\log P_{prior}(Z_{T})\Bigg{]}\colon\\ dZ_{s}=(-b(Z_{s},T-s)+\epsilon\nabla_{x}\cdot D(Z_{s},T-s)+v_{s})ds+\sqrt{\epsilon}\sigma(Z_{s},T-s)dW_{s},Z_{0}=y_{obs}\Bigg{\}},\end{split} (53)

where xD(x,t)\nabla_{x}\cdot D(x,t) represents a vector-valued function whose ii-th component is xDi(x,t)\nabla_{x}\cdot D_{i}(x,t), with Di(x,t)D_{i}(x,t) being the ii-th column of D(x,t)D(x,t). To simplify notation, we denote b(x,t)b(x,t) and Dij(x,t)D_{ij}(x,t) by bt(x)b_{t}(x) and Dij,t(x)D_{ij,t}(x), respectively. In the scenario where DD is non-invertible, the following optimal control problem arises:

min{𝔼[0T(12usTDTs(Zs)us+ϵxbTs(Zs)ϵ22i,j=1nxixjDij,Ts(Zs))dsϵlogPprior(ZT)]:dZs=(bTs(Zs)+ϵxD(Zs,Ts)+DTs(Zs)us)ds+ϵσTs(Zs)dWs,Z0=yobs}.\begin{split}\min\Bigg{\{}\mathbb{E}\Bigg{[}\int_{0}^{T}\Bigg{(}\frac{1}{2}u_{s}^{T}D_{T-s}(Z_{s})u_{s}+\epsilon\nabla_{x}\cdot b_{T-s}(Z_{s})-\frac{\epsilon^{2}}{2}\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}D_{ij,T-s}(Z_{s})\Bigg{)}ds-\epsilon\log P_{prior}(Z_{T})\Bigg{]}\colon\\ dZ_{s}=(-b_{T-s}(Z_{s})+\epsilon\nabla_{x}\cdot D(Z_{s},T-s)+D_{T-s}(Z_{s})u_{s})ds+\sqrt{\epsilon}\sigma_{T-s}(Z_{s})dW_{s},Z_{0}=y_{obs}\Bigg{\}}.\end{split} (54)

The proposed HJ-sampler algorithm consists of two main steps:

  1. 1.

    Numerically solve the viscous HJ PDE (51) to obtain SS;

  2. 2.

    Generate samples from ρ\rho, which satisfies (52), by sampling from the controlled SDE given below:

    dZt=(D(Zt,Tt)xS(Zt,t)b(Zt,Tt)+ϵj=1nxjDj(Zt,Tt))dt+ϵσ(Zt,Tt)dWt,Z0=yobs,\begin{split}dZ_{t}&=\left(D(Z_{t},T-t)\nabla_{x}S(Z_{t},t)-b(Z_{t},T-t)+\epsilon\sum_{j=1}^{n}\partial_{x_{j}}D_{j}(Z_{t},T-t)\right)dt+\sqrt{\epsilon}\sigma(Z_{t},T-t)dW_{t},\\ Z_{0}&=y_{obs},\end{split} (55)

    where Dj(x,t)D_{j}(x,t) denotes the jj-th column of the matrix D(x,t)D(x,t). In practice, the Euler–Maruyama method is employed to discretize this SDE as follows:

    Zk+1=Zk+(D(Zk,Ttk)xS(Zk,tk)b(Zk,Ttk)+ϵj=1nxjDj(Zk,Ttk))Δt+ϵΔtσ(Zk,Ttk)ξk,Z0=yobs,\begin{split}Z_{k+1}&=Z_{k}+\left(D(Z_{k},T-t_{k})\nabla_{x}S(Z_{k},t_{k})-b(Z_{k},T-t_{k})+\epsilon\sum_{j=1}^{n}\partial_{x_{j}}D_{j}(Z_{k},T-t_{k})\right)\Delta t+\sqrt{\epsilon\Delta t}\sigma(Z_{k},T-t_{k})\xi_{k},\\ Z_{0}&=y_{obs},\end{split}

    (56)

    where ξ0,,ξnt1\xi_{0},\dots,\xi_{n_{t}-1} are i.i.d. samples drawn from the mm-dimensional standard normal distribution.

The obtained samples for ZkZ_{k} provide an approximation to the posterior samples of YTkΔtY_{T-k\Delta t} given YT=yobsY_{T}=y_{obs}. Computational details are presented in the subsequent section.

C.1 Computational details

In this section, we provide the computational details for the results provided in Section 3.2. By the definition of 𝒜ϵ,Tt\mathcal{A}_{\epsilon,T-t}, we have

eSϵ𝒜ϵ,teSϵ=eSϵ(x(b(x,Tt)eSϵ)+ϵ2i,j=1nxixj(Dij(x,Tt)eSϵ))=eSϵ(eSϵxb(x,Tt)eSϵb(x,Tt)xSϵ+ϵ2i,j=1n(eSϵ(xixjDij(x,Tt))+2ϵeSϵ(xiS)(xjDij(x,Tt))+Dij(x,Tt)eSϵ((xiS)(xjS)ϵ2+xixjSϵ)))=xb(x,Tt)b(x,Tt)xSϵ+ϵ2i,j=1n(xixjDij(x,Tt)+2ϵ(xiS)(xjDij(x,Tt))+Dij(x,Tt)((xiS)(xjS)ϵ2+xixjSϵ))=xb(x,Tt)b(x,Tt)xSϵ+ϵ2i,j=1nxixjDij(x,Tt)+i,j=1n(xiS)(xjDij(x,Tt))+12ϵ(xS)TD(x,Tt)xS+12Tr(D(x,Tt)x2S).\begin{split}e^{-\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}e^{\frac{S}{\epsilon}}&=e^{-\frac{S}{\epsilon}}\left(-\nabla_{x}\cdot(b(x,T-t)e^{\frac{S}{\epsilon}})+\frac{\epsilon}{2}\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}(D_{ij}(x,T-t)e^{\frac{S}{\epsilon}})\right)\\ &=e^{-\frac{S}{\epsilon}}\Bigg{(}-e^{\frac{S}{\epsilon}}\nabla_{x}\cdot b(x,T-t)-e^{\frac{S}{\epsilon}}b(x,T-t)\cdot\frac{\nabla_{x}S}{\epsilon}+\frac{\epsilon}{2}\sum_{i,j=1}^{n}\Big{(}e^{\frac{S}{\epsilon}}(\partial_{x_{i}}\partial_{x_{j}}D_{ij}(x,T-t))\\ &\quad\quad\quad\quad+\frac{2}{\epsilon}e^{\frac{S}{\epsilon}}(\partial_{x_{i}}S)(\partial_{x_{j}}D_{ij}(x,T-t))+D_{ij}(x,T-t)e^{\frac{S}{\epsilon}}\Big{(}\frac{(\partial_{x_{i}}S)(\partial_{x_{j}}S)}{\epsilon^{2}}+\frac{\partial_{x_{i}}\partial_{x_{j}}S}{\epsilon}\Big{)}\Big{)}\Bigg{)}\\ &=-\nabla_{x}\cdot b(x,T-t)-b(x,T-t)\cdot\frac{\nabla_{x}S}{\epsilon}+\frac{\epsilon}{2}\sum_{i,j=1}^{n}\Big{(}\partial_{x_{i}}\partial_{x_{j}}D_{ij}(x,T-t)\\ &\quad\quad\quad\quad+\frac{2}{\epsilon}(\partial_{x_{i}}S)(\partial_{x_{j}}D_{ij}(x,T-t))+D_{ij}(x,T-t)\Big{(}\frac{(\partial_{x_{i}}S)(\partial_{x_{j}}S)}{\epsilon^{2}}+\frac{\partial_{x_{i}}\partial_{x_{j}}S}{\epsilon}\Big{)}\Big{)}\\ &=-\nabla_{x}\cdot b(x,T-t)-b(x,T-t)\cdot\frac{\nabla_{x}S}{\epsilon}+\frac{\epsilon}{2}\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}D_{ij}(x,T-t)\\ &\quad\quad\quad\quad+\sum_{i,j=1}^{n}(\partial_{x_{i}}S)(\partial_{x_{j}}D_{ij}(x,T-t))+\frac{1}{2\epsilon}(\nabla_{x}S)^{T}D(x,T-t)\nabla_{x}S+\frac{1}{2}\operatorname{Tr}(D(x,T-t)\nabla_{x}^{2}S).\end{split} (57)

Therefore, the HJ equation in (4) becomes

0=tS+ϵeSϵ𝒜ϵ,teSϵ=tSb(x,Tt)xS+ϵi,j=1n(xiS)(xjDij(x,Tt))+12(xS)TD(x,Tt)xS+ϵ2Tr(D(x,Tt)x2S)+ϵ22i,j=1nxixjDij(x,Tt)ϵxb(x,Tt).\begin{split}0&=\partial_{t}S+\epsilon e^{-\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}e^{\frac{S}{\epsilon}}\\ &=\partial_{t}S-b(x,T-t)\cdot\nabla_{x}S+\epsilon\sum_{i,j=1}^{n}(\partial_{x_{i}}S)(\partial_{x_{j}}D_{ij}(x,T-t))+\frac{1}{2}(\nabla_{x}S)^{T}D(x,T-t)\nabla_{x}S\\ &\quad\quad\quad\quad+\frac{\epsilon}{2}\operatorname{Tr}(D(x,T-t)\nabla_{x}^{2}S)+\frac{\epsilon^{2}}{2}\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}D_{ij}(x,T-t)-\epsilon\nabla_{x}\cdot b(x,T-t).\end{split} (58)

Similarly, according to the formula of 𝒜ϵ,t\mathcal{A}_{\epsilon,t}^{*}, we obtain

eSϵ𝒜ϵ,t(ρeSϵ)=eSϵ(b(x,Tt)x(ρeSϵ)+ϵ2i,j=1nDij(x,Tt)xixj(ρeSϵ))=eSϵ(eSϵρb(x,Tt)xSϵ+eSϵb(x,Tt)xρ+ϵ2i,j=1n(eSϵDij(x,Tt)xixjρ2ϵeSϵDij(x,Tt)(xiS)(xjρ)+ρDij(x,Tt)eSϵ((xiS)(xjS)ϵ2xixjSϵ)))=ρb(x,Tt)xSϵ+b(x,Tt)xρ+ϵ2i,j=1n(Dij(x,Tt)xixjρ2ϵDij(x,Tt)(xiS)(xjρ)+ρDij(x,Tt)((xiS)(xjS)ϵ2xixjSϵ)).\begin{split}e^{\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}^{*}(\rho e^{-\frac{S}{\epsilon}})&=e^{\frac{S}{\epsilon}}\left(b(x,T-t)\cdot\nabla_{x}(\rho e^{-\frac{S}{\epsilon}})+\frac{\epsilon}{2}\sum_{i,j=1}^{n}D_{ij}(x,T-t)\partial_{x_{i}}\partial_{x_{j}}(\rho e^{-\frac{S}{\epsilon}})\right)\\ &=e^{\frac{S}{\epsilon}}\Bigg{(}-e^{-\frac{S}{\epsilon}}\rho b(x,T-t)\cdot\frac{\nabla_{x}S}{\epsilon}+e^{-\frac{S}{\epsilon}}b(x,T-t)\cdot\nabla_{x}\rho+\frac{\epsilon}{2}\sum_{i,j=1}^{n}\Big{(}e^{-\frac{S}{\epsilon}}D_{ij}(x,T-t)\partial_{x_{i}}\partial_{x_{j}}\rho\\ &\quad\quad\quad\quad-\frac{2}{\epsilon}e^{-\frac{S}{\epsilon}}D_{ij}(x,T-t)(\partial_{x_{i}}S)(\partial_{x_{j}}\rho)+\rho D_{ij}(x,T-t)e^{-\frac{S}{\epsilon}}\Big{(}\frac{(\partial_{x_{i}}S)(\partial_{x_{j}}S)}{\epsilon^{2}}-\frac{\partial_{x_{i}}\partial_{x_{j}}S}{\epsilon}\Big{)}\Big{)}\Bigg{)}\\ &=-\rho b(x,T-t)\cdot\frac{\nabla_{x}S}{\epsilon}+b(x,T-t)\cdot\nabla_{x}\rho+\frac{\epsilon}{2}\sum_{i,j=1}^{n}\Big{(}D_{ij}(x,T-t)\partial_{x_{i}}\partial_{x_{j}}\rho\\ &\quad\quad\quad\quad-\frac{2}{\epsilon}D_{ij}(x,T-t)(\partial_{x_{i}}S)(\partial_{x_{j}}\rho)+\rho D_{ij}(x,T-t)\Big{(}\frac{(\partial_{x_{i}}S)(\partial_{x_{j}}S)}{\epsilon^{2}}-\frac{\partial_{x_{i}}\partial_{x_{j}}S}{\epsilon}\Big{)}\Big{)}.\end{split} (59)

Then, we have

ρeSϵ𝒜ϵ,teSϵeSϵ𝒜ϵ,t(ρeSϵ)=ρxb(x,Tt)ρb(x,Tt)xSϵ+ϵρ2i,j=1nxixjDij(x,Tt)+ρi,j=1n(xiS)(xjDij(x,Tt))+ρ2ϵ(xS)TD(x,Tt)xS+ρ2Tr(D(x,Tt)x2S)+ρb(x,Tt)xSϵb(x,Tt)xρϵ2i,j=1n(Dij(x,Tt)xixjρ2ϵDij(x,Tt)(xiS)(xjρ)+ρDij(x,Tt)((xiS)(xjS)ϵ2xixjSϵ))=ρxb(x,Tt)b(x,Tt)xρ+ϵρ2i,j=1nxixjDij(x,Tt)ϵ2i,j=1nDij(x,Tt)xixjρ+ρi,j=1n(xiS)(xjDij(x,Tt))+ρTr(D(x,Tt)x2S)+i,j=1nDij(x,Tt)(xiS)(xjρ)=x(b(x,Tt)ρ)+ϵρ2i,j=1nxixjDij(x,Tt)ϵ2i,j=1nDij(x,Tt)xixjρ+x(ρD(x,Tt)xS)=x((b(x,Tt)D(x,Tt)xS)ρ)ϵ2i,j=1nxixj(Dij(x,Tt)ρ)+ϵρi,j=1nxixjDij(x,Tt)+ϵi,j=1n(xiDij(x,Tt))(xjρ)=x(b(x,Tt)ρ)+i,j=1nxi(ρ(Dij(x,Tt)xjS+ϵxjDij(x,Tt)))ϵ2i,j=1nxixj(Dij(x,Tt)ρ).\begin{split}\rho e^{-\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}e^{\frac{S}{\epsilon}}-e^{\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}^{*}(\rho e^{-\frac{S}{\epsilon}})&=-\rho\nabla_{x}\cdot b(x,T-t)-\rho b(x,T-t)\cdot\frac{\nabla_{x}S}{\epsilon}+\frac{\epsilon\rho}{2}\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}D_{ij}(x,T-t)\\ &\quad\quad\quad\quad+\rho\sum_{i,j=1}^{n}(\partial_{x_{i}}S)(\partial_{x_{j}}D_{ij}(x,T-t))+\frac{\rho}{2\epsilon}(\nabla_{x}S)^{T}D(x,T-t)\nabla_{x}S+\frac{\rho}{2}\operatorname{Tr}(D(x,T-t)\nabla_{x}^{2}S)\\ &\quad\quad\quad\quad+\rho b(x,T-t)\cdot\frac{\nabla_{x}S}{\epsilon}-b(x,T-t)\cdot\nabla_{x}\rho-\frac{\epsilon}{2}\sum_{i,j=1}^{n}\Big{(}D_{ij}(x,T-t)\partial_{x_{i}}\partial_{x_{j}}\rho\\ &\quad\quad\quad\quad-\frac{2}{\epsilon}D_{ij}(x,T-t)(\partial_{x_{i}}S)(\partial_{x_{j}}\rho)+\rho D_{ij}(x,T-t)\Big{(}\frac{(\partial_{x_{i}}S)(\partial_{x_{j}}S)}{\epsilon^{2}}-\frac{\partial_{x_{i}}\partial_{x_{j}}S}{\epsilon}\Big{)}\Big{)}\\ &=-\rho\nabla_{x}\cdot b(x,T-t)-b(x,T-t)\cdot\nabla_{x}\rho+\frac{\epsilon\rho}{2}\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}D_{ij}(x,T-t)-\frac{\epsilon}{2}\sum_{i,j=1}^{n}D_{ij}(x,T-t)\partial_{x_{i}}\partial_{x_{j}}\rho\\ &\quad\quad\quad\quad+\rho\sum_{i,j=1}^{n}(\partial_{x_{i}}S)(\partial_{x_{j}}D_{ij}(x,T-t))+\rho\operatorname{Tr}(D(x,T-t)\nabla_{x}^{2}S)+\sum_{i,j=1}^{n}D_{ij}(x,T-t)(\partial_{x_{i}}S)(\partial_{x_{j}}\rho)\\ &=-\nabla_{x}\cdot(b(x,T-t)\rho)+\frac{\epsilon\rho}{2}\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}D_{ij}(x,T-t)-\frac{\epsilon}{2}\sum_{i,j=1}^{n}D_{ij}(x,T-t)\partial_{x_{i}}\partial_{x_{j}}\rho\\ &\quad\quad\quad\quad+\nabla_{x}\cdot(\rho D(x,T-t)\nabla_{x}S)\\ &=-\nabla_{x}\cdot((b(x,T-t)-D(x,T-t)\nabla_{x}S)\rho)-\frac{\epsilon}{2}\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}(D_{ij}(x,T-t)\rho)+\epsilon\rho\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}D_{ij}(x,T-t)\\ &\quad\quad\quad\quad+\epsilon\sum_{i,j=1}^{n}(\partial_{x_{i}}D_{ij}(x,T-t))(\partial_{x_{j}}\rho)\\ &=-\nabla_{x}\cdot(b(x,T-t)\rho)+\sum_{i,j=1}^{n}\partial_{x_{i}}\left(\rho(D_{ij}(x,T-t)\partial_{x_{j}}S+\epsilon\partial_{x_{j}}D_{ij}(x,T-t))\right)-\frac{\epsilon}{2}\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}(D_{ij}(x,T-t)\rho).\end{split}

(60)

Therefore, the Fokker-Planck equation in (4) becomes

0=tρ+ρeSϵ𝒜ϵ,teSϵeSϵ𝒜ϵ,t(ρeSϵ)=tρx(b(x,Tt)ρ)+i,j=1nxi(ρ(Dij(x,Tt)xjS+ϵxjDij(x,Tt)))ϵ2i,j=1nxixj(Dij(x,Tt)ρ).\begin{split}0&=\partial_{t}\rho+\rho e^{-\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}e^{\frac{S}{\epsilon}}-e^{\frac{S}{\epsilon}}\mathcal{A}_{\epsilon,t}^{*}(\rho e^{-\frac{S}{\epsilon}})\\ &=\partial_{t}\rho-\nabla_{x}\cdot(b(x,T-t)\rho)+\sum_{i,j=1}^{n}\partial_{x_{i}}\left(\rho(D_{ij}(x,T-t)\partial_{x_{j}}S+\epsilon\partial_{x_{j}}D_{ij}(x,T-t))\right)-\frac{\epsilon}{2}\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}(D_{ij}(x,T-t)\rho).\end{split}

(61)

After specifying the terminal condition J-J for SS and the initial condition δz0\delta_{z_{0}} for ρ\rho, the two PDEs relate to the following stochastic optimal control problem:

min{𝔼[0T12vsTD(Zs,Ts)1vsϵ22i,j=1nxixjDij(Zs,Ts)+ϵxb(Zs,Ts)ds+J(ZT)]:dZs=(vsb(Zs,Ts)+ϵj=1nxjDj(Zs,Ts))ds+ϵσ(Zs,Ts)dWs,Z0=z0},\begin{split}&\min\Bigg{\{}\mathbb{E}\Bigg{[}\int_{0}^{T}\frac{1}{2}v_{s}^{T}D(Z_{s},T-s)^{-1}v_{s}-\frac{\epsilon^{2}}{2}\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}D_{ij}(Z_{s},T-s)+\epsilon\nabla_{x}\cdot b(Z_{s},T-s)ds+J(Z_{T})\Bigg{]}\colon\\ &\quad\quad\quad\quad dZ_{s}=\Big{(}v_{s}-b(Z_{s},T-s)+\epsilon\sum_{j=1}^{n}\partial_{x_{j}}D_{j}(Z_{s},T-s)\Big{)}ds+\sqrt{\epsilon}\sigma(Z_{s},T-s)dW_{s},Z_{0}=z_{0}\Bigg{\}},\end{split} (62)

whose value is equal to S(z0,0)-S(z_{0},0). With different initial and terminal conditions on SS and ρ\rho, these two PDEs relate to certain MFGs or SOT problems, similar to what was discussed in Example 2.1 and Section A.

C.2 Numerical solvers for the viscous HJ PDE (51)

In the proposed HJ-sampler algorithm, a numerical solver is required to address the viscous HJ PDE (51) in this general case. Similar to the approaches discussed in Sections 3.2.1 and 3.2.2, we will describe the Riccati method and the SGM method in the following sections.

C.2.1 Riccati method

The Riccati method can be applied in cases where the function b(x,t)b(x,t) depends linearly on xx, σ(x,t)\sigma(x,t) depends only on tt, and the prior distribution is Gaussian. This version of the HJ-sampler is called the Riccati-HJ-sampler. Specifically, we assume that bb takes the form b(x,t)=Atx+βtb(x,t)=A_{t}x+\beta_{t}, where Atn×nA_{t}\in\mathbb{R}^{n\times n} and βtn\beta_{t}\in\mathbb{R}^{n} for any t[0,T]t\in[0,T]. We also assume that the matrix D(x,t)=σ(x,t)σ(x,t)TD(x,t)=\sigma(x,t)\sigma(x,t)^{T} depends only on tt. Additionally, the prior distribution is assumed to be Pprior(θ)exp(12(θθ0)TΣ1(θθ0))P_{prior}(\theta)\propto\exp\left(-\frac{1}{2}(\theta-\theta^{0})^{T}\Sigma^{-1}(\theta-\theta^{0})\right). Under these assumptions, the viscous HJ PDE (51) in the first step of the Riccati-HJ-sampler can be solved using Riccati ODEs, following a similar approach to that in Section 3.2.1. For simplicity, in this section, we will interchangeably use both DtD_{t} and D(x,t)D(x,t), as well as Dt,ijD_{t,ij} and Dij(x,t)D_{ij}(x,t), as needed.

Let S~(x,t)=S(x,Tt)\tilde{S}(x,t)=-S(x,T-t), where SS is the solution to (51). Then, S~\tilde{S} satisfies

{tS~+b(x,t)xS~ϵi,j=1n(xiS~)(xjDij(x,t))+12(xS~)TD(x,t)xS~ϵ2Tr(D(x,t)x2S~)+ϵ22i,j=1nxixjDij(x,t)ϵxb(x,t)=0,S~(x,0)=ϵlogPprior(x).\begin{dcases}\partial_{t}\tilde{S}+b(x,t)\cdot\nabla_{x}\tilde{S}-\epsilon\sum_{i,j=1}^{n}(\partial_{x_{i}}\tilde{S})(\partial_{x_{j}}D_{ij}(x,t))+\frac{1}{2}(\nabla_{x}\tilde{S})^{T}D(x,t)\nabla_{x}\tilde{S}\\ \quad\quad\quad\quad-\frac{\epsilon}{2}\operatorname{Tr}(D(x,t)\nabla_{x}^{2}\tilde{S})+\frac{\epsilon^{2}}{2}\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}D_{ij}(x,t)-\epsilon\nabla_{x}\cdot b(x,t)=0,\\ \tilde{S}(x,0)=-\epsilon\log P_{prior}(x).\end{dcases} (63)

Under the assumptions on bb, DD, and PpriorP_{prior}, this PDE simplifies to

{tS~+(Atx+βt)TxS~+12(xS~)TDtxS~ϵ2Tr(Dtx2S~)ϵTr(At)=0,S~(x,0)=ϵ2(xθ0)TΣ1(xθ0)+nϵ2log(2π)+ϵ2logdet(Σ).\begin{dcases}\partial_{t}\tilde{S}+(A_{t}x+\beta_{t})^{T}\nabla_{x}\tilde{S}+\frac{1}{2}(\nabla_{x}\tilde{S})^{T}D_{t}\nabla_{x}\tilde{S}-\frac{\epsilon}{2}\operatorname{Tr}(D_{t}\nabla_{x}^{2}\tilde{S})-\epsilon\operatorname{Tr}(A_{t})=0,\\ \tilde{S}(x,0)=\frac{\epsilon}{2}(x-\theta^{0})^{T}\Sigma^{-1}(x-\theta^{0})+\frac{n\epsilon}{2}\log(2\pi)+\frac{\epsilon}{2}\log\det(\Sigma).\end{dcases} (64)

In this viscous HJ PDE, both the Hamiltonian and the initial condition are quadratic, allowing the equation to be solved using Riccati ODEs. Specifically, the solution is given by S~(x,t)=12(xq(t))TQ(t)1(xq(t))+r(t)\tilde{S}(x,t)=\frac{1}{2}(x-q(t))^{T}Q(t)^{-1}(x-q(t))+r(t), where Q:[0,T]S++nQ\colon[0,T]\to S_{++}^{n}, q:[0,T]nq\colon[0,T]\to\mathbb{R}^{n}, and r:[0,T]r\colon[0,T]\to\mathbb{R} satisfy the following Riccati ODE system:

{Q˙(t)=Dt+Q(t)AtT+AtQ(t),q˙(t)=Atq(t)+βt,r˙(t)=ϵ2Tr(2At+DtQ(t)1),{Q(0)=1ϵΣ,q(0)=θ0,r(0)=nϵ2log(2π)+ϵ2logdet(Σ).\begin{dcases}\dot{Q}(t)=D_{t}+Q(t)A_{t}^{T}+A_{t}Q(t),\\ \dot{q}(t)=A_{t}q(t)+\beta_{t},\\ \dot{r}(t)=\frac{\epsilon}{2}\operatorname{Tr}(2A_{t}+D_{t}Q(t)^{-1}),\end{dcases}\quad\quad\quad\quad\begin{dcases}Q(0)=\frac{1}{\epsilon}\Sigma,\\ q(0)=\theta^{0},\\ r(0)=\frac{n\epsilon}{2}\log(2\pi)+\frac{\epsilon}{2}\log\det(\Sigma).\end{dcases} (65)

Since sampling only requires xS(x,t)\nabla_{x}S(x,t), it suffices to solve for QQ and qq, and then obtain

xS(x,t)=xS~(x,Tt)=Q(Tt)1(xq(Tt)).\nabla_{x}S(x,t)=-\nabla_{x}\tilde{S}(x,T-t)=-Q(T-t)^{-1}(x-q(T-t)). (66)

Beyond the Gaussian prior, we can also handle Gaussian mixture prior distributions. Similar to the approach in Section 3.2.1, we first solve the Riccati ODE for each Gaussian component and then compute xS\nabla_{x}S using (23).

C.2.2 SGM method

For more general cases, we can approximate the score function using a neural network trained on samples of YtY_{t} via the SGM method, similar to the approach in Section 3.2.2. This approach is called the SGM-HJ-sampler. In this case, the training data are sampled from the discretized SDE as follows:

Yk+1,j=Yk,j+b(Yk,j,tk)Δt+ϵΔtσ(Yk,j,tk)ξk,j.Y_{k+1,j}=Y_{k,j}+b(Y_{k,j},t_{k})\Delta t+\sqrt{\epsilon\Delta t}\sigma(Y_{k,j},t_{k})\xi_{k,j}. (67)

The neural network can also be trained using the loss functions (25) and (26). After training, the term xS(Zk,tk)\nabla_{x}S(Z_{k},t_{k}) in the sampling step can be approximated by the pretrained model ϵsW(Zk,Ttk)\epsilon s_{W}(Z_{k},T-t_{k}).

There are alternative choices for the loss function in the training of diffusion models. A popular choice is:

k=1ntj=1NλksW(Yk,j,tk)yklogP(Ytk=Yk,j|Y0=Y0,j)2,\sum_{k=1}^{n_{t}}\sum_{j=1}^{N}\lambda_{k}\|s_{W}(Y_{k,j},t_{k})-\nabla_{y_{k}}\log P(Y_{t_{k}}=Y_{k,j}|Y_{0}=Y_{0,j})\|^{2}, (68)

where λk\lambda_{k} are positive weighting terms, and various methods for selecting them in diffusion models are discussed in [92, 52].

In our case, when the model for YtY_{t} is complicated, we do not have an analytical formula for logP(Ytk|Y0)\log P(Y_{t_{k}}|Y_{0}). However, using a similar derivation, we can employ the following loss instead:

k=1ntj=1NλksW(Yk,j,tk)yklogP(Ytk=Yk,j|Ytk1=Yk1,j)2k=1ntj=1NλksW(Yk,j,tk)+1ϵΔtD(Yk1,j,tk1)1(Yk,jYk1,jb(Yk1,j,tk1)Δt)2.\begin{split}&\sum_{k=1}^{n_{t}}\sum_{j=1}^{N}\lambda_{k}\|s_{W}(Y_{k,j},t_{k})-\nabla_{y_{k}}\log P(Y_{t_{k}}=Y_{k,j}|Y_{t_{k-1}}=Y_{k-1,j})\|^{2}\\ \approx&\sum_{k=1}^{n_{t}}\sum_{j=1}^{N}\lambda_{k}\left\|s_{W}(Y_{k,j},t_{k})+\frac{1}{\epsilon\Delta t}D(Y_{k-1,j},t_{k-1})^{-1}(Y_{k,j}-Y_{k-1,j}-b(Y_{k-1,j},t_{k-1})\Delta t)\right\|^{2}.\end{split} (69)

In this scenario, the conditional distribution P(Ytk|Ytk1)P(Y_{t_{k}}|Y_{t_{k-1}}) does not have an analytical formula, so we approximate it using the conditional distribution from the discretized process in (67). Note that this loss function requires the matrix D(y,t)D(y,t) to be invertible for all y=Yk,jy=Y_{k,j} and t=tkt=t_{k}. The equivalence of these loss functions, including the original score matching loss k=1ntj=1NλksW(Yk,j,tk)yklogP(Ytk=Yk,j)2\sum_{k=1}^{n_{t}}\sum_{j=1}^{N}\lambda_{k}\|s_{W}(Y_{k,j},t_{k})-\nabla_{y_{k}}\log P(Y_{t_{k}}=Y_{k,j})\|^{2}, is discussed in [94].

C.3 Error estimation for HJ-sampler

In this section, we present a simple error estimation for the HJ-sampler. The total error in the HJ-sampler arises from two main sources: the numerical error in solving the viscous HJ PDE during the first step, and the sampling error in the second step. Here, we focus on analyzing the impact of the first error.

Let ρ(x,t|yobs)\rho(x,t|y_{obs}) or ρyobs(x,t)\rho_{y_{obs}}(x,t) denote the density of the stochastic process described in (55), and let ρ~(x,t|yobs)\tilde{\rho}(x,t|y_{obs}) or ρ~yobs(x,t)\tilde{\rho}_{y_{obs}}(x,t) represent the density of the following stochastic process:

dZτ=(D(Zτ,Tτ)α(Zτ,τ)b(Zτ,Tτ)+ϵj=1nxjDj(Zτ,Tτ))dτ+ϵσ(Zτ,Tτ)dWτ,Z0=yobs,\begin{split}dZ_{\tau}&=\left(D(Z_{\tau},T-\tau)\alpha(Z_{\tau},\tau)-b(Z_{\tau},T-\tau)+\epsilon\sum_{j=1}^{n}\partial_{x_{j}}D_{j}(Z_{\tau},T-\tau)\right)d\tau+\sqrt{\epsilon}\sigma(Z_{\tau},T-\tau)dW_{\tau},\\ Z_{0}&=y_{obs},\end{split} (70)

where α\alpha is an approximation of the control xS\nabla_{x}S. We compare the error between the true distribution ρyobs(,t)\rho_{y_{obs}}(\cdot,t) and its numerical approximation ρ~yobs(,t)\tilde{\rho}_{y_{obs}}(\cdot,t), measured by the Wasserstein-2 distance, denoted as W2(ρyobs(,t),ρ~yobs(,t))W_{2}(\rho_{y_{obs}}(\cdot,t),\tilde{\rho}_{y_{obs}}(\cdot,t)).

In the case of the Riccati-HJ-sampler, α\alpha can be considered as an interpolation of the spatial gradient of the Riccati solution at temporal grid points tkt_{k}. For the SGM-HJ-sampler, the function α(x,t)\alpha(x,t) is given by the pretrained neural network ϵsW(x,Tt)\epsilon s_{W}(x,T-t).

Following the proof in [58], we adapt the methodology to our setup. We provide an informal error estimation, focusing on the essential ideas while simplifying the analysis by omitting some technical details. We assume sufficient regularity of the functions so that the Fokker-Planck equations have smooth solutions ρyobs(x,t)\rho_{y_{obs}}(x,t) and ρ~yobs(x,t)\tilde{\rho}_{y_{obs}}(x,t) for any yobsy_{obs}, xnx\in\mathbb{R}^{n}, and t(0,T]t\in(0,T]. Additionally, we assume that the functions bb and α\alpha are Lipschitz continuous with Lipschitz constants LbL_{b} and LαL_{\alpha}, respectively. We further assume that D(x,t)D(x,t) depends only on tt and is bounded by CD(t)IC_{D}(t)I, meaning that CD(t)ID(x,t)C_{D}(t)I-D(x,t) is positive semi-definite. We assume that CD(t)C_{D}(t) and exp(2Lbt2Lα0tCD(Tτ)𝑑τ)CD(Tt)2\exp\left(-2L_{b}t-2L_{\alpha}\int_{0}^{t}C_{D}(T-\tau)d\tau\right)C_{D}(T-t)^{2} are both integrable over t[0,T]t\in[0,T]. Moreover, we assume that the optimal transport map from ρyobs(,t)\rho_{y_{obs}}(\cdot,t) to ρ~yobs(,t)\tilde{\rho}_{y_{obs}}(\cdot,t) is given by ϕ\nabla\phi, the gradient of a convex, second-order differentiable function ϕ\phi with invertible Hessian matrices. Note that the function ϕ\phi may vary with tt, but we omit this time dependence for simplicity of notation.

Under these assumptions, the function ρyobs\rho_{y_{obs}} satisfies the following equation:

0=tρyobsx(b(x,Tt)ρyobs)+i,j=1nxi(ρyobsDij(Tt)xjS)ϵ2i,j=1nxixj(Dij(Tt)ρyobs)=tρyobs+i=1nxi(ρyobs(bi(x,Tt)+j=1n(Dij(Tt)xjSϵDij(Tt)xjρyobs2ρyobs))).\begin{split}0&=\partial_{t}\rho_{y_{obs}}-\nabla_{x}\cdot(b(x,T-t)\rho_{y_{obs}})+\sum_{i,j=1}^{n}\partial_{x_{i}}\left(\rho_{y_{obs}}D_{ij}(T-t)\partial_{x_{j}}S\right)-\frac{\epsilon}{2}\sum_{i,j=1}^{n}\partial_{x_{i}}\partial_{x_{j}}(D_{ij}(T-t)\rho_{y_{obs}})\\ &=\partial_{t}\rho_{y_{obs}}+\sum_{i=1}^{n}\partial_{x_{i}}\left(\rho_{y_{obs}}\left(-b_{i}(x,T-t)+\sum_{j=1}^{n}\left(D_{ij}(T-t)\partial_{x_{j}}S-\frac{\epsilon D_{ij}(T-t)\partial_{x_{j}}\rho_{y_{obs}}}{2\rho_{y_{obs}}}\right)\right)\right).\end{split} (71)

Thus, ρyobs(,t)\rho_{y_{obs}}(\cdot,t) is also the marginal density of the ODE x˙t=v(xt,t|ρyobs)\dot{x}_{t}=v(x_{t},t|\rho_{y_{obs}}), where vv is given by:

v(x,t|ρyobs)=b(x,Tt)+D(Tt)xS(x,t)ϵD(Tt)xρyobs2ρyobs.\begin{split}v(x,t|\rho_{y_{obs}})&=-b(x,T-t)+D(T-t)\nabla_{x}S(x,t)-\frac{\epsilon D(T-t)\nabla_{x}\rho_{y_{obs}}}{2\rho_{y_{obs}}}.\end{split} (72)

Similarly, the function ρ~yobs(,t)\tilde{\rho}_{y_{obs}}(\cdot,t) is the marginal density of the ODE x˙t=v~(xt,t|ρ~yobs)\dot{x}_{t}=\tilde{v}(x_{t},t|\tilde{\rho}_{y_{obs}}), where v~\tilde{v} is given by:

v~(x,t|ρ~yobs)=b(x,Tt)+D(Tt)α(x,t)ϵD(Tt)xρ~yobs2ρ~yobs.\tilde{v}(x,t|\tilde{\rho}_{y_{obs}})=-b(x,T-t)+D(T-t)\alpha(x,t)-\frac{\epsilon D(T-t)\nabla_{x}\tilde{\rho}_{y_{obs}}}{2\tilde{\rho}_{y_{obs}}}. (73)

According to [83, Cor.5.25], the Wasserstein distance between ρyobs(,t)\rho_{y_{obs}}(\cdot,t) and ρ~yobs(,t)\tilde{\rho}_{y_{obs}}(\cdot,t), denoted by tW2(ρyobs,ρ~yobs)(t)=W2(ρyobs(,t),ρ~yobs(,t))t\mapsto W_{2}(\rho_{y_{obs}},\tilde{\rho}_{y_{obs}})(t)=W_{2}(\rho_{y_{obs}}(\cdot,t),\tilde{\rho}_{y_{obs}}(\cdot,t)), satisfies

12ddtW2(ρyobs,ρ~yobs)2=𝔼π(x,y)[(xy)T(v(x,t|ρyobs)v~(y,t|ρ~yobs))],\frac{1}{2}\frac{d}{dt}W_{2}(\rho_{y_{obs}},\tilde{\rho}_{y_{obs}})^{2}=\mathbb{E}_{\pi(x,y)}[(x-y)^{T}(v(x,t|\rho_{y_{obs}})-\tilde{v}(y,t|\tilde{\rho}_{y_{obs}}))], (74)

where π\pi is the optimal transport plan between ρyobs\rho_{y_{obs}} and ρ~yobs\tilde{\rho}_{y_{obs}}. For simplicity, we omit the variable tt in the density functions.

Next, we will bound the right-hand side of (74). Through straightforward computation, we obtain

𝔼π(x,y)[(xy)T(v(x,t|ρyobs)v~(y,t|ρ~yobs))]=𝔼π(x,y)[(xy)T(b(x,Tt)+b(y,Tt))]+𝔼π(x,y)[(xy)TD(Tt)(xS(x,t)α(y,t))]ϵ2𝔼π(x,y)[(xy)TD(Tt)(xρyobs(x,t)ρyobs(x,t)yρ~yobs(y,t)ρ~yobs(y,t))].\begin{split}&\mathbb{E}_{\pi(x,y)}[(x-y)^{T}(v(x,t|\rho_{y_{obs}})-\tilde{v}(y,t|\tilde{\rho}_{y_{obs}}))]=\mathbb{E}_{\pi(x,y)}[(x-y)^{T}(-b(x,T-t)+b(y,T-t))]\\ &\quad\quad\quad\quad\quad\quad\quad\quad+\mathbb{E}_{\pi(x,y)}\left[(x-y)^{T}D(T-t)(\nabla_{x}S(x,t)-\alpha(y,t))\right]\\ &\quad\quad\quad\quad\quad\quad\quad\quad-\frac{\epsilon}{2}\mathbb{E}_{\pi(x,y)}\left[(x-y)^{T}D(T-t)\left(\frac{\nabla_{x}\rho_{y_{obs}}(x,t)}{\rho_{y_{obs}}(x,t)}-\frac{\nabla_{y}\tilde{\rho}_{y_{obs}}(y,t)}{\tilde{\rho}_{y_{obs}}(y,t)}\right)\right].\end{split} (75)

Since the function bb is Lipschitz continuous with respect to xx, with uniform Lipschitz constant LbL_{b}, the first term on the right-hand side of (75) is bounded above by LbW2(ρyobs,ρ~yobs)2L_{b}W_{2}(\rho_{y_{obs}},\tilde{\rho}_{y_{obs}})^{2}. Furthermore, since the matrix D(t)D(t) is bounded above by CD(t)IC_{D}(t)I and the function α\alpha is Lipschitz with respect to xx with constant LαL_{\alpha}, the second term on the right-hand side of (75) is estimated by

𝔼π(x,y)[(xy)TD(Tt)(xS(x,t)α(y,t))]CD(Tt)𝔼π(x,y)[(xy)T(xS(x,t)α(y,t))]=CD(Tt)𝔼π(x,y)[(xy)T(xS(x,t)α(x,t))]+CD(Tt)𝔼π(x,y)[(xy)T(α(x,t)α(y,t))]CD(Tt)W2(ρyobs,ρ~yobs)(𝔼π(x,y)[xS(x,t)α(x,t)2])1/2+CD(Tt)LαW2(ρyobs,ρ~yobs)2=CD(Tt)W2(ρyobs,ρ~yobs)(𝔼ρyobs[xS(,t)α(,t)2])1/2+CD(Tt)LαW2(ρyobs,ρ~yobs)2.\begin{split}&\mathbb{E}_{\pi(x,y)}\left[(x-y)^{T}D(T-t)(\nabla_{x}S(x,t)-\alpha(y,t))\right]\\ \leq\,&C_{D}(T-t)\mathbb{E}_{\pi(x,y)}\left[(x-y)^{T}(\nabla_{x}S(x,t)-\alpha(y,t))\right]\\ =\,&C_{D}(T-t)\mathbb{E}_{\pi(x,y)}\left[(x-y)^{T}(\nabla_{x}S(x,t)-\alpha(x,t))\right]+C_{D}(T-t)\mathbb{E}_{\pi(x,y)}\left[(x-y)^{T}(\alpha(x,t)-\alpha(y,t))\right]\\ \leq\,&C_{D}(T-t)W_{2}(\rho_{y_{obs}},\tilde{\rho}_{y_{obs}})\left(\mathbb{E}_{\pi(x,y)}\left[\|\nabla_{x}S(x,t)-\alpha(x,t)\|^{2}\right]\right)^{1/2}+C_{D}(T-t)L_{\alpha}W_{2}(\rho_{y_{obs}},\tilde{\rho}_{y_{obs}})^{2}\\ =\,&C_{D}(T-t)W_{2}(\rho_{y_{obs}},\tilde{\rho}_{y_{obs}})\left(\mathbb{E}_{\rho_{y_{obs}}}\left[\|\nabla_{x}S(\cdot,t)-\alpha(\cdot,t)\|^{2}\right]\right)^{1/2}+C_{D}(T-t)L_{\alpha}W_{2}(\rho_{y_{obs}},\tilde{\rho}_{y_{obs}})^{2}.\end{split} (76)

According to optimal transport theory, there exists a convex function ϕ\phi such that the optimal transport plan π\pi is given by (Id,ϕ)#ρyobs(\text{Id},\nabla\phi)_{\#}\rho_{y_{obs}}, which denotes the push-forward of ρyobs\rho_{y_{obs}} by (Id,ϕ)(\text{Id},\nabla\phi) (where Id denotes the identity map xxx\mapsto x). Denote by ϕ\phi^{*} the Fenchel-Legendre transform of ϕ\phi. Then, π\pi can also be represented by (ϕ,Id)#ρ~yobs(\nabla\phi^{*},\text{Id})_{\#}\tilde{\rho}_{y_{obs}}. The expectation in the last term on the right-hand side of (75) simplifies as follows:

𝔼π(x,y)[(xy)TD(Tt)(xρyobs(x,t))ρyobs(x,t)yρ~yobs(y,t))ρ~yobs(y,t))]=𝔼ρyobs[(xϕ(x))TD(Tt)xρyobs(x,t)ρyobs(x,t)]𝔼ρ~yobs[(ϕ(y)y)TD(Tt)yρ~yobs(y,t)ρ~yobs(y,t)]=(xϕ(x))TD(Tt)xρyobs(x,t)𝑑x(ϕ(y)y)TD(Tt)yρ~yobs(y,t)𝑑y=Tr((I2ϕ(x))D(Tt))ρyobs(x,t)𝑑x+Tr((2ϕ(y)I)D(Tt))ρ~yobs(y,t)𝑑y=𝔼ρyobs[Tr(D(Tt)(I2ϕ))]+𝔼ρ~yobs[Tr(D(Tt)(2ϕI))]=𝔼ρyobs[Tr(D(Tt)(I2ϕ))]+𝔼ρyobs[Tr(D(Tt)((2ϕ)1I))]=𝔼ρyobs[Tr(D(Tt)(2ϕ(x)+(2ϕ(x))12I))]=𝔼ρyobs[Tr(D(Tt)(2ϕ(x))1(2ϕ(x)I)2)]0,\begin{split}&\mathbb{E}_{\pi(x,y)}\left[(x-y)^{T}D(T-t)\left(\frac{\nabla_{x}\rho_{y_{obs}}(x,t))}{\rho_{y_{obs}}(x,t)}-\frac{\nabla_{y}\tilde{\rho}_{y_{obs}}(y,t))}{\tilde{\rho}_{y_{obs}}(y,t)}\right)\right]\\ =\,&\mathbb{E}_{\rho_{y_{obs}}}\left[\frac{(x-\nabla\phi(x))^{T}D(T-t)\nabla_{x}\rho_{y_{obs}}(x,t)}{\rho_{y_{obs}}(x,t)}\right]-\mathbb{E}_{\tilde{\rho}_{y_{obs}}}\left[\frac{(\nabla\phi^{*}(y)-y)^{T}D(T-t)\nabla_{y}\tilde{\rho}_{y_{obs}}(y,t)}{\tilde{\rho}_{y_{obs}}(y,t)}\right]\\ =\,&\int(x-\nabla\phi(x))^{T}D(T-t)\nabla_{x}\rho_{y_{obs}}(x,t)dx-\int(\nabla\phi^{*}(y)-y)^{T}D(T-t)\nabla_{y}\tilde{\rho}_{y_{obs}}(y,t)dy\\ =\,&-\int\operatorname{Tr}\left((I-\nabla^{2}\phi(x))D(T-t)\right)\rho_{y_{obs}}(x,t)dx+\int\operatorname{Tr}\left((\nabla^{2}\phi^{*}(y)-I)D(T-t)\right)\tilde{\rho}_{y_{obs}}(y,t)dy\\ =\,&-\mathbb{E}_{\rho_{y_{obs}}}\left[\operatorname{Tr}(D(T-t)(I-\nabla^{2}\phi))\right]+\mathbb{E}_{\tilde{\rho}_{y_{obs}}}\left[\operatorname{Tr}(D(T-t)(\nabla^{2}\phi^{*}-I))\right]\\ =\,&-\mathbb{E}_{\rho_{y_{obs}}}\left[\operatorname{Tr}(D(T-t)(I-\nabla^{2}\phi))\right]+\mathbb{E}_{\rho_{y_{obs}}}\left[\operatorname{Tr}(D(T-t)((\nabla^{2}\phi)^{-1}-I))\right]\\ =\,&\mathbb{E}_{\rho_{y_{obs}}}\left[\operatorname{Tr}\left(D(T-t)(\nabla^{2}\phi(x)+(\nabla^{2}\phi(x))^{-1}-2I)\right)\right]\\ =\,&\mathbb{E}_{\rho_{y_{obs}}}\left[\operatorname{Tr}\left(D(T-t)(\nabla^{2}\phi(x))^{-1}(\nabla^{2}\phi(x)-I)^{2}\right)\right]\geq 0,\end{split} (77)

where the last term is non-negative because the matrix D(Tt)(2ϕ(x))1(2ϕ(x)I)2D(T-t)(\nabla^{2}\phi(x))^{-1}(\nabla^{2}\phi(x)-I)^{2} is positive semi-definite for any t[0,T]t\in[0,T].

Combining all the estimates, we have

12ddtW2(ρyobs,ρ~yobs)2C1(t)W2(ρyobs,ρ~yobs)2+CD(Tt)W2(ρyobs,ρ~yobs)(𝔼ρyobs[xS(,t)α(,t)2])1/2,\begin{split}\frac{1}{2}\frac{d}{dt}W_{2}(\rho_{y_{obs}},\tilde{\rho}_{y_{obs}})^{2}\leq C_{1}(t)W_{2}(\rho_{y_{obs}},\tilde{\rho}_{y_{obs}})^{2}+C_{D}(T-t)W_{2}(\rho_{y_{obs}},\tilde{\rho}_{y_{obs}})\left(\mathbb{E}_{\rho_{y_{obs}}}\left[\|\nabla_{x}S(\cdot,t)-\alpha(\cdot,t)\|^{2}\right]\right)^{1/2},\end{split}

(78)

where C1(t)=Lb+CD(Tt)LαC_{1}(t)=L_{b}+C_{D}(T-t)L_{\alpha} is a function of tt. Next, we take the expectation with respect to yobsP(yobs)=μ(yobs,T)y_{obs}\sim P(y_{obs})=\mu(y_{obs},T) and obtain

12ddt𝔼P(yobs)[W2(ρyobs,ρ~yobs)2]C1(t)𝔼P(yobs)[W2(ρyobs,ρ~yobs)2]+CD(Tt)𝔼P(yobs)[W2(ρyobs,ρ~yobs)(𝔼ρyobs[xS(,t)α(,t)2])1/2]C1(t)𝔼P(yobs)[W2(ρyobs,ρ~yobs)2]+CD(Tt)(𝔼P(yobs)[W2(ρyobs,ρ~yobs)2])1/2(𝔼P(yobs)𝔼ρyobs[xS(,t)α(,t)2])1/2C1(t)𝔼P(yobs)[W2(ρyobs,ρ~yobs)2]+CD(Tt)(𝔼P(yobs)[W2(ρyobs,ρ~yobs)2])1/2(𝔼[xS(YTt,t)α(YTt,t)2])1/2,\begin{split}&\frac{1}{2}\frac{d}{dt}\mathbb{E}_{P(y_{obs})}\left[W_{2}(\rho_{y_{obs}},\tilde{\rho}_{y_{obs}})^{2}\right]\\ \leq&C_{1}(t)\mathbb{E}_{P(y_{obs})}\left[W_{2}(\rho_{y_{obs}},\tilde{\rho}_{y_{obs}})^{2}\right]+C_{D}(T-t)\mathbb{E}_{P(y_{obs})}\left[W_{2}(\rho_{y_{obs}},\tilde{\rho}_{y_{obs}})\left(\mathbb{E}_{\rho_{y_{obs}}}\left[\|\nabla_{x}S(\cdot,t)-\alpha(\cdot,t)\|^{2}\right]\right)^{1/2}\right]\\ \leq&C_{1}(t)\mathbb{E}_{P(y_{obs})}\left[W_{2}(\rho_{y_{obs}},\tilde{\rho}_{y_{obs}})^{2}\right]+C_{D}(T-t)\left(\mathbb{E}_{P(y_{obs})}\left[W_{2}(\rho_{y_{obs}},\tilde{\rho}_{y_{obs}})^{2}\right]\right)^{1/2}\left(\mathbb{E}_{P(y_{obs})}\mathbb{E}_{\rho_{y_{obs}}}\left[\|\nabla_{x}S(\cdot,t)-\alpha(\cdot,t)\|^{2}\right]\right)^{1/2}\\ \leq&C_{1}(t)\mathbb{E}_{P(y_{obs})}\left[W_{2}(\rho_{y_{obs}},\tilde{\rho}_{y_{obs}})^{2}\right]+C_{D}(T-t)\left(\mathbb{E}_{P(y_{obs})}\left[W_{2}(\rho_{y_{obs}},\tilde{\rho}_{y_{obs}})^{2}\right]\right)^{1/2}\left(\mathbb{E}\left[\|\nabla_{x}S(Y_{T-t},t)-\alpha(Y_{T-t},t)\|^{2}\right]\right)^{1/2},\end{split}

(79)

where the second inequality follows from the Cauchy-Schwarz inequality, and the last one holds because P(yobs)ρyobs(x,t)=P(YT,1=yobs)P(YTt=x|YT,1=yobs)=P(YTt=x,YT,1=yobs)P(y_{obs})\rho_{y_{obs}}(x,t)=P(Y_{T,1}=y_{obs})P(Y_{T-t}=x|Y_{T,1}=y_{obs})=P(Y_{T-t}=x,Y_{T,1}=y_{obs}) following (12). Let (t)=(𝔼P(yobs)[W2(ρyobs,ρ~yobs)2])1/2\mathcal{L}(t)=\left(\mathbb{E}_{P(y_{obs})}\left[W_{2}(\rho_{y_{obs}},\tilde{\rho}_{y_{obs}})^{2}\right]\right)^{1/2}. We obtain

d(t)dt=12ddt𝔼P(yobs)[W2(ρyobs,ρ~yobs)2](t)C1(t)(t)+CD(Tt)(𝔼[xS(YTt,t)α(YTt,t)2])1/2.\frac{d\mathcal{L}(t)}{dt}=\frac{\frac{1}{2}\frac{d}{dt}\mathbb{E}_{P(y_{obs})}\left[W_{2}(\rho_{y_{obs}},\tilde{\rho}_{y_{obs}})^{2}\right]}{\mathcal{L}(t)}\leq C_{1}(t)\mathcal{L}(t)+C_{D}(T-t)\left(\mathbb{E}\left[\|\nabla_{x}S(Y_{T-t},t)-\alpha(Y_{T-t},t)\|^{2}\right]\right)^{1/2}. (80)

Moreover, we have (0)=(𝔼P(yobs)[W2(ρyobs(,0),ρ~yobs(,0))2])1/2=(𝔼P(yobs)[W2(δyobs,δyobs)2])1/2=0\mathcal{L}(0)=\left(\mathbb{E}_{P(y_{obs})}\left[W_{2}(\rho_{y_{obs}}(\cdot,0),\tilde{\rho}_{y_{obs}}(\cdot,0))^{2}\right]\right)^{1/2}=\left(\mathbb{E}_{P(y_{obs})}\left[W_{2}(\delta_{y_{obs}},\delta_{y_{obs}})^{2}\right]\right)^{1/2}=0. Then, by a similar argument as in Appendix A1 of [58], we conclude that

(t)1I(t)0tI(τ)CD(Tτ)(𝔼[xS(YTτ,τ)α(YTτ,τ)2])1/2𝑑τ,\mathcal{L}(t)\leq\frac{1}{I(t)}\int_{0}^{t}I(\tau)C_{D}(T-\tau)\left(\mathbb{E}\left[\|\nabla_{x}S(Y_{T-\tau},\tau)-\alpha(Y_{T-\tau},\tau)\|^{2}\right]\right)^{1/2}d\tau, (81)

where the function II is defined by I(t)=exp(0tC1(τ)𝑑τ)I(t)=\exp(-\int_{0}^{t}C_{1}(\tau)d\tau).

Note that (t)2\mathcal{L}(t)^{2} is the expected squared Wasserstein-2 error of the posterior distribution with respect to the observation yobsy_{obs}. According to Markov’s inequality, the probability of the error exceeding ee can be bounded as follows:

P(W2(ρyobs(,t),ρ~yobs(,t))e)𝔼P(yobs)[W2(ρyobs,ρ~yobs)2]e2=(t)2e21e2(1I(t)0tI(τ)CD(Tτ)(𝔼[xS(YTτ,τ)α(YTτ,τ)2])1/2𝑑τ)2.\begin{split}&P(W_{2}(\rho_{y_{obs}}(\cdot,t),\tilde{\rho}_{y_{obs}}(\cdot,t))\geq e)\leq\frac{\mathbb{E}_{P(y_{obs})}\left[W_{2}(\rho_{y_{obs}},\tilde{\rho}_{y_{obs}})^{2}\right]}{e^{2}}=\frac{\mathcal{L}(t)^{2}}{e^{2}}\\ \leq\,&\frac{1}{e^{2}}\left(\frac{1}{I(t)}\int_{0}^{t}I(\tau)C_{D}(T-\tau)\left(\mathbb{E}\left[\|\nabla_{x}S(Y_{T-\tau},\tau)-\alpha(Y_{T-\tau},\tau)\|^{2}\right]\right)^{1/2}d\tau\right)^{2}.\end{split} (82)
Remark C.1

In the error estimation above, beyond the regularity assumptions, one restrictive assumption is that the function DD depends only on tt. This excludes SDEs where the diffusion coefficient has xx dependence. This assumption is primarily utilized in the estimation in (77). Other parts of the proof can be straightforwardly generalized to cases where DD depends on xx. However, if DD does depend on xx, the term estimated in (77) may become negative. If this term can be appropriately bounded, the proof could be extended to more general cases.

Remark C.2

In the error analysis above, we examined how the numerical error from the first step of the HJ-sampler affects the continuous sampling process. The key term in the upper bound is the expectation 𝔼[xS(YTτ,τ)α(YTτ,τ)2]\mathbb{E}\left[\|\nabla_{x}S(Y_{T-\tau},\tau)-\alpha(Y_{T-\tau},\tau)\|^{2}\right]. For the Riccati-HJ-sampler, this term can be calculated based on the errors in the solutions QQ and qq to the Riccati ODE system. For the SGM-HJ-sampler, this term is related to the loss function and can be bounded by quantities derived from the loss value.

The overall error of the sampling algorithm consists of two parts: the error from the first step and the discretization error from the second step. The error in the first step, as estimated in our analysis, can guide the choice of the numerical solver for the viscous HJ PDE, including decisions on the temporal discretization size for traditional methods or the data size and neural network size for AI-based methods. The second step’s error, resulting from the SDE discretization scheme, can guide the selection of the sampling method and its temporal discretization size. Balancing these two sources of error, along with considering the computational efficiency of different methods, is crucial for achieving optimal performance.

The error analysis for the second step follows directly from the established analysis of SDE discretization schemes, so we omit the details here. In this paper, we use Euler–Maruyama discretization in the second step, yielding an order of 0.50.5 in the strong sense and order 11 in the weak sense. If higher accuracy is needed, higher-order schemes such as the Runge-Kutta method could be considered.

Appendix D Neural network training details for numerical examples

This section provides the details of the neural network sWs_{W} used in the SGM-HJ-sampler for obtaining posterior samples, as well as the specifics of the training procedure. Across all numerical examples, sWs_{W} is implemented with the tanh\tanh activation function for the nonlinear hidden layers, and the Adam optimizer [55] is employed with a learning rate of 1×1041\times 10^{-4}, unless stated otherwise.

In Section 4.1, we explore three 1D cases with varying prior distributions and one 2D case. For the 1D Gaussian prior case, a fully-connected neural network (FNN) with two hidden layers of 50 neurons each is utilized. The training of sWs_{W} involves 1,000,0001,000,000 sample paths of YtY_{t} over t[0,T]t\in[0,T], using mini-batch training with a batch size of 1,0001,000 for 3,0003,000 epochs. For the 1D Gaussian mixture case, the 1D mixture of uniform distributions case, and the 2D Gaussian mixture case, an FNN with three hidden layers and 50 neurons per layer is employed. The network is trained on 1,000,0001,000,000 sample paths of YtY_{t} over t[0,T]t\in[0,T] for 5,0005,000 epochs with a batch size of 1,0001,000.

In Section 4.2, the examples include one 1D case and two 2D cases. For the 1D case, an FNN with two hidden layers of 50 neurons each is trained on 1,000,0001,000,000 sample paths of YtY_{t} over t[0,T]t\in[0,T] for 3,0003,000 epochs, using a batch size of 1,0001,000 and a learning rate of 1×1041\times 10^{-4}. For the 2D Gaussian mixture prior case, an FNN with three hidden layers of 50 neurons each is trained on 1,000,0001,000,000 sample paths of YtY_{t} over t[0,T]t\in[0,T] for 5,0005,000 epochs with a batch size of 1,0001,000. For the 2D case with a LogNormal prior, which considers model misspecification, an FNN with three hidden layers of 50 neurons each is trained separately for each value of ϵ\epsilon. Each network is trained on 100,000100,000 sample paths of YtY_{t} over t[0,T]t\in[0,T] for 5,0005,000 epochs, using a batch size of 1,0001,000 and a learning rate of 1×1031\times 10^{-3}.

In Section 4.3, for each value of ϵ\epsilon, an FNN with three hidden layers and 50 neurons per layer is trained on 100,000100,000 sample paths of YtY_{t} over t[0,T]t\in[0,T] for 5,0005,000 epochs with a batch size of 1,0001,000. In Section 4.4, an FNN with three hidden layers and 200 neurons per layer is used, trained on 100,000100,000 sample paths of YtY_{t} over t[0,T]t\in[0,T] for 3,0003,000 epochs with a batch size of 1,0001,000 and a learning rate of 1×1031\times 10^{-3}.

In the first three sections, the loss function (25) is used with a weight of λk=1\lambda_{k}=1 for training. In the final section, the sliced version (26) is used with a weight of λk=1\lambda_{k}=1 and a sample size Nv=1N_{v}=1.

Appendix E Details of analytical formulas used in numerical results

E.1 Brownian motion

In this section, we focus on cases involving Brownian motion. Specifically, we assume the process YtY_{t} is governed by the SDE dYt=ϵdWtdY_{t}=\sqrt{\epsilon}dW_{t}, where ϵ>0\epsilon>0 is a hyperparameter, and WtW_{t} is standard Brownian motion, with different prior distributions for Y0Y_{0}.

E.1.1 One-dimensional uniform prior

Here, we provide the computational details for a one-dimensional uniform prior and a mixture of uniform priors. The generalization to higher-dimensional box-shaped domains is straightforward and therefore omitted.

First, assume the prior is a uniform distribution on [a,b][a,b]. The marginal distribution for YtY_{t} is given by:

P(Yt=y)=P(Yt=y|Y0=x)P(Y0=x)𝑑x=1baab12πϵtexp(12ϵt|yx|2)𝑑x=1baybϵtyaϵt1(2π)nexp(12x2)𝑑x=1ba(Φ(yaϵt)Φ(ybϵt)),\begin{split}P(Y_{t}=y)&=\int P(Y_{t}=y|Y_{0}=x)P(Y_{0}=x)dx=\frac{1}{b-a}\int_{a}^{b}\frac{1}{\sqrt{2\pi\epsilon t}}\exp\left(-\frac{1}{2\epsilon t}|y-x|^{2}\right)dx\\ &=\frac{1}{b-a}\int_{\frac{y-b}{\sqrt{\epsilon t}}}^{\frac{y-a}{\sqrt{\epsilon t}}}\frac{1}{\sqrt{(2\pi)^{n}}}\exp\left(-\frac{1}{2}x^{2}\right)dx=\frac{1}{b-a}\left(\Phi\left(\frac{y-a}{\sqrt{\epsilon t}}\right)-\Phi\left(\frac{y-b}{\sqrt{\epsilon t}}\right)\right),\end{split} (83)

where Φ\Phi is the cumulative distribution function of the standard one-dimensional Gaussian distribution. The solution to the viscous HJ PDE (15) is:

S(x,Tt)=ϵlogP(Yt=x)=ϵlog(Φ(xaϵt)Φ(xbϵt))ϵlog(ba).S(x,T-t)=\epsilon\log P(Y_{t}=x)=\epsilon\log\left(\Phi\left(\frac{x-a}{\sqrt{\epsilon t}}\right)-\Phi\left(\frac{x-b}{\sqrt{\epsilon t}}\right)\right)-\epsilon\log(b-a). (84)

The inference process, as described in (18), becomes:

Zk+1=Zk+xS(Zk,tk)Δt+ϵΔtξk=Zk+ϵ1ϵ(Ttk)(Φ(Zkaϵ(Ttk))Φ(Zkbϵ(Ttk)))Φ(Zkaϵ(Ttk))Φ(Zkbϵ(Ttk))Δt+ϵΔtξk=Zk+ϵ2π(Ttk)exp(12ϵ(Ttk)|Zka|2)exp(12ϵ(Ttk)|Zkb|2)Φ(Zkaϵ(Ttk))Φ(Zkbϵ(Ttk))Δt+ϵΔtξk.\begin{split}Z_{k+1}&=Z_{k}+\partial_{x}S(Z_{k},t_{k})\Delta t+\sqrt{\epsilon\Delta t}\xi_{k}\\ &=Z_{k}+\epsilon\frac{\frac{1}{\sqrt{\epsilon(T-t_{k})}}\left(\Phi^{\prime}\left(\frac{Z_{k}-a}{\sqrt{\epsilon(T-t_{k})}}\right)-\Phi^{\prime}\left(\frac{Z_{k}-b}{\sqrt{\epsilon(T-t_{k})}}\right)\right)}{\Phi\left(\frac{Z_{k}-a}{\sqrt{\epsilon(T-t_{k})}}\right)-\Phi\left(\frac{Z_{k}-b}{\sqrt{\epsilon(T-t_{k})}}\right)}\Delta t+\sqrt{\epsilon\Delta t}\xi_{k}\\ &=Z_{k}+\sqrt{\frac{\epsilon}{2\pi(T-t_{k})}}\frac{\exp\left(-\frac{1}{2\epsilon(T-t_{k})}|Z_{k}-a|^{2}\right)-\exp\left(-\frac{1}{2\epsilon(T-t_{k})}|Z_{k}-b|^{2}\right)}{\Phi\left(\frac{Z_{k}-a}{\sqrt{\epsilon(T-t_{k})}}\right)-\Phi\left(\frac{Z_{k}-b}{\sqrt{\epsilon(T-t_{k})}}\right)}\Delta t+\sqrt{\epsilon\Delta t}\xi_{k}.\end{split} (85)

The posterior distribution of YtY_{t} given YT=yobsY_{T}=y_{obs} for t(0,T)t\in(0,T) is:

P(Yt=θ|YT=yobs)=P(Yt=θ)P(YT=yobs|Yt=θ)P(YT=yobs)=(Φ(θaϵt)Φ(θbϵt))12πϵ(Tt)exp(12ϵ(Tt)|θyobs|2)Φ(yobsaϵT)Φ(yobsbϵT).\begin{split}P(Y_{t}=\theta|Y_{T}=y_{obs})&=\frac{P(Y_{t}=\theta)P(Y_{T}=y_{obs}|Y_{t}=\theta)}{P(Y_{T}=y_{obs})}=\frac{\left(\Phi\left(\frac{\theta-a}{\sqrt{\epsilon t}}\right)-\Phi\left(\frac{\theta-b}{\sqrt{\epsilon t}}\right)\right)\frac{1}{\sqrt{2\pi\epsilon(T-t)}}\exp\left(-\frac{1}{2\epsilon(T-t)}|\theta-y_{obs}|^{2}\right)}{\Phi\left(\frac{y_{obs}-a}{\sqrt{\epsilon T}}\right)-\Phi\left(\frac{y_{obs}-b}{\sqrt{\epsilon T}}\right)}.\end{split}

(86)

When t=0t=0, the posterior distribution is:

P(Y0=θ|YT=yobs)=P(Y0=θ)P(YT=yobs|Y0=θ)P(YT=yobs)=12πϵTexp(12ϵT|θyobs|2)Φ(yobsaϵT)Φ(yobsbϵT)χ[a,b](θ),\begin{split}P(Y_{0}=\theta|Y_{T}=y_{obs})&=\frac{P(Y_{0}=\theta)P(Y_{T}=y_{obs}|Y_{0}=\theta)}{P(Y_{T}=y_{obs})}=\frac{\frac{1}{\sqrt{2\pi\epsilon T}}\exp\left(-\frac{1}{2\epsilon T}|\theta-y_{obs}|^{2}\right)}{\Phi\left(\frac{y_{obs}-a}{\sqrt{\epsilon T}}\right)-\Phi\left(\frac{y_{obs}-b}{\sqrt{\epsilon T}}\right)}\chi_{[a,b]}(\theta),\end{split} (87)

where χ[a,b]\chi_{[a,b]} is the indicator function that takes the value 11 on [a,b][a,b] and 0 otherwise.

Next, consider the prior to be a mixture of uniform distributions: Pprior(x)=j=1Mwjbjajχ[aj,bj](x)P_{prior}(x)=\sum_{j=1}^{M}\frac{w_{j}}{b_{j}-a_{j}}\chi_{[a_{j},b_{j}]}(x), where the weights wjw_{j} satisfy j=1Mwj=1\sum_{j=1}^{M}w_{j}=1. Following the same process as above, the marginal distribution for YtY_{t} is:

P(Yt=y)=j=1Mwjbjaj(Φ(yajϵt)Φ(ybjϵt)).\begin{split}P(Y_{t}=y)&=\sum_{j=1}^{M}\frac{w_{j}}{b_{j}-a_{j}}\left(\Phi\left(\frac{y-a_{j}}{\sqrt{\epsilon t}}\right)-\Phi\left(\frac{y-b_{j}}{\sqrt{\epsilon t}}\right)\right).\end{split} (88)

The solution to the viscous HJ PDE (15) is:

S(x,Tt)=ϵlogP(Yt=x)=ϵlog(j=1Mwjbjaj(Φ(xajϵt)Φ(xbjϵt))).S(x,T-t)=\epsilon\log P(Y_{t}=x)=\epsilon\log\left(\sum_{j=1}^{M}\frac{w_{j}}{b_{j}-a_{j}}\left(\Phi\left(\frac{x-a_{j}}{\sqrt{\epsilon t}}\right)-\Phi\left(\frac{x-b_{j}}{\sqrt{\epsilon t}}\right)\right)\right). (89)

The inference process, as described in (18), becomes:

Zk+1=Zk+xS(Zk,tk)Δt+ϵΔtξk=Zk+ϵ1ϵ(Ttk)(j=1Mwjbjaj(Φ(Zkajϵ(Ttk))Φ(Zkbjϵ(Ttk))))j=1Mwjbjaj(Φ(Zkajϵ(Ttk))Φ(Zkbjϵ(Ttk)))Δt+ϵΔtξk=Zk+ϵ2π(Ttk)j=1Mwjbjaj(exp(12ϵ(Ttk)|Zkaj|2)exp(12ϵ(Ttk)|Zkbj|2))j=1Mwjbjaj(Φ(Zkajϵ(Ttk))Φ(Zkbjϵ(Ttk)))Δt+ϵΔtξk.\begin{split}&Z_{k+1}=Z_{k}+\partial_{x}S(Z_{k},t_{k})\Delta t+\sqrt{\epsilon\Delta t}\xi_{k}\\ &=Z_{k}+\epsilon\frac{\frac{1}{\sqrt{\epsilon(T-t_{k})}}\left(\sum_{j=1}^{M}\frac{w_{j}}{b_{j}-a_{j}}\left(\Phi^{\prime}\left(\frac{Z_{k}-a_{j}}{\sqrt{\epsilon(T-t_{k})}}\right)-\Phi^{\prime}\left(\frac{Z_{k}-b_{j}}{\sqrt{\epsilon(T-t_{k})}}\right)\right)\right)}{\sum_{j=1}^{M}\frac{w_{j}}{b_{j}-a_{j}}\left(\Phi\left(\frac{Z_{k}-a_{j}}{\sqrt{\epsilon(T-t_{k})}}\right)-\Phi\left(\frac{Z_{k}-b_{j}}{\sqrt{\epsilon(T-t_{k})}}\right)\right)}\Delta t+\sqrt{\epsilon\Delta t}\xi_{k}\\ &=Z_{k}+\sqrt{\frac{\epsilon}{2\pi(T-t_{k})}}\frac{\sum_{j=1}^{M}\frac{w_{j}}{b_{j}-a_{j}}\left(\exp\left(-\frac{1}{2\epsilon(T-t_{k})}|Z_{k}-a_{j}|^{2}\right)-\exp\left(-\frac{1}{2\epsilon(T-t_{k})}|Z_{k}-b_{j}|^{2}\right)\right)}{\sum_{j=1}^{M}\frac{w_{j}}{b_{j}-a_{j}}\left(\Phi\left(\frac{Z_{k}-a_{j}}{\sqrt{\epsilon(T-t_{k})}}\right)-\Phi\left(\frac{Z_{k}-b_{j}}{\sqrt{\epsilon(T-t_{k})}}\right)\right)}\Delta t+\sqrt{\epsilon\Delta t}\xi_{k}.\end{split} (90)

The posterior distribution of YtY_{t} given YT=yobsY_{T}=y_{obs} for t(0,T)t\in(0,T) is:

P(Yt=θ|YT=yobs)=P(Yt=θ)P(YT=yobs|Yt=θ)P(YT=yobs)=(j=1Mwjbjaj(Φ(θajϵt)Φ(θbjϵt)))12πϵ(Tt)exp(12ϵ(Tt)|θyobs|2)j=1Mwjbjaj(Φ(yobsajϵT)Φ(yobsbjϵT)).\begin{split}P(Y_{t}=\theta|Y_{T}=y_{obs})&=\frac{P(Y_{t}=\theta)P(Y_{T}=y_{obs}|Y_{t}=\theta)}{P(Y_{T}=y_{obs})}=\frac{\left(\sum_{j=1}^{M}\frac{w_{j}}{b_{j}-a_{j}}\left(\Phi\left(\frac{\theta-a_{j}}{\sqrt{\epsilon t}}\right)-\Phi\left(\frac{\theta-b_{j}}{\sqrt{\epsilon t}}\right)\right)\right)\frac{1}{\sqrt{2\pi\epsilon(T-t)}}\exp\left(-\frac{1}{2\epsilon(T-t)}|\theta-y_{obs}|^{2}\right)}{\sum_{j=1}^{M}\frac{w_{j}}{b_{j}-a_{j}}\left(\Phi\left(\frac{y_{obs}-a_{j}}{\sqrt{\epsilon T}}\right)-\Phi\left(\frac{y_{obs}-b_{j}}{\sqrt{\epsilon T}}\right)\right)}.\end{split}

(91)

The posterior distribution of Y0Y_{0} given YT=yobsY_{T}=y_{obs} is:

P(Y0=θ|YT=yobs)=P(Y0=θ)P(YT=yobs|Y0=θ)P(YT=yobs)=(j=1Mwjbjajχ[aj,bj](θ))12πϵTexp(12ϵT|θyobs|2)j=1Mwjbjaj(Φ(yobsajϵT)Φ(yobsbjϵT)).\begin{split}P(Y_{0}=\theta|Y_{T}=y_{obs})&=\frac{P(Y_{0}=\theta)P(Y_{T}=y_{obs}|Y_{0}=\theta)}{P(Y_{T}=y_{obs})}=\frac{\left(\sum_{j=1}^{M}\frac{w_{j}}{b_{j}-a_{j}}\chi_{[a_{j},b_{j}]}(\theta)\right)\frac{1}{\sqrt{2\pi\epsilon T}}\exp(-\frac{1}{2\epsilon T}|\theta-y_{obs}|^{2})}{\sum_{j=1}^{M}\frac{w_{j}}{b_{j}-a_{j}}\left(\Phi\left(\frac{y_{obs}-a_{j}}{\sqrt{\epsilon T}}\right)-\Phi\left(\frac{y_{obs}-b_{j}}{\sqrt{\epsilon T}}\right)\right)}.\end{split}

(92)

E.1.2 nn-dimensional Gaussian mixture prior

We consider the nn-dimensional Brownian motion problem with a Gaussian mixture prior. Let the prior be PpriorP_{prior} as given in (21), and let the process be governed by dYt=ϵdWtdY_{t}=\sqrt{\epsilon}dW_{t}. The marginal distribution is

P(Yt=x)=i=1MwiCi(t)exp(12(xθi)T(Σi+ϵtI)1(xθi)),P(Y_{t}=x)=\sum_{i=1}^{M}\frac{w_{i}}{C_{i}(t)}\exp\left(-\frac{1}{2}(x-\theta_{i})^{T}(\Sigma_{i}+\epsilon tI)^{-1}(x-\theta_{i})\right), (93)

where Ci(t)=(2π)ndet(Σi+ϵtI)C_{i}(t)=\sqrt{(2\pi)^{n}\det(\Sigma_{i}+\epsilon tI)} is the normalization constant.

The solution to the viscous HJ PDE (15) is

S(x,Tt)=ϵlog(i=1MwiCi(t)exp(12(xθi)T(Σi+ϵtI)1(xθi))).S(x,T-t)=\epsilon\log\left(\sum_{i=1}^{M}\frac{w_{i}}{C_{i}(t)}\exp\left(-\frac{1}{2}(x-\theta_{i})^{T}(\Sigma_{i}+\epsilon tI)^{-1}(x-\theta_{i})\right)\right). (94)

The inference process, as described in (18), becomes

Zk+1=Zk+(xS(Zk,tk))Δt+ϵΔtξk=Zkϵi=1MwiCi(Ttk)(Σi+ϵ(Ttk)I)1(Zkθi)exp(12(Zkθi)T(Σi+ϵ(Ttk)I)1(Zkθi))i=1MwiCi(Ttk)exp(12(Zkθi)T(Σi+ϵ(Ttk)I)1(Zkθi))Δt+ϵΔtξk.\begin{split}Z_{k+1}&=Z_{k}+(\nabla_{x}S(Z_{k},t_{k}))\Delta t+\sqrt{\epsilon\Delta t}\xi_{k}\\ &=Z_{k}-\epsilon\frac{\sum_{i=1}^{M}\frac{w_{i}}{C_{i}(T-t_{k})}(\Sigma_{i}+\epsilon(T-t_{k})I)^{-1}(Z_{k}-\theta_{i})\exp\left(-\frac{1}{2}(Z_{k}-\theta_{i})^{T}(\Sigma_{i}+\epsilon(T-t_{k})I)^{-1}(Z_{k}-\theta_{i})\right)}{\sum_{i=1}^{M}\frac{w_{i}}{C_{i}(T-t_{k})}\exp\left(-\frac{1}{2}(Z_{k}-\theta_{i})^{T}(\Sigma_{i}+\epsilon(T-t_{k})I)^{-1}(Z_{k}-\theta_{i})\right)}\Delta t+\sqrt{\epsilon\Delta t}\xi_{k}.\end{split}

(95)

The posterior distribution of YtY_{t} given YT=yobsY_{T}=y_{obs} for t[0,T)t\in[0,T) is

P(Yt=θ|YT=yobs)=P(Yt=θ)P(YT=yobs|Yt=θ)P(YT=yobs)=(i=1MwiCi(t)exp(12(θθi)T(Σi+ϵtI)1(θθi)))1(2πϵ(Tt))nexp(12ϵ(Tt)yobsθ2)i=1MwiCi(T)exp(12(yobsθi)T(Σi+ϵTI)1(yobsθi))=i=1Mw~i(2π)ndet(Mi)exp(12(θvi)TMi1(θvi)),\begin{split}P(Y_{t}=\theta|Y_{T}=y_{obs})&=\frac{P(Y_{t}=\theta)P(Y_{T}=y_{obs}|Y_{t}=\theta)}{P(Y_{T}=y_{obs})}\\ &=\frac{\left(\sum_{i=1}^{M}\frac{w_{i}}{C_{i}(t)}\exp\left(-\frac{1}{2}(\theta-\theta_{i})^{T}(\Sigma_{i}+\epsilon tI)^{-1}(\theta-\theta_{i})\right)\right)\frac{1}{\sqrt{(2\pi\epsilon(T-t))^{n}}}\exp\left(-\frac{1}{2\epsilon(T-t)}\|y_{obs}-\theta\|^{2}\right)}{\sum_{i=1}^{M}\frac{w_{i}}{C_{i}(T)}\exp\left(-\frac{1}{2}(y_{obs}-\theta_{i})^{T}(\Sigma_{i}+\epsilon TI)^{-1}(y_{obs}-\theta_{i})\right)}\\ &=\sum_{i=1}^{M}\frac{\tilde{w}_{i}}{\sqrt{(2\pi)^{n}\det(M_{i})}}\exp\left(-\frac{1}{2}(\theta-v_{i})^{T}M_{i}^{-1}(\theta-v_{i})\right),\end{split}

(96)

which is a Gaussian mixture, where the ii-th Gaussian has covariance matrix Mi=((Σi+ϵtI)1+Iϵ(Tt))1M_{i}=\left((\Sigma_{i}+\epsilon tI)^{-1}+\frac{I}{\epsilon(T-t)}\right)^{-1} and mean vi=Mi((Σi+ϵtI)1θi+yobsϵ(Tt))v_{i}=M_{i}\left((\Sigma_{i}+\epsilon tI)^{-1}\theta_{i}+\frac{y_{obs}}{\epsilon(T-t)}\right). The weight w~i\tilde{w}_{i} is

w~i=wiCi(T)exp(12(yobsθi)T(Σi+ϵTI)1(yobsθi))j=1MwjCj(T)exp(12(yobsθj)T(Σj+ϵTI)1(yobsθj)).\tilde{w}_{i}=\frac{\frac{w_{i}}{C_{i}(T)}\exp\left(-\frac{1}{2}(y_{obs}-\theta_{i})^{T}(\Sigma_{i}+\epsilon TI)^{-1}(y_{obs}-\theta_{i})\right)}{\sum_{j=1}^{M}\frac{w_{j}}{C_{j}(T)}\exp\left(-\frac{1}{2}(y_{obs}-\theta_{j})^{T}(\Sigma_{j}+\epsilon TI)^{-1}(y_{obs}-\theta_{j})\right)}. (97)

E.2 OU process

In general, if the matrix BB in the OU process is not diagonal, neither the posterior density function nor the sampling SDE (18) have analytical solutions. However, analytical formulas can be derived when BB is a diagonal matrix and the prior is a Gaussian mixture where each component has a diagonal covariance matrix.

Specifically, for one-dimensional cases, the analytical solutions can be obtained for the OU process with a Gaussian prior 𝒩(θP,σ2)\mathcal{N}(\theta^{P},\sigma^{2}). Note that in this case, the matrix BB reduces to a scalar. The solution to the Riccati ODE system is given by Q(t)=e2Btσ2ϵ+1e2Bt2BQ(t)=\frac{e^{-2Bt}\sigma^{2}}{\epsilon}+\frac{1-e^{-2Bt}}{2B}, q(t)=eBtθPq(t)=e^{-Bt}\theta^{P}, and r(t)=ϵ2log(2πe2Btσ2+πϵ(1e2Bt)B)r(t)=\frac{\epsilon}{2}\log(2\pi e^{-2Bt}\sigma^{2}+\frac{\pi\epsilon(1-e^{-2Bt})}{B}). The solution to the viscous HJ PDE (15) is

S(x,Tt)=12(xq(t))TQ(t)1(xq(t))r(t)=|xeBtθP|22e2Btσ2ϵ+1e2BtBϵ2log(2πe2Btσ2+πϵ(1e2Bt)B).S(x,T-t)=-\frac{1}{2}(x-q(t))^{T}Q(t)^{-1}(x-q(t))-r(t)=-\frac{\left|x-e^{-Bt}\theta^{P}\right|^{2}}{\frac{2e^{-2Bt}\sigma^{2}}{\epsilon}+\frac{1-e^{-2Bt}}{B}}-\frac{\epsilon}{2}\log\left(2\pi e^{-2Bt}\sigma^{2}+\frac{\pi\epsilon(1-e^{-2Bt})}{B}\right).

(98)

The inference process, as described in (18), becomes

Zk+1=Zk+(xS(Zk,tk))Δt+ϵΔtξk=ZkZkeB(Ttk)θPe2B(Ttk)σ2ϵ+1e2B(Ttk)2BΔt+ϵΔtξk.\begin{split}Z_{k+1}&=Z_{k}+(\partial_{x}S(Z_{k},t_{k}))\Delta t+\sqrt{\epsilon\Delta t}\xi_{k}=Z_{k}-\frac{Z_{k}-e^{-B(T-t_{k})}\theta^{P}}{\frac{e^{-2B(T-t_{k})}\sigma^{2}}{\epsilon}+\frac{1-e^{-2B(T-t_{k})}}{2B}}\Delta t+\sqrt{\epsilon\Delta t}\xi_{k}.\end{split} (99)

For any 0<t<s<T0<t<s<T, the process satisfies Ys=eB(st)Yt+ϵ(1e2B(st))2BξY_{s}=e^{-B(s-t)}Y_{t}+\sqrt{\frac{\epsilon(1-e^{-2B(s-t)})}{2B}}\xi, where ξ\xi is a standard Gaussian random variable independent of YtY_{t}. The marginal distribution of YtY_{t} is

P(Yt=x)=12πe2Btσ2+πϵ(1e2Bt)Bexp(12e2Btσ2+ϵ(1e2Bt)B|xeBtθP|2).P(Y_{t}=x)=\frac{1}{\sqrt{2\pi e^{-2Bt}\sigma^{2}+\frac{\pi\epsilon(1-e^{-2Bt})}{B}}}\exp\left(-\frac{1}{2e^{-2Bt}\sigma^{2}+\frac{\epsilon(1-e^{-2Bt})}{B}}|x-e^{-Bt}\theta^{P}|^{2}\right). (100)

The conditional distribution of YTY_{T} given YtY_{t} is

P(YT=y|Yt=θ)=Bϵπ(1e2B(Tt))exp(Bϵ(1e2B(Tt))|yeB(Tt)θ|2).P(Y_{T}=y|Y_{t}=\theta)=\sqrt{\frac{B}{\epsilon\pi(1-e^{-2B(T-t)})}}\exp\left(-\frac{B}{\epsilon(1-e^{-2B(T-t)})}\left|y-e^{-B(T-t)}\theta\right|^{2}\right). (101)

The posterior distribution of YtY_{t} given YT=yobsY_{T}=y_{obs} for any t[0,T)t\in[0,T) is

P(Yt=θ|YT=yobs)=P(Yt=θ)P(YT=yobs|Yt=θ)P(YT=yobs)=12πe2Btσ2+πϵ(1e2Bt)Bexp(12e2Btσ2+ϵ(1e2Bt)B|θeBtθP|2)Bϵπ(1e2B(Tt))exp(Bϵ(1e2B(Tt))|yobseB(Tt)θ|2)12πe2BTσ2+πϵ(1e2BT)Bexp(12e2BTσ2+ϵ(1e2BT)B|yobseBTθP|2).\begin{split}&P(Y_{t}=\theta|Y_{T}=y_{obs})=\frac{P(Y_{t}=\theta)P(Y_{T}=y_{obs}|Y_{t}=\theta)}{P(Y_{T}=y_{obs})}\\ &=\frac{\frac{1}{\sqrt{2\pi e^{-2Bt}\sigma^{2}+\frac{\pi\epsilon(1-e^{-2Bt})}{B}}}\exp\left(-\frac{1}{2e^{-2Bt}\sigma^{2}+\frac{\epsilon(1-e^{-2Bt})}{B}}|\theta-e^{-Bt}\theta^{P}|^{2}\right)\sqrt{\frac{B}{\epsilon\pi(1-e^{-2B(T-t)})}}\exp\left(-\frac{B}{\epsilon(1-e^{-2B(T-t)})}|y_{obs}-e^{-B(T-t)}\theta|^{2}\right)}{\frac{1}{\sqrt{2\pi e^{-2BT}\sigma^{2}+\frac{\pi\epsilon(1-e^{-2BT})}{B}}}\exp\left(-\frac{1}{2e^{-2BT}\sigma^{2}+\frac{\epsilon(1-e^{-2BT})}{B}}|y_{obs}-e^{-BT}\theta^{P}|^{2}\right)}.\end{split}

(102)

After simplification, this posterior distribution is Gaussian, with mean and variance given by:

𝔼[Yt|YT=yobs]=ϵ(eBteB(t2T))θP+2Bσ2eB(T+t)yobs+ϵ(eB(Tt)eB(T+t))yobsϵ(1e2BT)+2Bσ2e2BT,Var[Yt|YT=yobs]=ϵ(σ2e2Bt+ϵ(1e2Bt)2B)(1e2B(Tt))ϵ(1e2BT)+2Bσ2e2BT.\begin{split}\mathbb{E}[Y_{t}|Y_{T}=y_{obs}]&=\frac{\epsilon(e^{-Bt}-e^{B(t-2T)})\theta^{P}+2B\sigma^{2}e^{-B(T+t)}y_{obs}+\epsilon(e^{-B(T-t)}-e^{-B(T+t)})y_{obs}}{\epsilon(1-e^{-2BT})+2B\sigma^{2}e^{-2BT}},\\ \text{Var}[Y_{t}|Y_{T}=y_{obs}]&=\frac{\epsilon\left(\sigma^{2}e^{-2Bt}+\frac{\epsilon(1-e^{-2Bt})}{2B}\right)(1-e^{-2B(T-t)})}{\epsilon(1-e^{-2BT})+2B\sigma^{2}e^{-2BT}}.\end{split} (103)

Appendix F Supplementary results for the 1D Brownian motion example

Refer to caption
(a) The analytic-HJ-sampler.
Refer to caption
(b) The SGM-HJ-sampler.
Figure 12: Histograms depicting the distribution of posterior samples for the scaled 1D Brownian motion case with a Gaussian prior, across different observation values yobsy_{obs}. The SGM-HJ-sampler employs a neural network trained on t[0,T]t\in[0,T] with T=1T=1. The black dashed lines represent the exact posterior density functions (Gaussian). Each histogram is generated from 1×1061\times 10^{6} samples.

In this section, we provide supplementary results for the 1D Brownian motion example in Section 4.1.1, illustrating the inference of Y0Y_{0} given the observation of YTY_{T} using both the analytic-HJ-sampler and the SGM-HJ-sampler. Figure 12 displays histograms of the posterior samples. For quantitative comparison, please refer to Table 2 in the main content.