This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Generative Ensemble Regression: Learning Particle Dynamics from Observations of Ensembles with Physics-Informed Deep Generative Models

Liu Yang Division of Applied Mathematics, Brown University, Providence, RI 02912, USA Constantinos Daskalakis Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA George Em Karniadakis Division of Applied Mathematics, Brown University, Providence, RI 02912, USA Pacific Northwest National Laboratory, Richland, WA 99354, USA Correspondence author, [email protected]
Abstract

We propose a new method for inferring the governing stochastic ordinary differential equations (SODEs) by observing particle ensembles at discrete and sparse time instants, i.e., multiple “snapshots”. Particle coordinates at a single time instant, possibly noisy or truncated, are recorded in each snapshot but are unpaired across the snapshots. By training a physics-informed generative model that generates “fake” sample paths, we aim to fit the observed particle ensemble distributions with a curve in the probability measure space, which is induced from the inferred particle dynamics. We employ different metrics to quantify the differences between distributions, e.g., the sliced Wasserstein distances and the adversarial losses in generative adversarial networks (GANs). We refer to this method as generative “ensemble-regression” (GER), in analogy to the classic “point-regression”, where we infer the dynamics by performing regression in the Euclidean space. We illustrate the GER by learning the drift and diffusion terms of particle ensembles governed by SODEs with Brownian motions and Lévy processes up to 100 dimensions. We also discuss how to treat cases with noisy or truncated observations. Apart from systems consisting of independent particles, we also tackle nonlocal interacting particle systems with unknown interaction potential parameters by constructing a physics-informed loss function. Finally, we investigate scenarios of paired observations and discuss how to reduce the dimensionality in such cases by proving a convergence theorem that provides theoretical support.

Introduction

Classic methods for inferring the ordinary differential equation (ODE) dynamics from data usually require observations of a point or particle governed by the ODE at different time instants. We refer this learning paradigm as “point-regression”. More specifically, as illustrated in Figure 1, in point-regression problems we aim to infer the governing ODE of a point and perhaps also the initial condition, given the (possibly noisy) observations of its coordinates at different time instants. Typically, we optimize the dynamics and the initial coordinate so that the inferred curve matches the data in the Euclidean distance. Let us consider, for simplicity, the one-dimensional linear regression with quadratic loss as an example. Given observations of xx at multiple tt, we want to optimize the parameter aa in the ODE dxt/dt=adx_{t}/dt=a as well as the initial point x0=bx_{0}=b so that the mean squared L2L_{2} distance between predictions and data points is minimized, where aa and bb are the slope and the intercept of the linear function. Other examples in this category include logistic regression, recurrent neural networks [1] and the neural ODE [2] for time series, etc.

For systems consisting of an ensemble of particles, point-regression may fail to apply. For example, we want to infer the governing stochastic ordinary differential equations (SODE) from observations of particle ensembles at discrete and sparse time instants, but the data of an individual particle are not sufficiently informative for dynamic inference. Another example is a system consisting of a large number of interacting particles, even close to the mean field limit, e.g., we may want to infer how the fish interact with each other from discrete snapshots of the fish school. In such scenarios, instead of learning from individual particles, we need to learn from the particle ensembles. Specifically, we wish to infer the governing dynamics and perhaps also the initial condition, using observations of an ensemble of particle at discrete time instants. We call an observation at a single time instant a “snapshot”, where part or all of the particle coordinates in the ensemble are recorded. Since the particles could be indistinguishable in observations, especially in a large system, we thus consider the case where the data are not labeled with particle indices, in other words, we cannot pair data across snapshots.

We call this paradigm “ensemble-regression” in analogy to the “point-regression”. As illustrated in Figure 1, the initial condition and dynamics for particles would induce a curve tρtt\rightarrow\rho_{t} in the probability measure space, where ρt\rho_{t} denotes the particle distributions at time tt. Such ρt\rho_{t} will be governed by a corresponding partial differential equation (PDE), e.g., the Fokker-Planck equation if the particles are governed by SODEs of diffusion processes. We aim to optimize the dynamics and the initial distribution so that the inferred curve matches the distributions from the data, and the differences can be quantified with certain metrics.

Refer to caption
Refer to caption
Figure 1: Schematic showing the two paradigms of point-regression and ensemble-regression for dynamic inference. In point-regression problems (black labels), we aim to fit the point coordinates (blue dots) from data with the inferred curve (orange curve), determined by the initial coordinate 𝒙0\bm{x}_{0} as well as the ODE d𝒙t/dt=fθ(𝒙t,t)d\bm{x}_{t}/dt=f_{\theta}(\bm{x}_{t},t), where fθf_{\theta} is a function. As an analogue, in ensemble-regression problems (red labels), we aim to fit the distributions of the ensemble in the snapshots (blue dots) with the inferred curve (orange curve), determined by the initial distribution ρ0\rho_{0} as well as the PDE ρt/t=Lθ(ρt,t)\partial\rho_{t}/\partial t=L_{\theta}(\rho_{t},t), where LθL_{\theta} is an operator.

Herein we propose a new method to perform ensemble-regression. We use a generative model with deep neural networks as build blocks, which will generate “fake” particle systems, to represent the inferred curve in the probability measure space, and then perform regression in the probability measure space with the inferred curve. We thus name our method as generative ensemble-regression. In this paper we test the sliced Wasserstein (SW) distance [3] and the loss in generative adversarial networks (GANs) [4, 5] as two examples of metrics, the latter proved to be very effective when analyzing high dimensional data [6] in our problems.

The deep generative model is physics-informed in that our partial physical knowledge of the dynamics will be encoded into the architecture or the loss function. Such physical knowledge is sometimes essential for a correct dynamic inference, since the particle dynamic can be not unique even if the curve tρtt\rightarrow\rho_{t} and its governing PDE are fully given. For example, the following two particle dynamics with 𝒩(0,1)\mathcal{N}(0,1) as the initial distribution will lead to the same curve ρt=𝒩(0,t+1)\rho_{t}=\mathcal{N}(0,t+1):

  • Standard Brownian motion with no drift.

  • dxt/dt=xt/(2t+2)dx_{t}/dt=x_{t}/(2t+2) with no diffusion, i.e., xt=x0t+1x_{t}=x_{0}\sqrt{t+1}.

However, if we know that the particle dynamic is in the form of d𝒙t/dt=𝒗t(𝒙,t)d\bm{x}_{t}/dt=\bm{v}_{t}(\bm{x},t), so that ρt\rho_{t} is governed by the continuity equation dρt/dt=(𝒗tρt)d\rho_{t}/dt=-\nabla\cdot(\bm{v}_{t}\rho_{t}), and 𝒗t\bm{v}_{t} is limited to the L2(ρt;d)L^{2}(\rho_{t};\mathbb{R}^{d}) closure of {φ:φCc(d)}\{\nabla\varphi:\varphi\in C_{c}^{\infty}(\mathbb{R}^{d})\}, then the solution of 𝒗t\bm{v}_{t} is unique, for any curve tρtt\rightarrow\rho_{t} absolutely continuous from [a,b][a,b] to 𝒫2(d)\mathcal{P}_{2}(\mathbb{R}^{d}), where 𝒫2(d)\mathcal{P}_{2}(\mathbb{R}^{d}) is the Wasserstein-2 space of probability measures with finite quadratic moments in d\mathbb{R}^{d} [7, 8].

In the generative model, we use discretized ODE or SODE with unknown terms parameterized as neural networks. This is referred as the “neural ODE” and “neural SDE” in the literature [2, 9, 10, 11, 12, 13], but in applications the observations are mainly a time series, i.e., observations of a single particle. The idea of employing a neural networks as velocity surrogates with physics-informed loss functions in particle systems was also used for solving mean field game/control problems [14]. There are other works using GANs to solve inverse problems, including [15], where GANs were applied to learn parameters in time-independent stochastic differential equations, and [16] where GANs were applied to learn the random parameters from the (paired) observations of independent ODEs.

In this paper, we tackle two typical types of particle systems: (1) independent particle systems governed by SODEs, and (2) interacting particle systems governed by nonlocal flocking dynamics. For inferring dynamics governed by SODE, most algorithms are based on observations of a sample path, and perform the inference by calculating or approximating the probability of observations conditioned on the system parameters, using Euler-Maruyama discretization [17, 18], Kalman filtering [19], variational Gaussian process-based approximation [20, 21], etc. There are other works inferring the stochastic dynamics with the (estimated) densities from the perspective of Fokker-Planck equations [22], but this approach is hard to scale to high dimensional problems. For flocking dynamic systems, other researchers infer the key parameters in the influence function by fitting the velocity field via Bayesian optimization [23], or fitting the density field by solving a system of transformed PDEs and optimizing the parameters [24]. But these methods require the knowledge of the initial condition, including the distribution and velocity field.

Problem Setup

Refer to caption
Figure 2: Schematic of the generative model for ensemble-regression. We first use a feed-forward neural network to map the input noise to the output 𝑿~0d\tilde{\bm{X}}_{0}\in\mathbb{R}^{d}, whose distribution ρ~0\tilde{\rho}_{0} is intended to approximate the initial distribution ρ0\rho_{0}. Subsequently, we apply the discretized ODE or SODE with trainable parameters to generate particle trajectories 𝑿~t\tilde{\bm{X}}_{t} for t>0t>0 with 𝑿~0\tilde{\bm{X}}_{0} as the initial condition (brown curves). Differences between the distributions of 𝑿~t\tilde{\bm{X}}_{t} and the snapshots from data are quantified as (a part of) our loss function. For interacting particle systems, a neural network is employed as the velocity surrogate to generate the particle trajectories. Another loss function term, namely the Newton loss LNewtonL_{\text{Newton}}, is defined to quantify the consistency between the particle accelerations derived from the material derivatives, and the forces calculated from the particle interactions. With LNewtonL_{\text{Newton}}, we enforce the inferred velocity to be consistent with our partial knowledge of the dynamics.

We start from a system consisting of an ensemble of particles, where the dynamics of the particles is independent of other particles. The most commonly used stochastic processes in physics and biology are diffusion processes and Lévy processes; in particular, the particle dynamics is governed by the stochastic differential equation:

d𝑿t=𝝁tdt\displaystyle d\bm{X}_{t}=\bm{\mu}_{t}dt +𝝈td𝑩t,t0,\displaystyle+\bm{\sigma}_{t}d\bm{B}_{t},\quad t\geq 0, (1)

for diffusion processes, and

d𝑿t=𝝁tdt\displaystyle d\bm{X}_{t}=\bm{\mu}_{t}dt +𝝈td𝑳tα,t0,\displaystyle+\bm{\sigma}_{t}d\bm{L}_{t}^{\alpha},\quad t\geq 0, (2)

for Lévy processes, where 𝑿td\bm{X}_{t}\in\mathbb{R}^{d} is the position of a particle at time tt with 𝑿0\bm{X}_{0} randomly drawn from the initial distribution ρ0\rho_{0}, 𝝁td\bm{\mu}_{t}\in\mathbb{R}^{d} is the deterministic drift, 𝑩t\bm{B}_{t} and 𝑳tα\bm{L}_{t}^{\alpha} are the dd-dimensional standard Brownian motion and the α\alpha-stable symmetric Lévy process, respectively, and 𝝈td×d\bm{\sigma}_{t}\in\mathbb{R}^{d\times d} is the diffusion coefficient. In the mean field limit, the density of the particles would be governed by Fokker-Planck equations or fractional Fokker-Planck equations. For simplicity, in this work we assume that 𝝁t\bm{\mu}_{t} is a function of 𝑿t\bm{X}_{t} while 𝝈t\bm{\sigma}_{t} is constant, but, in principle, our proposed method can also tackle the time-dependent case.

Apart from systems consisting of independent particles, we further consider systems where particles interact with each other and the particle distributions exhibit more complicated behavior. As an example, we consider the Cucker-Smale particle model [25], which describes individuals in flocks with nonlocal interactions. The individual or particle motion is characterized by the following governing equations:

d𝑿t(i)/dt\displaystyle d\bm{X}_{t}^{(i)}/dt =𝝁t(i),\displaystyle=\bm{\mu}_{t}^{(i)}, (3)
d𝝁t(i)/dt=1N1ji\displaystyle d\bm{\mu}_{t}^{(i)}/dt=\frac{1}{N-1}\sum_{j\neq i} ϕ(𝑿t(i)𝑿t(j))(𝝁t(j)𝝁t(i)),\displaystyle\phi(\|\bm{X}_{t}^{(i)}-\bm{X}_{t}^{(j)}\|)(\bm{\mu}_{t}^{(j)}-\bm{\mu}_{t}^{(i)}),

where the superscripts denote the index of the particles, N1N\gg 1 is the number of particles in the system, and ϕ\phi is the influence function. Here, we set

ϕ(r)=cd,α|r|(d+α),cd,α=αΓ(d+α2)2πα+d/2Γ(1α/2),\displaystyle\phi(r)=c_{d,\alpha}|r|^{-(d+\alpha)},c_{d,\alpha}=\frac{\alpha\Gamma(\frac{d+\alpha}{2})}{2\pi^{\alpha+d/2}\Gamma(1-\alpha/2)}, (4)

where Γ\Gamma is the gamma function, dd is the dimension of 𝑿t(i)\bm{X}_{t}^{(i)}, and α\alpha is the factor that characterizes the decay rate of particle interactions as the distance rr grows. In the mean field limit, as NN grows the density of the particles is governed by the following fractional PDE:

ρt+(ρ𝒖)\displaystyle\frac{\partial\rho}{\partial t}+\nabla\cdot(\rho\bm{u}) =0,\displaystyle=0, (5)
𝒖t+𝒖𝒖\displaystyle\frac{\partial\bm{u}}{\partial t}+\bm{u}\cdot\nabla\bm{u} =[,𝒖](ρ),\displaystyle=[\mathcal{L},\bm{u}](\rho),
[,𝒖](ρ)(𝒙)=p.v. cd,αd\displaystyle[\mathcal{L},\bm{u}](\rho)(\bm{x})=\text{p.v. }c_{d,\alpha}\int_{\mathbb{R}^{d}} 𝒖(𝒚)𝒖(𝒙)𝒙𝒚d+αρ(𝒚)d𝒚,\displaystyle\frac{\bm{u}(\bm{y})-\bm{u}(\bm{x})}{\|\bm{x}-\bm{y}\|^{d+\alpha}}\rho(\bm{y})d\bm{y},

where ρ\rho is the density, 𝒖\bm{u} is the velocity field, and p.v. means the principle value. Note that α\alpha determines the fractional order.

We consider the scenario where the data available are the observations of the particles coordinates at different time instants {ti}i=1n\{t_{i}\}_{i=1}^{n} with 0t1<t2<tn0\leq t_{1}<t_{2}...<t_{n}, namely “snapshots”. In other words, the data 𝒟i\mathcal{D}_{i} for time tit_{i} will be a set of samples drawn from ρti\rho_{t_{i}}, the particle distributions at tit_{i}. If {𝒟i}i=1n\{\mathcal{D}_{i}\}_{i=1}^{n} are observations of the same set of particles and we can distinguish these particles, we refer to these cases as the “paired” observations since we can pair the particles from different snapshots. In other cases, {Di}i=1n\{D_{i}\}_{i=1}^{n} are observations of different sets of particles, or we cannot distinguish the particles. We refer to these cases as the “unpaired” observations. In this paper we mainly focus on the unpaired cases, but we also present some work on the paired cases in Paired Observations section, where we introduce how to reduce the effective dimensionality and we prove a theorem to support the introduced method. For independent particle systems, we assume that we are unaware of 𝝁t\bm{\mu}_{t} and 𝝈t\bm{\sigma}_{t}, or we may only know the parametrized forms of these terms, and we aim to infer these terms directly or through a proper parametization. For interacting particle systems, we assume we know the forms of the dynamics and the influence function, i.e., Equation 3 and 4, but need to infer the velocity field and the key parameter α\alpha.

Physics-informed Generative Model

To perform ensemble regression, we will use a generative model with deep neural networks to represent a curve in the probability measure space. Note that the curve is determined by the initial condition ρ0\rho_{0} and the governing equation, hence, the generative model consists of two parts. In the first part, we employ a feed-forward neural network GG to represent ρ0\rho_{0}. In particular, GG takes samples from random noise 𝒩\mathcal{N}, e.g., Gaussian noise, as input and the generated distribution G#𝒩{G}_{\#}\mathcal{N} is intended to approximate ρ0\rho_{0}, where # denotes the push forward operator. The second part of the generative model will generate “fake” particle trajectories with initial coordinates generated by GG, and the marginal distribution ρ~t\tilde{\rho}_{t} at time t0t\geq 0 will be used to represent ρt\rho_{t} in the curve.

Our knowledge of the physics will be incorporated into the generative model in two ways. For non-interacting particle systems, our knowledge including the form of the drift and diffusion as well as the type of stochastic processes will be directly embedded into the architecture of the generative model in the second part. For interacting particle systems, while we have no direct knowledge of the velocity field, we will enforce the inferred velocity to be consistent with our knowledge of the dynamics with a physics-based soft penalty. In the following, we introduce the details of the learning algorithm for both types of systems separately. A schematic overview of the method is shown in Figure 2.

Non-interacting Particle Systems

For non-interacting particle systems, generating particle trajectories is relatively straightforward by directly applying the discretization of governing SODE or ODE. For example, if the particle trajectories are diffusion processes or Lévy processes, we can use the following forward Euler scheme:

𝑿~0\displaystyle\tilde{\bm{X}}_{0} =G(𝒛),𝒛𝒩\displaystyle=G(\bm{z}),\quad\bm{z}\sim\mathcal{N} (6)
𝑿~(i+1)Δt\displaystyle\tilde{\bm{X}}_{(i+1)\Delta t} =𝑿~iΔt+𝝁tΔt+𝝈tΔt𝝃i,i0,\displaystyle=\tilde{\bm{X}}_{i\Delta t}+\bm{\mu}_{t}\Delta t+\bm{\sigma}_{t}\sqrt{\Delta t}\bm{\xi}_{i},\quad i\geq 0,
or 𝑿~(i+1)Δt\displaystyle\text{or }\tilde{\bm{X}}_{(i+1)\Delta t} =𝑿~iΔt+𝝁tΔt+𝝈tΔt1/α𝜻α,i,i0,\displaystyle=\tilde{\bm{X}}_{i\Delta t}+\bm{\mu}_{t}\Delta t+\bm{\sigma}_{t}\Delta t^{1/\alpha}\bm{\zeta}_{\alpha,i},\quad i\geq 0,

where Δt\Delta t is the time step, 𝝃i\bm{\xi}_{i} and 𝜻α,i\bm{\zeta}_{\alpha,i} are i.i.d. standard Gaussian random variables and α\alpha-stable random variables, respectively. We could represent 𝝁t\bm{\mu}_{t} and 𝝈t\bm{\sigma}_{t} with neural networks if they are unknown, or represent the unknown parameters with trainable variables if we know their parameterized form.

Our target is to tune the trainable variables in the generative model, including the parameters in GG and those for parameterizing 𝝁t\bm{\mu}_{t} and 𝝈t\bm{\sigma}_{t}, so that the generated marginal distribution ρ~ti\tilde{\rho}_{t_{i}} fits the data 𝒟i\mathcal{D}_{i} for each ii. We thus need to define a distance function 𝖽(,)\mathsf{d}(\cdot,\cdot) to measure the difference between the two input distributions, which can be estimated from samples drawn from the two distributions. Consequently the loss function in non-interacting particle systems is defined as:

Ldistribution\displaystyle L_{\text{distribution}} =i=1n𝖽(ρ~ti,ρ^𝒟i),\displaystyle=\sum_{i=1}^{n}\mathsf{d}(\tilde{\rho}_{t_{i}},\hat{\rho}_{\mathcal{D}_{i}}), (7)

where ρ^𝒟i\hat{\rho}_{\mathcal{D}_{i}} is the empirical distribution induced from the sample set 𝒟i\mathcal{D}_{i}. We will refer to it as the distribution loss.

There could be many ways to define 𝖽\mathsf{d}, including Wasserstein distances, maximum mean discrepancy, etc. In this paper we use two approaches to define 𝖽\mathsf{d}.

First we choose the squared sliced Wasserstein-2 (SW) distance [26, 3] as the function 𝖽\mathsf{d}:

𝖽(μ,ν)=SW22(μ,ν):=𝕊d1W22(π𝒆#μ,π𝒆#ν)𝑑d1(𝒆),\mathsf{d}(\mu,\nu)=SW_{2}^{2}(\mu,\nu):=\int_{\mathbb{S}^{d-1}}W_{2}^{2}({\pi_{\bm{e}}}_{\#}\mu,{\pi_{\bm{e}}}_{\#}\nu)d\mathcal{H}^{d-1}({\bm{e}}), (8)

where W2W_{2} is the Wasserstein-2 distance, and π𝒆#μ{\pi_{\bm{e}}}_{\#}\mu is the one dimensional distribution induced by projecting μ\mu onto the direction 𝒆\bm{e}, defined by

(π𝒆#μ)(A)=μ({𝒙d:𝒆𝒙A}),A(),({\pi_{\bm{e}}}_{\#}\mu)(A)=\mu(\{\bm{x}\in\mathbb{R}^{d}:{\bm{e}}\cdot\bm{x}\in A\}),\forall A\in\mathcal{B}(\mathbb{R}), (9)

similarly for π𝒆#ν{\pi_{\bm{e}}}_{\#}\nu. d1\mathcal{H}^{d-1} is the uniform Hausdorff measure on the sphere 𝕊d1\mathbb{S}^{d-1}. In short, the squared sliced Wasserstein-2 distance is the expectation of the squared Wasserstein-2 distance between the two input measures projected onto uniformly random directions. The sliced Wasserstein-2 distance is exactly the Wasserstein-2 distance for one-dimensional distributions, but is easier to calculate for higher dimensional distributions. We present the details of the estimation in the Supplementary Information section S1.

We also use GANs to obtain 𝖽\mathsf{d}. The generative model we introduced above can generate “fake” samples 𝑿~ti\tilde{\bm{X}}_{t_{i}} at {ti}i=1n\{t_{i}\}_{i=1}^{n}, and for each ii, we use a discriminator DiD_{i} to discriminate generated samples 𝑿~ti\tilde{\bm{X}}_{t_{i}} and real samples 𝑿ti\bm{X}_{t_{i}} from 𝒟i\mathcal{D}_{i}. The adversarial loss given by DiD_{i} can act as a metric of the difference between ρ~ti\tilde{\rho}_{t_{i}} and ρ^𝒟i\hat{\rho}_{\mathcal{D}_{i}}. In particular, we use WGAN-GP [5] as our version of GANs in our paper, with

𝖽(ρ~ti,ρ^𝒟i)\displaystyle\mathsf{d}(\tilde{\rho}_{t_{i}},\hat{\rho}_{\mathcal{D}_{i}}) =𝔼𝑿~tiρ~ti[Di(𝑿~ti)]+𝔼𝑿tiρ^𝒟i[Di(𝑿ti)],\displaystyle=-\mathbb{E}_{\tilde{\bm{X}}_{t_{i}}\sim\tilde{\rho}_{t_{i}}}[D_{i}(\tilde{\bm{X}}_{t_{i}})]+\mathbb{E}_{\bm{X}_{t_{i}}\sim\hat{\rho}_{\mathcal{D}_{i}}}[D_{i}(\bm{X}_{t_{i}})], (10)

and the loss function for each discriminator DiD_{i} is defined as

LDi=\displaystyle L_{D_{i}}= 𝔼𝑿~tiρ~ti[Di(𝑿~ti)]𝔼𝑿tiρ^𝒟i[Di(𝑿ti)]\displaystyle\mathbb{E}_{\tilde{\bm{X}}_{t_{i}}\sim\tilde{\rho}_{t_{i}}}[D_{i}(\tilde{\bm{X}}_{t_{i}})]-\mathbb{E}_{\bm{X}_{t_{i}}\sim\hat{\rho}_{\mathcal{D}_{i}}}[D_{i}(\bm{X}_{t_{i}})] (11)
+λ𝔼𝒙^iρ𝒙^i[(𝒙^iDi(𝒙^i)21)2], for i=1,2n,\displaystyle+\lambda\mathbb{E}_{\hat{\bm{x}}_{i}\sim\rho_{\hat{\bm{x}}_{i}}}[(\|\nabla_{\hat{\bm{x}}_{i}}D_{i}(\hat{\bm{x}}_{i})\|_{2}-1)^{2}],\text{ for }i=1,2...n,

where ρ𝒙^i\rho_{\hat{\bm{x}}_{i}} is the distribution generated by uniform sampling on straight lines between pairs of points sampled from ρ~ti\tilde{\rho}_{t_{i}} and ρ^𝒟i\hat{\rho}_{\mathcal{D}_{i}}, and λ=0.1\lambda=0.1 is the gradient penalty coefficient. Here, 𝖽(ρ~ti,ρ^𝒟i)\mathsf{d}(\tilde{\rho}_{t_{i}},\hat{\rho}_{\mathcal{D}_{i}}) can be mathematically interpreted as the Wasserstein-1 distance between the two input distributions. WGAN-GP is computationally more expensive than the sliced Wasserstein distance since we need to train the generative model and the discriminators iteratively, but it is more scalable to high dimensional problems, for which we made a comparison in the supplementary information section S1.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a)
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(b)
Figure 3: Results for 1D problems with (a) Brownian Motion and (b) Lévy process as the stochastic term. The two rows of (b) show the results with (top row) and without (bottom row) bounded map pre-processing, respectively. The first column visualizes densities estimated from training data. The second column shows the inferred drift functions in the end of training. The third column shows the inferred coefficient for the stochastic term during training. The fourth and fifth columns show the inferred densities in the two cases of using a cubic polynomial and a neural network to parameterize the drift, respectively. The solid lines and shaded areas refer to the mean and two standard deviations of three runs with different data.

Interacting Particle Systems

The straightforward discretization of the governing Equations 3 cannot be directly applied to generate “fake” particle trajectories in interacting particle systems, especially for those with strong nonlocal interactions, since the computational cost for one time step would be O(N2)O(N^{2}), where NN is the number of particles required so that the system is close to the mean field limit, which makes the learning almost intractable. Instead, we propose to first employ a neural network 𝝁~\tilde{\bm{\mu}} as a surrogate model of the velocity field 𝝁𝒕\bm{\mu_{t}} in the spatial-temporal domain to generate trajectories, and then apply an additional penalty in the loss function to enforce the velocity field 𝝁~\tilde{\bm{\mu}} to be consistent with the Equations 3. In this paper we use the forward Euler scheme to generate trajectories:

𝑿~0\displaystyle\tilde{\bm{X}}_{0} =G(𝒛),𝒛𝒩\displaystyle=G(\bm{z}),\quad\bm{z}\sim\mathcal{N} (12)
𝑿~(i+1)Δt\displaystyle\tilde{\bm{X}}_{(i+1)\Delta t} =𝑿~iΔt+𝝁~(𝑿~iΔt,iΔt)Δt,i0,\displaystyle=\tilde{\bm{X}}_{i\Delta t}+\tilde{\bm{\mu}}(\tilde{\bm{X}}_{i\Delta t},i\Delta t)\Delta t,\quad i\geq 0,

where Δt\Delta t is the time step. By employing the surrogate model for velocity, the computational cost for generating the particle trajectories is linear, instead of quadratic, with the number of particles.

To enforce 𝝁~\tilde{\bm{\mu}} to be consistent with our knowledge of the dynamics of Equations 3, we first randomly generate KK particle trajectories with Equation 12, denoted as {{𝑿~iΔt(k)}i=0I}k=1K\{\{\tilde{\bm{X}}^{(k)}_{i\Delta t}\}_{i=0}^{I}\}_{k=1}^{K}, and then calculate the “forces” applied on these kk particles by another MM randomly generated particles {{𝒀~iΔt(m)}i=0I}m=1M\{\{\tilde{\bm{Y}}^{(m)}_{i\Delta t}\}_{i=0}^{I}\}_{m=1}^{M}, using the formula of interactions in Equations 3, where we replace the unknown parameter α\alpha with a trainable variable. Meanwhile, from the velocity field 𝝁~\tilde{\bm{\mu}} we can directly calculate the accelerations for these kk particles using a material derivative. The forces and accelerations should be consistent with each other. We can thus define an L2L_{2} loss:

LNewton=1I+1i=0I1Kk=1K(𝑭i(k)𝒂i(k))2,\displaystyle L_{\text{Newton}}=\frac{1}{I+1}\sum_{i=0}^{I}\frac{1}{K}\sum_{k=1}^{K}(\bm{F}_{i}^{(k)}-\bm{a}_{i}^{(k)})^{2}, (13)
𝑭i(k)=\displaystyle\bm{F}_{i}^{(k)}= 1Mm=1Mϕ(𝑿~iΔt(k)𝒀~iΔt(m))(𝝁~(𝒀~iΔt(m),iΔt)𝝁~(𝑿~iΔt(k),iΔt)),\displaystyle\frac{1}{M}\sum_{m=1}^{M}\phi(\|\tilde{\bm{X}}^{(k)}_{i\Delta t}-\tilde{\bm{Y}}^{(m)}_{i\Delta t}\|)(\tilde{\bm{\mu}}(\tilde{\bm{Y}}^{(m)}_{i\Delta t},i\Delta t)-\tilde{\bm{\mu}}(\tilde{\bm{X}}^{(k)}_{i\Delta t},i\Delta t)),
𝒂i(k)=\displaystyle\bm{a}_{i}^{(k)}= D𝝁~(𝑿~iΔt(k),iΔt))Dt\displaystyle\frac{D\tilde{\bm{\mu}}(\tilde{\bm{X}}^{(k)}_{i\Delta t},i\Delta t))}{Dt}
=\displaystyle= 𝝁~(𝑿~iΔt(k),iΔt))t+𝝁~(𝑿~iΔt(k),iΔt))𝝁~(𝑿~iΔt(k),iΔt)),\displaystyle\frac{\partial\tilde{\bm{\mu}}(\tilde{\bm{X}}^{(k)}_{i\Delta t},i\Delta t))}{\partial t}+\tilde{\bm{\mu}}(\tilde{\bm{X}}^{(k)}_{i\Delta t},i\Delta t))\cdot\nabla\tilde{\bm{\mu}}(\tilde{\bm{X}}^{(k)}_{i\Delta t},i\Delta t)),

and we name it as the Newton loss. The time span [0,IΔt][0,I\Delta t] should cover the time for the latest snapshot tnt_{n}, and in this paper we simply set II as the ceiling of tn/Δtt_{n}/\Delta t. To reduce computational cost, the average over I+1I+1 time steps in LNewtonL_{\text{Newton}} can also be approximated by mini-batch, i.e., taking average over BB random time steps in each training step.

For one time step, the computational cost is O(KM)O(KM) for calculating the force terms and O(K)O(K) for the acceleration terms. Note that MM should be larger than or equal to NN, but KK can be much less than NN, thus the total computational cost for LNewtonL_{\text{Newton}} can be much less than O(N2)O(N^{2}), and this makes the learning tractable.

In the end, the loss for interacting particle systems will be a combination of the Newton loss and the distribution loss:

L=ηLNewton+1ni=1n𝖽(ρ~ti,ρ^𝒟i),\displaystyle L=\eta L_{\text{Newton}}+\frac{1}{n}\sum_{i=1}^{n}\mathsf{d}(\tilde{\rho}_{t_{i}},\hat{\rho}_{\mathcal{D}_{i}}), (14)

where the weight η\eta is set as 1 in this paper.

Modifications to the Distributions

In order to provide flexibility and adapt the method to different problems, we can also modify the generated distribution and real data before feeding them to the distance function d and calculate the distribution loss. We present some examples here.

In some cases ρt\rho_{t} may have heavy tails, e.g., when the particle trajectories correspond to a Lévy process. The heavy tails could spoil the training, since the rare outlier samples could dominate the loss function. We can choose a suitable bounded map h:ddh:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} to preprocess both the generated samples and the real data so that the heavy tails are removed. If the observations of the particle coordinates are noisy, we can also perturb the generated samples to add artificial noise. By scaling the artificial noise with trainable variables, the size of the noise in observations can also be learned. If we can only make observations of particles in a specific domain, i.e., the observations are truncated, we will also filter the generated samples using a corresponding mask so that the effective domain for the generated samples and observations are the same.

Computational Results

All the neural networks in the main text are feed-forward neural networks with three hidden layers, each of width 128, except the discriminator for high dimensional problems which is a ResNet with 5 hidden layers, each of width 256, and shortcut connections skipping one hidden layer. We use the leaky ReLu [27] activation function with α=0.2\alpha=0.2 for the discriminator neural networks in WGAN-GP, while using the hyperbolic tangent activation function for other neural networks. The neural network weights are initialized with the uniform Xavier initializer, and the biases are initialized as 0. The variables for parameterizing the diffusion and noise size are initialized as 0 (before been activated by the softplus function). The drift parameters are initialized as 0 in 1D problems, 0.5-0.5 in high dimensional problems, and randomly initialized with standard Gaussian distribution in 2D problems. The batch size for the distribution loss is 1000 for non-interacting particle systems, except in the 2D problem with truncated observations, where we generate 10000 samples to compensate the loss of samples due to filtering. We use the Adam optimizer [28] with lr=0.0001,β1=0.9,β2=0.999lr=0.0001,\beta_{1}=0.9,\beta_{2}=0.999 for the cases using the sliced Wasserstein distance, while lr=0.0001,β1=0.5,β2=0.9lr=0.0001,\beta_{1}=0.5,\beta_{2}=0.9 for the cases with WGAN-GP. The time step in the generative model is set as Δt=0.01\Delta t=0.01.

1D Problems: Brownian Motion and Lévy Process

In this section, we test our method on the 1D problems using the sliced Wasserstein distance as 𝖽\mathsf{d}. We first consider the SODE with Brownian motion and then with the α\alpha-stable symmetric Lévy process as the stochastic term:

dXt\displaystyle dX_{t} =(XtXt3)dt+dBt,\displaystyle=(X_{t}-{X_{t}}^{3})dt+dB_{t}, (15)
ordXt\displaystyle\text{or}\quad dX_{t} =(XtXt3)dt+dLtα,t0,\displaystyle=(X_{t}-{X_{t}}^{3})dt+dL_{t}^{\alpha},\quad t\geq 0,

where α=1.5\alpha=1.5, with ρ0=𝒩(0,0.04)\rho_{0}=\mathcal{N}(0,0.04).

Note that we have no knowledge of ρ0\rho_{0} in learning. Also, we assume we know that the stochastic term has a constant coefficient but we need to infer it; here the ground truth is 1.01.0. To represent the unknown coefficient, we use a trainable variable rectified by a softplus function softplus(x)=ln(1+ex)\text{softplus}(x)=\ln(1+e^{x}), which ensures positivity. As for the drift μ(x)=xx3\mu(x)=x-x^{3}, we consider the following two cases for both SODE problems. In case 1, we know that the drift μ(x)\mu(x) is a cubic polynomial of xx. In this case, we use a cubic polynomial a0+a1x+a2x2+a3x3a_{0}+a_{1}x+a_{2}x^{2}+a_{3}x^{3} to parameterize μ(x)\mu(x), and want to infer the four coefficients a0,a1,a2a_{0},a_{1},a_{2} and a3a_{3}. In case 2, we only know that the drift μ(x)\mu(x) is a function of xx and hence we use a neural network to parameterize μ(x)\mu(x).

For the SODE problem with Brownian motion, we first prepare a pool of 10510^{5} sample paths, then independently draw 10,00010,000 samples at t=0.2,0.5,1.0t=0.2,0.5,1.0 from the pool as our training data. The results for both cases of drift parameterization are illustrated in Figure 3(a). In the figure, all the densities are estimated using Gaussian kernel density estimation with Scott’s rule, and the inferred densities and the reference densities come from 10610^{6} samples produced by the generative model or simulation.

Both cases of drift parameterization provide a good inference of the diffusion coefficient, with an error less than 7%7\% after 2×1052\times 10^{5} training steps, in all the runs. When using the cubic polynomial parameterization, the inferred drift fits well with the ground truth, with the relative L2L_{2} error about 3%3\% in [3,3][-3,3] averaged over three runs. The inferred drift using the neural network only fits the ground truth in the region between -1.5 and 1.5. This is reasonable since the particles mainly concentrate in this region, and we can hardly learn the drift outside this region, where the training data are sparse. However, we note that such an inference of drift is sufficiently good for an accurate time extrapolation of the particle distribution at t=5t=5.

One interesting observation is that in case 1 the inferred densities are more accurate than those estimated from the training data. This is because our knowledge of the governing SODE bridges the limited samples in three snapshots, and as we infer the density at, e.g., t=0.2t=0.2, we are not only utilizing the data at t=0.2t=0.2 but also implicitly leveraging the data at t=0.5t=0.5 and 1.01.0. We can make an analogy in the context of linear regression: multiple noisy data are helpful for inferring the hidden ground truth, since they are bridged by the regressed linear function. We present a more detailed discussion on this topic in the supplementary information in section S2.

For the SODE problem with Lévy process, we prepare 1.5×1051.5\times 10^{5} sample paths, then independently draw 10,00010,000 samples within the region [1000,1000][-1000,1000] at t=0.2,0.5,1.0t=0.2,0.5,1.0 from the pool as our training data. To prevent instability during the training, we apply double precision and clip the generated α\alpha-stable random variable in Equation 2 between 100-100 and 100100.

We first test our method without pre-processing the samples as in the problem with Brownian motion. As we can see in the second row of Figure 3(b), the inferences are far away from the ground truth. This is due to the heavy tail of ρt\rho_{t} in the Lévy process: some samples far away from 0, although rare, could dominate the loss function and spoil the training. To deal with this problem, we then apply a bounded map h(x)=2tanh(x/2)h(x)=2\tanh(x/2) to all the generated and real samples before feeding them to 𝖽\mathsf{d}. The results are shown in the first row of Figure 3(b) where we can see the inferences are much improved. In case 2 where the drift is parameterized by neural networks, the inferred drift outside [1.5,1.5][-1.5,1.5] is better than that in the problem with Brownian motion. This is because the samples are more scattered in the Lévy case.

2D Problems: Various Scenarios of Observations

In this section, we test our method on 2D problems using the sliced Wasserstein distance as 𝖽\mathsf{d}, with various scenarios of observations. We consider the following 2D SODE:

d𝑿t\displaystyle d\bm{X}_{t} =𝝁(𝑿𝒕)dt+[s00s1s2]d𝑩t\displaystyle=\bm{\mu}(\bm{X_{t}})dt+\begin{bmatrix}s_{0}&0\\ s_{1}&s_{2}\end{bmatrix}d\bm{B}_{t} (16)

where

𝝁(𝒙)=𝒙φ(𝒙),\displaystyle\bm{\mu}(\bm{x})=\nabla_{\bm{x}}\varphi(\bm{x}), (17)

and

φ(𝒙)=\displaystyle\varphi(\bm{x})= (x1+a0)2(x2+a1)2\displaystyle-(x_{1}+a_{0})^{2}(x_{2}+a_{1})^{2} (18)
(x1+a2)2(x2+a3)2for 𝒙=(x1,x2).\displaystyle-(x_{1}+a_{2})^{2}(x_{2}+a_{3})^{2}\quad\text{for }\bm{x}=(x_{1},x_{2}).

The parameters are set as a0=a1=s0=s1=s2=1.0a_{0}=a_{1}=s_{0}=s_{1}=s_{2}=1.0, a2=a3=0.5a_{2}=a_{3}=-0.5. We set the initial distribution as ρ0=𝒩(0,𝑰2)\rho_{0}=\mathcal{N}(0,\bm{I}_{2}). We assume we know that the diffusion coefficient is a constant lower triangular matrix but we need to infer the three unknown parameters s0,s1,s2s_{0},s_{1},s_{2}. In particular, we use (softplus(s~0),s~1,softplus(s~2))(\text{softplus}(\tilde{s}_{0}),\tilde{s}_{1},\text{softplus}(\tilde{s}_{2})) to approximate (s0,s1,s2)(s_{0},s_{1},s_{2}), where s~0,s~1,s~2\tilde{s}_{0},\tilde{s}_{1},\tilde{s}_{2} are three trainable variables. Similar as in the 1D problems, we consider the following two cases for the drift. In case 1, we know the form of 𝝁\bm{\mu} and φ\varphi in Equation 17 and 18 but we need to infer a0,a1,a2a_{0},a_{1},a_{2} and a3a_{3} ((a0,a1)(a_{0},a_{1}) and (a2,a3)(a_{2},a_{3}) are exchangeable). In case 2, we only know that 𝝁\bm{\mu} is a gradient of φ\varphi in Equation 17, but we have no knowledge of φ\varphi. In this case, we use a neural network to parameterize φ\varphi.

For the training data, we prepare 10510^{5} sample paths, and consider the following various scenarios of observations at t=0,0.1,0.2,0.3,0.5,0.7,1.0t=0,0.1,0.2,0.3,0.5,0.7,1.0. The data are visualized in the supplementary information section S3.

  • Scenario 1: we assume that our observations are ideal, i.e., we make accurate observations of all the particle coordinates as the training data.

  • Scenario 2: we assume that our observations are noisy. Specifically, we make observations of all the particle coordinates, but each coordinate is perturbed by an i.i.d. random noise 𝒩(𝟎,e2𝑰2)\mathcal{N}(\bm{0},e^{2}\bm{I}_{2}), where ee is set as 0.20.2 but also need to be inferred during learning.

  • Scenario 3: we assume that our observations are truncated. Specifically, we make observations of the particle coordinates in Ω=(,0.5)×\Omega=(\infty,0.5)\times\mathbb{R}, with the particles outside of Ω\Omega dropped.

For the first scenario, the inferred drift and diffusion are illustrated in Figure 4(a). In case 1, all the inferred drift and diffusion parameters approach the ground truth during the training, with relative error less than 0.3%0.3\% for each aia_{i} and less than 4%4\% for each σi\sigma_{i} after 2×1052\times 10^{5} training steps. In case 2, the inferred diffusion parameters are also close to the ground truth, with error comparable with that in case 1. In the region where the training data are dense (i.e., the density estimated from all the training data is no less than 0.05), the inferred drift field has a relative L2L_{2} error about 13%13\%. The inference is much worse in other regions, and the reason is the same as in the 1D problem: the neural network can hardly learn the drift field where the training data are sparse.

For the second and third scenarios, we apply the technique of perturbing and filtering the generated samples, respectively. In Figure 4(b) and Figure 4(c) we show the results of case 1 with the drift parameterized by four parameters. After 2×1052\times 10^{5} training steps, all the parameters converge to ground truth. For scenario 2, the error is less than 0.5%0.5\% for each drift parameter aia_{i}, less than 3%3\% for each diffusion parameter sis_{i}, and about 5%5\% for the noise parameter ee. For scenario 3, the error is less than 0.2%0.2\% for each drift parameter aia_{i}, less than 1%1\% for each diffusion parameter sis_{i}. We remark that we failed to learn the drift with neural network parameterization in the second and third scenario, suggesting a room for further improvements.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a)
Refer to caption
Refer to caption
(b)
Refer to caption
Refer to caption
(c)
Figure 4: Inferred drift and diffusion in 2D problems, with (a) ideal observations, (b) noisy observations, and (c) truncated observations. In (a), the left column shows the results in case 1 where the drift is parameterized by four parameters, while the right column shows the results in case 2 where the drift is parameterized by a neural network. In the top right figure of (a), we visualize the drift field where the training data are dense. The red and black arrows represent the inferred drift and the exact drift, respectively. For each arrow, the length represents the norm of the drift, scaled by 0.10.1. The green dots are samples from the merged training data.

Higher Dimensional Problems

In this section, we test our method on higher dimensional non-interacting particle systems. Note that the sliced Wasserstein distance does not perform well in high dimensional problems, and we thus switch to the WGAN-GP to provide 𝖽\mathsf{d}. The comparison between the sliced Wasserstein distance and WGAN-GP in high dimensional problems is presented in the supplementary information section S1.

We consider a dd-dimensional SODE:

d𝑿t\displaystyle d\bm{X}_{t} =𝝁(𝑿𝒕)dt+𝝈d𝑩t\displaystyle=\bm{\mu}(\bm{X_{t}})dt+\bm{\sigma}d\bm{B}_{t} (19)

where

μ(i)(𝑿𝒕)=Xt(i)(Xt(i))3,i=1,2,d,\displaystyle\mu^{(i)}(\bm{X_{t}})=X_{t}^{(i)}-(X_{t}^{(i)})^{3},i=1,2...,d, (20)

for μ(i)\mu^{(i)} as the ii-th component of 𝝁d\bm{\mu}\in\mathbb{R}^{d}, and Xt(i)X_{t}^{(i)} is the ii-th component of 𝑿td\bm{X}_{t}\in\mathbb{R}^{d}. We set the diffusion coefficient matrix as

𝝈=[s10000s2s20000s3s30000s4s40000sdsd],\displaystyle\bm{\sigma}=\begin{bmatrix}s_{1}&0&0&0&\cdots&0\\ s^{\prime}_{2}&s_{2}&0&0&\cdots&0\\ 0&s^{\prime}_{3}&s_{3}&0&\cdots&0\\ 0&0&s^{\prime}_{4}&s_{4}&\cdots&0\\ \vdots&\vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\cdots&s^{\prime}_{d}&s_{d}\\ \end{bmatrix}, (21)

where the 2d12d-1 nonzero entries {si}i=1d\{s_{i}\}_{i=1}^{d} and {si}i=2d\{s^{\prime}_{i}\}_{i=2}^{d} are set as 1. The initial distribution is ρ0=𝒩(𝟎,0.04𝑰d)\rho_{0}=\mathcal{N}(\bm{0},0.04\bm{I}_{d}). Due to the non-diagonal diffusion coefficient matrix, the motion in different dimensions are coupled.

Refer to caption
Refer to caption
Figure 5: Inferred drift in the end of training and diffusion parameters during training, in 5D, 10D, 50D and 100D problems. The solid lines and shaded areas are the mean and two standard deviations over all ii and three runs with different random seeds.

We prepare 10510^{5} sample paths and observe all the particle positions at t=0.2,0.5,1.0t=0.2,0.5,1.0 as our training data. We use a cubic polynomial with four variables to parameterize μ(i)\mu^{(i)} as a function of Xt(i)X_{t}^{(i)} for each ii, while using 2d12d-1 trainable variables to learn the nonzero entries in the diffusion coefficient matrix, with the diagonal entries rectified by a softplus function for positivity. In other words, we use 6d16d-1 variables to parameterize the drift and diffusion for the dd-dimensional SODE. The results are shown in Figure 5. Even for the 100-dimensional problem, after 6×1056\times 10^{5} training steps, the average error of the diffusion coefficients is about 0.040.04, and the average relative L2L_{2} error of the drift in the interval [3,3][-3,3] in each dimension is about 14%14\%.

It may appear surprising that we can solve a 100-dimensional problem with only 10510^{5} samples since usually an exponentially large number of samples is required to describe the distribution. However, we remark that the difficulty of the learning task is significantly reduced since our partial knowledge, i.e., the form of the SODE, is encoded into the generator.

Interacting Particle Systems

In this section we consider the 1D and 2D cases of the interacting particle system with governing equations 3. To prepare the training data, we follow [23] and set the initial condition, including the density and velocity as

ρ0(x)\displaystyle\rho_{0}(x) =𝟙[0.75,0.75]π3cos(3πx2),\displaystyle=\mathbbm{1}_{[-0.75,0.75]}\frac{\pi}{3}\cos\left(\frac{3\pi x}{2}\right), (22)
μ0(x)\displaystyle\mu_{0}(x) =12sin(3πx2),\displaystyle=-\frac{1}{2}\sin\left(\frac{3\pi x}{2}\right),

for the 1D case and

ρ0(x1,x2)\displaystyle\rho_{0}(x_{1},x_{2}) =𝟙[0.75,0.75]2(π3)2cos(3πx12)cos(3πx22),\displaystyle=\mathbbm{1}_{[-0.75,0.75]^{2}}\left(\frac{\pi}{3}\right)^{2}\cos\left(\frac{3\pi x_{1}}{2}\right)\cos\left(\frac{3\pi x_{2}}{2}\right), (23)
𝝁0(x1,x2)\displaystyle\bm{\mu}_{0}(x_{1},x_{2}) =(122sin(3πx12),122sin(3πx22))\displaystyle=\left(-\frac{1}{2\sqrt{2}}\sin(\frac{3\pi x_{1}}{2}),-\frac{1}{2\sqrt{2}}\sin(\frac{3\pi x_{2}}{2})\right)

for the 2D case, where 𝟙\mathbbm{1} is the indicator function. Using the Velocity-Verlet method with time step 0.01, we perform simulations with 1024 and 9976 particles for the 1D and 2D cases, respectively, and generate data at t=0.5,0.6,,2.0t=0.5,0.6,...,2.0, i.e., 16 snapshots in total. The input radius rr of the influence function ϕ(r)\phi(r) is clipped to be at least rmin=0.01r_{\text{min}}=0.01, both in simulation and learning, to avoid the singularity. While using the same number of particles and make observations at the same time instants as in [23], our method does not require knowledge of the initial condition or the velocity field in the data snapshots, as opposed to their method based on Bayesian optimization to infer α\alpha.

We apply the sliced Wasserstein distance for the distribution loss, with the batch size equal to the number of particles in training data. When calculating the distance at each time instant, the generated and real distributions are normalized with the mean and standard deviation of the real distribution. For the Newton loss, we set K=16K=16, M=10000M=10000 and B=10B=10. We use 2×sigmoid(β)2\times\text{sigmoid}(\beta) to represent the inferred α\alpha so that the inference is bounded by 0 and 2, and the variable β\beta is initialized as 0.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(c)
Figure 6: Results for the 1D and 2D problems of interacting particle systems. (a): The inferred alpha in 1D and 2D problems. (b): The inferred velocity field against the reference velocity from simulation in the 1D problem. The dots also show the distribution of training data. (c): The inferred velocity field (red arrows) against the reference velocity from simulation (black arrows) in the 2D problem. For each arrow, the length represents the norm of the velocity scaled by 0.30.3. The green dots are random samples from training data.

The results are visualized in Figure 6. For the 1D case, the inferred α\alpha is 0.492 with ground truth 0.5 in the end of training. As a comparison, the inference is 0.480 for the 1D case in [23]. Our inferred velocity field also matches well with the reference ground truth at different time instants, even at t=0t=0 and 0.250.25 when we have no observations at all. This is because we incorporated the knowledge of dynamics in the learning system. For the 2D case, while the inferred velocity field also matches well with the ground truth, the inferred α\alpha is 0.469 with ground truth 0.5 in the end of training, i.e., about 6% error, much larger than in the 1D case. We attribute the error to the fact that the singularity problem for ϕ(r)\phi(r), where the order for rr is dimension-dependent, is more severe in the 2D case. In supplementary information section S4, we show that as we increase rminr_{\min} from 0.01 to 0.1, the error is reduced from 6% to 2%. Alternatively, in the 2D case we also try to directly use the data at t=0.5t=0.5 as the starting coordinates to generate trajectories for t0.5t\geq 0.5 (MM is also reduced to the data size 99769976), so that the initial generator GG is removed from the generative model. By doing so, the number of trainable variables is reduced and the learning becomes easier. The inferred α\alpha is 0.493 in the end of training. As a comparison, the inference is 0.513 for the 2D case in [23]. However, we remark that while this strategy helps to infer α\alpha without extra data, we cannot infer the velocity or density for 0t<0.50\leq t<0.5.

Paired Observations

We should note that the convergence of the marginal distribution in each snapshot does not necessarily lead to the convergence of the joint distribution of coordinate tuples (𝑿t1,𝑿t2,,𝑿tn)(\bm{X}_{t_{1}},\bm{X}_{t_{2}},...,\bm{X}_{t_{n}}). While the joint distribution is not available for unpaired observations that we are focused on so far, in other cases where the observed particle coordinates can be paired across snapshots, it is possible to improve the inference by fitting the joint distribution of coordinate tuples with the corresponding generated joint distribution.

In [29], the authors pointed out that Wasserstein convergence of the distribution of (𝑿t1,𝑿t2,,𝑿tn)(\bm{X}_{t_{1}},\bm{X}_{t_{2}},...,\bm{X}_{t_{n}}) is equivalent to Wasserstein convergence of the distributions of (𝑿ti,𝑿ti+1)(\bm{X}_{t_{i}},\bm{X}_{t_{i+1}}) for all i=1,2,,n1i=1,2,...,n-1, if (𝑿t1,𝑿t2,,𝑿tn)(\bm{X}_{t_{1}},\bm{X}_{t_{2}},...,\bm{X}_{t_{n}}) is a Markov chain. However, the sample spaces for {𝑿ti}i=1n\{\bm{X}_{t_{i}}\}_{i=1}^{n} are limited to be finite and discrete in [29], thus the result doesn’t apply to dynamic systems in continuous spaces. Here, we present a new theorem for continuous sample spaces:

Theorem 1.

Let (X1,X2,XT)(X_{1},X_{2},...X_{T}) be a Markov chain of length T3T\geq 3 and we use Xi:jX_{i:j} to denote the nodes (Xi,Xi+1Xj)(X_{i},X_{i+1}...X_{j}), for iji\leq j. Suppose the domain DtD_{t} for XtX_{t} is a compact subset of dt\mathbb{R}^{d_{t}} for t=1,2Tt=1,2...T. We use the lql_{q} (q1q\geq 1) Euclidean metric for all the Euclidean spaces with different dimensions.

Let {PnXi:j}n=1\{P_{n}^{X_{i:j}}\}_{n=1}^{\infty} and PXi:jP^{X_{i:j}} be probability measures of Xi:jX_{i:j} for iji\leq j, PnXi|XjP_{n}^{X_{i}|X_{j}} and PXi|XjP^{X_{i}|X_{j}} be the corresponding probability transition kernels. If PnXt:t+1P_{n}^{X_{t:t+1}} converges to PXt:t+1P^{X_{t:t+1}} in Wasserstein-pp (p1p\geq 1) metric for all t=1,2T1t=1,2...T-1, PnXt|Xt+1P_{n}^{X_{t}|X_{t+1}} and PXt+2|Xt+1P^{X_{t+2}|X_{t+1}} as functions of Xt+1X_{t+1} are CC-Lipschitz continuous in Wasserstein-pp metric for all t=1,2T2t=1,2...T-2 and nn, where CC is a constant, then PnX1:TP_{n}^{X_{1:T}} converges to PX1:TP^{X_{1:T}} in Wasserstein-pp metric.

We present the proof for Theorem 1 in the supplementary information section S5. The assumption on the continuity of the probability transition kernels is not required for finite discrete sample spaces in [29], but the theorem does not hold without it for continuous sample space. We provide a counter-example in the supplementary information section S6.

The theorem states that under certain conditions, we can set our goal as fitting the distributions of (𝑿ti,𝑿ti+1)(\bm{X}_{t_{i}},\bm{X}_{t_{i+1}}), i.e., coordinate pairs from adjacent snapshots. This should be easier compared with directly fitting the distribution of (𝑿t1,𝑿t2,,𝑿tn)(\bm{X}_{t_{1}},\bm{X}_{t_{2}},...,\bm{X}_{t_{n}}), since the effective dimensionality is reduced. We still can view this approach as “ensemble-regression”, except that instead of a curve, we try to fit the data with a 2D surface (ti,tj)ρti,tj(t_{i},t_{j})\rightarrow\rho_{t_{i},t_{j}} in the probability measure space, where ρti,tj\rho_{t_{i},t_{j}} denotes the joint distributions of (𝑿ti,𝑿tj)(\bm{X}_{t_{i}},\bm{X}_{t_{j}}).

Refer to caption
Refer to caption
Figure 7: Inferred drift function in the end of training and diffusion coefficient during training for the OU process problem, using a linear function or a neural network to parameterize the drift function.

As an illustration, we study the 1D Ornstein–Uhlenbeck (OU) process:

dXt=Xtdt+2dWt,\displaystyle dX_{t}=-X_{t}dt+\sqrt{2}dW_{t}, (24)

with ρ0=𝒩(0,1)\rho_{0}=\mathcal{N}(0,1), so that ρt=𝒩(0,1)\rho_{t}=\mathcal{N}(0,1) for any t>0t>0. This is a special example as the governing SODE is not unique given ρt\rho_{t}. We make observations of 100 particles at t=0.1,0.2,0.3,0.4,0.5,0.75,1.0t=0.1,0.2,0.3,0.4,0.5,0.75,1.0 as data, and compare the inferences by fitting the marginal distributions of individual coordinates or the joint distributions of adjacent coordinate pairs using the SW distance. The drift function is parameterized by a linear function y(x)=axy(x)=ax or a neural network, while the diffusion coefficient is represented by a trainable variable rectified by a softplus function. We present the results in Figure 7, where we can clearly see the failure in the cases of fitting the marginal distributions, while fitting the joint distributions works very well.

Summary and Discussion

We have proposed a new method for inferring the governing dynamics of particles from unpaired observations of their coordinates at multiple time instants, namely “snapshots”. We fitted the observed particle ensemble distribution with a physics-informed generative model, which can be viewed as performing regression in the probability measure space. We refer to this approach as generative “ensemble-regression”, in analogy to the classic “point-regression”, where we infer the dynamics by performing regression in the Euclidean space.

We first applied the method to particle systems governed by independent stochastic ordinary differential equations (SODE) with Brownian or Lévy noises, where we inferred the drift and the diffusion terms from a small number of snapshots. In the Lévy noise case, we demonstrated that the heavy tails in the distributions could spoil the training, but we addressed this issue by applying a preprocessing map to both the generated and target distributions. In scenarios with noisy or truncated training data, we modified the generated distributions accordingly by perturbing or filtering the generated samples. We then addressed high-dimensional SODE problems using the adversarial loss in GANs. In the end, we managed to learn the parameters for particle interactions in a nonlocal flocking systems.

It is possible to apply our method to learn the interaction parameters for particle-based simulation methods. In particular, we will fit the target mass and velocity distributions coming from the analytical solution or other simulation methods that are accurate but expensive. We leave this promising research direction for future work.

Acknowledgement

This work was supported by the PhILMS grant DE-SC0019453 and by the OSD/AFOSR MURI Grant FA9550-20-1-0358. We would like to thank Prof. Hui Wang and Ms. Tingwei Meng for carefully checking the proof of our theorem. We also want to thank Dr. Zhongqiang Zhang for helpful discussions.

Supplementary Information

S1. Comparison between Sliced Wasserstein Distance and WGAN-GP

The squared sliced Wasserstein-2 distance is the expectation of the squared Wasserstein-2 distance between the two input measures projected onto uniformly random directions. To estimate SW2(μ,ν)SW_{2}(\mu,\nu) from samples of μ\mu and ν\nu, we use the following process introduced in [3].

  • Draw samples independently from μ\mu and ν\nu with batch size bb, denoted as 𝒰\mathcal{U} and 𝒱\mathcal{V}.

  • Uniformly sample mm projection directions {𝒆j}j=1m\{\bm{e}_{j}\}_{j=1}^{m} in d\mathbb{R}^{d}. In this paper we set m=1000m=1000.

  • For each random direction 𝒆j\bm{e}_{j}, project and sort the samples in 𝒰\mathcal{U} and 𝒱\mathcal{V} in the direction of 𝒆j\bm{e}_{j}, getting {ui,j}i=1b\{u_{i,j}\}_{i=1}^{b} and {vi,j}i=1b\{v_{i,j}\}_{i=1}^{b}, where ui,jui+1,ju_{i,j}\leq u_{i+1,j} and vi,jvi+1,jv_{i,j}\leq v_{i+1,j} for i=1,2b1i=1,2...b-1. Calculate Lj=i=1b(ui,jvi,j)2/bL_{j}=\sum_{i=1}^{b}(u_{i,j}-v_{i,j})^{2}/b.

  • Calculate L=j=1mLj/mL=\sum_{j=1}^{m}L_{j}/m as the estimation of squared SW2(μ,ν)SW_{2}(\mu,\nu).

Compared with GANs, the sliced Wasserstein distance does not need discriminators, and is more robust than WGAN-GP in low dimensional problems. Take the following 1D problem as an example. We consider the SODE:

dXt=adt\displaystyle dX_{t}=adt +bdBt,t0,\displaystyle+bdB_{t},\quad t\geq 0, (25)

with ρ0=𝒩(0.5,0.5)\rho_{0}=\mathcal{N}(-0.5,0.5). We set a=b=1a=b=1 so that the exact solution is

ρt=𝒩(t0.5,t+0.5).\displaystyle\rho_{t}=\mathcal{N}(t-0.5,t+0.5). (26)

We have 10,00010,000 samples at t=0.5t=0.5 and 1.51.5, respectively, as the training data, and wish to infer the constant drift and diffusion coefficients. The input noise to the generator is uniform noise from -1 to 1. The results are shown in Figure 8.

Refer to caption
(a)
Refer to caption
(b)
Figure 8: Inferred drift and diffusion coefficient during the training, using (a) the squared sliced Wasserstein-2 distance or (b) WGAN-GP for d. The black dashed lines represent the ground truth, while the multiple colored lines represent the results in three independent runs.

We can clearly see oscillations of the inferred drift coefficient for WGAN-GP. This can be attributed to the two-player game between the generator and discriminator, which was also reported in [30].

Refer to caption
(a)
Refer to caption
(b)
Figure 9: Results of 2D, 5D, 10D and 20D problems, using (a) the squared sliced Wasserstein-2 distance or (b) WGAN-GP for d. The solid lines and shaded areas represent the mean and two standard deviations of the errors over all dimensions.

We did not observe significant oscillations using WGAN-GP in higher dimensional problems. Indeed, WGAN-GP outperforms the sliced Wasserstein distance in high dimensional problems. Let us consider the following simple SODE problem as an example, where the motions are uncoupled between dimensions:

dXt(i)=(Xt(i)(Xt(i))3)dt+dBt(i),i=1,2,d\displaystyle dX_{t}^{(i)}=(X_{t}^{(i)}-(X_{t}^{(i)})^{3})dt+dB_{t}^{(i)},i=1,2,...d (27)

where Xt(i)X_{t}^{(i)} is the ii-th component of 𝑿td\bm{X}_{t}\in\mathbb{R}^{d}, with ρ0=𝒩(𝟎,0.04𝑰d)\rho_{0}=\mathcal{N}(\bm{0},0.04\bm{I}_{d}). We prepare 10510^{5} sample paths and observe all the particle positions at t=0.2,0.5,1.0t=0.2,0.5,1.0 as our training data. We use a cubic polynomial a0(i)+a1(i)x+a2(i)x2+a3(i)x3a^{(i)}_{0}+a^{(i)}_{1}x+a^{(i)}_{2}x^{2}+a^{(i)}_{3}x^{3} to parameterize the ii-th component of the drift. We use another trainable variable rectified by a softplus function to represent the diffusion coefficient s(i)=1s^{(i)}=1 in each direction. In total, 5d5d variables are used to parameterize the dd-dimensional SODE. The results are shown in Figure 9. While the SW distance works for 2D problems, it does not scale well to high dimensional problems for which WGAN-GP gives much better inferences.

S2. A Study on the Density Estimation

In this section, we wish to show that our method can make use of the SODE and multiple snapshots to reduce the error of density estimation from limited samples. We consider the one-dimensional SODE:

dXt=(4XtXt3)dt\displaystyle dX_{t}=(4X_{t}-{X_{t}}^{3})dt +0.4dBt,t0,\displaystyle+0.4dB_{t},\quad t\geq 0, (28)

with ρ0=0.5𝒩(0.5,0.32)+0.5𝒩(0.5,0.32)\rho_{0}=0.5\mathcal{N}(-0.5,0.3^{2})+0.5\mathcal{N}(0.5,0.3^{2}).

We test the following two cases of training data sets. In both cases we have 1000 samples at t=0.05,0.1,0.15,0.2,0.25t=0.05,0.1,0.15,0.2,0.25, but the particle trajectories are different.

  • Case 1: We make observations from different sets of trajectories at different time instants.

  • Case 2: We make observations from the same set of trajectories at different time instants.

We assume that we know Equation 28, but have no knowledge of ρ0\rho_{0}. We could make the analogy of performing linear regression, where we know the slope but do not know the intercept. In Figure 10 we show the comparison between the inferred densities and the densities estimated directly from data. All the densities are estimated using Gaussian kernel density estimation. The inferred densities and the ground truth come from 10510^{5} samples produced by the generative model or simulation, with bandwidth decided by the Scott’s rule. To remove the effect of bandwidth selection in the density estimation directly from data, we perform a grid search of the optimal bandwidth factor via the L2L_{2} error against ground truth, from 0.01 to 0.5 with grid size 0.01 (Scott’s rule suggests about 0.25).

Refer to caption
(a)
Refer to caption
(b)
Figure 10: Integrated squared errors (ISE) of the inferred density functions at different time instants in (a) case 1 and (b) case 2. The squared errors are integrated from -4 to 4. The colored lines show the ISE averaged over three independent runs, while the markers with the corresponding colors show the ISE in each run.

The grid search strategy is actually infeasible in practice since we don’t know the ground truth, but it should perform better than any bandwidth selector. Despite that, in Figure 10(a) we can clearly see that in case 1 our inferred densities with naive Scott’s rule significantly outperform the ones estimated directly from training data with grid search. This cannot be attributed to the number of samples, since in case 2 the inferred densities with our method cannot perform better, as shown in Figure 10(b).

The different performances in two cases are reasonable. In case 1, our method can utilize the SODE and the observations at multiple time instants to reduce the error from limited samples. As we infer the density at t=0.05t=0.05, we are not only utilizing the data at t=0.05t=0.05 but also implicitly taking advantage of the data at later time instants. In case 2, since the observations come from the same sample paths, the observations at later time instants cannot provide additional information considering that the SODE is already known. Let us again make an analogy in the context of linear regression: if the observations have independent noise, multiple observations will be more helpful than a single observation. However, if the observations have the same noise, multiple observations cannot help us more than a single observation, if we already know the slope.

S3. Training Data in 2D Problems with Various Scenarios of Observations

In Figure 11 we visualize the training data in 2D problems with various scenarios of observations.

Refer to caption
Figure 11: Samples from the training data in 2D problems with ideal, noisy, or truncated observations.

S4. Effect of Clipping Radius in the Interacting Particle System

Refer to caption
Figure 12: Inferred α\alpha in the 2D interacting particle system during training, with different clipping radius rminr_{\text{min}}.

In this section we study the effect of various clipping radius rminr_{\text{min}} from 0.01 to 0.1, both in the simulation for data generation and the learning algorithm. We remark that changing rminr_{\text{min}} in this range would not make a huge difference to the system behavior, since the dynamics are dominated by non-local interactions. In Figure 12 we show the inferred α\alpha during the training. As rminr_{\text{min}} increase from 0.01 to 0.1, the inferred α\alpha in the end of training increases from 0.469 to 0.490, with the ground truth 0.5. This suggests that the error mainly comes from the singularity problem of ϕ(r)\phi(r) when rr is small.

S5. Proof for Theorem 1

The proof is based on the weak convergence: convergence of a sequence of probability measures {Pn}n=1\{P_{n}\}_{n=1}^{\infty} to a probability measure PP in the Wasserstein-pp metric is equivalent to the weak convergence plus the convergence of pp-th moments of {Pn}n=1\{P_{n}\}_{n=1}^{\infty} to PP [31]. Here, the weak convergence mean EPn[f(x)]E_{P_{n}}[f(x)] converges to EP[f(x)]E_{P}[f(x)] for each ff\in\mathcal{F}, where ={all bounded and continuous functions}\mathcal{F}=\{\text{all bounded and continuous functions}\} or ={all bounded and Lipschitz functions}\mathcal{F}=\{\text{all bounded and Lipschitz functions}\}, which are equivalent[32]. Note that in Theorem 1 we assume that the domain for XtX_{t} is a compact subset of Euclidean space for each tt, thus X1:Tp\|X_{1:T}\|^{p} is a bounded and continuous function of X1:TX_{1:T}. The convergence of pp-th moment then directly comes from the weak convergence, i.e., we only need to show that {Pn}n=1\{P_{n}\}_{n=1}^{\infty} converges to PP weakly. This is proved in Lemma 4 below.

The notations are inherited from Theorem 1. For simplicity, we will sometimes use X,Y,ZX,Y,Z to represent the nodes in sequence, and use PXYP^{XY} to represent the probability measure of (X,Y)(X,Y), etc.

Lemma 1.

Let PyP_{y} mapping from the Euclidean space to the probability measure space be CC-Lipschitz in Wasserstein-pp metric (p1p\geq 1). For any f(x,y,z):dx×dy×dzf(x,y,z):\mathbb{R}^{d_{x}}\times\mathbb{R}^{d_{y}}\times\mathbb{R}^{d_{z}}\rightarrow\mathbb{R} KK-Lipschitz,

g(x,y)=f(x,y,z)𝑑Py(z)g(x,y)=\int f(x,y,z)dP_{y}(z)

is (C+1)K(C+1)K-Lipschitz.

Proof.

Note that

|g(x+ϵ,y+ξ)g(x,y)|\displaystyle\left|g(x+\epsilon,y+\xi)-g(x,y)\right| (29)
=\displaystyle= |f(x+ϵ,y+ξ,z)𝑑Py+ξ(z)f(x,y,z)𝑑Py(z)|\displaystyle\left|\int f(x+\epsilon,y+\xi,z)dP_{y+\xi}(z)-\int f(x,y,z)dP_{y}(z)\right|
\displaystyle\leq |f(x+ϵ,y+ξ,z)𝑑Py+ξ(z)f(x+ϵ,y+ξ,z)𝑑Py(z)|+\displaystyle\left|\int f(x+\epsilon,y+\xi,z)dP_{y+\xi}(z)-\int f(x+\epsilon,y+\xi,z)dP_{y}(z)\right|+
|f(x+ϵ,y+ξ,z)𝑑Py(z)f(x,y,z)𝑑Py(z)|.\displaystyle\left|\int f(x+\epsilon,y+\xi,z)dP_{y}(z)-\int f(x,y,z)dP_{y}(z)\right|.

Firstly

|f(x+ϵ,y+ξ,z)𝑑Py+ξ(z)f(x+ϵ,y+ξ,z)𝑑Py(z)|\displaystyle\left|\int f(x+\epsilon,y+\xi,z)dP_{y+\xi}(z)-\int f(x+\epsilon,y+\xi,z)dP_{y}(z)\right| (30)
\displaystyle\leq KW1(Py,Py+ξ)\displaystyle KW_{1}(P_{y},P_{y+\xi})
\displaystyle\leq KWp(Py,Py+ξ)\displaystyle KW_{p}(P_{y},P_{y+\xi})
\displaystyle\leq CKϵ,\displaystyle CK\|\epsilon\|,

where the first inequality comes from the Kantorovich–Rubinstein formula [31] and that f(x+ϵ,y+ξ,z)f(x+\epsilon,y+\xi,z) is KK-Lipschitz. The last inequality comes from that PyP_{y} is CC-Lipschitz in Wasserstein-pp sense.

Secondly,

|f(x+ϵ,y+ξ,z)𝑑Py(z)f(x,y,z)𝑑Py(z)|\displaystyle\left|\int f(x+\epsilon,y+\xi,z)dP_{y}(z)-\int f(x,y,z)dP_{y}(z)\right| (31)
\displaystyle\leq |f(x+ϵ,y+ξ,z)f(x,y,z)|𝑑Py(z)\displaystyle\int\left|f(x+\epsilon,y+\xi,z)-f(x,y,z)\right|dP_{y}(z)
\displaystyle\leq K(ϵ,ξ)𝑑Py(z)\displaystyle\int K\|(\epsilon,\xi)\|dP_{y}(z)
=\displaystyle= K(ϵ,ξ).\displaystyle K\|(\epsilon,\xi)\|.

We conclude that

|g(x+ϵ,y+ξ)g(x,y)|(C+1)K(ϵ,ξ),\displaystyle\left|g(x+\epsilon,y+\xi)-g(x,y)\right|\leq(C+1)K\|(\epsilon,\xi)\|, (32)

i.e. g(x,y)g(x,y) is (C+1)K(C+1)K-Lipschitz ∎

Lemma 2.

If PnXYP_{n}^{XY} converge to PXYP^{XY} weakly, then PnXP_{n}^{X} converge to PXP^{X} weakly.

Proof.

For any bounded and continuous function f(x)f(x)

f(x)PnX(x)f(x)PX(x)\displaystyle\int f(x)P_{n}^{X}(x)-\int f(x)P^{X}(x) (33)
=\displaystyle= f(x)PnXY(x,y)f(x)PXY(x,y)0.\displaystyle\iint f(x)P_{n}^{XY}(x,y)-\iint f(x)P^{XY}(x,y)\rightarrow 0.

The convergence comes from the weak convergence of PnXYP_{n}^{XY} and that f(x)f(x) is bounded and continuous as a function of xx and yy. ∎

Lemma 3.

Suppose PnYZP_{n}^{YZ} converge to PYZP^{YZ} weakly, PZ|Y=yP^{Z|Y=y} is Lipschitz in Wasserstein-pp metric. For any g(y,z):dy×dzg(y,z):\mathbb{R}^{d_{y}}\times\mathbb{R}^{d_{z}}\rightarrow\mathbb{R} bounded and Lipschitz, we have

g(y,z)𝑑PnZ|Y=y(z)𝑑PnY(y)\displaystyle\iint g(y,z)dP_{n}^{Z|Y=y}(z)dP_{n}^{Y}(y) (34)
\displaystyle- g(y,z)𝑑PZ|Y=y(z)𝑑PnY(y)0.\displaystyle\iint g(y,z)dP^{Z|Y=y}(z)dP_{n}^{Y}(y)\rightarrow 0.
Proof.

Since g(y,z)g(y,z) is bounded and continuous, and PnYZP_{n}^{YZ} converge to PYZP^{YZ} weakly,

g(y,z)𝑑PnYZ(y,z)g(y,z)𝑑PYZ(y,z)0,\displaystyle\iint g(y,z)dP_{n}^{YZ}(y,z)-\iint g(y,z)dP^{YZ}(y,z)\rightarrow 0, (35)

i.e.

g(y,z)𝑑PnZ|Y=y(z)𝑑PnY(y)\displaystyle\iint g(y,z)dP_{n}^{Z|Y=y}(z)dP_{n}^{Y}(y) (36)
\displaystyle- g(y,z)𝑑PZ|Y=y(z)𝑑PY(y)0.\displaystyle\iint g(y,z)dP^{Z|Y=y}(z)dP^{Y}(y)\rightarrow 0.

Since g(y,z)g(y,z) is bounded and Lipschitz, PZ|Y=yP^{Z|Y=y} is Lipschitz, we have

g(y,z)𝑑PZ|Y=y(z)\displaystyle\int g(y,z)dP^{Z|Y=y}(z) (37)

is Lipschitz. Further since PnYP_{n}^{Y} converge to PYP^{Y} weakly (from Lemma 2):

g(y,z)𝑑PZ|Y=y(z)𝑑PnY(y)\displaystyle\iint g(y,z)dP^{Z|Y=y}(z)dP_{n}^{Y}(y) (38)
\displaystyle- g(y,z)𝑑PZ|Y=y(z)𝑑PY(y)0.\displaystyle\iint g(y,z)dP^{Z|Y=y}(z)dP^{Y}(y)\rightarrow 0.

Combining 36 and 38, we have

g(y,z)𝑑PnZ|Y=y(z)𝑑PnY(y)\displaystyle\iint g(y,z)dP_{n}^{Z|Y=y}(z)dP_{n}^{Y}(y) (39)
\displaystyle- g(y,z)𝑑PZ|Y=y(z)𝑑PnY(y)0.\displaystyle\iint g(y,z)dP^{Z|Y=y}(z)dP_{n}^{Y}(y)\rightarrow 0.

Lemma 4.

Let (X1,X2,XT)(X_{1},X_{2},...X_{T}) be a Markov chain of length T3T\geq 3 and we use Xi:jX_{i:j} to denote the nodes (Xi,Xi+1Xj)(X_{i},X_{i+1}...X_{j}), for iji\leq j. Suppose the domain DtD_{t} for XtX_{t} is a compact subset of dt\mathbb{R}^{d_{t}} for t=1,2Tt=1,2...T. We use the lql_{q} (q1q\geq 1) Euclidean metric for all the Euclidean spaces with different dimensions.

Let {PnXi:j}n=1\{P_{n}^{X_{i:j}}\}_{n=1}^{\infty} and PXi:jP^{X_{i:j}} be probability measures of Xi:jX_{i:j} for iji\leq j, PnXi|XjP_{n}^{X_{i}|X_{j}} and PXi|XjP^{X_{i}|X_{j}} be the corresponding probability transition kernels. If PnXt:t+1P_{n}^{X_{t:t+1}} converges to PXt:t+1P^{X_{t:t+1}} weakly for all t=1,2T1t=1,2...T-1, PnXt|Xt+1P_{n}^{X_{t}|X_{t+1}} and PXt+2|Xt+1P^{X_{t+2}|X_{t+1}} are CC-Lipschitz continuous in Wasserstein-pp metric for all t=1,2T2t=1,2...T-2 and nn, where CC is a constant, then PnX1:TP_{n}^{X_{1:T}} converges to PX1:TP^{X_{1:T}} weakly.

Proof.

We start from the case T=3T=3. For simplicity, we use X,Y,ZX,Y,Z to denote the nodes in sequence.

For any f(x,y,z)f(x,y,z) KK-Lipschitz and bounded, we have

f(x,y,z)𝑑PnXYZ(x,y,z)\displaystyle\int f(x,y,z)dP_{n}^{XYZ}(x,y,z) (40)
=\displaystyle= (f(x,y,z)𝑑PnZ|Y=y(z))𝑑PnXY(x,y)\displaystyle\int\left(\int f(x,y,z)dP_{n}^{Z|Y=y}(z)\right)dP_{n}^{XY}(x,y)
=\displaystyle= (gn(x,y)g(x,y))𝑑PnXY(x,y)\displaystyle\int(g_{n}(x,y)-g(x,y))dP_{n}^{XY}(x,y)
+\displaystyle+ g(x,y)𝑑PnXY(x,y),\displaystyle\int g(x,y)dP_{n}^{XY}(x,y),

where

g(x,y)\displaystyle g(x,y) =f(x,y,z)𝑑PZ|Y=y(z)\displaystyle=\int f(x,y,z)dP^{Z|Y=y}(z) (41)
gn(x,y)\displaystyle g_{n}(x,y) =f(x,y,z)𝑑PnZ|Y=y(z).\displaystyle=\int f(x,y,z)dP_{n}^{Z|Y=y}(z).

From Lemma 1, since f(x,y,z)f(x,y,z) is Lipschitz, PZ|Y=yP^{Z|Y=y} is Lipschitz in Wasserstein-pp sense, we have g(x,y)g(x,y) is Lipschitz. g(x,y)g(x,y) is also bounded since f(x,y,z)f(x,y,z) is bounded. So we have

g(x,y)𝑑PnXY(x,y)g(x,y)𝑑PXY(x,y),\displaystyle\int g(x,y)dP_{n}^{XY}(x,y)\rightarrow\int g(x,y)dP^{XY}(x,y), (42)

since PnXYP_{n}^{XY} converge to PXYP^{XY} weakly.

We then need to show (gn(x,y)g(x,y))𝑑PnXY(x,y)\int(g_{n}(x,y)-g(x,y))dP_{n}^{XY}(x,y) converges to 0. We prove by contradiction. Suppose it does not converge, then there exists ϵ>0\epsilon>0 and a subsequence of nn (denote as ii) such that

|(gi(x,y)g(x,y))𝑑PiXY(x,y)|ϵ,i.\displaystyle\left|\int(g_{i}(x,y)-g(x,y))dP_{i}^{XY}(x,y)\right|\geq\epsilon,\forall i. (43)

Without loss of generality, we assume that

(gi(x,y)g(x,y))𝑑PiXY(x,y)ϵ>0,i,\displaystyle\int(g_{i}(x,y)-g(x,y))dP_{i}^{XY}(x,y)\geq\epsilon>0,\forall i, (44)

i.e.

f(x,y,z)𝑑PiZ|Y=y(z)𝑑PiX|Y=y(x)𝑑PiY(y)\displaystyle\iiint f(x,y,z)dP_{i}^{Z|Y=y}(z)dP_{i}^{X|Y=y}(x)dP_{i}^{Y}(y) (45)
\displaystyle- f(x,y,z)𝑑PZ|Y=y(z)𝑑PiX|Y=y(x)𝑑PiY(y)ϵ>0,i.\displaystyle\iiint f(x,y,z)dP^{Z|Y=y}(z)dP_{i}^{X|Y=y}(x)dP_{i}^{Y}(y)\geq\epsilon>0,\forall i.

Let

hi(y,z)=f(x,y,z)𝑑PiX|Y=y(x),\displaystyle h_{i}(y,z)=\int f(x,y,z)dP_{i}^{X|Y=y}(x), (46)

then

hi(y,z)𝑑PiZ|Y=y(z)𝑑PiY(y)\displaystyle\iint h_{i}(y,z)dP_{i}^{Z|Y=y}(z)dP_{i}^{Y}(y) (47)
\displaystyle- hi(y,z)𝑑PZ|Y=y(z)𝑑PiY(y)ϵ>0,i.\displaystyle\iint h_{i}(y,z)dP^{Z|Y=y}(z)dP_{i}^{Y}(y)\geq\epsilon>0,\forall i.

Note that ff is KK-Lipschitz, PiX|Y=yP_{i}^{X|Y=y} is KK-Lipschitz, we have hih_{i} are all K(C+1)K(C+1)-Lipschitz, thus uniformly equicontinuous. Also hi(y,z)h_{i}(y,z) are uniformly bounded by the bound of ff, and the domain for (Y,Z)(Y,Z) is compact (since the domain DtD_{t} for XtX_{t} is compact for all tt). By the Arzela-Ascoli theorem, there exists a subsequence jj such that hjhh_{j}\rightarrow h uniformly, where hh is K(C+1)K(C+1)-Lipschitz and bounded. Therefore,

(hj(y,z)h(y,z))𝑑PjZ|Y=y(z)𝑑PjY(y)\displaystyle\iint(h_{j}(y,z)-h(y,z))dP_{j}^{Z|Y=y}(z)dP_{j}^{Y}(y) (48)
\displaystyle- (hj(y,z)h(y,z))𝑑PZ|Y=y(z)𝑑PjY(y)0.\displaystyle\iint(h_{j}(y,z)-h(y,z))dP^{Z|Y=y}(z)dP_{j}^{Y}(y)\rightarrow 0.

From Lemma 3

h(y,z)𝑑PjZ|Y=y(z)𝑑PjY(y)\displaystyle\iint h(y,z)dP_{j}^{Z|Y=y}(z)dP_{j}^{Y}(y) (49)
\displaystyle- h(y,z)𝑑PZ|Y=y(z)𝑑PjY(y)0.\displaystyle\iint h(y,z)dP^{Z|Y=y}(z)dP_{j}^{Y}(y)\rightarrow 0.

We then have

hj(y,z)𝑑PjZ|Y=y(z)𝑑PjY(y)\displaystyle\iint h_{j}(y,z)dP_{j}^{Z|Y=y}(z)dP_{j}^{Y}(y) (50)
\displaystyle- hj(y,z)𝑑PZ|Y=y(z)𝑑PjY(y)0.\displaystyle\iint h_{j}(y,z)dP^{Z|Y=y}(z)dP_{j}^{Y}(y)\rightarrow 0.

We have a contradiction between Equation 47 and 50. We finish the proof for T=3T=3.

We then prove the case for general TT by induction. Suppose it holds for T=NT=N, we now prove it for T=N+1T=N+1.

From the conditions for the case T=N+1T=N+1 and that the theorem holds for T=NT=N, we have PnX1:NP_{n}^{X_{1:N}} converges to PX1:NP^{X_{1:N}} weakly and PnX2:N+1P_{n}^{X_{2:N+1}} converges to PX2:N+1P^{X_{2:N+1}} weakly. We now view X1X_{1} as XX, view the Cartesian product of X2:NX_{2:N} as YY, and view XN+1X_{N+1} as ZZ, so we have PnXYP_{n}^{XY} converges to PXYP^{XY} weakly and PnYZP_{n}^{YZ} converges to PYZP^{YZ} weakly. Also PnX|YP_{n}^{X|Y} and PZ|YP^{Z|Y} are CC-Lipschitz continuious and Wasserstein-pp metric for all nn, since PnX|Y=PnX1|X2P_{n}^{X|Y}=P_{n}^{X_{1}|X_{2}} and PZ|Y=PXN+1|XNP^{Z|Y}=P^{X_{N+1}|X_{N}} from the Markovian property, and that X2X2:N\|X_{2}\|\leq\|X_{2:N}\|, XNX2:N\|X_{N}\|\leq\|X_{2:N}\|. Therefore, from the theorem for T=3T=3 case we have that PnXYZP_{n}^{XYZ} converges to PXYZP^{XYZ} weakly, i.e., PnX1:N+1P_{n}^{X_{1:N+1}} converges to PX1:N+1P^{X_{1:N+1}} weakly.

S6. A Counterexample in Continuous Sample Space

As a direct implementation of Corollary 7 in [29], for the Markov chain X1:TX_{1:T} in a finite discrete sample space, PnXt:t+1P_{n}^{X_{t:t+1}} converges to PXt:t+1P^{X_{t:t+1}} in Wasserstein-pp sense for each t=1,2T1t=1,2...T-1 implies that PnX1:TP_{n}^{X_{1:T}} converges to PX1:TP^{X_{1:T}} in Wasserstein-pp sense. However, if the Markov chains are defined in the continuous sample space, the implementation is, in general, not correct without further assumptions, e.g., the assumption of continuity for probability transition kernels. In this section, we provide a counterexample to show that.

We consider the Markov chain with T=3T=3 and use X,Y,ZX,Y,Z to denote the nodes in sequence. We define PnXYZP_{n}^{XYZ} as follows:

PnXYZ(0,0,0)\displaystyle P_{n}^{XYZ}(0,0,0) =12,\displaystyle=\frac{1}{2}, (51)
PnXYZ(1,1n,1)\displaystyle P_{n}^{XYZ}(1,\frac{1}{n},1) =12,\displaystyle=\frac{1}{2},

and PXYZP^{XYZ} as following:

PXYZ(0,0,0)\displaystyle P^{XYZ}(0,0,0) =14,\displaystyle=\frac{1}{4}, (52)
PXYZ(0,0,1)\displaystyle P^{XYZ}(0,0,1) =14,\displaystyle=\frac{1}{4},
PXYZ(1,0,0)\displaystyle P^{XYZ}(1,0,0) =14,\displaystyle=\frac{1}{4},
PXYZ(1,0,1)\displaystyle P^{XYZ}(1,0,1) =14.\displaystyle=\frac{1}{4}.

We can easily check that PnXYP_{n}^{XY} converges to PXYP^{XY} in Wasserstein-pp metric, since

PnXY(0,0)\displaystyle P_{n}^{XY}(0,0) =12,\displaystyle=\frac{1}{2}, (53)
PnXY(1,1n)\displaystyle P_{n}^{XY}(1,\frac{1}{n}) =12,\displaystyle=\frac{1}{2},

and

PXY(0,0)\displaystyle P^{XY}(0,0) =12,\displaystyle=\frac{1}{2}, (54)
PXY(1,0)\displaystyle P^{XY}(1,0) =12,\displaystyle=\frac{1}{2},

Similarly PnYZP_{n}^{YZ} converges to PYZP^{YZ} in Wasserstein-pp metric. However, PnXYZP_{n}^{XYZ} does not converge to PXYZP^{XYZ} in Wasserstein-pp metric: the support of PnXYZP_{n}^{XYZ} and (0,0,1)(0,0,1) always have a distance larger than 1.

References

  • [1] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [2] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pages 6571–6583, 2018.
  • [3] Ishan Deshpande, Ziyu Zhang, and Alexander G Schwing. Generative modeling using the sliced Wasserstein distance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3483–3491, 2018.
  • [4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
  • [5] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
  • [6] David L Donoho et al. High-dimensional data analysis: The curses and blessings of dimensionality. AMS math challenges lecture, 1(2000):32, 2000.
  • [7] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
  • [8] Luigi Ambrosio and Wilfrid Gangbo. Hamiltonian ODEs in the Wasserstein space of probability measures. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 61(1):18–53, 2008.
  • [9] Xuechen Li, Ting-Kam Leonard Wong, Ricky TQ Chen, and David Duvenaud. Scalable gradients for stochastic differential equations. arXiv preprint arXiv:2001.01328, 2020.
  • [10] Junteng Jia and Austin R Benson. Neural jump stochastic differential equations. In Advances in Neural Information Processing Systems, pages 9843–9854, 2019.
  • [11] Xuanqing Liu, Tesi Xiao, Si Si, Qin Cao, Sanjiv Kumar, and Cho-Jui Hsieh. Neural SDE: Stabilizing neural ODE networks with stochastic noise. arXiv preprint arXiv:1906.02355, 2019.
  • [12] Belinda Tzen and Maxim Raginsky. Neural stochastic differential equations: Deep latent Gaussian models in the diffusion limit. arXiv preprint arXiv:1905.09883, 2019.
  • [13] Belinda Tzen and Maxim Raginsky. Theoretical guarantees for sampling and inference in generative models with latent diffusions. arXiv preprint arXiv:1903.01608, 2019.
  • [14] Lars Ruthotto, Stanley J Osher, Wuchen Li, Levon Nurbekyan, and Samy Wu Fung. A machine learning framework for solving high-dimensional mean field game and mean field control problems. Proceedings of the National Academy of Sciences, 117(17):9183–9193, 2020.
  • [15] Liu Yang, Dongkun Zhang, and George Em Karniadakis. Physics-informed generative adversarial networks for stochastic differential equations. SIAM Journal on Scientific Computing, 42(1):A292–A317, 2020.
  • [16] Junyu Liu, Zichao Long, Ranran Wang, Jie Sun, and Bin Dong. RODE-Net: Learning ordinary differential equations with randomness from data. arXiv preprint arXiv:2006.02377, 2020.
  • [17] Ola Elerian, Siddhartha Chib, and Neil Shephard. Likelihood inference for discretely observed nonlinear diffusions. Econometrica, 69(4):959–993, 2001.
  • [18] Bjørn Eraker. MCMC analysis of diffusion models with application to finance. Journal of Business & Economic Statistics, 19(2):177–191, 2001.
  • [19] Simo Särkkä, Jouni Hartikainen, Isambi Sailon Mbalawata, and Heikki Haario. Posterior inference on parameters of stochastic differential equations via non-linear Gaussian filtering and adaptive MCMC. Statistics and Computing, 25(2):427–437, 2015.
  • [20] Cedric Archambeau, Dan Cornford, Manfred Opper, and John Shawe-Taylor. Gaussian process approximations of stochastic differential equations. Journal of Machine Learning Research, 1:1–16, 2007.
  • [21] Michail D Vrettas, Manfred Opper, and Dan Cornford. Variational mean-field algorithm for efficient inference in large systems of stochastic differential equations. Physical Review E, 91(1):012148, 2015.
  • [22] Joseph Bakarji and Daniel M Tartakovsky. Data-driven discovery of coarse-grained equations. Journal of Computational Physics, page 110219, 2021.
  • [23] Zhiping Mao, Zhen Li, and George Em Karniadakis. Nonlocal flocking dynamics: Learning the fractional order of pdes from particle simulations. Communications on Applied Mathematics and Computation, 1(4):597–619, 2019.
  • [24] Christos N Mavridis, Amoolya Tirumalai, and John S Baras. Learning interaction dynamics from particle trajectories and density evolution. In 2020 59th IEEE Conference on Decision and Control (CDC), pages 1014–1019. IEEE, 2020.
  • [25] Felipe Cucker and Steve Smale. Emergent behavior in flocks. IEEE Transactions on automatic control, 52(5):852–862, 2007.
  • [26] Filippo Santambrogio. {\{Euclidean, metric, and Wasserstein}\} gradient flows: an overview. Bulletin of Mathematical Sciences, 7(1):87–154, 2017.
  • [27] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the International Conference on Machine Learning, volume 30, page 3, 2013.
  • [28] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [29] Mucong Ding, Constantinos Daskalakis, and Soheil Feizi. Subadditivity of probability divergences on Bayes-Nets with applications to time series GANs. CoRR, abs/2003.00652, 2020.
  • [30] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training GANs with optimism. In International Conference on Learning Representations, 2018.
  • [31] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
  • [32] Achim Klenke. Probability theory: a comprehensive course. Springer Science & Business Media, 2013.