Generative Ensemble Regression: Learning Particle Dynamics from Observations of Ensembles with Physics-Informed Deep Generative Models

Liu Yang Division of Applied Mathematics, Brown University, Providence, RI 02912, USA Constantinos Daskalakis Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA George Em Karniadakis Division of Applied Mathematics, Brown University, Providence, RI 02912, USA Pacific Northwest National Laboratory, Richland, WA 99354, USA Correspondence author, [email protected]

Abstract

We propose a new method for inferring the governing stochastic ordinary differential equations (SODEs) by observing particle ensembles at discrete and sparse time instants, i.e., multiple “snapshots”. Particle coordinates at a single time instant, possibly noisy or truncated, are recorded in each snapshot but are unpaired across the snapshots. By training a physics-informed generative model that generates “fake” sample paths, we aim to fit the observed particle ensemble distributions with a curve in the probability measure space, which is induced from the inferred particle dynamics. We employ different metrics to quantify the differences between distributions, e.g., the sliced Wasserstein distances and the adversarial losses in generative adversarial networks (GANs). We refer to this method as generative “ensemble-regression” (GER), in analogy to the classic “point-regression”, where we infer the dynamics by performing regression in the Euclidean space. We illustrate the GER by learning the drift and diffusion terms of particle ensembles governed by SODEs with Brownian motions and Lévy processes up to 100 dimensions. We also discuss how to treat cases with noisy or truncated observations. Apart from systems consisting of independent particles, we also tackle nonlocal interacting particle systems with unknown interaction potential parameters by constructing a physics-informed loss function. Finally, we investigate scenarios of paired observations and discuss how to reduce the dimensionality in such cases by proving a convergence theorem that provides theoretical support.

Introduction

Classic methods for inferring the ordinary differential equation (ODE) dynamics from data usually require observations of a point or particle governed by the ODE at different time instants. We refer this learning paradigm as “point-regression”. More specifically, as illustrated in Figure 1, in point-regression problems we aim to infer the governing ODE of a point and perhaps also the initial condition, given the (possibly noisy) observations of its coordinates at different time instants. Typically, we optimize the dynamics and the initial coordinate so that the inferred curve matches the data in the Euclidean distance. Let us consider, for simplicity, the one-dimensional linear regression with quadratic loss as an example. Given observations of $x$ at multiple $t$ , we want to optimize the parameter $a$ in the ODE $dx_{t}/dt=a$ as well as the initial point $x_{0}=b$ so that the mean squared $L_{2}$ distance between predictions and data points is minimized, where $a$ and $b$ are the slope and the intercept of the linear function. Other examples in this category include logistic regression, recurrent neural networks [1] and the neural ODE [2] for time series, etc.

For systems consisting of an ensemble of particles, point-regression may fail to apply. For example, we want to infer the governing stochastic ordinary differential equations (SODE) from observations of particle ensembles at discrete and sparse time instants, but the data of an individual particle are not sufficiently informative for dynamic inference. Another example is a system consisting of a large number of interacting particles, even close to the mean field limit, e.g., we may want to infer how the fish interact with each other from discrete snapshots of the fish school. In such scenarios, instead of learning from individual particles, we need to learn from the particle ensembles. Specifically, we wish to infer the governing dynamics and perhaps also the initial condition, using observations of an ensemble of particle at discrete time instants. We call an observation at a single time instant a “snapshot”, where part or all of the particle coordinates in the ensemble are recorded. Since the particles could be indistinguishable in observations, especially in a large system, we thus consider the case where the data are not labeled with particle indices, in other words, we cannot pair data across snapshots.

We call this paradigm “ensemble-regression” in analogy to the “point-regression”. As illustrated in Figure 1, the initial condition and dynamics for particles would induce a curve $t\rightarrow\rho_{t}$ in the probability measure space, where $\rho_{t}$ denotes the particle distributions at time $t$ . Such $\rho_{t}$ will be governed by a corresponding partial differential equation (PDE), e.g., the Fokker-Planck equation if the particles are governed by SODEs of diffusion processes. We aim to optimize the dynamics and the initial distribution so that the inferred curve matches the distributions from the data, and the differences can be quantified with certain metrics.

Refer to caption — Figure 1: Schematic showing the two paradigms of point-regression and ensemble-regression for dynamic inference. In point-regression problems (black labels), we aim to fit the point coordinates (blue dots) from data with the inferred curve (orange curve), determined by the initial coordinate $\bm{x}_{0}$ as well as the ODE $d\bm{x}_{t}/dt=f_{\theta}(\bm{x}_{t},t)$ , where $f_{\theta}$ is a function. As an analogue, in ensemble-regression problems (red labels), we aim to fit the distributions of the ensemble in the snapshots (blue dots) with the inferred curve (orange curve), determined by the initial distribution $\rho_{0}$ as well as the PDE $\partial\rho_{t}/\partial t=L_{\theta}(\rho_{t},t)$ , where $L_{\theta}$ is an operator.

Herein we propose a new method to perform ensemble-regression. We use a generative model with deep neural networks as build blocks, which will generate “fake” particle systems, to represent the inferred curve in the probability measure space, and then perform regression in the probability measure space with the inferred curve. We thus name our method as generative ensemble-regression. In this paper we test the sliced Wasserstein (SW) distance [3] and the loss in generative adversarial networks (GANs) [4, 5] as two examples of metrics, the latter proved to be very effective when analyzing high dimensional data [6] in our problems.

The deep generative model is physics-informed in that our partial physical knowledge of the dynamics will be encoded into the architecture or the loss function. Such physical knowledge is sometimes essential for a correct dynamic inference, since the particle dynamic can be not unique even if the curve $t\rightarrow\rho_{t}$ and its governing PDE are fully given. For example, the following two particle dynamics with $\mathcal{N}(0,1)$ as the initial distribution will lead to the same curve $\rho_{t}=\mathcal{N}(0,t+1)$ :

•

Standard Brownian motion with no drift.
•

$dx_{t}/dt=x_{t}/(2t+2)$ with no diffusion, i.e., $x_{t}=x_{0}\sqrt{t+1}$ .

However, if we know that the particle dynamic is in the form of $d\bm{x}_{t}/dt=\bm{v}_{t}(\bm{x},t)$ , so that $\rho_{t}$ is governed by the continuity equation $d\rho_{t}/dt=-\nabla\cdot(\bm{v}_{t}\rho_{t})$ , and $\bm{v}_{t}$ is limited to the $L^{2}(\rho_{t};\mathbb{R}^{d})$ closure of $\{\nabla\varphi:\varphi\in C_{c}^{\infty}(\mathbb{R}^{d})\}$ , then the solution of $\bm{v}_{t}$ is unique, for any curve $t\rightarrow\rho_{t}$ absolutely continuous from $[a,b]$ to $\mathcal{P}_{2}(\mathbb{R}^{d})$ , where $\mathcal{P}_{2}(\mathbb{R}^{d})$ is the Wasserstein-2 space of probability measures with finite quadratic moments in $\mathbb{R}^{d}$ [7, 8].

In the generative model, we use discretized ODE or SODE with unknown terms parameterized as neural networks. This is referred as the “neural ODE” and “neural SDE” in the literature [2, 9, 10, 11, 12, 13], but in applications the observations are mainly a time series, i.e., observations of a single particle. The idea of employing a neural networks as velocity surrogates with physics-informed loss functions in particle systems was also used for solving mean field game/control problems [14]. There are other works using GANs to solve inverse problems, including [15], where GANs were applied to learn parameters in time-independent stochastic differential equations, and [16] where GANs were applied to learn the random parameters from the (paired) observations of independent ODEs.

In this paper, we tackle two typical types of particle systems: (1) independent particle systems governed by SODEs, and (2) interacting particle systems governed by nonlocal flocking dynamics. For inferring dynamics governed by SODE, most algorithms are based on observations of a sample path, and perform the inference by calculating or approximating the probability of observations conditioned on the system parameters, using Euler-Maruyama discretization [17, 18], Kalman filtering [19], variational Gaussian process-based approximation [20, 21], etc. There are other works inferring the stochastic dynamics with the (estimated) densities from the perspective of Fokker-Planck equations [22], but this approach is hard to scale to high dimensional problems. For flocking dynamic systems, other researchers infer the key parameters in the influence function by fitting the velocity field via Bayesian optimization [23], or fitting the density field by solving a system of transformed PDEs and optimizing the parameters [24]. But these methods require the knowledge of the initial condition, including the distribution and velocity field.

Problem Setup

We start from a system consisting of an ensemble of particles, where the dynamics of the particles is independent of other particles. The most commonly used stochastic processes in physics and biology are diffusion processes and Lévy processes; in particular, the particle dynamics is governed by the stochastic differential equation:

\displaystyle d\bm{X}_{t}=\bm{\mu}_{t}dt

\displaystyle+\bm{\sigma}_{t}d\bm{B}_{t},\quad t\geq 0,

(1)

for diffusion processes, and

\displaystyle d\bm{X}_{t}=\bm{\mu}_{t}dt

\displaystyle+\bm{\sigma}_{t}d\bm{L}_{t}^{\alpha},\quad t\geq 0,

(2)

for Lévy processes, where $\bm{X}_{t}\in\mathbb{R}^{d}$ is the position of a particle at time $t$ with $\bm{X}_{0}$ randomly drawn from the initial distribution $\rho_{0}$ , $\bm{\mu}_{t}\in\mathbb{R}^{d}$ is the deterministic drift, $\bm{B}_{t}$ and $\bm{L}_{t}^{\alpha}$ are the $d$ -dimensional standard Brownian motion and the $\alpha$ -stable symmetric Lévy process, respectively, and $\bm{\sigma}_{t}\in\mathbb{R}^{d\times d}$ is the diffusion coefficient. In the mean field limit, the density of the particles would be governed by Fokker-Planck equations or fractional Fokker-Planck equations. For simplicity, in this work we assume that $\bm{\mu}_{t}$ is a function of $\bm{X}_{t}$ while $\bm{\sigma}_{t}$ is constant, but, in principle, our proposed method can also tackle the time-dependent case.

Apart from systems consisting of independent particles, we further consider systems where particles interact with each other and the particle distributions exhibit more complicated behavior. As an example, we consider the Cucker-Smale particle model [25], which describes individuals in flocks with nonlocal interactions. The individual or particle motion is characterized by the following governing equations:

	$\displaystyle d\bm{X}_{t}^{(i)}/dt$	$\displaystyle=\bm{\mu}_{t}^{(i)},$		(3)
	$\displaystyle d\bm{\mu}_{t}^{(i)}/dt=\frac{1}{N-1}\sum_{j\neq i}$	$\displaystyle\phi(\\|\bm{X}_{t}^{(i)}-\bm{X}_{t}^{(j)}\\|)(\bm{\mu}_{t}^{(j)}-\bm{\mu}_{t}^{(i)}),$		(3)

where the superscripts denote the index of the particles, $N\gg 1$ is the number of particles in the system, and $\phi$ is the influence function. Here, we set

\displaystyle\phi(r)=c_{d,\alpha}|r|^{-(d+\alpha)},c_{d,\alpha}=\frac{\alpha\Gamma(\frac{d+\alpha}{2})}{2\pi^{\alpha+d/2}\Gamma(1-\alpha/2)},

(4)

where $\Gamma$ is the gamma function, $d$ is the dimension of $\bm{X}_{t}^{(i)}$ , and $\alpha$ is the factor that characterizes the decay rate of particle interactions as the distance $r$ grows. In the mean field limit, as $N$ grows the density of the particles is governed by the following fractional PDE:

$\displaystyle\frac{\partial\rho}{\partial t}+\nabla\cdot(\rho\bm{u})$	$\displaystyle=0,$	(5)
$\displaystyle\frac{\partial\bm{u}}{\partial t}+\bm{u}\cdot\nabla\bm{u}$	$\displaystyle=[\mathcal{L},\bm{u}](\rho),$
$\displaystyle[\mathcal{L},\bm{u}](\rho)(\bm{x})=\text{p.v. }c_{d,\alpha}\int_{\mathbb{R}^{d}}$	$\displaystyle\frac{\bm{u}(\bm{y})-\bm{u}(\bm{x})}{\\|\bm{x}-\bm{y}\\|^{d+\alpha}}\rho(\bm{y})d\bm{y},$

where $\rho$ is the density, $\bm{u}$ is the velocity field, and p.v. means the principle value. Note that $\alpha$ determines the fractional order.

We consider the scenario where the data available are the observations of the particles coordinates at different time instants $\{t_{i}\}_{i=1}^{n}$ with $0\leq t_{1}<t_{2}...<t_{n}$ , namely “snapshots”. In other words, the data $\mathcal{D}_{i}$ for time $t_{i}$ will be a set of samples drawn from $\rho_{t_{i}}$ , the particle distributions at $t_{i}$ . If $\{\mathcal{D}_{i}\}_{i=1}^{n}$ are observations of the same set of particles and we can distinguish these particles, we refer to these cases as the “paired” observations since we can pair the particles from different snapshots. In other cases, $\{D_{i}\}_{i=1}^{n}$ are observations of different sets of particles, or we cannot distinguish the particles. We refer to these cases as the “unpaired” observations. In this paper we mainly focus on the unpaired cases, but we also present some work on the paired cases in Paired Observations section, where we introduce how to reduce the effective dimensionality and we prove a theorem to support the introduced method. For independent particle systems, we assume that we are unaware of $\bm{\mu}_{t}$ and $\bm{\sigma}_{t}$ , or we may only know the parametrized forms of these terms, and we aim to infer these terms directly or through a proper parametization. For interacting particle systems, we assume we know the forms of the dynamics and the influence function, i.e., Equation 3 and 4, but need to infer the velocity field and the key parameter $\alpha$ .

Physics-informed Generative Model

To perform ensemble regression, we will use a generative model with deep neural networks to represent a curve in the probability measure space. Note that the curve is determined by the initial condition $\rho_{0}$ and the governing equation, hence, the generative model consists of two parts. In the first part, we employ a feed-forward neural network $G$ to represent $\rho_{0}$ . In particular, $G$ takes samples from random noise $\mathcal{N}$ , e.g., Gaussian noise, as input and the generated distribution ${G}_{\#}\mathcal{N}$ is intended to approximate $\rho_{0}$ , where _# denotes the push forward operator. The second part of the generative model will generate “fake” particle trajectories with initial coordinates generated by $G$ , and the marginal distribution $\tilde{\rho}_{t}$ at time $t\geq 0$ will be used to represent $\rho_{t}$ in the curve.

Our knowledge of the physics will be incorporated into the generative model in two ways. For non-interacting particle systems, our knowledge including the form of the drift and diffusion as well as the type of stochastic processes will be directly embedded into the architecture of the generative model in the second part. For interacting particle systems, while we have no direct knowledge of the velocity field, we will enforce the inferred velocity to be consistent with our knowledge of the dynamics with a physics-based soft penalty. In the following, we introduce the details of the learning algorithm for both types of systems separately. A schematic overview of the method is shown in Figure 2.

Non-interacting Particle Systems

For non-interacting particle systems, generating particle trajectories is relatively straightforward by directly applying the discretization of governing SODE or ODE. For example, if the particle trajectories are diffusion processes or Lévy processes, we can use the following forward Euler scheme:

$\displaystyle\tilde{\bm{X}}_{0}$	$\displaystyle=G(\bm{z}),\quad\bm{z}\sim\mathcal{N}$	(6)
$\displaystyle\tilde{\bm{X}}_{(i+1)\Delta t}$	$\displaystyle=\tilde{\bm{X}}_{i\Delta t}+\bm{\mu}_{t}\Delta t+\bm{\sigma}_{t}\sqrt{\Delta t}\bm{\xi}_{i},\quad i\geq 0,$
$\displaystyle\text{or }\tilde{\bm{X}}_{(i+1)\Delta t}$	$\displaystyle=\tilde{\bm{X}}_{i\Delta t}+\bm{\mu}_{t}\Delta t+\bm{\sigma}_{t}\Delta t^{1/\alpha}\bm{\zeta}_{\alpha,i},\quad i\geq 0,$

where $\Delta t$ is the time step, $\bm{\xi}_{i}$ and $\bm{\zeta}_{\alpha,i}$ are i.i.d. standard Gaussian random variables and $\alpha$ -stable random variables, respectively. We could represent $\bm{\mu}_{t}$ and $\bm{\sigma}_{t}$ with neural networks if they are unknown, or represent the unknown parameters with trainable variables if we know their parameterized form.

Our target is to tune the trainable variables in the generative model, including the parameters in $G$ and those for parameterizing $\bm{\mu}_{t}$ and $\bm{\sigma}_{t}$ , so that the generated marginal distribution $\tilde{\rho}_{t_{i}}$ fits the data $\mathcal{D}_{i}$ for each $i$ . We thus need to define a distance function $\mathsf{d}(\cdot,\cdot)$ to measure the difference between the two input distributions, which can be estimated from samples drawn from the two distributions. Consequently the loss function in non-interacting particle systems is defined as:

\displaystyle L_{\text{distribution}}

\displaystyle=\sum_{i=1}^{n}\mathsf{d}(\tilde{\rho}_{t_{i}},\hat{\rho}_{\mathcal{D}_{i}}),

(7)

where $\hat{\rho}_{\mathcal{D}_{i}}$ is the empirical distribution induced from the sample set $\mathcal{D}_{i}$ . We will refer to it as the distribution loss.

There could be many ways to define $\mathsf{d}$ , including Wasserstein distances, maximum mean discrepancy, etc. In this paper we use two approaches to define $\mathsf{d}$ .

First we choose the squared sliced Wasserstein-2 (SW) distance [26, 3] as the function $\mathsf{d}$ :

\mathsf{d}(\mu,\nu)=SW_{2}^{2}(\mu,\nu):=\int_{\mathbb{S}^{d-1}}W_{2}^{2}({\pi_{\bm{e}}}_{\#}\mu,{\pi_{\bm{e}}}_{\#}\nu)d\mathcal{H}^{d-1}({\bm{e}}),

(8)

where $W_{2}$ is the Wasserstein-2 distance, and ${\pi_{\bm{e}}}_{\#}\mu$ is the one dimensional distribution induced by projecting $\mu$ onto the direction $\bm{e}$ , defined by

({\pi_{\bm{e}}}_{\#}\mu)(A)=\mu(\{\bm{x}\in\mathbb{R}^{d}:{\bm{e}}\cdot\bm{x}\in A\}),\forall A\in\mathcal{B}(\mathbb{R}),

(9)

similarly for ${\pi_{\bm{e}}}_{\#}\nu$ . $\mathcal{H}^{d-1}$ is the uniform Hausdorff measure on the sphere $\mathbb{S}^{d-1}$ . In short, the squared sliced Wasserstein-2 distance is the expectation of the squared Wasserstein-2 distance between the two input measures projected onto uniformly random directions. The sliced Wasserstein-2 distance is exactly the Wasserstein-2 distance for one-dimensional distributions, but is easier to calculate for higher dimensional distributions. We present the details of the estimation in the Supplementary Information section S1.

We also use GANs to obtain $\mathsf{d}$ . The generative model we introduced above can generate “fake” samples $\tilde{\bm{X}}_{t_{i}}$ at $\{t_{i}\}_{i=1}^{n}$ , and for each $i$ , we use a discriminator $D_{i}$ to discriminate generated samples $\tilde{\bm{X}}_{t_{i}}$ and real samples $\bm{X}_{t_{i}}$ from $\mathcal{D}_{i}$ . The adversarial loss given by $D_{i}$ can act as a metric of the difference between $\tilde{\rho}_{t_{i}}$ and $\hat{\rho}_{\mathcal{D}_{i}}$ . In particular, we use WGAN-GP [5] as our version of GANs in our paper, with

\displaystyle\mathsf{d}(\tilde{\rho}_{t_{i}},\hat{\rho}_{\mathcal{D}_{i}})

\displaystyle=-\mathbb{E}_{\tilde{\bm{X}}_{t_{i}}\sim\tilde{\rho}_{t_{i}}}[D_{i}(\tilde{\bm{X}}_{t_{i}})]+\mathbb{E}_{\bm{X}_{t_{i}}\sim\hat{\rho}_{\mathcal{D}_{i}}}[D_{i}(\bm{X}_{t_{i}})],

(10)

and the loss function for each discriminator $D_{i}$ is defined as

	$\displaystyle L_{D_{i}}=$	$\displaystyle\mathbb{E}_{\tilde{\bm{X}}_{t_{i}}\sim\tilde{\rho}_{t_{i}}}[D_{i}(\tilde{\bm{X}}_{t_{i}})]-\mathbb{E}_{\bm{X}_{t_{i}}\sim\hat{\rho}_{\mathcal{D}_{i}}}[D_{i}(\bm{X}_{t_{i}})]$		(11)
		$\displaystyle+\lambda\mathbb{E}_{\hat{\bm{x}}_{i}\sim\rho_{\hat{\bm{x}}_{i}}}[(\\|\nabla_{\hat{\bm{x}}_{i}}D_{i}(\hat{\bm{x}}_{i})\\|_{2}-1)^{2}],\text{ for }i=1,2...n,$		(11)

where $\rho_{\hat{\bm{x}}_{i}}$ is the distribution generated by uniform sampling on straight lines between pairs of points sampled from $\tilde{\rho}_{t_{i}}$ and $\hat{\rho}_{\mathcal{D}_{i}}$ , and $\lambda=0.1$ is the gradient penalty coefficient. Here, $\mathsf{d}(\tilde{\rho}_{t_{i}},\hat{\rho}_{\mathcal{D}_{i}})$ can be mathematically interpreted as the Wasserstein-1 distance between the two input distributions. WGAN-GP is computationally more expensive than the sliced Wasserstein distance since we need to train the generative model and the discriminators iteratively, but it is more scalable to high dimensional problems, for which we made a comparison in the supplementary information section S1.

Interacting Particle Systems

The straightforward discretization of the governing Equations 3 cannot be directly applied to generate “fake” particle trajectories in interacting particle systems, especially for those with strong nonlocal interactions, since the computational cost for one time step would be $O(N^{2})$ , where $N$ is the number of particles required so that the system is close to the mean field limit, which makes the learning almost intractable. Instead, we propose to first employ a neural network $\tilde{\bm{\mu}}$ as a surrogate model of the velocity field $\bm{\mu_{t}}$ in the spatial-temporal domain to generate trajectories, and then apply an additional penalty in the loss function to enforce the velocity field $\tilde{\bm{\mu}}$ to be consistent with the Equations 3. In this paper we use the forward Euler scheme to generate trajectories:

	$\displaystyle\tilde{\bm{X}}_{0}$	$\displaystyle=G(\bm{z}),\quad\bm{z}\sim\mathcal{N}$		(12)
	$\displaystyle\tilde{\bm{X}}_{(i+1)\Delta t}$	$\displaystyle=\tilde{\bm{X}}_{i\Delta t}+\tilde{\bm{\mu}}(\tilde{\bm{X}}_{i\Delta t},i\Delta t)\Delta t,\quad i\geq 0,$		(12)

where $\Delta t$ is the time step. By employing the surrogate model for velocity, the computational cost for generating the particle trajectories is linear, instead of quadratic, with the number of particles.

To enforce $\tilde{\bm{\mu}}$ to be consistent with our knowledge of the dynamics of Equations 3, we first randomly generate $K$ particle trajectories with Equation 12, denoted as $\{\{\tilde{\bm{X}}^{(k)}_{i\Delta t}\}_{i=0}^{I}\}_{k=1}^{K}$ , and then calculate the “forces” applied on these $k$ particles by another $M$ randomly generated particles $\{\{\tilde{\bm{Y}}^{(m)}_{i\Delta t}\}_{i=0}^{I}\}_{m=1}^{M}$ , using the formula of interactions in Equations 3, where we replace the unknown parameter $\alpha$ with a trainable variable. Meanwhile, from the velocity field $\tilde{\bm{\mu}}$ we can directly calculate the accelerations for these $k$ particles using a material derivative. The forces and accelerations should be consistent with each other. We can thus define an $L_{2}$ loss:

	$\displaystyle L_{\text{Newton}}=\frac{1}{I+1}\sum_{i=0}^{I}\frac{1}{K}\sum_{k=1}^{K}(\bm{F}_{i}^{(k)}-\bm{a}_{i}^{(k)})^{2},$	(13)
$\displaystyle\bm{F}_{i}^{(k)}=$	$\displaystyle\frac{1}{M}\sum_{m=1}^{M}\phi(\\|\tilde{\bm{X}}^{(k)}_{i\Delta t}-\tilde{\bm{Y}}^{(m)}_{i\Delta t}\\|)(\tilde{\bm{\mu}}(\tilde{\bm{Y}}^{(m)}_{i\Delta t},i\Delta t)-\tilde{\bm{\mu}}(\tilde{\bm{X}}^{(k)}_{i\Delta t},i\Delta t)),$
$\displaystyle\bm{a}_{i}^{(k)}=$	$\displaystyle\frac{D\tilde{\bm{\mu}}(\tilde{\bm{X}}^{(k)}_{i\Delta t},i\Delta t))}{Dt}$
$\displaystyle=$	$\displaystyle\frac{\partial\tilde{\bm{\mu}}(\tilde{\bm{X}}^{(k)}_{i\Delta t},i\Delta t))}{\partial t}+\tilde{\bm{\mu}}(\tilde{\bm{X}}^{(k)}_{i\Delta t},i\Delta t))\cdot\nabla\tilde{\bm{\mu}}(\tilde{\bm{X}}^{(k)}_{i\Delta t},i\Delta t)),$

and we name it as the Newton loss. The time span $[0,I\Delta t]$ should cover the time for the latest snapshot $t_{n}$ , and in this paper we simply set $I$ as the ceiling of $t_{n}/\Delta t$ . To reduce computational cost, the average over $I+1$ time steps in $L_{\text{Newton}}$ can also be approximated by mini-batch, i.e., taking average over $B$ random time steps in each training step.

For one time step, the computational cost is $O(KM)$ for calculating the force terms and $O(K)$ for the acceleration terms. Note that $M$ should be larger than or equal to $N$ , but $K$ can be much less than $N$ , thus the total computational cost for $L_{\text{Newton}}$ can be much less than $O(N^{2})$ , and this makes the learning tractable.

In the end, the loss for interacting particle systems will be a combination of the Newton loss and the distribution loss:

\displaystyle L=\eta L_{\text{Newton}}+\frac{1}{n}\sum_{i=1}^{n}\mathsf{d}(\tilde{\rho}_{t_{i}},\hat{\rho}_{\mathcal{D}_{i}}),

(14)

where the weight $\eta$ is set as 1 in this paper.

Modifications to the Distributions

In order to provide flexibility and adapt the method to different problems, we can also modify the generated distribution and real data before feeding them to the distance function d and calculate the distribution loss. We present some examples here.

In some cases $\rho_{t}$ may have heavy tails, e.g., when the particle trajectories correspond to a Lévy process. The heavy tails could spoil the training, since the rare outlier samples could dominate the loss function. We can choose a suitable bounded map $h:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ to preprocess both the generated samples and the real data so that the heavy tails are removed. If the observations of the particle coordinates are noisy, we can also perturb the generated samples to add artificial noise. By scaling the artificial noise with trainable variables, the size of the noise in observations can also be learned. If we can only make observations of particles in a specific domain, i.e., the observations are truncated, we will also filter the generated samples using a corresponding mask so that the effective domain for the generated samples and observations are the same.

Computational Results

All the neural networks in the main text are feed-forward neural networks with three hidden layers, each of width 128, except the discriminator for high dimensional problems which is a ResNet with 5 hidden layers, each of width 256, and shortcut connections skipping one hidden layer. We use the leaky ReLu [27] activation function with $\alpha=0.2$ for the discriminator neural networks in WGAN-GP, while using the hyperbolic tangent activation function for other neural networks. The neural network weights are initialized with the uniform Xavier initializer, and the biases are initialized as 0. The variables for parameterizing the diffusion and noise size are initialized as 0 (before been activated by the softplus function). The drift parameters are initialized as $0$ in 1D problems, $-0.5$ in high dimensional problems, and randomly initialized with standard Gaussian distribution in 2D problems. The batch size for the distribution loss is 1000 for non-interacting particle systems, except in the 2D problem with truncated observations, where we generate 10000 samples to compensate the loss of samples due to filtering. We use the Adam optimizer [28] with $lr=0.0001,\beta_{1}=0.9,\beta_{2}=0.999$ for the cases using the sliced Wasserstein distance, while $lr=0.0001,\beta_{1}=0.5,\beta_{2}=0.9$ for the cases with WGAN-GP. The time step in the generative model is set as $\Delta t=0.01$ .

1D Problems: Brownian Motion and Lévy Process

In this section, we test our method on the 1D problems using the sliced Wasserstein distance as $\mathsf{d}$ . We first consider the SODE with Brownian motion and then with the $\alpha$ -stable symmetric Lévy process as the stochastic term:

	$\displaystyle dX_{t}$	$\displaystyle=(X_{t}-{X_{t}}^{3})dt+dB_{t},$		(15)
	$\displaystyle\text{or}\quad dX_{t}$	$\displaystyle=(X_{t}-{X_{t}}^{3})dt+dL_{t}^{\alpha},\quad t\geq 0,$		(15)

where $\alpha=1.5$ , with $\rho_{0}=\mathcal{N}(0,0.04)$ .

Note that we have no knowledge of $\rho_{0}$ in learning. Also, we assume we know that the stochastic term has a constant coefficient but we need to infer it; here the ground truth is $1.0$ . To represent the unknown coefficient, we use a trainable variable rectified by a softplus function $\text{softplus}(x)=\ln(1+e^{x})$ , which ensures positivity. As for the drift $\mu(x)=x-x^{3}$ , we consider the following two cases for both SODE problems. In case 1, we know that the drift $\mu(x)$ is a cubic polynomial of $x$ . In this case, we use a cubic polynomial $a_{0}+a_{1}x+a_{2}x^{2}+a_{3}x^{3}$ to parameterize $\mu(x)$ , and want to infer the four coefficients $a_{0},a_{1},a_{2}$ and $a_{3}$ . In case 2, we only know that the drift $\mu(x)$ is a function of $x$ and hence we use a neural network to parameterize $\mu(x)$ .

For the SODE problem with Brownian motion, we first prepare a pool of $10^{5}$ sample paths, then independently draw $10,000$ samples at $t=0.2,0.5,1.0$ from the pool as our training data. The results for both cases of drift parameterization are illustrated in Figure 3(a). In the figure, all the densities are estimated using Gaussian kernel density estimation with Scott’s rule, and the inferred densities and the reference densities come from $10^{6}$ samples produced by the generative model or simulation.

Both cases of drift parameterization provide a good inference of the diffusion coefficient, with an error less than $7\%$ after $2\times 10^{5}$ training steps, in all the runs. When using the cubic polynomial parameterization, the inferred drift fits well with the ground truth, with the relative $L_{2}$ error about $3\%$ in $[-3,3]$ averaged over three runs. The inferred drift using the neural network only fits the ground truth in the region between -1.5 and 1.5. This is reasonable since the particles mainly concentrate in this region, and we can hardly learn the drift outside this region, where the training data are sparse. However, we note that such an inference of drift is sufficiently good for an accurate time extrapolation of the particle distribution at $t=5$ .

One interesting observation is that in case 1 the inferred densities are more accurate than those estimated from the training data. This is because our knowledge of the governing SODE bridges the limited samples in three snapshots, and as we infer the density at, e.g., $t=0.2$ , we are not only utilizing the data at $t=0.2$ but also implicitly leveraging the data at $t=0.5$ and $1.0$ . We can make an analogy in the context of linear regression: multiple noisy data are helpful for inferring the hidden ground truth, since they are bridged by the regressed linear function. We present a more detailed discussion on this topic in the supplementary information in section S2.

For the SODE problem with Lévy process, we prepare $1.5\times 10^{5}$ sample paths, then independently draw $10,000$ samples within the region $[-1000,1000]$ at $t=0.2,0.5,1.0$ from the pool as our training data. To prevent instability during the training, we apply double precision and clip the generated $\alpha$ -stable random variable in Equation 2 between $-100$ and $100$ .

We first test our method without pre-processing the samples as in the problem with Brownian motion. As we can see in the second row of Figure 3(b), the inferences are far away from the ground truth. This is due to the heavy tail of $\rho_{t}$ in the Lévy process: some samples far away from $0$ , although rare, could dominate the loss function and spoil the training. To deal with this problem, we then apply a bounded map $h(x)=2\tanh(x/2)$ to all the generated and real samples before feeding them to $\mathsf{d}$ . The results are shown in the first row of Figure 3(b) where we can see the inferences are much improved. In case 2 where the drift is parameterized by neural networks, the inferred drift outside $[-1.5,1.5]$ is better than that in the problem with Brownian motion. This is because the samples are more scattered in the Lévy case.

2D Problems: Various Scenarios of Observations

In this section, we test our method on 2D problems using the sliced Wasserstein distance as $\mathsf{d}$ , with various scenarios of observations. We consider the following 2D SODE:

\displaystyle d\bm{X}_{t}

\displaystyle=\bm{\mu}(\bm{X_{t}})dt+\begin{bmatrix}s_{0}&0\\ s_{1}&s_{2}\end{bmatrix}d\bm{B}_{t}

(16)

where

\displaystyle\bm{\mu}(\bm{x})=\nabla_{\bm{x}}\varphi(\bm{x}),

(17)

and

	$\displaystyle\varphi(\bm{x})=$	$\displaystyle-(x_{1}+a_{0})^{2}(x_{2}+a_{1})^{2}$		(18)
		$\displaystyle-(x_{1}+a_{2})^{2}(x_{2}+a_{3})^{2}\quad\text{for }\bm{x}=(x_{1},x_{2}).$		(18)

The parameters are set as $a_{0}=a_{1}=s_{0}=s_{1}=s_{2}=1.0$ , $a_{2}=a_{3}=-0.5$ . We set the initial distribution as $\rho_{0}=\mathcal{N}(0,\bm{I}_{2})$ . We assume we know that the diffusion coefficient is a constant lower triangular matrix but we need to infer the three unknown parameters $s_{0},s_{1},s_{2}$ . In particular, we use $(\text{softplus}(\tilde{s}_{0}),\tilde{s}_{1},\text{softplus}(\tilde{s}_{2}))$ to approximate $(s_{0},s_{1},s_{2})$ , where $\tilde{s}_{0},\tilde{s}_{1},\tilde{s}_{2}$ are three trainable variables. Similar as in the 1D problems, we consider the following two cases for the drift. In case 1, we know the form of $\bm{\mu}$ and $\varphi$ in Equation 17 and 18 but we need to infer $a_{0},a_{1},a_{2}$ and $a_{3}$ ( $(a_{0},a_{1})$ and $(a_{2},a_{3})$ are exchangeable). In case 2, we only know that $\bm{\mu}$ is a gradient of $\varphi$ in Equation 17, but we have no knowledge of $\varphi$ . In this case, we use a neural network to parameterize $\varphi$ .

For the training data, we prepare $10^{5}$ sample paths, and consider the following various scenarios of observations at $t=0,0.1,0.2,0.3,0.5,0.7,1.0$ . The data are visualized in the supplementary information section S3.

•

Scenario 1: we assume that our observations are ideal, i.e., we make accurate observations of all the particle coordinates as the training data.
•

Scenario 2: we assume that our observations are noisy. Specifically, we make observations of all the particle coordinates, but each coordinate is perturbed by an i.i.d. random noise $\mathcal{N}(\bm{0},e^{2}\bm{I}_{2})$ , where $e$ is set as $0.2$ but also need to be inferred during learning.
•

Scenario 3: we assume that our observations are truncated. Specifically, we make observations of the particle coordinates in $\Omega=(\infty,0.5)\times\mathbb{R}$ , with the particles outside of $\Omega$ dropped.

For the first scenario, the inferred drift and diffusion are illustrated in Figure 4(a). In case 1, all the inferred drift and diffusion parameters approach the ground truth during the training, with relative error less than $0.3\%$ for each $a_{i}$ and less than $4\%$ for each $\sigma_{i}$ after $2\times 10^{5}$ training steps. In case 2, the inferred diffusion parameters are also close to the ground truth, with error comparable with that in case 1. In the region where the training data are dense (i.e., the density estimated from all the training data is no less than 0.05), the inferred drift field has a relative $L_{2}$ error about $13\%$ . The inference is much worse in other regions, and the reason is the same as in the 1D problem: the neural network can hardly learn the drift field where the training data are sparse.

For the second and third scenarios, we apply the technique of perturbing and filtering the generated samples, respectively. In Figure 4(b) and Figure 4(c) we show the results of case 1 with the drift parameterized by four parameters. After $2\times 10^{5}$ training steps, all the parameters converge to ground truth. For scenario 2, the error is less than $0.5\%$ for each drift parameter $a_{i}$ , less than $3\%$ for each diffusion parameter $s_{i}$ , and about $5\%$ for the noise parameter $e$ . For scenario 3, the error is less than $0.2\%$ for each drift parameter $a_{i}$ , less than $1\%$ for each diffusion parameter $s_{i}$ . We remark that we failed to learn the drift with neural network parameterization in the second and third scenario, suggesting a room for further improvements.

Higher Dimensional Problems

In this section, we test our method on higher dimensional non-interacting particle systems. Note that the sliced Wasserstein distance does not perform well in high dimensional problems, and we thus switch to the WGAN-GP to provide $\mathsf{d}$ . The comparison between the sliced Wasserstein distance and WGAN-GP in high dimensional problems is presented in the supplementary information section S1.

We consider a $d$ -dimensional SODE:

\displaystyle d\bm{X}_{t}

\displaystyle=\bm{\mu}(\bm{X_{t}})dt+\bm{\sigma}d\bm{B}_{t}

(19)

where

\displaystyle\mu^{(i)}(\bm{X_{t}})=X_{t}^{(i)}-(X_{t}^{(i)})^{3},i=1,2...,d,

(20)

for $\mu^{(i)}$ as the $i$ -th component of $\bm{\mu}\in\mathbb{R}^{d}$ , and $X_{t}^{(i)}$ is the $i$ -th component of $\bm{X}_{t}\in\mathbb{R}^{d}$ . We set the diffusion coefficient matrix as

\displaystyle\bm{\sigma}=\begin{bmatrix}s_{1}&0&0&0&\cdots&0\\ s^{\prime}_{2}&s_{2}&0&0&\cdots&0\\ 0&s^{\prime}_{3}&s_{3}&0&\cdots&0\\ 0&0&s^{\prime}_{4}&s_{4}&\cdots&0\\ \vdots&\vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\cdots&s^{\prime}_{d}&s_{d}\\ \end{bmatrix},

(21)

where the $2d-1$ nonzero entries $\{s_{i}\}_{i=1}^{d}$ and $\{s^{\prime}_{i}\}_{i=2}^{d}$ are set as 1. The initial distribution is $\rho_{0}=\mathcal{N}(\bm{0},0.04\bm{I}_{d})$ . Due to the non-diagonal diffusion coefficient matrix, the motion in different dimensions are coupled.

We prepare $10^{5}$ sample paths and observe all the particle positions at $t=0.2,0.5,1.0$ as our training data. We use a cubic polynomial with four variables to parameterize $\mu^{(i)}$ as a function of $X_{t}^{(i)}$ for each $i$ , while using $2d-1$ trainable variables to learn the nonzero entries in the diffusion coefficient matrix, with the diagonal entries rectified by a softplus function for positivity. In other words, we use $6d-1$ variables to parameterize the drift and diffusion for the $d$ -dimensional SODE. The results are shown in Figure 5. Even for the 100-dimensional problem, after $6\times 10^{5}$ training steps, the average error of the diffusion coefficients is about $0.04$ , and the average relative $L_{2}$ error of the drift in the interval $[-3,3]$ in each dimension is about $14\%$ .

It may appear surprising that we can solve a 100-dimensional problem with only $10^{5}$ samples since usually an exponentially large number of samples is required to describe the distribution. However, we remark that the difficulty of the learning task is significantly reduced since our partial knowledge, i.e., the form of the SODE, is encoded into the generator.

Interacting Particle Systems

In this section we consider the 1D and 2D cases of the interacting particle system with governing equations 3. To prepare the training data, we follow [23] and set the initial condition, including the density and velocity as

	$\displaystyle\rho_{0}(x)$	$\displaystyle=\mathbbm{1}_{[-0.75,0.75]}\frac{\pi}{3}\cos\left(\frac{3\pi x}{2}\right),$		(22)
	$\displaystyle\mu_{0}(x)$	$\displaystyle=-\frac{1}{2}\sin\left(\frac{3\pi x}{2}\right),$		(22)

for the 1D case and

	$\displaystyle\rho_{0}(x_{1},x_{2})$	$\displaystyle=\mathbbm{1}_{[-0.75,0.75]^{2}}\left(\frac{\pi}{3}\right)^{2}\cos\left(\frac{3\pi x_{1}}{2}\right)\cos\left(\frac{3\pi x_{2}}{2}\right),$		(23)
	$\displaystyle\bm{\mu}_{0}(x_{1},x_{2})$	$\displaystyle=\left(-\frac{1}{2\sqrt{2}}\sin(\frac{3\pi x_{1}}{2}),-\frac{1}{2\sqrt{2}}\sin(\frac{3\pi x_{2}}{2})\right)$		(23)

for the 2D case, where $\mathbbm{1}$ is the indicator function. Using the Velocity-Verlet method with time step 0.01, we perform simulations with 1024 and 9976 particles for the 1D and 2D cases, respectively, and generate data at $t=0.5,0.6,...,2.0$ , i.e., 16 snapshots in total. The input radius $r$ of the influence function $\phi(r)$ is clipped to be at least $r_{\text{min}}=0.01$ , both in simulation and learning, to avoid the singularity. While using the same number of particles and make observations at the same time instants as in [23], our method does not require knowledge of the initial condition or the velocity field in the data snapshots, as opposed to their method based on Bayesian optimization to infer $\alpha$ .

We apply the sliced Wasserstein distance for the distribution loss, with the batch size equal to the number of particles in training data. When calculating the distance at each time instant, the generated and real distributions are normalized with the mean and standard deviation of the real distribution. For the Newton loss, we set $K=16$ , $M=10000$ and $B=10$ . We use $2\times\text{sigmoid}(\beta)$ to represent the inferred $\alpha$ so that the inference is bounded by 0 and 2, and the variable $\beta$ is initialized as 0.

The results are visualized in Figure 6. For the 1D case, the inferred $\alpha$ is 0.492 with ground truth 0.5 in the end of training. As a comparison, the inference is 0.480 for the 1D case in [23]. Our inferred velocity field also matches well with the reference ground truth at different time instants, even at $t=0$ and $0.25$ when we have no observations at all. This is because we incorporated the knowledge of dynamics in the learning system. For the 2D case, while the inferred velocity field also matches well with the ground truth, the inferred $\alpha$ is 0.469 with ground truth 0.5 in the end of training, i.e., about 6% error, much larger than in the 1D case. We attribute the error to the fact that the singularity problem for $\phi(r)$ , where the order for $r$ is dimension-dependent, is more severe in the 2D case. In supplementary information section S4, we show that as we increase $r_{\min}$ from 0.01 to 0.1, the error is reduced from 6% to 2%. Alternatively, in the 2D case we also try to directly use the data at $t=0.5$ as the starting coordinates to generate trajectories for $t\geq 0.5$ ( $M$ is also reduced to the data size $9976$ ), so that the initial generator $G$ is removed from the generative model. By doing so, the number of trainable variables is reduced and the learning becomes easier. The inferred $\alpha$ is 0.493 in the end of training. As a comparison, the inference is 0.513 for the 2D case in [23]. However, we remark that while this strategy helps to infer $\alpha$ without extra data, we cannot infer the velocity or density for $0\leq t<0.5$ .

Paired Observations

We should note that the convergence of the marginal distribution in each snapshot does not necessarily lead to the convergence of the joint distribution of coordinate tuples $(\bm{X}_{t_{1}},\bm{X}_{t_{2}},...,\bm{X}_{t_{n}})$ . While the joint distribution is not available for unpaired observations that we are focused on so far, in other cases where the observed particle coordinates can be paired across snapshots, it is possible to improve the inference by fitting the joint distribution of coordinate tuples with the corresponding generated joint distribution.

In [29], the authors pointed out that Wasserstein convergence of the distribution of $(\bm{X}_{t_{1}},\bm{X}_{t_{2}},...,\bm{X}_{t_{n}})$ is equivalent to Wasserstein convergence of the distributions of $(\bm{X}_{t_{i}},\bm{X}_{t_{i+1}})$ for all $i=1,2,...,n-1$ , if $(\bm{X}_{t_{1}},\bm{X}_{t_{2}},...,\bm{X}_{t_{n}})$ is a Markov chain. However, the sample spaces for $\{\bm{X}_{t_{i}}\}_{i=1}^{n}$ are limited to be finite and discrete in [29], thus the result doesn’t apply to dynamic systems in continuous spaces. Here, we present a new theorem for continuous sample spaces:

Theorem 1.

Let $(X_{1},X_{2},...X_{T})$ be a Markov chain of length $T\geq 3$ and we use $X_{i:j}$ to denote the nodes $(X_{i},X_{i+1}...X_{j})$ , for $i\leq j$ . Suppose the domain $D_{t}$ for $X_{t}$ is a compact subset of $\mathbb{R}^{d_{t}}$ for $t=1,2...T$ . We use the $l_{q}$ ( $q\geq 1$ ) Euclidean metric for all the Euclidean spaces with different dimensions.

Let $\{P_{n}^{X_{i:j}}\}_{n=1}^{\infty}$ and $P^{X_{i:j}}$ be probability measures of $X_{i:j}$ for $i\leq j$ , $P_{n}^{X_{i}|X_{j}}$ and $P^{X_{i}|X_{j}}$ be the corresponding probability transition kernels. If $P_{n}^{X_{t:t+1}}$ converges to $P^{X_{t:t+1}}$ in Wasserstein- $p$ ( $p\geq 1$ ) metric for all $t=1,2...T-1$ , $P_{n}^{X_{t}|X_{t+1}}$ and $P^{X_{t+2}|X_{t+1}}$ as functions of $X_{t+1}$ are $C$ -Lipschitz continuous in Wasserstein- $p$ metric for all $t=1,2...T-2$ and $n$ , where $C$ is a constant, then $P_{n}^{X_{1:T}}$ converges to $P^{X_{1:T}}$ in Wasserstein- $p$ metric.

We present the proof for Theorem 1 in the supplementary information section S5. The assumption on the continuity of the probability transition kernels is not required for finite discrete sample spaces in [29], but the theorem does not hold without it for continuous sample space. We provide a counter-example in the supplementary information section S6.

The theorem states that under certain conditions, we can set our goal as fitting the distributions of $(\bm{X}_{t_{i}},\bm{X}_{t_{i+1}})$ , i.e., coordinate pairs from adjacent snapshots. This should be easier compared with directly fitting the distribution of $(\bm{X}_{t_{1}},\bm{X}_{t_{2}},...,\bm{X}_{t_{n}})$ , since the effective dimensionality is reduced. We still can view this approach as “ensemble-regression”, except that instead of a curve, we try to fit the data with a 2D surface $(t_{i},t_{j})\rightarrow\rho_{t_{i},t_{j}}$ in the probability measure space, where $\rho_{t_{i},t_{j}}$ denotes the joint distributions of $(\bm{X}_{t_{i}},\bm{X}_{t_{j}})$ .

As an illustration, we study the 1D Ornstein–Uhlenbeck (OU) process:

\displaystyle dX_{t}=-X_{t}dt+\sqrt{2}dW_{t},

(24)

with $\rho_{0}=\mathcal{N}(0,1)$ , so that $\rho_{t}=\mathcal{N}(0,1)$ for any $t>0$ . This is a special example as the governing SODE is not unique given $\rho_{t}$ . We make observations of 100 particles at $t=0.1,0.2,0.3,0.4,0.5,0.75,1.0$ as data, and compare the inferences by fitting the marginal distributions of individual coordinates or the joint distributions of adjacent coordinate pairs using the SW distance. The drift function is parameterized by a linear function $y(x)=ax$ or a neural network, while the diffusion coefficient is represented by a trainable variable rectified by a softplus function. We present the results in Figure 7, where we can clearly see the failure in the cases of fitting the marginal distributions, while fitting the joint distributions works very well.

Summary and Discussion

We have proposed a new method for inferring the governing dynamics of particles from unpaired observations of their coordinates at multiple time instants, namely “snapshots”. We fitted the observed particle ensemble distribution with a physics-informed generative model, which can be viewed as performing regression in the probability measure space. We refer to this approach as generative “ensemble-regression”, in analogy to the classic “point-regression”, where we infer the dynamics by performing regression in the Euclidean space.

We first applied the method to particle systems governed by independent stochastic ordinary differential equations (SODE) with Brownian or Lévy noises, where we inferred the drift and the diffusion terms from a small number of snapshots. In the Lévy noise case, we demonstrated that the heavy tails in the distributions could spoil the training, but we addressed this issue by applying a preprocessing map to both the generated and target distributions. In scenarios with noisy or truncated training data, we modified the generated distributions accordingly by perturbing or filtering the generated samples. We then addressed high-dimensional SODE problems using the adversarial loss in GANs. In the end, we managed to learn the parameters for particle interactions in a nonlocal flocking systems.

It is possible to apply our method to learn the interaction parameters for particle-based simulation methods. In particular, we will fit the target mass and velocity distributions coming from the analytical solution or other simulation methods that are accurate but expensive. We leave this promising research direction for future work.

Acknowledgement

This work was supported by the PhILMS grant DE-SC0019453 and by the OSD/AFOSR MURI Grant FA9550-20-1-0358. We would like to thank Prof. Hui Wang and Ms. Tingwei Meng for carefully checking the proof of our theorem. We also want to thank Dr. Zhongqiang Zhang for helpful discussions.

Supplementary Information

S1. Comparison between Sliced Wasserstein Distance and WGAN-GP

The squared sliced Wasserstein-2 distance is the expectation of the squared Wasserstein-2 distance between the two input measures projected onto uniformly random directions. To estimate $SW_{2}(\mu,\nu)$ from samples of $\mu$ and $\nu$ , we use the following process introduced in [3].

•

Draw samples independently from $\mu$ and $\nu$ with batch size $b$ , denoted as $\mathcal{U}$ and $\mathcal{V}$ .
•

Uniformly sample $m$ projection directions $\{\bm{e}_{j}\}_{j=1}^{m}$ in $\mathbb{R}^{d}$ . In this paper we set $m=1000$ .
•

For each random direction $\bm{e}_{j}$ , project and sort the samples in $\mathcal{U}$ and $\mathcal{V}$ in the direction of $\bm{e}_{j}$ , getting $\{u_{i,j}\}_{i=1}^{b}$ and $\{v_{i,j}\}_{i=1}^{b}$ , where $u_{i,j}\leq u_{i+1,j}$ and $v_{i,j}\leq v_{i+1,j}$ for $i=1,2...b-1$ . Calculate $L_{j}=\sum_{i=1}^{b}(u_{i,j}-v_{i,j})^{2}/b$ .
•

Calculate $L=\sum_{j=1}^{m}L_{j}/m$ as the estimation of squared $SW_{2}(\mu,\nu)$ .

Compared with GANs, the sliced Wasserstein distance does not need discriminators, and is more robust than WGAN-GP in low dimensional problems. Take the following 1D problem as an example. We consider the SODE:

\displaystyle dX_{t}=adt

\displaystyle+bdB_{t},\quad t\geq 0,

(25)

with $\rho_{0}=\mathcal{N}(-0.5,0.5)$ . We set $a=b=1$ so that the exact solution is

\displaystyle\rho_{t}=\mathcal{N}(t-0.5,t+0.5).

(26)

We have $10,000$ samples at $t=0.5$ and $1.5$ , respectively, as the training data, and wish to infer the constant drift and diffusion coefficients. The input noise to the generator is uniform noise from -1 to 1. The results are shown in Figure 8.

We can clearly see oscillations of the inferred drift coefficient for WGAN-GP. This can be attributed to the two-player game between the generator and discriminator, which was also reported in [30].

We did not observe significant oscillations using WGAN-GP in higher dimensional problems. Indeed, WGAN-GP outperforms the sliced Wasserstein distance in high dimensional problems. Let us consider the following simple SODE problem as an example, where the motions are uncoupled between dimensions:

\displaystyle dX_{t}^{(i)}=(X_{t}^{(i)}-(X_{t}^{(i)})^{3})dt+dB_{t}^{(i)},i=1,2,...d

(27)

where $X_{t}^{(i)}$ is the $i$ -th component of $\bm{X}_{t}\in\mathbb{R}^{d}$ , with $\rho_{0}=\mathcal{N}(\bm{0},0.04\bm{I}_{d})$ . We prepare $10^{5}$ sample paths and observe all the particle positions at $t=0.2,0.5,1.0$ as our training data. We use a cubic polynomial $a^{(i)}_{0}+a^{(i)}_{1}x+a^{(i)}_{2}x^{2}+a^{(i)}_{3}x^{3}$ to parameterize the $i$ -th component of the drift. We use another trainable variable rectified by a softplus function to represent the diffusion coefficient $s^{(i)}=1$ in each direction. In total, $5d$ variables are used to parameterize the $d$ -dimensional SODE. The results are shown in Figure 9. While the SW distance works for 2D problems, it does not scale well to high dimensional problems for which WGAN-GP gives much better inferences.

S2. A Study on the Density Estimation

In this section, we wish to show that our method can make use of the SODE and multiple snapshots to reduce the error of density estimation from limited samples. We consider the one-dimensional SODE:

\displaystyle dX_{t}=(4X_{t}-{X_{t}}^{3})dt

\displaystyle+0.4dB_{t},\quad t\geq 0,

(28)

with $\rho_{0}=0.5\mathcal{N}(-0.5,0.3^{2})+0.5\mathcal{N}(0.5,0.3^{2})$ .

We test the following two cases of training data sets. In both cases we have 1000 samples at $t=0.05,0.1,0.15,0.2,0.25$ , but the particle trajectories are different.

•

Case 1: We make observations from different sets of trajectories at different time instants.
•

Case 2: We make observations from the same set of trajectories at different time instants.

We assume that we know Equation 28, but have no knowledge of $\rho_{0}$ . We could make the analogy of performing linear regression, where we know the slope but do not know the intercept. In Figure 10 we show the comparison between the inferred densities and the densities estimated directly from data. All the densities are estimated using Gaussian kernel density estimation. The inferred densities and the ground truth come from $10^{5}$ samples produced by the generative model or simulation, with bandwidth decided by the Scott’s rule. To remove the effect of bandwidth selection in the density estimation directly from data, we perform a grid search of the optimal bandwidth factor via the $L_{2}$ error against ground truth, from 0.01 to 0.5 with grid size 0.01 (Scott’s rule suggests about 0.25).

The grid search strategy is actually infeasible in practice since we don’t know the ground truth, but it should perform better than any bandwidth selector. Despite that, in Figure 10(a) we can clearly see that in case 1 our inferred densities with naive Scott’s rule significantly outperform the ones estimated directly from training data with grid search. This cannot be attributed to the number of samples, since in case 2 the inferred densities with our method cannot perform better, as shown in Figure 10(b).

The different performances in two cases are reasonable. In case 1, our method can utilize the SODE and the observations at multiple time instants to reduce the error from limited samples. As we infer the density at $t=0.05$ , we are not only utilizing the data at $t=0.05$ but also implicitly taking advantage of the data at later time instants. In case 2, since the observations come from the same sample paths, the observations at later time instants cannot provide additional information considering that the SODE is already known. Let us again make an analogy in the context of linear regression: if the observations have independent noise, multiple observations will be more helpful than a single observation. However, if the observations have the same noise, multiple observations cannot help us more than a single observation, if we already know the slope.

S3. Training Data in 2D Problems with Various Scenarios of Observations

In Figure 11 we visualize the training data in 2D problems with various scenarios of observations.

S4. Effect of Clipping Radius in the Interacting Particle System

In this section we study the effect of various clipping radius $r_{\text{min}}$ from 0.01 to 0.1, both in the simulation for data generation and the learning algorithm. We remark that changing $r_{\text{min}}$ in this range would not make a huge difference to the system behavior, since the dynamics are dominated by non-local interactions. In Figure 12 we show the inferred $\alpha$ during the training. As $r_{\text{min}}$ increase from 0.01 to 0.1, the inferred $\alpha$ in the end of training increases from 0.469 to 0.490, with the ground truth 0.5. This suggests that the error mainly comes from the singularity problem of $\phi(r)$ when $r$ is small.

S5. Proof for Theorem 1

The proof is based on the weak convergence: convergence of a sequence of probability measures $\{P_{n}\}_{n=1}^{\infty}$ to a probability measure $P$ in the Wasserstein- $p$ metric is equivalent to the weak convergence plus the convergence of $p$ -th moments of $\{P_{n}\}_{n=1}^{\infty}$ to $P$ [31]. Here, the weak convergence mean $E_{P_{n}}[f(x)]$ converges to $E_{P}[f(x)]$ for each $f\in\mathcal{F}$ , where $\mathcal{F}=\{\text{all bounded and continuous functions}\}$ or $\mathcal{F}=\{\text{all bounded and Lipschitz functions}\}$ , which are equivalent[32]. Note that in Theorem 1 we assume that the domain for $X_{t}$ is a compact subset of Euclidean space for each $t$ , thus $\|X_{1:T}\|^{p}$ is a bounded and continuous function of $X_{1:T}$ . The convergence of $p$ -th moment then directly comes from the weak convergence, i.e., we only need to show that $\{P_{n}\}_{n=1}^{\infty}$ converges to $P$ weakly. This is proved in Lemma 4 below.

The notations are inherited from Theorem 1. For simplicity, we will sometimes use $X,Y,Z$ to represent the nodes in sequence, and use $P^{XY}$ to represent the probability measure of $(X,Y)$ , etc.

Lemma 1.

Let $P_{y}$ mapping from the Euclidean space to the probability measure space be $C$ -Lipschitz in Wasserstein- $p$ metric ( $p\geq 1$ ). For any $f(x,y,z):\mathbb{R}^{d_{x}}\times\mathbb{R}^{d_{y}}\times\mathbb{R}^{d_{z}}\rightarrow\mathbb{R}$ $K$ -Lipschitz,

g(x,y)=\int f(x,y,z)dP_{y}(z)

is $(C+1)K$ -Lipschitz.

Proof.

Note that

	$\displaystyle\left\|g(x+\epsilon,y+\xi)-g(x,y)\right\|$	(29)
$\displaystyle=$	$\displaystyle\left\|\int f(x+\epsilon,y+\xi,z)dP_{y+\xi}(z)-\int f(x,y,z)dP_{y}(z)\right\|$
$\displaystyle\leq$	$\displaystyle\left\|\int f(x+\epsilon,y+\xi,z)dP_{y+\xi}(z)-\int f(x+\epsilon,y+\xi,z)dP_{y}(z)\right\|+$
	$\displaystyle\left\|\int f(x+\epsilon,y+\xi,z)dP_{y}(z)-\int f(x,y,z)dP_{y}(z)\right\|.$

Firstly

	$\displaystyle\left\|\int f(x+\epsilon,y+\xi,z)dP_{y+\xi}(z)-\int f(x+\epsilon,y+\xi,z)dP_{y}(z)\right\|$	(30)
$\displaystyle\leq$	$\displaystyle KW_{1}(P_{y},P_{y+\xi})$
$\displaystyle\leq$	$\displaystyle KW_{p}(P_{y},P_{y+\xi})$
$\displaystyle\leq$	$\displaystyle CK\\|\epsilon\\|,$

where the first inequality comes from the Kantorovich–Rubinstein formula [31] and that $f(x+\epsilon,y+\xi,z)$ is $K$ -Lipschitz. The last inequality comes from that $P_{y}$ is $C$ -Lipschitz in Wasserstein- $p$ sense.

Secondly,

	$\displaystyle\left\|\int f(x+\epsilon,y+\xi,z)dP_{y}(z)-\int f(x,y,z)dP_{y}(z)\right\|$	(31)
$\displaystyle\leq$	$\displaystyle\int\left\|f(x+\epsilon,y+\xi,z)-f(x,y,z)\right\|dP_{y}(z)$
$\displaystyle\leq$	$\displaystyle\int K\\|(\epsilon,\xi)\\|dP_{y}(z)$
$\displaystyle=$	$\displaystyle K\\|(\epsilon,\xi)\\|.$

We conclude that

\displaystyle\left|g(x+\epsilon,y+\xi)-g(x,y)\right|\leq(C+1)K\|(\epsilon,\xi)\|,

(32)

i.e. $g(x,y)$ is $(C+1)K$ -Lipschitz ∎

Lemma 2.

If $P_{n}^{XY}$ converge to $P^{XY}$ weakly, then $P_{n}^{X}$ converge to $P^{X}$ weakly.

Proof.

For any bounded and continuous function $f(x)$

		$\displaystyle\int f(x)P_{n}^{X}(x)-\int f(x)P^{X}(x)$		(33)
	$\displaystyle=$	$\displaystyle\iint f(x)P_{n}^{XY}(x,y)-\iint f(x)P^{XY}(x,y)\rightarrow 0.$		(33)

The convergence comes from the weak convergence of $P_{n}^{XY}$ and that $f(x)$ is bounded and continuous as a function of $x$ and $y$ . ∎

Lemma 3.

Suppose $P_{n}^{YZ}$ converge to $P^{YZ}$ weakly, $P^{Z|Y=y}$ is Lipschitz in Wasserstein- $p$ metric. For any $g(y,z):\mathbb{R}^{d_{y}}\times\mathbb{R}^{d_{z}}\rightarrow\mathbb{R}$ bounded and Lipschitz, we have

		$\displaystyle\iint g(y,z)dP_{n}^{Z\|Y=y}(z)dP_{n}^{Y}(y)$		(34)
	$\displaystyle-$	$\displaystyle\iint g(y,z)dP^{Z\|Y=y}(z)dP_{n}^{Y}(y)\rightarrow 0.$		(34)

Proof.

Since $g(y,z)$ is bounded and continuous, and $P_{n}^{YZ}$ converge to $P^{YZ}$ weakly,

\displaystyle\iint g(y,z)dP_{n}^{YZ}(y,z)-\iint g(y,z)dP^{YZ}(y,z)\rightarrow 0,

(35)

i.e.

		$\displaystyle\iint g(y,z)dP_{n}^{Z\|Y=y}(z)dP_{n}^{Y}(y)$		(36)
	$\displaystyle-$	$\displaystyle\iint g(y,z)dP^{Z\|Y=y}(z)dP^{Y}(y)\rightarrow 0.$		(36)

Since $g(y,z)$ is bounded and Lipschitz, $P^{Z|Y=y}$ is Lipschitz, we have

\displaystyle\int g(y,z)dP^{Z|Y=y}(z)

(37)

is Lipschitz. Further since $P_{n}^{Y}$ converge to $P^{Y}$ weakly (from Lemma 2):

		$\displaystyle\iint g(y,z)dP^{Z\|Y=y}(z)dP_{n}^{Y}(y)$		(38)
	$\displaystyle-$	$\displaystyle\iint g(y,z)dP^{Z\|Y=y}(z)dP^{Y}(y)\rightarrow 0.$		(38)

Combining 36 and 38, we have

		$\displaystyle\iint g(y,z)dP_{n}^{Z\|Y=y}(z)dP_{n}^{Y}(y)$		(39)
	$\displaystyle-$	$\displaystyle\iint g(y,z)dP^{Z\|Y=y}(z)dP_{n}^{Y}(y)\rightarrow 0.$		(39)

∎

Lemma 4.

Let $\{P_{n}^{X_{i:j}}\}_{n=1}^{\infty}$ and $P^{X_{i:j}}$ be probability measures of $X_{i:j}$ for $i\leq j$ , $P_{n}^{X_{i}|X_{j}}$ and $P^{X_{i}|X_{j}}$ be the corresponding probability transition kernels. If $P_{n}^{X_{t:t+1}}$ converges to $P^{X_{t:t+1}}$ weakly for all $t=1,2...T-1$ , $P_{n}^{X_{t}|X_{t+1}}$ and $P^{X_{t+2}|X_{t+1}}$ are $C$ -Lipschitz continuous in Wasserstein- $p$ metric for all $t=1,2...T-2$ and $n$ , where $C$ is a constant, then $P_{n}^{X_{1:T}}$ converges to $P^{X_{1:T}}$ weakly.

Proof.

We start from the case $T=3$ . For simplicity, we use $X,Y,Z$ to denote the nodes in sequence.

For any $f(x,y,z)$ $K$ -Lipschitz and bounded, we have

	$\displaystyle\int f(x,y,z)dP_{n}^{XYZ}(x,y,z)$	(40)
$\displaystyle=$	$\displaystyle\int\left(\int f(x,y,z)dP_{n}^{Z\|Y=y}(z)\right)dP_{n}^{XY}(x,y)$
$\displaystyle=$	$\displaystyle\int(g_{n}(x,y)-g(x,y))dP_{n}^{XY}(x,y)$
$\displaystyle+$	$\displaystyle\int g(x,y)dP_{n}^{XY}(x,y),$

where

	$\displaystyle g(x,y)$	$\displaystyle=\int f(x,y,z)dP^{Z\|Y=y}(z)$		(41)
	$\displaystyle g_{n}(x,y)$	$\displaystyle=\int f(x,y,z)dP_{n}^{Z\|Y=y}(z).$		(41)

From Lemma 1, since $f(x,y,z)$ is Lipschitz, $P^{Z|Y=y}$ is Lipschitz in Wasserstein- $p$ sense, we have $g(x,y)$ is Lipschitz. $g(x,y)$ is also bounded since $f(x,y,z)$ is bounded. So we have

\displaystyle\int g(x,y)dP_{n}^{XY}(x,y)\rightarrow\int g(x,y)dP^{XY}(x,y),

(42)

since $P_{n}^{XY}$ converge to $P^{XY}$ weakly.

We then need to show $\int(g_{n}(x,y)-g(x,y))dP_{n}^{XY}(x,y)$ converges to 0. We prove by contradiction. Suppose it does not converge, then there exists $\epsilon>0$ and a subsequence of $n$ (denote as $i$ ) such that

\displaystyle\left|\int(g_{i}(x,y)-g(x,y))dP_{i}^{XY}(x,y)\right|\geq\epsilon,\forall i.

(43)

Without loss of generality, we assume that

\displaystyle\int(g_{i}(x,y)-g(x,y))dP_{i}^{XY}(x,y)\geq\epsilon>0,\forall i,

(44)

i.e.

		$\displaystyle\iiint f(x,y,z)dP_{i}^{Z\|Y=y}(z)dP_{i}^{X\|Y=y}(x)dP_{i}^{Y}(y)$		(45)
	$\displaystyle-$	$\displaystyle\iiint f(x,y,z)dP^{Z\|Y=y}(z)dP_{i}^{X\|Y=y}(x)dP_{i}^{Y}(y)\geq\epsilon>0,\forall i.$		(45)

Let

\displaystyle h_{i}(y,z)=\int f(x,y,z)dP_{i}^{X|Y=y}(x),

(46)

then

		$\displaystyle\iint h_{i}(y,z)dP_{i}^{Z\|Y=y}(z)dP_{i}^{Y}(y)$		(47)
	$\displaystyle-$	$\displaystyle\iint h_{i}(y,z)dP^{Z\|Y=y}(z)dP_{i}^{Y}(y)\geq\epsilon>0,\forall i.$		(47)

Note that $f$ is $K$ -Lipschitz, $P_{i}^{X|Y=y}$ is $K$ -Lipschitz, we have $h_{i}$ are all $K(C+1)$ -Lipschitz, thus uniformly equicontinuous. Also $h_{i}(y,z)$ are uniformly bounded by the bound of $f$ , and the domain for $(Y,Z)$ is compact (since the domain $D_{t}$ for $X_{t}$ is compact for all $t$ ). By the Arzela-Ascoli theorem, there exists a subsequence $j$ such that $h_{j}\rightarrow h$ uniformly, where $h$ is $K(C+1)$ -Lipschitz and bounded. Therefore,

		$\displaystyle\iint(h_{j}(y,z)-h(y,z))dP_{j}^{Z\|Y=y}(z)dP_{j}^{Y}(y)$		(48)
	$\displaystyle-$	$\displaystyle\iint(h_{j}(y,z)-h(y,z))dP^{Z\|Y=y}(z)dP_{j}^{Y}(y)\rightarrow 0.$		(48)

From Lemma 3

		$\displaystyle\iint h(y,z)dP_{j}^{Z\|Y=y}(z)dP_{j}^{Y}(y)$		(49)
	$\displaystyle-$	$\displaystyle\iint h(y,z)dP^{Z\|Y=y}(z)dP_{j}^{Y}(y)\rightarrow 0.$		(49)

We then have

		$\displaystyle\iint h_{j}(y,z)dP_{j}^{Z\|Y=y}(z)dP_{j}^{Y}(y)$		(50)
	$\displaystyle-$	$\displaystyle\iint h_{j}(y,z)dP^{Z\|Y=y}(z)dP_{j}^{Y}(y)\rightarrow 0.$		(50)

We have a contradiction between Equation 47 and 50. We finish the proof for $T=3$ .

We then prove the case for general $T$ by induction. Suppose it holds for $T=N$ , we now prove it for $T=N+1$ .

From the conditions for the case $T=N+1$ and that the theorem holds for $T=N$ , we have $P_{n}^{X_{1:N}}$ converges to $P^{X_{1:N}}$ weakly and $P_{n}^{X_{2:N+1}}$ converges to $P^{X_{2:N+1}}$ weakly. We now view $X_{1}$ as $X$ , view the Cartesian product of $X_{2:N}$ as $Y$ , and view $X_{N+1}$ as $Z$ , so we have $P_{n}^{XY}$ converges to $P^{XY}$ weakly and $P_{n}^{YZ}$ converges to $P^{YZ}$ weakly. Also $P_{n}^{X|Y}$ and $P^{Z|Y}$ are $C$ -Lipschitz continuious and Wasserstein- $p$ metric for all $n$ , since $P_{n}^{X|Y}=P_{n}^{X_{1}|X_{2}}$ and $P^{Z|Y}=P^{X_{N+1}|X_{N}}$ from the Markovian property, and that $\|X_{2}\|\leq\|X_{2:N}\|$ , $\|X_{N}\|\leq\|X_{2:N}\|$ . Therefore, from the theorem for $T=3$ case we have that $P_{n}^{XYZ}$ converges to $P^{XYZ}$ weakly, i.e., $P_{n}^{X_{1:N+1}}$ converges to $P^{X_{1:N+1}}$ weakly.

∎

S6. A Counterexample in Continuous Sample Space

As a direct implementation of Corollary 7 in [29], for the Markov chain $X_{1:T}$ in a finite discrete sample space, $P_{n}^{X_{t:t+1}}$ converges to $P^{X_{t:t+1}}$ in Wasserstein- $p$ sense for each $t=1,2...T-1$ implies that $P_{n}^{X_{1:T}}$ converges to $P^{X_{1:T}}$ in Wasserstein- $p$ sense. However, if the Markov chains are defined in the continuous sample space, the implementation is, in general, not correct without further assumptions, e.g., the assumption of continuity for probability transition kernels. In this section, we provide a counterexample to show that.

We consider the Markov chain with $T=3$ and use $X,Y,Z$ to denote the nodes in sequence. We define $P_{n}^{XYZ}$ as follows:

	$\displaystyle P_{n}^{XYZ}(0,0,0)$	$\displaystyle=\frac{1}{2},$		(51)
	$\displaystyle P_{n}^{XYZ}(1,\frac{1}{n},1)$	$\displaystyle=\frac{1}{2},$		(51)

and $P^{XYZ}$ as following:

$\displaystyle P^{XYZ}(0,0,0)$	$\displaystyle=\frac{1}{4},$	(52)
$\displaystyle P^{XYZ}(0,0,1)$	$\displaystyle=\frac{1}{4},$
$\displaystyle P^{XYZ}(1,0,0)$	$\displaystyle=\frac{1}{4},$
$\displaystyle P^{XYZ}(1,0,1)$	$\displaystyle=\frac{1}{4}.$

We can easily check that $P_{n}^{XY}$ converges to $P^{XY}$ in Wasserstein- $p$ metric, since

	$\displaystyle P_{n}^{XY}(0,0)$	$\displaystyle=\frac{1}{2},$		(53)
	$\displaystyle P_{n}^{XY}(1,\frac{1}{n})$	$\displaystyle=\frac{1}{2},$		(53)

and

	$\displaystyle P^{XY}(0,0)$	$\displaystyle=\frac{1}{2},$		(54)
	$\displaystyle P^{XY}(1,0)$	$\displaystyle=\frac{1}{2},$		(54)

Similarly $P_{n}^{YZ}$ converges to $P^{YZ}$ in Wasserstein- $p$ metric. However, $P_{n}^{XYZ}$ does not converge to $P^{XYZ}$ in Wasserstein- $p$ metric: the support of $P_{n}^{XYZ}$ and $(0,0,1)$ always have a distance larger than 1.

References

[1] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[2] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pages 6571–6583, 2018.
[3] Ishan Deshpande, Ziyu Zhang, and Alexander G Schwing. Generative modeling using the sliced Wasserstein distance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3483–3491, 2018.
[4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
[5] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
[6] David L Donoho et al. High-dimensional data analysis: The curses and blessings of dimensionality. AMS math challenges lecture, 1(2000):32, 2000.
[7] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
[8] Luigi Ambrosio and Wilfrid Gangbo. Hamiltonian ODEs in the Wasserstein space of probability measures. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 61(1):18–53, 2008.
[9] Xuechen Li, Ting-Kam Leonard Wong, Ricky TQ Chen, and David Duvenaud. Scalable gradients for stochastic differential equations. arXiv preprint arXiv:2001.01328, 2020.
[10] Junteng Jia and Austin R Benson. Neural jump stochastic differential equations. In Advances in Neural Information Processing Systems, pages 9843–9854, 2019.
[11] Xuanqing Liu, Tesi Xiao, Si Si, Qin Cao, Sanjiv Kumar, and Cho-Jui Hsieh. Neural SDE: Stabilizing neural ODE networks with stochastic noise. arXiv preprint arXiv:1906.02355, 2019.
[12] Belinda Tzen and Maxim Raginsky. Neural stochastic differential equations: Deep latent Gaussian models in the diffusion limit. arXiv preprint arXiv:1905.09883, 2019.
[13] Belinda Tzen and Maxim Raginsky. Theoretical guarantees for sampling and inference in generative models with latent diffusions. arXiv preprint arXiv:1903.01608, 2019.
[14] Lars Ruthotto, Stanley J Osher, Wuchen Li, Levon Nurbekyan, and Samy Wu Fung. A machine learning framework for solving high-dimensional mean field game and mean field control problems. Proceedings of the National Academy of Sciences, 117(17):9183–9193, 2020.
[15] Liu Yang, Dongkun Zhang, and George Em Karniadakis. Physics-informed generative adversarial networks for stochastic differential equations. SIAM Journal on Scientific Computing, 42(1):A292–A317, 2020.
[16] Junyu Liu, Zichao Long, Ranran Wang, Jie Sun, and Bin Dong. RODE-Net: Learning ordinary differential equations with randomness from data. arXiv preprint arXiv:2006.02377, 2020.
[17] Ola Elerian, Siddhartha Chib, and Neil Shephard. Likelihood inference for discretely observed nonlinear diffusions. Econometrica, 69(4):959–993, 2001.
[18] Bjørn Eraker. MCMC analysis of diffusion models with application to finance. Journal of Business & Economic Statistics, 19(2):177–191, 2001.
[19] Simo Särkkä, Jouni Hartikainen, Isambi Sailon Mbalawata, and Heikki Haario. Posterior inference on parameters of stochastic differential equations via non-linear Gaussian filtering and adaptive MCMC. Statistics and Computing, 25(2):427–437, 2015.
[20] Cedric Archambeau, Dan Cornford, Manfred Opper, and John Shawe-Taylor. Gaussian process approximations of stochastic differential equations. Journal of Machine Learning Research, 1:1–16, 2007.
[21] Michail D Vrettas, Manfred Opper, and Dan Cornford. Variational mean-field algorithm for efficient inference in large systems of stochastic differential equations. Physical Review E, 91(1):012148, 2015.
[22] Joseph Bakarji and Daniel M Tartakovsky. Data-driven discovery of coarse-grained equations. Journal of Computational Physics, page 110219, 2021.
[23] Zhiping Mao, Zhen Li, and George Em Karniadakis. Nonlocal flocking dynamics: Learning the fractional order of pdes from particle simulations. Communications on Applied Mathematics and Computation, 1(4):597–619, 2019.
[24] Christos N Mavridis, Amoolya Tirumalai, and John S Baras. Learning interaction dynamics from particle trajectories and density evolution. In 2020 59th IEEE Conference on Decision and Control (CDC), pages 1014–1019. IEEE, 2020.
[25] Felipe Cucker and Steve Smale. Emergent behavior in flocks. IEEE Transactions on automatic control, 52(5):852–862, 2007.
[26] Filippo Santambrogio. $\{$ Euclidean, metric, and Wasserstein $\}$ gradient flows: an overview. Bulletin of Mathematical Sciences, 7(1):87–154, 2017.
[27] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the International Conference on Machine Learning, volume 30, page 3, 2013.
[28] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[29] Mucong Ding, Constantinos Daskalakis, and Soheil Feizi. Subadditivity of probability divergences on Bayes-Nets with applications to time series GANs. CoRR, abs/2003.00652, 2020.
[30] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training GANs with optimism. In International Conference on Learning Representations, 2018.
[31] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
[32] Achim Klenke. Probability theory: a comprehensive course. Springer Science & Business Media, 2013.

	$\displaystyle\left\|g(x+\epsilon,y+\xi)-g(x,y)\right\|$	(29)
$\displaystyle=$	$\displaystyle\left\|\int f(x+\epsilon,y+\xi,z)dP_{y+\xi}(z)-\int f(x,y,z)dP_{y}(z)\right\|$
$\displaystyle\leq$	$\displaystyle\left\|\int f(x+\epsilon,y+\xi,z)dP_{y+\xi}(z)-\int f(x+\epsilon,y+\xi,z)dP_{y}(z)\right\|+$
	$\displaystyle\left\|\int f(x+\epsilon,y+\xi,z)dP_{y}(z)-\int f(x,y,z)dP_{y}(z)\right\|.$