This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Diffusion Methods for Generating Transition Paths thanks: This work is supported in part by National Science Foundation via award NSF DMS-2012286 and DMS-2309378.

Luke Triplett Duke University Jianfeng Lu Duke University
(September 18, 2023)
Abstract

In this work, we seek to simulate rare transitions between metastable states using score-based generative models. An efficient method for generating high-quality transition paths is valuable for the study of molecular systems since data is often difficult to obtain. We develop two novel methods for path generation in this paper: a chain-based approach and a midpoint-based approach. The first biases the original dynamics to facilitate transitions, while the second mirrors splitting techniques and breaks down the original transition into smaller transitions. Numerical results of generated transition paths for the Müller potential and for Alanine dipeptide demonstrate the effectiveness of these approaches in both the data-rich and data-scarce regimes.

1 Introduction

A challenge arises in the study of molecular dynamics when the behavior of a system is characterized by rare transitions between metastable states. Practically, these rare transitions mean that Monte Carlo simulations take a prohibitively long time even if enhanced sampling techniques are used.

A common way to understand the transition process is to sample transition paths from one metastable state to the other and use this data to estimate macroscopic properties of interest. Sampling through direct simulation is inefficient due to the high energy barrier, which has led to the exploration of alternative methods. Some notable methods include transition path sampling [BCDG02], biased sampling approaches [PS10], and milestoning [FE04]. Across different applications of rare event simulation, importance sampling and splitting methods are notable. The former biases the dynamics to reduce the variance of the sampling [Fle77], while the latter splits rare transitions into a series of higher probability steps [CGR19].

Broadly, generative models such as Generative Adversarial Models (GANs) and Variational Autoencoders (VAEs) have been developed to learn the underlying distribution of a dataset and can generate new samples that resemble the training data, with impressive performance. Among other tasks, generative models have been successfully used for image, text, and audio generation. Recently, VAEs have been used for transition path generation [LRS+23]. For this approach, the network learns a map from the transition path space to a smaller latent space (encoder) and the inverse map from the latent space back to the original space (decoder). The latent space is easier to sample from and can be used to generate low-cost samples by feeding latent space samples through the decoder.

Another promising method from machine learning frequently used in image-based applications is diffusion models or score-based generative models [SSDK+20] [CHIS23]. This paper will provide methods to generate transition paths under overdamped Langevin dynamics using score-based generative modeling. As with VAEs, diffusion models rely on a pair of forward and backward processes. The forward process of a diffusion model maps the initial data points to an easy-to-sample distribution through a series of noise-adding steps. The challenge is to recover the reverse process from the noisy distribution to the original distribution, which can be achieved by using score matching [HD05] [Vin11]. Having estimated the reverse process, we can generate samples from the noisy distribution at a low cost and transform them into samples from the target distribution.

The naive application of score-based generative modeling is not effective because of the high dimensionality of discretized paths. We introduce two methods to lower the dimensionality of the problem, motivated by techniques that have been used previously for simulating transition paths. The chain method, which is introduced in Section 4.1, updates the entire path together and relies on a decomposition of the score in which each path point only depends on adjacent path points. The midpoint method, which we outline in Section 4.2, generates path points separately across multiple iterations. It also uses a decomposition which gives the probability of the midpoint of the path conditioned on its two endpoints.

In Section 2, we give an overview of diffusion models and the reverse SDE sampling algorithm for general data distributions. In Section 3, we give background on transition paths and transition path theory, before discussing the two methods for generating transition paths in Section 4. Numerical results and a more detailed description of our algorithm are included in Section 5. The main contributions of this paper are to establish that diffusion-based methods are effective for generating transition paths and to propose a new construction for decomposing the probability of a transition path for dimension reduction.

2 Reverse SDE Diffusion Model

There are three broad categories commonly used for diffusion models. Denoising Diffusion Probabilistic Models (DDPM) [HJA20] noise the data via a discrete-time Markov process with the transition kernel given by P(xt|xt1)=𝒩(xt;1βtxt1,βtI)P(x_{t}|x_{t-1})=\mathcal{N}(x_{t};\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}I), where the hyperparameters {βt}\{\beta_{t}\} determine the rate of noising. The reverse kernel of the process is then approximated with a neural network. The likelihood P(xt|xt1)P(x_{t}|x_{t-1}) can’t be calculated, so a variational lower bound for the negative log-likelihood is used. NCSM [SE19] uses a different forward process, which adds mean-0 Gaussian noise. NCSM uses annealed Langevin sampling for sample generation, where the potential function is an approximation of the time-dependent score function log(pt(x))\nabla\log(p_{t}(x)). The third approach models the forward and reverse processes as an SDE [SSDK+20]. The model used in this paper can be described in either the DDPM or the reverse SDE framework, but we will describe it in the context of reverse SDEs.

Let us first review the definitions of the forward and backward processes, in addition to describing the algorithms for score-matching and sampling from the reverse SDE. Reverse SDE diffusion models are a generalization of NCSM and DDPM. As for all generative modeling, we seek to draw samples from the unknown distribution pdatap_{data}, which we have samples from. In the case of transition paths, the data is in the form of a time series. Since diffusion models are commonly used in computer vision, this will often be image data. The forward process of a reverse SDE diffusion model maps samples from pdatap_{data} to a noisy distribution using a stochastic process xtx_{t} such that

dxt=f(xt,t)dt+g(t)dWt,t[0,T],dx_{t}=f(x_{t},t)dt+g(t)dW_{t},t\in[0,T], (1)

where WtW_{t} is standard Brownian motion. We denote the probability density of xtx_{t} as pt(x)p_{t}(x). In this paper, we will use an Ornstein-Uhlenbeck forward process with f(xt,t)=βxt,g(t)=1f(x_{t},t)=-\beta x_{t},g(t)=1. Then, xtx0N(x0eβt,12β(1e2βt)I)x_{t}\mid x_{0}\sim N(x_{0}e^{-\beta t},\frac{1}{2\beta}(1-e^{-2\beta t})I) and xtx0𝑑N(0,12βI)x_{t}\mid x_{0}\overset{d}{\longrightarrow}N(0,\frac{1}{2\beta}I) as tt\to\infty. This means that as long as we choose a large enough TT, we can start the reverse process at a standard normal distribution. The corresponding reverse process is described as follows:

dxt=(f(xt,t)g(t)2logpt(xt))dt~+g(t)dW~t,t[0,T],dx_{t}=(f(x_{t},t)-g(t)^{2}\nabla\log{p_{t}(x_{t})})d\tilde{t}+g(t)d\tilde{W}_{t},t\in[0,T], (2)

where time starts at TT and flows backward to 0, i.e., "dt~d\tilde{t}" is a negative time differential. Reversing the direction of time we can get the equivalent expression

dxt=(f(xt,Tt)+logpTt(xt)g(xt,Tt)2)dt+g(xt,Tt)dWt.dx_{t}=(-f(x_{t},T-t)+\nabla\log{p_{T-t}(x_{t})}g(x_{t},T-t)^{2})dt+g(x_{t},T-t)dW_{t}. (3)

logpt(x)\nabla\log{p_{t}(x)} is the score function of the noised distribution, which can’t be retrieved analytically. So, the computational task shifts from approximating the posterior p(x)p(x) directly, which is the target of energy-based generative models, to approximating the noised score logpt(x)\nabla\log{p_{t}(x)}. Posterior estimation of complex distributions is a well-studied and challenging problem in statistics. Modeling the score is often more tractable and does not require calculating the normalizing constant.

2.1 Approximating the Score Function

In reverse SDE denoising, we use a neural network to parameterize the time-dependent score function as sθ(x,t)s_{\theta}(x,t). Since the reverse SDE involves a time-dependent score function, the loss function is obtained by taking a time average of the distance between the true score and sθs_{\theta}. Discretizing the SDE with time steps 0=t0<t1<<tN=T0=t_{0}<t_{1}<...<t_{N}=T, we get the following loss function:

L(θ)=1Ni=1Nhti𝔼xtilogpti(xti)sθ(xti,ti)22.L(\theta)=\frac{1}{N}\sum_{i=1}^{N}h_{t_{i}}\mathbb{E}_{x_{t_{i}}}||\nabla\log p_{t_{i}}(x_{t_{i}})-s_{\theta}(x_{t_{i}},t_{i})||_{2}^{2}. (4)

Using the step size hti=titi1h_{t_{i}}=t_{i}-t_{i-1} is a natural choice for weighting, although it is possible to use different weights. The loss function can be expressed as an expectation over the joint probability of xti,x0x_{t_{i}},x_{0} as [Vin11]

L(θ)=1Ni=1Nhti𝔼x0𝔼xtix0||logpti(xtix0)sθ(xti,ti)||22.L(\theta)=\frac{1}{N}\sum_{i=1}^{N}h_{t_{i}}\mathbb{E}_{x_{0}}\mathbb{E}_{x_{t_{i}}\mid x_{0}}||\nabla\log p_{t_{i}}(x_{t_{i}}\mid x_{0})-s_{\theta}(x_{t_{i}},t_{i})||_{2}^{2}. (5)

We can calculate logpti(xtix0)\nabla\log p_{t_{i}}(x_{t_{i}}\mid x_{0}) based on the forward process (in the case of Ornstein-Uhlenbeck, pti(x0)p_{t_{i}}(\cdot\mid x_{0}) is a Gaussian centered at x0eβtx_{0}e^{-\beta t}). Remarkably, this depends only on sθs_{\theta}, tt, and the choice of forward SDE. A similar loss function is used for NCSM with a different noise-adding procedure. The training procedure follows from the above expression for the loss.

Data: {xi}i=1M,{(ti,hi)}i=1N,nb\{x_{i}\}_{i=1}^{M},\{(t_{i},h_{i})\}_{i=1}^{N},nb ;
  /* nb = num batches, hih_{i} = step size */
Result: sθ(x)s_{\theta}(x)
i0i\leftarrow 0;
j0j\leftarrow 0;
for i<nbi<nb do
       xbx_{b}\leftarrow random.choice(x,M/nbx,M/nb) ;
       for j<Nj<N do
             noiserandn_like(xb)noise\leftarrow randn\_like(x_{b}) ;
             σj12β(1e2β)\sigma_{j}\leftarrow\sqrt{\frac{1}{2\beta}(1-e^{-2\beta})} ;
             x~bxbeβt+σjnoise\tilde{x}_{b}\leftarrow x_{b}e^{-\beta t}+\sigma_{j}\cdot noise ;
             lossloss+hjsθ(x~b,σj)+xbeβtx~bσj222loss\leftarrow loss+h_{j}\cdot||s_{\theta}(\tilde{x}_{b},\sigma_{j})+\frac{x_{b}e^{-\beta t}-\tilde{x}_{b}}{\sigma_{j}^{2}}||_{2}^{2} ;
             jj+1j\leftarrow j+1;
            
       end for
      lossloss.backwards() ;
       ii+1i\leftarrow i+1;
      
end for
Algorithm 1 Learning the Score Function

2.2 Sampling from Reverse SDE

Once we have learned the time-dependent score, we can use it to sample from the original distribution using the reverse SDE. Substituting our parameterized score function sθs_{\theta} into (2), we get an approximation of the reverse process,

dx~t=(βxtsθ(xt,t))dt~+dW~t,t[0,T].d\tilde{x}_{t}=(-\beta x_{t}-s_{\theta}(x_{t},t))d\tilde{t}+d\tilde{W}_{t},t\in[0,T]. (6)

We want to evaluate sθs_{\theta} at the times at which it was trained, so we discretize the SDE from equation (2) using the sequence of times 0=t~0t~1t~N=T0=\tilde{t}_{0}\leq\tilde{t}_{1}\leq...\leq\tilde{t}_{N}=T, where t~k=TtNk\tilde{t}_{k}=T-t_{N-k}. In forward time (flowing from 0 to T), this discretization gives

dx~t=(βxt+sθ(xt~k,Tt~k))dt+dWt,t[t~k,t~k+1].d\tilde{x}_{t}=(\beta x_{t}+s_{\theta}(x_{\tilde{t}_{k}},T-\tilde{t}_{k}))dt+dW_{t},t\in[\tilde{t}_{k},\tilde{t}_{k+1}]. (7)

We solve for xtx_{t} using Itô’s formula to retain the continuous dynamics of the first term. Let z~t=eβtx~t\tilde{z}_{t}=e^{-\beta t}\tilde{x}_{t} and ηk\eta_{k} be a standard Gaussian random variable, then

dz~t=(βeβtx~t+(βeβtx~t+eβtsθ(x~t~k,Tt~k))dt+eβtdWt.d\tilde{z}_{t}=(-\beta e^{-\beta t}\tilde{x}_{t}+(\beta e^{-\beta t}\tilde{x}_{t}+e^{-\beta t}s_{\theta}(\tilde{x}_{\tilde{t}_{k}},T-\tilde{t}_{k}))dt+e^{-\beta t}dW_{t}. (8)

After solving for z~t~k\tilde{z}_{\tilde{t}_{k}}, we can get an explicit expression for x~t~k\tilde{x}_{\tilde{t}_{k}}, which is known as the exponential integrator scheme [ZC22].

x~t~k+1=e12(t~k+1t~k)x~t~k+2(e12(t~k+1t~k)1)sθ(x~t~k,Tt~k)+et~k+1t~k1ηk,\tilde{x}_{\tilde{t}_{k+1}}=e^{\frac{1}{2}(\tilde{t}_{k+1}-\tilde{t}_{k})}\tilde{x}_{\tilde{t}_{k}}+2(e^{\frac{1}{2}(\tilde{t}_{k+1}-\tilde{t}_{k})}-1)s_{\theta}(\tilde{x}_{\tilde{t}_{k}},T-\tilde{t}_{k})+\sqrt{e^{\tilde{t}_{k+1}-\tilde{t}_{k}}-1}\cdot\eta_{k}, (9)

where ηkN(0,Id)\eta_{k}\sim N(0,I_{d}). We denote the distribution of xtkx_{t_{k}} as qtkq_{t_{k}}. According to the literature [CLL23], using an exponentially decaying step size for the discretization points provides strong theoretical guarantees for the distance between qtNq_{t_{N}} and the true data distribution, which we will explore further in Section 4.3. For the forward process, this means that tktk1tk+1tk=M\frac{t_{k}-t_{k-1}}{t_{k+1}-t_{k}}=M. It is important to choose tmin,Mt_{min},M, and NN carefully, as they significantly affect the performance. More details about our implementation can be found in Section 5. Having estimated the score function, we can generate samples from a distribution that is close to pdatap_{data} by setting x~0pTN(0,12βI)\tilde{x}_{0}\sim p_{T}\approx N(0,\frac{1}{2\beta}I), then repeatedly applying (9) at the discretization points. A visualization of the distribution of x~\tilde{x} at different t~k\tilde{t}_{k} values under this procedure is shown in Figure 1.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 1: Time evolution of the distribution of samples using exponential integrator scheme from (9). In the leftmost figure at t=0t=0, the samples are from a Gaussian distribution centered at (0,0). In the rightmost figure at t=Tt=T, the samples are approximately from the distribution p(x)=eV(x)p(x)=e^{-V(x)}, where VV is the Müller potential.

3 Transition Paths

3.1 Overdamped Langevin Dynamics

Transitions between metastable states are crucial to understanding behavior in many chemical systems. Metastable states exist in potential energy basins and so transitions will often occur on a longer time scale than the random fluctuations within the system. Due to their infrequency, it is challenging to effectively describe these transitions. In the study of transition paths, it is often useful to model the governing system as an SDE. In this section, we will look at the overdamped Langevin equation, given by

dXt=V(Xt)dt+2B1dWt,X0=xd,dX_{t}=-\nabla V(X_{t})dt+\sqrt{2B^{-1}}dW_{t},\quad X_{0}=x\in\mathbb{R}^{d}, (10)

where V(x)V(x) is the potential of the system and WtW_{t} is d-dimensional Brownian motion. Let t=σ(Xs,0st)\mathcal{F}_{t}=\sigma({X_{s},0\leq s\leq t}) be the filtration generated by XtX_{t}. In chemical applications, B1B^{-1} is the temperature times the Boltzmann constant. For reasonable VV, the invariant probability distribution exists and is given by

p(x)=eβV(x)/Z,Z=ΩeβV(x)𝑑x.p(x)=e^{-\beta\nabla V(x)}/Z,\qquad Z=\int_{\Omega}e^{-\beta\nabla V(x)}dx. (11)

We can see an example of samples generated from the invariant distribution when VV is the Müller potential in the above figure.

Consider two metastable states represented by closed regions A,BnA,B\subset\mathbb{R}^{n}. Their respective boundaries are A,B\partial A,\partial B. The paths that go from the boundary of AA to the boundary of BB without returning to AA are called transition paths [EVE10]. This means that a realization {Xs}s=0T\{X_{s}\}_{s=0}^{T} is a transition path if X0A,XTBX_{0}\in\partial A,X_{T}\in\partial B, and XsAB,s(0,T)X_{s}\notin A\cup B,\ \forall\ s\in(0,T). The distribution of transition paths is the target distribution of the generative procedure in our paper. A similar problem involves trajectories with a fixed time interval and is known as a bridge process. Much of the analysis in this paper can be extended to the fixed time case by removing the conditioning on path time during training and generation.

An important function for the study of transition paths is the committor function, which is defined by the following boundary value problem [EVE10], [LN15]

{LPq(x)=0if xABq(x)=0if xAq(x)=1if xB,\begin{cases}L^{P}q(x)=0&\text{if }x\notin A\cup B\\ q(x)=0&\text{if }x\in A\\ q(x)=1&\text{if }x\in B\end{cases}, (12)

where q(x)q(x) is the committor and LPL^{P} is the generator for (10), defined as LPf=B1ΔfVfL^{P}f=B^{-1}\Delta f-\nabla V\nabla f.

3.2 Distribution of Transition Paths

We will now examine the distribution of transition paths as in [LN15]. Consider stopping times of the process XtX_{t} with respect to t\mathcal{F}_{t}:

τA(X)=inf{st:sA},\displaystyle\tau^{(X)}_{A}=\inf\{s\geq t:s\in A\}, (13)
τB(X)=inf{st:sB}.\displaystyle\tau^{(X)}_{B}=\inf\{s\geq t:s\in B\}. (14)

Let EE be the event that τB(X)<τA(X)\tau^{(X)}_{B}<\tau^{(X)}_{A}. We are only looking at choices of V(x)V(x) such that P(τA(X)<)=1P(\tau^{(X)}_{A}<\infty)=1 and P(τB(X)<)=1P(\tau^{(X)}_{B}<\infty)=1. It follows from Itô’s formula that q(x)=P(EX0=x)q(x)=P(E\mid X_{0}=x) is a solution to (12). Thus, the committor gives the probability that a path starting from a particular point reaches region BB before region AA.

Consider the process Zt=XtτAτBZ_{t}=X_{t\wedge\tau_{A}\wedge\tau_{B}}, the corresponding measure x\mathbb{P}_{x}, and the stopping times τA(Z),τB(Z)\tau_{A}^{(Z)},\tau_{B}^{(Z)}. Since the paths we are generating terminate after reaching BB, we will work with the ZZ process, though it is possible to use the original XX process as well. We define EE^{*} as the event that τB(Z)<\tau_{B}^{(Z)}<\infty. The function defined as q(x)=P(EZ0=x)q(x)=P(E^{*}\mid Z_{0}=x) is equivalent to the committor for the XX process. We are interested in paths drawn from the measure x=x(E)\mathbb{Q}_{x}=\mathbb{P}_{x}(\cdot\mid E^{*}) on transition paths. The Radon-Nikodym derivative is given by

dxdx=𝕀Eq(x).\frac{d\mathbb{Q}_{x}}{d\mathbb{P}_{x}}=\frac{\mathbb{I}_{E}^{*}}{q(x)}. (15)

Suppose that we have a path {Zs}s=0\{Z_{s}\}_{s=0}^{\infty} starting at xx and ending in BB. Eq. (15) states that the relative likelihood of {Zs}\{Z_{s}\} under x\mathbb{Q}_{x} compared to x\mathbb{P}_{x} increases as q(x)q(x) decreases. This follows our intuition, since as q(x)q(x) decreases, a higher proportion of paths starting from xx from the original measure will end in AA rather than BB.

Using Doob’s h-Transform [Day92], we have that

P(Zt+s=yZt=x,E)=P(Zt+s=yZt=x)q(y)q(x)\displaystyle P(Z_{t+s}=y\mid Z_{t}=x,E^{*})=P(Z_{t+s}=y\mid Z_{t}=x)\frac{q(y)}{q(x)}
\displaystyle\implies LQf=1qLP(qf)\displaystyle L^{Q}f=\frac{1}{q}L^{P}(qf)
\displaystyle\implies LQf=LPf+2B1qqf,\displaystyle L^{Q}f=L^{P}f+\frac{2B^{-1}\nabla q}{q}\cdot\nabla f,

and the law of YtY_{t} given by the SDE

dYt=(V(Yt)+2B1q(Yt)q(Yt))dt+2B1dwtdY_{t}=(-\nabla V(Y_{t})+2B^{-1}\frac{\nabla q(Y_{t})}{q(Y_{t})})dt+\sqrt{2B^{-1}}dw_{t} (16)

is equivalent to the law of transition paths. Thus, conditioning on transition paths is equivalent to adding a drift term to (10).

4 Generating Transition Paths with Diffusion Models

We can generate transition paths by taking sections of an Euler-Maruyama simulation such that the first point is in basin A, the last point is in basin B, and all other points are in neither basin. This discretizes the definition of transition paths discussed earlier. We will represent transition paths as x={xi}i=1m,xidx=\{x_{i}\}_{i=1}^{m},x_{i}\in\mathbb{R}^{d}, where mm is the number of points in the path. We will denote the duration of a particular path as TT^{*}. This should not be confused with TT, which represents the duration of the diffusion process for sample generation. It is convenient to standardize the paths so that they contain the same number of points, and there is an equal time between all subsequent points in a single path.

4.1 Chain Reverse SDE Denoising

In this section, we introduce chain reverse SDE denoising, which learns the gradient for each point in the path separately by conditioning on the previous point. Specifically, we will use xnlogpt(x(t))\frac{\partial}{\partial x_{n}}\log p_{t}(x(t)) to represent the component of score function which corresponds to xnx_{n} and s(x(t),t)(n)s^{*}(x(t),t)^{(n)} as the corresponding neural network approximation. Unlike in the next section, here we are approximating the joint probability of the entire path (that is, sθ(x,t)pt(x(t))s^{*}_{\theta}(x,t)\approx p_{t}(x(t))) and split it into a product of conditional distributions involving neighboring points as described in Figure 2. This is similar to the approach used in [HDBDT21], but without fixing time. Our new loss function significantly reduces the dimension of the neural network optimization problem:

Refer to caption
Figure 2: Chain procedure for generating paths. Starting with a path that is pure noise (colored yellow), path points are updated together in each step and move towards a higher probability path (colored red). The movement of each point in a single step depends only on the position of its neighbors.
L(θ)=1Ni=1Nhti𝔼x(0)𝔼x(ti)x(0)j=1mxjlogpti(x(ti)x(0),T)sθ(x(ti),ti,T)(j)22,\begin{split}L(\theta)=&\frac{1}{N}\sum_{i=1}^{N}h_{t_{i}}\mathbb{E}_{x(0)}\mathbb{E}_{x(t_{i})\mid x(0)}\Bigl{\lVert}\sum_{j=1}^{m}\frac{\partial}{\partial x_{j}}\log p_{t_{i}}(x(t_{i})\mid x(0),T^{*})-s^{*}_{\theta}(x(t_{i}),t_{i},T^{*})^{(j)}\Bigr{\rVert}_{2}^{2},\end{split} (17)

where xjx_{j} now denotes the jjth point in the path. This is a slight change in notation from the previous section, where the subscript was the time in the generation process. The time flow of generation is now represented by x(t)x(t). We can take advantage of the Markov property of transition paths to get a simplified expression for the sub-score for interior points,

p(x)=p(x1)ip(xixi1)logp(x)=logp(xi)+ilogp(xixi1)xnp(x)=xnlogp(xnxn1)+xnlogp(xn+1xn).\begin{split}&p(x)=p(x_{1})\prod_{i}p(x_{i}\mid x_{i-1})\\ \implies&\log p(x)=\log p(x_{i})+\sum_{i}\log p(x_{i}\mid x_{i-1})\\ \implies&\frac{\partial}{\partial x_{n}}p(x)=\frac{\partial}{\partial x_{n}}\log p(x_{n}\mid x_{n-1})+\frac{\partial}{\partial x_{n}}\log p(x_{n+1}\mid x_{n}).\end{split} (18)

It follows that we need two networks, sθ1(xn(t),xn1(t),n,t,T)s_{\theta_{1}}(x_{n}(t),x_{n-1}(t),n,t,T^{*}) and sθ2(xn(t),xn+1(t),n,t,T)s_{\theta_{2}}(x_{n}(t),x_{n+1}(t),n,t,T^{*}). The first to approximate xnlogpt(xnxn1,T)\frac{\partial}{\partial x_{n}}\log p_{t}(x_{n}\mid x_{n-1},T^{*}) and the second for xnlogpt(xn+1xn,T)\frac{\partial}{\partial x_{n}}\log p_{t}(x_{n+1}\mid x_{n},T^{*}). The first and last points of the path require slightly different treatment. From the same decomposition of the joint distribution, we have that

x1p(x)\displaystyle\frac{\partial}{\partial x_{1}}p(x) =logp(x1)+x1logp(x2x1),\displaystyle=\nabla\log p(x_{1})+\frac{\partial}{\partial x_{1}}\log p(x_{2}\mid x_{1}), (19)
xmp(x)\displaystyle\frac{\partial}{\partial x_{m}}p(x) =xmlogp(xmxm1).\displaystyle=\frac{\partial}{\partial x_{m}}\log p(x_{m}\mid x_{m-1}). (20)

Then, the entire description of the sub-score functions is given by

sθ(x,t)(n)={sθ3(x1,t,T)+sθ2(x1,x2,1,t,T)if n=1sθ1(xm,xm1,m,t,T)if n=msθ1(xn,xn1,n,t,T)+sθ2(xn,xn+1,n,t,T)otherwise,s_{\theta}^{*}(x,t)^{(n)}=\begin{cases}s_{\theta_{3}}(x_{1},t,T^{*})+s_{\theta_{2}}(x_{1},x_{2},1,t,T^{*})&\text{if }n=1\\ s_{\theta_{1}}(x_{m},x_{m-1},m,t,T^{*})&\text{if }n=m\\ s_{\theta_{1}}(x_{n},x_{n-1},n,t,T^{*})+s_{\theta_{2}}(x_{n},x_{n+1},n,t,T^{*})&\text{otherwise}\end{cases}, (21)

where sθ3(x1,t,T)s_{\theta_{3}}(x_{1},t,T^{*}) is a third network to learn the distribution of initial points. As outlined in Section 2.2, we use a discretized version of the reverse SDE to generate transition paths.

4.2 Midpoint Reverse SDE Denoising

While the previous approach can generate paths well, there are drawbacks. It requires learning O(2m)O(2m) score functions since the score for a single point is affected by its index along the path. There is a large degree of correlation for parts of the path that are close together, but it is still not an easy computational task. Additionally, all points are updated simultaneously during generation, so the samples will start in a lower data density region. This motivates using a midpoint approach, which is outlined in Figure 3.

Refer to caption
Figure 3: Midpoint procedure for generating paths. The two endpoints are generated independently at the start of the procedure. Then, the path’s interior is constructed incrementally via splitting. In each iteration, a new point is generated between every adjacent pair of existing points, making the path more refined.

We train a score model that depends on both endpoints to learn the distribution of the point equidistant in time to each. This corresponds to the following decomposition of the joint density. Let us say that there are 2k+12^{k}+1 points in each path. We start with the case in which the path is represented by a discrete Markov process x={xj}j=12k+1x=\{x_{j}\}_{j=1}^{2^{k}+1}.

Lemma 4.1.

Define f(i)=max{k:(i1)mod2k=0}f(i)=\max\{k\in\mathbb{N}:(i-1)\mod 2^{k}=0\}, then

p(x)=p(x1)p(x2k+1x1)i=22kp(xixif(i),xi+f(i)).p(x)=p(x_{1})p(x_{2^{k}+1}\mid x_{1})\prod_{i=2}^{2^{k}}p(x_{i}\mid x_{i-f(i)},x_{i+f(i)}).
Proof.

First, we want to show that p(x2,,x2kx1,x2k+1)=i=22kp(xixif(i),xi+f(i))p(x_{2},...,x_{2^{k}}\mid x_{1},x_{2^{k}+1})=\prod_{i=2}^{2^{k}}p(x_{i}\mid x_{i-f(i)},x_{i+f(i)}). We proceed by induction. The inductive hypothesis is that p(x2,,x2k1x1,x2k1+1)=i=22k1p(xixif(i),xi+f(i))p(x_{2},...,x_{2^{k-1}}\mid x_{1},x_{2^{k-1}+1})=\prod_{i=2}^{2^{k-1}}p(x_{i}\mid x_{i-f(i)},x_{i+f(i)}).

p(x2,,x2kx1,x2k+1)=p(x2,,x2k1+1x1,x2k+1)p(x2k1+1,,x2kx1,x2,,x2k1+1,x2k+1)=p(x2,,x2k1+1x1,x2k+1)p(x2k1+1,,x2kx2k1+1,x2k+1)=i=22kp(xixif(i),xi+f(i)).\begin{split}p(x_{2},...,x_{2^{k}}\mid x_{1},x_{2^{k}+1})&=p(x_{2},...,x_{2^{k-1}+1}\mid x_{1},x_{2^{k}+1})p(x_{2^{k-1}+1},...,x_{2^{k}}\mid x_{1},x_{2},...,x_{2^{k-1}+1},x_{2^{k}+1})\\ &=p(x_{2},...,x_{2^{k-1}+1}\mid x_{1},x_{2^{k}+1})p(x_{2^{k-1}+1},...,x_{2^{k}}\mid x_{2^{k-1}+1},x_{2^{k}+1})\\ &=\prod_{i=2}^{2^{k}}p(x_{i}\mid x_{i-f(i)},x_{i+f(i)}).\qed\end{split}

Based on this, we can parameterize the score function by

sθ(xn,xn1,xn2,ndmT,t)logpt(xnxn1,xn2),s_{\theta}(x_{n^{*}},x_{n_{1}},x_{n_{2}},\frac{n_{d}}{m}T^{*},t)\approx\nabla\log p_{t}(x_{n^{*}}\mid x_{n_{1}},x_{n_{2}}), (22)

where ns=n1+n22,nd=n2n1n_{s}=\frac{n_{1}+n_{2}}{2},n_{d}=n_{2}-n_{1}. It is worth mentioning that without further knowledge of the transition path process, we would need to include n1n_{1} as a parameter for sθs_{\theta}. However, we can see from (16) that transition paths are a time-homogeneous process, which allows this simplification. Then,

p(xnsxn1,xn2)\displaystyle p(x_{n_{s}}\mid x_{n_{1}},x_{n_{2}}) =p(xnsxn1)p(xn2xns)p(xn2xn1)\displaystyle=\frac{p^{*}(x_{n_{s}}\mid x_{n_{1}})p^{*}(x_{n_{2}}\mid x_{n_{s}})}{p^{*}(x_{n_{2}}\mid x_{n_{1}})} (23)
=f(xns,xn1,nd2)f(xns,xn1,nd2)f(xn2,xn1,nd)\displaystyle=\frac{f(x_{n_{s}},x_{n_{1}},\frac{n_{d}}{2})f(x_{n_{s}},x_{n_{1}},\frac{n_{d}}{2})}{f(x_{n_{2}},x_{n_{1}},n_{d})} (24)
=g(xn1,xns,xn2,nd),\displaystyle=g(x_{n_{1}},x_{n_{s}},x_{n_{2}},n_{d}), (25)

where pp^{*} is the transition kernel for transition paths and f,gf,g are general functions designed to show the parameter dependence of p(xnxn1,xn2)p(x_{n^{*}}\mid x_{n_{1}},x_{n_{2}}). The training stage is similar to that described in the previous section. For an interior point xix_{i}, the corresponding score matching term is sθ(xi,xif(i),xi+f(i),2idmT,t)s_{\theta}(x_{i},x_{i-f(i)},x_{i+f(i)},\frac{2i_{d}}{m}T^{*},t). For the endpoints, we have sθ2(x0T)s_{\theta_{2}}(x_{0}\mid T^{*}) and sθ3(x2kx0,T)s_{\theta_{3}}(x_{2_{k}}\mid x_{0},T^{*}). It is algorithmically easier to use 2k+12^{k}+1 discretization points for the paths, but we can generalize to kk points. Starting with the same approach as before, we will eventually end up with a term that can not be split by a midpoint if we want the correct number of points. Specifically, this occurs when we want to generate nn interior points for some even nn. In this case, we can use

p(xm1+1,xm1+2,xm21xm1,xm2)=p(xm)p(xm1+1,,xm1xm1,xm)p(xm+1,,xm21xm,xm2),p(x_{m_{1}+1},x_{m_{1}+2},...x_{m_{2}-1}\mid x_{m_{1}},x_{m_{2}})\\ =p(x_{m^{*}})p(x_{m_{1}+1},...,x_{m^{*}-1}\mid x_{m_{1}},x_{m^{*}})p(x_{m^{*}+1},...,x_{m_{2}-1}\mid x_{m^{*}},x_{m_{2}}), (26)

where m=m1+m2+12m^{*}=\frac{m_{1}+m_{2}+1}{2} or m1+m212\frac{m_{1}+m_{2}-1}{2}. The score function will require an additional time parameter now that the midpoint is not exactly in between the endpoints. A natural choice is to use

sθ(xi,xii,xi+i,2imT,tshift,t),s_{\theta}(x_{i},x_{i-i^{*}},x_{i+i^{*}},\frac{2i^{*}}{m}T^{*},t_{\text{shift}},t),

where

tshift={0if midpoint is centeredT2motherwise.t_{\text{shift}}=\begin{cases}0&\text{if midpoint is centered}\\ \frac{T^{*}}{2m}&\text{otherwise}\end{cases}. (27)

A similar adaptation can be made when the points along the path are not evenly spaced. It remains to learn P(T)P(T^{*}), which is a simple, one-dimensional problem. We can then use times drawn from the learned distribution as the seed for generating paths.

We can extend the result from Lemma (4.1) to the continuous case. Consider a stochastic process x(t)x(t) as in (1) and arbitrary times t1,,tnt_{1},...,t_{n}.

Corollary 4.2.

Define f(i)=max{k:(i1)mod2k=0}f(i)=\max\{k\in\mathbb{N}:(i-1)\mod 2^{k}=0\}

p(x(t1),,x(tn))=p(x(t1))p(x(tn)x(t1))i=22kp(x(ti)x(ti+f(i)),x(ti+f(i))).p(x(t_{1}),...,x(t_{n}))=p(x(t_{1}))p(x(t_{n})\mid x(t_{1}))\prod_{i=2}^{2^{k}}p(x(t_{i})\mid x(t_{i+f(i)}),x(t_{i+f(i)})).

The proof follows by the same induction as in the previous case, since the Markov property still holds. This technique can be extended to problems similar to transition path generation, such as other types of conditional trajectories. It is also possible to apply the approach of non-simultaneous generation to the chain method that we described in the previous section.

4.3 Convergence Guarantees for Score-Matching

We seek to obtain a convergence guarantee for generating paths from the reverse SDE using the approach outlined in Section 2.2. There has been previous work on convergence guarantees for general data distributions by [LLT23], [CLL23], [DB22], [CCL+22]. In particular, we can make guarantees for the KL divergence or TV distance between pdatap_{data} and the distribution of generated samples given an L2L_{2} error bound on the score estimation. Recent results [BDBDD23] show that a bound with linear dd-dependence is possible for the number of discretization steps required to achieve KL(pdata||qtN)=O~(ϵ02)\mathrm{KL}(p_{data}||q_{t_{N}})=\tilde{O}(\epsilon_{0}^{2}).

Assumption 1.

The error in the score estimate at the selected discretization points t1,t2,,tkt_{1},t_{2},...,t_{k} is bounded:

1Tk=1Nhk𝔼ptksθ(x,tk)logptk2ϵ02.\frac{1}{T}\sum_{k=1}^{N}h_{k}\mathbb{E}_{p_{t_{k}}}||s_{\theta}(x,t_{k})-\nabla\log p_{t_{k}}||^{2}\leq\epsilon_{0}^{2}. (28)
Assumption 2.

The data distribution has a bounded second moment

𝔼pdatax2<.\mathbb{E}_{p_{data}}||x||^{2}<\infty. (29)

With these assumptions on the accuracy of the score network and the second moment of pdatap_{data}, we can establish bounds for the KL-divergence between pdatap_{data} and qtNq_{t_{N}}. Using the exponential integrator scheme from Section 2.2, the following theorem from [BDBDD23] holds:

Theorem 4.3.

Suppose that Assumptions 1 and 2 hold. If we choose T=12logdϵ02T=\frac{1}{2}\log\frac{d}{\epsilon_{0}^{2}} and N=Θ(d(T+log(1δ))2ϵ02)N=\Theta(\frac{d(T+\log(\frac{1}{\delta}))^{2}}{\epsilon_{0}^{2}}), then there exists a choice of MM from 2.2 such that KL(pdata||qtN)=O~(ϵ02)\mathrm{KL}(p_{data}||q_{t_{N}})=\tilde{O}(\epsilon_{0}^{2}).

It is important to keep in mind that the KL-divergence between pdatap_{data} and qtNq_{t_{N}} does not entirely reflect the quality of the generated samples. In a practical setting, there also must be sufficient differences between the initial samples and the output. Otherwise, the model is performing the trivial task of generating data that is the same or very similar to the input data.

5 Results

5.1 Müller Potential

To exhibit the effectiveness of our algorithms, we look at the overdamped Langevin equation from (10) and choose VV as the two-welled Müller potential in 2\mathbb{R}^{2} defined by

V(x)=i=14Di exp(ai(Xix1)2+bi(Xix1)(Yix2)+ci(Yix2)2).V(x)=\sum_{i=1}^{4}D_{i}\text{ exp}(a_{i}(X_{i}-x_{1})^{2}+b_{i}(X_{i}-x_{1})(Y_{i}-x_{2})+c_{i}(Y_{i}-x_{2})^{2}). (30)

We used the following parameters, as used in [KLY19] [LLR19]

a=[1,1,6.5,0.7],\displaystyle a=[-1,-1,-6.5,0.7], (5.2)
b=[0,0,11,0.6],\displaystyle b=[0,0,11,0.6],
c=[10,10,6.5,0.7],\displaystyle c=[-10,-10,-6.5,0.7],
d=[200,100,170,15].\displaystyle d=[-200,-100,-170,15].

We define regions A and B as circles with a radius of 0.1 centered around the minima at (0.62, 0.03) and (-0.56, 1.44) respectively. This creates a landscape with two major wells at A and B, a smaller minimum between them, and a potential that quickly goes to infinity outside the region Ω:=[1.5,1.5]×[0.5,2]\Omega:=[-1.5,1.5]\times[-0.5,2]. We used B1=102B^{-1}=10\sqrt{2} for this experiment, which is considered a moderate temperature.

For each of the score functions learned, we used a network architecture consisting of 6 fully connected layers with 20, 400, 400, 200, 20, and 2 neurons, excluding the input layer. The input size of the first layer was 7 (2 data points, n,Tn,T^{*} and tt) for the chain method and 8 for the midpoint method (3 data points, TT^{*} and tt). We used a hyperbolic tangent (tanh) for the first layer and leaky ReLU for the rest of the layers. The weights were initialized using the Xavier uniform initialization method. All models were trained for 200 epochs with an initial learning rate of 0.015 and a batch size of 600. For longer paths, we used a smaller batch size and learning rate because of memory limitations. The training set was about 20,000 paths, so this example is in the data-rich regime. All noise levels were trained for every batch. We used the (9) to discretize the reverse process. We found that tmin=0.005t_{min}=0.005 and T=7T=7 with 100 discretization points gave strong empirical results. For numerical reasons, we used a maximum step size of 1. In particular, this restriction prevents the initial step from being disproportionately large, which can lead to overshooting.

Refer to caption
Figure 4: Sample paths generated using midpoint method. The paths display consistent qualitative behavior characterized by random fluctuations that remain within low energy levels. The region for minimum A is shown with a red circle and the region for minimum B is shown with a blue circle.

The paths generated using either method (sample shown in Figure 4) closely resemble the training paths. To evaluate the generated samples numerically, we can calculate the relative entropy of the generated paths compared to the training paths or the out-of-sample paths (Tables 5a, 5b). We converted collections of paths into distributions using the scipy gaussian_kde function.

# Discretization Points Output and Training Output and OOS Training and OOS
9 0.018 0.068 0.031
17 .020 0.078 0.029
33 0.032 0.109 0.035
(a) Relative entropy comparison of NN generated transition paths with training and out of sample paths using midpoint method
# Discretization Points Output and Training Output and OOS Training and OOS
9 0.012 0.056 0.031
17 0.016 0.065 0.029
33 0.025 0.035 0.081
(b) Relative entropy comparison of NN generated transition paths with training and out of sample paths using chain method

We can also test the quality of our samples by looking at statistics of the generated paths. For example, we can look at the distribution of points on the path. From transition path theory [LN15], we know that p(x)p(x)q(x)(1q(x))p^{*}(x)\propto p(x)q(x)(1-q(x)), where pp is the original probability density of the system, pp^{*} is the density of points along transition paths, and qq is the committor. This gives a 2-dimensional distribution as opposed to a 2m2m dimensional one since each of the path points is treated independently. Here, the three relevant distributions are the training, output, and ground truth distributions.

Refer to caption
(c) Difference Plot
Refer to caption
(d) Training Density
Figure 5: Comparison of the density path points between training and output samples. The density of path points for the training sample is shown on the right, while the difference between the densities is shown on the left. The maximum relative difference is around 20%20\%.

We include the average absolute difference between the three relevant distributions at 2500 grid points in Tables 6a, 6b.

# Discretization Points Output and Training Output and True Training and True
9 0.027 0.055 0.034
17 0.031 0.060 0.034
33 0.039 0.072 0.037
(a) Average absolute difference comparison of the distribution of points on transition paths using midpoint method
# Discretization Points Output and Training True and Training Output and True
9 0.022 0.051 0.034
17 0.027 0.056 0.034
33 0.034 0.062 0.037
(b) Average absolute difference comparison of the distribution of points on transition paths using chain method

The chain method slightly outperforms the midpoint method according to this metric. Qualitatively, the paths generated using the two methods look similar. There is a strong correlation between the error and the training time, but a limited correlation with the size of the training dataset after reaching a certain size. This is promising for potential applications, as the data requirements are not immediately prohibitive.

Refer to caption
Figure 6: Comparison of average probability density difference between the 20,000 path dataset and the generated paths for different sizes of the training subset. When the training scheme is kept constant, but the size of the training set is smaller, the model’s accuracy drops drastically. On the other hand, when the number of epochs was adjusted accordingly, the accuracy did not vary significantly until a sample size of around 500.

The strong dependence on training time, which is shown in Figure 6, supports the notion that the numerical results of this paper could be improved with longer training or a larger model. We further explore performance in the data-scarce case in the following section.

5.2 Alanine Dipeptide

For our second numerical experiment, we generate transitions between stable conformers of Alanine dipeptide. The transition pathways [JW06] and free-energy profile [VV10] of Alanine dipeptide have been studied in the molecular dynamics literature and are shown in Figure 7. This system is commonly used to model dihedral angles in proteins. We use the dihedral angles, ϕ\phi and ψ\psi (shifted for visualization purposes), as collective variables.

The two stable states that we consider are the lower density state around ψ=0.6 rad,ϕ=0.9 rad\psi=0.6\text{ rad},\phi=0.9\text{ rad} (minimum A), and the higher density state around ψ=2.7 rad,ϕ=1.4 rad\psi=2.7\text{ rad},\phi=-1.4\text{ rad} (minimum B). The paths are less regular than with Müller potential because the reactive paths "jump" from values close to π\pi and π-\pi. We use 22 reactive trajectories as training data for the score network. The data include 21 samples that follow reactive pathway 1 and one sample that follows reactive pathway 2. We use the same network architecture as used for the midpoint approach in the previous section. We train the model for 2000 epochs with an initial learning rate of 0.0015.

Because the data are angles, the problem moves from \mathbb{R} to the surface of a torus. We wrap around points outside of the interval [π,π][-\pi,\pi] during the generation process to adjust for the new topology. This approach was more effective than encoding periodicity into the neural network. For this problem, the drift term for the forward process, 12x-\frac{1}{2}x, is no longer continuous, though this did not cause numerical issues. It may be useful to use the more well-behaved process dΩt=sin(Ωt)dt+2dWtd\Omega_{t}=\sin(-\Omega_{t})dt+\sqrt{2}dW_{t} applied to problems with this geometry.

Refer to caption
Figure 7: Alanine Dipeptide probability density and reactive pathways. The trajectories are highly concentrated around the bottom minimum. Transitions through the first reactive pathway (RP1) occur much more frequently than transitions through the second reactive pathway (RP2).
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c) Training Paths
Refer to caption
(d) Generated Paths
Figure 8: Comparison of original paths with generated paths using dihedral angles as reactive coordinates. Pictured on the top are a few paths from the training set (left) and generated using the midpoint method (right). On the bottom, the densities of points on the paths are shown for both cases.

We can generate paths across both reactive channels. In some trials, paths are generated that follow neither channel, instead going straight from the bottom left basin to minimum A. This occurred with similar frequency to paths across RP2. It is more challenging to determine the quality of the generated paths when there is minimal initial data, but it is encouraging that the qualitative shape of the paths is similar (Figure 8). Specifically, the trajectories tend to stay around the minima for a long time, and the transition time between them is short. For this experiment, the paths generated using the midpoint method were qualitatively closer to the original distribution. In particular, disproportionately large jumps between adjacent points occurred more frequently using the chain method, though both methods were able to generate representative paths.

References

  • [BCDG02] Peter G Bolhuis, David Chandler, Christoph Dellago, and Phillip L Geissler. Transition path sampling: Throwing ropes over rough mountain passes, in the dark. Annual review of physical chemistry, 53(1):291–318, 2002.
  • [BDBDD23] Joe Benton, Valentin De Bortoli, Arnaud Doucet, and George Deligiannidis. Linear Convergence Bounds for Diffusion Models via Stochastic Localization. arXiv preprint arXiv:2308.03686, 2023.
  • [CCL+22] Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. arXiv preprint arXiv:2209.11215, 2022.
  • [CGR19] Frédéric Cérou, Arnaud Guyader, and Mathias Rousset. Adaptive multilevel splitting: Historical perspective and recent results. Chaos: An Interdisciplinary Journal of Nonlinear Science, 29(4), 2019.
  • [CHIS23] Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [CLL23] Hongrui Chen, Holden Lee, and Jianfeng Lu. Improved analysis of score-based generative modeling: User-friendly bounds under minimal smoothness assumptions. In International Conference on Machine Learning, pages 4735–4763. PMLR, 2023.
  • [Day92] Martin V Day. Conditional exits for small noise diffusions with characteristic boundary. The Annals of Probability, pages 1385–1419, 1992.
  • [DB22] Valentin De Bortoli. Convergence of denoising diffusion models under the manifold hypothesis. arXiv preprint arXiv:2208.05314, 2022.
  • [EVE10] Weinan E and Eric Vanden-Eijnden. Transition-path theory and path-finding algorithms for the study of rare events. Annual review of physical chemistry, 61:391–420, 2010.
  • [FE04] Anton K Faradjian and Ron Elber. Computing time scales from reaction coordinates by milestoning. The Journal of chemical physics, 120(23):10880–10889, 2004.
  • [Fle77] Wendell H Fleming. Exit probabilities and optimal stochastic control. Applied Mathematics and Optimization, 4:329–346, 1977.
  • [HD05] Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
  • [HDBDT21] Jeremy Heng, Valentin De Bortoli, Arnaud Doucet, and James Thornton. Simulating diffusion bridges with score matching. arXiv preprint arXiv:2111.07243, 2021.
  • [HJA20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • [JW06] Hyunbum Jang and Thomas B Woolf. Multiple pathways in conformational transitions of the alanine dipeptide: an application of dynamic importance sampling. Journal of computational chemistry, 27(11):1136–1141, 2006.
  • [KLY19] Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving for high-dimensional committor functions using artificial neural networks. Research in the Mathematical Sciences, 6:1–13, 2019.
  • [LLR19] Qianxiao Li, Bo Lin, and Weiqing Ren. Computing committor functions for the study of rare events using deep learning. The Journal of Chemical Physics, 151(5), 2019.
  • [LLT23] Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence of score-based generative modeling for general data distributions. In International Conference on Algorithmic Learning Theory, pages 946–985. PMLR, 2023.
  • [LN15] Jianfeng Lu and James Nolen. Reactive trajectories and the transition path process. Probability Theory and Related Fields, 161(1-2):195–244, 2015.
  • [LRS+23] Tony Lelièvre, Geneviève Robin, Innas Sekkat, Gabriel Stoltz, and Gabriel Victorino Cardoso. Generative methods for sampling transition paths in molecular dynamics. ESAIM: Proceedings and Surveys, 73:238–256, 2023.
  • [PS10] F Pinski and A Stuart. Transition paths in molecules: gradient descent in pathspace. J. Chem. Phys, 132:184104, 2010.
  • [SE19] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  • [SSDK+20] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  • [Vin11] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
  • [VV10] Jiri Vymetal and Jiri Vondrasek. Metadynamics As a Tool for Mapping the Conformational and Free-Energy Space of Peptides: The Alanine Dipeptide Case Study. The Journal of Physical Chemistry B, 114(16):5632–5642, 2010.
  • [ZC22] Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902, 2022.