This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Energetic Variational Neural Network Discretizations of Gradient Flows

Ziqing Hu Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, 102G Crowley Hall, Notre Dame, IN 46556 USA (). [email protected]    Chun Liu Department of Applied Mathematics, Illinois Institute of Technology, Chicago, IL 60616, USA (). [email protected]    Yiwei Wang Corresponding author. Department of Mathematics, University of California, Riverside, Riverside, CA 92521, USA (). [email protected]    Zhiliang Xu Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, 102G Crowley Hall, Notre Dame, IN 46556 USA (). [email protected]
Abstract

We present a structure-preserving Eulerian algorithm for solving L2L^{2}-gradient flows and a structure-preserving Lagrangian algorithm for solving generalized diffusions. Both algorithms employ neural networks as tools for spatial discretization. Unlike most existing methods that construct numerical discretizations based on the strong or weak form of the underlying PDE, the proposed schemes are constructed based on the energy-dissipation law directly. This guarantees the monotonic decay of the system’s free energy, which avoids unphysical states of solutions and is crucial for the long-term stability of numerical computations. To address challenges arising from nonlinear neural network discretization, we perform temporal discretizations on these variational systems before spatial discretizations. This approach is computationally memory-efficient when implementing neural network-based algorithms. The proposed neural network-based schemes are mesh-free, allowing us to solve gradient flows in high dimensions. Various numerical experiments are presented to demonstrate the accuracy and energy stability of the proposed numerical schemes.

1 Introduction

Evolution equations with variational structures, often termed as gradient flows, have a wide range of applications in physics, material science, biology, and machine learning [14, 60, 78]. These systems not only possess but also are determined by an energy-dissipation law, which consists of an energy of state variables that describes the energetic coupling and competition, and a dissipative mechanism that drives the system to an equilibrium.

More precisely, beginning with a prescribed energy-dissipation law

(1) ddtEtotal=0,\frac{\mathrm{d}}{\mathrm{d}t}E^{\rm total}=-\triangle\leq 0~{},

where EtotalE^{\text{total}} is the sum of the Helmholtz free energy \mathcal{F} and the kinetic energy 𝒦\mathcal{K}, and \triangle is the rate of energy dissipation, one could derive the dynamics, the corresponding partial differential equation (PDE), of the system by combining the Least Action Principle (LAP) and the Maximum Dissipation Principle (MDP). The LAP states that the equation of motion of a Hamiltonian system can be derived from the variation of the action functional 𝒜=0T(𝒦)dt\mathcal{A}=\int_{0}^{T}(\mathcal{K}-\mathcal{F})\mathrm{d}t with respect to state variable 𝒙\bm{x}. This gives rise to a unique procedure to derive the “conservative force” in the system. The MDP, on the other hand, derives the “dissipative force” by taking the variation of the dissipation potential 𝒟\mathcal{D}, which equals to 12\frac{1}{2}\triangle in the linear response regime (near equilibrium), with respect to 𝒙t\bm{x}_{t}. In turn, the force balance condition leads to the PDE of the system:

(2) δ𝒟δ𝒙t=δ𝒜δ𝒙.\frac{\delta\mathcal{D}}{\delta\bm{x}_{t}}=\frac{\delta\mathcal{A}}{\delta\bm{x}}~{}.

This procedure is known as the energetic variational approach (EnVarA) [24, 48, 74], developed based on the celebrated work of Onsager [57, 58] and Rayleigh [63] in non-equilibrium thermodynamics. During the past decades, the framework of EnVarA has shown to be a powerful tool for developing thermodynamically consistent models for various complex fluid systems, including two-phase flows [80, 81, 83], liquid crystals [48], ionic solutions [22], and reactive fluids [75, 76]. In a certain sense, the energy-dissipation law (1), which comes from the first and second law of thermodynamics [23], provides a more intrinsic description of the underlying physics than the derived PDE (2), in particularly for systems that possess multiple structures and involve multiscale and multiphysics coupling.

From a numerical standpoint, the energy-dissipation law (1) also serves as a valuable guideline for developing structure-preserving numerical schemes for these variational systems, as many straightforward PDE-based discretizations may fail to preserve the continuous variational structures, as well as the physical constraints, such as the conservation of mass or momentum, the positivity of mass density, and the dissipation of the energy. As a recent development in this field [50], the authors proposed a numerical framework, called a discrete energetic variational approach, to address these challenges. Similar methods are also used in [72, 82]. The key idea of this approach is to construct numerical discretizations directly based on the continuous energy-dissipation laws without using the underlying PDE (see Section 3.1 for details). The approach has advantages in preserving a discrete counterpart of the continuous energy dissipation law, which is crucial for avoiding unphysical solutions and the long-term stability of numerical computations. Within the framework of the discrete energetic variational approach, Eulerian [55], Lagrangian [50, 51], and particle methods [73] have been developed for various problems.

However, tackling high-dimensional variational problems remains a challenge. Traditional numerical discretization, such as finite difference, finite element, and finite volume methods, suffer from the well-known curse of dimensionality (CoD) [27]. Namely, the computational complexity of the numerical algorithm increases exponentially as a function of the dimensionality of the problem [4, 27]. Although particle methods hold promise for addressing high-dimensional problems, standard particle methods often lack accuracy and are not suitable for problems involving derivatives.

During the recent years, neural networks (NNs) have demonstrated remarkable success across a wide spectrum of scientific disciplines [12, 31, 41, 43, 47]. Leveraging the potent expressive power of neural network [3, 10, 20], particularly deep neural network (DNN) architectures, there exists a growing interest in developing neural network-based algorithms for PDEs, especially for those in high dimensions [13, 21, 37, 38, 42, 62, 68, 70, 77]. Examples include physics-informed neural network (PINN) [62], deep Ritz method (DRM) [21], deep Galerkin method (DGM) [68], variational PINN [37], and weak adversarial network (WAN) [84] to name a few. A key component of these aforementioned approaches is to represent solutions of PDEs via NNs. All of these approaches determine the optimal parameters of the NN by minimizing a loss function, which is often derived from either the strong or weak form of the PDE (see section 2.1 for more details). By employing NN approximations, the approximate solution belongs to a space of nonlinear functions, which may lead to a robust estimation by sparser representation and cheaper computation [37].

The goal of this paper is to combine neural network-based spatial discretization with the framework of the discrete energetic variational approach [50], to develop efficient and robust numerical schemes, termed as energetic variational neural network (EVNN) methods. To clarify the idea, we consider the following two types of gradient flows modeled by EnVarA:

  • L2L^{2}-gradient flow that satisfies an energy-dissipation law

    (3) ddt[φ]=Ωη(φ)|φt|2d𝒙.\frac{\mathrm{d}}{\mathrm{d}t}\mathcal{F}[\varphi]=-\int_{\Omega}\eta(\varphi)|\varphi_{t}|^{2}\mathrm{d}\bm{x}~{}.
  • Generalized diffusion that satisfies an energy-dissipation law

    (4) ddt[ρ]=Ωη(ρ)|𝐮|2d𝒙,\frac{\mathrm{d}}{\mathrm{d}t}\mathcal{F}[\rho]=-\int_{\Omega}\eta(\rho)|\mathbf{u}|^{2}\mathrm{d}\bm{x}~{},

    where ρ\rho satisfies ρt+(ρ𝐮)=0\rho_{t}+\nabla\cdot(\rho\mathbf{u})=0 , known as the mass conservation.

Derivation of the underlying PDEs for these types of systems is described in Sections 2.1.1 and 2.1.2 of the paper.

Our primary aim is to develop structure-preserving Eulerian algorithms to solve L2L^{2}-gradient flows and structure-preserving Lagrangian algorithms to solve generalized diffusions based on their energy-dissipation law by utilizing neural networks as a new tool of spatial discretization. To overcome difficulties arising from neural network discretization, we develop a discretization approach that performs temporal discretizations before spatial discretizations. This approach leads to a computer-memory-efficient implementation of neural network-based algorithms. Since neural networks are advantageous due to their ability to serve as parametric approximations for unknown functions even in high dimensions, the proposed neural-network-based algorithms are capable of solving gradient flows in high-dimension, such as these appear in machine learning [19, 32, 69, 73].

The rest of the paper is organized as follows. Section 2 reviews the EnVarA and some existing neural network-based numerical approaches for solving PDEs. Section 3 of the paper is devoted to the development of the proposed EVNN schemes for L2L^{2}-gradient flows and generalized diffusions. Various numerical experiments are presented in Section 4 to demonstrate the accuracy and energy stability of the proposed numerical methods. Conclusions are drawn in Section 5.

2 Preliminary

2.1 Energetic Variational Approach

In this subsection, we provide a detailed derivation of underlying PDEs for L2L^{2}-gradient flows and generalized diffusions by the general framework of EnVarA. We refer interested readers to [24, 74] for a more comprehensive review of the EnVarA. In both systems, the kinetic energy 𝒦=0\mathcal{K}=0.

2.1.1 L2L^{2}-gradient flow

L2L^{2}-gradient flows are those systems satisfying the energy-dissipation law:

(5) ddt[φ]=Ωη|φt|2d𝒙.\frac{\mathrm{d}}{\mathrm{d}t}\mathcal{F}[{\varphi}]=-\int_{\Omega}\eta|\varphi_{t}|^{2}\mathrm{d}\bm{x}~{}.

where φ{\varphi} is the state variable, [φ]\mathcal{F}[\varphi] is the Helmholtz free energy, and η>0\eta>0 is the dissipation rate. One can also view φ{\varphi} as the generalized coordinates of the system [14]. By treating φ\varphi as 𝒙\bm{x}, the variational procedure (2) leads to the following L2L^{2}-gradient flow equation:

δ(12η|φt|2d𝒙)δφt=δδφηφt=δδφ.\frac{\delta(\frac{1}{2}\int\eta|\varphi_{t}|^{2}\mathrm{d}\bm{x})}{\delta\varphi_{t}}=-\frac{\delta\mathcal{F}}{\delta\varphi}\quad\Rightarrow\quad\eta\varphi_{t}=-\frac{\delta\mathcal{F}}{\delta\varphi}~{}.

Many problems in soft matter physics, material science, and machine learning can be modeled as L2L^{2}-gradient flows. Examples include Allen–Cahn equation [15], Oseen–Frank and Landau–de Gennes models of liquid crystals [11, 45], phase field crystal models of quasicrystal [34, 44], and generalized Ohta–Kawasaki model of diblock and triblock copolymers [56]. For example, by taking

[φ]=Ω12|φ|2+14ϵ2(φ21)2d𝒙,\mathcal{F}[\varphi]={\displaystyle\int_{\Omega}\frac{1}{2}|\nabla\varphi|^{2}+\frac{1}{4\epsilon^{2}}(\varphi^{2}-1)^{2}\mathrm{d}\bm{x}}~{},

which is the classical Ginzburg–Landau free energy, and letting η(φ)=1\eta(\varphi)=1, one gets the Allen–Cahn equation

φt=Δφ1ϵ2(φ21)φ.\varphi_{t}=\Delta\varphi-\frac{1}{\epsilon^{2}}(\varphi^{2}-1)\varphi\ .

2.1.2 Generalized diffusion

Generalized diffusion describes the space-time evolution of a conserved quantity ρ(𝒙,t)\rho(\bm{x},t). Due to the physics law of mass conservation, ρ(𝒙,t)\rho(\bm{x},t) satisfies the kinematics

(6) tρ+(ρ𝐮)=0,\partial_{t}\rho+\nabla\cdot(\rho\mathbf{u})=0\ ,

where 𝐮\mathbf{u} is an averaged velocity in the system.

To derive the generalized diffusion equation by the EnVarA, one should introduce a Lagrangian description of the system. Given a velocity field 𝐮(𝒙,t)\mathbf{u}(\bm{x},t), one can define a flow map 𝒙(𝐗,t)\bm{x}(\mathbf{X},t) through

{ddt𝒙(𝐗,t)=𝐮(𝒙(𝐗,t),t),𝒙(𝐗,0)=𝐗,\begin{cases}&\frac{\mathrm{d}}{\mathrm{d}t}\bm{x}(\mathbf{X},t)=\mathbf{u}(\bm{x}(\mathbf{X},t),t)~{},\\ &\bm{x}(\mathbf{X},0)=\mathbf{X}~{},\\ \end{cases}

where 𝐗Ω0\mathbf{X}\in\Omega_{0} is the Lagrangian coordinates and 𝒙Ωt\bm{x}\in\Omega_{t} is the Eulerian coordinates. Here Ω0\Omega_{0} is the initial configuration of the system, and Ωt\Omega_{t} is the configuration of the system at time tt. For a fixed 𝐗\mathbf{X}, 𝒙(𝐗,t)\bm{x}(\mathbf{X},t) describes the trajectory of a particle (or a material point) labeled by 𝐗\mathbf{X}; while for a fixed tt, 𝒙(𝐗,t)\bm{x}(\mathbf{X},t) is a diffeomorphism from Ω0\Omega_{0} to Ωt\Omega_{t} (See Fig. 1 for the illustration.) The existence of the flow map 𝒙(𝐗,t)\bm{x}(\mathbf{X},t) requires a certain regularity of 𝐮(𝒙,t)\mathbf{u}(\bm{x},t), for instance, being Lipschitz in 𝒙\bm{x}.

Refer to caption
Figure 1: Schematic illustration of a flow map x(𝐗,t)x(\mathbf{X},t).

In a Lagrangian picture, the evolution of the density function ρ(𝒙,t)\rho(\bm{x},t) is determined by the evolution of the 𝒙(𝐗,t)\bm{x}(\mathbf{X},t) through the kinematics relation (6) (written in the Lagrangian frame of reference). More precisely, one can define the deformation tensor associated with the flow map 𝒙(𝐗,t)\bm{x}(\mathbf{X},t) by

(7) 𝖥~(𝒙(𝐗,t),t)=𝖥(𝐗,t)=𝐗𝒙(𝐗,t).{\sf\tilde{F}}(\bm{x}(\mathbf{X},t),t)={\sf F}(\mathbf{X},t)=\nabla_{\mathbf{X}}\bm{x}(\mathbf{X},t)~{}.

Without ambiguity, we do not distinguish 𝖥{\sf F} and 𝖥~{\sf\tilde{F}} in the rest of the paper. Let ρ0(𝐗)\rho_{0}(\mathbf{X}) be the initial density, then the mass conservation indicates that

(8) ρ(𝒙(𝐗,t),t)=ρ0(𝐗)/det𝖥(𝐗,t),𝐗Ω0\rho(\bm{x}(\mathbf{X},t),t)=\rho_{0}(\mathbf{X})/\det{\sf F}(\mathbf{X},t)~{},\quad\forall\mathbf{X}\in\Omega_{0}

in the Lagrangian frame of reference. This is equivalent to ρt+(ρ𝐮)=0\rho_{t}+\nabla\cdot(\rho\mathbf{u})=0 in the Eulerian frame.

Within (8), one can rewrite the energy-dissipation law (4) in terms of the flow map 𝒙(𝐗,t)\bm{x}(\mathbf{X},t) and its velocity 𝒙t(𝐗,t)\bm{x}_{t}(\mathbf{X},t) in the reference domain Ω0\Omega_{0}. A typical form of the free energy [ρ]\mathcal{F}[\rho] is given by

(9) [ρ]=Ωω(ρ)+ρV(𝒙)+12(ΩK(𝒙,𝒚)ρ(𝒙)ρ(𝒚)d𝒚)d𝒙,\mathcal{F}[\rho]=\int_{\Omega}\omega(\rho)+\rho V(\bm{x})+\frac{1}{2}\left(\int_{\Omega}K(\bm{x},\bm{y})\rho(\bm{x})\rho(\bm{y})\mathrm{d}\bm{y}\right)\mathrm{d}\bm{x}~{},

where V(𝒙)V(\bm{x}) is a potential field, and K(𝒙,𝒚)K(\bm{x},\bm{y}) is a symmetric non-local interaction kernel. In Lagrangian coordinates, the free energy (9) and the dissipation in (4) become

[𝒙(𝐗,t)]=\displaystyle\mathcal{F}[\bm{x}(\mathbf{X},t)]= Ω0ω(ρ0det𝖥)det𝖥+V(𝒙)ρ0(𝐗)+ρ0(𝐗)2(Ω0K(𝒙,𝒚)ρ0(𝐗)d𝐗)d𝐗\displaystyle\int_{\Omega_{0}}\omega\left(\tfrac{\rho_{0}}{\det{\sf F}}\right)\det{\sf F}+V(\bm{x})\rho_{0}(\mathbf{X})+\frac{\rho_{0}(\mathbf{X})}{2}\left(\int_{\Omega_{0}}K(\bm{x},\bm{y})\rho_{0}(\mathbf{X})\mathrm{d}\mathbf{X}\right)\mathrm{d}\mathbf{X}

and

𝒟(𝒙,𝒙t)=12Ω0η(ρ0/det𝖥)|𝒙t|2det𝖥d𝐗,\mathcal{D}(\bm{x},\bm{x}_{t})=\displaystyle\frac{1}{2}\int_{\Omega_{0}}\eta(\rho_{0}/\det{\sf F})|\bm{x}_{t}|^{2}\det{\sf F}\mathrm{d}\mathbf{X}\ ,

By a direct computation, the force balance equation (2) can be written as (see [24] for a detailed computation)

(10) η(ρ)𝐮=ρμ,μ=ω(ρ)+V(𝒙)+Kρ.\eta(\rho)\mathbf{u}=-\rho\nabla\mu,\quad\mu=\omega^{\prime}(\rho)+V(\bm{x})+K*\rho.

In Eulerian coordinates, combining (10) with the kinematics equation (6), one obtains a generalized diffusion equation

(11) ρt=(m(ρ)μ),m(ρ)=ρ2/η(ρ),\rho_{t}=\nabla\cdot\left(m(\rho)\nabla\mu\right)~{},\quad m(\rho)=\rho^{2}/\eta(\rho)\ ,

where m(ρ)m(\rho) is known as the mobility in physics. Many PDEs in a wide range of applications, including the porous medium equations (PME) [71], nonlinear Fokker–Planck equations [35], Cahn–Hilliard equations [50], Keller–Segel equations [36], and Poisson–Nernst–Planck (PNP) equations [22], can be obtained by choosing [ρ]\mathcal{F}[\rho] and η(ρ)\eta(\rho) differently.

The velocity equation (10) can be viewed as the equation of the flow map 𝒙(𝐗,t)\bm{x}(\mathbf{X},t) in Lagrangian coordinates, i.e.,

(12) η(ρ0(𝐗)det𝖥(𝐗))ddt𝒙(𝐗,t)=ρ0(𝐗)det𝖥(𝐗)𝖥T𝐗μ(ρ0(𝐗)det𝖥(𝐗)),\eta\left(\frac{\rho_{0}(\mathbf{X})}{\det{\sf F}(\mathbf{X})}\right)\frac{\mathrm{d}}{\mathrm{d}t}\bm{x}(\mathbf{X},t)=-\frac{\rho_{0}(\mathbf{X})}{\det{\sf F}(\mathbf{X})}{\sf F}^{-\rm T}\nabla_{\mathbf{X}}\mu\left(\frac{\rho_{0}(\mathbf{X})}{\det{\sf F}(\mathbf{X})}\right)~{},

where 𝖥(𝐗)=𝐗𝒙(𝐗,t){\sf F}(\mathbf{X})=\nabla_{\mathbf{X}}\bm{x}(\mathbf{X},t). The flow map equation (12) is a highly nonlinear equation of 𝒙(𝐗,t)\bm{x}(\mathbf{X},t) that involves both 𝖥{\sf F} and det𝖥\det{\sf F}. It is rather unclear how to introduce a suitable spatial discretization to Eq. (12) by looking at the equation. However, a generalized diffusion can be viewed as an L2L^{2}-gradient flow of the flow map in the space of diffeomorphism. This perspective gives rise to a natural discretization of generalized diffusions [50].

Remark 2.1.

In the case that η(ρ)=ρ\eta(\rho)=\rho, a generalized diffusion can be viewed as a Wasserstein gradient flows in the space of all probability densities having finite second moments 𝒫2(Ω)\mathcal{P}_{2}(\Omega) [35]. Formally, the Wasserstein gradient flow can be defined as a continuous time limit (τ0\tau\rightarrow 0) from a semi-discrete JKO scheme,

(13) ρn+1=argminρ𝒫2(Ω)12τW2(ρ,ρn)2+[ρ],n=0,1,2,\rho^{n+1}=\mathop{\arg\min}_{\rho\in\mathcal{P}_{2}(\Omega)}\frac{1}{2\tau}W_{2}(\rho,\rho^{n})^{2}+\mathcal{F}[\rho]~{},\quad n=0,1,2\ldots~{},

where 𝒫2(Ω)={ρ:Ω[0,)|Ωρd𝐱=1,Ω|𝐱|2ρ(𝐱)d𝐱<}\displaystyle\mathcal{P}_{2}(\Omega)=\left\{\rho:\Omega\rightarrow[0,\infty)~{}|~{}\int_{\Omega}\rho\ \mathrm{d}\bm{x}=1,~{}~{}\int_{\Omega}|\bm{x}|^{2}\rho(\bm{x})\mathrm{d}\bm{x}<\infty\right\} and W2(ρ,ρn)W_{2}(\rho,\rho^{n}) is the Wasserstein distance between ρ\rho and ρn\rho^{n}. The Wasserstein gradient flow is an Eulerian description of these systems [6]. Other choices of dissipation can define other metrics in the space of probability measures [1, 46].

2.2 Neural-network-based numerical schemes for PDEs

In this subsection, we briefly review some existing neural network-based algorithms for solving PDEs. We refer interested readers to [18, 52] for detailed reviews.

Considering a PDE subject to a certain boundary condition

(14) φ(𝒙)=f(𝒙),𝒙Ωd,φ(𝒙)=g(𝒙),𝒙Ω,\displaystyle\mathcal{L}\varphi(\bm{x})=f(\bm{x})~{},~{}~{}\bm{x}\in\Omega\subset\mathbb{R}^{d}~{},\quad\mathcal{B}\varphi(\bm{x})=g(\bm{x})~{},~{}~{}\bm{x}\in\partial\Omega~{},

and assuming the solution can be approximated by a neural network φNN(𝒙;𝚯)\varphi_{\rm NN}(\bm{x};{\bm{\Theta}}), where 𝚯{\bm{\Theta}} is the set of parameters in the neural network, the Physics-Informed Neural Network (PINN) [62] and similar methods [13, 68, 77] find the optimal parameter 𝚯{\bm{\Theta}}^{*} by minimizing a loss function defined as:

(15) L(𝚯)=1Nini=1Nin(φNN(𝒙i,𝚯)f(𝒙i))2+λNbj=1Nb(φNN(𝒔j;𝚯)g(𝒔j))2.L({\bm{\Theta}})=\frac{1}{N_{in}}\sum_{i=1}^{N_{in}}\left(\mathcal{L}\varphi_{\rm NN}(\bm{x}_{i},{\bm{\Theta}})-f(\bm{x}_{i})\right)^{2}+\frac{\lambda}{N_{b}}\sum_{j=1}^{N_{b}}(\mathcal{B}\varphi_{\rm NN}(\bm{s}_{j};{\bm{\Theta}})-g(\bm{s}_{j}))^{2}~{}.

Here {𝒙i}i=1Nin\{\bm{x}_{i}\}_{i=1}^{N_{in}} and {𝒔j}j=1Nb\{\bm{s}_{j}\}_{j=1}^{N_{b}} are sets of samples in Ω\Omega and Ω\partial\Omega respectively, which can be drawn uniformly or by following some prescribed distributions. The parameter λ\lambda is used to weight the sum in the loss function. Minimizers of the loss function (15) can be obtained by using some optimization methods, such as AdaGrad and Adam. Of course, the objective function of this minimization problem in general is not convex even when the initial problem is. Obtaining the global minimum of (15) is highly non-trivial.

In contrast to the PINN, which is based on the strong form of a PDE, the deep Ritz method (DRM) [21] is designed for solving PDEs using their variational formulations. If Eq. (14) is an Euler-Lagrangian equation of some energy functional

(16) I[φ,φ]=ΩW(φ(𝒙),𝒙φ(𝒙))d𝒙+λΩ|φg(𝒙)|2dS,I[\varphi,\nabla\varphi]=\int_{\Omega}W(\varphi(\bm{x}),\nabla_{\bm{x}}\varphi(\bm{x}))\mathrm{d}\bm{x}+\lambda\int_{\partial\Omega}|\mathcal{B}\varphi-g(\bm{x})|^{2}\mathrm{d}S~{},

then the loss function can be defined directly by

(17) L(θ)=1Nini=1NinW(φNN(𝒙i),𝒙φNN(𝒙i))+λNbj=1Nb(φNN(𝒔j)g(𝒔j))2.L(\theta)=\frac{1}{N_{in}}\sum_{i=1}^{N_{in}}W(\varphi_{\rm NN}(\bm{x}_{i}),\nabla_{\bm{x}}\varphi_{\rm NN}(\bm{x}_{i}))+\frac{\lambda}{N_{b}}\sum_{j=1}^{N_{b}}(\mathcal{B}\varphi_{\rm NN}(\bm{s}_{j})-g(\bm{s}_{j}))^{2}~{}.

The last term in (16) is a penalty term for the boundary condition. The original Dirichlet boundary condition can be recovered with λ\lambda\rightarrow\infty. Again, the samples {𝒙i}i=1Nin\{\bm{x}_{i}\}_{i=1}^{N_{in}} and {𝒔j}j=1Nb\{\bm{s}_{j}\}_{j=1}^{N_{b}} can be uniformly sampled from Ω\Omega and Ω\partial\Omega or sampled following some other prescribed distributions, respectively. Additionally, both the variational PINN [37] and the WAN [84] utilized the Galerkin formulation to solve PDEs. The variational PINN [37], stems from the Petrov–Galerkin method, represents the solution via a DNN, and keeps test functions belonging to linear function spaces. The WAN employs the primal and adversarial networks to parameterize the weak solution and test functions respectively [84].

The neural network-based algorithms mentioned above focus on elliptic equations. For an evolution equation of the form:

(18) {tφ(𝒙,t)=F(t,𝒙,φ),(𝒙,t)Ω×(0,),φ(𝒙,0)=φ0(𝒙),𝒙Ω\left\{\begin{aligned} &\partial_{t}\varphi(\bm{x},t)=F(t,\bm{x},\varphi)~{},\quad(\bm{x},t)\in\Omega\times(0,\infty)~{},\\ &\varphi(\bm{x},0)=\varphi_{0}(\bm{x})~{},\quad\quad\quad\bm{x}\in\Omega\\ \end{aligned}\right.

with a suitable boundary condition, the predominant approach is to treat the time variable as an additional dimension, and the PINN type techniques can be applied. However, these approaches are often expensive and may fail to capture the dynamics of these systems. Unlike spatial variables, there exists an inherent order in time, where the solution at time TT is determined by the solutions in t<Tt<T. It is difficult to generate samples in time to maintain this causality. Very recently, new efforts have been made in using NNs to solve evolution PDEs [8, 16]. The idea of these approaches is to use NNs with time-dependent parameters to represent the solutions. For instance, Neural Galerkin method, proposed in [8], parameterizes the solution as φh(𝒙;𝚯(t))\varphi_{h}(\bm{x};{\bm{\Theta}}(t)) with 𝚯(t){\bm{\Theta}}(t) being the NN parameters, and defines a loss function in terms of 𝚯{\bm{\Theta}} and 𝚯˙\dot{\bm{\Theta}} through a residual function, i.e.,

(19) J(𝚯,𝜼)=12|𝚯φh𝜼F(𝒙,t,φh(𝒙;𝚯))|2dνΘ(𝒙),J({\bm{\Theta}},{\bm{\eta}})=\frac{1}{2}\int|\nabla_{\bm{\Theta}}\varphi_{h}\cdot{\bm{\eta}}-F(\bm{x},t,\varphi_{h}(\bm{x};{\bm{\Theta}}))|^{2}\mathrm{d}\nu_{\Theta}(\bm{x})~{},

where νΘ(𝒙)\nu_{\Theta}(\bm{x}) is a suitable measure which might depend on 𝚯{\bm{\Theta}}. By taking variation of J(𝚯,𝜼)J({\bm{\Theta}},{\bm{\eta}}) with respect to 𝜼{\bm{\eta}}, the neural Galerkin method arrives at an ODE of 𝚯(t){\bm{\Theta}}(t):

(20) 𝖬(𝚯)𝚯˙=F(t,𝚯),𝚯(0)=𝚯0,{\sf M}({\bm{\Theta}})\dot{\bm{\Theta}}=F(t,{\bm{\Theta}})~{},\quad{\bm{\Theta}}(0)={\bm{\Theta}}_{0}~{},

where 𝖬(𝚯)=𝚯φh𝚯φhdν𝚯(𝒙),{\sf M}({\bm{\Theta}})=\int\nabla_{\bm{\Theta}}\varphi_{h}\otimes\nabla_{\bm{\Theta}}\varphi_{h}\mathrm{d}\nu_{\bm{\Theta}}(\bm{x}), and F(t,𝚯)=𝚯φhF(𝒙,t,φh)dν𝚯(𝒙).F(t,{\bm{\Theta}})=\int\nabla_{\bm{\Theta}}\varphi_{h}F(\bm{x},t,\varphi_{h})\mathrm{d}\nu_{\bm{\Theta}}(\bm{x}). The ODE (20) can be solved by standard explicit or implicit time-marching schemes.

It’s important to note that, with the exception of DRM, all existing neural-network-based methods are developed based on either the strong or weak forms of PDEs. While DRM utilizes the free energy functional, it is only suitable for solving static problems, i.e., finding the equilibrium of the system. Additionally, most of these existing approaches are Eulerian methods. For certain types of PDEs, like generalized diffusions, these methods may fail to preserve physical constraints, such as positivity and the conservation of mass of a probability function.

3 Energetic Variational Neural Network

In this section, we present the structure-preserving EVNN discretization for solving both L2L^{2}-gradient flows and generalized diffusions, As mentioned earlier, the goal is to construct a neural-network discretization based on the energy-dissipation law, without working on the underlying PDE. One can view our method as a generalization of the DRM to evolution equations.

3.1 EVNN scheme for L2L^{2}-gradient flow

Before we discuss the neural network discretization, we first briefly review the discrete energetic variational approach proposed in [50]. Given a continuous energy-dissipation law (1), the discrete energetic variational approach first constructs a finite-dimensional approximation to a continuous energy-dissipation law by introducing a spatial discretization of the state variable φ\varphi, denoted by φh(𝒙;𝚯(t))\varphi_{h}(\bm{x};{\bm{\Theta}}(t)), where 𝚯(t)K{\bm{\Theta}}(t)\in\mathbb{R}^{K} is the parameter to be determined. By replacing φ\varphi by φh\varphi_{h}, one obtain a semi-discrete energy-dissipation law (here we assume 𝒦\mathcal{K} in the continuous model for simplicity) in terms of 𝚯{\bm{\Theta}}:

(21) ddth[𝚯(t)]=h[𝚯(t),𝚯(t)],\frac{\mathrm{d}}{\mathrm{d}t}\mathcal{F}_{h}\left[{\bm{\Theta}}(t)\right]=-\triangle_{h}[{\bm{\Theta}}(t),{\bm{\Theta}}^{\prime}(t)]~{},

where h[𝚯(t)]=[φh(𝒙;𝚯)]\mathcal{F}_{h}\left[{\bm{\Theta}}(t)\right]=\mathcal{F}[\varphi_{h}(\bm{x};{\bm{\Theta}})], and h[𝚯(t),𝚯(t)]=η(φh)|𝚯φh𝚯(t)|2d𝒙.\triangle_{h}[{\bm{\Theta}}(t),{\bm{\Theta}}^{\prime}(t)]=\int\eta(\varphi_{h})|\nabla_{\bm{\Theta}}\varphi_{h}\cdot{\bm{\Theta}}^{\prime}(t)|^{2}\mathrm{d}\bm{x}. For example, in Galerkin methods, one can let φh(𝒙,t)\varphi_{h}(\bm{x},t) be φh(𝒙,t)=i=1Kγi(t)ψi(𝒙)\varphi_{h}(\bm{x},t)=\sum_{i=1}^{K}\gamma_{i}(t)\psi_{i}(\bm{x}) with {ψi}i=1K\{\psi_{i}\}_{i=1}^{K} being the set of basis functions. Then 𝚯(t){\bm{\Theta}}(t) is a vector given by 𝚯(t)=(γ1(t),γ2(t),,γK(t))TK{\bm{\Theta}}(t)=(\gamma_{1}(t),\gamma_{2}(t),\ldots,\gamma_{K}(t))^{\rm T}\in\mathbb{R}^{K}.

By employing the energetic variational approach in the semi-discrete level (21), one can obtain an ODE system of 𝚯(t){\bm{\Theta}}(t). Particularly, in the linear response regime, 𝒟h(𝚯(t),𝚯(t))=12η(φh)|𝚯φh𝚯(t)|2d𝒙\mathcal{D}_{h}({\bm{\Theta}}(t),{\bm{\Theta}}^{\prime}(t))=\frac{1}{2}\int\eta(\varphi_{h})|\nabla_{\bm{\Theta}}\varphi_{h}\cdot{\bm{\Theta}}^{\prime}(t)|^{2}\mathrm{d}\bm{x} is a quadratic function of 𝚯{\bm{\Theta}}^{\prime}. The ODE system of 𝚯(t){\bm{\Theta}}(t) can then be written as

(22) 𝖣(𝚯)𝚯(t)=δhδ𝚯,{\sf D}\left({\bm{\Theta}}\right){\bm{\Theta}}^{\prime}(t)=-\frac{\delta\mathcal{F}_{h}}{\delta{\bm{\Theta}}}~{},

where

(23) δhδ𝚯=δδφ𝚯φh,𝖣(𝚯)=η(φh)(𝚯φh𝚯φh)d𝒙.\frac{\delta\mathcal{F}_{h}}{\delta{\bm{\Theta}}}=\frac{\delta\mathcal{F}}{\delta\varphi}\nabla_{\bm{\Theta}}\varphi_{h}~{},\quad{\sf D}\left({\bm{\Theta}}\right)=\int\eta(\varphi_{h})(\nabla_{\bm{\Theta}}\varphi_{h}\otimes\nabla_{\bm{\Theta}}\varphi_{h})\mathrm{d}\bm{x}~{}.

The ODE (22) is the same as the ODE (20) in the neural Galerkin method [8] for L2L^{2} gradient flows, although the derivation is different.

Since (22) is a finite-dimensional gradient flow, one can then construct a minimizing movement scheme for 𝚯{\bm{\Theta}} [25]: finding 𝚯n+1{\bm{\Theta}}^{n+1} such that

(24) 𝚯n+1=argmin𝚯𝒮adhJn(𝚯),Jn(𝚯)=(𝖣n(𝚯𝚯n))(𝚯𝚯n)2τ+h(𝚯).\displaystyle{\bm{\Theta}}^{n+1}=\mathop{\arg\min}_{{\bm{\Theta}}\in\mathcal{S}_{ad}^{h}}J_{n}({\bm{\Theta}})~{},\quad J_{n}({\bm{\Theta}})=\frac{({\sf D}^{n}_{*}({\bm{\Theta}}-{\bm{\Theta}}^{n}))\cdot({\bm{\Theta}}-{\bm{\Theta}}^{n})}{2\tau}+\mathcal{F}_{h}({\bm{\Theta}})~{}.

Here 𝒮adh\mathcal{S}_{ad}^{h} is the admissible set of 𝚯{\bm{\Theta}} inherited from the admissible set of φ\varphi, denoted by 𝒮\mathcal{S}, and 𝖣n{\sf D}^{n}_{*} is a constant matrix. A typical choice of 𝖣n=𝖣(𝚯n){\sf D}^{n}_{*}={\sf D}({\bm{\Theta}}^{n}). An advantage of this scheme is that

(25) h(𝚯n+1)Jhn(𝚯n+1)Jhn(𝚯n)=h(𝚯n),\mathcal{F}_{h}({\bm{\Theta}}^{n+1})\leq J_{h}^{n}({\bm{\Theta}}^{n+1})\leq J_{h}^{n}({\bm{\Theta}}^{n})=\mathcal{F}_{h}({\bm{\Theta}}^{n})~{},

if 𝖣n{\sf D}^{n}_{*} is positive definite, which guarantees the energy stability for the discrete free energy h(𝚯)\mathcal{F}_{h}({\bm{\Theta}}). Moreover, by choosing a proper optimization method, we can assure that φhn+1\varphi^{n+1}_{h} stays in the admissible set 𝒮\mathcal{S}.

Although neural networks can be used to construct φh(𝒙;𝚯(t))\varphi_{h}(\bm{x};{\bm{\Theta}}(t)), it might be expensive to compute 𝚯φh𝚯φh\nabla_{\bm{\Theta}}\varphi_{h}\otimes\nabla_{\bm{\Theta}}\varphi_{h} in 𝖣(𝚯){\sf D}({\bm{\Theta}}). Moreover, 𝖣(𝚯){\sf D}({\bm{\Theta}}) is not a sparse matrix and requires a lot of computer memory to store when a deep neural network is used.

To overcome these difficulties, we propose an alternative approach by introducing temporal discretization before spatial discretization. Let τ\tau be the time step size. For the L2L^{2}-gradient flow (5), given φn\varphi^{n}, which represents the numerical solution at tn=nτt^{n}=n\tau, one can obtain φn+1\varphi^{n+1} by solving the following optimization problem: finding φn+1\varphi^{n+1} in some admissible set 𝒮\mathcal{S} such that

(26) φn+1=argminφ𝒮Jn(φ),Jn(φ)=12τη(φn)|φφn|2d𝒙+[φ].\displaystyle\varphi^{n+1}=\mathop{\arg\min}_{\varphi\in\mathcal{S}}J^{n}(\varphi),\quad J^{n}(\varphi)=\frac{1}{2\tau}\int\eta(\varphi^{n})|\varphi-\varphi^{n}|^{2}\mathrm{d}\bm{x}+\mathcal{F}[\varphi]~{}.

Let φh(𝒙;𝚯)\varphi_{h}(\bm{x};{\bm{\Theta}}) be a finite-dimensional approximation to φ\varphi with 𝚯K{\bm{\Theta}}\in\mathbb{R}^{K} being the parameter of the spatial discretization (e.g., weights of linear combination in a Galerkin approximation) yet to be determined, then the minimizing movement scheme (26) can be written in terms of 𝚯{\bf\Theta}: finding 𝚯n+1{\bm{\Theta}}^{n+1} such that

(27) 𝚯n+1=argmin𝚯𝒮hJhn(𝚯),Jhn(𝚯)=12τηn|φh(𝒙;𝚯)φh(𝒙;𝚯n)|2d𝒙+h[𝚯].{\bm{\Theta}}^{n+1}=\arg\min_{{\bm{\Theta}}\in\mathcal{S}_{h}}J_{h}^{n}({\bm{\Theta}}),\quad J_{h}^{n}({\bm{\Theta}})=\frac{1}{2\tau}\int\eta^{n}|\varphi_{h}(\bm{x};{\bm{\Theta}})-\varphi_{h}(\bm{x};{\bm{\Theta}}^{n})|^{2}\mathrm{d}\bm{x}+\mathcal{F}_{h}[{\bm{\Theta}}]~{}.

Here 𝚯n{\bm{\Theta}}^{n} is the value of 𝚯{\bm{\Theta}} at time tnt^{n}. h[𝚯]=[φh(𝒙;𝚯)]\mathcal{F}_{h}[{\bm{\Theta}}]=\mathcal{F}[\varphi_{h}(\bm{x};{\bm{\Theta}})], and ηn=η(φh(𝒙;𝚯n))\eta^{n}=\eta(\varphi_{h}(\bm{x};{\bm{\Theta}}^{n})).

Remark 3.1.

The connection between the minimizing movement scheme (27) derived by a temporal-then-spatial approach, and the minimizing movement scheme (24) derived by a spatial-then-temporal approach, can be shown with a direct calculation. Indeed, according to the first-order necessary condition for optimality, an optimal solution to the minimization problem (27) 𝚯n+1{\bm{\Theta}}^{n+1} satisfies

(28) δJhn(𝚯)δ𝚯|𝚯n+1\displaystyle\frac{\delta J^{n}_{h}({\bm{\Theta}})}{\delta{\bm{\Theta}}}\Big{|}_{{\bm{\Theta}}^{n+1}} =1τηn(φh(𝒙;𝚯n+1)φh(𝒙;𝚯n))𝚯φh|𝚯n+1d𝒙+δhδ𝚯|𝚯n+1\displaystyle=\frac{1}{\tau}\int\eta^{n}(\varphi_{h}(\bm{x};{\bm{\Theta}}^{n+1})-\varphi_{h}(\bm{x};{\bm{\Theta}}^{n}))\nabla_{\bm{\Theta}}\varphi_{h}\Big{|}_{{\bm{\Theta}}^{n+1}}\mathrm{d}\bm{x}+\frac{\delta\mathcal{F}_{h}}{\delta{\bm{\Theta}}}\Big{|}_{{\bm{\Theta}}^{n+1}}
=ηn𝚯φh|𝚯𝚯φh|𝚯n+1d𝒙𝚯n+1𝚯nτ+δhδ𝚯|𝚯n+1=0\displaystyle=\int\eta^{n}\nabla_{\bm{\Theta}}\varphi_{h}\Big{|}_{{\bm{\Theta}}^{*}}\otimes\nabla_{\bm{\Theta}}\varphi_{h}\Big{|}_{{\bm{\Theta}}^{n+1}}\mathrm{d}\bm{x}\frac{{\bm{\Theta}^{n+1}}-{\bm{\Theta}}^{n}}{\tau}+\frac{\delta\mathcal{F}_{h}}{\delta{\bm{\Theta}}}\Big{|}_{{\bm{\Theta}}^{n+1}}=0

for some 𝚯{\bm{\Theta}}^{*}, where the second equality follows the mean value theorem. In the case of Galerkin methods, as 𝚯φh\nabla_{\bm{\Theta}}\varphi_{h} is independent on 𝚯{\bm{\Theta}}, 𝚯n+1{{\bm{\Theta}}}^{n+1} is a solution to an implicit Euler scheme for the ODE (22) in which η(φh)\eta(\varphi_{h}) is treated explicitly.

By choosing a certain neural network to approximate φ\varphi, denoted by φNN(𝒙;𝚯)\varphi_{\rm NN}(\bm{x};{\bm{\Theta}}) (𝚯{\bm{\Theta}} is used to denote all parameters in the neural network), we can summarize the EVNN scheme for solving a L2L^{2}-gradient flow in Alg. 1.

For a given initial condition φ0(𝒙)\varphi_{0}(\bm{x}), compute 𝚯0{\bm{\Theta}}^{0} by solving

(29) 𝚯0=argmin𝚯Ω|φNN(𝒙;𝚯)φ0(𝒙)|2d𝒙;{\bm{\Theta}}^{0}=\mathop{\arg\min}_{\bm{\Theta}}\int_{\Omega}|\varphi_{\rm NN}(\bm{x};{\bm{\Theta}})-\varphi_{0}(\bm{x})|^{2}\mathrm{d}\bm{x}~{};

At each step, update 𝚯n+1{\bm{\Theta}}^{n+1} by solving the optimization problem

(30) 𝚯n+1=argmin𝚯(12τΩηn|φNN(𝒙;𝚯)φNN(𝒙,𝚯n)|2d𝒙+[φNN(𝒙;𝚯)]).{\bm{\Theta}}^{n+1}=\mathop{\arg\min}_{\bm{\Theta}}\left(\frac{1}{2\tau}\int_{\Omega}\eta^{n}|\varphi_{\rm NN}(\bm{x};{\bm{\Theta}})-\varphi_{\rm NN}(\bm{x},{\bm{\Theta}}^{n})|^{2}\mathrm{d}\bm{x}+\mathcal{F}[\varphi_{\rm NN}(\bm{x};{\bm{\Theta}})]\right).

We have φNN(𝒙;𝚯n)\varphi_{\rm NN}(\bm{x};{\bm{\Theta}}^{n}) as a numerical solution at time tn=nτt^{n}=n\tau.

Algorithm 1 Numerical Algorithm for solving the L2L^{2}-gradient flow

It can be noticed that both Eq. (29) and Eq. (30) involve integration in the computational domain Ω\Omega. This integration is often computed by using a grid-based numerical quadrature or Monte–Carlo/Quasi–Monte–Carlo algorithms [65]. It is worth mentioning that due to the non-convex nature of the optimization problem and the error in estimating the integration, it might be difficult to find an optimal 𝚯n+1{\bm{\Theta}}^{n+1} at each step. But since we start the optimization procedure with 𝚯n{\bm{\Theta}}^{n}, we’ll always be able to get a 𝚯n+1{\bm{\Theta}}^{n+1} that lowers the discrete free energy at least on the training set. In practice, the optimization problem can be solved by either deterministic optimization algorithms, such as L-BFGS and gradient descent with Barzilai–Borwein step-size, or stochastic gradient descent algorithms, such as AdaGrad and Adam.

Remark 3.2.

It is straightforward to incorporate other variational high-order temporal discretizations to solve the L2L^{2}-gradient flows. For example, a second-order accurate BDF2 scheme can be reformulated as an optimization problem

(31) 𝚯n=argmin𝚯\displaystyle{\bm{\Theta}}^{n}=\mathop{\arg\min}_{\bm{\Theta}} (ητΩφNN(𝒙;𝚯)φNN(𝒙,𝚯n1)|2d𝒙\displaystyle\left(\frac{\eta}{\tau}\int_{\Omega}\varphi_{\rm NN}(\bm{x};{\bm{\Theta}})-\varphi_{\rm NN}(\bm{x},{\bm{\Theta}}^{n-1})|^{2}\mathrm{d}\bm{x}\right.
η4τΩ|φNN(𝒙;𝚯)φNN(𝒙,𝚯n2)|2d𝒙+[φNN(𝒙;𝚯]).\displaystyle\left.-\frac{\eta}{4\tau}\int_{\Omega}|\varphi_{\rm NN}(\bm{x};{\bm{\Theta}})-\varphi_{\rm NN}(\bm{x},{\bm{\Theta}}^{n-2})|^{2}\mathrm{d}\bm{x}+\mathcal{F}[\varphi_{\rm NN}(\bm{x};{\bm{\Theta}}]\right)~{}.

A modified Crank–Nicolson time-marching scheme can be reformulated as

(32) 𝚯n=argmin𝚯\displaystyle{\bm{\Theta}}^{n}=\mathop{\arg\min}_{\bm{\Theta}} (ητΩ|φNN(𝒙;𝚯)φNN(𝒙,𝚯n1)|2d𝒙+[φNN(𝒙;𝚯)]\displaystyle\left(\frac{\eta}{\tau}\int_{\Omega}|\varphi_{\rm NN}(\bm{x};{\bm{\Theta}})-\varphi_{\rm NN}(\bm{x},{\bm{\Theta}}^{n-1})|^{2}\mathrm{d}\bm{x}+\mathcal{F}[\varphi_{\rm NN}(\bm{x};{\bm{\Theta}})]\right.
+Θ[φNN(𝒙;𝚯n1)],𝚯𝚯n1).\displaystyle\left.\quad+\langle\nabla_{\Theta}\mathcal{F}[\varphi_{\rm NN}(\bm{x};{\bm{\Theta}}^{n-1})],{\bm{\Theta}}-{\bm{\Theta}}^{n-1}\rangle\right)~{}.

Here we assume that η\eta is a constant for simplicity.

3.2 Lagrangian EVNN scheme for generalized diffusions

In this subsection, we show how to formulate an EVNN scheme in the Lagrangian frame of reference for generalized diffusions.

As discussed previously, a generalized diffusion can be viewed as an L2L^{2}-gradient flow of the flow map 𝒙(𝐗,t)\bm{x}(\mathbf{X},t) in the space of diffeomorphisms. Hence, the EVNN scheme for the generalized diffusion can be formulated in terms of a minimizing movement scheme of the flow map given by

(33) Φn+1=argminΦDiff12τ|Φ(𝐗)Φn(𝐗)|2ρ0(𝐗)d𝐗+[Φ#ρ0],\Phi^{n+1}=\mathop{\arg\min}_{\Phi\in{\rm Diff}}\frac{1}{2\tau}\int|\Phi(\mathbf{X})-\Phi^{n}(\mathbf{X})|^{2}\rho_{0}(\mathbf{X})\mathrm{d}\mathbf{X}+\mathcal{F}\left[\Phi_{\verb|#|}\rho_{0}\right]~{},

where Φn(𝐗)\Phi^{n}(\mathbf{X}) is a numerical approximation of the flow map 𝒙(𝐗,t)\bm{x}(\mathbf{X},t) at tn=nτt^{n}=n\tau, ρ0(𝐗)\rho_{0}(\mathbf{X}) is the initial density. [ρ]\mathcal{F}[\rho] is the free energy for the generalized diffusion defined in Eq. (9), and

(Φ#ρ0)(𝒙):=ρ0(Φ1(𝒙))det𝖥(Φ1(𝒙)),Diff={Φ:dd|Φis a diffeomorphism}.(\Phi_{\verb|#|}\rho_{0})(\bm{x}):=\frac{\rho_{0}(\Phi^{-1}(\bm{x}))}{\det{\sf F}(\Phi^{-1}(\bm{x}))}~{},\quad{\rm Diff}=\{\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}~{}|~{}\Phi~{}\text{is a diffeomorphism}\}~{}.

One can parameterize Φ:dd\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} by a suitable neural network. The remaining procedure is almost the same as that of the last subsection. However, it is often difficult to solve this optimization problem directly, and one might need to build a large neural network to approximate Φn+1\Phi^{n+1} when nn is large.

To overcome this difficulty, we proposed an alternative approach, Instead of seeking an optimal map Φn+1\Phi^{n+1} between ρ0\rho^{0} and ρn+1\rho^{n+1}, we seek an optimal Ψn+1\Psi^{n+1} between ρn\rho^{n} and ρn+1\rho^{n+1}. More precisely. we assume that

Φn+1=Ψn+1ΨnΨn1Ψ1.\Phi^{n+1}=\Psi^{n+1}\circ\Psi^{n}\circ\Psi^{n-1}\ldots\circ\Psi^{1}~{}.

Given ρn\rho^{n}, one can compute Ψn+1\Psi^{n+1} by solving the following optimization problem

(34) Ψn+1=argminΨDiff12τ|Ψ(𝒙)𝒙|2ρn(𝒙)d𝒙+[Ψ#ρn].\Psi^{n+1}=\mathop{\arg\min}_{\Psi\in{\rm Diff}}\frac{1}{2\tau}\int|\Psi(\bm{x})-\bm{x}|^{2}\rho^{n}(\bm{x})\mathrm{d}\bm{x}+\mathcal{F}[\Psi_{\verb|#|}\rho^{n}]~{}.

The corresponding ρn+1\rho^{n+1} can then be computed through ρn+1(𝒙)=(Ψ#n+1ρn)(𝒙)\rho^{n+1}(\bm{x})=(\Psi^{n+1}_{\verb|#|}\rho^{n})(\bm{x}). An advantage of this approach is that we only need a small size of neural network to approximate Ψn+1\Psi^{n+1} at each time step when τ\tau is small.

Remark 3.3.

The scheme (34) can be viewed as a Lagrangian realization of the JKO scheme (13) for the Wasserstein gradient flow, although it is developed based on the L2L^{2}-gradient flow structure in the space of diffeomorphism. According to the Benamou–Brenier formulation [5], the Wasserstein distance between two probability densities ρ1\rho_{1} and ρ2\rho_{2} can be computed by solving the optimization problem

(35) W2(ρ1,ρ2)2=min(ρ,𝐮)𝒮01ρ|𝐮|2d𝒙dt,W_{2}(\rho_{1},\rho_{2})^{2}=\mathop{\min}_{(\rho,\mathbf{u})\in\mathcal{S}}\int_{0}^{1}\int\rho|\mathbf{u}|^{2}\mathrm{d}\bm{x}\mathrm{d}t~{},

where the admissible set of (ρ,𝐮)(\rho,\mathbf{u}) is given by

(36) 𝒮={(ρ,𝐮)|ρt+(ρ𝐮)=0,ρ(𝒙,0)=ρ1,ρ(𝒙,1)=ρ2}.\mathcal{S}=\{(\rho,\mathbf{u})~{}|~{}\rho_{t}+\nabla\cdot(\rho\mathbf{u})=0~{},\quad\rho(\bm{x},0)=\rho_{1}~{},\quad\rho(\bm{x},1)=\rho_{2}\}~{}.

Hence, one can solve the JKO scheme by solving

(37) (𝐮,ρ^)=argmin(𝐮,ρ^)120τΩρ^|𝐮|2d𝒙dt+[ρ^(τ)]\displaystyle(\mathbf{u}^{*},\hat{\rho}^{*})=\mathop{\arg\min}_{(\mathbf{u},\hat{\rho})}\frac{1}{2}\int_{0}^{\tau}\int_{\Omega}\hat{\rho}|\mathbf{u}|^{2}\mathrm{d}\bm{x}\mathrm{d}t+\mathcal{F}[\hat{\rho}(\tau)]
s.t.tρ^+(ρ^𝐮)=0,ρ^(0)=ρn.\displaystyle\text{s.t.}\quad\partial_{t}\hat{\rho}+\nabla\cdot(\hat{\rho}\mathbf{u})=0,\quad\hat{\rho}(0)=\rho^{n}.

and letting ρn+1=ρ^(τ)\rho^{n+1}=\hat{\rho}^{*}(\tau). In Lagrangian coordinates, (37) is equivalent to

(38) 𝒙(𝐗,t)=argmin𝒙(𝐗,t)120τΩρ0(𝐗)|𝒙t(𝐗,t)|2d𝐗dt+(ρ^(𝒙,τ)),\bm{x}^{*}(\mathbf{X},t)=\mathop{\arg\min}_{\bm{x}(\mathbf{X},t)}\frac{1}{2}\int_{0}^{\tau}\int_{\Omega}\rho_{0}(\mathbf{X})|\bm{x}_{t}(\mathbf{X},t)|^{2}\mathrm{d}\mathbf{X}\mathrm{d}t+\mathcal{F}(\hat{\rho}(\bm{x},\tau)),

where ρ0=ρn\rho_{0}=\rho^{n} and ρ^(𝐱(𝐗,τ),τ)=ρ0(𝐗)/det𝖥(𝐗,τ)\hat{\rho}(\bm{x}(\mathbf{X},\tau),\tau)=\rho_{0}(\mathbf{X})/\det{\sf F}(\mathbf{X},\tau). By taking variation of (38) with respect to 𝐱(𝐗,t)\bm{x}(\mathbf{X},t), one can show that the optimal condition is 𝐱tt(𝐗,t)=0\bm{x}_{tt}(\mathbf{X},t)=0 for t(0,τ)t\in(0,\tau), which indicates that 𝐱(𝐗,t)=t(Ψ(𝐗)X)/τ+X\bm{x}(\mathbf{X},t)=t(\Psi(\mathbf{X})-X)/\tau+X if 𝐱(𝐗,τ)=Ψ(𝐗)\bm{x}(\mathbf{X},\tau)=\Psi(\mathbf{X}). Hence, if Ψ\Psi^{*} is the optimal solution of (34), then 𝐱(𝐗,t)=t(Ψ(𝐗)𝐗)/τ+𝐗\bm{x}^{*}(\mathbf{X},t)=t(\Psi^{*}(\mathbf{X})-\mathbf{X})/\tau+\mathbf{X} is the optimal solution of (38).

Remark 3.4.

If η(ρ)ρ\eta(\rho)\neq\rho, we can formulate the optimization problem (34) as

(39) Ψn+1=argminΨDiff12τ|Ψ(𝒙)𝒙|2η(ρn(𝒙))d𝒙+[Ψ#ρn].\Psi^{n+1}=\mathop{\arg\min}_{\Psi\in{\rm Diff}}\frac{1}{2\tau}\int|\Psi(\bm{x})-\bm{x}|^{2}\eta(\rho^{n}(\bm{x}))\mathrm{d}\bm{x}+\mathcal{F}[\Psi_{\verb|#|}\rho^{n}]~{}.

by treating η\eta explicit. A subtle fact is that det𝖥n=1\det{\sf F}^{n}=1 since we always start with an identity map.

The numerical algorithm for solving the generalized diffusion is summarized in Alg. 2.

  • Given {𝒙in}i=1N\{\bm{x}_{i}^{n}\}_{i=1}^{N} and the densities ρin\rho^{n}_{i} at t=nτt=n\tau and 𝒙in\bm{x}_{i}^{n}.

  • Find Ψn+1(𝒙):dd\Psi^{n+1}(\bm{x}):\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}, by solving the optimization problem (34). To guarantee energy stability, we should take Ψ\Psi as an approximation to an identity map initially when solving the optimization problem (34).

  • After obtaining Ψn+1\Psi^{n+1}, updates {𝒙in+1}i=1N\{\bm{x}_{i}^{n+1}\}_{i=1}^{N} and ρn+1\rho^{n+1} by

    (40) 𝒙in+1=Ψn+1(𝒙in),ρin+1=ρindet(Ψn+1(𝒙in)).\displaystyle\bm{x}_{i}^{n+1}=\Psi^{n+1}(\bm{x}_{i}^{n})~{},\quad\rho^{n+1}_{i}=\frac{\rho^{n}_{i}}{\det(\nabla\Psi^{n+1}(\bm{x}_{i}^{n}))}~{}.
Algorithm 2 Numerical Algorithm for solving the generalized diffusion

The next question is how to accurately evaluate the numerical integration in (34). Let 𝒙in=Φn(𝒙i0)\bm{x}_{i}^{n}=\Phi^{n}(\bm{x}_{i}^{0}), for the general free energy (9), one way to evaluate the integrations in Eq. (34) is using

(41) J(Ψ)=12τi=1NρinΨ(𝒙in)𝒙in2|Ωin|+i=1N(ω(ρindet(Ψ(𝒙in)))det(Ψ(𝒙in)))|Ωin|\displaystyle J(\Psi)=\textstyle\frac{1}{2\tau}\sum_{i=1}^{N}\rho^{n}_{i}\|\Psi(\bm{x}_{i}^{n})-\bm{x}_{i}^{n}\|^{2}|\Omega_{i}^{n}|+\sum_{i=1}^{N}\left(\omega\left(\frac{\rho_{i}^{n}}{\det(\nabla\Psi(\bm{x}_{i}^{n}))}\right)\det(\nabla\Psi(\bm{x}_{i}^{n}))\right)|\Omega_{i}^{n}|
+i=1NV(Ψ(𝒙in))ρin|Ωin|+12i,j=1NK(Ψ(𝒙i),Ψ(𝒙j))ρinρjn|Ωjn||Ωin|,\displaystyle\quad\quad+\textstyle\sum_{i=1}^{N}V(\Psi(\bm{x}_{i}^{n}))\rho_{i}^{n}|\Omega_{i}^{n}|+\frac{1}{2}\sum_{i,j=1}^{N}K(\Psi(\bm{x}_{i}),\Psi(\bm{x}_{j}))\rho_{i}^{n}\rho_{j}^{n}|\Omega_{j}^{n}||\Omega_{i}^{n}|~{},

where |Ωin||\Omega_{i}^{n}| is the volume of the Voronoi cells associated with the set of points {𝒙in}\{\bm{x}_{i}^{n}\}, ρin\rho^{n}_{i} stands for ρn(𝒙in)\rho^{n}(\bm{x}_{i}^{n}). Here the numerical integration is computed through a piece-wisely constant reconstruction of ρn\rho^{n} based on its values at {Φn(𝒙i0)}i=1N\{\Phi^{n}(\bm{x}_{i}^{0})\}_{i=1}^{N}. However, it is not straightforward to compute |Ωin||\Omega_{i}^{n}|, particularly for high dimensional cases. In the current study, we assume that the initial samples are drawn from ρ0\rho_{0}, one can roughly assume that {𝒙in}\{\bm{x}_{i}^{n}\} follows the distribution ρn\rho^{n}, then according to the Monte–Carlo approach, the numerical integration can be evaluated as

(42) J(Ψ)\displaystyle J(\Psi) =12τ(1Ni=1NΨ(𝒙in)𝒙in2)\displaystyle=\textstyle\frac{1}{2\tau}\left(\frac{1}{N}\sum_{i=1}^{N}\|\Psi(\bm{x}_{i}^{n})-\bm{x}_{i}^{n}\|^{2}\right)
+1Ni=1N(fω(ρindet(Ψ(𝒙in)))+V(Ψ(𝒙in)))+12N2i,j=1NK(Ψ(𝒙i),Ψ(𝒙j)),\displaystyle\textstyle+\frac{1}{N}\sum_{i=1}^{N}\left(f_{\omega}\left(\frac{\rho_{i}^{n}}{\det(\nabla\Psi(\bm{x}_{i}^{n}))}\right)+V(\Psi(\bm{x}_{i}^{n}))\right)+\frac{1}{2N^{2}}\sum_{i,j=1}^{N}K\left(\Psi(\bm{x}_{i}),\Psi(\bm{x}_{j})\right)~{},

where fω(ρ)=ω(ρ)/ρ.f_{\omega}(\rho)=\omega(\rho)/\rho. The proposed numerical method can be further improved if one can evaluate the integration more accurately, i.e., have an efficient way to estimate |Ωin||\Omega_{i}^{n}|. Alternatively, as an advantage of the neural network-based algorithm, we can get ρn(x)=ρ0~(Φn)1(𝒙),𝒙\rho^{n}(x)=\rho_{0}\tilde{\circ}(\Phi^{n})^{-1}(\bm{x}),~{}\forall\bm{x}, which enables us to estimate J(Ψ)J(\Psi) by resampling. More precisely, assume μ\mu is a distribution that is easy to sample

(43) J(Ψ)=12τ(1Ni=1NΨ(𝒙in)𝒙in2)ρinμin+1Ni=1N(fω(ρindet(Ψ(𝒙in))))ρinμin\displaystyle J(\Psi)=\textstyle\frac{1}{2\tau}\left(\frac{1}{N}\sum_{i=1}^{N}\|\Psi(\bm{x}_{i}^{n})-\bm{x}_{i}^{n}\|^{2}\right)\frac{\rho_{i}^{n}}{\mu_{i}^{n}}+\frac{1}{N}\sum_{i=1}^{N}\left(f_{\omega}\left(\frac{\rho_{i}^{n}}{\det(\nabla\Psi(\bm{x}_{i}^{n}))}\right)\right)\frac{\rho_{i}^{n}}{\mu_{i}^{n}}
+V(Ψ(𝒙in))ρinμin+12N2i,j=1NK(Ψ(𝒙i),Ψ(𝒙j))ρinρjnμinμjn,\displaystyle\quad\textstyle+V(\Psi(\bm{x}_{i}^{n}))\frac{\rho_{i}^{n}}{\mu_{i}^{n}}+\frac{1}{2N^{2}}\sum_{i,j=1}^{N}K\left(\Psi(\bm{x}_{i}),\Psi(\bm{x}_{j})\right)\frac{\rho_{i}^{n}\rho_{j}^{n}}{\mu_{i}^{n}\mu_{j}^{n}}~{},

where 𝒙iμ\bm{x}_{i}\sim\mu, ρin=ρn(xi)\rho_{i}^{n}=\rho^{n}(x_{i}). We will explore this resampling approach in future work.

The remaining question is how to parameterize a diffeomorphism using neural networks. This is discussed in detail in the next subsection.

3.3 Neural network architectures

In principle, the proposed numerical framework is independent of the choice of neural network architectures. However, different neural network architectures may lead to different numerical performances, arising from a balance of approximation (representation power), optimization, and generalization. In this subsection, we briefly discuss several neural network architectures that we use in the numerical experiments.

3.3.1 Neural network architectures for Eulerian methods

For Eulerian methods, one can construct a neural network to approximate the unknown function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R}. Shallow neural networks (two-layer neural networks) approximate ff by functions of the form

(44) f(𝒙;𝚯)\displaystyle f(\bm{x};{\bm{\Theta}}) =i=1Nαiσ(𝝎i𝒙+bi)+α0=𝜶σ(𝑾𝒙+𝒃)+α0,\displaystyle=\sum_{i=1}^{N}\alpha_{i}\sigma(\bm{\omega}_{i}\cdot\bm{x}+b_{i})+\alpha_{0}=\bm{\alpha}\cdot\sigma({\bm{W}}\bm{x}+\bm{b})+\alpha_{0}~{},

where σ(.):\sigma(.):\mathbb{R}\rightarrow\mathbb{R} is a fixed nonlinear activation function, NN is the number of hidden nodes (neurons), and 𝚯={αi,𝝎i,bi}{\bm{\Theta}}=\{\alpha_{i},{\bm{\omega}}_{i},b_{i}\} are the NN parameters to be identified. Typical choices of activation functions include the ReLU σ(x)=max(x,0)\sigma(x)=\max(x,0), the sigmoid σ(x)=1/(1+e2x)\sigma(x)=1/(1+e^{-2x}) and the hyperbolic tangent function tanh(x)\tanh(x).A DNN can be viewed as a network composed of many hidden layers. More precisely, a DNN with LL hidden layers represents a function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} by [67]

(45) f(𝒙;𝚯)=gTLTL1T(1)(𝒙),f(\bm{x};{\bm{\Theta}})=g\circ T^{L}\circ T^{L-1}\circ\ldots\circ T^{(1)}(\bm{x})~{},

where g(𝒛)=i=1NLγizi+γ0g(\bm{z})=\sum_{i=1}^{N_{L}}\gamma_{i}z_{i}+\gamma_{0} is a linear map from Nl\mathbb{R}^{N_{l}} to \mathbb{R}, T(l)T^{(l)} is a nonlinear map from Nl1\mathbb{R}^{N_{l-1}} to Nl\mathbb{R}^{N_{l}}, and N0=dN_{0}=d. The nonlinear map TlT^{l} takes the form

(46) T(l)(𝒙l1)=σ(𝑾l𝒙l1+𝒃l),T^{(l)}(\bm{x}_{l-1})=\sigma({\bm{W}}_{l}\bm{x}_{l-1}+\bm{b}_{l})~{},

where 𝑾lNl1×Nl{\bm{W}}_{l}\in\mathbb{R}^{N_{l-1}\times N_{l}}, 𝒃ll\bm{b}_{l}\in\mathbb{R}^{l} and σ()\sigma(\cdot) is a nonlinear activation function that acts component-wisely on vector-valued inputs.

Another widely used class of DNN model is residual neural network (ResNet). A typical KK-block ResNet approximates an unknown function f(𝒙):df(\bm{x}):\mathbb{R}^{d}\rightarrow\mathbb{R} by

(47) fK(𝒙;𝚯)=g(𝒛K(𝒙)),f_{K}(\bm{x};{\bm{\Theta}})=g(\bm{z}_{K}(\bm{x}))~{},

where g(𝒛)=i=1Nγizi+γ0g(\bm{z})=\sum_{i=1}^{N}\gamma_{i}z_{i}+\gamma_{0} is a linear map from N\mathbb{R}^{N}\rightarrow\mathbb{R} and zK(𝒙):dNz_{K}(\bm{x}):\mathbb{R}^{d}\rightarrow\mathbb{R}^{N} is a nonlinear map defined through

(48) 𝒛0=𝑽𝒙,𝒛k=σ2(𝜶kTkLkTkL1Tk(1)(𝒛k1)+𝒛k1)k=1,2,,K.\displaystyle\bm{z}_{0}={\bm{V}}\bm{x}~{},\quad\bm{z}_{k}=\sigma_{2}({\bm{\alpha}}_{k}T^{L_{k}}_{k}\circ T^{L-1}_{k}\circ\ldots\circ T^{(1)}_{k}(\bm{z}_{k-1})+\bm{z}_{k-1})\quad k=1,2,\ldots,K~{}.

Here LiL_{i} is the number of fully connected layer in ii-the block, Tk(l)T^{(l)}_{k} is the same nonlinear map defined in (46) with 𝑾lkM×N{\bm{W}}_{l}^{k}\in\mathbb{R}^{M\times N}, 𝒃lkM{\bm{b}}_{l}^{k}\in\mathbb{R}^{M}, 𝑽N×d{\bm{V}}\in\mathbb{R}^{N\times d}, 𝜶iN×M{\bm{\alpha}}_{i}\in\mathbb{R}^{N\times M} and σi()\sigma_{i}(\cdot) is an element-wise activation function. The model parameters are 𝚯={𝜸,𝜶i,𝑾lk,𝒃lk,𝑽}{\bm{\Theta}}=\{{\bm{\gamma}},{\bm{\alpha}}_{i},{\bm{W}}_{l}^{k},{\bm{b}}_{l}^{k},{\bm{V}}\}. The original ResNet [29] takes σ2\sigma_{2} as a nonlinear activation function such as ReLU. Later studies indicate that one can also take σ2\sigma_{2} as the identity function [18, 30]. Then at an infinite length, i.e., LL\rightarrow\infty, (48) corresponds to the ODE

(49) d𝒛dt=f(𝒛),𝒛0=𝒙.\frac{\mathrm{d}\bm{z}}{\mathrm{d}t}=f(\bm{z})~{},\quad\bm{z}_{0}=\bm{x}~{}.

Compared with fully connected deep neural network models which may suffer from numerical instabilities in the form of exploding or vanishing gradients [19, 28, 40], very deep ResNet can be constructed to avoid these issues.

Given that the proposed numerical scheme employs neural networks with time-dependent parameters to approximate the solution of a gradient flow, there is no need to employ a deep neural network. In all numerical experiments for L2L^{2}-gradient flows, we utilize ResNet with only a few blocks. The detailed neural network settings for each numerical experiment will be described in the next section. We’ll compare the numerical performance of different neural network architectures in future work.

3.3.2 Neural network architectures for Lagrangian methods

The proposed Lagrangian method seeks for a neural network to approximate a diffeomorphism from d\mathbb{R}^{d} to d\mathbb{R}^{d}. This task involves two main challenges: one is ensuring that the map is a diffeomorphism, and the other is efficiently and robustly computing the deformation tensor 𝖥{\sf F} and its determinant det𝖥\det{\sf F}. Fortunately, various neural network architectures have been proposed to approximate a transport map. Examples include planar flows [64], auto-regressive flows [39], continuous-time diffusive flow [69], neural spline flow [17], and convex potential flow [32].

One way to construct a neural network dd\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} for approximating a flow map is to use a planar flow. A KK-layer planar flow is defined by T=TKT1T0T=T_{K}\circ\ldots T_{1}\circ T_{0}, where Tk:ddT_{k}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is given by

(50) 𝒙k+1=Tk(𝒙k)=𝒙k+𝒖kh(𝒘kT𝒙k+bk).\bm{x}^{k+1}=T_{k}(\bm{x}^{k})=\bm{x}^{k}+{\bm{u}}_{k}h({\bm{w}}^{\rm T}_{k}\bm{x}^{k}+b_{k})~{}.

Here, 𝒘k,𝒖kd{\bm{w}}_{k},{\bm{u}}_{k}\in\mathbb{R}^{d}, bkb_{k}\in\mathbb{R}, and hh is a smooth, element-wise, non-linear activation function such as tanh\tanh. Direct computation shows that

(51) Jk=det(Tk)=1+h(𝒘kT𝒙k+bk)𝒖kT𝒘k.J_{k}=\det(\nabla T_{k})=1+h^{\prime}({\bm{w}}^{\rm T}_{k}\bm{x}^{k}+b_{k}){\bm{u}}^{\rm T}_{k}{\bm{w}}_{k}~{}.

Clearly, TkT_{k} is a diffeomorphism if h(𝒘kT𝒙k+bk)𝒖kT𝒘k<1h^{\prime}({\bm{w}}^{\rm T}_{k}\bm{x}^{k}+b_{k}){\bm{u}}^{\rm T}_{k}{\bm{w}}_{k}<1 for 𝒙k\forall\bm{x}_{k}. The determinant of the transport map can be computed as det(T)=JKJK1J0\det(\nabla T)=J_{K}J_{K-1}\ldots J_{0}. and we have φ(T(𝒙))=φn(𝒙)det(T)\varphi(T(\bm{x}))=\frac{\varphi^{n}(\bm{x})}{\det(\nabla T)}.

One limitation of planar flow is its potential lack of expressive power. Another commonly employed neural network architecture for approximating a flow map is the convex potential flow [32], which defines a diffeomorphism via the gradient of a strictly convex function that is parameterized by an Input Convex Neural Network (ICNN) [2]. A fully connected, KK-layer ICNN can be written as

(52) 𝒛l+1=σl(𝑾l𝒛l+𝑨l𝐗+𝒃l),l=0,1,K1,\bm{z}_{l+1}=\sigma_{l}({\bm{W}}_{l}\bm{z}_{l}+\bm{A}_{l}\mathbf{X}+\bm{b}_{l})~{},\quad l=0,1,\ldots K-1~{},

where 𝒛0=0\bm{z}_{0}=0, 𝑾0=𝟢\bm{W}_{0}={\sf 0}, 𝚯={𝑾l,𝑨l,𝒃l}{\bm{\Theta}}=\{\bm{W}_{l},\bm{A}_{l},\bm{b}_{l}\} are the parameters to be determined, and σl\sigma_{l} is the nonlinear activation function. As proved in [2], if all entries 𝑾l\bm{W}_{l} are non-negative, and all σl\sigma_{l} are convex and non-decreasing, then f(𝐗;𝚯)f(\mathbf{X};{\bm{\Theta}}) is a convex function with respect to 𝐗\mathbf{X}. Hence 𝐗f(𝐗;𝚯)\nabla_{\mathbf{X}}f(\mathbf{X};{\bm{\Theta}}) provides a parametric approximation to a flow map. In the current study, we adopt the concept of the convex potential flow to develop the Lagrangian EVNN method. We’ll explore other types of neural network architectures in future work.

Remark 3.5.

It is worth mentioning that the gradient of a convex function ff only defines a subspace of diffeomorphism, and it is unclear whether the optimal solution of (34) belongs to this subspace.

4 Numerical Experiments

In this section, we test the proposed EVNN methods for various L2L^{2}-gradient flows and generalized diffusions. To evaluate the accuracy of different methods, we define the l2l^{2}-error

(1Ni=1N|φNN(𝒙i)φref(𝒙i)|2)1/2\left(\textstyle\frac{1}{N}\sum_{i=1}^{N}|\varphi_{\rm NN}(\bm{x}_{i})-\varphi_{\rm ref}(\bm{x}_{i})|^{2}\right)^{1/2}

and the relative l2l_{2}-error

(i=1N|φNN(𝒙i)φref(𝒙i)|2/(i=1N|φref(𝒙i))|2)1/2\left(\textstyle\sum_{i=1}^{N}|\varphi_{\rm NN}(\bm{x}_{i})-\varphi_{\rm ref}(\bm{x}_{i})|^{2}/(\textstyle\sum_{i=1}^{N}|\varphi_{\rm ref}(\bm{x}_{i}))|^{2}\right)^{1/2}

between the NN solution φNN\varphi_{\rm NN} and the corresponding reference solution φref\varphi_{\rm ref} on a set of test samples {𝒙i}i=1N\{\bm{x}_{i}\}_{i=1}^{N}. The reference solution φref(𝒙)\varphi_{\rm ref}(\bm{x}) is either an analytic solution or a numerical solution obtained by a traditional numerical method. In some numerical examples, we also plot point-wise absolute errors |φrefφNN||\varphi_{\rm ref}-\varphi_{\rm NN}| at the grid points.

4.1 Poisson equations

Although the proposed method is designed for evolutionary equations, it can also be used to compute equilibrium solutions for elliptic problems. In this subsection, we compare the performance of the EVNN method with two classical neural network-based algorithms, PINN and DRM, in the context of solving Poisson equations.

We first consider a 2D Poisson equation with a Dirichlet boundary condition

(53) {Δu(𝒙)=f(𝒙),𝒙Ωd,u(𝒙)=g(x),𝒙Ω.\begin{cases}&-\Delta u(\bm{x})=f(\bm{x})~{},\quad\bm{x}\in\Omega\subset\mathbb{R}^{d}~{},\\ &u(\bm{x})=g(x)~{},\quad\quad\quad\quad\bm{x}\in\partial\Omega~{}.\\ \end{cases}

Since EVNN is developed for evolution equations, we solve the following L2L^{2}-gradient flow

(54) ddt(Ω12|u|2f(𝒙)u(𝒙)d𝒙+λΩ|u(𝒙)g(𝒙)|2dS)=Ω|ut|2d𝒙,\frac{\mathrm{d}}{\mathrm{d}t}\left(\int_{\Omega}\frac{1}{2}|\nabla u|^{2}-f(\bm{x})u(\bm{x})\mathrm{d}\bm{x}+\lambda\int_{\partial\Omega}|u(\bm{x})-g(\bm{x})|^{2}\mathrm{d}S\right)=-\int_{\Omega}|u_{t}|^{2}\mathrm{d}\bm{x}~{},

to get a solution of the Poisson equation (53). Here, λΩ|ug(𝒙)|2dS\lambda\int_{\partial\Omega}|u-g(\bm{x})|^{2}\mathrm{d}S is the surface energy that enforces the Dirichlet boundary condition. The corresponding gradient flow equation is given by

(55) ut=Δu(x)+f(𝒙),u_{t}=\Delta u(x)+f(\bm{x}),

subject to a Robin boundary condition u(𝒙)g(𝒙)=12λu𝒏u(\bm{x})-g(\bm{x})=\frac{1}{2\lambda}\frac{\partial u}{\partial{\bm{n}}}, where 𝒏{\bm{n}} is the outer normal of Ω\partial\Omega. One can recover the original Dirichlet boundary condition by letting λ\lambda\rightarrow\infty. Such a penalty approach is also used in PINN and DRM, and we take λ=500\lambda=500 in all numerical experiments below.

We consider the following two cases:

  • Case 1: Ω=(0,π)×(π/2,π/2)\Omega=(0,\pi)\times(-\pi/2,\pi/2), f(𝒙)=2sinxcosyf(\bm{x})=2\sin x\cos y and g(𝒙)=0g(\bm{x})=0. The exact solution is u(𝒙)=sinxcosyu(\bm{x})=\sin x\cos y.

  • Case 2: Ω={(x,y)||𝒙|1}\Omega=\{(x,y)~{}|~{}|\bm{x}|\leq 1\}, f(𝒙)=π24sin(π2(1|𝒙|))+π2|𝒙|cos(π2(1|𝒙|))f(\bm{x})=\frac{\pi^{2}}{4}\sin(\frac{\pi}{2}(1-|\bm{x}|))+\frac{\pi}{2|\bm{x}|}\cos(\frac{\pi}{2}(1-|\bm{x}|)) and g(𝒙)=0g(\bm{x})=0. The exact solution is u(𝒙)=sin(π2(1|𝒙|))u(\bm{x})=\sin(\frac{\pi}{2}(1-|\bm{x}|)).

We employ a 1-block ResNet with 20 hidden nodes and one hidden layer for all cases. The total number of parameters is 501. We apply Xavier Initialization [26] to initialize the weights of neural networks in all cases. To evaluate the integration in all methods, we generate 25002500 samples in Ω\Omega using a Latin hypercube sampling (LHS) and 5050 samples on each boundary of Ω\partial\Omega using a 1D uniform sampling for case 1. For case 2, we generate 25002500 samples in (1,1)2(-1,1)^{2} using LHS, but only use the samples satisfies x2+y2<1x^{2}+y^{2}<1 as training samples. Additionally, we generate 200 training samples (cosθi,sinθi)i=1200(\cos\theta_{i},\sin\theta_{i})_{i=1}^{200} on the boundary, with {θi}i=1200\{\theta_{i}\}_{i=1}^{200} being generated by a uniform distribution on (0,2π)(0,2\pi). For PINN and DRM, we use Adam to minimize the loss function and use a different set of samples at each iteration. For EVNN, we use a different set of samples at each time step and employ an L-BFGS to solve the optimization problem (30).

Refer to caption
Figure 2: Relative l2l^{2} error with respect to CPU time for different methods for 2D Poisson equations.

Due to randomness arising from the stochastic initialization of the neural network and the sampling process, we repeated each method 10 times and plotted the mean and standard error of the relative l2l_{2}-errors. Fig. 2(a) shows the relative l2l_{2}-errors of different methods with respect to the CPU time for both cases. The training was executed on a MacBook Pro equipped with an M2 chip. In Case 1, the relative l2l_{2} error of each method is evaluated on a test set comprising a uniform grid of 101×101101\times 101 points within the domain (0,1)2(0,1)^{2}. For Case 2, the test set is defined as the collection of points {(xi,yj)|xi2+yj2<1}\{(x_{i},y_{j})|x_{i}^{2}+y_{j}^{2}<1\}, where (xi,yj)(x_{i},y_{j}) forms a uniform 201×201201\times 201 grid on (1,1)2(-1,1)^{2}. The CPU time is computed by averaging the time of 10 trials at nn-th iteration. The numerical results clearly show that the proposed method is more efficient than both PINN and DRM. It significantly enhances the efficiency and test accuracy of DRM through the introduction of time relaxation. While PINN may achieve better test accuracy in Case 2, it exhibits a larger standard error in both cases.

Next, we consider a high-dimensional Poisson equation

(56) Δu=f(𝒙),𝒙Ω=(1,1)d,-\Delta u=f(\bm{x})~{},\quad\bm{x}\in\Omega=(-1,1)^{d}~{},\\

with a homogeneous Neumann boundary condition un|Ω=0\frac{\partial u}{\partial n}|_{\partial\Omega}=0. We take f(𝒙)=π2k=1dcos(πxk)f(\bm{x})=\pi^{2}\sum_{k=1}^{d}\cos(\pi x_{k}). Similar numerical examples are tested in [21, 53]. The exact solution to this high-dimensional Poisson equation is u(𝒙)=k=1dcos(πxk)u(\bm{x})=\sum_{k=1}^{d}\cos(\pi x_{k}). Following [53], solve an L2L^{2}-gradient flow associated with the free energy

(57) [u]=(1,1)d12|u|2fud𝒙+λ((1,1)dud𝒙)2,\mathcal{F}[u]=\textstyle\int_{(-1,1)^{d}}\frac{1}{2}|\nabla u|^{2}-fu~{}\mathrm{d}\bm{x}+\lambda\left(\int_{(-1,1)^{d}}u~{}\mathrm{d}\bm{x}\right)^{2},

where the last term enforces [1,1]dud𝒙=0\int_{[-1,1]^{d}}u~{}\mathrm{d}\bm{x}=0. We take λ=0.5\lambda=0.5 as in [53] in all numerical tests.

Table 1: Relative l2l^{2}-error of high-dimensional Poisson equation with different settings in different dimensions. The first column shows the number of residual blocks, the number of nodes in each fully connected layer, and the number of samples.
Setting d=4d=4 d=8d=8 d=16d=16 d=32d=32
(2, 10, 1000) 0.024 0.046 0.623 0.867
(3, 60, 1000) 0.042 0.048 0.077 0.117
(3, 60, 10000) 0.018 0.022 0.036 0.077

Since it is difficult to get an accurate estimation of high-dimensional integration, we choose Adam to minimize (30) at each time step and use a different set of samples at each Adam step. This approach allows us to explore the high dimensional space with a low computational cost. The samples are drawn using LHS. Table 1 shows the final relative l2l^{2}-error with different neural network settings in different spatial dimensions. The relative l2l^{2}-error is evaluated on a test set that comprises 40000 samples generated by LHS. The model is trained with 200200 outer iterations (τ=0.01\tau=0.01) and at most 200200 iterations for inner optimization (30). An early stop criterion was applied if the l2l_{2}-norm of the model parameters between two consecutive epochs is less than 10610^{-6}. The first column of Table 1 specifies the numerical setting as follows: the entry (2,10)(2,10) means a neural network with 2 residual blocks, each with two fully connected layers having 10 nodes is used, and 10001000 represents the number of samples drawn in each epoch. As we can see, the proposed method can achieve comparable results with results reported in similar examples in previous work by DRM [21, 53]. It can be observed that increasing the width of the neural network improves test accuracy in high dimensions, as it enhances the network’s expressive power. Furthermore, increasing the number of training samples also significantly improves test accuracy in high dimensions. We’ll explore the effects of τ\tau, the number of samples, and neural network architecture for the high-dimensional Poisson equation in future work.

4.2 L2L^{2}-gradient flow

In this subsection, we apply the EVNN scheme to two L2L^{2}-gradient flows to demonstrate its energy stability and numerical accuracy.

4.2.1 Heat equation

We first consider an initial-boundary value problem of a heat equation:

(58) {ut=Δu(𝐱),𝐱Ω=(0,2)2,t(0,T],u(𝒙,t)=0,𝒙Ω,t(0,T],u(𝒙,0)=sin(π2x1)sin(π2x2),𝒙Ω=(0,2)2,\begin{cases}&u_{t}=\Delta u({\bf x}),\quad{\bf x}\in\Omega=(0,2)^{2}~{},\quad t\in(0,T]~{},\\ &u(\bm{x},t)=0,\quad\bm{x}\in\partial\Omega~{},\quad t\in(0,T]~{},\\ &u(\bm{x},0)=\sin\left(\frac{\pi}{2}x_{1}\right)\sin\left(\frac{\pi}{2}x_{2}\right)~{},\quad\bm{x}\in\Omega=(0,2)^{2}~{},\\ \end{cases}

which can be interpreted as an L2L^{2}-gradient flow satisfying the energy-dissipation law

(59) ddt(Ω12|u|2d𝒙+λΩ|u|2dS)=|ut|2d𝒙.\textstyle\frac{\mathrm{d}}{\mathrm{d}t}\left(\int_{\Omega}\frac{1}{2}\left|\nabla u\right|^{2}\mathrm{d}\bm{x}+\lambda\int_{\partial\Omega}|u|^{2}\mathrm{d}S\right)=-\int\left|u_{t}\right|^{2}\mathrm{d}\bm{x}~{}.

Again, a surface energy term is added to enforce the Dirichlet boundary condition.

Refer to caption
Figure 3: Numerical results for the heat equation. (a) - (d) are the neural network solutions at t=0.01,0.2,0.4t=0.01,0.2,0.4 and 0.60.6. (e) - (h) are the absolute differences between the EVNN solutions and the FDM solutions. (i) The evolution of discrete free energy with respect to time for both the EVNN and FDM solutions.

In the numerical simulation, we take τ=0.01\tau=0.01 and use a 1-block ResNet with 2020 nodes in each layer. The nonlinear activation function is chosen to be tanh\tanh. The total number of parameters is 501. To achieve energy stability, we fix the training samples to eliminate the numerical fluctuation arising from estimating the integration in (30). The set of training samples comprises a 301×301301\times 301 uniform grid on (0,2)2(0,2)^{2}, and an additional 10001000 uniformly spaced grid points on each edge of the boundary. To test the accuracy of the solution over time, we compare the neural network solution with a numerical solution obtained by a finite difference method (FDM). We apply the standard central finite difference and an implicit Euler method to obtain the FDM solution. The numerical scheme can be reformulated as an optimization problem:

(60) ϕhn+1=argminuh𝒞12τuhuhnh2+h(uh),h(uh)=12huh,huh.\phi_{h}^{n+1}=\mathop{\arg\min}_{u_{h}\in\mathcal{C}}\frac{1}{2\tau}\|u_{h}-u_{h}^{n}\|^{2}_{h}+\mathcal{F}_{h}(u_{h}),\quad\mathcal{F}_{h}(u_{h})=\frac{1}{2}\langle\nabla_{h}u_{h},\nabla_{h}u_{h}\rangle_{*}~{}.

Here, 𝒞={ui,j|0<i<I,0<j<J}\mathcal{C}=\{u_{i,j}~{}|~{}0<i<I,\quad 0<j<J\} is the set of grid functions defined on a uniform mesh, h\nabla_{h} is the discrete gradient, h\mathcal{F}_{h} is the discrete free energy, h\|\cdot\|_{h} is the discrete L2L_{2}-norm induced by the discrete L2L_{2}-inner product f,g=h2i=1N1fi,jgi,j\langle f,g\rangle=h^{2}\sum_{i=1}^{N-1}f_{i,j}g_{i,j}, and ,\langle\cdot,\cdot\rangle_{*} is the discrete inner product defined on the staggered mesh points [49, 66]. We use a 101×101101\times 101 uniform grid and take τ=0.01\tau=0.01 in the FDM simulation. Figure 3(a)-(d) shows the NN solution at t=0.01,0.2,0.4t=0.01,0.2,0.4 and 0.60.6. The corresponding point-wise absolute errors are shown in Fig. 3(e)-(h). The evolution of discrete free energy (evaluated on 101×101101\times 101 uniform grid) for both methods is shown in Fig. 3(i). Clearly, the neural-network-based algorithm achieves energy stability and the result is consistent with the FDM. It is worth mentioning that the size of the optimization problem in the neural network-based algorithm is much smaller than that in the FDM, although (60) is a quadratic optimization problem and can be solved directly by solving a linear equation of ui,ju_{i,j}.

4.2.2 Allen-Cahn equation

Next, we consider an Allen-Cahn type equation, which is extensively utilized in phase-field modeling, and becomes a versatile technique to solve interface problems arising from different disciplines [15]. In particular, we focus on the following Allen-Cahn equation on Ω=(1,1)2\Omega=(-1,1)^{2}:

(61) {φt(𝒙,t)=(F(φ)ε2φ(𝒙,t)+2W(φd𝒙A)),𝒙Ω,t>0,φ(𝒙,t)=1,𝒙Ω,t>0,φ(𝒙,0)=tanh(10(4x12+x220.5)),𝒙Ω.\left\{\begin{aligned} &\tfrac{\partial\varphi}{\partial t}(\bm{x},t)=-\left(\tfrac{F^{\prime}(\varphi)}{\varepsilon^{2}}-\triangle\varphi(\bm{x},t)+2W\left(\textstyle\int\varphi\mathrm{d}\bm{x}-A\right)\right)~{},~{}~{}~{}~{}\bm{x}\in\Omega~{},~{}~{}~{}~{}t>0~{},\\ &\varphi(\bm{x},t)=-1~{},\quad\bm{x}\in\partial\Omega~{},~{}~{}~{}~{}t>0~{},\\ &\varphi(\bm{x},0)=-\tanh(10(\sqrt{4x_{1}^{2}+x_{2}^{2}}-0.5))~{},~{}~{}~{}~{}\bm{x}\in\Omega~{}.\\ \end{aligned}\right.

where φ(𝒙,t)\varphi(\bm{x},t) is the phase-field variable, constants ϵ\epsilon, WW and AA are model parameters. The Allen–Cahn equation (61) can be regarded as an L2L^{2}-gradient flow, derived from an energy-dissipation law

(62) ddt[φ]=Ω|φt|2d𝒙,[φ]=Ω12|φ|2+14ε2(φ21)2d𝒙+W(Ωφd𝒙A)2.\textstyle\frac{\mathrm{d}}{\mathrm{d}t}\mathcal{F}[\varphi]=-\int_{\Omega}|\varphi_{t}|^{2}\mathrm{d}\bm{x},\quad\mathcal{F}[\varphi]=\int_{\Omega}\frac{1}{2}\left|\nabla\varphi\right|^{2}+\frac{1}{4\varepsilon^{2}}\left(\varphi^{2}-1\right)^{2}\mathrm{d}\bm{x}+W\left(\textstyle\int_{\Omega}\varphi\mathrm{d}\bm{x}-A\right)^{2}.

Here, the last term in the energy is a penalty term for the volume constraint φd𝒙=A\int\varphi\mathrm{d}\bm{x}=A, as the standard Allen-Cahn equation does not preserve the volume fraction φd𝒙\int\varphi\mathrm{d}\bm{x} over time. In the numerical simulation, we take A=(4πr2)+πr2,r=0.5,1ϵ2=100A=-(4-\pi r^{2})+\pi r^{2},r=0.5,\frac{1}{\epsilon^{2}}=100, and W=1000W=1000.

Refer to caption
Figure 4: Numerical results for the Allen–Cahn equation with a volume constraint. (a)-(d) NN solutions at t=0,0.05,0.1t=0,0.05,0.1 and 0.30.3, respectively. (e)-(h) Absolute differences between the EVNN solutions and the FEM solutions at t=0,0.05,0.1t=0,0.05,0.1, and 0.30.3, respectively. (i) Evolution of numerical free energies with respect to time for both the FEM and EVNN solutions.

Due to the complexity of the initial condition, we utilized a larger neural network compared to the one employed in the previous subsection. Our choice was a 1-block ResNet, which has two fully connected layers, each containing 20 nodes. The total number of parameters is 921921. The nonlinear activation function is chosen to be tanh\tanh. The set of training samples comprises a 301×301301\times 301 uniform grid on (1,1)2(-1,1)^{2}, and an additional 10001000 uniformly spaced grid points on each edge of the boundary. To test the numerical accuracy of the EVNN scheme for this problem, we also solve Eq. (61) by a finite element method, which approximates the phase-field variable φ\varphi by a piece-wise linear function φh(𝒙,t)=i=1Nγi(t)ψi(𝒙),\varphi_{h}(\bm{x},t)=\sum_{i=1}^{N}\gamma_{i}(t)\psi_{i}(\bm{x})~{}, where ψi(X)\psi_{i}(X) are hat functions supported on the computational mesh. Inserting φh\varphi_{h} into the continuous energy–dissipation law (62), we get a discrete energy–dissipation law with the discrete energy and dissipation given by

hFEM(𝜸)=e=1Neτe12|i=1Nγiψi(𝒙)|2+1ϵ2i=1N(γi21)2ψi(𝒙)d𝒙,\mathcal{F}_{h}^{\rm{FEM}}({\bm{\gamma}})=\textstyle\sum_{e=1}^{N_{e}}\int_{\tau_{e}}\frac{1}{2}\Big{|}\sum_{i=1}^{N}\gamma_{i}\nabla\psi_{i}(\bm{x})\Big{|}^{2}+\frac{1}{\epsilon^{2}}\sum_{i=1}^{N}(\gamma_{i}^{2}-1)^{2}\psi_{i}(\bm{x})\mathrm{d}\bm{x}~{},

and 𝒟hFEM(𝜸,𝜸)=e=1Neτe|i=1Nγi(t)ψi(𝒙)|2d𝒙\mathcal{D}_{h}^{\rm{FEM}}({\bm{\gamma}},{\bm{\gamma}}^{\prime})=\sum_{e=1}^{N_{e}}\int_{\tau_{e}}\Big{|}\sum_{i=1}^{N}\gamma_{i}^{\prime}(t)\psi_{i}(\bm{x})\Big{|}^{2}\mathrm{d}\bm{x} respectively. Here τe\tau_{e} is used to denote a finite element cell, and NeN_{e} is the number of cells. This form of discretization was used in [79], which constructs a piece-wise linear approximation to the nonlinear term in the discrete energy. We can update γi\gamma_{i} at each time step by solving the optimization problem (27), i.e.,

(63) 𝜸n+1=argmin𝜸N12τ𝖣(𝜸𝜸n)(𝜸𝜸n)+h(𝜸),{\bm{\gamma}}^{n+1}=\mathop{\arg\min}_{{\bm{\gamma}}\in\mathbb{R}^{N}}\frac{1}{2\tau}{\sf D}({\bm{\gamma}}-{\bm{\gamma}}^{n})\cdot({\bm{\gamma}}-{\bm{\gamma}}^{n})+\mathcal{F}_{h}({\bm{\gamma}})\ ,

where 𝖣ij=Ωψiψjd𝒙{\sf D}_{ij}=\int_{\Omega}\psi_{i}\psi_{j}\mathrm{d}\bm{x} is the mass matrix. The optimization problem (63) is solved by L-BFGS in our numerical implementation.

The simulation results are summarized in Fig. 4. It clearly shows that our method can achieve comparable results with the FEM. Numerical simulation of phase-field type models is often challenging. To capture the dynamics of the evolution of the thin diffuse interface, the mesh size should be much smaller than ϵ\epsilon, the width of the diffuse interface. Traditional numerical methods often use an adaptive or moving mesh approach to overcome this difficulty. In contrast, a neural network-based numerical scheme has a mesh-free feature. The number of parameters of the neural network can be much smaller than the number of samples needed to resolve the diffuse interface. Consequently, the dimension of the optimization problem in the NN-based scheme is much smaller than in the FEM scheme without using adaptive or moving meshes.

4.3 Generalized diffusions

In this subsection, we apply the proposed Lagrangian EVNN scheme to solving a Fokker-Planck equation and a porous medium equation.

4.3.1 Fokker-Planck equation

We first consider a Fokker-Planck equation

(64) {ρt=(ρ+ρV),𝒙d,t(0,T],ρ(𝒙,0)=ρ0(𝒙),𝒙d,\begin{cases}&\rho_{t}=\nabla\cdot(\nabla\rho+\rho\nabla V)~{},\quad\bm{x}\in\mathbb{R}^{d}~{},\quad t\in(0,T]~{},\\ &\rho(\bm{x},0)=\rho_{0}(\bm{x})~{},\quad\quad\quad\quad\bm{x}\in\mathbb{R}^{d}~{},\\ \end{cases}

where V(𝒙)V(\bm{x}) is a prescribed potential energy. The Fokker-Planck equation can be viewed as a generalized diffusion satisfying the energy-dissipation law

(65) ddtdρlnρ+ρV(𝒙)d𝒙=dρ|𝐮|2d𝒙,\frac{\mathrm{d}}{\mathrm{d}t}\int_{\mathbb{R}^{d}}\rho\ln\rho+\rho V(\bm{x})\mathrm{d}\bm{x}=-\int_{\mathbb{R}^{d}}\rho|\mathbf{u}|^{2}\mathrm{d}\bm{x}~{},

where the probability density ρ\rho satisfies the kinematics ρt+(ρ𝐮)=0\rho_{t}+\nabla\cdot(\rho\mathbf{u})=0.

We test our numerical scheme for solving the Fokker-Planck equation in 2D and 4D, respectively. For both numerical experiments, we adopt the augmented-ICNN proposed in [32]. We use a 11 ICNN block of 66 fully connected layers with 3232 hidden nodes in each layer. The activation function is chosen as the Gaussian-Softplus function σ(x)=2π(2x0x/2et2𝑑t+ex22+π2x)\sigma(x)=\textstyle\sqrt{\tfrac{2}{\pi}}(\sqrt{2}x\int_{0}^{x/\sqrt{2}}e^{-t^{2}}dt+e^{-\frac{x^{2}}{2}}+\sqrt{\tfrac{\pi}{2}}x). In addition, we use an L-BFGS to solve the optimization problem at each time step.

Refer to caption
Figure 5: Numerical results for the 2D Fokker–Planck equation. (a)-(d) Predicted solution on a 301×301301\times 301 uniform mesh on (3,3)2(-3,3)^{2} at t=0,0.1,0.5t=0,0.1,0.5 and 11, respectively. (e)-(f) The absolute and relative l2l_{2}-errors of the solution over time, evaluated on the uniform mesh.

In the 2D case, we consider V(𝒙)V(\bm{x}) as follows:

V=12(𝒙μtarget)TΣtarget1(𝒙μtarget)withμtarget=(13,13),\textstyle V=\frac{1}{2}(\bm{x}-\mu_{target})^{T}\Sigma^{-1}_{target}(\bm{x}-\mu_{target})~{}~{}{\rm with}~{}~{}\mu_{target}=\left(\frac{1}{3},\frac{1}{3}\right)~{},

and Σtarget=[58383858]\Sigma_{target}=\begin{bmatrix}\frac{5}{8}&-\frac{3}{8}\\ -\frac{3}{8}&\frac{5}{8}\end{bmatrix}. The initial condition ρ0(𝒙)\rho_{0}(\bm{x}) is set to be the 2D standard Gaussian 𝒩(𝟎,𝐈)\mathcal{N}({\bf 0},{\bf I}). The exact solution of Eq. (64) takes the following analytical form [33]

(66) ρ(𝒙,t)𝒩(μ(t),Σ(t)),\rho(\bm{x},t)\sim\mathcal{N}(\mu(t),\Sigma(t))~{},

where μ(t)=(1𝐞4t)μtarget\mu(t)=(1-\mathbf{e}^{-4t})\mu_{target} and Σ(t)=[58+38×e8t38+38×e8t38+38×e8t58+38×e8t]\Sigma(t)=\begin{bmatrix}\frac{5}{8}+\frac{3}{8}\times e^{-8t}&-\frac{3}{8}+\frac{3}{8}\times e^{-8t}\\ -\frac{3}{8}+\frac{3}{8}\times e^{-8t}&\frac{5}{8}+\frac{3}{8}\times e^{-8t}\end{bmatrix}. We draw 10000 samples from ρ0(𝒙)\rho_{0}(\bm{x}) as the training set. As an advantage of the neural network-based algorithm, we can compute ρn(𝒙)\rho^{n}(\bm{x}) point-wisely through a function composition ρn(𝒙)=ρ0~(Φn)1(𝒙)\rho^{n}(\bm{x})=\rho_{0}\tilde{\circ}(\Phi^{n})^{-1}(\bm{x}), where (Φn)1(𝒙)(\Phi^{n})^{-1}(\bm{x}) can be computed by solving a convex optimization problem, i.e., Φ1(𝒙)=argmin𝒚Φ(𝒚)𝒙T𝒚\Phi^{-1}(\bm{x})=\mathop{\arg\min}_{{\bm{y}}}\Phi({\bm{y}})-\bm{x}^{\rm T}{\bm{y}}. Fig. 5(a) - (d) shows the predicted density on a 301×301301\times 301 uniform grid on (3,3)2(-3,3)^{2}. The absolute and relative l2l_{2}-errors on the grid are shown in Figs. 5(e)-(f). It can be noticed that the relative l2l^{2}-error for the predicted solution is around 10210^{-2}, which is quite accurate given the limited number of samples used.

Refer to caption
Figure 6: Numerical results for the 4D Fokker–Planck equation. (a)-(d) Distribution of samples, projected in x1x_{1}-x2x_{2} plane, as well as its weight at t=0,0.1,0.5t=0,0.1,0.5 and 11, respectively. (e)-(h) Distribution of samples, projected in x3x_{3}-x4x_{4} plane, as well as its weight at t=0,0.1,0.5t=0,0.1,0.5 and 11, respectively. (i)-(j) The l2l_{2}-errors and the relative l2l^{2}-errors of the solution over time, evaluated in the training set.

In the 4D case, we take V(𝒙)V(\bm{x}) to be

V=12(𝒙μtarget)TΣtarget1(𝒙μtarget)withμtarget=(13,13,0,0),V=\tfrac{1}{2}(\bm{x}-\mu_{target})^{T}\Sigma^{-1}_{target}(\bm{x}-\mu_{target})\quad{\rm with}\quad\mu_{target}=\left(\tfrac{1}{3},\tfrac{1}{3},0,0\right)~{},

and Σtarget=[58383858][1001]\Sigma_{target}=\begin{bmatrix}\frac{5}{8}&-\frac{3}{8}\\ -\frac{3}{8}&\frac{5}{8}\end{bmatrix}\bigoplus\begin{bmatrix}1&0\\ 0&1\end{bmatrix}. The initial condition ρ0(𝒙)\rho_{0}(\bm{x}) is set to be a 4D standard normal distribution 𝒩(𝟎,𝐈)\mathcal{N}({\bf 0},{\bf I}). The exact solution of Eq. (64) follows the normal distribution with μ(t)=((1𝐞4t)13,(1𝐞4t)13,0,0),\mu(t)=\left((1-\mathbf{e}^{-4t})\tfrac{1}{3},(1-\mathbf{e}^{-4t})\tfrac{1}{3},0,0\right), and Σ(t)=[58+38×e8t38+38×e8t38+38×e8t58+38×e8t][1001].\Sigma(t)=\begin{bmatrix}\frac{5}{8}+\frac{3}{8}\times e^{-8t}&-\frac{3}{8}+\frac{3}{8}\times e^{-8t}\\ -\frac{3}{8}+\frac{3}{8}\times e^{-8t}&\frac{5}{8}+\frac{3}{8}\times e^{-8t}\end{bmatrix}\bigoplus\begin{bmatrix}1&0\\ 0&1\end{bmatrix}. Here \bigoplus is the direct sum. Same to the 2D case, we draw 10000 initial samples from ρ0(𝒙)\rho_{0}(\bm{x}). Fig. 6 shows the evolution of samples as well the values of the numerical solution on individual samples over space and time. Interestingly, the relative l2l^{2}-error in 4D case is similar to the 2D case. The dimension-independent error bound suggests the potential of the current method for handling higher-dimensional problems.

Fokker–Planck type equations have a wide application in machine learning. One of the fundamental tasks in modern statistics and machine learning is to estimate or generate samples from a target distribution ρ(𝒙)\rho^{*}(\bm{x}), which might be completely known, partially known up to the normalizing constant, or empirically given by samples. Examples include Bayesian inference [7], numerical integration [54], space-filling design [61], density estimation [69], and generative learning [59]. These problems can be transformed as an optimization problem, which is to seek for a ρopt𝒬\rho^{\rm opt}\in\mathcal{Q} by solving an optimization problem

(67) ρopt=argminρ𝒬D(ρ||ρ),\rho^{\rm opt}=\mathop{\arg\min}_{\rho\in\mathcal{Q}}D(\rho||\rho^{*})~{},

where 𝒬\mathcal{Q} is the admissible set, D(p||q)D(p||q) is a dissimilarity function that assesses the differences between two probability measures pp and qq. The classical dissimilarities include the Kullback–Leibler (KL) divergence and the Maximum Mean Discrepancy (MMD). The optimal solution to the optimization problem can be obtained by solving a Fokker–Planck type equation, given by

(68) ρt=(ρμ),μ=δD(ρ||ρ)δρ.\rho_{t}=\nabla\cdot(\rho\nabla\mu),\quad\mu=\frac{\delta D(\rho||\rho^{*})}{\delta\rho}.

The developed numerical approach has potential applications in these machine learning problems.

Refer to caption
Figure 7: The distribution of samples and their weight at different times for the Fokker–Planck equation with a complicated V(𝒙)V(\bm{x}), corresponds to an eight-components mixture Gaussian distribution. The initial 10000 samples are drawn from a standard normal distribution 𝒩(𝟎,𝐈)\mathcal{N}({\bf 0},{\bf I}).

To illustrate this point, we consider a toy problem that is widely used in the machine learning literature [32]. The goal is to sample from a target distribution ρ\rho^{*}, which is known up to a normalizing constant. We take the dissimilarity function as the KL divergence KL(ρ||ρ)=Ωρ(𝒙)ln(ρρ)d𝒙,{\rm KL}(\rho||\rho^{*})=\int_{\Omega}\rho(\bm{x})\ln\left(\frac{\rho}{\rho^{*}}\right)\mathrm{d}\bm{x}, then Eq. (68) is reduced to the Fokker–Planck equation (64) with V(𝒙)=lnρV(\bm{x})=-\ln\rho^{*}. In the numerical experiment, we take the target distribution ρ(𝒙)=18i=18N(𝒙|𝝁i,𝚺)\rho^{*}(\bm{x})=\frac{1}{8}\sum_{i=1}^{8}N(\bm{x}|\bm{\mu}_{i},\bm{\Sigma}), an eight components mixture Gaussian distribution, where 𝝁1=(0,4)\bm{\mu}_{1}=(0,4), 𝝁2=(2.8,2.8)\bm{\mu}_{2}=(2.8,2.8), 𝝁3=(4,0)\bm{\mu}_{3}=(4,0), 𝝁4=(2.8,2.8)\bm{\mu}_{4}=(-2.8,2.8), 𝝁5=(4,0)\bm{\mu}_{5}=(-4,0), 𝝁6=(2.8,2.8)\bm{\mu}_{6}=(-2.8,-2.8), 𝝁7=(0,4)\bm{\mu}_{7}=(0,-4), 𝝁8=(2.8,2.8)\bm{\mu}_{8}=(2.8,-2.8), and 𝚺=diag(0.2,0.2)\bm{\Sigma}=\mathrm{diag}(0.2,0.2). The simulation results are summarized in Fig. 7, in which we first draw 10000 samples from a standard normal distribution 𝒩(𝟎,𝐈)\mathcal{N}({\bf 0},{\bf I}) and show the distribution of samples as well as their weights at different time. It can be noticed that the proposed neural network-based algorithm can generate a weighted sample for a complicated target distribution.

4.3.2 Porous medium equation

Next, we consider a porous medium equation (PME), ρt=Δρα,\rho_{t}=\Delta\rho^{\alpha}~{},where α>1\alpha>1 is a constant. The PME is a typical example of nonlinear diffusion equations. One important feature of the PME is that the solution to the PME has a compact support at any time t>0t>0 if the initial data has a compact support. The free boundary of the compact support moves outward with a finite speed, known as the property of finite speed propagation [71]. As a consequence, numerical simulations of the PME are often difficult by using Eulerian methods, which may fail to capture the movement of the free boundary and suffer from numerical oscillations [50]. In a recent work [50], the authors developed a variational Lagrangian scheme using a finite element method. Here we show the ability of the Lagrangian EVNN scheme to solve the PME with a free boundary. Following [50], we employ the energy-dissipation law

(69) ddtdα(α1)(α2)ρα1d𝒙=d|𝐮|2d𝒙\frac{\mathrm{d}}{\mathrm{d}t}\int_{\mathbb{R}^{d}}\frac{\alpha}{(\alpha-1)(\alpha-2)}\rho^{\alpha-1}\mathrm{d}\bm{x}=-\int_{\mathbb{R}^{d}}|\mathbf{u}|^{2}\mathrm{d}\bm{x}~{}

to develop the EVNN scheme.

Refer to caption
Figure 8: Numerical results for the porous medium equation. (a)-(d) The numerical solution at t=0,0.05,0.25t=0,0.05,0.25, and 0.50.5, respectively. (e)-(f) The L2L_{2}-errors of the solution.

We take α=4\alpha=4 in the simulation. To test the numerical accuracy of the EVNN scheme for solving the PME, we consider a 2D Barenblatt–Pattle solution to the PME. The Barenblatt–Pattle solution in dd-dimensional space is given by

(70) Bα(𝒙,t)=tk[(C0k(α1)2dα|𝒙|2t2k/d)+]1/(α1),𝒙d,B_{\alpha}(\bm{x},t)=\textstyle t^{-k}\bigl{[}\bigl{(}C_{0}-\frac{k(\alpha-1)}{2d\alpha}\frac{|\bm{x}|^{2}}{t^{2k/d}}\bigr{)}_{+}\bigr{]}^{1/(\alpha-1)}~{},\quad\bm{x}\in\mathbb{R}^{d}~{},

where k=(α1+2/d)1k=(\alpha-1+2/d)^{-1}, u+:=max(u,0)u_{+}:=\max(u,0), and C0C_{0} is a constant that is related to the initial mass. This solution is radially symmetric, self-similar, and has compact support |𝒙|ξα(t)|\bm{x}|\leq\xi_{\alpha}(t) for any finite time, where ξα(t)=2dαC0k(α1)tk/d\xi_{\alpha}(t)=\textstyle\sqrt{\frac{2d\alpha C_{0}}{k(\alpha-1)}}t^{k/d}. We take the Barenblatt solution (70) with C0=0.1C_{0}=0.1 at t=0.1t=0.1 as the initial condition, and compare the numerical solution at TT with the exact solution Bα(𝒙,T+0.1)B_{\alpha}(\bm{x},T+0.1). We choose {xi0}i=1N\{x_{i}^{0}\}_{i=1}^{N} as a set of uniform grid points inside a disk {(x,y)|x2+y2ξα(0.1)}\{(x,y)~{}|~{}\sqrt{x^{2}+y^{2}}\leq\xi_{\alpha}(0.1)\}. To visualize the numerical results, we also draw 500500 points distributed with equal arc length on the initial free boundary and evolve them using the neural network. The numerical results with τ=0.005\tau=0.005 are shown in Fig. 8. It can be noticed that the proposed neural network-based algorithm can well approximate the Barenblatt–Pattle solution and capture the movement of the free boundary.

We note that the numerical methods to solve generalized diffusions (4) can also be formulated in the Eulerian frame of reference based on the notion of Wasserstein gradient flow [35]. However, it is often challenging to compute the Wasserstein distance between two probability densities efficiently with high accuracy, which often requires solving an additional min-max problem or minimization problem [5, 6, 9, 33]. In a recent study [33], the authors initially employ a fully connected neural network to approximate ρ\rho and subsequently utilize an additional ICNN to calculate the Wasserstein distance between two probability densities. Compared with their approach, our method is indeed more efficient and accurate.

5 Conclusion

In this paper, we develop structure-preserving EVNN schemes for simulating the L2L^{2}-gradient flows and the generalized diffusions by utilizing a neural network as a tool for spatial discretization, within the framework of the discrete energetic variational approach. These numerical schemes are directly constructed based on a prescribed continuous energy-dissipation law for the system, without the need for the underlying PDE. The incorporation of mesh-free neural network discretization opens up exciting possibilities for tackling high-dimensional gradient flows arising in different applications in the future.

Various numerical experiments are presented to demonstrate the accuracy and energy stability of the proposed numerical scheme. In our future work, we will explore the effects of different neural network architectures, sampling strategies, and optimization methods, followed by a detailed numerical analysis. Additionally, we intend to employ the EVNN scheme to investigate other complex fluid models, including the Cahn–Hilliard equation and Cahn–Hilliard–Navier–Stokes equations, as well as solve machine learning problems such as generative modeling and density estimation.

Acknowledgments

C. L and Y. W were partially supported by NSF DMS-1950868 and DMS-2153029. Z. X was partially supported by the NSF CDS&E-MSS 1854779.

References

  • [1] S. Adams, N. Dirr, M. Peletier, and J. Zimmer, Large deviations and gradient flows, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371 (2013), p. 20120341.
  • [2] B. Amos, L. Xu, and J. Z. Kolter, Input convex neural networks, in International Conference on Machine Learning, PMLR, 2017, pp. 146–155.
  • [3] A. Baron, Universal approximation bounds for superposition of a sigmoid function, IEEE Transaction on Information Theory, 39 (1993), pp. 930–945.
  • [4] R. Bellman, Dynamic Programming, Dover Publications, 1957.
  • [5] J.-D. Benamou and Y. Brenier, A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem, Numerische Mathematik, 84 (2000), pp. 375–393.
  • [6] J.-D. Benamou, G. Carlier, and M. Laborde, An augmented lagrangian approach to wasserstein gradient flows and applications, ESAIM: Proceedings and surveys, 54 (2016), pp. 1–17.
  • [7] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, Variational inference: A review for statisticians, J. Am. Stat. Assoc., 112 (2017), pp. 859–877.
  • [8] J. Bruna, B. Peherstorfer, and E. Vanden-Eijnden, Neural galerkin scheme with active learning for high-dimensional evolution equations, arXiv preprint arXiv:2203.01360, (2022).
  • [9] J. A. Carrillo, K. Craig, L. Wang, and C. Wei, Primal dual methods for wasserstein gradient flows, Foundations of Computational Mathematics, 22 (2022), pp. 389–443.
  • [10] G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems, 2 (1989), pp. 303–314.
  • [11] P.-G. De Gennes and J. Prost, The physics of liquid crystals, no. 83, Oxford university press, 1993.
  • [12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805, (2018).
  • [13] T. Dockhorn, A discussion on solving partial differential equations using neural networks, arXiv preprint arXiv:1904.07200, (2019).
  • [14] M. Doi, Onsager’s variational principle in soft matter, J. Phys.: Condens. Matter, 23 (2011), p. 284118.
  • [15] Q. Du and X. Feng, The phase field method for geometric moving interfaces and their numerical approximations, Handbook of Numerical Analysis, 21 (2020), pp. 425–508.
  • [16] Y. Du and T. A. Zaki, Evolutional deep neural network, Phys. Rev. E, 104 (2021), p. 045303.
  • [17] C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios, Neural spline flows, Advances in neural information processing systems, 32 (2019).
  • [18] W. E, J. Han, and A. Jentzen, Algorithms for solving high dimensional pdes: From nonlinear Monte Carlo to machine learning, Nonlinearity, 35 (2021), p. 278.
  • [19] W. E, C. Ma, and L. Wu, Machine learning from a continuous viewpoint, i, Science China Mathematics, 63 (2020), pp. 2233––2266.
  • [20] W. E, C. Ma, L. Wu, and S. Wojtowytsch, Towards a mathematical understanding of neural network-based machine learning: What we know and what we don’t, CSIAM Transactions on Applied Mathematics, 1 (2020), pp. 561–615.
  • [21] W. E and B. Yu, The deep ritz method: A deep learning-based numerical algorithm for solving variational problems, Communications in Mathematics and Statistics, 6 (2018).
  • [22] B. Eisenberg, Y. Hyon, and C. Liu, Energy variational analysis of ions in water and channels: Field theory for primitive models of complex ionic fluids, The Journal of Chemical Physics, 133 (2010), p. 104104.
  • [23] J. L. Ericksen, Introduction to the Thermodynamics of Solids, vol. 275, Springer, 1998.
  • [24] M.-H. Giga, A. Kirshtein, and C. Liu, Variational modeling and complex fluids, Handbook of mathematical analysis in mechanics of viscous fluids, (2017), pp. 1–41.
  • [25] E. D. Giorgi, Movimenti minimizzanti, in Proceedings of the Conference on Aspetti e problemi della Matematica oggi, Lecce, 1992.
  • [26] X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, 2010, pp. 249–256.
  • [27] J. Han, A. Jentzen, and W. E, Solving high-dimensional partial differential equations using deep learning, Proceedings of the National Academy of Sciences, 115 (2018), pp. 8505–8510.
  • [28] B. Hanin, Which neural net architectures give rise to exploding and vanishing gradients?, in NeurIPS, 2018.
  • [29] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [30]  , Identity mappings in deep residual networks, in European conference on computer vision, Springer, 2016, pp. 630–645.
  • [31] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal processing magazine, 29 (2012), pp. 82–97.
  • [32] C.-W. Huang, R. T. Chen, C. Tsirigotis, and A. Courville, Convex potential flows: Universal probability distributions with optimal transport and convex optimization, arXiv preprint arXiv:2012.05942, (2020).
  • [33] H. J. Hwang, C. Kim, M. S. Park, and H. Son, The deep minimizing movement scheme, arXiv preprint arXiv:2109.14851, (2021).
  • [34] K. Jiang and P. Zhang, Numerical methods for quasicrystals, J. Comput. Phys., 256 (2014), pp. 428–440.
  • [35] R. Jordan, D. Kinderlehrer, and F. Otto, The variational formulation of the Fokker–Planck equation, SIAM J. Math. Anal., 29 (1998), pp. 1–17.
  • [36] E. F. Keller and L. A. Segel, Initiation of slime mold aggregation viewed as an instability, Journal of theoretical biology, 26 (1970), pp. 399–415.
  • [37] E. Kharazmi, Z. Zhang, and G. E. Karniadakis, hp-vpinns: Variational physics-informed neural networks with domain decomposition, Computer Methods in Applied Mechanics and Engineering, 374 (2021), p. 113547.
  • [38] Y. Khoo, J. Lu, and L. Ying, Solving parametric PDE problems with artificial neural networks, European Journal of Applied Mathematics, 32 (2021), pp. 421–435.
  • [39] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling, Improved variational inference with inverse autoregressive flow, Advances in neural information processing systems, 29 (2016).
  • [40] J. F. Kolen and S. C. Kremer, Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies, 2001, pp. 237–243.
  • [41] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [42] I. E. Lagaris, A. Likas, and D. I. Fotiadis, Artificial neural networks for solving ordinary and partial differential equations, IEEE transactions on neural networks, 9 (1998), pp. 987–1000.
  • [43] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), pp. 2278–2324.
  • [44] R. Lifshitz and H. Diamant, Soft quasicrystals–why are they stable?, Philosophical Magazine, 87 (2007), pp. 3021–3030.
  • [45] F. H. Lin and C. Liu, Static and dynamic theories of liquid crystals, J. Partial Differential Equations, 14 (2001), pp. 289–330.
  • [46] S. Lisini, D. Matthes, and G. Savaré, Cahn–Hilliard and thin film equations with nonlinear mobility as gradient flows in weighted-wasserstein metrics, Journal of differential equations, 253 (2012), pp. 814–850.
  • [47] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez, A survey on deep learning in medical image analysis, Medical image analysis, 42 (2017), pp. 60–88.
  • [48] C. Liu and H. Sun, On energetic variational approaches in modeling the nematic liquid crystal flows, Discrete & Continuous Dynamical Systems, 23 (2009), p. 455.
  • [49] C. Liu, C. Wang, and Y. Wang, A structure-preserving, operator splitting scheme for reaction-diffusion equations with detailed balance, Journal of Computational Physics, 436 (2021), p. 110253.
  • [50] C. Liu and Y. Wang, On lagrangian schemes for porous medium type generalized diffusion equations: a discrete energetic variational approach, J. Comput. Phys., 417 (2020), p. 109566.
  • [51]  , A variational lagrangian scheme for a phase-field model: A discrete energetic variational approach, SIAM J. Sci. Comput., 42 (2020), pp. B1541–B1569.
  • [52] L. Lu, X. Meng, Z. Mao, and G. E. Karniadakis, Deepxde: A deep learning library for solving differential equations, SIAM Review, 63 (2021), pp. 208–228.
  • [53] Y. Lu, J. Lu, and M. Wang, A priori generalization analysis of the deep ritz method for solving high dimensional elliptic partial differential equations, in Conference on Learning Theory, PMLR, 2021, pp. 3196–3241.
  • [54] T. Müller, B. McWilliams, F. Rousselle, M. Gross, and J. Novák, Neural importance sampling, ACM Transactions on Graphics (TOG), 38 (2019), pp. 1–19.
  • [55] J. Noh, Y. Wang, H.-L. Liang, V. S. R. Jampani, A. Majumdar, and J. P. Lagerwall, Dynamic tuning of the director field in liquid crystal shells using block copolymers, Phys. Rev. Res., 2 (2020), p. 033160.
  • [56] T. Ohta and K. Kawasaki, Equilibrium morphology of block copolymer melts, Macromolecules, 19 (1986), pp. 2621–2632.
  • [57] L. Onsager, Reciprocal relations in irreversible processes. I., Phys. Rev., 37 (1931), p. 405.
  • [58]  , Reciprocal relations in irreversible processes. II., Phys. Rev., 38 (1931), p. 2265.
  • [59] G. Papamakarios, E. T. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan, Normalizing flows for probabilistic modeling and inference., J. Mach. Learn. Res., 22 (2021), pp. 1–64.
  • [60] M. A. Peletier, Variational modelling: Energies, gradient flows, and large deviations, arXiv preprint arXiv:1402.1990, (2014).
  • [61] L. Pronzato and W. G. Müller, Design of computer experiments: space filling and beyond, Statistics and Computing, 22 (2012), pp. 681–701.
  • [62] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics informed deep learning (part i): Data-driven solutions of nonlinear partial differential equations, arXiv preprint arXiv:1711.10561, (2017).
  • [63] L. Rayleigh, Some general theorems relating to vibrations, Proceedings of the London Mathematical Society, 4 (1873), pp. 357–368.
  • [64] D. Rezende and S. Mohamed, Variational inference with normalizing flows, in International conference on machine learning, PMLR, 2015, pp. 1530–1538.
  • [65] G. M. Rotskoff, A. R. Mitchell, and E. Vanden-Eijnden, Active importance sampling for variational objectives dominated by rare events: Consequences for optimization and generalization, arXiv preprint arXiv:2008.06334, (2020).
  • [66] A. J. Salgado and S. M. Wise, Classical Numerical Analysis: A Comprehensive Course, Cambridge University Press, 2022.
  • [67] Z. Shen, H. Yang, and S. Zhang, Nonlinear approximation via compositions, arXiv preprint arXiv:1902.10170, (2019).
  • [68] J. Sirignano and K. Spiliopoulos, Dgm: A deep learning algorithm for solving partial differential equations, J. Comput. Phys., 375 (2018), pp. 1339–1364.
  • [69] E. G. Tabak, E. Vanden-Eijnden, et al., Density estimation by dual ascent of the log-likelihood, Communications in Mathematical Sciences, 8 (2010), pp. 217–233.
  • [70] B. P. van Milligen, V. Tribaldos, and J. Jiménez, Neural network differential equation and plasma equilibrium solver, Phys. Rev. Lett., 75 (1995), p. 3594.
  • [71] J. L. Vázquez, The porous medium equation: mathematical theory, Oxford University Press, 2007.
  • [72] H. Wang, T. Qian, and X. Xu, Onsager’s variational principle in active soft matter, Soft Matter, 17 (2021), pp. 3634–3653.
  • [73] Y. Wang, J. Chen, C. Liu, and L. Kang, Particle-based energetic variational inference, Stat. Comput., 31 (2021), pp. 1–17.
  • [74] Y. Wang and C. Liu, Some recent advances in energetic variational approaches, Entropy, 24 (2022).
  • [75] Y. Wang, C. Liu, P. Liu, and B. Eisenberg, Field theory of reaction-diffusion: Law of mass action with an energetic variational approach, Phys. Rev. E, 102 (2020), p. 062147.
  • [76] Y. Wang, T.-F. Zhang, and C. Liu, A two species micro–macro model of wormlike micellar solutions and its maximum entropy closure approximations: An energetic variational approach, Journal of Non-Newtonian Fluid Mechanics, 293 (2021), p. 104559.
  • [77] Q. Wei, Y. Jiang, and J. Z. Y. Chen, Machine-learning solver for modified diffusion equations, Phys. Rev. E, 98 (2018), p. 053304.
  • [78] E. Weinan, Machine learning and computational mathematics, arXiv preprint arXiv:2009.14596, (2020).
  • [79] J. Xu, Y. Li, S. Wu, and A. Bousquet, On the stability and accuracy of partially and fully implicit schemes for phase field modeling, Computer Methods in Applied Mechanics and Engineering, 345 (2019), pp. 826–853.
  • [80] S. Xu, P. Sheng, and C. Liu, An energetic variational approach for ion transport, Communications in Mathematical Sciences, 12 (2014), pp. 779–789.
  • [81] S. Xu, Z. Xu, O. V. Kim, R. I. Litvinov, J. W. Weisel, and M. S. Alber, Model predictions of deformation, embolization and permeability of partially obstructive blood clots under variable shear flow, Journal of The Royal Society Interface, 14 (2017).
  • [82] X. Xu, Y. Di, and M. Doi, Variational method for liquids moving on a substrate, Physics of Fluids, 28 (2016), p. 087101.
  • [83] P. Yue, J. J. Feng, C. Liu, and J. Shen, A diffuse-interface method for simulating two-phase flows of complex fluids, J. Fluid Mech., 515 (2004), pp. 293–317.
  • [84] Y. Zang, G. Bao, X. Ye, and H. Zhou, Weak adversarial networks for high-dimensional partial differential equations, J. Comput. Phys., 411 (2020), p. 109409.