Energetic Variational Neural Network Discretizations of Gradient Flows

Ziqing Hu Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, 102G Crowley Hall, Notre Dame, IN 46556 USA (). [email protected] Chun Liu Department of Applied Mathematics, Illinois Institute of Technology, Chicago, IL 60616, USA (). [email protected] Yiwei Wang Corresponding author. Department of Mathematics, University of California, Riverside, Riverside, CA 92521, USA (). [email protected] Zhiliang Xu Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, 102G Crowley Hall, Notre Dame, IN 46556 USA (). [email protected]

Abstract

We present a structure-preserving Eulerian algorithm for solving $L^{2}$ -gradient flows and a structure-preserving Lagrangian algorithm for solving generalized diffusions. Both algorithms employ neural networks as tools for spatial discretization. Unlike most existing methods that construct numerical discretizations based on the strong or weak form of the underlying PDE, the proposed schemes are constructed based on the energy-dissipation law directly. This guarantees the monotonic decay of the system’s free energy, which avoids unphysical states of solutions and is crucial for the long-term stability of numerical computations. To address challenges arising from nonlinear neural network discretization, we perform temporal discretizations on these variational systems before spatial discretizations. This approach is computationally memory-efficient when implementing neural network-based algorithms. The proposed neural network-based schemes are mesh-free, allowing us to solve gradient flows in high dimensions. Various numerical experiments are presented to demonstrate the accuracy and energy stability of the proposed numerical schemes.

1 Introduction

Evolution equations with variational structures, often termed as gradient flows, have a wide range of applications in physics, material science, biology, and machine learning [14, 60, 78]. These systems not only possess but also are determined by an energy-dissipation law, which consists of an energy of state variables that describes the energetic coupling and competition, and a dissipative mechanism that drives the system to an equilibrium.

More precisely, beginning with a prescribed energy-dissipation law

(1)

\frac{\mathrm{d}}{\mathrm{d}t}E^{\rm total}=-\triangle\leq 0~{},

where $E^{\text{total}}$ is the sum of the Helmholtz free energy $\mathcal{F}$ and the kinetic energy $\mathcal{K}$ , and $\triangle$ is the rate of energy dissipation, one could derive the dynamics, the corresponding partial differential equation (PDE), of the system by combining the Least Action Principle (LAP) and the Maximum Dissipation Principle (MDP). The LAP states that the equation of motion of a Hamiltonian system can be derived from the variation of the action functional $\mathcal{A}=\int_{0}^{T}(\mathcal{K}-\mathcal{F})\mathrm{d}t$ with respect to state variable $\bm{x}$ . This gives rise to a unique procedure to derive the “conservative force” in the system. The MDP, on the other hand, derives the “dissipative force” by taking the variation of the dissipation potential $\mathcal{D}$ , which equals to $\frac{1}{2}\triangle$ in the linear response regime (near equilibrium), with respect to $\bm{x}_{t}$ . In turn, the force balance condition leads to the PDE of the system:

(2)

\frac{\delta\mathcal{D}}{\delta\bm{x}_{t}}=\frac{\delta\mathcal{A}}{\delta\bm{x}}~{}.

This procedure is known as the energetic variational approach (EnVarA) [24, 48, 74], developed based on the celebrated work of Onsager [57, 58] and Rayleigh [63] in non-equilibrium thermodynamics. During the past decades, the framework of EnVarA has shown to be a powerful tool for developing thermodynamically consistent models for various complex fluid systems, including two-phase flows [80, 81, 83], liquid crystals [48], ionic solutions [22], and reactive fluids [75, 76]. In a certain sense, the energy-dissipation law (1), which comes from the first and second law of thermodynamics [23], provides a more intrinsic description of the underlying physics than the derived PDE (2), in particularly for systems that possess multiple structures and involve multiscale and multiphysics coupling.

From a numerical standpoint, the energy-dissipation law (1) also serves as a valuable guideline for developing structure-preserving numerical schemes for these variational systems, as many straightforward PDE-based discretizations may fail to preserve the continuous variational structures, as well as the physical constraints, such as the conservation of mass or momentum, the positivity of mass density, and the dissipation of the energy. As a recent development in this field [50], the authors proposed a numerical framework, called a discrete energetic variational approach, to address these challenges. Similar methods are also used in [72, 82]. The key idea of this approach is to construct numerical discretizations directly based on the continuous energy-dissipation laws without using the underlying PDE (see Section 3.1 for details). The approach has advantages in preserving a discrete counterpart of the continuous energy dissipation law, which is crucial for avoiding unphysical solutions and the long-term stability of numerical computations. Within the framework of the discrete energetic variational approach, Eulerian [55], Lagrangian [50, 51], and particle methods [73] have been developed for various problems.

However, tackling high-dimensional variational problems remains a challenge. Traditional numerical discretization, such as finite difference, finite element, and finite volume methods, suffer from the well-known curse of dimensionality (CoD) [27]. Namely, the computational complexity of the numerical algorithm increases exponentially as a function of the dimensionality of the problem [4, 27]. Although particle methods hold promise for addressing high-dimensional problems, standard particle methods often lack accuracy and are not suitable for problems involving derivatives.

During the recent years, neural networks (NNs) have demonstrated remarkable success across a wide spectrum of scientific disciplines [12, 31, 41, 43, 47]. Leveraging the potent expressive power of neural network [3, 10, 20], particularly deep neural network (DNN) architectures, there exists a growing interest in developing neural network-based algorithms for PDEs, especially for those in high dimensions [13, 21, 37, 38, 42, 62, 68, 70, 77]. Examples include physics-informed neural network (PINN) [62], deep Ritz method (DRM) [21], deep Galerkin method (DGM) [68], variational PINN [37], and weak adversarial network (WAN) [84] to name a few. A key component of these aforementioned approaches is to represent solutions of PDEs via NNs. All of these approaches determine the optimal parameters of the NN by minimizing a loss function, which is often derived from either the strong or weak form of the PDE (see section 2.1 for more details). By employing NN approximations, the approximate solution belongs to a space of nonlinear functions, which may lead to a robust estimation by sparser representation and cheaper computation [37].

The goal of this paper is to combine neural network-based spatial discretization with the framework of the discrete energetic variational approach [50], to develop efficient and robust numerical schemes, termed as energetic variational neural network (EVNN) methods. To clarify the idea, we consider the following two types of gradient flows modeled by EnVarA:

•

$L^{2}$ -gradient flow that satisfies an energy-dissipation law

(3) $\frac{\mathrm{d}}{\mathrm{d}t}\mathcal{F}[\varphi]=-\int_{\Omega}\eta(\varphi)|\varphi_{t}|^{2}\mathrm{d}\bm{x}~{}.$
•

Generalized diffusion that satisfies an energy-dissipation law

(4) $\frac{\mathrm{d}}{\mathrm{d}t}\mathcal{F}[\rho]=-\int_{\Omega}\eta(\rho)|\mathbf{u}|^{2}\mathrm{d}\bm{x}~{},$

where $\rho$ satisfies $\rho_{t}+\nabla\cdot(\rho\mathbf{u})=0$ , known as the mass conservation.

Derivation of the underlying PDEs for these types of systems is described in Sections 2.1.1 and 2.1.2 of the paper.

Our primary aim is to develop structure-preserving Eulerian algorithms to solve $L^{2}$ -gradient flows and structure-preserving Lagrangian algorithms to solve generalized diffusions based on their energy-dissipation law by utilizing neural networks as a new tool of spatial discretization. To overcome difficulties arising from neural network discretization, we develop a discretization approach that performs temporal discretizations before spatial discretizations. This approach leads to a computer-memory-efficient implementation of neural network-based algorithms. Since neural networks are advantageous due to their ability to serve as parametric approximations for unknown functions even in high dimensions, the proposed neural-network-based algorithms are capable of solving gradient flows in high-dimension, such as these appear in machine learning [19, 32, 69, 73].

The rest of the paper is organized as follows. Section 2 reviews the EnVarA and some existing neural network-based numerical approaches for solving PDEs. Section 3 of the paper is devoted to the development of the proposed EVNN schemes for $L^{2}$ -gradient flows and generalized diffusions. Various numerical experiments are presented in Section 4 to demonstrate the accuracy and energy stability of the proposed numerical methods. Conclusions are drawn in Section 5.

2 Preliminary

2.1 Energetic Variational Approach

In this subsection, we provide a detailed derivation of underlying PDEs for $L^{2}$ -gradient flows and generalized diffusions by the general framework of EnVarA. We refer interested readers to [24, 74] for a more comprehensive review of the EnVarA. In both systems, the kinetic energy $\mathcal{K}=0$ .

2.1.1 $L^{2}$ -gradient flow

$L^{2}$ -gradient flows are those systems satisfying the energy-dissipation law:

(5)

\frac{\mathrm{d}}{\mathrm{d}t}\mathcal{F}[{\varphi}]=-\int_{\Omega}\eta|\varphi_{t}|^{2}\mathrm{d}\bm{x}~{}.

where ${\varphi}$ is the state variable, $\mathcal{F}[\varphi]$ is the Helmholtz free energy, and $\eta>0$ is the dissipation rate. One can also view ${\varphi}$ as the generalized coordinates of the system [14]. By treating $\varphi$ as $\bm{x}$ , the variational procedure (2) leads to the following $L^{2}$ -gradient flow equation:

\frac{\delta(\frac{1}{2}\int\eta|\varphi_{t}|^{2}\mathrm{d}\bm{x})}{\delta\varphi_{t}}=-\frac{\delta\mathcal{F}}{\delta\varphi}\quad\Rightarrow\quad\eta\varphi_{t}=-\frac{\delta\mathcal{F}}{\delta\varphi}~{}.

Many problems in soft matter physics, material science, and machine learning can be modeled as $L^{2}$ -gradient flows. Examples include Allen–Cahn equation [15], Oseen–Frank and Landau–de Gennes models of liquid crystals [11, 45], phase field crystal models of quasicrystal [34, 44], and generalized Ohta–Kawasaki model of diblock and triblock copolymers [56]. For example, by taking

\mathcal{F}[\varphi]={\displaystyle\int_{\Omega}\frac{1}{2}|\nabla\varphi|^{2}+\frac{1}{4\epsilon^{2}}(\varphi^{2}-1)^{2}\mathrm{d}\bm{x}}~{},

which is the classical Ginzburg–Landau free energy, and letting $\eta(\varphi)=1$ , one gets the Allen–Cahn equation

\varphi_{t}=\Delta\varphi-\frac{1}{\epsilon^{2}}(\varphi^{2}-1)\varphi\ .

2.1.2 Generalized diffusion

Generalized diffusion describes the space-time evolution of a conserved quantity $\rho(\bm{x},t)$ . Due to the physics law of mass conservation, $\rho(\bm{x},t)$ satisfies the kinematics

(6)

\partial_{t}\rho+\nabla\cdot(\rho\mathbf{u})=0\ ,

where $\mathbf{u}$ is an averaged velocity in the system.

To derive the generalized diffusion equation by the EnVarA, one should introduce a Lagrangian description of the system. Given a velocity field $\mathbf{u}(\bm{x},t)$ , one can define a flow map $\bm{x}(\mathbf{X},t)$ through

\begin{cases}&\frac{\mathrm{d}}{\mathrm{d}t}\bm{x}(\mathbf{X},t)=\mathbf{u}(\bm{x}(\mathbf{X},t),t)~{},\\ &\bm{x}(\mathbf{X},0)=\mathbf{X}~{},\\ \end{cases}

where $\mathbf{X}\in\Omega_{0}$ is the Lagrangian coordinates and $\bm{x}\in\Omega_{t}$ is the Eulerian coordinates. Here $\Omega_{0}$ is the initial configuration of the system, and $\Omega_{t}$ is the configuration of the system at time $t$ . For a fixed $\mathbf{X}$ , $\bm{x}(\mathbf{X},t)$ describes the trajectory of a particle (or a material point) labeled by $\mathbf{X}$ ; while for a fixed $t$ , $\bm{x}(\mathbf{X},t)$ is a diffeomorphism from $\Omega_{0}$ to $\Omega_{t}$ (See Fig. 1 for the illustration.) The existence of the flow map $\bm{x}(\mathbf{X},t)$ requires a certain regularity of $\mathbf{u}(\bm{x},t)$ , for instance, being Lipschitz in $\bm{x}$ .

Refer to caption — Figure 1: Schematic illustration of a flow map $x(\mathbf{X},t)$ .

In a Lagrangian picture, the evolution of the density function $\rho(\bm{x},t)$ is determined by the evolution of the $\bm{x}(\mathbf{X},t)$ through the kinematics relation (6) (written in the Lagrangian frame of reference). More precisely, one can define the deformation tensor associated with the flow map $\bm{x}(\mathbf{X},t)$ by

(7)

{\sf\tilde{F}}(\bm{x}(\mathbf{X},t),t)={\sf F}(\mathbf{X},t)=\nabla_{\mathbf{X}}\bm{x}(\mathbf{X},t)~{}.

Without ambiguity, we do not distinguish ${\sf F}$ and ${\sf\tilde{F}}$ in the rest of the paper. Let $\rho_{0}(\mathbf{X})$ be the initial density, then the mass conservation indicates that

(8)

\rho(\bm{x}(\mathbf{X},t),t)=\rho_{0}(\mathbf{X})/\det{\sf F}(\mathbf{X},t)~{},\quad\forall\mathbf{X}\in\Omega_{0}

in the Lagrangian frame of reference. This is equivalent to $\rho_{t}+\nabla\cdot(\rho\mathbf{u})=0$ in the Eulerian frame.

Within (8), one can rewrite the energy-dissipation law (4) in terms of the flow map $\bm{x}(\mathbf{X},t)$ and its velocity $\bm{x}_{t}(\mathbf{X},t)$ in the reference domain $\Omega_{0}$ . A typical form of the free energy $\mathcal{F}[\rho]$ is given by

(9)

\mathcal{F}[\rho]=\int_{\Omega}\omega(\rho)+\rho V(\bm{x})+\frac{1}{2}\left(\int_{\Omega}K(\bm{x},\bm{y})\rho(\bm{x})\rho(\bm{y})\mathrm{d}\bm{y}\right)\mathrm{d}\bm{x}~{},

where $V(\bm{x})$ is a potential field, and $K(\bm{x},\bm{y})$ is a symmetric non-local interaction kernel. In Lagrangian coordinates, the free energy (9) and the dissipation in (4) become

\displaystyle\mathcal{F}[\bm{x}(\mathbf{X},t)]=

\displaystyle\int_{\Omega_{0}}\omega\left(\tfrac{\rho_{0}}{\det{\sf F}}\right)\det{\sf F}+V(\bm{x})\rho_{0}(\mathbf{X})+\frac{\rho_{0}(\mathbf{X})}{2}\left(\int_{\Omega_{0}}K(\bm{x},\bm{y})\rho_{0}(\mathbf{X})\mathrm{d}\mathbf{X}\right)\mathrm{d}\mathbf{X}

and

\mathcal{D}(\bm{x},\bm{x}_{t})=\displaystyle\frac{1}{2}\int_{\Omega_{0}}\eta(\rho_{0}/\det{\sf F})|\bm{x}_{t}|^{2}\det{\sf F}\mathrm{d}\mathbf{X}\ ,

By a direct computation, the force balance equation (2) can be written as (see [24] for a detailed computation)

(10)

\eta(\rho)\mathbf{u}=-\rho\nabla\mu,\quad\mu=\omega^{\prime}(\rho)+V(\bm{x})+K*\rho.

In Eulerian coordinates, combining (10) with the kinematics equation (6), one obtains a generalized diffusion equation

(11)

\rho_{t}=\nabla\cdot\left(m(\rho)\nabla\mu\right)~{},\quad m(\rho)=\rho^{2}/\eta(\rho)\ ,

where $m(\rho)$ is known as the mobility in physics. Many PDEs in a wide range of applications, including the porous medium equations (PME) [71], nonlinear Fokker–Planck equations [35], Cahn–Hilliard equations [50], Keller–Segel equations [36], and Poisson–Nernst–Planck (PNP) equations [22], can be obtained by choosing $\mathcal{F}[\rho]$ and $\eta(\rho)$ differently.

The velocity equation (10) can be viewed as the equation of the flow map $\bm{x}(\mathbf{X},t)$ in Lagrangian coordinates, i.e.,

(12)

\eta\left(\frac{\rho_{0}(\mathbf{X})}{\det{\sf F}(\mathbf{X})}\right)\frac{\mathrm{d}}{\mathrm{d}t}\bm{x}(\mathbf{X},t)=-\frac{\rho_{0}(\mathbf{X})}{\det{\sf F}(\mathbf{X})}{\sf F}^{-\rm T}\nabla_{\mathbf{X}}\mu\left(\frac{\rho_{0}(\mathbf{X})}{\det{\sf F}(\mathbf{X})}\right)~{},

where ${\sf F}(\mathbf{X})=\nabla_{\mathbf{X}}\bm{x}(\mathbf{X},t)$ . The flow map equation (12) is a highly nonlinear equation of $\bm{x}(\mathbf{X},t)$ that involves both ${\sf F}$ and $\det{\sf F}$ . It is rather unclear how to introduce a suitable spatial discretization to Eq. (12) by looking at the equation. However, a generalized diffusion can be viewed as an $L^{2}$ -gradient flow of the flow map in the space of diffeomorphism. This perspective gives rise to a natural discretization of generalized diffusions [50].

Remark 2.1.

In the case that $\eta(\rho)=\rho$ , a generalized diffusion can be viewed as a Wasserstein gradient flows in the space of all probability densities having finite second moments $\mathcal{P}_{2}(\Omega)$ [35]. Formally, the Wasserstein gradient flow can be defined as a continuous time limit ( $\tau\rightarrow 0$ ) from a semi-discrete JKO scheme,

(13)

\rho^{n+1}=\mathop{\arg\min}_{\rho\in\mathcal{P}_{2}(\Omega)}\frac{1}{2\tau}W_{2}(\rho,\rho^{n})^{2}+\mathcal{F}[\rho]~{},\quad n=0,1,2\ldots~{},

where $\displaystyle\mathcal{P}_{2}(\Omega)=\left\{\rho:\Omega\rightarrow[0,\infty)~{}|~{}\int_{\Omega}\rho\ \mathrm{d}\bm{x}=1,~{}~{}\int_{\Omega}|\bm{x}|^{2}\rho(\bm{x})\mathrm{d}\bm{x}<\infty\right\}$ and $W_{2}(\rho,\rho^{n})$ is the Wasserstein distance between $\rho$ and $\rho^{n}$ . The Wasserstein gradient flow is an Eulerian description of these systems [6]. Other choices of dissipation can define other metrics in the space of probability measures [1, 46].

2.2 Neural-network-based numerical schemes for PDEs

In this subsection, we briefly review some existing neural network-based algorithms for solving PDEs. We refer interested readers to [18, 52] for detailed reviews.

Considering a PDE subject to a certain boundary condition

(14)

\displaystyle\mathcal{L}\varphi(\bm{x})=f(\bm{x})~{},~{}~{}\bm{x}\in\Omega\subset\mathbb{R}^{d}~{},\quad\mathcal{B}\varphi(\bm{x})=g(\bm{x})~{},~{}~{}\bm{x}\in\partial\Omega~{},

and assuming the solution can be approximated by a neural network $\varphi_{\rm NN}(\bm{x};{\bm{\Theta}})$ , where ${\bm{\Theta}}$ is the set of parameters in the neural network, the Physics-Informed Neural Network (PINN) [62] and similar methods [13, 68, 77] find the optimal parameter ${\bm{\Theta}}^{*}$ by minimizing a loss function defined as:

(15)

L({\bm{\Theta}})=\frac{1}{N_{in}}\sum_{i=1}^{N_{in}}\left(\mathcal{L}\varphi_{\rm NN}(\bm{x}_{i},{\bm{\Theta}})-f(\bm{x}_{i})\right)^{2}+\frac{\lambda}{N_{b}}\sum_{j=1}^{N_{b}}(\mathcal{B}\varphi_{\rm NN}(\bm{s}_{j};{\bm{\Theta}})-g(\bm{s}_{j}))^{2}~{}.

Here $\{\bm{x}_{i}\}_{i=1}^{N_{in}}$ and $\{\bm{s}_{j}\}_{j=1}^{N_{b}}$ are sets of samples in $\Omega$ and $\partial\Omega$ respectively, which can be drawn uniformly or by following some prescribed distributions. The parameter $\lambda$ is used to weight the sum in the loss function. Minimizers of the loss function (15) can be obtained by using some optimization methods, such as AdaGrad and Adam. Of course, the objective function of this minimization problem in general is not convex even when the initial problem is. Obtaining the global minimum of (15) is highly non-trivial.

In contrast to the PINN, which is based on the strong form of a PDE, the deep Ritz method (DRM) [21] is designed for solving PDEs using their variational formulations. If Eq. (14) is an Euler-Lagrangian equation of some energy functional

(16)

I[\varphi,\nabla\varphi]=\int_{\Omega}W(\varphi(\bm{x}),\nabla_{\bm{x}}\varphi(\bm{x}))\mathrm{d}\bm{x}+\lambda\int_{\partial\Omega}|\mathcal{B}\varphi-g(\bm{x})|^{2}\mathrm{d}S~{},

then the loss function can be defined directly by

(17)

L(\theta)=\frac{1}{N_{in}}\sum_{i=1}^{N_{in}}W(\varphi_{\rm NN}(\bm{x}_{i}),\nabla_{\bm{x}}\varphi_{\rm NN}(\bm{x}_{i}))+\frac{\lambda}{N_{b}}\sum_{j=1}^{N_{b}}(\mathcal{B}\varphi_{\rm NN}(\bm{s}_{j})-g(\bm{s}_{j}))^{2}~{}.

The last term in (16) is a penalty term for the boundary condition. The original Dirichlet boundary condition can be recovered with $\lambda\rightarrow\infty$ . Again, the samples $\{\bm{x}_{i}\}_{i=1}^{N_{in}}$ and $\{\bm{s}_{j}\}_{j=1}^{N_{b}}$ can be uniformly sampled from $\Omega$ and $\partial\Omega$ or sampled following some other prescribed distributions, respectively. Additionally, both the variational PINN [37] and the WAN [84] utilized the Galerkin formulation to solve PDEs. The variational PINN [37], stems from the Petrov–Galerkin method, represents the solution via a DNN, and keeps test functions belonging to linear function spaces. The WAN employs the primal and adversarial networks to parameterize the weak solution and test functions respectively [84].

The neural network-based algorithms mentioned above focus on elliptic equations. For an evolution equation of the form:

(18)

\left\{\begin{aligned} &\partial_{t}\varphi(\bm{x},t)=F(t,\bm{x},\varphi)~{},\quad(\bm{x},t)\in\Omega\times(0,\infty)~{},\\ &\varphi(\bm{x},0)=\varphi_{0}(\bm{x})~{},\quad\quad\quad\bm{x}\in\Omega\\ \end{aligned}\right.

with a suitable boundary condition, the predominant approach is to treat the time variable as an additional dimension, and the PINN type techniques can be applied. However, these approaches are often expensive and may fail to capture the dynamics of these systems. Unlike spatial variables, there exists an inherent order in time, where the solution at time $T$ is determined by the solutions in $t<T$ . It is difficult to generate samples in time to maintain this causality. Very recently, new efforts have been made in using NNs to solve evolution PDEs [8, 16]. The idea of these approaches is to use NNs with time-dependent parameters to represent the solutions. For instance, Neural Galerkin method, proposed in [8], parameterizes the solution as $\varphi_{h}(\bm{x};{\bm{\Theta}}(t))$ with ${\bm{\Theta}}(t)$ being the NN parameters, and defines a loss function in terms of ${\bm{\Theta}}$ and $\dot{\bm{\Theta}}$ through a residual function, i.e.,

(19)

J({\bm{\Theta}},{\bm{\eta}})=\frac{1}{2}\int|\nabla_{\bm{\Theta}}\varphi_{h}\cdot{\bm{\eta}}-F(\bm{x},t,\varphi_{h}(\bm{x};{\bm{\Theta}}))|^{2}\mathrm{d}\nu_{\Theta}(\bm{x})~{},

where $\nu_{\Theta}(\bm{x})$ is a suitable measure which might depend on ${\bm{\Theta}}$ . By taking variation of $J({\bm{\Theta}},{\bm{\eta}})$ with respect to ${\bm{\eta}}$ , the neural Galerkin method arrives at an ODE of ${\bm{\Theta}}(t)$ :

(20)

{\sf M}({\bm{\Theta}})\dot{\bm{\Theta}}=F(t,{\bm{\Theta}})~{},\quad{\bm{\Theta}}(0)={\bm{\Theta}}_{0}~{},

where ${\sf M}({\bm{\Theta}})=\int\nabla_{\bm{\Theta}}\varphi_{h}\otimes\nabla_{\bm{\Theta}}\varphi_{h}\mathrm{d}\nu_{\bm{\Theta}}(\bm{x}),$ and $F(t,{\bm{\Theta}})=\int\nabla_{\bm{\Theta}}\varphi_{h}F(\bm{x},t,\varphi_{h})\mathrm{d}\nu_{\bm{\Theta}}(\bm{x}).$ The ODE (20) can be solved by standard explicit or implicit time-marching schemes.

It’s important to note that, with the exception of DRM, all existing neural-network-based methods are developed based on either the strong or weak forms of PDEs. While DRM utilizes the free energy functional, it is only suitable for solving static problems, i.e., finding the equilibrium of the system. Additionally, most of these existing approaches are Eulerian methods. For certain types of PDEs, like generalized diffusions, these methods may fail to preserve physical constraints, such as positivity and the conservation of mass of a probability function.

3 Energetic Variational Neural Network

In this section, we present the structure-preserving EVNN discretization for solving both $L^{2}$ -gradient flows and generalized diffusions, As mentioned earlier, the goal is to construct a neural-network discretization based on the energy-dissipation law, without working on the underlying PDE. One can view our method as a generalization of the DRM to evolution equations.

3.1 EVNN scheme for $L^{2}$ -gradient flow

Before we discuss the neural network discretization, we first briefly review the discrete energetic variational approach proposed in [50]. Given a continuous energy-dissipation law (1), the discrete energetic variational approach first constructs a finite-dimensional approximation to a continuous energy-dissipation law by introducing a spatial discretization of the state variable $\varphi$ , denoted by $\varphi_{h}(\bm{x};{\bm{\Theta}}(t))$ , where ${\bm{\Theta}}(t)\in\mathbb{R}^{K}$ is the parameter to be determined. By replacing $\varphi$ by $\varphi_{h}$ , one obtain a semi-discrete energy-dissipation law (here we assume $\mathcal{K}$ in the continuous model for simplicity) in terms of ${\bm{\Theta}}$ :

(21)

\frac{\mathrm{d}}{\mathrm{d}t}\mathcal{F}_{h}\left[{\bm{\Theta}}(t)\right]=-\triangle_{h}[{\bm{\Theta}}(t),{\bm{\Theta}}^{\prime}(t)]~{},

where $\mathcal{F}_{h}\left[{\bm{\Theta}}(t)\right]=\mathcal{F}[\varphi_{h}(\bm{x};{\bm{\Theta}})]$ , and $\triangle_{h}[{\bm{\Theta}}(t),{\bm{\Theta}}^{\prime}(t)]=\int\eta(\varphi_{h})|\nabla_{\bm{\Theta}}\varphi_{h}\cdot{\bm{\Theta}}^{\prime}(t)|^{2}\mathrm{d}\bm{x}.$ For example, in Galerkin methods, one can let $\varphi_{h}(\bm{x},t)$ be $\varphi_{h}(\bm{x},t)=\sum_{i=1}^{K}\gamma_{i}(t)\psi_{i}(\bm{x})$ with $\{\psi_{i}\}_{i=1}^{K}$ being the set of basis functions. Then ${\bm{\Theta}}(t)$ is a vector given by ${\bm{\Theta}}(t)=(\gamma_{1}(t),\gamma_{2}(t),\ldots,\gamma_{K}(t))^{\rm T}\in\mathbb{R}^{K}$ .

By employing the energetic variational approach in the semi-discrete level (21), one can obtain an ODE system of ${\bm{\Theta}}(t)$ . Particularly, in the linear response regime, $\mathcal{D}_{h}({\bm{\Theta}}(t),{\bm{\Theta}}^{\prime}(t))=\frac{1}{2}\int\eta(\varphi_{h})|\nabla_{\bm{\Theta}}\varphi_{h}\cdot{\bm{\Theta}}^{\prime}(t)|^{2}\mathrm{d}\bm{x}$ is a quadratic function of ${\bm{\Theta}}^{\prime}$ . The ODE system of ${\bm{\Theta}}(t)$ can then be written as

(22)

{\sf D}\left({\bm{\Theta}}\right){\bm{\Theta}}^{\prime}(t)=-\frac{\delta\mathcal{F}_{h}}{\delta{\bm{\Theta}}}~{},

where

(23)

\frac{\delta\mathcal{F}_{h}}{\delta{\bm{\Theta}}}=\frac{\delta\mathcal{F}}{\delta\varphi}\nabla_{\bm{\Theta}}\varphi_{h}~{},\quad{\sf D}\left({\bm{\Theta}}\right)=\int\eta(\varphi_{h})(\nabla_{\bm{\Theta}}\varphi_{h}\otimes\nabla_{\bm{\Theta}}\varphi_{h})\mathrm{d}\bm{x}~{}.

The ODE (22) is the same as the ODE (20) in the neural Galerkin method [8] for $L^{2}$ gradient flows, although the derivation is different.

Since (22) is a finite-dimensional gradient flow, one can then construct a minimizing movement scheme for ${\bm{\Theta}}$ [25]: finding ${\bm{\Theta}}^{n+1}$ such that

(24)

\displaystyle{\bm{\Theta}}^{n+1}=\mathop{\arg\min}_{{\bm{\Theta}}\in\mathcal{S}_{ad}^{h}}J_{n}({\bm{\Theta}})~{},\quad J_{n}({\bm{\Theta}})=\frac{({\sf D}^{n}_{*}({\bm{\Theta}}-{\bm{\Theta}}^{n}))\cdot({\bm{\Theta}}-{\bm{\Theta}}^{n})}{2\tau}+\mathcal{F}_{h}({\bm{\Theta}})~{}.

Here $\mathcal{S}_{ad}^{h}$ is the admissible set of ${\bm{\Theta}}$ inherited from the admissible set of $\varphi$ , denoted by $\mathcal{S}$ , and ${\sf D}^{n}_{*}$ is a constant matrix. A typical choice of ${\sf D}^{n}_{*}={\sf D}({\bm{\Theta}}^{n})$ . An advantage of this scheme is that

(25)

\mathcal{F}_{h}({\bm{\Theta}}^{n+1})\leq J_{h}^{n}({\bm{\Theta}}^{n+1})\leq J_{h}^{n}({\bm{\Theta}}^{n})=\mathcal{F}_{h}({\bm{\Theta}}^{n})~{},

if ${\sf D}^{n}_{*}$ is positive definite, which guarantees the energy stability for the discrete free energy $\mathcal{F}_{h}({\bm{\Theta}})$ . Moreover, by choosing a proper optimization method, we can assure that $\varphi^{n+1}_{h}$ stays in the admissible set $\mathcal{S}$ .

Although neural networks can be used to construct $\varphi_{h}(\bm{x};{\bm{\Theta}}(t))$ , it might be expensive to compute $\nabla_{\bm{\Theta}}\varphi_{h}\otimes\nabla_{\bm{\Theta}}\varphi_{h}$ in ${\sf D}({\bm{\Theta}})$ . Moreover, ${\sf D}({\bm{\Theta}})$ is not a sparse matrix and requires a lot of computer memory to store when a deep neural network is used.

To overcome these difficulties, we propose an alternative approach by introducing temporal discretization before spatial discretization. Let $\tau$ be the time step size. For the $L^{2}$ -gradient flow (5), given $\varphi^{n}$ , which represents the numerical solution at $t^{n}=n\tau$ , one can obtain $\varphi^{n+1}$ by solving the following optimization problem: finding $\varphi^{n+1}$ in some admissible set $\mathcal{S}$ such that

(26)

\displaystyle\varphi^{n+1}=\mathop{\arg\min}_{\varphi\in\mathcal{S}}J^{n}(\varphi),\quad J^{n}(\varphi)=\frac{1}{2\tau}\int\eta(\varphi^{n})|\varphi-\varphi^{n}|^{2}\mathrm{d}\bm{x}+\mathcal{F}[\varphi]~{}.

Let $\varphi_{h}(\bm{x};{\bm{\Theta}})$ be a finite-dimensional approximation to $\varphi$ with ${\bm{\Theta}}\in\mathbb{R}^{K}$ being the parameter of the spatial discretization (e.g., weights of linear combination in a Galerkin approximation) yet to be determined, then the minimizing movement scheme (26) can be written in terms of ${\bf\Theta}$ : finding ${\bm{\Theta}}^{n+1}$ such that

(27)

{\bm{\Theta}}^{n+1}=\arg\min_{{\bm{\Theta}}\in\mathcal{S}_{h}}J_{h}^{n}({\bm{\Theta}}),\quad J_{h}^{n}({\bm{\Theta}})=\frac{1}{2\tau}\int\eta^{n}|\varphi_{h}(\bm{x};{\bm{\Theta}})-\varphi_{h}(\bm{x};{\bm{\Theta}}^{n})|^{2}\mathrm{d}\bm{x}+\mathcal{F}_{h}[{\bm{\Theta}}]~{}.

Here ${\bm{\Theta}}^{n}$ is the value of ${\bm{\Theta}}$ at time $t^{n}$ . $\mathcal{F}_{h}[{\bm{\Theta}}]=\mathcal{F}[\varphi_{h}(\bm{x};{\bm{\Theta}})]$ , and $\eta^{n}=\eta(\varphi_{h}(\bm{x};{\bm{\Theta}}^{n}))$ .

Remark 3.1.

The connection between the minimizing movement scheme (27) derived by a temporal-then-spatial approach, and the minimizing movement scheme (24) derived by a spatial-then-temporal approach, can be shown with a direct calculation. Indeed, according to the first-order necessary condition for optimality, an optimal solution to the minimization problem (27) ${\bm{\Theta}}^{n+1}$ satisfies

(28)		$\displaystyle\frac{\delta J^{n}_{h}({\bm{\Theta}})}{\delta{\bm{\Theta}}}\Big{\|}_{{\bm{\Theta}}^{n+1}}$	$\displaystyle=\frac{1}{\tau}\int\eta^{n}(\varphi_{h}(\bm{x};{\bm{\Theta}}^{n+1})-\varphi_{h}(\bm{x};{\bm{\Theta}}^{n}))\nabla_{\bm{\Theta}}\varphi_{h}\Big{\|}_{{\bm{\Theta}}^{n+1}}\mathrm{d}\bm{x}+\frac{\delta\mathcal{F}_{h}}{\delta{\bm{\Theta}}}\Big{\|}_{{\bm{\Theta}}^{n+1}}$
(28)			$\displaystyle=\int\eta^{n}\nabla_{\bm{\Theta}}\varphi_{h}\Big{\|}_{{\bm{\Theta}}^{*}}\otimes\nabla_{\bm{\Theta}}\varphi_{h}\Big{\|}_{{\bm{\Theta}}^{n+1}}\mathrm{d}\bm{x}\frac{{\bm{\Theta}^{n+1}}-{\bm{\Theta}}^{n}}{\tau}+\frac{\delta\mathcal{F}_{h}}{\delta{\bm{\Theta}}}\Big{\|}_{{\bm{\Theta}}^{n+1}}=0$

for some ${\bm{\Theta}}^{*}$ , where the second equality follows the mean value theorem. In the case of Galerkin methods, as $\nabla_{\bm{\Theta}}\varphi_{h}$ is independent on ${\bm{\Theta}}$ , ${{\bm{\Theta}}}^{n+1}$ is a solution to an implicit Euler scheme for the ODE (22) in which $\eta(\varphi_{h})$ is treated explicitly.

By choosing a certain neural network to approximate $\varphi$ , denoted by $\varphi_{\rm NN}(\bm{x};{\bm{\Theta}})$ ( ${\bm{\Theta}}$ is used to denote all parameters in the neural network), we can summarize the EVNN scheme for solving a $L^{2}$ -gradient flow in Alg. 1.

For a given initial condition $\varphi_{0}(\bm{x})$ , compute ${\bm{\Theta}}^{0}$ by solving

(29)

{\bm{\Theta}}^{0}=\mathop{\arg\min}_{\bm{\Theta}}\int_{\Omega}|\varphi_{\rm NN}(\bm{x};{\bm{\Theta}})-\varphi_{0}(\bm{x})|^{2}\mathrm{d}\bm{x}~{};

At each step, update ${\bm{\Theta}}^{n+1}$ by solving the optimization problem

(30)

{\bm{\Theta}}^{n+1}=\mathop{\arg\min}_{\bm{\Theta}}\left(\frac{1}{2\tau}\int_{\Omega}\eta^{n}|\varphi_{\rm NN}(\bm{x};{\bm{\Theta}})-\varphi_{\rm NN}(\bm{x},{\bm{\Theta}}^{n})|^{2}\mathrm{d}\bm{x}+\mathcal{F}[\varphi_{\rm NN}(\bm{x};{\bm{\Theta}})]\right).

We have $\varphi_{\rm NN}(\bm{x};{\bm{\Theta}}^{n})$ as a numerical solution at time $t^{n}=n\tau$ .

Algorithm 1 Numerical Algorithm for solving the

L^{2}

-gradient flow

It can be noticed that both Eq. (29) and Eq. (30) involve integration in the computational domain $\Omega$ . This integration is often computed by using a grid-based numerical quadrature or Monte–Carlo/Quasi–Monte–Carlo algorithms [65]. It is worth mentioning that due to the non-convex nature of the optimization problem and the error in estimating the integration, it might be difficult to find an optimal ${\bm{\Theta}}^{n+1}$ at each step. But since we start the optimization procedure with ${\bm{\Theta}}^{n}$ , we’ll always be able to get a ${\bm{\Theta}}^{n+1}$ that lowers the discrete free energy at least on the training set. In practice, the optimization problem can be solved by either deterministic optimization algorithms, such as L-BFGS and gradient descent with Barzilai–Borwein step-size, or stochastic gradient descent algorithms, such as AdaGrad and Adam.

Remark 3.2.

It is straightforward to incorporate other variational high-order temporal discretizations to solve the $L^{2}$ -gradient flows. For example, a second-order accurate BDF2 scheme can be reformulated as an optimization problem

(31)		$\displaystyle{\bm{\Theta}}^{n}=\mathop{\arg\min}_{\bm{\Theta}}$	$\displaystyle\left(\frac{\eta}{\tau}\int_{\Omega}\varphi_{\rm NN}(\bm{x};{\bm{\Theta}})-\varphi_{\rm NN}(\bm{x},{\bm{\Theta}}^{n-1})\|^{2}\mathrm{d}\bm{x}\right.$
(31)			$\displaystyle\left.-\frac{\eta}{4\tau}\int_{\Omega}\|\varphi_{\rm NN}(\bm{x};{\bm{\Theta}})-\varphi_{\rm NN}(\bm{x},{\bm{\Theta}}^{n-2})\|^{2}\mathrm{d}\bm{x}+\mathcal{F}[\varphi_{\rm NN}(\bm{x};{\bm{\Theta}}]\right)~{}.$

A modified Crank–Nicolson time-marching scheme can be reformulated as

(32)		$\displaystyle{\bm{\Theta}}^{n}=\mathop{\arg\min}_{\bm{\Theta}}$	$\displaystyle\left(\frac{\eta}{\tau}\int_{\Omega}\|\varphi_{\rm NN}(\bm{x};{\bm{\Theta}})-\varphi_{\rm NN}(\bm{x},{\bm{\Theta}}^{n-1})\|^{2}\mathrm{d}\bm{x}+\mathcal{F}[\varphi_{\rm NN}(\bm{x};{\bm{\Theta}})]\right.$
(32)			$\displaystyle\left.\quad+\langle\nabla_{\Theta}\mathcal{F}[\varphi_{\rm NN}(\bm{x};{\bm{\Theta}}^{n-1})],{\bm{\Theta}}-{\bm{\Theta}}^{n-1}\rangle\right)~{}.$

Here we assume that $\eta$ is a constant for simplicity.

3.2 Lagrangian EVNN scheme for generalized diffusions

In this subsection, we show how to formulate an EVNN scheme in the Lagrangian frame of reference for generalized diffusions.

As discussed previously, a generalized diffusion can be viewed as an $L^{2}$ -gradient flow of the flow map $\bm{x}(\mathbf{X},t)$ in the space of diffeomorphisms. Hence, the EVNN scheme for the generalized diffusion can be formulated in terms of a minimizing movement scheme of the flow map given by

(33)

\Phi^{n+1}=\mathop{\arg\min}_{\Phi\in{\rm Diff}}\frac{1}{2\tau}\int|\Phi(\mathbf{X})-\Phi^{n}(\mathbf{X})|^{2}\rho_{0}(\mathbf{X})\mathrm{d}\mathbf{X}+\mathcal{F}\left[\Phi_{\verb|#|}\rho_{0}\right]~{},

where $\Phi^{n}(\mathbf{X})$ is a numerical approximation of the flow map $\bm{x}(\mathbf{X},t)$ at $t^{n}=n\tau$ , $\rho_{0}(\mathbf{X})$ is the initial density. $\mathcal{F}[\rho]$ is the free energy for the generalized diffusion defined in Eq. (9), and

(\Phi_{\verb|#|}\rho_{0})(\bm{x}):=\frac{\rho_{0}(\Phi^{-1}(\bm{x}))}{\det{\sf F}(\Phi^{-1}(\bm{x}))}~{},\quad{\rm Diff}=\{\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}~{}|~{}\Phi~{}\text{is a diffeomorphism}\}~{}.

One can parameterize $\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ by a suitable neural network. The remaining procedure is almost the same as that of the last subsection. However, it is often difficult to solve this optimization problem directly, and one might need to build a large neural network to approximate $\Phi^{n+1}$ when $n$ is large.

To overcome this difficulty, we proposed an alternative approach, Instead of seeking an optimal map $\Phi^{n+1}$ between $\rho^{0}$ and $\rho^{n+1}$ , we seek an optimal $\Psi^{n+1}$ between $\rho^{n}$ and $\rho^{n+1}$ . More precisely. we assume that

\Phi^{n+1}=\Psi^{n+1}\circ\Psi^{n}\circ\Psi^{n-1}\ldots\circ\Psi^{1}~{}.

Given $\rho^{n}$ , one can compute $\Psi^{n+1}$ by solving the following optimization problem

(34)

\Psi^{n+1}=\mathop{\arg\min}_{\Psi\in{\rm Diff}}\frac{1}{2\tau}\int|\Psi(\bm{x})-\bm{x}|^{2}\rho^{n}(\bm{x})\mathrm{d}\bm{x}+\mathcal{F}[\Psi_{\verb|#|}\rho^{n}]~{}.

The corresponding $\rho^{n+1}$ can then be computed through $\rho^{n+1}(\bm{x})=(\Psi^{n+1}_{\verb|#|}\rho^{n})(\bm{x})$ . An advantage of this approach is that we only need a small size of neural network to approximate $\Psi^{n+1}$ at each time step when $\tau$ is small.

Remark 3.3.

The scheme (34) can be viewed as a Lagrangian realization of the JKO scheme (13) for the Wasserstein gradient flow, although it is developed based on the $L^{2}$ -gradient flow structure in the space of diffeomorphism. According to the Benamou–Brenier formulation [5], the Wasserstein distance between two probability densities $\rho_{1}$ and $\rho_{2}$ can be computed by solving the optimization problem

(35)

W_{2}(\rho_{1},\rho_{2})^{2}=\mathop{\min}_{(\rho,\mathbf{u})\in\mathcal{S}}\int_{0}^{1}\int\rho|\mathbf{u}|^{2}\mathrm{d}\bm{x}\mathrm{d}t~{},

where the admissible set of $(\rho,\mathbf{u})$ is given by

(36)

\mathcal{S}=\{(\rho,\mathbf{u})~{}|~{}\rho_{t}+\nabla\cdot(\rho\mathbf{u})=0~{},\quad\rho(\bm{x},0)=\rho_{1}~{},\quad\rho(\bm{x},1)=\rho_{2}\}~{}.

Hence, one can solve the JKO scheme by solving

(37)			$\displaystyle(\mathbf{u}^{},\hat{\rho}^{})=\mathop{\arg\min}_{(\mathbf{u},\hat{\rho})}\frac{1}{2}\int_{0}^{\tau}\int_{\Omega}\hat{\rho}\|\mathbf{u}\|^{2}\mathrm{d}\bm{x}\mathrm{d}t+\mathcal{F}[\hat{\rho}(\tau)]$
(37)			$\displaystyle\text{s.t.}\quad\partial_{t}\hat{\rho}+\nabla\cdot(\hat{\rho}\mathbf{u})=0,\quad\hat{\rho}(0)=\rho^{n}.$

and letting $\rho^{n+1}=\hat{\rho}^{*}(\tau)$ . In Lagrangian coordinates, (37) is equivalent to

(38)

\bm{x}^{*}(\mathbf{X},t)=\mathop{\arg\min}_{\bm{x}(\mathbf{X},t)}\frac{1}{2}\int_{0}^{\tau}\int_{\Omega}\rho_{0}(\mathbf{X})|\bm{x}_{t}(\mathbf{X},t)|^{2}\mathrm{d}\mathbf{X}\mathrm{d}t+\mathcal{F}(\hat{\rho}(\bm{x},\tau)),

where $\rho_{0}=\rho^{n}$ and $\hat{\rho}(\bm{x}(\mathbf{X},\tau),\tau)=\rho_{0}(\mathbf{X})/\det{\sf F}(\mathbf{X},\tau)$ . By taking variation of (38) with respect to $\bm{x}(\mathbf{X},t)$ , one can show that the optimal condition is $\bm{x}_{tt}(\mathbf{X},t)=0$ for $t\in(0,\tau)$ , which indicates that $\bm{x}(\mathbf{X},t)=t(\Psi(\mathbf{X})-X)/\tau+X$ if $\bm{x}(\mathbf{X},\tau)=\Psi(\mathbf{X})$ . Hence, if $\Psi^{*}$ is the optimal solution of (34), then $\bm{x}^{*}(\mathbf{X},t)=t(\Psi^{*}(\mathbf{X})-\mathbf{X})/\tau+\mathbf{X}$ is the optimal solution of (38).

Remark 3.4.

If $\eta(\rho)\neq\rho$ , we can formulate the optimization problem (34) as

(39)

\Psi^{n+1}=\mathop{\arg\min}_{\Psi\in{\rm Diff}}\frac{1}{2\tau}\int|\Psi(\bm{x})-\bm{x}|^{2}\eta(\rho^{n}(\bm{x}))\mathrm{d}\bm{x}+\mathcal{F}[\Psi_{\verb|#|}\rho^{n}]~{}.

by treating $\eta$ explicit. A subtle fact is that $\det{\sf F}^{n}=1$ since we always start with an identity map.

The numerical algorithm for solving the generalized diffusion is summarized in Alg. 2.

•

Given $\{\bm{x}_{i}^{n}\}_{i=1}^{N}$ and the densities $\rho^{n}_{i}$ at $t=n\tau$ and $\bm{x}_{i}^{n}$ .
•

Find $\Psi^{n+1}(\bm{x}):\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ , by solving the optimization problem (34). To guarantee energy stability, we should take $\Psi$ as an approximation to an identity map initially when solving the optimization problem (34).

•

After obtaining $\Psi^{n+1}$ , updates $\{\bm{x}_{i}^{n+1}\}_{i=1}^{N}$ and $\rho^{n+1}$ by

(40)

\displaystyle\bm{x}_{i}^{n+1}=\Psi^{n+1}(\bm{x}_{i}^{n})~{},\quad\rho^{n+1}_{i}=\frac{\rho^{n}_{i}}{\det(\nabla\Psi^{n+1}(\bm{x}_{i}^{n}))}~{}.

Algorithm 2 Numerical Algorithm for solving the generalized diffusion

The next question is how to accurately evaluate the numerical integration in (34). Let $\bm{x}_{i}^{n}=\Phi^{n}(\bm{x}_{i}^{0})$ , for the general free energy (9), one way to evaluate the integrations in Eq. (34) is using

(41)			$\displaystyle J(\Psi)=\textstyle\frac{1}{2\tau}\sum_{i=1}^{N}\rho^{n}_{i}\\|\Psi(\bm{x}_{i}^{n})-\bm{x}_{i}^{n}\\|^{2}\|\Omega_{i}^{n}\|+\sum_{i=1}^{N}\left(\omega\left(\frac{\rho_{i}^{n}}{\det(\nabla\Psi(\bm{x}_{i}^{n}))}\right)\det(\nabla\Psi(\bm{x}_{i}^{n}))\right)\|\Omega_{i}^{n}\|$
(41)			$\displaystyle\quad\quad+\textstyle\sum_{i=1}^{N}V(\Psi(\bm{x}_{i}^{n}))\rho_{i}^{n}\|\Omega_{i}^{n}\|+\frac{1}{2}\sum_{i,j=1}^{N}K(\Psi(\bm{x}_{i}),\Psi(\bm{x}_{j}))\rho_{i}^{n}\rho_{j}^{n}\|\Omega_{j}^{n}\|\|\Omega_{i}^{n}\|~{},$

where $|\Omega_{i}^{n}|$ is the volume of the Voronoi cells associated with the set of points $\{\bm{x}_{i}^{n}\}$ , $\rho^{n}_{i}$ stands for $\rho^{n}(\bm{x}_{i}^{n})$ . Here the numerical integration is computed through a piece-wisely constant reconstruction of $\rho^{n}$ based on its values at $\{\Phi^{n}(\bm{x}_{i}^{0})\}_{i=1}^{N}$ . However, it is not straightforward to compute $|\Omega_{i}^{n}|$ , particularly for high dimensional cases. In the current study, we assume that the initial samples are drawn from $\rho_{0}$ , one can roughly assume that $\{\bm{x}_{i}^{n}\}$ follows the distribution $\rho^{n}$ , then according to the Monte–Carlo approach, the numerical integration can be evaluated as

(42)		$\displaystyle J(\Psi)$	$\displaystyle=\textstyle\frac{1}{2\tau}\left(\frac{1}{N}\sum_{i=1}^{N}\\|\Psi(\bm{x}_{i}^{n})-\bm{x}_{i}^{n}\\|^{2}\right)$
(42)			$\displaystyle\textstyle+\frac{1}{N}\sum_{i=1}^{N}\left(f_{\omega}\left(\frac{\rho_{i}^{n}}{\det(\nabla\Psi(\bm{x}_{i}^{n}))}\right)+V(\Psi(\bm{x}_{i}^{n}))\right)+\frac{1}{2N^{2}}\sum_{i,j=1}^{N}K\left(\Psi(\bm{x}_{i}),\Psi(\bm{x}_{j})\right)~{},$

where $f_{\omega}(\rho)=\omega(\rho)/\rho.$ The proposed numerical method can be further improved if one can evaluate the integration more accurately, i.e., have an efficient way to estimate $|\Omega_{i}^{n}|$ . Alternatively, as an advantage of the neural network-based algorithm, we can get $\rho^{n}(x)=\rho_{0}\tilde{\circ}(\Phi^{n})^{-1}(\bm{x}),~{}\forall\bm{x}$ , which enables us to estimate $J(\Psi)$ by resampling. More precisely, assume $\mu$ is a distribution that is easy to sample

(43)			$\displaystyle J(\Psi)=\textstyle\frac{1}{2\tau}\left(\frac{1}{N}\sum_{i=1}^{N}\\|\Psi(\bm{x}_{i}^{n})-\bm{x}_{i}^{n}\\|^{2}\right)\frac{\rho_{i}^{n}}{\mu_{i}^{n}}+\frac{1}{N}\sum_{i=1}^{N}\left(f_{\omega}\left(\frac{\rho_{i}^{n}}{\det(\nabla\Psi(\bm{x}_{i}^{n}))}\right)\right)\frac{\rho_{i}^{n}}{\mu_{i}^{n}}$
(43)			$\displaystyle\quad\textstyle+V(\Psi(\bm{x}_{i}^{n}))\frac{\rho_{i}^{n}}{\mu_{i}^{n}}+\frac{1}{2N^{2}}\sum_{i,j=1}^{N}K\left(\Psi(\bm{x}_{i}),\Psi(\bm{x}_{j})\right)\frac{\rho_{i}^{n}\rho_{j}^{n}}{\mu_{i}^{n}\mu_{j}^{n}}~{},$

where $\bm{x}_{i}\sim\mu$ , $\rho_{i}^{n}=\rho^{n}(x_{i})$ . We will explore this resampling approach in future work.

The remaining question is how to parameterize a diffeomorphism using neural networks. This is discussed in detail in the next subsection.

3.3 Neural network architectures

In principle, the proposed numerical framework is independent of the choice of neural network architectures. However, different neural network architectures may lead to different numerical performances, arising from a balance of approximation (representation power), optimization, and generalization. In this subsection, we briefly discuss several neural network architectures that we use in the numerical experiments.

3.3.1 Neural network architectures for Eulerian methods

For Eulerian methods, one can construct a neural network to approximate the unknown function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ . Shallow neural networks (two-layer neural networks) approximate $f$ by functions of the form

(44)

\displaystyle f(\bm{x};{\bm{\Theta}})

\displaystyle=\sum_{i=1}^{N}\alpha_{i}\sigma(\bm{\omega}_{i}\cdot\bm{x}+b_{i})+\alpha_{0}=\bm{\alpha}\cdot\sigma({\bm{W}}\bm{x}+\bm{b})+\alpha_{0}~{},

where $\sigma(.):\mathbb{R}\rightarrow\mathbb{R}$ is a fixed nonlinear activation function, $N$ is the number of hidden nodes (neurons), and ${\bm{\Theta}}=\{\alpha_{i},{\bm{\omega}}_{i},b_{i}\}$ are the NN parameters to be identified. Typical choices of activation functions include the ReLU $\sigma(x)=\max(x,0)$ , the sigmoid $\sigma(x)=1/(1+e^{-2x})$ and the hyperbolic tangent function $\tanh(x)$ .A DNN can be viewed as a network composed of many hidden layers. More precisely, a DNN with $L$ hidden layers represents a function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ by [67]

(45)

f(\bm{x};{\bm{\Theta}})=g\circ T^{L}\circ T^{L-1}\circ\ldots\circ T^{(1)}(\bm{x})~{},

where $g(\bm{z})=\sum_{i=1}^{N_{L}}\gamma_{i}z_{i}+\gamma_{0}$ is a linear map from $\mathbb{R}^{N_{l}}$ to $\mathbb{R}$ , $T^{(l)}$ is a nonlinear map from $\mathbb{R}^{N_{l-1}}$ to $\mathbb{R}^{N_{l}}$ , and $N_{0}=d$ . The nonlinear map $T^{l}$ takes the form

(46)

T^{(l)}(\bm{x}_{l-1})=\sigma({\bm{W}}_{l}\bm{x}_{l-1}+\bm{b}_{l})~{},

where ${\bm{W}}_{l}\in\mathbb{R}^{N_{l-1}\times N_{l}}$ , $\bm{b}_{l}\in\mathbb{R}^{l}$ and $\sigma(\cdot)$ is a nonlinear activation function that acts component-wisely on vector-valued inputs.

Another widely used class of DNN model is residual neural network (ResNet). A typical $K$ -block ResNet approximates an unknown function $f(\bm{x}):\mathbb{R}^{d}\rightarrow\mathbb{R}$ by

(47)

f_{K}(\bm{x};{\bm{\Theta}})=g(\bm{z}_{K}(\bm{x}))~{},

where $g(\bm{z})=\sum_{i=1}^{N}\gamma_{i}z_{i}+\gamma_{0}$ is a linear map from $\mathbb{R}^{N}\rightarrow\mathbb{R}$ and $z_{K}(\bm{x}):\mathbb{R}^{d}\rightarrow\mathbb{R}^{N}$ is a nonlinear map defined through

(48)

\displaystyle\bm{z}_{0}={\bm{V}}\bm{x}~{},\quad\bm{z}_{k}=\sigma_{2}({\bm{\alpha}}_{k}T^{L_{k}}_{k}\circ T^{L-1}_{k}\circ\ldots\circ T^{(1)}_{k}(\bm{z}_{k-1})+\bm{z}_{k-1})\quad k=1,2,\ldots,K~{}.

Here $L_{i}$ is the number of fully connected layer in $i$ -the block, $T^{(l)}_{k}$ is the same nonlinear map defined in (46) with ${\bm{W}}_{l}^{k}\in\mathbb{R}^{M\times N}$ , ${\bm{b}}_{l}^{k}\in\mathbb{R}^{M}$ , ${\bm{V}}\in\mathbb{R}^{N\times d}$ , ${\bm{\alpha}}_{i}\in\mathbb{R}^{N\times M}$ and $\sigma_{i}(\cdot)$ is an element-wise activation function. The model parameters are ${\bm{\Theta}}=\{{\bm{\gamma}},{\bm{\alpha}}_{i},{\bm{W}}_{l}^{k},{\bm{b}}_{l}^{k},{\bm{V}}\}$ . The original ResNet [29] takes $\sigma_{2}$ as a nonlinear activation function such as ReLU. Later studies indicate that one can also take $\sigma_{2}$ as the identity function [18, 30]. Then at an infinite length, i.e., $L\rightarrow\infty$ , (48) corresponds to the ODE

(49)

\frac{\mathrm{d}\bm{z}}{\mathrm{d}t}=f(\bm{z})~{},\quad\bm{z}_{0}=\bm{x}~{}.

Compared with fully connected deep neural network models which may suffer from numerical instabilities in the form of exploding or vanishing gradients [19, 28, 40], very deep ResNet can be constructed to avoid these issues.

Given that the proposed numerical scheme employs neural networks with time-dependent parameters to approximate the solution of a gradient flow, there is no need to employ a deep neural network. In all numerical experiments for $L^{2}$ -gradient flows, we utilize ResNet with only a few blocks. The detailed neural network settings for each numerical experiment will be described in the next section. We’ll compare the numerical performance of different neural network architectures in future work.

3.3.2 Neural network architectures for Lagrangian methods

The proposed Lagrangian method seeks for a neural network to approximate a diffeomorphism from $\mathbb{R}^{d}$ to $\mathbb{R}^{d}$ . This task involves two main challenges: one is ensuring that the map is a diffeomorphism, and the other is efficiently and robustly computing the deformation tensor ${\sf F}$ and its determinant $\det{\sf F}$ . Fortunately, various neural network architectures have been proposed to approximate a transport map. Examples include planar flows [64], auto-regressive flows [39], continuous-time diffusive flow [69], neural spline flow [17], and convex potential flow [32].

One way to construct a neural network $\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ for approximating a flow map is to use a planar flow. A $K$ -layer planar flow is defined by $T=T_{K}\circ\ldots T_{1}\circ T_{0}$ , where $T_{k}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ is given by

(50)

\bm{x}^{k+1}=T_{k}(\bm{x}^{k})=\bm{x}^{k}+{\bm{u}}_{k}h({\bm{w}}^{\rm T}_{k}\bm{x}^{k}+b_{k})~{}.

Here, ${\bm{w}}_{k},{\bm{u}}_{k}\in\mathbb{R}^{d}$ , $b_{k}\in\mathbb{R}$ , and $h$ is a smooth, element-wise, non-linear activation function such as $\tanh$ . Direct computation shows that

(51)

J_{k}=\det(\nabla T_{k})=1+h^{\prime}({\bm{w}}^{\rm T}_{k}\bm{x}^{k}+b_{k}){\bm{u}}^{\rm T}_{k}{\bm{w}}_{k}~{}.

Clearly, $T_{k}$ is a diffeomorphism if $h^{\prime}({\bm{w}}^{\rm T}_{k}\bm{x}^{k}+b_{k}){\bm{u}}^{\rm T}_{k}{\bm{w}}_{k}<1$ for $\forall\bm{x}_{k}$ . The determinant of the transport map can be computed as $\det(\nabla T)=J_{K}J_{K-1}\ldots J_{0}$ . and we have $\varphi(T(\bm{x}))=\frac{\varphi^{n}(\bm{x})}{\det(\nabla T)}$ .

One limitation of planar flow is its potential lack of expressive power. Another commonly employed neural network architecture for approximating a flow map is the convex potential flow [32], which defines a diffeomorphism via the gradient of a strictly convex function that is parameterized by an Input Convex Neural Network (ICNN) [2]. A fully connected, $K-$ layer ICNN can be written as

(52)

\bm{z}_{l+1}=\sigma_{l}({\bm{W}}_{l}\bm{z}_{l}+\bm{A}_{l}\mathbf{X}+\bm{b}_{l})~{},\quad l=0,1,\ldots K-1~{},

where $\bm{z}_{0}=0$ , $\bm{W}_{0}={\sf 0}$ , ${\bm{\Theta}}=\{\bm{W}_{l},\bm{A}_{l},\bm{b}_{l}\}$ are the parameters to be determined, and $\sigma_{l}$ is the nonlinear activation function. As proved in [2], if all entries $\bm{W}_{l}$ are non-negative, and all $\sigma_{l}$ are convex and non-decreasing, then $f(\mathbf{X};{\bm{\Theta}})$ is a convex function with respect to $\mathbf{X}$ . Hence $\nabla_{\mathbf{X}}f(\mathbf{X};{\bm{\Theta}})$ provides a parametric approximation to a flow map. In the current study, we adopt the concept of the convex potential flow to develop the Lagrangian EVNN method. We’ll explore other types of neural network architectures in future work.

Remark 3.5.

It is worth mentioning that the gradient of a convex function $f$ only defines a subspace of diffeomorphism, and it is unclear whether the optimal solution of (34) belongs to this subspace.

4 Numerical Experiments

In this section, we test the proposed EVNN methods for various $L^{2}$ -gradient flows and generalized diffusions. To evaluate the accuracy of different methods, we define the $l^{2}$ -error

\left(\textstyle\frac{1}{N}\sum_{i=1}^{N}|\varphi_{\rm NN}(\bm{x}_{i})-\varphi_{\rm ref}(\bm{x}_{i})|^{2}\right)^{1/2}

and the relative $l_{2}$ -error

\left(\textstyle\sum_{i=1}^{N}|\varphi_{\rm NN}(\bm{x}_{i})-\varphi_{\rm ref}(\bm{x}_{i})|^{2}/(\textstyle\sum_{i=1}^{N}|\varphi_{\rm ref}(\bm{x}_{i}))|^{2}\right)^{1/2}

between the NN solution $\varphi_{\rm NN}$ and the corresponding reference solution $\varphi_{\rm ref}$ on a set of test samples $\{\bm{x}_{i}\}_{i=1}^{N}$ . The reference solution $\varphi_{\rm ref}(\bm{x})$ is either an analytic solution or a numerical solution obtained by a traditional numerical method. In some numerical examples, we also plot point-wise absolute errors $|\varphi_{\rm ref}-\varphi_{\rm NN}|$ at the grid points.

4.1 Poisson equations

Although the proposed method is designed for evolutionary equations, it can also be used to compute equilibrium solutions for elliptic problems. In this subsection, we compare the performance of the EVNN method with two classical neural network-based algorithms, PINN and DRM, in the context of solving Poisson equations.

We first consider a 2D Poisson equation with a Dirichlet boundary condition

(53)

\begin{cases}&-\Delta u(\bm{x})=f(\bm{x})~{},\quad\bm{x}\in\Omega\subset\mathbb{R}^{d}~{},\\ &u(\bm{x})=g(x)~{},\quad\quad\quad\quad\bm{x}\in\partial\Omega~{}.\\ \end{cases}

Since EVNN is developed for evolution equations, we solve the following $L^{2}-$ gradient flow

(54)

\frac{\mathrm{d}}{\mathrm{d}t}\left(\int_{\Omega}\frac{1}{2}|\nabla u|^{2}-f(\bm{x})u(\bm{x})\mathrm{d}\bm{x}+\lambda\int_{\partial\Omega}|u(\bm{x})-g(\bm{x})|^{2}\mathrm{d}S\right)=-\int_{\Omega}|u_{t}|^{2}\mathrm{d}\bm{x}~{},

to get a solution of the Poisson equation (53). Here, $\lambda\int_{\partial\Omega}|u-g(\bm{x})|^{2}\mathrm{d}S$ is the surface energy that enforces the Dirichlet boundary condition. The corresponding gradient flow equation is given by

(55)

u_{t}=\Delta u(x)+f(\bm{x}),

subject to a Robin boundary condition $u(\bm{x})-g(\bm{x})=\frac{1}{2\lambda}\frac{\partial u}{\partial{\bm{n}}}$ , where ${\bm{n}}$ is the outer normal of $\partial\Omega$ . One can recover the original Dirichlet boundary condition by letting $\lambda\rightarrow\infty$ . Such a penalty approach is also used in PINN and DRM, and we take $\lambda=500$ in all numerical experiments below.

We consider the following two cases:

•

Case 1: $\Omega=(0,\pi)\times(-\pi/2,\pi/2)$ , $f(\bm{x})=2\sin x\cos y$ and $g(\bm{x})=0$ . The exact solution is $u(\bm{x})=\sin x\cos y$ .
•

Case 2: $\Omega=\{(x,y)~{}|~{}|\bm{x}|\leq 1\}$ , $f(\bm{x})=\frac{\pi^{2}}{4}\sin(\frac{\pi}{2}(1-|\bm{x}|))+\frac{\pi}{2|\bm{x}|}\cos(\frac{\pi}{2}(1-|\bm{x}|))$ and $g(\bm{x})=0$ . The exact solution is $u(\bm{x})=\sin(\frac{\pi}{2}(1-|\bm{x}|))$ .

We employ a 1-block ResNet with 20 hidden nodes and one hidden layer for all cases. The total number of parameters is 501. We apply Xavier Initialization [26] to initialize the weights of neural networks in all cases. To evaluate the integration in all methods, we generate $2500$ samples in $\Omega$ using a Latin hypercube sampling (LHS) and $50$ samples on each boundary of $\partial\Omega$ using a 1D uniform sampling for case 1. For case 2, we generate $2500$ samples in $(-1,1)^{2}$ using LHS, but only use the samples satisfies $x^{2}+y^{2}<1$ as training samples. Additionally, we generate 200 training samples $(\cos\theta_{i},\sin\theta_{i})_{i=1}^{200}$ on the boundary, with $\{\theta_{i}\}_{i=1}^{200}$ being generated by a uniform distribution on $(0,2\pi)$ . For PINN and DRM, we use Adam to minimize the loss function and use a different set of samples at each iteration. For EVNN, we use a different set of samples at each time step and employ an L-BFGS to solve the optimization problem (30).

Due to randomness arising from the stochastic initialization of the neural network and the sampling process, we repeated each method 10 times and plotted the mean and standard error of the relative $l_{2}$ -errors. Fig. 2(a) shows the relative $l_{2}$ -errors of different methods with respect to the CPU time for both cases. The training was executed on a MacBook Pro equipped with an M2 chip. In Case 1, the relative $l_{2}$ error of each method is evaluated on a test set comprising a uniform grid of $101\times 101$ points within the domain $(0,1)^{2}$ . For Case 2, the test set is defined as the collection of points $\{(x_{i},y_{j})|x_{i}^{2}+y_{j}^{2}<1\}$ , where $(x_{i},y_{j})$ forms a uniform $201\times 201$ grid on $(-1,1)^{2}$ . The CPU time is computed by averaging the time of 10 trials at $n$ -th iteration. The numerical results clearly show that the proposed method is more efficient than both PINN and DRM. It significantly enhances the efficiency and test accuracy of DRM through the introduction of time relaxation. While PINN may achieve better test accuracy in Case 2, it exhibits a larger standard error in both cases.

Next, we consider a high-dimensional Poisson equation

(56)

-\Delta u=f(\bm{x})~{},\quad\bm{x}\in\Omega=(-1,1)^{d}~{},\\

with a homogeneous Neumann boundary condition $\frac{\partial u}{\partial n}|_{\partial\Omega}=0$ . We take $f(\bm{x})=\pi^{2}\sum_{k=1}^{d}\cos(\pi x_{k})$ . Similar numerical examples are tested in [21, 53]. The exact solution to this high-dimensional Poisson equation is $u(\bm{x})=\sum_{k=1}^{d}\cos(\pi x_{k})$ . Following [53], solve an $L^{2}$ -gradient flow associated with the free energy

(57)

\mathcal{F}[u]=\textstyle\int_{(-1,1)^{d}}\frac{1}{2}|\nabla u|^{2}-fu~{}\mathrm{d}\bm{x}+\lambda\left(\int_{(-1,1)^{d}}u~{}\mathrm{d}\bm{x}\right)^{2},

where the last term enforces $\int_{[-1,1]^{d}}u~{}\mathrm{d}\bm{x}=0$ . We take $\lambda=0.5$ as in [53] in all numerical tests.

Table 1: Relative

l^{2}

-error of high-dimensional Poisson equation with different settings in different dimensions. The first column shows the number of residual blocks, the number of nodes in each fully connected layer, and the number of samples.

Setting	$d=4$	$d=8$	$d=16$	$d=32$
(2, 10, 1000)	0.024	0.046	0.623	0.867
(3, 60, 1000)	0.042	0.048	0.077	0.117
(3, 60, 10000)	0.018	0.022	0.036	0.077

Since it is difficult to get an accurate estimation of high-dimensional integration, we choose Adam to minimize (30) at each time step and use a different set of samples at each Adam step. This approach allows us to explore the high dimensional space with a low computational cost. The samples are drawn using LHS. Table 1 shows the final relative $l^{2}$ -error with different neural network settings in different spatial dimensions. The relative $l^{2}$ -error is evaluated on a test set that comprises 40000 samples generated by LHS. The model is trained with $200$ outer iterations ( $\tau=0.01$ ) and at most $200$ iterations for inner optimization (30). An early stop criterion was applied if the $l_{2}$ -norm of the model parameters between two consecutive epochs is less than $10^{-6}$ . The first column of Table 1 specifies the numerical setting as follows: the entry $(2,10)$ means a neural network with 2 residual blocks, each with two fully connected layers having 10 nodes is used, and $1000$ represents the number of samples drawn in each epoch. As we can see, the proposed method can achieve comparable results with results reported in similar examples in previous work by DRM [21, 53]. It can be observed that increasing the width of the neural network improves test accuracy in high dimensions, as it enhances the network’s expressive power. Furthermore, increasing the number of training samples also significantly improves test accuracy in high dimensions. We’ll explore the effects of $\tau$ , the number of samples, and neural network architecture for the high-dimensional Poisson equation in future work.

4.2 $L^{2}$ -gradient flow

In this subsection, we apply the EVNN scheme to two $L^{2}$ -gradient flows to demonstrate its energy stability and numerical accuracy.

4.2.1 Heat equation

We first consider an initial-boundary value problem of a heat equation:

(58)

\begin{cases}&u_{t}=\Delta u({\bf x}),\quad{\bf x}\in\Omega=(0,2)^{2}~{},\quad t\in(0,T]~{},\\ &u(\bm{x},t)=0,\quad\bm{x}\in\partial\Omega~{},\quad t\in(0,T]~{},\\ &u(\bm{x},0)=\sin\left(\frac{\pi}{2}x_{1}\right)\sin\left(\frac{\pi}{2}x_{2}\right)~{},\quad\bm{x}\in\Omega=(0,2)^{2}~{},\\ \end{cases}

which can be interpreted as an $L^{2}-$ gradient flow satisfying the energy-dissipation law

(59)

\textstyle\frac{\mathrm{d}}{\mathrm{d}t}\left(\int_{\Omega}\frac{1}{2}\left|\nabla u\right|^{2}\mathrm{d}\bm{x}+\lambda\int_{\partial\Omega}|u|^{2}\mathrm{d}S\right)=-\int\left|u_{t}\right|^{2}\mathrm{d}\bm{x}~{}.

Again, a surface energy term is added to enforce the Dirichlet boundary condition.

In the numerical simulation, we take $\tau=0.01$ and use a 1-block ResNet with $20$ nodes in each layer. The nonlinear activation function is chosen to be $\tanh$ . The total number of parameters is 501. To achieve energy stability, we fix the training samples to eliminate the numerical fluctuation arising from estimating the integration in (30). The set of training samples comprises a $301\times 301$ uniform grid on $(0,2)^{2}$ , and an additional $1000$ uniformly spaced grid points on each edge of the boundary. To test the accuracy of the solution over time, we compare the neural network solution with a numerical solution obtained by a finite difference method (FDM). We apply the standard central finite difference and an implicit Euler method to obtain the FDM solution. The numerical scheme can be reformulated as an optimization problem:

(60)

\phi_{h}^{n+1}=\mathop{\arg\min}_{u_{h}\in\mathcal{C}}\frac{1}{2\tau}\|u_{h}-u_{h}^{n}\|^{2}_{h}+\mathcal{F}_{h}(u_{h}),\quad\mathcal{F}_{h}(u_{h})=\frac{1}{2}\langle\nabla_{h}u_{h},\nabla_{h}u_{h}\rangle_{*}~{}.

Here, $\mathcal{C}=\{u_{i,j}~{}|~{}0<i<I,\quad 0<j<J\}$ is the set of grid functions defined on a uniform mesh, $\nabla_{h}$ is the discrete gradient, $\mathcal{F}_{h}$ is the discrete free energy, $\|\cdot\|_{h}$ is the discrete $L_{2}$ -norm induced by the discrete $L_{2}$ -inner product $\langle f,g\rangle=h^{2}\sum_{i=1}^{N-1}f_{i,j}g_{i,j}$ , and $\langle\cdot,\cdot\rangle_{*}$ is the discrete inner product defined on the staggered mesh points [49, 66]. We use a $101\times 101$ uniform grid and take $\tau=0.01$ in the FDM simulation. Figure 3(a)-(d) shows the NN solution at $t=0.01,0.2,0.4$ and $0.6$ . The corresponding point-wise absolute errors are shown in Fig. 3(e)-(h). The evolution of discrete free energy (evaluated on $101\times 101$ uniform grid) for both methods is shown in Fig. 3(i). Clearly, the neural-network-based algorithm achieves energy stability and the result is consistent with the FDM. It is worth mentioning that the size of the optimization problem in the neural network-based algorithm is much smaller than that in the FDM, although (60) is a quadratic optimization problem and can be solved directly by solving a linear equation of $u_{i,j}$ .

4.2.2 Allen-Cahn equation

Next, we consider an Allen-Cahn type equation, which is extensively utilized in phase-field modeling, and becomes a versatile technique to solve interface problems arising from different disciplines [15]. In particular, we focus on the following Allen-Cahn equation on $\Omega=(-1,1)^{2}$ :

(61)

\left\{\begin{aligned} &\tfrac{\partial\varphi}{\partial t}(\bm{x},t)=-\left(\tfrac{F^{\prime}(\varphi)}{\varepsilon^{2}}-\triangle\varphi(\bm{x},t)+2W\left(\textstyle\int\varphi\mathrm{d}\bm{x}-A\right)\right)~{},~{}~{}~{}~{}\bm{x}\in\Omega~{},~{}~{}~{}~{}t>0~{},\\ &\varphi(\bm{x},t)=-1~{},\quad\bm{x}\in\partial\Omega~{},~{}~{}~{}~{}t>0~{},\\ &\varphi(\bm{x},0)=-\tanh(10(\sqrt{4x_{1}^{2}+x_{2}^{2}}-0.5))~{},~{}~{}~{}~{}\bm{x}\in\Omega~{}.\\ \end{aligned}\right.

where $\varphi(\bm{x},t)$ is the phase-field variable, constants $\epsilon$ , $W$ and $A$ are model parameters. The Allen–Cahn equation (61) can be regarded as an $L^{2}$ -gradient flow, derived from an energy-dissipation law

(62)

\textstyle\frac{\mathrm{d}}{\mathrm{d}t}\mathcal{F}[\varphi]=-\int_{\Omega}|\varphi_{t}|^{2}\mathrm{d}\bm{x},\quad\mathcal{F}[\varphi]=\int_{\Omega}\frac{1}{2}\left|\nabla\varphi\right|^{2}+\frac{1}{4\varepsilon^{2}}\left(\varphi^{2}-1\right)^{2}\mathrm{d}\bm{x}+W\left(\textstyle\int_{\Omega}\varphi\mathrm{d}\bm{x}-A\right)^{2}.

Here, the last term in the energy is a penalty term for the volume constraint $\int\varphi\mathrm{d}\bm{x}=A$ , as the standard Allen-Cahn equation does not preserve the volume fraction $\int\varphi\mathrm{d}\bm{x}$ over time. In the numerical simulation, we take $A=-(4-\pi r^{2})+\pi r^{2},r=0.5,\frac{1}{\epsilon^{2}}=100$ , and $W=1000$ .

Due to the complexity of the initial condition, we utilized a larger neural network compared to the one employed in the previous subsection. Our choice was a 1-block ResNet, which has two fully connected layers, each containing 20 nodes. The total number of parameters is $921$ . The nonlinear activation function is chosen to be $\tanh$ . The set of training samples comprises a $301\times 301$ uniform grid on $(-1,1)^{2}$ , and an additional $1000$ uniformly spaced grid points on each edge of the boundary. To test the numerical accuracy of the EVNN scheme for this problem, we also solve Eq. (61) by a finite element method, which approximates the phase-field variable $\varphi$ by a piece-wise linear function $\varphi_{h}(\bm{x},t)=\sum_{i=1}^{N}\gamma_{i}(t)\psi_{i}(\bm{x})~{},$ where $\psi_{i}(X)$ are hat functions supported on the computational mesh. Inserting $\varphi_{h}$ into the continuous energy–dissipation law (62), we get a discrete energy–dissipation law with the discrete energy and dissipation given by

\mathcal{F}_{h}^{\rm{FEM}}({\bm{\gamma}})=\textstyle\sum_{e=1}^{N_{e}}\int_{\tau_{e}}\frac{1}{2}\Big{|}\sum_{i=1}^{N}\gamma_{i}\nabla\psi_{i}(\bm{x})\Big{|}^{2}+\frac{1}{\epsilon^{2}}\sum_{i=1}^{N}(\gamma_{i}^{2}-1)^{2}\psi_{i}(\bm{x})\mathrm{d}\bm{x}~{},

and $\mathcal{D}_{h}^{\rm{FEM}}({\bm{\gamma}},{\bm{\gamma}}^{\prime})=\sum_{e=1}^{N_{e}}\int_{\tau_{e}}\Big{|}\sum_{i=1}^{N}\gamma_{i}^{\prime}(t)\psi_{i}(\bm{x})\Big{|}^{2}\mathrm{d}\bm{x}$ respectively. Here $\tau_{e}$ is used to denote a finite element cell, and $N_{e}$ is the number of cells. This form of discretization was used in [79], which constructs a piece-wise linear approximation to the nonlinear term in the discrete energy. We can update $\gamma_{i}$ at each time step by solving the optimization problem (27), i.e.,

(63)

{\bm{\gamma}}^{n+1}=\mathop{\arg\min}_{{\bm{\gamma}}\in\mathbb{R}^{N}}\frac{1}{2\tau}{\sf D}({\bm{\gamma}}-{\bm{\gamma}}^{n})\cdot({\bm{\gamma}}-{\bm{\gamma}}^{n})+\mathcal{F}_{h}({\bm{\gamma}})\ ,

where ${\sf D}_{ij}=\int_{\Omega}\psi_{i}\psi_{j}\mathrm{d}\bm{x}$ is the mass matrix. The optimization problem (63) is solved by L-BFGS in our numerical implementation.

The simulation results are summarized in Fig. 4. It clearly shows that our method can achieve comparable results with the FEM. Numerical simulation of phase-field type models is often challenging. To capture the dynamics of the evolution of the thin diffuse interface, the mesh size should be much smaller than $\epsilon$ , the width of the diffuse interface. Traditional numerical methods often use an adaptive or moving mesh approach to overcome this difficulty. In contrast, a neural network-based numerical scheme has a mesh-free feature. The number of parameters of the neural network can be much smaller than the number of samples needed to resolve the diffuse interface. Consequently, the dimension of the optimization problem in the NN-based scheme is much smaller than in the FEM scheme without using adaptive or moving meshes.

4.3 Generalized diffusions

In this subsection, we apply the proposed Lagrangian EVNN scheme to solving a Fokker-Planck equation and a porous medium equation.

4.3.1 Fokker-Planck equation

We first consider a Fokker-Planck equation

(64)

\begin{cases}&\rho_{t}=\nabla\cdot(\nabla\rho+\rho\nabla V)~{},\quad\bm{x}\in\mathbb{R}^{d}~{},\quad t\in(0,T]~{},\\ &\rho(\bm{x},0)=\rho_{0}(\bm{x})~{},\quad\quad\quad\quad\bm{x}\in\mathbb{R}^{d}~{},\\ \end{cases}

where $V(\bm{x})$ is a prescribed potential energy. The Fokker-Planck equation can be viewed as a generalized diffusion satisfying the energy-dissipation law

(65)

\frac{\mathrm{d}}{\mathrm{d}t}\int_{\mathbb{R}^{d}}\rho\ln\rho+\rho V(\bm{x})\mathrm{d}\bm{x}=-\int_{\mathbb{R}^{d}}\rho|\mathbf{u}|^{2}\mathrm{d}\bm{x}~{},

where the probability density $\rho$ satisfies the kinematics $\rho_{t}+\nabla\cdot(\rho\mathbf{u})=0$ .

We test our numerical scheme for solving the Fokker-Planck equation in 2D and 4D, respectively. For both numerical experiments, we adopt the augmented-ICNN proposed in [32]. We use a $1$ ICNN block of $6$ fully connected layers with $32$ hidden nodes in each layer. The activation function is chosen as the Gaussian-Softplus function $\sigma(x)=\textstyle\sqrt{\tfrac{2}{\pi}}(\sqrt{2}x\int_{0}^{x/\sqrt{2}}e^{-t^{2}}dt+e^{-\frac{x^{2}}{2}}+\sqrt{\tfrac{\pi}{2}}x)$ . In addition, we use an L-BFGS to solve the optimization problem at each time step.

In the 2D case, we consider $V(\bm{x})$ as follows:

\textstyle V=\frac{1}{2}(\bm{x}-\mu_{target})^{T}\Sigma^{-1}_{target}(\bm{x}-\mu_{target})~{}~{}{\rm with}~{}~{}\mu_{target}=\left(\frac{1}{3},\frac{1}{3}\right)~{},

and $\Sigma_{target}=\begin{bmatrix}\frac{5}{8}&-\frac{3}{8}\\ -\frac{3}{8}&\frac{5}{8}\end{bmatrix}$ . The initial condition $\rho_{0}(\bm{x})$ is set to be the 2D standard Gaussian $\mathcal{N}({\bf 0},{\bf I})$ . The exact solution of Eq. (64) takes the following analytical form [33]

(66)

\rho(\bm{x},t)\sim\mathcal{N}(\mu(t),\Sigma(t))~{},

where $\mu(t)=(1-\mathbf{e}^{-4t})\mu_{target}$ and $\Sigma(t)=\begin{bmatrix}\frac{5}{8}+\frac{3}{8}\times e^{-8t}&-\frac{3}{8}+\frac{3}{8}\times e^{-8t}\\ -\frac{3}{8}+\frac{3}{8}\times e^{-8t}&\frac{5}{8}+\frac{3}{8}\times e^{-8t}\end{bmatrix}$ . We draw 10000 samples from $\rho_{0}(\bm{x})$ as the training set. As an advantage of the neural network-based algorithm, we can compute $\rho^{n}(\bm{x})$ point-wisely through a function composition $\rho^{n}(\bm{x})=\rho_{0}\tilde{\circ}(\Phi^{n})^{-1}(\bm{x})$ , where $(\Phi^{n})^{-1}(\bm{x})$ can be computed by solving a convex optimization problem, i.e., $\Phi^{-1}(\bm{x})=\mathop{\arg\min}_{{\bm{y}}}\Phi({\bm{y}})-\bm{x}^{\rm T}{\bm{y}}$ . Fig. 5(a) - (d) shows the predicted density on a $301\times 301$ uniform grid on $(-3,3)^{2}$ . The absolute and relative $l_{2}$ -errors on the grid are shown in Figs. 5(e)-(f). It can be noticed that the relative $l^{2}$ -error for the predicted solution is around $10^{-2}$ , which is quite accurate given the limited number of samples used.

In the 4D case, we take $V(\bm{x})$ to be

V=\tfrac{1}{2}(\bm{x}-\mu_{target})^{T}\Sigma^{-1}_{target}(\bm{x}-\mu_{target})\quad{\rm with}\quad\mu_{target}=\left(\tfrac{1}{3},\tfrac{1}{3},0,0\right)~{},

and $\Sigma_{target}=\begin{bmatrix}\frac{5}{8}&-\frac{3}{8}\\ -\frac{3}{8}&\frac{5}{8}\end{bmatrix}\bigoplus\begin{bmatrix}1&0\\ 0&1\end{bmatrix}$ . The initial condition $\rho_{0}(\bm{x})$ is set to be a 4D standard normal distribution $\mathcal{N}({\bf 0},{\bf I})$ . The exact solution of Eq. (64) follows the normal distribution with $\mu(t)=\left((1-\mathbf{e}^{-4t})\tfrac{1}{3},(1-\mathbf{e}^{-4t})\tfrac{1}{3},0,0\right),$ and $\Sigma(t)=\begin{bmatrix}\frac{5}{8}+\frac{3}{8}\times e^{-8t}&-\frac{3}{8}+\frac{3}{8}\times e^{-8t}\\ -\frac{3}{8}+\frac{3}{8}\times e^{-8t}&\frac{5}{8}+\frac{3}{8}\times e^{-8t}\end{bmatrix}\bigoplus\begin{bmatrix}1&0\\ 0&1\end{bmatrix}.$ Here $\bigoplus$ is the direct sum. Same to the 2D case, we draw 10000 initial samples from $\rho_{0}(\bm{x})$ . Fig. 6 shows the evolution of samples as well the values of the numerical solution on individual samples over space and time. Interestingly, the relative $l^{2}$ -error in 4D case is similar to the 2D case. The dimension-independent error bound suggests the potential of the current method for handling higher-dimensional problems.

Fokker–Planck type equations have a wide application in machine learning. One of the fundamental tasks in modern statistics and machine learning is to estimate or generate samples from a target distribution $\rho^{*}(\bm{x})$ , which might be completely known, partially known up to the normalizing constant, or empirically given by samples. Examples include Bayesian inference [7], numerical integration [54], space-filling design [61], density estimation [69], and generative learning [59]. These problems can be transformed as an optimization problem, which is to seek for a $\rho^{\rm opt}\in\mathcal{Q}$ by solving an optimization problem

(67)

\rho^{\rm opt}=\mathop{\arg\min}_{\rho\in\mathcal{Q}}D(\rho||\rho^{*})~{},

where $\mathcal{Q}$ is the admissible set, $D(p||q)$ is a dissimilarity function that assesses the differences between two probability measures $p$ and $q$ . The classical dissimilarities include the Kullback–Leibler (KL) divergence and the Maximum Mean Discrepancy (MMD). The optimal solution to the optimization problem can be obtained by solving a Fokker–Planck type equation, given by

(68)

\rho_{t}=\nabla\cdot(\rho\nabla\mu),\quad\mu=\frac{\delta D(\rho||\rho^{*})}{\delta\rho}.

The developed numerical approach has potential applications in these machine learning problems.

To illustrate this point, we consider a toy problem that is widely used in the machine learning literature [32]. The goal is to sample from a target distribution $\rho^{*}$ , which is known up to a normalizing constant. We take the dissimilarity function as the KL divergence ${\rm KL}(\rho||\rho^{*})=\int_{\Omega}\rho(\bm{x})\ln\left(\frac{\rho}{\rho^{*}}\right)\mathrm{d}\bm{x},$ then Eq. (68) is reduced to the Fokker–Planck equation (64) with $V(\bm{x})=-\ln\rho^{*}$ . In the numerical experiment, we take the target distribution $\rho^{*}(\bm{x})=\frac{1}{8}\sum_{i=1}^{8}N(\bm{x}|\bm{\mu}_{i},\bm{\Sigma})$ , an eight components mixture Gaussian distribution, where $\bm{\mu}_{1}=(0,4)$ , $\bm{\mu}_{2}=(2.8,2.8)$ , $\bm{\mu}_{3}=(4,0)$ , $\bm{\mu}_{4}=(-2.8,2.8)$ , $\bm{\mu}_{5}=(-4,0)$ , $\bm{\mu}_{6}=(-2.8,-2.8)$ , $\bm{\mu}_{7}=(0,-4)$ , $\bm{\mu}_{8}=(2.8,-2.8)$ , and $\bm{\Sigma}=\mathrm{diag}(0.2,0.2)$ . The simulation results are summarized in Fig. 7, in which we first draw 10000 samples from a standard normal distribution $\mathcal{N}({\bf 0},{\bf I})$ and show the distribution of samples as well as their weights at different time. It can be noticed that the proposed neural network-based algorithm can generate a weighted sample for a complicated target distribution.

4.3.2 Porous medium equation

Next, we consider a porous medium equation (PME), $\rho_{t}=\Delta\rho^{\alpha}~{},$ where $\alpha>1$ is a constant. The PME is a typical example of nonlinear diffusion equations. One important feature of the PME is that the solution to the PME has a compact support at any time $t>0$ if the initial data has a compact support. The free boundary of the compact support moves outward with a finite speed, known as the property of finite speed propagation [71]. As a consequence, numerical simulations of the PME are often difficult by using Eulerian methods, which may fail to capture the movement of the free boundary and suffer from numerical oscillations [50]. In a recent work [50], the authors developed a variational Lagrangian scheme using a finite element method. Here we show the ability of the Lagrangian EVNN scheme to solve the PME with a free boundary. Following [50], we employ the energy-dissipation law

(69)

\frac{\mathrm{d}}{\mathrm{d}t}\int_{\mathbb{R}^{d}}\frac{\alpha}{(\alpha-1)(\alpha-2)}\rho^{\alpha-1}\mathrm{d}\bm{x}=-\int_{\mathbb{R}^{d}}|\mathbf{u}|^{2}\mathrm{d}\bm{x}~{}

to develop the EVNN scheme.

We take $\alpha=4$ in the simulation. To test the numerical accuracy of the EVNN scheme for solving the PME, we consider a 2D Barenblatt–Pattle solution to the PME. The Barenblatt–Pattle solution in $d$ -dimensional space is given by

(70)

B_{\alpha}(\bm{x},t)=\textstyle t^{-k}\bigl{[}\bigl{(}C_{0}-\frac{k(\alpha-1)}{2d\alpha}\frac{|\bm{x}|^{2}}{t^{2k/d}}\bigr{)}_{+}\bigr{]}^{1/(\alpha-1)}~{},\quad\bm{x}\in\mathbb{R}^{d}~{},

where $k=(\alpha-1+2/d)^{-1}$ , $u_{+}:=\max(u,0)$ , and $C_{0}$ is a constant that is related to the initial mass. This solution is radially symmetric, self-similar, and has compact support $|\bm{x}|\leq\xi_{\alpha}(t)$ for any finite time, where $\xi_{\alpha}(t)=\textstyle\sqrt{\frac{2d\alpha C_{0}}{k(\alpha-1)}}t^{k/d}$ . We take the Barenblatt solution (70) with $C_{0}=0.1$ at $t=0.1$ as the initial condition, and compare the numerical solution at $T$ with the exact solution $B_{\alpha}(\bm{x},T+0.1)$ . We choose $\{x_{i}^{0}\}_{i=1}^{N}$ as a set of uniform grid points inside a disk $\{(x,y)~{}|~{}\sqrt{x^{2}+y^{2}}\leq\xi_{\alpha}(0.1)\}$ . To visualize the numerical results, we also draw $500$ points distributed with equal arc length on the initial free boundary and evolve them using the neural network. The numerical results with $\tau=0.005$ are shown in Fig. 8. It can be noticed that the proposed neural network-based algorithm can well approximate the Barenblatt–Pattle solution and capture the movement of the free boundary.

We note that the numerical methods to solve generalized diffusions (4) can also be formulated in the Eulerian frame of reference based on the notion of Wasserstein gradient flow [35]. However, it is often challenging to compute the Wasserstein distance between two probability densities efficiently with high accuracy, which often requires solving an additional min-max problem or minimization problem [5, 6, 9, 33]. In a recent study [33], the authors initially employ a fully connected neural network to approximate $\rho$ and subsequently utilize an additional ICNN to calculate the Wasserstein distance between two probability densities. Compared with their approach, our method is indeed more efficient and accurate.

5 Conclusion

In this paper, we develop structure-preserving EVNN schemes for simulating the $L^{2}$ -gradient flows and the generalized diffusions by utilizing a neural network as a tool for spatial discretization, within the framework of the discrete energetic variational approach. These numerical schemes are directly constructed based on a prescribed continuous energy-dissipation law for the system, without the need for the underlying PDE. The incorporation of mesh-free neural network discretization opens up exciting possibilities for tackling high-dimensional gradient flows arising in different applications in the future.

Various numerical experiments are presented to demonstrate the accuracy and energy stability of the proposed numerical scheme. In our future work, we will explore the effects of different neural network architectures, sampling strategies, and optimization methods, followed by a detailed numerical analysis. Additionally, we intend to employ the EVNN scheme to investigate other complex fluid models, including the Cahn–Hilliard equation and Cahn–Hilliard–Navier–Stokes equations, as well as solve machine learning problems such as generative modeling and density estimation.

Acknowledgments

C. L and Y. W were partially supported by NSF DMS-1950868 and DMS-2153029. Z. X was partially supported by the NSF CDS&E-MSS 1854779.

References

[1] S. Adams, N. Dirr, M. Peletier, and J. Zimmer, Large deviations and gradient flows, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371 (2013), p. 20120341.
[2] B. Amos, L. Xu, and J. Z. Kolter, Input convex neural networks, in International Conference on Machine Learning, PMLR, 2017, pp. 146–155.
[3] A. Baron, Universal approximation bounds for superposition of a sigmoid function, IEEE Transaction on Information Theory, 39 (1993), pp. 930–945.
[4] R. Bellman, Dynamic Programming, Dover Publications, 1957.
[5] J.-D. Benamou and Y. Brenier, A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem, Numerische Mathematik, 84 (2000), pp. 375–393.
[6] J.-D. Benamou, G. Carlier, and M. Laborde, An augmented lagrangian approach to wasserstein gradient flows and applications, ESAIM: Proceedings and surveys, 54 (2016), pp. 1–17.
[7] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, Variational inference: A review for statisticians, J. Am. Stat. Assoc., 112 (2017), pp. 859–877.
[8] J. Bruna, B. Peherstorfer, and E. Vanden-Eijnden, Neural galerkin scheme with active learning for high-dimensional evolution equations, arXiv preprint arXiv:2203.01360, (2022).
[9] J. A. Carrillo, K. Craig, L. Wang, and C. Wei, Primal dual methods for wasserstein gradient flows, Foundations of Computational Mathematics, 22 (2022), pp. 389–443.
[10] G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems, 2 (1989), pp. 303–314.
[11] P.-G. De Gennes and J. Prost, The physics of liquid crystals, no. 83, Oxford university press, 1993.
[12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805, (2018).
[13] T. Dockhorn, A discussion on solving partial differential equations using neural networks, arXiv preprint arXiv:1904.07200, (2019).
[14] M. Doi, Onsager’s variational principle in soft matter, J. Phys.: Condens. Matter, 23 (2011), p. 284118.
[15] Q. Du and X. Feng, The phase field method for geometric moving interfaces and their numerical approximations, Handbook of Numerical Analysis, 21 (2020), pp. 425–508.
[16] Y. Du and T. A. Zaki, Evolutional deep neural network, Phys. Rev. E, 104 (2021), p. 045303.
[17] C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios, Neural spline flows, Advances in neural information processing systems, 32 (2019).
[18] W. E, J. Han, and A. Jentzen, Algorithms for solving high dimensional pdes: From nonlinear Monte Carlo to machine learning, Nonlinearity, 35 (2021), p. 278.
[19] W. E, C. Ma, and L. Wu, Machine learning from a continuous viewpoint, i, Science China Mathematics, 63 (2020), pp. 2233––2266.
[20] W. E, C. Ma, L. Wu, and S. Wojtowytsch, Towards a mathematical understanding of neural network-based machine learning: What we know and what we don’t, CSIAM Transactions on Applied Mathematics, 1 (2020), pp. 561–615.
[21] W. E and B. Yu, The deep ritz method: A deep learning-based numerical algorithm for solving variational problems, Communications in Mathematics and Statistics, 6 (2018).
[22] B. Eisenberg, Y. Hyon, and C. Liu, Energy variational analysis of ions in water and channels: Field theory for primitive models of complex ionic fluids, The Journal of Chemical Physics, 133 (2010), p. 104104.
[23] J. L. Ericksen, Introduction to the Thermodynamics of Solids, vol. 275, Springer, 1998.
[24] M.-H. Giga, A. Kirshtein, and C. Liu, Variational modeling and complex fluids, Handbook of mathematical analysis in mechanics of viscous fluids, (2017), pp. 1–41.
[25] E. D. Giorgi, Movimenti minimizzanti, in Proceedings of the Conference on Aspetti e problemi della Matematica oggi, Lecce, 1992.
[26] X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, 2010, pp. 249–256.
[27] J. Han, A. Jentzen, and W. E, Solving high-dimensional partial differential equations using deep learning, Proceedings of the National Academy of Sciences, 115 (2018), pp. 8505–8510.
[28] B. Hanin, Which neural net architectures give rise to exploding and vanishing gradients?, in NeurIPS, 2018.
[29] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[30] , Identity mappings in deep residual networks, in European conference on computer vision, Springer, 2016, pp. 630–645.
[31] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal processing magazine, 29 (2012), pp. 82–97.
[32] C.-W. Huang, R. T. Chen, C. Tsirigotis, and A. Courville, Convex potential flows: Universal probability distributions with optimal transport and convex optimization, arXiv preprint arXiv:2012.05942, (2020).
[33] H. J. Hwang, C. Kim, M. S. Park, and H. Son, The deep minimizing movement scheme, arXiv preprint arXiv:2109.14851, (2021).
[34] K. Jiang and P. Zhang, Numerical methods for quasicrystals, J. Comput. Phys., 256 (2014), pp. 428–440.
[35] R. Jordan, D. Kinderlehrer, and F. Otto, The variational formulation of the Fokker–Planck equation, SIAM J. Math. Anal., 29 (1998), pp. 1–17.
[36] E. F. Keller and L. A. Segel, Initiation of slime mold aggregation viewed as an instability, Journal of theoretical biology, 26 (1970), pp. 399–415.
[37] E. Kharazmi, Z. Zhang, and G. E. Karniadakis, hp-vpinns: Variational physics-informed neural networks with domain decomposition, Computer Methods in Applied Mechanics and Engineering, 374 (2021), p. 113547.
[38] Y. Khoo, J. Lu, and L. Ying, Solving parametric PDE problems with artificial neural networks, European Journal of Applied Mathematics, 32 (2021), pp. 421–435.
[39] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling, Improved variational inference with inverse autoregressive flow, Advances in neural information processing systems, 29 (2016).
[40] J. F. Kolen and S. C. Kremer, Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies, 2001, pp. 237–243.
[41] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in neural information processing systems, 2012, pp. 1097–1105.
[42] I. E. Lagaris, A. Likas, and D. I. Fotiadis, Artificial neural networks for solving ordinary and partial differential equations, IEEE transactions on neural networks, 9 (1998), pp. 987–1000.
[43] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), pp. 2278–2324.
[44] R. Lifshitz and H. Diamant, Soft quasicrystals–why are they stable?, Philosophical Magazine, 87 (2007), pp. 3021–3030.
[45] F. H. Lin and C. Liu, Static and dynamic theories of liquid crystals, J. Partial Differential Equations, 14 (2001), pp. 289–330.
[46] S. Lisini, D. Matthes, and G. Savaré, Cahn–Hilliard and thin film equations with nonlinear mobility as gradient flows in weighted-wasserstein metrics, Journal of differential equations, 253 (2012), pp. 814–850.
[47] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez, A survey on deep learning in medical image analysis, Medical image analysis, 42 (2017), pp. 60–88.
[48] C. Liu and H. Sun, On energetic variational approaches in modeling the nematic liquid crystal flows, Discrete & Continuous Dynamical Systems, 23 (2009), p. 455.
[49] C. Liu, C. Wang, and Y. Wang, A structure-preserving, operator splitting scheme for reaction-diffusion equations with detailed balance, Journal of Computational Physics, 436 (2021), p. 110253.
[50] C. Liu and Y. Wang, On lagrangian schemes for porous medium type generalized diffusion equations: a discrete energetic variational approach, J. Comput. Phys., 417 (2020), p. 109566.
[51] , A variational lagrangian scheme for a phase-field model: A discrete energetic variational approach, SIAM J. Sci. Comput., 42 (2020), pp. B1541–B1569.
[52] L. Lu, X. Meng, Z. Mao, and G. E. Karniadakis, Deepxde: A deep learning library for solving differential equations, SIAM Review, 63 (2021), pp. 208–228.
[53] Y. Lu, J. Lu, and M. Wang, A priori generalization analysis of the deep ritz method for solving high dimensional elliptic partial differential equations, in Conference on Learning Theory, PMLR, 2021, pp. 3196–3241.
[54] T. Müller, B. McWilliams, F. Rousselle, M. Gross, and J. Novák, Neural importance sampling, ACM Transactions on Graphics (TOG), 38 (2019), pp. 1–19.
[55] J. Noh, Y. Wang, H.-L. Liang, V. S. R. Jampani, A. Majumdar, and J. P. Lagerwall, Dynamic tuning of the director field in liquid crystal shells using block copolymers, Phys. Rev. Res., 2 (2020), p. 033160.
[56] T. Ohta and K. Kawasaki, Equilibrium morphology of block copolymer melts, Macromolecules, 19 (1986), pp. 2621–2632.
[57] L. Onsager, Reciprocal relations in irreversible processes. I., Phys. Rev., 37 (1931), p. 405.
[58] , Reciprocal relations in irreversible processes. II., Phys. Rev., 38 (1931), p. 2265.
[59] G. Papamakarios, E. T. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan, Normalizing flows for probabilistic modeling and inference., J. Mach. Learn. Res., 22 (2021), pp. 1–64.
[60] M. A. Peletier, Variational modelling: Energies, gradient flows, and large deviations, arXiv preprint arXiv:1402.1990, (2014).
[61] L. Pronzato and W. G. Müller, Design of computer experiments: space filling and beyond, Statistics and Computing, 22 (2012), pp. 681–701.
[62] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics informed deep learning (part i): Data-driven solutions of nonlinear partial differential equations, arXiv preprint arXiv:1711.10561, (2017).
[63] L. Rayleigh, Some general theorems relating to vibrations, Proceedings of the London Mathematical Society, 4 (1873), pp. 357–368.
[64] D. Rezende and S. Mohamed, Variational inference with normalizing flows, in International conference on machine learning, PMLR, 2015, pp. 1530–1538.
[65] G. M. Rotskoff, A. R. Mitchell, and E. Vanden-Eijnden, Active importance sampling for variational objectives dominated by rare events: Consequences for optimization and generalization, arXiv preprint arXiv:2008.06334, (2020).
[66] A. J. Salgado and S. M. Wise, Classical Numerical Analysis: A Comprehensive Course, Cambridge University Press, 2022.
[67] Z. Shen, H. Yang, and S. Zhang, Nonlinear approximation via compositions, arXiv preprint arXiv:1902.10170, (2019).
[68] J. Sirignano and K. Spiliopoulos, Dgm: A deep learning algorithm for solving partial differential equations, J. Comput. Phys., 375 (2018), pp. 1339–1364.
[69] E. G. Tabak, E. Vanden-Eijnden, et al., Density estimation by dual ascent of the log-likelihood, Communications in Mathematical Sciences, 8 (2010), pp. 217–233.
[70] B. P. van Milligen, V. Tribaldos, and J. Jiménez, Neural network differential equation and plasma equilibrium solver, Phys. Rev. Lett., 75 (1995), p. 3594.
[71] J. L. Vázquez, The porous medium equation: mathematical theory, Oxford University Press, 2007.
[72] H. Wang, T. Qian, and X. Xu, Onsager’s variational principle in active soft matter, Soft Matter, 17 (2021), pp. 3634–3653.
[73] Y. Wang, J. Chen, C. Liu, and L. Kang, Particle-based energetic variational inference, Stat. Comput., 31 (2021), pp. 1–17.
[74] Y. Wang and C. Liu, Some recent advances in energetic variational approaches, Entropy, 24 (2022).
[75] Y. Wang, C. Liu, P. Liu, and B. Eisenberg, Field theory of reaction-diffusion: Law of mass action with an energetic variational approach, Phys. Rev. E, 102 (2020), p. 062147.
[76] Y. Wang, T.-F. Zhang, and C. Liu, A two species micro–macro model of wormlike micellar solutions and its maximum entropy closure approximations: An energetic variational approach, Journal of Non-Newtonian Fluid Mechanics, 293 (2021), p. 104559.
[77] Q. Wei, Y. Jiang, and J. Z. Y. Chen, Machine-learning solver for modified diffusion equations, Phys. Rev. E, 98 (2018), p. 053304.
[78] E. Weinan, Machine learning and computational mathematics, arXiv preprint arXiv:2009.14596, (2020).
[79] J. Xu, Y. Li, S. Wu, and A. Bousquet, On the stability and accuracy of partially and fully implicit schemes for phase field modeling, Computer Methods in Applied Mechanics and Engineering, 345 (2019), pp. 826–853.
[80] S. Xu, P. Sheng, and C. Liu, An energetic variational approach for ion transport, Communications in Mathematical Sciences, 12 (2014), pp. 779–789.
[81] S. Xu, Z. Xu, O. V. Kim, R. I. Litvinov, J. W. Weisel, and M. S. Alber, Model predictions of deformation, embolization and permeability of partially obstructive blood clots under variable shear flow, Journal of The Royal Society Interface, 14 (2017).
[82] X. Xu, Y. Di, and M. Doi, Variational method for liquids moving on a substrate, Physics of Fluids, 28 (2016), p. 087101.
[83] P. Yue, J. J. Feng, C. Liu, and J. Shen, A diffuse-interface method for simulating two-phase flows of complex fluids, J. Fluid Mech., 515 (2004), pp. 293–317.
[84] Y. Zang, G. Bao, X. Ye, and H. Zhou, Weak adversarial networks for high-dimensional partial differential equations, J. Comput. Phys., 411 (2020), p. 109409.

(28)		$\displaystyle\frac{\delta J^{n}_{h}({\bm{\Theta}})}{\delta{\bm{\Theta}}}\Big{\|}_{{\bm{\Theta}}^{n+1}}$	$\displaystyle=\frac{1}{\tau}\int\eta^{n}(\varphi_{h}(\bm{x};{\bm{\Theta}}^{n+1})-\varphi_{h}(\bm{x};{\bm{\Theta}}^{n}))\nabla_{\bm{\Theta}}\varphi_{h}\Big{\|}_{{\bm{\Theta}}^{n+1}}\mathrm{d}\bm{x}+\frac{\delta\mathcal{F}_{h}}{\delta{\bm{\Theta}}}\Big{\|}_{{\bm{\Theta}}^{n+1}}$
(28)			$\displaystyle=\int\eta^{n}\nabla_{\bm{\Theta}}\varphi_{h}\Big{\|}_{{\bm{\Theta}}^{*}}\otimes\nabla_{\bm{\Theta}}\varphi_{h}\Big{\|}_{{\bm{\Theta}}^{n+1}}\mathrm{d}\bm{x}\frac{{\bm{\Theta}^{n+1}}-{\bm{\Theta}}^{n}}{\tau}+\frac{\delta\mathcal{F}_{h}}{\delta{\bm{\Theta}}}\Big{\|}_{{\bm{\Theta}}^{n+1}}=0$