Finite Expression Method for Solving High-Dimensional Partial Differential Equations

Senwei Liang and Haizhao Yang Department of Mathematics, Purdue UniversityDepartment of Mathematics and Department of Computer Science, University of Maryland College Park

Abstract

Designing efficient and accurate numerical solvers for high-dimensional partial differential equations (PDEs) remains a challenging and important topic in computational science and engineering, mainly due to the “curse of dimensionality” in designing numerical schemes that scale in dimension. This paper introduces a new methodology that seeks an approximate PDE solution in the space of functions with finitely many analytic expressions and, hence, this methodology is named the finite expression method (FEX). It is proved in approximation theory that FEX can avoid the curse of dimensionality. As a proof of concept, a deep reinforcement learning method is proposed to implement FEX for various high-dimensional PDEs in different dimensions, achieving high and even machine accuracy with a memory complexity polynomial in dimension and an amenable time complexity. An approximate solution with finite analytic expressions also provides interpretable insights into the ground truth PDE solution, which can further help to advance the understanding of physical systems and design postprocessing techniques for a refined solution.

keywords:

High-dimensional PDEs, Deep neural networks, Mathematical expressions, Finite expression method, Reinforcement learning.

1 Introduction

Partial differential equations (PDEs) play a fundamental role in scientific fields for modeling diverse physical phenomena, including diffusion [37, 27], fluid dynamics [2, 46] and quantum mechanics [16, 30]. Developing efficient and accurate solvers for numerical solutions to high-dimensional PDEs remains an important and challenging topic [49]. Many traditional solvers, such as finite element method (FEM) [40] and finite difference [18], are usually limited to low-dimensional domains since the computational cost increases exponentially in the dimension [49, 31]. Recently, neural networks (NNs) as mesh-free parameterization are widely employed in solving high-dimensional PDEs [13, 19, 25, 38, 47, 14] and control problems [21]. In theory, NNs have the capability of approximating various functions well and lessening the curse of dimensionality [45, 43, 51, 44, 24]. Yet the highly non-convex objective function and the spectral bias towards fitting a smooth function in NN optimization make it difficult to achieve high accuracy [50, 7, 41]. In practice, NN-based solvers can hardly achieve a highly accurate solution even when the true solution is a simple function, especially for high-dimensional problems [14, 33]. Besides, NN parametrization may still require large memory and high computation cost for high-dimensional problems [6]. Finally, numerical solutions from both traditional solvers and NN-based solvers lack interpretability, e.g., the dependence of the solution on variables is not readily apparent from the numerical solutions.

Refer to caption — Fig. 1: Overview of various numerical solvers for PDEs. In our finite expression method, we aim to find a PDE solution as a mathematical expression with finitely many operators. The resulting solution can reproduce the true solution and achieve high, even machine-level, accuracy. Furthermore, the mathematical expression has memory complexity polynomial in dimension and provides interpretable insights to advance understanding of physical systems and aid in designing postprocessing techniques for solution refinement.

In this paper, we propose the finite expression method (FEX), a methodology that aims to find a solution in the function space of mathematical expressions with finitely many operators. Compared with the NN and FEM methods, our FEX enjoys the following advantages (summarized in Fig. 1): (1) The expression can reproduce the true solution and achieve high, even machine-level, accuracy. (2) The expression requires low memory (a line of string) for solution storage and low computational cost for evaluation on new points. (3) The expression has good interpretability with an explicit, readable form. Moreover, from an approximation theory perspective detailed in Sec. 2.3, the expression in FEX is capable of avoiding the curse of dimensionality in theory.

In FEX, we formulate the search for mathematical expressions as a combinatorial optimization (CO) involving both discrete and continuous variables. While many techniques can be used to solve the CO in FEX, we provide a numerically effective implementation based on reinforcement learning (RL). Traditional algorithms (e.g. genetic programming and simulated annealing [35]) address CO problems by employing hand-crafted heuristics that heavily rely on specific problem formulations and domain knowledge [4, 10, 34]. However, RL has emerged as a popular and versatile tool for learning how to construct CO solutions based on reward feedback, without the need for extensive heuristic design. The success of RL applications in CO, such as automatic algorithm design [39, 5, 11] and symbolic regression [36, 29], has inspired us to seek mathematical expression solutions with RL. Specifically, in our implementation, the mathematical expression is represented by a binary tree, where each node is an operator along with parameters (scaling and bias) as shown in Fig. 3 and further explained in Sec. 3.1. The objective function we aim to minimize is a functional, and its minimization leads to the solution of the PDE, as described in Sec. 2.2. Consequently, our problem involves the minimization of both discrete operators and continuous parameters embedded within the tree structure. Optimizing both discrete and continuous variables simultaneously is inherently difficult. We propose a search loop for this CO as depicted in Fig. 2. Our idea is to first identify good operator selections that have a high possibility of recovering the true solution structure, and then optimize the parameters. The proposals of operator selections are drawn from a controller which will be updated via policy gradient [36] iteratively. In Sec. 4, we numerically demonstrate the ability of this RL-based implementation to find mathematical expressions that solve high-dimensional PDEs with high, even machine-level, accuracy. Furthermore, FEX provides interpretable insights into the ground truth PDE solutions, which can further advance understanding of physical systems and aid in designing postprocessing techniques to refine solutions.

2 An Abstract Methodology of Finite Expression Method

The goal of FEX is to find a mathematical expression to solve a PDE. This section will formally define the function space of mathematical expressions and formulate the problem of FEX as a CO.

2.1 Function Space with Finite Expressions in FEX

Mathematical expressions will be introduced to form the function space of FEX.

Definition 1 (Mathematical expression).

A mathematical expression is a combination of symbols, which is well-formed by syntax and rules and forms a valid function. The symbols include operands (variables and numbers), operators (e.g.,“ $+$ ”, “ $\sin$ ”, integral, derivative), brackets, punctuation.

In the definition of mathematical expressions, we only consider the expression that forms a valid function. In our context, “ $\sin(x\times y)+1$ ” is a mathematical expression, but, for example, “ $5>x$ ” and “ $\sin(x\times y)+$ ” are not a mathematical expression as they do not form a valid function. The operands and operators comprise the structure of an expression. The parentheses play a role in clarifying the operation order.

Definition 2 ( $k$ -finite expression).

A mathematical expression is called a $k$ -finite expression if the number of operators in this expression is $k$ .

“ $\sin(x\times y)+1$ ” is a $3$ -finite expression since there are three operators in it (“ $\times$ ”, “ $\sin$ ”, and “+”). The series, such as “ $1+x^{1}+\frac{x^{2}}{2}+\frac{x^{3}}{6}+\cdots$ ”, belongs to a mathematical expression, but it is not a finite expression since the amount of the operators is infinite. Formally, with the concept of finite expression, we can define FEX as follows,

Definition 3 (Finite expression method).

The finite expression method is a methodology to solve a PDE numerically by seeking a finite expression such that the resulting function solves the PDE approximately.

We denote $\mathbb{S}_{k}$ as a set of functions that are formed by finite expressions with the number of operators not larger than $k$ . $\mathbb{S}_{k}$ forms the function space of FEX. Clearly, $\mathbb{S}_{1}\subset\mathbb{S}_{2}\subset\mathbb{S}_{3}\cdots$ .

2.2 Identifying PDE Solutions in FEX

We denote a functional $\mathcal{L}:\mathbb{S}\to\mathbb{R}$ associated with a given PDE, where $\mathbb{S}$ is a function space and the minimizer of $\mathcal{L}$ is the best solution to solve the PDE in $\mathbb{S}$ . In FEX, given the number of operators $k$ , the problem of seeking a finite expression solution is formulated as a CO over $\mathbb{S}_{k}$ via

(1)

\displaystyle\min\{\mathcal{L}(u)|u\in\mathbb{S}_{k}\}.

The choice of the functional $\mathcal{L}$ is problem-dependent and one may conceive a better functional for a specific PDE with a specific constraint or domain. Some popular choices include least-square methods [28, 47, 12] and variation formulations [14, 52, 8, 53].

2.2.1 Least Square Method

Suppose that the PDE is given by

(2)

\begin{split}\mathcal{D}u(\bm{x})=f(u(\bm{x}),\bm{x}),\text{~{}}\bm{x}\in\Omega,\qquad\mathcal{B}u(\bm{x})=g(\bm{x}),\text{~{}}\bm{x}\in\partial\Omega,\end{split}

where $\mathcal{D}$ is a differential operator, $f(u(\bm{x}),\bm{x})$ can be a nonlinear function in $u$ , $\Omega$ is a bounded domain in $\mathbb{R}^{d}$ , and $\mathcal{B}u=g$ characterizes the boundary condition (e.g., Dirichlet, Neumann and Robin [15]). The least square method [28, 47, 12] defines a straightforward functional to characterize the error of the estimated solution by

(3)

\mathcal{L}(u):=\|\mathcal{D}u(\bm{x})-f(u,\bm{x})\|_{L^{2}(\Omega)}^{2}+\lambda\|\mathcal{B}u(\bm{x})-g(\bm{x})\|_{L^{2}(\partial\Omega)}^{2},

where $\lambda$ is a positive coefficient to enforce the boundary constraint.

2.2.2 Variation Formulation

We next introduce the variation formulation, which is commonly used to identify numerical PDE solutions [14, 52]. As an example, consider an elliptic PDE with homogeneous Dirichlet boundary conditions. This PDE is:

(4)

\displaystyle-\Delta u(\bm{x})+c(\bm{x})u(\bm{x})=f(\bm{x}),\text{~{}}\bm{x}\in\Omega,\qquad u(\bm{x})=0,\text{~{}}\bm{x}\in\partial\Omega,

where $c$ is a bounded function and $f\in L^{2}$ . The solution $u$ to PDE (4) minimizes the variation formulation $\frac{1}{2}\int_{\Omega}\|\nabla u\|^{2}+cu^{2}\text{d}\bm{x}-\int_{\Omega}fu\text{d}\bm{x}$ . By incorporating the boundary condition penalty into this variation, we obtain the functional:

(5)

\mathcal{L}(u):=\frac{1}{2}\int_{\Omega}\|\nabla u\|^{2}+cu^{2}\text{d}\bm{x}-\int_{\Omega}fu\text{d}\bm{x}+\lambda\int_{\partial\Omega}u^{2}\text{d}\bm{x}.

An alternative variation formulation can be defined using test functions. Let $v\in H_{0}^{1}(\Omega)$ be a test function, where $H_{0}^{1}(\Omega)$ denotes the Sobolev space whose weak derivative is $L^{2}$ integrable with zero boundary values. The weak solution $u$ of Eqn. (4) is defined as the function that satisfies the bilinear equations:

(6)

\displaystyle\begin{split}a(u,v):=\int_{\Omega}\nabla u\nabla v+cuv-fv\text{d}\bm{x}=0,\quad\forall v\in H_{0}^{1}(\Omega),\quad u(\bm{x})=0,\text{~{}}\bm{x}\in\partial\Omega,\end{split}

where $a(u,v)$ is constructed by multiplying (4) and $v$ , and integration by parts. All derivatives of the solution function can be transferred to the test function through repeated integration by parts, yielding another bilinear forms [8]. The weak solution can be reformulated as the solution to a saddle-point problem [53]:

(7)

\displaystyle\min_{u\in H_{0}^{1}(\Omega)}\max_{v\in H_{0}^{1}(\Omega)}|a(u,v)|^{2}/\|v\|_{L^{2}(\Omega)}^{2}.

Then, the functional $\mathcal{L}$ identifying the PDE solution is:

(8)

\mathcal{L}(u):=\max_{v\in H_{0}^{1}(\Omega)}|a(u,v)|^{2}/\|v\|_{L^{2}(\Omega)}^{2}+\lambda\int_{\partial\Omega}u^{2}\text{d}\bm{x}.

2.3 Approximation Theory of Elementary Expressions in FEX

The most important theoretical question in solving high-dimensional problems is whether or not a solver suffers from the curse of dimensionality. It will be shown that the function space of $k$ -finite expressions, i.e., $\mathbb{S}_{k}$ in (1), is a powerful function space that avoids the curse of dimensionality in approximating high-dimensional continuous functions, leveraging the recent development of advanced approximation theory of deep neural networks [44, 24]. First of all, it can be proved that $\mathbb{S}_{k}$ is dense in $C([0,1]^{d})$ for an arbitrary $d\in\mathbb{N}$ in the following theorem.

Theorem 4.

The function space $\mathbb{S}_{k}$ , generated with operators including “ $+$ ”, “ $-$ ”, “ $\times$ ”, “ $/$ ”, “ $|\cdot|$ ”, “sign $(\cdot)$ ”, and “ $\lfloor\cdot\rfloor$ ”, is dense in $C([a,b]^{d})$ for arbitrary $a$ , $b\in\mathbb{R}$ and $d\in\mathbb{N}$ if $k\geq\mathcal{O}(d^{4})$ .

Here “ $|\cdot|$ ”, “sign $(\cdot)$ ”, and “ $\lfloor\cdot\rfloor$ ” denote the absolute, sign and floor functions [44], respectively. The proof of Thm. 4 can be found in Appx. 6.1. The denseness of $\mathbb{S}_{k}$ means that the function space of $k$ -finite expressions can approximate any $d$ -dimensional continuous functions to any accuracy, while $k$ is only required to be $\mathcal{O}(d^{4})$ independent of the approximation accuracy. The proof of Thm. 4 takes the advantage of operators “sign $(\cdot)$ ” and “ $\lfloor\cdot\rfloor$ ”, which might not be frequently used in mathematical expressions. If it is more practical to restrict the operator list to regular operators like “ $+$ ”, “ $\times$ ”, “ $\sin(\cdot)$ ”, exponential functions, and the rectified linear unit (ReLU), then it can be proved that $\mathbb{S}_{k}$ can approximate Hölder functions without the curse of dimensionality in the following theorem.

Theorem 5.

Suppose the function space $\mathbb{S}_{k}$ is generated with operators including “ $+$ ”, “ $-$ ”, “ $\times$ ”, “ $\div$ ”, “ $\max\{0,x\}$ ”, “ $\sin(x)$ ”, and “ $2^{x}$ ”. Let $p\in[1,+\infty)$ . For any $f$ in the Hölder function class $\mathcal{H}^{\alpha}_{\mu}([0,1]^{d})$ and $\varepsilon>0$ , there exists a $k$ -finite expression $\phi$ in $\mathbb{S}_{k}$ such that $\|f-\phi\|_{L^{p}}\leq\varepsilon$ , if $k\geq\mathcal{O}(d^{2}(\log d+\log\frac{1}{\varepsilon})^{2})$ .

The proof of Thm. 5 can be found in Appx. 6.1. Although finite expressions have a powerful approximation capacity for high-dimensional functions, it is challenging to theoretically prove that our FEX solver to be introduced in Sec. 3 can identify the desired finite expressions with this power. However, the numerical experiments in Sec. 4 demonstrate that our FEX solver can identify the desired finite expressions to machine accuracy for several classes of high-dimensional PDEs.

3 An Implementation of FEX

Following the introduction of the abstract FEX methodology in Sec. 2, this section proposes a numerical implementation. First, binary trees are applied to construct finite expressions in Sec. 3.1. Next, our CO problem (1) is formulated in terms of parameter and operator selection to find expressions that approximate PDE solutions in Sec. 3.2. To resolve this CO, we propose implementing a search loop to identify effective operators that have the potential to recover the true solution when selected for expression. In Appendix 6.2, we provide pseudo-code for the FEX algorithm, which employs expanding trees to search for a solution. For the reader’s convenience, we have summarized the key notations used in this section in Table 1

Notation Explanation $\mathcal{T}$ A binary tree $\bm{e}$ Operator sequence $\bm{\theta}$ Trainable scaling and bias parameters $\mathcal{L}$ The functional associated with the PDE solution $S$ The scoring function that maps an operator sequence to $[0,1]$ $\bm{\chi}_{\Phi}$ The controller parameterized by $\Phi$ $\mathcal{J}$ The objective function for the policy-gradient approach

Table 1: A summary of notations in the FEX implementation.

3.1 Finite Expressions with Binary Trees

A finite expression can be represented as a binary tree $\mathcal{T}$ , as depicted in Fig. 3. Each node in the tree is assigned a value from a set of operators, and all these node values collectively form an operator sequence $\bm{e}$ , following a predefined order to traverse the tree (e.g., inorder traversal).

Within each node featuring a unary operator, we incorporate two additional parameters, a scaling parameter $\alpha$ and a bias parameter $\beta$ , to enhance expressiveness. All these parameters are denoted by $\bm{\theta}$ . Therefore, a finite expression can be denoted as $u(\bm{x};\mathcal{T},\mathbf{e},\bm{\theta})$ , a function of $\bm{x}$ . For a fixed $\mathcal{T}$ , the maximal number of operators is upper bounded by a constant denoted $k_{\mathcal{T}}$ . In this implementation, $\{u(\bm{x};\mathcal{T},\mathbf{e},\bm{\theta})|\mathbf{e},\bm{\theta}\}$ forms the function space in the CO to solve a PDE. This is a subset of functions expressible with at most $k_{\mathcal{T}}$ finite expressions.

The configuration of the tree of various depths can be designed as in Fig. 3. Each tree node is either a binary operator or a unary operator that takes value from the corresponding binary or unary set. The binary set can be $\mathbb{B}:=\{+,-,\times,\div,\cdots\}$ . The unary set can be $\mathbb{U}:=\{\sin,\exp,\log,\text{Id},(\cdot)^{2},\int\cdot\text{d}x_{i},\frac{\partial\cdot}{\partial x_{i}},\cdots\}$ , which contains elementary functions (e.g., polynomial and trigonometric function), antiderivative and differentiation operators. Here “Id” denotes the identity map. Notice that if an integration or a derivative is used in the expression, the operator can be applied numerically. Each entry of the operator sequence $\bm{e}$ corresponds one-to-one with traversing the tree nodes. The length of $\bm{e}$ equals the total number of tree nodes. For example, when inorder traversal is used, Fig. 2b depicts a tree with 4 nodes and $\bm{e}=(\text{Id},\times,\exp,\sin).$

The computation flow of the binary tree is conducted recursively. The operator of a node is granted higher precedence than that of its parent node. First, as in Fig. 3, we present the computation flow of the basic trees (a tree has a depth of 1 with only 1 operator). For a basic tree with a unary operator $u_{1}$ , when the input is $i_{0}$ , then the output $o_{1}=\alpha u_{1}(i_{0})+\beta$ , where $\alpha$ and $\beta$ are scaling and bias parameters, respectively. For a basic tree with a binary operator $b_{1}$ , when the input is $i_{01}$ and $i_{02}$ , the output becomes $o_{1}=b_{1}(i_{01},i_{02})$ . With these two basic trees, we are ready to define the computation for arbitrary depth by recursion, as the examples shown in Fig. 3. Specifically, the input of a parent node is the output of the child node(s). When the tree input at the bottom layer is a $d$ -dimensional variable $\bm{x}$ (Fig. 2b), the unary operator directly linked to $\bm{x}$ is applied element-wisely to each entry of $\bm{x}$ and the scaling $\alpha$ becomes a $d$ -dimensional vector to perform a linear transformation from $\mathbb{R}^{d}$ to $\mathbb{R}^{1}$ .

3.2 Solving a CO in FEX

Given a tree $\mathcal{T}$ , we aim to seek the PDE solution from the function space $\{u(\bm{x};\mathcal{T},\mathbf{e},\bm{\theta})|\mathbf{e},\bm{\theta}\}\subset\mathbb{S}_{k_{\mathcal{T}}}$ . The mathematical expression can be identified via the minimization of the functional $\mathcal{L}$ associated with a PDE, i.e.,

(9)

\displaystyle\min\{\mathcal{L}(u(\cdot;\mathcal{T},\bm{e},\bm{\theta}))|\bm{e},\bm{\theta}\}.

We introduce the framework for implementing FEX, as displayed in Fig. 2a, to seek a minimizer of (9). The basic idea is to find a good operator sequence $\bm{e}$ that may uncover the structure of the true solution, and then optimize the parameter $\bm{\theta}$ to minimize the functional (9). In our framework, the searching loop consists of four parts: 1) Score computation (i.e., rewards in RL). A mix-order optimization algorithm is proposed to efficiently assess the score of the operator sequence $\bm{e}$ to uncover the true structure. A higher score suggests a higher possibility to help to identify the true solution. 2) Operator sequence generation (i.e., taking actions in RL). A controller is proposed to generate operator sequences with high scores (see Fig. 2b). 3) Controller update (i.e., policy optimization in RL). The controller is updated to increase the probability of producing a good operator sequence via the score feedback of the generated ones. While the controller can be modeled in many ways (e.g., heuristic algorithm), we introduce the policy gradient in RL to optimize the controller. 4) Candidate optimization (i.e., a non-greedy strategy). During searching, we maintain a candidate pool to store the operator sequence with a high score. After searching, the parameters $\bm{\theta}$ of high-score operator sequences are optimized to approximate the PDE solution.

3.2.1 Score Computation

The score of an operator sequence $\bm{e}$ is critical to guide the controller toward generating good operator sequences and help to maintain a candidate pool of high scores. Intuitively, the score of $\bm{e}$ is defined in the range $[0,1]$ , namely $S(\bm{e})$ , by

(10)

\displaystyle S(\bm{e}):=\big{(}1+L(\bm{e})\big{)}^{-1},

where $L(\bm{e}):=\min\{\mathcal{L}(u(\cdot;\mathcal{T},\bm{e},\bm{\theta}))|\bm{\theta}\})$ . When $L(\bm{e})$ tends to $0$ , the expression represented by $\bm{e}$ is close to the true solution, and the score $S(\bm{e})$ goes to 1. Otherwise, $S(\bm{e})$ goes to 0. The global minimizer of $\mathcal{L}(u(\cdot;\mathcal{T},\bm{e},\bm{\theta}))$ over $\bm{\theta}$ is difficult and expensive to obtain. Instead of exhaustively searching for a global minimizer, a first-order optimization algorithm and a second-order one are combined to accelerate the evaluation of $S(\bm{e})$ .

First-order algorithms (e.g., the stochastic gradient descent [42] and Adam [26]) that utilize gradient to update are popular in machine learning. Typically, they demand a small learning rate and a substantial number of iterations to converge effectively. It can become time-consuming to optimize $L(\bm{e})$ using a first-order algorithm. Alternatively, second-order algorithms (e.g., the Newton method [3] and the Broyden-Fletcher-Goldfarb-Shanno method (BFGS) [17]) use the (approximated) Hessian matrix for faster convergence, but obtaining a good minimizer requires a good initial guess. To expedite the optimization process of $L(\bm{e})$ in our implementation, we employ a two-step approach. Initially, a first-order algorithm is utilized for $T_{1}$ steps to obtain a well-informed initial guess. Subsequently, a second-order algorithm is applied for an additional $T_{2}$ steps to further refine the solution. Let $\bm{\theta}_{0}^{\bm{e}}$ be an initialization and $\bm{\theta}_{T_{1}+T_{2}}^{\bm{e}}$ be the parameter set after $T_{1}+T_{2}$ steps of this two-step optimization. Then $\bm{\theta}_{T_{1}+T_{2}}^{\bm{e}}$ serves as an approximate of $\arg\min_{\bm{\theta}}\mathcal{L}(u(\cdot;\mathcal{T},\bm{e},\bm{\theta}))$ . Finally, $S(\bm{e})$ is estimated by

(11)

\displaystyle S(\bm{e})\approx\big{(}1+\mathcal{L}(u(\cdot;\mathcal{T},\bm{e},\bm{\theta}_{T_{1}+T_{2}}^{\bm{e}}))\big{)}^{-1}.

Remark that the approximation may exhibit significant variation due to the randomness associated with the initialization of $\bm{\theta}_{0}^{\bm{e}}$ .

3.2.2 Operator Sequence Generation

The role of the controller is to generate operator sequences with high scores during the searching loop. Let $\bm{\chi}_{\Phi}$ be a controller with model parameter $\Phi$ , and $\Phi$ is updated to increase the probability for good operator sequences during the searching loop. We use $\bm{e}\sim\bm{\chi}_{\Phi}$ to denote the process to sample an $\bm{e}$ according to the controller $\bm{\chi}_{\Phi}$ .

Treating tree node values of $\mathcal{T}$ as random variables, the controller $\bm{\chi}_{\Phi}$ outputs probability mass functions $\bm{p}_{\Phi}^{1},\bm{p}_{\Phi}^{2},\cdots,\bm{p}_{\Phi}^{s}$ to characterize their distributions, where $s$ is the total number of nodes. Each tree node value $e_{j}$ is sampled from $\bm{p}_{\Phi}^{j}$ to obtain an operator. Then $\bm{e}:=(e_{1},e_{2},\cdots,e_{s})$ is the operator sequence sampled from $\bm{\chi}_{\Phi}$ . See Fig. 2b for an illustration. Besides, we adopt the $\epsilon$ -greedy strategy [48] to enhance exploration of a potentially high-score $\bm{e}$ . With probability $\epsilon<1$ , $e_{i}$ is sampled from a uniform distribution of the operator set. With probability $1-\epsilon$ , $e_{i}\sim\bm{p}_{\Phi}^{i}$ . A larger $\epsilon$ leads to a higher probability to explore new sequences.

3.2.3 Controller Update

The goal of the controller update is to guide the controller toward generating high-score operator sequences $\bm{e}$ . The updating rule of a controller can be designed based on heuristics (e.g., genetic and simulated annealing algorithms) and gradient-based methods (e.g., policy gradient and darts [32]). As proof of concept, we introduce a policy-gradient-based updating rule in RL. The policy gradient method aims to maximize the return by optimizing a parameterized policy and the controller in our problem plays the role of a policy.

In this paper, the controller $\bm{\chi}_{\Phi}$ is modeled as a neural network parameterized by $\Phi$ . The training objective of the controller is to maximize the expected score of a sampled $\bm{e}$ , i.e.,

(12)

\displaystyle\mathcal{J}(\Phi):=\mathbb{E}_{\bm{e}\sim\bm{\chi}_{\Phi}}S(\bm{e}).

Taking the derivative of (12) with respect to $\Phi$ , we have

(13)

\displaystyle\nabla_{\Phi}\mathcal{J}(\Phi)=\mathbb{E}_{\bm{e}\sim\bm{\chi}_{\Phi}}\big{\{}S(\bm{e})\sum_{i=1}^{s}\nabla_{\Phi}\log(\bm{p}_{\Phi}^{i}(e_{i}))\big{\}},

where $\bm{p}_{\Phi}^{i}(e_{i})$ is the probability corresponding to the sampled $e_{i}$ . When the batch size is $N$ and $\{\bm{e}^{(1)},\bm{e}^{(2)},\cdots,\bm{e}^{(N)}\}$ are sampled under $\bm{\chi}_{\Phi}$ each time, the expectation can be approximated by

(14)

\displaystyle\nabla_{\Phi}\mathcal{J}(\Phi)\approx\frac{1}{N}\sum_{k=1}^{N}\big{\{}S(\bm{e}^{(k)})\sum_{i=1}^{s}\nabla_{\Phi}\log(\bm{p}_{\Phi}^{i}(e_{i}^{(k)}))\big{\}}.

Next, the model parameter $\Phi$ is updated via the gradient ascent with a learning rate $\eta$ , i.e.,

(15)

\displaystyle\Phi\leftarrow\Phi+\eta\nabla_{\Phi}\mathcal{J}(\Phi).

The objective in (12) helps to improve the average score of generated sequences. In our problem, the goal is to find $\bm{e}$ with the best score. To increase the probability of obtaining the best case, the objective function proposed in [36] is applied to seek the optimal solution via

(16)

\displaystyle\mathcal{J}(\Phi)=\mathbb{E}_{\bm{e}\sim\bm{\chi}_{\Phi}}\{S(\bm{e})|S(\bm{e})\geq S_{\nu,\Phi}\},

where $S_{\nu,\Phi}$ represents the $(1-\nu)\times 100\%$ -quantile of the score distribution generated by $\bm{\chi}_{\Phi}$ . In a discrete form, the gradient computation becomes

(17)

\displaystyle\nabla_{\Phi}\mathcal{J}(\Phi)\approx\frac{1}{N}\sum_{k=1}^{N}\big{\{}(S(\bm{e}^{(k)})-\hat{S}_{\nu,\Phi})\mathbbm{1}_{\{S(\bm{e}^{(k)})\geq\hat{S}_{\nu,\Phi}\}}\sum_{i=1}^{s}\nabla_{\Phi}\log(\bm{p}_{\Phi}^{i}(e_{i}^{(k)}))\big{\}}.

where $\mathbbm{1}$ is an indicator function that takes value $1$ if the condition is true otherwise 0, and $\hat{S}_{\nu,\Phi}$ is the $(1-\nu)$ -quantile of the scores $\{S(\bm{e}^{(i)})\}_{i=1}^{N}$ .

3.2.4 Candidate Optimization

As introduced in Sec. 3.2.1, the score of $\bm{e}$ is based on the optimization of a nonconvex function at a random initialization. Therefore, the optimization may get stuck at poor local minimizers and the score sometimes may not reflect whether $\bm{e}$ reveals the structure of the true solution. The operator sequence $\bm{e}$ corresponding to the true solution (or approximately) may not have the best score. For the purpose of not missing good operator sequences, a candidate pool $\mathbb{P}$ with capacity $K$ is maintained to store several $\bm{e}$ ’s of a high score.

During the search loop, if the size of $\mathbb{P}$ is less than the capacity $K$ , $\bm{e}$ will be put in $\mathbb{P}$ . If the size of $\mathbb{P}$ reaches $K$ and $S(\bm{e})$ is larger than the smallest score in $\mathbb{P}$ , then $\bm{e}$ will be appended to $\mathbb{P}$ and the one with the least score will be removed. After the searching loop, for every $\bm{e}\in\mathbb{P}$ , the objective function $\mathcal{L}(u(\cdot;\mathcal{T},\bm{e},\bm{\theta}))$ is optimized over $\bm{\theta}$ using a first-order algorithm with a small learning rate for $T_{3}$ iterations.

4 Numerical Examples

Numerical results will be provided to demonstrate the effectiveness of our FEX implementation introduced in Sec. 3.2 using two classical PDE problems: high-dimensional PDEs with constraints (such as Dirichlet boundary conditions and integration constraints) and eigenvalue problems. The computational tools for high-dimensional problems are very limited and NNs are probably the most popular ones. Therefore, FEX will be compared with NN-based solvers. Through our examples, the goal is to numerically demonstrate that:

•

Accuracy. FEX can achieve high and even machine accuracy for high-dimensional problems, while NN-based solvers can only achieve the accuracy of $\mathcal{O}(10^{-4})$ to $\mathcal{O}(10^{-2})$ . Furthermore, we demonstrate the effectiveness of our RL-based approach by comparing its performance to a solution developed using genetic programming (GP) [35].
•

Scalability. FEX is scalable in the problem dimension with an almost constant accuracy and a low memory requirement, i.e., the accuracy of FEX remains essentially the same when the dimension grows, while NN-based solvers have a worse accuracy when the dimension becomes larger.
•

Interpretability. FEX provides interpretable insights of the ground truth PDE solution and helps to design postprocessing techniques for a refined solution.

In particular, to show the benefit of interpretability, we will provide examples to show that the explicit formulas of FEX solutions help to design better NN-parametrization in NN-based solvers to achieve higher accuracy. The FEX-aided NN-based solvers are referred to as FEX-NN in this paper. Finally, we will show the convergence of FEX with the growth of the tree size when the true solution can not be exactly reproduced by finite expressions using the available operators and a binary tree. All results of this section are obtained with $6$ independent experiments to achieve their statistics.

4.1 Experimental Setting

This part provides the setting of FEX and NN-based solvers. The accuracy of a numerical solution $\tilde{u}$ compared with the true solution $u$ is measured by a relative $L^{2}$ error, i.e., $\|\tilde{u}-u\|_{L^{2}(\Omega)}/\|u\|_{L^{2}(\Omega)}$ . The integral in the $L^{2}$ norm is estimated by the Monte Carlo integral for high-dimensional problems.

Implements of FEX. The depth-3 binary tree (Fig. 2b) with 3 unary operators and 1 binary operator is used to generate mathematical expressions. The binary set is $\mathbbm{B}=\{+,-,\times\}$ and the unary set is $\mathbbm{U}=\{0,1,\text{Id},(\cdot)^{2},(\cdot)^{3},(\cdot)^{4},\exp,\sin,\cos\}$ . A fully connected NN is used as a controller $\bm{\chi}_{\Phi}$ with constant input (see Appendix 6.3 for more details). The output size of the controller NN is $n_{1}|\mathbbm{B}|+n_{2}|\mathbbm{U}|$ , where $n_{1}=1$ and $n_{2}=3$ represent the number of binary and unary operators, respectively, and $|\cdot|$ denotes the cardinality of a set.

There are four main parts in the implementation of FEX as introduced in Sec. 3.2. We will only briefly describe the key numerical choices here. (1) Score computation. The score is updated first by Adam with a learning rate 0.001 for $T_{1}=20$ iterations and then by BFGS with a learning rate 1 for maximum $T_{2}=20$ iterations. (2) Operator sequence generation. The depth-3 binary tree (Fig. 2b) with 3 unary operators and 1 binary operator is used to generate mathematical expressions. The binary set is $\mathbbm{B}=\{+,-,\times\}$ and the unary set is $\mathbbm{U}=\{0,1,\text{Id},(\cdot)^{2},(\cdot)^{3},(\cdot)^{4},\exp,\sin,\cos\}$ . A fully connected NN is used as a controller $\bm{\chi}_{\Phi}$ with constant input. The output size of the controller NN is $n_{1}|\mathbbm{B}|+n_{2}|\mathbbm{U}|$ , where $n_{1}=1$ and $n_{2}=3$ represent the number of binary and unary operators, respectively, and $|\cdot|$ denotes the cardinality of a set. (3) Controller update. The batch size for the policy gradient update is $N=10$ and the controller is trained for $1000$ iterations using Adam with a fixed learning rate $0.002$ . Especially in Section of “Numerical Convergence”, since the deeper trees are used, the controller is updated for 5000 iterations for trees with different depth. We adopt the $\epsilon$ -greedy strategy to increase the exploration of new $\bm{e}$ . The probability $\epsilon$ of sampling an $e_{i}$ by random is $0.1$ . (4) Candidate optimization. The candidate pool capacity is set to be $K=10$ . For any $\bm{e}\in\mathbb{P}$ , the parameter $\bm{\theta}$ is optimized using Adam with an initial learning rate $0.01$ for $T_{3}=20,000$ iterations. The learning rate decays according to the cosine decay schedule [23].

Implements of NN-based Solvers. Residual networks (ResNets) [22, 14] $u(\bm{x};\bm{\Theta})$ parameterized by $\bm{\Theta}$ are used to approximate a solution and a minimization problem ${\min}_{\bm{\Theta}}~{}\mathcal{L}(u(\cdot;\bm{\Theta}))$ is solved to identify a numerical solution [47, 49, 14].

The ResNet maps from $\mathbb{R}^{d}$ to $\mathbb{R}^{1}$ and consists of seven fully connected layers with three skip connections. Each hidden layer contains 50 neurons. The neural network is optimized using the Adam optimizer with an initial learning rate of 0.001 for 15,000 iterations. The learning rate is decayed following a cosine decay schedule.

Poisson equation. The coefficient $\lambda$ in the functional (3) is 100. The batch size for the interior and boundary is 5,000 and 1,000, respectively. In the NN method, we use the ReLU² activation, i.e., $(\max\{x,0\})^{2}$ , in ResNet to approximate the solution.

Linear conservative law. The coefficient $\lambda$ in the functional (20) is 100. In the NN method, we use ReLU $(\max\{x,0\})$ activation. The batch size for the interior and boundary is 5,000 and 1,000, respectively. We use the same batch size as the NN method except that we increase the batch size to 20,000 for the interior and 4,000 for the boundary when the dimension is not smaller than 36.

Schrödinger equation. The coefficient $\lambda$ in the functional (23) is 1. The batch size for estimating the first term and second term of (23) is 2,000 and 10,000, respectively. Besides, ReLU² is used in ResNet.

Implements of GP. GP is an evolutionary algorithm used for automated symbolic regression and the evolution of computer programs to solve complex problems. Through a process of selection, crossover, and mutation, GP evolves and refines these programs over generations to optimize a specific fitness or objective function. We utilize GP to optimize the CO (1) directly using the Python package GPlearn [1].

4.2 High-dimensional PDEs

Several numerical examples for high-dimensional PDEs including linear and nonlinear cases are provided here. In these tests, the true solutions of PDEs have explicit formulas that can be reproduced by the binary tree defined in Sec. 3.1 and $\mathbbm{B},\mathbbm{U}$ defined in Sec. 4.1.

4.2.1 Poisson Equation

We consider a Poisson equation [14] with a Dirichlet boundary condition on a $d$ -dimensional domain $\Omega=[-1,1]^{d}$ ,

(18)

-\Delta u=f\text{ for }\bm{x}\in\Omega,\quad u=g\text{ for }\bm{x}\in\partial\Omega.

Let the true solution be $\frac{1}{2}\sum_{i=1}^{d}x_{i}^{2}$ , and then $f$ becomes a constant function $-d$ . The functional $\mathcal{L}$ is defined by least-square error (3) is applied in the NN-based solver and FEX to seek the PDE solution for various dimensions ( $d=10$ , $20$ , $30$ , $40$ and $50$ ).

4.2.2 Linear Conservation Law

The linear conservation law [9] is considered here with a domain $T\times\Omega=[0,1]\times[-1,1]^{d}$ and an initial value problem

(19)

\displaystyle\begin{split}\frac{\pi d}{4}u_{t}-\sum_{i=1}^{d}u_{x_{i}}=0\text{ for }\bm{x}=(x_{1},\cdots,x_{d})\in\Omega,t\in[0,1],\quad u(0,\bm{x})=\sin(\frac{\pi}{4}\sum_{i=1}^{d}x_{i})\text{ for }\bm{x}\in\Omega,\end{split}

where the true solution is $u(t,\bm{x})=\sin(t+\frac{\pi}{4}\sum_{i=1}^{d}x_{i})$ , and $d=5$ , $11$ , $17$ , $23$ , $29$ , $35$ , $41$ , $47$ and $53$ in our tests. The functional $\mathcal{L}$ in the NN-based solver and FEX to identify the solution is defined by

(20)

\displaystyle\mathcal{L}(u):=\|\frac{\pi d}{4}u_{t}-\sum_{i=1}^{d}u_{x_{i}}\|_{L^{2}(T\times\Omega)}^{2}+\lambda\|u(0,\bm{x})-\sin(\frac{\pi}{4}\sum_{i=1}^{d}x_{i})\|_{L^{2}(\Omega)}^{2}.

4.2.3 Nonlinear Schrödinger Equation

We consider a nonlinear Schrödinger equation [20] with a cubic term on a $d$ -dimensional domain $\Omega=[-1,1]^{d}$ ,

(21)

\displaystyle-\Delta u+u^{3}+Vu

\displaystyle=0\text{ for }\bm{x}\in\Omega,

where $V(\bm{x})=-\frac{1}{9}\exp(\frac{2}{d}\sum_{i=1}^{d}\cos x_{i})+\sum_{i=1}^{d}(\frac{\sin^{2}x_{i}}{d^{2}}-\frac{\cos x_{i}}{d})$ for $\bm{x}=(x_{1},\cdots,x_{d})$ . And we let $\hat{u}(\bm{x})=\exp(\frac{1}{d}\sum_{j=1}^{d}\cos(x_{j}))/3$ be the solution of the PDE (21). To avoid the trivial zero solution, we apply different strategies during the score computation and candidate optimization phases. During the score computation phase, the norm of the test function $u$ , i.e., $\|u\|_{L_{2}(\Omega)}$ , is used as a penalty for the function that is close to zero by

(22)

\displaystyle\mathcal{L}_{1}(u):=\|-\Delta u+u^{3}+Vu\|_{L_{2}(\Omega)}^{2}/\|u\|_{L_{2}(\Omega)}^{3}.

During the candidate optimization phase, an integration constraint is imposed to (21), i.e., $\int_{\Omega}u(\bm{x})d\bm{x}=\int_{\Omega}\hat{u}(\bm{x})d\bm{x}$ . The functional $\mathcal{L}$ used in FEX to fine-tune the identified operator sequence is defined by

(23)

\displaystyle\mathcal{L}_{2}(u):=\|-\Delta u+u^{3}+Vu\|_{L_{2}(\Omega)}^{2}+\lambda\big{(}\int_{\Omega}u(\bm{x})d\bm{x}-\int_{\Omega}\hat{u}(\bm{x})d\bm{x}\big{)}^{2},

where the second term imposes the integration constraint. The functional (23) is also used in the NN-based solver to approximate the PDE solution. Various dimensions are tested in the numerical results, e.g., $d=6$ , $12$ , $18$ , $24$ , $30$ , $36$ , $42$ and $48$ . Remark that we avoid using (23) in the score computation because the Monte-Carlo error tends to be significant in the second term of (23) when the batch size is small, but using a large batch size can lead to inefficient computations. Hence, for the computational efficiency, we will utilize (22) instead in the score computation, rather than (23).

4.3 Results

Three main sets of numerical results for the PDE problems above will be presented. First, the errors of the numerical solutions by NN-based solvers and FEX are compared. Second, a convergence test is analyzed when the tree size of FEX increases. Finally, FEX is applied to design special NN parametrization to solve PDEs in NN-based solvers.

Estimated Solution Error. The depth-3 binary tree (Fig. 3) is used in FEX with four nodes (a root node (R), a middle node (M), and two leave nodes (L1 and L2)). Fig. 4 shows the operator distribution obtained by FEX and the error comparison between the NN method and our FEX. The results show that NN solutions have numerical errors between $\mathcal{O}(10^{-4})$ and $\mathcal{O}(10^{-2})$ and the errors grow in the problem dimension $d$ , agreeing with the numerical observation in the literature. Meanwhile, FEX can identify the true solution structure for the Poisson equation and the linear conservation law with errors of order $10^{-7}$ , reaching the machine accuracy since the single-float precision is used. In the results of the nonlinear Schrödinger equation, FEX identifies the solutions of the form $\exp(\cos(\cdot))$ but achieves errors of order $10^{-5}$ . Note that $\int_{\Omega}\hat{u}(\bm{x})d\bm{x}$ in (23) is estimated by the Monte-Carlo integration with millions of points as an accurate and precomputed constant, but $\int_{\Omega}u(\bm{x})d\bm{x}$ can only be estimated with fixed and small batch size, typically less than 10,000, in the optimization iterations. As the dimension grows, the estimation error of $\int_{\Omega}u(\bm{x})d\bm{x}$ increases, and, hence, even the ground true solution has an increasingly large error according to (23). Therefore, the optimization solver may return an approximate solution without machine accuracy. Designing a functional $\mathcal{L}$ free of the Monte-Carlo error (e.g., Eqns. (3) and (20)) for the nonlinear Schödinger equation could ensure machine accuracy. Furthermore, as shown in Figure 4, GP tends to identify solutions of lower quality that fail to achieve a high level of accuracy.

Numerical Convergence. The numerical convergence analysis is performed using the Poisson equation as an example. Binary trees of depths 2, 3, 4, 6 and 8 are used (see Fig. 3). The square operator $(\cdot)^{2}$ is excluded in $\mathbbm{U}$ so that the binary tree defined in Sec. 3.1 can not reproduce the true solution (sum of the square of coordinates) exactly. This setting can mimic the case of a complicated solution while a small binary tree was used. Fig. 5 shows the error distribution of FEX with the growth of dimensions and the change of tree depths. FEX obtains smaller errors with increasing tree size. Notice that, compared with the errors of NN-based solvers reported in Fig. 4, FEX gets a higher accuracy when a larger tree is used.

FEX-NN. FEX provides interpretable insights of the ground truth PDE solution by the operator distribution obtained from the searching loop. It may be beneficial to design NN models with a special structure to increase the accuracy of NN-based solvers. In the results of the Poisson equation, we observe that the square operator has a high probability to appear at the leave nodes, which suggests that the true solution may contain the structure $\bm{x}^{2}:=(x_{1}^{2},\cdots,x_{d}^{2})$ at the input $\bm{x}$ . As a result, we define the FEX-NN by $v(\bm{x}^{2};\bm{\Theta})$ for the Poisson equation. Similarly, we use FEX-NNs $\sin(v(\bm{x};\bm{\Theta}))$ for the linear conservation law and $\exp(v(\bm{x}^{2};\bm{\Theta}))$ for the nonlinear Schrödinger equation. Fig. 4 shows the errors of FEX-NN with the growth of dimensions, and it is clear that FEX-NN outperforms the vanilla NN-based method by a significant margin.

4.4 Eigenvalue Problem

Consider identifying the smallest eigenvalue $\gamma$ and the associated eigenfunction $u$ of the eigenvalue problem [14],

(24)

\displaystyle\begin{split}-\Delta u+wu=\gamma u,\quad\bm{x}\in\Omega,\quad\text{for}\ u=0\ \text{on}\ \partial\Omega.\end{split}

The minimization of the Rayleigh quotient $\mathcal{I}(u)=\frac{\int_{\Omega}\|\nabla u\|_{2}^{2}dx+\int_{\Omega}wu^{2}dx}{\int_{\Omega}u^{2}dx}$ , s.t., $u|_{\partial\Omega}=0$ , gives the smallest eigenvalue and the corresponding eigenfunction. In the NN-based solver [14], the following functional is defined

(25)

\displaystyle\begin{split}\mathcal{L}(u):=\mathcal{I}(u)+\lambda_{1}\int_{\partial\Omega}u^{2}dx+\lambda_{2}\big{(}\int_{\Omega}u^{2}dx-1\big{)}^{2}\end{split}

to seek an NN solution. Considering an example of $w=\|\bm{x}\|_{2}^{2}$ and $\Omega=\mathbb{R}^{d}$ , the smallest eigenvalue of (24) is $d$ and the associated eigenfunction is $\exp(-\frac{\|\bm{x}\|_{2}^{2}}{0.5})$ . The domain $\Omega$ is truncated from $\mathbb{R}^{d}$ to $[-3,3]^{d}$ for simplification as done in [14].

In FEX, the functional (25) is also used to estimate a solution. FEX discovers a high probability to have the “ $\exp$ ” operator at the tree root ( $100$ % for $d=2$ , $4$ , $6$ , $8$ , and $93.3$ % for $d=10$ as shown in Fig. 6). Therefore, it is reasonable to assume that the eigenfunction is of the form $\exp(v(\bm{x}))$ . Let $u(\bm{x})$ be $\exp(v(\bm{x}))$ and then Eqn. (24) is simplified to

(26)

\displaystyle-\Delta v-\|\nabla v\|_{2}^{2}+\|\bm{x}\|_{2}^{2}=\gamma.

Eqn. 26 does not have a trivial zero solution so we can avoid the integration constraint used in Eqns. (23) and (25), which leads to Monte-Carlo errors. Using Equation (26) and the Rayleigh quotient $\mathcal{I}$ , the values of $v$ and $\gamma$ are alternatively updated until they reach convergence. The detail of this iterative algorithm is presented in Appx. 6.4. Fig. 6 shows the relative absolute error of the estimated eigenvalues with the growth of the dimensions. We can see that directly optimizing (25) with the NN method produces a large error on the eigenvalue estimation, especially when the dimension is high (e.g., the relative error is up to 20% when $d=10$ ). With the postprocessing algorithm with FEX, we can identify the eigenvalue with an error close to zero.

5 Discussion and perspectives

In this paper, we proposed the finite expression method - a methodology to find PDE solutions as simple mathematical expressions. Our theory showed that mathematical expressions can overcome the curse of dimensionality. We provided one implementation of representing mathematical expressions using trees and solving the formulated combinatorial optimization in FEM using reinforcement learning. Our results demonstrated effectiveness of FEM at achieving high, even machine-level accuracy on various high-dimensional PDEs while existing solvers suffered from low accuracy in comparison.

While CO is inherently a complex challenge, continued exploration can be highly advantageous in achieving improved performance, particularly when dealing with more intricate PDE problems within the framework of FEX.

•

Efficient computation of operator sequence score. We employed Formulation (11) as a proxy for assessing the quality of an operator sequence. However, it’s important to note that this formulation necessitates optimization over multiple steps, resulting in non-negligible computational costs that can impact the overall expense of solving the CO problem in FEX. Therefore, it is crucial to define a more efficient scoring method that reduces computational expenses and simplifies the identification of favorable operator sequences.
•

Design of the controller. As an illustrative example of CO solving in this work, we employed a straightforward fully connected network to model the distribution responsible for proposing operator selections. It’s worth noting that more sophisticated techniques can also be applied, such as recurrent neural networks [36], which take prior decisions as input and generate new decisions as output. These advanced methods can offer enhanced modeling capabilities and adaptability to the context of the problem.

Data and Code Availability

No date is generated in this work. Source codes for reproducing the results in this paper are available online at: https://github.com/LeungSamWai/Finite-expression-method. The source codes are released under MIT license.

Acknowledgement

H. Yang was partially supported by the US National Science Foundation under awards DMS-2244988, DMS-2206333, and the Office of Naval Research Award N00014-23-1-2007. Research Young Investigator Award.

References

[1] Python package gplearn. https://gplearn.readthedocs.io/en/stable/.
[2] David J Acheson. Elementary fluid dynamics, 1991.
[3] Mordecai Avriel. Nonlinear programming: analysis and methods. Courier Corporation, 2003.
[4] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.
[5] Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V Le. Neural optimizer search with reinforcement learning. In International Conference on Machine Learning, pages 459–468. PMLR, 2017.
[6] Simone Bianco, Remi Cadene, Luigi Celona, and Paolo Napoletano. Benchmark analysis of representative deep neural network architectures. IEEE Access, 6:64270–64277, 2018.
[7] Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, and Quanquan Gu. Towards understanding the spectral bias of deep learning. arXiv preprint arXiv:1912.01198, 2019.
[8] Fan Chen, Jianguo Huang, Chunmei Wang, and Haizhao Yang. Friedrichs learning: Weak solutions of partial differential equations via deep learning. arXiv preprint arXiv:2012.08023, 2020.
[9] Jingrun Chen, Shi Jin, and Liyao Lyu. A deep learning based discontinuous galerkin method for hyperbolic equations with discontinuous solutions and random uncertainties. arXiv preprint arXiv:2107.01127, 2021.
[10] Wang Chi Cheung, Vincent Tan, and Zixin Zhong. A thompson sampling algorithm for cascading bandits. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 438–447. PMLR, 2019.
[11] John D Co-Reyes, Yingjie Miao, Daiyi Peng, Esteban Real, Quoc V Le, Sergey Levine, Honglak Lee, and Aleksandra Faust. Evolving reinforcement learning algorithms. In International Conference on Learning Representations, 2021.
[12] M. W. M. G. Dissanayake and N. Phan-Thien. Neural-network-based Approximations for Solving Partial Differential Equations. Comm. Numer. Methods Engrg., 10:195–201, 1994.
[13] Weinan E, Jiequn Han, and Arnulf Jentzen. Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 5(4):349–380, 2017.
[14] Weinan E and Bing Yu. The deep ritz method: a deep learning-based numerical algorithm for solving variational problems. Commun. Math. Stat., 6:1–12, 2018.
[15] L.C. Evans. Partial Differential Equations. Graduate studies in mathematics. American Mathematical Society, 2010.
[16] Richard P Feynman, Robert B Leighton, and Matthew Sands. The feynman lectures on physics; vol. i. American Journal of Physics, 33(9):750–752, 1965.
[17] Roger Fletcher. Practical methods of optimization. John Wiley & Sons, 2013.
[18] Christian Grossmann, Hans-Görg Roos, and Martin Stynes. Numerical treatment of partial differential equations, volume 154. Springer, 2007.
[19] J. Han, A. Jentzen, and W. E. Solving high-dimensional partial differential equations using deep learning. Proc. Natl. Acad. Sci., 115(34):8505–8510, 2018.
[20] Jiequn Han, Jianfeng Lu, and Mo Zhou. Solving high-dimensional eigenvalue problems using deep neural networks: A diffusion monte carlo like approach. Journal of Computational Physics, 423:109792, 2020.
[21] Jiequn Han and E Weinan. Deep learning approximation for stochastic control problems. ArXiv, abs/1611.07422, 2016.
[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[23] Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 558–567, 2019.
[24] Yuling Jiao, Yanming Lai, Xiliang Lu, and Zhijian Yang. Deep neural networks with relu-sine-exponential activations break curse of dimensionality on hölder class. 2021.
[25] Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving parametric pde problems with artificial neural networks. arXiv: Numerical Analysis, 2017.
[26] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[27] John G Kirkwood, Robert L Baldwin, Peter J Dunlop, Louis J Gosting, and Gerson Kegeles. Flow equations and frames of reference for isothermal diffusion in liquids. The Journal of Chemical Physics, 33(5):1505–1513, 1960.
[28] Isaac E Lagaris, Aristidis Likas, and Dimitrios I Fotiadis. Artificial neural networks for solving ordinary and partial differential equations. IEEE transactions on neural networks, 9(5):987–1000, 1998.
[29] Mikel Landajuela, Brenden K Petersen, Sookyung Kim, Claudio P Santiago, Ruben Glatt, Nathan Mundhenk, Jacob F Pettit, and Daniel Faissol. Discovering symbolic policies with deep reinforcement learning. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5979–5989. PMLR, 18–24 Jul 2021.
[30] Lev Davidovich Landau and Evgenii Mikhailovich Lifshitz. Quantum mechanics: non-relativistic theory, volume 3. Elsevier, 2013.
[31] Fu Li, Umberto Villa, Seonyeong Park, and Mark A. Anastasio. 3-d stochastic numerical breast phantoms for enabling virtual imaging trials of ultrasound computed tomography. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, 69(1):135–146, 2022.
[32] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2019.
[33] Ziqi Liu, Wei Cai, and Zhi-Qin John Xu. Multi-scale deep neural network (mscalednn) for solving poisson-boltzmann equation in complex domains. Communications in Computational Physics, 28(5):1970–2001, Jun 2020.
[34] Nina Mazyavkina, Sergey Sviridov, Sergei Ivanov, and Evgeny Burnaev. Reinforcement learning for combinatorial optimization: A survey. Computers & Operations Research, 134:105400, 2021.
[35] David J Murray-Smith. Modelling and simulation of integrated systems in engineering: issues of methodology, quality, testing and application. Elsevier, 2012.
[36] Brenden K Petersen, Mikel Landajuela Larma, Terrell N. Mundhenk, Claudio Prata Santiago, Soo Kyung Kim, and Joanne Taery Kim. Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients. In International Conference on Learning Representations, 2021.
[37] Jean Philibert. One and a half century of diffusion: Fick, einstein. Diffusion Fundamentals: Leipzig 2005, 1:8, 2005.
[38] M. Raissi, P. Perdikaris, and G.E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686 – 707, 2019.
[39] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for activation functions, 2018.
[40] JN Reddy. An introduction to the finite element method, volume 1221. McGraw-Hill New York, 2004.
[41] Basri Ronen, David Jacobs, Yoni Kasten, and Shira Kritchman. The convergence rate of neural networks for learned functions of different frequencies. Advances in Neural Information Processing Systems, 32, 2019.
[42] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
[43] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Neural network approximation: Three hidden layers are enough. arXiv:2010.14075, 2020.
[44] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation: Achieving arbitrary accuracy with fixed number of neurons. ArXiv, abs/2107.02397, 2021.
[45] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network with approximation error being reciprocal of width to power of square root of depth. Neural Computation, 2021.
[46] Marvin Shinbrot. Lectures on fluid mechanics. Courier Corporation, 2012.
[47] Justin Sirignano and Konstantinos Spiliopoulos. Dgm: A deep learning algorithm for solving partial differential equations. Journal of computational physics, 375:1339–1364, 2018.
[48] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
[49] E Weinan, Jiequn Han, and Arnulf Jentzen. Algorithms for solving high dimensional pdes: From nonlinear monte carlo to machine learning. Nonlinearity, 35(1):278, 2021.
[50] Zhi-Qin John Xu, Yaoyu Zhang, Tao Luo, Yanyang Xiao, and Zheng Ma. Frequency principle: Fourier analysis sheds light on deep neural networks. Communications in Computational Physics, 28(5):1746–1767, 2020.
[51] Dmitry Yarotsky. Elementary superexpressive activations. ArXiv, abs/2102.10911, 2021.
[52] Yulei, Liao, , 14603, , Yulei Liao, Pingbing, Ming, , 10035, , and Pingbing Ming. Deep nitsche method: Deep ritz method with essential boundary conditions. Communications in Computational Physics, 29(5):1365–1384, 2021.
[53] Yaohua Zang, Gang Bao, Xiaojing Ye, and Haomin Zhou. Weak adversarial networks for high-dimensional partial differential equations. Journal of Computational Physics, 411:109409, 2020.

6 Appendix

6.1 Proofs of Theorems

Proof of Thm. 4.

By Thm. 1.1 of Shen et al (2021) [44], for any $f\in C([a,b]^{d})$ as a continuous function on $[a,b]^{d}$ , there exists a fully connected neural network (FNN) $\phi$ with width $N=36d(2d+1)$ and depth $L=11$ (i.e., $11$ hidden layers) such that, for an arbitrary $\varepsilon>0$ , $\|\phi-f\|_{L^{\infty}([a,b]^{d})}<\varepsilon$ . This FNN is constructed via an activation function with an explicit formula $\sigma(x)=\sigma_{1}(x):=\big{|}x-2\lfloor\tfrac{x+1}{2}\rfloor\big{|}$ for $x\in[0,\infty)$ and $\sigma(x)=\sigma_{2}(x):=\frac{x}{|x|+1}$ for $x\in(-\infty,0)$ . Therefore, $\sigma(x)=\frac{\text{sign}(x)+1}{2}\sigma_{1}(x)-\frac{\text{sign}(x)-1}{2}\sigma_{2}(x)$ . Hence, it requires at most $18$ operators to evaluate $\sigma(x)$ . For an FNN of width $N$ and depth $L$ , there are $N(d+1)+(L-1)N^{2}$ operators “ $\times$ ”, $Nd-1+(L-1)N(N-1)$ operators “ $+$ ”, and $NL$ evaluations of $\sigma(x)$ to evaluate an output of the FNN. Therefore, the FNN $\phi$ is a mathematical expression with at most $k_{d}:=103680d^{4}+103824d^{3}+39600d^{2}+6804d-1=\mathcal{O}(d^{4})$ operators. Therefore, for any $\varepsilon>0$ , any continuous function $f$ on $[a,b]^{d}$ , there is a $k_{d}$ -finite expression that can approximate $f$ uniformly well on $[a,b]^{d}$ within $\varepsilon$ accuracy. Since $k_{d}$ is independent of $\varepsilon$ , it is clear that the function space of $k_{d}$ -finite expressions is dense in $C([a,b]^{d})$ . ∎

Proof of Thm. 5.

By Cor. 3.8 of Jiao et al (2021) [24], let $p\in[1,+\infty)$ , for any $f\in\mathcal{H}^{\alpha}_{\mu}([0,1]^{d})$ and $\varepsilon>0$ , there exists an FNN $\phi$ with width

\displaystyle N=\max\{2d\left\lceil\log_{2}\left(\sqrt{d}\left(\frac{3\mu}{\varepsilon}\right)^{1/\alpha}\right)\right\rceil,2\left\lceil\log_{2}\frac{3\mu d^{\alpha/2}}{2\varepsilon}\right\rceil+2\}

and depth $L=6$ such that $\|\phi-f\|_{L^{p}([0,1]^{d})}<\varepsilon$ . This FNN is constructed via activation functions chosen from the set $\{\sin(x),\max\{0,x\},2^{x}\}$ . Similar to the proof for Thm. 4, there are $N(d+1)+(L-1)N^{2}$ operators “ $\times$ ”, $Nd-1+(L-1)N(N-1)$ operators “ $+$ ”, and $NL$ evaluations of activation functions to evaluate an output of the FNN. Therefore, the total number of operators in $\phi$ as a mathematical expression is $\mathcal{O}(d^{2}(\log d+\log\frac{1}{\varepsilon})^{2})$ , which completes the proof. ∎

6.2 Algorithm to solve CO with reinforcement learning

We have included the pseudo-code for the proposed FEX implementation in Alg. 1 and Alg. 2. Specifically, Alg. 1 outlines the procedure for solving the CO problem using a predefined fixed tree. On the other hand, Alg. 2 is designed to iteratively expand a tree, thereby enhancing its expressiveness in the pursuit of identifying an improved solution based on the approach described in Alg. 1.

Algorithm 1 FEX with a fixed tree

Input: PDE and the associated functional $\mathcal{L}$ ; A tree $\mathcal{T}$ ; Searching loop iteration $T$ ; Coarse-tune iteration $T_{1}$ with Adam; Coarse-tune iteration $T_{2}$ with BFGS; Fine-tune iteration $T_{3}$ with Adam; Pool size $K$ ; Batch size $N$ .

Output: The solution $u(\bm{x};\mathcal{T},\hat{\bm{e}},\hat{\bm{\theta}})$ .

1:Initialize the agent

\bm{\chi}

for the tree

\mathcal{T}

\mathbb{P}\leftarrow\{\}

3:for

\_

from

1

T

4: Sample

N

sequences

\{\bm{e}^{(1)},\bm{e}^{(2)},\cdots,\bm{e}^{(N)}\}

from

\bm{\chi}

5: for

n

from

1

N

6: Optimize

\mathcal{L}(u(\bm{x};\mathcal{T},\bm{e}^{(n)},\bm{\theta}))

through coarse-tuning over

T_{1}+T_{2}

iterations to get score

S(\bm{e}^{(n)})

7: if

\bm{e}^{(n)}

belongs to the top-

K

of all scorings in

\mathbb{P}

then

\mathbb{P}

.append(

\bm{e}^{(n)}

)

\mathbb{P}

pops some

\bm{e}

of the smallest score when

|\mathbb{P}|>K

10: end if

11: end for

12: Update

\bm{\chi}

using (17)

13:end for

14:for

\bm{e}

\mathbb{P}

15: Fine-tune

\mathcal{L}(u(\bm{x};\mathcal{T},\bm{e},\bm{\theta}))

with

T_{3}

iterations.

16:end for

17:return the expression with the smallest fine-tune error.

Algorithm 2 FEX with progressively expanding trees

Input: Tree set $\{\mathcal{T}_{1},\mathcal{T}_{2},\cdots\}$ ; Error tolerance $\epsilon$ ;

Output: the solution $u(\bm{x};\hat{\mathcal{T}},\tilde{\bm{e}},\tilde{\bm{\theta}})$ .

1:for

\mathcal{T}

\{\mathcal{T}_{1},\mathcal{T}_{2},\cdots\}

2: Initialize the agent

\bm{\chi}

for the tree

\mathcal{T}

3: Obtain

u(\bm{x};\mathcal{T},\hat{\bm{e}},\hat{\bm{\theta}})

from Algorithm 1

4: if

\mathcal{L}(u(\cdot;\mathcal{T},\hat{\bm{e}},\hat{\bm{\theta}}))

\leq\epsilon

then

5: Break

6: end if

7:end for

8:return the expression with the smallest functional value.

6.3 Using a fully connected network to model a controller

We illustrate the use of a fully connected neural network to model the controller, which generates probability mass functions for each node within a tree, in Figure 7. As an example, let’s assume we have a tree with three nodes: $n_{1}=2$ nodes for the binary set with $|\mathbbm{B}|=2$ operators, and $n_{2}=1$ node for the unary set with $|\mathbbm{U}|=3$ operators. Consequently, the output size of the neural network is $n_{1}|\mathbbm{B}|+n_{2}|\mathbbm{U}|=7$ . This output is divided into $n_{1}+n_{2}$ parts, each corresponding to the probability mass function for its respective node. Besides, the input to the neural network is simply a constant vector.

6.4 Iterative method for eigenpairs

First, by solving Eqn. (25) with our FEX, we obtain an estimated eigenfunction $u(\bm{x};\mathcal{T},\bm{\hat{e}},\hat{\bm{\theta}})$ and get the initial estimation of the eigenvalue through the Rayleigh quotient $\gamma_{0}=\mathcal{I}(u(\cdot;\mathcal{T},\bm{\hat{e}},\hat{\bm{\theta}}))$ . Then we can utilize Eqn. (26) to iteratively find the eigenpair. We define the function for Eqn. (26) by

(27)

\displaystyle\mathcal{L}_{2}(v,\gamma):=\big{\|}-\Delta v-\|\nabla v\|_{2}^{2}+\|\bm{x}\|_{2}^{2}-\gamma\big{\|}_{L_{2}(\Omega)}^{2}.

Given $\gamma_{i}$ , we aim to find $v$ that is expressed by mathematical expression and minimizes $\mathcal{L}_{2}(v,\gamma_{i})$ . Assume $v$ is expressed by a binary tree ( $v:=v(\cdot;\mathcal{T},\bm{e},\bm{\theta})$ ), and then we can search the solution using our FEX with the following optimization,

(28)

\displaystyle\bm{e}_{i}^{*},\bm{\theta}_{i}^{*}\approx\operatorname*{arg\,min}_{\bm{e},\bm{\theta}}\mathcal{L}_{2}(v(\cdot;\mathcal{T},\bm{e},\bm{\theta}),\gamma_{i}).

Next, we can compute the current estimated eigenvalue by $\gamma_{i+1}=\mathcal{I}(\exp(v(\cdot;\mathcal{T},\bm{e}_{i}^{*},\bm{\theta}_{i}^{*})))$ .

If continuing this loop for $G$ times, we will obtain the eigenpair $\gamma_{G}$ and $\exp(v(\cdot;\mathcal{T},\bm{e}_{G}^{*},\bm{\theta}_{G}^{*}))$ .

Implementation. In our iteration method, the number of the iterative loop is $G=10$ . $\lambda_{1}=\lambda_{2}=500$ in (25). The batch size for estimating the first term and third term of (25) is 10,000 while that of the second term (boundary) is 2,000. ReLU² is used in ResNet, following [14].

Finite Expression Method for Solving High-Dimensional Partial Differential Equations

Abstract

keywords:

1 Introduction

2 An Abstract Methodology of Finite Expression Method

2.1 Function Space with Finite Expressions in FEX

Definition 1 (Mathematical expression).

Definition 2 (kk-finite expression).

Definition 3 (Finite expression method).

2.2 Identifying PDE Solutions in FEX

2.2.1 Least Square Method

2.2.2 Variation Formulation

2.3 Approximation Theory of Elementary Expressions in FEX

Theorem 4.

Theorem 5.

3 An Implementation of FEX

3.1 Finite Expressions with Binary Trees

3.2 Solving a CO in FEX

3.2.1 Score Computation

3.2.2 Operator Sequence Generation

3.2.3 Controller Update

3.2.4 Candidate Optimization

4 Numerical Examples

4.1 Experimental Setting

4.2 High-dimensional PDEs

4.2.1 Poisson Equation

4.2.2 Linear Conservation Law

4.2.3 Nonlinear Schrödinger Equation

4.3 Results

4.4 Eigenvalue Problem

5 Discussion and perspectives

Data and Code Availability

Acknowledgement

References

6 Appendix

6.1 Proofs of Theorems

Proof of Thm. 4.

Proof of Thm. 5.

6.2 Algorithm to solve CO with reinforcement learning

6.3 Using a fully connected network to model a controller

6.4 Iterative method for eigenpairs

Definition 2 ( $k$ -finite expression).