This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Finite Expression Method for Solving High-Dimensional Partial Differential Equations

Senwei Liang and Haizhao Yang Department of Mathematics, Purdue UniversityDepartment of Mathematics and Department of Computer Science, University of Maryland College Park
Abstract

Designing efficient and accurate numerical solvers for high-dimensional partial differential equations (PDEs) remains a challenging and important topic in computational science and engineering, mainly due to the “curse of dimensionality” in designing numerical schemes that scale in dimension. This paper introduces a new methodology that seeks an approximate PDE solution in the space of functions with finitely many analytic expressions and, hence, this methodology is named the finite expression method (FEX). It is proved in approximation theory that FEX can avoid the curse of dimensionality. As a proof of concept, a deep reinforcement learning method is proposed to implement FEX for various high-dimensional PDEs in different dimensions, achieving high and even machine accuracy with a memory complexity polynomial in dimension and an amenable time complexity. An approximate solution with finite analytic expressions also provides interpretable insights into the ground truth PDE solution, which can further help to advance the understanding of physical systems and design postprocessing techniques for a refined solution.

keywords:
High-dimensional PDEs, Deep neural networks, Mathematical expressions, Finite expression method, Reinforcement learning.

1 Introduction

Partial differential equations (PDEs) play a fundamental role in scientific fields for modeling diverse physical phenomena, including diffusion [37, 27], fluid dynamics [2, 46] and quantum mechanics [16, 30]. Developing efficient and accurate solvers for numerical solutions to high-dimensional PDEs remains an important and challenging topic [49]. Many traditional solvers, such as finite element method (FEM) [40] and finite difference [18], are usually limited to low-dimensional domains since the computational cost increases exponentially in the dimension [49, 31]. Recently, neural networks (NNs) as mesh-free parameterization are widely employed in solving high-dimensional PDEs [13, 19, 25, 38, 47, 14] and control problems [21]. In theory, NNs have the capability of approximating various functions well and lessening the curse of dimensionality [45, 43, 51, 44, 24]. Yet the highly non-convex objective function and the spectral bias towards fitting a smooth function in NN optimization make it difficult to achieve high accuracy [50, 7, 41]. In practice, NN-based solvers can hardly achieve a highly accurate solution even when the true solution is a simple function, especially for high-dimensional problems [14, 33]. Besides, NN parametrization may still require large memory and high computation cost for high-dimensional problems [6]. Finally, numerical solutions from both traditional solvers and NN-based solvers lack interpretability, e.g., the dependence of the solution on variables is not readily apparent from the numerical solutions.

Refer to caption
Fig. 1: Overview of various numerical solvers for PDEs. In our finite expression method, we aim to find a PDE solution as a mathematical expression with finitely many operators. The resulting solution can reproduce the true solution and achieve high, even machine-level, accuracy. Furthermore, the mathematical expression has memory complexity polynomial in dimension and provides interpretable insights to advance understanding of physical systems and aid in designing postprocessing techniques for solution refinement.

In this paper, we propose the finite expression method (FEX), a methodology that aims to find a solution in the function space of mathematical expressions with finitely many operators. Compared with the NN and FEM methods, our FEX enjoys the following advantages (summarized in Fig. 1): (1) The expression can reproduce the true solution and achieve high, even machine-level, accuracy. (2) The expression requires low memory (a line of string) for solution storage and low computational cost for evaluation on new points. (3) The expression has good interpretability with an explicit, readable form. Moreover, from an approximation theory perspective detailed in Sec. 2.3, the expression in FEX is capable of avoiding the curse of dimensionality in theory.

In FEX, we formulate the search for mathematical expressions as a combinatorial optimization (CO) involving both discrete and continuous variables. While many techniques can be used to solve the CO in FEX, we provide a numerically effective implementation based on reinforcement learning (RL). Traditional algorithms (e.g. genetic programming and simulated annealing [35]) address CO problems by employing hand-crafted heuristics that heavily rely on specific problem formulations and domain knowledge [4, 10, 34]. However, RL has emerged as a popular and versatile tool for learning how to construct CO solutions based on reward feedback, without the need for extensive heuristic design. The success of RL applications in CO, such as automatic algorithm design [39, 5, 11] and symbolic regression [36, 29], has inspired us to seek mathematical expression solutions with RL. Specifically, in our implementation, the mathematical expression is represented by a binary tree, where each node is an operator along with parameters (scaling and bias) as shown in Fig. 3 and further explained in Sec. 3.1. The objective function we aim to minimize is a functional, and its minimization leads to the solution of the PDE, as described in Sec. 2.2. Consequently, our problem involves the minimization of both discrete operators and continuous parameters embedded within the tree structure. Optimizing both discrete and continuous variables simultaneously is inherently difficult. We propose a search loop for this CO as depicted in Fig. 2. Our idea is to first identify good operator selections that have a high possibility of recovering the true solution structure, and then optimize the parameters. The proposals of operator selections are drawn from a controller which will be updated via policy gradient [36] iteratively. In Sec. 4, we numerically demonstrate the ability of this RL-based implementation to find mathematical expressions that solve high-dimensional PDEs with high, even machine-level, accuracy. Furthermore, FEX provides interpretable insights into the ground truth PDE solutions, which can further advance understanding of physical systems and aid in designing postprocessing techniques to refine solutions.

Refer to caption
Fig. 2: Representation of the components of our FEX implementation. (a) The searching loop for the finite expression solution consists of expression generation, score computation, controller update, and candidate optimization. (b) Depiction of the expression generation with a binary tree and a controller 𝝌\bm{\chi}. The controller outputs probability mass functions for each node in the tree, and these functions are used to sample node values. The final expression, which incorporates learnable scaling and bias parameters, is constructed based on the predefined tree structure and the sampled node values.

2 An Abstract Methodology of Finite Expression Method

The goal of FEX is to find a mathematical expression to solve a PDE. This section will formally define the function space of mathematical expressions and formulate the problem of FEX as a CO.

2.1 Function Space with Finite Expressions in FEX

Mathematical expressions will be introduced to form the function space of FEX.

Definition 1 (Mathematical expression).

A mathematical expression is a combination of symbols, which is well-formed by syntax and rules and forms a valid function. The symbols include operands (variables and numbers), operators (e.g.,“++”, “sin\sin”, integral, derivative), brackets, punctuation.

In the definition of mathematical expressions, we only consider the expression that forms a valid function. In our context, “sin(x×y)+1\sin(x\times y)+1” is a mathematical expression, but, for example, “5>x5>x” and “sin(x×y)+\sin(x\times y)+” are not a mathematical expression as they do not form a valid function. The operands and operators comprise the structure of an expression. The parentheses play a role in clarifying the operation order.

Definition 2 (kk-finite expression).

A mathematical expression is called a kk-finite expression if the number of operators in this expression is kk.

sin(x×y)+1\sin(x\times y)+1” is a 33-finite expression since there are three operators in it (“×\times”, “sin\sin”, and “+”). The series, such as “1+x1+x22+x36+1+x^{1}+\frac{x^{2}}{2}+\frac{x^{3}}{6}+\cdots”, belongs to a mathematical expression, but it is not a finite expression since the amount of the operators is infinite. Formally, with the concept of finite expression, we can define FEX as follows,

Definition 3 (Finite expression method).

The finite expression method is a methodology to solve a PDE numerically by seeking a finite expression such that the resulting function solves the PDE approximately.

We denote 𝕊k\mathbb{S}_{k} as a set of functions that are formed by finite expressions with the number of operators not larger than kk. 𝕊k\mathbb{S}_{k} forms the function space of FEX. Clearly, 𝕊1𝕊2𝕊3\mathbb{S}_{1}\subset\mathbb{S}_{2}\subset\mathbb{S}_{3}\cdots.

2.2 Identifying PDE Solutions in FEX

We denote a functional :𝕊\mathcal{L}:\mathbb{S}\to\mathbb{R} associated with a given PDE, where 𝕊\mathbb{S} is a function space and the minimizer of \mathcal{L} is the best solution to solve the PDE in 𝕊\mathbb{S}. In FEX, given the number of operators kk, the problem of seeking a finite expression solution is formulated as a CO over 𝕊k\mathbb{S}_{k} via

(1) min{(u)|u𝕊k}.\displaystyle\min\{\mathcal{L}(u)|u\in\mathbb{S}_{k}\}.

The choice of the functional \mathcal{L} is problem-dependent and one may conceive a better functional for a specific PDE with a specific constraint or domain. Some popular choices include least-square methods [28, 47, 12] and variation formulations [14, 52, 8, 53].

2.2.1 Least Square Method

Suppose that the PDE is given by

(2) 𝒟u(𝒙)=f(u(𝒙),𝒙), 𝒙Ω,u(𝒙)=g(𝒙), 𝒙Ω,\begin{split}\mathcal{D}u(\bm{x})=f(u(\bm{x}),\bm{x}),\text{~{}}\bm{x}\in\Omega,\qquad\mathcal{B}u(\bm{x})=g(\bm{x}),\text{~{}}\bm{x}\in\partial\Omega,\end{split}

where 𝒟\mathcal{D} is a differential operator, f(u(𝒙),𝒙)f(u(\bm{x}),\bm{x}) can be a nonlinear function in uu, Ω\Omega is a bounded domain in d\mathbb{R}^{d}, and u=g\mathcal{B}u=g characterizes the boundary condition (e.g., Dirichlet, Neumann and Robin [15]). The least square method [28, 47, 12] defines a straightforward functional to characterize the error of the estimated solution by

(3) (u):=𝒟u(𝒙)f(u,𝒙)L2(Ω)2+λu(𝒙)g(𝒙)L2(Ω)2,\mathcal{L}(u):=\|\mathcal{D}u(\bm{x})-f(u,\bm{x})\|_{L^{2}(\Omega)}^{2}+\lambda\|\mathcal{B}u(\bm{x})-g(\bm{x})\|_{L^{2}(\partial\Omega)}^{2},

where λ\lambda is a positive coefficient to enforce the boundary constraint.

2.2.2 Variation Formulation

We next introduce the variation formulation, which is commonly used to identify numerical PDE solutions [14, 52]. As an example, consider an elliptic PDE with homogeneous Dirichlet boundary conditions. This PDE is:

(4) Δu(𝒙)+c(𝒙)u(𝒙)=f(𝒙), 𝒙Ω,u(𝒙)=0, 𝒙Ω,\displaystyle-\Delta u(\bm{x})+c(\bm{x})u(\bm{x})=f(\bm{x}),\text{~{}}\bm{x}\in\Omega,\qquad u(\bm{x})=0,\text{~{}}\bm{x}\in\partial\Omega,

where cc is a bounded function and fL2f\in L^{2}. The solution uu to PDE (4) minimizes the variation formulation 12Ωu2+cu2d𝒙Ωfud𝒙\frac{1}{2}\int_{\Omega}\|\nabla u\|^{2}+cu^{2}\text{d}\bm{x}-\int_{\Omega}fu\text{d}\bm{x}. By incorporating the boundary condition penalty into this variation, we obtain the functional:

(5) (u):=12Ωu2+cu2d𝒙Ωfud𝒙+λΩu2d𝒙.\mathcal{L}(u):=\frac{1}{2}\int_{\Omega}\|\nabla u\|^{2}+cu^{2}\text{d}\bm{x}-\int_{\Omega}fu\text{d}\bm{x}+\lambda\int_{\partial\Omega}u^{2}\text{d}\bm{x}.

An alternative variation formulation can be defined using test functions. Let vH01(Ω)v\in H_{0}^{1}(\Omega) be a test function, where H01(Ω)H_{0}^{1}(\Omega) denotes the Sobolev space whose weak derivative is L2L^{2} integrable with zero boundary values. The weak solution uu of Eqn. (4) is defined as the function that satisfies the bilinear equations:

(6) a(u,v):=Ωuv+cuvfvd𝒙=0,vH01(Ω),u(𝒙)=0, 𝒙Ω,\displaystyle\begin{split}a(u,v):=\int_{\Omega}\nabla u\nabla v+cuv-fv\text{d}\bm{x}=0,\quad\forall v\in H_{0}^{1}(\Omega),\quad u(\bm{x})=0,\text{~{}}\bm{x}\in\partial\Omega,\end{split}

where a(u,v)a(u,v) is constructed by multiplying (4) and vv, and integration by parts. All derivatives of the solution function can be transferred to the test function through repeated integration by parts, yielding another bilinear forms [8]. The weak solution can be reformulated as the solution to a saddle-point problem [53]:

(7) minuH01(Ω)maxvH01(Ω)|a(u,v)|2/vL2(Ω)2.\displaystyle\min_{u\in H_{0}^{1}(\Omega)}\max_{v\in H_{0}^{1}(\Omega)}|a(u,v)|^{2}/\|v\|_{L^{2}(\Omega)}^{2}.

Then, the functional \mathcal{L} identifying the PDE solution is:

(8) (u):=maxvH01(Ω)|a(u,v)|2/vL2(Ω)2+λΩu2d𝒙.\mathcal{L}(u):=\max_{v\in H_{0}^{1}(\Omega)}|a(u,v)|^{2}/\|v\|_{L^{2}(\Omega)}^{2}+\lambda\int_{\partial\Omega}u^{2}\text{d}\bm{x}.

2.3 Approximation Theory of Elementary Expressions in FEX

The most important theoretical question in solving high-dimensional problems is whether or not a solver suffers from the curse of dimensionality. It will be shown that the function space of kk-finite expressions, i.e., 𝕊k\mathbb{S}_{k} in (1), is a powerful function space that avoids the curse of dimensionality in approximating high-dimensional continuous functions, leveraging the recent development of advanced approximation theory of deep neural networks [44, 24]. First of all, it can be proved that 𝕊k\mathbb{S}_{k} is dense in C([0,1]d)C([0,1]^{d}) for an arbitrary dd\in\mathbb{N} in the following theorem.

Theorem 4.

The function space 𝕊k\mathbb{S}_{k}, generated with operators including “++”, “-”, “×\times”, “//”, “|||\cdot|”, “sign()(\cdot)”, and “\lfloor\cdot\rfloor”, is dense in C([a,b]d)C([a,b]^{d}) for arbitrary aa, bb\in\mathbb{R} and dd\in\mathbb{N} if k𝒪(d4)k\geq\mathcal{O}(d^{4}).

Here “|||\cdot|”, “sign()(\cdot)”, and “\lfloor\cdot\rfloor” denote the absolute, sign and floor functions [44], respectively. The proof of Thm. 4 can be found in Appx. 6.1. The denseness of 𝕊k\mathbb{S}_{k} means that the function space of kk-finite expressions can approximate any dd-dimensional continuous functions to any accuracy, while kk is only required to be 𝒪(d4)\mathcal{O}(d^{4}) independent of the approximation accuracy. The proof of Thm. 4 takes the advantage of operators “sign()(\cdot)” and “\lfloor\cdot\rfloor”, which might not be frequently used in mathematical expressions. If it is more practical to restrict the operator list to regular operators like “++”, “×\times”, “sin()\sin(\cdot)”, exponential functions, and the rectified linear unit (ReLU), then it can be proved that 𝕊k\mathbb{S}_{k} can approximate Hölder functions without the curse of dimensionality in the following theorem.

Theorem 5.

Suppose the function space 𝕊k\mathbb{S}_{k} is generated with operators including “++”, “-”, “×\times”, “÷\div”, “max{0,x}\max\{0,x\}”, “sin(x)\sin(x)”, and “2x2^{x}”. Let p[1,+)p\in[1,+\infty). For any ff in the Hölder function class μα([0,1]d)\mathcal{H}^{\alpha}_{\mu}([0,1]^{d}) and ε>0\varepsilon>0, there exists a kk-finite expression ϕ\phi in 𝕊k\mathbb{S}_{k} such that fϕLpε\|f-\phi\|_{L^{p}}\leq\varepsilon, if k𝒪(d2(logd+log1ε)2)k\geq\mathcal{O}(d^{2}(\log d+\log\frac{1}{\varepsilon})^{2}).

The proof of Thm. 5 can be found in Appx. 6.1. Although finite expressions have a powerful approximation capacity for high-dimensional functions, it is challenging to theoretically prove that our FEX solver to be introduced in Sec. 3 can identify the desired finite expressions with this power. However, the numerical experiments in Sec. 4 demonstrate that our FEX solver can identify the desired finite expressions to machine accuracy for several classes of high-dimensional PDEs.

3 An Implementation of FEX

Following the introduction of the abstract FEX methodology in Sec. 2, this section proposes a numerical implementation. First, binary trees are applied to construct finite expressions in Sec. 3.1. Next, our CO problem (1) is formulated in terms of parameter and operator selection to find expressions that approximate PDE solutions in Sec. 3.2. To resolve this CO, we propose implementing a search loop to identify effective operators that have the potential to recover the true solution when selected for expression. In Appendix 6.2, we provide pseudo-code for the FEX algorithm, which employs expanding trees to search for a solution. For the reader’s convenience, we have summarized the key notations used in this section in Table 1

Notation Explanation 𝒯\mathcal{T} A binary tree 𝒆\bm{e} Operator sequence 𝜽\bm{\theta} Trainable scaling and bias parameters \mathcal{L} The functional associated with the PDE solution SS The scoring function that maps an operator sequence to [0,1][0,1] 𝝌Φ\bm{\chi}_{\Phi} The controller parameterized by Φ\Phi 𝒥\mathcal{J} The objective function for the policy-gradient approach

Table 1: A summary of notations in the FEX implementation.
Refer to caption
Fig. 3: Computational rule of a binary tree. In a binary tree, each node holds either a unary or a binary operator. We begin by defining the computational process for a depth-1 tree (i.e., L=1L=1), which contains only a single operator. For binary trees with more than depth one (i.e., L>1L>1), the computation is performed recursively.

3.1 Finite Expressions with Binary Trees

A finite expression can be represented as a binary tree 𝒯\mathcal{T}, as depicted in Fig. 3. Each node in the tree is assigned a value from a set of operators, and all these node values collectively form an operator sequence 𝒆\bm{e}, following a predefined order to traverse the tree (e.g., inorder traversal).

Within each node featuring a unary operator, we incorporate two additional parameters, a scaling parameter α\alpha and a bias parameter β\beta, to enhance expressiveness. All these parameters are denoted by 𝜽\bm{\theta}. Therefore, a finite expression can be denoted as u(𝒙;𝒯,𝐞,𝜽)u(\bm{x};\mathcal{T},\mathbf{e},\bm{\theta}), a function of 𝒙\bm{x}. For a fixed 𝒯\mathcal{T}, the maximal number of operators is upper bounded by a constant denoted k𝒯k_{\mathcal{T}}. In this implementation, {u(𝒙;𝒯,𝐞,𝜽)|𝐞,𝜽}\{u(\bm{x};\mathcal{T},\mathbf{e},\bm{\theta})|\mathbf{e},\bm{\theta}\} forms the function space in the CO to solve a PDE. This is a subset of functions expressible with at most k𝒯k_{\mathcal{T}} finite expressions.

The configuration of the tree of various depths can be designed as in Fig. 3. Each tree node is either a binary operator or a unary operator that takes value from the corresponding binary or unary set. The binary set can be 𝔹:={+,,×,÷,}\mathbb{B}:=\{+,-,\times,\div,\cdots\}. The unary set can be 𝕌:={sin,exp,log,Id,()2,dxi,xi,}\mathbb{U}:=\{\sin,\exp,\log,\text{Id},(\cdot)^{2},\int\cdot\text{d}x_{i},\frac{\partial\cdot}{\partial x_{i}},\cdots\}, which contains elementary functions (e.g., polynomial and trigonometric function), antiderivative and differentiation operators. Here “Id” denotes the identity map. Notice that if an integration or a derivative is used in the expression, the operator can be applied numerically. Each entry of the operator sequence 𝒆\bm{e} corresponds one-to-one with traversing the tree nodes. The length of 𝒆\bm{e} equals the total number of tree nodes. For example, when inorder traversal is used, Fig. 2b depicts a tree with 4 nodes and 𝒆=(Id,×,exp,sin).\bm{e}=(\text{Id},\times,\exp,\sin).

The computation flow of the binary tree is conducted recursively. The operator of a node is granted higher precedence than that of its parent node. First, as in Fig. 3, we present the computation flow of the basic trees (a tree has a depth of 1 with only 1 operator). For a basic tree with a unary operator u1u_{1}, when the input is i0i_{0}, then the output o1=αu1(i0)+βo_{1}=\alpha u_{1}(i_{0})+\beta, where α\alpha and β\beta are scaling and bias parameters, respectively. For a basic tree with a binary operator b1b_{1}, when the input is i01i_{01} and i02i_{02}, the output becomes o1=b1(i01,i02)o_{1}=b_{1}(i_{01},i_{02}). With these two basic trees, we are ready to define the computation for arbitrary depth by recursion, as the examples shown in Fig. 3. Specifically, the input of a parent node is the output of the child node(s). When the tree input at the bottom layer is a dd-dimensional variable 𝒙\bm{x} (Fig. 2b), the unary operator directly linked to 𝒙\bm{x} is applied element-wisely to each entry of 𝒙\bm{x} and the scaling α\alpha becomes a dd-dimensional vector to perform a linear transformation from d\mathbb{R}^{d} to 1\mathbb{R}^{1}.

3.2 Solving a CO in FEX

Given a tree 𝒯\mathcal{T}, we aim to seek the PDE solution from the function space {u(𝒙;𝒯,𝐞,𝜽)|𝐞,𝜽}𝕊k𝒯\{u(\bm{x};\mathcal{T},\mathbf{e},\bm{\theta})|\mathbf{e},\bm{\theta}\}\subset\mathbb{S}_{k_{\mathcal{T}}}. The mathematical expression can be identified via the minimization of the functional \mathcal{L} associated with a PDE, i.e.,

(9) min{(u(;𝒯,𝒆,𝜽))|𝒆,𝜽}.\displaystyle\min\{\mathcal{L}(u(\cdot;\mathcal{T},\bm{e},\bm{\theta}))|\bm{e},\bm{\theta}\}.

We introduce the framework for implementing FEX, as displayed in Fig. 2a, to seek a minimizer of (9). The basic idea is to find a good operator sequence 𝒆\bm{e} that may uncover the structure of the true solution, and then optimize the parameter 𝜽\bm{\theta} to minimize the functional (9). In our framework, the searching loop consists of four parts: 1) Score computation (i.e., rewards in RL). A mix-order optimization algorithm is proposed to efficiently assess the score of the operator sequence 𝒆\bm{e} to uncover the true structure. A higher score suggests a higher possibility to help to identify the true solution. 2) Operator sequence generation (i.e., taking actions in RL). A controller is proposed to generate operator sequences with high scores (see Fig. 2b). 3) Controller update (i.e., policy optimization in RL). The controller is updated to increase the probability of producing a good operator sequence via the score feedback of the generated ones. While the controller can be modeled in many ways (e.g., heuristic algorithm), we introduce the policy gradient in RL to optimize the controller. 4) Candidate optimization (i.e., a non-greedy strategy). During searching, we maintain a candidate pool to store the operator sequence with a high score. After searching, the parameters 𝜽\bm{\theta} of high-score operator sequences are optimized to approximate the PDE solution.

3.2.1 Score Computation

The score of an operator sequence 𝒆\bm{e} is critical to guide the controller toward generating good operator sequences and help to maintain a candidate pool of high scores. Intuitively, the score of 𝒆\bm{e} is defined in the range [0,1][0,1], namely S(𝒆)S(\bm{e}), by

(10) S(𝒆):=(1+L(𝒆))1,\displaystyle S(\bm{e}):=\big{(}1+L(\bm{e})\big{)}^{-1},

where L(𝒆):=min{(u(;𝒯,𝒆,𝜽))|𝜽})L(\bm{e}):=\min\{\mathcal{L}(u(\cdot;\mathcal{T},\bm{e},\bm{\theta}))|\bm{\theta}\}). When L(𝒆)L(\bm{e}) tends to 0, the expression represented by 𝒆\bm{e} is close to the true solution, and the score S(𝒆)S(\bm{e}) goes to 1. Otherwise, S(𝒆)S(\bm{e}) goes to 0. The global minimizer of (u(;𝒯,𝒆,𝜽))\mathcal{L}(u(\cdot;\mathcal{T},\bm{e},\bm{\theta})) over 𝜽\bm{\theta} is difficult and expensive to obtain. Instead of exhaustively searching for a global minimizer, a first-order optimization algorithm and a second-order one are combined to accelerate the evaluation of S(𝒆)S(\bm{e}).

First-order algorithms (e.g., the stochastic gradient descent [42] and Adam [26]) that utilize gradient to update are popular in machine learning. Typically, they demand a small learning rate and a substantial number of iterations to converge effectively. It can become time-consuming to optimize L(𝒆)L(\bm{e}) using a first-order algorithm. Alternatively, second-order algorithms (e.g., the Newton method [3] and the Broyden-Fletcher-Goldfarb-Shanno method (BFGS) [17]) use the (approximated) Hessian matrix for faster convergence, but obtaining a good minimizer requires a good initial guess. To expedite the optimization process of L(𝒆)L(\bm{e}) in our implementation, we employ a two-step approach. Initially, a first-order algorithm is utilized for T1T_{1} steps to obtain a well-informed initial guess. Subsequently, a second-order algorithm is applied for an additional T2T_{2} steps to further refine the solution. Let 𝜽0𝒆\bm{\theta}_{0}^{\bm{e}} be an initialization and 𝜽T1+T2𝒆\bm{\theta}_{T_{1}+T_{2}}^{\bm{e}} be the parameter set after T1+T2T_{1}+T_{2} steps of this two-step optimization. Then 𝜽T1+T2𝒆\bm{\theta}_{T_{1}+T_{2}}^{\bm{e}} serves as an approximate of argmin𝜽(u(;𝒯,𝒆,𝜽))\arg\min_{\bm{\theta}}\mathcal{L}(u(\cdot;\mathcal{T},\bm{e},\bm{\theta})). Finally, S(𝒆)S(\bm{e}) is estimated by

(11) S(𝒆)(1+(u(;𝒯,𝒆,𝜽T1+T2𝒆)))1.\displaystyle S(\bm{e})\approx\big{(}1+\mathcal{L}(u(\cdot;\mathcal{T},\bm{e},\bm{\theta}_{T_{1}+T_{2}}^{\bm{e}}))\big{)}^{-1}.

Remark that the approximation may exhibit significant variation due to the randomness associated with the initialization of 𝜽0𝒆\bm{\theta}_{0}^{\bm{e}}.

3.2.2 Operator Sequence Generation

The role of the controller is to generate operator sequences with high scores during the searching loop. Let 𝝌Φ\bm{\chi}_{\Phi} be a controller with model parameter Φ\Phi, and Φ\Phi is updated to increase the probability for good operator sequences during the searching loop. We use 𝒆𝝌Φ\bm{e}\sim\bm{\chi}_{\Phi} to denote the process to sample an 𝒆\bm{e} according to the controller 𝝌Φ\bm{\chi}_{\Phi}.

Treating tree node values of 𝒯\mathcal{T} as random variables, the controller 𝝌Φ\bm{\chi}_{\Phi} outputs probability mass functions 𝒑Φ1,𝒑Φ2,,𝒑Φs\bm{p}_{\Phi}^{1},\bm{p}_{\Phi}^{2},\cdots,\bm{p}_{\Phi}^{s} to characterize their distributions, where ss is the total number of nodes. Each tree node value eje_{j} is sampled from 𝒑Φj\bm{p}_{\Phi}^{j} to obtain an operator. Then 𝒆:=(e1,e2,,es)\bm{e}:=(e_{1},e_{2},\cdots,e_{s}) is the operator sequence sampled from 𝝌Φ\bm{\chi}_{\Phi}. See Fig. 2b for an illustration. Besides, we adopt the ϵ\epsilon-greedy strategy [48] to enhance exploration of a potentially high-score 𝒆\bm{e}. With probability ϵ<1\epsilon<1, eie_{i} is sampled from a uniform distribution of the operator set. With probability 1ϵ1-\epsilon, ei𝒑Φie_{i}\sim\bm{p}_{\Phi}^{i}. A larger ϵ\epsilon leads to a higher probability to explore new sequences.

3.2.3 Controller Update

The goal of the controller update is to guide the controller toward generating high-score operator sequences 𝒆\bm{e}. The updating rule of a controller can be designed based on heuristics (e.g., genetic and simulated annealing algorithms) and gradient-based methods (e.g., policy gradient and darts [32]). As proof of concept, we introduce a policy-gradient-based updating rule in RL. The policy gradient method aims to maximize the return by optimizing a parameterized policy and the controller in our problem plays the role of a policy.

In this paper, the controller 𝝌Φ\bm{\chi}_{\Phi} is modeled as a neural network parameterized by Φ\Phi. The training objective of the controller is to maximize the expected score of a sampled 𝒆\bm{e}, i.e.,

(12) 𝒥(Φ):=𝔼𝒆𝝌ΦS(𝒆).\displaystyle\mathcal{J}(\Phi):=\mathbb{E}_{\bm{e}\sim\bm{\chi}_{\Phi}}S(\bm{e}).

Taking the derivative of (12) with respect to Φ\Phi, we have

(13) Φ𝒥(Φ)=𝔼𝒆𝝌Φ{S(𝒆)i=1sΦlog(𝒑Φi(ei))},\displaystyle\nabla_{\Phi}\mathcal{J}(\Phi)=\mathbb{E}_{\bm{e}\sim\bm{\chi}_{\Phi}}\big{\{}S(\bm{e})\sum_{i=1}^{s}\nabla_{\Phi}\log(\bm{p}_{\Phi}^{i}(e_{i}))\big{\}},

where 𝒑Φi(ei)\bm{p}_{\Phi}^{i}(e_{i}) is the probability corresponding to the sampled eie_{i}. When the batch size is NN and {𝒆(1),𝒆(2),,𝒆(N)}\{\bm{e}^{(1)},\bm{e}^{(2)},\cdots,\bm{e}^{(N)}\} are sampled under 𝝌Φ\bm{\chi}_{\Phi} each time, the expectation can be approximated by

(14) Φ𝒥(Φ)1Nk=1N{S(𝒆(k))i=1sΦlog(𝒑Φi(ei(k)))}.\displaystyle\nabla_{\Phi}\mathcal{J}(\Phi)\approx\frac{1}{N}\sum_{k=1}^{N}\big{\{}S(\bm{e}^{(k)})\sum_{i=1}^{s}\nabla_{\Phi}\log(\bm{p}_{\Phi}^{i}(e_{i}^{(k)}))\big{\}}.

Next, the model parameter Φ\Phi is updated via the gradient ascent with a learning rate η\eta, i.e.,

(15) ΦΦ+ηΦ𝒥(Φ).\displaystyle\Phi\leftarrow\Phi+\eta\nabla_{\Phi}\mathcal{J}(\Phi).

The objective in (12) helps to improve the average score of generated sequences. In our problem, the goal is to find 𝒆\bm{e} with the best score. To increase the probability of obtaining the best case, the objective function proposed in [36] is applied to seek the optimal solution via

(16) 𝒥(Φ)=𝔼𝒆𝝌Φ{S(𝒆)|S(𝒆)Sν,Φ},\displaystyle\mathcal{J}(\Phi)=\mathbb{E}_{\bm{e}\sim\bm{\chi}_{\Phi}}\{S(\bm{e})|S(\bm{e})\geq S_{\nu,\Phi}\},

where Sν,ΦS_{\nu,\Phi} represents the (1ν)×100%(1-\nu)\times 100\%-quantile of the score distribution generated by 𝝌Φ\bm{\chi}_{\Phi}. In a discrete form, the gradient computation becomes

(17) Φ𝒥(Φ)1Nk=1N{(S(𝒆(k))S^ν,Φ)𝟙{S(𝒆(k))S^ν,Φ}i=1sΦlog(𝒑Φi(ei(k)))}.\displaystyle\nabla_{\Phi}\mathcal{J}(\Phi)\approx\frac{1}{N}\sum_{k=1}^{N}\big{\{}(S(\bm{e}^{(k)})-\hat{S}_{\nu,\Phi})\mathbbm{1}_{\{S(\bm{e}^{(k)})\geq\hat{S}_{\nu,\Phi}\}}\sum_{i=1}^{s}\nabla_{\Phi}\log(\bm{p}_{\Phi}^{i}(e_{i}^{(k)}))\big{\}}.

where 𝟙\mathbbm{1} is an indicator function that takes value 11 if the condition is true otherwise 0, and S^ν,Φ\hat{S}_{\nu,\Phi} is the (1ν)(1-\nu)-quantile of the scores {S(𝒆(i))}i=1N\{S(\bm{e}^{(i)})\}_{i=1}^{N}.

3.2.4 Candidate Optimization

As introduced in Sec. 3.2.1, the score of 𝒆\bm{e} is based on the optimization of a nonconvex function at a random initialization. Therefore, the optimization may get stuck at poor local minimizers and the score sometimes may not reflect whether 𝒆\bm{e} reveals the structure of the true solution. The operator sequence 𝒆\bm{e} corresponding to the true solution (or approximately) may not have the best score. For the purpose of not missing good operator sequences, a candidate pool \mathbb{P} with capacity KK is maintained to store several 𝒆\bm{e}’s of a high score.

During the search loop, if the size of \mathbb{P} is less than the capacity KK, 𝒆\bm{e} will be put in \mathbb{P}. If the size of \mathbb{P} reaches KK and S(𝒆)S(\bm{e}) is larger than the smallest score in \mathbb{P}, then 𝒆\bm{e} will be appended to \mathbb{P} and the one with the least score will be removed. After the searching loop, for every 𝒆\bm{e}\in\mathbb{P}, the objective function (u(;𝒯,𝒆,𝜽))\mathcal{L}(u(\cdot;\mathcal{T},\bm{e},\bm{\theta})) is optimized over 𝜽\bm{\theta} using a first-order algorithm with a small learning rate for T3T_{3} iterations.

4 Numerical Examples

Numerical results will be provided to demonstrate the effectiveness of our FEX implementation introduced in Sec. 3.2 using two classical PDE problems: high-dimensional PDEs with constraints (such as Dirichlet boundary conditions and integration constraints) and eigenvalue problems. The computational tools for high-dimensional problems are very limited and NNs are probably the most popular ones. Therefore, FEX will be compared with NN-based solvers. Through our examples, the goal is to numerically demonstrate that:

  • Accuracy. FEX can achieve high and even machine accuracy for high-dimensional problems, while NN-based solvers can only achieve the accuracy of 𝒪(104)\mathcal{O}(10^{-4}) to 𝒪(102)\mathcal{O}(10^{-2}). Furthermore, we demonstrate the effectiveness of our RL-based approach by comparing its performance to a solution developed using genetic programming (GP) [35].

  • Scalability. FEX is scalable in the problem dimension with an almost constant accuracy and a low memory requirement, i.e., the accuracy of FEX remains essentially the same when the dimension grows, while NN-based solvers have a worse accuracy when the dimension becomes larger.

  • Interpretability. FEX provides interpretable insights of the ground truth PDE solution and helps to design postprocessing techniques for a refined solution.

In particular, to show the benefit of interpretability, we will provide examples to show that the explicit formulas of FEX solutions help to design better NN-parametrization in NN-based solvers to achieve higher accuracy. The FEX-aided NN-based solvers are referred to as FEX-NN in this paper. Finally, we will show the convergence of FEX with the growth of the tree size when the true solution can not be exactly reproduced by finite expressions using the available operators and a binary tree. All results of this section are obtained with 66 independent experiments to achieve their statistics.

4.1 Experimental Setting

This part provides the setting of FEX and NN-based solvers. The accuracy of a numerical solution u~\tilde{u} compared with the true solution uu is measured by a relative L2L^{2} error, i.e., u~uL2(Ω)/uL2(Ω)\|\tilde{u}-u\|_{L^{2}(\Omega)}/\|u\|_{L^{2}(\Omega)}. The integral in the L2L^{2} norm is estimated by the Monte Carlo integral for high-dimensional problems.

Implements of FEX. The depth-3 binary tree (Fig. 2b) with 3 unary operators and 1 binary operator is used to generate mathematical expressions. The binary set is 𝔹={+,,×}\mathbbm{B}=\{+,-,\times\} and the unary set is 𝕌={0,1,Id,()2,()3,()4,exp,sin,cos}\mathbbm{U}=\{0,1,\text{Id},(\cdot)^{2},(\cdot)^{3},(\cdot)^{4},\exp,\sin,\cos\}. A fully connected NN is used as a controller 𝝌Φ\bm{\chi}_{\Phi} with constant input (see Appendix 6.3 for more details). The output size of the controller NN is n1|𝔹|+n2|𝕌|n_{1}|\mathbbm{B}|+n_{2}|\mathbbm{U}|, where n1=1n_{1}=1 and n2=3n_{2}=3 represent the number of binary and unary operators, respectively, and |||\cdot| denotes the cardinality of a set.

There are four main parts in the implementation of FEX as introduced in Sec. 3.2. We will only briefly describe the key numerical choices here. (1) Score computation. The score is updated first by Adam with a learning rate 0.001 for T1=20T_{1}=20 iterations and then by BFGS with a learning rate 1 for maximum T2=20T_{2}=20 iterations. (2) Operator sequence generation. The depth-3 binary tree (Fig. 2b) with 3 unary operators and 1 binary operator is used to generate mathematical expressions. The binary set is 𝔹={+,,×}\mathbbm{B}=\{+,-,\times\} and the unary set is 𝕌={0,1,Id,()2,()3,()4,exp,sin,cos}\mathbbm{U}=\{0,1,\text{Id},(\cdot)^{2},(\cdot)^{3},(\cdot)^{4},\exp,\sin,\cos\}. A fully connected NN is used as a controller 𝝌Φ\bm{\chi}_{\Phi} with constant input. The output size of the controller NN is n1|𝔹|+n2|𝕌|n_{1}|\mathbbm{B}|+n_{2}|\mathbbm{U}|, where n1=1n_{1}=1 and n2=3n_{2}=3 represent the number of binary and unary operators, respectively, and |||\cdot| denotes the cardinality of a set. (3) Controller update. The batch size for the policy gradient update is N=10N=10 and the controller is trained for 10001000 iterations using Adam with a fixed learning rate 0.0020.002. Especially in Section of “Numerical Convergence”, since the deeper trees are used, the controller is updated for 5000 iterations for trees with different depth. We adopt the ϵ\epsilon-greedy strategy to increase the exploration of new 𝒆\bm{e}. The probability ϵ\epsilon of sampling an eie_{i} by random is 0.10.1. (4) Candidate optimization. The candidate pool capacity is set to be K=10K=10. For any 𝒆\bm{e}\in\mathbb{P}, the parameter 𝜽\bm{\theta} is optimized using Adam with an initial learning rate 0.010.01 for T3=20,000T_{3}=20,000 iterations. The learning rate decays according to the cosine decay schedule [23].

Implements of NN-based Solvers. Residual networks (ResNets) [22, 14] u(𝒙;𝚯)u(\bm{x};\bm{\Theta}) parameterized by 𝚯\bm{\Theta} are used to approximate a solution and a minimization problem min𝚯(u(;𝚯)){\min}_{\bm{\Theta}}~{}\mathcal{L}(u(\cdot;\bm{\Theta})) is solved to identify a numerical solution [47, 49, 14].

The ResNet maps from d\mathbb{R}^{d} to 1\mathbb{R}^{1} and consists of seven fully connected layers with three skip connections. Each hidden layer contains 50 neurons. The neural network is optimized using the Adam optimizer with an initial learning rate of 0.001 for 15,000 iterations. The learning rate is decayed following a cosine decay schedule.

Poisson equation. The coefficient λ\lambda in the functional (3) is 100. The batch size for the interior and boundary is 5,000 and 1,000, respectively. In the NN method, we use the ReLU2 activation, i.e., (max{x,0})2(\max\{x,0\})^{2}, in ResNet to approximate the solution.

Linear conservative law. The coefficient λ\lambda in the functional (20) is 100. In the NN method, we use ReLU (max{x,0})(\max\{x,0\}) activation. The batch size for the interior and boundary is 5,000 and 1,000, respectively. We use the same batch size as the NN method except that we increase the batch size to 20,000 for the interior and 4,000 for the boundary when the dimension is not smaller than 36.

Schrödinger equation. The coefficient λ\lambda in the functional (23) is 1. The batch size for estimating the first term and second term of (23) is 2,000 and 10,000, respectively. Besides, ReLU2 is used in ResNet.

Implements of GP. GP is an evolutionary algorithm used for automated symbolic regression and the evolution of computer programs to solve complex problems. Through a process of selection, crossover, and mutation, GP evolves and refines these programs over generations to optimize a specific fitness or objective function. We utilize GP to optimize the CO (1) directly using the Python package GPlearn [1].

4.2 High-dimensional PDEs

Several numerical examples for high-dimensional PDEs including linear and nonlinear cases are provided here. In these tests, the true solutions of PDEs have explicit formulas that can be reproduced by the binary tree defined in Sec. 3.1 and 𝔹,𝕌\mathbbm{B},\mathbbm{U} defined in Sec. 4.1.

Refer to caption
Refer to caption
Refer to caption
Fig. 4: Distribution of the node values and the error comparison. In FEX, we search the optimal sequence of node values and show the frequency of the node values of the binary tree in the candidate pool, consisting of the root (Root), middle node (Middle), and two leaves (Leaf 1 and Leaf 2). Based on the observation of the distribution, we readily design the new NN parameterization (FEX-NN) to estimate the solution. The last column displays comparison of the relative L2L_{2} error as the function of the dimension over 6 independent trials between FEX, NN, FEX-NN and Genetic Programming (GP) for various high-dimensional PDE problems. Rows (a), (b) and (c) represent the results for Poisson equation (18), Linear conservation law (19) and Nonlinear Schrödinger equation (21) respectively. For various dimensions, FEX identifies the true solution, approximating solutions of almost the machine accuracy.

4.2.1 Poisson Equation

We consider a Poisson equation [14] with a Dirichlet boundary condition on a dd-dimensional domain Ω=[1,1]d\Omega=[-1,1]^{d},

(18) Δu=f for 𝒙Ω,u=g for 𝒙Ω.-\Delta u=f\text{ for }\bm{x}\in\Omega,\quad u=g\text{ for }\bm{x}\in\partial\Omega.

Let the true solution be 12i=1dxi2\frac{1}{2}\sum_{i=1}^{d}x_{i}^{2}, and then ff becomes a constant function d-d. The functional \mathcal{L} is defined by least-square error (3) is applied in the NN-based solver and FEX to seek the PDE solution for various dimensions (d=10d=10, 2020, 3030, 4040 and 5050).

Refer to caption
Fig. 5: Relative L2L_{2} error of solutions estimated by trees with increasing depth for the problems of various dimensions. We exclude the square operator ()2(\cdot)^{2} in the unary set 𝕌\mathbbm{U}, and the binary tree defined in Sec. 3.1 can not reproduce the true solution (sum of the square of coordinates) exactly in the example of the Poisson equation. We found that smaller errors could be obtained with larger tree sizes.
Refer to caption
Fig. 6: Eigenvalue problems and postprocessing algorithm design with FEX. Bottom left: We observed that the exponent operator “exp()\exp(\cdot)” dominates the tree root in the FEX searching loop. Based on this observation, we assume the solution is exp(v(𝒙))\exp(v(\bm{x})) and simplify the original PDE to a new PDE that avoids the trivial solution. Bottom right: The NN-based method produces a large error on the eigenvalue estimation, especially when the dimension is high (d=10d=10). With the postprocessing algorithm with FEX, we can identify the eigenvalue with an error close to zero.

4.2.2 Linear Conservation Law

The linear conservation law [9] is considered here with a domain T×Ω=[0,1]×[1,1]dT\times\Omega=[0,1]\times[-1,1]^{d} and an initial value problem

(19) πd4uti=1duxi=0 for 𝒙=(x1,,xd)Ω,t[0,1],u(0,𝒙)=sin(π4i=1dxi) for 𝒙Ω,\displaystyle\begin{split}\frac{\pi d}{4}u_{t}-\sum_{i=1}^{d}u_{x_{i}}=0\text{ for }\bm{x}=(x_{1},\cdots,x_{d})\in\Omega,t\in[0,1],\quad u(0,\bm{x})=\sin(\frac{\pi}{4}\sum_{i=1}^{d}x_{i})\text{ for }\bm{x}\in\Omega,\end{split}

where the true solution is u(t,𝒙)=sin(t+π4i=1dxi)u(t,\bm{x})=\sin(t+\frac{\pi}{4}\sum_{i=1}^{d}x_{i}), and d=5d=5, 1111, 1717, 2323, 2929, 3535, 4141, 4747 and 5353 in our tests. The functional \mathcal{L} in the NN-based solver and FEX to identify the solution is defined by

(20) (u):=πd4uti=1duxiL2(T×Ω)2+λu(0,𝒙)sin(π4i=1dxi)L2(Ω)2.\displaystyle\mathcal{L}(u):=\|\frac{\pi d}{4}u_{t}-\sum_{i=1}^{d}u_{x_{i}}\|_{L^{2}(T\times\Omega)}^{2}+\lambda\|u(0,\bm{x})-\sin(\frac{\pi}{4}\sum_{i=1}^{d}x_{i})\|_{L^{2}(\Omega)}^{2}.

4.2.3 Nonlinear Schrödinger Equation

We consider a nonlinear Schrödinger equation [20] with a cubic term on a dd-dimensional domain Ω=[1,1]d\Omega=[-1,1]^{d},

(21) Δu+u3+Vu\displaystyle-\Delta u+u^{3}+Vu =0 for 𝒙Ω,\displaystyle=0\text{ for }\bm{x}\in\Omega,

where V(𝒙)=19exp(2di=1dcosxi)+i=1d(sin2xid2cosxid)V(\bm{x})=-\frac{1}{9}\exp(\frac{2}{d}\sum_{i=1}^{d}\cos x_{i})+\sum_{i=1}^{d}(\frac{\sin^{2}x_{i}}{d^{2}}-\frac{\cos x_{i}}{d}) for 𝒙=(x1,,xd)\bm{x}=(x_{1},\cdots,x_{d}). And we let u^(𝒙)=exp(1dj=1dcos(xj))/3\hat{u}(\bm{x})=\exp(\frac{1}{d}\sum_{j=1}^{d}\cos(x_{j}))/3 be the solution of the PDE (21). To avoid the trivial zero solution, we apply different strategies during the score computation and candidate optimization phases. During the score computation phase, the norm of the test function uu, i.e., uL2(Ω)\|u\|_{L_{2}(\Omega)}, is used as a penalty for the function that is close to zero by

(22) 1(u):=Δu+u3+VuL2(Ω)2/uL2(Ω)3.\displaystyle\mathcal{L}_{1}(u):=\|-\Delta u+u^{3}+Vu\|_{L_{2}(\Omega)}^{2}/\|u\|_{L_{2}(\Omega)}^{3}.

During the candidate optimization phase, an integration constraint is imposed to (21), i.e., Ωu(𝒙)𝑑𝒙=Ωu^(𝒙)𝑑𝒙\int_{\Omega}u(\bm{x})d\bm{x}=\int_{\Omega}\hat{u}(\bm{x})d\bm{x}. The functional \mathcal{L} used in FEX to fine-tune the identified operator sequence is defined by

(23) 2(u):=Δu+u3+VuL2(Ω)2+λ(Ωu(𝒙)𝑑𝒙Ωu^(𝒙)𝑑𝒙)2,\displaystyle\mathcal{L}_{2}(u):=\|-\Delta u+u^{3}+Vu\|_{L_{2}(\Omega)}^{2}+\lambda\big{(}\int_{\Omega}u(\bm{x})d\bm{x}-\int_{\Omega}\hat{u}(\bm{x})d\bm{x}\big{)}^{2},

where the second term imposes the integration constraint. The functional (23) is also used in the NN-based solver to approximate the PDE solution. Various dimensions are tested in the numerical results, e.g., d=6d=6, 1212, 1818, 2424, 3030, 3636, 4242 and 4848. Remark that we avoid using (23) in the score computation because the Monte-Carlo error tends to be significant in the second term of (23) when the batch size is small, but using a large batch size can lead to inefficient computations. Hence, for the computational efficiency, we will utilize (22) instead in the score computation, rather than (23).

4.3 Results

Three main sets of numerical results for the PDE problems above will be presented. First, the errors of the numerical solutions by NN-based solvers and FEX are compared. Second, a convergence test is analyzed when the tree size of FEX increases. Finally, FEX is applied to design special NN parametrization to solve PDEs in NN-based solvers.

Estimated Solution Error. The depth-3 binary tree (Fig. 3) is used in FEX with four nodes (a root node (R), a middle node (M), and two leave nodes (L1 and L2)). Fig. 4 shows the operator distribution obtained by FEX and the error comparison between the NN method and our FEX. The results show that NN solutions have numerical errors between 𝒪(104)\mathcal{O}(10^{-4}) and 𝒪(102)\mathcal{O}(10^{-2}) and the errors grow in the problem dimension dd, agreeing with the numerical observation in the literature. Meanwhile, FEX can identify the true solution structure for the Poisson equation and the linear conservation law with errors of order 10710^{-7}, reaching the machine accuracy since the single-float precision is used. In the results of the nonlinear Schrödinger equation, FEX identifies the solutions of the form exp(cos())\exp(\cos(\cdot)) but achieves errors of order 10510^{-5}. Note that Ωu^(𝒙)𝑑𝒙\int_{\Omega}\hat{u}(\bm{x})d\bm{x} in (23) is estimated by the Monte-Carlo integration with millions of points as an accurate and precomputed constant, but Ωu(𝒙)𝑑𝒙\int_{\Omega}u(\bm{x})d\bm{x} can only be estimated with fixed and small batch size, typically less than 10,000, in the optimization iterations. As the dimension grows, the estimation error of Ωu(𝒙)𝑑𝒙\int_{\Omega}u(\bm{x})d\bm{x} increases, and, hence, even the ground true solution has an increasingly large error according to (23). Therefore, the optimization solver may return an approximate solution without machine accuracy. Designing a functional \mathcal{L} free of the Monte-Carlo error (e.g., Eqns. (3) and (20)) for the nonlinear Schödinger equation could ensure machine accuracy. Furthermore, as shown in Figure 4, GP tends to identify solutions of lower quality that fail to achieve a high level of accuracy.

Numerical Convergence. The numerical convergence analysis is performed using the Poisson equation as an example. Binary trees of depths 2, 3, 4, 6 and 8 are used (see Fig. 3). The square operator ()2(\cdot)^{2} is excluded in 𝕌\mathbbm{U} so that the binary tree defined in Sec. 3.1 can not reproduce the true solution (sum of the square of coordinates) exactly. This setting can mimic the case of a complicated solution while a small binary tree was used. Fig. 5 shows the error distribution of FEX with the growth of dimensions and the change of tree depths. FEX obtains smaller errors with increasing tree size. Notice that, compared with the errors of NN-based solvers reported in Fig. 4, FEX gets a higher accuracy when a larger tree is used.

FEX-NN. FEX provides interpretable insights of the ground truth PDE solution by the operator distribution obtained from the searching loop. It may be beneficial to design NN models with a special structure to increase the accuracy of NN-based solvers. In the results of the Poisson equation, we observe that the square operator has a high probability to appear at the leave nodes, which suggests that the true solution may contain the structure 𝒙2:=(x12,,xd2)\bm{x}^{2}:=(x_{1}^{2},\cdots,x_{d}^{2}) at the input 𝒙\bm{x}. As a result, we define the FEX-NN by v(𝒙2;𝚯)v(\bm{x}^{2};\bm{\Theta}) for the Poisson equation. Similarly, we use FEX-NNs sin(v(𝒙;𝚯))\sin(v(\bm{x};\bm{\Theta})) for the linear conservation law and exp(v(𝒙2;𝚯))\exp(v(\bm{x}^{2};\bm{\Theta})) for the nonlinear Schrödinger equation. Fig. 4 shows the errors of FEX-NN with the growth of dimensions, and it is clear that FEX-NN outperforms the vanilla NN-based method by a significant margin.

4.4 Eigenvalue Problem

Consider identifying the smallest eigenvalue γ\gamma and the associated eigenfunction uu of the eigenvalue problem [14],

(24) Δu+wu=γu,𝒙Ω,foru=0onΩ.\displaystyle\begin{split}-\Delta u+wu=\gamma u,\quad\bm{x}\in\Omega,\quad\text{for}\ u=0\ \text{on}\ \partial\Omega.\end{split}

The minimization of the Rayleigh quotient (u)=Ωu22𝑑x+Ωwu2𝑑xΩu2𝑑x\mathcal{I}(u)=\frac{\int_{\Omega}\|\nabla u\|_{2}^{2}dx+\int_{\Omega}wu^{2}dx}{\int_{\Omega}u^{2}dx}, s.t., u|Ω=0u|_{\partial\Omega}=0, gives the smallest eigenvalue and the corresponding eigenfunction. In the NN-based solver [14], the following functional is defined

(25) (u):=(u)+λ1Ωu2𝑑x+λ2(Ωu2𝑑x1)2\displaystyle\begin{split}\mathcal{L}(u):=\mathcal{I}(u)+\lambda_{1}\int_{\partial\Omega}u^{2}dx+\lambda_{2}\big{(}\int_{\Omega}u^{2}dx-1\big{)}^{2}\end{split}

to seek an NN solution. Considering an example of w=𝒙22w=\|\bm{x}\|_{2}^{2} and Ω=d\Omega=\mathbb{R}^{d}, the smallest eigenvalue of (24) is dd and the associated eigenfunction is exp(𝒙220.5)\exp(-\frac{\|\bm{x}\|_{2}^{2}}{0.5}). The domain Ω\Omega is truncated from d\mathbb{R}^{d} to [3,3]d[-3,3]^{d} for simplification as done in [14].

In FEX, the functional (25) is also used to estimate a solution. FEX discovers a high probability to have the “exp\exp” operator at the tree root (100100% for d=2d=2, 44, 66, 88, and 93.393.3% for d=10d=10 as shown in Fig. 6). Therefore, it is reasonable to assume that the eigenfunction is of the form exp(v(𝒙))\exp(v(\bm{x})). Let u(𝒙)u(\bm{x}) be exp(v(𝒙))\exp(v(\bm{x})) and then Eqn. (24) is simplified to

(26) Δvv22+𝒙22=γ.\displaystyle-\Delta v-\|\nabla v\|_{2}^{2}+\|\bm{x}\|_{2}^{2}=\gamma.

Eqn. 26 does not have a trivial zero solution so we can avoid the integration constraint used in Eqns.  (23) and (25), which leads to Monte-Carlo errors. Using Equation (26) and the Rayleigh quotient \mathcal{I}, the values of vv and γ\gamma are alternatively updated until they reach convergence. The detail of this iterative algorithm is presented in Appx. 6.4. Fig. 6 shows the relative absolute error of the estimated eigenvalues with the growth of the dimensions. We can see that directly optimizing (25) with the NN method produces a large error on the eigenvalue estimation, especially when the dimension is high (e.g., the relative error is up to 20% when d=10d=10). With the postprocessing algorithm with FEX, we can identify the eigenvalue with an error close to zero.

5 Discussion and perspectives

In this paper, we proposed the finite expression method - a methodology to find PDE solutions as simple mathematical expressions. Our theory showed that mathematical expressions can overcome the curse of dimensionality. We provided one implementation of representing mathematical expressions using trees and solving the formulated combinatorial optimization in FEM using reinforcement learning. Our results demonstrated effectiveness of FEM at achieving high, even machine-level accuracy on various high-dimensional PDEs while existing solvers suffered from low accuracy in comparison.

While CO is inherently a complex challenge, continued exploration can be highly advantageous in achieving improved performance, particularly when dealing with more intricate PDE problems within the framework of FEX.

  • Efficient computation of operator sequence score. We employed Formulation (11) as a proxy for assessing the quality of an operator sequence. However, it’s important to note that this formulation necessitates optimization over multiple steps, resulting in non-negligible computational costs that can impact the overall expense of solving the CO problem in FEX. Therefore, it is crucial to define a more efficient scoring method that reduces computational expenses and simplifies the identification of favorable operator sequences.

  • Design of the controller. As an illustrative example of CO solving in this work, we employed a straightforward fully connected network to model the distribution responsible for proposing operator selections. It’s worth noting that more sophisticated techniques can also be applied, such as recurrent neural networks [36], which take prior decisions as input and generate new decisions as output. These advanced methods can offer enhanced modeling capabilities and adaptability to the context of the problem.

Data and Code Availability

No date is generated in this work. Source codes for reproducing the results in this paper are available online at: https://github.com/LeungSamWai/Finite-expression-method. The source codes are released under MIT license.

Acknowledgement

H. Yang was partially supported by the US National Science Foundation under awards DMS-2244988, DMS-2206333, and the Office of Naval Research Award N00014-23-1-2007. Research Young Investigator Award.

References

  • [1] Python package gplearn. https://gplearn.readthedocs.io/en/stable/.
  • [2] David J Acheson. Elementary fluid dynamics, 1991.
  • [3] Mordecai Avriel. Nonlinear programming: analysis and methods. Courier Corporation, 2003.
  • [4] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.
  • [5] Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V Le. Neural optimizer search with reinforcement learning. In International Conference on Machine Learning, pages 459–468. PMLR, 2017.
  • [6] Simone Bianco, Remi Cadene, Luigi Celona, and Paolo Napoletano. Benchmark analysis of representative deep neural network architectures. IEEE Access, 6:64270–64277, 2018.
  • [7] Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, and Quanquan Gu. Towards understanding the spectral bias of deep learning. arXiv preprint arXiv:1912.01198, 2019.
  • [8] Fan Chen, Jianguo Huang, Chunmei Wang, and Haizhao Yang. Friedrichs learning: Weak solutions of partial differential equations via deep learning. arXiv preprint arXiv:2012.08023, 2020.
  • [9] Jingrun Chen, Shi Jin, and Liyao Lyu. A deep learning based discontinuous galerkin method for hyperbolic equations with discontinuous solutions and random uncertainties. arXiv preprint arXiv:2107.01127, 2021.
  • [10] Wang Chi Cheung, Vincent Tan, and Zixin Zhong. A thompson sampling algorithm for cascading bandits. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 438–447. PMLR, 2019.
  • [11] John D Co-Reyes, Yingjie Miao, Daiyi Peng, Esteban Real, Quoc V Le, Sergey Levine, Honglak Lee, and Aleksandra Faust. Evolving reinforcement learning algorithms. In International Conference on Learning Representations, 2021.
  • [12] M. W. M. G. Dissanayake and N. Phan-Thien. Neural-network-based Approximations for Solving Partial Differential Equations. Comm. Numer. Methods Engrg., 10:195–201, 1994.
  • [13] Weinan E, Jiequn Han, and Arnulf Jentzen. Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 5(4):349–380, 2017.
  • [14] Weinan E and Bing Yu. The deep ritz method: a deep learning-based numerical algorithm for solving variational problems. Commun. Math. Stat., 6:1–12, 2018.
  • [15] L.C. Evans. Partial Differential Equations. Graduate studies in mathematics. American Mathematical Society, 2010.
  • [16] Richard P Feynman, Robert B Leighton, and Matthew Sands. The feynman lectures on physics; vol. i. American Journal of Physics, 33(9):750–752, 1965.
  • [17] Roger Fletcher. Practical methods of optimization. John Wiley & Sons, 2013.
  • [18] Christian Grossmann, Hans-Görg Roos, and Martin Stynes. Numerical treatment of partial differential equations, volume 154. Springer, 2007.
  • [19] J. Han, A. Jentzen, and W. E. Solving high-dimensional partial differential equations using deep learning. Proc. Natl. Acad. Sci., 115(34):8505–8510, 2018.
  • [20] Jiequn Han, Jianfeng Lu, and Mo Zhou. Solving high-dimensional eigenvalue problems using deep neural networks: A diffusion monte carlo like approach. Journal of Computational Physics, 423:109792, 2020.
  • [21] Jiequn Han and E Weinan. Deep learning approximation for stochastic control problems. ArXiv, abs/1611.07422, 2016.
  • [22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [23] Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 558–567, 2019.
  • [24] Yuling Jiao, Yanming Lai, Xiliang Lu, and Zhijian Yang. Deep neural networks with relu-sine-exponential activations break curse of dimensionality on hölder class. 2021.
  • [25] Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving parametric pde problems with artificial neural networks. arXiv: Numerical Analysis, 2017.
  • [26] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [27] John G Kirkwood, Robert L Baldwin, Peter J Dunlop, Louis J Gosting, and Gerson Kegeles. Flow equations and frames of reference for isothermal diffusion in liquids. The Journal of Chemical Physics, 33(5):1505–1513, 1960.
  • [28] Isaac E Lagaris, Aristidis Likas, and Dimitrios I Fotiadis. Artificial neural networks for solving ordinary and partial differential equations. IEEE transactions on neural networks, 9(5):987–1000, 1998.
  • [29] Mikel Landajuela, Brenden K Petersen, Sookyung Kim, Claudio P Santiago, Ruben Glatt, Nathan Mundhenk, Jacob F Pettit, and Daniel Faissol. Discovering symbolic policies with deep reinforcement learning. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5979–5989. PMLR, 18–24 Jul 2021.
  • [30] Lev Davidovich Landau and Evgenii Mikhailovich Lifshitz. Quantum mechanics: non-relativistic theory, volume 3. Elsevier, 2013.
  • [31] Fu Li, Umberto Villa, Seonyeong Park, and Mark A. Anastasio. 3-d stochastic numerical breast phantoms for enabling virtual imaging trials of ultrasound computed tomography. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, 69(1):135–146, 2022.
  • [32] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2019.
  • [33] Ziqi Liu, Wei Cai, and Zhi-Qin John Xu. Multi-scale deep neural network (mscalednn) for solving poisson-boltzmann equation in complex domains. Communications in Computational Physics, 28(5):1970–2001, Jun 2020.
  • [34] Nina Mazyavkina, Sergey Sviridov, Sergei Ivanov, and Evgeny Burnaev. Reinforcement learning for combinatorial optimization: A survey. Computers & Operations Research, 134:105400, 2021.
  • [35] David J Murray-Smith. Modelling and simulation of integrated systems in engineering: issues of methodology, quality, testing and application. Elsevier, 2012.
  • [36] Brenden K Petersen, Mikel Landajuela Larma, Terrell N. Mundhenk, Claudio Prata Santiago, Soo Kyung Kim, and Joanne Taery Kim. Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients. In International Conference on Learning Representations, 2021.
  • [37] Jean Philibert. One and a half century of diffusion: Fick, einstein. Diffusion Fundamentals: Leipzig 2005, 1:8, 2005.
  • [38] M. Raissi, P. Perdikaris, and G.E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686 – 707, 2019.
  • [39] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for activation functions, 2018.
  • [40] JN Reddy. An introduction to the finite element method, volume 1221. McGraw-Hill New York, 2004.
  • [41] Basri Ronen, David Jacobs, Yoni Kasten, and Shira Kritchman. The convergence rate of neural networks for learned functions of different frequencies. Advances in Neural Information Processing Systems, 32, 2019.
  • [42] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
  • [43] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Neural network approximation: Three hidden layers are enough. arXiv:2010.14075, 2020.
  • [44] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation: Achieving arbitrary accuracy with fixed number of neurons. ArXiv, abs/2107.02397, 2021.
  • [45] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network with approximation error being reciprocal of width to power of square root of depth. Neural Computation, 2021.
  • [46] Marvin Shinbrot. Lectures on fluid mechanics. Courier Corporation, 2012.
  • [47] Justin Sirignano and Konstantinos Spiliopoulos. Dgm: A deep learning algorithm for solving partial differential equations. Journal of computational physics, 375:1339–1364, 2018.
  • [48] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • [49] E Weinan, Jiequn Han, and Arnulf Jentzen. Algorithms for solving high dimensional pdes: From nonlinear monte carlo to machine learning. Nonlinearity, 35(1):278, 2021.
  • [50] Zhi-Qin John Xu, Yaoyu Zhang, Tao Luo, Yanyang Xiao, and Zheng Ma. Frequency principle: Fourier analysis sheds light on deep neural networks. Communications in Computational Physics, 28(5):1746–1767, 2020.
  • [51] Dmitry Yarotsky. Elementary superexpressive activations. ArXiv, abs/2102.10911, 2021.
  • [52] Yulei, Liao, , 14603, , Yulei Liao, Pingbing, Ming, , 10035, , and Pingbing Ming. Deep nitsche method: Deep ritz method with essential boundary conditions. Communications in Computational Physics, 29(5):1365–1384, 2021.
  • [53] Yaohua Zang, Gang Bao, Xiaojing Ye, and Haomin Zhou. Weak adversarial networks for high-dimensional partial differential equations. Journal of Computational Physics, 411:109409, 2020.

6 Appendix

6.1 Proofs of Theorems

Proof of Thm. 4.

By Thm. 1.1 of Shen et al (2021) [44], for any fC([a,b]d)f\in C([a,b]^{d}) as a continuous function on [a,b]d[a,b]^{d}, there exists a fully connected neural network (FNN) ϕ\phi with width N=36d(2d+1)N=36d(2d+1) and depth L=11L=11 (i.e., 1111 hidden layers) such that, for an arbitrary ε>0\varepsilon>0, ϕfL([a,b]d)<ε\|\phi-f\|_{L^{\infty}([a,b]^{d})}<\varepsilon. This FNN is constructed via an activation function with an explicit formula σ(x)=σ1(x):=|x2x+12|\sigma(x)=\sigma_{1}(x):=\big{|}x-2\lfloor\tfrac{x+1}{2}\rfloor\big{|} for x[0,)x\in[0,\infty) and σ(x)=σ2(x):=x|x|+1\sigma(x)=\sigma_{2}(x):=\frac{x}{|x|+1} for x(,0)x\in(-\infty,0). Therefore, σ(x)=sign(x)+12σ1(x)sign(x)12σ2(x)\sigma(x)=\frac{\text{sign}(x)+1}{2}\sigma_{1}(x)-\frac{\text{sign}(x)-1}{2}\sigma_{2}(x). Hence, it requires at most 1818 operators to evaluate σ(x)\sigma(x). For an FNN of width NN and depth LL, there are N(d+1)+(L1)N2N(d+1)+(L-1)N^{2} operators “×\times”, Nd1+(L1)N(N1)Nd-1+(L-1)N(N-1) operators “++”, and NLNL evaluations of σ(x)\sigma(x) to evaluate an output of the FNN. Therefore, the FNN ϕ\phi is a mathematical expression with at most kd:=103680d4+103824d3+39600d2+6804d1=𝒪(d4)k_{d}:=103680d^{4}+103824d^{3}+39600d^{2}+6804d-1=\mathcal{O}(d^{4}) operators. Therefore, for any ε>0\varepsilon>0, any continuous function ff on [a,b]d[a,b]^{d}, there is a kdk_{d}-finite expression that can approximate ff uniformly well on [a,b]d[a,b]^{d} within ε\varepsilon accuracy. Since kdk_{d} is independent of ε\varepsilon, it is clear that the function space of kdk_{d}-finite expressions is dense in C([a,b]d)C([a,b]^{d}). ∎

Proof of Thm. 5.

By Cor. 3.8 of Jiao et al (2021) [24], let p[1,+)p\in[1,+\infty), for any fμα([0,1]d)f\in\mathcal{H}^{\alpha}_{\mu}([0,1]^{d}) and ε>0\varepsilon>0, there exists an FNN ϕ\phi with width

N=max{2dlog2(d(3με)1/α),2log23μdα/22ε+2}\displaystyle N=\max\{2d\left\lceil\log_{2}\left(\sqrt{d}\left(\frac{3\mu}{\varepsilon}\right)^{1/\alpha}\right)\right\rceil,2\left\lceil\log_{2}\frac{3\mu d^{\alpha/2}}{2\varepsilon}\right\rceil+2\}

and depth L=6L=6 such that ϕfLp([0,1]d)<ε\|\phi-f\|_{L^{p}([0,1]^{d})}<\varepsilon. This FNN is constructed via activation functions chosen from the set {sin(x),max{0,x},2x}\{\sin(x),\max\{0,x\},2^{x}\}. Similar to the proof for Thm. 4, there are N(d+1)+(L1)N2N(d+1)+(L-1)N^{2} operators “×\times”, Nd1+(L1)N(N1)Nd-1+(L-1)N(N-1) operators “++”, and NLNL evaluations of activation functions to evaluate an output of the FNN. Therefore, the total number of operators in ϕ\phi as a mathematical expression is 𝒪(d2(logd+log1ε)2)\mathcal{O}(d^{2}(\log d+\log\frac{1}{\varepsilon})^{2}), which completes the proof. ∎

6.2 Algorithm to solve CO with reinforcement learning

We have included the pseudo-code for the proposed FEX implementation in Alg. 1 and Alg. 2. Specifically, Alg. 1 outlines the procedure for solving the CO problem using a predefined fixed tree. On the other hand, Alg. 2 is designed to iteratively expand a tree, thereby enhancing its expressiveness in the pursuit of identifying an improved solution based on the approach described in Alg. 1.

Algorithm 1 FEX with a fixed tree

Input: PDE and the associated functional \mathcal{L}; A tree 𝒯\mathcal{T}; Searching loop iteration TT; Coarse-tune iteration T1T_{1} with Adam; Coarse-tune iteration T2T_{2} with BFGS; Fine-tune iteration T3T_{3} with Adam; Pool size KK; Batch size NN.

Output: The solution u(𝒙;𝒯,𝒆^,𝜽^)u(\bm{x};\mathcal{T},\hat{\bm{e}},\hat{\bm{\theta}}).

1:Initialize the agent 𝝌\bm{\chi} for the tree 𝒯\mathcal{T}
2:{}\mathbb{P}\leftarrow\{\}
3:for _\_ from 11 to TT do
4:     Sample NN sequences {𝒆(1),𝒆(2),,𝒆(N)}\{\bm{e}^{(1)},\bm{e}^{(2)},\cdots,\bm{e}^{(N)}\} from 𝝌\bm{\chi}
5:     for nn from 11 to NN do
6:         Optimize (u(𝒙;𝒯,𝒆(n),𝜽))\mathcal{L}(u(\bm{x};\mathcal{T},\bm{e}^{(n)},\bm{\theta})) through coarse-tuning over T1+T2T_{1}+T_{2} iterations to get score S(𝒆(n))S(\bm{e}^{(n)})
7:         if 𝒆(n)\bm{e}^{(n)} belongs to the top-KK of all scorings in \mathbb{P} then
8:              \mathbb{P}.append(𝒆(n)\bm{e}^{(n)})
9:              \mathbb{P} pops some 𝒆\bm{e} of the smallest score when ||>K|\mathbb{P}|>K
10:         end if
11:     end for
12:     Update 𝝌\bm{\chi} using (17)
13:end for
14:for 𝒆\bm{e} in \mathbb{P} do
15:     Fine-tune (u(𝒙;𝒯,𝒆,𝜽))\mathcal{L}(u(\bm{x};\mathcal{T},\bm{e},\bm{\theta})) with T3T_{3} iterations.
16:end for
17:return the expression with the smallest fine-tune error.
Algorithm 2 FEX with progressively expanding trees

Input: Tree set {𝒯1,𝒯2,}\{\mathcal{T}_{1},\mathcal{T}_{2},\cdots\}; Error tolerance ϵ\epsilon;

Output: the solution u(𝒙;𝒯^,𝒆~,𝜽~)u(\bm{x};\hat{\mathcal{T}},\tilde{\bm{e}},\tilde{\bm{\theta}}).

1:for 𝒯\mathcal{T} in {𝒯1,𝒯2,}\{\mathcal{T}_{1},\mathcal{T}_{2},\cdots\} do
2:     Initialize the agent 𝝌\bm{\chi} for the tree 𝒯\mathcal{T}
3:     Obtain u(𝒙;𝒯,𝒆^,𝜽^)u(\bm{x};\mathcal{T},\hat{\bm{e}},\hat{\bm{\theta}}) from Algorithm 1
4:     if (u(;𝒯,𝒆^,𝜽^))\mathcal{L}(u(\cdot;\mathcal{T},\hat{\bm{e}},\hat{\bm{\theta}})) ϵ\leq\epsilon then
5:         Break
6:     end if
7:end for
8:return the expression with the smallest functional value.

6.3 Using a fully connected network to model a controller

We illustrate the use of a fully connected neural network to model the controller, which generates probability mass functions for each node within a tree, in Figure 7. As an example, let’s assume we have a tree with three nodes: n1=2n_{1}=2 nodes for the binary set with |𝔹|=2|\mathbbm{B}|=2 operators, and n2=1n_{2}=1 node for the unary set with |𝕌|=3|\mathbbm{U}|=3 operators. Consequently, the output size of the neural network is n1|𝔹|+n2|𝕌|=7n_{1}|\mathbbm{B}|+n_{2}|\mathbbm{U}|=7. This output is divided into n1+n2n_{1}+n_{2} parts, each corresponding to the probability mass function for its respective node. Besides, the input to the neural network is simply a constant vector.

Refer to caption
Fig. 7: Illustration of using a fully connected neural network to model the controller that outputs the probability mass functions for each of the node within a tree. As an example, the tree contains three nodes: n1=2n_{1}=2 nodes for the binary set with |𝔹|=2|\mathbbm{B}|=2 operators and n2=1n_{2}=1 nodes for the unary set with |𝕌|=3|\mathbbm{U}|=3 operators.

6.4 Iterative method for eigenpairs

First, by solving Eqn. (25) with our FEX, we obtain an estimated eigenfunction u(𝒙;𝒯,𝒆^,𝜽^)u(\bm{x};\mathcal{T},\bm{\hat{e}},\hat{\bm{\theta}}) and get the initial estimation of the eigenvalue through the Rayleigh quotient γ0=(u(;𝒯,𝒆^,𝜽^))\gamma_{0}=\mathcal{I}(u(\cdot;\mathcal{T},\bm{\hat{e}},\hat{\bm{\theta}})). Then we can utilize Eqn. (26) to iteratively find the eigenpair. We define the function for Eqn. (26) by

(27) 2(v,γ):=Δvv22+𝒙22γL2(Ω)2.\displaystyle\mathcal{L}_{2}(v,\gamma):=\big{\|}-\Delta v-\|\nabla v\|_{2}^{2}+\|\bm{x}\|_{2}^{2}-\gamma\big{\|}_{L_{2}(\Omega)}^{2}.

Given γi\gamma_{i}, we aim to find vv that is expressed by mathematical expression and minimizes 2(v,γi)\mathcal{L}_{2}(v,\gamma_{i}). Assume vv is expressed by a binary tree (v:=v(;𝒯,𝒆,𝜽)v:=v(\cdot;\mathcal{T},\bm{e},\bm{\theta})), and then we can search the solution using our FEX with the following optimization,

(28) 𝒆i,𝜽iargmin𝒆,𝜽2(v(;𝒯,𝒆,𝜽),γi).\displaystyle\bm{e}_{i}^{*},\bm{\theta}_{i}^{*}\approx\operatorname*{arg\,min}_{\bm{e},\bm{\theta}}\mathcal{L}_{2}(v(\cdot;\mathcal{T},\bm{e},\bm{\theta}),\gamma_{i}).

Next, we can compute the current estimated eigenvalue by γi+1=(exp(v(;𝒯,𝒆i,𝜽i)))\gamma_{i+1}=\mathcal{I}(\exp(v(\cdot;\mathcal{T},\bm{e}_{i}^{*},\bm{\theta}_{i}^{*}))).

If continuing this loop for GG times, we will obtain the eigenpair γG\gamma_{G} and exp(v(;𝒯,𝒆G,𝜽G))\exp(v(\cdot;\mathcal{T},\bm{e}_{G}^{*},\bm{\theta}_{G}^{*})).

Implementation. In our iteration method, the number of the iterative loop is G=10G=10. λ1=λ2=500\lambda_{1}=\lambda_{2}=500 in  (25). The batch size for estimating the first term and third term of (25) is 10,000 while that of the second term (boundary) is 2,000. ReLU2 is used in ResNet, following [14].