Newton Informed Neural Operator for Computing Multiple Solutions of Nonlinear Partials Differential Equations

Wenrui Hao
Department of Mathematics
The Pennsylvania State University, University Park,
State College, PA, USA
[email protected]
&Xinliang Liu
Computer, Electrical and Mathematical Science and Engineering Division,
King Abdullah University of Science and Technology,
Thuwal, Saudi Arabia
[email protected] &Yahong Yang
Department of Mathematics
The Pennsylvania State University, University Park,
State College, PA, USA
[email protected] Corresponding Authors

Abstract

Solving nonlinear partial differential equations (PDEs) with multiple solutions using neural networks has found widespread applications in various fields such as physics, biology, and engineering. However, classical neural network methods for solving nonlinear PDEs, such as Physics-Informed Neural Networks (PINN), Deep Ritz methods, and DeepONet, often encounter challenges when confronted with the presence of multiple solutions inherent in the nonlinear problem. These methods may encounter ill-posedness issues. In this paper, we propose a novel approach called the Newton Informed Neural Operator, which builds upon existing neural network techniques to tackle nonlinearities. Our method combines classical Newton methods, addressing well-posed problems, and efficiently learns multiple solutions in a single learning process while requiring fewer supervised data points compared to existing neural network methods.

1 Introduction

Neural networks have been extensively applied to solve partial differential equations (PDEs) in various fields, including biology, physics, and materials science [17, 9]. While much of the existing work focuses on PDEs with a singular solution, nonlinear PDEs with multiple solutions pose a significant challenge but are widely encountered in applications such as [1, 3, 27]. In this paper, we aim to solve the following nonlinear PDEs that may contain multiple solutions:

\begin{cases}\mathcal{L}u(\bm{x})=f(u),&\bm{x}\in\Omega\\ u(\bm{x})=0,&\bm{x}\in\partial\Omega\end{cases}

(1)

Here, $\Omega$ is the domain of equation, $f(u)$ is a nonlinear function in $\mathbb{R}$ , $u:\mathbb{R}^{d}\to\mathbb{R}$ and $\mathcal{L}$ is a second-order elliptic operator given by

\quad\mathcal{L}u=-\sum_{i,j=1}^{d}a^{ij}(\bm{x})u_{x_{i}x_{j}}+\sum_{i=1}^{d}b^{i}(\bm{x})u_{x_{i}}+c(\bm{x})u,

for given coefficient functions $a^{ij},b^{i},c(i,j=1,\ldots,d)$ with $\sum_{i,j=1}^{n}a^{ij}(\bm{x})\xi_{i}\xi_{j}\geq\lambda|\bm{\xi}|^{2},~{}\text{for a constant $\lambda\geq 0$}.$

Various neural network methods have been developed to tackle partial differential equations (PDEs), including PINN [29], the Deep Ritz method [37], DeepONet [24], FNO [20], MgNO [11], and OL-DeepONet [22]. These methods can be broadly categorized into two types: function learning and operator learning approaches. In function learning, the goal is to directly learn the solution. However, these methods often encounter the limitation of only being able to learn one solution in each learning process. Furthermore, the problem becomes ill-posed when there are multiple solutions. On the other hand, operator learning aims to approximate the map between parameter functions in PDEs and the unique solution. This approach cannot address the issue of multiple solutions or find them in a single training session. We will discuss this in more detail in the next section.

In this paper, we present a novel neural network approach for solving nonlinear PDEs with multiple solutions. Our method is grounded in operator learning, allowing us to capture multiple solutions within a single training process, thus overcoming the limitations of function learning methods in neural networks. Moreover, we enhance our network architecture by incorporating traditional Newton methods [31, 1], as discussed in the next section. This integration ensures that the problem of operator learning becomes well-defined, as Newton methods provide well-defined locally, thereby ensuring a robust operator. This approach addresses the challenges associated with directly applying operator networks to such problems. Additionally, we leverage Newton information during training, enabling our method to perform effectively even with limited supervised data points. We introduce our network as the Newton Informed Neural Operator.

As mentioned earlier, our approach combines the classical Newton method, which translates nonlinear PDEs into an iterative process involving solving linear functions at each iteration. One key advantage of our method is that, once the operator is effectively learned, there’s no need to solve the linear equation in every iteration. This significantly reduces the time cost, especially in complex systems encountered in fields such as material science, biology, and chemistry. Above all, our network, termed the Newton Informed Neural Operator, can efficiently handle nonlinear PDEs with multiple solutions. It tackles well-posed problems, learns multiple solutions in a single learning process, requires fewer supervised data points compared to existing neural network methods, and saves time by eliminating the need to repeatedly learn and solve the inverse problem as in traditional Newton methods.

The following paper is organized as follows: In the next section (Section 2), we will delve into nonlinear PDEs with multiple solutions and discuss related works on solving PDEs using neural network methods. In Section 3, we will review the classical Newton method for solving PDEs and introduce the Newton Informed Neural Operator, which combines neural operators with the Newton method to address nonlinear PDEs with multiple solutions. In this section, we will also analyze the approximation and generalization errors of the Newton Informed Neural Operator. Finally, in Section 4, we present the numerical results of our neural networks for solving nonlinear PDEs.

2 Backgrounds and Relative Works

2.1 Nonlinear PDEs with multiple solutions

Significant mathematical models depicting natural phenomena in biology, physics, and materials science are rooted in nonlinear partial differential equations (PDEs) [5]. These models, characterized by their inherent nonlinearity, present complex multi-solution challenges. Illustrative examples include string theory in physics, reaction-diffusion systems in chemistry, and pattern formation in biology [16, 10]. However, experimental techniques like synchrotronic and laser methods can only observe a subset of these multiple solutions. Thus, there is an urgent need to develop computational methods to unravel these nonlinear models, offering deeper insights into the underlying physics and biology [12]. Consequently, efficient numerical techniques for identifying these solutions are pivotal in understanding these intricate systems. Despite recent advancements in numerical methods for solving nonlinear PDEs, significant computational challenges persist for large-scale systems. Specifically, the computational costs of employing Newton and Newton-like approaches are often prohibitive for the large-scale systems encountered in real-world applications. In response to these challenges [15], we propose an operator learning approach based on Newton’s method to efficiently compute multiple solutions of nonlinear PDEs.

2.2 Related works

Indeed, there are numerous approaches to solving partial differential equations (PDEs) using neural networks. Broadly speaking, these methods can be categorized into two main types: function learning and operator learning.

In function learning, neural networks are used to directly approximate the solutions to PDEs. Function learning approaches aim to directly learn the solution function itself. On the other hand, in operator learning, the focus is on learning the operator that maps input parameters to the solution of the PDE. Instead of directly approximating the solution function, the neural network learns the underlying operator that governs the behavior of the system.

Function learning methods

In the function learning, a commonly employed method for addressing this problem involves the use of Physics-Informed Neural Network (PINN)-based learning approaches, as introduced by Raissi et al. [29], and Deep Ritz Methods [37]. However, in these methods, the task becomes particularly challenging due to the ill-posed nature of the problem arising from multiple solutions. Despite employing various initial data and training methods, attaining high accuracy in solution learning remains a complex endeavor. Even when a high accuracy solution is achieved, each learning process typically results in the discovery of only one solution. The specific solution learned by the neural network is heavily influenced by the initial conditions and training methods employed. However, discerning the relationships between these factors and the learned solution remains a daunting task. In [14], the authors introduce a method called HomPINNs for learning multiple solution PDEs. In their approach, the number of solutions that can be learned depends on the availability of supervised data. However, for solutions with insufficient supervised data, whether HomPINN can successfully learn them or not remains uncontrollable.

Operator learning methods

Various approaches have been developed for operator learning to solve PDEs, including DeepONet [24], which integrates physical information [7, 22], as well as techniques like FNO [20] inspired by spectral methods, and MgNO [11], HANO [23], and WNO [21] based on multilevel methods, and transformer-based neural operators [4, 8]. These methods focus on approximating the operator between the parameters and the solutions. Firstly, they require the solutions of PDEs to be unique; otherwise, the operator is not well-defined. Secondly, they focus on the relationship between the parameter functions and the solution, rather than the initial data and multiple solutions.

3 Newton Informed Neural Operator

3.1 Review of Newton Methods to Solve Nonlinear Partial Differential Equations

To tackle this problem Eq. 1, we employ Newton’s method by linearizing the equation. Given any initial guess $u_{0}(\bm{x})$ , for $i=1,2,\ldots,M$ , in the $i$ -th iteration of Newton’s method, we have $\tilde{u}(\bm{x})=u+\delta u(\bm{x})$ by solving

\begin{cases}(\mathcal{L}-f^{\prime}(u))\delta u(\bm{x})=\mathcal{L}u-f(u),&\bm{x}\in\Omega\\ \delta u(\bm{x})=0,&\bm{x}\in\partial\Omega.\end{cases}

(2)

To solve Eq.(2) and repeat it $M$ times will yield one of the solutions of the nonlinear equation (1). For the differential initial data, it will converge to the differential solution of Eq.(1), which constitutes a well-posed problem. However, applying the solver to solve Eq. (1) multiple times can be time-consuming, especially in high-dimensional structures or when the number of discrete points in the solver is large. In this paper, we employ neural networks to address these challenges.

3.2 Neural Operator Structures

In this section, we introduce the structure of the DeepONet [24, 18] to approximate the operator locally in the Newton methods from Eq.(2), i.e., $\mathcal{G}(u):=u_{*}$ , where $u_{*}$ is the solution of Eq.(2), which depends on $u$ . If we can learn the operator $\mathcal{G}(u)$ well using the neural operator $\mathcal{O}(u;\bm{\theta})$ , then for an initial function $u_{0}$ , assume the $n$ -th iteration will approximate one solution, i.e., $(\mathcal{G}+\text{Id})^{(n)}(u_{0})\approx u^{*}$ . Thus,

(\mathcal{O}+\text{Id})^{(n)}(u_{0})\approx(\mathcal{G}+\text{Id})^{(n)}(u_{0})\approx u^{*}.

For another initial data, we can evaluate our neural operator and find the solution directly.

Then we discuss how to train such an operator. To begin, we define the following shallow neural operators with $p$ neurons for operators from $\mathcal{X}$ to $\mathcal{Y}$ as

\mathcal{O}(a;\bm{\theta})=\sum_{i=1}^{p}\mathcal{A}_{i}\sigma\left(\mathcal{W}_{i}a+\mathcal{B}_{i}\right)\quad\forall a\in\mathcal{X}

(3)

where $\mathcal{W}_{i}\in\mathcal{L}(\mathcal{X},\mathcal{Y}),\mathcal{B}_{i}\in\mathcal{Y}$ , $\mathcal{A}_{i}\in\mathcal{L}(\mathcal{Y},\mathcal{Y})$ , and $\bm{\theta}$ denote all the parameters in $\{\mathcal{W}_{i},\mathcal{A}_{i},\mathcal{B}_{i}\}_{i=1}^{p}$ . Here, $\mathcal{L}(\mathcal{X},\mathcal{Y})$ denotes the set of all bounded (continuous) linear operators between $\mathcal{X}$ and $\mathcal{Y}$ , and $\sigma:\mathbb{\mapsto}\mathbb{R}$ defines the nonlinear point-wise activation.

In this paper, we will use shallow DeepONet to approximate the Newton operator. To provide a more precise description, in the shallow neural network, $\mathcal{W}_{i}$ represents an interpolation of operators. With proper and reasonable assumptions, we can present the following theorem to ensure that DeepONet can effectively approximate the Newton method operator. The proof will be provided in the appendix. Furthermore, MgNO is replaced by $\mathcal{W}$ as a multigrid operator [33], and FNO is some kind of kernel operator; our analysis can be generalized to such cases.

Before the proof, we need to establish some assumptions regarding the input space $\mathcal{X}\subset H^{2}(\Omega)$ of the operator and $f(u)$ in Eq. (1).

Assumption 1.

(i): For any $u\in\mathcal{X}$ , we have that

f^{\prime}(u)\cap\{\text{eigenvalues of }\mathcal{L}\}=\emptyset.

(ii): There exists a constant $F$ such that $\|f(x)\|_{W^{2,\infty}(\mathbb{R})}\leq F$ .

(iii): All coefficients in $\mathcal{L}$ are $C^{1}$ and $\partial\Omega\in C^{2}$ .

(iv): There exists a linear bounded operator $\mathcal{P}$ on $\mathcal{X}$ , a small constant $\epsilon>0$ , and a constant $n$ such that

\|u-\mathcal{P}u\|_{H^{2}(\Omega)}\leq\epsilon,~{}\text{for all }u\in\mathcal{X}.

Furthermore, $\mathcal{P}u$ is an $n$ -dimensional term, i.e., it can be denoted by the $n$ -dimensional vector $\bar{\mathcal{P}}u\in\mathbb{R}^{n}$ .

Remark 1.

We want to emphasize the reasonableness of our assumptions. For condition (i), we are essentially restricting our approximation efforts to local regions. This limitation is necessary because attempting to approximate the neural operator across the entire domain could lead to issues, particularly in cases where multiple solutions exist. Consider a scenario where the input function $u$ lies between two distinct solutions. Even a small perturbation of $u$ could result in the system converging to a completely different solution. Condition (i) ensures that Equation (2) has a unique solution, allowing us to focus our approximation efforts within localized domains.

Conditions (ii) and (iii) serve to regularize the problem and ensure its tractability. These conditions are indeed straightforward to fulfill, contributing to the feasibility of the overall approach.

For the embedding operator $\mathcal{P}$ in (iv), there are a lot of choices in DeepONet, such as finite element methods like Argyris elements [2] or embedding methods in [19, 13]. We will discuss more in the appendix. The approximation order would be $h^{-(l-2)}$ , and the differential method may achieve a different order. We omit the detailed discussion in the paper. Furthermore, for the differential neural network, this embedding may be different; for example, we can use Fourier expansion [35] or multigrid methods [11] to achieve this task.

Theorem 1.

Suppose $\mathcal{X}=\mathcal{Y}\subset H^{2}(\Omega)$ and Assumption 1 holds. Then, there exists a neural network $\mathcal{O}(u;\bm{\theta})\in\Xi_{p}$ defined as

\Xi_{p}:=\left\{\sum_{i=1}^{p}\bm{A}_{i}\sigma\left(\mathcal{W}_{i}u+\bm{b}_{i}\right)\sigma\left(\bm{w}_{i}\cdot\bm{x}+\zeta_{i}\right)|\mathcal{W}_{i}\in\mathcal{L}(\mathcal{X},\mathbb{R}^{m}),\bm{b}_{i}\in\mathbb{R}^{m},\bm{A}_{i}\in\mathcal{L}(\mathbb{R}^{m},\mathbb{R})\right\}

(4)

such that

\left\|\sum_{i=1}^{p}\bm{A}_{i}\sigma\left(\mathcal{W}_{i}u+\bm{b}_{i}\right)\sigma\left(\bm{w}_{i}\cdot\bm{x}+\zeta_{i}\right)-\mathcal{G}(u)\right\|_{L^{2}(\Omega)}\leq C_{1}m^{-\frac{1}{n}}+C_{2}(\epsilon+p^{-\frac{2}{d}}),

(5)

where $\sigma$ is a smooth non-polynomial activation function, $C_{1}$ is a constant independent of $m$ , $\epsilon$ , and $p$ , $C_{2}$ is a constant depended on $p$ , $n$ is the scale of the $\mathcal{P}$ in Assumption 1. And $\epsilon$ depends on $n$ . $n$ and $\epsilon$ are defined in Assumption 1.

3.3 Loss Functions of Newton Informed Neural Operator

3.3.1 Mean Square Error

The Mean Square Error loss function is defined as:

\displaystyle\mathcal{E}_{S}(\bm{\theta}):=\frac{1}{M_{u}\cdot M_{x}}\sum_{j=1}^{M_{u}}\sum_{k=1}^{M_{x}}\left|\mathcal{G}\left(u_{j}\right)\left(\bm{x}_{k}\right)-\mathcal{O}\left(u_{j};\bm{\theta}\right)\left(\bm{x}_{k}\right)\right|^{2}

(6)

where $u_{1},u_{2},\ldots,u_{M_{u}}\sim\mu$ are independently and identically distributed (i.i.d) samples in $\mathcal{X}$ , and $\bm{x}_{1},\bm{x}_{2},\ldots,\bm{x}_{M_{x}}$ are uniformly i.i.d samples in $\Omega$ .

However, using only the Mean Squared Error loss function is not sufficient for training to learn the Newton method, especially since in most cases, we do not have enough data $\{u_{i},\mathcal{G}\left(u_{j}\right)\}_{j=1}^{M_{u}}$ . Furthermore, there are situations where we do not know how many solutions exist for the nonlinear equation (1). If the data is sparse around one of the solutions, it becomes impossible to effectively learn the Newton method around that solution.

Given that $\mathcal{E}_{S}(\bm{\theta})$ can be viewed as the finite data formula of $\mathcal{E}_{Sc}(\bm{\theta})$ , where

\mathcal{E}_{Sc}(\bm{\theta})=\lim_{M_{u},M_{x}\to\infty}\mathcal{E}_{S}(\bm{\theta}).

The smallness of $\mathcal{E}_{Sc}$ can be inferred from Theorem 1. To understand the gap between $\mathcal{E}_{Sc}(\bm{\theta})$ and $\mathcal{E}_{S}(\bm{\theta})$ , we can rely on the following theorem. Before the proof, we need some assumptions about the data in $\mathcal{E}_{S}(\bm{\theta})$ :

Assumption 2.

(i) Boundedness: For any neural network with bounded parameters, characterized by a bound $B$ and dimension $d_{\bm{\theta}}$ , there exists a function $\Psi:L^{2}(\Omega)\rightarrow[0,\infty)$ such that

|\mathcal{G}(u)(\bm{x})|\leqslant\Psi(u),\quad\sup_{\bm{\theta}\in[-B,B]^{d_{\bm{\theta}}}}\left|\mathcal{N}(u;\bm{\theta})(\bm{x})\right|\leqslant\Psi(u),

for all $u\in\mathcal{X},\bm{x}\in\Omega$ , and there exist constants $C,\kappa>0$ , such that

\displaystyle\Psi(u)\leqslant C\left(1+\|u\|_{H^{2}}\right)^{\kappa}.

(7)

(ii) Lipschitz continuity: There exists a function $\Phi:L^{2}(\Omega)\rightarrow[0,\infty)$ , such that

\displaystyle\left|\mathcal{N}(u;\bm{\theta})(\bm{x})-\mathcal{N}(u;\bm{\theta}^{\prime})(\bm{x})\right|\leqslant\Phi(u)\left\|\bm{\theta}-\bm{\theta}^{\prime}\right\|_{\ell^{\infty}}

(8)

for all $u\in\mathcal{X},\bm{x}\in\Omega$ , and

\Phi(u)\leqslant C\left(1+\|u\|_{H^{2}(\Omega)}\right)^{\kappa},

for the same constants $C,\kappa>0$ as in Eq. (7).

(iii) Finite measure: There exists $\alpha>0$ , such that

\int_{H^{2}(\Omega)}e^{\alpha\|u\|_{H^{2}(\Omega)}^{2}}\mathrm{d}\mu(u)<\infty.

Theorem 2.

If Assumption 2 holds, then the generalization error is bounded by

|\mathbf{E}(\mathcal{E}_{S}(\bm{\theta})-\mathcal{E}_{Sc}(\bm{\theta}))|\leqslant C\left[\frac{1}{\sqrt{M_{u}}}\left(1+Cd_{\bm{\theta}}\log(CB\sqrt{M_{u}})^{2\kappa+1/2}\right)+\frac{d_{\bm{\theta}}\log M_{x}}{M_{x}}\right],

where $C$ is a constant independent of $B$ , $d_{\bm{\theta}}$ , $M_{x}$ , and $M_{u}$ . The parameter $\kappa$ is specified in (7). Here, $B$ represents the bound of parameters and $d_{\bm{\theta}}$ is the number of parameters.

The proof of Theorem 2 is presented in Appendix B.3.

Remark 2.

The Assumption 2 is easy to achieve if we consider $\mathcal{X}$ as the local function set around the solution, which is typically the case in Newton methods. This aligns with our approach and working region in the approximation part (see Remark 1).

3.3.2 Newton Loss

As we have mentioned, relying solely on the MSE loss function can require a significant amount of data to achieve the task. However, obtaining enough data can be challenging, especially when the equation is complex and the dimension of the input space is large. Hence, we need to consider another loss function to aid learning, which is the physical information loss function [29, 7, 14, 21], referred to here as the Network loss function.

The Newton loss function is defined as:

\displaystyle\mathcal{E}_{N}(\bm{\theta}):=\frac{1}{N_{u}\cdot N_{x}}\sum_{j=1}^{N_{u}}\sum_{k=1}^{N_{x}}\left|(\mathcal{L}-f^{\prime}(u_{j}\left(\bm{x}_{k}\right)))\mathcal{O}\left(u_{j};\bm{\theta}\right)\left(\bm{x}_{k}\right)-\mathcal{L}u_{j}\left(\bm{x}_{k}\right)-f(u_{j}\left(\bm{x}_{k}\right))\right|^{2}

(9)

where $u_{1},u_{2},\ldots,u_{N_{u}}\sim\nu$ are independently and identically distributed (i.i.d) samples in $\mathcal{X}$ , and $\bm{x}_{1},\bm{x}_{2},\ldots,\bm{x}_{N_{x}}$ are uniformly i.i.d samples in $\Omega$ .

Given that $\mathcal{E}_{N}(\bm{\theta})$ can be viewed as the finite data formula of $\mathcal{E}_{Nc}(\bm{\theta})$ , where

\mathcal{E}_{Nc}(\bm{\theta})=\lim_{N_{u},N_{x}\to\infty}\mathcal{E}_{S}(\bm{\theta}).

To understand the gap between $\mathcal{E}_{Nc}(\bm{\theta})$ and $\mathcal{E}_{N}(\bm{\theta})$ , we can rely on the following theorem:

Corollary 1.

If Assumption 2 holds, then the generalization error is bounded by

|\mathbf{E}(\mathcal{E}_{N}(\bm{\theta})-\mathcal{E}_{Nc}(\bm{\theta}))|\leqslant C\left(\frac{d_{\bm{\theta}}\log N_{u}}{\sqrt{N_{u}}}+\frac{d_{\bm{\theta}}\log N_{x}}{N_{x}}\right),

The proof of Corollary 1 is similar to that of Theorem 2; therefore, it will be omitted from the paper.

Remark 3.

If we only utilize $\mathcal{E}_{S}(\bm{\theta})$ as our loss function, as demonstrated in Theorem 2, we require both $M_{u}$ and $M_{x}$ to be large, posing a significant challenge when dealing with complex nonlinear equations. Obtaining sufficient data becomes a critical issue in such cases. In this paper, we integrate Newton information into the loss function, defining it as follows:

\mathcal{E}(\bm{\theta}):=\lambda\mathcal{E}_{S}(\bm{\theta})+\mathcal{E}_{N}(\bm{\theta}),

(10)

where $\mathcal{E}_{N}(\bm{\theta})$ represents the cost of the data involved in unsupervised learning. If we lack sufficient data for $\mathcal{E}_{S}(\bm{\theta})$ , we can adjust the parameters by selecting a small $\lambda$ and increasing $N_{x}$ and $N_{u}$ . This strategy enables effective learning even when data for $\mathcal{E}_{S}(\bm{\theta})$ is limited.

In the following experiment, we will use the neural operator established in Eq. (3) and the loss function in Eq. (10) to learn one step of the Newton method locally, i.e., the map between the input $u$ and the solution $\delta u$ in eq. (2). If we have a large dataset, we can choose a large $\lambda$ in $\mathcal{E}(\bm{\theta})$ (10); if we have a small dataset, we will use a small $\lambda$ to ensure the generalization of the operator is minimized. After learning one step of the Newton method using the operator neural networks, we can easily and quickly obtain the solution by the initial condition of the nonlinear PDEs (1) and find new solutions not present in the datasets.

4 Experiments

4.1 Experimental Settings

We introduce two distinct training methodologies. The first approach employs exclusively supervised data, leveraging the Mean Squared Error Loss (6) as the primary optimization criterion. The second method combines both supervised and unsupervised learning paradigms, utilizing a hybrid loss function 10 that integrates Mean Squared Error Loss (6) for small proportion of data with ground truth (supervised training dataset) and with Newton’s loss (9) for large proportion of data without ground truth (unsupervised training dataset). We call the two methods as method 1 and method 2. The approaches are designed to simulate a practical scenario with limited data availability, facilitating a comparison between these training strategies to evaluate their efficacy in small supervised data regimes. We choose the same configuration of neural operator (DeepONet) which is align with our theoretical analysis. One can find the detailed experimental settings and the datasets for each examples below in Appendix A.

4.2 Case 1: Convex problem

We consider 2D convex problem $\mathcal{L}(u)-f(u)=0$ , where $\mathcal{L}(u):=-\Delta u$ , $f(u):-u^{2}+\sin 5\pi(x+y)$ and $u=0$ on $\partial\Omega$ . We investigate the training dynamics and testing performance of neural operator (DeepONet) trained with different loss functions and dataset sizes, focusing on Mean Squared Error (MSE) and Newton’s loss functions.

Refer to caption — (a) The training and testing errors using method 1

MSE Loss Training (Fig. 1(a)): In method 1, Effective training is observed but exhibits poor generalization, suggesting only MSE loss is not enough. Performance Comparison (Fig. 1(b)): Newton’s loss model (method 2) exhibits superior performance in both $L_{2}$ and $H_{1}$ error metrics, highlighting its enhanced generalization accuracy. This study shows the advantages of using Newton’s loss for training DeepONet models, requiring fewer supervised data points compared to method 1.

4.3 Case 2: Non-convex problem with multiple solutions

We consider a 2D Non-convex problem,

\begin{cases}-\Delta u(x,y)-u^{2}(x,y)=-s\sin(\pi x)\sin(\pi y)\quad\text{in}\quad\Omega,\\ u(x,y)=0,\quad\text{in}\quad\partial\Omega\end{cases}

(11)

where $\Omega=(0,1)\times(0,1)$ [3]. In this case, $\mathcal{L}(u):=-\Delta(u)-u^{2}$ and it is non-convex, with multiple solutions (see Figure2 for its solutions).

In the experiment, we let one of the multiple ground truth solutions is rarely touched in the supervised training dataset such that the neural operator trained via method 1 will saturate in terms of test error because of the it relies on the ground truth to recover all the patterns for multiple solution cases (as shown by the curves in Figure 2). On the other hand, the model trained via method 2 is less affected by the limited supervised data since the utilization of the Newton’s loss. One can refer Appendix A for the detailed experiments setting.

Efficiency

We utilize this problem as a case study to demonstrate the superior efficiency of our neural operator-based method as a surrogate model to Newton’ method. By benchmarking both approaches, we highlight the significant performance advantages of our neural operator model. The performance was assessed in terms of total execution time, which includes the setup of matrices and vectors, computation on the GPU, and synchronization of CUDA streams. Both methods leverage the parallel processing capabilities of the GPU. Specifically, the Newton solver explicitly uses 10 streams and CuPy with CUDA to parallelize the computation and fully utilize the GPU parallel processing capabilities, aiming to optimize execution efficiency. In contrast, the neural operator method is inherently parallelized, taking full advantage of the GPU architecture without the explicit use of multiple streams as indicated in the table. The computational times of both methods were evaluated under a common hardware configuration. One can find the detailed description of the experiments in A.5.

Parameter	Newton’s Method	Neural Operator
Number of Streams	10	-
Data Type	float32	float32
Execution Time for 500 linear Newton systems (s)	31.52	1.1E-4
Execution Time for 5000 linear Newton systems (s)	321.15	1.4E-4

Table 1: Benchmarking the efficiency of the neural operator-based method

The data presented in the table illustrates that the neural operator is significantly more efficient than the classical Newton solver on GPU. This efficiency gain is likely due to the neural operator’s ability to leverage parallel processing capabilities of GPUs more effectively than the Newton solver, though the Newton solver is also parallelized. This enhancement is crucial to improves efficiency for calculating the vast amounts of unknown patterns for complex nonlinear system such as Gray Scott model.

4.4 Gray-Scott Model

The Gray-Scott model [27] describes the reaction and diffusion of two chemical species, $A$ and $S$ , governed by the following equations:

	$\displaystyle\frac{\partial A}{\partial t}$	$\displaystyle=D_{A}\Delta A-SA^{2}+(\mu+\rho)A,$
	$\displaystyle\frac{\partial S}{\partial t}$	$\displaystyle=D_{S}\Delta S+SA^{2}-\rho(1-S),$

where $D_{A}$ and $D_{S}$ are the diffusion coefficients, and $\mu$ and $\rho$ are rate constants.

Newton’s Method for Steady-State Solutions

Newton’s method is employed to find steady-state solutions ( $\frac{\partial A}{\partial t}=0$ and $\frac{\partial S}{\partial t}=0$ ) by solving the nonlinear system:

	$\displaystyle 0$	$\displaystyle=D_{A}\Delta A-SA^{2}+(\mu+\rho)A,$		(12)
	$\displaystyle 0$	$\displaystyle=D_{S}\Delta S+SA^{2}-\rho(1-S).$		(12)

The Gray-Scott model is highly sensitive to initial conditions, where even minor perturbations can lead to vastly different emergent patterns. Please refer Figure 4 for some examples of the patterns. This sensitivity reflects the model’s complex, non-linear dynamics that can evolve into a multitude of possible steady states based on the initial setup. Consequently, training a neural operator to map initial conditions directly to their respective steady states presents significant challenges. Such a model must learn from a vast functional space, capturing the underlying dynamics that dictate the transition from any given initial state to its final pattern. This complexity and diversity of potential outcomes is the inherent difficulty in training neural operators effectively for systems as complex as the Gray-Scott model. One can refer A.1.2 for the detailed discussion on Gray-Scott model. We employ a Neural Operator as a substitute for the Newton solver in the Gray-Scott model, which recurrently maps the initial state to the steady state.

In subfigure (a), we use a ring-like pattern as the initial state to test our learned neural operator. This particular pattern does not appear in the supervised training dataset and lacks corresponding ground truth data. Despite this, our neural operator, trained using Newton’s loss, is able to approximate the mapping of the initial solution to its correct steady state effectively. we further test our neural operator, utilizing it as a surrogate for Newton’s method to address nonlinear problems with an initial state drawn from the test dataset. The curve shows the average convergence rate of $\|u-u_{i}\|$ across the test dataset, where $u_{i}$ represents the prediction at the $i$ -th step by the neural operator. In subfigure (c), both the training loss curves and the test error curves are displayed, illustrating their progression throughout the training period.

5 Conclusion

In this paper, we consider using neural operators to solve nonlinear PDEs (Eq. (1)) with multiple solutions. To make the problem well-posed and to learn the multiple solutions in one learning process, we combine neural operator learning with classical Newton methods, resulting in the Newton informed neural operator. We provide a theoretical analysis of our neural operator, demonstrating that it can effectively learn the Newton operator, reduce the number of required supervised data, and learn solutions not present in the supervised learning data due to the addition of the Newton loss (9) in the loss function. Our experiments are consistent with our theoretical analysis, showcasing the advantages of our network as mentioned earlier, i.e., it requires fewer supervised data, learns solutions not present in the supervised learning data, and costs less time than classical Newton methods to solve the problem.

Reproducibility Statement

Code Availability: The code used in our experiments can be accessed via https://github.com/xll2024/newton/tree/main and also the supplementary material. Datasets can be downloaded via urls in the repository. This encompasses all scripts, functions, and auxiliary files necessary to reproduce our results. Configuration Transparency: All configurations, including hyperparameters, model architectures, and optimization settings, are explicitly provided in Appendix.

Acknowledgments and Disclosure of Funding

Y.Y. and W.H. was supported by National Institute of General Medical Sciences through grant 1R35GM146894. The work of X.L. was partially supported by the KAUST Baseline Research Fund.

References

[1] H. Amann and P. Hess. A multiplicity result for a class of elliptic boundary value problems. Proceedings of the Royal Society of Edinburgh Section A: Mathematics, 84(1-2):145–151, 1979.
[2] S. Brenner. The mathematical theory of finite element methods. Springer, 2008.
[3] B. Breuer, P. McKenna, and M. Plum. Multiple solutions for a semilinear boundary value problem: a computational multiplicity proof. Journal of Differential Equations, 195(1):243–269, 2003.
[4] Shuhao Cao. Choose a transformer: Fourier or galerkin. Advances in Neural Information Processing Systems, 34, 2021.
[5] Mark C Cross and Pierre C Hohenberg. Pattern formation outside of equilibrium. Reviews of modern physics, 65(3):851, 1993.
[6] L. Evans. Partial differential equations, volume 19. American Mathematical Society, 2022.
[7] Somdatta Goswami, Aniruddha Bora, Yue Yu, and George Em Karniadakis. Physics-informed neural operators. 2022 arXiv preprint arXiv:2207.05748, 2022.
[8] Ruchi Guo, Shuhao Cao, and Long Chen. Transformer meets boundary value inverse problems. In The Eleventh International Conference on Learning Representations, 2022.
[9] J. Han, A. Jentzen, and W. E. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.
[10] W. Hao and C. Xue. Spatial pattern formation in reaction–diffusion models: a computational approach. Journal of mathematical biology, 80:521–543, 2020.
[11] J. He, X. Liu, and J. Xu. Mgno: Efficient parameterization of linear operators via multigrid. In The Twelfth International Conference on Learning Representations, 2023.
[12] Rebecca B Hoyle. Pattern formation: an introduction to methods. Cambridge University Press, 2006.
[13] J. Hu and S. Zhang. The minimal conforming ${H}^{k}$ finite element spaces on ${R}^{n}$ rectangular grids. Mathematics of Computation, 84(292):563–579, 2015.
[14] Y. Huang, W. Hao, and G. Lin. Hompinns: Homotopy physics-informed neural networks for learning multiple solutions of nonlinear elliptic differential equations. Computers & Mathematics with Applications, 121:62–73, 2022.
[15] Carl T Kelley. Solving nonlinear equations with Newton’s method. SIAM, 2003.
[16] Shigeru Kondo and Takashi Miura. Reaction-diffusion model as a framework for understanding biological pattern formation. science, 329(5999):1616–1620, 2010.
[17] I. Lagaris, A. Likas, and D. Fotiadis. Artificial neural networks for solving ordinary and partial differential equations. IEEE transactions on neural networks, 9(5):987–1000, 1998.
[18] Samuel Lanthaler, Siddhartha Mishra, and George E Karniadakis. Error estimates for deeponets: A deep learning framework in infinite dimensions. Transactions of Mathematics and Its Applications, 6(1):tnac001, 2022.
[19] Z. Li and N. Yan. New error estimates of bi-cubic hermite finite element methods for biharmonic equations. Journal of computational and applied mathematics, 142(2):251–285, 2002.
[20] Zongyi Li, Nikola Borislavov Kovachki, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier NeuralOperator for Parametric Partial Differential Equations. In International Conference on Learning Representations, 2020.
[21] Zongyi Li, Hongkai Zheng, Nikola B Kovachki, David Jin, Haoxuan Chen, Burigede Liu, Kamyar Azizzadenesheli, and Anima Anandkumar. Physics-informed neural operator for learning partial differential equations. CoRR, abs/2111.03794, 2021.
[22] B. Lin, Z. Mao, Z. Wang, and G. Karniadakis. Operator learning enhanced physics-informed neural networks for solving partial differential equations characterized by sharp solutions. arXiv preprint arXiv:2310.19590, 2023.
[23] Xinliang Liu, Bo Xu, Shuhao Cao, and Lei Zhang. Mitigating spectral bias for the multiscale operator learning. Journal of Computational Physics, 506:112944, 2024.
[24] Lu Lu, Pengzhan Jin, and George Em Karniadakis. Deeponet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators. arXiv preprint arXiv:1910.03193, 2019.
[25] Lu Lu, Xuhui Meng, Shengze Cai, Zhiping Mao, Somdatta Goswami, Zhongqiang Zhang, and George Em Karniadakis. A comprehensive and fair comparison of two neural operators (with practical extensions) based on fair data. Computer Methods in Applied Mechanics and Engineering, 393:114778, 2022.
[26] H. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1):164–177, 1996.
[27] J. Pearson. Complex patterns in a simple system. Science, 261(5118):189–192, 1993.
[28] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing, 14(5):503–519, 2017.
[29] Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707, 2019.
[30] A. Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function. Annals of statistics, 48(4):1875–1897, 2020.
[31] M. Ulbrich. Semismooth newton methods for operator equations in function spaces. SIAM Journal on Optimization, 13(3):805–841, 2002.
[32] T. Welti. High-dimensional stochastic approximation: algorithms and convergence rates. PhD thesis, ETH Zurich, 2020.
[33] J. Xu. Two-grid discretization techniques for linear and nonlinear pdes. SIAM journal on numerical analysis, 33(5):1759–1777, 1996.
[34] Y. Yang and J. He. Deeper or wider: A perspective from optimal generalization error with sobolev loss. International Conference on Machine Learning, 2024.
[35] Y. Yang and Y. Xiang. Approximation of functionals by neural network without curse of dimensionality. arXiv preprint arXiv:2205.14421, 2022.
[36] Y. Yang, H. Yang, and Y. Xiang. Nearly optimal VC-dimension and pseudo-dimension bounds for deep neural network derivatives. NuerIPS 2023, 2023.
[37] B. Yu and W. E. The deep ritz method: a deep learning-based numerical algorithm for solving variational problems. Communications in Mathematics and Statistics, 6(1):1–12, 2018.

Appendix A Experimental settings

A.1 Background on the PDEs and generation of datasets

A.1.1 Case 1: convex problem

Function and Jacobian

The function $F(u)$ might typically be defined as:

F(u)=-\Delta u+u^{2}

The Jacobian $J(u)$ , for the given function $F(u)$ , involves the derivative of $F$ with respect to $u$ , which includes the Laplace operator and the derivative of the nonlinear term:

J(u)=-\Delta+2\cdot u.

The dataset are generated by sampling the initial state $u_{0}\sim\mathcal{N}(0,\Delta^{-3})$ and then calculate the convergent sequence $\{u_{0},u_{1},...,u_{n}\}$ by Newton’s method. Each convergent sequence $\{u_{0},u_{1},...,u_{n}\}$ is one data entry in the dataset.

The analysis of function and Jacobian for the non-convex problem (case 2) is similar to the convex problem except that its Jacobian $J(u)=\Delta-2u$ such that Newton’s system is not positive definite.

A.1.2 Gray Scott model

Jacobian Matrix

The Jacobian matrix $J$ of the system is crucial for applying Newton’s method:

J=\begin{bmatrix}J_{AA}&J_{AS}\\ J_{SA}&J_{SS}\end{bmatrix}

with components:

	$\displaystyle J_{AA}$	$\displaystyle=-D_{A}\Delta+\text{diag}(-2SA+\mu+\rho),$
	$\displaystyle J_{AS}$	$\displaystyle=\text{diag}(-A^{2}),$
	$\displaystyle J_{SA}$	$\displaystyle=\text{diag}(2SA),$
	$\displaystyle J_{SS}$	$\displaystyle=-D_{S}\Delta+\text{diag}(A^{2}+\rho).$

The numerical simulation of the Gray-Scott model was configured with the following parameters:

•

Grid Size: The simulation grid is square with $N=63$ points on each side, leading to a total of $N^{2}$ grid points. This resolution was chosen to balance computational efficiency with spatial resolution sufficient to capture detailed patterns. The spacing between each grid point, $h$ , is computed as $h=\frac{1.0}{N-1}$ . This ensures that the domain is normalized to a unit square, which simplifies the analysis and scaling of diffusion rates.
•

Diffusion Coefficients: The diffusion coefficients for species $A$ and $S$ are set to $D_{A}=2.5\times 10^{-4}$ and $D_{S}=5.0\times 10^{-4}$ , respectively. These values determine the rate at which each species diffuses through the spatial domain.
•

Reaction Rates: The reaction rate $\mu$ and feed rate $\rho$ are crucial parameters that govern the dynamics of the system. For this simulation, $\mu$ is set to 0.065 and $\rho$ to 0.04, influencing the production and removal rates of the chemical species.

Simulations

The simulation utilizes a finite difference method for spatial discretization and Newton’s method to solve the steady state given the initial state. The algorithm is detailed in 1.

A.2 Implementations of loss functions

Discrete Newton’s Loss

In solving partial differential equations (PDEs) numerically on a regular grid, the Laplace operator and other differential terms can be efficiently computed using convolution. Here, we detail the method for calculating $J(u)\delta u-F(u)$ where $J(u)$ is the Jacobian matrix, $\delta u$ is the Newton step, and $F(u)$ is the function defining the PDE.

Discretization

Consider a discrete representation of a function $u$ on a $N\times N$ grid. The function $u$ and its perturbation $\delta u$ are represented as matrices:

u,\delta u\in\mathbb{R}^{N\times N}

The function $F(u)$ , which involves both linear and nonlinear terms, is similarly represented as $F(u)\in\mathbb{R}^{N\times N}$ .

Laplace Operator

Regarding the representing the $J(u)$ with $N\times N$ grid function $u$ , the discretized Laplace operator using a finite difference method can be expressed as a convolution:

-\Delta u=\begin{bmatrix}0&-1&0\\ -1&4&-1\\ 0&-1&0\end{bmatrix}\ast u

This convolution computes the result of the Laplace operator applied to the grid function $u$ . The boundary conditions can be further incorporated in the convolution with different padding mode. Dirichlet boundary condition corresponds to zeros padding while Neumann boundary condition corresponds to replicate padding.

A.3 Architecture of DeepONet

A variant of DeepONet is used in our Newton informed neural operator. In the DeepONet, we introduce a hybrid architecture that combines convolutional layers with a static trunk basis, optimized for grid-based data inputs common in computational applications like computational biology and materials science.

Branch Network

The branch network is designed to effectively downsample and process the spatial features through a series of convolutional layers:

•

A Conv2D layer with 128 filters (7x7, stride 2) initiates the feature extraction, reducing the input dimensionality while capturing coarse spatial features.
•

This is followed by additional Conv2D layers (128 filters, 5x5 kernel, stride 2 and subsequently 3x3 with padding, 1x1) which further refine and compact the feature representation.
•

The convolutional output is flattened and processed through two fully connected layers (256 units then down to branch features), using GELU activation.

Trunk Network

The trunk utilizes a static basis represented by the tensor V, incorporated as a non-trainable component: The tensor V is precomputed, using Proper Orthogonal Decomposition (POD) as in [25], and is dimensionally compatible with the output of the branch network.

Forward Pass

During the forward computation, the branch network outputs are projected onto the trunk’s static basis via matrix multiplication, resulting in a feature matrix that is reshaped into the grid dimensionality for output.

Hyperparameters

The following table 2 summarizes the key hyperparameters used in the DeepONet architecture:

Parameter	Value
Number of Conv2D layers	5
Filters in Conv2D layers	128, 128, 128, 128, 256
Kernel sizes in Conv2D layers	7x7, 5x5, 3x3, 1x1, 5x5
Strides in Conv2D layers	2, 2, 1, 1, 2
Fully Connected Layer Sizes	256, branch features
Activation Function	GELU

Table 2: Hyperparameters of the DeepONet architecture

A.4 Training settings

Below we summarize the key configurations and parameters employed in the training for three cases:

Dataset

•

Case 1: For method 1, we use 500 supervised data samples (with ground truth) while for method 2, we use 5000 unsupervised data samples (only with the initial state) and 500 supervised data samples.
•

Case 2: For method 1, we use 5000 supervised data while for method 2, we use 5000 unsupervised data samples and 5000 supervised data samples.
•

Case 3 (Gray-Scott model): We only perform the method 2, with 10000 supervised data samples and 50000 unsupervised data samples.

Optimization Technique

•

Optimizer: Adam, with a learning rate of $1\times 10^{-4}$ and a weight decay of $1\times 10^{-6}$ .
•

Training Epochs: The model was trained over 1000 epochs to ensure convergence and we use Batch Size: 50.

These settings underscore our commitment to precision and detailed examination of neural operator efficiency in computational tasks. Our architecture and optimization choices are particularly tailored to explore and exploit the capabilities of neural networks in processing complex systems simulations.

A.5 Benchmarking Newton’s Method and neural operator based method

Experimental Setup

The benchmark study was conducted to evaluate the performance of a GPU-accelerated implementation of Newton’s method, designed to solve systems of linear equations derived from discretizing partial differential equations. The implementation utilized CuPy with CUDA to leverage the parallel processing capabilities of the NVIDIA A100 GPU. The hardware comprises Intel Cascade Lake 2.5 GHz CPU, an NVIDIA A100 GPU.

Software Environment:

Ubuntu 20.04 LTS. Python Version: 3.8. CUDA Version: 11.4.

The Newton’s method was implemented to solve the Laplacian equation over a discretized domain. Multiple system solutions were computed in parallel using CUDA streams. Key parameters of the experiment are as follows: Data Type (dtype): Single precision floating point (float32). Number of Streams: 10 CUDA streams to process data in parallel. Number of Repeated Calculations: The Newton method was executed multiples times for 500/5000 Newton linear systems, respectively, distributed evenly across the streams. Function to Solve Systems: The CuPy’s spsolve was used for solving the sparse matrix systems. The following algorithm 1 summary the procedure to benchmark the time used for solving Newton’s system for 5000 different initial state.

Algorithm 1 Solve Newton’s Systems on GPU

1:procedure SolveSystemsGPU(

A

u

rhs\_f

N

)

\triangleright

Precompute RHS and diagonal for all systems

rhs\_list\leftarrow rhs\_f+u^{2}-A\times u

diag\_list\leftarrow-2\times u

\triangleright

Initialize solution storage

delta\_u\leftarrow\text{initialize zero matrix with shape of }u^{T}

\triangleright

Solve each system

5: for

i=0

num\_sys-1

rhs\leftarrow rhs\_list[i]

diag\leftarrow diag\_list[i]

J\leftarrow A+\text{diagonal matrix with }diag\text{ on the main diagonal}

\triangleright

Solve the linear system

delta\_u[i]\leftarrow\text{spsolve}(J,rhs)

10: end for

11: return

\text{transpose}(delta\_u)

12:end procedure

Appendix B Supplemental material for proof

B.1 Preliminaries

Definition 1 (Sobolev Spaces [6]).

Let $\Omega$ be $[0,1]^{d}$ and let $D$ be the operator of the weak derivative of a single variable function and $D^{\bm{\alpha}}=D^{\alpha_{1}}_{1}D^{\alpha_{2}}_{2}\ldots D^{\alpha_{d}}_{d}$ be the partial derivative where $\bm{\alpha}=[\alpha_{1},\alpha_{2},\ldots,\alpha_{d}]^{T}$ and $D_{i}$ is the derivative in the $i$ -th variable. Let $n\in\mathbb{N}$ and $1\leq p\leq\infty$ . Then we define Sobolev spaces

W^{n,p}(\Omega):=\left\{f\in L^{p}(\Omega):D^{\bm{\alpha}}f\in L^{p}(\Omega)\text{ for all }\bm{\alpha}\in\mathbb{N}^{d}\text{ with }|\bm{\alpha}|\leq n\right\}

with a norm

\|f\|_{W^{n,p}(\Omega)}:=\left(\sum_{0\leq|\alpha|\leq n}\left\|D^{\alpha}f\right\|_{L^{p}(\Omega)}^{p}\right)^{1/p}

if $p<\infty$ , and $\|f\|_{W^{n,\infty}(\Omega)}:=\max_{0\leq|\alpha|\leq n}\left\|D^{\alpha}f\right\|_{L^{\infty}(\Omega)}$ .

Furthermore, for $\bm{f}=(f_{1},\ldots,f_{d})$ , $\bm{f}\in W^{1,\infty}(\Omega,\mathbb{R}^{d})$ if and only if $f_{i}\in W^{1,\infty}(\Omega)$ for each $i=1,2,\ldots,d$ and

\|\bm{f}\|_{W^{1,\infty}(\Omega,\mathbb{R}^{d})}:=\max_{i=1,\ldots,d}\{\|f_{i}\|_{W^{1,\infty}(\Omega)}\}.

When $p=2$ , denote $W^{n,2}(\Omega)$ as $H^{n}(\Omega)$ for $n\in\mathbb{N}_{+}$ .

Proposition 1 ([26]).

Suppose $\sigma$ is a is a continuous non-polynomial function and $K$ is a compact in $\mathbb{R}^{d}$ , then there are positive integers $p$ , constants $w_{k},\zeta_{k}$ for $k=1,\ldots,p$ and bounded linear functionals $c_{k}:H^{r}(K)\to\mathbb{R}$ such that for any $v\in H^{r}(K)$ ,

\left\|v-\sum_{k=1}^{p}c_{k}(v)\sigma\left(\bm{w}_{k}\cdot\bm{x}+\zeta_{k}\right)\right\|_{L^{2}(K)}\leq cp^{-r/d}\|v\|_{H^{r}(K)}.

(13)

Proposition 2 ([28, 36]).

Suppose $\sigma$ is a continuous non-polynomial function and $\Omega$ is a compact subset of $\mathbb{R}^{d}$ . For any Lipschitz-continuous function $f$ , there exists a shallow neural network such that

\left\|f-\sum_{j=1}^{m}a_{j}\sigma\left(\bm{\omega}_{j}\cdot\bm{x}+b_{j}\right)\right\|_{\infty}\leq Cm^{-1/d},

(14)

where $C$ depends on the Lipschitz constant but is independent of $m$ .

Lemma 1 ([18]).

The $\epsilon$ -covering number of $[-B,B]^{d}$ , $K(\epsilon)$ , satisfies

K(\epsilon)\leqslant\left(\frac{CB}{\epsilon}\right)^{d},

for some constant $C>0$ , independent of $\epsilon$ , $B$ , and $d$ .

Step 5: Now we estimate

B.2 Proof of Theorem 1

In this subsection, we present the proof of Theorem 1, which describes the approximation ability of DeepONet. The sketch of the proof is illustrated in Fig. 6.

Proof of Theorem 1.

Step 1: Firstly, we need to verify that the target operator $\mathcal{G}(u)$ is well-defined.

Due to Assumption 1 (i) and [6, Theorem 6 in Section 6.2], we know that for $u\in\mathcal{X}\subset H^{2}(\Omega)$ , Equation (2) will have unique solutions. This means that $\mathcal{G}(u)$ is a well-defined operator for the input space $u\in\mathcal{X}$ .

Step 2: Secondly, we aim to verify that $\mathcal{G}(u)$ is a Lipschitz-continuous operator in $H^{2}(\Omega)$ for $u\in\mathcal{X}$ .

Consider the following:

$\displaystyle f^{\prime}(u+v)$	$\displaystyle=f^{\prime}(u)+vf^{\prime\prime}(\xi_{1})$
$\displaystyle f(u+v)$	$\displaystyle=f(u)+vf^{\prime}(\xi_{2})$
$\displaystyle\delta u_{v}(\bm{x})$	$\displaystyle=\delta u(\bm{x})+\epsilon(\bm{x})$	(15)

where $\delta u(\bm{x})$ is the solution of Eq.(2) for the input $u$ , and $\delta_{v}u(\bm{x})$ is the solution of Eq.(2) for the input $u+v$ . Denote

\delta_{v}u(\bm{x})-\delta u(\bm{x})=:\epsilon(\bm{x}).

Therefore, we have:

\begin{cases}(\mathcal{L}-f^{\prime}(u+v))\epsilon(\bm{x})=\Delta v-v(f^{\prime}(\xi_{2})+f^{\prime\prime}(\xi_{1})\delta u),&\bm{x}\in\Omega\\ \epsilon(\bm{x})=0,&\bm{x}\in\partial\Omega.\end{cases}

(16)

Since $u$ and $v$ are in $H^{2}$ and $\partial\Omega$ is in $C^{2}$ (Assumption 1 (iii)), according to [6, Theorem 4 in Section 6.3], there exist constants $C$ ¹¹1In this paper, we consistently employ the symbol $C$ as a constant, which may vary from line to line. and $\bar{C}$ such that:

	$\displaystyle\\|\epsilon(\bm{x})\\|_{H^{2}(\Omega)}$	$\displaystyle\leq C\\|\mathcal{L}v-v(f^{\prime}(\xi_{2})+f^{\prime\prime}(\xi_{1})\delta u)\\|_{L^{2}(\Omega)}$
		$\displaystyle\leq\bar{C}\\|v(\bm{x})\\|_{H^{2}(\Omega)}.$		(17)

The last inequality is due to the boundedness of $f^{\prime}(\xi_{2})+f^{\prime\prime}(\xi_{1})\delta u$ (Assumption 1 (ii)).

Step 3: In the approximation, we first reduce the operator learning to functional learning.

When the input function $u(\bm{x})$ belongs to $\mathcal{X}\subset H^{2}$ , the output function $\delta u$ also belongs to $H^{2}$ , provided that $\partial\Omega$ is of class $C^{2}$ . The function $\mathcal{G}(u)=\delta u$ can be approximated by a two-layer network architected by the activation function $\sigma(x)$ , which is not a polynomial, in the following form by Proposition 1 [26] (given in Subsection 16):

\left\|\mathcal{G}(u)(\bm{x})-\sum_{k=1}^{p}c_{k}[\mathcal{G}(u)]\sigma\left(\bm{w}_{k}\cdot\bm{x}+\zeta_{k}\right)\right\|_{L^{2}(\Omega)}\leq C_{1}p^{-\frac{2}{d}}\|\mathcal{G}(u)\|_{H^{2}(\Omega)},

(18)

where $\bm{w}_{k}\in\mathbb{R}^{d}$ , $\zeta_{k}\in\mathbb{R}$ for $k=1,\ldots,p$ , $c_{k}$ is a continuous functional, and $C_{1}$ is a constant independent of the parameters.

Denote $\phi_{k}(u)=c_{k}[\mathcal{G}(u)]$ , which is a Lipschitz-continuous functional from $H^{2}(\Omega)$ to $\mathbb{R}$ , which is due to $\mathcal{G}$ is a Lipschitz-continuous operator and $c_{k}$ is a linear functional. The remaining task in approximation is to approximate these functionals by neural networks.

Step 4: In this step, we reduce the functional learning to function learning by applying the operator $\mathcal{P}$ as in Assumption 1 (iv).

Based on $\phi_{k}(u)$ being a Lipschitz-continuous functional in $H^{2}(\Omega)$ , we have

|\phi_{k}(u)-\phi_{k}(\mathcal{P}u)|\leq L_{k}\|u-\mathcal{P}u\|_{H^{2}(\Omega)}\leq L_{k}\epsilon,

where $L_{k}$ is the Lipschitz constant of $\phi_{k}(u)$ for $u\in\mathcal{X}$ .

Furthermore, since $\mathcal{P}u$ is an $n$ -dimensional term, i.e., it can be denoted by the $n$ -dimensional vector $\bar{\mathcal{P}}u\in\mathbb{R}^{n}$ , we can rewrite $\phi_{k}(\mathcal{P}u)$ as $\psi_{k}(\bar{\mathcal{P}}u)$ , where $\psi_{k}:\mathbb{R}^{n}\to\mathbb{R}$ for $l=1,\ldots,p$ . Furthermore, $\psi_{k}$ is a Lipschitz-continuous function since $\phi_{k}$ is Lipschitz-continuous and $\mathcal{P}$ is a continuous linear operator.

Step 5: In this step, we will approximate $\psi_{k}$ using shallow neural networks.

Due to Proposition 2, we have that there is a shallow neural network such that

\left\|\psi_{k}(\bar{\mathcal{P}}u)-\bm{A}_{k}\sigma\left(\bm{M}_{k}\cdot\bar{\mathcal{P}}u+\bm{b}_{k}\right)\right\|_{\infty}\leq Cm^{-1/d},

(19)

where $\bm{a}_{k}^{\intercal}\in\mathbb{R}^{m}$ , $\bm{M}_{k}\in\mathbb{R}^{m\times n}$ , and $\bm{b}_{k}\in\mathbb{R}^{m}$ . For the simplicity notations, we can replace $\bm{M}_{k}\cdot\bar{\mathcal{P}}$ by a operator $\mathcal{W}_{k}$ .

Above all, we have that there is a neural network in $\Xi_{p}$ such that

\left\|\sum_{k=1}^{p}\bm{A}_{k}\sigma\left(\mathcal{W}_{k}u+\bm{b}_{k}\right)\sigma\left(\bm{w}_{k}\cdot\bm{x}+\zeta_{k}\right)-\mathcal{G}(u)\right\|_{L^{2}(\Omega)}\leq C_{1}m^{-\frac{1}{n}}+C_{2}(\epsilon+p^{-\frac{2}{d}}),

(20)

where $C_{1}$ is a constant independent of $m$ , $\epsilon$ , and $p$ , $C_{2}$ is a constant depended on $p$ .

∎

Here, we discuss more about the embedding operator $\mathcal{P}$ . One approach is to use the Argyris element [2]. This method involves considering the $21$ degrees of freedom shown in Fig. 7. In this figure, each $\bullet$ denotes evaluation at a point, the inner circle represents evaluation of the gradient at the center, and the outer circle denotes evaluation of the three second derivatives at the center. The arrows indicate the evaluation of the normal derivatives at the three midpoints.

Another alternative approach to discretizing the input space is to use the bi-cubic Hermite finite element method [19, 13].

B.3 Proof of Theorem 2

The proof of Theorem 2 is inspired by that in [18].

Proof of Theorem 2.

Step 1: To begin with, we introduce a new term called the middle term, denoted as $\mathcal{E}_{Sm}(\bm{\theta})$ , defined as follows:

\mathcal{E}_{Sm}(\bm{\theta}):=\frac{1}{M_{u}}\sum_{j=1}^{M_{u}}\int_{\Omega}\left|\mathcal{G}(u_{j})(\bm{x})-\mathcal{O}(u_{j};\bm{\theta})(\bm{x})\right|^{2}\mathrm{d}\bm{x},

This term represents the limit case of $\mathcal{E}_{S}(\bm{\theta})$ as the number of samples in the domain of the output space tends to infinity ( $M_{x}\to\infty$ ).

Then the error can be divided into two parts:

|\mathbf{E}(\mathcal{E}_{S}(\bm{\theta})-\mathcal{E}_{Sc}(\bm{\theta})|\leq|\mathbf{E}(\mathcal{E}_{S}(\bm{\theta})-\mathcal{E}_{Sm}(\bm{\theta})|+|\mathbf{E}(\mathcal{E}_{Sm}(\bm{\theta})-\mathcal{E}_{Sc}(\bm{\theta})|.

(21)

Step 2: For $|\mathbf{E}(\mathcal{E}_{Sm}(\bm{\theta})-\mathcal{E}_{Sc}(\bm{\theta})|$ , this is the classical generalization error analysis, and the result can be obtained from [30, 34]. We omit the details of this part, which can be expressed as

|\mathbf{E}(\mathcal{E}_{Sm}(\bm{\theta})-\mathcal{E}_{Sc}(\bm{\theta}))|\leq\frac{Cd_{\bm{\theta}}\log M_{x}}{M_{x}},

(22)

where $C$ is independent of the number of parameters $d_{\bm{\theta}}$ and the sample size $M_{x}$ . In the following steps, we are going to estimate $|\mathbf{E}(\mathcal{E}_{S}(\bm{\theta})-\mathcal{E}_{Sm}(\bm{\theta}))|$ , which is the error that comes from the sampling of the input space of the operator.

Step 3: Denote

S_{\bm{\theta}}^{M}:=\frac{1}{M}\sum_{j=1}^{M}\int_{\Omega}\left|\mathcal{G}(u_{j})(\bm{x})-\mathcal{O}(u_{j};\bm{\theta})(\bm{x})\right|^{2}\mathrm{d}\bm{x}.

We first estimate the gap between $S_{\bm{\theta}}^{M}$ and $S_{\bm{\theta}^{\prime}}^{M}$ for any bounded parameters $\bm{\theta},\bm{\theta}^{\prime}$ . Due to Assumption 2 (i) and (ii), we have that

	$\displaystyle\|S_{\bm{\theta}}^{M}-S_{\bm{\theta}^{\prime}}^{M}\|$
$\displaystyle\leq$	$\displaystyle\frac{1}{M}\sum_{j=1}^{M}\left\|\int_{\Omega}\left\|\mathcal{G}(u_{j})(\bm{x})-\mathcal{O}(u_{j};\bm{\theta})(\bm{x})\right\|^{2}-\left\|\mathcal{G}(u_{j})(\bm{x})-\mathcal{O}(u_{j};\bm{\theta}^{\prime})(\bm{x})\right\|^{2}\mathrm{d}\bm{x}\right\|$
$\displaystyle\leq$	$\displaystyle\frac{1}{M}\sum_{j=1}^{M}\left\|\int_{\Omega}\left\|2\mathcal{G}(u_{j})(\bm{x})+\mathcal{O}(u_{j};\bm{\theta})(\bm{x})+\mathcal{O}(u_{j};\bm{\theta}^{\prime})(\bm{x})\right\|\cdot\left\|\mathcal{O}(u_{j};\bm{\theta})(\bm{x})-\mathcal{O}(u_{j};\bm{\theta}^{\prime})(\bm{x})\right\|\mathrm{d}\bm{x}\right\|$
$\displaystyle\leq$	$\displaystyle\frac{4}{M}\sum_{j=1}^{M}\Psi(u_{j})\Phi(u_{j})\cdot\left\\|\bm{\theta}-\bm{\theta}^{\prime}\right\\|_{\ell^{\infty}}.$	(23)

Step 4: Based on Step 3, we are going to estimate

\mathbf{E}\left[\sup_{\bm{\theta}\in[-B,B]^{d_{\bm{\theta}}}}\left|S_{\bm{\theta}}^{M}-\mathbf{E}S_{\bm{\theta}}^{M}\right|^{p}\right]^{\frac{1}{p}}

by covering the number of the spaces.

Set $\{\bm{\theta}_{1},\ldots,\bm{\theta}_{K}\}$ is a $\varepsilon$ -covering of $\in[-B,B]^{d_{\bm{\theta}}}$ i.e. for any $\bm{\theta}$ in $\in[-B,B]^{d_{\bm{\theta}}}$ , there exists $j$ with $\left\|\bm{\theta}-\bm{\theta}_{j}\right\|_{\ell_{\infty}}\leqslant\epsilon$ . Then we have

	$\displaystyle\mathbf{E}\left[\sup_{\bm{\theta}\in[-B,B]^{d}}\left\|S_{\bm{\theta}}^{M}-\mathbf{E}\left[S_{\bm{\theta}}^{M}\right]\right\|^{p}\right]^{1/p}$
$\displaystyle\leq$	$\displaystyle\mathbf{E}{\left[\left(\sup_{\bm{\theta}\in[-B,B]^{d}}\left\|S_{\bm{\theta}}^{M}-S_{\bm{\theta}_{j(\bm{\theta})}}^{M}\right\|+\left\|S_{j(\bm{\theta})}^{M}-\mathbf{E}\left[S_{j(\bm{\theta})}^{M}\right]\right\|\right.\right.}\left.\left.+\left\|\mathbf{E}\left[S_{\bm{\theta}_{j(\bm{\theta})}}^{M}\right]-\mathbf{E}\left[S_{\bm{\theta}}^{M}\right]\right\|\right)^{p}\right]^{1/p}$
$\displaystyle\leq$	$\displaystyle\mathbf{E}\left[\left(\max_{j=1,\ldots,K}\left\|S_{j(\bm{\theta})}^{M}-\mathbf{E}\left[S_{j(\bm{\theta})}^{M}\right]\right\|\right.\right.\left.\left.+\frac{8\epsilon}{M}\left(\sum_{j=1}^{M}\left\|\Psi\left(u\right)\right\|\left\|\Phi\left(u\right)\right\|\right)\right)^{p}\right]^{1/p}$
$\displaystyle\leq$	$\displaystyle 8\epsilon\mathbf{E}\left[\|\Psi\Phi\|^{p}\right]^{1/p}+\mathbf{E}\left[\max_{j=1,\ldots,K}\left\|S_{\bm{\theta}_{j}}^{M}-\mathbf{E}\left[S_{\bm{\theta}_{j}}^{M}\right]\right\|^{p}\right]^{1/p}.$	(24)

For $8\epsilon\mathbf{E}\left[|\Psi\Phi|^{p}\right]^{1/p}$ , it can be approximate by

8\epsilon\mathbf{E}\left[|\Psi\Phi|^{p}\right]^{1/p}\leqslant 8\epsilon\mathbf{E}\left[|\Psi|^{2p}\right]^{1/2p}\mathbf{E}\left[|\Phi|^{2p}\right]^{1/2p}=8\epsilon\|\Psi\|_{L^{2p}}\|\Phi\|_{L^{2p}}.

For $\mathbf{E}\left[\max_{j=1,\ldots,K}\left|S_{\bm{\theta}_{j}}^{M}-\mathbf{E}\left[S_{\bm{\theta}_{j}}^{M}\right]\right|^{p}\right]^{1/p}$ , by applied the result in [32, 18], we know

\mathbf{E}\left[\max_{j=1,\ldots,K}\left|S_{{\bm{\theta}}_{j}}^{M}-\mathbf{E}\left[S_{{\bm{\theta}}_{j}}^{M}\right]\right|^{p}\right]^{1/p}\leq\frac{16K^{1/p}\sqrt{p}\|\Psi\|_{L^{2p}}^{2}}{\sqrt{M}}.

Step 5: Now we estimate $|\mathbf{E}(\mathcal{E}_{S}(\bm{\theta})-\mathcal{E}_{Sm}(\bm{\theta})|$ .

Due to Assumption 2 and directly calculation, we have that

\|\Psi\|_{L^{2p}},\|\Phi\|_{L^{2p}}\leqslant C(1+\gamma\kappa p)^{\kappa},

for constants $C,\gamma>0$ , depending only the measure $\mu$ and the constant $C$ appearing in the upper bound (7).

Based on Lemma 1, we have that

\mathbf{E}\left[\sup_{{\bm{\theta}}\in[-B,B]^{d_{\bm{\theta}}}}\left|S_{\bm{\theta}}^{M_{u}}-\mathbf{E}\left[S_{\bm{\theta}}^{M_{u}}\right]\right|^{p}\right]^{1/p}\leqslant 16C^{2}(1+\gamma\kappa p)^{2\kappa}\left(\epsilon+\left(\frac{CB}{\epsilon}\right)^{d_{\bm{\theta}}/p}\frac{\sqrt{p}}{\sqrt{M_{u}}}\right),

for some constants $C,\gamma>0$ , independent of $\kappa,\mu,B,d_{\bm{\theta}},N,\epsilon>0$ and $p\geqslant 2$ . We now choose $\epsilon=\frac{1}{\sqrt{M_{u}}}$ , so that

\epsilon+\left(\frac{CB}{\epsilon}\right)^{d_{\bm{\theta}}/p}\frac{\sqrt{p}}{\sqrt{M_{u}}}=\frac{1}{\sqrt{M_{u}}}\left(1+(CB\sqrt{M_{u}})^{d_{\bm{\theta}}/p}\sqrt{p}\right).

Next, let $p=d_{\bm{\theta}}\log(CB\sqrt{M_{u}})$ . Then,

(CB\sqrt{M_{u}})^{d_{\bm{\theta}}/p}\sqrt{p}=\exp\left(\frac{\log(CB\sqrt{M_{u}})d_{\bm{\theta}}}{p}\right)\sqrt{p}=e\sqrt{d_{\bm{\theta}}\log(CB\sqrt{M_{u}})},

and thus we conclude that

\epsilon+\left(\frac{CB}{\epsilon}\right)^{d_{\bm{\theta}}/p}\frac{\sqrt{p}}{\sqrt{M_{u}}}\leqslant\frac{1}{\sqrt{M_{u}}}\left(1+e\sqrt{d_{\bm{\theta}}\log(CB\sqrt{M_{u}})}.\right).

On the other hand, we have

(1+\gamma\kappa p)^{2\kappa}=\left(1+\gamma\kappa d_{\bm{\theta}}\log(CB\sqrt{M_{u}})\right)^{2\kappa}.

Increasing the constant $C>0$ , if necessary, we can further estimate

\left(1+\gamma\kappa d_{\bm{\theta}}\log(CB\sqrt{M_{u}})\right)^{2\kappa}\left(1+e\sqrt{d_{\bm{\theta}}\log(CB\sqrt{M_{u}})}.\right)\leqslant C\left(1+d_{\bm{\theta}}\log(CB\sqrt{M_{u}})\right)^{2\kappa+1/2},

where $C>0$ depends on $\kappa,\gamma,\mu$ and the constant appearing in (7), but is independent of $d_{\bm{\theta}},B$ and $N$ . We can express this dependence in the form $C=C(\mu,\Psi,\Phi)>0$ , as the constants $\kappa$ and $\gamma$ depend on the Gaussian tail of $\mu$ and the upper bound on $\Psi,\Phi$ .

Therefore,

	$\displaystyle\|\mathbf{E}(\mathcal{E}_{S}(\bm{\theta})-\mathcal{E}_{Sm}(\bm{\theta})\|$	$\displaystyle\leq\sup_{\bm{\theta}\in[-B,B]^{d_{\bm{\theta}}}}\|S_{\bm{\theta}}^{M_{u}}-\mathbf{E}\left[S_{\bm{\theta}}^{M_{u}}\right]\|$
		$\displaystyle\leq\frac{C}{\sqrt{M_{u}}}\left(1+Cd_{\bm{\theta}}\log(CB\sqrt{M_{u}})^{2\kappa+1/2}\right).$		(25)

∎

	$\displaystyle\\|\epsilon(\bm{x})\\|_{H^{2}(\Omega)}$	$\displaystyle\leq C\\|\mathcal{L}v-v(f^{\prime}(\xi_{2})+f^{\prime\prime}(\xi_{1})\delta u)\\|_{L^{2}(\Omega)}$
		$\displaystyle\leq\bar{C}\\|v(\bm{x})\\|_{H^{2}(\Omega)}.$		(17)